diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000000000000000000000000000000000000..10be5421fda98877b22feb1e7332979fb4057d9e --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,157 @@ +**简体中文**🀄 | [English🌎](.github/CONTRIBUTING_en.md) + +# Contributing to PaddleNLP + +我们非常欢迎并希望您对`PaddleNLP`做出开源贡献。在您开始提交您的贡献之前,请先行签署[PaddlePaddle 贡献者许可协议](https://cla-assistant.io/PaddlePaddle/PaddleNLP)。 +本文接下来将介绍我们的开发与贡献流程: + +## 贡献方式 + +我们欢迎不同的向`PaddleNLP`做出贡献的方式,例如: + +- 修复已知的Issue +- 提交新的Issue,例如提出功能需求或者bug报告 +- 实现新的模型结构 + +如果您不知道从哪里开始,请查看Issues板块中的`Good First Issue`标签。它为您提供一个对初学者友好的已知Issue列表,可以降低贡献的门槛,帮助您开始为开源做出贡献。您只需在您想处理的Issue中告知我们您想负责此Issue即可。 + +## 开发流程 + +PaddleNLP 使用 [Git 分支模型](http://nvie.com/posts/a-successful-git-branching-model/)。对于常见的开源贡献,我们有以下的贡献流程: + +#### 1. Fork + + 因为PaddleNLP的开发社区一直在发展,如果每位贡献者都直接向官方Repo提交commit将会难以管理。因此,请从您的分支中提交 Pull Requests。建议您通过GitHub的[“Fork”按钮](https://help.github.com/articles/fork-a-repo/)来创建您的Fork分支。 + +#### 2. Clone + + 请运行一下命令将您的分支clone到本地 + + ```bash + git clone https://github.com//PaddleNLP + cd PaddleNLP + ``` + +#### 3. 创建本地开发分支 + + 对于添加新功能或修复错误等日常工作,请在开发前创建您的本地开发分支: + + ```bash + git checkout -b my-cool-feature + ``` + +#### 4. 配置开发环境 + 在开始编码之前,您需要设置开发环境。我们强烈建议您在虚拟环境中进行所有开发,例如[venv](https://docs.python.org/3/library/venv.html)或[conda](https://docs.conda.io/en/latest/)。 + 请您设置并激活虚拟环境后,运行以下命令: + + ```bash + make install + ``` + + 这将设置 `PaddleNLP` 的所有依赖以及 [`pre-commit`](http://pre-commit.com/) 工具。 + + 如果您需要开发 `examples` 或 `applications` 模块并加载 `PaddleNLP`,请确保以可编辑模式(`-e`)安装 `PaddleNLP`。 + 如果在虚拟环境中已经安装 `PaddleNLP` ,请使用 `pip uninstall paddlenlp` 将其删除,然后以可编辑模式重新安装它 + `pip install -e .` + + +#### 5. 开发 + + 当您开发时,请确保您新增的代码会被单元测试所覆盖。我们所有的单元测试都可以在 `tests` 目录下找到。 + 您可以修改现有单元测试以覆盖新功能,也可以从头开始创建新测试。 + 当您完成代码时,您应该确保相关的单元测试可以通过。您可以像这样运行受更改影响的测试: + + ```bash + pytest tests/.py + ``` + +#### 6. Commit + + 我们使用 [`pre-commit`](http://pre-commit.com/)工具(包括[black](https://black.readthedocs.io/en/stable/)、[isort](https:/ /pycqa.github.io/isort/) 和 + [flake8](https://flake8.pycqa.org/en/latest/))来检查每次提交中的代码和文档的风格。当你运行 `git commit` 时,你会看到 + 类似于以下内容: + + ``` + ➜ (my-virtual-env) git commit -m "commiting my cool feature" + black....................................................................Passed + isort....................................................................Passed + flake8...................................................................Passed + check for merge conflicts................................................Passed + check for broken symlinks............................(no files to check)Skipped + detect private key.......................................................Passed + fix end of files.....................................(no files to check)Skipped + trim trailing whitespace.............................(no files to check)Skipped + CRLF end-lines checker...............................(no files to check)Skipped + CRLF end-lines remover...............................(no files to check)Skipped + No-tabs checker......................................(no files to check)Skipped + Tabs remover.........................................(no files to check)Skipped + copyright_checker........................................................Passed + ``` + + 但大多数时候事情并没有那么顺利。当您的代码或文档不符合标准时,`pre-commit` 检查将失败。 + ``` + ➜ (my-virtual-env) git commit -m "commiting my cool feature" + black....................................................................Passed + isort....................................................................Failed + - hook id: isort + - files were modified by this hook + + Fixing examples/information_extraction/waybill_ie/run_ernie_crf.py + + flake8...................................................................Passed + check for merge conflicts................................................Passed + check for broken symlinks............................(no files to check)Skipped + detect private key.......................................................Passed + fix end of files.....................................(no files to check)Skipped + trim trailing whitespace.............................(no files to check)Skipped + CRLF end-lines checker...............................(no files to check)Skipped + CRLF end-lines remover...............................(no files to check)Skipped + No-tabs checker......................................(no files to check)Skipped + Tabs remover.........................................(no files to check)Skipped + copyright_checker........................................................Passed + ``` + + 我们的工具将自动修复大部分样式错误,但是有些错误需要手动解决。幸运的是,错误信息一般通俗易懂,很容易修复。 + 解决错误后,您可以再次运行 git add 和 git commit ,这将再次触发 pre-commit 。 + 一旦 pre-commit 检查通过,您就可以推送代码了。 + + [Google][http://google.com/] 或 [StackOverflow](https://stackoverflow.com/) 是帮助您了解代码风格错误的好工具。 + 如果您仍然无法弄清楚,请不要担心。您可以使用 `git commit -m "style error" --no-verify` 提交,我们很乐意在您创建 Pull Request 后帮助您。 + +#### 7. git pull与代码冲突 + + 有经验的 Git 用户经常从官方Repo中git pull。因为这样子他们会及早注意到与其他人的代码冲突,并且让代码冲突更容易解决 + + ```bash + git remote add upstream https://github.com/PaddlePaddle/PaddleNLP + git pull upstream develop + ``` + +#### 8. git push与提交Pull Request + + 您可以将您的本地开发分支中的工作 push 到您的fork的分支中: + + ```bash + git push origin my-cool-stuff + ``` + + git push之后,您可以提交Pull Request,请求[官方repo](https://github.com/PaddlePaddle/PaddleNLP) 采纳您的开发工作。请您依照[这些步骤](https://help.github.com/articles/creating-a-pull-request/)创建Pull Request。 + +#### 9. 删除已经合入的本地和远程分支 + + 为了保持您本地的工作区和fork分支的干净整洁,建议您在Pull Request合入之后删除本地的残余分支: + + ```bash + git push origin my-cool-stuff + git checkout develop + git pull upstream develop + git branch -d my-cool-stuff + ``` + +## 代码Review + +- 在您的Pull Request能够顺利通过本地测试以及CI的情况下,您可以在Pull Request中 @ 相关的Reviewer,提醒他们尽快对您的Pull Request进行Review。 + +- 请处理Reviewer的每一条评论。如果您已按照评论修改,请回复“完成”;否则,可以在评论下展开讨论。 + +- 如果您不希望您的Reviewer被电子邮件通知淹没,您可以[批量回复](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/)。 diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000000000000000000000000000000000000..744bf2ba7ca0e0fc6d2e30faa4e9bafd7b949e63 --- /dev/null +++ b/LICENSE @@ -0,0 +1,203 @@ +Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/Makefile b/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..5b58a1e38ec783775a113a1a5f2e837634c8da16 --- /dev/null +++ b/Makefile @@ -0,0 +1,77 @@ +# Makefile for PaddleNLP +# +# GitHb: https://github.com/PaddlePaddle/PaddleNLP +# Author: Paddle Team https://github.com/PaddlePaddle +# + +.PHONY: all +all : lint test +check_dirs := applications examples model_zoo paddlenlp pipelines ppdiffusers scripts tests +# # # # # # # # # # # # # # # Format Block # # # # # # # # # # # # # # # + +format: + pre-commit run isort + pre-commit run black + +# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # + +# # # # # # # # # # # # # # # Lint Block # # # # # # # # # # # # # # # + +.PHONY: lint +lint: + $(eval modified_py_files := $(shell python scripts/get_modified_files.py $(check_dirs))) + @if test -n "$(modified_py_files)"; then \ + echo ${modified_py_files}; \ + pre-commit run --files ${modified_py_files}; \ + else \ + echo "No library .py files were modified"; \ + fi + +# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # + +# # # # # # # # # # # # # # # Test Block # # # # # # # # # # # # # # # + +.PHONY: test +test: unit-test + +unit-test: + PYTHONPATH=$(shell pwd) pytest -v \ + -n auto \ + --durations 20 \ + --cov paddlenlp \ + --cov-report xml:coverage.xml + +# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # + +.PHONY: install +install: + pip install -r requirements-dev.txt + pip install -r requirements.txt + pip install -r paddlenlp/experimental/autonlp/requirements.txt + pre-commit install + + +.PHONY: deploy-ppdiffusers +deploy-ppdiffusers: + cd ppdiffusers && make install && make + +.PHONY: deploy-paddle-pipelines +deploy-paddle-pipelines: + cd pipelines && make install && make + +.PHONY: deploy-paddlenlp +deploy-paddlenlp: + # install related package + make install + # build + python3 setup.py sdist bdist_wheel + # upload + twine upload --skip-existing dist/* + +.PHONY: regression-all +release: + bash ./scripts/regression/run_release.sh 0 0,1 all + +.PHONY: regression-key +key: + bash ./scripts/regression/run_release.sh 0 0,1 p0 diff --git a/README.md b/README.md index 84ab048bd88560a9cd9ff89dada6d1742654612a..d714a3de5daf34e45f27874af4ee326a2e366069 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,225 @@ -# LLAMA_paddle +# LLAMA -llama-13b pretrain example for paddle \ No newline at end of file +## 论文 + +`LLaMA: Open and Efficient Foundation Language Models` + +- [https://arxiv.org/abs/2302.13971](https://arxiv.org/abs/2302.13971) + +## 模型结构 + +LLaMA,这是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型,并表明可以专门使用公开可用的数据集来训练最先进的模型,而不依赖于专有的和不可访问的数据集。特别是,llama 13B在大多数基准测试中优于GPT-3 (175B), LLaMA 65B与最好的模型Chinchilla-70B和PaLM-540B具有竞争力。LLAMA网络基于 Transformer 架构。提出了各种改进,并用于不同的模型,例如 PaLM。 + +llama模型结构.png + +以下是llama-13B的主要网络参数配置: + +``` +{ + "architectures": [ + "LlamaForCausalLM" + ], + "bos_token_id": 1, + "eos_token_id": 2, + "hidden_size": 5120, + "initializer_range": 0.02, + "intermediate_size": 13824, + "max_position_embeddings": 2048, + "model_type": "llama", + "num_attention_heads": 40, + "num_hidden_layers": 40, + "pad_token_id": 0, + "paddlenlp_version": null, + "rms_norm_eps": 1e-06, + "use_recompute": false, + "vocab_size": 32000 +} +``` + +## 算法原理 + +llama算法原理.png + +以下是与原始 Transformer 架构的主要区别: + +**预归一化**。为了提高训练稳定性,对每个transformer 子层的输入进行归一化,而不是对输出进行归一化。使用 RMSNorm 归一化函数。 + +**SwiGLU 激活函数 [PaLM]**。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。 + +**旋转嵌入**。移除了绝对位置嵌入,而是添加了旋转位置嵌入 (RoPE),在网络的每一层。 + +## 数据集 + +数据详细制作流程可参考[此处](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/README.md),例:OpenWebText2预训练数据制作参考[此处](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md) + +为了方便用户运行测试本模型,本项目提供了处理好的100k条doc的训练样本: + + cd ./llm/llama/ + mkkdir data && cd data + wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_ids.npy + wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_idx.npz + cd .. && tree data + data + ├── llama_openwebtext_100k_ids.npy + └── llama_openwebtext_100k_idx.npz + +## 环境配置 + +### Docker + +推荐使用docker方式运行,提供拉取的docker镜像,关于本项目所需新版本 DTK 等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装,docker中默认使用dtk-23.04.1: + +``` +docker pull registry.baidubce.com/device/paddle-dcu:dtk23.04.1-centos79-x86_64-gcc73 + +docker run -it --network=host --name=paddle_llama --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 -v `pwd`:/home registry.baidubce.com/device/paddle-dcu:dtk23.04.1-centos79-x86_64-gcc73 /bin/bash + +# 替换DTK-23.10 + +pip install paddlenlp==2.6.1 -i http://mirrors.aliyun.com/pypi/simple/ +wget http://10.6.10.68:8000/customized/paddle/llama/paddlepaddle_dtk2310-2.5.1-cp39-cp39-linux_x86_64.whl +pip3 install paddlepaddle_dtk2310-2.5.1-cp39-cp39-linux_x86_64.whl +pip3 install tool_helpers visualdl==2.5.3 -i http://mirrors.aliyun.com/pypi/simple/ +``` + +## 训练 + +权重链接 + +13B:[https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-13b](https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-13b) + +7B:[https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-7b](https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-7b) + +该训练脚本需要1节点,每节点8张DCU-Z100L-32G。 + +并行配置采用TP 8,PP 1,使用fp16精度微调,配置如下: + +``` +--max_seq_length 2048 \ +--per_device_train_batch_size 1 \ +--gradient_accumulation_steps 2 \ +--per_device_eval_batch_size 2 \ +--use_flash_attention 0 \ +--use_fused_rms_norm 0 \ +--fp16 \ +--fp16_opt_level "O2" \ +--scale_loss 512 \ +--tensor_parallel_degree 8 \ +--learning_rate 0.00001 \ +--min_learning_rate 0.000001 \ +--max_steps 10000 \ +--save_steps 5000 \ +--weight_decay 0.01 \ +--warmup_ratio 0.01 \ +--max_grad_norm 1.0 \ +--logging_steps 10 \ +--dataloader_num_workers 1 \ +--eval_steps 1000 \ +--report_to "visualdl" \ +--sharding "stage1" \ +--disable_tqdm true \ +--continue_training 1 \ +--recompute 1 \ +--recompute_granularity full \ +--do_train \ +--do_eval \ +--device "gpu" \ +--distributed_dataloader 1 +``` + +微调命令: + +``` +cd ./llm/llama/ +bash run_trainer_tp8.sh +``` + +## result + +### 精度 + +训练数据:[https://bj.bcebos.com/paddlenlp/models/transformers/llama/data](https://bj.bcebos.com/paddlenlp/models/transformers/llama/data) + +使用的GPGPU:8张DCU-Z100L-32G。 + +模型精度(max_sequence_length: 2048): +| 卡数 | 分布式工具 | 收敛性 | +| :------: | :------: |:------: | +| 8 | Paddle | | +### input + +```plaintext +>>>冬天,中国哪座城市最适合避寒?问题描述:能推荐一些国内适合冬天避寒的城市吗?回答用户:旅游爱好者 +``` + +### output + +```plaintext +>>>回答:避寒,当然是去海南呀!海南的冬天,阳光明媚,温度适宜,而且空气清新,没有雾霾,没有沙尘暴,没有雾霾,没有雾霾! +``` + +## benchmark + +### 训练benchmark + +数据集使用[tatsu-lab/alpaca · Datasets at Hugging Face](https://huggingface.co/datasets/tatsu-lab/alpaca),将数据集放置在./examples/benchmark/peft/paddle下: + +``` +$tree tatsu-lab +tatsu-lab/ +└── alpaca + └── data + └── train-00000-of-00001-a09b74b3ef9c3b56.parquet +``` + +训练benchmark测试命令: + +``` +cd ./examples/benchmark/peft/paddle + +RCCL_NCHANNELS=8 HSA_FORCE_FINE_GRAIN_PCIE=1 python3 -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" benchmark.py --model_name_or_path facebook/llama-13b --english --train_data_size 1000 --intokens --intokens_length 1024 --num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 2 --evaluation_strategy no --save_strategy no --fp16 --fp16_opt_level O2 --recompute --tensor_parallel_degree 8 --logging_steps 50 --output_dir outputs +``` + +### 推理benchmark + +``` +cd ./examples/benchmark/peft/paddle +python3 inference_benchmark.py --model_name_or_path facebook/llama-13b --dtype float16 --do_forward --do_generate +``` + +### LAMBADA推理评估 + +``` +cd ./examples/benchmark/lambada +wget https://paddlenlp.bj.bcebos.com/data/benchmark/lambada_test.jsonl +``` + +验证LAMBADA数据集,运行以下脚本: + +``` +python3 eval.py \ +--model_name_or_path facebook/llama-13b \ +--batch_size 4 \ +--eval_path lambada_test.jsonl \ +--tensor_parallel_degree 1 \ +--cloze_eval +``` + +## 应用场景 + +### 算法类别 + +`自然语言处理` + +### 热点应用行业 + +`医疗,教育,科研,金融` + +## 源码仓库及问题反馈 + +- [https://developer.hpccube.com/codes/modelzoo/llama_paddle](https://developer.hpccube.com/codes/modelzoo/llama_paddle) + +## 参考 + +* https://huggingface.co/decapoda-research/llama-13b-hf +* [https://github.com/PaddlePaddle/PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) \ No newline at end of file diff --git a/README_en.md b/README_en.md new file mode 100644 index 0000000000000000000000000000000000000000..35b5a2dfb94ce42ce31f2e05cdda25c7ec68c14a --- /dev/null +++ b/README_en.md @@ -0,0 +1,356 @@ + +[简体中文🀄](./README.md) | **English🌎** + +

+ +------------------------------------------------------------------------------------------ + +

+ + + + + + + + + +

+ +

Features | Installation | Quick Start | API Reference | Community + +**PaddleNLP** is a NLP library that is both **easy to use** and **powerful**. It aggregates high-quality pretrained models in the industry and provides a **plug-and-play** development experience, covering a model library for various NLP scenarios. With practical examples from industry practices, PaddleNLP can meet the needs of developers who require **flexible customization**. + +## News 📢 + +* **2023.6.12: [Release of PaddleNLP v2.6rc](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.6.0rc)** + * 🔨 LLM Tools:Introduces comprehensive examples of open-source LLM training and inference, including [Bloom](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/bloom), [ChatGLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/chatglm), [GLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/glm), [Llama](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/llama) and [OPT](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/opt). Added Tensor Parallel capability to [Trainer API](./docs/trainer.md) for distributed LLM trainin. Also released [Parameter-Efficient Finetuning](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/peft),which enables training LLMs on consumer hardware. + +* **2023.1.12: [Release of PaddleNLP v2.5]()** + + * 🔨 NLP Tools: [PPDiffusers](./ppdiffusers), our cross-modal diffusion model toolbox based on PaddlePaddle, has been released! It provides a complete training process for diffusion models, and supports FastDeploy inference acceleration and multi-hardware deployment (supports Ascend chips and Kunlun core deployment). + * 💎 Industrial Applications: Information extraction, text classification, sentiment analysis, and intelligent question answering have all been newly upgraded. New releases include document information extraction [UIE-X](./applications/information_extraction/document), unified text classification [UTC](./applications/zero_shot_text_classification), unified sentiment analysis [UIE-Senta](./applications/sentiment_analysis/unified_sentiment_extraction) , and [unsupervised QA application](./applications/question_answering/unsupervised_qa). At the same time, the [ERNIE 3.0 Tiny v2](./model_zoo/ernie-tiny) series of pretrained small models have been released, which are more effective with low-resource and foreign data. They provide open-source end-to-end deployment solutions such as model pruning, model quantization, FastDeploy inference acceleration, and edge-side deployment to reduce the difficulty of pretrained model deployment. + * 💪 Framework Upgrade: Pretrained model [parameter configuration unification](./paddlenlp/transformers/configuration_utils.py), saving and loading custom parameter configurations no longer requires additional development; [Trainer API](./docs/trainer.md) has added BF16 training, recompute recalculations, sharding, and other distributed capabilities. Large-scale pre-training model training can easily be accomplished through simple configuration. [Model Compression API](./docs/compression.md) supports quantization training, vocabulary compression, and other functions. The compressed model has smaller accuracy loss, and the memory consumption of model deployment is greatly reduced. [Data Augmentation API](./docs/dataaug.md) has been comprehensively upgraded to support three granularities of data augmentation strategy: character, word, and sentence, making it easy to customize data augmentation strategies. + * 🤝 Community: 🤗Huggingface hub officially supports PaddleNLP pretrained models, supporting PaddleNLP Model and Tokenizer downloads and uploads directly from the 🤗Huggingface hub. Everyone is welcome to try out PaddleNLP pretrained models on the 🤗Huggingface hub [here](https://huggingface.co/PaddlePaddle). + +* **September 6, 2022: [Release of PaddleNLP v2.4]()** + + * 🔨 NLP Tools: [NLP Pipeline System Pipelines](./pipelines) has been released, supporting the rapid construction of search engines and question-answering systems, and can be extended to support various NLP systems, making it easy, flexible, and efficient to solve NLP tasks like building blocks! + * 💎 Industrial Applications: A new [text classification full-process application solution](./applications/text_classification) has been added, covering various scenarios such as multi-classification, multi-label, and hierarchical classification, supporting small-sample learning and TrustAI trustworthy computing model training and tuning. + * 🍭 AIGC: The SOTA model [CodeGen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/code_generation/codegen) for code generation in various programming languages has been added. + * 💪 Framework Upgrade: [Automatic Model Compression API](./docs/compression.md) has been released, which automatically cuts and quantizes models, greatly reducing the threshold for using model compression technology. [Few-shot Prompt](./applications/text_classification/multi_class/few-shot) capability has been released, integrating classic algorithms such as PET, P-Tuning, and RGL. + + + + + + +## Features + +#### 📦 Out-of-Box NLP Toolset + +#### 🤗 Awesome Chinese Model Zoo + +#### 🎛️ Industrial End-to-end System + +#### 🚀 High Performance Distributed Training and Inference + + +### Out-of-Box NLP Toolset + +Taskflow aims to provide off-the-shelf NLP pre-built task covering NLU and NLG technique, in the meanwhile with extremely fast inference satisfying industrial scenario. + +![taskflow1](https://user-images.githubusercontent.com/11793384/159693816-fda35221-9751-43bb-b05c-7fc77571dd76.gif) + +For more usage please refer to [Taskflow Docs](./docs/model_zoo/taskflow.md). + +### Awesome Chinese Model Zoo + +#### 🀄 Comprehensive Chinese Transformer Models + +We provide **45+** network architectures and over **500+** pretrained models. Not only includes all the SOTA model like ERNIE, PLATO and SKEP released by Baidu, but also integrates most of the high-quality Chinese pretrained model developed by other organizations. Use `AutoModel` API to **⚡SUPER FAST⚡** download pretrained models of different architecture. We welcome all developers to contribute your Transformer models to PaddleNLP! + +```python +from paddlenlp.transformers import * + +ernie = AutoModel.from_pretrained('ernie-3.0-medium-zh') +bert = AutoModel.from_pretrained('bert-wwm-chinese') +albert = AutoModel.from_pretrained('albert-chinese-tiny') +roberta = AutoModel.from_pretrained('roberta-wwm-ext') +electra = AutoModel.from_pretrained('chinese-electra-small') +gpt = AutoModelForPretraining.from_pretrained('gpt-cpm-large-cn') +``` + +Due to the computation limitation, you can use the ERNIE-Tiny light models to accelerate the deployment of pretrained models. +```python +# 6L768H +ernie = AutoModel.from_pretrained('ernie-3.0-medium-zh') +# 6L384H +ernie = AutoModel.from_pretrained('ernie-3.0-mini-zh') +# 4L384H +ernie = AutoModel.from_pretrained('ernie-3.0-micro-zh') +# 4L312H +ernie = AutoModel.from_pretrained('ernie-3.0-nano-zh') +``` +Unified API experience for NLP task like semantic representation, text classification, sentence matching, sequence labeling, question answering, etc. + +```python +import paddle +from paddlenlp.transformers import * + +tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh') +text = tokenizer('natural language processing') + +# Semantic Representation +model = AutoModel.from_pretrained('ernie-3.0-medium-zh') +sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']])) +# Text Classificaiton and Matching +model = AutoModelForSequenceClassification.from_pretrained('ernie-3.0-medium-zh') +# Sequence Labeling +model = AutoModelForTokenClassification.from_pretrained('ernie-3.0-medium-zh') +# Question Answering +model = AutoModelForQuestionAnswering.from_pretrained('ernie-3.0-medium-zh') +``` + +#### Wide-range NLP Task Support + +PaddleNLP provides rich examples covering mainstream NLP task to help developers accelerate problem solving. You can find our powerful transformer [Model Zoo](./model_zoo), and wide-range NLP application [examples](./examples) with detailed instructions. + +Also you can run our interactive [Notebook tutorial](https://aistudio.baidu.com/aistudio/personalcenter/thirdview/574995) on AI Studio, a powerful platform with **FREE** computing resource. + +
PaddleNLP Transformer model summary (click to show details)
+ +| Model | Sequence Classification | Token Classification | Question Answering | Text Generation | Multiple Choice | +| :----------------- | ----------------------- | -------------------- | ------------------ | --------------- | --------------- | +| ALBERT | ✅ | ✅ | ✅ | ❌ | ✅ | +| BART | ✅ | ✅ | ✅ | ✅ | ❌ | +| BERT | ✅ | ✅ | ✅ | ❌ | ✅ | +| BigBird | ✅ | ✅ | ✅ | ❌ | ✅ | +| BlenderBot | ❌ | ❌ | ❌ | ✅ | ❌ | +| ChineseBERT | ✅ | ✅ | ✅ | ❌ | ❌ | +| ConvBERT | ✅ | ✅ | ✅ | ❌ | ✅ | +| CTRL | ✅ | ❌ | ❌ | ❌ | ❌ | +| DistilBERT | ✅ | ✅ | ✅ | ❌ | ❌ | +| ELECTRA | ✅ | ✅ | ✅ | ❌ | ✅ | +| ERNIE | ✅ | ✅ | ✅ | ❌ | ✅ | +| ERNIE-CTM | ❌ | ✅ | ❌ | ❌ | ❌ | +| ERNIE-Doc | ✅ | ✅ | ✅ | ❌ | ❌ | +| ERNIE-GEN | ❌ | ❌ | ❌ | ✅ | ❌ | +| ERNIE-Gram | ✅ | ✅ | ✅ | ❌ | ❌ | +| ERNIE-M | ✅ | ✅ | ✅ | ❌ | ❌ | +| FNet | ✅ | ✅ | ✅ | ❌ | ✅ | +| Funnel-Transformer | ✅ | ✅ | ✅ | ❌ | ❌ | +| GPT | ✅ | ✅ | ❌ | ✅ | ❌ | +| LayoutLM | ✅ | ✅ | ❌ | ❌ | ❌ | +| LayoutLMv2 | ❌ | ✅ | ❌ | ❌ | ❌ | +| LayoutXLM | ❌ | ✅ | ❌ | ❌ | ❌ | +| LUKE | ❌ | ✅ | ✅ | ❌ | ❌ | +| mBART | ✅ | ❌ | ✅ | ❌ | ✅ | +| MegatronBERT | ✅ | ✅ | ✅ | ❌ | ✅ | +| MobileBERT | ✅ | ❌ | ✅ | ❌ | ❌ | +| MPNet | ✅ | ✅ | ✅ | ❌ | ✅ | +| NEZHA | ✅ | ✅ | ✅ | ❌ | ✅ | +| PP-MiniLM | ✅ | ❌ | ❌ | ❌ | ❌ | +| ProphetNet | ❌ | ❌ | ❌ | ✅ | ❌ | +| Reformer | ✅ | ❌ | ✅ | ❌ | ❌ | +| RemBERT | ✅ | ✅ | ✅ | ❌ | ✅ | +| RoBERTa | ✅ | ✅ | ✅ | ❌ | ✅ | +| RoFormer | ✅ | ✅ | ✅ | ❌ | ❌ | +| SKEP | ✅ | ✅ | ❌ | ❌ | ❌ | +| SqueezeBERT | ✅ | ✅ | ✅ | ❌ | ❌ | +| T5 | ❌ | ❌ | ❌ | ✅ | ❌ | +| TinyBERT | ✅ | ❌ | ❌ | ❌ | ❌ | +| UnifiedTransformer | ❌ | ❌ | ❌ | ✅ | ❌ | +| XLNet | ✅ | ✅ | ✅ | ❌ | ✅ | + +
+ +For more pretrained model usage, please refer to [Transformer API Docs](./docs/model_zoo/index.rst). + +### Industrial End-to-end System + +We provide high value scenarios including information extraction, semantic retrieval, question answering high-value. + +For more details industrial cases please refer to [Applications](./applications). + + +#### 🔍 Neural Search System + +
+ +
+ + +For more details please refer to [Neural Search](./applications/neural_search). + +#### ❓ Question Answering System + +We provide question answering pipeline which can support FAQ system, Document-level Visual Question answering system based on [🚀RocketQA](https://github.com/PaddlePaddle/RocketQA). + +
+ +
+ + +For more details please refer to [Question Answering](./applications/question_answering) and [Document VQA](./applications/document_intelligence/doc_vqa). + + +#### 💌 Opinion Extraction and Sentiment Analysis + +We build an opinion extraction system for product review and fine-grained sentiment analysis based on [SKEP](https://arxiv.org/abs/2005.05635) Model. + +
+ +
+ + +For more details please refer to [Sentiment Analysis](./applications/sentiment_analysis). + +#### 🎙️ Speech Command Analysis + +Integrated ASR Model, Information Extraction, we provide a speech command analysis pipeline that show how to use PaddleNLP and [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech) to solve Speech + NLP real scenarios. + +
+ +
+ + +For more details please refer to [Speech Command Analysis](./applications/speech_cmd_analysis). + +### High Performance Distributed Training and Inference + +#### ⚡ FastTokenizer: High Performance Text Preprocessing Library + +
+ +
+ +```python +AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True) +``` + +Set `use_fast=True` to use C++ Tokenizer kernel to achieve 100x faster on text pre-processing. For more usage please refer to [FastTokenizer](./fast_tokenizer). + +#### ⚡ FastGeneration: High Performance Generation Library + +
+ +
+ +```python +model = GPTLMHeadModel.from_pretrained('gpt-cpm-large-cn') +... +outputs, _ = model.generate( + input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search', + use_fast=True) +``` + +Set `use_fast=True` to achieve 5x speedup for Transformer, GPT, BART, PLATO, UniLM text generation. For more usage please refer to [FastGeneration](./fast_generation). + +#### 🚀 Fleet: 4D Hybrid Distributed Training + +
+ +
+ + +For more super large-scale model pre-training details please refer to [GPT-3](./examples/language_model/gpt-3). + + +## Installation + +### Prerequisites + +* python >= 3.7 +* paddlepaddle >= 2.3 + +More information about PaddlePaddle installation please refer to [PaddlePaddle's Website](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/conda/linux-conda.html). + +### Python pip Installation + +``` +pip install --upgrade paddlenlp +``` + +or you can install the latest develop branch code with the following command: + +```shell +pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html +``` + +## Quick Start + +**Taskflow** aims to provide off-the-shelf NLP pre-built task covering NLU and NLG scenario, in the meanwhile with extremely fast inference satisfying industrial applications. + +```python +from paddlenlp import Taskflow + +# Chinese Word Segmentation +seg = Taskflow("word_segmentation") +seg("第十四届全运会在西安举办") +>>> ['第十四届', '全运会', '在', '西安', '举办'] + +# POS Tagging +tag = Taskflow("pos_tagging") +tag("第十四届全运会在西安举办") +>>> [('第十四届', 'm'), ('全运会', 'nz'), ('在', 'p'), ('西安', 'LOC'), ('举办', 'v')] + +# Named Entity Recognition +ner = Taskflow("ner") +ner("《孤女》是2010年九州出版社出版的小说,作者是余兼羽") +>>> [('《', 'w'), ('孤女', '作品类_实体'), ('》', 'w'), ('是', '肯定词'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('的', '助词'), ('小说', '作品类_概念'), (',', 'w'), ('作者', '人物类_概念'), ('是', '肯定词'), ('余兼羽', '人物类_实体')] + +# Dependency Parsing +ddp = Taskflow("dependency_parsing") +ddp("9月9日上午纳达尔在亚瑟·阿什球场击败俄罗斯球员梅德韦杰夫") +>>> [{'word': ['9月9日', '上午', '纳达尔', '在', '亚瑟·阿什球场', '击败', '俄罗斯', '球员', '梅德韦杰夫'], 'head': [2, 6, 6, 5, 6, 0, 8, 9, 6], 'deprel': ['ATT', 'ADV', 'SBV', 'MT', 'ADV', 'HED', 'ATT', 'ATT', 'VOB']}] + +# Sentiment Analysis +senta = Taskflow("sentiment_analysis") +senta("这个产品用起来真的很流畅,我非常喜欢") +>>> [{'text': '这个产品用起来真的很流畅,我非常喜欢', 'label': 'positive', 'score': 0.9938690066337585}] +``` + +## API Reference + +- Support [LUGE](https://www.luge.ai/) dataset loading and compatible with Hugging Face [Datasets](https://huggingface.co/datasets). For more details please refer to [Dataset API](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_list.html). +- Using Hugging Face style API to load 500+ selected transformer models and download with fast speed. For more information please refer to [Transformers API](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html). +- One-line of code to load pre-trained word embedding. For more usage please refer to [Embedding API](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/embeddings.html). + +Please find all PaddleNLP API Reference from our [readthedocs](https://paddlenlp.readthedocs.io/). + +## Community + +### Slack + +To connect with other users and contributors, welcome to join our [Slack channel](https://paddlenlp.slack.com/). + +### WeChat + +Scan the QR code below with your Wechat⬇️. You can access to official technical exchange group. Look forward to your participation. + +
+ +
+ + + +## Citation + +If you find PaddleNLP useful in your research, please consider cite +``` +@misc{=paddlenlp, + title={PaddleNLP: An Easy-to-use and High Performance NLP Library}, + author={PaddleNLP Contributors}, + howpublished = {\url{https://github.com/PaddlePaddle/PaddleNLP}}, + year={2021} +} +``` + +## Acknowledge + +We have borrowed from Hugging Face's [Transformers](https://github.com/huggingface/transformers)🤗 excellent design on pretrained models usage, and we would like to express our gratitude to the authors of Hugging Face and its open source community. + +## License + +PaddleNLP is provided under the [Apache-2.0 License](./LICENSE). diff --git a/README_origin.md b/README_origin.md new file mode 100644 index 0000000000000000000000000000000000000000..aefd03450188d7594e6443066015caa57977088d --- /dev/null +++ b/README_origin.md @@ -0,0 +1,345 @@ +**简体中文**🀄 | [English🌎](./README_en.md) + +

+ +

+ +------------------------------------------------------------------------------------------ + +

+ + + + + + + + + +

+ + +

+ 安装 | + 快速开始 | + 特性 | + 社区交流 +

+ +**PaddleNLP**是一款**简单易用**且**功能强大**的自然语言处理和大语言模型(LLM)开发库。聚合业界**优质预训练模型**并提供**开箱即用**的开发体验,覆盖NLP多场景的模型库搭配**产业实践范例**可满足开发者**灵活定制**的需求。 + +## News 📢 + +* **2023.8.15 [PaddleNLP v2.6](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.6.0)**: 发布[全流程大模型工具链](./llm),涵盖预训练,精调,压缩,推理以及部署等各个环节,为用户提供端到端的大模型方案和一站式的开发体验;内置[4D并行分布式Trainer](./docs/trainer.md),[高效微调算法LoRA/Prefix Tuning](./llm#33-lora), [自研INT8/INT4量化算法](./llm#6-量化)等等;全面支持[LLaMA 1/2](./llm/llama), [BLOOM](.llm/bloom), [ChatGLM 1/2](./llm/chatglm), [GLM](./llm/glm), [OPT](./llm/opt)等主流大模型 + + +## 安装 + +### 环境依赖 + +- python >= 3.7 +- paddlepaddle >= 2.5.1 +- 如需大模型功能,请使用 paddlepaddle-gpu >= 2.5.1 + +### pip安装 + +```shell +pip install --upgrade paddlenlp +``` + +或者可通过以下命令安装最新 develop 分支代码: + +```shell +pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html +``` + +更多关于PaddlePaddle和PaddleNLP安装的详细教程请查看[Installation](./docs/get_started/installation.rst)。 + +## 快速开始 + + +### 大模型文本生成 + +PaddleNLP提供了方便易用的Auto API,能够快速的加载模型和Tokenizer。这里以使用 `linly-ai/chinese-llama-2-7b` 大模型做文本生成为例: + +```python +>>> from paddlenlp.transformers import AutoTokenizer, AutoModelForCausalLM +>>> tokenizer = AutoTokenizer.from_pretrained("linly-ai/chinese-llama-2-7b") +>>> model = AutoModelForCausalLM.from_pretrained("linly-ai/chinese-llama-2-7b", dtype="float16") +>>> input_features = tokenizer("你好!请自我介绍一下。", return_tensors="pd") +>>> outputs = model.generate(**input_features, max_length=128) +>>> tokenizer.batch_decode(outputs[0]) +['\n你好!我是一个AI语言模型,可以回答你的问题和提供帮助。'] +``` + +### 一键UIE预测 + +PaddleNLP提供[一键预测功能](./docs/model_zoo/taskflow.md),无需训练,直接输入数据即可开放域抽取结果。这里以信息抽取-命名实体识别任务,UIE模型为例: + +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow + +>>> schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction +>>> ie = Taskflow('information_extraction', schema=schema) +>>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!")) +[{'时间': [{'end': 6, + 'probability': 0.9857378532924486, + 'start': 0, + 'text': '2月8日上午'}], + '赛事名称': [{'end': 23, + 'probability': 0.8503089953268272, + 'start': 6, + 'text': '北京冬奥会自由式滑雪女子大跳台决赛'}], + '选手': [{'end': 31, + 'probability': 0.8981548639781138, + 'start': 28, + 'text': '谷爱凌'}]}] +``` + +更多PaddleNLP内容可参考: +- [大模型全流程工具链](./llm),包含主流中文大模型的全流程方案。 +- [精选模型库](./model_zoo),包含优质预训练模型的端到端全流程使用。 +- [多场景示例](./examples),了解如何使用PaddleNLP解决NLP多种技术问题,包含基础技术、系统应用与拓展应用。 +- [交互式教程](https://aistudio.baidu.com/aistudio/personalcenter/thirdview/574995),在🆓免费算力平台AI Studio上快速学习PaddleNLP。 + + +## 特性 + +#### 📦 开箱即用的NLP工具集 + +#### 🤗 丰富完备的中文模型库 + +#### 🎛️ 产业级端到端系统范例 + +#### 🚀 高性能分布式训练与推理 + + +### 开箱即用的NLP工具集 + +Taskflow提供丰富的**📦开箱即用**的产业级NLP预置模型,覆盖自然语言理解与生成两大场景,提供**💪产业级的效果**与**⚡️极致的推理性能**。 + +![taskflow1](https://user-images.githubusercontent.com/11793384/159693816-fda35221-9751-43bb-b05c-7fc77571dd76.gif) + +更多使用方法可参考[Taskflow文档](./docs/model_zoo/taskflow.md)。 +### 丰富完备的中文模型库 + +#### 🀄 业界最全的中文预训练模型 + +精选 45+ 个网络结构和 500+ 个预训练模型参数,涵盖业界最全的中文预训练模型:既包括文心NLP大模型的ERNIE、PLATO等,也覆盖BERT、GPT、RoBERTa、T5等主流结构。通过`AutoModel` API一键⚡**高速下载**⚡。 + +```python +from paddlenlp.transformers import * + +ernie = AutoModel.from_pretrained('ernie-3.0-medium-zh') +bert = AutoModel.from_pretrained('bert-wwm-chinese') +albert = AutoModel.from_pretrained('albert-chinese-tiny') +roberta = AutoModel.from_pretrained('roberta-wwm-ext') +electra = AutoModel.from_pretrained('chinese-electra-small') +gpt = AutoModelForPretraining.from_pretrained('gpt-cpm-large-cn') +``` + +针对预训练模型计算瓶颈,可以使用API一键使用文心ERNIE-Tiny全系列轻量化模型,降低预训练模型部署难度。 + +```python +# 6L768H +ernie = AutoModel.from_pretrained('ernie-3.0-medium-zh') +# 6L384H +ernie = AutoModel.from_pretrained('ernie-3.0-mini-zh') +# 4L384H +ernie = AutoModel.from_pretrained('ernie-3.0-micro-zh') +# 4L312H +ernie = AutoModel.from_pretrained('ernie-3.0-nano-zh') +``` + +对预训练模型应用范式如语义表示、文本分类、句对匹配、序列标注、问答等,提供统一的API体验。 + +```python +import paddle +from paddlenlp.transformers import * + +tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh') +text = tokenizer('自然语言处理') + +# 语义表示 +model = AutoModel.from_pretrained('ernie-3.0-medium-zh') +sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']])) +# 文本分类 & 句对匹配 +model = AutoModelForSequenceClassification.from_pretrained('ernie-3.0-medium-zh') +# 序列标注 +model = AutoModelForTokenClassification.from_pretrained('ernie-3.0-medium-zh') +# 问答 +model = AutoModelForQuestionAnswering.from_pretrained('ernie-3.0-medium-zh') +``` + +#### 💯 全场景覆盖的应用示例 + +覆盖从学术到产业的NLP应用示例,涵盖NLP基础技术、NLP系统应用以及拓展应用。全面基于飞桨核心框架2.0全新API体系开发,为开发者提供飞桨文本领域的最佳实践。 + +精选预训练模型示例可参考[Model Zoo](./model_zoo),更多场景示例文档可参考[examples目录](./examples)。更有免费算力支持的[AI Studio](https://aistudio.baidu.com)平台的[Notbook交互式教程](https://aistudio.baidu.com/aistudio/personalcenter/thirdview/574995)提供实践。 + +
PaddleNLP预训练模型适用任务汇总(点击展开详情
+ +| Model | Sequence Classification | Token Classification | Question Answering | Text Generation | Multiple Choice | +| :----------------- | ----------------------- | -------------------- | ------------------ | --------------- | --------------- | +| ALBERT | ✅ | ✅ | ✅ | ❌ | ✅ | +| BART | ✅ | ✅ | ✅ | ✅ | ❌ | +| BERT | ✅ | ✅ | ✅ | ❌ | ✅ | +| BigBird | ✅ | ✅ | ✅ | ❌ | ✅ | +| BlenderBot | ❌ | ❌ | ❌ | ✅ | ❌ | +| ChineseBERT | ✅ | ✅ | ✅ | ❌ | ❌ | +| ConvBERT | ✅ | ✅ | ✅ | ❌ | ✅ | +| CTRL | ✅ | ❌ | ❌ | ❌ | ❌ | +| DistilBERT | ✅ | ✅ | ✅ | ❌ | ❌ | +| ELECTRA | ✅ | ✅ | ✅ | ❌ | ✅ | +| ERNIE | ✅ | ✅ | ✅ | ❌ | ✅ | +| ERNIE-CTM | ❌ | ✅ | ❌ | ❌ | ❌ | +| ERNIE-Doc | ✅ | ✅ | ✅ | ❌ | ❌ | +| ERNIE-GEN | ❌ | ❌ | ❌ | ✅ | ❌ | +| ERNIE-Gram | ✅ | ✅ | ✅ | ❌ | ❌ | +| ERNIE-M | ✅ | ✅ | ✅ | ❌ | ❌ | +| FNet | ✅ | ✅ | ✅ | ❌ | ✅ | +| Funnel-Transformer | ✅ | ✅ | ✅ | ❌ | ❌ | +| GPT | ✅ | ✅ | ❌ | ✅ | ❌ | +| LayoutLM | ✅ | ✅ | ❌ | ❌ | ❌ | +| LayoutLMv2 | ❌ | ✅ | ❌ | ❌ | ❌ | +| LayoutXLM | ❌ | ✅ | ❌ | ❌ | ❌ | +| LUKE | ❌ | ✅ | ✅ | ❌ | ❌ | +| mBART | ✅ | ❌ | ✅ | ❌ | ✅ | +| MegatronBERT | ✅ | ✅ | ✅ | ❌ | ✅ | +| MobileBERT | ✅ | ❌ | ✅ | ❌ | ❌ | +| MPNet | ✅ | ✅ | ✅ | ❌ | ✅ | +| NEZHA | ✅ | ✅ | ✅ | ❌ | ✅ | +| PP-MiniLM | ✅ | ❌ | ❌ | ❌ | ❌ | +| ProphetNet | ❌ | ❌ | ❌ | ✅ | ❌ | +| Reformer | ✅ | ❌ | ✅ | ❌ | ❌ | +| RemBERT | ✅ | ✅ | ✅ | ❌ | ✅ | +| RoBERTa | ✅ | ✅ | ✅ | ❌ | ✅ | +| RoFormer | ✅ | ✅ | ✅ | ❌ | ❌ | +| SKEP | ✅ | ✅ | ❌ | ❌ | ❌ | +| SqueezeBERT | ✅ | ✅ | ✅ | ❌ | ❌ | +| T5 | ❌ | ❌ | ❌ | ✅ | ❌ | +| TinyBERT | ✅ | ❌ | ❌ | ❌ | ❌ | +| UnifiedTransformer | ❌ | ❌ | ❌ | ✅ | ❌ | +| XLNet | ✅ | ✅ | ✅ | ❌ | ✅ | + +
+ +可参考[Transformer 文档](/docs/model_zoo/index.rst) 查看目前支持的预训练模型结构、参数和详细用法。 + +### 产业级端到端系统范例 + +PaddleNLP针对信息抽取、语义检索、智能问答、情感分析等高频NLP场景,提供了端到端系统范例,打通*数据标注*-*模型训练*-*模型调优*-*预测部署*全流程,持续降低NLP技术产业落地门槛。更多详细的系统级产业范例使用说明请参考[Applications](./applications)。 + +#### 🔍 语义检索系统 + +针对无监督数据、有监督数据等多种数据情况,结合SimCSE、In-batch Negatives、ERNIE-Gram单塔模型等,推出前沿的语义检索方案,包含召回、排序环节,打通训练、调优、高效向量检索引擎建库和查询全流程。 + +
+ +
+ + +更多使用说明请参考[语义检索系统](./applications/neural_search)。 + +#### ❓ 智能问答系统 + +基于[🚀RocketQA](https://github.com/PaddlePaddle/RocketQA)技术的检索式问答系统,支持FAQ问答、说明书问答等多种业务场景。 + +
+ +
+ + +更多使用说明请参考[智能问答系统](./applications/question_answering)与[文档智能问答](./applications/document_intelligence/doc_vqa) + +#### 💌 评论观点抽取与情感分析 + +基于情感知识增强预训练模型SKEP,针对产品评论进行评价维度和观点抽取,以及细粒度的情感分析。 + +
+ +
+ +更多使用说明请参考[情感分析](./applications/sentiment_analysis)。 + +#### 🎙️ 智能语音指令解析 + +集成了[PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)和[百度开放平台](https://ai.baidu.com/)的语音识别和[UIE](./model_zoo/uie)通用信息抽取等技术,打造智能一体化的语音指令解析系统范例,该方案可应用于智能语音填单、智能语音交互、智能语音检索等场景,提高人机交互效率。 + +
+ +
+ +更多使用说明请参考[智能语音指令解析](./applications/speech_cmd_analysis)。 + +### 高性能分布式训练与推理 + +#### ⚡ FastTokenizer:高性能文本处理库 + +
+ +
+ +```python +AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True) +``` + +为了实现更极致的模型部署性能,安装FastTokenizers后只需在`AutoTokenizer` API上打开 `use_fast=True`选项,即可调用C++实现的高性能分词算子,轻松获得超Python百余倍的文本处理加速,更多使用说明可参考[FastTokenizer文档](./fast_tokenizer)。 + +#### ⚡️ FastGeneration:高性能生成加速库 + +
+ +
+ +```python +model = GPTLMHeadModel.from_pretrained('gpt-cpm-large-cn') +... +outputs, _ = model.generate( + input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search', + use_fast=True) +``` + +简单地在`generate()`API上打开`use_fast=True`选项,轻松在Transformer、GPT、BART、PLATO、UniLM等生成式预训练模型上获得5倍以上GPU加速,更多使用说明可参考[FastGeneration文档](./fast_generation)。 + +#### 🚀 Fleet:飞桨4D混合并行分布式训练技术 + +
+ +
+ + +更多关于千亿级AI模型的分布式训练使用说明可参考[GPT-3](./examples/language_model/gpt-3)。 + +## 社区交流 + +- 微信扫描二维码并填写问卷,回复小助手关键词(NLP)之后,即可加入交流群领取福利 + + - 与众多社区开发者以及官方团队深度交流。 + - 10G重磅NLP学习大礼包! + +
+ +
+ +## Citation + +如果PaddleNLP对您的研究有帮助,欢迎引用 + +``` +@misc{=paddlenlp, + title={PaddleNLP: An Easy-to-use and High Performance NLP Library}, + author={PaddleNLP Contributors}, + howpublished = {\url{https://github.com/PaddlePaddle/PaddleNLP}}, + year={2021} +} +``` + +## Acknowledge + +我们借鉴了Hugging Face的[Transformers](https://github.com/huggingface/transformers)🤗关于预训练模型使用的优秀设计,在此对Hugging Face作者及其开源社区表示感谢。 + +## License + +PaddleNLP遵循[Apache-2.0开源协议](./LICENSE)。 diff --git a/applications/README.md b/applications/README.md new file mode 100644 index 0000000000000000000000000000000000000000..650edeec997df86e868603283933fbdf5f50e24c --- /dev/null +++ b/applications/README.md @@ -0,0 +1,130 @@ +# 产业级端到端系统范例 + +## 1、简介 + +PaddleNLP 从预训练模型库出发,提供了经典预训练模型在主流 NLP 任务上丰富的[应用示例](../examples),满足了大量开发者的学习科研与基础应用需求。 + +针对更广泛的产业落地需求、更复杂的 NLP 场景任务,PaddleNLP 推出**产业级端到端系统范例库**(下文简称产业范例),提供单个模型之上的产业解决方案。 + +- 最强模型与实践———产业范例针对具体业务场景,提供最佳模型(组合),兼顾模型精度与性能,降低开发者模型选型成本; +- 全流程———打通数据标注-模型训练-模型调优-模型压缩—预测部署全流程,帮助开发者更低成本得完成产业落地。 + +## 2、基于 Pipelines 构建产业范例,加速落地 + +在面向不同场景任务建设一系列产业方案的过程中,不难发现,从技术基础设施角度看: + +(1)NLP系统都可以抽象为由多个基础组件串接而成的流水线系统; +(2)多个NLP流水线系统可共享使用相同的基础组件。 + +因此,PaddleNLP 逐渐孵化出了一套 NLP 流水线系统 [Pipelines](../pipelines),将各个 NLP 复杂系统的通用模块抽象封装为标准组件,支持开发者通过配置文件对标准组件进行组合,仅需几分钟即可定制化构建智能系统,让解决NLP任务像搭积木一样便捷、灵活、高效。同时,Pipelines 中预置了前沿的预训练模型和算法,在研发效率、模型效果和性能方面提供多重保障。因此,Pipelines 能够大幅加快开发者使用飞桨落地的效率。 + + +
+ +
+ +
+ +**PaddleNLP 提供了多个版本的产业范例:** + +- 如果你希望快速体验、直接应用、从零搭建一套完整系统,推荐使用 **Pipelines 版本**。这里集成了训练好的模型,无需关心模型训练细节;提供 Docker 环境,可快速一键部署端到端系统;打通前端 Demo 界面,便于直观展示、分析、调试效果。 +- 如果你希望使用自己的业务数据进行二次开发,推荐使用`./applications`目录下的**可定制版本**,训练好的模型可以直接集成进 Pipelines 中进行使用。 +- 也可以使用 [AI Studio](https://aistudio.baidu.com/aistudio/index) 在线 Jupyter Notebook 快速体验,有 GPU 算力哦。 + +| 场景任务 | Pipelines版本地址 | 可定制版本地址 | Notebook | +| :--------------- | ------- | ------- | ------- | +| **检索**| [字面+语义检索](../pipelines/examples/semantic-search) | [语义检索](./neural_search) | [基于Pipelines搭建检索系统](https://aistudio.baidu.com/aistudio/projectdetail/4442670)
[二次开发语义检索](https://aistudio.baidu.com/aistudio/projectdetail/3351784) | +| **问答** | [FAQ问答](../pipelines/examples/FAQ/)
[无监督检索式问答](../pipelines/examples/unsupervised-question-answering)
[有监督检索式问答](../pipelines/examples/question-answering) | [FAQ问答](./question_answering/supervised_qa)
[无监督检索式问答](./question_answering/unsupervised_qa) | [基于Pipelines搭建FAQ问答系统](https://aistudio.baidu.com/aistudio/projectdetail/4465498)
[基于Pipelines搭建抽取式问答系统](https://aistudio.baidu.com/aistudio/projectdetail/4442857)
[FAQ政务问答](https://aistudio.baidu.com/aistudio/projectdetail/3678873)
[FAQ保险问答](https://aistudio.baidu.com/aistudio/projectdetail/3882519) | +| **文本分类**| 暂无 | [文本分类](./text_classification) | [对话意图识别](https://aistudio.baidu.com/aistudio/projectdetail/2017202)
[法律文本多标签分类](https://aistudio.baidu.com/aistudio/projectdetail/3996601)
[层次分类](https://aistudio.baidu.com/aistudio/projectdetail/4568985) | +| **通用文本分类** | 暂无 | [通用文本分类](./zero_shot_text_classification) | | +| **通用信息抽取** | 暂无 | [通用信息抽取](./information_extraction) | [UIE快速体验](https://aistudio.baidu.com/aistudio/projectdetail/3914778)
[UIE微调实体抽取](https://aistudio.baidu.com/aistudio/projectdetail/4038499)
[UIE微调关系抽取](https://aistudio.baidu.com/aistudio/projectdetail/4371345)
[UIE-X快速体验](https://aistudio.baidu.com/aistudio/projectdetail/5017442)
[UIE-X微调](https://aistudio.baidu.com/aistudio/projectdetail/5261592) | +| **情感分析** | [情感分析](../pipelines/examples/sentiment_analysis) | [情感分析](./sentiment_analysis) | [情感分析](https://aistudio.baidu.com/aistudio/projectdetail/5318177)| +| **文档智能** | [文档抽取问答](../pipelines/examples/document-intelligence) | [跨模态文档问答](./document_intelligence/doc_vqa)| [文档抽取问答](https://aistudio.baidu.com/aistudio/projectdetail/4881278)
[汽车说明书问答](https://aistudio.baidu.com/aistudio/projectdetail/4049663) | +| **文生图** | [文生图系统](../pipelines/examples/text_to_image) | 可参考[PPDiffusers](../ppdiffusers) | | +| **语音指令解析** | 暂无 | [语音指令解析](./speech_cmd_analysis) | [语音指令解析](https://aistudio.baidu.com/aistudio/projectdetail/4399703) | +| **文本摘要** | 暂无 | [文本摘要](./text_summarization) | [文本摘要](https://aistudio.baidu.com/aistudio/projectdetail/4903667) | + +## 3、典型范例介绍 + +#### 📄 通用信息抽取系统 + +- 首个产业级通用信息抽取方案 UIE,面向纯文本,实现多任务统一建模,提供强大的零样本抽取和少样本快速迁移能力; +- 首个兼具文本及文档抽取能力、多语言、开放域的信息抽取方案 UIE-X,基于 [ERNIE-Layout](../model_zoo/ernie-layout) 跨模态布局增强预训练模型,集成 [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) 的 PP-OCR、PP-Structure 版面分析能力,小样本文档信息抽取效果领先。 + +
+ +
+ + +详细使用说明请参考[通用信息抽取系统](./information_extraction),更多:[UIE 解读](https://mp.weixin.qq.com/s/-hHz8knHIKKqKCBTke7i5A)、[UIE-X 解读](https://zhuanlan.zhihu.com/p/592422623)。 + +#### 🔍 语义检索系统 + +- 前沿算法———基于 SimCSE、In-batch Negatives、ERNIE Pairwise、RocketQA Pointwise 等提供针对无监督、有监督等多种数据情况的多样化方案; +- 全流程———覆盖召回、排序环节,集成主流 ANN 引擎,同时兼容 ElasticSearch 字面检索模式,提供多路召回方案。打通训练、调优、高效向量检索引擎建库和查询全流程。 + +
+ +
+ +详细使用说明请参考[语义检索系统](./neural_search)。 + +#### ❓ 智能问答系统 + +- 端到端问答技术 [🚀RocketQA](https://github.com/PaddlePaddle/RocketQA),首个中文端到端问答模型,基于知识增强的预训练模型ERNIE和百万量级的人工标注数据集DuReader训练得到,效果优异; +- 覆盖有监督(如 FAQ 问答)、无监督(自动生成 QA 对,生成的问答对语料可以通过无监督的方式构建检索式问答系统)等多种情况,适用各类业务场景。 + +
+ +
+ + +详细使用说明请参考[智能问答系统](./question_answering)与[文档智能问答](./document_intelligence/doc_vqa)。 + +#### 📚 通用文本分类 + +- 基于“任务架构统一、通用能力共享”的通用文本分类技术 UTC,实了良好的零/少样本迁移能力,实现大一统诸多任务的开放域分类,可支持情感分析、意图识别、语义匹配、蕴含推理等各种可转换为分类问题的 NLU 任务。 + +
+ +
+ +
+ +详细使用说明请参考[通用文本分类](./zero_shot_text_classification),更多:[文章解读](https://mp.weixin.qq.com/s/VV-nYv4y1r7oipJnURRL5w)。 + + +#### 🗂 文本分类 + +- 场景方案全覆盖––––开源预训练模型-微调、提示学习、基于语义索引等多种分类技术方案,满足不同场景需求,涵盖多分类(multi-class)、多标签(multi-label)、层次分类(hierarchical)三类任务; +- 模型高效调优––––强强结合数据增强能力与可信增强技术,解决脏数据、标注数据欠缺、数据不平衡等问题,大幅提升模型效果。 + +
+ +
+ +
+ +详细使用说明请参考[文本分类](./text_classification),更多:[文章解读](https://mp.weixin.qq.com/s/tas7yM8vapxwtlJt-MRZdg)。 + +#### 💌 评论观点抽取与情感分析 + +- 经典方案:基于情感知识增强预训练模型SKEP,两阶段式抽取和分类,首先通过序列标注的方式定位属性词和观点词,然后进行属性集情感分类; +- 前沿方案:基于UIE的情感分析方案采用 Prompt Learning 的方式进行情感信息抽取,精度更高。支持语句级和属性级情感分析,解决同义属性聚合、隐性观点抽取难点,并提供可视化分析能力。 + +
+ +
+
+ +详细使用说明请参考[情感分析](./sentiment_analysis),更多:[文章解读](https://mp.weixin.qq.com/s/QAHjIRG9zxpYfM6YPRQ-9w)。 + +#### 🎙️ 智能语音指令解析 + +- 集成了[PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)和[百度开放平台](https://ai.baidu.com/)的语音识别和[UIE](./model_zoo/uie)通用信息抽取等技术,打造智能一体化的语音指令解析系统范例,该方案可应用于智能语音填单、智能语音交互、智能语音检索等场景,提高人机交互效率。 + +
+ +
+ +详细使用说明请参考[智能语音指令解析](./applications/speech_cmd_analysis)。 diff --git a/applications/document_intelligence/README.md b/applications/document_intelligence/README.md new file mode 100644 index 0000000000000000000000000000000000000000..bb6464aafd71bc0eef04f60a638846b6383dcdaf --- /dev/null +++ b/applications/document_intelligence/README.md @@ -0,0 +1,188 @@ +# 文档智能应用 + +**目录** +- [1. 文档智能应用简介](#文档智能应用简介) +- [2. 技术特色介绍](#技术特色介绍) + - [2.1 多语言跨模态训练基座](#多语言跨模态训练基座) + - [2.2 多场景覆盖](#多场景覆盖) +- [3. 快速开始](#快速开始) + - [3.1 开箱即用](#开箱即用) + - [3.2 产业级流程方案](#产业级流程方案) + +## 1. 文档智能应用简介 + +文档智能(DI, Document Intelligence)主要指**对于网页、数字文档或扫描文档所包含的文本以及丰富的排版格式等信息,通过人工智能技术进行理解、分类、提取以及信息归纳**的过程。文档智能技术广泛应用于金融、保险、能源、物流、医疗等行业,常见的应用场景包括财务报销单、招聘简历、企业财报、合同文书、动产登记证、法律判决书、物流单据等多模态文档的关键信息抽取、文档解析、文档比对等。 + +在实际应用中,需要解决文档格式繁杂、布局多样、信息模态多样、需求开放、业务数据少等多重难题。针对文档智能领域的痛点和难点,PaddleNLP将持续开源一系列产业实践范例,解决开发者们实际应用难题。 + +
+ 文档智能技术一般流程 +
+ + + +## 2. 技术特色介绍 + + + +### 2.1 多语言跨模态训练基座 + +近期,百度文心文档智能,基于多语言跨模态布局增强的文档智能大模型[ERNIE-Layout](http://arxiv.org/abs/2210.06155),刷新了五类11项文档智能任务效果。依托文心ERNIE大模型,基于布局知识增强技术,融合文本、图像、布局等信息进行联合建模,能够对多模态文档(如文档图片、PDF 文件、扫描件等)进行深度理解与分析,为各类上层应用提供SOTA模型底座。 + +
+ +
+ + + +### 2.2 多场景覆盖 + +以下是文档智能技术的一些应用场景展示: + +- 发票抽取问答 + +
+ +
+ +- 海报抽取问答 + +
+ +
+ +- 网页抽取问答 + +
+ +
+ + +- 表格抽取问答 + +
+ +
+ + +- 试卷抽取问答 + +
+ +
+ + +- 英文票据多语种(中、英、日、泰、西班牙、俄语)抽取问答 + +
+ +
+ +- 中文票据多语种(中简、中繁、英、日、法语)抽取问答 + +
+ +
+ +- Demo图片可在此[下载](https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/demo.zip) + + + +## 3. 快速开始 + + + +### 3.1 开箱即用 + +开源DocPrompt开放文档抽取问答模型,以ERNIE-Layout为底座,可精准理解图文信息,推理学习附加知识,准备捕捉图片、PDF等多模态文档中的每个细节。 + +🧾 通过[Huggingface网页](https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout)体验DocPrompt功能: + +
+ +
+ +#### Taskflow + +通过``paddlenlp.Taskflow``三行代码调用DocPrompt功能,具备多语言文档抽取问答能力,部分应用场景展示如下: + +- 输入格式 + +``` +[ + {"doc": "./invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}, + {"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]} +] +``` + +默认使用PaddleOCR进行OCR识别,同时支持用户通过``word_boxes``传入自己的OCR结果,格式为``List[str, List[float, float, float, float]]``。 + +``` +[ + {"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes} +] +``` + +- 支持单条、批量预测 + + - 支持本地图片路径输入 + +
+ +
+ + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + + >>> docprompt = Taskflow("document_intelligence") + >>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}])) + [{'prompt': '五百丁本次想要担任的是什么职位?', + 'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]}, + {'prompt': '五百丁是在哪里上的大学?', + 'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]}, + {'prompt': '大学学的是什么专业?', + 'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科)'}]}] + ``` + + - http图片链接输入 + +
+ +
+ + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + + >>> docprompt = Taskflow("document_intelligence") + >>> pprint(docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}])) + [{'prompt': '发票号码是多少?', + 'result': [{'end': 2, 'prob': 0.74, 'start': 2, 'value': 'No44527206'}]}, + {'prompt': '校验码是多少?', + 'result': [{'end': 233, + 'prob': 1.0, + 'start': 231, + 'value': '01107 555427109891646'}]}] + ``` + +- 可配置参数说明 + * `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 + * `lang`:选择PaddleOCR的语言,`ch`可在中英混合的图片中使用,`en`在英文图片上的效果更好,默认为`ch`。 + * `topn`: 如果模型识别出多个结果,将返回前n个概率值最高的结果,默认为1。 + + + +### 3.2 产业级流程方案 + +针对文档智能领域的痛点和难点,PaddleNLP将持续开源一系列文档智能产业实践范例,解决开发者们实际应用难题。 + +- 👉 [汽车说明书跨模态智能问答](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/document_intelligence/doc_vqa#readme) + +更多:百度TextMind智能文档分析平台可提供包括文档信息抽取、文本内容审查、企业文档管理、文档格式解析、文档内容比对等全方位一站式的文档智能服务,已形成一套完整的企业文档场景化解决方案,满足银行、券商、法律、能源、传媒、通信、物流等不同行业和场景的文档处理需求,以AI助力企业的办公智能化升级和数字化转型。欢迎深度交流与商业合作,了解详情:https://ai.baidu.com/tech/nlp/Textanalysis + +## References + +- [文档智能:数据集、模型和应用](http://jcip.cipsc.org.cn/CN/abstract/abstract3331.shtml) + +- [ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](http://arxiv.org/abs/2210.06155) diff --git a/applications/document_intelligence/doc_vqa/.gitignore b/applications/document_intelligence/doc_vqa/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..cc8c76c471fb1e2ed206c2bbaacebd04f4f29068 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/.gitignore @@ -0,0 +1,17 @@ +OCR_process/*.json +*.png +*.json +answers/* +checkpoints/* +__pycache__/* +OCR_process/demo_pics/* +Rerank/log/* +Rerank/checkpoints/* +Rerank/data/* +Rerank/output/* +Rerank/__pycache__/* +Extraction/log/* +Extraction/checkpoints/* +Extraction/data/* +Extraction/output/* +Extraction/__pycache__/* diff --git a/applications/document_intelligence/doc_vqa/Extraction/change_to_mrc.py b/applications/document_intelligence/doc_vqa/Extraction/change_to_mrc.py new file mode 100644 index 0000000000000000000000000000000000000000..bb388b16605562c8adbca96292c008e3010342df --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Extraction/change_to_mrc.py @@ -0,0 +1,51 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys +import json +import numpy as np + + +def get_top1_from_ranker(path): + with open(path, "r", encoding="utf-8") as f: + scores = [float(line.strip()) for line in f.readlines()] + top_id = np.argmax(scores) + + return top_id + + +def get_ocr_result_by_id(path, top_id): + with open(path, "r", encoding="utf-8") as f: + reses = f.readlines() + res = reses[top_id] + return json.loads(res) + + +def write_to_file(doc, path): + with open(path, "w", encoding="utf-8") as f: + json.dump(doc, f, ensure_ascii=False) + f.write("\n") + + +if __name__ == "__main__": + question = sys.argv[1] + ranker_result_path = "../Rerank/data/demo.score" + ocr_result_path = "../OCR_process/demo_ocr_res.json" + save_path = "data/demo_test.json" + top_id = get_top1_from_ranker(ranker_result_path) + doc = get_ocr_result_by_id(ocr_result_path, top_id) + doc["question"] = question + doc["img_id"] = str(top_id + 1) + + write_to_file(doc, save_path) diff --git a/applications/document_intelligence/doc_vqa/Extraction/docvqa.py b/applications/document_intelligence/doc_vqa/Extraction/docvqa.py new file mode 100644 index 0000000000000000000000000000000000000000..0d3c576bc296790a570c9493e1571b1c825d26c4 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Extraction/docvqa.py @@ -0,0 +1,359 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import json +import sys + +import numpy as np +import paddle +from paddle.io import Dataset +from tqdm import tqdm + +sys.path.insert(0, "../") + + +class DocVQAExample(object): + def __init__(self, question, doc_tokens, doc_boxes=[], answer=None, labels=None, image=None): + self.question = question + self.doc_tokens = doc_tokens + self.doc_boxes = doc_boxes + self.image = image + self.answer = answer + self.labels = labels + + +class DocVQAFeatures(object): + """A single set of features of data.""" + + def __init__(self, example_index, input_ids, input_mask, segment_ids, boxes=None, label=None): + self.example_index = example_index + self.input_ids = input_ids + self.input_mask = input_mask + self.segment_ids = segment_ids + self.boxes = boxes + self.label = label + + +class DocVQA(Dataset): + def __init__( + self, args, tokenizer, label2id_map, max_seq_len=512, max_query_length=20, max_doc_length=512, max_span_num=1 + ): + super(DocVQA, self).__init__() + self.tokenizer = tokenizer + self.label2id_map = label2id_map + self.max_seq_len = max_seq_len + self.max_query_length = max_query_length + self.max_doc_length = max_doc_length + self.max_span_num = max_span_num + self.sample_list = None + self.args = args + + self.docvqa_inputs = self.docvqa_input() + + def check_is_max_context(self, doc_spans, cur_span_index, position): + """Check if this is the 'max context' doc span for the token.""" + + # Because of the sliding window approach taken to scoring documents, a single + # token can appear in multiple documents. E.g. + # Doc: the man went to the store and bought a gallon of milk + # Span A: the man went to the + # Span B: to the store and bought + # Span C: and bought a gallon of + # ... + # + # Now the word 'bought' will have two scores from spans B and C. We only + # want to consider the score with "maximum context", which we define as + # the *minimum* of its left and right context (the *sum* of left and + # right context will always be the same, of course). + # + # In the example the maximum context for 'bought' would be span C since + # it has 1 left context and 3 right context, while span B has 4 left context + # and 0 right context. + best_score = None + best_span_index = None + for (span_index, doc_span) in enumerate(doc_spans): + end = doc_span.start + doc_span.length - 1 + if position < doc_span.start: + continue + if position > end: + continue + num_left_context = position - doc_span.start + num_right_context = end - position + score = min(num_left_context, num_right_context) + 0.01 * doc_span.length + if best_score is None or score > best_score: + best_score = score + best_span_index = span_index + + return cur_span_index == best_span_index + + def convert_examples_to_features( + self, examples, tokenizer, label_map, max_seq_length, max_span_num, max_doc_length, max_query_length + ): + + if "[CLS]" in self.tokenizer.get_vocab(): + start_token = "[CLS]" + end_token = "[SEP]" + else: + start_token = "" + end_token = "" + + features = [] + for (example_index, example) in enumerate(examples): + query_tokens = tokenizer.tokenize(example.question) + if len(query_tokens) > max_query_length: + query_tokens = query_tokens[0:max_query_length] + + all_doc_tokens = example.doc_tokens + all_doc_boxes_tokens = example.doc_boxes + + cls_token_box = [0, 0, 0, 0] + sep_token_box = [1000, 1000, 1000, 1000] + pad_token_box = [0, 0, 0, 0] + ques_token_box = [0, 0, 0, 0] + + # The -3 accounts for [CLS], [SEP] and [SEP] + max_tokens_for_doc = max_seq_length - len(query_tokens) - 3 + + # We can have documents that are longer than the maximum sequence length. + # To deal with this we do a sliding window approach, where we take chunks + # of the up to our max length with a stride of `doc_stride`. + _DocSpan = collections.namedtuple("DocSpan", ["start", "length"]) + doc_spans = [] + start_offset = 0 + while start_offset < len(all_doc_tokens): + length = len(all_doc_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(all_doc_tokens): + break + start_offset += length + + spans_input_ids = [] + spans_input_mask = [] + spans_segment_ids = [] + spans_boxes_tokens = [] + for (doc_span_index, doc_span) in enumerate(doc_spans): + if doc_span_index == max_span_num: + break + tokens = [] + boxes_tokens = [] + token_is_max_context = {} + segment_ids = [] + tokens.append(start_token) + boxes_tokens.append(cls_token_box) + segment_ids.append(0) + for token in query_tokens: + tokens.append(token) + boxes_tokens.append(ques_token_box) + segment_ids.append(0) + tokens.append(end_token) + boxes_tokens.append(sep_token_box) + segment_ids.append(0) + for i in range(doc_span.length): + split_token_index = doc_span.start + i + is_max_context = self.check_is_max_context(doc_spans, doc_span_index, split_token_index) + token_is_max_context[len(tokens)] = is_max_context + tokens.append(all_doc_tokens[split_token_index]) + boxes_tokens.append(all_doc_boxes_tokens[split_token_index]) + segment_ids.append(0) + + tokens.append(end_token) + boxes_tokens.append(sep_token_box) + segment_ids.append(0) + input_ids = tokenizer.convert_tokens_to_ids(tokens) + # The mask has 1 for real tokens and 0 for padding tokens. Only real + # tokens are attended to. + input_mask = [1] * len(input_ids) + # Zero-pad up to the sequence length. + while len(input_ids) < max_seq_length: + input_ids.append(0) + input_mask.append(0) + segment_ids.append(0) + boxes_tokens.append(pad_token_box) + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + assert len(segment_ids) == max_seq_length + assert len(boxes_tokens) == max_seq_length + + spans_input_ids.append(input_ids) + spans_input_mask.append(input_mask) + spans_segment_ids.append(segment_ids) + spans_boxes_tokens.append(boxes_tokens) + + # Padding + # padding spans + # max_span_num: max_seg_num + # spans_input_ids: the tokens in each segment + if len(spans_input_ids) > max_span_num: + spans_input_ids = spans_input_ids[0:max_span_num] + spans_input_mask = spans_input_mask[0:max_span_num] + spans_segment_ids = spans_segment_ids[0:max_span_num] + spans_boxes_tokens = spans_boxes_tokens[0:max_span_num] + while len(spans_input_ids) < max_span_num: + tokens = [] + boxes_tokens = [] + segment_ids = [] + tokens.append(start_token) + boxes_tokens.append(cls_token_box) + segment_ids.append(0) + for token in query_tokens: + tokens.append(token) + boxes_tokens.append(ques_token_box) + segment_ids.append(0) + tokens.append(end_token) + boxes_tokens.append(sep_token_box) + segment_ids.append(0) + tokens.append(end_token) + boxes_tokens.append(sep_token_box) + segment_ids.append(0) + input_ids = tokenizer.convert_tokens_to_ids(tokens) + input_mask = [1] * len(input_ids) + while len(input_ids) < max_seq_length: + input_ids.append(0) + input_mask.append(0) + segment_ids.append(0) + boxes_tokens.append(pad_token_box) + spans_input_ids.append(input_ids) + spans_input_mask.append(input_mask) + spans_segment_ids.append(segment_ids) + spans_boxes_tokens.append(boxes_tokens) + + # padding labels + labels = example.labels + sep_id = tokenizer.convert_tokens_to_ids(end_token) + labels = ["O"] * (spans_input_ids[0].index(sep_id) + 1) + labels + if len(labels) > 512: + labels = labels[:512] + + if len(labels) < 512: + labels += ["O"] * (512 - len(labels)) + assert len(spans_input_ids[0]) == len(labels) + + label_ids = [] + for lid, l in enumerate(labels): + if l not in label_map: + label_ids.append(0) + else: + label_ids.append(label_map[l]) + + feature = DocVQAFeatures( + example_index=example_index, + input_ids=spans_input_ids, + input_mask=spans_input_mask, + segment_ids=spans_segment_ids, + boxes=spans_boxes_tokens, + label=label_ids, + ) + features.append(feature) + return features + + def create_examples(self, data, is_test=False): + """Creates examples for the training and dev sets.""" + examples = [] + for sample in tqdm(data, total=len(data)): + question = sample["question"] + doc_tokens = sample["document"] + doc_boxes = sample["document_bbox"] + labels = sample["labels"] if not is_test else [] + + x_min, y_min = min(doc_boxes, key=lambda x: x[0])[0], min(doc_boxes, key=lambda x: x[2])[2] + x_max, y_max = max(doc_boxes, key=lambda x: x[1])[1], max(doc_boxes, key=lambda x: x[3])[3] + width = x_max - x_min + height = y_max - y_min + + if max(width, height) < 1000: + scale_x = 1 + scale_y = 1 + else: + scale_x = 1000 / max(width, height) + scale_y = 1000 / max(width, height) + + scaled_doc_boxes = [ + [ + round((b[0] - x_min) * scale_x), + round((b[2] - y_min) * scale_y), + round((b[1] - x_min) * scale_x), + round((b[3] - y_min) * scale_y), + ] + for b in doc_boxes + ] + + for box, oribox in zip(scaled_doc_boxes, doc_boxes): + if box[0] < 0: + print(box, oribox) + if box[2] - box[0] < 0: + print(box, oribox) + if box[3] - box[1] < 0: + print(box, oribox) + for pos in box: + if pos > 1000: + print(width, height, box, oribox) + + example = DocVQAExample( + question=question, doc_tokens=doc_tokens, doc_boxes=scaled_doc_boxes, labels=labels + ) + examples.append(example) + return examples + + def docvqa_input(self): + data = [] + if self.args.do_train: + dataset = self.args.train_file + elif self.args.do_test: + dataset = self.args.test_file + with open(dataset, "r", encoding="utf8") as f: + for index, line in enumerate(f): + data.append(json.loads(line.strip())) + + # read the examples from train/test xlm files + examples = self.create_examples(data, is_test=self.args.do_test) + + features = self.convert_examples_to_features( + examples, + self.tokenizer, + self.label2id_map, + max_seq_length=self.max_seq_len, + max_doc_length=self.max_doc_length, + max_span_num=self.max_span_num, + max_query_length=self.max_query_length, + ) + + all_input_ids = paddle.to_tensor([f.input_ids for f in features], dtype="int64") + all_input_mask = paddle.to_tensor([f.input_mask for f in features], dtype="int64") + all_segment_ids = paddle.to_tensor([f.segment_ids for f in features], dtype="int64") + all_bboxes = paddle.to_tensor([f.boxes for f in features], dtype="int64") + all_labels = paddle.to_tensor([f.label for f in features], dtype="int64") + self.sample_list = [ + np.array(all_input_ids), + np.array(all_input_mask), + np.array(all_segment_ids), + np.array(all_bboxes), + np.array(all_labels), + ] + + def __getitem__(self, idx): + return ( + self.sample_list[0][idx], + self.sample_list[1][idx], + self.sample_list[2][idx], + self.sample_list[3][idx], + self.sample_list[4][idx], + ) + + def __len__( + self, + ): + return self.sample_list[0].shape[0] diff --git a/applications/document_intelligence/doc_vqa/Extraction/model.py b/applications/document_intelligence/doc_vqa/Extraction/model.py new file mode 100644 index 0000000000000000000000000000000000000000..71647d96bf2424ddba1479e66266da3bfe3a806f --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Extraction/model.py @@ -0,0 +1,206 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.fluid as fluid +import paddle.nn as nn + +from paddlenlp.transformers import LayoutXLMPretrainedModel + + +class Crf_decoding(paddle.fluid.dygraph.Layer): + def __init__(self, param_attr, size=None, is_test=True, dtype="float32"): + super(Crf_decoding, self).__init__() + + self._dtype = dtype + self._size = size + self._is_test = is_test + self._param_attr = param_attr + self._transition = self.create_parameter( + attr=self._param_attr, shape=[self._size + 2, self._size], dtype=self._dtype + ) + + @property + def weight(self): + return self._transition + + @weight.setter + def weight(self, value): + self._transition = value + + def forward(self, input, label=None, length=None): + + viterbi_path = self._helper.create_variable_for_type_inference(dtype=self._dtype) + this_inputs = {"Emission": [input], "Transition": self._transition, "Label": label} + if length is not None: + this_inputs["Length"] = [length] + self._helper.append_op( + type="crf_decoding", + inputs=this_inputs, + outputs={"ViterbiPath": [viterbi_path]}, + attrs={ + "is_test": self._is_test, + }, + ) + return viterbi_path + + +class Chunk_eval(paddle.fluid.dygraph.Layer): + def __init__(self, num_chunk_types, chunk_scheme, excluded_chunk_types=None): + super(Chunk_eval, self).__init__() + self.num_chunk_types = num_chunk_types + self.chunk_scheme = chunk_scheme + self.excluded_chunk_types = excluded_chunk_types + + def forward(self, input, label, seq_length=None): + + precision = self._helper.create_variable_for_type_inference(dtype="float32") + recall = self._helper.create_variable_for_type_inference(dtype="float32") + f1_score = self._helper.create_variable_for_type_inference(dtype="float32") + num_infer_chunks = self._helper.create_variable_for_type_inference(dtype="int64") + num_label_chunks = self._helper.create_variable_for_type_inference(dtype="int64") + num_correct_chunks = self._helper.create_variable_for_type_inference(dtype="int64") + + this_input = {"Inference": [input], "Label": [label]} + if seq_length is not None: + this_input["SeqLength"] = [seq_length] + + self._helper.append_op( + type="chunk_eval", + inputs=this_input, + outputs={ + "Precision": [precision], + "Recall": [recall], + "F1-Score": [f1_score], + "NumInferChunks": [num_infer_chunks], + "NumLabelChunks": [num_label_chunks], + "NumCorrectChunks": [num_correct_chunks], + }, + attrs={ + "num_chunk_types": self.num_chunk_types, + "chunk_scheme": self.chunk_scheme, + "excluded_chunk_types": self.excluded_chunk_types or [], + }, + ) + return (precision, recall, f1_score, num_infer_chunks, num_label_chunks, num_correct_chunks) + + +class Linear_chain_crf(paddle.fluid.dygraph.Layer): + def __init__(self, param_attr, size=None, is_test=False, dtype="float32"): + super(Linear_chain_crf, self).__init__() + + self._param_attr = param_attr + self._dtype = dtype + self._size = size + self._is_test = is_test + self._transition = self.create_parameter( + attr=self._param_attr, shape=[self._size + 2, self._size], dtype=self._dtype + ) + + @property + def weight(self): + return self._transition + + @weight.setter + def weight(self, value): + self._transition = value + + def forward(self, input, label, length=None): + + alpha = self._helper.create_variable_for_type_inference(dtype=self._dtype) + emission_exps = self._helper.create_variable_for_type_inference(dtype=self._dtype) + transition_exps = self._helper.create_variable_for_type_inference(dtype=self._dtype) + log_likelihood = self._helper.create_variable_for_type_inference(dtype=self._dtype) + this_inputs = {"Emission": [input], "Transition": self._transition, "Label": [label]} + if length is not None: + this_inputs["Length"] = [length] + self._helper.append_op( + type="linear_chain_crf", + inputs=this_inputs, + outputs={ + "Alpha": [alpha], + "EmissionExps": [emission_exps], + "TransitionExps": transition_exps, + "LogLikelihood": log_likelihood, + }, + attrs={ + "is_test": self._is_test, + }, + ) + return log_likelihood + + +class LayoutXLMForTokenClassification_with_CRF(LayoutXLMPretrainedModel): + def __init__(self, layoutxlm, num_classes, dropout=None): + super(LayoutXLMForTokenClassification_with_CRF, self).__init__() + self.num_classes = num_classes + self.layoutxlm = layoutxlm + self.dropout = nn.Dropout(dropout if dropout is not None else self.layoutxlm.config["hidden_dropout_prob"]) + self.emission_classifier = nn.Linear(self.layoutxlm.config["hidden_size"], self.num_classes) + self.emission_classifier.apply(self.init_weights) + self.linear_chain_crf = Linear_chain_crf( + size=self.num_classes, param_attr=paddle.fluid.ParamAttr(name="liner_chain_crfw") + ) + self.crf_decoding = Crf_decoding(param_attr=paddle.fluid.ParamAttr(name="crfw_decode"), size=self.num_classes) + self.crf_decoding.weight = self.linear_chain_crf.weight + self.crfw = fluid.layers.create_parameter( + shape=[self.num_classes + 2, self.num_classes], dtype="float32", name="crfw" + ) + self.mask_crfw = fluid.layers.create_parameter( + shape=[self.num_classes + 2, self.num_classes], dtype="float32", name="mask_matrix" + ) + + def get_input_embeddings(self): + return self.layoutxlm.embeddings.word_embeddings + + def forward( + self, + input_ids=None, + bbox=None, + attention_mask=None, + token_type_ids=None, + labels=None, + image=None, + position_ids=None, + head_mask=None, + is_train=False, + ): + + input_ids = input_ids.squeeze(axis=1) + bbox = bbox.squeeze(axis=1) + attention_mask = attention_mask.squeeze(axis=1) + token_type_ids = token_type_ids.squeeze(axis=1) + outputs = self.layoutxlm( + input_ids=input_ids, + bbox=bbox, + image=image, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + ) + seq_length = input_ids.shape[1] + # sequence out and image out + sequence_logits, _ = outputs[0][:, :seq_length], outputs[0][:, seq_length:] + emission = self.emission_classifier(sequence_logits) + length = paddle.sum(attention_mask, axis=1) + labels = labels.reshape([-1, seq_length, 1]) + + # standard crf loss + crf_cost = self.linear_chain_crf(input=emission, label=labels, length=length) + crf_decode = self.crf_decoding(input=emission, length=length) + if is_train: + return [crf_cost] + else: + return [crf_cost, crf_decode] diff --git a/applications/document_intelligence/doc_vqa/Extraction/run_docvqa.py b/applications/document_intelligence/doc_vqa/Extraction/run_docvqa.py new file mode 100644 index 0000000000000000000000000000000000000000..a0c1e5670fc4cbbf4b8270439d793f799c17fc07 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Extraction/run_docvqa.py @@ -0,0 +1,457 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import logging +import os +import random +import warnings +from collections import Counter + +import numpy as np +import paddle +from docvqa import DocVQA +from model import LayoutXLMForTokenClassification_with_CRF + +from paddlenlp.transformers import LayoutXLMModel, LayoutXLMTokenizer + +warnings.filterwarnings("ignore") +logger = logging.getLogger(__name__) + + +def parse_args(): + parser = argparse.ArgumentParser() + # yapf: disable + parser.add_argument("--model_name_or_path", default=None, type=str, required=True) + parser.add_argument("--do_train", default=False, type=bool, required=False) + parser.add_argument("--do_test", default=False, type=bool, required=False) + parser.add_argument("--test_file", default=None, type=str, required=False) + parser.add_argument("--train_file", default=None, type=str, required=False) + parser.add_argument("--output_dir", default=None, type=str, required=True) + parser.add_argument("--max_seq_len", default=512, type=int) + parser.add_argument("--max_query_length", default=20, type=int) + parser.add_argument("--max_doc_length", default=512, type=int) + parser.add_argument("--max_span_num", default=1, type=int) + parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for eval.") + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.") + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") + parser.add_argument("--eval_steps", type=int, default=10, help="eval every X updates steps.") + parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument("--init_checkpoint", type=str, default=None, help="the initialized checkpoint") + parser.add_argument("--save_path", type=str, default=None, help="the initialized checkpoint") + # yapf: enable + args = parser.parse_args() + return args + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +def get_label_maps(): + labels = ["O", "I-ans", "B-ans", "E-ans"] + label2id_map = {label: idx for idx, label in enumerate(labels)} + id2label_map = {idx: label for idx, label in enumerate(labels)} + return label2id_map, id2label_map + + +def main(args): + os.makedirs(args.output_dir, exist_ok=True) + logging.basicConfig( + filename=os.path.join(args.output_dir, "train.log") if paddle.distributed.get_rank() == 0 else None, + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + level=logging.INFO if paddle.distributed.get_rank() == 0 else logging.WARN, + ) + + ch = logging.StreamHandler() + ch.setLevel(logging.DEBUG) + logger.addHandler(ch) + + label2id_map, id2label_map = get_label_maps() + pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index + + # dist mode + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + tokenizer = LayoutXLMTokenizer.from_pretrained(args.model_name_or_path) + + if args.do_test: + model = LayoutXLMForTokenClassification_with_CRF.from_pretrained(args.init_checkpoint) + evaluate(args, model, tokenizer, label2id_map, id2label_map, pad_token_label_id, global_step=0) + exit(0) + + if args.init_checkpoint: + logger.info("Init checkpoint from {}".format(args.init_checkpoint)) + model = LayoutXLMForTokenClassification_with_CRF.from_pretrained(args.init_checkpoint) + else: + base_model = LayoutXLMModel.from_pretrained(args.model_name_or_path) + model = LayoutXLMForTokenClassification_with_CRF(base_model, num_classes=len(label2id_map), dropout=None) + + # dist mode + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + train_dataset = DocVQA( + args, + tokenizer, + label2id_map, + max_seq_len=args.max_seq_len, + max_query_length=args.max_query_length, + max_doc_length=args.max_doc_length, + max_span_num=args.max_span_num, + ) + + train_sampler = paddle.io.DistributedBatchSampler( + train_dataset, batch_size=args.per_gpu_train_batch_size, shuffle=False + ) + + args.train_batch_size = args.per_gpu_train_batch_size * max(1, paddle.distributed.get_world_size()) + + train_dataloader = paddle.io.DataLoader( + train_dataset, batch_sampler=train_sampler, num_workers=0, use_shared_memory=True, collate_fn=None + ) + + t_total = len(train_dataloader) * args.num_train_epochs + # build linear decay with warmup lr sch + lr_scheduler = paddle.optimizer.lr.PolynomialDecay( + learning_rate=args.learning_rate, decay_steps=t_total, end_lr=0.0, power=1.0 + ) + if args.warmup_steps > 0: + lr_scheduler = paddle.optimizer.lr.LinearWarmup( + lr_scheduler, args.warmup_steps, start_lr=0, end_lr=args.learning_rate + ) + + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + epsilon=args.adam_epsilon, + weight_decay=args.weight_decay, + ) + + logger.info("***** Running training *****") + logger.info(" Num examples = %d", len(train_dataset)) + logger.info(" Num Epochs = %d", args.num_train_epochs) + logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size) + logger.info( + " Total train batch size (w. parallel, distributed) = %d", + args.train_batch_size * paddle.distributed.get_world_size(), + ) + logger.info(" Total optimization steps = %d", t_total) + + global_step = 0 + tr_loss = 0.0 + set_seed(args) + for epoch_id in range(args.num_train_epochs): + print("epoch id:{}".format(epoch_id)) + for step, batch in enumerate(train_dataloader): + model.train() + input_ids, input_mask, segment_ids, bboxes, labels = batch + if input_ids.shape[0] != args.per_gpu_train_batch_size: + continue + outputs = model( + input_ids=input_ids, + bbox=bboxes, + attention_mask=input_mask, + token_type_ids=segment_ids, + labels=labels, + is_train=True, + ) + # model outputs are always tuple in paddlenlp (see doc) + loss = outputs[0] + loss = loss.mean() + if global_step % 50 == 0: + logger.info( + "[epoch {}/{}][iter: {}/{}] lr: {:.5f}, train loss: {:.5f}, ".format( + epoch_id, + args.num_train_epochs, + step, + len(train_dataloader), + lr_scheduler.get_lr(), + float(loss), + ) + ) + + loss.backward() + tr_loss += loss.item() + optimizer.step() + lr_scheduler.step() # Update learning rate schedule + optimizer.clear_grad() + global_step += 1 + + if paddle.distributed.get_rank() == 0 and args.save_steps > 0 and global_step % args.save_steps == 0: + # Save model checkpoint + output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step)) + os.makedirs(output_dir, exist_ok=True) + if paddle.distributed.get_rank() == 0: + model.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + paddle.save(args, os.path.join(output_dir, "training_args.bin")) + logger.info("Saving model checkpoint to %s", output_dir) + + +def _tokenize_chinese_chars(text): + """ + :param text: input text, unicode string + :return: + tokenized text, list + """ + + def _is_chinese_char(cp): + """Checks whether CP is the codepoint of a CJK character.""" + # This defines a "chinese character" as anything in the CJK Unicode block: + # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) + # + # Note that the CJK Unicode block is NOT all Japanese and Korean characters, + # despite its name. The modern Korean Hangul alphabet is a different block, + # as is Japanese Hiragana and Katakana. Those alphabets are used to write + # space-separated words, so they are not treated specially and handled + # like the all of the other languages. + if ( + (cp >= 0x4E00 and cp <= 0x9FFF) + or (cp >= 0x3400 and cp <= 0x4DBF) # + or (cp >= 0x20000 and cp <= 0x2A6DF) # + or (cp >= 0x2A700 and cp <= 0x2B73F) # + or (cp >= 0x2B740 and cp <= 0x2B81F) # + or (cp >= 0x2B820 and cp <= 0x2CEAF) # + or (cp >= 0xF900 and cp <= 0xFAFF) + or (cp >= 0x2F800 and cp <= 0x2FA1F) # + ): # + return True + + return False + + output = [] + buff = "" + for char in text: + cp = ord(char) + if _is_chinese_char(cp) or char == "=": + if buff != "": + output.append(buff) + buff = "" + output.append(char) + else: + buff += char + + if buff != "": + output.append(buff) + + return output + + +def fast_f1(text1, text2): + common_char = Counter(text1) & Counter(text2) + len_seq1 = len(text1) + len_seq2 = len(text2) + len_common = sum(common_char.values()) + if len_common == 0: + return 0.0 + precision = 1.0 * len_common / len_seq2 + recall = 1.0 * len_common / len_seq1 + return (2.0 * precision * recall) / (precision + recall) + + +def _normalize(in_str): + """ + normalize the input unicode string + """ + in_str = in_str.lower() + sp_char = [ + ":", + "_", + "`", + ",", + "。", + ":", + "?", + "!", + "(", + ")", + "“", + "”", + ";", + "’", + "《", + "》", + "……", + "·", + "、", + ",", + "「", + "」", + "(", + ")", + "-", + "~", + "『", + "』", + "|", + ] + out_segs = [] + for char in in_str: + if char in sp_char: + continue + else: + out_segs.append(char) + return "".join(out_segs) + + +def calc_f1_score(answer, prediction): + ans_segs = _tokenize_chinese_chars(_normalize(answer)) + prediction_segs = _tokenize_chinese_chars(_normalize(prediction)) + f1 = fast_f1(prediction_segs, ans_segs) + return f1 + + +def decode(tokenizer, res): + sep_id = tokenizer._convert_token_to_id("") + text_res = [] + all_f1 = [] + save_f1 = [] + for i in range(len(res)): + input_ids, label_ids, predict_ids, bbox = res[i] + remove_pos = ( + len(" ".join([str(x) for x in input_ids]).split("2 6 ")[0].strip(" ").split(" ")) + 2 + ) # remove the question bbox and sep bbox + start_pos = input_ids.index(sep_id) + query_text = [] + for idx in range(1, start_pos): + input_id = input_ids[idx] + query_text.append(tokenizer._convert_id_to_token(int(input_id))) + + # label texts and predict texts + text_label, text_predict = [], [] + label_bbox_index, predict_bbox_index = [], [] + for idx in range(start_pos + 1, len(input_ids)): + input_id, label_id, predict_id = input_ids[idx], label_ids[idx], predict_ids[idx] + + if label_id in [1, 2, 3]: + text_label.append(tokenizer._convert_id_to_token(int(input_id))) + label_bbox_index.append(idx - remove_pos + 1) + if predict_id in [1, 2, 3]: + text_predict.append(tokenizer._convert_id_to_token(int(input_id))) + predict_bbox_index.append(idx - remove_pos + 1) + text_res.append( + ["".join(query_text), "".join(text_label), "".join(text_predict), label_bbox_index, predict_bbox_index] + ) + + f1 = calc_f1_score("".join(text_label), "".join(text_predict)) + save_f1.append(f1) + + if len("".join(text_label)) > 10: + all_f1.append(f1) + if len(all_f1) > 0: + print("F1: ", sum(all_f1) / len(all_f1)) + + assert len(text_res) == len(save_f1) + return text_res + + +def evaluate(args, model, tokenizer, label2id_map, id2label_map, pad_token_label_id, prefix="", global_step=0): + + eval_dataset = DocVQA( + args, tokenizer, label2id_map, max_seq_len=512, max_query_length=20, max_doc_length=512, max_span_num=1 + ) + + args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, paddle.distributed.get_world_size()) + + eval_dataloader = paddle.io.DataLoader( + eval_dataset, batch_size=args.eval_batch_size, num_workers=0, use_shared_memory=True, collate_fn=None + ) + + # Eval! + logger.info("***** Running evaluation %s *****", prefix) + logger.info(" Num examples = %d", len(eval_dataset)) + logger.info(" Batch size = %d", args.eval_batch_size) + model.eval() + res = [] + for idx, batch in enumerate(eval_dataloader): + with paddle.no_grad(): + input_ids, input_mask, segment_ids, bboxes, labels = batch + + if input_ids.shape[0] != args.eval_batch_size: + continue + outputs = model( + input_ids=input_ids, + bbox=bboxes, + attention_mask=input_mask, + token_type_ids=segment_ids, + labels=labels, + is_train=False, + ) + labels = labels.numpy() + crf_decode = outputs[1].numpy() + bboxes = bboxes.squeeze().numpy() + input_ids = input_ids.squeeze(axis=1).numpy() + + for index in range(input_ids.shape[0]): + res.append([list(input_ids[index]), list(labels[index]), list(crf_decode[index]), bboxes[index]]) + + origin_inputs = [] + with open(args.test_file, "r", encoding="utf8") as f: + for line in f: + line = json.loads(line.strip()) + origin_inputs.append( + { + "img_name": line["img_name"], + "question": line["question"], + "bboxes": line["document_bbox"], + "img_id": line["img_id"], + } + ) + + text_res = decode(tokenizer, res) + + with open(args.save_path, "w", encoding="utf8") as f: + for line_res, line_text, line_label in zip(res, text_res, origin_inputs): + line_json = {} + line_json["img_name"] = line_label["img_name"] + line_json["img_id"] = line_label["img_id"] + line_json["question"] = line_label["question"] + line_json["label_answer"] = line_text[1] + line_json["predict_answer"] = line_text[2] + label_bbox_index, predict_bbox_index = line_text[3], line_text[4] + label_bboxes, predict_bboxes = [], [] + for i in range(len(line_label["bboxes"])): + if i in label_bbox_index: + label_bboxes.append(line_label["bboxes"][i]) + if i in predict_bbox_index: + predict_bboxes.append(line_label["bboxes"][i]) + line_json["label_bboxes"] = label_bboxes + line_json["predict_bboxes"] = predict_bboxes + json.dump(line_json, f, ensure_ascii=False) + f.write("\n") + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + main(args) diff --git a/applications/document_intelligence/doc_vqa/Extraction/run_test.sh b/applications/document_intelligence/doc_vqa/Extraction/run_test.sh new file mode 100644 index 0000000000000000000000000000000000000000..f2e0df79e5f74638f721e07d6bdad540b1f2edd8 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Extraction/run_test.sh @@ -0,0 +1,38 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +QUESTION=$1 + +python3 change_to_mrc.py ${QUESTION} + +python3 ./run_docvqa.py \ + --model_name_or_path "layoutxlm-base-uncased" \ + --max_seq_len 512 \ + --do_test true \ + --test_file "data/demo_test.json" \ + --num_train_epochs 100 \ + --eval_steps 6000 \ + --save_steps 6000 \ + --output_dir "output/" \ + --save_path "data/decode_res.json" \ + --init_checkpoint "./checkpoints/layoutxlm/" \ + --learning_rate 3e-5 \ + --warmup_steps 12000 \ + --per_gpu_train_batch_size 4 \ + --per_gpu_eval_batch_size 1 \ + --seed 2048 + +python3 view.py diff --git a/applications/document_intelligence/doc_vqa/Extraction/run_train.sh b/applications/document_intelligence/doc_vqa/Extraction/run_train.sh new file mode 100644 index 0000000000000000000000000000000000000000..1a5370643aad91a357062f558fa1da1a3c48a1f4 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Extraction/run_train.sh @@ -0,0 +1,32 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +python3 ./run_docvqa.py \ + --model_name_or_path "layoutxlm-base-uncased" \ + --max_seq_len 512 \ + --train_file "data/train.json" \ + --init_checkpoint "checkpoints/base_model" \ + --do_train true \ + --num_train_epochs 50 \ + --eval_steps 24000 \ + --save_steps 40 \ + --output_dir "output" \ + --save_path "data/decode_res.json" \ + --learning_rate 3e-5 \ + --warmup_steps 40 \ + --per_gpu_train_batch_size 4 \ + --per_gpu_eval_batch_size 4 \ + --seed 2048 diff --git a/applications/document_intelligence/doc_vqa/Extraction/view.py b/applications/document_intelligence/doc_vqa/Extraction/view.py new file mode 100644 index 0000000000000000000000000000000000000000..8a3f35a068db2357a957990ac7e3ba0c903d62d8 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Extraction/view.py @@ -0,0 +1,67 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import cv2 +import json +import numpy as np + + +def view_ocr_result(img_path, bboxes, opath): + image = cv2.imread(img_path) + for char_bbox in bboxes: + x_min, x_max, y_min, y_max = char_bbox + cv2.rectangle(image, (x_min, y_min), (x_max, y_max), (0, 0, 255), 1) + cv2.imwrite(opath, image) + + +def _highlight_bbox(img, bbox): + x = bbox[0] + w = bbox[1] - x + y = bbox[2] + h = bbox[3] - y + sub_img = img[y : y + h, x : x + w] + colored_rect = np.zeros(sub_img.shape, dtype=np.uint8) + colored_rect[:, :, 2] = 255 + colored_rect[:, :, 1] = 255 + res = cv2.addWeighted(sub_img, 0.5, colored_rect, 0.5, 1.0) + img[y : y + h, x : x + w] = res + + +def highlight_ans(source_img_path, output_img_path, ans_bbox): + image = cv2.imread(source_img_path) + for bbox in ans_bbox: + _highlight_bbox(image, bbox) + cv2.imwrite(output_img_path, image) + + +def highlight_img(source_img_path, output_img_path): + image = cv2.imread(source_img_path) + height = image.shape[0] + width = image.shape[1] + bbox = [0, width - 1, 0, height - 1] + _highlight_bbox(image, bbox) + cv2.imwrite(output_img_path, image) + + +if __name__ == "__main__": + res_path = "./data/decode_res.json" + result = {} + with open(res_path, "r", encoding="utf-8") as f: + line = f.readline() + result = json.loads(line.strip()) + + img_path = "../OCR_process/demo_pics/demo_{}.png".format(result["img_id"]) + img_save_path = "../answer.png" + highlight_ans(img_path, img_save_path, result["predict_bboxes"]) + print("extraction result has been saved to answer.png") diff --git a/applications/document_intelligence/doc_vqa/OCR_process/ocr_process.py b/applications/document_intelligence/doc_vqa/OCR_process/ocr_process.py new file mode 100644 index 0000000000000000000000000000000000000000..0444e559d580c422effd408f0a1632146235c599 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/OCR_process/ocr_process.py @@ -0,0 +1,272 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +import re + +from paddleocr import PaddleOCR + +from paddlenlp.transformers import LayoutXLMTokenizer + +tokenizer = LayoutXLMTokenizer.from_pretrained("layoutxlm-base-uncased") + + +def get_all_chars(tokenizer): + all_chr = [] + for i in range(30000): + tok_chr = tokenizer.tokenize(chr(i)) + tok_chr = [tc.replace("▁", "") for tc in tok_chr] + while "" in tok_chr: + tok_chr.remove("") + tok_chr = "".join(tok_chr) + if len(tok_chr) != 1: + all_chr.append(i) + return all_chr + + +def merge_bbox(tok_bboxes): + min_gx = min([box[0] for box in tok_bboxes]) + max_gx = max([box[1] for box in tok_bboxes]) + min_gy = min([box[2] for box in tok_bboxes]) + max_gy = max([box[3] for box in tok_bboxes]) + height_g = max_gy - min_gy + width_g = max_gx - min_gx + height_m = 0 + width_m = 0 + for box in tok_bboxes: + x_min, x_max, y_min, y_max = box + height_m += y_max - y_min + width_m += x_max - x_min + height_m = height_m / len(tok_bboxes) + if (height_g - height_m) < 0.5 * height_m and width_g - width_m < 0.1 * width_m: + return False, [min_gx, max_gx, min_gy, max_gy] + else: + return True, tok_bboxes[0] + + +def xlm_parse(ocr_res, tokenizer): + doc_bboxes = [] + all_chr = get_all_chars(tokenizer) + + try: + new_tokens, new_token_boxes = [], [] + for item in ocr_res: + new_tokens.extend(item["tokens"]) + new_token_boxes.extend(item["token_box"]) + + # get layoutxlm tokenizer results and get the final results + temp_span_text = "".join(new_tokens) + temp_span_bbox = new_token_boxes + span_text = "" + span_bbox = [] + # drop blank space + for text, bbox in zip(temp_span_text, temp_span_bbox): + if text == " ": + continue + else: + span_text += text + span_bbox += [bbox] + + # span_tokens starts with "_" + span_tokens = tokenizer.tokenize(span_text) + span_tokens[0] = span_tokens[0].replace("▁", "") + while "" in span_tokens: + span_tokens.remove("") + + doc_bboxes = [] + i = 0 + for tid, tok in enumerate(span_tokens): + tok = tok.replace("▁", "") + if tok == "": + doc_bboxes.append(span_bbox[i]) + continue + if tok == "": + if tid + 1 == len(span_tokens): + tok_len = 1 + else: + if span_tokens[tid + 1] == "": + tok_len = 1 + else: + for j in range(i, len(span_text)): + if span_text[j].lower() == span_tokens[tid + 1][0]: + break + tok_len = j - i + elif ord(span_text[i]) in all_chr: + if tid + 1 == len(span_tokens): + tok_len = 1 + elif "°" in tok and "C" in span_tokens[tid + 1]: + tok_len = len(tok) - 1 + if tok_len == 0: + doc_bboxes.append(span_bbox[i]) + continue + elif span_text[i] == "ⅱ": + if tok == "ii": + if span_text[i + 1] != "i": + tok_len = len(tok) - 1 + else: + tok_len = len(tok) + elif tok == "i": + tok_len = len(tok) - 1 + if tok_len == 0: + doc_bboxes.append(span_bbox[i]) + continue + elif "m" in tok and "2" == span_tokens[tid + 1][0]: + tok_len = len(tok) - 1 + if tok_len == 0: + doc_bboxes.append(span_bbox[i]) + continue + elif ord(span_text[i + 1]) in all_chr: + tok_len = 1 + else: + for j in range(i, len(span_text)): + if span_text[j].lower() == span_tokens[tid + 1][0]: + break + if span_text[j].lower() == "," and span_tokens[tid + 1][0] == ",": + break + if span_text[j].lower() == ";" and span_tokens[tid + 1][0] == ";": + break + if span_text[j].lower() == ")" and span_tokens[tid + 1][0] == ")": + break + if span_text[j].lower() == "(" and span_tokens[tid + 1][0] == "(": + break + if span_text[j].lower() == "¥" and span_tokens[tid + 1][0] == "¥": + break + + tok_len = j - i + + else: + if "�" == span_text[i]: + tok_len = len(tok) + 1 + elif tok == "......" and "…" in span_text[i : i + 6]: + tok_len = len(tok) - 2 + elif "ⅱ" in span_text[i + len(tok) - 1]: + if tok == "i": + tok_len = 1 + else: + tok_len = len(tok) - 1 + elif "°" in tok and "C" in span_tokens[tid + 1]: + tok_len = len(tok) - 1 + else: + tok_len = len(tok) + + assert i + tok_len <= len(span_bbox) + tok_bboxes = span_bbox[i : i + tok_len] + _, merged_bbox = merge_bbox(tok_bboxes) + + doc_bboxes.append(merged_bbox) + i = i + tok_len + except Exception: + print("Error") + span_tokens = ["▁"] * 512 + doc_bboxes = [[0, 0, 0, 0]] * 512 + + return span_tokens, doc_bboxes + + +def tokenize_ocr_res(ocr_reses): + """ + input: + ocr_res: the ocr result of the image + return: + new_reses: { + pid: { + "text": all text in each ocr_res, + "bounding_box": the bounding box of the ocr_res, + "tokens": all chars in ocr_res, + "token_box: bounding box of each chars in ocr_res + } + } + """ + new_reses = [] + for img_name, ocr_res in ocr_reses: + new_res = [] + for para in ocr_res: + text = para["text"] + text_box = para["bbox"] + x_min, y_min = [int(min(idx)) for idx in zip(*text_box)] + x_max, y_max = [int(max(idx)) for idx in zip(*text_box)] + text_chars = list(text.lower()) + char_num = 0 + for char in text_chars: + if re.match("[^\x00-\xff]", char): + char_num += 2 + else: + char_num += 1 + width = x_max - x_min + shift = x_min + new_token_boxes, new_tokens = [], [] + for char in text_chars: + if re.match("[^\x00-\xff]", char): + tok_x_max = shift + width / char_num * 2 + else: + tok_x_max = shift + width / char_num * 1 + tok_x_min = shift + tok_y_min = y_min + tok_y_max = y_max + + shift = tok_x_max + new_token_boxes.append([round(tok_x_min), round(tok_x_max), tok_y_min, tok_y_max]) + new_tokens.append(char) + new_res.append( + { + "text": para["text"], + "bounding_box": para["bbox"], + "tokens": new_tokens, + "token_box": new_token_boxes, + } + ) + new_reses.append((img_name, new_res)) + return new_reses + + +def process_input(ocr_reses, tokenizer, save_ocr_path): + ocr_reses = tokenize_ocr_res(ocr_reses) + + examples = [] + for img_name, ocr_res in ocr_reses: + doc_tokens, doc_bboxes = xlm_parse(ocr_res, tokenizer) + doc_tokens.insert(0, "▁") + doc_bboxes.insert(0, doc_bboxes[0]) + example = {"img_name": img_name, "document": doc_tokens, "document_bbox": doc_bboxes} + examples.append(example) + + with open(save_ocr_path, "w", encoding="utf8") as f: + for example in examples: + json.dump(example, f, ensure_ascii=False) + f.write("\n") + + print(f"ocr parsing results has been save to: {save_ocr_path}") + + +def ocr_preprocess(img_dir): + ocr = PaddleOCR(use_angle_cls=True, lang="ch", use_gpu=True) + ocr_reses = [] + img_names = sorted(os.listdir(img_dir), key=lambda x: int(x.split("_")[1].split(".")[0])) + for img_name in img_names: + img_path = os.path.join(img_dir, img_name) + parsing_res = ocr.ocr(img_path, cls=True)[0] + ocr_res = [] + for para in parsing_res: + ocr_res.append({"text": para[1][0], "bbox": para[0]}) + ocr_reses.append((img_name, ocr_res)) + + return ocr_reses + + +if __name__ == "__main__": + img_dir = "./demo_pics" + save_path = "./demo_ocr_res.json" + ocr_results = ocr_preprocess(img_dir) + process_input(ocr_results, tokenizer, save_path) diff --git a/applications/document_intelligence/doc_vqa/README.md b/applications/document_intelligence/doc_vqa/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a2ebdfc42f3018b3a0aaf5da1e04537ab22662cb --- /dev/null +++ b/applications/document_intelligence/doc_vqa/README.md @@ -0,0 +1,131 @@ +# 汽车说明书跨模态智能问答 + +## 1. 项目说明 + +**跨模态文档问答** 是跨模态的文档抽取任务,要求文档智能模型在文档中抽取能够回答文档相关问题的答案,需要模型在抽取和理解文档中文本信息的同时,还能充分利用文档的布局、字体、颜色等视觉信息,这比单一模态的信息抽取任务更具挑战性。 + +这种基于跨模态文档阅读理解技术的智能问答能力,可以深度解析非结构化文档中排版复杂的图文/图表内容,直接定位问题答案。 + +本项目将基于跨模态文档问答技术实现**汽车说明书问答系统**,该系统能够对用户提出的问题,自动从汽车说明书中寻找答案并进行回答。 + +如下图所示, 用户提出问题:"如何更换前风窗玻璃的刮水片",跨模态文档问答引擎将从库中寻找相关的文档,然后通过跨模态阅读理解模型抽取出相应的答案,并进行了高亮展示。 + +
image
+ +通过使用汽车说明书问答系统,能够极大地解决传统汽车售后的压力: +- 用户:用户没有耐心查阅说明书,打客服电话需要等待 +- 售后客服:需要配置大量客服人员,且客服专业知识培训周期长 +- 构建问题库:需要投入大量人力整理常见问题库,并且固定的问题库难以覆盖灵活多变的提问 + +对于用户来说,汽车说明书问答系统能够支持通过车机助手/APP/小程序为用户提供即问即答的功能。对于常见问题,用户不再需要查阅说明书,也无需打客服电话,从而缓解了人工客服的压力。 + +对于客服来讲,汽车说明书问答系统帮助客服人员快速定位答案,高效查阅文档,提高客服的专业水平,同时也能够缩短客服的培训周期。 + +## 2. 安装说明 + +#### 环境要求 + +- paddlepaddle == 2.3.2 +- paddlenlp == 2.5.2 +- paddleocr == 2.6.1.3 + +安装相关问题可参考[PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)和[PaddleNLP](https://paddlenlp.readthedocs.io/zh/latest/get_started/installation.html)文档。 + + +## 3. 整体流程 + +汽车说明书问答系统针对用户提出的汽车使用相关问题,智能化地在汽车说明书中找出对应答案,并返回给用户。本项目提供的汽车说明书问答系统的使用流程如下图所示。本项目提供的汽车说明书问答系统主要包括 3 个模块:OCR处理模块、排序模块和跨模态阅读理解模块。 + +在使用汽车说明书问答模型回答问题之前,需要先使用PaddleOCR对离线提供的汽车说明书文档进解析,并将解析结果保存下来,以备后续排序模块使用。 + +对于用户提问的问题,首先会被传入排序模块,排序模块会针对该问题对解析的文档进行排序打分,其结果将会被传入跨模态阅读理解模块。阅读理解模块将从分数最高的说明书文档中,抽取用户问题的答案,并返回给用户。 + +
image
+ +下面将具体介绍各个模块的功能。 + +## 4. OCR处理模块 + +本项目提供了包含10张图片的汽车说明书,为方便后续处理,首先需要通过 PaddleOCR 对汽车说明书进行识别,记录汽车说明书上的文字和文字布局信息, 以方便后续使用计算机视觉和自然语言处理方面的技术进行问答任务。 + +本项目提供的汽车说明书图片可点击[这里](https://paddlenlp.bj.bcebos.com/images/applications/automobile.tar.gz)进行下载,下载后解压放至 `./OCR_process/demo_pics` 目录下,然后通过如下命令,使用 PaddleOCR 对图片进行解析。 + +```shell +cd OCR_process/ +python3 ocr_process.py +cd .. +``` + +解析后的结果存放至 `./OCR_process/demo_ocr_res.json` 中。 + +## 5. 排序模块 +对于用户提出的问题,如果从所有的汽车说明书图片中去寻找答案会比较耗时且耗费资源。因此这里使用了一个基于[RocketQA](https://arxiv.org/pdf/2010.08191.pdf)的排序模块,该模块将根据用户提出的问题对汽车说明书的不同图片进行打分排序,这样便可以获取和问题最相关的图片,并使用跨模态阅读理解模块在该问题上进行抽取答案。 + +本项目提供了140条汽车说明书相关的训练样本,用于排序模型的训练, 同时也提供了一个基于RocketQA的预先训练好的基线模型 base_model。 本模块可以使用 base_model 在汽车说明书训练样本上进一步微调。 + +其中,汽车说明书的训练集可点击[这里](https://paddlenlp.bj.bcebos.com/data/automobile_rerank_train.tsv) 进行下载,下载后将其重命名为 `train.tsv` ,存放至 `./Rerank/data/` 目录下。 + +同时,base_model 是 [Dureader retrieval](https://arxiv.org/abs/2203.10232) 数据集训练的排序模型, 可点击[这里](https://paddlenlp.bj.bcebos.com/models/base_ranker.tar.gz) 进行下载,解压后可获得包含模型的目录 `base_model`,将其放至 `./Rerank/checkpoints` 目录下。 + +可使用如下代码进行训练: + +```shell +cd Rerank +bash run_train.sh ./data/train.tsv ./checkpoints/base_model 50 1 +cd .. +``` +其中,参数依次为训练数据地址,base_model 地址,训练轮次,节点数。 + +在模型训练完成后,可将模型重命名为 `ranker` 存放至 `./checkpoints/` 目录下,接下来便可以使用如下命令,根据给定的汽车说明书相关问题,对汽车说明书的图片进行打分。代码如下: + +```shell +cd Rerank +bash run_test.sh 后备箱怎么开 +cd .. +``` + +其中,后一项为用户问题,命令执行完成后,分数文件将会保存至 `./Rerank/data/demo.score` 中。 + + +## 6. 跨模态阅读理解模块 +本项目首先获取排序模块输出的结果中评分最高的图片,然后将会使用跨模态的语言模型 LayoutXLM 从该图片中去抽取用户提问的答案。在获取答案后,将会对答案在该图片中进行高亮显示并返回用户。 + +本项目提供了28条汽车说明书相关的训练样本,用于跨模态阅读理解模型的训练, 同时也提供了一个预先训练好的基线模型 base_model。 本模块可以使用 base_model 在汽车说明书训练样本上进一步微调,增强模型对汽车说明书领域的理解。 + +其中,汽车说明书的阅读理解训练集可点击[这里](https://paddlenlp.bj.bcebos.com/data/automobile_mrc_train.json) 进行下载,下载后将其重命名为 `train.json`,存放至 `./Extraction/data/` 目录下。 + +同时,base_model 是 [Dureader VIS](https://aclanthology.org/2022.findings-acl.105.pdf) 数据集训练的跨模态阅读理解模型, 可点击[这里](https://paddlenlp.bj.bcebos.com/models/base_mrc.tar.gz) 进行下载,解压后可获得包含模型的目录 `base_model`,将其放至 `./Extraction/checkpoints` 目录下。 + +可使用如下代码进行训练: + +```shell +cd Extraction +bash run_train.sh +cd .. +``` + +在模型训练完成后,可将模型重命名为 `layoutxlm` 存放至 `./checkpoints/` 目录下,接下来便可以使用如下命令,根据给定的汽车说明书相关问题,从得分最高的汽车说明书图片中抽取答案。代码如下: + +```shell +cd Extraction +bash run_test.sh 后备箱怎么开 +cd .. +``` + +其中,后一项为用户问题,命令执行完成后,最终结果将会保存至 `./answer.png` 中。 + +## 7. 全流程预测 +本项目提供了全流程预测的功能,可通过如下命令进行一键式预测: + +```shell +bash run_test.sh 后备箱怎么开 +``` + +其中,后一项参数为用户问题,最终结果将会保存至 `./answer.png` 中。 + +**备注**:在运行命令前,请确保已使用第4节介绍的命令对原始汽车说明书图片完成了文档解析。 + + +下图展示了用户提问的三个问题:"后备箱怎么开","钥匙怎么充电" 和 "NFC解锁注意事项", 可以看到,本项目的汽车说明书问答系统能够精准地找到答案并进行高亮显示。 + +
diff --git a/applications/document_intelligence/doc_vqa/Rerank/change_to_rerank.py b/applications/document_intelligence/doc_vqa/Rerank/change_to_rerank.py new file mode 100644 index 0000000000000000000000000000000000000000..7136d15327244e4f7f1984d55c6f999e3a5373be --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/change_to_rerank.py @@ -0,0 +1,33 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys +import json + +question = sys.argv[1] + +with open("../OCR_process/demo_ocr_res.json", "r", encoding="utf8") as f: + paras = [] + for line in f: + line = json.loads(line.strip()) + document = line["document"] + para = [] + for token in document: + token = token.replace("▁", "") + para.append(token) + paras.append("".join(para)) + +with open("./data/demo.tsv", "w", encoding="utf8") as f: + for para in paras: + f.write("{}\t\t{}\t0\n".format(question, para)) diff --git a/applications/document_intelligence/doc_vqa/Rerank/config/ernie_base_1.0_CN/vocab.txt b/applications/document_intelligence/doc_vqa/Rerank/config/ernie_base_1.0_CN/vocab.txt new file mode 100644 index 0000000000000000000000000000000000000000..5db20b3b96fb86ef2aec3b783e12e17041a02d45 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/config/ernie_base_1.0_CN/vocab.txt @@ -0,0 +1,17964 @@ +[PAD] +[CLS] +[SEP] +[MASK] +, +的 +、 +一 +人 +有 +是 +在 +中 +为 +和 +了 +不 +年 +学 +大 +国 +生 +以 +“ +” +作 +业 +个 +上 +用 +, +地 +会 +成 +发 +工 +时 +于 +理 +出 +行 +要 +. +等 +他 +到 +之 +这 +可 +后 +家 +对 +能 +公 +与 +》 +《 +主 +方 +分 +经 +来 +全 +其 +部 +多 +产 +自 +文 +高 +动 +进 +法 +化 +: +我 +面 +) +( +实 +教 +建 +体 +而 +长 +子 +下 +现 +开 +本 +力 +定 +性 +过 +设 +合 +小 +同 +机 +市 +品 +水 +新 +内 +事 +也 +种 +及 +制 +入 +所 +心 +务 +就 +管 +们 +得 +展 +重 +民 +加 +区 +物 +者 +通 +天 +政 +三 +电 +关 +度 +第 +名 +术 +最 +系 +月 +外 +资 +日 +代 +员 +如 +间 +位 +并 +书 +科 +村 +应 +量 +道 +前 +当 +无 +里 +相 +平 +从 +计 +提 +保 +任 +程 +技 +都 +研 +十 +基 +特 +好 +被 +或 +目 +将 +使 +山 +二 +说 +数 +点 +明 +情 +元 +着 +收 +组 +然 +美 +各 +由 +场 +金 +形 +农 +期 +因 +表 +此 +色 +起 +还 +立 +世 +安 +活 +专 +质 +1 +规 +社 +万 +信 +西 +统 +结 +路 +利 +次 +南 +式 +意 +级 +常 +师 +校 +你 +育 +果 +究 +司 +服 +门 +海 +导 +流 +项 +她 +总 +处 +两 +传 +东 +正 +省 +院 +户 +手 +具 +2 +原 +强 +北 +向 +先 +但 +米 +城 +企 +件 +风 +军 +身 +更 +知 +已 +气 +战 +至 +单 +口 +集 +创 +解 +四 +标 +交 +比 +商 +论 +界 +题 +变 +花 +3 +改 +类 +运 +指 +型 +调 +女 +神 +接 +造 +受 +广 +只 +委 +去 +共 +治 +达 +持 +条 +网 +头 +构 +县 +些 +该 +又 +那 +想 +样 +办 +济 +5 +格 +责 +车 +很 +施 +求 +己 +光 +精 +林 +完 +爱 +线 +参 +少 +积 +清 +看 +优 +报 +王 +直 +没 +每 +据 +游 +效 +感 +五 +影 +别 +获 +领 +称 +选 +供 +乐 +老 +么 +台 +问 +划 +带 +器 +源 +织 +放 +深 +备 +视 +白 +功 +取 +装 +营 +见 +记 +环 +队 +节 +准 +石 +它 +回 +历 +负 +真 +增 +医 +联 +做 +职 +容 +士 +包 +义 +观 +团 +病 +4 +府 +息 +则 +考 +料 +华 +州 +语 +证 +整 +让 +江 +史 +空 +验 +需 +支 +命 +给 +离 +认 +艺 +较 +土 +古 +养 +才 +境 +推 +把 +均 +图 +际 +斯 +近 +片 +局 +修 +字 +德 +权 +步 +始 +复 +转 +协 +即 +打 +画 +投 +决 +何 +约 +反 +quot +费 +议 +护 +极 +河 +房 +查 +布 +思 +干 +价 +儿 +非 +马 +党 +奖 +模 +故 +编 +音 +范 +识 +率 +存 +引 +客 +属 +评 +采 +尔 +配 +镇 +室 +再 +案 +监 +习 +注 +根 +克 +演 +食 +族 +示 +球 +状 +青 +号 +张 +百 +素 +首 +易 +热 +阳 +今 +园 +防 +版 +太 +乡 +英 +6 +材 +列 +便 +写 +住 +置 +层 +助 +确 +试 +难 +承 +象 +居 +10 +黄 +快 +断 +维 +却 +红 +速 +连 +众 +0 +细 +态 +话 +周 +言 +药 +培 +血 +亩 +龙 +越 +值 +几 +边 +读 +未 +曾 +测 +算 +京 +景 +余 +站 +低 +温 +消 +必 +切 +依 +随 +且 +志 +卫 +域 +照 +许 +限 +著 +销 +落 +足 +适 +争 +策 +8 +控 +武 +按 +7 +初 +角 +核 +死 +检 +富 +满 +显 +审 +除 +致 +亲 +占 +失 +星 +章 +善 +续 +千 +叶 +火 +副 +告 +段 +什 +声 +终 +况 +走 +木 +益 +戏 +独 +纪 +植 +财 +群 +六 +赛 +远 +拉 +亚 +密 +排 +超 +像 +课 +围 +往 +响 +击 +疗 +念 +八 +云 +险 +律 +请 +革 +诗 +批 +底 +压 +双 +男 +训 +例 +汉 +升 +拥 +势 +酒 +眼 +官 +牌 +油 +曲 +友 +望 +黑 +歌 +筑 +础 +香 +仅 +担 +括 +湖 +严 +秀 +剧 +九 +举 +执 +充 +兴 +督 +博 +草 +般 +李 +健 +喜 +授 +普 +预 +灵 +突 +良 +款 +罗 +9 +微 +七 +录 +朝 +飞 +宝 +令 +轻 +劳 +距 +异 +简 +兵 +树 +序 +候 +含 +福 +尽 +留 +20 +丰 +旅 +征 +临 +破 +移 +篇 +抗 +典 +端 +苏 +奇 +止 +康 +店 +毛 +觉 +春 +售 +络 +降 +板 +坚 +母 +讲 +早 +印 +略 +孩 +夫 +藏 +铁 +害 +互 +帝 +田 +融 +皮 +宗 +岁 +载 +析 +斗 +须 +伤 +12 +介 +另 +00 +半 +班 +馆 +味 +楼 +卡 +射 +述 +杀 +波 +绿 +免 +兰 +绝 +刻 +短 +察 +输 +择 +综 +杂 +份 +纳 +父 +词 +银 +送 +座 +左 +继 +固 +宣 +厂 +肉 +换 +补 +税 +派 +套 +欢 +播 +吸 +圆 +攻 +阿 +购 +听 +右 +减 +激 +巴 +背 +够 +遇 +智 +玉 +找 +宽 +陈 +练 +追 +毕 +彩 +软 +帮 +股 +荣 +托 +予 +佛 +堂 +障 +皇 +若 +守 +似 +届 +待 +货 +散 +额 +30 +尚 +穿 +丽 +骨 +享 +差 +针 +索 +稳 +宁 +贵 +酸 +液 +唐 +操 +探 +玩 +促 +笔 +库 +救 +虽 +久 +闻 +顶 +床 +港 +鱼 +亿 +登 +11 +永 +毒 +桥 +冷 +魔 +秘 +陆 +您 +童 +归 +侧 +沙 +染 +封 +紧 +松 +川 +刘 +15 +雄 +希 +毫 +卷 +某 +季 +菜 +庭 +附 +逐 +夜 +宫 +洲 +退 +顾 +尼 +胜 +剂 +纯 +舞 +遗 +苦 +梦 +挥 +航 +愿 +街 +招 +矿 +夏 +盖 +献 +怎 +茶 +申 +39 +吧 +脑 +亦 +吃 +频 +宋 +央 +威 +厚 +块 +冲 +叫 +熟 +礼 +厅 +否 +渐 +笑 +钱 +钟 +甚 +牛 +丝 +靠 +岛 +绍 +盘 +缘 +聚 +静 +雨 +氏 +圣 +顺 +唱 +刊 +阶 +困 +急 +饰 +弹 +庄 +既 +野 +阴 +混 +饮 +损 +齐 +末 +错 +轮 +宜 +鲜 +兼 +敌 +粉 +祖 +延 +100 +钢 +辑 +欧 +硬 +甲 +诉 +册 +痛 +订 +缺 +晚 +衣 +佳 +脉 +gt +盛 +乎 +拟 +贸 +扩 +船 +仪 +谁 +警 +50 +停 +席 +竞 +释 +庆 +汽 +仍 +掌 +诸 +仙 +弟 +吉 +洋 +奥 +票 +危 +架 +买 +径 +塔 +休 +付 +恶 +雷 +怀 +秋 +借 +巨 +透 +誉 +厘 +句 +跟 +胞 +婚 +幼 +烈 +峰 +寻 +君 +汇 +趣 +纸 +假 +肥 +患 +杨 +雅 +罪 +谓 +亮 +脱 +寺 +烟 +判 +绩 +乱 +刚 +摄 +洞 +践 +码 +启 +励 +呈 +曰 +呢 +符 +哥 +媒 +疾 +坐 +雪 +孔 +倒 +旧 +菌 +岩 +鼓 +亡 +访 +症 +暗 +湾 +幸 +池 +讨 +努 +露 +吗 +繁 +途 +殖 +败 +蛋 +握 +刺 +耕 +洗 +沉 +概 +哈 +泛 +凡 +残 +隐 +虫 +朋 +虚 +餐 +殊 +慢 +询 +蒙 +孙 +谈 +鲁 +裂 +贴 +污 +漫 +谷 +违 +泉 +拿 +森 +横 +扬 +键 +膜 +迁 +尤 +涉 +净 +诚 +折 +冰 +械 +拍 +梁 +沿 +避 +吴 +惊 +犯 +灭 +湿 +迷 +姓 +阅 +灯 +妇 +触 +冠 +答 +俗 +档 +尊 +谢 +措 +筹 +竟 +韩 +签 +剑 +鉴 +灾 +贯 +迹 +洛 +沟 +束 +翻 +巧 +坏 +弱 +零 +壁 +枝 +映 +恩 +抓 +屋 +呼 +脚 +绘 +40 +淡 +辖 +2010 +伊 +粒 +欲 +震 +伯 +私 +蓝 +甘 +储 +胡 +卖 +梅 +16 +耳 +疑 +润 +伴 +泽 +牧 +烧 +尾 +累 +糖 +怪 +唯 +莫 +粮 +柱 +18 +竹 +灰 +岸 +缩 +井 +伦 +柔 +盟 +珠 +丹 +amp +皆 +哪 +迎 +颜 +衡 +啊 +塑 +寒 +13 +紫 +镜 +25 +氧 +误 +伍 +彻 +刀 +览 +炎 +津 +耐 +秦 +尖 +潮 +描 +浓 +召 +禁 +阻 +胶 +译 +腹 +泰 +乃 +盐 +潜 +鸡 +诺 +遍 +2000 +纹 +冬 +牙 +麻 +辅 +猪 +弃 +楚 +羊 +晋 +14 +鸟 +赵 +洁 +谋 +隆 +滑 +60 +2008 +籍 +臣 +朱 +泥 +墨 +辆 +墙 +浪 +姐 +赏 +纵 +2006 +拔 +倍 +纷 +摩 +壮 +苗 +偏 +塞 +贡 +仁 +宇 +卵 +瓦 +枪 +覆 +殿 +刑 +贫 +妈 +幅 +幕 +忆 +丁 +估 +废 +萨 +舍 +详 +旗 +岗 +洪 +80 +贝 +2009 +迅 +凭 +勇 +雕 +奏 +旋 +杰 +煤 +阵 +乘 +溪 +奉 +畜 +挑 +昌 +硕 +庙 +惠 +薄 +逃 +爆 +哲 +浙 +珍 +炼 +栏 +暴 +币 +隔 +吨 +倾 +嘉 +址 +陶 +绕 +诊 +遭 +桃 +魂 +兽 +豆 +闲 +箱 +拓 +燃 +裁 +晶 +掉 +脂 +溶 +顿 +肤 +虑 +鬼 +2007 +灌 +徐 +龄 +陵 +恋 +侵 +坡 +寿 +勤 +磨 +妹 +瑞 +缓 +轴 +麦 +羽 +咨 +凝 +默 +驻 +敢 +债 +17 +浮 +幻 +株 +浅 +敬 +敏 +陷 +凤 +坛 +虎 +乌 +铜 +御 +乳 +讯 +循 +圈 +肌 +妙 +奋 +忘 +闭 +墓 +21 +汤 +忠 +2005 +跨 +怕 +振 +宾 +跑 +屏 +坦 +粗 +租 +悲 +伟 +拜 +24 +妻 +赞 +兄 +宿 +碑 +貌 +勒 +罚 +夺 +偶 +截 +纤 +2011 +齿 +郑 +聘 +偿 +扶 +豪 +慧 +跳 +the +疏 +莱 +腐 +插 +恐 +郎 +辞 +挂 +娘 +肿 +徒 +伏 +磁 +杯 +丛 +旨 +琴 +19 +炮 +醒 +砖 +替 +辛 +暖 +锁 +杜 +肠 +孤 +饭 +脸 +邮 +贷 +lt +俄 +毁 +荷 +谐 +荒 +肝 +链 +2004 +2012 +尺 +尘 +援 +a +疫 +崇 +恢 +扎 +伸 +幽 +抵 +胸 +谱 +舒 +迫 +200 +畅 +泡 +岭 +喷 +70 +窗 +捷 +宏 +肯 +90 +狂 +铺 +骑 +抽 +券 +俱 +徽 +胆 +碎 +邀 +褐 +斤 +涂 +赋 +署 +颗 +2003 +渠 +仿 +迪 +炉 +辉 +涵 +耗 +22 +返 +邻 +斑 +董 +魏 +午 +娱 +浴 +尿 +曼 +锅 +柳 +舰 +搭 +旁 +宅 +趋 +of +凉 +赢 +伙 +爷 +廷 +戴 +壤 +奶 +页 +玄 +驾 +阔 +轨 +朗 +捕 +肾 +稿 +惯 +侯 +乙 +渡 +稍 +恨 +脏 +2002 +姆 +腔 +抱 +杆 +垂 +赴 +赶 +莲 +辽 +荐 +旦 +妖 +2013 +稀 +驱 +沈 +役 +晓 +亭 +仲 +澳 +500 +炸 +绪 +28 +陕 +and +23 +恒 +堡 +纠 +仇 +懂 +焦 +搜 +s +忍 +贤 +添 +i +艾 +赤 +犹 +尝 +锦 +稻 +撰 +填 +衰 +栽 +邪 +粘 +跃 +桌 +胃 +悬 +c +翼 +彼 +睡 +曹 +刷 +摆 +悉 +锋 +26 +摇 +抢 +乏 +廉 +鼠 +盾 +瓷 +抑 +埃 +邦 +遂 +寸 +渔 +祥 +胎 +牵 +壳 +甜 +卓 +瓜 +袭 +遵 +巡 +逆 +玛 +韵 +2001 +桑 +酷 +赖 +桂 +郡 +肃 +仓 +寄 +塘 +瘤 +300 +碳 +搞 +燕 +蒸 +允 +忽 +斜 +穷 +郁 +囊 +奔 +昆 +盆 +愈 +递 +1000 +黎 +祭 +怒 +辈 +腺 +滚 +暂 +郭 +璃 +踪 +芳 +碍 +肺 +狱 +冒 +阁 +砂 +35 +苍 +揭 +踏 +颇 +柄 +闪 +孝 +葡 +腾 +茎 +鸣 +撤 +仰 +伐 +丘 +於 +泪 +荡 +扰 +纲 +拼 +欣 +纽 +癌 +堆 +27 +菲 +b +披 +挖 +寓 +履 +捐 +悟 +乾 +嘴 +钻 +拳 +吹 +柏 +遥 +抚 +忧 +赠 +霸 +艰 +淋 +猫 +帅 +奈 +寨 +滴 +鼻 +掘 +狗 +驶 +朴 +拆 +惜 +玻 +扣 +萄 +蔬 +宠 +2014 +缴 +赫 +凯 +滨 +乔 +腰 +葬 +孟 +吾 +枚 +圳 +忙 +扫 +杭 +凌 +1998 +梯 +丈 +隶 +1999 +剪 +盗 +擅 +疆 +弯 +携 +拒 +秒 +颁 +醇 +割 +浆 +姑 +爸 +螺 +穗 +缝 +慈 +喝 +瓶 +漏 +悠 +猎 +番 +孕 +伪 +漂 +腿 +吐 +坝 +滤 +函 +匀 +偷 +浩 +矛 +僧 +辨 +俊 +棉 +铸 +29 +诞 +丧 +夹 +to +姿 +睛 +淮 +阀 +姜 +45 +尸 +猛 +1997 +芽 +账 +旱 +醉 +弄 +坊 +烤 +萧 +矣 +雾 +倡 +榜 +弗 +氨 +朵 +锡 +袋 +拨 +湘 +岳 +烦 +肩 +熙 +炭 +婆 +棋 +禅 +穴 +宙 +汗 +艳 +儒 +叙 +晨 +颈 +峡 +拖 +烂 +茂 +戒 +飘 +氛 +蒂 +撞 +瓣 +箭 +叛 +1996 +31 +鞋 +劲 +祝 +娜 +饲 +侍 +诱 +叹 +卢 +弥 +32 +鼎 +厦 +屈 +慕 +魅 +m +厨 +嫁 +绵 +逼 +扮 +叔 +酶 +燥 +狼 +滋 +汁 +辐 +怨 +翅 +佩 +坑 +旬 +沃 +剩 +蛇 +颖 +篮 +锐 +侠 +匹 +唤 +熊 +漠 +迟 +敦 +雌 +谨 +婴 +浸 +磷 +筒 +2015 +滩 +埋 +框 +弘 +吕 +碰 +纺 +硫 +堪 +契 +蜜 +蓄 +1995 +阐 +apos +傲 +碱 +晰 +狭 +撑 +叉 +卧 +劫 +闹 +赐 +邓 +奴 +溉 +浦 +蹈 +辣 +遣 +耀 +耶 +翠 +t +叠 +迈 +霍 +碧 +恰 +脊 +昭 +摸 +饱 +赔 +泄 +哭 +讼 +逝 +逻 +廊 +擦 +渗 +彰 +you +卿 +旺 +宪 +36 +顷 +妆 +陪 +葛 +仔 +淀 +翰 +悦 +穆 +煮 +辩 +弦 +in +串 +押 +蚀 +逢 +贺 +焊 +煌 +缔 +惑 +鹿 +袁 +糊 +逸 +舟 +勃 +侦 +涯 +蔡 +辟 +涌 +枯 +痕 +疼 +莉 +柴 +1993 +眉 +1992 +罢 +催 +衔 +秉 +妃 +鸿 +傅 +400 +辰 +聪 +咸 +1994 +扇 +盈 +勘 +佐 +泊 +抛 +搬 +牢 +宴 +牲 +贾 +摘 +姻 +慎 +帕 +忌 +卒 +夕 +卜 +惟 +挺 +崖 +炒 +爵 +冻 +椒 +鳞 +祸 +潭 +腊 +蒋 +缠 +寂 +眠 +冯 +芯 +槽 +吊 +33 +150 +聊 +梗 +嫩 +凶 +铭 +爽 +筋 +韦 +脾 +铝 +肢 +栋 +勾 +萌 +渊 +掩 +狮 +撒 +漆 +骗 +禽 +38 +蕴 +坪 +洒 +冶 +兹 +椭 +喻 +泵 +哀 +翔 +1990 +棒 +芝 +x +扑 +3000 +毅 +衍 +惨 +疯 +欺 +贼 +肖 +轰 +巢 +臂 +轩 +扁 +淘 +犬 +宰 +祠 +挡 +厌 +帐 +蜂 +狐 +垃 +昂 +圾 +秩 +芬 +瞬 +枢 +舌 +唇 +棕 +1984 +霞 +霜 +艇 +侨 +鹤 +硅 +靖 +哦 +削 +泌 +奠 +d +吏 +夷 +咖 +彭 +窑 +胁 +肪 +120 +贞 +劝 +钙 +柜 +鸭 +75 +庞 +兔 +荆 +丙 +纱 +34 +戈 +藤 +矩 +泳 +惧 +铃 +渴 +胀 +袖 +丸 +狠 +豫 +茫 +1985 +浇 +菩 +氯 +啡 +1988 +葱 +37 +梨 +霉 +脆 +氢 +巷 +丑 +娃 +锻 +愤 +贪 +蝶 +1991 +厉 +闽 +浑 +斩 +栖 +l +茅 +昏 +龟 +碗 +棚 +滞 +慰 +600 +2016 +斋 +虹 +屯 +萝 +饼 +窄 +潘 +绣 +丢 +芦 +鳍 +42 +裕 +誓 +腻 +48 +95 +锈 +吞 +蜀 +啦 +扭 +5000 +巩 +髓 +1987 +劣 +拌 +谊 +涛 +勋 +郊 +莎 +痴 +窝 +驰 +1986 +跌 +笼 +挤 +溢 +1989 +隙 +55 +鹰 +诏 +帽 +65 +芒 +爬 +凸 +牺 +熔 +吻 +竭 +瘦 +冥 +800 +搏 +屡 +昔 +萼 +愁 +捉 +翁 +怖 +汪 +烯 +疲 +缸 +溃 +85 +泼 +剖 +涨 +橡 +谜 +悔 +嫌 +盒 +苯 +凹 +绳 +畏 +罐 +虾 +柯 +邑 +馨 +兆 +帖 +陌 +禄 +垫 +壶 +逊 +骤 +祀 +晴 +蓬 +e +苞 +煎 +菊 +堤 +甫 +拱 +氮 +罕 +舶 +伞 +姚 +弓 +嵌 +1983 +1982 +馈 +琼 +噪 +雀 +呵 +汝 +焉 +陀 +胺 +惩 +沼 +枣 +桐 +酱 +遮 +孢 +钝 +呀 +锥 +妥 +酿 +巫 +闯 +沧 +崩 +蕊 +酬 +匠 +躲 +43 +喊 +98 +琳 +46 +绎 +喉 +凰 +抬 +93 +膨 +盲 +剥 +喂 +庸 +奸 +n +钩 +冈 +募 +苑 +杏 +杉 +辱 +隋 +薪 +绒 +1980 +99 +欠 +尉 +r +攀 +抹 +巾 +1958 +渣 +苹 +猴 +悄 +屠 +41 +颂 +湛 +魄 +颠 +1949 +呆 +粤 +岂 +娇 +暑 +44 +56 +52 +鹅 +筛 +膏 +樱 +p +缆 +襄 +瑟 +恭 +泻 +匪 +兮 +恼 +吟 +仕 +蔽 +骄 +蚕 +斥 +椅 +姬 +谦 +for +椎 +搅 +卸 +沫 +怜 +坎 +瑰 +1978 +钦 +h +拾 +厕 +後 +逾 +薯 +衬 +钾 +崔 +稽 +蛮 +殷 +晒 +47 +菇 +臭 +弧 +擎 +粹 +纬 +1500 +焰 +玲 +竣 +咒 +歇 +糕 +诵 +茨 +妮 +酯 +麟 +卑 +浏 +咽 +罩 +舱 +酵 +晕 +顽 +赁 +咬 +枫 +冀 +贮 +艘 +亏 +薛 +瀑 +篆 +膀 +沸 +雍 +咳 +尹 +愉 +烹 +坠 +勿 +钠 +64 +坤 +甸 +墅 +闸 +藻 +韧 +鄂 +58 +51 +91 +j +瑶 +舆 +夸 +54 +蕾 +栗 +咏 +丞 +抄 +鹏 +弊 +檐 +骂 +仆 +峻 +爪 +赚 +帆 +娶 +嘛 +钓 +澄 +猜 +1979 +裔 +抒 +铅 +卉 +彦 +f +删 +衷 +禹 +寡 +蒲 +砌 +on +棱 +72 +拘 +堵 +雁 +仄 +荫 +53 +k +1981 +祈 +49 +奢 +赌 +寇 +3d +隧 +摊 +雇 +卦 +婉 +敲 +挣 +皱 +虞 +亨 +懈 +挽 +珊 +饶 +滥 +锯 +闷 +it +酮 +虐 +兑 +僵 +傻 +62 +沦 +巅 +鞭 +梳 +赣 +锌 +庐 +薇 +庵 +57 +96 +慨 +肚 +妄 +g +仗 +绑 +2017 +枕 +牡 +000 +胖 +沪 +垒 +捞 +捧 +竖 +蜡 +桩 +厢 +孵 +黏 +拯 +63 +谭 +68 +诈 +灿 +釉 +1956 +裹 +钮 +俩 +o +灶 +彝 +蟹 +涩 +醋 +110 +匙 +歧 +刹 +玫 +棘 +橙 +凑 +桶 +刃 +伽 +4000 +硝 +怡 +籽 +敞 +淳 +矮 +镶 +戚 +幢 +涡 +66 +尧 +膝 +is +哉 +肆 +畔 +溯 +97 +媚 +烘 +01 +67 +窃 +焚 +澜 +愚 +棵 +乞 +86 +78 +佑 +76 +iphone +暨 +敷 +饥 +俯 +蔓 +v +05 +88 +暮 +砍 +邵 +仑 +毗 +剿 +馀 +180 +锤 +刮 +1950 +梭 +摧 +250 +掠 +躯 +诡 +匈 +侣 +胚 +疮 +59 +裙 +windows +裸 +08 +塌 +吓 +俘 +糙 +藩 +楷 +羞 +with +鲍 +帘 +裤 +宛 +憾 +桓 +痰 +寞 +骚 +惹 +笋 +萃 +92 +栓 +61 +挫 +矢 +垦 +09 +垄 +绸 +凄 +your +镀 +熏 +钉 +1945 +led +粪 +缅 +洽 +鞘 +蔗 +82 +迄 +沐 +凿 +勉 +昨 +喘 +700 +爹 +屑 +耻 +沥 +庶 +涅 +腕 +袍 +懒 +阜 +嗜 +朔 +1200 +蒜 +沛 +坟 +轿 +喀 +笛 +狄 +饿 +蓉 +泣 +窟 +130 +豹 +屿 +73 +崛 +迦 +诠 +贬 +腥 +83 +钥 +嗣 +瑜 +07 +倦 +萎 +拦 +冤 +讽 +潇 +谣 +趁 +1960 +妨 +84 +贩 +74 +萍 +窦 +纂 +缀 +矫 +淑 +墩 +梵 +沾 +淫 +乖 +汰 +莞 +81 +旷 +浊 +挚 +撼 +69 +87 +氟 +焕 +06 +庚 +掀 +诀 +kg +盼 +71 +疹 +窖 +匆 +厥 +轧 +89 +淹 +94 +160 +亥 +鸦 +棍 +谅 +歼 +汕 +挪 +蚁 +敛 +魁 +畴 +炫 +丫 +奎 +菱 +沂 +撕 +阎 +詹 +03 +蛛 +77 +靡 +瞻 +咱 +愧 +烷 +畸 +灸 +眸 +that +觅 +芜 +1955 +廓 +斌 +躁 +麓 +摔 +1970 +烛 +睹 +孜 +缚 +堕 +昼 +睿 +琪 +琉 +贱 +6000 +渝 +跋 +1959 +茄 +1957 +舜 +1976 +诛 +1952 +捣 +芙 +04 +1961 +倚 +1938 +酰 +澈 +慌 +帜 +颤 +陇 +1962 +02 +颌 +昧 +佣 +眷 +徙 +禾 +逮 +1948 +79 +莹 +碟 +梢 +朽 +粥 +喇 +1964 +榆 +驳 +楔 +1965 +啸 +肋 +dna +踢 +1975 +1937 +u +傍 +桔 +肴 +呕 +旭 +埠 +贿 +曝 +杖 +俭 +栩 +1953 +斧 +镁 +匾 +踩 +橘 +颅 +1963 +囚 +蛙 +1946 +膳 +坞 +琐 +荧 +瘟 +涤 +胰 +衫 +噬 +皖 +邱 +埔 +汀 +羡 +睐 +葵 +耿 +糟 +厄 +秧 +黔 +蹄 +140 +漳 +鞍 +谏 +腋 +簇 +梧 +戎 +1977 +榴 +诣 +宦 +苔 +揽 +簧 +狸 +阙 +扯 +耍 +棠 +脓 +烫 +翘 +芭 +躺 +羁 +藉 +拐 +1966 +陡 +1954 +漓 +棺 +钧 +琅 +扔 +寝 +绚 +熬 +驿 +邹 +杠 +1972 +w +绥 +窥 +晃 +渭 +1947 +樊 +鑫 +祁 +陋 +哺 +堰 +祛 +y +梓 +崎 +1968 +孽 +蝴 +蔚 +抖 +苟 +肇 +溜 +绅 +妾 +1940 +跪 +沁 +q +1973 +莽 +虏 +be +瞄 +砸 +稚 +僚 +崭 +迭 +皂 +彬 +雏 +ip +羲 +缕 +绞 +俞 +簿 +耸 +廖 +嘲 +can +1969 +翌 +榄 +裴 +槐 +1939 +洼 +睁 +1951 +灼 +啤 +臀 +啥 +濒 +醛 +峨 +葫 +悍 +笨 +嘱 +1935 +稠 +360 +韶 +1941 +陛 +峭 +1974 +酚 +翩 +舅 +8000 +寅 +1936 +蕉 +阮 +垣 +戮 +me +趾 +犀 +巍 +re +霄 +1942 +1930 +饪 +sci +秆 +朕 +驼 +肛 +揉 +ipad +楠 +岚 +疡 +帧 +柑 +iso9001 +赎 +逍 +滇 +璋 +礁 +黛 +钞 +邢 +涧 +劈 +瞳 +砚 +驴 +1944 +锣 +恳 +栅 +吵 +牟 +沌 +瞩 +咪 +毯 +炳 +淤 +盯 +芋 +粟 +350 +栈 +戊 +盏 +峪 +拂 +暇 +酥 +汛 +900 +pc +嚣 +2500 +轼 +妒 +匿 +1934 +鸽 +蝉 +cd +痒 +宵 +瘫 +1927 +1943 +璧 +汲 +1971 +冢 +碌 +琢 +磅 +卤 +105 +剔 +谎 +圩 +酌 +捏 +渺 +媳 +1933 +穹 +谥 +骏 +哨 +骆 +乒 +10000 +摹 +兜 +柿 +喧 +呜 +捡 +橄 +逗 +瑚 +呐 +檀 +辜 +妊 +祯 +1931 +苷 +don +衙 +笃 +芸 +霖 +荔 +闺 +羌 +芹 +dvd +哼 +糯 +吼 +蕃 +嵩 +矶 +绽 +坯 +娠 +1928 +祷 +锰 +qq +by +瘀 +108 +岐 +1932 +茵 +筝 +斐 +肽 +歉 +1929 +嗽 +恤 +汶 +聂 +樟 +擒 +鹃 +拙 +鲤 +絮 +鄙 +彪 +ipod +z +嗓 +墟 +骼 +渤 +僻 +豁 +谕 +荟 +姨 +婷 +挠 +哇 +炙 +220 +诅 +娥 +哑 +阱 +嫉 +圭 +乓 +橱 +歪 +禧 +甩 +坷 +晏 +驯 +讳 +泗 +煞 +my +淄 +倪 +妓 +窍 +竿 +襟 +匡 +钛 +侈 +ll +侄 +铲 +哮 +厩 +1967 +亢 +101 +辕 +瘾 +辊 +狩 +掷 +潍 +240 +伺 +嘿 +弈 +嘎 +陨 +娅 +1800 +昊 +犁 +屁 +蜘 +170 +寥 +滕 +毙 +as +涝 +谛 +all +郝 +痹 +溺 +汾 +脐 +馅 +蠢 +珀 +腌 +扼 +敕 +莓 +峦 +铬 +谍 +炬 +龚 +麒 +睦 +磺 +吁 +掺 +烁 +靶 +or +圃 +饵 +褶 +娟 +滔 +挨 +android +褒 +胱 +cpu +晖 +脖 +垢 +抉 +冉 +茧 +from +渲 +癫 +125 +de +悼 +嫂 +瞒 +纶 +肘 +炖 +瀚 +皋 +姊 +颐 +1600 +俏 +颊 +gps +讶 +札 +奕 +磊 +镖 +遐 +眺 +腑 +boss +琦 +蚊 +窜 +渍 +嗯 +102 +1926 +touch +夯 +1300 +笙 +蘑 +翡 +碘 +卯 +啼 +靓 +辍 +莺 +躬 +猿 +杞 +眩 +虔 +凋 +遁 +泾 +岔 +羟 +弛 +娄 +茸 +皓 +峙 +逅 +邂 +苇 +楹 +蹲 +拢 +甄 +鳃 +104 +邯 +捆 +勺 +450 +酉 +荚 +唑 +臻 +辗 +绰 +徊 +榨 +苛 +赦 +盔 +壬 +恍 +缉 +2020 +熨 +7000 +澡 +桨 +匣 +兢 +106 +驭 +x1 +镍 +孰 +绮 +馏 +蝇 +佼 +鲸 +128 +哎 +裳 +蜕 +嚼 +嘻 +web +庇 +绢 +倩 +钵 +ii +恪 +帷 +莆 +柠 +藕 +砾 +115 +绊 +喙 +坂 +徘 +荀 +瞧 +蛾 +1925 +晦 +ph +mm +铎 +107 +紊 +锚 +酪 +稷 +聋 +闵 +熹 +冕 +诫 +珑 +曦 +篷 +320 +迥 +蘖 +胤 +103 +檬 +瑾 +钳 +遏 +辄 +嬉 +隅 +ps +秃 +112 +帛 +聆 +芥 +诬 +1100 +挟 +宕 +2018 +鹊 +琶 +膛 +mv +兀 +gb +懿 +碾 +叮 +863 +蠕 +譬 +缮 +烽 +妍 +榕 +260 +1920 +邃 +焙 +倘 +210 +戌 +茹 +豚 +晾 +浒 +玺 +醚 +祐 +炽 +this +缪 +凛 +噩 +溅 +毋 +槛 +ei +are +嫡 +蝠 +娴 +稣 +禀 +壑 +殆 +敖 +cm +ios +倭 +挛 +侃 +蚌 +咀 +盎 +殉 +岑 +浚 +谬 +狡 +1924 +癸 +280 +逛 +耽 +俺 +璨 +巳 +茜 +郸 +蒴 +琵 +we +230 +叩 +泸 +塾 +one +稼 +reg +侮 +锂 +曙 +3500 +up +薰 +婿 +惶 +拭 +篱 +恬 +淌 +烙 +袜 +徵 +慷 +夭 +噶 +莘 +135 +鸳 +殡 +蚂 +1900 +憎 +喃 +佚 +龛 +潢 +烃 +at +岱 +潺 +109 +衢 +璀 +5cm +1400 +鹭 +揣 +痢 +know +厮 +氓 +怠 +no +nbsp +痘 +硒 +镌 +乍 +咯 +惬 +not +桦 +骇 +枉 +蜗 +睾 +淇 +耘 +娓 +弼 +鳌 +嗅 +gdp +狙 +箫 +朦 +椰 +胥 +丐 +陂 +唾 +鳄 +柚 +谒 +journal +戍 +1912 +刁 +鸾 +缭 +骸 +铣 +酋 +蝎 +掏 +耦 +怯 +娲 +拇 +汹 +胧 +疤 +118 +硼 +恕 +哗 +眶 +痫 +凳 +鲨 +擢 +歹 +樵 +瘠 +app +茗 +翟 +黯 +蜒 +壹 +殇 +伶 +辙 +an +瑕 +町 +孚 +痉 +铵 +搁 +漾 +戟 +镰 +鸯 +猩 +190 +蔷 +缤 +叭 +垩 +113 +曳 +usb +奚 +毓 +ibm +颓 +汐 +靴 +china +傣 +尬 +濮 +赂 +媛 +懦 +扦 +111 +韬 +like +戳 +java +雯 +114 +蜿 +116 +1923 +笺 +裘 +尴 +侗 +mba +3g +钨 +1919 +苓 +1922 +寰 +蛊 +扳 +搓 +涟 +睫 +淬 +5mm +123 +ve +121 +赈 +恺 +瞎 +蝙 +1921 +枸 +萱 +颚 +憩 +秽 +秸 +拷 +阑 +貂 +粱 +煲 +隘 +暧 +惕 +沽 +time +菠 +1911 +趟 +磋 +偕 +涕 +邸 +so +踞 +惫 +122 +阪 +鞠 +饺 +汞 +颍 +氰 +屹 +蛟 +跻 +哟 +have +126 +臼 +熄 +绛 +弩 +褪 +117 +渎 +亟 +匮 +撇 +internet +霆 +攒 +舵 +扛 +彤 +nba +蛤 +婢 +偃 +胫 +姥 +睑 +love +iso +pk +诙 +what +诲 +锭 +悚 +扒 +洱 +劾 +惰 +篡 +瓯 +徇 +铀 +骋 +flash +1918 +out +筷 +渚 +踵 +俨 +ceo +榻 +糜 +捻 +釜 +哩 +萤 +270 +蛹 +隽 +垮 +鸠 +鸥 +漕 +瑙 +礴 +憧 +殴 +潼 +悯 +砺 +拽 +钗 +ct +酣 +镂 +mp3 +膺 +楞 +竺 +迂 +嫣 +忱 +cad +哄 +疣 +鹦 +1700 +枭 +憬 +疱 +will +婪 +沮 +1914 +怅 +119 +筱 +扉 +瞰 +linux +旌 +蔑 +铠 +瀛 +vip +琥 +750 +127 +懵 +谴 +捍 +蟾 +漩 +1913 +拣 +汴 +university +刨 +叱 +曜 +妞 +澎 +镑 +翎 +瞪 +sh +倔 +芍 +璞 +瓮 +驹 +芷 +寐 +擂 +丕 +蟠 +诃 +悸 +亘 +溴 +宸 +廿 +恃 +棣 +1917 +荼 +筠 +羚 +慑 +唉 +纣 +麼 +蹦 +锄 +145 +international +124 +淆 +甙 +132 +蚜 +椿 +禺 +绯 +冗 +168 +葩 +厝 +媲 +蒿 +痪 +650 +菁 +炊 +wifi +俑 +new +讥 +min +桀 +祺 +129 +吡 +迩 +do +john +箔 +皿 +缎 +萦 +剃 +霓 +酝 +mg +诰 +茉 +just +get +飙 +湍 +蜥 +箕 +蘸 +550 +4500 +柬 +韭 +溥 +but +熠 +鹉 +咐 +剌 +138 +悖 +瞿 +槟 +娩 +闾 +pvc +遴 +咫 +20000 +孺 +彷 +茬 +211 +蓟 +li +if +憨 +袅 +佬 +炯 +erp +1910 +啶 +昙 +蚩 +136 +痔 +蕨 +瓢 +夔 +毡 +赃 +鳖 +沅 +wang +go +饷 +165 +臧 +掖 +褚 +羹 +ic +勐 +tv +谚 +畦 +眨 +贻 +攸 +涎 +弑 +咎 +铂 +瑛 +1905 +矗 +虱 +more +133 +秤 +谟 +漱 +俸 +夙 +1915 +br +game +雉 +螨 +恣 +斛 +175 +谙 +隍 +131 +奄 +480 +yy +1916 +壕 +髻 +155 +鄱 +嘶 +磕 +濡 +赘 +荞 +讹 +猕 +痞 +鬓 +铮 +腱 +幡 +榭 +爻 +5m +涓 +晤 +咕 +惭 +钼 +匕 +ok +撮 +庾 +笠 +窘 +癖 +365 +垛 +窒 +畲 +甬 +彗 +缨 +湮 +寮 +et +衅 +谪 +156 +绫 +9000 +152 +兖 +疽 +磐 +380 +菏 +沱 +骁 +嫔 +盂 +娆 +钊 +蟒 +忏 +谤 +148 +137 +server +2200 +晟 +ng +15000 +google +痈 +耆 +谧 +簪 +134 +ml +疟 +扈 +脍 +琛 +咋 +胄 +142 +144 +葆 +轶 +桢 +973 +攘 +was +邕 +拧 +茯 +205 +摒 +1908 +intel +傀 +祚 +嘟 +帼 +1906 +wto +筵 +when +馒 +疚 +璇 +砧 +merge +槃 +microsoft +犷 +exe +腓 +煜 +弋 +疸 +濑 +310 +201 +麝 +嗟 +忻 +愣 +facebook +斓 +吝 +咧 +矾 +愫 +151 +158 +漪 +珂 +rna +逞 +146 +206 +糠 +璐 +藓 +昕 +妩 +屌 +疵 +excel +嘘 +he +plc +袂 +2400 +139 +稃 +剁 +侏 +掐 +猾 +匍 +2800 +坳 +黜 +邺 +闫 +猥 +湃 +斟 +癣 +1904 +185 +匐 +粳 +sql +330 +141 +cp +1909 +叟 +俾 +儡 +莒 +12000 +骥 +跤 +耙 +矜 +翱 +zhang +ms +赡 +1907 +浣 +栾 +拈 +science +420 +螟 +aaa +桧 +坍 +睢 +趴 +id +伎 +2100 +婺 +霹 +痊 +膊 +眯 +豌 +202 +驮 +骈 +850 +iii +嶂 +淞 +143 +腮 +髅 +炀 +啄 +亳 +麾 +147 +筐 +叨 +徨 +跷 +ac +楂 +郴 +绶 +hp +羔 +xp +ieee +咤 +now +there +靳 +they +屎 +雳 +瘘 +蹬 +2300 +惮 +acid +涪 +阖 +煽 +蹊 +225 +栉 +153 +俟 +涸 +辫 +锢 +佟 +176 +皎 +cctv +啮 +钰 +螂 +dc +啪 +绷 +204 +闰 +畿 +2d +覃 +2600 +惘 +贰 +154 +碉 +卞 +酐 +枷 +葺 +芪 +207 +蕙 +192 +咚 +籁 +pro +钴 +162 +冽 +玮 +骷 +啃 +焖 +猝 +榈 +滁 +拮 +跗 +讷 +蝗 +208 +蠡 +world +烨 +been +hd +gmp +256 +脯 +歙 +泠 +刍 +掳 +pe +his +僳 +340 +1902 +螯 +胳 +髦 +粽 +戾 +祜 +178 +186 +岷 +懋 +馥 +昵 +踊 +湄 +郢 +斡 +迢 +ce +photoshop +嗪 +about +裨 +1903 +羧 +膈 +翊 +lcd +鲫 +163 +螃 +沓 +疝 +笈 +ktv +榔 +157 +诘 +autocad +195 +颉 +蛀 +鸢 +焯 +囧 +make +梆 +npc +潞 +戛 +see +system +149 +佗 +艮 +chinese +let +霾 +鬟 +215 +net +玖 +1898 +腭 +喔 +172 +罔 +佥 +粑 +visual +舷 +泯 +m2 +198 +has +203 +sd +泓 +炜 +谗 +烬 +跆 +rpg +傩 +飓 +浔 +钤 +惚 +胭 +踝 +镯 +ep +221 +臆 +196 +蜚 +揪 +觞 +皈 +dj +183 +api +迸 +匝 +筏 +167 +醴 +黍 +洮 +滦 +侬 +甾 +290 +way +3200 +188 +diy +2cm +com +澧 +阈 +袱 +迤 +衮 +166 +濂 +娑 +砥 +砷 +铨 +缜 +箴 +30000 +逵 +猖 +159 +蛰 +箍 +侥 +2mm +搂 +纨 +裱 +枋 +嫦 +敝 +挝 +贲 +潦 +235 +撩 +惺 +铰 +f1 +忒 +咆 +哆 +莅 +164 +炕 +抨 +涿 +龈 +猷 +got +b1 +182 +2m +212 +遒 +缥 +vs +捂 +俐 +la +瘙 +搐 +牍 +isbn +馍 +our +痿 +袤 +峥 +184 +栎 +罹 +燎 +喵 +209 +1901 +璜 +飒 +蔼 +珞 +澹 +奘 +岖 +芡 +簸 +杵 +甥 +骊 +216 +悴 +173 +惆 +5mg +殃 +1895 +呃 +161 +5g +祗 +3600 +髋 +169 +liu +who +幔 +down +榛 +犊 +霁 +芮 +520 +牒 +佰 +her +狈 +薨 +co +吩 +鳝 +嵘 +濠 +呤 +纫 +3mm +檄 +214 +浜 +370 +189 +缙 +缢 +煦 +蓦 +揖 +拴 +缈 +218 +褥 +铿 +312 +燮 +life +锵 +174 +荥 +187 +忿 +4s +僖 +婶 +171 +chen +芾 +镐 +痣 +research +眈 +460 +祇 +邈 +翳 +碣 +遨 +鳗 +诂 +never +岫 +焘 +3cm +co2 +茱 +tcp +only +255 +gsm +say +洵 +晁 +right +噢 +she +over +偈 +旖 +david +181 +232 +蚓 +柘 +珐 +遽 +岌 +桅 +213 +唔 +222 +鄞 +雹 +michael +驸 +苻 +恻 +鬃 +玑 +磬 +崂 +304 +祉 +荤 +淼 +560 +264 +肱 +呗 +pp +b2 +骡 +囱 +10cm +佞 +back +1890 +226 +耒 +伫 +嚷 +粼 +aa +歆 +佃 +旎 +惋 +殁 +杳 +their +阡 +red +畈 +蔺 +os +177 +map +巽 +cbd +昱 +啰 +吠 +179 +199 +嗔 +涮 +238 +奂 +1896 +撷 +301 +袒 +720 +爰 +捶 +赭 +蜓 +姗 +蔻 +垠 +193 +gis +噻 +ab +峒 +皙 +want +245 +憔 +帚 +office +xx +杷 +蟆 +iso14001 +觐 +钒 +岙 +2700 +1899 +栀 +幄 +啧 +癜 +擀 +轲 +铆 +them +讴 +樽 +霏 +mtv +肮 +枳 +骞 +诧 +瘢 +虬 +拗 +play +219 +蕲 +316 +茁 +唆 +technology +word +沭 +毂 +蛎 +芊 +銮 +瞥 +呱 +223 +羿 +吒 +傥 +髯 +濯 +蜻 +皴 +802 +430 +邳 +燧 +1860 +獭 +垭 +祟 +217 +虢 +how +枇 +abs +鹫 +194 +颞 +1894 +333 +皑 +脲 +197 +舔 +魇 +霭 +org +坨 +郧 +baby +椽 +舫 +228 +oh +305 +荠 +琊 +溟 +1897 +煨 +265 +谯 +粲 +罂 +gonna +屉 +佯 +郦 +亵 +诽 +芩 +嵇 +蚤 +哒 +315 +啬 +ain +嚎 +玥 +twitter +191 +隼 +唢 +铛 +cause +壅 +藜 +won +吱 +rom +楣 +璟 +锆 +憋 +罡 +al +咙 +1850 +腈 +oslash +job +233 +廪 +堑 +into +诩 +b2c +溧 +鹑 +讫 +哌 +铢 +蜴 +1ml +稹 +噜 +镉 +224 +愕 +桁 +晔 +琰 +陲 +疙 +667 +崮 +need +540 +8mm +html +颛 +through +asp +桡 +钜 +580 +take +谑 +仞 +咦 +珪 +揍 +鱿 +阉 +3800 +瘩 +410 +槌 +滓 +茴 +tft +泮 +涣 +atm +pci +柞 +渥 +飨 +孪 +沔 +谲 +桉 +vcd +慵 +318 +oem +other +俚 +paul +跖 +纭 +恙 +which +fi +佘 +236 +荃 +咄 +鞅 +叁 +james +恽 +m3 +253 +炔 +萘 +钺 +6500 +1880 +ccd +楫 +塬 +钡 +琮 +苄 +950 +325 +275 +1g +day +o2o +960 +music +骰 +偎 +粕 +amd +咔 +鹄 +瓒 +阆 +捅 +嬴 +adobe +箨 +name +390 +680 +640 +氦 +倜 +b2b +觊 +xml +婕 +229 +jar +锑 +撬 +chem +掰 +嗷 +5500 +1cm +饯 +蓓 +234 +good +鼬 +spa +佤 +5a +ss +蚯 +挞 +臾 +where +atp +227 +嶙 +幂 +饬 +闱 +live +high +煅 +嘧 +1mm +蹭 +sun +abc +瞭 +顼 +箐 +here +徉 +231 +骜 +302 +嗨 +邛 +庑 +柩 +饕 +俎 +4mm +15g +嘌 +50000 +颏 +cssci +椁 +崧 +锉 +籼 +1870 +狞 +弁 +6mm +羯 +踹 +糅 +248 +1840 +砼 +263 +嫖 +tmp +252 +mac +285 +豉 +啉 +榷 +嘈 +en +俪 +痂 +308 +inf +630 +儋 +4a +芎 +ai +man +繇 +1889 +bt +239 +meta +蹇 +242 +530 +诋 +bbc +煸 +峋 +淙 +324 +management +1885 +泱 +徜 +crm +4cm +free +汩 +纥 +246 +蝼 +囿 +uv +暹 +谆 +蹂 +鞣 +3c +mr +螳 +cs +馗 +幺 +鞑 +贽 +268 +istp +243 +漯 +237 +牦 +淖 +engineering +dr +囤 +than +gprs +sp +440 +晗 +1888 +258 +忡 +懊 +呋 +埂 +pcb +307 +first +321 +robert +鲈 +sup2 +阕 +3m +幌 +cg +303 +鳅 +勰 +find +8cm +萸 +剽 +蚝 +wi +绔 +pdf +1250 +262 +php +辇 +10mg +use +ie +麋 +1884 +陟 +宥 +oracle +锺 +喽 +620 +1892 +1893 +淅 +熵 +荨 +247 +忤 +american +266 +seo +轭 +嗦 +荪 +also +骠 +鹘 +p2p +4g +聿 +绾 +诶 +985 +怆 +244 +喋 +恸 +湟 +睨 +翦 +fe +蜈 +1875 +褂 +娼 +1886 +羸 +觎 +470 +瘁 +306 +蚣 +呻 +241 +1882 +昶 +谶 +猬 +荻 +school +286 +酗 +unit +肄 +躏 +膑 +288 +2g +嗡 +273 +iv +cam +510 +庠 +崽 +254 +搪 +pcr +胯 +309 +铉 +峤 +郯 +藐 +舂 +come +蓼 +some +薏 +窿 +羣 +氽 +徕 +冼 +rs +阂 +欤 +殒 +窈 +脘 +780 +篝 +yang +1861 +3300 +iso9000 +麸 +砭 +max +砰 +骶 +豺 +lg +窠 +獒 +think +腴 +苕 +any +its +缇 +骅 +劭 +college +卅 +ups +揆 +垅 +na +6cm +琏 +镗 +苜 +胛 +1881 +black +珏 +吮 +抠 +搔 +276 +rock +251 +槎 +4200 +323 +掣 +pet +1887 +ap +琨 +餮 +375 +舛 +give +si +痤 +us +311 +278 +埭 +english +peter +1891 +820 +胪 +喹 +妲 +婀 +帙 +10g +oa +7500 +箩 +灏 +霎 +logo +袄 +dsp +bl +镭 +蓿 +power +long +墉 +too +嵊 +1862 +girl +堇 +king +蟋 +610 +叽 +249 +钎 +30cm +fm +録 +group +1883 +郓 +瘴 +vol +丶 +呦 +邬 +頫 +272 +馁 +hiv +鄢 +257 +1876 +ordm +蛭 +322 +愍 +锲 +槿 +珈 +best +4800 +mri +1080 +fda +10mm +261 +nt +660 +super +1m +center +ui +335 +蜃 +298 +拎 +鎏 +裟 +沏 +np +螭 +7mm +觑 +墒 +捺 +轸 +micro +榫 +based +319 +怔 +ram +618 +昀 +even +泷 +1864 +ca +凫 +唠 +狰 +鲛 +氐 +呛 +绀 +碛 +茏 +盅 +蟀 +洙 +off +訇 +蠹 +auml +dos +20cm +267 +棂 +18000 +蚴 +篾 +two +靛 +暄 +show +1868 +泞 +cdma +mark +vc +洄 +赓 +麽 +25000 +篓 +孑 +860 +烩 +980 +design +颢 +钣 +var +髂 +蹴 +wanna +筮 +蝌 +醮 +home +菖 +fun +cmos +獗 +friends +business +岘 +570 +鼐 +1865 +姣 +national +1874 +蟑 +袈 +葶 +掬 +most +vga +emba +躇 +30g +鹌 +city +踌 +282 +钹 +蚪 +颧 +001 +13000 +鹳 +274 +km +345 +1050 +stop +328 +then +鲲 +驷 +潴 +295 +386 +焱 +稔 +悌 +mpeg +st +suv +vista +a1 +vi +283 +help +basic +唏 +11000 +苒 +蹙 +house +heart +ouml +281 +氩 +bug +mobile +宓 +service +dll +綦 +苎 +application +疃 +methyl +攫 +rfid +100g +287 +掾 +1871 +徭 +490 +舀 +逶 +嗤 +760 +0m +ge +1872 +people +hr +蜷 +茔 +512 +疳 +迳 +罄 +瓠 +100mg +讪 +psp +av +傈 +ppp +杲 +灞 +氲 +鬲 +獠 +柒 +骧 +1848 +away +william +326 +搀 +珩 +绦 +1879 +嚏 +710 +镛 +喱 +倏 +馋 +茭 +擘 +斫 +284 +1mg +怂 +hdmi +唧 +犍 +谩 +赊 +317 +271 +wu +鬻 +禛 +15cm +259 +840 +feel +485 +圻 +10m +蹶 +5kg +1877 +1873 +缄 +瘿 +黠 +甑 +矸 +嘀 +il +蹼 +jack +lee +269 +叼 +di +313 +旻 +auc +502 +1350 +鹜 +289 +fc +稗 +336 +999 +association +many +293 +雒 +george +td +赉 +style +馔 +颦 +ul +ld50 +1867 +颔 +掇 +1863 +each +赅 +桎 +inc +痧 +dv +谄 +孛 +笆 +鲶 +铳 +3100 +mc +tell +4m +blue +327 +299 +bios +龋 +385 +盱 +笏 +2030 +窕 +苴 +314 +big +1866 +296 +萋 +355 +辘 +琬 +cu +梏 +much +蚧 +3400 +1280 +镳 +24h +own +670 +studio +瞅 +keep +6g +ppt +conference +around +information +睬 +1878 +class +偌 +鲵 +惦 +1830 +蜍 +mp4 +why +靼 +1851 +332 +阗 +菟 +黝 +1650 +control +挈 +嵴 +剡 +358 +楸 +dha +氤 +m1 +vr +呎 +珲 +5ml +馄 +滂 +338 +蹉 +蓑 +锷 +297 +279 +啜 +1644 +sm +婵 +well +鬣 +7cm +钿 +bbs +晌 +蛆 +隗 +酞 +枞 +352 +work +always +9g +戬 +獾 +镕 +star +easy +饨 +娣 +缰 +邾 +334 +8m +ni +鹗 +277 +425 +end +had +嗒 +苋 +薮 +棹 +type +richard +880 +6m +拄 +air +埕 +勖 +鹞 +殚 +鲢 +pop +a4 +1750 +ftp +16000 +啖 +ad +沣 +501 +靥 +葭 +诿 +htc +鸪 +007 +饴 +t1 +疖 +抟 +睽 +770 +access +tcl +稞 +吋 +谀 +澍 +杈 +妤 +sata +part +峄 +systems +漉 +40000 +ever +気 +368 +咲 +qs +ta +璘 +ltd +mol +media +萜 +僭 +朐 +742 +1855 +cc +圜 +癞 +藿 +555 +珉 +isp +set +1450 +陉 +him +僮 +292 +膻 +1853 +薹 +810 +汊 +still +锗 +昉 +pvp +猗 +http +1859 +3700 +strong +3a +锶 +real +跛 +art +1869 +331 +1368 +嘹 +337 +瓤 +402 +衄 +1856 +1820 +1150 +matlab +豕 +吆 +腆 +thomas +a2 +294 +le +366 +using +356 +bb +喆 +smith +different +莴 +401 +谌 +ci +珙 +疥 +kw +鲑 +405 +玷 +蛔 +砀 +361 +zh +nasa +materials +329 +nature +1h +谔 +睥 +ch +20mg +2mg +du +mail +data +every +蹑 +诒 +逋 +372 +while +姝 +刈 +婧 +going +喳 +镞 +铌 +291 +712 +辎 +鹧 +檩 +740 +扪 +10ml +霰 +ar +裆 +ol +嬷 +0mm +ufo +charles +20mm +tvb +apple +刎 +iec +project +sbs +嵋 +342 +690 +悱 +920 +嘤 +jean +篁 +荸 +瞑 +殓 +搽 +50mg +343 +橇 +include +eva +雎 +弭 +獐 +haccp +恿 +video +cf +vpn +society +眦 +730 +铐 +song +尕 +捎 +诟 +institute +痨 +cn +369 +笞 +756 +version +des +sns +趺 +590 +award +唬 +苣 +css +lte +xu +fbi +啾 +瘪 +垸 +357 +橹 +after +濛 +曷 +level +樾 +very +汨 +仟 +姒 +1858 +again +怦 +荏 +tom +诤 +苡 +吭 +830 +dm +before +406 +崆 +氡 +young +脩 +lan +胝 +钏 +3ds +cr +arm +pos +night +屐 +395 +忐 +彧 +拚 +鏖 +344 +100ml +525 +孳 +1024 +yu +忑 +384 +邝 +穰 +403 +摈 +庖 +351 +鸵 +398 +hello +矽 +354 +鲟 +said +381 +768 +発 +762 +sap +1854 +msn +菅 +book +353 +true +339 +javascript +348 +2900 +圪 +蹋 +衾 +簋 +璎 +367 +噎 +911 +嬗 +346 +肼 +362 +359 +跎 +滟 +little +4300 +701 +戦 +嵬 +look +仝 +phys +club +惇 +纾 +times +14000 +炁 +382 +xyz +number +ak +mind +huang +闳 +骐 +秣 +眙 +谘 +碓 +iso9002 +疔 +412 +恂 +am +top +master +鳕 +green +鸱 +int +爨 +镊 +404 +were +4600 +em +better +钯 +圮 +楽 +堀 +1852 +408 +sat +1857 +378 +422 +膘 +705 +噗 +347 +start +486 +锹 +505 +杼 +酊 +same +376 +white +挎 +箸 +郗 +垌 +sa +溏 +martin +蔫 +偻 +364 +妫 +飚 +625 +601 +辔 +濬 +666 +ds +瑄 +621 +觚 +5600 +nhk +415 +express +铍 +bit +跚 +9mm +翕 +煊 +these +50mm +gpu +b6 +hip +耄 +铋 +篦 +zhou +阇 +骛 +nvidia +莪 +吲 +youtube +唁 +870 +箧 +503 +tm +8500 +really +珅 +潋 +迨 +哽 +without +砦 +model +缗 +hey +謇 +呸 +mrna +垓 +糍 +park +wap +璠 +妣 +狎 +攥 +396 +闇 +york +蛉 +瑁 +joe +腼 +蹒 +great +review +200mg +chris +www +嶷 +online +莠 +沤 +哚 +475 +遑 +v1 +such +跺 +膦 +蹿 +unix +hard +40cm +50cm +nothing +郫 +zhao +玳 +ma +boy +埚 +url +432 +network +aaaa +衿 +371 +try +醪 +full +挹 +raid +bg +绡 +汜 +digital +mb +c1 +坩 +ccc +旃 +5200 +607 +itunes +powerpoint +鸨 +between +407 +翈 +1842 +1844 +435 +838 +抡 +chemistry +team +party +die +晞 +place +care +盥 +藁 +蓖 +383 +cv +臊 +made +state +465 +羰 +388 +1620 +sas +楝 +噱 +ji +饽 +苌 +soho +褓 +佶 +mp +581 +years +1260 +1680 +hop +稜 +瞠 +仡 +25mm +605 +423 +341 +363 +374 +627 +text +development +518 +伉 +襁 +ug +change +713 +涞 +1849 +蜇 +抿 +瑗 +pda +418 +un +line +958 +孱 +懑 +416 +von +373 +淦 +赝 +core +dns +747 +427 +387 +would +ipo +醌 +551 +缫 +蠲 +alt +嚓 +鲷 +湫 +捋 +1845 +咩 +裏 +avi +犒 +2050 +墀 +yeah +god +445 +lesson +硐 +蔸 +399 +758 +pu +computer +456 +钽 +1847 +麂 +brown +store +蒡 +鼹 +绻 +1821 +錾 +仃 +515 +篙 +蕤 +589 +applied +737 +930 +c3 +1841 +铤 +billboard +apec +槁 +牖 +螈 +mary +俦 +family +笄 +color +啻 +対 +jsp +郤 +next +iq +645 +506 +hbv +闼 +a3 +349 +value +413 +igg +411 +426 +醺 +赍 +檗 +usa +裾 +head +噫 +掸 +mike +箓 +usb2 +things +5800 +5v +o2 +妪 +乂 +蝈 +砻 +胍 +220v +392 +cba +397 +535 +idc +analysis +25mg +蜱 +ti +2h +聃 +雠 +碚 +椤 +缯 +昴 +890 +缱 +祎 +der +缬 +ex +508 +铙 +cnc +pentium +孀 +533 +advanced +mpa +yl +笳 +蘇 +愆 +685 +榉 +old +氙 +call +alex +燹 +撂 +菽 +583 +箬 +蛄 +瘸 +嬛 +495 +橐 +could +60000 +something +纡 +刽 +辂 +hong +377 +law +蒯 +邨 +1846 +1550 +r2 +1837 +赀 +player +414 +跸 +phone +邙 +hold +rgb +421 +henry +2025 +黟 +409 +磴 +1815 +mode +1843 +闿 +504 +letters +1780 +428 +垟 +389 +t2 +london +528 +jpeg +嵯 +钚 +steve +跄 +30min +527 +潸 +h2 +35000 +崴 +eric +379 +run +three +rf +left +455 +恁 +open +楮 +556 +bc +476 +腧 +458 +plus +1812 +1839 +胨 +b12 +4d +芫 +america +est +dream +碴 +隰 +杓 +md +ya +global +436 +15mm +2ml +貉 +欹 +sup3 +侑 +ea +鳜 +910 +ben +铄 +椴 +昇 +醍 +1020 +798 +midi +肓 +features +lc +brian +akb48 +缂 +1835 +test +铡 +light +978 +s1 +1799 +key +sim +1795 +simple +energy +蹠 +徂 +west +725 +body +豢 +424 +face +蒽 +lin +805 +1120 +479 +菡 +bill +433 +衲 +阚 +believe +brt +pa +last +芗 +hu +sam +wei +adsl +602 +mk +痍 +玠 +1832 +523 +晷 +604 +jj +468 +淝 +1560 +鄯 +ck +473 +糗 +耨 +榧 +394 +940 +eq +498 +used +sc +胴 +c2 +蕈 +screen +镬 +635 +鼾 +431 +education +wwe +摭 +鸮 +cl +5400 +fpga +恚 +419 +実 +asia +534 +552 +砝 +100mm +pid +741 +珣 +under +603 +寤 +埙 +mbc +tc +xxx +didn +478 +mn +p1 +锏 +simon +ansi +438 +hi +615 +喟 +蘅 +骺 +cell +捭 +study +586 +393 +莜 +should +xi +缶 +f2 +games +0g +1760 +mini +johnson +jones +yes +锟 +1825 +叵 +cm3 +炷 +1580 +stay +675 +another +6800 +鲧 +1736 +ps2 +胼 +517 +査 +岬 +2019 +1640 +rose +鹂 +牯 +珥 +entertainment +448 +und +496 +莼 +software +970 +邠 +5300 +h1n1 +488 +da +眇 +卟 +変 +20m +may +417 +lady +galaxy +4100 +惴 +1789 +846 +801 +渑 +907 +put +蚱 +gone +606 +t3 +company +632 +454 +516 +998 +548 +391 +4700 +瞌 +ide +瘰 +7200 +佝 +together +street +旸 +626 +衽 +郅 +奁 +731 +30mg +mvp +1370 +60cm +12cm +魑 +1828 +628 +everything +612 +san +937 +缛 +2gb +lu +angel +20ml +576 +颙 +sony +790 +press +镫 +hall +簌 +beautiful +豇 +711 +453 +pm +姹 +thing +442 +邋 +alpha +leave +暝 +441 +30mm +chapter +507 +100000 +526 +directx +511 +9cm +words +釐 +619 +洹 +444 +frank +咿 +eyes +483 +俳 +522 +蜊 +醐 +541 +water +499 +聩 +non +bob +坻 +532 +757 +545 +毽 +oo +喾 +alone +scott +744 +辋 +river +zhu +倌 +媪 +蛳 +滹 +哙 +nc +20g +阊 +gs +queen +趸 +1130 +1645 +祢 +4mg +1814 +girls +544 +e1 +籀 +1210 +1573 +徼 +ipv6 +訾 +髁 +1a +jackson +砜 +1836 +les +4gb +撸 +瓘 +1790 +缁 +镓 +sars +eps +519 +sod +bp +1810 +year +縻 +sound +617 +菀 +1125 +598 +酢 +桠 +466 +emc +撵 +怏 +429 +1838 +ready +渌 +546 +taylor +452 +news +1180 +568 +2a +af +538 +list +hot +1380 +etc +1796 +摞 +mo +槲 +levels +ht +浠 +诜 +魉 +韫 +daniel +亓 +盤 +pv +瑭 +魍 +1831 +emi +襞 +social +dreamweaver +爿 +kbs +565 +613 +990 +浃 +樯 +jb +讵 +揩 +physics +耋 +帏 +lng +崃 +bs +457 +enough +shy +521 +596 +ec +451 +鸩 +遢 +turn +臃 +available +4400 +585 +粿 +1010 +禳 +hand +439 +536 +桫 +link +side +earth +mx +髹 +7m +482 +诳 +472 +1140 +707 +622 +wcdma +513 +must +492 +462 +踉 +40mg +948 +cmax +郃 +1320 +v2 +542 +email +493 +嗖 +sup +讧 +cnn +446 +碁 +17000 +湎 +30m +529 +653 +531 +575 +阏 +sr +united +pm2 +mt +媾 +443 +様 +aac +806 +哔 +舸 +vb +611 +曩 +821 +gre +gl +cisco +忝 +峁 +掂 +464 +葳 +487 +437 +including +715 +鄄 +558 +both +谵 +463 +jim +608 +m4 +5100 +彊 +锴 +war +郜 +money +481 +葖 +1824 +tnt +蓇 +瓴 +鳟 +橼 +5s +louis +434 +鲇 +邗 +el +犄 +秭 +3900 +records +view +chemical +1001 +1mol +dance +668 +dl +槭 +缵 +que +624 +rt +1823 +1805 +005 +1826 +巯 +sgs +user +龊 +qc +狍 +island +language +space +擞 +saint +2n +pt +share +瞽 +hotel +christian +557 +栲 +撅 +2b +1801 +447 +1822 +瑀 +smt +hk +1834 +戢 +825 +50ml +朓 +逖 +general +椹 +nm +洺 +cae +484 +艏 +wma +zn +苁 +single +599 +c4 +滘 +777 +铧 +侪 +ocirc +1kg +684 +豳 +skf +12mm +489 +hla +竦 +貔 +ld +being +562 +圄 +van +gm +688 +655 +special +呷 +edition +1s +jiang +131108 +514 +1792 +ncaa +1833 +旄 +遛 +jr +program +656 +467 +ing +901 +755 +509 +芈 +kong +rp +砣 +桷 +audio +icp +happy +龌 +done +疬 +japan +ts +mit +p2 +524 +looking +miss +缟 +582 +洌 +35mm +494 +grand +跏 +those +joseph +ctrl +547 +1040 +686 +蝮 +lp +cod +菰 +sio2 +txt +1770 +1060 +帑 +767 +north +fcc +怙 +ester +718 +story +edi +634 +1360 +豸 +1660 +lh +雩 +1230 +magic +誊 +549 +臬 +4k +op +1662 +651 +镣 +箇 +616 +title +sciences +25cm +踱 +s2 +t4 +钍 +648 +100m +543 +588 +苫 +554 +蝽 +r1 +3mg +amino +1776 +浯 +609 +772 +ca2 +vlan +469 +500mg +単 +road +亶 +636 +metal +device +40mm +囹 +穑 +1730 +佻 +1818 +绌 +12g +537 +诔 +pve +autodesk +477 +v8 +ray +gp +span +gc +size +716 +鹬 +ssl +crt +1670 +925 +髌 +pn +1127 +702 +658 +services +support +1802 +蒌 +coming +experience +nbc +鳏 +631 +638 +ace +0cm +ems +9001 +殄 +yen +soc +ethyl +怛 +tf +筌 +刳 +studies +theory +1030 +578 +radio +翮 +卍 +畹 +471 +704 +because +1610 +箜 +save +燔 +赳 +553 +1809 +篌 +窨 +翥 +785 +炅 +钕 +lett +803 +1827 +academy +ed +629 +sf +pr +hill +explorer +future +food +莳 +662 +567 +dcs +忖 +戡 +1086 +1190 +1829 +bad +es +15m +order +spring +沢 +south +497 +025 +move +狒 +1630 +圉 +abb +449 +learn +l0 +d2 +5d +wav +琯 +邰 +cis +quality +odm +926 +acta +root +smart +1661 +苾 +cm2 +photos +l2 +via +sk +犸 +623 +邡 +feeling +572 +郏 +襦 +python +bmw +888 +guo +epa +williams +沆 +813 +bot +read +function +wilson +1723 +enterprise +玟 +50hz +s26 +fire +engineer +tony +1819 +濉 +rh +洎 +莨 +氘 +pb +咛 +1720 +佺 +1460 +815 +cbs +腩 +beta +鳔 +1735 +yan +1gb +x2 +剜 +秕 +牝 +芨 +din +関 +del +sms +649 +pal +1369 +far +maya +654 +拊 +812 +595 +竑 +50m +圹 +close +eos +颡 +1420 +6300 +1816 +wrong +break +573 +765 +file +friend +002 +摺 +683 +nx +沩 +蜉 +please +1170 +ro +6400 +筚 +nick +acm +愔 +ati +point +肟 +766 +俶 +fast +ata +d1 +678 +geforce +1710 +yahoo +堃 +绉 +mysql +1793 +奭 +gap +iso14000 +uk +astm +h2o +n2 +film +method +1804 +罅 +so2 +嗳 +665 +adam +uc +蜢 +1806 +1775 +photo +疠 +474 +image +200mm +sure +561 +帔 +髡 +643 +黥 +1813 +proceedings +褛 +柰 +beyond +royal +else +eda +808 +ddr +gif +鏊 +l1 +痼 +571 +waiting +堞 +code +652 +rss +learning +嗝 +461 +beijing +娉 +566 +577 +708 +1520 +689 +kevin +human +661 +539 +875 +1811 +ssci +6600 +戕 +587 +735 +3s +铱 +耜 +觥 +867 +镒 +584 +呓 +1522 +904 +case +1101 +491 +1080p +history +蒹 +栱 +im +564 +f4 +卮 +琚 +salt +jason +rohs +12v +hydroxy +逦 +modem +font +酩 +蓍 +cry +65536 +health +虺 +1798 +tonight +small +谠 +1570 +1220 +jane +against +597 +751 +459 +bd +鼋 +焗 +udp +process +1070 +1807 +children +8g +eb +62mm +22000 +add +1440 +褴 +rm +25g +ccedil +706 +714 +5l +砒 +赧 +蛏 +709 +蚬 +1530 +瘕 +5h +559 +jay +iga +020 +fall +scsi +顗 +isdn +death +563 +today +愠 +dvi +勣 +wait +1642 +飕 +徳 +滢 +琇 +鳙 +db +瞟 +尻 +force +400mg +澶 +荽 +舐 +arts +ha +east +lost +effects +1628 +album +harry +633 +dark +public +2250 +soul +826 +659 +exo +侂 +733 +se +黼 +icu +4h +market +潟 +7800 +绂 +瘗 +ngc +1794 +crazy +蓥 +竽 +濞 +igm +scdma +6200 +cb +835 +699 +骖 +偁 +bmp +809 +1270 +oled +応 +1160 +1621 +锜 +g3 +ova +cheng +614 +匏 +thinkpad +赑 +fps +create +kim +讦 +1480 +诨 +1540 +rev +1v1 +罘 +fans +巖 +1740 +ag +嫘 +1649 +ps3 +908 +颀 +g1 +703 +岿 +v3 +虻 +936 +fl +c2c +罴 +environmental +paris +594 +hear +囗 +jump +communications +溆 +talk +噤 +824 +骝 +003 +咂 +695 +728 +e2 +nec +iptv +1797 +kelly +500ml +锛 +721 +rc +1808 +ldl +1240 +槊 +radeon +676 +啕 +tang +plant +50g +驽 +professional +凇 +698 +s36 +lord +search +alan +籴 +pd +1403 +硖 +1791 +816 +1636 +3h +gsp +811 +sky +1632 +铯 +christmas +怿 +笥 +matter +574 +噙 +倨 +effect +647 +779 +1803 +657 +sorry +awards +igbt +pwm +坭 +醅 +sos +976 +592 +滏 +10min +682 +cs3 +悻 +did +mater +579 +聒 +1724 +feng +low +mhz +836 +722 +枥 +726 +昺 +bank +memory +rap +975 +663 +ips +酆 +2kg +787 +簟 +睇 +轫 +溱 +骢 +榘 +642 +珺 +跹 +677 +series +nlp +raquo +蚶 +stone +1672 +1817 +1646 +827 +驺 +ko +security +perfect +alexander +746 +tt +check +804 +饧 +15mg +sir +moon +doesn +591 +inside +tim +672 +641 +噼 +儆 +1w +氚 +646 +哧 +1783 +旒 +鸬 +1648 +夥 +ev +1688 +score +standard +玦 +723 +貅 +揄 +戗 +fx +938 +璩 +fu +1654 +剐 +010 +cpi +垴 +蘼 +hz +1521 +1067 +727 +ah +lv +916 +裒 +639 +han +躅 +1715 +唳 +form +second +嗑 +荦 +674 +霈 +jin +缦 +啭 +pi +1788 +rx +隈 +gao +sdk +zheng +悫 +745 +href +593 +ngo +multi +d3 +彀 +637 +1276 +悭 +found +jis +5700 +焓 +1234 +80cm +磔 +aim +1778 +蓊 +act +569 +xiao +郾 +717 +786 +return +5min +1582 +etf +1590 +action +1625 +sarah +yourself +枧 +鹚 +10kg +80000 +検 +775 +818 +stephen +gui +屃 +644 +9500 +v6 +馑 +wlan +hs +2048 +area +1616 +andrew +8226 +6mg +1567 +1763 +1470 +嗲 +pps +铟 +rca +pierre +687 +null +manager +738 +sdh +828 +薤 +60g +300mg +jun +1685 +favorite +making +playing +summer +754 +692 +涔 +樗 +664 +忾 +収 +绺 +945 +h2s +bis +self +300mm +烊 +opengl +912 +acute +螫 +黩 +996 +magazine +edward +su +elisa +hdl +cyp3a4 +鞫 +foundation +alice +ddr3 +915 +923 +tbs +andy +field +date +transactions +limited +during +1126 +鲠 +1057 +fan +嘭 +缣 +845 +681 +rw +mean +1566 +become +economic +852 +johnny +蒺 +unique +黒 +tu +boys +1330 +885 +getting +cj +1072 +nh +ne +band +cool +724 +771 +骘 +氖 +content +842 +镝 +俅 +谮 +te +9600 +drive +phenyl +1275 +屦 +cao +menu +823 +摁 +氪 +蘧 +active +sb +appl +988 +1622 +伝 +1725 +zero +1008 +3kg +腠 +叡 +hit +鲂 +mi +0kg +748 +lite +enjoy +local +789 +続 +1506 +seen +s3 +1765 +european +讣 +gold +1279 +736 +965 +pl +button +耷 +1430 +986 +763 +toefl +燊 +鸷 +jimmy +dota +955 +861 +猊 +732 +xbox +days +dan +673 +833 +囡 +崤 +4c +economics +23000 +agent +html5 +points +ryan +shi +砬 +湜 +reading +918 +mine +adc +917 +1592 +1781 +翚 +峯 +909 +once +exchange +choose +current +symbian +ts16949 +dave +machine +鲎 +qos +蕖 +1785 +9m +cia +until +cs4 +759 +f3 +903 +24000 +968 +8mg +lewis +鹈 +凼 +snh48 +866 +泫 +荑 +黻 +牂 +1722 +鄣 +篑 +ho +1110 +1784 +髭 +陬 +寔 +dt +shanghai +疴 +邽 +987 +45000 +1042 +喏 +彖 +sl +saas +814 +28000 +a5 +彘 +赟 +819 +foxpro +shit +822 +盹 +诮 +鸫 +per +does +150mm +products +camp +select +capital +茕 +corporation +26000 +铖 +954 +dd +闩 +string +page +ba +671 +読 +782 +鄜 +漈 +盍 +dlp +729 +甭 +愎 +outlook +wii +ue +1787 +festival +communication +channel +gary +1755 +1774 +8600 +copy +150mg +魃 +dragon +1056 +c5 +炆 +track +hdpe +liang +鍊 +1800mhz +1619 +蛐 +995 +21000 +薜 +win +1394 +1786 +rain +楯 +table +鲀 +逡 +itu +applications +mmorpg +嘞 +s7 +696 +侔 +1069 +觇 +lbs +0mg +car +wave +糸 +踮 +狷 +1552 +1627 +latest +step +886 +761 +菘 +783 +寳 +esp +扃 +865 +jazz +k1 +fine +child +kind +anna +60mg +997 +maria +nk +792 +raw +late +soa +905 +cai +ttl +delphi +prince +1340 +禊 +synthesis +喑 +rmb +miller +patrick +933 +running +50kg +1398 +ast +752 +location +dead +塍 +chateau +allows +forget +tg +921 +栝 +5w +kiss +1690 +691 +arthur +瓿 +index +csa +rmvb +msc +廨 +cas +known +h1 +tj +j2ee +asian +841 +1227 +g20 +cross +cos +ntilde +719 +貘 +dnf +california +france +modern +pacific +769 +1066 +turbo +753 +795 +669 +1764 +868 +馕 +僰 +union +1772 +2150 +1063 +哏 +double +fight +858 +math +bo +瑷 +men +sea +6700 +sem +697 +疎 +882 +note +qi +uml +902 +1637 +tp +1290 +1085 +776 +蝣 +怵 +阃 +dps +1687 +弢 +镲 +hcl +al2o3 +js +auto +螅 +1683 +v5 +culture +935 +吖 +edge +碲 +voice +1007 +bridge +855 +008 +夼 +茌 +battle +嗬 +靺 +dp +ae +1090 +895 +1012 +1162 +bi +778 +髀 +1575 +pcm +15min +1598 +铊 +secret +739 +200m +6h +matt +谡 +card +mic +癔 +ecu +16mm +984 +镠 +5km +dhcp +1753 +巻 +秾 +living +gn +1643 +framework +菪 +679 +赜 +1782 +four +铈 +1777 +british +shell +santa +yuan +20ma +fly +927 +qu +nds +qaq +bar +髙 +arp +1667 +1773 +693 +main +鲳 +1510 +1002 +2022 +cdna +box +珰 +100km +004 +畋 +bring +泅 +959 +hpv +makes +cmv +鲅 +tmd +1762 +854 +泚 +ghost +short +mcu +1768 +cat +963 +1757 +1206 +1207 +puzzle +793 +central +859 +飏 +walter +60hz +anderson +1727 +thought +屍 +仨 +864 +molecular +856 +dong +financial +1728 +surface +g2 +mf +葚 +叻 +solidworks +res +speed +1195 +咻 +ascii +1404 +784 +jeff +衩 +1371 +land +biology +1655 +郄 +otc +sio +1310 +1605 +蹩 +mems +1618 +m16 +complete +industrial +acs +1603 +kids +tour +u2 +allen +1756 +743 +嬖 +踽 +davis +柽 +鞨 +65279 +7600 +30ml +957 +0l +734 +p450 +956 +ir +麴 +500mm +casio +1038 +roger +library +015 +1652 +薙 +within +hands +874 +ntsc +钇 +whole +jq +氵 +垆 +post +sweet +wall +898 +cs5 +feo +9800 +cms +1390 +since +medical +犟 +1492 +罍 +stand +justin +lake +i5 +1729 +bell +ruby +important +bout +images +lab +962 +1759 +rj +cache +nb +production +経 +807 +1771 +doing +粜 +tnf +ws +guide +bim +events +1626 +1016 +焜 +performance +ra +zl +牀 +1568 +1647 +埝 +洧 +1615 +shift +788 +shen +1588 +60mm +覧 +tuv +1673 +electronic +mos +蓣 +8kg +862 +echo +1572 +section +981 +甯 +sg +1664 +understand +hsk +delta +x86 +eap +block +1578 +er +xl +蒐 +馐 +nox +畑 +ib +trying +ann +1635 +apache +naoh +12345 +缑 +礽 +1624 +694 +瞋 +1601 +浍 +983 +773 +1000m +someone +15kg +25m +847 +袢 +桕 +1037 +jerry +843 +picture +919 +e3 +printf +3gs +marie +853 +rj45 +侩 +913 +896 +lose +unicode +100cm +1711 +charlie +詈 +戸 +1689 +room +烝 +beat +堌 +伋 +hplc +9300 +110kv +nfc +倬 +764 +iis +圯 +solo +碇 +ef +round +chang +1366 +781 +1585 +982 +socket +df +892 +1536 +831 +ren +6kg +4900 +纰 +object +forever +832 +951 +qr +1023 +8800 +4kg +磾 +泔 +1131 +纮 +蓁 +971 +building +1021 +铗 +939 +弇 +挲 +crystal +艉 +smtp +鱬 +cims +fang +1265 +trans +pan +1745 +1604 +泺 +橛 +817 +796 +袴 +cosplay +1154 +1189 +749 +794 +1068 +881 +hc +hope +1410 +couldn +1638 +992 +along +age +250mg +clear +aps +1631 +1011 +provides +1123 +1701 +36000 +csf +韪 +n1 +works +籓 +967 +ptc +贶 +1111 +1651 +棰 +1726 +sar +1666 +qvga +hf +coreldraw +possible +趵 +1629 +943 +marc +luo +樨 +848 +county +944 +tb +dts +junior +vba +lot +傕 +玕 +毎 +direct +839 +繸 +2350 +774 +劵 +fsh +wmv +镧 +秫 +1094 +osi +1602 +邶 +猞 +dior +1766 +1623 +廛 +栌 +钲 +镦 +1607 +psa +spss +xy +1769 +cells +1465 +1577 +gon +send +vision +thinking +imf +嘏 +carl +蝰 +32000 +bay +928 +is09001 +镏 +20kg +淠 +imax +novel +qt +1684 +荇 +逄 +au +author +mod +80mm +1748 +849 +1612 +yet +嘅 +929 +6l +karl +6100 +students +gmat +myself +kate +jpg +979 +1752 +829 +2450 +914 +876 +祕 +瑠 +48h +mpv +1734 +mis +1565 +walk +941 +1075 +1235 +natural +k2 +977 +炝 +杪 +4050 +1669 +p3 +1004 +fn +埴 +1555 +vmware +chloride +942 +steven +1078 +獬 +966 +1135 +country +947 +柢 +捱 +跣 +887 +涑 +75mm +1278 +1583 +western +watch +撃 +伢 +堠 +1045 +12m +museum +1215 +document +marketing +952 +卽 +猁 +usb3 +906 +厣 +physical +辏 +1668 +旆 +agp +茆 +1488 +pg +乜 +deep +1082 +961 +踯 +1526 +# +[ +yam +lofter +##s +##0 +##a +##2 +##1 +##3 +##e +##8 +##5 +##6 +##4 +##9 +##7 +##t +##o +##d +##i +##n +##m +##c +##l +##y +##r +##g +##p +##f +pixnet +cookies +tripadvisor +##er +##k +##h +##b +##x +##u +##w +##ing +ctrip +##on +##v +llc +##an +##z +blogthis +##le +##in +##mm +##00 +ig +##ng +##us +##te +##ed +ncc +blog +##10 +##al +##ic +##ia +##q +##ce +##en +##is +##ra +##es +##j +##cm +tw +##ne +##re +##tion +pony +##2017 +##ch +##or +##na +cafe +pinterest +pixstyleme3c +##ta +##2016 +##ll +##20 +##ie +##ma +##17 +##ion +##th +##st +##se +##et +##ck +##ly +web885 +##ge +xd +##ry +##11 +0fork +##12 +##ter +##ar +##la +##os +##30 +##el +##50 +##ml +tue +posted +##at +##man +##15 +ago +##it +##me +##de +##nt +##mb +##16 +##ve +##da +##ps +##to +https +momo +##son +##ke +##80 +ebd +apk +##88 +##um +wiki +brake +mon +po +june +##ss +fb +##as +leonardo +safari +##60 +wed +win7 +kiehl +##co +##go +vfm +kanye +##90 +##2015 +##id +##ey +##sa +##ro +##am +##no +thu +fri +##sh +##ki +comments +##pe +##ine +uber +##mi +##ton +wordpress +##ment +win10 +##ld +##li +gmail +##rs +##ri +##rd +##21 +##io +##99 +paypal +policy +##40 +##ty +##18 +##01 +##ba +taiwan +##ga +privacy +agoda +##13 +##ny +##24 +##22 +##by +##ur +##hz +##ang +cookie +netscape +##ka +##ad +nike +survey +##016 +wikia +##32 +##017 +cbc +##tor +##kg +##rt +##14 +campaign +##ct +##ts +##ns +##ao +##nd +##70 +##ya +##il +##25 +0020 +897 +##23 +hotels +##ian +6606 +##ers +##26 +##day +##ay +##line +##be +talk2yam +yamservice +coco +##dy +##ies +##ha +instagram +##ot +##va +##mo +##land +ltxsw +##ation +##pa +##ol +tag +##ue +##31 +oppo +##ca +##om +chrome +##ure +lol +##19 +##bo +##100 +##way +##ko +##do +##un +##ni +herme +##28 +##up +##06 +##ds +admin +##48 +##015 +##35 +##ee +tpp +##ive +##cc +##ble +##ity +##ex +##ler +##ap +##book +##ice +##km +##mg +##ms +ebay +##29 +ubuntu +##cy +##view +##lo +##oo +##02 +step1 +july +##net +##ls +##ii +##05 +##33 +step2 +ios9 +##box +##ley +samsung +pokemon +##ent +##les +s8 +atom +##said +##55 +##2014 +##66 +adidas +amazon +##ber +##ner +visa +##77 +##der +connectivity +##hi +firefox +skip +##27 +##ir +##61 +##ai +##ver +cafe2017 +##ron +##ster +##sk +##ft +longchamp +ssd +##ti +reply +##my +apr +##ker +source +##one +##2013 +##ow +goods +##lin +##ip +##ics +##45 +##03 +##ff +##47 +ganji +##nce +##per +faq +comment +##ock +##bs +##ah +##lv +##mp +##000 +melody +17life +##au +##71 +##04 +##95 +##age +tips +##68 +##ting +##ung +wonderland +##ction +mar +article +##db +##07 +##ore +##op +##78 +##38 +##ong +##73 +##08 +##ica +##36 +##wa +##64 +homemesh +##85 +##tv +##di +macbook +##ier +##si +##75 +##ok +goris +lock +##ut +carol +##vi +##ac +anti +jan +tags +##98 +##51 +august +##86 +##fs +##sion +jordan +##tt +##lt +##42 +##bc +vivi +##rry +##ted +##rn +usd +##t00 +##58 +##09 +##34 +goo +##ui +##ary +item +##pm +##41 +##za +##2012 +blogabstract +##ger +##62 +##44 +gr2 +asus +cindy +##hd +esc +##od +booking +##53 +fed +##81 +##ina +chan +distribution +steam +pk10 +##ix +##65 +##91 +dec +##ana +icecat +00z +##46 +##ji +##ard +oct +##ain +jp +##ze +##bi +cio +##56 +h5 +##39 +##port +curve +##nm +##dia +utc +12345678910 +##52 +chanel +##and +##im +##63 +vera +vivo +##ei +2756 +##69 +msci +##po +##89 +##bit +##out +##zz +##97 +##67 +opec +##96 +##tes +##ast +##ling +##ory +##ical +kitty +##43 +step3 +##cn +win8 +iphone7 +beauty +##87 +dollars +##ys +##oc +pay +##2011 +##lly +##ks +download +sep +##board +##37 +##lan +winrar +##que +##ua +##com +ettoday +##54 +##ren +##via +##72 +##79 +##tch +##49 +##ial +##nn +step4 +2765 +gov +##xx +mandy +##ser +copyright +fashion +##ist +##art +##lm +##ek +##ning +##if +##ite +iot +##84 +##2010 +##ku +october +##ux +trump +##hs +##ide +##ins +april +##ight +##83 +protected +##fe +##ho +ofo +gomaji +march +##lla +##pp +##ec +6s +720p +##rm +##ham +##92 +fandom +##ell +info +##82 +sina +4066 +##able +##ctor +rights +jul +##76 +mall +##59 +donald +sodu +##light +reserved +htm +##han +##57 +##ise +##tions +##shi +doc +055 +##ram +shopping +aug +##pi +##well +wam +##hu +##gb +##93 +mix +##ef +##uan +bwl +##plus +##res +##ess +tea +hktvmall +##ate +##ese +feb +inn +nov +##ci +pass +##bet +##nk +coffee +airbnb +##ute +woshipm +skype +##fc +##www +##94 +##ght +##gs +##ile +##wood +##uo +icon +##em +says +##king +##tive +blogger +##74 +##ox +##zy +##red +##ium +##lf +nokia +claire +##ding +november +lohas +##500 +##tic +##cs +##che +##ire +##gy +##ult +january +ptt +##fa +##mer +pchome +udn +##time +##tte +garden +eleven +309b +bat +##123 +##tra +kindle +##ern +xperia +ces +travel +##ous +##int +edu +cho +##car +##our +##ant +rends +##jo +mastercard +##2000 +kb +##min +##ino +##ris +##ud +##set +##her +##ou +taipei +##fi +##ill +aphojoy +december +meiki +##ick +tweet +##av +iphone6 +##dd +views +##mark +##ash +##ome +koreanmall +##ak +q2 +##200 +mlb +##lle +##watch +##und +##tal +##less +4399 +##rl +update +shop +##mhz +##house +##key +##001 +##hy +##web +##2009 +##gg +##wan +##val +2021 +##ons +doi +trivago +overdope +##ance +573032185 +wx17house +##so +audi +##he +##rp +##ake +beach +cfa +ps4 +##800 +##link +##hp +ferragamo +##eng +##style +##gi +i7 +##ray +##max +##pc +september +##ace +vps +february +pantos +wp +lisa +jquery +offer +##berg +##news +fks +##all +##rus +##888 +##works +blogtitle +loftpermalink +ling +##ja +outlet +##ea +##top +##ness +salvatore +##lu +swift +##ul +week +##ean +##300 +##gle +##back +powered +##tan +##nes +canon +##zi +##las +##oe +##sd +##bot +##world +##zo +top100 +pmi +##vr +ball +vogue +ofweek +##list +##ort +##lon +##tc +##of +##bus +##gen +nas +##lie +##ria +##coin +##bt +nata +vive +cup +##ook +##sy +msg +3ce +##word +ebooks +r8 +nice +months +rewards +##ther +0800 +##xi +##sc +gg +blogfp +daily +##bb +##tar +##ky +anthony +##yo +##ara +##aa +##rc +##tz +##ston +gear +##eo +##ade +##win +##ura +##den +##ita +##sm +png +rakuten +whatsapp +##use +pad +gucci +##ode +##fo +chicago +##hone +io +sogo +be2 +##ology +cloud +##con +##ford +##joy +##kb +##rade +##ach +docker +##ful +##ase +ford +##star +edited +##are +##mc +siri +##ella +bloomberg +##read +pizza +##ison +##vm +node +18k +##play +##cer +##yu +##ings +asr +##lia +step5 +##cd +pixstyleme +##600 +##tus +tokyo +##rial +##life +##ae +tcs +##rk +##wang +##sp +##ving +premium +netflix +##lton +##ple +##cal +021 +##sen +##ville +nexus +##ius +##mah +tila +##tin +resort +##ws +p10 +report +##360 +##ru +bus +vans +##est +links +rebecca +##dm +azure +##365 +##mon +moto +##eam +blogspot +##ments +##ik +##kw +##bin +##ata +##vin +##tu +##ula +station +##ature +files +zara +hdr +top10 +s6 +marriott +avira +tab +##ran +##home +oculus +##ral +rosie +##force +##ini +ice +##bert +##nder +##mber +plurk +##sis +00kg +##ence +##nc +##name +log +ikea +malaysia +##ncy +##nie +##ye +##oid +##chi +xuehai +##1000 +##orm +##rf +##ware +##pro +##era +##ub +##2008 +8891 +scp +##zen +qvod +jcb +##hr +weibo +##row +##ish +github +mate +##lot +##ane +##tina +ed2k +##vel +##900 +final +ns +bytes +##ene +##cker +##2007 +##px +topapp +helpapp +14k +g4g +ldquo +##fork +##gan +##zon +##qq +##google +##ism +##zer +toyota +category +##labels +restaurant +##md +posts +##ico +angelababy +123456 +sports +candy +##new +##here +swissinfo +dram +##ual +##vice +##wer +sport +q1 +ios10 +##mll +wan +##uk +x3 +0t +##ming +e5 +##3d +h7n9 +worldcat +##vo +##led +##580 +##ax +##ert +polo +##lr +##hing +##chat +##ule +hotmail +##pad +bbq +##ring +wali +2k +costco +switch +##city +philips +##mann +panasonic +##cl +##vd +##ping +##rge +##lk +css3 +##ney +##ular +##400 +##tter +lz +##tm +##yan +##let +coach +##pt +a8 +follow +##berry +##ew +##wn +##og +##code +##rid +villa +git +r11 +##cket +error +##anonymoussaid +##ag +##ame +##gc +qa +##lis +##gin +vmalife +##cher +wedding +##tis +demo +bye +##rant +orz +acer +##ats +##ven +macd +yougou +##dn +##ano +##urt +##rent +continue +script +##wen +##ect +paper +##chel +##cat +x5 +fox +##blog +loading +##yn +##tp +kuso +799 +vdc +forest +prime +ultra +##rmb +square +##field +##reen +##ors +##ju +##air +##map +cdn +##wo +m8 +##get +opera +##base +##ood +vsa +##aw +##ail +count +##een +##gp +vsc +tree +##eg +##ose +##ories +##shop +alphago +v4 +fluke62max +zip +##sta +bas +##yer +hadoop +##ube +##wi +0755 +hola +##low +centre +##fer +##750 +##media +##san +##bank +q3 +##nge +##mail +##lp +client +event +vincent +##nse +sui +adchoice +##stry +##zone +ga +apps +##ab +##rner +kymco +##care +##pu +##yi +minkoff +annie +collection +kpi +playstation +bh +##bar +armani +##xy +iherb +##ery +##share +##ob +volvo +##ball +##hk +##cp +##rie +##ona +##sl +gtx +rdquo +jayz +##lex +##rum +namespace +##ale +##atic +##erson +##ql +##ves +##type +enter +##168 +##mix +##bian +a9 +ky +##lc +movie +##hc +tower +##ration +##mit +##nch +ua +tel +prefix +##o2 +##point +ott +##http +##ury +baidu +##ink +member +##logy +bigbang +nownews +##js +##shot +##tb +eba +##tics +##lus +spark +##ama +##ions +##lls +##down +##ress +burberry +day2 +##kv +related +edit +##ark +cx +32gb +g9 +##ans +##tty +s5 +##bee +thread +xr +buy +spotify +##ari +##verse +7headlines +nego +sunny +dom +positioning +fit +##tton +alexa +##ties +##llow +amy +##du +##rth +##lar +2345 +##des +sidebar +site +##cky +##kit +##ime +##009 +season +##fun +gogoro +a7 +lily +twd600 +##vis +##cture +friday +yi +##tta +##tel +##lock +economy +tinker +8gb +##app +oops +##right +edm +##cent +supreme +##its +##asia +dropbox +##tti +books +##tle +##ller +##ken +##more +##boy +sex +##dom +##ider +##unch +##put +##gh +ka +amoled +div +##tr +##n1 +port +howard +##tags +ken +##nus +adsense +buff +thunder +##town +##ique +##body +pin +##erry +tee +##the +##013 +udnbkk +16gb +##mic +miui +##tro +##alk +##nity +s4 +##oa +docomo +##tf +##ack +fc2 +##ded +##sco +##014 +##rite +linkedin +##ada +##now +##ndy +ucbug +sputniknews +legalminer +##ika +##xp +##bu +q10 +##rman +cheese +ming +maker +##gm +nikon +##fig +ppi +jchere +ted +fgo +tech +##tto +##gl +##len +hair +img +##pper +##a1 +acca +##ition +##ference +suite +##ig +##mond +##cation +##pr +101vip +##999 +64gb +airport +##over +##ith +##su +town +piece +##llo +no1 +##qi +focus +reader +##admin +##ora +false +##log +##ces +##ume +motel +##oper +flickr +netcomponents +##af +pose +##ound +##cg +##site +##iko +con +##ath +##hip +##rey +cream +##cks +012 +##dp +facebooktwitterpinterestgoogle +sso +shtml +swiss +##mw +lumia +xdd +tiffany +insee +russell +dell +##ations +camera +##vs +##flow +##late +classic +##nter +##ever +##lab +##nger +qe +##cing +editor +##nap +sunday +##ens +##700 +##bra +acg +sofascore +mkv +##ign +jonathan +build +labels +##oto +tesla +moba +gohappy +ajax +##test +##urs +wps +fedora +##ich +mozilla +##480 +##dr +urn +##lina +grace +##die +##try +##ader +elle +##chen +price +##ten +uhz +##ough +##hen +states +push +session +balance +wow +##cus +##py +##ward +##ep +34e +wong +prada +##cle +##ree +q4 +##ctive +##ool +##ira +##163 +rq +buffet +e6 +##ez +##card +##cha +day3 +eye +##end +adi +tvbs +##ala +nova +##tail +##ries +##ved +base +##ways +hero +hgih +profile +fish +mu +ssh +##wd +click +cake +##ond +pre +##tom +kic +pixel +##ov +##fl +product +6a +##pd +dear +##gate +yumi +##sky +bin +##ture +##ape +isis +nand +##101 +##load +##ream +a6 +##post +##we +zenfone +##ike +gd +forum +jessica +##ould +##ious +lohasthree +##gar +##ggle +##ric +##own +eclipse +##side +061 +##other +##tech +##ator +engine +##ged +plaza +##fit +westbrook +reuters +##ily +contextlink +##hn +##cil +##cel +cambridge +##ize +##aid +##data +frm +##head +butler +##sun +##mar +puma +pmid +kitchen +##lic +day1 +##text +##page +##rris +pm1 +##ket +trackback +##hai +display +##hl +idea +##sent +airmail +##ug +##men +028 +##lution +schemas +asics +wikipedia +##tional +##vy +##dget +##ein +contact +pepper +##uel +##ument +##hang +q5 +##sue +##ndi +swatch +##cept +popular +##ste +##tag +trc +##west +##live +honda +ping +messenger +##rap +v9 +unity +appqq +leo +##tone +##ass +uniqlo +##010 +moneydj +##tical +12306 +##m2 +coc +miacare +##mn +tmt +##core +vim +kk +##may +target +##2c +##ope +omega +pinkoi +##rain +##ement +p9 +rd +##tier +##vic +zone +isofix +cpa +kimi +##lay +lulu +##uck +050 +weeks +##hop +##ear +eia +##fly +korea +boost +##ship +eur +valley +##iel +##ude +rn +##ena +feed +5757 +qqmei +##thing +aws +pink +##ters +##kin +board +##vertisement +wine +##ien +##dge +##tant +##twitter +##3c +cool1 +##012 +##150 +##fu +##iner +googlemsn +pixnetfacebookyahoo +x7 +##uce +sao +##ev +##file +9678 +xddd +shirt +##rio +##hat +givenchy +bang +##lio +monday +##abc +ubuntuforumwikilinuxpastechat +##vc +##rity +7866 +##ost +imsean +tiger +##fet +dji +##come +##beth +##aft +##don +3p +emma +##khz +x6 +##face +pptv +x4 +##mate +sophie +##jing +fifa +##mand +sale +inwedding +##gn +##mmy +##pmlast +nana +##wu +note7 +##340 +##bel +window +##dio +##ht +##ivity +domain +neo +##isa +##lter +5k +f5 +##cts +ft +zol +##act +mwc +nbapop +eds +##room +previous +tomtom +##ets +5t +chi +##hg +fairmont +gay +1b +##raph +##ils +i3 +avenue +##host +##bon +##tsu +message +navigation +fintech +h6 +##ject +##vas +##firm +credit +##wf +xxxx +##nor +##space +huawei +plan +json +sbl +##dc +wish +##120 +##sol +windows7 +washington +##nsis +lo +##sio +##ym +##bor +planet +##wt +gpa +##tw +##oka +connect +##rss +##work +##atus +chicken +##times +fa +##ather +##cord +009 +##eep +hitachi +##pan +disney +##press +wind +frigidaire +##tl +hsu +##ull +expedia +archives +##wei +cut +ins +6gb +brand +cf1 +##rip +##nis +128gb +3t +##oon +quick +15058 +wing +##bug +##cms +##dar +##oh +zoom +trip +##nba +rcep +aspx +080 +gnu +##count +##url +##ging +8591 +am09 +shadow +##cia +emily +##tation +host +ff +techorz +##mini +##mporary +##ering +##next +cma +##mbps +##gas +##ift +##dot +amana +##ros +##eet +##ible +##aka +##lor +maggie +##011 +##iu +##gt +1tb +articles +##burg +##iki +database +fantasy +##rex +##cam +dlc +dean +##you +path +gaming +victoria +maps +##lee +##itor +overchicstoretvhome +##xt +##nan +x9 +install +##ann +##ph +##rcle +##nic +##nar +metro +chocolate +##rian +##table +skin +##sn +mountain +##0mm +inparadise +7x24 +##jia +eeworld +creative +g5 +parker +ecfa +village +sylvia +hbl +##ques +##onsored +##x2 +##v4 +##tein +ie6 +##stack +ver +##ads +##baby +bbe +##110 +##lone +##uid +ads +022 +gundam +006 +scrum +match +##ave +##470 +##oy +##talk +glass +lamigo +##eme +##a5 +wade +kde +##lace +ocean +tvg +##covery +##r3 +##ners +##rea +##aine +cover +##ision +##sia +##bow +msi +##love +soft +z2 +##pl +mobil +##uy +nginx +##oi +##rr +6221 +##mple +##sson +##nts +91tv +comhd +crv3000 +##uard +gallery +##bia +rate +spf +redis +traction +icloud +011 +jose +##tory +sohu +899 +kicstart2 +##hia +##sit +##walk +##xure +500g +##pact +xa +carlo +##250 +##walker +##can +cto +gigi +pen +##hoo +ob +##yy +13913459 +##iti +mango +##bbs +sense +oxford +walker +jennifer +##ola +course +##bre +##pus +##rder +lucky +075 +ivy +##nia +sotheby +##ugh +joy +##orage +##ush +##bat +##dt +r9 +##2d +##gio +wear +##lax +##moon +seven +lonzo +8k +evolution +##kk +kd +arduino +##lux +arpg +##rdon +cook +##x5 +five +##als +##ida +sign +##nda +##posted +fresh +##mine +##skip +##form +##ssion +##tee +dyson +stage +##jie +##night +epson +pack +##ppy +wd +##eh +##rence +##lvin +golden +discovery +##trix +##n2 +loft +##uch +##dra +##sse +1mdb +welcome +##urn +gaga +##lmer +teddy +##160 +##f2016 +##sha +rar +holiday +074 +##vg +##nos +##rail +gartner +gi +6p +##dium +kit +b3 +eco +sean +##stone +nu +##np +f16 +write +029 +m5 +##ias +##dk +fsm +52kb +##xxx +##cake +lim +ru +1v +##ification +published +angela +16g +analytics +##nel +gmt +##icon +##bby +ios11 +waze +9985 +##ust +##007 +delete +52sykb +wwdc +027 +##fw +1389 +##xon +brandt +##ses +##dragon +vetements +anne +monte +official +##ere +##nne +##oud +etnews +##a2 +##graphy +##rtex +##gma +mount +archive +morning +tan +ddos +e7 +day4 +factory +bruce +##ito +guest +##lling +n3 +mega +women +dac +church +##jun +singapore +##facebook +6991 +starbucks +##tos +##stin +##shine +zen +##mu +tina +request +##gence +q7 +##zzi +diary +##tore +##ead +cst +##osa +canada +va +##jiang +##lam +##nix +##sday +g6 +##master +bing +##zl +nb40 +thai +ln284ct +##itz +##2f +bonnie +##food +##lent +originals +##stro +##lts +##bscribe +ntd +yesstyle +hmv +##tment +d5 +##pn +topios9 +lifestyle +virtual +##ague +xz +##deo +muji +024 +unt +##nnis +faq1 +##ette +curry +##pop +release +##cast +073 +##ews +5c +##stle +ios7 +##ima +dog +lenovo +##r4 +013 +vornado +##desk +##ald +9595 +##van +oil +common +##jy +##lines +g7 +twice +ella +nano +belle +##mes +##self +##note +benz +##ova +##wing +kai +##hua +##rect +rainer +##unge +##0m +guestname +##uma +##kins +##zu +tokichoi +##price +##med +##mus +rmk +address +vm +openload +##group +##hin +##iginal +amg +urban +##oz +jobs +##public +##sch +##dden +##bell +hostel +##drive +##rmin +boot +##370 +##fx +##nome +##ctionary +##oman +##lish +##cr +##hm +##how +francis +c919 +b5 +evernote +##uc +##3000 +coupe +##urg +##cca +##uality +019 +##ett +##ani +##tax +##rma +leonnhurt +##jin +ict +bird +notes +##dical +##lli +result +iu +ee +smap +gopro +##last +yin +pure +32g +##dan +##rame +mama +##oot +bean +##hur +2l +bella +sync +xuite +##ground +discuz +##getrelax +##ince +##bay +##5s +apt +##pass +jing +##rix +rich +niusnews +##ello +bag +##eting +##mobile +##ience +details +universal +silver +dit +private +ddd +u11 +kanshu +##ified +fung +##nny +dx +##520 +tai +023 +##fr +##lean +##pin +##rin +ly +rick +##bility +banner +##baru +##gion +vdf +qualcomm +bear +oldid +ian +jo +##tors +population +##ernel +##mv +##bike +ww +##ager +exhibition +##del +##pods +fpx +structure +##free +##tings +kl +##rley +##copyright +##mma +orange +yoga +4l +canmake +honey +##anda +nikkie +dhl +publishing +##mall +##gnet +e88 +##dog +fishbase +### +##[ +。 +! +? +! +? +; +: +; +##, +##的 +##、 +##一 +##人 +##有 +##是 +##在 +##中 +##为 +##和 +##了 +##不 +##年 +##学 +##大 +##国 +##生 +##以 +##“ +##” +##作 +##业 +##个 +##上 +##用 +##, +##地 +##会 +##成 +##发 +##工 +##时 +##于 +##理 +##出 +##行 +##要 +##. +##等 +##他 +##到 +##之 +##这 +##可 +##后 +##家 +##对 +##能 +##公 +##与 +##》 +##《 +##主 +##方 +##分 +##经 +##来 +##全 +##其 +##部 +##多 +##产 +##自 +##文 +##高 +##动 +##进 +##法 +##化 +##: +##我 +##面 +##) +##( +##实 +##教 +##建 +##体 +##而 +##长 +##子 +##下 +##现 +##开 +##本 +##力 +##定 +##性 +##过 +##设 +##合 +##小 +##同 +##机 +##市 +##品 +##水 +##新 +##内 +##事 +##也 +##种 +##及 +##制 +##入 +##所 +##心 +##务 +##就 +##管 +##们 +##得 +##展 +##重 +##民 +##加 +##区 +##物 +##者 +##通 +##天 +##政 +##三 +##电 +##关 +##度 +##第 +##名 +##术 +##最 +##系 +##月 +##外 +##资 +##日 +##代 +##员 +##如 +##间 +##位 +##并 +##书 +##科 +##村 +##应 +##量 +##道 +##前 +##当 +##无 +##里 +##相 +##平 +##从 +##计 +##提 +##保 +##任 +##程 +##技 +##都 +##研 +##十 +##基 +##特 +##好 +##被 +##或 +##目 +##将 +##使 +##山 +##二 +##说 +##数 +##点 +##明 +##情 +##元 +##着 +##收 +##组 +##然 +##美 +##各 +##由 +##场 +##金 +##形 +##农 +##期 +##因 +##表 +##此 +##色 +##起 +##还 +##立 +##世 +##安 +##活 +##专 +##质 +##规 +##社 +##万 +##信 +##西 +##统 +##结 +##路 +##利 +##次 +##南 +##式 +##意 +##级 +##常 +##师 +##校 +##你 +##育 +##果 +##究 +##司 +##服 +##门 +##海 +##导 +##流 +##项 +##她 +##总 +##处 +##两 +##传 +##东 +##正 +##省 +##院 +##户 +##手 +##具 +##原 +##强 +##北 +##向 +##先 +##但 +##米 +##城 +##企 +##件 +##风 +##军 +##身 +##更 +##知 +##已 +##气 +##战 +##至 +##单 +##口 +##集 +##创 +##解 +##四 +##标 +##交 +##比 +##商 +##论 +##界 +##题 +##变 +##花 +##改 +##类 +##运 +##指 +##型 +##调 +##女 +##神 +##接 +##造 +##受 +##广 +##只 +##委 +##去 +##共 +##治 +##达 +##持 +##条 +##网 +##头 +##构 +##县 +##些 +##该 +##又 +##那 +##想 +##样 +##办 +##济 +##格 +##责 +##车 +##很 +##施 +##求 +##己 +##光 +##精 +##林 +##完 +##爱 +##线 +##参 +##少 +##积 +##清 +##看 +##优 +##报 +##王 +##直 +##没 +##每 +##据 +##游 +##效 +##感 +##五 +##影 +##别 +##获 +##领 +##称 +##选 +##供 +##乐 +##老 +##么 +##台 +##问 +##划 +##带 +##器 +##源 +##织 +##放 +##深 +##备 +##视 +##白 +##功 +##取 +##装 +##营 +##见 +##记 +##环 +##队 +##节 +##准 +##石 +##它 +##回 +##历 +##负 +##真 +##增 +##医 +##联 +##做 +##职 +##容 +##士 +##包 +##义 +##观 +##团 +##病 +##府 +##息 +##则 +##考 +##料 +##华 +##州 +##语 +##证 +##整 +##让 +##江 +##史 +##空 +##验 +##需 +##支 +##命 +##给 +##离 +##认 +##艺 +##较 +##土 +##古 +##养 +##才 +##境 +##推 +##把 +##均 +##图 +##际 +##斯 +##近 +##片 +##局 +##修 +##字 +##德 +##权 +##步 +##始 +##复 +##转 +##协 +##即 +##打 +##画 +##投 +##决 +##何 +##约 +##反 +##费 +##议 +##护 +##极 +##河 +##房 +##查 +##布 +##思 +##干 +##价 +##儿 +##非 +##马 +##党 +##奖 +##模 +##故 +##编 +##音 +##范 +##识 +##率 +##存 +##引 +##客 +##属 +##评 +##采 +##尔 +##配 +##镇 +##室 +##再 +##案 +##监 +##习 +##注 +##根 +##克 +##演 +##食 +##族 +##示 +##球 +##状 +##青 +##号 +##张 +##百 +##素 +##首 +##易 +##热 +##阳 +##今 +##园 +##防 +##版 +##太 +##乡 +##英 +##材 +##列 +##便 +##写 +##住 +##置 +##层 +##助 +##确 +##试 +##难 +##承 +##象 +##居 +##黄 +##快 +##断 +##维 +##却 +##红 +##速 +##连 +##众 +##细 +##态 +##话 +##周 +##言 +##药 +##培 +##血 +##亩 +##龙 +##越 +##值 +##几 +##边 +##读 +##未 +##曾 +##测 +##算 +##京 +##景 +##余 +##站 +##低 +##温 +##消 +##必 +##切 +##依 +##随 +##且 +##志 +##卫 +##域 +##照 +##许 +##限 +##著 +##销 +##落 +##足 +##适 +##争 +##策 +##控 +##武 +##按 +##初 +##角 +##核 +##死 +##检 +##富 +##满 +##显 +##审 +##除 +##致 +##亲 +##占 +##失 +##星 +##章 +##善 +##续 +##千 +##叶 +##火 +##副 +##告 +##段 +##什 +##声 +##终 +##况 +##走 +##木 +##益 +##戏 +##独 +##纪 +##植 +##财 +##群 +##六 +##赛 +##远 +##拉 +##亚 +##密 +##排 +##超 +##像 +##课 +##围 +##往 +##响 +##击 +##疗 +##念 +##八 +##云 +##险 +##律 +##请 +##革 +##诗 +##批 +##底 +##压 +##双 +##男 +##训 +##例 +##汉 +##升 +##拥 +##势 +##酒 +##眼 +##官 +##牌 +##油 +##曲 +##友 +##望 +##黑 +##歌 +##筑 +##础 +##香 +##仅 +##担 +##括 +##湖 +##严 +##秀 +##剧 +##九 +##举 +##执 +##充 +##兴 +##督 +##博 +##草 +##般 +##李 +##健 +##喜 +##授 +##普 +##预 +##灵 +##突 +##良 +##款 +##罗 +##微 +##七 +##录 +##朝 +##飞 +##宝 +##令 +##轻 +##劳 +##距 +##异 +##简 +##兵 +##树 +##序 +##候 +##含 +##福 +##尽 +##留 +##丰 +##旅 +##征 +##临 +##破 +##移 +##篇 +##抗 +##典 +##端 +##苏 +##奇 +##止 +##康 +##店 +##毛 +##觉 +##春 +##售 +##络 +##降 +##板 +##坚 +##母 +##讲 +##早 +##印 +##略 +##孩 +##夫 +##藏 +##铁 +##害 +##互 +##帝 +##田 +##融 +##皮 +##宗 +##岁 +##载 +##析 +##斗 +##须 +##伤 +##介 +##另 +##半 +##班 +##馆 +##味 +##楼 +##卡 +##射 +##述 +##杀 +##波 +##绿 +##免 +##兰 +##绝 +##刻 +##短 +##察 +##输 +##择 +##综 +##杂 +##份 +##纳 +##父 +##词 +##银 +##送 +##座 +##左 +##继 +##固 +##宣 +##厂 +##肉 +##换 +##补 +##税 +##派 +##套 +##欢 +##播 +##吸 +##圆 +##攻 +##阿 +##购 +##听 +##右 +##减 +##激 +##巴 +##背 +##够 +##遇 +##智 +##玉 +##找 +##宽 +##陈 +##练 +##追 +##毕 +##彩 +##软 +##帮 +##股 +##荣 +##托 +##予 +##佛 +##堂 +##障 +##皇 +##若 +##守 +##似 +##届 +##待 +##货 +##散 +##额 +##尚 +##穿 +##丽 +##骨 +##享 +##差 +##针 +##索 +##稳 +##宁 +##贵 +##酸 +##液 +##唐 +##操 +##探 +##玩 +##促 +##笔 +##库 +##救 +##虽 +##久 +##闻 +##顶 +##床 +##港 +##鱼 +##亿 +##登 +##永 +##毒 +##桥 +##冷 +##魔 +##秘 +##陆 +##您 +##童 +##归 +##侧 +##沙 +##染 +##封 +##紧 +##松 +##川 +##刘 +##雄 +##希 +##毫 +##卷 +##某 +##季 +##菜 +##庭 +##附 +##逐 +##夜 +##宫 +##洲 +##退 +##顾 +##尼 +##胜 +##剂 +##纯 +##舞 +##遗 +##苦 +##梦 +##挥 +##航 +##愿 +##街 +##招 +##矿 +##夏 +##盖 +##献 +##怎 +##茶 +##申 +##吧 +##脑 +##亦 +##吃 +##频 +##宋 +##央 +##威 +##厚 +##块 +##冲 +##叫 +##熟 +##礼 +##厅 +##否 +##渐 +##笑 +##钱 +##钟 +##甚 +##牛 +##丝 +##靠 +##岛 +##绍 +##盘 +##缘 +##聚 +##静 +##雨 +##氏 +##圣 +##顺 +##唱 +##刊 +##阶 +##困 +##急 +##饰 +##弹 +##庄 +##既 +##野 +##阴 +##混 +##饮 +##损 +##齐 +##末 +##错 +##轮 +##宜 +##鲜 +##兼 +##敌 +##粉 +##祖 +##延 +##钢 +##辑 +##欧 +##硬 +##甲 +##诉 +##册 +##痛 +##订 +##缺 +##晚 +##衣 +##佳 +##脉 +##盛 +##乎 +##拟 +##贸 +##扩 +##船 +##仪 +##谁 +##警 +##停 +##席 +##竞 +##释 +##庆 +##汽 +##仍 +##掌 +##诸 +##仙 +##弟 +##吉 +##洋 +##奥 +##票 +##危 +##架 +##买 +##径 +##塔 +##休 +##付 +##恶 +##雷 +##怀 +##秋 +##借 +##巨 +##透 +##誉 +##厘 +##句 +##跟 +##胞 +##婚 +##幼 +##烈 +##峰 +##寻 +##君 +##汇 +##趣 +##纸 +##假 +##肥 +##患 +##杨 +##雅 +##罪 +##谓 +##亮 +##脱 +##寺 +##烟 +##判 +##绩 +##乱 +##刚 +##摄 +##洞 +##践 +##码 +##启 +##励 +##呈 +##曰 +##呢 +##符 +##哥 +##媒 +##疾 +##坐 +##雪 +##孔 +##倒 +##旧 +##菌 +##岩 +##鼓 +##亡 +##访 +##症 +##暗 +##湾 +##幸 +##池 +##讨 +##努 +##露 +##吗 +##繁 +##途 +##殖 +##败 +##蛋 +##握 +##刺 +##耕 +##洗 +##沉 +##概 +##哈 +##泛 +##凡 +##残 +##隐 +##虫 +##朋 +##虚 +##餐 +##殊 +##慢 +##询 +##蒙 +##孙 +##谈 +##鲁 +##裂 +##贴 +##污 +##漫 +##谷 +##违 +##泉 +##拿 +##森 +##横 +##扬 +##键 +##膜 +##迁 +##尤 +##涉 +##净 +##诚 +##折 +##冰 +##械 +##拍 +##梁 +##沿 +##避 +##吴 +##惊 +##犯 +##灭 +##湿 +##迷 +##姓 +##阅 +##灯 +##妇 +##触 +##冠 +##答 +##俗 +##档 +##尊 +##谢 +##措 +##筹 +##竟 +##韩 +##签 +##剑 +##鉴 +##灾 +##贯 +##迹 +##洛 +##沟 +##束 +##翻 +##巧 +##坏 +##弱 +##零 +##壁 +##枝 +##映 +##恩 +##抓 +##屋 +##呼 +##脚 +##绘 +##淡 +##辖 +##伊 +##粒 +##欲 +##震 +##伯 +##私 +##蓝 +##甘 +##储 +##胡 +##卖 +##梅 +##耳 +##疑 +##润 +##伴 +##泽 +##牧 +##烧 +##尾 +##累 +##糖 +##怪 +##唯 +##莫 +##粮 +##柱 +##竹 +##灰 +##岸 +##缩 +##井 +##伦 +##柔 +##盟 +##珠 +##丹 +##皆 +##哪 +##迎 +##颜 +##衡 +##啊 +##塑 +##寒 +##紫 +##镜 +##氧 +##误 +##伍 +##彻 +##刀 +##览 +##炎 +##津 +##耐 +##秦 +##尖 +##潮 +##描 +##浓 +##召 +##禁 +##阻 +##胶 +##译 +##腹 +##泰 +##乃 +##盐 +##潜 +##鸡 +##诺 +##遍 +##纹 +##冬 +##牙 +##麻 +##辅 +##猪 +##弃 +##楚 +##羊 +##晋 +##鸟 +##赵 +##洁 +##谋 +##隆 +##滑 +##籍 +##臣 +##朱 +##泥 +##墨 +##辆 +##墙 +##浪 +##姐 +##赏 +##纵 +##拔 +##倍 +##纷 +##摩 +##壮 +##苗 +##偏 +##塞 +##贡 +##仁 +##宇 +##卵 +##瓦 +##枪 +##覆 +##殿 +##刑 +##贫 +##妈 +##幅 +##幕 +##忆 +##丁 +##估 +##废 +##萨 +##舍 +##详 +##旗 +##岗 +##洪 +##贝 +##迅 +##凭 +##勇 +##雕 +##奏 +##旋 +##杰 +##煤 +##阵 +##乘 +##溪 +##奉 +##畜 +##挑 +##昌 +##硕 +##庙 +##惠 +##薄 +##逃 +##爆 +##哲 +##浙 +##珍 +##炼 +##栏 +##暴 +##币 +##隔 +##吨 +##倾 +##嘉 +##址 +##陶 +##绕 +##诊 +##遭 +##桃 +##魂 +##兽 +##豆 +##闲 +##箱 +##拓 +##燃 +##裁 +##晶 +##掉 +##脂 +##溶 +##顿 +##肤 +##虑 +##鬼 +##灌 +##徐 +##龄 +##陵 +##恋 +##侵 +##坡 +##寿 +##勤 +##磨 +##妹 +##瑞 +##缓 +##轴 +##麦 +##羽 +##咨 +##凝 +##默 +##驻 +##敢 +##债 +##浮 +##幻 +##株 +##浅 +##敬 +##敏 +##陷 +##凤 +##坛 +##虎 +##乌 +##铜 +##御 +##乳 +##讯 +##循 +##圈 +##肌 +##妙 +##奋 +##忘 +##闭 +##墓 +##汤 +##忠 +##跨 +##怕 +##振 +##宾 +##跑 +##屏 +##坦 +##粗 +##租 +##悲 +##伟 +##拜 +##妻 +##赞 +##兄 +##宿 +##碑 +##貌 +##勒 +##罚 +##夺 +##偶 +##截 +##纤 +##齿 +##郑 +##聘 +##偿 +##扶 +##豪 +##慧 +##跳 +##疏 +##莱 +##腐 +##插 +##恐 +##郎 +##辞 +##挂 +##娘 +##肿 +##徒 +##伏 +##磁 +##杯 +##丛 +##旨 +##琴 +##炮 +##醒 +##砖 +##替 +##辛 +##暖 +##锁 +##杜 +##肠 +##孤 +##饭 +##脸 +##邮 +##贷 +##俄 +##毁 +##荷 +##谐 +##荒 +##肝 +##链 +##尺 +##尘 +##援 +##疫 +##崇 +##恢 +##扎 +##伸 +##幽 +##抵 +##胸 +##谱 +##舒 +##迫 +##畅 +##泡 +##岭 +##喷 +##窗 +##捷 +##宏 +##肯 +##狂 +##铺 +##骑 +##抽 +##券 +##俱 +##徽 +##胆 +##碎 +##邀 +##褐 +##斤 +##涂 +##赋 +##署 +##颗 +##渠 +##仿 +##迪 +##炉 +##辉 +##涵 +##耗 +##返 +##邻 +##斑 +##董 +##魏 +##午 +##娱 +##浴 +##尿 +##曼 +##锅 +##柳 +##舰 +##搭 +##旁 +##宅 +##趋 +##凉 +##赢 +##伙 +##爷 +##廷 +##戴 +##壤 +##奶 +##页 +##玄 +##驾 +##阔 +##轨 +##朗 +##捕 +##肾 +##稿 +##惯 +##侯 +##乙 +##渡 +##稍 +##恨 +##脏 +##姆 +##腔 +##抱 +##杆 +##垂 +##赴 +##赶 +##莲 +##辽 +##荐 +##旦 +##妖 +##稀 +##驱 +##沈 +##役 +##晓 +##亭 +##仲 +##澳 +##炸 +##绪 +##陕 +##恒 +##堡 +##纠 +##仇 +##懂 +##焦 +##搜 +##忍 +##贤 +##添 +##艾 +##赤 +##犹 +##尝 +##锦 +##稻 +##撰 +##填 +##衰 +##栽 +##邪 +##粘 +##跃 +##桌 +##胃 +##悬 +##翼 +##彼 +##睡 +##曹 +##刷 +##摆 +##悉 +##锋 +##摇 +##抢 +##乏 +##廉 +##鼠 +##盾 +##瓷 +##抑 +##埃 +##邦 +##遂 +##寸 +##渔 +##祥 +##胎 +##牵 +##壳 +##甜 +##卓 +##瓜 +##袭 +##遵 +##巡 +##逆 +##玛 +##韵 +##桑 +##酷 +##赖 +##桂 +##郡 +##肃 +##仓 +##寄 +##塘 +##瘤 +##碳 +##搞 +##燕 +##蒸 +##允 +##忽 +##斜 +##穷 +##郁 +##囊 +##奔 +##昆 +##盆 +##愈 +##递 +##黎 +##祭 +##怒 +##辈 +##腺 +##滚 +##暂 +##郭 +##璃 +##踪 +##芳 +##碍 +##肺 +##狱 +##冒 +##阁 +##砂 +##苍 +##揭 +##踏 +##颇 +##柄 +##闪 +##孝 +##葡 +##腾 +##茎 +##鸣 +##撤 +##仰 +##伐 +##丘 +##於 +##泪 +##荡 +##扰 +##纲 +##拼 +##欣 +##纽 +##癌 +##堆 +##菲 +##披 +##挖 +##寓 +##履 +##捐 +##悟 +##乾 +##嘴 +##钻 +##拳 +##吹 +##柏 +##遥 +##抚 +##忧 +##赠 +##霸 +##艰 +##淋 +##猫 +##帅 +##奈 +##寨 +##滴 +##鼻 +##掘 +##狗 +##驶 +##朴 +##拆 +##惜 +##玻 +##扣 +##萄 +##蔬 +##宠 +##缴 +##赫 +##凯 +##滨 +##乔 +##腰 +##葬 +##孟 +##吾 +##枚 +##圳 +##忙 +##扫 +##杭 +##凌 +##梯 +##丈 +##隶 +##剪 +##盗 +##擅 +##疆 +##弯 +##携 +##拒 +##秒 +##颁 +##醇 +##割 +##浆 +##姑 +##爸 +##螺 +##穗 +##缝 +##慈 +##喝 +##瓶 +##漏 +##悠 +##猎 +##番 +##孕 +##伪 +##漂 +##腿 +##吐 +##坝 +##滤 +##函 +##匀 +##偷 +##浩 +##矛 +##僧 +##辨 +##俊 +##棉 +##铸 +##诞 +##丧 +##夹 +##姿 +##睛 +##淮 +##阀 +##姜 +##尸 +##猛 +##芽 +##账 +##旱 +##醉 +##弄 +##坊 +##烤 +##萧 +##矣 +##雾 +##倡 +##榜 +##弗 +##氨 +##朵 +##锡 +##袋 +##拨 +##湘 +##岳 +##烦 +##肩 +##熙 +##炭 +##婆 +##棋 +##禅 +##穴 +##宙 +##汗 +##艳 +##儒 +##叙 +##晨 +##颈 +##峡 +##拖 +##烂 +##茂 +##戒 +##飘 +##氛 +##蒂 +##撞 +##瓣 +##箭 +##叛 +##鞋 +##劲 +##祝 +##娜 +##饲 +##侍 +##诱 +##叹 +##卢 +##弥 +##鼎 +##厦 +##屈 +##慕 +##魅 +##厨 +##嫁 +##绵 +##逼 +##扮 +##叔 +##酶 +##燥 +##狼 +##滋 +##汁 +##辐 +##怨 +##翅 +##佩 +##坑 +##旬 +##沃 +##剩 +##蛇 +##颖 +##篮 +##锐 +##侠 +##匹 +##唤 +##熊 +##漠 +##迟 +##敦 +##雌 +##谨 +##婴 +##浸 +##磷 +##筒 +##滩 +##埋 +##框 +##弘 +##吕 +##碰 +##纺 +##硫 +##堪 +##契 +##蜜 +##蓄 +##阐 +##傲 +##碱 +##晰 +##狭 +##撑 +##叉 +##卧 +##劫 +##闹 +##赐 +##邓 +##奴 +##溉 +##浦 +##蹈 +##辣 +##遣 +##耀 +##耶 +##翠 +##叠 +##迈 +##霍 +##碧 +##恰 +##脊 +##昭 +##摸 +##饱 +##赔 +##泄 +##哭 +##讼 +##逝 +##逻 +##廊 +##擦 +##渗 +##彰 +##卿 +##旺 +##宪 +##顷 +##妆 +##陪 +##葛 +##仔 +##淀 +##翰 +##悦 +##穆 +##煮 +##辩 +##弦 +##串 +##押 +##蚀 +##逢 +##贺 +##焊 +##煌 +##缔 +##惑 +##鹿 +##袁 +##糊 +##逸 +##舟 +##勃 +##侦 +##涯 +##蔡 +##辟 +##涌 +##枯 +##痕 +##疼 +##莉 +##柴 +##眉 +##罢 +##催 +##衔 +##秉 +##妃 +##鸿 +##傅 +##辰 +##聪 +##咸 +##扇 +##盈 +##勘 +##佐 +##泊 +##抛 +##搬 +##牢 +##宴 +##牲 +##贾 +##摘 +##姻 +##慎 +##帕 +##忌 +##卒 +##夕 +##卜 +##惟 +##挺 +##崖 +##炒 +##爵 +##冻 +##椒 +##鳞 +##祸 +##潭 +##腊 +##蒋 +##缠 +##寂 +##眠 +##冯 +##芯 +##槽 +##吊 +##聊 +##梗 +##嫩 +##凶 +##铭 +##爽 +##筋 +##韦 +##脾 +##铝 +##肢 +##栋 +##勾 +##萌 +##渊 +##掩 +##狮 +##撒 +##漆 +##骗 +##禽 +##蕴 +##坪 +##洒 +##冶 +##兹 +##椭 +##喻 +##泵 +##哀 +##翔 +##棒 +##芝 +##扑 +##毅 +##衍 +##惨 +##疯 +##欺 +##贼 +##肖 +##轰 +##巢 +##臂 +##轩 +##扁 +##淘 +##犬 +##宰 +##祠 +##挡 +##厌 +##帐 +##蜂 +##狐 +##垃 +##昂 +##圾 +##秩 +##芬 +##瞬 +##枢 +##舌 +##唇 +##棕 +##霞 +##霜 +##艇 +##侨 +##鹤 +##硅 +##靖 +##哦 +##削 +##泌 +##奠 +##吏 +##夷 +##咖 +##彭 +##窑 +##胁 +##肪 +##贞 +##劝 +##钙 +##柜 +##鸭 +##庞 +##兔 +##荆 +##丙 +##纱 +##戈 +##藤 +##矩 +##泳 +##惧 +##铃 +##渴 +##胀 +##袖 +##丸 +##狠 +##豫 +##茫 +##浇 +##菩 +##氯 +##啡 +##葱 +##梨 +##霉 +##脆 +##氢 +##巷 +##丑 +##娃 +##锻 +##愤 +##贪 +##蝶 +##厉 +##闽 +##浑 +##斩 +##栖 +##茅 +##昏 +##龟 +##碗 +##棚 +##滞 +##慰 +##斋 +##虹 +##屯 +##萝 +##饼 +##窄 +##潘 +##绣 +##丢 +##芦 +##鳍 +##裕 +##誓 +##腻 +##锈 +##吞 +##蜀 +##啦 +##扭 +##巩 +##髓 +##劣 +##拌 +##谊 +##涛 +##勋 +##郊 +##莎 +##痴 +##窝 +##驰 +##跌 +##笼 +##挤 +##溢 +##隙 +##鹰 +##诏 +##帽 +##芒 +##爬 +##凸 +##牺 +##熔 +##吻 +##竭 +##瘦 +##冥 +##搏 +##屡 +##昔 +##萼 +##愁 +##捉 +##翁 +##怖 +##汪 +##烯 +##疲 +##缸 +##溃 +##泼 +##剖 +##涨 +##橡 +##谜 +##悔 +##嫌 +##盒 +##苯 +##凹 +##绳 +##畏 +##罐 +##虾 +##柯 +##邑 +##馨 +##兆 +##帖 +##陌 +##禄 +##垫 +##壶 +##逊 +##骤 +##祀 +##晴 +##蓬 +##苞 +##煎 +##菊 +##堤 +##甫 +##拱 +##氮 +##罕 +##舶 +##伞 +##姚 +##弓 +##嵌 +##馈 +##琼 +##噪 +##雀 +##呵 +##汝 +##焉 +##陀 +##胺 +##惩 +##沼 +##枣 +##桐 +##酱 +##遮 +##孢 +##钝 +##呀 +##锥 +##妥 +##酿 +##巫 +##闯 +##沧 +##崩 +##蕊 +##酬 +##匠 +##躲 +##喊 +##琳 +##绎 +##喉 +##凰 +##抬 +##膨 +##盲 +##剥 +##喂 +##庸 +##奸 +##钩 +##冈 +##募 +##苑 +##杏 +##杉 +##辱 +##隋 +##薪 +##绒 +##欠 +##尉 +##攀 +##抹 +##巾 +##渣 +##苹 +##猴 +##悄 +##屠 +##颂 +##湛 +##魄 +##颠 +##呆 +##粤 +##岂 +##娇 +##暑 +##鹅 +##筛 +##膏 +##樱 +##缆 +##襄 +##瑟 +##恭 +##泻 +##匪 +##兮 +##恼 +##吟 +##仕 +##蔽 +##骄 +##蚕 +##斥 +##椅 +##姬 +##谦 +##椎 +##搅 +##卸 +##沫 +##怜 +##坎 +##瑰 +##钦 +##拾 +##厕 +##後 +##逾 +##薯 +##衬 +##钾 +##崔 +##稽 +##蛮 +##殷 +##晒 +##菇 +##臭 +##弧 +##擎 +##粹 +##纬 +##焰 +##玲 +##竣 +##咒 +##歇 +##糕 +##诵 +##茨 +##妮 +##酯 +##麟 +##卑 +##浏 +##咽 +##罩 +##舱 +##酵 +##晕 +##顽 +##赁 +##咬 +##枫 +##冀 +##贮 +##艘 +##亏 +##薛 +##瀑 +##篆 +##膀 +##沸 +##雍 +##咳 +##尹 +##愉 +##烹 +##坠 +##勿 +##钠 +##坤 +##甸 +##墅 +##闸 +##藻 +##韧 +##鄂 +##瑶 +##舆 +##夸 +##蕾 +##栗 +##咏 +##丞 +##抄 +##鹏 +##弊 +##檐 +##骂 +##仆 +##峻 +##爪 +##赚 +##帆 +##娶 +##嘛 +##钓 +##澄 +##猜 +##裔 +##抒 +##铅 +##卉 +##彦 +##删 +##衷 +##禹 +##寡 +##蒲 +##砌 +##棱 +##拘 +##堵 +##雁 +##仄 +##荫 +##祈 +##奢 +##赌 +##寇 +##隧 +##摊 +##雇 +##卦 +##婉 +##敲 +##挣 +##皱 +##虞 +##亨 +##懈 +##挽 +##珊 +##饶 +##滥 +##锯 +##闷 +##酮 +##虐 +##兑 +##僵 +##傻 +##沦 +##巅 +##鞭 +##梳 +##赣 +##锌 +##庐 +##薇 +##庵 +##慨 +##肚 +##妄 +##仗 +##绑 +##枕 +##牡 +##胖 +##沪 +##垒 +##捞 +##捧 +##竖 +##蜡 +##桩 +##厢 +##孵 +##黏 +##拯 +##谭 +##诈 +##灿 +##釉 +##裹 +##钮 +##俩 +##灶 +##彝 +##蟹 +##涩 +##醋 +##匙 +##歧 +##刹 +##玫 +##棘 +##橙 +##凑 +##桶 +##刃 +##伽 +##硝 +##怡 +##籽 +##敞 +##淳 +##矮 +##镶 +##戚 +##幢 +##涡 +##尧 +##膝 +##哉 +##肆 +##畔 +##溯 +##媚 +##烘 +##窃 +##焚 +##澜 +##愚 +##棵 +##乞 +##佑 +##暨 +##敷 +##饥 +##俯 +##蔓 +##暮 +##砍 +##邵 +##仑 +##毗 +##剿 +##馀 +##锤 +##刮 +##梭 +##摧 +##掠 +##躯 +##诡 +##匈 +##侣 +##胚 +##疮 +##裙 +##裸 +##塌 +##吓 +##俘 +##糙 +##藩 +##楷 +##羞 +##鲍 +##帘 +##裤 +##宛 +##憾 +##桓 +##痰 +##寞 +##骚 +##惹 +##笋 +##萃 +##栓 +##挫 +##矢 +##垦 +##垄 +##绸 +##凄 +##镀 +##熏 +##钉 +##粪 +##缅 +##洽 +##鞘 +##蔗 +##迄 +##沐 +##凿 +##勉 +##昨 +##喘 +##爹 +##屑 +##耻 +##沥 +##庶 +##涅 +##腕 +##袍 +##懒 +##阜 +##嗜 +##朔 +##蒜 +##沛 +##坟 +##轿 +##喀 +##笛 +##狄 +##饿 +##蓉 +##泣 +##窟 +##豹 +##屿 +##崛 +##迦 +##诠 +##贬 +##腥 +##钥 +##嗣 +##瑜 +##倦 +##萎 +##拦 +##冤 +##讽 +##潇 +##谣 +##趁 +##妨 +##贩 +##萍 +##窦 +##纂 +##缀 +##矫 +##淑 +##墩 +##梵 +##沾 +##淫 +##乖 +##汰 +##莞 +##旷 +##浊 +##挚 +##撼 +##氟 +##焕 +##庚 +##掀 +##诀 +##盼 +##疹 +##窖 +##匆 +##厥 +##轧 +##淹 +##亥 +##鸦 +##棍 +##谅 +##歼 +##汕 +##挪 +##蚁 +##敛 +##魁 +##畴 +##炫 +##丫 +##奎 +##菱 +##沂 +##撕 +##阎 +##詹 +##蛛 +##靡 +##瞻 +##咱 +##愧 +##烷 +##畸 +##灸 +##眸 +##觅 +##芜 +##廓 +##斌 +##躁 +##麓 +##摔 +##烛 +##睹 +##孜 +##缚 +##堕 +##昼 +##睿 +##琪 +##琉 +##贱 +##渝 +##跋 +##茄 +##舜 +##诛 +##捣 +##芙 +##倚 +##酰 +##澈 +##慌 +##帜 +##颤 +##陇 +##颌 +##昧 +##佣 +##眷 +##徙 +##禾 +##逮 +##莹 +##碟 +##梢 +##朽 +##粥 +##喇 +##榆 +##驳 +##楔 +##啸 +##肋 +##踢 +##傍 +##桔 +##肴 +##呕 +##旭 +##埠 +##贿 +##曝 +##杖 +##俭 +##栩 +##斧 +##镁 +##匾 +##踩 +##橘 +##颅 +##囚 +##蛙 +##膳 +##坞 +##琐 +##荧 +##瘟 +##涤 +##胰 +##衫 +##噬 +##皖 +##邱 +##埔 +##汀 +##羡 +##睐 +##葵 +##耿 +##糟 +##厄 +##秧 +##黔 +##蹄 +##漳 +##鞍 +##谏 +##腋 +##簇 +##梧 +##戎 +##榴 +##诣 +##宦 +##苔 +##揽 +##簧 +##狸 +##阙 +##扯 +##耍 +##棠 +##脓 +##烫 +##翘 +##芭 +##躺 +##羁 +##藉 +##拐 +##陡 +##漓 +##棺 +##钧 +##琅 +##扔 +##寝 +##绚 +##熬 +##驿 +##邹 +##杠 +##绥 +##窥 +##晃 +##渭 +##樊 +##鑫 +##祁 +##陋 +##哺 +##堰 +##祛 +##梓 +##崎 +##孽 +##蝴 +##蔚 +##抖 +##苟 +##肇 +##溜 +##绅 +##妾 +##跪 +##沁 +##莽 +##虏 +##瞄 +##砸 +##稚 +##僚 +##崭 +##迭 +##皂 +##彬 +##雏 +##羲 +##缕 +##绞 +##俞 +##簿 +##耸 +##廖 +##嘲 +##翌 +##榄 +##裴 +##槐 +##洼 +##睁 +##灼 +##啤 +##臀 +##啥 +##濒 +##醛 +##峨 +##葫 +##悍 +##笨 +##嘱 +##稠 +##韶 +##陛 +##峭 +##酚 +##翩 +##舅 +##寅 +##蕉 +##阮 +##垣 +##戮 +##趾 +##犀 +##巍 +##霄 +##饪 +##秆 +##朕 +##驼 +##肛 +##揉 +##楠 +##岚 +##疡 +##帧 +##柑 +##赎 +##逍 +##滇 +##璋 +##礁 +##黛 +##钞 +##邢 +##涧 +##劈 +##瞳 +##砚 +##驴 +##锣 +##恳 +##栅 +##吵 +##牟 +##沌 +##瞩 +##咪 +##毯 +##炳 +##淤 +##盯 +##芋 +##粟 +##栈 +##戊 +##盏 +##峪 +##拂 +##暇 +##酥 +##汛 +##嚣 +##轼 +##妒 +##匿 +##鸽 +##蝉 +##痒 +##宵 +##瘫 +##璧 +##汲 +##冢 +##碌 +##琢 +##磅 +##卤 +##剔 +##谎 +##圩 +##酌 +##捏 +##渺 +##媳 +##穹 +##谥 +##骏 +##哨 +##骆 +##乒 +##摹 +##兜 +##柿 +##喧 +##呜 +##捡 +##橄 +##逗 +##瑚 +##呐 +##檀 +##辜 +##妊 +##祯 +##苷 +##衙 +##笃 +##芸 +##霖 +##荔 +##闺 +##羌 +##芹 +##哼 +##糯 +##吼 +##蕃 +##嵩 +##矶 +##绽 +##坯 +##娠 +##祷 +##锰 +##瘀 +##岐 +##茵 +##筝 +##斐 +##肽 +##歉 +##嗽 +##恤 +##汶 +##聂 +##樟 +##擒 +##鹃 +##拙 +##鲤 +##絮 +##鄙 +##彪 +##嗓 +##墟 +##骼 +##渤 +##僻 +##豁 +##谕 +##荟 +##姨 +##婷 +##挠 +##哇 +##炙 +##诅 +##娥 +##哑 +##阱 +##嫉 +##圭 +##乓 +##橱 +##歪 +##禧 +##甩 +##坷 +##晏 +##驯 +##讳 +##泗 +##煞 +##淄 +##倪 +##妓 +##窍 +##竿 +##襟 +##匡 +##钛 +##侈 +##侄 +##铲 +##哮 +##厩 +##亢 +##辕 +##瘾 +##辊 +##狩 +##掷 +##潍 +##伺 +##嘿 +##弈 +##嘎 +##陨 +##娅 +##昊 +##犁 +##屁 +##蜘 +##寥 +##滕 +##毙 +##涝 +##谛 +##郝 +##痹 +##溺 +##汾 +##脐 +##馅 +##蠢 +##珀 +##腌 +##扼 +##敕 +##莓 +##峦 +##铬 +##谍 +##炬 +##龚 +##麒 +##睦 +##磺 +##吁 +##掺 +##烁 +##靶 +##圃 +##饵 +##褶 +##娟 +##滔 +##挨 +##褒 +##胱 +##晖 +##脖 +##垢 +##抉 +##冉 +##茧 +##渲 +##癫 +##悼 +##嫂 +##瞒 +##纶 +##肘 +##炖 +##瀚 +##皋 +##姊 +##颐 +##俏 +##颊 +##讶 +##札 +##奕 +##磊 +##镖 +##遐 +##眺 +##腑 +##琦 +##蚊 +##窜 +##渍 +##嗯 +##夯 +##笙 +##蘑 +##翡 +##碘 +##卯 +##啼 +##靓 +##辍 +##莺 +##躬 +##猿 +##杞 +##眩 +##虔 +##凋 +##遁 +##泾 +##岔 +##羟 +##弛 +##娄 +##茸 +##皓 +##峙 +##逅 +##邂 +##苇 +##楹 +##蹲 +##拢 +##甄 +##鳃 +##邯 +##捆 +##勺 +##酉 +##荚 +##唑 +##臻 +##辗 +##绰 +##徊 +##榨 +##苛 +##赦 +##盔 +##壬 +##恍 +##缉 +##熨 +##澡 +##桨 +##匣 +##兢 +##驭 +##镍 +##孰 +##绮 +##馏 +##蝇 +##佼 +##鲸 +##哎 +##裳 +##蜕 +##嚼 +##嘻 +##庇 +##绢 +##倩 +##钵 +##恪 +##帷 +##莆 +##柠 +##藕 +##砾 +##绊 +##喙 +##坂 +##徘 +##荀 +##瞧 +##蛾 +##晦 +##铎 +##紊 +##锚 +##酪 +##稷 +##聋 +##闵 +##熹 +##冕 +##诫 +##珑 +##曦 +##篷 +##迥 +##蘖 +##胤 +##檬 +##瑾 +##钳 +##遏 +##辄 +##嬉 +##隅 +##秃 +##帛 +##聆 +##芥 +##诬 +##挟 +##宕 +##鹊 +##琶 +##膛 +##兀 +##懿 +##碾 +##叮 +##蠕 +##譬 +##缮 +##烽 +##妍 +##榕 +##邃 +##焙 +##倘 +##戌 +##茹 +##豚 +##晾 +##浒 +##玺 +##醚 +##祐 +##炽 +##缪 +##凛 +##噩 +##溅 +##毋 +##槛 +##嫡 +##蝠 +##娴 +##稣 +##禀 +##壑 +##殆 +##敖 +##倭 +##挛 +##侃 +##蚌 +##咀 +##盎 +##殉 +##岑 +##浚 +##谬 +##狡 +##癸 +##逛 +##耽 +##俺 +##璨 +##巳 +##茜 +##郸 +##蒴 +##琵 +##叩 +##泸 +##塾 +##稼 +##侮 +##锂 +##曙 +##薰 +##婿 +##惶 +##拭 +##篱 +##恬 +##淌 +##烙 +##袜 +##徵 +##慷 +##夭 +##噶 +##莘 +##鸳 +##殡 +##蚂 +##憎 +##喃 +##佚 +##龛 +##潢 +##烃 +##岱 +##潺 +##衢 +##璀 +##鹭 +##揣 +##痢 +##厮 +##氓 +##怠 +##痘 +##硒 +##镌 +##乍 +##咯 +##惬 +##桦 +##骇 +##枉 +##蜗 +##睾 +##淇 +##耘 +##娓 +##弼 +##鳌 +##嗅 +##狙 +##箫 +##朦 +##椰 +##胥 +##丐 +##陂 +##唾 +##鳄 +##柚 +##谒 +##戍 +##刁 +##鸾 +##缭 +##骸 +##铣 +##酋 +##蝎 +##掏 +##耦 +##怯 +##娲 +##拇 +##汹 +##胧 +##疤 +##硼 +##恕 +##哗 +##眶 +##痫 +##凳 +##鲨 +##擢 +##歹 +##樵 +##瘠 +##茗 +##翟 +##黯 +##蜒 +##壹 +##殇 +##伶 +##辙 +##瑕 +##町 +##孚 +##痉 +##铵 +##搁 +##漾 +##戟 +##镰 +##鸯 +##猩 +##蔷 +##缤 +##叭 +##垩 +##曳 +##奚 +##毓 +##颓 +##汐 +##靴 +##傣 +##尬 +##濮 +##赂 +##媛 +##懦 +##扦 +##韬 +##戳 +##雯 +##蜿 +##笺 +##裘 +##尴 +##侗 +##钨 +##苓 +##寰 +##蛊 +##扳 +##搓 +##涟 +##睫 +##淬 +##赈 +##恺 +##瞎 +##蝙 +##枸 +##萱 +##颚 +##憩 +##秽 +##秸 +##拷 +##阑 +##貂 +##粱 +##煲 +##隘 +##暧 +##惕 +##沽 +##菠 +##趟 +##磋 +##偕 +##涕 +##邸 +##踞 +##惫 +##阪 +##鞠 +##饺 +##汞 +##颍 +##氰 +##屹 +##蛟 +##跻 +##哟 +##臼 +##熄 +##绛 +##弩 +##褪 +##渎 +##亟 +##匮 +##撇 +##霆 +##攒 +##舵 +##扛 +##彤 +##蛤 +##婢 +##偃 +##胫 +##姥 +##睑 +##诙 +##诲 +##锭 +##悚 +##扒 +##洱 +##劾 +##惰 +##篡 +##瓯 +##徇 +##铀 +##骋 +##筷 +##渚 +##踵 +##俨 +##榻 +##糜 +##捻 +##釜 +##哩 +##萤 +##蛹 +##隽 +##垮 +##鸠 +##鸥 +##漕 +##瑙 +##礴 +##憧 +##殴 +##潼 +##悯 +##砺 +##拽 +##钗 +##酣 +##镂 +##膺 +##楞 +##竺 +##迂 +##嫣 +##忱 +##哄 +##疣 +##鹦 +##枭 +##憬 +##疱 +##婪 +##沮 +##怅 +##筱 +##扉 +##瞰 +##旌 +##蔑 +##铠 +##瀛 +##琥 +##懵 +##谴 +##捍 +##蟾 +##漩 +##拣 +##汴 +##刨 +##叱 +##曜 +##妞 +##澎 +##镑 +##翎 +##瞪 +##倔 +##芍 +##璞 +##瓮 +##驹 +##芷 +##寐 +##擂 +##丕 +##蟠 +##诃 +##悸 +##亘 +##溴 +##宸 +##廿 +##恃 +##棣 +##荼 +##筠 +##羚 +##慑 +##唉 +##纣 +##麼 +##蹦 +##锄 +##淆 +##甙 +##蚜 +##椿 +##禺 +##绯 +##冗 +##葩 +##厝 +##媲 +##蒿 +##痪 +##菁 +##炊 +##俑 +##讥 +##桀 +##祺 +##吡 +##迩 +##箔 +##皿 +##缎 +##萦 +##剃 +##霓 +##酝 +##诰 +##茉 +##飙 +##湍 +##蜥 +##箕 +##蘸 +##柬 +##韭 +##溥 +##熠 +##鹉 +##咐 +##剌 +##悖 +##瞿 +##槟 +##娩 +##闾 +##遴 +##咫 +##孺 +##彷 +##茬 +##蓟 +##憨 +##袅 +##佬 +##炯 +##啶 +##昙 +##蚩 +##痔 +##蕨 +##瓢 +##夔 +##毡 +##赃 +##鳖 +##沅 +##饷 +##臧 +##掖 +##褚 +##羹 +##勐 +##谚 +##畦 +##眨 +##贻 +##攸 +##涎 +##弑 +##咎 +##铂 +##瑛 +##矗 +##虱 +##秤 +##谟 +##漱 +##俸 +##夙 +##雉 +##螨 +##恣 +##斛 +##谙 +##隍 +##奄 +##壕 +##髻 +##鄱 +##嘶 +##磕 +##濡 +##赘 +##荞 +##讹 +##猕 +##痞 +##鬓 +##铮 +##腱 +##幡 +##榭 +##爻 +##涓 +##晤 +##咕 +##惭 +##钼 +##匕 +##撮 +##庾 +##笠 +##窘 +##癖 +##垛 +##窒 +##畲 +##甬 +##彗 +##缨 +##湮 +##寮 +##衅 +##谪 +##绫 +##兖 +##疽 +##磐 +##菏 +##沱 +##骁 +##嫔 +##盂 +##娆 +##钊 +##蟒 +##忏 +##谤 +##晟 +##痈 +##耆 +##谧 +##簪 +##疟 +##扈 +##脍 +##琛 +##咋 +##胄 +##葆 +##轶 +##桢 +##攘 +##邕 +##拧 +##茯 +##摒 +##傀 +##祚 +##嘟 +##帼 +##筵 +##馒 +##疚 +##璇 +##砧 +##槃 +##犷 +##腓 +##煜 +##弋 +##疸 +##濑 +##麝 +##嗟 +##忻 +##愣 +##斓 +##吝 +##咧 +##矾 +##愫 +##漪 +##珂 +##逞 +##糠 +##璐 +##藓 +##昕 +##妩 +##屌 +##疵 +##嘘 +##袂 +##稃 +##剁 +##侏 +##掐 +##猾 +##匍 +##坳 +##黜 +##邺 +##闫 +##猥 +##湃 +##斟 +##癣 +##匐 +##粳 +##叟 +##俾 +##儡 +##莒 +##骥 +##跤 +##耙 +##矜 +##翱 +##赡 +##浣 +##栾 +##拈 +##螟 +##桧 +##坍 +##睢 +##趴 +##伎 +##婺 +##霹 +##痊 +##膊 +##眯 +##豌 +##驮 +##骈 +##嶂 +##淞 +##腮 +##髅 +##炀 +##啄 +##亳 +##麾 +##筐 +##叨 +##徨 +##跷 +##楂 +##郴 +##绶 +##羔 +##咤 +##靳 +##屎 +##雳 +##瘘 +##蹬 +##惮 +##涪 +##阖 +##煽 +##蹊 +##栉 +##俟 +##涸 +##辫 +##锢 +##佟 +##皎 +##啮 +##钰 +##螂 +##啪 +##绷 +##闰 +##畿 +##覃 +##惘 +##贰 +##碉 +##卞 +##酐 +##枷 +##葺 +##芪 +##蕙 +##咚 +##籁 +##钴 +##冽 +##玮 +##骷 +##啃 +##焖 +##猝 +##榈 +##滁 +##拮 +##跗 +##讷 +##蝗 +##蠡 +##烨 +##脯 +##歙 +##泠 +##刍 +##掳 +##僳 +##螯 +##胳 +##髦 +##粽 +##戾 +##祜 +##岷 +##懋 +##馥 +##昵 +##踊 +##湄 +##郢 +##斡 +##迢 +##嗪 +##裨 +##羧 +##膈 +##翊 +##鲫 +##螃 +##沓 +##疝 +##笈 +##榔 +##诘 +##颉 +##蛀 +##鸢 +##焯 +##囧 +##梆 +##潞 +##戛 +##佗 +##艮 +##霾 +##鬟 +##玖 +##腭 +##喔 +##罔 +##佥 +##粑 +##舷 +##泯 +##泓 +##炜 +##谗 +##烬 +##跆 +##傩 +##飓 +##浔 +##钤 +##惚 +##胭 +##踝 +##镯 +##臆 +##蜚 +##揪 +##觞 +##皈 +##迸 +##匝 +##筏 +##醴 +##黍 +##洮 +##滦 +##侬 +##甾 +##澧 +##阈 +##袱 +##迤 +##衮 +##濂 +##娑 +##砥 +##砷 +##铨 +##缜 +##箴 +##逵 +##猖 +##蛰 +##箍 +##侥 +##搂 +##纨 +##裱 +##枋 +##嫦 +##敝 +##挝 +##贲 +##潦 +##撩 +##惺 +##铰 +##忒 +##咆 +##哆 +##莅 +##炕 +##抨 +##涿 +##龈 +##猷 +##遒 +##缥 +##捂 +##俐 +##瘙 +##搐 +##牍 +##馍 +##痿 +##袤 +##峥 +##栎 +##罹 +##燎 +##喵 +##璜 +##飒 +##蔼 +##珞 +##澹 +##奘 +##岖 +##芡 +##簸 +##杵 +##甥 +##骊 +##悴 +##惆 +##殃 +##呃 +##祗 +##髋 +##幔 +##榛 +##犊 +##霁 +##芮 +##牒 +##佰 +##狈 +##薨 +##吩 +##鳝 +##嵘 +##濠 +##呤 +##纫 +##檄 +##浜 +##缙 +##缢 +##煦 +##蓦 +##揖 +##拴 +##缈 +##褥 +##铿 +##燮 +##锵 +##荥 +##忿 +##僖 +##婶 +##芾 +##镐 +##痣 +##眈 +##祇 +##邈 +##翳 +##碣 +##遨 +##鳗 +##诂 +##岫 +##焘 +##茱 +##洵 +##晁 +##噢 +##偈 +##旖 +##蚓 +##柘 +##珐 +##遽 +##岌 +##桅 +##唔 +##鄞 +##雹 +##驸 +##苻 +##恻 +##鬃 +##玑 +##磬 +##崂 +##祉 +##荤 +##淼 +##肱 +##呗 +##骡 +##囱 +##佞 +##耒 +##伫 +##嚷 +##粼 +##歆 +##佃 +##旎 +##惋 +##殁 +##杳 +##阡 +##畈 +##蔺 +##巽 +##昱 +##啰 +##吠 +##嗔 +##涮 +##奂 +##撷 +##袒 +##爰 +##捶 +##赭 +##蜓 +##姗 +##蔻 +##垠 +##噻 +##峒 +##皙 +##憔 +##帚 +##杷 +##蟆 +##觐 +##钒 +##岙 +##栀 +##幄 +##啧 +##癜 +##擀 +##轲 +##铆 +##讴 +##樽 +##霏 +##肮 +##枳 +##骞 +##诧 +##瘢 +##虬 +##拗 +##蕲 +##茁 +##唆 +##沭 +##毂 +##蛎 +##芊 +##銮 +##瞥 +##呱 +##羿 +##吒 +##傥 +##髯 +##濯 +##蜻 +##皴 +##邳 +##燧 +##獭 +##垭 +##祟 +##虢 +##枇 +##鹫 +##颞 +##皑 +##脲 +##舔 +##魇 +##霭 +##坨 +##郧 +##椽 +##舫 +##荠 +##琊 +##溟 +##煨 +##谯 +##粲 +##罂 +##屉 +##佯 +##郦 +##亵 +##诽 +##芩 +##嵇 +##蚤 +##哒 +##啬 +##嚎 +##玥 +##隼 +##唢 +##铛 +##壅 +##藜 +##吱 +##楣 +##璟 +##锆 +##憋 +##罡 +##咙 +##腈 +##廪 +##堑 +##诩 +##溧 +##鹑 +##讫 +##哌 +##铢 +##蜴 +##稹 +##噜 +##镉 +##愕 +##桁 +##晔 +##琰 +##陲 +##疙 +##崮 +##颛 +##桡 +##钜 +##谑 +##仞 +##咦 +##珪 +##揍 +##鱿 +##阉 +##瘩 +##槌 +##滓 +##茴 +##泮 +##涣 +##柞 +##渥 +##飨 +##孪 +##沔 +##谲 +##桉 +##慵 +##俚 +##跖 +##纭 +##恙 +##佘 +##荃 +##咄 +##鞅 +##叁 +##恽 +##炔 +##萘 +##钺 +##楫 +##塬 +##钡 +##琮 +##苄 +##骰 +##偎 +##粕 +##咔 +##鹄 +##瓒 +##阆 +##捅 +##嬴 +##箨 +##氦 +##倜 +##觊 +##婕 +##锑 +##撬 +##掰 +##嗷 +##饯 +##蓓 +##鼬 +##佤 +##蚯 +##挞 +##臾 +##嶙 +##幂 +##饬 +##闱 +##煅 +##嘧 +##蹭 +##瞭 +##顼 +##箐 +##徉 +##骜 +##嗨 +##邛 +##庑 +##柩 +##饕 +##俎 +##嘌 +##颏 +##椁 +##崧 +##锉 +##籼 +##狞 +##弁 +##羯 +##踹 +##糅 +##砼 +##嫖 +##豉 +##啉 +##榷 +##嘈 +##俪 +##痂 +##儋 +##芎 +##繇 +##蹇 +##诋 +##煸 +##峋 +##淙 +##泱 +##徜 +##汩 +##纥 +##蝼 +##囿 +##暹 +##谆 +##蹂 +##鞣 +##螳 +##馗 +##幺 +##鞑 +##贽 +##漯 +##牦 +##淖 +##囤 +##晗 +##忡 +##懊 +##呋 +##埂 +##鲈 +##阕 +##幌 +##鳅 +##勰 +##萸 +##剽 +##蚝 +##绔 +##辇 +##麋 +##陟 +##宥 +##锺 +##喽 +##淅 +##熵 +##荨 +##忤 +##轭 +##嗦 +##荪 +##骠 +##鹘 +##聿 +##绾 +##诶 +##怆 +##喋 +##恸 +##湟 +##睨 +##翦 +##蜈 +##褂 +##娼 +##羸 +##觎 +##瘁 +##蚣 +##呻 +##昶 +##谶 +##猬 +##荻 +##酗 +##肄 +##躏 +##膑 +##嗡 +##庠 +##崽 +##搪 +##胯 +##铉 +##峤 +##郯 +##藐 +##舂 +##蓼 +##薏 +##窿 +##羣 +##氽 +##徕 +##冼 +##阂 +##欤 +##殒 +##窈 +##脘 +##篝 +##麸 +##砭 +##砰 +##骶 +##豺 +##窠 +##獒 +##腴 +##苕 +##缇 +##骅 +##劭 +##卅 +##揆 +##垅 +##琏 +##镗 +##苜 +##胛 +##珏 +##吮 +##抠 +##搔 +##槎 +##掣 +##琨 +##餮 +##舛 +##痤 +##埭 +##胪 +##喹 +##妲 +##婀 +##帙 +##箩 +##灏 +##霎 +##袄 +##镭 +##蓿 +##墉 +##嵊 +##堇 +##蟋 +##叽 +##钎 +##録 +##郓 +##瘴 +##丶 +##呦 +##邬 +##頫 +##馁 +##鄢 +##蛭 +##愍 +##锲 +##槿 +##珈 +##蜃 +##拎 +##鎏 +##裟 +##沏 +##螭 +##觑 +##墒 +##捺 +##轸 +##榫 +##怔 +##昀 +##泷 +##凫 +##唠 +##狰 +##鲛 +##氐 +##呛 +##绀 +##碛 +##茏 +##盅 +##蟀 +##洙 +##訇 +##蠹 +##棂 +##蚴 +##篾 +##靛 +##暄 +##泞 +##洄 +##赓 +##麽 +##篓 +##孑 +##烩 +##颢 +##钣 +##髂 +##蹴 +##筮 +##蝌 +##醮 +##菖 +##獗 +##岘 +##鼐 +##姣 +##蟑 +##袈 +##葶 +##掬 +##躇 +##鹌 +##踌 +##钹 +##蚪 +##颧 +##鹳 +##鲲 +##驷 +##潴 +##焱 +##稔 +##悌 +##唏 +##苒 +##蹙 +##氩 +##宓 +##綦 +##苎 +##疃 +##攫 +##掾 +##徭 +##舀 +##逶 +##嗤 +##蜷 +##茔 +##疳 +##迳 +##罄 +##瓠 +##讪 +##傈 +##杲 +##灞 +##氲 +##鬲 +##獠 +##柒 +##骧 +##搀 +##珩 +##绦 +##嚏 +##镛 +##喱 +##倏 +##馋 +##茭 +##擘 +##斫 +##怂 +##唧 +##犍 +##谩 +##赊 +##鬻 +##禛 +##圻 +##蹶 +##缄 +##瘿 +##黠 +##甑 +##矸 +##嘀 +##蹼 +##叼 +##旻 +##鹜 +##稗 +##雒 +##赉 +##馔 +##颦 +##颔 +##掇 +##赅 +##桎 +##痧 +##谄 +##孛 +##笆 +##鲶 +##铳 +##龋 +##盱 +##笏 +##窕 +##苴 +##萋 +##辘 +##琬 +##梏 +##蚧 +##镳 +##瞅 +##睬 +##偌 +##鲵 +##惦 +##蜍 +##靼 +##阗 +##菟 +##黝 +##挈 +##嵴 +##剡 +##楸 +##氤 +##呎 +##珲 +##馄 +##滂 +##蹉 +##蓑 +##锷 +##啜 +##婵 +##鬣 +##钿 +##晌 +##蛆 +##隗 +##酞 +##枞 +##戬 +##獾 +##镕 +##饨 +##娣 +##缰 +##邾 +##鹗 +##嗒 +##苋 +##薮 +##棹 +##拄 +##埕 +##勖 +##鹞 +##殚 +##鲢 +##啖 +##沣 +##靥 +##葭 +##诿 +##鸪 +##饴 +##疖 +##抟 +##睽 +##稞 +##吋 +##谀 +##澍 +##杈 +##妤 +##峄 +##漉 +##気 +##咲 +##璘 +##萜 +##僭 +##朐 +##圜 +##癞 +##藿 +##珉 +##陉 +##僮 +##膻 +##薹 +##汊 +##锗 +##昉 +##猗 +##锶 +##跛 +##嘹 +##瓤 +##衄 +##豕 +##吆 +##腆 +##喆 +##莴 +##谌 +##珙 +##疥 +##鲑 +##玷 +##蛔 +##砀 +##谔 +##睥 +##蹑 +##诒 +##逋 +##姝 +##刈 +##婧 +##喳 +##镞 +##铌 +##辎 +##鹧 +##檩 +##扪 +##霰 +##裆 +##嬷 +##刎 +##嵋 +##悱 +##嘤 +##篁 +##荸 +##瞑 +##殓 +##搽 +##橇 +##雎 +##弭 +##獐 +##恿 +##眦 +##铐 +##尕 +##捎 +##诟 +##痨 +##笞 +##趺 +##唬 +##苣 +##啾 +##瘪 +##垸 +##橹 +##濛 +##曷 +##樾 +##汨 +##仟 +##姒 +##怦 +##荏 +##诤 +##苡 +##吭 +##崆 +##氡 +##脩 +##胝 +##钏 +##屐 +##忐 +##彧 +##拚 +##鏖 +##孳 +##忑 +##邝 +##穰 +##摈 +##庖 +##鸵 +##矽 +##鲟 +##発 +##菅 +##圪 +##蹋 +##衾 +##簋 +##璎 +##噎 +##嬗 +##肼 +##跎 +##滟 +##戦 +##嵬 +##仝 +##惇 +##纾 +##炁 +##闳 +##骐 +##秣 +##眙 +##谘 +##碓 +##疔 +##恂 +##鳕 +##鸱 +##爨 +##镊 +##钯 +##圮 +##楽 +##堀 +##膘 +##噗 +##锹 +##杼 +##酊 +##挎 +##箸 +##郗 +##垌 +##溏 +##蔫 +##偻 +##妫 +##飚 +##辔 +##濬 +##瑄 +##觚 +##铍 +##跚 +##翕 +##煊 +##耄 +##铋 +##篦 +##阇 +##骛 +##莪 +##吲 +##唁 +##箧 +##珅 +##潋 +##迨 +##哽 +##砦 +##缗 +##謇 +##呸 +##垓 +##糍 +##璠 +##妣 +##狎 +##攥 +##闇 +##蛉 +##瑁 +##腼 +##蹒 +##嶷 +##莠 +##沤 +##哚 +##遑 +##跺 +##膦 +##蹿 +##郫 +##玳 +##埚 +##衿 +##醪 +##挹 +##绡 +##汜 +##坩 +##旃 +##鸨 +##翈 +##抡 +##晞 +##盥 +##藁 +##蓖 +##臊 +##羰 +##楝 +##噱 +##饽 +##苌 +##褓 +##佶 +##稜 +##瞠 +##仡 +##伉 +##襁 +##涞 +##蜇 +##抿 +##瑗 +##孱 +##懑 +##淦 +##赝 +##醌 +##缫 +##蠲 +##嚓 +##鲷 +##湫 +##捋 +##咩 +##裏 +##犒 +##墀 +##硐 +##蔸 +##钽 +##麂 +##蒡 +##鼹 +##绻 +##錾 +##仃 +##篙 +##蕤 +##铤 +##槁 +##牖 +##螈 +##俦 +##笄 +##啻 +##対 +##郤 +##闼 +##醺 +##赍 +##檗 +##裾 +##噫 +##掸 +##箓 +##妪 +##乂 +##蝈 +##砻 +##胍 +##蜱 +##聃 +##雠 +##碚 +##椤 +##缯 +##昴 +##缱 +##祎 +##缬 +##铙 +##孀 +##笳 +##蘇 +##愆 +##榉 +##氙 +##燹 +##撂 +##菽 +##箬 +##蛄 +##瘸 +##嬛 +##橐 +##纡 +##刽 +##辂 +##蒯 +##邨 +##赀 +##跸 +##邙 +##黟 +##磴 +##闿 +##垟 +##嵯 +##钚 +##跄 +##潸 +##崴 +##恁 +##楮 +##腧 +##胨 +##芫 +##碴 +##隰 +##杓 +##貉 +##欹 +##侑 +##鳜 +##铄 +##椴 +##昇 +##醍 +##肓 +##缂 +##铡 +##蹠 +##徂 +##豢 +##蒽 +##菡 +##衲 +##阚 +##芗 +##痍 +##玠 +##晷 +##淝 +##鄯 +##糗 +##耨 +##榧 +##胴 +##蕈 +##镬 +##鼾 +##摭 +##鸮 +##恚 +##実 +##砝 +##珣 +##寤 +##埙 +##锏 +##喟 +##蘅 +##骺 +##捭 +##莜 +##缶 +##锟 +##叵 +##炷 +##鲧 +##胼 +##査 +##岬 +##鹂 +##牯 +##珥 +##莼 +##邠 +##眇 +##卟 +##変 +##惴 +##渑 +##蚱 +##瞌 +##瘰 +##佝 +##旸 +##衽 +##郅 +##奁 +##魑 +##缛 +##颙 +##镫 +##簌 +##豇 +##姹 +##邋 +##暝 +##釐 +##洹 +##咿 +##俳 +##蜊 +##醐 +##聩 +##坻 +##毽 +##喾 +##辋 +##倌 +##媪 +##蛳 +##滹 +##哙 +##阊 +##趸 +##祢 +##籀 +##徼 +##訾 +##髁 +##砜 +##撸 +##瓘 +##缁 +##镓 +##縻 +##菀 +##酢 +##桠 +##撵 +##怏 +##渌 +##摞 +##槲 +##浠 +##诜 +##魉 +##韫 +##亓 +##盤 +##瑭 +##魍 +##襞 +##爿 +##浃 +##樯 +##讵 +##揩 +##耋 +##帏 +##崃 +##鸩 +##遢 +##臃 +##粿 +##禳 +##桫 +##髹 +##诳 +##踉 +##郃 +##嗖 +##讧 +##碁 +##湎 +##阏 +##媾 +##様 +##哔 +##舸 +##曩 +##忝 +##峁 +##掂 +##葳 +##鄄 +##谵 +##彊 +##锴 +##郜 +##葖 +##蓇 +##瓴 +##鳟 +##橼 +##鲇 +##邗 +##犄 +##秭 +##槭 +##缵 +##巯 +##龊 +##狍 +##擞 +##瞽 +##栲 +##撅 +##瑀 +##戢 +##朓 +##逖 +##椹 +##洺 +##艏 +##苁 +##滘 +##铧 +##侪 +##豳 +##竦 +##貔 +##圄 +##呷 +##旄 +##遛 +##芈 +##砣 +##桷 +##龌 +##疬 +##缟 +##洌 +##跏 +##蝮 +##菰 +##帑 +##怙 +##豸 +##雩 +##誊 +##臬 +##镣 +##箇 +##踱 +##钍 +##苫 +##蝽 +##浯 +##単 +##亶 +##囹 +##穑 +##佻 +##绌 +##诔 +##鹬 +##髌 +##蒌 +##鳏 +##殄 +##怛 +##筌 +##刳 +##翮 +##卍 +##畹 +##箜 +##燔 +##赳 +##篌 +##窨 +##翥 +##炅 +##钕 +##莳 +##忖 +##戡 +##沢 +##狒 +##圉 +##琯 +##邰 +##苾 +##犸 +##邡 +##郏 +##襦 +##沆 +##玟 +##濉 +##洎 +##莨 +##氘 +##咛 +##佺 +##腩 +##鳔 +##剜 +##秕 +##牝 +##芨 +##関 +##拊 +##竑 +##圹 +##颡 +##摺 +##沩 +##蜉 +##筚 +##愔 +##肟 +##俶 +##堃 +##绉 +##奭 +##罅 +##嗳 +##蜢 +##疠 +##帔 +##髡 +##黥 +##褛 +##柰 +##鏊 +##痼 +##堞 +##嗝 +##娉 +##戕 +##铱 +##耜 +##觥 +##镒 +##呓 +##蒹 +##栱 +##卮 +##琚 +##逦 +##酩 +##蓍 +##虺 +##谠 +##鼋 +##焗 +##褴 +##砒 +##赧 +##蛏 +##蚬 +##瘕 +##顗 +##愠 +##勣 +##飕 +##徳 +##滢 +##琇 +##鳙 +##瞟 +##尻 +##澶 +##荽 +##舐 +##侂 +##黼 +##潟 +##绂 +##瘗 +##蓥 +##竽 +##濞 +##骖 +##偁 +##応 +##锜 +##匏 +##赑 +##讦 +##诨 +##罘 +##巖 +##嫘 +##颀 +##岿 +##虻 +##罴 +##囗 +##溆 +##噤 +##骝 +##咂 +##锛 +##槊 +##啕 +##驽 +##凇 +##籴 +##硖 +##铯 +##怿 +##笥 +##噙 +##倨 +##坭 +##醅 +##滏 +##悻 +##聒 +##枥 +##昺 +##酆 +##簟 +##睇 +##轫 +##溱 +##骢 +##榘 +##珺 +##跹 +##蚶 +##驺 +##饧 +##噼 +##儆 +##氚 +##哧 +##旒 +##鸬 +##夥 +##玦 +##貅 +##揄 +##戗 +##璩 +##剐 +##垴 +##蘼 +##裒 +##躅 +##唳 +##嗑 +##荦 +##霈 +##缦 +##啭 +##隈 +##悫 +##彀 +##悭 +##焓 +##磔 +##蓊 +##郾 +##枧 +##鹚 +##検 +##屃 +##馑 +##嗲 +##铟 +##薤 +##涔 +##樗 +##忾 +##収 +##绺 +##烊 +##螫 +##黩 +##鞫 +##鲠 +##嘭 +##缣 +##蒺 +##黒 +##骘 +##氖 +##镝 +##俅 +##谮 +##屦 +##摁 +##氪 +##蘧 +##伝 +##腠 +##叡 +##鲂 +##続 +##讣 +##耷 +##燊 +##鸷 +##猊 +##囡 +##崤 +##砬 +##湜 +##翚 +##峯 +##鲎 +##蕖 +##鹈 +##凼 +##泫 +##荑 +##黻 +##牂 +##鄣 +##篑 +##髭 +##陬 +##寔 +##疴 +##邽 +##喏 +##彖 +##彘 +##赟 +##盹 +##诮 +##鸫 +##茕 +##铖 +##闩 +##読 +##鄜 +##漈 +##盍 +##甭 +##愎 +##魃 +##炆 +##鍊 +##蛐 +##薜 +##楯 +##鲀 +##逡 +##嘞 +##侔 +##觇 +##糸 +##踮 +##狷 +##菘 +##寳 +##扃 +##禊 +##喑 +##塍 +##栝 +##瓿 +##廨 +##貘 +##馕 +##僰 +##哏 +##瑷 +##疎 +##蝣 +##怵 +##阃 +##弢 +##镲 +##螅 +##吖 +##碲 +##夼 +##茌 +##嗬 +##靺 +##髀 +##铊 +##谡 +##癔 +##镠 +##巻 +##秾 +##菪 +##赜 +##铈 +##髙 +##鲳 +##珰 +##畋 +##泅 +##鲅 +##泚 +##飏 +##屍 +##仨 +##葚 +##叻 +##咻 +##衩 +##郄 +##蹩 +##嬖 +##踽 +##柽 +##鞨 +##麴 +##薙 +##钇 +##氵 +##垆 +##犟 +##罍 +##経 +##粜 +##焜 +##牀 +##埝 +##洧 +##覧 +##蓣 +##甯 +##蒐 +##馐 +##畑 +##缑 +##礽 +##瞋 +##浍 +##袢 +##桕 +##侩 +##詈 +##戸 +##烝 +##堌 +##伋 +##倬 +##圯 +##碇 +##纰 +##磾 +##泔 +##纮 +##蓁 +##铗 +##弇 +##挲 +##艉 +##鱬 +##泺 +##橛 +##袴 +##韪 +##籓 +##贶 +##棰 +##趵 +##樨 +##傕 +##玕 +##毎 +##繸 +##劵 +##镧 +##秫 +##邶 +##猞 +##廛 +##栌 +##钲 +##镦 +##嘏 +##蝰 +##镏 +##淠 +##荇 +##逄 +##嘅 +##祕 +##瑠 +##炝 +##杪 +##埴 +##獬 +##柢 +##捱 +##跣 +##涑 +##撃 +##伢 +##堠 +##卽 +##猁 +##厣 +##辏 +##旆 +##茆 +##乜 +##踯 +##。 +##? +##! +##? +##; +[UNK] diff --git a/applications/document_intelligence/doc_vqa/Rerank/run_test.sh b/applications/document_intelligence/doc_vqa/Rerank/run_test.sh new file mode 100644 index 0000000000000000000000000000000000000000..48b03fe4dc5ad2358d97ffca5a02dca4f00cddca --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/run_test.sh @@ -0,0 +1,45 @@ +#!/bin/bash + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +QUESTION=$1 + +if [ ! -d output ]; then + mkdir output +fi +if [ ! -d log ]; then + mkdir log +fi + +python3 change_to_rerank.py ${QUESTION} + +python3 -u ./src/train_ce.py \ + --use_cuda true \ + --verbose true \ + --do_train false \ + --do_val false \ + --do_test true \ + --batch_size 128 \ + --init_checkpoint "./checkpoints/ranker" \ + --test_set "./data/demo.tsv" \ + --test_save "data/demo.score" \ + --max_seq_len 384 \ + --for_cn true \ + --vocab_path "config/ernie_base_1.0_CN/vocab.txt" \ + --ernie_config_path "config/ernie_base_1.0_CN/ernie_config.json" + 1>>log/train.log 2>&1 + diff --git a/applications/document_intelligence/doc_vqa/Rerank/run_train.sh b/applications/document_intelligence/doc_vqa/Rerank/run_train.sh new file mode 100644 index 0000000000000000000000000000000000000000..18cffa7761eeb335f9997df3903b703bca5c4803 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/run_train.sh @@ -0,0 +1,72 @@ +#!/bin/bash + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +if [ $# != 4 ];then + echo "USAGE: sh run_train.sh \$TRAIN_SET \$MODEL_PATH \$epoch \$nodes_count" + exit 1 +fi + +TRAIN_SET=$1 +MODEL_PATH=$2 +epoch=$3 +node=$4 + +CHECKPOINT_PATH=output +if [ ! -d output ]; then + mkdir output +fi +if [ ! -d log ]; then + mkdir log +fi + +lr=1e-5 +batch_size=32 +train_exampls=`cat $TRAIN_SET | wc -l` +save_steps=$[$train_exampls/$batch_size/$node] +data_size=$[$save_steps*$batch_size*$node] +new_save_steps=$[$save_steps*$epoch/2] + +python3 -m paddle.distributed.launch \ + --log_dir log \ + ./src/train_ce.py \ + --use_cuda true \ + --verbose true \ + --do_train true \ + --do_val false \ + --do_test false \ + --use_mix_precision false \ + --train_data_size ${data_size} \ + --batch_size ${batch_size} \ + --init_pretraining_params ${MODEL_PATH} \ + --train_set ${TRAIN_SET} \ + --save_steps ${new_save_steps} \ + --validation_steps ${new_save_steps} \ + --checkpoints ${CHECKPOINT_PATH} \ + --weight_decay 0.01 \ + --warmup_proportion 0.0 \ + --epoch $epoch \ + --max_seq_len 384 \ + --for_cn true \ + --vocab_path config/ernie_base_1.0_CN/vocab.txt \ + --ernie_config_path config/ernie_base_1.0_CN/ernie_config.json \ + --learning_rate ${lr} \ + --skip_steps 10 \ + --num_iteration_per_drop_scope 1 \ + --num_labels 2 \ + --random_seed 1 + diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/batching.py b/applications/document_intelligence/doc_vqa/Rerank/src/batching.py new file mode 100644 index 0000000000000000000000000000000000000000..10901b81f1c36b25b8640b0944e333647828c944 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/src/batching.py @@ -0,0 +1,69 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Mask, padding and batching.""" + +import numpy as np + + +def pad_batch_data( + insts, + pad_idx=0, + return_pos=False, + return_input_mask=False, + return_max_len=False, + return_num_token=False, + return_seq_lens=False, +): + """ + Pad the instances to the max sequence length in batch, and generate the + corresponding position data and attention bias. + """ + return_list = [] + max_len = max(len(inst) for inst in insts) + # Any token included in dict can be used to pad, since the paddings' loss + # will be masked out by weights and make no effect on parameter gradients. + + inst_data = np.array([inst + list([pad_idx] * (max_len - len(inst))) for inst in insts]) + return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])] + + # position data + if return_pos: + inst_pos = np.array([list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) for inst in insts]) + + return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])] + + if return_input_mask: + # This is used to avoid attention on paddings. + input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts]) + input_mask_data = np.expand_dims(input_mask_data, axis=-1) + return_list += [input_mask_data.astype("float32")] + + if return_max_len: + return_list += [max_len] + + if return_num_token: + num_token = 0 + for inst in insts: + num_token += len(inst) + return_list += [num_token] + + if return_seq_lens: + seq_lens = np.array([len(inst) for inst in insts]) + return_list += [seq_lens.astype("int64").reshape([-1, 1])] + + return return_list if len(return_list) > 1 else return_list[0] + + +if __name__ == "__main__": + pass diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/cross_encoder.py b/applications/document_intelligence/doc_vqa/Rerank/src/cross_encoder.py new file mode 100644 index 0000000000000000000000000000000000000000..700b70407ce09832a4b7709d8e83a49cefb525a7 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/src/cross_encoder.py @@ -0,0 +1,333 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Model for classifier.""" + +import logging +import time + +import numpy as np +import paddle.fluid as fluid +from model.ernie import ErnieModel +from scipy.stats import pearsonr, spearmanr + +log = logging.getLogger(__name__) + + +def create_model(args, pyreader_name, ernie_config, is_prediction=False, task_name=""): + pyreader = fluid.layers.py_reader( + capacity=50, + shapes=[ + [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, 1], + [-1, args.max_seq_len, 1], + [-1, 1], + [-1, 1], + ], + dtypes=["int64", "int64", "int64", "int64", "float32", "int64", "int64"], + lod_levels=[0, 0, 0, 0, 0, 0, 0], + name=task_name + "_" + pyreader_name, + use_double_buffer=True, + ) + + (src_ids, sent_ids, pos_ids, task_ids, input_mask, labels, qids) = fluid.layers.read_file(pyreader) + + def _model(is_noise=False): + ernie = ErnieModel( + src_ids=src_ids, + position_ids=pos_ids, + sentence_ids=sent_ids, + task_ids=task_ids, + input_mask=input_mask, + config=ernie_config, + is_noise=is_noise, + ) + + cls_feats = ernie.get_pooled_output() + if not is_noise: + cls_feats = fluid.layers.dropout(x=cls_feats, dropout_prob=0.1, dropout_implementation="upscale_in_train") + logits = fluid.layers.fc( + input=cls_feats, + size=args.num_labels, + param_attr=fluid.ParamAttr( + name=task_name + "_cls_out_w", initializer=fluid.initializer.TruncatedNormal(scale=0.02) + ), + bias_attr=fluid.ParamAttr(name=task_name + "_cls_out_b", initializer=fluid.initializer.Constant(0.0)), + ) + """ + if is_prediction: + probs = fluid.layers.softmax(logits) + feed_targets_name = [ + src_ids.name, sent_ids.name, pos_ids.name, input_mask.name + ] + if ernie_version == "2.0": + feed_targets_name += [task_ids.name] + return pyreader, probs, feed_targets_name + """ + + num_seqs = fluid.layers.create_tensor(dtype="int64") + # add focal loss + ce_loss, probs = fluid.layers.softmax_with_cross_entropy(logits=logits, label=labels, return_softmax=True) + loss = fluid.layers.mean(x=ce_loss) + accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs) + graph_vars = { + "loss": loss, + "probs": probs, + "accuracy": accuracy, + "labels": labels, + "num_seqs": num_seqs, + "qids": qids, + } + return graph_vars + + if not is_prediction: + graph_vars = _model(is_noise=True) + old_loss = graph_vars["loss"] + token_emb = fluid.default_main_program().global_block().var("word_embedding") + token_emb.stop_gradient = False + token_gradient = fluid.gradients(old_loss, token_emb)[0] + token_gradient.stop_gradient = False + epsilon = 1e-8 + norm = fluid.layers.sqrt(fluid.layers.reduce_sum(fluid.layers.square(token_gradient)) + epsilon) + gp = (0.01 * token_gradient) / norm + gp.stop_gradient = True + fluid.layers.assign(token_emb + gp, token_emb) + graph_vars = _model() + fluid.layers.assign(token_emb - gp, token_emb) + else: + graph_vars = _model() + + return pyreader, graph_vars + + +def evaluate_mrr(preds): + last_qid = None + total_mrr = 0.0 + qnum = 0.0 + rank = 0.0 + correct = False + for qid, score, label in preds: + if qid != last_qid: + rank = 0.0 + qnum += 1 + correct = False + last_qid = qid + + rank += 1 + if not correct and label != 0: + total_mrr += 1.0 / rank + correct = True + + return total_mrr / qnum + + +def evaluate( + exe, test_program, test_pyreader, graph_vars, eval_phase, use_multi_gpu_test=False, metric="simple_accuracy" +): + train_fetch_list = [graph_vars["loss"].name, graph_vars["accuracy"].name, graph_vars["num_seqs"].name] + + if eval_phase == "train": + if "learning_rate" in graph_vars: + train_fetch_list.append(graph_vars["learning_rate"].name) + outputs = exe.run(fetch_list=train_fetch_list, program=test_program) + ret = {"loss": np.mean(outputs[0]), "accuracy": np.mean(outputs[1])} + if "learning_rate" in graph_vars: + ret["learning_rate"] = float(outputs[3][0]) + return ret + + test_pyreader.start() + total_cost = 0.0 + total_acc = 0.0 + total_num_seqs = 0.0 + total_label_pos_num = 0.0 + total_pred_pos_num = 0.0 + total_correct_num = 0.0 + qids, labels, scores, preds = [], [], [], [] + time_begin = time.time() + + fetch_list = [ + graph_vars["loss"].name, + graph_vars["accuracy"].name, + graph_vars["probs"].name, + graph_vars["labels"].name, + graph_vars["num_seqs"].name, + graph_vars["qids"].name, + ] + while True: + try: + if use_multi_gpu_test: + np_loss, np_acc, np_probs, np_labels, np_num_seqs, np_qids = exe.run(fetch_list=fetch_list) + else: + np_loss, np_acc, np_probs, np_labels, np_num_seqs, np_qids = exe.run( + program=test_program, fetch_list=fetch_list + ) + total_cost += np.sum(np_loss * np_num_seqs) + total_acc += np.sum(np_acc * np_num_seqs) + total_num_seqs += np.sum(np_num_seqs) + labels.extend(np_labels.reshape((-1)).tolist()) + if np_qids is None: + np_qids = np.array([]) + qids.extend(np_qids.reshape(-1).tolist()) + scores.extend(np_probs[:, 1].reshape(-1).tolist()) + np_preds = np.argmax(np_probs, axis=1).astype(np.float32) + preds.extend(np_preds) + total_label_pos_num += np.sum(np_labels) + total_pred_pos_num += np.sum(np_preds) + total_correct_num += np.sum(np.dot(np_preds, np_labels)) + except fluid.core.EOFException: + test_pyreader.reset() + break + time_end = time.time() + cost = total_cost / total_num_seqs + elapsed_time = time_end - time_begin + + evaluate_info = "" + if metric == "acc_and_f1": + ret = acc_and_f1(preds, labels) + evaluate_info = "[%s evaluation] ave loss: %f, ave_acc: %f, f1: %f, data_num: %d, elapsed time: %f s" % ( + eval_phase, + cost, + ret["acc"], + ret["f1"], + total_num_seqs, + elapsed_time, + ) + elif metric == "matthews_corrcoef": + ret = matthews_corrcoef(preds, labels) + evaluate_info = "[%s evaluation] ave loss: %f, matthews_corrcoef: %f, data_num: %d, elapsed time: %f s" % ( + eval_phase, + cost, + ret, + total_num_seqs, + elapsed_time, + ) + elif metric == "pearson_and_spearman": + ret = pearson_and_spearman(scores, labels) + evaluate_info = ( + "[%s evaluation] ave loss: %f, pearson:%f, spearman:%f, corr:%f, data_num: %d, elapsed time: %f s" + % (eval_phase, cost, ret["pearson"], ret["spearman"], ret["corr"], total_num_seqs, elapsed_time) + ) + elif metric == "simple_accuracy": + ret = simple_accuracy(preds, labels) + evaluate_info = "[%s evaluation] ave loss: %f, acc:%f, data_num: %d, elapsed time: %f s" % ( + eval_phase, + cost, + ret, + total_num_seqs, + elapsed_time, + ) + elif metric == "acc_and_f1_and_mrr": + ret_a = acc_and_f1(preds, labels) + preds = sorted(zip(qids, scores, labels), key=lambda elem: (elem[0], -elem[1])) + ret_b = evaluate_mrr(preds) + evaluate_info = "[%s evaluation] ave loss: %f, acc: %f, f1: %f, mrr: %f, data_num: %d, elapsed time: %f s" % ( + eval_phase, + cost, + ret_a["acc"], + ret_a["f1"], + ret_b, + total_num_seqs, + elapsed_time, + ) + else: + raise ValueError("unsupported metric {}".format(metric)) + return evaluate_info + + +def matthews_corrcoef(preds, labels): + preds = np.array(preds) + labels = np.array(labels) + tp = np.sum((labels == 1) & (preds == 1)) + tn = np.sum((labels == 0) & (preds == 0)) + fp = np.sum((labels == 0) & (preds == 1)) + fn = np.sum((labels == 1) & (preds == 0)) + + mcc = ((tp * tn) - (fp * fn)) / np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)) + return mcc + + +def f1_score(preds, labels): + preds = np.array(preds) + labels = np.array(labels) + + tp = np.sum((labels == 1) & (preds == 1)) + fp = np.sum((labels == 0) & (preds == 1)) + fn = np.sum((labels == 1) & (preds == 0)) + p = tp / (tp + fp) + r = tp / (tp + fn) + f1 = (2 * p * r) / (p + r + 1e-8) + return f1 + + +def pearson_and_spearman(preds, labels): + preds = np.array(preds) + labels = np.array(labels) + + pearson_corr = pearsonr(preds, labels)[0] + spearman_corr = spearmanr(preds, labels)[0] + return { + "pearson": pearson_corr, + "spearmanr": spearman_corr, + "corr": (pearson_corr + spearman_corr) / 2, + } + + +def acc_and_f1(preds, labels): + preds = np.array(preds) + labels = np.array(labels) + + acc = simple_accuracy(preds, labels) + f1 = f1_score(preds, labels) + return { + "acc": acc, + "f1": f1, + "acc_and_f1": (acc + f1) / 2, + } + + +def simple_accuracy(preds, labels): + preds = np.array(preds) + labels = np.array(labels) + return (preds == labels).mean() + + +def predict(exe, test_program, test_pyreader, graph_vars, dev_count=1): + test_pyreader.start() + qids, probs = [], [] + preds = [] + + fetch_list = [graph_vars["probs"].name, graph_vars["qids"].name] + + while True: + try: + if dev_count == 1: + np_probs, np_qids = exe.run(program=test_program, fetch_list=fetch_list) + else: + np_probs, np_qids = exe.run(fetch_list=fetch_list) + + if np_qids is None: + np_qids = np.array([]) + qids.extend(np_qids.reshape(-1).tolist()) + np_preds = np.argmax(np_probs, axis=1).astype(np.float32) + preds.extend(np_preds) + probs.append(np_probs) + + except fluid.core.EOFException: + test_pyreader.reset() + break + + probs = np.concatenate(probs, axis=0).reshape([len(preds), -1]) + + return qids, preds, probs diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/finetune_args.py b/applications/document_intelligence/doc_vqa/Rerank/src/finetune_args.py new file mode 100644 index 0000000000000000000000000000000000000000..da3193f3167138ff50611d16115fa20d214b2dd3 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/src/finetune_args.py @@ -0,0 +1,95 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +from src.utils.args import ArgumentGroup + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +model_g = ArgumentGroup(parser, "model", "model configuration and paths.") +model_g.add_arg("ernie_config_path", str, None, "Path to the json file for ernie model config.") +model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.") +model_g.add_arg("init_pretraining_params", str, None, "Init pre-training params which preforms fine-tuning from. If the arg 'init_checkpoint' has been set, this argument wouldn't be valid.") +model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.") + +model_g.add_arg("is_classify", bool, True, "is_classify") +model_g.add_arg("is_regression", bool, False, "is_regression") +model_g.add_arg("task_id", int, 0, "task id") + +train_g = ArgumentGroup(parser, "training", "training options.") +train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.") +train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.") +train_g.add_arg("lr_scheduler", str, "linear_warmup_decay", "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay']) +train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.") +train_g.add_arg("warmup_proportion", float, 0.1, "Proportion of training steps to perform linear learning rate warmup for.") +train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.") +train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.") +train_g.add_arg("use_recompute", bool, False, "Whether to use recompute optimizer for training.") +train_g.add_arg("use_mix_precision", bool, False, "Whether to use mix-precision optimizer for training.") +train_g.add_arg("use_cross_batch", bool, False, "Whether to use cross-batch for training.") +train_g.add_arg("use_lamb", bool, False, "Whether to use LambOptimizer for training.") +train_g.add_arg("use_dynamic_loss_scaling", bool, True, "Whether to use dynamic loss scaling.") + +train_g.add_arg("test_save", str, "./checkpoints/test_result", "test_save") +train_g.add_arg("metric", str, "simple_accuracy", "metric") +train_g.add_arg("incr_every_n_steps", int, 100, "Increases loss scaling every n consecutive.") +train_g.add_arg("decr_every_n_nan_or_inf", int, 2, "Decreases loss scaling every n accumulated steps with nan or inf gradients.") +train_g.add_arg("incr_ratio", float, 2.0, "The multiplier to use when increasing the loss scaling.") +train_g.add_arg("decr_ratio", float, 0.8, "The less-than-one-multiplier to use when decreasing.") + +log_g = ArgumentGroup(parser, "logging", "logging related.") +log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.") +log_g.add_arg("verbose", bool, False, "Whether to output verbose log.") + +data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options") +data_g.add_arg("tokenizer", str, "FullTokenizer", "ATTENTION: the INPUT must be splited by Word with blank while using SentencepieceTokenizer or WordsegTokenizer") +data_g.add_arg("train_set", str, None, "Path to training data.") +data_g.add_arg("test_set", str, None, "Path to test data.") +data_g.add_arg("dev_set", str, None, "Path to validation data.") +data_g.add_arg("vocab_path", str, None, "Vocabulary path.") +data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.") +data_g.add_arg("q_max_seq_len", int, 32, "Number of words of the longest seqence.") +data_g.add_arg("p_max_seq_len", int, 256, "Number of words of the longest seqence.") +data_g.add_arg("train_data_size", int, 0, "Number of training data's total examples. Set for distribute.") +data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.") +data_g.add_arg("predict_batch_size", int, None, "Total examples' number in batch for predict. see also --in_tokens.") +data_g.add_arg("in_tokens", bool, False, "If set, the batch size will be the maximum number of tokens in one batch. Otherwise, it will be the maximum number of examples in one batch.") +data_g.add_arg("do_lower_case", bool, True, "Whether to lower case the input text. Should be True for uncased models and False for cased models.") +data_g.add_arg("random_seed", int, None, "Random seed.") +data_g.add_arg("label_map_config", str, None, "label_map_path.") +data_g.add_arg("num_labels", int, 2, "label number") +data_g.add_arg("diagnostic", str, None, "GLUE Diagnostic Dataset") +data_g.add_arg("diagnostic_save", str, None, "GLUE Diagnostic save f") +data_g.add_arg("max_query_length", int, 64, "Max query length.") +data_g.add_arg("max_answer_length", int, 100, "Max answer length.") +data_g.add_arg("doc_stride", int, 128, "When splitting up a long document into chunks, how much stride to take between chunks.") +data_g.add_arg("n_best_size", int, 20, "The total number of n-best predictions to generate in the nbest_predictions.json output file.") +data_g.add_arg("chunk_scheme", type=str, default="IOB", choices=["IO", "IOB", "IOE", "IOBES"], help="chunk scheme") + +run_type_g = ArgumentGroup(parser, "run_type", "running type options.") +run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.") +run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.") +run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).") +run_type_g.add_arg("num_iteration_per_drop_scope", int, 10, "Iteration intervals to drop scope.") +run_type_g.add_arg("do_train", bool, True, "Whether to perform training.") +run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation on dev data set.") +run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.") +run_type_g.add_arg("output_item", int, 3, "Test output format.") +run_type_g.add_arg("output_file_name", str, None, "Test output file name") +run_type_g.add_arg("test_data_cnt", int, 1110000 , "total cnt of testset") +run_type_g.add_arg("use_multi_gpu_test", bool, False, "Whether to perform evaluation using multiple gpu cards") +run_type_g.add_arg("metrics", bool, True, "Whether to perform evaluation on test data set.") +run_type_g.add_arg("shuffle", bool, True, "") +run_type_g.add_arg("for_cn", bool, False, "model train for cn or for other langs.") diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/index_search.py b/applications/document_intelligence/doc_vqa/Rerank/src/index_search.py new file mode 100644 index 0000000000000000000000000000000000000000..d20cdaed497206e6ac3989db049426163aa79a3d --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/src/index_search.py @@ -0,0 +1,83 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys + +import faiss +import numpy as np + + +def read_embed(file_name, dim=768, bs=3000): + if file_name.endswith("npy"): + i = 0 + emb_np = np.load(file_name) + while i < len(emb_np): + vec_list = emb_np[i : i + bs] + i += bs + yield vec_list + else: + vec_list = [] + with open(file_name) as inp: + for line in inp: + data = line.strip() + vector = [float(item) for item in data.split(" ")] + assert len(vector) == dim + vec_list.append(vector) + if len(vec_list) == bs: + yield vec_list + vec_list = [] + if vec_list: + yield vec_list + + +def load_qid(file_name): + qid_list = [] + with open(file_name) as inp: + for line in inp: + line = line.strip() + qid = line.split("\t")[0] + qid_list.append(qid) + return qid_list + + +def search(index, emb_file, qid_list, outfile, top_k): + q_idx = 0 + with open(outfile, "w") as out: + for batch_vec in read_embed(emb_file): + q_emb_matrix = np.array(batch_vec) + res_dist, res_p_id = index.search(q_emb_matrix.astype("float32"), top_k) + for i in range(len(q_emb_matrix)): + qid = qid_list[q_idx] + for j in range(top_k): + pid = res_p_id[i][j] + score = res_dist[i][j] + out.write("%s\t%s\t%s\t%s\n" % (qid, pid, j + 1, score)) + q_idx += 1 + + +def main(): + part = sys.argv[1] + topk = int(sys.argv[2]) + q_text_file = sys.argv[3] + outfile = "output/res.top%s-part%s" % (topk, part) + + qid_list = load_qid(q_text_file) + + engine = faiss.read_index("output/para.index.part%s" % part) + emb_file = "output/query.emb.npy" + search(engine, emb_file, qid_list, outfile, topk) + + +if __name__ == "__main__": + main() diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/merge.py b/applications/document_intelligence/doc_vqa/Rerank/src/merge.py new file mode 100644 index 0000000000000000000000000000000000000000..ae060b4a01edbe944f8c76ce2067c5155a7a7867 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/src/merge.py @@ -0,0 +1,61 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys + +shift = int(sys.argv[1]) +top = int(sys.argv[2]) +total_part = int(sys.argv[3]) + +f_list = [] +for part in range(total_part): + f0 = open("output/res.top%s-part%s" % (top, part)) + f_list.append(f0) + +line_list = [] +for part in range(total_part): + line = f_list[part].readline() + line_list.append(line) + +out = open("output/dev.res.top%s" % top, "w") +last_q = "" +ans_list = {} +while line_list[-1]: + cur_list = [] + for line in line_list: + sub = line.strip().split("\t") + cur_list.append(sub) + + if last_q == "": + last_q = cur_list[0][0] + if cur_list[0][0] != last_q: + rank = sorted(ans_list.items(), key=lambda a: a[1], reverse=True) + for i in range(top): + out.write("%s\t%s\t%s\t%s\n" % (last_q, rank[i][0], i + 1, rank[i][1])) + ans_list = {} + for i, sub in enumerate(cur_list): + ans_list[int(sub[1]) + shift * i] = float(sub[-1]) + last_q = cur_list[0][0] + + line_list = [] + for f0 in f_list: + line = f0.readline() + line_list.append(line) + +rank = sorted(ans_list.items(), key=lambda a: a[1], reverse=True) +for i in range(top): + out.write("%s\t%s\t%s\t%s\n" % (last_q, rank[i][0], i + 1, rank[i][1])) +out.close() + +print("output/dev.res.top%s" % top) diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/model/ernie.py b/applications/document_intelligence/doc_vqa/Rerank/src/model/ernie.py new file mode 100644 index 0000000000000000000000000000000000000000..2dafd972ca822ac8906e9f878559b9b16e3f9752 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/src/model/ernie.py @@ -0,0 +1,259 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Ernie model.""" + +from __future__ import absolute_import, division, print_function, unicode_literals + +import json +import logging +from io import open + +import paddle +import paddle.fluid as fluid +import six +from model.transformer_encoder import encoder, pre_process_layer + +log = logging.getLogger(__name__) + + +class ErnieConfig(object): + def __init__(self, config_path): + self._config_dict = self._parse(config_path) + + def _parse(self, config_path): + try: + with open(config_path, "r", encoding="utf8") as json_file: + config_dict = json.load(json_file) + except Exception: + raise IOError("Error in parsing Ernie model config file '%s'" % config_path) + else: + return config_dict + + def __getitem__(self, key): + return self._config_dict.get(key, None) + + def print_config(self): + for arg, value in sorted(six.iteritems(self._config_dict)): + log.info("%s: %s" % (arg, value)) + log.info("------------------------------------------------") + + +class ErnieModel(object): + def __init__( + self, + src_ids, + position_ids, + sentence_ids, + task_ids, + input_mask, + config, + weight_sharing=True, + model_name="", + is_noise=False, + ): + + self._emb_size = config["hidden_size"] + self._n_layer = config["num_hidden_layers"] + self._n_head = config["num_attention_heads"] + self._voc_size = config["vocab_size"] + self._max_position_seq_len = config["max_position_embeddings"] + if config["sent_type_vocab_size"]: + self._sent_types = config["sent_type_vocab_size"] + else: + self._sent_types = config["type_vocab_size"] + + self._use_task_id = config["use_task_id"] + if self._use_task_id: + self._task_types = config["task_type_vocab_size"] + self._hidden_act = config["hidden_act"] + self._prepostprocess_dropout = config["hidden_dropout_prob"] + self._attention_dropout = config["attention_probs_dropout_prob"] + if is_noise: + self._prepostprocess_dropout = 0 + self._attention_dropout = 0 + self._weight_sharing = weight_sharing + self.checkpoints = [] + + self._word_emb_name = "word_embedding" + self._pos_emb_name = "pos_embedding" + self._sent_emb_name = "sent_embedding" + self._task_emb_name = "task_embedding" + self._emb_dtype = "float32" + + # Initialize all weights by truncated normal initializer, and all biases + # will be initialized by constant zero by default. + self._param_initializer = fluid.initializer.TruncatedNormal(scale=config["initializer_range"]) + + self._build_model(model_name, src_ids, position_ids, sentence_ids, task_ids, input_mask) + + def _build_model(self, model_name, src_ids, position_ids, sentence_ids, task_ids, input_mask): + # padding id in vocabulary must be set to 0 + emb_out = fluid.layers.embedding( + input=src_ids, + size=[self._voc_size, self._emb_size], + dtype=self._emb_dtype, + param_attr=fluid.ParamAttr(name=model_name + self._word_emb_name, initializer=self._param_initializer), + is_sparse=False, + ) + + position_emb_out = fluid.layers.embedding( + input=position_ids, + size=[self._max_position_seq_len, self._emb_size], + dtype=self._emb_dtype, + param_attr=fluid.ParamAttr(name=model_name + self._pos_emb_name, initializer=self._param_initializer), + ) + + sent_emb_out = fluid.layers.embedding( + sentence_ids, + size=[self._sent_types, self._emb_size], + dtype=self._emb_dtype, + param_attr=fluid.ParamAttr(name=model_name + self._sent_emb_name, initializer=self._param_initializer), + ) + + emb_out = emb_out + position_emb_out + emb_out = emb_out + sent_emb_out + + if self._use_task_id: + task_emb_out = fluid.layers.embedding( + task_ids, + size=[self._task_types, self._emb_size], + dtype=self._emb_dtype, + param_attr=fluid.ParamAttr(name=model_name + self._task_emb_name, initializer=self._param_initializer), + ) + + emb_out = emb_out + task_emb_out + + emb_out = pre_process_layer(emb_out, "nd", self._prepostprocess_dropout, name=model_name + "pre_encoder") + + self_attn_mask = paddle.matmul(x=input_mask, y=input_mask, transpose_y=True) + + self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False) + n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1) + n_head_self_attn_mask.stop_gradient = True + + self._enc_out, self.checkpoints = encoder( + enc_input=emb_out, + attn_bias=n_head_self_attn_mask, + n_layer=self._n_layer, + n_head=self._n_head, + d_key=self._emb_size // self._n_head, + d_value=self._emb_size // self._n_head, + d_model=self._emb_size, + d_inner_hid=self._emb_size * 4, + prepostprocess_dropout=self._prepostprocess_dropout, + attention_dropout=self._attention_dropout, + relu_dropout=0, + hidden_act=self._hidden_act, + preprocess_cmd="", + postprocess_cmd="dan", + param_initializer=self._param_initializer, + model_name=model_name, + name=model_name + "encoder", + ) + + def get_sequence_output(self): + return self._enc_out + + def get_cls_output(self): + """Get the first feature of each sequence for classification""" + cls_output = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1]) + cls_output = fluid.layers.squeeze(cls_output, axes=[1]) + return cls_output + + def get_pooled_output(self): + """Get the first feature of each sequence for classification""" + next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1]) + next_sent_feat = fluid.layers.fc( + input=next_sent_feat, + size=self._emb_size, + act="tanh", + param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer), + bias_attr="pooled_fc.b_0", + ) + return next_sent_feat + + def get_lm_output(self, mask_label, mask_pos): + """Get the loss & accuracy for pretraining""" + + mask_pos = fluid.layers.cast(x=mask_pos, dtype="int32") + + # extract the first token feature in each sentence + self.next_sent_feat = self.get_pooled_output() + reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size]) + # extract masked tokens' feature + mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos) + + # transform: fc + mask_trans_feat = fluid.layers.fc( + input=mask_feat, + size=self._emb_size, + act=self._hidden_act, + param_attr=fluid.ParamAttr(name="mask_lm_trans_fc.w_0", initializer=self._param_initializer), + bias_attr=fluid.ParamAttr(name="mask_lm_trans_fc.b_0"), + ) + + # transform: layer norm + mask_trans_feat = fluid.layers.layer_norm( + mask_trans_feat, + begin_norm_axis=len(mask_trans_feat.shape) - 1, + param_attr=fluid.ParamAttr( + name="mask_lm_trans_layer_norm_scale", initializer=fluid.initializer.Constant(1.0) + ), + bias_attr=fluid.ParamAttr( + name="mask_lm_trans_layer_norm_bias", initializer=fluid.initializer.Constant(1.0) + ), + ) + # transform: layer norm + # mask_trans_feat = pre_process_layer( + # mask_trans_feat, 'n', name='mask_lm_trans') + + mask_lm_out_bias_attr = fluid.ParamAttr( + name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0) + ) + if self._weight_sharing: + fc_out = paddle.matmul( + x=mask_trans_feat, + y=fluid.default_main_program().global_block().var(self._word_emb_name), + transpose_y=True, + ) + fc_out += fluid.layers.create_parameter( + shape=[self._voc_size], dtype=self._emb_dtype, attr=mask_lm_out_bias_attr, is_bias=True + ) + + else: + fc_out = fluid.layers.fc( + input=mask_trans_feat, + size=self._voc_size, + param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer), + bias_attr=mask_lm_out_bias_attr, + ) + + mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label) + mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss) + + return mean_mask_lm_loss + + def get_task_output(self, task, task_labels): + task_fc_out = fluid.layers.fc( + input=self.next_sent_feat, + size=task["num_labels"], + param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0", initializer=self._param_initializer), + bias_attr=task["task_name"] + "_fc.b_0", + ) + task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy( + logits=task_fc_out, label=task_labels, return_softmax=True + ) + task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels) + mean_task_loss = fluid.layers.mean(task_loss) + return mean_task_loss, task_acc diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/model/transformer_encoder.py b/applications/document_intelligence/doc_vqa/Rerank/src/model/transformer_encoder.py new file mode 100644 index 0000000000000000000000000000000000000000..5fcef783bcded4f861934f49b188700747367647 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/src/model/transformer_encoder.py @@ -0,0 +1,318 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Transformer encoder.""" + +from __future__ import absolute_import, division, print_function + +from functools import partial + +import paddle +import paddle.fluid as fluid +import paddle.fluid.layers as layers + + +def multi_head_attention( + queries, + keys, + values, + attn_bias, + d_key, + d_value, + d_model, + n_head=1, + dropout_rate=0.0, + cache=None, + param_initializer=None, + name="multi_head_att", +): + """ + Multi-Head Attention. Note that attn_bias is added to the logit before + computing softmax activation to mask certain selected positions so that + they will not considered in attention weights. + """ + keys = queries if keys is None else keys + values = keys if values is None else values + + if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3): + raise ValueError("Inputs: queries, keys and values should all be 3-D tensors.") + + def __compute_qkv(queries, keys, values, n_head, d_key, d_value): + """ + Add linear projection to queries, keys, and values. + """ + q = layers.fc( + input=queries, + size=d_key * n_head, + num_flatten_dims=2, + param_attr=fluid.ParamAttr(name=name + "_query_fc.w_0", initializer=param_initializer), + bias_attr=name + "_query_fc.b_0", + ) + k = layers.fc( + input=keys, + size=d_key * n_head, + num_flatten_dims=2, + param_attr=fluid.ParamAttr(name=name + "_key_fc.w_0", initializer=param_initializer), + bias_attr=name + "_key_fc.b_0", + ) + v = layers.fc( + input=values, + size=d_value * n_head, + num_flatten_dims=2, + param_attr=fluid.ParamAttr(name=name + "_value_fc.w_0", initializer=param_initializer), + bias_attr=name + "_value_fc.b_0", + ) + return q, k, v + + def __split_heads(x, n_head): + """ + Reshape the last dimension of input tensor x so that it becomes two + dimensions and then transpose. Specifically, input a tensor with shape + [bs, max_sequence_length, n_head * hidden_dim] then output a tensor + with shape [bs, n_head, max_sequence_length, hidden_dim]. + """ + hidden_size = x.shape[-1] + # The value 0 in shape attr means copying the corresponding dimension + # size of the input as the output dimension size. + reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True) + + # permuate the dimensions into: + # [batch_size, n_head, max_sequence_len, hidden_size_per_head] + return layers.transpose(x=reshaped, perm=[0, 2, 1, 3]) + + def __combine_heads(x): + """ + Transpose and then reshape the last two dimensions of input tensor x + so that it becomes one dimension, which is reverse to __split_heads. + """ + if len(x.shape) == 3: + return x + if len(x.shape) != 4: + raise ValueError("Input(x) should be a 4-D Tensor.") + + trans_x = layers.transpose(x, perm=[0, 2, 1, 3]) + # The value 0 in shape attr means copying the corresponding dimension + # size of the input as the output dimension size. + return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True) + + def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate): + """ + Scaled Dot-Product Attention + """ + scaled_q = layers.scale(x=q, scale=d_key**-0.5) + product = paddle.matmul(x=scaled_q, y=k, transpose_y=True) + if attn_bias: + product += attn_bias + weights = layers.softmax(product) + if dropout_rate: + weights = layers.dropout( + weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False + ) + out = paddle.matmul(weights, v) + return out + + q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value) + + if cache is not None: # use cache and concat time steps + # Since the inplace reshape in __split_heads changes the shape of k and + # v, which is the cache input for next time step, reshape the cache + # input from the previous time step first. + k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1) + v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1) + + q = __split_heads(q, n_head) + k = __split_heads(k, n_head) + v = __split_heads(v, n_head) + + ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate) + + out = __combine_heads(ctx_multiheads) + + # Project back to the model size. + proj_out = layers.fc( + input=out, + size=d_model, + num_flatten_dims=2, + param_attr=fluid.ParamAttr(name=name + "_output_fc.w_0", initializer=param_initializer), + bias_attr=name + "_output_fc.b_0", + ) + return proj_out + + +def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name="ffn"): + """ + Position-wise Feed-Forward Networks. + This module consists of two linear transformations with a ReLU activation + in between, which is applied to each position separately and identically. + """ + hidden = layers.fc( + input=x, + size=d_inner_hid, + num_flatten_dims=2, + act=hidden_act, + param_attr=fluid.ParamAttr(name=name + "_fc_0.w_0", initializer=param_initializer), + bias_attr=name + "_fc_0.b_0", + ) + if dropout_rate: + hidden = layers.dropout( + hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False + ) + out = layers.fc( + input=hidden, + size=d_hid, + num_flatten_dims=2, + param_attr=fluid.ParamAttr(name=name + "_fc_1.w_0", initializer=param_initializer), + bias_attr=name + "_fc_1.b_0", + ) + return out + + +def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.0, name=""): + """ + Add residual connection, layer normalization and dropout to the out tensor + optionally according to the value of process_cmd. + This will be used before or after multi-head attention and position-wise + feed-forward networks. + """ + for cmd in process_cmd: + if cmd == "a": # add residual connection + out = out + prev_out if prev_out else out + elif cmd == "n": # add layer normalization + out_dtype = out.dtype + if out_dtype == fluid.core.VarDesc.VarType.FP16: + out = layers.cast(x=out, dtype="float32") + out = layers.layer_norm( + out, + begin_norm_axis=len(out.shape) - 1, + param_attr=fluid.ParamAttr( + name=name + "_layer_norm_scale", initializer=fluid.initializer.Constant(1.0) + ), + bias_attr=fluid.ParamAttr(name=name + "_layer_norm_bias", initializer=fluid.initializer.Constant(0.0)), + ) + if out_dtype == fluid.core.VarDesc.VarType.FP16: + out = layers.cast(x=out, dtype="float16") + elif cmd == "d": # add dropout + if dropout_rate: + out = layers.dropout( + out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False + ) + return out + + +pre_process_layer = partial(pre_post_process_layer, None) +post_process_layer = pre_post_process_layer + + +def encoder_layer( + enc_input, + attn_bias, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd="n", + postprocess_cmd="da", + param_initializer=None, + name="", +): + """The encoder layers that can be stacked to form a deep encoder. + This module consists of a multi-head (self) attention followed by + position-wise feed-forward networks and both the two components companied + with the post_process_layer to add residual connection, layer normalization + and dropout. + """ + attn_output = multi_head_attention( + pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + "_pre_att"), + None, + None, + attn_bias, + d_key, + d_value, + d_model, + n_head, + attention_dropout, + param_initializer=param_initializer, + name=name + "_multi_head_att", + ) + attn_output = post_process_layer( + enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + "_post_att" + ) + ffd_output = positionwise_feed_forward( + pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + "_pre_ffn"), + d_inner_hid, + d_model, + relu_dropout, + hidden_act, + param_initializer=param_initializer, + name=name + "_ffn", + ) + return ( + post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + "_post_ffn"), + ffd_output, + ) + + +def encoder( + enc_input, + attn_bias, + n_layer, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd="n", + postprocess_cmd="da", + param_initializer=None, + model_name="", + name="", +): + """ + The encoder is composed of a stack of identical layers returned by calling + encoder_layer. + """ + checkpoints = [] + for i in range(n_layer): + enc_output, cp = encoder_layer( + enc_input, + attn_bias, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd, + postprocess_cmd, + param_initializer=param_initializer, + name=name + "_layer_" + str(i), + ) + checkpoints.append(cp) + enc_input = enc_output + enc_output = pre_process_layer( + enc_output, preprocess_cmd, prepostprocess_dropout, name=model_name + "post_encoder" + ) + + return enc_output, checkpoints diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/optimization.py b/applications/document_intelligence/doc_vqa/Rerank/src/optimization.py new file mode 100644 index 0000000000000000000000000000000000000000..e724f1028f19e469e4759d399f637763c9007db0 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/src/optimization.py @@ -0,0 +1,121 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Optimization and learning rate scheduling.""" + +import paddle.fluid as fluid +from paddle.fluid.incubate.fleet.collective import fleet + + +def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps): + """Applies linear warmup of learning rate from 0 and decay to 0.""" + with fluid.default_main_program()._lr_schedule_guard(): + lr = fluid.layers.tensor.create_global_var( + shape=[1], value=0.0, dtype="float32", persistable=True, name="scheduled_learning_rate" + ) + + global_step = fluid.layers.learning_rate_scheduler._decay_step_counter() + + with fluid.layers.control_flow.Switch() as switch: + with switch.case(global_step < warmup_steps): + warmup_lr = learning_rate * (global_step / warmup_steps) + fluid.layers.tensor.assign(warmup_lr, lr) + with switch.default(): + decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay( + learning_rate=learning_rate, + decay_steps=num_train_steps, + end_learning_rate=0.0, + power=1.0, + cycle=False, + ) + fluid.layers.tensor.assign(decayed_lr, lr) + + return lr + + +def optimization( + loss, + warmup_steps, + num_train_steps, + learning_rate, + train_program, + startup_prog, + weight_decay, + scheduler="linear_warmup_decay", + use_dynamic_loss_scaling=False, + incr_every_n_steps=1000, + decr_every_n_nan_or_inf=2, + incr_ratio=2.0, + decr_ratio=0.8, + dist_strategy=None, + use_lamb=False, +): + if warmup_steps > 0: + if scheduler == "noam_decay": + scheduled_lr = fluid.layers.learning_rate_scheduler.noam_decay( + 1 / (warmup_steps * (learning_rate**2)), warmup_steps + ) + elif scheduler == "linear_warmup_decay": + scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps, num_train_steps) + else: + raise ValueError("Unknown learning rate scheduler, should be " "'noam_decay' or 'linear_warmup_decay'") + if use_lamb: + optimizer = fluid.optimizer.LambOptimizer(learning_rate=scheduled_lr) + else: + optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr) + else: + scheduled_lr = fluid.layers.create_global_var( + name=fluid.unique_name.generate("learning_rate"), + shape=[1], + value=learning_rate, + dtype="float32", + persistable=True, + ) + if use_lamb: + optimizer = fluid.optimizer.LambOptimizer(learning_rate=scheduled_lr) + else: + optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr) + optimizer._learning_rate_map[fluid.default_main_program()] = scheduled_lr + + fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)) + + def exclude_from_weight_decay(name): + if name.find("layer_norm") > -1: + return True + bias_suffix = ["_bias", "_b", ".b_0"] + for suffix in bias_suffix: + if name.endswith(suffix): + return True + return False + + param_list = dict() + + for param in train_program.global_block().all_parameters(): + param_list[param.name] = param * 1.0 + param_list[param.name].stop_gradient = True + + if dist_strategy is not None: + # use fleet api + optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy) + + _, param_grads = optimizer.minimize(loss) + + if weight_decay > 0: + for param, grad in param_grads: + if exclude_from_weight_decay(param.name): + continue + with param.block.program._optimized_guard([param, grad]), fluid.framework.name_scope("weight_decay"): + updated_param = param - param_list[param.name] * weight_decay * scheduled_lr + fluid.layers.assign(output=param, input=updated_param) + + return scheduled_lr diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/reader_ce.py b/applications/document_intelligence/doc_vqa/Rerank/src/reader_ce.py new file mode 100644 index 0000000000000000000000000000000000000000..8c08ade51b892044ed9a977f6ffdb2bb17fcc60d --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/src/reader_ce.py @@ -0,0 +1,314 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import logging +import sys +from collections import namedtuple +from io import open + +import numpy as np +import six +import tokenization +from batching import pad_batch_data + +log = logging.getLogger(__name__) + +if six.PY3: + import io + + sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf-8") + sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding="utf-8") + + +def csv_reader(fd, delimiter="\t", trainer_id=0, trainer_num=1): + def gen(): + for i, line in enumerate(fd): + if i % trainer_num == trainer_id: + slots = line.rstrip("\n").split(delimiter) + if len(slots) == 1: + yield slots, + else: + yield slots + + return gen() + + +class BaseReader(object): + def __init__( + self, + vocab_path, + label_map_config=None, + max_seq_len=512, + total_num=0, + do_lower_case=True, + in_tokens=False, + is_inference=False, + random_seed=None, + tokenizer="FullTokenizer", + for_cn=True, + task_id=0, + ): + self.max_seq_len = max_seq_len + self.tokenizer = tokenization.FullTokenizer(vocab_file=vocab_path, do_lower_case=do_lower_case) + self.vocab = self.tokenizer.vocab + self.pad_id = self.vocab["[PAD]"] + self.cls_id = self.vocab["[CLS]"] + self.sep_id = self.vocab["[SEP]"] + self.in_tokens = in_tokens + self.is_inference = is_inference + self.for_cn = for_cn + self.task_id = task_id + + np.random.seed(random_seed) + + self.current_example = 0 + self.current_epoch = 0 + self.num_examples = 0 + self.total_num = total_num + + if label_map_config: + with open(label_map_config, encoding="utf8") as f: + self.label_map = json.load(f) + else: + self.label_map = None + + def get_train_progress(self): + """Gets progress for training phase.""" + return self.current_example, self.current_epoch + + def _read_tsv(self, input_file, quotechar=None): + """Reads a tab separated value file.""" + with open(input_file, "r", encoding="utf8") as f: + reader = csv_reader(f) + headers = next(reader) + Example = namedtuple("Example", headers) + + examples = [] + for line in reader: + example = Example(*line) + examples.append(example) + return examples + + def _truncate_seq_pair(self, tokens_a, tokens_b, max_length): + """Truncates a sequence pair in place to the maximum length.""" + + # This is a simple heuristic which will always truncate the longer sequence + # one token at a time. This makes more sense than truncating an equal percent + # of tokens from each, since if one sequence is very short then each token + # that's truncated likely contains more information than a longer sequence. + while True: + total_length = len(tokens_a) + len(tokens_b) + if total_length <= max_length: + break + if len(tokens_a) > len(tokens_b): + tokens_a.pop() + else: + tokens_b.pop() + + def _convert_example_to_record(self, example, max_seq_length, tokenizer): + """Converts a single `Example` into a single `Record`.""" + + query = tokenization.convert_to_unicode(example.query) + tokens_a = tokenizer.tokenize(query) + tokens_b = None + + title = tokenization.convert_to_unicode(example.title) + tokens_b = tokenizer.tokenize(title) + + para = tokenization.convert_to_unicode(example.para) + tokens_para = tokenizer.tokenize(para) + + tokens_b.extend(tokens_para) + + self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) + + # The convention in BERT/ERNIE is: + # (a) For sequence pairs: + # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] + # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 + # (b) For single sequences: + # tokens: [CLS] the dog is hairy . [SEP] + # type_ids: 0 0 0 0 0 0 0 + # + # Where "type_ids" are used to indicate whether this is the first + # sequence or the second sequence. The embedding vectors for `type=0` and + # `type=1` were learned during pre-training and are added to the wordpiece + # embedding vector (and position vector). This is not *strictly* necessary + # since the [SEP] token unambiguously separates the sequences, but it makes + # it easier for the model to learn the concept of sequences. + # + # For classification tasks, the first vector (corresponding to [CLS]) is + # used as the "sentence vector". Note that this only makes sense because + # the entire model is fine-tuned. + tokens = [] + text_type_ids = [] + tokens.append("[CLS]") + text_type_ids.append(0) + for token in tokens_a: + tokens.append(token) + text_type_ids.append(0) + tokens.append("[SEP]") + text_type_ids.append(0) + + if tokens_b: + for token in tokens_b: + tokens.append(token) + text_type_ids.append(1) + tokens.append("[SEP]") + text_type_ids.append(1) + + token_ids = tokenizer.convert_tokens_to_ids(tokens) + position_ids = list(range(len(token_ids))) + + if self.is_inference: + Record = namedtuple("Record", ["token_ids", "text_type_ids", "position_ids"]) + record = Record(token_ids=token_ids, text_type_ids=text_type_ids, position_ids=position_ids) + else: + if self.label_map: + label_id = self.label_map[example.label] + else: + label_id = example.label + + Record = namedtuple("Record", ["token_ids", "text_type_ids", "position_ids", "label_id", "qid"]) + + qid = None + if "qid" in example._fields: + qid = example.qid + + record = Record( + token_ids=token_ids, text_type_ids=text_type_ids, position_ids=position_ids, label_id=label_id, qid=qid + ) + return record + + def _prepare_batch_data(self, examples, batch_size, phase=None): + """generate batch records""" + batch_records, max_len = [], 0 + for index, example in enumerate(examples): + if phase == "train": + self.current_example = index + record = self._convert_example_to_record(example, self.max_seq_len, self.tokenizer) + max_len = max(max_len, len(record.token_ids)) + if self.in_tokens: + to_append = (len(batch_records) + 1) * max_len <= batch_size + else: + to_append = len(batch_records) < batch_size + if to_append: + batch_records.append(record) + else: + yield self._pad_batch_records(batch_records) + batch_records, max_len = [record], len(record.token_ids) + + if batch_records: + yield self._pad_batch_records(batch_records) + + def get_num_examples(self, input_file): + # examples = self._read_tsv(input_file) + # return len(examples) + return self.num_examples + + def data_generator( + self, input_file, batch_size, epoch, dev_count=1, trainer_id=0, trainer_num=1, shuffle=True, phase=None + ): + + if phase == "train": + # examples = examples[trainer_id: (len(examples) //trainer_num) * trainer_num : trainer_num] + self.num_examples_per_node = self.total_num // trainer_num + self.num_examples = self.num_examples_per_node * trainer_num + examples = self._read_tsv( + input_file, trainer_id=trainer_id, trainer_num=trainer_num, num_examples=self.num_examples_per_node + ) + log.info("apply sharding %d/%d" % (trainer_id, trainer_num)) + else: + examples = self._read_tsv(input_file) + + def wrapper(): + all_dev_batches = [] + for epoch_index in range(epoch): + if phase == "train": + self.current_example = 0 + self.current_epoch = epoch_index + if shuffle: + np.random.shuffle(examples) + + for batch_data in self._prepare_batch_data(examples, batch_size, phase=phase): + if len(all_dev_batches) < dev_count: + all_dev_batches.append(batch_data) + if len(all_dev_batches) == dev_count: + for batch in all_dev_batches: + yield batch + all_dev_batches = [] + + def f(): + try: + for i in wrapper(): + yield i + except Exception: + import traceback + + traceback.print_exc() + + return f + + +class ClassifyReader(BaseReader): + def _read_tsv(self, input_file, quotechar=None, trainer_id=0, trainer_num=1, num_examples=0): + """Reads a tab separated value file.""" + with open(input_file, "r", encoding="utf8") as f: + reader = csv_reader(f, trainer_id=trainer_id, trainer_num=trainer_num) + # headers = next(reader) + headers = "query\ttitle\tpara\tlabel".split("\t") + text_indices = [index for index, h in enumerate(headers) if h != "label"] + Example = namedtuple("Example", headers) + + examples = [] + for cnt, line in enumerate(reader): + if num_examples != 0 and cnt == num_examples: + break + for index, text in enumerate(line): + if index in text_indices: + if self.for_cn: + line[index] = text.replace(" ", "") + else: + line[index] = text + example = Example(*line) + examples.append(example) + return examples + + def _pad_batch_records(self, batch_records): + batch_token_ids = [record.token_ids for record in batch_records] + batch_text_type_ids = [record.text_type_ids for record in batch_records] + batch_position_ids = [record.position_ids for record in batch_records] + + if not self.is_inference: + batch_labels = [record.label_id for record in batch_records] + batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1]) + + if batch_records[0].qid: + batch_qids = [record.qid for record in batch_records] + batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1]) + else: + batch_qids = np.array([]).astype("int64").reshape([-1, 1]) + + # padding + padded_token_ids, input_mask = pad_batch_data(batch_token_ids, pad_idx=self.pad_id, return_input_mask=True) + padded_text_type_ids = pad_batch_data(batch_text_type_ids, pad_idx=self.pad_id) + padded_position_ids = pad_batch_data(batch_position_ids, pad_idx=self.pad_id) + padded_task_ids = np.ones_like(padded_token_ids, dtype="int64") * self.task_id + + return_list = [padded_token_ids, padded_text_type_ids, padded_position_ids, padded_task_ids, input_mask] + if not self.is_inference: + return_list += [batch_labels, batch_qids] + + return return_list diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/tokenization.py b/applications/document_intelligence/doc_vqa/Rerank/src/tokenization.py new file mode 100644 index 0000000000000000000000000000000000000000..549b0126966b8a4992e7eb2a2ea319ad3ed946aa --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/src/tokenization.py @@ -0,0 +1,400 @@ +# coding=utf-8 +# Copyright 2018 The Google AI Language Team Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Tokenization classes.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function +from __future__ import unicode_literals +from __future__ import absolute_import + +import collections +import unicodedata +from io import open + + +def convert_to_unicode(text): + """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" + if isinstance(text, str): + return text + elif isinstance(text, bytes): + return text.decode("utf-8", "ignore") + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + + +def printable_text(text): + """Returns text encoded in a way suitable for print or `tf.logging`.""" + if isinstance(text, str): + return text + elif isinstance(text, bytes): + return text.decode("utf-8", "ignore") + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + + +def load_vocab(vocab_file): + """Loads a vocabulary file into a dictionary.""" + vocab = collections.OrderedDict() + with open(vocab_file, encoding="utf8") as fin: + for num, line in enumerate(fin): + items = convert_to_unicode(line.strip()).split("\t") + if len(items) > 2: + break + token = items[0] + index = items[1] if len(items) == 2 else num + token = token.strip() + vocab[token] = int(index) + return vocab + + +def convert_by_vocab(vocab, items): + """Converts a sequence of [tokens|ids] using the vocab.""" + output = [] + for item in items: + output.append(vocab[item]) + return output + + +def convert_tokens_to_ids(vocab, tokens): + return convert_by_vocab(vocab, tokens) + + +def convert_ids_to_tokens(inv_vocab, ids): + return convert_by_vocab(inv_vocab, ids) + + +def whitespace_tokenize(text): + """Runs basic whitespace cleaning and splitting on a piece of text.""" + text = text.strip() + if not text: + return [] + tokens = text.split() + return tokens + + +class FullTokenizer(object): + """Runs end-to-end tokenization.""" + + def __init__(self, vocab_file, do_lower_case=True): + self.vocab = load_vocab(vocab_file) + self.inv_vocab = {v: k for k, v in self.vocab.items()} + self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) + self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) + + def tokenize(self, text): + split_tokens = [] + for token in self.basic_tokenizer.tokenize(text): + for sub_token in self.wordpiece_tokenizer.tokenize(token): + split_tokens.append(sub_token) + + return split_tokens + + def convert_tokens_to_ids(self, tokens): + return convert_by_vocab(self.vocab, tokens) + + def convert_ids_to_tokens(self, ids): + return convert_by_vocab(self.inv_vocab, ids) + + +class CharTokenizer(object): + """Runs end-to-end tokenization.""" + + def __init__(self, vocab_file, do_lower_case=True): + self.vocab = load_vocab(vocab_file) + self.inv_vocab = {v: k for k, v in self.vocab.items()} + self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) + + def tokenize(self, text): + split_tokens = [] + for token in text.lower().split(" "): + for sub_token in self.wordpiece_tokenizer.tokenize(token): + split_tokens.append(sub_token) + + return split_tokens + + def convert_tokens_to_ids(self, tokens): + return convert_by_vocab(self.vocab, tokens) + + def convert_ids_to_tokens(self, ids): + return convert_by_vocab(self.inv_vocab, ids) + + +class BasicTokenizer(object): + """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" + + def __init__(self, do_lower_case=True): + """Constructs a BasicTokenizer. + + Args: + do_lower_case: Whether to lower case the input. + """ + self.do_lower_case = do_lower_case + + def tokenize(self, text): + """Tokenizes a piece of text.""" + text = convert_to_unicode(text) + text = self._clean_text(text) + + # This was added on November 1st, 2018 for the multilingual and Chinese + # models. This is also applied to the English models now, but it doesn't + # matter since the English models were not trained on any Chinese data + # and generally don't have any Chinese data in them (there are Chinese + # characters in the vocabulary because Wikipedia does have some Chinese + # words in the English Wikipedia.). + text = self._tokenize_chinese_chars(text) + + orig_tokens = whitespace_tokenize(text) + split_tokens = [] + for token in orig_tokens: + if self.do_lower_case: + token = token.lower() + token = self._run_strip_accents(token) + split_tokens.extend(self._run_split_on_punc(token)) + + output_tokens = whitespace_tokenize(" ".join(split_tokens)) + return output_tokens + + def _run_strip_accents(self, text): + """Strips accents from a piece of text.""" + text = unicodedata.normalize("NFD", text) + output = [] + for char in text: + cat = unicodedata.category(char) + if cat == "Mn": + continue + output.append(char) + return "".join(output) + + def _run_split_on_punc(self, text): + """Splits punctuation on a piece of text.""" + chars = list(text) + i = 0 + start_new_word = True + output = [] + while i < len(chars): + char = chars[i] + if _is_punctuation(char): + output.append([char]) + start_new_word = True + else: + if start_new_word: + output.append([]) + start_new_word = False + output[-1].append(char) + i += 1 + + return ["".join(x) for x in output] + + def _tokenize_chinese_chars(self, text): + """Adds whitespace around any CJK character.""" + output = [] + for char in text: + cp = ord(char) + if self._is_chinese_char(cp): + output.append(" ") + output.append(char) + output.append(" ") + else: + output.append(char) + return "".join(output) + + def _is_chinese_char(self, cp): + """Checks whether CP is the codepoint of a CJK character.""" + # This defines a "chinese character" as anything in the CJK Unicode block: + # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) + # + # Note that the CJK Unicode block is NOT all Japanese and Korean characters, + # despite its name. The modern Korean Hangul alphabet is a different block, + # as is Japanese Hiragana and Katakana. Those alphabets are used to write + # space-separated words, so they are not treated specially and handled + # like the all of the other languages. + if ( + (cp >= 0x4E00 and cp <= 0x9FFF) + or (cp >= 0x3400 and cp <= 0x4DBF) # + or (cp >= 0x20000 and cp <= 0x2A6DF) # + or (cp >= 0x2A700 and cp <= 0x2B73F) # + or (cp >= 0x2B740 and cp <= 0x2B81F) # + or (cp >= 0x2B820 and cp <= 0x2CEAF) # + or (cp >= 0xF900 and cp <= 0xFAFF) + or (cp >= 0x2F800 and cp <= 0x2FA1F) # + ): # + return True + + return False + + def _clean_text(self, text): + """Performs invalid character removal and whitespace cleanup on text.""" + output = [] + for char in text: + cp = ord(char) + if cp == 0 or cp == 0xFFFD or _is_control(char): + continue + if _is_whitespace(char): + output.append(" ") + else: + output.append(char) + return "".join(output) + + +class WordpieceTokenizer(object): + """Runs WordPiece tokenization.""" + + def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100): + self.vocab = vocab + self.unk_token = unk_token + self.max_input_chars_per_word = max_input_chars_per_word + + def tokenize(self, text): + """Tokenizes a piece of text into its word pieces. + + This uses a greedy longest-match-first algorithm to perform tokenization + using the given vocabulary. + + For example: + input = "unaffable" + output = ["un", "##aff", "##able"] + + Args: + text: A single token or whitespace separated tokens. This should have + already been passed through `BasicTokenizer. + + Returns: + A list of wordpiece tokens. + """ + + text = convert_to_unicode(text) + + output_tokens = [] + for token in whitespace_tokenize(text): + chars = list(token) + if len(chars) > self.max_input_chars_per_word: + output_tokens.append(self.unk_token) + continue + + is_bad = False + start = 0 + sub_tokens = [] + while start < len(chars): + end = len(chars) + cur_substr = None + while start < end: + substr = "".join(chars[start:end]) + if start > 0: + substr = "##" + substr + if substr in self.vocab: + cur_substr = substr + break + end -= 1 + if cur_substr is None: + is_bad = True + break + sub_tokens.append(cur_substr) + start = end + + if is_bad: + output_tokens.append(self.unk_token) + else: + output_tokens.extend(sub_tokens) + return output_tokens + + +def _is_whitespace(char): + """Checks whether `chars` is a whitespace character.""" + # \t, \n, and \r are technically control characters but we treat them + # as whitespace since they are generally considered as such. + if char == " " or char == "\t" or char == "\n" or char == "\r": + return True + cat = unicodedata.category(char) + if cat == "Zs": + return True + return False + + +def _is_control(char): + """Checks whether `chars` is a control character.""" + # These are technically control characters but we count them as whitespace + # characters. + if char == "\t" or char == "\n" or char == "\r": + return False + cat = unicodedata.category(char) + if cat.startswith("C"): + return True + return False + + +def _is_punctuation(char): + """Checks whether `chars` is a punctuation character.""" + cp = ord(char) + # We treat all non-letter/number ASCII as punctuation. + # Characters such as "^", "$", and "`" are not in the Unicode + # Punctuation class but we treat them as punctuation anyways, for + # consistency. + if (cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126): + return True + cat = unicodedata.category(char) + if cat.startswith("P"): + return True + return False + + +def tokenize_chinese_chars(text): + """Adds whitespace around any CJK character.""" + + def _is_chinese_char(cp): + """Checks whether CP is the codepoint of a CJK character.""" + # This defines a "chinese character" as anything in the CJK Unicode block: + # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) + # + # Note that the CJK Unicode block is NOT all Japanese and Korean characters, + # despite its name. The modern Korean Hangul alphabet is a different block, + # as is Japanese Hiragana and Katakana. Those alphabets are used to write + # space-separated words, so they are not treated specially and handled + # like the all of the other languages. + if ( + (cp >= 0x4E00 and cp <= 0x9FFF) + or (cp >= 0x3400 and cp <= 0x4DBF) # + or (cp >= 0x20000 and cp <= 0x2A6DF) # + or (cp >= 0x2A700 and cp <= 0x2B73F) # + or (cp >= 0x2B740 and cp <= 0x2B81F) # + or (cp >= 0x2B820 and cp <= 0x2CEAF) # + or (cp >= 0xF900 and cp <= 0xFAFF) + or (cp >= 0x2F800 and cp <= 0x2FA1F) # + ): # + return True + + return False + + def _is_whitespace(c): + if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F: + return True + return False + + output = [] + buff = "" + for char in text: + cp = ord(char) + if _is_chinese_char(cp) or _is_whitespace(char): + if buff != "": + output.append(buff) + buff = "" + output.append(char) + else: + buff += char + + if buff != "": + output.append(buff) + + return output diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/train_ce.py b/applications/document_intelligence/doc_vqa/Rerank/src/train_ce.py new file mode 100644 index 0000000000000000000000000000000000000000..4620782083dd3b907fe8e1c84a52a0a03f29854e --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/src/train_ce.py @@ -0,0 +1,354 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Finetuning on classification tasks.""" + +from __future__ import absolute_import, division, print_function, unicode_literals + +import logging +import multiprocessing +import os +import time +import warnings + +# NOTE(paddle-dev): All of these flags should be +# set before `import paddle`. Otherwise, it would +# not take any effect. +os.environ["FLAGS_eager_delete_tensor_gb"] = "0" # enable gc + +import paddle # noqa: E402 +import paddle.fluid as fluid # noqa: E402 + +if hasattr(paddle, "enable_static"): + paddle.enable_static() +import paddle.fluid.incubate.fleet.base.role_maker as role_maker # noqa: E402 +import reader_ce as reader_ce # noqa: E402 +from cross_encoder import create_model, evaluate, predict # noqa: E402 +from finetune_args import parser # noqa: E402 +from model.ernie import ErnieConfig # noqa: E402 +from optimization import optimization # noqa: E402 +from paddle.fluid.incubate.fleet.collective import ( # noqa: E402 + DistributedStrategy, + fleet, +) +from src.utils.args import check_cuda, prepare_logger, print_arguments # noqa: E402 +from src.utils.init import init_checkpoint, init_pretraining_params # noqa: E402 + +warnings.filterwarnings("ignore") +args = parser.parse_args() +log = logging.getLogger() + + +def main(args): + ernie_config = ErnieConfig(args.ernie_config_path) + ernie_config.print_config() + + if args.use_cuda: + dev_list = fluid.cuda_places() + place = dev_list[0] + dev_count = len(dev_list) + else: + place = fluid.CPUPlace() + dev_count = int(os.environ.get("CPU_NUM", multiprocessing.cpu_count())) + exe = fluid.Executor(place) + + reader = reader_ce.ClassifyReader( + vocab_path=args.vocab_path, + label_map_config=args.label_map_config, + max_seq_len=args.max_seq_len, + total_num=args.train_data_size, + do_lower_case=args.do_lower_case, + in_tokens=args.in_tokens, + random_seed=args.random_seed, + tokenizer=args.tokenizer, + for_cn=args.for_cn, + task_id=args.task_id, + ) + + if not (args.do_train or args.do_val or args.do_test): + raise ValueError("For args `do_train`, `do_val` and `do_test`, at " "least one of them must be True.") + + if args.do_test: + assert args.test_save is not None + startup_prog = fluid.Program() + if args.random_seed is not None: + startup_prog.random_seed = args.random_seed + + if args.predict_batch_size is None: + args.predict_batch_size = args.batch_size + + if args.do_train: + role = role_maker.PaddleCloudRoleMaker(is_collective=True) + fleet.init(role) + dev_count = fleet.worker_num() + + train_data_generator = reader.data_generator( + input_file=args.train_set, + batch_size=args.batch_size, + epoch=args.epoch, + dev_count=1, + trainer_id=fleet.worker_index(), + trainer_num=fleet.worker_num(), + shuffle=True, + phase="train", + ) + + num_train_examples = reader.get_num_examples(args.train_set) + + if args.in_tokens: + max_train_steps = args.epoch * num_train_examples // (args.batch_size // args.max_seq_len) // dev_count + else: + max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count + + warmup_steps = int(max_train_steps * args.warmup_proportion) + log.info("Device count: %d" % dev_count) + log.info("Num train examples: %d" % num_train_examples) + log.info("Max train steps: %d" % max_train_steps) + log.info("Num warmup steps: %d" % warmup_steps) + + train_program = fluid.Program() + + # use fleet api + exec_strategy = fluid.ExecutionStrategy() + if args.use_fast_executor: + exec_strategy.use_experimental_executor = True + exec_strategy.num_threads = dev_count + if args.is_distributed: + exec_strategy.num_threads = 3 + + exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope + + dist_strategy = DistributedStrategy() + dist_strategy.exec_strategy = exec_strategy + dist_strategy.nccl_comm_num = 1 + if args.is_distributed: + dist_strategy.nccl_comm_num = 2 + dist_strategy.use_hierarchical_allreduce = True + + if args.use_mix_precision: + dist_strategy.use_amp = True + + with fluid.program_guard(train_program, startup_prog): + with fluid.unique_name.guard(): + train_pyreader, graph_vars = create_model( + args, pyreader_name="train_reader", ernie_config=ernie_config + ) + scheduled_lr = optimization( + loss=graph_vars["loss"], + warmup_steps=warmup_steps, + num_train_steps=max_train_steps, + learning_rate=args.learning_rate, + train_program=train_program, + startup_prog=startup_prog, + weight_decay=args.weight_decay, + scheduler=args.lr_scheduler, + use_dynamic_loss_scaling=args.use_dynamic_loss_scaling, + incr_every_n_steps=args.incr_every_n_steps, + decr_every_n_nan_or_inf=args.decr_every_n_nan_or_inf, + incr_ratio=args.incr_ratio, + decr_ratio=args.decr_ratio, + dist_strategy=dist_strategy, + ) + + if args.verbose: + if args.in_tokens: + lower_mem, upper_mem, unit = fluid.contrib.memory_usage( + program=train_program, batch_size=args.batch_size // args.max_seq_len + ) + else: + lower_mem, upper_mem, unit = fluid.contrib.memory_usage( + program=train_program, batch_size=args.batch_size + ) + log.info("Theoretical memory usage in training: %.3f - %.3f %s" % (lower_mem, upper_mem, unit)) + + if args.do_val or args.do_test: + test_prog = fluid.Program() + with fluid.program_guard(test_prog, startup_prog): + with fluid.unique_name.guard(): + test_pyreader, graph_vars = create_model( + args, pyreader_name="test_reader", ernie_config=ernie_config, is_prediction=True + ) + + test_prog = test_prog.clone(for_test=True) + + train_program = fleet.main_program + + exe = fluid.Executor(place) + exe.run(startup_prog) + + if args.do_train: + if args.init_checkpoint and args.init_pretraining_params: + log.warning( + "WARNING: args 'init_checkpoint' and 'init_pretraining_params' " + "both are set! Only arg 'init_checkpoint' is made valid." + ) + if args.init_checkpoint: + init_checkpoint(exe, args.init_checkpoint, main_program=startup_prog) + elif args.init_pretraining_params: + init_pretraining_params(exe, args.init_pretraining_params, main_program=startup_prog) + elif args.do_val or args.do_test: + if not args.init_checkpoint: + raise ValueError("args 'init_checkpoint' should be set if" "only doing validation or testing!") + init_checkpoint(exe, args.init_checkpoint, main_program=startup_prog) + + if args.do_train: + train_exe = exe + train_pyreader.decorate_tensor_provider(train_data_generator) + else: + train_exe = None + + test_exe = exe + + current_epoch = 0 + steps = 0 + if args.do_train: + train_pyreader.start() + if warmup_steps > 0: + graph_vars["learning_rate"] = scheduled_lr + + ce_info = [] + time_begin = time.time() + last_epoch = 0 + while True: + try: + steps += 1 + + if fleet.worker_index() != 0: + train_exe.run(fetch_list=[], program=train_program) + continue + + if steps % args.skip_steps != 0: + train_exe.run(fetch_list=[], program=train_program) + + else: + outputs = evaluate( + train_exe, train_program, train_pyreader, graph_vars, "train", metric=args.metric + ) + + if args.verbose: + verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size() + verbose += "learning rate: %f" % ( + outputs["learning_rate"] if warmup_steps > 0 else args.learning_rate + ) + log.info(verbose) + + current_example, current_epoch = reader.get_train_progress() + time_end = time.time() + used_time = time_end - time_begin + + log.info( + "epoch: %d, progress: %d/%d, step: %d, ave loss: %f, " + "ave acc: %f, speed: %f steps/s" + % ( + current_epoch, + current_example * dev_count, + num_train_examples, + steps, + outputs["loss"], + outputs["accuracy"], + args.skip_steps / used_time, + ) + ) + ce_info.append([outputs["loss"], outputs["accuracy"], used_time]) + + time_begin = time.time() + + if steps % args.save_steps == 0: + save_path = os.path.join(args.checkpoints, "step_" + str(steps)) + fluid.io.save_persistables(exe, save_path, fleet._origin_program) + + if steps % args.validation_steps == 0: + # evaluate dev set + if args.do_val: + evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, current_epoch, steps) + + if args.do_test: + predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, current_epoch, steps) + + if last_epoch != current_epoch: + last_epoch = current_epoch + + except fluid.core.EOFException: + save_path = os.path.join(args.checkpoints, "step_" + str(steps)) + fluid.io.save_persistables(exe, save_path, fleet._origin_program) + train_pyreader.reset() + break + + # final eval on dev set + if args.do_val: + evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, current_epoch, steps) + + # final eval on test set + if args.do_test: + predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars) + + # final eval on diagnostic, hack for glue-ax + if args.diagnostic: + test_pyreader.decorate_tensor_provider( + reader.data_generator(args.diagnostic, batch_size=args.batch_size, epoch=1, dev_count=1, shuffle=False) + ) + + log.info("Final diagnostic") + qids, preds, probs = predict(test_exe, test_prog, test_pyreader, graph_vars) + assert len(qids) == len(preds), "{} v.s. {}".format(len(qids), len(preds)) + with open(args.diagnostic_save, "w") as f: + for id, s, p in zip(qids, preds, probs): + f.write("{}\t{}\t{}\n".format(id, s, p)) + + log.info("Done final diagnostic, saving to {}".format(args.diagnostic_save)) + + +def evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, epoch, steps): + # evaluate dev set + for ds in args.dev_set.split(","): + test_pyreader.decorate_tensor_provider( + reader.data_generator(ds, batch_size=args.predict_batch_size, epoch=1, dev_count=1, shuffle=False) + ) + log.info("validation result of dataset {}:".format(ds)) + evaluate_info = evaluate(exe, test_prog, test_pyreader, graph_vars, "dev", metric=args.metric) + log.info(evaluate_info + ", file: {}, epoch: {}, steps: {}".format(ds, epoch, steps)) + + +def predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, epoch=None, steps=None): + test_sets = args.test_set.split(",") + save_dirs = args.test_save.split(",") + assert len(test_sets) == len(save_dirs) + + for test_f, save_f in zip(test_sets, save_dirs): + test_pyreader.decorate_tensor_provider( + reader.data_generator(test_f, batch_size=args.predict_batch_size, epoch=1, dev_count=1, shuffle=False) + ) + + if epoch is not None or steps is not None: + save_path = save_f + "." + str(epoch) + "." + str(steps) + else: + save_path = save_f + log.info("testing {}, save to {}".format(test_f, save_path)) + qids, preds, probs = predict(exe, test_prog, test_pyreader, graph_vars) + + save_dir = os.path.dirname(save_path) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + else: + log.warning("save dir exists: %s, will skip saving" % save_dir) + + with open(save_path, "w") as f: + for p in probs: + f.write("{}\n".format(p[1])) + + +if __name__ == "__main__": + prepare_logger(log) + print_arguments(args) + check_cuda(args.use_cuda) + main(args) diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/utils/args.py b/applications/document_intelligence/doc_vqa/Rerank/src/utils/args.py new file mode 100644 index 0000000000000000000000000000000000000000..a00d3d867a2549a4253af785ec14b7422894eaa4 --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/src/utils/args.py @@ -0,0 +1,71 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Arguments for configuration.""" +from __future__ import absolute_import, division, print_function, unicode_literals + +import logging +import os +import sys + +import paddle.fluid as fluid +import six + +from paddlenlp.trainer.argparser import strtobool + +log = logging.getLogger(__name__) + + +def prepare_logger(logger, debug=False, save_to_file=None): + formatter = logging.Formatter(fmt="[%(levelname)s] %(asctime)s [%(filename)12s:%(lineno)5d]:\t%(message)s") + console_hdl = logging.StreamHandler() + console_hdl.setFormatter(formatter) + logger.addHandler(console_hdl) + if save_to_file is not None and not os.path.exists(save_to_file): + file_hdl = logging.FileHandler(save_to_file) + file_hdl.setFormatter(formatter) + logger.addHandler(file_hdl) + logger.setLevel(logging.DEBUG) + logger.propagate = False + + +class ArgumentGroup(object): + def __init__(self, parser, title, des): + self._group = parser.add_argument_group(title=title, description=des) + + def add_arg(self, name, type, default, help, positional_arg=False, **kwargs): + prefix = "" if positional_arg else "--" + type = strtobool if type == bool else type + self._group.add_argument( + prefix + name, default=default, type=type, help=help + " Default: %(default)s.", **kwargs + ) + + +def print_arguments(args): + log.info("----------- Configuration Arguments -----------") + for arg, value in sorted(six.iteritems(vars(args))): + log.info("%s: %s" % (arg, value)) + log.info("------------------------------------------------") + + +def check_cuda( + use_cuda, + err="\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \ + Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n", +): + try: + if use_cuda is True and fluid.is_compiled_with_cuda() is False: + log.error(err) + sys.exit(1) + except Exception: + pass diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/utils/init.py b/applications/document_intelligence/doc_vqa/Rerank/src/utils/init.py new file mode 100644 index 0000000000000000000000000000000000000000..8ba377fd83597cdaadde05698000f60737b56cfe --- /dev/null +++ b/applications/document_intelligence/doc_vqa/Rerank/src/utils/init.py @@ -0,0 +1,52 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import os + +import paddle.fluid as fluid + +log = logging.getLogger(__name__) + + +def init_checkpoint(exe, init_checkpoint_path, main_program): + assert os.path.exists(init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path + + def existed_persitables(var): + if not fluid.io.is_persistable(var): + return False + if not os.path.exists(os.path.join(init_checkpoint_path, var.name)): + print("Var not exists: [%s]\t%s" % (var.name, os.path.join(init_checkpoint_path, var.name))) + # else: + # print ("Var exists: [%s]" % (var.name)) + return os.path.exists(os.path.join(init_checkpoint_path, var.name)) + + fluid.io.load_vars(exe, init_checkpoint_path, main_program=main_program, predicate=existed_persitables) + log.info("Load model from {}".format(init_checkpoint_path)) + + +def init_pretraining_params(exe, pretraining_params_path, main_program): + assert os.path.exists(pretraining_params_path), "[%s] cann't be found." % pretraining_params_path + + def existed_params(var): + if not isinstance(var, fluid.framework.Parameter): + return False + if not os.path.exists(os.path.join(pretraining_params_path, var.name)): + print("Var not exists: [%s]\t%s" % (var.name, os.path.join(pretraining_params_path, var.name))) + # else: + # print ("Var exists: [%s]" % (var.name)) + return os.path.exists(os.path.join(pretraining_params_path, var.name)) + + fluid.io.load_vars(exe, pretraining_params_path, main_program=main_program, predicate=existed_params) + log.info("Load pretraining parameters from {}.".format(pretraining_params_path)) diff --git a/applications/document_intelligence/doc_vqa/run_test.sh b/applications/document_intelligence/doc_vqa/run_test.sh new file mode 100644 index 0000000000000000000000000000000000000000..2e742a0fa8cbb1a01f266c7bd522e3a67fe1660e --- /dev/null +++ b/applications/document_intelligence/doc_vqa/run_test.sh @@ -0,0 +1,34 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +QUESTION=$1 + +# Question: NFC咋开门 + +if [ $# != 1 ];then + echo "USAGE: sh script/run_cross_encoder_test.sh \$QUESTION" + exit 1 +fi + +# compute scores for QUESTION and OCR parsing results with Rerank module +cd Rerank +bash run_test.sh ${QUESTION} +cd .. + +# extraction answer for QUESTION from the top1 of rank +cd Extraction +bash run_test.sh ${QUESTION} +cd .. diff --git a/applications/document_intelligence/docprompt/README.md b/applications/document_intelligence/docprompt/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b323943aca9e7c08cf378027a105b6d2fbb81150 --- /dev/null +++ b/applications/document_intelligence/docprompt/README.md @@ -0,0 +1 @@ +[ERNIE-Layout](../../../model_zoo/ernie-layout) diff --git a/applications/information_extraction/README.md b/applications/information_extraction/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c85a0912cae48cf688d248449ba1235ee01370fc --- /dev/null +++ b/applications/information_extraction/README.md @@ -0,0 +1,171 @@ +简体中文 | [English](README_en.md) + +# 信息抽取应用 + +**目录** +- [1. 信息抽取应用简介](#1) +- [2. 技术特色](#2) + - [2.1 信息抽取方案全覆盖](#21) + - [2.2 强大的训练基座](#22) + - [2.3 产业级全流程方案](#23) + - [2.4 效果展示](#24) +- [3. 快速开始](#快速开始) + - [3.1 Taskflow开箱即用](#31) + - [3.2 文本信息抽取](#32) + - [3.3 文档信息抽取](#33) + + + +## 1. 信息抽取应用简介 + +信息抽取应用针对信息抽取一系列高频场景开源了产业级解决方案,**具备多领域、多任务、跨模态的能力**,打通**数据标注-模型训练-模型调优-预测部署全流程**,可快速实现信息抽取产品落地。 + +信息抽取通俗地说就是从给定的文本/图片等输入数据中抽取出结构化信息的过程。在信息抽取的落地过程中通常面临领域多变、任务多样、数据稀缺等许多挑战。针对信息抽取领域的难点和痛点,PaddleNLP信息抽取应用**基于UIE统一建模的思想**,提供了信息抽取产业级应用方案,**除支持纯文本场景实体、关系、事件、观点等不同任务抽取外,还支持文档/图片/表格的端到端信息抽取**。该应用**不限定行业领域和抽取目标**,可实现从产品原型研发、业务POC阶段到业务落地、迭代阶段的无缝衔接,助力开发者实现特定领域抽取场景的快速适配与落地。 + +**信息抽取应用亮点:** + +- **覆盖场景全面🎓:** 覆盖信息抽取各类主流任务,面向纯文本和文档场景,支持多语言,满足开发者多样信息抽取落地需求。 +- **效果领先🏃:** 以在纯文本、多模态上均有突出效果的UIE系列模型作为训练基座,提供多种尺寸的预训练模型满足不同需求,具有广泛成熟的实践应用性。 +- **简单易用⚡:** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用,一行命令即可开启信息抽取训练,轻松完成部署上线,降低信息抽取技术落地门槛。 +- **高效调优✊:** 开发者无需机器学习背景知识,即可轻松上手数据标注及模型训练流程。 + + + +## 2. 技术特色 + + + +### 2.1 信息抽取方案全覆盖 + +多模型选择,满足精度、速度,适配不同信息抽取使用场景。 + +| 模型名称 | 使用场景 | 支持任务 | +| :----------------------------------------------------------: | :--------------------------------------------------------- | :--------------------------------------------------- | +| `uie-base`
`uie-medium`
`uie-mini`
`uie-micro`
`uie-nano` | 面向**纯文本**场景的**抽取式**模型,支持**中文** | 具备实体、关系、事件、评论观点等通用信息抽取能力 | +| `uie-base-en` | 面向**纯文本**场景的**抽取式**模型,支持**英文** | 具备实体、关系、事件、评论观点等通用信息抽取能力 | +| `uie-m-base`
`uie-m-large` | 面向**纯文本**场景的**抽取式**模型,支持**中英** | 具备实体、关系、事件、评论观点等通用信息抽取能力 | +| `uie-x-base` | 面向**纯文本**和**文档**场景的**抽取式**模型,支持**中英** | 支持纯文本场景的全部功能,还支持文档/图片/表格的端到端信息抽取 | + + + +### 2.2 强大的训练基座 + +信息抽取应用使用ERNIE 3.0轻量级模型作为预训练模型,同时在大量信息抽取数据上进行了二次预训练,从而让模型适配固定prompt。 + +- 中文文本数据集实验效果 + +我们在互联网、医疗、金融三大垂类文本自建测试集上进行了实验: + + +
金融医疗互联网 +
0-shot5-shot0-shot5-shot0-shot5-shot +
uie-base (12L768H)46.4370.9271.8385.7278.3381.86 +
uie-medium (6L768H)41.1164.5365.4075.7278.3279.68 +
uie-mini (6L384H)37.0464.6560.5078.3672.0976.38 +
uie-micro (4L384H)37.5362.1157.0475.9266.0070.22 +
uie-nano (4L312H)38.9466.8348.2976.7462.8672.35 +
uie-m-large (24L1024H)49.3574.5570.5092.6678.4983.02 +
uie-m-base (12L768H)38.4674.3163.3787.3276.2780.13 +
🧾 🎓uie-x-base (12L768H)48.8473.8765.6088.8179.3681.65 +
+ +0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测,5-shot表示每个类别包含5条标注数据进行模型微调。**实验表明UIE在垂类场景可以通过少量数据(few-shot)进一步提升效果**。 + +- 多模态数据集实验效果 + +我们在通用、金融、医疗三大场景自建多模态测试集上对UIE-X的零样本效果进行了实验: + + +
通用金融医疗 +
🧾 🎓uie-x-base (12L768H)65.0373.5184.24 +
+ +通用测试集包含了不同领域的复杂样本,抽取难度最大。 + + + +### 2.3 产业级全流程方案 + +**调研阶段** + +- 该阶段目标需求开放且缺少数据积累。我们提供Taskflow三行代码极简调用的方式,无需标注数据即可在业务场景上快速验证效果。 + - [文本抽取 Taskflow使用指南](./taskflow_text.md) + - [文档抽取 Taskflow使用指南](./taskflow_doc.md) + +**数据准备阶段** + +- 我们推荐在实际的业务场景中定制自己的信息抽取模型。我们提供了不同抽取场景的Label Studio标注解决方案,可基于该方案实现从数据标注到训练数据构造的无缝衔接,大大降低了数据标注、模型定制的时间成本。 + - [文本抽取标注指南](./label_studio_text.md) + - [文档抽取标注指南](./label_studio_doc.md)。 + +**模型微调及封闭域蒸馏** + +- 基于UIE优秀的小样本微调能力,实现低成本模型定制适配。同时提供封闭域蒸馏的加速方案,解决抽取速度慢的问题。 + - [文本信息抽取全流程示例](./text/README.md) + - [文档信息抽取全流程示例](./document/README.md) + +**模型部署** + +- 提供HTTP部署方案,快速实现定制模型的部署上线。 + - [文本抽取HTTP部署指南](./text/deploy/simple_serving/README.md) + - [文档抽取HTTP部署指南](./document/deploy/simple_serving/README.md) + + + +### 2.4 效果展示 + +- 🧾 通过[Huggingface网页](https://huggingface.co/spaces/PaddlePaddle/UIE-X)体验UIE-X功能: + +
+ +
+ +- UIE-X端到端文档抽取产业应用示例 + + - 报关单 + +
+ +
+ + - Delivery Note(需微调) + +
+ +
+ + - 增值税发票(需微调) + +
+ +
+ + - 表单(需微调) + +
+ +
+ + + +## 3. 快速开始 + + + +### 3.1 Taskflow开箱即用 + +- 通过Taskflow实现开箱即用 + 👉 [文本抽取 Taskflow使用指南](./taskflow_text.md) + 👉 [文档抽取 Taskflow使用指南](./taskflow_doc.md) + + + +### 3.2 文本信息抽取 + +- 快速开启文本信息抽取 👉 [文本信息抽取指南](./text/README.md) + + + +### 3.3 文档信息抽取 + +- 快速开启文档信息抽取 👉 [文档信息抽取指南](./document/README.md) diff --git a/applications/information_extraction/README_en.md b/applications/information_extraction/README_en.md new file mode 100644 index 0000000000000000000000000000000000000000..b26a57aa1dc04cc2a850c77d64f046ac277ebb80 --- /dev/null +++ b/applications/information_extraction/README_en.md @@ -0,0 +1,170 @@ +# Information Extraction Application + +**Table of contents** +- [1. Introduction](#1) +- [2. Features](#2) + - [2.1 Available Models](#21) + - [2.2 Performance](#22) + - [2.3 Full Development Lifecycle](#23) + - [2.4 Demo](#24) +- [3. Quick Start](#3) + - [3.1 Taskflow](#31) + - [3.2 Text Information Extraction](#32) + - [3.3 Document Information Extraction](#33) + + + +## 1. Introduction + +This Information Extraction (IE) guide introduces our open-source industry-grade solution that covers the most widely-used application scenarios of Information Extraction. It features **multi-domain, multi-task, and cross-modal capabilities** and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Information Extraction techniques in your own products or models. + +Information Extraction (IE) is the process of extracting structured information from given input data such as text, pictures or scanned document. While IE brings immense value, applying IE techniques is never easy with challenges such as domain adaptation, heterogeneous structures, lack of labeled data, etc. This PaddleNLP Information Extraction Guide builds on the foundation of our work in [Universal Information Extraction] (https://arxiv.org/abs/2203.12277) and provides an industrial-level solution that not only supports **extracting entities, relations, events and opinions from plain text**, but also supports **cross-modal extraction out of documents, tables and pictures.** Our method features a flexible prompt, which allows you to specify extraction targets with simple natural language. We also provide a few different domain-adapated models specialized for different industry sectors. + +**Highlights:** + +- **Comprehensive Coverage🎓:** Covers various mainstream tasks of information extraction for plain text and document scenarios, supports multiple languages +- **State-of-the-Art Performance🏃:** Strong performance from the UIE model series models in plain text and multimodal datasets. We also provide pretrained models of various sizes to meet different needs +- **Easy to use⚡:** three lines of code to use our `Taskflow` for out-of-box Information Extraction capabilities. One line of command to model training and model deployment +- **Efficient Tuning✊:** Developers can easily get started with the data labeling and model training process without a background in Machine Learning. + + + +## 2. Features + + + +### 2.1 Available Models + +Multiple model selection, satisfying accuracy and speed, and adapting to different information extraction scenarios. + +| Model Name | Usage Scenarios | Supporting Tasks | +| :----------------------------------------------------------: | :--------------------------------------------------------- | :--------------------------------------------------- | +| `uie-base`
`uie-medium`
`uie-mini`
`uie-micro`
`uie-nano` | For **plain text** The **extractive** model of the scene supports **Chinese** | Supports entity, relation, event, opinion extraction | +| `uie-base-en` | An **extractive** model for **plain text** scenarios, supports **English** | Supports entity, relation, event, opinion extraction | +| `uie-m-base`
`uie-m-large` | An **extractive** model for **plain text** scenarios, supporting **Chinese and English** | Supports entity, relation, event, opinion extraction | +| `uie-x-base` | An **extractive** model for **plain text** and **document** scenarios, supports **Chinese and English** | Supports entity, relation, event, opinion extraction on both plain text and documents/pictures/tables | + + + + +### 2.2 Performance + +The UIE model series uses the ERNIE 3.0 lightweight models as the pre-trained language models and was finetuned on a large amount of information extraction data so that the model can be adapted to a fixed prompt. + +- Experimental results on Chinese dataset + +We conducted experiments on the in-house test sets of the three different domains of Internet, medical care, and finance: + + +
financehealthcareinternet +
0-shot5-shot0-shot5-shot0-shot5-shot +
uie-base (12L768H)46.4370.9271.8385.7278.3381.86 +
uie-medium (6L768H)41.1164.5365.4075.7278.3279.68 +
uie-mini (6L384H)37.0464.6560.5078.3672.0976.38 +
uie-micro (4L384H)37.5362.1157.0475.9266.0070.22 +
uie-nano (4L312H)38.9466.8348.2976.7462.8672.35 +
uie-m-large (24L1024H)49.3574.5570.5092.6678.4983.02 +
uie-m-base (12L768H)38.4674.3163.3787.3276.2780.13 +
🧾🎓uie-x-base (12L768H)48.8473.8765.6088.8179.36 81.65 +
+ +0-shot means that no training data is directly used for prediction through ```paddlenlp.Taskflow```, and 5-shot means that each category contains 5 pieces of labeled data for model fine-tuning. **Experiments show that UIE can further improve the performance with a small amount of data (few-shot)**. + +- Experimental results on multimodal datasets + +We experimented on the zero-shot performance of UIE-X on the in-house multi-modal test sets in three different domains of general, financial, and medical: + + +
General FinancialMedical +
🧾🎓uie-x-base (12L768H)65.0373.5184.24 +
+ +The general test set contains complex samples from different fields and is the most difficult task. + + + +### 2.3 Full Development Lifecycle + +**Research stage** + +- At this stage, the target requirements are open and there is no labeled data. We provide a simple way of using Taskflow out-of-the-box with three lines of code, which allows you to build POC without any labeled data. + - [Text Extraction Taskflow User Guide](./taskflow_text_en.md) + - [Document Extraction Taskflow User Guide](./taskflow_doc_en.md) + +**Data preparation stage** + +- We recommend finetuning your own information extraction model for your use case. We provide Label Studio labeling solutions for different extraction scenarios. Based on this solution, the seamless connection from data labeling to training data construction can be realized, which greatly reduces the time cost of data labeling and model customization. + - [Text Extraction Labeling Guide](./label_studio_text_en.md) + - [Document Extraction and Labeling Guide](./label_studio_doc_en.md). + +**Model fine-tuning and closed domain distillation** + +- Based on UIE's few-shot capabilities, it realizes low-cost model customization and adaptation. At the same time, it provides an acceleration solution for closed domain distillation to solve the problem of slow extraction speed. + - [Example of the whole process of text information extraction](./text/README_en.md) + - [Example of document information extraction process](./document/README_en.md) + +**Model Deployment** + +- Provide an HTTP deployment solution to quickly implement the deployment and launch of customized models. + - [Text Extract HTTP Deployment Guide](./text/deploy/simple_serving/README_en.md) + - [Document Extract HTTP Deployment Guide](./document/deploy/simple_serving/README_en.md) + + + +### 2.4 Demo + +- 🧾Try our UIE-X demo on [🤗 HuggingFace Space](https://huggingface.co/spaces/PaddlePaddle/UIE-X): + +
+ +
+ +- UIE-X end-to-end document extraction industry application example + + - Customs declaration + +
+ +
+ + - Delivery Note (Need fine-tuning) + +
+ +
+ + - VAT invoice (need fine-tuning) + +
+ +
+ + - Form (need fine-tuning) + +
+ +
+ + + +## 3. Quick Start + + + +### 3.1 Taskflow + +- Out of the box with Taskflow + 👉 [Text Extraction Taskflow User Guide](./taskflow_text_en.md) + 👉 [Document Extraction Taskflow User Guide](./taskflow_doc_en.md) + + + +### 3.2 Text Information Extraction + +- Quickly start text information extraction 👉 [Text Information Extraction Guide](./text/README_en.md) + + + +### 3.3 Document Information Extraction + +- Quickly open document information extraction 👉 [Document Information Extraction Guide](./document/README_en.md) diff --git a/applications/information_extraction/document/README.md b/applications/information_extraction/document/README.md new file mode 100644 index 0000000000000000000000000000000000000000..30f4670e954ff390b18898264663b646e9140762 --- /dev/null +++ b/applications/information_extraction/document/README.md @@ -0,0 +1,301 @@ +简体中文 | [English](README_en.md) + +# 文档信息抽取 + +**目录** +- [1. 文档信息抽取应用](#1) +- [2. 快速开始](#2) + - [2.1 代码结构](#代码结构) + - [2.2 数据标注](#数据标注) + - [2.3 模型微调](#模型微调) + - [2.4 模型评估](#模型评估) + - [2.5 定制模型一键预测](#定制模型一键预测) + - [2.6 实验指标](#实验指标) + + + +## 1. 文档信息抽取应用 + +本项目提供基于UIE微调的文档抽取端到端应用方案,打通**数据标注-模型训练-模型调优-预测部署全流程**,可快速实现文档信息抽取产品落地。 + +信息抽取通俗地说就是从给定的文本/图片等输入数据中抽取出结构化信息的过程。在信息抽取的落地过程中通常面临领域多变、任务多样、数据稀缺等许多挑战。针对信息抽取领域的难点和痛点,PaddleNLP信息抽取应用UIE统一建模的思想,提供了文档信息抽取产业级应用方案,支持**文档/图片/表格和纯文本场景下实体、关系、事件、观点等不同任务信息抽取**。该应用**不限定行业领域和抽取目标**,可实现从产品原型研发、业务POC阶段到业务落地、迭代阶段的无缝衔接,助力开发者实现特定领域抽取场景的快速适配与落地。 + +**文档信息抽取应用亮点:** + +- **覆盖场景全面🎓:** 覆盖文档信息抽取各类主流任务,支持多语言,满足开发者多样信息抽取落地需求。 +- **效果领先🏃:** 以在多模态信息抽取上有突出效果的模型UIE-X作为训练基座,具有广泛成熟的实践应用性。 +- **简单易用⚡:** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用,一行命令即可开启信息抽取训练,轻松完成部署上线,降低信息抽取技术落地门槛。 +- **高效调优✊:** 开发者无需机器学习背景知识,即可轻松上手数据标注及模型训练流程。 + + + +## 2. 快速开始 + +对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本(zero-shot)抽取,对于细分场景我们推荐使用定制功能(标注少量数据进行模型微调)以进一步提升效果。 + + + +### 2.1 代码结构 + +```shell +. +├── deploy # 部署目录 +│ └── simple_serving # 基于PaddleNLP SimpleServing 服务化部署 +├── utils.py # 数据处理工具 +├── finetune.py # 模型微调、压缩脚本 +├── evaluate.py # 模型评估脚本 +└── README.md +``` + + + +### 2.2 数据标注 +我们推荐使用 [Label Studio](https://labelstud.io/) 进行文档信息抽取数据标注,本项目打通了从数据标注到训练的通道,也即Label Studio导出数据可以通过 [label_studio.py](../label_studio.py) 脚本轻松将数据转换为输入模型时需要的形式,实现无缝衔接。标注方法的详细介绍请参考 [Label Studio数据标注指南](../label_studio_doc.md)。 + +这里我们提供预先标注好的`增值税发票数据集`的文件,可以运行下面的命令行下载数据集,我们将展示如何使用数据转化脚本生成训练/验证/测试集文件,并使用UIE-X模型进行微调。 + +下载增值税发票数据集: +```shell +wget https://paddlenlp.bj.bcebos.com/datasets/tax.tar.gz +tar -zxvf tax.tar.gz +mv tax data +rm tax.tar.gz +``` + +生成训练/验证集文件: +```shell +python ../label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --save_dir ./data \ + --splits 0.8 0.2 0 \ + --task_type ext +``` + +生成训练/验证集文件,可以使用PP-Structure的布局分析优化OCR结果的排序: +```shell +python ../label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --save_dir ./data \ + --splits 0.8 0.2 0\ + --task_type ext \ + --layout_analysis True +``` + +更多不同类型任务(含实体抽取、关系抽取、文档分类等)的标注规则及参数说明,请参考[Label Studio数据标注指南](../label_studio_doc.md)。 + + + +### 2.3 模型微调 + +推荐使用 [Trainer API ](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md) 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务,可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能,Trainer API 还针对训练过程的通用训练配置做了封装,比如:优化器、学习率调度等。 + +使用下面的命令,使用 `uie-x-base` 作为预训练模型进行模型微调,将微调后的模型保存至`./checkpoint/model_best`: + +单卡启动: + +```shell +python finetune.py \ + --device gpu \ + --logging_steps 5 \ + --save_steps 25 \ + --eval_steps 25 \ + --seed 42 \ + --model_name_or_path uie-x-base \ + --output_dir ./checkpoint/model_best \ + --train_path data/train.txt \ + --dev_path data/dev.txt \ + --max_seq_len 512 \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 8 \ + --num_train_epochs 10 \ + --learning_rate 1e-5 \ + --do_train \ + --do_eval \ + --do_export \ + --export_model_dir ./checkpoint/model_best \ + --overwrite_output_dir \ + --disable_tqdm True \ + --metric_for_best_model eval_f1 \ + --load_best_model_at_end True \ + --save_total_limit 1 +``` + +如果在GPU环境中使用,可以指定gpus参数进行多卡训练: + +```shell +python -u -m paddle.distributed.launch --gpus "0" finetune.py \ + --device gpu \ + --logging_steps 5 \ + --save_steps 25 \ + --eval_steps 25 \ + --seed 42 \ + --model_name_or_path uie-x-base \ + --output_dir ./checkpoint/model_best \ + --train_path data/train.txt \ + --dev_path data/dev.txt \ + --max_seq_len 512 \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 8 \ + --num_train_epochs 10 \ + --learning_rate 1e-5 \ + --do_train \ + --do_eval \ + --do_export \ + --export_model_dir ./checkpoint/model_best \ + --overwrite_output_dir \ + --disable_tqdm True \ + --metric_for_best_model eval_f1 \ + --load_best_model_at_end True \ + --save_total_limit 1 +``` + +该示例代码中由于设置了参数 `--do_eval`,因此在训练完会自动进行评估。 + +可配置参数说明: +* `device`: 训练设备,可选择 'cpu'、'gpu'、'npu' 其中的一种;默认为 GPU 训练。 +* `logging_steps`: 训练过程中日志打印的间隔 steps 数,默认10。 +* `save_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。 +* `eval_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。 +* `seed`:全局随机种子,默认为 42。 +* `model_name_or_path`:进行 few shot 训练使用的预训练模型。默认为 "uie-x-base"。 +* `output_dir`:必须,模型训练或压缩后保存的模型目录;默认为 `None` 。 +* `train_path`:训练集路径;默认为 `None` 。 +* `dev_path`:开发集路径;默认为 `None` 。 +* `max_seq_len`:文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。 +* `per_device_train_batch_size`:用于训练的每个 GPU 核心/NPU 核心/CPU 的batch大小,默认为8。 +* `per_device_eval_batch_size`:用于评估的每个 GPU 核心/NPU 核心/CPU 的batch大小,默认为8。 +* `num_train_epochs`: 训练轮次,使用早停法时可以选择 100;默认为10。 +* `learning_rate`:训练最大学习率,UIE-X 推荐设置为 1e-5;默认值为3e-5。 +* `label_names`:训练数据标签label的名称,UIE-X 设置为'start_positions' 'end_positions';默认值为None。 +* `do_train`:是否进行微调训练,设置该参数表示进行微调训练,默认不设置。 +* `do_eval`:是否进行评估,设置该参数表示进行评估,默认不设置。 +* `do_export`:是否进行导出,设置该参数表示进行静态图导出,默认不设置。 +* `export_model_dir`:静态图导出地址,默认为None。 +* `overwrite_output_dir`: 如果 `True`,覆盖输出目录的内容。如果 `output_dir` 指向检查点目录,则使用它继续训练。 +* `disable_tqdm`: 是否使用tqdm进度条。 +* `metric_for_best_model`:最优模型指标,UIE-X 推荐设置为 `eval_f1`,默认为None。 +* `load_best_model_at_end`:训练结束后是否加载最优模型,通常与`metric_for_best_model`配合使用,默认为False。 +* `save_total_limit`:如果设置次参数,将限制checkpoint的总数。删除旧的checkpoints `输出目录`,默认为None。 + + + +### 2.4 模型评估 + +```shell +python evaluate.py \ + --device "gpu" \ + --model_path ./checkpoint/model_best \ + --test_path ./data/dev.txt \ + --output_dir ./checkpoint/model_best \ + --label_names 'start_positions' 'end_positions'\ + --max_seq_len 512 \ + --per_device_eval_batch_size 16 +``` +评估方式说明:采用单阶段评价的方式,即关系抽取、事件抽取等需要分阶段预测的任务对每一阶段的预测结果进行分别评价。验证/测试集默认会利用同一层级的所有标签来构造出全部负例。 + +可开启`debug`模式对每个正例类别分别进行评估,该模式仅用于模型调试: + +```shell +python evaluate.py \ + --device "gpu" \ + --model_path ./checkpoint/model_best \ + --test_path ./data/dev.txt \ + --output_dir ./checkpoint/model_best \ + --label_names 'start_positions' 'end_positions'\ + --max_seq_len 512 \ + --per_device_eval_batch_size 16 \ + --debug True +``` + +输出结果: +```text +[2022-11-14 09:41:18,424] [ INFO] - ***** Running Evaluation ***** +[2022-11-14 09:41:18,424] [ INFO] - Num examples = 160 +[2022-11-14 09:41:18,424] [ INFO] - Pre device batch size = 4 +[2022-11-14 09:41:18,424] [ INFO] - Total Batch size = 4 +[2022-11-14 09:41:18,424] [ INFO] - Total prediction steps = 40 +[2022-11-14 09:41:26,451] [ INFO] - -----Evaluate model------- +[2022-11-14 09:41:26,451] [ INFO] - Class Name: ALL CLASSES +[2022-11-14 09:41:26,451] [ INFO] - Evaluation Precision: 0.94521 | Recall: 0.88462 | F1: 0.91391 +[2022-11-14 09:41:26,451] [ INFO] - ----------------------------- +[2022-11-14 09:41:26,452] [ INFO] - ***** Running Evaluation ***** +[2022-11-14 09:41:26,452] [ INFO] - Num examples = 8 +[2022-11-14 09:41:26,452] [ INFO] - Pre device batch size = 4 +[2022-11-14 09:41:26,452] [ INFO] - Total Batch size = 4 +[2022-11-14 09:41:26,452] [ INFO] - Total prediction steps = 2 +[2022-11-14 09:41:26,692] [ INFO] - Class Name: 开票日期 +[2022-11-14 09:41:26,692] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000 +[2022-11-14 09:41:26,692] [ INFO] - ----------------------------- +[2022-11-14 09:41:26,693] [ INFO] - ***** Running Evaluation ***** +[2022-11-14 09:41:26,693] [ INFO] - Num examples = 8 +[2022-11-14 09:41:26,693] [ INFO] - Pre device batch size = 4 +[2022-11-14 09:41:26,693] [ INFO] - Total Batch size = 4 +[2022-11-14 09:41:26,693] [ INFO] - Total prediction steps = 2 +[2022-11-14 09:41:26,952] [ INFO] - Class Name: 名称 +[2022-11-14 09:41:26,952] [ INFO] - Evaluation Precision: 0.87500 | Recall: 0.87500 | F1: 0.87500 +[2022-11-14 09:41:26,952] [ INFO] - ----------------------------- +... +``` + +可配置参数: +* `device`: 评估设备,可选择 'cpu'、'gpu'、'npu' 其中的一种;默认为 GPU 评估。 +* `model_path`: 进行评估的模型文件夹路径,路径下需包含模型权重文件`model_state.pdparams`及配置文件`model_config.json`。 +* `test_path`: 进行评估的测试集文件。 +* `label_names`:训练数据标签label的名称,UIE-X 设置为'start_positions' 'end_positions';默认值为None。 +* `batch_size`: 批处理大小,请结合机器情况进行调整,默认为16。 +* `max_seq_len`: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。 +* `per_device_eval_batch_size`:用于评估的每个 GPU 核心/NPU 核心/CPU 的batch大小,默认为8。 +* `debug`: 是否开启debug模式对每个正例类别分别进行评估,该模式仅用于模型调试,默认关闭。 +* `schema_lang`: 选择schema的语言,可选有`ch`和`en`。默认为`ch`,英文数据集请选择`en`。 + + + +### 2.5 定制模型一键预测 + +`paddlenlp.Taskflow`装载定制模型,通过`task_path`指定模型权重文件的路径,路径下需要包含训练好的模型权重文件`model_state.pdparams`。 + +```python +from pprint import pprint +from paddlenlp import Taskflow +from paddlenlp.utils.doc_parser import DocParser + +schema = ['开票日期', '名称', '纳税人识别号', '开户行及账号', '金额', '价税合计', 'No', '税率', '地址、电话', '税额'] +my_ie = Taskflow("information_extraction", model="uie-x-base", schema=schema, task_path='./checkpoint/model_best', precision='fp16') +``` + +我们可以根据设置的`schema`,对指定的`doc_path`文档进行信息抽取并进行可视化: + +```python +doc_path = "./data/images/b199.jpg" +results = my_ie({"doc": doc_path}) +pprint(results) + +# 结果可视化 +DocParser.write_image_with_results( + doc_path, + result=results[0], + save_path="./image_show.png") +``` + +
+ +
+ + + +### 2.6 实验指标 + +我们在自标注的增值税数据集上进行实验: + + + | | Precision | Recall | F1 Score | + | :---: | :--------: | :--------: | :--------: | + | 0-shot| 0.44898 | 0.56410 | 0.50000 | + | 5-shot| 0.9000 | 0.9231 | 0.9114 | + | 10-shot| 0.9125 | 0.93590 | 0.9241 | + | 20-shot| 0.9737 | 0.9487 | 0.9610 | + | 30-shot| 0.9744 | 0.9744 | 0.9744 | + | 30-shot+PP-Structure| 1.0 | 0.9625 | 0.9809 | + + +n-shot表示训练集包含n张标注图片数据进行模型微调,实验表明UIE-X可以通过少量数据(few-shot)和PP-Structure的布局分析进一步提升结果。 diff --git a/applications/information_extraction/document/README_en.md b/applications/information_extraction/document/README_en.md new file mode 100644 index 0000000000000000000000000000000000000000..acef107e5e07f7134adc4cf32a823a3688340891 --- /dev/null +++ b/applications/information_extraction/document/README_en.md @@ -0,0 +1,297 @@ +# document information extraction + +**Table of contents** +- [1. Introduction](#1) +- [2. Quick Start](#2) + - [2.1 Code Structure](#21) + - [2.2 Data Annotation](#22) + - [2.3 Finetuning](#23) + - [2.4 Evaluation](#24) + - [2.5 Inference](#25) + - [2.6 Experiments](#26) + + + +## 1. Introduction + +This Information Extraction (IE) guide introduces our open-source industry-grade solution that covers the most widely-used application scenarios of Information Extraction. It features **multi-domain, multi-task, and cross-modal capabilities** and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Information Extraction techniques in your own products or models. + +Information Extraction (IE) is the process of extracting structured information from given input data such as text, pictures or scanned document. While IE brings immense value, applying IE techniques is never easy with challenges such as domain adaptation, heterogeneous structures, lack of labeled data, etc. This PaddleNLP Information Extraction Guide builds on the foundation of our work in [Universal Information Extraction](https://arxiv.org/abs/2203.12277) and provides an industrial-level solution that not only supports **extracting entities, relations, events and opinions from plain text**, but also supports **cross-modal extraction out of documents, tables and pictures.** Our method features a flexible prompt, which allows you to specify extraction targets with simple natural language. We also provide a few different domain-adapted models specialized for different industry sectors. + +**Highlights:** + +- **Comprehensive Coverage🎓:** Covers various mainstream tasks of information extraction for plain text and document scenarios, supports multiple languages +- **State-of-the-Art Performance🏃:** Strong performance from the UIE model series models in plain text and multimodal datasets. We also provide pretrained models of various sizes to meet different needs +- **Easy to use⚡:** three lines of code to use our `Taskflow` for out-of-box Information Extraction capabilities. One line of command to model training and model deployment +- **Efficient Tuning✊:** Developers can easily get started with the data labeling and model training process without a background in Machine Learning. + + + +## 2. Quick Start + +For quick start, you can directly use ```paddlenlp.Taskflow``` out-of-the-box, leveraging the zero-shot capability. For production use cases, we recommend labeling a small amount of data for model fine-tuning to further improve the performance. + + + +### 2.1 Code Structure + +```shell +. +├── utils.py # data processing tools +├── finetune.py # model fine-tuning, compression script +├── evaluate.py # model evaluation script +└── README.md +``` + + + +### 2.2 Data Annotation + +We recommend using [Label Studio](https://labelstud.io/) for data labeling. We provide an end-to-end pipeline for the labeling -> training process. You can export the labeled data in Label Studio through [label_studio.py](../label_studio.py) script to export and convert the data into the required input form for the model. For a detailed introduction to labeling methods, please refer to [Label Studio Data Labeling Guide](../label_studio_doc_en.md). + +Here we provide the pre-labeled example dataset `VAT invoice dataset`, which you can download by running the following command. We will demonstrate how to use the data conversion script to generate training/validation/test set files for finetuning. + +Download the VAT invoice dataset: +```shell +wget https://paddlenlp.bj.bcebos.com/datasets/tax.tar.gz +tar -zxvf tax.tar.gz +mv tax data +rm tax.tar.gz +``` + +Generate training/validation data files: + +```shell +python ../label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --save_dir ./data \ + --splits 0.8 0.2 0 \ + --task_type ext +``` + +Generate training/validation set files, you can use PP-Structure's layout analysis to optimize the sorting of OCR results: + +```shell +python ../label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --save_dir ./data \ + --splits 0.8 0.2 0\ + --task_type ext\ + --layout_analysis True +``` + +For more labeling rules and parameter descriptions for different types of tasks (including entity extraction, relationship extraction, document classification, etc.), please refer to [Label Studio Data Labeling Guide](../label_studio_doc_en.md). + + + +### 2.3 Finetuning + +Use the following command to fine-tune the model using `uie-x-base` as the pre-trained model, and save the fine-tuned model to `./checkpoint/model_best`: + +Single GPU: + +```shell +python finetune.py\ + --device gpu \ + --logging_steps 5 \ + --save_steps 25 \ + --eval_steps 25 \ + --seed 42 \ + --model_name_or_path uie-x-base \ + --output_dir ./checkpoint/model_best\ + --train_path data/train.txt \ + --dev_path data/dev.txt \ + --max_seq_len 512 \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 8 \ + --num_train_epochs 10 \ + --learning_rate 1e-5 \ + --do_train \ + --do_eval \ + --do_export \ + --export_model_dir ./checkpoint/model_best\ + --overwrite_output_dir \ + --disable_tqdm True \ + --metric_for_best_model eval_f1 \ + --load_best_model_at_end True \ + --save_total_limit 1 +``` + +Multiple GPUs: + +```shell +python -u -m paddle.distributed.launch --gpus "0" finetune.py \ + --device gpu \ + --logging_steps 5 \ + --save_steps 25 \ + --eval_steps 25 \ + --seed 42 \ + --model_name_or_path uie-x-base \ + --output_dir ./checkpoint/model_best\ + --train_path data/train.txt \ + --dev_path data/dev.txt \ + --max_seq_len 512 \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 8 \ + --num_train_epochs 10 \ + --learning_rate 1e-5 \ + --do_train \ + --do_eval \ + --do_export \ + --export_model_dir ./checkpoint/model_best\ + --overwrite_output_dir \ + --disable_tqdm True \ + --metric_for_best_model eval_f1 \ + --load_best_model_at_end True \ + --save_total_limit 1 +``` + +Since the parameter `--do_eval` is set in the sample code, it will be automatically evaluated after training. + +Parameters: + +* `device`: Training device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU training. +* `logging_steps`: The interval steps of log printing during training, the default is 10. +* `save_steps`: The number of interval steps to save the model checkpoint during training, the default is 100. +* `eval_steps`: The number of interval steps to save the model checkpoint during training, the default is 100. +* `seed`: global random seed, default is 42. +* `model_name_or_path`: The pre-trained model used for few shot training. Defaults to "uie-x-base". +* `output_dir`: required, the model directory saved after model training or compression; the default is `None`. +* `train_path`: training set path; defaults to `None`. +* `dev_path`: Development set path; defaults to `None`. +* `max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512. +* `per_device_train_batch_size`: The batch size of each GPU core/NPU core/CPU used for training, the default is 8. +* `per_device_eval_batch_size`: Batch size per GPU core/NPU core/CPU for evaluation, default is 8. +* `num_train_epochs`: Training rounds, 100 can be selected when using early stopping method; the default is 10. +* `learning_rate`: The maximum learning rate for training, UIE-X recommends setting it to 1e-5; the default value is 3e-5. +* `label_names`: the name of the training data label label, UIE-X is set to 'start_positions' 'end_positions'; the default value is None. +* `do_train`: Whether to perform fine-tuning training, setting this parameter means to perform fine-tuning training, and it is not set by default. +* `do_eval`: Whether to evaluate, setting this parameter means to evaluate, the default is not set. +* `do_export`: Whether to export, setting this parameter means to export static images, and it is not set by default. +* `export_model_dir`: Static map export address, the default is None. +* `overwrite_output_dir`: If `True`, overwrite the contents of the output directory. If `output_dir` points to a checkpoint directory, use it to continue training. +* `disable_tqdm`: Whether to use tqdm progress bar. +* `metric_for_best_model`: Optimal model metric, UIE-X recommends setting it to `eval_f1`, the default is None. +* `load_best_model_at_end`: Whether to load the best model after training, usually used in conjunction with `metric_for_best_model`, the default is False. +* `save_total_limit`: If this parameter is set, the total number of checkpoints will be limited. Remove old checkpoints `output directory`, defaults to None. + + + +### 2.4 Evaluation + +```shell +python evaluate.py \ + --device "gpu" \ + --model_path ./checkpoint/model_best \ + --test_path ./data/dev.txt \ + --output_dir ./checkpoint/model_best \ + --label_names 'start_positions' 'end_positions'\ + --max_seq_len 512 \ + --per_device_eval_batch_size 16 +``` +We adopt the single-stage method for evaluation, which means tasks that require multiple stages (e.g. relation extraction, event extraction) are evaluated separately for each stage. By default, the validation/test set uses all labels at the same level to construct the negative examples. +The `debug` mode can be turned on to evaluate each positive category separately. This mode is only used for model debugging: + +```shell +python evaluate.py \ + --device "gpu" \ + --model_path ./checkpoint/model_best \ + --test_path ./data/dev.txt \ + --output_dir ./checkpoint/model_best \ + --label_names 'start_positions' 'end_positions' \ + --max_seq_len 512 \ + --per_device_eval_batch_size 16 \ + --debug True +``` + +Output result: + +```text +[2022-11-14 09:41:18,424] [ INFO] - ***** Running Evaluation ***** +[2022-11-14 09:41:18,424] [ INFO] - Num examples = 160 +[2022-11-14 09:41:18,424] [ INFO] - Pre device batch size = 4 +[2022-11-14 09:41:18,424] [ INFO] - Total Batch size = 4 +[2022-11-14 09:41:18,424] [ INFO] - Total prediction steps = 40 +[2022-11-14 09:41:26,451] [ INFO] - -----Evaluate model------- +[2022-11-14 09:41:26,451] [ INFO] - Class Name: ALL CLASSES +[2022-11-14 09:41:26,451] [ INFO] - Evaluation Precision: 0.94521 | Recall: 0.88462 | F1: 0.91391 +[2022-11-14 09:41:26,451] [ INFO] - ----------------------------- +[2022-11-14 09:41:26,452] [ INFO] - ***** Running Evaluation ***** +[2022-11-14 09:41:26,452] [ INFO] - Num examples = 8 +[2022-11-14 09:41:26,452] [ INFO] - Pre device batch size = 4 +[2022-11-14 09:41:26,452] [ INFO] - Total Batch size = 4 +[2022-11-14 09:41:26,452] [ INFO] - Total prediction steps = 2 +[2022-11-14 09:41:26,692] [ INFO] - Class Name: 开票日期 +[2022-11-14 09:41:26,692] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000 +[2022-11-14 09:41:26,692] [ INFO] - ----------------------------- +[2022-11-14 09:41:26,693] [ INFO] - ***** Running Evaluation ***** +[2022-11-14 09:41:26,693] [ INFO] - Num examples = 8 +[2022-11-14 09:41:26,693] [ INFO] - Pre device batch size = 4 +[2022-11-14 09:41:26,693] [ INFO] - Total Batch size = 4 +[2022-11-14 09:41:26,693] [ INFO] - Total prediction steps = 2 +[2022-11-14 09:41:26,952] [ INFO] - Class Name: 名称 +[2022-11-14 09:41:26,952] [ INFO] - Evaluation Precision: 0.87500 | Recall: 0.87500 | F1: 0.87500 +[2022-11-14 09:41:26,952] [ INFO] - ----------------------------- +... +``` + +Parameters: + +* `device`: Evaluation device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU evaluation. +* `model_path`: The path of the model folder for evaluation, which must contain the model weight file `model_state.pdparams` and the configuration file `model_config.json`. +* `test_path`: The test set file for evaluation. +* `label_names`: the name of the training data label, UIE-X is set to 'start_positions' 'end_positions'; the default value is None. +* `batch_size`: batch size, please adjust according to the machine situation, the default is 16. +* `max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512. +* `per_device_eval_batch_size`: Batch size per GPU core/NPU core/CPU for evaluation, default is 8. +* `debug`: Whether to enable the debug mode to evaluate each positive category separately. This mode is only used for model debugging and is disabled by default. +* `schema_lang`: Select the language of the schema, optional `ch` and `en`. The default is `ch`, please select `en` for the English dataset. + + + +### 2.5 Inference + +Same with the pretrained models, you can use `paddlenlp.Taskflow` to load your custom model by specifying the path of the model weight file through `task_path` + +```python +from pprint import pprint +from paddlenlp import Taskflow +from paddlenlp.utils.doc_parser import DocParser + +schema = ['开票日期', '名称', '纳税人识别号', '开户行及账号', '金额', '价税合计', 'No', '税率', '地址、电话', '税额'] +my_ie = Taskflow("information_extraction", model="uie-x-base", schema=schema, task_path='./checkpoint/model_best', precision='fp16') +``` + +We specify the extraction targets by setting `schema` and visualize the information of the specified `doc_path` document: + +```python +doc_path = "./data/images/b199.jpg" +results = my_ie({"doc": doc_path}) +pprint(results) + +# Result visualization +DocParser.write_image_with_results( + doc_path, + result=results[0], + save_path="./image_show.png") +``` + +
+ +
+ + + +### 2.6 Experiments + + | | Precision | Recall | F1 Score | + | :---: | :--------: | :--------: | :--------: | + | 0-shot| 0.44898 | 0.56410 | 0.50000 | + | 5-shot| 0.9000 | 0.9231 | 0.9114 | + | 10-shot| 0.9125 | 0.93590 | 0.9241 | + | 20-shot| 0.9737 | 0.9487 | 0.9610 | + | 30-shot| 0.9744 | 0.9744 | 0.9744 | + | 30-shot+PP-Structure| 1.0 | 0.9625 | 0.9809 | + + +n-shot means that the training set contains n labeled image data for model fine-tuning. Experiments show that UIE-X can further improve the results through a small amount of data (few-shot) and PP-Structure layout analysis. diff --git a/applications/information_extraction/document/deploy/simple_serving/README.md b/applications/information_extraction/document/deploy/simple_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e98fc150228a7f787c044b1f9acb8255aaccb844 --- /dev/null +++ b/applications/information_extraction/document/deploy/simple_serving/README.md @@ -0,0 +1,57 @@ +# 基于PaddleNLP SimpleServing 的服务化部署 + +## 目录 +- [环境准备](#环境准备) +- [Server服务启动](#Server服务启动) +- [Client请求启动](#Client请求启动) +- [服务化自定义参数](#服务化自定义参数) + +## 环境准备 +使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本) + +```shell +pip install paddlenlp >= 2.4.4 +``` + + +## Server服务启动 + +```bash +paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189 +``` + +## Client请求启动 + +```bash +python client.py +``` + +## 服务化自定义参数 + +### Server 自定义参数 +#### schema替换 +```python +# Default schema +schema = ['开票日期', '名称', '纳税人识别号', '开户行及账号', '金额', '价税合计', 'No', '税率', '地址、电话', '税额'] +``` + +#### 设置模型路径 +``` +# Default task_path +uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema) +``` + +#### 多卡服务化预测 +PaddleNLP SimpleServing 支持多卡负载均衡预测,主要在服务化注册的时候,注册两个Taskflow的task即可,下面是示例代码 +``` +uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0) +uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1) +service.register_taskflow('uie', [uie1, uie2]) +``` + +### Client 自定义参数 + +```python +# Changed to image paths you wanted +image_paths = ['../../data/images/b1.jpg'] +``` diff --git a/applications/information_extraction/document/deploy/simple_serving/README_en.md b/applications/information_extraction/document/deploy/simple_serving/README_en.md new file mode 100644 index 0000000000000000000000000000000000000000..71c54c406dde07d43801e5d96acff8d5a4264b7a --- /dev/null +++ b/applications/information_extraction/document/deploy/simple_serving/README_en.md @@ -0,0 +1,65 @@ +# Service deployment based on PaddleNLP SimpleServing + +## Table of contents +- [Environment Preparation](#1) +- [Server](#2) +- [Client](#3) +- [Service Custom Parameters](#4) + + + +## Environment Preparation +Use the PaddleNLP version with SimpleServing function (or the latest develop version) + +```shell +pip install paddlenlp >= 2.4.4 +``` + + + +## Server + +```bash +paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189 +``` + + + +## Client + +```bash +python client.py +``` + + + +## Service custom parameters + +### Server Custom Parameters + +#### schema replacement +```python +# Default schema +schema = ['Billing Date', 'Name', 'Taxpayer Identification Number', 'Account Bank and Account Number', 'Amount', 'Total Price and Tax', 'No', 'Tax Rate', 'Address, Phone', 'tax'] +``` + +#### Set model path +``` +# Default task_path +uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema) +``` + +#### Doka Service Prediction +PaddleNLP SimpleServing supports multi-card load balancing prediction, mainly during service registration, just register two Taskflow tasks, the following is the sample code +``` +uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0) +uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1) +service. register_taskflow('uie', [uie1, uie2]) +``` + +### Client Custom Parameters + +```python +# Changed to image paths you wanted +image_paths = ['../../data/images/b1.jpg'] +``` diff --git a/applications/information_extraction/document/deploy/simple_serving/client.py b/applications/information_extraction/document/deploy/simple_serving/client.py new file mode 100644 index 0000000000000000000000000000000000000000..dcd2c67f94108b47fe0c870b7f7c71e29bf3544f --- /dev/null +++ b/applications/information_extraction/document/deploy/simple_serving/client.py @@ -0,0 +1,42 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import requests + +from paddlenlp.utils.doc_parser import DocParser + +# Define the document parser +doc_parser = DocParser() + +image_paths = ["../../data/images/b1.jpg"] +image_base64_docs = [] + +# Get the image base64 to post +for image_path in image_paths: + req_dict = {} + doc = doc_parser.parse({"doc": image_path}, do_ocr=False) + base64 = doc["image"] + req_dict["doc"] = base64 + image_base64_docs.append(req_dict) + +url = "http://0.0.0.0:8189/taskflow/uie" +headers = {"Content-Type": "application/json"} +data = {"data": {"text": image_base64_docs}} + +# Post the requests +r = requests.post(url=url, headers=headers, data=json.dumps(data)) +datas = json.loads(r.text) +print(datas) diff --git a/applications/information_extraction/document/deploy/simple_serving/server.py b/applications/information_extraction/document/deploy/simple_serving/server.py new file mode 100644 index 0000000000000000000000000000000000000000..2a6ab1fb7d2d1d2e4190665150e5c0ba07730c49 --- /dev/null +++ b/applications/information_extraction/document/deploy/simple_serving/server.py @@ -0,0 +1,27 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp import SimpleServer, Taskflow + +# The schema changed to your defined schema +schema = ["开票日期", "名称", "纳税人识别号", "开户行及账号", "金额", "价税合计", "No", "税率", "地址、电话", "税额"] +# The task path changed to your best model path +uie = Taskflow( + "information_extraction", + schema=schema, + task_path="../../checkpoint/model_best", +) +# If you want to define the finetuned uie service +app = SimpleServer() +app.register_taskflow("taskflow/uie", uie) diff --git a/applications/information_extraction/document/evaluate.py b/applications/information_extraction/document/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..ca3cdeda1da3192f0f5ab0e50936eab965d3f1f1 --- /dev/null +++ b/applications/information_extraction/document/evaluate.py @@ -0,0 +1,149 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from dataclasses import dataclass, field +from functools import partial +from typing import Optional + +import paddle +from utils import convert_example, reader + +from paddlenlp.datasets import MapDataset, load_dataset +from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments +from paddlenlp.transformers import UIEX, AutoTokenizer +from paddlenlp.utils.ie_utils import ( + compute_metrics, + get_relation_type_dict, + uie_loss_func, + unify_prompt_name, +) +from paddlenlp.utils.log import logger + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for evaluation. + Using `PdArgumentParser` we can turn this class into argparse arguments to be able to + specify them on the command line. + """ + + test_path: str = field(default=None, metadata={"help": "The path of test set."}) + + schema_lang: str = field( + default="ch", metadata={"help": "Select the language type for schema, such as 'ch', 'en'"} + ) + + max_seq_len: Optional[int] = field( + default=512, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + + debug: bool = field( + default=False, + metadata={"help": "Whether choose debug mode."}, + ) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + + model_path: Optional[str] = field( + default=None, metadata={"help": "The path of saved model that you want to load."} + ) + + +def do_eval(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + tokenizer = AutoTokenizer.from_pretrained(model_args.model_path) + model = UIEX.from_pretrained(model_args.model_path) + + test_ds = load_dataset(reader, data_path=data_args.test_path, max_seq_len=data_args.max_seq_len, lazy=False) + trans_fn = partial(convert_example, tokenizer=tokenizer, max_seq_len=data_args.max_seq_len) + if data_args.debug: + class_dict = {} + relation_data = [] + + for data in test_ds: + class_name = unify_prompt_name(data["prompt"]) + # Only positive examples are evaluated in debug mode + if len(data["result_list"]) != 0: + p = "的" if data_args.schema_lang == "ch" else " of " + if p not in data["prompt"]: + class_dict.setdefault(class_name, []).append(data) + else: + relation_data.append((data["prompt"], data)) + + relation_type_dict = get_relation_type_dict(relation_data, schema_lang=data_args.schema_lang) + test_ds = test_ds.map(trans_fn) + + trainer = Trainer( + model=model, + criterion=uie_loss_func, + args=training_args, + eval_dataset=test_ds, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + ) + eval_metrics = trainer.evaluate() + logger.info("-----Evaluate model-------") + logger.info("Class Name: ALL CLASSES") + logger.info( + "Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" + % (eval_metrics["eval_precision"], eval_metrics["eval_recall"], eval_metrics["eval_f1"]) + ) + logger.info("-----------------------------") + if data_args.debug: + for key in class_dict.keys(): + test_ds = MapDataset(class_dict[key]) + test_ds = test_ds.map(trans_fn) + eval_metrics = trainer.evaluate(eval_dataset=test_ds) + + logger.info("Class Name: %s" % key) + logger.info( + "Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" + % (eval_metrics["eval_precision"], eval_metrics["eval_recall"], eval_metrics["eval_f1"]) + ) + logger.info("-----------------------------") + for key in relation_type_dict.keys(): + test_ds = MapDataset(relation_type_dict[key]) + test_ds = test_ds.map(trans_fn) + eval_metrics = trainer.evaluate(eval_dataset=test_ds) + logger.info("-----------------------------") + if data_args.schema_lang == "ch": + logger.info("Class Name: X的%s" % key) + else: + logger.info("Class Name: %s of X" % key) + logger.info( + "Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" + % (eval_metrics["eval_precision"], eval_metrics["eval_recall"], eval_metrics["eval_f1"]) + ) + + +if __name__ == "__main__": + do_eval() diff --git a/applications/information_extraction/document/finetune.py b/applications/information_extraction/document/finetune.py new file mode 100644 index 0000000000000000000000000000000000000000..822fd4e2a788adafe38c08fef5ca0f90fc3b046a --- /dev/null +++ b/applications/information_extraction/document/finetune.py @@ -0,0 +1,177 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from dataclasses import dataclass, field +from functools import partial +from typing import List, Optional + +import paddle +from utils import convert_example, reader + +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import ( + CompressionArguments, + PdArgumentParser, + Trainer, + get_last_checkpoint, +) +from paddlenlp.transformers import UIEX, AutoTokenizer, export_model +from paddlenlp.utils.ie_utils import compute_metrics, uie_loss_func +from paddlenlp.utils.log import logger + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + Using `PdArgumentParser` we can turn this class into argparse arguments to be able to + specify them on the command line. + """ + + train_path: str = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + + dev_path: str = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + + max_seq_len: Optional[int] = field( + default=512, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + + dynamic_max_length: Optional[List[int]] = field( + default=None, + metadata={"help": "dynamic max length from batch, it can be array of length, eg: 16 32 64 128"}, + ) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + + model_name_or_path: Optional[str] = field(default="uie-x-base", metadata={"help": "Path to pretrained model"}) + export_model_dir: Optional[str] = field( + default=None, + metadata={"help": "Path to directory to store the exported inference model."}, + ) + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + training_args.label_names = ["start_positions", "end_positions"] + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + # Define model and tokenizer + model = UIEX.from_pretrained(model_args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + + # Load and preprocess dataset + train_ds = load_dataset(reader, data_path=data_args.train_path, max_seq_len=data_args.max_seq_len, lazy=False) + dev_ds = load_dataset(reader, data_path=data_args.dev_path, max_seq_len=data_args.max_seq_len, lazy=False) + trans_fn = partial( + convert_example, + tokenizer=tokenizer, + max_seq_len=data_args.max_seq_len, + dynamic_max_length=data_args.dynamic_max_length, + ) + train_ds = train_ds.map(trans_fn) + dev_ds = dev_ds.map(trans_fn) + + trainer = Trainer( + model=model, + criterion=uie_loss_func, + args=training_args, + train_dataset=train_ds if training_args.do_train else None, + eval_dataset=dev_ds if training_args.do_eval else None, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + ) + + trainer.optimizer = paddle.optimizer.AdamW( + learning_rate=training_args.learning_rate, parameters=model.parameters() + ) + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + # export inference model + if training_args.do_export: + # You can also load from certain checkpoint + # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/") + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"), + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"), + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="position_ids"), + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="attention_mask"), + paddle.static.InputSpec(shape=[None, None, 4], dtype="int64", name="bbox"), + paddle.static.InputSpec(shape=[None, 3, 224, 224], dtype="int64", name="image"), + ] + if model_args.export_model_dir is None: + model_args.export_model_dir = os.path.join(training_args.output_dir, "export") + export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir) + + +if __name__ == "__main__": + main() diff --git a/applications/information_extraction/document/utils.py b/applications/information_extraction/document/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..69cfbeefdd69d08830a7a3312adebf75c8ae733f --- /dev/null +++ b/applications/information_extraction/document/utils.py @@ -0,0 +1,374 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import base64 +import json +from typing import List, Optional + +import numpy as np + +from paddlenlp.utils.ie_utils import map_offset, pad_image_data +from paddlenlp.utils.log import logger + + +def reader(data_path, max_seq_len=512): + """ + read json + """ + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + json_line = json.loads(line) + content = json_line["content"].strip() + prompt = json_line["prompt"] + boxes = json_line.get("bbox", None) + image = json_line.get("image", None) + # Model Input is aslike: [CLS] prompt [SEP] [SEP] text [SEP] for UIE-X + if boxes is not None and image is not None: + summary_token_num = 4 + else: + summary_token_num = 3 + if max_seq_len <= len(prompt) + summary_token_num: + raise ValueError("The value of max_seq_len is too small, please set a larger value") + max_content_len = max_seq_len - len(prompt) - summary_token_num + if len(content) <= max_content_len: + yield json_line + else: + result_list = json_line["result_list"] + json_lines = [] + accumulate = 0 + while True: + cur_result_list = [] + for result in result_list: + if result["end"] - result["start"] > max_content_len: + logger.warning( + "result['end'] - result ['start'] exceeds max_content_len, which will result in no valid instance being returned" + ) + if ( + result["start"] + 1 <= max_content_len < result["end"] + and result["end"] - result["start"] <= max_content_len + ): + max_content_len = result["start"] + break + + cur_content = content[:max_content_len] + res_content = content[max_content_len:] + if boxes is not None and image is not None: + cur_boxes = boxes[:max_content_len] + res_boxes = boxes[max_content_len:] + + while True: + if len(result_list) == 0: + break + elif result_list[0]["end"] <= max_content_len: + if result_list[0]["end"] > 0: + cur_result = result_list.pop(0) + cur_result_list.append(cur_result) + else: + cur_result_list = [result for result in result_list] + break + else: + break + + if boxes is not None and image is not None: + json_line = { + "content": cur_content, + "result_list": cur_result_list, + "prompt": prompt, + "bbox": cur_boxes, + "image": image, + } + else: + json_line = { + "content": cur_content, + "result_list": cur_result_list, + "prompt": prompt, + } + json_lines.append(json_line) + + for result in result_list: + if result["end"] <= 0: + break + result["start"] -= max_content_len + result["end"] -= max_content_len + accumulate += max_content_len + max_content_len = max_seq_len - len(prompt) - summary_token_num + if len(res_content) == 0: + break + elif len(res_content) < max_content_len: + if boxes is not None and image is not None: + json_line = { + "content": res_content, + "result_list": result_list, + "prompt": prompt, + "bbox": res_boxes, + "image": image, + } + else: + json_line = {"content": res_content, "result_list": result_list, "prompt": prompt} + + json_lines.append(json_line) + break + else: + content = res_content + boxes = res_boxes + + for json_line in json_lines: + yield json_line + + +def get_dynamic_max_len(examples, default_max_len: int, dynamic_max_length: List[int]) -> int: + """get max_length by examples which you can change it by examples in batch""" + cur_length = len(examples[0]["input_ids"]) + max_length = default_max_len + for max_length_option in sorted(dynamic_max_length): + if cur_length <= max_length_option: + max_length = max_length_option + break + return max_length + + +def convert_example( + example, + tokenizer, + max_seq_len, + pad_id=1, + c_sep_id=2, + summary_token_num=4, + dynamic_max_length: Optional[List[int]] = None, +): + + content = example["content"] + prompt = example["prompt"] + bbox_lines = example.get("bbox", None) + image_buff_string = example.get("image", None) + # Text + if bbox_lines is None or image_buff_string is None: + if dynamic_max_length is not None: + temp_encoded_inputs = tokenizer( + text=[example["prompt"]], + text_pair=[example["content"]], + truncation=True, + max_seq_len=max_seq_len, + return_attention_mask=True, + return_position_ids=True, + return_dict=False, + return_offsets_mapping=True, + ) + max_length = get_dynamic_max_len( + examples=temp_encoded_inputs, default_max_len=max_seq_len, dynamic_max_length=dynamic_max_length + ) + # always pad to max_length + encoded_inputs = tokenizer( + text=[example["prompt"]], + text_pair=[example["content"]], + truncation=True, + max_seq_len=max_length, + pad_to_max_seq_len=True, + return_attention_mask=True, + return_position_ids=True, + return_dict=False, + return_offsets_mapping=True, + ) + max_seq_len = max_length + else: + encoded_inputs = tokenizer( + text=[example["prompt"]], + text_pair=[example["content"]], + truncation=True, + max_seq_len=max_seq_len, + pad_to_max_seq_len=True, + return_attention_mask=True, + return_position_ids=True, + return_offsets_mapping=True, + return_dict=False, + ) + + encoded_inputs = encoded_inputs[0] + + inputs_ids = encoded_inputs["input_ids"] + position_ids = encoded_inputs["position_ids"] + attention_mask = encoded_inputs["attention_mask"] + + q_sep_index = inputs_ids.index(2, 1) + c_sep_index = attention_mask.index(0) + + offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]] + + bias = 0 + for index in range(len(offset_mapping)): + if index == 0: + continue + mapping = offset_mapping[index] + if mapping[0] == 0 and mapping[1] == 0 and bias == 0: + # bias = index + bias = offset_mapping[index - 1][-1] + 1 + + if mapping[0] == 0 and mapping[1] == 0: + continue + offset_mapping[index][0] += bias + offset_mapping[index][1] += bias + + offset_bias = bias + + bbox_list = [[0, 0, 0, 0] for x in range(len(inputs_ids))] + token_type_ids = [ + 1 if token_index <= q_sep_index or token_index > c_sep_index else 0 for token_index in range(max_seq_len) + ] + padded_image = np.zeros([3, 224, 224]) + + # Doc + else: + inputs_ids = [] + prev_bbox = [-1, -1, -1, -1] + this_text_line = "" + q_sep_index = -1 + offset_mapping = [] + last_offset = 0 + for char_index, (char, bbox) in enumerate(zip(content, bbox_lines)): + if char_index == 0: + prev_bbox = bbox + this_text_line = char + continue + + if all([bbox[x] == prev_bbox[x] for x in range(4)]): + this_text_line += char + else: + offset_mapping, last_offset, q_sep_index, inputs_ids = _encode_doc( + tokenizer, + offset_mapping, + last_offset, + prompt, + this_text_line, + inputs_ids, + q_sep_index, + max_seq_len, + ) + this_text_line = char + prev_bbox = bbox + + if len(this_text_line) > 0: + offset_mapping, last_offset, q_sep_index, inputs_ids = _encode_doc( + tokenizer, offset_mapping, last_offset, prompt, this_text_line, inputs_ids, q_sep_index, max_seq_len + ) + + if len(inputs_ids) > max_seq_len: + inputs_ids = inputs_ids[: (max_seq_len - 1)] + [c_sep_id] + offset_mapping = offset_mapping[: (max_seq_len - 1)] + [[0, 0]] + else: + inputs_ids += [c_sep_id] + offset_mapping += [[0, 0]] + + offset_bias = offset_mapping[q_sep_index - 1][-1] + 1 + + seq_len = len(inputs_ids) + inputs_ids += [pad_id] * (max_seq_len - seq_len) + token_type_ids = [1] * (q_sep_index + 1) + [0] * (seq_len - q_sep_index - 1) + token_type_ids += [pad_id] * (max_seq_len - seq_len) + + bbox_list = _process_bbox(inputs_ids, bbox_lines, offset_mapping, offset_bias) + + offset_mapping += [[0, 0]] * (max_seq_len - seq_len) + + position_ids = list(range(seq_len)) + + position_ids = position_ids + [0] * (max_seq_len - seq_len) + attention_mask = [1] * seq_len + [0] * (max_seq_len - seq_len) + + image_data = base64.b64decode(image_buff_string.encode("utf8")) + padded_image = pad_image_data(image_data) + + start_ids = np.array([0.0 for x in range(max_seq_len)], dtype="int64") + end_ids = np.array([0.0 for x in range(max_seq_len)], dtype="int64") + + for item in example["result_list"]: + start = map_offset(item["start"] + offset_bias, offset_mapping) + end = map_offset(item["end"] - 1 + offset_bias, offset_mapping) + start_ids[start] = 1.0 + end_ids[end] = 1.0 + + assert len(inputs_ids) == max_seq_len + assert len(token_type_ids) == max_seq_len + assert len(position_ids) == max_seq_len + assert len(attention_mask) == max_seq_len + assert len(bbox_list) == max_seq_len + tokenized_output = { + "input_ids": inputs_ids, + "token_type_ids": token_type_ids, + "position_ids": position_ids, + "attention_mask": attention_mask, + "bbox": bbox_list, + "image": padded_image, + "start_positions": start_ids, + "end_positions": end_ids, + } + return tokenized_output + + +def _process_bbox(tokens, bbox_lines, offset_mapping, offset_bias): + bbox_list = [[0, 0, 0, 0] for x in range(len(tokens))] + + for index, bbox in enumerate(bbox_lines): + index_token = map_offset(index + offset_bias, offset_mapping) + if 0 <= index_token < len(bbox_list): + bbox_list[index_token] = bbox + return bbox_list + + +def _encode_doc(tokenizer, offset_mapping, last_offset, prompt, this_text_line, inputs_ids, q_sep_index, max_seq_len): + if len(offset_mapping) == 0: + content_encoded_inputs = tokenizer( + text=[prompt], + text_pair=[this_text_line], + max_seq_len=max_seq_len, + return_dict=False, + return_offsets_mapping=True, + ) + content_encoded_inputs = content_encoded_inputs[0] + inputs_ids = content_encoded_inputs["input_ids"][:-1] + sub_offset_mapping = [list(x) for x in content_encoded_inputs["offset_mapping"]] + q_sep_index = content_encoded_inputs["input_ids"].index(2, 1) + + bias = 0 + for i in range(len(sub_offset_mapping)): + if i == 0: + continue + mapping = sub_offset_mapping[i] + if mapping[0] == 0 and mapping[1] == 0 and bias == 0: + bias = sub_offset_mapping[i - 1][-1] + 1 + if mapping[0] == 0 and mapping[1] == 0: + continue + if mapping == sub_offset_mapping[i - 1]: + continue + sub_offset_mapping[i][0] += bias + sub_offset_mapping[i][1] += bias + + offset_mapping = sub_offset_mapping[:-1] + last_offset = offset_mapping[-1][-1] + else: + content_encoded_inputs = tokenizer( + text=this_text_line, max_seq_len=max_seq_len, return_dict=False, return_offsets_mapping=True + ) + inputs_ids += content_encoded_inputs["input_ids"][1:-1] + sub_offset_mapping = [list(x) for x in content_encoded_inputs["offset_mapping"]] + + for i, sub_list in enumerate(sub_offset_mapping[1:-1]): + if i == 0: + org_offset = sub_list[1] + else: + if sub_list[0] != org_offset and sub_offset_mapping[1:-1][i - 1] != sub_list: + last_offset += 1 + org_offset = sub_list[1] + offset_mapping += [[last_offset, sub_list[1] - sub_list[0] + last_offset]] + last_offset = offset_mapping[-1][-1] + return offset_mapping, last_offset, q_sep_index, inputs_ids diff --git a/applications/information_extraction/label_studio.py b/applications/information_extraction/label_studio.py new file mode 100644 index 0000000000000000000000000000000000000000..0f0d815f7774d388d6d700f183280f800a2d654b --- /dev/null +++ b/applications/information_extraction/label_studio.py @@ -0,0 +1,139 @@ +# coding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import random +import time +from decimal import Decimal + +import numpy as np +import paddle + +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.utils.log import logger +from paddlenlp.utils.tools import DataConverter + + +def set_seed(seed): + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +def do_convert(): + set_seed(args.seed) + + tic_time = time.time() + if not os.path.exists(args.label_studio_file): + raise ValueError("Please input the correct path of label studio file.") + + if not os.path.exists(args.save_dir): + os.makedirs(args.save_dir) + + if len(args.splits) != 0 and len(args.splits) != 3: + raise ValueError("Only []/ len(splits)==3 accepted for splits.") + + def _check_sum(splits): + return Decimal(str(splits[0])) + Decimal(str(splits[1])) + Decimal(str(splits[2])) == Decimal("1") + + if len(args.splits) == 3 and not _check_sum(args.splits): + raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.") + + with open(args.label_studio_file, "r", encoding="utf-8") as f: + raw_examples = json.loads(f.read()) + + if args.is_shuffle: + indexes = np.random.permutation(len(raw_examples)) + index_list = indexes.tolist() + raw_examples = [raw_examples[i] for i in indexes] + + i1, i2, _ = args.splits + p1 = int(len(raw_examples) * i1) + p2 = int(len(raw_examples) * (i1 + i2)) + + train_ids = index_list[:p1] + dev_ids = index_list[p1:p2] + test_ids = index_list[p2:] + + with open(os.path.join(args.save_dir, "sample_index.json"), "w") as fp: + maps = {"train_ids": train_ids, "dev_ids": dev_ids, "test_ids": test_ids} + fp.write(json.dumps(maps)) + + if raw_examples[0]["data"].get("image"): + anno_type = "image" + else: + anno_type = "text" + + data_converter = DataConverter( + args.label_studio_file, + negative_ratio=args.negative_ratio, + prompt_prefix=args.prompt_prefix, + options=args.options, + separator=args.separator, + layout_analysis=args.layout_analysis, + schema_lang=args.schema_lang, + ocr_lang=args.ocr_lang, + anno_type=anno_type, + ) + + if args.task_type == "ext": + train_examples = data_converter.convert_ext_examples(raw_examples[:p1]) + dev_examples = data_converter.convert_ext_examples(raw_examples[p1:p2], is_train=False) + test_examples = data_converter.convert_ext_examples(raw_examples[p2:], is_train=False) + else: + train_examples = data_converter.convert_cls_examples(raw_examples[:p1]) + dev_examples = data_converter.convert_cls_examples(raw_examples[p1:p2]) + test_examples = data_converter.convert_cls_examples(raw_examples[p2:]) + + def _save_examples(save_dir, file_name, examples): + count = 0 + save_path = os.path.join(save_dir, file_name) + with open(save_path, "w", encoding="utf-8") as f: + for example in examples: + f.write(json.dumps(example, ensure_ascii=False) + "\n") + count += 1 + logger.info("Save %d examples to %s." % (count, save_path)) + + _save_examples(args.save_dir, "train.txt", train_examples) + _save_examples(args.save_dir, "dev.txt", dev_examples) + _save_examples(args.save_dir, "test.txt", test_examples) + + logger.info("Finished! It takes %.2f seconds" % (time.time() - tic_time)) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--label_studio_file", default="./data/label_studio.json", type=str, help="The annotation file exported from label studio platform.") + parser.add_argument("--save_dir", default="./data", type=str, help="The path of data that you wanna save.") + parser.add_argument("--negative_ratio", default=5, type=int, help="Used only for the extraction task, the ratio of positive and negative samples, number of negtive samples = negative_ratio * number of positive samples") + parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.") + parser.add_argument("--task_type", choices=['ext', 'cls'], default="ext", type=str, help="Select task type, ext for the extraction task and cls for the classification task, defaults to ext.") + parser.add_argument("--options", default=["正向", "负向"], type=str, nargs="+", help="Used only for the classification task, the options for classification") + parser.add_argument("--prompt_prefix", default="情感倾向", type=str, help="Used only for the classification task, the prompt prefix for classification") + parser.add_argument("--is_shuffle", default="True", type=strtobool, help="Whether to shuffle the labeled dataset, defaults to True.") + parser.add_argument("--layout_analysis", default=False, type=bool, help="Enable layout analysis to optimize the order of OCR result.") + parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization") + parser.add_argument("--separator", type=str, default='##', help="Used only for entity/aspect-level classification task, separator for entity label and classification label") + parser.add_argument("--schema_lang", choices=["ch", "en"], default="ch", help="Select the language type for schema.") + parser.add_argument("--ocr_lang", choices=["ch", "en"], default="ch", help="Select the language type for OCR.") + + args = parser.parse_args() + # yapf: enable + + do_convert() diff --git a/applications/information_extraction/label_studio_doc.md b/applications/information_extraction/label_studio_doc.md new file mode 100644 index 0000000000000000000000000000000000000000..57be7dc327b25abfe492a71c5723863bafbeb578 --- /dev/null +++ b/applications/information_extraction/label_studio_doc.md @@ -0,0 +1,272 @@ +# 文档抽取任务Label Studio使用指南 + + **目录** + +- [1. 安装](#1) +- [2. 文档抽取任务标注](#2) + - [2.1 项目创建](#21) + - [2.2 数据上传](#22) + - [2.3 标签构建](#23) + - [2.4 任务标注](#24) + - [2.5 数据导出](#25) + - [2.6 数据转换](#26) + - [2.7 更多配置](#27) + + + +## 1. 安装 +**以下标注示例用到的环境配置:** + +- Python 3.8+ +- label-studio == 1.6.0 +- paddleocr >= 2.6.0.1 + +在终端(terminal)使用pip安装label-studio: + +```shell +pip install label-studio==1.6.0 +``` + +安装完成后,运行以下命令行: +```shell +label-studio start +``` + +在浏览器打开[http://localhost:8080/](http://127.0.0.1:8080/),输入用户名和密码登录,开始使用label-studio进行标注。 + + + + +## 2. 文档抽取任务标注 + + + +#### 2.1 项目创建 + +点击创建(Create)开始创建一个新的项目,填写项目名称、描述,然后选择``Object Detection with Bounding Boxes``。 + +- 填写项目名称、描述 + +
+ +
+ +- **命名实体识别、关系抽取、事件抽取、实体/评价维度分类**任务选择``Object Detection with Bounding Boxes` + +
+ +
+ +- **文档分类**任务选择``Image Classification` + +
+ +
+ +- 添加标签(也可跳过后续在Setting/Labeling Interface中添加) + +
+ +
+ +图中展示了Span实体类型标签的构建,其他类型标签的构建可参考[2.3标签构建](#23) + + + +#### 2.2 数据上传 + +先从本地或HTTP链接上传图片,然后选择导入本项目。 + +
+ +
+ + + +#### 2.3 标签构建 + +- Span实体类型标签 + +
+ +
+ + +- Relation关系类型标签 + +
+ +
+ +Relation XML模板: + +```xml + + + + + +``` + + +- 分类类别标签 + +
+ +
+ + + +#### 2.4 任务标注 + +- 实体抽取 + + - 标注示例: + +
+ +
+ + - 该标注示例对应的schema为: + + ```text + schema = ['开票日期', '名称', '纳税人识别号', '地址、电话', '开户行及账号', '金额', '税额', '价税合计', 'No', '税率'] + ``` + +- 关系抽取 + + - Step 1. 标注主体(Subject)及客体(Object) + +
+ +
+ + - Step 2. 关系连线,箭头方向由主体(Subject)指向客体(Object) + +
+ +
+ +
+ +
+ + - Step 3. 添加对应关系类型标签 + +
+ +
+ +
+ +
+ + - Step 4. 完成标注 + +
+ +
+ + + - 该标注示例对应的schema为: + + ```text + schema = { + '名称及规格': [ + '金额', + '单位', + '数量' + ] + } + ``` + +- 文档分类 + + - 标注示例 + +
+ +
+ + - 该标注示例对应的schema为: + + ```text + schema = '文档类别[发票,报关单]' + ``` + + + + +#### 2.5 数据导出 + +勾选已标注图片ID,选择导出的文件类型为``JSON``,导出数据: + +
+ +
+ + + +#### 2.6 数据转换 + +将导出的文件重命名为``label_studio.json``后,放入``./document/data``目录下,并将对应的标注图片放入``./document/data/images``目录下(图片的文件名需与上传到label studio时的命名一致)。通过[label_studio.py](./label_studio.py)脚本可转为UIE的数据格式。 + +- 路径示例 + +```shell +./document/data/ +├── images # 图片目录 +│ ├── b0.jpg # 原始图片(文件名需与上传到label studio时的命名一致) +│ └── b1.jpg +└── label_studio.json # 从label studio导出的标注文件 +``` + +- 抽取式任务 + +```shell +python label_studio.py \ + --label_studio_file ./document/data/label_studio.json \ + --save_dir ./document/data \ + --splits 0.8 0.1 0.1\ + --task_type ext +``` + +- 文档分类任务 + +```shell +python label_studio.py \ + --label_studio_file ./document/data/label_studio.json \ + --save_dir ./document/data \ + --splits 0.8 0.1 0.1 \ + --task_type cls \ + --prompt_prefix "文档类别" \ + --options "发票" "报关单" +``` + + + +#### 2.7 更多配置 + +- ``label_studio_file``: 从label studio导出的数据标注文件。 +- ``save_dir``: 训练数据的保存目录,默认存储在``data``目录下。 +- ``negative_ratio``: 最大负例比例,该参数只对抽取类型任务有效,适当构造负例可提升模型效果。负例数量和实际的标签数量有关,最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效,默认为5。为了保证评估指标的准确性,验证集和测试集默认构造全负例。 +- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。 +- ``task_type``: 选择任务类型,可选有抽取和分类两种类型的任务。 +- ``options``: 指定分类任务的类别标签,该参数只对分类类型任务有效。默认为["正向", "负向"]。 +- ``prompt_prefix``: 声明分类任务的prompt前缀信息,该参数只对分类类型任务有效。默认为"情感倾向"。 +- ``is_shuffle``: 是否对数据集进行随机打散,默认为True。 +- ``seed``: 随机种子,默认为1000. +- ``separator``: 实体类别/评价维度与分类标签的分隔符,该参数只对实体/评价维度分类任务有效。默认为"##"。 +- ``schema_lang``:选择schema的语言,将会应该训练数据prompt的构造方式,可选有`ch`和`en`。默认为`ch`。 +- ``ocr_lang``:选择OCR的语言,可选有`ch`和`en`。默认为`ch`。 +- ``layout_analysis``:是否使用PPStructure对文档进行布局分析,该参数只对文档类型标注任务有效。默认为False。 + +备注: +- 默认情况下 [label_studio.py](./label_studio.py) 脚本会按照比例将数据划分为 train/dev/test 数据集 +- 每次执行 [label_studio.py](./label_studio.py) 脚本,将会覆盖已有的同名数据文件 +- 在模型训练阶段我们推荐构造一些负例以提升模型效果,在数据转换阶段我们内置了这一功能。可通过`negative_ratio`控制自动构造的负样本比例;负样本数量 = negative_ratio * 正样本数量。 +- 对于从label_studio导出的文件,默认文件中的每条数据都是经过人工正确标注的。 + + +## References +- **[Label Studio](https://labelstud.io/)** diff --git a/applications/information_extraction/label_studio_doc_en.md b/applications/information_extraction/label_studio_doc_en.md new file mode 100644 index 0000000000000000000000000000000000000000..e7925d54509c3a43b8ce4ae36abe8154d6fb739f --- /dev/null +++ b/applications/information_extraction/label_studio_doc_en.md @@ -0,0 +1,273 @@ +# Label Studio User Guide - Document Information Extraction + + **Table of contents** + +- [1. Installation](#1) +- [2. Document Extraction Task Annotation](#2) + - [2.1 Project Creation](#21) + - [2.2 Data Upload](#22) + - [2.3 Label Construction](#23) + - [2.4 Task Annotation](#24) + - [2.5 Data Export](#25) + - [2.6 Data Conversion](#26) + - [2.7 More Configuration](#27) + + + +## 1. Installation + +**Environmental configuration used in the following annotation examples:** + +- Python 3.8+ +- label-studio == 1.6.0 +- paddleocr >= 2.6.0.1 + +Use pip to install label-studio in the terminal: + +```shell +pip install label-studio==1.6.0 +``` + +Once the installation is complete, run the following command line: + +```shell +label-studio start +``` + +Open [http://localhost:8080/](http://127.0.0.1:8080/) in the browser, enter the user name and password to log in, and start using label-studio for labeling. + + + +## 2. Document Extraction Task Annotation + + + +#### 2.1 Project Creation + +Click Create to start creating a new project, fill in the project name, description, and select ``Object Detection with Bounding Boxes``. + +- Fill in the project name, description + +
+ +
+ +- For **Named Entity Recognition, Relation Extraction** tasks please select ``Object Detection with Bounding Boxes` + +
+ +
+ +- For **Document Classification** task please select ``Image Classification` + +
+ +
+ +- Define labels + +
+ +
+ +The figure shows the construction of Span entity type tags. For the construction of other types of tags, please refer to [2.3 Label Construction](#23) + + + +#### 2.2 Data upload + +First upload the picture from a local or HTTP link, and then choose to import this project. + +
+ +
+ + + +#### 2.3 Label Construction + +- Entity Label + +
+ +
+ + +- Relation label + +
+ +
+ +Relation XML template: + +```xml + + + + + +``` + +- Classification label + +
+ +
+ + + +#### 2.4 Task Annotation + +- Entity extraction + + - Callout example: + +
+ +
+ + - The schema corresponding to this annotation example is: + + ```text + schema = ['开票日期', '名称', '纳税人识别号', '地址、电话', '开户行及账号', '金额', '税额', '价税合计', 'No', '税率'] + ``` + +- Relation extraction + + - Step 1. Label the subject and object + +
+ +
+ + - Step 2. Relation line, the direction of the arrow is from the subject to the object + +
+ +
+ +
+ +
+ + - Step 3. Add corresponding relation label + +
+ +
+ +
+ +
+ + - Step 4. Finish labeling + +
+ +
+ + + - The schema corresponding to this annotation example is: + + ```text + schema = { + '名称及规格': [ + '金额', + '单位', + '数量' + ] + } + ``` + +- Document classification + + - Callout example + +
+ +
+ + - The schema corresponding to this annotation example is: + + ```text + schema = '文档类别[发票,报关单]' + ``` + + + + +#### 2.5 Data Export + +Check the marked image ID, select the exported file type as ``JSON``, and export the data: + +
+ +
+ + + + +#### 2.6 Data Conversion + +After renaming the exported file to ``label_studio.json``, put it into the ``./document/data`` directory, and put the corresponding label image into the ``./document/data/images`` directory (The file name of the picture must be the same as the one uploaded to label studio). Through the [label_studio.py](./label_studio.py) script, it can be converted to the data format of UIE. + +- Path example + +```shell +./document/data/ +├── images # image directory +│ ├── b0.jpg # Original picture (the file name must be the same as the one uploaded to label studio) +│ └── b1.jpg +└── label_studio.json # Annotation file exported from label studio +``` + +- Extraction task + +```shell +python label_studio.py \ + --label_studio_file ./document/data/label_studio.json \ + --save_dir ./document/data \ + --splits 0.8 0.1 0.1 \ + --task_type ext +``` + +- Document classification tasks + +```shell +python label_studio.py \ + --label_studio_file ./document/data/label_studio.json \ + --save_dir ./document/data \ + --splits 0.8 0.1 0.1 \ + --task_type cls \ + --prompt_prefix "document category" \ + --options "invoice" "customs declaration" +``` + + + +#### 2.7 More Configuration + +- ``label_studio_file``: Data labeling file exported from label studio. +- ``save_dir``: The storage directory of the training data, which is stored in the ``data`` directory by default. +- ``negative_ratio``: The maximum negative ratio. This parameter is only valid for extraction tasks. Properly constructing negative examples can improve the model effect. The number of negative examples is related to the actual number of labels, the maximum number of negative examples = negative_ratio * number of positive examples. This parameter is only valid for the training set, and the default is 5. In order to ensure the accuracy of the evaluation indicators, the verification set and test set are constructed with all negative examples by default. +- ``splits``: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means that the data is divided into training set, verification set and test set according to the ratio of ``8:1:1``. +- ``task_type``: Select the task type, there are two types of tasks: extraction and classification. +- ``options``: Specify the category label of the classification task, this parameter is only valid for the classification type task. Defaults to ["positive", "negative"]. +- ``prompt_prefix``: Declare the prompt prefix information of the classification task, this parameter is only valid for the classification type task. Defaults to "Sentimental Tendency". +- ``is_shuffle``: Whether to randomly shuffle the data set, the default is True. +- ``seed``: random seed, default is 1000. +- ``separator``: The separator between entity category/evaluation dimension and classification label. This parameter is only valid for entity/evaluation dimension classification tasks. The default is"##". +- ``schema_lang``: Select the language of the schema, which will be the construction method of the training data prompt, optional `ch` and `en`. Defaults to `ch`. +- ``ocr_lang``: Select the language for OCR, optional `ch` and `en`. Defaults to `ch`. +- ``layout_analysis``: Whether to use PPStructure to analyze the layout of the document. This parameter is only valid for document type labeling tasks. The default is False. + +Note: +- By default the [label_studio.py](./label_studio.py) script will divide the data proportionally into train/dev/test datasets +- Each time the [label_studio.py](./label_studio.py) script is executed, the existing data file with the same name will be overwritten +- In the model training phase, we recommend constructing some negative examples to improve the model performance, and we have built-in this function in the data conversion phase. The proportion of automatically constructed negative samples can be controlled by `negative_ratio`; the number of negative samples = negative_ratio * the number of positive samples. +- For files exported from label_studio, each piece of data in the default file is correctly labeled manually. + + +## References +- **[Label Studio](https://labelstud.io/)** diff --git a/applications/information_extraction/label_studio_text.md b/applications/information_extraction/label_studio_text.md new file mode 100644 index 0000000000000000000000000000000000000000..6596940a24e9fad3a33f4b699abfe36e87172d73 --- /dev/null +++ b/applications/information_extraction/label_studio_text.md @@ -0,0 +1,287 @@ +# 文本抽取任务Label Studio使用指南 + + **目录** + +- [1. 安装](#1) +- [2. 文本抽取任务标注](#2) + - [2.1 项目创建](#21) + - [2.2 数据上传](#22) + - [2.3 标签构建](#23) + - [2.4 任务标注](#24) + - [2.5 数据导出](#25) + - [2.6 数据转换](#26) + - [2.7 更多配置](#27) + + + +## 1. 安装 +**以下标注示例用到的环境配置:** + +- Python 3.8+ +- label-studio == 1.6.0 +- paddleocr >= 2.6.0.1 + +在终端(terminal)使用pip安装label-studio: + +```shell +pip install label-studio==1.6.0 +``` + +安装完成后,运行以下命令行: +```shell +label-studio start +``` + +在浏览器打开[http://localhost:8080/](http://127.0.0.1:8080/),输入用户名和密码登录,开始使用label-studio进行标注。 + + + +## 2. 文本抽取任务标注 + + + +#### 2.1 项目创建 + +点击创建(Create)开始创建一个新的项目,填写项目名称、描述,然后选择``Object Detection with Bounding Boxes``。 + +- 填写项目名称、描述 + +
+ +
+ +- **命名实体识别、关系抽取、事件抽取、实体/评价维度分类**任务选择``Relation Extraction`。 + +
+ +
+ +- **文本分类、句子级情感倾向分类**任务选择``Text Classification``。 + +
+ +
+ +- 添加标签(也可跳过后续在Setting/Labeling Interface中配置) + +
+ +
+ +图中展示了实体类型标签的构建,其他类型标签的构建可参考[2.3标签构建](#23) + + + +#### 2.2 数据上传 + +先从本地上传txt格式文件,选择``List of tasks``,然后选择导入本项目。 + +
+ +
+ + + +#### 2.3 标签构建 + +- Span类型标签 + +
+ +
+ +- Relation类型标签 + +
+ +
+ +Relation XML模板: + +```xml + + + + + +``` + +- 分类类别标签 + +
+ +
+ + + + +#### 2.4 任务标注 + +- 实体抽取 + +标注示例: + +
+ +
+ +该标注示例对应的schema为: + +```text +schema = [ + '时间', + '选手', + '赛事名称', + '得分' +] +``` + +- 关系抽取 + +
+ +
+ +对于关系抽取,其P的类型设置十分重要,需要遵循以下原则 + +“{S}的{P}为{O}”需要能够构成语义合理的短语。比如对于三元组(S, 父子, O),关系类别为父子是没有问题的。但按照UIE当前关系类型prompt的构造方式,“S的父子为O”这个表达不是很通顺,因此P改成孩子更好,即“S的孩子为O”。**合理的P类型设置,将显著提升零样本效果**。 + +该标注示例对应的schema为: + +```text +schema = { + '作品名': [ + '歌手', + '发行时间', + '所属专辑' + ] +} +``` + +- 事件抽取 + +
+ +
+ +该标注示例对应的schema为: + +```text +schema = { + '地震触发词': [ + '时间', + '震级' + ] +} +``` + +- 句子级分类 + +
+ +
+ + +该标注示例对应的schema为: + +```text +schema = '情感倾向[正向,负向]' +``` + +- 实体/评价维度分类 + +
+ +
+ +该标注示例对应的schema为: + +```text +schema = { + '评价维度': [ + '观点词', + '情感倾向[正向,负向]' + ] +} +``` + + + +#### 2.5 数据导出 + +勾选已标注文本ID,选择导出的文件类型为``JSON``,导出数据: + +
+ +
+ + + +#### 2.6 数据转换 + +将导出的文件重命名为``label_studio.json``后,放入``./data``目录下。通过[label_studio.py](./label_studio.py)脚本可转为UIE的数据格式。 + +- 抽取式任务 + +```shell +python label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --task_type ext +``` + +- 句子级分类任务 + +在数据转换阶段,我们会自动构造用于模型训练的prompt信息。例如句子级情感分类中,prompt为``情感倾向[正向,负向]``,可以通过`prompt_prefix`和`options`参数进行配置。 + +```shell +python label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --task_type cls \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --prompt_prefix "情感倾向" \ + --options "正向" "负向" +``` + +- 实体/评价维度分类任务 + +在数据转换阶段,我们会自动构造用于模型训练的prompt信息。例如评价维度情感分类中,prompt为``XXX的情感倾向[正向,负向]``,可以通过`prompt_prefix`和`options`参数进行声明。 + +```shell +python label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --task_type ext \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --prompt_prefix "情感倾向" \ + --options "正向" "负向" \ + --separator "##" +``` + + + +#### 2.7 更多配置 + +- ``label_studio_file``: 从label studio导出的数据标注文件。 +- ``save_dir``: 训练数据的保存目录,默认存储在``data``目录下。 +- ``negative_ratio``: 最大负例比例,该参数只对抽取类型任务有效,适当构造负例可提升模型效果。负例数量和实际的标签数量有关,最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效,默认为5。为了保证评估指标的准确性,验证集和测试集默认构造全负例。 +- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。 +- ``task_type``: 选择任务类型,可选有抽取和分类两种类型的任务。 +- ``options``: 指定分类任务的类别标签,该参数只对分类类型任务有效。默认为["正向", "负向"]。 +- ``prompt_prefix``: 声明分类任务的prompt前缀信息,该参数只对分类类型任务有效。默认为"情感倾向"。 +- ``is_shuffle``: 是否对数据集进行随机打散,默认为True。 +- ``seed``: 随机种子,默认为1000. +- ``schema_lang``:选择schema的语言,将会应该训练数据prompt的构造方式,可选有`ch`和`en`。默认为`ch`。 +- ``separator``: 实体类别/评价维度与分类标签的分隔符,该参数只对实体/评价维度分类任务有效。默认为"##"。 + +备注: +- 默认情况下 [label_studio.py](./label_studio.py) 脚本会按照比例将数据划分为 train/dev/test 数据集 +- 每次执行 [label_studio.py](./label_studio.py) 脚本,将会覆盖已有的同名数据文件 +- 在模型训练阶段我们推荐构造一些负例以提升模型效果,在数据转换阶段我们内置了这一功能。可通过`negative_ratio`控制自动构造的负样本比例;负样本数量 = negative_ratio * 正样本数量。 +- 对于从label_studio导出的文件,默认文件中的每条数据都是经过人工正确标注的。 + + +## References +- **[Label Studio](https://labelstud.io/)** diff --git a/applications/information_extraction/label_studio_text_en.md b/applications/information_extraction/label_studio_text_en.md new file mode 100644 index 0000000000000000000000000000000000000000..8f13d48079c4e8f694f3827d2e5a6f2a59c2c331 --- /dev/null +++ b/applications/information_extraction/label_studio_text_en.md @@ -0,0 +1,288 @@ +# Label Studio User Guide - Text Information Extraction + +**Table of contents** + +- [1. Installation](#1) +- [2. Text Extraction Task Annotation](#2) + - [2.1 Project Creation](#21) + - [2.2 Data Upload](#22) + - [2.3 Label Construction](#23) + - [2.4 Task Annotation](#24) + - [2.5 Data Export](#25) + - [2.6 Data Conversion](#26) + - [2.7 More Configuration](#27) + + + +## 1. Installation + +**Environmental configuration used in the following annotation examples:** + +- Python 3.8+ +- label-studio == 1.6.0 +- paddleocr >= 2.6.0.1 + +Use pip to install label-studio in the terminal: + +```shell +pip install label-studio==1.6.0 +``` + +Once the installation is complete, run the following command line: +```shell +label-studio start +``` + +Open [http://localhost:8080/](http://127.0.0.1:8080/) in the browser, enter the user name and password to log in, and start using label-studio for labeling. + + + +## 2. Text extraction task annotation + + + +#### 2.1 Project Creation + +Click Create to start creating a new project, fill in the project name, description, and select ``Object Detection with Bounding Boxes``. + +- Fill in the project name, description + +
+ +
+ +- For **Named Entity Recognition, Relation Extraction, Event Extraction, Opinion Extraction** tasks please select ``Relation Extraction`. + +
+ +
+ +- For **Text classification, Sentence-level sentiment classification** tasks please select ``Text Classification``. + +
+ +
+ +- Define labels + +
+ +
+ +The figure shows the construction of entity type tags, and the construction of other types of tags can refer to [2.3 Label Construction](#23) + + + +#### 2.2 Data upload + +First upload the txt format file locally, select ``List of tasks``, and then choose to import this project. + +
+ +
+ + + +#### 2.3 Label construction + +- Entity label + +
+ +
+ +- Relation label + +
+ +
+ +Relation XML template: + +```xml + + + + + +``` + +- Classification label + +
+ +
+ + + + +#### 2.4 Task annotation + +- Entity extraction + +Callout example: + +
+ +
+ +The schema corresponding to this annotation example is: + +```text +schema = [ + '时间', + '选手', + '赛事名称', + '得分' +] +``` + +- Relation extraction + +
+ +
+ +For relation extraction, the type setting of P is very important, and the following principles need to be followed + +"{P} of {S} is {O}" needs to be able to form a semantically reasonable phrase. For example, for a triple (S, father and son, O), there is no problem with the relation category being father and son. However, according to the current structure of the UIE relation type prompt, the expression "the father and son of S is O" is not very smooth, so it is better to change P to child, that is, "child of S is O". **A reasonable P type setting will significantly improve the zero-shot performance**. + +The schema corresponding to this annotation example is: + +```text +schema = { + '作品名': [ + '歌手', + '发行时间', + '所属专辑' + ] +} +``` + +- Event extraction + +
+ +
+ +The schema corresponding to this annotation example is: + +```text +schema = { + '地震触发词': [ + '时间', + '震级' + ] +} +``` + +- Sentence level classification + +
+ +
+ + +The schema corresponding to this annotation example is: + +```text +schema = '情感倾向[正向,负向]' +``` + +- Opinion Extraction + +
+ +
+ +The schema corresponding to this annotation example is: + +```text +schema = { + '评价维度': [ + '观点词', + '情感倾向[正向,负向]' + ] +} +``` + + + +#### 2.5 Data Export + +Check the marked text ID, select the exported file type as ``JSON``, and export the data: + +
+ +
+ + + +#### 2.6 Data conversion + +Rename the exported file to ``label_studio.json`` and put it in the ``./data`` directory. Through the [label_studio.py](./label_studio.py) script, it can be converted to the data format of UIE. + +- Extraction task + +```shell +python label_studio.py\ + --label_studio_file ./data/label_studio.json \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --task_type ext +``` + +- Sentence-level classification tasks + +In the data conversion stage, we will automatically construct prompt information for model training. For example, in sentence-level sentiment classification, the prompt is ``Sentiment Classification [positive, negative]``, which can be configured through `prompt_prefix` and `options` parameters. + +```shell +python label_studio.py\ + --label_studio_file ./data/label_studio.json \ + --task_type cls \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --prompt_prefix "Sentiment Classification" \ + --options "positive" "negative" +``` + +- Opinion Extraction + +In the data conversion stage, we will automatically construct prompt information for model training. For example, in the emotional classification of the evaluation dimension, the prompt is ``Sentiment Classification of xxx [positive, negative]``, which can be declared through the `prompt_prefix` and `options` parameters. + +```shell +python label_studio.py\ + --label_studio_file ./data/label_studio.json \ + --task_type ext \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --prompt_prefix "Sentiment Classification" \ + --options "positive" "negative" \ + --separator "##" +``` + + + +#### 2.7 More Configuration + +- ``label_studio_file``: Data labeling file exported from label studio. +- ``save_dir``: The storage directory of the training data, which is stored in the ``data`` directory by default. +- ``negative_ratio``: The maximum negative ratio. This parameter is only valid for extraction tasks. Properly constructing negative examples can improve the model effect. The number of negative examples is related to the actual number of labels, the maximum number of negative examples = negative_ratio * number of positive examples. This parameter is only valid for the training set, and the default is 5. In order to ensure the accuracy of the evaluation indicators, the verification set and test set are constructed with all negative examples by default. +- ``splits``: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means that the data is divided into training set, verification set and test set according to the ratio of ``8:1:1``. +- ``task_type``: Select the task type, there are two types of tasks: extraction and classification. +- ``options``: Specify the category label of the classification task, this parameter is only valid for the classification type task. Defaults to ["positive", "negative"]. +- ``prompt_prefix``: Declare the prompt prefix information of the classification task, this parameter is only valid for the classification type task. Defaults to "Sentimental Tendency". +- ``is_shuffle``: Whether to randomly shuffle the data set, the default is True. +- ``seed``: random seed, default is 1000. +- ``schema_lang``: Select the language of the schema, which will be the construction method of the training data prompt, optional `ch` and `en`. Defaults to `ch`. +- ``separator``: The separator between entity category/evaluation dimension and classification label. This parameter is only valid for entity/evaluation dimension classification tasks. The default is"##". + +Note: +- By default the [label_studio.py](./label_studio.py) script will divide the data proportionally into train/dev/test datasets +- Each time the [label_studio.py](./label_studio.py) script is executed, the existing data file with the same name will be overwritten +- In the model training phase, we recommend constructing some negative examples to improve the model performance, and we have built-in this function in the data conversion phase. The proportion of automatically constructed negative samples can be controlled by `negative_ratio`; the number of negative samples = negative_ratio * the number of positive samples. +- For files exported from label_studio, each piece of data in the default file is correctly labeled manually. + + +## References +- **[Label Studio](https://labelstud.io/)** diff --git a/applications/information_extraction/taskflow_doc.md b/applications/information_extraction/taskflow_doc.md new file mode 100644 index 0000000000000000000000000000000000000000..538a86e12b21aab09fb3843c70a129ecfe9eaa33 --- /dev/null +++ b/applications/information_extraction/taskflow_doc.md @@ -0,0 +1,310 @@ +# UIE Taskflow使用指南 + +**目录** +- [1. 功能简介](#1) +- [2. 文档信息抽取](#2) + - [2.1 实体抽取](#21) + - [2.2 关系抽取](#22) + - [2.3 跨任务使用](#23) + - [2.4 输入说明](#24) + - [2.5 使用技巧](#25) + - [2.6 结果可视化](#26) + - [2.7 更多配置](#27) + + + +## 1. 功能简介 + +```paddlenlp.Taskflow```提供文本及文档的通用信息抽取、评价观点抽取等能力,可抽取多种类型的信息,包括但不限于命名实体识别(如人名、地名、机构名等)、关系(如电影的导演、歌曲的发行时间等)、事件(如某路口发生车祸、某地发生地震等)、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标,无需训练即可统一抽取输入文本或文档中的对应信息。**实现开箱即用,并满足各类信息抽取需求** + + + +## 2. 文档信息抽取 + +本章节主要介绍Taskflow的文档抽取功能,以下示例图片[下载链接](https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/cases.zip)。 + + + +#### 2.1 实体抽取 + +实体抽取,又称命名实体识别(Named Entity Recognition,简称NER),是指识别文本中具有特定意义的实体。在开放域信息抽取中,抽取的类别没有限制,用户可以自己定义。 + +- 报关单 + +
+ +
+ +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow +>>> schema = ["收发货人", "进口口岸", "进口日期", "运输方式", "征免性质", "境内目的地", "运输工具名称", "包装种类", "件数", "合同协议号"] +>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base") +>>> pprint(ie({"doc": "./cases/custom.jpeg"})) +[{'件数': [{'bbox': [[826, 1062, 926, 1121]], + 'end': 312, + 'probability': 0.9832498761402597, + 'start': 308, + 'text': '1142'}], +'包装种类': [{'bbox': [[1214, 1066, 1310, 1121]], + 'end': 314, + 'probability': 0.9995648138860567, + 'start': 312, + 'text': '纸箱'}], +'合同协议号': [{'bbox': [[151, 1077, 258, 1117]], + 'end': 319, + 'probability': 0.9984179437542124, + 'start': 314, + 'text': '33035'}], +'境内目的地': [{'bbox': [[1966, 872, 2095, 923]], + 'end': 275, + 'probability': 0.9975541483111243, + 'start': 272, + 'text': '上海市'}], +'征免性质': [{'bbox': [[1583, 770, 1756, 821]], + 'end': 242, + 'probability': 0.9950633161231508, + 'start': 238, + 'text': '一般征税'}], +'收发货人': [{'bbox': [[321, 533, 841, 580]], + 'end': 95, + 'probability': 0.4772132061042136, + 'start': 82, + 'text': '上海新尚实国际贸易有限公司'}, + {'bbox': [[306, 584, 516, 624]], + 'end': 150, + 'probability': 0.33807074572195006, + 'start': 140, + 'text': '31222609K9'}], +'运输工具名称': [{'bbox': [[1306, 672, 1516, 712], [1549, 668, 1645, 712]], + 'end': 190, + 'probability': 0.6692050414718089, + 'start': 174, + 'text': 'E. R. TIANAN004E'}], +'运输方式': [{'bbox': [[1070, 664, 1240, 715]], + 'end': 174, + 'probability': 0.9994416347044179, + 'start': 170, + 'text': '永路运输'}], +'进口口岸': [{'bbox': [[1070, 566, 1346, 617]], + 'end': 120, + 'probability': 0.9945697196994345, + 'start': 111, + 'text': '洋山港区-2248'}], +'进口日期': [{'bbox': [[1726, 569, 1933, 610]], + 'end': 130, + 'probability': 0.9804819494073627, + 'start': 120, + 'text': '2017-02-24'}]}] +``` + +- 证件 + + +
+ +
+ +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow +>>> schema = ["Name", "Date of birth", "Issue date"] +>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en") +>>> pprint(ie({"doc": "./cases/license.jpeg"})) +``` + + + +#### 2.2 关系抽取 + +关系抽取(Relation Extraction,简称RE),是指从文本中识别实体并抽取实体之间的语义关系,进而获取三元组信息,即<主体,谓语,客体>。 + +- 表格 + +
+ +
+ +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow +>>> schema = {"姓名": ["招聘单位", "报考岗位"]} +>>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base") +>>> pprint(ie({"doc": "./cases/table.png"})) +``` + + + +#### 2.3 跨任务使用 + +- 实体、关系多任务抽取 + +对文档进行实体+关系抽取,schema构造如下: + +```text +schema = [ + "Total GBP", + "No.", + "Date", + "Customer No.", + "Subtotal without VAT", + { + "Description": [ + "Quantity", + "Amount" + ] + } +] +``` + +
+ +
+ +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow + +>>> schema = ["Total GBP", "No.", "Date", "Customer No.", "Subtotal without VAT", {"Description": ["Quantity", "Amount"]}] +>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en") +>>> pprint(ie({"doc": "./cases/delivery_note.png"})) +``` + + + +#### 2.4 输入说明 + +- 输入格式 + +文档抽取UIE-X支持图片路径、http图片链接、base64的输入形式,支持图片和PDF两种文档格式。文本抽取可以通过`text`指定输入文本。 + +```python +[ + {'text': '2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!'}, + {'doc': './cases/custom.jpg'}, + {'doc': 'https://user-images.githubusercontent.com/40840292/203457719-84a70241-607e-4bb1-ab4c-3d9beee9e254.jpeg'} +] +``` + +**NOTE**: 多页PDF输入目前只抽取第一页的结果,UIE-X比较适合单证文档(如票据、单据等)的信息提取,目前还不适合过长或多页的文档。 + +- 使用自己的layout / OCR作为输入 + +```python +layout = [ + ([68.0, 12.0, 167.0, 70.0], '名次'), + ([464.0, 13.0, 559.0, 67.0], '球员'), + ([833.0, 15.0, 1054.0, 64.0], '总出场时间'), + ...... +] +ie({"doc": doc_path, 'layout': layout}) +``` + + + +#### 2.5 使用技巧 + +- 使用PP-Structure版面分析功能 + +OCR中识别出来的文字会按照左上到右下进行排序,对于分栏、表格内有多行文本等情况我们推荐使用版面分析功能``layout_analysis=True``以优化文字排序并增强抽取效果。以下例子仅举例版面分析功能的使用场景,实际场景一般需要标注微调。 + +
+ +
+ +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow + +>>> schema = "中标候选人名称" +>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", layout_analysis=True) +>>> pprint(ie({"doc": "https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fwww.xuyiwater.com%2Fwp-content%2Fuploads%2F2021%2F06%2F1-4.jpg&refer=http%3A%2F%2Fwww.xuyiwater.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1672994926&t=2a4a3fedf6999a34ccde190f97bcfa47"})) +``` + +
+ +
+ +```python +>>> schema = "抗血小板药物的用药指征" +>>> ie.set_schema(schema) +>>> pprint(ie({"doc": "./cases/drug.webp"})) +``` + + + +#### 2.6 结果可视化 + +- OCR识别结果可视化: + +```python +>>> from paddlenlp.utils.doc_parser import DocParser + +>>> doc_parser = DocParser(ocr_lang="en") +>>> doc_path = "./cases/business_card.png" +>>> parsed_doc = doc_parser.parse({"doc": doc_path}) +>>> doc_parser.write_image_with_results( + doc_path, + layout=parsed_doc['layout'], + save_path="ocr_result.png") +``` + +
+ +
+ +- 抽取结果可视化: + +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow +>>> from paddlenlp.utils.doc_parser import DocParser + +>>> doc_path = "./cases/business_card.png" +>>> schema = ["人名", "职位", "号码", "邮箱地址", "网址", "地址", "邮编"] +>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en") + +>>> results = ie({"doc": doc_path}) + +>>> DocParser.write_image_with_results( + doc_path, + result=results[0], + save_path="image_show.png") +``` + +
+ +
+ + + +#### 2.7 更多配置 + +```python +>>> from paddlenlp import Taskflow + +>>> ie = Taskflow('information_extraction', + schema="", + schema_lang="ch", + ocr_lang="ch", + batch_size=16, + model='uie-x-base', + layout_analysis=False, + position_prob=0.5, + precision='fp32', + use_fast=False) +``` + +* `schema`:定义任务抽取目标,可参考开箱即用中不同任务的调用示例进行配置。 +* `schema_lang`:设置schema的语言,默认为`ch`, 可选有`ch`和`en`。因为中英schema的构造有所不同,因此需要指定schema的语言。 +* `ocr_lang`:选择PaddleOCR的语言,`ch`可在中英混合的图片中使用,`en`在英文图片上的效果更好,默认为`ch`。 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为16。 +* `model`:选择任务使用的模型,默认为`uie-base`,可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano`和`uie-medical-base`, `uie-base-en`,`uie-x-base`。 +* `layout_analysis`:是否使用PP-Structure对文档进行布局分析以优化布局信息的排序,默认为False。 +* `position_prob`:模型对于span的起始位置/终止位置的结果概率在0~1之间,返回结果去掉小于这个阈值的结果,默认为0.5,span的最终概率输出为起始位置概率和终止位置概率的乘积。 +* `precision`:选择模型精度,默认为`fp32`,可选有`fp16`和`fp32`。`fp16`推理速度更快,支持GPU和NPU硬件环境。如果选择`fp16`,在GPU硬件环境下,请先确保机器正确安装NVIDIA相关驱动和基础软件,**确保CUDA>=11.2,cuDNN>=8.1.1**,初次使用需按照提示安装相关依赖。其次,需要确保GPU设备的CUDA计算能力(CUDA Compute Capability)大于7.0,典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档:[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。 +* `use_fast`: 使用C++实现的高性能分词算子FastTokenizer进行文本预处理加速。需要通过`pip install fast-tokenizer-python`安装FastTokenizer库后方可使用。默认为`False`。更多使用说明可参考[FastTokenizer文档](../../fast_tokenizer)。 + +## References +- **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)** +- **[PP-Structure](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.6/ppstructure)** diff --git a/applications/information_extraction/taskflow_doc_en.md b/applications/information_extraction/taskflow_doc_en.md new file mode 100644 index 0000000000000000000000000000000000000000..09fdc7193073b349b7cca5affb206a2911be6f76 --- /dev/null +++ b/applications/information_extraction/taskflow_doc_en.md @@ -0,0 +1,305 @@ +# UIE Taskflow User Guide + +**Table of contents** +- [1. Introduction](#1) +- [2. Document Information Extraction](#2) + - [2.1 Entity Extraction](#21) + - [2.2 Relation Extraction](#22) + - [2.3 Multi-Task Extraction](#23) + - [2.4 Input Format](#24) + - [2.5 Tips](#25) + - [2.6 Visualization](#26) + - [2.7 More Configuration](#27) + + + +## 1. Introduction + +```paddlenlp.Taskflow``` provides general information extraction of text and documents, evaluation opinion extraction and other capabilities, and can extract various types of information, including but not limited to named entities (such as person name, place name, organization name, etc.), relations (such as the director of the movie, the release time of the song, etc.), events (such as a car accident at a certain intersection, an earthquake in a certain place, etc.), and information such as product reviews, opinions, and sentiments. Users can use natural language to customize the extraction target, and can uniformly extract the corresponding information in the input text or document without training. + + + +## 2. Document Information Extraction + +This section introduces the document extraction capability of Taskflow with the following example picture [download link](https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/cases.zip). + + + +#### 2.1 Entity Extraction + +Entity extraction, also known as Named Entity Recognition (NER for short), refers to identifying entities with specific meanings in text. UIE adopts the open-domain approach where the entity category is not fixed and the users can define them by through natural language. + +- Example: Customs Declaration Form + +
+ +
+ +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow +>>> schema = ["收发货人", "进口口岸", "进口日期", "运输方式", "征免性质", "境内目的地", "运输工具名称", "包装种类", "件数", "合同协议号"] +>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base") +>>> pprint(ie({"doc": "./cases/custom.jpeg"})) +[{'件数': [{'bbox': [[826, 1062, 926, 1121]], + 'end': 312, + 'probability': 0.9832498761402597, + 'start': 308, + 'text': '1142'}], +'包装种类': [{'bbox': [[1214, 1066, 1310, 1121]], + 'end': 314, + 'probability': 0.9995648138860567, + 'start': 312, + 'text': '纸箱'}], +'合同协议号': [{'bbox': [[151, 1077, 258, 1117]], + 'end': 319, + 'probability': 0.9984179437542124, + 'start': 314, + 'text': '33035'}], +'境内目的地': [{'bbox': [[1966, 872, 2095, 923]], + 'end': 275, + 'probability': 0.9975541483111243, + 'start': 272, + 'text': '上海市'}], +'征免性质': [{'bbox': [[1583, 770, 1756, 821]], + 'end': 242, + 'probability': 0.9950633161231508, + 'start': 238, + 'text': '一般征税'}], +'收发货人': [{'bbox': [[321, 533, 841, 580]], + 'end': 95, + 'probability': 0.4772132061042136, + 'start': 82, + 'text': '上海新尚实国际贸易有限公司'}, + {'bbox': [[306, 584, 516, 624]], + 'end': 150, + 'probability': 0.33807074572195006, + 'start': 140, + 'text': '31222609K9'}], +'运输工具名称': [{'bbox': [[1306, 672, 1516, 712], [1549, 668, 1645, 712]], + 'end': 190, + 'probability': 0.6692050414718089, + 'start': 174, + 'text': 'E. R. TIANAN004E'}], +'运输方式': [{'bbox': [[1070, 664, 1240, 715]], + 'end': 174, + 'probability': 0.9994416347044179, + 'start': 170, + 'text': '永路运输'}], +'进口口岸': [{'bbox': [[1070, 566, 1346, 617]], + 'end': 120, + 'probability': 0.9945697196994345, + 'start': 111, + 'text': '洋山港区-2248'}], +'进口日期': [{'bbox': [[1726, 569, 1933, 610]], + 'end': 130, + 'probability': 0.9804819494073627, + 'start': 120, + 'text': '2017-02-24'}]}] +``` + +- Example: Driver's License + +
+ +
+ +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow +>>> schema = ["Name", "Date of birth", "Issue date"] +>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en") +>>> pprint(ie({"doc": "./cases/license.jpeg"})) +``` + + + +#### 2.2 Relation Extraction + +Relation Extraction refers to identifying entities from text and extracting the semantic relationship between entities, and then obtaining triple information, namely . + +- Example: Extracting relations from a table + +
+ +
+ +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow +>>> schema = {"姓名": ["招聘单位", "报考岗位"]} +>>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base") +>>> pprint(ie({"doc": "./cases/table.png"})) +``` + + + +#### 2.3 Multi-Task Extraction + +To extract entities and relation from documents simultaneously, you may set the schema structure as following: + +```text +schema = [ + "Total GBP", + "No.", + "Date", + "Customer No.", + "Subtotal without VAT", + { + "Description": [ + "Quantity", + "Amount" + ] + } +] +``` + +
+ +
+ +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow + +>>> schema = ["Total GBP", "No.", "Date", "Customer No.", "Subtotal without VAT", {"Description": ["Quantity", "Amount"]}] +>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en") +>>> pprint(ie({"doc": "./cases/delivery_note.png"})) +``` + + + +#### 2.4 Input Format + +For document information extraction, UIE-X supports image paths, http image links, base64 input form, and image and PDF document formats. In the input dict, `text` indicates text input and `doc` refer to the document input. + +```python +[ + {'text': '2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!'}, + {'doc': './cases/custom.jpg'}, + {'doc': 'https://user-images.githubusercontent.com/40840292/203457719-84a70241-607e-4bb1-ab4c-3d9beee9e254.jpeg'} +] +``` + +**NOTE**: Multi-page PDF input currently only extracts the results of the first page. UIE-X is more suitable for information extraction of document documents (such as bills, receipts, etc.), but it is not suitable for documents that are too long or multi-page. + +- Using custom OCR input + +```python +layout = [ + ([68.0, 12.0, 167.0, 70.0], '名次'), + ([464.0, 13.0, 559.0, 67.0], '球员'), + ([833.0, 15.0, 1054.0, 64.0], '总出场时间'), + ...... +] +ie({"doc": doc_path, 'layout': layout}) +``` + + + +#### 2.5 Tips + +- Using PP-Structure layout analysis function + +The text recognized in OCR will be sorted from top left to bottom right. For cases such as column division and multiple lines of text in the table, we recommend using the layout analysis function ``layout_analysis=True`` to optimize text sorting and enhance the extraction effect. The following example is only an example of the usage scenario of the layout analysis function, and the actual scenario generally needs to be marked and fine-tuned. + +
+ +
+ +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow + +>>> schema = "中标候选人名称" +>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", layout_analysis=True) +>>> pprint(ie({"doc": "https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fwww.xuyiwater.com%2Fwp-content%2Fuploads%2F2021%2F06%2F1-4.jpg&refer=http%3A%2F%2Fwww.xuyiwater.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1672994926&t=2a4a3fedf6999a34ccde190f97bcfa47"})) +``` + +
+ +
+ +```python +>>> schema = "抗血小板药物的用药指征" +>>> ie.set_schema(schema) +>>> pprint(ie({"doc": "./cases/drug.webp"})) +``` + + + +#### 2.6 Visualization + +- Visualization of OCR recognition results: + +```python +>>> from paddlenlp.utils.doc_parser import DocParser + +>>> doc_parser = DocParser(ocr_lang="en") +>>> doc_path = "./cases/business_card.png" +>>> parsed_doc = doc_parser.parse({"doc": doc_path}) +>>> doc_parser.write_image_with_results( + doc_path, + layout=parsed_doc['layout'], + save_path="ocr_result.png") +``` + +
+ +
+ +- Visualization of extraction results: + +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow +>>> from paddlenlp.utils.doc_parser import DocParser + +>>> doc_path = "./cases/business_card.png" +>>> schema = ["人名", "职位", "号码", "邮箱地址", "网址", "地址", "邮编"] +>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en") + +>>> results = ie({"doc": doc_path}) + +>>> DocParser.write_image_with_results( + doc_path, + result=results[0], + save_path="image_show.png") +``` + +
+ +
+ + + +#### 2.7 More Configuration + +```python +>>> from paddlenlp import Taskflow + +>>> ie = Taskflow('information_extraction', + schema="", + schema_lang="ch", + ocr_lang="ch", + batch_size=16, + model='uie-x-base', + layout_analysis=False, + position_prob=0.5, + precision='fp32', + use_fast=False) +``` + +* `schema`: Define the task extraction target, which can be configured by referring to the calling examples of different tasks in the out-of-the-box. +* `schema_lang`: Set the language of the schema, the default is `ch`, optional `ch` and `en`. Because the structure of the Chinese and English schemas is different, the language of the schema needs to be specified. +* `ocr_lang`: Select the language of PaddleOCR, `ch` can be used in mixed Chinese and English images, `en` works better on English images, the default is `ch`. +* `batch_size`: batch size, please adjust according to the machine situation, the default is 16. +* `model`: select the model used by the task, the default is `uie-base`, optional `uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano` ` and `uie-medical-base`, `uie-base-en`, `uie-x-base`. +* `layout_analysis`: Whether to use PP-Structure to analyze the layout of the document to optimize the sorting of layout information, the default is False. +* `position_prob`: The result probability of the model for the start position/end position of the span is between 0 and 1, and the returned result removes the results less than this threshold, the default is 0.5, and the final probability output of the span is the start position probability and end position The product of the position probabilities. +* `precision`: select the model precision, the default is `fp32`, optional `fp16` and `fp32`. `fp16` inference is faster, support GPU and NPU hardware. If you choose `fp16` and GPU hardware, please ensure that the machine is correctly installed with NVIDIA-related drivers and basic software. **Ensure that CUDA>=11.2, cuDNN>=8.1.1**. For the first time use, you need to follow the prompts to install the relevant dependencies. Secondly, it is necessary to ensure that the CUDA Compute Capability of the GPU device is greater than 7.0. Typical devices include V100, T4, A10, A100, GTX 20 series and 30 series graphics cards, etc. For more information about CUDA Compute Capability and precision support, please refer to NVIDIA documentation: [GPU Hardware and Supported Precision Comparison Table](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix). +* `use_fast`: Use the high-performance word segmentation operator FastTokenizer implemented in C++ to accelerate text preprocessing. The FastTokenizer library needs to be installed through `pip install fast-tokenizer-python` before it can be used. Defaults to `False`. For more usage instructions, please refer to [FastTokenizer Documentation](../../fast_tokenizer). + +## References +- **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)** +- **[PP-Structure](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.6/ppstructure)** diff --git a/applications/information_extraction/taskflow_text.md b/applications/information_extraction/taskflow_text.md new file mode 100644 index 0000000000000000000000000000000000000000..a8fe491775beb1ab8edd50819eba488c253e4cc2 --- /dev/null +++ b/applications/information_extraction/taskflow_text.md @@ -0,0 +1,502 @@ +# 文本抽取任务UIE Taskflow使用指南 + +**目录** +- [1. 功能简介](#1) +- [2. 应用示例](#2) +- [3. 文本信息抽取](#3) + - [3.1 实体抽取](#31) + - [3.2 关系抽取](#32) + - [3.3 事件抽取](#33) + - [3.4 评论观点抽取](#34) + - [3.5 情感分类](#35) + - [3.6 跨任务抽取](#36) + - [3.7 模型选择](#37) + - [3.8 更多配置](#38) + + + +## 1. 功能简介 + +```paddlenlp.Taskflow```提供纯文本的通用信息抽取、评价观点抽取等能力,可抽取多种类型的信息,包括但不限于命名实体识别(如人名、地名、机构名等)、关系(如电影的导演、歌曲的发行时间等)、事件(如某路口发生车祸、某地发生地震等)、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标,无需训练即可统一抽取输入文本中的对应信息。**实现开箱即用,并满足各类信息抽取需求** + + + +## 2. 应用示例 + +UIE不限定行业领域和抽取目标,以下是一些通过Taskflow实现开箱即用的行业示例: + +- 医疗场景-专病结构化 + +![image](https://user-images.githubusercontent.com/40840292/169017581-93c8ee44-856d-4d17-970c-b6138d10f8bc.png) + +- 法律场景-判决书抽取 + +![image](https://user-images.githubusercontent.com/40840292/169017863-442c50f1-bfd4-47d0-8d95-8b1d53cfba3c.png) + +- 金融场景-收入证明、招股书抽取 + +![image](https://user-images.githubusercontent.com/40840292/169017982-e521ddf6-d233-41f3-974e-6f40f8f2edbc.png) + +- 公安场景-事故报告抽取 + +![image](https://user-images.githubusercontent.com/40840292/169018340-31efc1bf-f54d-43f7-b62a-8f7ce9bf0536.png) + +- 旅游场景-宣传册、手册抽取 + +![image](https://user-images.githubusercontent.com/40840292/169018113-c937eb0b-9fd7-4ecc-8615-bcdde2dac81d.png) + + + +## 3. 文本信息抽取 + + + +#### 3.1 实体抽取 + + 实体抽取,又称命名实体识别(Named Entity Recognition,简称NER),是指识别文本中具有特定意义的实体。在开放域信息抽取中,抽取的类别没有限制,用户可以自己定义。 + + - 例如抽取的目标实体类型是"时间"、"选手"和"赛事名称", schema构造如下: + + ```text + ['时间', '选手', '赛事名称'] + ``` + + 调用示例: + + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + + >>> schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction + >>> ie = Taskflow('information_extraction', schema=schema) + >>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!")) # Better print results using pprint + [{'时间': [{'end': 6, + 'probability': 0.9857378532924486, + 'start': 0, + 'text': '2月8日上午'}], + '赛事名称': [{'end': 23, + 'probability': 0.8503089953268272, + 'start': 6, + 'text': '北京冬奥会自由式滑雪女子大跳台决赛'}], + '选手': [{'end': 31, + 'probability': 0.8981548639781138, + 'start': 28, + 'text': '谷爱凌'}]}] + ``` + + - 例如抽取的目标实体类型是"肿瘤的大小"、"肿瘤的个数"、"肝癌级别"和"脉管内癌栓分级", schema构造如下: + + ```text + ['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级'] + ``` + + 在上例中我们已经实例化了一个`Taskflow`对象,这里可以通过`set_schema`方法重置抽取目标。 + + 调用示例: + + ```python + >>> schema = ['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级'] + >>> ie.set_schema(schema) + >>> pprint(ie("(右肝肿瘤)肝细胞性肝癌(II-III级,梁索型和假腺管型),肿瘤包膜不完整,紧邻肝被膜,侵及周围肝组织,未见脉管内癌栓(MVI分级:M0级)及卫星子灶形成。(肿物1个,大小4.2×4.0×2.8cm)。")) + [{'肝癌级别': [{'end': 20, + 'probability': 0.9243267447402701, + 'start': 13, + 'text': 'II-III级'}], + '肿瘤的个数': [{'end': 84, + 'probability': 0.7538413804059623, + 'start': 82, + 'text': '1个'}], + '肿瘤的大小': [{'end': 100, + 'probability': 0.8341128043459491, + 'start': 87, + 'text': '4.2×4.0×2.8cm'}], + '脉管内癌栓分级': [{'end': 70, + 'probability': 0.9083292325934664, + 'start': 67, + 'text': 'M0级'}]}] + ``` + + - 例如抽取的目标实体类型是"person"和"organization",schema构造如下: + + ```text + ['person', 'organization'] + ``` + + 英文模型调用示例: + + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + >>> schema = ['Person', 'Organization'] + >>> ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en') + >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.')) + [{'Organization': [{'end': 53, + 'probability': 0.9985840259877357, + 'start': 48, + 'text': 'Apple'}], + 'Person': [{'end': 14, + 'probability': 0.999631971804547, + 'start': 9, + 'text': 'Steve'}]}] + ``` + + + +#### 3.2 关系抽取 + + 关系抽取(Relation Extraction,简称RE),是指从文本中识别实体并抽取实体之间的语义关系,进而获取三元组信息,即<主体,谓语,客体>。 + + - 例如以"竞赛名称"作为抽取主体,抽取关系类型为"主办方"、"承办方"和"已举办次数", schema构造如下: + + ```text + { + '竞赛名称': [ + '主办方', + '承办方', + '已举办次数' + ] + } + ``` + + 调用示例: + + ```python + >>> schema = {'竞赛名称': ['主办方', '承办方', '已举办次数']} # Define the schema for relation extraction + >>> ie.set_schema(schema) # Reset schema + >>> pprint(ie('2022语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办,百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办,已连续举办4届,成为全球最热门的中文NLP赛事之一。')) + [{'竞赛名称': [{'end': 13, + 'probability': 0.7825402622754041, + 'relations': {'主办方': [{'end': 22, + 'probability': 0.8421710521379353, + 'start': 14, + 'text': '中国中文信息学会'}, + {'end': 30, + 'probability': 0.7580801847701935, + 'start': 23, + 'text': '中国计算机学会'}], + '已举办次数': [{'end': 82, + 'probability': 0.4671295049136148, + 'start': 80, + 'text': '4届'}], + '承办方': [{'end': 39, + 'probability': 0.8292706618236352, + 'start': 35, + 'text': '百度公司'}, + {'end': 72, + 'probability': 0.6193477885474685, + 'start': 56, + 'text': '中国计算机学会自然语言处理专委会'}, + {'end': 55, + 'probability': 0.7000497331473241, + 'start': 40, + 'text': '中国中文信息学会评测工作委员会'}]}, + 'start': 0, + 'text': '2022语言与智能技术竞赛'}]}] + ``` + + - 例如以"person"作为抽取主体,抽取关系类型为"Company"和"Position", schema构造如下: + + ```text + { + 'Person': [ + 'Company', + 'Position' + ] + } + ``` + + 英文模型调用示例: + + ```python + >>> schema = [{'Person': ['Company', 'Position']}] + >>> ie_en.set_schema(schema) + >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.')) + [{'Person': [{'end': 14, + 'probability': 0.999631971804547, + 'relations': {'Company': [{'end': 53, + 'probability': 0.9960158209451642, + 'start': 48, + 'text': 'Apple'}], + 'Position': [{'end': 44, + 'probability': 0.8871063806420736, + 'start': 41, + 'text': 'CEO'}]}, + 'start': 9, + 'text': 'Steve'}]}] + ``` + + + +#### 3.3 事件抽取 + + 事件抽取 (Event Extraction, 简称EE),是指从自然语言文本中抽取预定义的事件触发词(Trigger)和事件论元(Argument),组合为相应的事件结构化信息。 + + - 例如抽取的目标是"地震"事件的"地震强度"、"时间"、"震中位置"和"震源深度"这些信息,schema构造如下: + + ```text + { + '地震触发词': [ + '地震强度', + '时间', + '震中位置', + '震源深度' + ] + } + ``` + + 触发词的格式统一为`触发词`或``XX触发词`,`XX`表示具体事件类型,上例中的事件类型是`地震`,则对应触发词为`地震触发词`。 + + 调用示例: + + ```python + >>> schema = {'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']} # Define the schema for event extraction + >>> ie.set_schema(schema) # Reset schema + >>> ie('中国地震台网正式测定:5月16日06时08分在云南临沧市凤庆县(北纬24.34度,东经99.98度)发生3.5级地震,震源深度10千米。') + [{'地震触发词': [{'text': '地震', 'start': 56, 'end': 58, 'probability': 0.9987181623528585, 'relations': {'地震强度': [{'text': '3.5级', 'start': 52, 'end': 56, 'probability': 0.9962985320905915}], '时间': [{'text': '5月16日06时08分', 'start': 11, 'end': 22, 'probability': 0.9882578028575182}], '震中位置': [{'text': '云南临沧市凤庆县(北纬24.34度,东经99.98度)', 'start': 23, 'end': 50, 'probability': 0.8551415716584501}], '震源深度': [{'text': '10千米', 'start': 63, 'end': 67, 'probability': 0.999158304648045}]}}]}] + ``` + + - 英文模型**暂不支持事件抽取**,如有需要可使用英文事件数据集进行定制。 + + + +#### 3.4 评论观点抽取 + + 评论观点抽取,是指抽取文本中包含的评价维度、观点词。 + + - 例如抽取的目标是文本中包含的评价维度及其对应的观点词和情感倾向,schema构造如下: + + ```text + { + '评价维度': [ + '观点词', + '情感倾向[正向,负向]' + ] + } + ``` + + 调用示例: + + ```python + >>> schema = {'评价维度': ['观点词', '情感倾向[正向,负向]']} # Define the schema for opinion extraction + >>> ie.set_schema(schema) # Reset schema + >>> pprint(ie("店面干净,很清静,服务员服务热情,性价比很高,发现收银台有排队")) # Better print results using pprint + [{'评价维度': [{'end': 20, + 'probability': 0.9817040258681473, + 'relations': {'情感倾向[正向,负向]': [{'probability': 0.9966142505350533, + 'text': '正向'}], + '观点词': [{'end': 22, + 'probability': 0.957396472711558, + 'start': 21, + 'text': '高'}]}, + 'start': 17, + 'text': '性价比'}, + {'end': 2, + 'probability': 0.9696849569741168, + 'relations': {'情感倾向[正向,负向]': [{'probability': 0.9982153274927796, + 'text': '正向'}], + '观点词': [{'end': 4, + 'probability': 0.9945318044652538, + 'start': 2, + 'text': '干净'}]}, + 'start': 0, + 'text': '店面'}]}] + ``` + + - 英文模型schema构造如下: + + ```text + { + 'Aspect': [ + 'Opinion', + 'Sentiment classification [negative, positive]' + ] + } + ``` + + 调用示例: + + ```python + >>> schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}] + >>> ie_en.set_schema(schema) + >>> pprint(ie_en("The teacher is very nice.")) + [{'Aspect': [{'end': 11, + 'probability': 0.4301476415932193, + 'relations': {'Opinion': [{'end': 24, + 'probability': 0.9072940447883724, + 'start': 15, + 'text': 'very nice'}], + 'Sentiment classification [negative, positive]': [{'probability': 0.9998571920670685, + 'text': 'positive'}]}, + 'start': 4, + 'text': 'teacher'}]}] + ``` + + + +#### 3.5 情感分类 + + - 句子级情感倾向分类,即判断句子的情感倾向是“正向”还是“负向”,schema构造如下: + + ```text + '情感倾向[正向,负向]' + ``` + + 调用示例: + + ```python + >>> schema = '情感倾向[正向,负向]' # Define the schema for sentence-level sentiment classification + >>> ie.set_schema(schema) # Reset schema + >>> ie('这个产品用起来真的很流畅,我非常喜欢') + [{'情感倾向[正向,负向]': [{'text': '正向', 'probability': 0.9988661643929895}]}] + ``` + + 英文模型schema构造如下: + + ```text + 'Sentiment classification [negative, positive]' + ``` + + 英文模型调用示例: + + ```python + >>> schema = 'Sentiment classification [negative, positive]' + >>> ie_en.set_schema(schema) + >>> ie_en('I am sorry but this is the worst film I have ever seen in my life.') + [{'Sentiment classification [negative, positive]': [{'text': 'negative', 'probability': 0.9998415771287057}]}] + ``` + + + +#### 3.6 跨任务抽取 + + - 例如在法律场景同时对文本进行实体抽取和关系抽取,schema可按照如下方式进行构造: + + ```text + [ + "法院", + { + "原告": "委托代理人" + }, + { + "被告": "委托代理人" + } + ] + ``` + + 调用示例: + + ```python + >>> schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}] + >>> ie.set_schema(schema) + >>> pprint(ie("北京市海淀区人民法院\n民事判决书\n(199x)建初字第xxx号\n原告:张三。\n委托代理人李四,北京市 A律师事务所律师。\n被告:B公司,法定代表人王五,开发公司总经理。\n委托代理人赵六,北京市 C律师事务所律师。")) # Better print results using pprint + [{'原告': [{'end': 37, + 'probability': 0.9949814024296764, + 'relations': {'委托代理人': [{'end': 46, + 'probability': 0.7956844697990384, + 'start': 44, + 'text': '李四'}]}, + 'start': 35, + 'text': '张三'}], + '法院': [{'end': 10, + 'probability': 0.9221074192336651, + 'start': 0, + 'text': '北京市海淀区人民法院'}], + '被告': [{'end': 67, + 'probability': 0.8437349536631089, + 'relations': {'委托代理人': [{'end': 92, + 'probability': 0.7267121388225029, + 'start': 90, + 'text': '赵六'}]}, + 'start': 64, + 'text': 'B公司'}]}] + ``` + + + +#### 3.7 模型选择 + +- 多模型选择,满足精度、速度要求 + + | 模型 | 结构 | 语言 | + | :---: | :--------: | :--------: | + | `uie-base` (默认)| 12-layers, 768-hidden, 12-heads | 中文 | + | `uie-base-en` | 12-layers, 768-hidden, 12-heads | 英文 | + | `uie-medical-base` | 12-layers, 768-hidden, 12-heads | 中文 | + | `uie-medium`| 6-layers, 768-hidden, 12-heads | 中文 | + | `uie-mini`| 6-layers, 384-hidden, 12-heads | 中文 | + | `uie-micro`| 4-layers, 384-hidden, 12-heads | 中文 | + | `uie-nano`| 4-layers, 312-hidden, 12-heads | 中文 | + | `uie-m-large`| 24-layers, 1024-hidden, 16-heads | 中、英文 | + | `uie-m-base`| 12-layers, 768-hidden, 12-heads | 中、英文 | + + +- `uie-nano`调用示例: + + ```python + >>> from paddlenlp import Taskflow + + >>> schema = ['时间', '选手', '赛事名称'] + >>> ie = Taskflow('information_extraction', schema=schema, model="uie-nano") + >>> ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!") + [{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.6513581678349247}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.9819330659468051}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.4908131110420939}]}] + ``` + +- `uie-m-base`和`uie-m-large`支持中英文混合抽取,调用示例: + + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + + >>> schema = ['Time', 'Player', 'Competition', 'Score'] + >>> ie = Taskflow('information_extraction', schema=schema, model="uie-m-base", schema_lang="en") + >>> pprint(ie(["2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!", "Rafael Nadal wins French Open Final!"])) + [{'Competition': [{'end': 23, + 'probability': 0.9373889907291257, + 'start': 6, + 'text': '北京冬奥会自由式滑雪女子大跳台决赛'}], + 'Player': [{'end': 31, + 'probability': 0.6981119555336441, + 'start': 28, + 'text': '谷爱凌'}], + 'Score': [{'end': 39, + 'probability': 0.9888507878270296, + 'start': 32, + 'text': '188.25分'}], + 'Time': [{'end': 6, + 'probability': 0.9784080036931151, + 'start': 0, + 'text': '2月8日上午'}]}, + {'Competition': [{'end': 35, + 'probability': 0.9851549932171295, + 'start': 18, + 'text': 'French Open Final'}], + 'Player': [{'end': 12, + 'probability': 0.9379371275888104, + 'start': 0, + 'text': 'Rafael Nadal'}]}] + ``` + + + +#### 3.8 更多配置 + +```python +>>> from paddlenlp import Taskflow + +>>> ie = Taskflow('information_extraction', + schema="", + schema_lang="ch", + batch_size=16, + model='uie-base', + position_prob=0.5, + precision='fp32', + use_fast=False) +``` + +* `schema`:定义任务抽取目标,可参考开箱即用中不同任务的调用示例进行配置。 +* `schema_lang`:设置schema的语言,默认为`ch`, 可选有`ch`和`en`。因为中英schema的构造有所不同,因此需要指定schema的语言。该参数只对`uie-x-base`,`uie-m-base`和`uie-m-large`模型有效。 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为16。 +* `model`:选择任务使用的模型,默认为`uie-base`,可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano`和`uie-medical-base`, `uie-base-en`,`uie-x-base`。 +* `position_prob`:模型对于span的起始位置/终止位置的结果概率在0~1之间,返回结果去掉小于这个阈值的结果,默认为0.5,span的最终概率输出为起始位置概率和终止位置概率的乘积。 +* `precision`:选择模型精度,默认为`fp32`,可选有`fp16`和`fp32`。`fp16`推理速度更快,支持GPU和NPU硬件环境。如果选择`fp16`,在GPU硬件环境下,请先确保机器正确安装NVIDIA相关驱动和基础软件,**确保CUDA>=11.2,cuDNN>=8.1.1**,初次使用需按照提示安装相关依赖。其次,需要确保GPU设备的CUDA计算能力(CUDA Compute Capability)大于7.0,典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档:[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。 +* `use_fast`: 使用C++实现的高性能分词算子FastTokenizer进行文本预处理加速。需要通过`pip install fast-tokenizer-python`安装FastTokenizer库后方可使用。默认为`False`。更多使用说明可参考[FastTokenizer文档](../../fast_tokenizer)。 diff --git a/applications/information_extraction/taskflow_text_en.md b/applications/information_extraction/taskflow_text_en.md new file mode 100644 index 0000000000000000000000000000000000000000..d488799313f947bd9e2ae7807aaa283512b64bd4 --- /dev/null +++ b/applications/information_extraction/taskflow_text_en.md @@ -0,0 +1,312 @@ +# UIE Taskflow User Guide - Text Information Extraction + +**Table of contents** +- [1. Introduction](#1) +- [2. Examples](#2) +- [3. Text Information Extraction](#3) + - [3.1 Entity Extraction](#31) + - [3.2 Relation Extraction](#32) + - [3.3 Event Extraction](#33) + - [3.4 Opinion Extraction](#34) + - [3.5 Sentiment Classification](#35) + - [3.6 Multi-task Extraction](#36) + - [3.7 Available Models](#37) + - [3.8 More Configuration](#38) + + + +## 1. Introduction +```paddlenlp.Taskflow``` provides general information extraction of text and documents, evaluation opinion extraction and other capabilities, and can extract various types of information, including but not limited to named entities (such as person name, place name, organization name, etc.), relations (such as the director of the movie, the release time of the song, etc.), events (such as a car accident at a certain intersection, an earthquake in a certain place, etc.), and information such as product reviews, opinions, and sentiments. Users can use natural language to customize the extraction target, and can uniformly extract the corresponding information in the input text or document without training. + + + +## 2. Examples + +UIE does not limit industry fields and extraction targets. The following are some industry examples implemented out of the box by Taskflow: + +- Medical scenarios - specialized disease structure + +![image](https://user-images.githubusercontent.com/40840292/169017581-93c8ee44-856d-4d17-970c-b6138d10f8bc.png) + +- Legal scene - Judgment extraction + +![image](https://user-images.githubusercontent.com/40840292/169017863-442c50f1-bfd4-47d0-8d95-8b1d53cfba3c.png) + +- Financial scenarios - proof of income, extraction of prospectus + +![image](https://user-images.githubusercontent.com/40840292/169017982-e521ddf6-d233-41f3-974e-6f40f8f2edbc.png) + +- Public security scene - accident report extraction + +![image](https://user-images.githubusercontent.com/40840292/169018340-31efc1bf-f54d-43f7-b62a-8f7ce9bf0536.png) + +- Tourism scene - brochure, manual extraction + +![image](https://user-images.githubusercontent.com/40840292/169018113-c937eb0b-9fd7-4ecc-8615-bcdde2dac81d.png) + + + +## 3. Text information extraction + + + +#### 3.1 Entity Extraction + + Entity extraction, also known as Named Entity Recognition (NER for short), refers to identifying entities with specific meanings in text. In the open domain information extraction, the extracted categories are not limited, and users can define them by themselves. + + - For example, the extracted target entity types are "person" and "organization", and the schema defined as follows: + + ```text + ['person', 'organization'] + ``` + + Example: + + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + >>> schema = ['Person', 'Organization'] + >>> ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en') + >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.')) + [{'Organization': [{'end': 53, + 'probability': 0.9985840259877357, + 'start': 48, + 'text': 'Apple'}], + 'Person': [{'end': 14, + 'probability': 0.999631971804547, + 'start': 9, + 'text': 'Steve'}]}] + ``` + + + +#### 3.2 Relationship Extraction + + Relation Extraction refers to identifying entities from text and extracting the semantic relationship between entities, and then obtaining triple information, namely . + + - For example, if "person" is used as the extraction subject, and the extraction relationship types are "Company" and "Position", the schema structure is as follows: + + ```text + { + 'Person': [ + 'Company', + 'Position' + ] + } + ``` + + Example: + + ```python + >>> schema = [{'Person': ['Company', 'Position']}] + >>> ie_en.set_schema(schema) + >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.')) + [{'Person': [{'end': 14, + 'probability': 0.999631971804547, + 'relations': {'Company': [{'end': 53, + 'probability': 0.9960158209451642, + 'start': 48, + 'text': 'Apple'}], + 'Position': [{'end': 44, + 'probability': 0.8871063806420736, + 'start': 41, + 'text': 'CEO'}]}, + 'start': 9, + 'text': 'Steve'}]}] + ``` + + + +#### 3.3 Event extraction + + Event Extraction refers to extracting predefined event trigger words (Trigger) and event arguments (Argument) from natural language texts, and combining them into corresponding event structured information. + + - The English model** does not support event extraction**, if necessary, it can be customized using the English event dataset. + + + +#### 3.4 Opinion Extraction + + Opinion extraction refers to the extraction of evaluation dimensions and opinion words contained in the text. + + - For example, the target of extraction is the evaluation dimension contained in the text and its corresponding opinion words and emotional tendencies. The schema structure is as follows: + + ```text + { + 'Aspect': [ + 'Opinion', + 'Sentiment classification [negative, positive]' + ] + } + ``` + + Example: + + ```python + >>> schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}] + >>> ie_en.set_schema(schema) + >>> pprint(ie_en("The teacher is very nice.")) + [{'Aspect': [{'end': 11, + 'probability': 0.4301476415932193, + 'relations': {'Opinion': [{'end': 24, + 'probability': 0.9072940447883724, + 'start': 15, + 'text': 'very nice'}], + 'Sentiment classification [negative, positive]': [{'probability': 0.9998571920670685, + 'text': 'positive'}]}, + 'start': 4, + 'text': 'teacher'}]}] + ``` + + + +#### 3.5 Sentiment Classification + + - Sentence-level sentiment classification, that is, to judge whether the emotional orientation of a sentence is "positive" or "negative". The schema structure is as follows: + + ```text + 'Sentiment classification [negative, positive]' + ``` + + Example: + + ```python + >>> schema = 'Sentiment classification [negative, positive]' + >>> ie_en.set_schema(schema) + >>> ie_en('I am sorry but this is the worst film I have ever seen in my life.') + [{'Sentiment classification [negative, positive]': [{'text': 'negative', 'probability': 0.9998415771287057}]}] + ``` + +#### 3.6 Multi-Task Extraction + + - For example, in the legal scene, entity extraction and relation extraction are performed on the text at the same time, and the schema can be constructed as follows: + + ```text + [ + "法院", + { + "原告": "委托代理人" + }, + { + "被告": "委托代理人" + } + ] + ``` + + Example: + + ```python + >>> schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}] + >>> ie.set_schema(schema) + >>> pprint(ie("北京市海淀区人民法院\n民事判决书\n(199x)建初字第xxx号\n原告:张三。\n委托代理人李四,北京市 A律师事务所律师。\n被告:B公司,法定代表人王五,开发公司总经理。\n委托代理人赵六,北京市 C律师事务所律师。")) # Better print results using pprint + [{'原告': [{'end': 37, + 'probability': 0.9949814024296764, + 'relations': {'委托代理人': [{'end': 46, + 'probability': 0.7956844697990384, + 'start': 44, + 'text': '李四'}]}, + 'start': 35, + 'text': '张三'}], + '法院': [{'end': 10, + 'probability': 0.9221074192336651, + 'start': 0, + 'text': '北京市海淀区人民法院'}], + '被告': [{'end': 67, + 'probability': 0.8437349536631089, + 'relations': {'委托代理人': [{'end': 92, + 'probability': 0.7267121388225029, + 'start': 90, + 'text': '赵六'}]}, + 'start': 64, + 'text': 'B公司'}]}] + ``` + + + +#### 3.7 Available Model + +- A variety of models to different accuracy and speed requirements + + | Model | Structure | Language | + | :---: | :--------: | :--------: | + | `uie-base` (default)| 12-layers, 768-hidden, 12-heads | Chinese | + | `uie-base-en` | 12-layers, 768-hidden, 12-heads | English | + | `uie-medical-base` | 12-layers, 768-hidden, 12-heads | Chinese | + | `uie-medium`| 6-layers, 768-hidden, 12-heads | Chinese | + | `uie-mini`| 6-layers, 384-hidden, 12-heads | Chinese | + | `uie-micro`| 4-layers, 384-hidden, 12-heads | Chinese | + | `uie-nano`| 4-layers, 312-hidden, 12-heads | Chinese | + | `uie-m-large`| 24-layers, 1024-hidden, 16-heads | Chinese and English | + | `uie-m-base`| 12-layers, 768-hidden, 12-heads | Chinese and English | + + +- `uie-nano` call example: + + ```python + >>> from paddlenlp import Taskflow + + >>> schema = ['时间', '选手', '赛事名称'] + >>> ie = Taskflow('information_extraction', schema=schema, model="uie-nano") + >>> ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!") + [{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.6513581678349247}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.9819330659468051}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.4908131110420939}]}] + ``` + +- `uie-m-base` and `uie-m-large` support extraction of both Chinese and English, call example: + + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + + >>> schema = ['Time', 'Player', 'Competition', 'Score'] + >>> ie = Taskflow('information_extraction', schema=schema, model="uie-m-base", schema_lang="en") + >>> pprint(ie(["2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!", "Rafael Nadal wins French Open Final!"])) + [{'Competition': [{'end': 23, + 'probability': 0.9373889907291257, + 'start': 6, + 'text': '北京冬奥会自由式滑雪女子大跳台决赛'}], + 'Player': [{'end': 31, + 'probability': 0.6981119555336441, + 'start': 28, + 'text': '谷爱凌'}], + 'Score': [{'end': 39, + 'probability': 0.9888507878270296, + 'start': 32, + 'text': '188.25分'}], + 'Time': [{'end': 6, + 'probability': 0.9784080036931151, + 'start': 0, + 'text': '2月8日上午'}]}, + {'Competition': [{'end': 35, + 'probability': 0.9851549932171295, + 'start': 18, + 'text': 'French Open Final'}], + 'Player': [{'end': 12, + 'probability': 0.9379371275888104, + 'start': 0, + 'text': 'Rafael Nadal'}]}] + ``` + + + +#### 3.8 More Configuration + +```python +>>> from paddlenlp import Taskflow + +>>> ie = Taskflow('information_extraction', + schema="", + schema_lang="ch", + batch_size=16, + model='uie-base', + position_prob=0.5, + precision='fp32', + use_fast=False) +``` + +* `schema`: Define the task extraction target, which can be configured by referring to the calling examples of different tasks in the out-of-the-box. +* `schema_lang`: Set the language of the schema, the default is `ch`, optional `ch` and `en`. Because the structure of the Chinese and English schemas is different, the language of the schema needs to be specified. This parameter is only valid for `uie-x-base`, `uie-m-base` and `uie-m-large` models. +* `batch_size`: batch size, please adjust according to the machine situation, the default is 16. +* `model`: select the model used by the task, the default is `uie-base`, optional `uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano` and `uie-medical-base`, `uie-base-en`, `uie-x-base`. +* `position_prob`: The result probability of the model for the start position/end position of the span is between 0 and 1, and the returned result removes the results less than this threshold, the default is 0.5, and the final probability output of the span is the start position probability and end position The product of the position probabilities. +* `precision`: select the model precision, the default is `fp32`, optional `fp16` and `fp32`. `fp16` inference is faster, support GPU and NPU hardware. If you choose `fp16` and GPU hardware, please ensure that the machine is correctly installed with NVIDIA-related drivers and basic software. **Ensure that CUDA>=11.2, cuDNN>=8.1.1**. For the first time use, you need to follow the prompts to install the relevant dependencies. Secondly, it is necessary to ensure that the CUDA Compute Capability of the GPU device is greater than 7.0. Typical devices include V100, T4, A10, A100, GTX 20 series and 30 series graphics cards, etc. For more information about CUDA Compute Capability and precision support, please refer to NVIDIA documentation: [GPU Hardware and Supported Precision Comparison Table](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix). +* `use_fast`: Use the high-performance word segmentation operator FastTokenizer implemented in C++ to accelerate text preprocessing. The FastTokenizer library needs to be installed through `pip install fast-tokenizer-python` before it can be used. Defaults to `False`. For more usage instructions, please refer to [FastTokenizer Documentation](../../fast_tokenizer). diff --git a/applications/information_extraction/text/README.md b/applications/information_extraction/text/README.md new file mode 100644 index 0000000000000000000000000000000000000000..84cd77fd288cdc5f965289d33db3f93f083db72a --- /dev/null +++ b/applications/information_extraction/text/README.md @@ -0,0 +1,289 @@ +简体中文 | [English](README_en.md) + +# 文本信息抽取 + +**目录** +- [1. 文本信息抽取应用](#1) +- [2. 快速开始](#2) + - [2.1 代码结构](#代码结构) + - [2.2 数据标注](#数据标注) + - [2.3 模型微调](#模型微调) + - [2.4 模型评估](#模型评估) + - [2.5 定制模型一键预测](#定制模型一键预测) + - [2.6 实验指标](#实验指标) + - [2.7 封闭域蒸馏](#封闭域蒸馏) + + + +## 1. 文本信息抽取应用 + +本项目提供基于UIE微调的纯文本抽取端到端应用方案,打通**数据标注-模型训练-模型调优-预测部署全流程**,可快速实现文档信息抽取产品落地。 + +信息抽取通俗地说就是从给定的文本/图片等输入数据中抽取出结构化信息的过程。在信息抽取的落地过程中通常面临领域多变、任务多样、数据稀缺等许多挑战。针对信息抽取领域的难点和痛点,PaddleNLP信息抽取应用UIE统一建模的思想,提供了文档信息抽取产业级应用方案,支持**文档/图片/表格和纯文本场景下实体、关系、事件、观点等不同任务信息抽取**。该应用**不限定行业领域和抽取目标**,可实现从产品原型研发、业务POC阶段到业务落地、迭代阶段的无缝衔接,助力开发者实现特定领域抽取场景的快速适配与落地。 + +**文本信息抽取应用亮点:** + +- **覆盖场景全面🎓:** 覆盖文本信息抽取各类主流任务,支持多语言,满足开发者多样信息抽取落地需求。 +- **效果领先🏃:** 以在纯文本具有突出效果的UIE系列模型作为训练基座,提供多种尺寸的预训练模型满足不同需求,具有广泛成熟的实践应用性。 +- **简单易用⚡:** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用,一行命令即可开启信息抽取训练,轻松完成部署上线,降低信息抽取技术落地门槛。 +- **高效调优✊:** 开发者无需机器学习背景知识,即可轻松上手数据标注及模型训练流程。 + + + +## 2. 快速开始 + +对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本(zero-shot)抽取,对于细分场景我们推荐使用定制功能(标注少量数据进行模型微调)以进一步提升效果。 + + + +### 2.1 代码结构 + +```shell +. +├── utils.py # 数据处理工具 +├── finetune.py # 模型微调、压缩脚本 +├── evaluate.py # 模型评估脚本 +└── README.md +``` + + + +### 2.2 数据标注 + +我们推荐使用 [Label Studio](https://labelstud.io/) 进行文本信息抽取数据标注,本项目打通了从数据标注到训练的通道,也即Label Studio导出数据可以通过 [label_studio.py](../label_studio.py) 脚本轻松将数据转换为输入模型时需要的形式,实现无缝衔接。标注方法的详细介绍请参考 [Label Studio数据标注指南](../label_studio_text.md)。 + +这里我们提供预先标注好的`军事关系抽取数据集`的文件,可以运行下面的命令行下载数据集,我们将展示如何使用数据转化脚本生成训练/验证/测试集文件,并使用UIE模型进行微调。 + +下载军事关系抽取数据集: + +```shell +wget https://bj.bcebos.com/paddlenlp/datasets/military.tar.gz +tar -xvf military.tar.gz +mv military data +rm military.tar.gz +``` + +生成训练/验证集文件: +```shell +python ../label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --save_dir ./data \ + --splits 0.76 0.24 0 \ + --negative_ratio 3 \ + --task_type ext +``` + +更多不同类型任务(含实体抽取、关系抽取、文档分类等)的标注规则及参数说明,请参考[Label Studio数据标注指南](../label_studio_text.md)。 + + + + +### 2.3 模型微调 + +推荐使用 [Trainer API ](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md) 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务,可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能,Trainer API 还针对训练过程的通用训练配置做了封装,比如:优化器、学习率调度等。 + +使用下面的命令,使用 `uie-base` 作为预训练模型进行模型微调,将微调后的模型保存至`$finetuned_model`: + +单卡启动: + +```shell +python finetune.py \ + --device gpu \ + --logging_steps 10 \ + --save_steps 100 \ + --eval_steps 100 \ + --seed 1000 \ + --model_name_or_path uie-base \ + --output_dir ./checkpoint/model_best \ + --train_path data/train.txt \ + --dev_path data/dev.txt \ + --max_seq_len 512 \ + --per_device_train_batch_size 16 \ + --per_device_eval_batch_size 16 \ + --num_train_epochs 20 \ + --learning_rate 1e-5 \ + --do_train \ + --do_eval \ + --do_export \ + --export_model_dir ./checkpoint/model_best \ + --overwrite_output_dir \ + --disable_tqdm True \ + --metric_for_best_model eval_f1 \ + --load_best_model_at_end True \ + --save_total_limit 1 +``` + +如果在GPU环境中使用,可以指定gpus参数进行多卡训练: + +```shell +python -u -m paddle.distributed.launch --gpus "0,1" finetune.py \ + --device gpu \ + --logging_steps 10 \ + --save_steps 100 \ + --eval_steps 100 \ + --seed 1000 \ + --model_name_or_path uie-base \ + --output_dir ./checkpoint/model_best \ + --train_path data/train.txt \ + --dev_path data/dev.txt \ + --max_seq_len 512 \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 8 \ + --num_train_epochs 20 \ + --learning_rate 1e-5 \ + --do_train \ + --do_eval \ + --do_export \ + --export_model_dir ./checkpoint/model_best \ + --overwrite_output_dir \ + --disable_tqdm True \ + --metric_for_best_model eval_f1 \ + --load_best_model_at_end True \ + --save_total_limit 1 +``` + +该示例代码中由于设置了参数 `--do_eval`,因此在训练完会自动进行评估。 + +可配置参数说明: +* `device`: 训练设备,可选择 'cpu'、'gpu'、'npu' 其中的一种;默认为 GPU 训练。 +* `logging_steps`: 训练过程中日志打印的间隔 steps 数,默认10。 +* `save_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。 +* `eval_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。 +* `seed`:全局随机种子,默认为 42。 +* `model_name_or_path`:进行 few shot 训练使用的预训练模型。默认为 "uie-x-base"。 +* `output_dir`:必须,模型训练或压缩后保存的模型目录;默认为 `None` 。 +* `train_path`:训练集路径;默认为 `None` 。 +* `dev_path`:开发集路径;默认为 `None` 。 +* `max_seq_len`:文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。 +* `per_device_train_batch_size`:用于训练的每个 GPU 核心/CPU 的batch大小,默认为8。 +* `per_device_eval_batch_size`:用于评估的每个 GPU 核心/CPU 的batch大小,默认为8。 +* `num_train_epochs`: 训练轮次,使用早停法时可以选择 100;默认为10。 +* `learning_rate`:训练最大学习率,UIE-X 推荐设置为 1e-5;默认值为3e-5。 +* `label_names`:训练数据标签label的名称,UIE-X 设置为'start_positions' 'end_positions';默认值为None。 +* `do_train`:是否进行微调训练,设置该参数表示进行微调训练,默认不设置。 +* `do_eval`:是否进行评估,设置该参数表示进行评估,默认不设置。 +* `do_export`:是否进行导出,设置该参数表示进行静态图导出,默认不设置。 +* `export_model_dir`:静态图导出地址,默认为None。 +* `overwrite_output_dir`: 如果 `True`,覆盖输出目录的内容。如果 `output_dir` 指向检查点目录,则使用它继续训练。 +* `disable_tqdm`: 是否使用tqdm进度条。 +* `metric_for_best_model`:最优模型指标,UIE-X 推荐设置为 `eval_f1`,默认为None。 +* `load_best_model_at_end`:训练结束后是否加载最优模型,通常与`metric_for_best_model`配合使用,默认为False。 +* `save_total_limit`:如果设置次参数,将限制checkpoint的总数。删除旧的checkpoints `输出目录`,默认为None。 + + + +### 2.4 模型评估 + +通过运行以下命令进行模型评估: + +```shell +python evaluate.py \ + --model_path ./checkpoint/model_best \ + --test_path ./data/dev.txt \ + --device gpu \ + --batch_size 16 \ + --max_seq_len 512 +``` + +通过运行以下命令对 UIE-M 进行模型评估: + +``` +python evaluate.py \ + --model_path ./checkpoint/model_best \ + --test_path ./data/dev.txt \ + --batch_size 16 \ + --device gpu \ + --max_seq_len 512 \ + --multilingual +``` + +评估方式说明:采用单阶段评价的方式,即关系抽取、事件抽取等需要分阶段预测的任务对每一阶段的预测结果进行分别评价。验证/测试集默认会利用同一层级的所有标签来构造出全部负例。 + +可开启`debug`模式对每个正例类别分别进行评估,该模式仅用于模型调试: + +```shell +python evaluate.py \ + --model_path ./checkpoint/model_best \ + --test_path ./data/dev.txt \ + --debug +``` + +输出打印示例: + +```text +[2022-11-21 12:48:41,794] [ INFO] - ----------------------------- +[2022-11-21 12:48:41,795] [ INFO] - Class Name: 武器名称 +[2022-11-21 12:48:41,795] [ INFO] - Evaluation Precision: 0.96667 | Recall: 0.96667 | F1: 0.96667 +[2022-11-21 12:48:44,093] [ INFO] - ----------------------------- +[2022-11-21 12:48:44,094] [ INFO] - Class Name: X的产国 +[2022-11-21 12:48:44,094] [ INFO] - Evaluation Precision: 1.00000 | Recall: 0.99275 | F1: 0.99636 +[2022-11-21 12:48:46,474] [ INFO] - ----------------------------- +[2022-11-21 12:48:46,475] [ INFO] - Class Name: X的研发单位 +[2022-11-21 12:48:46,475] [ INFO] - Evaluation Precision: 0.77519 | Recall: 0.64935 | F1: 0.70671 +[2022-11-21 12:48:48,800] [ INFO] - ----------------------------- +[2022-11-21 12:48:48,801] [ INFO] - Class Name: X的类型 +[2022-11-21 12:48:48,801] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000 +``` + +可配置参数说明: + +- `device`: 评估设备,可选择 'cpu'、'gpu'、'npu' 其中的一种;默认为 GPU 评估。 +- `model_path`: 进行评估的模型文件夹路径,路径下需包含模型权重文件`model_state.pdparams`及配置文件`model_config.json`。 +- `test_path`: 进行评估的测试集文件。 +- `batch_size`: 批处理大小,请结合机器情况进行调整,默认为16。 +- `max_seq_len`: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。 +- `debug`: 是否开启debug模式对每个正例类别分别进行评估,该模式仅用于模型调试,默认关闭。 +- `multilingual`: 是否是跨语言模型,默认关闭。 +- `schema_lang`: 选择schema的语言,可选有`ch`和`en`。默认为`ch`,英文数据集请选择`en`。 + + + +### 2.5 定制模型一键预测 + +`paddlenlp.Taskflow`装载定制模型,通过`task_path`指定模型权重文件的路径,路径下需要包含训练好的模型权重文件`model_state.pdparams`。 + +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow + +>>> schema = {"武器名称": ["产国", "类型", "研发单位"]} +# 设定抽取目标和定制化模型权重路径 +>>> my_ie = Taskflow("information_extraction", schema=schema, task_path='./checkpoint/model_best') +>>> pprint(my_ie("威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。")) +[{'武器名称': [{'end': 14, + 'probability': 0.9998632702221926, + 'relations': {'产国': [{'end': 18, + 'probability': 0.9998815094394331, + 'start': 16, + 'text': '瑞典'}], + '研发单位': [{'end': 25, + 'probability': 0.9995875123178521, + 'start': 18, + 'text': 'FFV军械公司'}], + '类型': [{'end': 14, + 'probability': 0.999877336059086, + 'start': 12, + 'text': '炸弹'}]}, + 'start': 0, + 'text': '威尔哥(Virgo)减速炸弹'}]}] +``` + + + +### 2.6 实验指标 + +军事关系抽取数据集实验指标: + + | | Precision | Recall | F1 Score | + | :---: | :--------: | :--------: | :--------: | + | 0-shot | 0.64634| 0.53535 | 0.58564 | + | 5-shot | 0.89474 | 0.85000 | 0.87179 | + | 10-shot | 0.92793 | 0.85833 | 0.89177 | + | full-set | 0.93103 | 0.90000 | 0.91525 | + + + + +### 2.7 封闭域蒸馏 + +在一些工业应用场景中对性能的要求较高,模型若不能有效压缩则无法实际应用。因此,我们基于数据蒸馏技术构建了UIE Slim数据蒸馏系统。其原理是通过数据作为桥梁,将UIE模型的知识迁移到封闭域信息抽取小模型,以达到精度损失较小的情况下却能达到大幅度预测速度提升的效果。详细介绍请参考[UIE Slim 数据蒸馏](./data_distill/README.md) diff --git a/applications/information_extraction/text/README_en.md b/applications/information_extraction/text/README_en.md new file mode 100644 index 0000000000000000000000000000000000000000..4cc36bbcee15fa47b49ef511b97878474a16e41e --- /dev/null +++ b/applications/information_extraction/text/README_en.md @@ -0,0 +1,278 @@ +# Text information extraction + +**Table of contents** +- [1. Text Information Extraction Application](#1) +- [2. Quick Start](#2) + - [2.1 Code Structure](#21) + - [2.2 Data Annotation](#22) + - [2.3 Finetuning](#23) + - [2.4 Evaluation](#24) + - [2.5 Inference](#25) + - [2.6 Experiments](#26) + - [2.7 Closed Domain Distillation](#27) + + + +## 1. Text Information Extraction Application + +This project provides an end-to-end application solution for plain text extraction based on UIE fine-tuning and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Information Extraction techniques in your own products or models.a + +Information Extraction (IE) is the process of extracting structured information from given input data such as text, pictures or scanned document. While IE brings immense value, applying IE techniques is never easy with challenges such as domain adaptation, heterogeneous structures, lack of labeled data, etc. This PaddleNLP Information Extraction Guide builds on the foundation of our work in [Universal Information Extraction](https://arxiv.org/abs/2203.12277) and provides an industrial-level solution that not only supports **extracting entities, relations, events and opinions from plain text**, but also supports **cross-modal extraction out of documents, tables and pictures.** Our method features a flexible prompt, which allows you to specify extraction targets with simple natural language. We also provide a few different domain-adapated models specialized for different industry sectors. + +**Highlights:** +- **Comprehensive Coverage🎓:** Covers various mainstream tasks of information extraction for plain text and document scenarios, supports multiple languages +- **State-of-the-Art Performance🏃:** Strong performance from the UIE model series models in plain text and multimodal datasets. We also provide pretrained models of various sizes to meet different needs +- **Easy to use⚡:** three lines of code to use our `Taskflow` for out-of-box Information Extraction capabilities. One line of command to model training and model deployment +- **Efficient Tuning✊:** Developers can easily get started with the data labeling and model training process without a background in Machine Learning. + + + +## 2. Quick start + +For quick start, you can directly use ```paddlenlp.Taskflow``` out-of-the-box, leveraging the zero-shot performance. For production use cases, we recommend labeling a small amount of data for model fine-tuning to further improve the performance. + + + +### 2.1 Code structure + +```shell +. +├── utils.py # data processing tools +├── finetune.py # model fine-tuning, compression script +├── evaluate.py # model evaluation script +└── README.md +``` + + + +### 2.2 Data labeling + +We recommend using [Label Studio](https://labelstud.io/) for data labeling. We provide an end-to-end pipeline for the labeling -> training process. You can export the labeled data in Label Studio through [label_studio.py](../label_studio.py) script to export and convert the data into the required input form for the model. For a detailed introduction to labeling methods, please refer to [Label Studio Data Labeling Guide](../label_studio_text_en.md). + +Here we provide a pre-labeled example dataset `Military Relationship Extraction Dataset`, which you can download with the following command. We will show how to use the data conversion script to generate training/validation/test set files for fine-tuning . + +Download the military relationship extraction dataset: + +```shell +wget https://bj.bcebos.com/paddlenlp/datasets/military.tar.gz +tar -xvf military.tar.gz +mv military data +rm military.tar.gz +``` + +Generate training/validation set files: +```shell +python ../label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --save_dir ./data \ + --splits 0.76 0.24 0 \ + --negative_ratio 3 \ + --task_type ext +``` + +For more labeling rules and parameter descriptions for different types of tasks (including entity extraction, relationship extraction, document classification, etc.), please refer to [Label Studio Data Labeling Guide](../label_studio_text_en.md). + + + + +### 2.3 Finetuning + +Use the following command to fine-tune the model using `uie-base` as the pre-trained model, and save the fine-tuned model to `$finetuned_model`: + +Single GPU: + +```shell +python finetune.py \ + --device gpu \ + --logging_steps 10 \ + --save_steps 100 \ + --eval_steps 100 \ + --seed 1000 \ + --model_name_or_path uie-base \ + --output_dir ./checkpoint/model_best \ + --train_path data/train.txt \ + --dev_path data/dev.txt \ + --max_seq_len 512 \ + --per_device_train_batch_size 16 \ + --per_device_eval_batch_size 16 \ + --num_train_epochs 20 \ + --learning_rate 1e-5 \ + --do_train \ + --do_eval \ + --do_export \ + --export_model_dir ./checkpoint/model_best \ + --overwrite_output_dir \ + --disable_tqdm True \ + --metric_for_best_model eval_f1 \ + --load_best_model_at_end True \ + --save_total_limit 1 +``` + +Multiple GPUs: + +```shell +python -u -m paddle.distributed.launch --gpus "0,1" finetune.py \ + --device gpu \ + --logging_steps 10 \ + --save_steps 100 \ + --eval_steps 100 \ + --seed 1000 \ + --model_name_or_path uie-base \ + --output_dir ./checkpoint/model_best \ + --train_path data/train.txt \ + --dev_path data/dev.txt \ + --max_seq_len 512 \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 8 \ + --num_train_epochs 20 \ + --learning_rate 1e-5 \ + --do_train \ + --do_eval \ + --do_export \ + --export_model_dir ./checkpoint/model_best \ + --overwrite_output_dir \ + --disable_tqdm True \ + --metric_for_best_model eval_f1 \ + --load_best_model_at_end True \ + --save_total_limit 1 +``` + +Parameters: + +* `device`: Training device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU training. +* `logging_steps`: The interval steps of log printing during training, the default is 10. +* `save_steps`: The number of interval steps to save the model checkpoint during training, the default is 100. +* `eval_steps`: The number of interval steps to save the model checkpoint during training, the default is 100. +* `seed`: global random seed, default is 42. +* `model_name_or_path`: The pre-trained model used for few shot training. Defaults to "uie-x-base". +* `output_dir`: required, the model directory saved after model training or compression; the default is `None`. +* `train_path`: training set path; defaults to `None`. +* `dev_path`: Development set path; defaults to `None`. +* `max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512. +* `per_device_train_batch_size`: The batch size of each GPU core//NPU core/CPU used for training, the default is 8. +* `per_device_eval_batch_size`: Batch size per GPU core/NPU core/CPU for evaluation, default is 8. +* `num_train_epochs`: Training rounds, 100 can be selected when using early stopping method; the default is 10. +* `learning_rate`: The maximum learning rate for training, UIE-X recommends setting it to 1e-5; the default value is 3e-5. +* `label_names`: the name of the training data label, UIE-X is set to 'start_positions' 'end_positions'; the default value is None. +* `do_train`: Whether to perform fine-tuning training, setting this parameter means to perform fine-tuning training, and it is not set by default. +* `do_eval`: Whether to evaluate, setting this parameter means to evaluate, the default is not set. +* `do_export`: Whether to export, setting this parameter means to export static images, and it is not set by default. +* `export_model_dir`: Static map export address, the default is None. +* `overwrite_output_dir`: If `True`, overwrite the contents of the output directory. If `output_dir` points to a checkpoint directory, use it to continue training. +* `disable_tqdm`: Whether to use tqdm progress bar. +* `metric_for_best_model`: Optimal model metric, UIE-X recommends setting it to `eval_f1`, the default is None. +* `load_best_model_at_end`: Whether to load the best model after training, usually used in conjunction with `metric_for_best_model`, the default is False. +* `save_total_limit`: If this parameter is set, the total number of checkpoints will be limited. Remove old checkpoints `output directory`, defaults to None. + + + +### 2.4 Evaluation + +Model evaluation: + +```shell +python evaluate.py \ + --model_path ./checkpoint/model_best \ + --test_path ./data/dev.txt \ + --batch_size 16 \ + --max_seq_len 512 +``` + +Model evaluation for UIE-M: + +``` +python evaluate.py \ + --model_path ./checkpoint/model_best \ + --test_path ./data/dev.txt \ + --batch_size 16 \ + --max_seq_len 512 \ + --multilingual +``` + +We adopt the single-stage method for evaluation, which means tasks that require multiple stages (e.g. relation extraction, event extraction) are evaluated separately for each stage. By default, the validation/test set uses all labels at the same level to construct the negative examples. + +The `debug` mode can be turned on to evaluate each positive category separately. This mode is only used for model debugging: + +```shell +python evaluate.py \ + --model_path ./checkpoint/model_best \ + --test_path ./data/dev.txt \ + --debug +``` + +Output print example: + +```text +[2022-11-21 12:48:41,794] [ INFO] - ----------------------------- +[2022-11-21 12:48:41,795] [ INFO] - Class Name: 武器名称 +[2022-11-21 12:48:41,795] [ INFO] - Evaluation Precision: 0.96667 | Recall: 0.96667 | F1: 0.96667 +[2022-11-21 12:48:44,093] [ INFO] - ----------------------------- +[2022-11-21 12:48:44,094] [ INFO] - Class Name: X的产国 +[2022-11-21 12:48:44,094] [ INFO] - Evaluation Precision: 1.00000 | Recall: 0.99275 | F1: 0.99636 +[2022-11-21 12:48:46,474] [ INFO] - ----------------------------- +[2022-11-21 12:48:46,475] [ INFO] - Class Name: X的研发单位 +[2022-11-21 12:48:46,475] [ INFO] - Evaluation Precision: 0.77519 | Recall: 0.64935 | F1: 0.70671 +[2022-11-21 12:48:48,800] [ INFO] - ----------------------------- +[2022-11-21 12:48:48,801] [ INFO] - Class Name: X的类型 +[2022-11-21 12:48:48,801] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000 +``` + +Parameters: + +- `device`: Evaluation device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU evaluation. +- `model_path`: The path of the model folder for evaluation, which must contain the model weight file `model_state.pdparams` and the configuration file `model_config.json`. +- `test_path`: The test set file for evaluation. +- `batch_size`: batch size, please adjust according to the machine situation, the default is 16. +- `max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512. +- `debug`: Whether to enable the debug mode to evaluate each positive category separately. This mode is only used for model debugging and is disabled by default. +- `multilingual`: Whether it is a multilingual model, it is turned off by default. +- `schema_lang`: select the language of the schema, optional `ch` and `en`. The default is `ch`, please select `en` for the English dataset. + + + +### 2.5 Inference +Same with the pretrained models, you can use `paddlenlp.Taskflow` to load your custom model by specifying the path of the model weight file through `task_path` + +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow + +>>> schema = {"武器名称": ["产国", "类型", "研发单位"]} +# Set the extraction target and the fine-tuned model path +>>> my_ie = Taskflow("information_extraction", schema=schema, task_path='./checkpoint/model_best') +>>> pprint(my_ie("威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。")) +[{'武器名称': [{'end': 14, + 'probability': 0.9998632702221926, + 'relations': {'产国': [{'end': 18, + 'probability': 0.9998815094394331, + 'start': 16, + 'text': '瑞典'}], + '研发单位': [{'end': 25, + 'probability': 0.9995875123178521, + 'start': 18, + 'text': 'FFV军械公司'}], + '类型': [{'end': 14, + 'probability': 0.999877336059086, + 'start': 12, + 'text': '炸弹'}]}, + 'start': 0, + 'text': '威尔哥(Virgo)减速炸弹'}]}] +``` + + + +### 2.6 Experiments + + | | Precision | Recall | F1 Score | + | :---: | :--------: | :--------: | :--------: | + | 0-shot | 0.64634| 0.53535 | 0.58564 | + | 5-shot | 0.89474 | 0.85000 | 0.87179 | + | 10-shot | 0.92793 | 0.85833 | 0.89177 | + | full-set | 0.93103 | 0.90000 | 0.91525 | + + + + +### 2.7 Closed Domain Distillation + +Some industrial application scenarios have high inference performance requirements and the model cannot go into production without being effectively compressed. We built the [UIE Slim Data Distillation](./data_distill/README_en.md) with knowledge distillation techniques. The principle is to use the data as a bridge to transfer the knowledge of the UIE model to the smaller closed-domain information extraction model in order to achieve speedup inference significantly with minimal loss to accuracy. diff --git a/applications/information_extraction/text/data_distill/README.md b/applications/information_extraction/text/data_distill/README.md new file mode 100644 index 0000000000000000000000000000000000000000..6886f591e10d398f511f5acc321ed4d94929c109 --- /dev/null +++ b/applications/information_extraction/text/data_distill/README.md @@ -0,0 +1,156 @@ +# UIE Slim 数据蒸馏 + +在UIE强大的抽取能力背后,同样需要较大的算力支持计算。在一些工业应用场景中对性能的要求较高,若不能有效压缩则无法实际应用。因此,我们基于数据蒸馏技术构建了UIE Slim数据蒸馏系统。其原理是通过数据作为桥梁,将UIE模型的知识迁移到封闭域信息抽取小模型,以达到精度损失较小的情况下却能达到大幅度预测速度提升的效果。 + +#### UIE数据蒸馏三步 + +- **Step 1**: 使用UIE模型对标注数据进行finetune,得到Teacher Model。 + +- **Step 2**: 用户提供大规模无标注数据,需与标注数据同源。使用Taskflow UIE对无监督数据进行预测。 + +- **Step 3**: 使用标注数据以及步骤2得到的合成数据训练出封闭域Student Model。 + +## UIE Finetune + +参考[UIE关系抽取微调](../README.md)完成模型微调,得到``../checkpoint/model_best``。 + +## 离线蒸馏 + +#### 通过训练好的UIE定制模型预测无监督数据的标签 + +```shell +python data_distill.py \ + --data_path ../data \ + --save_dir student_data \ + --task_type relation_extraction \ + --synthetic_ratio 10 \ + --model_path ../checkpoint/model_best +``` + +**NOTE**:schema需要根据标注数据在`data_distill.py`中进行配置,且schema需要包含标注数据中的所有标签类型。 + +可配置参数说明: + +- `data_path`: 标注数据(`doccano_ext.json`)及无监督文本(`unlabeled_data.txt`)路径。 +- `model_path`: 训练好的UIE定制模型路径。 +- `save_dir`: 学生模型训练数据保存路径。 +- `synthetic_ratio`: 控制合成数据的比例。最大合成数据数量=synthetic_ratio*标注数据数量。 +- `platform`: 标注数据的所使用的标注平台,可选有`doccano`,`label_studio`,默认为`label_studio`。 +- `task_type`: 选择任务类型,可选有`entity_extraction`,`relation_extraction`,`event_extraction`和`opinion_extraction`。因为是封闭域抽取,不同任务的后处理逻辑不同,因此需指定任务类型。 +- `seed`: 随机种子,默认为1000。 + +#### 老师模型评估 + +UIE微调阶段针对UIE训练格式数据评估模型效果(该评估方式非端到端评估,非关系抽取或事件抽取的标准评估方式),可通过以下评估脚本进行端到端评估。 + +```shell +python evaluate_teacher.py \ + --task_type relation_extraction \ + --test_path ./student_data/dev_data.json \ + --label_maps_path ./student_data/label_maps.json \ + --model_path ../checkpoint/model_best +``` + +可配置参数说明: + +- `model_path`: 训练好的UIE定制模型路径。 +- `test_path`: 测试数据集路径。 +- `label_maps_path`: 学生模型标签字典。 +- `batch_size`: 批处理大小,默认为8。 +- `max_seq_len`: 最大文本长度,默认为256。 +- `task_type`: 选择任务类型,可选有`entity_extraction`,`relation_extraction`,`event_extraction`和`opinion_extraction`。因为是封闭域信息抽取的评估,需指定任务类型。 + + +#### 学生模型训练 + +```shell +python train.py \ + --task_type relation_extraction \ + --train_path student_data/train_data.json \ + --dev_path student_data/dev_data.json \ + --label_maps_path student_data/label_maps.json \ + --num_epochs 50 \ + --encoder ernie-3.0-mini-zh +``` + +可配置参数说明: + +- `train_path`: 训练集文件路径。 +- `dev_path`: 验证集文件路径。 +- `batch_size`: 批处理大小,默认为16。 +- `learning_rate`: 学习率,默认为3e-5。 +- `save_dir`: 模型存储路径,默认为`./checkpoint`。 +- `max_seq_len`: 最大文本长度,默认为256。 +- `weight_decay`: 表示AdamW优化器中使用的 weight_decay 的系数。 +- `warmup_proportion`: 学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.0。 +- `num_epochs`: 训练轮数,默认为100。 +- `seed`: 随机种子,默认为1000。 +- `encoder`: 选择学生模型的模型底座,默认为`ernie-3.0-mini-zh`。 +- `task_type`: 选择任务类型,可选有`entity_extraction`,`relation_extraction`,`event_extraction`和`opinion_extraction`。因为是封闭域信息抽取,需指定任务类型。 +- `logging_steps`: 日志打印的间隔steps数,默认10。 +- `eval_steps`: evaluate的间隔steps数,默认200。 +- `device`: 选用什么设备进行训练,可选cpu或gpu。 +- `init_from_ckpt`: 可选,模型参数路径,热启动模型训练;默认为None。 + +#### 学生模型评估 + +```shell +python evaluate.py \ + --model_path ./checkpoint/model_best \ + --test_path student_data/dev_data.json \ + --task_type relation_extraction \ + --label_maps_path student_data/label_maps.json \ + --encoder ernie-3.0-mini-zh +``` + +可配置参数说明: + +- `model_path`: 训练好的UIE定制模型路径。 +- `test_path`: 测试数据集路径。 +- `label_maps_path`: 学生模型标签字典。 +- `batch_size`: 批处理大小,默认为8。 +- `max_seq_len`: 最大文本长度,默认为256。 +- `encoder`: 选择学生模型的模型底座,默认为`ernie-3.0-mini-zh`。 +- `task_type`: 选择任务类型,可选有`entity_extraction`,`relation_extraction`,`event_extraction`和`opinion_extraction`。因为是封闭域信息抽取的评估,需指定任务类型。 + +## Taskflow部署学生模型 + +- 通过Taskflow一键部署封闭域信息抽取模型,`task_path`为学生模型路径。 + +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow + +>>> my_ie = Taskflow("information_extraction", model="uie-data-distill-gp", task_path="checkpoint/model_best/") # Schema is fixed in closed-domain information extraction +>>> pprint(my_ie("威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。")) +[{'武器名称': [{'end': 14, + 'probability': 0.9976037, + 'relations': {'产国': [{'end': 18, + 'probability': 0.9988706, + 'relations': {}, + 'start': 16, + 'text': '瑞典'}], + '研发单位': [{'end': 25, + 'probability': 0.9978277, + 'relations': {}, + 'start': 18, + 'text': 'FFV军械公司'}], + '类型': [{'end': 14, + 'probability': 0.99837446, + 'relations': {}, + 'start': 12, + 'text': '炸弹'}]}, + 'start': 0, + 'text': '威尔哥(Virgo)减速炸弹'}]}] +``` + + +# References + +- **[GlobalPointer](https://kexue.fm/search/globalpointer/)** + +- **[GPLinker](https://kexue.fm/archives/8888)** + +- **[JunnYu/GPLinker_pytorch](https://github.com/JunnYu/GPLinker_pytorch)** + +- **[CBLUE](https://github.com/CBLUEbenchmark/CBLUE)** diff --git a/applications/information_extraction/text/data_distill/README_en.md b/applications/information_extraction/text/data_distill/README_en.md new file mode 100644 index 0000000000000000000000000000000000000000..7b1ca61122a86ba7862c3a67471f4d4c95088234 --- /dev/null +++ b/applications/information_extraction/text/data_distill/README_en.md @@ -0,0 +1,154 @@ +# UIE Slim data distillation + +While UIE has powerful zero-shot extraction capabilities, its prompting structure requires significant compute to serve in real time. Some industrial application scenarios have high inference performance requirements and the model cannot go into production without being effectively compressed. We built the UIE Slim Data Distillation with knowledge distillation techniques. The principle is to use the data as a bridge to transfer the knowledge of the UIE model to the smaller closed-domain information extraction model in order to achieve speedup inference significantly with minimal loss to accuracy. + +#### Three steps of UIE data distillation + +- **Step 1**: Finetune the UIE model on the labeled data to get the Teacher Model. + +- **Step 2**: Process the user-provided unlabeled data and run inference with Taskflow UIE. + +- **Step 3**: Use the labeled data and the inference results obtained in step 2 to train a closed-domain Student Model. + +## UIE Finetune + +Refer to [UIE relationship extraction fine-tuning](../README.md) to complete the model fine-tuning and get ``../checkpoint/model_best``. + +## Offline Distillation + +#### Predict the label of unsupervised data through the trained UIE custom model + +```shell +python data_distill.py \ + --data_path ../data \ + --save_dir student_data \ + --task_type relation_extraction \ + --synthetic_ratio 10 \ + --model_path ../checkpoint/model_best +``` + +**NOTE**: The schema needs to be configured in `data_distill.py` according to the label data, and the schema needs to contain all label types in the label data. + +Description of configurable parameters: + +- `data_path`: Path to labeled data (`doccano_ext.json`) and unsupervised text (`unlabeled_data.txt`). +- `model_path`: The path of the trained UIE custom model. +- `save_dir`: The path to save the training data of the student model. +- `synthetic_ratio`: Controls the ratio of synthetic data. The maximum number of synthetic data=synthetic_ratio*number of labeled data. +- `platform`: The labeling platform used to label data, optional are `doccano`, `label_studio`, the default is `label_studio`. +- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is a closed-domain extraction, the post-processing logic of different tasks is different, so the task type needs to be specified. +- `seed`: random seed, default is 1000. + +#### Teacher model evaluation + +In the UIE fine-tuning stage, the model performance is evaluated on UIE training format data, which is not a standard end-to-end evaluation method for relation extraction or event extraction. The end-to-end evaluation can be performed through the following evaluation script. + +```shell +python evaluate_teacher.py \ + --task_type relation_extraction \ + --test_path ./student_data/dev_data.json\ + --label_maps_path ./student_data/label_maps.json \ + --model_path ../checkpoint/model_best +``` + +Description of configurable parameters: + +- `model_path`: The path of the trained UIE custom model. +- `test_path`: test dataset path. +- `label_maps_path`: dictionary of student model labels. +- `batch_size`: batch size, default is 8. +- `max_seq_len`: Maximum text length, default is 256. +- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is an evaluation of closed-domain information extraction, the task type needs to be specified. + + +#### Student model training + +```shell +python train.py\ + --task_type relation_extraction \ + --train_path student_data/train_data.json \ + --dev_path student_data/dev_data.json \ + --label_maps_path student_data/label_maps.json \ + --num_epochs 50 \ + --encoder ernie-3.0-mini-zh +``` + +Description of configurable parameters: + +- `train_path`: training set file path. +- `dev_path`: Validation set file path. +- `batch_size`: batch size, default is 16. +- `learning_rate`: Learning rate, default is 3e-5. +- `save_dir`: model storage path, the default is `./checkpoint`. +- `max_seq_len`: Maximum text length, default is 256. +- `weight_decay`: Indicates the coefficient of weight_decay used in the AdamW optimizer. +- `warmup_proportion`: The proportion of the learning rate warmup strategy. If it is 0.1, the learning rate will slowly increase from 0 to learning_rate during the first 10% training step, and then slowly decay. The default is 0.0. +- `num_epochs`: The number of training epochs, the default is 100. +- `seed`: random seed, default is 1000. +- `encoder`: select the model base of the student model, the default is `ernie-3.0-mini-zh`. +- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is closed-domain information extraction, the task type needs to be specified. +- `logging_steps`: The interval steps of log printing, the default is 10. +- `eval_steps`: The interval steps of evaluate, the default is 200. +- `device`: What device to choose for training, optional cpu or gpu. +- `init_from_ckpt`: optional, model parameter path, hot start model training; default is None. + +#### Student model evaluation + +```shell +python evaluate.py \ + --model_path ./checkpoint/model_best \ + --test_path student_data/dev_data.json \ + --task_type relation_extraction \ + --label_maps_path student_data/label_maps.json \ + --encoder ernie-3.0-mini-zh +``` + +Description of configurable parameters: + +- `model_path`: The path of the trained UIE custom model. +- `test_path`: test dataset path. +- `label_maps_path`: dictionary of student model labels. +- `batch_size`: batch size, default is 8. +- `max_seq_len`: Maximum text length, default is 256. +- `encoder`: select the model base of the student model, the default is `ernie-3.0-mini-zh`. +- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is an evaluation of closed-domain information extraction, the task type needs to be specified. + +## Student model deployment + +- Fast deployment of the closed-domain information extraction model through Taskflow, `task_path` is the path of the student model. + +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow + +>>> my_ie = Taskflow("information_extraction", model="uie-data-distill-gp", task_path="checkpoint/model_best/") # Schema is fixed in closed-domain information extraction +>>> pprint(my_ie("Virgo deceleration bomb was developed by the Swedish FFV Ordnance Company specially for the attack aircraft of the Swedish Royal Air Force to carry out low-altitude and high-speed bombing. It was developed in 1956 and entered service in 1963. It is equipped on the A32 "Contradiction", A35 "Dragon", and AJ134 "Thunder" attack aircraft are mainly used to attack landing craft, parked aircraft, anti-aircraft artillery, field artillery, light armored vehicles and active forces.")) +[{'weapon name': [{'end': 14, + 'probability': 0.9976037, + 'relations': {'country of origin': [{'end': 18, + 'probability': 0.9988706, + 'relations': {}, + 'start': 16, + 'text': 'Sweden'}], + 'R&D unit': [{'end': 25, + 'probability': 0.9978277, + 'relations': {}, + 'start': 18, + 'text': 'FFV Ordnance Company'}], + 'type': [{'end': 14, + 'probability': 0.99837446, + 'relations': {}, + 'start': 12, + 'text': 'bomb'}]}, + 'start': 0, + 'text': 'Virgo slowing bomb'}]}] +``` + + +# References + +- **[GlobalPointer](https://kexue.fm/search/globalpointer/)** + +- **[GPLinker](https://kexue.fm/archives/8888)** + +- **[JunnYu/GPLinker_pytorch](https://github.com/JunnYu/GPLinker_pytorch** diff --git a/applications/information_extraction/text/data_distill/criterion.py b/applications/information_extraction/text/data_distill/criterion.py new file mode 100644 index 0000000000000000000000000000000000000000..e5e6c2f9c3c07a800605c57240098b5243ff8634 --- /dev/null +++ b/applications/information_extraction/text/data_distill/criterion.py @@ -0,0 +1,56 @@ +# coding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import paddle +import paddle.nn as nn + + +class Criterion(nn.Layer): + """Criterion for GPNet""" + + def __init__(self, mask_zero=True): + self.mask_zero = mask_zero + + def _sparse_multilabel_categorical_crossentropy(self, y_true, y_pred, mask_zero=False): + """Sparse multi-label categorical cross entropy + reference to "https://kexue.fm/archives/7359". + """ + zeros = paddle.zeros_like(y_pred[..., :1]) + y_pred = paddle.concat([y_pred, zeros], axis=-1) + if mask_zero: + infs = zeros + 1e12 + y_pred = paddle.concat([infs, y_pred[..., 1:]], axis=-1) + y_pos_2 = paddle.take_along_axis(y_pred, y_true, axis=-1) + y_pos_1 = paddle.concat([y_pos_2, zeros], axis=-1) + if mask_zero: + y_pred = paddle.concat([-infs, y_pred[..., 1:]], axis=-1) + y_pos_2 = paddle.take_along_axis(y_pred, y_true, axis=-1) + + pos_loss = (-y_pos_1).exp().sum(axis=-1).log() + all_loss = y_pred.exp().sum(axis=-1).log() + aux_loss = y_pos_2.exp().sum(axis=-1).log() - all_loss + aux_loss = paddle.clip(1 - paddle.exp(aux_loss), min=0.1, max=1) + neg_loss = all_loss + paddle.log(aux_loss) + return pos_loss + neg_loss + + def __call__(self, y_pred, y_true): + shape = y_pred.shape + y_true = y_true[..., 0] * shape[2] + y_true[..., 1] + # bs, nclass, seqlen * seqlen + y_pred = paddle.reshape(y_pred, shape=[shape[0], -1, np.prod(shape[2:])]) + + loss = self._sparse_multilabel_categorical_crossentropy(y_true, y_pred, self.mask_zero) + return loss.sum(axis=1).mean() diff --git a/applications/information_extraction/text/data_distill/data_collator.py b/applications/information_extraction/text/data_distill/data_collator.py new file mode 100644 index 0000000000000000000000000000000000000000..2c12b98186ab1516c11af8b0a03576625958ddd9 --- /dev/null +++ b/applications/information_extraction/text/data_distill/data_collator.py @@ -0,0 +1,86 @@ +# coding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from dataclasses import dataclass +from typing import Dict, List, Optional, Union + +import paddle + +from paddlenlp.transformers.tokenizer_utils_base import ( + PaddingStrategy, + PretrainedTokenizerBase, +) + +ignore_list = ["offset_mapping", "text"] + + +@dataclass +class DataCollator: + tokenizer: PretrainedTokenizerBase + padding: Union[bool, str, PaddingStrategy] = True + max_length: Optional[int] = None + label_maps: Optional[dict] = None + task_type: Optional[str] = None + + def __call__(self, features: List[Dict[str, Union[List[int], paddle.Tensor]]]) -> Dict[str, paddle.Tensor]: + labels = [feature["labels"] for feature in features] if "labels" in features[0].keys() else None + new_features = [{k: v for k, v in f.items() if k not in ["labels"] + ignore_list} for f in features] + + batch = self.tokenizer.pad( + new_features, + padding=self.padding, + ) + + batch = [paddle.to_tensor(batch[k]) for k in batch.keys()] + + if labels is None: # for test + if "offset_mapping" in features[0].keys(): + batch.append([feature["offset_mapping"] for feature in features]) + if "text" in features[0].keys(): + batch.append([feature["text"] for feature in features]) + return batch + + bs = batch[0].shape[0] + if self.task_type == "entity_extraction": + # Ensure the dimension is greater or equal to 1 + max_ent_num = max(max([len(lb["ent_labels"]) for lb in labels]), 1) + num_ents = len(self.label_maps["entity2id"]) + batch_entity_labels = paddle.zeros(shape=[bs, num_ents, max_ent_num, 2], dtype="int64") + for i, lb in enumerate(labels): + for eidx, (l, eh, et) in enumerate(lb["ent_labels"]): + batch_entity_labels[i, l, eidx, :] = paddle.to_tensor([eh, et]) + + batch.append([batch_entity_labels]) + else: + # Ensure the dimension is greater or equal to 1 + max_ent_num = max(max([len(lb["ent_labels"]) for lb in labels]), 1) + max_spo_num = max(max([len(lb["rel_labels"]) for lb in labels]), 1) + num_ents = len(self.label_maps["entity2id"]) + if "relation2id" in self.label_maps.keys(): + num_rels = len(self.label_maps["relation2id"]) + else: + num_rels = len(self.label_maps["sentiment2id"]) + batch_entity_labels = paddle.zeros(shape=[bs, num_ents, max_ent_num, 2], dtype="int64") + batch_head_labels = paddle.zeros(shape=[bs, num_rels, max_spo_num, 2], dtype="int64") + batch_tail_labels = paddle.zeros(shape=[bs, num_rels, max_spo_num, 2], dtype="int64") + + for i, lb in enumerate(labels): + for eidx, (l, eh, et) in enumerate(lb["ent_labels"]): + batch_entity_labels[i, l, eidx, :] = paddle.to_tensor([eh, et]) + for spidx, (sh, st, p, oh, ot) in enumerate(lb["rel_labels"]): + batch_head_labels[i, p, spidx, :] = paddle.to_tensor([sh, oh]) + batch_tail_labels[i, p, spidx, :] = paddle.to_tensor([st, ot]) + batch.append([batch_entity_labels, batch_head_labels, batch_tail_labels]) + return batch diff --git a/applications/information_extraction/text/data_distill/data_distill.py b/applications/information_extraction/text/data_distill/data_distill.py new file mode 100644 index 0000000000000000000000000000000000000000..b3564311cd97d30712cbce2a3f416504cbfc352e --- /dev/null +++ b/applications/information_extraction/text/data_distill/data_distill.py @@ -0,0 +1,128 @@ +# coding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import math +import os +import random + +from tqdm import tqdm +from utils import anno2distill, schema2label_maps, set_seed, synthetic2distill + +from paddlenlp import Taskflow +from paddlenlp.utils.log import logger + + +def do_data_distill(): + set_seed(args.seed) + + # Generate closed-domain label maps + if not os.path.exists(args.save_dir): + os.mkdir(args.save_dir) + label_maps = schema2label_maps(args.task_type, schema=args.schema) + label_maps_path = os.path.join(args.save_dir, "label_maps.json") + + # Save closed-domain label maps file + with open(label_maps_path, "w", encoding="utf-8") as fp: + fp.write(json.dumps(label_maps, ensure_ascii=False)) + + # Load doccano file and convert to distill format + sample_index = json.loads( + open(os.path.join(args.data_path, "sample_index.json"), "r", encoding="utf-8").readline() + ) + + train_ids = sample_index["train_ids"] + dev_ids = sample_index["dev_ids"] + test_ids = sample_index["test_ids"] + + if args.platform == "label_studio": + with open(os.path.join(args.data_path, "label_studio.json"), "r", encoding="utf-8") as fp: + json_lines = json.loads(fp.read()) + elif args.platform == "doccano": + json_lines = [] + with open(os.path.join(args.data_path, "doccano_ext.json"), "r", encoding="utf-8") as fp: + for line in fp: + json_lines.append(json.loads(line)) + else: + raise ValueError("Unsupported annotation platform!") + + train_lines = [json_lines[i] for i in train_ids] + train_lines = anno2distill(train_lines, args.task_type, label_maps, args.platform) + + dev_lines = [json_lines[i] for i in dev_ids] + dev_lines = anno2distill(dev_lines, args.task_type, label_maps, args.platform) + + test_lines = [json_lines[i] for i in test_ids] + test_lines = anno2distill(test_lines, args.task_type, label_maps, args.platform) + + # Load trained UIE model + uie = Taskflow("information_extraction", schema=args.schema, task_path=args.model_path) + + if args.synthetic_ratio > 0: + # Generate synthetic data + texts = open(os.path.join(args.data_path, "unlabeled_data.txt"), "r", encoding="utf-8").readlines() + + actual_ratio = math.ceil(len(texts) / len(train_lines)) + if actual_ratio <= args.synthetic_ratio or args.synthetic_ratio == -1: + infer_texts = texts + else: + idxs = random.sample(range(0, len(texts)), args.synthetic_ratio * len(train_lines)) + infer_texts = [texts[i] for i in idxs] + + infer_results = [] + for text in tqdm(infer_texts, desc="Predicting: ", leave=False): + infer_results.extend(uie(text)) + + train_synthetic_lines = synthetic2distill(infer_texts, infer_results, args.task_type) + + # Concat origin and synthetic data + train_lines.extend(train_synthetic_lines) + + def _save_examples(save_dir, file_name, examples): + count = 0 + save_path = os.path.join(save_dir, file_name) + with open(save_path, "w", encoding="utf-8") as f: + for example in examples: + f.write(json.dumps(example, ensure_ascii=False) + "\n") + count += 1 + logger.info("Save %d examples to %s." % (count, save_path)) + + _save_examples(args.save_dir, "train_data.json", train_lines) + _save_examples(args.save_dir, "dev_data.json", dev_lines) + _save_examples(args.save_dir, "test_data.json", test_lines) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--data_path", default="../data", type=str, help="The directory for labeled data with doccano format and the large scale unlabeled data.") + parser.add_argument("--model_path", type=str, default="../checkpoint/model_best", help="The path of saved model that you want to load.") + parser.add_argument("--save_dir", default="./distill_task", type=str, help="The path of data that you wanna save.") + parser.add_argument("--synthetic_ratio", default=10, type=int, help="The ratio of labeled and synthetic samples.") + parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.") + parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization") + parser.add_argument("--platform", choices=['doccano', 'label_studio'], type=str, default="label_studio", help="Select the annotation platform.") + + args = parser.parse_args() + # yapf: enable + + # Define your schema here + schema = {"武器名称": ["产国", "类型", "研发单位"]} + + args.schema = schema + + do_data_distill() diff --git a/applications/information_extraction/text/data_distill/deploy/simple_serving/README.md b/applications/information_extraction/text/data_distill/deploy/simple_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..22b4783242750951949a7abaf3072d42a20c74b9 --- /dev/null +++ b/applications/information_extraction/text/data_distill/deploy/simple_serving/README.md @@ -0,0 +1,58 @@ +# 基于PaddleNLP SimpleServing 的服务化部署 + +## 目录 +- [环境准备](#环境准备) +- [Server服务启动](#Server服务启动) +- [Client请求启动](#Client请求启动) +- [服务化自定义参数](#服务化自定义参数) + +## 环境准备 +使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本) + +```shell +pip install paddlenlp >= 2.4.4 +``` + + +## Server服务启动 + +```bash +paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189 +``` + +## Client请求启动 + +```bash +python client.py +``` + +## 服务化自定义参数 + +### Server 自定义参数 +#### schema替换 +```python +# Default schema +schema = {"武器名称": ["产国", "类型", "研发单位"]} +``` + +#### 设置模型路径 +``` +# Default task_path +uie = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema) +``` + +#### 多卡服务化预测 +PaddleNLP SimpleServing 支持多卡负载均衡预测,主要在服务化注册的时候,注册两个Taskflow的task即可,下面是示例代码 +``` +uie1 = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema, device_id=0) +uie2 = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema, device_id=1) +service.register_taskflow('uie', [uie1, uie2]) +``` + +### Client 自定义参数 + +```python +# Changed to input texts you wanted +texts = ['威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。'] + +``` diff --git a/applications/information_extraction/text/data_distill/deploy/simple_serving/README_en.md b/applications/information_extraction/text/data_distill/deploy/simple_serving/README_en.md new file mode 100644 index 0000000000000000000000000000000000000000..8337a2fbc18f8dfdcc10aa1be16f96950e162d7b --- /dev/null +++ b/applications/information_extraction/text/data_distill/deploy/simple_serving/README_en.md @@ -0,0 +1,64 @@ +# Service deployment based on PaddleNLP SimpleServing + +- [Environment Preparation](#1) +- [Server](#2) +- [Client](#3) +- [Service Custom Parameters](#4) + + + +## Environment Preparation +Use the PaddleNLP version with SimpleServing function (or the latest develop version) + +```shell +pip install paddlenlp >= 2.4.4 +``` + + + +## Server + +```bash +paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189 +``` + + + +## Client + +```bash +python client.py +``` + + + +## Service Custom Parameters + +### Server Custom Parameters + +#### schema replacement +```python +# Default schema +schema = {"Weapon Name": ["Country of Production", "Type", "R&D Unit"]} +``` + +#### Set model path +``` +# Default task_path +uie = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema) +``` + +#### Doka Service Prediction +PaddleNLP SimpleServing supports multi-card load balancing prediction, mainly during service registration, just register two Taskflow tasks, the following is the sample code +``` +uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0) +uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1) +service. register_taskflow('uie', [uie1, uie2]) +``` + +### Client Custom Parameters + +```python +# Changed to input texts you wanted +texts = ['威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。'] +``` diff --git a/applications/information_extraction/text/data_distill/deploy/simple_serving/client.py b/applications/information_extraction/text/data_distill/deploy/simple_serving/client.py new file mode 100644 index 0000000000000000000000000000000000000000..cd2914e22b2b8b1c46db97facc09bdc6b5ac3957 --- /dev/null +++ b/applications/information_extraction/text/data_distill/deploy/simple_serving/client.py @@ -0,0 +1,29 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import requests + +url = "http://0.0.0.0:8189/taskflow/uie" + +headers = {"Content-Type": "application/json"} +texts = [ + "威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆>艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。" +] + +data = {"data": {"text": texts}} +r = requests.post(url=url, headers=headers, data=json.dumps(data)) +datas = json.loads(r.text) +print(datas) diff --git a/applications/information_extraction/text/data_distill/deploy/simple_serving/server.py b/applications/information_extraction/text/data_distill/deploy/simple_serving/server.py new file mode 100644 index 0000000000000000000000000000000000000000..dadb51a6dc04822869bd141a4a62c76c70012692 --- /dev/null +++ b/applications/information_extraction/text/data_distill/deploy/simple_serving/server.py @@ -0,0 +1,25 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp import SimpleServer, Taskflow + +# The schema changed to your defined schema +schema = {"武器名称": ["产国", "类型", "研发单位"]} +# The task path changed to your best model path +uie = Taskflow( + "information_extraction", model="uie-data-distill-gp", schema=schema, task_path="../../checkpoint/model_best/" +) +# If you want to define the finetuned uie service +app = SimpleServer() +app.register_taskflow("taskflow/uie", uie) diff --git a/applications/information_extraction/text/data_distill/evaluate.py b/applications/information_extraction/text/data_distill/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..a6cee13f165d2591967c8ee6b97fa7bc7152a153 --- /dev/null +++ b/applications/information_extraction/text/data_distill/evaluate.py @@ -0,0 +1,96 @@ +# coding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from metric import get_eval +from tqdm import tqdm +from utils import create_dataloader, get_label_maps, postprocess, reader + +from paddlenlp.datasets import load_dataset +from paddlenlp.layers import ( + GlobalPointerForEntityExtraction, + GPLinkerForRelationExtraction, +) +from paddlenlp.transformers import AutoModel, AutoTokenizer +from paddlenlp.utils.log import logger + + +@paddle.no_grad() +def evaluate(model, dataloader, label_maps, task_type="relation_extraction"): + model.eval() + all_preds = ([], []) if task_type in ["opinion_extraction", "relation_extraction", "event_extraction"] else [] + for batch in tqdm(dataloader, desc="Evaluating: ", leave=False): + input_ids, attention_masks, offset_mappings, texts = batch + logits = model(input_ids, attention_masks) + batch_outputs = postprocess(logits, offset_mappings, texts, label_maps, task_type) + if isinstance(batch_outputs, tuple): + all_preds[0].extend(batch_outputs[0]) # Entity output + all_preds[1].extend(batch_outputs[1]) # Relation output + else: + all_preds.extend(batch_outputs) + eval_results = get_eval(all_preds, dataloader.dataset.raw_data, task_type) + model.train() + return eval_results + + +def do_eval(): + label_maps = get_label_maps(args.task_type, args.label_maps_path) + + tokenizer = AutoTokenizer.from_pretrained(args.encoder) + encoder = AutoModel.from_pretrained(args.encoder) + if args.task_type == "entity_extraction": + model = GlobalPointerForEntityExtraction(encoder, label_maps) + else: + model = GPLinkerForRelationExtraction(encoder, label_maps) + + if args.model_path: + state_dict = paddle.load(os.path.join(args.model_path, "model_state.pdparams")) + model.set_dict(state_dict) + + test_ds = load_dataset(reader, data_path=args.test_path, lazy=False) + + test_dataloader = create_dataloader( + test_ds, + tokenizer, + max_seq_len=args.max_seq_len, + batch_size=args.batch_size, + label_maps=label_maps, + mode="test", + task_type=args.task_type, + ) + + eval_result = evaluate(model, test_dataloader, label_maps, task_type=args.task_type) + logger.info("Evaluation precision: " + str(eval_result)) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.") + parser.add_argument("--test_path", type=str, default=None, help="The path of test set.") + parser.add_argument("--encoder", default="ernie-3.0-mini-zh", type=str, help="Select the pretrained encoder model for GP.") + parser.add_argument("--label_maps_path", default="./ner_data/label_maps.json", type=str, help="The file path of the labels dictionary.") + parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.") + parser.add_argument("--max_seq_len", type=int, default=128, help="The maximum total input sequence length after tokenization.") + parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.") + + args = parser.parse_args() + # yapf: enable + + do_eval() diff --git a/applications/information_extraction/text/data_distill/evaluate_teacher.py b/applications/information_extraction/text/data_distill/evaluate_teacher.py new file mode 100644 index 0000000000000000000000000000000000000000..318c1f9b0d8122f53a17881548fdae566663e651 --- /dev/null +++ b/applications/information_extraction/text/data_distill/evaluate_teacher.py @@ -0,0 +1,95 @@ +# coding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle +from metric import get_eval +from tqdm import tqdm +from utils import create_dataloader, get_label_maps, reader, synthetic2distill + +from paddlenlp import Taskflow +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoTokenizer +from paddlenlp.utils.log import logger + + +@paddle.no_grad() +def evaluate(uie, dataloader, task_type="relation_extraction"): + all_preds = ([], []) if task_type in ["opinion_extraction", "relation_extraction", "event_extraction"] else [] + + infer_results = [] + all_texts = [] + for batch in tqdm(dataloader, desc="Evaluating: ", leave=False): + _, _, _, texts = batch + all_texts.extend(texts) + infer_results.extend(uie(texts)) + + infer_results = synthetic2distill(all_texts, infer_results, task_type) + + for res in infer_results: + if task_type == "entity_extraction": + all_preds.append(res["entity_list"]) + else: + all_preds[0].append(res["entity_list"]) + all_preds[1].append(res["spo_list"]) + + eval_results = get_eval(all_preds, dataloader.dataset.raw_data, task_type) + return eval_results + + +def do_eval(): + # Load trained UIE model + uie = Taskflow("information_extraction", schema=args.schema, batch_size=args.batch_size, task_path=args.model_path) + + label_maps = get_label_maps(args.task_type, args.label_maps_path) + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh") + + test_ds = load_dataset(reader, data_path=args.test_path, lazy=False) + + test_dataloader = create_dataloader( + test_ds, + tokenizer, + max_seq_len=args.max_seq_len, + batch_size=args.batch_size, + label_maps=label_maps, + mode="test", + task_type=args.task_type, + ) + + eval_result = evaluate(uie, test_dataloader, task_type=args.task_type) + logger.info("Evaluation precision: " + str(eval_result)) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.") + parser.add_argument("--test_path", type=str, default=None, help="The path of test set.") + parser.add_argument("--label_maps_path", default="./ner_data/label_maps.json", type=str, help="The file path of the labels dictionary.") + parser.add_argument("--batch_size", type=int, default=8, help="Batch size per GPU/CPU for training.") + parser.add_argument("--max_seq_len", type=int, default=256, help="The maximum total input sequence length after tokenization.") + parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.") + + args = parser.parse_args() + # yapf: enable + + schema = {"武器名称": ["产国", "类型", "研发单位"]} + + args.schema = schema + + do_eval() diff --git a/applications/information_extraction/text/data_distill/metric.py b/applications/information_extraction/text/data_distill/metric.py new file mode 100644 index 0000000000000000000000000000000000000000..0e140642de28350bb3f567159d485d2f0d49fb62 --- /dev/null +++ b/applications/information_extraction/text/data_distill/metric.py @@ -0,0 +1,72 @@ +# coding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +def get_eval(all_preds, raw_data, task_type): + if task_type == "entity_extraction": + ex, ey, ez = 1e-10, 1e-10, 1e-10 + for ent_preds, data in zip(all_preds, raw_data): + pred_ent_set = set([tuple(p.values()) for p in ent_preds]) + gold_ent_set = set([tuple(g.values()) for g in data["entity_list"]]) + ex += len(pred_ent_set & gold_ent_set) + ey += len(pred_ent_set) + ez += len(gold_ent_set) + ent_f1 = round(2 * ex / (ey + ez), 5) if ex != 1e-10 else 0.0 + ent_precision = round(ex / ey, 5) if ey != 1e-10 else 0.0 + ent_recall = round(ex / ez, 5) if ez != 1e-10 else 0.0 + + return { + "entity_f1": ent_f1, + "entity_precision": ent_precision, + "entity_recall": ent_recall, + } + else: + all_ent_preds, all_rel_preds = all_preds + + ex, ey, ez = 1e-10, 1e-10, 1e-10 + for ent_preds, data in zip(all_ent_preds, raw_data): + pred_ent_set = set([tuple(p.values()) for p in ent_preds]) + gold_ent_set = set([tuple(g.values()) for g in data["entity_list"]]) + ex += len(pred_ent_set & gold_ent_set) + ey += len(pred_ent_set) + ez += len(gold_ent_set) + ent_f1 = round(2 * ex / (ey + ez), 5) if ex != 1e-10 else 0.0 + ent_precision = round(ex / ey, 5) if ey != 1e-10 else 0.0 + ent_recall = round(ex / ez, 5) if ez != 1e-10 else 0.0 + + rx, ry, rz = 1e-10, 1e-10, 1e-10 + + for rel_preds, raw_data in zip(all_rel_preds, raw_data): + pred_rel_set = set([tuple(p.values()) for p in rel_preds]) + if task_type == "opinion_extraction": + gold_rel_set = set([tuple(g.values()) for g in raw_data["aso_list"]]) + else: + gold_rel_set = set([tuple(g.values()) for g in raw_data["spo_list"]]) + rx += len(pred_rel_set & gold_rel_set) + ry += len(pred_rel_set) + rz += len(gold_rel_set) + + rel_f1 = round(2 * rx / (ry + rz), 5) if rx != 1e-10 else 0.0 + rel_precision = round(rx / ry, 5) if ry != 1e-10 else 0.0 + rel_recall = round(rx / rz, 5) if rz != 1e-10 else 0.0 + + return { + "entity_f1": ent_f1, + "entity_precision": ent_precision, + "entity_recall": ent_recall, + "relation_f1": rel_f1, + "relation_precision": rel_precision, + "relation_recall": rel_recall, + } diff --git a/applications/information_extraction/text/data_distill/train.py b/applications/information_extraction/text/data_distill/train.py new file mode 100644 index 0000000000000000000000000000000000000000..e52cb84602697b2abd43ba172e0ba240884c2224 --- /dev/null +++ b/applications/information_extraction/text/data_distill/train.py @@ -0,0 +1,190 @@ +# coding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time + +import paddle +from criterion import Criterion +from evaluate import evaluate +from utils import ( + create_dataloader, + criteria_map, + get_label_maps, + reader, + save_model_config, + set_seed, +) + +from paddlenlp.datasets import load_dataset +from paddlenlp.layers import ( + GlobalPointerForEntityExtraction, + GPLinkerForRelationExtraction, +) +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup +from paddlenlp.utils.log import logger + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + set_seed(args.seed) + + label_maps = get_label_maps(args.task_type, args.label_maps_path) + + train_ds = load_dataset(reader, data_path=args.train_path, lazy=False) + dev_ds = load_dataset(reader, data_path=args.dev_path, lazy=False) + tokenizer = AutoTokenizer.from_pretrained(args.encoder) + + train_dataloader = create_dataloader( + train_ds, + tokenizer, + max_seq_len=args.max_seq_len, + batch_size=args.batch_size, + label_maps=label_maps, + mode="train", + task_type=args.task_type, + ) + + dev_dataloader = create_dataloader( + dev_ds, + tokenizer, + max_seq_len=args.max_seq_len, + batch_size=args.batch_size, + label_maps=label_maps, + mode="dev", + task_type=args.task_type, + ) + + encoder = AutoModel.from_pretrained(args.encoder) + if args.task_type == "entity_extraction": + model = GlobalPointerForEntityExtraction(encoder, label_maps) + else: + model = GPLinkerForRelationExtraction(encoder, label_maps) + + model_config = {"task_type": args.task_type, "label_maps": label_maps, "encoder": args.encoder} + + num_training_steps = len(train_dataloader) * args.num_epochs + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + criterion = Criterion() + + global_step, best_f1 = 1, 0.0 + tr_loss, logging_loss = 0.0, 0.0 + tic_train = time.time() + for epoch in range(1, args.num_epochs + 1): + for batch in train_dataloader: + input_ids, attention_masks, labels = batch + + logits = model(input_ids, attention_masks) + + loss = sum([criterion(o, l) for o, l in zip(logits, labels)]) / 3 + + loss.backward() + + tr_loss += loss.item() + + lr_scheduler.step() + optimizer.step() + optimizer.clear_grad() + + if global_step % args.logging_steps == 0 and rank == 0: + time_diff = time.time() - tic_train + loss_avg = (tr_loss - logging_loss) / args.logging_steps + logger.info( + "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, loss_avg, args.logging_steps / time_diff) + ) + logging_loss = tr_loss + tic_train = time.time() + + if global_step % args.eval_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + save_model_config(save_dir, model_config) + logger.disable() + tokenizer.save_pretrained(save_dir) + logger.enable() + + eval_result = evaluate(model, dev_dataloader, label_maps, task_type=args.task_type) + logger.info("Evaluation precision: " + str(eval_result)) + + f1 = eval_result[criteria_map[args.task_type]] + if f1 > best_f1: + logger.info(f"best F1 performance has been updated: {best_f1:.5f} --> {f1:.5f}") + best_f1 = f1 + save_dir = os.path.join(args.save_dir, "model_best") + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + save_model_config(save_dir, model_config) + logger.disable() + tokenizer.save_pretrained(save_dir) + logger.enable() + tic_train = time.time() + + global_step += 1 + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--train_path", default=None, type=str, help="The path of train set.") + parser.add_argument("--dev_path", default=None, type=str, help="The path of dev set.") + parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") + parser.add_argument("--max_seq_len", default=256, type=int, help="The maximum input sequence length.") + parser.add_argument("--label_maps_path", default="./ner_data/label_maps.json", type=str, help="The file path of the labels dictionary.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay rate for L2 regularizer.") + parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") + parser.add_argument("--num_epochs", default=100, type=int, help="Number of epoches for training.") + parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization") + parser.add_argument("--encoder", default="ernie-3.0-mini-zh", type=str, help="Select the pretrained encoder model for GP.") + parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.") + parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.") + parser.add_argument("--eval_steps", default=200, type=int, help="The interval steps to evaluate model performance.") + parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") + parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of model parameters for initialization.") + + args = parser.parse_args() + # yapf: enable + + do_train() diff --git a/applications/information_extraction/text/data_distill/utils.py b/applications/information_extraction/text/data_distill/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..1532ee86185a6688c38d4186f3d89f6d788b1447 --- /dev/null +++ b/applications/information_extraction/text/data_distill/utils.py @@ -0,0 +1,554 @@ +# coding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import copy +import json +import os +import random + +import numpy as np +import paddle +from data_collator import DataCollator + +from paddlenlp.taskflow.utils import SchemaTree +from paddlenlp.utils.log import logger + +criteria_map = { + "entity_extraction": "entity_f1", + "opinion_extraction": "relation_f1", # (Aspect, Sentiment, Opinion) + "relation_extraction": "relation_f1", # (Subject, Predicate, Object) + "event_extraction": "relation_f1", # (Trigger, Role, Argument) +} + + +def set_seed(seed): + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +def reader(data_path): + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + json_line = json.loads(line) + yield json_line + + +def save_model_config(save_dir, model_config): + model_config_file = os.path.join(save_dir, "model_config.json") + with open(model_config_file, "w", encoding="utf-8") as fp: + fp.write(json.dumps(model_config, ensure_ascii=False, indent=2)) + + +def map_offset(ori_offset, offset_mapping): + """ + map ori offset to token offset + """ + for index, span in enumerate(offset_mapping): + if span[0] <= ori_offset < span[1]: + return index + return -1 + + +def get_label_maps(task_type="relation_extraction", label_maps_path=None): + with open(label_maps_path, "r", encoding="utf-8") as fp: + label_maps = json.load(fp) + if task_type == "entity_extraction": + entity2id = label_maps["entity2id"] + id2entity = {idx: t for t, idx in entity2id.items()} + label_maps["id2entity"] = id2entity + else: + entity2id = label_maps["entity2id"] + relation2id = ( + label_maps["relation2id"] + if task_type in ["relation_extraction", "event_extraction"] + else label_maps["sentiment2id"] + ) + id2entity = {idx: t for t, idx in entity2id.items()} + id2relation = {idx: t for t, idx in relation2id.items()} + label_maps["id2entity"] = id2entity + label_maps["id2relation"] = id2relation + return label_maps + + +def create_dataloader( + dataset, tokenizer, max_seq_len=128, batch_size=1, label_maps=None, mode="train", task_type="relation_extraction" +): + def tokenize_and_align_train_labels(example): + tokenized_inputs = tokenizer( + example["text"], + max_length=max_seq_len, + padding=False, + truncation=True, + return_attention_mask=True, + return_token_type_ids=False, + return_offsets_mapping=True, + ) + offset_mapping = tokenized_inputs["offset_mapping"] + + ent_labels = [] + for e in example["entity_list"]: + _start, _end = e["start_index"], e["start_index"] + len(e["text"]) - 1 + start = map_offset(_start, offset_mapping) + end = map_offset(_end, offset_mapping) + if start == -1 or end == -1: + continue + label = label_maps["entity2id"][e["type"]] + ent_labels.append([label, start, end]) + + outputs = { + "input_ids": tokenized_inputs["input_ids"], + "attention_mask": tokenized_inputs["attention_mask"], + "labels": {"ent_labels": ent_labels, "rel_labels": []}, + } + + if task_type in ["relation_extraction", "event_extraction"]: + rel_labels = [] + for r in example["spo_list"]: + _sh, _oh = r["subject_start_index"], r["object_start_index"] + _st, _ot = _sh + len(r["subject"]) - 1, _oh + len(r["object"]) - 1 + sh = map_offset(_sh, offset_mapping) + st = map_offset(_st, offset_mapping) + oh = map_offset(_oh, offset_mapping) + ot = map_offset(_ot, offset_mapping) + if sh == -1 or st == -1 or oh == -1 or ot == -1: + continue + p = label_maps["relation2id"][r["predicate"]] + rel_labels.append([sh, st, p, oh, ot]) + outputs["labels"]["rel_labels"] = rel_labels + elif task_type == "opinion_extraction": + rel_labels = [] + for r in example["aso_list"]: + _ah, _oh = r["aspect_start_index"], r["opinion_start_index"] + _at, _ot = _ah + len(r["aspect"]) - 1, _oh + len(r["opinion"]) - 1 + ah = map_offset(_ah, offset_mapping) + at = map_offset(_at, offset_mapping) + oh = map_offset(_oh, offset_mapping) + ot = map_offset(_ot, offset_mapping) + if ah == -1 or at == -1 or oh == -1 or ot == -1: + continue + + s = label_maps["sentiment2id"][r["sentiment"]] + rel_labels.append([ah, at, s, oh, ot]) + outputs["labels"]["rel_labels"] = rel_labels + return outputs + + def tokenize(example): + tokenized_inputs = tokenizer( + example["text"], + max_length=max_seq_len, + padding=False, + truncation=True, + return_attention_mask=True, + return_offsets_mapping=True, + return_token_type_ids=False, + ) + tokenized_inputs["text"] = example["text"] + return tokenized_inputs + + if mode == "train": + dataset = dataset.map(tokenize_and_align_train_labels) + else: + dataset_copy = copy.deepcopy(dataset) + dataset = dataset.map(tokenize) + + data_collator = DataCollator(tokenizer, label_maps=label_maps, task_type=task_type) + + shuffle = True if mode == "train" else False + batch_sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader( + dataset=dataset, batch_sampler=batch_sampler, collate_fn=data_collator, num_workers=0, return_list=True + ) + if mode != "train": + dataloader.dataset.raw_data = dataset_copy + return dataloader + + +def postprocess(batch_outputs, offset_mappings, texts, label_maps, task_type="relation_extraction"): + if task_type == "entity_extraction": + batch_ent_results = [] + for entity_output, offset_mapping, text in zip(batch_outputs[0].numpy(), offset_mappings, texts): + entity_output[:, [0, -1]] -= np.inf + entity_output[:, :, [0, -1]] -= np.inf + ent_list = [] + for l, start, end in zip(*np.where(entity_output > 0.0)): + start, end = (offset_mapping[start][0], offset_mapping[end][-1]) + ent = {"text": text[start:end], "type": label_maps["id2entity"][l], "start_index": start} + ent_list.append(ent) + batch_ent_results.append(ent_list) + return batch_ent_results + else: + batch_ent_results = [] + batch_rel_results = [] + for entity_output, head_output, tail_output, offset_mapping, text in zip( + batch_outputs[0].numpy(), + batch_outputs[1].numpy(), + batch_outputs[2].numpy(), + offset_mappings, + texts, + ): + entity_output[:, [0, -1]] -= np.inf + entity_output[:, :, [0, -1]] -= np.inf + ents = set() + ent_list = [] + for l, start, end in zip(*np.where(entity_output > 0.0)): + ents.add((start, end)) + start, end = (offset_mapping[start][0], offset_mapping[end][-1]) + ent = {"text": text[start:end], "type": label_maps["id2entity"][l], "start_index": start} + ent_list.append(ent) + batch_ent_results.append(ent_list) + + rel_list = [] + for sh, st in ents: + for oh, ot in ents: + p1s = np.where(head_output[:, sh, oh] > 0.0)[0] + p2s = np.where(tail_output[:, st, ot] > 0.0)[0] + ps = set(p1s) & set(p2s) + for p in ps: + if task_type in ["relation_extraction", "event_extraction"]: + rel = { + "subject": text[offset_mapping[sh][0] : offset_mapping[st][1]], + "predicate": label_maps["id2relation"][p], + "object": text[offset_mapping[oh][0] : offset_mapping[ot][1]], + "subject_start_index": offset_mapping[sh][0], + "object_start_index": offset_mapping[oh][0], + } + else: + rel = { + "aspect": text[offset_mapping[sh][0] : offset_mapping[st][1]], + "sentiment": label_maps["id2relation"][p], + "opinion": text[offset_mapping[oh][0] : offset_mapping[ot][1]], + "aspect_start_index": offset_mapping[sh][0], + "opinion_start_index": offset_mapping[oh][0], + } + rel_list.append(rel) + batch_rel_results.append(rel_list) + return (batch_ent_results, batch_rel_results) + + +def build_tree(schema, name="root"): + """ + Build the schema tree. + """ + schema_tree = SchemaTree(name) + for s in schema: + if isinstance(s, str): + schema_tree.add_child(SchemaTree(s)) + elif isinstance(s, dict): + for k, v in s.items(): + if isinstance(v, str): + child = [v] + elif isinstance(v, list): + child = v + else: + raise TypeError( + "Invalid schema, value for each key:value pairs should be list or string" + "but {} received".format(type(v)) + ) + schema_tree.add_child(build_tree(child, name=k)) + else: + raise TypeError("Invalid schema, element should be string or dict, " "but {} received".format(type(s))) + return schema_tree + + +def schema2label_maps(task_type, schema=None): + if schema and isinstance(schema, dict): + schema = [schema] + + label_maps = {} + if task_type == "entity_extraction": + entity2id = {} + for s in schema: + entity2id[s] = len(entity2id) + + label_maps["entity2id"] = entity2id + elif task_type == "opinion_extraction": + schema = ["观点词", {"评价维度": ["观点词", "情感倾向[正向,负向]"]}] + logger.info("Opinion extraction does not support custom schema, the schema is default to %s." % schema) + label_maps["entity2id"] = {"评价维度": 0, "观点词": 1} + label_maps["sentiment2id"] = {"正向": 0, "负向": 1} + else: + entity2id = {} + relation2id = {} + schema_tree = build_tree(schema) + schema_list = schema_tree.children[:] + while len(schema_list) > 0: + node = schema_list.pop(0) + + if node.name not in entity2id.keys() and len(node.children) != 0: + entity2id[node.name] = len(entity2id) + + for child in node.children: + if child.name not in relation2id.keys(): + relation2id[child.name] = len(relation2id) + schema_list.append(child) + + entity2id["object"] = len(entity2id) + label_maps["entity2id"] = entity2id + label_maps["relation2id"] = relation2id + + label_maps["schema"] = schema + return label_maps + + +def anno2distill(json_lines, task_type, label_maps=None, platform="label_studio"): + if platform == "label_studio": + return label_studio2distill(json_lines, task_type, label_maps) + else: + return doccano2distill(json_lines, task_type, label_maps) + + +def label_studio2distill(json_lines, task_type, label_maps=None): + """Convert label-studio to distill format""" + if task_type == "opinion_extraction": + outputs = [] + for json_line in json_lines: + id2ent = {} + text = json_line["data"]["text"] + output = {"text": text} + entity_list = [] + aso_list = [] + annos = json_line["annotations"][0]["result"] + for anno in annos: + if anno["type"] == "labels": + ent_text = text[anno["value"]["start"] : anno["value"]["end"]] + ent_type_gather = anno["value"]["labels"][0].split("##") + if len(ent_type_gather) == 2: + ent_type, ent_senti = ent_type_gather + else: + ent_type = ent_type_gather[0] + ent_senti = None + ent = {"text": ent_text, "type": ent_type, "start_index": anno["value"]["start"]} + id2ent[anno["id"]] = ent + id2ent[anno["id"]]["sentiment"] = ent_senti + entity_list.append(ent) + else: + _aspect = id2ent[anno["from_id"]] + if _aspect["sentiment"]: + _opinion = id2ent[anno["to_id"]] + rel = { + "aspect": _aspect["text"], + "sentiment": _aspect["sentiment"], + "opinion": _opinion["text"], + "aspect_start_index": _aspect["start_index"], + "opinion_start_index": _opinion["start_index"], + } + aso_list.append(rel) + output["aso_list"] = aso_list + output["entity_list"] = entity_list + output["aso_list"] = aso_list + outputs.append(output) + else: + outputs = [] + for json_line in json_lines: + id2ent = {} + text = json_line["data"]["text"] + output = {"text": text} + entity_list = [] + spo_list = [] + annos = json_line["annotations"][0]["result"] + for anno in annos: + if anno["type"] == "labels": + ent_text = text[anno["value"]["start"] : anno["value"]["end"]] + ent_label = anno["value"]["labels"][0] + ent_type = "object" if ent_label not in label_maps["entity2id"].keys() else ent_label + ent = {"text": ent_text, "type": ent_type, "start_index": anno["value"]["start"]} + id2ent[anno["id"]] = ent + entity_list.append(ent) + else: + _subject = id2ent[anno["from_id"]] + _object = id2ent[anno["to_id"]] + rel = { + "subject": _subject["text"], + "predicate": anno["labels"][0], + "object": _object["text"], + "subject_start_index": _subject["start_index"], + "object_start_index": _object["start_index"], + } + spo_list.append(rel) + output["entity_list"] = entity_list + output["spo_list"] = spo_list + outputs.append(output) + return outputs + + +def doccano2distill(json_lines, task_type, label_maps=None): + """Convert doccano to distill format""" + if task_type == "opinion_extraction": + outputs = [] + for json_line in json_lines: + id2ent = {} + text = json_line["text"] + output = {"text": text} + entity_list = [] + entities = json_line["entities"] + for entity in entities: + ent_text = text[entity["start_offset"] : entity["end_offset"]] + ent_type_gather = entity["label"].split("##") + if len(ent_type_gather) == 2: + ent_type, ent_senti = ent_type_gather + else: + ent_type = ent_type_gather[0] + ent_senti = None + ent = {"text": ent_text, "type": ent_type, "start_index": entity["start_offset"]} + id2ent[entity["id"]] = ent + id2ent[entity["id"]]["sentiment"] = ent_senti + entity_list.append(ent) + output["entity_list"] = entity_list + aso_list = [] + relations = json_line["relations"] + for relation in relations: + _aspect = id2ent[relation["from_id"]] + if _aspect["sentiment"]: + _opinion = id2ent[relation["to_id"]] + rel = { + "aspect": _aspect["text"], + "sentiment": _aspect["sentiment"], + "opinion": _opinion["text"], + "aspect_start_index": _aspect["start_index"], + "opinion_start_index": _opinion["start_index"], + } + aso_list.append(rel) + output["aso_list"] = aso_list + outputs.append(output) + else: + outputs = [] + for json_line in json_lines: + id2ent = {} + text = json_line["text"] + output = {"text": text} + entity_list = [] + entities = json_line["entities"] + for entity in entities: + ent_text = text[entity["start_offset"] : entity["end_offset"]] + if entity["label"] not in label_maps["entity2id"].keys(): + if task_type == "entity_extraction": + logger.warning( + "Found undefined label type. The setting of schema should contain all the label types in annotation file export from annotation platform." + ) + continue + else: + ent_type = "object" + else: + ent_type = entity["label"] + ent = {"text": ent_text, "type": ent_type, "start_index": entity["start_offset"]} + id2ent[entity["id"]] = ent + entity_list.append(ent) + output["entity_list"] = entity_list + spo_list = [] + relations = json_line["relations"] + for relation in relations: + _subject = id2ent[relation["from_id"]] + _object = id2ent[relation["to_id"]] + rel = { + "subject": _subject["text"], + "predicate": relation["type"], + "object": _object["text"], + "subject_start_index": _subject["start_index"], + "object_start_index": _object["start_index"], + } + spo_list.append(rel) + output["spo_list"] = spo_list + outputs.append(output) + return outputs + + +def synthetic2distill(texts, infer_results, task_type, label_maps=None): + """Convert synthetic data to distill format""" + if task_type == "opinion_extraction": + outputs = [] + for i, line in enumerate(infer_results): + pred = line + output = {"text": texts[i]} + + entity_list = [] + aso_list = [] + for key1 in pred.keys(): + for s in pred[key1]: + ent = {"text": s["text"], "type": key1, "start_index": s["start"]} + entity_list.append(ent) + + if ( + "relations" in s.keys() + and "观点词" in s["relations"].keys() + and "情感倾向[正向,负向]" in s["relations"].keys() + ): + for o in s["relations"]["观点词"]: + rel = { + "aspect": s["text"], + "sentiment": s["relations"]["情感倾向[正向,负向]"][0]["text"], + "opinion": o["text"], + "aspect_start_index": s["start"], + "opinion_start_index": o["start"], + } + aso_list.append(rel) + + ent = {"text": o["text"], "type": "观点词", "start_index": o["start"]} + entity_list.append(ent) + output["entity_list"] = entity_list + output["aso_list"] = aso_list + outputs.append(output) + else: + outputs = [] + for i, line in enumerate(infer_results): + pred = line + output = {"text": texts[i]} + + entity_list = [] + spo_list = [] + for key1 in pred.keys(): + for s in pred[key1]: + ent = {"text": s["text"], "type": key1, "start_index": s["start"]} + entity_list.append(ent) + if "relations" in s.keys(): + for key2 in s["relations"].keys(): + for o1 in s["relations"][key2]: + if "start" in o1.keys(): + rel = { + "subject": s["text"], + "predicate": key2, + "object": o1["text"], + "subject_start_index": s["start"], + "object_start_index": o1["start"], + } + spo_list.append(rel) + + if "relations" not in o1.keys(): + ent = {"text": o1["text"], "type": "object", "start_index": o1["start"]} + entity_list.append(ent) + else: + ent = {"text": o1["text"], "type": key2, "start_index": o1["start"]} + entity_list.append(ent) + for key3 in o1["relations"].keys(): + for o2 in o1["relations"][key3]: + ent = { + "text": o2["text"], + "type": "object", + "start_index": o2["start"], + } + entity_list.append(ent) + + rel = { + "subject": o1["text"], + "predicate": key3, + "object": o2["text"], + "subject_start_index": o1["start"], + "object_start_index": o2["start"], + } + spo_list.append(rel) + output["entity_list"] = entity_list + output["spo_list"] = spo_list + outputs.append(output) + return outputs diff --git a/applications/information_extraction/text/deploy/simple_serving/README.md b/applications/information_extraction/text/deploy/simple_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..0624a674e9472083bf9180c488e3bfa127da2aef --- /dev/null +++ b/applications/information_extraction/text/deploy/simple_serving/README.md @@ -0,0 +1,58 @@ +# 基于PaddleNLP SimpleServing 的服务化部署 + +## 目录 +- [环境准备](#环境准备) +- [Server服务启动](#Server服务启动) +- [Client请求启动](#Client请求启动) +- [服务化自定义参数](#服务化自定义参数) + +## 环境准备 +使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本) + +```shell +pip install paddlenlp >= 2.4.4 +``` + + +## Server服务启动 + +```bash +paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189 +``` + +## Client请求启动 + +```bash +python client.py +``` + +## 服务化自定义参数 + +### Server 自定义参数 +#### schema替换 +```python +# Default schema +schema = {"武器名称": ["产国", "类型", "研发单位"]} +``` + +#### 设置模型路径 +``` +# Default task_path +uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema) +``` + +#### 多卡服务化预测 +PaddleNLP SimpleServing 支持多卡负载均衡预测,主要在服务化注册的时候,注册两个Taskflow的task即可,下面是示例代码 +``` +uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0) +uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1) +service.register_taskflow('uie', [uie1, uie2]) +``` + +### Client 自定义参数 + +```python +# Changed to input texts you wanted +texts = ['威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆>艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。'] + +``` diff --git a/applications/information_extraction/text/deploy/simple_serving/README_en.md b/applications/information_extraction/text/deploy/simple_serving/README_en.md new file mode 100644 index 0000000000000000000000000000000000000000..4736bd34e5bf271454d0222f80ba3c51832b59c1 --- /dev/null +++ b/applications/information_extraction/text/deploy/simple_serving/README_en.md @@ -0,0 +1,64 @@ +# Service deployment based on PaddleNLP SimpleServing + +- [Environment Preparation](#1) +- [Server](#2) +- [Client](#3) +- [Service Custom Parameters](#4) + + + +## Environment Preparation +Use the PaddleNLP version with SimpleServing function (or the latest develop version) + +```shell +pip install paddlenlp >= 2.4.4 +``` + + + +## Server + +```bash +paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189 +``` + + + +## Client + +```bash +python client.py +``` + + + +## Service Custom Parameters + +### Server Custom Parameters + +#### schema replacement +```python +# Default schema +schema = {"Weapon Name": ["Country of Production", "Type", "R&D Unit"]} +``` + +#### Set model path +``` +# Default task_path +uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema) +``` + +#### Doka Service Prediction +PaddleNLP SimpleServing supports multi-card load balancing prediction, mainly during service registration, just register two Taskflow tasks, the following is the sample code +``` +uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0) +uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1) +service. register_taskflow('uie', [uie1, uie2]) +``` + +### Client Custom Parameters + +```python +# Changed to input texts you wanted +texts = ['威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。'] +``` diff --git a/applications/information_extraction/text/deploy/simple_serving/client.py b/applications/information_extraction/text/deploy/simple_serving/client.py new file mode 100644 index 0000000000000000000000000000000000000000..cd2914e22b2b8b1c46db97facc09bdc6b5ac3957 --- /dev/null +++ b/applications/information_extraction/text/deploy/simple_serving/client.py @@ -0,0 +1,29 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import requests + +url = "http://0.0.0.0:8189/taskflow/uie" + +headers = {"Content-Type": "application/json"} +texts = [ + "威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆>艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。" +] + +data = {"data": {"text": texts}} +r = requests.post(url=url, headers=headers, data=json.dumps(data)) +datas = json.loads(r.text) +print(datas) diff --git a/applications/information_extraction/text/deploy/simple_serving/server.py b/applications/information_extraction/text/deploy/simple_serving/server.py new file mode 100644 index 0000000000000000000000000000000000000000..3bb193e6311a9546cae019262ae5286bc7fc6a5a --- /dev/null +++ b/applications/information_extraction/text/deploy/simple_serving/server.py @@ -0,0 +1,23 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp import SimpleServer, Taskflow + +# The schema changed to your defined schema +schema = {"武器名称": ["产国", "类型", "研发单位"]} +# The task path changed to your best model path +uie = Taskflow("information_extraction", schema=schema, task_path="../../checkpoint/model_best/") +# If you want to define the finetuned uie service +app = SimpleServer() +app.register_taskflow("taskflow/uie", uie) diff --git a/applications/information_extraction/text/evaluate.py b/applications/information_extraction/text/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..4e9607583fc6d578f8bc0ce1ea409a579d56b576 --- /dev/null +++ b/applications/information_extraction/text/evaluate.py @@ -0,0 +1,142 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +from functools import partial + +import paddle +from utils import convert_example, create_data_loader, reader + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import MapDataset, load_dataset +from paddlenlp.metrics import SpanEvaluator +from paddlenlp.transformers import UIE, UIEM, AutoTokenizer +from paddlenlp.utils.ie_utils import get_relation_type_dict, unify_prompt_name +from paddlenlp.utils.log import logger + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, multilingual=False): + """ + Given a dataset, it evals model and computes the metric. + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + multilingual(bool): Whether is the multilingual model. + """ + model.eval() + metric.reset() + for batch in data_loader: + if multilingual: + start_prob, end_prob = model(batch["input_ids"], batch["position_ids"]) + else: + start_prob, end_prob = model( + batch["input_ids"], batch["token_type_ids"], batch["position_ids"], batch["attention_mask"] + ) + + start_ids = paddle.cast(batch["start_positions"], "float32") + end_ids = paddle.cast(batch["end_positions"], "float32") + num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids) + metric.update(num_correct, num_infer, num_label) + precision, recall, f1 = metric.accumulate() + model.train() + return precision, recall, f1 + + +def do_eval(): + paddle.set_device(args.device) + + if args.model_path in ["uie-m-base", "uie-m-large"]: + args.multilingual = True + tokenizer = AutoTokenizer.from_pretrained(args.model_path) + if args.multilingual: + model = UIEM.from_pretrained(args.model_path) + else: + model = UIE.from_pretrained(args.model_path) + + test_ds = load_dataset(reader, data_path=args.test_path, max_seq_len=args.max_seq_len, lazy=False) + class_dict = {} + relation_data = [] + if args.debug: + for data in test_ds: + class_name = unify_prompt_name(data["prompt"]) + # Only positive examples are evaluated in debug mode + if len(data["result_list"]) != 0: + p = "的" if args.schema_lang == "ch" else " of " + if p not in data["prompt"]: + class_dict.setdefault(class_name, []).append(data) + else: + relation_data.append((data["prompt"], data)) + + relation_type_dict = get_relation_type_dict(relation_data, schema_lang=args.schema_lang) + else: + class_dict["all_classes"] = test_ds + + trans_fn = partial( + convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len, multilingual=args.multilingual + ) + + for key in class_dict.keys(): + if args.debug: + test_ds = MapDataset(class_dict[key]) + else: + test_ds = class_dict[key] + test_ds = test_ds.map(trans_fn) + + data_collator = DataCollatorWithPadding(tokenizer) + + test_data_loader = create_data_loader(test_ds, mode="test", batch_size=args.batch_size, trans_fn=data_collator) + + metric = SpanEvaluator() + precision, recall, f1 = evaluate(model, metric, test_data_loader, args.multilingual) + logger.info("-----------------------------") + logger.info("Class Name: %s" % key) + logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1)) + + if args.debug and len(relation_type_dict.keys()) != 0: + for key in relation_type_dict.keys(): + test_ds = MapDataset(relation_type_dict[key]) + test_ds = test_ds.map(trans_fn) + test_data_loader = create_data_loader( + test_ds, mode="test", batch_size=args.batch_size, trans_fn=data_collator + ) + + metric = SpanEvaluator() + precision, recall, f1 = evaluate(model, metric, test_data_loader) + logger.info("-----------------------------") + if args.schema_lang == "ch": + logger.info("Class Name: X的%s" % key) + else: + logger.info("Class Name: %s of X" % key) + logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1)) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.") + parser.add_argument("--test_path", type=str, default=None, help="The path of test set.") + parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.") + parser.add_argument("--device", type=str, default="gpu", choices=["gpu", "cpu", "npu"], help="Device selected for evaluate.") + parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.") + parser.add_argument("--debug", action='store_true', help="Precision, recall and F1 score are calculated for each class separately if this option is enabled.") + parser.add_argument("--multilingual", action='store_true', help="Whether is the multilingual model.") + parser.add_argument("--schema_lang", choices=["ch", "en"], default="ch", help="Select the language type for schema.") + + args = parser.parse_args() + # yapf: enable + + do_eval() diff --git a/applications/information_extraction/text/finetune.py b/applications/information_extraction/text/finetune.py new file mode 100644 index 0000000000000000000000000000000000000000..342ec6284574be7c949e2c74b5b0437fbd89b7e5 --- /dev/null +++ b/applications/information_extraction/text/finetune.py @@ -0,0 +1,243 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +from dataclasses import dataclass, field +from functools import partial +from typing import List, Optional + +import paddle +from utils import convert_example, reader + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import SpanEvaluator +from paddlenlp.trainer import ( + CompressionArguments, + PdArgumentParser, + Trainer, + get_last_checkpoint, +) +from paddlenlp.transformers import UIE, UIEM, AutoTokenizer, export_model +from paddlenlp.utils.ie_utils import compute_metrics, uie_loss_func +from paddlenlp.utils.log import logger + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + Using `PdArgumentParser` we can turn this class into argparse arguments to be able to + specify them on the command line. + """ + + train_path: str = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + + dev_path: str = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + + max_seq_length: Optional[int] = field( + default=512, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + + dynamic_max_length: Optional[List[int]] = field( + default=None, + metadata={"help": "dynamic max length from batch, it can be array of length, eg: 16 32 64 128"}, + ) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + + model_name_or_path: Optional[str] = field( + default="uie-base", + metadata={ + "help": "Path to pretrained model, such as 'uie-base', 'uie-tiny', " + "'uie-medium', 'uie-mini', 'uie-micro', 'uie-nano', 'uie-base-en', " + "'uie-m-base', 'uie-m-large', or finetuned model path." + }, + ) + export_model_dir: Optional[str] = field( + default=None, + metadata={"help": "Path to directory to store the exported inference model."}, + ) + multilingual: bool = field(default=False, metadata={"help": "Whether the model is a multilingual model."}) + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + training_args.label_names = ["start_positions", "end_positions"] + + if model_args.model_name_or_path in ["uie-m-base", "uie-m-large"]: + model_args.multilingual = True + elif os.path.exists(os.path.join(model_args.model_name_or_path, "model_config.json")): + with open(os.path.join(model_args.model_name_or_path, "model_config.json")) as f: + init_class = json.load(f)["init_class"] + if init_class == "UIEM": + model_args.multilingual = True + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + if model_args.multilingual: + model = UIEM.from_pretrained(model_args.model_name_or_path) + else: + model = UIE.from_pretrained(model_args.model_name_or_path) + + train_ds = load_dataset(reader, data_path=data_args.train_path, max_seq_len=data_args.max_seq_length, lazy=False) + dev_ds = load_dataset(reader, data_path=data_args.dev_path, max_seq_len=data_args.max_seq_length, lazy=False) + + trans_fn = partial( + convert_example, + tokenizer=tokenizer, + max_seq_len=data_args.max_seq_length, + multilingual=model_args.multilingual, + dynamic_max_length=data_args.dynamic_max_length, + ) + + train_ds = train_ds.map(trans_fn) + dev_ds = dev_ds.map(trans_fn) + + if training_args.device == "npu": + data_collator = DataCollatorWithPadding(tokenizer, padding="longest") + else: + data_collator = DataCollatorWithPadding(tokenizer) + + trainer = Trainer( + model=model, + criterion=uie_loss_func, + args=training_args, + data_collator=data_collator, + train_dataset=train_ds if training_args.do_train or training_args.do_compress else None, + eval_dataset=dev_ds if training_args.do_eval or training_args.do_compress else None, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + ) + + trainer.optimizer = paddle.optimizer.AdamW( + learning_rate=training_args.learning_rate, parameters=model.parameters() + ) + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + # export inference model + if training_args.do_export: + # You can also load from certain checkpoint + # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/") + if training_args.device == "npu": + # npu will transform int64 to int32 for internal calculation. + # To reduce useless transformation, we feed int32 inputs. + input_spec_dtype = "int32" + else: + input_spec_dtype = "int64" + if model_args.multilingual: + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="input_ids"), + paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="position_ids"), + ] + else: + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="input_ids"), + paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="token_type_ids"), + paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="position_ids"), + paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="attention_mask"), + ] + if model_args.export_model_dir is None: + model_args.export_model_dir = os.path.join(training_args.output_dir, "export") + export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir) + if training_args.do_compress: + + @paddle.no_grad() + def custom_evaluate(self, model, data_loader): + metric = SpanEvaluator() + model.eval() + metric.reset() + for batch in data_loader: + if model_args.multilingual: + logits = model(input_ids=batch["input_ids"], position_ids=batch["position_ids"]) + else: + logits = model( + input_ids=batch["input_ids"], + token_type_ids=batch["token_type_ids"], + position_ids=batch["position_ids"], + attention_mask=batch["attention_mask"], + ) + start_prob, end_prob = logits + start_ids, end_ids = batch["start_positions"], batch["end_positions"] + num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids) + metric.update(num_correct, num_infer, num_label) + precision, recall, f1 = metric.accumulate() + logger.info("f1: %s, precision: %s, recall: %s" % (f1, precision, f1)) + model.train() + return f1 + + trainer.compress(custom_evaluate=custom_evaluate) + + +if __name__ == "__main__": + main() diff --git a/applications/information_extraction/text/utils.py b/applications/information_extraction/text/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..8e1a86f79c520e3951ccb8abbb32097390bde83c --- /dev/null +++ b/applications/information_extraction/text/utils.py @@ -0,0 +1,234 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import random +from typing import List, Optional + +import numpy as np +import paddle + +from paddlenlp.utils.log import logger + + +def set_seed(seed): + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +def create_data_loader(dataset, mode="train", batch_size=1, trans_fn=None): + """ + Create dataloader. + Args: + dataset(obj:`paddle.io.Dataset`): Dataset instance. + mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. + batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. + trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. + Returns: + dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. + """ + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + else: + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True) + return dataloader + + +def map_offset(ori_offset, offset_mapping): + """ + map ori offset to token offset + """ + for index, span in enumerate(offset_mapping): + if span[0] <= ori_offset < span[1]: + return index + return -1 + + +def reader(data_path, max_seq_len=512): + """ + read json + """ + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + json_line = json.loads(line) + content = json_line["content"].strip() + prompt = json_line["prompt"] + # Model Input is aslike: [CLS] Prompt [SEP] Content [SEP] + # It include three summary tokens. + if max_seq_len <= len(prompt) + 3: + raise ValueError("The value of max_seq_len is too small, please set a larger value") + max_content_len = max_seq_len - len(prompt) - 3 + if len(content) <= max_content_len: + yield json_line + else: + result_list = json_line["result_list"] + json_lines = [] + accumulate = 0 + while True: + cur_result_list = [] + for result in result_list: + if result["end"] - result["start"] > max_content_len: + logger.warning( + "result['end'] - result ['start'] exceeds max_content_len, which will result in no valid instance being returned" + ) + if ( + result["start"] + 1 <= max_content_len < result["end"] + and result["end"] - result["start"] <= max_content_len + ): + max_content_len = result["start"] + break + + cur_content = content[:max_content_len] + res_content = content[max_content_len:] + + while True: + if len(result_list) == 0: + break + elif result_list[0]["end"] <= max_content_len: + if result_list[0]["end"] > 0: + cur_result = result_list.pop(0) + cur_result_list.append(cur_result) + else: + cur_result_list = [result for result in result_list] + break + else: + break + + json_line = {"content": cur_content, "result_list": cur_result_list, "prompt": prompt} + json_lines.append(json_line) + + for result in result_list: + if result["end"] <= 0: + break + result["start"] -= max_content_len + result["end"] -= max_content_len + accumulate += max_content_len + max_content_len = max_seq_len - len(prompt) - 3 + if len(res_content) == 0: + break + elif len(res_content) < max_content_len: + json_line = {"content": res_content, "result_list": result_list, "prompt": prompt} + json_lines.append(json_line) + break + else: + content = res_content + + for json_line in json_lines: + yield json_line + + +def get_dynamic_max_length(examples, default_max_length: int, dynamic_max_length: List[int]) -> int: + """get max_length by examples which you can change it by examples in batch""" + cur_length = len(examples[0]["input_ids"]) + max_length = default_max_length + for max_length_option in sorted(dynamic_max_length): + if cur_length <= max_length_option: + max_length = max_length_option + break + return max_length + + +def convert_example( + example, tokenizer, max_seq_len, multilingual=False, dynamic_max_length: Optional[List[int]] = None +): + """ + example: { + title + prompt + content + result_list + } + """ + if dynamic_max_length is not None: + temp_encoded_inputs = tokenizer( + text=[example["prompt"]], + text_pair=[example["content"]], + truncation=True, + max_seq_len=max_seq_len, + return_attention_mask=True, + return_position_ids=True, + return_dict=False, + return_offsets_mapping=True, + ) + max_length = get_dynamic_max_length( + examples=temp_encoded_inputs, default_max_length=max_seq_len, dynamic_max_length=dynamic_max_length + ) + # always pad to max_length + encoded_inputs = tokenizer( + text=[example["prompt"]], + text_pair=[example["content"]], + truncation=True, + max_seq_len=max_length, + pad_to_max_seq_len=True, + return_attention_mask=True, + return_position_ids=True, + return_dict=False, + return_offsets_mapping=True, + ) + start_ids = [0.0 for x in range(max_length)] + end_ids = [0.0 for x in range(max_length)] + else: + encoded_inputs = tokenizer( + text=[example["prompt"]], + text_pair=[example["content"]], + truncation=True, + max_seq_len=max_seq_len, + pad_to_max_seq_len=True, + return_attention_mask=True, + return_position_ids=True, + return_dict=False, + return_offsets_mapping=True, + ) + start_ids = [0.0 for x in range(max_seq_len)] + end_ids = [0.0 for x in range(max_seq_len)] + + encoded_inputs = encoded_inputs[0] + offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]] + bias = 0 + for index in range(1, len(offset_mapping)): + mapping = offset_mapping[index] + if mapping[0] == 0 and mapping[1] == 0 and bias == 0: + bias = offset_mapping[index - 1][1] + 1 # Includes [SEP] token + if mapping[0] == 0 and mapping[1] == 0: + continue + offset_mapping[index][0] += bias + offset_mapping[index][1] += bias + for item in example["result_list"]: + start = map_offset(item["start"] + bias, offset_mapping) + end = map_offset(item["end"] - 1 + bias, offset_mapping) + start_ids[start] = 1.0 + end_ids[end] = 1.0 + if multilingual: + tokenized_output = { + "input_ids": encoded_inputs["input_ids"], + "position_ids": encoded_inputs["position_ids"], + "start_positions": start_ids, + "end_positions": end_ids, + } + else: + tokenized_output = { + "input_ids": encoded_inputs["input_ids"], + "token_type_ids": encoded_inputs["token_type_ids"], + "position_ids": encoded_inputs["position_ids"], + "attention_mask": encoded_inputs["attention_mask"], + "start_positions": start_ids, + "end_positions": end_ids, + } + return tokenized_output diff --git a/applications/neural_search/README.md b/applications/neural_search/README.md new file mode 100644 index 0000000000000000000000000000000000000000..8ef1ba05960912e59a44fd6db97b2553a379ef6d --- /dev/null +++ b/applications/neural_search/README.md @@ -0,0 +1,501 @@ +# 手把手搭建一个语义检索系统 + +## 1. 场景概述 + +检索系统存在于我们日常使用的很多产品中,比如商品搜索系统、学术文献检索系等等,本方案提供了检索系统完整实现。限定场景是用户通过输入检索词 Query,快速在海量数据中查找相似文档。 +
+ +
+ +所谓语义检索(也称基于向量的检索,如上图所示),是指检索系统不再拘泥于用户 Query 字面本身,而是能精准捕捉到用户 Query 后面的真正意图并以此来搜索,从而更准确地向用户返回最符合的结果。通过使用最先进的语义索引模型找到文本的向量表示,在高维向量空间中对它们进行索引,并度量查询向量与索引文档的相似程度,从而解决了关键词索引带来的缺陷。 + +例如下面两组文本 Pair,如果基于关键词去计算相似度,两组的相似度是相同的。而从实际语义上看,第一组相似度高于第二组。 + +``` +车头如何放置车牌 前牌照怎么装 +车头如何放置车牌 后牌照怎么装 +``` + +语义检索系统的关键就在于,采用语义而非关键词方式进行召回,达到更精准、更广泛得召回相似结果的目的。想快速体验搜索的效果,请参考[Pipelines的语义检索实现](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/semantic-search) + +
+ +
+ + +## 2. 产品功能介绍 + +通常检索业务的数据都比较庞大,都会分为召回(索引)、排序两个环节。召回阶段主要是从至少千万级别的候选集合里面,筛选出相关的文档,这样候选集合的数目就会大大降低,在之后的排序阶段就可以使用一些复杂的模型做精细化或者个性化的排序。一般采用多路召回策略(例如关键词召回、热点召回、语义召回结合等),多路召回结果聚合后,经过统一的打分以后选出最优的 TopK 的结果。 + +### 2.1 系统特色 + ++ 低门槛 + + 手把手搭建起检索系统 + + 无需标注数据也能构建检索系统 + + 提供 训练、预测、ANN 引擎一站式能力 + + Pipelines 快速实现语义检索系统 + ++ 效果好 + + 针对多种数据场景的专业方案 + + 仅有无监督数据: SimCSE + + 仅有有监督数据: InBatchNegative + + 兼具无监督数据 和 有监督数据:融合模型 + + 进一步优化方案: 面向领域的预训练 Domain-adaptive Pretraining ++ 性能快 + + Paddle Inference 快速抽取向量 + + Milvus 快速查询和高性能建库 + + Paddle Serving服务化部署 + +### 2.2 功能架构 + +索引环节有两类方法:基于字面的关键词索引;语义索引。语义索引能够较好地表征语义信息,解决字面不相似但语义相似的情形。本系统给出的是语义索引方案,实际业务中可融合其他方案使用。下面就详细介绍整个方案的架构和功能。 + +#### 2.2.1 整体介绍 + + +
+ +
+ +以上是nerual_search的系统流程图,其中左侧为召回环节,核心是语义向量抽取模块;右侧是排序环节,核心是排序模型。召回环节需要用户通过自己的语料构建向量索引库,用户发起query了之后,就可以检索出相似度最高的向量,然后找出该向量对应的文本;排序环节主要是对召回的文本进行重新排序。下面我们分别介绍召回中的语义向量抽取模块,以及排序模型。 + + +#### 2.2.2 召回模块 + +召回模块需要从千万量级数据中快速召回候选数据。首先需要抽取语料库中文本的 Embedding,然后借助向量搜索引擎实现高效 ANN,从而实现候选集召回。 + +我们针对不同的数据情况推出三种语义索引方案,如下图所示,您可以参照此方案,快速建立语义索引: + +| ⭐️ 无监督数据 | ⭐️ 有监督数据 | **召回方案** | +| ------------ | ------------ | ------------ | +| 多 | 无 | SimCSE | +| 无 | 多 | In-batch Negatives| +| 有 | 有 | SimCSE+ In-batch Negatives | + +最基本的情况是只有无监督数据,我们推荐您使用 SimCSE 进行无监督训练;另一种方案是只有有监督数据,我们推荐您使用 In-batch Negatives 的方法进行有监督训练。 + +如果想进一步提升模型效果:还可以使用大规模业务数据,对预训练模型进行 Domain-adaptive Pretraining,训练完以后得到预训练模型,再进行无监督的 SimCSE。 + +此外,如果您同时拥有监督数据和无监督数据,我们推荐将两种方案结合使用,这样能训练出更加强大的语义索引模型。 + +#### 2.2.3 排序模块 + +召回模型负责从海量(千万级)候选文本中快速(毫秒级)筛选出与 Query 相关性较高的 TopK Doc,排序模型会在召回模型筛选出的 TopK Doc 结果基础之上针对每一个 (Query, Doc) Pair 对进行两两匹配计算相关性,排序效果更精准。 + +排序模块有2种选择,第一种基于前沿的预训练模型 ERNIE,训练 Pair-wise 语义匹配模型;第二种是基于RocketQA模型训练的Cross Encoder模形。第一种是Pair-wise的排序算法,基本思路是对样本构建偏序文档对,两两比较,从比较中学习顺序,第二种是Poinet-Wise的算法,只考虑当前Query和每个文档的绝对相关度,并没有考虑其他文档与Query的相关度,但是建模方式比较简单。第一种Pair-wise模型可以说是第二种point-wise模型的改进版本,但对于噪声数据更为敏感,即一个错误的标注会导致多个pair对的错误,用户可以先使用基于Point-wise的Cross Encoder构建一个基础模型,需要进一步优化可以使用Pair-wise的方法优化。 + +## 3. 文献检索实践 + +### 3.1 技术方案和评估指标 + +#### 3.1.1 技术方案 + +**语义索引**:由于我们既有无监督数据,又有有监督数据,所以结合 SimCSE 和 In-batch Negatives 方案,并采取 Domain-adaptive Pretraining 优化模型效果。 + +首先是利用 ERNIE模型进行 Domain-adaptive Pretraining,在得到的预训练模型基础上,进行无监督的 SimCSE 训练,最后利用 In-batch Negatives 方法进行微调,得到最终的语义索引模型,把建库的文本放入模型中抽取特征向量,然后把抽取后的向量放到语义索引引擎 milvus 中,利用 milvus 就可以很方便得实现召回了。 + +**排序**:使用 ERNIE-Gram 的单塔结构/RocketQA的Cross Encoder对召回后的数据精排序。 + +#### 3.1.2 评估指标 + +**模型效果指标** +* 在语义索引召回阶段使用的指标是 Recall@K,表示的是预测的前topK(从最后的按得分排序的召回列表中返回前K个结果)结果和语料库中真实的前 K 个相关结果的重叠率,衡量的是检索系统的查全率。 + +* 在排序阶段使用的指标为AUC,AUC反映的是分类器对样本的排序能力,如果完全随机得对样本分类,那么AUC应该接近0.5。分类器越可能把真正的正样本排在前面,AUC越大,分类性能越好。 + +**性能指标** +* 基于 Paddle Inference 快速抽取向量 + +* 建库性能和 ANN 查询性能快 + +### 3.2 预置数据说明 + +数据集来源于某文献检索系统,既有大量无监督数据,又有有监督数据。 + +(1)采用文献的 query, title,keywords,abstract 四个字段内容,构建无标签数据集进行 Domain-adaptive Pretraining; + +(2)采用文献的 query,title,keywords 三个字段内容,构造无标签数据集,进行无监督召回训练SimCSE; + +(3)使用文献的query, title, keywords,构造带正标签的数据集,不包含负标签样本,基于 In-batch Negatives 策略进行训练; + +(4)在排序阶段,使用点击(作为正样本)和展现未点击(作为负样本)数据构造排序阶段的训练集,进行精排训练。 + +| 阶段 |模型 | 训练集 | 评估集(用于评估模型效果) | 召回库 |测试集 | +| ------------ | ------------ |------------ | ------------ | ------------ | ------------ | +| 召回 | Domain-adaptive Pretraining | 2kw | - | - | - | +| 召回 | 无监督预训练 - SimCSE | 798w | 20000 | 300000| 1000 | +| 召回 | 有监督训练 - In-batch Negatives | 3998 | 20000 |300000 | 1000 | +| 排序 | 有监督训练 - ERNIE-Gram单塔 Pairwise/RocketQA Cross Encoder| 1973538 | 57811 | - | 1000 | + +我们将除 Domain-adaptive Pretraining 之外的其他数据集全部开源,下载地址: + +- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip) +- [literature_search_rank](https://paddlenlp.bj.bcebos.com/applications/literature_search_rank.zip) + +``` +├── milvus # milvus建库数据集 + ├── milvus_data.csv. # 构建召回库的数据(模拟实际业务线上的语料库,实际语料库远大于这里的规模),用于直观演示相关文献召回效果 +├── recall # 召回阶段数据集 + ├── train_unsupervised.csv # 无监督训练集,用于训练 SimCSE + ├── train.csv # 有监督训练集,用于训练 In-batch Negative + ├── dev.csv # 召回阶段验证集,用于评估召回模型的效果,SimCSE 和 In-batch Negative 共用 + ├── corpus.csv # 构建召回库的数据(模拟实际业务线上的语料库,实际语料库远大于这里的规模),用于评估召回阶段模型效果,SimCSE 和 In-batch Negative 共用 + ├── test.csv # 召回阶段测试数据,预测文本之间的相似度,SimCSE 和 In-batch Negative 共用 +├── data # RocketQA排序数据集 + ├── test.csv # 测试集 + ├── dev_pairwise.csv # 验证集 + └── train.csv # 训练集 +├── sort # 排序阶段数据集 + ├── train_pairwise.csv # 排序训练集 + ├── dev_pairwise.csv # 排序验证集 + └── test_pairwise.csv # 排序测试集 +``` + + +### 3.3 数据格式 + +1. 对于无监督SimCSE的训练方法,格式参考`train_unsupervised.csv`,即一行条文本即可,无需任何标注。对于召回模型训练需要规定格式的本地数据集,需要准备训练集文件`train.csv`,验证集`dev.csv`,召回集文件`corpus.csv`。 + + +训练数据集`train.csv`的格式如下: + +``` +query1 \t 用户点击的title1 +query2 \t 用户点击的title2 +``` +训练集合`train.csv`的文件样例: +``` +从《唐律疏义》看唐代封爵贵族的法律特权 从《唐律疏义》看唐代封爵贵族的法律特权《唐律疏义》,封爵贵族,法律特权 +宁夏社区图书馆服务体系布局现状分析 宁夏社区图书馆服务体系布局现状分析社区图书馆,社区图书馆服务,社区图书馆服务体系 +人口老龄化对京津冀经济 京津冀人口老龄化对区域经济增长的影响京津冀,人口老龄化,区域经济增长,固定效应模型 +英语广告中的模糊语 模糊语在英语广告中的应用及其功能模糊语,英语广告,表现形式,语用功能 +甘氨酸二肽的合成 甘氨酸二肽合成中缩合剂的选择甘氨酸,缩合剂,二肽 +...... +``` + +验证集`dev.csv`的格式如下: + +``` +query1 \t 用户点击的title1 +query2 \t 用户点击的title2 +``` + +验证集合`train.csv`的文件样例: +``` +试论我国海岸带经济开发的问题与前景 试论我国海岸带经济开发的问题与前景海岸带,经济开发,问题,前景 +外语阅读焦虑与英语成绩及性别的关系 外语阅读焦虑与英语成绩及性别的关系外语阅读焦虑,外语课堂焦虑,英语成绩,性别 +加油站风险分级管控 加油站工作危害风险分级研究加油站,工作危害分析(JHA),风险分级管控 +``` +召回集合`corpus.csv`主要作用是检验测试集合的句子对能否被正确召回,它的构造主要是提取验证集的第二列的句子,然后加入很多无关的句子,用来检验模型能够正确的从这些文本中找出测试集合对应的第二列的句子,格式如下: + +``` +2002-2017年我国法定传染病发病率和死亡率时间变化趋势传染病,发病率,死亡率,病死率 +陕西省贫困地区城乡青春期少女生长发育调查青春期,生长发育,贫困地区 +五丈岩水库溢洪道加固工程中的新材料应用碳纤维布,粘钢加固技术,超细水泥,灌浆技术 +...... +``` + +2. 对于排序模型的训练,排序模型目前提供了2种,第一种是Pairwise训练的方式,第二种是RocketQA的排序模型,对于第一种排序模型,需要准备训练集`train_pairwise.csv`,验证集`dev_pairwise.csv`两个文件,除此之外还可以准备测试集文件`test.csv`或者`test_pairwise.csv`。 + +训练数据集`train_pairwise.csv`的格式如下: + +``` +query1 \t 用户点击的title1 \t 用户未点击的title2 +query2 \t 用户点击的title3 \t 用户未点击的title4 +``` + +训练数据集`train_pairwise.csv`的示例如下: + +``` +英语委婉语引起的跨文化交际障碍 英语委婉语引起的跨文化交际障碍及其翻译策略研究英语委婉语,跨文化交际障碍,翻译策略 委婉语在英语和汉语中的文化差异委婉语,文化,跨文化交际 +范迪慧 嘉兴市中医院 滋阴疏肝汤联合八穴隔姜灸治疗肾虚肝郁型卵巢功能低下的临床疗效滋阴疏肝汤,八穴隔姜灸,肾虚肝郁型卵巢功能低下,性脉甾类激素,妊娠 温针灸、中药薰蒸在半月板损伤术后康复中的疗效分析膝损伤,半月板,胫骨,中医康复,温针疗法,薰洗 +...... +``` + +验证数据集`dev_pairwise.csv`的格式如下: + +``` +query1 \t title1 \t label +query2 \t title2 \t label +``` +验证数据集`dev_pairwise.csv`的示例如下: + +``` +作者单位:南州中学 浅谈初中教学管理如何体现人文关怀初中教育,教学管理,人文关怀 1 +作者单位:南州中学 高中美术课堂教学中藏区本土民间艺术的融入路径藏区,传统民间艺术,美术课堂 0 +作者单位:南州中学 列宁关于资产阶级民主革命向 社会主义革命过渡的理论列宁,直接过渡,间接过渡,资产阶级民主革命,社会主义革命 0 +DAA髋关节置换 DAA前侧入路和后外侧入路髋关节置换疗效对比髋关节置换术;直接前侧入路;后外侧入路;髋关节功能;疼痛;并发症 1 +DAA髋关节置换 DAA全髋关节置换术治疗髋关节病变对患者髋关节运动功能的影响直接前侧入路全髋关节置换术,髋关节病变,髋关节运动功能 0 +DAA髋关节置换 护患沟通技巧在急诊输液护理中的应用分析急诊科,输液护理,护理沟通技巧,应用 0 +....... +``` +训练数据集`test_pairwise.csv`的格式如下,其中这个score得分是召回算出来的相似度或者距离,仅供参考,可以忽略: + +``` +query1 \t title1 \t score +query2 \t title2 \t score +``` +训练数据集`test_pairwise.csv`的示例如下: + +``` +中西方语言与文化的差异 中西方文化差异以及语言体现中西方文化,差异,语言体现 0.43203747272491455 +中西方语言与文化的差异 论中西方文化差异在非言语交际中的体现中西方文化,差异,非言语交际 0.4644506871700287 +中西方语言与文化的差异 中西方体态语文化差异跨文化,体态语,非语言交际,差异 0.4917311668395996 +中西方语言与文化的差异 由此便可以发现两种语言以及两种文化的差异。 0.5039259195327759 +....... +``` + +对于第二种基于RocketQA的排序模型。 + +训练数据集`train.csv`,验证集`dev_pairwise.csv`的格式如下: + +``` +query1 \t title1 \t label +query2 \t title2 \t label +``` +训练数据集`train.csv`,验证集`dev_pairwise.csv`的示例如下: + +``` +(小学数学教材比较) 关键词:新加坡 新加坡与中国数学教材的特色比较数学教材,教材比较,问题解决 0 +徐慧新疆肿瘤医院 头颈部非霍奇金淋巴瘤扩散加权成像ADC值与Ki-67表达相关性分析淋巴瘤,非霍奇金,头颈部肿瘤,磁共振成像 1 +抗生素关性腹泻 鼠李糖乳杆菌GG防治消化系统疾病的研究进展鼠李糖乳杆菌,腹泻,功能性胃肠病,肝脏疾病,幽门螺杆菌 0 +德州市图书馆 图书馆智慧化建设与融合创新服务研究图书馆;智慧化;阅读服务;融合创新 1 +维生素c 综述 维生素C防治2型糖尿病研究进展维生素C;2型糖尿病;氧化应激;自由基;抗氧化剂 0 +....... +``` + +训练数据集`test.csv`的格式如下,其中这个score得分是召回算出来的相似度或者距离,仅供参考,可以忽略: + +``` +query1 \t title1 \t score +query2 \t title2 \t score +``` +训练数据集`test.csv`的示例如下: + +``` +加强科研项目管理有效促进医学科研工作 科研项目管理策略科研项目,项目管理,实施,必要性,策略 0.32163668 +加强科研项目管理有效促进医学科研工作 关于推进我院科研发展进程的相关问题研究医院科研,主体,环境,信息化 0.32922596 +加强科研项目管理有效促进医学科研工作 深圳科技计划对高校科研项目资助现状分析与思考基础研究,高校,科技计划,科技创新 0.36869502 +加强科研项目管理有效促进医学科研工作 普通高校科研管理模式的优化与创新普通高校,科研,科研管理 0.3688045 +....... +``` + + +### 3.4 运行环境和安装说明 + + +(1)运行环境 + +本实验采用了以下的运行环境进行,详细说明如下,用户也可以在自己 GPU 硬件环境进行: + +a. 软件环境: + + +- python >= 3.6 +- paddlenlp >= 2.2.1 +- paddlepaddle-gpu >=2.2 +- CUDA Version: 10.2 +- NVIDIA Driver Version: 440.64.00 +- Ubuntu 16.04.6 LTS (Docker) + + +b. 硬件环境: + + +- NVIDIA Tesla V100 16GB x4卡 +- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz + + +c. 依赖安装: + +``` +pip install -r requirements.txt +``` + +## 4. Neural Search 快速体验实践 + +PaddleNLP已经基于ERNIE 1.0训练了一个基线模型,如果想快速搭建Neural Search的完整系统,有两种方法,第一种是请参考下面的实现,包含了服务化的完整流程,另一种是使用Pipelines加载,Pipelines已经支持Neural Search训练的模型的载入,可以使用Pipelines的快速的基于Neural Search模型实现检索系统,详情请参考文档[Pipelines-Neural-Search](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/pipelines/examples/semantic-search/Neural_Search.md)。 + +### 4.1. 召回 + +- 召回向量抽取服务的搭建请参考:[In-batch Negatives](./recall/in_batch_negative/), 只需要下载基于ERNIE 1.0的预训练模型,导出成Paddle Serving的格式,然后启动Pipeline Server服务即可 + +- 召回向量检索服务的搭建请参考:[Milvus](./recall/milvus/), 需要搭建Milvus并且插入检索数据的向量 + +【注意】如果使用Neural Search训练好的模型,由于该模型是基于ERNIE 1.0训练的,所以需要把 `model_name_or_path`指定为`ernie 1.0`,向量抽取结果才能正常。 + + +### 4.2. 排序 + +排序服务的搭建请参考 [ernie_matching](./ranking/ernie_matching/),只需要下载基于ERNIE Gram的预训练模型,导出成Paddle Serving的格式,最后需要启动 Pipeline Serving服务 + +【注意】如果使用Neural Search训练好的模型,由于该模型是基于ERNIE Gram训练的,所以需要把 `model_name_or_path`指定为`ernie-gram-zh`,向量抽取结果才能正常。 + +### 4.3. 系统运行 + +以上召回和排序模型都经过Paddle Serving服务化以后,就可以直接使用下面的命令运行体验: + +``` +python3 run_system.py +``` +输出的结果为: + +``` +PipelineClient::predict pack_data time:1656991375.5521955 +PipelineClient::predict before time:1656991375.5529568 +Extract feature time to cost :0.0161135196685791 seconds +Search milvus time cost is 0.8139839172363281 seconds +PipelineClient::predict pack_data time:1656991376.3981335 +PipelineClient::predict before time:1656991376.3983877 +time to cost :0.05616641044616699 seconds +``` +会输出2个文件 `recall_result.csv` 是召回检索的结果,`rank_result.csv` 是排序的结果。csv的示例输出下。 + +召回的结果: + +``` +中西方语言与文化的差异,港台文化对内地中小学生的负面影响,0.055068351328372955 +中西方语言与文化的差异,外来文化在越南的传播与融合,0.05621318891644478 +中西方语言与文化的差异,临终关怀中的“仪式”,0.05705389380455017 +中西方语言与文化的差异,历史的真实与艺术加工,0.05745899677276611 +...... +``` + +排序的结果: + +``` +中西方语言与文化的差异,论中西方教育差异,0.870943009853363 +中西方语言与文化的差异,浅析中西方问候语的差异,0.8468159437179565 +中西方语言与文化的差异,文化认同及其根源,0.8288694620132446 +中西方语言与文化的差异,从历史文化角度分析中西方学校教育的差异,0.8209370970726013 +中西方语言与文化的差异,中西医思维方式的差异,0.8150948882102966 +中西方语言与文化的差异,浅析中韩餐桌文化差异,0.7751647233963013 +...... +``` + + + +## 5. 从头开始搭建自己的检索系统 + +这里展示了能够从头至尾跑通的完整代码,您使用自己的业务数据,照着跑,能搭建出一个给定 Query,返回 topK 相关文档的小型检索系统。您可以参照我们给出的效果和性能数据来检查自己的运行过程是否正确。 + +### 5.1 召回阶段 + +**召回模型训练** + +我们进行了多组实践,用来对比说明召回阶段各方案的效果: + +| 模型 | Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 |策略简要说明| +| ------------ | ------------ | ------------ |--------- |--------- |--------- |--------- | +| 有监督训练 Baseline | 30.077| 43.513| 48.633 | 53.448 |59.632| 标准 pair-wise 训练范式,通过随机采样产生负样本| +| 有监督训练 In-batch Negatives | 51.301 | 65.309| 69.878| 73.996|78.881| In-batch Negatives 有监督训练| +| 无监督训练 SimCSE | 42.374 | 57.505| 62.641| 67.09|72.331| SimCSE 无监督训练| +| 无监督 + 有监督训练 SimCSE + In-batch Negatives | 55.976 | 71.849| 76.363| 80.49|84.809| SimCSE无监督训练,In-batch Negatives 有监督训练| +| Domain-adaptive Pretraining + SimCSE | 51.031 | 66.648| 71.338 | 75.676 |80.144| ERNIE 预训练,SimCSE 无监督训练| +| Domain-adaptive Pretraining + SimCSE + In-batch Negatives| **58.248** | **75.099**| **79.813**| **83.801**|**87.733**| ERNIE 预训练,SimCSE 无监督训训练,In-batch Negatives 有监督训练| + +从上述表格可以看出,首先利用 ERNIE 3.0 做 Domain-adaptive Pretraining ,然后把训练好的模型加载到 SimCSE 上进行无监督训练,最后利用 In-batch Negatives 在有监督数据上进行训练能够获得最佳的性能。[模型下载](https://paddlenlp.bj.bcebos.com/models/inbatch_model_best.zip),模型的使用方式参考[In-batch Negatives](./recall/in_batch_negative/) 。 + + +这里采用 Domain-adaptive Pretraining + SimCSE + In-batch Negatives 方案: + +第一步:无监督训练 Domain-adaptive Pretraining + +训练用时 16hour55min,可参考:[ERNIE 1.0](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-1.0) + +第二步:无监督训练 SimCSE + +训练用时 16hour53min,可参考:[SimCSE](./recall/simcse/) + +第三步:有监督训练 + +几分钟内训练完成,可参考 [In-batch Negatives](./recall/in_batch_negative/) + + +**召回系统搭建** + +召回系统使用索引引擎 Milvus,可参考 [milvus_system](./recall/milvus/)。 +我们展示一下系统的效果,输入的文本如下: + +``` +中西方语言与文化的差异 + +``` +下面是召回的部分结果,第一个是召回的title,第二个数字是计算的相似度距离 + +``` +跨文化中的文化习俗对翻译的影响翻译,跨文化,文化习俗 0.615584135055542 +试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比 0.6155391931533813 +中英文化差异及习语翻译习语,文化差异,翻译 0.6153547763824463 +英语中的中国文化元素英语,中国文化,语言交流 0.6151996850967407 +跨文化交际中的文化误读研究文化误读,影响,中华文化,西方文明 0.6137217283248901 +在语言学习中了解中法文化差异文化差异,对话交际,语言 0.6134252548217773 +从翻译视角看文化差异影响下的中式英语的应对策略文化差异;中式英语现;汉英翻译;动态对等理论 0.6127341389656067 +归化与异化在跨文化传播中的动态平衡归化,异化,翻译策略,跨文化传播,文化外译 0.6127211451530457 +浅谈中西言语交际行为中的文化差异交际用语,文化差异,中国,西方 0.6125463843345642 +翻译中的文化因素--异化与归化文化翻译,文化因素,异化与归化 0.6111845970153809 +历史与文化差异对翻译影响的分析研究历史与文化差异,法汉翻译,翻译方法 0.6107486486434937 +从中、韩、美看跨文化交际中的东西方文化差异跨文化交际,东西方,文化差异 0.6091923713684082 +试论文化差异对翻译工作的影响文化差异,翻译工作,影响 0.6084284782409668 +从归化与异化看翻译中的文化冲突现象翻译,文化冲突,归化与异化,跨文化交际 0.6063553690910339 +中西方问候语的文化差异问候语,文化差异,文化背景 0.6054259538650513 +中英思维方式的差异对翻译的影响中英文化的差异,中英思维方式的差异,翻译 0.6026732921600342 +略论中西方语言文字的特性与差异语言,会意,确意,特性,差异 0.6009351015090942 +...... + +``` + + +### 5.2 排序阶段 + +排序阶段有2种方案,第一种是[ernie_matching](./ranking/ernie_matching/)使用的模型是 ERNIE-3.0-Medium-zh,用时 20h;第二种是基于RocketQA的排序模型[cross_encoder](./ranking/cross_encoder/),训练用时也是20h左右。 + + +排序阶段的效果评估: + +| 模型 | AUC | +| ------------ | ------------ | +| Baseline: In-batch Negatives | 0.582 | +| pairwise ERNIE-Gram |0.801 | +| CrossEncoder:rocketqa-base-cross-encoder |**0.835** | + + +同样输入文本: + +``` +中西方语言与文化的差异 +``` +排序阶段的结果展示如下,第一个是 Title ,第二个数字是计算的概率,显然经排序阶段筛选的文档与 Query 更相关: + +``` +中西方文化差异以及语言体现中西方文化,差异,语言体现 0.999848484992981 +论中西方语言与文化差异的历史渊源中西方语言,中西方文化,差异,历史渊源 0.9998375177383423 +从日常生活比较中西方语言与文化的差异中西方,语言,文化,比较 0.9985846281051636 +试论中西方语言文化教育的差异比较与融合中西方,语言文化教育,差异 0.9972485899925232 +中西方文化差异对英语学习的影响中西方文化,差异,英语,学习 0.9831035137176514 +跨文化视域下的中西文化差异研究跨文化,中西,文化差异 0.9781349897384644 +中西方文化差异对跨文化交际的影响分析文化差异,跨文化交际,影响 0.9735479354858398 +探析跨文化交际中的中西方语言差异跨文化交际,中西方,语言差异 0.9668175578117371 +中西方文化差异解读中英文差异表达中西文化,差异表达,跨文化交际 0.9629314541816711 +中西方文化差异对英语翻译的影响中西方文化差异,英语翻译,翻译策略,影响 0.9538986086845398 +论跨文化交际中的中西方文化冲突跨文化交际,冲突,文化差异,交际策略,全球化 0.9493677616119385 +中西方文化差异对英汉翻译的影响中西方文化,文化差异,英汉翻译,影响 0.9430705904960632 +中西方文化差异与翻译中西方,文化差异,翻译影响,策略方法,译者素质 0.9401137828826904 +外语教学中的中西文化差异外语教学,文化,差异 0.9397934675216675 +浅析西语国家和中国的文化差异-以西班牙为例跨文化交际,西语国家,文化差异 0.9373322129249573 +中英文化差异在语言应用中的体现中英文化,汉语言,语言应用,语言差异 0.9359155297279358 +.... +``` + + +## Reference + +[1] Tianyu Gao, Xingcheng Yao, Danqi Chen: [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821). EMNLP (1) 2021: 6894-6910 + +[2] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906). Preprint 2020. + +[3] Dongling Xiao, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang: +[ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding](https://arxiv.org/abs/2010.12148). NAACL-HLT 2021: 1702-1715 + +[4] Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu: +[ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223). CoRR abs/1904.09223 (2019) diff --git a/applications/neural_search/img/attu.png b/applications/neural_search/img/attu.png new file mode 100644 index 0000000000000000000000000000000000000000..9d152a6efba719f540d7a7ac343c43ca0b8eb4fc Binary files /dev/null and b/applications/neural_search/img/attu.png differ diff --git a/applications/neural_search/img/system_pipeline.png b/applications/neural_search/img/system_pipeline.png new file mode 100644 index 0000000000000000000000000000000000000000..b7ef7972df94c4d6daa9b64e02c0a894b5efeb19 Binary files /dev/null and b/applications/neural_search/img/system_pipeline.png differ diff --git a/applications/neural_search/ranking/cross_encoder/README.md b/applications/neural_search/ranking/cross_encoder/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b5dd873b4313c2f06b7bf5e6ee79f74f23bf5fc1 --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/README.md @@ -0,0 +1,399 @@ + + **目录** + +* [背景介绍](#背景介绍) +* [CrossEncoder](#CrossEncoder) + * [1. 技术方案和评估指标](#技术方案) + * [2. 环境依赖](#环境依赖) + * [3. 代码结构](#代码结构) + * [4. 数据准备](#数据准备) + * [5. 模型训练](#模型训练) + * [6. 评估](#开始评估) + * [7. 预测](#预测) + * [8. 部署](#部署) + + + +# 背景介绍 + +基于RocketQA的CrossEncoder训练的单塔模型,该模型用于搜索的排序阶段,对召回的结果进行重新排序的作用。 + + + + +# CrossEncoder + + + +## 1. 技术方案和评估指标 + +### 技术方案 + +加载基于ERNIE 3.0训练过的RocketQA单塔CrossEncoder模型。 + + +### 评估指标 + +(1)采用 AUC 指标来评估排序模型的排序效果。 + +**效果评估** + +| 训练方式 | 模型 | AUC | +| ------------ | ------------ |------------ | +| pairwise| ERNIE-Gram |0.801 | +| CrossEncoder | rocketqa-base-cross-encoder |**0.835** | + + + +## 2. 环境依赖和安装说明 + +**环境依赖** + +* python >= 3.7 +* paddlepaddle >= 2.3.7 +* paddlenlp >= 2.3 +* pandas >= 0.25.1 +* scipy >= 1.3.1 + + + +## 3. 代码结构 + +以下是本项目主要代码结构及说明: + +``` +ernie_matching/ +├── deply # 部署 + ├── cpp + ├── rpc_client.py # RPC 客户端的bash脚本 + ├── http_client.py # http 客户端的bash文件 + └── start_server.sh # 启动C++服务的脚本 + └── python + ├── deploy.sh # 预测部署bash脚本 + ├── config_nlp.yml # Pipeline 的配置文件 + ├── web_service.py # Pipeline 服务端的脚本 + ├── rpc_client.py # Pipeline RPC客户端的脚本 + └── predict.py # python 预测部署示例 +|—— scripts + ├── export_model.sh # 动态图参数导出静态图参数的bash文件 + ├── export_to_serving.sh # 导出 Paddle Serving 模型格式的bash文件 + ├── train_ce.sh # 匹配模型训练的bash文件 + ├── evaluate_ce.sh # 评估验证文件bash脚本 + ├── predict_ce.sh # 匹配模型预测脚本的bash文件 +├── export_model.py # 动态图参数导出静态图参数脚本 +├── export_to_serving.py # 导出 Paddle Serving 模型格式的脚本 +├── data.py # 训练样本的转换逻辑 +├── train_ce.py # 模型训练脚本 +├── evaluate.py # 评估验证文件 +├── predict.py # Pair-wise 模型预测脚本,输出文本对是相似度 + +``` + + + +## 4. 数据准备 + +### 数据集说明 + +样例数据如下: +``` +(小学数学教材比较) 关键词:新加坡 新加坡与中国数学教材的特色比较数学教材,教材比较,问题解决 0 +徐慧新疆肿瘤医院 头颈部非霍奇金淋巴瘤扩散加权成像ADC值与Ki-67表达相关性分析淋巴瘤,非霍奇金,头颈部肿瘤,磁共振成像 1 +抗生素关性腹泻 鼠李糖乳杆菌GG防治消化系统疾病的研究进展鼠李糖乳杆菌,腹泻,功能性胃肠病,肝脏疾病,幽门螺杆菌 0 +德州市图书馆 图书馆智慧化建设与融合创新服务研究图书馆;智慧化;阅读服务;融合创新 1 +维生素c 综述 维生素C防治2型糖尿病研究进展维生素C;2型糖尿病;氧化应激;自由基;抗氧化剂 0 +(白藜芦醇) 关键词:2型糖尿病 2型糖尿病大鼠心肌缺血再灌注损伤转录因子E2相关因子2/血红素氧合酶1信号通路的表达及白藜芦醇的干预研究糖尿病,2型,心肌缺血,再灌注损伤,白藜芦醇 1 +融资偏好 创新型企业产业风险、融资偏好与融资选择融资偏好;产业风险;融资选择 1 +星载激光雷达 星载激光雷达望远镜主镜超轻量化结构设计超轻量化;拓扑优化;集成优化;RMS;有限元仿真 1 +``` + + +### 数据集下载 + + +- [literature_search_rank](https://paddlenlp.bj.bcebos.com/applications/literature_search_rank.zip) + +``` +├── data # 排序数据集 + ├── test.csv # 测试集 + ├── dev_pairwise.csv # 验证集 + └── train.csv # 训练集 +``` + + + +## 5. 模型训练 + +**排序模型下载链接:** + + +|Model|训练参数配置|硬件|MD5| +| ------------ | ------------ | ------------ |-----------| +|[ERNIE-Gram-Sort](https://bj.bcebos.com/v1/paddlenlp/models/ernie_gram_sort.zip)|
epoch:3 lr:5E-5 bs:64 max_len:64
|
4卡 v100-16g
|d24ece68b7c3626ce6a24baa58dd297d| + + +### 训练环境说明 + + +- NVIDIA Driver Version: 440.64.00 +- Ubuntu 16.04.6 LTS (Docker) +- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz + + +### 单机单卡训练/单机多卡训练 + +这里采用单机多卡方式进行训练,通过如下命令,指定 GPU 0,1,2,3 卡。如果采用单机单卡训练,只需要把`--gpu`参数设置成单卡的卡号即可 + +训练的命令如下: + +``` +unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir="logs" train_ce.py \ + --device gpu \ + --train_set data/train.csv \ + --test_file data/dev_pairwise.csv \ + --save_dir ./checkpoints \ + --model_name_or_path rocketqa-base-cross-encoder \ + --batch_size 32 \ + --save_steps 10000 \ + --max_seq_len 384 \ + --learning_rate 1E-5 \ + --weight_decay 0.01 \ + --warmup_proportion 0.0 \ + --logging_steps 10 \ + --seed 1 \ + --epochs 3 \ + --eval_step 1000 +``` +也可以运行bash脚本: + +``` +sh scripts/train_ce.sh +``` + + + +## 6. 评估 + + +``` +python evaluate.py --model_name_or_path rocketqa-base-cross-encoder \ + --init_from_ckpt checkpoints/model_80000/model_state.pdparams \ + --test_file data/dev_pairwise.csv +``` +也可以运行bash脚本: + +``` +sh scripts/evaluate_ce.sh +``` + + +成功运行后会输出下面的指标: + +``` +eval_dev auc:0.829 +``` + + + +## 7. 预测 + +### 准备预测数据 + +待预测数据为 tab 分隔的 tsv 文件,每一行为 1 个文本 Pair,和文本pair的语义索引相似度,(该相似度由召回模型算出,仅供参考),部分示例如下: + +``` +中西方语言与文化的差异 第二语言习得的一大障碍就是文化差异。 0.5160342454910278 +中西方语言与文化的差异 跨文化视角下中国文化对外传播路径琐谈跨文化,中国文化,传播,翻译 0.5145505666732788 +中西方语言与文化的差异 从中西方民族文化心理的差异看英汉翻译语言,文化,民族文化心理,思维方式,翻译 0.5141439437866211 +中西方语言与文化的差异 中英文化差异对翻译的影响中英文化,差异,翻译的影响 0.5138794183731079 +中西方语言与文化的差异 浅谈文化与语言习得文化,语言,文化与语言的关系,文化与语言习得意识,跨文化交际 0.5131710171699524 +``` + + + +### 开始预测 + +以上述 demo 数据为例,运行如下命令基于我们开源的rocketqa模型开始计算文本 Pair 的语义相似度: + +```shell +unset CUDA_VISIBLE_DEVICES +python predict.py \ + --device 'gpu' \ + --params_path checkpoints/model_80000/model_state.pdparams \ + --model_name_or_path rocketqa-base-cross-encoder \ + --test_set data/test.csv \ + --topk 10 \ + --batch_size 128 \ + --max_seq_length 384 +``` +也可以直接执行下面的命令: + +``` +sh scripts/predict_ce.sh +``` +得到下面的输出,分别是query,title和对应的预测概率: + +``` +{'text_a': '加强科研项目管理有效促进医学科研工作', 'text_b': '高校\\十四五\\规划中学科建设要处理好五对关系\\十四五\\规划,学科建设,科技创新,人才培养', 'pred_prob': 0.7076062} +{'text_a': '加强科研项目管理有效促进医学科研工作', 'text_b': '校企科研合作项目管理模式创新校企科研合作项目,管理模式,问题,创新', 'pred_prob': 0.64633846} +{'text_a': '加强科研项目管理有效促进医学科研工作', 'text_b': '科研项目管理策略科研项目,项目管理,实施,必要性,策略', 'pred_prob': 0.63166416} +{'text_a': '加强科研项目管理有效促进医学科研工作', 'text_b': '高校科研项目经费管理流程优化研究——以z大学为例高校,科研项目经费\\全流程\\管理,流程优化', 'pred_prob': 0.60351866} +{'text_a': '加强科研项目管理有效促进医学科研工作', 'text_b': '关于推进我院科研发展进程的相关问题研究医院科研,主体,环境,信息化', 'pred_prob': 0.5688347} +{'text_a': '加强科研项目管理有效促进医学科研工作', 'text_b': '医学临床科研选题原则和方法医学临床,科学研究,选题', 'pred_prob': 0.55190295} +``` + + + +## 8. 部署 + +### 动转静导出 + +首先把动态图模型转换为静态图: + +``` +python export_model.py \ + --params_path checkpoints/model_80000/model_state.pdparams \ + --model_name_or_path rocketqa-base-cross-encoder \ + --output_path=./output +``` +也可以运行下面的bash脚本: + +``` +sh scripts/export_model.sh +``` + +### Paddle Inference + +使用PaddleInference + +``` +python deploy/python/predict.py --model_dir ./output \ + --input_file data/test.csv \ + --model_name_or_path rocketqa-base-cross-encoder +``` +也可以运行下面的bash脚本: + +``` +sh deploy/python/deploy.sh +``` +得到下面的输出,输出的是样本的query,title以及对应的概率: + +``` +Data: {'query': '加强科研项目管理有效促进医学科研工作', 'title': '科研项目管理策略科研项目,项目管理,实施,必要性,策略'} prob: 0.5479063987731934 +Data: {'query': '加强科研项目管理有效促进医学科研工作', 'title': '关于推进我院科研发展进程的相关问题研究医院科研,主体,环境,信息化'} prob: 0.5151925086975098 +Data: {'query': '加强科研项目管理有效促进医学科研工作', 'title': '深圳科技计划对高校科研项目资助现状分析与思考基础研究,高校,科技计划,科技创新'} prob: 0.42983829975128174 +Data: {'query': '加强科研项目管理有效促进医学科研工作', 'title': '普通高校科研管理模式的优化与创新普通高校,科研,科研管理'} prob: 0.465454638004303 +``` + +### Paddle Serving部署 + +Paddle Serving 的详细文档请参考 [Pipeline_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Python_Pipeline/Pipeline_Design_CN.md)和[Serving_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Serving_Design_CN.md),首先把静态图模型转换成Serving的格式: + +``` +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.pdmodel" \ + --params_filename "inference.pdiparams" \ + --server_path "serving_server" \ + --client_path "serving_client" \ + --fetch_alias_names "predict" + +``` + +参数含义说明 +* `dirname`: 需要转换的模型文件存储路径,Program 结构文件和参数文件均保存在此目录。 +* `model_filename`: 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ,则使用 `__model__` 作为默认的文件名 +* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中,它才需要被指定。如果模型参数是存储在各自分离的文件中,设置它的值为 None +* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server +* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client +* `fetch_alias_names`: 模型输出的别名设置,比如输入的 input_ids 等,都可以指定成其他名字,默认不指定 +* `feed_alias_names`: 模型输入的别名设置,比如输出 pooled_out 等,都可以重新指定成其他模型,默认不指定 + +也可以运行下面的 bash 脚本: +``` +sh scripts/export_to_serving.sh +``` +Paddle Serving的部署有两种方式,第一种方式是Pipeline的方式,第二种是C++的方式,下面分别介绍这两种方式的用法: + +#### Pipeline方式 + +修改对应预训练模型的`Tokenizer`: + +``` +self.tokenizer = AutoTokenizer.from_pretrained('rocketqa-base-cross-encoder') +``` + +启动 Pipeline Server: + +``` +python web_service.py +``` + +启动客户端调用 Server。 + +首先修改rpc_client.py中需要预测的样本: + +``` +list_data = [{"query":"加强科研项目管理有效促进医学科研工作","title":"科研项目管理策略科研项目,项目管理,实施,必要性,策略"}]` +``` +然后运行: +``` +python rpc_client.py +``` +模型的输出为: + +``` +PipelineClient::predict pack_data time:1662354188.422532 +PipelineClient::predict before time:1662354188.423034 +time to cost :0.016808509826660156 seconds +(1,) +[0.5479064] +``` +可以看到客户端发送了1条文本,这条文本的相似的概率值。 + +#### C++的方式 + +启动C++的Serving: + +``` +python -m paddle_serving_server.serve --model serving_server --port 8600 --gpu_id 0 --thread 5 --ir_optim True +``` +也可以使用脚本: + +``` +sh deploy/cpp/start_server.sh +``` +Client 可以使用 http 或者 rpc 两种方式,rpc 的方式为: + +``` +python deploy/cpp/rpc_client.py +``` +运行的输出为: + +``` +I0905 05:38:28.876770 28507 general_model.cpp:490] [client]logid=0,client_cost=158.124ms,server_cost=156.385ms. +time to cost :0.15848731994628906 seconds +[0.54790646] +``` +可以看到服务端返回了相似度结果 + +或者使用 http 的客户端访问模式: + +``` +python deploy/cpp/http_client.py +``` +运行的输出为: +``` +time to cost :0.13054680824279785 seconds +0.5479064707850817 +``` +可以看到服务端返回了相似度结果 + + +## Reference + +[1] Xiao, Dongling, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. “ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding.” ArXiv:2010.12148 [Cs]. + +[2] Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, Haifeng Wang: +RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. NAACL-HLT 2021: 5835-5847 diff --git a/applications/neural_search/ranking/cross_encoder/data.py b/applications/neural_search/ranking/cross_encoder/data.py new file mode 100644 index 0000000000000000000000000000000000000000..c03776f4a41c34f8e95fa7525ea1edf9979bcb47 --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/data.py @@ -0,0 +1,99 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np +import paddle + + +def convert_example(example, tokenizer, max_seq_length=512, is_test=False, is_pair=False): + """ + Builds model inputs from a sequence or a pair of sequence for sequence classification tasks + by concatenating and adding special tokens. And creates a mask from the two sequences passed + to be used in a sequence-pair classification task. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + It returns the first portion of the mask (0's). + + + Args: + example(obj:`list[str]`): List of input data, containing text and label if it have label. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of token ids. + token_type_ids(obj: `list[int]`): List of sequence pair mask. + label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. + """ + + if is_pair: + text = example["text_a"] + text_pair = example["text_b"] + else: + text = example["text"] + text_pair = None + encoded_inputs = tokenizer(text=text, text_pair=text_pair, max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if is_test: + return input_ids, token_type_ids + label = np.array([example["label"]], dtype="int64") + return input_ids, token_type_ids, label + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 3: + continue + yield {"text_a": data[0], "text_b": data[1]} + + +def read_data(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8", errors="ignore") as f: + for i, line in enumerate(f): + # Skip column name + if i == 0: + continue + data = line.rstrip("\n").split("\t") + if len(data) != 3: + print(data) + continue + query = data[0] + title = data[1] + label = data[-1] + # breakpoint() + yield {"text_a": query, "text_b": title, "label": int(label)} + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) diff --git a/applications/neural_search/ranking/cross_encoder/deploy/cpp/http_client.py b/applications/neural_search/ranking/cross_encoder/deploy/cpp/http_client.py new file mode 100644 index 0000000000000000000000000000000000000000..40bc903f065bc949dc199bf0b5aadb1a7ea010d2 --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/deploy/cpp/http_client.py @@ -0,0 +1,68 @@ +# coding:utf-8 +# pylint: disable=doc-string-missing +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import time + +import numpy as np +from paddle_serving_client.httpclient import HttpClient +from scipy.special import expit + +from paddlenlp.transformers import AutoTokenizer + + +def convert_example(example, tokenizer, max_seq_length=512): + + query, title = example["query"], example["title"] + encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + return input_ids, token_type_ids + + +# 启动python客户端 +endpoint_list = ["127.0.0.1:8600"] +client = HttpClient() +client.load_client_config("serving_client") +client.connect(endpoint_list) +feed_names = client.feed_names_ +fetch_names = client.fetch_names_ + +# 创建tokenizer +tokenizer = AutoTokenizer.from_pretrained("rocketqa-base-cross-encoder") +max_seq_len = 64 + +# 数据预处理 +list_data = [{"query": "加强科研项目管理有效促进医学科研工作", "title": "科研项目管理策略科研项目,项目管理,实施,必要性,策略"}] + +input_ids, token_type_ids = [], [] +for example in list_data: + input_id, token_type_id = convert_example(example, tokenizer, max_seq_length=max_seq_len) + input_ids.append(input_id) + token_type_ids.append(token_type_id) + +feed_dict = {} +feed_dict["input_ids"] = np.array(input_ids) +feed_dict["token_type_ids"] = np.array(token_type_ids) +# batch设置为True表示的是批量预测 +b_start = time.time() +result = client.predict(feed=feed_dict, fetch=fetch_names, batch=True) +b_end = time.time() +print(result) +print("time to cost :{} seconds".format(b_end - b_start)) +score = result.outputs[0].tensor[0].float_data +sim_score = expit(np.array(score))[1] +print(sim_score) diff --git a/applications/neural_search/ranking/cross_encoder/deploy/cpp/rpc_client.py b/applications/neural_search/ranking/cross_encoder/deploy/cpp/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..13d988aa5d0a61b392848de19791ae68a86b2af8 --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/deploy/cpp/rpc_client.py @@ -0,0 +1,68 @@ +# coding:utf-8 +# pylint: disable=doc-string-missing +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import time + +import numpy as np +from paddle_serving_client import Client +from scipy.special import expit + +from paddlenlp.transformers import AutoTokenizer + + +def convert_example(example, tokenizer, max_seq_length=512): + + query, title = example["query"], example["title"] + encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + return input_ids, token_type_ids + + +# 启动python客户端 +endpoint_list = ["127.0.0.1:8600"] +client = Client() +client.load_client_config("serving_client") +client.connect(endpoint_list) +feed_names = client.feed_names_ +fetch_names = client.fetch_names_ + +# 创建tokenizer +tokenizer = AutoTokenizer.from_pretrained("rocketqa-base-cross-encoder") +max_seq_len = 64 + +# 数据预处理 +list_data = [{"query": "加强科研项目管理有效促进医学科研工作", "title": "科研项目管理策略科研项目,项目管理,实施,必要性,策略"}] + +input_ids, token_type_ids = [], [] +for example in list_data: + input_id, token_type_id = convert_example(example, tokenizer, max_seq_length=max_seq_len) + input_ids.append(input_id) + token_type_ids.append(token_type_id) + +feed_dict = {} +feed_dict["input_ids"] = np.array(input_ids) +feed_dict["token_type_ids"] = np.array(token_type_ids) +# batch设置为True表示的是批量预测 +b_start = time.time() +result = client.predict(feed=feed_dict, fetch=fetch_names, batch=True) +# breakpoint() +sim_score = expit(result["predict"])[:, 1] +b_end = time.time() +print("time to cost :{} seconds".format(b_end - b_start)) +print(sim_score) diff --git a/applications/neural_search/ranking/cross_encoder/deploy/cpp/start_server.sh b/applications/neural_search/ranking/cross_encoder/deploy/cpp/start_server.sh new file mode 100644 index 0000000000000000000000000000000000000000..0197b9a6c223204db1facbecd4b3384a079b95af --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/deploy/cpp/start_server.sh @@ -0,0 +1 @@ +python -m paddle_serving_server.serve --model serving_server --port 8600 --gpu_id 0 --thread 5 --ir_optim True \ No newline at end of file diff --git a/applications/neural_search/ranking/cross_encoder/deploy/python/config_nlp.yml b/applications/neural_search/ranking/cross_encoder/deploy/python/config_nlp.yml new file mode 100644 index 0000000000000000000000000000000000000000..027ac44eafb7d3ebcd113a6783a3535d0e665b38 --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/deploy/python/config_nlp.yml @@ -0,0 +1,34 @@ +# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +# 当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 20 +# build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + # op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: False + # 使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + tracer: + interval_s: 10 +# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 8088 +# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 8089 +op: + ernie: + # 并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 1 + # 当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + # client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + # ir_optim + ir_optim: True + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 1 + # 计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: "0" + # Fetch结果列表,以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回 + fetch_list: ['predict'] + # 模型路径 + model_config: ../../serving_server/ diff --git a/applications/neural_search/ranking/cross_encoder/deploy/python/deploy.sh b/applications/neural_search/ranking/cross_encoder/deploy/python/deploy.sh new file mode 100644 index 0000000000000000000000000000000000000000..c20a4d057ab1078d08dfaf266c8a5487e149b81f --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/deploy/python/deploy.sh @@ -0,0 +1,3 @@ +python deploy/python/predict.py --model_dir ./output \ + --input_file data/test.csv \ + --model_name_or_path rocketqa-base-cross-encoder \ No newline at end of file diff --git a/applications/neural_search/ranking/cross_encoder/deploy/python/predict.py b/applications/neural_search/ranking/cross_encoder/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..b092fc004ab3132ccdc4c159deddcbb1a994a758 --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/deploy/python/predict.py @@ -0,0 +1,221 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys + +import numpy as np +import paddle +from paddle import inference +from scipy.special import softmax + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoTokenizer +from paddlenlp.utils.log import logger + +sys.path.append(".") + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--input_file", type=str, required=True, help="The test set file.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') +parser.add_argument("--benchmark", type=eval, default=False, help="To log some information about environment and running.") +parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") +parser.add_argument('--model_name_or_path', default="rocketqa-base-cross-encoder", help="The pretrained model used for training") +args = parser.parse_args() +# yapf: enable + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 3: + continue + yield {"query": data[0], "title": data[1]} + + +def convert_example(example, tokenizer, max_seq_length=512, is_test=False): + + query, title = example["query"], example["title"] + + encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if not is_test: + label = np.array([example["label"]], dtype="int64") + return input_ids, token_type_ids, label + else: + return input_ids, token_type_ids + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.pdmodel" + params_file = model_dir + "/inference.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + if args.benchmark: + import auto_log + + pid = os.getpid() + self.autolog = auto_log.AutoLogger( + model_name=args.model_name_or_path, + model_precision=precision, + batch_size=self.batch_size, + data_shape="dynamic", + save_path=args.save_log_path, + inference_config=config, + pids=pid, + process_name=None, + gpu_ids=0, + time_keys=["preprocess_time", "inference_time", "postprocess_time"], + warmup=0, + logger=logger, + ) + + def predict(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + label_map(obj:`dict`): The label id (key) to label str (value) map. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + if args.benchmark: + self.autolog.times.start() + + examples = [] + for text in data: + input_ids, segment_ids = convert_example(text, tokenizer, max_seq_length=self.max_seq_length, is_test=True) + examples.append((input_ids, segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # segment + ): fn(samples) + + if args.benchmark: + self.autolog.times.stamp() + + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + sim_score = self.output_handle.copy_to_cpu() + if args.benchmark: + self.autolog.times.stamp() + sim_score = softmax(sim_score)[:, 1] + + if args.benchmark: + self.autolog.times.end(stamp=True) + + return sim_score + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + test_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) + + data = [{"query": d["query"], "title": d["title"]} for d in test_ds] + + batches = [data[idx : idx + args.batch_size] for idx in range(0, len(data), args.batch_size)] + + results = [] + for batch_data in batches: + results.extend(predictor.predict(batch_data, tokenizer)) + for idx, text in enumerate(data): + print("Data: {} \t prob: {}".format(text, results[idx])) + if args.benchmark: + predictor.autolog.report() diff --git a/applications/neural_search/ranking/cross_encoder/deploy/python/rpc_client.py b/applications/neural_search/ranking/cross_encoder/deploy/python/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..27c3d6e0697600055caed53f72cc933bb2d86d8c --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/deploy/python/rpc_client.py @@ -0,0 +1,34 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import time +import numpy as np + +from paddle_serving_server.pipeline import PipelineClient + +client = PipelineClient() +client.connect(["127.0.0.1:8089"]) + +list_data = [{"query": "加强科研项目管理有效促进医学科研工作", "title": "科研项目管理策略科研项目,项目管理,实施,必要性,策略"}] +feed = {} +for i, item in enumerate(list_data): + feed[str(i)] = str(item) + +print(feed) +start_time = time.time() +ret = client.predict(feed_dict=feed) +end_time = time.time() +print("time to cost :{} seconds".format(end_time - start_time)) +result = np.array(eval(ret.value[0])) +print(result.shape) +print(result) diff --git a/applications/neural_search/ranking/cross_encoder/deploy/python/web_service.py b/applications/neural_search/ranking/cross_encoder/deploy/python/web_service.py new file mode 100644 index 0000000000000000000000000000000000000000..bfcbf7fc117084ba5ac6e7fdbed872fc8d16ea4f --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/deploy/python/web_service.py @@ -0,0 +1,74 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +from paddle_serving_server.web_service import Op, WebService +from scipy.special import softmax + + +def convert_example(example, tokenizer, max_seq_length=512): + + query, title = example["query"], example["title"] + encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + return input_ids, token_type_ids + + +class ErnieOp(Op): + def init_op(self): + from paddlenlp.transformers import AutoTokenizer + + self.tokenizer = AutoTokenizer.from_pretrained("rocketqa-base-cross-encoder") + + def preprocess(self, input_dicts, data_id, log_id): + from paddlenlp.data import Pad, Tuple + + ((_, input_dict),) = input_dicts.items() + print("input dict", input_dict) + batch_size = len(input_dict.keys()) + examples = [] + for i in range(batch_size): + example = json.loads(input_dict[str(i)].replace("'", '"')) + input_ids, segment_ids = convert_example(example, self.tokenizer) + examples.append((input_ids, segment_ids)) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + input_ids, segment_ids = batchify_fn(examples) + feed_dict = {} + feed_dict["input_ids"] = input_ids + feed_dict["token_type_ids"] = segment_ids + return feed_dict, False, None, "" + + def postprocess(self, input_dicts, fetch_dict, data_id, log_id): + new_dict = {} + sim_score = softmax(fetch_dict["predict"])[:, 1] + new_dict["predict"] = str(sim_score) + return new_dict, None, "" + + +class ErnieService(WebService): + def get_pipeline_response(self, read_op): + ernie_op = ErnieOp(name="ernie", input_ops=[read_op]) + return ernie_op + + +ernie_service = ErnieService(name="ernie") +ernie_service.prepare_pipeline_config("config_nlp.yml") +ernie_service.run_service() diff --git a/applications/neural_search/ranking/cross_encoder/evaluate.py b/applications/neural_search/ranking/cross_encoder/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..9e7fbe1dd0b3090a82c83b22722fe454000fb113 --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/evaluate.py @@ -0,0 +1,110 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +from data import convert_example, create_dataloader, read_data + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--test_file", type=str, required=True, help="The full path of test file") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--model_name_or_path', default="rocketqa-base-cross-encoder", help="The pretrained model used for training") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, phase="dev"): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + + for idx, batch in enumerate(data_loader): + input_ids, token_type_ids, labels = batch + + pos_probs = model(input_ids=input_ids, token_type_ids=token_type_ids) + + sim_score = F.softmax(pos_probs) + metric.update(preds=sim_score.numpy(), labels=labels) + + print("eval_{} auc:{:.3}".format(phase, metric.accumulate())) + metric.reset() + model.train() + + +def main(): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + dev_ds = load_dataset(read_data, data_path=args.test_file, lazy=False) + + model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=2) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func_eval = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_pair=True) + + batchify_fn_eval = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # pair_segment + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn_eval, trans_fn=trans_func_eval + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + metric = paddle.metric.Auc() + evaluate(model, metric, dev_data_loader, "dev") + + +if __name__ == "__main__": + main() diff --git a/applications/neural_search/ranking/cross_encoder/export_model.py b/applications/neural_search/ranking/cross_encoder/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..a580726b0c07df2094c637c899be79ffac2deca8 --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/export_model.py @@ -0,0 +1,52 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle + +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.") +parser.add_argument('--model_name_or_path', default="rocketqa-base-cross-encoder", help="The pretrained model used for training") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=2) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + model.eval() + + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/applications/neural_search/ranking/cross_encoder/export_to_serving.py b/applications/neural_search/ranking/cross_encoder/export_to_serving.py new file mode 100644 index 0000000000000000000000000000000000000000..e10cd5616a4e2b924fde1f7fa99a1722ca30a068 --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/export_to_serving.py @@ -0,0 +1,49 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import paddle_serving_client.io as serving_io + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--dirname", type=str, required=True, + default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.") +parser.add_argument("--model_filename", type=str, required=True, + default='inference.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.") +parser.add_argument("--params_filename", type=str, required=True, + default='inference.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.") +parser.add_argument("--server_path", type=str, default='./serving_server', + help="The path of server parameter in static graph to be saved.") +parser.add_argument("--client_path", type=str, default='./serving_client', + help="The path of client parameter in static graph to be saved.") +parser.add_argument("--feed_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars') +parser.add_argument("--fetch_alias_names", type=str, default=None, + help='set alias names for fetch vars, split by comma \',\', you should run --show_proto to check the number of fetch vars') +parser.add_argument("--show_proto", type=bool, default=False, + help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.') +# yapf: enable + +if __name__ == "__main__": + args = parser.parse_args() + serving_io.inference_model_to_serving( + dirname=args.dirname, + serving_server=args.server_path, + serving_client=args.client_path, + model_filename=args.model_filename, + params_filename=args.params_filename, + show_proto=args.show_proto, + feed_alias_names=args.feed_alias_names, + fetch_alias_names=args.fetch_alias_names, + ) diff --git a/applications/neural_search/ranking/cross_encoder/predict.py b/applications/neural_search/ranking/cross_encoder/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..5272713a3c416c55f9a5a662c8faf22c56bfc67b --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/predict.py @@ -0,0 +1,89 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import paddle +import paddle.nn.functional as F +from data import convert_example, create_dataloader, read_text_pair + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default="checkpoints/model_900/model_state.pdparams", help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", type=int, default=128, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.") +parser.add_argument("--test_set", type=str, required=True, help="The full path of test_set.") +parser.add_argument("--topk", type=int, default=10, help="The Topk texts.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--model_name_or_path', default="rocketqa-base-cross-encoder", help="The pretrained model used for training") +args = parser.parse_args() +# yapf: enable + + +@paddle.no_grad() +def predict(model, data_loader): + results = [] + model.eval() + with paddle.no_grad(): + for batch in data_loader: + input_ids, token_type_ids = batch + logits = model(input_ids, token_type_ids) + probs = F.softmax(logits) + probs = probs.numpy() + results.extend(probs[:, 1]) + return results + + +if __name__ == "__main__": + paddle.set_device(args.device) + + test_ds = load_dataset(read_text_pair, data_path=args.test_set, lazy=False) + model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=2) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_test=True, is_pair=True + ) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + ): [data for data in fn(samples)] + + test_data_loader = create_dataloader( + test_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + results = predict(model, test_data_loader) + test_ds = load_dataset(read_text_pair, data_path=args.test_set, lazy=False) + text_pairs = [] + for idx, prob in enumerate(results): + text_pair = test_ds[idx] + text_pair["pred_prob"] = prob + text_pairs.append(text_pair) + text_pairs = sorted(text_pairs, key=lambda x: x["pred_prob"], reverse=True)[: args.topk] + for item in text_pairs: + print(item) diff --git a/applications/neural_search/ranking/cross_encoder/scripts/evaluate_ce.sh b/applications/neural_search/ranking/cross_encoder/scripts/evaluate_ce.sh new file mode 100644 index 0000000000000000000000000000000000000000..f42491cda38d1fa20dccc8054a50d27195df1a95 --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/scripts/evaluate_ce.sh @@ -0,0 +1,3 @@ +python evaluate.py --model_name_or_path rocketqa-base-cross-encoder \ + --init_from_ckpt checkpoints/model_80000/model_state.pdparams \ + --test_file data/dev_pairwise.csv \ No newline at end of file diff --git a/applications/neural_search/ranking/cross_encoder/scripts/export_model.sh b/applications/neural_search/ranking/cross_encoder/scripts/export_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..a6c54ae878d983b24cd5fa77f92840755e3873d3 --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/scripts/export_model.sh @@ -0,0 +1,4 @@ +python export_model.py \ + --params_path checkpoints/model_80000/model_state.pdparams \ + --model_name_or_path rocketqa-base-cross-encoder \ + --output_path=./output \ No newline at end of file diff --git a/applications/neural_search/ranking/cross_encoder/scripts/export_to_serving.sh b/applications/neural_search/ranking/cross_encoder/scripts/export_to_serving.sh new file mode 100644 index 0000000000000000000000000000000000000000..4a5fe6bfe576184201647b65e69f181a0a25a224 --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/scripts/export_to_serving.sh @@ -0,0 +1,7 @@ +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.pdmodel" \ + --params_filename "inference.pdiparams" \ + --server_path "serving_server" \ + --client_path "serving_client" \ + --fetch_alias_names "predict" diff --git a/applications/neural_search/ranking/cross_encoder/scripts/predict_ce.sh b/applications/neural_search/ranking/cross_encoder/scripts/predict_ce.sh new file mode 100644 index 0000000000000000000000000000000000000000..46f6fc50d099fae712626591b4536eec3cad1f2f --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/scripts/predict_ce.sh @@ -0,0 +1,11 @@ +#!/bin/bash +unset CUDA_VISIBLE_DEVICES +export CUDA_VISIBLE_DEVICES=0 +python predict.py \ + --device 'gpu' \ + --params_path checkpoints/model_80000/model_state.pdparams \ + --model_name_or_path rocketqa-base-cross-encoder \ + --test_set data/test.csv \ + --topk 10 \ + --batch_size 128 \ + --max_seq_length 384 \ No newline at end of file diff --git a/applications/neural_search/ranking/cross_encoder/scripts/train_ce.sh b/applications/neural_search/ranking/cross_encoder/scripts/train_ce.sh new file mode 100644 index 0000000000000000000000000000000000000000..570f528902c6c396a40cfee3097c7bf142d4a5bc --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/scripts/train_ce.sh @@ -0,0 +1,18 @@ +#!/bin/bash +unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir="logs" train_ce.py \ + --device gpu \ + --train_set data/train.csv \ + --test_file data/dev_pairwise.csv \ + --save_dir ./checkpoints \ + --model_name_or_path rocketqa-base-cross-encoder \ + --batch_size 32 \ + --save_steps 10000 \ + --max_seq_len 384 \ + --learning_rate 1E-5 \ + --weight_decay 0.01 \ + --warmup_proportion 0.0 \ + --logging_steps 10 \ + --seed 1 \ + --epochs 3 \ + --eval_step 1000 diff --git a/applications/neural_search/ranking/cross_encoder/train_ce.py b/applications/neural_search/ranking/cross_encoder/train_ce.py new file mode 100644 index 0000000000000000000000000000000000000000..bae21c4992e9eeb17b596f17a7c4a816bbac6a38 --- /dev/null +++ b/applications/neural_search/ranking/cross_encoder/train_ce.py @@ -0,0 +1,189 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +from data import convert_example, create_dataloader, read_data + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--train_set", type=str, required=True, help="The full path of train_set_file.") +parser.add_argument("--test_file", type=str, required=True, help="The full path of test file") + +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.") +parser.add_argument("--save_steps", default=100, type=int, help="The interval steps to save checkppoints.") +parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.") +parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.") +parser.add_argument('--model_name_or_path', default="rocketqa-base-cross-encoder", help="The pretrained model used for training") +parser.add_argument("--eval_step", default=200, type=int, help="Step interval for evaluation.") +args = parser.parse_args() +# yapf: enable + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, phase="dev"): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + + for idx, batch in enumerate(data_loader): + input_ids, token_type_ids, labels = batch + + pos_probs = model(input_ids=input_ids, token_type_ids=token_type_ids) + + sim_score = F.softmax(pos_probs) + + metric.update(preds=sim_score.numpy(), labels=labels) + + print("eval_{} auc:{:.3}".format(phase, metric.accumulate())) + metric.reset() + model.train() + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + dev_count = paddle.distributed.get_world_size() + set_seed(args.seed) + + train_ds = load_dataset(read_data, data_path=args.train_set, lazy=False) + dev_ds = load_dataset(read_data, data_path=args.test_file, lazy=False) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=2) + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_pair=True) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + model = paddle.DataParallel(model) + + num_training_examples = len(train_ds) + # 4卡 gpu + max_train_steps = args.epochs * num_training_examples // args.batch_size // dev_count + + warmup_steps = int(max_train_steps * args.warmup_proportion) + + print("Device count: %d" % dev_count) + print("Num train examples: %d" % num_training_examples) + print("Max train steps: %d" % max_train_steps) + print("Num warmup steps: %d" % warmup_steps) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=args.learning_rate, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=paddle.nn.ClipGradByGlobalNorm(1.0), + ) + + criterion = paddle.nn.loss.CrossEntropyLoss() + metric = paddle.metric.Auc() + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + input_ids, token_type_ids, labels = batch + logits = model(input_ids, token_type_ids) + loss = criterion(logits, labels) + probs = F.softmax(logits, axis=1) + acc = paddle.metric.accuracy(input=probs, label=labels) + loss.backward() + + optimizer.step() + optimizer.clear_grad() + + global_step += 1 + if global_step % args.logging_steps == 0 and rank == 0: + time_diff = time.time() - tic_train + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, accuracy: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, acc, args.logging_steps / time_diff) + ) + tic_train = time.time() + if global_step % args.eval_step == 0 and rank == 0: + evaluate(model, metric, dev_data_loader, "dev") + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(save_dir) + tokenizer.save_pretrained(save_dir) + tic_train = time.time() + + # save final checkpoint + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(save_dir) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/applications/neural_search/ranking/ernie_matching/README.md b/applications/neural_search/ranking/ernie_matching/README.md new file mode 100644 index 0000000000000000000000000000000000000000..6f912ad680a193c029d5293e8288155b1f6c634d --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/README.md @@ -0,0 +1,409 @@ + + **目录** + +* [背景介绍](#背景介绍) +* [ERNIE-Gram](#ERNIE-Gram) + * [1. 技术方案和评估指标](#技术方案) + * [2. 环境依赖](#环境依赖) + * [3. 代码结构](#代码结构) + * [4. 数据准备](#数据准备) + * [5. 模型训练](#模型训练) + * [6. 评估](#开始评估) + * [7. 预测](#预测) + * [8. 部署](#部署) + + + +# 背景介绍 + +基于ERNIE-Gram训练Pair-wise模型。Pair-wise 匹配模型适合将文本对相似度作为特征之一输入到上层排序模块进行排序的应用场景。 + + + + +# ERNIE-Gram + + + +## 1. 技术方案和评估指标 + +### 技术方案 + +双塔模型,使用ERNIE-Gram预训练模型,使用margin_ranking_loss训练模型。 + + +### 评估指标 + +(1)采用 AUC 指标来评估排序模型的排序效果。 + +**效果评估** + +| 模型 | AUC | +| ------------ | ------------ | +| ERNIE-Gram | 0.801 | + + + +## 2. 环境依赖和安装说明 + +**环境依赖** + +* python >= 3.x +* paddlepaddle >= 2.1.3 +* paddlenlp >= 2.2 +* pandas >= 0.25.1 +* scipy >= 1.3.1 + + + +## 3. 代码结构 + +以下是本项目主要代码结构及说明: + +``` +ernie_matching/ +├── deply # 部署 + ├── cpp + ├── rpc_client.py # RPC 客户端的bash脚本 + ├── http_client.py # http 客户端的bash文件 + └── start_server.sh # 启动C++服务的脚本 + └── python + ├── deploy.sh # 预测部署bash脚本 + ├── config_nlp.yml # Pipeline 的配置文件 + ├── web_service.py # Pipeline 服务端的脚本 + ├── rpc_client.py # Pipeline RPC客户端的脚本 + └── predict.py # python 预测部署示例 +|—— scripts + ├── export_model.sh # 动态图参数导出静态图参数的bash文件 + ├── export_to_serving.sh # 导出 Paddle Serving 模型格式的bash文件 + ├── train_pairwise.sh # Pair-wise 单塔匹配模型训练的bash文件 + ├── evaluate.sh # 评估验证文件bash脚本 + ├── predict_pairwise.sh # Pair-wise 单塔匹配模型预测脚本的bash文件 +├── export_model.py # 动态图参数导出静态图参数脚本 +├── export_to_serving.py # 导出 Paddle Serving 模型格式的脚本 +├── model.py # Pair-wise 匹配模型组网 +├── data.py # Pair-wise 训练样本的转换逻辑 、Pair-wise 生成随机负例的逻辑 +├── train_pairwise.py # Pair-wise 单塔匹配模型训练脚本 +├── evaluate.py # 评估验证文件 +├── predict_pairwise.py # Pair-wise 单塔匹配模型预测脚本,输出文本对是相似度 + +``` + + + +## 4. 数据准备 + +### 数据集说明 + +样例数据如下: +``` +个人所得税税务筹划 基于新个税视角下的个人所得税纳税筹划分析新个税;个人所得税;纳税筹划 个人所得税工资薪金税务筹划研究个人所得税,工资薪金,税务筹划 +液压支架底座受力分析 ZY4000/09/19D型液压支架的有限元分析液压支架,有限元分析,两端加载,偏载,扭转 基于ANSYS的液压支架多工况受力分析液压支架,四种工况,仿真分析,ANSYS,应力集中,优化 +迟发性血管痉挛 西洛他唑治疗动脉瘤性蛛网膜下腔出血后脑血管痉挛的Meta分析西洛他唑,蛛网膜下腔出血,脑血管痉挛,Meta分析 西洛他唑治疗动脉瘤性蛛网膜下腔出血后脑血管痉挛的Meta分析西洛他唑,蛛网膜下腔出血,脑血管痉挛,Meta分析 +氧化亚硅 复合溶胶-凝胶一锅法制备锂离子电池氧化亚硅/碳复合负极材料氧化亚硅,溶胶-凝胶法,纳米颗粒,负极,锂离子电池 负载型聚酰亚胺-二氧化硅-银杂化膜的制备和表征聚酰亚胺,二氧化硅,银,杂化膜,促进传输 +``` + + +### 数据集下载 + + +- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip) + +``` +├── milvus # milvus建库数据集 + ├── milvus_data.csv. # 构建召回库的数据 +├── recall # 召回(语义索引)数据集 + ├── corpus.csv # 用于测试的召回库 + ├── dev.csv # 召回验证集 + ├── test.csv # 召回测试集 + ├── train.csv # 召回训练集 + ├── train_unsupervised.csv # 无监督训练集 +├── sort # 排序数据集 + ├── test_pairwise.csv # 排序测试集 + ├── dev_pairwise.csv # 排序验证集 + └── train_pairwise.csv # 排序训练集 + +``` + + + +## 5. 模型训练 + +**排序模型下载链接:** + + +|Model|训练参数配置|硬件|MD5| +| ------------ | ------------ | ------------ |-----------| +|[ERNIE-Gram-Sort](https://bj.bcebos.com/v1/paddlenlp/models/ernie_gram_sort.zip)|
epoch:3 lr:5E-5 bs:64 max_len:64
|
4卡 v100-16g
|d24ece68b7c3626ce6a24baa58dd297d| + + +### 训练环境说明 + + +- NVIDIA Driver Version: 440.64.00 +- Ubuntu 16.04.6 LTS (Docker) +- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz + + +### 单机单卡训练/单机多卡训练 + +这里采用单机多卡方式进行训练,通过如下命令,指定 GPU 0,1,2,3 卡, 基于ERNIE-Gram训练模型,数据量比较大,需要20小时10分钟左右。如果采用单机单卡训练,只需要把`--gpu`参数设置成单卡的卡号即可 + +训练的命令如下: + +``` +python -u -m paddle.distributed.launch --gpus "0,1,2,3" train_pairwise.py \ + --device gpu \ + --save_dir ./checkpoints \ + --batch_size 32 \ + --learning_rate 2E-5 \ + --margin 0.1 \ + --eval_step 100 \ + --train_file data/train_pairwise.csv \ + --test_file data/dev_pairwise.csv +``` +也可以运行bash脚本: + +``` +sh scripts/train_pairwise.sh +``` + + + +## 6. 评估 + + +``` +unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus "0" evaluate.py \ + --device gpu \ + --batch_size 32 \ + --learning_rate 2E-5 \ + --init_from_ckpt "./checkpoints/model_30000/model_state.pdparams" \ + --test_file data/dev_pairwise.csv +``` +也可以运行bash脚本: + +``` +sh scripts/evaluate.sh +``` + + +成功运行后会输出下面的指标: + +``` +eval_dev auc:0.796 +``` + + + +## 7. 预测 + +### 准备预测数据 + +待预测数据为 tab 分隔的 tsv 文件,每一行为 1 个文本 Pair,和文本pair的语义索引相似度,部分示例如下: + +``` +中西方语言与文化的差异 第二语言习得的一大障碍就是文化差异。 0.5160342454910278 +中西方语言与文化的差异 跨文化视角下中国文化对外传播路径琐谈跨文化,中国文化,传播,翻译 0.5145505666732788 +中西方语言与文化的差异 从中西方民族文化心理的差异看英汉翻译语言,文化,民族文化心理,思维方式,翻译 0.5141439437866211 +中西方语言与文化的差异 中英文化差异对翻译的影响中英文化,差异,翻译的影响 0.5138794183731079 +中西方语言与文化的差异 浅谈文化与语言习得文化,语言,文化与语言的关系,文化与语言习得意识,跨文化交际 0.5131710171699524 +``` + + + +### 开始预测 + +以上述 demo 数据为例,运行如下命令基于我们开源的 ERNIE-Gram模型开始计算文本 Pair 的语义相似度: + +```shell +python -u -m paddle.distributed.launch --gpus "0" \ + predict_pairwise.py \ + --device gpu \ + --params_path "./checkpoints/model_30000/model_state.pdparams"\ + --batch_size 128 \ + --max_seq_length 64 \ + --input_file 'sort/test_pairwise.csv' +``` +也可以直接执行下面的命令: + +``` +sh scripts/predict_pairwise.sh +``` +得到下面的输出,分别是query,title和对应的预测概率: + +``` +{'query': '中西方语言与文化的差异', 'title': '第二语言习得的一大障碍就是文化差异。', 'pred_prob': 0.85112214} +{'query': '中西方语言与文化的差异', 'title': '跨文化视角下中国文化对外传播路径琐谈跨文化,中国文化,传播,翻译', 'pred_prob': 0.78629625} +{'query': '中西方语言与文化的差异', 'title': '从中西方民族文化心理的差异看英汉翻译语言,文化,民族文化心理,思维方式,翻译', 'pred_prob': 0.91767526} +{'query': '中西方语言与文化的差异', 'title': '中英文化差异对翻译的影响中英文化,差异,翻译的影响', 'pred_prob': 0.8601749} +{'query': '中西方语言与文化的差异', 'title': '浅谈文化与语言习得文化,语言,文化与语言的关系,文化与语言习得意识,跨文化交际', 'pred_prob': 0.8944413} +``` + + + +## 8. 部署 + +### 动转静导出 + +首先把动态图模型转换为静态图: + +``` +python export_model.py --params_path checkpoints/model_30000/model_state.pdparams \ + --output_path=./output \ + --model_name_or_path ernie-3.0-medium-zh +``` +也可以运行下面的bash脚本: + +``` +sh scripts/export_model.sh +``` + +### Paddle Inference + +使用PaddleInference: + +``` +python deploy/python/predict.py --model_dir ./output \ + --input_file sort/test_pairwise.csv \ + --model_name_or_path ernie-3.0-medium-zh +``` +也可以运行下面的bash脚本: + +``` +sh deploy/python/deploy.sh +``` +得到下面的输出,输出的是样本的query,title以及对应的概率: + +``` +Data: {'query': '中西方语言与文化的差异', 'title': '第二语言习得的一大障碍就是文化差异。'} prob: [0.8511221] +Data: {'query': '中西方语言与文化的差异', 'title': '跨文化视角下中国文化对外传播路径琐谈跨文化,中国文化,传播,翻译'} prob: [0.7862964] +Data: {'query': '中西方语言与文化的差异', 'title': '从中西方民族文化心理的差异看英汉翻译语言,文化,民族文化心理,思维方式,翻译'} prob: [0.91767514] +Data: {'query': '中西方语言与文化的差异', 'title': '中英文化差异对翻译的影响中英文化,差异,翻译的影响'} prob: [0.8601747] +Data: {'query': '中西方语言与文化的差异', 'title': '浅谈文化与语言习得文化,语言,文化与语言的关系,文化与语言习得意识,跨文化交际'} prob: [0.8944413] +``` + +### Paddle Serving部署 + +Paddle Serving 的详细文档请参考 [Pipeline_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Python_Pipeline/Pipeline_Design_CN.md)和[Serving_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Serving_Design_CN.md),首先把静态图模型转换成Serving的格式: + +``` +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.predict.pdmodel" \ + --params_filename "inference.predict.pdiparams" \ + --server_path "serving_server" \ + --client_path "serving_client" \ + --fetch_alias_names "predict" + +``` + +参数含义说明 +* `dirname`: 需要转换的模型文件存储路径,Program 结构文件和参数文件均保存在此目录。 +* `model_filename`: 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ,则使用 `__model__` 作为默认的文件名 +* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中,它才需要被指定。如果模型参数是存储在各自分离的文件中,设置它的值为 None +* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server +* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client +* `fetch_alias_names`: 模型输出的别名设置,比如输入的 input_ids 等,都可以指定成其他名字,默认不指定 +* `feed_alias_names`: 模型输入的别名设置,比如输出 pooled_out 等,都可以重新指定成其他模型,默认不指定 + +也可以运行下面的 bash 脚本: +``` +sh scripts/export_to_serving.sh +``` +Paddle Serving的部署有两种方式,第一种方式是Pipeline的方式,第二种是C++的方式,下面分别介绍这两种方式的用法: + +#### Pipeline方式 + +修改`Tokenizer` + +``` +self.tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh') +``` + +启动 Pipeline Server: + +``` +python web_service.py +``` + +启动客户端调用 Server。 + +首先修改rpc_client.py中需要预测的样本: + +``` +list_data = [{"query":"中西方语言与文化的差异","title":"第二语言习得的一大障碍就是文化差异。"}]` +``` +然后运行: +``` +python rpc_client.py +``` +模型的输出为: + +``` +PipelineClient::predict pack_data time:1656912047.5986433 +PipelineClient::predict before time:1656912047.599081 +time to cost :0.012039899826049805 seconds +(1, 1) +[[0.85112208]] +``` +可以看到客户端发送了1条文本,这条文本的相似的概率值。 + +#### C++的方式 + +启动C++的Serving: + +``` +python -m paddle_serving_server.serve --model serving_server --port 8600 --gpu_id 0 --thread 5 --ir_optim True +``` +也可以使用脚本: + +``` +sh deploy/cpp/start_server.sh +``` +Client 可以使用 http 或者 rpc 两种方式,rpc 的方式为: + +``` +python deploy/cpp/rpc_client.py +``` +运行的输出为: + +``` +I0704 05:19:00.443437 1987 general_model.cpp:490] [client]logid=0,client_cost=8.477ms,server_cost=6.458ms. +time to cost :0.008707761764526367 seconds +{'predict': array([[0.8511221]], dtype=float32)} +``` +可以看到服务端返回了相似度结果 + +或者使用 http 的客户端访问模式: + +``` +python deploy/cpp/http_client.py +``` +运行的输出为: +``` +time to cost :0.006819009780883789 seconds +[0.8511220812797546] +``` +可以看到服务端返回了相似度结果 + +也可以使用curl方式发送Http请求: + +``` +curl -XPOST http://0.0.0.0:8600/GeneralModelService/inference -d ' {"tensor":[{"int64_data":[ 1, 12, 213, 58, 405, 545, 54, 68, 73, + 5, 859, 712, 2, 131, 177, 405, 545, 489, + 116, 5, 7, 19, 843, 1767, 113, 10, 68, + 73, 859, 712, 12043, 2],"elem_type":0,"name":"input_ids","alias_name":"input_ids","shape":[1,32]}, + {"int64_data":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],"elem_type":0,"name":"token_type_ids","alias_name":"token_type_ids","shape":[1,32]} + ], +"fetch_var_names":["sigmoid_2.tmp_0"], +"log_id":0 +}' +``` + + +## Reference + +[1] Xiao, Dongling, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. “ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding.” ArXiv:2010.12148 [Cs]. diff --git a/applications/neural_search/ranking/ernie_matching/data.py b/applications/neural_search/ranking/ernie_matching/data.py new file mode 100644 index 0000000000000000000000000000000000000000..d7fdd67cc36f5cc9691d62f29a5eec431f8f8415 --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/data.py @@ -0,0 +1,130 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import numpy as np + +from paddlenlp.datasets import MapDataset + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 3: + continue + yield {"query": data[0], "title": data[1]} + + +def convert_pointwise_example(example, tokenizer, max_seq_length=512, is_test=False): + + query, title = example["query"], example["title"] + + encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if not is_test: + label = np.array([example["label"]], dtype="int64") + return input_ids, token_type_ids, label + else: + return input_ids, token_type_ids + + +def convert_pairwise_example(example, tokenizer, max_seq_length=512, phase="train"): + + if phase == "train": + query, pos_title, neg_title = example["query"], example["title"], example["neg_title"] + + pos_inputs = tokenizer(text=query, text_pair=pos_title, max_seq_len=max_seq_length) + neg_inputs = tokenizer(text=query, text_pair=neg_title, max_seq_len=max_seq_length) + + pos_input_ids = pos_inputs["input_ids"] + pos_token_type_ids = pos_inputs["token_type_ids"] + neg_input_ids = neg_inputs["input_ids"] + neg_token_type_ids = neg_inputs["token_type_ids"] + + return (pos_input_ids, pos_token_type_ids, neg_input_ids, neg_token_type_ids) + + else: + query, title = example["query"], example["title"] + + inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = inputs["input_ids"] + token_type_ids = inputs["token_type_ids"] + if phase == "eval": + return input_ids, token_type_ids, example["label"] + elif phase == "predict": + return input_ids, token_type_ids + else: + raise ValueError("not supported phase:{}".format(phase)) + + +def gen_pair(dataset, pool_size=100): + """ + Generate triplet randomly based on dataset + + Args: + dataset: A `MapDataset` or `IterDataset` or a tuple of those. + Each example is composed of 2 texts: example["query"], example["title"] + pool_size: the number of example to sample negative example randomly + + Return: + dataset: A `MapDataset` or `IterDataset` or a tuple of those. + Each example is composed of 3 texts: example["query"], example["pos_title"]、example["neg_title"] + """ + + if len(dataset) < pool_size: + pool_size = len(dataset) + + new_examples = [] + pool = [] + tmp_examples = [] + + for example in dataset: + label = example["label"] + + # Filter negative example + if label == 0: + continue + + tmp_examples.append(example) + pool.append(example["title"]) + + if len(pool) >= pool_size: + np.random.shuffle(pool) + for idx, example in enumerate(tmp_examples): + example["neg_title"] = pool[idx] + new_examples.append(example) + tmp_examples = [] + pool = [] + else: + continue + return MapDataset(new_examples) diff --git a/applications/neural_search/ranking/ernie_matching/deploy/cpp/http_client.py b/applications/neural_search/ranking/ernie_matching/deploy/cpp/http_client.py new file mode 100644 index 0000000000000000000000000000000000000000..d649943dcf9f46d6ac46eae9bbdcddf2dfd0e09f --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/deploy/cpp/http_client.py @@ -0,0 +1,65 @@ +# coding:utf-8 +# pylint: disable=doc-string-missing +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import time + +import numpy as np +from paddle_serving_client.httpclient import HttpClient + +import paddlenlp as ppnlp + + +def convert_example(example, tokenizer, max_seq_length=512): + + query, title = example["query"], example["title"] + encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + return input_ids, token_type_ids + + +# 启动python客户端 +endpoint_list = ["127.0.0.1:8600"] +client = HttpClient() +client.load_client_config("serving_client") +client.connect(endpoint_list) +feed_names = client.feed_names_ +fetch_names = client.fetch_names_ + +# 创建tokenizer +tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained("ernie-gram-zh") +max_seq_len = 64 + +# 数据预处理 +list_data = [{"query": "中西方语言与文化的差异", "title": "第二语言习得的一大障碍就是文化差异。"}] + +input_ids, token_type_ids = [], [] +for example in list_data: + input_id, token_type_id = convert_example(example, tokenizer, max_seq_length=max_seq_len) + input_ids.append(input_id) + token_type_ids.append(token_type_id) + +feed_dict = {} +feed_dict["input_ids"] = np.array(input_ids) +feed_dict["token_type_ids"] = np.array(token_type_ids) +# batch设置为True表示的是批量预测 +b_start = time.time() +result = client.predict(feed=feed_dict, fetch=fetch_names, batch=True) +b_end = time.time() +print(result) +print("time to cost :{} seconds".format(b_end - b_start)) +print(result.outputs[0].tensor[0].float_data) diff --git a/applications/neural_search/ranking/ernie_matching/deploy/cpp/rpc_client.py b/applications/neural_search/ranking/ernie_matching/deploy/cpp/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..a3474c2045a7ffbc21117ef9133db8dbc73449fd --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/deploy/cpp/rpc_client.py @@ -0,0 +1,65 @@ +# coding:utf-8 +# pylint: disable=doc-string-missing +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import time + +import numpy as np +from paddle_serving_client import Client + +import paddlenlp as ppnlp + + +def convert_example(example, tokenizer, max_seq_length=512): + + query, title = example["query"], example["title"] + encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + return input_ids, token_type_ids + + +# 启动python客户端 +endpoint_list = ["127.0.0.1:8600"] +client = Client() +client.load_client_config("serving_client") +client.connect(endpoint_list) +feed_names = client.feed_names_ +fetch_names = client.fetch_names_ + +# 创建tokenizer +tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained("ernie-gram-zh") +max_seq_len = 64 + +# 数据预处理 +list_data = [{"query": "中西方语言与文化的差异", "title": "第二语言习得的一大障碍就是文化差异。"}] + +input_ids, token_type_ids = [], [] +for example in list_data: + input_id, token_type_id = convert_example(example, tokenizer, max_seq_length=max_seq_len) + input_ids.append(input_id) + token_type_ids.append(token_type_id) + +feed_dict = {} +feed_dict["input_ids"] = np.array(input_ids) +feed_dict["token_type_ids"] = np.array(token_type_ids) +# batch设置为True表示的是批量预测 +b_start = time.time() +result = client.predict(feed=feed_dict, fetch=fetch_names, batch=True) +b_end = time.time() +print("time to cost :{} seconds".format(b_end - b_start)) +print(result) diff --git a/applications/neural_search/ranking/ernie_matching/deploy/cpp/start_server.sh b/applications/neural_search/ranking/ernie_matching/deploy/cpp/start_server.sh new file mode 100644 index 0000000000000000000000000000000000000000..0197b9a6c223204db1facbecd4b3384a079b95af --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/deploy/cpp/start_server.sh @@ -0,0 +1 @@ +python -m paddle_serving_server.serve --model serving_server --port 8600 --gpu_id 0 --thread 5 --ir_optim True \ No newline at end of file diff --git a/applications/neural_search/ranking/ernie_matching/deploy/python/config_nlp.yml b/applications/neural_search/ranking/ernie_matching/deploy/python/config_nlp.yml new file mode 100644 index 0000000000000000000000000000000000000000..027ac44eafb7d3ebcd113a6783a3535d0e665b38 --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/deploy/python/config_nlp.yml @@ -0,0 +1,34 @@ +# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +# 当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 20 +# build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + # op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: False + # 使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + tracer: + interval_s: 10 +# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 8088 +# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 8089 +op: + ernie: + # 并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 1 + # 当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + # client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + # ir_optim + ir_optim: True + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 1 + # 计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: "0" + # Fetch结果列表,以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回 + fetch_list: ['predict'] + # 模型路径 + model_config: ../../serving_server/ diff --git a/applications/neural_search/ranking/ernie_matching/deploy/python/deploy.sh b/applications/neural_search/ranking/ernie_matching/deploy/python/deploy.sh new file mode 100644 index 0000000000000000000000000000000000000000..2eeeb719b51438e9e811e63cb77644524481cb36 --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/deploy/python/deploy.sh @@ -0,0 +1,3 @@ +python deploy/python/predict.py --model_dir ./output \ + --input_file sort/test_pairwise.csv \ + --model_name_or_path ernie-3.0-medium-zh \ No newline at end of file diff --git a/applications/neural_search/ranking/ernie_matching/deploy/python/predict.py b/applications/neural_search/ranking/ernie_matching/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..a18e07f0d9db8aa92a1ed1cdadef0781e553fe7f --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/deploy/python/predict.py @@ -0,0 +1,219 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys + +import numpy as np +import paddle +from paddle import inference + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoTokenizer +from paddlenlp.utils.log import logger + +sys.path.append(".") + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--input_file", type=str, required=True, help="The test set file.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') +parser.add_argument("--benchmark", type=eval, default=False, help="To log some information about environment and running.") +parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") +parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="The pretrained model used for training") +args = parser.parse_args() +# yapf: enable + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 3: + continue + yield {"query": data[0], "title": data[1]} + + +def convert_example(example, tokenizer, max_seq_length=512, is_test=False): + + query, title = example["query"], example["title"] + + encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if not is_test: + label = np.array([example["label"]], dtype="int64") + return input_ids, token_type_ids, label + else: + return input_ids, token_type_ids + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.predict.pdmodel" + params_file = model_dir + "/inference.predict.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + if args.benchmark: + import auto_log + + pid = os.getpid() + self.autolog = auto_log.AutoLogger( + model_name="ernie-tiny", + model_precision=precision, + batch_size=self.batch_size, + data_shape="dynamic", + save_path=args.save_log_path, + inference_config=config, + pids=pid, + process_name=None, + gpu_ids=0, + time_keys=["preprocess_time", "inference_time", "postprocess_time"], + warmup=0, + logger=logger, + ) + + def predict(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + label_map(obj:`dict`): The label id (key) to label str (value) map. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + if args.benchmark: + self.autolog.times.start() + + examples = [] + for text in data: + input_ids, segment_ids = convert_example(text, tokenizer, max_seq_length=self.max_seq_length, is_test=True) + examples.append((input_ids, segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # segment + ): fn(samples) + + if args.benchmark: + self.autolog.times.stamp() + + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + sim_score = self.output_handle.copy_to_cpu() + if args.benchmark: + self.autolog.times.stamp() + + if args.benchmark: + self.autolog.times.end(stamp=True) + + return sim_score + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + test_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) + + data = [{"query": d["query"], "title": d["title"]} for d in test_ds] + + batches = [data[idx : idx + args.batch_size] for idx in range(0, len(data), args.batch_size)] + + results = [] + for batch_data in batches: + results.extend(predictor.predict(batch_data, tokenizer)) + for idx, text in enumerate(data): + print("Data: {} \t prob: {}".format(text, results[idx])) + if args.benchmark: + predictor.autolog.report() diff --git a/applications/neural_search/ranking/ernie_matching/deploy/python/rpc_client.py b/applications/neural_search/ranking/ernie_matching/deploy/python/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..613fe9b9aa3c42eb0210a5ec3e302767ae56c3ae --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/deploy/python/rpc_client.py @@ -0,0 +1,34 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import time +import numpy as np + +from paddle_serving_server.pipeline import PipelineClient + +client = PipelineClient() +client.connect(["127.0.0.1:8089"]) + +list_data = [{"query": "中西方语言与文化的差异", "title": "第二语言习得的一大障碍就是文化差异。"}] +feed = {} +for i, item in enumerate(list_data): + feed[str(i)] = str(item) + +print(feed) +start_time = time.time() +ret = client.predict(feed_dict=feed) +end_time = time.time() +print("time to cost :{} seconds".format(end_time - start_time)) +result = np.array(eval(ret.value[0])) +print(result.shape) +print(result) diff --git a/applications/neural_search/ranking/ernie_matching/deploy/python/web_service.py b/applications/neural_search/ranking/ernie_matching/deploy/python/web_service.py new file mode 100644 index 0000000000000000000000000000000000000000..64bb3f1ef8970c7283077514453a5b22b8816ab0 --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/deploy/python/web_service.py @@ -0,0 +1,72 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +from paddle_serving_server.web_service import Op, WebService + + +def convert_example(example, tokenizer, max_seq_length=512): + + query, title = example["query"], example["title"] + encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + return input_ids, token_type_ids + + +class ErnieOp(Op): + def init_op(self): + from paddlenlp.transformers import AutoTokenizer + + self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + def preprocess(self, input_dicts, data_id, log_id): + from paddlenlp.data import Pad, Tuple + + ((_, input_dict),) = input_dicts.items() + print("input dict", input_dict) + batch_size = len(input_dict.keys()) + examples = [] + for i in range(batch_size): + example = json.loads(input_dict[str(i)].replace("'", '"')) + input_ids, segment_ids = convert_example(example, self.tokenizer) + examples.append((input_ids, segment_ids)) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + input_ids, segment_ids = batchify_fn(examples) + feed_dict = {} + feed_dict["input_ids"] = input_ids + feed_dict["token_type_ids"] = segment_ids + return feed_dict, False, None, "" + + def postprocess(self, input_dicts, fetch_dict, data_id, log_id): + new_dict = {} + new_dict["predict"] = str(fetch_dict["predict"].tolist()) + return new_dict, None, "" + + +class ErnieService(WebService): + def get_pipeline_response(self, read_op): + ernie_op = ErnieOp(name="ernie", input_ops=[read_op]) + return ernie_op + + +ernie_service = ErnieService(name="ernie") +ernie_service.prepare_pipeline_config("config_nlp.yml") +ernie_service.run_service() diff --git a/applications/neural_search/ranking/ernie_matching/evaluate.py b/applications/neural_search/ranking/ernie_matching/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..0aaf5caca1efd2617cfe531e9045379e9cff351e --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/evaluate.py @@ -0,0 +1,136 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +from functools import partial + +import numpy as np +import paddle +import pandas as pd +from data import convert_pairwise_example as convert_example +from data import create_dataloader +from model import PairwiseMatching +from tqdm import tqdm + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--margin", default=0.1, type=float, help="Margin for pos_score and neg_score.") +parser.add_argument("--test_file", type=str, required=True, help="The full path of test file") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="The pretrained model used for training") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, phase="dev"): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + + for idx, batch in enumerate(data_loader): + input_ids, token_type_ids, labels = batch + + pos_probs = model.predict(input_ids=input_ids, token_type_ids=token_type_ids) + + neg_probs = 1.0 - pos_probs + + preds = np.concatenate((neg_probs, pos_probs), axis=1) + metric.update(preds=preds, labels=labels) + + print("eval_{} auc:{:.3}".format(phase, metric.accumulate())) + metric.reset() + model.train() + + +# 构建读取函数,读取原始数据 +def read(src_path, is_predict=False): + data = pd.read_csv(src_path, sep="\t") + for index, row in tqdm(data.iterrows()): + query = row["query"] + title = row["title"] + neg_title = row["neg_title"] + yield {"query": query, "title": title, "neg_title": neg_title} + + +def read_test(src_path, is_predict=False): + data = pd.read_csv(src_path, sep="\t") + for index, row in tqdm(data.iterrows()): + query = row["query"] + title = row["title"] + label = row["label"] + yield {"query": query, "title": title, "label": label} + + +def main(): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + dev_ds = load_dataset(read_test, src_path=args.test_file, lazy=False) + print(dev_ds[0]) + + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func_eval = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, phase="eval") + + batchify_fn_eval = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # pair_segment + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn_eval, trans_fn=trans_func_eval + ) + + model = PairwiseMatching(pretrained_model, margin=args.margin) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + metric = paddle.metric.Auc() + evaluate(model, metric, dev_data_loader, "dev") + + +if __name__ == "__main__": + main() diff --git a/applications/neural_search/ranking/ernie_matching/export_model.py b/applications/neural_search/ranking/ernie_matching/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..0d329bb326ffbc91b3b4d4403d6f2cb32a323950 --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/export_model.py @@ -0,0 +1,54 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from model import PairwiseMatching + +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.") +parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="The pretrained model used for training") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + model = PairwiseMatching(pretrained_model) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + model.eval() + + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/applications/neural_search/ranking/ernie_matching/export_to_serving.py b/applications/neural_search/ranking/ernie_matching/export_to_serving.py new file mode 100644 index 0000000000000000000000000000000000000000..1ba681a4dfb14a43a5f91fa9c4cf632b4e6e827e --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/export_to_serving.py @@ -0,0 +1,49 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import paddle_serving_client.io as serving_io + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--dirname", type=str, required=True, + default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.") +parser.add_argument("--model_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.") +parser.add_argument("--params_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.") +parser.add_argument("--server_path", type=str, default='./serving_server', + help="The path of server parameter in static graph to be saved.") +parser.add_argument("--client_path", type=str, default='./serving_client', + help="The path of client parameter in static graph to be saved.") +parser.add_argument("--feed_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars') +parser.add_argument("--fetch_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars') +parser.add_argument("--show_proto", type=bool, default=False, + help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.') +# yapf: enable + +if __name__ == "__main__": + args = parser.parse_args() + serving_io.inference_model_to_serving( + dirname=args.dirname, + serving_server=args.server_path, + serving_client=args.client_path, + model_filename=args.model_filename, + params_filename=args.params_filename, + show_proto=args.show_proto, + feed_alias_names=args.feed_alias_names, + fetch_alias_names=args.fetch_alias_names, + ) diff --git a/applications/neural_search/ranking/ernie_matching/model.py b/applications/neural_search/ranking/ernie_matching/model.py new file mode 100644 index 0000000000000000000000000000000000000000..205148f00524d1cdcb78adefc2f920a7ef8ffd59 --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/model.py @@ -0,0 +1,75 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class PairwiseMatching(nn.Layer): + def __init__(self, pretrained_model, dropout=None, margin=0.1): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + self.margin = margin + + # hidden_size -> 1, calculate similarity + self.similarity = nn.Linear(self.ptm.config["hidden_size"], 1) + + @paddle.jit.to_static( + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ] + ) + def predict(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + cls_embedding = self.dropout(cls_embedding) + sim_score = self.similarity(cls_embedding) + sim_score = F.sigmoid(sim_score) + + return sim_score + + def forward( + self, + pos_input_ids, + neg_input_ids, + pos_token_type_ids=None, + neg_token_type_ids=None, + pos_position_ids=None, + neg_position_ids=None, + pos_attention_mask=None, + neg_attention_mask=None, + ): + + _, pos_cls_embedding = self.ptm(pos_input_ids, pos_token_type_ids, pos_position_ids, pos_attention_mask) + + _, neg_cls_embedding = self.ptm(neg_input_ids, neg_token_type_ids, neg_position_ids, neg_attention_mask) + + pos_embedding = self.dropout(pos_cls_embedding) + neg_embedding = self.dropout(neg_cls_embedding) + + pos_sim = self.similarity(pos_embedding) + neg_sim = self.similarity(neg_embedding) + + pos_sim = F.sigmoid(pos_sim) + neg_sim = F.sigmoid(neg_sim) + + labels = paddle.full(shape=[pos_cls_embedding.shape[0]], fill_value=1.0, dtype="float32") + + loss = F.margin_ranking_loss(pos_sim, neg_sim, labels, margin=self.margin) + + return loss diff --git a/applications/neural_search/ranking/ernie_matching/predict_pairwise.py b/applications/neural_search/ranking/ernie_matching/predict_pairwise.py new file mode 100644 index 0000000000000000000000000000000000000000..3abbfeb35589c9ab09419612462f8ae1b3840bcb --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/predict_pairwise.py @@ -0,0 +1,105 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +from data import convert_pairwise_example as convert_example +from data import create_dataloader, read_text_pair +from model import PairwiseMatching + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--input_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="The pretrained model used for training") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair. + data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + batch_probs = [] + + model.eval() + + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + + batch_prob = model.predict(input_ids=input_ids, token_type_ids=token_type_ids).numpy() + + batch_probs.append(batch_prob) + if len(batch_prob) == 1: + batch_probs = np.array(batch_probs) + else: + batch_probs = np.concatenate(batch_probs, axis=0) + + return batch_probs + + +if __name__ == "__main__": + paddle.set_device(args.device) + + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, phase="predict") + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment_ids + ): [data for data in fn(samples)] + + valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) + + valid_data_loader = create_dataloader( + valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + model = PairwiseMatching(pretrained_model) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + y_probs = predict(model, valid_data_loader) + + valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) + + for idx, prob in enumerate(y_probs): + text_pair = valid_ds[idx] + text_pair["pred_prob"] = prob[0] + print(text_pair) diff --git a/applications/neural_search/ranking/ernie_matching/scripts/evaluate.sh b/applications/neural_search/ranking/ernie_matching/scripts/evaluate.sh new file mode 100644 index 0000000000000000000000000000000000000000..bfb8c120a4cf852af840acb9a42d7594ac42977a --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/scripts/evaluate.sh @@ -0,0 +1,16 @@ +unset CUDA_VISIBLE_DEVICES +# gpu +python -u -m paddle.distributed.launch --gpus "0" evaluate.py \ + --device gpu \ + --batch_size 32 \ + --learning_rate 2E-5 \ + --init_from_ckpt "./checkpoints/model_30000/model_state.pdparams" \ + --test_file sort/dev_pairwise.csv + +# cpu +# python evaluate.py \ +# --device cpu \ +# --batch_size 32 \ +# --learning_rate 2E-5 \ +# --init_from_ckpt "./checkpoints/model_30000/model_state.pdparams" \ +# --test_file sort/dev_pairwise.csv \ No newline at end of file diff --git a/applications/neural_search/ranking/ernie_matching/scripts/export_model.sh b/applications/neural_search/ranking/ernie_matching/scripts/export_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..f6849a95eb80777aed443c4288828cb894ac57d8 --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/scripts/export_model.sh @@ -0,0 +1,3 @@ +python export_model.py --params_path checkpoints/model_30000/model_state.pdparams \ + --output_path=./output \ + --model_name_or_path ernie-3.0-medium-zh \ No newline at end of file diff --git a/applications/neural_search/ranking/ernie_matching/scripts/export_to_serving.sh b/applications/neural_search/ranking/ernie_matching/scripts/export_to_serving.sh new file mode 100644 index 0000000000000000000000000000000000000000..c252f811e29c1f741cbe9ba64f368ae5c914900d --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/scripts/export_to_serving.sh @@ -0,0 +1,7 @@ +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.predict.pdmodel" \ + --params_filename "inference.predict.pdiparams" \ + --server_path "serving_server" \ + --client_path "serving_client" \ + --fetch_alias_names "predict" diff --git a/applications/neural_search/ranking/ernie_matching/scripts/predict_pairwise.sh b/applications/neural_search/ranking/ernie_matching/scripts/predict_pairwise.sh new file mode 100644 index 0000000000000000000000000000000000000000..fe0767e14bfabb6441aafb3829c650b0be42d42d --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/scripts/predict_pairwise.sh @@ -0,0 +1,15 @@ +# gpu +python -u -m paddle.distributed.launch --gpus "0" \ + predict_pairwise.py \ + --device gpu \ + --params_path "./checkpoints/model_30000/model_state.pdparams"\ + --batch_size 128 \ + --max_seq_length 64 \ + --input_file 'sort/test_pairwise.csv' +# cpu +# python predict_pairwise.py \ +# --device gpu \ +# --params_path "./checkpoints/model_30000/model_state.pdparams"\ +# --batch_size 128 \ +# --max_seq_length 64 \ +# --input_file 'sort/test_pairwise.csv' \ No newline at end of file diff --git a/applications/neural_search/ranking/ernie_matching/scripts/train_pairwise.sh b/applications/neural_search/ranking/ernie_matching/scripts/train_pairwise.sh new file mode 100644 index 0000000000000000000000000000000000000000..a95169e5b155386bda9202285c7f5ee53fa3fddb --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/scripts/train_pairwise.sh @@ -0,0 +1,21 @@ +# gpu +python -u -m paddle.distributed.launch --gpus="0,1,2,3" train_pairwise.py \ + --device gpu \ + --save_dir ./checkpoints \ + --batch_size 32 \ + --learning_rate 2E-5 \ + --margin 0.1 \ + --eval_step 100 \ + --train_file sort/train_pairwise.csv \ + --test_file sort/dev_pairwise.csv + +# cpu +# python train_pairwise.py \ +# --device cpu \ +# --save_dir ./checkpoints \ +# --batch_size 32 \ +# --learning_rate 2E-5 \ +# --margin 0.1 \ +# --eval_step 100 \ +# --train_file sort/train_pairwise.csv \ +# --test_file sort/dev_pairwise.csv \ No newline at end of file diff --git a/applications/neural_search/ranking/ernie_matching/train_pairwise.py b/applications/neural_search/ranking/ernie_matching/train_pairwise.py new file mode 100644 index 0000000000000000000000000000000000000000..8d51cd55b5ec2d7ae984e1ce5b8911f270cd471b --- /dev/null +++ b/applications/neural_search/ranking/ernie_matching/train_pairwise.py @@ -0,0 +1,209 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import pandas as pd +from data import convert_pairwise_example as convert_example +from data import create_dataloader +from model import PairwiseMatching +from tqdm import tqdm + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--margin", default=0.2, type=float, help="Margin for pos_score and neg_score.") +parser.add_argument("--train_file", type=str, required=True, help="The full path of train file") +parser.add_argument("--test_file", type=str, required=True, help="The full path of test file") +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--eval_step", default=200, type=int, help="Step interval for evaluation.") +parser.add_argument('--save_step', default=10000, type=int, help="Step interval for saving checkpoint.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="The pretrained model used for training") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, phase="dev"): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + + for idx, batch in enumerate(data_loader): + input_ids, token_type_ids, labels = batch + + pos_probs = model.predict(input_ids=input_ids, token_type_ids=token_type_ids) + + neg_probs = 1.0 - pos_probs + + preds = np.concatenate((neg_probs, pos_probs), axis=1) + metric.update(preds=preds, labels=labels) + + print("eval_{} auc:{:.3}".format(phase, metric.accumulate())) + metric.reset() + model.train() + + +# 构建读取函数,读取原始数据 +def read(src_path, is_predict=False): + data = pd.read_csv(src_path, sep="\t") + for index, row in tqdm(data.iterrows()): + query = row["query"] + title = row["title"] + neg_title = row["neg_title"] + yield {"query": query, "title": title, "neg_title": neg_title} + + +def read_test(src_path, is_predict=False): + data = pd.read_csv(src_path, sep="\t") + for index, row in tqdm(data.iterrows()): + query = row["query"] + title = row["title"] + label = row["label"] + yield {"query": query, "title": title, "label": label} + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds = load_dataset(read, src_path=args.train_file, lazy=False) + dev_ds = load_dataset(read_test, src_path=args.test_file, lazy=False) + + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func_train = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + trans_func_eval = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, phase="eval") + + batchify_fn_train = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # pos_pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # pos_pair_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # neg_pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # neg_pair_segment + ): [data for data in fn(samples)] + + batchify_fn_eval = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # pair_segment + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn_train, trans_fn=trans_func_train + ) + + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn_eval, trans_fn=trans_func_eval + ) + + model = PairwiseMatching(pretrained_model, margin=args.margin) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + metric = paddle.metric.Auc() + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + pos_input_ids, pos_token_type_ids, neg_input_ids, neg_token_type_ids = batch + + loss = model( + pos_input_ids=pos_input_ids, + neg_input_ids=neg_input_ids, + pos_token_type_ids=pos_token_type_ids, + neg_token_type_ids=neg_token_type_ids, + ) + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.eval_step == 0 and rank == 0: + evaluate(model, metric, dev_data_loader, "dev") + + if global_step % args.save_step == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/applications/neural_search/recall/in_batch_negative/README.md b/applications/neural_search/recall/in_batch_negative/README.md new file mode 100644 index 0000000000000000000000000000000000000000..aa07d6908480146ddc401a92e75b4386f864d89e --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/README.md @@ -0,0 +1,635 @@ +# In-batch Negatives + + **目录** + +* [背景介绍](#背景介绍) +* [In-batch Negatives](#In-batchNegatives) + * [1. 技术方案和评估指标](#技术方案) + * [2. 环境依赖](#环境依赖) + * [3. 代码结构](#代码结构) + * [4. 数据准备](#数据准备) + * [5. 模型训练](#模型训练) + * [6. 评估](#开始评估) + * [7. 预测](#预测) + * [8. 部署](#部署) + + + +# 背景介绍 + +语义索引(可通俗理解为向量索引)技术是搜索引擎、推荐系统、广告系统在召回阶段的核心技术之一。语义索引模型的目标是:给定输入文本,模型可以从海量候选召回库中**快速、准确**地召回一批语义相关文本。语义索引模型的效果直接决定了语义相关的物料能否被成功召回进入系统参与上层排序,从基础层面影响整个系统的效果。 + +在召回阶段,最常见的方式是通过双塔模型,学习Document(简写为Doc)的向量表示,对Doc端建立索引,用ANN召回。我们在这种方式的基础上,引入语义索引策略 [In-batch Negatives](https://arxiv.org/abs/2004.04906),以如下Batch size=4的训练数据为例: + + +``` +我手机丢了,我想换个手机 我想买个新手机,求推荐 +求秋色之空漫画全集 求秋色之空全集漫画 +学日语软件手机上的 手机学日语的软件 +侠盗飞车罪恶都市怎样改车 侠盗飞车罪恶都市怎么改车 +``` + +In-batch Negatives 策略的训练数据为语义相似的 Pair 对,策略核心是在 1 个 Batch 内同时基于 N 个负例进行梯度更新,将Batch 内除自身之外其它所有 Source Text 的相似文本 Target Text 作为负例,例如: 上例中“我手机丢了,我想换个手机” 有 1 个正例(”我想买个新手机,求推荐“),3 个负例(1.求秋色之空全集漫画,2.手机学日语的软件,3.侠盗飞车罪恶都市怎么改车)。 + + + + +# In-batch Negatives + + + +## 1. 技术方案和评估指标 + +### 技术方案 + +双塔模型,在召回训练阶段引入In-batch Negatives 策略,使用hnswlib建立索引库,进行召回测试。 + + +### 评估指标 + +采用 Recall@1,Recall@5 ,Recall@10 ,Recall@20 和 Recall@50 指标来评估语义索引模型的召回效果。 + +Recall@K召回率是指预测的前topK(top-k是指从最后的按得分排序的召回列表中返回前k个结果)结果中检索出的相关结果数和库中所有的相关结果数的比率,衡量的是检索系统的查全率。 + +**效果评估** + +| 策略 | 模型 | Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 | +| ------------ | ------------ | ------------ |--------- |--------- |--------- |--------- | +| In-batch Negatives | ernie 1.0 | 51.301 | 65.309| 69.878| 73.996|78.881| +| In-batch Negatives | rocketqa-zh-base-query-encoder | **59.622** | **75.089**| **79.668**| **83.404**|**87.773**| + + + + +## 2. 环境依赖 + +推荐使用GPU进行训练,在预测阶段使用CPU或者GPU均可。 + +**环境依赖** +* python >= 3.6.2 +* paddlepaddle >= 2.2.3 +* paddlenlp >= 2.2 +* [hnswlib](https://github.com/nmslib/hnswlib) >= 0.5.2 +* visualdl >= 2.2.2 + + + +## 3. 代码结构 + +``` +|—— data.py # 数据读取、数据转换等预处理逻辑 +|—— base_model.py # 语义索引模型基类 +|—— train_batch_neg.py # In-batch Negatives 策略的训练主脚本 +|—— batch_negative + |—— model.py # In-batch Negatives 策略核心网络结构 +|—— ann_util.py # Ann 建索引库相关函数 + + +|—— recall.py # 基于训练好的语义索引模型,从召回库中召回给定文本的相似文本 +|—— evaluate.py # 根据召回结果和评估集计算评估指标 +|—— predict.py # 给定输入文件,计算文本 pair 的相似度 +|—— export_model.py # 动态图转换成静态图 +|—— scripts + |—— export_model.sh # 动态图转换成静态图脚本 + |—— predict.sh # 预测 bash 版本 + |—— evaluate.sh # 评估 bash 版本 + |—— run_build_index.sh # 构建索引 bash 版本 + |—— train_batch_neg.sh # 训练 bash 版本 + |—— export_to_serving.sh # Paddle Inference 转 Serving 的 bash 脚本 +|—— deploy + |—— python + |—— predict.py # PaddleInference + |—— deploy.sh # Paddle Inference 部署脚本 + |—— rpc_client.py # Paddle Serving 的 Client 端 + |—— web_service.py # Paddle Serving 的 Serving 端 + |—— config_nlp.yml # Paddle Serving 的配置文件 +|—— inference.py # 动态图抽取向量 +|—— export_to_serving.py # 静态图转 Serving + +``` + + + +## 4. 数据准备 + +### 数据集说明 + +我们基于某文献检索平台数据,构造面向语义索引的训练集、测试集、召回库。 + +**训练集** 和 **验证集** 格式一致,训练集4k条,测试集2w条,每行由一对语义相似的文本Pair构成,以tab符分割,第一列是检索query,第二列由相关文献标题(+关键词)构成。样例数据如下: + +``` +宁夏社区图书馆服务体系布局现状分析 宁夏社区图书馆服务体系布局现状分析社区图书馆,社区图书馆服务,社区图书馆服务体系 +人口老龄化对京津冀经济 京津冀人口老龄化对区域经济增长的影响京津冀,人口老龄化,区域经济增长,固定效应模型 +英语广告中的模糊语 模糊语在英语广告中的应用及其功能模糊语,英语广告,表现形式,语用功能 +甘氨酸二肽的合成 甘氨酸二肽合成中缩合剂的选择甘氨酸,缩合剂,二肽 +``` + +**召回库** 用于模拟业务线上的全量语料库,评估模型的召回效果,计算相应的Recall指标。召回库总共30万条样本,每行由一列构成,文献标题(+关键词),样例数据如下: +``` +陕西省贫困地区城乡青春期少女生长发育调查青春期,生长发育,贫困地区 +五丈岩水库溢洪道加固工程中的新材料应用碳纤维布,粘钢加固技术,超细水泥,灌浆技术 +木塑复合材料在儿童卫浴家具中的应用探索木塑复合材料,儿童,卫浴家具 +泡沫铝准静态轴向压缩有限元仿真泡沫铝,准静态,轴向压缩,力学特性 +``` + + +### 数据集下载 + + +- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip) + +``` +├── milvus # milvus建库数据集 + ├── milvus_data.csv. # 构建召回库的数据 +├── recall # 召回(语义索引)数据集 + ├── corpus.csv # 用于测试的召回库 + ├── dev.csv # 召回验证集 + ├── test.csv # 召回测试集 + ├── train.csv # 召回训练集 + ├── train_unsupervised.csv # 无监督训练集 +├── sort # 排序数据集 + ├── test_pairwise.csv # 排序测试集 + ├── dev_pairwise.csv # 排序验证集 + └── train_pairwise.csv # 排序训练集 + +``` + + + + +## 5. 模型训练 + +**语义索引训练模型下载链接:** + +以下模型结构参数为: `TrasformerLayer:12, Hidden:768, Heads:12, OutputEmbSize: 256` + +|Model|训练参数配置|硬件|MD5| +| ------------ | ------------ | ------------ |-----------| +|[batch_neg](https://bj.bcebos.com/v1/paddlenlp/models/inbatch_model.zip)|
ernie 1.0 margin:0.2 scale:30 epoch:3 lr:5E-5 bs:64 max_len:64
|
4卡 v100-16g
|f3e5c7d7b0b718c2530c5e1b136b2d74| + + +### 训练环境说明 + +- NVIDIA Driver Version: 440.64.00 +- Ubuntu 16.04.6 LTS (Docker) +- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz + + +### 单机单卡训练/单机多卡训练 + +这里采用单机多卡方式进行训练,通过如下命令,指定 GPU 0,1,2,3 卡, 基于 In-batch Negatives 策略训练模型,数据量比较小,几分钟就可以完成。如果采用单机单卡训练,只需要把`--gpus`参数设置成单卡的卡号即可。 + +如果使用CPU进行训练,则需要吧`--gpus`参数去除,然后吧`device`设置成cpu即可,详细请参考train_batch_neg.sh文件的训练设置 + +然后运行下面的命令使用GPU训练,得到语义索引模型: + +``` +root_path=inbatch +python -u -m paddle.distributed.launch --gpus "0,1,2,3" \ + train_batch_neg.py \ + --device gpu \ + --save_dir ./checkpoints/${root_path} \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --output_emb_size 256 \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --save_steps 10 \ + --max_seq_length 64 \ + --margin 0.2 \ + --train_set_file recall/train.csv \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --recall_num 50 \ + --similar_text_pair_file "recall/dev.csv" \ + --corpus_file "recall/corpus.csv" +``` + +参数含义说明 + +* `device`: 使用 cpu/gpu 进行训练 +* `save_dir`: 模型存储路径 +* `batch_size`: 训练的batch size的大小 +* `learning_rate`: 训练的学习率的大小 +* `epochs`: 训练的epoch数 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `model_name_or_path`: 预训练模型,用于模型和`Tokenizer`的参数初始化 +* `save_steps`: 模型存储 checkpoint 的间隔 steps 个数 +* `max_seq_length`: 输入序列的最大长度 +* `margin`: 正样本相似度与负样本之间的目标 Gap +* `train_set_file`: 训练集文件 +* `evaluate`: 是否开启边训练边评估模型训练效果,默认开启 +* `recall_result_dir`: 召回结果存储目录 +* `recall_result_file`: 召回结果的文件名 +* `hnsw_m`: hnsw 算法相关参数,保持默认即可 +* `hnsw_ef`: hnsw 算法相关参数,保持默认即可 +* `recall_num`: 对 1 个文本召回的相似文本数量 +* `similar_text_pair_file`: 由相似文本对构成的评估集 +* `corpus_file`: 召回库数据 corpus_file +* `use_recompute`: 使用Recompute策略,用于节省显存,是一种以时间换空间的技术 +* `use_gradient_cache`: 使用Gradient Cache策略,用于节省显存,是一种以时间换空间的技术 +* `chunk_numbers`: 使用Gradient Cache策略的参数,表示的是同一个批次的样本分几次执行 + +也可以使用bash脚本: + +``` +sh scripts/train.sh +``` + + + + +## 6. 评估 + +效果评估分为 4 个步骤: + +a. 获取Doc端Embedding + +基于语义索引模型抽取出Doc样本库的文本向量。 + +b. 采用hnswlib对Doc端Embedding建库 + +使用 ANN 引擎构建索引库(这里基于 [hnswlib](https://github.com/nmslib/hnswlib) 进行 ANN 索引) + +c. 获取Query的Embedding并查询相似结果 + +基于语义索引模型抽取出评估集 *Source Text* 的文本向量,在第 2 步中建立的索引库中进行 ANN 查询,召回 Top50 最相似的 *Target Text*, 产出评估集中 *Source Text* 的召回结果 `recall_result` 文件。 + +d. 评估 + +基于评估集 `dev.csv` 和召回结果 `recall_result` 计算评估指标 Recall@k,其中k取值1,5,10,20,50。 + +运行如下命令进行 ANN 建库、召回,产出召回结果数据 `recall_result` + +``` +root_dir="checkpoints/inbatch" +python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "${root_dir}/model_40/model_state.pdparams" \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 256\ + --max_seq_length 60 \ + --recall_num 50 \ + --similar_text_pair "recall/dev.csv" \ + --corpus_file "recall/corpus.csv" +``` +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `recall_result_dir`: 召回结果存储目录 +* `recall_result_file`: 召回结果的文件名 +* `params_path`: 待评估模型的参数文件名 +* `model_name_or_path`: 预训练模型,用于模型和`Tokenizer`的参数初始化 +* `hnsw_m`: hnsw 算法相关参数,保持默认即可 +* `hnsw_ef`: hnsw 算法相关参数,保持默认即可 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `recall_num`: 对 1 个文本召回的相似文本数量 +* `similar_text_pair`: 由相似文本对构成的评估集 +* `corpus_file`: 召回库数据 corpus_file + +也可以使用下面的bash脚本: + +``` +sh scripts/run_build_index.sh +``` + +run_build_index.sh还包含cpu和gpu运行的脚本,默认是gpu的脚本 + +成功运行结束后,会在 `./recall_result_dir/` 目录下产出 `recall_result.txt` 文件 + +``` +热处理对尼龙6 及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响 热处理对尼龙6及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响尼龙6,聚酰胺嵌段共聚物,芳香聚酰胺,热处理 0.9831992387771606 +热处理对尼龙6 及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响 热处理方法对高强高模聚乙烯醇纤维性能的影响聚乙烯醇纤维,热处理,性能,热拉伸,热定型 0.8438636660575867 +热处理对尼龙6 及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响 制备工艺对PVC/ABS合金力学性能和维卡软化温度的影响PVC,ABS,正交试验,力学性能,维卡软化温度 0.8130228519439697 +..... +``` + + +接下来,运行如下命令进行效果评估,产出Recall@1, Recall@5, Recall@10, Recall@20 和 Recall@50 指标: +``` +python -u evaluate.py \ + --similar_text_pair "recall/dev.csv" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 50 +``` +也可以使用下面的bash脚本: + +``` +sh scripts/evaluate.sh +``` + +参数含义说明 +* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv +* `recall_result_file`: 针对评估集中第一列文本 *Source Text* 的召回结果 +* `recall_num`: 对 1 个文本召回的相似文本数量 + +成功运行结束后,会输出如下评估指标: + +``` +recall@1=51.261 +recall@5=65.279 +recall@10=69.848 +recall@20=73.971 +recall@50=78.84 +``` + + + +## 7. 预测 + +我们可以基于语义索引模型预测文本的语义向量或者计算文本 Pair 的语义相似度。 + +### 7.1 功能一:抽取文本的语义向量 + +修改 inference.py 文件里面输入文本 id2corpus 和模型路径 params_path : + +``` +params_path='checkpoints/inbatch/model_40/model_state.pdparams' +id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} +``` +然后运行: +``` +python inference.py +``` +预测结果为256维的向量: + +``` +[1, 256] +[[ 0.07766181 -0.13780491 0.03388524 -0.14910668 -0.0334941 0.06780092 + 0.0104043 0.03168401 0.02605671 0.02088691 0.05520441 -0.0852212 + ..... +``` + +### 7.2 功能二:计算文本 Pair 的语义相似度 + + +### 准备预测数据 + +待预测数据为 tab 分隔的 csv 文件,每一行为 1 个文本 Pair,部分示例如下: +``` +试论我国海岸带经济开发的问题与前景 试论我国海岸带经济开发的问题与前景海岸带,经济开发,问题,前景 +外语阅读焦虑与英语成绩及性别的关系 外语阅读焦虑与英语成绩及性别的关系外语阅读焦虑,外语课堂焦虑,英语成绩,性别 +数字图书馆 智能化图书馆 +网络健康可信性研究 网络成瘾少年 +``` + +### 开始预测 + +以上述 demo 数据为例,运行如下命令基于我们开源的 [In-batch Negatives](https://arxiv.org/abs/2004.04906) 策略语义索引模型开始计算文本 Pair 的语义相似度: +``` +root_dir="checkpoints/inbatch" + +python -u -m paddle.distributed.launch --gpus "0" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_40/model_state.pdparams" \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --output_emb_size 256 \ + --batch_size 128 \ + --max_seq_length 64 \ + --text_pair_file "recall/test.csv" +``` + +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `params_path`: 预训练模型的参数文件名 +* `model_name_or_path`: 预训练模型,用于模型和`Tokenizer`的参数初始化 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `text_pair_file`: 由文本 Pair 构成的待预测数据集 + +也可以运行下面的bash脚本: + +``` +sh scripts/predict.sh +``` +predict.sh文件包含了cpu和gpu运行的脚本,默认是gpu运行的脚本 + +产出如下结果 +``` +0.9717282652854919 +0.9371012449264526 +0.7968897223472595 +0.30377304553985596 +``` + + + +## 8. 部署 + +### 动转静导出 + +首先把动态图模型转换为静态图: + +``` +python export_model.py --params_path checkpoints/inbatch/model_40/model_state.pdparams \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --output_path=./output +``` +也可以运行下面的bash脚本: + +``` +sh scripts/export_model.sh +``` + +### Paddle Inference预测 + +预测既可以抽取向量也可以计算两个文本的相似度。 + +修改id2corpus的样本: + +``` +# 抽取向量 +id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} +# 计算相似度 +corpus_list=[['中西方语言与文化的差异','中西方文化差异以及语言体现中西方文化,差异,语言体现'], + ['中西方语言与文化的差异','飞桨致力于让深度学习技术的创新与应用更简单']] + +``` + +然后使用PaddleInference + +``` +python deploy/python/predict.py \ + --model_dir=./output \ + --model_name_or_path rocketqa-zh-base-query-encoder +``` +也可以运行下面的bash脚本: + +``` +sh deploy.sh +``` +最终输出的是256维度的特征向量和句子对的预测概率: + +``` +(1, 256) +[[-0.0394925 -0.04474756 -0.065534 0.00939134 0.04359895 0.14659195 + -0.0091779 -0.07303623 0.09413272 -0.01255222 -0.08685658 0.02762237 + 0.10138468 0.00962821 0.10888419 0.04553023 0.05898942 0.00694253 + .... + +[0.959269642829895, 0.04725276678800583] +``` + +### Paddle Serving部署 + +Paddle Serving 的详细文档请参考 [Pipeline_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Python_Pipeline/Pipeline_Design_CN.md)和[Serving_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Serving_Design_CN.md),首先把静态图模型转换成Serving的格式: + +``` +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.get_pooled_embedding.pdmodel" \ + --params_filename "inference.get_pooled_embedding.pdiparams" \ + --server_path "./serving_server" \ + --client_path "./serving_client" \ + --fetch_alias_names "output_embedding" + +``` + +参数含义说明 +* `dirname`: 需要转换的模型文件存储路径,Program 结构文件和参数文件均保存在此目录。 +* `model_filename`: 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ,则使用 `__model__` 作为默认的文件名 +* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中,它才需要被指定。如果模型参数是存储在各自分离的文件中,设置它的值为 None +* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server +* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client +* `fetch_alias_names`: 模型输出的别名设置,比如输入的 input_ids 等,都可以指定成其他名字,默认不指定 +* `feed_alias_names`: 模型输入的别名设置,比如输出 pooled_out 等,都可以重新指定成其他模型,默认不指定 + +也可以运行下面的 bash 脚本: +``` +sh scripts/export_to_serving.sh +``` + +Paddle Serving的部署有两种方式,第一种方式是Pipeline的方式,第二种是C++的方式,下面分别介绍这两种方式的用法: + +#### Pipeline方式 + +修改模型需要用到的`Tokenizer` + +``` +self.tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-base-query-encoder") +``` + +然后启动 Pipeline Server: + +``` +cd deploy/python +python web_service.py +``` + +启动客户端调用 Server。 + +首先修改rpc_client.py中需要预测的样本: + +``` +list_data = [ + "国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据", + "试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比" +] +``` +然后运行: + +``` +python deploy/python/rpc_client.py +``` +模型的输出为: + +``` +{'0': '国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据', '1': '试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比'} +PipelineClient::predict pack_data time:1641450851.3752182 +PipelineClient::predict before time:1641450851.375738 +['output_embedding'] +(2, 256) +[[ 0.07830612 -0.14036864 0.03433796 -0.14967982 -0.03386067 0.06630666 + 0.01357943 0.03531194 0.02411093 0.02000859 0.05724002 -0.08119463 + ...... +``` + +可以看到客户端发送了2条文本,返回了2个 embedding 向量 + +#### C++的方式 + +启动C++的Serving: + +``` +python -m paddle_serving_server.serve --model serving_server --port 9393 --gpu_id 2 --thread 5 --ir_optim True --use_trt --precision FP16 +``` +也可以使用脚本: + +``` +sh deploy/cpp/start_server.sh +``` +Client 可以使用 http 或者 rpc 两种方式,rpc 的方式为: + +``` +python deploy/cpp/rpc_client.py +``` +运行的输出为: +``` +I0209 20:40:07.978225 20896 general_model.cpp:490] [client]logid=0,client_cost=395.695ms,server_cost=392.559ms. +time to cost :0.3960278034210205 seconds +{'output_embedding': array([[ 9.01343748e-02, -1.21870913e-01, 1.32834800e-02, + -1.57673359e-01, -2.60387752e-02, 6.98455423e-02, + 1.58108603e-02, 3.89952064e-02, 3.22783105e-02, + 3.49135026e-02, 7.66086206e-02, -9.12970975e-02, + 6.25643134e-02, 7.21886680e-02, 7.03565404e-02, + 5.44054210e-02, 3.25332815e-03, 5.01751155e-02, +...... +``` +可以看到服务端返回了向量 + +或者使用 http 的客户端访问模式: + +``` +python deploy/cpp/http_client.py +``` +运行的输出为: + +``` +(2, 64) +(2, 64) +outputs { + tensor { + float_data: 0.09013437479734421 + float_data: -0.12187091261148453 + float_data: 0.01328347995877266 + float_data: -0.15767335891723633 +...... +``` +可以看到服务端返回了向量 + +## FAQ + +#### 如何基于无监督SimCSE训练出的模型参数作为参数初始化继续做有监督 In-Batch Negative 训练? + ++ 使用 `--init_from_ckpt` 参数加载即可,下面是使用示例: + +``` +python -u -m paddle.distributed.launch --gpus "0,1,2,3" \ + train_batch_neg.py \ + --device gpu \ + --save_dir ./checkpoints/simcse_inbatch_negative \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --output_emb_size 256 \ + --save_steps 10 \ + --max_seq_length 64 \ + --margin 0.2 \ + --train_set_file recall/train.csv \ + --init_from_ckpt simcse/model_20000/model_state.pdparams +``` + + + +## Reference + +[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering, Preprint 2020. diff --git a/applications/neural_search/recall/in_batch_negative/ann_util.py b/applications/neural_search/recall/in_batch_negative/ann_util.py new file mode 100644 index 0000000000000000000000000000000000000000..a76b916a7e300355660aebb3580ae19ff442955a --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/ann_util.py @@ -0,0 +1,57 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# coding=UTF-8 + +import numpy as np +import hnswlib +from paddlenlp.utils.log import logger + + +def build_index(args, data_loader, model): + + index = hnswlib.Index(space="ip", dim=args.output_emb_size if args.output_emb_size > 0 else 768) + + # Initializing index + # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded + # during insertion of an element. + # The capacity can be increased by saving/loading the index, see below. + # + # ef_construction - controls index search speed/build speed tradeoff + # + # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M) + # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction + index.init_index(max_elements=args.hnsw_max_elements, ef_construction=args.hnsw_ef, M=args.hnsw_m) + + # Controlling the recall by setting ef: + # higher ef leads to better accuracy, but slower search + index.set_ef(args.hnsw_ef) + + # Set number of threads used during batch search/construction + # By default using all available cores + index.set_num_threads(16) + + logger.info("start build index..........") + + all_embeddings = [] + + for text_embeddings in model.get_semantic_embedding(data_loader): + all_embeddings.append(text_embeddings.numpy()) + + all_embeddings = np.concatenate(all_embeddings, axis=0) + index.add_items(all_embeddings) + + logger.info("Total index number:{}".format(index.get_current_count())) + + return index diff --git a/applications/neural_search/recall/in_batch_negative/base_model.py b/applications/neural_search/recall/in_batch_negative/base_model.py new file mode 100644 index 0000000000000000000000000000000000000000..99466292bccb7cbc99d10547cb5b06eb18782b35 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/base_model.py @@ -0,0 +1,161 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import abc + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class SemanticIndexBase(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear( + self.ptm.config.hidden_size, output_emb_size, weight_attr=weight_attr + ) + + def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + @abc.abstractmethod + def forward(self): + pass + + +class SemanticIndexBaseStatic(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear( + self.ptm.config.hidden_size, output_emb_size, weight_attr=weight_attr + ) + + @paddle.jit.to_static( + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ] + ) + def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding diff --git a/applications/neural_search/recall/in_batch_negative/batch_negative/model.py b/applications/neural_search/recall/in_batch_negative/batch_negative/model.py new file mode 100644 index 0000000000000000000000000000000000000000..bf9da27df57a79cdd98ca11fa483648aa31bf569 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/batch_negative/model.py @@ -0,0 +1,109 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import paddle +import paddle.nn.functional as F +from base_model import SemanticIndexBase + + +class SemanticIndexBatchNeg(SemanticIndexBase): + def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None): + super().__init__(pretrained_model, dropout, output_emb_size) + + self.margin = margin + # Used scaling cosine similarity to ease converge + self.sacle = scale + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) + + # Substract margin from all positive samples cosine_sim() + margin_diag = paddle.full( + shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype() + ) + + cosine_sim = cosine_sim - paddle.diag(margin_diag) + + # Scale cosine to ease training converge + cosine_sim *= self.sacle + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") + labels = paddle.reshape(labels, shape=[-1, 1]) + + loss = F.cross_entropy(input=cosine_sim, label=labels) + + return loss + + +class SemanticIndexCacheNeg(SemanticIndexBase): + def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None): + super().__init__(pretrained_model, dropout, output_emb_size) + self.margin = margin + # Used scaling cosine similarity to ease converge + self.sacle = scale + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) + + # Substract margin from all positive samples cosine_sim() + margin_diag = paddle.full(shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=cosine_sim.dtype) + + cosine_sim = cosine_sim - paddle.diag(margin_diag) + + # Scale cosine to ease training converge + cosine_sim *= self.sacle + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") + labels = paddle.reshape(labels, shape=[-1, 1]) + + return [cosine_sim, labels, query_cls_embedding, title_cls_embedding] diff --git a/applications/neural_search/recall/in_batch_negative/data.py b/applications/neural_search/recall/in_batch_negative/data.py new file mode 100644 index 0000000000000000000000000000000000000000..d35f0985be17ddc518e6473327438b69ff361c9f --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/data.py @@ -0,0 +1,154 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import paddle +from paddlenlp.utils.log import logger + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 2: + continue + yield {"text_a": data[0], "text_b": data[1]} + + +def read_text_triplet(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 3: + continue + yield {"text": data[0], "pos_sample": data[1], "neg_sample": data[2]} + + +# ANN - active learning ------------------------------------------------------ +def get_latest_checkpoint(args): + """ + Return: (latest_checkpoint_path, global_step) + """ + if not os.path.exists(args.save_dir): + return args.init_from_ckpt, 0 + + subdirectories = list(next(os.walk(args.save_dir))[1]) + + def valid_checkpoint(checkpoint): + chk_path = os.path.join(args.save_dir, checkpoint) + scheduler_path = os.path.join(chk_path, "model_state.pdparams") + succeed_flag_file = os.path.join(chk_path, "succeed_flag_file") + return os.path.exists(scheduler_path) and os.path.exists(succeed_flag_file) + + trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(trained_steps) > 0: + return os.path.join(args.save_dir, str(max(trained_steps)), "model_state.pdparams"), max(trained_steps) + + return args.init_from_ckpt, 0 + + +# ANN - active learning ------------------------------------------------------ +def get_latest_ann_data(ann_data_dir): + if not os.path.exists(ann_data_dir): + return None, -1 + + subdirectories = list(next(os.walk(ann_data_dir))[1]) + + def valid_checkpoint(step): + ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data") + # succeed_flag_file is an empty file that indicates ann data has been generated + succeed_flag_file = os.path.join(ann_data_dir, step, "succeed_flag_file") + return os.path.exists(succeed_flag_file) and os.path.exists(ann_data_file) + + ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(ann_data_steps) > 0: + latest_ann_data_file = os.path.join(ann_data_dir, str(max(ann_data_steps)), "new_ann_data") + logger.info("Using latest ann_data_file:{}".format(latest_ann_data_file)) + return latest_ann_data_file, max(ann_data_steps) + + logger.info("no new ann_data, return (None, -1)") + return None, -1 + + +def gen_id2corpus(corpus_file): + id2corpus = {} + with open(corpus_file, "r", encoding="utf-8") as f: + for idx, line in enumerate(f): + id2corpus[idx] = line.rstrip() + return id2corpus + + +def gen_text_file(similar_text_pair_file): + text2similar_text = {} + texts = [] + with open(similar_text_pair_file, "r", encoding="utf-8") as f: + for line in f: + splited_line = line.rstrip().split("\t") + if len(splited_line) != 2: + continue + + text, similar_text = line.rstrip().split("\t") + + if not text or not similar_text: + continue + + text2similar_text[text] = similar_text + texts.append({"text": text}) + return texts, text2similar_text diff --git a/applications/neural_search/recall/in_batch_negative/deploy/cpp/http_client.py b/applications/neural_search/recall/in_batch_negative/deploy/cpp/http_client.py new file mode 100644 index 0000000000000000000000000000000000000000..4115859f993814836d0fc7850e1936e4ba185f05 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/deploy/cpp/http_client.py @@ -0,0 +1,73 @@ +# coding:utf-8 +# pylint: disable=doc-string-missing +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import time + +import numpy as np +from paddle_serving_client import HttpClient + +from paddlenlp.transformers import AutoTokenizer + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=True): + list_input_ids = [] + list_token_type_ids = [] + for text in example: + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + list_input_ids.append(input_ids) + list_token_type_ids.append(token_type_ids) + return list_input_ids, list_token_type_ids + + +# 启动python客户端 +endpoint_list = ["127.0.0.1:9393"] +client = HttpClient() +client.load_client_config("serving_client") +client.connect(endpoint_list) +feed_names = client.feed_names_ +fetch_names = client.fetch_names_ +print(feed_names) +print(fetch_names) + +# 创建tokenizer +tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-base-query-encoder") +max_seq_len = 64 + +# 数据预处理 + +list_data = ["国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据.", "面向生态系统服务的生态系统分类方案研发与应用"] +# for i in range(5): +# list_data.extend(list_data) +# print(len(list_data)) +examples = convert_example(list_data, tokenizer, max_seq_length=max_seq_len) +print(examples) + +feed_dict = {} +feed_dict["input_ids"] = np.array(examples[0]) +feed_dict["token_type_ids"] = np.array(examples[1]) + +print(feed_dict["input_ids"].shape) +print(feed_dict["token_type_ids"].shape) + +# batch设置为True表示的是批量预测 +b_start = time.time() +result = client.predict(feed=feed_dict, fetch=fetch_names, batch=True) +b_end = time.time() +print(result) +print("time to cost :{} seconds".format(b_end - b_start)) diff --git a/applications/neural_search/recall/in_batch_negative/deploy/cpp/rpc_client.py b/applications/neural_search/recall/in_batch_negative/deploy/cpp/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..8938c8ce32c735e13f8790d48a81f21413a55ea1 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/deploy/cpp/rpc_client.py @@ -0,0 +1,71 @@ +# coding:utf-8 +# pylint: disable=doc-string-missing +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import time + +import numpy as np +from paddle_serving_client import Client + +from paddlenlp.transformers import AutoTokenizer + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=True): + list_input_ids = [] + list_token_type_ids = [] + for text in example: + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + list_input_ids.append(input_ids) + list_token_type_ids.append(token_type_ids) + return list_input_ids, list_token_type_ids + + +# 启动python客户端 +endpoint_list = ["127.0.0.1:9393"] +client = Client() +client.load_client_config("serving_client") +client.connect(endpoint_list) +feed_names = client.feed_names_ +fetch_names = client.fetch_names_ +print(feed_names) +print(fetch_names) + +# 创建tokenizer +tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-base-query-encoder") +max_seq_len = 64 + +# 数据预处理 + +list_data = ["国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据.", "面向生态系统服务的生态系统分类方案研发与应用"] +# for i in range(5): +# list_data.extend(list_data) +# print(len(list_data)) +examples = convert_example(list_data, tokenizer, max_seq_length=max_seq_len) +print(examples) + +feed_dict = {} +feed_dict["input_ids"] = np.array(examples[0]) +feed_dict["token_type_ids"] = np.array(examples[1]) + +print(feed_dict["input_ids"].shape) +print(feed_dict["token_type_ids"].shape) +# batch设置为True表示的是批量预测 +b_start = time.time() +result = client.predict(feed=feed_dict, fetch=fetch_names, batch=True) +b_end = time.time() +print("time to cost :{} seconds".format(b_end - b_start)) +print(result) diff --git a/applications/neural_search/recall/in_batch_negative/deploy/cpp/start_server.sh b/applications/neural_search/recall/in_batch_negative/deploy/cpp/start_server.sh new file mode 100644 index 0000000000000000000000000000000000000000..55d380d6f87396887675a008c54bb8544ce2a793 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/deploy/cpp/start_server.sh @@ -0,0 +1 @@ +python -m paddle_serving_server.serve --model serving_server --port 9393 --gpu_id 2 --thread 5 --ir_optim True --use_trt --precision FP16 \ No newline at end of file diff --git a/applications/neural_search/recall/in_batch_negative/deploy/python/config_nlp.yml b/applications/neural_search/recall/in_batch_negative/deploy/python/config_nlp.yml new file mode 100644 index 0000000000000000000000000000000000000000..1af6298427f4c90c02bfb9c3dc0142002fd58800 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/deploy/python/config_nlp.yml @@ -0,0 +1,34 @@ +# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +# 当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 20 +# build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + # op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: False + # 使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + tracer: + interval_s: 10 +# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 18082 +# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 8080 +op: + ernie: + # 并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 1 + # 当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + # client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + #ir_optim + ir_optim: True + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 1 + # 计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: '2' + # Fetch结果列表,以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回 + fetch_list: ['output_embedding'] + # 模型路径 + model_config: ../../serving_server/ diff --git a/applications/neural_search/recall/in_batch_negative/deploy/python/deploy.sh b/applications/neural_search/recall/in_batch_negative/deploy/python/deploy.sh new file mode 100644 index 0000000000000000000000000000000000000000..fe8f071e0a47a47f5dc24d84ea4eaaf8e7503c06 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/deploy/python/deploy.sh @@ -0,0 +1 @@ +python predict.py --model_dir=../../output \ No newline at end of file diff --git a/applications/neural_search/recall/in_batch_negative/deploy/python/predict.py b/applications/neural_search/recall/in_batch_negative/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..7d493b91a8c18ab64d9cd7c2b29edad688266866 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/deploy/python/predict.py @@ -0,0 +1,265 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +import sys + +import paddle +from paddle import inference +from scipy import spatial + +from paddlenlp.data import Pad, Tuple +from paddlenlp.transformers import AutoTokenizer +from paddlenlp.utils.log import logger + +sys.path.append(".") + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=15, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="model name.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') +parser.add_argument("--benchmark", type=eval, default=False, help="To log some information about environment and running.") +parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") +args = parser.parse_args() +# yapf: enable + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.pdmodel" + params_file = model_dir + "/inference.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + if args.benchmark: + import auto_log + + pid = os.getpid() + self.autolog = auto_log.AutoLogger( + model_name=args.model_name_or_path, + model_precision=precision, + batch_size=self.batch_size, + data_shape="dynamic", + save_path=args.save_log_path, + inference_config=config, + pids=pid, + process_name=None, + gpu_ids=0, + time_keys=["preprocess_time", "inference_time", "postprocess_time"], + warmup=0, + logger=logger, + ) + + def extract_embedding(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the feature vectors. + """ + if args.benchmark: + self.autolog.times.start() + + examples = [] + for text in data: + input_ids, segment_ids = convert_example(text, tokenizer) + examples.append((input_ids, segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment + ): fn(samples) + + if args.benchmark: + self.autolog.times.stamp() + + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + if args.benchmark: + self.autolog.times.stamp() + + if args.benchmark: + self.autolog.times.end(stamp=True) + + return logits + + def predict(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the predictions probs. + """ + if args.benchmark: + self.autolog.times.start() + + examples = [] + for idx, text in enumerate(data): + input_ids, segment_ids = convert_example({idx: text[0]}, tokenizer) + title_ids, title_segment_ids = convert_example({idx: text[1]}, tokenizer) + examples.append((input_ids, segment_ids, title_ids, title_segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment + ): fn(samples) + + if args.benchmark: + self.autolog.times.stamp() + + query_ids, query_segment_ids, title_ids, title_segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(query_ids) + self.input_handles[1].copy_from_cpu(query_segment_ids) + self.predictor.run() + query_logits = self.output_handle.copy_to_cpu() + + self.input_handles[0].copy_from_cpu(title_ids) + self.input_handles[1].copy_from_cpu(title_segment_ids) + self.predictor.run() + title_logits = self.output_handle.copy_to_cpu() + + if args.benchmark: + self.autolog.times.stamp() + + if args.benchmark: + self.autolog.times.end(stamp=True) + result = [float(1 - spatial.distance.cosine(arr1, arr2)) for arr1, arr2 in zip(query_logits, title_logits)] + return result + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + + # ErnieTinyTokenizer is special for ernie-tiny pretained model. + output_emb_size = 256 + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + id2corpus = {0: "国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据"} + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + res = predictor.extract_embedding(corpus_list, tokenizer) + print(res.shape) + print(res) + corpus_list = [["中西方语言与文化的差异", "中西方文化差异以及语言体现中西方文化,差异,语言体现"], ["中西方语言与文化的差异", "飞桨致力于让深度学习技术的创新与应用更简单"]] + res = predictor.predict(corpus_list, tokenizer) + print(res) diff --git a/applications/neural_search/recall/in_batch_negative/deploy/python/rpc_client.py b/applications/neural_search/recall/in_batch_negative/deploy/python/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..d46979c75e151522917d66b843f02a0f60a39c75 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/deploy/python/rpc_client.py @@ -0,0 +1,36 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import time +import numpy as np + +from paddle_serving_server.pipeline import PipelineClient + +client = PipelineClient() +client.connect(["127.0.0.1:8080"]) + +list_data = ["国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据", "试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比"] +feed = {} +for i, item in enumerate(list_data): + feed[str(i)] = item + +print(feed) +start_time = time.time() +ret = client.predict(feed_dict=feed) +end_time = time.time() +print("time to cost :{} seconds".format(end_time - start_time)) + +result = np.array(eval(ret.value[0])) +print(ret.key) +print(result.shape) +print(result) diff --git a/applications/neural_search/recall/in_batch_negative/deploy/python/web_service.py b/applications/neural_search/recall/in_batch_negative/deploy/python/web_service.py new file mode 100644 index 0000000000000000000000000000000000000000..a4730d721110cfc47531769fabc4d5733e83a131 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/deploy/python/web_service.py @@ -0,0 +1,72 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging + +from paddle_serving_server.web_service import Op, WebService + +_LOGGER = logging.getLogger() + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + result = [] + for text in example: + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class ErnieOp(Op): + def init_op(self): + from paddlenlp.transformers import AutoTokenizer + + self.tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-base-query-encoder") + + def preprocess(self, input_dicts, data_id, log_id): + from paddlenlp.data import Pad, Tuple + + ((_, input_dict),) = input_dicts.items() + print("input dict", input_dict) + batch_size = len(input_dict.keys()) + examples = [] + for i in range(batch_size): + input_ids, segment_ids = convert_example([input_dict[str(i)]], self.tokenizer) + examples.append((input_ids, segment_ids)) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + input_ids, segment_ids = batchify_fn(examples) + feed_dict = {} + feed_dict["input_ids"] = input_ids + feed_dict["token_type_ids"] = segment_ids + return feed_dict, False, None, "" + + def postprocess(self, input_dicts, fetch_dict, data_id, log_id): + new_dict = {} + new_dict["output_embedding"] = str(fetch_dict["output_embedding"].tolist()) + return new_dict, None, "" + + +class ErnieService(WebService): + def get_pipeline_response(self, read_op): + ernie_op = ErnieOp(name="ernie", input_ops=[read_op]) + return ernie_op + + +ernie_service = ErnieService(name="ernie") +ernie_service.prepare_pipeline_config("config_nlp.yml") +ernie_service.run_service() diff --git a/applications/neural_search/recall/in_batch_negative/evaluate.py b/applications/neural_search/recall/in_batch_negative/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..4a4236220d3bc5cfff6f6dd478a5121d5788f980 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/evaluate.py @@ -0,0 +1,88 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import numpy as np + +import time + +parser = argparse.ArgumentParser() +parser.add_argument("--similar_text_pair", type=str, + default='', help="The full path of similar pair file") +parser.add_argument("--recall_result_file", type=str, + default='', help="The full path of recall result file") +parser.add_argument("--recall_num", type=int, default=10, + help="Most similar number of doc recalled from corpus per query") + + +args = parser.parse_args() + + +def recall(rs, N=10): + """ + Ratio of recalled Ground Truth at topN Recalled Docs + >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]] + >>> recall(rs, N=1) + 0.333333 + >>> recall(rs, N=2) + >>> 0.6666667 + >>> recall(rs, N=3) + >>> 1.0 + Args: + rs: Iterator of recalled flag() + Returns: + Recall@N + """ + + recall_flags = [np.sum(r[0:N]) for r in rs] + return np.mean(recall_flags) + + +if __name__ == "__main__": + text2similar = {} + with open(args.similar_text_pair, "r", encoding="utf-8") as f: + for line in f: + text, similar_text = line.rstrip().split("\t") + text2similar[text] = similar_text + + rs = [] + + with open(args.recall_result_file, "r", encoding="utf-8") as f: + relevance_labels = [] + for index, line in enumerate(f): + + if index % args.recall_num == 0 and index != 0: + rs.append(relevance_labels) + relevance_labels = [] + + text, recalled_text, cosine_sim = line.rstrip().split("\t") + if text2similar[text] == recalled_text: + relevance_labels.append(1) + else: + relevance_labels.append(0) + + recall_N = [] + recall_num = [1, 5, 10, 20, 50] + for topN in recall_num: + R = round(100 * recall(rs, N=topN), 3) + recall_N.append(str(R)) + result = open("result.tsv", "a") + res = [] + timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime()) + res.append(timestamp) + for key, val in zip(recall_num, recall_N): + print("recall@{}={}".format(key, val)) + res.append(str(val)) + result.write("\t".join(res) + "\n") diff --git a/applications/neural_search/recall/in_batch_negative/export_model.py b/applications/neural_search/recall/in_batch_negative/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..648ccbc9672b5a02d93e9667861645c0c43e34f1 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/export_model.py @@ -0,0 +1,56 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from base_model import SemanticIndexBaseStatic + +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, + default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="Select model to train, defaults to rocketqa-zh-base-query-encoder.") +parser.add_argument("--output_path", type=str, default='./output', + help="The path of model parameter in static graph to be saved.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + output_emb_size = 256 + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + model = SemanticIndexBaseStatic(pretrained_model, output_emb_size=output_emb_size) + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + model.eval() + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/applications/neural_search/recall/in_batch_negative/export_to_serving.py b/applications/neural_search/recall/in_batch_negative/export_to_serving.py new file mode 100644 index 0000000000000000000000000000000000000000..1ba681a4dfb14a43a5f91fa9c4cf632b4e6e827e --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/export_to_serving.py @@ -0,0 +1,49 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import paddle_serving_client.io as serving_io + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--dirname", type=str, required=True, + default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.") +parser.add_argument("--model_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.") +parser.add_argument("--params_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.") +parser.add_argument("--server_path", type=str, default='./serving_server', + help="The path of server parameter in static graph to be saved.") +parser.add_argument("--client_path", type=str, default='./serving_client', + help="The path of client parameter in static graph to be saved.") +parser.add_argument("--feed_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars') +parser.add_argument("--fetch_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars') +parser.add_argument("--show_proto", type=bool, default=False, + help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.') +# yapf: enable + +if __name__ == "__main__": + args = parser.parse_args() + serving_io.inference_model_to_serving( + dirname=args.dirname, + serving_server=args.server_path, + serving_client=args.client_path, + model_filename=args.model_filename, + params_filename=args.params_filename, + show_proto=args.show_proto, + feed_alias_names=args.feed_alias_names, + fetch_alias_names=args.fetch_alias_names, + ) diff --git a/applications/neural_search/recall/in_batch_negative/inference.py b/applications/neural_search/recall/in_batch_negative/inference.py new file mode 100644 index 0000000000000000000000000000000000000000..a49d261a3be97650ecb9d34f949b3e7bb8ed4426 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/inference.py @@ -0,0 +1,75 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial +import os + +import paddle +from paddlenlp.data import Tuple, Pad +from paddlenlp.datasets import MapDataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +from base_model import SemanticIndexBaseStatic +from data import convert_example, create_dataloader + +if __name__ == "__main__": + device = "gpu" + max_seq_length = 64 + output_emb_size = 256 + batch_size = 1 + params_path = "checkpoints/inbatch/model_40/model_state.pdparams" + id2corpus = {0: "国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据"} + model_name_or_path = "rocketqa-zh-base-query-encoder" + paddle.set_device(device) + + tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( # noqa: E731 + Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_segment + ): [data for data in fn(samples)] + + pretrained_model = AutoModel.from_pretrained(model_name_or_path) + + model = SemanticIndexBaseStatic(pretrained_model, output_emb_size=output_emb_size) + + # Load pretrained semantic model + if params_path and os.path.isfile(params_path): + state_dict = paddle.load(params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + # convert_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + all_embeddings = [] + model.eval() + with paddle.no_grad(): + for batch_data in corpus_data_loader: + input_ids, token_type_ids = batch_data + + text_embeddings = model.get_pooled_embedding(input_ids, token_type_ids) + all_embeddings.append(text_embeddings) + + text_embedding = all_embeddings[0] + print(text_embedding.shape) + print(text_embedding.numpy()) diff --git a/applications/neural_search/recall/in_batch_negative/predict.py b/applications/neural_search/recall/in_batch_negative/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..7390c13851b17c5c47d1bcb5cad40d5258de9fd8 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/predict.py @@ -0,0 +1,120 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +from base_model import SemanticIndexBase +from data import convert_example, create_dataloader, read_text_pair + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--text_pair_file", type=str, + required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, + help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training") +parser.add_argument("--batch_size", default=32, type=int, + help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, + type=int, help="output_embedding_size") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", + help="Select which device to train model, defaults to gpu.") +parser.add_argument("--pad_to_max_seq_len", action="store_true", + help="Whether to pad to max seq length.") +args = parser.parse_args() +# yapf: enable + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair. + data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + cosine_sims = [] + + model.eval() + + with paddle.no_grad(): + for batch_data in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data + + batch_cosine_sim = model.cosine_sim( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ).numpy() + + cosine_sims.append(batch_cosine_sim) + + cosine_sims = np.concatenate(cosine_sims, axis=0) + + return cosine_sims + + +if __name__ == "__main__": + paddle.set_device(args.device) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + pad_to_max_seq_len=args.pad_to_max_seq_len, + ) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # title_segment + ): [data for data in fn(samples)] + + valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False) + + valid_data_loader = create_dataloader( + valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + + model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + cosin_sim = predict(model, valid_data_loader) + + for idx, cosine in enumerate(cosin_sim): + print("{}".format(cosine)) diff --git a/applications/neural_search/recall/in_batch_negative/recall.py b/applications/neural_search/recall/in_batch_negative/recall.py new file mode 100644 index 0000000000000000000000000000000000000000..d81c2a73540d78b5d97a20f6f60766e76202dc7e --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/recall.py @@ -0,0 +1,134 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# coding=UTF-8 + +import argparse +import os +from functools import partial + +import paddle +from ann_util import build_index +from base_model import SemanticIndexBase +from data import convert_example, create_dataloader, gen_id2corpus, gen_text_file + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset +from paddlenlp.transformers import AutoModel, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--corpus_file", type=str, required=True, + help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, + required=True, help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='recall_result', + help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, + default='recall_result_file', help="The file name of recall result") +parser.add_argument("--params_path", type=str, required=True, + help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, + help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, + type=int, help="output_embedding_size") +parser.add_argument("--recall_num", default=10, type=int, + help="Recall number for each query from Ann index.") +parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training") +parser.add_argument("--hnsw_m", default=100, type=int, + help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, + help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, + type=int, help="Recall number for each query from Ann index.") + +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", + help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # text_segment + ): [data for data in fn(samples)] + + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + + model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size) + model = paddle.DataParallel(model) + + # Load pretrained semantic model + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + logger.info("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + id2corpus = gen_id2corpus(args.corpus_file) + + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + # Need better way to get inner model of DataParallel + inner_model = model._layers + + final_index = build_index(args, corpus_data_loader, inner_model) + + text_list, text2similar_text = gen_text_file(args.similar_text_pair_file) + + query_ds = MapDataset(text_list) + + query_data_loader = create_dataloader( + query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + + recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file) + with open(recall_result_file, "w", encoding="utf-8") as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num) + + batch_size = len(cosine_sims) + + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write( + "{}\t{}\t{}\n".format( + text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx] + ) + ) diff --git a/applications/neural_search/recall/in_batch_negative/scripts/evaluate.sh b/applications/neural_search/recall/in_batch_negative/scripts/evaluate.sh new file mode 100644 index 0000000000000000000000000000000000000000..84d6f162b80ea2bf2d41b1947c61a77503c00264 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/scripts/evaluate.sh @@ -0,0 +1,4 @@ +python -u evaluate.py \ + --similar_text_pair "recall/dev.csv" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 50 \ No newline at end of file diff --git a/applications/neural_search/recall/in_batch_negative/scripts/export_model.sh b/applications/neural_search/recall/in_batch_negative/scripts/export_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..99d01c7b5aae4173fd1508f2d74e6f5f7696a7fa --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/scripts/export_model.sh @@ -0,0 +1,3 @@ +python export_model.py --params_path checkpoints/inbatch/model_40/model_state.pdparams \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --output_path=./output \ No newline at end of file diff --git a/applications/neural_search/recall/in_batch_negative/scripts/export_to_serving.sh b/applications/neural_search/recall/in_batch_negative/scripts/export_to_serving.sh new file mode 100644 index 0000000000000000000000000000000000000000..b0d7a422551fd09eb1a28cfacdf47237a8efc795 --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/scripts/export_to_serving.sh @@ -0,0 +1,7 @@ +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.get_pooled_embedding.pdmodel" \ + --params_filename "inference.get_pooled_embedding.pdiparams" \ + --server_path "serving_server" \ + --client_path "serving_client" \ + --fetch_alias_names "output_embedding" diff --git a/applications/neural_search/recall/in_batch_negative/scripts/predict.sh b/applications/neural_search/recall/in_batch_negative/scripts/predict.sh new file mode 100644 index 0000000000000000000000000000000000000000..3967bb2c9b5dce94568d1dca6436e87005ab9aef --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/scripts/predict.sh @@ -0,0 +1,22 @@ +# gpu version +root_dir="checkpoints/inbatch" +python -u -m paddle.distributed.launch --gpus "0" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_40/model_state.pdparams" \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --output_emb_size 256 \ + --batch_size 128 \ + --max_seq_length 64 \ + --text_pair_file "recall/test.csv" + + +# cpu +# root_dir="checkpoints/inbatch" +# python predict.py \ +# --device cpu \ +# --params_path "${root_dir}/model_40/model_state.pdparams" \ +# --output_emb_size 256 \ +# --batch_size 128 \ +# --max_seq_length 64 \ +# --text_pair_file "recall/test.csv" diff --git a/applications/neural_search/recall/in_batch_negative/scripts/run_build_index.sh b/applications/neural_search/recall/in_batch_negative/scripts/run_build_index.sh new file mode 100644 index 0000000000000000000000000000000000000000..9920a045b9dcaeaeeb7a4d2d0155bb3cd607bb1f --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/scripts/run_build_index.sh @@ -0,0 +1,46 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# GPU version +root_dir="checkpoints/inbatch" +python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "${root_dir}/model_40/model_state.pdparams" \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 256\ + --max_seq_length 64 \ + --recall_num 50 \ + --similar_text_pair "recall/dev.csv" \ + --corpus_file "recall/corpus.csv" + +# CPU version +# python recall.py \ +# --device cpu \ +# --recall_result_dir "recall_result_dir" \ +# --recall_result_file "recall_result.txt" \ +# --params_path "${root_dir}/model_40/model_state.pdparams" \ +# --hnsw_m 100 \ +# --hnsw_ef 100 \ +# --batch_size 64 \ +# --output_emb_size 256\ +# --max_seq_length 60 \ +# --recall_num 50 \ +# --similar_text_pair "recall/dev.csv" \ +# --corpus_file "recall/corpus.csv" \ No newline at end of file diff --git a/applications/neural_search/recall/in_batch_negative/train_batch_neg.py b/applications/neural_search/recall/in_batch_negative/train_batch_neg.py new file mode 100644 index 0000000000000000000000000000000000000000..2f156f5a39d3efd6abc134fb24c35fff572a0ffd --- /dev/null +++ b/applications/neural_search/recall/in_batch_negative/train_batch_neg.py @@ -0,0 +1,434 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from ann_util import build_index +from batch_negative.model import SemanticIndexBatchNeg, SemanticIndexCacheNeg +from data import ( + convert_example, + create_dataloader, + gen_id2corpus, + gen_text_file, + read_text_pair, +) + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset, load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=512, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training") +parser.add_argument("--output_emb_size", default=256, type=int, help="output_embedding_size") +parser.add_argument("--learning_rate", default=5E-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="cpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, help="Interval steps to save checkpoint") +parser.add_argument('--log_steps', type=int, default=10, help="Interval steps to print log") +parser.add_argument("--train_set_file", type=str, default='./recall/train.csv', help="The full path of train_set_file.") +parser.add_argument("--dev_set_file", type=str, default='./recall/dev.csv', help="The full path of dev_set_file.") +parser.add_argument("--margin", default=0.2, type=float, help="Margin between pos_sample and neg_samples") +parser.add_argument("--scale", default=30, type=int, help="Scale for pair-wise margin_rank_loss") +parser.add_argument("--corpus_file", type=str, default='./recall/corpus.csv', help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, default='./recall/dev.csv', help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='./recall_result_dir', help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, default='recall_result_init.txt', help="The file name of recall result") +parser.add_argument("--recall_num", default=50, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--evaluate_result", type=str, default='evaluate_result.txt', help="evaluate_result") +parser.add_argument('--evaluate', action='store_true', help='whether evaluate while training') +parser.add_argument("--max_grad_norm", type=float, default=5.0, help="max grad norm for global norm clip") +parser.add_argument("--use_amp", action="store_true", help="Whether to use AMP.") +parser.add_argument("--amp_loss_scale", default=32768, type=float, help="The value of scale_loss for fp16. This is only used for AMP training.") +parser.add_argument("--use_recompute", action='store_true', help="Using the recompute to scale up the batch size and save the memory.") +parser.add_argument("--use_gradient_cache", action='store_true', help="Using the gradient cache to scale up the batch size and save the memory.") +parser.add_argument("--chunk_numbers", type=int, default=50, help="The number of the chunks for model") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def recall(rs, N=10): + recall_flags = [np.sum(r[0:N]) for r in rs] + return np.mean(recall_flags) + + +@paddle.no_grad() +def evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus): + # Load pretrained semantic model + inner_model = model._layers + final_index = build_index(args, corpus_data_loader, inner_model) + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + with open(recall_result_file, "w", encoding="utf-8") as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num) + batch_size = len(cosine_sims) + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write( + "{}\t{}\t{}\n".format( + text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx] + ) + ) + text2similar = {} + with open(args.similar_text_pair_file, "r", encoding="utf-8") as f: + for line in f: + text, similar_text = line.rstrip().split("\t") + text2similar[text] = similar_text + rs = [] + with open(recall_result_file, "r", encoding="utf-8") as f: + relevance_labels = [] + for index, line in enumerate(f): + if index % args.recall_num == 0 and index != 0: + rs.append(relevance_labels) + relevance_labels = [] + text, recalled_text, cosine_sim = line.rstrip().split("\t") + if text == recalled_text: + continue + if text2similar[text] == recalled_text: + relevance_labels.append(1) + else: + relevance_labels.append(0) + + recall_N = [] + recall_num = [1, 5, 10, 20, 50] + for topN in recall_num: + R = round(100 * recall(rs, N=topN), 3) + recall_N.append(str(R)) + evaluate_result_file = os.path.join(args.recall_result_dir, args.evaluate_result) + result = open(evaluate_result_file, "a") + res = [] + timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime()) + res.append(timestamp) + for key, val in zip(recall_num, recall_N): + print("recall@{}={}".format(key, val)) + res.append(str(val)) + result.write("\t".join(res) + "\n") + return float(recall_N[1]) + + +def train( + train_data_loader, + model, + optimizer, + lr_scheduler, + rank, + corpus_data_loader, + query_data_loader, + recall_result_file, + text_list, + id2corpus, + tokenizer, +): + global_step = 0 + best_recall = 0.0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch + + loss = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ) + + global_step += 1 + if global_step % args.log_steps == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, args.log_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if not args.evaluate: + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + if args.evaluate and rank == 0: + print("evaluating") + recall_5 = evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus) + if recall_5 > best_recall: + best_recall = recall_5 + + save_dir = os.path.join(args.save_dir, "model_best") + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + with open(os.path.join(save_dir, "train_result.txt"), "a", encoding="utf-8") as fp: + fp.write("epoch=%d, global_step: %d, recall: %s\n" % (epoch, global_step, recall_5)) + + +def gradient_cache_train(train_data_loader, model, optimizer, lr_scheduler, rank, tokenizer): + + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.amp_loss_scale) + + if args.batch_size % args.chunk_numbers == 0: + chunk_numbers = args.chunk_numbers + else: + raise Exception( + f" Batch_size {args.batch_size} must divides chunk_numbers {args.chunk_numbers} without producing a remainder " + ) + + def split(inputs, chunk_numbers, axis=0): + if inputs.shape[0] % chunk_numbers == 0: + return paddle.split(inputs, chunk_numbers, axis=0) + else: + return paddle.split(inputs, inputs.shape[0], axis=0) + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + # Separate large batches into several sub batches + chunked_x = [split(t, chunk_numbers, axis=0) for t in batch] + sub_batchs = [list(s) for s in zip(*chunked_x)] + + all_grads = [] + all_CUDA_rnd_state = [] + all_query = [] + all_title = [] + + for sub_batch in sub_batchs: + all_reps = [] + all_labels = [] + ( + sub_query_input_ids, + sub_query_token_type_ids, + sub_title_input_ids, + sub_title_token_type_ids, + ) = sub_batch + with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]): + + with paddle.no_grad(): + sub_CUDA_rnd_state = paddle.framework.random.get_cuda_rng_state() + all_CUDA_rnd_state.append(sub_CUDA_rnd_state) + sub_cosine_sim, sub_label, query_embedding, title_embedding = model( + query_input_ids=sub_query_input_ids, + title_input_ids=sub_title_input_ids, + query_token_type_ids=sub_query_token_type_ids, + title_token_type_ids=sub_title_token_type_ids, + ) + all_reps.append(sub_cosine_sim) + all_labels.append(sub_label) + all_title.append(title_embedding) + all_query.append(query_embedding) + + model_reps = paddle.concat(all_reps, axis=0) + model_title = paddle.concat(all_title) + model_query = paddle.concat(all_query) + + model_title = model_title.detach() + model_query = model_query.detach() + + model_query.stop_gradient = False + model_title.stop_gradient = False + model_reps.stop_gradient = False + + model_label = paddle.concat(all_labels, axis=0) + loss = F.cross_entropy(input=model_reps, label=model_label) + loss.backward() + # Store gradients + all_grads.append(model_reps.grad) + + for sub_batch, CUDA_state, grad in zip(sub_batchs, all_CUDA_rnd_state, all_grads): + + ( + sub_query_input_ids, + sub_query_token_type_ids, + sub_title_input_ids, + sub_title_token_type_ids, + ) = sub_batch + paddle.framework.random.set_cuda_rng_state(CUDA_state) + # Recompute the forward propagation + sub_cosine_sim, sub_label, query_embedding, title_embedding = model( + query_input_ids=sub_query_input_ids, + title_input_ids=sub_title_input_ids, + query_token_type_ids=sub_query_token_type_ids, + title_token_type_ids=sub_title_token_type_ids, + ) + # Chain rule + surrogate = paddle.dot(sub_cosine_sim, grad) + # Backward propagation + if args.use_amp: + scaled = scaler.scale(surrogate) + scaled.backward() + else: + surrogate.backward() + # Update model parameters + if args.use_amp: + scaler.minimize(optimizer, scaled) + else: + optimizer.step() + + global_step += 1 + if global_step % args.log_steps == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, args.log_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False) + + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path, enable_recompute=args.use_recompute) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( # noqa: E731 + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # title_segment + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + if args.use_gradient_cache: + model = SemanticIndexCacheNeg( + pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size + ) + else: + model = SemanticIndexBatchNeg( + pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + print("warmup from:{}".format(args.init_from_ckpt)) + + model = paddle.DataParallel(model) + + batchify_fn_dev = lambda samples, fn=Tuple( # noqa: E731 + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # text_segment + ): [data for data in fn(samples)] + + id2corpus = gen_id2corpus(args.corpus_file) + + # convert_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=trans_func + ) + + text_list, text2similar_text = gen_text_file(args.similar_text_pair_file) + + query_ds = MapDataset(text_list) + + query_data_loader = create_dataloader( + query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=trans_func + ) + + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + + recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm), + ) + + if args.use_gradient_cache: + gradient_cache_train(train_data_loader, model, optimizer, lr_scheduler, rank, tokenizer) + else: + train( + train_data_loader, + model, + optimizer, + lr_scheduler, + rank, + corpus_data_loader, + query_data_loader, + recall_result_file, + text_list, + id2corpus, + tokenizer, + ) + + +if __name__ == "__main__": + do_train() diff --git a/applications/neural_search/recall/milvus/README.md b/applications/neural_search/recall/milvus/README.md new file mode 100644 index 0000000000000000000000000000000000000000..de3f1666b960c547c5a03697cb9f28cd52f9f87c --- /dev/null +++ b/applications/neural_search/recall/milvus/README.md @@ -0,0 +1,220 @@ + **目录** + +* [背景介绍](#背景介绍) +* [Milvus召回](#Milvus召回) + * [1. 技术方案和评估指标](#技术方案) + * [2. 环境依赖](#环境依赖) + * [3. 代码结构](#代码结构) + * [4. 数据准备](#数据准备) + * [5. 向量检索](#向量检索) + + + + +# 背景介绍 + +基于某检索平台开源的数据集构造生成了面向语义索引的召回库。 + + + +# Milvus召回 + + + +## 1. 技术方案和评估指标 + +### 技术方案 + +使用 Milvus 搭建召回系统,然后使用训练好的语义索引模型,抽取向量,插入到 Milvus 中,然后进行检索。 + + + +## 2. 环境依赖和安装说明 + +**环境依赖** +* python >= 3.6.2 +* paddlepaddle >= 2.2 +* paddlenlp >= 2.2 +* milvus >= 2.1.0 +* pymilvus >= 2.1.0 + + + +## 3. 代码结构 + +## 代码结构: + +``` +|—— scripts + |—— feature_extract.sh 提取特征向量的bash脚本 + |—— search.sh 插入向量和向量检索bash脚本 +├── base_model.py # 语义索引模型基类 +├── config.py # milvus配置文件 +├── data.py # 数据处理函数 +├── milvus_ann_search.py # 向量插入和检索的脚本 +├── inference.py # 动态图模型向量抽取脚本 +├── feature_extract.py # 批量抽取向量脚本 +├── milvus_util.py # milvus的工具类 +└── README.md +``` + + +## 4. 数据准备 + +数据集的样例如下,有两种,第一种是 title+keywords 进行拼接;第二种是一句话。 + +``` +煤矸石-污泥基活性炭介导强化污水厌氧消化煤矸石,污泥,复合基活性炭,厌氧消化,直接种间电子传递 +睡眠障碍与常见神经系统疾病的关系睡眠觉醒障碍,神经系统疾病,睡眠,快速眼运动,细胞增殖,阿尔茨海默病 +城市道路交通流中观仿真研究智能运输系统;城市交通管理;计算机仿真;城市道路;交通流;路径选择 +.... +``` + +### 数据集下载 + + +- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip) + +``` +├── milvus # milvus建库数据集 + ├── milvus_data.csv. # 构建召回库的数据 +├── recall # 召回(语义索引)数据集 + ├── corpus.csv # 用于测试的召回库 + ├── dev.csv # 召回验证集 + ├── test.csv # 召回测试集 + ├── train.csv # 召回训练集 + ├── train_unsupervised.csv # 无监督训练集 +├── sort # 排序数据集 + ├── test_pairwise.csv # 排序测试集 + ├── dev_pairwise.csv # 排序验证集 + └── train_pairwise.csv # 排序训练集 + +``` + + + +## 5. 向量检索 + +### 5.1 基于Milvus的向量检索系统搭建 + +数据准备结束以后,我们开始搭建 Milvus 的语义检索引擎,用于语义向量的快速检索,我们使用[Milvus](https://milvus.io/)开源工具进行召回,Milvus 的搭建教程请参考官方教程 [Milvus官方安装教程](https://milvus.io/docs/v2.1.x/install_standalone-docker.md)本案例使用的是 Milvus 的2.1版本,建议使用官方的 Docker 安装方式,简单快捷。 + +Milvus 搭建完系统以后就可以插入和检索向量了,首先生成 embedding 向量,每个样本生成256维度的向量,使用的是32GB的V100的卡进行的提取: + +``` +CUDA_VISIBLE_DEVICES=0 python feature_extract.py \ + --model_dir=./output \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --corpus_file "data/milvus_data.csv" +``` +其中 output 目录下存放的是召回的 Paddle Inference 静态图模型。 + +| 数据量 | 时间 | +| ------------ | ------------ | +|1000万条|3hour40min39s| + +运行结束后会生成 corpus_embedding.npy + +生成了向量后,需要把数据插入到 Milvus 库中,首先修改配置: + +修改 config.py 的配置 ip 和端口,本项目使用的是8530端口,而 Milvus 默认的是19530,需要根据情况进行修改: + +``` +MILVUS_HOST='your milvus ip' +MILVUS_PORT = 8530 +``` + +然后运行下面的命令把向量插入到Milvus库中: + +``` +python milvus_ann_search.py --data_path milvus/milvus_data.csv \ + --embedding_path corpus_embedding.npy \ + --batch_size 100000 \ + --insert +``` +参数含义说明 + +* `data_path`: 数据的路径 +* `embedding_path`: 数据对应向量的路径 +* `index`: 选择检索向量的索引,用于向量检索 +* `insert`: 是否插入向量 +* `search`: 是否检索向量 +* `batch_size`: 表示的是一次性插入的向量的数量 + + +| 数据量 | 时间 | +| ------------ | ------------ | +|1000万条|21min12s| + +另外,Milvus提供了可视化的管理界面,可以很方便的查看数据,安装地址为[Attu](https://github.com/zilliztech/attu). + +![](../../img/attu.png) + + +运行召回脚本: + +``` +python milvus_ann_search.py --data_path milvus/milvus_data.csv \ + --embedding_path corpus_embedding.npy \ + --batch_size 100000 \ + --index 18 \ + --search +``` + +运行以后的结果的输出为: + +``` +hit: (distance: 0.0, id: 18), text field: 吉林铁合金集团资产管理现状分析及对策资产管理;资金控制;应收帐款风险;造价控制;集中化财务控制 +hit: (distance: 0.45325806736946106, id: 7611689), text field: 哈药集团应收账款分析应收账款,流动资产,财务报告 +hit: (distance: 0.5440893769264221, id: 4297885), text field: 宝钢集团负债经营风险控制策略研究钢铁行业;负债经营;风险控制 +hit: (distance: 0.5455711483955383, id: 5661135), text field: 浅谈电网企业固定资产风险管理大数据,固定资产,风险管理 +... +``` +返回的是向量的距离,向量的id,以及对应的文本。 + +也可以一键执行上述的过程: + +``` +sh scripts/search.sh +``` + +### 5.2 文本检索 + +首先修改代码的模型路径和样本: + +``` +params_path='checkpoints/model_40/model_state.pdparams' +id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} +``` + +运行命令 + +``` +python3 inference.py + +``` +运行的输出为,分别是抽取的向量和召回的结果: + +``` +[1, 256] +Tensor(shape=[1, 256], dtype=float32, place=Place(gpu:0), stop_gradient=True, + [[ 0.07830613, -0.14036864, 0.03433795, -0.14967985, -0.03386058, + 0.06630671, 0.01357946, 0.03531205, 0.02411086, 0.02000865, + 0.05724005, -0.08119474, 0.06286906, 0.06509133, 0.07193415, + .... +hit: (distance: 0.40141725540161133, id: 2742485), text field: 完善国有企业技术创新投入机制的探讨--基于经济责任审计实践国有企业,技术创新,投 +入机制 +hit: (distance: 0.40258315205574036, id: 1472893), text field: 企业技术创新与组织冗余--基于国有企业与非国有企业的情境研究 +hit: (distance: 0.4121206998825073, id: 51831), text field: 企业创新影响对外直接投资决策—基于中国制造业上市公司的研究企业创新;对外直接投资; +制造业;上市公司 +hit: (distance: 0.42234909534454346, id: 8682312), text field: 政治关联对企业创新绩效的影响——国有企业与民营企业的对比政治关联,创新绩效,国有 +企业,民营企业,双重差分 +hit: (distance: 0.46187296509742737, id: 9324797), text field: 财务杠杆、股权激励与企业创新——基于中国A股制造业经验数据制造业;上市公司;股权激 +励;财务杠杆;企业创新 +.... +``` +## FAQ + +#### 抽取文本语义向量后,利用 Milvus 进行 ANN 检索查询到了完全相同的文本,但是计算出的距离为什么不是 0? + +使用的是近似索引,详情请参考Milvus官方文档,[索引创建机制](https://milvus.io/cn/docs/v2.0.x/index.md) diff --git a/applications/neural_search/recall/milvus/base_model.py b/applications/neural_search/recall/milvus/base_model.py new file mode 100644 index 0000000000000000000000000000000000000000..aa9459d843b33f51776947ca83b36b47d7c327f7 --- /dev/null +++ b/applications/neural_search/recall/milvus/base_model.py @@ -0,0 +1,170 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import abc + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class SemanticIndexBase(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear( + self.ptm.config.hidden_size, output_emb_size, weight_attr=weight_attr + ) + + @paddle.jit.to_static( + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ] + ) + def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + @abc.abstractmethod + def forward(self): + pass + + +class SemanticIndexBaseStatic(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear( + self.ptm.config.hidden_size, output_emb_size, weight_attr=weight_attr + ) + + @paddle.jit.to_static( + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ] + ) + def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding diff --git a/applications/neural_search/recall/milvus/config.py b/applications/neural_search/recall/milvus/config.py new file mode 100644 index 0000000000000000000000000000000000000000..7eada53bf5a5625979d7ed851a3b05b8be08d473 --- /dev/null +++ b/applications/neural_search/recall/milvus/config.py @@ -0,0 +1,32 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +MILVUS_HOST = "10.21.226.175" +MILVUS_PORT = 8530 +data_dim = 256 +top_k = 100 +collection_name = "literature_search" +partition_tag = "partition_2" +embedding_name = "embeddings" + +index_config = { + "index_type": "IVF_FLAT", + "metric_type": "L2", + "params": {"nlist": 1000}, +} + +search_params = { + "metric_type": "L2", + "params": {"nprobe": top_k}, +} diff --git a/applications/neural_search/recall/milvus/data.py b/applications/neural_search/recall/milvus/data.py new file mode 100644 index 0000000000000000000000000000000000000000..2acbc81b6af7261fe4dcf7bb7b001ba1f5404a60 --- /dev/null +++ b/applications/neural_search/recall/milvus/data.py @@ -0,0 +1,156 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import paddle + +from paddlenlp.utils.log import logger + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 2: + continue + yield {"text_a": data[0], "text_b": data[1]} + + +def read_text_triplet(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 3: + continue + yield {"text": data[0], "pos_sample": data[1], "neg_sample": data[2]} + + +# ANN - active learning ------------------------------------------------------ +def get_latest_checkpoint(args): + """ + Return: (latest_checkpoint_path, global_step) + """ + if not os.path.exists(args.save_dir): + return args.init_from_ckpt, 0 + + subdirectories = list(next(os.walk(args.save_dir))[1]) + + def valid_checkpoint(checkpoint): + chk_path = os.path.join(args.save_dir, checkpoint) + scheduler_path = os.path.join(chk_path, "model_state.pdparams") + succeed_flag_file = os.path.join(chk_path, "succeed_flag_file") + return os.path.exists(scheduler_path) and os.path.exists(succeed_flag_file) + + trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(trained_steps) > 0: + return os.path.join(args.save_dir, str(max(trained_steps)), "model_state.pdparams"), max(trained_steps) + + return args.init_from_ckpt, 0 + + +# ANN - active learning ------------------------------------------------------ +def get_latest_ann_data(ann_data_dir): + if not os.path.exists(ann_data_dir): + return None, -1 + + subdirectories = list(next(os.walk(ann_data_dir))[1]) + + def valid_checkpoint(step): + ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data") + # succeed_flag_file is an empty file that indicates ann data has been generated + succeed_flag_file = os.path.join(ann_data_dir, step, "succeed_flag_file") + return os.path.exists(succeed_flag_file) and os.path.exists(ann_data_file) + + ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(ann_data_steps) > 0: + latest_ann_data_file = os.path.join(ann_data_dir, str(max(ann_data_steps)), "new_ann_data") + logger.info("Using latest ann_data_file:{}".format(latest_ann_data_file)) + return latest_ann_data_file, max(ann_data_steps) + + logger.info("no new ann_data, return (None, -1)") + return None, -1 + + +def gen_id2corpus(corpus_file): + id2corpus = {} + with open(corpus_file, "r", encoding="utf-8") as f: + for idx, line in enumerate(f): + id2corpus[idx] = line.rstrip() + return id2corpus + + +def gen_text_file(similar_text_pair_file): + text2similar_text = {} + texts = [] + with open(similar_text_pair_file, "r", encoding="utf-8") as f: + for line in f: + splited_line = line.rstrip().split("\t") + if len(splited_line) != 2: + continue + + text, similar_text = line.rstrip().split("\t") + + if not text or not similar_text: + continue + + text2similar_text[text] = similar_text + texts.append({"text": text}) + return texts, text2similar_text diff --git a/applications/neural_search/recall/milvus/feature_extract.py b/applications/neural_search/recall/milvus/feature_extract.py new file mode 100644 index 0000000000000000000000000000000000000000..a2a850449ca2ceb955113de9d29dac740e8860df --- /dev/null +++ b/applications/neural_search/recall/milvus/feature_extract.py @@ -0,0 +1,170 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys + +import numpy as np +import paddle +from paddle import inference +from tqdm import tqdm + +from paddlenlp.data import Pad, Tuple +from paddlenlp.transformers import AutoTokenizer + +sys.path.append(".") + +from data import convert_example # noqa E402 + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--corpus_file", type=str, required=True, help="The corpus_file path.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') +parser.add_argument("--model_name_or_path", default='rocketqa-zh-base-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# yapf: enable + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.get_pooled_embedding.pdmodel" + params_file = model_dir + "/inference.get_pooled_embedding.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def predict(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + + all_embeddings = [] + examples = [] + for idx, text in enumerate(tqdm(data)): + input_ids, segment_ids = convert_example( + text, tokenizer, max_seq_length=self.max_seq_length, pad_to_max_seq_len=True + ) + examples.append((input_ids, segment_ids)) + if len(examples) > self.batch_size: + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + all_embeddings.append(logits) + examples = [] + if len(examples) > 0: + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + all_embeddings.append(logits) + all_embeddings = np.concatenate(all_embeddings, axis=0) + np.save("corpus_embedding", all_embeddings) + + +def read_text(file_path): + file = open(file_path) + id2corpus = {} + for idx, data in enumerate(file.readlines()): + id2corpus[idx] = data.strip() + return id2corpus + + +if __name__ == "__main__": + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + id2corpus = read_text(args.corpus_file) + + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + predictor.predict(corpus_list, tokenizer) diff --git a/applications/neural_search/recall/milvus/inference.py b/applications/neural_search/recall/milvus/inference.py new file mode 100644 index 0000000000000000000000000000000000000000..3b8c2e0c2743bd4ec3ae5c204c05a1f6393e7f20 --- /dev/null +++ b/applications/neural_search/recall/milvus/inference.py @@ -0,0 +1,84 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from functools import partial + +import paddle +from base_model import SemanticIndexBaseStatic +from config import collection_name, embedding_name, partition_tag +from data import convert_example, create_dataloader +from milvus_util import RecallByMilvus + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + + +def search_in_milvus(text_embedding): + recall_client = RecallByMilvus() + result = recall_client.search( + text_embedding.numpy(), + embedding_name, + collection_name, + partition_names=[partition_tag], + output_fields=["pk", "text"], + ) + for hits in result: + for hit in hits: + print(f"hit: {hit}, text field: {hit.entity.get('text')}") + + +if __name__ == "__main__": + device = "gpu" + max_seq_length = 64 + output_emb_size = 256 + batch_size = 1 + params_path = "checkpoints/model_40/model_state.pdparams" + id2corpus = {0: "国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据"} + model_name_or_path = "rocketqa-zh-base-query-encoder" + paddle.set_device(device) + tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # text_segment + ): [data for data in fn(samples)] + pretrained_model = AutoModel.from_pretrained(model_name_or_path) + model = SemanticIndexBaseStatic(pretrained_model, output_emb_size=output_emb_size) + # Load pretrained semantic model + if params_path and os.path.isfile(params_path): + state_dict = paddle.load(params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + # convert_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + # Need better way to get inner model of DataParallel + all_embeddings = [] + model.eval() + with paddle.no_grad(): + for batch_data in corpus_data_loader: + input_ids, token_type_ids = batch_data + text_embeddings = model.get_pooled_embedding(input_ids, token_type_ids) + all_embeddings.append(text_embeddings) + text_embedding = all_embeddings[0] + print(text_embedding.shape) + print(text_embedding) + search_in_milvus(text_embedding) diff --git a/applications/neural_search/recall/milvus/milvus_ann_search.py b/applications/neural_search/recall/milvus/milvus_ann_search.py new file mode 100644 index 0000000000000000000000000000000000000000..23a5500350035a970d9d275f8e34182f1d73bfb8 --- /dev/null +++ b/applications/neural_search/recall/milvus/milvus_ann_search.py @@ -0,0 +1,91 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import time + +import numpy as np +from config import collection_name, embedding_name, partition_tag +from milvus_util import RecallByMilvus, VecToMilvus, text_max_len +from tqdm import tqdm + +parser = argparse.ArgumentParser() +parser.add_argument( + "--data_path", default="milvus/milvus_data.csv", type=str, required=True, help="The data for vector extraction." +) +parser.add_argument( + "--embedding_path", default="corpus_embedding.npy", type=str, required=True, help="The vector path for data." +) +parser.add_argument("--index", default=0, type=int, help="index of the vector for search") +parser.add_argument("--insert", action="store_true", help="whether to insert data") +parser.add_argument("--search", action="store_true", help="whether to search data") +parser.add_argument("--batch_size", default=100000, type=int, help="number of examples to insert each time") +args = parser.parse_args() + + +def read_text(file_path): + file = open(file_path) + id2corpus = [] + for idx, data in enumerate(file.readlines()): + id2corpus.append(data.strip()) + return id2corpus + + +def milvus_data_insert(data_path, embedding_path, batch_size): + corpus_list = read_text(data_path) + embeddings = np.load(embedding_path) + embedding_ids = [i for i in range(embeddings.shape[0])] + client = VecToMilvus() + client.drop_collection(collection_name) + data_size = len(embedding_ids) + for i in tqdm(range(0, data_size, batch_size)): + cur_end = i + batch_size + if cur_end > data_size: + cur_end = data_size + batch_emb = embeddings[np.arange(i, cur_end)] + entities = [ + [j for j in range(i, cur_end, 1)], + [corpus_list[j][: text_max_len - 1] for j in range(i, cur_end, 1)], + batch_emb, # field embeddings, supports numpy.ndarray and list + ] + client.insert( + collection_name=collection_name, entities=entities, index_name=embedding_name, partition_tag=partition_tag + ) + + +def milvus_data_recall(embedding_path, index): + embeddings = np.load(embedding_path) + embedding_ids = [i for i in range(embeddings.shape[0])] + recall_client = RecallByMilvus() + if index > len(embedding_ids): + print("Index should not be larger than embedding size") + return + embeddings = embeddings[np.arange(index, index + 1)] + time_start = time.time() + result = recall_client.search( + embeddings, embedding_name, collection_name, partition_names=[partition_tag], output_fields=["pk", "text"] + ) + time_end = time.time() + sum_t = time_end - time_start + print("time cost", sum_t, "s") + for hits in result: + for hit in hits: + print(f"hit: {hit}, text field: {hit.entity.get('text')}") + + +if __name__ == "__main__": + if args.insert: + milvus_data_insert(args.data_path, args.embedding_path, args.batch_size) + if args.search: + milvus_data_recall(args.embedding_path, args.index) diff --git a/applications/neural_search/recall/milvus/milvus_util.py b/applications/neural_search/recall/milvus/milvus_util.py new file mode 100644 index 0000000000000000000000000000000000000000..d11bccaf2f2570a28b1dc994ed79ba92c3e384bd --- /dev/null +++ b/applications/neural_search/recall/milvus/milvus_util.py @@ -0,0 +1,169 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import numpy as np +from config import ( + MILVUS_HOST, + MILVUS_PORT, + data_dim, + index_config, + search_params, + top_k, +) +from pymilvus import ( + Collection, + CollectionSchema, + DataType, + FieldSchema, + connections, + utility, +) + +fmt = "\n=== {:30} ===\n" +text_max_len = 1000 +fields = [ + FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=False, max_length=100), + FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=text_max_len), + FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=data_dim), +] +schema = CollectionSchema(fields, "Neural Search Index") + + +class VecToMilvus: + def __init__(self): + print(fmt.format("start connecting to Milvus")) + connections.connect("default", host=MILVUS_HOST, port=MILVUS_PORT) + self.collection = None + + def has_collection(self, collection_name): + try: + has = utility.has_collection(collection_name) + print(f"Does collection {collection_name} exist in Milvus: {has}") + return has + except Exception as e: + print("Milvus has_table error:", e) + + def creat_collection(self, collection_name): + try: + print(fmt.format("Create collection {}".format(collection_name))) + self.collection = Collection(collection_name, schema, consistency_level="Strong") + except Exception as e: + print("Milvus create collection error:", e) + + def drop_collection(self, collection_name): + try: + utility.drop_collection(collection_name) + except Exception as e: + print("Milvus delete collection error:", e) + + def create_index(self, index_name): + try: + print(fmt.format("Start Creating index")) + self.collection.create_index(index_name, index_config) + print(fmt.format("Start loading")) + self.collection.load() + except Exception as e: + print("Milvus create index error:", e) + + def has_partition(self, partition_tag): + try: + result = self.collection.has_partition(partition_tag) + return result + except Exception as e: + print("Milvus has partition error: ", e) + + def create_partition(self, partition_tag): + try: + self.collection.create_partition(partition_tag) + print("create partition {} successfully".format(partition_tag)) + except Exception as e: + print("Milvus create partition error: ", e) + + def insert(self, entities, collection_name, index_name, partition_tag=None): + try: + if not self.has_collection(collection_name): + self.creat_collection(collection_name) + self.create_index(index_name) + else: + self.collection = Collection(collection_name) + if (partition_tag is not None) and (not self.has_partition(partition_tag)): + self.create_partition(partition_tag) + + self.collection.insert(entities, partition_name=partition_tag) + print(f"Number of entities in Milvus: {self.collection.num_entities}") # check the num_entites + except Exception as e: + print("Milvus insert error:", e) + + +class RecallByMilvus: + def __init__(self): + print(fmt.format("start connecting to Milvus")) + connections.connect("default", host=MILVUS_HOST, port=MILVUS_PORT) + self.collection = None + + def get_collection(self, collection_name): + try: + print(fmt.format("Connect collection {}".format(collection_name))) + self.collection = Collection(collection_name) + except Exception as e: + print("Milvus create collection error:", e) + + def search(self, vectors, embedding_name, collection_name, partition_names=[], output_fields=[]): + try: + self.get_collection(collection_name) + result = self.collection.search( + vectors, + embedding_name, + search_params, + limit=top_k, + partition_names=partition_names, + output_fields=output_fields, + ) + return result + except Exception as e: + print("Milvus recall error: ", e) + + +if __name__ == "__main__": + print(fmt.format("Start inserting entities")) + rng = np.random.default_rng(seed=19530) + num_entities = 3000 + entities = [ + # provide the pk field because `auto_id` is set to False + [i for i in range(num_entities)], + ["第{}个样本".format(i) for i in range(num_entities)], # field text, only supports list + rng.random((num_entities, data_dim)), # field embeddings, supports numpy.ndarray and list + ] + print(entities[-1].shape) + collection_name = "test1" + partition_tag = "partition_1" + embedding_name = "embeddings" + client = VecToMilvus() + client.insert( + collection_name=collection_name, entities=entities, index_name=embedding_name, partition_tag=partition_tag + ) + print(fmt.format("Start searching entities")) + vectors_to_search = entities[-1][-2:] + recall_client = RecallByMilvus() + result = recall_client.search( + vectors_to_search, + embedding_name, + collection_name, + partition_names=[partition_tag], + output_fields=["pk", "text"], + ) + for hits in result: + for hit in hits: + print(f"hit: {hit}, random field: {hit.entity.get('text')}") diff --git a/applications/neural_search/recall/milvus/scripts/feature_extract.sh b/applications/neural_search/recall/milvus/scripts/feature_extract.sh new file mode 100644 index 0000000000000000000000000000000000000000..7f996ac0600a67cafaa09995d007381b7c519380 --- /dev/null +++ b/applications/neural_search/recall/milvus/scripts/feature_extract.sh @@ -0,0 +1,6 @@ +CUDA_VISIBLE_DEVICES=2 python feature_extract.py \ + --model_dir ./output \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --batch_size 512 \ + --corpus_file "milvus/milvus_data.csv" + diff --git a/applications/neural_search/recall/milvus/scripts/search.sh b/applications/neural_search/recall/milvus/scripts/search.sh new file mode 100644 index 0000000000000000000000000000000000000000..5c4cdeea3536dcf19328ca0c895112daa4670a1d --- /dev/null +++ b/applications/neural_search/recall/milvus/scripts/search.sh @@ -0,0 +1,6 @@ +python milvus_ann_search.py --data_path milvus/milvus_data.csv \ + --embedding_path corpus_embedding.npy \ + --batch_size 100000 \ + --index 18 \ + --insert \ + --search \ No newline at end of file diff --git a/applications/neural_search/recall/simcse/README.md b/applications/neural_search/recall/simcse/README.md new file mode 100644 index 0000000000000000000000000000000000000000..033afd18008f28c0190ac577a5fe26b4c1dfa2a5 --- /dev/null +++ b/applications/neural_search/recall/simcse/README.md @@ -0,0 +1,448 @@ + + **目录** + +* [背景介绍](#背景介绍) +* [SimCSE](#SimCSE) + * [1. 技术方案和评估指标](#技术方案) + * [2. 环境依赖](#环境依赖) + * [3. 代码结构](#代码结构) + * [4. 数据准备](#数据准备) + * [5. 模型训练](#模型训练) + * [6. 评估](#开始评估) + * [7. 预测](#预测) + * [8. 部署](#部署) + + + +# 背景介绍 + +语义索引(可通俗理解为向量索引)技术是搜索引擎、推荐系统、广告系统在召回阶段的核心技术之一。语义索引模型的目标是:给定输入文本,模型可以从海量候选召回库中**快速、准确**地召回一批语义相关文本。语义索引模型的效果直接决定了语义相关的物料能否被成功召回进入系统参与上层排序,从基础层面影响整个系统的效果。 + +在召回阶段,最常见的方式是通过双塔模型,学习Document(简写为Doc)的向量表示,对Doc端建立索引,用ANN召回。我们在这种方式的基础上,引入无监督预训练策略,以如下训练数据为例: + + +``` +我手机丢了,我想换个手机 我想买个新手机,求推荐 +求秋色之空漫画全集 求秋色之空全集漫画 +学日语软件手机上的 手机学日语的软件 +侠盗飞车罪恶都市怎样改车 侠盗飞车罪恶都市怎么改车 +``` + +SimCSE 模型适合缺乏监督数据,但是又有大量无监督数据的匹配和检索场景。 + + + + +# SimCSE + + + +## 1. 技术方案和评估指标 + +### 技术方案 + +双塔模型,采用ERNIE1.0热启,在召回阶段引入 SimCSE 策略。 + + +### 评估指标 + +(1)采用 Recall@1,Recall@5 ,Recall@10 ,Recall@20 和 Recall@50 指标来评估语义索引模型的召回效果。 + +**效果评估** + +| 策略 | 模型| Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 | +| ------------ | ------------ | ------------ |--------- |--------- |--------- |--------- | +| SimCSE | ernie 1.0 |42.374 | 57.505| 62.641| 67.09|72.331| +| SimCSE | rocketqa-zh-base-query-encoder |**50.108** | **64.005**| **68.288**| **72.306**|**77.306**| + + + +## 2. 环境依赖和安装说明 + +**环境依赖** +* python >= 3.6 +* paddlepaddle >= 2.1.3 +* paddlenlp >= 2.2 +* [hnswlib](https://github.com/nmslib/hnswlib) >= 0.5.2 +* visualdl >= 2.2.2 + + + + + +## 3. 代码结构 + +以下是本项目主要代码结构及说明: + +``` +simcse/ +├── model.py # SimCSE 模型组网代码 +|—— deploy + |—— python + |—— predict.py # PaddleInference + ├── deploy.sh # Paddle Inference的bash脚本 +|—— scripts + ├── export_model.sh # 动态图转静态图bash脚本 + ├── predict.sh # 预测的bash脚本 + ├── evaluate.sh # 召回评估bash脚本 + ├── run_build_index.sh # 索引的构建脚本 + ├── train.sh # 训练的bash脚本 +|—— ann_util.py # Ann 建索引库相关函数 +├── data.py # 无监督语义匹配训练数据、测试数据的读取逻辑 +├── export_model.py # 动态图转静态图 +├── predict.py # 基于训练好的无监督语义匹配模型计算文本 Pair 相似度 +├── evaluate.py # 根据召回结果和评估集计算评估指标 +|—— inference.py # 动态图抽取向量 +|—— recall.py # 基于训练好的语义索引模型,从召回库中召回给定文本的相似文本 +└── train.py # SimCSE 模型训练、评估逻辑 + +``` + + + +## 4. 数据准备 + +### 数据集说明 + +我们基于开源的语义匹配数据集构造生成了面向语义索引的训练集、评估集、召回库。 + +样例数据如下: +``` +睡眠障碍与常见神经系统疾病的关系睡眠觉醒障碍,神经系统疾病,睡眠,快速眼运动,细胞增殖,阿尔茨海默病 +城市道路交通流中观仿真研究 +城市道路交通流中观仿真研究智能运输系统;城市交通管理;计算机仿真;城市道路;交通流;路径选择 +网络健康可信性研究 +网络健康可信性研究网络健康信息;可信性;评估模式 +脑瘫患儿家庭复原力的影响因素及干预模式雏形 研究 +脑瘫患儿家庭复原力的影响因素及干预模式雏形研究脑瘫患儿;家庭功能;干预模式 +地西他滨与HA方案治疗骨髓增生异常综合征转化的急性髓系白血病患者近期疗效比较 +地西他滨与HA方案治疗骨髓增生异常综合征转化的急性髓系白血病患者近期疗效比较 +个案工作 社会化 +个案社会工作介入社区矫正再社会化研究——以东莞市清溪镇为例社会工作者;社区矫正人员;再社会化;角色定位 +圆周运动加速度角速度 +圆周运动向心加速度物理意义的理论分析匀速圆周运动,向心加速度,物理意义,角速度,物理量,线速度,周期 +``` + +召回集,验证集,测试集与inbatch-negative实验的数据保持一致 + + +### 数据集下载 + + +- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip) + +``` +├── milvus # milvus建库数据集 + ├── milvus_data.csv. # 构建召回库的数据 +├── recall # 召回(语义索引)数据集 + ├── corpus.csv # 用于测试的召回库 + ├── dev.csv # 召回验证集 + ├── test.csv # 召回测试集 + ├── train.csv # 召回训练集 + ├── train_unsupervised.csv # 无监督训练集 +├── sort # 排序数据集 + ├── test_pairwise.csv # 排序测试集 + ├── dev_pairwise.csv # 排序验证集 + └── train_pairwise.csv # 排序训练集 + +``` + + + +## 5. 模型训练 + +**语义索引预训练模型下载链接:** + +以下模型结构参数为: `TrasformerLayer:12, Hidden:768, Heads:12, OutputEmbSize: 256` + +|Model|训练参数配置|硬件|MD5| +| ------------ | ------------ | ------------ |-----------| +|[SimCSE](https://bj.bcebos.com/v1/paddlenlp/models/simcse_model.zip)|
ernie 1.0 epoch:3 lr:5E-5 bs:64 max_len:64
|
4卡 v100-16g
|7c46d9b15a214292e3897c0eb70d0c9f| + +### 训练环境说明 + ++ NVIDIA Driver Version: 440.64.00 ++ Ubuntu 16.04.6 LTS (Docker) ++ Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz + + +### 单机单卡训练/单机多卡训练 + +这里采用单机多卡方式进行训练,通过如下命令,指定 GPU 0,1,2,3 卡, 基于SimCSE训练模型,无监督的数据量比较大,4卡的训练的时长在16个小时左右。如果采用单机单卡训练,只需要把`--gpu`参数设置成单卡的卡号即可。 + +训练的命令如下: + +```shell +$ unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus '0,1,2,3' \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/ \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --save_steps 2000 \ + --eval_steps 100 \ + --max_seq_length 64 \ + --infer_with_fc_pooler \ + --dropout 0.2 \ + --output_emb_size 256 \ + --train_set_file "./recall/train_unsupervised.csv" \ + --test_set_file "./recall/dev.csv" \ + --model_name_or_path "rocketqa-zh-base-query-encoder" +``` +也可以使用bash脚本: + +``` +sh scripts/train.sh +``` + + + +可支持配置的参数: + +* `infer_with_fc_pooler`:可选,在预测阶段计算文本 embedding 表示的时候网络前向是否会过训练阶段最后一层的 fc; 建议打开模型效果最好。 +* `scale`:可选,在计算 cross_entropy loss 之前对 cosine 相似度进行缩放的因子;默认为 20。 +* `dropout`:可选,SimCSE 网络前向使用的 dropout 取值;默认 0.1。 +* `save_dir`:可选,保存训练模型的目录;默认保存在当前目录checkpoints文件夹下。 +* `max_seq_length`:可选,ERNIE-Gram 模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `learning_rate`:可选,Fine-tune的最大学习率;默认为5e-5。 +* `weight_decay`:可选,控制正则项力度的参数,用于防止过拟合,默认为0.0。 +* `epochs`: 训练轮次,默认为1。 +* `warmup_proption`:可选,学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.0。 +* `init_from_ckpt`:可选,模型参数路径,热启动模型训练;默认为None。 +* `seed`:可选,随机种子,默认为1000. +* `device`: 选用什么设备进行训练,可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。 +* `model_name_or_path`: 预训练模型,用于模型和`Tokenizer`的参数初始化。 + +程序运行时将会自动进行训练,评估。同时训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +checkpoints/ +├── model_100 +│   ├── model_state.pdparams +│   ├── tokenizer_config.json +│   └── vocab.txt +└── ... +``` + + + +## 6. 评估 + +效果评估分为 4 个步骤: + +a. 获取Doc端Embedding + +基于语义索引模型抽取出Doc样本库的文本向量, + +b. 采用hnswlib对Doc端Embedding建库 + +使用 ANN 引擎构建索引库(这里基于 [hnswlib](https://github.com/nmslib/hnswlib) 进行 ANN 索引) + +c. 获取Query的Embedding并查询相似结果 + +基于语义索引模型抽取出评估集 *Source Text* 的文本向量,在第 2 步中建立的索引库中进行 ANN 查询,召回 Top50 最相似的 *Target Text*, 产出评估集中 *Source Text* 的召回结果 `recall_result` 文件 + +d. 评估 + +基于评估集 `dev.csv` 和召回结果 `recall_result` 计算评估指标 Recall@k,其中k取值1,5,10,20,50. + +运行如下命令进行 ANN 建库、召回,产出召回结果数据 `recall_result` + +``` +python -u -m paddle.distributed.launch --gpus "6" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "checkpoints/model_12000/model_state.pdparams" \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 256\ + --max_seq_length 60 \ + --recall_num 50 \ + --similar_text_pair "recall/dev.csv" \ + --corpus_file "recall/corpus.csv" +``` +也可以使用下面的bash脚本: + +``` +sh scripts/run_build_index.sh +``` + +run_build_index.sh还包含cpu和gpu运行的脚本,默认是gpu的脚本 + + +接下来,运行如下命令进行效果评估,产出Recall@1, Recall@5, Recall@10, Recall@20 和 Recall@50 指标: +``` +python -u evaluate.py \ + --similar_text_pair "recall/dev.csv" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 50 +``` +也可以使用下面的bash脚本: + +``` +bash scripts/evaluate.sh +``` + +参数含义说明 +* `similar_text_pair`: 由相似文本对构成的评估集 +* `recall_result_file`: 针对评估集中第一列文本 *Source Text* 的召回结果 +* `recall_num`: 对 1 个文本召回的相似文本数量 + +成功运行结束后,会输出如下评估指标: + +``` +recall@1=45.183 +recall@5=60.444 +recall@10=65.224 +recall@20=69.562 +recall@50=74.848 +``` + + + + +## 7. 预测 + +我们可以基于语义索引模型预测文本的语义向量或者计算文本 Pair 的语义相似度。 + +### 7.1 功能一:抽取文本的语义向量 + +修改 inference.py 文件里面输入文本 id2corpus 和模型路径 params_path: + +``` +params_path='checkpoints/model_12000/model_state.pdparams' +id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} +``` +然后运行 +``` +python inference.py +``` +预测结果位256维的向量: + +``` +[1, 256] +[[-6.70653954e-02 -6.46878220e-03 -6.78317016e-03 1.66617986e-02 + 7.20006675e-02 -9.79134627e-03 -1.38441555e-03 4.37440760e-02 + 4.78116237e-02 1.33881181e-01 1.82927232e-02 3.23656350e-02 + ... +``` + +### 7.2 功能二:计算文本 Pair 的语义相似度 + +### 准备预测数据 + +待预测数据为 tab 分隔的 tsv 文件,每一行为 1 个文本 Pair,部分示例如下: +``` +热处理对尼龙6 及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响 热处理对尼龙6及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响尼龙6,聚酰胺嵌段共聚物,芳香聚酰胺,热处理 +面向生态系统服务的生态系统分类方案研发与应用. 面向生态系统服务的生态系统分类方案研发与应用 +huntington舞蹈病的动物模型 Huntington舞蹈病的动物模型 +试论我国海岸带经济开发的问题与前景 试论我国海岸带经济开发的问题与前景海岸带,经济开发,问题,前景 +``` + +### 开始预测 + +以上述 demo 数据为例,运行如下命令基于我们开源的 SimCSE无监督语义索引模型开始计算文本 Pair 的语义相似度: +``` +root_dir="checkpoints" + +python -u -m paddle.distributed.launch --gpus "3" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_12000/model_state.pdparams" \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --output_emb_size 256 \ + --batch_size 128 \ + --max_seq_length 64 \ + --text_pair_file "recall/test.csv" +``` + +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `params_path`: 预训练模型的参数文件名 +* `model_name_or_path`: 预训练模型,用于模型和`Tokenizer`的参数初始化。 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `text_pair_file`: 由文本 Pair 构成的待预测数据集 + +也可以运行下面的bash脚本: + +``` +sh scripts/predict.sh +``` + +产出如下结果 +``` +0.6477588415145874 +0.9698382019996643 +1.0 +0.1787596344947815 +``` + + + +## 8. 部署 + +### 动转静导出 + +首先把动态图模型转换为静态图: + +``` +python export_model.py --params_path checkpoints/model_12000/model_state.pdparams \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --output_path=./output +``` +也可以运行下面的bash脚本: + +``` +sh scripts/export_model.sh +``` + +### Paddle Inference预测 + +预测既可以抽取向量也可以计算两个文本的相似度。 + +修改id2corpus的样本: + +``` +# 抽取向量 +id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} +# 计算相似度 +corpus_list=[['中西方语言与文化的差异','中西方文化差异以及语言体现中西方文化,差异,语言体现'], + ['中西方语言与文化的差异','飞桨致力于让深度学习技术的创新与应用更简单']] + +``` +然后使用PaddleInference + +``` +python deploy/python/predict.py --model_dir=./output +``` +也可以运行下面的bash脚本: + +``` +sh deploy.sh +``` +最终输出的是256维度的特征向量和句子对的预测概率 + +``` +(1, 256) +[[-6.70653731e-02 -6.46873191e-03 -6.78317575e-03 1.66618153e-02 + 7.20006898e-02 -9.79136024e-03 -1.38439541e-03 4.37440872e-02 + 4.78115827e-02 1.33881137e-01 1.82927139e-02 3.23656537e-02 + ....... + +[0.5649663209915161, 0.03284594044089317] +``` +## FAQ + +#### SimCSE模型怎么部署? + ++ SimCSE使用的模型跟 In-batch Negatives 训练出来的模型网络结构是一样的,使用 In-batch Negatives 的部署流程即可,参考[In-batch Negatives](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search/recall/in_batch_negative/deploy/python) + +## Reference +[1] Gao, Tianyu, Xingcheng Yao, and Danqi Chen. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” ArXiv:2104.08821 [Cs], April 18, 2021. http://arxiv.org/abs/2104.08821. diff --git a/applications/neural_search/recall/simcse/ann_util.py b/applications/neural_search/recall/simcse/ann_util.py new file mode 100644 index 0000000000000000000000000000000000000000..55c608d3e58c37c0d9baf884b270178d3ac5da7f --- /dev/null +++ b/applications/neural_search/recall/simcse/ann_util.py @@ -0,0 +1,57 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# coding=UTF-8 + +import numpy as np +import hnswlib +from paddlenlp.utils.log import logger + + +def build_index(args, data_loader, model): + + index = hnswlib.Index(space="ip", dim=args.output_emb_size) + + # Initializing index + # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded + # during insertion of an element. + # The capacity can be increased by saving/loading the index, see below. + # + # ef_construction - controls index search speed/build speed tradeoff + # + # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M) + # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction + index.init_index(max_elements=args.hnsw_max_elements, ef_construction=args.hnsw_ef, M=args.hnsw_m) + + # Controlling the recall by setting ef: + # higher ef leads to better accuracy, but slower search + index.set_ef(args.hnsw_ef) + + # Set number of threads used during batch search/construction + # By default using all available cores + index.set_num_threads(16) + + logger.info("start build index..........") + + all_embeddings = [] + + for text_embeddings in model.get_semantic_embedding(data_loader): + all_embeddings.append(text_embeddings.numpy()) + + all_embeddings = np.concatenate(all_embeddings, axis=0) + index.add_items(all_embeddings) + + logger.info("Total index number:{}".format(index.get_current_count())) + + return index diff --git a/applications/neural_search/recall/simcse/data.py b/applications/neural_search/recall/simcse/data.py new file mode 100644 index 0000000000000000000000000000000000000000..5e1fc4bf5bf1be42746648df499b3df5a1dfbfda --- /dev/null +++ b/applications/neural_search/recall/simcse/data.py @@ -0,0 +1,146 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import paddle + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def convert_example_test(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + + for key, text in example.items(): + if "label" in key: + # do_evaluate + result += [example["label"]] + else: + # do_train + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + + return result + + +def gen_id2corpus(corpus_file): + id2corpus = {} + with open(corpus_file, "r", encoding="utf-8") as f: + for idx, line in enumerate(f): + id2corpus[idx] = line.rstrip() + return id2corpus + + +def gen_text_file(similar_text_pair_file): + text2similar_text = {} + texts = [] + with open(similar_text_pair_file, "r", encoding="utf-8") as f: + for line in f: + splited_line = line.rstrip().split("\t") + if len(splited_line) != 2: + continue + + text, similar_text = line.rstrip().split("\t") + + if not text or not similar_text: + continue + + text2similar_text[text] = similar_text + texts.append({"text": text}) + return texts, text2similar_text + + +def read_simcse_text(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip() + yield {"text_a": data, "text_b": data} + + +def read_text_pair(data_path, is_test=False): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if is_test is False: + if len(data) != 3: + continue + yield {"text_a": data[0], "text_b": data[1], "label": data[2]} + else: + if len(data) != 2: + continue + yield {"text_a": data[0], "text_b": data[1]} diff --git a/applications/neural_search/recall/simcse/deploy/python/deploy.sh b/applications/neural_search/recall/simcse/deploy/python/deploy.sh new file mode 100644 index 0000000000000000000000000000000000000000..fe8f071e0a47a47f5dc24d84ea4eaaf8e7503c06 --- /dev/null +++ b/applications/neural_search/recall/simcse/deploy/python/deploy.sh @@ -0,0 +1 @@ +python predict.py --model_dir=../../output \ No newline at end of file diff --git a/applications/neural_search/recall/simcse/deploy/python/predict.py b/applications/neural_search/recall/simcse/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..d3a29726d1e64b51becb62f53eeaf9c276b4ba77 --- /dev/null +++ b/applications/neural_search/recall/simcse/deploy/python/predict.py @@ -0,0 +1,268 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys + +import paddle +from paddle import inference +from scipy import spatial + +from paddlenlp.data import Pad, Tuple +from paddlenlp.transformers import AutoTokenizer +from paddlenlp.utils.log import logger + +sys.path.append(".") + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=15, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="model name.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') +parser.add_argument("--benchmark", type=eval, default=False, help="To log some information about environment and running.") +parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") +args = parser.parse_args() +# yapf: enable + + +def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + + return result + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.get_pooled_embedding.pdmodel" + params_file = model_dir + "/inference.get_pooled_embedding.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + if args.benchmark: + import auto_log + + pid = os.getpid() + self.autolog = auto_log.AutoLogger( + model_name=args.model_name_or_path, + model_precision=precision, + batch_size=self.batch_size, + data_shape="dynamic", + save_path=args.save_log_path, + inference_config=config, + pids=pid, + process_name=None, + gpu_ids=0, + time_keys=["preprocess_time", "inference_time", "postprocess_time"], + warmup=0, + logger=logger, + ) + + def extract_embedding(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the feature vectors. + """ + if args.benchmark: + self.autolog.times.start() + + examples = [] + for text in data: + input_ids, segment_ids = convert_example(text, tokenizer) + examples.append((input_ids, segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment + ): fn(samples) + + if args.benchmark: + self.autolog.times.stamp() + + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + if args.benchmark: + self.autolog.times.stamp() + + if args.benchmark: + self.autolog.times.end(stamp=True) + + return logits + + def predict(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the prediction probs. + """ + if args.benchmark: + self.autolog.times.start() + + examples = [] + for idx, text in enumerate(data): + input_ids, segment_ids = convert_example({idx: text[0]}, tokenizer) + title_ids, title_segment_ids = convert_example({idx: text[1]}, tokenizer) + examples.append((input_ids, segment_ids, title_ids, title_segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment + ): fn(samples) + + if args.benchmark: + self.autolog.times.stamp() + + query_ids, query_segment_ids, title_ids, title_segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(query_ids) + self.input_handles[1].copy_from_cpu(query_segment_ids) + self.predictor.run() + query_logits = self.output_handle.copy_to_cpu() + + self.input_handles[0].copy_from_cpu(title_ids) + self.input_handles[1].copy_from_cpu(title_segment_ids) + self.predictor.run() + title_logits = self.output_handle.copy_to_cpu() + + if args.benchmark: + self.autolog.times.stamp() + + if args.benchmark: + self.autolog.times.end(stamp=True) + result = [float(1 - spatial.distance.cosine(arr1, arr2)) for arr1, arr2 in zip(query_logits, title_logits)] + return result + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + + # ErnieTinyTokenizer is special for ernie-tiny pretained model. + output_emb_size = 256 + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + id2corpus = {0: "国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据"} + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + res = predictor.extract_embedding(corpus_list, tokenizer) + print(res.shape) + print(res) + corpus_list = [["中西方语言与文化的差异", "中西方文化差异以及语言体现中西方文化,差异,语言体现"], ["中西方语言与文化的差异", "飞桨致力于让深度学习技术的创新与应用更简单"]] + res = predictor.predict(corpus_list, tokenizer) + print(res) diff --git a/applications/neural_search/recall/simcse/evaluate.py b/applications/neural_search/recall/simcse/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..63d3b2fe16340e2818ec6bd0690387069931a82e --- /dev/null +++ b/applications/neural_search/recall/simcse/evaluate.py @@ -0,0 +1,81 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import numpy as np + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--similar_text_pair", type=str, default='', help="The full path of similat pair file") +parser.add_argument("--recall_result_file", type=str, default='', help="The full path of recall result file") +parser.add_argument("--recall_num", type=int, default=10, help="Most similair number of doc recalled from corpus per query") +args = parser.parse_args() +# yapf: enable + + +def recall(rs, N=10): + """ + Ratio of recalled Ground Truth at topN Recalled Docs + >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]] + >>> recall(rs, N=1) + 0.333333 + >>> recall(rs, N=2) + >>> 0.6666667 + >>> recall(rs, N=3) + >>> 1.0 + Args: + rs: Iterator of recalled flag() + Returns: + Recall@N + """ + + recall_flags = [np.sum(r[0:N]) for r in rs] + return np.mean(recall_flags) + + +if __name__ == "__main__": + text2similar = {} + with open(args.similar_text_pair, "r", encoding="utf-8") as f: + for line in f: + text, similar_text = line.rstrip().split("\t") + text2similar[text] = similar_text + + rs = [] + + with open(args.recall_result_file, "r", encoding="utf-8") as f: + relevance_labels = [] + for index, line in enumerate(f): + + if index % args.recall_num == 0 and index != 0: + rs.append(relevance_labels) + relevance_labels = [] + + text, recalled_text, cosine_sim = line.rstrip().split("\t") + if text2similar[text] == recalled_text: + relevance_labels.append(1) + else: + relevance_labels.append(0) + + recall_N = [] + recall_num = [1, 5, 10, 20, 50] + result = open("result.tsv", "a") + res = [] + for topN in recall_num: + R = round(100 * recall(rs, N=topN), 3) + recall_N.append(str(R)) + for key, val in zip(recall_num, recall_N): + print("recall@{}={}".format(key, val)) + res.append(str(val)) + result.write("\t".join(res) + "\n") diff --git a/applications/neural_search/recall/simcse/export_model.py b/applications/neural_search/recall/simcse/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..a9242c24dac0b0d5679a82a128527f42f074fe1b --- /dev/null +++ b/applications/neural_search/recall/simcse/export_model.py @@ -0,0 +1,57 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from model import SimCSE + +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.") +parser.add_argument("--model_name_or_path", default='rocketqa-zh-base-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + # If you want to use ernie1.0 model, plesace uncomment the following code + output_emb_size = 256 + + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + model = SimCSE(pretrained_model, output_emb_size=output_emb_size) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + model.eval() + + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/applications/neural_search/recall/simcse/inference.py b/applications/neural_search/recall/simcse/inference.py new file mode 100644 index 0000000000000000000000000000000000000000..bb2345fa88c3a24a5317fc0f7a666dbcb13be8df --- /dev/null +++ b/applications/neural_search/recall/simcse/inference.py @@ -0,0 +1,108 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from functools import partial + +import paddle +from data import create_dataloader +from model import SimCSE + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + + +def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + + return result + + +if __name__ == "__main__": + device = "gpu" + max_seq_length = 64 + output_emb_size = 256 + batch_size = 1 + params_path = "checkpoints/model_20000/model_state.pdparams" + id2corpus = {0: "国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据"} + model_name_or_path = "rocketqa-zh-base-query-encoder" + paddle.set_device(device) + + tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # text_segment + ): [data for data in fn(samples)] + + pretrained_model = AutoModel.from_pretrained(model_name_or_path) + + model = SimCSE(pretrained_model, output_emb_size=output_emb_size) + + # Load pretrained semantic model + if params_path and os.path.isfile(params_path): + state_dict = paddle.load(params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + all_embeddings = [] + model.eval() + with paddle.no_grad(): + for batch_data in corpus_data_loader: + input_ids, token_type_ids = batch_data + + text_embeddings = model.get_pooled_embedding(input_ids, token_type_ids) + all_embeddings.append(text_embeddings) + + text_embedding = all_embeddings[0] + print(text_embedding.shape) + print(text_embedding.numpy()) diff --git a/applications/neural_search/recall/simcse/model.py b/applications/neural_search/recall/simcse/model.py new file mode 100644 index 0000000000000000000000000000000000000000..0e3613c1e7c73bfd534fc2e7da82f40005ebf318 --- /dev/null +++ b/applications/neural_search/recall/simcse/model.py @@ -0,0 +1,140 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class SimCSE(nn.Layer): + def __init__(self, pretrained_model, dropout=None, margin=0.0, scale=20, output_emb_size=None): + + super().__init__() + + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear( + self.ptm.config.hidden_size, output_emb_size, weight_attr=weight_attr + ) + + self.margin = margin + # Used scaling cosine similarity to ease converge + self.sacle = scale + + @paddle.jit.to_static( + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ] + ) + def get_pooled_embedding( + self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, with_pooler=True + ): + + # Note: cls_embedding is poolerd embedding with act tanh + sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if with_pooler is False: + cls_embedding = sequence_output[:, 0, :] + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + with_pooler=True, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask, with_pooler=with_pooler + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask, with_pooler=with_pooler + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) + + # substract margin from all positive samples cosine_sim() + margin_diag = paddle.full( + shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype() + ) + + cosine_sim = cosine_sim - paddle.diag(margin_diag) + + # scale cosine to ease training converge + cosine_sim *= self.sacle + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") + labels = paddle.reshape(labels, shape=[-1, 1]) + + loss = F.cross_entropy(input=cosine_sim, label=labels) + + return loss diff --git a/applications/neural_search/recall/simcse/predict.py b/applications/neural_search/recall/simcse/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..3d3800ad3495d6565c37e6c82cbdc4dbaeee65a6 --- /dev/null +++ b/applications/neural_search/recall/simcse/predict.py @@ -0,0 +1,110 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +from data import convert_example, create_dataloader, read_text_pair +from model import SimCSE + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--margin", default=0.0, type=float, help="Margin between pos_sample and neg_samples.") +parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.") +parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.") +parser.add_argument("--model_name_or_path", default='rocketqa-zh-base-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# yapf: enable + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SimCSE`): A model to extract text embedding or calculate similarity of text pair. + data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + + cosine_sims = [] + + model.eval() + + with paddle.no_grad(): + for batch_data in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data + + batch_cosine_sim = model.cosine_sim( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ).numpy() + + cosine_sims.append(batch_cosine_sim) + + cosine_sims = np.concatenate(cosine_sims, axis=0) + + return cosine_sims + + +if __name__ == "__main__": + paddle.set_device(args.device) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + ): [data for data in fn(samples)] + + valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False, is_test=True) + + valid_data_loader = create_dataloader( + valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + + model = SimCSE(pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + cosin_sim = predict(model, valid_data_loader) + + for idx, cosine in enumerate(cosin_sim): + print("{}".format(cosine)) diff --git a/applications/neural_search/recall/simcse/recall.py b/applications/neural_search/recall/simcse/recall.py new file mode 100644 index 0000000000000000000000000000000000000000..604cc0a5988daa1475e7b73683910767bda41ee2 --- /dev/null +++ b/applications/neural_search/recall/simcse/recall.py @@ -0,0 +1,120 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# coding=UTF-8 + +import argparse +import os +from functools import partial + +import paddle +from ann_util import build_index +from data import convert_example_test, create_dataloader, gen_id2corpus, gen_text_file +from model import SimCSE + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset +from paddlenlp.transformers import AutoModel, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--model_name_or_path", default='rocketqa-zh-base-query-encoder', type=str, help='The pretrained model used for training') +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial(convert_example_test, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # text_segment + ): [data for data in fn(samples)] + + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + + model = SimCSE(pretrained_model, output_emb_size=args.output_emb_size) + model = paddle.DataParallel(model) + + # Load pretrained semantic model + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + logger.info("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + id2corpus = gen_id2corpus(args.corpus_file) + + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + # Need better way to get inner model of DataParallel + inner_model = model._layers + + final_index = build_index(args, corpus_data_loader, inner_model) + + text_list, text2similar_text = gen_text_file(args.similar_text_pair_file) + + query_ds = MapDataset(text_list) + + query_data_loader = create_dataloader( + query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + + recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file) + with open(recall_result_file, "w", encoding="utf-8") as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num) + + batch_size = len(cosine_sims) + + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write( + "{}\t{}\t{}\n".format( + text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx] + ) + ) diff --git a/applications/neural_search/recall/simcse/scripts/evaluate.sh b/applications/neural_search/recall/simcse/scripts/evaluate.sh new file mode 100644 index 0000000000000000000000000000000000000000..a95782c94d3e3bc593a8b41c655d28947bef78ba --- /dev/null +++ b/applications/neural_search/recall/simcse/scripts/evaluate.sh @@ -0,0 +1,4 @@ + python -u evaluate.py \ + --similar_text_pair "recall/dev.csv" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 50 \ No newline at end of file diff --git a/applications/neural_search/recall/simcse/scripts/export_model.sh b/applications/neural_search/recall/simcse/scripts/export_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..629440b9b079920e74e916f6b899b27d18aec559 --- /dev/null +++ b/applications/neural_search/recall/simcse/scripts/export_model.sh @@ -0,0 +1,3 @@ +python export_model.py --params_path checkpoints/model_12000/model_state.pdparams \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --output_path=./output \ No newline at end of file diff --git a/applications/neural_search/recall/simcse/scripts/predict.sh b/applications/neural_search/recall/simcse/scripts/predict.sh new file mode 100644 index 0000000000000000000000000000000000000000..758e3ecf16967eae0d4daf6a950704cde60b4138 --- /dev/null +++ b/applications/neural_search/recall/simcse/scripts/predict.sh @@ -0,0 +1,21 @@ +# gpu +root_dir="checkpoints" +python -u -m paddle.distributed.launch --gpus "3" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_12000/model_state.pdparams" \ + --output_emb_size 256 \ + --batch_size 128 \ + --max_seq_length 64 \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --text_pair_file "recall/test.csv" + +# cpu +# root_dir="checkpoints" +# python predict.py \ +# --device cpu \ +# --params_path "${root_dir}/model_20000/model_state.pdparams" \ +# --output_emb_size 256 \ +# --batch_size 128 \ +# --max_seq_length 64 \ +# --text_pair_file "recall/test.csv" \ No newline at end of file diff --git a/applications/neural_search/recall/simcse/scripts/run_build_index.sh b/applications/neural_search/recall/simcse/scripts/run_build_index.sh new file mode 100644 index 0000000000000000000000000000000000000000..eee1ad3593598279f4ea6568e60dd269a6e12f3d --- /dev/null +++ b/applications/neural_search/recall/simcse/scripts/run_build_index.sh @@ -0,0 +1,31 @@ +# gpu +python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "checkpoints/model_12000/model_state.pdparams" \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 256\ + --max_seq_length 60 \ + --recall_num 50 \ + --similar_text_pair "recall/dev.csv" \ + --corpus_file "recall/corpus.csv" + +# cpu +# python recall.py \ +# --device cpu \ +# --recall_result_dir "recall_result_dir" \ +# --recall_result_file "recall_result.txt" \ +# --params_path "checkpoints/model_20000/model_state.pdparams" \ +# --hnsw_m 100 \ +# --hnsw_ef 100 \ +# --batch_size 64 \ +# --output_emb_size 256\ +# --max_seq_length 60 \ +# --recall_num 50 \ +# --similar_text_pair "recall/dev.csv" \ +# --corpus_file "recall/corpus.csv" \ No newline at end of file diff --git a/applications/neural_search/recall/simcse/scripts/train.sh b/applications/neural_search/recall/simcse/scripts/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..60817e0ff7b50b705c7075f94f3b78386d4da708 --- /dev/null +++ b/applications/neural_search/recall/simcse/scripts/train.sh @@ -0,0 +1,55 @@ +# simcse gpu +python -u -m paddle.distributed.launch --gpus '1,2,3,4' \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/ \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --save_steps 2000 \ + --eval_steps 100 \ + --max_seq_length 64 \ + --infer_with_fc_pooler \ + --dropout 0.2 \ + --output_emb_size 256 \ + --train_set_file "./recall/train_unsupervised.csv" \ + --test_set_file "./recall/dev.csv" \ + --model_name_or_path "rocketqa-zh-base-query-encoder" + +# simcse cpu +# python train.py \ +# --device cpu \ +# --save_dir ./checkpoints/ \ +# --batch_size 64 \ +# --learning_rate 5E-5 \ +# --epochs 3 \ +# --save_steps 2000 \ +# --eval_steps 100 \ +# --max_seq_length 64 \ +# --infer_with_fc_pooler \ +# --dropout 0.2 \ +# --output_emb_size 256 \ +# --train_set_file "./recall/train_unsupervised.csv" \ +# --test_set_file "./recall/dev.csv" +# --model_name_or_path "ernie-3.0-medium-zh" + +# post training + simcse +# python -u -m paddle.distributed.launch --gpus '0,1,2,3' \ +# train.py \ +# --device gpu \ +# --save_dir ./checkpoints/ \ +# --batch_size 64 \ +# --learning_rate 5E-5 \ +# --epochs 3 \ +# --save_steps 2000 \ +# --eval_steps 100 \ +# --max_seq_length 64 \ +# --infer_with_fc_pooler \ +# --dropout 0.2 \ +# --output_emb_size 256 \ +# --train_set_file "./recall/train_unsupervised.csv" \ +# --test_set_file "./recall/dev.csv" +# --model_name_or_path "post_ernie" + + + diff --git a/applications/neural_search/recall/simcse/train.py b/applications/neural_search/recall/simcse/train.py new file mode 100644 index 0000000000000000000000000000000000000000..050e79bbcd937f74381156373b617398940e222f --- /dev/null +++ b/applications/neural_search/recall/simcse/train.py @@ -0,0 +1,156 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from data import convert_example, create_dataloader, read_simcse_text +from model import SimCSE +from visualdl import LogWriter + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.") +parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, help="Step interval for saving checkpoint.") +parser.add_argument('--eval_steps', type=int, default=10000, help="Step interval for evaluation.") +parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file.") +parser.add_argument("--test_set_file", type=str, required=True, help="The full path of test_set_file.") +parser.add_argument("--margin", default=0.0, type=float, help="Margin between pos_sample and neg_samples.") +parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.") +parser.add_argument("--dropout", default=0.1, type=float, help="Dropout for pretrained model encoder.") +parser.add_argument("--infer_with_fc_pooler", action='store_true', help="Whether use fc layer after cls embedding or not for when infer.") +parser.add_argument("--model_name_or_path", default='rocketqa-zh-base-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# fmt: on + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + writer = LogWriter(logdir="./log/scalar_test/train") + + train_ds = load_dataset(read_simcse_text, data_path=args.train_set_file, lazy=False) + + pretrained_model = AutoModel.from_pretrained( + args.model_name_or_path, hidden_dropout_prob=args.dropout, attention_probs_dropout_prob=args.dropout + ) + print("loading model from {}".format(args.model_name_or_path)) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # title_segment + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + model = SimCSE(pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + print("warmup from:{}".format(args.init_from_ckpt)) + + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + time_start = time.time() + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch + + loss = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ) + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, 10 / (time.time() - tic_train)) + ) + writer.add_scalar(tag="loss", step=global_step, value=loss) + tic_train = time.time() + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % (global_step)) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + time_end = time.time() + print("totally cost", time_end - time_start) + + +if __name__ == "__main__": + do_train() diff --git a/applications/neural_search/requirements.txt b/applications/neural_search/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..3a500dbca18df48b4e063fbd28972ff28360b350 --- /dev/null +++ b/applications/neural_search/requirements.txt @@ -0,0 +1,11 @@ +pymilvus>=2.1.0 +pandas +paddlenlp>=2.1.1 +paddlepaddle-gpu>=2.2.3 +hnswlib>=0.5.2 +numpy>=1.17.2 +visualdl>=2.2.2 +paddle-serving-app>=0.7.0 +paddle-serving-client>=0.7.0 +paddle-serving-server-gpu>=0.7.0.post102 +pybind11 \ No newline at end of file diff --git a/applications/neural_search/run_system.py b/applications/neural_search/run_system.py new file mode 100644 index 0000000000000000000000000000000000000000..1c8af3f096c73392a51fc9dbbf2ae52d2adf2667 --- /dev/null +++ b/applications/neural_search/run_system.py @@ -0,0 +1,86 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys +import time + +import numpy as np +import pandas as pd +from paddle_serving_server.pipeline import PipelineClient + +sys.path.append("./recall/milvus") # noqa: E402 +from config import collection_name, embedding_name, partition_tag # noqa: E402 +from milvus_util import RecallByMilvus # noqa: E402 + + +def recall_result(list_data): + client = PipelineClient() + client.connect(["127.0.0.1:8080"]) + feed = {} + for i, item in enumerate(list_data): + feed[str(i)] = item + start_time = time.time() + ret = client.predict(feed_dict=feed) + end_time = time.time() + print("Extract feature time to cost :{} seconds".format(end_time - start_time)) + result = np.array(eval(ret.value[0])) + return result + + +def search_in_milvus(embeddings, query_text): + recall_client = RecallByMilvus() + start_time = time.time() + results = recall_client.search( + embeddings, embedding_name, collection_name, partition_names=[partition_tag], output_fields=["pk", "text"] + ) + end_time = time.time() + print("Search milvus time cost is {} seconds ".format(end_time - start_time)) + list_data = [] + for line in results: + for item in line: + # idx = item.id + distance = item.distance + text = item.entity.get("text") + list_data.append([query_text, text, distance]) + df = pd.DataFrame(list_data, columns=["query_text", "text", "distance"]) + df.to_csv("recall_result.csv", index=False) + return df + + +def rerank(df): + client = PipelineClient() + client.connect(["127.0.0.1:8089"]) + list_data = [] + for index, row in df.iterrows(): + example = {"query": row["query_text"], "title": row["text"]} + list_data.append(example) + feed = {} + for i, item in enumerate(list_data): + feed[str(i)] = str(item) + + start_time = time.time() + ret = client.predict(feed_dict=feed) + end_time = time.time() + print("time to cost :{} seconds".format(end_time - start_time)) + result = np.array(eval(ret.value[0])) + df["distance"] = result + df = df.sort_values(by=["distance"], ascending=False) + df.to_csv("rank_result.csv", index=False) + + +if __name__ == "__main__": + list_data = ["中西方语言与文化的差异"] + result = recall_result(list_data) + df = search_in_milvus(result, list_data[0]) + rerank(df) diff --git a/applications/question_answering/README.md b/applications/question_answering/README.md new file mode 100644 index 0000000000000000000000000000000000000000..673a716d53de981c1cc01f8480c98921225861c6 --- /dev/null +++ b/applications/question_answering/README.md @@ -0,0 +1,21 @@ +# 问答系统 + +问答系统(Question Answering System, QA)是信息检索系统的一种高级形式,它能用准确、简洁的自然语言回答用户用自然语言提出的问题。问答系统的应用空间十分包括,包括搜索引擎,小度音响等智能硬件,聊天机器人,以及政府、金融、银行、电信、电商领域的智能客服等。 + +在问答系统中,检索式问答系统是最容易落地的一种,它具有速度快、可控性好、容易拓展等特点。 +检索式问答系统是一种基于问题答案对进行检索匹配的系统,根据是否需要FAQ(Frequently asked questions)可以进一步分为有监督检索式问答系统和无监督检索式问答系统,前者需要用户提供FAQ语料,后者不需要预备问答语料,可通过问题答案对生成的方式自动生成语料。 + +PaddleNLP提供了[有监督检索式问答系统](./supervised_qa)和[无监督检索式问答系统](./unsupervised_qa),开发者可根据实际情况进行选择。 + +关于问答场景应用案例请查阅飞桨新产品[RocketQA](https://github.com/PaddlePaddle/RocketQA)。 + +**有监督检索式问答系统效果展示**: +
+ +
+ + +**无监督检索式问答系统效果展示**: +
+ +
diff --git a/applications/question_answering/supervised_qa/faq_finance/README.md b/applications/question_answering/supervised_qa/faq_finance/README.md new file mode 100644 index 0000000000000000000000000000000000000000..6f6714974de3784e22d913b1265b92be0754ec82 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/README.md @@ -0,0 +1,555 @@ +# 保险智能问答 + + **目录** + +* [1. 项目介绍](#项目介绍) +* [2. 系统特色](#系统特色) +* [3. 保险智能问答系统方案](#保险问答系统方案) +* [4. 动手实践——搭建自己的端到端检索式问答系统](#动手实践——搭建自己的端到端检索式问答系统) +* [5. 模型优化](#模型优化) +* [6. 参考文献](#参考文献) + + + +## 1. 项目介绍 + +智能问答是获取信息和知识的更直接、更高效的方式之一,传统的信息检索方法智能找到相关的文档,而智能问答能够直接找到精准的答案,极大的节省了人们查询信息的时间。问答按照技术分为基于阅读理解的问答和检索式的问答,阅读理解的问答是在正文中找到对应的答案片段,检索式问答则是匹配高频的问题,然后把答案返回给用户。本项目属于检索式的问答,问答的领域用途很广,比如搜索引擎,小度音响等智能硬件,政府,金融,银行,电信,电商领域的智能客服,聊天机器人等。 + +- 本方案是场景的定制化的方案,用户可以使用自己的数据训练一个特定场景的方案。另外,想快速体验FAQ智能问答系统请参考Pipelines的实现[FAQ智能问答](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/FAQ) + +- 本项目的详细教程请参考(包括数据和代码实现)[aistudio教程](https://aistudio.baidu.com/aistudio/projectdetail/3882519) + + + +## 2. 系统特色 + ++ 低门槛 + + 手把手搭建检索式保险智能问答 + + 无需相似 Query-Query Pair 标注数据也能构建保险智能问答 ++ 效果好 + + 业界领先的检索预训练模型: RocketQA Dual Encoder + + 针对无标注数据场景的领先解决方案: 检索预训练模型 + 增强的无监督语义索引微调 + ++ 性能快 + + 基于 Paddle Inference 快速抽取向量 + + 基于 Milvus 快速查询和高性能建库 + + 基于 Paddle Serving 高性能部署 + + + +## 3. 保险智能问答系统方案 + +### 3.1 技术方案和评估指标 + +#### 3.1.1 技术方案 + +**语义索引**:针对保险等金融领域的问答只有问答对的场景,我们提供了一个在SimCSE的基础上融合WR (word reptition)策略,同义词策略,R-Drop策略的无监督的解决方案。 + +#### 3.1.2 评估指标 + +* 该保险智能问答系统使用的指标是 Recall@K,表示的是预测的前topK(从最后的按得分排序的召回列表中返回前K个结果)结果和语料库中真实的前 K 个相关结果的重叠率,衡量的是检索系统的查全率。 + +### 3.2 数据说明 + +#### 3.2.1 预置数据介绍 + +数据集来源于Github开源的保险的问答数据,包括源用户的问题和相应的回复。 + +| 阶段 |模型 | 训练集 | 评估集(用于评估模型效果) | 召回库 | +| ------------ | ------------ |------------ | ------------ | ------------ | +| 召回 | SimCSE | 3030 | 758 | 3788 | + +其中训练集的问题-问题对的构造使用了同义词替换的方法,详情请参考[nlpcda](https://github.com/425776024/nlpcda) + +评估集的问题对的构造使用了中英文回译的方法,数据使用的是百度翻译的API,详情请参考[百度翻译](https://fanyi-api.baidu.com/?fr=simultaneous) + +【注意】:数据集是基于Github开源数据进行了处理得到的,如果有任何侵权问题,请及时联系,我们会第一时间进行删除。 + +``` +├── data # 数据集 + ├── train.csv # 无监督训练集 + ├── train_aug.csv # 同义词替换后构造的训练集 + ├── test_pair.csv # 测试集,用于评估模型的效果 + ├── corpus.csv # 构建召回的数据,用于评估模型的召回效果 + ├── qa_pair.csv # 问答对,问题对应的答案 +``` +数据集的下载链接为: [faq_finance](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/baoxianzhidao/intro.ipynb) + +#### 3.2.2 数据格式 + +训练需要规定格式的本地数据集,需要准备训练集文件`train.csv`或者`train_aug.csv`,测试集`test_pair.csv`,召回集文件`corpus.csv`,问答对 `qa_pair.csv`。 + +用于无监督训练的训练集的格式如下: + +``` +文本1 +文本2 +... +``` +训练集合`train.csv`的文件样例: + +``` +家里有社保,还有必要买重疾险吗? +工地买了建工险,出了事故多长时间上报保险公司有效 +请问下哆啦a保值不值得买呢?不晓得保障多不多 +自由职业办理养老保险是否划算 +工伤七级如果公司不干了,怎么赔我 +普通意外险的保障范围都有哪些? +...... +``` +除此之外,也可以使用数据增强的格式,训练方式是类似有监督的构造句子对。数据增强的文件格式如下: + +``` +文本1 \t 增强文本1 +文本2 \t 增强文本2 +``` +增强数据集`train_aug.csv`的格式如下: + +``` +工伤七级如果公司不干了,怎么赔我 工伤七级如果企业不干了,怎生赔我 +普通意外险的保障范围都有哪些? 一般性意外险的保障范围都有哪些? +重疾险赔付三次和赔付一次的区别 重疾险赔偿三次和赔偿一次的区别 +。。。。。 +``` + +测试集合`test_pair.csv`是问句对,具体格式如下: + +``` +句子1 \t 句子2 +句子3 \t 句子4 +``` +其中句子1和句子2是相似的句子,只是表达方式不同,或者进行了一定程度的变形,但实际表达的语义是一样的。 + +测试集的文件样例: + +``` +车险如何计算 如何计算汽车保险 +农民买养老保险怎么买 农民如何购买养老保险 +车险必买哪几项 你必须购买哪些汽车保险 +... +``` +召回集合`corpus.csv`主要作用是检验测试集合的句子对能否被正确召回,它的构造主要是提取测试集的第二列的句子,然后加入很多无关的句子,用来检验模型能够正确的从这些文本中找出测试集合对应的第二列的句子,格式如下: + +``` +如何办理企业养老保险 +如何为西班牙购买签证保险? +康慧宝需要买多少? +如果另一方对车辆事故负有全部责任,并且拒绝提前支付维修费,该怎么办 +准备清明节去新兴坡旅游,什么样的旅游保险好? +你能从国外账户购买互助基金吗? +什么是海上保险?有哪些海上保险? +.... +``` + +问答对集合`qa_pair.csv`包含的是整个项目的问题和对应的答案,,具体格式如下: + +``` +问题1 \t 答案1 +问题2 \t 答案2 +...... +``` +问答对集合示例: + +``` +既然强制运输保险有浮动费率制度,有商业保险吗? 商业车险也有的。关于汽车商业险的费率在全国每个省都是不一样的,在同一地区,费率也会变化。一般1年、2-4年、4-6年、费率都不同。新车第一年的费率会比较高,2-4是相对比较优惠,4-6会再上涨,不同类型的汽车费率也不同。商业车险保费浮动比例与其他公司相比都是差不多的,一般销售保费浮动比例是这样的:上年赔款1次,保费打7折;上年赔款2次,保费打8折;上年赔款3次,保费上浮15%;上年赔款4次,保费上浮51%;上年赔款5次以上,保费上浮69%。该公司的有关人士表示,如果上年赔款次数超过了7次,续保时可能会遭拒。目前的研究意见规定中加大了车险保费与赔款记录相关系数的浮动区间,并与交通违章情况挂钩,若车主少违章少出险则保费最多可打5折,反之则保费最高可上浮至现行标准的4.5倍。 +汇鑫安儿童保险的保费是否也与性别有关 有关系,女宝宝会比男宝宝要多一点。如0岁男宝宝趸交是130.4元,3年期交是43.7元,5年期交是27元;而0岁女宝宝趸交是131.6元,3年期交是44.1元,5年期交是27.2元。 +在中国,哪个品牌的餐饮照明比较好? 一般来说美尔家比较可靠吧,有保障 +...... +``` + + +### 3.3 代码说明 + +``` +|—— data.py # 数据读取、数据转换等预处理逻辑 +|—— model.py # SimCSE模型 +|—— train.py # SimCSE训练主脚本 +|—— ann_util.py # Ann 建索引库相关函数 +|—— config.py # Milvus 配置文件 +|—— evaluate.py # 召回评估文件 +|—— recall.py # 基于训练好的语义索引模型,从召回库中召回给定文本的相似文本 +|—— export_model.py # 动态图转换成静态图 +|—— export_to_serving.py # 静态图转 Serving +|—— feature_extract.py # 批量提取文本的特征向量 +|—— milvus_util.py # Milvus的插入和召回类 +|—— milvus_ann_search.py # 向 Milvus 引擎插入向量的函数 +|—— run_system.py # Client Server 模式客户端,向 server 发送文本,得到向量后,利用milvus引擎进行检索 +|—— scripts + |—— export_model.sh # 动态图转换成静态图脚本 + |—— evaluate.sh # 评估 bash 版本 + |—— run_build_index.sh # 构建索引 bash 版本 + |—— train.sh # 训练 bash 版本 + |—— feature_extract.sh # 向量抽取 bash 版本 + |—— export_to_serving.sh # Paddle Inference 转 Serving 的 bash 脚本 +|—— deploy + |—— python + |—— rpc_client.py # Paddle Serving 的 Client 端 + |—— web_service.py # Paddle Serving 的 Serving 端 + |—— config_nlp.yml # Paddle Serving 的配置文件 +``` + +### 3.3 效果评估 + +以下实验结果使用的是模型是`rocketqa-zh-dureader-query-encoder`: + +| 模型 | Recall@1 |Recall@5 |Recall@10 | +| ------------ | ------------ |--------- |--------- | +| RocketQA + SimCSE | 82.827 | 93.791| 96.169| +| RocketQA + SimCSE + WR | 82.695 | 93.791| 96.301| +| RocketQA + SimCSE + WR + 同义词 | 85.205 | 93.923| 95.509| +| RocketQA + SimCSE + 同义词 + RDrop | **85.469** | **94.716**| **96.433**| + + + +## 4. 动手实践——搭建自己的端到端检索式问答系统 + +### 4.1 环境安装 + +在运行下面的代码之前,安装相关的依赖,运行下面的命令: + +``` +pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple +``` + +### 4.2 模型训练 + +SimCSE可以使用2种方式进行训练,即有监督训练和无监督训练,区别在于无监督训练不需要标注数据集,有监督训练需要标注好问句对,下面是无监督的执行方式。 + +#### 无监督训练 + +无监督训练执行下面的方式,可以选择`train.csv`,纯无监督文本,或者数据增强的数据`train_aug.csv`,然后执行下面的命令: + +``` +python -u -m paddle.distributed.launch --gpus='0' \ + train.py \ + --device gpu \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --save_dir ./checkpoints/ \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --save_steps 50 \ + --eval_steps 50 \ + --max_seq_length 64 \ + --dropout 0.2 \ + --output_emb_size 256 \ + --dup_rate 0.1 \ + --rdrop_coef 0.1 \ + --train_set_file "./data/train_aug.csv" +``` + +参数含义说明 + +* `device`: 使用 cpu/gpu 进行训练 +* `save_dir`: 模型存储路径 +* `model_name_or_path`: 预训练语言模型名,用于模型的初始化 +* `batch_size`: 训练的batch size的大小 +* `learning_rate`: 训练的学习率的大小 +* `epochs`: 训练的epoch数 +* `is_unsupervised`:是否使用无监督的训练方式 +* `save_steps`: 模型存储 checkpoint 的间隔 steps 个数 +* `max_seq_length`: 输入序列的最大长度 +* `dropout`: SimCSE的dropout参数 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `dup_rate` : SimCSE的 Word reptition 策略的重复率 +* `train_set_file`: 训练集文件 +* `rdrop_coef`: R-Drop的系数 + +也可以使用下面的bash脚本: + +``` +sh scripts/train.sh +``` + +### 4.3 评估 + +效果评估分为 4 个步骤: + +a. 获取Doc端Embedding + +基于语义索引模型抽取出Doc样本库的文本向量。 + +b. 采用hnswlib对Doc端Embedding建库 + +使用 ANN 引擎构建索引库(这里基于 [hnswlib](https://github.com/nmslib/hnswlib) 进行 ANN 索引) + +c. 获取question的Embedding并查询相似结果 + +基于语义索引模型抽取出评估集 *Source Text* 的文本向量,在第 2 步中建立的索引库中进行 ANN 查询,召回 Top10 最相似的 *Target Text*, 产出评估集中 *Source Text* 的召回结果 `recall_result` 文件。 + +d. 评估 + +基于评估集 `test.csv` 和召回结果 `recall_result` 计算评估指标 Recall@k,其中k取值1,5,10。 + +运行如下命令进行 ANN 建库、召回,产出召回结果数据 `recall_result` + +``` +python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "checkpoints/model_100/model_state.pdparams" \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 256\ + --max_seq_length 64 \ + --recall_num 10 \ + --similar_text_pair "data/test_pair.csv" \ + --corpus_file "data/corpus.csv" +``` +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `recall_result_dir`: 召回结果存储目录 +* `recall_result_file`: 召回结果的文件名 +* `model_name_or_path`: 预训练语言模型名,用于模型的初始化 +* `params_path`: 待评估模型的参数文件名 +* `hnsw_m`: hnsw 算法相关参数,保持默认即可 +* `hnsw_ef`: hnsw 算法相关参数,保持默认即可 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `recall_num`: 对 1 个文本召回的相似文本数量 +* `similar_text_pair`: 由相似文本对构成的评估集 +* `corpus_file`: 召回库数据 corpus_file + +也可以使用下面的bash脚本: + +``` +sh scripts/run_build_index.sh +``` + +run_build_index.sh还包含cpu和gpu运行的脚本,默认是gpu的脚本 + +接下来,运行如下命令进行效果评估,产出Recall@1, Recall@5, Recall@10 指标: +``` +python -u evaluate.py \ + --similar_text_pair "data/test_pair.csv" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 10 +``` +也可以使用下面的bash脚本: + +``` +sh scripts/evaluate.sh +``` +输出如下的结果: + +``` +recall@1=84.941 +recall@5=94.452 +recall@10=96.433 +``` + +参数含义说明 +* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv +* `recall_result_file`: 针对评估集中第一列文本 *Source Text* 的召回结果 +* `recall_num`: 对 1 个文本召回的相似文本数量 + +### 4.4 模型部署 + +模型部署模块首先要把动态图转换成静态图,然后转换成serving的格式。 + +#### 动转静导出 + +首先把动态图模型转换为静态图: + +``` +python export_model.py --params_path checkpoints/model_100/model_state.pdparams \ + --output_path=./output \ + --model_name_or_path rocketqa-zh-base-query-encoder +``` +也可以运行下面的bash脚本: + +``` +sh scripts/export_model.sh +``` + +#### 问答检索引擎 + +模型准备结束以后,开始搭建 Milvus 的语义检索引擎,用于语义向量的快速检索,本项目使用[Milvus](https://milvus.io/)开源工具进行向量检索,Milvus 的搭建教程请参考官方教程 [Milvus官方安装教程](https://milvus.io/docs/v2.1.x/install_standalone-docker.md)本案例使用的是 Milvus 的2.1 版本,建议使用官方的 Docker-Compose 安装方式,简单快捷。 + + +Milvus 搭建完系统以后就可以插入和检索向量了,首先生成 embedding 向量,每个样本生成256维度的向量: + +``` +python feature_extract.py \ + --model_dir=./output \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --corpus_file "data/corpus.csv" +``` +其中 output 目录下存放的是召回的 Paddle Inference 静态图模型。 + +也可以运行下面的bash脚本: + +``` +sh scripts/feature_extract.sh +``` + +然后向搭建好的 Milvus 系统插入向量: + +``` +python milvus_ann_search.py --data_path data/qa_pair.csv \ + --embedding_path corpus_embedding.npy \ + --batch_size 100000 \ + --insert +``` + +另外,Milvus提供了可视化的管理界面,可以很方便的查看数据,安装地址为[Attu](https://github.com/zilliztech/attu). + + +#### Paddle Serving 部署 + +Paddle Serving 的安装可以参考[Paddle Serving 安装文档](https://github.com/PaddlePaddle/Serving#installation)。需要在服务端和客户端安装相关的依赖,用pip安装Paddle Serving的依赖如下: + +``` +pip install paddle-serving-client==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple +pip install paddle-serving-app==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple + +# 如果是CPU部署,只需要安装CPU Server +pip install paddle-serving-server==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple + +# 如果是GPU Server,需要确认环境再选择执行哪一条,推荐使用CUDA 10.2的包 +# CUDA10.2 + Cudnn7 + TensorRT6(推荐) +pip install paddle-serving-server-gpu==0.8.3.post102 -i https://pypi.tuna.tsinghua.edu.cn/simple +# CUDA10.1 + TensorRT6 +pip install paddle-serving-server-gpu==0.8.3.post101 -i https://pypi.tuna.tsinghua.edu.cn/simple +# CUDA11.2 + TensorRT8 +pip install paddle-serving-server-gpu==0.8.3.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple +``` +更详细的安装信息请参考[链接](https://github.com/PaddlePaddle/Serving/blob/v0.9.0/doc/Install_Linux_Env_CN.md),安装完依赖后就可以执行下面的步骤。首先把生成的静态图模型导出为 Paddle Serving的格式,命令如下: + +``` +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.get_pooled_embedding.pdmodel" \ + --params_filename "inference.get_pooled_embedding.pdiparams" \ + --server_path "./serving_server" \ + --client_path "./serving_client" \ + --fetch_alias_names "output_embedding" +``` + +参数含义说明 +* `dirname`: 需要转换的模型文件存储路径,Program 结构文件和参数文件均保存在此目录。 +* `model_filename`: 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ,则使用 `__model__` 作为默认的文件名 +* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中,它才需要被指定。如果模型参数是存储在各自分离的文件中,设置它的值为 None +* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server +* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client +* `fetch_alias_names`: 模型输出的别名设置,比如输入的 input_ids 等,都可以指定成其他名字,默认不指定 +* `feed_alias_names`: 模型输入的别名设置,比如输出 pooled_out 等,都可以重新指定成其他模型,默认不指定 + +也可以运行下面的 bash 脚本: +``` +sh scripts/export_to_serving.sh +``` + +启动 Pipeline Server: + +``` +cd deploy/python/ +python web_service.py --model_name_or_path rocketqa-zh-base-query-encoder +``` + +启动客户端调用 Server, 使用 POST的方式: + +向服务端发送 POST 请求示例: + +``` +curl -X POST -k http://localhost:8090/ernie/prediction -d '{"key": ["0"], "value": ["买了社保,是不是就不用买商业保险了?"]}' +``` + +也可以使用 rpc的方式: + +首先修改rpc_client.py中需要预测的样本: + +``` +list_data = [ + "买了社保,是不是就不用买商业保险了?", +] +``` +然后运行: + +``` +python rpc_client.py +``` + +对于Windows用户,启动下面的Pipeline Server: + +``` +python web_service_windows.py --model_name_or_path rocketqa-zh-base-query-encoder +``` + +启动客户端调用 Server, 使用 POST的方式(Windows不支持RPC的调用方式),首先修改http_client.py中需要预测的样本: + +``` +data = {"feed": ["买了社保,是不是就不用买商业保险了?"], "fetch": ["output_embedding"]} +``` +然后运行: +``` +python http_client.py +``` + +### 4.5 问答系统整个流程 + +问答系统使用了Client Server的模式,即抽取向量的模型部署在服务端,然后启动客户端(Client)端去访问。 + + +``` +python run_system.py +``` +代码内置的测试用例为: + +``` +list_data = ["买了社保,是不是就不用买商业保险了?"] +``` + +会输出如下的结果: + +``` +...... +PipelineClient::predict pack_data time:1663127450.1656108 +PipelineClient::predict before time:1663127450.166227 +Extract feature time to cost :0.017495155334472656 seconds + +=== start connecting to Milvus === +=== Connect collection faq_finance === +Search milvus time cost is 0.18691015243530273 seconds +如果你买社会保险,你不需要买商业保险吗? 社保是基础的,就是我们通常说的“五险”包括:基本养老保险、基本医疗保险、失业保险、工伤保险和生育保险。而商业保险则是保障。 0.32494643330574036 +已有社会保险还需要买商业保险吗 社保是社会保险的简称社会保险是指国家为了预防和分担年老失业疾病以及死亡等社会风险实现社会安全而强制社会多数成员参加的具有所得重分配功能的非营利性的社会安全制度主要包括基本医疗保险基本养老保险工伤保险失业保险生育保险五大类险种,商业保险是社保的一个补充,如果有足够的经济条件可以进行购买。1、社保覆盖面广,不存在拒保问题,但是保障较低,只能满足基本的保障需求。社保中的医疗保险,住院一般可报70%。而且这70%的医疗费,限于扣除起付线标准后。而且,在社保规定用药和规定项目内。许多检查费、专家诊疗、高新尖诊疗技术,社保都是不报的。这就需配合必要的商业保险了。2、另外,社保医疗是出院后报的,商业医保中的重疾险是确诊后就可以给钱,可以弥补很多家庭没钱治的困境;3、商业保险可以选择购买更高的保额,社保则很有限;社保医疗只是补偿医药费,而没有住院期间的收入损失补偿,商业医疗就有住院补贴。总之,建议在有了社保后,再购买适合自己的寿险,加上意外险、住院医疗、重疾医疗保险,就是非常的完善的保障了。 0.38041722774505615 +..... +``` +输出的结果包括特征提取和检索的时间,还包含检索出来的问答对。 + + + + +## 5. 模型优化 + +### 5.1 有监督训练[优化步骤,可选] + +无监督的方式对模型的提升有限,如果需要继续提升模型,则需要标注数据。构造类似`train_aug.csv`中的句子对,只需要构造相似句子对即可,不需要构造不相似的句子对。 + +``` +python -u -m paddle.distributed.launch --gpus='0' \ + train.py \ + --device gpu \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --save_dir ./checkpoints/ \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --save_steps 50 \ + --eval_steps 50 \ + --max_seq_length 64 \ + --dropout 0.2 \ + --output_emb_size 256 \ + --dup_rate 0.1 \ + --rdrop_coef 0.1 \ + --train_set_file "./data/train_aug.csv" +``` + +其他步骤同上,只是使用的数据集是有监督数据。 + + +## 6.参考文献 + +[1] Tianyu Gao, Xingcheng Yao, Danqi Chen: [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821). EMNLP (1) 2021: 6894-6910 diff --git a/applications/question_answering/supervised_qa/faq_finance/ann_util.py b/applications/question_answering/supervised_qa/faq_finance/ann_util.py new file mode 100644 index 0000000000000000000000000000000000000000..4e4983bfc4d6f581e14180804591dda2d0897465 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/ann_util.py @@ -0,0 +1,56 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import hnswlib +import numpy as np + +from paddlenlp.utils.log import logger + + +def build_index(args, data_loader, model): + + index = hnswlib.Index(space="ip", dim=args.output_emb_size if args.output_emb_size > 0 else 768) + + # Initializing index + # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded + # during insertion of an element. + # The capacity can be increased by saving/loading the index, see below. + # + # ef_construction - controls index search speed/build speed tradeoff + # + # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M) + # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction + index.init_index(max_elements=args.hnsw_max_elements, ef_construction=args.hnsw_ef, M=args.hnsw_m) + + # Controlling the recall by setting ef: + # higher ef leads to better accuracy, but slower search + index.set_ef(args.hnsw_ef) + + # Set number of threads used during batch search/construction + # By default using all available cores + index.set_num_threads(16) + + logger.info("start build index..........") + + all_embeddings = [] + + for text_embeddings in model.get_semantic_embedding(data_loader): + all_embeddings.append(text_embeddings.numpy()) + + all_embeddings = np.concatenate(all_embeddings, axis=0) + index.add_items(all_embeddings) + + logger.info("Total index number:{}".format(index.get_current_count())) + + return index diff --git a/applications/question_answering/supervised_qa/faq_finance/config.py b/applications/question_answering/supervised_qa/faq_finance/config.py new file mode 100644 index 0000000000000000000000000000000000000000..365da31198aa0da6c08265a094f51f2df61bb426 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/config.py @@ -0,0 +1,34 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +search_param = {"nprobe": 20} +collection_name = "faq_finance" +partition_tag = "partition_1" + +MILVUS_HOST = "10.21.226.175" +MILVUS_PORT = 8530 +data_dim = 256 +top_k = 10 +embedding_name = "embeddings" + +index_config = { + "index_type": "IVF_FLAT", + "metric_type": "L2", + "params": {"nlist": 1000}, +} + +search_params = { + "metric_type": "L2", + "params": {"nprobe": top_k}, +} diff --git a/applications/question_answering/supervised_qa/faq_finance/data.py b/applications/question_answering/supervised_qa/faq_finance/data.py new file mode 100644 index 0000000000000000000000000000000000000000..c0011cb431f8bc2e5df62b94f6b8c6c0981e6654 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/data.py @@ -0,0 +1,196 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import random + +import numpy as np +import paddle + + +def gen_id2corpus(corpus_file): + id2corpus = {} + with open(corpus_file, "r", encoding="utf-8") as f: + for idx, line in enumerate(f): + id2corpus[idx] = line.rstrip() + return id2corpus + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + + for key, text in example.items(): + if "label" in key: + # do_evaluate + result += [example["label"]] + else: + # do_train + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + + return result + + +def convert_example_test(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def read_simcse_text(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip() + yield {"text_a": data, "text_b": data} + + +def read_text_pair(data_path, is_test=False): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if is_test is True: + if len(data) != 3: + continue + yield {"text_a": data[0], "text_b": data[1], "label": data[2]} + else: + if len(data) != 2: + continue + + yield {"text_a": data[0], "text_b": data[1]} + + +def gen_text_file(similar_text_pair_file): + text2similar_text = {} + texts = [] + with open(similar_text_pair_file, "r", encoding="utf-8") as f: + for line in f: + splited_line = line.rstrip().split("\t") + if len(splited_line) != 2: + continue + + text, similar_text = line.rstrip().split("\t") + + if not text or not similar_text: + continue + + text2similar_text[text] = similar_text + texts.append({"text": text}) + return texts, text2similar_text + + +def word_repetition(input_ids, token_type_ids, dup_rate=0.32): + """Word Repetition strategy.""" + input_ids = input_ids.numpy().tolist() + token_type_ids = token_type_ids.numpy().tolist() + + batch_size, seq_len = len(input_ids), len(input_ids[0]) + repetitied_input_ids = [] + repetitied_token_type_ids = [] + rep_seq_len = seq_len + for batch_id in range(batch_size): + cur_input_id = input_ids[batch_id] + actual_len = np.count_nonzero(cur_input_id) + dup_word_index = [] + # If sequence length is less than 5, skip it + if actual_len > 5: + dup_len = random.randint(a=0, b=max(2, int(dup_rate * actual_len))) + # Skip cls and sep position + dup_word_index = random.sample(list(range(1, actual_len - 1)), k=dup_len) + + r_input_id = [] + r_token_type_id = [] + for idx, word_id in enumerate(cur_input_id): + # Insert duplicate word + if idx in dup_word_index: + r_input_id.append(word_id) + r_token_type_id.append(token_type_ids[batch_id][idx]) + r_input_id.append(word_id) + r_token_type_id.append(token_type_ids[batch_id][idx]) + after_dup_len = len(r_input_id) + repetitied_input_ids.append(r_input_id) + repetitied_token_type_ids.append(r_token_type_id) + + if after_dup_len > rep_seq_len: + rep_seq_len = after_dup_len + # Padding the data to the same length + for batch_id in range(batch_size): + after_dup_len = len(repetitied_input_ids[batch_id]) + pad_len = rep_seq_len - after_dup_len + repetitied_input_ids[batch_id] += [0] * pad_len + repetitied_token_type_ids[batch_id] += [0] * pad_len + + return paddle.to_tensor(repetitied_input_ids, dtype="int64"), paddle.to_tensor( + repetitied_token_type_ids, dtype="int64" + ) diff --git a/applications/question_answering/supervised_qa/faq_finance/deploy/python/config_nlp.yml b/applications/question_answering/supervised_qa/faq_finance/deploy/python/config_nlp.yml new file mode 100644 index 0000000000000000000000000000000000000000..229f66090ebf328b03aaa6b41120277176e97d97 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/deploy/python/config_nlp.yml @@ -0,0 +1,34 @@ +# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +# 当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 20 +# build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + # op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: False + # 使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + tracer: + interval_s: 10 +# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 8090 +# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 8080 +op: + ernie: + # 并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 1 + # 当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + # client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + # ir_optim + ir_optim: True + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 1 + # 计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: '2' + # Fetch结果列表,以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回 + fetch_list: ['output_embedding'] + # 模型路径 + model_config: ../../serving_server/ diff --git a/applications/question_answering/supervised_qa/faq_finance/deploy/python/http_client.py b/applications/question_answering/supervised_qa/faq_finance/deploy/python/http_client.py new file mode 100644 index 0000000000000000000000000000000000000000..65ecca248ab44e3de890b9543ca9e426b17af494 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/deploy/python/http_client.py @@ -0,0 +1,31 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import json + +import numpy as np +import requests + +headers = {"Content-type": "application/json"} +url = "http://10.21.226.175:8080/ernie/prediction" # XXX取决于服务端YourService的初始化name参数 + +data = {"feed": ["买了社保,是不是就不用买商业保险了?"], "fetch": ["output_embedding"]} +data = json.dumps(data) +print(data) +r = requests.post(url=url, headers=headers, data=data) +print(r.json()) +json_data = r.json() +data = np.array(json_data["result"]["output_embedding"]) +print(data.shape) diff --git a/applications/question_answering/supervised_qa/faq_finance/deploy/python/rpc_client.py b/applications/question_answering/supervised_qa/faq_finance/deploy/python/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..877b6190408adaf0693d90620e21a1087b1bc959 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/deploy/python/rpc_client.py @@ -0,0 +1,36 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import time + +import numpy as np +from paddle_serving_server.pipeline import PipelineClient + +client = PipelineClient() +client.connect(["127.0.0.1:8080"]) + +list_data = ["买了社保,是不是就不用买商业保险了?"] +feed = {} +for i, item in enumerate(list_data): + feed[str(i)] = item + +print(feed) +start_time = time.time() +ret = client.predict(feed_dict=feed) +end_time = time.time() +print("time to cost :{} seconds".format(end_time - start_time)) + +result = np.array(eval(ret.value[0])) +print(ret.key) +print(result.shape) +print(result) diff --git a/applications/question_answering/supervised_qa/faq_finance/deploy/python/web_service.py b/applications/question_answering/supervised_qa/faq_finance/deploy/python/web_service.py new file mode 100644 index 0000000000000000000000000000000000000000..fca4f023c7c9a01e1de3b15eccd75fcb97e220dd --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/deploy/python/web_service.py @@ -0,0 +1,83 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +from paddle_serving_server.web_service import Op, WebService + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="Select tokenizer name to for model") +args = parser.parse_args() +# yapf: enable + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + result = [] + for text in example: + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class ErnieOp(Op): + def init_op(self): + from paddlenlp.transformers import AutoTokenizer + + self.tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + def preprocess(self, input_dicts, data_id, log_id): + from paddlenlp.data import Pad, Tuple + + ((_, input_dict),) = input_dicts.items() + print("input dict", input_dict) + batch_size = len(input_dict.keys()) + examples = [] + for i in range(batch_size): + input_ids, segment_ids = convert_example([input_dict[str(i)]], self.tokenizer) + examples.append((input_ids, segment_ids)) + + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"), # segment + ), + ): + return fn(samples) + + input_ids, segment_ids = batchify_fn(examples) + feed_dict = {} + feed_dict["input_ids"] = input_ids + feed_dict["token_type_ids"] = segment_ids + return feed_dict, False, None, "" + + def postprocess(self, input_dicts, fetch_dict, data_id, log_id): + new_dict = {} + new_dict["output_embedding"] = str(fetch_dict["output_embedding"].tolist()) + return new_dict, None, "" + + +class ErnieService(WebService): + def get_pipeline_response(self, read_op): + ernie_op = ErnieOp(name="ernie", input_ops=[read_op]) + return ernie_op + + +if __name__ == "__main__": + ernie_service = ErnieService(name="ernie") + ernie_service.prepare_pipeline_config("config_nlp.yml") + ernie_service.run_service() diff --git a/applications/question_answering/supervised_qa/faq_finance/deploy/python/web_service_windows.py b/applications/question_answering/supervised_qa/faq_finance/deploy/python/web_service_windows.py new file mode 100644 index 0000000000000000000000000000000000000000..538fe3f58f8dd11058c350f39b18bfbc22bbf41a --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/deploy/python/web_service_windows.py @@ -0,0 +1,80 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +from paddle_serving_server.web_service import WebService + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="Select tokenizer name to for model") +args = parser.parse_args() +# yapf: enable + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + result = [] + for text in example: + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class ErnieService(WebService): + def init_service(self): + from paddlenlp.transformers import AutoTokenizer + + self.tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + def preprocess(self, feed=[], fetch=[]): + from paddlenlp.data import Pad, Tuple + + print("input dict", feed) + batch_size = len(feed) + is_batch = True + examples = [] + for i in range(batch_size): + input_ids, segment_ids = convert_example([feed[i]], self.tokenizer) + examples.append((input_ids, segment_ids)) + + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"), # segment + ), + ): + return fn(samples) + + input_ids, segment_ids = batchify_fn(examples) + feed_dict = {} + feed_dict["input_ids"] = input_ids + feed_dict["token_type_ids"] = segment_ids + return feed_dict, fetch, is_batch + + def postprocess(self, feed=[], fetch=[], fetch_map=None): + for key in fetch_map: + fetch_map[key] = fetch_map[key].tolist() + return fetch_map + + +if __name__ == "__main__": + ernie_service = ErnieService(name="ernie") + ernie_service.load_model_config("../../serving_server") + ernie_service.prepare_server(workdir="workdir", port=8080) + ernie_service.init_service() + ernie_service.run_debugger_service() + ernie_service.run_web_service() diff --git a/applications/question_answering/supervised_qa/faq_finance/evaluate.py b/applications/question_answering/supervised_qa/faq_finance/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..aabeadf5b197e24177e22a6331f8aa9fe4ef2c1b --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/evaluate.py @@ -0,0 +1,83 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import numpy as np + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--similar_text_pair", type=str, default='', help="The full path of similat pair file") +parser.add_argument("--recall_result_file", type=str, default='', help="The full path of recall result file") +parser.add_argument("--recall_num", type=int, default=10, help="Most similair number of doc recalled from corpus per query") + + +args = parser.parse_args() +# yapf: enable + + +def recall(rs, N=10): + """ + Ratio of recalled Ground Truth at topN Recalled Docs + >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]] + >>> recall(rs, N=1) + 0.333333 + >>> recall(rs, N=2) + >>> 0.6666667 + >>> recall(rs, N=3) + >>> 1.0 + Args: + rs: Iterator of recalled flag() + Returns: + Recall@N + """ + + recall_flags = [np.sum(r[0:N]) for r in rs] + return np.mean(recall_flags) + + +if __name__ == "__main__": + text2similar = {} + with open(args.similar_text_pair, "r", encoding="utf-8") as f: + for line in f: + text, similar_text = line.rstrip().split("\t") + text2similar[text] = similar_text + + rs = [] + + with open(args.recall_result_file, "r", encoding="utf-8") as f: + relevance_labels = [] + for index, line in enumerate(f): + + if index % args.recall_num == 0 and index != 0: + rs.append(relevance_labels) + relevance_labels = [] + + text, recalled_text, cosine_sim = line.rstrip().split("\t") + if text2similar[text] == recalled_text: + relevance_labels.append(1) + else: + relevance_labels.append(0) + recall_N = [] + recall_num = [1, 5, 10] + result = open("result.tsv", "a") + res = [] + for topN in recall_num: + R = round(100 * recall(rs, N=topN), 3) + recall_N.append(str(R)) + for key, val in zip(recall_num, recall_N): + print("recall@{}={}".format(key, val)) + res.append(str(val)) + result.write("\t".join(res) + "\n") + # print("\t".join(recall_N)) diff --git a/applications/question_answering/supervised_qa/faq_finance/export_model.py b/applications/question_answering/supervised_qa/faq_finance/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..e4cf9e3faaea501629d42e632fc1826c7a00050e --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/export_model.py @@ -0,0 +1,58 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from model import SimCSE + +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, + default='./checkpoint/model_50/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./output', + help="The path of model parameter in static graph to be saved.") +parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training") +parser.add_argument("--output_emb_size", default=256, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + model = SimCSE(pretrained_model, output_emb_size=args.output_emb_size) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + model.eval() + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/applications/question_answering/supervised_qa/faq_finance/export_to_serving.py b/applications/question_answering/supervised_qa/faq_finance/export_to_serving.py new file mode 100644 index 0000000000000000000000000000000000000000..6cc932da11173e54460642c16fd4226411ba3cfb --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/export_to_serving.py @@ -0,0 +1,50 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle_serving_client.io as serving_io + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--dirname", type=str, required=True, + default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.") +parser.add_argument("--model_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.") +parser.add_argument("--params_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.") +parser.add_argument("--server_path", type=str, default='./serving_server', + help="The path of server parameter in static graph to be saved.") +parser.add_argument("--client_path", type=str, default='./serving_client', + help="The path of client parameter in static graph to be saved.") +parser.add_argument("--feed_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars') +parser.add_argument("--fetch_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars') +parser.add_argument("--show_proto", type=bool, default=False, + help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.') +# yapf: enable + +if __name__ == "__main__": + args = parser.parse_args() + serving_io.inference_model_to_serving( + dirname=args.dirname, + serving_server=args.server_path, + serving_client=args.client_path, + model_filename=args.model_filename, + params_filename=args.params_filename, + show_proto=args.show_proto, + feed_alias_names=args.feed_alias_names, + fetch_alias_names=args.fetch_alias_names, + ) diff --git a/applications/question_answering/supervised_qa/faq_finance/feature_extract.py b/applications/question_answering/supervised_qa/faq_finance/feature_extract.py new file mode 100644 index 0000000000000000000000000000000000000000..3e9a419405f5f79bb7664e85cfe18f93d5ff9d6b --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/feature_extract.py @@ -0,0 +1,203 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import numpy as np +import paddle +from paddle import inference +from tqdm import tqdm + +from paddlenlp.data import Pad, Tuple +from paddlenlp.transformers import AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--corpus_file", type=str, required=True, help="The corpus_file path.") +parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') + +args = parser.parse_args() +# yapf: enable + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.get_pooled_embedding.pdmodel" + params_file = model_dir + "/inference.get_pooled_embedding.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def predict(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # segment + ), + ): + return fn(samples) + + all_embeddings = [] + examples = [] + for idx, text in enumerate(tqdm(data)): + input_ids, segment_ids = convert_example( + text, tokenizer, max_seq_length=self.max_seq_length, pad_to_max_seq_len=True + ) + examples.append((input_ids, segment_ids)) + if len(examples) >= self.batch_size: + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + all_embeddings.append(logits) + examples = [] + + if len(examples) > 0: + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + all_embeddings.append(logits) + + all_embeddings = np.concatenate(all_embeddings, axis=0) + np.save("corpus_embedding", all_embeddings) + + +def read_text(file_path): + file = open(file_path) + id2corpus = {} + for idx, data in enumerate(file.readlines()): + id2corpus[idx] = data.strip() + return id2corpus + + +if __name__ == "__main__": + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + id2corpus = read_text(args.corpus_file) + + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + predictor.predict(corpus_list, tokenizer) diff --git a/applications/question_answering/supervised_qa/faq_finance/milvus_ann_search.py b/applications/question_answering/supervised_qa/faq_finance/milvus_ann_search.py new file mode 100644 index 0000000000000000000000000000000000000000..36192fc0bec7c7465ec6481665b376cc2867d2dc --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/milvus_ann_search.py @@ -0,0 +1,93 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import time + +import numpy as np +from config import collection_name, embedding_name, partition_tag +from milvus_util import RecallByMilvus, VecToMilvus, text_max_len +from tqdm import tqdm + +parser = argparse.ArgumentParser() +parser.add_argument( + "--data_path", default="data/corpus.csv", type=str, required=True, help="The data for vector extraction." +) +parser.add_argument( + "--embedding_path", default="corpus_embedding.npy", type=str, required=True, help="The vector path for data." +) +parser.add_argument("--index", default=0, type=int, help="index of the vector for search") +parser.add_argument("--insert", action="store_true", help="whether to insert data") +parser.add_argument("--search", action="store_true", help="whether to search data") +parser.add_argument("--batch_size", default=100000, type=int, help="number of examples to insert each time") +args = parser.parse_args() + + +def read_text(file_path): + file = open(file_path) + id2corpus = [] + for idx, data in enumerate(file.readlines()): + question, answer = data.strip().split("\t") + id2corpus.append({"question": question, "answer": answer}) + return id2corpus + + +def milvus_data_insert(data_path, embedding_path, batch_size): + corpus_list = read_text(data_path) + embeddings = np.load(embedding_path) + embedding_ids = [i for i in range(embeddings.shape[0])] + client = VecToMilvus() + client.drop_collection(collection_name) + data_size = len(embedding_ids) + for i in tqdm(range(0, data_size, batch_size)): + cur_end = i + batch_size + if cur_end > data_size: + cur_end = data_size + batch_emb = embeddings[np.arange(i, cur_end)] + entities = [ + [j for j in range(i, cur_end, 1)], + [corpus_list[j]["question"][: text_max_len - 1] for j in range(i, cur_end, 1)], + [corpus_list[j]["answer"][: text_max_len - 1] for j in range(i, cur_end, 1)], + batch_emb, # field embeddings, supports numpy.ndarray and list + ] + client.insert( + collection_name=collection_name, entities=entities, index_name=embedding_name, partition_tag=partition_tag + ) + + +def milvus_data_recall(embedding_path, index): + embeddings = np.load(embedding_path) + embedding_ids = [i for i in range(embeddings.shape[0])] + recall_client = RecallByMilvus() + if index > len(embedding_ids): + print("Index should not be larger than embedding size") + return + embeddings = embeddings[np.arange(index, index + 1)] + time_start = time.time() + result = recall_client.search( + embeddings, embedding_name, collection_name, partition_names=[partition_tag], output_fields=["pk", "text"] + ) + time_end = time.time() + sum_t = time_end - time_start + print("time cost", sum_t, "s") + for hits in result: + for hit in hits: + print(f"hit: {hit}, text field: {hit.entity.get('text')}") + + +if __name__ == "__main__": + if args.insert: + milvus_data_insert(args.data_path, args.embedding_path, args.batch_size) + if args.search: + milvus_data_recall(args.embedding_path, args.index) diff --git a/applications/question_answering/supervised_qa/faq_finance/milvus_util.py b/applications/question_answering/supervised_qa/faq_finance/milvus_util.py new file mode 100644 index 0000000000000000000000000000000000000000..92d55ccf132b1cc9cadf78b9401f020c5c198f59 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/milvus_util.py @@ -0,0 +1,170 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import numpy as np +from config import ( + MILVUS_HOST, + MILVUS_PORT, + data_dim, + index_config, + search_params, + top_k, +) +from pymilvus import ( + Collection, + CollectionSchema, + DataType, + FieldSchema, + connections, + utility, +) + +fmt = "\n=== {:30} ===\n" +text_max_len = 1000 +fields = [ + FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=False, max_length=100), + FieldSchema(name="question", dtype=DataType.VARCHAR, max_length=text_max_len), + FieldSchema(name="answer", dtype=DataType.VARCHAR, max_length=text_max_len), + FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=data_dim), +] +schema = CollectionSchema(fields, "Neural Search Index") + + +class VecToMilvus: + def __init__(self): + print(fmt.format("start connecting to Milvus")) + connections.connect("default", host=MILVUS_HOST, port=MILVUS_PORT) + self.collection = None + + def has_collection(self, collection_name): + try: + has = utility.has_collection(collection_name) + print(f"Does collection {collection_name} exist in Milvus: {has}") + return has + except Exception as e: + print("Milvus has_table error:", e) + + def creat_collection(self, collection_name): + try: + print(fmt.format("Create collection {}".format(collection_name))) + self.collection = Collection(collection_name, schema, consistency_level="Strong") + except Exception as e: + print("Milvus create collection error:", e) + + def drop_collection(self, collection_name): + try: + utility.drop_collection(collection_name) + except Exception as e: + print("Milvus delete collection error:", e) + + def create_index(self, index_name): + try: + print(fmt.format("Start Creating index")) + self.collection.create_index(index_name, index_config) + print(fmt.format("Start loading")) + self.collection.load() + except Exception as e: + print("Milvus create index error:", e) + + def has_partition(self, partition_tag): + try: + result = self.collection.has_partition(partition_tag) + return result + except Exception as e: + print("Milvus has partition error: ", e) + + def create_partition(self, partition_tag): + try: + self.collection.create_partition(partition_tag) + print("create partition {} successfully".format(partition_tag)) + except Exception as e: + print("Milvus create partition error: ", e) + + def insert(self, entities, collection_name, index_name, partition_tag=None): + try: + if not self.has_collection(collection_name): + self.creat_collection(collection_name) + self.create_index(index_name) + else: + self.collection = Collection(collection_name) + if (partition_tag is not None) and (not self.has_partition(partition_tag)): + self.create_partition(partition_tag) + + self.collection.insert(entities, partition_name=partition_tag) + print(f"Number of entities in Milvus: {self.collection.num_entities}") # check the num_entites + except Exception as e: + print("Milvus insert error:", e) + + +class RecallByMilvus: + def __init__(self): + print(fmt.format("start connecting to Milvus")) + connections.connect("default", host=MILVUS_HOST, port=MILVUS_PORT) + self.collection = None + + def get_collection(self, collection_name): + try: + print(fmt.format("Connect collection {}".format(collection_name))) + self.collection = Collection(collection_name) + except Exception as e: + print("Milvus create collection error:", e) + + def search(self, vectors, embedding_name, collection_name, partition_names=[], output_fields=[]): + try: + self.get_collection(collection_name) + result = self.collection.search( + vectors, + embedding_name, + search_params, + limit=top_k, + partition_names=partition_names, + output_fields=output_fields, + ) + return result + except Exception as e: + print("Milvus recall error: ", e) + + +if __name__ == "__main__": + print(fmt.format("Start inserting entities")) + rng = np.random.default_rng(seed=19530) + num_entities = 3000 + entities = [ + # provide the pk field because `auto_id` is set to False + [i for i in range(num_entities)], + ["第{}个样本".format(i) for i in range(num_entities)], # field text, only supports list + rng.random((num_entities, data_dim)), # field embeddings, supports numpy.ndarray and list + ] + print(entities[-1].shape) + collection_name = "test1" + partition_tag = "partition_1" + embedding_name = "embeddings" + client = VecToMilvus() + client.insert( + collection_name=collection_name, entities=entities, index_name=embedding_name, partition_tag=partition_tag + ) + print(fmt.format("Start searching entities")) + vectors_to_search = entities[-1][-2:] + recall_client = RecallByMilvus() + result = recall_client.search( + vectors_to_search, + embedding_name, + collection_name, + partition_names=[partition_tag], + output_fields=["pk", "text"], + ) + for hits in result: + for hit in hits: + print(f"hit: {hit}, random field: {hit.entity.get('text')}") diff --git a/applications/question_answering/supervised_qa/faq_finance/model.py b/applications/question_answering/supervised_qa/faq_finance/model.py new file mode 100644 index 0000000000000000000000000000000000000000..1850f185a2d1808a7d2ab48f55941ead82ac53a7 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/model.py @@ -0,0 +1,143 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + +import paddlenlp + + +class SimCSE(nn.Layer): + def __init__(self, pretrained_model, dropout=None, margin=0.0, scale=20, output_emb_size=None): + + super().__init__() + + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr) + + self.margin = margin + # Used scaling cosine similarity to ease converge + self.sacle = scale + self.classifier = nn.Linear(output_emb_size, 2) + self.rdrop_loss = paddlenlp.losses.RDropLoss() + + @paddle.jit.to_static( + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ] + ) + def get_pooled_embedding( + self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, with_pooler=True + ): + + # Note: cls_embedding is poolerd embedding with act tanh + sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if with_pooler is False: + cls_embedding = sequence_output[:, 0, :] + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + with_pooler=True, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask, with_pooler=with_pooler + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask, with_pooler=with_pooler + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + logits1 = self.classifier(query_cls_embedding) + logits2 = self.classifier(title_cls_embedding) + kl_loss = self.rdrop_loss(logits1, logits2) + + cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) + + # substract margin from all positive samples cosine_sim() + margin_diag = paddle.full( + shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype() + ) + + cosine_sim = cosine_sim - paddle.diag(margin_diag) + + # scale cosine to ease training converge + cosine_sim *= self.sacle + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") + labels = paddle.reshape(labels, shape=[-1, 1]) + + loss = F.cross_entropy(input=cosine_sim, label=labels) + + return loss, kl_loss diff --git a/applications/question_answering/supervised_qa/faq_finance/recall.py b/applications/question_answering/supervised_qa/faq_finance/recall.py new file mode 100644 index 0000000000000000000000000000000000000000..9bfb71d054175a641e07b758ccb8455ca9fe96d9 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/recall.py @@ -0,0 +1,122 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import paddle +from ann_util import build_index +from data import convert_example_test, create_dataloader, gen_id2corpus, gen_text_file +from model import SimCSE + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset +from paddlenlp.transformers import AutoModel, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.") +parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training") +parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") + +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial(convert_example_test, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # text_segment + ), + ): + return [data for data in fn(samples)] + + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + + model = SimCSE(pretrained_model, output_emb_size=args.output_emb_size) + model = paddle.DataParallel(model) + + # Load pretrained semantic model + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + logger.info("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + id2corpus = gen_id2corpus(args.corpus_file) + + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + # Need better way to get inner model of DataParallel + inner_model = model._layers + + final_index = build_index(args, corpus_data_loader, inner_model) + + text_list, text2similar_text = gen_text_file(args.similar_text_pair_file) + + query_ds = MapDataset(text_list) + + query_data_loader = create_dataloader( + query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + + recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file) + with open(recall_result_file, "w", encoding="utf-8") as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num) + + batch_size = len(cosine_sims) + + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write( + "{}\t{}\t{}\n".format( + text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx] + ) + ) diff --git a/applications/question_answering/supervised_qa/faq_finance/requirements.txt b/applications/question_answering/supervised_qa/faq_finance/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..2dfbec02607b44cb34cc42a8f5a0e3e0cab7c743 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/requirements.txt @@ -0,0 +1,8 @@ +pymilvus>=2.1.0 +pandas==0.25.1 +paddlenlp>=2.3.7 +paddlepaddle-gpu>=2.2.3 +hnswlib>=0.5.2 +numpy>=1.17.2 +visualdl>=2.2.2 +pybind11 \ No newline at end of file diff --git a/applications/question_answering/supervised_qa/faq_finance/run_system.py b/applications/question_answering/supervised_qa/faq_finance/run_system.py new file mode 100644 index 0000000000000000000000000000000000000000..d095a4d59ce384f397cf441983560638a4db1def --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/run_system.py @@ -0,0 +1,67 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import time + +import numpy as np +import pandas as pd +from config import collection_name, embedding_name, partition_tag +from milvus_util import RecallByMilvus +from paddle_serving_server.pipeline import PipelineClient + + +def recall_result(list_data): + client = PipelineClient() + client.connect(["127.0.0.1:8080"]) + feed = {} + for i, item in enumerate(list_data): + feed[str(i)] = item + start_time = time.time() + ret = client.predict(feed_dict=feed) + end_time = time.time() + print("Extract feature time to cost :{} seconds".format(end_time - start_time)) + result = np.array(eval(ret.value[0])) + return result + + +def search_in_milvus(embeddings, query_text): + recall_client = RecallByMilvus() + start_time = time.time() + results = recall_client.search( + embeddings, + embedding_name, + collection_name, + partition_names=[partition_tag], + output_fields=["pk", "question", "answer"], + ) + end_time = time.time() + print("Search milvus time cost is {} seconds ".format(end_time - start_time)) + list_data = [] + for line in results: + for item in line: + + distance = item.distance + question = item.entity.get("question") + answer = item.entity.get("answer") + print(question, answer, distance) + list_data.append([query_text, question, answer, distance]) + df = pd.DataFrame(list_data, columns=["query_text", "question", "answer", "distance"]) + df.to_csv("faq_result.csv", index=False) + + +if __name__ == "__main__": + list_data = ["买了社保,是不是就不用买商业保险了?"] + result = recall_result(list_data) + df = search_in_milvus(result, list_data[0]) diff --git a/applications/question_answering/supervised_qa/faq_finance/scripts/evaluate.sh b/applications/question_answering/supervised_qa/faq_finance/scripts/evaluate.sh new file mode 100644 index 0000000000000000000000000000000000000000..7b0a901f9e7b6b77c2c832b849847395f675145f --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/scripts/evaluate.sh @@ -0,0 +1,18 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + python -u evaluate.py \ + --similar_text_pair "data/test_pair.csv" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 10 \ No newline at end of file diff --git a/applications/question_answering/supervised_qa/faq_finance/scripts/export_model.sh b/applications/question_answering/supervised_qa/faq_finance/scripts/export_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..7cd26597635a5c7006aa5d53041a0b58d8057346 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/scripts/export_model.sh @@ -0,0 +1,17 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python export_model.py --params_path checkpoints/model_100/model_state.pdparams \ + --output_path=./output \ + --model_name_or_path rocketqa-zh-base-query-encoder \ No newline at end of file diff --git a/applications/question_answering/supervised_qa/faq_finance/scripts/export_to_serving.sh b/applications/question_answering/supervised_qa/faq_finance/scripts/export_to_serving.sh new file mode 100644 index 0000000000000000000000000000000000000000..7a7337b40b7a7c2d652ce2a837562eaceeba0531 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/scripts/export_to_serving.sh @@ -0,0 +1,21 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.get_pooled_embedding.pdmodel" \ + --params_filename "inference.get_pooled_embedding.pdiparams" \ + --server_path "serving_server" \ + --client_path "serving_client" \ + --fetch_alias_names "output_embedding" diff --git a/applications/question_answering/supervised_qa/faq_finance/scripts/feature_extract.sh b/applications/question_answering/supervised_qa/faq_finance/scripts/feature_extract.sh new file mode 100644 index 0000000000000000000000000000000000000000..25862539311dfff26ccd1f7563743dedc3db86fc --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/scripts/feature_extract.sh @@ -0,0 +1,18 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python feature_extract.py \ + --model_dir=./output \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --corpus_file "data/corpus.csv" \ No newline at end of file diff --git a/applications/question_answering/supervised_qa/faq_finance/scripts/run_build_index.sh b/applications/question_answering/supervised_qa/faq_finance/scripts/run_build_index.sh new file mode 100644 index 0000000000000000000000000000000000000000..f235047e3ad3bde4ee580ec9c06a7ef61c9c1e5f --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/scripts/run_build_index.sh @@ -0,0 +1,30 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# gpu +python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --params_path "checkpoints/model_100/model_state.pdparams" \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 256\ + --max_seq_length 64 \ + --recall_num 10 \ + --similar_text_pair "data/test_pair.csv" \ + --corpus_file "data/corpus.csv" \ No newline at end of file diff --git a/applications/question_answering/supervised_qa/faq_finance/scripts/train.sh b/applications/question_answering/supervised_qa/faq_finance/scripts/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..f1da0dd71e827edd322d768e96a7fee599da013d --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/scripts/train.sh @@ -0,0 +1,30 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python -u -m paddle.distributed.launch --gpus='1' \ + train.py \ + --device gpu \ + --model_name_or_path rocketqa-zh-base-query-encoder \ + --save_dir ./checkpoints/ \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --save_steps 50 \ + --eval_steps 50 \ + --max_seq_length 64 \ + --dropout 0.2 \ + --output_emb_size 256 \ + --dup_rate 0.1 \ + --rdrop_coef 0.1 \ + --train_set_file "./data/train_aug.csv" \ No newline at end of file diff --git a/applications/question_answering/supervised_qa/faq_finance/train.py b/applications/question_answering/supervised_qa/faq_finance/train.py new file mode 100644 index 0000000000000000000000000000000000000000..52fbb51b566f970d2be8f467ee396420cb4b8a62 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_finance/train.py @@ -0,0 +1,222 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from data import ( + convert_example, + create_dataloader, + read_simcse_text, + read_text_pair, + word_repetition, +) +from model import SimCSE +from scipy import stats + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.") +parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, help="Step interval for saving checkpoint.") +parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override ecpochs.") +parser.add_argument('--eval_steps', type=int, default=10000, help="Step interval for evaluation.") +parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file.") +parser.add_argument("--margin", default=0.0, type=float, help="Margin between pos_sample and neg_samples.") +parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.") +parser.add_argument("--is_unsupervised", action='store_true', help="Whether to use unsupervised training") +parser.add_argument("--dropout", default=0.1, type=float, help="Dropout for pretrained model encoder.") +parser.add_argument("--dup_rate", default=0.32, type=float, help="duplicate rate for word repetition.") +parser.add_argument("--infer_with_fc_pooler", action='store_true', help="Whether use fc layer after cls embedding or not for when infer.") +parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training") +parser.add_argument("--rdrop_coef", default=0.0, type=float, help="The coefficient of KL-Divergence loss in R-Drop paper, for more detail please refer to https://arxiv.org/abs/2106.14448), if rdrop_coef > 0 then R-Drop works") +args = parser.parse_args() + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def do_evaluate(model, tokenizer, data_loader, with_pooler=False): + model.eval() + + total_num = 0 + spearman_corr = 0.0 + sims = [] + labels = [] + + for batch in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids, label = batch + total_num += len(label) + + query_cls_embedding = model.get_pooled_embedding( + query_input_ids, query_token_type_ids, with_pooler=with_pooler) + + title_cls_embedding = model.get_pooled_embedding(title_input_ids, title_token_type_ids, with_pooler=with_pooler) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + + sims.append(cosine_sim.numpy()) + labels.append(label.numpy()) + + sims = np.concatenate(sims, axis=0) + labels = np.concatenate(labels, axis=0) + + spearman_corr = stats.spearmanr(labels, sims).correlation + model.train() + return spearman_corr, total_num + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + if args.is_unsupervised: + train_ds = load_dataset(read_simcse_text, data_path=args.train_set_file, is_test=False, lazy=False) + else: + train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, is_test=False, lazy=False) + + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path, hidden_dropout_prob=args.dropout, attention_probs_dropout_prob=args.dropout) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length) + + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # title_segment + ), + ): + return [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, + mode='train', + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + model = SimCSE( + pretrained_model, + margin=args.margin, + scale=args.scale, + output_emb_size=args.output_emb_size) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + print("warmup from:{}".format(args.init_from_ckpt)) + + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else len( + train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, + args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [ + p.name for n, p in model.named_parameters() + if not any(nd in n for nd in ["bias", "norm"]) + ] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params) + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch + if random.random() < 0.2: + title_input_ids, title_token_type_ids = query_input_ids, query_token_type_ids + query_input_ids, query_token_type_ids = word_repetition(query_input_ids, query_token_type_ids, args.dup_rate) + title_input_ids, title_token_type_ids = word_repetition(title_input_ids, title_token_type_ids, args.dup_rate) + + loss, kl_loss = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids) + + loss = loss + kl_loss * args.rdrop_coef + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, + 10 / (time.time() - tic_train))) + tic_train = time.time() + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, 'model_state.pdparams') + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + if args.max_steps > 0 and global_step >= args.max_steps: + return + + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, 'model_state.pdparams') + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/applications/question_answering/supervised_qa/faq_system/README.md b/applications/question_answering/supervised_qa/faq_system/README.md new file mode 100644 index 0000000000000000000000000000000000000000..7992677f102e5a1e47139f77a484fc2ab473c561 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/README.md @@ -0,0 +1,366 @@ +# 政务问答检索式 FAQ System + + **目录** + +* [1. 场景概述](#场景概述) +* [2. 系统特色](#系统特色) +* [3. 政务问答系统方案](#政务问答系统方案) +* [4. 动手实践——搭建自己的端到端检索式问答系统](#动手实践——搭建自己的端到端检索式问答系统) + + + + +## 1. 场景概述 + +政府工作人员往往要做很多政策解读等工作,费时费力还耗费大量的人力,在政府内部,工作人员往往积累了很多问答对,但是不知道怎么构建一个问答系统来辅助工作人员提升日常工作效率,简化工作流程。 + + + +## 2. 系统特色 + ++ 低门槛 + + 手把手搭建检索式 FAQ System + + 无需相似 Query-Query Pair 标注数据也能构建 FAQ System ++ 效果好 + + 业界领先的检索预训练模型: RocketQA Dual Encoder + + 针对无标注数据场景的领先解决方案: 检索预训练模型 + 增强的无监督语义索引微调 + ++ 性能快 + + 基于 Paddle Inference 快速抽取向量 + + 基于 Milvus 快速查询和高性能建库 + + 基于 Paddle Serving 高性能部署 + + + +## 3. 政务问答系统方案 + +### 3.1 技术方案和评估指标 + +#### 3.1.1 技术方案 + +**语义索引**:针对政务问答只有问答对的场景,我们提供了一个 融合SimCSE 和 WR (word reptition)策略的无监督的解决方案。 + +#### 3.1.2 评估指标 + +* 该政务问答系统使用的指标是 Recall@K,表示的是预测的前topK(从最后的按得分排序的召回列表中返回前K个结果)结果和语料库中真实的前 K 个相关结果的重叠率,衡量的是检索系统的查全率。 + + +### 3.2 数据说明 + +数据集来源于疫情政务问答比赛数据,包括源文本,问题和答案。 + +| 阶段 |模型 | 训练集 | 评估集(用于评估模型效果) | 召回库 | +| ------------ | ------------ |------------ | ------------ | ------------ | +| 召回 | SimCSE | 4000 | 1000 | 5000 | + +其中评估集的问题对的构造使用了中英文回译的方法,总共有1000条评估集,其中500条数据使用的是百度翻译的API,详情请参考[百度翻译](https://fanyi-api.baidu.com/?fr=simultaneous),另外500条数据使用了SimBERT模型生成的同义句。 + + +``` +├── data # 数据集 + ├── train.csv # 无监督训练集 + ├── test_pair.csv # 测试集,用于评估模型的效果 + ├── corpus.csv # 构建召回的数据,用于评估模型的召回效果 + ├── qa_pair.csv # 问答对,问题对应的答案 +``` +数据集的下载链接为: [faq_data](https://paddlenlp.bj.bcebos.com/applications/faq_data.zip) + +### 3.3 代码说明 + +``` +|—— data.py # 数据读取、数据转换等预处理逻辑 +|—— model.py # SimCSE模型 +|—— train.py # SimCSE训练主脚本 +|—— ann_util.py # Ann 建索引库相关函数 +|—— config.py # Milvus 配置文件 +|—— evaluate.py # 召回评估文件 +|—— recall.py # 基于训练好的语义索引模型,从召回库中召回给定文本的相似文本 +|—— export_model.py # 动态图转换成静态图 +|—— export_to_serving.py # 静态图转 Serving +|—— feature_extract.py # 批量提取文本的特征向量 +|—— milvus_util.py # Milvus的插入和召回类 +|—— vector_insert.py # 向 Milvus 引擎插入向量的函数 +|—— run_system.py # Client Server 模式客户端,向 server 发送文本,得到向量后,利用milvus引擎进行检索 +|—— scripts + |—— export_model.sh # 动态图转换成静态图脚本 + |—— evaluate.sh # 评估 bash 版本 + |—— run_build_index.sh # 构建索引 bash 版本 + |—— train.sh # 训练 bash 版本 + |—— feature_extract.sh # 向量抽取 bash 版本 + |—— export_to_serving.sh # Paddle Inference 转 Serving 的 bash 脚本 +|—— deploy + |—— python + |—— rpc_client.py # Paddle Serving 的 Client 端 + |—— web_service.py # Paddle Serving 的 Serving 端 + |—— config_nlp.yml # Paddle Serving 的配置文件 +``` + +### 3.3 效果评估 + +| 模型 | Recall@1 |Recall@10 | +| ------------ | ------------ |--------- | +| ERNIE1.0 + SimCSE | 68.068 | 85.686| +| RocketQA | 81.381 | 96.997| +| RocketQA + SimCSE | 83.283 | 97.297| +| RocketQA + SimCSE + WR | **83.584** | **97.497**| + + + +## 4. 动手实践——搭建自己的端到端检索式问答系统 + +### 4.1 环境安装 + +在运行下面的代码之前,安装相关的依赖,运行下面的命令: + +``` +pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple +``` + +### 4.2 无监督训练 + +``` +python -u -m paddle.distributed.launch --gpus '0' \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/ \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --save_steps 50 \ + --max_seq_length 64 \ + --dropout 0.2 \ + --output_emb_size 256 \ + --dup_rate 0.3 \ + --train_set_file "./data/train.csv" +``` + +参数含义说明 + +* `device`: 使用 cpu/gpu 进行训练 +* `save_dir`: 模型存储路径 +* `batch_size`: 训练的batch size的大小 +* `learning_rate`: 训练的学习率的大小 +* `epochs`: 训练的epoch数 +* `save_steps`: 模型存储 checkpoint 的间隔 steps 个数 +* `max_seq_length`: 输入序列的最大长度 +* `dropout`: SimCSE的dropout参数 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `dup_rate` : SimCSE的 Word reptition 策略的重复率 +* `train_set_file`: 训练集文件 + +也可以使用下面的bash脚本: + +``` +sh scripts/train.sh +``` + +### 4.3 评估 + +效果评估分为 4 个步骤: + +a. 获取Doc端Embedding + +基于语义索引模型抽取出Doc样本库的文本向量。 + +b. 采用hnswlib对Doc端Embedding建库 + +使用 ANN 引擎构建索引库(这里基于 [hnswlib](https://github.com/nmslib/hnswlib) 进行 ANN 索引) + +c. 获取question的Embedding并查询相似结果 + +基于语义索引模型抽取出评估集 *Source Text* 的文本向量,在第 2 步中建立的索引库中进行 ANN 查询,召回 Top10 最相似的 *Target Text*, 产出评估集中 *Source Text* 的召回结果 `recall_result` 文件。 + +d. 评估 + +基于评估集 `test.csv` 和召回结果 `recall_result` 计算评估指标 Recall@k,其中k取值1,5,10。 + +运行如下命令进行 ANN 建库、召回,产出召回结果数据 `recall_result` + +``` +python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "checkpoints/model_150/model_state.pdparams" \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 256\ + --max_seq_length 64 \ + --recall_num 10 \ + --similar_text_pair "data/test_pair.csv" \ + --corpus_file "data/corpus.csv" +``` +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `recall_result_dir`: 召回结果存储目录 +* `recall_result_file`: 召回结果的文件名 +* `params_path`: 待评估模型的参数文件名 +* `hnsw_m`: hnsw 算法相关参数,保持默认即可 +* `hnsw_ef`: hnsw 算法相关参数,保持默认即可 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `recall_num`: 对 1 个文本召回的相似文本数量 +* `similar_text_pair`: 由相似文本对构成的评估集 +* `corpus_file`: 召回库数据 corpus_file + +也可以使用下面的bash脚本: + +``` +sh scripts/run_build_index.sh +``` + +run_build_index.sh还包含cpu和gpu运行的脚本,默认是gpu的脚本 + +接下来,运行如下命令进行效果评估,产出Recall@1, Recall@5, Recall@10 指标: +``` +python -u evaluate.py \ + --similar_text_pair "data/test_pair.csv" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 10 +``` +也可以使用下面的bash脚本: + +``` +sh scripts/evaluate.sh +``` +输出如下的结果: + +``` +recall@1=83.784 +recall@5=94.995 +recall@10=96.997 +``` + +参数含义说明 +* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv +* `recall_result_file`: 针对评估集中第一列文本 *Source Text* 的召回结果 +* `recall_num`: 对 1 个文本召回的相似文本数量 + +## 4.4 模型部署 + +模型部署模块首先要把动态图转换成静态图,然后转换成serving的格式。 + +### 动转静导出 + +首先把动态图模型转换为静态图: + +``` +python export_model.py --params_path checkpoints/model_150/model_state.pdparams --output_path=./output +``` +也可以运行下面的bash脚本: + +``` +sh scripts/export_model.sh +``` + +### 问答检索引擎 + +模型准备结束以后,开始搭建 Milvus 的语义检索引擎,用于语义向量的快速检索,本项目使用[Milvus](https://milvus.io/)开源工具进行向量检索,Milvus 的搭建教程请参考官方教程 [Milvus官方安装教程](https://milvus.io/cn/docs/v1.1.1/milvus_docker-cpu.md)本案例使用的是 Milvus 的1.1.1 CPU版本,建议使用官方的 Docker 安装方式,简单快捷。 + + +Milvus 搭建完系统以后就可以插入和检索向量了,首先生成 embedding 向量,每个样本生成256维度的向量: + +``` +python feature_extract.py \ + --model_dir=./output \ + --corpus_file "data/corpus.csv" +``` +其中 output 目录下存放的是召回的 Paddle Inference 静态图模型。 + +然后向搭建好的 Milvus 系统插入向量: + +``` +python vector_insert.py +``` + +### Paddle Serving 部署 + +Paddle Serving 的安装可以参考[Paddle Serving 安装文档](https://github.com/PaddlePaddle/Serving#installation)。需要在服务端和客户端安装相关的依赖,安装完依赖后就可以执行下面的步骤。 + + +首先把生成的静态图模型导出为 Paddle Serving的格式,命令如下: + +``` +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.get_pooled_embedding.pdmodel" \ + --params_filename "inference.get_pooled_embedding.pdiparams" \ + --server_path "./serving_server" \ + --client_path "./serving_client" \ + --fetch_alias_names "output_embedding" +``` + +参数含义说明 +* `dirname`: 需要转换的模型文件存储路径,Program 结构文件和参数文件均保存在此目录。 +* `model_filename`: 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ,则使用 `__model__` 作为默认的文件名 +* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中,它才需要被指定。如果模型参数是存储在各自分离的文件中,设置它的值为 None +* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server +* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client +* `fetch_alias_names`: 模型输出的别名设置,比如输入的 input_ids 等,都可以指定成其他名字,默认不指定 +* `feed_alias_names`: 模型输入的别名设置,比如输出 pooled_out 等,都可以重新指定成其他模型,默认不指定 + +也可以运行下面的 bash 脚本: +``` +sh scripts/export_to_serving.sh +``` + +启动 Pipeline Server: + +``` +cd deploy/python/ +python web_service.py +``` + +启动客户端调用 Server, 使用 POST的方式: + +向服务端发送 POST 请求示例: + +``` +curl -X POST -k http://localhost:8090/ernie/prediction -d '{"key": ["0"], "value": ["宁夏针对哪些人员开通工伤保障绿色通道?"]}' +``` + +也可以使用 rpc的方式: + +首先修改rpc_client.py中需要预测的样本: + +``` +list_data = [ + "湖北省为什么鼓励缴费人通过线上缴费渠道缴费?", + "佛山市救助站有多少个救助床位" +] +``` +然后运行: + +``` +python rpc_client.py +``` + +## 4.5 问答系统整个流程 + +问答系统使用了Client Server的模式,即抽取向量的模型部署在服务端,然后启动客户端(Client)端去访问。 + + +``` +python run_system.py +``` +代码内置的测试用例为: + +``` +list_data = ["嘉定区南翔镇实行双门长制“门长”要求落实好哪些工作?"] +``` + +会输出如下的结果: + +``` +...... +Extract feature time to cost :0.01161503791809082 seconds +Search milvus time cost is 0.004535675048828125 seconds +嘉定区南翔镇实行双门长制“门长”要求落实好哪些工作? 拦、查、问、测、记 1.2107588152551751e-12 +上海市黄浦区老西门街道建立的党建责任区包干机制内容是什么? 街道工作人员担任楼宇联络员,分片区对接商务楼宇所属的物业公司,引导楼宇企业共同落实严防严控任务 0.4956303834915161 +上海市街道执行“四个统一”具体指什么? 统一由居委会干部在统一时间(每周三、五下午),递交至统一地点(社区事务受理服务中心专设窗口),街道统一收集至後台 0.6684658527374268 +怀柔区城管委在加强监督检查方面是如何落实的? 严格落实四方责任,保证每周2~3次深入环卫、电、气、热、公共自行车、垃圾处置等单位进行巡查,督促企业做好防疫工作,协调复工复产中存在的问题,确保安全复工复产有效落实。 0.7147952318191528 +华新镇“亮牌分批复工”工作方案具体内容是什么? 所有店铺一律先贴“红牌”禁止经营,经相关部门审批後,再换贴“蓝牌”准许复工。 0.7162970900535583 +..... +``` +输出的结果包括特征提取和检索的时间,还包含检索出来的问答对。 diff --git a/applications/question_answering/supervised_qa/faq_system/ann_util.py b/applications/question_answering/supervised_qa/faq_system/ann_util.py new file mode 100644 index 0000000000000000000000000000000000000000..4e4983bfc4d6f581e14180804591dda2d0897465 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/ann_util.py @@ -0,0 +1,56 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import hnswlib +import numpy as np + +from paddlenlp.utils.log import logger + + +def build_index(args, data_loader, model): + + index = hnswlib.Index(space="ip", dim=args.output_emb_size if args.output_emb_size > 0 else 768) + + # Initializing index + # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded + # during insertion of an element. + # The capacity can be increased by saving/loading the index, see below. + # + # ef_construction - controls index search speed/build speed tradeoff + # + # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M) + # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction + index.init_index(max_elements=args.hnsw_max_elements, ef_construction=args.hnsw_ef, M=args.hnsw_m) + + # Controlling the recall by setting ef: + # higher ef leads to better accuracy, but slower search + index.set_ef(args.hnsw_ef) + + # Set number of threads used during batch search/construction + # By default using all available cores + index.set_num_threads(16) + + logger.info("start build index..........") + + all_embeddings = [] + + for text_embeddings in model.get_semantic_embedding(data_loader): + all_embeddings.append(text_embeddings.numpy()) + + all_embeddings = np.concatenate(all_embeddings, axis=0) + index.add_items(all_embeddings) + + logger.info("Total index number:{}".format(index.get_current_count())) + + return index diff --git a/applications/question_answering/supervised_qa/faq_system/config.py b/applications/question_answering/supervised_qa/faq_system/config.py new file mode 100644 index 0000000000000000000000000000000000000000..44bf3a260a31d97107e22f8fec09b141c5b7fe79 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/config.py @@ -0,0 +1,27 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +from milvus import IndexType, MetricType + +MILVUS_HOST = "10.21.226.173" +MILVUS_PORT = 8530 + +collection_param = {"dimension": 256, "index_file_size": 256, "metric_type": MetricType.L2} + +index_type = IndexType.IVF_FLAT +index_param = {"nlist": 1000} + +top_k = 100 +search_param = {"nprobe": 20} diff --git a/applications/question_answering/supervised_qa/faq_system/data.py b/applications/question_answering/supervised_qa/faq_system/data.py new file mode 100644 index 0000000000000000000000000000000000000000..bda3a5de2e5dc3ca773aae96be874b5a9dc39abc --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/data.py @@ -0,0 +1,194 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import random + +import numpy as np +import paddle + + +def gen_id2corpus(corpus_file): + id2corpus = {} + with open(corpus_file, "r", encoding="utf-8") as f: + for idx, line in enumerate(f): + id2corpus[idx] = line.rstrip() + return id2corpus + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + + for key, text in example.items(): + if "label" in key: + # do_evaluate + result += [example["label"]] + else: + # do_train + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + + return result + + +def convert_example_test(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def read_simcse_text(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip() + yield {"text_a": data, "text_b": data} + + +def read_text_pair(data_path, is_test=False): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if is_test is False: + if len(data) != 3: + continue + yield {"text_a": data[0], "text_b": data[1], "label": data[2]} + else: + if len(data) != 2: + continue + yield {"text_a": data[0], "text_b": data[1]} + + +def gen_text_file(similar_text_pair_file): + text2similar_text = {} + texts = [] + with open(similar_text_pair_file, "r", encoding="utf-8") as f: + for line in f: + splited_line = line.rstrip().split("\t") + if len(splited_line) != 2: + continue + + text, similar_text = line.rstrip().split("\t") + + if not text or not similar_text: + continue + + text2similar_text[text] = similar_text + texts.append({"text": text}) + return texts, text2similar_text + + +def word_repetition(input_ids, token_type_ids, dup_rate=0.32): + """Word Repetition strategy.""" + input_ids = input_ids.numpy().tolist() + token_type_ids = token_type_ids.numpy().tolist() + + batch_size, seq_len = len(input_ids), len(input_ids[0]) + repetitied_input_ids = [] + repetitied_token_type_ids = [] + rep_seq_len = seq_len + for batch_id in range(batch_size): + cur_input_id = input_ids[batch_id] + actual_len = np.count_nonzero(cur_input_id) + dup_word_index = [] + # If sequence length is less than 5, skip it + if actual_len > 5: + dup_len = random.randint(a=0, b=max(2, int(dup_rate * actual_len))) + # Skip cls and sep position + dup_word_index = random.sample(list(range(1, actual_len - 1)), k=dup_len) + + r_input_id = [] + r_token_type_id = [] + for idx, word_id in enumerate(cur_input_id): + # Insert duplicate word + if idx in dup_word_index: + r_input_id.append(word_id) + r_token_type_id.append(token_type_ids[batch_id][idx]) + r_input_id.append(word_id) + r_token_type_id.append(token_type_ids[batch_id][idx]) + after_dup_len = len(r_input_id) + repetitied_input_ids.append(r_input_id) + repetitied_token_type_ids.append(r_token_type_id) + + if after_dup_len > rep_seq_len: + rep_seq_len = after_dup_len + # Padding the data to the same length + for batch_id in range(batch_size): + after_dup_len = len(repetitied_input_ids[batch_id]) + pad_len = rep_seq_len - after_dup_len + repetitied_input_ids[batch_id] += [0] * pad_len + repetitied_token_type_ids[batch_id] += [0] * pad_len + + return paddle.to_tensor(repetitied_input_ids, dtype="int64"), paddle.to_tensor( + repetitied_token_type_ids, dtype="int64" + ) diff --git a/applications/question_answering/supervised_qa/faq_system/deploy/python/config_nlp.yml b/applications/question_answering/supervised_qa/faq_system/deploy/python/config_nlp.yml new file mode 100644 index 0000000000000000000000000000000000000000..229f66090ebf328b03aaa6b41120277176e97d97 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/deploy/python/config_nlp.yml @@ -0,0 +1,34 @@ +# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +# 当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 20 +# build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + # op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: False + # 使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + tracer: + interval_s: 10 +# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 8090 +# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 8080 +op: + ernie: + # 并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 1 + # 当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + # client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + # ir_optim + ir_optim: True + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 1 + # 计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: '2' + # Fetch结果列表,以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回 + fetch_list: ['output_embedding'] + # 模型路径 + model_config: ../../serving_server/ diff --git a/applications/question_answering/supervised_qa/faq_system/deploy/python/rpc_client.py b/applications/question_answering/supervised_qa/faq_system/deploy/python/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..44e8c3a0d744ee73130f2a728365f0a98a37df2d --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/deploy/python/rpc_client.py @@ -0,0 +1,36 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import time + +import numpy as np +from paddle_serving_server.pipeline import PipelineClient + +client = PipelineClient() +client.connect(["127.0.0.1:8080"]) + +list_data = ["湖北省为什么鼓励缴费人通过线上缴费渠道缴费?", "佛山市救助站有多少个救助床位"] +feed = {} +for i, item in enumerate(list_data): + feed[str(i)] = item + +print(feed) +start_time = time.time() +ret = client.predict(feed_dict=feed) +end_time = time.time() +print("time to cost :{} seconds".format(end_time - start_time)) + +result = np.array(eval(ret.value[0])) +print(ret.key) +print(result.shape) +print(result) diff --git a/applications/question_answering/supervised_qa/faq_system/deploy/python/web_service.py b/applications/question_answering/supervised_qa/faq_system/deploy/python/web_service.py new file mode 100644 index 0000000000000000000000000000000000000000..aa2185782b17a68419d5d73b04e8241d35a90f29 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/deploy/python/web_service.py @@ -0,0 +1,75 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +from paddle_serving_server.web_service import Op, WebService + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + result = [] + for text in example: + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class ErnieOp(Op): + def init_op(self): + from paddlenlp.transformers import AutoTokenizer + + self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + def preprocess(self, input_dicts, data_id, log_id): + from paddlenlp.data import Pad, Tuple + + ((_, input_dict),) = input_dicts.items() + print("input dict", input_dict) + batch_size = len(input_dict.keys()) + examples = [] + for i in range(batch_size): + input_ids, segment_ids = convert_example([input_dict[str(i)]], self.tokenizer) + examples.append((input_ids, segment_ids)) + + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"), # segment + ), + ): + return fn(samples) + + input_ids, segment_ids = batchify_fn(examples) + feed_dict = {} + feed_dict["input_ids"] = input_ids + feed_dict["token_type_ids"] = segment_ids + return feed_dict, False, None, "" + + def postprocess(self, input_dicts, fetch_dict, data_id, log_id): + new_dict = {} + new_dict["output_embedding"] = str(fetch_dict["output_embedding"].tolist()) + return new_dict, None, "" + + +class ErnieService(WebService): + def get_pipeline_response(self, read_op): + ernie_op = ErnieOp(name="ernie", input_ops=[read_op]) + return ernie_op + + +ernie_service = ErnieService(name="ernie") +ernie_service.prepare_pipeline_config("config_nlp.yml") +ernie_service.run_service() diff --git a/applications/question_answering/supervised_qa/faq_system/evaluate.py b/applications/question_answering/supervised_qa/faq_system/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..aabeadf5b197e24177e22a6331f8aa9fe4ef2c1b --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/evaluate.py @@ -0,0 +1,83 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import numpy as np + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--similar_text_pair", type=str, default='', help="The full path of similat pair file") +parser.add_argument("--recall_result_file", type=str, default='', help="The full path of recall result file") +parser.add_argument("--recall_num", type=int, default=10, help="Most similair number of doc recalled from corpus per query") + + +args = parser.parse_args() +# yapf: enable + + +def recall(rs, N=10): + """ + Ratio of recalled Ground Truth at topN Recalled Docs + >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]] + >>> recall(rs, N=1) + 0.333333 + >>> recall(rs, N=2) + >>> 0.6666667 + >>> recall(rs, N=3) + >>> 1.0 + Args: + rs: Iterator of recalled flag() + Returns: + Recall@N + """ + + recall_flags = [np.sum(r[0:N]) for r in rs] + return np.mean(recall_flags) + + +if __name__ == "__main__": + text2similar = {} + with open(args.similar_text_pair, "r", encoding="utf-8") as f: + for line in f: + text, similar_text = line.rstrip().split("\t") + text2similar[text] = similar_text + + rs = [] + + with open(args.recall_result_file, "r", encoding="utf-8") as f: + relevance_labels = [] + for index, line in enumerate(f): + + if index % args.recall_num == 0 and index != 0: + rs.append(relevance_labels) + relevance_labels = [] + + text, recalled_text, cosine_sim = line.rstrip().split("\t") + if text2similar[text] == recalled_text: + relevance_labels.append(1) + else: + relevance_labels.append(0) + recall_N = [] + recall_num = [1, 5, 10] + result = open("result.tsv", "a") + res = [] + for topN in recall_num: + R = round(100 * recall(rs, N=topN), 3) + recall_N.append(str(R)) + for key, val in zip(recall_num, recall_N): + print("recall@{}={}".format(key, val)) + res.append(str(val)) + result.write("\t".join(res) + "\n") + # print("\t".join(recall_N)) diff --git a/applications/question_answering/supervised_qa/faq_system/export_model.py b/applications/question_answering/supervised_qa/faq_system/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..fba6e24c2af66042ea29c6a92e04dfac7d524e14 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/export_model.py @@ -0,0 +1,57 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from model import SimCSE + +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, + default='./checkpoint/model_50/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./output', + help="The path of model parameter in static graph to be saved.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + # If you want to use ernie1.0 model, plesace uncomment the following code + output_emb_size = 256 + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + model = SimCSE(pretrained_model, output_emb_size=output_emb_size) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + model.eval() + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/applications/question_answering/supervised_qa/faq_system/export_to_serving.py b/applications/question_answering/supervised_qa/faq_system/export_to_serving.py new file mode 100644 index 0000000000000000000000000000000000000000..6cc932da11173e54460642c16fd4226411ba3cfb --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/export_to_serving.py @@ -0,0 +1,50 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle_serving_client.io as serving_io + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--dirname", type=str, required=True, + default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.") +parser.add_argument("--model_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.") +parser.add_argument("--params_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.") +parser.add_argument("--server_path", type=str, default='./serving_server', + help="The path of server parameter in static graph to be saved.") +parser.add_argument("--client_path", type=str, default='./serving_client', + help="The path of client parameter in static graph to be saved.") +parser.add_argument("--feed_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars') +parser.add_argument("--fetch_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars') +parser.add_argument("--show_proto", type=bool, default=False, + help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.') +# yapf: enable + +if __name__ == "__main__": + args = parser.parse_args() + serving_io.inference_model_to_serving( + dirname=args.dirname, + serving_server=args.server_path, + serving_client=args.client_path, + model_filename=args.model_filename, + params_filename=args.params_filename, + show_proto=args.show_proto, + feed_alias_names=args.feed_alias_names, + fetch_alias_names=args.fetch_alias_names, + ) diff --git a/applications/question_answering/supervised_qa/faq_system/feature_extract.py b/applications/question_answering/supervised_qa/faq_system/feature_extract.py new file mode 100644 index 0000000000000000000000000000000000000000..6d2a292c42de1f2202e14a331eea1e9ad790cbca --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/feature_extract.py @@ -0,0 +1,202 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import numpy as np +import paddle +from paddle import inference +from tqdm import tqdm + +from paddlenlp.data import Pad, Tuple +from paddlenlp.transformers import AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--corpus_file", type=str, required=True, help="The corpus_file path.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') + +args = parser.parse_args() +# yapf: enable + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.get_pooled_embedding.pdmodel" + params_file = model_dir + "/inference.get_pooled_embedding.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def predict(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # segment + ), + ): + return fn(samples) + + all_embeddings = [] + examples = [] + for idx, text in enumerate(tqdm(data)): + input_ids, segment_ids = convert_example( + text, tokenizer, max_seq_length=self.max_seq_length, pad_to_max_seq_len=True + ) + examples.append((input_ids, segment_ids)) + if len(examples) >= self.batch_size: + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + all_embeddings.append(logits) + examples = [] + + if len(examples) > 0: + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + all_embeddings.append(logits) + + all_embeddings = np.concatenate(all_embeddings, axis=0) + np.save("corpus_embedding", all_embeddings) + + +def read_text(file_path): + file = open(file_path) + id2corpus = {} + for idx, data in enumerate(file.readlines()): + id2corpus[idx] = data.strip() + return id2corpus + + +if __name__ == "__main__": + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + id2corpus = read_text(args.corpus_file) + + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + predictor.predict(corpus_list, tokenizer) diff --git a/applications/question_answering/supervised_qa/faq_system/milvus_util.py b/applications/question_answering/supervised_qa/faq_system/milvus_util.py new file mode 100644 index 0000000000000000000000000000000000000000..2eee1a88cf1ce7897981212d0710f3fad28e3351 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/milvus_util.py @@ -0,0 +1,109 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from config import ( + MILVUS_HOST, + MILVUS_PORT, + collection_param, + index_param, + index_type, + search_param, + top_k, +) + +# from milvus import * +from milvus import Milvus + + +class VecToMilvus: + def __init__(self): + self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT) + + def has_collection(self, collection_name): + try: + status, ok = self.client.has_collection(collection_name) + return ok + except Exception as e: + print("Milvus has_table error:", e) + + def creat_collection(self, collection_name): + try: + collection_param["collection_name"] = collection_name + status = self.client.create_collection(collection_param) + print(status) + return status + except Exception as e: + print("Milvus create collection error:", e) + + def create_index(self, collection_name): + try: + status = self.client.create_index(collection_name, index_type, index_param) + print(status) + return status + except Exception as e: + print("Milvus create index error:", e) + + def has_partition(self, collection_name, partition_tag): + try: + status, ok = self.client.has_partition(collection_name, partition_tag) + return ok + except Exception as e: + print("Milvus has partition error: ", e) + + def create_partition(self, collection_name, partition_tag): + try: + status = self.client.create_partition(collection_name, partition_tag) + print("create partition {} successfully".format(partition_tag)) + return status + except Exception as e: + print("Milvus create partition error: ", e) + + def insert(self, vectors, collection_name, ids=None, partition_tag=None): + try: + if not self.has_collection(collection_name): + self.creat_collection(collection_name) + self.create_index(collection_name) + print("collection info: {}".format(self.client.get_collection_info(collection_name)[1])) + if (partition_tag is not None) and (not self.has_partition(collection_name, partition_tag)): + self.create_partition(collection_name, partition_tag) + status, ids = self.client.insert( + collection_name=collection_name, records=vectors, ids=ids, partition_tag=partition_tag + ) + self.client.flush([collection_name]) + print( + "Insert {} entities, there are {} entities after insert data.".format( + len(ids), self.client.count_entities(collection_name)[1] + ) + ) + return status, ids + except Exception as e: + print("Milvus insert error:", e) + + +class RecallByMilvus: + def __init__(self): + self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT) + + def search(self, vectors, collection_name, partition_tag=None): + try: + status, results = self.client.search( + collection_name=collection_name, + query_records=vectors, + top_k=top_k, + params=search_param, + partition_tag=partition_tag, + ) + return status, results + except Exception as e: + print("Milvus recall error: ", e) diff --git a/applications/question_answering/supervised_qa/faq_system/model.py b/applications/question_answering/supervised_qa/faq_system/model.py new file mode 100644 index 0000000000000000000000000000000000000000..d22cfe8c57979874c4839628577dfd1496e7d40e --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/model.py @@ -0,0 +1,135 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class SimCSE(nn.Layer): + def __init__(self, pretrained_model, dropout=None, margin=0.0, scale=20, output_emb_size=None): + + super().__init__() + + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr) + + self.margin = margin + # Used scaling cosine similarity to ease converge + self.sacle = scale + + @paddle.jit.to_static( + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ] + ) + def get_pooled_embedding( + self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, with_pooler=True + ): + + # Note: cls_embedding is poolerd embedding with act tanh + sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if with_pooler is False: + cls_embedding = sequence_output[:, 0, :] + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + with_pooler=True, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask, with_pooler=with_pooler + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask, with_pooler=with_pooler + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) + + # substract margin from all positive samples cosine_sim() + margin_diag = paddle.full( + shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype() + ) + + cosine_sim = cosine_sim - paddle.diag(margin_diag) + + # scale cosine to ease training converge + cosine_sim *= self.sacle + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") + labels = paddle.reshape(labels, shape=[-1, 1]) + + loss = F.cross_entropy(input=cosine_sim, label=labels) + + return loss diff --git a/applications/question_answering/supervised_qa/faq_system/recall.py b/applications/question_answering/supervised_qa/faq_system/recall.py new file mode 100644 index 0000000000000000000000000000000000000000..21b2db729ba2d2fd57a310ca8275745ea337a647 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/recall.py @@ -0,0 +1,124 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import paddle +from ann_util import build_index +from data import convert_example_test, create_dataloader, gen_id2corpus, gen_text_file +from model import SimCSE + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset +from paddlenlp.transformers import AutoModel, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.") + +parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") + +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + model_name_or_path = "rocketqa-zh-dureader-query-encoder" + tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) + + trans_func = partial(convert_example_test, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # text_segment + ), + ): + return [data for data in fn(samples)] + + pretrained_model = AutoModel.from_pretrained(model_name_or_path) + + model = SimCSE(pretrained_model, output_emb_size=args.output_emb_size) + model = paddle.DataParallel(model) + + # Load pretrained semantic model + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + logger.info("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + id2corpus = gen_id2corpus(args.corpus_file) + + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + # Need better way to get inner model of DataParallel + inner_model = model._layers + + final_index = build_index(args, corpus_data_loader, inner_model) + + text_list, text2similar_text = gen_text_file(args.similar_text_pair_file) + # print(text_list[:5]) + + query_ds = MapDataset(text_list) + + query_data_loader = create_dataloader( + query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + + recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file) + with open(recall_result_file, "w", encoding="utf-8") as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num) + + batch_size = len(cosine_sims) + + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write( + "{}\t{}\t{}\n".format( + text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx] + ) + ) diff --git a/applications/question_answering/supervised_qa/faq_system/requirements.txt b/applications/question_answering/supervised_qa/faq_system/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..4a2271d321bf25ab1c5bc4e24af6c96a7a1f3a18 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/requirements.txt @@ -0,0 +1,5 @@ +pymilvus==1.1.1 +pandas==0.25.1 +paddlenlp>=2.3.7 +hnswlib>=0.5.2 +pybind11 \ No newline at end of file diff --git a/applications/question_answering/supervised_qa/faq_system/run_system.py b/applications/question_answering/supervised_qa/faq_system/run_system.py new file mode 100644 index 0000000000000000000000000000000000000000..a8390575b19ffcf5cabca0693075448af817762b --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/run_system.py @@ -0,0 +1,63 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import time + +import numpy as np +import pandas as pd +from data import gen_id2corpus +from milvus_util import RecallByMilvus +from paddle_serving_server.pipeline import PipelineClient + + +def search_in_milvus(text_embedding, query_text): + collection_name = "faq_system" + partition_tag = "partition_1" + client = RecallByMilvus() + start_time = time.time() + status, results = client.search( + collection_name=collection_name, vectors=text_embedding, partition_tag=partition_tag + ) + end_time = time.time() + print("Search milvus time cost is {} seconds ".format(end_time - start_time)) + + corpus_file = "data/qa_pair.csv" + id2corpus = gen_id2corpus(corpus_file) + list_data = [] + for line in results: + for item in line: + idx = item.id + distance = item.distance + text = id2corpus[idx] + print(text, distance) + list_data.append([query_text, text, distance]) + df = pd.DataFrame(list_data, columns=["query_text", "text", "distance"]) + df = df.sort_values(by="distance", ascending=True) + df.to_csv("data/recall_predict.csv", columns=["text", "distance"], sep="\t", header=None, index=False) + + +if __name__ == "__main__": + client = PipelineClient() + client.connect(["127.0.0.1:8080"]) + list_data = ["嘉定区南翔镇实行双门长制“门长”要求落实好哪些工作?"] + feed = {} + for i, item in enumerate(list_data): + feed[str(i)] = item + start_time = time.time() + ret = client.predict(feed_dict=feed) + end_time = time.time() + print("Extract feature time to cost :{} seconds".format(end_time - start_time)) + result = np.array(eval(ret.value[0])) + search_in_milvus(result, list_data[0]) diff --git a/applications/question_answering/supervised_qa/faq_system/scripts/evaluate.sh b/applications/question_answering/supervised_qa/faq_system/scripts/evaluate.sh new file mode 100644 index 0000000000000000000000000000000000000000..7b0a901f9e7b6b77c2c832b849847395f675145f --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/scripts/evaluate.sh @@ -0,0 +1,18 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + python -u evaluate.py \ + --similar_text_pair "data/test_pair.csv" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 10 \ No newline at end of file diff --git a/applications/question_answering/supervised_qa/faq_system/scripts/export_model.sh b/applications/question_answering/supervised_qa/faq_system/scripts/export_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..cdc97faa49473203422439958c11abc375ab599f --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/scripts/export_model.sh @@ -0,0 +1,15 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python export_model.py --params_path checkpoints/model_150/model_state.pdparams --output_path=./output \ No newline at end of file diff --git a/applications/question_answering/supervised_qa/faq_system/scripts/export_to_serving.sh b/applications/question_answering/supervised_qa/faq_system/scripts/export_to_serving.sh new file mode 100644 index 0000000000000000000000000000000000000000..7a7337b40b7a7c2d652ce2a837562eaceeba0531 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/scripts/export_to_serving.sh @@ -0,0 +1,21 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.get_pooled_embedding.pdmodel" \ + --params_filename "inference.get_pooled_embedding.pdiparams" \ + --server_path "serving_server" \ + --client_path "serving_client" \ + --fetch_alias_names "output_embedding" diff --git a/applications/question_answering/supervised_qa/faq_system/scripts/feature_extract.sh b/applications/question_answering/supervised_qa/faq_system/scripts/feature_extract.sh new file mode 100644 index 0000000000000000000000000000000000000000..f75e929756718f8adb8906e8bc3cdf4c9f66d343 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/scripts/feature_extract.sh @@ -0,0 +1,17 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python feature_extract.py \ + --model_dir=./output \ + --corpus_file "data/corpus.csv" \ No newline at end of file diff --git a/applications/question_answering/supervised_qa/faq_system/scripts/run_build_index.sh b/applications/question_answering/supervised_qa/faq_system/scripts/run_build_index.sh new file mode 100644 index 0000000000000000000000000000000000000000..ead0a04f57bc1a7fc2b32a067ae08af88d8b60a0 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/scripts/run_build_index.sh @@ -0,0 +1,29 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# gpu +python -u -m paddle.distributed.launch --gpus "4" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "checkpoints/model_150/model_state.pdparams" \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 256\ + --max_seq_length 64 \ + --recall_num 10 \ + --similar_text_pair "data/test_pair.csv" \ + --corpus_file "data/corpus.csv" \ No newline at end of file diff --git a/applications/question_answering/supervised_qa/faq_system/scripts/train.sh b/applications/question_answering/supervised_qa/faq_system/scripts/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..0de39022eb15b8af23292fe1817b5c64571385e4 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/scripts/train.sh @@ -0,0 +1,28 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python -u -m paddle.distributed.launch --gpus '4' \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/ \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --save_steps 50 \ + --eval_steps 50 \ + --max_seq_length 64 \ + --dropout 0.2 \ + --output_emb_size 256 \ + --dup_rate 0.3 \ + --train_set_file "./data/train.csv" \ No newline at end of file diff --git a/applications/question_answering/supervised_qa/faq_system/train.py b/applications/question_answering/supervised_qa/faq_system/train.py new file mode 100644 index 0000000000000000000000000000000000000000..efe0306e5578bb2f8efde9c9b2028a5c02aa7d00 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/train.py @@ -0,0 +1,209 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from data import convert_example, create_dataloader, read_simcse_text, word_repetition +from model import SimCSE +from scipy import stats + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.") +parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, help="Step interval for saving checkpoint.") +parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override ecpochs.") +parser.add_argument('--eval_steps', type=int, default=10000, help="Step interval for evaluation.") +parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file.") +parser.add_argument("--margin", default=0.0, type=float, help="Margin between pos_sample and neg_samples.") +parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.") +parser.add_argument("--dropout", default=0.1, type=float, help="Dropout for pretrained model encoder.") +parser.add_argument("--dup_rate", default=0.32, type=float, help="duplicate rate for word repetition.") +parser.add_argument("--infer_with_fc_pooler", action='store_true', help="Whether use fc layer after cls embedding or not for when infer.") + +args = parser.parse_args() + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def do_evaluate(model, tokenizer, data_loader, with_pooler=False): + model.eval() + + total_num = 0 + spearman_corr = 0.0 + sims = [] + labels = [] + + for batch in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids, label = batch + total_num += len(label) + + query_cls_embedding = model.get_pooled_embedding( + query_input_ids, query_token_type_ids, with_pooler=with_pooler) + + title_cls_embedding = model.get_pooled_embedding(title_input_ids, title_token_type_ids, with_pooler=with_pooler) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + + sims.append(cosine_sim.numpy()) + labels.append(label.numpy()) + + sims = np.concatenate(sims, axis=0) + labels = np.concatenate(labels, axis=0) + + spearman_corr = stats.spearmanr(labels, sims).correlation + model.train() + return spearman_corr, total_num + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + train_ds = load_dataset( + read_simcse_text, data_path=args.train_set_file, lazy=False) + model_name_or_path = 'rocketqa-zh-dureader-query-encoder' + pretrained_model = AutoModel.from_pretrained(model_name_or_path, hidden_dropout_prob=args.dropout, attention_probs_dropout_prob=args.dropout) + + tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length) + + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # title_segment + ) + ): + return [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, + mode='train', + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + model = SimCSE( + pretrained_model, + margin=args.margin, + scale=args.scale, + output_emb_size=args.output_emb_size) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + print("warmup from:{}".format(args.init_from_ckpt)) + + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else len( + train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, + args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [ + p.name for n, p in model.named_parameters() + if not any(nd in n for nd in ["bias", "norm"]) + ] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params) + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch + if args.dup_rate > 0.0: + query_input_ids, query_token_type_ids = word_repetition(query_input_ids, query_token_type_ids, args.dup_rate) + title_input_ids, title_token_type_ids = word_repetition(title_input_ids, title_token_type_ids, args.dup_rate) + + loss = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids) + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, + 10 / (time.time() - tic_train))) + tic_train = time.time() + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, 'model_state.pdparams') + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + if args.max_steps > 0 and global_step >= args.max_steps: + return + + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, 'model_state.pdparams') + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/applications/question_answering/supervised_qa/faq_system/vector_insert.py b/applications/question_answering/supervised_qa/faq_system/vector_insert.py new file mode 100644 index 0000000000000000000000000000000000000000..5a32a083a3709e3da3f4eadfe970dd1638d3c610 --- /dev/null +++ b/applications/question_answering/supervised_qa/faq_system/vector_insert.py @@ -0,0 +1,46 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import numpy as np +from milvus_util import VecToMilvus +from tqdm import tqdm + + +def vector_insert(file_path): + embeddings = np.load(file_path) + print(embeddings.shape) + embedding_ids = [i for i in range(embeddings.shape[0])] + print(len(embedding_ids)) + client = VecToMilvus() + collection_name = "faq_system" + partition_tag = "partition_1" + data_size = len(embedding_ids) + batch_size = 100000 + for i in tqdm(range(0, data_size, batch_size)): + cur_end = i + batch_size + if cur_end > data_size: + cur_end = data_size + batch_emb = embeddings[np.arange(i, cur_end)] + status, ids = client.insert( + collection_name=collection_name, + vectors=batch_emb.tolist(), + ids=embedding_ids[i : i + batch_size], + partition_tag=partition_tag, + ) + + +if __name__ == "__main__": + file_path = "corpus_embedding.npy" + vector_insert(file_path) diff --git a/applications/question_answering/unsupervised_qa/README.md b/applications/question_answering/unsupervised_qa/README.md new file mode 100644 index 0000000000000000000000000000000000000000..719d8f640f56d40d490e40770d93760264c9d1b4 --- /dev/null +++ b/applications/question_answering/unsupervised_qa/README.md @@ -0,0 +1,480 @@ +# 无监督检索式问答系统 + +**目录** +- [无监督检索式问答系统](#无监督检索式问答系统) + - [简介](#简介) + - [项目优势](#项目优势) + - [方案介绍](#方案介绍) + - [流程图](#流程图) + - [技术方案](#技术方案) + - [代码结构说明](#代码结构说明) + - [快速体验](#快速体验) + - [运行环境和安装说明](#运行环境和安装说明) + - [数据说明](#数据说明) + - [快速体验无监督检索式问答系统](#快速体验无监督检索式问答系统) + - [可视化无监督检索式问答系统](#可视化无监督检索式问答系统) + - [离线问答对语料构建](#离线问答对语料构建) + - [基于Pipelines构建问答系统](#基于Pipelines构建问答系统) + - [自定义模型](#自定义模型) + - [数据准备](#数据准备) + - [模型微调](#模型微调) + - [答案抽取](#答案抽取) + - [问题生成](#问题生成) + - [过滤模型](#过滤模型) + - [语义索引和召回模型](#语义索引和召回模型) + - [排序模型](#排序模型) + - [References](#References) + +## 简介 +问答(QA)系统中最关键的挑战之一是标记数据的稀缺性,这是因为对目标领域获取问答对或常见问答对(FAQ)的成本很高,需要消耗大量的人力和时间。由于上述制约,这导致检索式问答系统落地困难,解决此问题的一种方法是依据问题上下文或大量非结构化文本自动生成的QA问答对。 + +在此背景下,无监督检索式问答系统(即问答对自动生成智能检索式问答),基于PaddleNLP[问题生成](../../../examples/question_generation/README.md)、[UIE](../../../model_zoo/uie/README.md)、[检索式问答](../supervised_qa/faq_finance/README.md),支持以非结构化文本形式为上下文自动生成QA问答对,生成的问答对语料可以通过无监督的方式构建检索式问答系统。 + +若开发者已有FAQ语料,请参考[supervised_qa](../supervised_qa)。 + +### 项目优势 +具体来说,本项目具有以下优势: + ++ 低成本 + + 可通过自动生成的方式快速大量合成QA语料,大大降低人力成本 + + 可控性好,合成语料和语义检索问答解耦合,可以人工筛查和删除合成的问答对,也可以添加人工标注的问答对 + ++ 低门槛 + + 手把手搭建无监督检索式问答系统 + + 无需相似Query-Query Pair标注数据也能构建问答系统 + ++ 效果好 + + 可通过自动问答对生成提升问答对语料覆盖度,缓解中长尾问题覆盖较少的问题 + + 业界领先的检索预训练模型: RocketQA Dual Encoder + + 针对无标注数据场景的领先解决方案: 检索预训练模型 + 增强的无监督语义索引微调 + ++ 端到端 + + 提供包括问答语料生成、索引库构建、模型服务部署、WebUI可视化一整套端到端智能问答系统能力 + + 支持对Txt、Word、PDF、Image多源数据上传,同时支持离线、在线QA语料生成和ANN数据库更新 + +## 方案介绍 + +### 流程图 +本项目的流程图如下,对于给定的非结构化文本,我们首先通过答案抽取、问题生成、以及往返过滤模块,得到大量语料相关的问答对。针对这些得到的问答对,用户可以通过可以人工筛查和删除的方式来调整生成的问答对,也可以进一步添加人工标注的问答对。随后开发者就可以通过语义索引模块,来构建向量索引库。在构造完索引库之后,我们就可以通过召回模块和排序模块对问答对进行查询,得到最终的查询结果。 + +
+ image +
+ +### 技术方案 +由于涉及较多的模块,本项目将基于PaddleNLP Pipelines进行模块的组合和项目的构建。PaddleNLP Pipelines是一个端到端NLP流水线系统框架,它可以通过插拔式组件产线化设计来构建一个完整的无监督问答系统。具体来说,我们的技术方案包含以下方面: + +**答案抽取**:我们基于UIE训练了一个答案抽取模型,该答案抽取模型接收“答案”作为提示词,该模型可以用来对潜在的答案信息进行挖掘抽取,我们同时提供了训练好的模型权重`uie-base-answer-extractor`。 + +**问题生成**:我们基于中文预训练语言模型UNIMO-Text、模版策略和大规模多领域问题生成数据集训练了一个通用点问题生成预训练模型`unimo-text-1.0-question-generation`。 + +**往返过滤**:我们采用过生成(overgenerate)的策略生成大量的潜在答案和问题,并通过往返过滤的方式针对生成的过量问答对进行过滤得到最终的问答对。我们的往返过滤模块需要训练一个有条件抽取式问答模型3。 + +**语义索引**:针对给定问答对语料,我们基于RocketQA(即`rocketqa-zh-base-query-encoder`)对问答对进行语义向量化,并通过ElasticSearch的ANN服务构建索引库。 + +**召回排序**:给定用户查询,我们基于RocketQA的query-encoder和cross-encoder分别进行召回和排序操作,得到目标的问答对,从而返回给用户查询结果。 + +**Pipelines**:由于本项目设计的模块较多,我们使用PaddleNLP Pipelines进行模块的组合和项目的构建。大体来说,我们的Pipelines包含两个具体的pipeline和三个服务。两个pipeline分别是qa_generation_pipeline和dense_faq_pipeline;三个服务分别是基于ElasticSearch的ANN在线索引库服务,基于RestAPI的模型后端服务以及基于Streamlit的前端WebUI服务。 + + +## 快速体验 +### 运行环境和安装说明 +基于Pipelines构建问答系统需要安装paddle-pipelines依赖,使用pip安装命令如下: +```bash +# pip一键安装 +pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple +``` +或者进入pipelines目录下,针对源码进行安装: +```bash +# 源码进行安装 +cd PaddleNLP/pipelines/ +pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple +python setup.py install +``` + +### 数据说明 +我们以提供的纯文本文件[source_file.txt](https://paddlenlp.bj.bcebos.com/applications/unsupervised_qa/source_file.txt)为例,系统将每一条都视为一个上下文并基于此生成多个问答对,并基于此构建索引库,该文件可直接下载放入`data`,开发者也可以使用自己的文件。 + +### 快速体验无监督检索式问答系统 +开发者可以通过如下命令快速体验无监督智能检索问答系统的效果,系统将自动根据提供的纯文本文件构建问答对语料库,并基于生成的问答对语料库构造索引库。 +我们建议在GPU环境下运行本示例,运行速度较快,运行命令如下: +```bash +# GPU环境下运行示例 +# 设置1个空闲的GPU卡,此处假设0卡为空闲GPU +export CUDA_VISIBLE_DEVICES=0 +python run_pipelines_example.py --device gpu --source_file data/source_file.txt --doc_dir data/my_data --index_name faiss_index --retriever_batch_size 16 +``` +关键参数释义如下: +- `device`: 使用的设备,默认为'gpu',可选择['cpu', 'gpu']。 +- `source_file`: 源文件路径,指定该路径将自动为其生成问答对至`doc_dir`。 +- `doc_dir`: 生成的问答对语料保存的位置,系统将根据该位置自动构建检索数据库,默认为'data/my_data'。 +- `index_name`: FAISS的ANN索引名称,默认为'faiss_index'。 +- `retriever_batch_size`: 构建ANN索引时的批量大小,默认为16。 + +如果只有CPU机器,可以通过--device参数指定cpu即可, 运行耗时较长,运行命令如下: +```bash +# CPU环境下运行示例 +unset CUDA_VISIBLE_DEVICES +python run_pipelines_example.py --device cpu --source_file data/source_file.txt --doc_dir data/my_data --index_name faiss_index --retriever_batch_size 16 +``` + + + + +## 可视化无监督检索式问答系统 +开发者可以基于Pipelines进一步构建Web可视化的无监督检索式问答系统,其效果如下, +
+ +
+ + + +### 离线问答对语料构建 +这一部分介绍如何离线构建问答对语料,同时我们我们也在Pipeline中集成了在线问答对语料。 +#### 数据说明 +我们以提供的纯文本文件[source_file.txt](https://paddlenlp.bj.bcebos.com/applications/unsupervised_qa/source_file.txt)为例,系统将每一条都视为一个上下文并基于此生成多个问答对,随后系统将根据这些问答对构建索引库,该文件可直接下载放入`data`,开发者也可以使用自己的文件。 + +#### 问答对生成 +对于标准场景的问答对可以直接使用提供的预训练模型实现零样本(zero-shot)问答对生成。对于细分场景开发者可以根据个人需求训练[自定义模型](#自定义模型),加载自定义模型进行问答对生成,以进一步提升效果。 + +生成问答对语料的命令如下: +```shell +export CUDA_VISIBLE_DEVICES=0 +python -u run_qa_pairs_generation.py \ + --source_file_path=data/source_file.txt \ + --target_file_path=data/target_file.json \ + --answer_generation_model_path=uie-base-answer-extractor-v1 \ + --question_generation_model_path=unimo-text-1.0-question-generation \ + --filtration_model_path=uie-base-qa-filter-v1 \ + --batch_size=8 \ + --a_max_answer_candidates=10 \ + --a_prompt='答案' \ + --a_position_prob=0.01 \ + --q_num_return_sequences=3 \ + --q_max_question_length=50 \ + --q_decode_strategy=sampling \ + --q_top_k=5 \ + --q_top_p=1 \ + --do_filtration \ + --f_filtration_position_prob=0.01 \ + --do_debug +``` +关键参数释义如下: +- `source_file_path` 源文件路径,源文件中每一行代表一条待生成问答对的上下文文本。 +- `target_file_path` 目标文件路径,生成的目标文件为json格式。 +- `answer_generation_model_path` 要加载的答案抽取模型的路径,可以是PaddleNLP提供的预训练模型,或者是本地模型checkpoint路径。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + | 可选预训练模型 | + |---------------------------------| + | uie-base-answer-extractor-v1 | + +- `question_generation_model_path` 要加载的问题生成模型的路径,可以是PaddleNLP提供的预训练模型,或者是本地模型checkpoint路径。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + | 可选预训练模型 | + |---------------------------------| + | unimo-text-1.0-question-generation | + | unimo-text-1.0-dureader_qg | + | unimo-text-1.0-question-generation-dureader_qg | + +- `filtration_model_path` 要加载的过滤模型的路径,可以是PaddleNLP提供的预训练模型,或者是本地模型checkpoint路径。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + | 可选预训练模型 | + |---------------------------------| + | uie-base-qa-filter-v1 | + +- `batch_size` 使用taskflow时的批处理大小,请结合机器情况进行调整,默认为8。 +- `a_max_answer_candidates` 答案抽取阶段,每个输入的最大返回答案候选数,默认为5。 +- `a_prompt` 答案抽取阶段,使用的提示词,以","分隔,默认为"答案"。 +- `a_position_prob` 答案抽取阶段,置信度阈值,默认为0.01。 +- `q_num_return_sequences` 问题生成阶段,返回问题候选数,在使用"beam_search"解码策略时它应该小于`q_num_beams`,默认为3。 +- `q_max_question_length` 问题生成阶段,最大解码长度,默认为50。 +- `q_decode_strategy` 问题生成阶段,解码策略,默认为"sampling"。 +- `q_top_k` 问题生成阶段,使用"sampling"解码策略时的top k值,默认为5。 +- `q_top_p` 问题生成阶段,使用"sampling"解码策略时的top p值,默认为0。 +- `q_num_beams` 问题生成阶段,使用"beam_search"解码策略时的beam大小,默认为6。 +- `do_filtration` 是否进行过滤。 +- `f_filtration_position_prob` 过滤阶段,过滤置信度阈值,默认为0.1。 +- `do_debug` 是否进入调试状态,调试状态下将输出过滤掉的生成问答对。 + +#### 语料转换 +执行以下脚本对生成的问答对进行转换,得到语义索引所需要的语料train.csv、dev.csv、q_corpus.csv、qa_pair.csv: +```shell +python -u run_corpus_preparation.py \ + --source_file_path data/target_file.json \ + --target_dir_path data/my_corpus +``` +关键参数释义如下: +- `source_file_path` 指示了要转换的训练数据集文件或测试数据集文件,文件格式要求见从本地文件创建数据集部分。指示了要转换的问答对json文件路径,生成的目标文件为json格式 +- `target_dir_path` 输出数据的目标文件夹,默认为"data/my_corpus"。 +- `test_sample_num` 构建检索系统时保留的测试样本数目,默认为0。 +- `train_sample_num` 构建检索系统时保留的有监督训练样本数目,默认为0。 +- `all_sample_num` 构建检索系统时保留的总样本数目,默认为None,表示保留除了前`test_sample_num`+`train_sample_num`个样本外的所有样本。 + + + + + +### 基于Pipelines构建问答系统 +本项目提供了基于Pipelines的低成本构建问答对自动生成智能检索问答系统的能力。开发者只需要提供非结构化的纯文本,就可以使用本项目预制的问答对生成模块生成大量的问答对,并基于此快速搭建一个针对自己业务的检索问答系统,并可以提供Web可视化产品服务。Web可视化产品服务支持问答检索、在线问答对生成,在线文件上传和解析,在线索引库更新等功能,用户也可根据需要自行调整。具体的构建流程请参考[Pipelines-无监督智能检索问答系统](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/unsupervised-question-answering)。 + + + +## 自定义模型 +除了使用预置模型外,用户也可以训练并接入自己训练的模型,我们提供了从答案抽取、问题生成、往返过滤的过滤模型,到语义索引、召回、排序各个阶段的定制化训练方案。 +### 数据准备 +这一部分介绍如何准备和预处理答案抽取、问题生成、过滤模块微调所需的数据。关于如何准备通过无监督方式训练自定义语义索引模型所需的问答对数据,见[离线问答对语料构建](#离线问答对语料构建)。 +#### 自定义数据 +在许多情况下,我们需要使用本地数据集来微调模型从而得到定制化的能力,让生成的问答对更接近于理想分布,本项目支持使用固定格式本地数据集文件进行微调。 + +这里我们提供预先标注好的文件样例[train.json](https://paddlenlp.bj.bcebos.com/applications/unsupervised_qa/train.json)和[dev.json](https://paddlenlp.bj.bcebos.com/applications/unsupervised_qa/dev.json),开发者可直接下载放入`data`目录,此外也可自行构建本地数据集,具体来说,本地数据集主要包含以下文件: +```text +data +├── train.json # 训练数据集文件 +├── dev.json # 开发数据集文件 +└── test.json # 可选,待预测数据文件 +``` +本地数据集文件格式如下: +```text +# train.json/dev.json/test.json文件格式: +{ + "context": , + "answer": , + "question": , +} +... +``` +本地数据集文件具体样例如下: +```text +train.json/dev.json/test.json文件样例: +{ + "context": "欠条是永久有效的,未约定还款期限的借款合同纠纷,诉讼时效自债权人主张债权之日起计算,时效为2年。 根据《中华人民共和国民法通则》第一百三十五条:向人民法院请求保护民事权利的诉讼时效期间为二年,法律另有规定的除外。 第一百三十七条:诉讼时效期间从知道或者应当知道权利被侵害时起计算。但是,从权利被侵害之日起超过二十年的,人民法院不予保护。有特殊情况的,人民法院可以延长诉讼时效期间。 第六十二条第(四)项:履行期限不明确的,债务人可以随时履行,债权人也可以随时要求履行,但应当给对方必要的准备时间。", + "answer": "永久有效", + "question": "欠条的有效期是多久" +} +... +``` + +#### 数据预处理 +执行以下脚本对数据集进行数据预处理,得到接下来答案抽取、问题生成、过滤模块模型微调所需要的数据,注意这里答案抽取、问题生成、过滤模块的微调数据来源于相同的数据集。 +```shell +python -u run_data_preprocess.py \ + --source_file_path data/train.json \ + --target_dir data/finetune \ + --do_answer_prompt + +python -u run_data_preprocess.py \ + --source_file_path data/dev.json \ + --target_dir data/finetune \ + --do_answer_prompt +``` +关键参数释义如下: +- `source_file_path` 指示了要转换的训练数据集文件或测试数据集文件,文件格式要求见[自定义数据](#自定义数据)部分。 +- `target_dir` 输出数据的目标文件夹,默认为"data/finetune"。 +- `do_answer_prompt` 表示在构造答案抽取数据时是否添加"答案"提示词。 +- `do_len_prompt` 表示在构造答案抽取数据时是否添加长度提示词。 +- `do_domain_prompt` 表示在构造答案抽取数据时是否添加领域提示词。 +- `domain` 表示添加的领域提示词,在`do_domain_prompt`时有效。 + +**NOTE:** 预处理后的微调用数据将分别位于`target_dir`下的answer_extraction、question_generation、filtration三个子文件夹中。 + +### 模型微调 +#### 答案抽取 +运行如下命令即可在样例训练集上微调答案抽取模型,用户可以选择基于`uie-base-answer-extractor`进行微调,或者基于`uie-base`等从头开始微调。 +```shell +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +# 例如使用1号和2号卡,则:`--gpu 1,2` +unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus "1,2" --log_dir log/answer_extraction finetune/answer_extraction_and_roundtrip_filtration/finetune.py \ + --train_path=data/finetune/answer_extraction/train.json \ + --dev_path=data/finetune/answer_extraction/dev.json \ + --save_dir=log/answer_extraction/checkpoints \ + --learning_rate=1e-5 \ + --batch_size=16 \ + --max_seq_len=512 \ + --num_epochs=30 \ + --model=uie-base \ + --seed=1000 \ + --logging_steps=100 \ + --valid_steps=100 \ + --device=gpu +``` +关键参数释义如下: +- `train_path`: 训练集文件路径。 +- `dev_path`: 验证集文件路径。 +- `save_dir`: 模型存储路径,默认为`log/answer_extration/checkpoints`。 +- `learning_rate`: 学习率,默认为1e-5。 +- `batch_size`: 批处理大小,请结合机器情况进行调整,默认为16。 +- `max_seq_len`: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。 +- `num_epochs`: 训练轮数,默认为30。 +- `model`: 选择模型,程序会基于选择的模型进行模型微调,可选有`uie-base-answer-extractor`,`uie-base`,`uie-medium`, `uie-mini`, `uie-micro`和`uie-nano`,默认为`uie-base`。 +- `init_from_ckpt`: 用于初始化的模型参数的路径。 +- `seed`: 随机种子,默认为1000. +- `logging_steps`: 日志打印的间隔steps数,默认10。 +- `valid_steps`: evaluate的间隔steps数,默认100。 +- `device`: 选用什么设备进行训练,可选cpu或gpu。 + + +通过运行以下命令在样例验证集上进行模型评估: + +```shell +python finetune/answer_extraction_and_roundtrip_filtration/evaluate.py \ + --model_path=log/answer_extraction/checkpoints/model_best \ + --test_path=data/finetune/answer_extraction/dev.json \ + --batch_size=16 \ + --max_seq_len=512 \ + --limit=0.01 +``` + +关键参数释义如下: +- `model_path`: 进行评估的模型文件夹路径,路径下需包含模型权重文件`model_state.pdparams`及配置文件`model_config.json`。 +- `test_path`: 进行评估的测试集文件。 +- `batch_size`: 批处理大小,请结合机器情况进行调整,默认为16。 +- `max_seq_len`: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。 +- `model`: 选择所使用的模型,可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`和`uie-nano`,默认为`uie-base`。 +- `debug`: 是否开启debug模式对每个正例类别分别进行评估,该模式仅用于模型调试,默认关闭。 +- `limit`: SpanEvaluator测评指标的`limit`,当概率数组中的最后一个维度大于该值时将返回相应的文本片段;当limit设置为0.01时表示关注模型的召回率,也即答案的覆盖率。 + +#### 问题生成 +运行如下命令即可在样例训练集上微调问题生成模型,并在样例验证集上进行验证。 +```shell +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +# 例如使用1号和2号卡,则:`--gpu 1,2` +unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus "1,2" --log_dir log/question_generation finetune/question_generation/train.py \ + --train_file=data/finetune/question_generation/train.json \ + --predict_file=data/finetune/question_generation/dev.json \ + --save_dir=log/question_generation/checkpoints \ + --output_path=log/question_generation/predict.txt \ + --dataset_name=dureader_qg \ + --model_name_or_path="unimo-text-1.0" \ + --logging_steps=100 \ + --save_steps=500 \ + --epochs=20 \ + --batch_size=16 \ + --learning_rate=1e-5 \ + --warmup_proportion=0.02 \ + --weight_decay=0.01 \ + --max_seq_len=512 \ + --max_target_len=30 \ + --do_train \ + --do_predict \ + --max_dec_len=20 \ + --min_dec_len=3 \ + --num_return_sequences=1 \ + --template=1 \ + --device=gpu +``` + + +关键参数释义如下: +- `gpus` 指示了训练所用的GPU,使用多卡训练可以指定多个GPU卡号,例如 --gpus "0,1"。 +- `dataset_name` 数据集名称,用来指定数据集格式,默认为`dureader_qg`。 +- `train_file` 本地训练数据地址,数据格式必须与`dataset_name`所指数据集格式相同,默认为None。 +- `predict_file` 本地测试数据地址,数据格式必须与`dataset_name`所指数据集格式相同,默认为None。 +- `model_name_or_path` 指示了finetune使用的具体预训练模型,可以是PaddleNLP提供的预训练模型,或者是本地的预训练模型。如果使用本地的预训练模型,可以配置本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一,默认为`unimo-text-1.0`。 + | 可选预训练模型 | + |---------------------------------| + | unimo-text-1.0 | + | unimo-text-1.0-large | + | unimo-text-1.0-question-generation | + +- `save_dir` 表示模型的保存路径。 +- `output_path` 表示预测结果的保存路径。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `seed` 表示随机数生成器的种子。 +- `epochs` 表示训练轮数。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。 +- `warmup_proportion` 表示学习率逐渐升高到基础学习率(即上面配置的learning_rate)所需要的迭代数占总步数的比例。 +- `max_seq_len` 模型输入序列的最大长度。 +- `max_target_len` 模型训练时标签的最大长度。 +- `min_dec_len` 模型生成序列的最小长度。 +- `max_dec_len` 模型生成序列的最大长度。 +- `do_train` 是否进行训练。 +- `do_predict` 是否进行预测,在验证集上会自动评估。 +- `device` 表示使用的设备,从gpu和cpu中选择。 +- `template` 表示使用的模版,从[0, 1, 2, 3, 4]中选择,0表示不选择模版,1表示使用默认模版。 + +程序运行时将会自动进行训练和验证,训练过程中会自动保存模型在指定的`save_dir`中。 + +**【注意】** 如需恢复模型训练,`model_name_or_path`配置本地模型的目录地址即可。 + + +#### 过滤模型 +运行如下命令即可在样例训练集上微调答案抽取模型,用户可以选择基于`uie-base-qa-filter`进行微调,或者基于`uie-base`等从头开始微调。 +```shell +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +# 例如使用1号和2号卡,则:`--gpu 1,2` +unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus "1,2" --log_dir log/filtration finetune/answer_extraction_and_roundtrip_filtration/finetune.py \ + --train_path=data/finetune/filtration/train.json \ + --dev_path=data/finetune/filtration/dev.json \ + --save_dir=log/filtration/checkpoints \ + --learning_rate=1e-5 \ + --batch_size=16 \ + --max_seq_len=512 \ + --num_epochs=30 \ + --model=uie-base \ + --seed=1000 \ + --logging_steps=100 \ + --valid_steps=100 \ + --device=gpu +``` +关键参数释义如下: +- `train_path`: 训练集文件路径。 +- `dev_path`: 验证集文件路径。 +- `save_dir`: 模型存储路径,默认为`log/filtration/checkpoints`。 +- `learning_rate`: 学习率,默认为1e-5。 +- `batch_size`: 批处理大小,请结合机器情况进行调整,默认为16。 +- `max_seq_len`: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。 +- `num_epochs`: 训练轮数,默认为30。 +- `model`: 选择模型,程序会基于选择的模型进行模型微调,可选有`uie-base-qa-filter`,`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`和`uie-nano`,默认为`uie-base`。 +- `init_from_ckpt`: 用于初始化的模型参数的路径。 +- `seed`: 随机种子,默认为1000. +- `logging_steps`: 日志打印的间隔steps数,默认10。 +- `valid_steps`: evaluate的间隔steps数,默认100。 +- `device`: 选用什么设备进行训练,可选cpu或gpu。 + + +通过运行以下命令在样例验证集上进行模型评估: + +```shell +python finetune/answer_extraction_and_roundtrip_filtration/evaluate.py \ + --model_path=log/filtration/checkpoints/model_best \ + --test_path=data/finetune/filtration/dev.json \ + --batch_size=16 \ + --max_seq_len=512 \ + --limit=0.5 +``` + +关键参数释义如下: +- `model_path`: 进行评估的模型文件夹路径,路径下需包含模型权重文件`model_state.pdparams`及配置文件`model_config.json`。 +- `test_path`: 进行评估的测试集文件。 +- `batch_size`: 批处理大小,请结合机器情况进行调整,默认为16。 +- `max_seq_len`: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。 +- `model`: 选择所使用的模型,可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`和`uie-nano`,默认为`uie-base`。 +- `debug`: 是否开启debug模式对每个正例类别分别进行评估,该模式仅用于模型调试,默认关闭。 +- `limit`: SpanEvaluator测评指标的`limit`,当概率数组中的最后一个维度大于该值时将返回相应的文本片段。 + +#### 语义索引和召回模型 +我们的语义索引和召回模型是基于RocketQA的QueryEncoder训练的双塔模型,该模型用于语义索引和召回阶段,分别进行语义向量抽取和相似度召回。除使用预置模型外,如果用户想训练并接入自己的模型,模型训练可以参考[FAQ Finance](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/question_answering/supervised_qa/faq_finance)。 + +#### 排序模型 +我们的排序模型是基于RocketQA的CrossEncoder训练的单塔模型,该模型用于搜索的排序阶段,对召回的结果进行重新排序的作用。关于排序的定制训练,可以参考[CrossEncoder](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search/ranking/cross_encoder)。 + +## References +[1] Zheng, Chujie, and Minlie Huang. "Exploring prompt-based few-shot learning for grounded dialog generation." arXiv preprint arXiv:2109.06513 (2021). + +[2] Li, Wei, et al. "Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning." arXiv preprint arXiv:2012.15409 (2020). + +[3] Puri, Raul, et al. "Training question answering models from synthetic data." arXiv preprint arXiv:2002.09599 (2020). + +[4] Lewis, Patrick, et al. "Paq: 65 million probably-asked questions and what you can do with them." Transactions of the Association for Computational Linguistics 9 (2021): 1098-1115. + +[5] Alberti, Chris, et al. "Synthetic QA corpora generation with roundtrip consistency." arXiv preprint arXiv:1906.05416 (2019). diff --git a/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/evaluate.py b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..4705384bf37cdeb3d478e51235836ef8674c914a --- /dev/null +++ b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/evaluate.py @@ -0,0 +1,94 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +from functools import partial + +import paddle +from utils import convert_example, reader, unify_prompt_name + +from paddlenlp.datasets import MapDataset, load_dataset +from paddlenlp.metrics import SpanEvaluator +from paddlenlp.transformers import UIE, AutoTokenizer +from paddlenlp.utils.log import logger + + +@paddle.no_grad() +def evaluate(model, metric, data_loader): + """ + Given a dataset, it evals model and computes the metric. + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + """ + model.eval() + metric.reset() + for batch in data_loader: + input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch + start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids) + start_ids = paddle.cast(start_ids, "float32") + end_ids = paddle.cast(end_ids, "float32") + num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids) + metric.update(num_correct, num_infer, num_label) + precision, recall, f1 = metric.accumulate() + model.train() + return precision, recall, f1 + + +def do_eval(): + tokenizer = AutoTokenizer.from_pretrained(args.model_path) + model = UIE.from_pretrained(args.model_path) + + test_ds = load_dataset(reader, data_path=args.test_path, max_seq_len=args.max_seq_len, lazy=False) + class_dict = {} + if args.debug: + for data in test_ds: + class_name = unify_prompt_name(data["prompt"]) + # Only positive examples are evaluated in debug mode + if len(data["result_list"]) != 0: + class_dict.setdefault(class_name, []).append(data) + else: + class_dict["all_classes"] = test_ds + for key in class_dict.keys(): + if args.debug: + test_ds = MapDataset(class_dict[key]) + else: + test_ds = class_dict[key] + test_ds = test_ds.map(partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len)) + test_batch_sampler = paddle.io.BatchSampler(dataset=test_ds, batch_size=args.batch_size, shuffle=False) + test_data_loader = paddle.io.DataLoader(dataset=test_ds, batch_sampler=test_batch_sampler, return_list=True) + + metric = SpanEvaluator(args.limit) + precision, recall, f1 = evaluate(model, metric, test_data_loader) + logger.info("-----------------------------") + logger.info("Class Name: %s" % key) + logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1)) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.") + parser.add_argument("--test_path", type=str, default=None, help="The path of test set.") + parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.") + parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.") + parser.add_argument("--debug", action='store_true', help="Precision, recall and F1 score are calculated for each class separately if this option is enabled.") + parser.add_argument("--limit", type=float, default=0.5, help="The limit when using SpanEvaluator, when the last dimension in probability arrays is greater than the limit, the corresponding span will be returned.") + + args = parser.parse_args() + # yapf: enable + + do_eval() diff --git a/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/finetune.py b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/finetune.py new file mode 100644 index 0000000000000000000000000000000000000000..5e6747075667f67b5a53233628dc0d86e6d0cc62 --- /dev/null +++ b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/finetune.py @@ -0,0 +1,141 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time +from functools import partial + +import paddle +from evaluate import evaluate +from utils import convert_example, reader, set_seed + +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import SpanEvaluator +from paddlenlp.transformers import UIE, AutoTokenizer +from paddlenlp.utils.log import logger + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + tokenizer = AutoTokenizer.from_pretrained(args.model) + model = UIE.from_pretrained(args.model) + + train_ds = load_dataset(reader, data_path=args.train_path, max_seq_len=args.max_seq_len, lazy=False) + print("train data loaded successfully.") + dev_ds = load_dataset(reader, data_path=args.dev_path, max_seq_len=args.max_seq_len, lazy=False) + print("dev data loaded successfully.") + + train_ds = train_ds.map(partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len)) + dev_ds = dev_ds.map(partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len)) + + train_batch_sampler = paddle.io.BatchSampler(dataset=train_ds, batch_size=args.batch_size, shuffle=True) + train_data_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, return_list=True) + + dev_batch_sampler = paddle.io.BatchSampler(dataset=dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = paddle.io.DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, return_list=True) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + optimizer = paddle.optimizer.AdamW(learning_rate=args.learning_rate, parameters=model.parameters()) + + criterion = paddle.nn.BCELoss() + metric = SpanEvaluator() + + loss_list = [] + global_step = 0 + + best_f1 = 0 + tic_train = time.time() + for epoch in range(1, args.num_epochs + 1): + for batch in train_data_loader: + input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch + start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids) + start_ids = paddle.cast(start_ids, "float32") + end_ids = paddle.cast(end_ids, "float32") + loss_start = criterion(start_prob, start_ids) + loss_end = criterion(end_prob, end_ids) + loss = (loss_start + loss_end) / 2.0 + loss.backward() + optimizer.step() + optimizer.clear_grad() + loss_list.append(float(loss)) + + global_step += 1 + if global_step % args.logging_steps == 0 and rank == 0: + time_diff = time.time() - tic_train + loss_avg = sum(loss_list) / len(loss_list) + logger.info( + "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, loss_avg, args.logging_steps / time_diff) + ) + tic_train = time.time() + + if global_step % args.valid_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(save_dir) + logger.disable() + tokenizer.save_pretrained(save_dir) + logger.enable() + + precision, recall, f1 = evaluate(model, metric, dev_data_loader) + logger.info("Evaluation precision: %.5f, recall: %.5f, F1: %.5f" % (precision, recall, f1)) + if f1 > best_f1: + logger.info(f"best F1 performence has been updated: {best_f1:.5f} --> {f1:.5f}") + best_f1 = f1 + save_dir = os.path.join(args.save_dir, "model_best") + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(save_dir) + logger.disable() + tokenizer.save_pretrained(save_dir) + logger.enable() + tic_train = time.time() + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--train_path", default=None, type=str, help="The path of train set.") + parser.add_argument("--dev_path", default=None, type=str, help="The path of dev set.") + parser.add_argument("--save_dir", default='.log/filtration/checkpoints', type=str, help="The output directory where the model checkpoints will be written.") + parser.add_argument("--max_seq_len", default=512, type=int, help="The maximum input sequence length. Sequences longer than this will be split automatically.") + parser.add_argument("--num_epochs", default=100, type=int, help="Total number of training epochs to perform.") + parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization") + parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.") + parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.") + parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") + parser.add_argument("--model", choices=["uie-base", "uie-tiny", "uie-medium", "uie-mini", "uie-micro", "uie-nano"], default="uie-base", type=str, help="Select the pretrained model for few-shot learning.") + parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of model parameters for initialization.") + + args = parser.parse_args() + # yapf: enable + + do_train() diff --git a/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/span.py b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/span.py new file mode 100644 index 0000000000000000000000000000000000000000..a34c113ed4f0274b656d302ed8eb08a827084041 --- /dev/null +++ b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/span.py @@ -0,0 +1,103 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddle.metric import Metric + +from paddlenlp.utils.tools import get_bool_ids_greater_than, get_span + + +class SpanEvaluator(Metric): + """ + SpanEvaluator computes the precision, recall and F1-score for span detection. + """ + + def __init__(self, limit=0.5): + super(SpanEvaluator, self).__init__() + self.num_infer_spans = 0 + self.num_label_spans = 0 + self.num_correct_spans = 0 + self.limit = limit + + def compute(self, start_probs, end_probs, gold_start_ids, gold_end_ids): + """ + Computes the precision, recall and F1-score for span detection. + """ + pred_start_ids = get_bool_ids_greater_than(start_probs, self.limit) + pred_end_ids = get_bool_ids_greater_than(end_probs, self.limit) + gold_start_ids = get_bool_ids_greater_than(gold_start_ids.tolist(), self.limit) + gold_end_ids = get_bool_ids_greater_than(gold_end_ids.tolist(), self.limit) + num_correct_spans = 0 + num_infer_spans = 0 + num_label_spans = 0 + for predict_start_ids, predict_end_ids, label_start_ids, label_end_ids in zip( + pred_start_ids, pred_end_ids, gold_start_ids, gold_end_ids + ): + [_correct, _infer, _label] = self.eval_span( + predict_start_ids, predict_end_ids, label_start_ids, label_end_ids + ) + num_correct_spans += _correct + num_infer_spans += _infer + num_label_spans += _label + return num_correct_spans, num_infer_spans, num_label_spans + + def update(self, num_correct_spans, num_infer_spans, num_label_spans): + """ + This function takes (num_infer_spans, num_label_spans, num_correct_spans) as input, + to accumulate and update the corresponding status of the SpanEvaluator object. + """ + self.num_infer_spans += num_infer_spans + self.num_label_spans += num_label_spans + self.num_correct_spans += num_correct_spans + + def eval_span(self, predict_start_ids, predict_end_ids, label_start_ids, label_end_ids): + """ + evaluate position extraction (start, end) + return num_correct, num_infer, num_label + input: [1, 2, 10] [4, 12] [2, 10] [4, 11] + output: (1, 2, 2) + """ + pred_set = get_span(predict_start_ids, predict_end_ids) + label_set = get_span(label_start_ids, label_end_ids) + num_correct = len(pred_set & label_set) + num_infer = len(pred_set) + # For the case of overlapping in the same category, + # length of label_start_ids and label_end_ids is not equal + num_label = max(len(label_start_ids), len(label_end_ids)) + return (num_correct, num_infer, num_label) + + def accumulate(self): + """ + This function returns the mean precision, recall and f1 score for all accumulated minibatches. + + Returns: + tuple: Returns tuple (`precision, recall, f1 score`). + """ + precision = float(self.num_correct_spans / self.num_infer_spans) if self.num_infer_spans else 0.0 + recall = float(self.num_correct_spans / self.num_label_spans) if self.num_label_spans else 0.0 + f1_score = float(2 * precision * recall / (precision + recall)) if self.num_correct_spans else 0.0 + return precision, recall, f1_score + + def reset(self): + """ + Reset function empties the evaluation memory for previous mini-batches. + """ + self.num_infer_spans = 0 + self.num_label_spans = 0 + self.num_correct_spans = 0 + + def name(self): + """ + Return name of metric instance. + """ + return "precision", "recall", "f1" diff --git a/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/utils.py b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..95b0fb6d0b65bca840c26a643cd7267da2ce701f --- /dev/null +++ b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/utils.py @@ -0,0 +1,454 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import math +import random +import re + +import numpy as np +import paddle +from tqdm import tqdm + +from paddlenlp.utils.log import logger + + +def set_seed(seed): + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +def convert_example(example, tokenizer, max_seq_len): + """ + example: { + title + prompt + content + result_list + } + """ + encoded_inputs = tokenizer( + text=[example["prompt"]], + text_pair=[example["content"]], + truncation=True, + max_seq_len=max_seq_len, + pad_to_max_seq_len=True, + return_attention_mask=True, + return_position_ids=True, + return_dict=False, + return_offsets_mapping=True, + ) + encoded_inputs = encoded_inputs[0] + offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]] + bias = 0 + for index in range(1, len(offset_mapping)): + mapping = offset_mapping[index] + if mapping[0] == 0 and mapping[1] == 0 and bias == 0: + bias = offset_mapping[index - 1][1] + 1 # Includes [SEP] token + if mapping[0] == 0 and mapping[1] == 0: + continue + offset_mapping[index][0] += bias + offset_mapping[index][1] += bias + start_ids = [0 for x in range(max_seq_len)] + end_ids = [0 for x in range(max_seq_len)] + for item in example["result_list"]: + start = map_offset(item["start"] + bias, offset_mapping) + end = map_offset(item["end"] - 1 + bias, offset_mapping) + start_ids[start] = 1.0 + end_ids[end] = 1.0 + + tokenized_output = [ + encoded_inputs["input_ids"], + encoded_inputs["token_type_ids"], + encoded_inputs["position_ids"], + encoded_inputs["attention_mask"], + start_ids, + end_ids, + ] + tokenized_output = [np.array(x, dtype="int64") for x in tokenized_output] + return tuple(tokenized_output) + + +def map_offset(ori_offset, offset_mapping): + """ + map ori offset to token offset + """ + for index, span in enumerate(offset_mapping): + if span[0] <= ori_offset < span[1]: + return index + return -1 + + +def reader(data_path, max_seq_len=512): + """ + read json + """ + with open(data_path, "r", encoding="utf-8") as f: + i = 0 + j = 0 + for line in f: + json_line = json.loads(line) + content = json_line["content"].strip() + prompt = json_line["prompt"] + # Model Input is aslike: [CLS] Prompt [SEP] Content [SEP] + # It include three summary tokens. + if max_seq_len <= len(prompt) + 3: + raise ValueError("The value of max_seq_len is too small, please set a larger value") + max_content_len = max_seq_len - len(prompt) - 3 + if len(content) <= max_content_len: + i += 1 + yield json_line + else: + j += 1 + result_list = json_line["result_list"] + json_lines = [] + accumulate = 0 + while True: + cur_result_list = [] + + for result in result_list: + if result["start"] + 1 <= max_content_len < result["end"]: + max_content_len = result["start"] + break + + cur_content = content[:max_content_len] + res_content = content[max_content_len:] + + while True: + if len(result_list) == 0: + break + elif result_list[0]["end"] <= max_content_len: + if result_list[0]["end"] > 0: + cur_result = result_list.pop(0) + cur_result_list.append(cur_result) + else: + cur_result_list = [result for result in result_list] + break + else: + break + + json_line = {"content": cur_content, "result_list": cur_result_list, "prompt": prompt} + json_lines.append(json_line) + + for result in result_list: + if result["end"] <= 0: + break + result["start"] -= max_content_len + result["end"] -= max_content_len + accumulate += max_content_len + max_content_len = max_seq_len - len(prompt) - 3 + if len(res_content) == 0: + break + elif len(res_content) < max_content_len: + json_line = {"content": res_content, "result_list": result_list, "prompt": prompt} + json_lines.append(json_line) + break + else: + content = res_content + + for json_line in json_lines: + yield json_line + + +def unify_prompt_name(prompt): + # The classification labels are shuffled during finetuning, so they need + # to be unified during evaluation. + if re.search(r"\[.*?\]$", prompt): + prompt_prefix = prompt[: prompt.find("[", 1)] + cls_options = re.search(r"\[.*?\]$", prompt).group()[1:-1].split(",") + cls_options = sorted(list(set(cls_options))) + cls_options = ",".join(cls_options) + prompt = prompt_prefix + "[" + cls_options + "]" + return prompt + return prompt + + +def add_negative_example(examples, texts, prompts, label_set, negative_ratio): + negative_examples = [] + positive_examples = [] + with tqdm(total=len(prompts)) as pbar: + for i, prompt in enumerate(prompts): + + redundants_list = list(set(label_set) ^ set(prompt)) + redundants_list.sort() + + num_positive = len(examples[i]) + if num_positive != 0: + actual_ratio = math.ceil(len(redundants_list) / num_positive) + else: + # Set num_positive to 1 for text without positive example + num_positive, actual_ratio = 1, 0 + + if actual_ratio <= negative_ratio or negative_ratio == -1: + idxs = [k for k in range(len(redundants_list))] + else: + idxs = random.sample(range(0, len(redundants_list)), negative_ratio * num_positive) + + for idx in idxs: + negative_result = {"content": texts[i], "result_list": [], "prompt": redundants_list[idx]} + negative_examples.append(negative_result) + positive_examples.extend(examples[i]) + pbar.update(1) + return positive_examples, negative_examples + + +def add_full_negative_example(examples, texts, relation_prompts, predicate_set, subject_goldens): + with tqdm(total=len(relation_prompts)) as pbar: + for i, relation_prompt in enumerate(relation_prompts): + negative_sample = [] + for subject in subject_goldens[i]: + for predicate in predicate_set: + # The relation prompt is constructed as follows: + # subject + "的" + predicate + prompt = subject + "的" + predicate + if prompt not in relation_prompt: + negative_result = {"content": texts[i], "result_list": [], "prompt": prompt} + negative_sample.append(negative_result) + examples[i].extend(negative_sample) + pbar.update(1) + return examples + + +def construct_relation_prompt_set(entity_name_set, predicate_set): + relation_prompt_set = set() + for entity_name in entity_name_set: + for predicate in predicate_set: + # The relation prompt is constructed as follows: + # subject + "的" + predicate + relation_prompt = entity_name + "的" + predicate + relation_prompt_set.add(relation_prompt) + return sorted(list(relation_prompt_set)) + + +def generate_cls_example(text, labels, prompt_prefix, options): + random.shuffle(options) + cls_options = ",".join(options) + prompt = prompt_prefix + "[" + cls_options + "]" + + result_list = [] + example = {"content": text, "result_list": result_list, "prompt": prompt} + for label in labels: + start = prompt.rfind(label[0]) - len(prompt) - 1 + end = start + len(label) + result = {"text": label, "start": start, "end": end} + example["result_list"].append(result) + return example + + +def convert_cls_examples(raw_examples, prompt_prefix="情感倾向", options=["正向", "负向"]): + """ + Convert labeled data export from doccano for classification task. + """ + examples = [] + logger.info("Converting doccano data...") + with tqdm(total=len(raw_examples)): + for line in raw_examples: + items = json.loads(line) + # Compatible with doccano >= 1.6.2 + if "data" in items.keys(): + text, labels = items["data"], items["label"] + else: + text, labels = items["text"], items["label"] + example = generate_cls_example(text, labels, prompt_prefix, options) + examples.append(example) + return examples + + +def convert_ext_examples( + raw_examples, negative_ratio, prompt_prefix="情感倾向", options=["正向", "负向"], separator="##", is_train=True +): + """ + Convert labeled data export from doccano for extraction and aspect-level classification task. + """ + + def _sep_cls_label(label, separator): + label_list = label.split(separator) + if len(label_list) == 1: + return label_list[0], None + return label_list[0], label_list[1:] + + def _concat_examples(positive_examples, negative_examples, negative_ratio): + examples = [] + if math.ceil(len(negative_examples) / len(positive_examples)) <= negative_ratio: + examples = positive_examples + negative_examples + else: + # Random sampling the negative examples to ensure overall negative ratio unchanged. + idxs = random.sample(range(0, len(negative_examples)), negative_ratio * len(positive_examples)) + negative_examples_sampled = [] + for idx in idxs: + negative_examples_sampled.append(negative_examples[idx]) + examples = positive_examples + negative_examples_sampled + return examples + + texts = [] + entity_examples = [] + relation_examples = [] + entity_cls_examples = [] + entity_prompts = [] + relation_prompts = [] + entity_label_set = [] + entity_name_set = [] + predicate_set = [] + subject_goldens = [] + + logger.info("Converting doccano data...") + with tqdm(total=len(raw_examples)) as pbar: + for line in raw_examples: + items = json.loads(line) + entity_id = 0 + if "data" in items.keys(): + relation_mode = False + if isinstance(items["label"], dict) and "entities" in items["label"].keys(): + relation_mode = True + text = items["data"] + entities = [] + relations = [] + if not relation_mode: + # Export file in JSONL format which doccano < 1.7.0 + # e.g. {"data": "", "label": [ [0, 2, "ORG"], ... ]} + for item in items["label"]: + entity = {"id": entity_id, "start_offset": item[0], "end_offset": item[1], "label": item[2]} + entities.append(entity) + entity_id += 1 + else: + # Export file in JSONL format for relation labeling task which doccano < 1.7.0 + # e.g. {"data": "", "label": {"relations": [ {"id": 0, "start_offset": 0, "end_offset": 6, "label": "ORG"}, ... ], "entities": [ {"id": 0, "from_id": 0, "to_id": 1, "type": "foundedAt"}, ... ]}} + entities.extend([entity for entity in items["label"]["entities"]]) + if "relations" in items["label"].keys(): + relations.extend([relation for relation in items["label"]["relations"]]) + else: + # Export file in JSONL format which doccano >= 1.7.0 + # e.g. {"text": "", "label": [ [0, 2, "ORG"], ... ]} + if "label" in items.keys(): + text = items["text"] + entities = [] + for item in items["label"]: + entity = {"id": entity_id, "start_offset": item[0], "end_offset": item[1], "label": item[2]} + entities.append(entity) + entity_id += 1 + relations = [] + else: + # Export file in JSONL (relation) format + # e.g. {"text": "", "relations": [ {"id": 0, "start_offset": 0, "end_offset": 6, "label": "ORG"}, ... ], "entities": [ {"id": 0, "from_id": 0, "to_id": 1, "type": "foundedAt"}, ... ]} + text, relations, entities = items["text"], items["relations"], items["entities"] + texts.append(text) + + entity_example = [] + entity_prompt = [] + entity_example_map = {} + entity_map = {} # id to entity name + for entity in entities: + entity_name = text[entity["start_offset"] : entity["end_offset"]] + entity_map[entity["id"]] = { + "name": entity_name, + "start": entity["start_offset"], + "end": entity["end_offset"], + } + + entity_label, entity_cls_label = _sep_cls_label(entity["label"], separator) + + # Define the prompt prefix for entity-level classification + entity_cls_prompt_prefix = entity_name + "的" + prompt_prefix + if entity_cls_label is not None: + entity_cls_example = generate_cls_example( + text, entity_cls_label, entity_cls_prompt_prefix, options + ) + + entity_cls_examples.append(entity_cls_example) + + result = {"text": entity_name, "start": entity["start_offset"], "end": entity["end_offset"]} + if entity_label not in entity_example_map.keys(): + entity_example_map[entity_label] = { + "content": text, + "result_list": [result], + "prompt": entity_label, + } + else: + entity_example_map[entity_label]["result_list"].append(result) + + if entity_label not in entity_label_set: + entity_label_set.append(entity_label) + if entity_name not in entity_name_set: + entity_name_set.append(entity_name) + entity_prompt.append(entity_label) + + for v in entity_example_map.values(): + entity_example.append(v) + + entity_examples.append(entity_example) + entity_prompts.append(entity_prompt) + + subject_golden = [] # Golden entity inputs + relation_example = [] + relation_prompt = [] + relation_example_map = {} + for relation in relations: + predicate = relation["type"] + subject_id = relation["from_id"] + object_id = relation["to_id"] + # The relation prompt is constructed as follows: + # subject + "的" + predicate + prompt = entity_map[subject_id]["name"] + "的" + predicate + if entity_map[subject_id]["name"] not in subject_golden: + subject_golden.append(entity_map[subject_id]["name"]) + result = { + "text": entity_map[object_id]["name"], + "start": entity_map[object_id]["start"], + "end": entity_map[object_id]["end"], + } + if prompt not in relation_example_map.keys(): + relation_example_map[prompt] = {"content": text, "result_list": [result], "prompt": prompt} + else: + relation_example_map[prompt]["result_list"].append(result) + + if predicate not in predicate_set: + predicate_set.append(predicate) + relation_prompt.append(prompt) + + for v in relation_example_map.values(): + relation_example.append(v) + + relation_examples.append(relation_example) + relation_prompts.append(relation_prompt) + subject_goldens.append(subject_golden) + pbar.update(1) + + logger.info("Adding negative samples for first stage prompt...") + positive_examples, negative_examples = add_negative_example( + entity_examples, texts, entity_prompts, entity_label_set, negative_ratio + ) + if len(positive_examples) == 0: + all_entity_examples = [] + elif is_train: + all_entity_examples = _concat_examples(positive_examples, negative_examples, negative_ratio) + else: + all_entity_examples = positive_examples + negative_examples + + all_relation_examples = [] + if len(predicate_set) != 0: + if is_train: + logger.info("Adding negative samples for second stage prompt...") + relation_prompt_set = construct_relation_prompt_set(entity_name_set, predicate_set) + positive_examples, negative_examples = add_negative_example( + relation_examples, texts, relation_prompts, relation_prompt_set, negative_ratio + ) + all_relation_examples = _concat_examples(positive_examples, negative_examples, negative_ratio) + else: + logger.info("Adding negative samples for second stage prompt...") + relation_examples = add_full_negative_example( + relation_examples, texts, relation_prompts, predicate_set, subject_goldens + ) + all_relation_examples = [r for relation_example in relation_examples for r in relation_example] + return all_entity_examples, all_relation_examples, entity_cls_examples diff --git a/applications/question_answering/unsupervised_qa/finetune/question_generation/gen_utils.py b/applications/question_answering/unsupervised_qa/finetune/question_generation/gen_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..10591bf8058fcb2d186363b2db0f00a6324a2a5f --- /dev/null +++ b/applications/question_answering/unsupervised_qa/finetune/question_generation/gen_utils.py @@ -0,0 +1,316 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import random +from functools import partial + +import numpy as np +import paddle +import paddle.distributed as dist +from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler + +from paddlenlp.data import Pad + + +def print_args(args): + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +def set_seed(seed): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(seed) + np.random.seed(seed) + # Maybe different op seeds(for dropout) for different procs is better. + paddle.seed(seed + dist.get_rank()) + + +def convert_example( + example, tokenizer, max_seq_len=512, max_target_len=128, max_title_len=256, mode="train", template=0 +): + """Convert all examples into necessary features.""" + if mode == "pretrain" or mode == "pretrain_test": + context = example["context"] + answer = example["answer"] + target = example["target"] + source = "答案:" + answer + tokenizer.sep_token + "上下文:" + context + title = None + + elif mode == "train" or mode == "test": + target = None + title = None + if "source" in example and "title" in example: + source = example["source"] + if "title" in example.keys(): + title = example["title"] + elif "context" in example and "answer" in example: + source = example["context"] + if "answer" in example.keys(): + title = example["answer"] + else: + assert False, "Source and title are not in the input dictionary, nor are context and answer." + if "target" in example.keys(): + target = example["target"] + elif "question" in example.keys(): + target = example["question"] + + if template == 1: + source = "答案:" + title + tokenizer.sep_token + "上下文:" + source + title = None + if target: + target = "问题:" + target + elif template == 2: + source = "答案:" + title + tokenizer.sep_token + "上下文:" + source + title = None + if target: + target = "在已知答案的前提下,问题:" + target + elif template == 3: + source = "这是一个问题生成任务,根据提供的答案和上下文,来生成问题。" + title + tokenizer.sep_token + "上下文:" + source + title = None + if target: + target = "问题:" + target + elif template == 4: + prompt_common = example["prompt_common"] + prompt_domain = example["prompt_domain"] + source = ( + prompt_common + + " " + + tokenizer.sep_token + + " " + + "".join( + [" " + tokenizer.cls_token + " " + one + " " + tokenizer.sep_token + " " for one in prompt_domain] + ) + + " " + + tokenizer.cls_token + + " " + + "答案:" + + title + + " " + + tokenizer.sep_token + + " " + + tokenizer.cls_token + + "上下文:" + + source + ) + + title = None + if target: + target = "问题:" + target + + if mode == "train" or mode == "pretrain": + tokenized_example = tokenizer.gen_encode( + source, + title=title, + target=target, + max_seq_len=max_seq_len, + max_target_len=max_target_len, + max_title_len=max_title_len, + return_position_ids=True, + return_length=True, + ) + temp_tokens = tokenizer.convert_ids_to_tokens(tokenized_example["input_ids"]) + index_list = [] + count = tokenized_example["input_ids"].count(tokenizer.cls_token_id) + # If template==4, count must be equal to 7, otherwise count must be equal to 2 + assert count == 7 or count == 2, ( + str(count) + " is not in [2, 7], temp_tokens: " + " ".join(temp_tokens) + "source: " + source + ) + index = -1 + for i in range(0, count): + index = tokenized_example["input_ids"].index(tokenizer.cls_token_id, index + 1) + index_list.append(index) + if template == 4: + tokenized_example["token_type_ids"] = ( + [2] * (index_list[1] - index_list[0]) + + [3] * (index_list[4] - index_list[1]) + + [0] * (index_list[6] - index_list[4]) + + [1] * (len(tokenized_example["input_ids"]) - index_list[6]) + ) + target_start = index_list[-1] + target_end = tokenized_example["seq_len"] + # Use to gather the logits corresponding to the labels during training + tokenized_example["masked_positions"] = list(range(target_start, target_end - 1)) + tokenized_example["labels"] = tokenized_example["input_ids"][target_start + 1 : target_end] + if template == 4: + tokenized_example["token_type_ids"] + return tokenized_example + + elif mode == "test" or mode == "pretrain_test": + tokenized_example = tokenizer.gen_encode( + source, + title=title, + max_seq_len=max_seq_len, + max_title_len=max_title_len, + add_start_token_for_decoding=True, + return_position_ids=True, + ) + + if template == 4: + # temp_tokens = tokenizer.convert_ids_to_tokens(tokenized_example['input_ids']) + index_list = [] + count = tokenized_example["input_ids"].count(tokenizer.cls_token_id) + assert count == 7, str(count) + " is not in [7]" + index = -1 + for i in range(0, count): + index = tokenized_example["input_ids"].index(tokenizer.cls_token_id, index + 1) + index_list.append(index) + tokenized_example["token_type_ids"] = ( + [2] * (index_list[1] - index_list[0]) + + [3] * (index_list[4] - index_list[1]) + + [0] * (index_list[6] - index_list[4]) + + [1] * (len(tokenized_example["input_ids"]) - index_list[6]) + ) + assert ("target" in example and example["target"]) or ("question" in example and example["question"]), example + if "target" in example and example["target"]: + tokenized_example["target"] = example["target"] + elif "question" in example and example["question"]: + tokenized_example["target"] = example["question"] + return tokenized_example + + +def batchify_fn(batch_examples, pad_val, mode): + def pad_mask(batch_attention_mask): + batch_size = len(batch_attention_mask) + max_len = max(map(len, batch_attention_mask)) + attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9 + for i, mask_data in enumerate(attention_mask): + seq_len = len(batch_attention_mask[i]) + mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32") + # In order to ensure the correct broadcasting mechanism, expand one + # dimension to the second dimension (n_head of Transformer). + attention_mask = np.expand_dims(attention_mask, axis=1) + return attention_mask + + pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64") + + input_ids = pad_func([example["input_ids"] for example in batch_examples]) + token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples]) + position_ids = pad_func([example["position_ids"] for example in batch_examples]) + + attention_mask = pad_mask([example["attention_mask"] for example in batch_examples]) + + if mode == "train" or mode == "pretrain": + max_len = max([example["seq_len"] for example in batch_examples]) + masked_positions = np.concatenate( + [ + np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len + for i, example in enumerate(batch_examples) + ] + ) + labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples]) + return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels + elif mode == "test" or mode == "pretrain_test": + return input_ids, token_type_ids, position_ids, attention_mask + + +def create_data_loader(dataset, tokenizer, args, mode): + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_len=args.max_seq_len, + max_target_len=args.max_target_len, + max_title_len=args.max_title_len, + mode=mode, + template=args.template, + ) + dataset = dataset.map(trans_func, lazy=True) + if mode == "pretrain": + batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) + elif mode == "train": + batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) + elif mode == "test" or mode == "pretrain_test": + batch_sampler = BatchSampler(dataset, batch_size=args.batch_size // 2, shuffle=False) + collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode) + data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True) + return dataset, data_loader + + +def post_process_sum(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first .""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.mask_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + special_tokens = ["[UNK]"] + tokens = [token for token in tokens if token not in special_tokens] + return token_ids, tokens + + +def remove_template(instr): + """Remove template prefix of decoded sequence.""" + outstr = instr.strip("问题:") + outstr = outstr.strip("在已知答案的前提下,问题:") + return outstr + + +def select_sum(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1): + results = [] + group = [] + tmp = [] + if scores is not None: + ids = ids.numpy() + scores = scores.numpy() + + if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0: + raise ValueError( + "the length of `ids` is {}, but the `num_return_sequences` is {}".format( + len(ids), num_return_sequences + ) + ) + + for pred, score in zip(ids, scores): + pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer) + num_token = len(pred_token_ids) + + target = "".join(pred_tokens) + target = remove_template(target) + + # not ending + if max_dec_len is not None and num_token >= max_dec_len: + score -= 1e3 + + tmp.append([target, score]) + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + preds = sorted(preds, key=lambda x: -x[1]) + results.append(preds[0][0]) + else: + ids = ids.numpy() + + for pred in ids: + pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer) + num_token = len(pred_token_ids) + response = "".join(pred_tokens) + response = remove_template(response) + + # TODO: Support return scores in FT. + tmp.append([response]) + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + results.append(preds[0][0]) + + return results diff --git a/applications/question_answering/unsupervised_qa/finetune/question_generation/predict.py b/applications/question_answering/unsupervised_qa/finetune/question_generation/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..4ec590f45b08016a2642fbb37253aae48e33ee09 --- /dev/null +++ b/applications/question_answering/unsupervised_qa/finetune/question_generation/predict.py @@ -0,0 +1,141 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import time + +import paddle +import paddle.distributed as dist +from gen_utils import create_data_loader, print_args, select_sum, set_seed + +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOTokenizer + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--dataset_name', type=str, default='dureader_qg', help='The name of the dataset to load.') + parser.add_argument('--model_name_or_path', type=str, default='unimo-text-1.0', help='The path or shortcut name of the pre-trained model.') + parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.") + parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.') + parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.') + parser.add_argument('--seed', type=int, default=1, help='Random seed for initialization.') + parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.') + parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.') + parser.add_argument('--max_target_len', type=int, default=30, help='The maximum target sequence length of training.') + parser.add_argument('--max_title_len', type=int, default=30, help='The maximum title sequence length of training.') + parser.add_argument('--max_dec_len', type=int, default=20, help='The maximum sequence length of decoding.') + parser.add_argument('--min_dec_len', type=int, default=3, help='The minimal sequence length of decoding.') + parser.add_argument('--num_return_sequences', type=int, default=1, help='The numbers of returned sequences for one input in generation.') + parser.add_argument('--decode_strategy', type=str, default='beam_search', help='The decode strategy in generation.') + parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.') + parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.') + parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.') + parser.add_argument('--num_beams', type=int, default=6, help='The number of beams for beam search.') + parser.add_argument('--length_penalty', type=float, default=1.2, help='The exponential penalty to the sequence length for beam search.') + parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.') + parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.') + parser.add_argument("--do_predict", action='store_true', help="Whether to eval and predict.") + parser.add_argument("--template", type=int, default=1, help="The template used during training, select from [0, 1, 2, 3, 4].") + + args = parser.parse_args() + return args +# yapf: enable + + +def read_file(file): + with open(file, "r", encoding="utf-8") as f: + for line in f.readlines(): + line = line.strip() + if not line: + continue + line = json.loads(line) + yield line + + +def run(args): + paddle.set_device(args.device) + world_size = dist.get_world_size() + + if world_size > 1: + dist.init_parallel_env() + set_seed(args.seed) + + model = UNIMOLMHeadModel.from_pretrained(args.model_name_or_path) + tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path) + + if world_size > 1: + model = paddle.DataParallel(model) + + if args.predict_file: + dev_ds = load_dataset(read_file, file=args.predict_file, lazy=False) + else: + dev_ds = load_dataset(args.dataset_name, splits="dev", data_files=args.predict_file) + + dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test") + + if args.do_predict: + model_eval = model._layers if isinstance(model, paddle.DataParallel) else model + prediction(model_eval, dev_data_loader, args, tokenizer) + + +@paddle.no_grad() +def prediction(model, data_loader, args, tokenizer): + print("\nPred begin...") + model.eval() + pred_ref = [] + time_begin = time.time() + total_time = 0.0 + start_time = time.time() + for step, inputs in enumerate(data_loader, 1): + input_ids, token_type_ids, position_ids, attention_mask = inputs + ids, scores = model.generate( + input_ids=input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask, + max_length=args.max_dec_len, + min_length=args.min_dec_len, + decode_strategy=args.decode_strategy, + temperature=args.temperature, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + num_return_sequences=args.num_return_sequences, + bos_token_id=tokenizer.cls_token_id, + eos_token_id=tokenizer.mask_token_id, + ) + + total_time += time.time() - start_time + if step % args.logging_steps == 0: + print("step %d - %.3fs/step" % (step, total_time / args.logging_steps)) + total_time = 0.0 + + results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences) + pred_ref.extend(results) + start_time = time.time() + print("Generation cost time:", time.time() - time_begin) + + with open(args.output_path, "w", encoding="utf-8") as fout: + for ref in pred_ref: + fout.write(ref + "\n") + + +if __name__ == "__main__": + args = parse_args() + print_args(args) + run(args) diff --git a/applications/question_answering/unsupervised_qa/finetune/question_generation/train.py b/applications/question_answering/unsupervised_qa/finetune/question_generation/train.py new file mode 100644 index 0000000000000000000000000000000000000000..73e2c1544328e42275af2d0886e44e66a61df7cf --- /dev/null +++ b/applications/question_answering/unsupervised_qa/finetune/question_generation/train.py @@ -0,0 +1,281 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import time + +import paddle +import paddle.distributed as dist +import paddle.nn.functional as F +from gen_utils import create_data_loader, print_args, select_sum, set_seed +from paddle.optimizer import AdamW + +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import BLEU +from paddlenlp.transformers import ( + BasicTokenizer, + LinearDecayWithWarmup, + UNIMOLMHeadModel, + UNIMOTokenizer, +) + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--dataset_name', type=str, default='dureader_qg', help='The name of the dataset to load.') + parser.add_argument('--model_name_or_path', type=str, default='unimo-text-1.0', help='The path or shortcut name of the pre-trained model.') + parser.add_argument("--train_file", type=str, required=False, default=None, help="Train data path.") + parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.") + parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.') + parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.') + parser.add_argument('--save_steps', type=int, default=1000, help='Save checkpoint every X updates steps.') + parser.add_argument('--seed', type=int, default=1, help='Random seed for initialization.') + parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.') + parser.add_argument('--learning_rate', type=float, default=5e-5, help='The initial learning rate.') + parser.add_argument('--weight_decay', type=float, default=0.01, help='The weight decay for optimizer.') + parser.add_argument('--epochs', type=int, default=3, help='Total number of training epochs to perform.') + parser.add_argument('--warmup_proportion', type=float, default=0.02, help='The number of warmup steps.') + parser.add_argument('--max_grad_norm', type=float, default=1.0, help='The max value of grad norm.') + parser.add_argument('--beta1', type=float, default=0.9, help='beta1') + parser.add_argument('--beta2', type=float, default=0.98, help='beta2') + parser.add_argument('--epsilon', type=float, default=1e-6, help='epsilon') + parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.') + parser.add_argument('--max_target_len', type=int, default=30, help='The maximum target sequence length of training.') + parser.add_argument('--max_title_len', type=int, default=30, help='The maximum title sequence length of training.') + parser.add_argument('--max_dec_len', type=int, default=20, help='The maximum sequence length of decoding.') + parser.add_argument('--min_dec_len', type=int, default=3, help='The minimal sequence length of decoding.') + parser.add_argument('--num_return_sequences', type=int, default=1, help='The numbers of returned sequences for one input in generation.') + parser.add_argument('--decode_strategy', type=str, default='beam_search', help='The decode strategy in generation.') + parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.') + parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.') + parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.') + parser.add_argument('--num_beams', type=int, default=6, help='The number of beams for beam search.') + parser.add_argument('--length_penalty', type=float, default=1.2, help='The exponential penalty to the sequence length for beam search.') + parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.') + parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.') + parser.add_argument("--do_train", action='store_true', help="Whether to train the model.") + parser.add_argument("--do_predict", action='store_true', help="Whether to eval and predict.") + parser.add_argument("--template", type=int, default=1, help="The template used during training, select from [0, 1, 2, 3, 4].") + + args = parser.parse_args() + return args +# yapf: enable + + +def calc_bleu_n(preds, targets, n_size=4): + assert len(preds) == len(targets), ( + "The length of pred_responses should be equal to the length of " + "target_responses. But received {} and {}.".format(len(preds), len(targets)) + ) + bleu = BLEU(n_size=n_size) + tokenizer = BasicTokenizer() + + for pred, target in zip(preds, targets): + pred_tokens = tokenizer.tokenize(pred) + target_token = tokenizer.tokenize(target) + + bleu.add_inst(pred_tokens, [target_token]) + + print("\n" + "*" * 15) + print("The auto evaluation result is:") + print("BLEU-" + str(n_size) + ":", bleu.score()) + return bleu.score() + + +def calc_bleu(preds, targets): + calc_bleu_n(preds, targets, 1) + calc_bleu_n(preds, targets, 2) + calc_bleu_n(preds, targets, 3) + bleu4_score = calc_bleu_n(preds, targets, 4) + return bleu4_score + + +def read_file(file): + with open(file, "r", encoding="utf-8") as f: + for line in f.readlines(): + line = line.strip() + if not line: + continue + line = json.loads(line) + yield line + + +def save_ckpt(model, tokenizer, save_dir, name): + output_dir = os.path.join(save_dir, "model_{}".format(name)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + +def run(args): + paddle.set_device(args.device) + world_size = dist.get_world_size() + + if world_size > 1: + dist.init_parallel_env() + set_seed(args.seed) + + model = UNIMOLMHeadModel.from_pretrained(args.model_name_or_path) + tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path) + + if world_size > 1: + model = paddle.DataParallel(model) + + if args.train_file: + train_ds = load_dataset(read_file, file=args.train_file, lazy=False) + else: + train_ds = load_dataset(args.dataset_name, splits="train", data_files=args.train_file) + if args.predict_file: + dev_ds = load_dataset(read_file, file=args.predict_file, lazy=False) + else: + dev_ds = load_dataset(args.dataset_name, splits="dev", data_files=args.predict_file) + + train_ds, train_data_loader = create_data_loader(train_ds, tokenizer, args, "train") + dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test") + + if args.do_train: + num_training_steps = args.epochs * len(train_data_loader) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + beta1=args.beta1, + beta2=args.beta2, + epsilon=args.epsilon, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm), + ) + + step = 0 + total_time = 0.0 + best_bleu4 = 0 + for epoch in range(args.epochs): + print("\nEpoch %d/%d" % (epoch + 1, args.epochs)) + batch_start_time = time.time() + for inputs in train_data_loader: + step += 1 + labels = inputs[-1] + logits = model(*inputs[:-1]) + labels = paddle.nn.functional.one_hot(labels, num_classes=logits.shape[-1]) + labels = paddle.nn.functional.label_smooth(labels) + loss = F.cross_entropy(logits, labels, soft_label=True) + loss.backward() + + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + total_time += time.time() - batch_start_time + if step % args.logging_steps == 0: + ppl = paddle.exp(loss) + print( + "step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step" + % (step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps) + ) + total_time = 0.0 + + if step % args.save_steps == 0 or step >= num_training_steps: + if dist.get_rank() == 0: + save_ckpt(model, tokenizer, args.save_dir, step) + print("Saved step {} model.\n".format(step)) + if args.do_predict: + model_eval = model._layers if isinstance(model, paddle.DataParallel) else model + bleu4 = evaluation(model_eval, dev_data_loader, args, tokenizer) + if bleu4 > best_bleu4: + print("best BLEU-4 performence has been updated: %.5f --> %.5f" % (best_bleu4, bleu4)) + best_bleu4 = bleu4 + save_ckpt(model, tokenizer, args.save_dir, "best") + + batch_start_time = time.time() + + print("\nTraining completed.") + elif args.do_predict: + model_eval = model._layers if isinstance(model, paddle.DataParallel) else model + evaluation(model_eval, dev_data_loader, args, tokenizer) + + +@paddle.no_grad() +def evaluation(model, data_loader, args, tokenizer): + print("\nEval begin...") + model.eval() + pred_ref = [] + time_begin = time.time() + total_time = 0.0 + start_time = time.time() + for step, inputs in enumerate(data_loader, 1): + input_ids, token_type_ids, position_ids, attention_mask = inputs + ids, scores = model.generate( + input_ids=input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask, + max_length=args.max_dec_len, + min_length=args.min_dec_len, + decode_strategy=args.decode_strategy, + temperature=args.temperature, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + num_return_sequences=args.num_return_sequences, + bos_token_id=tokenizer.cls_token_id, + eos_token_id=tokenizer.mask_token_id, + ) + + total_time += time.time() - start_time + if step % args.logging_steps == 0: + print("step %d - %.3fs/step" % (step, total_time / args.logging_steps)) + total_time = 0.0 + + results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences) + pred_ref.extend(results) + start_time = time.time() + print("Generation cost time:", time.time() - time_begin) + + with open(args.output_path, "w", encoding="utf-8") as fout: + for ref in pred_ref: + fout.write(ref + "\n") + + with open(args.output_path + ".reference.txt", "w", encoding="utf-8") as fout: + targets = [example["target"] for example in data_loader.dataset] + for target in targets: + fout.write(target + "\n") + + print("\nSave inference result into: %s" % args.output_path) + + if "target" in data_loader.dataset[0].keys(): + targets = [example["target"] for example in data_loader.dataset] + bleu4_score = calc_bleu(pred_ref, targets) + + model.train() + return bleu4_score + + +if __name__ == "__main__": + args = parse_args() + print_args(args) + run(args) diff --git a/applications/question_answering/unsupervised_qa/run_corpus_preparation.py b/applications/question_answering/unsupervised_qa/run_corpus_preparation.py new file mode 100644 index 0000000000000000000000000000000000000000..56b1f912c020a14745af24376db0332ac50c4c66 --- /dev/null +++ b/applications/question_answering/unsupervised_qa/run_corpus_preparation.py @@ -0,0 +1,85 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--source_file_path', type=str, default=None, help='the source json file path') + parser.add_argument('--target_dir_path', type=str, default=None, help='the target dir path') + parser.add_argument('--test_sample_num', type=int, default=0, help='the test sample number when preparing qa system data') + parser.add_argument('--train_sample_num', type=int, default=0, help='the test sample number when preparing qa system data') + parser.add_argument('--all_sample_num', type=int, default=None, help='the all sample number when preparing qa system data') + args = parser.parse_args() + return args +# yapf: enable + + +def convert_json_to_data(json_file, out_dir, test_sample_num, train_sample_num, all_sample_num=None): + with open(json_file, "r", encoding="utf-8") as rf, open( + os.path.join(out_dir, "qa_pair.csv"), "w", encoding="utf-8" + ) as qa_pair_wf, open(os.path.join(out_dir, "qac_triple.csv"), "w", encoding="utf-8") as qac_triple_wf, open( + os.path.join(out_dir, "train.csv"), "w", encoding="utf-8" + ) as train_wf, open( + os.path.join(out_dir, "q_corpus.csv"), "w", encoding="utf-8" + ) as q_corpus_wf, open( + os.path.join(out_dir, "dev.csv"), "w", encoding="utf-8" + ) as test_wf: + for i, json_line in enumerate(rf.readlines()): + line_dict = json.loads(json_line) + context = line_dict["context"] + if "answer" in line_dict and "question" in line_dict: + answer = line_dict["answer"] + question = line_dict["question"] + elif "synthetic_answer" in line_dict and "synthetic_question" in line_dict: + answer = line_dict["synthetic_answer"] + question = line_dict["synthetic_question"] + + if isinstance(question, list): + question = question[0] + else: + question = question + + if i < test_sample_num: + test_wf.write(question.replace("\n", " ").replace("\t", " ").strip() + "\n") + elif test_sample_num <= i < test_sample_num + train_sample_num: + train_wf.write(question.replace("\n", " ").replace("\t", " ").strip() + "\n") + + if not all_sample_num or i < all_sample_num: + qa_pair_wf.write( + question.replace("\n", " ").replace("\t", " ").strip() + + "\t" + + answer.replace("\n", " ").replace("\t", " ").strip() + + "\n" + ) + qac_triple_wf.write( + question.replace("\n", " ").replace("\t", " ").strip() + + "\t" + + answer.replace("\n", " ").replace("\t", " ").strip() + + "\t" + + context + + "\n" + ) + q_corpus_wf.write(question.replace("\n", " ").replace("\t", " ").strip() + "\n") + + +if __name__ == "__main__": + args = parse_args() + convert_json_to_data( + args.source_file_path, args.target_dir_path, args.test_sample_num, args.train_sample_num, args.all_sample_num + ) diff --git a/applications/question_answering/unsupervised_qa/run_data_preprocess.py b/applications/question_answering/unsupervised_qa/run_data_preprocess.py new file mode 100644 index 0000000000000000000000000000000000000000..8da9535318359593693a36807fa3899b7fa1fcc7 --- /dev/null +++ b/applications/question_answering/unsupervised_qa/run_data_preprocess.py @@ -0,0 +1,161 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--source_file_path', type=str, default=None, help='the source json file path') + parser.add_argument('--target_dir', type=str, default='data', help='the target file path') + parser.add_argument('--do_answer_prompt', action="store_true", help="is use answer prompt") + parser.add_argument('--do_len_prompt', action="store_true", help="is use length prompt") + parser.add_argument('--do_domain_prompt', action="store_true", help="is use domain prompt") + parser.add_argument('--domain', type=str, default=None, help='the domain of the dataset when using domain prompt') + args = parser.parse_args() + return args +# yapf: enable + + +def convert_from_json_to_answer_extraction_format( + json_file, output_path, domain=None, do_answer_prompt=True, do_len_prompt=False, do_domain_prompt=False +): + with open(json_file, "r", encoding="utf-8") as rf, open(output_path, "w", encoding="utf-8") as wf: + for line in rf: + json_line = json.loads(line) + context = json_line["context"] + + answer = json_line["answer"] + # Cut the abnormally long sample + if len(answer) > 300: + answer = answer[:300] + + begin_id = context.find(answer) + assert begin_id != -1, "'" + answer + "' is not found in " + context + end_id = begin_id + len(answer) + result = {"text": answer, "start": begin_id, "end": end_id} + if do_answer_prompt: + outdict = { + "content": context, + "result_list": [result], + "prompt": "答案", + } + wf.write(json.dumps(outdict, ensure_ascii=False) + "\n") + if do_len_prompt: + if len(answer) < 10: + len_prompat = "短答案" + elif len(answer) < 20: + len_prompat = "中短答案" + elif len(answer) < 30: + len_prompat = "中长答案" + else: + len_prompat = "长答案" + + len_outdict = { + "content": context, + "result_list": [result], + "prompt": len_prompat, + } + wf.write(json.dumps(len_outdict, ensure_ascii=False) + "\n") + if do_domain_prompt and domain: + domain_outdict = { + "content": context, + "result_list": [result], + "prompt": domain, + } + wf.write(json.dumps(domain_outdict, ensure_ascii=False) + "\n") + + +def convert_from_json_to_question_generation_format(json_file, output_path, tokenizer=None): + with open(json_file, "r", encoding="utf-8") as rf, open(output_path, "w", encoding="utf-8") as wf: + for line in rf: + json_line = json.loads(line) + context = json_line["context"] + + answer = json_line["answer"] + # Cut the abnormally long sample + if len(answer) > 300: + answer = answer[:300] + question = json_line["question"] + + outdict = { + "question": question, + "answer": answer, + "context": context, + } + wf.write(json.dumps(outdict, ensure_ascii=False) + "\n") + + +def convert_from_json_to_filtration_format(json_file, output_path, tokenizer=None): + with open(json_file, "r", encoding="utf-8") as rf, open(output_path, "w", encoding="utf-8") as wf: + for line in rf: + json_line = json.loads(line) + context = json_line["context"] + + answer = json_line["answer"] + # Cut the abnormally long sample + if len(answer) > 300: + answer = answer[:300] + question = json_line["question"] + + prefix = "问题:" + question + "上下文:" + content = prefix + context + + begin_id = context.find(answer) + assert begin_id != -1, "'" + answer + "' is not found in " + context + end_id = begin_id + len(answer) + begin_id += len(prefix) + end_id += len(prefix) + + result = {"text": answer, "start": begin_id, "end": end_id} + outdict = { + "content": content, + "result_list": [result], + "prompt": "答案", + } + wf.write(json.dumps(outdict, ensure_ascii=False) + "\n") + + +if __name__ == "__main__": + args = parse_args() + answer_extraction_target_file_path = os.path.join( + args.target_dir, "answer_extraction", os.path.basename(args.source_file_path) + ) + if not os.path.exists(os.path.dirname(answer_extraction_target_file_path)): + os.makedirs(os.path.dirname(answer_extraction_target_file_path)) + convert_from_json_to_answer_extraction_format( + json_file=args.source_file_path, + output_path=answer_extraction_target_file_path, + domain=args.domain, + do_answer_prompt=args.do_answer_prompt, + do_len_prompt=args.do_len_prompt, + do_domain_prompt=args.do_domain_prompt, + ) + + question_generation_target_file_path = os.path.join( + args.target_dir, "question_generation", os.path.basename(args.source_file_path) + ) + if not os.path.exists(os.path.dirname(question_generation_target_file_path)): + os.makedirs(os.path.dirname(question_generation_target_file_path)) + convert_from_json_to_question_generation_format( + json_file=args.source_file_path, output_path=question_generation_target_file_path + ) + + filtration_target_file_path = os.path.join(args.target_dir, "filtration", os.path.basename(args.source_file_path)) + if not os.path.exists(os.path.dirname(filtration_target_file_path)): + os.makedirs(os.path.dirname(filtration_target_file_path)) + convert_from_json_to_filtration_format(json_file=args.source_file_path, output_path=filtration_target_file_path) diff --git a/applications/question_answering/unsupervised_qa/run_pipelines_example.py b/applications/question_answering/unsupervised_qa/run_pipelines_example.py new file mode 100644 index 0000000000000000000000000000000000000000..c0b714f799af0e5d48c9a192b737a9ae86ed5ebf --- /dev/null +++ b/applications/question_answering/unsupervised_qa/run_pipelines_example.py @@ -0,0 +1,155 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from pprint import pprint + +from pipelines.document_stores import FAISSDocumentStore +from pipelines.nodes import ( + AnswerExtractor, + DensePassageRetriever, + ErnieRanker, + QAFilter, + QuestionGenerator, +) +from pipelines.pipelines import QAGenerationPipeline, SemanticSearchPipeline +from pipelines.utils import convert_files_to_dicts, print_documents + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.") +parser.add_argument("--index_name", default='faiss_index', type=str, help="The ann index name of FAISS.") +parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.") +parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.") +parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.") +parser.add_argument("--doc_dir", default="data/my_data", type=str, help="The question-answer pairs file to be loaded when building ANN index.") +parser.add_argument("--source_file", default=None, type=str, help="The source raw texts file to be loaded when creating question-answer pairs.") + +args = parser.parse_args() +# yapf: enable + + +def dense_faq_pipeline(): + use_gpu = True if args.device == "gpu" else False + faiss_document_store = "faiss_document_store.db" + if os.path.exists(args.index_name) and os.path.exists(faiss_document_store): + # connect to existed FAISS Index + document_store = FAISSDocumentStore.load(args.index_name) + retriever = DensePassageRetriever( + document_store=document_store, + query_embedding_model="rocketqa-zh-dureader-query-encoder", + passage_embedding_model="rocketqa-zh-dureader-query-encoder", + max_seq_len_query=args.max_seq_len_query, + max_seq_len_passage=args.max_seq_len_passage, + batch_size=args.retriever_batch_size, + use_gpu=use_gpu, + embed_title=False, + ) + else: + dicts = convert_files_to_dicts( + dir_path=args.doc_dir, split_paragraphs=True, split_answers=True, encoding="utf-8" + ) + + if os.path.exists(args.index_name): + os.remove(args.index_name) + if os.path.exists(faiss_document_store): + os.remove(faiss_document_store) + + document_store = FAISSDocumentStore(embedding_dim=768, faiss_index_factory_str="Flat") + document_store.write_documents(dicts) + + retriever = DensePassageRetriever( + document_store=document_store, + query_embedding_model="rocketqa-zh-dureader-query-encoder", + passage_embedding_model="rocketqa-zh-dureader-query-encoder", + max_seq_len_query=args.max_seq_len_query, + max_seq_len_passage=args.max_seq_len_passage, + batch_size=args.retriever_batch_size, + use_gpu=use_gpu, + embed_title=False, + ) + + # update Embedding + document_store.update_embeddings(retriever) + + # save index + document_store.save(args.index_name) + + # Ranker + ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=use_gpu) + + pipe = SemanticSearchPipeline(retriever, ranker) + + pipeline_params = {"Retriever": {"top_k": 50}, "Ranker": {"top_k": 1}} + prediction = pipe.run(query="世界上最早的地雷发明者是谁?", params=pipeline_params) + + print_documents(prediction, print_name=False, print_meta=True) + + +def qa_generation_pipeline(): + answer_extractor = AnswerExtractor( + model="uie-base-answer-extractor", + device=args.device, + schema=["答案"], + max_answer_candidates=3, + position_prob=0.01, + batch_size=1, + ) + + question_generator = QuestionGenerator( + model="unimo-text-1.0-question-generation", + device=args.device, + num_return_sequences=2, + ) + + qa_filter = QAFilter( + model="uie-base-qa-filter", + device=args.device, + schema=["答案"], + position_prob=0.1, + ) + + pipe = QAGenerationPipeline( + answer_extractor=answer_extractor, question_generator=question_generator, qa_filter=qa_filter + ) + pipeline_params = {"QAFilter": {"is_filter": True}} + + # list example + meta = [ + "世界上最早的电影院是美国洛杉矶的“电气剧场”,建于1902年。", + "以脸书为例,2020年时,54%的成年人表示,他们从该平台获取新闻。而现在,这个数字下降到了44%。与此同时,YouTube在过去几年里一直保持平稳,约有三分之一的用户在该平台上获取新闻。", + ] + prediction = pipe.run(meta=meta, params=pipeline_params) + prediction = prediction["filtered_cqa_triples"] + pprint(prediction) + + # file example + if args.source_file: + meta = [] + with open(args.source_file, "r", encoding="utf-8") as rf: + for line in rf: + meta.append(line.strip()) + prediction = pipe.run(meta=meta, params=pipeline_params) + prediction = prediction["filtered_cqa_triples"] + if not os.path.exists(args.doc_dir): + os.makedirs(args.doc_dir) + with open(os.path.join(args.doc_dir, "generated_qa_pairs.txt"), "w", encoding="utf-8") as wf: + for pair in prediction: + wf.write(pair["synthetic_question"].strip() + "\t" + pair["synthetic_answer"].strip() + "\n") + + +if __name__ == "__main__": + qa_generation_pipeline() + dense_faq_pipeline() diff --git a/applications/question_answering/unsupervised_qa/run_qa_pairs_generation.py b/applications/question_answering/unsupervised_qa/run_qa_pairs_generation.py new file mode 100644 index 0000000000000000000000000000000000000000..8a8e57b94c01878456f652ba0c4cdbec4019c60b --- /dev/null +++ b/applications/question_answering/unsupervised_qa/run_qa_pairs_generation.py @@ -0,0 +1,334 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os + +from tqdm import tqdm + +from paddlenlp import Taskflow + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--answer_generation_model_path', type=str, default=None, help='the model path to be loaded for answer extraction') + parser.add_argument('--question_generation_model_path', type=str, default=None, help='the model path to be loaded for question generation') + parser.add_argument('--filtration_model_path', type=str, default=None, help='the model path to be loaded for filtration') + parser.add_argument('--source_file_path', type=str, default=None, help='the source file path') + parser.add_argument('--target_file_path', type=str, default=None, help='the target json file path') + parser.add_argument('--batch_size', type=int, default=1, help='the batch size when using taskflow') + parser.add_argument("--do_debug", action='store_true', help="Whether to do debug") + parser.add_argument('--a_prompt', type=str, default='答案', help='the prompt when using taskflow, separate by ,') + parser.add_argument('--a_position_prob', type=float, default=0.01, help='confidence threshold for answer extraction') + parser.add_argument('--a_max_answer_candidates', type=int, default=5, help='the max number of return answer candidate for each input') + parser.add_argument('--q_num_return_sequences', type=int, default=3, help='the number of return sequences for each input sample, it should be less than num_beams') + parser.add_argument('--q_max_question_length', type=int, default=50, help='the max decoding length') + parser.add_argument('--q_decode_strategy', type=str, default='sampling', help='the decode strategy') + parser.add_argument('--q_num_beams', type=int, default=6, help='the number of beams when using beam search') + parser.add_argument('--q_num_beam_groups', type=int, default=1, help='the number of beam groups when using diverse beam search') + parser.add_argument('--q_diversity_rate', type=float, default=0.0, help='the diversity_rate when using diverse beam search') + parser.add_argument('--q_top_k', type=float, default=5, help='the top_k when using sampling decoding strategy') + parser.add_argument('--q_top_p', type=float, default=1.0, help='the top_p when using sampling decoding strategy') + parser.add_argument('--q_temperature', type=float, default=1.0, help='the temperature when using sampling decoding strategy') + parser.add_argument("--do_filtration", action='store_true', help="Whether to do filtration") + parser.add_argument('--f_filtration_position_prob', type=float, default=0.1, help='confidence threshold for filtration') + args = parser.parse_args() + return args +# yapf: enable + + +def answer_generation_from_paragraphs( + paragraphs, batch_size=16, model=None, max_answer_candidates=5, schema=None, wf=None +): + """Generate answer from given paragraphs.""" + result = [] + buffer = [] + i = 0 + len_paragraphs = len(paragraphs) + for paragraph_tobe in tqdm(paragraphs): + buffer.append(paragraph_tobe) + if len(buffer) == batch_size or (i + 1) == len_paragraphs: + predicts = model(buffer) + paragraph_list = buffer + buffer = [] + for predict_dict, paragraph in zip(predicts, paragraph_list): + answers = [] + probabilitys = [] + for prompt in schema: + if prompt in predict_dict: + answer_dicts = predict_dict[prompt] + answers += [answer_dict["text"] for answer_dict in answer_dicts] + probabilitys += [answer_dict["probability"] for answer_dict in answer_dicts] + else: + answers += [] + probabilitys += [] + candidates = sorted(list(set([(a, p) for a, p in zip(answers, probabilitys)])), key=lambda x: -x[1]) + if len(candidates) > max_answer_candidates: + candidates = candidates[:max_answer_candidates] + outdict = { + "context": paragraph, + "answer_candidates": candidates, + } + if wf: + wf.write(json.dumps(outdict, ensure_ascii=False) + "\n") + result.append(outdict) + i += 1 + return result + + +def create_fake_question( + json_file_or_pair_list, out_json=None, num_return_sequences=1, all_sample_num=None, batch_size=8 +): + if out_json: + wf = open(out_json, "w", encoding="utf-8") + if isinstance(json_file_or_pair_list, list): + all_lines = json_file_or_pair_list + else: + rf = open(json_file_or_pair_list, "r", encoding="utf-8") + all_lines = [] + for json_line in rf: + line_dict = json.loads(json_line) + all_lines.append(line_dict) + rf.close() + num_all_lines = len(all_lines) + output = [] + context_buffer = [] + answer_buffer = [] + answer_probability_buffer = [] + true_question_buffer = [] + i = 0 + for index, line_dict in enumerate(tqdm(all_lines)): + if "question" in line_dict: + q = line_dict["question"] + else: + q = "" + c = line_dict["context"] + assert "answer_candidates" in line_dict + answers = line_dict["answer_candidates"] + if not answers: + continue + for j, pair in enumerate(answers): + a, p = pair + context_buffer += [c] + answer_buffer += [a] + answer_probability_buffer += [p] + true_question_buffer += [q] + if ( + (i + 1) % batch_size == 0 + or (all_sample_num and (i + 1) == all_sample_num) + or ((index + 1) == num_all_lines and j == len(answers) - 1) + ): + result_buffer = question_generation( + [{"context": context, "answer": answer} for context, answer in zip(context_buffer, answer_buffer)] + ) + context_buffer_temp, answer_buffer_temp, answer_probability_buffer_temp, true_question_buffer_temp = ( + [], + [], + [], + [], + ) + for context, answer, answer_probability, true_question in zip( + context_buffer, answer_buffer, answer_probability_buffer, true_question_buffer + ): + context_buffer_temp += [context] * num_return_sequences + answer_buffer_temp += [answer] * num_return_sequences + answer_probability_buffer_temp += [answer_probability] * num_return_sequences + true_question_buffer_temp += [true_question] * num_return_sequences + result_one_two_buffer = [(one, two) for one, two in zip(result_buffer[0], result_buffer[1])] + for context, answer, answer_probability, true_question, result in zip( + context_buffer_temp, + answer_buffer_temp, + answer_probability_buffer_temp, + true_question_buffer_temp, + result_one_two_buffer, + ): + fake_questions_tokens = [result[0]] + fake_questions_scores = [result[1]] + for fake_questions_token, fake_questions_score in zip( + fake_questions_tokens, fake_questions_scores + ): + out_dict = { + "context": context, + "synthetic_answer": answer, + "synthetic_answer_probability": answer_probability, + "synthetic_question": fake_questions_token, + "synthetic_question_probability": fake_questions_score, + "true_question": true_question, + } + if out_json: + wf.write(json.dumps(out_dict, ensure_ascii=False) + "\n") + output.append(out_dict) + context_buffer = [] + answer_buffer = [] + true_question_buffer = [] + if all_sample_num and (i + 1) >= all_sample_num: + break + i += 1 + if out_json: + wf.close() + return output + + +def filtration(paragraphs, batch_size=16, model=None, schema=None, wf=None, wf_debug=None): + result = [] + buffer = [] + valid_num, invalid_num = 0, 0 + i = 0 + len_paragraphs = len(paragraphs) + for paragraph_tobe in tqdm(paragraphs): + buffer.append(paragraph_tobe) + if len(buffer) == batch_size or (i + 1) == len_paragraphs: + model_inputs = [] + for d in buffer: + context = d["context"] + synthetic_question = d["synthetic_question"] + prefix = "问题:" + synthetic_question + "上下文:" + content = prefix + context + model_inputs.append(content) + predicts = model(model_inputs) + paragraph_list = buffer + buffer = [] + for predict_dict, paragraph in zip(predicts, paragraph_list): + context = paragraph["context"] + synthetic_question = paragraph["synthetic_question"] + synthetic_question_probability = paragraph["synthetic_question_probability"] + synthetic_answer = paragraph["synthetic_answer"] + synthetic_answer_probability = paragraph["synthetic_answer_probability"] + + answers = [] + probabilitys = [] + for prompt in schema: + if prompt in predict_dict: + answer_dicts = predict_dict[prompt] + answers += [answer_dict["text"] for answer_dict in answer_dicts] + probabilitys += [answer_dict["probability"] for answer_dict in answer_dicts] + else: + answers += [] + probabilitys += [] + candidates = [ + an for an, pro in sorted([(a, p) for a, p in zip(answers, probabilitys)], key=lambda x: -x[1]) + ] + out_dict = { + "context": context, + "synthetic_answer": synthetic_answer, + "synthetic_answer_probability": synthetic_answer_probability, + "synthetic_question": synthetic_question, + "synthetic_question_probability": synthetic_question_probability, + } + if synthetic_answer in candidates: + if wf: + wf.write(json.dumps(out_dict, ensure_ascii=False) + "\n") + result.append(out_dict) + valid_num += 1 + else: + if wf_debug: + wf_debug.write(json.dumps(out_dict, ensure_ascii=False) + "\n") + invalid_num += 1 + i += 1 + print("valid synthetic question-answer pairs number:", valid_num) + print("invalid synthetic question-answer pairs number:", invalid_num) + return result + + +if __name__ == "__main__": + args = parse_args() + assert args.a_prompt + schema = args.a_prompt.strip().split(",") + answer_generator = Taskflow( + "information_extraction", + schema=schema, + task_path=args.answer_generation_model_path, + batch_size=args.batch_size, + position_prob=args.a_position_prob, + ) + assert args.source_file_path + paragraphs = [] + if args.source_file_path.endswith(".json"): + with open(args.source_file_path, "r", encoding="utf-8") as rf: + for json_line in rf: + line_dict = json.loads(json_line) + assert "context" in line_dict or "content" in line_dict + if "context" in line_dict: + paragraphs.append(line_dict["context"].strip()) + elif "content" in line_dict: + paragraphs.append(line_dict["content"].strip()) + else: + with open(args.source_file_path, "r", encoding="utf-8") as rf: + for line in rf: + paragraphs.append(line.strip()) + + synthetic_context_answer_pairs = answer_generation_from_paragraphs( + paragraphs, + batch_size=args.batch_size, + model=answer_generator, + max_answer_candidates=args.a_max_answer_candidates, + schema=schema, + wf=None, + ) + print("create synthetic answers successfully!") + + question_generation = Taskflow( + "question_generation", + task_path=args.question_generation_model_path, + output_scores=True, + max_length=args.q_max_question_length, + is_select_from_num_return_sequences=False, + num_return_sequences=args.q_num_return_sequences, + batch_size=args.batch_size, + decode_strategy=args.q_decode_strategy, + num_beams=args.q_num_beams, + num_beam_groups=args.q_num_beam_groups, + diversity_rate=args.q_diversity_rate, + top_k=args.q_top_k, + top_p=args.q_top_p, + temperature=args.q_temperature, + ) + synthetic_answer_question_pairs = create_fake_question( + synthetic_context_answer_pairs, + None if args.do_filtration else args.target_file_path, + args.q_num_return_sequences, + None, + args.batch_size, + ) + print("create synthetic question-answer pairs successfully!") + + wf = None + wf_debug = None + if args.target_file_path: + if not os.path.exists(os.path.dirname(args.target_file_path)): + os.makedirs(os.path.dirname(args.target_file_path)) + wf = open(args.target_file_path, "w", encoding="utf-8") + if args.do_debug: + wf_debug = open(args.target_file_path + ".debug.json", "w", encoding="utf-8") + if args.do_filtration: + filtration_model = Taskflow( + "information_extraction", + schema=["答案"], + task_path=args.filtration_model_path, + batch_size=args.batch_size, + position_prob=args.f_filtration_position_prob, + ) + filtration( + synthetic_answer_question_pairs, + batch_size=16, + model=filtration_model, + schema=["答案"], + wf=wf, + wf_debug=wf_debug, + ) + print("filter synthetic question-answer pairs successfully!") + rf.close() + wf.close() diff --git a/applications/question_answering/unsupervised_qa/tools/create_synthetic_answer.py b/applications/question_answering/unsupervised_qa/tools/create_synthetic_answer.py new file mode 100644 index 0000000000000000000000000000000000000000..d5408bb48d2b3035cbd741fb1c1b4b41b0863234 --- /dev/null +++ b/applications/question_answering/unsupervised_qa/tools/create_synthetic_answer.py @@ -0,0 +1,105 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json + +from tqdm import tqdm + +from paddlenlp import Taskflow + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--model_path', type=str, default=None, help='the model path to be loaded for question_generation taskflow') + parser.add_argument('--source_file_path', type=str, default=None, help='the source file path') + parser.add_argument('--target_file_path', type=str, default=None, help='the target json file path') + parser.add_argument('--all_sample_num', type=int, default=None, help='the test sample number when convert_json_to_data') + parser.add_argument('--num_return_sequences', type=int, default=3, help='the number of return sequences for each input sample, it should be less than num_beams') + parser.add_argument('--batch_size', type=int, default=1, help='the batch size when using taskflow') + parser.add_argument('--position_prob', type=float, default=0.01, help='the batch size when using taskflow') + parser.add_argument('--decode_strategy', type=str, default=None, help='the decode strategy') + parser.add_argument('--num_beams', type=int, default=6, help='the number of beams when using beam search') + parser.add_argument('--num_beam_groups', type=int, default=1, help='the number of beam groups when using diverse beam search') + parser.add_argument('--diversity_rate', type=float, default=0.0, help='the diversity_rate when using diverse beam search') + parser.add_argument('--top_k', type=float, default=0, help='the top_k when using sampling decoding strategy') + parser.add_argument('--top_p', type=float, default=1.0, help='the top_p when using sampling decoding strategy') + parser.add_argument('--temperature', type=float, default=1.0, help='the temperature when using sampling decoding strategy') + args = parser.parse_args() + return args +# yapf: enable + + +def answer_generation_from_paragraphs(paragraphs, batch_size=16, model=None, wf=None): + """Generate answer from given paragraphs.""" + result = [] + buffer = [] + for paragraph_tobe in tqdm(paragraphs): + buffer.append(paragraph_tobe) + if len(buffer) == batch_size: + predicts = model(buffer) + paragraph_list = buffer + buffer = [] + for predict_dict, paragraph in zip(predicts, paragraph_list): + if "答案" in predict_dict: + answer_dicts = predict_dict["答案"] + answers = [answer_dict["text"] for answer_dict in answer_dicts] + probabilitys = [answer_dict["probability"] for answer_dict in answer_dicts] + else: + answers = [] + probabilitys = [] + + outdict = { + "context": paragraph, + "answer_candidates": sorted([(a, p) for a, p in zip(answers, probabilitys)], key=lambda x: -x[1]), + } + if wf: + wf.write(json.dumps(outdict, ensure_ascii=False) + "\n") + result.append(outdict) + return result + + +if __name__ == "__main__": + args = parse_args() + schema = ["答案"] + answer_generator = Taskflow( + "information_extraction", + schema=schema, + task_path=args.model_path, + batch_size=args.batch_size, + position_prob=args.position_prob, + ) + assert args.source_file_path + paragraphs = [] + if args.source_file_path.endswith(".json"): + with open(args.source_file_path, "r", encoding="utf-8") as rf: + for json_line in rf: + line_dict = json.loads(json_line) + assert "context" in line_dict or "content" in line_dict + if "context" in line_dict: + paragraphs.append(line_dict["context"].strip()) + elif "content" in line_dict: + paragraphs.append(line_dict["content"].strip()) + else: + with open(args.source_file_path, "r", encoding="utf-8") as rf: + for line in rf: + paragraphs.append(line.strip()) + wf = None + if args.target_file_path: + wf = open(args.target_file_path, "w", encoding="utf-8") + + answer_generation_from_paragraphs(paragraphs, batch_size=args.batch_size, model=answer_generator, wf=wf) + rf.close() + wf.close() diff --git a/applications/question_answering/unsupervised_qa/tools/create_synthetic_question.py b/applications/question_answering/unsupervised_qa/tools/create_synthetic_question.py new file mode 100644 index 0000000000000000000000000000000000000000..9d8e1a1b7ed598d65eed21b5c2cf8bb7b358abfc --- /dev/null +++ b/applications/question_answering/unsupervised_qa/tools/create_synthetic_question.py @@ -0,0 +1,119 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json + +from tqdm import tqdm + +from paddlenlp import Taskflow + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--model_path', type=str, default=None, help='the model path to be loaded for question_generation taskflow') + parser.add_argument('--max_length', type=int, default=50, help='the max decoding length') + parser.add_argument('--num_return_sequences', type=int, default=3, help='the number of return sequences for each input sample, it should be less than num_beams') + parser.add_argument('--source_file_path', type=str, default=None, help='the souce json file path') + parser.add_argument('--target_file_path', type=str, default=None, help='the target json file path') + parser.add_argument('--all_sample_num', type=int, default=None, help='the test sample number when convert_json_to_data') + parser.add_argument('--batch_size', type=int, default=1, help='the batch size when using taskflow') + parser.add_argument('--decode_strategy', type=str, default=None, help='the decode strategy') + parser.add_argument('--num_beams', type=int, default=6, help='the number of beams when using beam search') + parser.add_argument('--num_beam_groups', type=int, default=1, help='the number of beam groups when using diverse beam search') + parser.add_argument('--diversity_rate', type=float, default=0.0, help='the diversity_rate when using diverse beam search') + parser.add_argument('--top_k', type=float, default=0, help='the top_k when using sampling decoding strategy') + parser.add_argument('--top_p', type=float, default=1.0, help='the top_p when using sampling decoding strategy') + parser.add_argument('--temperature', type=float, default=1.0, help='the temperature when using sampling decoding strategy') + args = parser.parse_args() + return args +# yapf: enable + + +def create_fake_question(json_file, out_json, num_return_sequences, all_sample_num=None, batch_size=8): + with open(json_file, "r", encoding="utf-8") as rf, open(out_json, "w", encoding="utf-8") as wf: + all_lines = rf.readlines() + num_all_lines = len(all_lines) + context_buffer = [] + answer_buffer = [] + true_question_buffer = [] + for i, json_line in enumerate(tqdm(all_lines)): + line_dict = json.loads(json_line) + q = line_dict["question"] + a = line_dict["answer"] + c = line_dict["context"] + + context_buffer += [c] + answer_buffer += [a] + true_question_buffer += [q] + if ( + (i + 1) % batch_size == 0 + or (all_sample_num and (i + 1) == all_sample_num or (i + 1)) + or (i + 1) == num_all_lines + ): + result_buffer = question_generation( + [{"context": context, "answer": answer} for context, answer in zip(context_buffer, answer_buffer)] + ) + context_buffer_temp, answer_buffer_temp, true_question_buffer_temp = [], [], [] + for context, answer, true_question in zip(context_buffer, answer_buffer, true_question_buffer): + context_buffer_temp += [context] * num_return_sequences + answer_buffer_temp += [answer] * num_return_sequences + true_question_buffer_temp += [true_question] * num_return_sequences + result_one_two_buffer = [(one, two) for one, two in zip(result_buffer[0], result_buffer[1])] + for context, answer, true_question, result in zip( + context_buffer_temp, answer_buffer_temp, true_question_buffer_temp, result_one_two_buffer + ): + fake_quesitons_tokens = [result[0]] + fake_quesitons_scores = [result[1]] + for fake_quesitons_token, fake_quesitons_score in zip( + fake_quesitons_tokens, fake_quesitons_scores + ): + out_dict = { + "context": context, + "answer": answer, + "question": fake_quesitons_token, + "true_question": true_question, + "score": fake_quesitons_score, + } + wf.write(json.dumps(out_dict, ensure_ascii=False) + "\n") + context_buffer = [] + answer_buffer = [] + true_question_buffer = [] + + if all_sample_num and (i + 1) >= all_sample_num: + break + + +if __name__ == "__main__": + args = parse_args() + question_generation = Taskflow( + "question_generation", + task_path=args.model_path, + output_scores=True, + max_length=args.max_length, + is_select_from_num_return_sequences=False, + num_return_sequences=args.num_return_sequences, + batch_size=args.batch_size, + decode_strategy=args.decode_strategy, + num_beams=args.num_beams, + num_beam_groups=args.num_beam_groups, + diversity_rate=args.diversity_rate, + top_k=args.top_k, + top_p=args.top_p, + temperature=args.temperature, + ) + create_fake_question( + args.source_file_path, args.target_file_path, args.num_return_sequences, args.all_sample_num, args.batch_size + ) diff --git a/applications/question_answering/unsupervised_qa/tools/dev_qq_pair_creation.py b/applications/question_answering/unsupervised_qa/tools/dev_qq_pair_creation.py new file mode 100644 index 0000000000000000000000000000000000000000..9de127a63f7b979f0438f15f984ec81948d018cc --- /dev/null +++ b/applications/question_answering/unsupervised_qa/tools/dev_qq_pair_creation.py @@ -0,0 +1,118 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument("--do_create_test_qq_pair", action='store_true', help="Whether to do create_test_qq_pair") + parser.add_argument('--qq_pair_source_ori_file_path', type=str, default=None, help='the original source file path for qq-pair creating') + parser.add_argument('--qq_pair_source_trans_file_path', type=str, default=None, help='the translated source file path for qq-pair creating') + parser.add_argument('--qq_pair_target_file_path', type=str, default=None, help='the target file path for qq-pair creating') + parser.add_argument('--trans_query_answer_path', type=str, default=None, help='the target query-answer file path for extract_trans_from_fake_question') + parser.add_argument('--dev_sample_num', type=int, default=None, help='the test sample number when convert_json_to_data, if None, treat all lines as dev samples') + args = parser.parse_args() + return args +# yapf: enable + + +def extract_q_from_json_file(json_file, out_file=None, test_sample_num=None, query_answer_path=None): + with open(json_file, "r", encoding="utf-8") as rf: + if out_file: + wf = open(os.path.join(out_file), "w", encoding="utf-8") + if query_answer_path: + qeury_answer_wf = open(query_answer_path, "w", encoding="utf-8") + q_list = [] + for i, json_line in enumerate(rf.readlines()): + line_dict = json.loads(json_line) + if isinstance(line_dict["question"], list): + question = line_dict["question"][0] + else: + question = line_dict["question"] + answer = line_dict["answer"] + if not test_sample_num or i < test_sample_num: + if query_answer_path: + qeury_answer_wf.write( + question.replace("\n", " ").replace("\t", " ").strip() + + "\t" + + answer.replace("\n", " ").replace("\t", " ").strip() + + "\n" + ) + if out_file: + wf.write(question.replace("\n", " ").replace("\t", " ").strip() + "\n") + q_list.append(question.strip()) + else: + break + if query_answer_path: + qeury_answer_wf.close() + if out_file: + wf.colse() + return q_list + + +def create_test_qq_pair( + ori_path=None, trans_path=None, write_path=None, trans_query_answer_path=None, test_sample_num=None +): + assert trans_path + trans_rf = open(trans_path, "r", encoding="utf-8") + wf = open(write_path, "w", encoding="utf-8") + if trans_path.endswith(".json"): + trans_q_list = extract_q_from_json_file(trans_path, None, test_sample_num, trans_query_answer_path) + else: + trans_q_list = [ + line.strip() for i, line in enumerate(trans_rf.readlines()) if not test_sample_num or i < test_sample_num + ] + + if not ori_path or ori_path in ["NONE", "None", "none"]: + origin_q_list = ["-" for _ in range(len(trans_q_list))] + else: + origin_rf = open(ori_path, "r", encoding="utf-8") + if ori_path.endswith(".json"): + origin_q_list = extract_q_from_json_file(ori_path, None, test_sample_num) + else: + origin_q_list = [ + line.strip() + for i, line in enumerate(origin_rf.readlines()) + if not test_sample_num or i < test_sample_num + ] + + for origin, trans in zip(origin_q_list, trans_q_list): + wf.write( + trans.replace("\n", " ").replace("\t", " ").strip() + + "\t" + + origin.replace("\n", " ").replace("\t", " ").strip() + + "\n" + ) + if not ori_path or ori_path in ["NONE", "None", "none"]: + pass + else: + origin_rf.close() + trans_rf.close() + wf.close() + + +if __name__ == "__main__": + args = parse_args() + if args.do_create_test_qq_pair: + create_test_qq_pair( + ori_path=args.qq_pair_source_ori_file_path, + trans_path=args.qq_pair_source_trans_file_path, + write_path=args.qq_pair_target_file_path, + trans_query_answer_path=args.trans_query_answer_path, + test_sample_num=args.dev_sample_num, + ) diff --git a/applications/question_answering/unsupervised_qa/tools/json_format_indent.py b/applications/question_answering/unsupervised_qa/tools/json_format_indent.py new file mode 100644 index 0000000000000000000000000000000000000000..84731b2130130fd7f960a546e6d5b6888f711849 --- /dev/null +++ b/applications/question_answering/unsupervised_qa/tools/json_format_indent.py @@ -0,0 +1,30 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + + +def json_format_indent(json_file, output_json): + with open(output_json, "w", encoding="utf-8") as wf: + with open(json_file, "r", encoding="utf-8") as rf: + all_lines = [] + for json_line in rf: + line_dict = json.loads(json_line) + all_lines.append(line_dict) + output_dataset = {"data": all_lines} + json.dump(output_dataset, wf, ensure_ascii=False, indent="\t") + + +if __name__ == "__main__": + json_format_indent("", "") diff --git a/applications/question_answering/unsupervised_qa/tools/question_coverage.py b/applications/question_answering/unsupervised_qa/tools/question_coverage.py new file mode 100644 index 0000000000000000000000000000000000000000..42c74d1eb759ea148ecf251c78830031cf6b84dc --- /dev/null +++ b/applications/question_answering/unsupervised_qa/tools/question_coverage.py @@ -0,0 +1,233 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import multiprocessing +import os +import time + +from tqdm import tqdm +from tqdm.contrib import tzip + +from paddlenlp.metrics import BLEU +from paddlenlp.transformers import BasicTokenizer + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--true_file_path', type=str, default=None, help='the source json file path') + parser.add_argument('--generate_file_path', type=str, default=None, help='the target json file path') + parser.add_argument('--num_return_sequences', type=int, default=3, help='the number of return sequences for each input sample, it should be less than num_beams') + parser.add_argument('--all_sample_num', type=int, default=None, help='the number of valid sample') + parser.add_argument('--bleu_n_size', type=int, default=4, help='the bleu n size') + parser.add_argument('--bleu_threshold', type=float, default=0.3, help='the bleu threshold') + parser.add_argument("--do_log_file", action="store_true", help="is log analysis file") + parser.add_argument('--log_dir', type=str, default=None, help='the log dir') + parser.add_argument("--do_multiprocessing", action="store_true", help="is do multiprocessing") + parser.add_argument("--do_map_async", action="store_true", help="is use map_async or apply_async when do multiprocessing") + args = parser.parse_args() + return args +# yapf: enable + + +def calc_bleu_n(preds, targets, n_size=4): + assert len(preds) == len(targets), ( + "The length of pred_responses should be equal to the length of " + "target_responses. But received {} and {}.".format(len(preds), len(targets)) + ) + bleu = BLEU(n_size=n_size) + tokenizer = BasicTokenizer() + + for pred, target in zip(preds, targets): + pred_tokens = tokenizer.tokenize(pred) + target_token = tokenizer.tokenize(target) + + bleu.add_inst(pred_tokens, [target_token]) + return bleu.score() + + +def worker_apply_async(true_question, generate_question_group, bleu_n_size, bleu_threshold, i): + first_positive_pair = None + for generate_question in generate_question_group: + bleu_score = calc_bleu_n([generate_question], [true_question], bleu_n_size) + if bleu_score > bleu_threshold: + first_positive_pair = (generate_question, true_question, i) + if first_positive_pair: + return (True, first_positive_pair) + else: + return (False, (generate_question_group[0], true_question)) + + +def worker_map_async(args): + true_question, generate_question_group, bleu_n_size, bleu_threshold, i = args + first_positive_pair = None + for generate_question in generate_question_group: + bleu_score = calc_bleu_n([generate_question], [true_question], bleu_n_size) + if bleu_score > bleu_threshold: + first_positive_pair = (generate_question, true_question, i) + if first_positive_pair: + return (True, first_positive_pair) + else: + return (False, (generate_question_group[0], true_question)) + + +def coverage_rate( + true_file_path, + generate_file_path, + bleu_n_size, + bleu_threshold, + num_return_sequences, + all_sample_num=None, + is_log_file=False, + log_dir=None, + is_multiprocessing=True, + is_map_async=True, +): + true_questions = [] + with open(true_file_path, "r", encoding="utf-8") as rf: + for i, json_line in enumerate(tqdm(rf.readlines())): + if i >= all_sample_num: + break + line_dict = json.loads(json_line) + true_questions.append( + line_dict["question"][0] if isinstance(line_dict["question"], list) else line_dict["question"] + ) + + generate_question_groups = [] + with open(generate_file_path, "r", encoding="utf-8") as rf: + group = [] + for i, json_line in enumerate(tqdm(rf.readlines())): + if i >= all_sample_num * num_return_sequences: + break + line_dict = json.loads(json_line) + group.append( + line_dict["question"][0] if isinstance(line_dict["question"], list) else line_dict["question"] + ) + if (i + 1) % num_return_sequences == 0: + generate_question_groups.append(group) + group = [] + print("true_questions", len(true_questions)) + print("generate_question_groups", len(generate_question_groups)) + positive = [] + negative = [] + if is_multiprocessing: + pool = multiprocessing.Pool(processes=30) + pool_results = [] + if is_map_async: + map_async_inputs = [] + i = 0 + bleu_cal_time_start = time.time() + generate_question_groups = [ + [ + generate_question if generate_question.strip() != "" else "none" + for generate_question in generate_question_group + ] + for generate_question_group in generate_question_groups + ] + for true_question, generate_question_group in tzip(true_questions, generate_question_groups): + if is_multiprocessing: + if is_map_async: + map_async_inputs.append((true_question, generate_question_group, bleu_n_size, bleu_threshold, i)) + else: + pool_results.append( + pool.apply_async( + worker_apply_async, + args=(true_question, generate_question_group, bleu_n_size, bleu_threshold, i), + ) + ) + + else: + first_positive_pair = None + best_pair, best_score = None, 0 + for generate_question in generate_question_group: + try: + bleu_score = calc_bleu_n([generate_question], [true_question], bleu_n_size) + except BaseException: + print("generate_question", generate_question) + print("true_question", true_question) + if bleu_score > best_score: + best_pair = (generate_question, true_question) + if bleu_score > bleu_threshold: + first_positive_pair = (generate_question, true_question) + if first_positive_pair: + positive.append((best_pair[0], best_pair[1], best_score)) + else: + negative.append((best_pair[0], best_pair[1], best_score)) + i += 1 + if is_multiprocessing: + if is_map_async: + pool_results = pool.map_async(worker_map_async, map_async_inputs) + pool.close() + pool.join() + for result in pool_results.get(): + is_positive, pair = result + if is_positive: + positive.append(pair) + else: + negative.append(pair) + else: + pool.close() + pool.join() + for result in pool_results: + is_positive, pair = result.get() + if is_positive: + positive.append(pair) + else: + negative.append(pair) + + bleu_cal_time_end = time.time() + print("bleu_cal_time_spend:", bleu_cal_time_end - bleu_cal_time_start) + if is_log_file and log_dir: + with open(os.path.join(log_dir, "positive_pair.txt"), "w", encoding="utf-8") as wf: + for pair in positive: + wf.write( + pair[0] + "\t" + pair[1] + "\n" + if len(pair) == 2 + else pair[0] + "\t" + pair[1] + str(pair[2]) + "\n" + ) + with open(os.path.join(log_dir, "negative_pair.txt"), "w", encoding="utf-8") as wf: + for pair in negative: + wf.write( + pair[0] + "\t" + pair[1] + "\n" + if len(pair) == 2 + else pair[0] + "\t" + pair[1] + str(pair[2]) + "\n" + ) + assert len(positive) + len(negative) == all_sample_num, ( + "the number of positive pairs " + + str(len(positive)) + + " plus the number of negative pairs " + + str(len(negative)) + + " should be equal to all_sample_num" + + str(all_sample_num) + ) + return len(positive) / (len(positive) + len(negative)) + + +if __name__ == "__main__": + args = parse_args() + rate = coverage_rate( + true_file_path=args.true_file_path, + generate_file_path=args.generate_file_path, + bleu_n_size=args.bleu_n_size, + bleu_threshold=args.bleu_threshold, + num_return_sequences=args.num_return_sequences, + all_sample_num=args.all_sample_num, + is_log_file=args.do_log_file, + log_dir=args.log_dir, + is_multiprocessing=args.do_multiprocessing, + is_map_async=args.do_map_async, + ) + print("coverage rate is", rate) diff --git a/applications/sentiment_analysis/ASO_analysis/.gitignore b/applications/sentiment_analysis/ASO_analysis/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..7d7a840f05d7620097a1fc1633da9db221c58bf2 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/.gitignore @@ -0,0 +1,2 @@ +checkpoints/* +data/* diff --git a/applications/sentiment_analysis/ASO_analysis/README.md b/applications/sentiment_analysis/ASO_analysis/README.md new file mode 100644 index 0000000000000000000000000000000000000000..1bd17691bcdd4a995e08d002f8a6cd4be9fc58e4 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/README.md @@ -0,0 +1,200 @@ +# 评论观点抽取与情感倾向性分析 + +## 1. 场景概述 + +情感分析旨在对带有情感色彩的主观性文本进行分析、处理、归纳和推理,其广泛应用于消费决策、舆情分析、个性化推荐等领域,具有很高的商业价值。 + +依托百度领先的情感分析技术,食行生鲜自动生成菜品评论标签辅助用户购买,并指导运营采购部门调整选品和促销策略;房天下向购房者和开发商直观展示楼盘的用户口碑情况,并对好评楼盘置顶推荐;国美搭建服务智能化评分系统,客服运营成本减少40%,负面反馈处理率100%。 + +情感分析相关的任务有语句级情感分析、评论对象抽取、观点抽取等等。一般来讲,被人们所熟知的情感分析任务是语句级别的情感分析,该任务是在宏观上去分析整句话的感情色彩,其粒度可能相对比较粗。 + +因为在人们进行评论的时候,往往针对某一产品或服务进行多个属性的评论,对每个属性的评论可能也会褒贬不一,因此针对属性级别的情感分析在真实的场景中会更加实用,同时更能给到企业用户或商家更加具体的建议。例如这句关于薯片的评论。 + +> 这个薯片味道真的太好了,口感很脆,只是包装很一般。 + +可以看到,顾客在口感、包装和味道 三个属性上对薯片进行了评价,顾客在味道和口感两个方面给出了好评,但是在包装上给出了负面的评价。只有通过这种比较细粒度的分析,商家才能更有针对性的发现问题,进而改进自己的产品或服务。 + +基于这样的考虑,本项目提出了一种细粒度的情感分析能力,对于给定的文本,首先会抽取该文本中的评论观点,然后分析不同观点的情感极性。 + + +## 2. 产品功能介绍 + +### 2.1 系统特色 +为了降低技术门槛,方便开发者共享效果领先的情感分析技术,PaddleNLP本次开源的情感分析系统,具备三大亮点: + +- 覆盖任务全 + - 集成评论观点抽取、属性级情感分类等情感分析能力,并开源模型,且打通模型训练、评估、预测部署全流程。 +- 效果领先 + - 集成百度研发的基于情感知识增强的预训练模型SKEP,为各类情感分析任务提供统一且强大的情感语义表示能力。 +- 预测性能强 + - 针对预训练模型预测效率低的问题,开源小模型PP-MiniLM,量化优化策略,预测性能大幅提升。 + +### 2.2 架构&功能 + +本项目提出的情感分析解决方案如图1所示,整个情感分析的过程大致包含两个阶段,依次是评论观点抽取模型,属性级情感分类模型。对于给定的一段文本,首先基于前者抽取出文本语句中潜在的评论属性以及该属性相应的评论观点,然后将评论属性、观点以及原始文本进行拼接,传给属性级情感分类模型以识别出该评论属性的情感极性。 + +这里需要提到的是,由于目前市面上的大多数模型是基于通用语料训练出来的,这些模型可能并不会对情感信息那么敏感。基于这样的考量,本项目使用了百度自研的 SKEP 预训练模型,其在预训练阶段便设计了多种情感信息相关的预训练目标进行训练。作为一种情感专属的模型,其更适合用来做上边提到的评论观点抽取任务,以及属性级情感分类任务。 + +另外,本项目使用的是 Large 版的 SKEP 模型,考虑到企业用户在线上部署时会考虑到模型预测效率,所以本项目专门提供了一个通用版的小模型 [PP-MiniLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/model_compression/pp-minilm) 以及一套量化策略,用户可以使用相应情感数据集对 PP-MiniLM 进行微调,然后进行量化,以达到更快的预测效率。 + +
+ +

图1 情感分析系统图

+

+ + +## 3. 情感分析效果展示 +在图1中可以看到,本项目的核心模块为评论观点抽取和属性级情感分类模块,本项目中基于情感专属模型 SKEP 实现了两个模块,并且提供了两者训练和测试的脚本,分别放在 `extraction` 和 `classification` 目录下。 + +下表展示了我们训练的评论观点抽取模型在验证集 `dev` 和测试集 `test` 上的表现: +|Model|数据集|precision|Recall|F1| +| ------------ | ------------ | ------------ |-----------|------------ | +|SKEP-Large|dev|0.87095|0.90056|0.88551| +|SKEP-Large|test|0.87125|0.89944|0.88512| + +下表展示了我们训练的属性级情感分类模型在验证集 `dev` 和测试集 `test` 上的表现: +|Model|数据集|precision|Recall|F1| +| ------------ | ------------ | ------------ |-----------|------------ | +|SKEP-Large|dev|0.98758|0.99251|0.99004| +|SKEP-Large|test|0.98497|0.99139|0.98817| + +给定一段文本,使用我们提供的全流程预测脚本可以轻松获得情感分析结果,如下所示。 + +- input_text: 蛋糕味道不错,很好吃,店家很耐心,服务也很好,很棒 + - aspect: 蛋糕味道, opinions: ['不错', '好吃'], sentiment_polarity: 正向 + - aspect: 店家, opinions: ['耐心'], sentiment_polarity: 正向 + - aspect: 服务, opinions: ['好', '棒'], sentiment_polarity: 正向 + +如果你想了解更多评论观点抽取模型和属性级情感分类模型的实现细节,请分别点击 [extraction](extraction/README.md) 和 [classification](classification/README.md)。 + + +## 4. 情感分析实践 +以下是本项目运行的完整目录结构以及说明: + +``` +. +├── extraction # 评价观点抽取模型包 +├── classification # 细粒度情感分类模型包 +├── pp_minilm # PP-MiniLM特色小模型包 +├── deploy # 高性能预测部署包 +│ ├── predict.py # 高性能预测脚本 +│   ├── run_predict.py # 高性能预测命令 +├── imgs # 图片目录 +├── demo.py # demo脚本,方便体验预测效果 +├── predict.py # 全流程预测脚本 +├── export_model.py # 动转静模型导出脚本 +├── utils.py # 工具函数脚本 +├── run_demo.sh # 运行demo,快速体验情感分析效果 +├── run_predict.sh # 全流程预测命令 +├── run_export_model.sh # 动转静模型导出命令 +└── README.md +``` + +### 4.1 运行环境和依赖安装 +(1) 环境依赖 + +- python >= 3.6 +- paddlenlp >= 2.2.2 +- paddlepaddle-gpu >= 2.2.1 + +(2) 运行环境准备 +在运行之前,请在本目录下新建目录 `data` 和 `checkpoints`,分别用于存放数据和保存模型。 + +本项目需要训练两个阶段的模型:评论观点抽取模型,属性级情感分类模型。本次针对这抽取和分类模型,我们分别开源了 Demo 数据: [ext_data](https://bj.bcebos.com/v1/paddlenlp/data/ext_data.tar.gz)和[cls_data](https://bj.bcebos.com/v1/paddlenlp/data/cls_data.tar.gz)。 + +用户可分别点击下载,解压后将相应的数据文件依次放入 `./data/ext_data` 和 `./data/cls_data` 目录下即可。 + +### 4.2 使用说明 +本项目开源了训练后的评论观点模型 [ext_model](https://bj.bcebos.com/paddlenlp/models/best_ext.pdparams) 和 属性级情感分类模型 [cls_model](https://bj.bcebos.com/paddlenlp/models/best_cls.pdparams)。如有需要,可点击下载,下载后请将 `ext_model` 和 `cls_model` 重命名为 `best.pdparams`,分别放入 `./checkpoints/ext_checkpoints` 和 `./checkpoints/cls_checkpoints` 中。 + +另外,考虑到不同用户可能有不同的需求,本项目提供了如下的方式学习或使用本项目。 + +**(1)快速体验效果** +如果你想快速体验本项目提供的情感分析能力,可使用本项目提供的 `demo.sh` 脚本以交互式的方式进行体验。 +```shell +sh run_demo.sh +``` + +**备注**:体验之前,请确保下载以上提到的 `ext_model` 和 `cls_model`,重命名后放入相应的目录中。 + +**(2) 文本批量预测** +如果你有一批数据,不方便逐句输入,可使用本项目提供的正式预测脚本 `predict.py`, 以文件的形式进行输入,处理后该脚本会将结果文件保存到与输入文件相同的目录下,默认的结果文件名为 `sentiment_results.json`。 + +本功能在预测时需要传入测试集文件路径,可将测试集文件命名为`test.txt`, 然后放入 `./data` 目录下。需要注意的是,测试集文件每行均为一个待预测的语句,如下所示。 + +- 蛋糕味道不错,很好吃,店家很耐心,服务也很好,很棒 +- 酒店干净整洁,性价比很高 +- 酒店环境不错,非常安静,性价比还可以 +- 房间很大,环境不错 + +通过运行如下命令,便可进行批量文本情感分析预测: +```shell +sh run_predict.sh +``` + +**备注**:体验之前,请确保下载以上提到的 `ext_model` 和 `cls_model`,重命名后放入相应的目录中。 + +**(3)高性能预测** +如果你想将本项目部署到线上环境去运行,那么建议你使用本项目基于 Paddle Inference 实现的高性能推理脚本 `deploy/predict.py`。 + +在使用之前,首先需要将保存的动态图模型转为静态图,通过调用下面的命令,便可将评论观点抽取模型和属性级情感分类模型转为静态图模型: +```shell +sh run_export_model.sh extraction +sh run_export_model.sh classification +``` + +这里需要注意的是,要确保相应的动态图已经下载或者训练生成到 `model_path` 指定的目录中,静态图模型会自动生成到`save_path`指定的地址。 + +同上,高性能预测的默认输入和输出形式也为文件,可分别通过 `test_path` 和 `save_path` 进行指定,通过如下命令便可以基于Paddle Inference 进行高性能预测: +```shell +cd deploy +sh run_predict.sh +``` + +**(4)自定义模型训练** +如果你希望自己尝试进行评论观点抽取模型训练,可使用4.1节中提供的 `ext_data` Demo 数据,或自己业务的标注数据重新训练模型,本项目已将评论观点抽取模型的相关训练和测试代码放入 `extraction` 目录下, 请到该目录下执行模型训练即可,更多的实现细节和使用方式,请参考[这里](extraction/README.md)。 + +如果你希望自己尝试进行属性级情感分类模型训练,可使用4.1节中提供的 `cls_data` Demo 数据,或自己业务的标注数据重新训练模型,本项目已将属性级情感分类模型的相关训练和测试代码放入 `classification` 目录下,请到该目录下执行模型训练即可,更多的实现细节和使用方式,请参考[这里](classification/README.md)。 + +在训练后,如果需要进行高性能预测,可参考(3)进行动转静,然后基于Paddle Inference 进行高性能预测。 + +### 4.3 数据标注说明 +如果你想标注自己的业务数据,并尝试利用标注的新数据重新训练本项目。本项目推荐使用 [doccano](https://github.com/doccano/doccano) 进行数据标注平台,同时本项目也打通了其从标注到训练的通道,即 doccano 导出的数据后可通过 [doccano.py](./doccano.py) 脚本轻松将数据转换为输入模型时需要的形式,实现无缝衔接。 为达到这个目的,您需要按以下标注规则在 doccano 平台上标注数据: + +
+ +

图2 数据标注样例图

+

+ +- 在doccano平台上,定义标签 Pos-Aspect、 Neg-Aspect 和 Opinion,其中 Pos-Aspect 表示 Aspect 的情感极性为正向;Neg-Aspect 表示 Aspect 的情感极性为负向;Opinion 表示相应的观点词。 +- 使用以上定义的标签开始标注数据,图2展示了一个标注样例。 +- 当标注完成后,在 doccano 平台上导出 `jsonl` 形式的文件,并将其重命名为 `doccano.json` 后,放入 `./data` 目录下。 +- 通过 [doccano.py](./doccano.py) 脚本进行数据形式转换,然后便可以开始进行相应模型训练。 + +```shell +python doccano.py \ + --doccano_file ./data/doccano.json \ + --save_ext_dir ./data/ext_data \ + --save_cls_dir ./data/cls_data +``` + +**备注:** +- 默认情况下 [doccano.py](./doccano.py) 脚本会按照比例将数据划分为 train/dev/test 数据集 +- 每次执行 [doccano.py](./doccano.py) 脚本,将会覆盖已有的同名数据文件 + +## 5. 小模型优化策略 +以上实验中,无论是评论观点抽取模型,还是属性级情感分类模型,使用的均是 Large 版的 SKEP 模型,考虑到企业用户在线上部署时会考虑到模型预测效率,本项目提供了一套基于 [PP-MiniLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/model_compression/pp-minilm) 中文特色小模型的解决方案。PP-MiniLM 提供了一套完整的小模型优化方案:首先使用 Task-agnostic 的方式进行模型蒸馏、然后依托于 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) 进行模型裁剪、模型量化等模型压缩技术,有效减小了模型的规模,加快了模型运行速度。 + +本项目基于 PP-MiniLM 中文特色小模型进行 fine-tune 属性级情感分类模型,然后使用 PaddleSlim 对训练好的模型进行量化操作。 + +在实验进行后,我们将 SKEP-Large、PP-MiniLM、量化PP-MiniLM 三个模型在性能和效果方面进行了对比,如下表所示。可以看到,三者在本任务数据集上的评估指标几乎相等,但是 PP-MiniLM 小模型运行速度较 SKEP-Large 提高了4倍,量化后的 PP-MiniLM 运行速度较 SKEP-Large 提高了近8倍。更多的详细信息请参考[这里](./pp_minilm/README.md)。 + +|Model|运行时间(s)|precision|Recall|F1| +| ------------ | ------------ | ------------ |-----------|------------ | +|SKEP-Large|1.00x|0.98497|0.99139|0.98817| +|PP-MiniLM|4.95x|0.98379|0.98859|0.98618| +|量化 PP-MiniLM|8.93x|0.98312|0.98953|0.98631| + +## 6. 引用 + +[1] H. Tian et al., “SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis,” arXiv:2005.05635 [cs], May 2020, Accessed: Nov. 11, 2021. diff --git a/applications/sentiment_analysis/ASO_analysis/classification/README.md b/applications/sentiment_analysis/ASO_analysis/classification/README.md new file mode 100644 index 0000000000000000000000000000000000000000..3d7bbc5c719cb448c2a9c8c78021d7c8012582eb --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/classification/README.md @@ -0,0 +1,64 @@ +# 细粒度情感分类模型 + +## 1. 方案设计 + +本项目将进行属性级别的情感分类,对于给定的一段文本,我们在基于评论观点抽取模型抽取出不同属性对应的观点后,便可以有针对性地对各个属性判别情感极性。具体来讲,本项目将抽取出的评论属性和观点进行拼接,然后和原始语句进行拼接作为一条独立的训练语句。 + +如图1所示,首先将评论属性和观点词进行拼接为"味道好",然后将"味道好"和原文进行拼接,然后传入SKEP模型,并使用 "CLS" 位置的向量进行细粒度情感倾向。 + +
+ +

图1 细粒度情感分类模型

+

+ +## 2. 项目结构说明 + +以下是本项目运行的完整目录结构及说明: + +```shell +. +├── data.py # 数据处理脚本 +├── model.py # 模型组网脚本 +├── train.py # 模型训练脚本 +├── evaluate.py # 模型评估脚本 +├── run_train.sh # 模型训练命令 +├── run_evaluate.sh # 模型评估命令 +└── README.md +``` + +## 3. 数据说明 + +本实验中,相应的数据集需要包含3列数据:标签、评论观点和原文,下面给出了一些样本示例。 + +- 1 口味清淡 口味很清淡,价格也比较公道 +- 1 经济实惠 经济实惠,环境好,套餐划算 +- 0 设施一般 房间大,设施一般 + +可点击 [cls_data](https://bj.bcebos.com/v1/paddlenlp/data/cls_data.tar.gz) 进行 Demo 数据下载,将数据解压之后放入父目录的 `data/cls_data/` 文件夹下。 + +## 4. 模型效果展示 + +在分类模型训练过程中,总共训练了10轮,并选择了评估 F1 得分最高的 best 模型,下表展示了训练过程中使用的训练参数。我们同时开源了相应的模型,可点击下表的 `cls_model` 进行下载,下载后将模型重命名为 `best.pdparams`,然后放入父目录的 `checkpoints/cls_checkpoints` 中。 +|Model|训练参数配置|MD5| +| ------------ | ------------ |-----------| +|[cls_model](https://bj.bcebos.com/paddlenlp/models/best_cls.pdparams)|
learning_rate: 3e-5, batch_size: 16, max_seq_len:256, epochs:10
|3de6ddf581e665d9b1d035c29b49778a| + +我们基于训练过程中的 best 模型在验证集 `dev` 和测试集 `test` 上进行了评估测试,模型效果如下表所示: +|Model|数据集|precision|Recall|F1| +| ------------ | ------------ | ------------ |-----------|------------ | +|SKEP-Large|dev|0.98758|0.99251|0.99004| +|SKEP-Large|test|0.98497|0.99139|0.98817| + +**备注**: 以上数据是基于全量数据训练和测试结果,并非 Demo 数据集。 + +## 5. 模型训练 +通过运行以下命令进行分类模型训练: +```shell +sh run_train.sh +``` + +## 6. 模型测试 +通过运行以下命令进行分类模型测试: +```shell +sh run_evaluate.sh +``` diff --git a/applications/sentiment_analysis/ASO_analysis/classification/data.py b/applications/sentiment_analysis/ASO_analysis/classification/data.py new file mode 100644 index 0000000000000000000000000000000000000000..fcbaf5110cbda5aac5ab4d22f6d46028595cdf5b --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/classification/data.py @@ -0,0 +1,38 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +def load_dict(dict_path): + with open(dict_path, "r", encoding="utf-8") as f: + words = [word.strip() for word in f.readlines()] + word2id = dict(zip(words, range(len(words)))) + id2word = dict((v, k) for k, v in word2id.items()) + + return word2id, id2word + + +def convert_example_to_feature(example, tokenizer, label2id, max_seq_len=512, is_test=False): + example = example["text"].rstrip().split("\t") + if not is_test: + label = int(example[0]) + aspect_text = example[1] + text = example[2] + encoded_inputs = tokenizer(aspect_text, text_pair=text, max_seq_len=max_seq_len, return_length=True) + encoded_inputs["label"] = label + else: + aspect_text = example[0] + text = example[1] + encoded_inputs = tokenizer(aspect_text, text_pair=text, max_seq_len=max_seq_len, return_length=True) + + return encoded_inputs diff --git a/applications/sentiment_analysis/ASO_analysis/classification/evaluate.py b/applications/sentiment_analysis/ASO_analysis/classification/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..e12bca18901f957718b0a8cabce0a395c7821036 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/classification/evaluate.py @@ -0,0 +1,77 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +from functools import partial + +import paddle +from data import convert_example_to_feature, load_dict +from datasets import load_dataset +from tqdm import tqdm + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.metrics.glue import AccuracyAndF1 +from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer + + +def evaluate(model, data_loader, metric): + + model.eval() + metric.reset() + for batch_data in tqdm(data_loader): + input_ids, token_type_ids, labels = batch_data["input_ids"], batch_data["token_type_ids"], batch_data["labels"] + logits = model(input_ids, token_type_ids=token_type_ids) + correct = metric.compute(logits, labels) + metric.update(correct) + + accuracy, precision, recall, f1, _ = metric.accumulate() + + return accuracy, precision, recall, f1 + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.") + parser.add_argument('--test_path', type=str, default=None, help="The path of test set.") + parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.") + parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.") + parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.") + args = parser.parse_args() + # yapf: enbale + + # load dev data + model_name = "skep_ernie_1.0_large_ch" + label2id, id2label = load_dict(args.label_path) + datasets = load_dataset("text", data_files={"test": args.test_path}) + + tokenizer = SkepTokenizer.from_pretrained(model_name) + trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len) + test_ds = datasets["test"].map(trans_func, batched=False, remove_columns=["text"]) + + data_collator = DataCollatorWithPadding(tokenizer) + + test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False) + test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator) + + # load model + loaded_state_dict = paddle.load(args.model_path) + model = SkepForSequenceClassification.from_pretrained(model_name, num_classes=len(label2id)) + model.load_dict(loaded_state_dict) + + metric = AccuracyAndF1() + + # evaluate on dev data + accuracy, precision, recall, f1 = evaluate(model, test_loader, metric) + print(f'evaluation result: accuracy:{accuracy:.5f} precision: {precision:.5f}, recall: {recall:.5f}, F1: {f1:.5f}') diff --git a/applications/sentiment_analysis/ASO_analysis/classification/run_evaluate.sh b/applications/sentiment_analysis/ASO_analysis/classification/run_evaluate.sh new file mode 100644 index 0000000000000000000000000000000000000000..e9d6721c719d52c4fabffdd6bbb87673fbee2d7e --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/classification/run_evaluate.sh @@ -0,0 +1,22 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +python evaluate.py \ + --model_path "../checkpoints/cls_checkpoints/best.pdparams" \ + --test_path "../data/cls_data/test.txt" \ + --label_path "../data/cls_data/label.dict" \ + --batch_size 16 \ + --max_seq_len 256 diff --git a/applications/sentiment_analysis/ASO_analysis/classification/run_train.sh b/applications/sentiment_analysis/ASO_analysis/classification/run_train.sh new file mode 100644 index 0000000000000000000000000000000000000000..56a4a6755a429b1967f65bf3828aab008f855651 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/classification/run_train.sh @@ -0,0 +1,32 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +python train.py \ + --train_path "../data/cls_data/train.txt" \ + --dev_path "../data/cls_data/dev.txt" \ + --label_path "../data/cls_data/label.dict" \ + --num_epochs 5 \ + --batch_size 16 \ + --max_seq_len 256 \ + --learning_rate 3e-5 \ + --weight_decay 0.01 \ + --max_grad_norm 1.0 \ + --warmup_proportion 0.1 \ + --log_steps 50 \ + --eval_steps 100 \ + --seed 1000 \ + --device "gpu" \ + --checkpoints "../checkpoints/cls_checkpoints" diff --git a/applications/sentiment_analysis/ASO_analysis/classification/train.py b/applications/sentiment_analysis/ASO_analysis/classification/train.py new file mode 100644 index 0000000000000000000000000000000000000000..7d87d93c9ce9bad2f9b271dbdabc372d6e81bfd1 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/classification/train.py @@ -0,0 +1,149 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import warnings +from functools import partial + +import numpy as np +import paddle +from data import convert_example_to_feature, load_dict +from datasets import load_dataset +from evaluate import evaluate + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.metrics.glue import AccuracyAndF1 +from paddlenlp.transformers import ( + LinearDecayWithWarmup, + SkepForSequenceClassification, + SkepTokenizer, +) + +warnings.filterwarnings("ignore") + + +def set_seed(seed): + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +def train(): + # set running envir + model_name = "skep_ernie_1.0_large_ch" + + paddle.set_device(args.device) + set_seed(args.seed) + + if not os.path.exists(args.checkpoints): + os.mkdir(args.checkpoints) + + # load and process data5 + label2id, id2label = load_dict(args.label_path) + datasets = load_dataset("text", data_files={"train": args.train_path, "dev": args.dev_path}) + + tokenizer = SkepTokenizer.from_pretrained(model_name) + trans_func = partial( + convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len + ) + + train_ds = datasets["train"].map(trans_func, batched=False, remove_columns=["text"]) + dev_ds = datasets["dev"].map(trans_func, batched=False, remove_columns=["text"]) + + data_collator = DataCollatorWithPadding(tokenizer, padding=True) + + train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + train_loader = paddle.io.DataLoader(train_ds, batch_sampler=train_batch_sampler, collate_fn=data_collator) + dev_loader = paddle.io.DataLoader(dev_ds, batch_sampler=dev_batch_sampler, collate_fn=data_collator) + + # configure model training + model = SkepForSequenceClassification.from_pretrained(model_name, num_classes=len(label2id)) + + num_training_steps = len(train_loader) * args.num_epochs + lr_scheduler = LinearDecayWithWarmup( + learning_rate=args.learning_rate, total_steps=num_training_steps, warmup=args.warmup_proportion + ) + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + grad_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm) + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=grad_clip, + ) + + metric = AccuracyAndF1() + + # start to train model + global_step, best_f1 = 1, 0.0 + model.train() + for epoch in range(1, args.num_epochs + 1): + for batch_data in train_loader(): + input_ids, token_type_ids, labels = ( + batch_data["input_ids"], + batch_data["token_type_ids"], + batch_data["labels"], + ) + loss, logits = model(input_ids, token_type_ids=token_type_ids, labels=labels) + + loss.backward() + lr_scheduler.step() + optimizer.step() + optimizer.clear_grad() + + if global_step > 0 and global_step % args.log_steps == 0: + print(f"epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.item():.6f}") + if (global_step > 0 and global_step % args.eval_steps == 0) or global_step == num_training_steps: + accuracy, precision, recall, f1 = evaluate(model, dev_loader, metric) + model.train() + if f1 > best_f1: + print(f"best F1 performence has been updated: {best_f1:.5f} --> {f1:.5f}") + best_f1 = f1 + paddle.save(model.state_dict(), f"{args.checkpoints}/best.pdparams") + print( + f"evaluation result: accuracy:{accuracy:.5f} precision: {precision:.5f}, recall: {recall:.5f}, F1: {f1:.5f}" + ) + + global_step += 1 + + paddle.save(model.state_dict(), f"{args.checkpoints}/final.pdparams") + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser(__doc__) + parser.add_argument("--num_epochs", type=int, default=3, help="Number of epoches for training.") + parser.add_argument("--train_path", type=str, default=None, help="The path of train set.") + parser.add_argument("--dev_path", type=str, default=None, help="The path of dev set.") + parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.") + parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.") + parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.") + parser.add_argument("--learning_rate", type=float, default=5e-5, help="The initial learning rate for optimizer.") + parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.") + parser.add_argument("--max_grad_norm", type=float, default=1.0, help="Max grad norm to clip gradient.") + parser.add_argument("--warmup_proportion", type=float, default=0.1, help="Linear warmup proportion over the training process.") + parser.add_argument("--log_steps", type=int, default=50, help="Frequency of printing log.") + parser.add_argument("--eval_steps", type=int, default=500, help="Frequency of performing evaluation.") + parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") + parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") + parser.add_argument("--checkpoints", type=str, default=None, help="Directory to save checkpoint.") + + args = parser.parse_args() + # yapf: enable + + train() diff --git a/applications/sentiment_analysis/ASO_analysis/demo.py b/applications/sentiment_analysis/ASO_analysis/demo.py new file mode 100644 index 0000000000000000000000000000000000000000..2b7be71295e941582ce534588d025c1b24ff0a40 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/demo.py @@ -0,0 +1,131 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import re + +import paddle +from utils import decoding, load_dict + +from paddlenlp.transformers import ( + SkepForSequenceClassification, + SkepForTokenClassification, + SkepTokenizer, +) + + +def is_aspect_first(text, aspect, opinion_word): + return text.find(aspect) <= text.find(opinion_word) + + +def concate_aspect_and_opinion(text, aspect, opinion_words): + aspect_text = "" + for opinion_word in opinion_words: + if is_aspect_first(text, aspect, opinion_word): + aspect_text += aspect + opinion_word + "," + else: + aspect_text += opinion_word + aspect + "," + aspect_text = aspect_text[:-1] + + return aspect_text + + +def format_print(results): + for result in results: + aspect, opinions, sentiment = result["aspect"], result["opinions"], result["sentiment_polarity"] + print(f"aspect: {aspect}, opinions: {opinions}, sentiment_polarity: {sentiment}") + print() + + +def predict(args, ext_model, cls_model, tokenizer, ext_id2label, cls_id2label): + + ext_model.eval() + cls_model.eval() + + while True: + input_text = input("input text: \n") + input_text = re.sub(" +", "", input_text.strip()) + if not input_text: + continue + if input_text == "quit" or input_text == "exit": + break + + input_text = input_text.strip().replace(" ", "") + # processing input text + encoded_inputs = tokenizer(list(input_text), is_split_into_words=True, max_seq_len=args.ext_max_seq_len) + input_ids = paddle.to_tensor([encoded_inputs["input_ids"]]) + token_type_ids = paddle.to_tensor([encoded_inputs["token_type_ids"]]) + + # extract aspect and opinion words + logits = ext_model(input_ids, token_type_ids=token_type_ids) + predictions = logits.argmax(axis=2).numpy()[0] + tag_seq = [ext_id2label[idx] for idx in predictions][1:-1] + + aps = decoding(input_text[: args.ext_max_seq_len - 2], tag_seq) + + # predict sentiment for aspect with cls_model + results = [] + for ap in aps: + aspect = ap[0] + opinion_words = list(set(ap[1:])) + aspect_text = concate_aspect_and_opinion(input_text, aspect, opinion_words) + + encoded_inputs = tokenizer( + aspect_text, text_pair=input_text, max_seq_len=args.cls_max_seq_len, return_length=True + ) + input_ids = paddle.to_tensor([encoded_inputs["input_ids"]]) + token_type_ids = paddle.to_tensor([encoded_inputs["token_type_ids"]]) + + logits = cls_model(input_ids, token_type_ids=token_type_ids) + prediction = int(logits.argmax(axis=1)) + + result = {"aspect": aspect, "opinions": opinion_words, "sentiment_polarity": cls_id2label[prediction]} + results.append(result) + + format_print(results) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + parser.add_argument("--ext_model_path", type=str, default=None, help="The path of extraction model path that you want to load.") + parser.add_argument("--cls_model_path", type=str, default=None, help="The path of classification model path that you want to load.") + parser.add_argument("--ext_label_path", type=str, default=None, help="The path of extraction label dict.") + parser.add_argument("--cls_label_path", type=str, default=None, help="The path of classification label dict.") + parser.add_argument("--ext_max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization for extraction model.") + parser.add_argument("--cls_max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization for classification model.") + args = parser.parse_args() + # yapf: enbale + + # load dict + model_name = "skep_ernie_1.0_large_ch" + ext_label2id, ext_id2label = load_dict(args.ext_label_path) + cls_label2id, cls_id2label = load_dict(args.cls_label_path) + tokenizer = SkepTokenizer.from_pretrained(model_name) + print("label dict loaded.") + + # load ext model + ext_state_dict = paddle.load(args.ext_model_path) + ext_model = SkepForTokenClassification.from_pretrained(model_name, num_classes=len(ext_label2id)) + ext_model.load_dict(ext_state_dict) + print("extraction model loaded.") + + # load cls model + cls_state_dict = paddle.load(args.cls_model_path) + cls_model = SkepForSequenceClassification.from_pretrained(model_name, num_classes=len(cls_label2id)) + cls_model.load_dict(cls_state_dict) + print("classification model loaded.") + + # do predict + predict(args, ext_model, cls_model, tokenizer, ext_id2label, cls_id2label) diff --git a/applications/sentiment_analysis/ASO_analysis/deploy/predict.py b/applications/sentiment_analysis/ASO_analysis/deploy/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..284092e71d998dd1be07fa26d0f3c77e78a48006 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/deploy/predict.py @@ -0,0 +1,361 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import copy +import json +import os +import re +from collections import defaultdict +from functools import partial + +import paddle +from datasets import Dataset, load_dataset +from paddle import inference +from seqeval.metrics.sequence_labeling import get_entities + +from paddlenlp.data import DataCollatorForTokenClassification, DataCollatorWithPadding +from paddlenlp.transformers import SkepTokenizer + + +def load_dict(dict_path): + with open(dict_path, "r", encoding="utf-8") as f: + words = [word.strip() for word in f.readlines()] + word2id = dict(zip(words, range(len(words)))) + id2word = dict((v, k) for k, v in word2id.items()) + + return word2id, id2word + + +def read_test_file(data_path): + with open(data_path, "r", encoding="utf-8") as f: + for line in f.readlines(): + line = line.strip().replace(" ", "") + yield {"text": line} + + +def decoding(text, tag_seq): + assert len(text) == len(tag_seq), f"text len: {len(text)}, tag_seq len: {len(tag_seq)}" + + puncs = list(",.?;!,。?;!") + splits = [idx for idx in range(len(text)) if text[idx] in puncs] + + prev = 0 + sub_texts, sub_tag_seqs = [], [] + for i, split in enumerate(splits): + sub_tag_seqs.append(tag_seq[prev:split]) + sub_texts.append(text[prev:split]) + prev = split + sub_tag_seqs.append(tag_seq[prev:]) + sub_texts.append((text[prev:])) + + ents_list = [] + for sub_text, sub_tag_seq in zip(sub_texts, sub_tag_seqs): + ents = get_entities(sub_tag_seq, suffix=False) + ents_list.append((sub_text, ents)) + + aps = [] + no_a_words = [] + for sub_tag_seq, ent_list in ents_list: + sub_aps = [] + sub_no_a_words = [] + for ent in ent_list: + ent_name, start, end = ent + if ent_name == "Aspect": + aspect = sub_tag_seq[start : end + 1] + sub_aps.append([aspect]) + if len(sub_no_a_words) > 0: + sub_aps[-1].extend(sub_no_a_words) + sub_no_a_words.clear() + else: + ent_name == "Opinion" + opinion = sub_tag_seq[start : end + 1] + if len(sub_aps) > 0: + sub_aps[-1].append(opinion) + else: + sub_no_a_words.append(opinion) + + if sub_aps: + aps.extend(sub_aps) + if len(no_a_words) > 0: + aps[-1].extend(no_a_words) + no_a_words.clear() + elif sub_no_a_words: + if len(aps) > 0: + aps[-1].extend(sub_no_a_words) + else: + no_a_words.extend(sub_no_a_words) + + if no_a_words: + no_a_words.insert(0, "None") + aps.append(no_a_words) + + return aps + + +def convert_example_to_feature_ext(example, tokenizer, label2id, max_seq_len=512, is_test=False): + example = example["text"].rstrip().split("\t") + text = list(example[0]) + if not is_test: + label = example[1].split(" ") + assert len(text) == len(label) + new_text = [] + new_label = [] + for text_ch, label_ch in zip(text, label): + if text_ch.strip(): + new_text.append(text_ch) + new_label.append(label_ch) + new_label = ( + [label2id["O"]] + [label2id[label_term] for label_term in new_label][: (max_seq_len - 2)] + [label2id["O"]] + ) + encoded_inputs = tokenizer(new_text, is_split_into_words="token", max_seq_len=max_seq_len, return_length=True) + encoded_inputs["labels"] = new_label + assert len(encoded_inputs["input_ids"]) == len( + new_label + ), f"input_ids: {len(encoded_inputs['input_ids'])}, label: {len(new_label)}" + else: + new_text = [text_ch for text_ch in text if text_ch.strip()] + encoded_inputs = tokenizer(new_text, is_split_into_words="token", max_seq_len=max_seq_len, return_length=True) + + return encoded_inputs + + +def convert_example_to_feature_cls(example, tokenizer, label2id, max_seq_len=512, is_test=False): + example = example["text"].rstrip().split("\t") + if not is_test: + label = int(example[0]) + aspect_text = example[1] + text = example[2] + encoded_inputs = tokenizer(aspect_text, text_pair=text, max_seq_len=max_seq_len, return_length=True) + encoded_inputs["label"] = label + else: + aspect_text = example[0] + text = example[1] + encoded_inputs = tokenizer(aspect_text, text_pair=text, max_seq_len=max_seq_len, return_length=True) + + return encoded_inputs + + +def remove_blanks(example): + example["text"] = re.sub(" +", "", example["text"]) + return example + + +class Predictor(object): + def __init__(self, args): + self.args = args + self.ext_predictor, self.ext_input_handles, self.ext_output_hanle = self.create_predictor(args.ext_model_path) + print(f"ext_model_path: {args.ext_model_path}, {self.ext_predictor}") + self.cls_predictor, self.cls_input_handles, self.cls_output_hanle = self.create_predictor(args.cls_model_path) + self.ext_label2id, self.ext_id2label = load_dict(args.ext_label_path) + self.cls_label2id, self.cls_id2label = load_dict(args.cls_label_path) + self.tokenizer = SkepTokenizer.from_pretrained(args.base_model_name) + + def create_predictor(self, model_path): + model_file = model_path + ".pdmodel" + params_file = model_path + ".pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if self.args.device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[args.precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=self.args.batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif self.args.device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif self.args.device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + predictor = paddle.inference.create_predictor(config) + input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()] + output_handle = predictor.get_output_handle(predictor.get_output_names()[0]) + + return predictor, input_handles, output_handle + + def predict_ext(self, args): + datasets = load_dataset("text", data_files={"test": args.test_path}) + datasets["test"] = datasets["test"].map(remove_blanks) + trans_func = partial( + convert_example_to_feature_ext, + tokenizer=self.tokenizer, + label2id=self.ext_label2id, + max_seq_len=args.ext_max_seq_len, + is_test=True, + ) + test_ds = copy.copy(datasets["test"]).map(trans_func, batched=False, remove_columns=["text"]) + data_collator = DataCollatorForTokenClassification(self.tokenizer, label_pad_token_id=self.ext_label2id["O"]) + test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False) + test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator) + + results = [] + for bid, batch_data in enumerate(test_loader): + input_ids, token_type_ids, seq_lens = ( + batch_data["input_ids"], + batch_data["token_type_ids"], + batch_data["seq_len"], + ) + self.ext_input_handles[0].copy_from_cpu(input_ids.numpy()) + self.ext_input_handles[1].copy_from_cpu(token_type_ids.numpy()) + self.ext_predictor.run() + logits = self.ext_output_hanle.copy_to_cpu() + + predictions = logits.argmax(axis=2) + for eid, (seq_len, prediction) in enumerate(zip(seq_lens, predictions)): + idx = bid * args.batch_size + eid + tag_seq = [self.ext_id2label[idx] for idx in prediction[:seq_len][1:-1]] + text = datasets["test"][idx]["text"] + aps = decoding(text[: args.ext_max_seq_len - 2], tag_seq) + for aid, ap in enumerate(aps): + aspect, opinions = ap[0], list(set(ap[1:])) + aspect_text = self._concate_aspect_and_opinion(text, aspect, opinions) + results.append( + { + "id": str(idx) + "_" + str(aid), + "aspect": aspect, + "opinions": opinions, + "text": text, + "aspect_text": aspect_text, + } + ) + + return results + + def predict_cls(self, args, ext_results): + text_list = [] + for result in ext_results: + example = result["aspect_text"] + "\t" + result["text"] + text_list.append(example) + ext_results = {"text": text_list} + + dataset = Dataset.from_dict(ext_results) + trans_func = partial( + convert_example_to_feature_cls, + tokenizer=self.tokenizer, + label2id=self.cls_label2id, + max_seq_len=args.cls_max_seq_len, + is_test=True, + ) + + test_ds = dataset.map(trans_func, batched=False, remove_columns=["text"]) + data_collator = DataCollatorWithPadding(self.tokenizer, padding=True) + test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False) + test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator) + + results = [] + for batch_data in test_loader: + input_ids, token_type_ids = batch_data["input_ids"], batch_data["token_type_ids"] + self.cls_input_handles[0].copy_from_cpu(input_ids.numpy()) + self.cls_input_handles[1].copy_from_cpu(token_type_ids.numpy()) + self.cls_predictor.run() + logits = self.cls_output_hanle.copy_to_cpu() + + predictions = logits.argmax(axis=1).tolist() + results.extend(predictions) + + return results + + def post_process(self, args, ext_results, cls_results): + assert len(ext_results) == len(cls_results) + + collect_dict = defaultdict(list) + for ext_result, cls_result in zip(ext_results, cls_results): + ext_result["sentiment_polarity"] = self.cls_id2label[cls_result] + eid, _ = ext_result["id"].split("_") + collect_dict[eid].append(ext_result) + + sentiment_results = [] + for eid in collect_dict.keys(): + sentiment_result = {} + ap_list = [] + for idx, single_ap in enumerate(collect_dict[eid]): + if idx == 0: + sentiment_result["text"] = single_ap["text"] + ap_list.append( + { + "aspect": single_ap["aspect"], + "opinions": single_ap["opinions"], + "sentiment_polarity": single_ap["sentiment_polarity"], + } + ) + sentiment_result["ap_list"] = ap_list + sentiment_results.append(sentiment_result) + + with open(args.save_path, "w", encoding="utf-8") as f: + for sentiment_result in sentiment_results: + f.write(json.dumps(sentiment_result, ensure_ascii=False) + "\n") + print(f"sentiment analysis results has been saved to path: {args.save_path}") + + def predict(self, args): + ext_results = self.predict_ext(args) + cls_results = self.predict_cls(args, ext_results) + self.post_process(args, ext_results, cls_results) + + def _concate_aspect_and_opinion(self, text, aspect, opinion_words): + aspect_text = "" + for opinion_word in opinion_words: + if text.find(aspect) <= text.find(opinion_word): + aspect_text += aspect + opinion_word + "," + else: + aspect_text += opinion_word + aspect + "," + aspect_text = aspect_text[:-1] + + return aspect_text + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + parser.add_argument("--base_model_name", default='skep_ernie_1.0_large_ch', type=str, help="Base model name, SKEP used by default", ) + parser.add_argument("--ext_model_path", type=str, default=None, help="The path of extraction model path that you want to load.") + parser.add_argument("--cls_model_path", type=str, default=None, help="The path of classification model path that you want to load.") + parser.add_argument("--ext_label_path", type=str, default=None, help="The path of extraction label dict.") + parser.add_argument("--cls_label_path", type=str, default=None, help="The path of classification label dict.") + parser.add_argument('--test_path', type=str, default=None, help="The path of test set that you want to predict.") + parser.add_argument('--save_path', type=str, required=True, default=None, help="The saving path of predict results.") + parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.") + parser.add_argument("--ext_max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization for extraction model.") + parser.add_argument("--cls_max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization for classification model.") + parser.add_argument("--use_tensorrt", action='store_true', help="Whether to use inference engin TensorRT.") + parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') + parser.add_argument("--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference.") + parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') + parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') + args = parser.parse_args() + # yapf: enbale + + predictor = Predictor(args) + predictor.predict(args) diff --git a/applications/sentiment_analysis/ASO_analysis/deploy/run_predict.sh b/applications/sentiment_analysis/ASO_analysis/deploy/run_predict.sh new file mode 100644 index 0000000000000000000000000000000000000000..21feb414e4f008662e28fa5c6a811546813591d1 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/deploy/run_predict.sh @@ -0,0 +1,27 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +python predict.py \ + --base_model_name "skep_ernie_1.0_large_ch" \ + --ext_model_path "../checkpoints/ext_checkpoints/static/infer" \ + --cls_model_path "../checkpoints/cls_checkpoints/static/infer" \ + --ext_label_path "../data/ext_data/label.dict" \ + --cls_label_path "../data/cls_data/label.dict" \ + --test_path "../data/test.txt" \ + --save_path "../data/sentiment_results.json" \ + --batch_size 8 \ + --ext_max_seq_len 512 \ + --cls_max_seq_len 256 diff --git a/applications/sentiment_analysis/ASO_analysis/doccano.py b/applications/sentiment_analysis/ASO_analysis/doccano.py new file mode 100644 index 0000000000000000000000000000000000000000..9880fe0a0e118170eeae6ed81fa62fca0e1317b1 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/doccano.py @@ -0,0 +1,157 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os + +import numpy as np +from utils import concate_aspect_and_opinion, decoding, save_dict, save_examples + + +def doccano2SA(doccano_file, save_ext_dir, save_cls_dir, splits=[0.8, 0.9], is_shuffle=True): + """ + @Description: Consvert doccano file to data format which is suitable to input to this Application. + @Param doccano_file: The annotated file exported from doccano labeling platform. + @Param save_ext_dir: The directory of ext data that you wanna save. + @Param save_cls_dir: The directory of cls data that you wanna save. + @Param splits: Whether to split doccano file into train/dev/test, note: Only []/ len(splits)==2 accepted. + @Param is_shuffle: Whether to shuffle data. + """ + if not os.path.exists(doccano_file): + raise ValueError("Please input the correct path of doccano file.") + + if not os.path.exists(save_ext_dir): + os.makedirs(save_ext_dir) + + if not os.path.exists(save_cls_dir): + os.makedirs(save_cls_dir) + + if len(splits) != 0 and len(splits) != 2: + raise ValueError("Only []/ len(splits)==2 accepted for splits.") + + if splits and ( + splits[0] >= splits[1] or splits[0] >= 1.0 or splits[1] >= 1.0 or splits[0] <= 0.0 or splits[1] <= 0 + ): + raise ValueError("Please set correct splits, the element in it should be in (0,1), and splits[1]>splits[0].") + + def label_ext_with_label_term(ext_label, start, end, tag): + + if tag == "Opinion": + b_tag = "B-Opinion" + i_tag = "I-Opinion" + else: + b_tag = "B-Aspect" + i_tag = "I-Aspect" + + ext_label[start] = b_tag + for i in range(start + 1, end): + ext_label[i] = i_tag + + ext_examples, cls_examples = [], [] + with open(doccano_file, "r", encoding="utf-8") as f: + raw_examples = f.readlines() + # start to label for ext and cls data + for line in raw_examples: + items = json.loads(line) + text, label_terms = items["data"], items["label"] + # label ext data with label_terms + ext_label = ["O"] * len(text) + aspect_mapper = {} + for label_term in label_terms: + start, end, tag = label_term + label_ext_with_label_term(ext_label, start, end, tag) + if tag == "Pos-Aspect": + aspect_mapper[text[start:end]] = "1" + elif tag == "Neg-Aspect": + aspect_mapper[text[start:end]] = "0" + ext_examples.append((text, " ".join(ext_label))) + # label cls data + aps = decoding(text, ext_label) + for ap in aps: + aspect, opinions = ap[0], list(set(ap[1:])) + if aspect not in aspect_mapper: + continue + aspect_text = concate_aspect_and_opinion(text, aspect, opinions) + cls_examples.append((aspect_mapper[aspect], aspect_text, text)) + + # index for saving data + ext_idx = np.arange(len(ext_examples)) + cls_idx = np.arange(len(cls_examples)) + + if is_shuffle: + ext_idx = np.random.permutation(ext_idx) + cls_idx = np.random.permutation(cls_idx) + + if len(splits) == 0: + # save ext data + save_ext_path = os.path.join(save_ext_dir, "doccano.txt") + save_examples(ext_examples, save_ext_path, ext_idx) + print(f"\next: save data to {save_ext_path}.") + # save cls data + save_cls_path = os.path.join(save_cls_dir, "doccano.txt") + save_examples(cls_examples, save_cls_path, cls_idx) + print(f"\ncls: save data to {save_cls_path}.") + + else: + # save ext data + eth1, eth2 = int(len(ext_examples) * splits[0]), int(len(ext_examples) * splits[1]) + save_ext_train_path = os.path.join(save_ext_dir, "train.txt") + save_ext_dev_path = os.path.join(save_ext_dir, "dev.txt") + save_ext_test_path = os.path.join(save_ext_dir, "test.txt") + save_examples(ext_examples, save_ext_train_path, ext_idx[:eth1]) + save_examples(ext_examples, save_ext_dev_path, ext_idx[eth1:eth2]) + save_examples(ext_examples, save_ext_test_path, ext_idx[eth2:]) + print(f"\next: save train data to {save_ext_train_path}.") + print(f"ext: save dev data to {save_ext_dev_path}.") + print(f"ext: save test data to {save_ext_test_path}.") + + # save cls data + cth1, cth2 = int(len(cls_examples) * splits[0]), int(len(cls_examples) * splits[1]) + save_cls_train_path = os.path.join(save_cls_dir, "train.txt") + save_cls_dev_path = os.path.join(save_cls_dir, "dev.txt") + save_cls_test_path = os.path.join(save_cls_dir, "test.txt") + save_examples(cls_examples, save_cls_train_path, cls_idx[:cth1]) + save_examples(cls_examples, save_cls_dev_path, cls_idx[cth1:cth2]) + save_examples(cls_examples, save_cls_test_path, cls_idx[cth2:]) + print(f"\ncls: save train data to {save_cls_train_path}.") + print(f"cls: save dev data to {save_cls_dev_path}.") + print(f"cls: save test data to {save_cls_test_path}.") + + # save ext dict + ext_dict_path = os.path.join(save_ext_dir, "label.dict") + cls_dict_path = os.path.join(save_cls_dir, "label.dict") + save_dict(ext_dict_path, "ext") + save_dict(cls_dict_path, "cls") + print(f"\next: save dict to {ext_dict_path}.") + print(f"cls: save dict to {cls_dict_path}.") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--doccano_file", + type=str, + default="./data/doccano.json", + help="The doccano file exported from doccano platform.", + ) + parser.add_argument( + "--save_ext_dir", type=str, default="./data/ext_data1", help="The path of ext data that you wanna save." + ) + parser.add_argument( + "--save_cls_dir", type=str, default="./data/cls_data1", help="The path of cls data that you wanna save." + ) + args = parser.parse_args() + + doccano2SA(args.doccano_file, args.save_ext_dir, args.save_cls_dir, is_shuffle=True) diff --git a/applications/sentiment_analysis/ASO_analysis/export_model.py b/applications/sentiment_analysis/ASO_analysis/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..532698d98c999fb8fcb5eeaa2dcc0219fc206ce7 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/export_model.py @@ -0,0 +1,60 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle + +from paddlenlp.transformers import ( + PPMiniLMForSequenceClassification, + SkepForSequenceClassification, + SkepForTokenClassification, +) + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + parser.add_argument("--model_type", type=str, default="extraction", choices=["extraction", "classification", "pp_minilm"], help="The model type that you wanna export.") + parser.add_argument("--base_model_name", type=str, default="skep_ernie_1.0_large_ch", help="The base model of experiment, skep or ppminilm") + parser.add_argument("--model_path", type=str, default=None, help="The path of model that you want to load.") + parser.add_argument("--save_path", type=str, default=None, help="The path of the exported static model.") + args = parser.parse_args() + # yapf: enable + + # load model with saved state_dict + if args.model_type == "extraction": + model = SkepForTokenClassification.from_pretrained(args.base_model_name, num_classes=5) + elif args.model_type == "classification": + model = SkepForSequenceClassification.from_pretrained(args.base_model_name, num_classes=2) + else: + model = PPMiniLMForSequenceClassification.from_pretrained(args.base_model_name, num_classes=2) + + loaded_state_dict = paddle.load(args.model_path) + model.load_dict(loaded_state_dict) + print(f"Loaded parameters from {args.model_path}") + + model.eval() + # convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec( + shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec( + shape=[None, None], dtype="int64") # token_type_ids + ]) + + # save to static model + paddle.jit.save(model, args.save_path) + print(f"static {args.model_type} model has been to {args.save_path}") diff --git a/applications/sentiment_analysis/ASO_analysis/extraction/README.md b/applications/sentiment_analysis/ASO_analysis/extraction/README.md new file mode 100644 index 0000000000000000000000000000000000000000..2a471a6bd62e18d31e4cbef337f5e374f112052a --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/extraction/README.md @@ -0,0 +1,65 @@ +# 评论观点抽取模型 + +## 1. 方案设计 + +在本实验中,我们将采用序列标注的方式进行评论观点抽取,需要注意的是,这里会同时抽取评论的属性和观点,为此我们基于 BIO 的序列标注体系进行了标签的拓展:B-Aspect, I-Aspect, B-Opinion, I-Opinion, O,其中前两者用于标注评论属性,后两者用于标注评论观点。 + +如图1所示,首先将文本串传入 SKEP 模型中,利用 SKEP 模型对该文本串进行语义编码后,然后基于每个位置的输出去预测相应的标签。 + +
+ +

图1 评论观点抽取模型

+

+ +## 2. 项目结构说明 + +以下是本项目运行的完整目录结构及说明: + +```shell +. +├── data.py # 数据处理脚本 +├── model.py # 模型组网脚本 +├── train.py # 模型训练脚本 +├── evaluate.py # 模型评估脚本 +├── run_train.sh # 模型训练命令 +├── run_evaluate.sh # 模型评估命令 +└── README.md +``` + +## 3. 数据说明 + +如上所述,本项目将采用序列标注的方式进行抽取评论属性和观点,所以本项目训练集中需要包含两列数据:文本串和相应的序列标签数据,下面给出了一条样本。 + + +- 服务好,环境好,做出来效果也不错 B-Aspect I-Aspect B-Opinion O B-Aspect I-Aspect B-Opinion O O O O B-Aspect I-Aspect O B-Opinion I-Opinion +- 环境很好,交通便利 B-Aspect I-Aspect O B-Opinion O B-Aspect I-Aspect B-Opinion I-Opinion +- 空气清新,景色优美 B-Aspect I-Aspect B-Opinion I-Opinion O B-Aspect I-Aspect O B-Opinion + + +可点击 [ext_data](https://bj.bcebos.com/v1/paddlenlp/data/ext_data.tar.gz) 进行 Demo 数据下载,将数据解压之后放入父目录的 `data/ext_data/` 文件夹下。 + +## 4. 模型效果展示 +在抽取模型训练过程中,总共训练了10轮,并选择了评估F1得分最高的 best 模型,下表展示了训练过程中使用的训练参数。我们同时开源了相应的模型,可点击下表的 `ext_model` 进行下载,下载后将模型重命名为 `best.pdparams`,然后放入父目录的 `checkpoints/ext_checkpoints` 中。 +|Model|训练参数配置|MD5| +| ------------ | ------------ |-----------| +|[ext_model](https://bj.bcebos.com/paddlenlp/models/best_ext.pdparams)|
learning_rate: 5e-5, batch_size: 8, max_seq_len:512, epochs:10
|e3358632165aa0338225e175b57cb304| + +我们基于训练过程中的 best 模型在验证集 `dev` 和测试集 `test` 上进行了评估测试,模型效果如下表所示: +|Model|数据集|precision|Recall|F1| +| ------------ | ------------ | ------------ |-----------|------------ | +|SKEP-Large|dev|0.87095|0.90056|0.88551| +|SKEP-Large|test|0.87125|0.89944|0.88512| + +**备注**:以上数据是基于全量数据训练和测试结果,并非 Demo 数据集。 + +## 5. 模型训练 +通过运行以下命令进行评论观点抽取模型训练: +```shell +sh run_train.sh +``` + +## 6. 模型测试 +通过运行以下命令进行评论观点抽取模型测试: +```shell +sh run_evaluate.sh +``` diff --git a/applications/sentiment_analysis/ASO_analysis/extraction/data.py b/applications/sentiment_analysis/ASO_analysis/extraction/data.py new file mode 100644 index 0000000000000000000000000000000000000000..b0700bc72a108b843765ccc95a741773ce86e9e5 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/extraction/data.py @@ -0,0 +1,49 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +def load_dict(dict_path): + with open(dict_path, "r", encoding="utf-8") as f: + words = [word.strip() for word in f.readlines()] + word2id = dict(zip(words, range(len(words)))) + id2word = dict((v, k) for k, v in word2id.items()) + + return word2id, id2word + + +def convert_example_to_feature(example, tokenizer, label2id, max_seq_len=512, is_test=False): + example = example["text"].rstrip().split("\t") + text = list(example[0]) + if not is_test: + label = example[1].split(" ") + assert len(text) == len(label) + new_text = [] + new_label = [] + for text_ch, label_ch in zip(text, label): + if text_ch.strip(): + new_text.append(text_ch) + new_label.append(label_ch) + new_label = ( + [label2id["O"]] + [label2id[label_term] for label_term in new_label][: (max_seq_len - 2)] + [label2id["O"]] + ) + encoded_inputs = tokenizer(new_text, is_split_into_words="token", max_seq_len=max_seq_len, return_length=True) + encoded_inputs["labels"] = new_label + assert len(encoded_inputs["input_ids"]) == len( + new_label + ), f"input_ids: {len(encoded_inputs['input_ids'])}, label: {len(new_label)}" + else: + new_text = [text_ch for text_ch in text if text_ch.strip()] + encoded_inputs = tokenizer(new_text, is_split_into_words="token", max_seq_len=max_seq_len, return_length=True) + + return encoded_inputs diff --git a/applications/sentiment_analysis/ASO_analysis/extraction/evaluate.py b/applications/sentiment_analysis/ASO_analysis/extraction/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..0a0e5d62e14469b6280fb7c3159ac55c75ac5922 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/extraction/evaluate.py @@ -0,0 +1,84 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +from functools import partial + +import paddle +from data import convert_example_to_feature, load_dict +from datasets import load_dataset +from tqdm import tqdm + +from paddlenlp.data import DataCollatorForTokenClassification +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers import SkepForTokenClassification, SkepTokenizer + + +def evaluate(model, data_loader, metric): + + model.eval() + metric.reset() + for batch_data in tqdm(data_loader): + input_ids, token_type_ids, seq_lens, labels = ( + batch_data["input_ids"], + batch_data["token_type_ids"], + batch_data["seq_len"], + batch_data["labels"], + ) + logits = model(input_ids, token_type_ids=token_type_ids) + + # count metric + predictions = logits.argmax(axis=2) + num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(seq_lens, predictions, labels) + metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + + precision, recall, f1 = metric.accumulate() + + return precision, recall, f1 + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.") + parser.add_argument('--test_path', type=str, default=None, help="The path of test set.") + parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.") + parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.") + parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.") + args = parser.parse_args() + # yapf: enbale + + # load dev data + model_name = "skep_ernie_1.0_large_ch" + label2id, id2label = load_dict(args.label_path) + datasets = load_dataset("text", data_files={"test": args.test_path}) + + tokenizer = SkepTokenizer.from_pretrained(model_name) + trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len) + test_ds = datasets["test"].map(trans_func, batched=False, remove_columns=["text"]) + + data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=label2id["O"]) + test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False) + test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator) + + # load model + loaded_state_dict = paddle.load(args.model_path) + model = SkepForTokenClassification.from_pretrained(model_name, num_classes=len(label2id)) + model.load_dict(loaded_state_dict) + + metric = ChunkEvaluator(label2id.keys()) + + # evaluate on dev data + precision, recall, f1 = evaluate(model, test_loader, metric) + print(f'evaluation result: precision: {precision:.5f}, recall: {recall:.5f}, F1: {f1:.5f}') diff --git a/applications/sentiment_analysis/ASO_analysis/extraction/run_evaluate.sh b/applications/sentiment_analysis/ASO_analysis/extraction/run_evaluate.sh new file mode 100644 index 0000000000000000000000000000000000000000..b57782b3b069a86ef96fcb41c3828cb81cd903f9 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/extraction/run_evaluate.sh @@ -0,0 +1,23 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +python evaluate.py \ + --model_path "../checkpoints/ext_checkpoints/best.pdparams" \ + --test_path "../data/ext_data/test.txt" \ + --label_path "../data/ext_data/label.dict" \ + --batch_size 16 \ + --max_seq_len 256 + diff --git a/applications/sentiment_analysis/ASO_analysis/extraction/run_train.sh b/applications/sentiment_analysis/ASO_analysis/extraction/run_train.sh new file mode 100644 index 0000000000000000000000000000000000000000..0680f1df08afd5b01af4c4e527bb90966686d745 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/extraction/run_train.sh @@ -0,0 +1,32 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +python train.py \ + --train_path "../data/ext_data/train.txt" \ + --dev_path "../data/ext_data/dev.txt" \ + --label_path "../data/ext_data/label.dict" \ + --num_epochs 10 \ + --batch_size 16 \ + --max_seq_len 256 \ + --learning_rate 5e-5 \ + --weight_decay 0.01 \ + --max_grad_norm 1.0 \ + --warmup_proportion 0.1 \ + --log_steps 50 \ + --eval_steps 250 \ + --seed 1000 \ + --device "gpu" \ + --checkpoints "../checkpoints/ext_checkpoints/" diff --git a/applications/sentiment_analysis/ASO_analysis/extraction/train.py b/applications/sentiment_analysis/ASO_analysis/extraction/train.py new file mode 100644 index 0000000000000000000000000000000000000000..6e0a278237271068d1dd73f4631968f514a08950 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/extraction/train.py @@ -0,0 +1,147 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import argparse +import os +import random +import warnings +from functools import partial + +import numpy as np +import paddle +from data import convert_example_to_feature, load_dict +from datasets import load_dataset +from evaluate import evaluate + +from paddlenlp.data import DataCollatorForTokenClassification +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers import ( + LinearDecayWithWarmup, + SkepForTokenClassification, + SkepTokenizer, +) + +warnings.filterwarnings("ignore") + + +def set_seed(seed): + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +def train(): + # set running envir + model_name = "skep_ernie_1.0_large_ch" + + paddle.set_device(args.device) + set_seed(args.seed) + + if not os.path.exists(args.checkpoints): + os.mkdir(args.checkpoints) + + # load and process data + label2id, id2label = load_dict(args.label_path) + datasets = load_dataset("text", data_files={"dev": args.dev_path, "train": args.train_path}) + + tokenizer = SkepTokenizer.from_pretrained(model_name) + trans_func = partial( + convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len + ) + train_ds = datasets["train"].map(trans_func, batched=False, remove_columns=["text"]) + dev_ds = datasets["dev"].map(trans_func, batched=False, remove_columns=["text"]) + + data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=label2id["O"]) + + train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + train_loader = paddle.io.DataLoader(train_ds, batch_sampler=train_batch_sampler, collate_fn=data_collator) + dev_loader = paddle.io.DataLoader(dev_ds, batch_sampler=dev_batch_sampler, collate_fn=data_collator) + + # configure model training + model = SkepForTokenClassification.from_pretrained(model_name, num_classes=len(label2id)) + + num_training_steps = len(train_loader) * args.num_epochs + lr_scheduler = LinearDecayWithWarmup( + learning_rate=args.learning_rate, total_steps=num_training_steps, warmup=args.warmup_proportion + ) + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + grad_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm) + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=grad_clip, + ) + + metric = ChunkEvaluator(label2id.keys()) + + # start to train model + global_step, best_f1 = 1, 0.0 + model.train() + for epoch in range(1, args.num_epochs + 1): + for batch_data in train_loader(): + input_ids, token_type_ids, labels = ( + batch_data["input_ids"], + batch_data["token_type_ids"], + batch_data["labels"], + ) + loss, logits = model(input_ids, token_type_ids=token_type_ids, labels=labels) + + loss.backward() + lr_scheduler.step() + optimizer.step() + optimizer.clear_grad() + + if global_step > 0 and global_step % args.log_steps == 0: + print(f"epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.item():.6f}") + if (global_step > 0 and global_step % args.eval_steps == 0) or global_step == num_training_steps: + precision, recall, f1 = evaluate(model, dev_loader, metric) + model.train() + if f1 > best_f1: + print(f"best F1 performence has been updated: {best_f1:.5f} --> {f1:.5f}") + best_f1 = f1 + paddle.save(model.state_dict(), f"{args.checkpoints}/best.pdparams") + print(f"evaluation result: precision: {precision:.5f}, recall: {recall:.5f}, F1: {f1:.5f}") + + global_step += 1 + + paddle.save(model.state_dict(), f"{args.checkpoints}/final.pdparams") + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser(__doc__) + parser.add_argument("--num_epochs", type=int, default=3, help="Number of epoches for training.") + parser.add_argument("--train_path", type=str, default=None, help="The path of train set.") + parser.add_argument("--dev_path", type=str, default=None, help="The path of dev set.") + parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.") + parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.") + parser.add_argument("--max_seq_len", type=int, default=512, help="Batch size per GPU/CPU for training.") + parser.add_argument("--learning_rate", type=float, default=5e-5, help="The initial learning rate for optimizer.") + parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.") + parser.add_argument("--max_grad_norm", type=float, default=1.0, help="Max grad norm to clip gradient.") + parser.add_argument("--warmup_proportion", type=float, default=0.1, help="Linear warmup proportion over the training process.") + parser.add_argument("--log_steps", type=int, default=50, help="Frequency of printing log.") + parser.add_argument("--eval_steps", type=int, default=500, help="Frequency of performing evaluation.") + parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") + parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") + parser.add_argument("--checkpoints", type=str, default=None, help="Directory to save checkpoint.") + + args = parser.parse_args() + # yapf: enable + + train() diff --git a/applications/sentiment_analysis/ASO_analysis/imgs/design_cls_model.png b/applications/sentiment_analysis/ASO_analysis/imgs/design_cls_model.png new file mode 100644 index 0000000000000000000000000000000000000000..2aac511d38bc4aa6fcf82e05d3aa4c0725859744 Binary files /dev/null and b/applications/sentiment_analysis/ASO_analysis/imgs/design_cls_model.png differ diff --git a/applications/sentiment_analysis/ASO_analysis/imgs/design_ext_model.png b/applications/sentiment_analysis/ASO_analysis/imgs/design_ext_model.png new file mode 100644 index 0000000000000000000000000000000000000000..14e4f72fb775658899b971724414c6f976731986 Binary files /dev/null and b/applications/sentiment_analysis/ASO_analysis/imgs/design_ext_model.png differ diff --git a/applications/sentiment_analysis/ASO_analysis/imgs/labeling_example.png b/applications/sentiment_analysis/ASO_analysis/imgs/labeling_example.png new file mode 100644 index 0000000000000000000000000000000000000000..21a0a2e9c1f36c695da75f576d8d680fa7fd0c88 Binary files /dev/null and b/applications/sentiment_analysis/ASO_analysis/imgs/labeling_example.png differ diff --git a/applications/sentiment_analysis/ASO_analysis/imgs/sentiment_system.png b/applications/sentiment_analysis/ASO_analysis/imgs/sentiment_system.png new file mode 100644 index 0000000000000000000000000000000000000000..bff450c7c60f2e84d84db1ac2084be1b8f918ce3 Binary files /dev/null and b/applications/sentiment_analysis/ASO_analysis/imgs/sentiment_system.png differ diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/README.md b/applications/sentiment_analysis/ASO_analysis/pp_minilm/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f84515db981c8bbb40d6a74e0452d93e95fab9dc --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/README.md @@ -0,0 +1,136 @@ +# 基于 PP-MiniLM 的小模型优化策略 + +本项目中,无论是评论观点抽取模型,还是属性级情感分类模型,使用的均是 Large 版的 SKEP 模型,考虑到企业用户在线上部署时会考虑到模型预测效率,所以本项目提供了开源小模型 [PP-MiniLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/model_compression/pp-minilm) 及量化加速方案,大幅提升预测性能。 + +在本项目中,我们基于 PP-MiniLM 中文特色小模型进行 fine-tune 属性级情感分类模型,然后使用 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) 进行模型量化,减小模型规模,加快模型预测性能。 + +## 1. 基于 PP-MiniLM 训练属性级情感分类模型 + +本实验的方案设计和基于 SKEP 的细粒度情感分类一样,有需要的同学请移步[这里](./../classification/README.md),这里不再赘述。 + +### 1.1 项目结构说明 +以下是本项目运行的完整目录结构及说明: + +```shell +. +├── data.py # 数据处理脚本 +├── model.py # 模型组网脚本 +├── train.py # 模型训练脚本 +├── evaluate.py # 模型评估脚本 +├── quant_post.py # 模型量化脚本 +├── performance_test.py # 静态图预测脚本 +├── run_train.sh # 模型训练命令 +├── run_evaluate.sh # 模型评估命令 +├── run_quant.sh # 模型量化命令 +├── run_performance_test.sh # 静态图预测命令 +└── README.md +``` + +### 1.2 数据说明 + +本实验数据和基于SKEP的细粒度情感分类实验所用数据是同一份,如果已将数据下载,并放入父目录的`data/cls_data/`目录下,则无需重复下载操作。更多信息请参考[这里](../classification/README.md)。 + +### 1.3 模型效果展示 + +在分类模型训练过程中,总共训练了10轮,并选择了评估 F1 得分最高的 best 模型, 下表展示了训练过程中使用的训练参数。我们同时开源了相应的模型,可点击下表的 `PP-MiniLM_cls` 进行下载,下载后将模型重命名为 `best.pdparams`,然后放入父目录的 `checkpoints/pp_checkpoints` 中。 +|Model|训练参数配置|MD5| +| ------------ | ------------ |-----------| +|[PP-MiniLM_cls](https://bj.bcebos.com/paddlenlp/models/best_mini.pdparams)|
learning_rate: 3e-5, batch_size: 16, max_seq_len:256, epochs:10
|643d358620e84879921b42d326f97aae| + +我们基于训练过程中的 best 模型在 `cls_data` 验证集 `dev` 和测试集 `test` 上进行了评估测试,模型效果如下表所示: +|Model|数据集|precision|Recall|F1| +| ------------ | ------------ | ------------ |-----------|------------ | +|PP-MiniLM|dev_set|0.98668|0.99115|0.98891| +|PP-MiniLM|test_set|0.98263|0.98766|0.98514| + +**备注**:以上数据是基于全量数据训练和测试结果,并非 Demo 数据集。 + +### 1.4 模型训练 +通过运行以下命令进行分类小模型训练,模型训练后会默认保存到父目录的`checkpoints/pp_checkpoints/`文件夹下: +```shell +sh run_train.sh +``` + +### 1.5 模型测试 +通过运行以下命令进行分类小模型测试: +```shell +sh run_evaluate.sh +``` + +## 2. 对 PP-MiniLM 小模型进行量化 +本节将基于 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) ,对训练好的 PP-MiniLM 小模型进行量化。具体来讲,本节采用的是静态离线量化方法,即在训练好的模型基础上,使用少量校准数据计算量化因子,便可快速得到量化模型。量化过程中,默认使用 `avg` 的量化策略,对 `matmul/matmul_v2` 算子进行 `channel_wise_abs_max` 类型的量化。 + +首先,需要先将训练好的动态图模型,转为静态图模型,注意这里需要跳到父目录进行操作: +```shell +cd .. +sh run_export_model.sh pp_minilm +``` + +然后,使用如下命令进行量化生成的静态图模型: +```shell +sh run_quant.sh +``` +执行以上命令时,需要使用 `static_model_dir` 指定待量化的模型目录,量化后,模型将会被保存在 `quant_model_dir` 指定的目录中。 + +最后,对量化后的小模型可使用 `performance_test.py` 进行评估, 该脚本主要用于性能测试,如果需要做评估,需要设置 `--eval`,如下所示: +```shell +python performance_test.py \ + --base_model_name "ppminilm-6l-768h" \ + --model_path "../checkpoints/pp_checkpoints/quant/infer" \ + --test_path "../data/cls_data/test.txt" \ + --label_path "../data/cls_data/label.dict" \ + --batch_size 16 \ + --max_seq_len 256 \ + --eval +``` + +## 3. 对量化后的小模型进行性能测试 + +### 3.1 环境要求 +本节需要使用安装有 Paddle Inference 预测库的 [PaddlePaddle 2.2.1](https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html) 进行预测,请根据合适的机器环境进行下载安装。若想要得到明显的加速效果,推荐在 NVIDA Tensor Core GPU(如 T4、A10、A100) 上进行测试,若在 V 系列 GPU 卡上测试,由于其不支持 Int8 Tensor Core,将达不到预期的加速效果。 + +**备注**:本项目基于T4进行性能测试。 + +### 3.2 运行方式 +本项目使用了动态 shape 功能 (tuned_dynamic_shape),因此需要设置获取 shape 的范围。Paddle Inference 提供了相应的接口,即首先通过离线输入数据来统计出所有临时 tensor 的 shape 范围,TensorRT 子图的 tensor 输入 shape 范围可直接根据上一步 tune 出来的结果来设置,即可完成自动 shape 范围设置。统计完成后,只需设置统计结果路径,即可启用 tuned_dynamic_shape 功能。 + +在本案例中,进行性能测试的脚本为 `performance_test.py`,需要先设置 `--collect_shape` 参数,然后再取消传入这个参数,再次运行 `performance_test.py`。可通过设置 `--num_epochs` 计算多轮运行时间,然后取平均时间作为最终结果,具体使用方式如下: + +首先,设置 `--collect_shape` 参数,生成 shape range info 文件: +```shell +python performance_test.py \ + --base_model_name "ppminilm-6l-768h" \ + --model_path "../checkpoints/pp_checkpoints/quant/infer" \ + --test_path "../data/cls_data/test.txt" \ + --label_path "../data/cls_data/label.dict" \ + --num_epochs 1 \ + --batch_size 16 \ + --max_seq_len 256 \ + --use_tensorrt \ + --int8 \ + --collect_shape +``` +然后,开始进行性能测试: +```shell +python performance_test.py \ + --base_model_name "ppminilm-6l-768h" \ + --model_path "../checkpoints/pp_checkpoints/quant/infer" \ + --test_path "../data/cls_data/test.txt" \ + --label_path "../data/cls_data/label.dict" \ + --num_epochs 10 \ + --batch_size 16 \ + --max_seq_len 256 \ + --use_tensorrt \ + --int8 \ +``` + + +## 4. PP-MiniLM 模型效果展示 + +关于 SKEP-Large、PP-MiniLM、量化PP-MiniLM 三个模型在性能和效果方面的对比如下表所示。可以看到,三者在本任务数据集上的评估指标几乎相等,但是 PP-MiniLM 小模型运行速度较 SKEP-Large 提高了4倍,量化后的 PP-MiniLM 运行速度较 SKEP-Large 提高了近8倍。 + +|Model|运行时间(s)|precision|Recall|F1| +| ------------ | ------------ | ------------ |-----------|------------ | +|SKEP-Large|1.00x|0.98497|0.99139|0.98817| +|PP-MiniLM|4.95x|0.98263|0.98766|0.98514| +|量化 PP-MiniLM|8.93x|0.97696|0.98720|0.98205| diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/data.py b/applications/sentiment_analysis/ASO_analysis/pp_minilm/data.py new file mode 100644 index 0000000000000000000000000000000000000000..fcbaf5110cbda5aac5ab4d22f6d46028595cdf5b --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/data.py @@ -0,0 +1,38 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +def load_dict(dict_path): + with open(dict_path, "r", encoding="utf-8") as f: + words = [word.strip() for word in f.readlines()] + word2id = dict(zip(words, range(len(words)))) + id2word = dict((v, k) for k, v in word2id.items()) + + return word2id, id2word + + +def convert_example_to_feature(example, tokenizer, label2id, max_seq_len=512, is_test=False): + example = example["text"].rstrip().split("\t") + if not is_test: + label = int(example[0]) + aspect_text = example[1] + text = example[2] + encoded_inputs = tokenizer(aspect_text, text_pair=text, max_seq_len=max_seq_len, return_length=True) + encoded_inputs["label"] = label + else: + aspect_text = example[0] + text = example[1] + encoded_inputs = tokenizer(aspect_text, text_pair=text, max_seq_len=max_seq_len, return_length=True) + + return encoded_inputs diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/evaluate.py b/applications/sentiment_analysis/ASO_analysis/pp_minilm/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..cefe2bab0cd44ee745fe0959e3c8693765577845 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/evaluate.py @@ -0,0 +1,79 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +from functools import partial + +import paddle +from data import convert_example_to_feature, load_dict +from datasets import load_dataset + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.metrics.glue import AccuracyAndF1 +from paddlenlp.transformers import PPMiniLMForSequenceClassification, PPMiniLMTokenizer + + +def evaluate(model, data_loader, metric): + + model.eval() + metric.reset() + for batch_data in data_loader: + input_ids, token_type_ids, labels = ( + batch_data["input_ids"], + batch_data["token_type_ids"], + batch_data["labels"], + ) + logits = model(input_ids, token_type_ids=token_type_ids) + correct = metric.compute(logits, labels) + metric.update(correct) + + accuracy, precision, recall, f1, _ = metric.accumulate() + + return accuracy, precision, recall, f1 + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + parser.add_argument("--base_model_name", type=str, default=None, help="The name of base model.") + parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.") + parser.add_argument('--test_path', type=str, default=None, help="The path of test set.") + parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.") + parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.") + parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.") + args = parser.parse_args() + # yapf: enbale + + # load dev data + label2id, id2label = load_dict(args.label_path) + datasets = load_dataset("text", data_files={"test": args.test_path}) + + tokenizer = PPMiniLMTokenizer.from_pretrained(args.base_model_name) + trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len) + test_ds = datasets["test"].map(trans_func, batched=False, remove_columns=["text"]) + + data_collator = DataCollatorWithPadding(tokenizer, padding=True) + test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False) + test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator) + + # load model + loaded_state_dict = paddle.load(args.model_path) + model = PPMiniLMForSequenceClassification.from_pretrained(args.base_model_name, num_classes=len(label2id)) + model.load_dict(loaded_state_dict) + + metric = AccuracyAndF1() + + # evaluate on dev data + accuracy, precision, recall, f1 = evaluate(model, test_loader, metric) + print(f'evaluation result: accuracy:{accuracy:.5f} precision: {precision:.5f}, recall: {recall:.5f}, F1: {f1:.5f}') diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/performance_test.py b/applications/sentiment_analysis/ASO_analysis/pp_minilm/performance_test.py new file mode 100644 index 0000000000000000000000000000000000000000..89e4e884aa81d13634783ae4eb61457c2ea09825 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/performance_test.py @@ -0,0 +1,173 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time +from functools import partial + +import numpy as np +import paddle +from data import convert_example_to_feature, load_dict +from datasets import load_dataset +from paddle import inference + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.metrics import AccuracyAndF1 +from paddlenlp.transformers import PPMiniLMTokenizer + + +class Predictor(object): + def __init__(self, args): + self.predictor, self.input_handles, self.output_handles = self.create_predictor(args) + + def create_predictor(self, args): + config = paddle.inference.Config(args.model_path + ".pdmodel", args.model_path + ".pdiparams") + if args.device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + paddle.set_device("gpu") + elif args.device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + paddle.set_device("cpu") + elif args.device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + if args.use_tensorrt: + if args.int8: + config.enable_tensorrt_engine( + workspace_size=1 << 30, + precision_mode=inference.PrecisionType.Int8, + max_batch_size=args.batch_size, + min_subgraph_size=5, + use_static=False, + use_calib_mode=False, + ) + else: + config.enable_tensorrt_engine( + workspace_size=1 << 30, + precision_mode=inference.PrecisionType.Float32, + max_batch_size=args.batch_size, + min_subgraph_size=5, + use_static=False, + use_calib_mode=False, + ) + print("Enable TensorRT is: {}".format(config.tensorrt_engine_enabled())) + if args.collect_shape: + config.collect_shape_range_info( + os.path.join(os.path.dirname(args.model_path), "collect_shape_range_info.pbtxt") + ) + else: + config.enable_tuned_tensorrt_dynamic_shape( + os.path.join(os.path.dirname(args.model_path), "collect_shape_range_info.pbtxt"), True + ) + + predictor = paddle.inference.create_predictor(config) + input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()] + output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()] + + return predictor, input_handles, output_handles + + def predict_batch(self, data): + for input_field, input_handle in zip(data, self.input_handles): + input_handle.copy_from_cpu(input_field.numpy() if isinstance(input_field, paddle.Tensor) else input_field) + self.predictor.run() + output = [output_handle.copy_to_cpu() for output_handle in self.output_handles] + + return output + + def predict(self, data_loader, metric): + + outputs = [] + metric.reset() + for i, data in enumerate(data_loader): + output = self.predict_batch([data[0], data[1]]) + logits = paddle.to_tensor(output).squeeze(0) + correct = metric.compute(logits, paddle.to_tensor(data[3])) + metric.update(correct) + outputs.append(output) + + accuracy, precision, recall, F1, _ = metric.accumulate() + return outputs, accuracy, precision, recall, F1 + + def predict_perf(self, args, data_loader): + start_time = time.time() + for i, data in enumerate(data_loader): + if i < args.perf_warmup_steps: # skip warmup steps. + continue + output = self.predict_batch([data["input_ids"], data["token_type_ids"]]) + paddle.to_tensor(output) + + used_time = time.time() - start_time + return used_time + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + parser.add_argument("--base_model_name", type=str, default=None, help="The name of base model.") + parser.add_argument("--model_path", default='./checkpoints/quant/infer', type=str, required=True, help="The path prefix of inference model to be used.") + parser.add_argument('--test_path', type=str, default=None, help="The path of test set.") + parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.") + parser.add_argument("--num_epochs", type=int, default=0, help="Number of epoches for training.") + parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--max_seq_len", default=256, type=int, help="The maximum total input sequence length after tokenization.") + parser.add_argument("--perf_warmup_steps", default=1, type=int, help="Warmup steps for performance test.") + parser.add_argument("--use_tensorrt", action='store_true', help="Whether to use inference engin TensorRT.") + parser.add_argument("--eval", action='store_true', help="Whether to test performance.") + parser.add_argument("--collect_shape", action='store_true', help="Whether collect shape range info.") + parser.add_argument("--int8", action='store_true', help="Whether to use int8 inference.") + parser.add_argument("--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference.") + + args = parser.parse_args() + # yapf: enable + + # set running environment + paddle.seed(42) + + label2id, id2label = load_dict(args.label_path) + datasets = load_dataset("text", data_files={"test": args.test_path}) + + tokenizer = PPMiniLMTokenizer.from_pretrained(args.base_model_name) + trans_func = partial( + convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len, is_test=False + ) + test_ds = datasets["test"].map(trans_func, batched=False, remove_columns=["text"]) + + data_collator = DataCollatorWithPadding(tokenizer, padding=True) + batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False) + data_loader = paddle.io.DataLoader( + dataset=test_ds, batch_sampler=batch_sampler, collate_fn=data_collator, num_workers=0, return_list=True + ) + + predictor = Predictor(args) + + if args.num_epochs > 0: + print("start to do performance task.") + times = [] + for epoch_id in range(1, args.num_epochs + 1): + used_time = predictor.predict_perf(args, data_loader) + times.append(used_time) + print(f"epoch {epoch_id}, used_time: {used_time}") + print(f"the avg time of {args.num_epochs} epochs is {np.mean(times)}") + + if args.eval: + print("start to do evaluate task.") + metric = AccuracyAndF1() + outputs, accuracy, precision, recall, F1 = predictor.predict(data_loader, metric) + print( + f"evalute results - accuracy: {accuracy: .5f}, precision: {precision: .5f}, recall: {recall: .5f}, F1: {F1: .5f}" + ) diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/quant_post.py b/applications/sentiment_analysis/ASO_analysis/pp_minilm/quant_post.py new file mode 100644 index 0000000000000000000000000000000000000000..57d7acea768074642cc47b8f42bbc9cf8538f5e5 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/quant_post.py @@ -0,0 +1,92 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +from functools import partial + +import paddle +from data import convert_example_to_feature, load_dict +from datasets import load_dataset + +from paddlenlp.data import Pad +from paddlenlp.transformers import PPMiniLMTokenizer + +import paddleslim # isort: skip paddleslim needs to be imported last for some overrides to kick in + + +def quant_post(args): + place = paddle.set_device("gpu") + exe = paddle.static.Executor(place) + + label2id, id2label = load_dict(args.label_path) + datasets = load_dataset("text", data_files={"dev": args.dev_path}) + + tokenizer = PPMiniLMTokenizer.from_pretrained(args.base_model_name) + trans_func = partial( + convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len + ) + dev_ds = datasets["dev"].map(trans_func, batched=False, remove_columns=["text"]) + + def batch_generator_func(): + batch_data = [[], []] + for data in dev_ds: + batch_data[0].append(data["input_ids"]) + batch_data[1].append(data["token_type_ids"]) + if len(batch_data[0]) == args.batch_size: + input_ids = Pad(axis=0, pad_val=0, dtype="int64")(batch_data[0]) + segment_ids = Pad(axis=0, pad_val=0, dtype="int64")(batch_data[1]) + yield [input_ids, segment_ids] + batch_data = [[], []] + + paddleslim.quant.quant_post_static( + exe, + args.static_model_dir, + args.quant_model_dir, + save_model_filename=args.save_model_filename, + save_params_filename=args.save_params_filename, + algo=args.algorithm, + hist_percent=0.9999, + batch_generator=batch_generator_func, + model_filename=args.input_model_filename, + params_filename=args.input_param_filename, + quantizable_op_type=["matmul", "matmul_v2"], + weight_bits=8, + weight_quantize_type="channel_wise_abs_max", + batch_nums=1, + ) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + parser.add_argument("--base_model_name", type=str, default="ppminilm-6l-768h", help="The path of ppminilm model.") + parser.add_argument("--static_model_dir", type=str, default="./checkpoints/static", help="Directory of static model that will be quantized.") + parser.add_argument("--quant_model_dir", type=str, default=None, help="Directory of the quantized model that will be written.") + parser.add_argument("--algorithm", type=str, default="avg", help="Quantize algorithm that you want to choice, such as abs_max, avg, mse, hist.") + parser.add_argument('--dev_path', type=str, default=None, help="The path of dev set.") + parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.") + parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.") + parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.") + parser.add_argument("--save_model_filename", type=str, default="infer.pdmodel", required=False, help="File name of quantified model.") + parser.add_argument("--save_params_filename", type=str, default="infer.pdiparams", required=False, help="File name of quantified model's parameters.") + parser.add_argument("--input_model_filename", type=str, default="infer.pdmodel", required=False, help="File name of float model.") + parser.add_argument("--input_param_filename", type=str, default="infer.pdiparams", required=False, help="File name of float model's parameters.") + + args = parser.parse_args() + # yapf: enable + + # start quantize model + paddle.enable_static() + quant_post(args) + print(f"quantize model done. the quantized model has been saved to {args.quant_model_dir}") diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_evaluate.sh b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_evaluate.sh new file mode 100644 index 0000000000000000000000000000000000000000..5bf7ad5f777f171eeea392a1684e2cb524c7dc08 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_evaluate.sh @@ -0,0 +1,24 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +python evaluate.py \ + --base_model_name "ppminilm-6l-768h" \ + --model_path "../checkpoints/pp_checkpoints/best.pdparams" \ + --test_path "../data/cls_data/test.txt" \ + --label_path "../data/cls_data/label.dict" \ + --batch_size 16 \ + --max_seq_len 256 + diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_performance_test.sh b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_performance_test.sh new file mode 100644 index 0000000000000000000000000000000000000000..4d720b86329283bedf18b659a64cb062c1459916 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_performance_test.sh @@ -0,0 +1,24 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +python performance_test.py \ + --base_model_name "ppminilm-6l-768h" \ + --model_path "../checkpoints/pp_checkpoints/quant/infer" \ + --test_path "../data/cls_data/test.txt" \ + --label_path "../data/cls_data/label.dict" \ + --num_epochs 10 \ + --batch_size 16 \ + --max_seq_len 256 \ No newline at end of file diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_quant.sh b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_quant.sh new file mode 100644 index 0000000000000000000000000000000000000000..e11dd1b1d6d2ecd0a05552146c9fd866ec97dbb6 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_quant.sh @@ -0,0 +1,30 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +python quant_post.py \ + --base_model_name "ppminilm-6l-768h" \ + --static_model_dir "../checkpoints/pp_checkpoints/static" \ + --quant_model_dir "../checkpoints/pp_checkpoints/quant" \ + --algorithm "avg" \ + --dev_path "../data/cls_data/dev.txt" \ + --label_path "../data/cls_data/label.dict" \ + --batch_size 4 \ + --max_seq_len 256 \ + --save_model_filename "infer.pdmodel" \ + --save_params_filename "infer.pdiparams" \ + --input_model_filename "infer.pdmodel" \ + --input_param_filename "infer.pdiparams" + diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_train.sh b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_train.sh new file mode 100644 index 0000000000000000000000000000000000000000..44a1851a01e4803e01fa877bb4dbe2b6f20893a2 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_train.sh @@ -0,0 +1,33 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +python train.py \ + --base_model_name "ppminilm-6l-768h" \ + --train_path "../data/cls_data/train.txt" \ + --dev_path "../data/cls_data/dev.txt" \ + --label_path "../data/cls_data/label.dict" \ + --num_epochs 5 \ + --batch_size 16 \ + --max_seq_len 256 \ + --learning_rate 3e-5 \ + --weight_decay 0.01 \ + --max_grad_norm 1.0 \ + --warmup_proportion 0.1 \ + --log_steps 50 \ + --eval_steps 100 \ + --seed 1000 \ + --device "gpu" \ + --checkpoints "../checkpoints/pp_checkpoints/" diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/train.py b/applications/sentiment_analysis/ASO_analysis/pp_minilm/train.py new file mode 100644 index 0000000000000000000000000000000000000000..0fd724a5a2b2a0ff7bdef8f0232cce0d7cdcb362 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/train.py @@ -0,0 +1,149 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import warnings +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +from data import convert_example_to_feature, load_dict +from datasets import load_dataset +from evaluate import evaluate + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.metrics.glue import AccuracyAndF1 +from paddlenlp.transformers import ( + LinearDecayWithWarmup, + PPMiniLMForSequenceClassification, + PPMiniLMTokenizer, +) + +warnings.filterwarnings("ignore") + + +def set_seed(seed): + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +def train(): + # set running envir + paddle.set_device(args.device) + set_seed(args.seed) + + if not os.path.exists(args.checkpoints): + os.mkdir(args.checkpoints) + + # load and process data + label2id, id2label = load_dict(args.label_path) + datasets = load_dataset("text", data_files={"train": args.train_path, "dev": args.dev_path}) + + tokenizer = PPMiniLMTokenizer.from_pretrained(args.base_model_name) + trans_func = partial( + convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len + ) + train_ds = datasets["train"].map(trans_func, batched=False, remove_columns=["text"]) + dev_ds = datasets["dev"].map(trans_func, batched=False, remove_columns=["text"]) + + data_collator = DataCollatorWithPadding(tokenizer, padding=True) + + train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + train_loader = paddle.io.DataLoader(train_ds, batch_sampler=train_batch_sampler, collate_fn=data_collator) + dev_loader = paddle.io.DataLoader(dev_ds, batch_sampler=dev_batch_sampler, collate_fn=data_collator) + + # configure model training + model = PPMiniLMForSequenceClassification.from_pretrained(args.base_model_name, num_classes=len(label2id)) + + num_training_steps = len(train_loader) * args.num_epochs + lr_scheduler = LinearDecayWithWarmup( + learning_rate=args.learning_rate, total_steps=num_training_steps, warmup=args.warmup_proportion + ) + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + grad_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm) + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=grad_clip, + ) + + metric = AccuracyAndF1() + + # start to train model + global_step, best_f1 = 1, 0.0 + model.train() + for epoch in range(1, args.num_epochs + 1): + for batch_data in train_loader(): + input_ids, token_type_ids, labels = ( + batch_data["input_ids"], + batch_data["token_type_ids"], + batch_data["labels"], + ) + logits = model(input_ids, token_type_ids=token_type_ids) + loss = F.cross_entropy(logits, labels) + + loss.backward() + lr_scheduler.step() + optimizer.step() + optimizer.clear_grad() + + if global_step > 0 and global_step % args.log_steps == 0: + print(f"epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.item():.6f}") + if (global_step > 0 and global_step % args.eval_steps == 0) or global_step == num_training_steps: + accuracy, precision, recall, f1 = evaluate(model, dev_loader, metric) + model.train() + if f1 > best_f1: + print(f"best F1 performence has been updated: {best_f1:.5f} --> {f1:.5f}") + best_f1 = f1 + paddle.save(model.state_dict(), f"{args.checkpoints}/best.pdparams") + print( + f"evaluation result: accuracy:{accuracy:.5f} precision: {precision:.5f}, recall: {recall:.5f}, F1: {f1:.5f}" + ) + + global_step += 1 + + paddle.save(model.state_dict(), f"{args.checkpoints}/final.pdparams") + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser(__doc__) + parser.add_argument("--base_model_name", type=str, default=None, help="The name of base model.") + parser.add_argument("--train_path", type=str, default=None, help="The path of train set.") + parser.add_argument("--dev_path", type=str, default=None, help="The path of dev set.") + parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.") + parser.add_argument("--num_epochs", type=int, default=3, help="Number of epoches for fine-tuning.") + parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.") + parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.") + parser.add_argument("--learning_rate", type=float, default=5e-5, help="The initial learning rate for optimizer.") + parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.") + parser.add_argument("--max_grad_norm", type=float, default=1.0, help="Max grad norm to clip gradient.") + parser.add_argument("--warmup_proportion", type=float, default=0.1, help="Warmup proportion params for warmup strategy") + parser.add_argument("--log_steps", type=int, default=50, help="Frequency of printing log.") + parser.add_argument("--eval_steps", type=int, default=500, help="Frequency of performing evaluation.") + parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") + parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") + parser.add_argument("--checkpoints", type=str, default=None, help="Directory to save checkpoint.") + + args = parser.parse_args() + # yapf: enable + + train() diff --git a/applications/sentiment_analysis/ASO_analysis/predict.py b/applications/sentiment_analysis/ASO_analysis/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..61c518ea49dd349c39dcb46fab72f7faaf994b9b --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/predict.py @@ -0,0 +1,216 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import copy +import json +import re +from collections import defaultdict +from functools import partial + +import paddle +from classification.data import ( + convert_example_to_feature as convert_example_to_feature_cls, +) +from datasets import Dataset, load_dataset +from extraction.data import convert_example_to_feature as convert_example_to_feature_ext +from utils import decoding, load_dict + +from paddlenlp.data import DataCollatorForTokenClassification, DataCollatorWithPadding +from paddlenlp.transformers import ( + SkepForSequenceClassification, + SkepForTokenClassification, + SkepTokenizer, +) + + +def concate_aspect_and_opinion(text, aspect, opinions): + aspect_text = "" + for opinion in opinions: + if text.find(aspect) <= text.find(opinion): + aspect_text += aspect + opinion + "," + else: + aspect_text += opinion + aspect + "," + aspect_text = aspect_text[:-1] + + return aspect_text + + +def remove_blanks(example): + example["text"] = re.sub(" +", "", example["text"]) + return example + + +def predict_ext(args): + # load dict and dataset + model_name = "skep_ernie_1.0_large_ch" + ext_label2id, ext_id2label = load_dict(args.ext_label_path) + datasets = load_dataset("text", data_files={"test": args.test_path}) + datasets["test"] = datasets["test"].map(remove_blanks) + + tokenizer = SkepTokenizer.from_pretrained(model_name) + trans_func = partial( + convert_example_to_feature_ext, + tokenizer=tokenizer, + label2id=ext_label2id, + max_seq_len=args.ext_max_seq_len, + is_test=True, + ) + test_ds = copy.copy(datasets["test"]).map(trans_func, batched=False, remove_columns=["text"]) + data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=ext_label2id["O"]) + test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False) + test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator) + print("test data loaded.") + + # load ext model + ext_state_dict = paddle.load(args.ext_model_path) + ext_model = SkepForTokenClassification.from_pretrained(model_name, num_classes=len(ext_label2id)) + ext_model.load_dict(ext_state_dict) + print("extraction model loaded.") + + ext_model.eval() + results = [] + for bid, batch_data in enumerate(test_loader): + input_ids, token_type_ids, seq_lens = ( + batch_data["input_ids"], + batch_data["token_type_ids"], + batch_data["seq_len"], + ) + logits = ext_model(input_ids, token_type_ids=token_type_ids) + + predictions = logits.argmax(axis=2).numpy() + for eid, (seq_len, prediction) in enumerate(zip(seq_lens, predictions)): + idx = bid * args.batch_size + eid + tag_seq = [ext_id2label[idx] for idx in prediction[:seq_len][1:-1]] + text = datasets["test"][idx]["text"] + aps = decoding(text[: args.ext_max_seq_len - 2], tag_seq) + for aid, ap in enumerate(aps): + aspect, opinions = ap[0], list(set(ap[1:])) + aspect_text = concate_aspect_and_opinion(text, aspect, opinions) + results.append( + { + "id": str(idx) + "_" + str(aid), + "aspect": aspect, + "opinions": opinions, + "text": text, + "aspect_text": aspect_text, + } + ) + + return results + + +def predict_cls(args, ext_results): + # load dict + model_name = "skep_ernie_1.0_large_ch" + cls_label2id, cls_id2label = load_dict(args.cls_label_path) + text_list = [] + for result in ext_results: + example = result["aspect_text"] + "\t" + result["text"] + text_list.append(example) + ext_results = {"text": text_list} + dataset = Dataset.from_dict(ext_results) + + tokenizer = SkepTokenizer.from_pretrained(model_name) + trans_func = partial( + convert_example_to_feature_cls, + tokenizer=tokenizer, + label2id=cls_label2id, + max_seq_len=args.cls_max_seq_len, + is_test=True, + ) + + test_ds = dataset.map(trans_func, batched=False, remove_columns=["text"]) + data_collator = DataCollatorWithPadding(tokenizer, padding=True) + test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False) + test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator) + print("test data loaded.") + + # load cls model + cls_state_dict = paddle.load(args.cls_model_path) + cls_model = SkepForSequenceClassification.from_pretrained(model_name, num_classes=len(cls_label2id)) + cls_model.load_dict(cls_state_dict) + print("classification model loaded.") + + cls_model.eval() + + results = [] + for bid, batch_data in enumerate(test_loader): + input_ids, token_type_ids = batch_data["input_ids"], batch_data["token_type_ids"] + logits = cls_model(input_ids, token_type_ids=token_type_ids) + + predictions = logits.argmax(axis=1).numpy().tolist() + results.extend(predictions) + + results = [cls_id2label[pred_id] for pred_id in results] + return results + + +def post_process(ext_results, cls_results): + assert len(ext_results) == len(cls_results) + + collect_dict = defaultdict(list) + for ext_result, cls_result in zip(ext_results, cls_results): + ext_result["sentiment_polarity"] = cls_result + eid, _ = ext_result["id"].split("_") + collect_dict[eid].append(ext_result) + + sentiment_results = [] + for eid in collect_dict.keys(): + sentiment_result = {} + ap_list = [] + for idx, single_ap in enumerate(collect_dict[eid]): + if idx == 0: + sentiment_result["text"] = single_ap["text"] + ap_list.append( + { + "aspect": single_ap["aspect"], + "opinions": single_ap["opinions"], + "sentiment_polarity": single_ap["sentiment_polarity"], + } + ) + sentiment_result["ap_list"] = ap_list + sentiment_results.append(sentiment_result) + + with open(args.save_path, "w", encoding="utf-8") as f: + for sentiment_result in sentiment_results: + f.write(json.dumps(sentiment_result, ensure_ascii=False) + "\n") + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + parser.add_argument("--ext_model_path", type=str, default=None, help="The path of extraction model path that you want to load.") + parser.add_argument("--cls_model_path", type=str, default=None, help="The path of classification model path that you want to load.") + parser.add_argument("--ext_label_path", type=str, default=None, help="The path of extraction label dict.") + parser.add_argument("--cls_label_path", type=str, default=None, help="The path of classification label dict.") + parser.add_argument('--test_path', type=str, default=None, help="The path of test set that you want to predict.") + parser.add_argument('--save_path', type=str, required=True, default=None, help="The saving path of predict results.") + parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.") + parser.add_argument("--ext_max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization for extraction model.") + parser.add_argument("--cls_max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization for classification model.") + args = parser.parse_args() + # yapf: enable + + # predict with ext model + ext_results = predict_ext(args) + print("predicting with extraction model done!") + + # predict with cls model + cls_results = predict_cls(args, ext_results) + print("predicting with classification model done!") + + # post_process prediction results + post_process(ext_results, cls_results) + print(f"sentiment analysis results has been saved to path: {args.save_path}") diff --git a/applications/sentiment_analysis/ASO_analysis/run_demo.sh b/applications/sentiment_analysis/ASO_analysis/run_demo.sh new file mode 100644 index 0000000000000000000000000000000000000000..ea87c8ffa49433c3a461ac31d1c57a121833e0d1 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/run_demo.sh @@ -0,0 +1,24 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +python demo.py \ + --ext_model_path "./checkpoints/ext_checkpoints/best.pdparams" \ + --cls_model_path "./checkpoints/cls_checkpoints/best.pdparams" \ + --ext_label_path "./data/ext_data/label.dict" \ + --cls_label_path "./data/cls_data/label.dict" \ + --ext_max_seq_len 512 \ + --cls_max_seq_len 256 + diff --git a/applications/sentiment_analysis/ASO_analysis/run_export_model.sh b/applications/sentiment_analysis/ASO_analysis/run_export_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..177ad4307ebdbe4cb9fd7ec9f9c1008158468751 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/run_export_model.sh @@ -0,0 +1,41 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +model_type=$1 + +if [ ! $model_type ]; then +echo "Please enter the correct export model type, for example: sh run_export extraction" +elif [ $model_type = extraction ]; then +python export_model.py \ + --model_type "extraction" \ + --model_path "./checkpoints/ext_checkpoints/best.pdparams" \ + --save_path "./checkpoints/ext_checkpoints/static/infer" + +elif [ $model_type = classification ]; then +python export_model.py \ + --model_type "classification" \ + --model_path "./checkpoints/cls_checkpoints/best.pdparams" \ + --save_path "./checkpoints/cls_checkpoints/static/infer" + +elif [ $model_type = pp_minilm ]; then +python export_model.py \ + --model_type "pp_minilm" \ + --base_model_name "ppminilm-6l-768h" \ + --model_path "./checkpoints/pp_checkpoints/best.pdparams" \ + --save_path "./checkpoints/pp_checkpoints/static/infer" +else +echo "Three model_types are supported: [extraction, classification, pp_minilm]" +fi diff --git a/applications/sentiment_analysis/ASO_analysis/run_predict.sh b/applications/sentiment_analysis/ASO_analysis/run_predict.sh new file mode 100644 index 0000000000000000000000000000000000000000..8fbb7a624edc3e708a69e89b42c284467cc78ce5 --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/run_predict.sh @@ -0,0 +1,26 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=0 + +python predict.py \ + --ext_model_path "./checkpoints/ext_checkpoints/best.pdparams" \ + --cls_model_path "./checkpoints/cls_checkpoints/best.pdparams" \ + --test_path "./data/test.txt" \ + --ext_label_path "./data/ext_data/label.dict" \ + --cls_label_path "./data/cls_data/label.dict" \ + --save_path "./data/sentiment_results.json" \ + --batch_size 8 \ + --ext_max_seq_len 512 \ + --cls_max_seq_len 256 diff --git a/applications/sentiment_analysis/ASO_analysis/utils.py b/applications/sentiment_analysis/ASO_analysis/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..88264c549ab041fdd1f3937f6b3b039e25a7bfec --- /dev/null +++ b/applications/sentiment_analysis/ASO_analysis/utils.py @@ -0,0 +1,133 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import random + +import numpy as np +import paddle +from seqeval.metrics.sequence_labeling import get_entities + + +def set_seed(seed): + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +def load_dict(dict_path): + with open(dict_path, "r", encoding="utf-8") as f: + words = [word.strip() for word in f.readlines()] + word2id = dict(zip(words, range(len(words)))) + id2word = dict((v, k) for k, v in word2id.items()) + + return word2id, id2word + + +def read_test_file(data_path): + with open(data_path, "r", encoding="utf-8") as f: + for line in f.readlines(): + line = line.strip().replace(" ", "") + yield {"text": line} + + +def decoding(text, tag_seq): + assert len(text) == len(tag_seq), f"text len: {len(text)}, tag_seq len: {len(tag_seq)}" + + puncs = list(",.?;!,。?;!") + splits = [idx for idx in range(len(text)) if text[idx] in puncs] + + prev = 0 + sub_texts, sub_tag_seqs = [], [] + for i, split in enumerate(splits): + sub_tag_seqs.append(tag_seq[prev:split]) + sub_texts.append(text[prev:split]) + prev = split + sub_tag_seqs.append(tag_seq[prev:]) + sub_texts.append((text[prev:])) + + ents_list = [] + for sub_text, sub_tag_seq in zip(sub_texts, sub_tag_seqs): + ents = get_entities(sub_tag_seq, suffix=False) + ents_list.append((sub_text, ents)) + + aps = [] + no_a_words = [] + for sub_tag_seq, ent_list in ents_list: + sub_aps = [] + sub_no_a_words = [] + for ent in ent_list: + ent_name, start, end = ent + if ent_name == "Aspect": + aspect = sub_tag_seq[start : end + 1] + sub_aps.append([aspect]) + if len(sub_no_a_words) > 0: + sub_aps[-1].extend(sub_no_a_words) + sub_no_a_words.clear() + else: + ent_name == "Opinion" + opinion = sub_tag_seq[start : end + 1] + if len(sub_aps) > 0: + sub_aps[-1].append(opinion) + else: + sub_no_a_words.append(opinion) + + if sub_aps: + aps.extend(sub_aps) + if len(no_a_words) > 0: + aps[-1].extend(no_a_words) + no_a_words.clear() + elif sub_no_a_words: + if len(aps) > 0: + aps[-1].extend(sub_no_a_words) + else: + no_a_words.extend(sub_no_a_words) + + if no_a_words: + no_a_words.insert(0, "None") + aps.append(no_a_words) + + return aps + + +def concate_aspect_and_opinion(text, aspect, opinions): + aspect_text = "" + for opinion in opinions: + if text.find(aspect) <= text.find(opinion): + aspect_text += aspect + opinion + "," + else: + aspect_text += opinion + aspect + "," + aspect_text = aspect_text[:-1] + + return aspect_text + + +def save_examples(examples, save_path, idxs): + with open(save_path, "w", encoding="utf-8") as f: + for idx in idxs: + line = "\t".join(examples[idx]) + "\n" + f.write(line) + + +def save_dict(dict_path, dict_type): + if dict_type not in ["ext", "cls"]: + raise ValueError("Only ext/cls should be accepted for dict_type.") + + with open(dict_path, "w", encoding="utf-8") as f: + if dict_type == "ext": + label_list = ["O", "B-Aspect", "I-Aspect", "B-Opinion", "I-Opinion"] + else: + label_list = ["负向", "正向"] + + for label in label_list: + f.write(label + "\n") diff --git a/applications/sentiment_analysis/README.md b/applications/sentiment_analysis/README.md new file mode 100644 index 0000000000000000000000000000000000000000..3cb8501e0d39dea82e7c84179a1c12147306b85f --- /dev/null +++ b/applications/sentiment_analysis/README.md @@ -0,0 +1,41 @@ +# 情感分析应用 + +## **1. 情感分析简介** +情感分析(sentiment analysis)是近年来国内外研究的热点,旨在对带有情感色彩的主观性文本进行分析、处理、归纳和推理。情感分析具有广泛的应用场景,可以被应用于消费决策、舆情分析、个性化推荐等领域。 + +按照分析粒度可以大致分为三类:篇章级的情感分析(Document-Level Sentiment Classification)、语句级的情感分析(Sentence-Level Sentiment Classification)和属性级的情感分析(Aspect-Level Sentiment Classification)。其中属性级的情感分析又包含多项子任务,例如属性抽取(Aspect Term Extraction)、观点抽取(Opinion Term Extraction)、属性级情感分析(Aspect-Based Sentiment Classification)等。 + +
+ +
+ + + +## **2. 情感分析项目介绍** + +PaddleNLP情感分析应用立足真实企业用户对情感分析方面的需求,同时针对情感分析领域的痛点和难点,基于前沿模型开源了细粒度的情感分析解决方案,助力开发者快速分析业务相关产品或服务的用户感受。针对情感分析应用,本项目不仅提供了基于Taskflow开箱即用的情感分析能力,还提供了从输入数据到情感分析结果可视化的能力,另外考虑到一些企业用户需要针对业务场景进行适配,本项目同时提供了完整的情感分析定制方案:数据标注 - 模型训练 - 模型测试 - 模型部署 - 情感分析可视化。 + +当前PaddleNLP情感分析应用更多聚焦于属性级的情感分析,支持文本评论中关于属性、观点词和情感倾向方面的分析。当前提供了两种情感分析方案:基于通用信息抽取模型UIE的情感分析方案和基于情感知识增强模型SKEP的情感分析方案。 + +基于UIE的情感分析方案采用 Prompt Learning 的方式进行情感信息抽取,该分析方式需要预先定义情感信息抽取的schema,然后通过该schema逐步分析和抽取情感信息。 相比基于SKEP的情感分析方案,UIE方案在测试中表现出了更好的效果。在测试中,通过精确匹配的方式对比抽取的 属性、情感倾向和观点词 三者信息,即当三者全部匹配才算抽取正确,下表展示了此次测试的评测指标: + +| 模型 | 权重 | Precision | Recall | F1 | +| :---: | :--------: | :--------: | :--------: | :--------: | +| `SKEP` | `skep_ernie_1.0_large_ch` | 0.76368 | 0.74710 | 0.75530 | +| `uie` | `uie-senta-base` | 0.89593 | 0.86125 | 0.87825 | + + +基于SKEP的情感分析方案主要采用两阶段式的情感分析抽取,首先通过序列标注的方式定位属性词和观点词,然后通过结合属性词和观点词两者信息进行属性情感极性分类。相比基于UIE的情感分析方案,基于SKEP的情感分析方案具有更快的预测速度。下表展示了在测试集上平均每分钟预测的样本数,可以看到SKEP方案的预测速度显著快于UIE方案。 + +| 模型 | 权重 | 预测样本数/m | +| :---: | :--------: | :--------: | +| `SKEP` | `skep_ernie_1.0_large_ch` | 3428 | +| `uie` | `uie-senta-base` | 1104 | + +备注: 当前只有基于UIE的方案支持情感分析结果可视化能力,基于SKEP的方案暂不支持。 + +## **3. 快速开始** + +- 👉 [基于UIE的情感分析方案](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/sentiment_analysis/unified_sentiment_extraction) + +- 👉 [基于SKEP的情感分析方案](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/sentiment_analysis/ASO_analysis) diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/.gitignore b/applications/sentiment_analysis/unified_sentiment_extraction/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..e21e51f32d9c494cadaf1884d0bed4c8bf95e09b --- /dev/null +++ b/applications/sentiment_analysis/unified_sentiment_extraction/.gitignore @@ -0,0 +1,10 @@ +checkpoint/* +data/* +export/* +images/* +outputs/* +log/* +uie-base/* +SimHei.ttf +*.sh +myhttp.py diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/README.md b/applications/sentiment_analysis/unified_sentiment_extraction/README.md new file mode 100644 index 0000000000000000000000000000000000000000..37e52fc8957e2015f5b68ffc2b887543a107fc0a --- /dev/null +++ b/applications/sentiment_analysis/unified_sentiment_extraction/README.md @@ -0,0 +1,884 @@ +# 通用情感信息抽取 + +## **目录** +## **目录** +- [1. 情感分析应用简介](#1) +- [2. 特色介绍](#2) +- [3. 运行环境](#3) +- [4. 整体功能介绍与Taskflow快速体验](#4) + - [4.1 开箱即用的情感分析能力](#4.1) + - [4.1.1 语句级情感分析](#4.1.1) + - [4.1.2 属性级情感分析](#4.1.2) + - [4.1.3 多版本模型选择](#4.1.3) + - [4.2 批量处理:从数据到情感分析可视化](#4.2) + - [4.2.1 数据描述](#4.2.1) + - [4.2.2 批量情感分析](#4.2.2) + - [4.2.3 情感分析可视化](#4.2.3) + - [4.2.3.1 一键生成情感分析结果](#4.2.3.1) + - [4.2.3.2 情感分析详细展示](#4.2.3.2) +- [5. 更进一步:结合业务分析经验,定制情感分析](#5) + - [5.1 打通数据标注到训练样本构建](#5.1) + - [5.1.1 样本构建:语句级情感分类任务](#5.1.1) + - [5.1.2 样本构建:属性抽取相关任务](#5.1.2) + - [5.1.3 样本构建升级1:加强属性聚合能力](#5.1.3) + - [5.1.4 样本构建升级2:加强隐性观点抽取能力](#5.1.4) + - [5.2 模型训练](#5.2) + - [5.3 模型测试](#5.3) + - [5.4 模型预测及效果展示](#5.4) + - [5.4.1 使用训练后的模型进行预测](#5.4.1) + - [5.4.2 属性聚合预测和分析](#5.4.2) + - [5.4.3 隐性观点词抽取预测和分析](#5.4.3) +- [6. 模型部署](#6) + - [6.1 基于SimpleServer进行服务化部署](#6.1) + - [6.2 基于Pipeline进行部署](#6.2) + + + +## **1. 情感分析应用简介** + +PaddleNLP情感分析应用立足真实企业用户对情感分析方面的需求,针对情感分析领域的痛点和难点,提供基于前沿模型的情感分析解决方案,助力开发者快速分析业务相关产品或服务的用户感受。 + +本项目以通用信息抽取模型UIE为训练底座,提供了语句级情感分析和属性级情感分析能力、覆盖情感分类、属性抽取、观点抽取等常用情感分析能力,如下图所示。同时提供了可视化能力,支持从输入数据到情感分析结果可视化,帮助用户快速分析业务数据。更进一步地,本项目同时支持基于业务数据进行定制训练,同时支持引入业务侧积累的经验和知识,包括同义属性和隐性观点词表,加强模型进行属性聚合和隐性观点抽取的能力,进一步提高模型对于业务场景数据的分析能力。 + +
+ +
+ + + +## **2. 特色介绍** + +- **功能丰富🎓**:提供情感分析训练模型作为底座,支持语句级情感分析和属性级情感分析,覆盖情感分类、属性与观点抽取、同义属性聚合、隐性观点抽取、可视化分析等常见情感分析任务。 +- **效果领先✊**: 以通用信息抽取模型UIE为训练底座,具有较强的零样本预测和小样本微调能力。 +- **开箱即用👶**:打通Taskflow使用流程,3行代码获取分析结果,同时提供了情感分析结果可视化能力。 +- **定制模型🏃**:支持针对特定业务场景进行全流程定制,包括数据标注、样本构建、模型训练和模型测试,同时通过融合业务相关的同义属性词和隐性观点词,可进一步提高模型针对业务场景的情感分析能力。 + + + + +## **3. 运行环境** + +**代码结构** +``` +unified_sentiment_extraction/ +├── batch_predict.py # 以文件的形式输入,进行批量预测的脚本 +├── evaluate.py # 模型评估脚本 +├── finetune.py # 模型微调脚本 +├── label_studio.py # 将label-studio导出数据转换为模型输入数据的脚本 +├── label_studio.md # 将label-studio标注说明 +├── utils.py # 工具函数脚本 +├── visual_analysis.py # 情感分析结果可视化脚本 +└── README.md # 使用说明 +``` + +**安装依赖** + +- python == 3.9.12 +- paddlepaddle == 2.3.2 +- paddlenlp == 2.4.5 +- wordcloud == 1.8.2.2 + +**安装PaddlePaddle**: + +环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 具体可以参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。如下命令可以安装linux系统,CUDA版本为10.2环境下的paddlepaddle,具体版本号为支持GPU的2.3.2版本。 + +```shell +conda install paddlepaddle-gpu==2.3.2 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ +``` + +**安装PaddleNLP**: +安装PaddleNLP可以开启百度镜像源来加速下载,更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 + +```shell +python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple +``` + +**安装wordcloud**: + +```shell +python3 -m pip install wordcloud==1.8.2.2 +``` + + + +## **4. 整体功能介绍与Taskflow快速体验** + +本项目以通用信息抽取模型UIE为训练底座,基于大量情感分析数据进一步训练,增强了模型对于情感知识的处理能力,支持语句级情感分类、属性抽取、观点词抽取、属性级情感分类等基础情感分析能力。下表展示了通用UIE `uie-base` 和情感知识增强的UIE `uie-senta-base` 在测试集上的效果对比。 + +| 模型 | Precision | Recall | F1 | +| :---: | :--------: | :--------: | :--------: | +| `uie-base` | 0.86759 | 0.83696 | 0.85200 | +| `uie-senta-base` | 0.93403 | 0.92795 | 0.93098 | + +另外,为方便用户体验和使用,本项目提供的情感分析能力已经集成到了 Taskflow,可以通过Taskflow开箱即用的能力快速体验情感分析的功能。 + + + +### **4.1 开箱即用的情感分析能力** + + + +#### **4.1.1 语句级情感分析** +整句情感分析功能当前支持二分类:正向和负向,调用示例如下: + +```python +>>> from paddlenlp import Taskflow + +>>> schema = ['情感倾向[正向,负向]'] +>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema) +>>> print(senta('蛋糕味道不错,店家服务也很好')) + +[ + { + '情感倾向[正向,负向]': [ + { + 'text': '正向', + 'probability': 0.996646058824652 + } + ] + } +] +``` + + + +#### **4.1.2 属性级情感分析** + +除语句级情感分析之外,本项目同时支持属性级情感分析,包括属性抽取(Aspect Term Extraction)、观点抽取(Opinion Term Extraction)、属性级情感分析(Aspect Based Sentiment Classification)等等。可以通过设置相应的schema进行对应信息的抽取,其调用示例如下。 + +```python +>>> from paddlenlp import Taskflow + +>>> # Aspect Term Extraction +>>> # schema = ["评价维度"] +>>> # Aspect - Opinion Extraction +>>> # schema = [{"评价维度":["观点词"]}] +>>> # Aspect - Sentiment Extraction +>>> # schema = [{"评价维度":["情感倾向[正向,负向,未提及]"]}] +>>> # Aspect - Sentiment - Opinion Extraction +>>> schema = [{"评价维度":["观点词", "情感倾向[正向,负向,未提及]"]}] + +>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema) +>>> print(senta('蛋糕味道不错,店家服务也很热情')) + +[ + { + '评价维度': [ + { + 'text': '服务', + 'start': 9, + 'end': 11, + 'probability': 0.9709093024793489, + 'relations': { + '观点词': [ + { + 'text': '热情', + 'start': 13, + 'end': 15, + 'probability': 0.9897222206316556 + } + ], + '情感倾向[正向,负向,未提及]': [ + { + 'text': '正向', + 'probability': 0.9999327669598301 + } + ] + } + }, + { + 'text': '味道', + 'start': 2, + 'end': 4, + 'probability': 0.9105472387838915, + 'relations': { + '观点词': [ + { + 'text': '不错', + 'start': 4, + 'end': 6, + 'probability': 0.9946981266891619 + } + ], + '情感倾向[正向,负向,未提及]': [ + { + 'text': '正向', + 'probability': 0.9998829392709467 + } + ] + } + } + ] + } +] +``` + +更进一步地,在某些业务场景中,特别是一些垂域场景,用户可能比较关注固定的某些属性。在这种情况下,可以预先提供相应的属性集合,则本项目将只会在该属性集上进行情感分析,分析和抽取该集合中各个属性的信息。 + +针对固定属性的情感分析示例如下,需要将属性集合传入参数 `aspects` 中,后续将只针对这些属性进行分析。可以看到在示例中,传入了属性 `房间`,`位置` 和 `价格`,针对 `房间` 和 `价格` 均分析到了观点词和情感倾向,但是`位置`由于在样本中并未提及,因此相应观点词为空,情感倾向为 `未提及`。 + +```python +>>> # define schema for pre-defined aspects, schema +>>> schema = ["观点词", "情感倾向[正向,负向,未提及]"] +>>> aspects = ["房间", "位置", "价格"] +>>> # set aspects for Taskflow +>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema, aspects=aspects) +>>> print(senta("这家店的房间很大,店家服务也很热情,就是价格有点贵")) + +[ + { + '评价维度': [ + { + 'text': '房间', + 'relations': { + '观点词': [ + { + 'text': '大', + 'start': 7, + 'end': 8, + 'probability': 0.9998772175681552 + } + ], + '情感倾向[正向,负向,未提及]': [ + { + 'text': '正向', + 'probability': 0.9999312170965595 + } + ] + } + }, + { + 'text': '位置', + 'relations': { + '情感倾向[正向,负向,未提及]': [ + { + 'text': '未提及', + 'probability': 0.9999939203353847 + } + ] + } + }, + { + 'text': '价格', + 'relations': { + '观点词': [ + { + 'text': '贵', + 'start': 24, + 'end': 25, + 'probability': 0.998841669863026 + } + ], + '情感倾向[正向,负向,未提及]': [ + { + 'text': '负向', + 'probability': 0.9997340617174757 + } + ] + } + } + ] + } +] +``` + + + +#### **4.1.3 多版本模型选择** +为方便用户实际业务应用情况,本项目多个版本的模型,可以根据业务对于精度和速度方面的要求进行选择,下表展示了不同版本模型的结构以及在测试集上的指标。 + +| 模型 | 结构 | Precision | Recall | F1 | +| :---: | :--------: | :--------: | :--------: | :--------: | +| `uie-senta-base` (默认) | 12-layers, 768-hidden, 12-heads | 0.93403 | 0.92795 | 0.93098 | +| `uie-senta-medium` | 6-layers, 768-hidden, 12-heads | 0.93146 | 0.92137 | 0.92639 | +| `uie-senta-mini` | 6-layers, 384-hidden, 12-heads | 0.91799 | 0.92028 | 0.91913 | +| `uie-senta-micro` | 4-layers, 384-hidden, 12-heads | 0.91542 | 0.90957 | 0.91248 | +| `uie-senta-nano` | 4-layers, 312-hidden, 12-heads | 0.90817 | 0.90878 | 0.90847 | + +在Taskflow中,可以直接指定相应模型名称进行使用,使用`uie-senta-mini`版本的示例如下: + +```python +>>> from paddlenlp import Taskflow + +>>> schema = [{"评价维度":["观点词", "情感倾向[正向,负向,未提及]"]}] +>>> senta = Taskflow("sentiment_analysis", model="uie-senta-mini", schema=schema) +``` + + + +### **4.2 批量处理:从数据到情感分析可视化** + +为方便使用,本项目提供了批量处理的功能,支持以文件形式输入,批量进行情感分析。同时打通了从数据到情感分析结果可视化的流程,帮助用户可以更加快速获取情感分析结果,聚焦于业务分析方面。 + + + +#### **4.2.1 数据描述** +输入数据如下方式进行组织,每行表示一个文本评论。可以点击[这里](https://paddlenlp.bj.bcebos.com/datasets/sentiment_analysis/hotel/test_hotel.tar.gz)下载酒店场景的测试数据进行分析。 + +``` +非常好的酒店 不枉我们爬了近一个小时的山,另外 大厨手艺非常棒 竹筒饭 竹筒鸡推荐入住的客人必须要点, +房间隔音效果不好,楼下KTV好吵的 +酒店的房间很大,干净舒适,服务热情 +怎么说呢,早上办理入住的,一进房间闷热的一股怪味,很臭,不能开热风,好多了,虽然房间小,但是合理范围 +总台服务很差,房间一般 +``` + + + +#### **4.2.2 批量情感分析** + +通过脚本 `batch_predict.py` 批量进行情感分析,通过 `file_path` 指定要进行情感分析的文件路径,处理完后,结果将会保存在 `save_path` 指定的文件中,示例如下: + +```shell +python batch_predict.py \ + --file_path "./data/test_hotel.txt" \ + --save_path "./data/sentiment_analysis.json" \ + --model "uie-senta-base" \ + --schema "[{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}]" \ + --batch_size 4 \ + --max_seq_len 512 +``` + +参数说明: +- ``file_path``: 用于进行情感分析的文件路径。 +- ``save_path``: 情感分析结果的保存路径。 +- ``model``: 进行情感分析的模型名称,可以在这些模型中进行选择:['uie-senta-base', 'uie-senta-medium', 'uie-senta-mini', 'uie-senta-micro', 'uie-senta-nano']。 +- ``load_from_dir``: 指定需要加载的离线模型目录,比如训练后保存的模型,如果不进行指定,则默认根据 `model` 指定的模型名称自动下载相应模型。 +- ``schema``: 基于UIE模型进行信息抽取的Schema描述。 +- ``batch_size``: 预测过程中的批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 16。 +- ``max_seq_len``: 模型支持处理的最大序列长度,默认为512。 +- ``aspects``: 预先给定的属性,如果设置,模型将只针对这些属性进行情感分析,比如分析这些属性的观点词。 + + + +#### **4.2.3 情感分析可视化** + +在情感分析处理之后,可以根据情感分析的保存结果进行可视化展示,帮助用户更友好地分析业务特点。默认情况下,可视化功能支持围绕属性、观点、属性+观点、属性+情感、指定属性+观点分析功能。在各项分析中,均支持词云和直方图两类图像展示。 + +下面将以酒店场景数据为例进行展示。 + + + +**4.2.3.1 一键生成情感分析结果** + +基于以上生成的情感分析结果,可以使用`visual_analysis.py`脚本对情感分析结果进行可视化,最终可视化结果将会被保存在 `save_dir` 指定的目录下。 使用时需要指定情感分析可视化的结果的任务类型,若是语句级的情感分类,则将task_type指定为``cls``,若是属性级的情感分析,则将task_type指定为``ext``,示例如下: + +``` +python visual_analysis.py \ + --file_path "./outputs/test_hotel.json" \ + --save_dir "./outputs/images" \ + --task_type "ext" +``` + +可配置参数说明: +- ``file_path``: 指定情感分析结果的保存路径。 +- ``save_dir``: 指定图片的保存目录。 +- ``task_type``: 指定任务类型,语句级情感分类请指定为``cls``,属性级情感分析请指定为``ext``,默认为``ext``。 +- ``font_path``: 指定字体文件的路径,用以在生成的wordcloud图片中辅助显示中文,如果为空,则会自动下载黑体字,用以展示中文字体。 + +**备注**:在`visual_analysis.py`脚本启动时,默认会删除当前已经存在的`save_dir`目录以及其中文件,然后在该目录下重新生成相应的可视化图片。 + +下图展示了对酒店场景数据分析后的部分图片: + +
+ +
+
+ + + +**4.2.3.2 情感分析详细展示** + +**(1) 属性分析** +通过属性信息,可以查看客户对于产品/服务的重点关注方面. 可以通过`plot_aspect_with_frequency`函数对属性进行可视化,当前可通过参数`image_type`分别指定`wordcloud`和'histogram',通过词云和直方图的形式进行可视化。 + +```python +# define SentimentResult to process the result of sentiment result. +sr = SentimentResult(args.file_path, sentiment_name=args.sentiment_name) +# define VisualSentiment to help visualization +vs = VisualSentiment(font_path=args.font_path) + +# visualization for aspect +save_path = os.path.join(args.save_dir, "aspect_wc.png") +vs.plot_aspect_with_frequency(sr.aspect_frequency, save_path, image_type="wordcloud") +save_path = os.path.join(args.save_dir, "aspect_hist.png") +vs.plot_aspect_with_frequency(sr.aspect_frequency, save_path, image_type="histogram") +``` + +
+ +
+
+ +**(2) 观点分析** +通过观点信息,可以查看客户对于产品/服务整体的直观印象。可以通过`plot_opinion_with_frequency`函数对观点进行可视化。 + +```python +# visualization for opinion +save_path = os.path.join(args.save_dir, "opinion_wc.png") +vs.plot_opinion_with_frequency(sr.opinion_frequency, save_path, image_type="wordcloud") +``` + +
+ +
+
+ +**(3) 属性+观点分析** +结合属性和观点两者信息,可以更加具体的展现客户对于产品/服务的详细观点,分析某个属性的优劣,从而能够帮助商家更有针对性地改善或提高自己的产品/服务质量。可以通过`plot_aspect_with_opinion`函数对属性+观点进行可视化,同时可通过设置参数`sentiment`按照情感倾向展示不同分析结果,以更好进行情感分析,若设置为`all`,则会展示正向和负向所有的属性;若为`positive`,则会仅展示正向的属性;若为`negative`,则会仅展示负向的属性。如果在绘制直方图时,通过设置参数`top_n`,可以展示频率最高的top n个属性。 + +```python +# visualization for aspect + opinion +save_path = os.path.join(args.save_dir, "aspect_opinion_wc.png") +vs.plot_aspect_with_opinion(sr.aspect_opinion, save_path, image_type="wordcloud", sentiment="all") +save_path = os.path.join(args.save_dir, "aspect_opinion_hist.png") +vs.plot_aspect_with_opinion(sr.aspect_opinion, save_path, image_type="histogram", sentiment="all", top_n=8) +``` + +
+ +
+ + + **(4) 属性+情感分析** +挖掘客户对于产品/服务针对属性的情感极性,帮助商家直观地查看客户对于产品/服务的某些属性的印象。可以通过`plot_aspect_with_sentiment`函数对属性+情感进行可视化。如果在绘制直方图时,通过设置参数`top_n`,可以展示频率最高的top n个属性。 + +```python +# visualization for aspect + sentiment +save_path = os.path.join(args.save_dir, "aspect_sentiment_wc.png") +vs.plot_aspect_with_sentiment(sr.aspect_sentiment, save_path, image_type="wordcloud") +save_path = os.path.join(args.save_dir, "aspect_sentiment_hist.png") +vs.plot_aspect_with_sentiment(sr.aspect_sentiment, save_path, image_type="histogram", top_n=15, descend_aspects=sr.descend_aspects) +``` + +
+ +
+ +**(5) 对给定属性进行观点分析** +通过指定属性,更加细致查看客户对于产品/服务某个属性的观点。可以帮助商家更加细粒度地分析客户对于产品/服务的某个属性的印象。下面图片示例中,展示了客户对于属性"房间"的观点。可以通过`plot_opinion_with_aspect`函数,对给定的属性进行观点分析。默认情况下,不会自动生成该类图像,需要开发者手动调用`plot_opinion_with_aspect`进行可视化分析。 + +```python +aspect = "房间" +save_path = os.path.join(args.save_dir, "opinions_for_aspect_wc.png") +vs.plot_opinion_with_aspect(aspect, sr.aspect_opinion, save_path, image_type="wordcloud") +save_path = os.path.join(args.save_dir, "opinions_for_aspect_hist.png") +vs.plot_opinion_with_aspect(aspect, sr.aspect_opinion, save_path, image_type="histogram") +``` + +
+ +
+ + + + +## **5. 更进一步:结合业务分析经验,定制情感分析** + +考虑到用户在对业务数据进行情感分析时,往往聚焦于某个特定场景或领域,为满足用户更高的情感分析要求,本项目支持从以下方面协助用户,结合业务经验,进一步定制情感分析能力,提高模型对业务数据的理解和分析能力。 + +- 数据层面:打通 label-studio 平台,定制了情感信息的标注规则,支持根据标注数据自动转换为模型输入样本。 +- 属性聚合:结合业务经验,支持传入同义的属性集合,可以增强模型对于数据聚合的能力。 +- 隐性观点抽取:结合业务经验,支持自定义隐性观点词表,可以增强模型对于隐性观点的抽取能力。 + +下面以酒店场景为例,讲解定制酒店垂域的情感分析能力。具体地,将从数据标注及样本构建 - 模型训练 - 模型测试 - 模型预测及效果展示等全流程展开介绍。 + + + +### **5.1 打通数据标注到训练样本构建** + +本项目建议用户使用 label-studio 平台标注数据,同时提供了一套用于情感信息标注的规则,可以参考[情感分析任务Label Studio使用指南](./label_studio.md)获取更多信息,这里不再赘述。同时本项目打通了从 label-studio 标注平台到转换为模型输入形式数据的流程, 即支持用户在基于 label_studio 标注业务侧数据后,通过label-studio 导出标注好的json数据, 然后利用本项目提供的 `label_studio.py` 脚本,可以将导出数据一键转换为模型训练数据。 + +在利用 `label_studio.py` 脚本进行数据转换时,需要考虑任务类型的不同,选择相应的样本构建方式,整体可以分为 `分类` 和 `抽取` 任务。 + +
+ +
+ +为方便用户使用,本项目提供了300+条酒店场景的标注数据,可点击[这里](https://paddlenlp.bj.bcebos.com/datasets/sentiment_analysis/hotel/label_studio.tar.gz)进行下载,请注意该数据仅适合用于 `抽取` 类型的任务。 + + + + +#### **5.1.1 样本构建:语句级情感分类任务** + +对于语句级情感分类任务,默认支持2分类:``正向`` 和 ``负向``,可以通过如下命令构造相关训练数据。 + +```shell +python label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --task_type cls \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --options "正向" "负向" \ + --is_shuffle True \ + --seed 1000 +``` + +参数介绍: +- ``label_studio_file``: 从label studio导出的语句级情感分类的数据标注文件。 +- ``task_type``: 选择任务类型,可选有抽取和分类两种类型的任务,其中前者需要设置为``ext``,后者需要设置为``cls``。由于此处为语句级情感分类任务,因此需要设置为``cls``。 +- ``save_dir``: 训练数据的保存目录,默认存储在``data``目录下。 +- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。 +- ``options``: 情感极性分类任务的选项设置。对于语句级情感分类任务,默认支持2分类:``正向`` 和 ``负向``;对于属性级情感分析任务,默认支持3分类:``正向``, ``负向``和 ``未提及``,其中``未提及``表示要分析的属性在原文本评论中未提及,因此无法分析情感极性。如果业务需要其他情感极性选项,可以通过``options``字段进行设置,需要注意的是,如果定制了``options``,参数``label_studio_file``指定的文件需要包含针对新设置的选项的标注数据。 +- ``is_shuffle``: 是否对数据集进行随机打散,默认为True。 +- ``seed``: 随机种子,默认为1000. + +**备注**:参数``options``可以不进行手动指定,如果这么做,则采用默认的设置。针对语句级情感分类任务,其默认将被设置为:``"正向" "负向"``;对于属性级情感分析任务,默认将被设置为:``"正向" "负向" "未提及"``。 + + + +#### **5.1.2 样本构建:属性抽取相关任务** + +针对抽取式的任务,比如属性-观点抽取、属性-情感极性-观点词抽取、属性分类任务等,可以使用如下命令将label-studio导出数据转换为模型训练数据。 + +```shell +python label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --task_type ext \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --options "正向" "负向" "未提及" \ + --negative_ratio 5 \ + --is_shuffle True \ + --seed 1000 +``` + +重点参数介绍: +- ``label_studio_file``: 从label studio导出的属性抽取相关的数据标注文件。 +- ``task_type``: 选择任务类型,可选有抽取和分类两种类型的任务,其中前者需要设置为``ext``,后者需要设置为``cls``。由于此处为属性抽取相关任务,因此需要设置为``ext``。 +- ``negative_ratio``表示对于一个样本,为每个子任务(属性级的观点抽取,属性级的情感分类)最多生成``negative_ratio``个负样本。如果额外提供了属性同义词标或隐性观点抽取词表,将结合两者信息生成更多的负样本,以增强属性聚合和隐性观点抽取能力。 +其他参数解释同上,这里不再赘述。 + + + +#### **5.1.3 样本构建升级1:加强属性聚合能力** + +在用户对产品或服务进行评论时,对某一些属性可能会有不同的说法,这会在后续对属性分析时可能会带来困扰。如以下示例中的"价格","价钱"和"费用"。 + +``` +蛋糕味道不错,外观很漂亮,而且价格比较便宜 +蛋糕味道不错,外观很漂亮,而且价钱比较便宜 +蛋糕味道不错,外观很漂亮,而且费用比较便宜 +``` + +针对这种情况,针对属性相关任务,本项目同时支持用户结合业务经验,通过设置同义的属性词表,加强模型的属性聚合能力。具体来讲,本项目期望通过以下两点,支持对属性聚合能力的建设。 + +- 支持针对用户给定的属性进行情感分析 +- 支持用户提供同义的属性词表,用以加强模型对用户领域属性同义词的理解能力 + +以下给出了酒店场景的示例,每行代表1类同义词,不同词之间以"空格"隔开。 + +``` +房间 屋子 房子 +位置 地理位置 +隔音 隔声 +价格 价钱 费用 +``` + +可以通过以下命令,使用synonym_file指定凝聚业务经验的同义属性集合,利用同义属性生成对应的数据样本: + +```shell +python label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --synonym_file ./data/synonyms.txt \ + --task_type ext \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --options "正向" "负向" "未提及" \ + --negative_ratio 5 \ + --is_shuffle True \ + --seed 1000 +``` + + + +#### **5.1.4 样本构建升级2:加强隐性观点抽取能力** + +另外,本项目同时支持加强对隐性观点功能抽取的能力,这里需要说明一点,本项目中定义隐性观点是指没有对应属性的纯观点词,如以下示例中的"比较便宜"便是隐性观点。 + +``` +蛋糕味道不错,外观很漂亮,而且比较便宜 +``` + +本项目支持用户提供一个隐性观点映射文件,用户可以根据自己的业务场景定义隐性观点词,以下给出了酒店场景的示例。其格式为,第1个单词为隐性观点对应的属性,后续按照情感情感倾向对隐性观点词进行了归类,同一类的以"[ ]"方式放到一块。 + +``` +价格, 正向[实惠 便宜 超划算 划算 物超所值 物有所值 不贵], 负向[贵 不便宜 不划算] +卫生, 正向[干净], 负向[很脏 很臭 不干净] +隔音, 负向[好吵] +位置, 负向[不太好找] +``` + +可以通过参数"implicit_file"指定凝聚业务经验的隐性观点词表,生成对应的隐性观点数据样本: + +```shell +python label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --implicit_file ./data/implicit_opinions.txt \ + --task_type ext \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --options "正向" "负向" "未提及" \ + --negative_ratio 5 \ + --is_shuffle True \ + --seed 1000 +``` + + + +### **5.2 模型训练** +在生成酒店场景的训练数据后,可以通过以下命令启动模型训练: + +```shell +python -u -m paddle.distributed.launch --gpus "0" finetune.py \ + --train_path ./data/train.json \ + --dev_path ./data/dev.json \ + --save_dir ./checkpoint \ + --learning_rate 1e-5 \ + --batch_size 16 \ + --max_seq_len 512 \ + --num_epochs 3 \ + --model uie-senta-base \ + --seed 1000 \ + --logging_steps 10 \ + --valid_steps 100 \ + --device gpu +``` + +可配置参数说明: + +* ``train_path``:必须,训练集文件路径。 +* ``dev_path``:必须,验证集文件路径。 +* ``save_dir``:模型 checkpoints 的保存目录,默认为"./checkpoint"。 +* ``learning_rate``:训练最大学习率,UIE 推荐设置为 1e-5;默认值为1e-5。 +* ``batch_size``:训练集训练过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 16。 +* ``max_seq_len``:模型支持处理的最大序列长度,默认为512。 +* ``num_epochs``:模型训练的轮次,可以视任务情况进行调整,默认为10。 +* ``model``:训练使用的预训练模型。可选择的有`uie-senta-base`, `uie-senta-medium`, `uie-senta-mini`, `uie-senta-micro`, `uie-senta-nano`,默认为`uie-senta-base`。 +* ``logging_steps``: 训练过程中日志打印的间隔 steps 数,默认10。 +* ``valid_steps``: 训练过程中模型评估的间隔 steps 数,默认100。 +* ``seed``:全局随机种子,默认为 42。 +* ``device``: 训练设备,可选择 'cpu'、'gpu' 其中的一种;默认为 GPU 训练。 + + + +### **5.3 模型测试** +通过运行以下命令进行对酒店场景的测试集进行评估: + +``` +python evaluate.py \ + --model_path ./checkpoint/model_best \ + --test_path ./data/test.json \ + --batch_size 16 \ + --max_seq_len 512 +``` + +可配置参数说明: + +* ``model_path``:必须,进行评估的模型文件夹路径,路径下需包含模型权重文件model_state.pdparams及配置文件model_config.json。 +* ``test_path``:必须,进行评估的测试集文件。 +* ``batch_size``:训练集训练过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 16。 +* ``max_seq_len``:文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。 +* ``debug``: 是否开启debug模式对每个正例类别分别进行评估,该模式仅用于模型调试,默认关闭。 + +在构造样本过程中,如果设置了最大负例比例negative_ratio,会在样本中添加一定数量的负样本,模型测试默认会对正样本和负样本共同进行评估。特别地,当开启debug模式后,会对每个正例类别分别进行评估,该模式仅用于模型调试: +``` +python evaluate.py \ + --model_path ./checkpoint/model_best \ + --test_path ./data/test.json \ + --batch_size 16 \ + --max_seq_len 512 \ + --debug +``` + +输出打印示例: +``` +[2022-12-12 05:20:06,152] [ INFO] - ----------------------------- +[2022-12-12 05:20:06,152] [ INFO] - Class Name: 评价维度 +[2022-12-12 05:20:06,152] [ INFO] - Evaluation Precision: 0.89655 | Recall: 0.89655 | F1: 0.89655 +[2022-12-12 05:20:06,553] [ INFO] - ----------------------------- +[2022-12-12 05:20:06,553] [ INFO] - Class Name: 观点词 +[2022-12-12 05:20:06,553] [ INFO] - Evaluation Precision: 0.81159 | Recall: 0.86154 | F1: 0.83582 +[2022-12-12 05:20:07,610] [ INFO] - ----------------------------- +[2022-12-12 05:20:07,611] [ INFO] - Class Name: X的观点词 +[2022-12-12 05:20:07,611] [ INFO] - Evaluation Precision: 0.92222 | Recall: 0.90217 | F1: 0.91209 +[2022-12-12 05:20:08,331] [ INFO] - ----------------------------- +[2022-12-12 05:20:08,331] [ INFO] - Class Name: X的情感倾向[未提及,正向,负向] +[2022-12-12 05:20:08,331] [ INFO] - Evaluation Precision: 0.81481 | Recall: 0.81481 | F1: 0.81481 +``` + + + +### **5.4 模型预测及效果展示** + + + +#### **5.4.1 使用训练后的模型进行预测** +paddlenlp.Taskflow装载定制模型,通过task_path指定模型权重文件的路径,路径下需要包含训练好的模型权重文件model_state.pdparams。 + +```python +>>> # define schema +>>> schema = [{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}] +>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema, task_path="./checkpoint/model_best") +>>> senta("这家点的房间很大,店家服务也很热情,就是房间隔音不好") +[ + { + '评价维度': [ + { + 'text': '服务', + 'start': 11, + 'end': 13, + 'probability': 0.9600759151746807, + 'relations': { + '观点词': [ + { + 'text': '热情', + 'start': 15, + 'end': 17, + 'probability': 0.9995151134519027 + } + ], + '情感倾向[正向,负向,未提及]': [ + { + 'text': '正向', + 'probability': 0.9998306104766073 + } + ] + } + }, + { + 'text': '隔音', + 'start': 22, + 'end': 24, + 'probability': 0.9993525950520166, + 'relations': { + '观点词': [ + { + 'text': '不好', + 'start': 24, + 'end': 26, + 'probability': 0.9992370362201655 + } + ], + '情感倾向[正向,负向,未提及]': [ + { + 'text': '负向', + 'probability': 0.9842680108546062 + } + ] + } + }, + { + 'text': '房间', + 'start': 4, + 'end': 6, + 'probability': 0.9991784415865368, + 'relations': { + '观点词': [ + { + 'text': '很大', + 'start': 6, + 'end': 8, + 'probability': 0.8359714693985723 + } + ], + '情感倾向[正向,负向,未提及]': [ + { + 'text': '正向', + 'probability': 0.997688853839179 + } + ] + } + } + ] + } +] +``` + + + +#### **5.4.2 属性聚合预测和分析** + +下面就 `隔音` 与 `价格` 两个属性进行分析,抽取样本中与这两个属性相关的情感信息,代码如下: + +```python +>>> schema = [{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}] +>>> aspects = ["隔音", "价格"] +>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema, task_path="./checkpoint/model_best", aspects=aspects) +>>> senta("这家点的房间很大,店家服务也很热情,就是房间隔音不好") +``` + +下图展示了关于模型对于属性聚合能力支持的样本,在分析之前设定固定的属性集合`["隔音", "价格"]`,可以看到尽管语料中同时出现了`隔音`、`隔声`、`价格`、`价钱`和`费用`,但是经过定制后的情感分析模型依然能够准确识别出对于属性 `隔音` 和 `价格`的情感信息,从而起到属性聚合的效果。 + +| 样本 | 属性 | 观点词 | 情感倾向 | +| :----: |:----: |:----: |:----: | +|这家店的房间很大,隔音效果不错,而且价格很便宜|隔音|不错|正向| +|房间比较小,隔声也不太好,设施还是挺齐全的|隔音|不太好|负向| +|房间还不错,有免费的矿泉水,而且价格很实惠|价格|实惠|正向| +|房间很大,店家也挺热情,很棒,就是价钱有点贵|价格|贵|负向| +|酒店不错,房间面积大,住的也舒适,而且价格很划算|价格|划算|正向| +|房间好大呀,而且这边还挺安静的,不过整体还是很好的,很宽敞,而且价格很便宜|价格|便宜|正向| + + + +#### **5.4.3 隐性观点词抽取预测和分析** + +下面就`价格` 和 `卫生` 两个属性进行分析隐性观点,抽取样本中与这两个属性相关的情感信息,代码如下: + +对于"价格"的调用示例: +```python +>>> schema = [{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}] +>>> aspects = ["价格"] +>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema, task_path="./checkpoint/model_best", aspects=aspects) +>>> senta("这家店的房间很大,店家服务也很热情,而且很便宜") +``` + +下图展示了关于模型对于隐性观点抽取的样本,可以看到,虽然以下这些样本中,并未出现`价格` 和 `卫生`,但模型依然正确识别除了这两个属性的情感信息。 + +| 样本 | 属性 | 观点词 | 情感倾向 | +| :----: |:----: |:----: |:----: | +|房间比较大,就是感觉贵了点,不太划算|价格|贵、不太划算|负向| +|这家店的房间很大,店家服务也很热情,而且很便宜|价格|便宜|正向| +|这次来荆州给我的房间小的无语了,所幸比较实惠|价格|实惠|正向| +|酒店不大,有点不干净|卫生|不干净|负向| +|老板人很好,房间虽然很大,但有点脏|卫生|脏|负向| +|房间不大,很温暖,也很干净|卫生|干净|正向| + + + + +## **6. 模型部署** + + + +### **6.1 基于SimpleServer进行服务化部署** +本项目支持基于PaddleNLP SimpleServing进行服务化部署,可以在`deploy`目录下执行以下命令启动服务和请求。 + +**启动服务** +``` +paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189 +``` +**Client发送请求** + +服务启动后, 通过 `client.py` 脚本发送请求: +``` +python client.py +``` + +**多卡服务化预测** + +PaddleNLP SimpleServing 支持多卡负载均衡预测,主要在服务化注册的时候,注册两个Taskflow的task即可,代码示例如下: + +```python +senta1 = Taskflow("sentiment_analysis", schema=schema, model="uie-senta-base", device_id=0) +senta2 = Taskflow("sentiment_analysis", schema=schema, model="uie-senta-base", device_id=1) + +app.register_taskflow('senta', [senta1, senta2]) +``` + + + +### **6.2 基于Pipeline进行部署** + +本项目支持基于Pipeline的方式进行部署,用户只需要上传测试文件,即可获取对应的情感分析可视化结果,更多信息请参考[情感分析Pipeline](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/sentiment_analysis)。 diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/batch_predict.py b/applications/sentiment_analysis/unified_sentiment_extraction/batch_predict.py new file mode 100644 index 0000000000000000000000000000000000000000..87347668d58bad2ef71167a66e6ad83c18e145ea --- /dev/null +++ b/applications/sentiment_analysis/unified_sentiment_extraction/batch_predict.py @@ -0,0 +1,88 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time + +from utils import load_txt, write_json_file + +from paddlenlp import Taskflow +from paddlenlp.utils.log import logger + + +def main(args): + """ + Predict based on Taskflow. + """ + start_time = time.time() + # read file + logger.info("Trying to load dataset: {}".format(args.file_path)) + if not os.path.exists(args.file_path): + raise ValueError("something with wrong for your file_path, it may not exist.") + examples = load_txt(args.file_path) + + # define Taskflow for sentiment analysis + schema = eval(args.schema) + if args.load_from_dir: + senta = Taskflow( + "sentiment_analysis", + model=args.model, + schema=schema, + aspects=args.aspects, + batch_size=args.batch_size, + max_seq_len=args.max_seq_len, + task_path=args.load_from_dir, + ) + else: + senta = Taskflow( + "sentiment_analysis", + model=args.model, + schema=schema, + aspects=args.aspects, + batch_size=args.batch_size, + max_seq_len=args.max_seq_len, + ) + + # predict with Taskflow + logger.info("Start to perform sentiment analysis for your dataset, this may take some time.") + results = senta(examples) + + # save results + save_path = args.save_path + if not save_path: + save_dir = os.path.dirname(args.file_path) + save_path = os.path.join(save_dir, "sentiment_results.json") + write_json_file(results, save_path) + logger.info("The results of sentiment analysis has been saved to: {}".format(save_path)) + logger.info("This run take {} seconds.".format(time.time() - start_time)) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--file_path", type=str, default="./data/test_hotel.txt", help="The file path that you want to perform sentiment analysis on.") + parser.add_argument("--save_path", type=str, default="./data/sentiment_analysis.json", help="The saving path for the results of sentiment analysis.") + parser.add_argument("--model", choices=['uie-senta-base', 'uie-senta-medium', 'uie-senta-mini', 'uie-senta-micro', 'uie-senta-nano'], default="uie-senta-base", help="The model name that you wanna use for sentiment analysis.") + parser.add_argument("--load_from_dir", default=None, type=str, help="The directory path for the finetuned model to predict, if set None, it will download model according to model_name.") + parser.add_argument("--schema", default="[{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}]", type=str, help="The schema for UIE to extract infomation.") + parser.add_argument("--aspects", default=None, type=str, nargs="+", help="A list of pre-given aspects, that is to say, Pipeline only perform sentiment analysis on these pre-given aspects if you input it.") + parser.add_argument("--batch_size", type=int, default=4, help="Batch size per GPU/CPU for training.") + parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.") + + args = parser.parse_args() + # yapf: enable + + main(args) diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/deploy/client.py b/applications/sentiment_analysis/unified_sentiment_extraction/deploy/client.py new file mode 100644 index 0000000000000000000000000000000000000000..1024ff2ce88e8089f49bd590bd01335aad3e0ee8 --- /dev/null +++ b/applications/sentiment_analysis/unified_sentiment_extraction/deploy/client.py @@ -0,0 +1,29 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import requests + +url = "http://0.0.0.0:8189/taskflow/senta" +headers = {"Content-Type": "application/json"} +texts = ["蛋糕味道不错,店家的服务也很热情"] +data = { + "data": { + "text": texts, + } +} +r = requests.post(url=url, headers=headers, data=json.dumps(data)) +datas = json.loads(r.text) +print(datas) diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/deploy/server.py b/applications/sentiment_analysis/unified_sentiment_extraction/deploy/server.py new file mode 100644 index 0000000000000000000000000000000000000000..5bb78ea18f690e10da11efca1e91462527b7a0a5 --- /dev/null +++ b/applications/sentiment_analysis/unified_sentiment_extraction/deploy/server.py @@ -0,0 +1,23 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp import SimpleServer, Taskflow + +# The schema changed to your defined schema +schema = [{"评价维度": ["观点词", "情感倾向[正向,负向,未提及]"]}] +# define taskflow to perform sentiment analysis +senta = Taskflow("sentiment_analysis", schema=schema, model="uie-senta-base") +# define your server +app = SimpleServer() +app.register_taskflow("taskflow/senta", senta) diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/evaluate.py b/applications/sentiment_analysis/unified_sentiment_extraction/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..7382b7ddeea0fed863bfa93156414f97b95f10c8 --- /dev/null +++ b/applications/sentiment_analysis/unified_sentiment_extraction/evaluate.py @@ -0,0 +1,122 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import re +from functools import partial + +import paddle +from tqdm import tqdm +from utils import ( + convert_example, + create_data_loader, + get_relation_type_dict, + reader, + unify_prompt_name, +) + +from paddlenlp.datasets import MapDataset, load_dataset +from paddlenlp.metrics import SpanEvaluator +from paddlenlp.transformers import UIE, AutoTokenizer +from paddlenlp.utils.log import logger + + +@paddle.no_grad() +def evaluate(model, metric, data_loader): + """ + Given a dataset, it evals model and computes the metric. + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + """ + model.eval() + metric.reset() + for batch in tqdm(data_loader): + input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch + start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids) + start_ids = paddle.cast(start_ids, "float32") + end_ids = paddle.cast(end_ids, "float32") + num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids) + metric.update(num_correct, num_infer, num_label) + precision, recall, f1 = metric.accumulate() + model.train() + return precision, recall, f1 + + +def do_eval(): + tokenizer = AutoTokenizer.from_pretrained(args.model_path) + model = UIE.from_pretrained(args.model_path) + + test_ds = load_dataset(reader, data_path=args.test_path, max_seq_len=args.max_seq_len, lazy=False) + class_dict = {} + relation_data = [] + if args.debug: + for data in test_ds: + class_name = unify_prompt_name(data["prompt"]) + # Only positive examples are evaluated in debug mode + if re.search(r"\[.*?\]$", data["prompt"]) and data["result_list"][0]["text"] == "未提及": + continue + if len(data["result_list"]) != 0: + if "的" not in data["prompt"]: + class_dict.setdefault(class_name, []).append(data) + else: + relation_data.append((data["prompt"], data)) + relation_type_dict = get_relation_type_dict(relation_data) + else: + class_dict["all_classes"] = test_ds + + trans_fn = partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len) + + for key in class_dict.keys(): + if args.debug: + test_ds = MapDataset(class_dict[key]) + else: + test_ds = class_dict[key] + + test_data_loader = create_data_loader(test_ds, mode="test", batch_size=args.batch_size, trans_fn=trans_fn) + + metric = SpanEvaluator() + precision, recall, f1 = evaluate(model, metric, test_data_loader) + logger.info("-----------------------------") + logger.info("Class Name: %s" % key) + logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1)) + + if args.debug and len(relation_type_dict.keys()) != 0: + for key in relation_type_dict.keys(): + test_ds = MapDataset(relation_type_dict[key]) + + test_data_loader = create_data_loader(test_ds, mode="test", batch_size=args.batch_size, trans_fn=trans_fn) + + metric = SpanEvaluator() + precision, recall, f1 = evaluate(model, metric, test_data_loader) + logger.info("-----------------------------") + logger.info("Class Name: X的%s" % key) + logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1)) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.") + parser.add_argument("--test_path", type=str, default=None, help="The path of test set.") + parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.") + parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.") + parser.add_argument("--debug", action='store_true', help="Precision, recall and F1 score are calculated for each class separately if this option is enabled.") + + args = parser.parse_args() + # yapf: enable + + do_eval() diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/finetune.py b/applications/sentiment_analysis/unified_sentiment_extraction/finetune.py new file mode 100644 index 0000000000000000000000000000000000000000..212a61de8a574f0889f455ac4e9fb52eb7805d5a --- /dev/null +++ b/applications/sentiment_analysis/unified_sentiment_extraction/finetune.py @@ -0,0 +1,135 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time +from functools import partial + +import paddle +from evaluate import evaluate +from utils import convert_example, create_data_loader, reader, set_seed + +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import SpanEvaluator +from paddlenlp.transformers import UIE, AutoTokenizer +from paddlenlp.utils.log import logger + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + tokenizer = AutoTokenizer.from_pretrained(args.model) + model = UIE.from_pretrained(args.model) + + train_ds = load_dataset(reader, data_path=args.train_path, max_seq_len=args.max_seq_len, lazy=False) + dev_ds = load_dataset(reader, data_path=args.dev_path, max_seq_len=args.max_seq_len, lazy=False) + + trans_fn = partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len) + + train_data_loader = create_data_loader(train_ds, mode="train", batch_size=args.batch_size, trans_fn=trans_fn) + dev_data_loader = create_data_loader(dev_ds, mode="dev", batch_size=args.batch_size, trans_fn=trans_fn) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + logger.info("load model from path: {}".format(args.init_from_ckpt)) + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + optimizer = paddle.optimizer.AdamW(learning_rate=args.learning_rate, parameters=model.parameters()) + + criterion = paddle.nn.BCELoss() + metric = SpanEvaluator() + + loss_list = [] + global_step = 0 + best_f1 = 0 + tic_train = time.time() + for epoch in range(1, args.num_epochs + 1): + for batch in train_data_loader: + input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch + start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids) + start_ids = paddle.cast(start_ids, "float32") + end_ids = paddle.cast(end_ids, "float32") + loss_start = criterion(start_prob, start_ids) + loss_end = criterion(end_prob, end_ids) + loss = (loss_start + loss_end) / 2.0 + loss.backward() + optimizer.step() + optimizer.clear_grad() + loss_list.append(float(loss)) + + global_step += 1 + if global_step % args.logging_steps == 0 and rank == 0: + time_diff = time.time() - tic_train + loss_avg = sum(loss_list) / len(loss_list) + logger.info( + "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, loss_avg, args.logging_steps / time_diff) + ) + tic_train = time.time() + + if global_step % args.valid_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(save_dir) + logger.disable() + tokenizer.save_pretrained(save_dir) + logger.enable() + + precision, recall, f1 = evaluate(model, metric, dev_data_loader) + logger.info("Evaluation precision: %.5f, recall: %.5f, F1: %.5f" % (precision, recall, f1)) + if f1 > best_f1: + logger.info(f"best F1 performence has been updated: {best_f1:.5f} --> {f1:.5f}") + best_f1 = f1 + save_dir = os.path.join(args.save_dir, "model_best") + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(save_dir) + logger.disable() + tokenizer.save_pretrained(save_dir) + logger.enable() + tic_train = time.time() + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--train_path", default=None, type=str, help="The path of train set.") + parser.add_argument("--dev_path", default=None, type=str, help="The path of dev set.") + parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") + parser.add_argument("--max_seq_len", default=512, type=int, help="The maximum input sequence length. Sequences longer than this will be split automatically.") + parser.add_argument("--num_epochs", default=100, type=int, help="Total number of training epochs to perform.") + parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization") + parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.") + parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.") + parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") + parser.add_argument("--model", choices=["uie-senta-base", "uie-senta-medium", "uie-senta-mini", "uie-senta-micro", "uie-senta-nano"], default="uie-senta-base", type=str, help="Select the pretrained model for few-shot learning.") + parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of model parameters for initialization.") + + args = parser.parse_args() + # yapf: enable + + do_train() diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/label_studio.md b/applications/sentiment_analysis/unified_sentiment_extraction/label_studio.md new file mode 100644 index 0000000000000000000000000000000000000000..e3a52da0e1a72b057747f0802f9fa43f055d38b3 --- /dev/null +++ b/applications/sentiment_analysis/unified_sentiment_extraction/label_studio.md @@ -0,0 +1,171 @@ +# 情感分析任务Label Studio使用指南 + + **目录** + +- [1. label-studio 安装](#1) +- [2. label-studio 项目创建](#2) +- [3. 情感分析任务标注](#3) + - [3.1 语句级情感分类任务](#3.1) + - [3.2 属性级情感分析任务](#3.2) + - [3.2.1 属性-情感极性-观点词抽取](#3.2.1) + - [3.2.2 属性-情感极性抽取](#3.2.2) + - [3.2.3 属性-观点词抽取](#3.2.3) + - [3.2.4 属性抽取](#3.2.4) + - [3.2.5 观点词抽取](#3.2.5) +- [4. 导出标注数据](#4) +- [5. References](#5) + + + +## **1. label-studio 安装** +本内容在以下环境进行测试安装: +- python == 3.9.12 +- label-studio == 1.6.0 + +在终端(terminal)使用pip安装label-studio: + +```shell +pip install label-studio==1.6.0 +``` + +安装完成后,运行以下命令行: +```shell +label-studio start +``` + +在浏览器打开[http://localhost:8080/](http://127.0.0.1:8080/),输入用户名和密码登录,开始使用label-studio进行标注。 + + + +## **2. label-studio 项目创建** + +创建项目之前,需要先确定标注的任务类型以及需要标注哪些内容,然后点击创建(Create)开始创建一个新的项目,填写项目名称、描述。 + +
+ +
+ +如果数据已经准备好,可以在此进行导入数据。 + +
+ +
+ + +接下来,根据需要标注的任务类型,选择适合的任务。在本项目中,默认会包含两种类型的任务:语句级情感分类任务和属性级情感分析任务。由于这两者都属于自然语言处理(NLP)任务,因此可以点击 `Natural Language Processing` 选项,在该选项下面进行选择相应的子项任务。 + +- 如果标注语句级情感分类任务,请选择`Text Classification`。 + +
+ +
+ +- 如果标注属性级情感分析任务,比如属性-观点词-情感极性三元组的信息抽取,请选择`Relation Extraction`。 + +
+ +
+ +最后点击保存即可。 + + + +## **3. 情感分析任务标注** + + + +### **3.1 语句级情感分类任务** +这里对应的任务类型为`Text Classification`,在标注之前,需要设定`正向`和`负向`的标签,然后保存即可。 + +
+ +
+ +设定好标签后,即可开始进行标注,选择正向或负向,最后点击提交,便标注好一条数据。 +
+ +
+ + + +### **3.2 属性级情感分析任务** + +在本项目中,属性级的情感分析需要配置的标注任务类型为`Relation Extraction`,包括属性抽取、观点抽取、属性-观点抽取、属性-情感极性抽取、属性-情感极性-观点词三元组抽取等任务。其中属性-情感极-观点词(A-S-O)三元组抽取是最常见的任务之一,下面优先讲解该任务的标注规则。 + + + +#### **3.2.1 属性-情感极性-观点词抽取** +属性-情感极性-观点词(A-S-O)三元组抽取标注内容涉及两类标签:Span 类型标签和 Relation 类型标签。其中Span标签用于定位文本批评中属性、观点词和情感极性三类信息,Relation类型标签用于设置评价维度和观点词、情感倾向之间的关系。 + + +#### **(1)Span类型标签** +这里需要定位属性、情感极性、观点词三类信息,在标注时,需要将属性和情感极性进行组合,形成复合标签。具体来讲,设定`评价维度##正向`用于定位情感倾向为正向的属性,`评价维度##负向`用于定位情感倾向为负向的属性。另外,利用标注标签`观点词`定位语句中的观点词。 + +
+ +
+ +#### **(2)Relation类型标签** +这里只涉及到1中Relation类型标签,即`评价维度`到`观点词`的映射关系。这里可以设置一下两者关系的名称,即点击Code,然后配置关系名称(这里将两者关系设置为`观点词`),最后点击保存即可。 + +
+ +
+ +在设置好Span类型和Relation标签之后,便可以开始进行标注数据了。 + +
+ +
+ + + +#### **3.2.2 属性-情感极性抽取** +如3.2.1所述,本项目中针对属性-情感极性(A-S)抽取任务,采用`Span`的形式进行标注。设定`评价维度##正向`用于定位情感倾向为正向的属性,`评价维度##负向`用于定位情感倾向为负向的属性。下图展示了关于属性-情感极性抽取任务的标注示例。 + +
+ +
+ + + +#### **3.2.3 属性-观点词抽取** +针对属性-观点词(A-O)抽取任务,采用`Relation`的形式进行标注。这需要将属性对应标注标签设定为`评价维度`,观点词设定为`观点词`。下图展示了关于属性-观点词抽取任务的标注示例。 + +
+ +
+ + + +#### **3.2.4 属性抽取** +针对属性(A)抽取任务,采用`Span`的形式进行标注。 这需要将属性对应的标注标签设定为`评价维度`。下图展示了关于属性抽取任务的标注示例。 + +
+ +
+ + + +#### **3.2.4 观点词抽取** +针对观点词(O)抽取任务,采用`Span`的形式进行标注。 这需要将观点词对应的标注标签设定为`观点词`。下图展示了关于观点词抽取任务的标注示例。 + +
+ +
+ + + + +## **4. 导出标注数据** + +勾选已标注文本ID,点击Export按钮,选择导出的文件类型为`JSON`,导出数据: + +
+ +
+ + + +## **5. References** +- **[Label Studio 官网](https://labelstud.io/)** diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/label_studio.py b/applications/sentiment_analysis/unified_sentiment_extraction/label_studio.py new file mode 100644 index 0000000000000000000000000000000000000000..d03703ef88ddf093c513b11a17fe536195ac4460 --- /dev/null +++ b/applications/sentiment_analysis/unified_sentiment_extraction/label_studio.py @@ -0,0 +1,738 @@ +# coding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import copy +import json +import os +import random +import time +from decimal import Decimal + +import numpy as np +import paddle +from utils import load_txt + +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.utils.log import logger + + +def set_seed(seed): + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +PROMPT_ITEMS = { + "aspect_prompt_prefix": "评价维度", + "opinion_prompt": "观点词", + "sentiment_prompt_prefix": "情感倾向", + "separator": "##", + "not_mentioned_option": "未提及", + "positive_option": "正向", + "negative_option": "负向", +} + + +class Convertor(object): + """Convertor to convert data export from annotation platform""" + + def __init__(self, negative_ratio=5): + """Init Data Convertor""" + self.negative_ratio = negative_ratio + self.aspect_prompt_prefix = PROMPT_ITEMS["aspect_prompt_prefix"] + self.opinion_prompt = PROMPT_ITEMS["opinion_prompt"] + self.sentiment_prompt_prefix = PROMPT_ITEMS["sentiment_prompt_prefix"] + self.separator = PROMPT_ITEMS["separator"] + self.not_mentioned_option = PROMPT_ITEMS["not_mentioned_option"] + self.options = PROMPT_ITEMS["options"] + + def process_text_tag(self, line, task_type="ext"): + items = {} + items["text"] = line["data"]["text"] + if task_type == "ext": + items["entities"] = [] + items["relations"] = [] + items["relation_ids"] = set() + result_list = line["annotations"][0]["result"] + for result in result_list: + if result["type"] == "labels": + items["entities"].append( + { + "id": result["id"], + "start_offset": result["value"]["start"], + "end_offset": result["value"]["end"], + "label": result["value"]["labels"][0], + } + ) + else: + items["relations"].append( + { + "id": result["from_id"] + "-" + result["to_id"], + "from_id": result["from_id"], + "to_id": result["to_id"], + "type": result["labels"][0] if result["labels"] else self.opinion_prompt, + } + ) + items["relation_ids"].add(result["from_id"]) + items["relation_ids"].add(result["to_id"]) + + elif task_type == "cls": + items["label"] = line["annotations"][0]["result"][0]["value"]["choices"] + return items + + def convert_cls_examples(self, raw_examples, data_flag="Data"): + """ + Convert labeled data for classification task. + """ + examples = [] + logger.info("{0:7} Start to convert annotation data.".format("[" + data_flag + "]")) + for line in raw_examples: + items = self.process_text_tag(line, task_type="cls") + text, labels = items["text"], items["label"] + example = self.generate_cls_example(text, labels, self.sentiment_prompt_prefix, self.options) + examples.append(example) + logger.info("{0:7} End to convert annotation data.\n".format("")) + return examples + + def convert_ext_examples( + self, + raw_examples, + synonyms=None, + implicit_opinion_map=None, + sentiment_map=None, + with_negatives=True, + task_type="ext_aso", + data_flag="Data", + ): + """ + Convert labeled data for extraction task. + """ + + def _sep_cls_label(label, separator): + label_list = label.split(separator) + if len(label_list) == 1: + return label_list[0], None + return label_list[0], label_list[1:] + + texts = [] + # {"content": "", "result_list": [], "prompt": "X"} + entity_examples = [] + # {"content": "", "result_list": [], "prompt": "X的Y"} + relation_examples = [] + # {"content": "", "result_list": [], "prompt": "X的情感倾向[正向,负向]"} + entity_cls_examples = [] + + # entity label set: ["评价维度", "观点词", ... ] + entity_label_set = [] + # predicate set: ["观点词", ... ] + predicate_set = [] + # set of subject entity in relation: ["房间", "价格", ... ] + subject_name_set = [] + + # List[List[str]] + # List of entity prompt for each example + entity_prompt_list = [] + # Golden subject label for each example + subject_golden_list = [] + # List of inverse relation for each example + inverse_relation_list = [] + # List of predicate for each example + predicate_list = [] + + logger.info("{0:7} Start to convert annotation data.".format("[" + data_flag + "]")) + logger.info("{0:7} Trying to generate positive examples...".format("")) + for line in raw_examples: + items = self.process_text_tag(line, task_type="ext") + + text, relations, entities, relation_ids = ( + items["text"], + items["relations"], + items["entities"], + items["relation_ids"], + ) + texts.append(text) + + entity_example = [] + entity_prompt = [] + entity_example_map = {} + implict_example_map = {} + entity_map = {} + subject_golden = [] + for entity in entities: + entity_name = text[entity["start_offset"] : entity["end_offset"]] + entity_map[entity["id"]] = { + "name": entity_name, + "start": entity["start_offset"], + "end": entity["end_offset"], + } + + entity_label, entity_cls_label = _sep_cls_label(entity["label"], self.separator) + + # generate examples for entity-level sentiment classification + if entity_cls_label is not None: + entity_cls_prompt_prefix = entity_name + "的" + self.sentiment_prompt_prefix + entity_cls_example = self.generate_cls_example( + text, entity_cls_label, entity_cls_prompt_prefix, self.options + ) + + entity_cls_examples.append(entity_cls_example) + + # generate examples for entity extraction + result = {"text": entity_name, "start": entity["start_offset"], "end": entity["end_offset"]} + if entity_label not in entity_example_map.keys(): + entity_example_map[entity_label] = { + "content": text, + "result_list": [result], + "prompt": entity_label, + } + else: + entity_example_map[entity_label]["result_list"].append(result) + + if entity_label not in entity_label_set: + entity_label_set.append(entity_label) + entity_prompt.append(entity_label) + + if implicit_opinion_map and entity["id"] not in relation_ids: + maped_entity = entity_map[entity["id"]] + if maped_entity["name"] not in implicit_opinion_map: + continue + + result = { + "text": maped_entity["name"], + "start": maped_entity["start"], + "end": maped_entity["end"], + } + aspect = implicit_opinion_map[maped_entity["name"]] + if aspect not in implict_example_map: + implict_example_map[aspect] = [result] + else: + implict_example_map[aspect].append(result) + + if entity_label.startswith(self.aspect_prompt_prefix): + if entity_name not in subject_golden: + if synonyms and entity_name in synonyms: + subject_synonyms = synonyms[entity_name] + subject_golden.extend(subject_synonyms) + else: + subject_golden.append(entity_name) + + if entity_name not in subject_name_set: + subject_name_set.append(entity_name) + + for v in entity_example_map.values(): + entity_example.append(v) + entity_examples.append(entity_example) + entity_prompt_list.append(entity_prompt) + + # generate examples for classification of implicit opinion + if task_type == "ext_as" or task_type == "ext_aso": + for entity_name in implict_example_map.keys(): + prompt = entity_name + "的" + self.sentiment_prompt_prefix + opinions = implict_example_map[entity_name] + sentiment = None + for opinion in opinions: + if opinion["text"] in sentiment_map: + sentiment = sentiment_map[opinion["text"]] + break + if sentiment is None: + continue + implicit_example = self.generate_cls_example(text, [sentiment], prompt, self.options) + entity_cls_examples.append(implicit_example) + + # generate examples for relation extraction + # Golden entity inputs, initializing with implicit subject and it's synonyms + for implicit_subject in implict_example_map.keys(): + subject_golden.append(implicit_subject) + if synonyms and implicit_subject in synonyms: + subject_golden.extend(synonyms[implicit_subject]) + relation_example = [] + relation_example_map = {} + inverse_relation = [] + predicates = [] + + # generate examples for extraction of implicit opinion + for entity_name in implict_example_map.keys(): + prompt = entity_name + "的" + self.opinion_prompt + implicit_example = { + "content": text, + "result_list": implict_example_map[entity_name], + "prompt": prompt, + } + relation_example.append(implicit_example) + + # generate examples for labeled relations + for relation in relations: + predicate = relation["type"] + subject_id = relation["from_id"] + object_id = relation["to_id"] + + prompt = entity_map[subject_id]["name"] + "的" + predicate + inverse_negative = entity_map[object_id]["name"] + "的" + predicate + + result = { + "text": entity_map[object_id]["name"], + "start": entity_map[object_id]["start"], + "end": entity_map[object_id]["end"], + } + + inverse_relation.append(inverse_negative) + predicates.append(predicate) + + if prompt not in relation_example_map.keys(): + relation_example_map[prompt] = {"content": text, "result_list": [result], "prompt": prompt} + else: + relation_example_map[prompt]["result_list"].append(result) + + if predicate not in predicate_set: + predicate_set.append(predicate) + + for v in relation_example_map.values(): + relation_example.append(v) + + relation_examples.append(relation_example) + subject_golden_list.append(subject_golden) + inverse_relation_list.append(inverse_relation) + predicate_list.append(predicates) + + # start to generate negative examples + if with_negatives and task_type in ["ext_as", "ext_ao", "ext_aso"]: + logger.info("{0:7} Trying to generate negative examples...".format("")) + + # generate negative examples according to entity + all_entity_examples = [] + if with_negatives: + positive_examples, negative_examples = self.add_entity_negative_example( + entity_examples, texts, entity_prompt_list, entity_label_set + ) + if len(positive_examples) != 0: + all_entity_examples = positive_examples + negative_examples + else: + for i in range(len(entity_examples)): + all_entity_examples.extend(entity_examples[i]) + + # generate negative examples according to relation + all_relation_examples = [] + if with_negatives: + if len(predicate_set) != 0: + positive_examples = [] + negative_examples = [] + per_n_ratio = self.negative_ratio // 3 + + for i, text in enumerate(texts): + negative_example = [] + collects = [] + + # 1. inverse_relation_list + redundants1 = inverse_relation_list[i] + + # 2. subject_name_set - subject_golden_list[i] + redundants2 = [] + if len(predicate_list[i]) != 0: + nonentity_list = list(set(subject_name_set) - set(subject_golden_list[i])) + nonentity_list.sort() + + redundants2 = [ + nonentity + "的" + predicate_list[i][random.randrange(len(predicate_list[i]))] + for nonentity in nonentity_list + ] + + # 3. entity_label_set - entity_prompt_list[i] + redundants3 = [] + if len(subject_golden_list[i]) != 0: + non_ent_label_list = list(set(entity_label_set) - set(entity_prompt_list[i])) + non_ent_label_list.sort() + + redundants3 = [ + subject_golden_list[i][random.randrange(len(subject_golden_list[i]))] + "的" + non_ent_label + for non_ent_label in non_ent_label_list + ] + + redundants_list = [redundants1, redundants2, redundants3] + + for redundants in redundants_list: + added, rest = self.add_relation_negative_example(redundants, texts[i], per_n_ratio) + negative_example.extend(added) + collects.extend(rest) + num_sup = self.negative_ratio - len(negative_example) + if num_sup > 0 and collects: + if num_sup > len(collects): + idxs = [k for k in range(len(collects))] + else: + idxs = random.sample(range(0, len(collects)), num_sup) + for idx in idxs: + negative_example.append(collects[idx]) + positive_examples.extend(relation_examples[i]) + negative_examples.extend(negative_example) + + all_relation_examples = positive_examples + negative_examples + else: + for i in range(len(relation_examples)): + all_relation_examples.extend(relation_examples[i]) + + # generate negative examples according to sentiment polarity + all_cls_examples = entity_cls_examples + if with_negatives: + if task_type == "ext_aso" or task_type == "ext_as" and self.not_mentioned_option in self.options: + cls_negatives_examples = self.add_cls_negative_example(texts, subject_name_set, subject_golden_list) + all_cls_examples += cls_negatives_examples + + # generate examples with synonyms to support aspect aggregation + if synonyms is not None: + synonym_map = {} + for k, vs in synonyms.items(): + for v in vs: + synonym_map[v] = k + relation_synonym_examples = self.change_aspect_with_synonyms(all_relation_examples, synonyms, synonym_map) + all_relation_examples += relation_synonym_examples + cls_synonym_examples = self.change_aspect_with_synonyms(all_cls_examples, synonyms, synonym_map) + all_cls_examples += cls_synonym_examples + + logger.info("{0:7} End to convert annotation data.\n".format("")) + return all_entity_examples + all_relation_examples + all_cls_examples + + def change_aspect_with_synonyms(self, examples, synonyms, synonym_map): + synonym_examples = [] + for example in examples: + prompt = example["prompt"] + aspect, suffix = prompt.split("的", maxsplit=1) + if aspect not in synonym_map.keys(): + continue + synonym_cluster = synonyms[synonym_map[aspect]] + for syn_aspect in synonym_cluster: + if syn_aspect == aspect: + continue + syn_prompt = syn_aspect + "的" + suffix + syn_example = copy.deepcopy(example) + syn_example["prompt"] = syn_prompt + synonym_examples.append(syn_example) + return synonym_examples + + def generate_cls_example(self, text, labels, prompt_prefix, options): + random.shuffle(self.options) + cls_options = ",".join(self.options) + prompt = prompt_prefix + "[" + cls_options + "]" + + result_list = [] + example = {"content": text, "result_list": result_list, "prompt": prompt} + + for label in labels: + start = prompt.rfind(label) - len(prompt) - 1 + end = start + len(label) + result = {"text": label, "start": start, "end": end} + example["result_list"].append(result) + return example + + def add_entity_negative_example(self, examples, texts, prompts, label_set): + negative_examples = [] + positive_examples = [] + for i, prompt in enumerate(prompts): + redundants = list(set(label_set) - set(prompt)) + redundants.sort() + + ratio = self.negative_ratio + if ratio > len(redundants): + ratio = len(redundants) + idxs = random.sample(range(0, len(redundants)), ratio) + + for idx in idxs: + negative_result = {"content": texts[i], "result_list": [], "prompt": redundants[idx]} + negative_examples.append(negative_result) + positive_examples.extend(examples[i]) + return positive_examples, negative_examples + + def add_relation_negative_example(self, redundants, text, ratio): + added_example = [] + rest_example = [] + + if ratio > len(redundants): + ratio = len(redundants) + + all_idxs = [k for k in range(len(redundants))] + idxs = random.sample(range(0, len(redundants)), ratio) + rest_idxs = list(set(all_idxs) - set(idxs)) + + for idx in idxs: + negative_result = {"content": text, "result_list": [], "prompt": redundants[idx]} + added_example.append(negative_result) + + for rest_idx in rest_idxs: + negative_result = {"content": text, "result_list": [], "prompt": redundants[rest_idx]} + rest_example.append(negative_result) + + return added_example, rest_example + + def add_cls_negative_example(self, texts, subject_name_set, subject_golden_list): + negative_examples = [] + for i, text in enumerate(texts): + redundants = list(set(subject_name_set) - set(subject_golden_list[i])) + redundants.sort() + + ratio = self.negative_ratio + if ratio > len(redundants): + ratio = len(redundants) + idxs = random.sample(range(0, len(redundants)), ratio) + + for idx in idxs: + subject_name = redundants[idx] + prompt_prefix = subject_name + "的" + self.sentiment_prompt_prefix + negative_example = self.generate_cls_example(text, ["未提及"], prompt_prefix, self.options) + negative_examples.append(negative_example) + return negative_examples + + +def load_synonym(synonym_path): + synonyms = {} + lines = load_txt(synonym_path) + for line in lines: + items = line.split() + synonyms[items[0]] = items + return synonyms + + +def load_implicit_opinion(implicit_opinion_path): + implicit_opinion_map = {} + sentiment_map = {} + lines = load_txt(implicit_opinion_path) + for line in lines: + items = line.split(",") + aspect = items[0].strip() + for item in items[1:]: + item = item.strip() + start = item.find("[") + end = item.find("]") + sentiment = item[0:start] + opinions = item[start + 1 : end].strip().split() + for opinion in opinions: + implicit_opinion_map[opinion] = aspect + sentiment_map[opinion] = sentiment + return implicit_opinion_map, sentiment_map + + +def parse_ext_task_type(raw_examples): + + task_type_dict = {"ext_a": False, "ext_o": False, "ext_ao": False, "ext_as": False, "ext_aso": False} + + def _parse_raw_example(raw_example): + entity_map = {} + relations = [] + result_list = raw_example["annotations"][0]["result"] + for result in result_list: + if result["type"] == "labels": + entity_id = result["id"] + entity_map[entity_id] = result["value"]["labels"][0] + elif result["type"] == "relation": + relation_pair = (result["from_id"], result["to_id"]) + relations.append(relation_pair) + else: + raise ValueError( + "Unknown entity type [{}], it indicates that your dataset maybe not a aspect-based extraction dataset, please check it.".format( + result["type"] + ) + ) + + for entity_label in entity_map.values(): + if ( + entity_label.startswith(PROMPT_ITEMS["aspect_prompt_prefix"]) + and PROMPT_ITEMS["separator"] in entity_label + ): + task_type_dict["ext_as"] = True + elif entity_label == PROMPT_ITEMS["aspect_prompt_prefix"]: + task_type_dict["ext_a"] = True + elif entity_label == PROMPT_ITEMS["opinion_prompt"]: + task_type_dict["ext_o"] = True + else: + raise ValueError("Unknown prompt: {}".format(entity_label)) + + # relations store the relation between aspect and opinion by default + if relations: + task_type_dict["ext_ao"] = True + + if task_type_dict["ext_ao"] and task_type_dict["ext_as"]: + task_type_dict["ext_aso"] = True + + for raw_example in raw_examples: + # analyze task type + _parse_raw_example(raw_example) + if task_type_dict["ext_aso"]: + return "ext_aso" + elif (not task_type_dict["ext_as"]) and task_type_dict["ext_ao"]: + return "ext_ao" + + if task_type_dict["ext_as"]: + return "ext_as" + elif task_type_dict["ext_o"]: + return "ext_o" + else: + return "ext_a" + + +def do_convert(): + set_seed(args.seed) + + tic_time = time.time() + if not os.path.exists(args.label_studio_file): + raise ValueError("Please input the correct path of label studio file.") + + if not os.path.exists(args.save_dir): + os.makedirs(args.save_dir) + + if len(args.splits) != 0 and len(args.splits) != 3: + raise ValueError("Only []/ len(splits)==3 accepted for splits.") + + def _check_sum(splits): + return Decimal(str(splits[0])) + Decimal(str(splits[1])) + Decimal(str(splits[2])) == Decimal("1") + + if len(args.splits) == 3 and not _check_sum(args.splits): + raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.") + + with open(args.label_studio_file, "r", encoding="utf-8") as f: + raw_examples = json.loads(f.read()) + + if args.is_shuffle: + indexes = np.random.permutation(len(raw_examples)) + raw_examples = [raw_examples[i] for i in indexes] + + # construct options according + if args.options: + PROMPT_ITEMS["options"] = args.options + else: + if args.task_type == "ext": + PROMPT_ITEMS["options"] = [ + PROMPT_ITEMS["positive_option"], + PROMPT_ITEMS["negative_option"], + PROMPT_ITEMS["not_mentioned_option"], + ] + else: + PROMPT_ITEMS["options"] = [PROMPT_ITEMS["positive_option"], PROMPT_ITEMS["negative_option"]] + + # analyze detailed ext task type: ext_a, ext_o, ext_as, ext_ao, ext_aso + if args.task_type == "ext": + args.task_type = parse_ext_task_type(raw_examples) + + logger.info("You are trying perform dataset construction operation for task {}.\n".format(args.task_type)) + + # load synonyms + synonyms = None + if args.synonym_file: + if args.task_type in ["cls", "ext_a", "ext_o"]: + logger.warning( + "The param synonym_file will not work for task, because the task {} that you wanna try does not support synonym_function.".format( + args.task_type + ) + ) + else: + if not os.path.isfile(args.synonym_file): + raise ValueError( + "The path you input is not a file, please input the correct path of synonym file: {}".format( + args.synonym_file + ) + ) + synonyms = load_synonym(args.synonym_file) + + # load implicit opinions + implicit_opinion_map = None + sentiment_map = None + if args.implicit_file: + if args.task_type in ["cls", "ext_a", "ext_o", "ext_as"]: + logger.warning( + "The param implicit_file will not work for task, because the task {} that you wanna try does not support implicit opinion function.".format( + args.task_type + ) + ) + else: + if not os.path.isfile(args.implicit_file): + raise ValueError( + "The path you input is not a file, please input the correct path of implicit opinion file: {}".format( + args.implicit_file + ) + ) + implicit_opinion_map, sentiment_map = load_implicit_opinion(args.implicit_file) + + # split examples into train/dev/test examples + i1, i2, _ = args.splits + p1 = int(len(raw_examples) * i1) + p2 = int(len(raw_examples) * (i1 + i2)) + + # define Convertor and convert raw examples to model examples + convertor = Convertor(negative_ratio=args.negative_ratio) + + if args.task_type.startswith("ext"): + train_examples = convertor.convert_ext_examples( + raw_examples[:p1], + synonyms=synonyms, + implicit_opinion_map=implicit_opinion_map, + sentiment_map=sentiment_map, + task_type=args.task_type, + data_flag="Train", + ) + dev_examples = convertor.convert_ext_examples( + raw_examples[p1:p2], + synonyms=synonyms, + implicit_opinion_map=implicit_opinion_map, + sentiment_map=sentiment_map, + task_type=args.task_type, + data_flag="Dev", + ) + test_examples = convertor.convert_ext_examples( + raw_examples[p2:], + synonyms=synonyms, + implicit_opinion_map=implicit_opinion_map, + sentiment_map=sentiment_map, + task_type=args.task_type, + data_flag="Test", + ) + else: + train_examples = convertor.convert_cls_examples(raw_examples[:p1], data_flag="Train") + dev_examples = convertor.convert_cls_examples(raw_examples[p1:p2], data_flag="Dev") + test_examples = convertor.convert_cls_examples(raw_examples[p2:], data_flag="Test") + + # save examples + def _save_examples(save_dir, file_name, examples): + count = 0 + save_path = os.path.join(save_dir, file_name) + with open(save_path, "w", encoding="utf-8") as f: + for example in examples: + f.write(json.dumps(example, ensure_ascii=False) + "\n") + count += 1 + logger.info("Save %d examples to %s." % (count, save_path)) + + _save_examples(args.save_dir, "train.json", train_examples) + _save_examples(args.save_dir, "dev.json", dev_examples) + _save_examples(args.save_dir, "test.json", test_examples) + + logger.info("Finished! It takes {:.2f} seconds".format(time.time() - tic_time)) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--label_studio_file", default="./data/label_studio.json", type=str, help="The annotation file exported from label studio platform.") + parser.add_argument("--synonym_file", type=str, help="The synonmy file of aspect to support aspect aggregation.") + parser.add_argument("--implicit_file", type=str, help="The implicit opinion file whose aspect not be mentioned in text, to support extraction of implicit opinion.") + parser.add_argument("--save_dir", default="./data", type=str, help="The path of data that you wanna save.") + parser.add_argument("--negative_ratio", default=5, type=int, help="Worked only for the extraction task, it means that for each task (aspect-based opinion extraction, aspect-based sentiment classicition) of an example, at least negative_ratio negative examples will be generated without considering synonym_file and implicit_file.") + parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.") + parser.add_argument("--task_type", choices=['ext', 'cls'], default="ext", type=str, help="Two task types [ext, cls] are supported, ext represents the aspect-based extraction task and cls represents the sentence-level classification task, defaults to ext.") + parser.add_argument("--options", type=str, nargs="+", help="Used only for the classification task, the options for classification") + parser.add_argument("--is_shuffle", type=strtobool, default="True", help="Whether to shuffle the labeled dataset, defaults to True.") + parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization") + + args = parser.parse_args() + # yapf: enablecl + logger.info("Parameter Description:\n{}\n".format(args.__dict__)) + + do_convert() diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/utils.py b/applications/sentiment_analysis/unified_sentiment_extraction/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..3b52a525fb2a06a9d175232bd196559c0ca6f6ad --- /dev/null +++ b/applications/sentiment_analysis/unified_sentiment_extraction/utils.py @@ -0,0 +1,265 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import random +import re + +import numpy as np +import paddle + + +def set_seed(seed): + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +def load_txt(file_path): + texts = [] + with open(file_path, "r", encoding="utf-8") as f: + for line in f.readlines(): + texts.append(line.strip()) + return texts + + +def load_json_file(path): + exmaples = [] + with open(path, "r", encoding="utf-8") as f: + for line in f.readlines(): + example = json.loads(line) + exmaples.append(example) + return exmaples + + +def write_json_file(examples, save_path): + with open(save_path, "w", encoding="utf-8") as f: + for example in examples: + line = json.dumps(example, ensure_ascii=False) + f.write(line + "\n") + + +def str2bool(v): + """Support bool type for argparse.""" + if v.lower() in ("yes", "true", "t", "y", "1"): + return True + elif v.lower() in ("no", "false", "f", "n", "0"): + return False + else: + raise argparse.ArgumentTypeError("Unsupported value encountered.") + + +def create_data_loader(dataset, mode="train", batch_size=1, trans_fn=None): + """ + Create dataloader. + Args: + dataset(obj:`paddle.io.Dataset`): Dataset instance. + mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. + batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. + trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. + Returns: + dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. + """ + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + else: + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True) + return dataloader + + +def convert_example(example, tokenizer, max_seq_len): + """ + example: { + title + prompt + content + result_list + } + """ + encoded_inputs = tokenizer( + text=[example["prompt"]], + text_pair=[example["content"]], + truncation=True, + max_seq_len=max_seq_len, + pad_to_max_seq_len=True, + return_attention_mask=True, + return_position_ids=True, + return_dict=False, + return_offsets_mapping=True, + ) + encoded_inputs = encoded_inputs[0] + offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]] + bias = 0 + for index in range(1, len(offset_mapping)): + mapping = offset_mapping[index] + if mapping[0] == 0 and mapping[1] == 0 and bias == 0: + bias = offset_mapping[index - 1][1] + 1 # Includes [SEP] token + if mapping[0] == 0 and mapping[1] == 0: + continue + offset_mapping[index][0] += bias + offset_mapping[index][1] += bias + start_ids = [0 for x in range(max_seq_len)] + end_ids = [0 for x in range(max_seq_len)] + for item in example["result_list"]: + # Positioning at char granularity,offset_mapping indicates offset by char. + start = map_offset(item["start"] + bias, offset_mapping) + end = map_offset(item["end"] - 1 + bias, offset_mapping) + start_ids[start] = 1.0 + end_ids[end] = 1.0 + + tokenized_output = [ + encoded_inputs["input_ids"], + encoded_inputs["token_type_ids"], + encoded_inputs["position_ids"], + encoded_inputs["attention_mask"], + start_ids, + end_ids, + ] + tokenized_output = [np.array(x, dtype="int64") for x in tokenized_output] + return tuple(tokenized_output) + + +def map_offset(ori_offset, offset_mapping): + """ + map ori offset to token offset + """ + for index, span in enumerate(offset_mapping): + if span[0] <= ori_offset < span[1]: + return index + return -1 + + +def reader(data_path, max_seq_len=512): + """ + read json + """ + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + json_line = json.loads(line) + content = json_line["content"].strip() + prompt = json_line["prompt"] + # Model Input is aslike: [CLS] Prompt [SEP] Content [SEP] + # It include three summary tokens. + if max_seq_len <= len(prompt) + 3: + raise ValueError("The value of max_seq_len is too small, please set a larger value") + max_content_len = max_seq_len - len(prompt) - 3 + if len(content) <= max_content_len: + yield json_line + else: + result_list = json_line["result_list"] + json_lines = [] + accumulate = 0 + while True: + cur_result_list = [] + + for result in result_list: + if result["start"] + 1 <= max_content_len < result["end"]: + max_content_len = result["start"] + break + + cur_content = content[:max_content_len] + res_content = content[max_content_len:] + + while True: + if len(result_list) == 0: + break + elif result_list[0]["end"] <= max_content_len: + if result_list[0]["end"] > 0: + cur_result = result_list.pop(0) + cur_result_list.append(cur_result) + else: + cur_result_list = [result for result in result_list] + break + else: + break + + json_line = {"content": cur_content, "result_list": cur_result_list, "prompt": prompt} + json_lines.append(json_line) + + for result in result_list: + if result["end"] <= 0: + break + result["start"] -= max_content_len + result["end"] -= max_content_len + accumulate += max_content_len + max_content_len = max_seq_len - len(prompt) - 3 + if len(res_content) == 0: + break + elif len(res_content) < max_content_len: + json_line = {"content": res_content, "result_list": result_list, "prompt": prompt} + json_lines.append(json_line) + break + else: + content = res_content + + for json_line in json_lines: + yield json_line + + +def unify_prompt_name(prompt): + # The classification labels are shuffled during finetuning, so they need + # to be unified during evaluation. + if re.search(r"\[.*?\]$", prompt): + prompt_prefix = prompt[: prompt.find("[", 1)] + cls_options = re.search(r"\[.*?\]$", prompt).group()[1:-1].split(",") + cls_options = sorted(list(set(cls_options))) + cls_options = ",".join(cls_options) + prompt = prompt_prefix + "[" + cls_options + "]" + return prompt + return prompt + + +def get_relation_type_dict(relation_data): + def compare(a, b): + a = a[::-1] + b = b[::-1] + res = "" + for i in range(min(len(a), len(b))): + if a[i] == b[i]: + res += a[i] + else: + break + if res == "": + return res + elif res[::-1][0] == "的": + return res[::-1][1:] + return "" + + relation_type_dict = {} + added_list = [] + for i in range(len(relation_data)): + added = False + if relation_data[i][0] not in added_list: + for j in range(i + 1, len(relation_data)): + match = compare(relation_data[i][0], relation_data[j][0]) + if match != "": + match = unify_prompt_name(match) + if relation_data[i][0] not in added_list: + added_list.append(relation_data[i][0]) + relation_type_dict.setdefault(match, []).append(relation_data[i][1]) + added_list.append(relation_data[j][0]) + relation_type_dict.setdefault(match, []).append(relation_data[j][1]) + added = True + if not added: + added_list.append(relation_data[i][0]) + suffix = relation_data[i][0].rsplit("的", 1)[1] + suffix = unify_prompt_name(suffix) + relation_type_dict[suffix] = relation_data[i][1] + return relation_type_dict diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/visual_analysis.py b/applications/sentiment_analysis/unified_sentiment_extraction/visual_analysis.py new file mode 100644 index 0000000000000000000000000000000000000000..bba221429f28381b80448ee73378c1ab1b90669a --- /dev/null +++ b/applications/sentiment_analysis/unified_sentiment_extraction/visual_analysis.py @@ -0,0 +1,663 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import shutil +from collections import defaultdict +from operator import itemgetter + +import matplotlib.pyplot as plt +import wordcloud +from utils import load_json_file + +from paddlenlp.taskflow.utils import download_file +from paddlenlp.utils.log import logger + +# make sure that the font 'SimHei' is installed in system +plt.rcParams["font.sans-serif"] = ["SimHei"] +plt.rcParams["axes.unicode_minus"] = False + +URLS = { + "SimHei": [ + "https://paddlenlp.bj.bcebos.com/applications/sentiment_analysis/SimHei.ttf", + "c9c9de86d3fa7c4af0d3f1269bb2dff2", + ], +} + +PROMPT_ITEMS = { + "aspect_prompt": "评价维度", + "opinion_prompt": "观点词", + "sentiment_prompt_prefix": "情感倾向", + "separator": "##", + "not_mentioned_option": "未提及", + "positive_option": "正向", + "negative_option": "负向", +} + + +class VisualSentiment(object): + """ + A tool class for visualing sentiment analysis results. + """ + + def __init__(self, font_path=None): + if font_path is not None: + if not os.path.isfile(font_path): + raise ValueError("The param font_path passed in may not be a file: {}".format(font_path)) + self.font_path = font_path + else: + default_name = "SimHei" + save_dir = os.path.dirname(__file__) + download_file(save_dir, default_name + ".ttf", URLS[default_name][0], URLS[default_name][1]) + self.font_path = os.path.join(save_dir, default_name + ".ttf") + + self.wc = wordcloud.WordCloud(font_path=self.font_path, background_color="white", width=800, height=400) + plt.figure(figsize=(8, 6)) + + def _plot_wordcloud(self, content_freq, save_path): + """ + plot wordcloud image. + + Args: + content_freq (dict): a content dict with frequency, the key is content and its value is frequency. + save_path (str): path that the image is saved to. + """ + + text_list = [] + for item in content_freq: + text_list.extend([item] * content_freq[item]) + random.shuffle(text_list) + text = " ".join(text_list) + + self.wc.generate(text) + self.wc.to_file(save_path) + + def _plot_histogram( + self, content_freq, save_path, with_line_chart="true", top_n=15, plt_title="", plt_xlabel="", plt_ylabel="" + ): + """ + generate histogram image. one aspect corresponds to one bar. + + Args: + content_freq (dict): a content dict with frequency, the key is content and its value is frequency. + save_path (str): path that the image is saved to. + with_line_chart (bool): Whether to plot line chart, only work when image_type is set be histogram. + top_n (int): show top_n of frequency of contents, only work when image_type is set be histogram. + plt_title (str): the title of image, only work when image_type is set be histogram. + plt_xlabel (str): the 'x' axis label of image, only work when image_type is set be histogram. + plt_ylabel (str): the 'y' axis label of image, only work when image_type is set be histogram. + """ + + content_freq_items = content_freq.items() + content_freq_items = sorted(content_freq_items, key=lambda x: x[1], reverse=True) + content_freq_items = content_freq_items[:top_n] + + x_data = [item[0] for item in content_freq_items] + y_data = [item[1] for item in content_freq_items] + + for i in range(len(x_data)): + plt.bar(x_data[i], y_data[i]) + + if with_line_chart: + plt.plot(x_data, y_data, "-") + plt.title(plt_title) + + plt.xlabel(plt_xlabel) + plt.ylabel(plt_ylabel) + plt.savefig(save_path) + plt.close() + + def _plot_content_with_frequency( + self, + content_freq, + save_path, + image_type="wordcloud", + with_line_chart="true", + top_n=15, + plt_title="", + plt_xlabel="", + plt_ylabel="", + ): + """ + generate image for specified content, such as aspect, opinion and so on. + two types of images are supported: wordcloud and histogram. + + Args: + content_freq (dict): a content dict with frequency, the key is content and its value is frequency. + save_path (str): path that the image is saved to. + image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]. + with_line_chart (bool): Whether to plot line chart, only work when image_type is set be histogram. + top_n (int): show top_n of frequency of contents, only work when image_type is set be histogram. + plt_title (str): the title of image, only work when image_type is set be histogram. + plt_xlabel (str): the 'x' axis label of image, only work when image_type is set be histogram. + plt_ylabel (str): the 'y' axis label of image, only work when image_type is set be histogram. + """ + + if image_type not in ["wordcloud", "histogram"]: + raise ValueError( + "Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]." + ) + + if image_type == "wordcloud": + self._plot_wordcloud(content_freq, save_path) + else: + self._plot_histogram( + content_freq, + save_path, + with_line_chart=with_line_chart, + top_n=top_n, + plt_title=plt_title, + plt_xlabel=plt_xlabel, + plt_ylabel=plt_ylabel, + ) + + def plot_aspect_with_frequency( + self, aspect_freq, save_path, image_type="wordcloud", with_line_chart="true", top_n=15 + ): + """ + generate image for aspect, two types of images are supported: wordcloud and histogram. + this method can help analyze which aspects of the product/service are more important to customers. + + Args: + aspect_freq (dict): an aspect dict with frequency, the key is aspect and its value is frequency. + save_path (str): path that the image is saved to. + image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]. + with_line_chart (bool): Whether to plot line chart, Only work when image_type is set be histogram. + top_n (int): show top_n of frequency of apsects, Only work when image_type is set be histogram. + """ + + if not aspect_freq: + raise ValueError("aspect_freq is empty, please check it.") + + if image_type not in ["wordcloud", "histogram"]: + raise ValueError( + "Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]." + ) + + if image_type == "wordcloud": + self._plot_content_with_frequency(aspect_freq, save_path, image_type=image_type) + else: + title = "The histogram of aspect/frequency" + xlabel = "aspect" + ylabel = "frequency" + + self._plot_content_with_frequency( + aspect_freq, + save_path, + image_type=image_type, + with_line_chart=with_line_chart, + top_n=top_n, + plt_title=title, + plt_xlabel=xlabel, + plt_ylabel=ylabel, + ) + + def plot_opinion_with_frequency( + self, opinion_freq, save_path, image_type="wordcloud", with_line_chart="true", top_n=15 + ): + """ + generate image for opinion, two types of images are supported: wordcloud and histogram. + this method can help analyze the whole impression of the product/service. + + Args: + opinion_freq (dict): an opinion dict with frequency, the key is opinion and its value is frequency. + save_path (str): path that the image is saved to. + image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]. + with_line_chart (bool): Whether to plot line chart, Only work when image_type is set be histogram. + top_n (int): show top_n of frequency of opinions, Only work when image_type is set be histogram. + """ + + if not opinion_freq: + raise ValueError("opinion_freq is empty, please check it.") + + if image_type not in ["wordcloud", "histogram"]: + raise ValueError( + "Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]." + ) + + if image_type == "wordcloud": + self._plot_content_with_frequency(opinion_freq, save_path, image_type=image_type) + else: + title = "The histogram of opinion/frequency" + xlabel = "opinion" + ylabel = "frequency" + + self._plot_content_with_frequency( + opinion_freq, + save_path, + image_type=image_type, + with_line_chart=with_line_chart, + top_n=top_n, + plt_title=title, + plt_xlabel=xlabel, + plt_ylabel=ylabel, + ) + + def plot_aspect_with_opinion( + self, aspect_opinion, save_path, sentiment="all", image_type="wordcloud", with_line_chart="true", top_n=15 + ): + """ + generate image with aspect and opinion, that is, combining apsect with opinion to display the more specifical opinions of aspect. + this method can help you at two aspects: 1. mining custom's overall impression of products/services; 2. analyzing the quality of some aspect and improve it further. + + Args: + aspect_opinion (dict[dict] or dict): when sentiment set be "all", a expected dict containing aspect, opinion and its frequency, the key is aspect and its value is a dict containing the aspect's opinion and frequency. when sentiment set be "positive" or "netative", a expected dict containing aspect with opinion and frequency, the key is aspect with opinion and its value is frequency. + aspect_sentiment (dict[dict]): a dict containing aspect, sentiment and its frequency, the key is aspect and its value is a dict containing the aspect's sentiment and frequency. + save_path (str): path that the image is saved to. + sentiment (str): analyzing aspect with sentiment, Only "all", "positive" and "negative" are received. "positive" only analyzes positive aspects with opinions, "negative" only analyzes negative aspects with opinions, and "all" analyzes all apsects. + image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]. + with_line_chart (bool): Whether to plot line chart, Only work when image_type is set be histogram. + top_n (int): show top_n of frequency of opinions, Only work when image_type is set be histogram. + """ + + if not aspect_opinion: + raise ValueError("aspect_opinion is empty, please check it.") + + if image_type not in ["wordcloud", "histogram"]: + raise ValueError( + "Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]." + ) + + if sentiment not in ["all", "positive", "negative"]: + raise ValueError( + "Only 'all', 'positive' and 'negative' are received for sentiment, that is, you should set be in [all, positive, negative]." + ) + + if sentiment == "all": + new_aspect_opinion = {} + + for aspect in aspect_opinion: + for opinion in aspect_opinion[aspect]: + key = aspect + opinion + new_aspect_opinion[key] = aspect_opinion[aspect][opinion] + aspect_opinion = new_aspect_opinion + + if image_type == "wordcloud": + self._plot_content_with_frequency(aspect_opinion, save_path, image_type=image_type) + else: + if sentiment == "all": + title = "The histogram of aspect with opinion/frequency" + else: + title = "The histogram of {} aspect with opinion/frequency".format(sentiment) + xlabel = "aspect with opinion" + ylabel = "frequency" + + self._plot_content_with_frequency( + aspect_opinion, + save_path, + image_type=image_type, + with_line_chart=with_line_chart, + top_n=top_n, + plt_title=title, + plt_xlabel=xlabel, + plt_ylabel=ylabel, + ) + + def plot_aspect_with_sentiment( + self, aspect_sentiment, save_path, image_type="wordcloud", top_n=0, descend_aspects=None + ): + """ + generate image with aspect and sentiment, that is, combining apsect and sentiment to display the sentiment of aspect. + This method can help you more intuitively analyze customers' direct impressions of aspects of products/services. + + Args: + aspect_sentiment (dict[dict]): a dict containing aspect, sentiment and its frequency, the key is aspect and its value is a dict containing the aspect's sentiment and frequency. + descend_aspects (dict): an aspect list, sorted by frequency in reverse order. + save_path (str): path that the image is saved to. + image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]. + top_n (int): show top_n of frequency of opinions, Only work when image_type is set be histogram. if top_n set be 0, it will plot all aspects in histogram. + """ + + if not aspect_sentiment: + raise ValueError("aspect_sentiment is empty, please check it.") + + if image_type not in ["wordcloud", "histogram"]: + raise ValueError( + "Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]." + ) + + if image_type == "wordcloud": + new_aspect_opinion = {} + for aspect in aspect_sentiment: + for sentiment in aspect_sentiment[aspect]: + key = aspect + sentiment + new_aspect_opinion[key] = aspect_sentiment[aspect][sentiment] + self._plot_wordcloud(new_aspect_opinion, save_path) + else: + if top_n != 0 and descend_aspects is None: + raise ValueError("You should input the param descend_aspects when top_n != 0.") + + if top_n != 0: + keep_aspects = set(descend_aspects[:top_n]) + + aspects = [] + positives = [] + negatives = [] + for aspect, sentiment in aspect_sentiment.items(): + if top_n != 0 and aspect not in keep_aspects: + continue + aspects.append(aspect) + if "正向" in sentiment: + positives.append(sentiment["正向"]) + else: + positives.append(0) + if "负向" in sentiment: + negatives.append(sentiment["负向"]) + else: + negatives.append(0) + + total_width, n = 0.8, 2 + width = total_width / n + x_pos = [item - (total_width - width) / 2 for item in range(len(aspects))] + x_neg = [item + width for item in x_pos] + + plt.bar(x_pos, positives, width=width, label="positive") + plt.bar(x_neg, negatives, width=width, label="negative") + plt.title("The histogram of aspect/sentiment") + plt.xlabel("aspect") + plt.ylabel("sentiment frequency") + plt.xticks(x_pos, aspects) + plt.legend() + plt.savefig(save_path) + plt.close() + + def plot_opinion_with_aspect( + self, aspect, aspect_opinion, save_path, image_type="wordcloud", with_line_chart=True, top_n=15 + ): + """ + generate opinion image for given aspect. This method can help you analyzing opinions for given aspects. + + Args: + aspect (str): The set of aspect to analyze. + aspect_opinion (dict[dict] or dict): when sentiment set be "all", a expected dict containing aspect, opinion and its frequency, the key is aspect and its value is a dict containing the aspect's opinion and frequency. when sentiment set be "positive" or "netative", a expected dict containing aspect with opinion and frequency, the key is aspect with opinion and its value is frequency. + save_path (str): path that the image is saved to. + image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]. + with_line_chart (bool): Whether to plot line chart, Only work when image_type is set be histogram. + top_n (int): show top_n of frequency of opinions, Only work when image_type is set be histogram. + """ + + if not aspect_opinion: + raise ValueError("aspect_opinion is empty, please check it.") + + if aspect not in aspect: + raise ValueError("{} not in aspect_opinion, please check it.") + + if image_type not in ["wordcloud", "histogram"]: + raise ValueError( + "Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]." + ) + + opinions = aspect_opinion[aspect] + opinion_items = sorted(opinions.items(), key=lambda x: x[1], reverse=True) + if top_n is not None: + opinion_items = opinion_items[:top_n] + + opinion_freq = {k: v for k, v in opinion_items} + + if image_type == "wordcloud": + self._plot_wordcloud(opinion_freq, save_path) + else: + title = "The opinion analysis for aspect [{}] ".format(aspect) + xlabel = "opinion" + ylabel = "frequency" + self._plot_histogram( + opinion_freq, + save_path, + with_line_chart=with_line_chart, + top_n=top_n, + plt_title=title, + plt_xlabel=xlabel, + plt_ylabel=ylabel, + ) + + def plot_sentence_sentiment(self, sentence_sentiment, save_path): + """ + generate image for sentence sentiment, only histogram are supported. + this method can help analyze the customers' whole impression for product/service. + + Args: + sentence_sentiment (dict): an sentiment dict with frequency, the key is sentiment polarity and its value is frequency. + save_path (str): path that the image is saved to. + """ + + if not sentence_sentiment: + raise ValueError("sentence_sentiment is empty, please check it.") + + title = "The histogram of sentence sentiment" + xlabel = "sentiment polarity" + ylabel = "frequency" + + self._plot_histogram( + sentence_sentiment, save_path, with_line_chart=False, plt_title=title, plt_xlabel=xlabel, plt_ylabel=ylabel + ) + + +class SentimentResult(object): + """ + load and analyze result of sentiment analysis. + """ + + def __init__(self, file_path): + self.file_path = file_path + self.sentiment_prompt = PROMPT_ITEMS["sentiment_prompt"] + self.sentiment_prompt_prefix = PROMPT_ITEMS["sentiment_prompt_prefix"] + self.options = PROMPT_ITEMS["options"] + self.opinion_prompt = PROMPT_ITEMS["opinion_prompt"] + self.aspect_prompt = PROMPT_ITEMS["aspect_prompt"] + self.not_mentioned_option = PROMPT_ITEMS["not_mentioned_option"] + self.positive_option = PROMPT_ITEMS["positive_option"] + self.negative_option = PROMPT_ITEMS["negative_option"] + self.prompts = set() + # load the result of sentiment analysis + self.results = self._load_sentiment_result(file_path) + # define the parsing middle result for sentiment analysis + self.aspect_frequency = defaultdict(int) + self.opinion_frequency = defaultdict(int) + self.aspect_sentiment = defaultdict(dict) + self.aspect_opinion = defaultdict(dict) + self.aspect_opinion_positives = defaultdict(int) + self.aspect_opinion_negatives = defaultdict(int) + self.descend_aspects = [] + self.sentence_sentiment = defaultdict(int) + # start to parse sentiment result + self.parse_sentiment_result(self.results) + + def _load_sentiment_result(self, file_path): + return load_json_file(file_path) + + def _parse_aspect(self, aspect): + aspect_name = aspect["text"] + self.aspect_frequency[aspect_name] += 1 + if "relations" not in aspect: + return + + sentiment_name = None + if self.sentiment_prompt in aspect["relations"].keys(): + sentiment = aspect["relations"][self.sentiment_prompt][0] + sentiment_name = sentiment["text"] + if sentiment_name == self.not_mentioned_option: + sentiment_name = None + return + if sentiment_name not in self.aspect_sentiment[aspect_name]: + self.aspect_sentiment[aspect_name][sentiment_name] = 1 + else: + self.aspect_sentiment[aspect_name][sentiment_name] += 1 + + if self.opinion_prompt in aspect["relations"].keys(): + opinions = aspect["relations"][self.opinion_prompt] + for opinion in opinions: + opinion_name = opinion["text"] + self.opinion_frequency[opinion_name] += 1 + if opinion_name not in self.aspect_opinion[aspect_name]: + self.aspect_opinion[aspect_name][opinion_name] = 1 + else: + self.aspect_opinion[aspect_name][opinion_name] += 1 + + if sentiment_name is not None: + aspect_opinion_name = aspect_name + opinion_name + if sentiment_name == self.positive_option: + self.aspect_opinion_positives[aspect_opinion_name] += 1 + else: + self.aspect_opinion_negatives[aspect_opinion_name] += 1 + + self.prompts.update(aspect["relations"].keys()) + + def _parse_opinion(self, opinion): + opinion_name = opinion["text"] + self.opinion_frequency[opinion_name] += 1 + + def _parse_sentiment_polarity(self, sentiment): + sentiment_name = sentiment["text"] + self.sentence_sentiment[sentiment_name] += 1 + + def parse_one_result(self, result): + for key in result.keys(): + if key == self.aspect_prompt: + for aspect in result[self.aspect_prompt]: + self._parse_aspect(aspect) + elif key == self.opinion_prompt: + for opinion in result[self.opinion_prompt]: + self._parse_opinion(opinion) + elif key == self.sentiment_prompt: + sentiment = result[self.sentiment_prompt][0] + self._parse_sentiment_polarity(sentiment) + else: + raise ValueError( + "Unknown key {} for sentiment analysis, you can check it as follows: 1. whether the parameter task_type is right; 2. whether the sentiment prompt {} created by the parameter options matches with the prompt {} in the file of sentiment analysis results; 3. whether the aspect_prompt, opinion_prompt or sentiment prompt are right.".format( + key, self.sentiment_prompt, key + ) + ) + self.prompts.add(key) + + def parse_sentiment_result(self, results): + for result in results: + if not result: + continue + self.parse_one_result(result) + # parse descend_aspects + descend_aspects_items = sorted(self.aspect_frequency.items(), key=itemgetter(1), reverse=True) + self.descend_aspects = [item[0] for item in descend_aspects_items] + # check whether sentiment prompt is parsed correctly + for prompt in self.prompts: + if prompt.startswith(self.sentiment_prompt_prefix) and prompt != self.sentiment_prompt: + logger.warning( + "The visual images related to sentiment ploarity cannot be generated. Because the sentiment prompt {} created by the opinions you input cannot be match with the one {} in the file of sentiment analysis result.".format( + self.sentiment_prompt, prompt + ) + ) + + +def default_visual_analysis(args): + # checking generating environment + if os.path.exists(args.save_dir): + shutil.rmtree(args.save_dir) + os.makedirs(args.save_dir) + # update sentiment prompt according to task type + if args.options: + PROMPT_ITEMS["options"] = args.options + else: + if args.task_type == "ext": + PROMPT_ITEMS["options"] = [ + PROMPT_ITEMS["positive_option"], + PROMPT_ITEMS["negative_option"], + PROMPT_ITEMS["not_mentioned_option"], + ] + else: + PROMPT_ITEMS["options"] = [PROMPT_ITEMS["positive_option"], PROMPT_ITEMS["negative_option"]] + PROMPT_ITEMS["sentiment_prompt"] = PROMPT_ITEMS["sentiment_prompt_prefix"] + "[{}]".format( + ",".join(PROMPT_ITEMS["options"]) + ) + + # define sr to process the result of sentiment analysis + logger.info("Trying to parse sentiment analysis result: {}".format(args.file_path)) + sr = SentimentResult(args.file_path) + # define vs to visualize sentiment result + vs = VisualSentiment(font_path=args.font_path) + logger.info("Start to generate visual images of sentiment analysis for you.") + # visualize aspect with frequency + if args.task_type == "ext" and sr.aspect_frequency: + save_path = os.path.join(args.save_dir, "aspect_wc.png") + vs.plot_aspect_with_frequency(sr.aspect_frequency, save_path, image_type="wordcloud") + save_path = os.path.join(args.save_dir, "aspect_hist.png") + vs.plot_aspect_with_frequency(sr.aspect_frequency, save_path, image_type="histogram") + # visualize opinion with frequency + if args.task_type == "ext" and sr.opinion_frequency: + save_path = os.path.join(args.save_dir, "opinion_wc.png") + vs.plot_opinion_with_frequency(sr.opinion_frequency, save_path, image_type="wordcloud") + save_path = os.path.join(args.save_dir, "opinion_hist.png") + vs.plot_opinion_with_frequency(sr.opinion_frequency, save_path, image_type="histogram") + # visualize aspect and opinion + if args.task_type == "ext" and sr.aspect_opinion: + save_path = os.path.join(args.save_dir, "aspect_opinion_wc.png") + vs.plot_aspect_with_opinion(sr.aspect_opinion, save_path, image_type="wordcloud", sentiment="all") + save_path = os.path.join(args.save_dir, "aspect_opinion_hist.png") + vs.plot_aspect_with_opinion(sr.aspect_opinion, save_path, image_type="histogram", sentiment="all", top_n=8) + # visualize positive aspect and opinion + if args.task_type == "ext" and sr.aspect_opinion_positives: + save_path = os.path.join(args.save_dir, "aspect_opinion_wc_pos.png") + vs.plot_aspect_with_opinion( + sr.aspect_opinion_positives, save_path, image_type="wordcloud", sentiment="positive" + ) + save_path = os.path.join(args.save_dir, "aspect_opinion_hist_pos.png") + vs.plot_aspect_with_opinion( + sr.aspect_opinion_positives, save_path, image_type="histogram", sentiment="positive", top_n=8 + ) + # visualize negative aspect and opinion + if args.task_type == "ext" and sr.aspect_opinion_negatives: + save_path = os.path.join(args.save_dir, "aspect_opinion_wc_neg.png") + vs.plot_aspect_with_opinion( + sr.aspect_opinion_negatives, save_path, image_type="wordcloud", sentiment="negative" + ) + save_path = os.path.join(args.save_dir, "aspect_opinion_hist_neg.png") + vs.plot_aspect_with_opinion( + sr.aspect_opinion_negatives, save_path, image_type="histogram", sentiment="negative", top_n=8 + ) + # visualize aspect and sentiment + if args.task_type == "ext" and sr.aspect_sentiment: + save_path = os.path.join(args.save_dir, "aspect_sentiment_wc.png") + vs.plot_aspect_with_sentiment(sr.aspect_sentiment, save_path, image_type="wordcloud") + save_path = os.path.join(args.save_dir, "aspect_sentiment_hist.png") + vs.plot_aspect_with_sentiment( + sr.aspect_sentiment, save_path, image_type="histogram", top_n=15, descend_aspects=sr.descend_aspects + ) + # visualize sentiment polarity for sentence + if args.task_type == "cls" and sr.sentence_sentiment: + save_path = os.path.join(args.save_dir, "sentence_sentiment.png") + vs.plot_sentence_sentiment(sr.sentence_sentiment, save_path) + + if not os.listdir(args.save_dir): + logger.info( + "Nothing generated for task {}, please check that you input the correct parameter task_type or the result of sentiment analysis.".format( + args.task_type + ) + ) + else: + logger.info("Visual images for sentiment analysis has been saved to: {}".format(args.save_dir)) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + parser.add_argument("--file_path", required=True, type=str, help="The result path of sentiment analysis.") + parser.add_argument("--save_dir", default="./images", type=str, help="The saving path of images.") + parser.add_argument("--font_path", default=None, type=str, help="The font Path for showing Chinese in wordcloud.") + parser.add_argument("--task_type", choices=['ext', 'cls'], default="ext", type=str, help="Two task types [ext, cls] are supported, ext represents the aspect-based extraction task and cls represents the sentence-level classification task, defaults to ext.") + parser.add_argument("--options", type=str, nargs="+", help="Used only for the classification task, the options for classification") + + args = parser.parse_args() + # ypdf: enable + + default_visual_analysis(args) diff --git a/applications/speech_cmd_analysis/README.md b/applications/speech_cmd_analysis/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9e56c6a4b30505ec12edae7d4e4ca02cdc52a5e4 --- /dev/null +++ b/applications/speech_cmd_analysis/README.md @@ -0,0 +1,216 @@ +# 智能语音指令解析 (Speech Command Analysis) + +## 1. 项目说明 + +**智能语音指令解析**集成了业界领先的语音识别(Automatic Speech Recognition, ASR)、信息抽取(Information Extraction, IE)等技术,打造智能一体化的语音指令系统,广泛应用于智能语音填单、智能语音交互、智能语音检索、手机APP语音唤醒等场景,提高人机交互效率。 + +其中,**智能语音填单**允许用户通过**口述**的方式记录信息,利用**算法**解析口述内容中的关键信息,完成**自动信息录入**。 + +#### 场景痛点 + +- 电话分析:边询问边记录,关键信息遗漏。例如,社区疫情防控信息记录员需要边通电话边记录关键信息,重点信息不突出,人工二次审核成本高。 +- 工单生成:特定场景,无法完成文字录入。例如,电力路线巡检工作人员在高空巡检高压电线路,不便即时文字记录,滞后记录可能导致信息遗漏。 +- 信息登记:重复性的工作,效率低易出错。例如,某品牌汽车售后客服话务员每天接听约300通电话,重复性工作耗时长,易出错。 + +针对以上场景,应用Baidu大脑AI开放平台[短语音识别标准版](https://ai.baidu.com/tech/speech/asr)和[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)的信息抽取技术,可以自动识别和抽取语音中的关键信息,帮助相关人员简化记录流程,提高工作效率和质量。 +另外,通过构造小样本优化信息抽取模型,能够获得更加准确的场景定制化效果。 + +#### 方案选型 + +- **语音识别模型** + Baidu大脑AI开放平台[短语音识别标准版](https://ai.baidu.com/tech/speech/asr)采用领先国际的流式端到端语音语言一体化建模方法,融合百度自然语言处理技术,近场中文普通话识别准确率达98%。根据语音内容理解可以将数字序列、小数、时间、分数、基础运算符正确转换为数字格式,使得识别的数字结果更符合使用习惯,直观自然。 + +- **信息抽取模型** + [Universal Information Extraction, UIE](https://arxiv.org/pdf/2203.12277.pdf): Yaojie Lu等人在2022年提出了开放域信息抽取的统一框架,这一框架在实体抽取、关系抽取、事件抽取、情感分析等任务上都有着良好的泛化效果。本应用基于这篇工作的prompt设计思想,提供了以ERNIE为底座的阅读理解型信息抽取模型,用于关键信息抽取。同时,针对不同场景,支持通过构造小样本数据来优化模型效果,快速适配特定的关键信息配置。 + + +## 2. 安装说明 + +#### 环境要求 + +- paddlepaddle >= 2.2.0 +- paddlenlp >= 2.3.0 + +安装相关问题可参考[PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)和[PaddleNLP](https://paddlenlp.readthedocs.io/zh/latest/get_started/installation.html)文档。 + +#### 可选依赖 + +- 若要使用音频文件格式转换脚本,则需安装依赖``ffmpeg``和``pydub``。 + +``` +git clone https://git.ffmpeg.org/ffmpeg.git ffmpeg +cd ffmpeg +./configure +make +make install +pip install pydub +``` + +## 3. 数据准备 + +本应用来自于语音报销工单信息录入场景,即员工向公司报销部门提出交通费报销的口头申请,在传统场景下,报销审核人员需要人工将语音转换为文字信息,并从中抽取记录报销需要的``时间``、``出发地``、``目的地``和``费用``字段,而在本应用可以端到端的完成这一工作。相应的数据集为[语音报销工单数据](https://paddlenlp.bj.bcebos.com/datasets/erniekit/speech-cmd-analysis/audio-expense-account.jsonl),共50条标注数据,用于信息抽取模型在交通费报销场景下的优化,示例数据如下: + +```json +{"id": 39, "text": "10月16日高铁从杭州到上海南站车次d5414共48元", "relations": [], "entities": [{"id": 90, "start_offset": 0, "end_offset": 6, "label": "时间"}, {"id": 77, "start_offset": 9, "end_offset": 11, "label": "出发地"}, {"id": 91, "start_offset": 12, "end_offset": 16, "label": "目的地"}, {"id": 92, "start_offset": 24, "end_offset": 26, "label": "费用"}]} +``` + +其中抽取的目标(schema)表示为: + +```python +schema = ['出发地', '目的地', '费用', '时间'] +``` + +标注数据保存在同一个文本文件中,每条样例占一行且存储为``json``格式,其包含以下字段 +- ``id``: 样本在数据集中的唯一标识ID。 +- ``text``: 语音报销工单的原始文本数据。 +- ``entities``: 数据中包含的实体标签,每个实体标签包含四个字段: + - ``id``: 实体在数据集中的唯一标识ID,不同样本中的相同实体对应同一个ID。 + - ``start_offset``: 实体的起始token在文本中的下标。 + - ``end_offset``: 实体的结束token在文本中下标的下一个位置。 + - ``label``: 实体类型。 +- ``relations``: 数据中包含的关系标签(在语音报销工单应用中无关系标签),每个关系标签包含四个字段: + - ``id``: (关系主语,关系谓语,关系宾语)三元组在数据集中的唯一标识ID,不同样本中的相同三元组对应同一个ID。 + - ``from_id``: 关系主语实体对应的标识ID。 + - ``to_id``: 关系宾语实体对应的标识ID。 + - ``type``: 关系类型。 + +#### BaiduAI开放平台申请使用 + +- 注册账号。在[百度智能云](https://console.bce.baidu.com)注册账号并登陆。 +- 资源申请。平台提供了免费资源用于功能测试,打开[语音识别控制台](https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/overview/index),点击``领取免费资源``,勾选短语音识别后点击下方``0元领取``。 +- 创建应用。打开语音识别控制台,点击[创建应用](https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/app/create),填写必选项后点击``立即创建``。 +- 获取API Key和Secret Key。打开语音识别控制台,点击[管理应用](https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/app/list)即可查看应用对应的API Key和Secret Key。在运行本应用脚本时,设置这两个参数即可调用该平台的语音识别服务。 + +#### 音频格式转换 + +在语音报销工单信息录入的场景下,模型的输入为报销工单相关的音频文件。可以根据设备类型,选取合适的录音软件来录制音频文件,保存格式应为``.wav``数据格式。若音频文件格式不符,可以运行以下脚本进行转换: + +- 单个文件格式转换 + +``` +python audio_to_wav.py --audio_file sample.m4a --audio_format m4a --save_dir ./audios_wav/ +``` + +- 指定目录下所有文件格式转换 + +``` +python audio_to_wav.py --audio_file ./audios_raw/ --save_dir ./audios_wav/ +``` + +可配置参数包括 + +- ``audio_file``: 原始音频文件或者所在目录。若设置为目录,则对该目录下所有音频文件进行格式转换。 +- ``audio_format``: 原始音频文件格式(可选),支持``mp3``, ``m4a``。若未设置,则根据文件扩展名对支持的两种音频文件进行格式转换。 +- ``save_dir``: 转换后``.wav``格式文件的存储目录,文件名称与原始音频保持一致。 + +#### 自定义数据标注 + +对于不同的应用场景,关键信息的配置多种多样,直接应用通用信息抽取模型的效果可能不够理想。这时可以标注少量场景相关的数据,利用few-shot learning技术来改进特定场景下的信息抽取效果。在本应用场景中,标注数据为[语音报销工单数据](https://paddlenlp.bj.bcebos.com/datasets/erniekit/speech-cmd-analysis/audio-expense-account.jsonl)。针对其他场景,可使用[doccano](https://github.com/doccano/doccano)平台标注并导出自定义数据。 + + +## 4. 模型训练 + +针对特定场景下的关键信息配置,需要使用标注数据对通用信息抽取模型进行训练以优化抽取效果。 + +#### 代码结构 + +```shell +. +├── audio_to_wav.py # 音频文件格式转换脚本 +├── pipeline.py # 语音指令解析脚本 +├── preprocess.py # 数据预处理脚本 +├── finetune.py # 信息抽取模型 fine-tune 脚本 +├── model.py # 信息抽取模型(UIE)组网脚本 +└── utils.py # 辅助函数 +``` + +#### 数据预处理 + +下载[语音报销工单数据](https://paddlenlp.bj.bcebos.com/datasets/erniekit/speech-cmd-analysis/audio-expense-account.jsonl),存储在``./data/``目录下。执行以下脚本,按设置的比例划分数据集,同时构造负样本用于提升模型的学习效果。 + +```shell +python preprocess.py \ + --input_file ./data/audio-expense-account.jsonl \ + --save_dir ./data/ \ + --negative_ratio 5 \ + --splits 0.2 0.8 0.0 \ + --seed 1000 +``` + +可配置参数包括 + +- ``input_file``: 标注数据文件名。数据格式应与[语音报销工单数据](https://paddlenlp.bj.bcebos.com/datasets/erniekit/speech-cmd-analysis/audio-expense-account.jsonl)一致。 +- ``save_dir``: 训练数据的保存目录,默认存储在``data``目录下。若``splits``为空,则数据存储在``train.txt``文件,若``splits``为长度为3的列表,则数据存储在目录下的``train.txt``、``dev.txt``、``test.txt``文件。 +- ``negative_ratio``: 负样本与正样本的比例。使用负样本策略可提升模型效果,负样本数量 = negative_ratio * 正样本数量。 +- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。 +- ``is_shuffle``: 是否对数据集进行随机打散,默认为True。 +- ``seed``: 随机种子,默认为1000. + + +#### 定制化模型训练 + +运行以下命令,使用单卡训练自定义的UIE模型。 + +```shell +CUDA_VISIBLE_DEVICES=0 python finetune.py \ + --train_path ./data/train.txt \ + --dev_path ./data/dev.txt \ + --save_dir ./checkpoint \ + --model uie-base \ + --learning_rate 1e-5 \ + --batch_size 16 \ + --max_seq_len 512 \ + --num_epochs 50 \ + --seed 1000 \ + --logging_steps 10 \ + --valid_steps 10 \ + --device gpu +``` + +可配置参数包括 + +- `train_path`: 训练集文件路径。 +- `dev_path`: 验证集文件路径。 +- `save_dir`: 模型存储路径,默认为`./checkpoint`。 +- ``init_from_ckpt``: 可选,模型参数路径,热启动模型训练。默认为None。 +- `learning_rate`: 学习率,默认为1e-5。 +- `batch_size`: 批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数,默认为16。 +- `max_seq_len`: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。 +- `num_epochs`: 训练轮数,默认为100。 +- `model`: 选择模型,程序会基于选择的模型进行模型微调,可选有`uie-base`和`uie-tiny`。 +- `seed`: 随机种子,默认为1000. +- `logging_steps`: 日志打印的间隔steps数,默认为10。 +- `valid_steps`: evaluate的间隔steps数,默认为100。 +- `device`: 模型训练使用的设备,可选cpu或gpu。 + + +## 5. 模型预测 + +预测时使用的schema应与finetune阶段训练数据的schema保持一致以得到更好的效果。在语音报销工单信息录入场景下, +- 首先准备好``.wav``格式的音频文件,例如下载[sample.wav](https://bj.bcebos.com/paddlenlp/applications/speech-cmd-analysis/sample.wav)放在``./audios_wav/``目录下。 +- 然后在BaiduAI开放平台创建语音识别应用以获取API Key和Secret Key。 +- 最后加载用场景数据finetune后的模型参数,执行语音指令解析脚本即可抽取报销需要的``时间``、``出发地``、``目的地``和``费用``字段。具体命令如下 + +```shell +python pipeline.py \ + --api_key '4E1BG9lTnlSeIf1NQFlrxxxx' \ + --secret_key '544ca4657ba8002e3dea3ac2f5fxxxxx' \ + --audio_file ./audios_wav/sample.wav \ + --uie_model ./checkpoint/model_best/ \ + --schema '时间' '出发地' '目的地' '费用' +``` + +可配置参数包括 + +- ``api_key``: BaiduAI开放平台上创建应用的API Key。 +- ``secret_key``: BaiduAI开放平台上创建应用的Secret Key。 +- ``audio_file``: ``.wav``格式音频文件路径。 +- ``uie_model``: 预测使用的模型参数文件所在路径。默认为None,即使用通用的预训练UIE模型。 +- ``schema``: 关键实体信息配置。默认为语音报销工单场景下的四个关键字段。 + + +## 6. 模型部署 + +在应用中提供了基于Web的部署Demo方案,支持用户在网页录入语音进行预测。用户可根据实际情况参考实现。 + +![demo](https://user-images.githubusercontent.com/25607475/165510522-a7f5f131-cd3f-4855-8932-6d8b6a7bb913.png) diff --git a/applications/speech_cmd_analysis/audio_to_wav.py b/applications/speech_cmd_analysis/audio_to_wav.py new file mode 100644 index 0000000000000000000000000000000000000000..a0f4aa631976e78b9aefc8772897429602c79dc5 --- /dev/null +++ b/applications/speech_cmd_analysis/audio_to_wav.py @@ -0,0 +1,68 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import argparse + +from pydub import AudioSegment + +if __name__ == "__main__": + parser = argparse.ArgumentParser("Convert other audio formats to wav.") + parser.add_argument("--audio_file", type=str, required=True, help="Path to source audio.") + parser.add_argument("--wav_file", type=str, default=None, help="Path to save .wav file.") + parser.add_argument("--audio_format", type=str, default=None, help="The file extension.") + args = parser.parse_args() + + supported = ["mp3", "m4a", "wav"] + if args.audio_format is not None: + if args.audio_format not in supported: + raise ValueError(".%s format file is not supported!" % args.audio_format) + supported = [args.audio_format] + print("All %s format files are converted to .wav format..." % ", ".join(supported)) + + if os.path.isfile(args.audio_file): + src_files = [args.audio_file] + if args.audio_format is not None: + if args.audio_format != args.audio_file.strip().split(".")[-1]: + raise ValueError( + "Ignore audio_format %s! It is not consistent with the format of audio_file %s." + % (args.audio_format, args.audio_file) + ) + elif os.path.isdir(args.audio_file): + src_files = [x for x in os.listdir(args.audio_file) if x.strip().split(".")[-1] in supported] + src_files = [os.path.join(args.audio_file, x) for x in src_files] + else: + raise Exception("%s is neither valid path nor file!" % args.audio_file) + + if args.wav_file is None: + wav_files = [os.path.basename(x)[:-3] + "wav" for x in src_files] + elif os.path.isfile(args.wav_file): + if len(src_files) == 1: + wav_files = [args.wav_file] + else: + raise Exception( + "All audios in %s will overwrite the same file %s! \ + Please check it." + % (args.audio_file, args.wav_file) + ) + else: + if not os.path.exists(args.wav_file): + os.makedirs(args.wav_file) + wav_files = [os.path.join(args.wav_file, os.path.basename(x)[:-3] + "wav") for x in src_files] + + for src_file, wav_file in zip(src_files, wav_files): + audio = AudioSegment.from_file(src_file, src_file[-3:]) + wav_audio = audio.export(wav_file, format="wav") + + print("%d files converted!" % len(src_files)) diff --git a/applications/speech_cmd_analysis/finetune.py b/applications/speech_cmd_analysis/finetune.py new file mode 100644 index 0000000000000000000000000000000000000000..692c0648a2f289ae6174a369523e381b95fe5c05 --- /dev/null +++ b/applications/speech_cmd_analysis/finetune.py @@ -0,0 +1,127 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time +from functools import partial + +import paddle +from utils import convert_example, create_dataloader, evaluate, reader, set_seed + +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import SpanEvaluator +from paddlenlp.transformers import UIE, AutoTokenizer + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + tokenizer = AutoTokenizer.from_pretrained(args.model) + model = UIE.from_pretrained(args.model) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + train_ds = load_dataset(reader, data_path=args.train_path, max_seq_len=args.max_seq_len, lazy=False) + dev_ds = load_dataset(reader, data_path=args.dev_path, max_seq_len=args.max_seq_len, lazy=False) + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len) + + train_data_loader = create_dataloader( + dataset=train_ds, mode="train", batch_size=args.batch_size, trans_fn=trans_func + ) + + dev_data_loader = create_dataloader(dataset=dev_ds, mode="dev", batch_size=args.batch_size, trans_fn=trans_func) + + optimizer = paddle.optimizer.AdamW(learning_rate=args.learning_rate, parameters=model.parameters()) + + criterion = paddle.nn.BCELoss() + metric = SpanEvaluator() + + loss_list = [] + global_step = 0 + best_f1 = 0 + tic_train = time.time() + for epoch in range(1, args.num_epochs + 1): + for batch in train_data_loader: + input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch + start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids) + start_ids = paddle.cast(start_ids, "float32") + end_ids = paddle.cast(end_ids, "float32") + loss_start = criterion(start_prob, start_ids) + loss_end = criterion(end_prob, end_ids) + loss = (loss_start + loss_end) / 2.0 + loss.backward() + optimizer.step() + optimizer.clear_grad() + loss_list.append(float(loss)) + + global_step += 1 + if global_step % args.logging_steps == 0 and rank == 0: + time_diff = time.time() - tic_train + loss_avg = sum(loss_list) / len(loss_list) + print( + "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, loss_avg, args.logging_steps / time_diff) + ) + tic_train = time.time() + + if global_step % args.valid_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + model.save_pretrained(save_dir) + + precision, recall, f1 = evaluate(model, metric, dev_data_loader) + print("Evaluation precision: %.5f, recall: %.5f, F1: %.5f" % (precision, recall, f1)) + if f1 > best_f1: + print(f"best F1 performence has been updated: {best_f1:.5f} --> {f1:.5f}") + best_f1 = f1 + save_dir = os.path.join(args.save_dir, "model_best") + model.save_pretrained(save_dir) + tic_train = time.time() + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--train_path", default=None, type=str, help="The path of train set.") + parser.add_argument("--dev_path", default=None, type=str, help="The path of dev set.") + parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") + parser.add_argument("--max_seq_len", default=512, type=int, help="The maximum input sequence length. ") + parser.add_argument("--num_epochs", default=100, type=int, help="Total number of training epochs to perform.") + parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization") + parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.") + parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.") + parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") + parser.add_argument("--model", choices=["uie-base", "uie-tiny"], default="uie-base", type=str, help="Select the pretrained model for few-shot learning.") + parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of model parameters for initialization.") + + args = parser.parse_args() + # yapf: enable + + do_train() diff --git a/applications/speech_cmd_analysis/pipeline.py b/applications/speech_cmd_analysis/pipeline.py new file mode 100644 index 0000000000000000000000000000000000000000..7f6f49780b722cc92e9bdb467181d28163dd299f --- /dev/null +++ b/applications/speech_cmd_analysis/pipeline.py @@ -0,0 +1,78 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# ## Task: Speech Command Analysis for Audio Expense Claim +# +# Structured information entry is a common application scenario of speech +# command analysis, where we can extract expected keywords from audios in +# an end-to-end way. This technique can economize on manpower and reduce +# error rates. + +import argparse +import json +import os +import pprint + +from tqdm import tqdm +from utils import mandarin_asr_api + +from paddlenlp import Taskflow + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--audio_file", type=str, required=True, help="The audio file name.") + parser.add_argument("--api_key", type=str, required=True, help="The app key applied on Baidu AI Platform.") + parser.add_argument( + "--secret_key", type=str, required=True, help="The app secret key generated on Baidu AI Platform." + ) + parser.add_argument("--uie_model", type=str, default=None, help="The path to uie model.") + parser.add_argument( + "--schema", + type=str, + nargs="+", + default=["时间", "出发地", "目的地", "费用"], + help="The type of entities expected to extract.", + ) + parser.add_argument( + "--save_file", type=str, default="./uie_results.txt", help="The path to save the recognised text and schemas." + ) + args = parser.parse_args() + + if os.path.isfile(args.audio_file): + audios = [args.audio_file] + elif os.path.isdir(args.audio_file): + audios = [x for x in os.listdir(args.audio_file)] + audios = [os.path.join(args.audio_file, x) for x in audios] + else: + raise Exception("%s is neither valid path nor file!" % args.audio_file) + + audios = [x for x in audios if x.endswith(".wav")] + if len(audios) == 0: + raise Exception("No valid .wav file! Please check %s." % args.audio_file) + + if args.uie_model is None: + parser = Taskflow("information_extraction", schema=args.schema) + else: + parser = Taskflow("information_extraction", schema=args.schema, task_path=args.uie_model) + + with open(args.save_file, "w") as fp: + for audio_file in tqdm(audios): + # automatic speech recognition + text = mandarin_asr_api(args.api_key, args.secret_key, audio_file) + # extract entities according to schema + result = parser(text) + fp.write(text + "\n") + fp.write(json.dumps(result, ensure_ascii=False) + "\n\n") + print(text) + pprint.pprint(result) diff --git a/applications/speech_cmd_analysis/preprocess.py b/applications/speech_cmd_analysis/preprocess.py new file mode 100644 index 0000000000000000000000000000000000000000..ed46758acfc426aec3173652705c4a35655f0386 --- /dev/null +++ b/applications/speech_cmd_analysis/preprocess.py @@ -0,0 +1,98 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time +import argparse +import json +import numpy as np + +from utils import set_seed, convert_ext_examples + + +def do_convert(): + set_seed(args.seed) + + tic_time = time.time() + if not os.path.exists(args.input_file): + raise ValueError("Please input the correct path of doccano file.") + + if not os.path.exists(args.save_dir): + os.makedirs(args.save_dir) + + if len(args.splits) != 0 and len(args.splits) != 3: + raise ValueError("Only []/ len(splits)==3 accepted for splits.") + + if args.splits and sum(args.splits) != 1: + raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.") + + with open(args.input_file, "r", encoding="utf-8") as f: + raw_examples = f.readlines() + + def _create_ext_examples(examples, negative_ratio=0, shuffle=False): + entities, relations = convert_ext_examples(examples, negative_ratio) + examples = [e + r for e, r in zip(entities, relations)] + if shuffle: + indexes = np.random.permutation(len(examples)) + examples = [examples[i] for i in indexes] + return examples + + def _save_examples(save_dir, file_name, examples): + count = 0 + save_path = os.path.join(save_dir, file_name) + with open(save_path, "w", encoding="utf-8") as f: + for example in examples: + for x in example: + f.write(json.dumps(x, ensure_ascii=False) + "\n") + count += 1 + print("\nSave %d examples to %s." % (count, save_path)) + + if len(args.splits) == 0: + examples = _create_ext_examples(raw_examples, args.negative_ratio, args.is_shuffle) + _save_examples(args.save_dir, "train.txt", examples) + else: + if args.is_shuffle: + indexes = np.random.permutation(len(raw_examples)) + raw_examples = [raw_examples[i] for i in indexes] + + i1, i2, _ = args.splits + p1 = int(len(raw_examples) * i1) + p2 = int(len(raw_examples) * (i1 + i2)) + + train_examples = _create_ext_examples(raw_examples[:p1], args.negative_ratio, args.is_shuffle) + dev_examples = _create_ext_examples(raw_examples[p1:p2]) + test_examples = _create_ext_examples(raw_examples[p2:]) + + _save_examples(args.save_dir, "train.txt", train_examples) + _save_examples(args.save_dir, "dev.txt", dev_examples) + _save_examples(args.save_dir, "test.txt", test_examples) + + print("Finished! It takes %.2f seconds" % (time.time() - tic_time)) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--input_file", default="./data/data.json", type=str, help="The data file exported from doccano platform.") + parser.add_argument("--save_dir", default="./data", type=str, help="The path to save processed data.") + parser.add_argument("--negative_ratio", default=5, type=int, help="Used only for the classification task, the ratio of positive and negative samples, number of negtive samples = negative_ratio * number of positive samples") + parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.") + parser.add_argument("--is_shuffle", default=True, type=bool, help="Whether to shuffle the labeled dataset, defaults to True.") + parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") + + args = parser.parse_args() + # yapf: enable + + do_convert() diff --git a/applications/speech_cmd_analysis/utils.py b/applications/speech_cmd_analysis/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..97f02bc98e977fafef05a9f290ca63872afc1303 --- /dev/null +++ b/applications/speech_cmd_analysis/utils.py @@ -0,0 +1,409 @@ +# coding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import math +import random +import time +from urllib.error import URLError +from urllib.parse import urlencode +from urllib.request import Request, urlopen + +import numpy as np +import paddle +from tqdm import tqdm + + +def set_seed(seed): + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +class ASRError(Exception): + pass + + +def mandarin_asr_api(api_key, secret_key, audio_file, audio_format="wav"): + """Mandarin ASR + + Args: + audio_file (str): + Audio file of Mandarin with sampling rate 16000. + audio_format (str): + The file extension of audio_file, 'wav' by default. + + Please refer to https://github.com/Baidu-AIP/speech-demo for more demos. + """ + # Configurations. + TOKEN_URL = "http://aip.baidubce.com/oauth/2.0/token" + ASR_URL = "http://vop.baidu.com/server_api" + SCOPE = "audio_voice_assistant_get" + API_KEY = api_key + SECRET_KEY = secret_key + + # Fetch tokens from TOKEN_URL. + post_data = urlencode( + {"grant_type": "client_credentials", "client_id": API_KEY, "client_secret": SECRET_KEY} + ).encode("utf-8") + + request = Request(TOKEN_URL, post_data) + try: + result_str = urlopen(request).read() + except URLError as error: + print("token http response http code : " + str(error.code)) + result_str = error.read() + result_str = result_str.decode() + + result = json.loads(result_str) + if "access_token" in result.keys() and "scope" in result.keys(): + if SCOPE and (SCOPE not in result["scope"].split(" ")): + raise ASRError("scope is not correct!") + token = result["access_token"] + else: + raise ASRError( + "MAYBE API_KEY or SECRET_KEY not correct: " + "access_token or scope not found in token response" + ) + + # Fetch results by ASR api. + with open(audio_file, "rb") as speech_file: + speech_data = speech_file.read() + length = len(speech_data) + if length == 0: + raise ASRError("file %s length read 0 bytes" % audio_file) + params_query = urlencode({"cuid": "ASR", "token": token, "dev_pid": 1537}) + headers = {"Content-Type": "audio/%s; rate=16000" % audio_format, "Content-Length": length} + + url = ASR_URL + "?" + params_query + request = Request(url, speech_data, headers) + try: + begin = time.time() + result_str = urlopen(request).read() + print("Request time cost %f" % (time.time() - begin)) + except URLError as error: + print("asr http response http code : " + str(error.code)) + result_str = error.read() + result_str = str(result_str, "utf-8") + result = json.loads(result_str) + + return result["result"][0] + + +@paddle.no_grad() +def evaluate(model, metric, data_loader): + """ + Given a dataset, it evals model and computes the metric. + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + """ + model.eval() + metric.reset() + for batch in data_loader: + input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch + start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids) + start_ids = paddle.cast(start_ids, "float32") + end_ids = paddle.cast(end_ids, "float32") + num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids) + metric.update(num_correct, num_infer, num_label) + precision, recall, f1 = metric.accumulate() + model.train() + return precision, recall, f1 + + +def convert_example(example, tokenizer, max_seq_len): + """ + example: { + title + prompt + content + result_list + } + """ + encoded_inputs = tokenizer( + text=[example["prompt"]], + text_pair=[example["content"]], + stride=len(example["prompt"]), + truncation=True, + max_seq_len=max_seq_len, + pad_to_max_seq_len=True, + return_attention_mask=True, + return_position_ids=True, + return_dict=False, + ) + encoded_inputs = encoded_inputs[0] + offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]] + bias = 0 + for index in range(len(offset_mapping)): + if index == 0: + continue + mapping = offset_mapping[index] + if mapping[0] == 0 and mapping[1] == 0 and bias == 0: + bias = index + if mapping[0] == 0 and mapping[1] == 0: + continue + offset_mapping[index][0] += bias + offset_mapping[index][1] += bias + start_ids = [0 for x in range(max_seq_len)] + end_ids = [0 for x in range(max_seq_len)] + for item in example["result_list"]: + start = map_offset(item["start"] + bias, offset_mapping) + end = map_offset(item["end"] - 1 + bias, offset_mapping) + start_ids[start] = 1.0 + end_ids[end] = 1.0 + + tokenized_output = [ + encoded_inputs["input_ids"], + encoded_inputs["token_type_ids"], + encoded_inputs["position_ids"], + encoded_inputs["attention_mask"], + start_ids, + end_ids, + ] + tokenized_output = [np.array(x, dtype="int64") for x in tokenized_output] + return tuple(tokenized_output) + + +def map_offset(ori_offset, offset_mapping): + """ + map ori offset to token offset + """ + for index, span in enumerate(offset_mapping): + if span[0] <= ori_offset < span[1]: + return index + return -1 + + +def reader(data_path, max_seq_len=512): + """ + read json + """ + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + json_line = json.loads(line) + content = json_line["content"] + prompt = json_line["prompt"] + # Model Input is aslike: [CLS] Prompt [SEP] Content [SEP] + # It include three summary tokens. + if max_seq_len <= len(prompt) + 3: + raise ValueError("The value of max_seq_len is too small, please set a larger value") + max_content_len = max_seq_len - len(prompt) - 3 + if len(content) <= max_content_len: + yield json_line + else: + result_list = json_line["result_list"] + json_lines = [] + accumulate = 0 + while True: + cur_result_list = [] + + for result in result_list: + if result["start"] + 1 <= max_content_len < result["end"]: + max_content_len = result["start"] + break + + cur_content = content[:max_content_len] + res_content = content[max_content_len:] + + while True: + if len(result_list) == 0: + break + elif result_list[0]["end"] <= max_content_len: + if result_list[0]["end"] > 0: + cur_result = result_list.pop(0) + cur_result_list.append(cur_result) + else: + cur_result_list = [result for result in result_list] + break + else: + break + + json_line = {"content": cur_content, "result_list": cur_result_list, "prompt": prompt} + json_lines.append(json_line) + + for result in result_list: + if result["end"] <= 0: + break + result["start"] -= max_content_len + result["end"] -= max_content_len + accumulate += max_content_len + max_content_len = max_seq_len - len(prompt) - 3 + if len(res_content) == 0: + break + elif len(res_content) < max_content_len: + json_line = {"content": res_content, "result_list": result_list, "prompt": prompt} + json_lines.append(json_line) + break + else: + content = res_content + + for json_line in json_lines: + yield json_line + + +def add_negative_example(examples, texts, prompts, label_set, negative_ratio): + with tqdm(total=len(prompts)) as pbar: + for i, prompt in enumerate(prompts): + negtive_sample = [] + redundants_list = list(set(label_set) ^ set(prompt)) + redundants_list.sort() + + if len(examples[i]) == 0: + continue + else: + actual_ratio = math.ceil(len(redundants_list) / len(examples[i])) + + if actual_ratio <= negative_ratio: + idxs = [k for k in range(len(redundants_list))] + else: + idxs = random.sample(range(0, len(redundants_list)), negative_ratio * len(examples[i])) + + for idx in idxs: + negtive_result = {"content": texts[i], "result_list": [], "prompt": redundants_list[idx]} + negtive_sample.append(negtive_result) + examples[i].extend(negtive_sample) + pbar.update(1) + return examples + + +def construct_relation_prompt_set(entity_name_set, predicate_set): + relation_prompt_set = set() + for entity_name in entity_name_set: + for predicate in predicate_set: + # The relation prompt is constructed as follows: + # subject + "的" + predicate + relation_prompt = entity_name + "的" + predicate + relation_prompt_set.add(relation_prompt) + return sorted(list(relation_prompt_set)) + + +def convert_ext_examples(raw_examples, negative_ratio): + texts = [] + entity_examples = [] + relation_examples = [] + entity_prompts = [] + relation_prompts = [] + entity_label_set = [] + entity_name_set = [] + predicate_set = [] + + print("Converting doccano data...") + with tqdm(total=len(raw_examples)) as pbar: + for line in raw_examples: + items = json.loads(line) + entity_id = 0 + if "data" in items.keys(): + text = items["data"] + entities = [] + for item in items["label"]: + entity = {"id": entity_id, "start_offset": item[0], "end_offset": item[1], "label": item[2]} + entities.append(entity) + entity_id += 1 + relations = [] + else: + text, relations, entities = items["text"], items["relations"], items["entities"] + texts.append(text) + + entity_example = [] + entity_prompt = [] + entity_example_map = {} + entity_map = {} # id to entity name + for entity in entities: + entity_name = text[entity["start_offset"] : entity["end_offset"]] + entity_map[entity["id"]] = { + "name": entity_name, + "start": entity["start_offset"], + "end": entity["end_offset"], + } + + entity_label = entity["label"] + result = {"text": entity_name, "start": entity["start_offset"], "end": entity["end_offset"]} + if entity_label not in entity_example_map.keys(): + entity_example_map[entity_label] = { + "content": text, + "result_list": [result], + "prompt": entity_label, + } + else: + entity_example_map[entity_label]["result_list"].append(result) + + if entity_label not in entity_label_set: + entity_label_set.append(entity_label) + if entity_name not in entity_name_set: + entity_name_set.append(entity_name) + entity_prompt.append(entity_label) + + for v in entity_example_map.values(): + entity_example.append(v) + + entity_examples.append(entity_example) + entity_prompts.append(entity_prompt) + + relation_example = [] + relation_prompt = [] + relation_example_map = {} + for relation in relations: + predicate = relation["type"] + subject_id = relation["from_id"] + object_id = relation["to_id"] + # The relation prompt is constructed as follows: + # subject + "的" + predicate + prompt = entity_map[subject_id]["name"] + "的" + predicate + result = { + "text": entity_map[object_id]["name"], + "start": entity_map[object_id]["start"], + "end": entity_map[object_id]["end"], + } + if prompt not in relation_example_map.keys(): + relation_example_map[prompt] = {"content": text, "result_list": [result], "prompt": prompt} + else: + relation_example_map[prompt]["result_list"].append(result) + + if predicate not in predicate_set: + predicate_set.append(predicate) + relation_prompt.append(prompt) + + for v in relation_example_map.values(): + relation_example.append(v) + + relation_examples.append(relation_example) + relation_prompts.append(relation_prompt) + pbar.update(1) + + print("Adding negative samples for first stage prompt...") + entity_examples = add_negative_example(entity_examples, texts, entity_prompts, entity_label_set, negative_ratio) + if len(predicate_set) != 0: + print("Constructing relation prompts...") + relation_prompt_set = construct_relation_prompt_set(entity_name_set, predicate_set) + + print("Adding negative samples for second stage prompt...") + relation_examples = add_negative_example( + relation_examples, texts, relation_prompts, relation_prompt_set, negative_ratio + ) + return entity_examples, relation_examples + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) diff --git a/applications/text_classification/README.md b/applications/text_classification/README.md new file mode 100644 index 0000000000000000000000000000000000000000..361773805eddafb5c07288b67100801ebe2e3828 --- /dev/null +++ b/applications/text_classification/README.md @@ -0,0 +1,292 @@ +# 文本分类应用 + +**目录** +- [1. 文本分类应用简介](#文本分类应用简介) +- [2. 技术特色介绍](#技术特色介绍) + - [2.1 文本分类方案全覆盖](#文本分类方案全覆盖) + - [2.2 更懂中文的训练基座](#更懂中文的训练基座) + - [2.3 高效模型调优方案](#高效模型调优方案) + - [2.4 产业级全流程方案](#产业级全流程方案) +- [3. 快速开始](#快速开始) +- [4. 常用中文分类数据集](#常用中文分类数据集) + + + +## 1. 文本分类应用简介 +文本分类应用针对**多分类、多标签、层次分类等高频场景开源了产业级分类应用方案**,打通数据标注-模型训练-模型调优-模型压缩-预测部署全流程,旨在解决细分场景应用的痛点和难点,快速实现文本分类产品落地。 + +文本分类简单来说就是对给定的一个句子或一段文本使用分类模型分类。虽然文本分类在金融、医疗、法律、工业等领域都有广泛的成功实践应用,但如何选择合适的方案和预训练模型、数据标注质量差、效果调优困难、AI入门成本高、如何高效训练部署等问题使部分开发者望而却步。针对文本分类领域的痛点和难点,PaddleNLP文本分类应用提出了多种前沿解决方案,助力开发者简单高效实现文本分类数据标注、训练、调优、上线,降低文本分类落地技术门槛。 + +
+ 文本分类落地难点 +
+ +**文本分类应用技术特色:** + +- **方案全面🎓:** 涵盖多分类、多标签、层次分类等高频分类场景,提供预训练模型微调、提示学习(小样本学习)、语义索引三种端到端全流程分类方案,满足开发者多样文本分类落地需求。 +- **效果领先🏃:** 使用在中文领域内模型效果和模型计算效率有突出效果的ERNIE 3.0 轻量级系列模型作为训练基座,ERNIE 3.0 轻量级系列提供多种尺寸的预训练模型满足不同需求,具有广泛成熟的实践应用性。 +- **高效调优✊:** 文本分类应用依托[TrustAI](https://github.com/PaddlePaddle/TrustAI)可信增强能力和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md),提供模型分析模块助力开发者实现模型分析,并提供稀疏数据筛选、脏数据清洗、数据增强等多种解决方案。 +- **简单易用👶:** 开发者**无需机器学习背景知识**,仅需提供指定格式的标注分类数据,一行命令即可开启文本分类训练,轻松完成上线部署,不再让技术成为文本分类的门槛。 + + + +## 2. 技术特色介绍 + + + +### 2.1 文本分类方案全覆盖 + +
+ image +
+ +#### 2.1.1 分类场景齐全 + +文本分类应用涵盖多分类(multi class)、多标签(multi label)、层次分类(hierarchical)三种场景,接下来我们将以下图的新闻文本分类为例介绍三种分类场景的区别。 + +
+ image +
+ +- **多分类🚶:** 数据集的标签集含有两个或两个以上的类别,所有输入句子/文本有且只有一个标签。在文本多分类场景中,我们需要预测输入句子/文本最可能来自 `n` 个标签类别中的哪一个类别。以上图多分类中新闻文本为例,该新闻文本的标签为 `娱乐`。快速开启多分类任务参见 👉 [多分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class#readme) + +- **多标签👫 :** 数据集的标签集含有两个或两个以上的类别,输入句子/文本具有一个或多个标签。在文本多标签任务中,我们需要预测输入句子/文本可能来自 `n` 个标签类别中的哪几个类别。以上图多标签中新闻文本为例,该新闻文本具有 `相机` 和 `芯片` 两个标签。快速开启多标签任务参见 👉 [多标签指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label#readme) 。 + +- **层次分类👪 :** 数据集的标签集具有多级标签且标签之间具有层级结构关系,输入句子/文本具有一个或多个标签。在文本层次分类任务中,我们需要预测输入句子/文本可能来自于不同级标签类别中的某一个或几个类别。以上图层次分类中新闻文本为例(新闻为根节点),该新闻一级分类标签为 `体育`,二级分类标签为 `足球`。快速开启层次分类任务参见 👉 [层次分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical#readme) 。 + + +#### 2.1.2 多方案满足定制需求 + +#### 方案一:预训练模型微调 + +【方案选择】对于大多数任务,我们推荐使用**预训练模型微调作为首选的文本分类方案**,预训练模型微调提供了数据标注-模型训练-模型分析-模型压缩-预测部署全流程,有效减少开发时间,低成本迁移至实际应用场景。 + +【方案介绍】ERNIE 3.0 轻量级模型不能直接在文本分类任务上使用,预训练模型微调在预训练模型 `[CLS]` 输出向量后接入线性层作为文本分类器,用具体任务数据进行微调训练文本分类器,使预训练模型”更懂”这个任务。 + +【方案效果】下表展示在多标签任务CAIL2019—婚姻家庭要素提取数据集中ERNIE 3.0 系列轻量级模型效果评测。 + + +
+ +
+ + +【快速开始】 +- 快速开启多分类任务参见 👉 [预训练模型微调-多分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class#readme) +- 快速开启多标签分类任务参见 👉 [预训练模型微调-多标签分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label#readme) +- 快速开启层次分类任务参见 👉 [预训练模型微调-层次分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical#readme) + +#### 方案二:提示学习 + +【方案选择】提示学习(Prompt Learning)适用于**标注成本高、标注样本较少的文本分类场景**。在小样本场景中,相比于预训练模型微调学习,提示学习能取得更好的效果。对于标注样本充足、标注成本较低的场景,我们仍旧推荐使用充足的标注样本进行文本分类[预训练模型微调](#预训练模型微调)。 + +【方案介绍】**提示学习的主要思想是将文本分类任务转换为构造提示中掩码 `[MASK]` 的分类预测任务**,也即在掩码 `[MASK]`向量后接入线性层分类器预测掩码位置可能的字或词。提示学习使用待预测字的预训练向量来初始化分类器参数(如果待预测的是词,则为词中所有字的预训练向量平均值),充分利用预训练语言模型学习到的特征和标签文本,从而降低样本需求。提示学习同时提供[ R-Drop](https://arxiv.org/abs/2106.14448) 和 [RGL](https://aclanthology.org/2022.findings-naacl.81/) 策略,帮助提升模型效果。 + +我们以下图情感二分类任务为例来具体介绍提示学习流程,分类任务标签分为 `0:负向` 和 `1:正向` 。在文本加入构造提示 `我[MASK]喜欢。` ,将情感分类任务转化为预测掩码 `[MASK]` 的待预测字是 `不` 还是 `很`。具体实现方法是在掩码`[MASK]`的输出向量后接入线性分类器(二分类),然后用`不`和`很`的预训练向量来初始化分类器进行训练,分类器预测分类为 `0:不` 或 `1:很` 对应原始标签 `0:负向` 或 `1:正向`。而预训练模型微调则是在预训练模型`[CLS]`向量接入随机初始化线性分类器进行训练,分类器直接预测分类为 `0:负向` 或 `1:正向`。 + +
+ +
+ +【方案效果】我们比较预训练模型微调与提示学习在多分类、多标签、层次分类小样本场景的模型表现(多分类精度为准确率,多标签和层次分类精度为Macro F1值),可以看到在样本较少的情况下,提示学习比预训练模型微调有明显优势。 + +
+ 文本分类落地难点 +
+ + + +【快速开始】 + +更多测评和使用细节详见各场景文档: +- 快速开启多分类任务参见 👉 [提示学习(小样本)-多分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class/few-shot#readme) +- 快速开启多标签分类任务参见 👉 [提示学习(小样本)-多标签分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label/few-shot#readme) +- 快速开启层次分类任务参见 👉 [提示学习(小样本)-层次分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical/few-shot#readme) + +#### 方案三:语义索引 + +【方案选择】基于语义索引的文本分类方案**适用于标签类别不固定的场景**,对于新增标签类别或新的相关分类任务无需重新训练,模型仍然能获得较好预测效果,方案具有良好的拓展性。 + +【方案介绍】语义索引目标是从海量候选召回集中快速、准确地召回一批与输入文本语义相关的文本。基于语义索引的文本分类方法具体来说是将标签集作为召回目标集,召回与输入文本语义相似的标签作为文本的标签类别。 + +
+ +
+ +【快速开始】 +- 快速开启多分类任务参见 👉 [语义索引-多分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class/retrieval_based#readme) +- 快速开启多标签分类任务参见 👉 [语义索引-多标签分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label/retrieval_based#readme) +- 快速开启层次分类任务参见 👉 [语义索引-层次分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical/retrieval_based#readme) + + + + + +### 2.2 更懂中文的训练基座 + +近年来,大量的研究表明在超大规模的语料采用无监督或者弱监督的方式训练模型,模型能够获得语言相关的知识。预训练模型学习到的文本语义表示能够避免从零开始训练模型,同时有利于下游自然语言处理(NLP)任务。预训练模型与具体的文本分类任务的关系可以直观地理解为,**预训练模型已经懂得了相关句法、语义的语言知识,用具体任务数据训练使得预训练模型”更懂”这个任务**,在预训练过程中学到的知识基础使学习文本分类任务事半功倍。 + +文本分类应用使用**ERNIE 3.0轻量级模型作为预训练模型**,ERNIE 3.0 轻量级模型是文心大模型ERNIE 3.0基础上通过在线蒸馏技术得到的轻量级模型。下面是ERNIE 3.0 效果-时延图,ERNIE 3.0 轻量级模型在精度和性能上的综合表现已全面领先于 UER-py、Huawei-Noah 以及 HFL 的中文模型,具体的测评细节可以见[ERNIE 3.0 效果和性能测评文档](../../model_zoo/ernie-3.0)。 + +
+ +
+ + + + + +### 2.3 高效模型调优方案 + +有这么一句话在业界广泛流传,"数据决定了机器学习的上限,而模型和算法只是逼近这个上限",可见数据质量的重要性。文本分类应用依托[TrustAI](https://github.com/PaddlePaddle/TrustAI)可信增强能力和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)开源了模型分析模块,针对标注数据质量不高、训练数据覆盖不足、样本数量少等文本分类常见数据痛点,提供稀疏数据筛选、脏数据清洗、数据增强三种数据优化方案,解决训练数据缺陷问题,用低成本方式获得大幅度的效果提升。 + + +- **稀疏数据筛选**基于特征相似度的实例级证据分析方法挖掘待预测数据中缺乏证据支持的数据(也即稀疏数据),并进行有选择的训练集数据增强或针对性筛选未标注数据进行标注来解决稀疏数据问题,有效提升模型表现。 +
+ 文本分类落地难点 +
+ +我们采用在多分类、多标签、层次分类场景中评测稀疏数据-数据增强策略和稀疏数据-数据标注策略,下图表明稀疏数据筛选方案在各场景能够有效提高模型表现(多分类精度为准确率,多标签和层次分类精度为Macro F1值)。 + +
+ 文本分类落地难点 +
+ + +- **脏数据清洗**基于表示点方法的实例级证据分析方法,计算训练数据对模型的影响分数,分数高的训练数据表明对模型影响大,这些数据有较大概率为脏数据(标注错误样本)。脏数据清洗方案通过高效识别训练集中脏数据(也即标注质量差的数据),有效降低人力检查成本。 + +
+ 文本分类落地难点 +
+ +我们采用在多分类、多标签、层次分类场景中评测脏数据清洗方案,实验表明方案能够高效筛选出训练集中脏数据,提高模型表现(多分类精度为准确率,多标签和层次分类精度为Macro F1值)。 + +
+ 文本分类落地难点 +
+ + +- **数据增强**在数据量较少的情况下能够通过增加数据集多样性,提升模型效果。PaddleNLP内置[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md),支持词替换、词删除、词插入、词置换、基于上下文生成词(MLM预测)、TF-IDF等多种数据增强策略。数据增强方案提供一行命令,快速完成数据集增强。以CAIL2019—婚姻家庭要素提取数据子集(500条)为例,我们在数据集应用多种数据增强策略,策略效果如下表。 + +
+ 文本分类落地难点 +
+ + +【快速开始】 + +更多使用方法和测评细节详见各场景模型分析模块: + +- 体验模型分析模块 👉 [多分类-模型分析模块](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class/analysis) +- 体验模型分析模块 👉 [多标签-模型分析模块](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label/analysis) +- 体验模型分析模块 👉 [层次分类-模型分析模块](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical/analysis) + + + +### 2.4 产业级全流程方案 + +文本分类应用提供了简单易用的数据标注-模型训练-模型调优-模型压缩-预测部署全流程方案,我们将以预训练模型微调方案为例介绍文本分类应用的全流程: + +
+ image +
+
+ + 文本分类应用全流程示意图 + +
+ + +**1.数据准备阶段** + +- 我们根据文本分类任务选择对应的场景目录: [多分类场景目录](./multi_class)、 + [多标签场景目录](./multi_label)、[层次分类场景目录](./hierarchical)。 + +- 如果没有已标注的数据集,我们推荐doccano数据标注工具进行标注,详见[文本分类标注指南](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/doccano.md)。如果已有标注好的本地数据集,我们需要根据不同任务要求将数据集整理为文档要求的格式,详见各分类场景文档。 + +**2.模型训练** + +- 数据准备完成后,开始进行预训练模型微调训练。可以根据实际数据调整可配置参数,选择使用GPU或CPU进行模型训练,脚本默认保存在开发集最佳表现模型参数。 + +- 训练结束后,使用模型分析(analysis)模块对分析模型表现,同时模型分析(analysis)模块依托[TrustAI](https://github.com/PaddlePaddle/TrustAI)可信增强能力和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)提供稀疏数据筛选、脏数据清洗、数据增强三种优化方案帮助提升模型效果。 + +- 模型训练、调优完成后,可以通过预测脚本加载最佳模型参数,打印模型预测结果。 + +**3.模型部署** + +- 现实部署场景需要同时考虑模型的精度和性能表现,文本分类应用接入PaddleNLP 模型压缩 API 。采用了DynaBERT 中宽度自适应裁剪策略,对预训练模型多头注意力机制中的头(Head )进行重要性排序,保证更重要的头(Head )不容易被裁掉,然后用原模型作为蒸馏过程中的教师模型,宽度更小的模型作为学生模型,蒸馏得到的学生模型就是我们裁剪得到的模型。实验表明模型裁剪能够有效缩小模型体积、减少内存占用、提升推理速度。模型裁剪去掉了部分冗余参数的扰动,增加了模型的泛化能力,在部分任务中预测精度得到提高。 + +
+ image +
+ +- 模型部署需要将保存的最佳模型参数(动态图参数)导出成静态图参数,用于后续的推理部署。p.s.模型裁剪之后会默认导出静态图模型 + +- 文本分类应用提供了离线部署,并且支持在GPU设备使用FP16,在CPU设备使用动态量化的低精度加速推理;同时提供基于Paddle Serving的在线服务化部署,详见各分类场景文档中模型部署介绍。 + + + + +## 3. 快速开始 + +- 快速开启多分类 👉 [多分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class#readme) + +- 快速开启多标签分类 👉 [多标签指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label#readme) + +- 快速开启层次分类 👉 [层次分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical#readme) + + + +## 4. 常用中文分类数据集 + +**多分类数据集:** + +- [THUCNews新闻分类数据集](http://thuctc.thunlp.org/) + +- [百科问答分类数据集](https://github.com/brightmart/nlp_chinese_corpus#3%E7%99%BE%E7%A7%91%E7%B1%BB%E9%97%AE%E7%AD%94json%E7%89%88baike2018qa) + +- [头条新闻标题数据集TNEWS](https://github.com/aceimnorstuvwxz/toutiao-text-classfication-dataset) + +- [复旦新闻文本数据集](https://www.heywhale.com/mw/dataset/5d3a9c86cf76a600360edd04) + +- [IFLYTEK app应用描述分类数据集](https://storage.googleapis.com/cluebenchmark/tasks/iflytek_public.zip) + +- [CAIL 2022事件检测](https://cloud.tsinghua.edu.cn/d/6e911ff1286d47db8016/) + +**情感分类数据集(多分类):** + +- [亚马逊商品评论情感数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/yf_amazon/intro.ipynb) + +- [财经新闻情感分类数据集](https://github.com/wwwxmu/Dataset-of-financial-news-sentiment-classification) + +- [ChnSentiCorp 酒店评论情感分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets/ChnSentiCorp_htl_all) + +- [外卖评论情感分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/waimai_10k/intro.ipynb) + +- [weibo情感二分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/weibo_senti_100k/intro.ipynb) + +- [weibo情感四分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/simplifyweibo_4_moods/intro.ipynb) + +- [商品评论情感分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/online_shopping_10_cats/intro.ipynb) + +- [电影评论情感分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/dmsc_v2/intro.ipynb) + +- [大众点评分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/yf_dianping/intro.ipynb) + +**多标签数据集:** + +- [学生评语分类数据集](https://github.com/FBI1314/textClassification/tree/master/multilabel_text_classfication/data) + +- [CAIL2019婚姻要素识别](https://aistudio.baidu.com/aistudio/projectdetail/3996601) + +- [CAIL2018 刑期预测、法条预测、罪名预测](https://cail.oss-cn-qingdao.aliyuncs.com/CAIL2018_ALL_DATA.zip) + +**层次分类数据集:** + +- [头条新闻标题分类-TNEWS的升级版](https://github.com/aceimnorstuvwxz/toutiao-multilevel-text-classfication-dataset) + +- [网页层次分类数据集](https://csri.scu.edu.cn/info/1012/2827.htm) + +- [医学意图数据集(CMID)](https://github.com/liutongyang/CMID) + +- [2020语言与智能技术竞赛事件分类](https://github.com/percent4/keras_bert_multi_label_cls/tree/master/data) diff --git a/applications/text_classification/doccano.md b/applications/text_classification/doccano.md new file mode 100644 index 0000000000000000000000000000000000000000..16c79db92436d3a911b461cffa5b9f76052ecde2 --- /dev/null +++ b/applications/text_classification/doccano.md @@ -0,0 +1,240 @@ +# 文本分类任务doccano使用指南 + + **目录** + +* [1. 安装](#安装) +* [2. 项目创建](#项目创建) +* [3. 数据上传](#数据上传) +* [4. 标签构建](#标签构建) +* [5. 任务标注](#任务标注) +* [6. 数据导出](#数据导出) +* [7. 数据转换](#数据转换) + + + +## 1. 安装 +**以下标注示例用到的环境配置:** + +- Python 3.8+ +- doccano 1.6.2 + +在终端(terminal)使用pip安装doccano: + +```shell +pip install doccano==1.6.2 +``` +安装完成后,运行以下命令行: +```shell +# Initialize database. +doccano init +# Create a super user. +doccano createuser --username admin --password pass +# Start a web server. +doccano webserver --port 8000 +``` +在新的终端(terminal)运行如下命令行: +```shell +# Start the task queue to handle file upload/download. +doccano task +``` +在浏览器打开[http://127.0.0.1:8000/](http://127.0.0.1:8000/),输入用户名和密码登录,开始使用doccano进行标注。doccano支持中文版本,可以点击右上角选择ZH(中文)。 + +
+ +
+ +doccano还支持PostgreSQL、Docker、Docker Compose等安装方式,详情请参考[doccano官方文档](https://github.com/doccano/doccano) 完成doccano的安装与初始配置。 + + + + +## 2. 项目创建 + +文本分类支持多分类、多标签、层次分类三种类型的文本分类任务。 + +点击创建(Create)开始创建一个新的项目,选择文本分类,然后填写项目名称、描述、Tags等项目信息。如果是多分类任务或者是单路径层次分类任务,勾选 `Allow single label` ,勾选后标签标注只允许选择一个标签进行标注。点击创建成功创建一个doccano项目。 +
+ +
+ + + + +## 3. 数据上传 + +点击数据集-操作-导入数据集,开始导入本地待标注数据集: +
+ +
+ +doccano支持`TextFile`、`TextLine`、`JSONL`和`CoNLL`四种数据上传格式,文本分类本地数据集定制训练中**统一使用TextLine**这一文件格式,即上传的文件需要为txt等格式,且在数据标注时,该文件的每一行待标注文本显示为一页内容。 +上传的文件为txt等格式,每一行为一条待标注文本,示例: + +```text +黑苦荞茶的功效与作用及食用方法 +交界痣会凸起吗 +检查是否能怀孕挂什么科 +鱼油怎么吃咬破吃还是直接咽下去 +幼儿挑食的生理原因是 +... +``` + +上传数据类型**选择TextLine**,选择待标注文本或拖拽文本导入doccano项目中,点击导入,导入待标注数据集。 + +
+ +
+ + + + +## 4. 标签构建 + +点击标签-操作-创建标签,开始添加分类类别标签: +
+ +
+填入分类类别标签,选择标签颜色,建议不同标签选择不同颜色,最后点击保存或保存并添加下一个,保存标签: +
+ +
+文本分类标签构建示例: +
+ +
+ +**NOTE:** +我们默认层次分类标签不同层的标签之间具有关联性,以下图为例一个样本具有标签美短虎斑,我们默认还包含美国短毛猫和猫两个标签。 + +
+ +
+ +对于层次分类任务的分类标签我们建议使用标签层次结构中**叶结点标签路径作为标签**,以上图的标签结构为例,我们建议使用`##`作为分隔符,分隔不同层之间的标签: + +
+ +
+ + +## 5. 任务标注 + +标注示例,选择对应的分类类别标签,输入回车(Enter)键确认: + +
+ +
+ + + + +## 6. 数据导出 + +选择数据集-操作-导出数据集,将标注好的数据导出,我们默认所有数据集已经标注完成且正确: +
+ +
+ +选择导出的文件类型为``JSONL``,导出数据: + +
+ +
+ +导出数据示例: +```text +{"id": 23, "data": "黑苦荞茶的功效与作用及食用方法", "label": ["功效作用"]} +{"id": 24, "data": "交界痣会凸起吗", "label": ["疾病表述"]} +{"id": 25, "data": "检查是否能怀孕挂什么科", "label": ["就医建议"]} +{"id": 26, "data": "鱼油怎么吃咬破吃还是直接咽下去", "label": ["其他"]} +{"id": 27, "data": "幼儿挑食的生理原因是", "label": ["病因分析"]} +``` + +标注数据保存在同一个文本文件中,每条样例占一行且存储为``jsonl``格式,其包含以下字段 +- ``id``: 样本在数据集中的唯一标识ID。 +- ``data``: 原始文本数据。 +- ``label``: 文本对应类别标签。 + + + +## 7.数据转换 + +该章节详细说明如何通过`doccano.py`脚本对doccano平台导出的标注数据进行转换,一键生成训练/验证/测试集。当标注完成后,在 doccano 平台上导出 `JSON` 形式的文件,并将其重命名为 `doccano.jsonl`。 + + +### 7.1 多分类任务 +通过 [doccano.py](./doccano.py) 脚本进行数据形式转换,然后便可以按照[多分类文本任务指南](multi_class/README.md)中固定格式进行相应模型训练。 + +数据标注转化运行: + +```shell +python doccano.py \ + --doccano_file doccano.jsonl \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --task_type "multi_class" +``` + +稀疏数据识别出的有效标注请增加配置参数`--valid`,脏数据清洗的标注数据(文本中有脏数据标签)请增加配置参数`--dirty`,更多稀疏数据识别和脏数据清洗详见[多分类训练评估与模型优化指南](multi_class/analysis/README.md) + +### 7.2 多标签任务 +通过 [doccano.py](./doccano.py) 脚本进行数据形式转换,然后便可以按照[多标签文本分类任务指南](multi_label/README.md)中固定格式进行相应模型训练。 + +数据标注转化运行: + +```shell +python doccano.py \ + --doccano_file doccano.jsonl \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --task_type "multi_label" +``` + +稀疏数据识别出的有效标注请增加配置参数`--valid`,脏数据清洗的标注数据(文本中有脏数据标签)请增加配置参数`--dirty`,更多稀疏数据识别和脏数据清洗详见[多标签训练评估与模型优化指南](multi_label/analysis/README.md) + +### 7.3 层次分类任务 + +通过 [doccano.py](./doccano.py) 脚本进行数据形式转换,然后便可以按照[层次文本分类任务指南](hierarchical/README.md)中固定格式进行相应模型训练。 + +数据标注转化运行: + +```shell +python doccano.py \ + --doccano_file doccano.jsonl \ + --save_dir ./data \ + --splits 0.8 0.2 \ + --task_type "hierarchical" +``` + +稀疏数据识别出的有效标注请增加配置参数`--valid`,脏数据清洗的标注数据(文本中有脏数据标签)请增加配置参数`--dirty`,更多稀疏数据识别和脏数据清洗详见[层次分类训练评估与模型优化指南](hierarchical/analysis/README.md) +可配置参数说明: + +- ``doccano_file``: 从doccano导出的数据标注文件。 +- ``save_dir``: 训练数据的保存目录,默认存储在``data``目录下。 +- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.2]表示按照``8:2``的比例将数据划分为训练集、验证集。 +- ``task_type``: 可选,选择任务类型,有多分类,多标签,层次分类三种类型的任务。 +- ``is_shuffle``: 是否对数据集进行随机打散,默认为True。 +- ``seed``: 随机种子,默认为1000. +- ``separator``: 不同层标签之间的分隔符,该参数只对层次文本分类任务有效。默认为"##"。 +- ``valid``: 是否为稀疏数据筛选的有效标注数据,默认为False. +- ``dirty``: 是否为脏数据清洗策略标注数据,默认为False. + +转化后的doccano标注数据目录结构如下: + +```text +data/ +├── train.txt # 训练数据集文件 +├── dev.txt # 开发数据集文件 +├── test.txt # 测试训练集文件(可选,数据划分为 train/dev/test 数据集) +├── label.txt # 分类标签文件 +└── data.txt # 待预测数据文件 +``` + +备注: +- 默认情况下 [doccano.py](./doccano.py) 脚本会按照比例将数据划分成train/dev 数据集,也可以划分为 train/dev/test 数据集。 +- 脚本会自动生成data.txt,如果数据划分为 train/dev/test 数据集,data.txt则为test数据集无标签数据;如果数据划分为 train/dev 数据集,data.txt为无标签数据。**如果有未标注数据,则用未标注数据文件替换data.txt** +- 每次执行 [doccano.py](./doccano.py) 脚本,将会覆盖已有的同名数据文件 +- 对于从doccano导出的文件,默认文件中的每条数据都是经过人工正确标注的。 + +## References +- **[doccano](https://github.com/doccano/doccano)** diff --git a/applications/text_classification/doccano.py b/applications/text_classification/doccano.py new file mode 100644 index 0000000000000000000000000000000000000000..7c3d096556a084e4a9ebd955abb03a69ac99246f --- /dev/null +++ b/applications/text_classification/doccano.py @@ -0,0 +1,171 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import random +import time +from decimal import Decimal + +import numpy as np +import paddle +from tqdm import tqdm + +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--doccano_file", default="doccano.jsonl", type=str, help="The doccano file exported from doccano platform.") +parser.add_argument("--save_dir", default="./data", type=str, help="The path of data that you wanna save.") +parser.add_argument("--splits", default=[0.8, 0.2], type=float, nargs="*", help="The ratio of samples in datasets. [0.8, 0.2] means 80% samples used for training, 20% for evaluation.") +parser.add_argument("--task_type", choices=['multi_class', 'multi_label', 'hierarchical'], default="multi_label", type=str, help="Select task type, multi_class for multi classification task, multi_label for multi label classification task and hierarchical for hierarchical classification, defaults to multi_label.") +parser.add_argument("--is_shuffle", default=True, type=bool, help="Whether to shuffle the labeled dataset, defaults to True.") +parser.add_argument("--seed", type=int, default=3, help="Random seed for initialization") +parser.add_argument("--separator", type=str, default="##", help="Separator for hierarchical classification") +parser.add_argument("--valid", action='store_true', help="Whether annotate valid data(extracted from sparse strategy)") +parser.add_argument("--dirty", action='store_true', help="Whether annotate dirty data(extracted from dirty data cleaning strategy)") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """ + Set random seed + """ + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +def do_convert(): + """ + Convert doccano jsonl to fixed format + """ + set_seed(args.seed) + + tic_time = time.time() + if not os.path.exists(args.doccano_file): + raise ValueError("Please input the correct path of doccano file.") + + if not os.path.exists(args.save_dir): + os.makedirs(args.save_dir) + + if len(args.splits) != 1 and len(args.splits) != 2 and len(args.splits) != 3: + raise ValueError("Only len(splits)==1 /len(splits)==2 / len(splits)==3 accepted for splits.") + + def _check_sum(splits): + if len(splits) == 2: + return Decimal(str(splits[0])) + Decimal(str(splits[1])) == Decimal("1") + if len(splits) == 3: + return Decimal(str(splits[0])) + Decimal(str(splits[1])) + Decimal(str(splits[2])) == Decimal("1") + + if not _check_sum(args.splits): + raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.") + + with open(args.doccano_file, "r", encoding="utf-8") as f: + raw_examples = f.readlines() + f.close() + + examples = [] + label_list = [] + with tqdm(total=len(raw_examples)): + for line in raw_examples: + items = json.loads(line) + # Compatible with doccano >= 1.6.2 + if "data" in items.keys(): + text, labels = items["data"], items["label"] + else: + text, labels = items["text"], items["label"] + labels = list(set(labels)) + for l in labels: + if "," in l: + raise ValueError("There exists comma ',' in {}".format(l)) + + if args.task_type == "multi_label" or args.task_type == "multi_class": + if args.dirty: + text = " ".join(text.strip().split("\t")[:-1]) + else: + text = " ".join(text.strip().split("\t")) + example = text + "\t" + ",".join(labels) + "\n" + for l in labels: + if l not in label_list: + label_list.append(l) + if args.task_type == "hierarchical": + label_dict = [] + for label in labels: + level_labels = label.split(args.separator) + for i in range(len(level_labels)): + l = args.separator.join(level_labels[: i + 1]) + if l not in label_dict: + label_dict.append(l) + if l not in label_list: + label_list.append(l) + if args.dirty: + text = " ".join(text.strip().split("\t")[:-1]) + else: + text = " ".join(text.strip().split("\t")) + example = text + "\t" + ",".join(label_dict) + "\n" + examples.append(example) + + if not args.dirty and not args.valid: + save_path = os.path.join(args.save_dir, "label.txt") + with open(save_path, "w", encoding="utf-8") as f: + label_list = sorted(label_list) + for l in label_list: + f.write(l + "\n") + + def _save_examples(save_dir, file_name, examples, is_data=False): + count = 0 + save_path = os.path.join(save_dir, file_name) + with open(save_path, "w", encoding="utf-8") as f: + for example in examples: + if is_data: + f.write(example.split("\t")[0] + "\n") + else: + f.write(example) + count += 1 + logger.info("Save %d examples to %s." % (count, save_path)) + + if args.is_shuffle: + indexes = np.random.permutation(len(raw_examples)) + raw_examples = [raw_examples[i] for i in indexes] + + if len(args.splits) == 1: + if args.valid: + _save_examples(args.save_dir, "valid.txt", examples) + elif args.dirty: + _save_examples(args.save_dir, "train_dirty.txt", examples) + else: + _save_examples(args.save_dir, "train.txt", examples) + _save_examples(args.save_dir, "data.txt", examples, True) + elif len(args.splits) == 2: + i1, _ = args.splits + p1 = int(len(raw_examples) * i1) + _save_examples(args.save_dir, "train.txt", examples[:p1]) + _save_examples(args.save_dir, "dev.txt", examples[p1:]) + _save_examples(args.save_dir, "data.txt", examples[p1:], True) + elif len(args.splits) == 3: + i1, i2, _ = args.splits + p1 = int(len(raw_examples) * i1) + p2 = int(len(raw_examples) * (i1 + i2)) + _save_examples(args.save_dir, "train.txt", examples[:p1]) + _save_examples(args.save_dir, "dev.txt", examples[p1:p2]) + _save_examples(args.save_dir, "test.txt", examples[p2:]) + _save_examples(args.save_dir, "data.txt", examples[p2:], True) + logger.info("Finished! It takes %.2f seconds" % (time.time() - tic_time)) + + +if __name__ == "__main__": + do_convert() diff --git a/applications/text_classification/hierarchical/README.md b/applications/text_classification/hierarchical/README.md new file mode 100644 index 0000000000000000000000000000000000000000..98ae356572acc9d4ebe39d72edf838eb1a24e51d --- /dev/null +++ b/applications/text_classification/hierarchical/README.md @@ -0,0 +1,477 @@ +# 层次分类指南 + +**目录** +- [1. 层次分类简介](#层次分类简介) +- [2. 快速开始](#快速开始) + - [2.1 运行环境](#运行环境) + - [2.2 代码结构](#代码结构) + - [2.3 数据准备](#数据准备) + - [2.4 模型训练](#模型训练) + - [2.5 模型部署](#模型部署) + - [2.6 模型效果](#模型效果) + + + +## 1. 层次分类简介 + +本项目提供通用场景下**基于预训练模型微调的层次分类端到端应用方案**,打通数据标注-模型训练-模型调优-模型压缩-预测部署全流程,有效缩短开发周期,降低AI开发落地门槛。 + +层次文本分类任务的中数据样本具有多个标签且标签之间存在特定的层级结构,目标是**预测输入句子/文本可能来自于不同级标签类别中的某一个或几个类别**。以下图新闻文本分类为例,该新闻的一级标签为体育,二级标签为足球,体育与足球之间存在层级关系。在现实场景中,大量的数据如新闻分类、专利分类、学术论文分类等标签集合存在层次化结构,需要利用算法为文本自动标注更细粒度和更准确的标签。 + +
+ +
+
+ +**方案亮点:** + +- **效果领先🏃:** 使用在中文领域内模型效果和模型计算效率有突出效果的ERNIE 3.0 轻量级系列模型作为训练基座,ERNIE 3.0 轻量级系列提供多种尺寸的预训练模型满足不同需求,具有广泛成熟的实践应用性。 +- **高效调优✊:** 文本分类应用依托[TrustAI](https://github.com/PaddlePaddle/TrustAI)可信增强能力和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md),提供模型分析模块助力开发者实现模型分析,并提供稀疏数据筛选、脏数据清洗、数据增强等多种解决方案。 +- **简单易用👶:** 开发者**无需机器学习背景知识**,仅需提供指定格式的标注分类数据,一行命令即可开启文本分类训练,轻松完成上线部署,不再让技术成为文本分类的门槛。 + +**更多选择:** + +对于大多数层次分类任务,我们推荐使用预训练模型微调作为首选的文本分类方案,层次分类项目中还提供 提示学习(小样本)和语义索引的两种全流程文本分类方案满足不同开发者需求,更多技术细节请参见[文本分类技术特色介绍](../README.md)。 + +- 【标注成本高、标注样本较少的小样本场景】 👉 [提示学习层次分类方案](./few-shot#readme) + +- 【标签类别不固定场景、标签数量众多】 👉 [语义索引层次分类方案](./retrieval_based#readme) + + + +## 2. 快速开始 + +我们以[2020语言与智能技术竞赛:事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)抽取的多标签层次数据集为例,演示层次分类全流程方案使用。下载数据集: +```shell +wget https://paddlenlp.bj.bcebos.com/datasets/baidu_extract_2020.tar.gz +tar -zxvf baidu_extract_2020.tar.gz +mv baidu_extract_2020 data +rm baidu_extract_2020.tar.gz +``` + +
+ image +
+
+ + 层次分类数据标注-模型训练-模型分析-模型压缩-预测部署流程图 + +
+ + + +### 2.1 运行环境 + +- python >= 3.6 +- paddlepaddle >= 2.3 +- paddlenlp >= 2.4.8 +- scikit-learn >= 1.0.2 + +**安装PaddlePaddle:** + + 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。 + + +**安装PaddleNLP:** + +安装PaddleNLP默认开启百度镜像源来加速下载,如果您使用 HTTP 代理可以关闭(删去 -i https://mirror.baidu.com/pypi/simple),更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 +```shell +python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple +``` + + +**安装sklearn:** +```shell +python3 -m pip install scikit-learn==1.0.2 +``` + + + +### 2.2 代码结构 + +```text +hierarchical/ +├── few-shot # 小样本学习方案 +├── retrieval_based # 语义索引方案 +├── analysis # 分析模块 +├── deploy # 部署 +│   └── predictor # 离线部署 +│ ├── paddle_serving # PaddleServing在线服务化部署 +│   └── triton_serving # Triton在线服务化部署 +├── train.py # 训练评估脚本 +├── predict.py # 预测脚本 +├── export_model.py # 静态图模型导出脚本 +├── utils.py # 工具函数脚本 +├── metric.py # metric脚本 +├── prune.py # 裁剪脚本 +└── README.md # 使用说明 +``` + + + +### 2.3 数据准备 + +训练需要准备指定格式的标注数据集,如果没有已标注的数据集,可以参考 [数据标注指南](../doccano.md) 进行文本分类数据标注。指定格式本地数据集目录结构: + +```text +data/ +├── train.txt # 训练数据集文件 +├── dev.txt # 开发数据集文件 +├── test.txt # 测试数据集文件(可选) +├── label.txt # 分类标签文件 +└── data.txt # 待预测数据文件(可选) +``` + +**训练、开发、测试数据集文件:** 文本与标签类别名用tab符`'\t'`分隔开,标签中多个标签之间用英文逗号`','`分隔开,文本中避免出现tab符`'\t'`。 + +- train.txt/dev.txt/test.txt 文件格式: +```text +<文本>'\t'<标签>','<标签>','<标签> +<文本>'\t'<标签>','<标签> +... +``` + +- train.txt/dev.txt/test.txt 文件样例: +```text +又要停产裁员6000!通用汽车罢工危机再升级股价大跌市值蒸发近300亿! 组织行为,组织行为##罢工,组织关系,组织关系##裁员 +上海一改建厂房坍塌已救出19人其中5人死亡 人生,人生##死亡,灾害/意外,灾害/意外##坍/垮塌 +车闻:广本召回9万余辆;领动上市,10.98万起;艾力绅混动 产品行为,产品行为##召回 +86岁老翁过马路遭挖掘机碾压身亡警方:正在侦办中 灾害/意外,灾害/意外##车祸,人生,人生##死亡 +... +``` + +**分类标签文件:** 包含数据集中所有标签,每个标签一行。 + +- label.txt 文件格式: + +```text +<一级标签> +<一级标签>'##'<二级标签> +<一级标签>'##'<二级标签>'##'<三级标签> +... +``` +- label.txt 文件样例: +```text +人生 +人生##死亡 +灾害/意外 +灾害/意外##坍/垮塌 +灾害/意外##车祸 +产品行为 +产品行为##召回 +... +``` +**待预测数据文件:** 包含需要预测标签的文本数据,每条数据一行。 +- data.txt 文件格式: +```text +<文本> +<文本> +... +``` +- data.txt 文件样例: +```text +金属卡扣安装不到位,上海乐扣乐扣贸易有限公司将召回捣碎器1162件 +卡车超载致使跨桥侧翻,没那么简单 +消失的“外企光环”,5月份在华裁员900余人,香饽饽变“臭”了 +... +``` + + + +### 2.4 模型训练 + +#### 2.4.1 预训练模型微调 + +使用CPU/GPU训练,默认为GPU训练。使用CPU训练只需将设备参数配置改为`--device cpu`,可以使用`--device gpu:0`指定GPU卡号: +```shell +python train.py \ + --dataset_dir "data" \ + --device "gpu" \ + --max_seq_length 128 \ + --model_name "ernie-3.0-medium-zh" \ + --batch_size 32 \ + --early_stop \ + --epochs 100 +``` + +如果在GPU环境中使用,可以指定`gpus`参数进行单卡/多卡训练。使用多卡训练可以指定多个GPU卡号,例如 --gpus "0,1"。如果设备只有一个GPU卡号默认为0,可使用`nvidia-smi`命令查看GPU使用情况。 + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" train.py \ + --dataset_dir "data" \ + --device "gpu" \ + --max_seq_length 128 \ + --model_name "ernie-3.0-medium-zh" \ + --batch_size 32 \ + --early_stop \ + --epochs 100 +``` + + +可支持配置的参数: + +* `device`: 选用什么设备进行训练,选择cpu、gpu、xpu、npu。如使用gpu训练,可使用参数--gpus指定GPU卡号;默认为"gpu"。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含train.txt,dev.txt和label.txt文件;默认为None。 +* `save_dir`:保存训练模型的目录;默认保存在当前目录checkpoint文件夹下。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `model_name`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh",根据任务复杂度和硬件条件进行选择。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `learning_rate`:训练最大学习率;默认为3e-5。 +* `epochs`: 训练轮次,使用早停法时可以选择100;默认为10。 +* `early_stop`:选择是否使用早停法(EarlyStopping),模型在开发集经过一定epoch后精度表现不再上升,训练终止;默认为False。 +* `early_stop_nums`:在设定的早停训练轮次内,模型在开发集上表现不再上升,训练终止;默认为10。 +* `logging_steps`: 训练过程中日志打印的间隔steps数,默认5。 +* `weight_decay`:控制正则项力度的参数,用于防止过拟合,默认为0.0。 +* `warmup`:是否使用学习率warmup策略,使用时应设置适当的训练轮次(epochs);默认为False。 +* `warmup_steps`:学习率warmup策略的比例数,如果设为1000,则学习率会在1000steps数从0慢慢增长到learning_rate, 而后再缓慢衰减;默认为0。 +* `init_from_ckpt`: 模型初始checkpoint参数地址,默认None。 +* `seed`:随机种子,默认为3。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `dev_file`:本地数据集中开发集文件名;默认为"dev.txt"。 +* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。 + +程序运行时将会自动进行训练,评估。同时训练过程中会自动保存开发集上最佳模型在指定的 `save_dir` 中,保存模型文件结构如下所示: + +```text +checkpoint/ +├── config.json # 模型配置文件,paddlenlp 2.4.5以前为model_config.json +├── model_state.pdparams # 模型参数文件 +├── tokenizer_config.json # 分词器配置文件 +├── vocab.txt +└── ... +``` +**NOTE:** +* 如需恢复模型训练,则可以设置 `--init_from_ckpt checkpoint/model_state.pdparams` 。 +* 如需训练英文文本分类任务,只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en"、"ernie-2.0-large-en"。 +* 英文和中文以外语言的文本分类任务,推荐使用基于96种语言(涵盖法语、日语、韩语、德语、西班牙语等几乎所有常见语言)进行预训练的多语言预训练模型"ernie-m-base"、"ernie-m-large",详情请参见[ERNIE-M论文](https://arxiv.org/pdf/2012.15674.pdf)。 +#### 2.4.2 训练评估与模型优化 + +文本分类预测过程中常会遇到诸如"模型为什么会预测出错误的结果","如何提升模型的表现"等问题。[Analysis模块](./analysis) 提供了**模型评估、可解释性分析、数据优化**等功能,旨在帮助开发者更好地分析文本分类模型预测结果和对模型效果进行优化。 + +
+ +
+ +**模型评估:** 训练后的模型我们可以使用 [Analysis模块](./analysis) 对每个类别分别进行评估,并输出预测错误样本(bad case),默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"`: + +```shell +python analysis/evaluate.py --device "gpu" --max_seq_length 128 --batch_size 32 --bad_case_file "bad_case.txt" --dataset_dir "data" --params_path "./checkpoint" +``` + +输出打印示例: + +```text +[2022-08-11 03:10:14,058] [ INFO] - -----Evaluate model------- +[2022-08-11 03:10:14,059] [ INFO] - Dev dataset size: 1498 +[2022-08-11 03:10:14,059] [ INFO] - Accuracy in dev dataset: 89.19% +[2022-08-11 03:10:14,059] [ INFO] - Macro avg in dev dataset: precision: 93.48 | recall: 93.26 | F1 score 93.22 +[2022-08-11 03:10:14,059] [ INFO] - Micro avg in dev dataset: precision: 95.07 | recall: 95.46 | F1 score 95.26 +[2022-08-11 03:10:14,095] [ INFO] - Level 1 Label Performance: Macro F1 score: 96.39 | Micro F1 score: 96.81 | Accuracy: 94.93 +[2022-08-11 03:10:14,255] [ INFO] - Level 2 Label Performance: Macro F1 score: 92.79 | Micro F1 score: 93.90 | Accuracy: 89.72 +[2022-08-11 03:10:14,256] [ INFO] - Class name: 交往 +[2022-08-11 03:10:14,256] [ INFO] - Evaluation examples in dev dataset: 60(4.0%) | precision: 91.94 | recall: 95.00 | F1 score 93.44 +[2022-08-11 03:10:14,256] [ INFO] - ---------------------------- +[2022-08-11 03:10:14,256] [ INFO] - Class name: 交往##会见 +[2022-08-11 03:10:14,256] [ INFO] - Evaluation examples in dev dataset: 12(0.8%) | precision: 92.31 | recall: 100.00 | F1 score 96.00 +... +``` + +预测错误的样本保存在bad_case.txt文件中: + +```text +Text Label Prediction +据猛龙随队记者JoshLewenberg报道,消息人士透露,猛龙已将前锋萨加巴-科纳特裁掉。此前他与猛龙签下了一份Exhibit10合同。在被裁掉后,科纳特下赛季大概率将前往猛龙的发展联盟球队效力。 组织关系,组织关系##加盟,组织关系##裁员 组织关系,组织关系##解雇 +冠军射手被裁掉,欲加入湖人队,但湖人却无意,冠军射手何去何从 组织关系,组织关系##裁员 组织关系,组织关系##解雇 +6月7日报道,IBM将裁员超过1000人。IBM周四确认,将裁减一千多人。据知情人士称,此次裁员将影响到约1700名员工,约占IBM全球逾34万员工中的0.5%。IBM股价今年累计上涨16%,但该公司4月发布的财报显示,一季度营收下降5%,低于市场预期。 组织关系,组织关系##裁员 组织关系,组织关系##裁员,财经/交易 +有多名魅族员工表示,从6月份开始,魅族开始了新一轮裁员,重点裁员区域是营销和线下。裁员占比超过30%,剩余员工将不过千余人,魅族的知名工程师,爱讲真话的洪汉生已经从钉钉里退出了,外界传言说他去了OPPO。 组织关系,组织关系##退出,组织关系##裁员 组织关系,组织关系##裁员 +... +``` + +**可解释性分析:** 基于[TrustAI](https://github.com/PaddlePaddle/TrustAI)提供单词和句子级别的模型可解释性分析,帮助理解模型预测结果,用于错误样本(bad case)分析,细节详见[训练评估与模型优化指南](analysis/README.md)。 + +- 单词级别可解释性分析,也即分析待预测样本中哪一些单词对模型预测结果起重要作用。以下图为例,用颜色深浅表示单词对预测结果的重要性。 +
+ +
+ +- 句子级别可解释性分析 ,也即分析对待预测样本的模型预测结果与训练集中中哪些样本有重要关系。下面的例子表明句子级别可解释性分析可以帮助理解待预测样本的预测结果与训练集中样本之间的关联。 +```text +text: 据猛龙随队记者JoshLewenberg报道,消息人士透露,猛龙已将前锋萨加巴-科纳特裁掉。此前他与猛龙签下了一份Exhibit10合同。在被裁掉后,科纳特下赛季大概率将前往猛龙的发展联盟球队效力。 +predict label: 组织关系,组织关系##解雇 +label: 组织关系,组织关系##加盟,组织关系##裁员 +examples with positive influence +support1 text: 尼克斯官方今日宣布,他们已经裁掉了前锋扎克-欧文,后者昨日才与尼克斯签约。 label: 组织关系,组织关系##加盟,组织关系##解雇 score: 0.99357 +support2 text: 活塞官方今日宣布,他们已经签下了克雷格-斯沃德,并且裁掉了托德-威瑟斯。 label: 组织关系,组织关系##加盟,组织关系##解雇 score: 0.98344 +support3 text: 孟菲斯灰熊今年宣布,球队已经签下后卫达斯蒂-汉纳斯(DustyHannahs,版头图)并裁掉马特-穆尼。 label: 组织关系,组织关系##加盟,组织关系##解雇 score: 0.98219 +... +``` + +**数据优化:** 结合[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)提供了**稀疏数据筛选、脏数据清洗、数据增强**三种优化策略,从多角度优化训练数据提升模型效果,策略细节详见[训练评估与模型优化指南](analysis/README.md)。 + +- 稀疏数据筛选主要是解决数据不均衡、训练数据覆盖不足的问题,通过数据增强和数据标注两种方式解决这一问题。 +- 脏数据清洗可以帮助开发者筛选训练集中错误标注的数据,对这些数据重新进行人工标注,得到标注正确的数据再重新进行训练。 +- 数据增强策略提供多种数据增强方案,可以快速扩充数据,提高模型泛化性和鲁棒性。 + +#### 2.4.3 模型预测 +训练结束后,输入待预测数据(data.txt)和类别标签对照列表(label.txt),使用训练好的模型进行,默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"`: + +```shell +python predict.py --device "gpu" --max_seq_length 128 --batch_size 32 --dataset_dir "data" +``` + +可支持配置的参数: + +* `device`: 选用什么设备进行预测,可选cpu、gpu、xpu、npu;默认为gpu。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含data.txt和label.txt文件;默认为None。 +* `params_path`:待预测模型的目录;默认为"./checkpoint/"。 +* `max_seq_length`:模型使用的最大序列长度,建议与训练时最大序列长度一致, 若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `data_file`:本地数据集中未标注待预测数据文件名;默认为"data.txt"。 +* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。 + + + + +### 2.5 模型部署 + +#### 2.5.1 静态图导出 + +使用动态图训练结束之后,还可以将动态图参数导出成静态图参数,静态图模型将用于**后续的推理部署工作**。具体代码见[静态图导出脚本](export_model.py),静态图参数保存在`output_path`指定路径中。运行方式: + +```shell +python export_model.py --params_path ./checkpoint/ --output_path ./export +``` + +如果使用多语言模型 ERNIE M作为预训练模型,运行方式: +```shell +python export_model.py --params_path ./checkpoint/ --output_path ./export --multilingual +``` + +可支持配置的参数: +* `multilingual`:是否为多语言任务(是否使用ERNIE M作为预训练模型);默认为False。 +* `params_path`:动态图训练保存的参数路径;默认为"./checkpoint/"。 +* `output_path`:静态图图保存的参数路径;默认为"./export"。 + +程序运行时将会自动导出模型到指定的 `output_path` 中,保存模型文件结构如下所示: + +```text +export/ +├── float32.pdiparams +├── float32.pdiparams.info +└── float32.pdmodel +``` + 导出模型之后用于部署,项目提供了基于ONNXRuntime的 [离线部署方案](./deploy/predictor/README.md) 和基于Paddle Serving的 [在线服务化部署方案](./deploy/predictor/README.md)。 +#### 2.5.2 模型裁剪 + +如果有模型部署上线的需求,需要进一步压缩模型体积,可以使用 PaddleNLP 的 [压缩API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/compression.md), 一行命令即可启动模型裁剪。 + +使用裁剪功能需要安装 paddleslim: + +```shell +pip install paddleslim==2.4.1 +``` + +开始模型裁剪训练,默认为GPU训练,使用CPU训练只需将设备参数配置改为`--device "cpu"`: +```shell +python prune.py \ + --device "gpu" \ + --dataset_dir "data" \ + --output_dir "prune" \ + --learning_rate 3e-5 \ + --per_device_train_batch_size 32 \ + --per_device_eval_batch_size 32 \ + --num_train_epochs 10 \ + --max_seq_length 128 \ + --logging_steps 5 \ + --save_steps 100 \ + --width_mult_list '3/4' '2/3' '1/2' +``` + + +可支持配置的参数: +* `output_dir`:必须,保存模型输出和中间checkpoint的输出目录;默认为 `None` 。 +* `device`: 选用什么设备进行裁剪,选择cpu、gpu。如使用gpu训练,可使用参数--gpus指定GPU卡号。 +* `per_device_train_batch_size`:训练集裁剪训练过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `per_device_eval_batch_size`:开发集评测过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `learning_rate`:训练最大学习率;默认为5e-5。 +* `num_train_epochs`: 训练轮次,使用早停法时可以选择100;默认为10。 +* `logging_steps`: 训练过程中日志打印的间隔steps数,默认100。 +* `save_steps`: 训练过程中保存模型checkpoint的间隔steps数,默认100。 +* `seed`:随机种子,默认为3。 +* `width_mult_list`:裁剪宽度(multi head)保留的比例列表,表示对self_attention中的 `q`、`k`、`v` 以及 `ffn` 权重宽度的保留比例,保留比例乘以宽度(multi haed数量)应为整数;默认是None。 +* `dataset_dir`:本地数据集路径,需包含train.txt,dev.txt,label.txt;默认为None。 +* `max_seq_length`:模型使用的最大序列长度,建议与训练过程保持一致, 若出现显存不足,请适当调低这一参数;默认为128。 +* `params_dir`:待预测模型参数文件;默认为"./checkpoint/"。 + +程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存开发集上最佳模型在指定的 `output_dir` 中,保存模型文件结构如下所示: + +```text +prune/ +├── width_mult_0.75 +│   ├── pruned_model.pdiparams +│   ├── pruned_model.pdiparams.info +│   ├── pruned_model.pdmodel +│   ├── model_state.pdparams +│   └── model_config.json +└── ... +``` + +**NOTE:** + +1. 目前支持的裁剪策略需要训练,训练时间视下游任务数据量而定,且和微调的训练时间是一个量级。 裁剪类似蒸馏过程,方便起见,可以直接使用微调时的超参。为了进一步提升精度,可以对 `per_device_train_batch_size`、`learning_rate`、`num_train_epochs`、`max_seq_length` 等超参进行网格搜索(grid search)。 + +2. 模型裁剪主要用于推理部署,因此裁剪后的模型都是静态图模型,只可用于推理部署,不能再通过 `from_pretrained` 导入继续训练。导出模型之后用于部署,项目提供了基于ONNXRuntime的 [离线部署方案](./deploy/predictor/README.md) 和基于Paddle Serving的 [在线服务化部署方案](./deploy/predictor/README.md)。 + +3. ERNIE Base、Medium、Mini、Micro、Nano的模型宽度(multi head数量)为12,ERNIE Xbase、Large 模型宽度(multi head数量)为16,保留比例`width_mult`乘以宽度(multi haed数量)应为整数。 + +#### 2.5.3 部署方案 + +- 离线部署搭建请参考[离线部署](deploy/predictor/README.md)。 + +- 在线服务化部署搭建请参考 [PaddleNLP SimpleServing部署指南](deploy/simple_serving/README.md) 或 [Triton部署指南](deploy/triton_serving/README.md)。 + + + +### 2.6 模型效果 + +我们在[2020语言与智能技术竞赛:事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)的多标签层次数据集评测模型表现,测试配置如下: + +1. 数据集:2020语言与智能技术竞赛抽取的多标签层次数据集 + +2. 物理机环境 + + 系统: CentOS Linux release 7.7.1908 (Core) + + GPU: Tesla V100-SXM2-32GB + + CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz + + CUDA: 11.2 + + cuDNN: 8.1.0 + + Driver Version: 460.27.04 + + 内存: 630 GB + +3. PaddlePaddle 版本:2.3.0 + +4. PaddleNLP 版本:2.4 + +5. 性能数据指标:latency。latency 测试方法:固定 batch size 为 32,GPU部署运行时间 total_time,计算 latency = total_time / total_samples + +6. 精度评价指标:Micro F1分数、Macro F1分数 + +| | 模型结构 |Micro F1(%) | Macro F1(%) | latency(ms) | +| -------------------------- | ------------ | ------------ | ------------ |------------ | +|ERNIE 1.0 Large Cw |24-layer, 1024-hidden, 20-heads|96.24|94.24 |5.59 | +|ERNIE 3.0 Xbase |20-layer, 1024-hidden, 16-heads|96.21|94.13| 5.51 | +|ERNIE 3.0 Base |12-layer, 768-hidden, 12-heads|95.68|93.39| 2.01 | +|ERNIE 3.0 Medium| 6-layer, 768-hidden, 12-heads|95.26|93.22| 1.01| +|ERNIE 3.0 Mini|6-layer, 384-hidden, 12-heads|94.72|93.03| 0.36| +|ERNIE 3.0 Micro | 4-layer, 384-hidden, 12-heads|94.24|93.08| 0.24| +|ERNIE 3.0 Nano |4-layer, 312-hidden, 12-heads|93.98|91.25|0.19| +| ERNIE 3.0 Medium + 裁剪(保留比例3/4)|6-layer, 768-hidden, 9-heads| 95.45|93.40| 0.81 | +| ERNIE 3.0 Medium + 裁剪(保留比例2/3)|6-layer, 768-hidden, 8-heads| 95.23|93.27 | 0.74 | +| ERNIE 3.0 Medium + 裁剪(保留比例1/2)|6-layer, 768-hidden, 6-heads| 94.92 | 92.70| 0.61 | diff --git a/applications/text_classification/hierarchical/analysis/README.md b/applications/text_classification/hierarchical/analysis/README.md new file mode 100644 index 0000000000000000000000000000000000000000..0f79bc349a50e970042c619655a7931cdbece3b2 --- /dev/null +++ b/applications/text_classification/hierarchical/analysis/README.md @@ -0,0 +1,427 @@ +# 训练评估与模型优化指南 + +**目录** + * [Analysis模块介绍](#Analysis模块介绍) + * [环境准备](#环境准备) + * [模型评估](#模型评估) + * [可解释性分析](#可解释性分析) + * [单词级别可解释性分析](#单词级别可解释性分析) + * [句子级别可解释性分析](#句子级别可解释性分析) + * [数据优化](#数据优化) + * [稀疏数据筛选方案](#稀疏数据筛选方案) + * [脏数据清洗方案](#脏数据清洗方案) + * [数据增强策略方案](#数据增强策略方案) + +## Analysis模块介绍 + +Analysis模块提供了**模型评估、可解释性分析、数据优化**等功能,旨在帮助开发者更好地分析文本分类模型预测结果和对模型效果进行优化。 + +- **模型评估:** 对整体分类情况和每个类别分别进行评估,并打印预测错误样本,帮助开发者分析模型表现找到训练和预测数据中存在的问题。 + +- **可解释性分析:** 基于[TrustAI](https://github.com/PaddlePaddle/TrustAI)提供单词和句子级别的模型可解释性分析,帮助理解模型预测结果。 + +- **数据优化:** 结合[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)提供了**稀疏数据筛选、脏数据清洗、数据增强**三种优化策略,从多角度优化训练数据提升模型效果。 + +
+ +
+ +以下是本项目主要代码结构及说明: + +```text +analysis/ +├── evaluate.py # 评估脚本 +├── sent_interpret.py # 句子级别可解释性分析脚本 +├── word_interpret.py # 单词级别可解释性分析notebook +├── sparse.py # 稀疏数据筛选脚本 +├── dirty.py # 脏数据清洗脚本 +├── aug.py # 数据增强脚本 +└── README.md # 训练评估与模型优化指南 +``` + +## 环境准备 +需要可解释性分析和数据优化需要安装相关环境。 +- trustai >= 0.1.7 +- interpretdl >= 0.7.0 + +**安装TrustAI**(可选)如果使用可解释性分析和数据优化中稀疏数据筛选和脏数据清洗需要安装TrustAI。 +```shell +pip install trustai==0.1.7 +``` + +**安装InterpretDL**(可选)如果使用词级别可解释性分析GradShap方法,需要安装InterpretDL +```shell +pip install interpretdl==0.7.0 +``` + +## 模型评估 + +我们使用训练好的模型计算模型的在开发集的准确率,同时打印每个类别数据量及表现: + +```shell +python evaluate.py \ + --device "gpu" \ + --dataset_dir "../data" \ + --params_path "../checkpoint" \ + --max_seq_length 128 \ + --batch_size 32 \ + --bad_case_file "bad_case.txt" +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `device`: 选用什么设备进行训练,可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含dev.txt和label.txt文件;默认为None。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `dev_file`:本地数据集中开发集文件名;默认为"dev.txt"。 +* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。 +* `bad_case_path`:开发集中预测错误样本保存路径;默认为"/bad_case.txt"。 + + +输出打印示例: + +```text +[2022-08-11 03:10:14,058] [ INFO] - -----Evaluate model------- + +[2022-08-11 03:10:14,059] [ INFO] - Dev dataset size: 1498 +[2022-08-11 03:10:14,059] [ INFO] - Accuracy in dev dataset: 89.19% +[2022-08-11 03:10:14,059] [ INFO] - Macro avg in dev dataset: precision: 93.48 | recall: 93.26 | F1 score 93.22 +[2022-08-11 03:10:14,059] [ INFO] - Micro avg in dev dataset: precision: 95.07 | recall: 95.46 | F1 score 95.26 +[2022-08-11 03:10:14,095] [ INFO] - Level 1 Label Performance: Macro F1 score: 96.39 | Micro F1 score: 96.81 | Accuracy: 94.93 +[2022-08-11 03:10:14,255] [ INFO] - Level 2 Label Performance: Macro F1 score: 92.79 | Micro F1 score: 93.90 | Accuracy: 89.72 +[2022-08-11 03:10:14,256] [ INFO] - Class name: 交往 +[2022-08-11 03:10:14,256] [ INFO] - Evaluation examples in dev dataset: 60(4.0%) | precision: 91.94 | recall: 95.00 | F1 score 93.44 +[2022-08-11 03:10:14,256] [ INFO] - ---------------------------- +[2022-08-11 03:10:14,256] [ INFO] - Class name: 交往##会见 +[2022-08-11 03:10:14,256] [ INFO] - Evaluation examples in dev dataset: 12(0.8%) | precision: 92.31 | recall: 100.00 | F1 score 96.00 +... +``` + +预测错误的样本保存在bad_case.txt文件中: + +```text +Text Label Prediction +据猛龙随队记者JoshLewenberg报道,消息人士透露,猛龙已将前锋萨加巴-科纳特裁掉。此前他与猛龙签下了一份Exhibit10合同。在被裁掉后,科纳特下赛季大概率将前往猛龙的发展联盟球队效力。 组织关系,组织关系##加盟,组织关系##裁员 组织关系,组织关系##解雇 +冠军射手被裁掉,欲加入湖人队,但湖人却无意,冠军射手何去何从 组织关系,组织关系##裁员 组织关系,组织关系##解雇 +6月7日报道,IBM将裁员超过1000人。IBM周四确认,将裁减一千多人。据知情人士称,此次裁员将影响到约1700名员工,约占IBM全球逾34万员工中的0.5%。IBM股价今年累计上涨16%,但该公司4月发布的财报显示,一季度营收下降5%,低于市场预期。 组织关系,组织关系##裁员 组织关系,组织关系##裁员,财经/交易 +有多名魅族员工表示,从6月份开始,魅族开始了新一轮裁员,重点裁员区域是营销和线下。裁员占比超过30%,剩余员工将不过千余人,魅族的知名工程师,爱讲真话的洪汉生已经从钉钉里退出了,外界传言说他去了OPPO。 组织关系,组织关系##退出,组织关系##裁员 组织关系,组织关系##裁员 +... +``` + +## 可解释性分析 +"模型为什么会预测出这个结果?"是文本分类任务开发者时常遇到的问题,如何分析错误样本(bad case)是文本分类任务落地中重要一环,本项目基于TrustAI开源了基于词级别和句子级别的模型可解释性分析方法,帮助开发者更好地理解文本分类模型与数据,有助于后续的模型优化与数据清洗标注。 + +### 单词级别可解释性分析 +本项目开源模型的词级别可解释性分析Notebook,提供LIME、Integrated Gradient、GradShap 三种分析方法,支持分析微调后模型的预测结果,开发者可以通过更改**数据目录**和**模型目录**在自己的任务中使用Jupyter Notebook进行数据分析。 + +运行 [word_interpret.ipynb](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/hierarchical/analysis/README.md) 代码,即可分析影响样本预测结果的关键词以及可视化所有词对预测结果的贡献情况,颜色越深代表这个词对预测结果影响越大: +
+ +
+ +### 句子级别可解释性分析 +本项目基于特征相似度([FeatureSimilarity](https://arxiv.org/abs/2104.04128))算法,计算对样本预测结果正影响的训练数据,帮助理解模型的预测结果与训练集数据的关系。 + +待分析数据文件`interpret_input_file`应为以下三种格式中的一种: +**格式一:包括文本、标签、预测结果** +```text +<文本>'\t'<标签>'\t'<预测结果> +... +``` + +**格式二:包括文本、标签** +```text +<文本>'\t'<标签> +... +``` + +**格式三:只包括文本** +```text +<文本> +准予原告胡某甲与被告韩某甲离婚。 +... +``` + +我们可以运行代码,得到支持样本模型预测结果的训练数据: +```shell +python sent_interpret.py \ + --device "gpu" \ + --dataset_dir "../data" \ + --params_path "../checkpoint/" \ + --max_seq_length 128 \ + --batch_size 16 \ + --top_k 3 \ + --train_file "train.txt" \ + --interpret_input_file "bad_case.txt" \ + --interpret_result_file "sent_interpret.txt" +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `device`: 选用什么设备进行训练,可可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含dev.txt和label.txt文件;默认为None。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `seed`:随机种子,默认为3。 +* `top_k`:筛选支持训练证据数量;默认为3。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `interpret_input_file`:本地数据集中待分析文件名;默认为"bad_case.txt"。 +* `interpret_result_file`:保存句子级别可解释性结果文件名;默认为"sent_interpret.txt"。 + +可解释性结果保存在 `interpret_result_file` 文件中: +```text +text: 据猛龙随队记者JoshLewenberg报道,消息人士透露,猛龙已将前锋萨加巴-科纳特裁掉。此前他与猛龙签下了一份Exhibit10合同。在被裁掉后,科纳特下赛季大概率将前往猛龙的发展联盟球队效力。 +predict label: 组织关系,组织关系##解雇 +label: 组织关系,组织关系##加盟,组织关系##裁员 +examples with positive influence +support1 text: 尼克斯官方今日宣布,他们已经裁掉了前锋扎克-欧文,后者昨日才与尼克斯签约。 label: 组织关系,组织关系##加盟,组织关系##解雇 score: 0.99357 +support2 text: 活塞官方今日宣布,他们已经签下了克雷格-斯沃德,并且裁掉了托德-威瑟斯。 label: 组织关系,组织关系##加盟,组织关系##解雇 score: 0.98344 +support3 text: 孟菲斯灰熊今年宣布,球队已经签下后卫达斯蒂-汉纳斯(DustyHannahs,版头图)并裁掉马特-穆尼。 label: 组织关系,组织关系##加盟,组织关系##解雇 score: 0.98219 +... +``` + +## 数据优化 + +### 稀疏数据筛选方案 + +稀疏数据筛选适用于文本分类中**数据不平衡或训练数据覆盖不足**的场景,简单来说,就是由于模型在训练过程中没有学习到足够与待预测样本相似的数据,模型难以正确预测样本所属类别的情况。稀疏数据筛选旨在开发集中挖掘缺乏训练证据支持的数据,通常可以采用**数据增强**或**少量数据标注**的两种低成本方式,提升模型在开发集的预测效果。 + +本项目中稀疏数据筛选基于TrustAI,利用基于特征相似度的实例级证据分析方法,抽取开发集中样本的支持训练证据,并计算支持证据平均分(通常为得分前三的支持训练证据均分)。分数较低的样本表明其训练证据不足,在训练集中较为稀疏,实验表明模型在这些样本上表现也相对较差。更多细节详见[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[实例级证据分析](https://github.com/PaddlePaddle/TrustAI/blob/main/trustai/interpretation/example_level/README.md)。 + + +#### 稀疏数据识别—数据增强 + +这里我们将介绍稀疏数据识别—数据增强流程: + +- **稀疏数据识别:** 挖掘开发集中的缺乏训练证据支持数据,记为稀疏数据集(Sparse Dataset); + +- **数据增强**:将稀疏数据集在训练集中的支持证据应用数据增强策略,这些数据增强后的训练数据记为支持数据集(Support Dataset); + +- **重新训练模型:** 将支持数据集加入到原有的训练集获得新的训练集,重新训练新的文本分类模型。 + +现在我们进行稀疏数据识别-数据增强,得到支持数据集: + +```shell +python sparse.py \ + --device "gpu" \ + --dataset_dir "../data" \ + --aug_strategy "substitute" \ + --max_seq_length 128 \ + --params_path "../checkpoint/" \ + --batch_size 16 \ + --sparse_num 100 \ + --support_num 100 +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `device`: 选用什么设备进行训练,可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含dev.txt和label.txt文件;默认为None。 +* `aug_strategy`:数据增强类型,可选"duplicate","substitute", "insert", "delete", "swap";默认为"substitute"。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `seed`:随机种子,默认为3。 +* `rationale_num_sparse`:筛选稀疏数据时计算样本置信度时支持训练证据数量;认为3。 +* `rationale_num_support`:筛选支持数据时计算样本置信度时支持训练证据数量,如果筛选的支持数据不够,可以适当增加;默认为6。 +* `sparse_num`:筛选稀疏数据数量,建议为开发集的10%~20%,默认为100。 +* `support_num`:用于数据增强的支持数据数量,建议为训练集的10%~20%,默认为100。 +* `support_threshold`:支持数据的阈值,只选择支持证据分数大于阈值作为支持数据,默认为0.7。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `dev_file`:本地数据集中开发集文件名;默认为"dev.txt"。 +* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。 +* `sparse_file`:保存在本地数据集路径中稀疏数据文件名;默认为"sparse.txt"。 +* `support_file`:保存在本地数据集路径中支持训练数据文件名;默认为"support.txt"。 + +将得到增强支持数据`support.txt`与训练集数据`train.txt`合并得到新的训练集`train_sparse_aug.txt`重新进行训练: + +```shell +cat ../data/train.txt ../data/support.txt > ../data/train_sparse_aug.txt +``` + +**方案效果** + +我们在[2020语言与智能技术竞赛:事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)抽取部分训练数据(训练集数据规模:700)进行实验,筛选稀疏数据数量和筛选支持数据数量均设为100条,使用不同的数据增强方法进行评测: + +| |Micro F1(%) | Macro F1(%) | +| ---------| ------------ |------------ | +|训练集|90.41|79.16| +|训练集+支持增强集(duplicate) |**90.60**|80.55| +|训练集+支持增强集(substitute) |90.21|80.11| +|训练集+支持增强集(insert) |90.53|**80.61**| +|训练集+支持增强集(delete) |90.56| 80.26| +|训练集+支持增强集(swap) |90.18|80.05| + +#### 稀疏数据识别-数据标注 + +本方案能够有针对性进行数据标注,相比于随机标注数据更好提高模型预测效果。这里我们将介绍稀疏数据识别-数据标注流程: + +- **稀疏数据识别:** 挖掘开发集中的缺乏训练证据支持数据,记为稀疏数据集(Sparse Dataset); + +- **数据标注**:在未标注数据集中筛选稀疏数据集的支持证据,并进行数据标注,记为支持数据集(Support Dataset); + +- **重新训练模型:** 将支持数据集加入到原有的训练集获得新的训练集,重新训练新的文本分类模型。 + +现在我们进行稀疏数据识别--数据标注,得到待标注数据: + +```shell +python sparse.py \ + --annotate \ + --device "gpu" \ + --dataset_dir "../data" \ + --max_seq_length 128 \ + --params_path "../checkpoint/" \ + --batch_size 16 \ + --sparse_num 100 \ + --support_num 100 \ + --unlabeled_file "data.txt" +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `device`: 选用什么设备进行训练,可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含dev.txt和label.txt文件;默认为None。 +* `annotate`:选择稀疏数据识别--数据标注模式;默认为False。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `seed`:随机种子,默认为3。 +* `rationale_num_sparse`:筛选稀疏数据时计算样本置信度时支持训练证据数量;认为3。 +* `rationale_num_support`:筛选支持数据时计算样本置信度时支持训练证据数量,如果筛选的支持数据不够,可以适当增加;默认为6。 +* `sparse_num`:筛选稀疏数据数量,建议为开发集的10%~20%,默认为100。 +* `support_num`:用于数据增强的支持数据数量,建议为训练集的10%~20%,默认为100。 +* `support_threshold`:支持数据的阈值,只选择支持证据分数大于阈值作为支持数据,默认为0.7。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `dev_file`:本地数据集中开发集文件名;默认为"dev.txt"。 +* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。 +* `unlabeled_file`:本地数据集中未标注数据文件名;默认为"data.txt"。 +* `sparse_file`:保存在本地数据集路径中稀疏数据文件名;默认为"sparse.txt"。 +* `support_file`:保存在本地数据集路径中支持训练数据文件名;默认为"support.txt"。 + +我们将筛选出的支持数据`support.txt`进行标注,可以使用标注工具帮助更快标注,详情请参考[文本分类任务doccano数据标注使用指南](../../doccano.md)进行文本分类数据标注。然后将已标注数据`support.txt`与训练集数据`train.txt`合并得到新的训练集`train_sparse_annotate.txt`重新进行训练: + +```shell +cat ../data/train.txt ../data/support.txt > ../data/train_sparse_annotate.txt +``` + +**方案效果** + +我们在[2020语言与智能技术竞赛:事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)抽取部分训练数据(训练集数据规模:700)进行实验,筛选稀疏数据数量设为100条,筛选待标注数据数量为50和100条。我们比较了使用稀疏数据方案的策略采样和随机采样的效果,下表结果表明使用稀疏数据方案的策略采样能够有效指导训练数据扩充,在标注更少的数据情况下获得更大提升的效果: + +| |Micro F1(%) | Macro F1(%) | +| ---------| ------------ | ------------ | +|训练集|90.41|79.16| +|训练集+策略采样集(50) |90.79|82.37| +|训练集+随机采样集(50) |90.10|79.27| +|训练集+策略采样集(100) |91.12|**84.13**| +|训练集+随机采样集(100) |**91.24**|81.66| + +### 脏数据清洗方案 + +脏数据清洗方案是基于已训练好的文本分类模型,筛选出训练数据集中标注错误的数据,再由人工检查重新标注,获得标注正确的数据集进行重新训练。我们将介绍脏数据清洗流程: + +- **脏数据筛选:** 基于TrustAI中表示点方法,计算训练数据对文本分类模型的影响分数,分数高的训练数据表明对模型影响大,这些数据有较大概率为标注错误样本,记为脏数据集(Dirty Dataset)。 + +- **数据清洗、训练:** 将筛选出的脏数据由人工重新检查,为数据打上正确的标签。将清洗后的训练数据重新放入文本分类模型进行训练。 + +现在我们进行脏数据识别,脏数据保存在`"train_dirty.txt"`,剩余训练数据保存在`"train_dirty_rest.txt"`: + +```shell +python dirty.py \ + --device "gpu" \ + --dataset_dir "../data" \ + --max_seq_length 128 \ + --params_path "../checkpoint/" \ + --batch_size 16 \ + --dirty_num 100 \ + --dirty_file "train_dirty.txt" \ + --rest_file "train_dirty_rest.txt" +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含train.txt和label.txt文件;默认为None。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `device`: 选用什么设备进行训练,可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `seed`:随机种子,默认为3。 +* `dirty_file`:保存脏数据文件名,默认为"train_dirty.txt"。 +* `rest_file`:保存剩余数据(非脏数据)文件名,默认为"train_dirty_rest.txt"。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `dirty_threshold`:筛选脏数据用于重新标注的阈值,只选择影响分数大于阈值作为支持数据,默认为0。 + + +我们将筛选出脏数据进行人工检查重新标注,可以将`train_dirty.txt`直接导入标注工具doccano帮助更快重新标注,详情请参考[文本分类任务doccano数据标注使用指南](../../doccano.md)进行文本分类数据标注。然后将已重新标注的脏数据`train_dirty.txt`与剩余训练集数据`train_dirty_rest.txt`合并得到新的训练集`train_clean.txt`重新进行训练: + +```shell +cat ../data/train_dirty_rest.txt ../data/train_dirty.txt > ../data/train_clean.txt +``` + +**方案效果** + +我们在[2020语言与智能技术竞赛:事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)抽取部分训练数据(训练集数据规模:2000)进行实验,取200条数据进行脏数据处理,也即200条训练数据为标签错误数据,选择不同`dirty_num`应用脏数据清洗策略进行评测: + +| |Micro F1(%) | Macro F1(%) | +| ---------| ------------ |------------ | +|训练集(2000)|92.54|86.04| +|训练集(2000,含200条脏数据) |89.11|73.33| +|训练集(2000,含200条脏数据) + 脏数据清洗(50)|90.00|77.67| +|训练集(2000,含200条脏数据) + 脏数据清洗(100)|92.48|**87.83**| +|训练集(2000,含200条脏数据) + 脏数据清洗(150)|**92.55**|83.73| + +### 数据增强策略方案 + +在数据量较少或某些类别样本量较少时,也可以通过数据增强策略的方式,生成更多的训练数据,提升模型效果。 + +```shell +python aug.py \ + --create_n 2 \ + --aug_percent 0.1 \ + --train_path "../data/train.txt" \ + --aug_path "../data/aug.txt" +``` + +可支持配置的参数: + +* `train_path`:待增强训练数据集文件路径;默认为"../data/train.txt"。 +* `aug_path`:增强生成的训练数据集文件路径;默认为"../data/train_aug.txt"。 +* `aug_strategy`:数据增强策略,可选"mix", "substitute", "insert", "delete", "swap","mix"为多种数据策略混合使用;默认为"substitute"。 +* `aug_type`:词替换/词插入增强类型,可选"synonym", "homonym", "mlm",建议在GPU环境下使用mlm类型;默认为"synonym"。 +* `create_n`:生成的句子数量,默认为2。 +* `aug_percent`:生成词替换百分比,默认为0.1。 +* `device`: 选用什么设备进行增强,可选择cpu、gpu、xpu、npu,仅在使用mlm类型有影响;默认为"gpu"。 + +生成的增强数据保存在`"aug.txt"`文件中,与训练集数据`train.txt`合并得到新的训练集`train_aug.txt`重新进行训练: + +```shell +cat ../data/aug.txt ../data/train.txt > ../data/train_aug.txt +``` + +PaddleNLP内置多种数据增强策略,更多数据增强策略使用方法请参考[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)。 + +**方案效果** + +我们在[2020语言与智能技术竞赛:事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)抽取部分训练数据(训练集数据规模:2000)进行实验,采用不同数据增强策略进行两倍数据增强(每条样本生成两条增强样本): + +| |Micro F1(%) | Macro F1(%) | +| ---------| ------------ |------------ | +|训练集(2000)|92.54|86.04| +|训练集(2000)+数据增强(×2, mix) |93.23|89.69| +|训练集(2000)+支持增强集(×2, substitute) |93.07|89.49| +|训练集(2000)+支持增强集(×2, insert) |**93.63**|**89.69**| +|训练集(2000)+支持增强集(×2, delete) |91.53| 84.47| +|训练集(2000)+支持增强集(×2, swap) |93.24|89.02| diff --git a/applications/text_classification/hierarchical/analysis/aug.py b/applications/text_classification/hierarchical/analysis/aug.py new file mode 100644 index 0000000000000000000000000000000000000000..731454ba556065abce173e8c5394c861466f0dde --- /dev/null +++ b/applications/text_classification/hierarchical/analysis/aug.py @@ -0,0 +1,82 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle + +from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--train_path", type=str, default="../data/train.txt", help="Train dataset file name") +parser.add_argument("--aug_path", type=str, default="../data/aug.txt", help="Aug dataset file name") +parser.add_argument("--aug_strategy", choices=["mix", "substitute", "insert", "delete", "swap"], default='substitute', help="Select data augmentation strategy") +parser.add_argument("--aug_type", choices=["synonym", "homonym", "mlm"], default='synonym', help="Select data augmentation type for substitute and insert") +parser.add_argument("--create_n", type=int, default=2, help="Number of augmented sequences.") +parser.add_argument("--aug_percent", type=float, default=0.1, help="Percentage of augmented words in sequences.") +parser.add_argument('--device', default="gpu", help="Select which device to do data augmentation strategy, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def aug(): + """Do data augmentation""" + if args.aug_strategy in ["mix", "substitute", "insert"] and args.aug_strategy == "mlm": + paddle.set_device(args.device) + + if args.aug_strategy in ["substitute", "insert", "delete", "swap"]: + if args.aug_strategy == "substitute": + aug = WordSubstitute(args.aug_type, create_n=args.create_n, aug_percent=args.aug_percent) + elif args.aug_strategy == "insert": + aug = WordInsert(args.aug_type, create_n=args.create_n, aug_percent=args.aug_percent) + elif args.aug_strategy == "delete": + aug = WordDelete(create_n=args.create_n, aug_percent=args.aug_percent) + elif args.aug_strategy == "swap": + aug = WordSwap(create_n=args.create_n, aug_percent=args.aug_percent) + with open(args.train_path, "r", encoding="utf-8") as f1, open(args.aug_path, "w", encoding="utf-8") as f2: + for line in f1: + s, l = line.strip().split("\t") + + augs = aug.augment(s) + if not isinstance(augs[0], str): + augs = augs[0] + for a in augs: + f2.write(a + "\t" + l + "\n") + f1.close(), f2.close() + elif args.aug_strategy in ["mix"]: + aug = [ + WordSubstitute(args.aug_type, create_n=1, aug_percent=args.aug_percent), + WordInsert(args.aug_type, create_n=1, aug_percent=args.aug_percent), + WordDelete(create_n=1, aug_percent=args.aug_percent), + WordSwap(create_n=1, aug_percent=args.aug_percent), + ] + count = 0 + with open(args.train_path, "r", encoding="utf-8") as f1, open(args.aug_path, "w", encoding="utf-8") as f2: + for line in f1: + s, l = line.strip().split("\t") + + for i in range(args.create_n): + i = count % len(aug) + augs = aug[i].augment(s) + if not isinstance(augs[0], str): + augs = augs[0] + count += 1 + for a in augs: + f2.write(a + "\t" + l + "\n") + f1.close(), f2.close() + + +if __name__ == "__main__": + aug() diff --git a/applications/text_classification/hierarchical/analysis/dirty.py b/applications/text_classification/hierarchical/analysis/dirty.py new file mode 100644 index 0000000000000000000000000000000000000000..394e597c7e280929f68ad38c6f188acdfa8ded50 --- /dev/null +++ b/applications/text_classification/hierarchical/analysis/dirty.py @@ -0,0 +1,153 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os +import random + +import numpy as np +import paddle +from paddle.io import BatchSampler, DataLoader +from trustai.interpretation import RepresenterPointModel + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.") +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--seed", type=int, default=3, help="random seed for initialization") +parser.add_argument("--dirty_num", type=int, default=100, help="Number of dirty data. default:50") +parser.add_argument("--dirty_file", type=str, default="train_dirty.txt", help="Path to save dirty data.") +parser.add_argument("--rest_file", type=str, default="train_dirty_rest.txt", help="The path of rest data.") +parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name") +parser.add_argument("--dirty_threshold", type=float, default="0", help="The threshold to select dirty data.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """ + Set random seed + """ + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + os.environ["PYTHONHASHSEED"] = str(seed) + + +def read_local_dataset(path): + """ + Read dataset file + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + sentence, label = line.strip().split("\t") + yield {"text": sentence, "label": label} + + +def preprocess_function(examples, tokenizer, max_seq_length): + """ + Preprocess dataset + """ + result = tokenizer(text=examples["text"], max_seq_len=max_seq_length) + return result + + +def get_dirty_data(weight_matrix, dirty_num, threshold=0): + """ + Get index of dirty data from train data + """ + scores = [] + for idx in range(weight_matrix.shape[0]): + weight_sum = 0 + count = 0 + for weight in weight_matrix[idx].numpy(): + if weight > threshold: + count += 1 + weight_sum += weight + scores.append((count, weight_sum)) + sorted_scores = sorted(scores)[::-1] + sorted_idxs = sorted(range(len(scores)), key=lambda idx: scores[idx])[::-1] + + ret_scores = sorted_scores[:dirty_num] + ret_idxs = sorted_idxs[:dirty_num] + + return ret_idxs, ret_scores + + +class LocalDataCollatorWithPadding(DataCollatorWithPadding): + """ + Convert the result of DataCollatorWithPadding from dict dictionary to a list + """ + + def __call__(self, features): + batch = super().__call__(features) + batch = list(batch.values()) + return batch + + +def run(): + """ + Get dirty data + """ + set_seed(args.seed) + paddle.set_device(args.device) + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + + # Prepare & preprocess dataset + train_path = os.path.join(args.dataset_dir, args.train_file) + train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False) + + trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + train_ds = train_ds.map(trans_func) + + # Batchify dataset + collate_fn = LocalDataCollatorWithPadding(tokenizer) + train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False) + train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn) + + # Classifier_layer_name is the layer name of the last output layer + rep_point = RepresenterPointModel(model, train_data_loader, classifier_layer_name="classifier") + weight_matrix = rep_point.weight_matrix + + # Save dirty data & rest data + dirty_indexs, _ = get_dirty_data(weight_matrix, args.dirty_num, args.dirty_threshold) + + dirty_path = os.path.join(args.dataset_dir, args.dirty_file) + rest_path = os.path.join(args.dataset_dir, args.rest_file) + + with open(dirty_path, "w") as f1, open(rest_path, "w") as f2: + for idx in range(len(train_ds)): + if idx in dirty_indexs: + f1.write(train_ds.data[idx]["text"] + "\t" + train_ds.data[idx]["label"] + "\n") + else: + f2.write(train_ds.data[idx]["text"] + "\t" + train_ds.data[idx]["label"] + "\n") + + f1.close(), f2.close() + + +if __name__ == "__main__": + run() diff --git a/applications/text_classification/hierarchical/analysis/evaluate.py b/applications/text_classification/hierarchical/analysis/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..ef927e7df460b005c2d6aa92e9b34d66a4e68a48 --- /dev/null +++ b/applications/text_classification/hierarchical/analysis/evaluate.py @@ -0,0 +1,202 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os + +import numpy as np +import paddle +import paddle.nn.functional as F +from paddle.io import BatchSampler, DataLoader +from sklearn.metrics import accuracy_score, classification_report, f1_score + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', default="gpu", help="Select which device to evaluate model, defaults to gpu.") +parser.add_argument("--dataset_dir", required=True, type=str, help="Local dataset directory should include dev.txt and label.txt") +parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for evaluation.") +parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name") +parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name") +parser.add_argument("--bad_case_file", type=str, default="./bad_case.txt", help="Bad case saving file path") +args = parser.parse_args() +# yapf: enable + + +def preprocess_function(examples, tokenizer, max_seq_length, label_nums, is_test=False): + """ + Preprocess dataset + """ + result = tokenizer(text=examples["text"], max_seq_len=max_seq_length) + if not is_test: + result["labels"] = [float(1) if i in examples["label"] else float(0) for i in range(label_nums)] + return result + + +def read_local_dataset(path, label_list): + """ + Read dataset file + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + items = line.strip().split("\t") + if len(items) == 0: + continue + elif len(items) == 1: + sentence = items[0] + labels = [] + label = "" + else: + sentence = "".join(items[:-1]) + label = items[-1] + labels = [label_list[l] for l in label.split(",")] + yield {"text": sentence, "label": labels, "label_n": label} + + +@paddle.no_grad() +def evaluate(): + """ + Evaluate the model performance + """ + paddle.set_device(args.device) + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + + # load and preprocess dataset + label_path = os.path.join(args.dataset_dir, args.label_file) + dev_path = os.path.join(args.dataset_dir, args.dev_file) + + label_list = {} + label_map = {} + label_map_dict = {} + with open(label_path, "r", encoding="utf-8") as f: + for i, line in enumerate(f): + l = line.strip() + label_list[l] = i + label_map[i] = l + for ii, ll in enumerate(l.split("##")): + if ii not in label_map_dict: + label_map_dict[ii] = {} + if ll not in label_map_dict[ii]: + iii = len(label_map_dict[ii]) + label_map_dict[ii][ll] = iii + dev_ds = load_dataset(read_local_dataset, path=dev_path, label_list=label_list, lazy=False) + trans_func = functools.partial( + preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length, label_nums=len(label_list) + ) + dev_ds = dev_ds.map(trans_func) + + # batchify dataset + collate_fn = DataCollatorWithPadding(tokenizer) + dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn) + + model.eval() + probs = [] + labels = [] + for batch in dev_data_loader: + label = batch.pop("labels") + logits = model(**batch) + labels.extend(label.numpy()) + probs.extend(F.sigmoid(logits).numpy()) + probs = np.array(probs) + labels = np.array(labels) + preds = probs > 0.5 + report = classification_report(labels, preds, digits=4, output_dict=True) + accuracy = accuracy_score(labels, preds) + + labels_dict = {ii: [] for ii in range(len(label_map_dict))} + preds_dict = {ii: [] for ii in range(len(label_map_dict))} + for i in range(len(preds)): + for ii in range(len(label_map_dict)): + labels_dict[ii].append([0] * len(label_map_dict[ii])) + preds_dict[ii].append([0] * len(label_map_dict[ii])) + for l in dev_ds.data[i]["label_n"].split(","): + for ii, sub_l in enumerate(l.split("##")): + labels_dict[ii][-1][label_map_dict[ii][sub_l]] = 1 + + pred_n = [label_map[i] for i, pp in enumerate(preds[i]) if pp] + + for l in pred_n: + for ii, sub_l in enumerate(l.split("##")): + preds_dict[ii][-1][label_map_dict[ii][sub_l]] = 1 + + logger.info("-----Evaluate model-------") + logger.info("Dev dataset size: {}".format(len(dev_ds))) + logger.info("Accuracy in dev dataset: {:.2f}%".format(accuracy * 100)) + logger.info( + "Micro avg in dev dataset: precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format( + report["micro avg"]["precision"] * 100, + report["micro avg"]["recall"] * 100, + report["micro avg"]["f1-score"] * 100, + ) + ) + logger.info( + "Macro avg in dev dataset: precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format( + report["macro avg"]["precision"] * 100, + report["macro avg"]["recall"] * 100, + report["macro avg"]["f1-score"] * 100, + ) + ) + for ii in range(len(label_map_dict)): + macro_f1_score = f1_score(labels_dict[ii], preds_dict[ii], average="macro") + micro_f1_score = f1_score(labels_dict[ii], preds_dict[ii], average="micro") + accuracy = accuracy_score(labels_dict[ii], preds_dict[ii]) + logger.info( + "Level {} Label Performance: Macro F1 score: {:.2f} | Micro F1 score: {:.2f} | Accuracy: {:.2f}".format( + ii + 1, macro_f1_score * 100, micro_f1_score * 100, accuracy * 100 + ) + ) + + for i in label_map: + logger.info("Class name: {}".format(label_map[i])) + logger.info( + "Evaluation examples in dev dataset: {}({:.1f}%) | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format( + report[str(i)]["support"], + 100 * report[str(i)]["support"] / len(dev_ds), + report[str(i)]["precision"] * 100, + report[str(i)]["recall"] * 100, + report[str(i)]["f1-score"] * 100, + ) + ) + logger.info("----------------------------") + bad_case_path = os.path.join(args.dataset_dir, args.bad_case_file) + with open(bad_case_path, "w", encoding="utf-8") as f: + f.write("Text\tLabel\tPrediction\n") + for i in range(len(preds)): + for p, l in zip(preds[i], labels[i]): + if (p and l == 0) or (not p and l == 1): + pred_n = [label_map[i] for i, pp in enumerate(preds[i]) if pp] + f.write(dev_ds.data[i]["text"] + "\t" + dev_ds.data[i]["label_n"] + "\t" + ",".join(pred_n) + "\n") + break + + f.close() + logger.info("Bad case in dev dataset saved in {}".format(bad_case_path)) + + return + + +if __name__ == "__main__": + evaluate() diff --git a/applications/text_classification/hierarchical/analysis/sent_interpret.py b/applications/text_classification/hierarchical/analysis/sent_interpret.py new file mode 100644 index 0000000000000000000000000000000000000000..1f0e4a88c190a158f948930a6e6c74c266f9f5f8 --- /dev/null +++ b/applications/text_classification/hierarchical/analysis/sent_interpret.py @@ -0,0 +1,157 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os +import random + +import numpy as np +import paddle +from paddle.io import BatchSampler, DataLoader +from trustai.interpretation import FeatureSimilarityModel + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory should include train.txt,dev.txt and test.txt files.") +parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--seed", type=int, default=3, help="random seed for initialization") +parser.add_argument("--top_k", type=int, default=3, help="Top K important training data.") +parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name") +parser.add_argument("--interpret_input_file", type=str, default="bad_case.txt", help="interpretation file name") +parser.add_argument("--interpret_result_file", type=str, default="sent_interpret.txt", help="interpreted file name") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """ + Set random seed + """ + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + os.environ["PYTHONHASHSEED"] = str(seed) + + +def read_local_dataset(path): + """ + Read dataset file + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + items = line.strip().split("\t") + if items[0] == "Text": + continue + if len(items) == 3: + yield {"text": items[0], "label": items[1], "predict": items[2]} + elif len(items) == 2: + yield {"text": items[0], "label": items[1], "predict": ""} + elif len(items) == 1: + yield {"text": items[0], "label": "", "predict": ""} + else: + logger.info(line.strip()) + raise ValueError("{} should be in fixed format.".format(path)) + + +def preprocess_function(examples, tokenizer, max_seq_length): + """ + Preprocess dataset + """ + result = tokenizer(text=examples["text"], max_seq_len=max_seq_length) + return result + + +class LocalDataCollatorWithPadding(DataCollatorWithPadding): + """ + Convert the result of DataCollatorWithPadding from dict dictionary to a list + """ + + def __call__(self, features): + batch = super().__call__(features) + batch = list(batch.values()) + return batch + + +def find_positive_influence_data(): + + set_seed(args.seed) + paddle.set_device(args.device) + + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + + # Prepare & preprocess dataset + train_path = os.path.join(args.dataset_dir, args.train_file) + interpret_path = os.path.join(args.dataset_dir, args.interpret_input_file) + + train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False) + interpret_ds = load_dataset(read_local_dataset, path=interpret_path, lazy=False) + trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + train_ds = train_ds.map(trans_func) + interpret_ds = interpret_ds.map(trans_func) + + # Batchify dataset + collate_fn = LocalDataCollatorWithPadding(tokenizer) + train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False) + interpret_batch_sampler = BatchSampler(interpret_ds, batch_size=args.batch_size, shuffle=False) + train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn) + interpret_data_loader = DataLoader( + dataset=interpret_ds, batch_sampler=interpret_batch_sampler, collate_fn=collate_fn + ) + + # Classifier_layer_name is the layer name of the last output layer + feature_sim = FeatureSimilarityModel(model, train_data_loader, classifier_layer_name="classifier") + # Feature similarity analysis & select sparse data + analysis_result = [] + for batch in interpret_data_loader: + analysis_result += feature_sim(batch, sample_num=args.top_k) + with open(os.path.join(args.dataset_dir, args.interpret_result_file), "w") as f: + for i in range(len(analysis_result)): + f.write("text: " + interpret_ds.data[i]["text"] + "\n") + if "predict" in interpret_ds.data[i]: + f.write("predict label: " + interpret_ds.data[i]["predict"] + "\n") + if "label" in interpret_ds.data[i]: + f.write("label: " + interpret_ds.data[i]["label"] + "\n") + f.write("examples with positive influence\n") + for i, (idx, score) in enumerate(zip(analysis_result[i].pos_indexes, analysis_result[i].pos_scores)): + f.write( + "support{} text: ".format(i + 1) + + train_ds.data[idx]["text"] + + "\t" + + "label: " + + train_ds.data[idx]["label"] + + "\t" + + "score: " + + "{:.5f}".format(score) + + "\n" + ) + f.close() + + +if __name__ == "__main__": + find_positive_influence_data() diff --git a/applications/text_classification/hierarchical/analysis/sparse.py b/applications/text_classification/hierarchical/analysis/sparse.py new file mode 100644 index 0000000000000000000000000000000000000000..80feb4f661348014bd44e28ec69b4d6bcff9d189 --- /dev/null +++ b/applications/text_classification/hierarchical/analysis/sparse.py @@ -0,0 +1,286 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os +import random + +import numpy as np +import paddle +from paddle.io import BatchSampler, DataLoader +from trustai.interpretation import FeatureSimilarityModel + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory should include train.txt,dev.txt and test.txt files.") +parser.add_argument("--aug_strategy", choices=["duplicate", "substitute", "insert", "delete", "swap"], default='substitute', help="Select data augmentation strategy") +parser.add_argument("--annotate", action='store_true', help="Select unlabeled data for annotation") +parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--seed", type=int, default=3, help="random seed for initialization") +parser.add_argument("--rationale_num_sparse", type=int, default=3, help="Number of rationales per example for sparse data.") +parser.add_argument("--rationale_num_support", type=int, default=6, help="Number of rationales per example for support data.") +parser.add_argument("--sparse_num", type=int, default=100, help="Number of sparse data.") +parser.add_argument("--support_threshold", type=float, default="0.7", help="The threshold to select support data.") +parser.add_argument("--support_num", type=int, default=100, help="Number of support data.") +parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name") +parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name") +parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name") +parser.add_argument("--unlabeled_file", type=str, default="data.txt", help="Unlabeled data filename") +parser.add_argument("--sparse_file", type=str, default="sparse.txt", help="Sparse data file name.") +parser.add_argument("--support_file", type=str, default="support.txt", help="support data file name.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """ + Set random seed + """ + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + os.environ["PYTHONHASHSEED"] = str(seed) + + +def read_local_dataset(path): + """ + Read dataset file + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + items = line.strip().split("\t") + if len(items) == 2: + yield {"text": items[0], "label": items[1]} + elif len(items) == 1: + yield {"text": items[0]} + else: + logger.info(line.strip()) + raise ValueError("{} should be in fixed format.".format(path)) + + +def preprocess_function(examples, tokenizer, max_seq_length): + """ + Preprocess dataset + """ + result = tokenizer(text=examples["text"], max_seq_len=max_seq_length) + return result + + +class LocalDataCollatorWithPadding(DataCollatorWithPadding): + """ + Convert the result of DataCollatorWithPadding from dict dictionary to a list + """ + + def __call__(self, features): + batch = super().__call__(features) + batch = list(batch.values()) + return batch + + +def get_sparse_data(analysis_result, sparse_num): + """ + Get sparse data + """ + idx_scores = {} + preds = [] + for i in range(len(analysis_result)): + scores = analysis_result[i].pos_scores + idx_scores[i] = sum(scores) / len(scores) + preds.append(analysis_result[i].pred_label) + + idx_socre_list = list(sorted(idx_scores.items(), key=lambda x: x[1]))[:sparse_num] + ret_idxs, ret_scores = list(zip(*idx_socre_list)) + return ret_idxs, ret_scores, preds + + +def find_sparse_data(): + """ + Find sparse data (lack of supports in train dataset) in dev dataset + """ + set_seed(args.seed) + paddle.set_device(args.device) + + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + + # Prepare & preprocess dataset + label_path = os.path.join(args.dataset_dir, args.label_file) + train_path = os.path.join(args.dataset_dir, args.train_file) + dev_path = os.path.join(args.dataset_dir, args.dev_file) + + label_list = {} + with open(label_path, "r", encoding="utf-8") as f: + for i, line in enumerate(f): + l = line.strip() + label_list[l] = i + + train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False) + dev_ds = load_dataset(read_local_dataset, path=dev_path, lazy=False) + trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + train_ds = train_ds.map(trans_func) + dev_ds = dev_ds.map(trans_func) + + # Batchify dataset + collate_fn = LocalDataCollatorWithPadding(tokenizer) + train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False) + dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn) + dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn) + + # Classifier_layer_name is the layer name of the last output layer + feature_sim = FeatureSimilarityModel(model, train_data_loader, classifier_layer_name="classifier") + # Feature similarity analysis & select sparse data + analysis_result = [] + for batch in dev_data_loader: + analysis_result += feature_sim(batch, sample_num=args.rationale_num_sparse) + sparse_indexs, sparse_scores, preds = get_sparse_data(analysis_result, args.sparse_num) + + # Save the sparse data + with open(os.path.join(args.dataset_dir, args.sparse_file), "w") as f: + for idx in sparse_indexs: + data = dev_ds.data[idx] + f.write(data["text"] + "\t" + str(data["label"]) + "\n") + f.close() + logger.info("Sparse data saved in {}".format(os.path.join(args.dataset_dir, args.sparse_file))) + logger.info("Average score in sparse data: {:.4f}".format(sum(sparse_scores) / len(sparse_scores))) + return os.path.join(args.dataset_dir, args.sparse_file) + + +def get_support_data(analysis_result, support_num, support_threshold=0.7): + """ + get support data + """ + ret_idxs = [] + ret_scores = [] + rationale_idx = 0 + try: + while len(ret_idxs) < support_num: + for n in range(len(analysis_result)): + score = analysis_result[n].pos_scores[rationale_idx] + if score > support_threshold: + idx = analysis_result[n].pos_indexes[rationale_idx] + if idx not in ret_idxs: + ret_idxs.append(idx) + ret_scores.append(score) + if len(ret_idxs) >= support_num: + break + + rationale_idx += 1 + except IndexError: + logger.error( + f"The index is out of range, please reduce support_num or increase support_threshold. Got {len(ret_idxs)} now." + ) + + return ret_idxs, ret_scores + + +def find_support_data(): + """ + Find support data (which supports sparse data) from candidate dataset + """ + set_seed(args.seed) + paddle.set_device(args.device) + + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + + # Prepare & preprocess dataset + if args.annotate: + candidate_path = os.path.join(args.dataset_dir, args.unlabeled_file) + else: + candidate_path = os.path.join(args.dataset_dir, args.train_file) + + sparse_path = os.path.join(args.dataset_dir, args.sparse_file) + support_path = os.path.join(args.dataset_dir, args.support_file) + candidate_ds = load_dataset(read_local_dataset, path=candidate_path, lazy=False) + sparse_ds = load_dataset(read_local_dataset, path=sparse_path, lazy=False) + trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + candidate_ds = candidate_ds.map(trans_func) + sparse_ds = sparse_ds.map(trans_func) + + # Batchify dataset + collate_fn = LocalDataCollatorWithPadding(tokenizer) + candidate_batch_sampler = BatchSampler(candidate_ds, batch_size=args.batch_size, shuffle=False) + sparse_batch_sampler = BatchSampler(sparse_ds, batch_size=args.batch_size, shuffle=False) + candidate_data_loader = DataLoader( + dataset=candidate_ds, batch_sampler=candidate_batch_sampler, collate_fn=collate_fn + ) + sparse_data_loader = DataLoader(dataset=sparse_ds, batch_sampler=sparse_batch_sampler, collate_fn=collate_fn) + + # Classifier_layer_name is the layer name of the last output layer + feature_sim = FeatureSimilarityModel(model, candidate_data_loader, classifier_layer_name="classifier") + # Feature similarity analysis + analysis_result = [] + for batch in sparse_data_loader: + analysis_result += feature_sim(batch, sample_num=args.rationale_num_support) + + support_indexs, support_scores = get_support_data(analysis_result, args.support_num, args.support_threshold) + + # Save the support data + if args.annotate or args.aug_strategy == "duplicate": + with open(support_path, "w") as f: + for idx in list(support_indexs): + data = candidate_ds.data[idx] + if "label" in data: + f.write(data["text"] + "\t" + data["label"] + "\n") + else: + f.write(data["text"] + "\n") + f.close() + else: + create_n = 1 + aug_percent = 0.1 + if args.aug_strategy == "substitute": + aug = WordSubstitute("synonym", create_n=create_n, aug_percent=aug_percent) + elif args.aug_strategy == "insert": + aug = WordInsert("synonym", create_n=create_n, aug_percent=aug_percent) + elif args.aug_strategy == "delete": + aug = WordDelete(create_n=create_n, aug_percent=aug_percent) + elif args.aug_strategy == "swap": + aug = WordSwap(create_n=create_n, aug_percent=aug_percent) + + with open(support_path, "w") as f: + for idx in list(support_indexs): + data = candidate_ds.data[idx] + augs = aug.augment(data["text"]) + if not isinstance(augs[0], str): + augs = augs[0] + for a in augs: + f.write(a + "\t" + data["label"] + "\n") + f.close() + logger.info("support data saved in {}".format(support_path)) + logger.info("support average scores: {:.4f}".format(float(sum(support_scores)) / len(support_scores))) + + +if __name__ == "__main__": + find_sparse_data() + find_support_data() diff --git a/applications/text_classification/hierarchical/analysis/word_interpret.ipynb b/applications/text_classification/hierarchical/analysis/word_interpret.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..70a0da4081f6facf69b33c262895d60dbf549b2d --- /dev/null +++ b/applications/text_classification/hierarchical/analysis/word_interpret.ipynb @@ -0,0 +1,362 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 词级别可解释性分析\n", + "本项目提供模型的词级别可解释性分析,包括LIME、Integrated Gradient、GradShap 三种分析方法,支持分析微调后模型的预测结果,开发者可以通过更改**数据目录**和**模型目录**在自己的任务中使用此项目进行数据分析。\n", + "\n", + "![image](https://user-images.githubusercontent.com/63761690/195334753-78cc2dc8-a5ba-4460-9fde-3b1bb704c053.png)\n", + " \n", + "\n", + "## 1.导入Python模块与参数配置\n", + "首先我们导入必要的导入必要python模块和设置配置参数,词级别可解释性分析算法支持三种待分析的文本 `INTERPRETER_FILE` 数据文件格式:\n", + "\n", + "**格式一:包括文本、标签、预测结果**\n", + "```text\n", + "<文本>'\\t'<标签>'\\t'<预测结果>\n", + "...\n", + "```\n", + "\n", + "**格式二:包括文本、标签**\n", + "```text\n", + "<文本>'\\t'<标签>\n", + "...\n", + "```\n", + "\n", + "**格式三:只包括文本**\n", + "```text\n", + "<文本>\n", + "...\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import functools\n", + "import random\n", + "import os\n", + "import argparse\n", + "\n", + "import jieba\n", + "import numpy as np \n", + "from trustai.interpretation import VisualizationTextRecord\n", + "from trustai.interpretation import get_word_offset\n", + "import paddle\n", + "from paddle.io import DataLoader, BatchSampler\n", + "from paddlenlp.data import DataCollatorWithPadding\n", + "from paddlenlp.datasets import load_dataset\n", + "from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from trustai.interpretation import VisualizationTextRecord\n", + "from trustai.interpretation import get_word_offset\n", + "import paddle\n", + "from paddle.io import DataLoader, BatchSampler\n", + "from paddlenlp.data import DataCollatorWithPadding\n", + "from paddlenlp.datasets import load_dataset\n", + "from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# 预先定义配置参数\n", + "\n", + "# 运行环境,可选\"cpu\",\"gpu\",\"gpu:x\"(x为gpu编号)\n", + "DEVICE = \"gpu\"\n", + "# 数据路径\n", + "DATASET_DIR = \"../data\" \n", + "# 训练模型保存路径\n", + "PARAM_PATH = \"../checkpoint/\" \n", + "# tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数\n", + "MAX_LENGTH = 128 \n", + "# 批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数\n", + "BATCH_SIZE = 1 \n", + "# 待分析解释的数据\n", + "INTERPRETER_FILE = \"bad_case.txt\"\n", + "# 可选 \"ig\",\"lime\",\"grad\" ,可以根据实际任务效果选择解释器\n", + "# \"grad\":GradShap方法依赖interpretdl\n", + "# !pip install interpretdl\n", + "INTERPRETER = \"ig\"\n", + "# 分析句子中TOP K关键词,K值\n", + "KEY_WORDS_NUM = 5" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "def read_local_dataset(path):\n", + " \"\"\"\n", + " Read dataset file\n", + " \"\"\"\n", + " with open(path, 'r', encoding='utf-8') as f:\n", + " for line in f:\n", + " items = line.strip().split('\\t')\n", + " if items[0] == 'Text':\n", + " continue\n", + " items[0] = items[0][:MAX_LENGTH-2]\n", + " if len(items) == 3:\n", + " yield {'text': items[0], 'label': items[1], 'predict': items[2]}\n", + " elif len(items) == 2:\n", + " yield {'text': items[0], 'label': items[1], 'predict': ''}\n", + " elif len(items) == 1:\n", + " yield {'text': items[0], 'label': '', 'predict': ''}\n", + " else:\n", + " raise ValueError(\"{} should be in fixed format.\".format(path))\n", + "\n", + "def preprocess_function(examples, tokenizer, max_seq_length):\n", + " \"\"\"\n", + " Preprocess dataset\n", + " \"\"\"\n", + " result = tokenizer(text=examples[\"text\"], max_seq_len=max_seq_length)\n", + " return result\n", + "\n", + "class LocalDataCollatorWithPadding(DataCollatorWithPadding):\n", + " \"\"\"\n", + " Convert the result of DataCollatorWithPadding from dict dictionary to a list\n", + " \"\"\"\n", + "\n", + " def __call__(self, features):\n", + " batch = super().__call__(features)\n", + " batch = list(batch.values())\n", + " return batch" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[32m[2022-10-12 11:45:49,858] [ INFO]\u001b[0m - We are using to load '/workspace/PaddleNLP/applications/text_classification/hierarchical/checkpoint/'.\u001b[0m\n", + "W1012 11:45:49.861358 26086 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2\n", + "W1012 11:45:49.865923 26086 gpu_resources.cc:91] device: 0, cuDNN Version: 8.1.\n", + "\u001b[32m[2022-10-12 11:45:52,912] [ INFO]\u001b[0m - We are using to load '/workspace/PaddleNLP/applications/text_classification/hierarchical/checkpoint/'.\u001b[0m\n" + ] + } + ], + "source": [ + "paddle.set_device(DEVICE)\n", + "\n", + "# Define model & tokenizer\n", + "if os.path.exists(PARAM_PATH):\n", + " model = AutoModelForSequenceClassification.from_pretrained(PARAM_PATH)\n", + " tokenizer = AutoTokenizer.from_pretrained(PARAM_PATH)\n", + "else:\n", + " raise ValueError(\"The {} should exist.\".format(PARAM_PATH))\n", + "\n", + "\n", + "# Prepare & preprocess dataset\n", + "interpret_path = os.path.join(DATASET_DIR, INTERPRETER_FILE)\n", + "\n", + "\n", + "interpret_ds = load_dataset(read_local_dataset, path=interpret_path, lazy=False)\n", + "trans_func = functools.partial(preprocess_function,\n", + " tokenizer=tokenizer,\n", + " max_seq_length=MAX_LENGTH)\n", + "\n", + "interpret_ds = interpret_ds.map(trans_func)\n", + "\n", + "# Batchify dataset\n", + "collate_fn = LocalDataCollatorWithPadding(tokenizer)\n", + "interpret_batch_sampler = BatchSampler(interpret_ds,\n", + " batch_size=BATCH_SIZE,\n", + " shuffle=False)\n", + "interpret_data_loader = DataLoader(dataset=interpret_ds,\n", + " batch_sampler=interpret_batch_sampler,\n", + " collate_fn=collate_fn)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Start token level interpretion, it will take some time...\n", + "Building prefix dict from the default dictionary ...\n", + "Loading model from cache /tmp/jieba.cache\n", + "Loading model cost 0.746 seconds.\n", + "Prefix dict has been built successfully.\n", + "Start word level alignment, it will take some time...\n" + ] + } + ], + "source": [ + "# Init an interpreter\n", + "if INTERPRETER == 'ig':\n", + " from trustai.interpretation.token_level import IntGradInterpreter\n", + " interpreter = IntGradInterpreter(model)\n", + "elif INTERPRETER == 'lime':\n", + " from trustai.interpretation.token_level import LIMEInterpreter\n", + " interpreter = LIMEInterpreter(model, unk_id=tokenizer.convert_tokens_to_ids('[UNK]'), pad_id=tokenizer.convert_tokens_to_ids('[PAD]'))\n", + "else:\n", + " from trustai.interpretation.token_level import GradShapInterpreter\n", + " interpreter = GradShapInterpreter(model)\n", + "\n", + "# Use interpreter to get the importance scores for all data\n", + "print(\"Start token level interpretion, it will take some time...\")\n", + "analysis_result = []\n", + "for batch in interpret_data_loader:\n", + " analysis_result += interpreter(tuple(batch))\n", + "\n", + "# Add CLS and SEP tags to both original text and standard splited tokens\n", + "contexts = []\n", + "words = []\n", + "for i in range(len(interpret_ds)):\n", + " text = interpret_ds.data[i][\"text\"]\n", + " contexts.append(\"[CLS]\" + text + \"[SEP]\")\n", + " words.append([\"[CLS]\"] + list(jieba.cut(text)) + [\"[SEP]\"])\n", + "\n", + "# Get the offset map of tokenized tokens and standard splited tokens\n", + "print(\"Start word level alignment, it will take some time...\")\n", + "ori_offset_maps = []\n", + "word_offset_maps = []\n", + "for i in range(len(contexts)):\n", + " ori_offset_maps.append(tokenizer.get_offset_mapping(contexts[i]))\n", + " word_offset_maps.append(get_word_offset(contexts[i], words[i]))\n", + "\n", + "align_res = interpreter.alignment(analysis_result, contexts, words, word_offset_maps, ori_offset_maps, special_tokens=[\"[CLS]\", '[SEP]'],rationale_num=KEY_WORDS_NUM)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.core.display import display, HTML\n", + "class Visualization(VisualizationTextRecord):\n", + "\n", + " def __init__(self, interpret_res, true_label=None, pred_label=None, words=None):\n", + " if words is not None:\n", + " self.words = words\n", + " else:\n", + " self.words = interpret_res.words\n", + " self.pred_label = pred_label if pred_label is not None else ''\n", + " self.true_label = true_label if true_label is not None else ''\n", + " self.key_words = \" \".join(set(interpret_res.rationale_tokens))\n", + " word_attributions = interpret_res.word_attributions\n", + " _max = max(word_attributions)\n", + " _min = min(word_attributions)\n", + " self.word_attributions = [(word_imp - _min) / (_max - _min) for word_imp in word_attributions]\n", + "\n", + " def record_html(self):\n", + " \"\"\"change all informations to html\"\"\"\n", + " return \"\".join([\n", + " \"\",\n", + " self._format_class(self.true_label),\n", + " self._format_class(self.pred_label),\n", + " self._format_class(self.key_words),\n", + " self._format_word_attributions(),\n", + " \"\",\n", + " ])\n", + " def _format_class(self, label):\n", + " return '{label}'.format(label=label)\n", + "\n", + "def visualize_text(text_records):\n", + " \"\"\"visualize text\"\"\"\n", + " html = [\"\"]\n", + " rows = [\"\"\n", + " \"\"\n", + " \"\"\n", + " \"\"]\n", + " for record in text_records:\n", + " rows.append(record.record_html())\n", + " html.append(\"\".join(rows))\n", + " html.append(\"
LabelPredictionKey wordsImportant visualization
\")\n", + " html = HTML(\"\".join(html))\n", + " display(html)\n", + " return html.data\n", + "\n", + "\n", + "def visualize(interpret_res, ds):\n", + " records = []\n", + " for i in range(len(interpret_res)):\n", + " records.append(Visualization(interpret_res[i], true_label=ds.data[i][\"label\"], pred_label=ds.data[i][\"predict\"]))\n", + " html = visualize_text(records)\n", + " return html" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
LabelPredictionKey wordsImportant visualization
组织关系,组织关系##加盟,组织关系##裁员组织关系,组织关系##解雇。 特裁 签下 此前 掉 [CLS] 猛龙 随队 记者 JoshLewenberg 报道 消息人士 透露 猛龙 前锋 加巴 - 科纳 特裁 此前 猛龙 签下 一份 Exhibit10 合同 裁掉 科纳 特下 赛季 概率 前往 猛龙 发展 联盟 球队 效力 [SEP]
组织关系,组织关系##裁员组织关系,组织关系##解雇加入 湖人队 裁掉 被 何去何从 [CLS] 冠军 射手 裁掉 加入 湖人队 湖人 无意 冠军 射手 何去何从 [SEP]
组织关系,组织关系##裁员组织关系,组织关系##裁员,财经/交易裁员 超过 1000 将 裁减 [CLS] 6 7 报道 IBM 裁员 超过 1000 IBM 周四 确认 裁减 一千多 知情 人士 此次 裁员 影响 1700 员工 IBM 全球 34 员工 0.5% IBM 股价 今年 累计 上涨 16% 公司 4 发布 财报 显示 一季度 营收 下降 5% 低于 市场 预期 [SEP]
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# process for vbisualize\n", + "html = visualize(align_res, interpret_ds)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.7.13 64-bit", + "metadata": { + "interpreter": { + "hash": "767d51c1340bd893661ea55ea3124f6de3c7a262a8b4abca0554b478b1e2ff90" + } + }, + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.13-final" + }, + "orig_nbformat": 2 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/README.md b/applications/text_classification/hierarchical/deploy/paddle_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b9caf349b1acf4ad0fb420152fd9d505533422c5 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/paddle_serving/README.md @@ -0,0 +1,189 @@ +# 基于Paddle Serving的服务化部署 + +本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具搭建层次分类在线服务部署。 + +## 目录 +- [环境准备](#环境准备) +- [模型转换](#模型转换) +- [部署模型](#部署模型) + +## 环境准备 +需要准备PaddleNLP的运行环境和Paddle Serving的运行环境。 + +- python >= 3.6 +- paddlepaddle >= 2.3 +- paddlenlp >= 2.4 + +### 安装PaddlePaddle + + 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。 + + +### 安装PaddleNLP + +安装PaddleNLP默认开启百度镜像源来加速下载,如果您使用 HTTP 代理可以删去` -i https://mirror.baidu.com/pypi/simple` ,更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 + +```shell +python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple +``` +### 安装Paddle Serving + +安装client和serving app,用于向服务发送请求: +```shell +pip install paddle_serving_app paddle_serving_client +``` +安装serving,用于启动服务,根据服务器设备选择安装CPU server或GPU server: + +- 安装CPU server +```shell +pip install paddle_serving_server +``` +- 安装GPU server, 注意选择跟本地环境一致的命令 +```shell +# CUDA10.2 + Cudnn7 + TensorRT6 +pip install paddle-serving-server-gpu==0.8.3.post102 -i https://pypi.tuna.tsinghua.edu.cn/simple + +# CUDA10.1 + TensorRT6 +pip install paddle-serving-server-gpu==0.8.3.post101 -i https://pypi.tuna.tsinghua.edu.cn/simple + +# CUDA11.2 + TensorRT8 +pip install paddle-serving-server-gpu==0.8.3.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple +``` + +**NOTE:** +- 默认开启国内清华镜像源来加速下载,如果您使用 HTTP 代理可以关闭(-i https://pypi.tuna.tsinghua.edu.cn/simple) +- 更多wheel包请参考[serving官网文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Latest_Packages_CN.md) + +### 安装FastTokenizer文本处理加速库(可选) +推荐安装fast_tokenizer可以得到更极致的文本处理效率,进一步提升服务性能。 +```shell +pip install fast-tokenizer-python +``` + + +## 模型转换 + +使用Paddle Serving做服务化部署时,需要将保存的inference模型转换为serving易于部署的模型。 + +用已安装的paddle_serving_client将静态图参数模型转换成serving格式。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[模型静态图导出](../../README.md),模型地址`dirname`,模型文件和参数名`model_filename`,`params_filename`根据实际填写即可。 + +```shell +python -m paddle_serving_client.convert --dirname ../../export --model_filename float32.pdmodel --params_filename float32.pdiparams +``` + +可以通过命令查参数含义: +```shell +python -m paddle_serving_client.convert --help +``` + +转换成功后的目录如下: +``` +paddle_serving/ +├──serving_server +│ ├── float32.pdiparams +│ ├── float32.pdmodel +│ ├── serving_server_conf.prototxt +│ └── serving_server_conf.stream.prototxt +└──serving_client + ├── serving_client_conf.prototxt + └── serving_client_conf.stream.prototxt +``` + +## 部署模型 + +serving目录包含启动pipeline服务和发送预测请求的代码和模型,包括: + +``` +serving/ +├──serving_server +│ ├── float32.pdiparams +│ ├── float32.pdmodel +│ ├── serving_server_conf.prototxt +│ └── serving_server_conf.stream.prototxt +├──config.yml # 层次分类任务启动服务端的配置文件 +├──rpc_client.py # 层次分类任务发送pipeline预测请求的脚本 +└──service.py # 层次分类任务启动服务端的脚本 + +``` + +### 修改配置文件 +目录中的`config.yml`文件解释了每一个参数的含义,可以根据实际需要修改其中的配置。比如: +``` +# 修改模型目录为下载的模型目录或自己的模型目录: +model_config: serving_server => model_config: erine-3.0-tiny/serving_server + +# 修改rpc端口号 +rpc_port: 10231 => rpc_port: 9998 + +# 修改使用GPU推理为使用CPU推理: +device_type: 1 => device_type: 0 + +#开启MKLDNN加速 +#use_mkldnn: False => use_mkldnn: True + +#Fetch结果列表,以serving_client/serving_client_conf.prototxt中fetch_var的alias_name为准 +fetch_list: ["linear_147.tmp_1"] => fetch_list: ["linear_75.tmp_1"] +``` + + +### 分类任务 +#### 启动服务 +修改好配置文件后,执行下面命令启动服务: +```shell +python service.py --max_seq_length 128 --model_name "ernie-3.0-medium-zh" +``` + +可支持配置的参数: +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `model_name`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。 + +输出打印如下: +``` +[DAG] Succ init +[PipelineServicer] succ init +...... +--- Running analysis [ir_graph_to_program_pass] +I0727 06:50:34.988327 43126 analysis_predictor.cc:1007] ======= optimize end ======= +I0727 06:50:34.992336 43126 naive_executor.cc:102] --- skip [feed], feed -> token_type_ids +I0727 06:50:34.992357 43126 naive_executor.cc:102] --- skip [feed], feed -> input_ids +I0727 06:50:34.993671 43126 naive_executor.cc:102] --- skip [linear_75.tmp_1], fetch -> fetch +[2022-07-27 06:50:35,954] [ WARNING] - Can't find the fast_tokenizer package, please ensure install fast_tokenizer correctly. You can install fast_tokenizer by `pip install fast-tokenizer-python`. +[2022-07-27 06:50:35,954] [ INFO] - We are using to load 'ernie-3.0-medium-zh'. +[2022-07-27 06:50:35,954] [ INFO] - Already cached /root/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh_vocab.txt +[OP Object] init success +``` + +#### 启动rpc client测试 +注意执行客户端请求时关闭代理,并根据实际情况修改server_url地址(启动服务所在的机器) +```shell +python rpc_client.py +``` +输出打印如下: +``` +text: 消失的“外企光环”,5月份在华裁员900余人,香饽饽变“臭”了 +label: 组织关系,组织关系##裁员 +-------------------- +text: 卡车超载致使跨桥侧翻,没那么简单 +label: 灾害/意外,灾害/意外##坍/垮塌 +-------------------- +text: 金属卡扣安装不到位,上海乐扣乐扣贸易有限公司将召回捣碎器1162件 +label: 产品行为,产品行为##召回 +-------------------- +``` +#### 启动http client测试 +注意执行客户端请求时关闭代理,并根据实际情况修改server_url地址(启动服务所在的机器) +```shell +python http_client.py +``` +输出打印如下: +``` +text: 消失的“外企光环”,5月份在华裁员900余人,香饽饽变“臭”了 +label: 组织关系,组织关系##裁员 +-------------------- +text: 卡车超载致使跨桥侧翻,没那么简单 +label: 灾害/意外,灾害/意外##坍/垮塌 +-------------------- +text: 金属卡扣安装不到位,上海乐扣乐扣贸易有限公司将召回捣碎器1162件 +label: 产品行为,产品行为##召回 +-------------------- +``` diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/config.yml b/applications/text_classification/hierarchical/deploy/paddle_serving/config.yml new file mode 100644 index 0000000000000000000000000000000000000000..62a1a3056b826619c7c640fcb9c426a2d96fc28f --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/paddle_serving/config.yml @@ -0,0 +1,59 @@ +#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 18090 + +#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 9878 + +#worker_num, 最大并发数。 +#当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +#当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 1 + +#build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + #op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: False + + #重试次数 + retry: 1 + + #使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + use_profile: false + tracer: + interval_s: 10 + +op: + seq_cls: + #并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 1 + + #当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + #client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + + #模型路径 + model_config: serving_server + + #Fetch结果列表,以client_config中fetch_var的alias_name为准 + fetch_list: ["linear_75.tmp_1"] + + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 1 + + #计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: "0" + + #开启MKLDNN加速 + use_mkldnn: True + + #thread_num + thread_num: 12 + + #ir_optim + ir_optim: True + + #开启tensorrt后,进行优化的子图包含的最少节点数 + #min_subgraph_size: 10 \ No newline at end of file diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/http_client.py b/applications/text_classification/hierarchical/deploy/paddle_serving/http_client.py new file mode 100644 index 0000000000000000000000000000000000000000..083fe02600c7081b88441901cf6a32a31d549ea4 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/paddle_serving/http_client.py @@ -0,0 +1,123 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import json + +import numpy as np +import requests + + +class Runner(object): + def __init__( + self, + server_url: str, + ): + self.server_url = server_url + + def Run(self, text, label_list): + sentence = np.array([t.encode("utf-8") for t in text], dtype=np.object_) + sentence = sentence.__repr__() + data = {"key": ["sentence"], "value": [sentence]} + data = json.dumps(data) + + ret = requests.post(url=self.server_url, data=data) + ret = ret.json() + for t, l in zip(text, eval(ret["value"][0])): + print("text: ", t) + label = ",".join([label_list[int(ll)] for ll in l.split(",")]) + print("label: ", label) + print("--------------------") + return + + +if __name__ == "__main__": + server_url = "http://127.0.0.1:9878/seq_cls/prediction" + runner = Runner(server_url) + text = ["消失的“外企光环”,5月份在华裁员900余人,香饽饽变“臭”了?", "卡车超载致使跨桥侧翻,没那么简单", "金属卡扣安装不到位,上海乐扣乐扣贸易有限公司将召回捣碎器1162件"] + label_list = [ + "交往", + "交往##会见", + "交往##感谢", + "交往##探班", + "交往##点赞", + "交往##道歉", + "产品行为", + "产品行为##上映", + "产品行为##下架", + "产品行为##发布", + "产品行为##召回", + "产品行为##获奖", + "人生", + "人生##产子/女", + "人生##出轨", + "人生##分手", + "人生##失联", + "人生##婚礼", + "人生##庆生", + "人生##怀孕", + "人生##死亡", + "人生##求婚", + "人生##离婚", + "人生##结婚", + "人生##订婚", + "司法行为", + "司法行为##举报", + "司法行为##入狱", + "司法行为##开庭", + "司法行为##拘捕", + "司法行为##立案", + "司法行为##约谈", + "司法行为##罚款", + "司法行为##起诉", + "灾害/意外", + "灾害/意外##地震", + "灾害/意外##坍/垮塌", + "灾害/意外##坠机", + "灾害/意外##洪灾", + "灾害/意外##爆炸", + "灾害/意外##袭击", + "灾害/意外##起火", + "灾害/意外##车祸", + "竞赛行为", + "竞赛行为##夺冠", + "竞赛行为##晋级", + "竞赛行为##禁赛", + "竞赛行为##胜负", + "竞赛行为##退役", + "竞赛行为##退赛", + "组织关系", + "组织关系##停职", + "组织关系##加盟", + "组织关系##裁员", + "组织关系##解散", + "组织关系##解约", + "组织关系##解雇", + "组织关系##辞/离职", + "组织关系##退出", + "组织行为", + "组织行为##开幕", + "组织行为##游行", + "组织行为##罢工", + "组织行为##闭幕", + "财经/交易", + "财经/交易##上市", + "财经/交易##出售/收购", + "财经/交易##加息", + "财经/交易##涨价", + "财经/交易##涨停", + "财经/交易##融资", + "财经/交易##跌停", + "财经/交易##降价", + "财经/交易##降息", + ] + runner.Run(text, label_list) diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/rpc_client.py b/applications/text_classification/hierarchical/deploy/paddle_serving/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..f946f82078a0de296e8c7b5fb7d856bc5f343bb6 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/paddle_serving/rpc_client.py @@ -0,0 +1,120 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np +from paddle_serving_server.pipeline import PipelineClient + + +class Runner(object): + def __init__( + self, + server_url: str, + ): + self.client = PipelineClient() + self.client.connect([server_url]) + + def Run(self, data, label_list): + data = np.array([x.encode("utf-8") for x in data], dtype=np.object_) + ret = self.client.predict(feed_dict={"sentence": data}) + for ( + d, + l, + ) in zip(data, eval(ret.value[0])): + print("text: ", d) + label = ",".join([label_list[int(ll)] for ll in l.split(",")]) + print("label: ", label) + print("--------------------") + return + + +if __name__ == "__main__": + server_url = "127.0.0.1:18090" + runner = Runner(server_url) + text = ["消失的“外企光环”,5月份在华裁员900余人,香饽饽变“臭”了?", "卡车超载致使跨桥侧翻,没那么简单", "金属卡扣安装不到位,上海乐扣乐扣贸易有限公司将召回捣碎器1162件"] + label_list = [ + "交往", + "交往##会见", + "交往##感谢", + "交往##探班", + "交往##点赞", + "交往##道歉", + "产品行为", + "产品行为##上映", + "产品行为##下架", + "产品行为##发布", + "产品行为##召回", + "产品行为##获奖", + "人生", + "人生##产子/女", + "人生##出轨", + "人生##分手", + "人生##失联", + "人生##婚礼", + "人生##庆生", + "人生##怀孕", + "人生##死亡", + "人生##求婚", + "人生##离婚", + "人生##结婚", + "人生##订婚", + "司法行为", + "司法行为##举报", + "司法行为##入狱", + "司法行为##开庭", + "司法行为##拘捕", + "司法行为##立案", + "司法行为##约谈", + "司法行为##罚款", + "司法行为##起诉", + "灾害/意外", + "灾害/意外##地震", + "灾害/意外##坍/垮塌", + "灾害/意外##坠机", + "灾害/意外##洪灾", + "灾害/意外##爆炸", + "灾害/意外##袭击", + "灾害/意外##起火", + "灾害/意外##车祸", + "竞赛行为", + "竞赛行为##夺冠", + "竞赛行为##晋级", + "竞赛行为##禁赛", + "竞赛行为##胜负", + "竞赛行为##退役", + "竞赛行为##退赛", + "组织关系", + "组织关系##停职", + "组织关系##加盟", + "组织关系##裁员", + "组织关系##解散", + "组织关系##解约", + "组织关系##解雇", + "组织关系##辞/离职", + "组织关系##退出", + "组织行为", + "组织行为##开幕", + "组织行为##游行", + "组织行为##罢工", + "组织行为##闭幕", + "财经/交易", + "财经/交易##上市", + "财经/交易##出售/收购", + "财经/交易##加息", + "财经/交易##涨价", + "财经/交易##涨停", + "财经/交易##融资", + "财经/交易##跌停", + "财经/交易##降价", + "财经/交易##降息", + ] + runner.Run(text, label_list) diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/service.py b/applications/text_classification/hierarchical/deploy/paddle_serving/service.py new file mode 100644 index 0000000000000000000000000000000000000000..608f0e1f7528942794300c30239caf04fac4b061 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/paddle_serving/service.py @@ -0,0 +1,105 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging + +import numpy as np +from paddle_serving_server.web_service import Op, WebService + +from paddlenlp.transformers import AutoTokenizer + +_LOGGER = logging.getLogger() + +FETCH_NAME_MAP = { + "ernie-1.0-large-zh-cw": "linear_291.tmp_1", + "ernie-3.0-xbase-zh": "linear_243.tmp_1", + "ernie-3.0-base-zh": "linear_147.tmp_1", + "ernie-3.0-medium-zh": "linear_75.tmp_1", + "ernie-3.0-mini-zh": "linear_75.tmp_1", + "ernie-3.0-micro-zh": "linear_51.tmp_1", + "ernie-3.0-nano-zh": "linear_51.tmp_1", + "ernie-2.0-base-en": "linear_147.tmp_1", + "ernie-2.0-large-en": "linear_291.tmp_1", + "ernie-m-base": "linear_147.tmp_1", + "ernie-m-large": "linear_291.tmp_1", +} + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", + choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en", "ernie-m-base", "ernie-m-large"]) +args = parser.parse_args() +# fmt: on + + +class Op(Op): + def init_op(self): + self.tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_fast=True) + # Output nodes may differ from model to model + # You can see the output node name in the conf.prototxt file of serving_server + self.fetch_names = [ + FETCH_NAME_MAP[args.model_name], + ] + + def preprocess(self, input_dicts, data_id, log_id): + # Convert input format + ((_, input_dict),) = input_dicts.items() + data = input_dict["sentence"] + if isinstance(data, str) and "array(" in data: + data = eval(data) + else: + _LOGGER.error("input value {}is not supported.".format(data)) + data = [i.decode("utf-8") for i in data] + + # tokenizer + pad + data = self.tokenizer( + data, + max_length=args.max_seq_length, + padding=True, + truncation=True, + return_position_ids=False, + return_attention_mask=False, + ) + tokenized_data = {} + for tokenizer_key in data: + tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], dtype="int64") + return tokenized_data, False, None, "" + + def postprocess(self, input_dicts, fetch_dict, data_id, log_id): + + results = fetch_dict[self.fetch_names[0]] + results = np.array(results) + labels = [] + + for result in results: + label = [] + result = 1 / (1 + (np.exp(-result))) + for i, p in enumerate(result): + if p > 0.5: + label.append(str(i)) + labels.append(",".join(label)) + return {"label": labels}, None, "" + + +class Service(WebService): + def get_pipeline_response(self, read_op): + return Op(name="seq_cls", input_ops=[read_op]) + + +if __name__ == "__main__": + service = Service(name="seq_cls") + service.prepare_pipeline_config("config.yml") + service.run_service() diff --git a/applications/text_classification/hierarchical/deploy/predictor/README.md b/applications/text_classification/hierarchical/deploy/predictor/README.md new file mode 100644 index 0000000000000000000000000000000000000000..26f278d9a44c01257d4df521058bcebeff3d19b0 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/predictor/README.md @@ -0,0 +1,175 @@ +# 基于ONNXRuntime推理部署指南 + +**目录** + * [环境准备](#环境准备) + * [基于GPU部署推理样例](#基于GPU部署推理样例) + * [基于CPU部署推理样例](#基于CPU部署推理样例) + * [性能与精度测试](#性能与精度测试) +## 环境准备 + +模型转换与ONNXRuntime预测部署依赖Paddle2ONNX和ONNXRuntime,Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式,算子目前稳定支持导出ONNX Opset 7~15,更多细节可参考:[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[静态图导出](../../README.md),模型使用裁剪API进行裁剪之后会自动生成静态图模型。 + +如果基于GPU部署,请先确保机器已正确安装NVIDIA相关驱动和基础软件,确保CUDA >= 11.2,CuDNN >= 8.2,并使用以下命令安装所需依赖: +```shell +python -m pip install onnxruntime-gpu onnx onnxconverter-common==1.9.0 psutil paddle2onnx==1.0.5 +``` + +如果基于CPU部署,请使用如下命令安装所需依赖: +```shell +python -m pip install onnxruntime psutil +``` + +安装FastTokenizer文本处理加速库(可选) +推荐安装fast_tokenizer可以得到更极致的文本处理效率,进一步提升服务性能。 +```shell +pip install fast-tokenizer-python +``` + +## 基于GPU部署推理样例 +请使用如下命令进行部署 +``` +python infer.py \ + --device "gpu" \ + --model_path_prefix "../../export/float32" \ + --model_name_or_path "ernie-3.0-medium-zh" \ + --max_seq_length 128 \ + --batch_size 32 \ + --dataset_dir "../../data" +``` +多语言模型加上`--multilingual`,裁剪后的模型前缀为`--model_path_prefix ../../prune/width_mult_XXXX/pruned_model`。 +可支持配置的参数: + +* `model_path_prefix`:必须,待推理模型路径前缀。 +* `model_name_or_path`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。 +* `max_seq_length`:ERNIE/BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `use_fp16`:选择是否开启FP16进行加速;默认为False。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `device`: 选用什么设备进行训练,可选cpu、gpu。 +* `device_id`: 选择GPU卡号;默认为0。 +* `perf`:选择进行模型性能和精度评估;默认为False。 +* `dataset_dir`:本地数据集地址,需包含data.txt, label.txt, test.txt/dev.txt(可选,如果启动模型性能和精度评估);默认为None。 +* `perf_dataset`:评估数据集,可选'dev'、'test',选择在开发集或测试集评估模型;默认为"dev"。 +型);默认为False。 + +在GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0,在包括V100、T4、A10、A100、GTX 20系列和30系列显卡等设备上可以开启FP16进行加速,在CPU或者CUDA计算能力 (CUDA Compute Capability) 小于7.0时开启不会带来加速效果。可以使用如下命令开启ONNXRuntime的FP16进行推理加速: + +``` +python infer.py \ + --use_fp16 \ + --device "gpu" \ + --model_path_prefix "../../export/float32" \ + --model_name_or_path "ernie-3.0-medium-zh" \ + --max_seq_length 128 \ + --batch_size 32 \ + --dataset_dir "../../data" +``` + +可以使用如下命令开启ONNXRuntime推理评估模型的性能和精度: + +``` +python infer.py \ + --perf \ + --perf_dataset 'dev' \ + --device "gpu" \ + --model_path_prefix "../../export/float32" \ + --model_name_or_path "ernie-3.0-medium-zh" \ + --max_seq_length 128 \ + --batch_size 32 \ + --dataset_dir "../../data" +``` + +## 基于CPU部署推理样例 + +请使用如下命令进行部署 +``` +python infer.py \ + --device "cpu" \ + --model_path_prefix "../../export/float32" \ + --model_name_or_path "ernie-3.0-medium-zh" \ + --max_seq_length 128 \ + --batch_size 32 \ + --dataset_dir "../../data" +``` + +可支持配置的参数: + +* `model_path_prefix`:必须,待推理模型路径前缀。 +* `model_name_or_path`:选择预训练模型;默认为"ernie-3.0-medium-zh",中文数据集推荐使用"ernie-3.0-medium-zh"。 +* `max_seq_length`:ERNIE/BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `use_quantize`:选择是否开启INT8动态量化进行加速;默认为False。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为200。 +* `num_threads`:cpu线程数;默认为cpu的物理核心数量。 +* `device`: 选用什么设备进行训练,可选cpu、gpu。 +* `perf`:选择进行模型性能和精度评估;默认为False。 +* `dataset_dir`:本地数据集地址,需包含data.txt, label.txt, dev.txt/test.txt(可选,如果启动模型性能和精度评估);默认为None。 +* `perf_dataset`:评估数据集,选择在开发集或测试集评估模型;默认为"dev"。 + +可以使用如下命令开启ONNXRuntime的INT8动态量化进行推理加速: + +``` +python infer.py \ + --use_quantize \ + --device "cpu" \ + --model_path_prefix "../../export/float32" \ + --model_name_or_path "ernie-3.0-medium-zh" \ + --max_seq_length 128 \ + --batch_size 32 \ + --dataset_dir "../../data" +``` + +**Note**:INT8动态量化与FP16相比精度损失较大,GPU部署建议使用FP16加速。 + +可以使用如下命令开启ONNXRuntime推理评估模型的性能和精度: + +``` +python infer.py \ + --perf \ + --perf_dataset 'dev' \ + --device "cpu" \ + --model_path_prefix "../../export/float32" \ + --model_name_or_path "ernie-3.0-medium-zh" \ + --max_seq_length 128 \ + --batch_size 32 \ + --dataset_dir "../../data" +``` + +## 性能与精度测试 + + +测试配置如下: + +1. [2020语言与智能技术竞赛:事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)抽取的多标签数据集 + +2. 物理机环境 + + 系统: CentOS Linux release 7.7.1908 (Core) + + GPU: Tesla V100-SXM2-32GB + + CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz + + CUDA: 11.2 + + cuDNN: 8.1.0 + + Driver Version: 460.27.04 + + 内存: 630 GB + +3. PaddlePaddle 版本:2.3.0 + +4. PaddleNLP 版本:2.3.1 + +5. 性能数据指标:latency。latency 测试方法:固定 batch size 为 32,GPU部署运行时间 total_time,计算 latency = total_time / total_samples + +6. 精度评价指标:Micro F1分数、Macro F1分数 + +| | Micro F1(%) | Macro F1(%) | latency(ms) | +| -------------------------- | ------------ | ------------- |------------- | +| ERNIE 3.0 Medium+FP32+GPU | 95.26|93.22| 1.01| +| ERNIE 3.0 Medium+FP16+GPU | 95.26|93.22| 0.38| +| ERNIE 3.0 Medium+FP32+CPU | 95.26|93.22| 18.93 | +| ERNIE 3.0 Medium+INT8+CPU | 95.03 | 92.87| 12.14 | + + +经过FP16转化加速比达到3~4倍左右,精度变化较小,与FP16相比,INT8在线量化精度下降较大,加速比在1.5~2倍左右。 diff --git a/applications/text_classification/hierarchical/deploy/predictor/infer.py b/applications/text_classification/hierarchical/deploy/predictor/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..34a58d91fc99e0fa7dca573499da3fb02d5fbca0 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/predictor/infer.py @@ -0,0 +1,88 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import psutil +from predictor import Predictor + +from paddlenlp.datasets import load_dataset + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.") +parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en", "ernie-m-base", "ernie-m-large"]) +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--use_fp16", action='store_true', help="Whether to use fp16 inference, only takes effect when deploying on gpu.") +parser.add_argument("--use_quantize", action='store_true', help="Whether to use quantization for acceleration, only takes effect when deploying on cpu.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for predicting.") +parser.add_argument("--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu, only takes effect when deploying on cpu.") +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--device_id', default=0, help="Select which gpu device to train model.") +parser.add_argument("--perf", action='store_true', help="Whether to compute the latency and f1 score of the test set.") +parser.add_argument("--dataset_dir", required=True, default=None, type=str, help="The dataset directory including data.txt, taxonomy.txt, test.txt(optional, if evaluate the performance).") +parser.add_argument("--perf_dataset", choices=['dev', 'test'], default='dev', type=str, help="evaluate the performance on dev dataset or test dataset") +parser.add_argument('--multilingual', action='store_true', help='Whether is multilingual task') +args = parser.parse_args() +# yapf: enable + + +def read_local_dataset(path, label_list): + label_list_dict = {label_list[i]: i for i in range(len(label_list))} + with open(path, "r", encoding="utf-8") as f: + for line in f: + items = line.strip().split("\t") + if len(items) == 0: + continue + elif len(items) == 1: + sentence = items[0] + labels = [] + else: + sentence = "".join(items[:-1]) + label = items[-1] + labels = [label_list_dict[l] for l in label.split(",")] + yield {"sentence": sentence, "label": labels} + + +if __name__ == "__main__": + + label_list = [] + label_dir = os.path.join(args.dataset_dir, "label.txt") + with open(label_dir, "r", encoding="utf-8") as f: + lines = f.readlines() + for i, line in enumerate(lines): + label_list.append(line.strip()) + f.close() + + predictor = Predictor(args, label_list) + + if args.perf: + eval_dir = os.path.join(args.dataset_dir, "{}.txt".format(args.perf_dataset)) + eval_ds = load_dataset(read_local_dataset, path=eval_dir, label_list=label_list, lazy=False) + texts, labels = predictor.get_text_and_label(eval_ds) + + # preprocess & evaluate & latency + preprocess_result = predictor.preprocess(texts) + predictor.evaluate(preprocess_result, labels) + predictor.performance(preprocess_result) + else: + data = [] + data_dir = os.path.join(args.dataset_dir, "data.txt") + with open(data_dir, "r", encoding="utf-8") as f: + lines = f.readlines() + for i, line in enumerate(lines): + data.append(line.strip()) + f.close() + predictor.predict(data) diff --git a/applications/text_classification/hierarchical/deploy/predictor/predictor.py b/applications/text_classification/hierarchical/deploy/predictor/predictor.py new file mode 100644 index 0000000000000000000000000000000000000000..28b07ee7da2a7376f59417d6fa6ebc2152e482c6 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/predictor/predictor.py @@ -0,0 +1,233 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time + +import numpy as np +import onnxruntime as ort +import paddle2onnx +from sklearn.metrics import f1_score + +from paddlenlp.transformers import AutoTokenizer +from paddlenlp.utils.log import logger + + +class InferBackend(object): + def __init__( + self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, use_quantize=False, num_threads=10 + ): + logger.info(">>> [InferBackend] Creating Engine ...") + onnx_model = paddle2onnx.export( + model_file=model_path_prefix + ".pdmodel", + params_file=model_path_prefix + ".pdiparams", + opset_version=13, + enable_onnx_checker=True, + ) + infer_model_dir = model_path_prefix.rsplit("/", 1)[0] + float_onnx_file = os.path.join(infer_model_dir, "model.onnx") + with open(float_onnx_file, "wb") as f: + f.write(onnx_model) + + if device == "gpu": + + logger.info(">>> [InferBackend] Use GPU to inference ...") + + if use_fp16: + logger.info(">>> [InferBackend] Use FP16 to inference ...") + import onnx + from onnxconverter_common import float16 + + fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx") + onnx_model = onnx.load_model(float_onnx_file) + trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True) + onnx.save_model(trans_model, fp16_model_file) + onnx_model = fp16_model_file + if use_quantize: + logger.info( + ">>> [InferBackend] use_quantize only takes effect when deploying on cpu, use_fp16 for acceleration when deploying on gpu ..." + ) + sess_options = ort.SessionOptions() + self.predictor = ort.InferenceSession( + onnx_model, + sess_options=sess_options, + providers=["CUDAExecutionProvider"], + provider_options=[{"device_id": device_id}], + ) + try: + assert "CUDAExecutionProvider" in self.predictor.get_providers() + except AssertionError: + raise AssertionError( + "The environment for GPU inference is not set properly. " + "A possible cause is that you had installed both onnxruntime and onnxruntime-gpu. " + "Please run the following commands to reinstall: \n " + "1) pip uninstall -y onnxruntime onnxruntime-gpu \n 2) pip install onnxruntime-gpu" + ) + else: + logger.info(">>> [InferBackend] Use CPU to inference ...") + if use_fp16: + logger.info( + ">>> [InferBackend] use_fp16 only takes effect when deploying on gpu, use_quantize for acceleration when deploying on cpu ..." + ) + if use_quantize: + dynamic_quantize_model = os.path.join(infer_model_dir, "int8_model.onnx") + self.dynamic_quantize(float_onnx_file, dynamic_quantize_model) + onnx_model = dynamic_quantize_model + sess_options = ort.SessionOptions() + sess_options.intra_op_num_threads = num_threads + self.predictor = ort.InferenceSession( + onnx_model, sess_options=sess_options, providers=["CPUExecutionProvider"] + ) + logger.info(">>> [InferBackend] Engine Created ...") + + def dynamic_quantize(self, input_float_model, dynamic_quantized_model): + from onnxruntime.quantization import quantize_dynamic + + quantize_dynamic(input_float_model, dynamic_quantized_model) + + def infer(self, input_dict: dict): + result = self.predictor.run(None, input_dict) + return result + + +def sigmoid_(x): + """ + compute sigmoid + """ + return 1 / (1 + np.exp(-x)) + + +class Predictor(object): + def __init__(self, args, label_list): + self.tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=True) + self.label_list = label_list + self.batch_size = args.batch_size + self.max_seq_length = args.max_seq_length + self.multilingual = args.multilingual + self.inference_backend = InferBackend( + args.model_path_prefix, args.device, args.device_id, args.use_fp16, args.use_quantize, args.num_threads + ) + + def preprocess(self, input_data: list): + + # tokenizer + pad + data = self.tokenizer( + input_data, + max_length=self.max_seq_length, + padding=True, + truncation=True, + return_position_ids=False, + return_attention_mask=False, + return_token_type_ids=not self.multilingual, + ) + tokenized_data = {} + for tokenizer_key in data: + + tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], dtype="int64") + return tokenized_data + + def postprocess(self, infer_data): + threshold = 0.5 + + sigmoid = np.vectorize(sigmoid_) + probs = sigmoid(infer_data) + labels = [] + + for prob in probs: + label = [] + + for i, p in enumerate(prob): + if p > threshold: + label.append(self.label_list[i]) + + labels.append(label) + + return labels + + def infer(self, data): + infer_data = self.inference_backend.infer(data) + logits = np.array(infer_data[0]) + return logits + + def infer_batch(self, preprocess_result): + sample_num = len(preprocess_result["input_ids"]) + infer_result = None + for i in range(0, sample_num, self.batch_size): + batch_size = min(self.batch_size, sample_num - i) + preprocess_result_batch = {} + for tokenizer_key in preprocess_result: + preprocess_result_batch[tokenizer_key] = [ + preprocess_result[tokenizer_key][i + j] for j in range(batch_size) + ] + + result = self.infer(preprocess_result_batch) + if infer_result is None: + infer_result = result + else: + infer_result = np.append(infer_result, result, axis=0) + return infer_result + + def printer(self, results, input_data): + for text, labels in zip(input_data, results): + hierarchical_labels = {} + logger.info("text: {}".format(text)) + logger.info("prediction result: {}".format(",".join(labels))) + for label in labels: + for i, l in enumerate(label.split("##")): + if i not in hierarchical_labels: + hierarchical_labels[i] = [] + if l not in hierarchical_labels[i]: + hierarchical_labels[i].append(l) + for d in range(len(hierarchical_labels)): + logger.info("level {} : {}".format(d + 1, ",".join(hierarchical_labels[d]))) + logger.info("--------------------") + + def predict(self, input_data: list): + preprocess_result = self.preprocess(input_data) + infer_result = self.infer_batch(preprocess_result) + result = self.postprocess(infer_result) + self.printer(result, input_data) + return + + def performance(self, preprocess_result): + nums = len(preprocess_result["input_ids"]) + + start = time.time() + self.infer_batch(preprocess_result) + total_time = time.time() - start + logger.info("sample nums: %s, time: %.2f, latency: %.2f ms" % (nums, total_time, 1000 * total_time / nums)) + return + + def evaluate(self, preprocess_result, labels): + + infer_result = self.infer_batch(preprocess_result) + sigmoid = np.vectorize(sigmoid_) + probs = sigmoid(infer_result) + preds = probs > 0.5 + micro_f1_score = f1_score(y_pred=preds, y_true=labels, average="micro") + macro_f1_score = f1_score(y_pred=preds, y_true=labels, average="macro") + logger.info("micro f1: %.2f, macro f1: %.2f" % (micro_f1_score * 100, macro_f1_score * 100)) + return + + def get_text_and_label(self, ds): + """ + Return text and label list + """ + all_texts = [] + all_labels = [] + for ii in range(len(ds)): + all_texts.append(ds[ii]["sentence"]) + labels = [float(1) if i in ds[ii]["label"] else float(0) for i in range(len(self.label_list))] + all_labels.append(labels) + return all_texts, all_labels diff --git a/applications/text_classification/hierarchical/deploy/simple_serving/README.md b/applications/text_classification/hierarchical/deploy/simple_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..13ec53a993bab48a08a8e6b0f4558cca09151fd1 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/simple_serving/README.md @@ -0,0 +1,42 @@ +# 基于PaddleNLP SimpleServing 的服务化部署 + +## 目录 +- [环境准备](#环境准备) +- [Server启动服务](#Server服务启动) +- [其他参数设置](#其他参数设置) + +## 环境准备 +使用有SimpleServing功能的PaddleNLP版本 +```shell +pip install paddlenlp --upgrade +``` +## Server服务启动 +### 分类任务启动 +#### 启动 分类 Server 服务 +```bash +paddlenlp server server:app --host 0.0.0.0 --port 8189 +``` +如果是ERNIE-M模型则启动 +```bash +paddlenlp server ernie_m_server:app --host 0.0.0.0 --port 8189 +``` +#### 分类任务发送服务 +```bash +python client.py +``` + + +## 其他参数设置 +可以在client端设置 `max_seq_len`, `batch_size`, `prob_limit` 参数 +```python + data = { + 'data': { + 'text': texts, + }, + 'parameters': { + 'max_seq_len': args.max_seq_len, + 'batch_size': args.batch_size, + 'prob_limit': args.prob_limit + } + } +``` diff --git a/applications/text_classification/hierarchical/deploy/simple_serving/client.py b/applications/text_classification/hierarchical/deploy/simple_serving/client.py new file mode 100644 index 0000000000000000000000000000000000000000..a1eb7fc8357a5d9226c90776c6ceccce33b4492c --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/simple_serving/client.py @@ -0,0 +1,45 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import requests +import json + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization.") +parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.") +parser.add_argument("--prob_limit", default=0.5, type=float, help="The limitation of probability for the label.") +args = parser.parse_args() +# yapf: enable + +url = "http://0.0.0.0:8189/models/cls_hierarchical" +headers = {"Content-Type": "application/json"} + +if __name__ == "__main__": + texts = [ + "请问木竭胶囊能同高血压药、氨糖同时服吗?", + "低压100*高压140*头涨,想吃点降压药。谢谢!", + "脑穿通畸形易发人群有哪些", + "幼儿乱吃丙硫氧嘧啶片怎么办,我也不知道她吃了几片", + "如果是可以降血糖的话,血糖值7点多的大概需要吃几个疗程?", + ] + data = { + "data": { + "text": texts, + }, + "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size, "prob_limit": args.prob_limit}, + } + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.text) diff --git a/applications/text_classification/hierarchical/deploy/simple_serving/ernie_m_server.py b/applications/text_classification/hierarchical/deploy/simple_serving/ernie_m_server.py new file mode 100644 index 0000000000000000000000000000000000000000..40a43f974d549ddcd581d4872c1f1580f8b3c499 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/simple_serving/ernie_m_server.py @@ -0,0 +1,25 @@ +# coding:utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from paddlenlp import SimpleServer +from paddlenlp.server import ERNIEMHandler, MultiLabelClassificationPostHandler + +app = SimpleServer() +app.register( + "models/cls_hierarchical", + model_path="../../export", + tokenizer_name="ernie-m-base", + model_handler=ERNIEMHandler, + post_handler=MultiLabelClassificationPostHandler, +) diff --git a/applications/text_classification/hierarchical/deploy/simple_serving/server.py b/applications/text_classification/hierarchical/deploy/simple_serving/server.py new file mode 100644 index 0000000000000000000000000000000000000000..2965552088fce7654bf350c272c6a2391c948942 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/simple_serving/server.py @@ -0,0 +1,25 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp import SimpleServer +from paddlenlp.server import CustomModelHandler, MultiLabelClassificationPostHandler + +app = SimpleServer() +app.register( + "models/cls_hierarchical", + model_path="../../export", + tokenizer_name="ernie-3.0-medium-zh", + model_handler=CustomModelHandler, + post_handler=MultiLabelClassificationPostHandler, +) diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/README.md b/applications/text_classification/hierarchical/deploy/triton_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..6e61ac5536c7195abf24c86192a26537bb35be90 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/triton_serving/README.md @@ -0,0 +1,186 @@ +# 基于Triton Inference Server的服务化部署指南 + +本文档将介绍如何使用[Triton Inference Server](https://github.com/triton-inference-server/server)工具部署基于ERNIE 2.0英文模型文本层次分类的pipeline在线服务。 + +## 目录 +- [服务端环境准备](#服务端环境准备) +- [模型获取和转换](#模型获取和转换) +- [部署模型](#部署模型) +- [客户端请求](#客户端请求) + +## 服务端环境准备 + +### 安装Triton Server +拉取Triton Server镜像: +```shell +docker pull nvcr.io/nvidia/tritonserver:21.10-py3 +``` +启动容器: +```shell +docker run -it --gpus all --net=host --name triton_server -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash +``` + +**NOTE:** + +1. Triton版本号`21.10`可以根据自己的需求调整,各个Triton版本对应的Driver、CUDA、TRT和ONNX Runtime等后端版本可以参考[官网文档](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)。注意其中的`NVIDIA Driver`行,如果NVIDIA Driver低于文档中要求,在启动运行时会报错。 + +2. 可以使用`--gpus '"device=1"'`来指定GPU卡号,更多GPU指定方式请参见[Nvidia User Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#gpu-enumeration) + + +### 进入容器并准备PaddleNLP环境 +整个服务的前后处理依赖PaddleNLP,需要在容器内安装相关python包 + +进入容器: +```shell +docker exec -it triton_server bash +``` +安装PaddlePaddle、PaddleNLP +```shell +python3 -m pip install paddlepaddle-gpu paddlenlp -i https://mirror.baidu.com/pypi/simple +``` + +**NOTE:** + +1. 默认开启百度镜像源来加速下载,如果您使用 HTTP 代理可以关闭(-i https://mirror.baidu.com/pypi/simple) + +2. 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.2, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。 + +3. 更多关于PaddleNLP安装的详细教程请查看[Installation](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 + + +### 安装FastTokenizers文本处理加速库(可选) + +推荐安装fast_tokenizer可以得到更极致的文本处理效率,进一步提升服务性能。 + +在容器内安装 fast_tokenizer +```shell +python3 -m pip install fast-tokenizer-python +``` + + +## 模型获取和转换 + +使用Triton做服务化部署时,选择ONNX Runtime后端运行需要先将模型转换成ONNX格式。 + + +首先将保存的动态图参数导出成静态图参数,具体代码见[静态图导出脚本](../../export_model.py)。静态图参数保存在`output_path`指定路径中。运行方式: + +```shell +python ../../export_model.py --params_path=../../checkpoint/model_state.pdparams --output_path=./wos_infer_model +``` + +使用Paddle2ONNX将Paddle静态图模型转换为ONNX模型格式的命令如下,以下命令成功运行后,将会在当前目录下生成model.onnx模型文件。 + +```shell +paddle2onnx --model_dir infer_model/ --model_filename float32.pdmodel --params_filename float32.pdiparams --save_file model.onnx --opset_version 13 --enable_onnx_checker True --enable_dev_version True +``` + +创建空白目录/seqcls/1和seqcls_model/1,并将将转换好的ONNX模型移动到模型仓库目录 + +```shell +mkdir /models/seqcls/1 +mkdir /models/seqcls_model/1 +mv model.onnx /models/seqcls_model/1 +``` + +Paddle2ONNX的命令行参数说明请查阅:[Paddle2ONNX命令行参数说明](https://github.com/PaddlePaddle/Paddle2ONNX#%E5%8F%82%E6%95%B0%E9%80%89%E9%A1%B9) + +模型下载转换好之后,models目录结构如下: +``` +models +├── seqcls +│   ├── 1 +│   └── config.pbtxt +├── seqcls_model +│   ├── 1 +│   │   └── model.onnx +│   └── config.pbtxt +├── seqcls_postprocess +│   ├── 1 +│   │   └── model.py +│   └── config.pbtxt +└── tokenizer + ├── 1 + │   └── model.py + └── config.pbtxt +``` + +模型配置文件config.pbtxt配置细节请参见[Triton Server Model Configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md) + +## 部署模型 + +triton目录包含启动pipeline服务的配置和发送预测请求的代码,包括: + +``` +models # Triton启动需要的模型仓库,包含模型和服务配置文件 +seqcls_grpc_client.py # 层次分类任务发送pipeline预测请求的脚本 +``` + +### 启动服务端 + +在容器内执行下面命令启动服务,默认启动models下所有模型: +```shell +tritonserver --model-repository=/models +``` +也可以通过设定参数只启动单一任务服务: +```shell +tritonserver --model-repository=/models --model-control-mode=explicit --load-model=seqcls +``` +输出打印如下: + +``` +... +I0619 13:40:51.590901 5127 onnxruntime.cc:1999] TRITONBACKEND_Initialize: onnxruntime +I0619 13:40:51.590938 5127 onnxruntime.cc:2009] Triton TRITONBACKEND API version: 1.6 +I0619 13:40:51.590947 5127 onnxruntime.cc:2015] 'onnxruntime' TRITONBACKEND API version: 1.6 +I0619 13:40:51.623808 5127 openvino.cc:1193] TRITONBACKEND_Initialize: openvino +I0619 13:40:51.623862 5127 openvino.cc:1203] Triton TRITONBACKEND API version: 1.6 +I0619 13:40:51.623868 5127 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.6 +I0619 13:40:52.980990 5127 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f14d8000000' with size 268435456 +... +I0619 13:43:33.360018 5127 server.cc:592] ++--------------------+---------+--------+ +| Model | Version | Status | ++--------------------+---------+--------+ +| seqcls | 1 | READY | +| seqcls_model | 1 | READY | +| seqcls_postprocess | 1 | READY | +| tokenizer | 1 | READY | ++--------------------+---------+--------+ +... +I0619 13:43:33.365824 5127 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001 +I0619 13:43:33.366221 5127 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000 +I0619 13:43:33.409775 5127 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002 +``` + +**NOTE:** + +启动服务时,Triton Server的每个python后端进程默认申请`64M`内存,默认启动的docker无法启动多个python后端节点。两个解决方案: + +1. 启动容器时设置`shm-size`参数, 比如:`docker run -it --net=host --name triton_server --shm-size="1g" -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash` + +2. 启动服务时设置python后端的`shm-default-byte-size`参数, 设置python后端的默认内存为10M: `tritonserver --model-repository=/models --backend-config=python,shm-default-byte-size=10485760` + +## 客户端请求 + +### 客户端环境准备 +客户端请求有两种方式,可以选择在本地执行脚本请求,或下载官方客户端镜像在容器中执行。 + +方式一:本地执行脚本,需要先安装依赖: +``` +pip install grpcio +pip install tritonclient==2.10.0 +``` + +方式二:拉取官网镜像并启动容器: +``` +docker pull nvcr.io/nvidia/tritonserver:21.10-py3-sdk +docker run -it --net=host --name triton_client -v /path/to/triton:/triton_code nvcr.io/nvidia/tritonserver:21.10-py3-sdk bash +``` + +### 启动客户端测试 +注意执行客户端请求时关闭代理,并根据实际情况修改main函数中的ip地址(启动服务所在的机器) + +``` +python seqcls_grpc_client.py +``` diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls/config.pbtxt b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..82261157aefe68bac9a1865d888c0257d2e905e8 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls/config.pbtxt @@ -0,0 +1,75 @@ +name: "seqcls" +platform: "ensemble" +max_batch_size: 64 +input [ + { + name: "INPUT" + data_type: TYPE_STRING + dims: [ 1 ] + } +] +output [ + { + name: "label" + data_type: TYPE_INT64 + dims: [ 1 ] + }, + { + name: "confidence" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] +ensemble_scheduling { + step [ + { + model_name: "tokenizer" + model_version: 1 + input_map { + key: "INPUT_0" + value: "INPUT" + } + output_map { + key: "OUTPUT_0" + value: "tokenizer_input_ids" + } + output_map { + key: "OUTPUT_1" + value: "tokenizer_token_type_ids" + } + }, + { + model_name: "seqcls_model" + model_version: 1 + input_map { + key: "input_ids" + value: "tokenizer_input_ids" + } + input_map { + key: "token_type_ids" + value: "tokenizer_token_type_ids" + } + output_map { + key: "linear_75.tmp_1" + value: "OUTPUT_2" + } + }, + { + model_name: "seqcls_postprocess" + model_version: 1 + input_map { + key: "POST_INPUT" + value: "OUTPUT_2" + } + output_map { + key: "POST_label" + value: "label" + } + output_map { + key: "POST_confidence" + value: "confidence" + } + } + ] +} + diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_model/config.pbtxt b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_model/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..0fb1417cba37d4b4497fb1c27aff3ab6e039bd1f --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_model/config.pbtxt @@ -0,0 +1,36 @@ +platform: "onnxruntime_onnx" +max_batch_size: 64 +input [ + { + name: "input_ids" + data_type: TYPE_INT64 + dims: [ -1 ] + }, + { + name: "token_type_ids" + data_type: TYPE_INT64 + dims: [ -1 ] + } +] +output [ + { + name: "linear_75.tmp_1" + data_type: TYPE_FP32 + dims: [ 74 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_GPU + } +] + +optimization { + graph: {level: -1} +} + +parameters { key: "intra_op_thread_count" value: { string_value: "0" } } +parameters { key: "execution_mode" value: { string_value: "0" } } +parameters { key: "inter_op_thread_count" value: { string_value: "0" } } diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/1/model.py b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/1/model.py new file mode 100644 index 0000000000000000000000000000000000000000..5db7ef0c7746db295e9110817db6982704d6ac1b --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/1/model.py @@ -0,0 +1,109 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import numpy as np + +# triton_python_backend_utils is available in every Triton Python model. You +# need to use this module to create inference requests and responses. It also +# contains some utility functions for extracting information from model_config +# and converting Triton input/output types to numpy types. +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel(object): + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + + def initialize(self, args): + """`initialize` is called only once when the model is being loaded. + Implementing `initialize` function is optional. This function allows + the model to initialize any state associated with this model. + Parameters + ---------- + args : dict + Both keys and values are strings. The dictionary keys and values are: + * model_config: A JSON string containing the model configuration, config.txt + * model_instance_kind: A string containing model instance kind + * model_instance_device_id: A string containing model instance device ID + * model_repository: Model repository path + * model_version: Model version + * model_name: Model name + """ + self.model_config = json.loads(args["model_config"]) + print("model_config:", self.model_config) + + self.input_names = [] + for input_config in self.model_config["input"]: + self.input_names.append(input_config["name"]) + print("input:", self.input_names) + + self.output_names = [] + self.output_dtype = [] + for output_config in self.model_config["output"]: + self.output_names.append(output_config["name"]) + dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) + self.output_dtype.append(dtype) + print("output:", self.output_names) + + def execute(self, requests): + """`execute` must be implemented in every Python model. `execute` + function receives a list of pb_utils.InferenceRequest as the only + argument. This function is called when an inference is requested + for this model. Depending on the batching configuration (e.g. Dynamic + Batching) used, `requests` may contain multiple requests. Every + Python model, must create one pb_utils.InferenceResponse for every + pb_utils.InferenceRequest in `requests`. If there is an error, you can + set the error argument when creating a pb_utils.InferenceResponse. + Parameters + ---------- + requests : list + A list of pb_utils.InferenceRequest + Returns + ------- + list + A list of pb_utils.InferenceResponse. The length of this list must + be the same as `requests` + """ + responses = [] + # print("num:", len(requests), flush=True) + for request in requests: + data = pb_utils.get_input_tensor_by_name(request, self.input_names[0]) + data = data.as_numpy() + data = 1 / (1 + (np.exp((-data[0])))) + + probs = [] + labels = [] + for l, p in enumerate(data): + if p > 0.5: + labels.append(l) + probs.append(p) + + labels = np.array(labels, dtype=self.output_dtype[0]) + probs = np.array(probs, dtype=self.output_dtype[1]) + # print(labels, probs) + out_tensor1 = pb_utils.Tensor(self.output_names[0], labels) + out_tensor2 = pb_utils.Tensor(self.output_names[1], probs) + inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2]) + responses.append(inference_response) + return responses + + def finalize(self): + """`finalize` is called only once when the model is being unloaded. + Implementing `finalize` function is optional. This function allows + the model to perform any necessary clean ups before exit. + """ + print("Cleaning up...") diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..fbeda7129f9247823de9d5918af5ee435613e967 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt @@ -0,0 +1,31 @@ +name: "seqcls_postprocess" +backend: "python" +max_batch_size: 64 + +input [ + { + name: "POST_INPUT" + data_type: TYPE_FP32 + dims: [ 74 ] + } +] + +output [ + { + name: "POST_label" + data_type: TYPE_INT64 + dims: [ -1 ] + }, + { + name: "POST_confidence" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_CPU + } +] diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/1/model.py b/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/1/model.py new file mode 100644 index 0000000000000000000000000000000000000000..b4b0b0547ee5a33278e53bacb5d5649c1b3c562b --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/1/model.py @@ -0,0 +1,105 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import numpy as np + +# triton_python_backend_utils is available in every Triton Python model. You +# need to use this module to create inference requests and responses. It also +# contains some utility functions for extracting information from model_config +# and converting Triton input/output types to numpy types. +import triton_python_backend_utils as pb_utils + +from paddlenlp.transformers import AutoTokenizer + + +class TritonPythonModel(object): + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + + def initialize(self, args): + """`initialize` is called only once when the model is being loaded. + Implementing `initialize` function is optional. This function allows + the model to initialize any state associated with this model. + Parameters + ---------- + args : dict + Both keys and values are strings. The dictionary keys and values are: + * model_config: A JSON string containing the model configuration, config.pbtxt + * model_instance_kind: A string containing model instance kind + * model_instance_device_id: A string containing model instance device ID + * model_repository: Model repository path + * model_version: Model version + * model_name: Model name + """ + self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True) + # You must parse model_config. JSON string is not parsed here + self.model_config = json.loads(args["model_config"]) + print("model_config:", self.model_config) + + self.input_names = [] + for input_config in self.model_config["input"]: + self.input_names.append(input_config["name"]) + print("input:", self.input_names) + + self.output_names = [] + self.output_dtype = [] + for output_config in self.model_config["output"]: + self.output_names.append(output_config["name"]) + dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) + self.output_dtype.append(dtype) + print("output:", self.output_names) + + def execute(self, requests): + """`execute` must be implemented in every Python model. `execute` + function receives a list of pb_utils.InferenceRequest as the only + argument. This function is called when an inference is requested + for this model. Depending on the batching configuration (e.g. Dynamic + Batching) used, `requests` may contain multiple requests. Every + Python model, must create one pb_utils.InferenceResponse for every + pb_utils.InferenceRequest in `requests`. If there is an error, you can + set the error argument when creating a pb_utils.InferenceResponse. + Parameters + ---------- + requests : list + A list of pb_utils.InferenceRequest + Returns + ------- + list + A list of pb_utils.InferenceResponse. The length of this list must + be the same as `requests` + """ + responses = [] + for request in requests: + data = pb_utils.get_input_tensor_by_name(request, self.input_names[0]) + data = data.as_numpy() + data = [i[0].decode("utf-8") for i in data] + data = self.tokenizer(data, max_length=128, padding=True, truncation=True) + input_ids = np.array(data["input_ids"], dtype=self.output_dtype[0]) + token_type_ids = np.array(data["token_type_ids"], dtype=self.output_dtype[1]) + + out_tensor1 = pb_utils.Tensor(self.output_names[0], input_ids) + out_tensor2 = pb_utils.Tensor(self.output_names[1], token_type_ids) + inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2]) + responses.append(inference_response) + return responses + + def finalize(self): + """`finalize` is called only once when the model is being unloaded. + Implementing `finalize` function is optional. This function allows + the model to perform any necessary clean ups before exit. + """ + print("Cleaning up...") diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/config.pbtxt b/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..d35d1f44968ba205b1890899a82568d33e90a999 --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/config.pbtxt @@ -0,0 +1,31 @@ +name: "tokenizer" +backend: "python" +max_batch_size: 64 + +input [ + { + name: "INPUT_0" + data_type: TYPE_STRING + dims: [ 1 ] + } +] + +output [ + { + name: "OUTPUT_0" + data_type: TYPE_INT64 + dims: [ -1 ] + }, + { + name: "OUTPUT_1" + data_type: TYPE_INT64 + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_CPU + } +] diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/seqcls_grpc_client.py b/applications/text_classification/hierarchical/deploy/triton_serving/seqcls_grpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..caedf81b752ab16a1844e4b3e371e34a1d84126e --- /dev/null +++ b/applications/text_classification/hierarchical/deploy/triton_serving/seqcls_grpc_client.py @@ -0,0 +1,108 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +from typing import Optional + +import numpy as np +from tritonclient.grpc import InferenceServerClient, InferInput, InferRequestedOutput + +LOGGER = logging.getLogger("run_inference_on_triton") + + +class SyncGRPCTritonRunner: + DEFAULT_MAX_RESP_WAIT_S = 120 + + def __init__( + self, + server_url: str, + model_name: str, + model_version: str, + *, + verbose=False, + resp_wait_s: Optional[float] = None, + ): + self._server_url = server_url + self._model_name = model_name + self._model_version = model_version + self._verbose = verbose + self._response_wait_t = self.DEFAULT_MAX_RESP_WAIT_S if resp_wait_s is None else resp_wait_s + + self._client = InferenceServerClient(self._server_url, verbose=self._verbose) + error = self._verify_triton_state(self._client) + if error: + raise RuntimeError(f"Could not communicate to Triton Server: {error}") + + LOGGER.debug( + f"Triton server {self._server_url} and model {self._model_name}:{self._model_version} " + f"are up and ready!" + ) + + model_config = self._client.get_model_config(self._model_name, self._model_version) + model_metadata = self._client.get_model_metadata(self._model_name, self._model_version) + LOGGER.info(f"Model config {model_config}") + LOGGER.info(f"Model metadata {model_metadata}") + + self._inputs = {tm.name: tm for tm in model_metadata.inputs} + self._input_names = list(self._inputs) + self._outputs = {tm.name: tm for tm in model_metadata.outputs} + self._output_names = list(self._outputs) + self._outputs_req = [InferRequestedOutput(name) for name in self._outputs] + + def Run(self, inputs): + """ + Args: + inputs: list, Each value corresponds to an input name of self._input_names + Returns: + results: dict, {name : numpy.array} + """ + infer_inputs = [] + for idx, data in enumerate(inputs): + data = np.array([[x.encode("utf-8")] for x in data], dtype=np.object_) + infer_input = InferInput(self._input_names[idx], [len(data), 1], "BYTES") + infer_input.set_data_from_numpy(data) + infer_inputs.append(infer_input) + + results = self._client.infer( + model_name=self._model_name, + model_version=self._model_version, + inputs=infer_inputs, + outputs=self._outputs_req, + client_timeout=self._response_wait_t, + ) + results = {name: results.as_numpy(name) for name in self._output_names} + return results + + def _verify_triton_state(self, triton_client): + if not triton_client.is_server_live(): + return f"Triton server {self._server_url} is not live" + elif not triton_client.is_server_ready(): + return f"Triton server {self._server_url} is not ready" + elif not triton_client.is_model_ready(self._model_name, self._model_version): + return f"Model {self._model_name}:{self._model_version} is not ready" + return None + + +if __name__ == "__main__": + model_name = "seqcls" + model_version = "1" + url = "localhost:8001" + runner = SyncGRPCTritonRunner(url, model_name, model_version) + + texts = [["消失的“外企光环”,5月份在华裁员900余人,香饽饽变“臭”了"], ["卡车超载致使跨桥侧翻,没那么简单"], ["金属卡扣安装不到位,上海乐扣乐扣贸易有限公司将召回捣碎器1162件"]] + + for text in texts: + # input format:[input1, input2 ... inputn], n = len(self._input_names) + result = runner.Run([text]) + print(result) diff --git a/applications/text_classification/hierarchical/export_model.py b/applications/text_classification/hierarchical/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..c57dc23372f9b934fbee6686092309cd5ef5b22a --- /dev/null +++ b/applications/text_classification/hierarchical/export_model.py @@ -0,0 +1,45 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from paddlenlp.transformers import AutoModelForSequenceClassification + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--multilingual', action='store_true', help='Whether is multilingual task') +parser.add_argument("--params_path", type=str, default='./checkpoint/', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./export', help="The path of model parameter in static graph to be saved.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + model.eval() + if args.multilingual: + input_spec = [paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids")] + else: + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"), + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"), + ] + # Convert to static graph with specific input description + model = paddle.jit.to_static(model, input_spec=input_spec) + + # Save in static graph model. + save_path = os.path.join(args.output_path, "float32") + paddle.jit.save(model, save_path) diff --git a/applications/text_classification/hierarchical/few-shot/README.md b/applications/text_classification/hierarchical/few-shot/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ff44c9d173a69b44457b7d0c07481fad01ab2788 --- /dev/null +++ b/applications/text_classification/hierarchical/few-shot/README.md @@ -0,0 +1,375 @@ +# 小样本场景下的多标签层次分类任务指南 + +**零样本/小样本文本分类推荐使用 UTC 模型,详情见[目录](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/zero_shot_text_classification),本项目将会在2.5.2版本下线。** + +## 目录 + +- [1. 项目说明](#项目说明) +- [2. 效果展示](#效果展示) +- [3. 定制训练](#定制训练) + - [3.1 运行环境](#运行环境) + - [3.2 代码结构](#代码结构) + - [3.3 数据标注](#数据标注) + - [3.4 模型训练](#模型训练) + - [3.5 模型评估](#模型评估) + - [3.6 模型部署](#模型部署) +- [4. References](#References) + + +## 1. 项目说明 + +本项目提供了小样本场景下文本多标签层次分类的解决方案,在 ERNIE3.0 的基础上利用提示学习取得比微调更好的分类效果,充分利用标注信息。 + +**多标签层次分类任务** 指自然语言处理任务中,每个样本具有多个标签标记,并且标签集合中标签之间存在预定义的层次结构,多标签层次分类需要充分考虑标签集之间的层次结构关系来预测层次化预测结果。 +在现实场景中,大量的数据如新闻分类、专利分类、学术论文分类等标签集合存在层次化结构,需要利用算法为文本自动标注更细粒度和更准确的标签。 +现有的主流解决方案是在预训练语言模型上进行微调,因为多标签分类任务与预训练阶段的掩码预测任务有着天然的差异,想要取得较好的分类效果往往需要大量数据标注。 + +**提示学习(Prompt Learning)** 的主要思想是将二/多分类任务转换为掩码预测任务,充分利用预训练语言模型学习到的特征,从而降低样本需求。以情感分类任务为例,标签分为`1-正向`,`0-负向`两类,如下图所示,通过提示`我[MASK]喜欢。`,原有`1-正向`,`0-负向`的标签被转化为了预测空格是`很`还是`不`。 + +
+ +
+ +微调方法和提示方法的区别如图所示: + +【微调学习】需要学习的参数是以 `[CLS]` 向量为输入,以负向/正向为输出的随机初始化的分类器。 + +【提示学习】通过构造提示,将原有的分类任务转化为掩码预测,即掩盖原句中的某个字,用模型预测该字。此时的分类器不再是随机初始化,而是利用了待预测字的预训练向量来初始化,充分利用了预训练模型学习到的参数。 + +【方案选择】对于标注样本充足的场景可以直接使用[微调学习](../README.md)实现文本多分类,对于尚无标注或者标注样本较少的任务场景我们推荐使用提示学习,以取得更好的效果。 + +### 方案特点 + +- **标注成本低**:以往的微调方式需要大量的数据标注才能保证模型分类效果。提示学习可以降低数据标注依赖,在小样本(few-shot)的场景下取得比微调更好的分类效果。 +- **全流程打通**:提供了从训练到部署的完整解决方案,可以低成本迁移至实际应用场景。 + + + +## 2.效果展示 + +本项目中使用了 ERNIE3.0 模型,对于中文训练任务可以根据需求选择不同的预训练模型参数进行训练,我们测评了 Base 模型在事件类型分类任务上的表现。测试配置如下: + +1. 数据集:2020语言与智能技术竞赛:[事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)小样本数据集测试集。 + +2. 物理机环境 + + 系统: CentOS Linux release 7.7.1908 (Core) + + GPU: Tesla V100-SXM2-32GB + + CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz + + CUDA: 11.2 + + cuDNN: 8.1.0 + + Driver Version: 460.27.04 + + 内存: 630 GB + +3. PaddlePaddle 版本:2.4rc + +4. PaddleNLP 版本:2.4.3 + +5. 评估设置 + +- 每个 epoch 评估一次,按照验证集上的评价指标,取分数最高的模型参数用于测试集的评估。表格中的最终结果为重复 10 次的均值。 +- 为了避免过拟合,这里使用了早停机制 (Early-stopping)。因为微调方式收敛较慢,且波动较大,我们将微调方式的早停步数增加为 20 步,但仍有一半结果未收敛,表格中的微调结果为 5 次的均值。 +- 测试脚本如下 + - 微调 + + ``` + cd ../ + python train.py --dataset_dir "./data/" --save_dir "./checkpoints" --max_seq_length 128 --model_name "ernie-3.0-base-zh" --batch_size 8 --learning_rate 3e-5 --epochs 100 --logging_steps 5 --early_stop --early_stop_num 20 + ``` + + - 提示学习 + + ``` + python train.py --data_dir ./data/ --output_dir ./checkpoints/ --prompt "这句话描述的事件是" --model_name_or_path ernie-3.0-base-zh --max_seq_length 128 --learning_rate 3e-5 --ppt_learning_rate 3e-4 --do_train --do_eval --num_train_epochs 100 --logging_steps 5 --per_device_eval_batch_size 32 --per_device_train_batch_size 8 --do_predict --metric_for_best_model macro_f1_score --load_best_model_at_end --evaluation_strategy epoch --save_strategy epoch + ``` + +6. 精度评价指标:Micro F1分数、Macro F1分数 + + | model_name | 训练方式 | Micro F1分数 | Macro F1分数 | + | ---------- | ------- | ----------- | ----------- | + | ernie-3.0-base-zh | 微调学习 | 0.7172 | 0.3821 | + | ernie-3.0-base-zh | 提示学习 | 0.8945 | 0.8516 | + + + +## 3.定制训练 + +下边通过事件抽取任务的例子展示如何使用小样本学习来进行文本分类。 + + +### 3.1 运行环境 + +- python >= 3.7 +- paddlepaddle >= 2.4rc +- paddlenlp >= 2.4.3 +- paddle2onnx >= 1.0.3 + + +### 3.2 代码结构 + +```text +. +├── train.py # 模型组网训练脚本 +├── utils.py # 数据处理工具 +├── infer.py # 模型部署脚本 +└── README.md +``` + + +### 3.3 数据标注 + +我们推荐使用数据标注平台[doccano](https://github.com/doccano/doccano)进行自定义数据标注,本项目也打通了从标注到训练的通道,即doccano导出数据后可通过[doccano.py](../../doccano.py)脚本轻松将数据转换为输入模型时需要的形式,实现无缝衔接。标注方法的详细介绍请参考[doccano数据标注指南](../../doccano.md)。 + +**示例数据** + +这里我们使用2020语言与智能技术竞赛:事件抽取任务数据集的子集作为示例数据集。该数据集中原始训练集包括 11958 条标注样本,我们按每条标签随机采样 2 条样本,得到 148 条样本数据作为训练集,剩余训练集数据作为测试集。可点击[这里](https://paddlenlp.bj.bcebos.com/datasets/few-shot/events.tar.gz)下载解压并放入`./data/`文件夹,或者运行以下脚本 + +``` +wget https://paddlenlp.bj.bcebos.com/datasets/few-shot/events.tar.gz +tar zxvf events.tar.gz +mv events data +``` + +**数据格式** + +下边主要介绍多标签分类任务自定义数据集的格式要求,整体目录如下 + +```text +data/ +├── train.txt # 训练数据集 +├── dev.txt # 验证数据集 +├── test.txt # 测试数据集(可选) +├── data.txt # 待预测数据(可选) +└── label.txt # 分类标签集 +``` + +**训练/验证/测试数据** + +对于训练/验证/测试数据集文件,每行数据表示一条样本,包括文本和标签两部分,由tab符`\t`分隔,多个标签以英文逗号`,`分隔,同一标签内不同层级以`##`字符连接。格式如下 +```text +<文本>'\t'<标签>','<标签>','<标签> +<文本>'\t'<标签>','<标签> +... +``` +例如, +``` +紫光圣果副总经理李明雷辞职 组织关系,组织关系##辞/离职 +无理取闹辱骂扶贫干部织金一居民被行拘 司法行为,司法行为##拘捕 +... +``` + +**预测数据** + +对于待预测数据文件,每行包含一条待预测样本,无标签。格式如下 +```text +<文本> +<文本> +... +``` +例如, +``` +没白等!大众PoloPlus明日上市,配1.5L全铝发动机 +国家统计局17日发布消息称,国务院第四次全国经济普查领导小组办公室和国家统计局近期对四川省德阳市下辖广汉市第四次全国经济普查违法举报线索进行了立案调查。 +... +``` + +**标签数据** + +对于分类标签集文件,存储了数据集中所有的标签路径集合,每行是一个标签路径,高层的标签指向底层标签,不同层级的标签用'##'连接,本项目选择为标签层次结构中的每一个节点生成对应的标签路径,详见[层次分类任务介绍](../README.md#层次分类任务介绍),标签路径格式如下 + +```text +<一级标签> +<一级标签>'##'<二级标签> +<一级标签>'##'<二级标签>'##'<三级标签> +... +``` +如果需要自定义标签映射用于分类器初始化,则每行需要包括标签名和相应的映射词,由`==`分隔。格式如下 +```text +<一级标签>'=='<映射词> +<一级标签>'##'<二级标签>'=='<映射词> +<一级标签>'##'<二级标签>'##'<三级标签>'=='<映射词> +... +``` +例如,原标签路径`交往##会见`中包括特殊符号`##`,大概率不会在说话或者写作中使用,因此我们将其映射为`会见`或者`见面`。 +``` +交往==交往 +交往##会见==会见 +... +``` + +**Note**: 这里的标签映射词定义遵循的规则是,不同映射词尽可能长度一致,映射词和提示需要尽可能构成通顺的语句。越接近自然语句,小样本下模型训练效果越好。如果原标签名已经可以构成通顺语句,也可以不构造映射词,每行一个标签即可。 + + +### 3.4 模型训练 + +**单卡训练** + +``` +python train.py \ +--data_dir ./data/ \ +--output_dir ./checkpoints/ \ +--prompt "这句话描述的事件是" \ +--model_name_or_path ernie-3.0-base-zh \ +--max_seq_length 128 \ +--learning_rate 3e-5 \ +--ppt_learning_rate 3e-4 \ +--do_train \ +--do_eval \ +--do_predict \ +--do_export \ +--num_train_epochs 100 \ +--logging_steps 5 \ +--save_total_limit 1 \ +--per_device_eval_batch_size 32 \ +--per_device_train_batch_size 8 \ +--metric_for_best_model macro_f1_score \ +--load_best_model_at_end \ +--eval_steps 100 +``` +**多卡训练** + +``` +unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus 0,1,2,3 train.py \ +--data_dir ./data/ \ +--output_dir ./checkpoints/ \ +--prompt "这句话描述的事件是" \ +--model_name_or_path ernie-3.0-base-zh \ +--max_seq_length 128 \ +--learning_rate 3e-5 \ +--ppt_learning_rate 3e-4 \ +--do_train \ +--do_eval \ +--do_predict \ +--do_export \ +--num_train_epochs 100 \ +--logging_steps 5 \ +--save_total_limit 1 \ +--per_device_eval_batch_size 32 \ +--per_device_train_batch_size 8 \ +--metric_for_best_model macro_f1_score \ +--load_best_model_at_end \ +--eval_steps 100 +``` + +可配置参数说明: +- `model_name_or_path`: 内置模型名,或者模型参数配置目录路径。默认为`ernie-3.0-base-zh`。 +- `data_dir`: 训练数据集路径,数据格式要求详见[数据标注](#数据标注)。 +- `output_dir`: 模型参数、训练日志和静态图导出的保存目录。 +- `prompt`: 提示模板。定义了如何将文本和提示拼接结合。 +- `soft_encoder`: 提示向量的编码器,`lstm`表示双向LSTM, `mlp`表示双层线性层, None表示直接使用提示向量。默认为`lstm`。 +- `use_rdrop`: 使用 [R-Drop](https://arxiv.org/abs/2106.14448) 策略。 +- `use_rgl`: 使用 [RGL](https://aclanthology.org/2022.findings-naacl.81/) 策略。 +- `encoder_hidden_size`: 提示向量的维度。若为None,则使用预训练模型字向量维度。默认为200。 +- `max_seq_length`: 最大句子长度,超过该长度的文本将被截断,不足的以Pad补全。提示文本不会被截断。 +- `learning_rate`: 预训练语言模型参数基础学习率大小,将与learning rate scheduler产生的值相乘作为当前学习率。 +- `ppt_learning_rate`: 提示相关参数的基础学习率大小,当预训练参数不固定时,与其共用learning rate scheduler。一般设为`learning_rate`的十倍。 +- `do_train`: 是否进行训练。 +- `do_eval`: 是否进行评估。 +- `do_predict`: 是否进行预测。 +- `do_export`: 是否在运行结束时将模型导出为静态图,保存路径为`output_dir/export`。 +- `num_train_epochs`: 训练的最大轮数。 +- `max_steps`: 训练的最大步数。此设置将会覆盖`num_train_epochs`。 +- `save_total_limit`: 模型检查点保存数量。 +- `device`: 使用的设备,默认为`gpu`。 +- `eval_steps`: 评估模型的间隔步数。 +- `logging_steps`: 打印日志的间隔步数。 +- `per_device_train_batch_size`: 每次训练每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。 +- `per_device_eval_batch_size`: 每次评估每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。 +- `load_best_model_at_end`: 是否在模型训练结束后加载评估指标最优的模型参数。 +- `evaluation_strategy`: 模型评估的间隔策略。若为`epoch`,则每轮训练结束后评估模型。 +- `save_strategy`: 模型保存的间隔策略。若为`epoch`,则每轮训练结束后保存当前模型参数。 + +更多参数介绍可参考[配置文件](https://paddlenlp.readthedocs.io/zh/latest/trainer.html)。 + + + +### 3.5 模型评估 + +在模型训练时开启`--do_predict`,训练结束后直接在测试集上`test.txt`进行评估,也可以在训练结束后,通过运行以下命令加载模型参数进行评估: +``` +python train.py --do_predict --data_dir ./data --output_dir ./predict_checkpoint --resume_from_checkpoint ./checkpoints/ --max_seq_length 128 +``` + +可配置参数说明: + +- `data_dir`: 测试数据路径。测试数据应存放在该目录下`test.txt`文件中,每行一条待预测文本。 +- `output_dir`: 日志的保存目录。 +- `resume_from_checkpoint`: 训练时模型参数的保存目录,用于加载模型参数。 +- `do_predict`: 是否进行测试集评估。 +- `max_seq_length`: 最大句子长度,超过该长度的文本将被截断,不足的以Pad补全。提示文本不会被截断。 + + +### 3.6 模型部署 + +#### 模型导出 + +在训练结束后,需要将动态图模型导出为静态图参数用于部署推理。可以在模型训练时开启`--do_export`在训练结束后直接导出,也可以运行以下命令加载并导出训练后的模型参数,默认导出到在`output_dir`指定的目录下。 +``` +python train.py --do_export --data_dir ./data --output_dir ./export_checkpoint --resume_from_checkpoint ./checkpoints/ +``` + +可配置参数说明: + +- `data_dir`: 标签数据路径。 +- `output_dir`: 静态图模型参数和日志的保存目录。 +- `resume_from_checkpoint`: 训练时模型参数的保存目录,用于加载模型参数。 +- `do_export`: 是否将模型导出为静态图,保存路径为`output_dir/export`。 +- `export_type`: 模型导出的格式,默认为`paddle`,即导出静态图。 + +#### ONNXRuntime部署 + +**运行环境** + +模型转换与ONNXRuntime预测部署依赖Paddle2ONNX和ONNXRuntime,Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式,算子目前稳定支持导出ONNX Opset 7~15,更多细节可参考:[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)。 + +- 如果基于GPU部署,请先确保机器已正确安装NVIDIA相关驱动和基础软件,确保CUDA >= 11.2,CuDNN >= 8.2,并使用以下命令安装所需依赖: +```shell +pip install psutil +python -m pip install onnxruntime-gpu onnx onnxconverter-common +``` + +- 如果基于CPU部署,请使用如下命令安装所需依赖: +```shell +pip install psutil +python -m pip install onnxruntime +``` + +**CPU端推理样例** + +``` +python infer.py --model_path_prefix checkpoints/export/model --data_dir ./data --batch_size 32 --device cpu +``` + +**GPU端推理样例** + +``` +python infer.py --model_path_prefix checkpoints/export/model --data_dir ./data --batch_size 32 --device gpu --device_id 0 +``` + +可配置参数说明: + +- `model_path_prefix`: 导出的静态图模型路径及文件前缀。 +- `model_name`: 内置预训练模型名,用于加载tokenizer。默认为`ernie-3.0-base-zh`。 +- `data_dir`: 待推理数据所在路径,数据应存放在该目录下的`data.txt`文件。 +- `max_length`: 最大句子长度,超过该长度的文本将被截断,不足的以Pad补全。提示文本不会被截断。 +- `batch_size`: 每次预测的样本数量。 +- `device`: 选择推理设备,包括`cpu`和`gpu`。默认为`gpu`。 +- `device_id`: 指定GPU设备ID。 +- `use_fp16`: 是否使用半精度加速推理。仅在GPU设备上有效。 +- `num_threads`: 设置CPU使用的线程数。默认为机器上的物理内核数。 + +**Note**: 在GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0,在包括V100、T4、A10、A100、GTX 20系列和30系列显卡等设备上可以开启FP16进行加速,在CPU或者CUDA计算能力 (CUDA Compute Capability) 小于7.0时开启不会带来加速效果。 + + +## 4. References + +- Liu, Xiao, et al. "GPT understands, too." arXiv preprint arXiv:2103.10385 (2021). [[PDF]](https://arxiv.org/abs/2103.10385) +- Hambardzumyan, Karen, Hrant Khachatrian, and Jonathan May. "Warp: Word-level adversarial reprogramming." arXiv preprint arXiv:2101.00121 (2021). [[PDF]](https://arxiv.org/abs/2101.00121) +- Ding, Ning, et al. "Openprompt: An open-source framework for prompt-learning." arXiv preprint arXiv:2111.01998 (2021). [[PDF]](https://arxiv.org/abs/2111.01998) diff --git a/applications/text_classification/hierarchical/few-shot/infer.py b/applications/text_classification/hierarchical/few-shot/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..eeb30fc27c2d9d61fd060febad20628b59f2f8fa --- /dev/null +++ b/applications/text_classification/hierarchical/few-shot/infer.py @@ -0,0 +1,226 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os + +import numpy as np +import onnxruntime as ort +import paddle2onnx +import psutil +import six + +from paddlenlp.prompt import AutoTemplate, PromptDataCollatorWithPadding +from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.") +parser.add_argument("--model_name", default="ernie-3.0-base-zh", type=str, help="The name of pretrained model.") +parser.add_argument("--data_dir", default=None, type=str, help="The path to the prediction data, including label.txt and data.txt.") +parser.add_argument("--max_length", default=128, type=int, help="The maximum total input sequence length after tokenization.") +parser.add_argument("--use_fp16", action='store_true', help="Whether to use fp16 inference, only takes effect when deploying on gpu.") +parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for predicting.") +parser.add_argument("--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu.") +parser.add_argument("--device", choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.") +args = parser.parse_args() +# yapf: enable + + +class InferBackend(object): + def __init__(self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, num_threads=10): + + if not isinstance(device, six.string_types): + logger.error( + ">>> [InferBackend] The type of device must be string, but the type you set is: ", type(device) + ) + exit(0) + if device not in ["cpu", "gpu"]: + logger.error(">>> [InferBackend] The device must be cpu or gpu, but your device is set to:", type(device)) + exit(0) + + logger.info(">>> [InferBackend] Creating Engine ...") + + onnx_model = paddle2onnx.command.c_paddle_to_onnx( + model_file=model_path_prefix + ".pdmodel", + params_file=model_path_prefix + ".pdiparams", + opset_version=13, + enable_onnx_checker=True, + ) + infer_model_dir = model_path_prefix.rsplit("/", 1)[0] + float_onnx_file = os.path.join(infer_model_dir, "model.onnx") + with open(float_onnx_file, "wb", encoding="utf-8") as f: + f.write(onnx_model) + + if device == "gpu": + logger.info(">>> [InferBackend] Use GPU to inference ...") + providers = ["CUDAExecutionProvider"] + if use_fp16: + logger.info(">>> [InferBackend] Use FP16 to inference ...") + import onnx + from onnxconverter_common import float16 + + fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx") + onnx_model = onnx.load_model(float_onnx_file) + trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True) + onnx.save_model(trans_model, fp16_model_file) + onnx_model = fp16_model_file + else: + logger.info(">>> [InferBackend] Use CPU to inference ...") + providers = ["CPUExecutionProvider"] + if use_fp16: + logger.warning( + ">>> [InferBackend] Ignore use_fp16 as it only " + "takes effect when deploying on gpu..." + ) + + sess_options = ort.SessionOptions() + sess_options.intra_op_num_threads = num_threads + self.predictor = ort.InferenceSession( + onnx_model, sess_options=sess_options, providers=providers, provider_options=[{"device_id": device_id}] + ) + if device == "gpu": + try: + assert "CUDAExecutionProvider" in self.predictor.get_providers() + except AssertionError: + raise AssertionError( + "The environment for GPU inference is not set properly. " + "A possible cause is that you had installed both onnxruntime and onnxruntime-gpu. " + "Please run the following commands to reinstall: \n " + "1) pip uninstall -y onnxruntime onnxruntime-gpu \n 2) pip install onnxruntime-gpu" + ) + logger.info(">>> [InferBackend] Engine Created ...") + + def infer(self, input_dict: dict): + result = self.predictor.run(None, input_dict) + return result + + +class HierachicalPredictor(object): + def __init__(self, args): + self.args = args + self.tokenizer = AutoTokenizer.from_pretrained(args.model_name) + self.model = AutoModelForMaskedLM.from_pretrained(args.model_name) + self.template, self.labels, self.input_handles = self.post_init() + self.collate_fn = PromptDataCollatorWithPadding( + self.tokenizer, padding=True, return_tensors="np", return_attention_mask=True + ) + + self.inference_backend = InferBackend( + self.args.model_path_prefix, + self.args.device, + self.args.device_id, + self.args.use_fp16, + self.args.num_threads, + ) + + def post_init(self): + export_path = os.path.dirname(self.args.model_path_prefix) + template_path = os.path.join(export_path, "template_config.json") + with open(template_path, "r", encoding="utf-8") as fp: + prompt = json.load(fp) + template = AutoTemplate.create_from(prompt, self.tokenizer, self.args.max_length, self.model) + keywords = template.extract_template_keywords(template.prompt) + inputs = ["input_ids", "token_type_ids", "position_ids", "attention_mask", "masked_positions"] + if "soft" in keywords: + inputs.append("soft_token_ids") + if "encoder" in keywords: + inputs.append("encoder_ids") + verbalizer_path = os.path.join(export_path, "verbalizer_config.json") + with open(verbalizer_path, "r", encoding="utf-8") as fp: + label_words = json.load(fp) + labels = sorted(list(label_words.keys())) + + return template, labels, inputs + + def predict(self, input_data: list): + encoded_inputs = self.preprocess(input_data) + infer_result = self.infer_batch(encoded_inputs) + result = self.postprocess(infer_result) + self.printer(result, input_data) + return result + + def _infer(self, input_dict): + infer_data = self.inference_backend.infer(input_dict) + return infer_data + + def infer_batch(self, inputs): + num_sample = len(inputs) + infer_data = None + num_infer_data = None + for index in range(0, num_sample, self.args.batch_size): + left, right = index, index + self.args.batch_size + batch_dict = self.collate_fn(inputs[left:right]) + input_dict = {} + for key in self.input_handles: + value = batch_dict[key] + if key == "attention_mask": + if value.ndim == 2: + value = (1 - value[:, np.newaxis, np.newaxis, :]) * -1e4 + elif value.ndim != 4: + raise ValueError("Expect attention mask with ndim=2 or 4, but get ndim={}".format(value.ndim)) + value = value.astype("float32") + else: + value = value.astype("int64") + input_dict[key] = value + results = self._infer(input_dict) + if infer_data is None: + infer_data = [[x] for x in results] + num_infer_data = len(results) + else: + for i in range(num_infer_data): + infer_data[i].append(results[i]) + for i in range(num_infer_data): + infer_data[i] = np.concatenate(infer_data[i], axis=0) + return infer_data + + def preprocess(self, input_data: list): + text = [{"text_a": x} for x in input_data] + inputs = [self.template(x) for x in text] + return inputs + + @staticmethod + def sigmoid(z): + return 1 / (1 + np.exp(-z)) + + def postprocess(self, infer_data): + threshold = 0.5 + probs = self.sigmoid(infer_data[0]) + label_ids = np.argwhere(probs > threshold) + labels = [[] for _ in range(probs.shape[0])] + for idx, label_id in label_ids: + labels[idx].append(self.labels[label_id]) + return {"label": labels} + + def printer(self, result, input_data): + label = result["label"] + for i in range(len(label)): + logger.info("input data: {}".format(input_data[i])) + logger.info("labels: {}".format(", ".join(label[i]))) + logger.info("-----------------------------") + + +if __name__ == "__main__": + for arg_name, arg_value in vars(args).items(): + logger.info("{:20}: {}".format(arg_name, arg_value)) + + predictor = HierachicalPredictor(args) + + text_dir = os.path.join(args.data_dir, "data.txt") + with open(text_dir, "r", encoding="utf-8") as f: + text_list = [x.strip() for x in f.readlines()] + + predictor.predict(text_list) diff --git a/applications/text_classification/hierarchical/few-shot/metric.py b/applications/text_classification/hierarchical/few-shot/metric.py new file mode 100644 index 0000000000000000000000000000000000000000..e46d46ed093bce37012d524f3fc0483b7118119a --- /dev/null +++ b/applications/text_classification/hierarchical/few-shot/metric.py @@ -0,0 +1,81 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +from paddle.metric import Metric +from sklearn.metrics import classification_report, f1_score + +from paddlenlp.utils.log import logger + + +class MetricReport(Metric): + """ + F1 score for hierarchical text classification task. + """ + + def __init__(self, name="MetricReport", average="micro"): + super(MetricReport, self).__init__() + self.average = average + self._name = name + self.reset() + + def reset(self): + """ + Resets all of the metric state. + """ + self.y_prob = None + self.y_true = None + + def f1_score(self, y_prob): + """ + Compute micro f1 score and macro f1 score + """ + threshold = 0.5 + self.y_pred = y_prob > threshold + micro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="micro") + macro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="macro") + return micro_f1_score, macro_f1_score + + def update(self, probs, labels): + """ + Update the probability and label + """ + if self.y_prob is not None: + self.y_prob = np.append(self.y_prob, probs.numpy(), axis=0) + else: + self.y_prob = probs.numpy() + if self.y_true is not None: + self.y_true = np.append(self.y_true, labels.numpy(), axis=0) + else: + self.y_true = labels.numpy() + + def accumulate(self): + """ + Returns micro f1 score and macro f1 score + """ + micro_f1_score, macro_f1_score = self.f1_score(y_prob=self.y_prob) + return micro_f1_score, macro_f1_score + + def report(self): + """ + Returns classification report + """ + self.y_pred = self.y_prob > 0.5 + logger.info("classification report:\n" + classification_report(self.y_true, self.y_pred, digits=4)) + + def name(self): + """ + Returns metric name + """ + return self._name diff --git a/applications/text_classification/hierarchical/few-shot/requirements_cpu.txt b/applications/text_classification/hierarchical/few-shot/requirements_cpu.txt new file mode 100644 index 0000000000000000000000000000000000000000..bbe76e363f00631d66e0733833813cad5991f009 --- /dev/null +++ b/applications/text_classification/hierarchical/few-shot/requirements_cpu.txt @@ -0,0 +1,5 @@ +psutil +paddlepaddle>=2.4rc +paddlenlp>=2.4.3 +paddle2onnx>=1.0.3 +onnxruntime diff --git a/applications/text_classification/hierarchical/few-shot/requirements_gpu.txt b/applications/text_classification/hierarchical/few-shot/requirements_gpu.txt new file mode 100644 index 0000000000000000000000000000000000000000..66454bd8b6b5fe08521215d4a5c2e7242225d869 --- /dev/null +++ b/applications/text_classification/hierarchical/few-shot/requirements_gpu.txt @@ -0,0 +1,7 @@ +psutil +paddlepaddle-gpu>=2.4rc +paddlenlp>=2.4.3 +paddle2onnx>=1.0.3 +onnxruntime-gpu +onnx +onnxconverter-common diff --git a/applications/text_classification/hierarchical/few-shot/train.py b/applications/text_classification/hierarchical/few-shot/train.py new file mode 100644 index 0000000000000000000000000000000000000000..ce0d574f8dc85bcfa2909ca591224e1a7f29be55 --- /dev/null +++ b/applications/text_classification/hierarchical/few-shot/train.py @@ -0,0 +1,133 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from collections import defaultdict +from dataclasses import dataclass, field + +import paddle +import paddle.nn.functional as F +from metric import MetricReport +from utils import load_local_dataset + +from paddlenlp.prompt import ( + AutoTemplate, + PromptModelForSequenceClassification, + PromptTrainer, + PromptTuningArguments, + SoftVerbalizer, +) +from paddlenlp.trainer import EarlyStoppingCallback, PdArgumentParser +from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer +from paddlenlp.utils.log import logger + + +# yapf: disable +@dataclass +class DataArguments: + data_dir: str = field(default="./data", metadata={"help": "The dataset dictionary includes train.txt, dev.txt, test.txt, label.txt and data.txt (optional) files."}) + prompt: str = field(default=None, metadata={"help": "The input prompt for tuning."}) + + +@dataclass +class ModelArguments: + model_name_or_path: str = field(default="ernie-3.0-base-zh", metadata={"help": "The build-in pretrained model or the path to local model."}) + export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."}) +# yapf: enable + + +def main(): + # Parse the arguments. + parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Load the pretrained language model. + model = AutoModelForMaskedLM.from_pretrained(model_args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + + # Define the template for preprocess and the verbalizer for postprocess. + template = AutoTemplate.create_from(data_args.prompt, tokenizer, training_args.max_seq_length, model=model) + logger.info("Using template: {}".format(template.prompt)) + + label_file = os.path.join(data_args.data_dir, "label.txt") + with open(label_file, "r", encoding="utf-8") as fp: + label_words = defaultdict(list) + for line in fp: + data = line.strip().split("==") + word = data[1] if len(data) > 1 else data[0].split("##")[-1] + label_words[data[0]].append(word) + verbalizer = SoftVerbalizer(label_words, tokenizer, model) + + # Load the few-shot datasets. + train_ds, dev_ds, test_ds = load_local_dataset( + data_path=data_args.data_dir, splits=["train", "dev", "test"], label_list=verbalizer.labels_to_ids + ) + + # Define the criterion. + criterion = paddle.nn.BCEWithLogitsLoss() + + # Initialize the prompt model with the above variables. + prompt_model = PromptModelForSequenceClassification( + model, template, verbalizer, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout + ) + + # Define the metric function. + def compute_metrics(eval_preds): + metric = MetricReport() + preds = F.sigmoid(paddle.to_tensor(eval_preds.predictions)) + metric.update(preds, paddle.to_tensor(eval_preds.label_ids)) + micro_f1_score, macro_f1_score = metric.accumulate() + return {"micro_f1_score": micro_f1_score, "macro_f1_score": macro_f1_score} + + # Deine the early-stopping callback. + callbacks = [EarlyStoppingCallback(early_stopping_patience=4, early_stopping_threshold=0.0)] + + # Initialize the trainer. + trainer = PromptTrainer( + model=prompt_model, + tokenizer=tokenizer, + args=training_args, + criterion=criterion, + train_dataset=train_ds, + eval_dataset=dev_ds, + callbacks=callbacks, + compute_metrics=compute_metrics, + ) + + # Training. + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=None) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Prediction. + if training_args.do_predict: + test_ret = trainer.predict(test_ds) + trainer.log_metrics("test", test_ret.metrics) + + # Export static model. + if training_args.do_export: + export_path = os.path.join(training_args.output_dir, "export") + trainer.export_model(export_path, export_type=model_args.export_type) + + +if __name__ == "__main__": + main() diff --git a/applications/text_classification/hierarchical/few-shot/utils.py b/applications/text_classification/hierarchical/few-shot/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..2e1dc6f44756ce215cb6b2b638a3d865ac436c91 --- /dev/null +++ b/applications/text_classification/hierarchical/few-shot/utils.py @@ -0,0 +1,53 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +from paddlenlp.datasets import load_dataset + + +def load_local_dataset(data_path, splits, label_list): + """ + Load dataset for hierachical classification from files, where + there is one example per line. Text and label are separated + by '\t', and multiple labels are delimited by ','. + + Args: + data_path (str): + Path to the dataset directory, including label.txt, train.txt, + dev.txt (and data.txt). + splits (list): + Which file(s) to load, such as ['train', 'dev', 'test']. + label_list (dict): + The dictionary that maps labels to indeces. + """ + + def _reader(data_file, label_list): + with open(data_file, "r", encoding="utf-8") as fp: + for idx, line in enumerate(fp): + data = line.strip().split("\t") + if len(data) == 1: + yield {"text_a": data[0]} + else: + text, label = data + label = label.strip().split(",") + label = [float(1) if x in label else float(0) for x in label_list] + yield {"text_a": text, "labels": label} + + split_map = {"train": "train.txt", "dev": "dev.txt", "test": "test.txt"} + datasets = [] + for split in splits: + data_file = os.path.join(data_path, split_map[split]) + datasets.append(load_dataset(_reader, data_file=data_file, label_list=label_list, lazy=False)) + return datasets diff --git a/applications/text_classification/hierarchical/metric.py b/applications/text_classification/hierarchical/metric.py new file mode 100644 index 0000000000000000000000000000000000000000..608e23f2cf733fb813c309a5b5516fd0e58a7a0f --- /dev/null +++ b/applications/text_classification/hierarchical/metric.py @@ -0,0 +1,81 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +from sklearn.metrics import f1_score, classification_report + +from paddle.metric import Metric +from paddlenlp.utils.log import logger + + +class MetricReport(Metric): + """ + F1 score for hierarchical text classification task. + """ + + def __init__(self, name="MetricReport", average="micro"): + super(MetricReport, self).__init__() + self.average = average + self._name = name + self.reset() + + def reset(self): + """ + Resets all of the metric state. + """ + self.y_prob = None + self.y_true = None + + def f1_score(self, y_prob): + """ + Compute micro f1 score and macro f1 score + """ + threshold = 0.5 + self.y_pred = y_prob > threshold + micro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="micro") + macro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="macro") + return micro_f1_score, macro_f1_score + + def update(self, probs, labels): + """ + Update the probability and label + """ + if self.y_prob is not None: + self.y_prob = np.append(self.y_prob, probs.numpy(), axis=0) + else: + self.y_prob = probs.numpy() + if self.y_true is not None: + self.y_true = np.append(self.y_true, labels.numpy(), axis=0) + else: + self.y_true = labels.numpy() + + def accumulate(self): + """ + Returns micro f1 score and macro f1 score + """ + micro_f1_score, macro_f1_score = self.f1_score(y_prob=self.y_prob) + return micro_f1_score, macro_f1_score + + def report(self): + """ + Returns classification report + """ + self.y_pred = self.y_prob > 0.5 + logger.info("classification report:\n" + classification_report(self.y_true, self.y_pred, digits=4)) + + def name(self): + """ + Returns metric name + """ + return self._name diff --git a/applications/text_classification/hierarchical/predict.py b/applications/text_classification/hierarchical/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..8b0c4388dd219ba52c888078ee65459a25956569 --- /dev/null +++ b/applications/text_classification/hierarchical/predict.py @@ -0,0 +1,107 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os + +import paddle +import paddle.nn.functional as F +from paddle.io import BatchSampler, DataLoader +from utils import preprocess_function, read_local_dataset + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--dataset_dir", required=True, default=None, type=str, help="Local dataset directory should include data.txt and label.txt") +parser.add_argument("--params_path", default="./checkpoint/", type=str, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--data_file", type=str, default="data.txt", help="Unlabeled data file name") +parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name") +args = parser.parse_args() +# yapf: enable + + +@paddle.no_grad() +def predict(): + """ + Predicts the data labels. + """ + paddle.set_device(args.device) + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + + label_list = [] + label_path = os.path.join(args.dataset_dir, args.label_file) + with open(label_path, "r", encoding="utf-8") as f: + for i, line in enumerate(f): + label_list.append(line.strip()) + + data_ds = load_dataset( + read_local_dataset, path=os.path.join(args.dataset_dir, args.data_file), is_test=True, lazy=False + ) + + trans_func = functools.partial( + preprocess_function, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + label_nums=len(label_list), + is_test=True, + ) + + data_ds = data_ds.map(trans_func) + + # batchify dataset + collate_fn = DataCollatorWithPadding(tokenizer) + data_batch_sampler = BatchSampler(data_ds, batch_size=args.batch_size, shuffle=False) + + data_data_loader = DataLoader(dataset=data_ds, batch_sampler=data_batch_sampler, collate_fn=collate_fn) + + results = [] + model.eval() + for batch in data_data_loader: + logits = model(**batch) + probs = F.sigmoid(logits).numpy() + for prob in probs: + labels = [] + for i, p in enumerate(prob): + if p > 0.5: + labels.append(label_list[i]) + results.append(labels) + + for t, labels in zip(data_ds.data, results): + hierarchical_labels = {} + logger.info("text: {}".format(t["sentence"])) + logger.info("prediction result: {}".format(",".join(labels))) + for label in labels: + for i, l in enumerate(label.split("##")): + if i not in hierarchical_labels: + hierarchical_labels[i] = [] + if l not in hierarchical_labels[i]: + hierarchical_labels[i].append(l) + for d in range(len(hierarchical_labels)): + logger.info("level {} : {}".format(d + 1, ",".join(hierarchical_labels[d]))) + logger.info("--------------------") + return + + +if __name__ == "__main__": + + predict() diff --git a/applications/text_classification/hierarchical/prune.py b/applications/text_classification/hierarchical/prune.py new file mode 100644 index 0000000000000000000000000000000000000000..e4b9dff397e3f12165af1ebece97d7429dd1df04 --- /dev/null +++ b/applications/text_classification/hierarchical/prune.py @@ -0,0 +1,123 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import functools + +import paddle +import paddle.nn.functional as F +from paddleslim.nas.ofa import OFA +from paddlenlp.utils.log import logger +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import PdArgumentParser, Trainer, CompressionArguments +from paddlenlp.transformers import AutoTokenizer, AutoModelForSequenceClassification +from dataclasses import dataclass, field + +from utils import preprocess_function, read_local_dataset +from metric import MetricReport + + +# yapf: disable +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + Using `PdArgumentParser` we can turn this class + into argparse arguments to be able to specify them on + the command line. + """ + + dataset_dir: str = field(default=None, metadata={"help": "Local dataset directory should include train.txt, dev.txt and label.txt."}) + max_seq_length: int = field(default=128, metadata={"help": "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded."}) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + params_dir: str = field(default='./checkpoint/', metadata={"help": "The output directory where the model checkpoints are written."}) +# yapf: enable + + +@paddle.no_grad() +def custom_evaluate(self, model, data_loader): + metric = MetricReport() + model.eval() + metric.reset() + for batch in data_loader: + logits = model(batch["input_ids"], batch["token_type_ids"], attention_mask=[None, None]) + # Supports paddleslim.nas.ofa.OFA model and nn.layer model. + if isinstance(model, OFA): + logits = logits[0] + probs = F.sigmoid(logits) + metric.update(probs, batch["labels"]) + + micro_f1_score, macro_f1_score = metric.accumulate() + logger.info("micro f1 score: %.5f, macro f1 score: %.5f" % (micro_f1_score, macro_f1_score)) + model.train() + return macro_f1_score + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments)) + model_args, data_args, compression_args = parser.parse_args_into_dataclasses() + paddle.set_device(compression_args.device) + compression_args.strategy = "dynabert" + # Log model and data config + compression_args.print_config(model_args, "Model") + compression_args.print_config(data_args, "Data") + + label_list = {} + label_path = os.path.join(data_args.dataset_dir, "label.txt") + train_path = os.path.join(data_args.dataset_dir, "train.txt") + dev_path = os.path.join(data_args.dataset_dir, "dev.txt") + with open(label_path, "r", encoding="utf-8") as f: + for i, line in enumerate(f): + l = line.strip() + label_list[l] = i + + train_ds = load_dataset(read_local_dataset, path=train_path, label_list=label_list, lazy=False) + dev_ds = load_dataset(read_local_dataset, path=dev_path, label_list=label_list, lazy=False) + + model = AutoModelForSequenceClassification.from_pretrained(model_args.params_dir) + tokenizer = AutoTokenizer.from_pretrained(model_args.params_dir) + + trans_func = functools.partial( + preprocess_function, tokenizer=tokenizer, max_seq_length=data_args.max_seq_length, label_nums=len(label_list) + ) + train_dataset = train_ds.map(trans_func) + dev_dataset = dev_ds.map(trans_func) + + # Define data collector, criterion + data_collator = DataCollatorWithPadding(tokenizer) + criterion = paddle.nn.BCEWithLogitsLoss() + + trainer = Trainer( + model=model, + args=compression_args, + data_collator=data_collator, + train_dataset=train_dataset, + eval_dataset=dev_dataset, + criterion=criterion, + ) # Strategy`dynabert` needs arguments `criterion` + + compression_args.print_config() + + trainer.compress(custom_evaluate=custom_evaluate) + + +if __name__ == "__main__": + main() diff --git a/applications/text_classification/hierarchical/retrieval_based/README.md b/applications/text_classification/hierarchical/retrieval_based/README.md new file mode 100644 index 0000000000000000000000000000000000000000..8dbbfc0a40f317db3e18645e6ad4124ebc384c09 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/README.md @@ -0,0 +1,466 @@ +# 基于检索的文本分类方法 + + **目录** + +* [1. 基于语义索引的分类任务介绍](#基于语义索引的分类任务介绍) +* [2. 代码结构说明](#代码结构说明) +* [3. 环境准备](#环境准备) +* [4. 数据准备](#数据准备) +* [5. 模型训练](#模型训练) +* [6. 模型预测](#模型预测) +* [7. 模型部署](#模型部署) +* [8. 分类流程](#分类流程) + + + +# 1.基于语义索引的分类任务介绍 + +以前的分类任务中,标签信息作为无实际意义,独立存在的one-hot编码形式存在,这种做法会潜在的丢失标签的语义信息,本方案把文本分类任务中的标签信息转换成含有语义信息的语义向量,将文本分类任务转换成向量检索和匹配的任务。这样做的好处是对于一些类别标签不是很固定的场景,或者需要经常有一些新增类别的需求的情况非常合适。另外,对于一些新的相关的分类任务,这种方法也不需要模型重新学习或者设计一种新的模型结构来适应新的任务。总的来说,这种基于检索的文本分类方法能够有很好的拓展性,能够利用标签里面包含的语义信息,不需要重新进行学习。这种方法可以应用到相似标签推荐,文本标签标注,金融风险事件分类,政务信访分类等领域。 + +本方案是基于语义索引模型的分类,语义索引模型的目标是:给定输入文本,模型可以从海量候选召回库中**快速、准确**地召回一批语义相关文本。基于语义索引的分类方法有两种,第一种方法是直接把标签变成召回库,即把输入文本和标签的文本进行匹配,第二种是利用召回的文本带有类别标签,把召回文本的类别标签作为给定输入文本的类别。本方案使用双塔模型,训练阶段引入In-batch Negatives 策略,使用hnswlib建立索引库,并把标签作为召回库,进行召回测试。最后利用召回的结果使用 Accuracy 指标来评估语义索引模型的分类的效果。 + + +**效果评估** + +| 模型 | Accuracy | 策略简要说明| +| ------------ | ------------ |--------- | +| ernie-3.0-medium-zh | 50.580 | ernie-3.0-medium-zh多分类,5个epoch,对于新增类别需要重新训练| +| In-batch Negatives + RocketQA | 49.755 | Inbatch-negative有监督训练,标签当作召回集,对新增类别不需要重新训练| +| In-batch Negatives + RocketQA + 投票| **51.756** | Inbatch-negative有监督训练,训练集当作召回集,对新增类别,需要至少一条的数据放入召回库中| + + + +## 2. 代码结构说明 + +``` +|—— data.py # 数据读取、数据转换等预处理逻辑 +|—— base_model.py # 语义索引模型基类 +|—— train.py # In-batch Negatives 策略的训练主脚本 +|—— model.py # In-batch Negatives 策略核心网络结构 + +|—— recall.py # 基于训练好的语义索引模型,从召回库中召回给定文本的相似文本 +|—— evaluate.py # 根据召回结果和评估集计算评估指标 +|—— predict.py # 给定输入文件,计算文本 pair 的相似度 +|—— export_model.py # 动态图转换成静态图 +|—— export_to_serving.py # 静态图转 Serving +|—— scripts + |—— export_model.sh # 动态图转换成静态图脚本 + |—— predict.sh # 预测 bash 版本 + |—— evaluate.sh # 评估 bash 版本 + |—— run_build_index.sh # 构建索引 bash 版本 + |—— train.sh # 训练 bash 版本 + |—— export_to_serving.sh # Paddle Inference 转 Serving 的 bash 脚本 + |—— run.sh # 构建Milvus向量的 bash 版本 +|—— utils + ├── config.py # Milvus 的配置文件 + ├── feature_extract.py # 向量抽取文件 + ├── milvus_util.py # Milvus 的配置文件 +|—— deploy + |—— python + |—— predict.py # PaddleInference + |—— deploy.sh # Paddle Inference 部署脚本 + |—— rpc_client.py # Paddle Serving 的 Client 端 + |—— web_service.py # Paddle Serving 的 Serving 端 + |—— config_nlp.yml # Paddle Serving 的配置文件 + +``` + + + +## 3. 环境准备 + +推荐使用GPU进行训练,在预测阶段使用CPU或者GPU均可。 + +**环境依赖** +* python >= 3.6.2 +* paddlepaddle >= 2.3.1 +* paddlenlp >= 2.3.4 +* hnswlib >= 0.5.2 +* visualdl >= 2.2.2 + +``` +pip install -r requirements.txt +``` + + + +## 4. 数据准备 + +训练需要准备指定格式的本地数据集,如果没有已标注的数据集,可以参考[文本分类任务doccano数据标注使用指南](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/doccano.md)进行文本分类数据标注。 + +**指定格式本地数据集目录结构** + +``` +├── data # 数据集目录 + ├── label.txt # 标签集 + ├── dev.txt # 验证集 + ├── train.txt # 训练集 +``` + +**训练、开发、测试数据集** + +train.txt(训练数据集文件), dev.txt(开发数据集文件),test.txt(可选,测试数据集文件),文件中文本与标签类别名用tab符`'\t'`分隔开,层次标签之间用`'##'`号分隔开。训练集指用于训练模型的数据;开发集指用于评测模型表现的数据,可以根据模型在开发集上的精度调整训练参数和模型;测试集用于测试模型表现,没有测试集时可以使用开发集代替。 + +**注意文本中不能包含tab符`'\t'`**。 + +- train.txt/dev.txt/test.txt 文件格式: +```text +<文本>'\t'<标签>'##'<标签>'##'<标签> +<文本>'\t'<标签>'##'<标签> +... +... +``` + +- train.txt/dev.txt/test.txt 文件样例: +```text +请问深入骨髓地喜欢一个人怎么办我不能确定对方是不是喜欢我,我却想我不能确定对方是不是喜欢我,我却想分分秒秒跟他在一起,有谁能告诉我如何能想他少一点 烦恼##恋爱 +我登陆诛仙2时总说我账号密码错误,但是我打的是正确的,就算不对我? 游戏##完美游戏##诛仙 +斩魔仙者称号怎么得来的斩魔仙者称号怎么得来的 游戏##网络游戏 +有哪位好心人上传一份女衬衫的加拿大海关发票给我看一下塞多谢了多谢了 商业/理财##贸易 +... +``` +**分类标签** + +label.txt(层次分类标签文件)记录数据集中所有标签路径集合,层次标签之间用`'##'`连接即可,标签的行先后顺序对结果没有影响。 + +- label.txt 文件格式: + +```text +<一级标签1> +<一级标签1>'##'<二级标签1> +<一级标签1>'##'<二级标签1>'##'<三级标签1> +<一级标签1>'##'<二级标签2> +<一级标签2> +<一级标签2>'##'<二级标签3> +... +``` +- label.txt 文件样例: +```text +教育/科学 +教育/科学##院校信息 +教育/科学##外语学习##英语考试 +教育/科学##理工学科##生物学 +教育/科学##职业教育##会计资格考试 +... +``` + + + +## 5. 模型训练 + +我们使用百科知识问答的数据来构建训练集,开发集。 + +**训练集(train.txt)** 和 **开发集(dev.txt)** 格式一致,训练集30k条,开发集10k条,每行由文本的标题,内容和类别标签组成,以tab符分割,第一列是问题的标题和问题的描述拼接,剩下的列问题的类别。 +**召回库(label.txt)** 召回库的构建有2种方式,第一种是把所有的类别标签当成召回库,第二种是把训练集当成召回集合,我们以第一种为例。 + +数据集选择的是百科问答数据集的一个子集,问答数据集详情请参考[nlp_chinese_corpus](https://github.com/brightmart/nlp_chinese_corpus) + +- [baike_qa_category](https://paddlenlp.bj.bcebos.com/applications/baike_qa_category.zip) + +``` +wget https://paddlenlp.bj.bcebos.com/applications/baike_qa_category.zip +unzip baike_qa_category.zip +``` + +### 单机单卡训练/单机多卡训练 + +这里采用单机多卡方式进行训练,通过如下命令,指定 GPU 0,1 卡;如果采用单机单卡训练,只需要把`--gpus`参数设置成单卡的卡号即可。 + +如果使用CPU进行训练,则需要吧`--gpus`参数去除,然后吧`device`设置成cpu即可,详细请参考train.sh文件的训练设置 + +然后运行下面的命令使用GPU训练,得到语义索引模型: + +``` +root_path=inbatch +data_path=data +python -u -m paddle.distributed.launch --gpus "0,1" \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/${root_path} \ + --batch_size 24 \ + --learning_rate 5E-5 \ + --epochs 100 \ + --output_emb_size 0 \ + --save_steps 50 \ + --max_seq_length 384 \ + --warmup_proportion 0.0 \ + --margin 0.2 \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --train_set_file ${data_path}/train.txt \ + --corpus_file ${data_path}/label.txt \ + --similar_text_pair_file ${data_path}/dev.txt \ + --evaluate True +``` + +参数含义说明 + +* `device`: 使用 cpu/gpu 进行训练 +* `save_dir`: 模型存储路径 +* `batch_size`: 训练的batch size的大小 +* `learning_rate`: 训练的学习率的大小 +* `epochs`: 训练的epoch数 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `save_steps`: 模型存储 checkpoint 的间隔 steps 个数 +* `max_seq_length`: 输入序列的最大长度 +* `margin`: 正样本相似度与负样本之间的目标 Gap +* `train_set_file`: 训练集文件 +* `evaluate`: 是否开启边训练边评估模型训练效果,默认开启 +* `recall_result_dir`: 召回结果存储目录 +* `recall_result_file`: 召回结果的文件名 +* `hnsw_m`: hnsw 算法相关参数,保持默认即可 +* `hnsw_ef`: hnsw 算法相关参数,保持默认即可 +* `recall_num`: 对 1 个文本召回的相似文本数量 +* `similar_text_pair`: 由相似文本对构成的评估集 +* `corpus_file`: 召回库数据 corpus_file + +也可以使用bash脚本: + +``` +sh scripts/train.sh +``` + + + +## 6. 模型预测 + +我们可以基于语义索引模型计算文本和标签的语义相似度。 + + +### 开始预测 + +加载训练的语义索引模型,然后计算文本和标签的语义相似度: + +``` +root_dir="checkpoints/inbatch/model_best" +python -u -m paddle.distributed.launch --gpus "0" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_state.pdparams" \ + --output_emb_size 0 \ + --batch_size 128 \ + --max_seq_length 384 \ + --text_pair_file "data/dev.txt" +``` + +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `params_path`: 预训练模型的参数文件名 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `text_pair_file`: 由文本 Pair 构成的待预测数据集 + +也可以运行下面的bash脚本: + +``` +sh scripts/predict.sh +``` +predict.sh文件包含了cpu和gpu运行的脚本,默认是gpu运行的脚本 + +产出如下结果 +``` +0.8841502070426941 +0.7834227681159973 +0.04591505229473114 +0.15116563439369202 +...... +``` + + + +## 7. 模型部署 + +### 动转静导出 + +首先把动态图模型转换为静态图: + +``` +python export_model.py --params_path checkpoints/inbatch/model_best/model_state.pdparams --output_path=./output +``` +也可以运行下面的bash脚本: + +``` +sh scripts/export_model.sh +``` + +### Paddle Inference预测 + +预测既可以抽取向量也可以计算两个文本的相似度。 + +修改deploy/python/predict.py中的id2corpus和corpus_list的样本: + +``` +# 抽取向量 +id2corpus = { + 0: { + "sentence": "CPU每秒执行的指令数CPU频率3.0G,执行一条指令需要1.5,频率3.0G,执行一条指令需要1.5个周期,每秒执行的指令数为?是20亿吗?" + } + } +# 计算文本和类别的相似度 +corpus_list = [{ + "sentence": "CPU每秒执行的指令数CPU频率3.0G,执行一条指令需要1.5,频率3.0G,执行一条指令需要1.5个周期,每秒执行的指令数为?是20亿吗?", + 'label': '电脑/网络,硬件' + }, { + "sentence": "CPU每秒执行的指令数CPU频率3.0G,执行一条指令需要1.5,频率3.0G,执行一条指令需要1.5个周期,每秒执行的指令数为?是20亿吗?", + 'label': '商业/理财,股票' + }] + +``` + +然后使用PaddleInference + +``` +python deploy/python/predict.py --model_dir=./output +``` +也可以运行下面的bash脚本: + +``` +sh deploy.sh +``` +最终输出的是256维度的特征向量和句子对的预测概率: + +``` +(1, 768) +[[-0.06491912 -0.0133915 0.00937684 0.01285653 -0.02468005 0.03528611 + 0.0623698 -0.06062918 0.02238894 -0.05348937 0.02161925 0.04480227 + .... + +[0.8100336194038391, -0.05148252472281456] +``` + +### 向量引擎 + +模型准备结束以后,开始搭建 Milvus 的向量检索引擎,用于文本语义向量的快速检索,本项目使用[Milvus](https://milvus.io/)开源工具进行向量检索,Milvus 的搭建教程请参考官方教程 [Milvus官方安装教程](https://milvus.io/cn/docs/v1.1.1/milvus_docker-cpu.md)本案例使用的是 Milvus 的1.1.1 CPU版本,建议使用官方的 Docker 安装方式,简单快捷。 + + +Milvus 搭建完系统以后就可以插入和检索向量了,首先生成 embedding 向量,每个样本生成768维度的向量: + +``` +CUDA_VISIBLE_DEVICES=0 python utils/feature_extract.py \ + --data_name label \ + --model_dir ./output \ + --output_dir data \ + --corpus_file "./data/label.txt" +``` +其中 output 目录下存放的是召回的 Paddle Inference 静态图模型。 + +然后向搭建好的 Milvus 系统插入向量: + +``` +python utils/vector_insert.py \ + --vector_path ./data/label_embedding.npy +``` +也可以直接运行: + +```bash +sh scripts/run.sh +``` + +### Paddle Serving部署 + +Paddle Serving 的详细文档请参考 [Pipeline_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Python_Pipeline/Pipeline_Design_CN.md)和[Serving_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Serving_Design_CN.md),首先把静态图模型转换成Serving的格式: + +``` +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.get_pooled_embedding.pdmodel" \ + --params_filename "inference.get_pooled_embedding.pdiparams" \ + --server_path "./serving_server" \ + --client_path "./serving_client" \ + --fetch_alias_names "output_embedding" +``` + +参数含义说明 +* `dirname`: 需要转换的模型文件存储路径,Program 结构文件和参数文件均保存在此目录。 +* `model_filename`: 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ,则使用 `__model__` 作为默认的文件名 +* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中,它才需要被指定。如果模型参数是存储在各自分离的文件中,设置它的值为 None +* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server +* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client +* `fetch_alias_names`: 模型输出的别名设置,比如输入的 input_ids 等,都可以指定成其他名字,默认不指定 +* `feed_alias_names`: 模型输入的别名设置,比如输出 pooled_out 等,都可以重新指定成其他模型,默认不指定 + +也可以运行下面的 bash 脚本: +``` +sh scripts/export_to_serving.sh +``` + +Paddle Serving的部署有两种方式,第一种方式是Pipeline的方式,第二种是C++的方式,下面分别介绍这两种方式的用法: + +#### Pipeline方式 + +启动 Pipeline Server: + +``` +cd deploy/python/ +python web_service.py +``` + +启动客户端调用 Server, 使用 POST的方式: + +向服务端发送 POST 请求示例: + +``` +curl -X POST -k http://localhost:8090/ernie/prediction -d '{"key": ["0"], "value": ["{\"sentence\": \"CPU每秒执行的指令数CPU频率3.0G,执行一条指令需要1.5,频率3.0G,执行一条指令需要1.5个周期,每秒执行的指令数为?是20亿吗?\"}"]}' +``` + +也可以使用 rpc的方式: +首先修改rpc_client.py中需要预测的样本: + +``` +list_data = [{ + "sentence": "CPU每秒执行的指令数CPU频率3.0G,执行一条指令需要1.5,频率3.0G,执行一条指令需要1.5个周期,每秒执行的指令数为?是20亿吗?" +}] +``` +然后运行: + +``` +python rpc_client.py +``` +模型的输出为: + +``` +PipelineClient::predict pack_data time:1658988633.3673246 +PipelineClient::predict before time:1658988633.3678396 +time to cost :0.014188766479492188 seconds +['output_embedding'] +(1, 768) +[[-0.06491912 -0.0133915 0.00937684 0.01285653 -0.02468005 0.03528611 + 0.0623698 -0.06062918 0.02238894 -0.05348937 0.02161925 0.04480227 + ...... +``` + +可以看到客户端发送了1条文本,返回这个 embedding 向量 + + + +## 8. 分类流程 + +基于检索的分类系统使用了Client Server的模式,即抽取向量的模型部署在服务端,然后启动客户端(Client)端去访问。 + +``` +python run_system.py +``` +代码内置的测试用例为: + +``` +list_data = [{"sentence": "我是一个多情善感的小男孩!我想翻译成英文,谢谢!我想成英文,谢谢!"}] +``` +会输出如下的结果: + +``` +...... +PipelineClient::predict pack_data time:1658988661.507715 +PipelineClient::predict before time:1658988661.5081818 +Extract feature time to cost :0.02322244644165039 seconds +Search milvus time cost is 0.06801486015319824 seconds +{'sentence': '我是一个多情善感的小男孩!我想翻译成英文,谢谢!我想成英文,谢谢!'} 教育/科学,外语学习 0.17211778461933136 +{'sentence': '我是一个多情善感的小男孩!我想翻译成英文,谢谢!我想成英文,谢谢!'} 教育/科学,外语学习,英语翻译 0.5666656494140625 +{'sentence': '我是一个多情善感的小男孩!我想翻译成英文,谢谢!我想成英文,谢谢!'} 教育/科学,外语学习,法语 0.8530913591384888 +{'sentence': '我是一个多情善感的小男孩!我想翻译成英文,谢谢!我想成英文,谢谢!'} 教育/科学,出国/留学 1.1201119422912598 +{'sentence': '我是一个多情善感的小男孩!我想翻译成英文,谢谢!我想成英文,谢谢!'} 生活,购车养车,汽车养护 1.4068719148635864 +..... +``` +输出的结果包括特征提取和检索的时间,还包含检索出来文本和对应的标签,通过设定阈值等方式可以得到最终的标签。 + +## Reference + +[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering, Preprint 2020. diff --git a/applications/text_classification/hierarchical/retrieval_based/base_model.py b/applications/text_classification/hierarchical/retrieval_based/base_model.py new file mode 100644 index 0000000000000000000000000000000000000000..56aa3ba50e189281c35d41e8819014f56d8e53f4 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/base_model.py @@ -0,0 +1,153 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import abc + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class SemanticIndexBase(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr) + + def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + @abc.abstractmethod + def forward(self): + pass + + +class SemanticIndexBaseStatic(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr) + + @paddle.jit.to_static( + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ] + ) + def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding diff --git a/applications/text_classification/hierarchical/retrieval_based/data.py b/applications/text_classification/hierarchical/retrieval_based/data.py new file mode 100644 index 0000000000000000000000000000000000000000..9d440967a21822376a7c0a3c99e4b73933bf5cf8 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/data.py @@ -0,0 +1,224 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import hnswlib +import numpy as np +import paddle +from paddlenlp.utils.log import logger + + +def build_index(corpus_data_loader, model, output_emb_size, hnsw_max_elements, hnsw_ef, hnsw_m): + + index = hnswlib.Index(space="ip", dim=output_emb_size if output_emb_size > 0 else 768) + + # Initializing index + # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded + # during insertion of an element. + # The capacity can be increased by saving/loading the index, see below. + # + # ef_construction - controls index search speed/build speed tradeoff + # + # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M) + # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction + index.init_index(max_elements=hnsw_max_elements, ef_construction=hnsw_ef, M=hnsw_m) + + # Controlling the recall by setting ef: + # higher ef leads to better accuracy, but slower search + index.set_ef(hnsw_ef) + + # Set number of threads used during batch search/construction + # By default using all available cores + index.set_num_threads(16) + logger.info("start build index..........") + all_embeddings = [] + for text_embeddings in model.get_semantic_embedding(corpus_data_loader): + all_embeddings.append(text_embeddings.numpy()) + all_embeddings = np.concatenate(all_embeddings, axis=0) + index.add_items(all_embeddings) + logger.info("Total index number:{}".format(index.get_current_count())) + return index + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + A BERT sequence has the following format: + - single sequence: ``[CLS] X [SEP]`` + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def convert_corpus_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + result = [] + for k, v in example.items(): + encoded_inputs = tokenizer(text=v, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def convert_label_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + result = [] + for k, v in example.items(): + encoded_inputs = tokenizer(text=v, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + yield {"sentence": data[0], "label": data[1].replace("##", ",")} + + +# ANN - active learning ------------------------------------------------------ +def get_latest_checkpoint(args): + """ + Return: (latest_checkpint_path, global_step) + """ + if not os.path.exists(args.save_dir): + return args.init_from_ckpt, 0 + + subdirectories = list(next(os.walk(args.save_dir))[1]) + + def valid_checkpoint(checkpoint): + chk_path = os.path.join(args.save_dir, checkpoint) + scheduler_path = os.path.join(chk_path, "model_state.pdparams") + succeed_flag_file = os.path.join(chk_path, "succeed_flag_file") + return os.path.exists(scheduler_path) and os.path.exists(succeed_flag_file) + + trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(trained_steps) > 0: + return os.path.join(args.save_dir, str(max(trained_steps)), "model_state.pdparams"), max(trained_steps) + + return args.init_from_ckpt, 0 + + +# ANN - active learning ------------------------------------------------------ +def get_latest_ann_data(ann_data_dir): + if not os.path.exists(ann_data_dir): + return None, -1 + + subdirectories = list(next(os.walk(ann_data_dir))[1]) + + def valid_checkpoint(step): + ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data") + # succed_flag_file is an empty file that indicates ann data has been generated + succeed_flag_file = os.path.join(ann_data_dir, step, "succeed_flag_file") + return os.path.exists(succeed_flag_file) and os.path.exists(ann_data_file) + + ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(ann_data_steps) > 0: + latest_ann_data_file = os.path.join(ann_data_dir, str(max(ann_data_steps)), "new_ann_data") + logger.info("Using lateset ann_data_file:{}".format(latest_ann_data_file)) + return latest_ann_data_file, max(ann_data_steps) + + logger.info("no new ann_data, return (None, -1)") + return None, -1 + + +def gen_id2corpus(corpus_file): + id2corpus = {} + with open(corpus_file, "r", encoding="utf-8") as f: + for idx, line in enumerate(f): + id2corpus[idx] = line.rstrip().replace("##", ",") + return id2corpus + + +def gen_text_file(similar_text_pair_file): + text2similar_text = {} + texts = [] + with open(similar_text_pair_file, "r", encoding="utf-8") as f: + for idx, line in enumerate(f): + splited_line = line.rstrip().split("\t") + text, similar_text = splited_line[0], ",".join(splited_line[1:]) + text2similar_text[text] = similar_text + texts.append({"text": text}) + return texts, text2similar_text diff --git a/applications/text_classification/hierarchical/retrieval_based/deploy/python/config_nlp.yml b/applications/text_classification/hierarchical/retrieval_based/deploy/python/config_nlp.yml new file mode 100644 index 0000000000000000000000000000000000000000..236c3802002e80075ab55ed14461b0bde9fd545c --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/deploy/python/config_nlp.yml @@ -0,0 +1,34 @@ +# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +# 当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 20 +# build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + # op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: False + # 使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + tracer: + interval_s: 10 +# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 8090 +# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 8080 +op: + ernie: + # 并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 1 + # 当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + # client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + #ir_optim + ir_optim: True + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 1 + # 计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: '2' + # Fetch结果列表,以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回 + fetch_list: ['output_embedding'] + # 模型路径 + model_config: ../../serving_server/ diff --git a/applications/text_classification/hierarchical/retrieval_based/deploy/python/deploy.sh b/applications/text_classification/hierarchical/retrieval_based/deploy/python/deploy.sh new file mode 100644 index 0000000000000000000000000000000000000000..fe8f071e0a47a47f5dc24d84ea4eaaf8e7503c06 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/deploy/python/deploy.sh @@ -0,0 +1 @@ +python predict.py --model_dir=../../output \ No newline at end of file diff --git a/applications/text_classification/hierarchical/retrieval_based/deploy/python/predict.py b/applications/text_classification/hierarchical/retrieval_based/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..d5f0c6203ec284b565eae187e4a8157404bc2ca9 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/deploy/python/predict.py @@ -0,0 +1,253 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys + +import paddle +from paddle import inference +from scipy import spatial + +from paddlenlp.data import Pad, Tuple +from paddlenlp.transformers import AutoTokenizer + +sys.path.append(".") + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=15, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') +parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") +args = parser.parse_args() +# fmt: on + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + A BERT sequence has the following format: + - single sequence: ``[CLS] X [SEP]`` + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def convert_query_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + result = [] + encoded_inputs = tokenizer( + text=example["sentence"], + max_seq_len=max_seq_length, + pad_to_max_seq_len=pad_to_max_seq_len, + truncation_strategy="longest_first", + ) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.get_pooled_embedding.pdmodel" + params_file = model_dir + "/inference.get_pooled_embedding.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def extract_embedding(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the feature vectors. + """ + examples = [] + for idx, text in data.items(): + print(text) + input_ids, segment_ids = convert_query_example(text, tokenizer) + examples.append((input_ids, segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # segment + ): fn(samples) + + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + return logits + + def predict(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the predictions probs. + """ + + examples = [] + for idx, text in enumerate(data): + input_ids, segment_ids, title_ids, title_segment_ids = convert_example(text, tokenizer) + + examples.append((input_ids, segment_ids, title_ids, title_segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + + query_ids, query_segment_ids, title_ids, title_segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(query_ids) + self.input_handles[1].copy_from_cpu(query_segment_ids) + self.predictor.run() + query_logits = self.output_handle.copy_to_cpu() + + self.input_handles[0].copy_from_cpu(title_ids) + self.input_handles[1].copy_from_cpu(title_segment_ids) + self.predictor.run() + title_logits = self.output_handle.copy_to_cpu() + + result = [float(1 - spatial.distance.cosine(arr1, arr2)) for arr1, arr2 in zip(query_logits, title_logits)] + return result + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + + output_emb_size = 256 + tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-dureader-query-encoder") + id2corpus = {0: {"sentence": "CPU每秒执行的指令数CPU频率3.0G,执行一条指令需要1.5,频率3.0G,执行一条指令需要1.5个周期,每秒执行的指令数为?是20亿吗?"}} + res = predictor.extract_embedding(id2corpus, tokenizer) + print(res.shape) + print(res) + corpus_list = [ + {"sentence": "CPU每秒执行的指令数CPU频率3.0G,执行一条指令需要1.5,频率3.0G,执行一条指令需要1.5个周期,每秒执行的指令数为?是20亿吗?", "label": "电脑/网络,硬件"}, + {"sentence": "CPU每秒执行的指令数CPU频率3.0G,执行一条指令需要1.5,频率3.0G,执行一条指令需要1.5个周期,每秒执行的指令数为?是20亿吗?", "label": "商业/理财,股票"}, + ] + res = predictor.predict(corpus_list, tokenizer) + print(res) diff --git a/applications/text_classification/hierarchical/retrieval_based/deploy/python/rpc_client.py b/applications/text_classification/hierarchical/retrieval_based/deploy/python/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..c3d5e71e3d09c23a0e1d4f20c3f17c408ae9a4ba --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/deploy/python/rpc_client.py @@ -0,0 +1,35 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import time +import numpy as np + +from paddle_serving_server.pipeline import PipelineClient + +client = PipelineClient() +client.connect(["127.0.0.1:8080"]) + +list_data = [{"sentence": "CPU每秒执行的指令数CPU频率3.0G,执行一条指令需要1.5,频率3.0G,执行一条指令需要1.5个周期,每秒执行的指令数为?是20亿吗?"}] +feed = {} +for i, item in enumerate(list_data): + feed[str(i)] = str(item) + +print(feed) +start_time = time.time() +ret = client.predict(feed_dict=feed) +end_time = time.time() +print("time to cost :{} seconds".format(end_time - start_time)) +result = np.array(eval(ret.value[0])) +print(ret.key) +print(result.shape) +print(result) diff --git a/applications/text_classification/hierarchical/retrieval_based/deploy/python/web_service.py b/applications/text_classification/hierarchical/retrieval_based/deploy/python/web_service.py new file mode 100644 index 0000000000000000000000000000000000000000..df054797d51ec195c6f23ad1c144aa4f6aed43d1 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/deploy/python/web_service.py @@ -0,0 +1,72 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddle_serving_server.web_service import Op, WebService + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + result = [] + for text in example: + encoded_inputs = tokenizer( + text=text["sentence"], max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len + ) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class ErnieOp(Op): + def init_op(self): + from paddlenlp.transformers import AutoTokenizer + + model_name_or_path = "rocketqa-zh-dureader-query-encoder" + self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) + + def preprocess(self, input_dicts, data_id, log_id): + from paddlenlp.data import Pad, Tuple + + ((_, input_dict),) = input_dicts.items() + print("input dict", input_dict) + batch_size = len(input_dict.keys()) + examples = [] + for i in range(batch_size): + example = eval(input_dict[str(i)]) + input_ids, segment_ids = convert_example([example], self.tokenizer) + examples.append((input_ids, segment_ids)) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + input_ids, segment_ids = batchify_fn(examples) + feed_dict = {} + feed_dict["input_ids"] = input_ids + feed_dict["token_type_ids"] = segment_ids + return feed_dict, False, None, "" + + def postprocess(self, input_dicts, fetch_dict, data_id, log_id): + new_dict = {} + new_dict["output_embedding"] = str(fetch_dict["output_embedding"].tolist()) + return new_dict, None, "" + + +class ErnieService(WebService): + def get_pipeline_response(self, read_op): + ernie_op = ErnieOp(name="ernie", input_ops=[read_op]) + return ernie_op + + +ernie_service = ErnieService(name="ernie") +ernie_service.prepare_pipeline_config("config_nlp.yml") +ernie_service.run_service() diff --git a/applications/text_classification/hierarchical/retrieval_based/evaluate.py b/applications/text_classification/hierarchical/retrieval_based/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..c315b0d8b129b971f206c3404c79b712a2d57014 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/evaluate.py @@ -0,0 +1,83 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import time + +import numpy as np + +parser = argparse.ArgumentParser() +parser.add_argument("--similar_text_pair", type=str, default="", help="The full path of similar pair file") +parser.add_argument("--recall_result_file", type=str, default="", help="The full path of recall result file") +parser.add_argument( + "--recall_num", type=int, default=10, help="Most similair number of doc recalled from corpus per query" +) +args = parser.parse_args() + + +def recall(rs, N=10): + """ + Ratio of recalled Ground Truth at topN Recalled Docs + >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]] + >>> recall(rs, N=1) + 0.333333 + >>> recall(rs, N=2) + >>> 0.6666667 + >>> recall(rs, N=3) + >>> 1.0 + Args: + rs: Iterator of recalled flag() + Returns: + Recall@N + """ + + recall_flags = [np.sum(r[0:N]) for r in rs] + return np.mean(recall_flags) + + +if __name__ == "__main__": + text2similar = {} + with open(args.similar_text_pair, "r", encoding="utf-8") as f: + for line in f: + text, similar_text = line.rstrip().rsplit("\t", 1) + text2similar[text] = similar_text + + rs = [] + with open(args.recall_result_file, "r", encoding="utf-8") as f: + relevance_labels = [] + for index, line in enumerate(f): + + if index % args.recall_num == 0 and index != 0: + rs.append(relevance_labels) + relevance_labels = [] + text_arr = line.rstrip().split("\t") + text_title, text_para, recalled_title, recalled_para, label, cosine_sim = text_arr + if text2similar["\t".join([text_title, text_para])] == label: + relevance_labels.append(1) + else: + relevance_labels.append(0) + + recall_N = [] + recall_num = [1, 5, 10, 20, 50] + for topN in recall_num: + R = round(100 * recall(rs, N=topN), 3) + recall_N.append(str(R)) + result = open("result.tsv", "a") + res = [] + timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime()) + res.append(timestamp) + for key, val in zip(recall_num, recall_N): + print("recall@{}={}".format(key, val)) + res.append(str(val)) + result.write("\t".join(res) + "\n") diff --git a/applications/text_classification/hierarchical/retrieval_based/export_model.py b/applications/text_classification/hierarchical/retrieval_based/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..d212a633e6006bda1f252687779a009528ddacf8 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/export_model.py @@ -0,0 +1,55 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from base_model import SemanticIndexBaseStatic + +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.") +parser.add_argument("--output_emb_size", default=0, type=int, help="output_embedding_size") +parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# fmt: on + +if __name__ == "__main__": + # If you want to use ernie1.0 model, plesace uncomment the following code + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + model = SemanticIndexBaseStatic(pretrained_model, output_emb_size=args.output_emb_size) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + model.eval() + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/applications/text_classification/hierarchical/retrieval_based/export_to_serving.py b/applications/text_classification/hierarchical/retrieval_based/export_to_serving.py new file mode 100644 index 0000000000000000000000000000000000000000..1ba681a4dfb14a43a5f91fa9c4cf632b4e6e827e --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/export_to_serving.py @@ -0,0 +1,49 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import paddle_serving_client.io as serving_io + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--dirname", type=str, required=True, + default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.") +parser.add_argument("--model_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.") +parser.add_argument("--params_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.") +parser.add_argument("--server_path", type=str, default='./serving_server', + help="The path of server parameter in static graph to be saved.") +parser.add_argument("--client_path", type=str, default='./serving_client', + help="The path of client parameter in static graph to be saved.") +parser.add_argument("--feed_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars') +parser.add_argument("--fetch_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars') +parser.add_argument("--show_proto", type=bool, default=False, + help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.') +# yapf: enable + +if __name__ == "__main__": + args = parser.parse_args() + serving_io.inference_model_to_serving( + dirname=args.dirname, + serving_server=args.server_path, + serving_client=args.client_path, + model_filename=args.model_filename, + params_filename=args.params_filename, + show_proto=args.show_proto, + feed_alias_names=args.feed_alias_names, + fetch_alias_names=args.fetch_alias_names, + ) diff --git a/applications/text_classification/hierarchical/retrieval_based/model.py b/applications/text_classification/hierarchical/retrieval_based/model.py new file mode 100644 index 0000000000000000000000000000000000000000..fd87c6d8363efc4f54db6c6bd5d7b623ea68ab59 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/model.py @@ -0,0 +1,65 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn.functional as F +from base_model import SemanticIndexBase + + +class SemanticIndexBatchNeg(SemanticIndexBase): + def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None): + super().__init__(pretrained_model, dropout, output_emb_size) + + self.margin = margin + # Used scaling cosine similarity to ease converge + self.sacle = scale + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) + + # substract margin from all positive samples cosine_sim() + margin_diag = paddle.full( + shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype() + ) + + cosine_sim = cosine_sim - paddle.diag(margin_diag) + + # scale cosine to ease training converge + cosine_sim *= self.sacle + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") + labels = paddle.reshape(labels, shape=[-1, 1]) + + loss = F.cross_entropy(input=cosine_sim, label=labels) + + return loss diff --git a/applications/text_classification/hierarchical/retrieval_based/predict.py b/applications/text_classification/hierarchical/retrieval_based/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..906fdf3519c3aff341c81bcb0bb47f4245b91daf --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/predict.py @@ -0,0 +1,100 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +from base_model import SemanticIndexBase +from data import convert_example, create_dataloader, read_text_pair + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--pad_to_max_seq_len", action="store_true", help="Whether to pad to max seq length.") +parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# fmt: on + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair. + data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + cosine_sims = [] + model.eval() + with paddle.no_grad(): + for batch_data in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data + batch_cosine_sim = model.cosine_sim( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ).numpy() + cosine_sims.append(batch_cosine_sim) + cosine_sims = np.concatenate(cosine_sims, axis=0) + return cosine_sims + + +if __name__ == "__main__": + paddle.set_device(args.device) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + pad_to_max_seq_len=args.pad_to_max_seq_len, + ) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # title_segment + ): [data for data in fn(samples)] + valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False) + valid_data_loader = create_dataloader( + valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size) + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + cosin_sim = predict(model, valid_data_loader) + for idx, cosine in enumerate(cosin_sim): + print("{}".format(cosine)) + if idx > 5: + break diff --git a/applications/text_classification/hierarchical/retrieval_based/recall.py b/applications/text_classification/hierarchical/retrieval_based/recall.py new file mode 100644 index 0000000000000000000000000000000000000000..3c185bd8efaa0a20ea0289976e600a62e70f1dcc --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/recall.py @@ -0,0 +1,108 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# coding=UTF-8 + +import argparse +import os +from functools import partial + +import paddle +from ann_util import build_index +from base_model import SemanticIndexBase +from data import convert_corpus_example, create_dataloader, gen_id2corpus, gen_text_file + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset +from paddlenlp.transformers import AutoModel, AutoTokenizer +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# fmt: on + +if __name__ == "__main__": + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + trans_func = partial(convert_corpus_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # text_segment + ): [data for data in fn(samples)] + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size) + model = paddle.DataParallel(model) + # Load pretrained semantic model + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + logger.info("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + id2corpus = gen_id2corpus(args.corpus_file) + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + # Need better way to get inner model of DataParallel + inner_model = model._layers + final_index = build_index( + corpus_data_loader, + inner_model, + output_emb_size=args.output_emb_size, + hnsw_max_elements=args.hnsw_max_elements, + hnsw_ef=args.hnsw_ef, + hnsw_m=args.hnsw_m, + ) + text_list, text2similar_text = gen_text_file(args.similar_text_pair_file) + query_ds = MapDataset(text_list) + query_data_loader = create_dataloader( + query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file) + with open(recall_result_file, "w", encoding="utf-8") as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num) + batch_size = len(cosine_sims) + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write( + "{}\t{}\t{}\n".format( + text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx] + ) + ) diff --git a/applications/text_classification/hierarchical/retrieval_based/requirements.txt b/applications/text_classification/hierarchical/retrieval_based/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..6657b02e7b0c9a430659394b3398f575cae4ea91 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/requirements.txt @@ -0,0 +1,11 @@ +pymilvus==1.1.2 +pandas==0.25.1 +paddlenlp>=2.3.4 +paddlepaddle-gpu>=2.3.0 +hnswlib>=0.5.2 +numpy>=1.17.2 +visualdl>=2.2.2 +paddle-serving-app>=0.7.0 +paddle-serving-client>=0.7.0 +paddle-serving-server-gpu>=0.7.0.post102 +pybind11 \ No newline at end of file diff --git a/applications/text_classification/hierarchical/retrieval_based/run_system.py b/applications/text_classification/hierarchical/retrieval_based/run_system.py new file mode 100644 index 0000000000000000000000000000000000000000..464024d976702c8e57bd87eba9c6888dd78403d5 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/run_system.py @@ -0,0 +1,64 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys +import time + +import numpy as np +import pandas as pd +from data import gen_id2corpus +from paddle_serving_server.pipeline import PipelineClient + +sys.path.append("utils") +from utils.milvus_util import RecallByMilvus # noqa: E402 + + +def search_in_milvus(text_embedding, corpus_file, query_text): + collection_name = "text" + partition_tag = "partition_2" + client = RecallByMilvus() + start_time = time.time() + status, results = client.search( + collection_name=collection_name, vectors=text_embedding, partition_tag=partition_tag + ) + end_time = time.time() + print("Search milvus time cost is {} seconds ".format(end_time - start_time)) + id2corpus = gen_id2corpus(corpus_file) + list_data = [] + for line in results: + for item in line: + idx = item.id + distance = item.distance + text = id2corpus[idx] + list_data.append([query_text, text, distance]) + df = pd.DataFrame(list_data, columns=["query_text", "label", "distance"]) + df = df.sort_values(by="distance", ascending=True) + for index, row in df.iterrows(): + print(row["query_text"], row["label"], row["distance"]) + + +if __name__ == "__main__": + client = PipelineClient() + client.connect(["127.0.0.1:8080"]) + corpus_file = "data/label.txt" + list_data = [{"sentence": "我是一个多情善感的小男孩!我想翻译成英文,谢谢!我想成英文,谢谢!"}] + feed = {} + for i, item in enumerate(list_data): + feed[str(i)] = str(item) + start_time = time.time() + ret = client.predict(feed_dict=feed) + end_time = time.time() + print("Extract feature time to cost :{} seconds".format(end_time - start_time)) + result = np.array(eval(ret.value[0])) + search_in_milvus(result, corpus_file, list_data[0]) diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/evaluate.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/evaluate.sh new file mode 100644 index 0000000000000000000000000000000000000000..900d7283f76e1af5390be84a448266c1719004f8 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/scripts/evaluate.sh @@ -0,0 +1,4 @@ +python -u evaluate.py \ + --similar_text_pair "data/dev.txt" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 50 \ No newline at end of file diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/export_model.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/export_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..bda31ba21ea81d3e431cf464314c89fdcfb04d09 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/scripts/export_model.sh @@ -0,0 +1 @@ +python export_model.py --params_path checkpoints/inbatch/model_best/model_state.pdparams --output_path=./output \ No newline at end of file diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/export_to_serving.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/export_to_serving.sh new file mode 100644 index 0000000000000000000000000000000000000000..b0d7a422551fd09eb1a28cfacdf47237a8efc795 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/scripts/export_to_serving.sh @@ -0,0 +1,7 @@ +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.get_pooled_embedding.pdmodel" \ + --params_filename "inference.get_pooled_embedding.pdiparams" \ + --server_path "serving_server" \ + --client_path "serving_client" \ + --fetch_alias_names "output_embedding" diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/predict.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/predict.sh new file mode 100644 index 0000000000000000000000000000000000000000..67c9eee02da0d8a8beb58791c74a34bd4b68fefb --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/scripts/predict.sh @@ -0,0 +1,22 @@ +# gpu version +root_dir="checkpoints/inbatch/model_best" +python -u -m paddle.distributed.launch --gpus "0" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_state.pdparams" \ + --output_emb_size 0 \ + --batch_size 128 \ + --max_seq_length 384 \ + --text_pair_file "data/dev.txt" + + +# cpu +# root_dir="checkpoints/inbatch/model_best" +# python -u -m paddle.distributed.launch \ +# predict.py \ +# --device cpu \ +# --params_path "${root_dir}/model_state.pdparams" \ +# --output_emb_size 0 \ +# --batch_size 128 \ +# --max_seq_length 384 \ +# --text_pair_file "data/train.txt" diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/run.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/run.sh new file mode 100644 index 0000000000000000000000000000000000000000..8bcd987f155ed2b0fd4c2a13faa7eb599d8076ff --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/scripts/run.sh @@ -0,0 +1,8 @@ +CUDA_VISIBLE_DEVICES=0 python utils/feature_extract.py \ + --data_name label \ + --model_dir ./output \ + --output_dir data \ + --corpus_file "./data/label.txt" + +python utils/vector_insert.py \ + --vector_path ./data/label_embedding.npy \ No newline at end of file diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/run_build_index.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/run_build_index.sh new file mode 100644 index 0000000000000000000000000000000000000000..3fa6030207137bb58e25b97244c038e886e875b2 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/scripts/run_build_index.sh @@ -0,0 +1,16 @@ +# GPU version +root_dir="checkpoints/inbatch" +python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "${root_dir}/model_best/model_state.pdparams" \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 0 \ + --max_seq_length 384 \ + --recall_num 50 \ + --similar_text_pair "data/dev.txt" \ + --corpus_file "data/train.txt" \ No newline at end of file diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/train.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..ea88cfdd53a73078b5045c1c1c34f2dda1d321ea --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/scripts/train.sh @@ -0,0 +1,35 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# GPU training +root_path=inbatch +data_path=data +python -u -m paddle.distributed.launch --gpus "0,1" \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/${root_path} \ + --batch_size 24 \ + --learning_rate 5E-5 \ + --epochs 100 \ + --output_emb_size 0 \ + --save_steps 50 \ + --max_seq_length 384 \ + --warmup_proportion 0.0 \ + --margin 0.2 \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --train_set_file ${data_path}/train.txt \ + --corpus_file ${data_path}/label.txt \ + --similar_text_pair_file ${data_path}/dev.txt \ + --evaluate True diff --git a/applications/text_classification/hierarchical/retrieval_based/train.py b/applications/text_classification/hierarchical/retrieval_based/train.py new file mode 100644 index 0000000000000000000000000000000000000000..a14114abe9c40d487dde2067c84b4e87af402f53 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/train.py @@ -0,0 +1,281 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from data import ( + build_index, + convert_example, + create_dataloader, + gen_id2corpus, + gen_text_file, + read_text_pair, +) +from model import SemanticIndexBatchNeg + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset, load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, + help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=512, type=int, + help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, + help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=256, + type=int, help="output_embedding_size") +parser.add_argument("--learning_rate", default=5E-5, type=float, + help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, + help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=10, type=int, + help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, + help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, + help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, + help="random seed for initialization") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="cpu", + help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, + help="Inteval steps to save checkpoint") +parser.add_argument('--log_steps', type=int, default=10, + help="Inteval steps to print log") +parser.add_argument("--train_set_file", type=str, + default='./data/train.txt', + help="The full path of train_set_file.") +parser.add_argument("--margin", default=0.2, type=float, + help="Margin between pos_sample and neg_samples") +parser.add_argument("--scale", default=30, type=int, + help="Scale for pair-wise margin_rank_loss") +parser.add_argument("--corpus_file", type=str, default='./data/label.txt', + help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, + default='./data/dev.txt', + help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='./recall_result_dir', + help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, + default='recall_result_init.txt', help="The file name of recall result") +parser.add_argument("--recall_num", default=50, type=int, + help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_m", default=100, type=int, + help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, + help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, + type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--evaluate_result", type=str, default='evaluate_result.txt', + help="evaluate_result") +parser.add_argument('--evaluate', default=True, type=eval, choices=[True, False], + help='whether evaluate while training') +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def recall(rs, N=10): + recall_flags = [np.sum(r[0:N]) for r in rs] + return np.mean(recall_flags) + + +@paddle.no_grad() +def evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus): + # Load pretrained semantic model + inner_model = model._layers + final_index = build_index( + corpus_data_loader, + inner_model, + output_emb_size=args.output_emb_size, + hnsw_max_elements=args.hnsw_max_elements, + hnsw_ef=args.hnsw_ef, + hnsw_m=args.hnsw_m, + ) + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + with open(recall_result_file, "w", encoding="utf-8") as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num) + batch_size = len(cosine_sims) + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write( + "{}\t{}\t{}\n".format( + text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx] + ) + ) + text2similar = {} + with open(args.similar_text_pair_file, "r", encoding="utf-8") as f: + for line in f: + text_arr = line.rstrip().rsplit("\t") + text, similar_text = text_arr[0], text_arr[1].replace("##", ",") + text2similar[text] = similar_text + rs = [] + with open(recall_result_file, "r", encoding="utf-8") as f: + relevance_labels = [] + for index, line in enumerate(f): + if index % args.recall_num == 0 and index != 0: + rs.append(relevance_labels) + relevance_labels = [] + text_arr = line.rstrip().rsplit("\t") + text, similar_text, cosine_sim = text_arr + if text2similar[text] == similar_text: + relevance_labels.append(1) + else: + relevance_labels.append(0) + + recall_N = [] + recall_num = [1, 5, 10, 20] + for topN in recall_num: + R = round(100 * recall(rs, N=topN), 3) + recall_N.append(str(R)) + evaluate_result_file = os.path.join(args.recall_result_dir, args.evaluate_result) + result = open(evaluate_result_file, "a") + res = [] + timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime()) + res.append(timestamp) + for key, val in zip(recall_num, recall_N): + print("recall@{}={}".format(key, val)) + res.append(str(val)) + result.write("\t".join(res) + "\n") + return float(recall_N[0]) + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + set_seed(args.seed) + train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False) + pretrained_model = AutoModel.from_pretrained("rocketqa-zh-dureader-query-encoder") + tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-dureader-query-encoder") + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # title_segment + ): [data for data in fn(samples)] + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + model = SemanticIndexBatchNeg( + pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size + ) + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + print("warmup from:{}".format(args.init_from_ckpt)) + model = paddle.DataParallel(model) + batchify_fn_dev = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # text_segment + ): [data for data in fn(samples)] + if args.evaluate: + eval_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + id2corpus = gen_id2corpus(args.corpus_file) + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=eval_func + ) + # convert_corpus_example + query_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + text_list, _ = gen_text_file(args.similar_text_pair_file) + query_ds = MapDataset(text_list) + query_data_loader = create_dataloader( + query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=query_func + ) + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file) + num_training_steps = len(train_data_loader) * args.epochs + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=nn.ClipGradByNorm(clip_norm=1.0), + ) + global_step = 0 + best_recall = 0.0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch + loss = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ) + global_step += 1 + if global_step % args.log_steps == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if not args.evaluate and rank == 0: + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + if args.evaluate and rank == 0: + print("evaluating") + recall_5 = evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus) + if recall_5 > best_recall: + best_recall = recall_5 + save_dir = os.path.join(args.save_dir, "model_best") + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + with open(os.path.join(save_dir, "train_result.txt"), "a", encoding="utf-8") as fp: + fp.write("epoch=%d, global_step: %d, recall: %s\n" % (epoch, global_step, recall_5)) + + +if __name__ == "__main__": + do_train() diff --git a/applications/text_classification/hierarchical/retrieval_based/utils/__init__.py b/applications/text_classification/hierarchical/retrieval_based/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/utils/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/applications/text_classification/hierarchical/retrieval_based/utils/config.py b/applications/text_classification/hierarchical/retrieval_based/utils/config.py new file mode 100644 index 0000000000000000000000000000000000000000..5c95294c79ccec9c81348052a9f9b96f3465e8c6 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/utils/config.py @@ -0,0 +1,32 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from milvus import IndexType, MetricType + +MILVUS_HOST = "10.21.226.173" +MILVUS_PORT = 8530 + +output_emb_size = 0 + +collection_param = { + "dimension": output_emb_size if output_emb_size > 0 else 768, + "index_file_size": 256, + "metric_type": MetricType.L2, +} + +index_type = IndexType.FLAT +index_param = {"nlist": 1000} + +top_k = 20 +search_param = {"nprobe": 20} diff --git a/applications/text_classification/hierarchical/retrieval_based/utils/feature_extract.py b/applications/text_classification/hierarchical/retrieval_based/utils/feature_extract.py new file mode 100644 index 0000000000000000000000000000000000000000..f9f67b19e1385961ff5a60876bdcf97c2d70f6d4 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/utils/feature_extract.py @@ -0,0 +1,193 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import numpy as np +import paddle +from paddle import inference +from tqdm import tqdm + +import paddlenlp as ppnlp +from paddlenlp.data import Pad, Tuple + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--corpus_file", type=str, required=True, help="The corpus_file path.") +parser.add_argument("--output_dir", type=str, required=True, help="The output path.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--data_name", type=str, required=True, help="The dataset name.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') +args = parser.parse_args() +# fmt: on + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + - single sequence: ``[CLS] X [SEP]`` + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + + encoded_inputs = tokenizer(text=example, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.get_pooled_embedding.pdmodel" + params_file = model_dir + "/inference.get_pooled_embedding.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def predict(self, data, tokenizer, data_name): + """ + Predicts the data labels. + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + Returns: + results(obj:`dict`): All the predictions labels. + """ + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment + ): fn(samples) + + all_embeddings = [] + examples = [] + for idx, text in tqdm(data.items()): + input_ids, segment_ids = convert_example( + text, tokenizer, max_seq_length=self.max_seq_length, pad_to_max_seq_len=True + ) + examples.append((input_ids, segment_ids)) + if len(examples) >= self.batch_size: + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + all_embeddings.append(logits) + examples = [] + + if len(examples) > 0: + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + all_embeddings.append(logits) + + all_embeddings = np.concatenate(all_embeddings, axis=0) + np.save("./{}/{}_embedding".format(args.output_dir, data_name), all_embeddings) + + +def read_text(file_path): + file = open(file_path) + id2corpus = {} + for idx, line in enumerate(file.readlines()): + id2corpus[idx] = line.rstrip().replace("##", ",") + return id2corpus + + +if __name__ == "__main__": + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + data_name = args.data_name + model_name_or_path = "rocketqa-zh-dureader-query-encoder" + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained(model_name_or_path) + id2corpus = read_text(args.corpus_file) + predictor.predict(id2corpus, tokenizer, data_name) diff --git a/applications/text_classification/hierarchical/retrieval_based/utils/milvus_util.py b/applications/text_classification/hierarchical/retrieval_based/utils/milvus_util.py new file mode 100644 index 0000000000000000000000000000000000000000..e6b186c4fa480ab20b888c0cd1376624083da9b9 --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/utils/milvus_util.py @@ -0,0 +1,114 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from config import ( + MILVUS_HOST, + MILVUS_PORT, + collection_param, + index_param, + index_type, + search_param, + top_k, +) +from milvus import Milvus + + +class VecToMilvus: + def __init__(self): + self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT) + + def has_collection(self, collection_name): + try: + status, ok = self.client.has_collection(collection_name) + return ok + except Exception as e: + print("Milvus has_table error:", e) + + def creat_collection(self, collection_name): + try: + collection_param["collection_name"] = collection_name + status = self.client.create_collection(collection_param) + print(status) + return status + except Exception as e: + print("Milvus create collection error:", e) + + def create_index(self, collection_name): + try: + status = self.client.create_index(collection_name, index_type, index_param) + print(status) + return status + except Exception as e: + print("Milvus create index error:", e) + + def has_partition(self, collection_name, partition_tag): + try: + status, ok = self.client.has_partition(collection_name, partition_tag) + return ok + except Exception as e: + print("Milvus has partition error: ", e) + + def delete_partition(self, collection_name, partition_tag): + try: + status = self.client.drop_collection(collection_name) + return status + except Exception as e: + print("Milvus has partition error: ", e) + + def create_partition(self, collection_name, partition_tag): + try: + status = self.client.create_partition(collection_name, partition_tag) + print("create partition {} successfully".format(partition_tag)) + return status + except Exception as e: + print("Milvus create partition error: ", e) + + def insert(self, vectors, collection_name, ids=None, partition_tag=None): + try: + if not self.has_collection(collection_name): + self.creat_collection(collection_name) + self.create_index(collection_name) + print("collection info: {}".format(self.client.get_collection_info(collection_name)[1])) + if (partition_tag is not None) and (not self.has_partition(collection_name, partition_tag)): + self.create_partition(collection_name, partition_tag) + status, ids = self.client.insert( + collection_name=collection_name, records=vectors, ids=ids, partition_tag=partition_tag + ) + self.client.flush([collection_name]) + print( + "Insert {} entities, there are {} entities after insert data.".format( + len(ids), self.client.count_entities(collection_name)[1] + ) + ) + return status, ids + except Exception as e: + print("Milvus insert error:", e) + + +class RecallByMilvus: + def __init__(self): + self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT) + + def search(self, vectors, collection_name, partition_tag=None): + try: + status, results = self.client.search( + collection_name=collection_name, + query_records=vectors, + top_k=top_k, + params=search_param, + partition_tag=partition_tag, + ) + return status, results + except Exception as e: + print("Milvus recall error: ", e) diff --git a/applications/text_classification/hierarchical/retrieval_based/utils/vector_insert.py b/applications/text_classification/hierarchical/retrieval_based/utils/vector_insert.py new file mode 100644 index 0000000000000000000000000000000000000000..a58f330c3a0de4adbf496e9dfca7249f86233bcb --- /dev/null +++ b/applications/text_classification/hierarchical/retrieval_based/utils/vector_insert.py @@ -0,0 +1,53 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import numpy as np +from milvus_util import VecToMilvus +from tqdm import tqdm + +parser = argparse.ArgumentParser() +parser.add_argument("--vector_path", type=str, required=True, help="feature file path.") +args = parser.parse_args() + + +def vector_insert(file_path): + embeddings = np.load(file_path) + print(embeddings.shape) + embedding_ids = [i for i in range(embeddings.shape[0])] + print(len(embedding_ids)) + client = VecToMilvus() + + collection_name = "text" + partition_tag = "partition_2" + if client.has_partition(collection_name, partition_tag): + client.delete_partition(collection_name, partition_tag) + data_size = len(embedding_ids) + batch_size = 50000 + for i in tqdm(range(0, data_size, batch_size)): + cur_end = i + batch_size + if cur_end > data_size: + cur_end = data_size + batch_emb = embeddings[np.arange(i, cur_end)] + status, ids = client.insert( + collection_name=collection_name, + vectors=batch_emb.tolist(), + ids=embedding_ids[i : i + batch_size], + partition_tag=partition_tag, + ) + + +if __name__ == "__main__": + vector_insert(args.vector_path) diff --git a/applications/text_classification/hierarchical/train.py b/applications/text_classification/hierarchical/train.py new file mode 100644 index 0000000000000000000000000000000000000000..5c8ee8872c29cfa44555485fbaf5c75d318f6130 --- /dev/null +++ b/applications/text_classification/hierarchical/train.py @@ -0,0 +1,200 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os +import random +import time + +import numpy as np +import paddle +from metric import MetricReport +from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler +from utils import evaluate, preprocess_function, read_local_dataset + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import ( + AutoModelForSequenceClassification, + AutoTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--dataset_dir", required=True, default=None, type=str, help="Local dataset directory should include train.txt, dev.txt and label.txt") +parser.add_argument("--save_dir", default="./checkpoint", type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", + choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en", "ernie-m-base", "ernie-m-large"]) +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument('--early_stop', action='store_true', help='Epoch before early stop.') +parser.add_argument('--early_stop_nums', type=int, default=3, help='Number of epoch before early stop.') +parser.add_argument("--logging_steps", default=5, type=int, help="The interval steps to logging.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument('--warmup', action='store_true', help="whether use warmup strategy") +parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup steps over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=3, help="random seed for initialization") +parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name") +parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name") +parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name") +args = parser.parse_args() +# fmt: on + + +def set_seed(seed): + """ + Sets random seed + """ + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + os.environ["PYTHONHASHSEED"] = str(seed) + + +def args_saving(): + argsDict = args.__dict__ + with open(os.path.join(args.save_dir, "setting.txt"), "w") as f: + f.writelines("------------------ start ------------------" + "\n") + for eachArg, value in argsDict.items(): + f.writelines(eachArg + " : " + str(value) + "\n") + f.writelines("------------------- end -------------------") + + +def train(): + """ + Training a hierarchical classification model + """ + + if not os.path.exists(args.save_dir): + os.makedirs(args.save_dir) + args_saving() + set_seed(args.seed) + paddle.set_device(args.device) + + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + # load and preprocess dataset + label_list = {} + with open(os.path.join(args.dataset_dir, args.label_file), "r", encoding="utf-8") as f: + for i, line in enumerate(f): + l = line.strip() + label_list[l] = i + train_ds = load_dataset( + read_local_dataset, path=os.path.join(args.dataset_dir, args.train_file), label_list=label_list, lazy=False + ) + dev_ds = load_dataset( + read_local_dataset, path=os.path.join(args.dataset_dir, args.dev_file), label_list=label_list, lazy=False + ) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name) + trans_func = functools.partial( + preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length, label_nums=len(label_list) + ) + train_ds = train_ds.map(trans_func) + dev_ds = dev_ds.map(trans_func) + + # batchify dataset + collate_fn = DataCollatorWithPadding(tokenizer) + if paddle.distributed.get_world_size() > 1: + train_batch_sampler = DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + else: + train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn) + dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn) + + # define model + model = AutoModelForSequenceClassification.from_pretrained(args.model_name, num_classes=len(label_list)) + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + criterion = paddle.nn.BCEWithLogitsLoss() + metric = MetricReport() + + global_step = 0 + best_f1_score = 0 + early_stop_count = 0 + tic_train = time.time() + + for epoch in range(1, args.epochs + 1): + + if args.early_stop and early_stop_count >= args.early_stop_nums: + logger.info("Early stop!") + break + + for step, batch in enumerate(train_data_loader, start=1): + + labels = batch.pop("labels") + logits = model(**batch) + loss = criterion(logits, labels) + + loss.backward() + optimizer.step() + if args.warmup: + lr_scheduler.step() + optimizer.clear_grad() + + global_step += 1 + if global_step % args.logging_steps == 0 and rank == 0: + logger.info( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + + early_stop_count += 1 + micro_f1_score, macro_f1_score = evaluate(model, criterion, metric, dev_data_loader) + + save_best_path = args.save_dir + if not os.path.exists(save_best_path): + os.makedirs(save_best_path) + + # save models + if macro_f1_score > best_f1_score: + early_stop_count = 0 + best_f1_score = macro_f1_score + model._layers.save_pretrained(save_best_path) + tokenizer.save_pretrained(save_best_path) + logger.info("Current best macro f1 score: %.5f" % (best_f1_score)) + logger.info("Final best macro f1 score: %.5f" % (best_f1_score)) + logger.info("Save best macro f1 text classification model in %s" % (args.save_dir)) + + +if __name__ == "__main__": + train() + print(args.train_file) diff --git a/applications/text_classification/hierarchical/utils.py b/applications/text_classification/hierarchical/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..9915906c9a2ab9dd8680d95711a4336efd96ec6e --- /dev/null +++ b/applications/text_classification/hierarchical/utils.py @@ -0,0 +1,98 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np + +import paddle +import paddle.nn.functional as F +from paddlenlp.utils.log import logger + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader): + """ + Given a dataset, it evaluates model and computes the metric. + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + criterion(obj:`paddle.nn.Layer`): It can compute the loss. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + """ + + model.eval() + metric.reset() + losses = [] + for batch in data_loader: + labels = batch.pop("labels") + logits = model(**batch) + loss = criterion(logits, labels) + probs = F.sigmoid(logits) + losses.append(loss.numpy()) + metric.update(probs, labels) + + micro_f1_score, macro_f1_score = metric.accumulate() + logger.info( + "eval loss: %.5f, micro f1 score: %.5f, macro f1 score: %.5f" + % (np.mean(losses), micro_f1_score, macro_f1_score) + ) + model.train() + metric.reset() + + return micro_f1_score, macro_f1_score + + +def preprocess_function(examples, tokenizer, max_seq_length, label_nums, is_test=False): + """ + Builds model inputs from a sequence for sequence classification tasks + by concatenating and adding special tokens. + + Args: + examples(obj:`list[str]`): List of input data, containing text and label if it have label. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_length(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + label_nums(obj:`int`): The number of the labels. + Returns: + result(obj:`dict`): The preprocessed data including input_ids, token_type_ids, labels. + """ + result = tokenizer(text=examples["sentence"], max_seq_len=max_seq_length) + # One-Hot label + if not is_test: + result["labels"] = [float(1) if i in examples["label"] else float(0) for i in range(label_nums)] + return result + + +def read_local_dataset(path, label_list=None, is_test=False): + """ + Read dataset + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + if is_test: + items = line.strip().split("\t") + sentence = "".join(items) + yield {"sentence": sentence} + else: + items = line.strip().split("\t") + if len(items) == 0: + continue + elif len(items) == 1: + sentence = items[0] + labels = [] + else: + sentence = "".join(items[:-1]) + label = items[-1] + labels = [label_list[l] for l in label.split(",")] + yield {"sentence": sentence, "label": labels} diff --git a/applications/text_classification/multi_class/README.md b/applications/text_classification/multi_class/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e91ff948a9dfbe4fb75b306a2980f63bbf1b4d7d --- /dev/null +++ b/applications/text_classification/multi_class/README.md @@ -0,0 +1,412 @@ +# 二分类/多分类任务指南 + +**目录** + +- [1. 二分类/多分类简介](#二分类/多分类简介) +- [2. 快速开始](#快速开始) + - [2.1 运行环境](#运行环境) + - [2.2 代码结构](#代码结构) + - [2.3 数据准备](#数据准备) + - [2.4 模型训练](#模型训练) + - [2.5 模型预测](#模型预测) + - [2.6 模型效果](#模型效果) + + + + +## 1. 二分类/多分类简介 + +本项目提供通用场景下**基于预训练模型微调的二分类/多分类端到端应用方案**,打通数据标注-模型训练-模型调优-模型压缩-预测部署全流程,有效缩短开发周期,降低AI开发落地门槛。 + +二分类/多分类数据集的标签集含有两个或两个以上的类别,所有输入句子/文本有且只有一个标签。在文本多分类场景中,我们需要预测**输入句子/文本最可能来自 `n` 个标签类别中的哪一个类别**。在本项目中二分类任务被视为多分类任务中标签集包含两个类别的情况,以下统一称为多分类任务。以下图为例,该新闻文本的最可能的标签为 `娱乐`。多分类任务在商品分类、网页标签、新闻分类、医疗文本分类等各种现实场景中具有广泛的适用性。 + +
+ +
+
+ +**方案亮点:** + +- **效果领先🏃:** 使用在中文领域内模型效果和模型计算效率有突出效果的ERNIE 3.0 轻量级系列模型作为训练基座,ERNIE 3.0 轻量级系列提供多种尺寸的预训练模型满足不同需求,具有广泛成熟的实践应用性。 +- **高效调优✊:** 文本分类应用依托[TrustAI](https://github.com/PaddlePaddle/TrustAI)可信增强能力和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md),提供模型分析模块助力开发者实现模型分析,并提供稀疏数据筛选、脏数据清洗、数据增强等多种解决方案。 +- **简单易用👶:** 开发者**无需机器学习背景知识**,仅需提供指定格式的标注分类数据,一行命令即可开启文本分类训练,轻松完成上线部署,不再让技术成为文本分类的门槛。 + +**更多选择:** + +对于大多数多分类任务,我们推荐使用预训练模型微调作为首选的文本分类方案,多分类项目中还提供通用文本分类(UTC)和语义索引的两种方案满足不同开发者需求,更多技术细节请参见[文本分类技术特色介绍](../README.md)。 + +- 【零样本、小样本场景】 👉 [通用文本分类(UTC)方案](../../zero_shot_text_classification) + +- 【标签类别不固定、标签类别众多】 👉 [语义索引分类方案](./retrieval_based) + + + + +## 2. 快速开始 + +接下来我们将以CBLUE公开数据集KUAKE-QIC任务为示例,演示多分类全流程方案使用。下载数据集: + +```shell +wget https://paddlenlp.bj.bcebos.com/datasets/KUAKE_QIC.tar.gz +tar -zxvf KUAKE_QIC.tar.gz +mv KUAKE_QIC data +rm -rf KUAKE_QIC.tar.gz +``` + +
+ image +
+
+ + 多分类数据标注-模型训练-模型分析-模型压缩-预测部署流程图 + +
+ + + +### 2.1 运行环境 + +- python >= 3.6 +- paddlepaddle >= 2.3 +- paddlenlp >= 2.5.1 +- scikit-learn >= 1.0.2 + +**安装PaddlePaddle:** + + 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。 + + +**安装PaddleNLP:** + +```shell +pip install --upgrade paddlenlp +``` + + +**安装sklearn:** +```shell +pip install scikit-learn +``` + + + +### 2.2 代码结构 + +```text +multi_class/ +├── few-shot # 小样本学习方案 +├── retrieval_based # 语义索引方案 +├── analysis # 分析模块 +├── deploy # 部署 +│ ├── simple_serving # SimpleServing服务化部署 +│   └── triton_serving # Triton服务化部署 +├── train.py # 训练、评估、裁剪脚本 +├── utils.py # 工具函数脚本 +└── README.md # 多分类使用说明 +``` + + + +### 2.3 数据准备 + +训练需要准备指定格式的本地数据集,如果没有已标注的数据集,可以参考[文本分类任务doccano数据标注使用指南](../doccano.md)进行文本分类数据标注。指定格式本地数据集目录结构: + +```text +data/ +├── train.txt # 训练数据集文件 +├── dev.txt # 开发数据集文件 +└── label.txt # 分类标签文件 +``` + +**训练、开发、测试数据集** 文件中文本与标签类别名用tab符`'\t'`分隔开,文本中避免出现tab符`'\t'`。 + +- train.txt/dev.txt/test.txt 文件格式: +```text +<文本>'\t'<标签> +<文本>'\t'<标签> +... +``` + +- train.txt/dev.txt/test.txt 文件样例: +```text +25岁已经感觉脸部松弛了怎么办 治疗方案 +小孩的眉毛剪了会长吗? 其他 +172的身高还能长高吗? 其他 +冻疮用三金冻疮酊有效果么? 功效作用 +... +``` + +**分类标签** + +label.txt(分类标签文件)记录数据集中所有标签集合,每一行为一个标签名。 +- label.txt 文件格式: +```text +<标签> +<标签> +... +``` + +- label.txt 文件样例: +```text +病情诊断 +治疗方案 +病因分析 +指标解读 +就医建议 +... +``` + + + +### 2.4 模型训练 + +我们推荐使用 Trainer API 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务,可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能,Trainer API 还针对训练过程的通用训练配置做了封装,比如:优化器、学习率调度等。 + +#### 2.4.1 预训练模型微调 + + +使用CPU/GPU训练,默认为GPU训练。使用CPU训练只需将设备参数配置改为`--device cpu`,可以使用`--device gpu:0`指定GPU卡号: +```shell +python train.py \ + --do_train \ + --do_eval \ + --do_export \ + --model_name_or_path ernie-3.0-tiny-medium-v2-zh \ + --output_dir checkpoint \ + --device gpu \ + --num_train_epochs 100 \ + --early_stopping True \ + --early_stopping_patience 5 \ + --learning_rate 3e-5 \ + --max_length 128 \ + --per_device_eval_batch_size 32 \ + --per_device_train_batch_size 32 \ + --metric_for_best_model accuracy \ + --load_best_model_at_end \ + --logging_steps 5 \ + --evaluation_strategy epoch \ + --save_strategy epoch \ + --save_total_limit 1 +``` + +如果在GPU环境中使用,可以指定`gpus`参数进行多卡分布式训练。使用多卡训练可以指定多个GPU卡号,例如 --gpus 0,1。如果设备只有一个GPU卡号默认为0,可使用`nvidia-smi`命令查看GPU使用情况: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus 0,1 train.py \ + --do_train \ + --do_eval \ + --do_export \ + --model_name_or_path ernie-3.0-tiny-medium-v2-zh \ + --output_dir checkpoint \ + --device gpu \ + --num_train_epochs 100 \ + --early_stopping True \ + --early_stopping_patience 5 \ + --learning_rate 3e-5 \ + --max_length 128 \ + --per_device_eval_batch_size 32 \ + --per_device_train_batch_size 32 \ + --metric_for_best_model accuracy \ + --load_best_model_at_end \ + --logging_steps 5 \ + --evaluation_strategy epoch \ + --save_strategy epoch \ + --save_total_limit 1 +``` + +主要的配置的参数为: +- `do_train`: 是否进行训练。 +- `do_eval`: 是否进行评估。 +- `debug`: 与`do_eval`配合使用,是否开启debug模型,对每一个类别进行评估。 +- `do_export`: 训练结束后是否导出静态图。 +- `do_compress`: 训练结束后是否进行模型裁剪。 +- `model_name_or_path`: 内置模型名,或者模型参数配置目录路径。默认为`ernie-3.0-tiny-medium-v2-zh`。 +- `output_dir`: 模型参数、训练日志和静态图导出的保存目录。 +- `device`: 使用的设备,默认为`gpu`。 +- `num_train_epochs`: 训练轮次,使用早停法时可以选择100。 +- `early_stopping`: 是否使用早停法,也即一定轮次后评估指标不再增长则停止训练。 +- `early_stopping_patience`: 在设定的早停训练轮次内,模型在开发集上表现不再上升,训练终止;默认为4。 +- `learning_rate`: 预训练语言模型参数基础学习率大小,将与learning rate scheduler产生的值相乘作为当前学习率。 +- `max_length`: 最大句子长度,超过该长度的文本将被截断,不足的以Pad补全。提示文本不会被截断。 +- `per_device_train_batch_size`: 每次训练每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。 +- `per_device_eval_batch_size`: 每次评估每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。 +- `max_length`: 最大句子长度,超过该长度的文本将被截断,不足的以Pad补全。提示文本不会被截断。 +- `train_path`: 训练集路径,默认为"./data/train.txt"。 +- `dev_path`: 开发集集路径,默认为"./data/dev.txt"。 +- `test_path`: 测试集路径,默认为"./data/dev.txt"。 +- `label_path`: 标签路径,默认为"./data/label.txt"。 +- `bad_case_path`: 错误样本保存路径,默认为"./data/bad_case.txt"。 +- `width_mult_list`:裁剪宽度(multi head)保留的比例列表,表示对self_attention中的 `q`、`k`、`v` 以及 `ffn` 权重宽度的保留比例,保留比例乘以宽度(multi haed数量)应为整数;默认是None。 +训练脚本支持所有`TrainingArguments`的参数,更多参数介绍可参考[TrainingArguments 参数介绍](https://paddlenlp.readthedocs.io/zh/latest/trainer.html#trainingarguments)。 + +程序运行时将会自动进行训练,评估。同时训练过程中会自动保存开发集上最佳模型在指定的 `output_dir` 中,保存模型文件结构如下所示: + +```text +checkpoint/ +├── export # 静态图模型 +├── config.json # 模型配置文件 +├── model_state.pdparams # 模型参数文件 +├── tokenizer_config.json # 分词器配置文件 +├── vocab.txt +└── special_tokens_map.json +``` + +**NOTE:** +* 中文训练任务(文本支持含部分英文)推荐使用"ernie-1.0-large-zh-cw"、"ernie-3.0-tiny-base-v2-zh"、"ernie-3.0-tiny-medium-v2-zh"、"ernie-3.0-tiny-micro-v2-zh"、"ernie-3.0-tiny-mini-v2-zh"、"ernie-3.0-tiny-nano-v2-zh"、"ernie-3.0-tiny-pico-v2-zh"。 +* 英文训练任务推荐使用"ernie-3.0-tiny-mini-v2-en"、 "ernie-2.0-base-en"、"ernie-2.0-large-en"。 +* 英文和中文以外语言的文本分类任务,推荐使用基于96种语言(涵盖法语、日语、韩语、德语、西班牙语等几乎所有常见语言)进行预训练的多语言预训练模型"ernie-m-base"、"ernie-m-large",详情请参见[ERNIE-M论文](https://arxiv.org/pdf/2012.15674.pdf)。 + +#### 2.4.2 训练评估 +训练后的模型我们可以开启`debug`模式,对每个类别分别进行评估,并打印错误预测样本保存在`bad_case.txt`。默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"`: + +```shell +python train.py \ + --do_eval \ + --debug True \ + --device gpu \ + --model_name_or_path checkpoint \ + --output_dir checkpoint \ + --per_device_eval_batch_size 32 \ + --max_length 128 \ + --test_path './data/dev.txt' +``` + +输出打印示例: + +```text +[2023-02-14 12:35:03,470] [ INFO] - -----Evaluate model------- +[2023-02-14 12:35:03,471] [ INFO] - Dev dataset size: 1955 +[2023-02-14 12:35:03,471] [ INFO] - Accuracy in dev dataset: 81.74% +[2023-02-14 12:35:03,471] [ INFO] - Macro average | precision: 77.39 | recall: 79.89 | F1 score 78.32 +[2023-02-14 12:35:03,471] [ INFO] - Class name: 病情诊断 +[2023-02-14 12:35:03,471] [ INFO] - Evaluation examples in dev dataset: 288(14.7%) | precision: 85.22 | recall: 86.11 | F1 score 85.66 +[2023-02-14 12:35:03,471] [ INFO] - ---------------------------- +[2023-02-14 12:35:03,471] [ INFO] - Class name: 治疗方案 +[2023-02-14 12:35:03,471] [ INFO] - Evaluation examples in dev dataset: 676(34.6%) | precision: 91.72 | recall: 90.09 | F1 score 90.90 +... +``` + +文本分类预测过程中常会遇到诸如"模型为什么会预测出错误的结果","如何提升模型的表现"等问题。[Analysis模块](./analysis) 提供了**可解释性分析、数据优化**等功能,旨在帮助开发者更好地分析文本分类模型预测结果和对模型效果进行优化。 + +#### 2.4.3 模型裁剪(可选) + +如果有模型部署上线的需求,需要进一步压缩模型体积,可以使用 PaddleNLP 的 [压缩API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/compression.md), 一行命令即可启动模型裁剪。 + +使用裁剪功能需要安装 paddleslim: + +```shell +pip install paddleslim == 2.4.1 +``` + +开始模型裁剪训练,默认为GPU训练,使用CPU训练只需将设备参数配置改为`--device "cpu"`: +```shell +python train.py \ + --do_compress \ + --device gpu \ + --data_dir data \ + --model_name_or_path checkpoint \ + --output_dir checkpoint/prune \ + --learning_rate 3e-5 \ + --per_device_train_batch_size 32 \ + --per_device_eval_batch_size 32 \ + --num_train_epochs 1 \ + --max_length 128 \ + --logging_steps 5 \ + --save_steps 100 \ + --width_mult_list '3/4' '2/3' '1/2' +``` +保存模型文件结构如下所示: + +```text +checkpoint/prune/ +├── width_mult_0.75 +│   ├── pruned_model.pdiparams +│   ├── pruned_model.pdiparams.info +│   ├── pruned_model.pdmodel +│   ├── model_state.pdparams +│   └── config.json +└── ... +``` +**NOTE:** + +1. 目前支持的裁剪策略需要训练,训练时间视下游任务数据量而定,且和微调的训练时间是一个量级。 裁剪类似蒸馏过程,方便起见,可以直接使用微调时的超参。为了进一步提升精度,可以对 `per_device_train_batch_size`、`learning_rate`、`max_length` 等超参进行网格搜索(grid search)。 + +2. 模型裁剪主要用于推理部署,因此裁剪后的模型都是静态图模型,只可用于推理部署。 + +3. ERNIE Base、Medium、Mini、Micro、Nano的模型宽度(multi head数量)为12,ERNIE Xbase、Large 模型宽度(multi head数量)为16,保留比例`width_mult`乘以宽度(multi haed数量)应为整数。 + + + +### 2.5 模型预测 +我们推荐使用taskflow进行模型预测,请保证paddlenlp版本大于2.5.1。 +``` +from paddlenlp import Taskflow + +# 模型预测 +cls = Taskflow("text_classification", task_path='checkpoint/export', is_static_model=True) +cls(["黑苦荞茶的功效与作用及食用方法","幼儿挑食的生理原因是"]) +# [{'predictions': [{'label': '功效作用', 'score': 0.9683999621710758}], 'text': '黑苦荞茶的功效与作用及食用方法'}, {'predictions': [{'label': '病因分析', 'score': 0.5204789523701855}], 'text': '幼儿挑食的生理原因是'}] + +# 裁剪模型预测 +cls = Taskflow("text_classification", task_path='checkpoint/prune/width_mult_0.67', is_static_model=True) +cls(["黑苦荞茶的功效与作用及食用方法","幼儿挑食的生理原因是"]) +# [{'predictions': [{'label': '功效作用', 'score': 0.964693000149321}], 'text': '黑苦荞茶的功效与作用及食用方法'}, {'predictions': [{'label': '病因分析', 'score': 0.4915921440237312}], 'text': '幼儿挑食的生理原因是'}] +``` + +#### 可配置参数说明 +* `task_path`:自定义任务路径,默认为None。 +* `is_static_model`:task_path中是否为静态图模型参数,默认为False。 +* `max_length`:最长输入长度,包括所有标签的长度,默认为512。 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `id2label`:标签映射字典,如果`task_path`中包含id2label.json或加载动态图参数无需定义。 +* `precision`:选择模型精度,默认为`fp32`,可选有`fp16`和`fp32`。`fp16`推理速度更快。如果选择`fp16`,请先确保机器正确安装NVIDIA相关驱动和基础软件,**确保CUDA>=11.2,cuDNN>=8.1.1**,初次使用需按照提示安装相关依赖。其次,需要确保GPU设备的CUDA计算能力(CUDA Compute Capability)大于7.0,典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档:[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。 + +在线服务化部署搭建请参考: +- 【简单易用】 👉 [Simple Serving部署指南](deploy/simple_serving) +- 【低时延】 👉 [Triton部署指南](deploy/triton_serving)。 + + + + +### 2.6 模型效果 + +PaddleNLP提供ERNIE 3.0 全系列轻量化模型,对于中文训练任务可以根据需求选择不同的预训练模型参数进行训练,我们评测了不同预训练模型在KUAKE-QIC任务的表现,测试配置如下: + +1. 数据集:CBLUE数据集中医疗搜索检索词意图分类(KUAKE-QIC)任务开发集 + +2. 物理机环境 + + 系统: CentOS Linux release 7.7.1908 (Core) + + GPU: Tesla V100-SXM2-32GB + + CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz + + CUDA: 11.2 + + cuDNN: 8.1.0 + + Driver Version: 460.27.04 + + 内存: 630 GB + +3. PaddlePaddle 版本:2.3.0 + +4. PaddleNLP 版本:2.3.1 + +5. 性能数据指标:latency。latency 测试方法:固定 batch size 为 32,GPU部署运行时间 total_time,计算 latency = total_time / total_samples + +6. 精度评价指标:Accuracy + +| model_name | 模型结构 |Accuracy(%) | latency(ms) | +| -------------------------- | ------------ | ------------ | ------------ | +|ERNIE 1.0 Large Cw |24-layer, 1024-hidden, 20-heads|82.30| 5.62 | +|ERNIE 3.0 Base |12-layer, 768-hidden, 12-heads|82.25| 2.07 | +|ERNIE 3.0 Medium| 6-layer, 768-hidden, 12-heads|81.79| 1.07| +|ERNIE 3.0 Mini |6-layer, 384-hidden, 12-heads|79.80| 0.38| +|ERNIE 3.0 Micro | 4-layer, 384-hidden, 12-heads|79.80| 0.26| +|ERNIE 3.0 Nano |4-layer, 312-hidden, 12-heads|78.57|0.22| +| ERNIE 3.0 Medium + 裁剪(保留比例3/4)|6-layer, 768-hidden, 9-heads| 81.79| 0.83 | +| ERNIE 3.0 Medium + 裁剪(保留比例2/3)|6-layer, 768-hidden, 8-heads| 81.07 | 0.79 | +| ERNIE 3.0 Medium + 裁剪(保留比例1/2)|6-layer, 768-hidden, 6-heads| 81.07 | 0.64 | diff --git a/applications/text_classification/multi_class/analysis/README.md b/applications/text_classification/multi_class/analysis/README.md new file mode 100644 index 0000000000000000000000000000000000000000..cf70e0dc42645923cb52c8f9cbce0f4e049b5e65 --- /dev/null +++ b/applications/text_classification/multi_class/analysis/README.md @@ -0,0 +1,321 @@ +# 训练评估与模型优化指南 + +**目录** + * [Analysis模块介绍](#Analysis模块介绍) + * [环境准备](#环境准备) + * [可解释性分析](#可解释性分析) + * [单词级别可解释性分析](#单词级别可解释性分析) + * [句子级别可解释性分析](#句子级别可解释性分析) + * [数据优化](#数据优化) + * [稀疏数据筛选方案](#稀疏数据筛选方案) + * [脏数据清洗方案](#脏数据清洗方案) + * [数据增强策略方案](#数据增强策略方案) + +## Analysis模块介绍 + +Analysis模块提供了**可解释性分析、数据优化**等功能,旨在帮助开发者更好地分析文本分类模型预测结果和对模型效果进行优化。 + + +- **可解释性分析:** 基于[TrustAI](https://github.com/PaddlePaddle/TrustAI)提供单词和句子级别的模型可解释性分析,帮助理解模型预测结果。 + +- **数据优化:** 结合[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)提供了**稀疏数据筛选、脏数据清洗、数据增强**三种优化策略,从多角度优化训练数据提升模型效果。 + +
+ +
+ +以下是本项目主要代码结构及说明: + +```text +analysis/ +├── sent_interpret.ipynb # 句子级别可解释性分析脚本 +├── word_interpret.py # 单词级别可解释性分析notebook +├── sparse.py # 稀疏数据筛选脚本 +├── dirty.py # 脏数据清洗脚本 +├── aug.py # 数据增强脚本 +└── README.md # 训练评估与模型优化指南 +``` + +## 环境准备 +需要可解释性分析和数据优化需要安装相关环境。 +- trustai >= 0.1.12 +- interpretdl >= 0.7.0 + +**安装TrustAI**(可选)如果使用可解释性分析和数据优化中稀疏数据筛选和脏数据清洗需要安装TrustAI。 +```shell +pip install trustai==0.1.12 +``` + +**安装InterpretDL**(可选)如果使用词级别可解释性分析GradShap方法,需要安装InterpretDL +```shell +pip install interpretdl==0.7.0 +``` + +## 可解释性分析 +"模型为什么会预测出这个结果?"是文本分类任务开发者时常遇到的问题,如何分析错误样本(bad case)是文本分类任务落地中重要一环,本项目基于TrustAI开源了基于词级别和句子级别的模型可解释性分析方法,帮助开发者更好地理解文本分类模型与数据,有助于后续的模型优化与数据清洗标注。 + +### 单词级别可解释性分析 +本项目开源模型的词级别可解释性分析Notebook,提供LIME、Integrated Gradient、GradShap 三种分析方法,支持分析微调后模型的预测结果,开发者可以通过更改**数据目录**和**模型目录**在自己的任务中使用Jupyter Notebook进行数据分析。 + +运行 [word_interpret.ipynb](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/multi_class/analysis/README.md) 代码,即可分析影响样本预测结果的关键词以及可视化所有词对预测结果的贡献情况,颜色越深代表这个词对预测结果影响越大: +
+ +
+ +### 句子级别可解释性分析 +本项目基于特征相似度([FeatureSimilarity](https://arxiv.org/abs/2104.04128))算法,计算对样本预测结果正影响的训练数据,帮助理解模型的预测结果与训练集数据的关系。 + + +我们可以运行代码,得到支持样本模型预测结果的训练数据: +```shell +python sent_interpret.py \ + --device "gpu" \ + --dataset_dir "../data" \ + --params_path "../checkpoint/" \ + --max_seq_length 128 \ + --batch_size 16 \ + --top_k 3 \ + --train_file "train.txt" \ + --interpret_input_file "bad_case.txt" \ + --interpret_result_file "sent_interpret.txt" +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `device`: 选用什么设备进行训练,可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含dev.txt和label.txt文件;默认为None。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `seed`:随机种子,默认为3。 +* `top_k`:筛选支持训练证据数量;默认为3。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `interpret_input_file`:本地数据集中待分析文件名;默认为"bad_case.txt"。 +* `interpret_result_file`:保存句子级别可解释性结果文件名;默认为"sent_interpret.txt"。 + +## 数据优化 + +### 稀疏数据筛选方案 + +稀疏数据筛选适用于文本分类中**数据不平衡或训练数据覆盖不足**的场景,简单来说,就是由于模型在训练过程中没有学习到足够与待预测样本相似的数据,模型难以正确预测样本所属类别的情况。稀疏数据筛选旨在开发集中挖掘缺乏训练证据支持的数据,通常可以采用**数据增强**或**少量数据标注**的两种低成本方式,提升模型在开发集的预测效果。 + +本项目中稀疏数据筛选基于TrustAI,利用基于特征相似度的实例级证据分析方法,抽取开发集中样本的支持训练证据,并计算支持证据平均分(通常为得分前三的支持训练证据均分)。分数较低的样本表明其训练证据不足,在训练集中较为稀疏,实验表明模型在这些样本上表现也相对较差。更多细节详见[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[实例级证据分析](https://github.com/PaddlePaddle/TrustAI/blob/main/trustai/interpretation/example_level/README.md)。 + + +#### 稀疏数据识别—数据增强 + +这里我们将介绍稀疏数据识别—数据增强流程: + +- **稀疏数据识别:** 挖掘开发集中的缺乏训练证据支持数据,记为稀疏数据集(Sparse Dataset); + +- **数据增强**:将稀疏数据集在训练集中的支持证据应用数据增强策略,这些数据增强后的训练数据记为支持数据集(Support Dataset); + +- **重新训练模型:** 将支持数据集加入到原有的训练集获得新的训练集,重新训练新的文本分类模型。 + +现在我们进行稀疏数据识别-数据增强,得到支持数据集: + +```shell +python sparse.py \ + --device "gpu" \ + --dataset_dir "../data" \ + --aug_strategy "substitute" \ + --max_seq_length 128 \ + --params_path "../checkpoint/" \ + --batch_size 16 \ + --sparse_num 100 \ + --support_num 100 +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `device`: 选用什么设备进行训练,可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含dev.txt和label.txt文件;默认为None。 +* `aug_strategy`:数据增强类型,可选"duplicate","substitute", "insert", "delete", "swap";默认为"substitute"。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `seed`:随机种子,默认为3。 +* `rationale_num_sparse`:筛选稀疏数据时计算样本置信度时支持训练证据数量;认为3。 +* `rationale_num_support`:筛选支持数据时计算样本置信度时支持训练证据数量,如果筛选的支持数据不够,可以适当增加;默认为6。 +* `sparse_num`:筛选稀疏数据数量,建议为开发集的10%~20%,默认为100。 +* `support_num`:用于数据增强的支持数据数量,建议为训练集的10%~20%,默认为100。 +* `support_threshold`:支持数据的阈值,只选择支持证据分数大于阈值作为支持数据,默认为0.7。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `dev_file`:本地数据集中开发集文件名;默认为"dev.txt"。 +* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。 +* `sparse_file`:保存在本地数据集路径中稀疏数据文件名;默认为"sparse.txt"。 +* `support_file`:保存在本地数据集路径中支持训练数据文件名;默认为"support.txt"。 + +将得到增强支持数据`support.txt`与训练集数据`train.txt`合并得到新的训练集`train_sparse_aug.txt`重新进行训练: + +```shell +cat ../data/train.txt ../data/support.txt > ../data/train_sparse_aug.txt +``` + +**方案效果** + +我们在KUAKE-QIC数据集部分数据(训练集数据规模:500)进行实验,筛选稀疏数据数量和支持数据数量均设为100条,使用不同的数据增强方法进行评测: +| |Accuracy(%) | +| ---------| ------------ | +|训练集|73.50| +|训练集+支持增强集(duplicate) |73.61| +|训练集+支持增强集(substitute) |**74.32**| +|训练集+支持增强集(insert) |73.81| +|训练集+支持增强集(delete) |74.27| +|训练集+支持增强集(swap) |73.66| + +#### 稀疏数据识别-数据标注 + +本方案能够有针对性进行数据标注,相比于随机标注数据更好提高模型预测效果。这里我们将介绍稀疏数据识别-数据标注流程: + +- **稀疏数据识别:** 挖掘开发集中的缺乏训练证据支持数据,记为稀疏数据集(Sparse Dataset); + +- **数据标注**:在未标注数据集中筛选稀疏数据集的支持证据,并进行数据标注,记为支持数据集(Support Dataset); + +- **重新训练模型:** 将支持数据集加入到原有的训练集获得新的训练集,重新训练新的文本分类模型。 + +现在我们进行稀疏数据识别--数据标注,得到待标注数据: + +```shell +python sparse.py \ + --annotate \ + --device "gpu" \ + --dataset_dir "../data" \ + --max_seq_length 128 \ + --params_path "../checkpoint/" \ + --batch_size 16 \ + --sparse_num 100 \ + --support_num 100 \ + --unlabeled_file "data.txt" +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `device`: 选用什么设备进行训练,可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含dev.txt和label.txt文件;默认为None。 +* `annotate`:选择稀疏数据识别--数据标注模式;默认为False。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `seed`:随机种子,默认为3。 +* `rationale_num_sparse`:筛选稀疏数据时计算样本置信度时支持训练证据数量;认为3。 +* `rationale_num_support`:筛选支持数据时计算样本置信度时支持训练证据数量,如果筛选的支持数据不够,可以适当增加;默认为6。 +* `sparse_num`:筛选稀疏数据数量,建议为开发集的10%~20%,默认为100。 +* `support_num`:用于数据增强的支持数据数量,建议为训练集的10%~20%,默认为100。 +* `support_threshold`:支持数据的阈值,只选择支持证据分数大于阈值作为支持数据,默认为0.7。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `dev_file`:本地数据集中开发集文件名;默认为"dev.txt"。 +* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。 +* `unlabeled_file`:本地数据集中未标注数据文件名;默认为"data.txt"。 +* `sparse_file`:保存在本地数据集路径中稀疏数据文件名;默认为"sparse.txt"。 +* `support_file`:保存在本地数据集路径中支持训练数据文件名;默认为"support.txt"。 + +我们将筛选出的支持数据`support.txt`进行标注,可以使用标注工具帮助更快标注,详情请参考[文本分类任务doccano数据标注使用指南](../../doccano.md)进行文本分类数据标注。然后将已标注数据`support.txt`与训练集数据`train.txt`合并得到新的训练集`train_sparse_annotate.txt`重新进行训练: + +```shell +cat ../data/train.txt ../data/support.txt > ../data/train_sparse_annotate.txt +``` + +**方案效果** + +我们在KUAKE-QIC数据集部分数据(训练集数据规模:500)进行实验,筛选稀疏数据数量设为100条,筛选待标注数据数量为50和100条。我们比较了使用稀疏数据方案的策略采样和随机采样的效果,下表结果表明使用稀疏数据方案的策略采样能够支持指导训练数据扩充,在标注更少的数据情况下获得更大提升的效果: + +| |Accuracy(%) | +| ---------| ------------ | +|训练集|73.50| +|训练集+策略采样集(50) |76.88| +|训练集+随机采样集(50) |74.32| +|训练集+策略采样集(100) |**77.64**| +|训练集+随机采样集(100) |76.37| + +### 脏数据清洗方案 + +脏数据清洗方案是基于已训练好的文本分类模型,筛选出训练数据集中标注错误的数据,再由人工检查重新标注,获得标注正确的数据集进行重新训练。我们将介绍脏数据清洗流程: + +- **脏数据筛选:** 基于TrustAI中表示点方法,计算训练数据对文本分类模型的影响分数,分数高的训练数据表明对模型影响大,这些数据有较大概率为标注错误样本,记为脏数据集(Dirty Dataset)。 + +- **数据清洗、训练:** 将筛选出的脏数据由人工重新检查,为数据打上正确的标签。将清洗后的训练数据重新放入文本分类模型进行训练。 + +现在我们进行脏数据识别,脏数据保存在`"train_dirty.txt"`,剩余训练数据保存在`"train_dirty_rest.txt"`: + +```shell +python dirty.py \ + --device "gpu:3" \ + --dataset_dir "../data" \ + --max_seq_length 128 \ + --params_path "../checkpoint/" \ + --batch_size 8 \ + --dirty_num 100 \ + --dirty_file "train_dirty.txt" \ + --rest_file "train_dirty_rest.txt" +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含train.txt和label.txt文件;默认为None。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `device`: 选用什么设备进行训练,可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `seed`:随机种子,默认为3。 +* `dirty_file`:保存脏数据文件名,默认为"train_dirty.txt"。 +* `rest_file`:保存剩余数据(非脏数据)文件名,默认为"train_dirty_rest.txt"。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `dirty_threshold`:筛选脏数据用于重新标注的阈值,只选择影响分数大于阈值作为支持数据,默认为0。 + + +我们将筛选出脏数据进行人工检查重新标注,可以将`train_dirty.txt`直接导入标注工具doccano帮助更快重新标注,详情请参考[文本分类任务doccano数据标注使用指南](../../doccano.md)进行文本分类数据标注。然后将已重新标注的脏数据`train_dirty.txt`与剩余训练集数据`train_dirty_rest.txt`合并得到新的训练集`train_clean.txt`重新进行训练: + +```shell +cat ../data/train_dirty_rest.txt ../data/train_dirty.txt > ../data/train_clean.txt +``` + +**方案效果** + +我们在KUAKE-QIC数据集部分数据(训练集数据规模:500)进行实验,取100条数据进行脏数据处理,也即100条训练数据为标签错误数据,选择不同`dirty_num`应用脏数据清洗策略进行评测: + +| |Accuracy(%) | +| ---------| ------------ | +|训练集(500)|**73.50**| +|训练集(500,含100条脏数据) |65.58| +|训练集(500,含100条脏数据) + 脏数据清洗(50)|68.90| +|训练集(500,含100条脏数据) + 脏数据清洗(100)|69.36| +|训练集(500,含100条脏数据) + 脏数据清洗(150)|73.15| + +### 数据增强策略方案 + +在数据量较少或某些类别样本量较少时,也可以通过数据增强策略的方式,生成更多的训练数据,提升模型效果。 + +```shell +python aug.py \ + --create_n 2 \ + --aug_percent 0.1 \ + --train_path "../data/train.txt" \ + --aug_path "../data/aug.txt" +``` + +可支持配置的参数: + +* `train_path`:待增强训练数据集文件路径;默认为"../data/train.txt"。 +* `aug_path`:增强生成的训练数据集文件路径;默认为"../data/train_aug.txt"。 +* `aug_strategy`:数据增强策略,可选"mix", "substitute", "insert", "delete", "swap","mix"为多种数据策略混合使用;默认为"substitute"。 +* `aug_type`:词替换/词插入增强类型,可选"synonym", "homonym", "mlm",建议在GPU环境下使用mlm类型;默认为"synonym"。 +* `create_n`:生成的句子数量,默认为2。 +* `aug_percent`:生成词替换百分比,默认为0.1。 +* `device`: 选用什么设备进行增强,可选择cpu、gpu、xpu、npu,仅在使用mlm类型有影响;默认为"gpu"。 + +生成的增强数据保存在`"aug.txt"`文件中,与训练集数据`train.txt`合并得到新的训练集`train_aug.txt`重新进行训练: + +```shell +cat ../data/aug.txt ../data/train.txt > ../data/train_aug.txt +``` + +PaddleNLP内置多种数据增强策略,更多数据增强策略使用方法请参考[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)。 diff --git a/applications/text_classification/multi_class/analysis/aug.py b/applications/text_classification/multi_class/analysis/aug.py new file mode 100644 index 0000000000000000000000000000000000000000..dc0f87f2a216e07771061ec225eb6c7271e404c2 --- /dev/null +++ b/applications/text_classification/multi_class/analysis/aug.py @@ -0,0 +1,82 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle + +from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--train_path", type=str, default="../data/train.txt", help="Train dataset file name") +parser.add_argument("--aug_path", type=str, default="../data/aug.txt", help="Aug dataset file name") +parser.add_argument("--aug_strategy", choices=["mix", "substitute", "insert", "delete", "swap"], default='substitute', help="Select data augmentation strategy") +parser.add_argument("--aug_type", choices=["synonym", "homonym", "mlm"], default='synonym', help="Select data augmentation type for substitute and insert") +parser.add_argument("--create_n", type=int, default=2, help="Number of augmented sequences.") +parser.add_argument("--aug_percent", type=float, default=0.1, help="Percentage of augmented words in sequences.") +parser.add_argument('--device', default="gpu", help="Select which device to do data augmentation strategy, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def aug(): + """Do data augmentation""" + if args.aug_strategy in ["mix", "substitute", "insert"] and args.aug_strategy == "mlm": + paddle.set_device(args.device) + + if args.aug_strategy in ["substitute", "insert", "delete", "swap"]: + if args.aug_strategy == "substitute": + aug = WordSubstitute(args.aug_type, create_n=args.create_n, aug_percent=args.aug_percent) + elif args.aug_strategy == "insert": + aug = WordInsert(args.aug_type, create_n=args.create_n, aug_percent=args.aug_percent) + elif args.aug_strategy == "delete": + aug = WordDelete(create_n=args.create_n, aug_percent=args.aug_percent) + elif args.aug_strategy == "swap": + aug = WordSwap(create_n=args.create_n, aug_percent=args.aug_percent) + with open(args.train_path, "r", encoding="utf-8") as f1, open(args.aug_path, "w", encoding="utf-8") as f2: + for line in f1: + s, l = line.strip().split("\t") + + augs = aug.augment(s) + if not isinstance(augs[0], str): + augs = augs[0] + for a in augs: + f2.write(a + "\t" + l + "\n") + f1.close(), f2.close() + elif args.aug_strategy in ["mix"]: + aug = [ + WordSubstitute(args.aug_type, create_n=1, aug_percent=args.aug_percent), + WordInsert(args.aug_type, create_n=1, aug_percent=args.aug_percent), + WordDelete(create_n=1, aug_percent=args.aug_percent), + WordSwap(create_n=1, aug_percent=args.aug_percent), + ] + count = 0 + with open(args.train_path, "r", encoding="utf-8") as f1, open(args.aug_path, "w", encoding="utf-8") as f2: + for line in f1: + s, l = line.strip().split("\t") + + for i in range(args.create_n): + i = count % len(aug) + augs = aug[i].augment(s) + count += 1 + if not isinstance(augs[0], str): + augs = augs[0] + for a in augs: + f2.write(a + "\t" + l + "\n") + f1.close(), f2.close() + + +if __name__ == "__main__": + aug() diff --git a/applications/text_classification/multi_class/analysis/dirty.py b/applications/text_classification/multi_class/analysis/dirty.py new file mode 100644 index 0000000000000000000000000000000000000000..ee0ae746821a230acd374a4815f89dbbc812349f --- /dev/null +++ b/applications/text_classification/multi_class/analysis/dirty.py @@ -0,0 +1,152 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os +import random + +import numpy as np +import paddle +from paddle.io import BatchSampler, DataLoader +from trustai.interpretation import RepresenterPointModel + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.") +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--seed", type=int, default=3, help="random seed for initialization") +parser.add_argument("--dirty_num", type=int, default=100, help="Number of dirty data. default:50") +parser.add_argument("--dirty_file", type=str, default="train_dirty.txt", help="Path to save dirty data.") +parser.add_argument("--rest_file", type=str, default="train_dirty_rest.txt", help="The path of rest data.") +parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name") +parser.add_argument("--dirty_threshold", type=float, default="0", help="The threshold to select dirty data.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """ + Set random seed + """ + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + os.environ["PYTHONHASHSEED"] = str(seed) + + +def read_local_dataset(path): + """ + Read dataset file + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + sentence, label = line.strip().split("\t") + yield {"text": sentence, "label": label} + + +def preprocess_function(examples, tokenizer, max_seq_length): + """ + Preprocess dataset + """ + result = tokenizer(text=examples["text"], max_seq_len=max_seq_length) + return result + + +def get_dirty_data(weight_matrix, dirty_num, threshold=0): + """ + Get index of dirty data from train data + """ + scores = [] + for idx in range(weight_matrix.shape[0]): + weight_sum = 0 + count = 0 + for weight in weight_matrix[idx].numpy(): + if weight > threshold: + count += 1 + weight_sum += weight + scores.append((count, weight_sum)) + sorted_scores = sorted(scores)[::-1] + sorted_idxs = sorted(range(len(scores)), key=lambda idx: scores[idx])[::-1] + + ret_scores = sorted_scores[:dirty_num] + ret_idxs = sorted_idxs[:dirty_num] + + return ret_idxs, ret_scores + + +class LocalDataCollatorWithPadding(DataCollatorWithPadding): + """ + Convert the result of DataCollatorWithPadding from dict dictionary to a list + """ + + def __call__(self, features): + batch = super().__call__(features) + batch = list(batch.values()) + return batch + + +def run(): + """ + Get dirty data + """ + set_seed(args.seed) + paddle.set_device(args.device) + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + # Prepare & preprocess dataset + train_path = os.path.join(args.dataset_dir, args.train_file) + train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False) + + trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + train_ds = train_ds.map(trans_func) + + # Batchify dataset + collate_fn = LocalDataCollatorWithPadding(tokenizer) + train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False) + train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn) + + # Classifier_layer_name is the layer name of the last output layer + rep_point = RepresenterPointModel(model, train_data_loader, classifier_layer_name="classifier") + weight_matrix = rep_point.weight_matrix + + # Save dirty data & rest data + dirty_indexs, _ = get_dirty_data(weight_matrix, args.dirty_num, args.dirty_threshold) + + dirty_path = os.path.join(args.dataset_dir, args.dirty_file) + rest_path = os.path.join(args.dataset_dir, args.rest_file) + + with open(dirty_path, "w") as f1, open(rest_path, "w") as f2: + for idx in range(len(train_ds)): + if idx in dirty_indexs: + f1.write(train_ds.data[idx]["text"] + "\t" + train_ds.data[idx]["label"] + "\n") + else: + f2.write(train_ds.data[idx]["text"] + "\t" + train_ds.data[idx]["label"] + "\n") + + f1.close(), f2.close() + + +if __name__ == "__main__": + run() diff --git a/applications/text_classification/multi_class/analysis/sent_interpret.py b/applications/text_classification/multi_class/analysis/sent_interpret.py new file mode 100644 index 0000000000000000000000000000000000000000..1f0e4a88c190a158f948930a6e6c74c266f9f5f8 --- /dev/null +++ b/applications/text_classification/multi_class/analysis/sent_interpret.py @@ -0,0 +1,157 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os +import random + +import numpy as np +import paddle +from paddle.io import BatchSampler, DataLoader +from trustai.interpretation import FeatureSimilarityModel + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory should include train.txt,dev.txt and test.txt files.") +parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--seed", type=int, default=3, help="random seed for initialization") +parser.add_argument("--top_k", type=int, default=3, help="Top K important training data.") +parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name") +parser.add_argument("--interpret_input_file", type=str, default="bad_case.txt", help="interpretation file name") +parser.add_argument("--interpret_result_file", type=str, default="sent_interpret.txt", help="interpreted file name") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """ + Set random seed + """ + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + os.environ["PYTHONHASHSEED"] = str(seed) + + +def read_local_dataset(path): + """ + Read dataset file + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + items = line.strip().split("\t") + if items[0] == "Text": + continue + if len(items) == 3: + yield {"text": items[0], "label": items[1], "predict": items[2]} + elif len(items) == 2: + yield {"text": items[0], "label": items[1], "predict": ""} + elif len(items) == 1: + yield {"text": items[0], "label": "", "predict": ""} + else: + logger.info(line.strip()) + raise ValueError("{} should be in fixed format.".format(path)) + + +def preprocess_function(examples, tokenizer, max_seq_length): + """ + Preprocess dataset + """ + result = tokenizer(text=examples["text"], max_seq_len=max_seq_length) + return result + + +class LocalDataCollatorWithPadding(DataCollatorWithPadding): + """ + Convert the result of DataCollatorWithPadding from dict dictionary to a list + """ + + def __call__(self, features): + batch = super().__call__(features) + batch = list(batch.values()) + return batch + + +def find_positive_influence_data(): + + set_seed(args.seed) + paddle.set_device(args.device) + + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + + # Prepare & preprocess dataset + train_path = os.path.join(args.dataset_dir, args.train_file) + interpret_path = os.path.join(args.dataset_dir, args.interpret_input_file) + + train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False) + interpret_ds = load_dataset(read_local_dataset, path=interpret_path, lazy=False) + trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + train_ds = train_ds.map(trans_func) + interpret_ds = interpret_ds.map(trans_func) + + # Batchify dataset + collate_fn = LocalDataCollatorWithPadding(tokenizer) + train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False) + interpret_batch_sampler = BatchSampler(interpret_ds, batch_size=args.batch_size, shuffle=False) + train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn) + interpret_data_loader = DataLoader( + dataset=interpret_ds, batch_sampler=interpret_batch_sampler, collate_fn=collate_fn + ) + + # Classifier_layer_name is the layer name of the last output layer + feature_sim = FeatureSimilarityModel(model, train_data_loader, classifier_layer_name="classifier") + # Feature similarity analysis & select sparse data + analysis_result = [] + for batch in interpret_data_loader: + analysis_result += feature_sim(batch, sample_num=args.top_k) + with open(os.path.join(args.dataset_dir, args.interpret_result_file), "w") as f: + for i in range(len(analysis_result)): + f.write("text: " + interpret_ds.data[i]["text"] + "\n") + if "predict" in interpret_ds.data[i]: + f.write("predict label: " + interpret_ds.data[i]["predict"] + "\n") + if "label" in interpret_ds.data[i]: + f.write("label: " + interpret_ds.data[i]["label"] + "\n") + f.write("examples with positive influence\n") + for i, (idx, score) in enumerate(zip(analysis_result[i].pos_indexes, analysis_result[i].pos_scores)): + f.write( + "support{} text: ".format(i + 1) + + train_ds.data[idx]["text"] + + "\t" + + "label: " + + train_ds.data[idx]["label"] + + "\t" + + "score: " + + "{:.5f}".format(score) + + "\n" + ) + f.close() + + +if __name__ == "__main__": + find_positive_influence_data() diff --git a/applications/text_classification/multi_class/analysis/sparse.py b/applications/text_classification/multi_class/analysis/sparse.py new file mode 100644 index 0000000000000000000000000000000000000000..f3809b49a3463d75ecee54927a39f362b6c985d0 --- /dev/null +++ b/applications/text_classification/multi_class/analysis/sparse.py @@ -0,0 +1,291 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os +import random + +import numpy as np +import paddle +from paddle.io import BatchSampler, DataLoader +from trustai.interpretation import FeatureSimilarityModel + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory should include train.txt,dev.txt and test.txt files.") +parser.add_argument("--aug_strategy", choices=["duplicate", "substitute", "insert", "delete", "swap"], default='substitute', help="Select data augmentation strategy") +parser.add_argument("--annotate", action='store_true', help="Select unlabeled data for annotation") +parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--seed", type=int, default=3, help="random seed for initialization") +parser.add_argument("--rationale_num_sparse", type=int, default=3, help="Number of rationales per example for sparse data.") +parser.add_argument("--rationale_num_support", type=int, default=6, help="Number of rationales per example for support data.") +parser.add_argument("--sparse_num", type=int, default=100, help="Number of sparse data.") +parser.add_argument("--support_threshold", type=float, default="0.7", help="The threshold to select support data.") +parser.add_argument("--support_num", type=int, default=100, help="Number of support data.") +parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name") +parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name") +parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name") +parser.add_argument("--unlabeled_file", type=str, default="data.txt", help="Unlabeled data filename") +parser.add_argument("--sparse_file", type=str, default="sparse.txt", help="Sparse data file name.") +parser.add_argument("--support_file", type=str, default="support.txt", help="support data file name.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """ + Set random seed + """ + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + os.environ["PYTHONHASHSEED"] = str(seed) + + +def read_local_dataset(path): + """ + Read dataset file + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + items = line.strip().split("\t") + if len(items) == 2: + yield {"text": items[0], "label": items[1]} + elif len(items) == 1: + yield {"text": items[0]} + else: + logger.info(line.strip()) + raise ValueError("{} should be in fixed format.".format(path)) + f.close() + + +def preprocess_function(examples, tokenizer, max_seq_length): + """ + Preprocess dataset + """ + result = tokenizer(text=examples["text"], max_seq_len=max_seq_length) + return result + + +class LocalDataCollatorWithPadding(DataCollatorWithPadding): + """ + Convert the result of DataCollatorWithPadding from dict dictionary to a list + """ + + def __call__(self, features): + batch = super().__call__(features) + batch = list(batch.values()) + return batch + + +def get_sparse_data(analysis_result, sparse_num): + """ + Get sparse data + """ + idx_scores = {} + preds = [] + for i in range(len(analysis_result)): + scores = analysis_result[i].pos_scores + idx_scores[i] = sum(scores) / len(scores) + preds.append(analysis_result[i].pred_label) + + idx_socre_list = list(sorted(idx_scores.items(), key=lambda x: x[1]))[:sparse_num] + ret_idxs, ret_scores = list(zip(*idx_socre_list)) + return ret_idxs, ret_scores, preds + + +def find_sparse_data(): + """ + Find sparse data (lack of supports in train dataset) in dev dataset + """ + set_seed(args.seed) + paddle.set_device(args.device) + + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + + # Prepare & preprocess dataset + label_path = os.path.join(args.dataset_dir, args.label_file) + train_path = os.path.join(args.dataset_dir, args.train_file) + dev_path = os.path.join(args.dataset_dir, args.dev_file) + + label_list = {} + with open(label_path, "r", encoding="utf-8") as f: + for i, line in enumerate(f): + l = line.strip() + label_list[l] = i + f.close() + + train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False) + dev_ds = load_dataset(read_local_dataset, path=dev_path, lazy=False) + trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + train_ds = train_ds.map(trans_func) + dev_ds = dev_ds.map(trans_func) + + # Batchify dataset + collate_fn = LocalDataCollatorWithPadding(tokenizer) + train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False) + dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn) + dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn) + + # Classifier_layer_name is the layer name of the last output layer + feature_sim = FeatureSimilarityModel(model, train_data_loader, classifier_layer_name="classifier") + # Feature similarity analysis & select sparse data + analysis_result = [] + for batch in dev_data_loader: + analysis_result += feature_sim(batch, sample_num=args.rationale_num_sparse) + sparse_indexs, sparse_scores, preds = get_sparse_data(analysis_result, args.sparse_num) + + # Save the sparse data + is_true = [] + with open(os.path.join(args.dataset_dir, args.sparse_file), "w") as f: + for idx in sparse_indexs: + data = dev_ds.data[idx] + f.write(data["text"] + "\t" + str(data["label"]) + "\n") + is_true.append(1 if str(preds[idx]) == str(label_list[data["label"]]) else 0) + f.close() + logger.info("Sparse data saved in {}".format(os.path.join(args.dataset_dir, args.sparse_file))) + logger.info("Accuracy in sparse data: {:.2f}%".format(100 * sum(is_true) / len(is_true))) + logger.info("Average score in sparse data: {:.4f}".format(sum(sparse_scores) / len(sparse_scores))) + return os.path.join(args.dataset_dir, args.sparse_file) + + +def get_support_data(analysis_result, support_num, support_threshold=0.7): + """ + get support data + """ + ret_idxs = [] + ret_scores = [] + rationale_idx = 0 + try: + while len(ret_idxs) < support_num: + for n in range(len(analysis_result)): + score = analysis_result[n].pos_scores[rationale_idx] + if score > support_threshold: + idx = analysis_result[n].pos_indexes[rationale_idx] + if idx not in ret_idxs: + ret_idxs.append(idx) + ret_scores.append(score) + if len(ret_idxs) >= support_num: + break + + rationale_idx += 1 + except IndexError: + logger.error( + f"The index is out of range, please reduce support_num or increase support_threshold. Got {len(ret_idxs)} now." + ) + + return ret_idxs, ret_scores + + +def find_support_data(): + """ + Find support data (which supports sparse data) from candidate dataset + """ + set_seed(args.seed) + paddle.set_device(args.device) + + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + + # Prepare & preprocess dataset + if args.annotate: + candidate_path = os.path.join(args.dataset_dir, args.unlabeled_file) + else: + candidate_path = os.path.join(args.dataset_dir, args.train_file) + + sparse_path = os.path.join(args.dataset_dir, args.sparse_file) + support_path = os.path.join(args.dataset_dir, args.support_file) + candidate_ds = load_dataset(read_local_dataset, path=candidate_path, lazy=False) + sparse_ds = load_dataset(read_local_dataset, path=sparse_path, lazy=False) + trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + candidate_ds = candidate_ds.map(trans_func) + sparse_ds = sparse_ds.map(trans_func) + + # Batchify dataset + collate_fn = LocalDataCollatorWithPadding(tokenizer) + candidate_batch_sampler = BatchSampler(candidate_ds, batch_size=args.batch_size, shuffle=False) + sparse_batch_sampler = BatchSampler(sparse_ds, batch_size=args.batch_size, shuffle=False) + candidate_data_loader = DataLoader( + dataset=candidate_ds, batch_sampler=candidate_batch_sampler, collate_fn=collate_fn + ) + sparse_data_loader = DataLoader(dataset=sparse_ds, batch_sampler=sparse_batch_sampler, collate_fn=collate_fn) + + # Classifier_layer_name is the layer name of the last output layer + feature_sim = FeatureSimilarityModel(model, candidate_data_loader, classifier_layer_name="classifier") + # Feature similarity analysis + analysis_result = [] + for batch in sparse_data_loader: + analysis_result += feature_sim(batch, sample_num=args.rationale_num_support) + + support_indexs, support_scores = get_support_data(analysis_result, args.support_num, args.support_threshold) + + # Save the support data + if args.annotate or args.aug_strategy == "duplicate": + with open(support_path, "w") as f: + for idx in list(support_indexs): + data = candidate_ds.data[idx] + if "label" in data: + f.write(data["text"] + "\t" + data["label"] + "\n") + else: + f.write(data["text"] + "\n") + f.close() + else: + create_n = 1 + aug_percent = 0.1 + if args.aug_strategy == "substitute": + aug = WordSubstitute("embedding", create_n=create_n, aug_percent=aug_percent) + elif args.aug_strategy == "insert": + aug = WordInsert("embedding", create_n=create_n, aug_percent=aug_percent) + elif args.aug_strategy == "delete": + aug = WordDelete(create_n=create_n, aug_percent=aug_percent) + elif args.aug_strategy == "swap": + aug = WordSwap(create_n=create_n, aug_percent=aug_percent) + + with open(support_path, "w") as f: + for idx in list(support_indexs): + data = candidate_ds.data[idx] + augs = aug.augment(data["text"]) + if not isinstance(augs[0], str): + augs = augs[0] + for a in augs: + f.write(a + "\t" + data["label"] + "\n") + f.close() + logger.info("support data saved in {}".format(support_path)) + logger.info("support average scores: {:.4f}".format(float(sum(support_scores)) / len(support_scores))) + + +if __name__ == "__main__": + find_sparse_data() + find_support_data() diff --git a/applications/text_classification/multi_class/analysis/word_interpret.ipynb b/applications/text_classification/multi_class/analysis/word_interpret.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..05ee41790fcc82bfae7e20330139a25cfe4fcbae --- /dev/null +++ b/applications/text_classification/multi_class/analysis/word_interpret.ipynb @@ -0,0 +1,354 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 词级别可解释性分析\n", + "本项目提供模型的词级别可解释性分析,包括LIME、Integrated Gradient、GradShap 三种分析方法,支持分析微调后模型的预测结果,开发者可以通过更改**数据目录**和**模型目录**在自己的任务中使用此项目进行数据分析。\n", + "\n", + "![image](https://user-images.githubusercontent.com/63761690/195086276-6ee16e96-4ec3-4a0f-821f-37546d21746b.png)\n", + " \n", + "\n", + "## 1.导入Python模块与参数配置\n", + "首先我们导入必要的导入必要python模块和设置配置参数,词级别可解释性分析算法支持三种待分析的文本 `INTERPRETER_FILE` 数据文件格式:\n", + "\n", + "**格式一:包括文本、标签、预测结果**\n", + "```text\n", + "<文本>'\\t'<标签>'\\t'<预测结果>\n", + "...\n", + "```\n", + "\n", + "**格式二:包括文本、标签**\n", + "```text\n", + "<文本>'\\t'<标签>\n", + "...\n", + "```\n", + "\n", + "**格式三:只包括文本**\n", + "```text\n", + "<文本>\n", + "...\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import functools\n", + "import random\n", + "import os\n", + "import argparse\n", + "\n", + "import jieba\n", + "import numpy as np \n", + "from trustai.interpretation import VisualizationTextRecord\n", + "from trustai.interpretation import get_word_offset\n", + "import paddle\n", + "from paddle.io import DataLoader, BatchSampler\n", + "from paddlenlp.data import DataCollatorWithPadding\n", + "from paddlenlp.datasets import load_dataset\n", + "from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from trustai.interpretation import VisualizationTextRecord\n", + "from trustai.interpretation import get_word_offset\n", + "import paddle\n", + "from paddle.io import DataLoader, BatchSampler\n", + "from paddlenlp.data import DataCollatorWithPadding\n", + "from paddlenlp.datasets import load_dataset\n", + "from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# 预先定义配置参数\n", + "\n", + "# 运行环境,可选\"cpu\",\"gpu\",\"gpu:x\"(x为gpu编号)\n", + "DEVICE = \"gpu\"\n", + "# 数据路径\n", + "DATASET_DIR = \"../data\" \n", + "# 训练模型保存路径\n", + "PARAM_PATH = \"../checkpoint/\" \n", + "# tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数\n", + "MAX_LENGTH = 128 \n", + "# 批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数\n", + "BATCH_SIZE = 1 \n", + "# 待分析解释的数据\n", + "INTERPRETER_FILE = \"bad_case.txt\"\n", + "# 可选 \"ig\",\"lime\",\"grad\" ,可以根据实际任务效果选择解释器\n", + "# \"grad\":GradShap方法依赖interpretdl\n", + "# !pip install interpretdl\n", + "INTERPRETER = \"ig\"\n", + "# 分析句子中TOP K关键词,K值\n", + "KEY_WORDS_NUM = 5" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "def read_local_dataset(path):\n", + " \"\"\"\n", + " Read dataset file\n", + " \"\"\"\n", + " with open(path, 'r', encoding='utf-8') as f:\n", + " for line in f:\n", + " items = line.strip().split('\\t')\n", + " if items[0] == 'Text':\n", + " continue\n", + " items[0] = items[0][:MAX_LENGTH-2]\n", + " if len(items) == 3:\n", + " yield {'text': items[0], 'label': items[1], 'predict': items[2]}\n", + " elif len(items) == 2:\n", + " yield {'text': items[0], 'label': items[1], 'predict': ''}\n", + " elif len(items) == 1:\n", + " yield {'text': items[0], 'label': '', 'predict': ''}\n", + " else:\n", + " raise ValueError(\"{} should be in fixed format.\".format(path))\n", + "\n", + "def preprocess_function(examples, tokenizer, max_seq_length):\n", + " \"\"\"\n", + " Preprocess dataset\n", + " \"\"\"\n", + " result = tokenizer(text=examples[\"text\"], max_seq_len=max_seq_length)\n", + " return result\n", + "\n", + "class LocalDataCollatorWithPadding(DataCollatorWithPadding):\n", + " \"\"\"\n", + " Convert the result of DataCollatorWithPadding from dict dictionary to a list\n", + " \"\"\"\n", + "\n", + " def __call__(self, features):\n", + " batch = super().__call__(features)\n", + " batch = list(batch.values())\n", + " return batch" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[32m[2022-10-11 12:17:29,041] [ INFO]\u001b[0m - We are using to load '/workspace/PaddleNLP/applications/text_classification/multi_class/checkpoint/'.\u001b[0m\n", + "W1011 12:17:29.044690 79080 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2\n", + "W1011 12:17:29.051118 79080 gpu_resources.cc:91] device: 0, cuDNN Version: 8.1.\n", + "\u001b[32m[2022-10-11 12:17:32,517] [ INFO]\u001b[0m - We are using to load '/workspace/PaddleNLP/applications/text_classification/multi_class/checkpoint/'.\u001b[0m\n" + ] + } + ], + "source": [ + "paddle.set_device(DEVICE)\n", + "\n", + "# Define model & tokenizer\n", + "if os.path.exists(PARAM_PATH):\n", + " model = AutoModelForSequenceClassification.from_pretrained(PARAM_PATH)\n", + " tokenizer = AutoTokenizer.from_pretrained(PARAM_PATH)\n", + "else:\n", + " raise ValueError(\"The {} should exist.\".format(PARAM_PATH))\n", + "\n", + "# Prepare & preprocess dataset\n", + "interpret_path = os.path.join(DATASET_DIR, INTERPRETER_FILE)\n", + "\n", + "\n", + "interpret_ds = load_dataset(read_local_dataset, path=interpret_path, lazy=False)\n", + "trans_func = functools.partial(preprocess_function,\n", + " tokenizer=tokenizer,\n", + " max_seq_length=MAX_LENGTH)\n", + "\n", + "interpret_ds = interpret_ds.map(trans_func)\n", + "\n", + "# Batchify dataset\n", + "collate_fn = LocalDataCollatorWithPadding(tokenizer)\n", + "interpret_batch_sampler = BatchSampler(interpret_ds,\n", + " batch_size=BATCH_SIZE,\n", + " shuffle=False)\n", + "interpret_data_loader = DataLoader(dataset=interpret_ds,\n", + " batch_sampler=interpret_batch_sampler,\n", + " collate_fn=collate_fn)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Start token level interpretion, it will take some time...\n", + "Building prefix dict from the default dictionary ...\n", + "Loading model from cache /tmp/jieba.cache\n", + "Loading model cost 1.005 seconds.\n", + "Prefix dict has been built successfully.\n", + "Start word level alignment, it will take some time...\n" + ] + } + ], + "source": [ + "# Init an interpreter\n", + "if INTERPRETER == 'ig':\n", + " from trustai.interpretation.token_level import IntGradInterpreter\n", + " interpreter = IntGradInterpreter(model)\n", + "elif INTERPRETER == 'lime':\n", + " from trustai.interpretation.token_level import LIMEInterpreter\n", + " interpreter = LIMEInterpreter(model, unk_id=tokenizer.convert_tokens_to_ids('[UNK]'), pad_id=tokenizer.convert_tokens_to_ids('[PAD]'))\n", + "else:\n", + " from trustai.interpretation.token_level import GradShapInterpreter\n", + " interpreter = GradShapInterpreter(model)\n", + "\n", + "# Use interpreter to get the importance scores for all data\n", + "print(\"Start token level interpretion, it will take some time...\")\n", + "analysis_result = []\n", + "for batch in interpret_data_loader:\n", + " analysis_result += interpreter(tuple(batch))\n", + "\n", + "# Add CLS and SEP tags to both original text and standard splited tokens\n", + "contexts = []\n", + "words = []\n", + "for i in range(len(interpret_ds)):\n", + " text = interpret_ds.data[i][\"text\"]\n", + " contexts.append(\"[CLS]\" + text + \"[SEP]\")\n", + " words.append([\"[CLS]\"] + list(jieba.cut(text)) + [\"[SEP]\"])\n", + "\n", + "# Get the offset map of tokenized tokens and standard splited tokens\n", + "print(\"Start word level alignment, it will take some time...\")\n", + "ori_offset_maps = []\n", + "word_offset_maps = []\n", + "for i in range(len(contexts)):\n", + " ori_offset_maps.append(tokenizer.get_offset_mapping(contexts[i]))\n", + " word_offset_maps.append(get_word_offset(contexts[i], words[i]))\n", + "\n", + "align_res = interpreter.alignment(analysis_result, contexts, words, word_offset_maps, ori_offset_maps, special_tokens=[\"[CLS]\", '[SEP]'],rationale_num=KEY_WORDS_NUM)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.core.display import display, HTML\n", + "class Visualization(VisualizationTextRecord):\n", + "\n", + " def __init__(self, interpret_res, true_label=None, pred_label=None, words=None):\n", + " if words is not None:\n", + " self.words = words\n", + " else:\n", + " self.words = interpret_res.words\n", + " self.pred_label = pred_label if pred_label is not None else ''\n", + " self.true_label = true_label if true_label is not None else ''\n", + " self.key_words = \" \".join(set(interpret_res.rationale_tokens))\n", + " word_attributions = interpret_res.word_attributions\n", + " _max = max(word_attributions)\n", + " _min = min(word_attributions)\n", + " self.word_attributions = [(word_imp - _min) / (_max - _min) for word_imp in word_attributions]\n", + "\n", + " def record_html(self):\n", + " \"\"\"change all informations to html\"\"\"\n", + " return \"\".join([\n", + " \"\",\n", + " self._format_class(self.true_label),\n", + " self._format_class(self.pred_label),\n", + " self._format_class(self.key_words),\n", + " self._format_word_attributions(),\n", + " \"\",\n", + " ])\n", + " def _format_class(self, label):\n", + " return '{label}'.format(label=label)\n", + "\n", + "def visualize_text(text_records):\n", + " \"\"\"visualize text\"\"\"\n", + " html = [\"\"]\n", + " rows = [\"\"\n", + " \"\"\n", + " \"\"\n", + " \"\"]\n", + " for record in text_records:\n", + " rows.append(record.record_html())\n", + " html.append(\"\".join(rows))\n", + " html.append(\"
LabelPredictionKey wordsImportant visualization
\")\n", + " html = HTML(\"\".join(html))\n", + " display(html)\n", + " return html.data\n", + "\n", + "\n", + "def visualize(interpret_res, ds):\n", + " records = []\n", + " for i in range(len(interpret_res)):\n", + " records.append(Visualization(interpret_res[i], true_label=ds.data[i][\"label\"], pred_label=ds.data[i][\"predict\"]))\n", + " html = visualize_text(records)\n", + " return html" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
LabelPredictionKey wordsImportant visualization
其他注意事项月 服用 请问 的 可以 [CLS] 您好 请问 一岁 三个 孩子 可以 服用 复方 颗粒 [SEP]
其他就医建议输卵管 基本 检查 粘连 的 [CLS] 输卵管 粘连 基本 检查 [SEP]
其他病情诊断胎动 么 ? 是 会 [CLS] 胎动 [SEP]
其他病情诊断这是 经常 干呕 了 生病 [CLS] 经常 干呕 恶心 这是 生病 [SEP]
就医建议治疗方案治 治疗 菏泽 怎么 白癜风 [CLS] 菏泽 哪个 医院 治疗 白癜风 比较 ? 怎么 [SEP]
其他后果表述左旋 不良反应 吃 的 肉碱 [CLS] 左旋 肉碱 不良反应 [SEP]
注意事项其他上 出血 吗 做爱 环后 [CLS] 环后 出血 可以 做爱 [SEP]
病情诊断病因分析感冒 了 呀 怎么 会 [CLS] 孩子 感冒 怎么 喘息 [SEP]
其他治疗方案孕 周 21 [CLS] 21 [SEP]
其他指标解读谱 心肌 意义 酶 ? [CLS] 心肌 五项 意义 [SEP]
病情诊断其他家长 判断 吃 吃饱 怎么 [CLS] 家长 怎么 判断 孩子 吃饱 怎么 不肯 就是 [SEP]
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# process for vbisualize\n", + "html = visualize(align_res, interpret_ds)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.7.13 64-bit", + "metadata": { + "interpreter": { + "hash": "767d51c1340bd893661ea55ea3124f6de3c7a262a8b4abca0554b478b1e2ff90" + } + }, + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.13-final" + }, + "orig_nbformat": 2 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/applications/text_classification/multi_class/deploy/simple_serving/README.md b/applications/text_classification/multi_class/deploy/simple_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..6ec4f03ad2277907e1ff4ebef4c63d9a9a95db73 --- /dev/null +++ b/applications/text_classification/multi_class/deploy/simple_serving/README.md @@ -0,0 +1,23 @@ +# 基于PaddleNLP SimpleServing 的服务化部署 + +## 目录 +- [环境准备](#环境准备) +- [Server启动服务](#Server服务启动) +- [其他参数设置](#其他参数设置) + +## 环境准备 +使用有SimpleServing功能的PaddleNLP版本 +```shell +pip install paddlenlp >= 2.5.1 +``` +## Server服务启动 +### 分类任务启动 +#### 启动分类 Server 服务 +```bash +paddlenlp server server:app --host 0.0.0.0 --port 8189 +``` + +#### 启动分类 Client 服务 +```bash +python client.py +``` diff --git a/applications/text_classification/multi_class/deploy/simple_serving/client.py b/applications/text_classification/multi_class/deploy/simple_serving/client.py new file mode 100644 index 0000000000000000000000000000000000000000..57389859efd92107080303d54d2b6e311fba96aa --- /dev/null +++ b/applications/text_classification/multi_class/deploy/simple_serving/client.py @@ -0,0 +1,27 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import requests + +url = "http://0.0.0.0:8189/taskflow/cls" +headers = {"Content-Type": "application/json"} + +if __name__ == "__main__": + texts = ["黑苦荞茶的功效与作用及食用方法", "交界痣会凸起吗", "检查是否能怀孕挂什么科", "鱼油怎么吃咬破吃还是直接咽下去", "幼儿挑食的生理原因是"] + data = {"data": {"text": texts}} + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + result_json = json.loads(r.text) + print(result_json["result"]) diff --git a/applications/text_classification/multi_class/deploy/simple_serving/server.py b/applications/text_classification/multi_class/deploy/simple_serving/server.py new file mode 100644 index 0000000000000000000000000000000000000000..75701c1c797694bd5abded708a16a5961523c1c9 --- /dev/null +++ b/applications/text_classification/multi_class/deploy/simple_serving/server.py @@ -0,0 +1,19 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp import SimpleServer, Taskflow + +cls = Taskflow("text_classification", task_path="../../checkpoint/export", is_static_model=True) +app = SimpleServer() +app.register_taskflow("taskflow/cls", cls) diff --git a/applications/text_classification/multi_class/deploy/triton_serving/README.md b/applications/text_classification/multi_class/deploy/triton_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..7fbcb5c92013441f489e54c75412a3a8c3adf679 --- /dev/null +++ b/applications/text_classification/multi_class/deploy/triton_serving/README.md @@ -0,0 +1,200 @@ +# 基于Triton Inference Server的服务化部署指南 + +本文档将介绍如何使用[Triton Inference Server](https://github.com/triton-inference-server/server)工具部署基于ERNIE 3.0中文模型文本多分类的pipeline在线服务。 + +## 目录 +- [服务端环境准备](#服务端环境准备) +- [模型获取和转换](#模型获取和转换) +- [部署模型](#部署模型) +- [客户端请求](#客户端请求) + +## 服务端环境准备 + +### 安装Triton Server +拉取Triton Server镜像: +```shell +docker pull nvcr.io/nvidia/tritonserver:21.10-py3 +``` +启动容器: +```shell +docker run -it --gpus all --net=host --name triton_server -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash +``` + +**NOTE:** + +1. Triton版本号`21.10`可以根据自己的需求调整,各个Triton版本对应的Driver、CUDA、TRT和ONNX Runtime等后端版本可以参考[官网文档](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)。注意其中的`NVIDIA Driver`行,如果NVIDIA Driver低于文档中要求,在启动运行时会报错。 + +2. 可以使用`--gpus '"device=1"'`来指定GPU卡号,更多GPU指定方式请参见[Nvidia User Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#gpu-enumeration) + + +### 进入容器并准备PaddleNLP环境 +整个服务的前后处理依赖PaddleNLP,需要在容器内安装相关python包 + +进入容器: +```shell +docker exec -it triton_server bash +``` +安装PaddlePaddle、PaddleNLP +```shell +python3 -m pip install paddlepaddle-gpu paddlenlp -i https://mirror.baidu.com/pypi/simple +``` + +**NOTE:** + +1. 默认开启百度镜像源来加速下载,如果您使用 HTTP 代理可以关闭(-i https://mirror.baidu.com/pypi/simple) + +2. 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.2, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。 + +3. 更多关于PaddleNLP安装的详细教程请查看[Installation](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 + + +### 安装FastTokenizer文本处理加速库(可选) + +部署环境是Linux,推荐安装fast_tokenizer可以得到更极致的文本处理效率,进一步提升服务性能。 + +在容器内安装 fast_tokenizer +```shell +python3 -m pip install fast-tokenizer-python +``` + + +## 模型获取和转换 + +使用Triton做服务化部署时,选择ONNX Runtime后端运行需要先将模型转换成ONNX格式。使用Paddle2ONNX将Paddle静态图模型转换为ONNX模型格式的命令如下,以下命令成功运行后,将会在当前目录下生成model.onnx模型文件。 +```shell +paddle2onnx --model_dir ../../checkpoint/export --model_filename model.pdmodel --params_filename model.pdiparams --save_file model.onnx --opset_version 13 --enable_onnx_checker True --enable_dev_version True +``` +创建空白目录/seqcls/1和seqcls_model/1,并将将转换好的ONNX模型移动到模型仓库目录 +```shell +mkdir /models/seqcls/1 +mkdir /models/seqcls_model/1 +mv model.onnx /models/seqcls_model/1 +``` + +Paddle2ONNX的命令行参数说明请查阅:[Paddle2ONNX命令行参数说明](https://github.com/PaddlePaddle/Paddle2ONNX#%E5%8F%82%E6%95%B0%E9%80%89%E9%A1%B9) + +模型下载转换好之后,models目录结构如下: +``` +models +├── seqcls +│   ├── 1 +│   └── config.pbtxt +├── seqcls_model +│   ├── 1 +│   │   └── model.onnx +│   └── config.pbtxt +├── seqcls_postprocess +│   ├── 1 +│   │   └── model.py +│   └── config.pbtxt +└── tokenizer + ├── 1 + │   └── model.py + └── config.pbtxt +``` + +模型配置文件config.pbtxt配置细节请参见[Triton Server Model Configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md) + +## 部署模型 + +triton目录包含启动pipeline服务的配置和发送预测请求的代码,包括: + +``` +models # Triton启动需要的模型仓库,包含模型和服务配置文件 +seqcls_grpc_client.py # 分类任务发送pipeline预测请求的脚本 +``` + +### 启动服务端 + +在容器内执行下面命令启动服务,默认启动models下所有模型: +```shell +tritonserver --model-repository=/models +``` +也可以通过设定参数只启动单一任务服务: +```shell +tritonserver --model-repository=/models --model-control-mode=explicit --load-model=seqcls +``` + +**NOTE:** + +启动服务时,Triton Server的每个python后端进程默认申请`64M`内存,默认启动的docker无法启动多个python后端节点。两个解决方案: + +1. 启动容器时设置`shm-size`参数, 比如:`docker run -it --net=host --name triton_server --shm-size="1g" -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash` + +2. 启动服务时设置python后端的`shm-default-byte-size`参数, 设置python后端的默认内存为10M: `tritonserver --model-repository=/models --backend-config=python,shm-default-byte-size=10485760` + +输出打印如下: + +``` +... +I0619 13:40:51.590901 5127 onnxruntime.cc:1999] TRITONBACKEND_Initialize: onnxruntime +I0619 13:40:51.590938 5127 onnxruntime.cc:2009] Triton TRITONBACKEND API version: 1.6 +I0619 13:40:51.590947 5127 onnxruntime.cc:2015] 'onnxruntime' TRITONBACKEND API version: 1.6 +I0619 13:40:51.623808 5127 openvino.cc:1193] TRITONBACKEND_Initialize: openvino +I0619 13:40:51.623862 5127 openvino.cc:1203] Triton TRITONBACKEND API version: 1.6 +I0619 13:40:51.623868 5127 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.6 +I0619 13:40:52.980990 5127 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f14d8000000' with size 268435456 +... +I0619 13:43:33.360018 5127 server.cc:592] ++--------------------+---------+--------+ +| Model | Version | Status | ++--------------------+---------+--------+ +| seqcls | 1 | READY | +| seqcls_model | 1 | READY | +| seqcls_postprocess | 1 | READY | +| tokenizer | 1 | READY | ++--------------------+---------+--------+ +... +I0619 13:43:33.365824 5127 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001 +I0619 13:43:33.366221 5127 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000 +I0619 13:43:33.409775 5127 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002 +``` + +## 客户端请求 + +### 客户端环境准备 +客户端请求有两种方式,可以选择在本地执行脚本请求,或下载官方客户端镜像在容器中执行。 + +方式一:本地执行脚本,需要先安装依赖: +```shell +pip install grpcio +pip install tritonclient==2.10.0 +``` + +方式二:拉取官网镜像并启动容器: +```shell +docker pull nvcr.io/nvidia/tritonserver:21.10-py3-sdk +docker run -it --net=host --name triton_client -v /path/to/triton:/triton_code nvcr.io/nvidia/tritonserver:21.10-py3-sdk bash +``` + +### 启动客户端测试 +注意执行客户端请求时关闭代理,并根据实际情况修改main函数中的ip地址(启动服务所在的机器) + +```shell +python seqcls_grpc_client.py +``` + +输出打印如下: + +``` +text: 黑苦荞茶的功效与作用及食用方法 +label: 功效作用 +confidence: 0.984 +-------------------- +text: 交界痣会凸起吗 +label: 疾病表述 +confidence: 0.904 +-------------------- +text: 检查是否能怀孕挂什么科 +label: 就医建议 +confidence: 0.969 +-------------------- +text: 幼儿挑食的生理原因是 +label: 病因分析 +confidence: 0.495 +-------------------- +text: 鱼油怎么吃咬破吃还是直接咽下去 +label: 其他 +confidence: 0.850 +-------------------- +``` diff --git a/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls/config.pbtxt b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..82261157aefe68bac9a1865d888c0257d2e905e8 --- /dev/null +++ b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls/config.pbtxt @@ -0,0 +1,75 @@ +name: "seqcls" +platform: "ensemble" +max_batch_size: 64 +input [ + { + name: "INPUT" + data_type: TYPE_STRING + dims: [ 1 ] + } +] +output [ + { + name: "label" + data_type: TYPE_INT64 + dims: [ 1 ] + }, + { + name: "confidence" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] +ensemble_scheduling { + step [ + { + model_name: "tokenizer" + model_version: 1 + input_map { + key: "INPUT_0" + value: "INPUT" + } + output_map { + key: "OUTPUT_0" + value: "tokenizer_input_ids" + } + output_map { + key: "OUTPUT_1" + value: "tokenizer_token_type_ids" + } + }, + { + model_name: "seqcls_model" + model_version: 1 + input_map { + key: "input_ids" + value: "tokenizer_input_ids" + } + input_map { + key: "token_type_ids" + value: "tokenizer_token_type_ids" + } + output_map { + key: "linear_75.tmp_1" + value: "OUTPUT_2" + } + }, + { + model_name: "seqcls_postprocess" + model_version: 1 + input_map { + key: "POST_INPUT" + value: "OUTPUT_2" + } + output_map { + key: "POST_label" + value: "label" + } + output_map { + key: "POST_confidence" + value: "confidence" + } + } + ] +} + diff --git a/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_model/config.pbtxt b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_model/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..82912ecbe05ee7b8688e5f5e68eafdc138999626 --- /dev/null +++ b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_model/config.pbtxt @@ -0,0 +1,36 @@ +platform: "onnxruntime_onnx" +max_batch_size: 64 +input [ + { + name: "input_ids" + data_type: TYPE_INT64 + dims: [ -1 ] + }, + { + name: "token_type_ids" + data_type: TYPE_INT64 + dims: [ -1 ] + } +] +output [ + { + name: "linear_75.tmp_1" + data_type: TYPE_FP32 + dims: [ 11 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_GPU + } +] + +optimization { + graph: {level: -1} +} + +parameters { key: "intra_op_thread_count" value: { string_value: "0" } } +parameters { key: "execution_mode" value: { string_value: "0" } } +parameters { key: "inter_op_thread_count" value: { string_value: "0" } } diff --git a/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_postprocess/1/model.py b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_postprocess/1/model.py new file mode 100644 index 0000000000000000000000000000000000000000..ad18d669698eb30f7121f0a89b3e32a8a3d44bfd --- /dev/null +++ b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_postprocess/1/model.py @@ -0,0 +1,101 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import numpy as np + +# triton_python_backend_utils is available in every Triton Python model. You +# need to use this module to create inference requests and responses. It also +# contains some utility functions for extracting information from model_config +# and converting Triton input/output types to numpy types. +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel(object): + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + + def initialize(self, args): + """`initialize` is called only once when the model is being loaded. + Implementing `initialize` function is optional. This function allows + the model to initialize any state associated with this model. + Parameters + ---------- + args : dict + Both keys and values are strings. The dictionary keys and values are: + * model_config: A JSON string containing the model configuration, config.txt + * model_instance_kind: A string containing model instance kind + * model_instance_device_id: A string containing model instance device ID + * model_repository: Model repository path + * model_version: Model version + * model_name: Model name + """ + self.model_config = json.loads(args["model_config"]) + print("model_config:", self.model_config) + + self.input_names = [] + for input_config in self.model_config["input"]: + self.input_names.append(input_config["name"]) + print("input:", self.input_names) + + self.output_names = [] + self.output_dtype = [] + for output_config in self.model_config["output"]: + self.output_names.append(output_config["name"]) + dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) + self.output_dtype.append(dtype) + print("output:", self.output_names) + + def execute(self, requests): + """`execute` must be implemented in every Python model. `execute` + function receives a list of pb_utils.InferenceRequest as the only + argument. This function is called when an inference is requested + for this model. Depending on the batching configuration (e.g. Dynamic + Batching) used, `requests` may contain multiple requests. Every + Python model, must create one pb_utils.InferenceResponse for every + pb_utils.InferenceRequest in `requests`. If there is an error, you can + set the error argument when creating a pb_utils.InferenceResponse. + Parameters + ---------- + requests : list + A list of pb_utils.InferenceRequest + Returns + ------- + list + A list of pb_utils.InferenceResponse. The length of this list must + be the same as `requests` + """ + responses = [] + # print("num:", len(requests), flush=True) + for request in requests: + data = pb_utils.get_input_tensor_by_name(request, self.input_names[0]) + data = data.as_numpy() + max_value = np.max(data, axis=1, keepdims=True) + exp_data = np.exp(data - max_value) + probs = exp_data / np.sum(exp_data, axis=1, keepdims=True) + probs = probs.max(axis=-1) + out_tensor1 = pb_utils.Tensor(self.output_names[0], data.argmax(axis=-1)) + out_tensor2 = pb_utils.Tensor(self.output_names[1], probs) + inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2]) + responses.append(inference_response) + return responses + + def finalize(self): + """`finalize` is called only once when the model is being unloaded. + Implementing `finalize` function is optional. This function allows + the model to perform any necessary clean ups before exit. + """ + print("Cleaning up...") diff --git a/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..a5d90d3bcbb02b50409e9053cb730f3ea193b9be --- /dev/null +++ b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt @@ -0,0 +1,31 @@ +name: "seqcls_postprocess" +backend: "python" +max_batch_size: 64 + +input [ + { + name: "POST_INPUT" + data_type: TYPE_FP32 + dims: [ 11 ] + } +] + +output [ + { + name: "POST_label" + data_type: TYPE_INT64 + dims: [ -1 ] + }, + { + name: "POST_confidence" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_CPU + } +] diff --git a/applications/text_classification/multi_class/deploy/triton_serving/models/tokenizer/1/model.py b/applications/text_classification/multi_class/deploy/triton_serving/models/tokenizer/1/model.py new file mode 100644 index 0000000000000000000000000000000000000000..97a96222075370ebcca531250dc6cc7c0e8fead0 --- /dev/null +++ b/applications/text_classification/multi_class/deploy/triton_serving/models/tokenizer/1/model.py @@ -0,0 +1,109 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import numpy as np + +# triton_python_backend_utils is available in every Triton Python model. You +# need to use this module to create inference requests and responses. It also +# contains some utility functions for extracting information from model_config +# and converting Triton input/output types to numpy types. +import triton_python_backend_utils as pb_utils + +from paddlenlp.transformers import AutoTokenizer + + +class TritonPythonModel(object): + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + + def initialize(self, args): + """`initialize` is called only once when the model is being loaded. + Implementing `initialize` function is optional. This function allows + the model to initialize any state associated with this model. + Parameters + ---------- + args : dict + Both keys and values are strings. The dictionary keys and values are: + * model_config: A JSON string containing the model configuration, config.pbtxt + * model_instance_kind: A string containing model instance kind + * model_instance_device_id: A string containing model instance device ID + * model_repository: Model repository path + * model_version: Model version + * model_name: Model name + """ + self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True) + # You must parse model_config. JSON string is not parsed here + self.model_config = json.loads(args["model_config"]) + print("model_config:", self.model_config) + + self.input_names = [] + for input_config in self.model_config["input"]: + self.input_names.append(input_config["name"]) + print("input:", self.input_names) + + self.output_names = [] + self.output_dtype = [] + for output_config in self.model_config["output"]: + self.output_names.append(output_config["name"]) + dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) + self.output_dtype.append(dtype) + print("output:", self.output_names) + + def execute(self, requests): + """`execute` must be implemented in every Python model. `execute` + function receives a list of pb_utils.InferenceRequest as the only + argument. This function is called when an inference is requested + for this model. Depending on the batching configuration (e.g. Dynamic + Batching) used, `requests` may contain multiple requests. Every + Python model, must create one pb_utils.InferenceResponse for every + pb_utils.InferenceRequest in `requests`. If there is an error, you can + set the error argument when creating a pb_utils.InferenceResponse. + Parameters + ---------- + requests : list + A list of pb_utils.InferenceRequest + Returns + ------- + list + A list of pb_utils.InferenceResponse. The length of this list must + be the same as `requests` + """ + responses = [] + # print("num:", len(requests), flush=True) + for request in requests: + data = pb_utils.get_input_tensor_by_name(request, self.input_names[0]) + data = data.as_numpy() + data = [i[0].decode("utf-8") for i in data] + data = self.tokenizer(data, max_length=128, padding=True, truncation=True) + input_ids = np.array(data["input_ids"], dtype=self.output_dtype[0]) + token_type_ids = np.array(data["token_type_ids"], dtype=self.output_dtype[1]) + + # print("input_ids:", input_ids) + # print("token_type_ids:", token_type_ids) + + out_tensor1 = pb_utils.Tensor(self.output_names[0], input_ids) + out_tensor2 = pb_utils.Tensor(self.output_names[1], token_type_ids) + inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2]) + responses.append(inference_response) + return responses + + def finalize(self): + """`finalize` is called only once when the model is being unloaded. + Implementing `finalize` function is optional. This function allows + the model to perform any necessary clean ups before exit. + """ + print("Cleaning up...") diff --git a/applications/text_classification/multi_class/deploy/triton_serving/models/tokenizer/config.pbtxt b/applications/text_classification/multi_class/deploy/triton_serving/models/tokenizer/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..d35d1f44968ba205b1890899a82568d33e90a999 --- /dev/null +++ b/applications/text_classification/multi_class/deploy/triton_serving/models/tokenizer/config.pbtxt @@ -0,0 +1,31 @@ +name: "tokenizer" +backend: "python" +max_batch_size: 64 + +input [ + { + name: "INPUT_0" + data_type: TYPE_STRING + dims: [ 1 ] + } +] + +output [ + { + name: "OUTPUT_0" + data_type: TYPE_INT64 + dims: [ -1 ] + }, + { + name: "OUTPUT_1" + data_type: TYPE_INT64 + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_CPU + } +] diff --git a/applications/text_classification/multi_class/deploy/triton_serving/seqcls_grpc_client.py b/applications/text_classification/multi_class/deploy/triton_serving/seqcls_grpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..4144f5484b214de0daf1ab468c88d5b078f9499d --- /dev/null +++ b/applications/text_classification/multi_class/deploy/triton_serving/seqcls_grpc_client.py @@ -0,0 +1,112 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +from typing import Optional + +import numpy as np +from tritonclient.grpc import InferenceServerClient, InferInput, InferRequestedOutput + +LOGGER = logging.getLogger("run_inference_on_triton") + + +class SyncGRPCTritonRunner: + DEFAULT_MAX_RESP_WAIT_S = 120 + + def __init__( + self, + server_url: str, + model_name: str, + model_version: str, + *, + verbose=False, + resp_wait_s: Optional[float] = None, + ): + self._server_url = server_url + self._model_name = model_name + self._model_version = model_version + self._verbose = verbose + self._response_wait_t = self.DEFAULT_MAX_RESP_WAIT_S if resp_wait_s is None else resp_wait_s + + self._client = InferenceServerClient(self._server_url, verbose=self._verbose) + error = self._verify_triton_state(self._client) + if error: + raise RuntimeError(f"Could not communicate to Triton Server: {error}") + + LOGGER.debug( + f"Triton server {self._server_url} and model {self._model_name}:{self._model_version} " + f"are up and ready!" + ) + + model_config = self._client.get_model_config(self._model_name, self._model_version) + model_metadata = self._client.get_model_metadata(self._model_name, self._model_version) + LOGGER.info(f"Model config {model_config}") + LOGGER.info(f"Model metadata {model_metadata}") + + self._inputs = {tm.name: tm for tm in model_metadata.inputs} + self._input_names = list(self._inputs) + self._outputs = {tm.name: tm for tm in model_metadata.outputs} + self._output_names = list(self._outputs) + self._outputs_req = [InferRequestedOutput(name) for name in self._outputs] + + def Run(self, inputs): + """ + Args: + inputs: list, Each value corresponds to an input name of self._input_names + Returns: + results: dict, {name : numpy.array} + """ + infer_inputs = [] + for idx, data in enumerate(inputs): + data = np.array([[x.encode("utf-8")] for x in data], dtype=np.object_) + infer_input = InferInput(self._input_names[idx], [len(data), 1], "BYTES") + infer_input.set_data_from_numpy(data) + infer_inputs.append(infer_input) + + results = self._client.infer( + model_name=self._model_name, + model_version=self._model_version, + inputs=infer_inputs, + outputs=self._outputs_req, + client_timeout=self._response_wait_t, + ) + results = {name: results.as_numpy(name) for name in self._output_names} + return results + + def _verify_triton_state(self, triton_client): + if not triton_client.is_server_live(): + return f"Triton server {self._server_url} is not live" + elif not triton_client.is_server_ready(): + return f"Triton server {self._server_url} is not ready" + elif not triton_client.is_model_ready(self._model_name, self._model_version): + return f"Model {self._model_name}:{self._model_version} is not ready" + return None + + +if __name__ == "__main__": + model_name = "seqcls" + model_version = "1" + url = "localhost:8001" + runner = SyncGRPCTritonRunner(url, model_name, model_version) + + data = [["黑苦荞茶的功效与作用及食用方法", "交界痣会凸起吗", "检查是否能怀孕挂什么科"], ["幼儿挑食的生理原因是"], ["鱼油怎么吃咬破吃还是直接咽下去"]] + label_list = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"] + for texts in data: + # input format:[input1, input2 ... inputn], n = len(self._input_names) + result = runner.Run([texts]) + for i, text in enumerate(texts): + print("text: ", text) + print("label: ", label_list[result["label"][i]]) + print("confidence: ", "{:.3f}".format(result["confidence"][i])) + print("--------------------") diff --git a/applications/text_classification/multi_class/few-shot/README.md b/applications/text_classification/multi_class/few-shot/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b0ec99bdde54dbca9df475a1ca676c5e5ea8b2eb --- /dev/null +++ b/applications/text_classification/multi_class/few-shot/README.md @@ -0,0 +1,360 @@ +# 小样本场景下的二/多分类任务指南 + +**零样本/小样本文本分类推荐使用 UTC 模型,详情见[目录](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/zero_shot_text_classification),本项目将会在2.5.2版本下线。** + +## 目录 + +- [1. 项目说明](#项目说明) +- [2. 效果展示](#效果展示) +- [3. 定制训练](#定制训练) + - [3.1 运行环境](#运行环境) + - [3.2 代码结构](#代码结构) + - [3.3 数据标注](#数据标注) + - [3.4 模型训练](#模型训练) + - [3.5 模型评估](#模型评估) + - [3.6 模型部署](#模型部署) +- [4. References](#References) + + +## 1. 项目说明 + +本项目提供了小样本场景下文本二/多分类的解决方案,在 ERNIE3.0 的基础上利用提示学习取得比微调更好的分类效果,充分利用标注信息。 + +### 模型介绍 + +**文本二/多分类** 用于预测样本属于标签候选集中的哪个类别,在商品分类、网页分类、新闻分类、医疗文本分类等现实场景中有着广泛应用。 +现有的主流解决方案是在预训练语言模型上进行微调,因为二/多分类任务与预训练阶段的掩码预测任务有着天然的差异,想要取得较好的分类效果往往需要大量数据标注。 + +**提示学习(Prompt Learning)** 的主要思想是将二/多分类任务转换为掩码预测任务,充分利用预训练语言模型学习到的特征,从而降低样本需求。以情感分类任务为例,标签分为`1-正向`,`0-负向`两类,如下图所示,通过提示`我[MASK]喜欢。`,原有`1-正向`,`0-负向`的标签被转化为了预测空格是`很`还是`不`。 + +
+ +
+ +微调方法和提示方法的区别如图所示: + +【微调学习】需要学习的参数是以 `[CLS]` 向量为输入,以负向/正向为输出的随机初始化的分类器。 + +【提示学习】通过构造提示,将原有的分类任务转化为掩码预测,即掩盖原句中的某个字,用模型预测该字。此时的分类器不再是随机初始化,而是利用了待预测字的预训练向量来初始化,充分利用了预训练模型学习到的参数。 + +【方案选择】对于标注样本充足的场景可以直接使用[微调学习](../README.md)实现文本多分类,对于尚无标注或者标注样本较少的任务场景我们推荐使用提示学习,以取得更好的效果。 + +### 方案特点 + +- **标注成本低**:以往的微调方式需要大量的数据标注才能保证模型分类效果。提示学习可以降低数据标注依赖,在少样本(few-shot)的场景下取得比微调更好的分类效果。 +- **全流程打通**:提供了从训练到部署的完整解决方案,可以低成本迁移至实际应用场景。 + + +## 2.效果展示 + +本项目中使用了 ERNIE3.0 模型,对于中文训练任务可以根据需求选择不同的预训练模型参数进行训练,我们测评了 Base 模型在新闻分类任务上的表现。测试配置如下: + +1. 数据集:[FewCLUE](https://github.com/CLUEbenchmark/FewCLUE)中的新闻分类(tnews)任务测试集。 + +2. 物理机环境 + + 系统: CentOS Linux release 7.7.1908 (Core) + + GPU: Tesla V100-SXM2-32GB + + CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz + + CUDA: 11.2 + + cuDNN: 8.1.0 + + Driver Version: 460.27.04 + + 内存: 630 GB + +3. PaddlePaddle 版本:2.4rc + +4. PaddleNLP 版本:2.4.3 + +5. 评估设置 + + 每个 epoch 评估一次,按照验证集上的评价指标,取验证集上分数最高的模型参数用于测试集的测试。为了避免过拟合,这里使用了早停机制 (Early-stopping)。表格中的最终结果为重复 10 次的均值。 + +- 微调 + +``` +cd ../ +python train.py --dataset_dir "./data/" --save_dir "./checkpoints" --max_seq_length 128 --model_name "ernie-3.0-base-zh" --batch_size 8 --learning_rate 3e-5 --epochs 100 --logging_steps 5 --early_stop +``` + +- 提示学习 + +``` +python train.py --data_dir ./data/ --output_dir ./checkpoints/ --prompt "这条新闻写的是" --model_name_or_path ernie-3.0-base-zh --max_seq_length 128 --learning_rate 3e-5 --ppt_learning_rate 3e-4 --do_train --do_eval --num_train_epochs 100 --logging_steps 5 --per_device_eval_batch_size 32 --per_device_train_batch_size 8 --do_predict --metric_for_best_model accuracy --load_best_model_at_end --evaluation_strategy epoch --save_strategy epoch --save_total_limit 1 +``` + +6. 精度评价指标:Accuracy + + +| model_name | 训练方式 | Accuracy | +| ---------- | ------- | ---- | +| ernie-3.0-base-zh | 微调学习 | 0.5046 | +| ernie-3.0-base-zh | 提示学习 | 0.5521 | + + + +## 3.定制训练 + +下边通过**新闻分类**的例子展示如何使用小样本学习来进行文本分类。 + + +### 3.1 运行环境 + +- python >= 3.7 +- paddlepaddle >= 2.4rc +- paddlenlp >= 2.4.3 +- paddle2onnx >= 1.0.3 + + +### 3.2 代码结构 + +```text +. +├── train.py # 模型组网训练脚本 +├── utils.py # 数据处理工具 +├── infer.py # 模型部署脚本 +└── README.md +``` + + +### 3.3 数据标注 + +我们推荐使用数据标注平台[doccano](https://github.com/doccano/doccano)进行自定义数据标注,本项目也打通了从标注到训练的通道,即doccano导出数据后可通过[doccano.py](../../doccano.py)脚本轻松将数据转换为输入模型时需要的形式,实现无缝衔接。标注方法的详细介绍请参考[doccano数据标注指南](../../doccano.md)。 + +**示例数据** + +这里我们使用[FewCLUE](https://github.com/CLUEbenchmark/FewCLUE)中的新闻分类tnews数据集后缀为0的子集作为示例数据集,可点击[这里](https://paddlenlp.bj.bcebos.com/datasets/few-shot/tnews.tar.gz)下载解压并放入`./data/`文件夹,或者运行以下脚本 + +``` +wget https://paddlenlp.bj.bcebos.com/datasets/few-shot/tnews.tar.gz +tar zxvf tnews.tar.gz +mv tnews data +``` + +**数据格式** + +下边主要介绍二/多分类任务自定义数据集的格式要求,整体目录如下 + +```text +data/ +├── train.txt # 训练数据集 +├── dev.txt # 验证数据集 +├── test.txt # 测试数据集(可选) +├── data.txt # 待预测数据(可选) +└── label.txt # 分类标签集 +``` + +**训练/验证/测试数据** + +对于训练/验证/测试数据集文件,每行数据表示一条样本,包括文本和标签两部分,由tab符`\t`分隔。格式如下 +```text +<文本>'\t'<标签> +<文本>'\t'<标签> +... +``` +例如,在新闻分类数据集中 +``` +文登区这些公路及危桥将进入封闭施工,请注意绕行! news_car +普洱茶要如何醒茶? news_culture +... +``` + +**预测数据** + +对于待预测数据文件,每行包含一条待预测样本,无标签。格式如下 +```text +<文本> +<文本> +... +``` +例如,在新闻分类数据集中 +``` +互联网时代如何保护个人信息 +清秋暮雨读柳词:忍把浮名,换了浅斟低唱丨周末读诗 +... +``` + +**标签数据** + +对于分类标签集文件,存储了数据集中所有的标签集合,每行为一个标签名。如果需要自定义标签映射用于分类器初始化,则每行需要包括标签名和相应的映射词,由`==`分隔。格式如下 +```text +<标签>'=='<映射词> +<标签>'=='<映射词> +... +``` +例如,对于新闻分类数据集,原标签`news_car`可被映射为中文`汽车`等等。 +``` +news_car==汽车 +news_culture==文化 +... +``` +**Note**: 这里的标签映射词定义遵循的规则是,不同映射词尽可能长度一致,映射词和提示需要尽可能构成通顺的语句。越接近自然语句,小样本下模型训练效果越好。如果原标签名已经可以构成通顺语句,也可以不构造映射词,每行一个标签即可。 + + +### 3.4 模型训练 + +**单卡训练** + +``` +python train.py \ +--device gpu \ +--data_dir ./data \ +--output_dir ./checkpoints/ \ +--prompt "这条新闻标题的主题是" \ +--max_seq_length 128 \ +--learning_rate 3e-6 \ +--ppt_learning_rate 3e-5 \ +--do_train \ +--do_eval \ +--use_rdrop \ +--max_steps 1000 \ +--eval_steps 10 \ +--logging_steps 5 \ +--save_total_limit 1 \ +--load_best_model_at_end True \ +--per_device_eval_batch_size 32 \ +--per_device_train_batch_size 8 \ +--do_predict \ +--do_export +``` +**多卡训练** + +``` +unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus 0,1,2,3 train.py \ +--data_dir ./data \ +--output_dir ./checkpoints/ \ +--prompt "这条新闻标题的主题是" \ +--max_seq_length 128 \ +--learning_rate 3e-6 \ +--ppt_learning_rate 3e-5 \ +--do_train \ +--do_eval \ +--use_rdrop \ +--do_eval \ +--max_steps 1000 \ +--eval_steps 10 \ +--logging_steps 5 \ +--save_total_limit 1 \ +--load_best_model_at_end True \ +--per_device_eval_batch_size 32 \ +--per_device_train_batch_size 8 \ +--do_predict \ +--do_export +``` + +可配置参数说明: +- `model_name_or_path`: 内置模型名,或者模型参数配置目录路径。默认为`ernie-3.0-base-zh`。 +- `data_dir`: 训练数据集路径,数据格式要求详见[数据标注](#数据标注)。 +- `output_dir`: 模型参数、训练日志和静态图导出的保存目录。 +- `prompt`: 提示模板。定义了如何将文本和提示拼接结合。 +- `soft_encoder`: 提示向量的编码器,`lstm`表示双向LSTM, `mlp`表示双层线性层, None表示直接使用提示向量。默认为`lstm`。 +- `use_rdrop`: 使用 [R-Drop](https://arxiv.org/abs/2106.14448) 策略。 +- `use_rgl`: 使用 [RGL](https://aclanthology.org/2022.findings-naacl.81/) 策略。 +- `encoder_hidden_size`: 提示向量的维度。若为None,则使用预训练模型字向量维度。默认为200。 +- `max_seq_length`: 最大句子长度,超过该长度的文本将被截断,不足的以Pad补全。提示文本不会被截断。 +- `learning_rate`: 预训练语言模型参数基础学习率大小,将与learning rate scheduler产生的值相乘作为当前学习率。 +- `ppt_learning_rate`: 提示相关参数的基础学习率大小,当预训练参数不固定时,与其共用learning rate scheduler。一般设为`learning_rate`的十倍。 +- `do_train`: 是否进行训练。 +- `do_eval`: 是否进行评估。 +- `do_predict`: 是否进行预测。 +- `do_export`: 是否在运行结束时将模型导出为静态图,保存路径为`output_dir/export`。 +- `max_steps`: 训练的最大步数。此设置将会覆盖`num_train_epochs`。 +- `save_total_limit`: 模型检查点保存数量。 +- `eval_steps`: 评估模型的间隔步数。 +- `device`: 使用的设备,默认为`gpu`。 +- `logging_steps`: 打印日志的间隔步数。 +- `per_device_train_batch_size`: 每次训练每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。 +- `per_device_eval_batch_size`: 每次评估每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。 + +更多参数介绍可参考[配置文件](https://paddlenlp.readthedocs.io/zh/latest/trainer.html)。 + + +### 3.5 模型评估 + +在模型训练时开启`--do_predict`,训练结束后直接在测试集上`test.txt`进行评估,也可以在训练结束后,通过运行以下命令加载模型参数进行评估: +``` +python train.py --do_predict --data_dir ./data --output_dir ./predict_checkpoint --resume_from_checkpoint ./checkpoints/ --max_seq_length 128 +``` + +可配置参数说明: + +- `data_dir`: 测试数据路径。测试数据应存放在该目录下`test.txt`文件中,每行一条待预测文本。 +- `output_dir`: 日志的保存目录。 +- `resume_from_checkpoint`: 训练时模型参数的保存目录,用于加载模型参数。 +- `do_predict`: 是否进行预测。 +- `max_seq_length`: 最大句子长度,超过该长度的文本将被截断,不足的以Pad补全。提示文本不会被截断。 + + +### 3.6 模型部署 + +#### 模型导出 + +在训练结束后,需要将动态图模型导出为静态图参数用于部署推理。可以在模型训练时开启`--do_export`在训练结束后直接导出,也可以运行以下命令加载并导出训练后的模型参数,默认导出到在`output_dir`指定的目录下。 +``` +python train.py --do_export --data_dir ./data --output_dir ./export_checkpoint --resume_from_checkpoint ./checkpoints/ +``` + +可配置参数说明: + +- `data_dir`: 标签数据路径。 +- `output_dir`: 静态图模型参数和日志的保存目录。 +- `resume_from_checkpoint`: 训练时模型参数的保存目录,用于加载模型参数。 +- `do_export`: 是否将模型导出为静态图,保存路径为`output_dir/export`。 +- `export_type`: 模型导出的格式,默认为`paddle`,即导出静态图。 + +#### ONNXRuntime部署 + +**运行环境** + +模型转换与ONNXRuntime预测部署依赖Paddle2ONNX和ONNXRuntime,Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式,算子目前稳定支持导出ONNX Opset 7~15,更多细节可参考:[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)。 + +- 如果基于GPU部署,请先确保机器已正确安装NVIDIA相关驱动和基础软件,确保CUDA >= 11.2,CuDNN >= 8.2,并使用以下命令安装所需依赖: +```shell +pip install psutil +python -m pip install onnxruntime-gpu onnx onnxconverter-common +``` + +- 如果基于CPU部署,请使用如下命令安装所需依赖: +```shell +pip install psutil +python -m pip install onnxruntime +``` + +**CPU端推理样例** + +``` +python infer.py --model_path_prefix checkpoints/export/model --data_dir ./data --batch_size 32 --device cpu +``` + +**GPU端推理样例** + +``` +python infer.py --model_path_prefix checkpoints/export/model --data_dir ./data --batch_size 32 --device gpu --device_id 0 +``` + +可配置参数说明: + +- `model_path_prefix`: 导出的静态图模型路径及文件前缀。 +- `model_name`: 内置预训练模型名,用于加载tokenizer。默认为`ernie-3.0-base-zh`。 +- `data_dir`: 待推理数据所在路径,数据应存放在该目录下的`data.txt`文件。 +- `max_length`: 最大句子长度,超过该长度的文本将被截断,不足的以Pad补全。提示文本不会被截断。 +- `batch_size`: 每次预测的样本数量。 +- `device`: 选择推理设备,包括`cpu`和`gpu`。默认为`gpu`。 +- `device_id`: 指定GPU设备ID。 +- `use_fp16`: 是否使用半精度加速推理。仅在GPU设备上有效。 +- `num_threads`: 设置CPU使用的线程数。默认为机器上的物理内核数。 + +**Note**: 在GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0,在包括V100、T4、A10、A100、GTX 20系列和30系列显卡等设备上可以开启FP16进行加速,在CPU或者CUDA计算能力 (CUDA Compute Capability) 小于7.0时开启不会带来加速效果。 + + +## 4. References + +- Liu, Xiao, et al. "GPT understands, too." arXiv preprint arXiv:2103.10385 (2021). [[PDF]](https://arxiv.org/abs/2103.10385) +- Hambardzumyan, Karen, Hrant Khachatrian, and Jonathan May. "Warp: Word-level adversarial reprogramming." arXiv preprint arXiv:2101.00121 (2021). [[PDF]](https://arxiv.org/abs/2101.00121) +- Ding, Ning, et al. "Openprompt: An open-source framework for prompt-learning." arXiv preprint arXiv:2111.01998 (2021). [[PDF]](https://arxiv.org/abs/2111.01998) diff --git a/applications/text_classification/multi_class/few-shot/infer.py b/applications/text_classification/multi_class/few-shot/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..6e5b351fdafb916dc1ebed70925158ff187fb106 --- /dev/null +++ b/applications/text_classification/multi_class/few-shot/infer.py @@ -0,0 +1,221 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os + +import numpy as np +import onnxruntime as ort +import paddle2onnx +import psutil +import six + +from paddlenlp.prompt import AutoTemplate, PromptDataCollatorWithPadding +from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.") +parser.add_argument("--model_name", default="ernie-3.0-base-zh", type=str, help="The name of pretrained model.") +parser.add_argument("--data_dir", default=None, type=str, help="The path to the prediction data, including label.txt and data.txt.") +parser.add_argument("--max_length", default=128, type=int, help="The maximum total input sequence length after tokenization.") +parser.add_argument("--use_fp16", action='store_true', help="Whether to use fp16 inference, only takes effect when deploying on gpu.") +parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for predicting.") +parser.add_argument("--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu.") +parser.add_argument("--device", choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.") +args = parser.parse_args() +# yapf: enable + + +class InferBackend(object): + def __init__(self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, num_threads=10): + + if not isinstance(device, six.string_types): + logger.error( + ">>> [InferBackend] The type of device must be string, but the type you set is: ", type(device) + ) + exit(0) + if device not in ["cpu", "gpu"]: + logger.error(">>> [InferBackend] The device must be cpu or gpu, but your device is set to:", type(device)) + exit(0) + + logger.info(">>> [InferBackend] Creating Engine ...") + + onnx_model = paddle2onnx.command.c_paddle_to_onnx( + model_file=model_path_prefix + ".pdmodel", + params_file=model_path_prefix + ".pdiparams", + opset_version=13, + enable_onnx_checker=True, + ) + infer_model_dir = model_path_prefix.rsplit("/", 1)[0] + float_onnx_file = os.path.join(infer_model_dir, "model.onnx") + with open(float_onnx_file, "wb", encoding="utf-8") as f: + f.write(onnx_model) + + if device == "gpu": + logger.info(">>> [InferBackend] Use GPU to inference ...") + providers = ["CUDAExecutionProvider"] + if use_fp16: + logger.info(">>> [InferBackend] Use FP16 to inference ...") + import onnx + from onnxconverter_common import float16 + + fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx") + onnx_model = onnx.load_model(float_onnx_file) + trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True) + onnx.save_model(trans_model, fp16_model_file) + onnx_model = fp16_model_file + else: + logger.info(">>> [InferBackend] Use CPU to inference ...") + providers = ["CPUExecutionProvider"] + if use_fp16: + logger.warning( + ">>> [InferBackend] Ignore use_fp16 as it only " + "takes effect when deploying on gpu..." + ) + + sess_options = ort.SessionOptions() + sess_options.intra_op_num_threads = num_threads + self.predictor = ort.InferenceSession( + onnx_model, sess_options=sess_options, providers=providers, provider_options=[{"device_id": device_id}] + ) + + if device == "gpu": + try: + assert "CUDAExecutionProvider" in self.predictor.get_providers() + except AssertionError: + raise AssertionError( + "The environment for GPU inference is not set properly. " + "A possible cause is that you had installed both onnxruntime and onnxruntime-gpu. " + "Please run the following commands to reinstall: \n " + "1) pip uninstall -y onnxruntime onnxruntime-gpu \n 2) pip install onnxruntime-gpu" + ) + logger.info(">>> [InferBackend] Engine Created ...") + + def infer(self, input_dict: dict): + result = self.predictor.run(None, input_dict) + return result + + +class MultiClassPredictor(object): + def __init__(self, args): + self.args = args + self.tokenizer = AutoTokenizer.from_pretrained(args.model_name) + self.model = AutoModelForMaskedLM.from_pretrained(args.model_name) + self.template, self.labels, self.input_handles = self.post_init() + self.collate_fn = PromptDataCollatorWithPadding( + self.tokenizer, padding=True, return_tensors="np", return_attention_mask=True + ) + + self.inference_backend = InferBackend( + self.args.model_path_prefix, + self.args.device, + self.args.device_id, + self.args.use_fp16, + self.args.num_threads, + ) + + def post_init(self): + export_path = os.path.dirname(self.args.model_path_prefix) + template_path = os.path.join(export_path, "template_config.json") + with open(template_path, "r", encoding="utf-8") as fp: + prompt = json.load(fp) + template = AutoTemplate.create_from(prompt, self.tokenizer, self.args.max_length, self.model) + keywords = template.extract_template_keywords(template.prompt) + inputs = ["input_ids", "token_type_ids", "position_ids", "attention_mask"] + if "mask" in keywords: + inputs.append("masked_positions") + if "soft" in keywords: + inputs.append("soft_token_ids") + if "encoder" in keywords: + inputs.append("encoder_ids") + verbalizer_path = os.path.join(export_path, "verbalizer_config.json") + with open(verbalizer_path, "r", encoding="utf-8") as fp: + label_words = json.load(fp) + labels = sorted(list(label_words.keys())) + + return template, labels, inputs + + def predict(self, input_data: list): + encoded_inputs = self.preprocess(input_data) + infer_result = self.infer_batch(encoded_inputs) + result = self.postprocess(infer_result) + self.printer(result, input_data) + return result + + def _infer(self, input_dict): + infer_data = self.inference_backend.infer(input_dict) + return infer_data + + def infer_batch(self, inputs): + num_sample = len(inputs) + infer_data = None + num_infer_data = None + for index in range(0, num_sample, self.args.batch_size): + left, right = index, index + self.args.batch_size + batch_dict = self.collate_fn(inputs[left:right]) + input_dict = {} + for key in self.input_handles: + value = batch_dict[key] + if key == "attention_mask": + if value.ndim == 2: + value = (1 - value[:, np.newaxis, np.newaxis, :]) * -1e4 + elif value.ndim != 4: + raise ValueError("Expect attention mask with ndim=2 or 4, but get ndim={}".format(value.ndim)) + value = value.astype("float32") + else: + value = value.astype("int64") + input_dict[key] = value + results = self._infer(input_dict) + if infer_data is None: + infer_data = [[x] for x in results] + num_infer_data = len(results) + else: + for i in range(num_infer_data): + infer_data[i].append(results[i]) + for i in range(num_infer_data): + infer_data[i] = np.concatenate(infer_data[i], axis=0) + return infer_data + + def preprocess(self, input_data: list): + text = [{"text_a": x} for x in input_data] + inputs = [self.template(x) for x in text] + return inputs + + def postprocess(self, infer_data): + preds = np.argmax(infer_data[0], axis=-1) + labels = [self.labels[x] for x in preds] + return {"label": labels} + + def printer(self, result, input_data): + label = result["label"] + for i in range(len(label)): + logger.info("input data: {}".format(input_data[i])) + logger.info("labels: {}".format(label[i])) + logger.info("-----------------------------") + + +if __name__ == "__main__": + for arg_name, arg_value in vars(args).items(): + logger.info("{:20}: {}".format(arg_name, arg_value)) + + predictor = MultiClassPredictor(args) + + text_dir = os.path.join(args.data_dir, "data.txt") + with open(text_dir, "r", encoding="utf-8") as f: + text_list = [x.strip() for x in f.readlines()] + + predictor.predict(text_list) diff --git a/applications/text_classification/multi_class/few-shot/requirements_cpu.txt b/applications/text_classification/multi_class/few-shot/requirements_cpu.txt new file mode 100644 index 0000000000000000000000000000000000000000..bbe76e363f00631d66e0733833813cad5991f009 --- /dev/null +++ b/applications/text_classification/multi_class/few-shot/requirements_cpu.txt @@ -0,0 +1,5 @@ +psutil +paddlepaddle>=2.4rc +paddlenlp>=2.4.3 +paddle2onnx>=1.0.3 +onnxruntime diff --git a/applications/text_classification/multi_class/few-shot/requirements_gpu.txt b/applications/text_classification/multi_class/few-shot/requirements_gpu.txt new file mode 100644 index 0000000000000000000000000000000000000000..66454bd8b6b5fe08521215d4a5c2e7242225d869 --- /dev/null +++ b/applications/text_classification/multi_class/few-shot/requirements_gpu.txt @@ -0,0 +1,7 @@ +psutil +paddlepaddle-gpu>=2.4rc +paddlenlp>=2.4.3 +paddle2onnx>=1.0.3 +onnxruntime-gpu +onnx +onnxconverter-common diff --git a/applications/text_classification/multi_class/few-shot/train.py b/applications/text_classification/multi_class/few-shot/train.py new file mode 100644 index 0000000000000000000000000000000000000000..ac06587de017158f9388fb637657c12285cf94ce --- /dev/null +++ b/applications/text_classification/multi_class/few-shot/train.py @@ -0,0 +1,132 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from collections import defaultdict +from dataclasses import dataclass, field + +import paddle +from paddle.metric import Accuracy +from utils import load_local_dataset + +from paddlenlp.prompt import ( + AutoTemplate, + PromptModelForSequenceClassification, + PromptTrainer, + PromptTuningArguments, + SoftVerbalizer, +) +from paddlenlp.trainer import EarlyStoppingCallback, PdArgumentParser +from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer +from paddlenlp.utils.log import logger + + +# yapf: disable +@dataclass +class DataArguments: + data_dir: str = field(default="./data/", metadata={"help": "Path to a dataset which includes train.txt, dev.txt, test.txt, label.txt and data.txt (optional)."}) + prompt: str = field(default=None, metadata={"help": "The input prompt for tuning."}) + + +@dataclass +class ModelArguments: + model_name_or_path: str = field(default="ernie-3.0-base-zh", metadata={"help": "Build-in pretrained model name or the path to local model."}) + export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."}) +# yapf: enable + + +def main(): + # Parse the arguments. + parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Load the pretrained language model. + model = AutoModelForMaskedLM.from_pretrained(model_args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + + # Define the template for preprocess and the verbalizer for postprocess. + template = AutoTemplate.create_from(data_args.prompt, tokenizer, training_args.max_seq_length, model=model) + logger.info("Using template: {}".format(template.prompt)) + + label_file = os.path.join(data_args.data_dir, "label.txt") + with open(label_file, "r", encoding="utf-8") as fp: + label_words = defaultdict(list) + for line in fp: + data = line.strip().split("==") + word = data[1] if len(data) > 1 else data[0].split("##")[-1] + label_words[data[0]].append(word) + verbalizer = SoftVerbalizer(label_words, tokenizer, model) + + # Load the few-shot datasets. + train_ds, dev_ds, test_ds = load_local_dataset( + data_path=data_args.data_dir, splits=["train", "dev", "test"], label_list=verbalizer.labels_to_ids + ) + + # Define the criterion. + criterion = paddle.nn.CrossEntropyLoss() + + # Initialize the prompt model with the above variables. + prompt_model = PromptModelForSequenceClassification( + model, template, verbalizer, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout + ) + + # Define the metric function. + def compute_metrics(eval_preds): + metric = Accuracy() + correct = metric.compute(paddle.to_tensor(eval_preds.predictions), paddle.to_tensor(eval_preds.label_ids)) + metric.update(correct) + acc = metric.accumulate() + return {"accuracy": acc} + + # Deine the early-stopping callback. + callbacks = [EarlyStoppingCallback(early_stopping_patience=4, early_stopping_threshold=0.0)] + + # Initialize the trainer. + trainer = PromptTrainer( + model=prompt_model, + tokenizer=tokenizer, + args=training_args, + criterion=criterion, + train_dataset=train_ds, + eval_dataset=dev_ds, + callbacks=callbacks, + compute_metrics=compute_metrics, + ) + + # Traininig. + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=None) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Prediction. + if training_args.do_predict: + test_ret = trainer.predict(test_ds) + trainer.log_metrics("test", test_ret.metrics) + + # Export static model. + if training_args.do_export: + export_path = os.path.join(training_args.output_dir, "export") + trainer.export_model(export_path, export_type=model_args.export_type) + + +if __name__ == "__main__": + main() diff --git a/applications/text_classification/multi_class/few-shot/utils.py b/applications/text_classification/multi_class/few-shot/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..8a92a9a697eeeeac130b4d6354836471df7a76fd --- /dev/null +++ b/applications/text_classification/multi_class/few-shot/utils.py @@ -0,0 +1,53 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +from paddlenlp.datasets import load_dataset + + +def load_local_dataset(data_path, splits, label_list): + """ + Read datasets from files. + + Args: + data_path (str): + Path to the dataset directory, including label.txt, train.txt, + dev.txt, test.txt (and data.txt). + splits (list): + Which file(s) to load, such as ['train', 'dev', 'test']. + label_list(dict): + A dictionary to encode labels as ids, which should be compatible + with that of verbalizer. + """ + + def _reader(data_file, label_list): + with open(data_file, "r", encoding="utf-8") as fp: + for idx, line in enumerate(fp): + data = line.strip().split("\t") + if len(data) == 1: + yield {"text_a": data[0]} + else: + text, label = data + yield {"text_a": text, "labels": label_list[label]} + + assert isinstance(splits, list) and len(splits) > 0 + + split_map = {"train": "train.txt", "dev": "dev.txt", "test": "test.txt"} + + dataset = [] + for split in splits: + data_file = os.path.join(data_path, split_map[split]) + dataset.append(load_dataset(_reader, data_file=data_file, label_list=label_list, lazy=False)) + return dataset diff --git a/applications/text_classification/multi_class/retrieval_based/README.md b/applications/text_classification/multi_class/retrieval_based/README.md new file mode 100644 index 0000000000000000000000000000000000000000..10602a07b356e37c6c2d8e22929f06252b9ef2e6 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/README.md @@ -0,0 +1,457 @@ +# 基于检索的文本分类方法 + + **目录** + +* [1. 基于语义索引的分类任务介绍](#基于语义索引的分类任务介绍) +* [2. 代码结构说明](#代码结构说明) +* [3. 环境准备](#环境准备) +* [4. 数据准备](#数据准备) +* [5. 模型训练](#模型训练) +* [6. 模型预测](#模型预测) +* [7. 模型部署](#模型部署) +* [8. 分类流程](#分类流程) + + + +# 1.基于语义索引的分类任务介绍 + +以前的分类任务中,标签信息作为无实际意义,独立存在的one-hot编码形式存在,这种做法会潜在的丢失标签的语义信息,本方案把文本分类任务中的标签信息转换成含有语义信息的语义向量,将文本分类任务转换成向量检索和匹配的任务。这样做的好处是对于一些类别标签不是很固定的场景,或者需要经常有一些新增类别的需求的情况非常合适。另外,对于一些新的相关的分类任务,这种方法也不需要模型重新学习或者设计一种新的模型结构来适应新的任务。总的来说,这种基于检索的文本分类方法能够有很好的拓展性,能够利用标签里面包含的语义信息,不需要重新进行学习。这种方法可以应用到相似标签推荐,文本标签标注,金融风险事件分类,政务信访分类等领域。 + +本方案是基于语义索引模型的分类,语义索引模型的目标是:给定输入文本,模型可以从海量候选召回库中**快速、准确**地召回一批语义相关文本。基于语义索引的分类方法有两种,第一种方法是直接把标签变成召回库,即把输入文本和标签的文本进行匹配,第二种是利用召回的文本带有类别标签,把召回文本的类别标签作为给定输入文本的类别。本方案使用双塔模型,训练阶段引入In-batch Negatives 策略,使用hnswlib建立索引库,并把标签作为召回库,进行召回测试。最后利用召回的结果使用 Accuracy 指标来评估语义索引模型的分类的效果。 + + + + +## 2. 代码结构说明 + +``` +|—— data.py # 数据读取、数据转换等预处理逻辑 +|—— base_model.py # 语义索引模型基类 +|—— train.py # In-batch Negatives 策略的训练主脚本 +|—— model.py # In-batch Negatives 策略核心网络结构 + +|—— recall.py # 基于训练好的语义索引模型,从召回库中召回给定文本的相似文本 +|—— evaluate.py # 根据召回结果和评估集计算评估指标 +|—— predict.py # 给定输入文件,计算文本 pair 的相似度 +|—— export_model.py # 动态图转换成静态图 +|—— export_to_serving.py # 静态图转 Serving +|—— scripts + |—— export_model.sh # 动态图转换成静态图脚本 + |—— predict.sh # 预测 bash 版本 + |—— evaluate.sh # 评估 bash 版本 + |—— run_build_index.sh # 构建索引 bash 版本 + |—— train.sh # 训练 bash 版本 + |—— export_to_serving.sh # Paddle Inference 转 Serving 的 bash 脚本 + |—— run.sh # 构建Milvus向量的 bash 版本 +|—— utils + ├── config.py # Milvus 的配置文件 + ├── feature_extract.py # 向量抽取文件 + ├── milvus_util.py # Milvus 的配置文件 +|—— deploy + |—— python + |—— predict.py # PaddleInference + |—— deploy.sh # Paddle Inference 部署脚本 + |—— rpc_client.py # Paddle Serving 的 Client 端 + |—— web_service.py # Paddle Serving 的 Serving 端 + |—— config_nlp.yml # Paddle Serving 的配置文件 + +``` + + + +## 3. 环境准备 + +推荐使用GPU进行训练,在预测阶段使用CPU或者GPU均可。 + +**环境依赖** +* python >= 3.6.2 +* paddlepaddle >= 2.3.1 +* paddlenlp >= 2.3.4 +* hnswlib >= 0.5.2 +* visualdl >= 2.2.2 + +``` +pip install -r requirements.txt +``` + + + +## 4. 数据准备 + +训练需要准备指定格式的本地数据集,如果没有已标注的数据集,可以参考[文本分类任务doccano数据标注使用指南](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/doccano.md)进行文本分类数据标注。 + +**指定格式本地数据集目录结构** + +``` +├── data # 数据集目录 + ├── label.txt # 标签集 + ├── dev.txt # 验证集 + ├── train.txt # 训练集 +``` + +**训练、开发、测试数据集** + +train.txt(训练数据集文件), dev.txt(开发数据集文件),test.txt(可选,测试数据集文件),文件中文本与标签类别名用tab符`'\t'`分隔开。训练集指用于训练模型的数据;开发集指用于评测模型表现的数据,可以根据模型在开发集上的精度调整训练参数和模型;测试集用于测试模型表现,没有测试集时可以使用开发集代替。 + +- train.txt/dev.txt/test.txt 文件格式: +```text +<文本>'\t'<标签> +<文本>'\t'<标签> +... +``` +- train.txt/dev.txt/test.txt 文件样例: +```text +青岛有什么好一点的国际青旅推荐?离海近一点 外国人多一点 氛围好点的 青岛 +谈谈去西安旅游,哪些地方让你觉得不虚此行? 旅行 +上古卷轴5有哪些奇葩玩法? 单机游戏 +... +``` +**分类标签** + +label.txt(分类标签文件)记录数据集中所有标签集合,每一行为一个标签名。 +- label.txt 文件格式: +```text +<标签> +<标签> +... +``` +- label.txt 文件样例: +```text +上海 +恋爱 +动画电影(Animated film) +狮 +生活 +汽车 +... +``` + + + +## 5. 模型训练 + +我们使用百科知识问答的数据来构建训练集,开发集。 + +**训练集(train.txt)** 和 **开发集(dev.txt)** 格式一致,训练集61510条,开发集6835条,每行由文本的标题,内容和类别标签组成,以tab符分割,第一列是问题的标题和描述拼接,剩下的列是问题的类别,另外,标签有5981个。 +**召回库(label.txt)** 召回库的构建有2种方式,第一种是把所有的类别标签当成召回库,第二种是把训练集当成召回集合,我们以第一种为例。 + +数据集选择的是百科问答数据集的一个子集,问答数据集详情请参考[nlp_chinese_corpus](https://github.com/brightmart/nlp_chinese_corpus) + +- [webtext2019zh_topic](https://paddlenlp.bj.bcebos.com/applications/webtext2019zh_qa.zip) + +``` +wget https://paddlenlp.bj.bcebos.com/applications/webtext2019zh_qa.zip +unzip webtext2019zh_qa.zip +``` + +### 单机单卡训练/单机多卡训练 + +这里采用单机多卡方式进行训练,通过如下命令,指定 GPU 0,1 卡;如果采用单机单卡训练,只需要把`--gpus`参数设置成单卡的卡号即可。 + +如果使用CPU进行训练,则需要吧`--gpus`参数去除,然后吧`device`设置成cpu即可,详细请参考train.sh文件的训练设置 + +然后运行下面的命令使用GPU训练,得到语义索引模型: + +``` +root_path=inbatch +data_path=data +python -u -m paddle.distributed.launch --gpus "0,1" \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/${root_path} \ + --batch_size 24 \ + --learning_rate 5E-5 \ + --epochs 100 \ + --output_emb_size 0 \ + --save_steps 50 \ + --max_seq_length 384 \ + --warmup_proportion 0.0 \ + --margin 0.2 \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --train_set_file ${data_path}/train.txt \ + --corpus_file ${data_path}/label.txt \ + --similar_text_pair_file ${data_path}/dev.txt \ + --evaluate True +``` + +参数含义说明 + +* `device`: 使用 cpu/gpu 进行训练 +* `save_dir`: 模型存储路径 +* `batch_size`: 训练的batch size的大小 +* `learning_rate`: 训练的学习率的大小 +* `epochs`: 训练的epoch数 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `save_steps`: 模型存储 checkpoint 的间隔 steps 个数 +* `max_seq_length`: 输入序列的最大长度 +* `margin`: 正样本相似度与负样本之间的目标 Gap +* `train_set_file`: 训练集文件 +* `evaluate`: 是否开启边训练边评估模型训练效果,默认开启 +* `recall_result_dir`: 召回结果存储目录 +* `recall_result_file`: 召回结果的文件名 +* `hnsw_m`: hnsw 算法相关参数,保持默认即可 +* `hnsw_ef`: hnsw 算法相关参数,保持默认即可 +* `recall_num`: 对 1 个文本召回的相似文本数量 +* `similar_text_pair`: 由相似文本对构成的评估集 +* `corpus_file`: 召回库数据 corpus_file + +也可以使用bash脚本: + +``` +sh scripts/train.sh +``` + + + +## 6. 模型预测 + +我们可以基于语义索引模型计算文本和标签的语义相似度。 + + +### 开始预测 + +加载训练的语义索引模型,然后计算文本和标签的语义相似度: + +``` +root_dir="checkpoints/inbatch/model_best" +python -u -m paddle.distributed.launch --gpus "0" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_state.pdparams" \ + --model_name_or_path rocketqa-zh-dureader-query-encoder \ + --output_emb_size 0 \ + --batch_size 128 \ + --max_seq_length 384 \ + --text_pair_file "data/dev.txt" +``` + +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `params_path`: 预训练模型的参数文件名 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `model_name_or_path`: 预训练模型,用于模型和`Tokenizer`的参数初始化。 +* `text_pair_file`: 由文本 Pair 构成的待预测数据集 + +也可以运行下面的bash脚本: + +``` +sh scripts/predict.sh +``` +predict.sh文件包含了cpu和gpu运行的脚本,默认是gpu运行的脚本 + +产出如下结果 +``` +0.8841502070426941 +0.7834227681159973 +0.04591505229473114 +0.15116563439369202 +...... +``` + + + +## 7. 模型部署 + +### 动转静导出 + +首先把动态图模型转换为静态图: + +``` +python export_model.py \ + --params_path checkpoints/inbatch/model_best/model_state.pdparams \ + --model_name_or_path rocketqa-zh-dureader-query-encoder \ + --output_path=./output +``` +也可以运行下面的bash脚本: + +``` +sh scripts/export_model.sh +``` + +### Paddle Inference预测 + +预测既可以抽取向量也可以计算两个文本的相似度。 + +修改deploy/python/predict.py中的id2corpus和corpus_list的样本: + +``` +# 抽取向量 +id2corpus = { + 0: { + "sentence": + "青岛有什么好一点的国际青旅推荐?离海近一点 外国人多一点 氛围好点的" + } + } +# 计算文本和类别的相似度 +corpus_list = [{ + "sentence": + "青岛有什么好一点的国际青旅推荐?离海近一点 外国人多一点 氛围好点的?", + 'label': '青岛' + }, { + "sentence": + "青岛有什么好一点的国际青旅推荐?离海近一点 外国人多一点 氛围好点的", + 'label': '单机游戏' + }] + +``` + +然后使用PaddleInference + +``` +python deploy/python/predict.py --model_dir=./output +``` +也可以运行下面的bash脚本: + +``` +sh deploy.sh +``` +最终输出的是256维度的特征向量和句子对的预测概率: + +``` +(1, 768) +(1, 768) +[[-0.02010613 0.01188739 0.02152571 0.055634 -0.05920463 -0.06134057 + 0.05532542 -0.01561351 0.02738068 0.05340409 0.06015684 -0.01133287 + .... + +[0.6114088296890259, 0.08313259482383728] +``` + +### 向量引擎 + +模型准备结束以后,开始搭建 Milvus 的向量检索引擎,用于文本语义向量的快速检索,本项目使用[Milvus](https://milvus.io/)开源工具进行向量检索,Milvus 的搭建教程请参考官方教程 [Milvus官方安装教程](https://milvus.io/cn/docs/v1.1.1/milvus_docker-cpu.md)本案例使用的是 Milvus 的1.1.1 CPU版本,建议使用官方的 Docker 安装方式,简单快捷。 + + +Milvus 搭建完系统以后就可以插入和检索向量了,首先生成 embedding 向量,每个样本生成768维度的向量: + +``` +CUDA_VISIBLE_DEVICES=0 python utils/feature_extract.py \ + --data_name label \ + --model_dir ./output \ + --output_dir data \ + --corpus_file "./data/label.txt" +``` +其中 output 目录下存放的是召回的 Paddle Inference 静态图模型。 + +然后向搭建好的 Milvus 系统插入向量: + +``` +python utils/vector_insert.py \ + --vector_path ./data/label_embedding.npy +``` +也可以直接运行: + +```bash +sh scripts/run.sh +``` + +### Paddle Serving部署 + +Paddle Serving 的详细文档请参考 [Pipeline_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Python_Pipeline/Pipeline_Design_CN.md)和[Serving_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Serving_Design_CN.md),首先把静态图模型转换成Serving的格式: + +``` +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.get_pooled_embedding.pdmodel" \ + --params_filename "inference.get_pooled_embedding.pdiparams" \ + --server_path "./serving_server" \ + --client_path "./serving_client" \ + --fetch_alias_names "output_embedding" +``` + +参数含义说明 +* `dirname`: 需要转换的模型文件存储路径,Program 结构文件和参数文件均保存在此目录。 +* `model_filename`: 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ,则使用 `__model__` 作为默认的文件名 +* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中,它才需要被指定。如果模型参数是存储在各自分离的文件中,设置它的值为 None +* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server +* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client +* `fetch_alias_names`: 模型输出的别名设置,比如输入的 input_ids 等,都可以指定成其他名字,默认不指定 +* `feed_alias_names`: 模型输入的别名设置,比如输出 pooled_out 等,都可以重新指定成其他模型,默认不指定 + +也可以运行下面的 bash 脚本: +``` +sh scripts/export_to_serving.sh +``` + +Paddle Serving的部署有两种方式,第一种方式是Pipeline的方式,第二种是C++的方式,下面分别介绍这两种方式的用法: + +#### Pipeline方式 + +启动 Pipeline Server: + +``` +cd deploy/python/ +python web_service.py +``` + +启动客户端调用 Server, 使用 POST的方式: + +向服务端发送 POST 请求示例: + +``` +curl -X POST -k http://localhost:8090/ernie/prediction -d '{"key": ["0"], "value": ["{\"sentence\": \"青岛有什么好一点的国际青旅推荐?离海近一点 外国人多一点 氛围好点的\"}"]}' +``` + +也可以使用 rpc的方式: +首先修改rpc_client.py中需要预测的样本: + +``` +list_data = [{ + "sentence": "青岛有什么好一点的国际青旅推荐?离海近一点 外国人多一点 氛围好点的" +}] +``` +然后运行: + +``` +python rpc_client.py +``` +模型的输出为: + +``` +PipelineClient::predict pack_data time:1658988633.3673246 +PipelineClient::predict before time:1658988633.3678396 +time to cost :0.014188766479492188 seconds +['output_embedding'] +(1, 768) +[[-0.06491912 -0.0133915 0.00937684 0.01285653 -0.02468005 0.03528611 + 0.0623698 -0.06062918 0.02238894 -0.05348937 0.02161925 0.04480227 + ...... +``` + +可以看到客户端发送了1条文本,返回这个 embedding 向量 + + + +## 8. 分类流程 + +基于检索的分类系统使用了Client Server的模式,即抽取向量的模型部署在服务端,然后启动客户端(Client)端去访问。 + +``` +python run_system.py +``` +代码内置的测试用例为: + +``` +list_data = [{"sentence": "谈谈去西安旅游,哪些地方让你觉得不虚此行?"}] +``` +会输出如下的结果: + +``` +...... +PipelineClient::predict pack_data time:1658988661.507715 +PipelineClient::predict before time:1658988661.5081818 +Extract feature time to cost :0.02322244644165039 seconds +Search milvus time cost is 0.06801486015319824 seconds +{'sentence': '谈谈去西安旅游,哪些地方让你觉得不虚此行?'} 旅行 0.3969537019729614 +{'sentence': '谈谈去西安旅游,哪些地方让你觉得不虚此行?'} 西安 0.7750667333602905 +{'sentence': '谈谈去西安旅游,哪些地方让你觉得不虚此行?'} 陕西 0.8064634799957275 +{'sentence': '谈谈去西安旅游,哪些地方让你觉得不虚此行?'} 火车上 0.8384211659431458 +{'sentence': '谈谈去西安旅游,哪些地方让你觉得不虚此行?'} 山西 0.9251932501792908 +..... +``` +输出的结果包括特征提取和检索的时间,还包含检索出来文本和对应的标签,通过设定阈值等方式可以得到最终的标签。 + +## Reference + +[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering, Preprint 2020. diff --git a/applications/text_classification/multi_class/retrieval_based/base_model.py b/applications/text_classification/multi_class/retrieval_based/base_model.py new file mode 100644 index 0000000000000000000000000000000000000000..56aa3ba50e189281c35d41e8819014f56d8e53f4 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/base_model.py @@ -0,0 +1,153 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import abc + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class SemanticIndexBase(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr) + + def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + @abc.abstractmethod + def forward(self): + pass + + +class SemanticIndexBaseStatic(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr) + + @paddle.jit.to_static( + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ] + ) + def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding diff --git a/applications/text_classification/multi_class/retrieval_based/data.py b/applications/text_classification/multi_class/retrieval_based/data.py new file mode 100644 index 0000000000000000000000000000000000000000..17081543edaa967725453ecec99e760bb9f94f90 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/data.py @@ -0,0 +1,224 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import hnswlib +import numpy as np +import paddle +from paddlenlp.utils.log import logger + + +def build_index(corpus_data_loader, model, output_emb_size, hnsw_max_elements, hnsw_ef, hnsw_m): + + index = hnswlib.Index(space="ip", dim=output_emb_size if output_emb_size > 0 else 768) + + # Initializing index + # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded + # during insertion of an element. + # The capacity can be increased by saving/loading the index, see below. + # + # ef_construction - controls index search speed/build speed tradeoff + # + # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M) + # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction + index.init_index(max_elements=hnsw_max_elements, ef_construction=hnsw_ef, M=hnsw_m) + + # Controlling the recall by setting ef: + # higher ef leads to better accuracy, but slower search + index.set_ef(hnsw_ef) + + # Set number of threads used during batch search/construction + # By default using all available cores + index.set_num_threads(16) + logger.info("start build index..........") + all_embeddings = [] + for text_embeddings in model.get_semantic_embedding(corpus_data_loader): + all_embeddings.append(text_embeddings.numpy()) + all_embeddings = np.concatenate(all_embeddings, axis=0) + index.add_items(all_embeddings) + logger.info("Total index number:{}".format(index.get_current_count())) + return index + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + A BERT sequence has the following format: + - single sequence: ``[CLS] X [SEP]`` + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def convert_corpus_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + result = [] + for k, v in example.items(): + encoded_inputs = tokenizer(text=v, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def convert_label_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + result = [] + for k, v in example.items(): + encoded_inputs = tokenizer(text=v, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + yield {"sentence": data[0], "label": data[1]} + + +# ANN - active learning ------------------------------------------------------ +def get_latest_checkpoint(args): + """ + Return: (latest_checkpint_path, global_step) + """ + if not os.path.exists(args.save_dir): + return args.init_from_ckpt, 0 + + subdirectories = list(next(os.walk(args.save_dir))[1]) + + def valid_checkpoint(checkpoint): + chk_path = os.path.join(args.save_dir, checkpoint) + scheduler_path = os.path.join(chk_path, "model_state.pdparams") + succeed_flag_file = os.path.join(chk_path, "succeed_flag_file") + return os.path.exists(scheduler_path) and os.path.exists(succeed_flag_file) + + trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(trained_steps) > 0: + return os.path.join(args.save_dir, str(max(trained_steps)), "model_state.pdparams"), max(trained_steps) + + return args.init_from_ckpt, 0 + + +# ANN - active learning ------------------------------------------------------ +def get_latest_ann_data(ann_data_dir): + if not os.path.exists(ann_data_dir): + return None, -1 + + subdirectories = list(next(os.walk(ann_data_dir))[1]) + + def valid_checkpoint(step): + ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data") + # succeed_flag_file is an empty file that indicates ann data has been generated + succeed_flag_file = os.path.join(ann_data_dir, step, "succeed_flag_file") + return os.path.exists(succeed_flag_file) and os.path.exists(ann_data_file) + + ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(ann_data_steps) > 0: + latest_ann_data_file = os.path.join(ann_data_dir, str(max(ann_data_steps)), "new_ann_data") + logger.info("Using lateset ann_data_file:{}".format(latest_ann_data_file)) + return latest_ann_data_file, max(ann_data_steps) + + logger.info("no new ann_data, return (None, -1)") + return None, -1 + + +def gen_id2corpus(corpus_file): + id2corpus = {} + with open(corpus_file, "r", encoding="utf-8") as f: + for idx, line in enumerate(f): + id2corpus[idx] = line.rstrip() + return id2corpus + + +def gen_text_file(similar_text_pair_file): + text2similar_text = {} + texts = [] + with open(similar_text_pair_file, "r", encoding="utf-8") as f: + for idx, line in enumerate(f): + splited_line = line.rstrip().split("\t") + text, similar_text = splited_line[0], ",".join(splited_line[1:]) + text2similar_text[text] = similar_text + texts.append({"text": text}) + return texts, text2similar_text diff --git a/applications/text_classification/multi_class/retrieval_based/deploy/python/config_nlp.yml b/applications/text_classification/multi_class/retrieval_based/deploy/python/config_nlp.yml new file mode 100644 index 0000000000000000000000000000000000000000..236c3802002e80075ab55ed14461b0bde9fd545c --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/deploy/python/config_nlp.yml @@ -0,0 +1,34 @@ +# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +# 当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 20 +# build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + # op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: False + # 使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + tracer: + interval_s: 10 +# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 8090 +# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 8080 +op: + ernie: + # 并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 1 + # 当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + # client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + #ir_optim + ir_optim: True + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 1 + # 计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: '2' + # Fetch结果列表,以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回 + fetch_list: ['output_embedding'] + # 模型路径 + model_config: ../../serving_server/ diff --git a/applications/text_classification/multi_class/retrieval_based/deploy/python/deploy.sh b/applications/text_classification/multi_class/retrieval_based/deploy/python/deploy.sh new file mode 100644 index 0000000000000000000000000000000000000000..fe8f071e0a47a47f5dc24d84ea4eaaf8e7503c06 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/deploy/python/deploy.sh @@ -0,0 +1 @@ +python predict.py --model_dir=../../output \ No newline at end of file diff --git a/applications/text_classification/multi_class/retrieval_based/deploy/python/predict.py b/applications/text_classification/multi_class/retrieval_based/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..a94045fcb31ff04582406cb481389af358a06cb6 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/deploy/python/predict.py @@ -0,0 +1,253 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys + +import paddle +from paddle import inference +from scipy import spatial + +from paddlenlp.data import Pad, Tuple +from paddlenlp.transformers import AutoTokenizer + +sys.path.append(".") + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=15, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') +parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") +args = parser.parse_args() +# fmt: on + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + A BERT sequence has the following format: + - single sequence: ``[CLS] X [SEP]`` + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def convert_query_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + result = [] + encoded_inputs = tokenizer( + text=example["sentence"], + max_seq_len=max_seq_length, + pad_to_max_seq_len=pad_to_max_seq_len, + truncation_strategy="longest_first", + ) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.get_pooled_embedding.pdmodel" + params_file = model_dir + "/inference.get_pooled_embedding.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def extract_embedding(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the feature vectors. + """ + examples = [] + for idx, text in data.items(): + print(text) + input_ids, segment_ids = convert_query_example(text, tokenizer) + examples.append((input_ids, segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # segment + ): fn(samples) + + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + return logits + + def predict(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the predictions probs. + """ + + examples = [] + for idx, text in enumerate(data): + input_ids, segment_ids, title_ids, title_segment_ids = convert_example(text, tokenizer) + + examples.append((input_ids, segment_ids, title_ids, title_segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + + query_ids, query_segment_ids, title_ids, title_segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(query_ids) + self.input_handles[1].copy_from_cpu(query_segment_ids) + self.predictor.run() + query_logits = self.output_handle.copy_to_cpu() + + self.input_handles[0].copy_from_cpu(title_ids) + self.input_handles[1].copy_from_cpu(title_segment_ids) + self.predictor.run() + title_logits = self.output_handle.copy_to_cpu() + + result = [float(1 - spatial.distance.cosine(arr1, arr2)) for arr1, arr2 in zip(query_logits, title_logits)] + return result + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + + output_emb_size = 256 + tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-dureader-query-encoder") + id2corpus = {0: {"sentence": "青岛有什么好一点的国际青旅推荐?离海近一点 外国人多一点 氛围好点的"}} + res = predictor.extract_embedding(id2corpus, tokenizer) + print(res.shape) + print(res) + corpus_list = [ + {"sentence": "青岛有什么好一点的国际青旅推荐?离海近一点 外国人多一点 氛围好点的?", "label": "青岛"}, + {"sentence": "青岛有什么好一点的国际青旅推荐?离海近一点 外国人多一点 氛围好点的", "label": "单机游戏"}, + ] + res = predictor.predict(corpus_list, tokenizer) + print(res) diff --git a/applications/text_classification/multi_class/retrieval_based/deploy/python/rpc_client.py b/applications/text_classification/multi_class/retrieval_based/deploy/python/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..afb13b803f65fb995b5c33a5ab5a069bee030717 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/deploy/python/rpc_client.py @@ -0,0 +1,35 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import time +import numpy as np + +from paddle_serving_server.pipeline import PipelineClient + +client = PipelineClient() +client.connect(["127.0.0.1:8080"]) + +list_data = [{"sentence": "青岛有什么好一点的国际青旅推荐?离海近一点 外国人多一点 氛围好点的"}] +feed = {} +for i, item in enumerate(list_data): + feed[str(i)] = str(item) + +print(feed) +start_time = time.time() +ret = client.predict(feed_dict=feed) +end_time = time.time() +print("time to cost :{} seconds".format(end_time - start_time)) +result = np.array(eval(ret.value[0])) +print(ret.key) +print(result.shape) +print(result) diff --git a/applications/text_classification/multi_class/retrieval_based/deploy/python/web_service.py b/applications/text_classification/multi_class/retrieval_based/deploy/python/web_service.py new file mode 100644 index 0000000000000000000000000000000000000000..df054797d51ec195c6f23ad1c144aa4f6aed43d1 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/deploy/python/web_service.py @@ -0,0 +1,72 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddle_serving_server.web_service import Op, WebService + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + result = [] + for text in example: + encoded_inputs = tokenizer( + text=text["sentence"], max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len + ) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class ErnieOp(Op): + def init_op(self): + from paddlenlp.transformers import AutoTokenizer + + model_name_or_path = "rocketqa-zh-dureader-query-encoder" + self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) + + def preprocess(self, input_dicts, data_id, log_id): + from paddlenlp.data import Pad, Tuple + + ((_, input_dict),) = input_dicts.items() + print("input dict", input_dict) + batch_size = len(input_dict.keys()) + examples = [] + for i in range(batch_size): + example = eval(input_dict[str(i)]) + input_ids, segment_ids = convert_example([example], self.tokenizer) + examples.append((input_ids, segment_ids)) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + input_ids, segment_ids = batchify_fn(examples) + feed_dict = {} + feed_dict["input_ids"] = input_ids + feed_dict["token_type_ids"] = segment_ids + return feed_dict, False, None, "" + + def postprocess(self, input_dicts, fetch_dict, data_id, log_id): + new_dict = {} + new_dict["output_embedding"] = str(fetch_dict["output_embedding"].tolist()) + return new_dict, None, "" + + +class ErnieService(WebService): + def get_pipeline_response(self, read_op): + ernie_op = ErnieOp(name="ernie", input_ops=[read_op]) + return ernie_op + + +ernie_service = ErnieService(name="ernie") +ernie_service.prepare_pipeline_config("config_nlp.yml") +ernie_service.run_service() diff --git a/applications/text_classification/multi_class/retrieval_based/evaluate.py b/applications/text_classification/multi_class/retrieval_based/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..bff2cfb814aa5c25442b2bda9f13561f72036cae --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/evaluate.py @@ -0,0 +1,83 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import time + +import numpy as np + +parser = argparse.ArgumentParser() +parser.add_argument("--similar_text_pair", type=str, default="", help="The full path of similar pair file") +parser.add_argument("--recall_result_file", type=str, default="", help="The full path of recall result file") +parser.add_argument( + "--recall_num", type=int, default=10, help="Most similar number of doc recalled from corpus per query" +) +args = parser.parse_args() + + +def recall(rs, N=10): + """ + Ratio of recalled Ground Truth at topN Recalled Docs + >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]] + >>> recall(rs, N=1) + 0.333333 + >>> recall(rs, N=2) + >>> 0.6666667 + >>> recall(rs, N=3) + >>> 1.0 + Args: + rs: Iterator of recalled flag() + Returns: + Recall@N + """ + + recall_flags = [np.sum(r[0:N]) for r in rs] + return np.mean(recall_flags) + + +if __name__ == "__main__": + text2similar = {} + with open(args.similar_text_pair, "r", encoding="utf-8") as f: + for line in f: + text, similar_text = line.rstrip().rsplit("\t", 1) + text2similar[text] = similar_text + + rs = [] + with open(args.recall_result_file, "r", encoding="utf-8") as f: + relevance_labels = [] + for index, line in enumerate(f): + + if index % args.recall_num == 0 and index != 0: + rs.append(relevance_labels) + relevance_labels = [] + text_arr = line.rstrip().split("\t") + text_title, text_para, recalled_title, recalled_para, label, cosine_sim = text_arr + if text2similar["\t".join([text_title, text_para])] == label: + relevance_labels.append(1) + else: + relevance_labels.append(0) + + recall_N = [] + recall_num = [1, 5, 10, 20, 50] + for topN in recall_num: + R = round(100 * recall(rs, N=topN), 3) + recall_N.append(str(R)) + result = open("result.tsv", "a") + res = [] + timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime()) + res.append(timestamp) + for key, val in zip(recall_num, recall_N): + print("recall@{}={}".format(key, val)) + res.append(str(val)) + result.write("\t".join(res) + "\n") diff --git a/applications/text_classification/multi_class/retrieval_based/export_model.py b/applications/text_classification/multi_class/retrieval_based/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..ac06c79a8f971e5cdbeede11c99c9f16d6e59520 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/export_model.py @@ -0,0 +1,54 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from base_model import SemanticIndexBaseStatic + +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.") +parser.add_argument("--output_emb_size", default=0, type=int, help="output_embedding_size") +parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# fmt: on + +if __name__ == "__main__": + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + model = SemanticIndexBaseStatic(pretrained_model, output_emb_size=args.output_emb_size) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + model.eval() + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/applications/text_classification/multi_class/retrieval_based/export_to_serving.py b/applications/text_classification/multi_class/retrieval_based/export_to_serving.py new file mode 100644 index 0000000000000000000000000000000000000000..1ba681a4dfb14a43a5f91fa9c4cf632b4e6e827e --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/export_to_serving.py @@ -0,0 +1,49 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import paddle_serving_client.io as serving_io + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--dirname", type=str, required=True, + default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.") +parser.add_argument("--model_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.") +parser.add_argument("--params_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.") +parser.add_argument("--server_path", type=str, default='./serving_server', + help="The path of server parameter in static graph to be saved.") +parser.add_argument("--client_path", type=str, default='./serving_client', + help="The path of client parameter in static graph to be saved.") +parser.add_argument("--feed_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars') +parser.add_argument("--fetch_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars') +parser.add_argument("--show_proto", type=bool, default=False, + help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.') +# yapf: enable + +if __name__ == "__main__": + args = parser.parse_args() + serving_io.inference_model_to_serving( + dirname=args.dirname, + serving_server=args.server_path, + serving_client=args.client_path, + model_filename=args.model_filename, + params_filename=args.params_filename, + show_proto=args.show_proto, + feed_alias_names=args.feed_alias_names, + fetch_alias_names=args.fetch_alias_names, + ) diff --git a/applications/text_classification/multi_class/retrieval_based/model.py b/applications/text_classification/multi_class/retrieval_based/model.py new file mode 100644 index 0000000000000000000000000000000000000000..25570c8dfa1725213a9ff0b463bde4fcabbd3ab9 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/model.py @@ -0,0 +1,65 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn.functional as F +from base_model import SemanticIndexBase + + +class SemanticIndexBatchNeg(SemanticIndexBase): + def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None): + super().__init__(pretrained_model, dropout, output_emb_size) + + self.margin = margin + # Used scaling cosine similarity to ease converge + self.sacle = scale + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) + + # subtract margin from all positive samples cosine_sim() + margin_diag = paddle.full( + shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype() + ) + + cosine_sim = cosine_sim - paddle.diag(margin_diag) + + # scale cosine to ease training converge + cosine_sim *= self.sacle + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") + labels = paddle.reshape(labels, shape=[-1, 1]) + + loss = F.cross_entropy(input=cosine_sim, label=labels) + + return loss diff --git a/applications/text_classification/multi_class/retrieval_based/predict.py b/applications/text_classification/multi_class/retrieval_based/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..906fdf3519c3aff341c81bcb0bb47f4245b91daf --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/predict.py @@ -0,0 +1,100 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +from base_model import SemanticIndexBase +from data import convert_example, create_dataloader, read_text_pair + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--pad_to_max_seq_len", action="store_true", help="Whether to pad to max seq length.") +parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# fmt: on + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair. + data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + cosine_sims = [] + model.eval() + with paddle.no_grad(): + for batch_data in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data + batch_cosine_sim = model.cosine_sim( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ).numpy() + cosine_sims.append(batch_cosine_sim) + cosine_sims = np.concatenate(cosine_sims, axis=0) + return cosine_sims + + +if __name__ == "__main__": + paddle.set_device(args.device) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + pad_to_max_seq_len=args.pad_to_max_seq_len, + ) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # title_segment + ): [data for data in fn(samples)] + valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False) + valid_data_loader = create_dataloader( + valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size) + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + cosin_sim = predict(model, valid_data_loader) + for idx, cosine in enumerate(cosin_sim): + print("{}".format(cosine)) + if idx > 5: + break diff --git a/applications/text_classification/multi_class/retrieval_based/recall.py b/applications/text_classification/multi_class/retrieval_based/recall.py new file mode 100644 index 0000000000000000000000000000000000000000..1d6f49ae9f7c92156fdc5d8d0c338e2339221072 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/recall.py @@ -0,0 +1,113 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# coding=UTF-8 + +import argparse +import os +from functools import partial + +import paddle +from base_model import SemanticIndexBase +from data import ( + build_index, + convert_corpus_example, + create_dataloader, + gen_id2corpus, + gen_text_file, +) + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset +from paddlenlp.transformers import AutoModel, AutoTokenizer +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# fmt: on + +if __name__ == "__main__": + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + trans_func = partial(convert_corpus_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # text_segment + ): [data for data in fn(samples)] + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size) + model = paddle.DataParallel(model) + # Load pretrained semantic model + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + logger.info("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + id2corpus = gen_id2corpus(args.corpus_file) + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + # Need better way to get inner model of DataParallel + inner_model = model._layers + final_index = build_index( + corpus_data_loader, + inner_model, + output_emb_size=args.output_emb_size, + hnsw_max_elements=args.hnsw_max_elements, + hnsw_ef=args.hnsw_ef, + hnsw_m=args.hnsw_m, + ) + text_list, text2similar_text = gen_text_file(args.similar_text_pair_file) + query_ds = MapDataset(text_list) + query_data_loader = create_dataloader( + query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file) + with open(recall_result_file, "w", encoding="utf-8") as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num) + batch_size = len(cosine_sims) + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write( + "{}\t{}\t{}\n".format( + text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx] + ) + ) diff --git a/applications/text_classification/multi_class/retrieval_based/requirements.txt b/applications/text_classification/multi_class/retrieval_based/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..6657b02e7b0c9a430659394b3398f575cae4ea91 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/requirements.txt @@ -0,0 +1,11 @@ +pymilvus==1.1.2 +pandas==0.25.1 +paddlenlp>=2.3.4 +paddlepaddle-gpu>=2.3.0 +hnswlib>=0.5.2 +numpy>=1.17.2 +visualdl>=2.2.2 +paddle-serving-app>=0.7.0 +paddle-serving-client>=0.7.0 +paddle-serving-server-gpu>=0.7.0.post102 +pybind11 \ No newline at end of file diff --git a/applications/text_classification/multi_class/retrieval_based/run_system.py b/applications/text_classification/multi_class/retrieval_based/run_system.py new file mode 100644 index 0000000000000000000000000000000000000000..27e71c6ecc9865e3665af7a065f8e55076119e0b --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/run_system.py @@ -0,0 +1,63 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys +import time + +import numpy as np +import pandas as pd +from data import gen_id2corpus +from paddle_serving_server.pipeline import PipelineClient + +sys.path.append("utils") +from utils.config import collection_name, partition_tag # noqa: E402 +from utils.milvus_util import RecallByMilvus # noqa: E402 + + +def search_in_milvus(text_embedding, corpus_file, query_text): + client = RecallByMilvus() + start_time = time.time() + status, results = client.search( + collection_name=collection_name, vectors=text_embedding, partition_tag=partition_tag + ) + end_time = time.time() + print("Search milvus time cost is {} seconds ".format(end_time - start_time)) + id2corpus = gen_id2corpus(corpus_file) + list_data = [] + for line in results: + for item in line: + idx = item.id + distance = item.distance + text = id2corpus[idx] + list_data.append([query_text, text, distance]) + df = pd.DataFrame(list_data, columns=["query_text", "label", "distance"]) + df = df.sort_values(by="distance", ascending=True) + for index, row in df.iterrows(): + print(row["query_text"], row["label"], row["distance"]) + + +if __name__ == "__main__": + client = PipelineClient() + client.connect(["127.0.0.1:8080"]) + corpus_file = "data/label.txt" + list_data = [{"sentence": "谈谈去西安旅游,哪些地方让你觉得不虚此行?"}] + feed = {} + for i, item in enumerate(list_data): + feed[str(i)] = str(item) + start_time = time.time() + ret = client.predict(feed_dict=feed) + end_time = time.time() + print("Extract feature time to cost :{} seconds".format(end_time - start_time)) + result = np.array(eval(ret.value[0])) + search_in_milvus(result, corpus_file, list_data[0]) diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/evaluate.sh b/applications/text_classification/multi_class/retrieval_based/scripts/evaluate.sh new file mode 100644 index 0000000000000000000000000000000000000000..2da2c025cc7447965b676fb73ec2d548b8e342c2 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/scripts/evaluate.sh @@ -0,0 +1,18 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python -u evaluate.py \ + --similar_text_pair "data/dev.txt" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 50 \ No newline at end of file diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/export_model.sh b/applications/text_classification/multi_class/retrieval_based/scripts/export_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..188e3a9bdf383e40f36ba3c7c5bb015ad6cdcddd --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/scripts/export_model.sh @@ -0,0 +1,18 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python export_model.py \ + --params_path checkpoints/inbatch/model_best/model_state.pdparams \ + --model_name_or_path rocketqa-zh-dureader-query-encoder \ + --output_path=./output diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/export_to_serving.sh b/applications/text_classification/multi_class/retrieval_based/scripts/export_to_serving.sh new file mode 100644 index 0000000000000000000000000000000000000000..7a7337b40b7a7c2d652ce2a837562eaceeba0531 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/scripts/export_to_serving.sh @@ -0,0 +1,21 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.get_pooled_embedding.pdmodel" \ + --params_filename "inference.get_pooled_embedding.pdiparams" \ + --server_path "serving_server" \ + --client_path "serving_client" \ + --fetch_alias_names "output_embedding" diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/predict.sh b/applications/text_classification/multi_class/retrieval_based/scripts/predict.sh new file mode 100644 index 0000000000000000000000000000000000000000..b5a14d480ae64554125e86462f3632bdbf3a09bd --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/scripts/predict.sh @@ -0,0 +1,38 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# gpu version +root_dir="checkpoints/inbatch/model_best" +python -u -m paddle.distributed.launch --gpus "0" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_state.pdparams" \ + --model_name_or_path rocketqa-zh-dureader-query-encoder \ + --output_emb_size 0 \ + --batch_size 128 \ + --max_seq_length 384 \ + --text_pair_file "data/dev.txt" + + +# cpu +# root_dir="checkpoints/inbatch/model_best" +# python -m paddle.distributed.launch --nproc_per_node 8 --backend "gloo" \ +# predict.py \ +# --device cpu \ +# --params_path "${root_dir}/model_state.pdparams" \ +# --output_emb_size 0 \ +# --model_name_or_path rocketqa-zh-dureader-query-encoder \ +# --batch_size 128 \ +# --max_seq_length 384 \ +# --text_pair_file "data/dev.txt" diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/run.sh b/applications/text_classification/multi_class/retrieval_based/scripts/run.sh new file mode 100644 index 0000000000000000000000000000000000000000..c4c990729c26e0c9fd00e4420ebe1810abd00984 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/scripts/run.sh @@ -0,0 +1,22 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +CUDA_VISIBLE_DEVICES=0 python utils/feature_extract.py \ + --data_name label \ + --model_dir ./output \ + --output_dir data \ + --corpus_file "./data/label.txt" + +python utils/vector_insert.py \ + --vector_path ./data/label_embedding.npy \ No newline at end of file diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/run_build_index.sh b/applications/text_classification/multi_class/retrieval_based/scripts/run_build_index.sh new file mode 100644 index 0000000000000000000000000000000000000000..7d75a8daad62a9f2ca482c354c7bad54788862c8 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/scripts/run_build_index.sh @@ -0,0 +1,31 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# GPU version +root_dir="checkpoints/inbatch" +python -u -m paddle.distributed.launch --gpus "1" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "${root_dir}/model_best/model_state.pdparams" \ + --model_name_or_path rocketqa-zh-dureader-query-encoder \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 0 \ + --max_seq_length 384 \ + --recall_num 50 \ + --similar_text_pair "data/dev.txt" \ + --corpus_file "data/train.txt" \ No newline at end of file diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/train.sh b/applications/text_classification/multi_class/retrieval_based/scripts/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..2cef4abcddac47f91a0d98ce701375d910879f27 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/scripts/train.sh @@ -0,0 +1,36 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# GPU training +root_path=inbatch +data_path=data +python -u -m paddle.distributed.launch --gpus "0,1" \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/${root_path} \ + --batch_size 24 \ + --learning_rate 5E-5 \ + --epochs 100 \ + --output_emb_size 0 \ + --save_steps 50 \ + --max_seq_length 384 \ + --warmup_proportion 0.0 \ + --margin 0.2 \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --train_set_file ${data_path}/train.txt \ + --corpus_file ${data_path}/label.txt \ + --similar_text_pair_file ${data_path}/dev.txt \ + --evaluate True \ + --model_name_or_path rocketqa-zh-dureader-query-encoder diff --git a/applications/text_classification/multi_class/retrieval_based/train.py b/applications/text_classification/multi_class/retrieval_based/train.py new file mode 100644 index 0000000000000000000000000000000000000000..e0e1cded718c1df6ce738c785dfc86a3d5fde568 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/train.py @@ -0,0 +1,253 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from data import ( + build_index, + convert_example, + create_dataloader, + gen_id2corpus, + gen_text_file, + read_text_pair, +) +from model import SemanticIndexBatchNeg + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset, load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=512, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=256, type=int, help="output_embedding_size") +parser.add_argument("--learning_rate", default=5E-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="cpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, help="Interval steps to save checkpoint") +parser.add_argument('--log_steps', type=int, default=10, help="Interval steps to print log") +parser.add_argument("--train_set_file", type=str, default='./data/train.txt', help="The full path of train_set_file.") +parser.add_argument("--margin", default=0.2, type=float, help="Margin between pos_sample and neg_samples") +parser.add_argument("--scale", default=30, type=int, help="Scale for pair-wise margin_rank_loss") +parser.add_argument("--corpus_file", type=str, default='./data/label.txt', help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, default='./data/dev.txt', help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='./recall_result_dir', help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, default='recall_result_init.txt', help="The file name of recall result") +parser.add_argument("--recall_num", default=50, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--evaluate_result", type=str, default='evaluate_result.txt', help="evaluate_result") +parser.add_argument('--evaluate', default=True, type=eval, choices=[True, False], help='whether evaluate while training') +parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# fmt: on + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def recall(rs, N=10): + recall_flags = [np.sum(r[0:N]) for r in rs] + return np.mean(recall_flags) + + +@paddle.no_grad() +def evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus): + # Load pretrained semantic model + inner_model = model._layers + final_index = build_index( + corpus_data_loader, + inner_model, + output_emb_size=args.output_emb_size, + hnsw_max_elements=args.hnsw_max_elements, + hnsw_ef=args.hnsw_ef, + hnsw_m=args.hnsw_m, + ) + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + with open(recall_result_file, "w", encoding="utf-8") as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num) + batch_size = len(cosine_sims) + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write( + "{}\t{}\t{}\n".format( + text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx] + ) + ) + text2similar = {} + with open(args.similar_text_pair_file, "r", encoding="utf-8") as f: + for line in f: + text_arr = line.rstrip().rsplit("\t") + text, similar_text = text_arr[0], text_arr[1] + text2similar[text] = similar_text + rs = [] + with open(recall_result_file, "r", encoding="utf-8") as f: + relevance_labels = [] + for index, line in enumerate(f): + if index % args.recall_num == 0 and index != 0: + rs.append(relevance_labels) + relevance_labels = [] + text_arr = line.rstrip().rsplit("\t") + text, similar_text, cosine_sim = text_arr + if text2similar[text] == similar_text: + relevance_labels.append(1) + else: + relevance_labels.append(0) + + recall_N = [] + recall_num = [1, 5, 10, 20] + for topN in recall_num: + R = round(100 * recall(rs, N=topN), 3) + recall_N.append(str(R)) + evaluate_result_file = os.path.join(args.recall_result_dir, args.evaluate_result) + result = open(evaluate_result_file, "a") + res = [] + timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime()) + res.append(timestamp) + for key, val in zip(recall_num, recall_N): + print("recall@{}={}".format(key, val)) + res.append(str(val)) + result.write("\t".join(res) + "\n") + return float(recall_N[0]) + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + set_seed(args.seed) + train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False) + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # title_segment + ): [data for data in fn(samples)] + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + model = SemanticIndexBatchNeg( + pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size + ) + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + print("warmup from:{}".format(args.init_from_ckpt)) + model = paddle.DataParallel(model) + batchify_fn_dev = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # text_segment + ): [data for data in fn(samples)] + if args.evaluate: + eval_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + id2corpus = gen_id2corpus(args.corpus_file) + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=eval_func + ) + # convert_corpus_example + query_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + text_list, _ = gen_text_file(args.similar_text_pair_file) + query_ds = MapDataset(text_list) + query_data_loader = create_dataloader( + query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=query_func + ) + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file) + num_training_steps = len(train_data_loader) * args.epochs + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=nn.ClipGradByNorm(clip_norm=1.0), + ) + global_step = 0 + best_recall = 0.0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch + loss = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ) + global_step += 1 + if global_step % args.log_steps == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if not args.evaluate and rank == 0: + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + if args.evaluate and rank == 0: + print("evaluating") + recall_5 = evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus) + if recall_5 > best_recall: + best_recall = recall_5 + save_dir = os.path.join(args.save_dir, "model_best") + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + with open(os.path.join(save_dir, "train_result.txt"), "a", encoding="utf-8") as fp: + fp.write("epoch=%d, global_step: %d, recall: %s\n" % (epoch, global_step, recall_5)) + + +if __name__ == "__main__": + do_train() diff --git a/applications/text_classification/multi_class/retrieval_based/utils/__init__.py b/applications/text_classification/multi_class/retrieval_based/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/utils/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/applications/text_classification/multi_class/retrieval_based/utils/config.py b/applications/text_classification/multi_class/retrieval_based/utils/config.py new file mode 100644 index 0000000000000000000000000000000000000000..0784ba7410aa69f265e511893ce08f74b97088a1 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/utils/config.py @@ -0,0 +1,35 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from milvus import IndexType, MetricType + +MILVUS_HOST = "10.21.226.173" +MILVUS_PORT = 8530 + +output_emb_size = 0 + +collection_param = { + "dimension": output_emb_size if output_emb_size > 0 else 768, + "index_file_size": 256, + "metric_type": MetricType.L2, +} + +index_type = IndexType.FLAT +index_param = {"nlist": 1000} + +top_k = 20 +search_param = {"nprobe": 20} + +collection_name = "text" +partition_tag = "partition_2" diff --git a/applications/text_classification/multi_class/retrieval_based/utils/feature_extract.py b/applications/text_classification/multi_class/retrieval_based/utils/feature_extract.py new file mode 100644 index 0000000000000000000000000000000000000000..171253b0d1bc1882ec165e7af640e00c8ee1ff68 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/utils/feature_extract.py @@ -0,0 +1,193 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import numpy as np +import paddle +from paddle import inference +from tqdm import tqdm + +import paddlenlp as ppnlp +from paddlenlp.data import Pad, Tuple + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--corpus_file", type=str, required=True, help="The corpus_file path.") +parser.add_argument("--output_dir", type=str, required=True, help="The output path.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--data_name", type=str, required=True, help="The dataset name.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') +args = parser.parse_args() +# fmt: on + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + - single sequence: ``[CLS] X [SEP]`` + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + + encoded_inputs = tokenizer(text=example, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.get_pooled_embedding.pdmodel" + params_file = model_dir + "/inference.get_pooled_embedding.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def predict(self, data, tokenizer, data_name): + """ + Predicts the data labels. + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + Returns: + results(obj:`dict`): All the predictions labels. + """ + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + + all_embeddings = [] + examples = [] + for idx, text in tqdm(data.items()): + input_ids, segment_ids = convert_example( + text, tokenizer, max_seq_length=self.max_seq_length, pad_to_max_seq_len=True + ) + examples.append((input_ids, segment_ids)) + if len(examples) >= self.batch_size: + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + all_embeddings.append(logits) + examples = [] + + if len(examples) > 0: + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + all_embeddings.append(logits) + + all_embeddings = np.concatenate(all_embeddings, axis=0) + np.save("./{}/{}_embedding".format(args.output_dir, data_name), all_embeddings) + + +def read_text(file_path): + file = open(file_path) + id2corpus = {} + for idx, line in enumerate(file.readlines()): + id2corpus[idx] = line.rstrip() + return id2corpus + + +if __name__ == "__main__": + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + data_name = args.data_name + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained(args.model_name_or_path) + id2corpus = read_text(args.corpus_file) + predictor.predict(id2corpus, tokenizer, data_name) diff --git a/applications/text_classification/multi_class/retrieval_based/utils/milvus_util.py b/applications/text_classification/multi_class/retrieval_based/utils/milvus_util.py new file mode 100644 index 0000000000000000000000000000000000000000..e6b186c4fa480ab20b888c0cd1376624083da9b9 --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/utils/milvus_util.py @@ -0,0 +1,114 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from config import ( + MILVUS_HOST, + MILVUS_PORT, + collection_param, + index_param, + index_type, + search_param, + top_k, +) +from milvus import Milvus + + +class VecToMilvus: + def __init__(self): + self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT) + + def has_collection(self, collection_name): + try: + status, ok = self.client.has_collection(collection_name) + return ok + except Exception as e: + print("Milvus has_table error:", e) + + def creat_collection(self, collection_name): + try: + collection_param["collection_name"] = collection_name + status = self.client.create_collection(collection_param) + print(status) + return status + except Exception as e: + print("Milvus create collection error:", e) + + def create_index(self, collection_name): + try: + status = self.client.create_index(collection_name, index_type, index_param) + print(status) + return status + except Exception as e: + print("Milvus create index error:", e) + + def has_partition(self, collection_name, partition_tag): + try: + status, ok = self.client.has_partition(collection_name, partition_tag) + return ok + except Exception as e: + print("Milvus has partition error: ", e) + + def delete_partition(self, collection_name, partition_tag): + try: + status = self.client.drop_collection(collection_name) + return status + except Exception as e: + print("Milvus has partition error: ", e) + + def create_partition(self, collection_name, partition_tag): + try: + status = self.client.create_partition(collection_name, partition_tag) + print("create partition {} successfully".format(partition_tag)) + return status + except Exception as e: + print("Milvus create partition error: ", e) + + def insert(self, vectors, collection_name, ids=None, partition_tag=None): + try: + if not self.has_collection(collection_name): + self.creat_collection(collection_name) + self.create_index(collection_name) + print("collection info: {}".format(self.client.get_collection_info(collection_name)[1])) + if (partition_tag is not None) and (not self.has_partition(collection_name, partition_tag)): + self.create_partition(collection_name, partition_tag) + status, ids = self.client.insert( + collection_name=collection_name, records=vectors, ids=ids, partition_tag=partition_tag + ) + self.client.flush([collection_name]) + print( + "Insert {} entities, there are {} entities after insert data.".format( + len(ids), self.client.count_entities(collection_name)[1] + ) + ) + return status, ids + except Exception as e: + print("Milvus insert error:", e) + + +class RecallByMilvus: + def __init__(self): + self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT) + + def search(self, vectors, collection_name, partition_tag=None): + try: + status, results = self.client.search( + collection_name=collection_name, + query_records=vectors, + top_k=top_k, + params=search_param, + partition_tag=partition_tag, + ) + return status, results + except Exception as e: + print("Milvus recall error: ", e) diff --git a/applications/text_classification/multi_class/retrieval_based/utils/vector_insert.py b/applications/text_classification/multi_class/retrieval_based/utils/vector_insert.py new file mode 100644 index 0000000000000000000000000000000000000000..19ad7628cfeeffd603e4792e5e789393665e8d5d --- /dev/null +++ b/applications/text_classification/multi_class/retrieval_based/utils/vector_insert.py @@ -0,0 +1,52 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import numpy as np +from config import collection_name, partition_tag +from milvus_util import VecToMilvus +from tqdm import tqdm + +parser = argparse.ArgumentParser() +parser.add_argument("--vector_path", type=str, required=True, help="feature file path.") +args = parser.parse_args() + + +def vector_insert(file_path): + embeddings = np.load(file_path) + print(embeddings.shape) + embedding_ids = [i for i in range(embeddings.shape[0])] + print(len(embedding_ids)) + client = VecToMilvus() + + if client.has_partition(collection_name, partition_tag): + client.delete_partition(collection_name, partition_tag) + data_size = len(embedding_ids) + batch_size = 50000 + for i in tqdm(range(0, data_size, batch_size)): + cur_end = i + batch_size + if cur_end > data_size: + cur_end = data_size + batch_emb = embeddings[np.arange(i, cur_end)] + status, ids = client.insert( + collection_name=collection_name, + vectors=batch_emb.tolist(), + ids=embedding_ids[i : i + batch_size], + partition_tag=partition_tag, + ) + + +if __name__ == "__main__": + vector_insert(args.vector_path) diff --git a/applications/text_classification/multi_class/train.py b/applications/text_classification/multi_class/train.py new file mode 100644 index 0000000000000000000000000000000000000000..ca480b3a1e7b1932db63ae4b490c1351f2056b2d --- /dev/null +++ b/applications/text_classification/multi_class/train.py @@ -0,0 +1,230 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import functools +import json +import os +import shutil +from dataclasses import dataclass, field +from pathlib import Path +from typing import Optional + +import numpy as np +import paddle +from sklearn.metrics import ( + accuracy_score, + classification_report, + precision_recall_fscore_support, +) +from utils import log_metrics_debug, preprocess_function, read_local_dataset + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import ( + CompressionArguments, + EarlyStoppingCallback, + PdArgumentParser, + Trainer, +) +from paddlenlp.transformers import ( + AutoModelForSequenceClassification, + AutoTokenizer, + export_model, +) +from paddlenlp.utils.log import logger + +SUPPORTED_MODELS = [ + "ernie-1.0-large-zh-cw", + "ernie-1.0-base-zh-cw", + "ernie-3.0-xbase-zh", + "ernie-3.0-base-zh", + "ernie-3.0-medium-zh", + "ernie-3.0-micro-zh", + "ernie-3.0-mini-zh", + "ernie-3.0-nano-zh", + "ernie-3.0-tiny-base-v2-zh", + "ernie-3.0-tiny-medium-v2-zh", + "ernie-3.0-tiny-micro-v2-zh", + "ernie-3.0-tiny-mini-v2-zh", + "ernie-3.0-tiny-nano-v2-zh ", + "ernie-3.0-tiny-pico-v2-zh", + "ernie-2.0-large-en", + "ernie-2.0-base-en", + "ernie-3.0-tiny-mini-v2-en", + "ernie-m-base", + "ernie-m-large", +] + + +# yapf: disable +@dataclass +class DataArguments: + max_length: int = field(default=128, metadata={"help": "Maximum number of tokens for the model."}) + early_stopping: bool = field(default=False, metadata={"help": "Whether apply early stopping strategy."}) + early_stopping_patience: int = field(default=4, metadata={"help": "Stop training when the specified metric worsens for early_stopping_patience evaluation calls"}) + debug: bool = field(default=False, metadata={"help": "Whether choose debug mode."}) + train_path: str = field(default='./data/train.txt', metadata={"help": "Train dataset file path."}) + dev_path: str = field(default='./data/dev.txt', metadata={"help": "Dev dataset file path."}) + test_path: str = field(default='./data/dev.txt', metadata={"help": "Test dataset file path."}) + label_path: str = field(default='./data/label.txt', metadata={"help": "Label file path."}) + bad_case_path: str = field(default='./data/bad_case.txt', metadata={"help": "Bad case file path."}) + + +@dataclass +class ModelArguments: + model_name_or_path: str = field(default="ernie-3.0-tiny-medium-v2-zh", metadata={"help": "Build-in pretrained model name or the path to local model."}) + export_model_dir: Optional[str] = field(default=None, metadata={"help": "Path to directory to store the exported inference model."}) +# yapf: enable + + +def main(): + """ + Training a binary or multi classification model + """ + + parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if training_args.do_compress: + training_args.strategy = "dynabert" + if training_args.do_train or training_args.do_compress: + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + paddle.set_device(training_args.device) + + # Define id2label + id2label = {} + label2id = {} + with open(data_args.label_path, "r", encoding="utf-8") as f: + for i, line in enumerate(f): + l = line.strip() + id2label[i] = l + label2id[l] = i + + # Define model & tokenizer + if os.path.isdir(model_args.model_name_or_path): + model = AutoModelForSequenceClassification.from_pretrained( + model_args.model_name_or_path, label2id=label2id, id2label=id2label + ) + elif model_args.model_name_or_path in SUPPORTED_MODELS: + model = AutoModelForSequenceClassification.from_pretrained( + model_args.model_name_or_path, num_classes=len(label2id), label2id=label2id, id2label=id2label + ) + else: + raise ValueError( + f"{model_args.model_name_or_path} is not a supported model type. Either use a local model path or select a model from {SUPPORTED_MODELS}" + ) + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + + # load and preprocess dataset + train_ds = load_dataset(read_local_dataset, path=data_args.train_path, label2id=label2id, lazy=False) + dev_ds = load_dataset(read_local_dataset, path=data_args.dev_path, label2id=label2id, lazy=False) + trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_length=data_args.max_length) + train_ds = train_ds.map(trans_func) + dev_ds = dev_ds.map(trans_func) + + if data_args.debug: + test_ds = load_dataset(read_local_dataset, path=data_args.test_path, label2id=label2id, lazy=False) + test_ds = test_ds.map(trans_func) + + # Define the metric function. + def compute_metrics(eval_preds): + pred_ids = np.argmax(eval_preds.predictions, axis=-1) + metrics = {} + metrics["accuracy"] = accuracy_score(y_true=eval_preds.label_ids, y_pred=pred_ids) + for average in ["micro", "macro"]: + precision, recall, f1, _ = precision_recall_fscore_support( + y_true=eval_preds.label_ids, y_pred=pred_ids, average=average + ) + metrics[f"{average}_precision"] = precision + metrics[f"{average}_recall"] = recall + metrics[f"{average}_f1"] = f1 + return metrics + + def compute_metrics_debug(eval_preds): + pred_ids = np.argmax(eval_preds.predictions, axis=-1) + metrics = classification_report(eval_preds.label_ids, pred_ids, output_dict=True) + return metrics + + # Define the early-stopping callback. + if data_args.early_stopping: + callbacks = [EarlyStoppingCallback(early_stopping_patience=data_args.early_stopping_patience)] + else: + callbacks = None + + # Define Trainer + trainer = Trainer( + model=model, + tokenizer=tokenizer, + args=training_args, + criterion=paddle.nn.loss.CrossEntropyLoss(), + train_dataset=train_ds, + eval_dataset=dev_ds, + callbacks=callbacks, + data_collator=DataCollatorWithPadding(tokenizer), + compute_metrics=compute_metrics_debug if data_args.debug else compute_metrics, + ) + + # Training + if training_args.do_train: + train_result = trainer.train() + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + for checkpoint_path in Path(training_args.output_dir).glob("checkpoint-*"): + shutil.rmtree(checkpoint_path) + + # Evaluate and tests model + if training_args.do_eval: + if data_args.debug: + output = trainer.predict(test_ds) + log_metrics_debug(output, id2label, test_ds, data_args.bad_case_path) + else: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + # export inference model + if training_args.do_export: + if model.init_config["init_class"] in ["ErnieMForSequenceClassification"]: + input_spec = [paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids")] + else: + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"), + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"), + ] + if model_args.export_model_dir is None: + model_args.export_model_dir = os.path.join(training_args.output_dir, "export") + export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir) + tokenizer.save_pretrained(model_args.export_model_dir) + id2label_file = os.path.join(model_args.export_model_dir, "id2label.json") + with open(id2label_file, "w", encoding="utf-8") as f: + json.dump(id2label, f, ensure_ascii=False) + logger.info(f"id2label file saved in {id2label_file}") + + # compress + if training_args.do_compress: + trainer.compress() + for width_mult in training_args.width_mult_list: + pruned_infer_model_dir = os.path.join(training_args.output_dir, "width_mult_" + str(round(width_mult, 2))) + tokenizer.save_pretrained(pruned_infer_model_dir) + id2label_file = os.path.join(pruned_infer_model_dir, "id2label.json") + with open(id2label_file, "w", encoding="utf-8") as f: + json.dump(id2label, f, ensure_ascii=False) + logger.info(f"id2label file saved in {id2label_file}") + + for path in Path(training_args.output_dir).glob("runs"): + shutil.rmtree(path) + + +if __name__ == "__main__": + main() diff --git a/applications/text_classification/multi_class/utils.py b/applications/text_classification/multi_class/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..9e7a8889f6a32050ce1e48fbabd280a551abce1a --- /dev/null +++ b/applications/text_classification/multi_class/utils.py @@ -0,0 +1,86 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np + +from paddlenlp.utils.log import logger + + +def preprocess_function(examples, tokenizer, max_length, is_test=False): + """ + Builds model inputs from a sequence for sequence classification tasks + by concatenating and adding special tokens. + """ + result = tokenizer(examples["text"], max_length=max_length, truncation=True) + if not is_test: + result["labels"] = np.array([examples["label"]], dtype="int64") + return result + + +def read_local_dataset(path, label2id=None, is_test=False): + """ + Read dataset. + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + if is_test: + sentence = line.strip() + yield {"text": sentence} + else: + items = line.strip().split("\t") + yield {"text": items[0], "label": label2id[items[1]]} + + +def log_metrics_debug(output, id2label, dev_ds, bad_case_path): + """ + Log metrics in debug mode. + """ + predictions, label_ids, metrics = output + pred_ids = np.argmax(predictions, axis=-1) + logger.info("-----Evaluate model-------") + logger.info("Dev dataset size: {}".format(len(dev_ds))) + logger.info("Accuracy in dev dataset: {:.2f}%".format(metrics["test_accuracy"] * 100)) + logger.info( + "Macro average | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format( + metrics["test_macro avg"]["precision"] * 100, + metrics["test_macro avg"]["recall"] * 100, + metrics["test_macro avg"]["f1-score"] * 100, + ) + ) + for i in id2label: + l = id2label[i] + logger.info("Class name: {}".format(l)) + i = "test_" + str(i) + if i in metrics: + logger.info( + "Evaluation examples in dev dataset: {}({:.1f}%) | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format( + metrics[i]["support"], + 100 * metrics[i]["support"] / len(dev_ds), + metrics[i]["precision"] * 100, + metrics[i]["recall"] * 100, + metrics[i]["f1-score"] * 100, + ) + ) + else: + logger.info("Evaluation examples in dev dataset: 0 (0%)") + logger.info("----------------------------") + + with open(bad_case_path, "w", encoding="utf-8") as f: + f.write("Text\tLabel\tPrediction\n") + for i, (p, l) in enumerate(zip(pred_ids, label_ids)): + p, l = int(p), int(l) + if p != l: + f.write(dev_ds.data[i]["text"] + "\t" + id2label[l] + "\t" + id2label[p] + "\n") + + logger.info("Bad case in dev dataset saved in {}".format(bad_case_path)) diff --git a/applications/text_classification/multi_label/README.md b/applications/text_classification/multi_label/README.md new file mode 100644 index 0000000000000000000000000000000000000000..94896cc8cc14f9ffbd184fc3aee212ec783f8ab0 --- /dev/null +++ b/applications/text_classification/multi_label/README.md @@ -0,0 +1,474 @@ +# 多标签分类指南 + +**目录** +- [1. 多标签分类简介](#多标签分类简介) +- [2. 快速开始](#快速开始) + - [2.1 运行环境](#运行环境) + - [2.2 代码结构](#代码结构) + - [2.3 数据准备](#数据准备) + - [2.4 模型训练](#模型训练) + - [2.5 模型部署](#模型部署) + - [2.6 模型效果](#模型效果) + + + +## 1. 多标签分类简介 + +本项目提供通用场景下**基于预训练模型微调的多标签分类端到端应用方案**,打通数据标注-模型训练-模型调优-模型压缩-预测部署全流程,有效缩短开发周期,降低AI开发落地门槛。 + +多标签数据集的标签集含有两个或两个以上的类别,输入句子/文本具有一个或多个标签,多标签任务的目标是预测**样本属于哪些标签类别,这些类别具有不相互排斥的属性**。文本多标签分类在各种现实场景中具有广泛的适用性,例如商品分类、网页标签、新闻标注、蛋白质功能分类、电影分类、语义场景分类等。以下图为例,该新闻文本具有 `相机` 和 `芯片` 两个标签。 + + +
+ +
+
+ +**方案亮点:** + +- **效果领先🏃:** 使用在中文领域内模型效果和模型计算效率有突出效果的ERNIE 3.0 轻量级系列模型作为训练基座,ERNIE 3.0 轻量级系列提供多种尺寸的预训练模型满足不同需求,具有广泛成熟的实践应用性。 +- **高效调优✊:** 文本分类应用依托[TrustAI](https://github.com/PaddlePaddle/TrustAI)可信增强能力和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md),提供模型分析模块助力开发者实现模型分析,并提供稀疏数据筛选、脏数据清洗、数据增强等多种解决方案。 +- **简单易用👶:** 开发者**无需机器学习背景知识**,仅需提供指定格式的标注分类数据,一行命令即可开启文本分类训练,轻松完成上线部署,不再让技术成为文本分类的门槛。 + +**更多选择:** + +对于大多数多标签分类任务,我们推荐使用预训练模型微调作为首选的文本分类方案,多标签分类项目中还提供 提示学习(小样本)和语义索引的两种全流程文本分类方案满足不同开发者需求,更多技术细节请参见[文本分类技术特色介绍](../README.md)。 + +- 【标注成本高、标注样本较少的小样本场景】 👉 [提示学习多标签分类方案](./few-shot#readme) +- 【标签类别不固定场景、标签类别众多】 👉 [语义索引多分类方案](./retrieval_based#readme) + + +## 2. 快速开始 + +我们以公开数据集CAIL2019—婚姻家庭要素提取任务为示例,演示多标签分类全流程方案使用。下载数据集: +```shell +wget https://paddlenlp.bj.bcebos.com/datasets/divorce.tar.gz +tar -zxvf divorce.tar.gz +mv divorce data +rm divorce.tar.gz +``` + +
+ image +
+
+ + 多标签分类数据标注-模型训练-模型分析-模型压缩-预测部署流程图 + +
+ + + +### 2.1 运行环境 + +- python >= 3.6 +- paddlepaddle >= 2.3 +- paddlenlp >= 2.4.8 +- scikit-learn >= 1.0.2 + +**安装PaddlePaddle:** + + 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。 + + +**安装PaddleNLP:** + +安装PaddleNLP默认开启百度镜像源来加速下载,如果您使用 HTTP 代理可以关闭(删去 -i https://mirror.baidu.com/pypi/simple),更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 +```shell +python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple +``` + + +**安装sklearn:** +```shell +python3 -m pip install scikit-learn==1.0.2 +``` + + + +### 2.2 代码结构 + +```text +multi_label/ +├── few-shot # 小样本学习方案 +├── analysis # 分析模块 +├── deploy # 部署 +│   └── predictor # 离线部署 +│ ├── paddle_serving # PaddleServing在线服务化部署 +│   └── triton_serving # Triton在线服务化部署 +├── train.py # 训练评估脚本 +├── predict.py # 预测脚本 +├── export_model.py # 静态图模型导出脚本 +├── utils.py # 工具函数脚本 +├── metric.py # metric脚本 +├── prune.py # 裁剪脚本 +└── README.md # 使用说明 +``` + + +### 2.3 数据准备 + +训练需要准备指定格式的标注数据集,如果没有已标注的数据集,可以参考 [数据标注指南](../doccano.md) 进行文本分类数据标注。指定格式本地数据集目录结构: + +```text +data/ +├── train.txt # 训练数据集文件 +├── dev.txt # 开发数据集文件 +├── test.txt # 测试数据集文件(可选) +├── label.txt # 分类标签文件 +└── data.txt # 待预测数据文件(可选) +``` +**训练、开发、测试数据集**文件中文本与标签类别名用tab符`'\t'`分隔开,标签中多个标签之间用`','`逗号分隔开,文本中避免出现tab符`'\t'`。 + +- train.txt/dev.txt/test.txt 文件格式: +```text +<文本>'\t'<标签>','<标签>','<标签> +<文本>'\t'<标签>','<标签> +... +``` + +- train.txt/dev.txt/test.txt 文件样例: + +```text +现在原告已是第二次申请与被告离婚了。 二次起诉离婚 +双方均认可价值6万元。 不动产分割,有夫妻共同财产 +2004年4月,原、被告发生纠纷后,被告离家外出未归,直到现在,双方长期分居生活,十几年间互无联系,夫妻感情已经完全破裂。 婚后分居 +婚生子杨某甲由原告抚养,高中阶段之前的相关费用由原告承担,高中阶段之后的相关费用由双方协商,被告可以随时探望孩子; 婚后有子女,支付抚养费,限制行为能力子女抚养 +... +``` + +**分类标签**包含数据集中所有标签集合,每一行为一个标签名。 + +- label.txt 文件格式: + +```text +<标签> +<标签> +... +``` + +- label.txt 文件样例: +```text +婚后有子女 +限制行为能力子女抚养 +有夫妻共同财产 +支付抚养费 +... +``` + +**待预测数据文件:** 包含需要预测标签的文本数据,每条数据一行。 + +- data.txt 文件格式: + +```text +<文本> +<文本> +... +``` + +- data.txt 文件样例: +```text +原、被告另购置橱柜、碗架、电磁炉、电饭锅各一个归原告王某某所有。 +于是原告到儿子就读的幼儿园进行探望,被告碰见后对原告破口大骂,还不让儿子叫原告妈妈,而叫被告现在的妻子做妈妈。 +6、被告父亲给的房屋装修款2.3万元在原告处,要求依法分割; +由我全额出资购买的联想台式电脑,我均依次放弃。 +... +``` + + +### 2.4 模型训练 + +#### 2.4.1 预训练模型微调 + +使用CPU/GPU训练,默认为GPU训练。使用CPU训练只需将设备参数配置改为`--device cpu`,可以使用`--device gpu:0`指定GPU卡号: +```shell +python train.py \ + --dataset_dir "data" \ + --device "gpu" \ + --max_seq_length 128 \ + --model_name "ernie-3.0-medium-zh" \ + --batch_size 32 \ + --early_stop \ + --epochs 100 +``` + +如果在GPU环境中使用,可以指定`gpus`参数进行单卡/多卡训练。使用多卡训练可以指定多个GPU卡号,例如 --gpus "0,1"。如果设备只有一个GPU卡号默认为0,可使用`nvidia-smi`命令查看GPU使用情况。 + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" train.py \ + --dataset_dir "data" \ + --device "gpu" \ + --max_seq_length 128 \ + --model_name "ernie-3.0-medium-zh" \ + --batch_size 32 \ + --early_stop \ + --epochs 100 +``` +可支持配置的参数: + +* `device`: 选用什么设备进行训练,选择cpu、gpu、xpu、npu。如使用gpu训练,可使用参数--gpus指定GPU卡号;默认为"gpu"。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含train.txt,dev.txt和label.txt文件;默认为None。 +* `save_dir`:保存训练模型的目录;默认保存在当前目录checkpoint文件夹下。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `model_name`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh"。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `learning_rate`:训练最大学习率;默认为3e-5。 +* `epochs`: 训练轮次,使用早停法时可以选择100;默认为10。 +* `early_stop`:选择是否使用早停法(EarlyStopping),模型在开发集经过一定epoch后精度表现不再上升,训练终止;默认为False。 +* `early_stop_nums`:在设定的早停训练轮次内,模型在开发集上表现不再上升,训练终止;默认为4。 +* `logging_steps`: 训练过程中日志打印的间隔steps数,默认5。 +* `weight_decay`:控制正则项力度的参数,用于防止过拟合,默认为0.0。 +* `warmup`:是否使用学习率warmup策略,使用时应设置适当的训练轮次(epochs);默认为False。 +* `warmup_steps`:学习率warmup策略的比例数,如果设为1000,则学习率会在1000steps数从0慢慢增长到learning_rate, 而后再缓慢衰减;默认为0。 +* `init_from_ckpt`: 模型初始checkpoint参数地址,默认None。 +* `seed`:随机种子,默认为3。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `dev_file`:本地数据集中开发集文件名;默认为"dev.txt"。 +* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。 + +程序运行时将会自动进行训练,评估。同时训练过程中会自动保存开发集上最佳模型在指定的 `save_dir` 中,保存模型文件结构如下所示: + +```text +checkpoint/ +├── config.json # 模型配置文件,paddlenlp 2.4.5以前为model_config.json +├── model_state.pdparams # 模型参数文件 +├── tokenizer_config.json # 分词器配置文件 +├── vocab.txt +└── ... +``` + +**NOTE:** +* 如需恢复模型训练,则可以设置 `init_from_ckpt` , 如 `init_from_ckpt=checkpoint/model_state.pdparams` 。 +* 如需训练英文文本分类任务,只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en"、"ernie-2.0-large-en"。 +* 英文和中文以外语言的文本分类任务,推荐使用基于96种语言(涵盖法语、日语、韩语、德语、西班牙语等几乎所有常见语言)进行预训练的多语言预训练模型"ernie-m-base"、"ernie-m-large",详情请参见[ERNIE-M论文](https://arxiv.org/pdf/2012.15674.pdf)。 + +#### 2.4.2 训练评估与模型优化 + +文本分类预测过程中常会遇到诸如"模型为什么会预测出错误的结果","如何提升模型的表现"等问题。[Analysis模块](./analysis) 提供了**模型评估、可解释性分析、数据优化**等功能,旨在帮助开发者更好地分析文本分类模型预测结果和对模型效果进行优化。 + +
+ +
+ +**模型评估:** 训练后的模型我们可以使用 [Analysis模块](./analysis) 对每个类别分别进行评估,并输出预测错误样本(bad case),默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"`: + +```shell +python analysis/evaluate.py --device "gpu" --max_seq_length 128 --batch_size 32 --bad_case_file "bad_case.txt" --dataset_dir "data" --params_path "./checkpoint" +``` + +输出打印示例: + +```text +[2022-08-12 02:24:48,193] [ INFO] - -----Evaluate model------- +[2022-08-12 02:24:48,194] [ INFO] - Dev dataset size: 1611 +[2022-08-12 02:24:48,194] [ INFO] - Accuracy in dev dataset: 74.24% +[2022-08-12 02:24:48,194] [ INFO] - Macro avg in dev dataset: precision: 82.96 | recall: 77.59 | F1 score 79.36 +[2022-08-12 02:24:48,194] [ INFO] - Micro avg in dev dataset: precision: 91.50 | recall: 89.66 | F1 score 90.57 +[2022-08-12 02:24:48,195] [ INFO] - Class name: 婚后有子女 +[2022-08-12 02:24:48,195] [ INFO] - Evaluation examples in dev dataset: 784(48.7%) | precision: 97.07 | recall: 97.32 | F1 score 97.20 +[2022-08-12 02:24:48,195] [ INFO] - ---------------------------- +[2022-08-12 02:24:48,195] [ INFO] - Class name: 限制行为能力子女抚养 +[2022-08-12 02:24:48,195] [ INFO] - Evaluation examples in dev dataset: 492(30.5%) | precision: 88.57 | recall: 88.21 | F1 score 88.39 +... +``` + +预测错误的样本保存在bad_case.txt文件中: + +```text +Text Label Prediction +2014年,王X以其与肖X协议离婚时未分割该套楼房的首付款为由,起诉至法院,要求分得楼房的首付款15万元。 不动产分割,有夫妻共同财产 不动产分割 +但原、被告对已建立起的夫妻感情不够珍惜,因琐事即发生吵闹并最终分居,对夫妻感情造成了严重的影响,现原、被告已分居六年有余,且经人民法院判决不准离婚后仍未和好,夫妻感情确已破裂,依法应准予原、被告离婚。 二次起诉离婚,准予离婚,婚后分居,法定离婚 婚后分居,准予离婚 +婚后生有一女,取名彭某乙,已11岁,现已由被告从铁炉白族乡中心小学转入走马镇李桥小学读书。 婚后有子女 婚后有子女,限制行为能力子女抚养 +... +``` +**可解释性分析:** 基于[TrustAI](https://github.com/PaddlePaddle/TrustAI)提供单词和句子级别的模型可解释性分析,帮助理解模型预测结果,用于错误样本(bad case)分析,细节详见[训练评估与模型优化指南](analysis/README.md)。 + +- 单词级别可解释性分析,也即分析待预测样本中哪一些单词对模型预测结果起重要作用。以下图为例,用颜色深浅表示单词对预测结果的重要性。 +
+ +
+ +- 句子级别可解释性分析 ,也即分析对待预测样本的模型预测结果与训练集中中哪些样本有重要关系。下面的例子表明句子级别可解释性分析可以帮助理解待预测样本的预测结果与训练集中样本之间的关联。 +```text +text: 2015年2月23日,被告将原告赶出家门,原告居住于娘家待产,双方分居至今。 +predict label: 婚后分居 +label: 不履行家庭义务,婚后分居 +examples with positive influence +support1 text: 2014年中秋节原告回了娘家,原、被告分居至今。 label: 婚后分居 score: 0.99942 +support2 text: 原告于2013年8月13日离开被告家,分居至今。 label: 婚后分居 score: 0.99916 +support3 text: 2014年4月,被告外出务工,双方分居至今。 label: 婚后分居 score: 0.99902 +... +``` + +**数据优化:** 结合[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)提供了**稀疏数据筛选、脏数据清洗、数据增强**三种优化策略,从多角度优化训练数据提升模型效果,策略细节详见[训练评估与模型优化指南](analysis/README.md)。 + +- 稀疏数据筛选主要是解决数据不均衡、训练数据覆盖不足的问题,通过数据增强和数据标注两种方式解决这一问题。 +- 脏数据清洗可以帮助开发者筛选训练集中错误标注的数据,对这些数据重新进行人工标注,得到标注正确的数据再重新进行训练。 +- 数据增强策略提供多种数据增强方案,可以快速扩充数据,提高模型泛化性和鲁棒性。 + + +#### 2.4.3 模型预测 +训练结束后,输入待预测数据(data.txt)和类别标签对照列表(label.txt),使用训练好的模型进行,默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"`: + +```shell +python predict.py --device "gpu" --max_seq_length 128 --batch_size 32 --dataset_dir "data" +``` + +可支持配置的参数: + +* `device`: 选用什么设备进行预测,可选cpu、gpu、xpu、npu;默认为gpu。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含data.txt和label.txt文件;默认为None。 +* `params_path`:待预测模型的目录;默认为"./checkpoint/"。 +* `max_seq_length`:模型使用的最大序列长度,建议与训练时最大序列长度一致, 若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `data_file`:本地数据集中未标注待预测数据文件名;默认为"data.txt"。 +* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。 + + + +### 2.5 模型部署 + +#### 2.5.1 静态图导出 + +使用动态图训练结束之后,还可以将动态图参数导出成静态图参数,静态图模型将用于**后续的推理部署工作**。具体代码见[静态图导出脚本](export_model.py),静态图参数保存在`output_path`指定路径中。运行方式: + +```shell +python export_model.py --params_path ./checkpoint/ --output_path ./export +``` + +如果使用多语言模型 ERNIE M作为预训练模型,运行方式: + +```shell +python export_model.py --params_path ./checkpoint/ --output_path ./export --multilingual +``` + +可支持配置的参数: + +* `multilingual`:是否为多语言任务(是否使用ERNIE M作为预训练模型);默认为False。 +* `params_path`:动态图训练保存的参数路径;默认为"./checkpoint/"。 +* `output_path`:静态图图保存的参数路径;默认为"./export"。 + +程序运行时将会自动导出模型到指定的 `output_path` 中,保存模型文件结构如下所示: + +```text +export/ +├── float32.pdiparams +├── float32.pdiparams.info +└── float32.pdmodel +``` + 导出模型之后用于部署,项目提供了基于ONNXRuntime的 [离线部署方案](./deploy/predictor/README.md) 和基于Paddle Serving的 [在线服务化部署方案](./deploy/predictor/README.md)。 + +#### 2.5.2 模型裁剪 + +如果有模型部署上线的需求,需要进一步压缩模型体积,可以使用 PaddleNLP 的 [压缩API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/compression.md), 一行命令即可启动模型裁剪。 + +使用裁剪功能需要安装 paddleslim: + +```shell +pip install paddleslim==2.4.1 +``` + +开始模型裁剪训练,默认为GPU训练,使用CPU训练只需将设备参数配置改为`--device "cpu"`: +```shell +python prune.py \ + --device "gpu" \ + --dataset_dir "data" \ + --output_dir "prune" \ + --learning_rate 3e-5 \ + --per_device_train_batch_size 32 \ + --per_device_eval_batch_size 32 \ + --num_train_epochs 10 \ + --max_seq_length 128 \ + --logging_steps 5 \ + --save_steps 100 \ + --width_mult_list '3/4' '2/3' '1/2' +``` + + +可支持配置的参数: +* `output_dir`:必须,保存模型输出和中间checkpoint的输出目录;默认为 `None` 。 +* `device`: 选用什么设备进行裁剪,选择cpu、gpu。如使用gpu训练,可使用参数--gpus指定GPU卡号。 +* `per_device_train_batch_size`:训练集裁剪训练过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `per_device_eval_batch_size`:开发集评测过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `learning_rate`:训练最大学习率;默认为5e-5。 +* `num_train_epochs`: 训练轮次,使用早停法时可以选择100;默认为10。 +* `logging_steps`: 训练过程中日志打印的间隔steps数,默认100。 +* `save_steps`: 训练过程中保存模型checkpoint的间隔steps数,默认100。 +* `seed`:随机种子,默认为3。 +* `width_mult_list`:裁剪宽度(multi head)保留的比例列表,表示对self_attention中的 `q`、`k`、`v` 以及 `ffn` 权重宽度的保留比例,保留比例乘以宽度(multi haed数量)应为整数;默认是None。 +* `dataset_dir`:本地数据集路径,需包含train.txt,dev.txt,label.txt;默认为None。 +* `max_seq_length`:模型使用的最大序列长度,建议与训练过程保持一致, 若出现显存不足,请适当调低这一参数;默认为128。 +* `params_dir`:待预测模型参数文件;默认为"./checkpoint/"。 + +程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存开发集上最佳模型在指定的 `output_dir` 中,保存模型文件结构如下所示: + +```text +prune/ +├── width_mult_0.75 +│   ├── pruned_model.pdiparams +│   ├── pruned_model.pdiparams.info +│   ├── pruned_model.pdmodel +│   ├── model_state.pdparams +│   └── model_config.json +└── ... +``` + +**NOTE:** + +1. 目前支持的裁剪策略需要训练,训练时间视下游任务数据量而定,且和微调的训练时间是一个量级。 裁剪类似蒸馏过程,方便起见,可以直接使用微调时的超参。为了进一步提升精度,可以对 `per_device_train_batch_size`、`learning_rate`、`num_train_epochs`、`max_seq_length` 等超参进行网格搜索(grid search)。 + +2. 模型裁剪主要用于推理部署,因此裁剪后的模型都是静态图模型,只可用于推理部署,不能再通过 `from_pretrained` 导入继续训练。导出模型之后用于部署,项目提供了基于ONNXRuntime的 [离线部署方案](./deploy/predictor/README.md) 和基于Paddle Serving的 [在线服务化部署方案](./deploy/predictor/README.md)。 + +3. ERNIE Base、Medium、Mini、Micro、Nano的模型宽度(multi head数量)为12,ERNIE Xbase、Large 模型宽度(multi head数量)为16,保留比例`width_mult`乘以宽度(multi haed数量)应为整数。 + + +#### 2.5.3 部署方案 + +- 离线部署搭建请参考[离线部署](deploy/predictor/README.md)。 + +- 在线服务化部署搭建请参考 [PaddleNLP SimpleServing部署指南](deploy/simple_serving/README.md) 或 [Triton部署指南](deploy/triton_serving/README.md)。 + + + +### 2.6 模型效果 + +我们在CAIL2019—婚姻家庭要素提取任务数据集评测模型表现,测试配置如下: + +1. 数据集:CAIL2019—婚姻家庭要素提取任务数据集 + +2. 物理机环境 + + 系统: CentOS Linux release 7.7.1908 (Core) + + GPU: Tesla V100-SXM2-32GB + + CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz + + CUDA: 11.2 + + cuDNN: 8.1.0 + + Driver Version: 460.27.04 + + 内存: 630 GB + +3. PaddlePaddle 版本:2.3.0 + +4. PaddleNLP 版本:2.3.1 + +5. 性能数据指标:latency。latency 测试方法:固定 batch size 为 32,GPU部署运行时间 total_time,计算 latency = total_time / total_samples + +6. 精度评价指标:Micro F1分数、Macro F1分数 + +| model_name | 模型结构 |Micro F1(%) | Macro F1(%) | latency(ms) | +| -------------------------- | ------------ | ------------ | ------------ |------------ | +|ERNIE 1.0 Large Cw |24-layer, 1024-hidden, 20-heads|91.14|81.68 |5.66 | +|ERNIE 3.0 Base |12-layer, 768-hidden, 12-heads|90.38|80.14| 2.70 | +|ERNIE 3.0 Medium| 6-layer, 768-hidden, 12-heads|90.57|79.36| 1.46| +|ERNIE 3.0 Mini |6-layer, 384-hidden, 12-heads|89.27|76.78| 0.56| +|ERNIE 3.0 Micro | 4-layer, 384-hidden, 12-heads|89.43|77.20| 0.34| +|ERNIE 3.0 Nano |4-layer, 312-hidden, 12-heads|85.39|75.07|0.32| +| ERNIE 3.0 Medium + 裁剪(保留比例3/4)|6-layer, 768-hidden, 9-heads| 89.94|79.35| 0.81 | +| ERNIE 3.0 Medium + 裁剪(保留比例2/3)|6-layer, 768-hidden, 8-heads| 89.99|79.37 | 0.75 | +| ERNIE 3.0 Medium + 裁剪(保留比例1/2)|6-layer, 768-hidden, 6-heads| 89.19 | 76.35| 0.61 | diff --git a/applications/text_classification/multi_label/analysis/README.md b/applications/text_classification/multi_label/analysis/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9aeba1759089b078d6a9ed69e10864c3d1c14b88 --- /dev/null +++ b/applications/text_classification/multi_label/analysis/README.md @@ -0,0 +1,421 @@ +# 训练评估与模型优化指南 + +**目录** + * [Analysis模块介绍](#Analysis模块介绍) + * [环境准备](#环境准备) + * [模型评估](#模型评估) + * [可解释性分析](#可解释性分析) + * [单词级别可解释性分析](#单词级别可解释性分析) + * [句子级别可解释性分析](#句子级别可解释性分析) + * [数据优化](#数据优化) + * [稀疏数据筛选方案](#稀疏数据筛选方案) + * [脏数据清洗方案](#脏数据清洗方案) + * [数据增强策略方案](#数据增强策略方案) + +## Analysis模块介绍 + +Analysis模块提供了**模型评估、可解释性分析、数据优化**等功能,旨在帮助开发者更好地分析文本分类模型预测结果和对模型效果进行优化。 + +- **模型评估:** 对整体分类情况和每个类别分别进行评估,并打印预测错误样本,帮助开发者分析模型表现找到训练和预测数据中存在的问题。 + +- **可解释性分析:** 基于[TrustAI](https://github.com/PaddlePaddle/TrustAI)提供单词和句子级别的模型可解释性分析,帮助理解模型预测结果。 + +- **数据优化:** 结合[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)提供了**稀疏数据筛选、脏数据清洗、数据增强**三种优化策略,从多角度优化训练数据提升模型效果。 + +
+ +
+ +以下是本项目主要代码结构及说明: + +```text +analysis/ +├── evaluate.py # 评估脚本 +├── sent_interpret.py # 句子级别可解释性分析脚本 +├── word_interpret.py # 单词级别可解释性分析notebook +├── sparse.py # 稀疏数据筛选脚本 +├── dirty.py # 脏数据清洗脚本 +├── aug.py # 数据增强脚本 +└── README.md # 训练评估与模型优化指南 +``` + +## 环境准备 +需要可解释性分析和数据优化需要安装相关环境。 +- trustai >= 0.1.7 +- interpretdl >= 0.7.0 + +**安装TrustAI**(可选)如果使用可解释性分析和数据优化中稀疏数据筛选和脏数据清洗需要安装TrustAI。 +```shell +pip install trustai==0.1.7 +``` + +**安装InterpretDL**(可选)如果使用词级别可解释性分析GradShap方法,需要安装InterpretDL +```shell +pip install interpretdl==0.7.0 +``` + +## 模型评估 + +我们使用训练好的模型计算模型的在开发集的准确率,同时打印每个类别数据量及表现: + +```shell +python evaluate.py \ + --device "gpu" \ + --dataset_dir "../data" \ + --params_path "../checkpoint" \ + --max_seq_length 128 \ + --batch_size 32 \ + --bad_case_file "bad_case.txt" +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `device`: 选用什么设备进行训练,可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含train.txt、dev.txt和label.txt文件;默认为None。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `train_file`:本地数据集中开发集文件名;默认为"train.txt"。 +* `dev_file`:本地数据集中开发集文件名;默认为"dev.txt"。 +* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。 +* `bad_case_path`:开发集中预测错误样本保存路径;默认为"/bad_case.txt"。 + +输出打印示例: + +```text +[2022-08-12 02:24:48,193] [ INFO] - -----Evaluate model------- +[2022-08-12 02:24:48,194] [ INFO] - Dev dataset size: 1611 +[2022-08-12 02:24:48,194] [ INFO] - Accuracy in dev dataset: 74.24% +[2022-08-12 02:24:48,194] [ INFO] - Macro avg in dev dataset: precision: 82.96 | recall: 77.59 | F1 score 79.36 +[2022-08-12 02:24:48,194] [ INFO] - Micro avg in dev dataset: precision: 91.50 | recall: 89.66 | F1 score 90.57 +[2022-08-12 02:24:48,195] [ INFO] - Class name: 婚后有子女 +[2022-08-12 02:24:48,195] [ INFO] - Evaluation examples in dev dataset: 784(48.7%) | precision: 97.07 | recall: 97.32 | F1 score 97.20 +[2022-08-12 02:24:48,195] [ INFO] - ---------------------------- +[2022-08-12 02:24:48,195] [ INFO] - Class name: 限制行为能力子女抚养 +[2022-08-12 02:24:48,195] [ INFO] - Evaluation examples in dev dataset: 492(30.5%) | precision: 88.57 | recall: 88.21 | F1 score 88.39 +... +``` + +预测错误的样本保存在bad_case.txt文件中: + +```text +Text Label Prediction +2014年,王X以其与肖X协议离婚时未分割该套楼房的首付款为由,起诉至法院,要求分得楼房的首付款15万元。 不动产分割,有夫妻共同财产 不动产分割 +但原、被告对已建立起的夫妻感情不够珍惜,因琐事即发生吵闹并最终分居,对夫妻感情造成了严重的影响,现原、被告已分居六年有余,且经人民法院判决不准离婚后仍未和好,夫妻感情确已破裂,依法应准予原、被告离婚。 二次起诉离婚,准予离婚,婚后分居,法定离婚 婚后分居,准予离婚 +婚后生有一女,取名彭某乙,已11岁,现已由被告从铁炉白族乡中心小学转入走马镇李桥小学读书。 婚后有子女 婚后有子女,限制行为能力子女抚养 +... +``` +## 可解释性分析 +"模型为什么会预测出这个结果?"是文本分类任务开发者时常遇到的问题,如何分析错误样本(bad case)是文本分类任务落地中重要一环,本项目基于TrustAI开源了基于词级别和句子级别的模型可解释性分析方法,帮助开发者更好地理解文本分类模型与数据,有助于后续的模型优化与数据清洗标注。 + +### 单词级别可解释性分析 +本项目开源模型的词级别可解释性分析Notebook,提供LIME、Integrated Gradient、GradShap 三种分析方法,支持分析微调后模型的预测结果,开发者可以通过更改**数据目录**和**模型目录**在自己的任务中使用Jupyter Notebook进行数据分析。 + +运行 [word_interpret.ipynb](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/multi_label/analysis/README.md) 代码,即可分析影响样本预测结果的关键词以及可视化所有词对预测结果的贡献情况,颜色越深代表这个词对预测结果影响越大: +
+ +
+ +### 句子级别可解释性分析 +本项目基于特征相似度([FeatureSimilarity](https://arxiv.org/abs/2104.04128))算法,计算对样本预测结果正影响的训练数据,帮助理解模型的预测结果与训练集数据的关系。 + +待分析数据文件`interpret_input_file`应为以下三种格式中的一种: +**格式一:包括文本、标签、预测结果** +```text +<文本>'\t'<标签>'\t'<预测结果> +... +``` + +**格式二:包括文本、标签** +```text +<文本>'\t'<标签> +... +``` + +**格式三:只包括文本** +```text +<文本> +准予原告胡某甲与被告韩某甲离婚。 +... +``` + +我们可以运行代码,得到支持样本模型预测结果的训练数据: +```shell +python sent_interpret.py \ + --device "gpu" \ + --dataset_dir "../data" \ + --params_path "../checkpoint/" \ + --max_seq_length 128 \ + --batch_size 16 \ + --top_k 3 \ + --train_file "train.txt" \ + --interpret_input_file "bad_case.txt" \ + --interpret_result_file "sent_interpret.txt" +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `device`: 选用什么设备进行训练,可可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含dev.txt和label.txt文件;默认为None。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `seed`:随机种子,默认为3。 +* `top_k`:筛选支持训练证据数量;默认为3。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `interpret_input_file`:本地数据集中待分析文件名;默认为"bad_case.txt"。 +* `interpret_result_file`:保存句子级别可解释性结果文件名;默认为"sent_interpret.txt"。 + +可解释性结果保存在 `interpret_result_file` 文件中: +```text +text: 2015年2月23日,被告将原告赶出家门,原告居住于娘家待产,双方分居至今。 +predict label: 婚后分居 +label: 不履行家庭义务,婚后分居 +examples with positive influence +support1 text: 2014年中秋节原告回了娘家,原、被告分居至今。 label: 婚后分居 score: 0.99942 +support2 text: 原告于2013年8月13日离开被告家,分居至今。 label: 婚后分居 score: 0.99916 +support3 text: 2014年4月,被告外出务工,双方分居至今。 label: 婚后分居 score: 0.99902 +... +``` + + +## 数据优化 + +### 稀疏数据筛选方案 + +稀疏数据筛选适用于文本分类中**数据不平衡或训练数据覆盖不足**的场景,简单来说,就是由于模型在训练过程中没有学习到足够与待预测样本相似的数据,模型难以正确预测样本所属类别的情况。稀疏数据筛选旨在开发集中挖掘缺乏训练证据支持的数据,通常可以采用**数据增强**或**少量数据标注**的两种低成本方式,提升模型在开发集的预测效果。 + +本项目中稀疏数据筛选基于TrustAI,利用基于特征相似度的实例级证据分析方法,抽取开发集中样本的支持训练证据,并计算支持证据平均分(通常为得分前三的支持训练证据均分)。分数较低的样本表明其训练证据不足,在训练集中较为稀疏,实验表明模型在这些样本上表现也相对较差。更多细节详见[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[实例级证据分析](https://github.com/PaddlePaddle/TrustAI/blob/main/trustai/interpretation/example_level/README.md)。 + + +#### 稀疏数据识别—数据增强 + +这里我们将介绍稀疏数据识别—数据增强流程: + +- **稀疏数据识别:** 挖掘开发集中的缺乏训练证据支持数据,记为稀疏数据集(Sparse Dataset); + +- **数据增强**:将稀疏数据集在训练集中的支持证据应用数据增强策略,这些数据增强后的训练数据记为支持数据集(Support Dataset); + +- **重新训练模型:** 将支持数据集加入到原有的训练集获得新的训练集,重新训练新的文本分类模型。 + +现在我们进行稀疏数据识别-数据增强,得到支持数据集: + +```shell +python sparse.py \ + --device "gpu" \ + --dataset_dir "../data" \ + --aug_strategy "substitute" \ + --max_seq_length 128 \ + --params_path "../checkpoint/" \ + --batch_size 16 \ + --sparse_num 100 \ + --support_num 100 +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `device`: 选用什么设备进行训练,可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含dev.txt和label.txt文件;默认为None。 +* `aug_strategy`:数据增强类型,可选"duplicate","substitute", "insert", "delete", "swap";默认为"substitute"。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `seed`:随机种子,默认为3。 +* `rationale_num_sparse`:筛选稀疏数据时计算样本置信度时支持训练证据数量;认为3。 +* `rationale_num_support`:筛选支持数据时计算样本置信度时支持训练证据数量,如果筛选的支持数据不够,可以适当增加;默认为6。 +* `sparse_num`:筛选稀疏数据数量,建议为开发集的10%~20%,默认为100。 +* `support_num`:用于数据增强的支持数据数量,建议为训练集的10%~20%,默认为100。 +* `support_threshold`:支持数据的阈值,只选择支持证据分数大于阈值作为支持数据,默认为0.7。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `dev_file`:本地数据集中开发集文件名;默认为"dev.txt"。 +* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。 +* `sparse_file`:保存在本地数据集路径中稀疏数据文件名;默认为"sparse.txt"。 +* `support_file`:保存在本地数据集路径中支持训练数据文件名;默认为"support.txt"。 + +将得到增强支持数据`support.txt`与训练集数据`train.txt`合并得到新的训练集`train_sparse_aug.txt`重新进行训练: + +```shell +cat ../data/train.txt ../data/support.txt > ../data/train_sparse_aug.txt +``` + +**方案效果** + +我们在CAIL2019—婚姻家庭要素提取数据集抽取部分训练数据(训练集数据规模:500)进行实验,筛选稀疏数据数量和筛选支持数据数量均设为100条,使用不同的数据增强方法进行评测: + +| |Micro F1(%) | Macro F1(%) | +| ---------| ------------ |------------ | +|训练集|84.43|50.01| +|训练集+支持增强集(duplicate) |84.80|**51.78**| +|训练集+支持增强集(substitute) |84.66|50.61| +|训练集+支持增强集(insert) |84.48|49.95| +|训练集+支持增强集(delete) |84.83| 51.04| +|训练集+支持增强集(swap) |**84.84**|51.06| + +#### 稀疏数据识别-数据标注 + +本方案能够有针对性进行数据标注,相比于随机标注数据更好提高模型预测效果。这里我们将介绍稀疏数据识别-数据标注流程: + +- **稀疏数据识别:** 挖掘开发集中的缺乏训练证据支持数据,记为稀疏数据集(Sparse Dataset); + +- **数据标注**:在未标注数据集中筛选稀疏数据集的支持证据,并进行数据标注,记为支持数据集(Support Dataset); + +- **重新训练模型:** 将支持数据集加入到原有的训练集获得新的训练集,重新训练新的文本分类模型。 + +现在我们进行稀疏数据识别--数据标注,得到待标注数据: + +```shell +python sparse.py \ + --annotate \ + --device "gpu" \ + --dataset_dir "../data" \ + --max_seq_length 128 \ + --params_path "../checkpoint/" \ + --batch_size 16 \ + --sparse_num 100 \ + --support_num 100 \ + --unlabeled_file "data.txt" +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `device`: 选用什么设备进行训练,可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含dev.txt和label.txt文件;默认为None。 +* `annotate`:选择稀疏数据识别--数据标注模式;默认为False。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `seed`:随机种子,默认为3。 +* `rationale_num_sparse`:筛选稀疏数据时计算样本置信度时支持训练证据数量;认为3。 +* `rationale_num_support`:筛选支持数据时计算样本置信度时支持训练证据数量,如果筛选的支持数据不够,可以适当增加;默认为6。 +* `sparse_num`:筛选稀疏数据数量,建议为开发集的10%~20%,默认为100。 +* `support_num`:用于数据增强的支持数据数量,建议为训练集的10%~20%,默认为100。 +* `support_threshold`:支持数据的阈值,只选择支持证据分数大于阈值作为支持数据,默认为0.7。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `dev_file`:本地数据集中开发集文件名;默认为"dev.txt"。 +* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。 +* `unlabeled_file`:本地数据集中未标注数据文件名;默认为"data.txt"。 +* `sparse_file`:保存在本地数据集路径中稀疏数据文件名;默认为"sparse.txt"。 +* `support_file`:保存在本地数据集路径中支持训练数据文件名;默认为"support.txt"。 + +我们将筛选出的支持数据`support.txt`进行标注,可以使用标注工具帮助更快标注,详情请参考[文本分类任务doccano数据标注使用指南](../../doccano.md)进行文本分类数据标注。然后将已标注数据`support.txt`与训练集数据`train.txt`合并得到新的训练集`train_sparse_annotate.txt`重新进行训练: + +```shell +cat ../data/train.txt ../data/support.txt > ../data/train_sparse_annotate.txt +``` + +**方案效果** + +我们在CAIL2019—婚姻家庭要素提取数据集抽取部分训练数据(训练集数据规模:500)进行实验,筛选稀疏数据数量设为100条,筛选待标注数据数量为50和100条。我们比较了使用稀疏数据方案的策略采样和随机采样的效果,下表结果表明使用稀疏数据方案的策略采样能够有效指导训练数据扩充,在标注更少的数据情况下获得更大提升的效果: + +| |Micro F1(%) | Macro F1(%) | +| ---------| ------------ | ------------ | +|训练集|84.43|50.01| +|训练集+策略采样集(50) |85.77|**57.13**| +|训练集+随机采样集(50) |84.91|54.40| +|训练集+策略采样集(100) |**86.14**|56.93| +|训练集+随机采样集(100) |84.69|50.76| + +### 脏数据清洗方案 + +脏数据清洗方案是基于已训练好的文本分类模型,筛选出训练数据集中标注错误的数据,再由人工检查重新标注,获得标注正确的数据集进行重新训练。我们将介绍脏数据清洗流程: + +- **脏数据筛选:** 基于TrustAI中表示点方法,计算训练数据对文本分类模型的影响分数,分数高的训练数据表明对模型影响大,这些数据有较大概率为标注错误样本,记为脏数据集(Dirty Dataset)。 + +- **数据清洗、训练:** 将筛选出的脏数据由人工重新检查,为数据打上正确的标签。将清洗后的训练数据重新放入文本分类模型进行训练。 + +现在我们进行脏数据识别,脏数据保存在`"train_dirty.txt"`,剩余训练数据保存在`"train_dirty_rest.txt"`: + +```shell +python dirty.py \ + --device "gpu" \ + --dataset_dir "../data" \ + --max_seq_length 128 \ + --params_path "../checkpoint/" \ + --batch_size 16 \ + --dirty_num 100 \ + --dirty_file "train_dirty.txt" \ + --rest_file "train_dirty_rest.txt" +``` + +默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"` + +可支持配置的参数: + +* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含train.txt和label.txt文件;默认为None。 +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。 +* `device`: 选用什么设备进行训练,可选择cpu、gpu、xpu、npu;默认为"gpu"。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `seed`:随机种子,默认为3。 +* `dirty_file`:保存脏数据文件名,默认为"train_dirty.txt"。 +* `rest_file`:保存剩余数据(非脏数据)文件名,默认为"train_dirty_rest.txt"。 +* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 +* `dirty_threshold`:筛选脏数据用于重新标注的阈值,只选择影响分数大于阈值作为支持数据,默认为0。 + + +我们将筛选出脏数据进行人工检查重新标注,可以将`train_dirty.txt`直接导入标注工具doccano帮助更快重新标注,详情请参考[文本分类任务doccano数据标注使用指南](../../doccano.md)进行文本分类数据标注。然后将已重新标注的脏数据`train_dirty.txt`与剩余训练集数据`train_dirty_rest.txt`合并得到新的训练集`train_clean.txt`重新进行训练: + +```shell +cat ../data/train_dirty_rest.txt ../data/train_dirty.txt > ../data/train_clean.txt +``` + +**方案效果** + +我们在CAIL2019—婚姻家庭要素提取数据集抽取部分训练数据(训练集数据规模:500)进行实验,取50条数据进行脏数据处理,也即50条训练数据为标签错误数据。选择不同`dirty_num`应用脏数据清洗策略进行评测: + +| |Micro F1(%) | Macro F1(%) | +| ---------| ------------ |------------ | +|训练集(500,含50条脏数据) |82.89|47.83| +|训练集(500,含50条脏数据) + 脏数据清洗(25)|82.42|49.57| +|训练集(500,含50条脏数据) + 脏数据清洗(50)|83.38|50.40| +|训练集(500,含50条脏数据) + 脏数据清洗(100)|84.50|51.28| + + +## 数据增强策略方案 + +在数据量较少或某些类别样本量较少时,也可以通过数据增强策略的方式,生成更多的训练数据,提升模型效果。 + +```shell +python aug.py \ + --create_n 2 \ + --aug_percent 0.1 \ + --train_path "../data/train.txt" \ + --aug_path "../data/aug.txt" +``` + +可支持配置的参数: + +* `train_path`:待增强训练数据集文件路径;默认为"../data/train.txt"。 +* `aug_path`:增强生成的训练数据集文件路径;默认为"../data/train_aug.txt"。 +* `aug_strategy`:数据增强策略,可选"mix", "substitute", "insert", "delete", "swap"为多种数据策略混合使用;默认为"substitute"。 +* `aug_type`:词替换/词插入增强类型,可选"synonym", "homonym", "mlm",建议在GPU环境下使用mlm类型;默认为"synonym"。 +* `create_n`:生成的句子数量,默认为2。 +* `aug_percent`:生成词替换百分比,默认为0.1。 +* `device`: 选用什么设备进行增强,可选择cpu、gpu、xpu、npu,仅在使用mlm类型有影响;默认为"gpu"。 + +生成的增强数据保存在`"aug.txt"`文件中,与训练集数据`train.txt`合并得到新的训练集`train_aug.txt`重新进行训练: + +```shell +cat ../data/aug.txt ../data/train.txt > ../data/train_aug.txt +``` + +PaddleNLP内置多种数据增强策略,更多数据增强策略使用方法请参考[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)。 + +我们在CAIL2019—婚姻家庭要素提取数据集抽取部分训练数据(训练集数据规模:500)进行实验,采用不同数据增强策略进行两倍数据增强(每条样本生成两条增强样本): + +| |Micro F1(%) | Macro F1(%) | +| ---------| ------------ |------------ | +|训练集(500)|84.43|50.01| +|训练集(500)+数据增强(×2, mix) |84.72|51.86| +|训练集(500)+支持增强集(×2, substitute) |84.50|53.23| +|训练集(500)+支持增强集(×2, insert) |**85.03**|53.54| +|训练集(500)+支持增强集(×2, delete) |84.74| **55.89**| +|训练集(500)+支持增强集(×2, swap) |84.44|52.50| diff --git a/applications/text_classification/multi_label/analysis/aug.py b/applications/text_classification/multi_label/analysis/aug.py new file mode 100644 index 0000000000000000000000000000000000000000..12f1e0fbcdbd33818b532b1443a985bbd9c2c15b --- /dev/null +++ b/applications/text_classification/multi_label/analysis/aug.py @@ -0,0 +1,80 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle + +from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--train_path", type=str, default="../data/train.txt", help="Train dataset file name") +parser.add_argument("--aug_path", type=str, default="../data/aug.txt", help="Aug dataset file name") +parser.add_argument("--aug_strategy", choices=["mix", "substitute", "insert", "delete", "swap"], default='substitute', help="Select data augmentation strategy") +parser.add_argument("--aug_type", choices=["synonym", "homonym", "mlm"], default='synonym', help="Select data augmentation type for substitute and insert") +parser.add_argument("--create_n", type=int, default=2, help="Number of augmented sequences.") +parser.add_argument("--aug_percent", type=float, default=0.1, help="Percentage of augmented words in sequences.") +parser.add_argument('--device', default="gpu", help="Select which device to do data augmentation strategy, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def aug(): + """Do data augmentation""" + if args.aug_strategy in ["mix", "substitute", "insert"] and args.aug_strategy == "mlm": + paddle.set_device(args.device) + + if args.aug_strategy in ["substitute", "insert", "delete", "swap"]: + if args.aug_strategy == "substitute": + aug = WordSubstitute(args.aug_type, create_n=args.create_n, aug_percent=args.aug_percent) + elif args.aug_strategy == "insert": + aug = WordInsert(args.aug_type, create_n=args.create_n, aug_percent=args.aug_percent) + elif args.aug_strategy == "delete": + aug = WordDelete(create_n=args.create_n, aug_percent=args.aug_percent) + elif args.aug_strategy == "swap": + aug = WordSwap(create_n=args.create_n, aug_percent=args.aug_percent) + with open(args.train_path, "r", encoding="utf-8") as f1, open(args.aug_path, "w", encoding="utf-8") as f2: + for line in f1: + s, l = line.strip().split("\t") + augs = aug.augment(s) + if not isinstance(augs[0], str): + augs = augs[0] + for a in augs: + f2.write(a + "\t" + l + "\n") + f1.close(), f2.close() + elif args.aug_strategy in ["mix"]: + aug = [ + WordSubstitute(args.aug_type, create_n=1, aug_percent=args.aug_percent), + WordInsert(args.aug_type, create_n=1, aug_percent=args.aug_percent), + WordDelete(create_n=1, aug_percent=args.aug_percent), + WordSwap(create_n=1, aug_percent=args.aug_percent), + ] + count = 0 + with open(args.train_path, "r", encoding="utf-8") as f1, open(args.aug_path, "w", encoding="utf-8") as f2: + for line in f1: + s, l = line.strip().split("\t") + for i in range(args.create_n): + i = count % len(aug) + augs = aug[i].augment(s) + if not isinstance(augs[0], str): + augs = augs[0] + count += 1 + for a in augs: + f2.write(a + "\t" + l + "\n") + f1.close(), f2.close() + + +if __name__ == "__main__": + aug() diff --git a/applications/text_classification/multi_label/analysis/dirty.py b/applications/text_classification/multi_label/analysis/dirty.py new file mode 100644 index 0000000000000000000000000000000000000000..394e597c7e280929f68ad38c6f188acdfa8ded50 --- /dev/null +++ b/applications/text_classification/multi_label/analysis/dirty.py @@ -0,0 +1,153 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os +import random + +import numpy as np +import paddle +from paddle.io import BatchSampler, DataLoader +from trustai.interpretation import RepresenterPointModel + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.") +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--seed", type=int, default=3, help="random seed for initialization") +parser.add_argument("--dirty_num", type=int, default=100, help="Number of dirty data. default:50") +parser.add_argument("--dirty_file", type=str, default="train_dirty.txt", help="Path to save dirty data.") +parser.add_argument("--rest_file", type=str, default="train_dirty_rest.txt", help="The path of rest data.") +parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name") +parser.add_argument("--dirty_threshold", type=float, default="0", help="The threshold to select dirty data.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """ + Set random seed + """ + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + os.environ["PYTHONHASHSEED"] = str(seed) + + +def read_local_dataset(path): + """ + Read dataset file + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + sentence, label = line.strip().split("\t") + yield {"text": sentence, "label": label} + + +def preprocess_function(examples, tokenizer, max_seq_length): + """ + Preprocess dataset + """ + result = tokenizer(text=examples["text"], max_seq_len=max_seq_length) + return result + + +def get_dirty_data(weight_matrix, dirty_num, threshold=0): + """ + Get index of dirty data from train data + """ + scores = [] + for idx in range(weight_matrix.shape[0]): + weight_sum = 0 + count = 0 + for weight in weight_matrix[idx].numpy(): + if weight > threshold: + count += 1 + weight_sum += weight + scores.append((count, weight_sum)) + sorted_scores = sorted(scores)[::-1] + sorted_idxs = sorted(range(len(scores)), key=lambda idx: scores[idx])[::-1] + + ret_scores = sorted_scores[:dirty_num] + ret_idxs = sorted_idxs[:dirty_num] + + return ret_idxs, ret_scores + + +class LocalDataCollatorWithPadding(DataCollatorWithPadding): + """ + Convert the result of DataCollatorWithPadding from dict dictionary to a list + """ + + def __call__(self, features): + batch = super().__call__(features) + batch = list(batch.values()) + return batch + + +def run(): + """ + Get dirty data + """ + set_seed(args.seed) + paddle.set_device(args.device) + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + + # Prepare & preprocess dataset + train_path = os.path.join(args.dataset_dir, args.train_file) + train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False) + + trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + train_ds = train_ds.map(trans_func) + + # Batchify dataset + collate_fn = LocalDataCollatorWithPadding(tokenizer) + train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False) + train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn) + + # Classifier_layer_name is the layer name of the last output layer + rep_point = RepresenterPointModel(model, train_data_loader, classifier_layer_name="classifier") + weight_matrix = rep_point.weight_matrix + + # Save dirty data & rest data + dirty_indexs, _ = get_dirty_data(weight_matrix, args.dirty_num, args.dirty_threshold) + + dirty_path = os.path.join(args.dataset_dir, args.dirty_file) + rest_path = os.path.join(args.dataset_dir, args.rest_file) + + with open(dirty_path, "w") as f1, open(rest_path, "w") as f2: + for idx in range(len(train_ds)): + if idx in dirty_indexs: + f1.write(train_ds.data[idx]["text"] + "\t" + train_ds.data[idx]["label"] + "\n") + else: + f2.write(train_ds.data[idx]["text"] + "\t" + train_ds.data[idx]["label"] + "\n") + + f1.close(), f2.close() + + +if __name__ == "__main__": + run() diff --git a/applications/text_classification/multi_label/analysis/evaluate.py b/applications/text_classification/multi_label/analysis/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..fd57a57cdf64891fa9eb136a42940dcb283dc104 --- /dev/null +++ b/applications/text_classification/multi_label/analysis/evaluate.py @@ -0,0 +1,170 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os + +import numpy as np +import paddle +import paddle.nn.functional as F +from paddle.io import BatchSampler, DataLoader +from sklearn.metrics import accuracy_score, classification_report + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', default="gpu", help="Select which device to evaluate model, defaults to gpu.") +parser.add_argument("--dataset_dir", required=True, type=str, help="Local dataset directory should include dev.txt and label.txt") +parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for evaluation.") +parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name") +parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name") +parser.add_argument("--bad_case_file", type=str, default="bad_case.txt", help="Bad case saving file name") +args = parser.parse_args() +# yapf: enable + + +def preprocess_function(examples, tokenizer, max_seq_length, label_nums, is_test=False): + """ + Preprocess dataset + """ + result = tokenizer(text=examples["text"], max_seq_len=max_seq_length) + if not is_test: + result["labels"] = [float(1) if i in examples["label"] else float(0) for i in range(label_nums)] + return result + + +def read_local_dataset(path, label_list): + """ + Read dataset file + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + items = line.strip().split("\t") + if len(items) == 0: + continue + elif len(items) == 1: + sentence = items[0] + labels = [] + label = "" + else: + sentence = "".join(items[:-1]) + label = items[-1] + labels = [label_list[l] for l in label.split(",")] + yield {"text": sentence, "label": labels, "label_n": label} + + +@paddle.no_grad() +def evaluate(): + """ + Evaluate the model performance + """ + paddle.set_device(args.device) + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + + # load and preprocess dataset + label_path = os.path.join(args.dataset_dir, args.label_file) + dev_path = os.path.join(args.dataset_dir, args.dev_file) + + label_list = {} + label_map = {} + with open(label_path, "r", encoding="utf-8") as f: + for i, line in enumerate(f): + l = line.strip() + label_list[l] = i + label_map[i] = l + dev_ds = load_dataset(read_local_dataset, path=dev_path, label_list=label_list, lazy=False) + trans_func = functools.partial( + preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length, label_nums=len(label_list) + ) + dev_ds = dev_ds.map(trans_func) + + # batchify dataset + collate_fn = DataCollatorWithPadding(tokenizer) + dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn) + + model.eval() + probs = [] + labels = [] + for batch in dev_data_loader: + label = batch.pop("labels") + logits = model(**batch) + labels.extend(label.numpy()) + probs.extend(F.sigmoid(logits).numpy()) + probs = np.array(probs) + labels = np.array(labels) + preds = probs > 0.5 + report = classification_report(labels, preds, digits=4, output_dict=True) + accuracy = accuracy_score(labels, preds) + + logger.info("-----Evaluate model-------") + logger.info("Dev dataset size: {}".format(len(dev_ds))) + logger.info("Accuracy in dev dataset: {:.2f}%".format(accuracy * 100)) + logger.info( + "Micro avg in dev dataset: precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format( + report["micro avg"]["precision"] * 100, + report["micro avg"]["recall"] * 100, + report["micro avg"]["f1-score"] * 100, + ) + ) + logger.info( + "Macro avg in dev dataset: precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format( + report["macro avg"]["precision"] * 100, + report["macro avg"]["recall"] * 100, + report["macro avg"]["f1-score"] * 100, + ) + ) + + for i in label_map: + logger.info("Class name: {}".format(label_map[i])) + logger.info( + "Evaluation examples in dev dataset: {}({:.1f}%) | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format( + report[str(i)]["support"], + 100 * report[str(i)]["support"] / len(dev_ds), + report[str(i)]["precision"] * 100, + report[str(i)]["recall"] * 100, + report[str(i)]["f1-score"] * 100, + ) + ) + logger.info("----------------------------") + bad_case_path = os.path.join(args.dataset_dir, args.bad_case_file) + with open(bad_case_path, "w", encoding="utf-8") as f: + f.write("Text\tLabel\tPrediction\n") + for i in range(len(preds)): + for p, l in zip(preds[i], labels[i]): + if (p and l == 0) or (not p and l == 1): + pred_n = [label_map[i] for i, pp in enumerate(preds[i]) if pp] + f.write(dev_ds.data[i]["text"] + "\t" + dev_ds.data[i]["label_n"] + "\t" + ",".join(pred_n) + "\n") + break + + f.close() + logger.info("Bad case in dev dataset saved in {}".format(bad_case_path)) + + return + + +if __name__ == "__main__": + evaluate() diff --git a/applications/text_classification/multi_label/analysis/sent_interpret.py b/applications/text_classification/multi_label/analysis/sent_interpret.py new file mode 100644 index 0000000000000000000000000000000000000000..1f0e4a88c190a158f948930a6e6c74c266f9f5f8 --- /dev/null +++ b/applications/text_classification/multi_label/analysis/sent_interpret.py @@ -0,0 +1,157 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os +import random + +import numpy as np +import paddle +from paddle.io import BatchSampler, DataLoader +from trustai.interpretation import FeatureSimilarityModel + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory should include train.txt,dev.txt and test.txt files.") +parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--seed", type=int, default=3, help="random seed for initialization") +parser.add_argument("--top_k", type=int, default=3, help="Top K important training data.") +parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name") +parser.add_argument("--interpret_input_file", type=str, default="bad_case.txt", help="interpretation file name") +parser.add_argument("--interpret_result_file", type=str, default="sent_interpret.txt", help="interpreted file name") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """ + Set random seed + """ + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + os.environ["PYTHONHASHSEED"] = str(seed) + + +def read_local_dataset(path): + """ + Read dataset file + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + items = line.strip().split("\t") + if items[0] == "Text": + continue + if len(items) == 3: + yield {"text": items[0], "label": items[1], "predict": items[2]} + elif len(items) == 2: + yield {"text": items[0], "label": items[1], "predict": ""} + elif len(items) == 1: + yield {"text": items[0], "label": "", "predict": ""} + else: + logger.info(line.strip()) + raise ValueError("{} should be in fixed format.".format(path)) + + +def preprocess_function(examples, tokenizer, max_seq_length): + """ + Preprocess dataset + """ + result = tokenizer(text=examples["text"], max_seq_len=max_seq_length) + return result + + +class LocalDataCollatorWithPadding(DataCollatorWithPadding): + """ + Convert the result of DataCollatorWithPadding from dict dictionary to a list + """ + + def __call__(self, features): + batch = super().__call__(features) + batch = list(batch.values()) + return batch + + +def find_positive_influence_data(): + + set_seed(args.seed) + paddle.set_device(args.device) + + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + + # Prepare & preprocess dataset + train_path = os.path.join(args.dataset_dir, args.train_file) + interpret_path = os.path.join(args.dataset_dir, args.interpret_input_file) + + train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False) + interpret_ds = load_dataset(read_local_dataset, path=interpret_path, lazy=False) + trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + train_ds = train_ds.map(trans_func) + interpret_ds = interpret_ds.map(trans_func) + + # Batchify dataset + collate_fn = LocalDataCollatorWithPadding(tokenizer) + train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False) + interpret_batch_sampler = BatchSampler(interpret_ds, batch_size=args.batch_size, shuffle=False) + train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn) + interpret_data_loader = DataLoader( + dataset=interpret_ds, batch_sampler=interpret_batch_sampler, collate_fn=collate_fn + ) + + # Classifier_layer_name is the layer name of the last output layer + feature_sim = FeatureSimilarityModel(model, train_data_loader, classifier_layer_name="classifier") + # Feature similarity analysis & select sparse data + analysis_result = [] + for batch in interpret_data_loader: + analysis_result += feature_sim(batch, sample_num=args.top_k) + with open(os.path.join(args.dataset_dir, args.interpret_result_file), "w") as f: + for i in range(len(analysis_result)): + f.write("text: " + interpret_ds.data[i]["text"] + "\n") + if "predict" in interpret_ds.data[i]: + f.write("predict label: " + interpret_ds.data[i]["predict"] + "\n") + if "label" in interpret_ds.data[i]: + f.write("label: " + interpret_ds.data[i]["label"] + "\n") + f.write("examples with positive influence\n") + for i, (idx, score) in enumerate(zip(analysis_result[i].pos_indexes, analysis_result[i].pos_scores)): + f.write( + "support{} text: ".format(i + 1) + + train_ds.data[idx]["text"] + + "\t" + + "label: " + + train_ds.data[idx]["label"] + + "\t" + + "score: " + + "{:.5f}".format(score) + + "\n" + ) + f.close() + + +if __name__ == "__main__": + find_positive_influence_data() diff --git a/applications/text_classification/multi_label/analysis/sparse.py b/applications/text_classification/multi_label/analysis/sparse.py new file mode 100644 index 0000000000000000000000000000000000000000..80feb4f661348014bd44e28ec69b4d6bcff9d189 --- /dev/null +++ b/applications/text_classification/multi_label/analysis/sparse.py @@ -0,0 +1,286 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os +import random + +import numpy as np +import paddle +from paddle.io import BatchSampler, DataLoader +from trustai.interpretation import FeatureSimilarityModel + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory should include train.txt,dev.txt and test.txt files.") +parser.add_argument("--aug_strategy", choices=["duplicate", "substitute", "insert", "delete", "swap"], default='substitute', help="Select data augmentation strategy") +parser.add_argument("--annotate", action='store_true', help="Select unlabeled data for annotation") +parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--seed", type=int, default=3, help="random seed for initialization") +parser.add_argument("--rationale_num_sparse", type=int, default=3, help="Number of rationales per example for sparse data.") +parser.add_argument("--rationale_num_support", type=int, default=6, help="Number of rationales per example for support data.") +parser.add_argument("--sparse_num", type=int, default=100, help="Number of sparse data.") +parser.add_argument("--support_threshold", type=float, default="0.7", help="The threshold to select support data.") +parser.add_argument("--support_num", type=int, default=100, help="Number of support data.") +parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name") +parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name") +parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name") +parser.add_argument("--unlabeled_file", type=str, default="data.txt", help="Unlabeled data filename") +parser.add_argument("--sparse_file", type=str, default="sparse.txt", help="Sparse data file name.") +parser.add_argument("--support_file", type=str, default="support.txt", help="support data file name.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """ + Set random seed + """ + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + os.environ["PYTHONHASHSEED"] = str(seed) + + +def read_local_dataset(path): + """ + Read dataset file + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + items = line.strip().split("\t") + if len(items) == 2: + yield {"text": items[0], "label": items[1]} + elif len(items) == 1: + yield {"text": items[0]} + else: + logger.info(line.strip()) + raise ValueError("{} should be in fixed format.".format(path)) + + +def preprocess_function(examples, tokenizer, max_seq_length): + """ + Preprocess dataset + """ + result = tokenizer(text=examples["text"], max_seq_len=max_seq_length) + return result + + +class LocalDataCollatorWithPadding(DataCollatorWithPadding): + """ + Convert the result of DataCollatorWithPadding from dict dictionary to a list + """ + + def __call__(self, features): + batch = super().__call__(features) + batch = list(batch.values()) + return batch + + +def get_sparse_data(analysis_result, sparse_num): + """ + Get sparse data + """ + idx_scores = {} + preds = [] + for i in range(len(analysis_result)): + scores = analysis_result[i].pos_scores + idx_scores[i] = sum(scores) / len(scores) + preds.append(analysis_result[i].pred_label) + + idx_socre_list = list(sorted(idx_scores.items(), key=lambda x: x[1]))[:sparse_num] + ret_idxs, ret_scores = list(zip(*idx_socre_list)) + return ret_idxs, ret_scores, preds + + +def find_sparse_data(): + """ + Find sparse data (lack of supports in train dataset) in dev dataset + """ + set_seed(args.seed) + paddle.set_device(args.device) + + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + + # Prepare & preprocess dataset + label_path = os.path.join(args.dataset_dir, args.label_file) + train_path = os.path.join(args.dataset_dir, args.train_file) + dev_path = os.path.join(args.dataset_dir, args.dev_file) + + label_list = {} + with open(label_path, "r", encoding="utf-8") as f: + for i, line in enumerate(f): + l = line.strip() + label_list[l] = i + + train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False) + dev_ds = load_dataset(read_local_dataset, path=dev_path, lazy=False) + trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + train_ds = train_ds.map(trans_func) + dev_ds = dev_ds.map(trans_func) + + # Batchify dataset + collate_fn = LocalDataCollatorWithPadding(tokenizer) + train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False) + dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn) + dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn) + + # Classifier_layer_name is the layer name of the last output layer + feature_sim = FeatureSimilarityModel(model, train_data_loader, classifier_layer_name="classifier") + # Feature similarity analysis & select sparse data + analysis_result = [] + for batch in dev_data_loader: + analysis_result += feature_sim(batch, sample_num=args.rationale_num_sparse) + sparse_indexs, sparse_scores, preds = get_sparse_data(analysis_result, args.sparse_num) + + # Save the sparse data + with open(os.path.join(args.dataset_dir, args.sparse_file), "w") as f: + for idx in sparse_indexs: + data = dev_ds.data[idx] + f.write(data["text"] + "\t" + str(data["label"]) + "\n") + f.close() + logger.info("Sparse data saved in {}".format(os.path.join(args.dataset_dir, args.sparse_file))) + logger.info("Average score in sparse data: {:.4f}".format(sum(sparse_scores) / len(sparse_scores))) + return os.path.join(args.dataset_dir, args.sparse_file) + + +def get_support_data(analysis_result, support_num, support_threshold=0.7): + """ + get support data + """ + ret_idxs = [] + ret_scores = [] + rationale_idx = 0 + try: + while len(ret_idxs) < support_num: + for n in range(len(analysis_result)): + score = analysis_result[n].pos_scores[rationale_idx] + if score > support_threshold: + idx = analysis_result[n].pos_indexes[rationale_idx] + if idx not in ret_idxs: + ret_idxs.append(idx) + ret_scores.append(score) + if len(ret_idxs) >= support_num: + break + + rationale_idx += 1 + except IndexError: + logger.error( + f"The index is out of range, please reduce support_num or increase support_threshold. Got {len(ret_idxs)} now." + ) + + return ret_idxs, ret_scores + + +def find_support_data(): + """ + Find support data (which supports sparse data) from candidate dataset + """ + set_seed(args.seed) + paddle.set_device(args.device) + + # Define model & tokenizer + if os.path.exists(args.params_path): + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + else: + raise ValueError("The {} should exist.".format(args.params_path)) + + # Prepare & preprocess dataset + if args.annotate: + candidate_path = os.path.join(args.dataset_dir, args.unlabeled_file) + else: + candidate_path = os.path.join(args.dataset_dir, args.train_file) + + sparse_path = os.path.join(args.dataset_dir, args.sparse_file) + support_path = os.path.join(args.dataset_dir, args.support_file) + candidate_ds = load_dataset(read_local_dataset, path=candidate_path, lazy=False) + sparse_ds = load_dataset(read_local_dataset, path=sparse_path, lazy=False) + trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + candidate_ds = candidate_ds.map(trans_func) + sparse_ds = sparse_ds.map(trans_func) + + # Batchify dataset + collate_fn = LocalDataCollatorWithPadding(tokenizer) + candidate_batch_sampler = BatchSampler(candidate_ds, batch_size=args.batch_size, shuffle=False) + sparse_batch_sampler = BatchSampler(sparse_ds, batch_size=args.batch_size, shuffle=False) + candidate_data_loader = DataLoader( + dataset=candidate_ds, batch_sampler=candidate_batch_sampler, collate_fn=collate_fn + ) + sparse_data_loader = DataLoader(dataset=sparse_ds, batch_sampler=sparse_batch_sampler, collate_fn=collate_fn) + + # Classifier_layer_name is the layer name of the last output layer + feature_sim = FeatureSimilarityModel(model, candidate_data_loader, classifier_layer_name="classifier") + # Feature similarity analysis + analysis_result = [] + for batch in sparse_data_loader: + analysis_result += feature_sim(batch, sample_num=args.rationale_num_support) + + support_indexs, support_scores = get_support_data(analysis_result, args.support_num, args.support_threshold) + + # Save the support data + if args.annotate or args.aug_strategy == "duplicate": + with open(support_path, "w") as f: + for idx in list(support_indexs): + data = candidate_ds.data[idx] + if "label" in data: + f.write(data["text"] + "\t" + data["label"] + "\n") + else: + f.write(data["text"] + "\n") + f.close() + else: + create_n = 1 + aug_percent = 0.1 + if args.aug_strategy == "substitute": + aug = WordSubstitute("synonym", create_n=create_n, aug_percent=aug_percent) + elif args.aug_strategy == "insert": + aug = WordInsert("synonym", create_n=create_n, aug_percent=aug_percent) + elif args.aug_strategy == "delete": + aug = WordDelete(create_n=create_n, aug_percent=aug_percent) + elif args.aug_strategy == "swap": + aug = WordSwap(create_n=create_n, aug_percent=aug_percent) + + with open(support_path, "w") as f: + for idx in list(support_indexs): + data = candidate_ds.data[idx] + augs = aug.augment(data["text"]) + if not isinstance(augs[0], str): + augs = augs[0] + for a in augs: + f.write(a + "\t" + data["label"] + "\n") + f.close() + logger.info("support data saved in {}".format(support_path)) + logger.info("support average scores: {:.4f}".format(float(sum(support_scores)) / len(support_scores))) + + +if __name__ == "__main__": + find_sparse_data() + find_support_data() diff --git a/applications/text_classification/multi_label/analysis/word_interpret.ipynb b/applications/text_classification/multi_label/analysis/word_interpret.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..936628db5fdcaad0808b26f41f713159b66b04b3 --- /dev/null +++ b/applications/text_classification/multi_label/analysis/word_interpret.ipynb @@ -0,0 +1,380 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 词级别可解释性分析\n", + "本项目提供模型的词级别可解释性分析,包括LIME、Integrated Gradient、GradShap 三种分析方法,支持分析微调后模型的预测结果,开发者可以通过更改**数据目录**和**模型目录**在自己的任务中使用此项目进行数据分析。\n", + "\n", + "![image](https://user-images.githubusercontent.com/63761690/192739675-63145d59-23c6-416f-bf71-998fd4995254.png)\n", + "\n", + "## 1.导入Python模块与参数配置\n", + "首先我们导入必要的导入必要python模块和设置配置参数,词级别可解释性分析算法支持三种待分析的文本 `INTERPRETER_FILE` 数据文件格式:\n", + "\n", + "**格式一:包括文本、标签、预测结果**\n", + "```text\n", + "<文本>'\\t'<标签>'\\t'<预测结果>\n", + "...\n", + "```\n", + "\n", + "**格式二:包括文本、标签**\n", + "```text\n", + "<文本>'\\t'<标签>\n", + "...\n", + "```\n", + "\n", + "**格式三:只包括文本**\n", + "```text\n", + "<文本>\n", + "准予原告胡某甲与被告韩某甲离婚。\n", + "...\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "grep: warning: GREP_OPTIONS is deprecated; please use an alias or script\n", + "/usr/local/lib/python3.7/dist-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n", + "/usr/local/lib/python3.7/dist-packages/paddlenlp/transformers/image_utils.py:213: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.\n", + " resample=Image.BILINEAR,\n", + "/usr/local/lib/python3.7/dist-packages/paddlenlp/transformers/image_utils.py:379: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.\n", + " resample=Image.NEAREST,\n", + "/usr/local/lib/python3.7/dist-packages/paddlenlp/transformers/ernie_vil/feature_extraction.py:65: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.\n", + " resample=Image.BICUBIC,\n", + "/usr/local/lib/python3.7/dist-packages/paddlenlp/transformers/clip/feature_extraction.py:64: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.\n", + " resample=Image.BICUBIC,\n" + ] + } + ], + "source": [ + "import functools\n", + "import random\n", + "import os\n", + "import argparse\n", + "\n", + "import jieba\n", + "import numpy as np\n", + "from trustai.interpretation import VisualizationTextRecord\n", + "from trustai.interpretation import get_word_offset\n", + "import paddle\n", + "from paddle.io import DataLoader, BatchSampler\n", + "from paddlenlp.data import DataCollatorWithPadding\n", + "from paddlenlp.datasets import load_dataset\n", + "from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# 预先定义配置参数\n", + "\n", + "# 运行环境,可选\"cpu\",\"gpu\",\"gpu:x\"(x为gpu编号)\n", + "DEVICE = \"gpu\"\n", + "# 数据路径\n", + "DATASET_DIR = \"../data\" \n", + "# 训练模型保存路径\n", + "PARAM_PATH = \"../checkpoint/\" \n", + "# tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数\n", + "MAX_LENGTH = 128 \n", + "# 批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数\n", + "BATCH_SIZE = 1 \n", + "# 待分析解释的数据\n", + "INTERPRETER_FILE = \"bad_case.txt\"\n", + "# 可选 \"ig\",\"lime\",\"grad\" ,可以根据实际任务效果选择解释器\n", + "# \"grad\":GradShap方法依赖interpretdl\n", + "# !pip install interpretdl\n", + "INTERPRETER = \"ig\"\n", + "# 分析句子中TOP K关键词,K值\n", + "KEY_WORDS_NUM = 5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2.读取待分析数据" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "def read_local_dataset(path):\n", + " \"\"\"\n", + " Read dataset file\n", + " \"\"\"\n", + " with open(path, 'r', encoding='utf-8') as f:\n", + " for line in f:\n", + " items = line.strip().split('\\t')\n", + " if items[0] == 'Text':\n", + " continue\n", + " items[0] = items[0][:MAX_LENGTH-2]\n", + " if len(items) == 3:\n", + " yield {'text': items[0], 'label': items[1], 'predict': items[2]}\n", + " elif len(items) == 2:\n", + " yield {'text': items[0], 'label': items[1], 'predict': ''}\n", + " elif len(items) == 1:\n", + " yield {'text': items[0], 'label': '', 'predict': ''}\n", + " else:\n", + " raise ValueError(\"{} should be in fixed format.\".format(path))\n", + "\n", + "def preprocess_function(examples, tokenizer, max_seq_length):\n", + " \"\"\"\n", + " Preprocess dataset\n", + " \"\"\"\n", + " result = tokenizer(text=examples[\"text\"], max_seq_len=max_seq_length)\n", + " return result\n", + "\n", + "class LocalDataCollatorWithPadding(DataCollatorWithPadding):\n", + " \"\"\"\n", + " Convert the result of DataCollatorWithPadding from dict dictionary to a list\n", + " \"\"\"\n", + "\n", + " def __call__(self, features):\n", + " batch = super().__call__(features)\n", + " batch = list(batch.values())\n", + " return batch" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[32m[2022-09-28 04:51:03,566] [ INFO]\u001b[0m - We are using to load '../checkpoint/'.\u001b[0m\n", + "W0928 04:51:03.570216 4827 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2\n", + "W0928 04:51:03.575362 4827 gpu_resources.cc:91] device: 0, cuDNN Version: 8.1.\n", + "\u001b[32m[2022-09-28 04:51:06,542] [ INFO]\u001b[0m - We are using to load '../checkpoint/'.\u001b[0m\n" + ] + } + ], + "source": [ + "paddle.set_device(DEVICE)\n", + "\n", + "# Define model & tokenizer\n", + "if os.path.exists(PARAM_PATH):\n", + " model = AutoModelForSequenceClassification.from_pretrained(PARAM_PATH)\n", + " tokenizer = AutoTokenizer.from_pretrained(PARAM_PATH)\n", + "else:\n", + " raise ValueError(\"The {} should exist.\".format(PARAM_PATH))\n", + "\n", + "# Prepare & preprocess dataset\n", + "interpret_path = os.path.join(DATASET_DIR, INTERPRETER_FILE)\n", + "\n", + "\n", + "interpret_ds = load_dataset(read_local_dataset, path=interpret_path, lazy=False)\n", + "trans_func = functools.partial(preprocess_function,\n", + " tokenizer=tokenizer,\n", + " max_seq_length=MAX_LENGTH)\n", + "\n", + "interpret_ds = interpret_ds.map(trans_func)\n", + "\n", + "# Batchify dataset\n", + "collate_fn = LocalDataCollatorWithPadding(tokenizer)\n", + "interpret_batch_sampler = BatchSampler(interpret_ds,\n", + " batch_size=BATCH_SIZE,\n", + " shuffle=False)\n", + "interpret_data_loader = DataLoader(dataset=interpret_ds,\n", + " batch_sampler=interpret_batch_sampler,\n", + " collate_fn=collate_fn)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3.开始数据可解释性分析\n", + "数据量较大时,数据分析时间较长,请耐心等待" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Start token level interpretion, it will take some time...\n", + "Building prefix dict from the default dictionary ...\n", + "Loading model from cache /tmp/jieba.cache\n", + "Loading model cost 0.751 seconds.\n", + "Prefix dict has been built successfully.\n", + "Start word level alignment, it will take some time...\n" + ] + } + ], + "source": [ + "# Init an interpreter\n", + "if INTERPRETER == 'ig':\n", + " from trustai.interpretation.token_level import IntGradInterpreter\n", + " interpreter = IntGradInterpreter(model)\n", + "elif INTERPRETER == 'lime':\n", + " from trustai.interpretation.token_level import LIMEInterpreter\n", + " interpreter = LIMEInterpreter(model, unk_id=tokenizer.convert_tokens_to_ids('[UNK]'), pad_id=tokenizer.convert_tokens_to_ids('[PAD]'))\n", + "else:\n", + " from trustai.interpretation.token_level import GradShapInterpreter\n", + " interpreter = GradShapInterpreter(model)\n", + "\n", + "# Use interpreter to get the importance scores for all data\n", + "print(\"Start token level interpretion, it will take some time...\")\n", + "analysis_result = []\n", + "for batch in interpret_data_loader:\n", + " analysis_result += interpreter(tuple(batch))\n", + "\n", + "# Add CLS and SEP tags to both original text and standard splited tokens\n", + "contexts = []\n", + "words = []\n", + "for i in range(len(interpret_ds)):\n", + " text = interpret_ds.data[i][\"text\"]\n", + " contexts.append(\"[CLS]\" + text + \"[SEP]\")\n", + " words.append([\"[CLS]\"] + list(jieba.cut(text)) + [\"[SEP]\"])\n", + "\n", + "# Get the offset map of tokenized tokens and standard splited tokens\n", + "print(\"Start word level alignment, it will take some time...\")\n", + "ori_offset_maps = []\n", + "word_offset_maps = []\n", + "for i in range(len(contexts)):\n", + " ori_offset_maps.append(tokenizer.get_offset_mapping(contexts[i]))\n", + " word_offset_maps.append(get_word_offset(contexts[i], words[i]))\n", + "\n", + "align_res = interpreter.alignment(analysis_result, contexts, words, word_offset_maps, ori_offset_maps, special_tokens=[\"[CLS]\", '[SEP]'],rationale_num=KEY_WORDS_NUM)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4.数据可解释性分析结果可视化\n", + "使用用颜色深浅可视化方式代表句子中词对预测结果的重要程度" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.core.display import display, HTML\n", + "class Visualization(VisualizationTextRecord):\n", + "\n", + " def __init__(self, interpret_res, true_label=None, pred_label=None, words=None):\n", + " if words is not None:\n", + " self.words = words\n", + " else:\n", + " self.words = interpret_res.words\n", + " self.pred_label = pred_label if pred_label is not None else ''\n", + " self.true_label = true_label if true_label is not None else ''\n", + " self.key_words = \" \".join(set(interpret_res.rationale_tokens))\n", + " word_attributions = interpret_res.word_attributions\n", + " _max = max(word_attributions)\n", + " _min = min(word_attributions)\n", + " self.word_attributions = [(word_imp - _min) / (_max - _min) for word_imp in word_attributions]\n", + "\n", + " def record_html(self):\n", + " \"\"\"change all informations to html\"\"\"\n", + " return \"\".join([\n", + " \"\",\n", + " self._format_class(self.true_label),\n", + " self._format_class(self.pred_label),\n", + " self._format_class(self.key_words),\n", + " self._format_word_attributions(),\n", + " \"\",\n", + " ])\n", + " def _format_class(self, label):\n", + " return '{label}'.format(label=label)\n", + "\n", + "def visualize_text(text_records):\n", + " \"\"\"visualize text\"\"\"\n", + " html = [\"\"]\n", + " rows = [\"\"\n", + " \"\"\n", + " \"\"\n", + " \"\"]\n", + " for record in text_records:\n", + " rows.append(record.record_html())\n", + " html.append(\"\".join(rows))\n", + " html.append(\"
LabelPredictionKey wordsImportant visualization
\")\n", + " html = HTML(\"\".join(html))\n", + " display(html)\n", + " return html.data\n", + "\n", + "\n", + "def visualize(interpret_res, ds):\n", + " records = []\n", + " for i in range(len(interpret_res)):\n", + " records.append(Visualization(interpret_res[i], true_label=ds.data[i][\"label\"], pred_label=ds.data[i][\"predict\"]))\n", + " html = visualize_text(records)\n", + " return html" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
LabelPredictionKey wordsImportant visualization
不履行家庭义务,婚后分居婚后分居至今 双方 出 分居 。 [CLS] 2015 2 23 被告 原告 家门 原告 居住 娘家 待产 双方 分居 至今 [SEP]
婚后有子女,限制行为能力子女抚养婚后有子女,限制行为能力子女抚养,不履行离婚协议财产 符合 付清 欠条 抚养 [CLS] 被告 孙某 辩称 离婚 协议 关于 财产 分割 给付 资金 符合 法律 规定 只有 离婚 子女 抚养 符合 法律 规定 没有 协议 代表 被告 真实 意思 表示 离婚 协议 没有 约定 付款 时间 而且 被告 原告 出具 欠条 5 年内 付清 原告 期满 起诉 驳回 [SEP]
存在非婚生子,支付抚养费,限制行为能力子女抚养限制行为能力子女抚养,存在非婚生子赵某 并非 认可 之女 表示 [CLS] 被告 董某 认可 赵某 并非 原告 之女 表示 愿意 自行 抚养 赵某 [SEP]
准予离婚准予离婚,法定离婚原告 韩某 准予 离婚 。 [CLS] 准予 原告 胡某 被告 韩某 离婚 [SEP]
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# process for vbisualize\n", + "html = visualize(align_res, interpret_ds)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.7.13 64-bit", + "metadata": { + "interpreter": { + "hash": "767d51c1340bd893661ea55ea3124f6de3c7a262a8b4abca0554b478b1e2ff90" + } + }, + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.13-final" + }, + "orig_nbformat": 2 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/applications/text_classification/multi_label/deploy/paddle_serving/README.md b/applications/text_classification/multi_label/deploy/paddle_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..2b7ef12d709fc579fa8dfe5a9725789671243e81 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/paddle_serving/README.md @@ -0,0 +1,190 @@ +# 基于Paddle Serving的服务化部署 + +本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具搭建多标签在线服务部署。 + +## 目录 +- [环境准备](#环境准备) +- [模型转换](#模型转换) +- [部署模型](#部署模型) + +## 环境准备 +需要准备PaddleNLP的运行环境和Paddle Serving的运行环境。 + +- python >= 3.6 +- paddlepaddle >= 2.3 +- paddlenlp >= 2.4 + +### 安装PaddlePaddle + + 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。 + + +### 安装PaddleNLP + +安装PaddleNLP默认开启百度镜像源来加速下载,如果您使用 HTTP 代理可以删去` -i https://mirror.baidu.com/pypi/simple` ,更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 + +```shell +python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple +``` +### 安装Paddle Serving +安装client和serving app,用于向服务发送请求: +``` +pip install paddle_serving_app paddle_serving_client +``` +安装serving,用于启动服务,根据服务器设备选择安装CPU server或GPU server: + +- 安装CPU server +```shell +pip install paddle_serving_server +``` +- 安装GPU server, 注意选择跟本地环境一致的命令 +```shell +# CUDA10.2 + Cudnn7 + TensorRT6 +pip install paddle-serving-server-gpu==0.8.3.post102 -i https://pypi.tuna.tsinghua.edu.cn/simple +# CUDA10.1 + TensorRT6 +pip install paddle-serving-server-gpu==0.8.3.post101 -i https://pypi.tuna.tsinghua.edu.cn/simple +# CUDA11.2 + TensorRT8 +pip install paddle-serving-server-gpu==0.8.3.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple +``` + +**NOTE:** +- 默认开启国内清华镜像源来加速下载,如果您使用 HTTP 代理可以关闭(-i https://pypi.tuna.tsinghua.edu.cn/simple) +- 更多wheel包请参考[serving官网文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Latest_Packages_CN.md) + +### 安装FastTokenizer文本处理加速库(可选) +推荐安装fast_tokenizer可以得到更极致的文本处理效率,进一步提升服务性能。 +```shell +pip install fast-tokenizer-python +``` + + +## 模型转换 + +使用Paddle Serving做服务化部署时,需要将保存的inference模型转换为serving易于部署的模型。 + +用已安装的paddle_serving_client将静态图参数模型转换成serving格式。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[模型静态图导出](../../README.md),模型地址`dirname`,模型文件和参数名`model_filename`,`params_filename`根据实际填写即可。 + +```shell +python -m paddle_serving_client.convert --dirname ../../export --model_filename float32.pdmodel --params_filename float32.pdiparams +``` + +可以通过命令查参数含义: +```shell +python -m paddle_serving_client.convert --help +``` + +转换成功后的目录如下: +``` +paddle_serving/ +├──serving_server +│ ├── float32.pdiparams +│ ├── float32.pdmodel +│ ├── serving_server_conf.prototxt +│ └── serving_server_conf.stream.prototxt +└──serving_client + ├── serving_client_conf.prototxt + └── serving_client_conf.stream.prototxt +``` + +## 部署模型 + +serving目录包含启动pipeline服务和发送预测请求的代码和模型,包括: + +``` +serving/ +├──serving_server +│ ├── float32.pdiparams +│ ├── float32.pdmodel +│ ├── serving_server_conf.prototxt +│ └── serving_server_conf.stream.prototxt +├──config.yml # 分类任务启动服务端的配置文件 +├──rpc_client.py # 分类任务发送pipeline预测请求的脚本 +└──service.py # 分类任务启动服务端的脚本 +``` + +### 修改配置文件 +目录中的`config.yml`文件解释了每一个参数的含义,可以根据实际需要修改其中的配置。比如: +``` +# 修改模型目录为下载的模型目录或自己的模型目录: +model_config: serving_server => model_config: erine-3.0-tiny/serving_server + +# 修改rpc端口号 +rpc_port: 10231 => rpc_port: 9998 + +# 修改使用GPU推理为使用CPU推理: +device_type: 1 => device_type: 0 + +#开启MKLDNN加速 +#use_mkldnn: False => use_mkldnn: True + +#Fetch结果列表,以serving_client/serving_client_conf.prototxt中fetch_var的alias_name为准 +fetch_list: ["linear_147.tmp_1"] => fetch_list: ["linear_75.tmp_1"] +``` + +### 分类任务 +#### 启动服务 +修改好配置文件后,执行下面命令启动服务: +```shell +python service.py --max_seq_length 128 --model_name "ernie-3.0-medium-zh" +``` + +可支持配置的参数: +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `model_name`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。 + +输出打印如下: +``` +[DAG] Succ init +[PipelineServicer] succ init +...... +--- Running analysis [ir_graph_to_program_pass] +I0625 16:44:36.563802 40218 analysis_predictor.cc:1007] ======= optimize end ======= +I0625 16:44:36.571702 40218 naive_executor.cc:102] --- skip [feed], feed -> token_type_ids +I0625 16:44:36.571728 40218 naive_executor.cc:102] --- skip [feed], feed -> input_ids +I0625 16:44:36.574352 40218 naive_executor.cc:102] --- skip [linear_147.tmp_1], fetch -> fetch +[2022-06-25 16:44:37,545] [ WARNING] - Can't find the fast_tokenizer package, please ensure install fast_tokenizer correctly. You can install fast_tokenizer by `pip install fast-tokenizer-python`. +[2022-06-25 16:44:37,546] [ INFO] - We are using to load 'ernie-3.0-medium-zh'. +[2022-06-25 16:44:37,546] [ INFO] - Already cached /root/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_base_zh_vocab.txt +[OP Object] init success +W0625 16:45:40.312942 40218 gpu_context.cc:278] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.2 +W0625 16:45:40.316538 40218 gpu_context.cc:306] device: 3, cuDNN Version: 8.1. +``` + +#### 启动rpc client测试 +注意执行客户端请求时关闭代理,并根据实际情况修改server_url地址(启动服务所在的机器) +```shell +python rpc_client.py +``` +输出打印如下: +``` +data: 五松新村房屋是被告婚前购买的; +label: 婚前个人财产 +-------------------- +data: 被告于2016年3月将车牌号为皖B×××××出售了2.7万元,被告通过原告偿还了齐荷花人民币2.6万元,原、被告尚欠齐荷花2万元。 +label: 有夫妻共同财产,有夫妻共同债务 +-------------------- +data: 2、判令被告返还借婚姻索取的现金33万元,婚前个人存款10万元; +label: 婚前个人财产 +-------------------- +data: 一、判决原告于某某与被告杨某某离婚; +label: 准予离婚,法定离婚 +``` +#### 启动http client测试 +注意执行客户端请求时关闭代理,并根据实际情况修改server_url地址(启动服务所在的机器) +```shell +python http_client.py +``` +输出打印如下: +``` +data: 五松新村房屋是被告婚前购买的; +label: 婚前个人财产 +-------------------- +data: 被告于2016年3月将车牌号为皖B×××××出售了2.7万元,被告通过原告偿还了齐荷花人民币2.6万元,原、被告尚欠齐荷花2万元。 +label: 有夫妻共同财产,有夫妻共同债务 +-------------------- +data: 2、判令被告返还借婚姻索取的现金33万元,婚前个人存款10万元; +label: 婚前个人财产 +-------------------- +data: 一、判决原告于某某与被告杨某某离婚; +label: 准予离婚,法定离婚 +``` diff --git a/applications/text_classification/multi_label/deploy/paddle_serving/config.yml b/applications/text_classification/multi_label/deploy/paddle_serving/config.yml new file mode 100644 index 0000000000000000000000000000000000000000..62a1a3056b826619c7c640fcb9c426a2d96fc28f --- /dev/null +++ b/applications/text_classification/multi_label/deploy/paddle_serving/config.yml @@ -0,0 +1,59 @@ +#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 18090 + +#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 9878 + +#worker_num, 最大并发数。 +#当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +#当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 1 + +#build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + #op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: False + + #重试次数 + retry: 1 + + #使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + use_profile: false + tracer: + interval_s: 10 + +op: + seq_cls: + #并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 1 + + #当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + #client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + + #模型路径 + model_config: serving_server + + #Fetch结果列表,以client_config中fetch_var的alias_name为准 + fetch_list: ["linear_75.tmp_1"] + + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 1 + + #计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: "0" + + #开启MKLDNN加速 + use_mkldnn: True + + #thread_num + thread_num: 12 + + #ir_optim + ir_optim: True + + #开启tensorrt后,进行优化的子图包含的最少节点数 + #min_subgraph_size: 10 \ No newline at end of file diff --git a/applications/text_classification/multi_label/deploy/paddle_serving/http_client.py b/applications/text_classification/multi_label/deploy/paddle_serving/http_client.py new file mode 100644 index 0000000000000000000000000000000000000000..df4d2d9313bcd1f73f50a62d6e564d5b7cfcb10f --- /dev/null +++ b/applications/text_classification/multi_label/deploy/paddle_serving/http_client.py @@ -0,0 +1,74 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import json + +import numpy as np +import requests + + +class Runner(object): + def __init__( + self, + server_url: str, + ): + self.server_url = server_url + + def Run(self, text, label_list): + sentence = np.array([t.encode("utf-8") for t in text], dtype=np.object_) + sentence = sentence.__repr__() + data = {"key": ["sentence"], "value": [sentence]} + data = json.dumps(data) + + ret = requests.post(url=self.server_url, data=data) + ret = ret.json() + for t, l in zip(text, eval(ret["value"][0])): + print("text: ", t) + label = ",".join([label_list[int(ll)] for ll in l.split(",")]) + print("label: ", label) + print("--------------------") + return + + +if __name__ == "__main__": + server_url = "http://127.0.0.1:9878/seq_cls/prediction" + runner = Runner(server_url) + text = [ + "五松新村房屋是被告婚前购买的;", + "被告于2016年3月将车牌号为皖B×××××出售了2.7万元,被告通过原告偿还了齐荷花人民币2.6万元,原、被告尚欠齐荷花2万元。", + "2、判令被告返还借婚姻索取的现金33万元,婚前个人存款10万元;", + "一、判决原告于某某与被告杨某某离婚;", + ] + label_list = [ + "婚后有子女", + "限制行为能力子女抚养", + "有夫妻共同财产", + "支付抚养费", + "不动产分割", + "婚后分居", + "二次起诉离婚", + "按月给付抚养费", + "准予离婚", + "有夫妻共同债务", + "婚前个人财产", + "法定离婚", + "不履行家庭义务", + "存在非婚生子", + "适当帮助", + "不履行离婚协议", + "损害赔偿", + "感情不和分居满二年", + "子女随非抚养权人生活", + "婚后个人财产", + ] + runner.Run(text, label_list) diff --git a/applications/text_classification/multi_label/deploy/paddle_serving/rpc_client.py b/applications/text_classification/multi_label/deploy/paddle_serving/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..4146d4b591470254b687c815a40989327a3d25cd --- /dev/null +++ b/applications/text_classification/multi_label/deploy/paddle_serving/rpc_client.py @@ -0,0 +1,68 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np +from paddle_serving_server.pipeline import PipelineClient + + +class Runner(object): + def __init__( + self, + server_url: str, + ): + self.client = PipelineClient() + self.client.connect([server_url]) + + def Run(self, data, label_list): + sentence = np.array([x.encode("utf-8") for x in data], dtype=np.object_) + ret = self.client.predict(feed_dict={"sentence": sentence}) + for d, l in zip(data, eval(ret.value[0])): + print("data: ", d) + label = ",".join([label_list[int(ll)] for ll in l.split(",")]) + print("label: ", label) + print("--------------------") + return + + +if __name__ == "__main__": + server_url = "127.0.0.1:18090" + runner = Runner(server_url) + text = [ + "五松新村房屋是被告婚前购买的;", + "被告于2016年3月将车牌号为皖B×××××出售了2.7万元,被告通过原告偿还了齐荷花人民币2.6万元,原、被告尚欠齐荷花2万元。", + "2、判令被告返还借婚姻索取的现金33万元,婚前个人存款10万元;", + "一、判决原告于某某与被告杨某某离婚;", + ] + label_list = [ + "婚后有子女", + "限制行为能力子女抚养", + "有夫妻共同财产", + "支付抚养费", + "不动产分割", + "婚后分居", + "二次起诉离婚", + "按月给付抚养费", + "准予离婚", + "有夫妻共同债务", + "婚前个人财产", + "法定离婚", + "不履行家庭义务", + "存在非婚生子", + "适当帮助", + "不履行离婚协议", + "损害赔偿", + "感情不和分居满二年", + "子女随非抚养权人生活", + "婚后个人财产", + ] + runner.Run(text, label_list) diff --git a/applications/text_classification/multi_label/deploy/paddle_serving/service.py b/applications/text_classification/multi_label/deploy/paddle_serving/service.py new file mode 100644 index 0000000000000000000000000000000000000000..cac5558a62865d1eb3b9795a0e36cfe96079ad1a --- /dev/null +++ b/applications/text_classification/multi_label/deploy/paddle_serving/service.py @@ -0,0 +1,106 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging + +import numpy as np +from paddle_serving_server.web_service import Op, WebService + +from paddlenlp.transformers import AutoTokenizer + +_LOGGER = logging.getLogger() + +FETCH_NAME_MAP = { + "ernie-1.0-large-zh-cw": "linear_291.tmp_1", + "ernie-3.0-xbase-zh": "linear_243.tmp_1", + "ernie-3.0-base-zh": "linear_147.tmp_1", + "ernie-3.0-medium-zh": "linear_75.tmp_1", + "ernie-3.0-mini-zh": "linear_75.tmp_1", + "ernie-3.0-micro-zh": "linear_51.tmp_1", + "ernie-3.0-nano-zh": "linear_51.tmp_1", + "ernie-2.0-base-en": "linear_147.tmp_1", + "ernie-2.0-large-en": "linear_291.tmp_1", + "ernie-m-base": "linear_147.tmp_1", + "ernie-m-large": "linear_291.tmp_1", +} + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", + choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en", "ernie-m-base", "ernie-m-large"]) +args = parser.parse_args() +# fmt: on + + +class Op(Op): + def init_op(self): + self.tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_fast=True) + # Output nodes may differ from model to model + # You can see the output node name in the conf.prototxt file of serving_server + self.fetch_names = [ + FETCH_NAME_MAP[args.model_name], + ] + + def preprocess(self, input_dicts, data_id, log_id): + # Convert input format + ((_, input_dict),) = input_dicts.items() + data = input_dict["sentence"] + if isinstance(data, str) and "array(" in data: + data = eval(data) + else: + _LOGGER.error("input value {}is not supported.".format(data)) + data = [i.decode("utf-8") for i in data] + + # tokenizer + pad + data = self.tokenizer( + data, + max_length=args.max_seq_length, + padding=True, + truncation=True, + return_position_ids=False, + return_attention_mask=False, + ) + tokenized_data = {} + for tokenizer_key in data: + tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], dtype="int64") + + return tokenized_data, False, None, "" + + def postprocess(self, input_dicts, fetch_dict, data_id, log_id): + + results = fetch_dict[self.fetch_names[0]] + results = np.array(results) + labels = [] + + for result in results: + label = [] + result = 1 / (1 + (np.exp(-result))) + for i, p in enumerate(result): + if p > 0.5: + label.append(str(i)) + labels.append(",".join(label)) + return {"label": labels}, None, "" + + +class Service(WebService): + def get_pipeline_response(self, read_op): + return Op(name="seq_cls", input_ops=[read_op]) + + +if __name__ == "__main__": + service = Service(name="seq_cls") + service.prepare_pipeline_config("config.yml") + service.run_service() diff --git a/applications/text_classification/multi_label/deploy/predictor/README.md b/applications/text_classification/multi_label/deploy/predictor/README.md new file mode 100644 index 0000000000000000000000000000000000000000..7bc4e828c0df3d518e2e671bf7b3ddeca924d054 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/predictor/README.md @@ -0,0 +1,178 @@ +# 离线推理部署指南 + +**目录** + * [环境准备](#环境准备) + * [基于GPU部署推理样例](#基于GPU部署推理样例) + * [基于CPU部署推理样例](#基于CPU部署推理样例) + * [性能与精度测试](#性能与精度测试) + +## 环境准备 + +模型转换与ONNXRuntime预测部署依赖Paddle2ONNX和ONNXRuntime,Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式,算子目前稳定支持导出ONNX Opset 7~15,更多细节可参考:[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[模型静态图导出](../../README.md),模型使用裁剪API进行裁剪之后会自动生成静态图模型。 + +如果基于GPU部署,请先确保机器已正确安装NVIDIA相关驱动和基础软件,确保CUDA >= 11.2,CuDNN >= 8.2,并使用以下命令安装所需依赖: +```shell +python -m pip install onnxruntime-gpu onnx onnxconverter-common==1.9.0 paddle2onnx==1.0.5 +``` + +如果基于CPU部署,请使用如下命令安装所需依赖: +```shell +python -m pip install onnxruntime +``` + +安装FastTokenizer文本处理加速库(可选) +推荐安装fast_tokenizer可以得到更极致的文本处理效率,进一步提升服务性能。 +```shell +pip install fast-tokenizer-python +``` +## 基于GPU部署推理样例 + +请使用如下命令进行部署 +``` +python infer.py \ + --device "gpu" \ + --model_path_prefix "../../export/float32" \ + --model_name_or_path "ernie-3.0-medium-zh" \ + --max_seq_length 128 \ + --batch_size 32 \ + --dataset_dir "../../data" +``` +多语言模型加上`--multilingual`,裁剪后的模型前缀为`--model_path_prefix ../../prune/width_mult_XXXX/pruned_model`。 + +可支持配置的参数: + +* `model_path_prefix`:必须,待推理模型路径前缀。 +* `model_name_or_path`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。 +* `max_seq_length`:ERNIE/BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `use_fp16`:选择是否开启FP16进行加速;默认为False。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `device`: 选用什么设备进行训练,可选cpu、gpu。 +* `device_id`: 选择GPU卡号;默认为0。 +* `perf`:选择进行模型性能和精度评估;默认为False。 +* `dataset_dir`:本地数据集地址,需包含data.txt, label.txt, test.txt/dev.txt(可选,如果启动模型性能和精度评估);默认为None。 +* `perf_dataset`:评估数据集,可选'dev'、'test',选择在开发集或测试集评估模型;默认为"dev"。 +* `multilingual`:是否为多语言任务(是否使用ERNIE M作为预训练模型);默认为False。 + +在GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0,在包括V100、T4、A10、A100、GTX 20系列和30系列显卡等设备上可以开启FP16进行加速,在CPU或者CUDA计算能力 (CUDA Compute Capability) 小于7.0时开启不会带来加速效果。可以使用如下命令开启ONNXRuntime的FP16进行推理加速: + +``` +python infer.py \ + --use_fp16 \ + --device "gpu" \ + --model_path_prefix "../../export/float32" \ + --model_name_or_path "ernie-3.0-medium-zh" \ + --max_seq_length 128 \ + --batch_size 32 \ + --dataset_dir "../../data" +``` + +可以使用如下命令开启ONNXRuntime推理评估模型的性能和精度: + +``` +python infer.py \ + --perf \ + --perf_dataset 'dev' \ + --device "gpu" \ + --model_path_prefix "../../export/float32" \ + --model_name_or_path "ernie-3.0-medium-zh" \ + --max_seq_length 128 \ + --batch_size 32 \ + --dataset_dir "../../data" +``` + +## 基于CPU部署推理样例 + +请使用如下命令进行部署 +``` +python infer.py \ + --device "cpu" \ + --model_path_prefix "../../export/float32" \ + --model_name_or_path "ernie-3.0-medium-zh" \ + --max_seq_length 128 \ + --batch_size 32 \ + --dataset_dir "../../data" +``` + +可支持配置的参数: + +* `model_path_prefix`:必须,待推理模型路径前缀。 +* `model_name_or_path`:选择预训练模型;默认为"ernie-3.0-medium-zh",中文数据集推荐使用"ernie-3.0-medium-zh"。 +* `max_seq_length`:ERNIE/BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `use_quantize`:选择是否开启INT8动态量化进行加速;默认为False。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为200。 +* `num_threads`:cpu线程数;默认为cpu的物理核心数量。 +* `device`: 选用什么设备进行训练,可选cpu、gpu。 +* `perf`:选择进行模型性能和精度评估;默认为False。 +* `dataset_dir`:本地数据集地址,需包含data.txt, label.txt, dev.txt/test.txt(可选,如果启动模型性能和精度评估);默认为None。 +* `perf_dataset`:评估数据集,选择在开发集或测试集评估模型;默认为"dev"。 + +可以使用如下命令开启ONNXRuntime的INT8动态量化进行推理加速: + +``` +python infer.py \ + --use_quantize \ + --device "cpu" \ + --model_path_prefix "../../export/float32" \ + --model_name_or_path "ernie-3.0-medium-zh" \ + --max_seq_length 128 \ + --batch_size 32 \ + --dataset_dir "../../data" +``` + +**Note**:INT8动态量化与FP16相比精度损失较大,GPU部署建议使用FP16加速。 + +可以使用如下命令开启ONNXRuntime推理评估模型的性能和精度: + +``` +python infer.py \ + --perf \ + --perf_dataset 'dev' \ + --device "cpu" \ + --model_path_prefix "../../export/float32" \ + --model_name_or_path "ernie-3.0-medium-zh" \ + --max_seq_length 128 \ + --batch_size 32 \ + --dataset_dir "../../data" +``` + + +## 性能与精度测试 + + +测试配置如下: + +1. CAIL2019—婚姻家庭要素提取任务开发集 + +2. 物理机环境 + + 系统: CentOS Linux release 7.7.1908 (Core) + + GPU: Tesla V100-SXM2-32GB + + CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz + + CUDA: 11.2 + + cuDNN: 8.1.0 + + Driver Version: 460.27.04 + + 内存: 630 GB + +3. PaddlePaddle 版本:2.3.0 + +4. PaddleNLP 版本:2.3.1 + +5. 性能数据指标:latency。latency 测试方法:固定 batch size 为 32,GPU部署运行时间 total_time,计算 latency = total_time / total_samples + +6. 精度评价指标:Micro F1分数、Macro F1分数 + +| | Micro F1(%) | Macro F1(%) | latency(ms) | +| -------------------------- | ------------ | ------------- |------------- | +| ERNIE 3.0 Medium+FP32+GPU | 90.57|79.36| 1.46| +| ERNIE 3.0 Medium+FP16+GPU | 90.57| 79.36| 0.49| +| ERNIE 3.0 Medium+FP32+CPU | 90.57|79.36| 47.92 | +| ERNIE 3.0 Medium+INT8+CPU | 90.05 | 77.69| 34.24 | + + +经过FP16转化加速比达到3~4倍左右,精度变化较小,与FP16相比,INT8在线量化精度下降较大,加速比在1.5倍左右 diff --git a/applications/text_classification/multi_label/deploy/predictor/infer.py b/applications/text_classification/multi_label/deploy/predictor/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..fcf562cb6fb4a4b6c89c2fce783e2ccc06591892 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/predictor/infer.py @@ -0,0 +1,89 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import psutil +from predictor import Predictor + +from paddlenlp.datasets import load_dataset + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.") +parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", + choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en", "ernie-m-base", "ernie-m-large"]) +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--use_fp16", action='store_true', help="Whether to use fp16 inference, only takes effect when deploying on gpu.") +parser.add_argument("--use_quantize", action='store_true', help="Whether to use quantization for acceleration, only takes effect when deploying on cpu.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for predicting.") +parser.add_argument("--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu, only takes effect when deploying on cpu.") +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--device_id', default=0, help="Select which gpu device to train model.") +parser.add_argument("--perf", action='store_true', help="Whether to compute the latency and f1 score of the test set.") +parser.add_argument("--dataset_dir", required=True, default=None, type=str, help="The dataset directory including data.txt, taxonomy.txt, test.txt(optional, if evaluate the performance).") +parser.add_argument("--perf_dataset", choices=['dev', 'test'], default='dev', type=str, help="evaluate the performance on dev dataset or test dataset") +parser.add_argument('--multilingual', action='store_true', help='Whether is multilingual task') +args = parser.parse_args() +# yapf: enable + + +def read_local_dataset(path, label_list): + label_list_dict = {label_list[i]: i for i in range(len(label_list))} + with open(path, "r", encoding="utf-8") as f: + for line in f: + items = line.strip().split("\t") + if len(items) == 0: + continue + elif len(items) == 1: + sentence = items[0] + labels = [] + else: + sentence = "".join(items[:-1]) + label = items[-1] + labels = [label_list_dict[l] for l in label.split(",")] + yield {"sentence": sentence, "label": labels} + + +if __name__ == "__main__": + + label_list = [] + label_dir = os.path.join(args.dataset_dir, "label.txt") + with open(label_dir, "r", encoding="utf-8") as f: + lines = f.readlines() + for i, line in enumerate(lines): + label_list.append(line.strip()) + f.close() + + predictor = Predictor(args, label_list) + + if args.perf: + eval_dir = os.path.join(args.dataset_dir, "{}.txt".format(args.perf_dataset)) + eval_ds = load_dataset(read_local_dataset, path=eval_dir, label_list=label_list, lazy=False) + texts, labels = predictor.get_text_and_label(eval_ds) + + # preprocess & evaluate & latency + preprocess_result = predictor.preprocess(texts) + predictor.evaluate(preprocess_result, labels) + predictor.performance(preprocess_result) + else: + data = [] + data_dir = os.path.join(args.dataset_dir, "data.txt") + with open(data_dir, "r", encoding="utf-8") as f: + lines = f.readlines() + for i, line in enumerate(lines): + data.append(line.strip()) + f.close() + predictor.predict(data) diff --git a/applications/text_classification/multi_label/deploy/predictor/predictor.py b/applications/text_classification/multi_label/deploy/predictor/predictor.py new file mode 100644 index 0000000000000000000000000000000000000000..d1b6a8785d182a70caf0384ba5cd18c02e18a961 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/predictor/predictor.py @@ -0,0 +1,229 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time + +import numpy as np +import onnxruntime as ort +import paddle2onnx +from sklearn.metrics import f1_score + +from paddlenlp.transformers import AutoTokenizer +from paddlenlp.utils.log import logger + + +class InferBackend(object): + def __init__( + self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, use_quantize=False, num_threads=10 + ): + logger.info(">>> [InferBackend] Creating Engine ...") + onnx_model = paddle2onnx.export( + model_file=model_path_prefix + ".pdmodel", + params_file=model_path_prefix + ".pdiparams", + opset_version=13, + enable_onnx_checker=True, + ) + infer_model_dir = model_path_prefix.rsplit("/", 1)[0] + float_onnx_file = os.path.join(infer_model_dir, "model.onnx") + with open(float_onnx_file, "wb") as f: + f.write(onnx_model) + + if device == "gpu": + + logger.info(">>> [InferBackend] Use GPU to inference ...") + + if use_fp16: + logger.info(">>> [InferBackend] Use FP16 to inference ...") + import onnx + from onnxconverter_common import float16 + + fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx") + onnx_model = onnx.load_model(float_onnx_file) + trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True) + onnx.save_model(trans_model, fp16_model_file) + onnx_model = fp16_model_file + if use_quantize: + logger.info( + ">>> [InferBackend] use_quantize only takes effect when deploying on cpu, use_fp16 for acceleration when deploying on gpu ..." + ) + sess_options = ort.SessionOptions() + self.predictor = ort.InferenceSession( + onnx_model, + sess_options=sess_options, + providers=["CUDAExecutionProvider"], + provider_options=[{"device_id": device_id}], + ) + try: + assert "CUDAExecutionProvider" in self.predictor.get_providers() + except AssertionError: + raise AssertionError( + "The environment for GPU inference is not set properly. " + "A possible cause is that you had installed both onnxruntime and onnxruntime-gpu. " + "Please run the following commands to reinstall: \n " + "1) pip uninstall -y onnxruntime onnxruntime-gpu \n 2) pip install onnxruntime-gpu" + ) + else: + logger.info(">>> [InferBackend] Use CPU to inference ...") + if use_fp16: + logger.info( + ">>> [InferBackend] use_fp16 only takes effect when deploying on gpu, use_quantize for acceleration when deploying on cpu ..." + ) + if use_quantize: + dynamic_quantize_model = os.path.join(infer_model_dir, "int8_model.onnx") + self.dynamic_quantize(float_onnx_file, dynamic_quantize_model) + onnx_model = dynamic_quantize_model + sess_options = ort.SessionOptions() + sess_options.intra_op_num_threads = num_threads + self.predictor = ort.InferenceSession( + onnx_model, sess_options=sess_options, providers=["CPUExecutionProvider"] + ) + + logger.info(">>> [InferBackend] Engine Created ...") + + def dynamic_quantize(self, input_float_model, dynamic_quantized_model): + from onnxruntime.quantization import quantize_dynamic + + quantize_dynamic(input_float_model, dynamic_quantized_model) + + def infer(self, input_dict: dict): + result = self.predictor.run(None, input_dict) + return result + + +def sigmoid_(x): + """ + compute sigmoid + """ + return 1 / (1 + np.exp(-x)) + + +class Predictor(object): + def __init__(self, args, label_list): + self.tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=True) + self.label_list = label_list + self.batch_size = args.batch_size + self.max_seq_length = args.max_seq_length + self.multilingual = args.multilingual + self.inference_backend = InferBackend( + args.model_path_prefix, args.device, args.device_id, args.use_fp16, args.use_quantize, args.num_threads + ) + + def preprocess(self, input_data: list): + + # tokenizer + pad + data = self.tokenizer( + input_data, + max_length=self.max_seq_length, + padding=True, + truncation=True, + return_position_ids=False, + return_attention_mask=False, + return_token_type_ids=not self.multilingual, + ) + tokenized_data = {} + for tokenizer_key in data: + tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], dtype="int64") + return tokenized_data + + def postprocess(self, infer_data): + threshold = 0.5 + + sigmoid = np.vectorize(sigmoid_) + probs = sigmoid(infer_data) + labels = [] + + for prob in probs: + label = [] + + for i, p in enumerate(prob): + if p > threshold: + label.append(i) + + labels.append(label) + + return labels + + def infer(self, data): + infer_data = self.inference_backend.infer(data) + logits = np.array(infer_data[0]) + return logits + + def infer_batch(self, preprocess_result): + sample_num = len(preprocess_result["input_ids"]) + infer_result = None + for i in range(0, sample_num, self.batch_size): + batch_size = min(self.batch_size, sample_num - i) + preprocess_result_batch = {} + for tokenizer_key in preprocess_result: + preprocess_result_batch[tokenizer_key] = [ + preprocess_result[tokenizer_key][i + j] for j in range(batch_size) + ] + + result = self.infer(preprocess_result_batch) + if infer_result is None: + infer_result = result + else: + infer_result = np.append(infer_result, result, axis=0) + return infer_result + + def printer(self, result, input_data): + + for idx, text in enumerate(input_data): + labels = [] + logger.info("input data: {}".format(text)) + for r in result[idx]: + labels.append(self.label_list[r]) + logger.info("labels: {}".format(",".join(labels))) + logger.info("----------------------------") + + def predict(self, input_data: list): + preprocess_result = self.preprocess(input_data) + infer_result = self.infer_batch(preprocess_result) + result = self.postprocess(infer_result) + self.printer(result, input_data) + return + + def performance(self, preprocess_result): + nums = len(preprocess_result["input_ids"]) + + start = time.time() + self.infer_batch(preprocess_result) + total_time = time.time() - start + logger.info("sample nums: %s, time: %.2f, latency: %.2f ms" % (nums, total_time, 1000 * total_time / nums)) + return + + def evaluate(self, preprocess_result, labels): + + infer_result = self.infer_batch(preprocess_result) + sigmoid = np.vectorize(sigmoid_) + probs = sigmoid(infer_result) + preds = probs > 0.5 + micro_f1_score = f1_score(y_pred=preds, y_true=labels, average="micro") + macro_f1_score = f1_score(y_pred=preds, y_true=labels, average="macro") + logger.info("micro f1: %.2f, macro f1: %.2f" % (micro_f1_score * 100, macro_f1_score * 100)) + + return + + def get_text_and_label(self, ds): + """ + Return text and label list + """ + all_texts = [] + all_labels = [] + for ii in range(len(ds)): + all_texts.append(ds[ii]["sentence"]) + labels = [float(1) if i in ds[ii]["label"] else float(0) for i in range(len(self.label_list))] + all_labels.append(labels) + return all_texts, all_labels diff --git a/applications/text_classification/multi_label/deploy/simple_serving/README.md b/applications/text_classification/multi_label/deploy/simple_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..583887019358cc2cb36abc8488f0b9bdc76a8249 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/simple_serving/README.md @@ -0,0 +1,41 @@ +# 基于PaddleNLP SimpleServing 的服务化部署 + +## 目录 +- [环境准备](#环境准备) +- [Server启动服务](#Server服务启动) +- [其他参数设置](#其他参数设置) + +## 环境准备 +使用有SimpleServing功能的PaddleNLP版本 +```shell +pip install paddlenlp --upgrade +``` +## Server服务启动 +### 分类任务启动 +#### 启动 分类 Server 服务 +```bash +paddlenlp server server:app --host 0.0.0.0 --port 8189 +``` +如果是ERNIE-M模型则启动 +```bash +paddlenlp server ernie_m_server:app --host 0.0.0.0 --port 8189 +``` +#### 分类任务发送服务 +```bash +python client.py +``` + +## 其他参数设置 +可以在client端设置 `max_seq_len`, `batch_size`, `prob_limit` 参数 +```python + data = { + 'data': { + 'text': texts, + }, + 'parameters': { + 'max_seq_len': args.max_seq_len, + 'batch_size': args.batch_size, + 'prob_limit': args.prob_limit + } + } +``` diff --git a/applications/text_classification/multi_label/deploy/simple_serving/client.py b/applications/text_classification/multi_label/deploy/simple_serving/client.py new file mode 100644 index 0000000000000000000000000000000000000000..de347a19654a31e5666bc5e75d693c7e426a8e9e --- /dev/null +++ b/applications/text_classification/multi_label/deploy/simple_serving/client.py @@ -0,0 +1,43 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import requests +import json + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization.") +parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.") +parser.add_argument("--prob_limit", default=0.5, type=float, help="The limitation of probability for the label.") +args = parser.parse_args() +# yapf: enable + +url = "http://0.0.0.0:8189/models/cls_multi_label" +headers = {"Content-Type": "application/json"} + +if __name__ == "__main__": + texts = [ + "原、被告另购置橱柜、碗架、电磁炉、电饭锅各一个归原告王某某所有。", + "于是原告到儿子就读的幼儿园进行探望,被告碰见后对原告破口大骂,还不让儿子叫原告妈妈,而叫被告现在的妻子做妈妈。", + "由我全额出资购买的联想台式电脑,我均依次放弃。", + ] + data = { + "data": { + "text": texts, + }, + "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size, "prob_limit": args.prob_limit}, + } + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(json.loads(r.text)) diff --git a/applications/text_classification/multi_label/deploy/simple_serving/ernie_m_server.py b/applications/text_classification/multi_label/deploy/simple_serving/ernie_m_server.py new file mode 100644 index 0000000000000000000000000000000000000000..ba7d0cbaf23ceaa74bb25521f1d6905776e9c523 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/simple_serving/ernie_m_server.py @@ -0,0 +1,26 @@ +# coding:utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp import SimpleServer +from paddlenlp.server import ERNIEMHandler, MultiLabelClassificationPostHandler + +app = SimpleServer() +app.register( + "models/cls_multi_label", + model_path="../../export", + tokenizer_name="ernie-m-base", + model_handler=ERNIEMHandler, + post_handler=MultiLabelClassificationPostHandler, +) diff --git a/applications/text_classification/multi_label/deploy/simple_serving/server.py b/applications/text_classification/multi_label/deploy/simple_serving/server.py new file mode 100644 index 0000000000000000000000000000000000000000..caec715de52b9f75a1edd13c8840920f671af3e1 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/simple_serving/server.py @@ -0,0 +1,25 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp import SimpleServer +from paddlenlp.server import CustomModelHandler, MultiLabelClassificationPostHandler + +app = SimpleServer() +app.register( + "models/cls_multi_label", + model_path="../../export", + tokenizer_name="ernie-3.0-medium-zh", + model_handler=CustomModelHandler, + post_handler=MultiLabelClassificationPostHandler, +) diff --git a/applications/text_classification/multi_label/deploy/triton_serving/README.md b/applications/text_classification/multi_label/deploy/triton_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c8541c22ccf5d85ac856cc4c5477d9a06efbe163 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/triton_serving/README.md @@ -0,0 +1,188 @@ +# 基于Triton Inference Server的服务化部署指南 + +本文档将介绍如何使用[Triton Inference Server](https://github.com/triton-inference-server/server)工具部署基于ERNIE 3.0中文模型文本多标签分类的pipeline在线服务。 + +## 目录 +- [服务端环境准备](#服务端环境准备) +- [模型获取和转换](#模型获取和转换) +- [部署模型](#部署模型) +- [客户端请求](#客户端请求) + +## 服务端环境准备 + +### 安装Triton Server +拉取Triton Server镜像: +```shell +docker pull nvcr.io/nvidia/tritonserver:21.10-py3 +``` +启动容器: +```shell +docker run -it --gpus all --net=host --name triton_server -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash +``` + +**NOTE:** + +1. Triton版本号`21.10`可以根据自己的需求调整,各个Triton版本对应的Driver、CUDA、TRT和ONNX Runtime等后端版本可以参考[官网文档](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)。注意其中的`NVIDIA Driver`行,如果NVIDIA Driver低于文档中要求,在启动运行时会报错。 + +2. 可以使用`--gpus '"device=1"'`来指定GPU卡号,更多GPU指定方式请参见[Nvidia User Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#gpu-enumeration) + + +### 进入容器并准备PaddleNLP环境 +整个服务的前后处理依赖PaddleNLP,需要在容器内安装相关python包 + +进入容器: +```shell +docker exec -it triton_server bash +``` +安装PaddlePaddle、PaddleNLP +```shell +python3 -m pip install paddlepaddle-gpu paddlenlp -i https://mirror.baidu.com/pypi/simple +``` + +**NOTE:** + +1. 默认开启百度镜像源来加速下载,如果您使用 HTTP 代理可以关闭(-i https://mirror.baidu.com/pypi/simple) + +2. 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.2, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。 + +3. 更多关于PaddleNLP安装的详细教程请查看[Installation](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 + + +### 安装FastTokenizer文本处理加速库(可选) + +推荐安装fast_tokenizer可以得到更极致的文本处理效率,进一步提升服务性能。 + +在容器内安装 fast_tokenizer +```shell +python3 -m pip install fast-tokenizer-python +``` + + +## 模型获取和转换 + +使用Triton做服务化部署时,选择ONNX Runtime后端运行需要先将模型转换成ONNX格式。 + + +首先将保存的动态图参数导出成静态图参数,具体代码见[静态图导出脚本](../../export_model.py),静态图参数保存在`output_path`指定路径中,裁剪API裁剪会自动保存静态图模型。运行方式: + +```shell +python ../../export_model.py --params_path=../../checkpoint/model_state.pdparams --output_path=./infer_model +``` + +使用Paddle2ONNX将Paddle静态图模型转换为ONNX模型格式的命令如下,以下命令成功运行后,将会在当前目录下生成model.onnx模型文件。 + +用Paddle2ONNX转换分类模型 +```shell +paddle2onnx --model_dir infer_model/ --model_filename float32.pdmodel --params_filename float32.pdiparams --save_file model.onnx --opset_version 13 --enable_onnx_checker True --enable_dev_version True +``` +创建空白目录/seqcls/1和seqcls_model/1,并将将转换好的ONNX模型移动到模型仓库目录 +```shell +mkdir /models/seqcls/1 +mkdir /models/seqcls_model/1 +mv model.onnx /models/seqcls_model/1 +``` + +Paddle2ONNX的命令行参数说明请查阅:[Paddle2ONNX命令行参数说明](https://github.com/PaddlePaddle/Paddle2ONNX#%E5%8F%82%E6%95%B0%E9%80%89%E9%A1%B9) + +模型下载转换好之后,models目录结构如下: +``` +models +├── seqcls +│   ├── 1 +│   └── config.pbtxt +├── seqcls_model +│   ├── 1 +│   │   └── model.onnx +│   └── config.pbtxt +├── seqcls_postprocess +│   ├── 1 +│   │   └── model.py +│   └── config.pbtxt +└── tokenizer + ├── 1 + │   └── model.py + └── config.pbtxt +``` + +模型配置文件config.pbtxt配置细节请参见[Triton Server Model Configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md) + +## 部署模型 + +triton目录包含启动pipeline服务的配置和发送预测请求的代码,包括: + +``` +models # Triton启动需要的模型仓库,包含模型和服务配置文件 +seqcls_grpc_client.py # 分类任务发送pipeline预测请求的脚本 +``` + +### 启动服务端 + + +在容器内执行下面命令启动服务,默认启动models下所有模型: +```shell +tritonserver --model-repository=/models +``` +也可以通过设定参数只启动单一任务服务: +```shell +tritonserver --model-repository=/models --model-control-mode=explicit --load-model=seqcls +``` + +**NOTE:** + +启动服务时,Triton Server的每个python后端进程默认申请`64M`内存,默认启动的docker无法启动多个python后端节点。两个解决方案: + +1. 启动容器时设置`shm-size`参数, 比如:`docker run -it --net=host --name triton_server --shm-size="1g" -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash` + +2. 启动服务时设置python后端的`shm-default-byte-size`参数, 设置python后端的默认内存为10M: `tritonserver --model-repository=/models --backend-config=python,shm-default-byte-size=10485760` + +输出打印如下: + +``` +... +I0619 13:40:51.590901 5127 onnxruntime.cc:1999] TRITONBACKEND_Initialize: onnxruntime +I0619 13:40:51.590938 5127 onnxruntime.cc:2009] Triton TRITONBACKEND API version: 1.6 +I0619 13:40:51.590947 5127 onnxruntime.cc:2015] 'onnxruntime' TRITONBACKEND API version: 1.6 +I0619 13:40:51.623808 5127 openvino.cc:1193] TRITONBACKEND_Initialize: openvino +I0619 13:40:51.623862 5127 openvino.cc:1203] Triton TRITONBACKEND API version: 1.6 +I0619 13:40:51.623868 5127 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.6 +I0619 13:40:52.980990 5127 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f14d8000000' with size 268435456 +... +I0619 13:43:33.360018 5127 server.cc:592] ++--------------------+---------+--------+ +| Model | Version | Status | ++--------------------+---------+--------+ +| seqcls | 1 | READY | +| seqcls_model | 1 | READY | +| seqcls_postprocess | 1 | READY | +| tokenizer | 1 | READY | ++--------------------+---------+--------+ +... +I0619 13:43:33.365824 5127 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001 +I0619 13:43:33.366221 5127 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000 +I0619 13:43:33.409775 5127 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002 +``` + +## 客户端请求 + +### 客户端环境准备 +客户端请求有两种方式,可以选择在本地执行脚本请求,或下载官方客户端镜像在容器中执行。 + +方式一:本地执行脚本,需要先安装依赖: +``` +pip install grpcio +pip install tritonclient==2.10.0 +``` + +方式二:拉取官网镜像并启动容器: + +```shell +docker pull nvcr.io/nvidia/tritonserver:21.10-py3-sdk +docker run -it --net=host --name triton_client -v /path/to/triton:/triton_code nvcr.io/nvidia/tritonserver:21.10-py3-sdk bash +``` + +### 启动客户端测试 +注意执行客户端请求时关闭代理,并根据实际情况修改main函数中的ip地址(启动服务所在的机器) + +``` +python seqcls_grpc_client.py +``` diff --git a/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls/config.pbtxt b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..82261157aefe68bac9a1865d888c0257d2e905e8 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls/config.pbtxt @@ -0,0 +1,75 @@ +name: "seqcls" +platform: "ensemble" +max_batch_size: 64 +input [ + { + name: "INPUT" + data_type: TYPE_STRING + dims: [ 1 ] + } +] +output [ + { + name: "label" + data_type: TYPE_INT64 + dims: [ 1 ] + }, + { + name: "confidence" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] +ensemble_scheduling { + step [ + { + model_name: "tokenizer" + model_version: 1 + input_map { + key: "INPUT_0" + value: "INPUT" + } + output_map { + key: "OUTPUT_0" + value: "tokenizer_input_ids" + } + output_map { + key: "OUTPUT_1" + value: "tokenizer_token_type_ids" + } + }, + { + model_name: "seqcls_model" + model_version: 1 + input_map { + key: "input_ids" + value: "tokenizer_input_ids" + } + input_map { + key: "token_type_ids" + value: "tokenizer_token_type_ids" + } + output_map { + key: "linear_75.tmp_1" + value: "OUTPUT_2" + } + }, + { + model_name: "seqcls_postprocess" + model_version: 1 + input_map { + key: "POST_INPUT" + value: "OUTPUT_2" + } + output_map { + key: "POST_label" + value: "label" + } + output_map { + key: "POST_confidence" + value: "confidence" + } + } + ] +} + diff --git a/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_model/config.pbtxt b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_model/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..d61a7fec6d5a2d266c1d5d43ff11638d445e2036 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_model/config.pbtxt @@ -0,0 +1,36 @@ +platform: "onnxruntime_onnx" +max_batch_size: 64 +input [ + { + name: "input_ids" + data_type: TYPE_INT64 + dims: [ -1 ] + }, + { + name: "token_type_ids" + data_type: TYPE_INT64 + dims: [ -1 ] + } +] +output [ + { + name: "linear_75.tmp_1" + data_type: TYPE_FP32 + dims: [ 20 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_GPU + } +] + +optimization { + graph: {level: -1} +} + +parameters { key: "intra_op_thread_count" value: { string_value: "0" } } +parameters { key: "execution_mode" value: { string_value: "0" } } +parameters { key: "inter_op_thread_count" value: { string_value: "0" } } diff --git a/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_postprocess/1/model.py b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_postprocess/1/model.py new file mode 100644 index 0000000000000000000000000000000000000000..5db7ef0c7746db295e9110817db6982704d6ac1b --- /dev/null +++ b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_postprocess/1/model.py @@ -0,0 +1,109 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import numpy as np + +# triton_python_backend_utils is available in every Triton Python model. You +# need to use this module to create inference requests and responses. It also +# contains some utility functions for extracting information from model_config +# and converting Triton input/output types to numpy types. +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel(object): + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + + def initialize(self, args): + """`initialize` is called only once when the model is being loaded. + Implementing `initialize` function is optional. This function allows + the model to initialize any state associated with this model. + Parameters + ---------- + args : dict + Both keys and values are strings. The dictionary keys and values are: + * model_config: A JSON string containing the model configuration, config.txt + * model_instance_kind: A string containing model instance kind + * model_instance_device_id: A string containing model instance device ID + * model_repository: Model repository path + * model_version: Model version + * model_name: Model name + """ + self.model_config = json.loads(args["model_config"]) + print("model_config:", self.model_config) + + self.input_names = [] + for input_config in self.model_config["input"]: + self.input_names.append(input_config["name"]) + print("input:", self.input_names) + + self.output_names = [] + self.output_dtype = [] + for output_config in self.model_config["output"]: + self.output_names.append(output_config["name"]) + dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) + self.output_dtype.append(dtype) + print("output:", self.output_names) + + def execute(self, requests): + """`execute` must be implemented in every Python model. `execute` + function receives a list of pb_utils.InferenceRequest as the only + argument. This function is called when an inference is requested + for this model. Depending on the batching configuration (e.g. Dynamic + Batching) used, `requests` may contain multiple requests. Every + Python model, must create one pb_utils.InferenceResponse for every + pb_utils.InferenceRequest in `requests`. If there is an error, you can + set the error argument when creating a pb_utils.InferenceResponse. + Parameters + ---------- + requests : list + A list of pb_utils.InferenceRequest + Returns + ------- + list + A list of pb_utils.InferenceResponse. The length of this list must + be the same as `requests` + """ + responses = [] + # print("num:", len(requests), flush=True) + for request in requests: + data = pb_utils.get_input_tensor_by_name(request, self.input_names[0]) + data = data.as_numpy() + data = 1 / (1 + (np.exp((-data[0])))) + + probs = [] + labels = [] + for l, p in enumerate(data): + if p > 0.5: + labels.append(l) + probs.append(p) + + labels = np.array(labels, dtype=self.output_dtype[0]) + probs = np.array(probs, dtype=self.output_dtype[1]) + # print(labels, probs) + out_tensor1 = pb_utils.Tensor(self.output_names[0], labels) + out_tensor2 = pb_utils.Tensor(self.output_names[1], probs) + inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2]) + responses.append(inference_response) + return responses + + def finalize(self): + """`finalize` is called only once when the model is being unloaded. + Implementing `finalize` function is optional. This function allows + the model to perform any necessary clean ups before exit. + """ + print("Cleaning up...") diff --git a/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..c625f28cc43537863fbeff341f94a7a3f9f60b31 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt @@ -0,0 +1,31 @@ +name: "seqcls_postprocess" +backend: "python" +max_batch_size: 64 + +input [ + { + name: "POST_INPUT" + data_type: TYPE_FP32 + dims: [ 20 ] + } +] + +output [ + { + name: "POST_label" + data_type: TYPE_INT64 + dims: [ -1 ] + }, + { + name: "POST_confidence" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_CPU + } +] diff --git a/applications/text_classification/multi_label/deploy/triton_serving/models/tokenizer/1/model.py b/applications/text_classification/multi_label/deploy/triton_serving/models/tokenizer/1/model.py new file mode 100644 index 0000000000000000000000000000000000000000..97a96222075370ebcca531250dc6cc7c0e8fead0 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/triton_serving/models/tokenizer/1/model.py @@ -0,0 +1,109 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import numpy as np + +# triton_python_backend_utils is available in every Triton Python model. You +# need to use this module to create inference requests and responses. It also +# contains some utility functions for extracting information from model_config +# and converting Triton input/output types to numpy types. +import triton_python_backend_utils as pb_utils + +from paddlenlp.transformers import AutoTokenizer + + +class TritonPythonModel(object): + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + + def initialize(self, args): + """`initialize` is called only once when the model is being loaded. + Implementing `initialize` function is optional. This function allows + the model to initialize any state associated with this model. + Parameters + ---------- + args : dict + Both keys and values are strings. The dictionary keys and values are: + * model_config: A JSON string containing the model configuration, config.pbtxt + * model_instance_kind: A string containing model instance kind + * model_instance_device_id: A string containing model instance device ID + * model_repository: Model repository path + * model_version: Model version + * model_name: Model name + """ + self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True) + # You must parse model_config. JSON string is not parsed here + self.model_config = json.loads(args["model_config"]) + print("model_config:", self.model_config) + + self.input_names = [] + for input_config in self.model_config["input"]: + self.input_names.append(input_config["name"]) + print("input:", self.input_names) + + self.output_names = [] + self.output_dtype = [] + for output_config in self.model_config["output"]: + self.output_names.append(output_config["name"]) + dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) + self.output_dtype.append(dtype) + print("output:", self.output_names) + + def execute(self, requests): + """`execute` must be implemented in every Python model. `execute` + function receives a list of pb_utils.InferenceRequest as the only + argument. This function is called when an inference is requested + for this model. Depending on the batching configuration (e.g. Dynamic + Batching) used, `requests` may contain multiple requests. Every + Python model, must create one pb_utils.InferenceResponse for every + pb_utils.InferenceRequest in `requests`. If there is an error, you can + set the error argument when creating a pb_utils.InferenceResponse. + Parameters + ---------- + requests : list + A list of pb_utils.InferenceRequest + Returns + ------- + list + A list of pb_utils.InferenceResponse. The length of this list must + be the same as `requests` + """ + responses = [] + # print("num:", len(requests), flush=True) + for request in requests: + data = pb_utils.get_input_tensor_by_name(request, self.input_names[0]) + data = data.as_numpy() + data = [i[0].decode("utf-8") for i in data] + data = self.tokenizer(data, max_length=128, padding=True, truncation=True) + input_ids = np.array(data["input_ids"], dtype=self.output_dtype[0]) + token_type_ids = np.array(data["token_type_ids"], dtype=self.output_dtype[1]) + + # print("input_ids:", input_ids) + # print("token_type_ids:", token_type_ids) + + out_tensor1 = pb_utils.Tensor(self.output_names[0], input_ids) + out_tensor2 = pb_utils.Tensor(self.output_names[1], token_type_ids) + inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2]) + responses.append(inference_response) + return responses + + def finalize(self): + """`finalize` is called only once when the model is being unloaded. + Implementing `finalize` function is optional. This function allows + the model to perform any necessary clean ups before exit. + """ + print("Cleaning up...") diff --git a/applications/text_classification/multi_label/deploy/triton_serving/models/tokenizer/config.pbtxt b/applications/text_classification/multi_label/deploy/triton_serving/models/tokenizer/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..d35d1f44968ba205b1890899a82568d33e90a999 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/triton_serving/models/tokenizer/config.pbtxt @@ -0,0 +1,31 @@ +name: "tokenizer" +backend: "python" +max_batch_size: 64 + +input [ + { + name: "INPUT_0" + data_type: TYPE_STRING + dims: [ 1 ] + } +] + +output [ + { + name: "OUTPUT_0" + data_type: TYPE_INT64 + dims: [ -1 ] + }, + { + name: "OUTPUT_1" + data_type: TYPE_INT64 + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_CPU + } +] diff --git a/applications/text_classification/multi_label/deploy/triton_serving/seqcls_grpc_client.py b/applications/text_classification/multi_label/deploy/triton_serving/seqcls_grpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..0f10e3a90d9f628114a94a858e996d6f593955c8 --- /dev/null +++ b/applications/text_classification/multi_label/deploy/triton_serving/seqcls_grpc_client.py @@ -0,0 +1,114 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +from typing import Optional + +import numpy as np +from tritonclient.grpc import InferenceServerClient, InferInput, InferRequestedOutput + +LOGGER = logging.getLogger("run_inference_on_triton") + + +class SyncGRPCTritonRunner: + DEFAULT_MAX_RESP_WAIT_S = 120 + + def __init__( + self, + server_url: str, + model_name: str, + model_version: str, + *, + verbose=False, + resp_wait_s: Optional[float] = None, + ): + self._server_url = server_url + self._model_name = model_name + self._model_version = model_version + self._verbose = verbose + self._response_wait_t = self.DEFAULT_MAX_RESP_WAIT_S if resp_wait_s is None else resp_wait_s + + self._client = InferenceServerClient(self._server_url, verbose=self._verbose) + error = self._verify_triton_state(self._client) + if error: + raise RuntimeError(f"Could not communicate to Triton Server: {error}") + + LOGGER.debug( + f"Triton server {self._server_url} and model {self._model_name}:{self._model_version} " + f"are up and ready!" + ) + + model_config = self._client.get_model_config(self._model_name, self._model_version) + model_metadata = self._client.get_model_metadata(self._model_name, self._model_version) + LOGGER.info(f"Model config {model_config}") + LOGGER.info(f"Model metadata {model_metadata}") + + self._inputs = {tm.name: tm for tm in model_metadata.inputs} + self._input_names = list(self._inputs) + self._outputs = {tm.name: tm for tm in model_metadata.outputs} + self._output_names = list(self._outputs) + self._outputs_req = [InferRequestedOutput(name) for name in self._outputs] + + def Run(self, inputs): + """ + Args: + inputs: list, Each value corresponds to an input name of self._input_names + Returns: + results: dict, {name : numpy.array} + """ + infer_inputs = [] + for idx, data in enumerate(inputs): + data = np.array([[x.encode("utf-8")] for x in data], dtype=np.object_) + infer_input = InferInput(self._input_names[idx], [len(data), 1], "BYTES") + infer_input.set_data_from_numpy(data) + infer_inputs.append(infer_input) + + results = self._client.infer( + model_name=self._model_name, + model_version=self._model_version, + inputs=infer_inputs, + outputs=self._outputs_req, + client_timeout=self._response_wait_t, + ) + results = {name: results.as_numpy(name) for name in self._output_names} + return results + + def _verify_triton_state(self, triton_client): + if not triton_client.is_server_live(): + return f"Triton server {self._server_url} is not live" + elif not triton_client.is_server_ready(): + return f"Triton server {self._server_url} is not ready" + elif not triton_client.is_model_ready(self._model_name, self._model_version): + return f"Model {self._model_name}:{self._model_version} is not ready" + return None + + +if __name__ == "__main__": + model_name = "seqcls" + model_version = "1" + url = "localhost:8001" + runner = SyncGRPCTritonRunner(url, model_name, model_version) + + texts = [ + ["五松新村房屋是被告婚前购买的;"], + ["被告于2016年3月将车牌号为皖B×××××出售了2.7万元,被告通过原告偿还了齐荷花人民币2.6万元,原、被告尚欠齐荷花2万元。"], + ["一、判决原告于某某与被告杨某某离婚;"], + ] + for text in texts: + # input format:[input1, input2 ... inputn], n = len(self._input_names) + result = runner.Run([text]) + print("text: ", text) + print("label: ", ",".join([str(r) for r in result["label"]])) + print("confidence: ", ",".join([str("%.3f" % c) for c in result["confidence"]])) + print("--------------------") diff --git a/applications/text_classification/multi_label/export_model.py b/applications/text_classification/multi_label/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..c57dc23372f9b934fbee6686092309cd5ef5b22a --- /dev/null +++ b/applications/text_classification/multi_label/export_model.py @@ -0,0 +1,45 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from paddlenlp.transformers import AutoModelForSequenceClassification + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--multilingual', action='store_true', help='Whether is multilingual task') +parser.add_argument("--params_path", type=str, default='./checkpoint/', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./export', help="The path of model parameter in static graph to be saved.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + model.eval() + if args.multilingual: + input_spec = [paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids")] + else: + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"), + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"), + ] + # Convert to static graph with specific input description + model = paddle.jit.to_static(model, input_spec=input_spec) + + # Save in static graph model. + save_path = os.path.join(args.output_path, "float32") + paddle.jit.save(model, save_path) diff --git a/applications/text_classification/multi_label/few-shot/README.md b/applications/text_classification/multi_label/few-shot/README.md new file mode 100644 index 0000000000000000000000000000000000000000..d291ac73308a4a36e809db7ca7702aacdc343216 --- /dev/null +++ b/applications/text_classification/multi_label/few-shot/README.md @@ -0,0 +1,377 @@ +# 小样本场景下的多标签分类任务指南 + +**零样本/小样本文本分类推荐使用 UTC 模型,详情见[目录](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/zero_shot_text_classification),本项目将会在2.5.2版本下线。** + +## 目录 + +- [1. 项目说明](#项目说明) +- [2. 效果展示](#效果展示) +- [3. 定制训练](#定制训练) + - [3.1 运行环境](#运行环境) + - [3.2 代码结构](#代码结构) + - [3.3 数据标注](#数据标注) + - [3.4 模型训练](#模型训练) + - [3.5 模型评估](#模型评估) + - [3.6 模型部署](#模型部署) +- [4. References](#References) + + +## 1. 项目说明 + +本项目提供了小样本场景下文本多标签分类的解决方案,在 ERNIE3.0 的基础上利用提示学习取得比微调更好的分类效果,充分利用标注信息。 + +近年来,大量包含了案件事实及其适用法律条文信息的裁判文书逐渐在互联网上公开,海量的数据使自然语言处理技术的应用成为可能。现实中的案情错综复杂,案情描述通常涉及多个重要事实,以CAIL2019数据集中婚姻家庭领域的案情要素抽取为例: + +```text +"2013年11月28日原、被告离婚时自愿达成协议,婚生子张某乙由被告李某某抚养,本院以(2013)宝渭法民初字第01848号民事调解书对该协议内容予以了确认,该协议具有法律效力,对原、被告双方均有约束力。" +``` +该案件中涉及`婚后有子女`、`限制行为能力子女抚养`两项要素。接下来我们将讲解在小样本场景下如何利用多标签模型,对输入文本中进行案情重要要素抽取。 + +**文本多标签分类** 用于预测样本属于哪些标签类别,这些类别具有不相互排斥的属性,在商品分类、网页标签、新闻标注、蛋白质功能分类、电影分类、语义场景分类等现实场景中有着广泛应用。 +现有的主流解决方案是在预训练语言模型上进行微调,因为多标签分类任务与预训练阶段的掩码预测任务有着天然的差异,想要取得较好的分类效果往往需要大量数据标注。 + +**提示学习(Prompt Learning)** 的主要思想是将二/多分类任务转换为掩码预测任务,充分利用预训练语言模型学习到的特征,从而降低样本需求。以情感分类任务为例,标签分为`1-正向`,`0-负向`两类,如下图所示,通过提示`我[MASK]喜欢。`,原有`1-正向`,`0-负向`的标签被转化为了预测空格是`很`还是`不`。 + +
+ +
+ +微调方法和提示方法的区别如图所示: + +【微调学习】需要学习的参数是以 `[CLS]` 向量为输入,以负向/正向为输出的随机初始化的分类器。 + +【提示学习】通过构造提示,将原有的分类任务转化为掩码预测,即掩盖原句中的某个字,用模型预测该字。此时的分类器不再是随机初始化,而是利用了待预测字的预训练向量来初始化,充分利用了预训练模型学习到的参数。 + +【方案选择】对于标注样本充足的场景可以直接使用[微调学习](../README.md)实现文本多分类,对于尚无标注或者标注样本较少的任务场景我们推荐使用提示学习,以取得更好的效果。 + +### 方案特点 + +- **标注成本低**:以往的微调方式需要大量的数据标注才能保证模型分类效果。提示学习可以降低数据标注依赖,在小样本(few-shot)的场景下取得比微调更好的分类效果。 +- **全流程打通**:提供了从训练到部署的完整解决方案,可以低成本迁移至实际应用场景。 + + +## 2.效果展示 + +本项目中使用了 ERNIE3.0 模型,对于中文训练任务可以根据需求选择不同的预训练模型参数进行训练,我们测评了 Base 模型在婚姻家庭要素提取任务上的表现。测试配置如下: + +1. 数据集:CAIL2019—婚姻家庭要素提取任务小样本数据集测试集。 + +2. 物理机环境 + + 系统: CentOS Linux release 7.7.1908 (Core) + + GPU: Tesla V100-SXM2-32GB + + CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz + + CUDA: 11.2 + + cuDNN: 8.1.0 + + Driver Version: 460.27.04 + + 内存: 630 GB + +3. PaddlePaddle 版本:2.4rc + +4. PaddleNLP 版本:2.4.3 + +5. 评估设置 + +- 每个 epoch 评估一次,按照验证集上的评价指标,取分数最高的模型参数用于测试集的评估。表格中的最终结果为重复 10 次的均值。 +- 为了避免过拟合,这里使用了早停机制 (Early-stopping)。因为微调方式收敛较慢,且波动较大,我们将微调方式的早停步数增加为 10 步。 +- 测试脚本如下 + - 微调 + + ``` + cd ../ + python train.py --dataset_dir "./data/" --save_dir "./checkpoints" --max_seq_length 128 --model_name "ernie-3.0-base-zh" --batch_size 8 --learning_rate 3e-5 --epochs 100 --logging_steps 5 --early_stop --early_stop_num 10 + ``` + + - 提示学习 + + ``` + python train.py --data_dir ./data/ --output_dir ./checkpoints/ --prompt "这句话包含的要素有" --model_name_or_path ernie-3.0-base-zh --max_seq_length 128 --learning_rate 3e-5 --ppt_learning_rate 3e-4 --do_train --do_eval --num_train_epochs 100 --logging_steps 5 --per_device_eval_batch_size 32 --per_device_train_batch_size 8 --do_predict --metric_for_best_model macro_f1_score --load_best_model_at_end --eval_steps 100 --save_total_limit 1 + ``` + +6. 精度评价指标:Micro F1分数、Macro F1分数 + + | model_name | 训练方式 | Micro F1分数 | Macro F1分数 | + | ---------- | ------- | ----------- | ----------- | + | ernie-3.0-base-zh | 微调学习 | 0.7419 | 0.5105 | + | ernie-3.0-base-zh | 提示学习 | 0.7839 | 0.6003 | + + +## 3.定制训练 + +下边通过婚姻家庭要素提取的例子展示如何使用小样本学习来进行文本分类。 + + +### 3.1 运行环境 + +- python >= 3.7 +- paddlepaddle >= 2.4rc +- paddlenlp >= 2.4.3 +- paddle2onnx >= 1.0.3 + + +### 3.2 代码结构 + +```text +. +├── train.py # 模型组网训练脚本 +├── utils.py # 数据处理工具 +├── infer.py # 模型部署脚本 +└── README.md +``` + + +### 3.3 数据标注 + +我们推荐使用数据标注平台[doccano](https://github.com/doccano/doccano)进行自定义数据标注,本项目也打通了从标注到训练的通道,即doccano导出数据后可通过[doccano.py](../../doccano.py)脚本轻松将数据转换为输入模型时需要的形式,实现无缝衔接。标注方法的详细介绍请参考[doccano数据标注指南](../../doccano.md)。 + +**示例数据** + +这里我们使用CAIL2019—婚姻家庭要素提取任务数据集的子集作为示例数据集。该数据集中原始训练集包括 14377 条标注样本,我们按每条标签随机采样 4 条样本,得到 80 条样本数据作为训练集,剩余训练集数据作为测试集。可点击[这里](https://paddlenlp.bj.bcebos.com/datasets/few-shot/elements.tar.gz)下载解压并放入`./data/`文件夹,或者运行以下脚本 + +``` +wget https://paddlenlp.bj.bcebos.com/datasets/few-shot/elements.tar.gz +tar zxvf elements.tar.gz +mv elements data +``` + +**数据格式** + +下边主要介绍多标签分类任务自定义数据集的格式要求,整体目录如下 + +```text +data/ +├── train.txt # 训练数据集 +├── dev.txt # 验证数据集 +├── test.txt # 测试数据集(可选) +├── data.txt # 待预测数据(可选) +└── label.txt # 分类标签集 +``` + +**训练/验证/测试数据** + +对于训练/验证/测试数据集文件,每行数据表示一条样本,包括文本和标签两部分,由tab符`\t`分隔,多个标签以英文逗号`,`分隔。格式如下 +```text +<文本>'\t'<标签>','<标签>','<标签> +<文本>'\t'<标签>','<标签> +... +``` +例如,在婚姻家庭要素提取数据集中 +``` +现在原告已是第二次申请与被告离婚了。 二次起诉离婚 +双方均认可价值6万元。 不动产分割,有夫妻共同财产 +2004年4月,原、被告发生纠纷后,被告离家外出未归,直到现在,双方长期分居生活,十几年间互无联系,夫妻感情已经完全破裂。 婚后分居 +婚生子杨某甲由原告抚养,高中阶段之前的相关费用由原告承担,高中阶段之后的相关费用由双方协商,被告可以随时探望孩子; 婚后有子女,支付抚养费,限制行为能力子女抚养 +... +``` + +**预测数据** + +对于待预测数据文件,每行包含一条待预测样本,无标签。格式如下 +```text +<文本> +<文本> +... +``` +例如,在婚姻家庭要素提取数据集中 +``` +五松新村房屋是被告婚前购买的; +2、判令被告返还借婚姻索取的现金33万元,婚前个人存款10万元; +... +``` + +**标签数据** + +对于分类标签集文件,存储了数据集中所有的标签集合,每行为一个标签名。如果需要自定义标签映射用于分类器初始化,则每行需要包括标签名和相应的映射词,由`==`分隔。格式如下 +```text +<标签>'=='<映射词> +<标签>'=='<映射词> +... +``` +例如,对于婚姻家庭要素提取数据集,原标签字数较多,因此同一个标签依赖的输出也多。为了降低训练难度,我们可以将其映射为较短的短语 +``` +有夫妻共同债务==共同债务 +存在非婚生子==非婚生子 +... +``` +**Note**: 这里的标签映射词定义遵循的规则是,不同映射词尽可能长度一致,映射词和提示需要尽可能构成通顺的语句。越接近自然语句,小样本下模型训练效果越好。如果原标签名已经可以构成通顺语句,也可以不构造映射词,每行一个标签即可,即 +``` +有夫妻共同债务 +存在非婚生子 +... +``` + + +### 3.4 模型训练 + +**单卡训练** + +``` +python train.py \ +--data_dir ./data/ \ +--output_dir ./checkpoints/ \ +--prompt "这句话包含的要素有" \ +--model_name_or_path ernie-3.0-base-zh \ +--max_seq_length 128 \ +--learning_rate 3e-5 \ +--ppt_learning_rate 3e-4 \ +--do_train \ +--do_eval \ +--do_predict \ +--do_export \ +--num_train_epochs 100 \ +--logging_steps 5 \ +--save_total_limit 1 \ +--per_device_eval_batch_size 32 \ +--per_device_train_batch_size 8 \ +--metric_for_best_model macro_f1_score \ +--load_best_model_at_end \ +--evaluation_strategy epoch \ +--save_strategy epoch +``` +**多卡训练** + +``` +unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus 0,1,2,3 train.py \ +--data_dir ./data/ \ +--output_dir ./checkpoints/ \ +--prompt "这句话包含的要素有" \ +--model_name_or_path ernie-3.0-base-zh \ +--max_seq_length 128 \ +--learning_rate 3e-5 \ +--ppt_learning_rate 3e-4 \ +--do_train \ +--do_eval \ +--do_predict \ +--do_export \ +--num_train_epochs 100 \ +--logging_steps 5 \ +--save_total_limit 1 \ +--per_device_eval_batch_size 32 \ +--per_device_train_batch_size 8 \ +--metric_for_best_model macro_f1_score \ +--load_best_model_at_end \ +--evaluation_strategy epoch \ +--save_strategy epoch +``` + +可配置参数说明: +- `model_name_or_path`: 内置模型名,或者模型参数配置目录路径。默认为`ernie-3.0-base-zh`。 +- `data_dir`: 训练数据集路径,数据格式要求详见[数据标注](#数据标注)。 +- `output_dir`: 模型参数、训练日志和静态图导出的保存目录。 +- `prompt`: 提示模板。定义了如何将文本和提示拼接结合。 +- `soft_encoder`: 提示向量的编码器,`lstm`表示双向LSTM, `mlp`表示双层线性层, None表示直接使用提示向量。默认为`lstm`。 +- `use_rdrop`: 使用 [R-Drop](https://arxiv.org/abs/2106.14448) 策略。 +- `use_rgl`: 使用 [RGL](https://aclanthology.org/2022.findings-naacl.81/) 策略。 +- `encoder_hidden_size`: 提示向量的维度。若为None,则使用预训练模型字向量维度。默认为200。 +- `max_seq_length`: 最大句子长度,超过该长度的文本将被截断,不足的以Pad补全。提示文本不会被截断。 +- `learning_rate`: 预训练语言模型参数基础学习率大小,将与learning rate scheduler产生的值相乘作为当前学习率。 +- `ppt_learning_rate`: 提示相关参数的基础学习率大小,当预训练参数不固定时,与其共用learning rate scheduler。一般设为`learning_rate`的十倍。 +- `do_train`: 是否进行训练。 +- `do_eval`: 是否进行评估。 +- `do_predict`: 是否进行预测。 +- `do_export`: 是否在运行结束时将模型导出为静态图,保存路径为`output_dir/export`。 +- `num_train_epochs`: 训练的最大轮数。 +- `max_steps`: 训练的最大步数。此设置将会覆盖`num_train_epochs`。 +- `save_total_limit`: 模型检查点保存数量。 +- `device`: 使用的设备,默认为`gpu`。 +- `eval_steps`: 评估模型的间隔步数。 +- `logging_steps`: 打印日志的间隔步数。 +- `per_device_train_batch_size`: 每次训练每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。 +- `per_device_eval_batch_size`: 每次评估每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。 +- `load_best_model_at_end`: 是否在模型训练结束后加载评估指标最优的模型参数。 +- `evaluation_strategy`: 模型评估的间隔策略。若为`epoch`,则每轮训练结束后评估模型。 +- `save_strategy`: 模型保存的间隔策略。若为`epoch`,则每轮训练结束后保存当前模型参数。 + +更多参数介绍可参考[配置文件](https://paddlenlp.readthedocs.io/zh/latest/trainer.html)。 + + +### 3.5 模型评估 + +在模型训练时开启`--do_predict`,训练结束后直接在测试集上`test.txt`进行评估,也可以在训练结束后,通过运行以下命令加载模型参数进行评估: +``` +python train.py --do_predict --data_dir ./data --output_dir ./predict_checkpoint --resume_from_checkpoint ./checkpoints/ --max_seq_length 128 +``` + +可配置参数说明: + +- `data_dir`: 测试数据路径。测试数据应存放在该目录下`test.txt`文件中,每行一条待预测文本。 +- `output_dir`: 日志的保存目录。 +- `resume_from_checkpoint`: 训练时模型参数的保存目录,用于加载模型参数。 +- `do_predict`: 是否进行测试集评估。 +- `max_seq_length`: 最大句子长度,超过该长度的文本将被截断,不足的以Pad补全。提示文本不会被截断。 + + +### 3.6 模型部署 + +#### 模型导出 + +在训练结束后,需要将动态图模型导出为静态图参数用于部署推理。可以在模型训练时开启`--do_export`在训练结束后直接导出,也可以运行以下命令加载并导出训练后的模型参数,默认导出到在`output_dir`指定的目录下。 +``` +python train.py --do_export --data_dir ./data --output_dir ./export_checkpoint --resume_from_checkpoint ./checkpoints/ +``` + +可配置参数说明: + +- `data_dir`: 标签数据路径。 +- `output_dir`: 静态图模型参数和日志的保存目录。 +- `resume_from_checkpoint`: 训练时模型参数的保存目录,用于加载模型参数。 +- `do_export`: 是否将模型导出为静态图,保存路径为`output_dir/export`。 +- `export_type`: 模型导出的格式,默认为`paddle`,即导出静态图。 + +#### ONNXRuntime部署 + +**运行环境** + +模型转换与ONNXRuntime预测部署依赖Paddle2ONNX和ONNXRuntime,Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式,算子目前稳定支持导出ONNX Opset 7~15,更多细节可参考:[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)。 + +- 如果基于GPU部署,请先确保机器已正确安装NVIDIA相关驱动和基础软件,确保CUDA >= 11.2,CuDNN >= 8.2,并使用以下命令安装所需依赖: +```shell +pip install psutil +python -m pip install onnxruntime-gpu onnx onnxconverter-common +``` + +- 如果基于CPU部署,请使用如下命令安装所需依赖: +```shell +pip install psutil +python -m pip install onnxruntime +``` + +**CPU端推理样例** + +``` +python infer.py --model_path_prefix checkpoints/export/model --data_dir ./data --batch_size 32 --device cpu +``` + +**GPU端推理样例** + +``` +python infer.py --model_path_prefix checkpoints/export/model --data_dir ./data --batch_size 32 --device gpu --device_id 0 +``` + +可配置参数说明: + +- `model_path_prefix`: 导出的静态图模型路径及文件前缀。 +- `model_name`: 内置预训练模型名,用于加载tokenizer。默认为`ernie-3.0-base-zh`。 +- `data_dir`: 待推理数据所在路径,数据应存放在该目录下的`data.txt`文件。 +- `max_length`: 最大句子长度,超过该长度的文本将被截断,不足的以Pad补全。提示文本不会被截断。 +- `batch_size`: 每次预测的样本数量。 +- `device`: 选择推理设备,包括`cpu`和`gpu`。默认为`gpu`。 +- `device_id`: 指定GPU设备ID。 +- `use_fp16`: 是否使用半精度加速推理。仅在GPU设备上有效。 +- `num_threads`: 设置CPU使用的线程数。默认为机器上的物理内核数。 + +**Note**: 在GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0,在包括V100、T4、A10、A100、GTX 20系列和30系列显卡等设备上可以开启FP16进行加速,在CPU或者CUDA计算能力 (CUDA Compute Capability) 小于7.0时开启不会带来加速效果。 + + +## 4. References + +- Liu, Xiao, et al. "GPT understands, too." arXiv preprint arXiv:2103.10385 (2021). [[PDF]](https://arxiv.org/abs/2103.10385) +- Hambardzumyan, Karen, Hrant Khachatrian, and Jonathan May. "Warp: Word-level adversarial reprogramming." arXiv preprint arXiv:2101.00121 (2021). [[PDF]](https://arxiv.org/abs/2101.00121) +- Ding, Ning, et al. "Openprompt: An open-source framework for prompt-learning." arXiv preprint arXiv:2111.01998 (2021). [[PDF]](https://arxiv.org/abs/2111.01998) diff --git a/applications/text_classification/multi_label/few-shot/infer.py b/applications/text_classification/multi_label/few-shot/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..1de1407b9a2610ed51c00b40081f3bccc423fa63 --- /dev/null +++ b/applications/text_classification/multi_label/few-shot/infer.py @@ -0,0 +1,227 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os + +import numpy as np +import onnxruntime as ort +import paddle2onnx +import psutil +import six + +from paddlenlp.prompt import AutoTemplate, PromptDataCollatorWithPadding +from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.") +parser.add_argument("--model_name", default="ernie-3.0-base-zh", type=str, help="The name of pretrained model.") +parser.add_argument("--data_dir", default=None, type=str, help="The path to the prediction data, including label.txt and data.txt.") +parser.add_argument("--max_length", default=128, type=int, help="The maximum total input sequence length after tokenization.") +parser.add_argument("--use_fp16", action='store_true', help="Whether to use fp16 inference, only takes effect when deploying on gpu.") +parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for predicting.") +parser.add_argument("--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu.") +parser.add_argument("--device", choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.") +args = parser.parse_args() +# yapf: enable + + +class InferBackend(object): + def __init__(self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, num_threads=10): + + if not isinstance(device, six.string_types): + logger.error( + ">>> [InferBackend] The type of device must be string, but the type you set is: ", type(device) + ) + exit(0) + if device not in ["cpu", "gpu"]: + logger.error(">>> [InferBackend] The device must be cpu or gpu, but your device is set to:", type(device)) + exit(0) + + logger.info(">>> [InferBackend] Creating Engine ...") + + onnx_model = paddle2onnx.command.c_paddle_to_onnx( + model_file=model_path_prefix + ".pdmodel", + params_file=model_path_prefix + ".pdiparams", + opset_version=13, + enable_onnx_checker=True, + ) + infer_model_dir = model_path_prefix.rsplit("/", 1)[0] + float_onnx_file = os.path.join(infer_model_dir, "model.onnx") + with open(float_onnx_file, "wb", encoding="utf-8") as f: + f.write(onnx_model) + + if device == "gpu": + logger.info(">>> [InferBackend] Use GPU to inference ...") + providers = ["CUDAExecutionProvider"] + if use_fp16: + logger.info(">>> [InferBackend] Use FP16 to inference ...") + import onnx + from onnxconverter_common import float16 + + fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx") + onnx_model = onnx.load_model(float_onnx_file) + trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True) + onnx.save_model(trans_model, fp16_model_file) + onnx_model = fp16_model_file + else: + logger.info(">>> [InferBackend] Use CPU to inference ...") + providers = ["CPUExecutionProvider"] + if use_fp16: + logger.warning( + ">>> [InferBackend] Ignore use_fp16 as it only " + "takes effect when deploying on gpu..." + ) + + sess_options = ort.SessionOptions() + sess_options.intra_op_num_threads = num_threads + self.predictor = ort.InferenceSession( + onnx_model, sess_options=sess_options, providers=providers, provider_options=[{"device_id": device_id}] + ) + + if device == "gpu": + try: + assert "CUDAExecutionProvider" in self.predictor.get_providers() + except AssertionError: + raise AssertionError( + "The environment for GPU inference is not set properly. " + "A possible cause is that you had installed both onnxruntime and onnxruntime-gpu. " + "Please run the following commands to reinstall: \n " + "1) pip uninstall -y onnxruntime onnxruntime-gpu \n 2) pip install onnxruntime-gpu" + ) + logger.info(">>> [InferBackend] Engine Created ...") + + def infer(self, input_dict: dict): + result = self.predictor.run(None, input_dict) + return result + + +class MultiLabelPredictor(object): + def __init__(self, args): + self.args = args + self.tokenizer = AutoTokenizer.from_pretrained(args.model_name) + self.model = AutoModelForMaskedLM.from_pretrained(args.model_name) + self.template, self.labels, self.input_handles = self.post_init() + self.collate_fn = PromptDataCollatorWithPadding( + self.tokenizer, padding=True, return_tensors="np", return_attention_mask=True + ) + + self.inference_backend = InferBackend( + self.args.model_path_prefix, + self.args.device, + self.args.device_id, + self.args.use_fp16, + self.args.num_threads, + ) + + def post_init(self): + export_path = os.path.dirname(self.args.model_path_prefix) + template_path = os.path.join(export_path, "template_config.json") + with open(template_path, "r", encoding="utf-8") as fp: + prompt = json.load(fp) + template = AutoTemplate.create_from(prompt, self.tokenizer, self.args.max_length, self.model) + keywords = template.extract_template_keywords(template.prompt) + inputs = ["input_ids", "token_type_ids", "position_ids", "attention_mask", "masked_positions"] + if "soft" in keywords: + inputs.append("soft_token_ids") + if "encoder" in keywords: + inputs.append("encoder_ids") + verbalizer_path = os.path.join(export_path, "verbalizer_config.json") + with open(verbalizer_path, "r", encoding="utf-8") as fp: + label_words = json.load(fp) + labels = sorted(list(label_words.keys())) + + return template, labels, inputs + + def predict(self, input_data: list): + encoded_inputs = self.preprocess(input_data) + infer_result = self.infer_batch(encoded_inputs) + result = self.postprocess(infer_result) + self.printer(result, input_data) + return result + + def _infer(self, input_dict): + infer_data = self.inference_backend.infer(input_dict) + return infer_data + + def infer_batch(self, inputs): + num_sample = len(inputs) + infer_data = None + num_infer_data = None + for index in range(0, num_sample, self.args.batch_size): + left, right = index, index + self.args.batch_size + batch_dict = self.collate_fn(inputs[left:right]) + input_dict = {} + for key in self.input_handles: + value = batch_dict[key] + if key == "attention_mask": + if value.ndim == 2: + value = (1 - value[:, np.newaxis, np.newaxis, :]) * -1e4 + elif value.ndim != 4: + raise ValueError("Expect attention mask with ndim=2 or 4, but get ndim={}".format(value.ndim)) + value = value.astype("float32") + else: + value = value.astype("int64") + input_dict[key] = value + results = self._infer(input_dict) + if infer_data is None: + infer_data = [[x] for x in results] + num_infer_data = len(results) + else: + for i in range(num_infer_data): + infer_data[i].append(results[i]) + for i in range(num_infer_data): + infer_data[i] = np.concatenate(infer_data[i], axis=0) + return infer_data + + def preprocess(self, input_data: list): + text = [{"text_a": x} for x in input_data] + inputs = [self.template(x) for x in text] + return inputs + + @staticmethod + def sigmoid(z): + return 1 / (1 + np.exp(-z)) + + def postprocess(self, infer_data): + threshold = 0.5 + probs = self.sigmoid(infer_data[0]) + label_ids = np.argwhere(probs > threshold) + labels = [[] for _ in range(probs.shape[0])] + for idx, label_id in label_ids: + labels[idx].append(self.labels[label_id]) + return {"label": labels} + + def printer(self, result, input_data): + label = result["label"] + for i in range(len(label)): + logger.info("input data: {}".format(input_data[i])) + logger.info("labels: {}".format(", ".join(label[i]))) + logger.info("-----------------------------") + + +if __name__ == "__main__": + for arg_name, arg_value in vars(args).items(): + logger.info("{:20}: {}".format(arg_name, arg_value)) + + predictor = MultiLabelPredictor(args) + + text_dir = os.path.join(args.data_dir, "data.txt") + with open(text_dir, "r", encoding="utf-8") as f: + text_list = [x.strip() for x in f.readlines()] + + predictor.predict(text_list) diff --git a/applications/text_classification/multi_label/few-shot/metric.py b/applications/text_classification/multi_label/few-shot/metric.py new file mode 100644 index 0000000000000000000000000000000000000000..f41317ba76030105364c73843b6cf4fa9ad6d0bb --- /dev/null +++ b/applications/text_classification/multi_label/few-shot/metric.py @@ -0,0 +1,81 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +from paddle.metric import Metric +from sklearn.metrics import classification_report, f1_score + +from paddlenlp.utils.log import logger + + +class MetricReport(Metric): + """ + F1 score for multi-label text classification task. + """ + + def __init__(self, name="MetricReport", average="micro"): + super(MetricReport, self).__init__() + self.average = average + self._name = name + self.reset() + + def reset(self): + """ + Resets all of the metric state. + """ + self.y_prob = None + self.y_true = None + + def f1_score(self, y_prob): + """ + Compute micro f1 score and macro f1 score + """ + threshold = 0.5 + self.y_pred = y_prob > threshold + micro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="micro") + macro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="macro") + return micro_f1_score, macro_f1_score + + def update(self, probs, labels): + """ + Update the probability and label + """ + if self.y_prob is not None: + self.y_prob = np.append(self.y_prob, probs.numpy(), axis=0) + else: + self.y_prob = probs.numpy() + if self.y_true is not None: + self.y_true = np.append(self.y_true, labels.numpy(), axis=0) + else: + self.y_true = labels.numpy() + + def accumulate(self): + """ + Returns micro f1 score and macro f1 score + """ + micro_f1_score, macro_f1_score = self.f1_score(y_prob=self.y_prob) + return micro_f1_score, macro_f1_score + + def report(self): + """ + Returns classification report + """ + self.y_pred = self.y_prob > 0.5 + logger.info("classification report:\n" + classification_report(self.y_true, self.y_pred, digits=4)) + + def name(self): + """ + Returns metric name + """ + return self._name diff --git a/applications/text_classification/multi_label/few-shot/requirements_cpu.txt b/applications/text_classification/multi_label/few-shot/requirements_cpu.txt new file mode 100644 index 0000000000000000000000000000000000000000..bbe76e363f00631d66e0733833813cad5991f009 --- /dev/null +++ b/applications/text_classification/multi_label/few-shot/requirements_cpu.txt @@ -0,0 +1,5 @@ +psutil +paddlepaddle>=2.4rc +paddlenlp>=2.4.3 +paddle2onnx>=1.0.3 +onnxruntime diff --git a/applications/text_classification/multi_label/few-shot/requirements_gpu.txt b/applications/text_classification/multi_label/few-shot/requirements_gpu.txt new file mode 100644 index 0000000000000000000000000000000000000000..66454bd8b6b5fe08521215d4a5c2e7242225d869 --- /dev/null +++ b/applications/text_classification/multi_label/few-shot/requirements_gpu.txt @@ -0,0 +1,7 @@ +psutil +paddlepaddle-gpu>=2.4rc +paddlenlp>=2.4.3 +paddle2onnx>=1.0.3 +onnxruntime-gpu +onnx +onnxconverter-common diff --git a/applications/text_classification/multi_label/few-shot/train.py b/applications/text_classification/multi_label/few-shot/train.py new file mode 100644 index 0000000000000000000000000000000000000000..345a5217158355bb95864f00227e114dc3f25fcf --- /dev/null +++ b/applications/text_classification/multi_label/few-shot/train.py @@ -0,0 +1,133 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from collections import defaultdict +from dataclasses import dataclass, field + +import paddle +import paddle.nn.functional as F +from metric import MetricReport +from utils import load_local_dataset + +from paddlenlp.prompt import ( + AutoTemplate, + PromptModelForSequenceClassification, + PromptTrainer, + PromptTuningArguments, + SoftVerbalizer, +) +from paddlenlp.trainer import EarlyStoppingCallback, PdArgumentParser +from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer +from paddlenlp.utils.log import logger + + +# yapf: disable +@dataclass +class DataArguments: + data_dir: str = field(default="./data", metadata={"help": "The dataset dictionary includes train.txt, dev.txt and label.txt files."}) + prompt: str = field(default=None, metadata={"help": "The input prompt for tuning."}) + + +@dataclass +class ModelArguments: + model_name_or_path: str = field(default="ernie-3.0-base-zh", metadata={"help": "The build-in pretrained model or the path to local model."}) + export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."}) +# yapf: enable + + +def main(): + # Parse the arguments. + parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Load the pretrained language model. + model = AutoModelForMaskedLM.from_pretrained(model_args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + + # Define the template for preprocess and the verbalizer for postprocess. + template = AutoTemplate.create_from(data_args.prompt, tokenizer, training_args.max_seq_length, model=model) + logger.info("Using template: {}".format(template.prompt)) + + label_file = os.path.join(data_args.data_dir, "label.txt") + with open(label_file, "r", encoding="utf-8") as fp: + label_words = defaultdict(list) + for line in fp: + data = line.strip().split("==") + word = data[1] if len(data) > 1 else data[0].split("##")[-1] + label_words[data[0]].append(word) + verbalizer = SoftVerbalizer(label_words, tokenizer, model) + + # Load the few-shot datasets. + train_ds, dev_ds, test_ds = load_local_dataset( + data_path=data_args.data_dir, splits=["train", "dev", "test"], label_list=verbalizer.labels_to_ids + ) + + # Define the criterion. + criterion = paddle.nn.BCEWithLogitsLoss() + + # Initialize the prompt model with the above variables. + prompt_model = PromptModelForSequenceClassification( + model, template, verbalizer, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout + ) + + # Define the metric function. + def compute_metrics(eval_preds): + metric = MetricReport() + preds = F.sigmoid(paddle.to_tensor(eval_preds.predictions)) + metric.update(preds, paddle.to_tensor(eval_preds.label_ids)) + micro_f1_score, macro_f1_score = metric.accumulate() + return {"micro_f1_score": micro_f1_score, "macro_f1_score": macro_f1_score} + + # Deine the early-stopping callback. + callbacks = [EarlyStoppingCallback(early_stopping_patience=4, early_stopping_threshold=0.0)] + + # Initialize the trainer. + trainer = PromptTrainer( + model=prompt_model, + tokenizer=tokenizer, + args=training_args, + criterion=criterion, + train_dataset=train_ds, + eval_dataset=dev_ds, + callbacks=callbacks, + compute_metrics=compute_metrics, + ) + + # Training. + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=None) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Prediction. + if training_args.do_predict: + test_ret = trainer.predict(test_ds) + trainer.log_metrics("test", test_ret.metrics) + + # Export static model. + if training_args.do_export: + export_path = os.path.join(training_args.output_dir, "export") + trainer.export_model(export_path, export_type=model_args.export_type) + + +if __name__ == "__main__": + main() diff --git a/applications/text_classification/multi_label/few-shot/utils.py b/applications/text_classification/multi_label/few-shot/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..3a9c2a5b83faff2b8d512d43bc9471e603c7f6fd --- /dev/null +++ b/applications/text_classification/multi_label/few-shot/utils.py @@ -0,0 +1,53 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +from paddlenlp.datasets import load_dataset + + +def load_local_dataset(data_path, splits, label_list): + """ + Load dataset for multi-label classification from files, where + there is one example per line. Text and label are separated + by '\t', and multiple labels are delimited by ','. + + Args: + data_path (str): + Path to the dataset directory, including label.txt, train.txt, + dev.txt (and data.txt). + splits (list): + Which file(s) to load, such as ['train', 'dev', 'test']. + label_list (dict): + The dictionary that maps labels to indeces. + """ + + def _reader(data_file, label_list): + with open(data_file, "r", encoding="utf-8") as fp: + for idx, line in enumerate(fp): + data = line.strip().split("\t") + if len(data) == 1: + yield {"text_a": data[0]} + else: + text, label = data + label = label.strip().split(",") + label = [float(1) if x in label else float(0) for x in label_list] + yield {"text_a": text, "labels": label} + + split_map = {"train": "train.txt", "dev": "dev.txt", "test": "test.txt"} + datasets = [] + for split in splits: + data_file = os.path.join(data_path, split_map[split]) + datasets.append(load_dataset(_reader, data_file=data_file, label_list=label_list, lazy=False)) + return datasets diff --git a/applications/text_classification/multi_label/metric.py b/applications/text_classification/multi_label/metric.py new file mode 100644 index 0000000000000000000000000000000000000000..44ca1c37f12c80bcd2df0b8c8d0c6d50cde3a014 --- /dev/null +++ b/applications/text_classification/multi_label/metric.py @@ -0,0 +1,81 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +from sklearn.metrics import f1_score, classification_report + +from paddle.metric import Metric +from paddlenlp.utils.log import logger + + +class MetricReport(Metric): + """ + F1 score for multi-label text classification task. + """ + + def __init__(self, name="MetricReport", average="micro"): + super(MetricReport, self).__init__() + self.average = average + self._name = name + self.reset() + + def reset(self): + """ + Resets all of the metric state. + """ + self.y_prob = None + self.y_true = None + + def f1_score(self, y_prob): + """ + Compute micro f1 score and macro f1 score + """ + threshold = 0.5 + self.y_pred = y_prob > threshold + micro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="micro") + macro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="macro") + return micro_f1_score, macro_f1_score + + def update(self, probs, labels): + """ + Update the probability and label + """ + if self.y_prob is not None: + self.y_prob = np.append(self.y_prob, probs.numpy(), axis=0) + else: + self.y_prob = probs.numpy() + if self.y_true is not None: + self.y_true = np.append(self.y_true, labels.numpy(), axis=0) + else: + self.y_true = labels.numpy() + + def accumulate(self): + """ + Returns micro f1 score and macro f1 score + """ + micro_f1_score, macro_f1_score = self.f1_score(y_prob=self.y_prob) + return micro_f1_score, macro_f1_score + + def report(self): + """ + Returns classification report + """ + self.y_pred = self.y_prob > 0.5 + logger.info("classification report:\n" + classification_report(self.y_true, self.y_pred, digits=4)) + + def name(self): + """ + Returns metric name + """ + return self._name diff --git a/applications/text_classification/multi_label/predict.py b/applications/text_classification/multi_label/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..c0ad1f4a72bec354812872f81c116fa73c663656 --- /dev/null +++ b/applications/text_classification/multi_label/predict.py @@ -0,0 +1,102 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os + +import paddle +import paddle.nn.functional as F +from paddle.io import BatchSampler, DataLoader +from utils import preprocess_function, read_local_dataset + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--dataset_dir", required=True, default=None, type=str, help="Local dataset directory should include data.txt and label.txt") +parser.add_argument("--output_file", default="output.txt", type=str, help="Save prediction result") +parser.add_argument("--params_path", default="./checkpoint/", type=str, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--data_file", type=str, default="data.txt", help="Unlabeled data file name") +parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name") +args = parser.parse_args() +# yapf: enable + + +@paddle.no_grad() +def predict(): + """ + Predicts the data labels. + """ + paddle.set_device(args.device) + model = AutoModelForSequenceClassification.from_pretrained(args.params_path) + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + + label_list = [] + label_path = os.path.join(args.dataset_dir, args.label_file) + with open(label_path, "r", encoding="utf-8") as f: + for i, line in enumerate(f): + label_list.append(line.strip()) + + data_ds = load_dataset( + read_local_dataset, path=os.path.join(args.dataset_dir, args.data_file), is_test=True, lazy=False + ) + + trans_func = functools.partial( + preprocess_function, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + label_nums=len(label_list), + is_test=True, + ) + + data_ds = data_ds.map(trans_func) + + # batchify dataset + collate_fn = DataCollatorWithPadding(tokenizer) + data_batch_sampler = BatchSampler(data_ds, batch_size=args.batch_size, shuffle=False) + + data_data_loader = DataLoader(dataset=data_ds, batch_sampler=data_batch_sampler, collate_fn=collate_fn) + + results = [] + model.eval() + for batch in data_data_loader: + logits = model(**batch) + probs = F.sigmoid(logits).numpy() + for prob in probs: + labels = [] + for i, p in enumerate(prob): + if p > 0.5: + labels.append(i) + results.append(labels) + + with open(args.output_file, "w", encoding="utf-8") as f: + f.write("text" + "\t" + "label" + "\n") + for d, result in zip(data_ds.data, results): + label = [label_list[r] for r in result] + f.write(d["sentence"] + "\t" + ", ".join(label) + "\n") + logger.info("Prediction results save in {}.".format(args.output_file)) + + return + + +if __name__ == "__main__": + + predict() diff --git a/applications/text_classification/multi_label/prune.py b/applications/text_classification/multi_label/prune.py new file mode 100644 index 0000000000000000000000000000000000000000..16d8b1596f9bb31bcbbd901e0cb6df11dbd2e543 --- /dev/null +++ b/applications/text_classification/multi_label/prune.py @@ -0,0 +1,123 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import functools +import os +from dataclasses import dataclass, field + +import paddle +import paddle.nn.functional as F +from metric import MetricReport +from paddleslim.nas.ofa import OFA +from utils import preprocess_function, read_local_dataset + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import CompressionArguments, PdArgumentParser, Trainer +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + + +# yapf: disable +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + Using `PdArgumentParser` we can turn this class + into argparse arguments to be able to specify them on + the command line. + """ + + dataset_dir: str = field(default=None, metadata={"help": "Local dataset directory should include train.txt, dev.txt and label.txt."}) + max_seq_length: int = field(default=128, metadata={"help": "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded."}) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + params_dir: str = field(default='./checkpoint/', metadata={"help": "The output directory where the model checkpoints are written."}) +# yapf: enable + + +@paddle.no_grad() +def custom_evaluate(self, model, data_loader): + metric = MetricReport() + model.eval() + metric.reset() + for batch in data_loader: + logits = model(batch["input_ids"], batch["token_type_ids"]) + # Supports paddleslim.nas.ofa.OFA model and nn.layer model. + if isinstance(model, OFA): + logits = logits[0] + probs = F.sigmoid(logits) + metric.update(probs, batch["labels"]) + + micro_f1_score, macro_f1_score = metric.accumulate() + logger.info("micro f1 score: %.5f, macro f1 score: %.5f" % (micro_f1_score, macro_f1_score)) + model.train() + return macro_f1_score + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments)) + model_args, data_args, compression_args = parser.parse_args_into_dataclasses() + paddle.set_device(compression_args.device) + compression_args.strategy = "dynabert" + # Log model and data config + compression_args.print_config(model_args, "Model") + compression_args.print_config(data_args, "Data") + + label_list = {} + label_path = os.path.join(data_args.dataset_dir, "label.txt") + train_path = os.path.join(data_args.dataset_dir, "train.txt") + dev_path = os.path.join(data_args.dataset_dir, "dev.txt") + with open(label_path, "r", encoding="utf-8") as f: + for i, line in enumerate(f): + l = line.strip() + label_list[l] = i + + train_ds = load_dataset(read_local_dataset, path=train_path, label_list=label_list, lazy=False) + dev_ds = load_dataset(read_local_dataset, path=dev_path, label_list=label_list, lazy=False) + + model = AutoModelForSequenceClassification.from_pretrained(model_args.params_dir) + tokenizer = AutoTokenizer.from_pretrained(model_args.params_dir) + + trans_func = functools.partial( + preprocess_function, tokenizer=tokenizer, max_seq_length=data_args.max_seq_length, label_nums=len(label_list) + ) + train_dataset = train_ds.map(trans_func) + dev_dataset = dev_ds.map(trans_func) + + # Define data collector, criterion + data_collator = DataCollatorWithPadding(tokenizer) + criterion = paddle.nn.BCEWithLogitsLoss() + + trainer = Trainer( + model=model, + args=compression_args, + data_collator=data_collator, + train_dataset=train_dataset, + eval_dataset=dev_dataset, + criterion=criterion, + ) # Strategy`dynabert` needs arguments `criterion` + + compression_args.print_config() + + trainer.compress(custom_evaluate=custom_evaluate) + + +if __name__ == "__main__": + main() diff --git a/applications/text_classification/multi_label/retrieval_based/README.md b/applications/text_classification/multi_label/retrieval_based/README.md new file mode 100644 index 0000000000000000000000000000000000000000..835336e446ab01cf6c06123d8c73bd2addf43346 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/README.md @@ -0,0 +1,512 @@ +# 基于检索的多标签文本分类方法 + + **目录** + +* [1. 基于语义索引的多标签分类任务介绍](#基于语义索引的多标签分类任务介绍) +* [2. 代码结构说明](#代码结构说明) +* [3. 环境准备](#环境准备) +* [4. 数据准备](#数据准备) +* [5. 模型训练](#模型训练) +* [6. 模型评估](#模型训练) +* [7. 模型预测](#模型预测) +* [8. 模型部署](#模型部署) +* [9. 分类流程](#分类流程) + + + +# 1.基于语义索引的多标签分类任务介绍 + +以前的分类任务中,标签信息作为无实际意义,独立存在的one-hot编码形式存在,这种做法会潜在的丢失标签的语义信息,本方案把文本分类任务中的标签信息转换成含有语义信息的语义向量,将文本分类任务转换成向量检索和匹配的任务。这样做的好处是对于一些类别标签不是很固定的场景,或者需要经常有一些新增类别的需求的情况非常合适。另外,对于一些新的相关的分类任务,这种方法也不需要模型重新学习或者设计一种新的模型结构来适应新的任务。总的来说,这种基于检索的文本分类方法能够有很好的拓展性,能够利用标签里面包含的语义信息,不需要重新进行学习。这种方法可以应用到相似标签推荐,文本标签标注,金融风险事件分类,政务信访分类等领域。 + +本方案是基于语义索引模型的分类,语义索引模型的目标是:给定输入文本,模型可以从海量候选召回库中**快速、准确**地召回一批语义相关文本。基于语义索引的多标签分类方法有两种,第一种方法是直接把标签变成召回库,即把输入文本和标签的文本进行匹配,第二种是利用召回的文本带有类别标签,把召回文本的类别标签作为给定输入文本的类别。本方案使用双塔模型,训练阶段引入In-batch Negatives 策略,使用hnswlib建立索引库,并把标签作为召回库,进行召回测试。最后利用召回的结果使用 Accuracy 指标来评估语义索引模型的分类的效果。 + +**注意** 基于语义索引的文本分类的标签在预测过程中会抽取成向量,所以标签需要文本的形式,不能是ID形式的标签。 + + + +## 2. 代码结构说明 + +``` +|—— data.py # 数据读取、数据转换等预处理逻辑 +|—— base_model.py # 语义索引模型基类 +|—— train.py # In-batch Negatives 策略的训练主脚本 +|—— model.py # In-batch Negatives 策略核心网络结构 + +|—— recall.py # 基于训练好的语义索引模型,从召回库中召回给定文本的相似文本 +|—— evaluate.py # 根据召回结果和评估集计算评估指标 +|—— metric.py # Macro F1和Micro F1评估指标 +|—— predict.py # 给定输入文件,计算文本 pair 的相似度 +|—— export_model.py # 动态图转换成静态图 +|—— export_to_serving.py # 静态图转 Serving +|—— scripts + |—— export_model.sh # 动态图转换成静态图脚本 + |—— predict.sh # 预测 bash 版本 + |—— evaluate.sh # 评估 bash 版本 + |—— run_build_index.sh # 构建索引 bash 版本 + |—— train.sh # 训练 bash 版本 + |—— export_to_serving.sh # Paddle Inference 转 Serving 的 bash 脚本 + |—— run.sh # 构建Milvus向量的 bash 版本 +|—— utils + ├── config.py # Milvus 的配置文件 + ├── feature_extract.py # 向量抽取文件 + ├── milvus_util.py # Milvus 的配置文件 +|—— deploy + |—— python + |—— predict.py # PaddleInference + |—— deploy.sh # Paddle Inference 部署脚本 + |—— rpc_client.py # Paddle Serving 的 Client 端 + |—— web_service.py # Paddle Serving 的 Serving 端 + |—— config_nlp.yml # Paddle Serving 的配置文件 + +``` + + + +## 3. 环境准备 + +推荐使用GPU进行训练,在预测阶段使用CPU或者GPU均可。 + +**环境依赖** +* python >= 3.7 +* paddlepaddle >= 2.3.1 +* paddlenlp >= 2.3.4 +* hnswlib >= 0.5.2 +* visualdl >= 2.2.2 + +``` +pip install -r requirements.txt +``` + + + +## 4. 数据准备 + +训练需要准备指定格式的本地数据集,如果没有已标注的数据集,可以参考[文本分类任务doccano数据标注使用指南](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/doccano.md)进行文本分类数据标注。 + +**指定格式本地数据集目录结构** + +``` +├── data # 数据集目录 + ├── label.txt # 标签集 + ├── dev.txt # 验证集 + ├── test.txt # 测试集 + ├── train.txt # 训练集 +``` + +**训练、开发、测试数据集** + +train.txt(训练数据集文件), dev.txt(开发数据集文件),test.txt(可选,测试数据集文件),文件中文本与标签类别名用tab符`'\t'`分隔开,由于文本有多个标签,需要把文本拆分成多个单标签的形式,即多行文本标签对。训练集指用于训练模型的数据;开发集指用于评测模型表现的数据,可以根据模型在开发集上的精度调整训练参数和模型;测试集用于测试模型表现,没有测试集时可以使用开发集代替。 + +- train.txt/test.txt 文件格式: +```text +<文本>'\t'<标签1> +<文本>'\t'<标签2> +... +``` +- train.txt/test.txt 文件样例: +```text +茄子怎么做才好吃?怎么做才好吃? 生活 +茄子怎么做才好吃?怎么做才好吃? 美食/烹饪 +茄子怎么做才好吃?怎么做才好吃? 烹饪方法 +... +``` +- dev.txt 文件格式: +``` +<文本>'\t'<标签1>,<标签2>,<标签3> +<文本>'\t'<标签2>,<标签5> +``` +- dev.txt 文件样例: +```text +克隆人的思想和原体是一样的吗? 教育/科学,理工学科,生物学 +家用燃气灶哪个牌子好些?最好是高效节能的~最好是高效的~ 生活,购物 +咨询下,黛莱美面膜怎么样,那里有咨询下,黛莱美怎么样,那里有 生活,美容/塑身,护肤 +... +``` + +**分类标签** + +label.txt(分类标签文件)记录数据集中所有标签集合,每一行为一个标签名。 +- label.txt 文件格式: +```text +<标签> +<标签> +... +``` +- label.txt 文件样例: +```text +魔力宝贝 +完美游戏 +地球科学 +美食/烹饪 +肝胆外科 +动漫 +婚嫁 +历史话题 +新生儿 +... +``` + + + +## 5. 模型训练 + +我们使用百科知识问答的数据来构建训练集,开发集。 + +**训练集(train.txt)** 和 **开发集(dev.txt)** 格式一致,训练集30k条,开发集10k条,每行由文本的标题,内容和类别标签组成,以tab符分割,第一列是问题的标题和问题的描述拼接,剩下的列问题的类别。 +**召回库(label.txt)** 类别的数量是323类,召回标签库的构建有2种方式,第一种是把所有的类别标签当成召回库,第二种是把训练集当成召回集合,我们以第一种为例。 + +数据集选择的是百科问答数据集的一个子集,问答数据集详情请参考[nlp_chinese_corpus](https://github.com/brightmart/nlp_chinese_corpus) + +- [baike_qa_category](https://paddlenlp.bj.bcebos.com/applications/baike_qa_multilabel.zip) + +``` +wget https://paddlenlp.bj.bcebos.com/applications/baike_qa_multilabel.zip +unzip baike_qa_category.zip +``` + +### 单机单卡训练/单机多卡训练 + +这里采用单机多卡方式进行训练,通过如下命令,指定 GPU 0,1 卡;如果采用单机单卡训练,只需要把`--gpus`参数设置成单卡的卡号即可。 + +如果使用CPU进行训练,则需要吧`--gpus`参数去除,然后吧`device`设置成cpu即可,详细请参考train.sh文件的训练设置 + +然后运行下面的命令使用GPU训练,得到语义索引模型: + +``` +root_path=inbatch +data_path=data +python -u -m paddle.distributed.launch --gpus "0,1" \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/${root_path} \ + --batch_size 24 \ + --learning_rate 5E-5 \ + --epochs 100 \ + --output_emb_size 0 \ + --save_steps 50 \ + --max_seq_length 384 \ + --warmup_proportion 0.0 \ + --margin 0.2 \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --train_set_file ${data_path}/train.txt \ + --corpus_file ${data_path}/label.txt \ + --similar_text_pair_file ${data_path}/dev.txt \ + --evaluate True +``` + +参数含义说明 + +* `device`: 使用 cpu/gpu 进行训练 +* `save_dir`: 模型存储路径 +* `batch_size`: 训练的batch size的大小 +* `learning_rate`: 训练的学习率的大小 +* `epochs`: 训练的epoch数 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `save_steps`: 模型存储 checkpoint 的间隔 steps 个数 +* `max_seq_length`: 输入序列的最大长度 +* `margin`: 正样本相似度与负样本之间的目标 Gap +* `train_set_file`: 训练集文件 +* `evaluate`: 是否开启边训练边评估模型训练效果,默认开启 +* `recall_result_dir`: 召回结果存储目录 +* `recall_result_file`: 召回结果的文件名 +* `hnsw_m`: hnsw 算法相关参数,保持默认即可 +* `hnsw_ef`: hnsw 算法相关参数,保持默认即可 +* `recall_num`: 对 1 个文本召回的相似文本数量 +* `similar_text_pair`: 由相似文本对构成的评估集 +* `corpus_file`: 召回库数据 corpus_file + +也可以使用bash脚本: + +``` +sh scripts/train.sh +``` + + + +## 6. 模型评估 + +评估脚本命令如下: +``` +python -u evaluate.py \ + --similar_text_pair "data/dev.txt" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --label_path data/label.txt +``` +也可以使用bash脚本: + +``` +sh scripts/evaluate.sh +``` +会得到如下的输出结果: + +``` +Micro fl score: 99.30934877025769 +Macro f1 score: 78.20694877991563 +``` + + + +## 7. 模型预测 + +我们可以基于语义索引模型计算文本和标签的语义相似度。 + + +### 开始预测 + +加载训练的语义索引模型,然后计算文本和标签的语义相似度: + +``` +root_dir="checkpoints/inbatch/model_best" +python -u -m paddle.distributed.launch --gpus "0" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_state.pdparams" \ + --model_name_or_path rocketqa-zh-dureader-query-encoder \ + --output_emb_size 0 \ + --batch_size 128 \ + --max_seq_length 384 \ + --text_pair_file "data/test.txt" +``` + +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `params_path`: 预训练模型的参数文件名 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `model_name_or_path`: 预训练模型,用于模型和`Tokenizer`的参数初始化。 +* `text_pair_file`: 由文本 Pair 构成的待预测数据集 + +也可以运行下面的bash脚本: + +``` +sh scripts/predict.sh +``` +predict.sh文件包含了cpu和gpu运行的脚本,默认是gpu运行的脚本 + +产出如下结果 +``` +0.8841502070426941 +0.7834227681159973 +0.04591505229473114 +0.15116563439369202 +...... +``` + + + +## 8. 模型部署 + +模型部署分为:动转静导出,向量引擎,Paddle Inference推理, Paddle Serving服务化这几个部分。为了提升预测速度,通常需要把训练好的模型转换成静态图,然后就可以使用Paddle Inference静态图进行推理,向量引擎则是存放标签的向量的形式,方便快速检索,另外,Paddle Inference可以进一步对模型服务化,即使用Paddle Serving进行服务化,这样可以通过HTTP或者RPC的方式进行调用。Paddle Serving的服务化形式有Pipeline和C++两种形式,Pipeline灵活一点,方便进行修改,C++部署更麻烦一点,但C++的部署形式效率更高。 + +### 动转静导出 + +首先把动态图模型转换为静态图: + +``` +python export_model.py \ + --params_path checkpoints/inbatch/model_best/model_state.pdparams \ + --model_name_or_path rocketqa-zh-dureader-query-encoder \ + --output_path=./output +``` +也可以运行下面的bash脚本: + +``` +sh scripts/export_model.sh +``` +### Paddle Inference预测 + +预测既可以抽取向量也可以计算两个文本的相似度。 + +修改deploy/python/predict.py中的id2corpus和corpus_list的样本: + +``` +# 抽取向量 +id2corpus = {0: {"sentence": "快要到期的“我的资料”怎么续日期?"}} +# 计算文本和类别的相似度 +corpus_list = [{ + "sentence": "快要到期的“我的资料”怎么续日期?", + 'label': '互联网' + }, { + "sentence": "快要到期的“我的资料”怎么续日期?", + 'label': '游戏' + }] + +``` + +然后使用PaddleInference + +``` +python deploy/python/predict.py --model_dir=./output +``` +也可以运行下面的bash脚本: + +``` +sh deploy.sh +``` +最终输出的是256维度的特征向量和句子对的预测概率: + +``` +(1, 768) +[[ 0.01510728 -0.03822846 -0.0350773 -0.02304687 0.04219331 -0.04335611 + -0.03983097 0.04164692 0.04074539 -0.02351343 0.04246496 -0.02563381 + .... + +[0.8133385181427002, 0.1509452909231186] +``` + +### 向量引擎 + +模型准备结束以后,开始搭建 Milvus 的向量检索引擎,用于文本语义向量的快速检索,本项目使用[Milvus](https://milvus.io/)开源工具进行向量检索,Milvus 的搭建教程请参考官方教程 [Milvus官方安装教程](https://milvus.io/cn/docs/v1.1.1/milvus_docker-cpu.md)本案例使用的是 Milvus 的1.1.1 CPU版本,建议使用官方的 Docker 安装方式,简单快捷。 + + +Milvus 搭建完系统以后就可以插入和检索向量了,首先生成 embedding 向量,每个样本生成768维度的向量: + +``` +CUDA_VISIBLE_DEVICES=0 python utils/feature_extract.py \ + --data_name label \ + --model_dir ./output \ + --output_dir data \ + --corpus_file "./data/label.txt" +``` +其中 output 目录下存放的是召回的 Paddle Inference 静态图模型。 + +修改 utils/config.py 的配置 ip 和端口,本项目使用的是8530端口,而 Milvus 默认的是19530,需要根据情况进行修改: +``` +MILVUS_HOST='your milvus ip' +MILVUS_PORT = 8530 +``` + +然后向搭建好的 Milvus 系统插入向量: + +``` +python utils/vector_insert.py \ + --vector_path ./data/label_embedding.npy +``` +也可以直接运行: + +```bash +sh scripts/run.sh +``` + +### Paddle Serving部署 + +Paddle Serving 的安装可以参考[Paddle Serving 安装文档](https://github.com/PaddlePaddle/Serving#installation)。需要在服务端和客户端安装相关的依赖,用pip安装Paddle Serving的依赖如下: + +``` +pip install paddle-serving-client==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple +pip install paddle-serving-app==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple + +# 如果是CPU部署,只需要安装CPU Server +pip install paddle-serving-server==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple + +# 如果是GPU Server,需要确认环境再选择执行哪一条,推荐使用CUDA 10.2的包 +# CUDA10.2 + Cudnn7 + TensorRT6(推荐) +pip install paddle-serving-server-gpu==0.8.3.post102 -i https://pypi.tuna.tsinghua.edu.cn/simple +# CUDA10.1 + TensorRT6 +pip install paddle-serving-server-gpu==0.8.3.post101 -i https://pypi.tuna.tsinghua.edu.cn/simple +# CUDA11.2 + TensorRT8 +pip install paddle-serving-server-gpu==0.8.3.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple +``` +更详细的安装信息请参考[链接](https://github.com/PaddlePaddle/Serving/blob/v0.9.0/doc/Install_Linux_Env_CN.md),安装完依赖后就可以执行下面的步骤。首先把生成的静态图模型导出为 Paddle Serving的格式,命令如下: + +``` +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.get_pooled_embedding.pdmodel" \ + --params_filename "inference.get_pooled_embedding.pdiparams" \ + --server_path "./serving_server" \ + --client_path "./serving_client" \ + --fetch_alias_names "output_embedding" +``` + +参数含义说明 +* `dirname`: 需要转换的模型文件存储路径,Program 结构文件和参数文件均保存在此目录。 +* `model_filename`: 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ,则使用 `__model__` 作为默认的文件名 +* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中,它才需要被指定。如果模型参数是存储在各自分离的文件中,设置它的值为 None +* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server +* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client +* `fetch_alias_names`: 模型输出的别名设置,比如输入的 input_ids 等,都可以指定成其他名字,默认不指定 +* `feed_alias_names`: 模型输入的别名设置,比如输出 pooled_out 等,都可以重新指定成其他模型,默认不指定 + +也可以运行下面的 bash 脚本: +``` +sh scripts/export_to_serving.sh +``` + +Paddle Serving的部署采用Pipeline的方式,如果用户有对性能有更高的要求,可以采用C++的部署形式,请参考[Neural Search](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search/recall/in_batch_negative#c%E7%9A%84%E6%96%B9%E5%BC%8F): + +#### Pipeline方式 + +启动 Pipeline Server: + +``` +cd deploy/python/ +python web_service.py +``` + +启动客户端调用 Server, 使用 POST的方式: + +向服务端发送 POST 请求示例: + +``` +curl -X POST -k http://localhost:8090/ernie/prediction -d '{"key": ["0"], "value": ["{\"sentence\": \"中国农业大学怎么样?可以吗?\"}"]}' +``` + +也可以使用 rpc的方式: +首先修改rpc_client.py中需要预测的样本: + +``` +list_data = [{ + "sentence": "中国农业大学怎么样?可以吗?" +}] +``` +然后运行: + +``` +python rpc_client.py +``` +模型的输出为: + +``` +PipelineClient::predict pack_data time:1658988633.3673246 +PipelineClient::predict before time:1658988633.3678396 +time to cost :0.014188766479492188 seconds +['output_embedding'] +(1, 768) +[[-0.06491912 -0.0133915 0.00937684 0.01285653 -0.02468005 0.03528611 + 0.0623698 -0.06062918 0.02238894 -0.05348937 0.02161925 0.04480227 + ...... +``` + +可以看到客户端发送了1条文本,返回这个 embedding 向量 + + + +## 9. 分类流程 + +为了演示基于检索的文本分类流程,我们使用下面的python脚本来完成整个流程,该分类系统使用了Client Server的模式,即抽取向量的模型部署在服务端,然后启动客户端(Client)端去访问,得到分类的结果。 + +``` +python run_system.py +``` +代码内置的测试用例为: + +``` +list_data = [{"sentence": "中国农业大学怎么样?可以吗?"}] +``` +会输出如下的结果: + +``` +...... +PipelineClient::predict pack_data time:1658988661.507715 +PipelineClient::predict before time:1658988661.5081818 +Extract feature time to cost :0.02322244644165039 seconds +Search milvus time cost is 0.06801486015319824 seconds +{'sentence': '中国农业大学怎么样?可以吗?'} 教育/科学 0.6138255596160889 +{'sentence': '中国农业大学怎么样?可以吗?'} 院校信息 0.9188833236694336 +``` +输出的结果包括特征提取和检索的时间,还包含检索出来文本和对应的标签,通过设定阈值等方式可以得到最终的标签。 + +## Reference + +[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering, Preprint 2020. diff --git a/applications/text_classification/multi_label/retrieval_based/base_model.py b/applications/text_classification/multi_label/retrieval_based/base_model.py new file mode 100644 index 0000000000000000000000000000000000000000..56aa3ba50e189281c35d41e8819014f56d8e53f4 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/base_model.py @@ -0,0 +1,153 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import abc + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class SemanticIndexBase(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr) + + def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + @abc.abstractmethod + def forward(self): + pass + + +class SemanticIndexBaseStatic(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr) + + @paddle.jit.to_static( + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ] + ) + def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding diff --git a/applications/text_classification/multi_label/retrieval_based/data.py b/applications/text_classification/multi_label/retrieval_based/data.py new file mode 100644 index 0000000000000000000000000000000000000000..61b6fc701a8d3489ca3d7aa57990970b52781a08 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/data.py @@ -0,0 +1,236 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import hnswlib +import numpy as np +import paddle +from paddlenlp.utils.log import logger + + +def build_index(corpus_data_loader, model, output_emb_size, hnsw_max_elements, hnsw_ef, hnsw_m): + + index = hnswlib.Index(space="cosine", dim=output_emb_size if output_emb_size > 0 else 768) + + # Initializing index + # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded + # during insertion of an element. + # The capacity can be increased by saving/loading the index, see below. + # + # ef_construction - controls index search speed/build speed tradeoff + # + # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M) + # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction + index.init_index(max_elements=hnsw_max_elements, ef_construction=hnsw_ef, M=hnsw_m) + + # Controlling the recall by setting ef: + # higher ef leads to better accuracy, but slower search + index.set_ef(hnsw_ef) + + # Set number of threads used during batch search/construction + # By default using all available cores + index.set_num_threads(16) + logger.info("start build index..........") + all_embeddings = [] + for text_embeddings in model.get_semantic_embedding(corpus_data_loader): + all_embeddings.append(text_embeddings.numpy()) + all_embeddings = np.concatenate(all_embeddings, axis=0) + index.add_items(all_embeddings) + logger.info("Total index number:{}".format(index.get_current_count())) + return index + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + A BERT sequence has the following format: + - single sequence: ``[CLS] X [SEP]`` + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def convert_corpus_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + result = [] + for k, v in example.items(): + encoded_inputs = tokenizer(text=v, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def convert_label_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + result = [] + for k, v in example.items(): + encoded_inputs = tokenizer(text=v, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 2: + print(data) + continue + yield {"sentence": data[0], "label": data[1]} + + +# ANN - active learning ------------------------------------------------------ +def get_latest_checkpoint(args): + """ + Return: (latest_checkpint_path, global_step) + """ + if not os.path.exists(args.save_dir): + return args.init_from_ckpt, 0 + + subdirectories = list(next(os.walk(args.save_dir))[1]) + + def valid_checkpoint(checkpoint): + chk_path = os.path.join(args.save_dir, checkpoint) + scheduler_path = os.path.join(chk_path, "model_state.pdparams") + succeed_flag_file = os.path.join(chk_path, "succeed_flag_file") + return os.path.exists(scheduler_path) and os.path.exists(succeed_flag_file) + + trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(trained_steps) > 0: + return os.path.join(args.save_dir, str(max(trained_steps)), "model_state.pdparams"), max(trained_steps) + + return args.init_from_ckpt, 0 + + +# ANN - active learning ------------------------------------------------------ +def get_latest_ann_data(ann_data_dir): + if not os.path.exists(ann_data_dir): + return None, -1 + + subdirectories = list(next(os.walk(ann_data_dir))[1]) + + def valid_checkpoint(step): + ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data") + # succed_flag_file is an empty file that indicates ann data has been generated + succeed_flag_file = os.path.join(ann_data_dir, step, "succeed_flag_file") + return os.path.exists(succeed_flag_file) and os.path.exists(ann_data_file) + + ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(ann_data_steps) > 0: + latest_ann_data_file = os.path.join(ann_data_dir, str(max(ann_data_steps)), "new_ann_data") + logger.info("Using lateset ann_data_file:{}".format(latest_ann_data_file)) + return latest_ann_data_file, max(ann_data_steps) + + logger.info("no new ann_data, return (None, -1)") + return None, -1 + + +def label2ids(label_path): + label2id = {} + with open(label_path) as f: + for idx, label in enumerate(f.readlines()): + label = label.strip() + label2id[label] = idx + return label2id + + +def gen_id2corpus(corpus_file): + id2corpus = {} + with open(corpus_file, "r", encoding="utf-8") as f: + for idx, line in enumerate(f): + id2corpus[idx] = line.rstrip() + return id2corpus + + +def gen_text_file(similar_text_pair_file): + text2similar_text = {} + texts = [] + with open(similar_text_pair_file, "r", encoding="utf-8") as f: + for idx, line in enumerate(f): + splited_line = line.rstrip().split("\t") + text, similar_text = splited_line[0], ",".join(splited_line[1:]) + text2similar_text[text] = similar_text + texts.append({"text": text}) + return texts, text2similar_text diff --git a/applications/text_classification/multi_label/retrieval_based/deploy/python/config_nlp.yml b/applications/text_classification/multi_label/retrieval_based/deploy/python/config_nlp.yml new file mode 100644 index 0000000000000000000000000000000000000000..2429b5a6d01c837116f66dca345f90253146325e --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/deploy/python/config_nlp.yml @@ -0,0 +1,34 @@ +# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +# 当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 20 +# build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + # op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: False + # 使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + tracer: + interval_s: 10 +# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 8090 +# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 8080 +op: + ernie: + # 并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 1 + # 当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + # client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + #ir_optim + ir_optim: True + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 1 + # 计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: '0' + # Fetch结果列表,以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回 + fetch_list: ['output_embedding'] + # 模型路径 + model_config: ../../serving_server/ diff --git a/applications/text_classification/multi_label/retrieval_based/deploy/python/deploy.sh b/applications/text_classification/multi_label/retrieval_based/deploy/python/deploy.sh new file mode 100644 index 0000000000000000000000000000000000000000..6351e89d8b7b80fd740746a8617ffa6f072b0e15 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/deploy/python/deploy.sh @@ -0,0 +1,15 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python predict.py --model_dir=../../output \ No newline at end of file diff --git a/applications/text_classification/multi_label/retrieval_based/deploy/python/predict.py b/applications/text_classification/multi_label/retrieval_based/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..ce442f3992e3e4860aca3ea22198ae3ea6fbded0 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/deploy/python/predict.py @@ -0,0 +1,250 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys + +import paddle +from paddle import inference +from scipy import spatial + +from paddlenlp.data import Pad, Tuple +from paddlenlp.transformers import AutoTokenizer + +sys.path.append(".") + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=15, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') +parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") +args = parser.parse_args() +# fmt: on + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + A BERT sequence has the following format: + - single sequence: ``[CLS] X [SEP]`` + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def convert_query_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + result = [] + encoded_inputs = tokenizer( + text=example["sentence"], + max_seq_len=max_seq_length, + pad_to_max_seq_len=pad_to_max_seq_len, + truncation_strategy="longest_first", + ) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.get_pooled_embedding.pdmodel" + params_file = model_dir + "/inference.get_pooled_embedding.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def extract_embedding(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the feature vectors. + """ + examples = [] + for idx, text in data.items(): + print(text) + input_ids, segment_ids = convert_query_example(text, tokenizer) + examples.append((input_ids, segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # segment + ): fn(samples) + + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + return logits + + def predict(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the predictions probs. + """ + + examples = [] + for idx, text in enumerate(data): + input_ids, segment_ids, title_ids, title_segment_ids = convert_example(text, tokenizer) + + examples.append((input_ids, segment_ids, title_ids, title_segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + + query_ids, query_segment_ids, title_ids, title_segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(query_ids) + self.input_handles[1].copy_from_cpu(query_segment_ids) + self.predictor.run() + query_logits = self.output_handle.copy_to_cpu() + + self.input_handles[0].copy_from_cpu(title_ids) + self.input_handles[1].copy_from_cpu(title_segment_ids) + self.predictor.run() + title_logits = self.output_handle.copy_to_cpu() + + result = [float(1 - spatial.distance.cosine(arr1, arr2)) for arr1, arr2 in zip(query_logits, title_logits)] + return result + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + + output_emb_size = 256 + tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-dureader-query-encoder") + id2corpus = {0: {"sentence": "快要到期的“我的资料”怎么续日期?"}} + res = predictor.extract_embedding(id2corpus, tokenizer) + print(res.shape) + print(res) + corpus_list = [{"sentence": "快要到期的“我的资料”怎么续日期?", "label": "互联网"}, {"sentence": "快要到期的“我的资料”怎么续日期?", "label": "游戏"}] + res = predictor.predict(corpus_list, tokenizer) + print(res) diff --git a/applications/text_classification/multi_label/retrieval_based/deploy/python/rpc_client.py b/applications/text_classification/multi_label/retrieval_based/deploy/python/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..b20265cfb4e3eeeddfb1a79500d5680e957d6f4a --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/deploy/python/rpc_client.py @@ -0,0 +1,35 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import time + +import numpy as np +from paddle_serving_server.pipeline import PipelineClient + +client = PipelineClient() +client.connect(["127.0.0.1:8080"]) + +list_data = [{"sentence": "青岛有什么好一点的国际青旅推荐?离海近一点 外国人多一点 氛围好点的"}] +feed = {} +for i, item in enumerate(list_data): + feed[str(i)] = str(item) + +print(feed) +start_time = time.time() +ret = client.predict(feed_dict=feed) +end_time = time.time() +print("time to cost :{} seconds".format(end_time - start_time)) +result = np.array(eval(ret.value[0])) +print(ret.key) +print(result.shape) +print(result) diff --git a/applications/text_classification/multi_label/retrieval_based/deploy/python/web_service.py b/applications/text_classification/multi_label/retrieval_based/deploy/python/web_service.py new file mode 100644 index 0000000000000000000000000000000000000000..df054797d51ec195c6f23ad1c144aa4f6aed43d1 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/deploy/python/web_service.py @@ -0,0 +1,72 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddle_serving_server.web_service import Op, WebService + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + result = [] + for text in example: + encoded_inputs = tokenizer( + text=text["sentence"], max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len + ) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class ErnieOp(Op): + def init_op(self): + from paddlenlp.transformers import AutoTokenizer + + model_name_or_path = "rocketqa-zh-dureader-query-encoder" + self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) + + def preprocess(self, input_dicts, data_id, log_id): + from paddlenlp.data import Pad, Tuple + + ((_, input_dict),) = input_dicts.items() + print("input dict", input_dict) + batch_size = len(input_dict.keys()) + examples = [] + for i in range(batch_size): + example = eval(input_dict[str(i)]) + input_ids, segment_ids = convert_example([example], self.tokenizer) + examples.append((input_ids, segment_ids)) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + input_ids, segment_ids = batchify_fn(examples) + feed_dict = {} + feed_dict["input_ids"] = input_ids + feed_dict["token_type_ids"] = segment_ids + return feed_dict, False, None, "" + + def postprocess(self, input_dicts, fetch_dict, data_id, log_id): + new_dict = {} + new_dict["output_embedding"] = str(fetch_dict["output_embedding"].tolist()) + return new_dict, None, "" + + +class ErnieService(WebService): + def get_pipeline_response(self, read_op): + ernie_op = ErnieOp(name="ernie", input_ops=[read_op]) + return ernie_op + + +ernie_service = ErnieService(name="ernie") +ernie_service.prepare_pipeline_config("config_nlp.yml") +ernie_service.run_service() diff --git a/applications/text_classification/multi_label/retrieval_based/evaluate.py b/applications/text_classification/multi_label/retrieval_based/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..f3b5f937767bab04edb97a6864296646d0d364f9 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/evaluate.py @@ -0,0 +1,72 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import numpy as np +from data import label2ids +from metric import MetricReport +from tqdm import tqdm + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--label_path", type=str, + default='data/label.txt', help="The full path of label file") +parser.add_argument("--recall_result_file", type=str, + default='./recall_result_dir/recall_result.txt', help="The full path of recall result file") +parser.add_argument("--similar_text_pair", default='data/dev.txt', + help="The full path of similar pair file") + +parser.add_argument("--threshold", default=0.5, type=float, + help="The threshold for selection the labels") + +args = parser.parse_args() +# yapf: enable + + +def evaluate(label2id): + metric = MetricReport() + text2similar = {} + # Encoding labels as one hot + with open(args.similar_text_pair, "r", encoding="utf-8") as f: + for line in f: + text, similar_text = line.rstrip().rsplit("\t", 1) + text2similar[text] = np.zeros(len(label2id)) + # One hot Encoding + for label in similar_text.strip().split(","): + text2similar[text][label2id[label]] = 1 + pred_labels = {} + # Convert predicted labels into one hot encoding + with open(args.recall_result_file, "r", encoding="utf-8") as f: + for index, line in enumerate(f): + text_arr = line.rstrip().split("\t") + text, labels, cosine_sim = text_arr + # One hot Encoding + if text not in pred_labels: + pred_labels[text] = np.zeros(len(label2id)) + if float(cosine_sim) > args.threshold: + for label in labels.split(","): + pred_labels[text][label2id[label]] = float(cosine_sim) + + for text, probs in tqdm(pred_labels.items()): + metric.update(probs, text2similar[text]) + + micro_f1_score, macro_f1_score = metric.accumulate() + print("Micro fl score: {}".format(micro_f1_score * 100)) + print("Macro f1 score: {}".format(macro_f1_score * 100)) + + +if __name__ == "__main__": + label2id = label2ids(args.label_path) + evaluate(label2id) diff --git a/applications/text_classification/multi_label/retrieval_based/export_model.py b/applications/text_classification/multi_label/retrieval_based/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..ac06c79a8f971e5cdbeede11c99c9f16d6e59520 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/export_model.py @@ -0,0 +1,54 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from base_model import SemanticIndexBaseStatic + +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.") +parser.add_argument("--output_emb_size", default=0, type=int, help="output_embedding_size") +parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# fmt: on + +if __name__ == "__main__": + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + model = SemanticIndexBaseStatic(pretrained_model, output_emb_size=args.output_emb_size) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + model.eval() + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/applications/text_classification/multi_label/retrieval_based/export_to_serving.py b/applications/text_classification/multi_label/retrieval_based/export_to_serving.py new file mode 100644 index 0000000000000000000000000000000000000000..6cc932da11173e54460642c16fd4226411ba3cfb --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/export_to_serving.py @@ -0,0 +1,50 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle_serving_client.io as serving_io + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--dirname", type=str, required=True, + default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.") +parser.add_argument("--model_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.") +parser.add_argument("--params_filename", type=str, required=True, + default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.") +parser.add_argument("--server_path", type=str, default='./serving_server', + help="The path of server parameter in static graph to be saved.") +parser.add_argument("--client_path", type=str, default='./serving_client', + help="The path of client parameter in static graph to be saved.") +parser.add_argument("--feed_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars') +parser.add_argument("--fetch_alias_names", type=str, default=None, + help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars') +parser.add_argument("--show_proto", type=bool, default=False, + help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.') +# yapf: enable + +if __name__ == "__main__": + args = parser.parse_args() + serving_io.inference_model_to_serving( + dirname=args.dirname, + serving_server=args.server_path, + serving_client=args.client_path, + model_filename=args.model_filename, + params_filename=args.params_filename, + show_proto=args.show_proto, + feed_alias_names=args.feed_alias_names, + fetch_alias_names=args.fetch_alias_names, + ) diff --git a/applications/text_classification/multi_label/retrieval_based/metric.py b/applications/text_classification/multi_label/retrieval_based/metric.py new file mode 100644 index 0000000000000000000000000000000000000000..d68451a538bbda1354cd632488b3363a18dfea77 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/metric.py @@ -0,0 +1,80 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +from sklearn.metrics import f1_score, classification_report +from paddle.metric import Metric +from paddlenlp.utils.log import logger + + +class MetricReport(Metric): + """ + F1 score for multi-label text classification task. + """ + + def __init__(self, name="MetricReport", average="micro"): + super(MetricReport, self).__init__() + self.average = average + self._name = name + self.reset() + + def reset(self): + """ + Resets all of the metric state. + """ + self.y_prob = None + self.y_true = None + + def f1_score(self, y_prob): + """ + Compute micro f1 score and macro f1 score + """ + threshold = 0.5 + self.y_pred = y_prob > threshold + micro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="micro") + macro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="macro") + return micro_f1_score, macro_f1_score + + def update(self, probs, labels): + """ + Update the probability and label + """ + if self.y_prob is not None: + self.y_prob = np.append(self.y_prob, probs, axis=0) + else: + self.y_prob = probs + if self.y_true is not None: + self.y_true = np.append(self.y_true, labels, axis=0) + else: + self.y_true = labels + + def accumulate(self): + """ + Returns micro f1 score and macro f1 score + """ + micro_f1_score, macro_f1_score = self.f1_score(y_prob=self.y_prob) + return micro_f1_score, macro_f1_score + + def report(self): + """ + Returns classification report + """ + self.y_pred = self.y_prob > 0.5 + logger.info("classification report:\n" + classification_report(self.y_true, self.y_pred, digits=4)) + + def name(self): + """ + Returns metric name + """ + return self._name diff --git a/applications/text_classification/multi_label/retrieval_based/model.py b/applications/text_classification/multi_label/retrieval_based/model.py new file mode 100644 index 0000000000000000000000000000000000000000..d21569ab78c7436c024ebfe5f70e420620c63526 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/model.py @@ -0,0 +1,66 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import paddle +import paddle.nn.functional as F +from base_model import SemanticIndexBase + + +class SemanticIndexBatchNeg(SemanticIndexBase): + def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None): + super().__init__(pretrained_model, dropout, output_emb_size) + + self.margin = margin + # Used scaling cosine similarity to ease converge + self.sacle = scale + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) + + # Substract margin from all positive samples cosine_sim() + margin_diag = paddle.full( + shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype() + ) + + cosine_sim = cosine_sim - paddle.diag(margin_diag) + + # Scale cosine to ease training converge + cosine_sim *= self.sacle + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") + labels = paddle.reshape(labels, shape=[-1, 1]) + + loss = F.cross_entropy(input=cosine_sim, label=labels) + + return loss diff --git a/applications/text_classification/multi_label/retrieval_based/predict.py b/applications/text_classification/multi_label/retrieval_based/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..906fdf3519c3aff341c81bcb0bb47f4245b91daf --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/predict.py @@ -0,0 +1,100 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +from base_model import SemanticIndexBase +from data import convert_example, create_dataloader, read_text_pair + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--pad_to_max_seq_len", action="store_true", help="Whether to pad to max seq length.") +parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# fmt: on + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair. + data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + cosine_sims = [] + model.eval() + with paddle.no_grad(): + for batch_data in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data + batch_cosine_sim = model.cosine_sim( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ).numpy() + cosine_sims.append(batch_cosine_sim) + cosine_sims = np.concatenate(cosine_sims, axis=0) + return cosine_sims + + +if __name__ == "__main__": + paddle.set_device(args.device) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + pad_to_max_seq_len=args.pad_to_max_seq_len, + ) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # title_segment + ): [data for data in fn(samples)] + valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False) + valid_data_loader = create_dataloader( + valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size) + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + cosin_sim = predict(model, valid_data_loader) + for idx, cosine in enumerate(cosin_sim): + print("{}".format(cosine)) + if idx > 5: + break diff --git a/applications/text_classification/multi_label/retrieval_based/recall.py b/applications/text_classification/multi_label/retrieval_based/recall.py new file mode 100644 index 0000000000000000000000000000000000000000..1d6f49ae9f7c92156fdc5d8d0c338e2339221072 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/recall.py @@ -0,0 +1,113 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# coding=UTF-8 + +import argparse +import os +from functools import partial + +import paddle +from base_model import SemanticIndexBase +from data import ( + build_index, + convert_corpus_example, + create_dataloader, + gen_id2corpus, + gen_text_file, +) + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset +from paddlenlp.transformers import AutoModel, AutoTokenizer +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training') +args = parser.parse_args() +# fmt: on + +if __name__ == "__main__": + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + trans_func = partial(convert_corpus_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # text_segment + ): [data for data in fn(samples)] + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size) + model = paddle.DataParallel(model) + # Load pretrained semantic model + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + logger.info("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + id2corpus = gen_id2corpus(args.corpus_file) + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + # Need better way to get inner model of DataParallel + inner_model = model._layers + final_index = build_index( + corpus_data_loader, + inner_model, + output_emb_size=args.output_emb_size, + hnsw_max_elements=args.hnsw_max_elements, + hnsw_ef=args.hnsw_ef, + hnsw_m=args.hnsw_m, + ) + text_list, text2similar_text = gen_text_file(args.similar_text_pair_file) + query_ds = MapDataset(text_list) + query_data_loader = create_dataloader( + query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file) + with open(recall_result_file, "w", encoding="utf-8") as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num) + batch_size = len(cosine_sims) + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write( + "{}\t{}\t{}\n".format( + text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx] + ) + ) diff --git a/applications/text_classification/multi_label/retrieval_based/requirements.txt b/applications/text_classification/multi_label/retrieval_based/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..7033416510087320bbd232fdb3eeacaed736dee5 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/requirements.txt @@ -0,0 +1,5 @@ +pymilvus==1.1.2 +pandas==0.25.1 +paddlenlp>=2.3.7 +hnswlib>=0.5.2 +pybind11 \ No newline at end of file diff --git a/applications/text_classification/multi_label/retrieval_based/run_system.py b/applications/text_classification/multi_label/retrieval_based/run_system.py new file mode 100644 index 0000000000000000000000000000000000000000..46e24fe94318a9cefb55de931fff06b856489aaf --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/run_system.py @@ -0,0 +1,64 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys +import time + +import numpy as np +import pandas as pd +from data import gen_id2corpus +from paddle_serving_server.pipeline import PipelineClient + +sys.path.append("utils") +from utils.config import collection_name, partition_tag # noqa: E402 +from utils.milvus_util import RecallByMilvus # noqa: E402 + + +def search_in_milvus(text_embedding, corpus_file, query_text): + client = RecallByMilvus() + start_time = time.time() + status, results = client.search( + collection_name=collection_name, vectors=text_embedding, partition_tag=partition_tag + ) + end_time = time.time() + print("Search milvus time cost is {} seconds ".format(end_time - start_time)) + id2corpus = gen_id2corpus(corpus_file) + list_data = [] + for line in results: + for item in line: + idx = item.id + distance = item.distance + text = id2corpus[idx] + list_data.append([query_text, text, distance]) + df = pd.DataFrame(list_data, columns=["query_text", "label", "innner_product"]) + df = df.sort_values(by="innner_product") + for index, row in df.iterrows(): + if row["innner_product"] > 0.5: + print(row["query_text"], row["label"], row["innner_product"]) + + +if __name__ == "__main__": + client = PipelineClient() + client.connect(["127.0.0.1:8080"]) + corpus_file = "data/label.txt" + list_data = [{"sentence": "中国农业大学怎么样?可以吗?"}] + feed = {} + for i, item in enumerate(list_data): + feed[str(i)] = str(item) + start_time = time.time() + ret = client.predict(feed_dict=feed) + end_time = time.time() + print("Extract feature time to cost :{} seconds".format(end_time - start_time)) + result = np.array(eval(ret.value[0])) + search_in_milvus(result, corpus_file, list_data[0]) diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/evaluate.sh b/applications/text_classification/multi_label/retrieval_based/scripts/evaluate.sh new file mode 100644 index 0000000000000000000000000000000000000000..b7c908c13ca3e0e63d4ee7d28d79838f66fbada8 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/scripts/evaluate.sh @@ -0,0 +1,18 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python -u evaluate.py \ + --similar_text_pair "data/dev.txt" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --label_path data/label.txt \ No newline at end of file diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/export_model.sh b/applications/text_classification/multi_label/retrieval_based/scripts/export_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..188e3a9bdf383e40f36ba3c7c5bb015ad6cdcddd --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/scripts/export_model.sh @@ -0,0 +1,18 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python export_model.py \ + --params_path checkpoints/inbatch/model_best/model_state.pdparams \ + --model_name_or_path rocketqa-zh-dureader-query-encoder \ + --output_path=./output diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/export_to_serving.sh b/applications/text_classification/multi_label/retrieval_based/scripts/export_to_serving.sh new file mode 100644 index 0000000000000000000000000000000000000000..7a7337b40b7a7c2d652ce2a837562eaceeba0531 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/scripts/export_to_serving.sh @@ -0,0 +1,21 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python export_to_serving.py \ + --dirname "output" \ + --model_filename "inference.get_pooled_embedding.pdmodel" \ + --params_filename "inference.get_pooled_embedding.pdiparams" \ + --server_path "serving_server" \ + --client_path "serving_client" \ + --fetch_alias_names "output_embedding" diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/predict.sh b/applications/text_classification/multi_label/retrieval_based/scripts/predict.sh new file mode 100644 index 0000000000000000000000000000000000000000..ff3fd8f76d1ce1442cdee0d53c5f2e73b2214fa7 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/scripts/predict.sh @@ -0,0 +1,26 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# gpu version +root_dir="checkpoints/inbatch/model_best" +python -u -m paddle.distributed.launch --gpus "0" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_state.pdparams" \ + --model_name_or_path rocketqa-zh-dureader-query-encoder \ + --output_emb_size 0 \ + --batch_size 128 \ + --max_seq_length 384 \ + --text_pair_file "data/test.txt" + diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/run.sh b/applications/text_classification/multi_label/retrieval_based/scripts/run.sh new file mode 100644 index 0000000000000000000000000000000000000000..c4c990729c26e0c9fd00e4420ebe1810abd00984 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/scripts/run.sh @@ -0,0 +1,22 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +CUDA_VISIBLE_DEVICES=0 python utils/feature_extract.py \ + --data_name label \ + --model_dir ./output \ + --output_dir data \ + --corpus_file "./data/label.txt" + +python utils/vector_insert.py \ + --vector_path ./data/label_embedding.npy \ No newline at end of file diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/run_build_index.sh b/applications/text_classification/multi_label/retrieval_based/scripts/run_build_index.sh new file mode 100644 index 0000000000000000000000000000000000000000..2224ad585b1b70886ab6ee167f5ff2626fa37604 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/scripts/run_build_index.sh @@ -0,0 +1,30 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +root_dir="checkpoints/inbatch" +python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "${root_dir}/model_best/model_state.pdparams" \ + --model_name_or_path rocketqa-zh-dureader-query-encoder \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 0 \ + --max_seq_length 384 \ + --recall_num 5 \ + --similar_text_pair "data/dev.txt" \ + --corpus_file "data/label.txt" \ No newline at end of file diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/train.sh b/applications/text_classification/multi_label/retrieval_based/scripts/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..2cef4abcddac47f91a0d98ce701375d910879f27 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/scripts/train.sh @@ -0,0 +1,36 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# GPU training +root_path=inbatch +data_path=data +python -u -m paddle.distributed.launch --gpus "0,1" \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/${root_path} \ + --batch_size 24 \ + --learning_rate 5E-5 \ + --epochs 100 \ + --output_emb_size 0 \ + --save_steps 50 \ + --max_seq_length 384 \ + --warmup_proportion 0.0 \ + --margin 0.2 \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --train_set_file ${data_path}/train.txt \ + --corpus_file ${data_path}/label.txt \ + --similar_text_pair_file ${data_path}/dev.txt \ + --evaluate True \ + --model_name_or_path rocketqa-zh-dureader-query-encoder diff --git a/applications/text_classification/multi_label/retrieval_based/train.py b/applications/text_classification/multi_label/retrieval_based/train.py new file mode 100644 index 0000000000000000000000000000000000000000..c6304d24b877e00bcb874ef4c5fdcd9fdb70eaf9 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/train.py @@ -0,0 +1,245 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from data import ( + build_index, + convert_example, + create_dataloader, + gen_id2corpus, + gen_text_file, + label2ids, + read_text_pair, +) +from metric import MetricReport +from model import SemanticIndexBatchNeg + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset, load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=512, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=256, type=int, help="output_embedding_size") +parser.add_argument("--learning_rate", default=5E-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="cpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, help="Inteval steps to save checkpoint") +parser.add_argument('--log_steps', type=int, default=10, help="Inteval steps to print log") +parser.add_argument("--train_set_file", type=str, default='./data/train.txt', help="The full path of train_set_file.") +parser.add_argument("--margin", default=0.2, type=float, help="Margin between pos_sample and neg_samples") +parser.add_argument("--scale", default=30, type=int, help="Scale for pair-wise margin_rank_loss") +parser.add_argument("--corpus_file", type=str, default='./data/label.txt', help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, default='./data/dev.txt', help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='./recall_result_dir', help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, default='recall_result_init.txt', help="The file name of recall result") +parser.add_argument("--recall_num", default=50, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--evaluate_result", type=str, default='evaluate_result.txt', help="evaluate_result") +parser.add_argument('--evaluate', default=True, type=eval, choices=[True, False], help='whether evaluate while training') +parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training') +parser.add_argument("--threshold", default=0.5, type=float, help="The threshold for selection the labels") +args = parser.parse_args() +# fmt: on + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus, label2id): + metric = MetricReport() + # Load pretrained semantic model + inner_model = model._layers + final_index = build_index( + corpus_data_loader, + inner_model, + output_emb_size=args.output_emb_size, + hnsw_max_elements=args.hnsw_max_elements, + hnsw_ef=args.hnsw_ef, + hnsw_m=args.hnsw_m, + ) + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + with open(recall_result_file, "w", encoding="utf-8") as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num) + batch_size = len(cosine_sims) + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write( + "{}\t{}\t{}\n".format( + text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx] + ) + ) + text2similar = {} + with open(args.similar_text_pair_file, "r", encoding="utf-8") as f: + for line in f: + text_arr = line.rstrip().rsplit("\t") + text, similar_text = text_arr[0], text_arr[1] + text2similar[text] = np.zeros(len(label2id)) + # One hot Encoding + for label in similar_text.strip().split(","): + text2similar[text][label2id[label]] = 1 + # Convert predicted labels into one hot encoding + pred_labels = {} + with open(recall_result_file, "r", encoding="utf-8") as f: + for index, line in enumerate(f): + text_arr = line.rstrip().split("\t") + text, labels, cosine_sim = text_arr + # One hot Encoding + if text not in pred_labels: + pred_labels[text] = np.zeros(len(label2id)) + if float(cosine_sim) > args.threshold: + for label in labels.split(","): + pred_labels[text][label2id[label]] = float(cosine_sim) + + for text, probs in pred_labels.items(): + metric.update(probs, text2similar[text]) + micro_f1_score, macro_f1_score = metric.accumulate() + return macro_f1_score + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + set_seed(args.seed) + train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False) + pretrained_model = AutoModel.from_pretrained(args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # title_segment + ): [data for data in fn(samples)] + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + model = SemanticIndexBatchNeg( + pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size + ) + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + print("warmup from:{}".format(args.init_from_ckpt)) + model = paddle.DataParallel(model) + batchify_fn_dev = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # text_segment + ): [data for data in fn(samples)] + if args.evaluate: + eval_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + id2corpus = gen_id2corpus(args.corpus_file) + label2id = label2ids(args.corpus_file) + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=eval_func + ) + query_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + text_list, _ = gen_text_file(args.similar_text_pair_file) + query_ds = MapDataset(text_list) + query_data_loader = create_dataloader( + query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=query_func + ) + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file) + num_training_steps = len(train_data_loader) * args.epochs + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=nn.ClipGradByNorm(clip_norm=1.0), + ) + global_step = 0 + best_score = 0.0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch + loss = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ) + global_step += 1 + if global_step % args.log_steps == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if not args.evaluate and rank == 0: + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + if args.evaluate and rank == 0: + print("evaluating") + macro_f1_score = evaluate( + model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus, label2id + ) + if macro_f1_score > best_score: + best_score = macro_f1_score + save_dir = os.path.join(args.save_dir, "model_best") + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + with open(os.path.join(save_dir, "train_result.txt"), "a", encoding="utf-8") as fp: + fp.write("epoch=%d, global_step: %d, Macro f1: %s\n" % (epoch, global_step, macro_f1_score)) + + +if __name__ == "__main__": + do_train() diff --git a/applications/text_classification/multi_label/retrieval_based/utils/__init__.py b/applications/text_classification/multi_label/retrieval_based/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/utils/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/applications/text_classification/multi_label/retrieval_based/utils/config.py b/applications/text_classification/multi_label/retrieval_based/utils/config.py new file mode 100644 index 0000000000000000000000000000000000000000..2da411d0c5e84aa359357823e037df96de69619c --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/utils/config.py @@ -0,0 +1,35 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from milvus import MetricType, IndexType + +MILVUS_HOST = "localhost" +MILVUS_PORT = 8530 + +output_emb_size = 0 + +collection_param = { + "dimension": output_emb_size if output_emb_size > 0 else 768, + "index_file_size": 256, + "metric_type": MetricType.IP, +} + +index_type = IndexType.FLAT +index_param = {"nlist": 1000} + +top_k = 20 +search_param = {"nprobe": 20} + +collection_name = "multi_label" +partition_tag = "partition_2" diff --git a/applications/text_classification/multi_label/retrieval_based/utils/feature_extract.py b/applications/text_classification/multi_label/retrieval_based/utils/feature_extract.py new file mode 100644 index 0000000000000000000000000000000000000000..966a801776ba89bb104a2ddcc16a400df93ba268 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/utils/feature_extract.py @@ -0,0 +1,194 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import numpy as np +import paddle +from paddle import inference +from tqdm import tqdm + +import paddlenlp as ppnlp +from paddlenlp.data import Pad, Tuple + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--corpus_file", type=str, required=True, help="The corpus_file path.") +parser.add_argument("--output_dir", type=str, required=True, help="The output path.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--data_name", type=str, required=True, help="The dataset name.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') + +args = parser.parse_args() +# fmt: on + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + - single sequence: ``[CLS] X [SEP]`` + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + + encoded_inputs = tokenizer(text=example, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.get_pooled_embedding.pdmodel" + params_file = model_dir + "/inference.get_pooled_embedding.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def predict(self, data, tokenizer, data_name): + """ + Predicts the data labels. + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + Returns: + results(obj:`dict`): All the predictions labels. + """ + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + + all_embeddings = [] + examples = [] + for idx, text in tqdm(data.items()): + input_ids, segment_ids = convert_example( + text, tokenizer, max_seq_length=self.max_seq_length, pad_to_max_seq_len=True + ) + examples.append((input_ids, segment_ids)) + if len(examples) >= self.batch_size: + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + all_embeddings.append(logits) + examples = [] + + if len(examples) > 0: + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + all_embeddings.append(logits) + + all_embeddings = np.concatenate(all_embeddings, axis=0) + np.save("./{}/{}_embedding".format(args.output_dir, data_name), all_embeddings) + + +def read_text(file_path): + file = open(file_path) + id2corpus = {} + for idx, line in enumerate(file.readlines()): + id2corpus[idx] = line.rstrip() + return id2corpus + + +if __name__ == "__main__": + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + data_name = args.data_name + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained(args.model_name_or_path) + id2corpus = read_text(args.corpus_file) + predictor.predict(id2corpus, tokenizer, data_name) diff --git a/applications/text_classification/multi_label/retrieval_based/utils/milvus_util.py b/applications/text_classification/multi_label/retrieval_based/utils/milvus_util.py new file mode 100644 index 0000000000000000000000000000000000000000..e6b186c4fa480ab20b888c0cd1376624083da9b9 --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/utils/milvus_util.py @@ -0,0 +1,114 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from config import ( + MILVUS_HOST, + MILVUS_PORT, + collection_param, + index_param, + index_type, + search_param, + top_k, +) +from milvus import Milvus + + +class VecToMilvus: + def __init__(self): + self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT) + + def has_collection(self, collection_name): + try: + status, ok = self.client.has_collection(collection_name) + return ok + except Exception as e: + print("Milvus has_table error:", e) + + def creat_collection(self, collection_name): + try: + collection_param["collection_name"] = collection_name + status = self.client.create_collection(collection_param) + print(status) + return status + except Exception as e: + print("Milvus create collection error:", e) + + def create_index(self, collection_name): + try: + status = self.client.create_index(collection_name, index_type, index_param) + print(status) + return status + except Exception as e: + print("Milvus create index error:", e) + + def has_partition(self, collection_name, partition_tag): + try: + status, ok = self.client.has_partition(collection_name, partition_tag) + return ok + except Exception as e: + print("Milvus has partition error: ", e) + + def delete_partition(self, collection_name, partition_tag): + try: + status = self.client.drop_collection(collection_name) + return status + except Exception as e: + print("Milvus has partition error: ", e) + + def create_partition(self, collection_name, partition_tag): + try: + status = self.client.create_partition(collection_name, partition_tag) + print("create partition {} successfully".format(partition_tag)) + return status + except Exception as e: + print("Milvus create partition error: ", e) + + def insert(self, vectors, collection_name, ids=None, partition_tag=None): + try: + if not self.has_collection(collection_name): + self.creat_collection(collection_name) + self.create_index(collection_name) + print("collection info: {}".format(self.client.get_collection_info(collection_name)[1])) + if (partition_tag is not None) and (not self.has_partition(collection_name, partition_tag)): + self.create_partition(collection_name, partition_tag) + status, ids = self.client.insert( + collection_name=collection_name, records=vectors, ids=ids, partition_tag=partition_tag + ) + self.client.flush([collection_name]) + print( + "Insert {} entities, there are {} entities after insert data.".format( + len(ids), self.client.count_entities(collection_name)[1] + ) + ) + return status, ids + except Exception as e: + print("Milvus insert error:", e) + + +class RecallByMilvus: + def __init__(self): + self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT) + + def search(self, vectors, collection_name, partition_tag=None): + try: + status, results = self.client.search( + collection_name=collection_name, + query_records=vectors, + top_k=top_k, + params=search_param, + partition_tag=partition_tag, + ) + return status, results + except Exception as e: + print("Milvus recall error: ", e) diff --git a/applications/text_classification/multi_label/retrieval_based/utils/vector_insert.py b/applications/text_classification/multi_label/retrieval_based/utils/vector_insert.py new file mode 100644 index 0000000000000000000000000000000000000000..986ba6f0918b01df8eb79887a2a07cd0a9dd52ac --- /dev/null +++ b/applications/text_classification/multi_label/retrieval_based/utils/vector_insert.py @@ -0,0 +1,55 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import numpy as np +from config import collection_name, partition_tag +from milvus_util import VecToMilvus +from tqdm import tqdm + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--vector_path", type=str, required=True, help="feature file path.") + +args = parser.parse_args() +# fmt: on + + +def vector_insert(file_path): + embeddings = np.load(file_path) + print(embeddings.shape) + embedding_ids = [i for i in range(embeddings.shape[0])] + print(len(embedding_ids)) + client = VecToMilvus() + + if client.has_partition(collection_name, partition_tag): + client.delete_partition(collection_name, partition_tag) + data_size = len(embedding_ids) + batch_size = 50000 + for i in tqdm(range(0, data_size, batch_size)): + cur_end = i + batch_size + if cur_end > data_size: + cur_end = data_size + batch_emb = embeddings[np.arange(i, cur_end)] + status, ids = client.insert( + collection_name=collection_name, + vectors=batch_emb.tolist(), + ids=embedding_ids[i : i + batch_size], + partition_tag=partition_tag, + ) + + +if __name__ == "__main__": + vector_insert(args.vector_path) diff --git a/applications/text_classification/multi_label/train.py b/applications/text_classification/multi_label/train.py new file mode 100644 index 0000000000000000000000000000000000000000..a1769d3019b3ac84a76f79dd2086c1a1ff87bf17 --- /dev/null +++ b/applications/text_classification/multi_label/train.py @@ -0,0 +1,198 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import os +import random +import time + +import numpy as np +import paddle +from metric import MetricReport +from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler +from utils import evaluate, preprocess_function, read_local_dataset + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import ( + AutoModelForSequenceClassification, + AutoTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--dataset_dir", required=True, default=None, type=str, help="Local dataset directory should include train.txt, dev.txt and label.txt") +parser.add_argument("--save_dir", default="./checkpoint", type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", + choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en", "ernie-m-base", "ernie-m-large"]) +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument('--early_stop', action='store_true', help='Epoch before early stop.') +parser.add_argument('--early_stop_nums', type=int, default=3, help='Number of epoch before early stop.') +parser.add_argument("--logging_steps", default=5, type=int, help="The interval steps to logging.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument('--warmup', action='store_true', help="whether use warmup strategy") +parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup steps over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=3, help="random seed for initialization") +parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name") +parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name") +parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name") +args = parser.parse_args() +# fmt: on + + +def set_seed(seed): + """ + Sets random seed + """ + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + os.environ["PYTHONHASHSEED"] = str(seed) + + +def args_saving(): + argsDict = args.__dict__ + with open(os.path.join(args.save_dir, "setting.txt"), "w") as f: + f.writelines("------------------ start ------------------" + "\n") + for eachArg, value in argsDict.items(): + f.writelines(eachArg + " : " + str(value) + "\n") + f.writelines("------------------- end -------------------") + + +def train(): + """ + Training a multi label classification model + """ + + if not os.path.exists(args.save_dir): + os.makedirs(args.save_dir) + args_saving() + set_seed(args.seed) + paddle.set_device(args.device) + + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + # load and preprocess dataset + label_list = {} + with open(os.path.join(args.dataset_dir, args.label_file), "r", encoding="utf-8") as f: + for i, line in enumerate(f): + l = line.strip() + label_list[l] = i + train_ds = load_dataset( + read_local_dataset, path=os.path.join(args.dataset_dir, args.train_file), label_list=label_list, lazy=False + ) + dev_ds = load_dataset( + read_local_dataset, path=os.path.join(args.dataset_dir, args.dev_file), label_list=label_list, lazy=False + ) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name) + trans_func = functools.partial( + preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length, label_nums=len(label_list) + ) + train_ds = train_ds.map(trans_func) + dev_ds = dev_ds.map(trans_func) + # batchify dataset + collate_fn = DataCollatorWithPadding(tokenizer) + if paddle.distributed.get_world_size() > 1: + train_batch_sampler = DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + else: + train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn) + dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn) + + # define model + model = AutoModelForSequenceClassification.from_pretrained(args.model_name, num_classes=len(label_list)) + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + criterion = paddle.nn.BCEWithLogitsLoss() + metric = MetricReport() + + global_step = 0 + best_f1_score = 0 + early_stop_count = 0 + tic_train = time.time() + + for epoch in range(1, args.epochs + 1): + + if args.early_stop and early_stop_count >= args.early_stop_nums: + logger.info("Early stop!") + break + + for step, batch in enumerate(train_data_loader, start=1): + + labels = batch.pop("labels") + logits = model(**batch) + loss = criterion(logits, labels) + + loss.backward() + optimizer.step() + if args.warmup: + lr_scheduler.step() + optimizer.clear_grad() + + global_step += 1 + if global_step % args.logging_steps == 0 and rank == 0: + logger.info( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + + early_stop_count += 1 + micro_f1_score, macro_f1_score = evaluate(model, criterion, metric, dev_data_loader) + + save_best_path = args.save_dir + if not os.path.exists(save_best_path): + os.makedirs(save_best_path) + + # save models + if macro_f1_score > best_f1_score: + early_stop_count = 0 + best_f1_score = macro_f1_score + model._layers.save_pretrained(save_best_path) + tokenizer.save_pretrained(save_best_path) + logger.info("Current best macro f1 score: %.5f" % (best_f1_score)) + logger.info("Final best macro f1 score: %.5f" % (best_f1_score)) + logger.info("Save best macro f1 text classification model in %s" % (args.save_dir)) + + +if __name__ == "__main__": + train() diff --git a/applications/text_classification/multi_label/utils.py b/applications/text_classification/multi_label/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..9915906c9a2ab9dd8680d95711a4336efd96ec6e --- /dev/null +++ b/applications/text_classification/multi_label/utils.py @@ -0,0 +1,98 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np + +import paddle +import paddle.nn.functional as F +from paddlenlp.utils.log import logger + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader): + """ + Given a dataset, it evaluates model and computes the metric. + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + criterion(obj:`paddle.nn.Layer`): It can compute the loss. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + """ + + model.eval() + metric.reset() + losses = [] + for batch in data_loader: + labels = batch.pop("labels") + logits = model(**batch) + loss = criterion(logits, labels) + probs = F.sigmoid(logits) + losses.append(loss.numpy()) + metric.update(probs, labels) + + micro_f1_score, macro_f1_score = metric.accumulate() + logger.info( + "eval loss: %.5f, micro f1 score: %.5f, macro f1 score: %.5f" + % (np.mean(losses), micro_f1_score, macro_f1_score) + ) + model.train() + metric.reset() + + return micro_f1_score, macro_f1_score + + +def preprocess_function(examples, tokenizer, max_seq_length, label_nums, is_test=False): + """ + Builds model inputs from a sequence for sequence classification tasks + by concatenating and adding special tokens. + + Args: + examples(obj:`list[str]`): List of input data, containing text and label if it have label. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_length(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + label_nums(obj:`int`): The number of the labels. + Returns: + result(obj:`dict`): The preprocessed data including input_ids, token_type_ids, labels. + """ + result = tokenizer(text=examples["sentence"], max_seq_len=max_seq_length) + # One-Hot label + if not is_test: + result["labels"] = [float(1) if i in examples["label"] else float(0) for i in range(label_nums)] + return result + + +def read_local_dataset(path, label_list=None, is_test=False): + """ + Read dataset + """ + with open(path, "r", encoding="utf-8") as f: + for line in f: + if is_test: + items = line.strip().split("\t") + sentence = "".join(items) + yield {"sentence": sentence} + else: + items = line.strip().split("\t") + if len(items) == 0: + continue + elif len(items) == 1: + sentence = items[0] + labels = [] + else: + sentence = "".join(items[:-1]) + label = items[-1] + labels = [label_list[l] for l in label.split(",")] + yield {"sentence": sentence, "label": labels} diff --git a/applications/text_summarization/README.md b/applications/text_summarization/README.md new file mode 100644 index 0000000000000000000000000000000000000000..3baa427f77f6a56ac4079640c88ca7bb6a3fea82 --- /dev/null +++ b/applications/text_summarization/README.md @@ -0,0 +1,30 @@ +# 文本摘要应用 + +## **1. 文本摘要简介** +文本摘要的目标是自动地将输入文本转换成简短摘要,为用户提供简明扼要的内容描述,是缓解文本信息过载的一个重要手段。 +文本摘要也是自然语言生成领域中的一个重要任务,有很多应用场景,如新闻摘要、论文摘要、财报摘要、传记摘要、专利摘要、对话摘要、评论摘要、观点摘要、电影摘要、文章标题生成、商品名生成、自动报告生成、搜索结果预览等。 + + +## **2. 文本摘要项目介绍** + +PaddleNLP文本摘要应用主要针对中文文本数据上的摘要需求,基于最前沿的文本摘要预训练模型,开源了文本摘要解决方案。针对文本摘要应用,本项目提供了基于Taskflow开箱即用的产业级文本摘要预置任务能力,无需训练,一键完成文本摘要预测。除此之外,本项目提供给用户定制化训练策略,可以结合用户自身的不同数据需求完成模型的训练、预测和推理部署工作。对于需要特殊能力的文本摘要预训练模型,本项目开源了摘要模型的预训练代码,用户可以使用大规模无标注数据定制在特定领域有摘要能力的预训练模型。 + +本项目使用的基础模型为[PEGASUS(Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models)](https://arxiv.org/pdf/1912.08777.pdf), 是由谷歌公司提出的文本摘要预训练模型。其预训练目标:Gap Sentences Generation (GSG),是根据文本摘要任务形式特殊设计的自监督上游任务。PEGASUS有两个不同的版本(base和large),其模型参数分别为: + + +| 参数 | base(238M) | large(523M) | +| :---: | :--------: | :--------: | +| encoder layers | 12 | 16| +| encoder_attention_heads | 12 | 16| +| encoder_ffn_dim | 3072 |4096 | +| decoder layers | 12 | 16| +| decoder_attention_heads | 12 | 16| +| decoder_ffn_dim | 3072 |4096 | +| max_encode_length | 512 | 1024| + + +## **3. 快速开始** + +- [预训练PEGASUS模型](./pretrain/) + +- [微调PEGASUS模型](./finetune/) diff --git a/applications/text_summarization/finetune/README.md b/applications/text_summarization/finetune/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f6a059944a39db39c6e336f792b3ac0489476796 --- /dev/null +++ b/applications/text_summarization/finetune/README.md @@ -0,0 +1,319 @@ +# 生成式文本摘要应用 +**目录** + +- [生成式文本摘要应用](#生成式文本摘要应用) + - [简介](#简介) + - [效果展示](#效果展示) + - [开箱即用](#开箱即用) + - [支持单条、批量预测](#支持单条批量预测) + - [可配置参数说明](#可配置参数说明) + - [训练定制](#训练定制) + - [文本摘要应用定制训练全流程介绍](#文本摘要应用定制训练全流程介绍) + - [环境依赖](#环境依赖) + - [代码结构说明](#代码结构说明) + - [数据准备](#数据准备) + - [数据加载](#数据加载) + - [从本地文件创建数据集](#从本地文件创建数据集) + - [模型训练](#模型训练) + - [模型预测](#模型预测) + - [模型推理部署](#模型推理部署) + - [FastGeneration加速及模型静态图导出](#fastgeneration加速及模型静态图导出) + - [模型部署](#模型部署) + - [References](#references) + +## 简介 + +文本摘要的目标是自动地将输入文本转换成简短摘要,为用户提供简明扼要的内容描述,是缓解文本信息过载的一个重要手段。 +文本摘要也是自然语言生成领域中的一个重要任务,有很多应用场景,如新闻摘要、论文摘要、财报摘要、传记摘要、专利摘要、对话摘要、评论摘要、观点摘要、电影摘要、文章标题生成、商品名生成、自动报告生成、搜索结果预览等。 + +本项目是基于预训练语言模型PEGASUS的中文文本摘要产业实践,具有以下优势: + +- 效果领先。在LCSTS上效果达到SOTA。 +- 开箱即用。本项目提供TaskFlow接口,无需训练,仅需几行代码便可预测。 +- 高性能推理。本项目基于[FastGeneration](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/fast_generation) + 进行推理加速,能够提供更高性能的推理体验。 +- 训练推理全流程打通。本项目提供了全面的定制训练流程,从数据准备、模型训练预测,到模型推理部署,一应俱全。 + +## 效果展示 + +## 开箱即用 + +PaddleNLP提供开箱即用的产业级NLP预置任务能力,无需训练,一键预测。 + +### 支持单条、批量预测 + +```python +>> > from paddlenlp import Taskflow +>> > summarizer = Taskflow("text_summarization") +# 单条输入 +>> > summarizer( + '2022年,中国房地产进入转型阵痛期,传统“高杠杆、快周转”的模式难以为继,万科甚至直接喊话,中国房地产进入“黑铁时代”') +# 输出:['万科喊话中国房地产进入“黑铁时代”'] + +# 多条输入 +>> > summarizer([ + '据悉,2022年教育部将围绕“巩固提高、深化落实、创新突破”三个关键词展开工作。要进一步强化学校教育主阵地作用,继续把落实“双减”作为学校工作的重中之重,重点从提高作业设计水平、提高课后服务水平、提高课堂教学水平、提高均衡发展水平四个方面持续巩固提高学校“双减”工作水平。', + '党参有降血脂,降血压的作用,可以彻底消除血液中的垃圾,从而对冠心病以及心血管疾病的患者都有一定的稳定预防工作作用,因此平时口服党参能远离三高的危害。另外党参除了益气养血,降低中枢神经作用,调整消化系统功能,健脾补肺的功能。' +]) +# 输出:['教育部:将从四个方面持续巩固提高学校“双减”工作水平', '党参能降低三高的危害'] +``` + +### 可配置参数说明 + +* `model`:可选模型,默认为`IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese`。 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 + +## 训练定制 + +### 文本摘要应用定制训练全流程介绍 + +接下来,我们将按数据准备、训练、预测、推理部署对文本摘要应用的全流程进行介绍。 + +1. **数据准备** + +- 如果没有已标注的数据集,我们推荐[doccano](https://github.com/doccano/doccano)数据标注工具。 + 如果已有标注好的本地数据集,我们需要根据将数据集整理为文档要求的格式,请参考[从本地文件创建数据集](#从本地文件创建数据集) + 。 + +2. **模型训练** + +- 数据准备完成后,可以开始使用我们的数据集对预训练模型进行微调训练。我们可以根据任务需求,调整可配置参数,选择使用GPU或CPU进行模型训练,脚本默认保存在开发集最佳表现模型。中文任务默认使用"IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese"模型,还支持large模型: "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese"。 + + +3. **模型预测** + +- 训练结束后,我们可以加载保存的最佳模型进行模型测试,打印模型预测结果。 + +4. **模型推理部署** + +- 模型部署需要将保存的最佳模型参数(动态图)导出成静态图参数,用于后续的推理部署。 + +- 文本摘要应用提供了基于Paddle Inference的本地部署predictor,并且支持在GPU设备使用FastGeneration进行加速。 + +- 文本摘要应用提供了基于Simple Serving的服务端部署方案。 + +### 环境依赖 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +finetune/ +├── data # 数据 +│ ├── train.json # 训练数据集文件 +│ └── test.json # 可选,待预测数据文件 +├── deploy # 部署 +│ ├── paddle_inference # PaddleInference高性能推理部署 +│ │ ├── inference_pegasus.py # 推理部署脚本 +│ │ └── README.md # 说明文档 +│ └── simple_serving +│ ├── client.py # 客户端程序 +│ ├── server.py # 服务器程序 +│ └── README.md # 说明文档 +├── run_prepare.py # 小数据集获取脚本 +├── export_model.py # 动态图参数导出静态图参数脚本 +├── export_model.sh # 动态图参数导出静态图参数shell脚本 +├── predict.py # 预测脚本 +├── train.py # 训练评估脚本 +├── utils.py # 工具函数脚本 +├── requirements.txt # 依赖包 +└── README.md # 说明文档 +``` + +### 数据准备 + +#### 数据加载 + +#### 从本地文件创建数据集 + +在许多情况,我们需要使用本地数据集来训练我们的文本摘要模型,本项目支持使用固定格式本地数据集文件进行训练。 + +本地数据集目录结构如下: + +```text +data/ +├── train.json # 训练数据集文件 +└── test.json # 可选,待预测数据文件 +``` + +本地数据集文件格式如下: + +- train.json/test.json 文件每行格式: + +```text +{ +"title": "任志强抨击政府把土地作为投机品地产业被人为破坏", +"content": "“北京的保障房市场就像一个巨大的赌场,每个人都在期待中奖。”面对中国目前现行的保障性住房政策,华远地产董事长任志强再次语出惊人。(分享自@第一财经-中国房地产金融)" +} +``` + +这里提供小数据集供测试,运行下面命令即可下载: + +```bash +python run_prepare.py +``` + +更多数据集读取格式详见[数据集加载](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html#) +和[自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。 + +### 模型训练 + +运行如下命令即可在样例训练集上进行finetune,并在样例验证集上进行验证。 + +```shell +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +unset CUDA_VISIBLE_DEVICES + +python -m paddle.distributed.launch --gpus "2,3,4,5,6,7" train.py \ + --model_name_or_path=IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese \ + --train_file data/train.json \ + --eval_file data/test.json \ + --output_dir pegasus_out \ + --max_source_length 128 \ + --max_target_length 64 \ + --num_train_epochs 20 \ + --logging_steps 1 \ + --save_steps 10000 \ + --per_device_train_batch_size 128 \ + --per_device_eval_batch_size 128 \ + --learning_rate 5e-5 \ + --warmup_ratio 0.02 \ + --weight_decay=0.01 \ + --do_train \ + --do_eval \ + --device=gpu +``` + +关键参数释义如下: + +- `gpus` 指示了训练所用的GPU卡号。 +- `train_file` 本地训练数据地址。 +- `eval_file` 本地测试数据地址。 +- `model_name_or_path` + 指示了finetune使用的具体预训练模型,可以是PaddleNLP提供的预训练模型,或者是本地的预训练模型。如果使用本地的预训练模型,可以配置本地模型的目录地址,例如: + ./checkpoints/model_xx/,目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + + | PaddleNLP提供的预训练模型 | + |---------------------------------| + | IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese | + | IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese | + +- `output_dir` 表示模型的保存路径。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `seed` 表示随机数生成器的种子。 +- `num_train_epochs` 表示训练轮数。 +- `per_device_train_batch_size` 表示每次训练**每张卡**上的样本数目。 +- `per_device_eval_batch_size` 表示每次验证**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。 +- `warmup_ratio` + 表示学习率逐渐升高到基础学习率(即上面配置的learning_rate)所需要的迭代数占总步数的比例,最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf) + 。 +- `max_source_length` 模型输入序列的最大长度。 +- `max_target_length` 模型训练时标签的最大长度。 +- `do_train` 是否进行训练。 +- `do_eval` 是否进行预测。 +- `device` 表示使用的设备,从gpu和cpu中选择。 + + 除此之外,我们提供了一种可选的解码端输入增强策略。该策略在解码过程中,基于标准摘要和模型输出构造了新的解码输入数据,以此实现解码端的数据增强。具体详情可以参考[SSTIA论文](https://openreview.net/pdf?id=pz1euXohm4H)。如果想使用该策略,可以设置参数: +- `use_SSTIA` 为True表示使用该策略。以及, +- `mix_ratio` 表示构造输入和原始输入的权重。 + + 该策略在Pegasus-238M和Pegasus-523M模型上均有大幅度提升,具体效果见后文实验结果表格。 + + PaddleNLP提供了训练好的SSTIA模型,可以修改`model_name_or_path`直接使用: + + | PaddleNLP提供的SSTIA模型 | + |---------------------------------| + | PaddlePaddle/Randeng-Pegasus-238M-Summary-Chinese-SSTIA | + | PaddlePaddle/Randeng-Pegasus-523M-Summary-Chinese-SSTIA | + +更多参数详情和参数的默认值请参考`train.py`。 + +程序运行时将会自动进行训练和验证,训练过程中会自动保存模型在指定的`output_dir`中。 +如: + +```text +./pegasus_out/ +├── model_config.json +├── model_state.pdparams +├── special_tokens_map.json +├── tokenizer_config.json +└── vocab.txt +``` + +**NOTE:** 如需恢复模型训练,`model_name_or_path`配置本地模型的目录地址即可。 + +### 模型预测 + +运行下方脚本可以使用训练好的模型进行预测。 + +```shell +unset CUDA_VISIBLE_DEVICES + +python predict.py \ + --init_checkpoint_dir=pegasus_out \ + --prefict_file data/valid.json \ + --max_source_length 128 \ + --max_target_length 64 \ + --batch_size 128 \ + --device=gpu \ +``` + +程序运行结束后会将预测结果保存在`output_path`中。 + +Finetuned baseline的模型在[LCSTS](https://aclanthology.org/D15-1229/)测试集上有如下结果: +| model_name | Rouge-1 | Rouge-2 | Rouge-L | BLEU-4 | +| :-----------------------------: | :---: | :-----------: | :-------------------: |:-------------------: | +| finetuned IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese | 43.30 | 30.08 | 40.12 | 24.50 | +| IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese + SSTIA | 45.79 | 33.20 | 42.88 | 28.07 | +| finetuned IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese | 48.13 | 36.41 | 45.39 | 31.99 | +| IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese + SSTIA | 53.23 | 42.79 | 50.84 | 39.05 | + +### 模型推理部署 + +#### FastGeneration加速及模型静态图导出 + +使用动态图训练结束之后,可以通过[静态图导出脚本](export_model.py) +实现基于FastGeneration的高性能预测加速,并将动态图参数导出成静态图参数,静态图参数保存在`output_path`指定路径中。运行方式: + +```shell +python export_model.py \ + --model_name_or_path IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese \ + --decoding_strategy beam_search \ + --export_output_dir ./inference_model \ + --max_out_len 30 \ +``` + +关键参数释义如下: + +* `model_name_or_path`:动态图训练保存的参数路径;默认为"IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese"。 +* `export_output_dir`:静态图图保存的参数路径;默认为"./inference_model"。 +* `max_out_len`:最大输出长度。 + +执行命令后将会自动导出模型到指定的 `inference_model` 中,保存模型文件结构如下所示: + +```text +inference_model/ +├── pegasus.pdiparams +├── pegasus.pdiparams.info +└── pegasus.pdmodel +``` + +#### 模型部署 + +文本摘要应用已打通多种场景部署方案,点击链接获取具体的使用教程。 + +- [Paddle Inference 推理 (Python)](./deploy/paddle_inference/README.md) +- [Simple Serving 服务化部署(Python)](./deploy/simple_serving/README.md) + +## References + +- Zhang J, Zhao Y, Saleh M, et al. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization[C] + //International Conference on Machine Learning. PMLR, 2020: 11328-11339. +- Wang J, Zhang Y, Zhang L, et al. Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence[J]. arXiv + preprint arXiv:2209.02970, 2022. +- Xie S, Lv A, Xia Y, et al. Target-side input augmentation for sequence to sequence generation[C] + //International Conference on Learning Representations. 2022. diff --git a/applications/text_summarization/finetune/deploy/paddle_inference/README.md b/applications/text_summarization/finetune/deploy/paddle_inference/README.md new file mode 100644 index 0000000000000000000000000000000000000000..4c3cf774dced1606cd6df4e00630954ab70babe7 --- /dev/null +++ b/applications/text_summarization/finetune/deploy/paddle_inference/README.md @@ -0,0 +1,31 @@ +# Paddle Inference部署 +本文档将介绍如何使用[Paddle Inference](https://paddle-inference.readthedocs.io/en/latest/guides/introduction/index_intro.html#paddle-inference)工具进行自动文本摘要应用高性能推理推理部署。 + +**目录** + * [背景介绍](#背景介绍) + * [导出预测部署模型](#导出预测部署模型) + * [基于Python预测](#基于Python预测) + + +## 背景介绍 +Paddle inference和主框架的Model.predict均可实现推理预测,Paddle Inference 是飞桨的原生推理库, 作用于服务器端和云端,提供高性能的推理能力,主框架的Model 对象是一个具备训练、测试、推理的神经网络。相比于Model.predict,inference可使用MKLDNN、CUDNN、TensorRT进行预测加速。Model.predict适用于训练好的模型直接进行预测,paddle inference适用于对推理性能、通用性有要求的用户,针对不同平台不同的应用场景进行了深度的适配优化,保证模型在服务器端即训即用,快速部署。由于 Paddle Inference 能力直接基于飞桨的训练算子,因此它支持飞桨训练出的所有模型的推理。 + + +Paddle Inference Python端预测部署主要包含两个步骤: +- 导出预测部署模型 +- 基于Python预测 + + +## 导出预测部署模型 +部署时需要使用预测格式的模型(即动态图转静态图操作)。预测格式模型相对训练格式模型而言,在拓扑上裁剪掉了预测不需要的算子,并且会做特定部署优化。具体操作详见[FastGeneration加速及模型静态图导出](../../README.md)。 + +## 基于Python预测 + + +在终端输入以下命令可在GPU上进行预测: +```shell +python inference_pegasus.py --inference_model_dir ../../inference_model +``` + +关键参数释义如下: +* `inference_model_dir`:用于高性能推理的静态图模型参数路径;默认为"../../inference_model"。 diff --git a/applications/text_summarization/finetune/deploy/paddle_inference/inference_pegasus.py b/applications/text_summarization/finetune/deploy/paddle_inference/inference_pegasus.py new file mode 100644 index 0000000000000000000000000000000000000000..b917302a9e0c63c2f3b5672190f5f6e0345577a2 --- /dev/null +++ b/applications/text_summarization/finetune/deploy/paddle_inference/inference_pegasus.py @@ -0,0 +1,102 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from pprint import pprint + +import numpy as np +from paddle import inference + +from paddlenlp.ops.ext_utils import load +from paddlenlp.transformers import PegasusChineseTokenizer + + +def setup_args(): + """Setup arguments.""" + parser = argparse.ArgumentParser() + parser.add_argument( + "--inference_model_dir", + default="../../inference_model/", + type=str, + help="Path to save inference model of Pegasus. ", + ) + args = parser.parse_args() + return args + + +def setup_predictor(args): + """Setup inference predictor.""" + # Load FastGeneration lib. + load("FastGeneration", verbose=True) + model_file = os.path.join(args.inference_model_dir, "pegasus.pdmodel") + params_file = os.path.join(args.inference_model_dir, "pegasus.pdiparams") + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = inference.Config(model_file, params_file) + config.enable_use_gpu(100, 0) + config.switch_ir_optim() + config.enable_memory_optim() + config.disable_glog_info() + + predictor = inference.create_predictor(config) + return predictor + + +def convert_example(example, tokenizer, max_seq_len=512): + """Convert all examples into necessary features.""" + tokenized_example = tokenizer( + example, max_length=max_seq_len, padding=True, truncation=True, return_attention_mask=False + ) + return tokenized_example + + +def infer(args, predictor): + """Use predictor to inference.""" + tokenizer = PegasusChineseTokenizer.from_pretrained("IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese") + + inputs = [ + "在北京冬奥会自由式滑雪女子坡面障碍技巧决赛中,中国选手谷爱凌夺得银牌。祝贺谷爱凌!今天上午,自由式滑雪女子坡面障碍技巧决赛举行。决赛分三轮进行,取选手最佳成绩排名决出奖牌。第一跳,中国选手谷爱凌获得69.90分。在12位选手中排名第三。完成动作后,谷爱凌又扮了个鬼脸,甚是可爱。第二轮中,谷爱凌在道具区第三个障碍处失误,落地时摔倒。获得16.98分。网友:摔倒了也没关系,继续加油!在第二跳失误摔倒的情况下,谷爱凌顶住压力,第三跳稳稳发挥,流畅落地!获得86.23分!此轮比赛,共12位选手参赛,谷爱凌第10位出场。网友:看比赛时我比谷爱凌紧张,加油!", + "据微信公众号“界面”报道,4日上午10点左右,中国发改委反垄断调查小组突击查访奔驰上海办事处,调取数据材料,并对多名奔驰高管进行了约谈。截止昨日晚9点,包括北京梅赛德斯-奔驰销售服务有限公司东区总经理在内的多名管理人员仍留在上海办公室内", + ] + + data = convert_example(inputs, tokenizer, max_seq_len=128) + input_handles = {} + for name in predictor.get_input_names(): + input_handles[name] = predictor.get_input_handle(name) + input_handles[name].copy_from_cpu(np.array(data[name], dtype="int32")) + + output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()] + + predictor.run() + + output = [output_handle.copy_to_cpu() for output_handle in output_handles] + + for idx, sample in enumerate(output[0]): + for beam_idx, beam in enumerate(sample): + if beam_idx >= len(sample) // 2: + break + print( + f"Example {idx} beam {beam_idx}: ", + "".join(tokenizer.decode(beam, skip_special_tokens=True, clean_up_tokenization_spaces=False)), + ) + + +if __name__ == "__main__": + args = setup_args() + pprint(args) + predictor = setup_predictor(args) + infer(args, predictor) diff --git a/applications/text_summarization/finetune/deploy/simple_serving/README.md b/applications/text_summarization/finetune/deploy/simple_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..d46d99f9317367cd9b8f9cddf8ac72266d7d3e92 --- /dev/null +++ b/applications/text_summarization/finetune/deploy/simple_serving/README.md @@ -0,0 +1,78 @@ +# SimpleServing服务化部署 + +本文档将介绍如何使用[PaddleNLP SimpleServing](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/server.md)工具部署自动文本摘要在线服务。 + +## 目录 +- [SimpleServing服务化部署](#SimpleServing服务化部署) + - [目录](#目录) + - [背景介绍](#背景介绍) + - [环境准备](#环境准备) + - [启动服务](#启动服务) + - [发送请求](#发送请求) + - [服务化自定义参数](#服务化自定义参数) + - [server参数](#server参数) + - [模型路径](#模型路径) + - [多卡服务化预测](#多卡服务化预测) + - [Taskflow加速](#Taskflow加速) + - [client参数](#client参数) + + + +## 背景介绍 +PaddleNLP SimpleServing 是基于 unicorn 封装的模型部署服务化工具,该服务化工具具备灵活、易用的特性,可以简易部署预训练模型和预训练模型工具Taskflow,PaddleNLP SimpleServing 具备以下两个特性: + +- 易用:一行代码即可部署预训练模型和预训练工具Taskflow +- 灵活:Handler机制可以快速定制化服务化部署方式 + +PaddleNLP SimpleServing Python端预测部署主要包含以下步骤: +- 环境准备 +- 启动服务 +- 发送请求 + +## 环境准备 +下载安装包含SimpleServing功能的PaddleNLP版本: +```shell +pip install paddlenlp +``` + +## 启动服务 +```shell +paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189 +``` + +## 发送请求 +```shell +python client.py +``` + +## 服务化自定义参数 + +### server参数 + +#### 模型路径 + +默认使用的模型为 `IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese` , 用户也可以通过修改`task_path`参数使用其他模型或自己的模型: + +```shell +ts = Taskflow("text_summarization", task_path='../../checkpoint/model_best/') +``` +可选模型有 `PaddlePaddle/Randeng-Pegasus-238M-Summary-Chinese-SSTIA`, `PaddlePaddle/Randeng-Pegasus-523M-Summary-Chinese-SSTIA`, `unimo-text-1.0-summary`, `IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese`, `IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese` + +#### 多卡服务化预测 +PaddleNLP SimpleServing 支持多卡负载均衡预测,主要在服务化注册的时候,注册两个Taskflow的task即可,下面是示例代码: + +```shell +ts1 = Taskflow('text_summarization', device_id=0) +ts2 = Taskflow('text_summarization', device_id=1) +service.register_taskflow("taskflow/text_summarization", [ts1, ts2]) +``` + +#### Taskflow加速 +PaddleNLP SimpleServing 支持在线服务加速,需要在注册Taskflow时设置参数`use_faster`: + +```shell +ts = Taskflow("text_summarization", use_faster=True) +``` + +### client参数 +用户修改`client.py`中的texts变量以对任意文本进行摘要。 diff --git a/applications/text_summarization/finetune/deploy/simple_serving/client.py b/applications/text_summarization/finetune/deploy/simple_serving/client.py new file mode 100644 index 0000000000000000000000000000000000000000..4a24deca524a63e90aba224bd67746f1be8d3e64 --- /dev/null +++ b/applications/text_summarization/finetune/deploy/simple_serving/client.py @@ -0,0 +1,29 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import requests + +url = "http://0.0.0.0:8189/taskflow/text_summarization" +headers = {"Content-Type": "application/json"} +texts = [ + "在北京冬奥会自由式滑雪女子坡面障碍技巧决赛中,中国选手谷爱凌夺得银牌。祝贺谷爱凌!今天上午,自由式滑雪女子坡面障碍技巧决赛举行。决赛分三轮进行,取选手最佳成绩排名决出奖牌。第一跳,中国选手谷爱凌获得69.90分。在12位选手中排名第三。完成动作后,谷爱凌又扮了个鬼脸,甚是可爱。第二轮中,谷爱凌在道具区第三个障碍处失误,落地时摔倒。获得16.98分。网友:摔倒了也没关系,继续加油!在第二跳失误摔倒的情况下,谷爱凌顶住压力,第三跳稳稳发挥,流畅落地!获得86.23分!此轮比赛,共12位选手参赛,谷爱凌第10位出场。网友:看比赛时我比谷爱凌紧张,加油!", + "据微信公众号“界面”报道,4日上午10点左右,中国发改委反垄断调查小组突击查访奔驰上海办事处,调取数据材料,并对多名奔驰高管进行了约谈。截止昨日晚9点,包括北京梅赛德斯-奔驰销售服务有限公司东区总经理在内的多名管理人员仍留在上海办公室内", +] +data = {"data": {"text": texts}} + +r = requests.post(url=url, headers=headers, data=json.dumps(data)) +datas = json.loads(r.text) +print(datas) diff --git a/applications/text_summarization/finetune/deploy/simple_serving/server.py b/applications/text_summarization/finetune/deploy/simple_serving/server.py new file mode 100644 index 0000000000000000000000000000000000000000..ea35203924063cfbd78a6bcd469c84bb26cb2bdb --- /dev/null +++ b/applications/text_summarization/finetune/deploy/simple_serving/server.py @@ -0,0 +1,19 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp import SimpleServer, Taskflow + +ts = Taskflow("text_summarization") +app = SimpleServer() +app.register_taskflow("taskflow/text_summarization", ts) diff --git a/applications/text_summarization/finetune/export_model.py b/applications/text_summarization/finetune/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..fad70e14ee88e5b81710f567301d820747b68711 --- /dev/null +++ b/applications/text_summarization/finetune/export_model.py @@ -0,0 +1,135 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from pprint import pprint + +import paddle + +from paddlenlp.ops import FasterPegasus +from paddlenlp.transformers import ( + PegasusChineseTokenizer, + PegasusForConditionalGeneration, +) +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_name_or_path", + default="IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese", + type=str, + help="The model name to specify the Pegasus to use. ", + ) + parser.add_argument( + "--export_output_dir", default="./inference_model", type=str, help="Path to save inference model of Pegasus. " + ) + parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure top_k sampling. ") + parser.add_argument( + "--topp", default=1.0, type=float, help="The probability threshold to procedure top_p sampling. " + ) + parser.add_argument("--max_out_len", default=64, type=int, help="Maximum output length. ") + parser.add_argument("--min_out_len", default=1, type=int, help="Minimum output length. ") + parser.add_argument("--num_return_sequence", default=1, type=int, help="The number of returned sequence. ") + parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ") + parser.add_argument("--num_return_sequences", default=1, type=int, help="The number of returned sequences. ") + parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ") + parser.add_argument( + "--decoding_strategy", + default="beam_search", + choices=["beam_search"], + type=str, + help="The main strategy to decode. ", + ) + parser.add_argument("--num_beams", default=4, type=int, help="The number of candidate to procedure beam search. ") + parser.add_argument( + "--diversity_rate", default=0.0, type=float, help="The diversity rate to procedure beam search. " + ) + parser.add_argument( + "--length_penalty", + default=0.0, + type=float, + help="The exponential penalty to the sequence length in the beam_search strategy. ", + ) + + args = parser.parse_args() + return args + + +def do_predict(args): + place = "gpu" + place = paddle.set_device(place) + + model_name_or_path = args.model_name_or_path + model = PegasusForConditionalGeneration.from_pretrained(model_name_or_path) + tokenizer = PegasusChineseTokenizer.from_pretrained(model_name_or_path) + + pegasus = FasterPegasus(model=model, use_fp16_decoding=args.use_fp16_decoding, trans_out=True) + + # Set evaluate mode + pegasus.eval() + + # Convert dygraph model to static graph model + pegasus = paddle.jit.to_static( + pegasus, + input_spec=[ + # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int32"), + # encoder_output + None, + # seq_len + None, + # min_length + args.min_out_len, + # max_length + args.max_out_len, + # num_beams. Used for beam_search. + args.num_beams, + # decoding_strategy + args.decoding_strategy, + # decoder_start_token_id + model.decoder_start_token_id, + # bos_token_id + tokenizer.bos_token_id, + # eos_token_id + tokenizer.eos_token_id, + # pad_token_id + tokenizer.pad_token_id, + # diversity rate. Used for beam search. + args.diversity_rate, + # length_penalty + args.length_penalty, + # topk + args.topk, + # topp + args.topp, + # temperature + args.temperature, + # num_return_sequences + args.num_return_sequences, + ], + ) + + # Save converted static graph model + paddle.jit.save(pegasus, os.path.join(args.export_output_dir, "pegasus")) + logger.info("PEGASUS has been saved to {}.".format(args.export_output_dir)) + + +if __name__ == "__main__": + args = parse_args() + pprint(args) + + do_predict(args) diff --git a/applications/text_summarization/finetune/export_model.sh b/applications/text_summarization/finetune/export_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..95bf1d5acb750b4860efee95a40d112c04c5dedd --- /dev/null +++ b/applications/text_summarization/finetune/export_model.sh @@ -0,0 +1,19 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python export_model.py \ + --model_name_or_path IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese \ + --decoding_strategy beam_search \ + --export_output_dir ./inference_model \ + --max_out_len 30 \ \ No newline at end of file diff --git a/applications/text_summarization/finetune/predict.py b/applications/text_summarization/finetune/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..b15c5008061e54479861070fc8de9c491d0ccdfb --- /dev/null +++ b/applications/text_summarization/finetune/predict.py @@ -0,0 +1,193 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import random +import time +from functools import partial +from pprint import pprint + +import numpy as np +import paddle +from datasets import load_dataset +from paddle.io import BatchSampler, DataLoader +from utils import compute_metrics, convert_example + +from paddlenlp.data import DataCollatorForSeq2Seq +from paddlenlp.transformers import ( + PegasusChineseTokenizer, + PegasusForConditionalGeneration, +) + + +def parse_args(): + parser = argparse.ArgumentParser() + # Required parameters + parser.add_argument( + "--init_checkpoint_dir", + default="IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese", + type=str, + required=True, + help="Path to pre-trained model. ", + ) + parser.add_argument( + "--prefict_file", type=str, required=False, default="data/valid.json", help="Predict data path." + ) + parser.add_argument( + "--output_path", type=str, default="generate.txt", help="The file path where the infer result will be saved." + ) + parser.add_argument( + "--max_source_length", + default=128, + type=int, + help="The maximum total input sequence length after " + "tokenization.Sequences longer than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--min_target_length", + default=0, + type=int, + help="The minimum total sequence length for target text when generating. ", + ) + parser.add_argument( + "--max_target_length", + default=64, + type=int, + help="The maximum total sequence length for target text after " + "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded." + "during ``evaluate`` and ``predict``.", + ) + parser.add_argument( + "--decode_strategy", default="greedy_search", type=str, help="The decode strategy in generation." + ) + parser.add_argument( + "--top_k", + default=2, + type=int, + help="The number of highest probability vocabulary tokens to keep for top-k sampling.", + ) + parser.add_argument("--top_p", default=1.0, type=float, help="The cumulative probability for top-p sampling.") + parser.add_argument("--num_beams", default=1, type=int, help="The number of beams for beam search.") + parser.add_argument( + "--length_penalty", + default=0.6, + type=float, + help="The exponential penalty to the sequence length for beam search.", + ) + parser.add_argument( + "--early_stopping", + default=False, + type=eval, + help="Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.", + ) + parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity of beam search. ") + parser.add_argument( + "--faster", action="store_true", help="Whether to process inference using faster transformer. " + ) + parser.add_argument( + "--use_fp16_decoding", + action="store_true", + help="Whether to use fp16 when using faster transformer. Only works when using faster transformer. ", + ) + parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for testing or evaluation.") + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu", "xpu"], + help="The device to select to train the model, is must be cpu/gpu/xpu.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def generate(args): + paddle.set_device(args.device) + set_seed(args) + tokenizer = PegasusChineseTokenizer.from_pretrained(args.init_checkpoint_dir) + model = PegasusForConditionalGeneration.from_pretrained(args.init_checkpoint_dir) + dataset = load_dataset("json", data_files=args.prefict_file, split="train") + remove_columns = ["content", "title"] + trans_func = partial( + convert_example, + text_column="content", + summary_column="title", + tokenizer=tokenizer, + max_source_length=args.max_source_length, + max_target_length=args.max_target_length, + ) + dataset = dataset.map(trans_func, batched=True, load_from_cache_file=True, remove_columns=remove_columns) + batch_sampler = BatchSampler(dataset, batch_size=args.batch_size, shuffle=False) + batchify_fn = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model) + data_loader = DataLoader( + dataset=dataset, batch_sampler=batch_sampler, num_workers=0, collate_fn=batchify_fn, return_list=True + ) + data_loader.pin_memory = False + + model.eval() + total_time = 0.0 + start_time = time.time() + all_preds = [] + all_labels = [] + for step, batch in enumerate(data_loader): + labels = batch.pop("labels").numpy() + preds, _ = model.generate( + input_ids=batch["input_ids"], + attention_mask=batch["attention_mask"], + max_length=args.max_target_length, + min_length=args.min_target_length, + decode_strategy=args.decode_strategy, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + early_stopping=args.early_stopping, + diversity_rate=args.diversity_rate, + use_fast=args.faster, + ) + total_time += time.time() - start_time + if step % args.logging_steps == 0: + print("step %d - %.3fs/step" % (step, total_time / args.logging_steps)) + total_time = 0.0 + all_preds.extend( + tokenizer.batch_decode(preds.numpy(), skip_special_tokens=True, clean_up_tokenization_spaces=False) + ) + labels = np.where(labels != -100, labels, tokenizer.pad_token_id) + all_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True, clean_up_tokenization_spaces=False)) + start_time = time.time() + + compute_metrics(all_preds, all_labels) + with open(args.output_path, "w", encoding="utf-8") as fout: + for decoded_pred in all_preds: + fout.write(decoded_pred + "\n") + print("Save generated result into: %s" % args.output_path) + + +if __name__ == "__main__": + args = parse_args() + pprint(args) + generate(args) diff --git a/applications/text_summarization/finetune/requirements.txt b/applications/text_summarization/finetune/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..7cfd8eb0cbe9979f0943ae351e09c4c60aa3f13d --- /dev/null +++ b/applications/text_summarization/finetune/requirements.txt @@ -0,0 +1 @@ +rouge==1.0.1 \ No newline at end of file diff --git a/applications/text_summarization/finetune/run_prepare.py b/applications/text_summarization/finetune/run_prepare.py new file mode 100644 index 0000000000000000000000000000000000000000..373d0b08d7c65ab4431386fbc5e64255483e3aa8 --- /dev/null +++ b/applications/text_summarization/finetune/run_prepare.py @@ -0,0 +1,29 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + + +def prepare(): + + bos_link_train = "https://paddlenlp.bj.bcebos.com/datasets/tiny_summary_dataset/train.json" + bos_link_valid = "https://paddlenlp.bj.bcebos.com/datasets/tiny_summary_dataset/valid.json" + bos_link_test = "https://paddlenlp.bj.bcebos.com/datasets/tiny_summary_dataset/test.json" + os.system("mkdir data") + os.system("cd data && wget %s " % (bos_link_train)) + os.system("cd data && wget %s " % (bos_link_valid)) + os.system("cd data && wget %s " % (bos_link_test)) + + +prepare() diff --git a/applications/text_summarization/finetune/train.py b/applications/text_summarization/finetune/train.py new file mode 100644 index 0000000000000000000000000000000000000000..52e36eda84d978cb24637549c253f8da4046cc74 --- /dev/null +++ b/applications/text_summarization/finetune/train.py @@ -0,0 +1,170 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from dataclasses import dataclass, field +from functools import partial +from typing import Optional + +import numpy as np +import paddle +from datasets import load_dataset +from utils import PegasusTrainer, compute_metrics, convert_example, main_process_first + +from paddlenlp.data import DataCollatorForSeq2Seq +from paddlenlp.trainer import PdArgumentParser, TrainingArguments, set_seed +from paddlenlp.transformers import ( + PegasusChineseTokenizer, + PegasusForConditionalGeneration, +) + + +@dataclass +class ModelArguments: + model_name_or_path: Optional[str] = field( + default="IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese", + metadata={"help": ("Path to pre-trained model.")}, + ) + max_source_length: Optional[int] = field( + default=128, + metadata={ + "help": ( + "The maximum total input sequence length after " + "tokenization.Sequences longer than this will be truncated, sequences shorter will be padded." + ) + }, + ) + min_target_length: Optional[int] = field( + default=0, + metadata={"help": ("The minimum total sequence length for target text when generating. ")}, + ) + max_target_length: Optional[int] = field( + default=64, + metadata={ + "help": ( + "The maximum total sequence length for target text after " + "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded." + "during ``evaluate`` and ``predict``." + ) + }, + ) + use_SSTIA: Optional[bool] = field( + default=False, + metadata={"help": ("Whether to use SSTIA.")}, + ) + mix_ratio: Optional[float] = field( + default=0, + metadata={"help": ("Mixture ratio for TSDASG synthetic input.")}, + ) + num_beams: Optional[int] = field( + default=1, + metadata={"help": ("The number of beams to use in beam search.")}, + ) + predict_with_generate: Optional[bool] = field( + default=True, + metadata={"help": ("Whether to generate in predcit.")}, + ) + + +@dataclass +class DataArguments: + train_file: Optional[str] = field( + default="data/train.json", + metadata={"help": ("Train data path.")}, + ) + eval_file: Optional[str] = field( + default="data/test.json", + metadata={"help": ("Eval data path.")}, + ) + + +def compute_metrics_trainer(eval_preds, tokenizer): + all_preds = [] + all_labels = [] + labels = eval_preds.label_ids + preds = eval_preds.predictions + all_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=False)) + labels = np.where(labels != -100, labels, tokenizer.pad_token_id) + all_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True, clean_up_tokenization_spaces=False)) + rougel = compute_metrics(all_preds, all_labels) + return {"RougeL": rougel} + + +def do_train(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(training_args.seed) + + training_args.generation_max_length = model_args.max_target_length + training_args.generation_num_beams = model_args.num_beams + training_args.predict_with_generate = model_args.predict_with_generate + + tokenizer = PegasusChineseTokenizer.from_pretrained(model_args.model_name_or_path) + train_set = load_dataset("json", data_files=data_args.train_file, split="train") + dev_set = load_dataset("json", data_files=data_args.eval_file, split="train") + remove_columns = ["content", "title"] + trans_func = partial( + convert_example, + text_column="content", + summary_column="title", + tokenizer=tokenizer, + max_source_length=model_args.max_source_length, + max_target_length=model_args.max_target_length, + ) + with main_process_first(desc="train dataset map pre-processing"): + train_set = train_set.map(trans_func, batched=True, load_from_cache_file=True, remove_columns=remove_columns) + with main_process_first(desc="dev dataset map pre-processing"): + dev_set = dev_set.map(trans_func, batched=True, load_from_cache_file=True, remove_columns=remove_columns) + + model = PegasusForConditionalGeneration.from_pretrained(model_args.model_name_or_path) + if model_args.use_SSTIA: + model.use_SSTIA = True + model.mix_ratio = model_args.mix_ratio + + batchify_fn = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model) + + compute_metrics_func = partial( + compute_metrics_trainer, + tokenizer=tokenizer, + ) + + trainer = PegasusTrainer( + model=model, + args=training_args, + train_dataset=train_set if training_args.do_train else None, + eval_dataset=dev_set if training_args.do_eval else None, + tokenizer=tokenizer, + data_collator=batchify_fn, + compute_metrics=compute_metrics_func, + ) + + if training_args.do_train: + train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + metrics = train_results.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + +if __name__ == "__main__": + do_train() diff --git a/applications/text_summarization/finetune/utils.py b/applications/text_summarization/finetune/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..489bc4567f6690ba80587dd78c9588eeff749b36 --- /dev/null +++ b/applications/text_summarization/finetune/utils.py @@ -0,0 +1,224 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import contextlib +from typing import Any, Dict, List, Optional, Tuple, Union + +import numpy as np +import paddle +from paddle import nn +from rouge import Rouge + +from paddlenlp.metrics import BLEU +from paddlenlp.trainer import Seq2SeqTrainer +from paddlenlp.utils.log import logger + + +def convert_example(example, text_column, summary_column, tokenizer, max_source_length, max_target_length): + """ + Convert a example into necessary features. + """ + inputs = example[text_column] + targets = example[summary_column] + model_inputs = tokenizer( + inputs, max_length=max_source_length, padding=False, truncation=True, return_attention_mask=True + ) + labels = tokenizer(targets, max_length=max_target_length, padding=False, truncation=True) + model_inputs["labels"] = labels["input_ids"] + return model_inputs + + +def compute_metrics(preds, targets): + assert len(preds) == len(targets), ( + "The length of pred_responses should be equal to the length of " + "target_responses. But received {} and {}.".format(len(preds), len(targets)) + ) + rouge = Rouge() + bleu4 = BLEU(n_size=4) + scores = [] + for pred, target in zip(preds, targets): + try: + score = rouge.get_scores(" ".join(pred), " ".join(target)) + scores.append([score[0]["rouge-1"]["f"], score[0]["rouge-2"]["f"], score[0]["rouge-l"]["f"]]) + except ValueError: + scores.append([0, 0, 0]) + bleu4.add_inst(pred, [target]) + rouge1 = np.mean([i[0] for i in scores]) + rouge2 = np.mean([i[1] for i in scores]) + rougel = np.mean([i[2] for i in scores]) + print("\n" + "*" * 15) + print("The auto evaluation result is:") + print("rouge-1:", round(rouge1, 4)) + print("rouge-2:", round(rouge2, 4)) + print("rouge-L:", round(rougel, 4)) + print("BLEU-4:", round(bleu4.score(), 4)) + return rougel + + +@contextlib.contextmanager +def main_process_first(desc="work"): + if paddle.distributed.get_world_size() > 1: + rank = paddle.distributed.get_rank() + is_main_process = rank == 0 + main_process_desc = "main local process" + + try: + if not is_main_process: + # tell all replicas to wait + logger.debug(f"{rank}: waiting for the {main_process_desc} to perform {desc}") + paddle.distributed.barrier() + yield + finally: + if is_main_process: + # the wait is over + logger.debug(f"{rank}: {main_process_desc} completed {desc}, releasing all replicas") + paddle.distributed.barrier() + else: + yield + + +class PegasusTrainer(Seq2SeqTrainer): + def compute_loss(self, model, inputs, return_outputs=False): + """ + How the loss is computed by Trainer. By default, all models return the loss in the first element. + Subclass and override for custom behavior. + """ + if self.criterion is not None: + if "labels" in inputs: + labels = inputs.pop("labels") + elif "start_positions" in inputs and "end_positions" in inputs: + labels = (inputs.pop("start_positions"), inputs.pop("end_positions")) + elif self.args.label_names is not None: + labels = [] + for label in self.label_names: + labels.append(inputs.pop(label)) + labels = tuple(labels) + elif "generator_labels" in inputs: + labels = inputs["generator_labels"] + else: + labels = None + + outputs = model(**inputs) + if self.criterion is not None: + loss = self.criterion(outputs, labels) + outputs = (loss, outputs) + + # Save past state if it exists + # TODO: this needs to be fixed and made cleaner later. + if self.args.past_index >= 0: + self._past = outputs[self.args.past_index] + + # We don't use .loss here since the model may return tuples instead of ModelOutput. + loss = outputs["loss"] if isinstance(outputs, dict) else outputs[2] + + return (loss, outputs) if return_outputs else loss + + def prediction_step( + self, + model: nn.Layer, + inputs: Dict[str, Union[paddle.Tensor, Any]], + prediction_loss_only: bool, + ignore_keys: Optional[List[str]] = None, + ) -> Tuple[Optional[float], Optional[paddle.Tensor], Optional[paddle.Tensor]]: + """ + Perform an evaluation step on `model` using `inputs`. + + Subclass and override to inject custom behavior. + + Args: + model (`nn.Layer`): + The model to evaluate. + inputs (`Dict[str, Union[paddle.Tensor, Any]]`): + The inputs and targets of the model. + + The dictionary will be unpacked before being fed to the model. Most models expect the targets under the + argument `labels`. Check your model's documentation for all accepted arguments. + prediction_loss_only (`bool`): + Whether or not to return the loss only. + + Return: + Tuple[Optional[float], Optional[paddle.Tensor], Optional[paddle.Tensor]]: A tuple with the loss, logits and + labels (each being optional). + """ + + if not self.args.predict_with_generate or prediction_loss_only: + return super().prediction_step( + model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys + ) + + has_labels = "labels" in inputs + inputs = self._prepare_inputs(inputs) + + gen_kwargs = self._gen_kwargs.copy() + if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None: + gen_kwargs["max_length"] = self.model.config.max_length + gen_kwargs["num_beams"] = ( + gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.model.config.num_beams + ) + + if "attention_mask" in inputs: + gen_kwargs["attention_mask"] = inputs.get("attention_mask", None) + if "global_attention_mask" in inputs: + gen_kwargs["global_attention_mask"] = inputs.get("global_attention_mask", None) + + # prepare generation inputs + # some encoder-decoder models can have varying encoder's and thus + # varying model input names + if hasattr(self.model, "encoder") and self.model.encoder.main_input_name != self.model.main_input_name: + generation_inputs = inputs[self.model.encoder.main_input_name] + else: + generation_inputs = inputs[self.model.main_input_name] + + generated_tokens = self.model.generate( + generation_inputs, + **gen_kwargs, + ) + # different from hf returns: tuple[Tensor]: It is a tuple contains two elements: ids and scores. + if isinstance(generated_tokens, tuple): + generated_tokens = generated_tokens[0] + # in case the batch is shorter than max length, the output should be padded + if gen_kwargs.get("max_length") is not None and generated_tokens.shape[-1] < gen_kwargs["max_length"]: + generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_length"]) + elif gen_kwargs.get("max_new_tokens") is not None and generated_tokens.shape[-1] < ( + gen_kwargs["max_new_tokens"] + 1 + ): + generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_new_tokens"] + 1) + + with paddle.no_grad(): + if has_labels: + with self.autocast_smart_context_manager(): + outputs = model(**inputs) + if self.label_smoother is not None: + loss = self.label_smoother(outputs, inputs["labels"]).mean().detach() + else: + # pegasus output is lm_logits, new_cache, masked_lm_loss + loss = (outputs["loss"] if isinstance(outputs, dict) else outputs[2]).mean().detach() + else: + loss = None + + if self.args.prediction_loss_only: + return (loss, None, None) + + if has_labels: + labels = inputs["labels"] + if gen_kwargs.get("max_length") is not None and labels.shape[-1] < gen_kwargs["max_length"]: + labels = self._pad_tensors_to_max_len(labels, gen_kwargs["max_length"]) + elif gen_kwargs.get("max_new_tokens") is not None and labels.shape[-1] < ( + gen_kwargs["max_new_tokens"] + 1 + ): + labels = self._pad_tensors_to_max_len(labels, (gen_kwargs["max_new_tokens"] + 1)) + else: + labels = None + + return (loss, generated_tokens, labels) diff --git a/applications/text_summarization/pretrain/README.md b/applications/text_summarization/pretrain/README.md new file mode 100644 index 0000000000000000000000000000000000000000..76f86e23b57cad39f5639921dc0a66e310a586ca --- /dev/null +++ b/applications/text_summarization/pretrain/README.md @@ -0,0 +1,191 @@ +# 生成式文本摘要预训练 +**目录** + +- [生成式文本摘要预训练](#生成式文本摘要预训练) + - [简介](#简介) + - [预训练任务介绍](#预训练任务) + - [预训练定制](#预训练定制) + - [文本摘要预训练全流程介绍](#文本摘要预训练全流程介绍) + - [环境依赖](#环境依赖) + - [代码结构说明](#代码结构说明) + - [数据准备](#数据准备) + - [数据加载](#数据加载) + - [从本地文件创建数据集](#从本地文件创建数据集) + - [模型预训练](#模型预训练) + - [模型微调](#模型微调) + - [References](#references) + +## 简介 + +文本摘要的目标是自动地将输入文本转换成简短摘要,为用户提供简明扼要的内容描述,是缓解文本信息过载的一个重要手段。 +文本摘要也是自然语言生成领域中的一个重要任务,有很多应用场景,如新闻摘要、论文摘要、财报摘要、传记摘要、专利摘要、对话摘要、评论摘要、观点摘要、电影摘要、文章标题生成、商品名生成、自动报告生成、搜索结果预览等。 + +本项目预训练了一个专门为中文文本摘要任务设计的语言模型:PEGASUS。其预训练目标为间隙句子生成(Gap Sentences Generation, GSG),是专门为文本摘要任务设计的上游任务。 + +## 预训练任务 +Gap Sentences Generation(GSG)是一种专门为文本摘要提出的自监督预训练任务,其首先找出输入文本中较为核心的数个句子,然后将它们直接拼接到一起得到伪摘要输出,这些句子在输入中的位置则被替换成mask token,预训练的目标就是生成这些被mask掉的核心句子,即间隙句子。 + +对于GSG任务如何选择核心句子以及超参数的设置,请参考[原论文](https://arxiv.org/pdf/1912.08777.pdf) + +另外,原论文中也用到了Masked Language Model (MLM) 作为预训练任务,但实际效果增幅不大,所以不做使用。 + + +## 预训练定制 + +### 文本摘要预训练全流程介绍 + +接下来,我们将按数据准备、预训练、预测的全流程进行介绍。 + +1. **数据准备** + +- 如果没有已标注的数据集,我们推荐[doccano](https://github.com/doccano/doccano)数据标注工具。 + 如果已有标注好的本地数据集,我们需要根据将数据集整理为文档要求的格式,请参考[从本地文件创建数据集](#从本地文件创建数据集) + 。 +- 此外,还需要准备中文停用词表,存放到stopwords.txt中,建议参考[哈工大停用词表](https://github.com/goto456/stopwords) + +2. **模型预训练** + +- 数据准备完成后,可以开始使用我们的数据集完成模型的预训练任务。我们可以根据任务需求,调整可配置参数,选择使用GPU或CPU进行模型训练,脚本默认保存在开发集最佳表现模型。预训练的Tokenizer默认使用base版本"IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese"的分词器,还支持large版本的分词器: "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese" + + +3. **模型预测** + +- 预训练结束后,我们可以加载保存的最佳模型进行模型测试,打印模型在文本摘要任务上的预测结果。 + + +### 环境依赖 + +rouge==1.0.1 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +pretrain/ +├── data # 数据 +│ ├── train.json # 预训练数据集文件 +│ └── test.json # 可选,待预测数据文件 +├── stopwords.txt # 停用词表 +├── train.py # 训练评估脚本 +├── utils.py # 工具函数脚本 +├── requirements.txt # 依赖包 +└── README.md # 说明文档 +``` + +### 数据准备 + +#### 数据加载 + +#### 从本地文件创建数据集 + +如果您想使用自己的数据来预训练PEGASUS模型,本项目支持使用固定格式本地数据集文件进行预训练。 + +本地数据集目录结构如下: + +```text +data/ +├── train.json # 训练数据集文件 +└── test.json # 可选,待预测数据文件 +``` + +本地数据集文件格式如下: + +- train.json/test.json 文件每行格式: + +```text +{ +"title": "任志强抨击政府把土地作为投机品地产业被人为破坏", +"content": "“北京的保障房市场就像一个巨大的赌场,每个人都在期待中奖。”面对中国目前现行的保障性住房政策,华远地产董事长任志强再次语出惊人。(分享自@第一财经-中国房地产金融)" +} +``` + +更多数据集读取格式详见[数据集加载](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html#) +和[自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。 + +### 模型预训练 + +运行如下命令即可在样例训练集上开始pretrain,并在样例验证集上进行验证。 + +```shell +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +unset CUDA_VISIBLE_DEVICES + +python -m paddle.distributed.launch --gpus "2,3,4,5,6,7" train.py \ + --model_name_or_path=IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese \ + --train_file data/train.json \ + --eval_file data/test.json \ + --output_dir pegasus_out \ + --max_source_length 128 \ + --max_target_length 64 \ + --num_train_epochs 20 \ + --logging_steps 1 \ + --save_steps 10000 \ + --per_device_train_batch_size 128 \ + --per_device_eval_batch_size 128 \ + --learning_rate 1e-4 \ + --warmup_ratio 0.02 \ + --weight_decay=0.001 \ + --do_train \ + --do_eval \ + --device=gpu +``` + +关键参数释义如下: + +- `gpus` 指示了训练所用的GPU卡号。 +- `train_file` 本地训练数据地址。 +- `eval_file` 本地测试数据地址。 +- `model_name_or_path` + 指示了pretrain所使用的分词器,可以是PaddleNLP提供的分词器,或者是本地的分词器。如果使用本地的分词器,可以配置本地分词器的目录地址,例如: + ./checkpoints/model_xx/。如果使用PaddleNLP提供的分词器,可以选择下面其中之一。 + + | PaddleNLP提供的分词器 | + |---------------------------------| + | IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese | + | IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese | + +- `output_dir` 表示模型的保存路径。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `seed` 表示随机数生成器的种子。 +- `num_train_epochs` 表示训练轮数。 +- `per_device_train_batch_size` 表示每次训练**每张卡**上的样本数目。 +- `per_device_eval_batch_size` 表示每次验证**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。 +- `warmup_ratio` + 表示学习率逐渐升高到基础学习率(即上面配置的learning_rate)所需要的迭代数占总步数的比例,最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。 +- `max_source_length` 模型输入序列的最大长度。 +- `max_target_length` 模型训练时标签的最大长度。 +- `do_train` 是否进行训练。 +- `do_eval` 是否进行预测。 +- `device` 表示使用的设备,从gpu和cpu中选择。 + +更多参数详情和参数的默认值请参考`train.py`。 + +程序运行时将会自动进行训练和验证,训练过程中会自动保存模型在指定的`output_dir`中。 +如: + +```text +./pegasus_out/ +├── model_config.json +├── model_state.pdparams +├── special_tokens_map.json +├── tokenizer_config.json +└── vocab.txt +``` + +**NOTE:** 如需恢复模型训练,`model_name_or_path`配置本地模型的目录地址即可。 + + +## 模型微调 +微调代码及效果请参考[PEGASUS微调](../finetune/) + + +## References + +- Zhang J, Zhao Y, Saleh M, et al. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization[C] + //International Conference on Machine Learning. PMLR, 2020: 11328-11339. +- Wang J, Zhang Y, Zhang L, et al. Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence[J]. arXiv + preprint arXiv:2209.02970, 2022. diff --git a/applications/text_summarization/pretrain/requirements.txt b/applications/text_summarization/pretrain/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..7cfd8eb0cbe9979f0943ae351e09c4c60aa3f13d --- /dev/null +++ b/applications/text_summarization/pretrain/requirements.txt @@ -0,0 +1 @@ +rouge==1.0.1 \ No newline at end of file diff --git a/applications/text_summarization/pretrain/train.py b/applications/text_summarization/pretrain/train.py new file mode 100644 index 0000000000000000000000000000000000000000..14487efc3c1aca4103153d7c22d4664c2da0d349 --- /dev/null +++ b/applications/text_summarization/pretrain/train.py @@ -0,0 +1,167 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from dataclasses import dataclass, field +from functools import partial +from typing import Optional + +import numpy as np +import paddle +from datasets import load_dataset +from utils import PegasusTrainer, compute_metrics, convert_example, main_process_first + +from paddlenlp.data import DataCollatorForSeq2Seq +from paddlenlp.trainer import PdArgumentParser, TrainingArguments, set_seed +from paddlenlp.transformers import ( + PegasusChineseTokenizer, + PegasusConfig, + PegasusForConditionalGeneration, +) + + +@dataclass +class ModelArguments: + model_name_or_path: Optional[str] = field( + default="IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese", + metadata={"help": ("Path to pre-trained model.")}, + ) + max_source_length: Optional[int] = field( + default=128, + metadata={ + "help": ( + "The maximum total input sequence length after " + "tokenization.Sequences longer than this will be truncated, sequences shorter will be padded." + ) + }, + ) + min_target_length: Optional[int] = field( + default=0, + metadata={"help": ("The minimum total sequence length for target text when generating. ")}, + ) + max_target_length: Optional[int] = field( + default=64, + metadata={ + "help": ( + "The maximum total sequence length for target text after " + "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded." + "during ``evaluate`` and ``predict``." + ) + }, + ) + num_beams: Optional[int] = field( + default=1, + metadata={"help": ("The number of beams to use in beam search.")}, + ) + predict_with_generate: Optional[bool] = field( + default=True, + metadata={"help": ("Whether to generate in predcit.")}, + ) + + +@dataclass +class DataArguments: + train_file: Optional[str] = field( + default="data/train.json", + metadata={"help": ("Train data path.")}, + ) + eval_file: Optional[str] = field( + default="data/test.json", + metadata={"help": ("Eval data path.")}, + ) + stop_words: Optional[str] = field( + default="stopwords.txt", + metadata={"help": ("The stop words vocab.")}, + ) + + +def compute_metrics_trainer(eval_preds, tokenizer): + all_preds = [] + all_labels = [] + labels = eval_preds.label_ids + preds = eval_preds.predictions + all_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=False)) + labels = np.where(labels != -100, labels, tokenizer.pad_token_id) + all_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True, clean_up_tokenization_spaces=False)) + rougel = compute_metrics(all_preds, all_labels) + return {"RougeL": rougel} + + +def do_train(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(training_args.seed) + + training_args.generation_max_length = model_args.max_target_length + training_args.max_source_length = model_args.max_source_length + training_args.generation_num_beams = model_args.num_beams + training_args.predict_with_generate = model_args.predict_with_generate + training_args.stop_words = data_args.stop_words + + tokenizer = PegasusChineseTokenizer.from_pretrained(model_args.model_name_or_path) + train_set = load_dataset("json", data_files=data_args.train_file, split="train") + dev_set = load_dataset("json", data_files=data_args.eval_file, split="train") + + # train_set needn't map + remove_columns = ["title", "content"] + trans_func = partial( + convert_example, + text_column="content", + summary_column="title", + tokenizer=tokenizer, + max_source_length=model_args.max_source_length, + max_target_length=model_args.max_target_length, + ) + with main_process_first(desc="dev dataset map pre-processing"): + dev_set = dev_set.map(trans_func, batched=True, load_from_cache_file=True, remove_columns=remove_columns) + + config = PegasusConfig() + model = PegasusForConditionalGeneration(config=config) + + dev_batchify_fn = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model) + + compute_metrics_func = partial( + compute_metrics_trainer, + tokenizer=tokenizer, + ) + + trainer = PegasusTrainer( + model=model, + args=training_args, + train_dataset=train_set if training_args.do_train else None, + eval_dataset=dev_set if training_args.do_eval else None, + tokenizer=tokenizer, + data_collator=dev_batchify_fn, + compute_metrics=compute_metrics_func, + ) + + if training_args.do_train: + train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + metrics = train_results.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + +if __name__ == "__main__": + do_train() diff --git a/applications/text_summarization/pretrain/utils.py b/applications/text_summarization/pretrain/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..d15f540f233651e3c27121678b43eff7153e5a01 --- /dev/null +++ b/applications/text_summarization/pretrain/utils.py @@ -0,0 +1,486 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import contextlib +import random +import re +import sys +from typing import Any, Dict, List, Optional, Tuple, Union + +import numpy as np +import paddle +from paddle import nn +from paddle.io import DataLoader +from rouge import Rouge + +from paddlenlp.metrics import BLEU +from paddlenlp.trainer import Seq2SeqTrainer +from paddlenlp.utils.log import logger + +rouge = Rouge() + + +class FakeAbstractCollator: + def __init__(self, tokenizer, stopwords_dict, max_enc_length): + self.tokenizer = tokenizer + self.max_seq_length = max_enc_length + self.stopwords_dict = stopwords_dict + + def __call__(self, samples): + labels = [] + attn_mask = [] + decoder_attn_mask = [] + source_inputs = [] + + for text in samples: + texts = text["content"] + text = text_segmentate(texts) + + if len(text) < 2: + continue + sentence_id_vec, source, target, source_idxs, target_idxs = pseudo_summary_f1( + text, self.stopwords_dict, self.tokenizer, self.max_seq_length, "rouge-l" + ) + source_idxs, target_idxs = get_input_mask(sentence_id_vec, target_idxs) + if len(source_idxs) > self.max_seq_length: + if 2 not in source_idxs[self.max_seq_length - 1 :]: + source_idxs = source_idxs[: self.max_seq_length] + source_idxs[-1] = self.tokenizer.eos_token_id + sys.stderr.write("Warning split long line: " + source + "\n") + else: + continue + + source_idxs, attention_mask = padding_to_maxlength( + source_idxs, self.max_seq_length, self.tokenizer.pad_token_id + ) + label, target_attention_mask = padding_to_maxlength( + target_idxs, self.max_seq_length, self.tokenizer.pad_token_id + ) + source_inputs.append(source_idxs) + attn_mask.append(attention_mask) + decoder_attn_mask.append(target_attention_mask) + labels.append(label) + labels = paddle.to_tensor(labels) + decode_input_idxs = shift_tokens_right(labels, self.tokenizer.pad_token_id, self.tokenizer.pad_token_id) + end_token_index = paddle.where(labels == self.tokenizer.eos_token_id)[1] + for idx, end_idx in enumerate(end_token_index): + labels[idx, end_idx + 1 :] = -100 + + return { + "input_ids": paddle.to_tensor(source_inputs), + "attention_mask": paddle.to_tensor(attn_mask), + "labels": labels, + "decoder_input_ids": decode_input_idxs, + "decoder_attention_mask": paddle.to_tensor(decoder_attn_mask), + } + + +def load_stopwords(stopwords_path): + stopwords_dict = {} + with open(stopwords_path, "r") as rf: + for line in rf: + line = line.strip() + if line not in stopwords_dict: + stopwords_dict[line] = 0 + else: + pass + return stopwords_dict + + +def text_segmentate(text): + en_seg_pattern = "((?:\\!|\\?|\\.|\\n)+(?:\\s)+)" + ch_seg_pattern = "((?:?|!|。|\\n)+)" + try: + text = re.sub(en_seg_pattern, r"\1[SEP]", text) + except Exception as e: + print("input: ", text) + raise e + text = re.sub(ch_seg_pattern, r"\1[SEP]", text) + text_list = text.split("[SEP]") + text_list = list(filter(lambda x: len(x) != 0, text_list)) + return text_list + + +def gather_join(texts, idxs): + return "".join([texts[i] for i in idxs]) + + +def gather_join_f1(texts_token, idsx): + join_texts = [] + for id in idsx: + join_texts.extend(texts_token[id]) + return join_texts + + +def compute_rouge(source, target): + source, target = " ".join(source), " ".join(target) + try: + scores = rouge.get_scores(hyps=source, refs=target) + return { + "rouge-1": scores[0]["rouge-1"]["f"], + "rouge-2": scores[0]["rouge-2"]["f"], + "rouge-l": scores[0]["rouge-l"]["f"], + } + except ValueError: + return { + "rouge-1": 0.0, + "rouge-2": 0.0, + "rouge-l": 0.0, + } + + +def remove_stopwords(texts, stopwords_dict): + for i, text in enumerate(texts): + texts[i] = list(filter(lambda x: x not in stopwords_dict, text)) + return texts + + +def pseudo_summary_f1(texts, stopwords, tokenizer, max_length, rouge_strategy="rouge-l"): + summary_rate = 0.25 + max_length = max_length - 1 + texts_tokens = [] + sentece_idxs_vec = [] + for text in texts: + if len(texts) == 0: + continue + try: + ids = tokenizer.encode(text.strip())["input_ids"][:-1] + except ValueError: + print("error, input : ", text) + raise ValueError + sentece_idxs_vec.append(ids) + tokens = [tokenizer._convert_id_to_token(token) for token in ids] + texts_tokens.append(tokens) + + texts_tokens_rm = remove_stopwords(texts_tokens, stopwords) + source_idxs, target_idxs = list(range(len(texts))), [] + + assert len(texts_tokens) == len(texts) + while True: + sims = [] + for i in source_idxs: + new_source_idxs = [j for j in source_idxs if j != i] + new_target_idxs = sorted(target_idxs + [i]) + new_source = gather_join_f1(texts_tokens_rm, new_source_idxs) + new_target = gather_join_f1(texts_tokens_rm, new_target_idxs) + sim = compute_rouge(new_source, new_target)[rouge_strategy] + sims.append(sim) + new_idx = source_idxs[np.argmax(sims)] + del sims + source_idxs.remove(new_idx) + target_idxs = sorted(target_idxs + [new_idx]) + source = gather_join(texts, source_idxs) + target = gather_join(texts, target_idxs) + try: + if len(source_idxs) == 1 or 1.0 * len(target) / len(source) > summary_rate: + break + except ZeroDivisionError: + print(texts) + print("source: ", source) + print("target: ", target) + + if len(source) < len(target): + source, target = target, source + source_idxs, target_idxs = target_idxs, source_idxs + + return sentece_idxs_vec, source, target, source_idxs, target_idxs + + +def get_input_mask(sentence_id_vec, indexs): + target_idxs = [] + input_idxs = [] + kMaskSentenceTokenId = 2 + kEosTokenId = 1 + mask_sentence_options_cumulative_prob = [0.9, 0.9, 1, 1] + for index in indexs: + target_idxs.extend(sentence_id_vec[index]) + choice = random.uniform(0, 1) + if choice < mask_sentence_options_cumulative_prob[0]: + sentence_id_vec[index] = [kMaskSentenceTokenId] + elif choice < mask_sentence_options_cumulative_prob[1]: + replace_id = random.randint(0, len(sentence_id_vec)) + sentence_id_vec[index] = sentence_id_vec[replace_id] + elif choice < mask_sentence_options_cumulative_prob[2]: + pass + else: + sentence_id_vec[index] = [] + + target_idxs.append(kEosTokenId) + for index, sentence_id in enumerate(sentence_id_vec): + if len(sentence_id) == 0: + continue + input_idxs.extend(sentence_id_vec[index]) + + input_idxs.append(kEosTokenId) + return input_idxs, target_idxs + + +def shift_tokens_right(input_ids, pad_token_id, decoder_start_token_id): + shifted_input_ids = paddle.zeros_like(input_ids) + shifted_input_ids[:, 1:] = paddle.clone(input_ids[:, :-1]) + shifted_input_ids[:, 0] = decoder_start_token_id + + if pad_token_id is None: + raise ValueError("self.model.config.pad_token_id has to be defined.") + shifted_input_ids = paddle.where(shifted_input_ids == -100, paddle.to_tensor(pad_token_id), shifted_input_ids) + + return shifted_input_ids + + +def padding_to_maxlength(ids, max_length, pad_id): + cur_len = len(ids) + len_diff = max_length - cur_len + return ids + [pad_id] * len_diff, [1] * cur_len + [0] * len_diff + + +def convert_example(example, text_column, summary_column, tokenizer, max_source_length, max_target_length): + """ + Convert a example into necessary features. + """ + inputs = example[text_column] + targets = example[summary_column] + model_inputs = tokenizer( + inputs, max_length=max_source_length, padding=False, truncation=True, return_attention_mask=True + ) + labels = tokenizer(targets, max_length=max_target_length, padding=False, truncation=True) + model_inputs["labels"] = labels["input_ids"] + return model_inputs + + +def compute_correct(logits, labels): + y_pred = paddle.argmax(logits, axis=-1) + y_pred = y_pred.reshape( + [ + -1, + ] + ) + y_true = labels.reshape( + [ + -1, + ] + ) + correct = paddle.sum(paddle.equal(y_pred, y_true).astype("float32")).item() + return correct + + +def compute_metrics(preds, targets): + assert len(preds) == len(targets), ( + "The length of pred_responses should be equal to the length of " + "target_responses. But received {} and {}.".format(len(preds), len(targets)) + ) + rouge = Rouge() + bleu4 = BLEU(n_size=4) + scores = [] + for pred, target in zip(preds, targets): + try: + score = rouge.get_scores(" ".join(pred), " ".join(target)) + scores.append([score[0]["rouge-1"]["f"], score[0]["rouge-2"]["f"], score[0]["rouge-l"]["f"]]) + except ValueError: + scores.append([0, 0, 0]) + bleu4.add_inst(pred, [target]) + rouge1 = np.mean([i[0] for i in scores]) + rouge2 = np.mean([i[1] for i in scores]) + rougel = np.mean([i[2] for i in scores]) + print("\n" + "*" * 15) + print("The auto evaluation result is:") + print("rouge-1:", round(rouge1, 4)) + print("rouge-2:", round(rouge2, 4)) + print("rouge-L:", round(rougel, 4)) + print("BLEU-4:", round(bleu4.score(), 4)) + return rougel + + +@contextlib.contextmanager +def main_process_first(desc="work"): + if paddle.distributed.get_world_size() > 1: + rank = paddle.distributed.get_rank() + is_main_process = rank == 0 + main_process_desc = "main local process" + + try: + if not is_main_process: + # tell all replicas to wait + logger.debug(f"{rank}: waiting for the {main_process_desc} to perform {desc}") + paddle.distributed.barrier() + yield + finally: + if is_main_process: + # the wait is over + logger.debug(f"{rank}: {main_process_desc} completed {desc}, releasing all replicas") + paddle.distributed.barrier() + else: + yield + + +class PegasusTrainer(Seq2SeqTrainer): + def get_train_dataloader(self): + """ + Returns the training [`~paddle.io.DataLoader`]. + + Will use no sampler if `self.train_dataset` does not implement `__len__`, a random sampler (adapted to + distributed training if necessary) otherwise. + + Subclass and override this method if you want to inject some custom behavior. + """ + if self.train_dataset is None: + raise ValueError("Trainer: training requires a train_dataset.") + + train_dataset = self.train_dataset + train_sampler = self._get_train_sampler() + + stopwords_dict = load_stopwords(self.args.stop_words) + train_batchify_fn = FakeAbstractCollator(self.tokenizer, stopwords_dict, self.args.max_source_length) + + return DataLoader( + train_dataset, + batch_sampler=train_sampler, + collate_fn=train_batchify_fn, + num_workers=self.args.dataloader_num_workers, + ) + + def compute_loss(self, model, inputs, return_outputs=False): + """ + How the loss is computed by Trainer. By default, all models return the loss in the first element. + Subclass and override for custom behavior. + """ + if self.criterion is not None: + if "labels" in inputs: + labels = inputs.pop("labels") + elif "start_positions" in inputs and "end_positions" in inputs: + labels = (inputs.pop("start_positions"), inputs.pop("end_positions")) + elif self.args.label_names is not None: + labels = [] + for label in self.label_names: + labels.append(inputs.pop(label)) + labels = tuple(labels) + elif "generator_labels" in inputs: + labels = inputs["generator_labels"] + else: + labels = None + + outputs = model(**inputs) + if self.criterion is not None: + loss = self.criterion(outputs, labels) + outputs = (loss, outputs) + + # Save past state if it exists + # TODO: this needs to be fixed and made cleaner later. + if self.args.past_index >= 0: + self._past = outputs[self.args.past_index] + + # We don't use .loss here since the model may return tuples instead of ModelOutput. + # pegasus output is lm_logits, new_cache, masked_lm_loss + loss = outputs["loss"] if isinstance(outputs, dict) else outputs[2] + + return (loss, outputs) if return_outputs else loss + + def prediction_step( + self, + model: nn.Layer, + inputs: Dict[str, Union[paddle.Tensor, Any]], + prediction_loss_only: bool, + ignore_keys: Optional[List[str]] = None, + ) -> Tuple[Optional[float], Optional[paddle.Tensor], Optional[paddle.Tensor]]: + """ + Perform an evaluation step on `model` using `inputs`. + + Subclass and override to inject custom behavior. + + Args: + model (`nn.Layer`): + The model to evaluate. + inputs (`Dict[str, Union[paddle.Tensor, Any]]`): + The inputs and targets of the model. + + The dictionary will be unpacked before being fed to the model. Most models expect the targets under the + argument `labels`. Check your model's documentation for all accepted arguments. + prediction_loss_only (`bool`): + Whether or not to return the loss only. + + Return: + Tuple[Optional[float], Optional[paddle.Tensor], Optional[paddle.Tensor]]: A tuple with the loss, logits and + labels (each being optional). + """ + + if not self.args.predict_with_generate or prediction_loss_only: + return super().prediction_step( + model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys + ) + + has_labels = "labels" in inputs + inputs = self._prepare_inputs(inputs) + + gen_kwargs = self._gen_kwargs.copy() + if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None: + gen_kwargs["max_length"] = self.model.config.max_length + gen_kwargs["num_beams"] = ( + gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.model.config.num_beams + ) + + if "attention_mask" in inputs: + gen_kwargs["attention_mask"] = inputs.get("attention_mask", None) + if "global_attention_mask" in inputs: + gen_kwargs["global_attention_mask"] = inputs.get("global_attention_mask", None) + + # prepare generation inputs + # some encoder-decoder models can have varying encoder's and thus + # varying model input names + if hasattr(self.model, "encoder") and self.model.encoder.main_input_name != self.model.main_input_name: + generation_inputs = inputs[self.model.encoder.main_input_name] + else: + generation_inputs = inputs[self.model.main_input_name] + + generated_tokens = self.model.generate( + generation_inputs, + **gen_kwargs, + ) + # different from hf returns: tuple[Tensor]: It is a tuple contains two elements: ids and scores. + if isinstance(generated_tokens, tuple): + generated_tokens = generated_tokens[0] + # in case the batch is shorter than max length, the output should be padded + if gen_kwargs.get("max_length") is not None and generated_tokens.shape[-1] < gen_kwargs["max_length"]: + generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_length"]) + elif gen_kwargs.get("max_new_tokens") is not None and generated_tokens.shape[-1] < ( + gen_kwargs["max_new_tokens"] + 1 + ): + generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_new_tokens"] + 1) + + with paddle.no_grad(): + if has_labels: + with self.autocast_smart_context_manager(): + outputs = model(**inputs) + if self.label_smoother is not None: + loss = self.label_smoother(outputs, inputs["labels"]).mean().detach() + else: + # pegasus output is lm_logits, new_cache, masked_lm_loss + loss = (outputs["loss"] if isinstance(outputs, dict) else outputs[2]).mean().detach() + else: + loss = None + + if self.args.prediction_loss_only: + return (loss, None, None) + + if has_labels: + labels = inputs["labels"] + if gen_kwargs.get("max_length") is not None and labels.shape[-1] < gen_kwargs["max_length"]: + labels = self._pad_tensors_to_max_len(labels, gen_kwargs["max_length"]) + elif gen_kwargs.get("max_new_tokens") is not None and labels.shape[-1] < ( + gen_kwargs["max_new_tokens"] + 1 + ): + labels = self._pad_tensors_to_max_len(labels, (gen_kwargs["max_new_tokens"] + 1)) + else: + labels = None + + return (loss, generated_tokens, labels) diff --git a/applications/zero_shot_text_classification/README.md b/applications/zero_shot_text_classification/README.md new file mode 100644 index 0000000000000000000000000000000000000000..4f9ed711246ca402e8785c416e7d3d626d717a77 --- /dev/null +++ b/applications/zero_shot_text_classification/README.md @@ -0,0 +1,286 @@ +简体中文 | [English](README_en.md) + +# 零样本文本分类 + +**目录** +- [1. 零样本文本分类应用](#1) +- [2. 快速开始](#2) + - [2.1 代码结构](#代码结构) + - [2.2 数据标注](#数据标注) + - [2.3 模型微调](#模型微调) + - [2.4 模型评估](#模型评估) + - [2.5 定制模型一键预测](#定制模型一键预测) + - [2.6 模型部署](#模型部署) + - [2.7 实验指标](#实验指标) + + + +## 1. 零样本文本分类应用 + +本项目提供基于通用文本分类 UTC(Universal Text Classification) 模型微调的文本分类端到端应用方案,打通**数据标注-模型训练-模型调优-预测部署全流程**,可快速实现文本分类产品落地。 + +
+ UTC模型结构图 +
+ +文本分类简单来说就是对给定的句子或文本使用分类模型分类。在文本分类的落地过程中通常面临领域多变、任务多样、数据稀缺等许多挑战。针对文本分类领域的痛点和难点,PaddleNLP 零样本文本分类应用 UTC 通过统一语义匹配方式 USM(Unified Semantic Matching)统一建模标签与文本的语义匹配能力,具备低资源迁移能力,支持通用分类、评论情感分析、语义相似度计算、蕴含推理、多项式阅读理解等众多“泛分类”任务,助力开发者简单高效实现多任务文本分类数据标注、训练、调优、上线,降低文本分类落地技术门槛。 + + +**零样本文本分类应用亮点:** + +- **覆盖场景全面🎓:** 覆盖文本分类各类主流任务,支持多任务训练,满足开发者多样文本分类落地需求。 +- **效果领先🏃:** 具有突出分类效果的UTC模型作为训练基座,提供良好的零样本和小样本学习能力。该模型在[ZeroCLUE](https://www.cluebenchmarks.com/zeroclue.html)和[FewCLUE](https://www.cluebenchmarks.com/fewclue.html)均取得榜首(截止2023年1月11日)。 +- **简单易用:** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用,一行命令即可开启文本分类,轻松完成部署上线,降低多任务文本分类落地门槛。 +- **高效调优✊:** 开发者无需机器学习背景知识,即可轻松上手数据标注及模型训练流程。 + + + +## 2. 快速开始 + +对于简单的文本分类可以直接使用```paddlenlp.Taskflow```实现零样本(zero-shot)分类,对于细分场景我们推荐使用定制功能(标注少量数据进行模型微调)以进一步提升效果。 + + + +### 2.1 代码结构 + +```shell +. +├── deploy/simple_serving/ # 模型部署脚本 +├── utils.py # 数据处理工具 +├── run_train.py # 模型微调脚本 +├── run_eval.py # 模型评估脚本 +├── label_studio.py # 数据格式转换脚本 +├── label_studio_text.md # 数据标注说明文档 +└── README.md +``` + + + +### 2.2 数据标注 + +我们推荐使用[Label Studio](https://labelstud.io/) 数据标注工具进行标注,如果已有标注好的本地数据集,我们需要将数据集整理为文档要求的格式,详见[Label Studio数据标注指南](./label_studio_text.md)。 + +这里我们提供预先标注好的`医疗意图分类数据集`的文件,可以运行下面的命令行下载数据集,我们将展示如何使用数据转化脚本生成训练/验证/测试集文件,并使用UTC模型进行微调。 + +下载医疗意图分类数据集: + + +```shell +wget https://bj.bcebos.com/paddlenlp/datasets/utc-medical.tar.gz +tar -xvf utc-medical.tar.gz +mv utc-medical data +rm utc-medical.tar.gz +``` + +生成训练/验证集文件: +```shell +python label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --options ./data/label.txt +``` +多任务训练场景可分别进行数据转换再进行混合。 + + + +### 2.3 模型微调 + +推荐使用 PromptTrainer API 对模型进行微调,该 API 封装了提示定义功能,且继承自 [Trainer API ](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md) 。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调等任务,可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能,Trainer API 还针对训练过程的通用训练配置做了封装,比如:优化器、学习率调度等。 + +使用下面的命令,使用 `utc-base` 作为预训练模型进行模型微调,将微调后的模型保存至`$finetuned_model`: + +单卡启动: + +```shell +python run_train.py \ + --device gpu \ + --logging_steps 10 \ + --save_steps 100 \ + --eval_steps 100 \ + --seed 1000 \ + --model_name_or_path utc-base \ + --output_dir ./checkpoint/model_best \ + --dataset_path ./data/ \ + --max_seq_length 512 \ + --per_device_train_batch_size 2 \ + --per_device_eval_batch_size 2 \ + --gradient_accumulation_steps 8 \ + --num_train_epochs 20 \ + --learning_rate 1e-5 \ + --do_train \ + --do_eval \ + --do_export \ + --export_model_dir ./checkpoint/model_best \ + --overwrite_output_dir \ + --disable_tqdm True \ + --metric_for_best_model macro_f1 \ + --load_best_model_at_end True \ + --save_total_limit 1 \ + --save_plm +``` + +如果在GPU环境中使用,可以指定gpus参数进行多卡训练: + +```shell +python -u -m paddle.distributed.launch --gpus "0,1" run_train.py \ + --device gpu \ + --logging_steps 10 \ + --save_steps 100 \ + --eval_steps 100 \ + --seed 1000 \ + --model_name_or_path utc-base \ + --output_dir ./checkpoint/model_best \ + --dataset_path ./data/ \ + --max_seq_length 512 \ + --per_device_train_batch_size 2 \ + --per_device_eval_batch_size 2 \ + --gradient_accumulation_steps 8 \ + --num_train_epochs 20 \ + --learning_rate 1e-5 \ + --do_train \ + --do_eval \ + --do_export \ + --export_model_dir ./checkpoint/model_best \ + --overwrite_output_dir \ + --disable_tqdm True \ + --metric_for_best_model macro_f1 \ + --load_best_model_at_end True \ + --save_total_limit 1 \ + --save_plm +``` + +该示例代码中由于设置了参数 `--do_eval`,因此在训练完会自动进行评估。 + +可配置参数说明: +* `single_label`: 每条样本是否只预测一个标签。默认为`False`,表示多标签分类。 +* `device`: 训练设备,可选择 'cpu'、'gpu' 其中的一种;默认为 GPU 训练。 +* `logging_steps`: 训练过程中日志打印的间隔 steps 数,默认10。 +* `save_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。 +* `eval_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。 +* `seed`:全局随机种子,默认为 42。 +* `model_name_or_path`:进行 few shot 训练使用的预训练模型。默认为 "utc-base", 可选"utc-xbase", "utc-base", "utc-medium", "utc-mini", "utc-micro", "utc-nano", "utc-pico"。 +* `output_dir`:必须,模型训练或压缩后保存的模型目录;默认为 `None` 。 +* `dataset_path`:数据集文件所在目录;默认为 `./data/` 。 +* `train_file`:训练集后缀;默认为 `train.txt` 。 +* `dev_file`:开发集后缀;默认为 `dev.txt` 。 +* `max_seq_len`:文本最大切分长度,包括标签的输入超过最大长度时会对输入文本进行自动切分,标签部分不可切分,默认为512。 +* `per_device_train_batch_size`:用于训练的每个 GPU 核心/CPU 的batch大小,默认为8。 +* `per_device_eval_batch_size`:用于评估的每个 GPU 核心/CPU 的batch大小,默认为8。 +* `num_train_epochs`: 训练轮次,使用早停法时可以选择 100;默认为10。 +* `learning_rate`:训练最大学习率,UTC 推荐设置为 1e-5;默认值为3e-5。 +* `do_train`:是否进行微调训练,设置该参数表示进行微调训练,默认不设置。 +* `do_eval`:是否进行评估,设置该参数表示进行评估,默认不设置。 +* `do_export`:是否进行导出,设置该参数表示进行静态图导出,默认不设置。 +* `export_model_dir`:静态图导出地址,默认为None。 +* `overwrite_output_dir`: 如果 `True`,覆盖输出目录的内容。如果 `output_dir` 指向检查点目录,则使用它继续训练。 +* `disable_tqdm`: 是否使用tqdm进度条。 +* `metric_for_best_model`:最优模型指标, UTC 推荐设置为 `macro_f1`,默认为None。 +* `load_best_model_at_end`:训练结束后是否加载最优模型,通常与`metric_for_best_model`配合使用,默认为False。 +* `save_total_limit`:如果设置次参数,将限制checkpoint的总数。删除旧的checkpoints `输出目录`,默认为None。 + + + +### 2.4 模型评估 + +通过运行以下命令进行模型评估预测: + +```shell +python run_eval.py \ + --model_path ./checkpoint/model_best \ + --test_path ./data/test.txt \ + --per_device_eval_batch_size 2 \ + --max_seq_len 512 \ + --output_dir ./checkpoint_test +``` + +可配置参数说明: + +- `model_path`: 进行评估的模型文件夹路径,路径下需包含模型权重文件`model_state.pdparams`及配置文件`model_config.json`。 +- `test_path`: 进行评估的测试集文件。 +- `per_device_eval_batch_size`: 批处理大小,请结合机器情况进行调整,默认为16。 +- `max_seq_len`: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。 +- `single_label`: 每条样本是否只预测一个标签。默认为`False`,表示多标签分类。 + + + +### 2.5 定制模型一键预测 + +`paddlenlp.Taskflow`装载定制模型,通过`task_path`指定模型权重文件的路径,路径下需要包含训练好的模型权重文件`model_state.pdparams`。 + +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow +>>> schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"] +>>> my_cls = Taskflow("zero_shot_text_classification", model="utc-base", schema=schema, task_path='./checkpoint/model_best/plm', precision="fp16") +>>> pprint(my_cls("中性粒细胞比率偏低")) +``` + + + +### 2.6 模型部署 + +目前 UTC 模型提供基于多种部署方式,包括基于 FastDeploy 的本地 Python 部署以及 PaddleNLP SimpleServing 的服务化部署。 + +#### Python 部署 + +以下示例展示如何基于 FastDeploy 库完成 UTC 模型完成通用文本分类任务的 Python 预测部署,可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端,并使用`--model_dir`参数指定运行的模型。模型目录为 `application/zero_shot_text_classification/checkpoint/model_best`(用户可按实际情况设置)。 + +```bash +# CPU 推理 +python deploy/python/infer.py --model_dir ./checkpoint/model_best --device cpu +# GPU 推理 +python deploy/python/infer.py --model_dir ./checkpoint/model_best --device gpu +``` + +运行完成后返回的结果如下: + +```bash +[2023-03-02 06:32:47,528] [ INFO] - We are using to load './checkpoint/model_best'. +[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend Runtime initialized with Backend::PDINFER in Device::GPU. +[2023-03-02 06:33:18,120] [ INFO] - Assigning ['[O-MASK]'] to the additional_special_tokens key of the tokenizer +[{'predictions': [{'label': '这是一条好评', 'score': 0.9073}], 'text_a': '房间干净明亮,非常不错'}] +``` + +更多细节请参考[UTC Python 部署方法](./deploy/python/README.md) + +#### 服务化部署 + +在 UTC 的服务化能力中我们提供基于PaddleNLP SimpleServing 来搭建服务化能力,通过几行代码即可搭建服务化部署能力。 + +``` +# Save at server.py +from paddlenlp import SimpleServer, Taskflow + +schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议"] +utc = Taskflow("zero_shot_text_classification", + model="utc-base", + schema=schema, + task_path="../../checkpoint/model_best/plm", + precision="fp32") +app = SimpleServer() +app.register_taskflow("taskflow/utc", utc) +``` + +``` +# Start the server +paddlenlp server server:app --host 0.0.0.0 --port 8990 +``` + +支持FP16半精度推理加速,详见[UTC SimpleServing 使用方法](./deploy/simple_serving/README.md) + + + +### 2.7 实验指标 + +医疗意图分类数据集 KUAKE-QIC 验证集 zero-shot 实验指标: + + | | Macro F1 | Micro F1 | + | :--------: | :--------: | :--------: | + | utc-xbase | 66.30 | 89.67 | + | utc-base | 64.13 | 89.06 | + | utc-medium | 69.62 | 89.15 | + | utc-micro | 60.31 | 79.14 | + | utc-mini | 65.82 | 89.82 | + | utc-nano | 62.03 | 80.92 | + | utc-pico | 53.63 | 83.57 | diff --git a/applications/zero_shot_text_classification/README_en.md b/applications/zero_shot_text_classification/README_en.md new file mode 100644 index 0000000000000000000000000000000000000000..9c8a4df1f3df4933f5f03b9956b83d933aec452d --- /dev/null +++ b/applications/zero_shot_text_classification/README_en.md @@ -0,0 +1,255 @@ +[简体中文](README.md) | English + +# Zero-shot Text Classification + +**Table of contents** +- [1. Zero-shot Text Classification Application](#1) +- [2. Quick Start](#2) + - [2.1 Code Structure](#21) + - [2.2 Data Annotation](#22) + - [2.3 Finetuning](#23) + - [2.4 Evaluation](#24) + - [2.5 Inference](#25) + - [2.6 Deployment](#26) + - [2.7 Experiments](#27) + + + +## 1. Zero-shot Text Classification + +This project provides an end-to-end application solution for universal text classification based on Universal Task Classification (UTC) finetuning and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Text Classification techniques with zero-shot ability in your own products or models. + +
+ UTC模型结构图 +
+ +Text Classification refers to assigning a set of categories to given input text. Despite the advantages of tuning, applying text classification techniques in practice remains a challenge due to domain adaption and lack of labeled data, etc. This PaddleNLP Zero-shot Text Classification Guide builds on our UTC from the Unified Semantic Matching (USM) model series and provides an industrial-level solution that supports universal text classification tasks, including but not limited to **semantic analysis, semantic matching, intention recognition and event detection**, allowing you accomplish multiple tasks with a single model. Besides, our method brings good generation performance through multi-task pretraining. + +**Highlights:** + +- **Comprehensive Coverage**🎓: Covers various mainstream tasks of text classification, including but not limited to semantic analysis, semantic matching, intention recognition and event detection. + +- **State-of-the-Art Performance**🏃: Strong performance from the UTC model, which ranks first on [ZeroCLUE](https://www.cluebenchmarks.com/zeroclue.html)/[FewCLUE](https://www.cluebenchmarks.com/fewclue.html) as of 01/11/2023. + +- **Easy to use**⚡: Three lines of code to use our Taskflow for out-of-box Zero-shot Text Classification capability. One line of command to model training and model deployment. + +- **Efficient Tuning**✊: Developers can easily get started with the data labeling and model training process without a background in Machine Learning. + + + +## 2. Quick start + +For quick start, you can directly use ```paddlenlp.Taskflow``` out-of-the-box, leveraging the zero-shot performance. For production use cases, we recommend labeling a small amount of data for model fine-tuning to further improve the performance. + + + +### 2.1 Code structure + +```shell +. +├── deploy/simple_serving/ # model deployment script +├── utils.py # data processing tools +├── run_train.py # model fine-tuning script +├── run_eval.py # model evaluation script +├── label_studio.py # data format conversion script +├── label_studio_text.md # data annotation instruction +└── README.md +``` + + +### 2.2 Data labeling + +We recommend using [Label Studio](https://labelstud.io/) for data labeling. You can export labeled data in Label Studio and convert them into the required input format. Please refer to [Label Studio Data Labeling Guide](./label_studio_text_en.md) for more details. + +Here we provide a pre-labeled example dataset `Medical Question Intent Classification Dataset`, which you can download with the following command. We will show how to use the data conversion script to generate training/validation/test set files for fine-tuning. + +Download the medical question intent classification dataset: + +```shell +wget https://bj.bcebos.com/paddlenlp/datasets/utc-medical.tar.gz +tar -xvf utc-medical.tar.gz +mv utc-medical data +rm utc-medical.tar.gz +``` + +Generate training/validation set files: + +```shell +python label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --options ./data/label.txt +``` + +For multi-task training, you can convert data with script separately and move them to the same directory. + + + +### 2.3 Finetuning + +Use the following command to fine-tune the model using `utc-base` as the pre-trained model, and save the fine-tuned model to `./checkpoint/model_best/`: + +Single GPU: + +```shell +python run_train.py \ + --device gpu \ + --logging_steps 10 \ + --save_steps 100 \ + --eval_steps 100 \ + --seed 1000 \ + --model_name_or_path utc-base \ + --output_dir ./checkpoint/model_best \ + --dataset_path ./data/ \ + --max_seq_length 512 \ + --per_device_train_batch_size 2 \ + --per_device_eval_batch_size 2 \ + --gradient_accumulation_steps 8 \ + --num_train_epochs 20 \ + --learning_rate 1e-5 \ + --do_train \ + --do_eval \ + --do_export \ + --export_model_dir ./checkpoint/model_best \ + --overwrite_output_dir \ + --disable_tqdm True \ + --metric_for_best_model macro_f1 \ + --load_best_model_at_end True \ + --save_total_limit 1 +``` + +Multiple GPUs: + +```shell +python -u -m paddle.distributed.launch --gpus "0,1" run_train.py \ + --device gpu \ + --logging_steps 10 \ + --save_steps 100 \ + --eval_steps 100 \ + --seed 1000 \ + --model_name_or_path utc-base \ + --output_dir ./checkpoint/model_best \ + --dataset_path ./data/ \ + --max_seq_length 512 \ + --per_device_train_batch_size 2 \ + --per_device_eval_batch_size 2 \ + --gradient_accumulation_steps 8 \ + --num_train_epochs 20 \ + --learning_rate 1e-5 \ + --do_train \ + --do_eval \ + --do_export \ + --export_model_dir ./checkpoint/model_best \ + --overwrite_output_dir \ + --disable_tqdm True \ + --metric_for_best_model macro_f1 \ + --load_best_model_at_end True \ + --save_total_limit 1 +``` + +Parameters: + +* `device`: Training device, one of 'cpu' and 'gpu' can be selected; the default is GPU training. +* `logging_steps`: The interval steps of log printing during training, the default is 10. +* `save_steps`: The number of interval steps to save the model checkpoint during training, the default is 100. +* `eval_steps`: The number of interval steps to save the model checkpoint during training, the default is 100. +* `seed`: global random seed, default is 42. +* `model_name_or_path`: The pre-trained model used for few shot training. Defaults to "utc-base". Options: "utc-xbase", "utc-base", "utc-medium", "utc-mini", "utc-micro", "utc-nano", "utc-pico". +* `output_dir`: Required, the model directory saved after model training or compression; the default is `None`. +* `dataset_path`: The directory to dataset; defaults to `./data`. +* `train_file`: Training file name; defaults to `train.txt`. +* `dev_file`: Development file name; defaults to `dev.txt`. +* `max_seq_len`: The maximum segmentation length of the text and label candidates. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512. +* `per_device_train_batch_size`: The batch size of each GPU core/CPU used for training, the default is 8. +* `per_device_eval_batch_size`: Batch size per GPU core/CPU for evaluation, default is 8. +* `num_train_epochs`: Training rounds, 100 can be selected when using early stopping method; the default is 10. +* `learning_rate`: The maximum learning rate for training, UTC recommends setting it to 1e-5; the default value is 3e-5. +* `do_train`: Whether to perform fine-tuning training, setting this parameter means to perform fine-tuning training, and it is not set by default. +* `do_eval`: Whether to evaluate, setting this parameter means to evaluate, the default is not set. +* `do_export`: Whether to export, setting this parameter means to export static graph, and it is not set by default. +* `export_model_dir`: Static map export address, the default is `./checkpoint/model_best`. +* `overwrite_output_dir`: If `True`, overwrite the contents of the output directory. If `output_dir` points to a checkpoint directory, use it to continue training. +* `disable_tqdm`: Whether to use tqdm progress bar. +* `metric_for_best_model`: Optimal model metric, UTC recommends setting it to `macro_f1`, the default is None. +* `load_best_model_at_end`: Whether to load the best model after training, usually used in conjunction with `metric_for_best_model`, the default is False. +* `save_total_limit`: If this parameter is set, the total number of checkpoints will be limited. Remove old checkpoints `output directory`, defaults to None. + + + +### 2.4 Evaluation + +Model evaluation: + +```shell +python evaluate.py \ + --model_path ./checkpoint/model_best \ + --test_path ./data/test.txt \ + --per_device_eval_batch_size 2 \ + --max_seq_len 512 \ + --output_dir ./checkpoint_test +``` + +Parameters: + +- `model_path`: The path of the model folder for evaluation, which must contain the model weight file `model_state.pdparams` and the configuration file `model_config.json`. +- `test_path`: The test set file for evaluation. +- `per_device_eval_batch_size`: Batch size, please adjust it according to the machine situation, the default is 8. +- `max_seq_len`: The maximum segmentation length of the text and label candidates. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512. + + + +### 2.5 Inference + +You can use `paddlenlp.Taskflow` to load your custom model by specifying the path of the model weight file through `task_path`. + +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow +>>> schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"] +>>> my_cls = Taskflow("zero_shot_text_classification", model="utc-base", schema=schema, task_path="./checkpoint/model_best", precision="fp16") +>>> pprint(my_cls("中性粒细胞比率偏低")) +``` + + + +### 2.6 Deployment + +We provide the deployment solution on the foundation of PaddleNLP SimpleServing, where you can easily build your own deployment service with three-line code. + +``` +# Save at server.py +from paddlenlp import SimpleServer, Taskflow + +schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议"] +utc = Taskflow("zero_shot_text_classification", + model="utc-base", + schema=schema, + task_path="../../checkpoint/model_best/", + precision="fp32") +app = SimpleServer() +app.register_taskflow("taskflow/utc", utc) +``` + +``` +# Start the server +paddlenlp server server:app --host 0.0.0.0 --port 8990 +``` + +It supports FP16 (half-precision) and multiple process for inference acceleration. + + + +### 2.7 Experiments + +The zero-shot results reported here are based on the development set of KUAKE-QIC. + + | | Macro F1 | Micro F1 | + | :--------: | :--------: | :--------: | + | utc-xbase | 66.30 | 89.67 | + | utc-base | 64.13 | 89.06 | + | utc-medium | 69.62 | 89.15 | + | utc-micro | 60.31 | 79.14 | + | utc-mini | 65.82 | 89.82 | + | utc-nano | 62.03 | 80.92 | + | utc-pico | 53.63 | 83.57 | diff --git a/applications/zero_shot_text_classification/deploy/python/README.md b/applications/zero_shot_text_classification/deploy/python/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b1da371d8a7c0dd026e97bc83e798cfe347dc9b6 --- /dev/null +++ b/applications/zero_shot_text_classification/deploy/python/README.md @@ -0,0 +1,130 @@ +# FastDeploy UTC 模型 Python 部署示例 + +在部署前,参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy Python SDK。 + +本目录下提供 `infer.py` 快速完成在 CPU/GPU 的通用文本分类任务的 Python 部署示例。 + +## 依赖安装 + +直接执行以下命令安装部署示例的依赖。 + +```bash +# 安装 fast_tokenizer 以及 GPU 版本 fastdeploy +pip install fast-tokenizer-python fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html +``` + +## 快速开始 + +以下示例展示如何基于 FastDeploy 库完成 UTC 模型进行文本分类任务的 Python 预测部署,可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端,并使用`--model_dir`参数指定运行的模型,具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [UTC 训练文档](../../README.md)导出得到的部署模型,其模型目录为 `application/zero_shot_text_classification/checkpoint/model_best`(用户可按实际情况设置)。 + + +```bash +# CPU 推理 +python infer.py --model_dir ../../checkpoint/model_best --device cpu +# GPU 推理 +python infer.py --model_dir ../../checkpoint/model_best --device gpu +``` + +运行完成后返回的结果如下: + +```bash +[2023-03-02 06:32:47,528] [ INFO] - We are using to load './checkpoint/model_best'. +[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend Runtime initialized with Backend::PDINFER in Device::GPU. +[2023-03-02 06:33:18,120] [ INFO] - Assigning ['[O-MASK]'] to the additional_special_tokens key of the tokenizer +[{'predictions': [{'label': '这是一条好评', 'score': 0.9073}], 'text_a': '房间干净明亮,非常不错'}] +``` + +## 参数说明 + +| 参数 |参数说明 | +|----------|--------------| +|--model_dir | 指定部署模型的目录, | +|--batch_size |输入的batch size,默认为 1| +|--max_length |最大序列长度,默认为 128| +|--num_omask_tokens | 最大标签数量,默认为64| +|--device | 运行的设备,可选范围: ['cpu', 'gpu'],默认为'cpu' | +|--device_id | 运行设备的id。默认为0。 | +|--cpu_threads | 当使用cpu推理时,指定推理的cpu线程数,默认为1。| +|--backend | 支持的推理后端,可选范围: ['onnx_runtime', 'paddle', 'tensorrt', 'paddle_tensorrt'],默认为'paddle' | +|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启,默认为False | + +## FastDeploy 高阶用法 + +FastDeploy 在 Python 端上,提供 `fastdeploy.RuntimeOption.use_xxx()` 以及 `fastdeploy.RuntimeOption.use_xxx_backend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 UTC 模型,需要选择硬件所支持的推理引擎进行部署,下表展示如何在不同的硬件上选择可用的推理引擎部署 UTC 模型。 + +符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持; + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
硬件 硬件对应的接口 可用的推理引擎 推理引擎对应的接口 是否支持 Paddle 新格式量化模型 是否支持 FP16 模式
CPU use_cpu() Paddle Inference use_paddle_infer_backend() N/A
ONNX Runtime use_ort_backend() N/A
GPU use_gpu() Paddle Inference use_paddle_infer_backend() N/A
ONNX Runtime use_ort_backend()
Paddle TensorRT use_paddle_infer_backend() + paddle_infer_option.enable_trt = True
TensorRT use_trt_backend()
昆仑芯 XPU use_kunlunxin() Paddle Lite use_paddle_lite_backend() N/A
华为 昇腾 use_ascend() Paddle Lite use_paddle_lite_backend()
Graphcore IPU use_ipu() Paddle Inference use_paddle_infer_backend() N/A
diff --git a/applications/zero_shot_text_classification/deploy/python/infer.py b/applications/zero_shot_text_classification/deploy/python/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..5dceeef418ed3a0bd26cc39a80c8d6a4e054f5c4 --- /dev/null +++ b/applications/zero_shot_text_classification/deploy/python/infer.py @@ -0,0 +1,248 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import distutils.util +import os +from typing import Any, Dict, List, Union + +import fastdeploy as fd +import numpy as np + +from paddlenlp.prompt import PromptDataCollatorWithPadding, UTCTemplate +from paddlenlp.transformers import AutoTokenizer + + +def parse_arguments(): + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument("--model_dir", required=True, help="The directory of model.") + parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.") + parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.") + parser.add_argument( + "--device", + type=str, + default="cpu", + choices=["gpu", "cpu"], + help="Type of inference device, support 'cpu' or 'gpu'.", + ) + parser.add_argument( + "--backend", + type=str, + default="paddle", + choices=["onnx_runtime", "paddle", "tensorrt", "paddle_tensorrt"], + help="The inference runtime backend.", + ) + parser.add_argument( + "--pred_threshold", + default=0.5, + type=float, + help="Probability threshold for start/end index probabiliry.", + ) + parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.") + parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.") + parser.add_argument("--num_omask_tokens", type=int, default=64, help="The max length of sequence.") + parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.") + parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode") + parser.add_argument("--cpu_threads", type=int, default=1, help="Number of threads to predict when using cpu.") + parser.add_argument("--device_id", type=int, default=0, help="Select which gpu device to train model.") + return parser.parse_args() + + +class Predictor(object): + def __init__(self, args, schema: list = None): + self.set_schema(schema) + self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir) + self.runtime = self.create_fd_runtime(args) + self.batch_size = args.batch_size + self.max_length = args.max_length + self.template = UTCTemplate(self.tokenizer, self.max_length) + self.collator = PromptDataCollatorWithPadding(self.tokenizer, return_tensors="np") + self.pred_threshold = args.pred_threshold + + def set_schema(self, schema): + if schema is None: + self._question = None + self._choices = None + elif isinstance(schema, list): + self._question = "" + self._choices = schema + elif isinstance(schema, dict) and len(schema) == 1: + for key, value in schema.items(): + self._question = key + self._choices = value + else: + raise ValueError(f"Invalid schema: {schema}.") + + def _check_input_text(self, inputs): + if isinstance(inputs, str) or isinstance(inputs, dict): + inputs = [inputs] + + if isinstance(inputs, list): + input_list = [] + for example in inputs: + data = {"text_a": "", "text_b": "", "choices": self._choices, "question": self._question} + if isinstance(example, dict): + for k, v in example.items(): + if k in data: + data[k] = example[k] + elif isinstance(example, str): + data["text_a"] = example + data["text_b"] = "" + elif isinstance(example, list): + for x in example: + if not isinstance(x, str): + raise ValueError("Invalid inputs, input text should be strings.") + data["text_a"] = example[0] + data["text_b"] = "".join(example[1:]) if len(example) > 1 else "" + else: + raise ValueError( + "Invalid inputs, the input should be {'text_a': a, 'text_b': b}, a text or a list of text." + ) + + if len(data["text_a"]) < 1 and len(data["text_b"]) < 1: + raise ValueError("Invalid inputs, input `text_a` and `text_b` are both missing or empty.") + if not isinstance(data["choices"], list) or len(data["choices"]) < 2: + raise ValueError("Invalid inputs, label candidates should be a list with length >= 2.") + if not isinstance(data["question"], str): + raise ValueError("Invalid inputs, prompt question should be a string.") + input_list.append(data) + else: + raise TypeError("Invalid input format!") + return input_list + + def create_fd_runtime(self, args): + option = fd.RuntimeOption() + model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel") + params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams") + option.set_model_path(model_path, params_path) + if args.device == "cpu": + option.use_cpu() + option.set_cpu_thread_num(args.cpu_threads) + else: + option.use_gpu(args.device_id) + if args.backend == "paddle": + option.use_paddle_infer_backend() + elif args.backend == "onnx_runtime": + option.use_ort_backend() + elif args.backend == "openvino": + option.use_openvino_backend() + else: + option.use_trt_backend() + if args.backend == "paddle_tensorrt": + option.use_paddle_infer_backend() + option.paddle_infer_option.collect_trt_shape = True + option.paddle_infer_option.enable_trt = True + trt_file = os.path.join(args.model_dir, "model.trt") + option.trt_option.set_shape( + "input_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length] + ) + option.trt_option.set_shape( + "token_type_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length] + ) + option.trt_option.set_shape( + "position_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length] + ) + option.trt_option.set_shape( + "attention_mask", + [1, 1, 1, 1], + [args.batch_size, 1, args.max_length, args.max_length], + [args.batch_size, 1, args.max_length, args.max_length], + ) + option.trt_option.set_shape( + "omask_positions", + [1, 1], + [args.batch_size, args.num_omask_tokens], + [args.batch_size, args.num_omask_tokens], + ) + option.trt_option.set_shape("cls_positions", [1], [args.batch_size], [args.batch_size]) + if args.use_fp16: + option.trt_option.enable_fp16 = True + trt_file = trt_file + ".fp16" + option.trt_option.serialize_file = trt_file + return fd.Runtime(option) + + @staticmethod + def sigmoid(z): + return 1 / (1 + np.exp(-z)) + + def preprocess(self, inputs: Union[str, List[str]]) -> Dict[str, Any]: + """ + Transform the raw text to the model inputs, two steps involved: + 1) Transform the raw text to token ids. + 2) Generate the other model inputs from the raw text and token ids. + """ + inputs = self._check_input_text(inputs) + # Get the config from the kwargs + tokenized_inputs = [self.template(i) for i in inputs] + batches = [ + tokenized_inputs[idx : idx + self.batch_size] for idx in range(0, len(tokenized_inputs), self.batch_size) + ] + outputs = {} + outputs["text"] = inputs + outputs["batches"] = [self.collator(batch) for batch in batches] + + return outputs + + def infer(self, inputs: Dict[str, Any]) -> Dict[str, Any]: + outputs = {} + outputs["text"] = inputs["text"] + outputs["batch_logits"] = [] + dtype_list = ["int64", "int64", "int64", "float32", "int64", "int64"] + for batch in inputs["batches"]: + batch = dict(batch) + for i in range(self.runtime.num_inputs()): + input_name = self.runtime.get_input_info(i).name + batch[input_name] = batch[input_name].astype(dtype_list[i]) + del batch["soft_token_ids"] + logits = self.runtime.infer(batch)[0] + outputs["batch_logits"].append(logits) + return outputs + + def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, Any]: + outputs = [] + for logits in inputs["batch_logits"]: + scores = self.sigmoid(np.array(logits)) + output = {} + output["predictions"] = [] + for i, class_score in enumerate(scores[0]): + if class_score > self.pred_threshold: + output["predictions"].append({"label": i, "score": class_score}) + outputs.append(output) + + for i, output in enumerate(outputs): + if len(inputs["text"][i]["text_a"]) > 0: + output["text_a"] = inputs["text"][i]["text_a"] + if len(inputs["text"][i]["text_b"]) > 0: + output["text_b"] = inputs["text"][i]["text_b"] + for j, pred in enumerate(output["predictions"]): + output["predictions"][j] = { + "label": inputs["text"][i]["choices"][pred["label"]], + "score": pred["score"], + } + + return outputs + + def predict(self, texts): + inputs = self.preprocess(texts) + outputs = self.infer(inputs) + results = self.postprocess(outputs) + return results + + +if __name__ == "__main__": + args = parse_arguments() + predictor = Predictor(args, schema=["这是一条差评", "这是一条好评"]) + results = predictor.predict("房间干净明亮,非常不错") + print(results) diff --git a/applications/zero_shot_text_classification/deploy/simple_serving/README.md b/applications/zero_shot_text_classification/deploy/simple_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..81179c100c293ffdd706b8fa3c15c49d47097207 --- /dev/null +++ b/applications/zero_shot_text_classification/deploy/simple_serving/README.md @@ -0,0 +1,82 @@ +# 基于PaddleNLP SimpleServing 的服务化部署 + +## 目录 +- [环境准备](#环境准备) +- [Server启动服务](#Server服务启动) +- [Client请求启动](#Client请求启动) +- [服务化自定义参数](#服务化自定义参数) + +## 环境准备 + +使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本) + +```shell +pip install paddlenlp >= 2.5.0 +``` + +## Server服务启动 + +```bash +paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8190 +``` + +## Client请求启动 + +```bash +python client.py +``` + +## 服务化自定义参数 + +### Server 自定义参数 + +#### schema替换 + +```python +# Default schema +schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"] +``` + +#### 设置模型路径 + +```python +# Default task_path +utc = Taskflow("zero_shot_text_classification", model="utc-base", task_path="../../checkpoint/model_best/plm", schema=schema) +``` + +#### 多卡服务化预测 +PaddleNLP SimpleServing 支持多卡负载均衡预测,主要在服务化注册的时候,注册两个Taskflow的task即可,下面是示例代码 + +```python +utc1 = Taskflow("zero_shot_text_classification", model="utc-base", task_path="../../checkpoint/model_best/plm", schema=schema) +utc2 = Taskflow("zero_shot_text_classification", model="utc-base", task_path="../../checkpoint/model_best/plm", schema=schema) +service.register_taskflow("taskflow/utc", [utc1, utc2]) +``` + +#### 更多配置 + +```python +>>> from paddlenlp import Taskflow +>>> schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"] +>>> utc = Taskflow("zero_shot_text_classification", + schema=schema, + model="utc-base", + max_seq_len=512, + batch_size=1, + pred_threshold=0.5, + precision="fp32") +``` + +* `schema`:定义任务标签候选集合。 +* `model`:选择任务使用的模型,默认为`utc-base`, 可选有`utc-xbase`, `utc-base`, `utc-medium`, `utc-micro`, `utc-mini`, `utc-nano`, `utc-pico`。 +* `max_seq_len`:最长输入长度,包括所有标签的长度,默认为512。 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `pred_threshold`:模型对标签预测的概率在0~1之间,返回结果去掉小于这个阈值的结果,默认为0.5。 +* `precision`:选择模型精度,默认为`fp32`,可选有`fp16`和`fp32`。`fp16`推理速度更快。如果选择`fp16`,请先确保机器正确安装NVIDIA相关驱动和基础软件,**确保CUDA>=11.2,cuDNN>=8.1.1**,初次使用需按照提示安装相关依赖。其次,需要确保GPU设备的CUDA计算能力(CUDA Compute Capability)大于7.0,典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档:[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。 + +### Client 自定义参数 + +```python +# Changed to input texts you wanted +texts = ["中性粒细胞比率偏低"] +``` diff --git a/applications/zero_shot_text_classification/deploy/simple_serving/client.py b/applications/zero_shot_text_classification/deploy/simple_serving/client.py new file mode 100644 index 0000000000000000000000000000000000000000..9c50a8c1fc98db17fe553f37b04117dde7dca025 --- /dev/null +++ b/applications/zero_shot_text_classification/deploy/simple_serving/client.py @@ -0,0 +1,30 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +from pprint import pprint + +import requests + +if __name__ == "__main__": + url = "http://0.0.0.0:8190/taskflow/utc" + + headers = {"Content-Type": "application/json"} + + texts = ["中性粒细胞比率偏低", "男性小腹疼痛是什么原因?"] + + data = {"data": {"text": texts}} + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + datas = json.loads(r.text) + pprint(datas) diff --git a/applications/zero_shot_text_classification/deploy/simple_serving/server.py b/applications/zero_shot_text_classification/deploy/simple_serving/server.py new file mode 100644 index 0000000000000000000000000000000000000000..a1bb9ef1c278621bd070edc397b825a8381d21e3 --- /dev/null +++ b/applications/zero_shot_text_classification/deploy/simple_serving/server.py @@ -0,0 +1,25 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp import SimpleServer, Taskflow + +# The schema changed to your defined schema +schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"] +# The task path changed to your best model path +utc = Taskflow( + "zero_shot_text_classification", model="utc-base", task_path="../../checkpoint/model_best/plm", schema=schema +) +# If you want to define the finetuned utc service +app = SimpleServer() +app.register_taskflow("taskflow/utc", utc) diff --git a/applications/zero_shot_text_classification/label_studio.py b/applications/zero_shot_text_classification/label_studio.py new file mode 100644 index 0000000000000000000000000000000000000000..8b54112935be631b9cf45a5a8d6d6eaab431145b --- /dev/null +++ b/applications/zero_shot_text_classification/label_studio.py @@ -0,0 +1,159 @@ +# coding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import random +import time +from decimal import Decimal + +import numpy as np +import paddle + +from paddlenlp.utils.log import logger + + +def set_seed(seed): + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +class LabelStudioDataConverter(object): + """ + DataConverter to convert data export from LabelStudio platform + """ + + def __init__(self, options, text_separator): + super().__init__() + if isinstance(options, list) and len(options) == 1 and os.path.isfile(options[0]): + with open(options[0], "r", encoding="utf-8") as fp: + self.options = [x.strip() for x in fp] + elif isinstance(options, list) and len(options) > 0: + self.options = options + else: + raise ValueError( + "Invalid options. Please use file with one label per line or set `options` with condidate labels." + ) + self.text_separator = text_separator + + def convert_utc_examples(self, raw_examples): + utc_examples = [] + for example in raw_examples: + raw_text = example["data"]["text"].split(self.text_separator) + if len(raw_text) < 1: + continue + elif len(raw_text) == 1: + raw_text.append("") + elif len(raw_text) > 2: + raw_text = ["".join(raw_text[:-1]), raw_text[-1]] + + label_list = [] + for raw_label in example["annotations"][0]["result"][0]["value"]["choices"]: + if raw_label not in self.options: + raise ValueError( + f"Label `{raw_label}` not found in label candidates `options`. Please recheck the data." + ) + label_list.append(np.where(np.array(self.options) == raw_label)[0].tolist()[0]) + + utc_examples.append( + { + "text_a": raw_text[0], + "text_b": raw_text[1], + "question": "", + "choices": self.options, + "labels": label_list, + } + ) + return utc_examples + + +def do_convert(): + set_seed(args.seed) + + tic_time = time.time() + if not os.path.exists(args.label_studio_file): + raise ValueError("Please input the correct path of label studio file.") + + if not os.path.exists(args.save_dir): + os.makedirs(args.save_dir) + + if len(args.splits) != 0 and len(args.splits) != 3: + raise ValueError("Only []/ len(splits)==3 accepted for splits.") + + def _check_sum(splits): + return Decimal(str(splits[0])) + Decimal(str(splits[1])) + Decimal(str(splits[2])) == Decimal("1") + + if len(args.splits) == 3 and not _check_sum(args.splits): + raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.") + + with open(args.label_studio_file, "r", encoding="utf-8") as f: + raw_examples = json.loads(f.read()) + + if args.is_shuffle: + indexes = np.random.permutation(len(raw_examples)) + index_list = indexes.tolist() + raw_examples = [raw_examples[i] for i in indexes] + + i1, i2, _ = args.splits + p1 = int(len(raw_examples) * i1) + p2 = int(len(raw_examples) * (i1 + i2)) + + train_ids = index_list[:p1] + dev_ids = index_list[p1:p2] + test_ids = index_list[p2:] + + with open(os.path.join(args.save_dir, "sample_index.json"), "w") as fp: + maps = {"train_ids": train_ids, "dev_ids": dev_ids, "test_ids": test_ids} + fp.write(json.dumps(maps)) + + data_converter = LabelStudioDataConverter(args.options, args.text_separator) + + train_examples = data_converter.convert_utc_examples(raw_examples[:p1]) + dev_examples = data_converter.convert_utc_examples(raw_examples[p1:p2]) + test_examples = data_converter.convert_utc_examples(raw_examples[p2:]) + + def _save_examples(save_dir, file_name, examples): + count = 0 + save_path = os.path.join(save_dir, file_name) + with open(save_path, "w", encoding="utf-8") as f: + for example in examples: + f.write(json.dumps(example, ensure_ascii=False) + "\n") + count += 1 + logger.info("Save %d examples to %s." % (count, save_path)) + + _save_examples(args.save_dir, "train.txt", train_examples) + _save_examples(args.save_dir, "dev.txt", dev_examples) + _save_examples(args.save_dir, "test.txt", test_examples) + + logger.info("Finished! It takes %.2f seconds" % (time.time() - tic_time)) + + +if __name__ == "__main__": + # yapf: disable + parser = argparse.ArgumentParser() + + parser.add_argument("--label_studio_file", default="./data/label_studio.json", type=str, help="The annotation file exported from label studio platform.") + parser.add_argument("--save_dir", default="./data", type=str, help="The path of data that you wanna save.") + parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.") + parser.add_argument("--text_separator", type=str, default='\t', help="Separator for classification with two input texts.") + parser.add_argument("--options", default=None, type=str, nargs="+", help="The options for classification.") + parser.add_argument("--is_shuffle", default=True, type=bool, help="Whether to shuffle the labeled dataset, defaults to True.") + parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization") + args = parser.parse_args() + # yapf: enable + + do_convert() diff --git a/applications/zero_shot_text_classification/label_studio_text.md b/applications/zero_shot_text_classification/label_studio_text.md new file mode 100644 index 0000000000000000000000000000000000000000..50a419046c2b85fc517dbccfda95f6676a37eb2c --- /dev/null +++ b/applications/zero_shot_text_classification/label_studio_text.md @@ -0,0 +1,143 @@ +简体中文 | [English](label_studio_text_en.md) + +# 文本分类任务Label Studio使用指南 + + **目录** + +- [1. 安装](#1) +- [2. 文本分类任务标注](#2) + - [2.1 项目创建](#21) + - [2.2 数据上传](#22) + - [2.3 标签构建](#23) + - [2.4 任务标注](#24) + - [2.5 数据导出](#25) + - [2.6 数据转换](#26) + - [2.7 更多配置](#27) + + + +## 1. 安装 +**以下标注示例用到的环境配置:** + +- Python 3.8+ +- label-studio == 1.6.0 + +在终端(terminal)使用pip安装label-studio: + +```shell +pip install label-studio==1.6.0 +``` + +安装完成后,运行以下命令行: +```shell +label-studio start +``` + +在浏览器打开[http://localhost:8080/](http://127.0.0.1:8080/),输入用户名和密码登录,开始使用label-studio进行标注。 + + + +2. 文本分类任务标注 + + + +#### 2.1 项目创建 + +点击创建(Create)开始创建一个新的项目,填写项目名称、描述,然后在``Labeling Setup``中选择``Text Classification``。 + +- 填写项目名称、描述 + +
+ +
+ +- 数据上传,从本地上传txt格式文件,选择``List of tasks``,然后选择导入本项目 + + + +
+ +
+ +- 设置任务,添加标签 + + + +
+ +
+ +
+ +
+ + + +#### 2.2 数据上传 + +项目创建后,可在Project/文本分类任务中点击``Import``继续导入数据,同样从本地上传txt格式文件,选择``List of tasks``,详见[项目创建](#data) 。 + + + +#### 2.3 标签构建 + +项目创建后,可在Setting/Labeling Interface中继续配置标签,详见[项目创建](#label) + +默认模式为单标签多分类数据标注。对于多标签多分类数据标注,需要将`choice`的值由`single`改为`multiple`。 + +
+ +
+ + + +#### 2.4 任务标注 + +
+ +
+ + + +#### 2.5 数据导出 + +勾选已标注文本ID,选择导出的文件类型为``JSON``,导出数据: + +
+ +
+ + + +#### 2.6 数据转换 + +将导出的文件重命名为``label_studio.json``后,放入``./data``目录下。通过[label_studio.py](./label_studio.py)脚本可转为UTC的数据格式。 + +在数据转换阶段,还需要提供标签候选信息,放在`./data/label.txt`文件中,每个标签占一行。例如在医疗意图分类中,标签候选为``["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]``,也可通过``options``参数直接进行配置。 + +```shell +python label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --options ./data/label.txt +``` + + + +#### 2.7 更多配置 + +- ``label_studio_file``: 从label studio导出的数据标注文件。 +- ``save_dir``: 训练数据的保存目录,默认存储在``data``目录下。 +- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。 +- ``options``: 指定分类任务的类别标签。若输入类型为文件,则文件中每行一个标签。 +- ``is_shuffle``: 是否对数据集进行随机打散,默认为True。 +- ``seed``: 随机种子,默认为1000. + +备注: +- 默认情况下 [label_studio.py](./label_studio.py) 脚本会按照比例将数据划分为 train/dev/test 数据集 +- 每次执行 [label_studio.py](./label_studio.py) 脚本,将会覆盖已有的同名数据文件 +- 对于从label_studio导出的文件,默认文件中的每条数据都是经过人工正确标注的。 + +## References +- **[Label Studio](https://labelstud.io/)** diff --git a/applications/zero_shot_text_classification/label_studio_text_en.md b/applications/zero_shot_text_classification/label_studio_text_en.md new file mode 100644 index 0000000000000000000000000000000000000000..e45c4146c116e74a11c153828f0b8f0d311a0283 --- /dev/null +++ b/applications/zero_shot_text_classification/label_studio_text_en.md @@ -0,0 +1,143 @@ +[简体中文](label_studio_text.md) | English + +# Label Studio User Guide - Text Classification + +**Table of contents** + +- [1. Installation](#1) +- [2. Text Classification Task Annotation](#2) + - [2.1 Project Creation](#21) + - [2.2 Data Upload](#22) + - [2.3 Label Construction](#23) + - [2.4 Task Annotation](#24) + - [2.5 Data Export](#25) + - [2.6 Data Conversion](#26) + - [2.7 More Configuration](#27) + + + +## 1. Installation + +** Dependencies used in the following annotation examples:** + +- Python 3.8+ +- label-studio == 1.6.0 + +Use pip to install label-studio in the terminal: + +```shell +pip install label-studio==1.6.0 +``` + +Once the installation is complete, run the following command line: +```shell +label-studio start +``` + +Open [http://localhost:8080/](http://127.0.0.1:8080/) in the browser, enter the user name and password to log in, and start using label-studio for labeling. + + + +## 2. Text Classification Task Annotation + + + +#### 2.1 Project Creation + +Click Create to start creating a new project, fill in the project name, description, and select ``Text Classification`` in ``Labeling Setup``. + +- Fill in the project name, description + +
+ +
+ +- Upload the txt format file locally, select ``List of tasks``, and then choose to import this project. + + + +
+ +
+ +- Define labels + + + +
+ +
+ +
+ +
+ + + +#### 2.2 Data Upload + +You can continue to import local txt format data after project creation. See more details in [Project Creation](#data). + + + +#### 2.3 Label Construction + +After project creation, you can add/delete labels in Setting/Labeling Interface just as in [Project Creation](#label) + +LabelStudio supports single-label data annotation by default. Modify the value of `choice` as `multiple` in the `code` tab when multiple-label annotation is required. + +
+ +
+ + + +#### 2.4 Task annotation + +
+ +
+ + + +#### 2.5 Data Export + +Check the marked text ID, select the exported file type as ``JSON``, and export the data: + +
+ +
+ + + +#### 2.6 Data conversion + +First, create a label file in the `./data` directory, with one label candidate per line. You can also directly set label condidates list by `options`. Rename the exported file to ``label_studio.json`` and put it in the ``./data`` directory. Through the [label_studio.py](./label_studio.py) script, it can be converted to the data format of UTC. + + +```shell +python label_studio.py \ + --label_studio_file ./data/label_studio.json \ + --save_dir ./data \ + --splits 0.8 0.1 0.1 \ + --options ./data/label.txt +``` + + + +#### 2.7 More Configuration + +- ``label_studio_file``: Data labeling file exported from label studio. +- ``save_dir``: The storage directory of the training data, which is stored in the ``data`` directory by default. +- ``splits``: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means that the data is divided into training set, verification set and test set according to the ratio of ``8:1:1``. +- ``options``: Specify the label candidates set. For filename, there should be one label per line in the file. For list, the length should be longer than 1. +- ``is_shuffle``: Whether to randomly shuffle the data set, the default is True. +- ``seed``: random seed, default is 1000. + +Note: +- By default the [label_studio.py](./label_studio.py) script will divide the data proportionally into train/dev/test datasets +- Each time the [label_studio.py](./label_studio.py) script is executed, the existing data file with the same name will be overwritten +- For files exported from label_studio, each piece of data in the default file is correctly labeled manually. + +## References +- **[Label Studio](https://labelstud.io/)** diff --git a/applications/zero_shot_text_classification/run_eval.py b/applications/zero_shot_text_classification/run_eval.py new file mode 100644 index 0000000000000000000000000000000000000000..30686ff9fda33ca7075f274710daaf261e491348 --- /dev/null +++ b/applications/zero_shot_text_classification/run_eval.py @@ -0,0 +1,136 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +from dataclasses import dataclass, field + +import paddle +from paddle.metric import Accuracy +from sklearn.metrics import f1_score +from utils import UTCLoss, read_local_dataset + +from paddlenlp.datasets import load_dataset +from paddlenlp.prompt import ( + PromptModelForSequenceClassification, + PromptTrainer, + PromptTuningArguments, + UTCTemplate, +) +from paddlenlp.trainer import PdArgumentParser +from paddlenlp.transformers import UTC, AutoTokenizer + + +@dataclass +class DataArguments: + test_path: str = field(default="./data/test.txt", metadata={"help": "Test dataset file name."}) + threshold: float = field(default=0.5, metadata={"help": "The threshold to produce predictions."}) + single_label: str = field(default=False, metadata={"help": "Predict exactly one label per sample."}) + + +@dataclass +class ModelArguments: + model_name_or_path: str = field(default="utc-base", metadata={"help": "Build-in pretrained model."}) + model_path: str = field(default=None, metadata={"help": "Build-in pretrained model."}) + + +def main(): + # Parse the arguments. + parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + paddle.set_device(training_args.device) + + # Load the pretrained language model. + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + model = UTC.from_pretrained(model_args.model_name_or_path) + + # Define template for preprocess and verbalizer for postprocess. + template = UTCTemplate(tokenizer, training_args.max_seq_length) + + # Load and preprocess dataset. + if data_args.test_path is not None: + test_ds = load_dataset(read_local_dataset, data_path=data_args.test_path, lazy=False) + + # Initialize the prompt model. + prompt_model = PromptModelForSequenceClassification( + model, template, None, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout + ) + if model_args.model_path is not None: + model_state = paddle.load(os.path.join(model_args.model_path, "model_state.pdparams")) + prompt_model.set_state_dict(model_state) + + # Define the metric function. + def compute_metrics_single_label(eval_preds): + labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64") + preds = paddle.to_tensor(eval_preds.predictions) + preds = paddle.nn.functional.softmax(preds, axis=-1) + labels = paddle.argmax(labels, axis=-1) + print(preds, labels) + metric = Accuracy() + correct = metric.compute(preds, labels) + metric.update(correct) + acc = metric.accumulate() + return {"accuracy": acc} + + def compute_metrics(eval_preds): + labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64") + preds = paddle.to_tensor(eval_preds.predictions) + + preds = paddle.nn.functional.sigmoid(preds) + preds = preds[labels != -100].numpy() + labels = labels[labels != -100].numpy() + preds = preds > data_args.threshold + micro_f1 = f1_score(y_pred=preds, y_true=labels, average="micro") + macro_f1 = f1_score(y_pred=preds, y_true=labels, average="macro") + + return {"micro_f1": micro_f1, "macro_f1": macro_f1} + + trainer = PromptTrainer( + model=prompt_model, + tokenizer=tokenizer, + args=training_args, + criterion=UTCLoss(), + train_dataset=None, + eval_dataset=None, + callbacks=None, + compute_metrics=compute_metrics_single_label if data_args.single_label else compute_metrics, + ) + + if data_args.test_path is not None: + test_ret = trainer.predict(test_ds) + trainer.log_metrics("test", test_ret.metrics) + with open(os.path.join(training_args.output_dir, "test_metric.json"), "w", encoding="utf-8") as fp: + json.dump(test_ret.metrics, fp) + + with open(os.path.join(training_args.output_dir, "test_predictions.json"), "w", encoding="utf-8") as fp: + if data_args.single_label: + preds = paddle.nn.functional.softmax(paddle.to_tensor(test_ret.predictions), axis=-1) + for index, pred in enumerate(preds): + result = {"id": index} + result["labels"] = paddle.argmax(pred).item() + result["probs"] = pred[result["labels"]].item() + fp.write(json.dumps(result, ensure_ascii=False) + "\n") + else: + preds = paddle.nn.functional.sigmoid(paddle.to_tensor(test_ret.predictions)) + for index, pred in enumerate(preds): + result = {"id": index} + result["labels"] = paddle.where(pred > data_args.threshold)[0].tolist() + result["probs"] = pred[pred > data_args.threshold].tolist() + fp.write(json.dumps(result, ensure_ascii=False) + "\n") + + +if __name__ == "__main__": + main() diff --git a/applications/zero_shot_text_classification/run_train.py b/applications/zero_shot_text_classification/run_train.py new file mode 100644 index 0000000000000000000000000000000000000000..6e7956d19bdd0a84bb70a9da9eb9c53c5574fac4 --- /dev/null +++ b/applications/zero_shot_text_classification/run_train.py @@ -0,0 +1,154 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from dataclasses import dataclass, field + +import paddle +from paddle.metric import Accuracy +from paddle.static import InputSpec +from sklearn.metrics import f1_score +from utils import UTCLoss, read_local_dataset + +from paddlenlp.datasets import load_dataset +from paddlenlp.prompt import ( + PromptModelForSequenceClassification, + PromptTrainer, + PromptTuningArguments, + UTCTemplate, +) +from paddlenlp.trainer import PdArgumentParser +from paddlenlp.transformers import UTC, AutoTokenizer, export_model + + +@dataclass +class DataArguments: + dataset_path: str = field( + default="./data", + metadata={"help": "Local dataset directory including train.txt, dev.txt and label.txt (optional)."}, + ) + train_file: str = field(default="train.txt", metadata={"help": "Train dataset file name."}) + dev_file: str = field(default="dev.txt", metadata={"help": "Dev dataset file name."}) + threshold: float = field(default=0.5, metadata={"help": "The threshold to produce predictions."}) + single_label: str = field(default=False, metadata={"help": "Predict exactly one label per sample."}) + + +@dataclass +class ModelArguments: + model_name_or_path: str = field( + default="utc-base", + metadata={ + "help": "The build-in pretrained UTC model name or path to its checkpoints, such as " + "`utc-xbase`, `utc-base`, `utc-medium`, `utc-mini`, `utc-micro`, `utc-nano` and `utc-pico`." + }, + ) + export_type: str = field(default="paddle", metadata={"help": "The type to export. Support `paddle` and `onnx`."}) + export_model_dir: str = field(default="checkpoints/model_best", metadata={"help": "The export model path."}) + + +def main(): + # Parse the arguments. + parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + paddle.set_device(training_args.device) + + # Load the pretrained language model. + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + model = UTC.from_pretrained(model_args.model_name_or_path) + + # Define template for preprocess and verbalizer for postprocess. + template = UTCTemplate(tokenizer, training_args.max_seq_length) + + # Load and preprocess dataset. + train_ds = load_dataset( + read_local_dataset, + data_path=data_args.dataset_path, + data_file=data_args.train_file, + lazy=False, + ) + dev_ds = load_dataset( + read_local_dataset, + data_path=data_args.dataset_path, + data_file=data_args.dev_file, + lazy=False, + ) + + # Define the criterion. + criterion = UTCLoss() + + # Initialize the prompt model. + prompt_model = PromptModelForSequenceClassification( + model, template, None, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout + ) + + # Define the metric function. + def compute_metrics_single_label(eval_preds): + labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64") + preds = paddle.to_tensor(eval_preds.predictions) + preds = paddle.nn.functional.softmax(preds, axis=-1) + labels = paddle.argmax(labels, axis=-1) + metric = Accuracy() + correct = metric.compute(preds, labels) + metric.update(correct) + acc = metric.accumulate() + return {"accuracy": acc} + + def compute_metrics(eval_preds): + labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64") + preds = paddle.to_tensor(eval_preds.predictions) + preds = paddle.nn.functional.sigmoid(preds) + preds = preds[labels != -100].numpy() + labels = labels[labels != -100].numpy() + preds = preds > data_args.threshold + micro_f1 = f1_score(y_pred=preds, y_true=labels, average="micro") + macro_f1 = f1_score(y_pred=preds, y_true=labels, average="macro") + + return {"micro_f1": micro_f1, "macro_f1": macro_f1} + + trainer = PromptTrainer( + model=prompt_model, + tokenizer=tokenizer, + args=training_args, + criterion=criterion, + train_dataset=train_ds, + eval_dataset=dev_ds, + callbacks=None, + compute_metrics=compute_metrics_single_label if data_args.single_label else compute_metrics, + ) + + # Training. + if training_args.do_train: + train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + metrics = train_results.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Export. + if training_args.do_export: + input_spec = [ + InputSpec(shape=[None, None], dtype="int64", name="input_ids"), + InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"), + InputSpec(shape=[None, None], dtype="int64", name="position_ids"), + InputSpec(shape=[None, None, None, None], dtype="float32", name="attention_mask"), + InputSpec(shape=[None, None], dtype="int64", name="omask_positions"), + InputSpec(shape=[None], dtype="int64", name="cls_positions"), + ] + export_model(trainer.pretrained_model, input_spec, model_args.export_model_dir, model_args.export_type) + + +if __name__ == "__main__": + main() diff --git a/applications/zero_shot_text_classification/utils.py b/applications/zero_shot_text_classification/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..c983b75dc9d1d26cf3d90fe718fa2dd83e9dde05 --- /dev/null +++ b/applications/zero_shot_text_classification/utils.py @@ -0,0 +1,78 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import json +import os + +import numpy as np +import paddle + +from paddlenlp.utils.log import logger + + +def read_local_dataset(data_path, data_file=None, is_test=False): + """ + Load datasets with one example per line, formated as: + {"text_a": X, "text_b": X, "question": X, "choices": [A, B], "labels": [0, 1]} + """ + if data_file is not None: + file_paths = [os.path.join(data_path, fname) for fname in os.listdir(data_path) if fname.endswith(data_file)] + else: + file_paths = [data_path] + skip_count = 0 + for file_path in file_paths: + with open(file_path, "r", encoding="utf-8") as fp: + for example in fp: + example = json.loads(example.strip()) + if len(example["choices"]) < 2 or not isinstance(example["text_a"], str) or len(example["text_a"]) < 3: + skip_count += 1 + continue + if "text_b" not in example: + example["text_b"] = "" + if not is_test or "labels" in example: + if not isinstance(example["labels"], list): + example["labels"] = [example["labels"]] + one_hots = np.zeros(len(example["choices"]), dtype="float32") + for x in example["labels"]: + one_hots[x] = 1 + example["labels"] = one_hots.tolist() + + if is_test: + yield example + continue + std_keys = ["text_a", "text_b", "question", "choices", "labels"] + std_example = {k: example[k] for k in std_keys if k in example} + yield std_example + logger.warning(f"Skip {skip_count} examples.") + + +class UTCLoss(object): + def __call__(self, logit, label): + return self.forward(logit, label) + + def forward(self, logit, label): + logit = (1.0 - 2.0 * label) * logit + logit_neg = logit - label * 1e12 + logit_pos = logit - (1.0 - label) * 1e12 + zeros = paddle.zeros_like(logit[..., :1]) + logit_neg = paddle.concat([logit_neg, zeros], axis=-1) + logit_pos = paddle.concat([logit_pos, zeros], axis=-1) + label = paddle.concat([label, zeros], axis=-1) + logit_neg[label == -100] = -1e12 + logit_pos[label == -100] = -1e12 + neg_loss = paddle.logsumexp(logit_neg, axis=-1) + pos_loss = paddle.logsumexp(logit_pos, axis=-1) + loss = (neg_loss + pos_loss).mean() + return loss diff --git a/csrc/README.md b/csrc/README.md new file mode 100644 index 0000000000000000000000000000000000000000..117d9c79c69cb0c9af39a0898bea157f21d89a96 --- /dev/null +++ b/csrc/README.md @@ -0,0 +1,15 @@ +# PaddleNLP 自定义 OP + +此文档介绍如何编译安装 PaddleNLP 自定义 OP。 + +## 安装 C++ 依赖 + +```shell +pip install -r requirements.txt +``` + +## 编译 Cuda 算子 + +```shell +python setup_cuda.py install +``` diff --git a/csrc/generation/encode_rotary_qk.cu b/csrc/generation/encode_rotary_qk.cu new file mode 100644 index 0000000000000000000000000000000000000000..a920f81a2d1160a9a4e9e9eeb84bdb98c2bc5b51 --- /dev/null +++ b/csrc/generation/encode_rotary_qk.cu @@ -0,0 +1,225 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "helper.h" + +template +__global__ void NeoXRotaryKernel(const T *input, + const float *cos_emb, + const float *sin_emb, + const int *sequence_lengths, + T *output, + const int rotary_emb_dims, + const int batch_size, + const int head_num, + const int seq_len, + const int last_dim) { + int bi = blockIdx.x; + int hi = blockIdx.y; + int si = blockIdx.z; + if (sequence_lengths && si >= sequence_lengths[bi] * rotary_emb_dims) return; + int half_lastdim = last_dim / 2; + for (int ti = threadIdx.x; ti < half_lastdim; ti += blockDim.x) { + int base_idx = bi * head_num * seq_len * last_dim + + hi * seq_len * last_dim + si * last_dim; + int left_idx = base_idx + ti; + const int right_idx = base_idx + ti + half_lastdim; + int emb_idx_left = bi * seq_len * last_dim + si * last_dim + ti; + int emb_idx_right = + bi * seq_len * last_dim + si * last_dim + ti + half_lastdim; + float input_left = static_cast(input[left_idx]); + float input_right = static_cast(input[right_idx]); + + float cos_tmp_left = cos_emb[emb_idx_left]; + float sin_tmp_left = sin_emb[emb_idx_left]; + float cos_tmp_right = cos_emb[emb_idx_right]; + float sin_tmp_right = sin_emb[emb_idx_right]; + + T res1 = + static_cast(input_left * cos_tmp_left - input_right * sin_tmp_left); + T res2 = static_cast(input_right * cos_tmp_right + + input_left * sin_tmp_right); + output[left_idx] = res1; + output[right_idx] = res2; + } +} + + +template +__global__ void RotaryKernel(const T *input, + const float *cos_emb, + const float *sin_emb, + const int *sequence_lengths, + T *output, + const int rotary_emb_dims, + const int batch_size, + const int head_num, + const int seq_len, + const int last_dim) { + int bi = blockIdx.x; + int hi = blockIdx.y; + int si = blockIdx.z; + if (sequence_lengths && si >= sequence_lengths[bi] * rotary_emb_dims) return; + int half_lastdim = last_dim / 2; + // Note(ZhenyuLi): Calculate the relevant data at one time, so that no + // additional space is required. + for (int ti = threadIdx.x; ti < half_lastdim; ti += blockDim.x) { + int base_idx = bi * head_num * seq_len * last_dim + + hi * seq_len * last_dim + si * last_dim; + int left_idx = base_idx + 2 * ti; + const int right_idx = base_idx + 2 * ti + 1; + int emb_idx = bi * seq_len * last_dim + si * last_dim + 2 * ti; + float input_left = static_cast(input[left_idx]); + float input_right = static_cast(input[right_idx]); + float cos_tmp = cos_emb[emb_idx]; + float sin_tmp = sin_emb[emb_idx]; + T res1 = static_cast(input_left * cos_tmp - input_right * sin_tmp); + T res2 = static_cast(input_right * cos_tmp + input_left * sin_tmp); + output[left_idx] = res1; + output[right_idx] = res2; + } +} + +template +void LaunchRotaryQK(const paddle::Tensor& q, + const paddle::Tensor& kv, + const paddle::Tensor& rotary_emb, + const paddle::Tensor& seq_lens, + const int32_t rotary_emb_dims, + bool use_neox) { + typedef PDTraits traits_; + typedef typename traits_::DataType DataType_; + typedef typename traits_::data_t data_t; + + + const int32_t batch_size = q.shape()[0]; + const int32_t head_num = q.shape()[1]; + const int32_t seq_len = q.shape()[2]; + const int32_t dim_head = q.shape()[3]; + + auto cu_stream = q.stream(); + dim3 grid(batch_size, head_num, seq_len * rotary_emb_dims); + const int last_dim = dim_head / rotary_emb_dims; + auto getBlockSize = [](int dim) { + if (dim > 256) { + return 512; + } else if (dim > 128) { + return 256; + } else if (dim > 64) { + return 128; + } else if (dim > 32) { + return 64; + } else { + return 32; + } + }; + int BlockSize = getBlockSize(last_dim / 2); + const float *cos_emb = rotary_emb.data(); + const float *sin_emb = rotary_emb.data() + batch_size * seq_len * dim_head; + + const DataType_* q_data = reinterpret_cast(q.data()); + const DataType_* k_data = reinterpret_cast(kv.data()); + + DataType_* q_out_data = reinterpret_cast(const_cast(q.data())); + DataType_* k_out_data = reinterpret_cast(const_cast(kv.data())); + + + if (!use_neox) { + RotaryKernel<<>>( + q_data, + cos_emb, + sin_emb, + seq_lens.data()/*sequence_lengths*/, + q_out_data, + rotary_emb_dims, + batch_size, + head_num, + seq_len * rotary_emb_dims, + last_dim); + RotaryKernel<<>>( + k_data, + cos_emb, + sin_emb, + seq_lens.data()/*sequence_lengths*/, + k_out_data, + rotary_emb_dims, + batch_size, + head_num, + seq_len * rotary_emb_dims, + last_dim); + } else { + NeoXRotaryKernel<<>>( + q_data, + cos_emb, + sin_emb, + seq_lens.data()/*sequence_lengths*/, + q_out_data, + rotary_emb_dims, + batch_size, + head_num, + seq_len * rotary_emb_dims, + last_dim); + NeoXRotaryKernel<<>>( + k_data, + cos_emb, + sin_emb, + seq_lens.data()/*sequence_lengths*/, + k_out_data, + rotary_emb_dims, + batch_size, + head_num, + seq_len * rotary_emb_dims, + last_dim); + } +} + +void RotaryQK(const paddle::Tensor& q, + const paddle::Tensor& kv, + const paddle::Tensor& rotary_emb, + const paddle::Tensor& seq_lens, + const int32_t rotary_emb_dims, + bool use_neox) { + switch (q.type()) { + case paddle::DataType::BFLOAT16: { + return LaunchRotaryQK( + q, kv, rotary_emb, seq_lens, rotary_emb_dims, use_neox + ); + } + case paddle::DataType::FLOAT16: { + return LaunchRotaryQK( + q, kv, rotary_emb, seq_lens, rotary_emb_dims, use_neox + ); + } + case paddle::DataType::FLOAT32: { + return LaunchRotaryQK( + q, kv, rotary_emb, seq_lens, rotary_emb_dims, use_neox + ); + } + default: { + PD_THROW( + "NOT supported data type. " + "Only bfloat16, float16 and float32 are supported. "); + break; + } + } +} + + + +PD_BUILD_OP(encode_rotary_qk) + .Inputs({"q", "kv", "rotary_emb", "seq_lens"}) + .Outputs({"rotary_q_out", "rotary_kv_out"}) + .SetInplaceMap({{"q", "rotary_q_out"}, {"kv", "rotary_kv_out"}}) + .Attrs({"rotary_emb_dims: int", "use_neox: bool"}) + .SetKernelFn(PD_KERNEL(RotaryQK)); \ No newline at end of file diff --git a/csrc/generation/fused_get_rope.cu b/csrc/generation/fused_get_rope.cu new file mode 100644 index 0000000000000000000000000000000000000000..0115652626433a55fbdeeceee5a4b836348c94e5 --- /dev/null +++ b/csrc/generation/fused_get_rope.cu @@ -0,0 +1,215 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "helper.h" + +/* +Position_ids: bsz, max_seq_length +*/ + +template +struct GetPackType { + using type = typename std::aligned_storage::type; +}; + +template +using PackType = typename GetPackType::type; + +template +union Pack { + static_assert(sizeof(PackType) == sizeof(T) * N, ""); + __device__ Pack() { + // do nothing + } + PackType storage; + T elem[N]; +}; + +__global__ __launch_bounds__(kBlockSize) void fused_get_rotary_embedding_neox(const int64_t* position_ids, + const int32_t bsz, + const int32_t max_seq_length, + const int32_t max_position_seq_length, + const int32_t head_dim, + const int32_t prompt_num, + const float inv_head_dim, + const int32_t elem_cnt, + float* rope_embedding) { + /* + In Naive implementation, it will stacks [freqs, freqs] + And actually, each threads can process 1 values, and store continuous 2 same values. + So here We construct a Pack to store 2 values. + */ + constexpr int PackSize = 2; + // Pack SinStorePack{}; + // Pack CosStorePack{}; + + const int half_head_dim = head_dim / PackSize; + const int32_t global_thread_idx = blockIdx.x * blockDim.x + threadIdx.x; + for(int idx = global_thread_idx, step=blockDim.x * gridDim.x; idx < elem_cnt; idx += step){ + const int32_t bsz_seq_idx = idx / half_head_dim; + const int32_t bsz_idx = bsz_seq_idx / max_seq_length; + const int32_t seq_idx = bsz_seq_idx % max_seq_length; + const int64_t position_offset = bsz_idx * max_position_seq_length + seq_idx + prompt_num; + const int32_t half_head_idx = (idx % half_head_dim) * PackSize; + const float exponent_factor = -static_cast(half_head_idx) * inv_head_dim; // * inv_head_dim equals to / head_dim. + const float inv_freq_val = powf(10000.0f, exponent_factor); + const float freqs_val = static_cast(position_ids[position_offset]) * inv_freq_val; + const float cos_embedding_val = cos(freqs_val); + const float sin_embedding_val = sin(freqs_val); + + const int32_t cos_offset = bsz_seq_idx * head_dim + half_head_idx / PackSize; + rope_embedding[cos_offset] = cos_embedding_val; + rope_embedding[cos_offset + half_head_dim] = cos_embedding_val; + const int32_t sin_offset = bsz * max_seq_length * head_dim + cos_offset; + rope_embedding[sin_offset] = sin_embedding_val; + rope_embedding[sin_offset + half_head_dim] = sin_embedding_val; + + // /* + // Since After stack, the continuous 2 elements value is same. + // So here each threads store 2 computed embedding value. + // */ + // #pragma unroll + // for(int unroll_idx = 0; unroll_idx < PackSize; unroll_idx++){ + // CosStorePack.elem[unroll_idx] = cos_embedding_val; + // SinStorePack.elem[unroll_idx] = sin_embedding_val; + // } + // + // const int32_t cos_offset = bsz_seq_idx * head_dim + half_head_idx; + // const int32_t sin_offset = bsz * max_seq_length * head_dim + cos_offset; + // *(reinterpret_cast*>(rope_embedding + cos_offset)) = CosStorePack.storage; + // *(reinterpret_cast*>(rope_embedding + sin_offset)) = SinStorePack.storage; + } +} + +__global__ __launch_bounds__(kBlockSize) void fused_get_rotary_embedding(const int64_t* position_ids, + const int32_t bsz, + const int32_t max_seq_length, + const int32_t max_position_seq_length, + const int32_t head_dim, + const int32_t prompt_num, + const float inv_head_dim, + const int32_t elem_cnt, + float* rope_embedding) { + /* + In Naive implementation, it will stacks [freqs, freqs] + And actually, each threads can process 1 values, and store continuous 2 same values. + So here We construct a Pack to store 2 values. + */ + constexpr int PackSize = 2; + Pack SinStorePack{}; + Pack CosStorePack{}; + + const int half_head_dim = head_dim / PackSize; + const int32_t global_thread_idx = blockIdx.x * blockDim.x + threadIdx.x; + for(int idx = global_thread_idx, step=blockDim.x * gridDim.x; idx < elem_cnt; idx += step){ + const int32_t bsz_seq_idx = idx / half_head_dim; + const int32_t bsz_idx = bsz_seq_idx / max_seq_length; + const int32_t seq_idx = bsz_seq_idx % max_seq_length; + const int64_t position_offset = bsz_idx * max_position_seq_length + seq_idx + prompt_num; + const int32_t half_head_idx = (idx % half_head_dim) * PackSize; + const float exponent_factor = -static_cast(half_head_idx) * inv_head_dim; // * inv_head_dim equals to / head_dim. + const float inv_freq_val = powf(10000.0f, exponent_factor); + const float freqs_val = static_cast(position_ids[position_offset]) * inv_freq_val; + const float cos_embedding_val = cos(freqs_val); + const float sin_embedding_val = sin(freqs_val); + + /* + Since After stack, the continuous 2 elements value is same. + So here each threads store 2 computed embedding value. + */ + #pragma unroll + for(int unroll_idx = 0; unroll_idx < PackSize; unroll_idx++){ + CosStorePack.elem[unroll_idx] = cos_embedding_val; + SinStorePack.elem[unroll_idx] = sin_embedding_val; + } + + const int32_t cos_offset = bsz_seq_idx * head_dim + half_head_idx; + const int32_t sin_offset = bsz * max_seq_length * head_dim + cos_offset; + *(reinterpret_cast*>(rope_embedding + cos_offset)) = CosStorePack.storage; + *(reinterpret_cast*>(rope_embedding + sin_offset)) = SinStorePack.storage; + } +} + +std::vector GetRoPE(const paddle::Tensor& input_ids, + const paddle::Tensor& position_ids, + const paddle::Tensor& head_dim_shape_tensor, + int prompt_num, + bool use_neox) { + const int64_t batch_size = input_ids.shape()[0]; + const int64_t max_seq_length = input_ids.shape()[1]; + const int64_t max_position_seq_length = position_ids.shape()[1]; + const int64_t head_dim = head_dim_shape_tensor.shape()[0]; + const float inv_head_dim = 1.0f / static_cast(head_dim); + + auto cu_stream = position_ids.stream(); + + auto rotary_embedding = paddle::full({2, batch_size, 1, max_seq_length, head_dim}, -1, paddle::DataType::FLOAT32, position_ids.place()); + + assert(head_dim % 2 == 0); + const int32_t elem_cnt = batch_size * max_seq_length * head_dim / 2; + int32_t grid_size = 1; + GetNumBlocks(elem_cnt, &grid_size); + if (use_neox) { + fused_get_rotary_embedding_neox<<>> ( + position_ids.data(), + batch_size, + max_seq_length, + max_position_seq_length, + head_dim, + prompt_num, + inv_head_dim, + elem_cnt, + reinterpret_cast(rotary_embedding.data())); + } else { + fused_get_rotary_embedding<<>> ( + position_ids.data(), + batch_size, + max_seq_length, + max_position_seq_length, + head_dim, + prompt_num, + inv_head_dim, + elem_cnt, + reinterpret_cast(rotary_embedding.data())); + } + return {rotary_embedding}; +} + + + +std::vector> GetRoPEInferShape(const std::vector& input_ids_shape, + const std::vector& position_ids_shape, + const std::vector& head_dim_shape_tensor_shape) { + const int64_t batch_size = position_ids_shape[0]; + const int64_t max_seq_length = input_ids_shape[1]; + const int64_t head_dim = head_dim_shape_tensor_shape[0]; + std::vector out_shape = {2, batch_size, 1, max_seq_length, head_dim}; + return {out_shape}; +} + +std::vector GetRoPEInferDtype(const paddle::DataType& input_ids_dtype, + const paddle::DataType& position_ids_dtype, + const paddle::DataType& head_dim_shape_tensor_dtype) { + // RoPE output dtype is Float. + return {paddle::DataType::FLOAT32}; +} + +PD_BUILD_OP(fused_get_rotary_embedding) + .Inputs({"input_ids", "position_ids", "head_dim_shape_tensor"}) + .Outputs({"rotary_embedding"}) + .Attrs({"prompt_num: int", + "use_neox: bool"}) + .SetKernelFn(PD_KERNEL(GetRoPE)) + .SetInferShapeFn(PD_INFER_SHAPE(GetRoPEInferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(GetRoPEInferDtype)); \ No newline at end of file diff --git a/csrc/generation/get_padding_offset.cu b/csrc/generation/get_padding_offset.cu new file mode 100644 index 0000000000000000000000000000000000000000..1d6327a7616a73cfce68e31b16b0a0c3d37a9e6f --- /dev/null +++ b/csrc/generation/get_padding_offset.cu @@ -0,0 +1,133 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "helper.h" + +__global__ void RemovePadding(int64_t *output_data, + const int64_t *input_data, + const int *seq_lens, + const int *cum_offsets, + const int sequence_length) { + const int bi = blockIdx.x; + const int tid = threadIdx.x; + + for (int i = tid; i < seq_lens[bi]; i += blockDim.x) { + const int tgt_seq_id = bi * sequence_length - cum_offsets[bi] + i; + const int src_seq_id = bi * sequence_length + i; + output_data[tgt_seq_id] = input_data[src_seq_id]; + } +} + +__global__ void GetCumOffsetKernel(int *token_num, + int *enc_token_num, + int *dec_token_num, + int *cum_offsets, + const int *sequence_lengths, + const int *sequence_lengths_encoder, + const int *sequence_lengths_decoder, + const int batch_size, + const int max_seq_len) { + // get padding offset of each batch + int total_seq_len = 0; + int enc_total_seq_len = 0; + int dec_total_seq_len = 0; + int cum_offset = 0; + int index = 0; + + for (int i = 0; i < batch_size; i++) { + cum_offsets[i] = cum_offset; + int seq_len = sequence_lengths[i]; + int seq_len_enc = sequence_lengths_encoder[i]; + int seq_len_dec = sequence_lengths_decoder[i]; + + cum_offset += max_seq_len - seq_len; + + total_seq_len += seq_len; + enc_total_seq_len += seq_len_enc; + dec_total_seq_len += seq_len_dec; + } + token_num[0] = total_seq_len; + enc_token_num[0] = enc_total_seq_len; + dec_token_num[0] = dec_total_seq_len; +} + +__global__ void GetPaddingOffsetKernel(int *padding_offset, + int *cum_offsets_out, + const int *cum_offsets, + const int *seq_lens, + const int max_seq_len) { + // get padding offset of each batch + const int bi = blockIdx.x; + const int ti = threadIdx.x; + if (ti == 0) { + cum_offsets_out[bi] = bi == 0 ? 0 : cum_offsets[bi - 1]; + } + int cum_offset = bi == 0 ? 0 : cum_offsets[bi - 1]; + for (int i = ti; i < seq_lens[bi]; i += blockDim.x) { + padding_offset[bi * max_seq_len - cum_offset + i] = cum_offset; + } +} + + +std::vector GetPaddingOffset(const paddle::Tensor& input_ids, + const paddle::Tensor& cum_offsets, + const paddle::Tensor& token_num, + const paddle::Tensor& seq_len) { + auto cu_stream = input_ids.stream(); + std::vector input_ids_shape = input_ids.shape(); + const int bsz = input_ids_shape[0]; + const int seq_length = input_ids_shape[1]; + auto cum_offsets_out = cum_offsets.copy_to(cum_offsets.place(), false); + auto cpu_token_num = token_num.copy_to(paddle::CPUPlace(), false); + const int token_num_data = cpu_token_num.data()[0]; + auto x_remove_padding = paddle::full({token_num_data}, 0, paddle::DataType::INT64, input_ids.place()); + auto padding_offset = paddle::full({token_num_data}, 0, paddle::DataType::INT32, input_ids.place()); + int blockSize = min((token_num_data + 32 - 1) / 32 * 32, 128); + GetPaddingOffsetKernel<<>>( + padding_offset.data(), + cum_offsets_out.data(), + cum_offsets.data(), + seq_len.data(), + seq_length); + RemovePadding<<>>( + x_remove_padding.data(), + input_ids.data(), + seq_len.data(), + cum_offsets_out.data(), + seq_length); + return {x_remove_padding, cum_offsets_out, padding_offset}; // , enc_token_num, dec_token_num}; +} + +std::vector> GetPaddingOffsetInferShape(const std::vector& input_ids_shape, + const std::vector& cum_offsets_shape, + const std::vector& token_num_shape, + const std::vector& seq_len_shape) { + int64_t bsz = input_ids_shape[0]; + int64_t seq_len = input_ids_shape[1]; + return {{-1}, {bsz}, {-1}}; +} + +std::vector GetPaddingOffsetInferDtype(const paddle::DataType& input_ids_dtype, + const paddle::DataType& cum_offsets_dtype, + const paddle::DataType& token_num_dtype, + const paddle::DataType& seq_len_dtype) { + return {input_ids_dtype, seq_len_dtype, seq_len_dtype}; +} + +PD_BUILD_OP(get_padding_offset) + .Inputs({"input_ids", "cum_offsets", "token_num", "seq_len"}) + .Outputs({"x_remove_padding", "cum_offsets_out", "padding_offset"}) + .SetKernelFn(PD_KERNEL(GetPaddingOffset)) + .SetInferShapeFn(PD_INFER_SHAPE(GetPaddingOffsetInferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(GetPaddingOffsetInferDtype)); \ No newline at end of file diff --git a/csrc/generation/helper.h b/csrc/generation/helper.h new file mode 100644 index 0000000000000000000000000000000000000000..4a74709aecaea6c725a3c2f4e84a0eb6b31d974d --- /dev/null +++ b/csrc/generation/helper.h @@ -0,0 +1,103 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/extension.h" +#include +#include + +constexpr int kBlockSize = 256; +constexpr int kNumWaves = 16; + +inline cudaError_t GetNumBlocks(int64_t n, int* num_blocks) { + int dev; + { + cudaError_t err = cudaGetDevice(&dev); + if (err != cudaSuccess) { return err; } + } + int sm_count; + { + cudaError_t err = cudaDeviceGetAttribute(&sm_count, cudaDevAttrMultiProcessorCount, dev); + if (err != cudaSuccess) { return err; } + } + int tpm; + { + cudaError_t err = cudaDeviceGetAttribute(&tpm, cudaDevAttrMaxThreadsPerMultiProcessor, dev); + if (err != cudaSuccess) { return err; } + } + *num_blocks = std::max(1, std::min((n + kBlockSize - 1) / kBlockSize, + sm_count * tpm / kBlockSize * kNumWaves)); + return cudaSuccess; +} + +template +__device__ T max_func(const T a, const T b) { + return a > b ? a : b; +} + +template +struct MaxOp { + __device__ __forceinline__ T operator()(const T& a, const T& b) const { + return max_func(a, b); + } +}; + +template +class PDTraits; + +template <> +class PDTraits { +public: + typedef float DataType; + typedef float data_t; +}; + +template <> +class PDTraits { +public: + typedef half DataType; + typedef paddle::float16 data_t; +}; + +template <> +class PDTraits { +public: + typedef __nv_bfloat16 DataType; + typedef paddle::bfloat16 data_t; +}; + +template +struct alignas(sizeof(T) * Size) AlignedVector { + T val[Size]; + + HOSTDEVICE inline const T& operator[](int i) const { return val[i]; } + HOSTDEVICE inline T& operator[](int i) { return val[i]; } +}; + +template +HOSTDEVICE inline void Load(const T* addr, AlignedVector* vec) { + const AlignedVector* addr_vec = + reinterpret_cast*>(addr); + *vec = *addr_vec; +} + +template +HOSTDEVICE inline void Store(const AlignedVector& vec, T* addr) { + AlignedVector* addr_vec = + reinterpret_cast*>(addr); + *addr_vec = vec; +} + +constexpr int VEC_16B = 16; \ No newline at end of file diff --git a/csrc/generation/qkv_transpose_split.cu b/csrc/generation/qkv_transpose_split.cu new file mode 100644 index 0000000000000000000000000000000000000000..ba9ee1f8ceb75170281afdfffd3ae0b271a0fe8f --- /dev/null +++ b/csrc/generation/qkv_transpose_split.cu @@ -0,0 +1,193 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "helper.h" + +template +__global__ void fusedQKV_transpose_split_kernel( + T *q_buf, + T *k_buf, + T *v_buf, + const T *qkv, + const int *padding_offset, + const int *seq_lens, + const int32_t elem_cnt, + const int batch_size, + const int max_len_this_time, + const int seq_len, + const int token_num, + const int head_num, + const int size_per_head) { + const int32_t offset = batch_size * max_len_this_time * head_num * size_per_head; + const int32_t hidden_size = head_num * size_per_head; + const int32_t fused_hidden_size = 3 * hidden_size; + int64_t global_thread_idx = blockDim.x * blockIdx.x + threadIdx.x; + using LoadT = AlignedVector; + LoadT src_vec; + LoadT bias_vec; + + for (int32_t linear_index = global_thread_idx * VecSize, + step = gridDim.x * blockDim.x * VecSize; + linear_index < elem_cnt; + linear_index += step) { + Load(&qkv[linear_index], &src_vec); + int32_t bias_idx = linear_index % fused_hidden_size; + const int32_t token_idx = linear_index / fused_hidden_size; + const int32_t ori_token_idx = + token_idx + (padding_offset == nullptr ? 0 : padding_offset[token_idx]); + const int32_t target_batch_id = ori_token_idx / seq_len; + if (seq_lens[target_batch_id] == 0) continue; + const int32_t seq_id = ori_token_idx % seq_len; + + // equal to: + // const int qkv_id = (linear_index % fused_hidden_size) / hidden_size; + const int32_t qkv_id = bias_idx / hidden_size; + const int32_t head_id = (linear_index % hidden_size) / size_per_head; + const int32_t size_id = linear_index % size_per_head; + + if (qkv_id == 0) { + Store( + src_vec, + &q_buf[target_batch_id * head_num * max_len_this_time * size_per_head + + head_id * max_len_this_time * size_per_head + seq_id * size_per_head + + size_id]); + } else if (qkv_id == 1) { + Store( + src_vec, + &k_buf[target_batch_id * head_num * max_len_this_time * size_per_head + + head_id * max_len_this_time * size_per_head + seq_id * size_per_head + + size_id]); + } else { + Store( + src_vec, + &v_buf[target_batch_id * head_num * max_len_this_time * size_per_head + + head_id * max_len_this_time * size_per_head + seq_id * size_per_head + + size_id]); + } + } +} + +template +std::vector qkv_transpose_split(const paddle::Tensor& qkv, // [token_num, dim_embed] + const paddle::Tensor& padding_offset, // [bsz, 1] + const paddle::Tensor& seq_lens, + const paddle::Tensor& input_ids, + int num_head, + int head_size) { + typedef PDTraits traits_; + typedef typename traits_::DataType DataType_; + typedef typename traits_::data_t data_t; + + auto cu_stream = qkv.stream(); + std::vector qkv_shape = qkv.shape(); + const int token_num = qkv_shape[0]; + const int bsz = seq_lens.shape()[0]; + const int max_seq_len = input_ids.shape()[1]; //max_seq_len_tensor.copy_to(paddle::CPUPlace(), false).data()[0]; + auto q_out = paddle::full({bsz, num_head, max_seq_len, head_size}, 0, qkv.dtype(), qkv.place()); + auto k_out = paddle::full({bsz, num_head, max_seq_len, head_size}, 0, qkv.dtype(), qkv.place()); + auto v_out = paddle::full({bsz, num_head, max_seq_len, head_size}, 0, qkv.dtype(), qkv.place()); + constexpr int PackSize = VEC_16B / sizeof(DataType_); + const int elem_cnt = token_num * num_head * head_size * 3; + const int pack_num = elem_cnt / PackSize; + const int blocksize = 128; + const int grid_size = (pack_num + blocksize - 1) / blocksize; + fusedQKV_transpose_split_kernel + <<>>( + reinterpret_cast(q_out.data()), + reinterpret_cast(k_out.data()), + reinterpret_cast(v_out.data()), + reinterpret_cast(const_cast(qkv.data())), + padding_offset.data(), + seq_lens.data(), + elem_cnt, + bsz, + max_seq_len, + max_seq_len, + token_num, + num_head, + head_size); + return {q_out, k_out, v_out}; +} + +std::vector QKVTransposeSplit(const paddle::Tensor& qkv, + const paddle::Tensor& padding_offset, + const paddle::Tensor& seq_lens, + const paddle::Tensor& input_ids, + int num_head, + int head_size) { + switch (qkv.type()) { + case paddle::DataType::BFLOAT16: { + return qkv_transpose_split( + qkv, + padding_offset, + seq_lens, + input_ids, + num_head, + head_size + ); + } + case paddle::DataType::FLOAT16: { + return qkv_transpose_split( + qkv, + padding_offset, + seq_lens, + input_ids, + num_head, + head_size + ); + } + case paddle::DataType::FLOAT32: { + return qkv_transpose_split( + qkv, + padding_offset, + seq_lens, + input_ids, + num_head, + head_size + ); + } + default: { + PD_THROW( + "NOT supported data type. " + "Only float16, bfloat16 and float32 are supported. "); + break; + } + } +} + +std::vector> QKVTransposeSplitInferShape(const std::vector& qkv_shape, + const std::vector& padding_offset_shape, + const std::vector& seq_lens_shape, + const std::vector& input_ids_shape, + int num_head, + int head_size) { + int64_t bsz = seq_lens_shape[0]; + return {{bsz, num_head, -1, head_size}, {bsz, num_head, -1, head_size}, {bsz, num_head, -1, head_size}}; +} + +std::vector QKVTransposeSplitInferDtype(const paddle::DataType& qkv_dtype, + const paddle::DataType& padding_offset_dtype, + const paddle::DataType& seq_lens_dtype, + const paddle::DataType& input_ids_dtype) { + return {qkv_dtype, qkv_dtype, qkv_dtype}; +} + +PD_BUILD_OP(qkv_transpose_split) + .Inputs({"qkv", "padding_offset", "seq_lens", "input_ids"}) + .Outputs({"q_out", "k_out", "v_out"}) + .Attrs({"num_head: int", + "head_size: int"}) + .SetKernelFn(PD_KERNEL(QKVTransposeSplit)) + .SetInferShapeFn(PD_INFER_SHAPE(QKVTransposeSplitInferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(QKVTransposeSplitInferDtype)); \ No newline at end of file diff --git a/csrc/generation/rebuild_padding.cu b/csrc/generation/rebuild_padding.cu new file mode 100644 index 0000000000000000000000000000000000000000..3c8dcc9be47fcc208b12ec2e759e1eb50cac02ca --- /dev/null +++ b/csrc/generation/rebuild_padding.cu @@ -0,0 +1,166 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "helper.h" + +template +__global__ void RebuildPaddingKernel(T *output_data, + const T *input_data, + const int *cum_offsets, + const int *seq_lens, + const int max_seq_len, + const int dim_embed, + const int elem_nums) { + using LoadT = AlignedVector; + LoadT src_vec; + const int global_idx = blockDim.x * blockIdx.x + threadIdx.x; + for (int i = global_idx * VecSize; i < elem_nums; i += gridDim.x * blockDim.x * VecSize) { + const int bi = i / dim_embed; + const int bias_idx = i % dim_embed; + int seq_id = seq_lens[bi] - 1; + const int ori_token_idx = bi * max_seq_len - cum_offsets[bi] + seq_id; + const int src_offset = ori_token_idx * dim_embed + bias_idx; + Load(&input_data[src_offset], &src_vec); + Store(src_vec, &output_data[i]); + } +} + +template +__global__ void RebuildPaddingKernel(T *output_data, + const T *input_data, + const int *padding_offset, + const int dim_embed) { + const int tid = threadIdx.x; + const int bid = blockIdx.x; + const int dst_seq_id = bid + padding_offset[bid]; + const int src_seq_id = bid; + + for (int i = tid; i < dim_embed; i += blockDim.x) { + output_data[dst_seq_id * dim_embed + i] = + input_data[src_seq_id * dim_embed + i]; + } +} + +template +void InvokeRebuildPadding(T *output_data, + const T *input_data, + const int *padding_offset, + const int token_num, + const int dim_embed, + cudaStream_t stream) { + // src: [token_num, dim_embed] + // dst: [batch_size * max_seq_len, dim_embed] + RebuildPaddingKernel<<>>( + output_data, input_data, padding_offset, dim_embed); +} + +template +std::vector rebuild_padding(const paddle::Tensor& tmp_out, // [token_num, dim_embed] + const paddle::Tensor& padding_offset, // [bsz, 1] + const paddle::Tensor& seq_lens, + const paddle::Tensor& input_ids) { + typedef PDTraits traits_; + typedef typename traits_::DataType DataType_; + typedef typename traits_::data_t data_t; + + auto cu_stream = tmp_out.stream(); + std::vector tmp_out_shape = tmp_out.shape(); + const int token_num = tmp_out_shape[0]; + const int dim_embed = tmp_out_shape[1]; + const int bsz = seq_lens.shape()[0]; + auto out = paddle::full({bsz, dim_embed}, 0, tmp_out.dtype(), tmp_out.place()); + constexpr int PackSize = VEC_16B / sizeof(DataType_); + int elem_nums = out.numel(); + int pack_num = elem_nums / PackSize; + const int blocksize = 128; + const int grid_size = (pack_num + blocksize - 1) / blocksize; + RebuildPaddingKernel<<>>( + reinterpret_cast(out.data()), + reinterpret_cast(const_cast(tmp_out.data())), + padding_offset.data(), + seq_lens.data(), + input_ids.shape()[1], + dim_embed, + elem_nums); + // InvokeRebuildPadding( + // reinterpret_cast(out.data()), + // reinterpret_cast(const_cast(tmp_out.data())), + // padding_offset.data(), + // token_num, + // dim_embed, + // tmp_out.stream() + // ); + return {out}; +} + +std::vector RebuildPadding(const paddle::Tensor& tmp_out, + const paddle::Tensor& padding_offset, + const paddle::Tensor& seq_lens, + const paddle::Tensor& input_ids) { + switch (tmp_out.type()) { + case paddle::DataType::BFLOAT16: { + return rebuild_padding( + tmp_out, + padding_offset, + seq_lens, + input_ids + ); + } + case paddle::DataType::FLOAT16: { + return rebuild_padding( + tmp_out, + padding_offset, + seq_lens, + input_ids + ); + } + case paddle::DataType::FLOAT32: { + return rebuild_padding( + tmp_out, + padding_offset, + seq_lens, + input_ids + ); + } + default: { + PD_THROW( + "NOT supported data type. " + "Only float16, bfloat16 and float32 are supported. "); + break; + } + } +} + +std::vector> RebuildPaddingInferShape(const std::vector& tmp_out_shape, + const std::vector& padding_offset_shape, + const std::vector& seq_lens_shape, + const std::vector& input_ids_shape) { + int64_t bsz = seq_lens_shape[0]; + int64_t dim_embed = tmp_out_shape[1]; + return {{bsz, dim_embed}}; +} + +std::vector RebuildPaddingInferDtype(const paddle::DataType& tmp_out_dtype, + const paddle::DataType& padding_offset_dtype, + const paddle::DataType& seq_lens_dtype, + const paddle::DataType& input_ids_dtype) { + return {tmp_out_dtype}; +} + +PD_BUILD_OP(rebuild_padding) + .Inputs({"tmp_out", "padding_offset", "seq_lens", "input_ids"}) + .Outputs({"out"}) + .SetKernelFn(PD_KERNEL(RebuildPadding)) + .SetInferShapeFn(PD_INFER_SHAPE(RebuildPaddingInferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(RebuildPaddingInferDtype)); \ No newline at end of file diff --git a/csrc/generation/save_with_output.cc b/csrc/generation/save_with_output.cc new file mode 100644 index 0000000000000000000000000000000000000000..8cf19d5c2d818921d3e7c6fc485878af26756ebe --- /dev/null +++ b/csrc/generation/save_with_output.cc @@ -0,0 +1,166 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include +#include + +#include "stdlib.h" +#include +#include // dladdr +#include +#include +#include "paddle/extension.h" + +constexpr char kSEP = '/'; + +std::string DirName(const std::string &filepath) { + auto pos = filepath.rfind(kSEP); + if (pos == std::string::npos) { + return ""; + } + return filepath.substr(0, pos); +} + +bool FileExists(const std::string &filepath) { + struct stat buffer; + return (stat(filepath.c_str(), &buffer) == 0); +} + +void MkDir(const char *path) { + std::string path_error(path); + path_error += " mkdir failed!"; + if (mkdir(path, 0755)) { + if (errno != EEXIST) { + throw std::runtime_error(path_error); + } + } +} + +void MkDirRecursively(const char *fullpath) { + if (*fullpath == '\0') return; // empty string + if (FileExists(fullpath)) return; + MkDirRecursively(DirName(fullpath).c_str()); + MkDir(fullpath); +} + + +template +void saveToFile(std::ostream & os, const void* x_data, std::vector shape, int64_t x_numel, const char type_id) { + // 1.type + os.write(reinterpret_cast(&type_id),sizeof(type_id)); + // 2.data + uint64_t size = x_numel * sizeof(data_t); + os.write(static_cast(x_data),static_cast(size)); + +} + +template +void save_with_output_kernel(const paddle::Tensor& x, + const paddle::Tensor& batch_idx, + const paddle::Tensor& step_idx, + std::string file_path, + int64_t rank_id, + char type_id) { + std::vector x_shape = x.shape(); + + if(rank_id >= 0) { + file_path += "_rank_" + std::to_string(rank_id); + } + + int batch_idx_data = -1, step_idx_data = -1; + + if(batch_idx.is_gpu()) { + paddle::Tensor batch_idx_cpu = batch_idx.copy_to(paddle::CPUPlace()); + batch_idx_data = batch_idx_cpu.data()[0]; + } else { + batch_idx_data = batch_idx.data()[0]; + } + if(step_idx.is_gpu()) { + paddle::Tensor step_idx_cpu = step_idx.copy_to(paddle::CPUPlace()); + step_idx_data = step_idx_cpu.data()[0]; + } else { + step_idx_data = step_idx.data()[0]; + } + auto x_data = x.data(); + + if(batch_idx_data >= 0) { + file_path += "_batch_" + std::to_string(batch_idx_data); + } + if(step_idx_data >= 0) { + file_path += "_step_" + std::to_string(step_idx_data); + } + MkDirRecursively(DirName(file_path).c_str()); + std::ofstream fout(file_path, std::ios::binary); + fout.write("0",1); + saveToFile(fout, x_data, x_shape, x.numel(),type_id); + fout.seekp(std::ios::beg); + fout.write("1",1); + fout.close(); + +} + +void print_shape(const paddle::Tensor& tmp, char *tmp_str){ + std::vector shape = tmp.shape(); + printf("%s's shape: \n", tmp_str); + for(int i=0; i < shape.size(); i++) { + printf("%d ", (int)shape[i]); + } + printf("\n"); +} + +std::vector SaveWithOutputForward(const paddle::Tensor& x, + const paddle::Tensor& batch_idx, + const paddle::Tensor& step_idx, + std::string file_path, + int64_t rank_id) { + auto out = x.copy_to(paddle::CPUPlace(), false); + switch(x.type()) { + case paddle::DataType::FLOAT32: + save_with_output_kernel(out, batch_idx, step_idx, file_path, rank_id, '0'); + break; + case paddle::DataType::INT64: + save_with_output_kernel(out, batch_idx, step_idx, file_path, rank_id,'1'); + break; + case paddle::DataType::INT32: + save_with_output_kernel(out, batch_idx, step_idx, file_path, rank_id, '2'); + break; + default: + PD_THROW("function SaveWithOutputForward is not implemented for data type"); + } + return {out}; +} + +std::vector> SaveWithOutputInferShape(const std::vector& x_shape, + const std::vector& batch_idx_shape, + const std::vector& step_idx_shape) { + return {x_shape}; +} + +std::vector SaveWithOutputInferDtype(const paddle::DataType& x_dtype, + const paddle::DataType& batch_idx_dtype, + const paddle::DataType& step_idx_dtype) { + return {x_dtype}; +} + +PD_BUILD_OP(save_with_output) + .Inputs({"x", "batch_idx", "step_idx"}) + .Attrs({"file_path: std::string", + "rank_id: int64_t"}) + .Outputs({"out"}) + .SetKernelFn(PD_KERNEL(SaveWithOutputForward)) + .SetInferShapeFn(PD_INFER_SHAPE(SaveWithOutputInferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(SaveWithOutputInferDtype)); diff --git a/csrc/generation/set_alibi_mask_value.cu b/csrc/generation/set_alibi_mask_value.cu new file mode 100644 index 0000000000000000000000000000000000000000..8036f1096ebd536ac20596f20a7c4c3040d1c6b3 --- /dev/null +++ b/csrc/generation/set_alibi_mask_value.cu @@ -0,0 +1,136 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "helper.h" + +template +__global__ void set_value_by_id(const int *seq_lens, + const bool *stop_flags, + const float *alibi_slopes, + const int64_t *tgt_pos, + T *output_data, + int *sequence_lengths, + int bs, + int length, + int num_head) { + int bs_id = blockIdx.x; + int hid = threadIdx.x; + if (bs_id < bs) { + T *output_data_now = output_data + bs_id * num_head * length + hid * length; + float tgt_pos_now = static_cast(tgt_pos[bs_id]); + output_data_now[seq_lens[bs_id]] = static_cast(tgt_pos_now * alibi_slopes[hid]); + if (stop_flags[bs_id]) { + sequence_lengths[bs_id] = 0; + } + } +} + +template +std::vector set_mask_value(const paddle::Tensor& input_data, + const paddle::Tensor& stop_flags, + const paddle::Tensor& seq_lens, + const paddle::Tensor& alibi_slopes, + const paddle::Tensor& tgt_pos + ) { + typedef PDTraits traits_; + typedef typename traits_::DataType DataType_; + typedef typename traits_::data_t data_t; + + PD_CHECK(seq_lens.dtype() == paddle::DataType::INT32); + PD_CHECK(stop_flags.dtype() == paddle::DataType::BOOL); + auto cu_stream = input_data.stream(); + std::vector input_data_shape = input_data.shape(); + std::vector seq_lens_shape = seq_lens.shape(); + auto sequence_lengths = seq_lens.copy_to(seq_lens.place(), false); + + int input_bs = input_data_shape[0]; + int length = input_data_shape[3]; + int seq_bs = seq_lens_shape[0]; + int num_head = alibi_slopes.shape()[0]; + + int grid_size = input_bs; + int block_size = num_head; + set_value_by_id<<>>(seq_lens.data(), + stop_flags.data(), + alibi_slopes.data(), + tgt_pos.data(), + reinterpret_cast(const_cast(input_data.data())), + sequence_lengths.data(), seq_bs, length, num_head); + return {sequence_lengths}; +} + +std::vector SetMaskValue(const paddle::Tensor& input_data, + const paddle::Tensor& stop_flags, + const paddle::Tensor& seq_lens, + const paddle::Tensor& alibi_slopes, + const paddle::Tensor& tgt_pos) { + switch (input_data.type()) { + case paddle::DataType::BFLOAT16: { + return set_mask_value( + input_data, + stop_flags, + seq_lens, + alibi_slopes, + tgt_pos + ); + } + case paddle::DataType::FLOAT16: { + return set_mask_value( + input_data, + stop_flags, + seq_lens, + alibi_slopes, + tgt_pos + ); + } + case paddle::DataType::FLOAT32: { + return set_mask_value( + input_data, + stop_flags, + seq_lens, + alibi_slopes, + tgt_pos + ); + } + default: { + PD_THROW( + "NOT supported data type. " + "Only float16, bfloat16 and float32 are supported. "); + break; + } + } +} + +std::vector> SetMaskValueInferShape(const std::vector& input_data_shape, + const std::vector& stop_flags_shape, + const std::vector& seq_lens_shape, + const std::vector& alibi_slopes_shape, + const std::vector& tgt_pos) { + return {seq_lens_shape}; +} + +std::vector SetMaskValueInferDtype(const paddle::DataType& input_data_dtype, + const paddle::DataType& stop_flags_dtype, + const paddle::DataType& seq_lens_dtype, + const paddle::DataType& alibi_slopes_dtype, + const paddle::DataType& tgt_pos_dtype) { + return {seq_lens_dtype}; +} + +PD_BUILD_OP(set_alibi_mask_value) + .Inputs({"input_data", "stop_flags", "seq_lens", "alibi_slopes", "tgt_pos"}) + .Outputs({"sequence_lengths"}) + .SetKernelFn(PD_KERNEL(SetMaskValue)) + .SetInferShapeFn(PD_INFER_SHAPE(SetMaskValueInferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(SetMaskValueInferDtype)); \ No newline at end of file diff --git a/csrc/generation/set_mask_value.cu b/csrc/generation/set_mask_value.cu new file mode 100644 index 0000000000000000000000000000000000000000..bcd63a277de7b84463b105cf5ebae2bdb5128bd0 --- /dev/null +++ b/csrc/generation/set_mask_value.cu @@ -0,0 +1,123 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/extension.h" + +template +class PDTraits; + +template <> +class PDTraits { +public: + typedef float DataType; + typedef float data_t; +}; + +template <> +class PDTraits { +public: + typedef half DataType; + typedef paddle::float16 data_t; +}; + +template <> +class PDTraits { +public: + typedef __nv_bfloat16 DataType; + typedef paddle::bfloat16 data_t; +}; + +template +__global__ void set_value_by_id(const int *seq_lens, const bool *stop_flags, T *output_data, int *sequence_lengths, int bs, int length) { + int tid = threadIdx.x; + if (tid < bs) { + T *output_data_now = output_data + tid * length; + output_data_now[seq_lens[tid]] = 1.0; + if (stop_flags[tid]) { + sequence_lengths[tid] = 0; + } + } +} + +template +std::vector set_mask_value(const paddle::Tensor& input_data, const paddle::Tensor& stop_flags, const paddle::Tensor& seq_lens) { + typedef PDTraits traits_; + typedef typename traits_::DataType DataType_; + typedef typename traits_::data_t data_t; + + PD_CHECK(seq_lens.dtype() == paddle::DataType::INT32); + PD_CHECK(stop_flags.dtype() == paddle::DataType::BOOL); + auto cu_stream = input_data.stream(); + std::vector input_data_shape = input_data.shape(); + std::vector seq_lens_shape = seq_lens.shape(); + auto sequence_lengths = seq_lens.copy_to(seq_lens.place(), false); + + int input_bs = input_data_shape[0]; + int length = input_data_shape[3]; + int seq_bs = seq_lens_shape[0]; + + int block_size = (input_bs + 32 - 1) / 32 * 32; + set_value_by_id<<<1, block_size, 0, cu_stream>>>(seq_lens.data(), + stop_flags.data(), + reinterpret_cast(const_cast(input_data.data())), + sequence_lengths.data(), seq_bs, length); + return {sequence_lengths}; +} + +std::vector SetMaskValue(const paddle::Tensor& input_data, const paddle::Tensor& stop_flags, const paddle::Tensor& seq_lens) { + switch (input_data.type()) { + case paddle::DataType::BFLOAT16: { + return set_mask_value( + input_data, + stop_flags, + seq_lens + ); + } + case paddle::DataType::FLOAT16: { + return set_mask_value( + input_data, + stop_flags, + seq_lens + ); + } + case paddle::DataType::FLOAT32: { + return set_mask_value( + input_data, + stop_flags, + seq_lens + ); + } + default: { + PD_THROW( + "NOT supported data type. " + "Only float16, bfloat16 and float32 are supported. "); + break; + } + } +} + +std::vector> SetMaskValueInferShape(const std::vector& input_data_shape, const std::vector& stop_flags_shape, const std::vector& seq_lens_shape) { + return {seq_lens_shape}; +} + +std::vector SetMaskValueInferDtype(const paddle::DataType& input_data_dtype, const paddle::DataType& stop_flags_dtype, const paddle::DataType& seq_lens_dtype) { + return {seq_lens_dtype}; +} + +PD_BUILD_OP(set_mask_value) + .Inputs({"input_data", "stop_flags", "seq_lens"}) + .Outputs({"sequence_lengths"}) + .SetKernelFn(PD_KERNEL(SetMaskValue)) + .SetInferShapeFn(PD_INFER_SHAPE(SetMaskValueInferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(SetMaskValueInferDtype)); diff --git a/csrc/generation/set_value_by_flags.cu b/csrc/generation/set_value_by_flags.cu new file mode 100644 index 0000000000000000000000000000000000000000..03900783972c20e85f9f76b92b6141e9b2a79391 --- /dev/null +++ b/csrc/generation/set_value_by_flags.cu @@ -0,0 +1,56 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/extension.h" + +__global__ void set_value_by_flag_and_id(const bool *stop_flags, int64_t *pre_ids_all, const int64_t *pre_ids, const int64_t *step_idx, int bs, int length) { + int tid = threadIdx.x; + if (tid < bs && !stop_flags[tid]) { + int64_t *pre_ids_all_now = pre_ids_all + tid * length; + if (step_idx[tid] >= 0) { + pre_ids_all_now[step_idx[tid]] = pre_ids[tid]; + } + } +} + +std::vector SetValueByFlagsAndIdx(const paddle::Tensor& pre_ids_all, const paddle::Tensor& pre_ids_now, const paddle::Tensor& step_idx, const paddle::Tensor& stop_flags) { + auto cu_stream = stop_flags.stream(); + std::vector pre_ids_all_shape = pre_ids_all.shape(); + auto stop_flags_out = stop_flags.copy_to(stop_flags.place(), false); // gpu -> gpu + + int bs = stop_flags.shape()[0]; + int length = pre_ids_all_shape[1]; + int block_size = (bs + 32 - 1) / 32 * 32; + set_value_by_flag_and_id<<<1, block_size, 0, cu_stream>>>(stop_flags.data(), const_cast(pre_ids_all.data()), pre_ids_now.data(), step_idx.data(), bs, length); + return {stop_flags_out}; +} + +std::vector> SetValueByFlagsAndIdxInferShape(const std::vector& pre_ids_all_shape, const std::vector& pre_ids_now_shape, + const std::vector& step_idx_shape, const std::vector& stop_flags_shape) { + return {stop_flags_shape}; +} + +std::vector SetValueByFlagsAndIdxInferDtype(const paddle::DataType& pre_ids_all_dtype, + const paddle::DataType& pre_ids_now_dtype, + const paddle::DataType& step_idx_dtype, + const paddle::DataType& stop_flags_dtype) { + return {stop_flags_dtype}; +} + +PD_BUILD_OP(set_value_by_flags_and_idx) + .Inputs({"pre_ids_all", "pre_ids_now", "step_idx", "stop_flags"}) + .Outputs({"stop_flags_out"}) + .SetKernelFn(PD_KERNEL(SetValueByFlagsAndIdx)) + .SetInferShapeFn(PD_INFER_SHAPE(SetValueByFlagsAndIdxInferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(SetValueByFlagsAndIdxInferDtype)); diff --git a/csrc/generation/stop_generation_multi_ends.cu b/csrc/generation/stop_generation_multi_ends.cu new file mode 100644 index 0000000000000000000000000000000000000000..7be2c6cf3cd1835cb8293d5bba2be22ceac6475b --- /dev/null +++ b/csrc/generation/stop_generation_multi_ends.cu @@ -0,0 +1,126 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/extension.h" +#include +#include +#include +#include +#include +#include +#include +#include + +void set_flags_multi_ends(char *str_flags, bool *res, int length) { + for (int i = 0; i < length; i++) { + if (str_flags[i] == '0') { + res[i] = false; + } else { + res[i] = true; + } + } +} + +__device__ bool is_in_end(const int64_t id, const int64_t *end_ids, int length) { + bool flag = false; + for (int i = 0; i < length; i++) { + if (id == end_ids[i]) { + return true; + } + } + return flag; +} + +__global__ void set_value_by_flags(const bool *stop_flags, const int64_t *end_ids, int64_t *topk_ids, bool *stop_flags_out, const int bs, int end_length) { + int tid = threadIdx.x; + if (tid < bs) { + topk_ids[tid] = stop_flags[tid] ? end_ids[0] : topk_ids[tid]; + __syncthreads(); + if (is_in_end(topk_ids[tid], end_ids, end_length)) { + stop_flags_out[tid] = true; + } + } +} + +__global__ void set_value_by_flags_both(const bool *flags, const bool *stop_flags, const int64_t *end_ids, int64_t *topk_ids, bool *stop_flags_out, const int bs, int end_length) { + int tid = threadIdx.x; + if (tid < bs) { + topk_ids[tid] = flags[tid] || stop_flags[tid] ? end_ids[0] : topk_ids[tid]; + __syncthreads(); + if (is_in_end(topk_ids[tid], end_ids, end_length)) { + stop_flags_out[tid] = true; + } + } +} + +std::vector GetStopFlagsMulti(const paddle::Tensor& topk_ids, const paddle::Tensor& stop_flags, const paddle::Tensor& end_ids, int64_t mode) { + // mode = 0, stop_generation and stop_flags + // mode = 1, just stop_generation + // mode = 2, just stop_flags + PD_CHECK(mode <= 2); + PD_CHECK(topk_ids.dtype() == paddle::DataType::INT64); + PD_CHECK(stop_flags.dtype() == paddle::DataType::BOOL); + + auto cu_stream = topk_ids.stream(); + std::vector shape = topk_ids.shape(); + int64_t bs_now = shape[0]; + int64_t end_length = end_ids.shape()[0]; + auto topk_ids_out = topk_ids.copy_to(topk_ids.place(), false); // gpu -> gpu + auto stop_flags_out = stop_flags.copy_to(stop_flags.place(), false); // gpu -> gpu + if (mode == 0 || mode == 1) { + constexpr char *path = "/root/paddlejob/workspace/env_run/lzy/ERNIE_ALL/early_stop/ERNIE3.0-fused-fp16/ops/test"; + auto flags = paddle::full({bs_now, 1}, 1, paddle::DataType::BOOL, paddle::CPUPlace()); + int fd = -1; + int ret = -1; + void *addr = nullptr; + fd = open(path, O_RDWR); + if(-1 == fd){ + perror("open error"); + } + addr = mmap(NULL, bs_now, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if(addr == MAP_FAILED){ + perror("mmap error"); + } + close(fd); + set_flags_multi_ends((char *)addr, flags.data(), bs_now); + munmap(addr, bs_now); + auto flags_gpu = flags.copy_to(topk_ids.place(), false); // cpu -> gpu + int block_size = (bs_now + 32 - 1) / 32 * 32; + if (mode == 0) { + set_value_by_flags_both<<<1, block_size, 0, cu_stream>>>(flags_gpu.data(), stop_flags.data(), end_ids.data(), topk_ids_out.data(), stop_flags_out.data(), bs_now, end_length); + } else { + set_value_by_flags<<<1, block_size, 0, cu_stream>>>(flags_gpu.data(), end_ids.data(), topk_ids_out.data(), stop_flags_out.data(), bs_now, end_length); + } + } else if (mode == 2) { + int block_size = (bs_now + 32 - 1) / 32 * 32; + set_value_by_flags<<<1, block_size, 0, cu_stream>>>(stop_flags.data(), end_ids.data(), topk_ids_out.data(), stop_flags_out.data(), bs_now, end_length); + } + return {topk_ids_out, stop_flags_out}; +} + +std::vector> GetStopFlagsMultiInferShape(const std::vector& topk_ids_shape, const std::vector& stop_flags_shape, const std::vector& end_ids_shape) { + return {topk_ids_shape, stop_flags_shape}; +} + +std::vector GetStopFlagsMultiInferDtype(const paddle::DataType& topk_ids_dtype, const paddle::DataType& stop_flags_dtype, const paddle::DataType& end_ids_dtype) { + return {topk_ids_dtype, stop_flags_dtype}; +} + +PD_BUILD_OP(set_stop_value_multi_ends) + .Inputs({"topk_ids", "stop_flags", "end_ids"}) + .Outputs({"topk_ids_out", "stop_flags_out"}) + .Attrs({"mode: int64_t"}) + .SetKernelFn(PD_KERNEL(GetStopFlagsMulti)) + .SetInferShapeFn(PD_INFER_SHAPE(GetStopFlagsMultiInferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(GetStopFlagsMultiInferDtype)); \ No newline at end of file diff --git a/csrc/generation/token_penalty_multi_scores.cu b/csrc/generation/token_penalty_multi_scores.cu new file mode 100644 index 0000000000000000000000000000000000000000..3ef010501921ca7a5191a5a7cd84a57ec6d5aa4b --- /dev/null +++ b/csrc/generation/token_penalty_multi_scores.cu @@ -0,0 +1,231 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "helper.h" + + +template +__global__ inline void min_length_logits_process(T* logits, + const int64_t *cur_len, + const int64_t *min_len, + const int64_t *eos_token_id, + const int64_t bs, + const int64_t length, + const int64_t end_length) { + int bi = threadIdx.x; + if (bi >= bs) return; + if (cur_len[bi] < 0) { + return; + } + if (cur_len[bi] < min_len[bi]) { + for (int i=0; i < end_length; i++) { + logits[bi * length + eos_token_id[i]] = -1e10; + } + } +} + +template<> +__global__ inline void min_length_logits_process(half* logits, + const int64_t *cur_len, + const int64_t *min_len, + const int64_t *eos_token_id, + const int64_t bs, + const int64_t length, + const int64_t end_length) { + int bi = threadIdx.x; + if (bi >= bs) return; + if (cur_len[bi] < 0) { + return; + } + if (cur_len[bi] < min_len[bi]) { + for (int i=0; i < end_length; i++) { + logits[bi * length + eos_token_id[i]] = -1e4; + } + } +} + + +__global__ void update_repeat_times(const int64_t *pre_ids, + const int64_t *cur_len, + int *repeat_times, + const int64_t bs, + const int64_t length, + const int64_t length_id) { + int bi = blockIdx.x; + if (cur_len[bi] < 0) { + return; + } + int tid = threadIdx.x; + const int64_t *pre_ids_now = pre_ids + bi * length_id; + int *repeat_times_now = repeat_times + bi * length; + for (int i = tid; i < length_id; i += blockDim.x) { + int64_t id = pre_ids_now[i]; + if (id < 0) break; + atomicAdd(&repeat_times_now[id], 1); + } +} + +template +__global__ void update_value_by_repeat_times(const int *repeat_times, + const T *penalty_scores, + const T *frequency_score, + const T *presence_score, + T *logits, + const int64_t bs, + const int64_t length) { + int bi = blockIdx.x; + int tid = threadIdx.x; + T *logits_now = logits + bi * length; + const int *repeat_times_now = repeat_times + bi * length; + float alpha = static_cast(penalty_scores[bi]); + float beta = static_cast(frequency_score[bi]); + float gamma = static_cast(presence_score[bi]); + for (int i = tid; i < length; i += blockDim.x) { + int times = repeat_times_now[i]; + if (times == 0) continue; + float logit_now = static_cast(logits_now[i]); + logit_now = logit_now < 0 ? logit_now * alpha : logit_now / alpha; + logits_now[i] = static_cast(logit_now - times * beta - gamma); + } +} + +template +std::vector token_penalty_multi_scores_kernel(const paddle::Tensor& pre_ids, + const paddle::Tensor& logits, + const paddle::Tensor& penalty_scores, + const paddle::Tensor& frequency_score, + const paddle::Tensor& presence_score, + const paddle::Tensor& cur_len, + const paddle::Tensor& min_len, + const paddle::Tensor& eos_token_id) { + + typedef PDTraits traits_; + typedef typename traits_::DataType DataType_; + typedef typename traits_::data_t data_t; + auto cu_stream = logits.stream(); + std::vector shape = logits.shape(); + auto repeat_times = paddle::full(shape, 0, paddle::DataType::INT32, pre_ids.place()); + int64_t bs = shape[0]; + int64_t length = shape[1]; + int64_t length_id = pre_ids.shape()[1]; + auto logits_out = logits.copy_to(logits.place(), false); // gpu -> gpu + + int64_t end_length = eos_token_id.shape()[0]; + + const int block_size = (bs + 32 - 1) / 32 * 32; + min_length_logits_process<<<1, block_size, 0, cu_stream>>>( + reinterpret_cast(const_cast(logits_out.data())), + cur_len.data(), + min_len.data(), + eos_token_id.data(), + bs, length, end_length); + + int block_size_1 = (length_id + 32 - 1) / 32 * 32; + block_size_1 = min(block_size_1, 512); + update_repeat_times<<>>(pre_ids.data(), cur_len.data(), repeat_times.data(), bs, length, length_id); + int block_size_2 = (length + 32 - 1) / 32 * 32; + block_size_2 = min(block_size_2, 512); + update_value_by_repeat_times<<>>( + repeat_times.data(), + reinterpret_cast(const_cast(penalty_scores.data())), + reinterpret_cast(const_cast(frequency_score.data())), + reinterpret_cast(const_cast(presence_score.data())), + reinterpret_cast(const_cast(logits_out.data())), + bs, length); + return {logits_out}; +} + +std::vector TokenPenaltyMultiScores(const paddle::Tensor& pre_ids, + const paddle::Tensor& logits, + const paddle::Tensor& penalty_scores, + const paddle::Tensor& frequency_scores, + const paddle::Tensor& presence_scores, + const paddle::Tensor& cur_len, + const paddle::Tensor& min_len, + const paddle::Tensor& eos_token_id) { + + switch (logits.type()) { + case paddle::DataType::BFLOAT16: { + return token_penalty_multi_scores_kernel( + pre_ids, + logits, + penalty_scores, + frequency_scores, + presence_scores, + cur_len, + min_len, + eos_token_id + ); + } + case paddle::DataType::FLOAT16: { + return token_penalty_multi_scores_kernel( + pre_ids, + logits, + penalty_scores, + frequency_scores, + presence_scores, + cur_len, + min_len, + eos_token_id + ); + } + case paddle::DataType::FLOAT32: { + return token_penalty_multi_scores_kernel( + pre_ids, + logits, + penalty_scores, + frequency_scores, + presence_scores, + cur_len, + min_len, + eos_token_id + ); + } + default: { + PD_THROW( + "NOT supported data type. " + "Only float16 and float32 are supported. "); + break; + } + } +} + +std::vector> TokenPenaltyMultiScoresInferShape(const std::vector& pre_ids_shape, + const std::vector& logits_shape, + const std::vector& penalty_scores_shape, + const std::vector& frequency_scores_shape, + const std::vector& presence_scores_shape, + const std::vector& cur_len_shape, + const std::vector& min_len_shape, + const std::vector& eos_token_id_shape) { + return {logits_shape}; +} + +std::vector TokenPenaltyMultiScoresInferDtype(const paddle::DataType& pre_ids_dtype, + const paddle::DataType& logits_dtype, + const paddle::DataType& penalty_scores_dtype, + const paddle::DataType& frequency_scores_dtype, + const paddle::DataType& presence_scores_dtype, + const paddle::DataType& cur_len_dtype, + const paddle::DataType& min_len_dtype, + const paddle::DataType& eos_token_id_dtype) { + return {logits_dtype}; +} + +PD_BUILD_OP(get_token_penalty_multi_scores) + .Inputs({"pre_ids", "logits", "penalty_scores", "frequency_scores", "presence_scores", "cur_len", "min_len", "eos_token_id"}) + .Outputs({"logits_out"}) + .SetKernelFn(PD_KERNEL(TokenPenaltyMultiScores)) + .SetInferShapeFn(PD_INFER_SHAPE(TokenPenaltyMultiScoresInferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(TokenPenaltyMultiScoresInferDtype)); diff --git a/csrc/generation/top_p_sampling.cu b/csrc/generation/top_p_sampling.cu new file mode 100644 index 0000000000000000000000000000000000000000..ae847d5febc31045a05f55361ed00f51dbd4737b --- /dev/null +++ b/csrc/generation/top_p_sampling.cu @@ -0,0 +1,678 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "helper.h" + +#define CHECK_INPUT(x) PD_CHECK(x.is_gpu(), #x " must be a GPU Tensor.") + +#define FINAL_MASK 0xFFFFFFFF + +#define FIXED_BLOCK_DIM_BASE(dim, ...) \ + case (dim): { \ + constexpr auto kBlockDim = (dim); \ + __VA_ARGS__; \ + } break + + +#define FIXED_BLOCK_DIM(...) \ + FIXED_BLOCK_DIM_BASE(1024, ##__VA_ARGS__); \ + FIXED_BLOCK_DIM_BASE(512, ##__VA_ARGS__); \ + FIXED_BLOCK_DIM_BASE(256, ##__VA_ARGS__); \ + FIXED_BLOCK_DIM_BASE(128, ##__VA_ARGS__); \ + FIXED_BLOCK_DIM_BASE(64, ##__VA_ARGS__); \ + FIXED_BLOCK_DIM_BASE(32, ##__VA_ARGS__) + +struct SegmentOffsetIter { + explicit SegmentOffsetIter(int num_cols) : num_cols_(num_cols) {} + + __host__ __device__ __forceinline__ int operator()(int idx) const { + return idx * num_cols_; + } + + int num_cols_; +}; + +template +struct Pair { + __device__ __forceinline__ Pair() {} + __device__ __forceinline__ Pair(T value, int id) : v(value), id(id) {} + + __device__ __forceinline__ void set(T value, int id) { + v = value; + id = id; + } + + __device__ __forceinline__ void operator=(const Pair& in) { + v = in.v; + id = in.id; + } + + __device__ __forceinline__ bool operator<(const T value) const { + return ((float)v < (float)value); + } + + __device__ __forceinline__ bool operator>(const T value) const { + return ((float)v > (float)value); + } + __device__ __forceinline__ bool operator<(const Pair& in) const { + return ((float)v < (float)in.v) || (((float)v == (float)in.v) && (id > in.id)); + } + + __device__ __forceinline__ bool operator>(const Pair& in) const { + return ((float)v > (float)in.v) || (((float)v == (float)in.v) && (id < in.id)); + } + + T v; + int id; +}; + +inline int div_up(int a, int n) +{ + return (a + n - 1) / n; +} + +__global__ void setup_kernel(curandState_t *state, const uint64_t seed, const int bs) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + for (int i = idx; i < bs; i += gridDim.x * blockDim.x) { + curand_init(seed, 0, i, &state[i]); + } +} + +template +__device__ __forceinline__ void AddTo(Pair topk[], + const Pair& p, + int beam_size) { + for (int k = beam_size - 2; k >= 0; k--) { + if (topk[k] < p) { + topk[k + 1] = topk[k]; + } else { + topk[k + 1] = p; + return; + } + } + topk[0] = p; +} + +template +__device__ __forceinline__ void GetTopK(Pair topk[], + const T* src, + int idx, + int dim, + int beam_size) { + while (idx < dim) { + if (topk[beam_size - 1] < src[idx]) { + Pair tmp(src[idx], idx); + AddTo(topk, tmp, beam_size); + } + idx += BlockSize; + } +} + +template +__device__ __forceinline__ void GetTopK(Pair topk[], + const T* src, + int idx, + int dim, + const Pair& max, + int beam_size) { + while (idx < dim) { + if (topk[beam_size - 1] < src[idx]) { + Pair tmp(src[idx], idx); + if (tmp < max) { + AddTo(topk, tmp, beam_size); + } + } + idx += BlockSize; + } +} + +template +__device__ __forceinline__ void ThreadGetTopK(Pair topk[], + int* beam, + int beam_size, + const T* src, + bool* firstStep, + bool* is_empty, + Pair* max, + int dim, + const int tid) { + if (*beam > 0) { + int length = (*beam) < beam_size ? *beam : beam_size; + if (*firstStep) { + *firstStep = false; + GetTopK(topk, src, tid, dim, length); + } else { + for (int k = 0; k < MaxLength; k++) { + if (k < MaxLength - (*beam)) { + topk[k] = topk[k + *beam]; + } else { + topk[k].set(std::numeric_limits::min(), -1); + } + } + if (!(*is_empty)) { + GetTopK( + topk + MaxLength - *beam, src, tid, dim, *max, length); + } + } + + *max = topk[MaxLength - 1]; + if ((*max).id == -1) *is_empty = true; + *beam = 0; + } +} + +template +__forceinline__ __device__ Pair WarpReduce(Pair input) { +#pragma unroll + for (int offset = 16; offset > 0; offset >>= 1) { + T tmp_val = __shfl_down_sync(FINAL_MASK, input.v, static_cast(offset), 32); + int tmp_id = __shfl_down_sync(FINAL_MASK, input.id, static_cast(offset), 32); + if ((float)input.v < (float)tmp_val) { + input.v = tmp_val; + input.id = tmp_id; + } + } + return input; +} + +template +__device__ __forceinline__ void BlockReduce(Pair shared_max[], + Pair topk[], + Pair beam_max[], + int* beam, + int* k, + int *count, + const int tid, + const int wid, + const int lane) { + while (true) { + __syncthreads(); + Pair input_now = topk[0]; + input_now = WarpReduce(input_now); + + if (lane == 0) { + shared_max[wid] = input_now; + } + __syncthreads(); + input_now = (tid < BlockSize / 32) + ? shared_max[lane] + : Pair(std::numeric_limits::min(), -1); + if (wid == 0) { + input_now = WarpReduce(input_now); + if (lane == 0) shared_max[0] = input_now; + } + __syncthreads(); + if (tid == 0) { + beam_max[*count] = shared_max[0]; + (*count)++; + } + int tid_max = shared_max[0].id % BlockSize; + if (tid == tid_max) { + (*beam)++; + } + if (--(*k) == 0) break; + __syncthreads(); + + if (tid == tid_max) { + if (*beam < MaxLength) { + topk[0] = topk[*beam]; + } + } + + if (MaxLength < 5) { + if (*beam >= MaxLength) break; + } else { + unsigned mask = 0u; + mask = __ballot_sync(FINAL_MASK, true); + if (tid_max / 32 == wid) { + if (__shfl_down_sync(FINAL_MASK, *beam, tid_max % 32, 32) == + MaxLength) + break; + } + } + } +} + +template +__global__ void KeMatrixTopPBeamTopK(const T* src, + T* top_ps, + int64_t* out_id, // topk id + T* out_val, // topk val + int vocab_size, + curandState_t* state, + int* count_iter, + int* count_iter_begin) { + const int tid = threadIdx.x; + const int wid = tid / 32; + const int lane = tid % 32; + const int bid = blockIdx.x; + + int top_num = TopPBeamTopK; + float top_p_num = static_cast(top_ps[bid]); + + __shared__ Pair shared_max[BlockSize / 32]; + __shared__ Pair beam_max[TopPBeamTopK]; + + Pair topk[MaxLength]; + int beam = MaxLength; + Pair max; + bool is_empty = false; + bool firststep = true; + __shared__ int count; + + if (tid == 0) { + count = 0; + } + + for (int j = 0; j < MaxLength; j++) { + topk[j].set(std::numeric_limits::min(), -1); + } + + while (top_num) { + ThreadGetTopK(topk, + &beam, + TopPBeamTopK, + src + bid * vocab_size, + &firststep, + &is_empty, + &max, + vocab_size, + tid); + BlockReduce( + shared_max, topk, beam_max, &beam, &top_num, &count, tid, wid, lane); + } + if (tid == 0) { + count_iter_begin[bid] = count_iter[bid]; + float rand_top_p = curand_uniform(state + bid) * top_p_num; + top_ps[bid] = (T)rand_top_p; + float sum_prob = 0.0f; + + for(int i = 0; i < TopPBeamTopK; i++) { + float val = static_cast(beam_max[i].v); + sum_prob += val; +#ifdef DEBUG_TOPP + printf("bi: %d, top_p: %f, rand_top_p: %f, sum_prob: %f\n", bid, top_p_num, rand_top_p, sum_prob); +#endif + if(sum_prob >= rand_top_p) { + count_iter_begin[bid] += 1; + out_id[bid] = static_cast(beam_max[i].id); + out_val[bid] = beam_max[i].v; + break; + } + } + } +} + +__global__ void SetCountIter(int *count_iter, int num) { + int tid = threadIdx.x; + int bid = blockIdx.x; + int idx = bid * blockDim.x + tid; + for (int i = idx; i < num; i += gridDim.x * blockDim.x) { + count_iter[i] = i; + } +} + +template +__global__ void FillIndex(T* indices, T num_rows, T num_cols) { + int col_id = threadIdx.x; + int row_id = blockIdx.x; + + for (T j = row_id; j < num_rows; j += gridDim.x) { + for (T i = col_id; i < num_cols; i += blockDim.x) { + indices[j * num_cols + i] = i; + } + } +} + +struct BlockPrefixCallbackOp { + // Running prefix + float running_total; + // Constructor + __device__ BlockPrefixCallbackOp(float running_total): running_total(running_total) {} + // Callback operator to be entered by the first warp of threads in the block. + // Thread-0 is responsible for returning a value for seeding the block-wide scan. + __device__ float operator()(float block_aggregate) + { + float old_prefix = running_total; + running_total += block_aggregate; + return old_prefix; + } +}; + +template +__global__ void topp_sampling(T* sorted_probs, + int64_t* sorted_id, + T* out_val, + int64_t* out_id, + const T* top_ps, + const T* threshold, + const uint64_t seed, + const int p_num, + const int vocab_size, + int* count_iter, + int* count_iter_begin) { + __shared__ int stop_shared; + __shared__ float rand_p; + const int tid = threadIdx.x; + const int bid = blockIdx.x; + constexpr int NUM_WARPS = BLOCK_SIZE / 32; + constexpr int WARP_SIZE = 32; + const int lane_id = tid % 32; + const int warp_id = tid / 32; + const float p_t = static_cast(top_ps[bid]); + const float threshold_now = threshold ? static_cast(threshold[bid]) : 0.f; + if (tid == 0) { + stop_shared = 0; + rand_p = p_t; +#ifdef DEBUG_TOPP + printf("bi: %d, p: %f\n", bid, rand_p); +#endif + } + if (count_iter_begin[bid] == count_iter[bid + 1]) { + // topk + return; + } + + typedef cub::BlockScan BlockScan; + typedef cub::BlockReduce BlockReduce; + __shared__ typename BlockScan::TempStorage temp_storage; + __shared__ typename BlockReduce::TempStorage temp_storage_reduce; + __shared__ uint32_t selected_shared[NUM_WARPS]; + int threshold_id = 0; + + // Initialize running total + BlockPrefixCallbackOp prefix_op(0); + + if (lane_id == 0) { + selected_shared[warp_id] = 0; + } + __syncthreads(); + + int offset = bid * vocab_size; +#ifdef DEBUG_TOPP + if(tid == 0) { + printf("first_elem1_1: %f, first_elem1_2: %f, first_id1_1: %d, first_id1_2: %d\n", (float)sorted_probs[offset], (float)sorted_probs[offset+1], (int)sorted_id[offset], (int)sorted_id[offset+1]); + } +#endif + int end = ((vocab_size + BLOCK_SIZE - 1) / BLOCK_SIZE) * BLOCK_SIZE; + int i_activate = 0; + float thread_offset = 0; + for (int i = tid; i < end; i += BLOCK_SIZE) { + float thread_count = + (i < vocab_size) ? static_cast(sorted_probs[offset + i]) : 0.f; + if (i < vocab_size && thread_count >= threshold_now) { + threshold_id = i; + } + BlockScan(temp_storage) + .InclusiveSum(thread_count, thread_offset, prefix_op); + + uint32_t activate_mask = __ballot_sync(FINAL_MASK, rand_p <= thread_offset); + + i_activate = i; + if (activate_mask != 0) { + if (lane_id == 0) { + atomicAdd(&stop_shared, 1); + selected_shared[warp_id] = activate_mask; + } + } + __syncthreads(); + if (stop_shared > 0) { + break; + } + } + __syncthreads(); + if (stop_shared == 0) { + if (tid == 0) { + out_id[bid] = sorted_id[offset]; + out_val[bid] = sorted_probs[offset]; +#ifdef DEBUG_TOPP + printf("stop_shared: %d, out_id: %d, out_val: %f\n", (int)stop_shared, (int)out_id[bid], (float)out_val[bid]); +#endif + } + return; + } +#ifdef DEBUG_TOPP + if(tid == 0) { + printf("first_elem2_1: %f, first_elem2_2: %f, first_id2_1: %d, first_id2_2: %d\n", (float)sorted_probs[offset], (float)sorted_probs[offset+1], (int)sorted_id[offset], (int)sorted_id[offset+1]); + } +#endif + bool skip = (selected_shared[warp_id] > 0) ? false : true; + for (int i = 0; i < warp_id; i++) { + if (selected_shared[i] != 0) { + // If the previous has stopped, skip the current warp + skip = true; + } + } + if (!skip) { + int active_lane_id = + WARP_SIZE - __popc(selected_shared[warp_id]); // first not 0 + if (lane_id == active_lane_id) { + float val = static_cast(sorted_probs[offset + i_activate]); +#ifdef DEBUG_TOPP + printf("active_lane_id: %d, i_activate: %d.\n", active_lane_id, i_activate); + for (int i=0; i < active_lane_id; i++) { + printf("p %d, value: %f\n", i, (float)(sorted_probs[offset + i])); + } +#endif + if (val < threshold_now) { + // don't sample low score token + int max_id = BlockReduce(temp_storage_reduce).Reduce(threshold_id, MaxOp()); + curandStatePhilox4_32_10_t rng; + curand_init(seed, tid, 0, &rng); + int random_id = curand(&rng) % (max_id + 1); + out_id[bid] = sorted_id[offset + random_id]; + out_val[bid] = sorted_probs[offset + random_id]; + } else { + out_id[bid] = sorted_id[offset + i_activate]; + out_val[bid] = sorted_probs[offset + i_activate]; + } + } + } +} + +int GetBlockSize(int vocab_size) { + if (vocab_size > 512) { + return 1024; + } else if (vocab_size > 256) { + return 512; + } else if (vocab_size > 128) { + return 256; + } else if (vocab_size > 64) { + return 128; + } else { + return 64; + } +} + +template +std::vector top_p_sampling_kernel(const paddle::Tensor& x, const paddle::Tensor& top_ps, int random_seed) { + typedef PDTraits traits_; + typedef typename traits_::DataType DataType_; + typedef typename traits_::data_t data_t; + std::vector shape = x.shape(); + auto cu_stream = x.stream(); + + int bs = shape[0]; + int p_num = top_ps.numel(); + PD_CHECK(bs == p_num, "PD_CHECK returns ", false, ", expected bs == p_num."); + int vocab_size = shape[1]; + auto topp_ids = paddle::full({bs, 1}, 1, paddle::DataType::INT64, x.place()); + auto topp_probs = paddle::full({bs, 1}, 1, x.dtype(), x.place()); + auto top_ps_tmp = top_ps.copy_to(top_ps.place(), false); + auto inds_input = paddle::full({bs, vocab_size}, 1, paddle::DataType::INT64, x.place()); + auto sorted_out = paddle::full({bs, vocab_size}, 1, x.dtype(), x.place()); + auto sorted_id = paddle::full({bs, vocab_size}, 1, paddle::DataType::INT64, x.place()); + + + int BlockSize = GetBlockSize(vocab_size); + switch (BlockSize) { + FIXED_BLOCK_DIM(FillIndex<<>>(inds_input.data(), bs, vocab_size)); + default: + PD_THROW("the input data shape has error in the FillIndex kernel."); + } + + + static int count = 0; + static curandState_t* dev_curand_states; + if (count == 0) { +#if CUDA_VERSION >= 11020 + cudaMallocAsync(&dev_curand_states, bs * sizeof(curandState_t), cu_stream); +#else + cudaMalloc(&dev_curand_states, bs * sizeof(curandState_t)); +#endif + } + uint64_t seed = 0; + if (random_seed == -1) { + srand((unsigned int)(time(NULL))); + seed = rand(); + } else { + seed = random_seed; + } + setup_kernel<<<1, 256, 0, cu_stream>>>(dev_curand_states, seed, bs); + + auto count_iter = paddle::empty({bs + 1}, paddle::DataType::INT32, x.place()); + auto count_iter_begin = paddle::empty({bs}, paddle::DataType::INT32, x.place()); + SetCountIter<<<1, 256, 0, cu_stream>>>(count_iter.data(), bs + 1); + + constexpr int TopKMaxLength = 2; + constexpr int TopPBeamTopK = 5; + switch (BlockSize) { + FIXED_BLOCK_DIM( + KeMatrixTopPBeamTopK<<>>( + reinterpret_cast(const_cast(x.data())), + reinterpret_cast(const_cast(top_ps_tmp.data())), + topp_ids.data(), + reinterpret_cast(topp_probs.data()), + vocab_size, + dev_curand_states, + count_iter.data(), + count_iter_begin.data())); + default: + PD_THROW("the input data shape has error in the topp_beam_topk kernel."); + } + count++; + + size_t temp_storage_bytes = 0; + + cub::TransformInputIterator + segment_offsets_t_begin(count_iter_begin.data(), + SegmentOffsetIter(vocab_size)); + + cub::TransformInputIterator + segment_offsets_t_end(count_iter.data(), + SegmentOffsetIter(vocab_size)); + + DataType_ *x_ptr = reinterpret_cast(const_cast(x.data())); + DataType_ *sorted_out_ptr = reinterpret_cast(const_cast(sorted_out.data())); + int64_t *in_id_ptr = inds_input.data(); + int64_t *out_id_ptr = sorted_id.data(); + + cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr, + temp_storage_bytes, + x_ptr, + sorted_out_ptr, + in_id_ptr, + out_id_ptr, + vocab_size * bs, + bs, + segment_offsets_t_begin, + segment_offsets_t_end + 1, + 0, + sizeof(data_t) * 8, + cu_stream); + + temp_storage_bytes = div_up(temp_storage_bytes, 256) * 256; + int64_t temp_size = temp_storage_bytes; + auto temp_storage = paddle::empty({temp_size}, paddle::DataType::UINT8, x.place()); + + cub::DeviceSegmentedRadixSort::SortPairsDescending( + temp_storage.data(), + temp_storage_bytes, + x_ptr, + sorted_out_ptr, + in_id_ptr, + out_id_ptr, + vocab_size * bs, + bs, + segment_offsets_t_begin, + segment_offsets_t_end + 1, + 0, + sizeof(data_t) * 8, + cu_stream); + + switch (BlockSize) { + FIXED_BLOCK_DIM( + topp_sampling<<>>( + sorted_out_ptr, + out_id_ptr, + reinterpret_cast(topp_probs.data()), + topp_ids.data(), + reinterpret_cast(const_cast(top_ps_tmp.data())), + nullptr, + seed, + p_num, + vocab_size, + count_iter.data(), + count_iter_begin.data())); + default: + PD_THROW("the input data shape has error in the topp_sampling kernel."); + } + return {topp_probs, topp_ids}; +} + + +std::vector TopPSampling(const paddle::Tensor& x, const paddle::Tensor& top_ps, int random_seed) { + switch (x.type()) { + case paddle::DataType::FLOAT16: { + return top_p_sampling_kernel( + x, + top_ps, + random_seed + ); + } + case paddle::DataType::FLOAT32: { + return top_p_sampling_kernel( + x, + top_ps, + random_seed + ); + } + default: { + PD_THROW( + "NOT supported data type. " + "Only float16 and float32 are supported. "); + break; + } + } +} + +std::vector> TopPSamplingInferShape(const std::vector& x_shape, + const std::vector& top_ps_shape) { + std::vector out_probs_shape = {x_shape[0], 1}; + std::vector out_ids_shape = {x_shape[0], 1}; + return {out_probs_shape, out_ids_shape}; +} + +std::vector TopPSamplingInferDtype(const paddle::DataType& x_dtype, + const paddle::DataType& top_ps_dtype) { + return {x_dtype, paddle::DataType::INT64}; +} + +PD_BUILD_OP(top_p_sampling) + .Inputs({"x", "top_ps"}) + .Outputs({"topp_probs", "topp_ids"}) + .Attrs({"random_seed: int"}) + .SetKernelFn(PD_KERNEL(TopPSampling)) + .SetInferShapeFn(PD_INFER_SHAPE(TopPSamplingInferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(TopPSamplingInferDtype)); \ No newline at end of file diff --git a/csrc/generation/transpose_removing_padding.cu b/csrc/generation/transpose_removing_padding.cu new file mode 100644 index 0000000000000000000000000000000000000000..5b6b16a7faa2dbe5885a2d0fc8d0514a17dfc80e --- /dev/null +++ b/csrc/generation/transpose_removing_padding.cu @@ -0,0 +1,177 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "helper.h" + +template +__global__ void TransposeRemovingPadding(const T* input_data, + const int* seq_lens, + T* output_data, + const int batch_size, + const int num_head, + const int max_len_this_time, + const int seq_len, + const int head_dim, + const int token_num, + const int elem_cnt, + const int* padding_offset) { + // transpose and remove padding + // [batch_size, num_head, max_len_this_time, head_dim] -> [token_num, num_head, + // head_dim] + int64_t idx = blockDim.x * blockIdx.x + threadIdx.x; + const int dim_embed = num_head * head_dim; + using LoadT = AlignedVector; + LoadT src_vec; + + for (int32_t linear_index = idx * VecSize, + step = gridDim.x * blockDim.x * VecSize; + linear_index < elem_cnt; + linear_index += step) { + const int token_idx = linear_index / dim_embed; + const int ori_token_idx = + token_idx + (padding_offset == nullptr ? 0 : padding_offset[token_idx]); + const int ori_batch_id = ori_token_idx / seq_len; + if (seq_lens && seq_lens[ori_batch_id] == 0) continue; + const int ori_seq_id = ori_token_idx % seq_len; + const int ori_head_id = (linear_index % dim_embed) / head_dim; + const int ori_head_lane = (linear_index % dim_embed) % head_dim; + const int ori_idx = ori_batch_id * num_head * max_len_this_time * head_dim + + ori_head_id * max_len_this_time * head_dim + + ori_seq_id * head_dim + ori_head_lane; + Load(&input_data[ori_idx], &src_vec); + Store(src_vec, &output_data[linear_index]); + } +} + +template +void InvokeTransposeRemovePadding(const T* input_data, + const int* seq_lens, + T* output_data, + const int batch_size, + const int num_head, + const int max_len_this_time, + const int seq_len, + const int head_dim, + const int token_num, + const int* padding_offset, + cudaStream_t cu_stream) { + // [batch_size, num_head, max_len_this_time, head_dim] -> [token_num, num_head, + // head_dim] + constexpr int VEC_16B = 16; + const int elem_cnt = token_num * num_head * head_dim; + constexpr int PackSize = VEC_16B / sizeof(T); + const int32_t pack_num = elem_cnt / PackSize; + const int32_t block_size = 128; + int32_t grid_size = (pack_num + block_size - 1) / block_size; + TransposeRemovingPadding + <<>>(input_data, + seq_lens, + output_data, + batch_size, + num_head, + max_len_this_time, + seq_len, + head_dim, + token_num, + elem_cnt, + padding_offset); +} + +template +std::vector apply_transpose_remove_padding(const paddle::Tensor& input, + const paddle::Tensor& seq_lens, + const paddle::Tensor& padding_offset) { + typedef PDTraits traits_; + typedef typename traits_::DataType DataType_; + typedef typename traits_::data_t data_t; + + auto cu_stream = input.stream(); + std::vector input_shape = input.shape(); + const int bsz = input_shape[0]; + const int num_head = input_shape[1]; + const int seq_len = input_shape[2]; + const int dim_head = input_shape[3]; + const int token_num = padding_offset.shape()[0]; + + auto out = paddle::full({token_num, num_head * dim_head}, 0, input.dtype(), input.place()); + InvokeTransposeRemovePadding( + reinterpret_cast(const_cast(input.data())), + seq_lens.data(), + reinterpret_cast(out.data()), + bsz, + num_head, + seq_len, + seq_len, + dim_head, + token_num, + padding_offset.data(), + cu_stream + ); + return {out}; +} + +std::vector ApplyTransposeRemovingPadding(const paddle::Tensor& input, + const paddle::Tensor& seq_lens, + const paddle::Tensor& padding_offset) { + switch (input.type()) { + case paddle::DataType::BFLOAT16: { + return apply_transpose_remove_padding( + input, + seq_lens, + padding_offset + ); + } + case paddle::DataType::FLOAT16: { + return apply_transpose_remove_padding( + input, + seq_lens, + padding_offset + ); + } + case paddle::DataType::FLOAT32: { + return apply_transpose_remove_padding( + input, + seq_lens, + padding_offset + ); + } + default: { + PD_THROW( + "NOT supported data type. " + "Only float16, bfloat16 and float32 are supported. "); + break; + } + } +} + +std::vector> ApplyTransposeRemovingPaddingInferShape( + const std::vector& input_shape, + const std::vector& seq_lens_shape, + const std::vector& padding_offset_shape) { + return {{padding_offset_shape[0], input_shape[1] * input_shape[3]}}; +} + +std::vector ApplyTransposeRemovingPaddingInferDtype( + const paddle::DataType& input_dtype, + const paddle::DataType& seq_lens_dtype, + const paddle::DataType& padding_offset_dtype) { + return {input_dtype}; +} + +PD_BUILD_OP(transpose_remove_padding) + .Inputs({"input", "seq_lens", "padding_offset"}) + .Outputs({"fmha_out"}) + .SetKernelFn(PD_KERNEL(ApplyTransposeRemovingPadding)) + .SetInferShapeFn(PD_INFER_SHAPE(ApplyTransposeRemovingPaddingInferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(ApplyTransposeRemovingPaddingInferDtype)); \ No newline at end of file diff --git a/csrc/generation/write_cache_kv.cu b/csrc/generation/write_cache_kv.cu new file mode 100644 index 0000000000000000000000000000000000000000..62ebf854b0e058cc445775f2f98f7fcf776d2a17 --- /dev/null +++ b/csrc/generation/write_cache_kv.cu @@ -0,0 +1,185 @@ +// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "helper.h" + + +template +inline __device__ __host__ T div_up(T m, T n) { + return (m + n - 1) / n; +} + +template +__global__ void write_cache_k_kernel(T *cache_k, + const T *k, + const int *seq_lens, + const int num_head, + const int dim_head, + const int seq_len, + const int max_seq_len) { + const int bi = blockIdx.y; + const int len = seq_lens ? seq_lens[bi] : seq_len; + if (len == 0) { + return; + } + + const int hi = blockIdx.z; + constexpr int X_ELEMS = VEC_16B / sizeof(T); + + // [bsz, num_head, seq_len, dim_head/x, x] + auto k_src = reinterpret_cast( + k + bi * num_head * seq_len * dim_head + hi * seq_len * dim_head); + // [bsz, num_head, dim_head/x, max_seq_len, x] + auto k_dst = reinterpret_cast( + cache_k + bi * num_head * max_seq_len * dim_head + + hi * max_seq_len * dim_head); + + const int out_idx = blockIdx.x * blockDim.x + threadIdx.x; + // vec size + int dim_head_div_x = dim_head / X_ELEMS; + + // FIXME(wangxi): num_head is not need? + // if (out_idx >= num_head * dim_head_div_x * max_seq_len) return; + if (out_idx >= dim_head_div_x * max_seq_len) return; + + int idx = out_idx; + const int k_seq_len_id = idx % max_seq_len; + // idx = (idx - k_seq_len_id) / max_seq_len; + idx = idx / max_seq_len; + const int k_vec_id = idx % dim_head_div_x; + + if (k_seq_len_id < len) { + k_dst[out_idx] = k_src[k_seq_len_id * dim_head_div_x + k_vec_id]; + } +} + +template +__global__ void write_cache_v_kernel(T *cache_v, + const T *v, + const int *seq_lens, + const int num_head, + const int dim_head, + const int seq_len, + const int max_seq_len) { + const int bi = blockIdx.y; + const int len = seq_lens ? seq_lens[bi] : seq_len; + if (len == 0) { + return; + } + + const int hi = blockIdx.z; + + // [bsz, num_head, seq_len, dim_head/x, x] + auto v_src = reinterpret_cast( + v + bi * num_head * seq_len * dim_head + hi * seq_len * dim_head); + // [bsz, num_head, max_seq_len, dim_head/x, x] + auto v_dst = reinterpret_cast( + cache_v + bi * num_head * max_seq_len * dim_head + + hi * max_seq_len * dim_head); + + const int idx = blockIdx.x * blockDim.x + threadIdx.x; + constexpr int X_ELEMS = VEC_16B / sizeof(T); + const int dim_head_div_x = dim_head / X_ELEMS; + + if (idx >= dim_head_div_x * len) return; + + v_dst[idx] = v_src[idx]; +} + +template +void LaunchWriteCacheKV(const paddle::Tensor& input_k, + const paddle::Tensor& input_v, + const paddle::Tensor& cache_kv, + const paddle::Tensor& sequence_lengths) { + typedef PDTraits traits_; + typedef typename traits_::DataType DataType_; + typedef typename traits_::data_t data_t; + + const int64_t bsz = input_k.shape()[0]; + const int64_t seq_len = input_k.shape()[2]; + const int64_t cache_bsz = cache_kv.shape()[1]; + const int64_t num_head = cache_kv.shape()[2]; + const int64_t dim_head = cache_kv.shape()[4]; + // printf("bsz: %d, cache_bsz: %d, num_head: %d, seq_len: %d, dim_head: %d.\n", bsz, cache_bsz, num_head, seq_len, dim_head); + + auto cache_kv_out = paddle::full({1}, -1, cache_kv.dtype(), cache_kv.place()); + + const DataType_ *k_ptr = reinterpret_cast(input_k.data()); + const DataType_ *v_ptr = reinterpret_cast(input_v.data()); + + // [2, bsz, num_head, max_seq_len, head_dim] + int max_seq_len = cache_kv.shape()[3]; + DataType_ *cache_kv_data = reinterpret_cast(const_cast(cache_kv.data())); + + int64_t cache_k_size = cache_bsz * num_head * max_seq_len * dim_head; + + DataType_ *cache_k_ptr = cache_kv_data; + DataType_ *cache_v_ptr = cache_kv_data + cache_k_size; + + constexpr int block_sz = 128; + constexpr int x = VEC_16B / sizeof(DataType_); + + assert(dim_head % x == 0); + // PD_CHECK((dim_head % x) == 0, "PD_CHECK returns ", false, ", dim_head must be divisible by vec_size."); + + int max_size = max_seq_len * dim_head / x; + int size = seq_len * dim_head / x; + dim3 grid(div_up(max_size, block_sz), bsz, num_head); + dim3 grid_v(div_up(size, block_sz), bsz, num_head); + + // transpose [bsz, num_head, seq_len, dim_head/x, x]-> + // [bsz, num_head, dim_head/x, max_seq_len, x] + write_cache_k_kernel<<>>( + cache_k_ptr, k_ptr, sequence_lengths.data(), num_head, dim_head, seq_len, max_seq_len); + + // copy [bsz, num_head, seq_len, dim_head/x, x]-> + // [bsz, num_head, max_seq_len, dim_head/x, x] + write_cache_v_kernel<<>>( + cache_v_ptr, v_ptr, sequence_lengths.data(), num_head, dim_head, seq_len, max_seq_len); +} + +void WriteCacheKV(const paddle::Tensor& input_k, + const paddle::Tensor& input_v, + const paddle::Tensor& cache_kv, + const paddle::Tensor& sequence_lengths_shape) { + switch (cache_kv.type()) { + case paddle::DataType::BFLOAT16: { + return LaunchWriteCacheKV( + input_k, input_v, cache_kv, sequence_lengths_shape + ); + } + case paddle::DataType::FLOAT16: { + return LaunchWriteCacheKV( + input_k, input_v, cache_kv, sequence_lengths_shape + ); + } + case paddle::DataType::FLOAT32: { + return LaunchWriteCacheKV( + input_k, input_v, cache_kv, sequence_lengths_shape + ); + } + default: { + PD_THROW( + "NOT supported data type. " + "Only bfloat16, float16 and float32 are supported. "); + break; + } + } +} + +PD_BUILD_OP(write_cache_kv) + .Inputs({"input_k", "input_v", "cache_kv", "sequence_lengths"}) + .Outputs({"cache_kv_out"}) + .SetInplaceMap({{"cache_kv", "cache_kv_out"}}) + .SetKernelFn(PD_KERNEL(WriteCacheKV)); \ No newline at end of file diff --git a/csrc/requirements.txt b/csrc/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..0bf0625387b4da469c3e1866ea4849ca6d1183ec --- /dev/null +++ b/csrc/requirements.txt @@ -0,0 +1,2 @@ +cupy-cuda116 +pybind11 \ No newline at end of file diff --git a/csrc/setup_cuda.py b/csrc/setup_cuda.py new file mode 100644 index 0000000000000000000000000000000000000000..cf4ef65b45d5c5b417f3b3617ca9075e54a959d0 --- /dev/null +++ b/csrc/setup_cuda.py @@ -0,0 +1,37 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddle.utils.cpp_extension import CUDAExtension, setup + +setup( + name="paddlenlp_ops", + ext_modules=CUDAExtension( + sources=[ + "./generation/save_with_output.cc", + "./generation/set_mask_value.cu", + "./generation/set_value_by_flags.cu", + "./generation/token_penalty_multi_scores.cu", + "./generation/stop_generation_multi_ends.cu", + "./generation/fused_get_rope.cu", + "./generation/get_padding_offset.cu", + "./generation/qkv_transpose_split.cu", + "./generation/rebuild_padding.cu", + "./generation/transpose_removing_padding.cu", + "./generation/write_cache_kv.cu", + "./generation/encode_rotary_qk.cu", + "./generation/top_p_sampling.cu", + "./generation/set_alibi_mask_value.cu", + ] + ), +) diff --git a/docs/FAQ.md b/docs/FAQ.md new file mode 100644 index 0000000000000000000000000000000000000000..12853cf627fdd24ef0ae49cbab32ddaa8693c666 --- /dev/null +++ b/docs/FAQ.md @@ -0,0 +1,495 @@ +## PaddleNLP常见问题汇总(持续更新) + ++ [【精选】NLP精选5问](#NLP精选) + + + [Q1.1 如何加载自己的本地数据集,以便使用PaddleNLP的功能?](#1-1) + + [Q1.2 PaddleNLP会将内置的数据集、模型下载到默认路径,如何修改路径?](#1-2) + + [Q1.3 PaddleNLP中如何保存、加载训练好的模型?](#1-3) + + [Q1.4 当训练样本较少时,有什么推荐的方法能提升模型效果吗?](#1-4) + + [Q1.5 如何提升模型的性能,提升QPS?](#1-5) + ++ [【理论篇】NLP通用问题](#NLP通用问题 ) + + + [Q2.1 数据类别分布不均衡, 有哪些应对方法?](#2-2) + + [Q2.2 如果使用预训练模型,一般需要多少条样本?](#2-3) + ++ [【实战篇】PaddleNLP实战问题](#PaddleNLP实战问题) + + [数据集和数据处理](#数据问题) + + + [Q3.1 使用自己的数据集训练预训练模型时,如何引入额外的词表?](#3-1) + + [模型训练调优](#训练调优问题) + + + [Q3.2 如何加载自己的预训练模型,进而使用PaddleNLP的功能?](#4-1) + + [Q3.3 如果训练中断,需要继续热启动训练,如何保证学习率和优化器能从中断地方继续迭代?](#4-2) + + [Q3.4 如何冻结模型梯度?](#4-3) + + [Q3.5 如何在eval阶段打印评价指标,在各epoch保存模型参数?](#4-4) + + [Q3.6 训练过程中,训练程序意外退出或Hang住,应该如何排查?](#4-5) + + + [Q3.7 在模型验证和测试过程中,如何保证每一次的结果是相同的?](#4-6) + + [Q3.8 ERNIE模型如何返回中间层的输出?](#4-7) + + [预测部署](#部署问题) + + + [Q3.9 PaddleNLP训练好的模型如何部署到服务器 ?](#5-1) + + [Q3.10 静态图模型如何转换成动态图模型?](#5-2) + ++ [特定模型和应用场景咨询](#NLP应用场景) + + [Q4.1 【词法分析】LAC模型,如何自定义标签label,并继续训练?](#6-1) + + [Q4.2 信息抽取任务中,是否推荐使用预训练模型+CRF,怎么实现呢?](#6-2) + + [Q4.3 【阅读理解】`MapDatasets`的`map()`方法中对应的`batched=True`怎么理解,在阅读理解任务中为什么必须把参数`batched`设置为`True`?](#6-3) + + [Q4.4 【语义匹配】语义索引和语义匹配有什么区别?](#6-4) + + [Q4.5 【解语】wordtag模型如何自定义添加命名实体及对应词类?](#6-5) + ++ [其他使用咨询](#使用咨询问题) + + [Q5.1 在CUDA11使用PaddlNLP报错?](#7-1) + + [Q5.2 如何设置parameter?](#7-2) + + [Q5.3 GPU版的Paddle虽然能在CPU上运行,但是必须要有GPU设备吗?](#7-3) + + [Q5.4 如何指定用CPU还是GPU训练模型?](#7-4) + + [Q5.5 动态图模型和静态图模型的预测结果一致吗?](#7-5) + + [Q5.6 如何可视化acc、loss曲线图、模型网络结构图等?](#7-6) + + + +## ⭐️【精选】NLP精选5问 + + + +##### Q1.1 如何加载自己的本地数据集,以便使用PaddleNLP的功能? + +**A:** 通过使用PaddleNLP提供的 `load_dataset`, `MapDataset` 和 `IterDataset` ,可以方便的自定义属于自己的数据集哦,也欢迎您贡献数据集到PaddleNLP repo。 + +从本地文件创建数据集时,我们 **推荐** 根据本地数据集的格式给出读取function并传入 `load_dataset()` 中创建数据集。 +以[waybill_ie](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/waybill_ie)快递单信息抽取任务中的数据为例: + +```python +from paddlenlp.datasets import load_dataset + +def read(data_path): + with open(data_path, 'r', encoding='utf-8') as f: + # 跳过列名 + next(f) + for line in f: + words, labels = line.strip('\n').split('\t') + words = words.split('\002') + labels = labels.split('\002') + yield {'tokens': words, 'labels': labels} + +# data_path为read()方法的参数 +map_ds = load_dataset(read, data_path='train.txt', lazy=False) +iter_ds = load_dataset(read, data_path='train.txt', lazy=True) +``` + +如果您习惯使用`paddle.io.Dataset/IterableDataset`来创建数据集也是支持的,您也可以从其他python对象如`List`对象创建数据集,详细内容可参照[官方文档-自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。 + + + +##### Q1.2 PaddleNLP会将内置的数据集、模型下载到默认路径,如何修改路径? + +**A:** 内置的数据集、模型默认会下载到`$HOME/.paddlenlp/`下,通过配置环境变量可下载到指定路径: + +(1)Linux下,设置 `export PPNLP_HOME="xxxx"`,注意不要设置带有中文字符的路径。 + +(2)Windows下,同样配置环境变量 PPNLP_HOME 到其他非中文字符路径,重启即可。 + + + +##### Q1.3 PaddleNLP中如何保存、加载训练好的模型? + +**A:**(1)PaddleNLP预训练模型 + +​ 保存: + +```python +model.save_pretrained("./checkpoint') +tokenizer.save_pretrained("./checkpoint') +``` + +​ 加载: + +```python +model.from_pretrained("./checkpoint') +tokenizer.from_pretrained("./checkpoint') +``` + +(2)常规模型 + 保存: + +```python +emb = paddle.nn.Embedding(10, 10) +layer_state_dict = emb.state_dict() +paddle.save(layer_state_dict, "emb.pdparams") #保存模型参数 +``` + +​ 加载: +```python +emb = paddle.nn.Embedding(10, 10) +load_layer_state_dict = paddle.load("emb.pdparams") # 读取模型参数 +emb.set_state_dict(load_layer_state_dict) # 加载模型参数 +``` + + + +##### Q1.4 当训练样本较少时,有什么推荐的方法能提升模型效果吗? + +**A:** 增加训练样本带来的效果是最直接的。此外,可以基于我们开源的[预训练模型](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers)进行热启,再用少量数据集fine-tune模型。此外,针对分类、匹配等场景,[小样本学习](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/few_shot)也能够带来不错的效果。 + + + +##### Q1.5 如何提升模型的性能,提升QPS? + +**A:** 从工程角度,对于服务器端部署可以使用[Paddle Inference](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/05_inference_deployment/inference/inference_cn.html)高性能预测引擎进行预测部署。对于Transformer类模型的GPU预测还可以使用PaddleNLP中提供的[FastGeneration](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/ops)功能来进行快速预测,其集成了[NV FasterTransformer](https://github.com/NVIDIA/FasterTransformer)并进行了功能增强。 + +从模型策略角度,可以使用一些模型小型化技术来进行模型压缩,如模型蒸馏和裁剪,通过小模型来实现加速。PaddleNLP中集成了ERNIE-Tiny这样一些通用小模型供下游任务微调使用。另外PaddleNLP提供了[模型压缩示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/model_compression),实现了DynaBERT、TinyBERT、MiniLM等方法策略,可以参考对自己的模型进行蒸馏压缩。 + + + +## ⭐️【理论篇】NLP通用问题 + + + +##### Q2.1 数据类别分布不均衡, 有哪些应对方法? + +**A:** 可以采用以下几种方法优化类别分布不均衡问题: + +(1)欠采样:对样本量较多的类别进行欠采样,去除一些样本,使得各类别数目接近。 + +(2)过采样:对样本量较少的类别进行过采样,选择样本进行复制,使得各类别数目接近。 + +(3)修改分类阈值:直接使用类别分布不均衡的数据训练分类器,会使得模型在预测时更偏向于多数类,所以不再以0.5为分类阈值,而是针对少数类在模型仅有较小把握时就将样本归为少数类。 + +(4)代价敏感学习:比如LR算法中设置class_weight参数。 + + + +##### Q2.2 如果使用预训练模型,一般需要多少条样本? + +**A:** 很难定义具体需要多少条样本,取决于具体的任务以及数据的质量。如果数据质量没问题的话,分类、文本匹配任务所需数据量级在百级别,翻译则需要百万级能够训练出一个比较鲁棒的模型。如果样本量较少,可以考虑数据增强,或小样本学习。 + + + + +## ⭐️【实战篇】PaddleNLP实战问题 + + + +### 数据集和数据处理 + + + +##### Q3.1 使用自己的数据集训练预训练模型时,如何引入额外的词表? + +**A:** 预训练模型通常会有配套的tokenzier和词典,对于大多数中文预训练模型,如ERNIE-3.0,使用的都是字粒度的输入,tokenzier会将句子转换为字粒度的形式,模型无法收到词粒度的输入。如果希望引入额外的词典,需要修改预训练模型的tokenizer和词典,可以参考这里[blog](https://kexue.fm/archives/7758/comment-page-1#Tokenizer ),另外注意embedding矩阵也要加上这些新增词的embedding表示。 + +另外还有一种方式可以使用这些字典信息,可以将数据中在词典信息中的词进行整体mask进行一个mask language model的二次预训练,这样经过二次训练的模型就包含了对额外字典的表征。可参考 [PaddleNLP 预训练数据流程](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-1.0/)。 + + +此外还有些词粒度及字词混合粒度的预训练模型,在这些词粒度的模型下引入额外的词表也会容易些,我们也将持续丰富PaddleNLP中的预训练模型。 + + + +### 模型训练调优 + + + +##### Q3.2 如何加载自己的预训练模型,进而使用PaddleNLP的功能? + +**A:** 以bert为例,如果是使用PaddleNLP训练,通过`save_pretrained()`接口保存的模型,可通过`from_pretrained()`来加载: + +```python +tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") +model = BertModel.from_pretrained("bert-base-uncased") +``` + +如果不是上述情况,可以使用如下方式加载模型,也欢迎您贡献模型到PaddleNLP repo中。 + +(1)加载`BertTokenizer`和`BertModel` + +```python +tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") +model = BertModel.from_pretrained("bert-base-uncased") +``` + +(2)调用`save_pretrained()`生成 `model_config.json`、 ``tokenizer_config.json``、`model_state.pdparams`、 `vocab.txt `文件,保存到`./checkpoint`: + +```python +tokenizer.save_pretrained("./checkpoint") +model.save_pretrained("./checkpoint") +``` + +(3)修改`model_config.json`、 `tokenizer_config.json`这两个配置文件,指定为自己的模型,之后通过`from_pretrained()`加载模型。 + +```python +tokenizer = BertTokenizer.from_pretrained("./checkpoint") +model = BertModel.from_pretrained("./checkpoint") +``` + + + +##### Q3.3 如果训练中断,需要继续热启动训练,如何保证学习率和优化器能从中断地方继续迭代? + +**A:** + + (1)完全恢复训练状态,可以先将`lr`、` optimizer`、`model`的参数保存下来: + +```python +paddle.save(lr_scheduler.state_dict(), "xxx_lr") +paddle.save(optimizer.state_dict(), "xxx_opt") +paddle.save(model.state_dict(), "xxx_para") +``` + +(2)加载`lr`、` optimizer`、`model`参数即可恢复训练: + +```python +lr_scheduler.set_state_dict(paddle.load("xxxx_lr")) +optimizer.set_state_dict(paddle.load("xxx_opt")) +model.set_state_dict(paddle.load("xxx_para")) +``` + + + +##### Q3.4 如何冻结模型梯度? + +**A:** +有多种方法可以尝试: + +(1)可以直接修改 PaddleNLP 内部代码实现,在需要冻结梯度的地方用 `paddle.no_grad()` 包裹一下 + + `paddle.no_grad()` 的使用方式,以对 `forward()` 进行冻结为例: + +``` python + # Method 1 + class Model(nn.Layer): + def __init__(self, ...): + ... + + def forward(self, ...): + with paddle.no_grad(): + ... + + + # Method 2 + class Model(nn.Layer): + def __init__(self, ...): + ... + + @paddle.no_grad() + def forward(self, ...): + ... +``` + + `paddle.no_grad()` 的使用也不局限于模型内部实现里面,也可以包裹外部的方法,比如: + +``` python + @paddle.no_grad() + def evaluation(...): + ... + + model = Model(...) + model.eval() + + ... + +``` + +(2)第二种方法:以ERNIE为例,将模型输出的 tensor 设置 `stop_gradient` 为 True。可以使用 `register_forward_post_hook` 按照如下的方式尝试: + +``` python + def forward_post_hook(layer, input, output): + output.stop_gradient=True + + self.ernie.register_forward_post_hook(forward_post_hook) +``` + +(3)第三种方法:在 `optimizer` 上进行处理,`model.parameters` 是一个 `List`,可以通过 `name` 进行相应的过滤,更新/不更新某些参数,这种方法需要对网络结构的名字有整体了解,因为网络结构的实体名字决定了参数的名字,这个使用方法有一定的门槛: + +```python + [ p for p in model.parameters() if 'linear' not in p.name] # 这里就可以过滤一下linear层,具体过滤策略可以根据需要来设定 +``` + + + +##### Q3.5 如何在eval阶段打印评价指标,在各epoch保存模型参数? + +**A:** 飞桨主框架提供了两种训练与预测的方法,一种是用 [paddle.Model()](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/Model_cn.html)对模型进行封装,通过高层API如`Model.fit()`、`Model.evaluate()`、`Model.predict()`等完成模型的训练与预测;另一种就是基于基础API常规的训练方式。 + +(1)对于第一种方法: + +- 我们可以设置 `paddle.Model.fit() ` API中的 *eval_data* 和 *eval_freq* 参数在训练过程中打印模型评价指标:*eval_data* 参数是一个可迭代的验证集数据源,*eval_freq* 参数是评估的频率;当*eval_data* 给定后,*eval_freq* 的默认值为1,即每一个epoch进行一次评估。注意:在训练前,我们需要在 `Model.prepare()` 接口传入metrics参数才能在eval时打印模型评价指标。 + +- 关于模型保存,我们可以设置 `paddle.Model.fit()` 中的 *save_freq* 参数控制模型保存的频率:*save_freq* 的默认值为1,即每一个epoch保存一次模型。 + +(2)对于第二种方法: + +- 我们在PaddleNLP的examples目录下提供了常见任务的训练与预测脚本:如[GLUE](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/benchmark/glue) 和 [SQuAD](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_reading_comprehension/SQuAD)等 + +- 开发者可以参考上述脚本进行自定义训练与预测脚本的开发。 + + + +##### Q3.6 训练过程中,训练程序意外退出或Hang住,应该如何排查? + +**A:** 一般先考虑内存、显存(使用GPU训练的话)是否不足,可将训练和评估的batch size调小一些。 + +需要注意,batch size调小时,学习率learning rate也要调小,一般可按等比例调整。 + + + +##### Q3.7 在模型验证和测试过程中,如何保证每一次的结果是相同的? + +**A:** 在验证和测试过程中常常出现的结果不一致情况一般有以下几种解决方法: + +(1)确保设置了eval模式,并保证数据相关的seed设置保证数据一致性。 + +(2)如果是下游任务模型,查看是否所有模型参数都被导入了,直接使用bert-base这种预训练模型是不包含任务相关参数的,要确认导入的是微调后的模型,否则任务相关参数会随机初始化导致出现随机性。 + +(3)部分算子使用CUDNN后端产生的不一致性可以通过环境变量的设置来避免。如果模型中使用了CNN相关算子,可以设置`FLAGS_cudnn_deterministic=True`。如果模型中使用了RNN相关算子,可以设置`CUBLAS_WORKSPACE_CONFIG=:16:8`或`CUBLAS_WORKSPACE_CONFIG=:4096:2`(CUDNN 10.2以上版本可用,参考[CUDNN 8 release note](https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_8.html))。 + + + +##### Q3.8 ERNIE模型如何返回中间层的输出? + +**A:** 目前的API设计不保留中间层输出,当然在PaddleNLP里可以很方便地修改源码。 +此外,还可以在`ErnieModel`的`__init__`函数中通过`register_forward_post_hook()`为想要保留输出的Layer注册一个`forward_post_hook`函数,在`forward_post_hook`函数中把Layer的输出保存到一个全局的`List`里面。`forward_post_hook`函数将会在`forward`函数调用之后被调用,并保存Layer输出到全局的`List`。详情参考[`register_forward_post_hook()`](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/Layer_cn.html#register_forward_post_hook)。 + + + +### 预测部署 + + + +##### Q3.9 PaddleNLP训练好的模型如何部署到服务器 ? + +**A:** 我们推荐在动态图模式下开发,静态图模式部署。 + +(1)动转静 + + 动转静,即将动态图的模型转为可用于部署的静态图模型。 + 动态图接口更加易用,python 风格的交互式编程体验,对于模型开发更为友好,而静态图相比于动态图在性能方面有更绝对的优势。因此动转静提供了这样的桥梁,同时兼顾开发成本和性能。 + 可以参考官方文档 [动态图转静态图文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/04_dygraph_to_static/index_cn.html),使用 `paddle.jit.to_static` 完成动转静。 + 另外,在 PaddleNLP 我们也提供了导出静态图模型的例子,可以参考 [waybill_ie 模型导出](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/waybill_ie/#%E6%A8%A1%E5%9E%8B%E5%AF%BC%E5%87%BA)。 + +(2)借助Paddle Inference部署 + + 动转静之后保存下来的模型可以借助Paddle Inference完成高性能推理部署。Paddle Inference内置高性能的CPU/GPU Kernel,结合细粒度OP横向纵向融合等策略,并集成 TensorRT 实现模型推理的性能提升。具体可以参考文档 [Paddle Inference 简介](https://paddleinference.paddlepaddle.org.cn/master/product_introduction/inference_intro.html)。 + 为便于初次上手的用户更易理解 NLP 模型如何使用Paddle Inference,PaddleNLP 也提供了对应的例子以供参考,可以参考 [/PaddleNLP/examples](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/) 下的deploy目录,如[基于ERNIE的命名实体识别模型部署](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/waybill_ie/deploy/python)。 + + + +##### Q3.10 静态图模型如何转换成动态图模型? + +**A:** 首先,需要将静态图参数保存成`ndarray`数据,然后将静态图参数名和对应动态图参数名对应,最后保存成动态图参数即可。详情可参考[参数转换脚本](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/ernie/static_to_dygraph_params)。 + + + +### ⭐️特定模型和应用场景咨询 + + + +##### Q4.1 【词法分析】LAC模型,如何自定义标签label,并继续训练? + +**A:** 更新label文件`tag.dict`,添加 修改下CRF的标签数即可。 + +可参考[自定义标签示例](https://github.com/PaddlePaddle/PaddleNLP/issues/662),[增量训练自定义LABLE示例](https://github.com/PaddlePaddle/PaddleNLP/issues/657)。 + + + +##### Q4.2 信息抽取任务中,是否推荐使用预训练模型+CRF,怎么实现呢? + +**A:** 预训练模型+CRF是一个通用的序列标注的方法,目前预训练模型对序列信息的表达也是非常强的,也可以尝试直接使用预训练模型对序列标注任务建模。 + + + +##### Q4.3.【阅读理解】`MapDatasets`的`map()`方法中对应的`batched=True`怎么理解,在阅读理解任务中为什么必须把参数`batched`设置为`True`? + +**A:** `batched=True`就是对整个batch(这里不一定是训练中的batch,理解为一组数据就可以)的数据进行map,即map中的trans_func接受一组数据为输入,而非逐条进行map。在阅读理解任务中,根据使用的doc_stride不同,一条样本可能被转换成多条feature,对数据逐条map是行不通的,所以需要设置`batched=True`。 + + + +##### Q4.4 【语义匹配】语义索引和语义匹配有什么区别? + +**A:** 语义索引要解决的核心问题是如何从海量 Doc 中通过 ANN 索引的方式快速、准确地找出与 query 相关的文档,语义匹配要解决的核心问题是对 query和文档更精细的语义匹配信息建模。换个角度理解, [语义索引](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/semantic_indexing)是要解决搜索、推荐场景下的召回问题,而[语义匹配](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_matching)是要解决排序问题,两者要解决的问题不同,所采用的方案也会有很大不同,但两者间存在一些共通的技术点,可以互相借鉴。 + + + +##### Q4.5 【解语】wordtag模型如何自定义添加命名实体及对应词类? + +**A:** 其主要依赖于二次构造数据来进行finetune,同时要更新termtree信息。wordtag分为两个步骤: +(1)通过BIOES体系进行分词; +(2)将分词后的信息和TermTree进行匹配。 + 因此我们需要: +(1)分词正确,这里可能依赖于wordtag的finetune数据,来让分词正确; +(2)wordtag里面也需要把分词正确后term打上相应的知识信息。wordtag自定义TermTree的方式将在后续版本提供出来。 + +可参考[issue](https://github.com/PaddlePaddle/PaddleNLP/issues/822)。 + + + +### ⭐️其他使用咨询 + + + +##### Q5.1 在CUDA11使用PaddlNLP报错? + +**A:** 在CUDA11安装,可参考[issue](https://github.com/PaddlePaddle/PaddleNLP/issues/348),其他CUDA版本安装可参考 [官方文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/conda/linux-conda.html) + + + +##### Q5.2 如何设置parameter? + +**A:** 有多种方法: +(1)可以通过`set_value()`来设置parameter,`set_value()`的参数可以是`numpy`或者`tensor`。 + +```python + layer.weight.set_value( + paddle.tensor.normal( + mean=0.0, + std=self.initializer_range + if hasattr(self, "initializer_range") else + self.ernie.config["initializer_range"], + shape=layer.weight.shape)) +``` +(2)通过`create_parameter()`设置参数。 + +``` python + class MyLayer(paddle.nn.Layer): + def __init__(self): + super(MyLayer, self).__init__() + self._linear = paddle.nn.Linear(1, 1) + w_tmp = self.create_parameter([1,1]) + self.add_parameter("w_tmp", w_tmp) + + def forward(self, input): + return self._linear(input) + + mylayer = MyLayer() + for name, param in mylayer.named_parameters(): + print(name, param) +``` + + + +##### Q5.3 GPU版的Paddle虽然能在CPU上运行,但是必须要有GPU设备吗? + +**A:** 不支持 GPU 的设备只能安装 CPU 版本的 PaddlePaddle。 GPU 版本的 PaddlePaddle 如果想只在 CPU 上运行,可以通过 `export CUDA_VISIBLE_DEVICES=-1` 来设置。 + + + +##### Q5.4 如何指定用CPU还是GPU训练模型? + +**A:** 一般我们的训练脚本提供了 `--device` 选项,用户可以通过 `--device` 选择需要使用的设备。 + +具体而言,在Python文件中,我们可以通过·paddle.device.set_device()·,设置为gpu或者cpu,可参考 [set_device文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 + + + +##### Q5.5 动态图模型和静态图模型的预测结果一致吗? + +**A:** 正常情况下,预测结果应当是一致的。如果遇到不一致的情况,可以及时反馈给 PaddleNLP 的开发人员,我们进行处理。 + + + +##### Q5.6 如何可视化acc、loss曲线图、模型网络结构图等? + +**A:** 可使用[VisualDL](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/03_VisualDL/index_cn.html)进行可视化。其中acc、loss曲线图的可视化可参考[Scalar——折线图组件](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/03_VisualDL/visualdl_usage_cn.html#scalar)使用指南,模型网络结构的可视化可参考[Graph——网络结构组件](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/03_VisualDL/visualdl_usage_cn.html#graph)使用指南。 diff --git a/docs/Makefile b/docs/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..ed88099027f775942fa65dce2314f1ae9675cb36 --- /dev/null +++ b/docs/Makefile @@ -0,0 +1,20 @@ +# Minimal makefile for Sphinx documentation +# + +# You can set these variables from the command line, and also +# from the environment for the first two. +SPHINXOPTS ?= +SPHINXBUILD ?= sphinx-build +SOURCEDIR = . +BUILDDIR = build + +# Put it first so that "make" without argument is like "make help". +help: + @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) + +.PHONY: help Makefile + +# Catch-all target: route all unknown targets to Sphinx using the new +# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). +%: Makefile + @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/docs/_static/custom.css b/docs/_static/custom.css new file mode 100644 index 0000000000000000000000000000000000000000..bbd2345b4ee40ccf407c3b4f8afb657b9b7b803c --- /dev/null +++ b/docs/_static/custom.css @@ -0,0 +1,3 @@ +.wy-nav-content { + max-width: 80%; +} diff --git a/docs/advanced_guide/distributed_training.rst b/docs/advanced_guide/distributed_training.rst new file mode 100644 index 0000000000000000000000000000000000000000..16bee02a8eb86c4ab4f3a603c5c0cafdc013b981 --- /dev/null +++ b/docs/advanced_guide/distributed_training.rst @@ -0,0 +1,5 @@ +============ +大规模分布式训练 +============ + +大规模分布式训练: diff --git a/docs/advanced_guide/fastgeneration/fastgeneration.rst b/docs/advanced_guide/fastgeneration/fastgeneration.rst new file mode 100644 index 0000000000000000000000000000000000000000..95fff8849aef78c8daa8ffdfd1225152c6dc6868 --- /dev/null +++ b/docs/advanced_guide/fastgeneration/fastgeneration.rst @@ -0,0 +1,189 @@ +======== +FastGeneration加速生成API +======== + +FastGeneration是PaddleNLP v2.2版本加入的一个高性能推理功能,可实现基于CUDA的序列解码。该功能可以用于多种生成类的预训练NLP模型,例如GPT、BART、UnifiedTransformer等,并且支持多种解码策略。因此该功能主要适用于机器翻译,文本续写,文本摘要,对话生成等任务。 + +功能底层依托于 `FasterTransformer `_ ,该库专门针对Transformer系列模型及各种解码策略进行了优化。功能顶层封装于 `model.generate` 函数。功能的开启和关闭通过传入 `use_fast` 参数进行控制(默认为开启状态)。该功能具有如下特性: + +- 全面支持生成式预训练模型。包括GPT、BART、mBART、UnifiedTransformer和UNIMO-text。 +- 支持大多数主流解码策略。包括Beam Search、Sampling、Greedy Search。以及Diverse Sibling Search、Length Penalty等子策略。 +- 解码速度快。最高可达非加速版generate函数的 **17倍**。HuggingFace generate函数的 **8倍**。**并支持FP16混合精度计算**。 详细性能试验数据请参见 `FastGeneration Performence `_ 。 +- 易用性强。功能的入口为 `model.generate` ,与非加速版生成api的使用方法相同,当满足加速条件时使用jit即时编译高性能算子并用于生成,不满足则自动切换回非加速版生成api。下图展示了FastGeneration的启动流程: + +.. image:: ../../imgs/fast_generation.png + +快速开始 +----------- + +为体现FastGeneration的易用性,我们在 `samples `_ 文件夹中内置了几个典型任务示例,下面以基于GPT模型的中文文本续写任务为例: + +.. code-block:: + + python samples/gpt_sample.py + + +如果是第一次执行,PaddleNLP会启动即时编译( `JIT Compile `_ )自动编译高性能解码算子。 + +.. code-block:: + + ... + 2021-11-17 13:42:56,771 - INFO - execute command: cd /10.2/hub/PaddleNLP/paddlenlp/ops/extenstions && /usr/local/bin/python FasterTransformer_setup.py build + INFO:utils.cpp_extension:execute command: cd /10.2/hub/PaddleNLP/paddlenlp/ops/extenstions && /usr/local/bin/python FasterTransformer_setup.py build + grep: warning: GREP_OPTIONS is deprecated; please use an alias or script + running build + running build_ext + -- The C compiler identification is GNU 8.2.0 + -- The CXX compiler identification is GNU 8.2.0 + -- The CUDA compiler identification is NVIDIA 10.2.89 + -- Check for working C compiler: /usr/bin/cc + -- Check for working C compiler: /usr/bin/cc -- works + -- Detecting C compiler ABI info + -- Detecting C compiler ABI info - done + -- Detecting C compile features + -- Detecting C compile features - done + -- Check for working CXX compiler: /usr + ... + + +编译过程通常会花费几分钟的时间但是只会进行一次,之后再次使用高性能解码不需要重新编译了。编译完成后会继续运行,可以看到生成的结果如下: + +.. code-block:: + + Model input: 花间一壶酒,独酌无相亲。举杯邀明月, + Result: 对影成三人。 + +打开示例代码 `samples/gpt_sample.py` ,我们可以看到如下代码: + +.. code-block:: + + ... + model = GPTLMHeadModel.from_pretrained(model_name) + ... + outputs, _ = model.generate( + input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search') + ... + +可以看到,FastGeneration的使用方法与 `model.generate()` 相同,只需传入输入tensor和解码相关参数即可,使用非常简便。如果要使用非加速版的 `model.generate()` 方法,只需传入 `use_fast=False` 即可,示例如下: + +.. code-block:: + + ... + outputs, _ = model.generate( + input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search', use_fast=False) + ... + +.. note:: + + 需要注意的是,如果传入 `model.generate()` 的参数不满足高性能版本的要求。程序会做出提示并自动切换为非加速版本,例如我们传入 `min_length=1` ,会得到如下提示: + + .. code-block:: + + [2021-11-17 14:21:06,132] [ WARNING] - 'min_length != 0' is not supported yet in the fast version + [2021-11-17 14:21:06,132] [ WARNING] - FastGeneration is not available, and the original version would be used instead. + + +关于该方法的更多参数可以参考API文档 `generate `_ 。 + +`samples `_ 文件夹中的其他示例的使用方法相同。 + +其他示例 +----------- + +除了以上简单示例之外,PaddleNLP的examples中所有使用了 `model.generate()` 的示例都可以通过调整到合适的参数使用高性能推理。具体如下: + +- `examples/dialogue/unified_transformer `_ +- `model_zoo/gpt/fast_gpt `_ +- `examples/text_generation/unimo-text `_ +- `examples/text_summarization/bart `_ + +根据提示修改对应参数即可使用FastGeneration加速生成。下面我们以基于 `Unified Transformer` 的任务型对话为例展示一下FastGeneration的加速效果: + +打开以上链接中Unified Transformer对应的example,找到README中对应预测的脚本。稍作修改如下: + +.. code-block:: + + export CUDA_VISIBLE_DEVICES=0 + python infer.py \ + --model_name_or_path=unified_transformer-12L-cn-luge \ + --output_path=./predict.txt \ + --logging_steps=10 \ + --seed=2021 \ + --max_seq_len=512 \ + --max_knowledge_len=256 \ + --batch_size=4 \ + --min_dec_len=1 \ + --max_dec_len=64 \ + --num_return_sequences=1 \ + --decode_strategy=sampling \ + --top_k=5 \ + --device=gpu + +由于这里只是展示性能,我们直接在 `model_name_or_path` 填入PaddleNLP预训练模型名称 `unified_transformer-12L-cn-luge` 。 + +可以看到,由于该任务为对话任务,我们为了防止模型生成过多安全回复(如:哈哈哈、不错等),保证生成结果具有更多的随机性,我们选择TopK-sampling作为解码策略,并让k=5。 + +打开 `infer.py` ,可以看到我们传入的脚本参数大多都提供给了 `model.generate()` 方法: + +.. code-block:: + + output = model.generate( + input_ids=input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask, + seq_len=seq_len, + max_length=args.max_dec_len, + min_length=args.min_dec_len, + decode_strategy=args.decode_strategy, + temperature=args.temperature, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + early_stopping=args.early_stopping, + num_return_sequences=args.num_return_sequences, + use_fp16_decoding=args.use_fp16_decoding, + use_fast=args.faster) + +运行脚本,输出结果如下: + +.. code-block:: + + step 10 - 1.695s/step + step 20 - 1.432s/step + step 30 - 1.435s/step + +可以看到,非加速版 `generate()` 方法的预测速度为每个step耗时1.5秒左右。 + +下面我们在启动脚本中传入 `--faster` 参数,这会让 `generate()` 方法传入 `use_fast=True` ,启动加速模式。同时我们需要设置 `--min_dec_len=0` ,因为FastGeneration当前还不支持该参数。新的脚本启动参数如下: + +.. code-block:: + + export CUDA_VISIBLE_DEVICES=0 + python infer.py \ + --model_name_or_path=unified_transformer-12L-cn-luge \ + --output_path=./predict.txt \ + --logging_steps=10 \ + --seed=2021 \ + --max_seq_len=512 \ + --max_knowledge_len=256 \ + --batch_size=4 \ + --min_dec_len=0 \ + --max_dec_len=64 \ + --num_return_sequences=1 \ + --decode_strategy=sampling \ + --top_k=5 \ + --device=gpu \ + --faster + +再次运行脚本,输出结果如下(由于我们已经编译过高性能算子,所以这里不会重新编译): + +.. code-block:: + + [2021-11-23 13:38:09,200] [ DEBUG] - skipping 'FastTransformer' extension (up-to-date) build + step 10 - 0.511s/step + step 20 - 0.343s/step + step 30 - 0.419s/step + +可以看到,FastGeneration的预测速度为每个step耗时0.4秒左右,提速超过三倍。如果减少 `num_return_sequences` ,可以得到更高的加速比。 diff --git a/docs/advanced_guide/fastgeneration/fasttransformer.rst b/docs/advanced_guide/fastgeneration/fasttransformer.rst new file mode 100644 index 0000000000000000000000000000000000000000..28d767560676c634ed872de32aaced47ac823fdc --- /dev/null +++ b/docs/advanced_guide/fastgeneration/fasttransformer.rst @@ -0,0 +1,241 @@ +============ +Transformer高性能加速 +============ + + +使用环境说明 +------------ + +* 本项目依赖于 PaddlePaddle 2.1.0 及以上版本或适当的 develop 版本 +* CMake >= 3.10 +* CUDA 10.1 或 10.2(需要 PaddlePaddle 框架一致) +* gcc 版本需要与编译 PaddlePaddle 版本一致,比如使用 gcc8.2 +* 推荐使用 Python3 +* `FasterTransformer `_ 使用必要的环境 +* 环境依赖 + + - attrdict + - pyyaml + + .. code-block:: + + pip install attrdict pyyaml + + +快速开始 +------------ + +我们实现了基于 FasterTransformer 的自定义 op 的接入,打造了 FastGeneration 的能力用于加速文本生成模型在 GPU 上的预测性能。接下来,我们将分别介绍基于 Python 动态图和预测库使用 FastGeneration 自定义 op 的方式,包括 op 的编译与使用。 + +Python 动态图使用自定义 op +------------ + +JIT 自动编译 +^^^^^^^^^^^^ + +目前当基于动态图使用 FastGeneration 预测加速自定义 op 时,PaddleNLP 提供了 Just In Time 的自动编译,在一些 API 上,用户无需关注编译流程,可以直接执行对应的 API,程序会自动编译需要的第三方库。 + +以 Transformer 为例,可以直接调用 `TransformerGenerator()` 这个 API,程序会自动编译。使用示例可以参考 `Transformer 预测加速使用示例-sample `_,`Transformer 预测加速使用示例-机器翻译 `_。 + +编译自定义OP +^^^^^^^^^^^^ + +除了自动编译外,如果需要自行编译,我们已经提供对应的 CMakeLists.txt,可以参考使用如下的方式完成编译。 + +PaddleNLP 准备 +"""""""""""" + +首先,如果需要从源码自行编译,可以直接使用 Python 的 package 下的 paddlenlp,或是可从 github 克隆一个 PaddleNLP,并重新编译: + +以下以从 github 上 clone 一个新版 PaddleNLP 为例: + +.. code-block:: + + git clone https://github.com/PaddlePaddle/PaddleNLP.git + +其次,配置环境变量,让我们可以使用当前 clone 的 paddlenlp,并进入到自定义 OP 的路径,准备后续的编译操作: + +.. code-block:: + + export PYTHONPATH=$PWD/PaddleNLP/:$PYTHONPATH + cd PaddleNLP/paddlenlp/ops/ + +编译 +"""""""""""" + +编译之前,请确保安装的 PaddlePaddle 的版本高于 2.1.0 或是基于最新的 develop 分支的代码编译,并且正常可用。 + +编译自定义 OP 可以参照一下步骤: + +.. code-block:: + + mkdir build + cd build/ + cmake .. -DCMAKE_BUILD_TYPE=Release -DPY_CMD=python3.x + make -j + cd ../ + +可以使用的编译选项包括: + +* `-DPY_CMD`: 指定当前装有 PaddlePaddle 版本的 python 环境,比如 `-DPY_CMD=python3.7`。若未指定 `-DPY_CMD` 将会默认使用系统命令 `python` 对应的 Python。 +* `-DSM`: 是指的所用 GPU 的 compute capability,建议不使用该选项设置,未设置时将自动检测。如要设置,需根据 [compute capability](https://developer.nvidia.com/zh-cn/cuda-gpus#compute) 进行设置,如 V100 时设置 `-DSM=70` 或 T4 时设置 `-DSM=75`。 +* `-DWITH_GPT`: 是否编译带有 GPT 相关的 lib。若使用 GPT-2 高性能推理,需要加上 `-DWITH_GPT=ON`。默认为 OFF。 +* `-DWITH_UNIFIED`: 是否编译带有 Unified Transformer 或是 UNIMOText 相关的 lib。若使用,需要加上 `-DWITH_UNIFIED=ON`。默认为 ON。 +* `-DWITH_BART`: 是否编译带有 BART 支持的相关 lib。若使用,需要加上 `-DWITH_BART=ON`。默认为 ON。 +* `-DWITH_DECODER`: 是否编译带有 decoder 优化的 lib。默认为 ON。 + +最终,编译会在 `./build/lib/` 路径下,产出 `libdecoding_op.so`,即需要的 FastGeneration decoding 执行的库。 + +使用 Transformer decoding 高性能推理 +^^^^^^^^^^^^ + +编写 python 脚本的时候,调用 `FasterTransformer API `_ 即可实现 Transformer 模型的高性能预测。 + +举例如下: + +.. code-block:: + + from paddlenlp.ops import FasterTransformer + + transformer = FasterTransformer( + src_vocab_size=args.src_vocab_size, + trg_vocab_size=args.trg_vocab_size, + max_length=args.max_length + 1, + n_layer=args.n_layer, + n_head=args.n_head, + d_model=args.d_model, + d_inner_hid=args.d_inner_hid, + dropout=args.dropout, + weight_sharing=args.weight_sharing, + bos_id=args.bos_idx, + eos_id=args.eos_idx, + decoding_strategy=args.decoding_strategy, + beam_size=args.beam_size, + topk=args.topk, + topp=args.topp, + max_out_len=args.max_out_len, + decoding_lib=args.decoding_lib, + use_fp16_decoding=args.use_fp16_decoding) + +若当前环境下没有需要的自定义 op 的动态库,将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op 所需的动态库,可以如前文所述进行编译。编译好后,使用 `FasterTransformer(decoding_lib="/path/to/lib", ...)` 可以完成导入。 + +更详细的例子可以参考 `Transformer 预测加速使用示例-sample `_,`Transformer 预测加速使用示例-机器翻译 `_,我们提供了更详细用例。 + +Transformer decoding 示例代码 +"""""""""""" + +使用 PaddlePaddle 仅执行 decoding 测试(float32): + +.. code-block:: + + export CUDA_VISIBLE_DEVICES=0 + export FLAGS_fraction_of_gpu_memory_to_use=0.1 + # 执行 decoding_gemm 目的是基于当前环境、配置,提前确定一个性能最佳的矩阵乘算法,不是必要的步骤 + ./build/third-party/build/fastertransformer/bin/decoding_gemm 32 4 8 64 30000 32 512 0 + python ./fast_transformer/sample/decoding_sample.py --config ./fast_transformer/sample/config/decoding.sample.yaml --decoding_lib ./build/lib/libdecoding_op.so + +使用 PaddlePaddle 仅执行 decoding 测试(float16): +执行 float16 的 decoding,需要在执行的时候,加上 `--use_fp16_decoding` 选项。 + +.. code-block:: + + export CUDA_VISIBLE_DEVICES=0 + export FLAGS_fraction_of_gpu_memory_to_use=0.1 + # 执行 decoding_gemm 目的是基于当前环境、配置,提前确定一个性能最佳的矩阵乘算法,不是必要的步骤 + ./build/third-party/build/fastertransformer/bin/decoding_gemm 32 4 8 64 30000 32 512 1 + python ./fast_transformer/sample/decoding_sample.py --config ./fast_transformer/sample/config/decoding.sample.yaml --decoding_lib ./build/lib/libdecoding_op.so --use_fp16_decoding + +其中,`decoding_gemm` 不同参数的意义可以参考 `FasterTransformer 文档 `_。这里提前执行 `decoding_gemm`,可以在当前路径下生成一个 config 文件,里面会包含针对当前 decoding 部分提供的配置下,性能最佳的矩阵乘的算法,并在执行的时候读入这个数据。 + +C++ 预测库使用自定义 op +------------ + +编译自定义OP +^^^^^^^^^^^^ + +在 C++ 预测库使用自定义 OP 需要将实现的 C++、CUDA 代码**以及 C++ 预测的 demo**编译成一个可执行文件。因预测库支持方式与 Python 不同,这个过程将不会产生自定义 op 的动态库,将直接得到可执行文件。我们已经提供对应的 CMakeLists.txt ,可以参考使用如下的方式完成编译。并获取执行 demo。 + +PaddleNLP 准备 +"""""""""""" + +首先,因为需要基于当前环境重新编译,当前的 paddlenlp 的 python 包里面并不包含 FastGeneration 相关 lib,需要从源码自行编译,可以直接使用 Python 的 package 下的 paddlenlp,或是可从 github 克隆一个 PaddleNLP,并重新编译: + +以下以从 github 上 clone 一个新版 PaddleNLP 为例: + +.. code-block:: + + git clone https://github.com/PaddlePaddle/PaddleNLP.git + +其次,让我们可以使用当前 clone 的 paddlenlp,并进入到自定义 OP 的路径,准备后续的编译操作: + +.. code-block:: + + cd PaddleNLP/paddlenlp/ops/ + +编译 +"""""""""""" + +编译之前,请确保安装的 PaddlePaddle 的版本高于 2.1.0 或是基于最新的 develop 分支的代码编译,并且正常可用。 + +编译自定义 OP 可以参照一下步骤: + +.. code-block:: + + mkdir build + cd build/ + cmake .. -DCMAKE_BUILD_TYPE=Release -DPADDLE_LIB=/path/to/paddle_inference_lib/ -DDEMO=./demo/transformer_e2e.cc -DON_INFER=ON -DWITH_MKL=ON + make -j + cd ../ + +可以使用的编译选项包括: + +* `-DPADDLE_LIB`: 需要指明使用的 PaddlePaddle 预测库的路径 `/path/to/paddle_inference_install_dir/`,需要使用的 PaddlePaddle 的 lib 可以选择自行编译或者直接从官网下载 `paddle_inference_linux_lib `_。需要注意的是,在该路径下,预测库的组织结构满足: + .. code-block:: + + . + ├── CMakeCache.txt + ├── paddle/ + ├── include/ + └── lib/ + ├── third_party/ + ├── cudaerror/ + ├── install/ + └── threadpool/ + └── version.txt + +* `-DDEMO`: 说明预测库使用 demo 的位置。比如指定 -DDEMO=./demo/transformer_e2e.cc 或是 -DDEMO=./demo/gpt.cc。最好使用绝对路径,若使用相对路径,需要是相对于 `PaddleNLP/paddlenlp/ops/fast_transformer/src/` 的相对路径。 +* `-DSM`: 是指的所用 GPU 的 compute capability,建议不使用该选项设置,未设置时将自动检测。如要设置,需根据 [compute capability](https://developer.nvidia.com/zh-cn/cuda-gpus#compute) 进行设置,如 V100 时设置 `-DSM=70` 或 T4 时设置 `-DSM=75`。 +* `-DWITH_GPT`: 是否编译带有 GPT 相关的 lib。若使用 GPT-2 高性能推理,需要加上 `-DWITH_GPT=ON`。默认为 OFF。 +* `-DWITH_UNIFIED`: 是否编译带有 Unified Transformer 或是 UNIMOText 相关的 lib。若使用,需要加上 `-DWITH_UNIFIED=ON`。默认为 ON。 +* `-DWITH_BART`: 是否编译带有 BART 支持的相关 lib。若使用,需要加上 `-DWITH_BART=ON`。默认为 ON。 +* `-DWITH_DECODER`: 是否编译带有 decoder 优化的 lib。默认为 ON。 +* `-DWITH_MKL`: 若当前是使用的 mkl 的 Paddle lib,那么需要打开 MKL 以引入 MKL 相关的依赖。 +* `-DON_INFER`: 是否编译 paddle inference 预测库。 +* **当使用预测库的自定义 op 的时候,请务必开启 `-DON_INFER=ON` 选项,否则,不会得到预测库的可执行文件。** + +执行 Transformer decoding on PaddlePaddle +"""""""""""" + +编译完成后,在 `build/bin/` 路径下将会看到 `transformer_e2e` 的一个可执行文件。通过设置对应的设置参数完成执行的过程。 + +.. code-block:: + + cd bin/ + ./transformer_e2e -batch_size -gpu_id -model_dir -vocab_file -data_file + +举例说明: + +.. code-block:: + + cd bin/ + # 执行 decoding_gemm 目的是基于当前环境、配置,提前确定一个性能最佳的矩阵乘算法,不是必要的步骤 + ../third-party/build/fastertransformer/bin/decoding_gemm 8 5 8 64 38512 256 512 0 + ./transformer_e2e -batch_size 8 -gpu_id 0 -model_dir ./infer_model/ -vocab_file DATA_HOME/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 -data_file DATA_HOME/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en + +其中: + +* `decoding_gemm` 不同参数的意义可以参考 `FasterTransformer 文档 `_。这里提前执行 `decoding_gemm`,可以在当前路径下生成一个 config 文件,里面会包含针对当前 decoding 部分提供的配置下,性能最佳的矩阵乘的算法,并在执行的时候读入这个数据。 +* `DATA_HOME` 则是 `paddlenlp.utils.env.DATA_HOME` 返回的路径。 + +预测所需要的模型文件,可以通过 `fast_transformer/README.md `_ 文档中所记述的方式导出。 + diff --git a/docs/advanced_guide/fastgeneration/index.rst b/docs/advanced_guide/fastgeneration/index.rst new file mode 100644 index 0000000000000000000000000000000000000000..f99b8666359d59d5fff6303748a343163f057ed7 --- /dev/null +++ b/docs/advanced_guide/fastgeneration/index.rst @@ -0,0 +1,8 @@ +============ +文本生成高性能加速 +============ + +.. toctree:: + :maxdepth: 1 + + fasttransformer.rst diff --git a/docs/advanced_guide/model_compression/distill_lstm.rst b/docs/advanced_guide/model_compression/distill_lstm.rst new file mode 100644 index 0000000000000000000000000000000000000000..bb50237bdcc21b46cd24e0e380c24ec8bcfc9756 --- /dev/null +++ b/docs/advanced_guide/model_compression/distill_lstm.rst @@ -0,0 +1,134 @@ +由BERT到Bi-LSTM的知识蒸馏 +============ + + +整体原理介绍 +------------ + +本例是将特定任务下BERT模型的知识蒸馏到基于Bi-LSTM的小模型中,主要参考论文 `Distilling Task-Specific Knowledge from BERT into Simple Neural Networks `_ \ +实现。整体原理如下: + +1. 在本例中,较大的模型是BERT被称为教师模型,Bi-LSTM被称为学生模型。 + +2. 小模型学习大模型的知识,需要小模型学习蒸馏相关的损失函数。在本实验中,损失函数是均方误差损失函数,传入函数的两个参数分别是学生模型的输出和教师模型的输出。 + +3. 在论文的模型蒸馏阶段,作者为了能让教师模型表达出更多的“暗知识”(dark knowledge,通常指分类任务中低概率类别与高概率类别的关系)供学生模型学习,对训练数据进行了数据增强。通过数据增强,可以产生更多无标签的训练数据,在训练过程中,学生模型可借助教师模型的“暗知识”,在更大的数据集上进行训练,产生更好的蒸馏效果。本文的作者使用了三种数据增强方式,分别是: + + A. Masking,即以一定的概率将原数据中的word token替换成 ``[MASK]`` ; + + B. POS—guided word replacement,即以一定的概率将原数据中的词用与其有相同POS tag的词替换; + + C. n-gram sampling,即以一定的概率,从每条数据中采样n-gram,其中n的范围可通过人工设置。 + + + +模型蒸馏步骤介绍 +------------ + +本实验分为三个训练过程:在特定任务上对BERT进行微调、在特定任务上对基于Bi-LSTM的小模型进行训练(用于评价蒸馏效果)、将BERT模型的知识蒸馏到基于Bi-LSTM的小模型上。 + +1. 基于bert-base-uncased预训练模型在特定任务上进行微调 +^^^^^^^^^^^^ + +训练BERT的fine-tuning模型,可以去 `PaddleNLP `_ 中\ +的 `glue `_ 目录下对bert-base-uncased做微调。 + +以GLUE的SST-2任务为例,用bert-base-uncased做微调之后,可以得到一个在SST-2任务上的教师模型,可以把在dev上取得最好Accuracy的模型保存下来,用于第三步的蒸馏。 + + +2. 训练基于Bi-LSTM的小模型 +^^^^^^^^^^^^ + +在本示例中,小模型采取的是基于双向LSTM的分类模型,网络层分别是 ``Embedding`` 、``LSTM`` 、 带有 ``tanh`` 激活函数的 ``Linear`` 层,最后经过\ +一个全连接的输出层得到logits。``LSTM`` 网络层定义如下: + +.. code-block:: + + self.lstm = nn.LSTM(embed_dim, hidden_size, num_layers, + 'bidirectional', dropout=dropout_prob) + +基于Bi-LSTM的小模型的 ``forward`` 函数定义如下: + +.. code-block:: + + def forward(self, x, seq_len): + x_embed = self.embedder(x) + lstm_out, (hidden, _) = self.lstm( + x_embed, sequence_length=seq_len) # 双向LSTM + out = paddle.concat((hidden[-2, :, :], hidden[-1, :, :]), axis=1) + out = paddle.tanh(self.fc(out)) + logits = self.output_layer(out) + + return logits + + +3.数据增强介绍 +^^^^^^^^^^^^ + +接下来的蒸馏过程,蒸馏时使用的训练数据集并不只包含数据集中原有的数据,而是按照上文原理介绍中的A、C两种方法进行数据增强后的总数据。 +在多数情况下,``alpha`` 会被设置为0,表示无视硬标签,学生模型只利用数据增强后的无标签数据进行训练。根据教师模型提供的软标签 ``teacher_logits`` \ +,对比学生模型的 ``logits`` ,计算均方误差损失。由于数据增强过程产生了更多的数据,学生模型可以从教师模型中学到更多的暗知识。 + +数据增强的核心代码如下: + +.. code-block:: + + def ngram_sampling(words, words_2=None, p_ng=0.25, ngram_range=(2, 6)): + if np.random.rand() < p_ng: + ngram_len = np.random.randint(ngram_range[0], ngram_range[1] + 1) + ngram_len = min(ngram_len, len(words)) + start = np.random.randint(0, len(words) - ngram_len + 1) + words = words[start:start + ngram_len] + if words_2: + words_2 = words_2[start:start + ngram_len] + return words if not words_2 else (words, words_2) + + def data_augmentation(data, whole_word_mask=whole_word_mask): + # 1. Masking + words = [] + if not whole_word_mask: + tokenized_list = tokenizer.tokenize(data) + words = [ + tokenizer.mask_token if np.random.rand() < p_mask else word + for word in tokenized_list + ] + else: + for word in data.split(): + words += [[tokenizer.mask_token]] if np.random.rand( + ) < p_mask else [tokenizer.tokenize(word)] + # 2. N-gram sampling + words = ngram_sampling(words, p_ng=p_ng, ngram_range=ngram_range) + words = flatten(words) if isinstance(words[0], list) else words + new_text = " ".join(words) + return words, new_text + + +4.蒸馏模型 +^^^^^^^^^^^^ + +这一步是将教师模型BERT的知识蒸馏到基于Bi-LSTM的学生模型中,在本例中,主要是让学生模型(Bi-LSTM)去学习教师模型的输出logits。\ +蒸馏时使用的训练数据集是由上一步数据增强后的数据,核心代码如下: + +.. code-block:: + + ce_loss = nn.CrossEntropyLoss() # 交叉熵损失函数 + mse_loss = nn.MSELoss() # 均方误差损失函数 + + for epoch in range(args.max_epoch): + for i, batch in enumerate(train_data_loader): + bert_input_ids, bert_segment_ids, student_input_ids, seq_len, labels = batch + + # Calculate teacher model's forward. + with paddle.no_grad(): + teacher_logits = teacher.model(bert_input_ids, bert_segment_ids) + + # Calculate student model's forward. + logits = model(student_input_ids, seq_len) + + # Calculate the loss, usually args.alpha equals to 0. + loss = args.alpha * ce_loss(logits, labels) + ( + 1 - args.alpha) * mse_loss(logits, teacher_logits) + + loss.backward() + optimizer.step() + diff --git a/docs/advanced_guide/model_compression/index.rst b/docs/advanced_guide/model_compression/index.rst new file mode 100644 index 0000000000000000000000000000000000000000..74f38cf8b5d2efff22b9cb7bd5fdeebf981d35f7 --- /dev/null +++ b/docs/advanced_guide/model_compression/index.rst @@ -0,0 +1,10 @@ +============ +模型压缩 +============ + +.. toctree:: + :maxdepth: 1 + + introduction.rst + distill_lstm.rst + ofa_bert.rst diff --git a/docs/advanced_guide/model_compression/introduction.rst b/docs/advanced_guide/model_compression/introduction.rst new file mode 100644 index 0000000000000000000000000000000000000000..0fb932abc1d9b4d279fc10f63292d6d3ecf5fbba --- /dev/null +++ b/docs/advanced_guide/model_compression/introduction.rst @@ -0,0 +1,46 @@ +============ +模型压缩简介 +============ + + +近些年,基于Transformer的语言模型在机器翻译、阅读理解、文本匹配、自然语言推理等自然语言处理任务上取得了实质\ +进展。然而,海量的参数和计算资源的大量耗费,使BERT及其变体在部署中困难重重。模型压缩的发展,使得这些问题得到\ +了缓解。 + +模型压缩简介 +------------ + +模型压缩在保证一定精度的情况下,能够降低模型的存储,加速模型的推理时间。常见的模型压缩方法主要包括模型裁剪、量化和蒸馏。\ +下面分别对这几种方法进行简要的介绍。 + +模型裁剪 +^^^^^^^^^^^^ +模型裁剪是通过对已经训练好的模型中不重要的网络连接进行裁剪,减少模型的冗余和计算量,从而减少网络存储、大幅度进行加速的模型压缩方法。 + +量化 +^^^^^^^^^^^^ +一般而言,神经网络模型的参数都是用的32bit长度的浮点型数表示。实际上,有时不需要保留那么高的精度,可以通过量化的方法减少\ +模型的存储空间,通常用INT8代替Float32存储。比如,SGD(Stochastic Gradient Descent)所需要的精度仅为6~8bit,\ +因此合理的量化网络也可保证精度的情况下减小模型的存储体积,并且能够大幅度加速,使得神经网络在CPU上的运行成为可能。\ +通常,量化包含多种方法,例如:二值神经网络、三元权重网络以及XNOR网络。 + + +蒸馏 +^^^^^^^^^^^^ +蒸馏本质是student模型(参数量较少的模型)对teacher模型(参数量较多)的拟合,student模型从teacher中学到知识,比自己单独学习效果更好,。比较常见的方法通常是由Bert base蒸馏到\ +Bi-LSTM或者是Transformer层数更少的BERT小模型。例如DistilBERT,它保留了BERT-base 97%的精度,\ +减少了40%的参数,推理速度快了60%。 + + +模型压缩示例 +------------ + +下面将会对基于飞桨实现的常见的模型压缩示例进行介绍,其中《由BERT到Bi-LSTM的知识蒸馏》可以作为蒸馏实验的"Hello World"示例。\ +而《使用DynaBERT中的策略对BERT进行压缩》中使用的DynaBERT则是同时对不同尺寸的子网络进行训练,通过该方法训练后可以在推理阶段直接对模型裁剪。 + + +.. toctree:: + :maxdepth: 1 + + distill_lstm.rst + ofa_bert.rst diff --git a/docs/advanced_guide/model_compression/ofa_bert.rst b/docs/advanced_guide/model_compression/ofa_bert.rst new file mode 100644 index 0000000000000000000000000000000000000000..5b0db3b18db4c0e2cf6779dc809514e0eff9fb24 --- /dev/null +++ b/docs/advanced_guide/model_compression/ofa_bert.rst @@ -0,0 +1,177 @@ +使用DynaBERT中的策略对BERT进行压缩 +============ + +本教程使用的是 `DynaBERT-Dynamic BERT with Adaptive Width and Depth `_ 中的训练策略。\ +把原始模型作为超网络中最大的子模型,这里超网络指的是包含所有搜索空间在内的一个网络。\ +原始模型包括多个相同大小的Transformer Block。在每次训练前会选择当前轮次要训练的子模型,\ +每个子模型包含多个相同大小的Sub Transformer Block,每个Sub Transformer Block是选择不同宽度的Transformer Block得到的,\ +一个Transformer Block包含一个Multi-Head Attention和一个Feed-Forward Network,Sub Transformer Block获得方式为: + +1. 一个 ``Multi-Head Attention`` 层中有多个Head,每次选择不同宽度的子模型时,会同时对Head数量进行等比例减少,\ +例如:如果原始模型中有12个Head,本次训练选择的模型是宽度为原始宽度75%的子模型,则本次训练中所有Transformer Block的Head数量为9。 + +2. ``Feed-Forward Network`` 层中 ``Linear`` 的参数大小进行等比例减少,例如:如果原始模型中 ``FFN`` 层的特征维度为3072,\ +本次训练选择的模型是宽度为原始宽度75%的子模型,则本次训练中所有Transformer Block中 ``FFN`` 层的特征维度为2304。 + + +整体原理介绍 +------------ + +1. 首先对预训练模型的参数和head根据其重要性进行重排序,把重要的参数和head排在参数的前侧,保证训练过程中的参数裁剪不会裁剪掉这些重要的参数。\ +参数的重要性计算是先使用dev数据计算一遍每个参数的梯度,然后根据梯度和参数的整体大小来计算当前参数的重要性,head的重要性计算是通过传入一个\ +全1的对head的mask,并计算这个mask的梯度,根据mask的梯度来判断每个 ``Multi-Head Attention`` 层中每个Head的重要性。 + +2. 使用原本的预训练模型作为蒸馏过程中的教师网络。同时定义一个超网络,这个超网络中最大的子网络的结构和教师网络的结构相同其他小的子网络是对最大网络\ +进行不同的宽度选择来得到的,宽度选择具体指对网络中的参数进行裁剪,所有子网络在整个训练过程中都是参数共享的。 + +3. 使用重排序之后的预训练模型参数初始化超网络,并把这个超网络作为学生网络。分别为 ``Embedding`` 层,为每个transformer block层和最后的logits添加蒸馏损失。 + +4. 每个batch数据在训练前首先会选择当前要训练的子网络配置(子网络配置目前仅包括对整个模型的宽度的选择),参数更新时仅会更新当前子网络计算中用到的那部分参数。 + +5. 通过以上的方式来优化整个超网络参数,训练完成后选择满足加速要求和精度要求的子模型。 + +.. image:: ../../../examples/model_compression/ofa/imgs/ofa_bert.jpg + +.. centered:: 整体流程 + + +基于PaddleSlim进行模型压缩 +------------ + +在本例中,也需要训练基于特定任务的BERT模型,方法同上一篇教程《由BERT到Bi-LSTM的知识蒸馏》。下面重点介绍本例模型压缩的过程。 + +1. 定义初始网络 +^^^^^^^^^^^^ +定义原始BERT-base模型并定义一个字典保存原始模型参数。普通模型转换为超网络之后,由于其组网OP的改变导致原始模型加载的参数失效,所以需要定义一个字典保存原始模型的参数并用来初始化超网络。 + +.. code-block:: + + model = BertForSequenceClassification.from_pretrained('bert', num_classes=2) + origin_weights = {} + for name, param in model.named_parameters(): + origin_weights[name] = param + + +2. 构建超网络 +^^^^^^^^^^^^ +定义搜索空间,并根据搜索空间把普通网络转换为超网络。 + +.. code-block:: + + # 定义搜索空间 + sp_config = supernet(expand_ratio=[0.25, 0.5, 0.75, 1.0]) + # 转换模型为超网络 + model = Convert(sp_config).convert(model) + paddleslim.nas.ofa.utils.set_state_dict(model, origin_weights) + + +3. 定义教师网络 +^^^^^^^^^^^^ +构造教师网络。 + +.. code-block:: + + teacher_model = BertForSequenceClassification.from_pretrained('bert', num_classes=2) + + +4. 配置蒸馏相关参数 +^^^^^^^^^^^^ +需要配置的参数包括教师模型实例;需要添加蒸馏的层,在教师网络和学生网络的 ``Embedding`` 层和每一个 ``Tranformer Block`` 层\ +之间添加蒸馏损失,中间层的蒸馏损失使用默认的MSE损失函数;配置 ``lambda_distill`` 参数表示整体蒸馏损失的缩放比例。 + +.. code-block:: + + mapping_layers = ['bert.embeddings'] + for idx in range(model.bert.config['num_hidden_layers']): + mapping_layers.append('bert.encoder.layers.{}'.format(idx)) + + default_distill_config = { + 'lambda_distill': 0.1, + 'teacher_model': teacher_model, + 'mapping_layers': mapping_layers, + } + distill_config = DistillConfig(**default_distill_config) + + +5. 定义Once-For-All模型 +^^^^^^^^^^^^ +普通模型和蒸馏相关配置传给 ``OFA`` 接口,自动添加蒸馏过程并把超网络训练方式转为 ``OFA`` 训练方式。 + +.. code-block:: + + ofa_model = paddleslim.nas.ofa.OFA(model, distill_config=distill_config) + + +6. 计算神经元和head的重要性并根据其重要性重排序参数 +^^^^^^^^^^^^ + +.. code-block:: + + head_importance, neuron_importance = utils.compute_neuron_head_importance( + 'sst-2', + ofa_model.model, + dev_data_loader, + num_layers=model.bert.config['num_hidden_layers'], + num_heads=model.bert.config['num_attention_heads']) + reorder_neuron_head(ofa_model.model, head_importance, neuron_importance) + + +7. 传入当前OFA训练所处的阶段 +^^^^^^^^^^^^ + +.. code-block:: + + ofa_model.set_epoch(epoch) + ofa_model.set_task('width') + + +8. 传入网络相关配置,开始训练 +^^^^^^^^^^^^ +本示例使用DynaBERT的策略进行超网络训练。 + +.. code-block:: + + width_mult_list = [1.0, 0.75, 0.5, 0.25] + lambda_logit = 0.1 + for width_mult in width_mult_list: + net_config = paddleslim.nas.ofa.utils.dynabert_config(ofa_model, width_mult) + ofa_model.set_net_config(net_config) + logits, teacher_logits = ofa_model(input_ids, segment_ids, attention_mask=[None, None]) + rep_loss = ofa_model.calc_distill_loss() + logit_loss = soft_cross_entropy(logits, teacher_logits.detach()) + loss = rep_loss + lambda_logit * logit_loss + loss.backward() + optimizer.step() + lr_scheduler.step() + ofa_model.model.clear_gradients() + + + +**NOTE** + +由于在计算head的重要性时会利用一个mask来收集梯度,所以需要通过monkey patch的方式重新实现一下 ``BERTModel`` 类的 ``forward`` 函数。示例如下: + +.. code-block:: + + from paddlenlp.transformers import BertModel + def bert_forward(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=[None, None]): + wtype = self.pooler.dense.fn.weight.dtype if hasattr( + self.pooler.dense, 'fn') else self.pooler.dense.weight.dtype + if attention_mask[0] is None: + attention_mask[0] = paddle.unsqueeze( + (input_ids == self.pad_token_id).astype(wtype) * -1e9, axis=[1, 2]) + embedding_output = self.embeddings( + input_ids=input_ids, + position_ids=position_ids, + token_type_ids=token_type_ids) + encoder_outputs = self.encoder(embedding_output, attention_mask) + sequence_output = encoder_outputs + pooled_output = self.pooler(sequence_output) + return sequence_output, pooled_output + + + BertModel.forward = bert_forward diff --git a/docs/advanced_guide/prompt.md b/docs/advanced_guide/prompt.md new file mode 100644 index 0000000000000000000000000000000000000000..c8e5d5a98a82109b51456a4b8271b06b9c991612 --- /dev/null +++ b/docs/advanced_guide/prompt.md @@ -0,0 +1,600 @@ +# 提示学习:Prompt API + +随着预训练语言模型规模的增长,“预训练-微调”范式在下游自然语言处理任务上的表现越来越好,但与之相应地对训练数据量和计算存储资源的要求也越来越高。为了充分利用预训练语言模型学习到的知识,同时降低对数据和资源的依赖,**提示学习**(Prompt Learning)作为一种可能的新范式受到了越来越多的关注,在 FewCLUE、SuperGLUE 等榜单的小样本任务上取得了远优于传统微调范式的结果。 + +**提示学习**的核心思想是将下游任务转化为预训练阶段的掩码预测(MLM)任务。实现思路包括通过模板(Template)定义的提示语句,将原有任务转化为预测掩码位置的词,以及通过标签词(Verbalizer)的定义,建立预测词与真实标签之间的映射关系。 + +以情感分类任务为例,“预训练-微调”范式和“预训练-提示”范式(以 [PET](https://arxiv.org/abs/2001.07676) 为例)之间的区别如下图所示 + +
+ +
+ +【微调学习】使用 `[CLS]` 来做分类,需要训练随机初始化的分类器,需要充分的训练数据来拟合。 + +【提示学习】通过提示语句和标签词映射的定义,转化为 MLM 任务,无需训练新的参数,适用于小样本场景。 + + +Prompt API 提供了这类算法实现的基本模块,支持[PET](https://arxiv.org/abs/2001.07676)、[P-Tuning](https://arxiv.org/abs/2103.10385)、[WARP](https://aclanthology.org/2021.acl-long.381/)、[RGL](https://aclanthology.org/2022.findings-naacl.81/)等经典算法的快速实现。 + +**目录** + +* [如何定义模板](#如何定义模板) + * [离散型模板](#离散型模板) + * [连续型模板](#连续型模板) + * [前缀连续型模板](#前缀连续型模板) + * [快速定义模板](#快速定义模板) +* [如何定义标签词映射](#如何定义标签词映射) + * [离散型标签词映射](#离散型标签词映射) + * [连续型标签词映射](#连续型标签词映射) +* [快速开始训练](#快速开始训练) + * [数据准备](#数据准备) + * [预训练参数准备](#预训练参数准备) + * [定义提示学习模型](#定义提示学习模型) + * [使用PromptTrainer训练](#使用PromptTrainer训练) +* [实践教程](#实践教程) + * [文本分类示例](#文本分类示例) + * 其他任务示例(待更新) +* [Reference](#Reference) + +## 如何定义模板 + +**模板**(Template)的功能是在原有输入文本上增加提示语句,从而将原任务转化为 MLM 任务,可以分为离散型和连续型两种。Prompt API 中提供了统一的数据结构来构造不同类型的模板,输入相应格式的**字符串**,通过解析得到对应的输入模板。模板由不同字段构成,可任意组合。每个字段中的关键字定义了数据文本或者提示文本,即 `input_ids`,属性可定义该字段是否可截断,以及对应的 `position_ids`,`token_type_ids` 等。 + +### 离散型模板 + +离散型模板 `ManualTemplate` 是直接将提示语句与原始输入文本拼接起来,二者的词向量矩阵共享,均为预训练模型学到的词向量矩阵。可用于实现 PET、RGL 等算法。 + +**模板关键字及属性** + +- ``text`` :数据集中原始输入文本对应的关键字,例如,`text_a`、`text_b` 和 `content`。 +- ``hard`` :自定义的提示语句文本。 +- ``mask`` :待预测词的占位符。 + - ``length`` :定义 ``mask`` 的数量。 +- ``sep`` :句间的标志符。不同句子的 `token_type_ids` 需使用 `token_type` 属性定义,默认相同。 +- ``options`` :数据集字典或者文件中的候选标签序列。 + - ``add_omask`` :在每个标签前新增 `[O-MASK]` 字符,用于计算候选标签的预测值。支持实现 [UniMC](https://arxiv.org/pdf/2210.08590.pdf) 算法。 + - ``add_prompt`` :给每个标签拼接固定的提示文本,标签位置由 `[OPT]` 标记。支持实现 [EFL](https://arxiv.org/pdf/2104.14690.pdf) 算法。 + +**模版通用属性** + +- `position`: 定义当前字段的起始 `position id`。 +- `token_type`: 定义当前字段及后续字段的 `token type id`。 +- `truncate`: 定义当提示和文本总长度超过最大长度时,当前字段是否可截断。可选 `True` 和 `False`。 + +**模板定义** + +``` +{'hard': '“'}{'text': 'text_a'}{'hard': '”和“'}{'text': 'text_b'}{'hard': '”之间的逻辑关系是'}{'mask'} +``` + +或者使用简化方式定义,省略关键字 ``hard`` 后与上述模板等价。 + +``` +“{'text': 'text_a'}”和“{'text': 'text_b'}”之间的逻辑关系是{'mask'} +``` + +``` +{'options': './data/label.txt'}{'sep'}下边两句话间的逻辑关系是什么?{'text': 'text_a'}{'sep': None, 'token_type': 1}{'text': 'text_b'} +``` +其中 `label.txt` 为候选标签的本地文件路径,每行一个候选标签,例如 + +``` +中立 +蕴含 +矛盾 +``` + +**样本示例** + +例如,对于自然语言推理任务,给定样本 + +```python +sample = { + "text_a": "心里有些生畏,又不知畏惧什么", "text_b": "心里特别开心", "labels": "矛盾" +} +``` + +按照模板修改拼接后,最终输入模型的文本数据为 + +``` +“心里有些生畏,又不知畏惧什么”和“心里特别开心”之间的逻辑关系是[MASK] +``` + + +**调用 API** + +```python +from paddlenlp.prompt import ManualTemplate +from paddlenlp.transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh") +template = ManualTemplate(prompt="“{'text': 'text_a'}”和“{'text': 'text_b'}”之间的逻辑关系是{'mask'}", + tokenizer=tokenizer, + max_length=512) +input_dict = template(sample) +``` + +其中初始化参数定义如下 + +- ``prompt`` :定义提示语句以及与输入文本组合方式的字符串。 +- ``tokenizer`` :预训练模型的 tokenizer,用于文本编码。 +- ``max_length`` :定义输入模型文本的最大长度,包括提示部分。 + +**使用技巧** + +不同模板定义对结果的影响很明显。一般来说,提示语句与原始输入文本拼接后,语句越通顺自然,模型效果越好。在实践中,对于不同的任务需要分析文本特点,尝试不同的模板以取得好的效果。 + + +### 连续型模板 + +离散型模板的使用难点在于设计一个好的提示语句需要很多经验和语言专业知识。为了解决这一问题,连续型模板 `SoftTemplate` 尝试使用一组连续性 prompt 向量作为模板,这样模型训练时就无需人工给定提示语句。当然,`SoftTemplate` 也支持用人工构造的提示来初始化 prompt 向量。与离散型模板的区别在于连续型提示向量与输入文本的词向量矩阵不共享,二者在训练过程中分别进行参数更新。可用于实现 P-Tuning 等算法。 + +除此之外,连续型模板还支持混合模板定义,即在原始输入上同时拼接离散型提示和连续型提示向量。 + +**模板关键字** + +- ``text`` :数据集中原始输入文本对应的关键字,例如,`text_a`和`text_b`。 +- ``hard`` :自定义的文本提示语句。 +- ``mask`` :待预测词的占位符。 +- ``sep`` :句间的标志符。不同句子的 `token_type_ids` 需使用 `token_type` 属性定义,默认相同。 +- ``soft`` 表示连续型提示。若值为 ``None`` ,则随机初始化提示向量;若值为文本,则使用文本对应的预训练字向量初始化提示向量。 + - ``length`` :定义 ``soft token`` 的数量。若定义文本长度小于该值,超过部分随机初始化。 + - ``encoder`` :定义 `soft token` 的编码器类型,可选 `lstm`,`mlp`。默认为 `None`, 不使用编码器。 + - ``hidden_size`` :定义编码器的隐藏层维度。默认与预训练词向量维度相同。 +- ``options`` :数据集字典或者文件中的候选标签序列。 + - ``add_omask`` :在每个标签前新增 `[O-MASK]` 字符,用于计算候选标签的预测值。支持实现 [UniMC](https://arxiv.org/pdf/2210.08590.pdf) 算法。 + - ``add_prompt`` :给每个标签拼接固定的提示文本,标签位置由 `[OPT]` 标记。支持实现 [EFL](https://arxiv.org/pdf/2104.14690.pdf) 算法。 + +**模版通用属性** + +- `position`: 定义当前字段的起始 `position id`。 +- `token_type`: 定义当前字段及后续字段的 `token type id`。 +- `truncate`: 定义当提示和文本总长度超过最大长度时,当前字段是否可截断。可选 `True` 和 `False`。 + +**模板定义** + +- 定义长度为 1 的连续型提示,随机初始化: + +```python +"{'soft'}{'text': 'text_a'}{'sep': None, 'token_type': 1}{'text': 'text_b'}" +``` + +- 定义长度为 10 的连续型提示,随机初始化,编码器为 `mlp`: + +```python +"{'text': 'text_a'}{'sep'}{'text': 'text_b'}{'soft': None, 'length':10, 'encoder': 'mlp'}{'mask'}" +``` + +- 定义长度为 15 的连续型提示,使用 `请判断` 初始化前三个 soft token,其余随机初始化,编码器为隐藏层维度为 100 的双层 LSTM: + +```python +"{'text': 'text_a'}{'sep'}{'text': 'text_b'}{'soft': '请判断:', 'length': 15, 'encoder': 'lstm', 'hidden_size': 100}{'mask'}" +``` + +- 定义长度为 15 的连续型提示,使用 `"请判断这两个句子间的逻辑关系:"` 的预训练词向量逐一进行初始化: + +```python +"{'text': 'text_a'}{'sep'}{'text': 'text_b'}{'soft': '请判断这两个句子间的逻辑关系:'}{'mask'}" +``` + +- 定义混合模板,这里`soft`关键字对应的提示和`hard`对应的提示对应两套不同的向量: + +```python +"{'soft': '自然语言推理任务:'}{'text': 'text_a'}{'sep'}{'text': 'text_b'}这两个句子间的逻辑关系是{'mask'}" +``` + + +**调用 API** + +```python +from paddlenlp.prompt import SoftTemplate +from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM + +model = AutoModelForMaskedLM.from_pretrained("ernie-3.0-base-zh") +tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh") +template = SoftTemplate(prompt="{'text': 'text_a'}{'sep'}{'text': 'text_b'}{'soft': '请判断这两个句子间的逻辑关系:'}{'mask'}", + tokenizer=tokenizer, + max_length=512, + word_embeddings=model.get_input_embeddings()) +``` + +其中初始化参数定义如下 + +- ``prompt`` :定义连续型模板的提示语句、初始化以及与输入文本组合方式的字符串。 +- ``tokenizer`` :预训练模型的 tokenizer,用于文本编码。 +- ``max_seq_length`` :定义输入模型文本的最大长度,包括提示部分。 +- ``word_embeddings`` :预训练语言模型的词向量,用于连续型提示向量初始化。 +- ``soft_embeddings`` :连续型提示向量矩阵,可用于不同模板间的连续型参数共享。设置后将覆盖默认连续型向量矩阵。 + +**使用技巧** + +- 对于分类任务,推荐的连续型提示长度一般为10-20。 +- 对于随机初始化的连续性 prompt 向量,通常用比预训练模型微调更大的学习率来更新参数。 +- 与离散型模板相似,连续型模板对初始化参数也比较敏感。自定义提示语句作为连续性 prompt 向量的初始化参数通常比随机初始化效果好。 +- prompt_encoder 为已有论文中的策略,用于建模不同连续型提示向量之间的序列关系。 + + +### 前缀连续型模板 + +`PrefixTemplate` 同样使用了连续型向量作为提示,与 `SoftTemplate` 的不同,该模版的提示向量不仅仅作用于输入层,每层都会有相应的提示向量。可用于实现 P-Tuning 等算法。 + +**模板关键字** + +- ``text`` :数据集中原始输入文本对应的关键字,例如,`text_a`和`text_b`。 +- ``hard`` :自定义的文本提示语句。 +- ``mask`` :待预测词的占位符。 +- ``sep`` :句间的标志符。不同句子的 `token_type_ids` 需使用 `token_type` 属性定义,默认相同。 +- ``prefix`` 表示连续型提示,该字段**必须**位于模板首位。若值为 ``None`` ,则随机初始化提示向量;若值为文本,则使用文本对应的预训练字向量初始化提示向量。 + - ``length`` :定义 ``soft token`` 的数量。若定义文本长度小于该值,超过部分随机初始化。 + - ``encoder`` :定义 `soft token` 的编码器类型,可选 `lstm`,`mlp`。默认为 `None`, 不使用编码器。 + - ``hidden_size`` :定义编码器的隐藏层维度。默认与预训练词向量维度相同。 +- ``options`` :数据集字典或者文件中的候选标签序列。 + - ``add_omask`` :在每个标签前新增 `[O-MASK]` 字符,用于计算候选标签的预测值。支持实现 [UniMC](https://arxiv.org/pdf/2210.08590.pdf) 算法。 + - ``add_prompt`` :给每个标签拼接固定的提示文本,标签位置由 `[OPT]` 标记。支持实现 [EFL](https://arxiv.org/pdf/2104.14690.pdf) 算法。 + +**模版通用属性** + +- `position`: 定义当前字段的起始 `position id`。 +- `token_type`: 定义当前字段及后续字段的 `token type id`。 +- `truncate`: 定义当提示和文本总长度超过最大长度时,当前字段是否可截断。可选 `True` 和 `False`。 + +**模板定义** + +- 定义长度为 15 的连续型提示,随机初始化: + +```python +"{'prefix': '新闻类别', 'length': 10, 'encoder': 'lstm'}{'text': 'text_a'}" +``` + +- 定义混合模板,这里`prefix`关键字对应的提示和`hard`对应的提示对应两套不同的向量: + +```python +"{'prefix': '自然语言推理任务:', 'encoder': 'mlp'}{'text': 'text_a'}{'sep'}{'text': 'text_b'}这两个句子间的逻辑关系是{'mask'}" +``` + + +**调用 API** + +```python +from paddlenlp.prompt import PrefixTemplate +from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM + +model = AutoModelForMaskedLM.from_pretrained("ernie-3.0-base-zh") +tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh") +template = PrefixTemplate(prompt="{'prefix': '任务描述'}{'text': 'text_a'}{'mask'}", + tokenizer=tokenizer, + max_length=512, + model=model, + prefix_dropout=0.1) +``` + +其中初始化参数定义如下 + +- ``prompt`` :定义连续型模板的提示语句、初始化以及与输入文本组合方式的字符串。 +- ``tokenizer`` :预训练模型的 tokenizer,用于文本编码。 +- ``max_length`` :定义输入模型文本的最大长度,包括提示部分。 +- ``model`` :预训练语言模型,用于连续型提示向量初始化,以及根据模型结构生成每层对应的提示向量。 +- ``prefix_dropout`` :连续型提示向量的丢弃概率,用于正则化。 + + +### 快速定义模板 + +PaddleNLP 提供了 ``AutoTemplate`` API 快速定义简化离散型模板,也可根据完整模板字符串自动切换 ManualTemplate、SoftTemplate 和 PrefixTemplate。 + +**模板定义** + +- 快速定义离散型的文本提示。例如, + +```python +"这篇文章表达了怎样的情感?" +``` + +等价于 + +```python +"{'text': 'text_a'}{'hard': '这篇文章表达了怎样的情感?'}{'mask'}" +``` + +- 当输入为完整模板字符串时,解析得到的模板与[离散型模板](#离散型模板)和[连续型模板](#连续型模板)中描述的一致。 + +**调用 API** + +```python +from paddlenlp.prompt import AutoTemplate +from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM + +model = AutoModelForMaskedLM.from_pretrained("ernie-3.0-base-zh") +tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh") +# 离散型模板,返回值为 ManualTemplate 实例 +template = AutoTemplate.create_from(prompt="这个句子表达了怎样的情感?", + tokenizer=tokenizer, + max_length=512) + +template = AutoTemplate.create_from(prompt="这个句子表达了怎样的情感?{'text': 'text_a'}{'mask'}", + tokenizer=tokenizer, + max_length=512) + +# 连续型模板,返回值为 SoftTemplate 实例 +template = AutoTemplate.create_from(prompt="{'text': 'text_a'}{'sep'}{'text': 'text_b'}{'soft': '请判断这两个句子间的逻辑关系:'}{'mask'}", + tokenizer=tokenizer, + max_length=512, + model=model) + +# 前缀连续型模板,返回值为 PrefixTemplate 实例 +template = AutoTemplate.create_from(prompt="{'prefix': None, 'encoder': 'mlp', 'hidden_size': 50}{'text': 'text_a'}", + tokenizer=tokenizer, + max_length=512, + model=model) +``` + +其中初始化参数定义如下 + +- ``prompt`` :定义离散型/连续型提示、初始化以及和输入文本的组合方式。 +- ``tokenizer`` :预训练模型的 tokenizer,用于文本编码。 +- ``max_length`` :定义输入模型文本的最大长度,包括提示部分。 +- ``model`` :预训练语言模型,为了取预训练词向量用于连续型提示向量初始化。 + +## 如何定义标签词映射 + +**标签词映射**(Verbalizer)也是提示学习中可选的重要模块,用于建立预测词和标签之间的映射,将“预训练-微调”模式中预测标签的任务转换为预测模板中掩码位置的词语,从而将下游任务统一为预训练任务的形式。目前框架支持了离散型标签词映射和连续型标签词映射 [Word-level Adversarial ReProgramming (WARP)](https://aclanthology.org/2021.acl-long.381/) 方法。 + + +例如,在情感二分类任务中,微调方法和提示学习的标签体系如下 + +- **微调方式** : 数据集的标签为 ``负向`` 和 ``正向``,分别映射为 ``0`` 和 ``1`` ; + +- **提示学习** : 通过下边的标签词映射建立原始标签与预测词之间的映射。 + +``` python +{'负向': '不', '正向': '很'} +``` + +具体来说,对于模板 ``{'text':'text_a'}这句话表示我{'mask'}满意。`` ,我们使用映射 ``{'负向': '不', '正向': '很'}`` 将标签 ``负向`` 映射为 ``不`` ,将标签 ``正向`` 映射为 ``很`` 。也就是说,我们期望对于正向情感的文本,预测结果为 ``...这句话表示我很满意。`` ,对于负向情感的文本,预测结果为 ``...这句话表示我不满意。`` + + +### 离散型标签词映射 + +``ManualVerbalizer`` 支持构造 ``{'mask'}`` 对应的标签词映射,同一标签可对应多个不同长度的词,直接作用于 ``AutoMaskedLM`` 模型结构。当标签对应的预测词长度大于 ``1`` 时,默认取均值;当标签对应多个 `{'mask'}` 时,默认与单个 `{mask}` 效果等价。 + +**调用 API** + +```python +from paddlenlp.prompt import ManualVerbalizer +from paddlenlp.transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh") +verbalizer = ManualVerbalizer(tokenizer=tokenizer, + label_words={'负向': '不', '正向': '很'}) +``` + +其中初始化参数定义如下 + +- ``label_words`` : 原标签到预测词之间的映射字典。 +- ``tokenizer`` : 预训练模型的 tokenizer,用于预测词的编码。 + +``MaskedLMVerbalizer`` 同样支持构造 ``{'mask'}`` 对应的标签词映射,映射词与模板中的 `{'mask'}` 逐字对应,因此,映射词长度应与 `{'mask'}` 数量保持一致。当定义的标签词映射中同一标签对应多个词时,仅有第一个映射词生效。在自定义的 `compute_metric` 函数中需先调用 `verbalizer.aggregate_multiple_mask` 将多 `{'mask'}` 合并后再计算评估函数,默认使用乘积的方式。 + +**调用 API** +```python +from paddlenlp.prompt import MaskedLMVerbalizer +from paddlenlp.transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh") +verbalizer = MaskedLMVerbalizer(tokenizer=tokenizer, + label_words={'负向': '不', '正向': '很'}) +``` + +其中初始化参数定义如下 + +- ``label_words`` : 原标签到预测词之间的映射字典。 +- ``tokenizer`` : 预训练模型的 tokenizer,用于预测词的编码。 + +### 连续型标签词映射 + +标签词映射分类器 ``SoftVerbalizer`` 修改了原 ``AutoMaskedLM`` 的模型结构,将预训练模型最后一层“隐藏层-词表”替换为“隐藏层-标签”的映射。该层网络的初始化参数由标签词映射中的预测词词向量来决定,如果预测词长度大于 ``1`` ,则使用词向量均值进行初始化。当前支持的预训练模型包括 ``ErnieForMaskedLM`` 、 ``BertForMaskedLM`` 、 ``AlbertForMaskedLM`` 和 ``RobertaForMaskedLM`` 。可用于实现 WARP 算法。 + + +**调用 API** + +```python +from paddlenlp.prompt import SoftVerbalizer +from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM + +model = AutoModelForMaskedLM.from_pretrained("ernie-3.0-base-zh") +tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh") +verbalizer = SoftVerbalizer(label_words={'负向': '生气', '正向': '高兴'}, + tokenizer=tokenizer, + model=model) +``` + +- ``label_words`` : 原标签到预测词之间的映射字典。 +- ``tokenizer`` : 预训练模型的 tokenizer,用于预测词的编码。 +- ``model`` :预训练语言模型,用于取预训练词向量进行“隐藏层-标签”网络的修改和初始化。 + +## 快速开始训练 + +本节介绍了如何使用 ``PromptTrainer`` 快速搭建提示训练流程。 + +### 数据准备 + +数据集封装为 ``MapDataset`` 类型。每条数据格式为字典结构,字典中关键字与模板中 `text` 定义的值相对应,统一使用 `labels` 关键字表示样本标签。 + +例如,文本语义相似度 BUSTM 数据集中的数据样本 + +```python +from paddlenlp.datasets import MapDataset + +data_ds = MapDataset([ + {'id': 3, 'sentence1': '你晚上吃了什么', 'sentence2': '你晚上吃啥了', 'label': 1}, + {'id': 4, 'sentence1': '我想打开滴滴叫的士', 'sentence2': '你叫小欧吗', 'label': 0}, + {'id': 5, 'sentence1': '女孩子到底是不是你', 'sentence2': '你不是女孩子吗', 'label': 1} +]) + +def convert_label_keyword(input_dict): + input_dict["labels"] = input_dict.pop("label") + return input_dict + +data_ds = data_ds.map(convert_label_keyword) +``` + +### 预训练参数准备 + +如果使用标签词映射,用 ``AutoModelForMaskedLM`` 和 ``AutoTokenizer`` 加载预训练模型参数。如果不使用标签词映射,可将 ``AutoModelForMaskedLM`` 替换为任务对应的模型。 + +```python +from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM + +model = AutoModelForMaskedLM.from_pretrained("ernie-3.0-base-zh") +tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh") +``` + + +### 定义提示学习模型 + +对于文本分类任务,我们将模板预处理和标签词映射封装为提示学习模型 ``PromptModelForSequenceClassification`` 。 + + +```python +from paddlenlp.prompt import AutoTemplate +from paddlenlp.prompt import ManualVerbalizer +from paddlenlp.prompt import PromptModelForSequenceClassification + +# 定义模板 +template = AutoTemplate.create_from(prompt="{'text': 'text_a'}和{'text': 'text_b'}说的是{'mask'}同的事情。", + tokenizer=tokenizer, + max_length=512) + +# 定义标签词映射 +verbalizer = ManualVerbalizer(label_words={0: '不', 1: '相'}, + tokenizer=tokenizer) + +# 定义文本分类提示模型 +prompt_model = PromptModelForSequenceClassification(model, + template, + verbalizer, + freeze_plm=False, + freeze_dropout=False) +``` + +其中提示模型初始化参数如下 + +- ``model`` : 预训练模型实例,支持 ``AutoModelForMaskedLM`` 和 ``AutoModelForSequenceClassification`` 。 +- ``template`` : 模板实例。 +- ``verbalizer`` : 标签词映射实例。当设为 ``None`` 时,不使用标签词映射,模型输出及损失值计算由 ``model`` 类型定义。 +- ``freeze_plm`` : 在训练时固定预训练模型参数,默认为 `False`。对于轻量级预训练模型,推荐使用默认值。 +- ``freeze_dropout`` : 在训练时固定预训练模型参数并关闭 ``dropout`` 。 当 ``freeze_dropout=True`` ,``freeze_plm`` 也为 ``True`` 。 + + +### 使用PromptTrainer训练 + +``PromptTrainer`` 继承自 ``Trainer`` , 封装了数据处理,模型训练、测试,训练策略等,便于训练流程的快速搭建。 + +**配置训练参数** + +``PromptTuningArguments`` 继承自 ``TrainingArguments`` ,包含了提示学习的主要训练参数。其中 ``TrainingArguments`` 参数见 `Trainer API 文档 `_ ,其余参数详见 [Prompt Trainer参数列表](#PromptTrainer参数列表) 。推荐使用 **命令行** 的形式进行参数配置,即 + +```shell +python xxx.py --output_dir xxx --learning_rate xxx +``` + +除了训练参数,还需要自定义数据和模型相关的参数。最后用 ``PdArgumentParser`` 输出参数。 + +```python +from dataclasses import dataclass, field +from paddlenlp.trainer import PdArgumentParser +from paddlenlp.prompt import PromptTuningArguments + +@dataclass +class DataArguments: + data_path : str = field(default="./data", metadata={"help": "The path to dataset."}) + +parser = PdArgumentParser((DataArguments, PromptTuningArguments)) +data_args, training_args = parser.parse_args_into_dataclasses( + args=["--output_dir", "./", "--do_train", "True"], look_for_args_file=False) +``` + +**初始化和训练** + +除了上述准备,还需要定义损失函数和评估函数。 + +```python + +import paddle +from paddle.metric import Accuracy +from paddlenlp.prompt import PromptTrainer + +# 损失函数 +criterion = paddle.nn.CrossEntropyLoss() + +# 评估函数 +def compute_metrics(eval_preds): + metric = Accuracy() + correct = metric.compute(paddle.to_tensor(eval_preds.predictions), + paddle.to_tensor(eval_preds.label_ids)) + metric.update(correct) + acc = metric.accumulate() + return {"accuracy": acc} + +# 初始化 +trainer = PromptTrainer(model=prompt_model, + tokenizer=tokenizer, + args=training_args, + criterion=criterion, + train_dataset=data_ds, + eval_dataset=None, + callbacks=None, + compute_metrics=compute_metrics) + +# 训练模型 +if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=None) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() +``` + +## 实践教程 + +### 文本分类示例 + + +- [多分类文本分类示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class/few-shot) + +- [多标签文本分类示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label/few-shot) + +- [多层次文本分类示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical/few-shot) + + +## Reference + +- Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. [[PDF]](https://arxiv.org/abs/2001.07676) +- GPT Understands, Too. [[PDF]](https://arxiv.org/abs/2103.10385) +- WARP: Word-level Adversarial ReProgramming. [[PDF]](https://aclanthology.org/2021.acl-long.381/) +- RGL: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning. [[PDF]](https://aclanthology.org/2022.findings-naacl.81/) +- R-Drop: Regularized Dropout for Neural Networks. [[PDF]](https://arxiv.org/abs/2106.14448) +- Openprompt: An open-source framework for prompt-learning. [[PDF]](https://arxiv.org/abs/2111.01998) + + +### 附录 + + +#### PromptTrainer参数列表 + + +| 参数 | 类型 | 默认值 | 含义 | +| ---------------- | ------ | ------- | ------------------------------------------------------- | +| max_seq_length | int | 512 | 模型输入的最大长度,包括模板部分 | +| freeze_plm | bool | False | 是否在训练时固定预训练模型的参数 | +| freeze_dropout | bool | False | 是否在训练时固定预训练模型的参数,同时关闭 dropout | +| use_rdrop | bool | False | 是否使用 RDrop 策略,详见 [RDrop 论文](https://arxiv.org/abs/2106.14448) | +| alpha_rdrop | float | 5.0 | RDrop Loss 的权重 | +| use_rgl | bool | False | 是否使用 RGL 策略,详见 [RGL 论文](https://aclanthology.org/2022.findings-naacl.81/) | +| alpha_rgl | float | 0.5 | RGL Loss 的权重 | +| ppt_learning_rate| float | 1e-4 | 连续型提示以及 SoftVerbalizer “隐藏层-标签”层参数的学习率 | +| ppt_weight_decay | float | 0.0 | 连续型提示以及 SoftVerbalizer “隐藏层-标签”层参数的衰减参数 | +| ppt_adam_beta1 | float | 0.9 | 连续型提示以及 SoftVerbalizer “隐藏层-标签”层参数的 beta1 | +| ppt_adam_beta2 | float | 0.999 | 连续型提示以及 SoftVerbalizer “隐藏层-标签”层参数的 beta2 | +| ppt_adam_epsilon | float | 1e-8 | 连续型提示以及 SoftVerbalizer “隐藏层-标签”层参数的 epsilon| diff --git a/docs/clear_api.py b/docs/clear_api.py new file mode 100644 index 0000000000000000000000000000000000000000..136beaae97d311ac58ce1a83301545c921077f72 --- /dev/null +++ b/docs/clear_api.py @@ -0,0 +1,138 @@ +import os +import re + + +def modify_doc_title_dir(abspath_rstfiles_dir): + """ + rst文件中:有‘========’和‘----------’行的表示其行上一行的文字是标题, + ‘=’和‘-’要大于等于标题的长度。 + 使用sphinx-apidoc -o ./source/rst_files /home/myubuntu/pro/mypro命令将 + 生成rst文件放在./source/rst_files目录下, 执行sphinx-quickstart命令生成的 + index.rst不用放到这个目录中。 或在source目录下新建 + rst_files目录然后将rst文件剪切到这个目录下,修改后再剪切出来 + 生成rst文件后将rst_files/modules.rst文件中的标题去掉,并修改maxdepth字段。 + 删除和修改使用sphinx-apidoc -o 命令的生成的rst文件中的标题 + :param abspath_rstfiles_dir: rst文件所在的文件夹的绝对路径 + :return: + """ + rst_files = os.listdir(abspath_rstfiles_dir) + # 要删除的节点(标题目录的节点) + del_nodes = ["Submodules", "Module contents", "Subpackages"] + # 要删除的标题中的字符串 + del_str = [" module", " package"] + # datasets需要的部分 + dataset_list = ["datasets", "dataset"] + # 需要call方法 + add_call_files = [ + "data.collate", + "data.iterator", + "data.sampler", + "data.tokenizer", + "data.vocab", + "tokenizer\_utils", + ] + # 删除inheritance + del_inheritance = [ + "crf", + "tcn", + "distributed", + "dataset", + "paraller", + "decoder", + "rdrop", + "decoding", + "fast\_transformer", + "Adamoptimizer", + "attention\_utils", + "model\_utils", + "batch\_sampler", + "model", + ] + # 文档中空白的part,不显示 + del_rst = ["iterator", "constant"] + for rst_file in rst_files: + f = open(os.path.join(abspath_rstfiles_dir, rst_file), "r") + file_lines = f.readlines() + f.close() + write_con = [] + flag = 0 + first_line = file_lines[0] + # 去除不需要的datasets + if "datasets" in first_line: + name = first_line.split()[0] + length = len(name.split(".")) + # paddlenlp.datasets 需要留下 + if length > 2: + if "datasets.dataset" not in first_line: + path = os.path.join(abspath_rstfiles_dir, rst_file) + print(path) + os.remove(path) + print(path) + continue + # 去除文档中空白页面,目前是data.iterator, embeddings.constant部分 + del_rst_flag = 0 + for pattern in del_rst: + if pattern in first_line: + path = os.path.join(abspath_rstfiles_dir, rst_file) + os.remove(path) + del_rst_flag = 1 + break + if del_rst_flag == 1: + continue + # 是否加入call + add_call_files_flag = 0 + for i in add_call_files: + if i in first_line: + add_call_files_flag = 1 + # 是否删除inheritance + del_inheritance_flag = 0 + for j in del_inheritance: + if j in first_line: + del_inheritance_flag = 1 + if "modeling" in first_line: + del_inheritance_flag = 0 + for file_line in file_lines: + if file_line.strip() in del_nodes: + flag = 1 + continue + if flag: + flag = 0 + continue + if re.search(del_str[0], file_line): + length = len(file_line.split(".")) + if length > 2: + modify_line = file_line.split(".")[-1].replace(del_str[0], "") + else: + modify_line = file_line.replace(del_str[0], "") + write_con.append(modify_line) + continue + if re.search(del_str[1], file_line): + length = len(file_line.split(".")) + if length > 2: + modify_line = file_line.split(".")[-1].replace(del_str[1], "") + else: + modify_line = file_line.replace(del_str[1], "") + write_con.append(modify_line) + continue + if "undoc-members" in file_line: + if "no-undoc-members" not in file_line: + file_line = file_line.replace("undoc-members", "no-undoc-members") + # 去除datasets中多余内容 + if "paddlenlp.datasets" in file_line: + last_name = file_line.split(".")[-1] + if last_name.strip() not in dataset_list: + continue + if "show-inheritance" in file_line: + if del_inheritance_flag == 0: + write_con.append(file_line) + else: + write_con.append(file_line) + if add_call_files_flag == 1: + write_con.append(" :special-members: __call__\n") + f = open(os.path.join(abspath_rstfiles_dir, rst_file), "w") + f.writelines(write_con) + f.close() + + +if __name__ == "__main__": + modify_doc_title_dir("./source") diff --git a/docs/community/contribute_datasets/how_to_write_a_DatasetBuilder.rst b/docs/community/contribute_datasets/how_to_write_a_DatasetBuilder.rst new file mode 100644 index 0000000000000000000000000000000000000000..a7c20303259bc859282ec79d968d36975adc7b65 --- /dev/null +++ b/docs/community/contribute_datasets/how_to_write_a_DatasetBuilder.rst @@ -0,0 +1,123 @@ +============== +创建 :class:`DatasetBuilder` +============== + +数据集的贡献通过定义一个 :class:`DatasetBuilder` 的子类来实现。一个合格的 :class:`DatasetBuilder` 需要遵循一些协议和规范。 + +下面我们以 :obj:`LCQMC` 为例了解一下 :class:`DatasetBuilder` 通常需要包含哪些方法和参数。 + +成员变量 +--------------- + +.. code-block:: + + from paddle.dataset.common import md5file + from paddle.utils.download import get_path_from_url + from paddlenlp.utils.env import DATA_HOME + + class LCQMC(DatasetBuilder): + """ + LCQMC:A Large-scale Chinese Question Matching Corpus + More information please refer to `https://www.aclweb.org/anthology/C18-1166/` + + """ + lazy = False + URL = "https://bj.bcebos.com/paddlehub-dataset/lcqmc.tar.gz" + MD5 = "62a7ba36f786a82ae59bbde0b0a9af0c" + META_INFO = collections.namedtuple('META_INFO', ('file', 'md5')) + SPLITS = { + 'train': META_INFO( + os.path.join('lcqmc', 'train.tsv'), + '2193c022439b038ac12c0ae918b211a1'), + 'dev': META_INFO( + os.path.join('lcqmc', 'dev.tsv'), + 'c5dcba253cb4105d914964fd8b3c0e94'), + 'test': META_INFO( + os.path.join('lcqmc', 'test.tsv'), + '8f4b71e15e67696cc9e112a459ec42bd'), + } + +首先贡献的数据集需要继承 :class:`paddlenlp.datasets.DatasetBuilder` 类,类名格式为camel case。之后应该添加一段注释,简要说明数据集的来源等信息。之后需定义以下成员变量: + +- :attr:`lazy` :数据集的默认类型。:obj:`False` 对应 :class:`MapDataset` ,:obj:`True` 对应 :class:`IterDataset` 。 +- :attr:`URL` :数据集压缩包下载地址,需提供有效并稳定的下载链接。如果数据集不是压缩包,可以不再这里提供。 +- :attr:`MD5` :数据集压缩包的md5值,用于文件校验,如果数据集文件不是压缩包,可以不再这里提供。 +- :attr:`META_INFO` :数据集split信息格式。 +- :attr:`SPLITS` :数据集的split信息,包含数据集解压后的不同文件的具体位置,文件名,md5值等,如果数据集不是压缩包则通常在这里提供下载地址,还可以包含诸如不同文件对应的文件读取参数等信息。 + +除此之外,不同的数据集可能还需要诸如 :attr:`VOCAB_INFO` 等其他成员变量(参见 `iwslt15.py `__ )。或者成员变量会有其他格式。贡献者可以根据实际情况自行调整。 + +.. note:: + + - 如果贡献的数据集没有子数据集,那么 :class:`DatasetBuilder` **必须包含** :attr:`SPLITS` 成员变量,且该变量必须是一个字典,字典的key是该数据集包含的splits。 + - 如果贡献的数据集有子数据集,那么 :class:`DatasetBuilder` **必须包含** :attr:`BUILDER_CONFIGS` 成员变量,且该变量必须是一个字典,字典的key是该数据集包含的子数据集的 :attr:`name` 。字典的value是包含该数据集的子数据集split信息的字典,key值必须是 `splits` 。具体格式(参见 `glue.py `__ ) + +:func:`_get_data` 方法 +----------------------- + +.. code-block:: + + def _get_data(self, mode, **kwargs): + ''' Check and download Dataset ''' + default_root = os.path.join(DATA_HOME, self.__class__.__name__) + filename, data_hash = self.SPLITS[mode] + fullname = os.path.join(default_root, filename) + if not os.path.exists(fullname) or (data_hash and + not md5file(fullname) == data_hash): + get_path_from_url(self.URL, default_root, self.MD5) + + return fullname + +:func:`_get_data` 方法根据传入的 :attr:`mode` 和数据集的split信息定位到具体数据集文件。首先进行md5值校验本地文件,若校验失败则调用 :func:`paddle.utils.download.get_path_from_url` 方法下载并校验数据集文件,最后返回数据集文件的本地地址。 + +:func:`_read` 方法 +----------------------- + +.. code-block:: + + def _read(self, filename): + """Reads data.""" + with open(filename, 'r', encoding='utf-8') as f: + head = None + for line in f: + data = line.strip().split("\t") + if not head: + head = data + else: + query, title, label = data + yield {"query": query, "title": title, "label": label} + +:func:`_read` 方法根据传入的文件地址读取数据。该方法必须是一个生成器,以确保 :class:`DatasetBuilder` 可以构造 :class:`MapDataset` 和 :class:`IterDataset` 两种数据集。 +当不同split对应的数据文件读取方式不同时,该方法还需要支持 :attr:`split` 参数,并支持不同split下的读取方式。 + +.. note:: + + - 该方法提供的每条example都应是一个 :class:`Dictionary` 对象。 + - :class:`DatasetBuilder` 在生成Dataset时提供了将class label转换为id的功能。如果用户需要此功能,需要将example中label对应的key设置为 **"label"** 或 **"labels"** ,并在类中正确添加 :func:`get_labels` 方法。 + +:func:`get_labels` 方法 +----------------------- + +.. code-block:: + + def get_labels(self): + """ + Return labels of the LCQMC object. + """ + return ["0", "1"] + +:func:`get_labels` 方法返回一个由该数据集中所有label组成的list。用于将数据集中的class label转换为id,并且这个list之后会作为实例变量传给生成的数据集。 + +:func:`get_vocab` 方法 +----------------------- + +如果数据集提供词典文件,则需要加入 :func:`get_vocab` 方法和 :attr:`VOCAB_INFO` 变量。 + +该方法会根据 :attr:`VOCAB_INFO` 变量返回一个包含数据集词典信息的 :class:`Dictionary` 对象并作为实例变量传给生成的数据集。用于在训练过程中初始化 :class:`paddlenlp.data.Vocab` 对象。 +该方法的写法请参考 `iwslt15.py `__ 。 + +.. note:: + + - 贡献数据集时 :func:`get_labels` 和 :func:`get_vocab` 方法是可选的,视具体数据集内容而定。 :func:`_read` 和 :func:`_get_data` 方法是 **必须包含** 的。 + - 如果您不希望在数据获取过程中进行md5值校验,可以不用给出相关成员变量和校验代码。 + diff --git a/docs/community/contribute_datasets/index.rst b/docs/community/contribute_datasets/index.rst new file mode 100644 index 0000000000000000000000000000000000000000..e18717438ba1414365b7450cf2df883bd0c00883 --- /dev/null +++ b/docs/community/contribute_datasets/index.rst @@ -0,0 +1,9 @@ +============ +如何贡献数据集 +============ + +.. toctree:: + :maxdepth: 1 + + sharing_dataset.rst + how_to_write_a_DatasetBuilder.rst \ No newline at end of file diff --git a/docs/community/contribute_datasets/sharing_dataset.rst b/docs/community/contribute_datasets/sharing_dataset.rst new file mode 100644 index 0000000000000000000000000000000000000000..1f94a5c378d15958e37c41344ff6a0439a303b66 --- /dev/null +++ b/docs/community/contribute_datasets/sharing_dataset.rst @@ -0,0 +1,98 @@ +======================== +分享你的数据集 +======================== + +除了使用PaddleNLP内置的数据集以外,我们也鼓励用户向PaddleNLP贡献自己的数据集。 + +下面我们来介绍一下贡献数据集的详细流程: + +配置环境 +--------------- + +#. 编写和测试PaddleNLP代码需要依赖python3.6以上版本以及最新版本的PaddlePaddle。请确保正确安装以上依赖。 +#. 在PaddleNLP的github页面上点击Fork按钮,在自己的github中创建一份PaddleNLP repo的副本。 +#. 将您frok的内容下载到本地,并将官方repo作为remote。 + + .. code-block:: + + git clone https://github.com/USERNAME/PaddleNLP + cd PaddleNLP + git remote add upstream https://github.com/PaddlePaddle/PaddleNLP.git + +#. 安装pre-commit钩子,它可以帮助我们格式化源代码,再提交前自动检查代码问题。不满足钩子的PR **不能** 被提交到PaddleNLP。 + + .. code-block:: + + pip install pre-commit + pre-commit install + +添加一个 :class:`DatasetBuilder` +---------------------------------- + +#. 创建一个新的本地分支,一般从develop 分支上创建新分支。 + + .. code-block:: + + git checkout -b my-new-dataset + +#. 找到您本地repo下的 `PaddleNLP/paddlenlp/datasets/` 路径,PaddleNLP的所有数据集代码都储存在这个文件夹下。 + + .. code-block:: + + cd paddlenlp/datasets + +#. 为您的数据集确定一个 `name`,例如 `squad` , `chnsenticorp` 等,这个 `name` 就是您的数据集被读取时的名称。 + + .. note:: + + - 为了方便别人使用您的数据集,确保这个 `name` **不会太长而且能够正确的表义**。 + - 数据集的 `name` 格式应为snake case。 + +#. 在该路径下创建python文件,文件名是数据集的 `name`,例如 `squad.py` 。并在这个文件中编写数据集的 :class:`DatasetBuilder` 代码。 + + :class:`DatasetBuilder` 的编写可以参考教程 :doc:`如何创建一个DatasetBuilder <./how_to_write_a_DatasetBuilder>` 。里面给出了详细的步骤和规范。 + + 我们也推荐您参考已有数据集的 :class:`DatasetBuilder` 进行创建,从已有代码copy一些共用部分可能对您编写自己的数据集代码有所帮助,下面是一些已有数据集的示例: + + - `iwslt15.py `__ 翻译数据集,包含词表文件。 + - `glue.py `__ glue数据集,包含多个子数据集,文件格式为tsv。 + - `squad.py `__ 阅读理解数据集,文件格式为json。 + - `imdb.py `__ imdb数据集,每个split包含多个文件。 + - `ptb.py `__ 语料库数据集。 + - `msra_ner.py `__ 序列标注数据集。 + +#. 开发完成后,可以使用 :attr:`load_dataset` 测试您创建的数据集中的split能否正确被识别。也可以使用 :attr:`print` 看看数据集读入的格式是否符合您的预期: + + .. code-block:: + + from paddlenlp.datasets import load_dataset + + ds = load_dataset('your_dataset_name', splits='your_split') + print(ds[0]) + +提交您的成果 +--------------- + +#. 当您认为数据集的代码已经ready后,就可以在本地commit您的修改了: + + .. code-block:: + + git add PaddleNLP/paddlenlp/datasets/your_dataset_name.py + git commit + +#. 在提交修改之前,最好获取获取先upstream的最新代码并更新当前分支。 + + .. code-block:: + + git fetch upstream + git pull upstream develop + +#. 将本地的修改推送到GitHub上,并在GitHub上向PaddleNLP提交Pull Request。 + + .. code-block:: + + git push origin my-new-dataset + +以上就是像PaddleNLP贡献数据集的完整流程了。我们看到您的PR后会尽快review,如果有任何问题都会尽快反馈给您。如果没有问题的话我们就会合入到PaddleNLP repo,您贡献的数据集就可以供其他人使用啦。 + +如果您对贡献数据集还有任何疑问,欢迎加入官方QQ技术交流群: 973379845向我们提出。我们会尽快为您解答。 \ No newline at end of file diff --git a/docs/community/contribute_docs.rst b/docs/community/contribute_docs.rst new file mode 100644 index 0000000000000000000000000000000000000000..19af97658064076f9b7c7c56246ad5dbe617ccdb --- /dev/null +++ b/docs/community/contribute_docs.rst @@ -0,0 +1,3 @@ +============== +如何贡献问答、案例 +============== diff --git a/docs/community/contribute_models/contribute_awesome_pretrained_models.rst b/docs/community/contribute_models/contribute_awesome_pretrained_models.rst new file mode 100644 index 0000000000000000000000000000000000000000..f2afc1d02c9f290578373cc896d2a0891f382cb8 --- /dev/null +++ b/docs/community/contribute_models/contribute_awesome_pretrained_models.rst @@ -0,0 +1,78 @@ +==================================================================================== +贡献预训练模型权重 +==================================================================================== + +1. 模型网络结构类型 +------------------------------------------------------------------------------------ +PaddleNLP目前已支持绝大多数主流的预训练模型网络结构,既包括百度自研的预训练模型(如ERNIE系列), +也涵盖业界主流的预训练模型(如BERT,ALBERT,GPT,RoBERTa,XLNet等)。 + +PaddleNLP目前支持的预训练模型结构类型汇总可见 +`Transformer预训练模型汇总 `_ +(持续增加中,也非常欢迎进行新模型贡献:`如何贡献新模型 `_ )。 + +2. 模型参数权重类型 +------------------------------------------------------------------------------------ +非常欢迎大家贡献优质模型参数权重。 +参数权重类型包括但不限于(以BERT模型网络为例): + +- PaddleNLP还未收录的BERT预训练模型参数权重 + (如 `bert-base-japanese-char `_ ,`danish-bert-botxo `_ 等); +- BERT模型在其他垂类领域(如数学,金融,法律,医学等)的预训练模型参数权重 + (如 `MathBERT `_ ,`finbert `_ 等); +- 基于BERT在下游具体任务进行fine-tuning后的模型参数权重 + (如 `bert-base-multilingual-uncased-sentiment `_ , + `bert-base-NER `_ 等); +- 其他模型参数权重(任何你觉得有价值的模型参数权重); + +3. 参数权重格式转换 +------------------------------------------------------------------------------------ +当我们想要贡献github上开源的某模型权重时,但是发现该权重保存为其他的深度学习框架(PyTorch,TensorFlow等)的格式, +这就需要我们进行不同深度学习框架间的模型格式转换,下面的链接给出了一份详细的关于Pytorch到Paddle模型格式转换的教程: +`Pytorch到Paddle模型格式转换文档 <./convert_pytorch_to_paddle.rst>`_ 。 + +4. 进行贡献 +------------------------------------------------------------------------------------ +4.1 准备权重相关文件 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +一般来说,我们需要准备 **model_state.pdparams** ,**vocab.txt**,**tokenizer_config.json** +以及 **model_config.json** 这四个文件进行参数权重贡献。 + +- model_state.pdparams 文件可以通过上述的参数权重格式转换过程得到; +- vocab.txt 文件可以直接使用原始模型对应的vocab文件(根据模型对应tokenizer类型的不同,该文件名可能为spiece.model等); +- model_config.json 文件可以参考对应 model.save_pretrained() 接口保存的model_config.json文件; +- tokenizer_config.json 文件可以参考对应 tokenizer.save_pretrained() 接口保存的model_config.json文件; + +4.2 创建个人目录 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +如果你是首次进行权重贡献,那么你需要在 ``PaddleNLP/community/`` 下新建一个目录。 +目录名称使用你的github名称,比如新建目录 ``PaddleNLP/community/yingyibiao/`` 。 +如果已有个人目录,则可以跳过此步骤。 + +4.3 创建权重目录 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +在步骤4.2的个人目录下新建一个权重目录,权重目录名为本次贡献的模型权重名称。 +比如我想贡献 ``bert-base-uncased-sst-2-finetuned`` 这个模型, +则新建权重目录 ``PaddleNLP/community/yingyibiao/bert-base-uncased-sst-2-finetuned/`` 。 + +4.4 在权重目录下添加PR(pull request)相关文件 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +在步骤4.3的目录下加入两个文件,分别为 ``README.md`` 和 ``files.json`` 。 + +- ``README.md`` 是对你贡献的权重的详细介绍,使用示例,权重来源等。 +- ``files.json`` 为步骤4.1所得的权重相关文件以及对应地址。files.json文件内容示例如下,只需将地址中的 *yingyibiao* 和 + *bert-base-uncased-sst-2-finetuned* 分别更改为你的github用户名和权重名称。 + +.. code:: python + + { + "model_config_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/yingyibiao/bert-base-uncased-sst-2-finetuned/model_config.json", + "model_state": "https://bj.bcebos.com/paddlenlp/models/transformers/community/yingyibiao/bert-base-uncased-sst-2-finetuned/model_state.pdparams", + "tokenizer_config_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/yingyibiao/bert-base-uncased-sst-2-finetuned/tokenizer_config.json", + "vocab_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/yingyibiao/bert-base-uncased-sst-2-finetuned/vocab.txt" + } + +4.5 在github上提PR进行贡献 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +- 第一次进行开源贡献的同学可以参考 `first-contributions `_ 。 +- 模型权重贡献PR示例请参考 `bert-base-uncased-sst-2-finetuned PR <.>`_ 。 \ No newline at end of file diff --git a/docs/community/contribute_models/contribute_new_models.rst b/docs/community/contribute_models/contribute_new_models.rst new file mode 100644 index 0000000000000000000000000000000000000000..c7de561eb55b5dd29a2276b9d1279df064f25051 --- /dev/null +++ b/docs/community/contribute_models/contribute_new_models.rst @@ -0,0 +1,3 @@ +========================================== +贡献新模型 +========================================== \ No newline at end of file diff --git a/docs/community/contribute_models/convert_pytorch_to_paddle.rst b/docs/community/contribute_models/convert_pytorch_to_paddle.rst new file mode 100644 index 0000000000000000000000000000000000000000..7da5dd6d008b8979a7c825a31b818aeacc63ee15 --- /dev/null +++ b/docs/community/contribute_models/convert_pytorch_to_paddle.rst @@ -0,0 +1,463 @@ +========================================== +模型格式转换 +========================================== + +0. 前言 +------------------------------------------ +本文将介绍如何进行不同框架下的模型权重转换(以模型权重从PyTorch框架到Paddle框架的格式转换为例)。 + +模型格式转换的过程需要用户对模型结构有一个较详细的了解,成功完成模型格式转换也会有助于加深用户对该模型结构的理解。 +让我们开始这个有趣的过程吧! + +1. 模型权重文件概述 +------------------------------------------ +不管在什么框架下,当我们保存训练好的模型时,我们都需要将模型的参数权重持久化保存下来; +当我们加载一个保存好的模型时,我们都需要将参数权重加载并重新赋值给相应的模型。 + +PyTorch和Paddle都是通过序列化和反序列化模型的 ``state dict`` (状态字典)来进行参数权重的存储和加载的。 +``state dict`` 从数据结构上来看就是一个字典(比如Python中的dict), +其中key是模型参数的名称(数据类型为string),而value则为key所对应的值(数据类型为Tensor)。 +参数存储时,先获取目标对象的 ``state dict`` ,然后将 ``state dict`` 存储至磁盘; +参数载入时,先从磁盘载入保存的 ``state dict`` ,然后通过 ``set_state_dict()`` 方法配置到目标对象中。 + +按照约定俗成的命名规则,Paddle框架保存的模型文件名一般后缀为 `'.pdparams'` , +PyTorch框架保存的模型文件名一般后缀为 `'.pt'` 、 `'.pth'` 或者 `'.bin'` 。 +虽然后缀并不影响模型的保存和加载,但我们一般都会遵循这个命名规范。 + +2. 模型的 ``state dict`` 概述 +------------------------------------------ +刚刚我们简单介绍了一下模型文件和其中存储的 ``state dict`` , +下面让我们来看一个具体的例子来对 ``state dict`` 有更进一步的了解。 + +``LeNet`` 是由Yann LeCun等人在1998年提出的一个CNN网络模型,并且成功应用于手写数字识别系统。 +Paddle集成了 ``LeNet`` 这个简单的模型,我们可以一键进行模型加载, +下面的代码实现了该模型的加载和对应 ``state dict`` 的输出: + +.. code:: python + + >>> import paddle + >>> from paddle.vision.models import LeNet + >>> model = LeNet() + >>> model.state_dict().keys() # 输出state_dict的所有keys + odict_keys(['features.0.weight', 'features.0.bias', 'features.3.weight', 'features.3.bias', + 'fc.0.weight', 'fc.0.bias', 'fc.1.weight', 'fc.1.bias', 'fc.2.weight', 'fc.2.bias']) + + >>> model.state_dict()['features.0.weight'] # 输出 'features.0.weight' 对应的value + Parameter containing: + Tensor(shape=[6, 1, 3, 3], dtype=float32, place=CPUPlace, stop_gradient=False, + [[[[-0.31584871, 0.27280194, -0.43816274], + [ 0.06681869, 0.44526964, 0.80944657], + [ 0.05796078, 0.57411081, 0.15335406]]], + ... + ... + [[[-0.07211500, -0.14458601, -1.11733580], + [ 0.53036308, -0.19761689, 0.56962037], + [-0.09760553, -0.02011104, -0.50577533]]]]) + + +我们可以通过 ``model.state_dict().keys()`` 来获取模型的所有参数名称。 +可以看到 ``LeNet`` 一共有10组参数,分别为:*'features.0.weight'*、*'features.0.bias'*、*'features.3.weight'* +、*'features.3.bias'*、*'fc.0.weight'*、*'fc.0.bias'*、*'fc.1.weight'*、*'fc.1.bias'*、*'fc.2.weight'* 和 *'fc.2.bias'*。 + +通过查询 ``model.state_dict()['features.0.weight']`` 可以查看 **'features.0.weight'** 这个参数的具体权重数值。 +上述输出显示该权重是一个dtype=float32,shape=[6, 1, 3, 3]的Tensor。 + +3. 利用 ``state dict`` 进行权重格式转换 +------------------------------------------ +了解了模型的存储和加载以及相关的 ``state dict`` 之后,我们来看一下模型格式的转换的具体步骤。 +一般来说,我们可以通过 ``state dict`` 的相互转换来帮助我们进行模型格式的转换。 + +以从PyTorch框架到Paddle框架的模型权重转换为例,转换的具体流程为: + +1. 加载PyTorch模型得到 ``state dict`` +2. PyTorch下的 ``state dict`` 转换为Paddle下的 ``state dict`` +3. 保存Paddle下的 ``state dict`` 得到Paddle模型。 + +下面我们来看一个具体的例子:``'bert-base-uncased'`` 是一个谷歌开源的12层的bert英文模型。 +PaddleNLP(Paddle框架)和HuggingFace的transformers(PyTorch框架)里都集成了这个模型, +两者参数量和具体参数数值是完全一致的。我们可以来加载对比这两个模型的 ``state dict`` 来了解转换的细节。 + +3.1 PyTorch框架下的 ``state dict`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +首先加载transformers下的 ``'bert-base-uncased'`` 模型, + +.. code:: python + + >>> import torch + >>> model_name = "bert-base-uncased" + >>> # 模型下载地址: https://huggingface.co/bert-base-uncased/blob/main/pytorch_model.bin + >>> model_file = "pytorch_model.bin" + >>> pytorch_state_dict = torch.load(model_file) + >>> pytorch_state_dict.keys() + odict_keys(['bert.embeddings.word_embeddings.weight', 'bert.embeddings.position_embeddings.weight', 'bert.embeddings.token_type_embeddings.weight', + 'bert.embeddings.LayerNorm.gamma', 'bert.embeddings.LayerNorm.beta', + 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.query.bias', + 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.key.bias', + 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.attention.self.value.bias', + 'bert.encoder.layer.0.attention.output.dense.weight', 'bert.encoder.layer.0.attention.output.dense.bias', + 'bert.encoder.layer.0.attention.output.LayerNorm.gamma', 'bert.encoder.layer.0.attention.output.LayerNorm.beta', + 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.intermediate.dense.bias', + 'bert.encoder.layer.0.output.dense.weight', 'bert.encoder.layer.0.output.dense.bias', + 'bert.encoder.layer.0.output.LayerNorm.gamma', 'bert.encoder.layer.0.output.LayerNorm.beta', + 'bert.encoder.layer.1'... + 'bert.encoder.layer.2'... + . + . + . + 'bert.encoder.layer.9'... + 'bert.encoder.layer.10'... + 'bert.encoder.layer.11.attention.self.query.weight', 'bert.encoder.layer.11.attention.self.query.bias', + 'bert.encoder.layer.11.attention.self.key.weight', 'bert.encoder.layer.11.attention.self.key.bias', + 'bert.encoder.layer.11.attention.self.value.weight', 'bert.encoder.layer.11.attention.self.value.bias', + 'bert.encoder.layer.11.attention.output.dense.weight', 'bert.encoder.layer.11.attention.output.dense.bias', + 'bert.encoder.layer.11.attention.output.LayerNorm.gamma', 'bert.encoder.layer.11.attention.output.LayerNorm.beta', + 'bert.encoder.layer.11.intermediate.dense.weight', 'bert.encoder.layer.11.intermediate.dense.bias', + 'bert.encoder.layer.11.output.dense.weight', 'bert.encoder.layer.11.output.dense.bias', + 'bert.encoder.layer.11.output.LayerNorm.gamma', 'bert.encoder.layer.11.output.LayerNorm.beta', + 'bert.pooler.dense.weight', 'bert.pooler.dense.bias', + 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', + 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.gamma', + 'cls.predictions.transform.LayerNorm.beta', 'cls.predictions.decoder.weight', + 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']) + +\**odict_keys**(ordered_dict keys)所显示的是PyTorch模型文件所对应的 ``state dict`` 的keys: +我们仔细观察一下可以发现参数可以分成几大模块:**embeddings** 模块, +**encoder_layers** 模块, **pooler** 模块和 **cls** 模块。 + +我们可以结合bert的具体结构来解读一下各个模块: + +- **embeddings** 模块 + + *'bert.embeddings'* 开头的各个参数是embeddings模块的参数, + 包括word_embeddings矩阵,position_embeddings矩阵,token_type_embeddings矩阵以及embeddings模块的LayerNorm层参数等。 +- **encoder_layers** 模块 + + *'bert.encoder.layer'*开头的各个参数是各encoder层的参数, + 可以看到 ``'bert-base-uncased'`` 模型一共有12层encoder(编号0-11),每一层encoder的结构都相同。 + 每一层encoder主要由一个*self-attention*模块和一个*feed-forward*模块构成。 + 我们具体来看一下第1层encoder的参数(编号为0,'bert.encoder.layer.0'开头的参数): + + 首先是*self-attention*模块: + + * *'attention.self.query'*,*'attention.self.key'* 和 *'attention.self.value'* + 分别代表self-attention结构里面的query矩阵,key矩阵和value矩阵。 + * *'attention.output.dense'* 是self-attention结构的线性层。 + * *'attention.output.LayerNorm'* 则是self-attention结构后的LayerNorm层。 + + 接下来是*feed-forward*模块,对应 'intermediate.dense' 和 'output.dense' 开头的参数 + 。*feed-forward*之后还有一个*LayerNorm*层,对应的是 'output.LayerNorm' 开头的参数。 +- **pooler** 模块 + + pooler模块在最后一层encoder之后,是我们对最后一层encoder输出的池化操作, +- **cls** 模块 + + cls模块是我们计算mlm(masked language model)和next sentence prediction(nsp)任务的结构。 + 'cls.predictions'开头的参数是我们做mlm任务时的参数,'cls.seq_relationship'开头的参数是我们做nsp预测任务时的参数。 + +3.2 Paddle框架下的 ``state dict`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +相信到现在,我们已经对bert这个模型的结构以及相应的具体参数有了更进一步的了解。 +接下来我们来加载PaddleNLP下的模型: + +.. code:: python + + >>> import paddle + >>> model_name = "bert-base-uncased" + >>> # 模型下载地址: https://bj.bcebos.com/paddlenlp/models/transformers/bert-base-uncased.pdparams + >>> model_file = "bert-base-uncased.pdparams" + >>> paddle_state_dict = paddle.load(model_file) + >>> paddle_state_dict.keys() + dict_keys(['bert.embeddings.word_embeddings.weight', 'bert.embeddings.position_embeddings.weight', 'bert.embeddings.token_type_embeddings.weight', + 'bert.embeddings.layer_norm.weight', 'bert.embeddings.layer_norm.bias', + 'bert.encoder.layers.0.self_attn.q_proj.weight', 'bert.encoder.layers.0.self_attn.q_proj.bias', + 'bert.encoder.layers.0.self_attn.k_proj.weight', 'bert.encoder.layers.0.self_attn.k_proj.bias', + 'bert.encoder.layers.0.self_attn.v_proj.weight', 'bert.encoder.layers.0.self_attn.v_proj.bias', + 'bert.encoder.layers.0.self_attn.out_proj.weight', 'bert.encoder.layers.0.self_attn.out_proj.bias', + 'bert.encoder.layers.0.linear1.weight', 'bert.encoder.layers.0.linear1.bias', + 'bert.encoder.layers.0.linear2.weight', 'bert.encoder.layers.0.linear2.bias', + 'bert.encoder.layers.0.norm1.weight', 'bert.encoder.layers.0.norm1.bias', + 'bert.encoder.layers.0.norm2.weight', 'bert.encoder.layers.0.norm2.bias', + 'bert.encoder.layers.1'... + ... + ... + ... + 'bert.encoder.layers.10'... + 'bert.encoder.layers.11.self_attn.q_proj.weight', 'bert.encoder.layers.11.self_attn.q_proj.bias', + 'bert.encoder.layers.11.self_attn.k_proj.weight', 'bert.encoder.layers.11.self_attn.k_proj.bias', + 'bert.encoder.layers.11.self_attn.v_proj.weight', 'bert.encoder.layers.11.self_attn.v_proj.bias', + 'bert.encoder.layers.11.self_attn.out_proj.weight', 'bert.encoder.layers.11.self_attn.out_proj.bias', + 'bert.encoder.layers.11.linear1.weight', 'bert.encoder.layers.11.linear1.bias', + 'bert.encoder.layers.11.linear2.weight', 'bert.encoder.layers.11.linear2.bias', + 'bert.encoder.layers.11.norm1.weight', 'bert.encoder.layers.11.norm1.bias', + 'bert.encoder.layers.11.norm2.weight', 'bert.encoder.layers.11.norm2.bias', + 'bert.pooler.dense.weight', 'bert.pooler.dense.bias', + 'cls.predictions.decoder_weight', 'cls.predictions.decoder_bias', + 'cls.predictions.transform.weight', 'cls.predictions.transform.bias', + 'cls.predictions.layer_norm.weight', 'cls.predictions.layer_norm.bias', + 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']) + +Paddle模型的 ``state dict`` 是通过一个dict来进行存储,可以看到,两者的 ``state dict`` 是十分相似的。 +我们对比一下两者: + +- 两者的存储是相似的,PyTorch里使用的是python中的ordered_dict来存储模型的参数状态, + 在Paddle中则使用的是python中的dict来来进行存储。 +- 两者的结构也是相似的,都可以分成embeddings,encoder_layer, pooler, cls等 + 模块(当然这也很直观,毕竟两者的模型结构和模型参数是完全一致的)。 +- 同时两者也存在一些区别,两者的 ``state dict`` 的keys有一些细微的差异,这是由于模型代码的具体实现的参数命名差异所造成的。 + +3.3 PyTorch和Paddle的 ``state dict`` 对比 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +我们接下来对上述两个 ``state dict`` 的参数名称以及对应权重来做一一对应。 +下面的表格是整理好的 ``state_dict`` 对应关系表格(同一行代表着相对应的参数): + ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| Keys (PyTorch) | Shape (PyTorch) | Keys (Paddle) | Shape (Paddle) | ++========================================================+============================+==================================================+===========================+ +| bert.embeddings.word_embeddings.weight | [30522, 768] | bert.embeddings.word_embeddings.weight | [30522, 768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.embeddings.position_embeddings.weight | [512, 768] | bert.embeddings.position_embeddings.weight | [512, 768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.embeddings.token_type_embeddings.weight | [2, 768] | bert.embeddings.token_type_embeddings.weight | [2, 768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.embeddings.LayerNorm.gamma | [768] | bert.embeddings.layer_norm.weight | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.embeddings.LayerNorm.beta | [768] | bert.embeddings.layer_norm.bias | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.attention.self.query.weight | [768, 768] | bert.encoder.layers.0.self_attn.q_proj.weight | [768, 768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.attention.self.query.bias | [768] | bert.encoder.layers.0.self_attn.q_proj.bias | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.attention.self.key.weight | [768, 768] | bert.encoder.layers.0.self_attn.k_proj.weight | [768, 768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.attention.self.key.bias | [768] | bert.encoder.layers.0.self_attn.k_proj.bias | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.attention.self.value.weight | [768, 768] | bert.encoder.layers.0.self_attn.v_proj.weight | [768, 768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.attention.self.value.bias | [768] | bert.encoder.layers.0.self_attn.v_proj.bias | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.attention.output.dense.weight | [768, 768] | bert.encoder.layers.0.self_attn.out_proj.weight | [768, 768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.attention.output.dense.bias | [768] | bert.encoder.layers.0.self_attn.out_proj.bias | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.attention.output.LayerNorm.gamma | [768] | bert.encoder.layers.0.norm1.weight | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.attention.output.LayerNorm.beta | [768] | bert.encoder.layers.0.norm1.bias | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.intermediate.dense.weight | [3072, 768] | bert.encoder.layers.0.linear1.weight | [768, 3072] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.intermediate.dense.bias | [3072] | bert.encoder.layers.0.linear1.bias | [3072] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.output.dense.weight | [768, 3072] | bert.encoder.layers.0.linear2.weight | [3072, 768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.output.dense.bias | [768] | bert.encoder.layers.0.linear2.bias | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.output.LayerNorm.gamma | [768] | bert.encoder.layers.0.norm2.weight | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.encoder.layer.0.output.LayerNorm.beta | [768] | bert.encoder.layers.0.norm2.bias | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.pooler.dense.weight | [768, 768] | bert.pooler.dense.weight | [768, 768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| bert.pooler.dense.bias | [768] | bert.pooler.dense.bias | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| cls.predictions.bias | [30522] | cls.predictions.decoder_bias | [30522] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| cls.predictions.transform.dense.weight | [768, 768] | cls.predictions.transform.weight | [768, 768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| cls.predictions.transform.dense.bias | [768] | cls.predictions.transform.bias | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| cls.predictions.transform.LayerNorm.gamma | [768] | cls.predictions.layer_norm.weight | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| cls.predictions.transform.LayerNorm.beta | [768] | cls.predictions.layer_norm.bias | [768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| cls.predictions.decoder.weight | [30522, 768] | cls.predictions.decoder_weight | [30522, 768] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| cls.seq_relationship.weight | [2, 768] | cls.seq_relationship.weight | [768, 2] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ +| cls.seq_relationship.bias | [2] | cls.seq_relationship.bias | [2] | ++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+ + +正确地对应好 ``state dict`` 的参数以及权重有助于我们正确地进行 ``state dict`` 的转换。 + +我们从参数名称上能看出基本的一个对应关系,例如: + +* `bert.embeddings.LayerNorm.gamma` 对应 `bert.embeddings.layer_norm.weight` ; +* `bert.embeddings.LayerNorm.beta` 对应 `bert.embeddings.layer_norm.bias` ; +* `bert.encoder.layer.0.attention.self.query.weight` 对应 `bert.encoder.layers.0.self_attn.q_proj.weight` ; +* `bert.encoder.layer.0.attention.self.query.bias` 对应 `bert.encoder.layers.0.self_attn.q_proj.bias`。 + +两者的顺序是基本一致的,但也有一些例外,例如: + +* `bert.encoder.layers.0.norm1.weight` 对应 `bert.encoder.layer.0.attention.output.LayerNorm.gamma` ; +* `bert.encoder.layers.0.norm1.bias` 对应 `bert.encoder.layer.0.attention.output.LayerNorm.beta` ; +* `bert.encoder.layer.0.intermediate.dense.weight` 对应 `bert.encoder.layers.0.linear1.weight` ; +* `bert.encoder.layer.0.output.dense.weight` 对应 `bert.encoder.layers.0.linear2.weight` ; +* `bert.encoder.layer.0.output.LayerNorm.gamma` 对应 `bert.encoder.layers.0.norm2.weight`。 + +正确的参数对应关系可能需要我们阅读具体的代码进行判断。 +在上面的表格中我们已经将两者的keys准确地一一对应了。建立好了keys的对应关系之后,我们可以进行values的对应。 + +如果你仔细观察表格,会发现有些参数对应的values形状存在差异。 +比如 ``bert.encoder.layer.0.intermediate.dense.weight`` 和 ``bert.encoder.layers.0.linear1.weight`` +这两个keys是相对应的一组参数名,但是他们的values形状却不相同;前者是 ``[3072, 768]`` , +后者是 ``[768, 3072]`` ,两者刚好是一个转置的关系。这是因为PyTorch对于 ``nn.Linear`` 模块的保存是将权重的shape进行转置后保存的。 +所以在我们进行 ``state dict`` 转换的时候,需要注意做好shape的转换(例如将PyTorch模型里 +nn.Linear层对应的参数权重转置处理后生成Paddle的参数权重)。 + +另外还需要注意其他一些细节,这里列出来几个可能会遇到的情景以供参考: + +- 有些模型结构可能在实现时对参数的处理有差异导致存在参数的拆分或者合并等操作, + 此时我们需要进行参数多对一或者一对多的映射,同时将对应的values拆分或者合并。 +- 还有存在batch norm层时,我们需要注意todo。 + +3.4 bert模型转换代码 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +下一步就是进行最关键的模型转换环节。 +这一步十分关键,正确地进行 ``state dict`` 的转换才能确保我们通过精度验证。 + +下面是进行模型转换的代码(PyTorch转换为Paddle): + +.. code:: python + + import paddle + import torch + import numpy as np + + torch_model_path = "pytorch_model.bin" + torch_state_dict = torch.load(torch_model_path) + + paddle_model_path = "bert_base_uncased.pdparams" + paddle_state_dict = {} + + # State_dict's keys mapping: from torch to paddle + keys_dict = { + # about embeddings + "embeddings.LayerNorm.gamma": "embeddings.layer_norm.weight", + "embeddings.LayerNorm.beta": "embeddings.layer_norm.bias", + + # about encoder layer + 'encoder.layer': 'encoder.layers', + 'attention.self.query': 'self_attn.q_proj', + 'attention.self.key': 'self_attn.k_proj', + 'attention.self.value': 'self_attn.v_proj', + 'attention.output.dense': 'self_attn.out_proj', + 'attention.output.LayerNorm.gamma': 'norm1.weight', + 'attention.output.LayerNorm.beta': 'norm1.bias', + 'intermediate.dense': 'linear1', + 'output.dense': 'linear2', + 'output.LayerNorm.gamma': 'norm2.weight', + 'output.LayerNorm.beta': 'norm2.bias', + + # about cls predictions + 'cls.predictions.transform.dense': 'cls.predictions.transform', + 'cls.predictions.decoder.weight': 'cls.predictions.decoder_weight', + 'cls.predictions.transform.LayerNorm.gamma': 'cls.predictions.layer_norm.weight', + 'cls.predictions.transform.LayerNorm.beta': 'cls.predictions.layer_norm.bias', + 'cls.predictions.bias': 'cls.predictions.decoder_bias' + } + + + for torch_key in torch_state_dict: + paddle_key = torch_key + for k in keys_dict: + if k in paddle_key: + paddle_key = paddle_key.replace(k, keys_dict[k]) + + if ('linear' in paddle_key) or ('proj' in paddle_key) or ('vocab' in paddle_key and 'weight' in paddle_key) or ("dense.weight" in paddle_key) or ('transform.weight' in paddle_key) or ('seq_relationship.weight' in paddle_key): + paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[torch_key].cpu().numpy().transpose()) + else: + paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[torch_key].cpu().numpy()) + + print("torch: ", torch_key,"\t", torch_state_dict[torch_key].shape) + print("paddle: ", paddle_key, "\t", paddle_state_dict[paddle_key].shape, "\n") + + paddle.save(paddle_state_dict, paddle_model_path) + + +我们来看一下这份转换代码: +我们需要下载好待转换的PyTorch模型,并加载模型得到**torch_state_dict** +;**paddle_state_dict** 和 **paddle_model_path** 则定义了转换后的 ``state dict`` 和模型文件路径; +代码中 **keys_dict** 定义了两者keys的映射关系(可以通过上面的表格对比得到)。 + +下一步就是最关键的 *paddle_state_dict* 的构建,我们对 *torch_state_dict* 里的每一个key都进行映射, +得到对应的 *paddle_state_dict* 的key。获取 *paddle_state_dict* 的key之后我们需要 +对 *torch_state_dict* 的value进行转换,如果key对应的结构是 ``nn.Linear`` 模块的话, +我们还需要进行value的transpose操作。 + +最后我们保存得到的 *paddle_state_dict* 就能得到对应的Paddle模型。 +至此我们已经完成了模型的转换工作,得到了Paddle框架下的模型 ``"model_state.pdparams"`` 。 + +4. 模型权重验证 +------------------------------------------ +得到了模型权重后,我们还需要进行精度的对齐来验证我们上述转换的正确性。 +我们可以通过前向推理和下游任务fine-tuning这两个任务进行精度对齐验证。 + +4.1 对齐前向精度 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +前向精度的对齐十分简单,我们只需要保证两者输入是一致的前提下,观察得到的输出是否一致。 +这里有几个注意事项,我们运行推理时需要打开eval模式,设置dropout为0等操作去除随机性造成的影响。 + +除了得到的模型权重文件,我们还需要准备模型配置文件。将模型权重文件(model_state.pdparams)和模型配置文件(model_config.json) +这两个文件放在同一个路径下,我们就可以进行模型前向精度的对齐验证,下面提供了bert模型对齐前向精度的代码示例: + +.. code:: python + + text = "Welcome to use paddle paddle and paddlenlp!" + torch_model_name = "bert-base-uncased" + paddle_model_name = "bert-base-uncased" + + # torch output + import torch + import transformers + from transformers.models.bert import * + + # torch_model = BertForPreTraining.from_pretrained(torch_model_name) + torch_model = BertModel.from_pretrained(torch_model_name) + torch_tokenizer = BertTokenizer.from_pretrained(torch_model_name) + torch_model.eval() + + torch_inputs = torch_tokenizer(text, return_tensors="pt") + torch_outputs = torch_model(**torch_inputs) + + torch_logits = torch_outputs[0] + torch_array = torch_logits.cpu().detach().numpy() + print("torch_prediction_logits shape:{}".format(torch_array.shape)) + print("torch_prediction_logits:{}".format(torch_array)) + + + # paddle output + import paddle + import paddlenlp + from paddlenlp.transformers.bert.modeling import * + import numpy as np + + # paddle_model = BertForPretraining.from_pretrained(paddle_model_name) + paddle_model = BertModel.from_pretrained(paddle_model_name) + paddle_tokenizer = BertTokenizer.from_pretrained(paddle_model_name) + paddle_model.eval() + + paddle_inputs = paddle_tokenizer(text) + paddle_inputs = {k:paddle.to_tensor([v]) for (k, v) in paddle_inputs.items()} + paddle_outputs = paddle_model(**paddle_inputs) + + paddle_logits = paddle_outputs[0] + paddle_array = paddle_logits.numpy() + print("paddle_prediction_logits shape:{}".format(paddle_array.shape)) + print("paddle_prediction_logits:{}".format(paddle_array)) + + + # the output logits should have the same shape + assert torch_array.shape == paddle_array.shape, "the output logits should have the same shape, but got : {} and {} instead".format(torch_array.shape, paddle_array.shape) + diff = torch_array - paddle_array + print(np.amax(abs(diff))) + +代码最后会打印模型输出矩阵的每个元素最大差值,根据这个差值可以判定我们是否对齐了前向精度。 + +4.2 下游任务fine-tuning验证(可选) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +当我们对齐前向精度时,一般来说我们的模型转换就已经成功了。我们还可以运行下游任务fine-tuning进行double check。 +同样的,我们需要设置相同的训练数据,相同的训练参数,相同的训练环境进行fine-tuning来对比两者的收敛性以及收敛指标。 + +5. 写在最后 +------------------------------------------ +恭喜你成功完成了模型权重的格式转换工作!欢迎向PaddleNLP提PR共享你的模型, +这样每一个使用PaddleNLP的用户都能使用你共享的模型哦~ diff --git a/docs/community/contribute_models/index.rst b/docs/community/contribute_models/index.rst new file mode 100644 index 0000000000000000000000000000000000000000..9e15bd78f87b733f03f87ba0facfdfac5045c1db --- /dev/null +++ b/docs/community/contribute_models/index.rst @@ -0,0 +1,10 @@ +============ +如何贡献模型 +============ + +.. toctree:: + :maxdepth: 1 + + convert_pytorch_to_paddle.rst + contribute_awesome_pretrained_models.rst + contribute_new_models.rst \ No newline at end of file diff --git a/docs/community/join_in_PaddleNLP-SIG.rst b/docs/community/join_in_PaddleNLP-SIG.rst new file mode 100644 index 0000000000000000000000000000000000000000..44d55add660875f8cb87210c6331f068b5dac6af --- /dev/null +++ b/docs/community/join_in_PaddleNLP-SIG.rst @@ -0,0 +1,3 @@ +============== +如何加入PaddleNLP-SIG +============== diff --git a/docs/community/rfcs/20230304_api_design_for_tie_weight_task_103.md b/docs/community/rfcs/20230304_api_design_for_tie_weight_task_103.md new file mode 100644 index 0000000000000000000000000000000000000000..f023f95cde654d2af97a35378980a871a772b574 --- /dev/null +++ b/docs/community/rfcs/20230304_api_design_for_tie_weight_task_103.md @@ -0,0 +1,269 @@ +# 标题 + +标题如:paddle.io.dataset 设计文档 + +|API名称 | 新增API名称 | +|---|----------------------------------------------------| +|提交作者 | 丘文波, 刘旺旺 | +|提交时间 | 2022-03-10 | +|版本号 | V3 | +|依赖飞桨版本 | 如无特殊情况,都应基于develop版本开发 | +|文件名 | 20230304_api_design_for_tie_weight_task_103.md
| + + +# 一、概述 +## 1、相关背景 +对应任务是 No.103:新增tie_weights能力 + +权重绑定, 一般是指将输入层embedding和 输出层embeding共享权重, 从而在减少网络的参数量, 使得embeding层参数训练更加充分. + +其中《attention is all you need》中的提到的transformer模型也使用到了tie weigh这个技巧, 论文3.4节提到将encoder输入embedding与decoder输入embedding以及输出线性层权重共享 这个技巧的有效性在论文《Using the output embedding to improve language models》进行了验证 . + +所以预训练语言模型需要实现一个输入层embedding和 输出层embeding共享权重共享功能,方便使用者进行调用. + +相关issue: +* [https://github.com/PaddlePaddle/PaddleNLP/issues/4740](https://github.com/PaddlePaddle/PaddleNLP/issues/4740) + + +## 2、功能目标 +给预训练语言模型增加一个基础函数, 实现输入层embeding和输出层embedding的权重共享绑定: + +- 为PaddleNLP新增tie_weights功能,能够对齐HuggingFace Transformers中的[tie_weights](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.tie_weights)功能 +- 参考: [https://github.com/huggingface/transformers/blob/v4.26.1/src/transformers/modeling_utils.py#L1172](https://github.com/huggingface/transformers/blob/v4.26.1/src/transformers/modeling_utils.py#L1172) + + +## 3、意义 +实现权重绑定的函数, 作为一种模型技巧来提升训练效果.减少模型参数, + +权重绑定的函数作为模型的一个基本函数, 在基于预训练模型组网的时候 方便进行调用进行实验, 减少模型参数,提升模型效果. + + +# 二、飞桨现状 +对飞桨框架目前支持此功能的现状调研,如果不支持此功能,如是否可以有替代实现的API,是否有其他可绕过的方式,或者用其他API组合实现的方式; + +paddle 中并没有对tie weight的统一实现,调用者需自己写代码实现这部分功能. + +paddleNLP中的一些示例代码中也找到了一个tie weight的实现. + +(1) [代码链接1](https://github.com/qiuwenbogdut/PaddleNLP/blob/develop/examples/language_model/transformer-xl/mem_transformer.py#L811) + +```python +if tie_weight: + for i in range(len(self.crit.out_layers_weight)): + self.crit.out_layers_weight[i] = self.word_emb.emb_layers[i].weight + +if tie_projs: + for i, tie_proj in enumerate(tie_projs): + if tie_proj and div_val == 1 and d_model != d_embed: + self.crit.out_projs[i] = self.word_emb.emb_projs[0] + elif tie_proj and div_val != 1: + self.crit.out_projs[i] = self.word_emb.emb_projs[i] +``` + +(2) [代码链接2](https://github.com/PaddlePaddle/PaddleNLP/blob/4e5df921ff61ddae1d869c37aea621b9cac6bcd4/paddlenlp/transformers/reformer/modeling.py#L1977) + +```python +def tie_weights(self): + """ + Tie the weights between the input embeddings and the output embeddings. + """ + tie_word_embeddings = ( + self.tie_word_embeddings + if hasattr(self, "tie_word_embeddings") + else self.config.get("tie_word_embeddings", False) + ) + if hasattr(self, "get_output_embeddings") and hasattr(self, "get_input_embeddings") and tie_word_embeddings: + output_embeddings = self.get_output_embeddings() + if output_embeddings is not None: + self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings()) +``` + +(3) [代码链接3](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/ernie/modeling.py#L748) +```python +class ErnieLMPredictionHead(nn.Layer): + r""" + Ernie Model with a `language modeling` head on top. + """ + + def __init__( + self, + config: ErnieConfig, + embedding_weights=None, + weight_attr=None, + ): + super(ErnieLMPredictionHead, self).__init__() + + self.transform = nn.Linear(config.hidden_size, config.hidden_size, weight_attr=weight_attr) + self.activation = getattr(nn.functional, config.hidden_act) + self.layer_norm = nn.LayerNorm(config.hidden_size) + self.decoder_weight = ( + self.create_parameter( + shape=[config.vocab_size, config.hidden_size], + dtype=self.transform.weight.dtype, + attr=weight_attr, + is_bias=False, + ) + if embedding_weights is None + else embedding_weights + ) + self.decoder_bias = self.create_parameter( + shape=[config.vocab_size], dtype=self.decoder_weight.dtype, is_bias=True + ) +``` + + +其实paddlenlp内大部分的tie_weights实现是直接在模型layer定义层面实现的,见[代码](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/ernie/modeling.py#L748) +,而不是类似transformers一样在模型以外统一实现的。这个项目的目标就是看一下能否在模型外统一实现,而不用每个模型都自己实现一次 + +paddle里面tie_weghts实现主要有两种方式: +* 一种在modeling.py中定义了tie_weghts函数,相应的模型也实现了get_input_embeding()和get_output_embeding()来获取输入和输出embeding层权重,然后通过赋值方式进行绑定。如上面的代码链接(1)(2) +* 另外一种是 在定义模型层的时候 直接将输入input_embeding的weight,赋值给输出层weight. 将embedding的weight直接传给head来构建linear输出层,期望是在get_input_embeding()拿到weight,然后传给head层,如上面代码链接(3) + + + +最好是在模型[基类里面model_utils.py#L897](https://github.com/PaddlePaddle/PaddleNLP/blob/be80a3e30fb681e53773c265babe611d4df62ead/paddlenlp/transformers/model_utils.py#L897) +去统一实现 tie_weights,减少调用者的开发. + +# 三、业内方案调研 +描述业内深度学习框架如何实现此功能,包括与此功能相关的现状、未来趋势;调研的范围包括不限于TensorFlow、PyTorch、NumPy等 + +(1)目前huggingface的transformers库中实现了这个tieweight 这个基础函数. [代码链接](https://github.com/huggingface/transformers/blob/v4.26.1/src/transformers/modeling_utils.py#L1172) +```python +def tie_weights(self): + """ + Tie the weights between the input embeddings and the output embeddings. + If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning the + weights instead. + """ + if getattr(self.config, "tie_word_embeddings", True): + output_embeddings = self.get_output_embeddings() + if output_embeddings is not None: + self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings()) + + if getattr(self.config, "is_encoder_decoder", False) and getattr(self.config, "tie_encoder_decoder", False): + if hasattr(self, self.base_model_prefix): + self = getattr(self, self.base_model_prefix) + self._tie_encoder_decoder_weights(self.encoder, self.decoder, self.base_model_prefix) + + for module in self.modules(): + if hasattr(module, "_tie_weights"): + module._tie_weights() +``` + + +(2) tensor2tensor库 tieweight 实现代码 [代码链接](https://github.com/tensorflow/tensor2tensor/blob/316c9ce2f2b2373f44f5be0da712dda3e5861a75/tensor2tensor/layers/modalities.py#L1106) +```python +def symbol_top(body_output, targets, model_hparams, vocab_size): + del targets # unused arg + if model_hparams.shared_embedding_and_softmax_weights: + scope_name = "shared" + reuse = tf.AUTO_REUSE + else: + scope_name = "softmax" + reuse = False + with tf.variable_scope(scope_name, reuse=reuse): + body_output_shape = common_layers.shape_list(body_output) + var = get_weights(model_hparams, vocab_size, body_output_shape[-1]) + if (model_hparams.factored_logits and + model_hparams.mode == tf_estimator.ModeKeys.TRAIN): + # insert channels dimension + body_output = tf.expand_dims(body_output, 3) + return common_layers.FactoredTensor(body_output, var) + else: + body_output = tf.reshape(body_output, [-1, body_output_shape[-1]]) + logits = tf.matmul(body_output, var, transpose_b=True) + return tf.reshape(logits, + body_output_shape[:-1] + [1, vocab_size]) +``` + + +(3) fairseq库 中 tie weight实现函数 [代码链接](https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/fconv.py#L480) +```python +self.fc2 = Linear(in_channels, out_embed_dim) + if share_embed: + assert out_embed_dim == embed_dim, ( + "Shared embed weights implies same dimensions " + " out_embed_dim={} vs embed_dim={}".format(out_embed_dim, embed_dim) + ) + self.fc3 = nn.Linear(out_embed_dim, num_embeddings) + self.fc3.weight = self.embed_tokens.weight + else: + self.fc3 = Linear(out_embed_dim, num_embeddings, dropout=dropout) +``` + +# 四、对比分析 +paddle和 huggingface的transformers 都是基于动态图进行开发, 所以准备参照huggingface的transformers 的 tie weight 函数思路去实现功能. + +# 五、设计思路与实现方案 +参考huggingface的 transformers中的实现思路来基于paddle进行开发 + +实现tie_weight函数步骤: +1. 获取模型input embedding 权重对象 A +2. 获取模型 output embedding 权重对象 B +3. 让A和B 都指向同一个权重值 + + + + +## 命名与参数设计 +参考:[飞桨API 设计及命名规范](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/api_contributing_guides/api_design_guidelines_standard_cn.html) +## 底层OP设计 +## API实现方案 + +# 六、测试和验收的考量 +参考:[新增API 测试及验收规范](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/api_contributing_guides/api_accpetance_criteria_cn.html) + +测试tie_weight有两个办法: +* 直接判断输出层weight和输入层weight的id,如果一致即通过,否则Failed. +* 训练几个step,经过几个反向后,看下输出层weight和输入层weight是否一致,如果一致即通过,否则Failed. + +用过id的一致性判断是否绑定成功, 简单高效,后面准备采用这种方式进行单侧: +构建单元测试, 测试模型的get_input_embeding得到的权重的id 和get_output_embeding 得到的权重id 是都一致, 如果是一致就通过,都则不通过 + + + +# 七、可行性分析和排期规划 + +设计一个小脚本验证一下这种方式的有效性: +```python +import numpy as np +from paddle.nn import Embedding + +"""step1 定义两个不同的embedding 对象 AA 和 BB""" +print('------------step1') +AA = Embedding(1,2) +BB = Embedding(1,2) + +AA.weight = BB.weight # 进行权重的绑定 + +""" step2 测试一下绑定结果""" +print('------------step2') +print('检测 AA 和 BB 的id是否一致:', AA is BB,id(AA), id(BB)) # AA 和 BB 的id 不一致 +print('检测 AA.weight 和 BB.weight 的id是否一致:',AA.weight is BB.weight,id(AA.weight), id(BB.weight)) # 但是AA.weight 和 BB.weight 的id是一致的 + +print("AA.weight: ",AA.weight) +print("BB.weight: ",BB.weight) + + + +""" step3 尝试修改一下AA的weight的值 BB的weight的值是否也跟着会一起修改""" +# 修改一下其中一个AA 的权重值, 看一下 BB的权重值会不会变化 +print('------------step3') +AA.weight.set_value(np.array([[4.0,6.0]],dtype=np.float32)) + +print('检测 修改后的 AA.weight 和 BB.weight 的id是否一致:',AA.weight is BB.weight,id(AA.weight), id(BB.weight)) # AA.weight 和 BB.weight 的id是一致的 +print("AA.weight 修改后的值: ",AA.weight) +print("BB.weight:",BB.weight) + +``` + +时间和开发排期规划,主要milestone +- 3.10 跟官方确认好开发思路 +- 3.17 提交实现代码 + +# 八、影响面 +需要进一步讨论的问题,开放性问题,有争议问题;对其他模块是否有影响 + +# 名词解释 + +# 附件及参考资料 diff --git a/docs/community/rfcs/api_design_template.md b/docs/community/rfcs/api_design_template.md new file mode 100644 index 0000000000000000000000000000000000000000..fad836f99af4c7feeea401e5d8547f2861595852 --- /dev/null +++ b/docs/community/rfcs/api_design_template.md @@ -0,0 +1,50 @@ +# 标题 + +标题如:paddle.io.dataset 设计文档 +|API名称 | 新增API名称 | +|---|---| +|提交作者 | 李强、张明 | +|提交时间 | 2022-03-01 | +|版本号 | 此设计文档的版本号,如V1.0 | +|依赖飞桨版本 | 如无特殊情况,都应基于develop版本开发 | +|文件名 | 提交的markdown设计文档文件名称,如:20200301_api_design_for_dataset.md
| + + +# 一、概述 +## 1、相关背景 +填写此任务的开发背景,为什么想要开发这个API。如果有相关issue,请将issue链接填写至此。 + +## 2、功能目标 + +## 3、意义 +集中阐述本次升级的作用和意义。 + +# 二、飞桨现状 +对飞桨框架目前支持此功能的现状调研,如果不支持此功能,如是否可以有替代实现的API,是否有其他可绕过的方式,或者用其他API组合实现的方式; + + +# 三、业内方案调研 +描述业内深度学习框架如何实现此功能,包括与此功能相关的现状、未来趋势;调研的范围包括不限于TensorFlow、PyTorch、NumPy等 + +# 四、对比分析 +对第三部分调研的方案进行对比**评价**和**对比分析**,论述各种方案的优劣势。 + +# 五、设计思路与实现方案 + +## 命名与参数设计 +参考:[飞桨API 设计及命名规范](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/api_contributing_guides/api_design_guidelines_standard_cn.html) +## 底层OP设计 +## API实现方案 + +# 六、测试和验收的考量 +参考:[新增API 测试及验收规范](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/api_contributing_guides/api_accpetance_criteria_cn.html) + +# 七、可行性分析和排期规划 +时间和开发排期规划,主要milestone + +# 八、影响面 +需要进一步讨论的问题,开放性问题,有争议问题;对其他模块是否有影响 + +# 名词解释 + +# 附件及参考资料 diff --git a/docs/compression.md b/docs/compression.md new file mode 100644 index 0000000000000000000000000000000000000000..35d8d6f8dc635510fecb37e714ce929fda5e5b3c --- /dev/null +++ b/docs/compression.md @@ -0,0 +1,470 @@ +# PaddleNLP 模型压缩 API + + **目录** + * [模型压缩 API 功能简介](#模型压缩API功简介) + * [三大场景快速启动模型压缩示例](#三大场景快速启动模型压缩示例) + * [四步启动模型压缩](#四步启动模型压缩) + * [Step1:获取模型压缩参数 compression_args](#获取模型压缩参数compression_args) + * [Step2:实例化 Trainer 并调用 compress()](#实例化Trainer并调用compress()) + * [Trainer 实例化参数介绍](#Trainer实例化参数介绍) + * [Step3:实现自定义评估函数(按需可选)](#实现自定义评估函数(按需可选)) + * [Step4:传参并运行压缩脚本](#传参并运行压缩脚本) + * [CompressionArguments 参数介绍](#CompressionArguments参数介绍) + * [模型评估与部署](#模型评估与部署) + * [FAQ](#FAQ) + * [参考文献](#References) + + + + +## 模型压缩 API 功能简介 + +PaddleNLP 模型压缩 API 功能支持对 ERNIE 类下游任务上微调后的模型进行裁剪、量化,以缩小模型体积、减少内存占用、减少计算、提升推理速度从而减少部署难度。模型压缩 API 效果好,且简洁易用。目前裁剪功能现在支持 DynaBERT 中的宽度自适应裁剪策略;量化现在支持静态离线量化方法(PTQ)、量化训练(QAT)和 Embedding 量化。PTQ 无需训练,只需少量校准数据,即可导出量化模型,QAT 类似 FP32 模型的训练过程,也基本能够做到精度无损,Embedding 量化过程较为简单,不需要训练也不需要校准数据即可完成。 + +- **效果好**:目前已经在分类(包含文本分类、文本匹配、自然语言推理、代词消歧、阅读理解等任务)、序列标注、抽取式阅读理解任务上进行过验证,基本达到精度无损。例如,对于 12L768H 和 6L768H 结构的模型,进行宽度保留比例为 2/3 的裁剪基本可以达到精度无损,模型裁剪后推理速度能够达到原先的 1-2 倍;6L768H 结构的模型量化后推理速度能够达到量化前的 2-3 倍。 + +- **简洁易用**:只需要简单几步即可开展模型压缩任务 + +##### ERNIE 3.0 压缩效果 +如下表所示,ERNIE 3.0-Medium (6-layer, 384-hidden, 12-heads) 模型在三类任务(文本分类、序列标注、抽取式阅读理解)经过裁剪 + 量化后加速比均达到 3 倍左右,所有任务上平均精度损失可控制在 0.5 以内(0.46)。 + +| | TNEWS 性能 | TNEWS 精度 | MSRA_NER 性能 | MSRA_NER 精度 | CMRC2018 性能 | CMRC2018 精度 | +| -------------------------- | ------------- | ------------ | ------------- | ------------- | ------------- | ------------- | +| ERNIE 3.0-Medium+FP32 | 1123.85(1.0x) | 57.45 | 366.75(1.0x) | 93.04 | 146.84(1.0x) | 66.95 | +| ERNIE 3.0-Medium+INT8 | 3226.26(2.9x) | 56.99(-0.46) | 889.33(2.4x) | 92.70(-0.34) | 348.84(2.4x) | 66.32(-0.63 | +| ERNIE 3.0-Medium+裁剪+FP32 | 1424.01(1.3x) | 57.31(-0.14) | 454.27(1.2x) | 93.27(+0.23) | 183.77(1.3x) | 65.92(-1.03) | +| ERNIE 3.0-Medium+裁剪+INT8 | 3635.48(3.2x) | 57.26(-0.19) | 1105.26(3.0x) | 93.20(+0.16) | 444.27(3.0x) | 66.17(-0.78) | + +(以上数据来自 [ERNIE 3.0 性能测试文档](../model_zoo/ernie-3.0/#性能测试),文档包含测试环境介绍) + +##### UIE 压缩效果 + +以报销工单信息抽取任务为例,使用 `uie-base` 进行微调,先得到原始 FP32 模型,然后使用 QAT 策略进一步量化。量化后的模型比原始 FP32 模型的 F1 值高 2.19。 + +| Models | F1 | +| ------------- |:------------:| +| uie-base+微调+FP32 | 91.93 | +| uie-base+微调+量化+INT8 | 94.12 | + + + + +### 三大场景快速启动模型压缩示例 + +本项目提供了压缩 API 在分类(包含文本分类、文本匹配、自然语言推理、代词消歧等任务)、序列标注、抽取式阅读理解三大场景下的使用样例,可以分别参考 [ERNIE 3.0](../model_zoo/ernie-3.0/) 目录下的 [compress_seq_cls.py](../model_zoo/ernie-3.0/compress_seq_cls.py) 、[compress_token_cls.py](../model_zoo/ernie-3.0/compress_token_cls.py)、[compress_qa.py](../model_zoo/ernie-3.0/compress_qa.py) 脚本,启动方式如下: + +```shell +# 分类任务 +# 该脚本共支持 CLUE 中 7 个分类任务,超参不全相同,因此分类任务中的超参配置利用 config.yml 配置 +python compress_seq_cls.py \ + --dataset "clue tnews" \ + --model_name_or_path best_models/TNEWS \ + --output_dir ./ + +# 序列标注任务 +python compress_token_cls.py \ + --dataset "msra_ner" \ + --model_name_or_path best_models/MSRA_NER \ + --output_dir ./ \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --per_device_eval_batch_size 32 \ + --learning_rate 0.00005 \ + --remove_unused_columns False \ + --num_train_epochs 3 + +# 阅读理解任务 +python compress_qa.py \ + --dataset "clue cmrc2018" \ + --model_name_or_path best_models/CMRC2018 \ + --output_dir ./ \ + --max_seq_length 512 \ + --learning_rate 0.00003 \ + --num_train_epochs 8 \ + --per_device_train_batch_size 24 \ + --per_device_eval_batch_size 24 \ + --max_answer_length 50 \ + +``` + +示例代码中压缩使用的是 datasets 内置的数据集,若想要使用自定义数据集压缩,可参考 [datasets 加载自定义数据集文档](https://huggingface.co/docs/datasets/loading)。 + + + + +## 四步启动模型压缩 + +### 环境依赖 + +- paddlepaddle-gpu >=2.4.1 +- paddlenlp >= 2.5 +- paddleslim >= 2.4.0 + +模型压缩 API 中的压缩功能依赖最新的 `paddleslim` 包。可运行以下命令安装: + +```shell +pip install paddleslim -i https://pypi.tuna.tsinghua.edu.cn/simple +``` + +模型压缩 API 的使用大致分为四步: + +- Step 1: 使用 `PdArgumentParser` 解析从命令行传入的超参数,以获取压缩参数 `compression_args`; +- Step 2: 实例化 Trainer 并调用 `compress()` 压缩 API +- Step 3: 实现自定义评估函数和 loss 计算函数(按需可选),以适配自定义压缩任务 +- Step 4:传参并运行压缩脚本 + +**示例代码** + +```python +from paddlenlp.trainer import PdArgumentParser, CompressionArguments + +# Step1: 使用 `PdArgumentParser` 解析从命令行传入的超参数,以获取压缩参数 `compression_args`; +parser = PdArgumentParser(CompressionArguments) +compression_args = parser.parse_args_into_dataclasses() + +# Step2: 实例化 Trainer 并调用 compress() +trainer = Trainer( + model=model, + args=compression_args, + data_collator=data_collator, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + criterion=criterion) + +# Step 3: 使用内置模型和评估方法,则不需要实现自定义评估函数和 loss 计算函数 +trainer.compress() +``` + +```shell +# Step4: 传参并运行压缩脚本 +python compress.py \ + --output_dir ./compress_models \ + --per_device_train_batch_size 32 \ + --per_device_eval_batch_size 32 \ + --num_train_epochs 4 \ + --width_mult_list 0.75 \ + --batch_size_list 4 8 16 \ + --batch_num_list 1 \ + +``` + + + + +### Step 1:获取模型压缩参数 compression_args + +使用 `PdArgumentParser` 对象解析从命令行得到的超参数,从而得到 `compression_args`,并将 `compression_args` 传给 `Trainer` 对象。获取 `compression_args` 的方法通常如下: + +```python +from paddlenlp.trainer import PdArgumentParser, CompressionArguments + +# Step1: 使用 `PdArgumentParser` 解析从命令行传入的超参数,以获取压缩参数 `compression_args`; +parser = PdArgumentParser(CompressionArguments) +compression_args = parser.parse_args_into_dataclasses() +``` + + + +### Step 2:实例化 Trainer 并调用 compress + + + +#### Trainer 实例化参数介绍 + +- **--model** 待压缩的模型,目前支持 ERNIE、BERT、RoBERTa、ERNIE-M、ELECTRA、ERNIE-Gram、PP-MiniLM、TinyBERT 等结构相似的模型,是在下游任务中微调后的模型,当预训练模型选择 ERNIE 时,需要继承 `ErniePretrainedModel`。以分类任务为例,可通过`AutoModelForSequenceClassification.from_pretrained(model_name_or_path)` 等方式来获取,这种情况下,`model_name_or_path`目录下需要有 model_config.json, model_state.pdparams 文件; +- **--data_collator** 三类任务均可使用 PaddleNLP 预定义好的 [DataCollator 类](../paddlenlp/data/data_collator.py),`data_collator` 可对数据进行 `Pad` 等操作。使用方法参考 [示例代码](../model_zoo/ernie-3.0/compress_seq_cls.py) 即可; +- **--train_dataset** 裁剪训练需要使用的训练集,是任务相关的数据。自定义数据集的加载可参考 [文档](https://huggingface.co/docs/datasets/loading)。不启动裁剪时,可以为 None; +- **--eval_dataset** 裁剪训练使用的评估集,也是量化使用的校准数据,是任务相关的数据。自定义数据集的加载可参考 [文档](https://huggingface.co/docs/datasets/loading)。是 Trainer 的必选参数; +- **--tokenizer** 模型 `model` 对应的 `tokenizer`,可使用 `AutoTokenizer.from_pretrained(model_name_or_path)` 来获取。 +- **--criterion** 模型的 loss 计算方法,可以是一个 nn.Layer 对象,也可以是一个函数,用于在 ofa_utils.py 计算模型的 loss 用于计算梯度从而确定神经元重要程度。 + +其中,`criterion` 函数定义示例: + +```python +# 支持的形式一: +def criterion(logits, labels): + loss_fct = paddle.nn.BCELoss() + start_ids, end_ids = labels + start_prob, end_prob = outputs + start_ids = paddle.cast(start_ids, 'float32') + end_ids = paddle.cast(end_ids, 'float32') + loss_start = loss_fct(start_prob, start_ids) + loss_end = loss_fct(end_prob, end_ids) + loss = (loss_start + loss_end) / 2.0 + return loss + +# 支持的形式二: +class CrossEntropyLossForSQuAD(paddle.nn.Layer): + + def __init__(self): + super(CrossEntropyLossForSQuAD, self).__init__() + + def forward(self, y, label): + start_logits, end_logits = y + start_position, end_position = label + start_position = paddle.unsqueeze(start_position, axis=-1) + end_position = paddle.unsqueeze(end_position, axis=-1) + start_loss = paddle.nn.functional.cross_entropy(input=start_logits, + label=start_position) + end_loss = paddle.nn.functional.cross_entropy(input=end_logits, + label=end_position) + loss = (start_loss + end_loss) / 2 + return loss +``` + +用以上参数实例化 Trainer 对象,之后直接调用 `compress()` 。`compress()` 会根据选择的策略进入不同的分支,以进行裁剪或者量化的过程。 + +**示例代码** + +```python +from paddlenlp.trainer import PdArgumentParser, CompressionArguments + +# Step1: 使用 `PdArgumentParser` 解析从命令行传入的超参数,以获取压缩参数 `compression_args`; +parser = PdArgumentParser(CompressionArguments) +compression_args = parser.parse_args_into_dataclasses() + +# Step2: 实例化 Trainer 并调用 compress() +trainer = Trainer( + model=model, + args=compression_args, + data_collator=data_collator, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + criterion=criterion) + +trainer.compress() +``` + + + +### Step3:实现自定义评估函数,以适配自定义压缩任务 + +当使用 DynaBERT 裁剪功能时,如果模型、Metrics 不符合下表的情况,那么模型压缩 API 中评估函数需要自定义。 + +目前 DynaBERT 裁剪功能只支持 SequenceClassification 等三类 PaddleNLP 内置 class,并且内置评估器对应为 Accuracy、F1、Squad。 + +| Model class name | SequenceClassification | TokenClassification | QuestionAnswering | +| ---------------- | ------------------------- | --------------------- | ----------------- | +| Metrics | Accuracy | F1 | Squad | + +需要注意以下三个条件: + +- 如果模型是自定义模型,需要继承 `XXXPretrainedModel`,例如当预训练模型选择 ERNIE 时,继承 `ErniePretrainedModel`,模型需要支持调用 `from_pretrained()` 导入模型,且只含 `pretrained_model_name_or_path` 一个必选参数,`forward` 函数返回 `logits` 或者 `tuple of logits`; + +- 如果模型是自定义模型,或者数据集比较特殊,压缩 API 中 loss 的计算不符合使用要求,需要自定义 `custom_evaluate` 评估函数,需要同时支持 `paddleslim.nas.ofa.OFA` 模型和 `paddle.nn.layer` 模型。可参考下方示例代码。 + - 输入`model` 和 `dataloader`,返回模型的评价指标(单个 float 值)。 + - 将该函数传入 `compress()` 中的 `custom_evaluate` 参数; + +`custom_evaluate()` 函数定义示例: + +```python + import paddle + from paddle.metric import Accuracy + + @paddle.no_grad() + def evaluate_seq_cls(self, model, data_loader): + metric = Accuracy() + model.eval() + metric.reset() + for batch in data_loader: + logits = model(input_ids=batch['input_ids'], + token_type_ids=batch['token_type_ids']) + # Supports paddleslim.nas.ofa.OFA model and nn.layer model. + if isinstance(model, paddleslim.nas.ofa.OFA): + logits = logits[0] + correct = metric.compute(logits, batch['labels']) + metric.update(correct) + res = metric.accumulate() + logger.info("acc: %s, " % res) + model.train() + return res +``` + + +在调用 `compress()` 时传入这个自定义函数: + +```python +trainer.compress(custom_evaluate=evaluate_seq_cls) +``` + + + + +### Step 4:传参并运行压缩脚本 + +这一步主要是将压缩需要用到的参数通过命令行传入,并启动压缩脚本。 + +压缩启动命令: + +**示例代码** + +```shell +# Step4: 运行压缩脚本 +python compress.py \ + --output_dir ./compress_models \ + --per_device_train_batch_size 32 \ + --per_device_eval_batch_size 32 \ + --num_train_epochs 4 \ + --width_mult_list 0.75 \ + --batch_size_list 4 8 16 \ + --batch_num_list 1 \ + +``` + +下面会介绍模型压缩启动命令可以传递的超参数。 + + + +#### CompressionArguments 参数介绍 + +`CompressionArguments` 中的参数一部分是模型压缩功能特定参数,另一部分继承自 `TrainingArguments`,是压缩训练时需要设置的超参数。下面会进行具体介绍, + +**公共参数** + +公共参数中的参数和具体的压缩策略无关。 + +- **--strategy** 模型压缩策略,目前支持 `'dynabert+qat+embeddings'`、`'dynabert+qat'`、`'dynabert+embeddings'`、`'dynabert+ptq'`、 `'dynabert'` 、 `'ptq'` 和 `'qat'`。 +其中 `'dynabert'` 代表基于 DynaBERT 的宽度裁剪策略,`'qat'` 表示量化训练,`'ptq'` 表示静态离线量化,`'embeddings'` 表示词表量化,并且 `--strategy` 支持选择它们之间所有合理的策略组合。默认是 `'dynabert+ptq'`; + +- **--output_dir** 模型压缩后模型保存目录; + +- **--input_infer_model_path** 待压缩的静态图模型,该参数是为了支持对静态图模型的压缩。不需使用时可忽略。默认为 `None`; + +- **--input_dtype** 导出模型的输入类型,一般是 `int64` 或者是 `int32`。默认为 `int64`; + +**DynaBERT 裁剪参数** + +当用户使用了 DynaBERT 裁剪、PTQ 量化策略(即策略中包含 'dynabert'、'qat' 时需要传入以下可选参数: + +- **--width_mult_list** 裁剪宽度保留的搜索列表,对 6 层模型推荐 `3/4` ,对 12 层模型推荐 `2/3`,表示对 `q`、`k`、`v` 以及 `ffn` 权重宽度的保留比例,假设 12 层模型原先有 12 个 attention heads,裁剪后只剩 9 个 attention heads。默认是 `[3/4]`; + +- **--per_device_train_batch_size** 用于裁剪训练的每个 GPU/CPU 核心 的 batch 大小。默认是 8; + +- **--per_device_eval_batch_size** 用于裁剪评估的每个 GPU/CPU 核心 的 batch 大小。默认是 8; + +- **--num_train_epochs** 裁剪训练所需要的 epochs 数。默认是 3.0; + +- **--max_steps** 如果设置为正数,则表示要执行的训练步骤总数。覆盖 `num_train_epochs`。默认为 -1; + +- **--logging_steps** 两个日志之间的更新步骤数。默认为 500; + +- **--save_steps** 评估模型的步数。默认为 100; + +- **--optim** 裁剪训练使用的优化器名称,默认为adamw,默认为 'adamw'; + +- **--learning_rate** 裁剪训练使用优化器的初始学习率,默认为 5e-05; + +- **--weight_decay** 除了所有 bias 和 LayerNorm 权重之外,应用于所有层裁剪训练时的权重衰减数值。 默认为 0.0; + +- **--adam_beta1** 裁剪训练使用 AdamW 的优化器时的 beta1 超参数。默认为 0.9; + +- **--adam_beta2** 裁剪训练使用 AdamW 优化器时的 beta2 超参数。默认为 0.999; + +- **--adam_epsilon** 裁剪训练使用 AdamW 优化器时的 epsilon 超参数。默认为 1e-8; + +- **--max_grad_norm** 最大梯度范数(用于梯度裁剪)。默认为 1.0; + +- **--lr_scheduler_type** 要使用的学习率调度策略。默认为 'linear'; + +- **--warmup_ratio** 用于从 0 到 `learning_rate` 的线性 warmup 的总训练步骤的比例。 默认为 0.0; + +- **--warmup_steps** 用于从 0 到 `learning_rate` 的线性 warmup 的步数。覆盖 warmup_ratio 参数。默认是 0; + +- **--seed** 设置的随机种子。为确保多次运行的可复现性。默认为 42; + +- **--device** 运行的设备名称。支持 cpu/gpu。默认为 'gpu'; + +- **--remove_unused_columns** 是否去除 Dataset 中不用的字段数据。默认是 True; + +**量化公共参数** + + +**PTQ 量化参数** + +当用户使用了 PTQ 量化策略时需要传入以下可选参数: + +- **--algo_list** 量化策略搜索列表,目前支持 `'KL'`、`'abs_max'`、`'min_max'`、`'avg'`、`'hist'`、`'mse'` 和 `'emd'`,不同的策略计算量化比例因子的方法不同。建议传入多种策略,可批量得到由多种策略产出的多个量化模型,可从中选择效果最优模型。ERNIE 类模型较推荐 `'hist'`, `'mse'`, `'KL'`,`'emd'` 等策略。默认是 ['mse', 'KL']; + +- **--batch_num_list** batch_nums 的超参搜索列表,batch_nums 表示采样需要的 batch 数。校准数据的总量是 batch_size * batch_nums。如 batch_num 为 None,则 data loader 提供的所有数据均会被作为校准数据。默认是 [1]; + +- **--batch_size_list** 校准样本的 batch_size 搜索列表。并非越大越好,也是一个超参数,建议传入多种校准样本数,最后可从多个量化模型中选择最优模型。默认是 `[4]`; + +- **--weight_quantize_type** 权重的量化类型,支持 `'abs_max'` 和 `'channel_wise_abs_max'` 两种方式。通常使用 'channel_wise_abs_max', 这种方法得到的模型通常精度更高; + +- **activation_quantize_type** 激活 tensor 的量化类型。支持 'abs_max', 'range_abs_max' 和 'moving_average_abs_max'。在 'ptq' 策略中,默认是 'range_abs_max'; + +- **--round_type** 权重值从 FP32 到 INT8 的转化方法,目前支持 `'round'` 和 '[adaround](https://arxiv.org/abs/2004.10568.)',默认是 `'round'`; + +- **--bias_correction** 如果是 True,表示使用 [bias correction](https://arxiv.org/abs/1810.05723) 功能,默认为 False。 + +**QAT 量化参数** + +当用户使用了 QAT 量化策略时,除了可以设置上面训练相关的参数,还可以传入以下可选参数: + +- **--weight_quantize_type** 权重的量化类型,支持 `'abs_max'` 和 `'channel_wise_abs_max'` 两种方式。通常使用 'channel_wise_abs_max', 这种方法得到的模型通常精度更高; + +- **activation_quantize_type** 激活 tensor 的量化类型。支持 'abs_max', 'range_abs_max' 和 'moving_average_abs_max'。在'qat'策略中,它默认是 'moving_average_abs_max'; + +- **use_pact** 是否使用 PACT 量化策略,是对普通方法的改进,参考论文[PACT: Parameterized Clipping Activation for Quantized Neural Networks](https://arxiv.org/abs/1805.06085),打开后精度更高,默认是 True。 + +- **moving_rate** 'moving_average_abs_max' 量化方法中的衰减系数,默认为 0.9; + + + +## 模型评估与部署 + +裁剪、量化后的模型不能再通过 `from_pretrained` 导入进行预测,而是需要使用 Paddle 部署工具才能完成预测。 + +压缩后的模型部署可以参考 [部署文档](../model_zoo/ernie-3.0/deploy) 完成。 + +### Python 部署 + +服务端部署可以从这里开始。可以参考 [seq_cls_infer.py](../model_zoo/ernie-3.0/deploy/python/seq_cls_infer.py) 或者 [token_cls_infer.py](../model_zoo/ernie-3.0/deploy/python/token_cls_infer.py) 来编写自己的预测脚本。并根据 [Python 部署指南](../model_zoo/ernie-3.0/deploy/python/README.md) 的介绍安装预测环境,对压缩后的模型进行精度评估、性能测试以及部署。 + + + + +### 服务化部署 + +- [FastDeploy ERNIE 3.0 模型 Serving 部署示例](../model_zoo/ernie-3.0/deploy/serving/README.md) +- [基于PaddleNLP SimpleServing 的服务化部署](../model_zoo/ernie-3.0/deploy/simple_serving/README.md) + +### 移动端部署 + + + + +## FAQ + +**Q:模型压缩需要数据吗?** + +A:DynaBERT 裁剪和量化训练 QAT 需要使用训练集进行训练,验证集进行评估,其过程类似微调;静态离线量化 PTQ 只需要验证集(对样本量要求较低,一般 4-16 个样本就可能可以满足要求); + +**Q:示例代码里是内置的数据集,如何使用我自己的数据呢** + +A:可以参考 UIE 的例子,也可以参考 [datasets 加载自定义数据集文档](https://huggingface.co/docs/datasets/loading); + +**Q:模型压缩后的模型还能继续训练吗?** + +A:模型压缩主要用于推理加速,因此压缩后的模型都是静态图(预测)模型,不能再通过 `from_pretrained()` API 导入继续训练; + +**Q:裁剪和量化怎么选?** + +A:可以设置参数 `--strategy` 来选择压缩的策略,默认是裁剪和量化同时选择,先裁剪后量化。目前裁剪策略有训练过程,需要下游任务的训练数据,其训练时间视下游任务数据量而定,且和微调的训练时间是一个量级。静态离线量化则不需要额外的训练,更快,通常来说量化的加速比比裁剪更明显。建议裁剪和量化同时选择,有些情况下可能比单独量化效果更好; + +**Q:裁剪中也有训练过程吗?** + +A:DynaBERT 裁剪类似蒸馏过程,也会有模型训练时用到的超参,方便起见,可以直接使用微调时所用的最佳的超参。如果想进一步提升精度,可以对 `batch_size`、`learning_rate`、`epoch` 等超参数进行 Grid Search; + +**Q:使用 `TensorDataset` 对象做量化报错了,为什么?** + +A:使用量化时,`eval_dataset` 不可以是 `TensorDataset` 对象,因为量化功能内部在静态图模式下执行,而 `TensorDataset` 只能在动态图下使用,两者同时使用会导致错误; + + + +## 参考文献 +- Hou L, Huang Z, Shang L, Jiang X, Chen X and Liu Q. DynaBERT: Dynamic BERT with Adaptive Width and Depth[J]. arXiv preprint arXiv:2004.04037, 2020. + +- Cai H, Gan C, Wang T, Zhang Z, and Han S. Once for all: Train one network and specialize it for efficient deployment[J]. arXiv preprint arXiv:1908.09791, 2020. + +- Wu H, Judd P, Zhang X, Isaev M and Micikevicius P. Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation[J]. arXiv preprint arXiv:2004.09602v1, 2020. diff --git a/docs/conf.py b/docs/conf.py new file mode 100644 index 0000000000000000000000000000000000000000..7477c20c7cbe42a9d44925601260336b5c06214c --- /dev/null +++ b/docs/conf.py @@ -0,0 +1,111 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Configuration file for the Sphinx documentation builder. +# +# This file only contains a selection of the most common options. For a full +# list see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Path setup -------------------------------------------------------------- + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. +# +import os +import sys + +sys.path.insert(0, os.path.abspath("../..")) +sys.path.append(os.path.abspath("..")) + +# -- Project information ----------------------------------------------------- + +project = "PaddleNLP" +copyright = "2023, PaddleNLP" +author = "PaddleNLP" +default_role = "py:obj" + +# -- General configuration --------------------------------------------------- + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [ + "sphinx_rtd_theme", + "recommonmark", + "sphinx.ext.autodoc", + "sphinx.ext.autosummary", + "sphinx.ext.napoleon", + "sphinx_copybutton", + "sphinx_markdown_tables", + "sphinx.ext.viewcode", + "sphinx.ext.coverage", + "sphinx.ext.extlinks", +] + +autodoc_default_options = { + "member-order": "bysource", + "undoc-members": False, +} + +autosummary_generate = True + +# Add any paths that contain templates here, relative to this directory. +templates_path = ["_templates"] + +# The language for content autogenerated by Sphinx. Refer to documentation +# for a list of supported languages. +# +# This is also used if you do content translation via gettext catalogs. +# Usually you set "language" from the command line for these cases. +locale_dirs = ["locale/"] +gettext_compact = False +language = "zh_CN" +add_module_names = False + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This pattern also affects html_static_path and html_extra_path. +exclude_patterns = [] + +# source_parsers = { +# '.md': recommonmark.parser.CommonMarkParser, +# } +source_suffix = [".rst", ".md"] + +# -- Options for HTML output ------------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +html_theme = "sphinx_book_theme" + +html_theme_options = { + "collapse_navigation": True, + "display_version": True, + "navigation_depth": 5, + "navigation_with_keys": True, + "body_max_width": "80%", +} + +html_css_files = [ + "custom.css", +] + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = ["_static"] +html_logo = "paddle.png" diff --git a/docs/data.md b/docs/data.md new file mode 100644 index 0000000000000000000000000000000000000000..32d0eef2296da1e4d4008f60ba04ad31f8cdbe4a --- /dev/null +++ b/docs/data.md @@ -0,0 +1,237 @@ +# PaddleNLP Data API + +该模块提供了在NLP任务中构建有效的数据处理Pipeline的常用API。 + +## APIl列表 + +| API | 简介 | +| ------------------------------- | :----------------------------------------- | +| `paddlenlp.data.Stack` | 堆叠N个具有相同shape的输入数据来构建一个batch | +| `paddlenlp.data.Pad` | 堆叠N个输入数据来构建一个batch,每个输入数据将会被padding到N个输入数据中最大的长度 | +| `paddlenlp.data.Tuple` | 将多个batchify函数包装在一起,组成tuple | +| `paddlenlp.data.Dict` | 将多个batchify函数包装在一起,组成dict | +| `paddlenlp.data.SamplerHelper` | 构建用于`Dataloader`的可迭代sampler | +| `paddlenlp.data.Vocab` | 用于文本token和ID之间的映射 | +| `paddlenlp.data.JiebaTokenizer` | Jieba分词 | + +## API使用方法 + +以上API都是用来辅助构建`DataLoader`,`DataLoader`比较重要的三个初始化参数是`dataset`、`batch_sampler`和`collate_fn`。 + +`paddlenlp.data.Vocab`和`paddlenlp.data.JiebaTokenizer`用在构建`dataset`时处理文本token到ID的映射。 + +`paddlenlp.data.SamplerHelper`用于构建可迭代的`batch_sampler`。 + +`paddlenlp.data.Stack`、`paddlenlp.data.Pad`、`paddlenlp.data.Tuple`和`paddlenlp.data.Dict`用于构建生成mini-batch的`collate_fn`函数。 + +### 数据预处理 + +#### `paddlenlp.data.Vocab` + +`paddlenlp.data.Vocab`词表类,集合了一系列文本token与ids之间映射的一系列方法,支持从文件、字典、json等一系方式构建词表。 + +```python +from paddlenlp.data import Vocab +# 从文件构建 +vocab1 = Vocab.load_vocabulary(vocab_file_path) +# 从字典构建 +# dic = {'unk':0, 'pad':1, 'bos':2, 'eos':3, ...} +vocab2 = Vocab.from_dict(dic) +# 从json构建,一般是已构建好的Vocab对象先保存为json_str或json文件后再进行恢复 +# json_str方式 +json_str = vocab1.to_json() +vocab3 = Vocab.from_json(json_str) +# json文件方式 +vocab1.to_json(json_file_path) +vocab4 = Vocab.from_json(json_file_path) +``` + +#### `paddlenlp.data.JiebaTokenizer` + +`paddlenlp.data.JiebaTokenizer`初始化需传入`paddlenlp.data.Vocab`类,包含`cut`分词方法和将句子明文转换为ids的`encode`方法。 + +```python +from paddlenlp.data import Vocab, JiebaTokenizer +# 词表文件路径,运行示例程序可先下载词表文件 +# wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt +vocab_file_path = './senta_word_dict.txt' +# 构建词表 +vocab = Vocab.load_vocabulary( + vocab_file_path, + unk_token='[UNK]', + pad_token='[PAD]') +tokenizer = JiebaTokenizer(vocab) +tokens = tokenizer.cut('我爱你中国') # ['我爱你', '中国'] +ids = tokenizer.encode('我爱你中国') # [1170578, 575565] +``` + +### 构建`Sampler` + +#### `paddlenlp.data.SamplerHelper` + +`paddlenlp.data.SamplerHelper`的作用是构建用于`DataLoader`的可迭代采样器,它包含`shuffle`、`sort`、`batch`、`shard`等一系列方法,方便用户灵活使用。 + +```python +from paddlenlp.data import SamplerHelper +from paddle.io import Dataset + +class MyDataset(Dataset): + def __init__(self): + super(MyDataset, self).__init__() + self.data = [ + [[1, 2, 3, 4], [1]], + [[5, 6, 7], [0]], + [[8, 9], [1]], + ] + + def __getitem__(self, index): + data = self.data[index][0] + label = self.data[index][1] + return data, label + + def __len__(self): + return len(self.data) + +dataset = MyDataset() +# SamplerHelper返回的是数据索引的可迭代对象,产生的迭代的索引为:[0, 1, 2] +sampler = SamplerHelper(dataset) +# `shuffle()`的作用是随机打乱索引顺序,产生的迭代的索引为:[0, 2, 1] +sampler = sampler.shuffle() +# sort()的作用是按照指定key为排序方式并在buffer_size大小个样本中排序 +# 示例中以样本第一个字段的长度进行升序排序,产生的迭代的索引为:[2, 0, 1] +key = (lambda x, data_source: len(data_source[x][0])) +sampler = sampler.sort(key=key, buffer_size=2) +# batch()的作用是按照batch_size组建mini-batch,产生的迭代的索引为:[[2, 0], [1]] +sampler = sampler.batch(batch_size=2) +# shard()的作用是为多卡训练切分数据集,当前卡产生的迭代的索引为:[[2, 0]] +sampler = sampler.shard(num_replicas=2) +``` + +### 构建`collate_fn` + +#### `paddlenlp.data.Stack` + +`paddlenlp.data.Stack`用来组建batch,其输入必须具有相同的shape,输出便是这些输入的堆叠组成的batch数据。 + +```python +from paddlenlp.data import Stack +a = [1, 2, 3, 4] +b = [3, 4, 5, 6] +c = [5, 6, 7, 8] +result = Stack()([a, b, c]) +""" +[[1, 2, 3, 4], + [3, 4, 5, 6], + [5, 6, 7, 8]] +""" +``` + +#### `paddlenlp.data.Pad` + +`paddlenlp.data.Pad`用来组建batch,它的输入长度不同,它首先会将输入数据全部padding到最大长度,然后再堆叠组成batch数据输出。 + +```python +from paddlenlp.data import Pad +a = [1, 2, 3, 4] +b = [5, 6, 7] +c = [8, 9] +result = Pad(pad_val=0)([a, b, c]) +""" +[[1, 2, 3, 4], + [5, 6, 7, 0], + [8, 9, 0, 0]] +""" +``` + +#### `paddlenlp.data.Tuple` + +`paddlenlp.data.Tuple`会将多个组batch的函数包装在一起,组成tuple。 + +```python +from paddlenlp.data import Stack, Pad, Tuple +data = [ + [[1, 2, 3, 4], [1]], + [[5, 6, 7], [0]], + [[8, 9], [1]], + ] +batchify_fn = Tuple(Pad(pad_val=0), Stack()) +ids, label = batchify_fn(data) +""" +ids: +[[1, 2, 3, 4], + [5, 6, 7, 0], + [8, 9, 0, 0]] +label: [[1], [0], [1]] +""" +``` + +#### `paddlenlp.data.Dict` + +`paddlenlp.data.Dict`会将多个组batch的函数包装在一起,组成dict。 + +```python +from paddlenlp.data import Stack, Pad, Dict +data = [ + {'labels':[1], 'token_ids':[1, 2, 3, 4]}, + {'labels':[0], 'token_ids':[5, 6, 7]}, + {'labels':[1], 'token_ids':[8, 9]}, + ] +batchify_fn = Dict({'token_ids':Pad(pad_val=0), 'labels':Stack()}) +ids, label = batchify_fn(data) +""" +ids: +[[1, 2, 3, 4], + [5, 6, 7, 0], + [8, 9, 0, 0]] +label: [[1], [0], [1]] +""" +``` + +### 综合示例 + +```python +from paddlenlp.data import Vocab, JiebaTokenizer, Stack, Pad, Tuple, SamplerHelper +from paddlenlp.datasets import load_dataset +from paddle.io import DataLoader + +# 词表文件路径,运行示例程序可先下载词表文件 +# wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt +vocab_file_path = './senta_word_dict.txt' +# 构建词表 +vocab = Vocab.load_vocabulary( + vocab_file_path, + unk_token='[UNK]', + pad_token='[PAD]') +# 初始化分词器 +tokenizer = JiebaTokenizer(vocab) + +def convert_example(example): + text, label = example['text'], example['label'] + ids = tokenizer.encode(text) + label = [label] + return ids, label + +dataset = load_dataset('chnsenticorp', splits='train') +dataset = dataset.map(convert_example, lazy=True) + +pad_id = vocab.token_to_idx[vocab.pad_token] +batchify_fn = Tuple( + Pad(axis=0, pad_val=pad_id), # ids + Stack(dtype='int64') # label +) + +batch_sampler = SamplerHelper(dataset).shuffle().batch(batch_size=16) +data_loader = DataLoader( + dataset, + batch_sampler=batch_sampler, + collate_fn=batchify_fn, + return_list=True) + +# 测试数据集 +for batch in data_loader: + ids, label = batch + print(ids.shape, label.shape) + print(ids) + print(label) + break +``` diff --git a/docs/data_prepare/data_preprocess.rst b/docs/data_prepare/data_preprocess.rst new file mode 100644 index 0000000000000000000000000000000000000000..cebf4102e2eca1b65ae4093f289c536ff7bd9a8d --- /dev/null +++ b/docs/data_prepare/data_preprocess.rst @@ -0,0 +1,212 @@ +================ +数据处理 +================ + +Dataset中通常为原始数据,需要经过一定的数据处理并进行采样组batch,而后通过 :class:`paddle.io.DataLoader` 为训练或预测使用,PaddleNLP中为其中各环节提供了相应的功能支持。 + +基于预训练模型的数据处理 +------------------------ + +在使用预训练模型做NLP任务时,需要加载对应的Tokenizer,PaddleNLP在 :class:`PreTrainedTokenizer` 中内置的 :func:`__call__` 方法可以实现基础的数据处理功能。PaddleNLP内置的所有预训练模型的Tokenizer都继承自 :class:`PreTrainedTokenizer` ,下面以BertTokenizer举例说明: + +.. code-block:: + + from paddlenlp.transformers import BertTokenizer + + tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') + + # 单句转换(单条数据) + print(tokenizer(text='天气不错')) # {'input_ids': [101, 1921, 3698, 679, 7231, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0]} + + # 句对转换(单条数据) + print(tokenizer(text='天气',text_pair='不错')) # {'input_ids': [101, 1921, 3698, 102, 679, 7231, 102], 'token_type_ids': [0, 0, 0, 0, 1, 1, 1]} + + # 单句转换(多条数据) + print(tokenizer(text=['天气','不错'])) # [{'input_ids': [101, 1921, 3698, 102], 'token_type_ids': [0, 0, 0, 0]}, + # {'input_ids': [101, 679, 7231, 102], 'token_type_ids': [0, 0, 0, 0]}] + +关于 :func:`__call__` 方法的其他参数和功能,请查阅PreTrainedTokenizer。 + +paddlenlp内置的 :class:`paddlenlp.datasets.MapDataset` 的 :func:`map` 方法支持传入一个函数,对数据集内的数据进行统一转换。下面我们以 :obj:`LCQMC` 的数据处理流程为例: + +.. code-block:: + + from paddlenlp.transformers import BertTokenizer + from paddlenlp.datasets import load_dataset + + tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') + train_ds = load_dataset('lcqmc', splits='train') + + print(train_ds[0]) # {'query': '喜欢打篮球的男生喜欢什么样的女生', 'title': '爱打篮球的男生喜欢什么样的女生', 'label': 1} + +可以看到, :obj:`LCQMC` 是一个句对匹配任务,即判断两个句子的意思是否相似的2分类任务。我们需要处理的是key为 **query** 和 **title** 的文本数据,我们编写基于 :class:`PreTrainedTokenizer` 的数据处理函数并传入数据集的 :func:`map` 方法。 + +.. code-block:: + + def convert_example(example, tokenizer): + tokenized_example = tokenizer( + text=example['query'], + text_pair=example['title']) + # 加上label用于训练 + tokenized_example['label'] = [example['label']] + return tokenized_example + + from functools import partial + + trans_func = partial( + convert_example, + tokenizer=tokenizer) + + train_ds.map(trans_func) + print(train_ds[0]) # {'input_ids': [101, 1599, 3614, 2802, 5074, 4413, 4638, 4511, 4495, + # 1599, 3614, 784, 720, 3416, 4638, 1957, 4495, 102, + # 4263, 2802, 5074, 4413, 4638, 4511, 4495, 1599, 3614, + # 784, 720, 3416, 4638, 1957, 4495, 102], + # 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + # 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + # 'label': [1]} + +可以看到,数据集中的文本数据已经被处理成了模型可以接受的 *feature* 。 + +:func:`map` 方法有一个重要的参数 :attr:`batched`,当设置为 :obj:`True` 时(默认为 :obj:`False` ),数据处理函数 :func:`trans_func` 的输入不再是单条数据,而是数据集的所有数据: + +.. code-block:: + + def convert_examples(examples, tokenizer): + querys = [example['query'] for example in examples] + titles = [example['title'] for example in examples] + tokenized_examples = tokenizer(text=querys, text_pair=titles,return_dict=False) + + # 加上label用于训练 + for idx in range(len(tokenized_examples)): + tokenized_examples[idx]['label'] = [examples[idx]['label']] + + return tokenized_examples + + from functools import partial + + trans_func = partial(convert_examples, tokenizer=tokenizer) + + train_ds.map(trans_func, batched=True) + print(train_ds[0]) # {'input_ids': [101, 1599, 3614, 2802, 5074, 4413, 4638, 4511, 4495, + # 1599, 3614, 784, 720, 3416, 4638, 1957, 4495, 102, + # 4263, 2802, 5074, 4413, 4638, 4511, 4495, 1599, 3614, + # 784, 720, 3416, 4638, 1957, 4495, 102], + # 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + # 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + # 'label': [1]} + +可以看到,在本例中两种实现的结果是相同的。但是在诸如阅读理解,对话等任务中,一条原始数据可能会产生多个 *feature* 的情况(参见 `run_squad.py `__ )通常需要将 :attr:`batched` 参数设置为 :obj:`True` 。 + +:func:`map` 方法还有一个 :attr:`num_workers` 参数,当其大于0时进行多进程数据处理,可以提高处理速度。但是需要注意如果在数据处理的函数中用到了 **数据index** 的相关信息,多进程处理可能会导致错误的结果。 + +关于 :func:`map` 方法的其他参数和 :class:`paddlenlp.datasets.MapDataset` 的其他数据处理方法,请查阅 :doc:`dataset <../source/paddlenlp.datasets.dataset>` 。 + +Batchify +----------- + +PaddleNLP内置了多种collate function,配合 :class:`paddle.io.BatchSampler` 可以协助用户简单的完成组batch的操作。 + +我们继续以 :obj:`LCQMC` 的数据处理流程为例。从上一节最后可以看到,处理后的单条数据是一个 **字典** ,包含 `input_ids` , `token_type_ids` 和 `label` 三个key。 + +其中 `input_ids` 和 `token_type_ids` 是需要进行 **padding** 操作后输入模型的,而 `label` 是需要 **stack** 之后传入loss function的。 + +因此,我们使用PaddleNLP内置的 :func:`Dict` ,:func:`Stack` 和 :func:`Pad` 函数整理batch中的数据。最终的 :func:`batchify_fn` 如下: + +.. code-block:: + + from paddlenlp.data import Dict, Stack, Pad + + # 使用Dict函数将Pad,Stack等函数与数据中的键值相匹配 + train_batchify_fn = lambda samples, fn=Dict({ + 'input_ids': Pad(axis=0, pad_val=tokenizer.pad_token_id), + 'token_type_ids': Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + 'label': Stack(dtype="int64") + }): fn(samples) + +之后使用 :class:`paddle.io.BatchSampler` 和 :func:`batchify_fn` 构建 :class:`paddle.io.DataLoader` : + +.. code-block:: + + from paddle.io import DataLoader, BatchSampler + + train_batch_sampler = BatchSampler(train_ds, batch_size=2, shuffle=True) + + train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=train_batchify_fn) + +到此,一个完整的数据准备流程就完成了。关于更多batchify方法,请查阅 :doc:`collate <../source/paddlenlp.data.collate>`。 + +.. note:: + + - 当需要进行 **单机多卡** 训练时,需要将 :class:`BatchSampler` 更换为 :class:`DistributedBatchSampler` 。更多有关 :class:`paddle.io.BatchSampler` 的信息,请查阅 `BatchSampler `_。 + + - 当需要诸如batch内排序,按token组batch等更复杂的组batch功能时。可以使用PaddleNLP内置的 :class:`SamplerHelper` 。相关用例请参考 `reader.py `__。 + +基于非预训练模型的数据处理 +------------------------- + +在使用非预训练模型做NLP任务时,我们可以借助PaddleNLP内置的 :class:`JiebaTokenizer` 和 :class:`Vocab` 完成数据处理的相关功能,整体流程与使用预训练模型基本相似。我们以中文情感分析 :obj:`ChnSentiCorp` 数据集为例: + +.. code-block:: + + from paddlenlp.data import JiebaTokenizer, Vocab + from paddlenlp.datasets import load_dataset + + train_ds = load_dataset('chnsenticorp', splits='train') + + print(train_ds[0]) # {'text': '选择珠江花园的原因就是方便,有电动扶梯直接到达海边,周围餐馆、食廊、商场、超市、摊位一应俱全。 + # 酒店装修一般,但还算整洁。 泳池在大堂的屋顶,因此很小,不过女儿倒是喜欢。 包的早餐是西式的,还算丰富。 + # 服务吗,一般', 'label': 1} + + # 从本地词典文件构建Vocab + vocab = Vocab.load_vocabulary('./senta_word_dict.txt', unk_token='[UNK]', pad_token='[PAD]') + + # 使用Vocab初始化JiebaTokenizer + tokenizer = JiebaTokenizer(vocab) + +.. note:: + + - :class:`Vocab` 除了可以从本地词典文件初始化之外,还提供多种初始化方法,包括从 :class:`dictionary` 创建、从数据集创建等。详情请查阅Vocab。 + - 除了使用内置的 :class:`JiebaTokenizer` 外,用户还可以使用任何自定义的方式或第三方库进行分词,之后使用 :func:`Vocab.to_indices` 方法将token转为id。 + +之后与基于预训练模型的数据处理流程相似,编写数据处理函数并传入 :func:`map` 方法: + +.. code-block:: + + def convert_example(example, tokenizer): + input_ids = tokenizer.encode(example["text"]) + valid_length = [len(input_ids)] + label = [example["label"]] + return input_ids, valid_length, label + + trans_fn = partial(convert_example, tokenizer=tokenizer) + train_ds.map(trans_fn) + + print(train_ds[0]) # ([417329, 128448, 140437, 173188, 118001, 213058, 595790, 1106339, 940533, 947744, 169206, + # 421258, 908089, 982848, 1106339, 35413, 1055821, 4782, 377145, 4782, 238721, 4782, 642263, + # 4782, 891683, 767091, 4783, 672971, 774154, 1250380, 1106339, 340363, 146708, 1081122, + # 4783, 1, 943329, 1008467, 319839, 173188, 909097, 1106339, 1010656, 261577, 1110707, + # 1106339, 770761, 597037, 1068649, 850865, 4783, 1, 993848, 173188, 689611, 1057229, 1239193, + # 173188, 1106339, 146708, 427691, 4783, 1, 724601, 179582, 1106339, 1250380], + # [67], + # [1]) + + +可以看到,原始数据已经被处理成了 *feature* 。但是这里我们发现单条数据并不是一个 **字典** ,而是 **元组** 。所以我们的 :func:`batchify_fn` 也要相应的做一些调整: + +.. code-block:: + + from paddlenlp.data import Tuple, Stack, Pad + + # 使用Tuple函数将Pad,Stack等函数与数据中的键值相匹配 + train_batchify_fn = lambda samples, fn=Tuple(( + Pad(axis=0, pad_val=vocab.token_to_idx.get('[PAD]', 0)), # input_ids + Stack(dtype="int64"), # seq len + Stack(dtype="int64") # label + )): fn(samples) + +可以看到,:func:`Dict` 函数是将单条数据中的键值与 :func:`Pad` 等函数进行对应,适用于单条数据是字典的情况。而 :func:`Tuple` 是通过单条数据中不同部分的index进行对应的。 + +所以需要 **注意** 的是 :func:`convert_example` 方法和 :func:`batchify_fn` 方法的匹配。 + +之后的流程与基于预训练模型的数据处理相同。 diff --git a/docs/data_prepare/dataset_list.md b/docs/data_prepare/dataset_list.md new file mode 100644 index 0000000000000000000000000000000000000000..cd86cabae768af14acb1e7118c67d550dd28eb5e --- /dev/null +++ b/docs/data_prepare/dataset_list.md @@ -0,0 +1,106 @@ +# PaddleNLP Datasets API + +PaddleNLP提供了以下数据集的快速读取API,实际使用时请根据需要**添加splits信息**: + +## 阅读理解 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | ----- | ------ | +| [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) | 斯坦福问答数据集,包括SQuAD1.1和SQuAD2.0|`paddlenlp.datasets.load_dataset('squad')` | +| [DuReader-yesno](https://aistudio.baidu.com/aistudio/competition/detail/49) | 千言数据集:阅读理解,判断答案极性|`paddlenlp.datasets.load_dataset('dureader_yesno')` | +| [DuReader-robust](https://aistudio.baidu.com/aistudio/competition/detail/49) | 千言数据集:阅读理解,答案原文抽取|`paddlenlp.datasets.load_dataset('dureader_robust')` | +| [CMRC2018](http://hfl-rc.com/cmrc2018/) | 第二届“讯飞杯”中文机器阅读理解评测数据集|`paddlenlp.datasets.load_dataset('cmrc2018')` | +| [DRCD](https://github.com/DRCKnowledgeTeam/DRCD) | 台達閱讀理解資料集|`paddlenlp.datasets.load_dataset('drcd')` | +| [TriviaQA](http://nlp.cs.washington.edu/triviaqa/) | Washington大学问答数据集|`paddlenlp.datasets.load_dataset('triviaqa')` | +| [C3](https://dataset.org/c3/) | 阅读理解单选题 |`paddlenlp.datasets.load_dataset('c3')` | + + +## 文本分类 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [CoLA](https://nyu-mll.github.io/CoLA/) | 单句分类任务,二分类,判断句子是否合法| `paddlenlp.datasets.load_dataset('glue','cola')`| +| [SST-2](https://nlp.stanford.edu/sentiment/index.html) | 单句分类任务,二分类,判断句子情感极性| `paddlenlp.datasets.load_dataset('glue','sst-2')`| +| [MRPC](https://microsoft.com/en-us/download/details.aspx?id=52398) | 句对匹配任务,二分类,判断句子对是否是相同意思| `paddlenlp.datasets.load_dataset('glue','mrpc')`| +| [STSB](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) | 计算句子对相似性,分数为1~5| `paddlenlp.datasets.load_dataset('glue','sts-b')`| +| [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) | 判定句子对是否等效,等效、不等效两种情况,二分类任务| `paddlenlp.datasets.load_dataset('glue','qqp')`| +| [MNLI](http://www.nyu.edu/projects/bowman/multinli/) | 句子对,一个前提,一个是假设。前提和假设的关系有三种情况:蕴含(entailment),矛盾(contradiction),中立(neutral)。句子对三分类问题| `paddlenlp.datasets.load_dataset('glue','mnli')`| +| [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) | 判断问题(question)和句子(sentence)是否蕴含,蕴含和不蕴含,二分类| `paddlenlp.datasets.load_dataset('glue','qnli')`| +| [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) | 判断句对是否蕴含,句子1和句子2是否互为蕴含,二分类任务| `paddlenlp.datasets.load_dataset('glue','rte')`| +| [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) | 判断句子对是否相关,相关或不相关,二分类任务| `paddlenlp.datasets.load_dataset('glue','wnli')`| +| [LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html) | A Large-scale Chinese Question Matching Corpus 语义匹配数据集| `paddlenlp.datasets.load_dataset('lcqmc')`| +| [ChnSentiCorp](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/ChnSentiCorp_htl_all/intro.ipynb) | 中文评论情感分析语料| `paddlenlp.datasets.load_dataset('chnsenticorp')`| +| [COTE-DP](https://aistudio.baidu.com/aistudio/competition/detail/50/?isFromLuge=1) | 中文观点抽取语料 | `paddlenlp.datasets.load_dataset('cote', 'dp')`| +| [SE-ABSA16_PHNS](https://aistudio.baidu.com/aistudio/competition/detail/50/?isFromLuge=1) | 中文评价对象级情感分析语料| `paddlenlp.datasets.load_dataset('seabsa16', 'phns')`| +| [AFQMC](https://github.com/CLUEbenchmark/CLUE) | 蚂蚁金融语义相似度数据集,1表示句子1和句子2的含义类似,0表示含义不同| `paddlenlp.datasets.load_dataset('clue', 'afqmc')`| +| [TNEWS](https://github.com/CLUEbenchmark/CLUE) | 今日头条中文新闻(短文本)分类,共15类| `paddlenlp.datasets.load_dataset('clue', 'tnews')`| +| [IFLYTEK](https://github.com/CLUEbenchmark/CLUE) | 长文本分类,共119个类别| `paddlenlp.datasets.load_dataset('clue', 'iflytek')`| +| [OCNLI](https://github.com/cluebenchmark/OCNLI) | 原生中文自然语言推理数据集,句子对三分类问题| `paddlenlp.datasets.load_dataset('clue', 'ocnli')`| +| [CMNLI ](https://github.com/CLUEbenchmark/CLUE) | 中文语言推理任务,判断sentence1和sentence2的关系:蕴含(entailment),矛盾(contradiction),中立(neutral)。句子对三分类问题 | `paddlenlp.datasets.load_dataset('clue', 'cmnli')`| +| [CLUEWSC2020](https://github.com/CLUEbenchmark/CLUE) | WSC Winograd模式挑战中文版,代词消歧任务,二分类任务| `paddlenlp.datasets.load_dataset('clue', 'cluewsc2020')`| +| [CSL](https://github.com/P01son6415/CSL) | 论文关键词识别,判断关键词是否全部为真实关键词,二分类任务 | `paddlenlp.datasets.load_dataset('clue', 'csl')`| +| [EPRSTMT](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets) | FewCLUE 评测中的电商产品评论情感分析数据集,Positive、Negative 情感 2 分类任务| `paddlenlp.datasets.load_dataset('fewclue', 'eprstmt')`| +| [CSLDCP](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets) | FewCLUE 评测中的中文科学文献学科分类数据集,根据文献的中文摘要判断文献类别,共 67 类别。| `paddlenlp.datasets.load_dataset('fewclue', 'csldcp')`| +| [TNEWSF](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets) | FewCLUE 评测中的今日头条中文新闻(短文本)分类,共15类 | `paddlenlp.datasets.load_dataset('fewclue', 'tnews')`| +| [IFLYTEK](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets) | FewCLUE 评测中的长文本分类任务,共 119 个类别 | `paddlenlp.datasets.load_dataset('fewclue', 'iflytek')`| +| [OCNLIF](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets) | FewCLUE 评测中的中文自然语言推理数据集,句子对三分类问题 | `paddlenlp.datasets.load_dataset('fewclue', 'ocnli')`| +| [BUSTM](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets) | FewCLUE 评测中对话短文本语义匹配数据集, 2 分类任务 | `paddlenlp.datasets.load_dataset('fewclue', ‘bustm')`| +| [CHIDF](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets) | FewCLUE 评测中的成语阅读理解填空, 根据文本内容从候选 7 个成语中预测正确的成语 | `paddlenlp.datasets.load_dataset('fewclue', 'chid')`| +| [CSLF](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets) | FewCLUE 评测中的论文关键词识别,判断关键词是否全部为真实关键词,二分类任务 | `paddlenlp.datasets.load_dataset('fewclue', 'csl')`| +| [CLUEWSCF](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets) | FewCLUE 评测中的 WSC Winograd 模式挑战中文版,代词消歧任务,二分类任务 | `paddlenlp.datasets.load_dataset('fewclue', 'cluewsc')`| +| [THUCNews](https://github.com/gaussic/text-classification-cnn-rnn#%E6%95%B0%E6%8D%AE%E9%9B%86) | THUCNews中文新闻类别分类 | `paddlenlp.datasets.load_dataset('thucnews')` | +| [HYP](https://pan.webis.de/semeval19/semeval19-web/) | 英文政治新闻情感分类语料 | `paddlenlp.datasets.load_dataset('hyp')` | +| [XNLI](https://github.com/facebookresearch/XNLI) | 15种语言自然语言推理数据集,三分类任务. | `paddlenlp.datasets.load_dataset('xnli', 'ar')`| +| [XNLI_CN](https://github.com/facebookresearch/XNLI) | 中文自然语言推理数据集(XNLI的子集),三分类任务. | `paddlenlp.datasets.load_dataset('xnli_cn')`| + +## 文本匹配 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [CAIL2019-SCM](https://github.com/china-ai-law-challenge/CAIL2019/tree/master/scm) | 相似法律案例匹配 | `paddlenlp.datasets.load_dataset('cail2019_scm')` | + +## 序列标注 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [MSRA_NER](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra) | MSRA 命名实体识别数据集| `paddlenlp.datasets.load_dataset('msra_ner')`| +| [People's Daily](https://github.com/OYE93/Chinese-NLP-Corpus/tree/master/NER/People's%20Daily) | 人民日报命名实体识别数据集| `paddlenlp.datasets.load_dataset('peoples_daily_ner')`| +| [CoNLL-2002](https://www.aclweb.org/anthology/W02-2024/) | 西班牙语和荷兰语实体识别数据集| `paddlenlp.datasets.load_dataset('conll2002', 'es')`| + + +## 机器翻译 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [IWSLT15](https://workshop2015.iwslt.org/) | IWSLT'15 English-Vietnamese data 英语-越南语翻译数据集| `paddlenlp.datasets.load_dataset('iwslt15')`| +| [WMT14ENDE](http://www.statmt.org/wmt14/translation-task.html) | WMT14 EN-DE 经过BPE分词的英语-德语翻译数据集| `paddlenlp.datasets.load_dataset('wmt14ende')`| + +## 机器同传 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [BSTC](https://aistudio.baidu.com/aistudio/competition/detail/44/) | 千言数据集:机器同传,包括transcription_translation和asr | `paddlenlp.datasets.load_dataset('bstc', 'asr')`| + +## 对话系统 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [DuConv](https://aistudio.baidu.com/aistudio/competition/detail/48/) | 千言数据集:开放域对话,中文知识型对话数据集 | `paddlenlp.datasets.load_dataset('duconv')`| + +## 文本生成 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [Poetry](https://github.com/chinese-poetry/chinese-poetry) | 中文诗歌古典文集数据| `paddlenlp.datasets.load_dataset('poetry')`| +| [Couplet](https://github.com/v-zich/couplet-clean-dataset) | 中文对联数据集| `paddlenlp.datasets.load_dataset('couplet')`| +| [DuReaderQG](https://github.com/PaddlePaddle/Research/tree/master/NLP/DuReader-Robust-BASELINE) | 基于DuReader的问题生成数据集| `paddlenlp.datasets.load_dataset('dureader_qg')`| +| [AdvertiseGen](https://github.com/ZhihongShao/Planning-based-Hierarchical-Variational-Model) | 中文文案生成数据集| `paddlenlp.datasets.load_dataset('advertisegen')`| +| [LCSTS_new](https://aclanthology.org/D15-1229.pdf) | 中文摘要生成数据集| `paddlenlp.datasets.load_dataset('lcsts_new')`| +| [CNN/Dailymail](https://github.com/abisee/cnn-dailymail) | 英文摘要生成数据集| `paddlenlp.datasets.load_dataset('cnn_dailymail')`| + +## 语料库 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [PTB](http://www.fit.vutbr.cz/~imikolov/rnnlm/) | Penn Treebank Dataset | `paddlenlp.datasets.load_dataset('ptb')`| +| [Yahoo Answer 100k](https://arxiv.org/pdf/1702.08139.pdf) | 从Yahoo Answer采样100K| `paddlenlp.datasets.load_dataset('yahoo_answer_100k')`| diff --git a/docs/data_prepare/dataset_load.rst b/docs/data_prepare/dataset_load.rst new file mode 100644 index 0000000000000000000000000000000000000000..e3d193da8ec29382c7637589ca536f078a703ec5 --- /dev/null +++ b/docs/data_prepare/dataset_load.rst @@ -0,0 +1,69 @@ +============ +加载数据集 +============ + +快速加载内置数据集 +--------------------- + +目前PaddleNLP内置20余个NLP数据集,涵盖阅读理解,文本分类,序列标注,机器翻译等多项任务。目前提供的数据集可以在 :doc:`数据集列表 <./dataset_list>` 中找到。 + +以 **msra_ner** 数据集为例: + +.. code-block:: + + >>> from paddlenlp.datasets import load_dataset + >>> train_ds, test_ds = load_dataset("msra_ner", splits=("train", "test")) + +:func:`load_dataset` 方法会从 :obj:`paddlenlp.datasets` 下找到msra_ner数据集对应的数据读取脚本(默认路径:paddlenlp/datasets/msra_ner.py),并调用脚本中 :class:`DatasetBuilder` 类的相关方法生成数据集。 + +生成数据集可以以 :class:`MapDataset` 和 :class:`IterDataset` 两种类型返回,分别是对 :class:`paddle.io.Dataset` 和 :class:`paddle.io.IterableDataset` 的扩展,只需在 :func:`load_dataset` 时设置 :attr:`lazy` 参数即可获取相应类型。:obj:`Flase` 对应返回 :class:`MapDataset` ,:obj:`True` 对应返回 :class:`IterDataset`,默认值为None,对应返回 :class:`DatasetBuilder` 默认的数据集类型,大多数为 :class:`MapDataset` 。 + +.. code-block:: + + >>> from paddlenlp.datasets import load_dataset + >>> train_ds = load_dataset("msra_ner", splits="train") + >>> print(type(train_ds)) + # Default + >>> train_ds = load_dataset("msra_ner", splits="train", lazy=True) + >>> print(type(train_ds)) + + +关于 :class:`MapDataset` 和 :class:`IterDataset` 功能和异同可以参考API文档 :doc:`datasets <../source/paddlenlp.datasets.dataset>`。 + +选择子数据集 +^^^^^^^^^^^^^^^^^^^^^^^ + +有些数据集是很多子数据集的集合,每个子数据集都是一个独立的数据集。例如 **GLUE** 数据集就包含COLA, SST2, MRPC, QQP等10个子数据集。 + +:func:`load_dataset` 方法提供了一个 :attr:`name` 参数用来指定想要获取的子数据集。使用方法如下: + +.. code-block:: + + >>> from paddlenlp.datasets import load_dataset + >>> train_ds, dev_ds = load_dataset("glue", name="cola", splits=("train", "dev")) + +以内置数据集格式读取本地数据集 +----------------------------- + +有的时候,我们希望使用数据格式与内置数据集相同的本地数据替换某些内置数据集的数据(例如参加SQuAD竞赛,对训练数据进行了数据增强)。 :func:`load_dataset` 方法提供的 :attr:`data_files` 参数可以实现这个功能。以 **SQuAD** 为例。 + +.. code-block:: + + >>> from paddlenlp.datasets import load_dataset + >>> train_ds, dev_ds = load_dataset("squad", data_files=("my_train_file.json", "my_dev_file.json")) + >>> test_ds = load_dataset("squad", data_files="my_test_file.json") + +.. note:: + + 对于某些数据集,不同的split的读取方式不同。对于这种情况则需要在 :attr:`splits` 参数中以传入与 :attr:`data_files` **一一对应** 的split信息。 + + 此时 :attr:`splits` 不再代表选取的内置数据集,而代表以何种格式读取本地数据集。 + + 下面以 **COLA** 数据集为例: + + .. code-block:: + + >>> from paddlenlp.datasets import load_dataset + >>> train_ds, test_ds = load_dataset("glue", "cola", splits=["train", "test"], data_files=["my_train_file.csv", "my_test_file.csv"]) + + **另外需要注意数据集的是没有默认加载选项的,**:attr:`splits` **和**:attr:`data_files` **必须至少指定一个。** \ No newline at end of file diff --git a/docs/data_prepare/dataset_self_defined.rst b/docs/data_prepare/dataset_self_defined.rst new file mode 100644 index 0000000000000000000000000000000000000000..673d78a26acb7af4a48c3c35dfbbd277933b9b17 --- /dev/null +++ b/docs/data_prepare/dataset_self_defined.rst @@ -0,0 +1,155 @@ +============ +如何自定义数据集 +============ + +通过使用PaddleNLP提供的 :func:`load_dataset` , :class:`MapDataset` 和 :class:`IterDataset` 。任何人都可以方便的定义属于自己的数据集。 + +从本地文件创建数据集 +------------------- + +从本地文件创建数据集时,我们 **推荐** 根据本地数据集的格式给出读取function并传入 :func:`load_dataset` 中创建数据集。 + +以 `waybill_ie `__ 快递单信息抽取任务中的数据为例: + +.. code-block:: + + from paddlenlp.datasets import load_dataset + + def read(data_path): + with open(data_path, 'r', encoding='utf-8') as f: + # 跳过列名 + next(f) + for line in f: + words, labels = line.strip('\n').split('\t') + words = words.split('\002') + labels = labels.split('\002') + yield {'tokens': words, 'labels': labels} + + # data_path为read()方法的参数 + map_ds = load_dataset(read, data_path='train.txt',lazy=False) + iter_ds = load_dataset(read, data_path='train.txt',lazy=True) + +我们推荐将数据读取代码写成生成器(generator)的形式,这样可以更好的构建 :class:`MapDataset` 和 :class:`IterDataset` 两种数据集。同时我们也推荐将单条数据写成字典的格式,这样可以更方便的监测数据流向。 + +事实上,:class:`MapDataset` 在绝大多数时候都可以满足要求。一般只有在数据集过于庞大无法一次性加载进内存的时候我们才考虑使用 :class:`IterDataset` 。任何人都可以方便的定义属于自己的数据集。 + +.. note:: + + 需要注意的是,只有PaddleNLP内置的数据集具有将数据中的label自动转为id的功能(详细条件参见 :doc:`创建DatasetBuilder <../community/contribute_datasets/how_to_write_a_DatasetBuilder>`)。 + + 像上例中的自定义数据集需要在自定义的convert to feature方法中添加label转id的功能。 + + 自定义数据读取function中的参数可以直接以关键字参数的方式传入 :func:`load_dataset` 中。而且对于自定义数据集,:attr:`lazy` 参数是 **必须** 传入的。 + +从 :class:`paddle.io.Dataset/IterableDataset` 创建数据集 +------------------- + +虽然PaddlePddle内置的 :class:`Dataset` 和 :class:`IterableDataset` 是可以直接接入 :class:`DataLoader` 用于模型训练的,但有时我们希望更方便的使用一些数据处理(例如convert to feature, 数据清洗,数据增强等)。而PaddleNLP内置的 :class:`MapDataset` 和 :class:`IterDataset` 正好提供了能实现以上功能的API。 + +所以如果您习惯使用 :class:`paddle.io.Dataset/IterableDataset` 创建数据集的话。只需要在原来的数据集上套上一层 :class:`MapDataset` 或 :class:`IterDataset` 就可以把原来的数据集对象转换成PaddleNLP的数据集。 + +下面举一个简单的小例子。:class:`IterDataset` 的用法基本相同。 + +.. code-block:: + + from paddle.io import Dataset + from paddlenlp.datasets import MapDataset + + class MyDataset(Dataset): + def __init__(self, path): + + def load_data_from_source(path): + ... + ... + return data + + self.data = load_data_from_source(path) + + def __getitem__(self, idx): + return self.data[idx] + + def __len__(self): + return len(self.data) + + ds = MyDataset(data_path) # paddle.io.Dataset + new_ds = MapDataset(ds) # paddlenlp.datasets.MapDataset + +从其他python对象创建数据集 +------------------- + +理论上,我们可以使用任何包含 :func:`__getitem__` 方法和 :func:`__len__` 方法的python对象创建 :class:`MapDataset`。包括 :class:`List` ,:class:`Tuple` ,:class:`DataFrame` 等。只要将符合条件的python对象作为初始化参数传入 :class:`MapDataset` 即可完成创建。 + +.. code-block:: + + from paddlenlp.datasets import MapDataset + + data_source_1 = [1,2,3,4,5] + data_source_2 = ('a', 'b', 'c', 'd') + + list_ds = MapDataset(data_source_1) + tuple_ds = MapDataset(data_source_2) + + print(list_ds[0]) # 1 + print(tuple_ds[0]) # a + +同样的,我们也可以使用包含 :func:`__iter__` 方法的python对象创建 :class:`IterDataset` 。例如 :class:`List`, :class:`Generator` 等。创建方法与 :class:`MapDataset` 相同。 + +.. code-block:: + + from paddlenlp.datasets import IterDataset + + data_source_1 = ['a', 'b', 'c', 'd'] + data_source_2 = (i for i in range(5)) + + list_ds = IterDataset(data_source_1) + gen_ds = IterDataset(data_source_2) + + print([data for data in list_ds]) # ['a', 'b', 'c', 'd'] + print([data for data in gen_ds]) # [0, 1, 2, 3, 4] + +.. note:: + + 需要注意,像上例中直接将 **生成器** 对象传入 :class:`IterDataset` 所生成的数据集。其数据只能迭代 **一次** 。 + +与常规的python对象一样,只要满足以上的条件,我们也可以使用同样的方法从第三方数据集创建PaddleNLP数据集。 + +例如HuggingFace Dataset: + +.. code-block:: + + from paddlenlp.datasets import MapDataset + from datasets import load_dataset + + hf_train_ds = load_dataset('msra_ner', split='train') + print(type(train_ds)) # + + train_ds = MapDataset(train_ds) + print(type(train_ds)) # + + print(train_ds[2]) # {'id': '2', + # 'ner_tags': [0, 0, 0, 5, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + # 0, 0, 0, 0, 0, 0, 5, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0], + # 'tokens': ['因', '有', '关', '日', '寇', '在', '京', '掠', '夺', '文', '物', + # '详', '情', ',', '藏', '界', '较', '为', '重', '视', ',', '也', + # '是', '我', '们', '收', '藏', '北', '京', '史', '料', '中', '的', + # '要', '件', '之', '一', '。']} + + hf_train_ds = load_dataset('cmrc2018', split='train') + train_ds = MapDataset(hf_train_ds) + print(train_ds[1818]) # {'answers': {'answer_start': [9], 'text': ['字仲可']}, + # 'context': '徐珂(),原名昌,字仲可,浙江杭县(今属杭州市)人。光绪举人。 + # 后任商务印书馆编辑。参加南社。1901年在上海担任了《外交报》、 + # 《东方杂志》的编辑,1911年,接管《东方杂志》的“杂纂部”。与潘仕成、 + # 王晋卿、王辑塘、冒鹤亭等友好。编有《清稗类钞》、《历代白话诗选》、 + # 《古今词选集评》等。光绪十五年(1889年)举人。后任商务印书馆编辑。 + # 参加南社。曾担任袁世凯在天津小站练兵时的幕僚,不久离去。', + # 'id': 'TRAIN_113_QUERY_0', + # 'question': '徐珂字什么?'} + + hf_train_ds = load_dataset('glue', 'sst2', split='train') + train_ds = MapDataset(hf_train_ds) + print(train_ds[0]) # {'idx': 0, 'label': 0, 'sentence': 'hide new secretions from the parental units '} + + hf_train_ds = load_dataset('ptb_text_only', split='train') + train_ds = MapDataset(hf_train_ds) + print(train_ds[1]) # {'sentence': 'pierre N years old will join the board as a nonexecutive director nov. N'} diff --git a/docs/data_prepare/overview.rst b/docs/data_prepare/overview.rst new file mode 100644 index 0000000000000000000000000000000000000000..2fadab309ffa415b8dcd2959e828f0f1ca3a7eac --- /dev/null +++ b/docs/data_prepare/overview.rst @@ -0,0 +1,32 @@ +============ +整体介绍 +============ + +数据集和数据处理部分一直是NLP任务中最重要的环节之一。为了方便用户以更低的学习成本完成这一环节,PaddleNLP提供了以下特性: + +- 功能强大的API。可以帮助用户完成大部分常见NLP任务的数据处理流程。 +- 更灵活的封装。各个模块保持低耦合,高内聚,保证用户可以通过继承和改写满足特定的数据处理需求。 +- 内置数据集涵盖大部分NLP任务,搭配简洁易用的数据集加载协议和贡献协议。对新手和社区贡献者更加友好。 + +核心API +---------- + +- :func:`load_dataset` :数据集快速加载接口,通过传入数据集读取脚本的名称和其他参数调用 :class:`DatasetBuilder` 子类的相关方法生成数据集。关于加载数据集的详细方法,请查阅 :doc:`加载数据集 <./dataset_load>` 。 +- :class:`DatasetBuilder` : :class:`DatasetBuilder` 是一个基类,所有的内置数据集都继承自该类,该类的主要功能是下载和读取数据集文件并生成Dataset。其中大部分方法已经封装,不对贡献者暴露。贡献者通过重写 :func:`_get_data` 和 :func:`_read` 等方法像社区贡献数据集。详细信息请查阅 :doc:`如何贡献数据集 ` 。 +- :class:`MapDataset/IterDataset` :PaddleNLP内置数据集类型,分别是对 :class:`paddle.io.Dataset` 和 :class:`paddle.io.IterableDataset` 的扩展。内置诸如 :func:`map` , :func:`filter` 等适用于NLP任务的数据处理功能。同时还能帮助用户简单创建自定义数据集。详细信息请查阅***和 :doc:`如何自定义数据集 <./dataset_self_defined>` 。 + +数据处理流程设计 +----------------- + +目前PaddleNLP的通用数据处理流程如下: + +#. 加载数据集(内置数据集或者自定义数据集,数据集返回 **原始数据**)。 +#. 定义 :func:`trans_func` ,包括tokenize,token to id等操作,并传入数据集的 :func:`map` 方法,将原始数据转为 *feature* 。 +#. 根据上一步数据处理的结果定义 **batchify** 方法和 :class:`BatchSampler` 。 +#. 定义 :class:`DataLoader` , 传入 :class:`BatchSampler` 和 :func:`batchify_fn` 。 + +下面是基于Bert的文本分类任务的数据处理流程图: + +.. image:: ../imgs/data_preprocess_pipline.png + +关于数据处理的详细信息,请查阅 :doc:`./data_preprocess` 。 diff --git a/docs/dataaug.md b/docs/dataaug.md new file mode 100644 index 0000000000000000000000000000000000000000..0ae035990f8dbd8c17ce7328e995bf8f47e6dae9 --- /dev/null +++ b/docs/dataaug.md @@ -0,0 +1,1167 @@ +# Data Augmentation API + +PaddleNLP提供了Data Augmentation数据增强API,可用于训练数据数据增强 + +**目录** +* [1. 词级别数据增强策略](#词级别数据增强策略) + * [1.1 词替换](#词替换) + * [1.2 词插入](#词插入) + * [1.3 词删除](#词删除) + * [1.4 词交换](#词交换) +* [2. 句子级别数据增强策略](#句子级别数据增强策略) + * [2.1 同义句生成](#同义句生成) + * [2.2 句子回译](#句子回译) + * [2.3 句子摘要](#句子摘要) + * [2.4 句子续写](#句子续写) +* [3. 字级别数据增强策略](#字级别数据增强策略) + * [3.1 字替换](#字替换) + * [3.2 字插入](#字插入) + * [3.3 字删除](#字删除) + * [3.4 字交换](#字交换) +* [4. 文档一键增强](#文档一键增强) + + + + +## 1.词级别数据增强策略 + + + +### 1.1 词替换 +词替换数据增强策略也即将句子中的词随机替换为其他单词进行数据增强,这里我们将介绍如何使用`paddlenlp.dataaug.WordSubstitute`进行词级别替换的数据增强。 + +```text +WordSubstitute 参数介绍: + + aug_type(str or list(str)): + 词替换增强策略类别。可以选择"antonym"、"embedding"、"synonym"、"homonym"、"custom"、"random"、"mlm"或者 + 前四种词替换增强策略组合。 + + custom_file_path (str,*可选*): + 本地数据增强词表路径。如果词替换增强策略选择"custom",本地数据增强词表路径不能为None。默认为None。 + + create_n(int): + 数据增强句子数量。默认为1。 + + aug_n(int): + 数据增强句子中被替换词数量。默认为None + + aug_percent(int): + 数据增强句子中被替换词数量占全句词比例。如果aug_n不为None,则被替换词数量为aug_n。默认为0.1。 + + aug_min (int): + 数据增强句子中被替换词数量最小值。默认为1。 + + aug_max (int): + 数据增强句子中被替换词数量最大值。默认为10。 + + tf_idf (bool): + 使用TF-IDF分数确定哪些词进行增强。默认为False。 + + tf_idf_file (str,*可选*): + 用于计算TF-IDF分数的文件。如果tf_idf为True,本地数据增强词表路径不能为None。默认为None。 +``` + +我们接下来将以下面的例子介绍词级别替换的使用: + +``` python +from paddlenlp.dataaug import WordSubstitute +s = ["人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。"] +``` + +**同义词替换** + +根据同义词词表将句子中的词替换为同义词,可以根据实际需要,设置被替换词数量占全句词比例`aug_percent`和生成增强句子数量`create_n`。`synonym`基于[中文同义词词表](https://github.com/guotong1988/chinese_dictionary)实现,`embedding`则是基于词向量(word embedding)之间的词距离构建的同义词词表确定,可以根据实际效果选择合适的词表。 + +``` python +aug = WordSubstitute('synonym', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的音符号,其中蕴含着丰富的语义信,生人可以很轻松地理解其中的含义。', '全人类语言是泛泛的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的意思。']] +augmented = aug.augment(s) +print(augmented) +# [['全人类言语是抽象的信符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。', '全人类语言是抽象的信息标记,其中蕴含着丰富的语义信息,人类可以很轻松地略知一二其中的含义。'], ['而计算机不得不处理数值化的信息,无法直接理解人类言语,所以需要将人类语言进行数值化更换。', '而计算机只能处理数值化的信息,无法直接理解人类言语,所以需要将生人语言进行数值化变换。']] +``` + +可以根据的实际需求,直接设置句子中被替换的词数量 `aug_n`: +``` python +aug = WordSubstitute('synonym', create_n=1, aug_n=3) +augmented = aug.augment(s[0]) +print(augmented) +# [['全人类语言是空泛的信息符号,其中蕴含着丰富的涵义信息,人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的消息符号,其中蕴含着丰富的疑义信息,人类可以很轻松地理解其中的意义。'], ['而计算机唯其如此处理实测值化的信息,无法直接理解人类语言,所以需要将人类语言进行实测值化转换。']] +``` + +``` python +aug = WordSubstitute('embedding', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的音符号,其中蕴含着丰富的语义信,生人可以很轻松地理解其中的含义。', '全人类语言是泛泛的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的意思。']] +augmented = aug.augment(s) +print(augmented) +# [['全人类言语是抽象的信符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。', '全人类语言是抽象的信息标记,其中蕴含着丰富的语义信息,人类可以很轻松地略知一二其中的含义。'], ['而计算机不得不处理数值化的信息,无法直接理解人类言语,所以需要将人类语言进行数值化更换。', '而计算机只能处理数值化的信息,无法直接理解人类言语,所以需要将生人语言进行数值化变换。']] +``` + +**同音词替换** + +根据同音词词表将句子中的词替换为同音词: + +``` python +aug = WordSubstitute('homonym', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是臭香的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松德力竭其中的含义。', '任雷语言是抽象的信息富豪,其中蕴含着丰富的语义信息,任蕾可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是臭香的新潟符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含一。', '任雷语言是抽象的新潟符号,其中蕴含着丰富的语义信息,人类可以很庆松地理解其中的含义。'], ['而计算机只能处理数值化的新戏,无法直接丽姐人类语言,所以需要将人类语言进行书之化转换。', '而计算机只能处理数值化的心系,无法直接李杰人类玉烟,所以需要将人类语言进行数值化转换。']] +``` + +**反义词替换** + +根据反义词词表将句子中的词替换为反义词: + +``` python +aug = WordSubstitute('antonym', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是具体的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地糊涂其中的含义。', '人类语言是具体的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地懵懂其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是具体的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地糊涂其中的含义。', '人类语言是具体的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地困惑其中的含义。'], ['而计算机只能处理数值冻的信息,无法直接困惑人类语言,所以需要将人类语言进行数值冻转换。', '而计算机只能处理数值冻的信息,无法直接懵懂人类语言,所以需要将人类语言进行数值冻转换。']] +``` + +**本地词表替换** + +只需要传入本地词表文件路径`custom_file_path`,即可使用自定义的词表进行替换。本地词表文件为固定格式的`json`文件,字典关键字(key)为词,字典键值(item)为列表形式的替换词。例如自定义本地词表`custom.json`如下: +``` +{"人类":["人", "人种","全人类"], "抽象":["abstract","具象"], "轻松":["简单","容易"]} +``` + +使用自定义的本地词表进行句子中词替换: +``` python +custom_file_path = "custom.json" +aug = WordSubstitute('custom', custom_file_path=custom_file_path, create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人语言是abstract的信息符号,其中蕴含着丰富的语义信息,全人类可以很轻松地理解其中的含义。', '全人类语言是具象的信息符号,其中蕴含着丰富的语义信息,人可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人语言是abstract的信息符号,其中蕴含着丰富的语义信息,人种可以很轻松地理解其中的含义。', '人语言是具象的信息符号,其中蕴含着丰富的语义信息,人种可以很轻松地理解其中的含义。'], ['而计算机只能处理数值化的信息,无法直接理解人语言,所以需要将全人类语言进行数值化转换。', '而计算机只能处理数值化的信息,无法直接理解全人类语言,所以需要将人语言进行数值化转换。']] +``` + +**组合替换** + +还可以选择将同义词、同音词、本地词表进行随机组合,例如组合同义词词表核本地词表进行词替换: +``` python +custom_file_path = "custom.json" +aug = WordSubstitute(['custom','synonym'], custom_file_path=custom_file_path, create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的信息符号,其中蕴含着丰富的语义音信,生人可以很轻松地领悟其中的含义。', '人种语言是抽象的信息符号,其中蕴含着丰富的贬义信息,人可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的信符号,其中蕴含着丰富的语义消息,生人可以很轻松地理解其中的含义。', '人语言是抽象的信息符号,其中蕴含着丰富的语义消息,人类可以很轻松地亮堂其中的含义。'], ['而计算机只能处理数值变成的信息,无法直接理解人类语言,所以需要将生人语言进行数值变为转换。', '而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类言语进行标注值变为转换。']] +``` + +**随机词替换** + +使用随机词进行句子中词替换: +``` python +aug = WordSubstitute('random', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类塘屿是抽象的黄芪酒符号,其中蕴含着丰富的语义信息,人类可以很轻单官理解其中的含义。', '人类语言是抽象的亞符号,其中蕴含着丰富的语义镇咳药,人类可以いていた松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类共进退是抽象的信息符号,其中蕴含着丰富的梦界信息,人类可以很轻大凤理解其中的含义。', '人类语言是4490的信息符号,其中蕴含着丰富的语义信息,科摩可以很轻松地崔磊其中的含义。'], ['而库山乡只能处理数值化的信息,无法直接理解街亭失守MicrosoftWorks,所以需要将人类语言进行数值化转换。', '而0.57万只能处理数值化的信息,无法直接理解人类语言,所以需要将穆哥叶楚进行数值化转换。']] +``` + +**上下文替换** + +上下文替换是随机将句子中单词进行掩码,利用中文预训练模型ERNIE 1.0,根据句子中的上下文预测被掩码的单词。相比于根据词表进行词替换,上下文替换预测出的单词更匹配句子内容,数据增强所需的时间也更长。 + +使用模型根据上下文预测单词进行句子中词替换: +``` python +import paddle +# 在GPU环境下运行 +paddle.set_device("gpu") +# 在CPU下环境运行 +# paddle.set_device("cpu") +aug = WordSubstitute('mlm', create_n=1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的信义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的语字符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。'], ['而计算机只能处理数值化的信言,无法直接理解人类语言,所以需要将人类语言进行数值化转换。']] +``` +句子中被替换的词数量目前只支持 `aug_n` 为1。 + +**基于TF-IDF的词替换** + +TF-IDF算法认为如果一个词在同一个句子中出现的次数多,词对句子的重要性就会增加;如果它在语料库中出现频率越高,它的重要性将被降低。我们将计算每个词的TF-IDF分数,**低的TF-IDF得分将有很高的概率被替换**。 + +我们可以在上面所有词替换策略中使用TF-IDF计算词被替换的概率,我们首先需要将`tf_idf`设为True,并传入语料库文件(包含所有训练的数据) `tf_idf_file` 用于计算单词的TF-IDF分数。语料库文件为固定 `txt` 格式,每一行为一条句子。以语料库文件`"data.txt"`做同义词替换为例,语料库文件格式如下: +``` text +人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。 +而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。 +... +``` + +``` python +tf_idf_file = "data.txt" +aug = WordSubstitute('synonym', tf_idf=True, tf_idf_file=tf_idf_file, create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的消息符号,其中蕴含着丰富的语义音信,人类可以很轻松地敞亮其中的含义。', '生人语言是抽象的消息符号,其中蕴含着丰富的语义信息,全人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类言语是抽象的信息符号,其中蕴含着丰富的语义信息,生人可以很轻松地分晓其中的含义。', '人类言语是抽象的音问符号,其中蕴含着丰富的语义信息,全人类可以很轻松地理解其中的含义。'], ['而计算机只能处理数值化的信息,无法直接理解人类言语,所以需要将全人类言语进行数值化转换。', '而计算机只能处理数值化的信息,无法直接理解生人语言,所以需要将全人类语言进行数值化变换。']] +``` + + +### 词插入 +词插入数据增强策略也即将句子中的词随机插入其他单词进行数据增强,这里我们将介绍如何使用`paddlenlp.dataaug.WordInsert`进行词级别插入的数据增强。 + +```text +WordInsert 参数介绍: + + aug_type(str or list(str)): + 词插入增强策略类别。可以选择"antonym"、"embedding"、"synonym"、"homonym"、"custom"、"random"、"mlm"或者 + 前三种词插入增强策略组合。 + + custom_file_path (str,*可选*): + 本地数据增强词表路径。如果词插入增强策略选择"custom",本地数据增强词表路径不能为None。默认为None。 + + create_n(int): + 数据增强句子数量。默认为1。 + + aug_n(int): + 数据增强句子中被插入词数量。默认为None + + aug_percent(int): + 数据增强句子中被插入词数量占全句词比例。如果aug_n不为None,则被插入词数量为aug_n。默认为0.1。 + + aug_min (int): + 数据增强句子中被插入词数量最小值。默认为1。 + + aug_max (int): + 数据增强句子中被插入词数量最大值。默认为10。 +``` + +我们接下来将以下面的例子介绍词级别插入的使用: + +``` python +from paddlenlp.dataaug import WordInsert +s = ["人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。"] +``` + +**同义词插入** +根据同义词词表将句子中的词前/后插入同义词,可以根据实际需要,设置插入词数量占全句词比例`aug_percent`和生成增强句子数量`create_n`。`synonym`基于[中文同义词词表](https://github.com/guotong1988/chinese_dictionary)实现,`embedding`则是基于词向量(word embedding)之间的词距离构建的同义词词表确定,可以根据实际效果选择合适的词表。 + +``` python +aug = WordInsert('synonym', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['全人类人类语言是华而不实抽象的信息符号,其中蕴含着丰富的语义消息信息,人类可以很轻松地理解其中的含义。', '人类语言是抽象的音信信息符号,其中蕴含着丰富的语义消息信息,生人人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言言语是抽象的信息符号,其中蕴含着丰富的语义褒义信息音问,人类可以很轻松地理解其中的含义。', '人类语言是抽象言之无物的信息符号记号,其中蕴含着丰富的语义信息,人类可以很轻松地理解清楚其中的含义。'], ['而计算机只能只得处理数值化变为的信息,无法直接理解人类生人语言,所以需要将人类语言进行数值化转换。', '而计算机只能处理数值分值化化为的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换变换。']] +``` + +可以根据的实际需求,直接设置句子中被替换的词数量 `aug_n`: +``` python +aug = WordInsert('synonym', create_n=1, aug_n=3) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言自然语言是抽象的信息符号,其中蕴含着蕴含丰富的语义信息数据,人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象具象的信息符号,其中蕴含着丰富的语义演算信息,人类人类文明可以很轻松地理解其中的含义。'], ['而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类全人类语言进行数值最大值化转换切换。']] +``` + +``` python +aug = WordInsert('embedding', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的音符号,其中蕴含着丰富的语义信,生人可以很轻松地理解其中的含义。', '全人类语言是泛泛的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的意思。']] +augmented = aug.augment(s) +print(augmented) +# [['全人类言语是抽象的信符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。', '全人类语言是抽象的信息标记,其中蕴含着丰富的语义信息,人类可以很轻松地略知一二其中的含义。'], ['而计算机不得不处理数值化的信息,无法直接理解人类言语,所以需要将人类语言进行数值化更换。', '而计算机只能处理数值化的信息,无法直接理解人类言语,所以需要将生人语言进行数值化变换。']] +``` + +**同音词插入** + +根据同音词词表将句子中的词插入为同音词: + +``` python +aug = WordInsert('homonym', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言雨燕是抽象的信息符号,其中蕴含着丰富的语义信息,人类任雷可以很轻松地理解其中的含义寒意。', '人泪人类语言是丑像抽象的心细信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象筹饷的信息符号,其中蕴含着丰富的语义信息,人类可以很轻恨情松地理解力竭其中的含义。', '人类语言是抽象臭香的信息新戏符号,其中蕴含着丰富的语义信息,人类可以很轻很庆松地理解其中的含义。'], ['而计算机只能纸能处理数值化的信息新西,无法直接理解李杰人类语言,所以需要将人类语言进行数值化转换。', '而计算机只能处理数值化的信息,无法直接理解人类语言语嫣,所以需要将人类语言语嫣进行数值书之化转换。']] +``` + +**反义词插入** + +根据反义词词表将句子中的词前/后插入反义词: + +``` python +aug = WordInsert('antonym', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象具体的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解懵懂其中的含义。', '人类语言是具体抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地懵懂理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象具体的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解懵懂其中的含义。', '人类语言是具体抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地困惑理解其中的含义。'], ['而计算机只能处理数值化凝的信息,无法直接理解困惑人类语言,所以需要将人类语言进行数值化冻转换。', '而计算机只能处理数值化凝的信息,无法直接理解懵懂人类语言,所以需要将人类语言进行数值化冻转换。']] +``` + +**本地词表插入** + +只需要传入本地词表文件路径`custom_file_path`,即可使用自定义的词表进行插入。本地词表文件为固定格式的`json`文件,字典关键字(key)为词,字典键值(item)为列表形式的插入词。例如自定义本地词表`custom.json`如下: +``` +{"人类":["人累", "扔雷"], "抽象":["丑相"], "符号":["富豪","负号","付豪"]} +``` + +使用自定义的本地词表进行句子中词插入: +``` python +custom_file_path = "custom.json" +aug = WordInsert('custom', custom_file_path=custom_file_path, create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类扔雷语言是抽象的信息符号富豪,其中蕴含着丰富的语义信息,人类扔雷可以很轻松地理解其中的含义。', '人类扔雷语言是抽象丑相的信息符号负号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['扔雷人类语言是丑相抽象的信息付豪符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。', '人类扔雷语言是抽象丑相的信息符号,其中蕴含着丰富的语义信息,人类人累可以很轻松地理解其中的含义。'], ['而计算机只能处理数值化的信息,无法直接理解人类人累语言,所以需要将人类扔雷语言进行数值化转换。', '而计算机只能处理数值化的信息,无法直接理解人类扔雷语言,所以需要将人类人累语言进行数值化转换。']] +``` + + +**组合插入** + +还可以选择将同义词、同音词、本地词表进行随机组合,例如组合同义词词表核本地词表进行词插入: +``` python +custom_file_path = "custom.json" +aug = WordInsert(['custom','synonym'], custom_file_path=custom_file_path, create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言词汇是抽象的信息数据符号,其中蕴含着蕴含丰富的语义信息,人类可以很轻松地理解其中的含义。', '人类语言是丑相抽象的信息符号,其中蕴含蕴含着丰富的语义信息,人类可以很轻松地理解其中的含意含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的信息符号,其中蕴含蕴含着丰富的语义数据信息,人类可以很轻松地理解其中的涵义含义。', '人类人累语言语法是抽象的信息符号,其中蕴含着丰富的语义信息数据,人类可以很轻松地理解其中的含义。'], ['而计算机计算机系统只能处理数值值化的信息,无法直接理解人类人累语言,所以需要将人类语言进行数值化转换。', '而计算机只能处理数值计算结果化的信息,无法直接理解人类语言,所以需要将人类人类文明语言进行数值化转换变换。']] +``` + +**随机词插入** + +使用随机词进行句子中词插入: +``` python +aug = WordInsert('random', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['郎氏人类语言是抽象的魏略信息符号,其中晓畅蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。', 'seeddestiny人类语言是抽象的那一双信息符号,其中九王坟蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类文胸语言是抽象解放日报的信息符号鸭池,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。', '堤平人类语言是文学作家抽象的信息中越关系符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。'], ['而勤业计算机只能处理数值HIStory化的信息,无法直接理解人类语言,所以需要将唐本佑人类语言进行数值化转换。', '而计算机刀弓只能处理数值化苏雨琪的信息,无法直接理解人类语言,所以需要将人类平达语言进行数值化转换。']] +``` + + +**上下文插入** + +上下文插入是随机将句子中单词进行掩码,利用中文预训练模型ERNIE 1.0,根据句子中的上下文预测被掩码的单词。相比于根据词表进行词插入,上下文插入预测出的单词更匹配句子内容,数据增强所需的时间也更长。 + +使用模型根据上下文预测单词进行句子中词插入: +``` python +import paddle +# 在GPU环境下运行 +paddle.set_device("gpu") +# 在CPU下环境运行 +# paddle.set_device("cpu") +aug = WordInsert('mlm', create_n=1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的信息符号,其中蕴含着丰富的语义语化信息,人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的信息符号系统,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。'], ['而计算机只能直接处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。']] +``` +句子中插入的词数量目前只支持 `aug_n` 为1。 + +### 词删除 + +词删除数据增强策略也即将句子中的词随机删除进行数据增强,这里我们将介绍如何使用`paddlenlp.dataaug.WordDelete`进行词级别删除的数据增强。 + +```text +WordDelete 参数介绍: + + create_n(int): + 数据增强句子数量。默认为1。 + + aug_n(int): + 数据增强句子中被删除词数量。默认为None + + aug_percent(int): + 数据增强句子中被删除词数量占全句词比例。如果aug_n不为None,则被删除词数量为aug_n。默认为0.1。 + + aug_min (int): + 数据增强句子中被删除词数量最小值。默认为1。 + + aug_max (int): + 数据增强句子中被删除词数量最大值。默认为10。 +``` + +我们接下来将以下面的例子介绍词级别删除的使用: + +``` python +from paddlenlp.dataaug import WordDelete +s = ["人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。"] +``` + +将随机删除句子中的词: +``` python +aug = WordDelete(create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的。', '人类语言是抽象的信息符号,其中蕴含着丰富的语义,人类可以松地其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是的信息符号,其中丰富的语义,人类可以很轻松地理解其中的含义。', '人类语言是的信息,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的。'], ['而计算机只能处理数值化的信息,无法直接理解语言,所以需要将人类语言进行转换。', '而计算机处理数值化的信息,无法直接人类语言,所以需要将人类语言进行数值化。']] +``` + +### 词交换 + +词交换数据增强策略也即将句子中的词的位置随机交换进行数据增强,这里我们将介绍如何使用`paddlenlp.dataaug.WordSwap`进行词级别交换的数据增强。 + +```text +WordSwap 参数介绍: + + create_n(int): + 数据增强句子数量。默认为1。 + + aug_n(int): + 数据增强句子中被交换词数量。默认为None + + aug_percent(int): + 数据增强句子中被交换词数量占全句词比例。如果aug_n不为None,则被交换词数量为aug_n。默认为0.1。 + + aug_min (int): + 数据增强句子中被交换词数量最小值。默认为1。 + + aug_max (int): + 数据增强句子中被交换词数量最大值。默认为10。 +``` + +我们接下来将以下面的例子介绍词级别交换的使用: + +``` python +from paddlenlp.dataaug import WordSwap +s = ["人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。"] +``` + +将随机交换句子中的词: +``` python +aug = WordSwap(create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的符号信息,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以松地很轻理解其中的含义。'], ['而计算机只能处理化数值的信息,无法直接理解人类语言,所以需要将人类语言进行数值转换化。']] +``` + + + +## 2. 句子级别数据增强策略 + + + +### 2.1 同义句生成 + +同义句生成数据增强策略也即根据输入句子生成相似句,模型首先生成`generate_n`个句子,然后再利用模型筛选出最佳的`create_n`。这里我们将介绍如何使用`paddlenlp.dataaug.SentenceGenerate`进行同义句生成的数据增强。 + +```text +SentenceGenerate 参数介绍: + + model_name (str): + 生成同义句模型名,可选"roformer-chinese-sim-char-ft-base", "roformer-chinese-sim-char-base","roformer-chinese-sim-char-ft-small","roformer-chinese-sim-char-small"。默认为"roformer-chinese-sim-char-base"。 + + create_n(int): + 数据增强句子数量,从生成相似句中筛选最佳的句子数量。默认为1。 + + generate_n(int): + 模型生成相似句数量。默认为5。 + + max_length(int): + 模型生成相似句最长长度。默认为128。 + + top_p (float): + “sampling”策略中top-p-filtering的累积概率。该值应满足:math:`0<=top_p<1`。默认为0.95 +``` + +我们接下来将以下面的例子介绍同义句生成的使用: + +``` python +from paddlenlp.dataaug import SentenceGenerate +s = ["人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。"] +``` + +``` python +import paddle +# 建议在GPU环境下运行 +paddle.set_device("gpu") +# 在CPU下环境运行 +# paddle.set_device("cpu") +aug = SentenceGenerate(create_n=2, generate_n=5, max_length=128, top_p=0.95) +augmented = aug.augment(s[0]) +print(augmented) +# ['人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义', '人类语言是一个抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。', '人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义答。'], ['而计算机只能处理数值化的信息,无法直接理解人类语言,故需要将人类语言进行数值化转换。', '2、计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。']] +``` + + + +### 2.2 句子回译 + +句子回译数据增强策略也即将输入的句子翻译为另一种语言,然后再翻译回来,生成语义相同表达方式不同的句子,用于数据增强。这里我们将介绍如何使用基于百度翻译API`paddlenlp.dataaug.SentenceBackTranslateAPI`进行句子回译的数据增强和基于模型的`paddlenlp.dataaug.SentenceBackTranslate`。 + + +```text +SentenceBackTranslateAPI 参数介绍: + + src_lang (str): + 输入句子的语言。默认为"zh"。 + + tgt_lang(str): + 目标句子的语言,增强策略将会把句子翻译为目标句子语言,再翻译回输入句子语言。默认为"en"。 + + appid(str): + 百度通用翻译API的APPID(如果你使用自己的百度翻译API服务appid/secretKey)。默认为None。 + + secretKey (str): + 百度通用翻译API的密钥(如果你使用自己的百度翻译API服务appid/secretKey)。默认为1。 + + qps (int): + 百度通用翻译API的QPS(如果你使用自己的百度翻译API服务appid/secretKey)。 默认为1。 +``` + +我们接下来将以下面的例子介绍基于百度翻译API的句子回译的使用: + +使用SentenceBackTranslateAPI需要安装PaddleHub +```shell +pip install paddlehub==2.3.1 +``` + +``` python +from paddlenlp.dataaug import SentenceBackTranslateAPI +s = ["人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。"] +``` + +``` python +aug = SentenceBackTranslateAPI(src_lang='zh', tgt_lang='en') +augmented = aug.augment(s[0]) +print(augmented) +# ['人类语言是一种抽象的信息符号,蕴含着丰富的语义信息。人类很容易理解它的含义。'] +augmented = aug.augment(s) +print(augmented) +# ['人类语言是一种抽象的信息符号,蕴含着丰富的语义信息。人类很容易理解它的含义。', '然而,计算机只能处理数字信息,不能直接理解人类语言,因此有必要将人类语言转换为数字信息。'] +``` +**Note** +1. 默认使用PaddleHub提供的百度翻译API服务,也可以选择注册自己的百度翻译API服务账号获取相应的AppID和密钥,账号注册流程请参见[百度翻译API文档](https://fanyi-api.baidu.com/doc/21),使用自己AppID和密钥则无需安装PaddleHub。 +2. `src_lang`和`tgt_lang`支持的语言和服务异常报错详见[百度翻译API文档](https://fanyi-api.baidu.com/doc/21)中完整语种列表和错误码列表。 + +```text +SentenceBackTranslate 参数介绍: + + src_lang (str): + 输入句子的语言。默认为"zh"。可选语言:'ar', 'cs', 'de', 'en', 'es', 'et', 'fi', 'fr', 'gu', 'hi', 'it', 'ja', 'kk', 'ko', 'lt', 'lv', 'my', 'ne', 'nl', 'ro', 'ru', 'si', 'tr', 'vi', 'zh', 'af', 'az', 'bn', 'fa', 'he', 'hr', 'id', 'ka', 'km', 'mk', 'ml', 'mn', 'mr', 'pl', 'ps', 'pt', 'sv', 'sw', 'ta', 'te', 'th', 'tl', 'uk', 'ur', 'xh', 'gl', 'sl'。 + + tgt_lang(str): + 目标句子的语言,增强策略将会把句子翻译为目标句子语言,再翻译回输入句子语言。默认为"en"。可选语言:'ar', 'cs', 'de', 'en', 'es', 'et', 'fi', 'fr', 'gu', 'hi', 'it', 'ja', 'kk', 'ko', 'lt', 'lv', 'my', 'ne', 'nl', 'ro', 'ru', 'si', 'tr', 'vi', 'zh', 'af', 'az', 'bn', 'fa', 'he', 'hr', 'id', 'ka', 'km', 'mk', 'ml', 'mn', 'mr', 'pl', 'ps', 'pt', 'sv', 'sw', 'ta', 'te', 'th', 'tl', 'uk', 'ur', 'xh', 'gl', 'sl'。 + + max_length(int): + 模型生成相似句最长长度。默认为128。 + + batch_size (int): + 批大小,如果显存不足,适当调小该值。默认为1。 + + num_beams (int): + “beam_search”策略中的beam值。 默认为 4。 + + use_faster (bool): + 是否使用FasterGeneration进行加速。默认为False。 + + decode_strategy (str): + 生成中的解码策略。 目前支持三种解码策略:“greedy_search”、“sampling”和“beam_search”。 默认为“beam_search”。 + +``` + +我们接下来将以下面的例子介绍基于模型的句子回译的使用: + +``` python +from paddlenlp.dataaug import SentenceBackTranslate +s = ["人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。"] +``` + +``` python +import paddle +# 建议在GPU环境下运行 +paddle.set_device("gpu") +# 在CPU下环境运行 +# paddle.set_device("cpu") +aug = SentenceBackTranslate(src_lang='zh', tgt_lang='en', batch_size=1, max_length=128) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象信息符号, 它包含丰富的语义信息, 可以容易理解.']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象信息符号, 它包含丰富的语义信息, 可以容易理解.'], ['计算机只能处理数字化信息,不能直接理解人类语言,因此有必要进行数字化。']] +``` +**Note** +1. 如果`use_faster`设为True,第一次执行PaddleNLP会启动即时编译(JIT Compile)自动编译高性能解码算子。编译过程通常会花费几分钟的时间编译只会进行一次,之后再次使用高性能解码就不需要重新编译了,编译完成后会继续运行。 + + + +### 2.3 句子摘要 + +句子摘要数据增强策略也即对输入句子生成摘要句子,这里我们将介绍如何使用`paddlenlp.dataaug.SentenceSummarize`进行句子摘要的数据增强。 + +```text +SentenceSummarize 参数介绍: + + create_n(int): + 数据增强句子数量,从生成相似句中筛选最佳的句子数量。默认为1。 + + max_length(int): + 模型生成相似句最长长度。默认为128。 + + batch_size (int): + 批大小,如果显存不足,适当调小该值。默认为1。 + + top_k (int): + “sampling”策略中top-k-filtering的最高概率token的数量, 0表示没有影响。默认为5。 + + top_p (float): + “sampling”策略中top-p-filtering的累积概率。该值应满足:math:`0<=top_p<1`。默认为1.0,表示没有影响。 + + temperature (float): + “sampling”策略中对下一个token概率进行建模的值。 默认为 1.0,表示没有影响。 + + use_fp16_decoding (bool): + 是否使用fp16进行加速。默认为False。 +``` + +我们接下来将以下面的例子介绍句子摘要的使用: + +``` python +from paddlenlp.dataaug import SentenceSummarize +s = ["人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。"] +``` + +``` python +import paddle +# 建议在GPU环境下运行 +paddle.set_device("gpu") +# 在CPU下环境运行 +# paddle.set_device("cpu") +aug = SentenceSummarize(create_n=2, batch_size=1, max_length=128) +augmented = aug.augment(s[0]) +print(augmented) +# [['什么是人类语言?', '为什么说人类语言是抽象的信息符号?']] +augmented = aug.augment(s) +print(augmented) +# [['什么是人类语言?', '为什么说人类语言是抽象的信息符号?'], ['计算机只能处理数值化的信息(图)', '计算机只能处理数值化的信息']] +``` + + + +### 2.4 句子续写 + +句子续写数据增强策略也即对输入句子进行句子续写,这里我们将介绍如何使用`paddlenlp.dataaug.SentenceContinue`进行句子续写的数据增强。 + +```text +SentenceContinue 参数介绍: + + model_name (str): + 生成同义句模型名,可选"gpt-cpm-large-cn", "gpt-cpm-small-cn-distill"。默认为"gpt-cpm-small-cn-distill"。 + + max_length(int): + 模型生成相似句最长长度。默认为128。 + + decode_strategy (str): + 生成中的解码策略。 目前支持三种解码策略:“greedy_search”、“sampling”和“beam_search”。 默认为“beam_search”。 + + use_faster (bool): + 是否使用FasterGeneration进行加速。默认为False。 + + create_n(int): + 数据增强句子数量,从生成相似句中筛选最佳的句子数量。默认为1。 + + top_k (int): + “sampling”策略中top-k-filtering的最高概率token的数量, 0表示没有影响。默认为5。 + + top_p (float): + “sampling”策略中top-p-filtering的累积概率。该值应满足:math:`0<=top_p<1`。默认为1.0,表示没有影响。 + + temperature (float): + “sampling”策略中对下一个token概率进行建模的值。 默认为 1.0,表示没有影响。 + + batch_size (int): + 批大小,如果显存不足,适当调小该值。默认为1。 +``` + +我们接下来将以下面的例子介绍同义句生成的使用: + +``` python +from paddlenlp.dataaug import SentenceContinue +s = ["人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。"] +``` + +``` python +import paddle +# 建议在GPU环境下运行 +paddle.set_device("gpu") +# 在CPU下环境运行 +# paddle.set_device("cpu") +aug = SentenceContinue(create_n=2, batch_size=1, max_length=64) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。然而语言本身的抽象不是简单的,语言的复杂性以及语言的抽象化则是人类认识世界的另一个重要途径。信息本身和人类的理解能力无关,人类理解世界的过程就是信息过程的不断丰富与不断', '人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。不过,这也是很不容易的。有一些事情是不可能实现的,对于一些人来说,不可能实现的事情只是遥不可及的梦,这也就是为什么在他们的思想中经常会']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。那么,为什么会出现这种现象呢?首先,我们知道人类拥有最简单的语言,但是我们无法通过语言去直接理解它,这就使得我们需要建立数学模型,使得理解过程比语言模型复杂得多', '人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。如果人类可以用语言解决语言问题,那么这个问题是不能回避的。这就是为什么计算机是一个语言的存在,因为它能够处理语言的逻辑关系。这就要求我们对语言的基本事实和各种各样的信息进行细致'], ['而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。因此,计算机在编程方面的功能就是将程序的数据进行算法处理,以便在特定情况下做出特定的功能。在这里可以看到,计算机编程的主要功能是处理文字的信息,而与文字的信息无关的', '而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。因此,“语言”这个词的含义,实际上可以由下面这个公式来表示:=\\alpha\\left(\\alpha-(\\alpha']] +``` + + +## 3.字级别数据增强策略 + + + +### 3.1 字替换 +字替换数据增强策略也即将句子中的字随机替换为其他单字进行数据增强,这里我们将介绍如何使用`paddlenlp.dataaug.CharSubstitute`进行字级别替换的数据增强。 + +```text +CharSubstitute 参数介绍: + + aug_type(str or list(str)): + 字替换增强策略类别。可以选择"antonym"、"homonym"、"custom"、"random"、"mlm"或者 + 前三种字替换增强策略组合。 + + custom_file_path (str,*可选*): + 本地数据增强字表路径。如果字替换增强策略选择"custom",本地数据增强字表路径不能为None。默认为None。 + + create_n(int): + 数据增强句子数量。默认为1。 + + aug_n(int): + 数据增强句子中被替换字数量。默认为None + + aug_percent(int): + 数据增强句子中被替换字数量占全句字比例。如果aug_n不为None,则被替换字数量为aug_n。默认为0.1。 + + aug_min (int): + 数据增强句子中被替换字数量最小值。默认为1。 + + aug_max (int): + 数据增强句子中被替换字数量最大值。默认为10。 +``` + +我们接下来将以下面的例子介绍字级别替换的使用: + +``` python +from paddlenlp.dataaug import CharSubstitute +s = ["人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。"] +``` + +**同音字替换** + +根据同音字表将句子中的字替换为同音字,可以根据实际需要,设置被替换字数量占全句字比例`aug_percent`和生成增强句子数量`create_n`。 + +``` python +aug = CharSubstitute('homonym', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是筹象的信汐符号,其中蕴含着逢富的语义锌息,人类可以很轻诵地理解其中的含义。', '人类语嫣是抽象的信息符号,其中蕴含着丰富的语义信息,人垒可以很情松地理婕其种的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的辛息符豪,其中匀含着丰富的语义信息,人类可以很庆耸地理解其中的含义。', '人磊语晏是抽象的新息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理劫其种的含义。'], ['而叽算机只能处理数值化的信息,无法直接理解人蕾语堰,所以需要将人类语演进行数值化专换。', '而疾算机只能杵理数值华的信息,无法直接理捷人类语验,所以需要将人类语言进行数值化转换。']] +``` + +可以根据的实际需求,直接设置句子中被替换的字数量 `aug_n`: +``` python +aug = CharSubstitute('homonym', create_n=1, aug_n=3) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的信息符号,其中蕴含着丰富的裕义信息,人类可以很轻送地理解其中的含漪。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的信息符号,其中陨含着丰富的语义信息,人蕾可以很轻松地理解其种的含义。'], ['而计算机只能处理数值化的心息,无罚直接理解人类语言,所以需要将人类煜言进行数值化转换。']] +``` + +**反义字替换** + +根据反义字字表将句子中的字替换为反义字: + +``` python +aug = CharSubstitute('antonym', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的信息符号,其中蕴含着丰穷的语义信息,人类可以很轻紧地理结其西的露义。', '人类语言是抽象的疑息符号,其西蕴含着歉富的语义信息,人类可以很轻松地理结其中的露义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的疑作符号,其洋蕴含着丰贫的语义信息,人类可以很轻松地理解其中的露义。', '人类语言是抽象的信息符号,其洋蕴含着歉贫的语义信息,人类可以很轻紧地理系其中的含义。'], ['而计算机只能处理数值凝的疑作,无法曲接理扎人类语言,所以需要将人类语言进行数值化转换。', '而计算机只能处理数值化的信作,无法屈接理结人类语言,所以需要将人类语言退行数值凝转换。']] +``` + +**本地字表替换** + +只需要传入本地字表文件路径`custom_file_path`,即可使用自定义的字表进行替换。本地字表文件为固定格式的`json`文件,字典关键字(key)为字,字典键值(item)为列表形式的替换字。例如自定义本地字表`custom.json`如下: +``` +{"人":["任", "认","忍"], "抽":["丑","臭"], "轻":["亲","秦"],"数":["书","树"],"转":["赚","专"],"理":["里","例"]} +``` + +使用自定义的本地字表进行句子中字替换: +``` python +custom_file_path = "custom.json" +aug = CharSubstitute('custom', custom_file_path=custom_file_path, create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是丑象的信息符号,其中蕴含着丰富的语义信息,人类可以很秦松地理解其中的含义。', '人类语言是臭象的信息符号,其中蕴含着丰富的语义信息,人类可以很秦松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是丑象的信息符号,其中蕴含着丰富的语义信息,人类可以很秦松地例解其中的含义。', '人类语言是臭象的信息符号,其中蕴含着丰富的语义信息,人类可以很秦松地里解其中的含义。'], ['而计算机只能处例书值化的信息,无法直接里解人类语言,所以需要将人类语言进行书值化专换。', '而计算机只能处里书值化的信息,无法直接例解人类语言,所以需要将人类语言进行树值化赚换。']] +``` + +**组合替换** + +还可以选择将同音字、本地字表进行随机组合,例如组合同音字表和本地字表进行字替换: +``` python +custom_file_path = "custom.json" +aug = CharSubstitute(['custom','homonym'], custom_file_path=custom_file_path, create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的信囍符号,其中蕴含着丰斧的遇倚信息,人类可以很轻颂地理解其中的含义。', '人类语言是抽乡的信吸符好,其终蕴含着丰富的语义芯息,人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类於言是抽想的信息肤号,其中蕴含着丰腐的语义信息,人类可以很轻松地理解其中的含诣。', '人类语言是抽项的信息符号,其中蕴憨着丰富的娱义信息,人类可以很请怂地理解其中的含义。'], ['而计算机只能处理数值划的信羲,无法直接理解人类钰言,所以墟要将人类语闫进行数值化转换。', '而计算羁只能处理数值化的信熙,无法直介理解人类语岩,所以需要将人类语焰进行数值化转换。']] +``` + +**随机字替换** + +使用随机字进行句子中字替换: +``` python +aug = CharSubstitute('random', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人开自言是抽象的信息符号,其中蕴正着丰富的语义信息,人类可以很拜松地理解其中的含侯。', '人类语言是抽象的许息符号,其世蕴银着丰B的语义莘息,人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类吧言是抽象的信息符号,其中蕴含着丰富的萎义桅小,人类可以很轻松地理解其中的后义。', '人类语言是河象的信夹符号,其中蕴含着丰刘的语义信息,人类可以很轻李地理解其中的含阿。'], ['而庙算机只能处葛数弘化的信息,无法直接理解人类语拉,所以需要将人吴语言进行数值化转换。', '而n算机只能处理数值化的信息,无法直接理解人红语言,所以需要将人类语言进行林值查转P。']] +``` + +**上下文替换** + +上下文替换是随机将句子中单字进行掩码,利用中文预训练模型ERNIE 3.0,根据句子中的上下文预测被掩码的单字。相比于根据字表进行字替换,上下文替换预测出的单字更匹配句子内容,数据增强所需的时间也更长。 + +使用模型根据上下文预测单字进行句子中字替换: +``` python +import paddle +# 在GPU环境下运行 +paddle.set_device("gpu") +# 在CPU下环境运行 +# paddle.set_device("cpu") +aug = CharSubstitute('mlm', create_n=1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的信息符号,其中包含着丰富的语义信息,人类可以很轻松地理解其中的含义。']] +``` +句子中被替换的字数量目前只支持 `aug_n` 为1。 + + + +### 字插入 +字插入数据增强策略也即将句子中的字随机插入其他单字进行数据增强,这里我们将介绍如何使用`paddlenlp.dataaug.CharInsert`进行字级别插入的数据增强。 + +```text +CharInsert 参数介绍: + + aug_type(str or list(str)): + 字插入增强策略类别。可以选择"antonym"、"homonym"、"custom"、"random"、"mlm"或者 + 前三种字插入增强策略组合。 + + custom_file_path (str,*可选*): + 本地数据增强字表路径。如果字插入增强策略选择"custom",本地数据增强字表路径不能为None。默认为None。 + + create_n(int): + 数据增强句子数量。默认为1。 + + aug_n(int): + 数据增强句子中被插入字数量。默认为None + + aug_percent(int): + 数据增强句子中被插入字数量占全句字比例。如果aug_n不为None,则被插入字数量为aug_n。默认为0.1。 + + aug_min (int): + 数据增强句子中被插入字数量最小值。默认为1。 + + aug_max (int): + 数据增强句子中被插入字数量最大值。默认为10。 +``` + +我们接下来将以下面的例子介绍字级别插入的使用: + +``` python +from paddlenlp.dataaug import CharInsert +s = ["人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。"] +``` + +**同音字插入** +根据同音字表将句子中的字前/后插入同音字,可以根据实际需要,设置插入字数量占全句字比例`aug_percent`和生成增强句子数量`create_n`。 + +``` python +aug = CharInsert('homonym', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语寓言咽是抽象的信息符复号,其中蕴韵含着丰富夫的语义信息,人类可以很轻松地理解其中的含义。', '人镭类语岩言是抽想象的信息符号,其忠中蕴含着疯丰富的语义信息,人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类勒语言是抽象想的信息符号,其中蕴含着丰富的语誉义以信息,人类可以很轻卿松地理解其中的含义。', '人泪类语言是抽象的芯信息符号,其中蕴含着枫丰富的语疑义锌信息,人类可以很轻松地理解其中的含义。'], ['而计算机只能处理数植值化的新信息,无法直接狸理解人类语言,所以需要将人类峪语言进行书数值化转换。', '而计算机只能处理梳数值化的新信息,无法直接笠理解人类语言,所以需要将人类语衍言进行数值化赚转换。']] +``` + +可以根据的实际需求,直接设置句子中被替换的字数量 `aug_n`: +``` python +aug = CharInsert('homonym', create_n=1, aug_n=3) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类勒语言是抽象的信息符号,其中蕴含着丰缝富的语义信息,人类可以很轻松颂地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的信息符号,其中蕴含着丰富的语义新信息,人类可以很轻松地荔理解其终中的含义。'], ['而计算机只能处理数值化的信息,无法直接理解人类语言,所以序需要将人類类语言进刑行数值化转换。']] +``` + + +**本地字表插入** + +只需要传入本地字表文件路径`custom_file_path`,即可使用自定义的字表进行插入。本地字表文件为固定格式的`json`文件,字典关键字(key)为字,字典键值(item)为列表形式的插入字。例如自定义本地字表`custom.json`如下: +``` +{"人":["任", "认","忍"], "抽":["丑","臭"], "轻":["亲","秦"],"数":["书","树"],"转":["赚","专"],"理":["里","例"]} +``` + +使用自定义的本地字表进行句子中字插入: +``` python +custom_file_path = "custom.json" +aug = CharInsert('custom', custom_file_path=custom_file_path, create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是臭抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很亲轻松地里理解其中的含义。', '人类语言是抽臭象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻秦松地理里解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是丑抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很秦轻松地例理解其中的含义。', '人类语言是丑抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很亲轻松地例理解其中的含义。'], ['而计算机只能处理例数树值化的信息,无法直接理例解人类语言,所以需要将人类语言进行数树值化转专换。', '而计算机只能处里理树数值化的信息,无法直接例理解人类语言,所以需要将人类语言进行书数值化赚转换。']] +``` +**反义字插入** + +根据反义字字表将句子中的字前/后插入反义字: + +``` python +aug = CharInsert('antonym', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的疑信作息符号,其中蕴露含着丰富的语义信息,人类可以很轻紧松地理扎解其中的含义。', '人类语言是抽象的信疑息符号,其中洋蕴含着丰富穷的语义信息作,人类可以很轻松地理解其中的含露义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的信作息符号,其中蕴含着丰富的语义信作息,人类可以很轻紧松地理系解其中的露含义。', '人类语言是抽象的信疑息符号,其中洋蕴含露着丰富的语义信息作,人类可以很轻松地理解扎其中的含义。'], ['而计算机只能处理数值凝化的信作息,无法屈直接理解人类语言,所以需要将人类语言进止行数值化停转换。', '而计算机只能处理数值化凝的信疑息,无法直接递理解系人类语言,所以需要将人类语言进行数值化凝转换。']] +``` + +**组合插入** + +还可以选择将同音字、同音字、本地字表进行随机组合,例如组合同音字表核本地字表进行字插入: +``` python +custom_file_path = "custom.json" +aug = CharInsert(['custom','homonym'], custom_file_path=custom_file_path, create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类镭语言是抽象的信鑫息夕符号壕,其中蕴含着丰富的语义信息,人类可以很轻晴松地理解其中的含义。', '人类语咽言是抽翔象的信息覆符号,其中蕴含着丰腐富的语义信息,人类可以很轻松地離理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是稠抽象的芯信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理桀解其重中的含裔义。', '人类语言是抽象的信息囍符号壕,其中蕴含着丰富孵的语义信息奚,人类可以很轻卿松地理解其中的含义。'], ['而计算机只能处理数值化的信息,无法直接理解人类语言阎,所以需要将人类语言衍进金行数值化哗转专换。', '而计纪算机只能处岀理隶数值化的信息,无法直接理解人类语言,所以需要将人类雷语言进行数值芷化转换。']] +``` + +**随机字插入** + +使用随机字进行句子中字插入: +``` python +aug = CharInsert('random', create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类S语言是抽象的信息符号,其中蕴含着丰富的语义信息,人鞋类可以很轻J松地张理解其中的含陈义。', '人类谷语言是抽象的信息符号,其中蕴含着丰富的语义信息,人烘类可以很轻割松地灵理解其中的含异义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语创言是抽象的信好息符号,其中蕴王含着丰富的语义信如息,人类可以很轻松地理解其中的丹含义。', '人类语F言是抽象的信M息符号,其中蕴史含着丰富的语义信伊息,人类可以很轻松地理解其中的秀含义。'], ['而计算机只能处楚理数值化O的信息,无法直接理解人类语丁言,所以需P要将人类语言进行甲数值化转换。', '而计算机只能处漫理数值化翁的信息,无法直接理解人类语奚言,所以需中要将人类语言进行黄数值化转换。']] +``` + + +**上下文插入** + +上下文插入是随机将句子中单字进行掩码,利用中文预训练模型ERNIE 3.0,根据句子中的上下文预测被掩码的单字。相比于根据字表进行字插入,上下文插入预测出的单字更匹配句子内容,数据增强所需的时间也更长。 + +使用模型根据上下文预测单字进行句子中字插入: +``` python +import paddle +# 在GPU环境下运行 +paddle.set_device("gpu") +# 在CPU下环境运行 +# paddle.set_device("cpu") +aug = CharInsert('mlm', create_n=1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的信息息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的信息息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。'], ['而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值转化转换。']] +``` +句子中插入的字数量目前只支持 `aug_n` 为1。 + +### 字删除 + +字删除数据增强策略也即将句子中的字随机删除进行数据增强,这里我们将介绍如何使用`paddlenlp.dataaug.CharDelete`进行字级别删除的数据增强。 + +```text +CharDelete 参数介绍: + + create_n(int): + 数据增强句子数量。默认为1。 + + aug_n(int): + 数据增强句子中被删除字数量。默认为None + + aug_percent(int): + 数据增强句子中被删除字数量占全句字比例。如果aug_n不为None,则被删除字数量为aug_n。默认为0.1。 + + aug_min (int): + 数据增强句子中被删除字数量最小值。默认为1。 + + aug_max (int): + 数据增强句子中被删除字数量最大值。默认为10。 +``` + +我们接下来将以下面的例子介绍字级别删除的使用: + +``` python +from paddlenlp.dataaug import CharDelete +s = ["人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。"] +``` + +将随机删除句子中的字: +``` python +aug = CharDelete(create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的。', '人类语言是抽象的信息符号,其中蕴含着丰富的语义,人类可以松地其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是的信息符号,其中丰富的语义,人类可以很轻松地理解其中的含义。', '人类语言是的信息,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的。'], ['而计算机只能处理数值化的信息,无法直接理解语言,所以需要将人类语言进行转换。', '而计算机处理数值化的信息,无法直接人类语言,所以需要将人类语言进行数值化。']] +``` + +### 字交换 + +字交换数据增强策略也即将句子中的字的位置随机交换进行数据增强,这里我们将介绍如何使用`paddlenlp.dataaug.CharSwap`进行字级别交换的数据增强。 + +```text +CharSwap 参数介绍: + + create_n(int): + 数据增强句子数量。默认为1。 + + aug_n(int): + 数据增强句子中被交换字数量。默认为None + + aug_percent(int): + 数据增强句子中被交换字数量占全句字比例。如果aug_n不为None,则被交换字数量为aug_n。默认为0.1。 + + aug_min (int): + 数据增强句子中被交换字数量最小值。默认为1。 + + aug_max (int): + 数据增强句子中被交换字数量最大值。默认为10。 +``` + +我们接下来将以下面的例子介绍字级别交换的使用: + +``` python +from paddlenlp.dataaug import CharSwap +s = ["人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息,无法直接理解人类语言,所以需要将人类语言进行数值化转换。"] +``` + +将随机交换句子中的字: +``` python +aug = CharSwap(create_n=2, aug_percent=0.1) +augmented = aug.augment(s[0]) +print(augmented) +# [['人类语言是抽象的符号信息,其中蕴含着丰富的语义信息,人类可以很轻松地理解其中的含义。']] +augmented = aug.augment(s) +print(augmented) +# [['人类语言是抽象的信息符号,其中蕴含着丰富的语义信息,人类可以松地很轻理解其中的含义。'], ['而计算机只能处理化数值的信息,无法直接理解人类语言,所以需要将人类语言进行数值转换化。']] +``` + + + + +## 4. 文档一键增强 + +数据增强API也提供了文档一键增强功能,可以输入指定格式文件进行数据增强。 +```text +FileAugment 初始化参数介绍: + + strategies(list): + 输入应用的数据增强策略。 +``` + +我们接下来将以下面的例子介绍文档一键增强的使用。 + +只需要传入固定格式的`txt`文件,如下自定义输入文件`data.txt`: + +```text +25岁已经感觉脸部松弛了怎么办 +小孩的眉毛剪了会长吗? +... +``` + +我们对文件`data.txt`应用词替换和词插入数据增强策略。 + +```python +from paddlenlp.dataaug import WordSubstitute, WordInsert, FileAugment +aug1 = WordSubstitute('synonym', create_n=1, aug_percent=0.1) +aug2 = WordInsert('synonym', create_n=1, aug_percent=0.1) +aug = FileAugment([aug1,aug2]) +aug.augment(input_file='data.txt', output_file="aug.txt") +``` + +数据增强结果保存在`aug.txt`中,如下: +```text +25岁已经感觉面松弛了怎么办 +小朋友的眉毛剪了会长吗? +25岁已经感觉脸部松驰松弛了怎么办 +幼儿小孩的眉毛剪了会长吗? +``` + +如果输入的文件中带有文本标签,如下自定义输入文件`data.txt`: + +```text +25岁已经感觉脸部松弛了怎么办 治疗方案 +小孩的眉毛剪了会长吗? 其他 +``` +我们可以通过定义`separator`和`separator_id`选择只对其中部分文本进行数据增强策略。 +```python +aug.augment(input_file='data.txt', output_file="aug.txt", separator='\t', separator_id=0) +``` + +数据增强结果保存在`aug.txt`中,如下: + +```text +25阴历年已经感觉脸部松弛了怎么办 治疗方案 +小孩子的眉毛剪了会长吗? 其他 +25岁已经感觉面庞脸部松弛了怎么办 治疗方案 +小孩小朋友的眉毛剪了会长吗? 其他 +``` diff --git a/docs/datasets.md b/docs/datasets.md new file mode 100644 index 0000000000000000000000000000000000000000..0d5d598a7da97bdef5af69cb0e0a4ef960f8ee35 --- /dev/null +++ b/docs/datasets.md @@ -0,0 +1,67 @@ +# PaddleNLP Datasets API + +PaddleNLP提供了以下数据集的快速读取API,实际使用时请根据需要**添加splits信息**: + +## 阅读理解 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | ----- | ------ | +| [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) | 斯坦福问答数据集,包括SQuAD1.1和SQuAD2.0|`paddlenlp.datasets.load_dataset('squad')` | +| [DuReader-yesno](https://aistudio.baidu.com/aistudio/competition/detail/49) | 千言数据集:阅读理解,判断答案极性|`paddlenlp.datasets.load_dataset('dureader_yesno')` | +| [DuReader-robust](https://aistudio.baidu.com/aistudio/competition/detail/49) | 千言数据集:阅读理解,答案原文抽取|`paddlenlp.datasets.load_dataset('dureader_robust')` | +| [CMRC2018](http://hfl-rc.com/cmrc2018/) | 第二届“讯飞杯”中文机器阅读理解评测数据集|`paddlenlp.datasets.load_dataset('cmrc2018')` | +| [DRCD](https://github.com/DRCKnowledgeTeam/DRCD) | 台達閱讀理解資料集|`paddlenlp.datasets.load_dataset('drcd')` | + +## 文本分类 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [CoLA](https://nyu-mll.github.io/CoLA/) | 单句分类任务,二分类,判断句子是否合法| `paddlenlp.datasets.load_dataset('glue','cola')`| +| [SST-2](https://nlp.stanford.edu/sentiment/index.html) | 单句分类任务,二分类,判断句子情感极性| `paddlenlp.datasets.load_dataset('glue','sst-2')`| +| [MRPC](https://microsoft.com/en-us/download/details.aspx?id=52398) | 句对匹配任务,二分类,判断句子对是否是相同意思| `paddlenlp.datasets.load_dataset('glue','mrpc')`| +| [STSB](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) | 计算句子对相似性,分数为1~5| `paddlenlp.datasets.load_dataset('glue','sts-b')`| +| [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) | 判定句子对是否等效,等效、不等效两种情况,二分类任务| `paddlenlp.datasets.load_dataset('glue','qqp')`| +| [MNLI](http://www.nyu.edu/projects/bowman/multinli/) | 句子对,一个前提,一个是假设。前提和假设的关系有三种情况:蕴含(entailment),矛盾(contradiction),中立(neutral)。句子对三分类问题| `paddlenlp.datasets.load_dataset('glue','mnli')`| +| [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) | 判断问题(question)和句子(sentence)是否蕴含,蕴含和不蕴含,二分类| `paddlenlp.datasets.load_dataset('glue','qnli')`| +| [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) | 判断句对是否蕴含,句子1和句子2是否互为蕴含,二分类任务| `paddlenlp.datasets.load_dataset('glue','rte')`| +| [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) | 判断句子对是否相关,相关或不相关,二分类任务| `paddlenlp.datasets.load_dataset('glue','wnli')`| +| [LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html) | A Large-scale Chinese Question Matching Corpus 语义匹配数据集| `paddlenlp.datasets.load_dataset('lcqmc')`| +| [ChnSentiCorp](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/ChnSentiCorp_htl_all/intro.ipynb) | 中文评论情感分析语料| `paddlenlp.datasets.load_dataset('chnsenticorp')`| + + +## 序列标注 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [MSRA_NER](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra) | MSRA 命名实体识别数据集| `paddlenlp.datasets.load_dataset('msra_ner')`| +| [People's Daily](https://github.com/OYE93/Chinese-NLP-Corpus/tree/master/NER/People's%20Daily) | 人民日报命名实体识别数据集| `paddlenlp.datasets.load_dataset('peoples_daily_ner')`| + + +## 机器翻译 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [IWSLT15](https://workshop2015.iwslt.org/) | IWSLT'15 English-Vietnamese data 英语-越南语翻译数据集| `paddlenlp.datasets.load_dataset('iwslt15')`| +| [WMT14ENDE](http://www.statmt.org/wmt14/translation-task.html) | WMT14 EN-DE 经过BPE分词的英语-德语翻译数据集| `paddlenlp.datasets.load_dataset('wmt14ende')`| + + +## 机器同传 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [BSTC](https://aistudio.baidu.com/aistudio/competition/detail/44/) | 千言数据集:机器同传,包括transcription_translation和asr | `paddlenlp.datasets.load_dataset('bstc', 'asr')`| + + +## 文本生成 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [Poetry](https://github.com/chinese-poetry/chinese-poetry) | 中文诗歌古典文集数据| `paddlenlp.datasets.load_dataset('poetry')`| +| [Couplet](https://github.com/v-zich/couplet-clean-dataset) | 中文对联数据集| `paddlenlp.datasets.load_dataset('couplet')`| + +## 语料库 + +| 数据集名称 | 简介 | 调用方法 | +| ---- | --------- | ------ | +| [PTB](http://www.fit.vutbr.cz/~imikolov/rnnlm/) | Penn Treebank Dataset | `paddlenlp.datasets.load_dataset('ptb')`| +| [Yahoo Answer 100k](https://arxiv.org/pdf/1702.08139.pdf) | 从Yahoo Answer采样100K| `paddlenlp.datasets.load_dataset('yahoo_answer_100k')`| diff --git a/docs/get_started/installation.rst b/docs/get_started/installation.rst new file mode 100644 index 0000000000000000000000000000000000000000..4b1362b411b5e76342a89ace9be037434ac8e7a6 --- /dev/null +++ b/docs/get_started/installation.rst @@ -0,0 +1,111 @@ +安装PaddleNLP +^^^^^^^^ +以下安装过程默认用户已安装好paddlepaddle-gpu或paddlepaddle(版本大于或等于2.0),paddlepaddle安装方式参照 飞桨官网_。 + +.. _飞桨官网: https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/2.0/install/pip/windows-pip.html + +pip安装 +-------- +.. code-block:: + + pip install --upgrade paddlenlp>=2.0.0rc -i https://pypi.org/simple + +Anaconda安装 +-------- +Anaconda是一个开源的Python发行版本,其包含了conda、Python等180多个科学包及其依赖项。使用Anaconda可以通过创建多个独立的Python环境,避免用户的Python环境安装太多不同版本依赖导致冲突。 + +1、windows安装Anaconda +>>>>>>>>> + +第一步 下载 +::::::::: +* 在 Anaconda官网_ 选择下载Windows Python3.7 64-Bit版本。 + +.. _Anaconda官网: https://www.anaconda.com/products/individual + +* 确保已经安装Visual C++ Build Tools(可以在开始菜单中找到),如未安装,请点击 下载安装_。 + +.. _下载安装: https://go.microsoft.com/fwlink/?Linkid=691126 + +第二步 安装 +::::::::: +运行下载的安装包(以.exe为后辍),根据引导完成安装, 用户可自行修改安装目录(如下图)。 + +.. image:: ../imgs/anaconda_windows.png + +第三步 使用 +::::::::: +* 点击Windows系统左下角的Windows图标,打开:所有程序->Anaconda3/2(64-bit)->Anaconda Prompt +* 在命令行中执行下述命令 + +.. code-block:: + + # 创建名为my_paddlenlp的环境,指定Python版本为3.7 + conda create -n my_paddlenlp python=3.7 + # 进入my_paddlenlp环境 + conda activate my_paddlenlp + # 安装PaddleNLP + pip install --upgrade paddlenlp>=2.0.0rc -i https://pypi.org/simple + +按如上方式配置后,即可在环境中使用PaddleNLP了,命令行输入python回车后,import paddlenlp试试吧,之后再次使用都可以通过打开'所有程序->Anaconda3/2(64-bit)->Anaconda Prompt',再执行conda activate my_paddlenlp进入环境后,即可再次使用PaddleNLP。 + +2、Linux/Mac安装Anaconda +>>>>>>>>> + +第一步 下载 +::::::::: +在 Anaconda官网_ 选择下载对应系统 Python3.7版本下载(Mac下载Command Line Installer版本即可)。 + +.. _Anaconda官网: https://www.anaconda.com/products/individual + +第二步 安装 +::::::::: +打开终端,在终端安装Anaconda + +.. code-block:: + + # ~/Downloads/Anaconda3-2019.07-Linux-x86_64.sh即下载的文件 + bash ~/Downloads/Anaconda3-2019.07-Linux-x86_64.sh + +安装过程中一直回车即可,如提示设置安装路径,可根据需求修改,一般默认即可。 + +第三步 使用 +::::::::: + +.. code-block:: + + # 创建名为my_paddlenlp的环境,指定Python版本为3.7 + conda create -n my_paddlenlp python=3.7 + # 进入my_paddlenlp环境 + conda activate my_paddlenlp + # 安装PaddleNLP + pip install --upgrade paddlenlp>=2.0.0rc -i https://pypi.org/simple + +按如上方式配置后,即可在环境中使用PaddleNLP了,命令行输入python回车后,import paddlenlp试试吧,之后再次使用都可以通过打开'所有程序->Anaconda3/2(64-bit)->Anaconda Prompt',再执行conda activate my_paddlenlp进入环境后,即可再次使用PaddleNLP。 + +代码安装 +--------- +github代码会跟随开发进度不断更新 + +.. code-block:: + + git clone https://github.com/PaddlePaddle/PaddleNLP.git + cd PaddleNLP + git checkout develop + +使用Docker镜像体验PaddleNLP +^^^^^^^^ + +如果您没有Docker运行环境,请参考 `Docker官网`_ 进行安装 + +.. _Docker官网: https://www.docker.com + +PaddleNLP提供了带有最新代码的docker镜像供您使用,您只需要*拉取docker镜像*,然后*运行docker镜像*,无需其他任何额外操作,即可开始使用PaddleNLP的所有功能。 + +在 `Docker Hub`_ 中获取这些镜像及相应的使用指南,包括CPU、GPU、ROCm版本。 + +.. _Docker Hub: https://hub.docker.com/repository/docker/paddlecloud/paddlenlp + +如果您对自动化制作docker镜像感兴趣,或有自定义需求,请访问 `PaddlePaddle/PaddleCloud`_ 做进一步了解。 + +.. _PaddlePaddle/PaddleCloud: https://github.com/PaddlePaddle/PaddleCloud/tree/main/tekton diff --git a/docs/get_started/quick_start.rst b/docs/get_started/quick_start.rst new file mode 100644 index 0000000000000000000000000000000000000000..1ece7a7aa95c1e5f85b8a1e283d83e2fa117694b --- /dev/null +++ b/docs/get_started/quick_start.rst @@ -0,0 +1,164 @@ +======== +10分钟完成高精度中文情感分析 +======== + +1. 安装PaddleNLP +======== + +安装相关过程和问题可以参考PaddleNLP的 安装文档_。 + +.. _安装文档: https://paddlenlp.readthedocs.io/en/latest/gettingstarted/install.html + + +.. code-block:: + + >>> pip install --upgrade paddlenlp -i https://pypi.org/simple + +2. 一键加载预训练模型 +======== + +情感分析本质是一个文本分类任务。PaddleNLP内置了ERNIE、BERT、RoBERTa、Electra等丰富的预训练模型,并且内置了各种预训练模型对于不同下游任务的Fine-tune网络。用户可以使用PaddleNLP提供的模型,完成问答、序列分类、token分类等任务。查阅 预训练模型_ 了解更多。这里以ERNIE模型为例,介绍如何将预训练模型Fine-tune完成文本分类任务。 + +.. _预训练模型: https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html + +加载预训练模型ERNIE + +.. code-block:: + + >>> MODEL_NAME = "ernie-3.0-medium-zh" + >>> ernie_model = paddlenlp.transformers.ErnieModel.from_pretrained(MODEL_NAME) + +加载预训练模型ERNIE用于文本分类任务的Fine-tune网络,只需指定想要使用的模型名称和文本分类的类别数即可完成网络定义。 + +.. code-block:: + + >>> model = paddlenlp.transformers.ErnieForSequenceClassification.from_pretrained( + ... MODEL_NAME, num_classes=len(label_list)) + +3. 调用Tokenizer进行数据处理 +======== + +Tokenizer用于将原始输入文本转化成模型可以接受的输入数据形式。PaddleNLP对于各种预训练模型已经内置了相应的Tokenizer,指定想要使用的模型名字即可加载。 + +.. code-block:: + + >>> tokenizer = paddlenlp.transformers.ErnieTokenizer.from_pretrained(MODEL_NAME) + +Transformer类预训练模型所需的数据处理步骤通常包括将原始输入文本切分token;将token映射为对应的token id;拼接上预训练模型对应的特殊token ,如[CLS]、[SEP];最后转化为框架所需的数据格式。为了方便使用,PaddleNLP提供了高阶API,一键即可返回模型所需数据格式。 + +一行代码完成切分token,映射token ID以及拼接特殊token: + +.. code-block:: + + >>> encoded_text = tokenizer(text="请输入测试样例") + +转化成paddle框架数据格式: + +.. code-block:: + + >>> input_ids = paddle.to_tensor([encoded_text['input_ids']]) + >>> print("input_ids : {}".format(input_ids)) + >>> token_type_ids = paddle.to_tensor([encoded_text['token_type_ids']]) + >>> print("token_type_ids : {}".format(token_type_ids)) + input_ids : Tensor(shape=[1, 9], dtype=int64, place=CUDAPlace(0), stop_gradient=True, + [[1 , 647, 789, 109, 558, 525, 314, 656, 2 ]]) + token_type_ids : Tensor(shape=[1, 9], dtype=int64, place=CUDAPlace(0), stop_gradient=True, + [[0, 0, 0, 0, 0, 0, 0, 0, 0]]) + +input_ids: 表示输入文本的token ID。 + +token_type_ids: 表示对应的token属于输入的第一个句子还是第二个句子。(Transformer类预训练模型支持单句以及句对输入。) + +此时即可输入ERNIE模型中得到相应输出。 + +.. code-block:: + + >>> sequence_output, pooled_output = ernie_model(input_ids, token_type_ids) + >>> print("Token wise output: {}, Pooled output: {}".format( + ... sequence_output.shape, pooled_output.shape)) + Token wise output: [1, 9, 768], Pooled output: [1, 768] + +可以看出,ERNIE模型输出有2个tensor。 + +sequence_output是对应每个输入token的语义特征表示,shape为(1, num_tokens, hidden_size)。其一般用于序列标注、问答等任务。 + +pooled_output是对应整个句子的语义特征表示,shape为(1, hidden_size)。其一般用于文本分类、信息检索等任务。 + +4. 加载数据集 +======== +PaddleNLP内置了适用于阅读理解、文本分类、序列标注、机器翻译等下游任务的多个数据集,这里我们使用公开中文情感分析数据集ChnSenticorp,包含7000多条正负向酒店评论数据。 + +一键加载PaddleNLP内置数据集: + +.. code-block:: + + >>> train_ds, dev_ds, test_ds = paddlenlp.datasets.load_dataset( + ... 'chnsenticorp', splits=['train', 'dev', 'test']) + +获取分类数据标签: + +.. code-block:: + + >>> label_list = train_ds.label_list + >>> print(label_list) + ['0', '1'] + +展示一些数据: + +.. code-block:: + + >>> for idx in range(5): + ... print(train_ds[idx]) + + {'text': '选择珠江花园的原因就是方便,有电动扶梯直接到达海边,周围餐馆、食廊、商场、超市、摊位一应俱全。 + 酒店装修一般,但还算整洁。 泳池在大堂的屋顶,因此很小,不过女儿倒是喜欢。 包的早餐是西式的,还算丰富。 服务吗,一般', 'label': 1} + {'text': '15.4寸笔记本的键盘确实爽,基本跟台式机差不多了,蛮喜欢数字小键盘,输数字特方便,样子也很美观,做工也相当不错', 'label': 1} + {'text': '房间太小。其他的都一般。。。。。。。。。', 'label': 0} + {'text': '1.接电源没有几分钟,电源适配器热的不行. 2.摄像头用不起来. 3.机盖的钢琴漆,手不能摸,一摸一个印. 4.硬盘分区不好办.', 'label': 0} + {'text': '今天才知道这书还有第6卷,真有点郁闷:为什么同一套书有两种版本呢?当当网是不是该跟出版社商量商量, + 单独出个第6卷,让我们的孩子不会有所遗憾。', 'label': 1} + +5. 模型训练与评估 +======== +数据读入时使用 :func:`paddle.io.DataLoader` 接口多线程异步加载数据,然后设置适用于ERNIE这类Transformer模型的动态学习率和损失函数、优化算法、评价指标等。 + +模型训练的过程通常按照以下步骤: + +#. 从dataloader中取出一个batch data。 +#. 将batch data喂给model,做前向计算。 +#. 将前向计算结果传给损失函数,计算loss。将前向计算结果传给评价方法,计算评价指标。 +#. loss反向回传,更新梯度。重复以上步骤。 +#. 每训练一个epoch时,程序将会评估一次,评估当前模型训练的效果。 + +本示例同步在AIStudio上,可直接 在线体验模型训练_。 + +.. _在线体验模型训练: https://aistudio.baidu.com/aistudio/projectdetail/1294333 + +最后,保存训练好的模型用于预测。 + +6. 模型预测 +======== +保存训练模型,定义预测函数 :func:`predict` ,即可开始预测文本情感倾向。 + +以自定义预测数据和数据标签为示例: + +.. code-block:: + + >>> data = [ + ... '这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般', + ... '怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片', + ... '作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。', + ... ] + >>> label_map = {0: 'negative', 1: 'positive'} + +得到预测结果: + +.. code-block:: + + >>> results = predict( + ... model, data, tokenizer, label_map, batch_size=batch_size) + >>> for idx, text in enumerate(data): + ... print('Data: {} \t Label: {}'.format(text, results[idx])) + Data: 这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般 Label: negative + Data: 怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片 Label: negative + Data: 作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。 Label: positive diff --git a/docs/imgs/anaconda_windows.png b/docs/imgs/anaconda_windows.png new file mode 100644 index 0000000000000000000000000000000000000000..fe1c62ff6134f1d3cba928d91940f404ae9ac11d Binary files /dev/null and b/docs/imgs/anaconda_windows.png differ diff --git a/docs/imgs/data_preprocess_pipline.png b/docs/imgs/data_preprocess_pipline.png new file mode 100644 index 0000000000000000000000000000000000000000..0bf1887ac272f1d961da4ca2c0ad41e538959b46 Binary files /dev/null and b/docs/imgs/data_preprocess_pipline.png differ diff --git a/docs/imgs/fast_generation.png b/docs/imgs/fast_generation.png new file mode 100644 index 0000000000000000000000000000000000000000..4e03be62ed9a3eceed467e083ef6bbdb8868d210 Binary files /dev/null and b/docs/imgs/fast_generation.png differ diff --git a/docs/imgs/paddlenlp.png b/docs/imgs/paddlenlp.png new file mode 100644 index 0000000000000000000000000000000000000000..5053c464119c15872c42acfee746881098e8e12b Binary files /dev/null and b/docs/imgs/paddlenlp.png differ diff --git a/docs/index.rst b/docs/index.rst new file mode 100644 index 0000000000000000000000000000000000000000..8780d0e9093fa0e916b71457eadcd3b364db1741 --- /dev/null +++ b/docs/index.rst @@ -0,0 +1,113 @@ +欢迎使用PaddleNLP +================== + +`PaddleNLP `_ 是飞桨自然语言处理开发库,具备 **易用的文本领域API**,**多场景的应用示例**、和 **高性能分布式训练** 三大特点,旨在提升飞桨开发者文本领域建模效率,旨在提升开发者在文本领域的开发效率,并提供丰富的NLP应用示例。 + + +- **易用的文本领域API** + + - 提供丰富的产业级预置任务能力 **Taskflow** 和全流程的文本领域API:支持丰富中文数据集加载的 **Dataset API**,可灵活高效地完成数据预处理的 **Data API** ,预置60+预训练词向量的 **Embedding API** ,提供100+预训练模型的 **Transformer API** 等,可大幅提升NLP任务建模的效率。 + +- **多场景的应用示例** + + - 覆盖从学术到产业级的NLP应用示例,涵盖NLP基础技术、NLP系统应用以及相关拓展应用。全面基于飞桨核心框架2.0全新API体系开发,为开发者提供飞桨文本领域的最佳实践。 + +- **高性能分布式训练** + + - 基于飞桨核心框架领先的自动混合精度优化策略,结合分布式Fleet API,支持4D混合并行策略,可高效地完成大规模预训练模型训练。 + + +* 项目GitHub: https://github.com/PaddlePaddle/PaddleNLP +* 项目Gitee: https://gitee.com/paddlepaddle/PaddleNLP +* GitHub Issue反馈: https://github.com/PaddlePaddle/PaddleNLP/issues +* 微信交流群: 微信扫描二维码并填写问卷之后,即可加入交流群,与众多社区开发者以及官方团队深度交流。 + +.. image:: https://user-images.githubusercontent.com/11793384/184784832-bb97930f-a738-4480-99be-517aeb65afac.png + :align: center + :alt: paddlenlp微信交流群二维码 + +.. toctree:: + :maxdepth: 1 + :caption: 快速开始 + + 安装 + 10分钟完成高精度中文情感分析 + +.. toctree:: + :maxdepth: 1 + :caption: 数据准备 + + 整体介绍 + 数据集列表 + 加载数据集 + 自定义数据集 + 数据处理 + +.. toctree:: + :maxdepth: 1 + :caption: 模型库 + + Transformer预训练模型 + 使用Trainer API训练 + 使用Trainer API进行模型压缩 + 一键预测功能 + 预训练词向量 + + +.. toctree:: + :maxdepth: 1 + :caption: 评价指标 + + 评价指标 + +.. toctree:: + :maxdepth: 1 + :caption: 实践教程 + + AI Studio Notebook + +.. toctree:: + :maxdepth: 1 + :caption: 进阶指南 + + 模型压缩 + 文本生成高性能加速 + 大规模分布式训练 + +.. toctree:: + :maxdepth: 1 + :caption: 社区交流共建 + + 如何贡献模型 + 如何贡献数据集 + 如何贡献文档案例 + 如何加入兴趣小组 + +.. toctree:: + :maxdepth: 1 + :caption: FAQ + + FAQ + +.. toctree:: + :maxdepth: 1 + :caption: API Reference + + paddlenlp.data + paddlenlp.datasets + paddlenlp.embeddings + paddlenlp.layers + paddlenlp.losses + paddlenlp.metrics + paddlenlp.ops + paddlenlp.seq2vec + paddlenlp.taskflow + paddlenlp.trainer + paddlenlp.transformers + paddlenlp.utils + +Indices and tables +==================== +* :ref:`genindex` +* :ref:`modindex` +* :ref:`search` diff --git a/docs/locale/en/LC_MESSAGES/FAQ.po b/docs/locale/en/LC_MESSAGES/FAQ.po new file mode 100644 index 0000000000000000000000000000000000000000..2439942650b8553133c0aa8bac07f1b0efc79c29 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/FAQ.po @@ -0,0 +1,686 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../FAQ.md:1 +msgid "PaddleNLP常见问题汇总(持续更新)" +msgstr "" + +#: ../FAQ.md:3 +msgid "【精选】NLP精选5问" +msgstr "" + +#: ../FAQ.md:5 ../FAQ.md:59 +msgid "Q1.1 如何加载自己的本地数据集,以便使用PaddleNLP的功能?" +msgstr "" + +#: ../FAQ.md:6 ../FAQ.md:88 +msgid "Q1.2 PaddleNLP会将内置的数据集、模型下载到默认路径,如何修改路径?" +msgstr "" + +#: ../FAQ.md:7 ../FAQ.md:98 +msgid "Q1.3 PaddleNLP中如何保存、加载训练好的模型?" +msgstr "" + +#: ../FAQ.md:8 ../FAQ.md:134 +msgid "Q1.4 当训练样本较少时,有什么推荐的方法能提升模型效果吗?" +msgstr "" + +#: ../FAQ.md:9 ../FAQ.md:140 +msgid "Q1.5 如何提升模型的性能,提升QPS?" +msgstr "" + +#: ../FAQ.md:11 +msgid "【理论篇】NLP通用问题" +msgstr "" + +#: ../FAQ.md:13 ../FAQ.md:152 +msgid "Q2.1 数据类别分布不均衡, 有哪些应对方法?" +msgstr "" + +#: ../FAQ.md:14 ../FAQ.md:166 +msgid "Q2.2 如果使用预训练模型,一般需要多少条样本?" +msgstr "" + +#: ../FAQ.md:16 +msgid "【实战篇】PaddleNLP实战问题" +msgstr "" + +#: ../FAQ.md:18 ../FAQ.md:177 +msgid "数据集和数据处理" +msgstr "" + +#: ../FAQ.md:20 ../FAQ.md:181 +msgid "Q3.1 使用自己的数据集训练预训练模型时,如何引入额外的词表?" +msgstr "" + +#: ../FAQ.md:22 ../FAQ.md:192 +msgid "模型训练调优" +msgstr "" + +#: ../FAQ.md:24 ../FAQ.md:196 +msgid "Q3.2 如何加载自己的预训练模型,进而使用PaddleNLP的功能?" +msgstr "" + +#: ../FAQ.md:25 ../FAQ.md:230 +msgid "Q3.3 如果训练中断,需要继续热启动训练,如何保证学习率和优化器能从中断地方继续迭代?" +msgstr "" + +#: ../FAQ.md:26 ../FAQ.md:252 +msgid "Q3.4 如何冻结模型梯度?" +msgstr "" + +#: ../FAQ.md:27 ../FAQ.md:313 +msgid "Q3.5 如何在eval阶段打印评价指标,在各epoch保存模型参数?" +msgstr "" + +#: ../FAQ.md:28 ../FAQ.md:331 +msgid "Q3.6 训练过程中,训练程序意外退出或Hang住,应该如何排查?" +msgstr "" + +#: ../FAQ.md:30 ../FAQ.md:339 +msgid "Q3.7 在模型验证和测试过程中,如何保证每一次的结果是相同的?" +msgstr "" + +#: ../FAQ.md:31 ../FAQ.md:351 +msgid "Q3.8 ERNIE模型如何返回中间层的输出?" +msgstr "" + +#: ../FAQ.md:33 ../FAQ.md:358 +msgid "预测部署" +msgstr "" + +#: ../FAQ.md:35 ../FAQ.md:362 +msgid "Q3.9 PaddleNLP训练好的模型如何部署到服务器 ?" +msgstr "" + +#: ../FAQ.md:36 ../FAQ.md:380 +msgid "Q3.10 静态图模型如何转换成动态图模型?" +msgstr "" + +#: ../FAQ.md:38 +msgid "特定模型和应用场景咨询" +msgstr "" + +#: ../FAQ.md:39 ../FAQ.md:390 +msgid "Q4.1 【词法分析】LAC模型,如何自定义标签label,并继续训练?" +msgstr "" + +#: ../FAQ.md:40 ../FAQ.md:398 +msgid "Q4.2 信息抽取任务中,是否推荐使用预训练模型+CRF,怎么实现呢?" +msgstr "" + +#: ../FAQ.md:41 +msgid "" +"Q4.3 " +"【阅读理解】MapDatasets的map()方法中对应的batched=True怎么理解,在阅读理解任务中为什么必须把参数batched设置为True?" +msgstr "" + +#: ../FAQ.md:42 ../FAQ.md:410 +msgid "Q4.4 【语义匹配】语义索引和语义匹配有什么区别?" +msgstr "" + +#: ../FAQ.md:43 ../FAQ.md:416 +msgid "Q4.5 【解语】wordtag模型如何自定义添加命名实体及对应词类?" +msgstr "" + +#: ../FAQ.md:45 +msgid "其他使用咨询" +msgstr "" + +#: ../FAQ.md:46 ../FAQ.md:433 +msgid "Q5.1 在CUDA11使用PaddlNLP报错?" +msgstr "" + +#: ../FAQ.md:47 ../FAQ.md:439 +msgid "Q5.2 如何设置parameter?" +msgstr "" + +#: ../FAQ.md:48 ../FAQ.md:473 +msgid "Q5.3 GPU版的Paddle虽然能在CPU上运行,但是必须要有GPU设备吗?" +msgstr "" + +#: ../FAQ.md:49 ../FAQ.md:479 +msgid "Q5.4 如何指定用CPU还是GPU训练模型?" +msgstr "" + +#: ../FAQ.md:50 ../FAQ.md:487 +msgid "Q5.5 动态图模型和静态图模型的预测结果一致吗?" +msgstr "" + +#: ../FAQ.md:51 ../FAQ.md:493 +msgid "Q5.6 如何可视化acc、loss曲线图、模型网络结构图等?" +msgstr "" + +#: ../FAQ.md:53 +msgid "" +msgstr "" + +#: ../FAQ.md:55 +msgid "⭐️【精选】NLP精选5问" +msgstr "" + +#: ../FAQ.md:57 +msgid "" +msgstr "" + +#: ../FAQ.md:61 +msgid "" +"A: 通过使用PaddleNLP提供的 load_dataset, MapDataset 和 IterDataset " +",可以方便的自定义属于自己的数据集哦,也欢迎您贡献数据集到PaddleNLP repo。" +msgstr "" + +#: ../FAQ.md:63 +msgid "" +"从本地文件创建数据集时,我们 推荐 根据本地数据集的格式给出读取function并传入 load_dataset() 中创建数据集。 " +"以waybill_ie快递单信息抽取任务中的数据为例:" +msgstr "" + +#: ../FAQ.md:84 +msgid "如果您习惯使用paddle.io.Dataset/IterableDataset来创建数据集也是支持的,您也可以从其他python对象如List对象创建数据集,详细内容可参照官方文档-自定义数据集。" +msgstr "" + +#: ../FAQ.md:86 +msgid "" +msgstr "" + +#: ../FAQ.md:90 +msgid "A: 内置的数据集、模型默认会下载到$HOME/.paddlenlp/下,通过配置环境变量可下载到指定路径:" +msgstr "" + +#: ../FAQ.md:92 +msgid "(1)Linux下,设置 export PPNLP_HOME=\"xxxx\",注意不要设置带有中文字符的路径。" +msgstr "" + +#: ../FAQ.md:94 +msgid "(2)Windows下,同样配置环境变量 PPNLP_HOME 到其他非中文字符路径,重启即可。" +msgstr "" + +#: ../FAQ.md:96 +msgid "" +msgstr "" + +#: ../FAQ.md:100 +msgid "A:(1)PaddleNLP预训练模型" +msgstr "" + +#: ../FAQ.md:102 +msgid "​ 保存:" +msgstr "" + +#: ../FAQ.md:109 ../FAQ.md:125 +msgid "​ 加载:" +msgstr "" + +#: ../FAQ.md:116 +msgid "(2)常规模型 保存:" +msgstr "" + +#: ../FAQ.md:132 +msgid "" +msgstr "" + +#: ../FAQ.md:136 +msgid "" +"A: 增加训练样本带来的效果是最直接的。此外,可以基于我们开源的预训练模型进行热启,再用少量数据集fine-" +"tune模型。此外,针对分类、匹配等场景,小样本学习也能够带来不错的效果。" +msgstr "" + +#: ../FAQ.md:138 +msgid "" +msgstr "" + +#: ../FAQ.md:142 +msgid "" +"A: 从工程角度,对于服务器端部署可以使用Paddle " +"Inference高性能预测引擎进行预测部署。对于Transformer类模型的GPU预测还可以使用PaddleNLP中提供的FasterTransformer功能来进行快速预测,其集成了NV" +" FasterTransformer并进行了功能增强。" +msgstr "" + +#: ../FAQ.md:144 +msgid "" +"从模型策略角度,可以使用一些模型小型化技术来进行模型压缩,如模型蒸馏和裁剪,通过小模型来实现加速。PaddleNLP中集成了ERNIE-" +"Tiny这样一些通用小模型供下游任务微调使用。另外PaddleNLP提供了模型压缩示例,实现了DynaBERT、TinyBERT、MiniLM等方法策略,可以参考对自己的模型进行蒸馏压缩。" +msgstr "" + +#: ../FAQ.md:146 +msgid "" +msgstr "" + +#: ../FAQ.md:148 +msgid "⭐️【理论篇】NLP通用问题" +msgstr "" + +#: ../FAQ.md:150 +msgid "" +msgstr "" + +#: ../FAQ.md:154 +msgid "A: 可以采用以下几种方法优化类别分布不均衡问题:" +msgstr "" + +#: ../FAQ.md:156 +msgid "(1)欠采样:对样本量较多的类别进行欠采样,去除一些样本,使得各类别数目接近。" +msgstr "" + +#: ../FAQ.md:158 +msgid "(2)过采样:对样本量较少的类别进行过采样,选择样本进行复制,使得各类别数目接近。" +msgstr "" + +#: ../FAQ.md:160 +msgid "(3)修改分类阈值:直接使用类别分布不均衡的数据训练分类器,会使得模型在预测时更偏向于多数类,所以不再以0.5为分类阈值,而是针对少数类在模型仅有较小把握时就将样本归为少数类。" +msgstr "" + +#: ../FAQ.md:162 +msgid "(4)代价敏感学习:比如LR算法中设置class_weight参数。" +msgstr "" + +#: ../FAQ.md:164 +msgid "" +msgstr "" + +#: ../FAQ.md:168 +msgid "" +"A: " +"很难定义具体需要多少条样本,取决于具体的任务以及数据的质量。如果数据质量没问题的话,分类、文本匹配任务所需数据量级在百级别,翻译则需要百万级能够训练出一个比较鲁棒的模型。如果样本量较少,可以考虑数据增强,或小样本学习。" +msgstr "" + +#: ../FAQ.md:171 +msgid "" +msgstr "" + +#: ../FAQ.md:173 +msgid "⭐️【实战篇】PaddleNLP实战问题" +msgstr "" + +#: ../FAQ.md:175 +msgid "" +msgstr "" + +#: ../FAQ.md:179 +msgid "" +msgstr "" + +#: ../FAQ.md:183 +msgid "" +"A: " +"预训练模型通常会有配套的tokenzier和词典,对于大多数中文预训练模型,如ERNIE-3.0-Medium-zh,使用的都是字粒度的输入,tokenzier会将句子转换为字粒度的形式,模型无法收到词粒度的输入。如果希望引入额外的词典,需要修改预训练模型的tokenizer和词典,可以参考这里blog,另外注意embedding矩阵也要加上这些新增词的embedding表示。" +msgstr "" + +#: ../FAQ.md:185 +msgid "" +"另外还有一种方式可以使用这些字典信息,可以将数据中在词典信息中的词进行整体mask进行一个mask language " +"model的二次预训练,这样经过二次训练的模型就包含了对额外字典的表征。可参考 Mask Language Model 数据构建。" +msgstr "" + +#: ../FAQ.md:188 +msgid "此外还有些词粒度及字词混合粒度的预训练模型,在这些词粒度的模型下引入额外的词表也会容易些,我们也将持续丰富PaddleNLP中的预训练模型。" +msgstr "" + +#: ../FAQ.md:190 +msgid "" +msgstr "" + +#: ../FAQ.md:194 +msgid "" +msgstr "" + +#: ../FAQ.md:198 +msgid "" +"A: " +"以bert为例,如果是使用PaddleNLP训练,通过save_pretrained()接口保存的模型,可通过from_pretrained()来加载:" +msgstr "" + +#: ../FAQ.md:205 +msgid "如果不是上述情况,可以使用如下方式加载模型,也欢迎您贡献模型到PaddleNLP repo中。" +msgstr "" + +#: ../FAQ.md:207 +msgid "(1)加载BertTokenizer和BertModel" +msgstr "" + +#: ../FAQ.md:214 +msgid "" +"(2)调用save_pretrained()生成 model_config.json、 " +"tokenizer_config.json、model_state.pdparams、 vocab.txt " +"文件,保存到./checkpoint:" +msgstr "" + +#: ../FAQ.md:221 +msgid "" +"(3)修改model_config.json、 " +"tokenizer_config.json这两个配置文件,指定为自己的模型,之后通过from_pretrained()加载模型。" +msgstr "" + +#: ../FAQ.md:228 +msgid "" +msgstr "" + +#: ../FAQ.md:232 +msgid "A:" +msgstr "" + +#: ../FAQ.md:234 +msgid "(1)完全恢复训练状态,可以先将lr、 optimizer、model的参数保存下来:" +msgstr "" + +#: ../FAQ.md:242 +msgid "(2)加载lr、 optimizer、model参数即可恢复训练:" +msgstr "" + +#: ../FAQ.md:250 +msgid "" +msgstr "" + +#: ../FAQ.md:254 +msgid "A: 有多种方法可以尝试:" +msgstr "" + +#: ../FAQ.md:257 +msgid "(1)可以直接修改 PaddleNLP 内部代码实现,在需要冻结梯度的地方用 paddle.no_grad() 包裹一下" +msgstr "" + +#: ../FAQ.md:259 +msgid "paddle.no_grad() 的使用方式,以对 forward() 进行冻结为例:" +msgstr "" + +#: ../FAQ.md:282 +msgid "paddle.no_grad() 的使用也不局限于模型内部实现里面,也可以包裹外部的方法,比如:" +msgstr "" + +#: ../FAQ.md:296 +msgid "" +"(2)第二种方法:以ERNIE为例,将模型输出的 tensor 设置 stop_gradient 为 True。可以使用 " +"register_forward_post_hook 按照如下的方式尝试:" +msgstr "" + +#: ../FAQ.md:305 +msgid "" +"(3)第三种方法:在 optimizer 上进行处理,model.parameters 是一个 List,可以通过 name " +"进行相应的过滤,更新/不更新某些参数,这种方法需要对网络结构的名字有整体了解,因为网络结构的实体名字决定了参数的名字,这个使用方法有一定的门槛:" +msgstr "" + +#: ../FAQ.md:311 +msgid "" +msgstr "" + +#: ../FAQ.md:315 +msgid "" +"A: 飞桨主框架提供了两种训练与预测的方法,一种是用 " +"paddle.Model()对模型进行封装,通过高层API如Model.fit()、Model.evaluate()、Model.predict()等完成模型的训练与预测;另一种就是基于基础API常规的训练方式。" +msgstr "" + +#: ../FAQ.md:317 +msgid "(1)对于第一种方法:" +msgstr "" + +#: ../FAQ.md:319 +msgid "" +"我们可以设置 paddle.Model.fit() API中的 eval_data 和 eval_freq " +"参数在训练过程中打印模型评价指标:eval_data 参数是一个可迭代的验证集数据源,eval_freq 参数是评估的频率;当eval_data " +"给定后,eval_freq 的默认值为1,即每一个epoch进行一次评估。注意:在训练前,我们需要在 Model.prepare() " +"接口传入metrics参数才能在eval时打印模型评价指标。" +msgstr "" + +#: ../FAQ.md:321 +msgid "" +"关于模型保存,我们可以设置 paddle.Model.fit() 中的 save_freq 参数控制模型保存的频率:save_freq " +"的默认值为1,即每一个epoch保存一次模型。" +msgstr "" + +#: ../FAQ.md:323 +msgid "(2)对于第二种方法:" +msgstr "" + +#: ../FAQ.md:325 +msgid "我们在PaddleNLP的examples目录下提供了常见任务的训练与预测脚本:如GLUE 和 SQuAD等" +msgstr "" + +#: ../FAQ.md:327 +msgid "开发者可以参考上述脚本进行自定义训练与预测脚本的开发。" +msgstr "" + +#: ../FAQ.md:329 +msgid "" +msgstr "" + +#: ../FAQ.md:333 +msgid "A: 一般先考虑内存、显存(使用GPU训练的话)是否不足,可将训练和评估的batch size调小一些。" +msgstr "" + +#: ../FAQ.md:335 +msgid "需要注意,batch size调小时,学习率learning rate也要调小,一般可按等比例调整。" +msgstr "" + +#: ../FAQ.md:337 +msgid "" +msgstr "" + +#: ../FAQ.md:341 +msgid "A: 在验证和测试过程中常常出现的结果不一致情况一般有以下几种解决方法:" +msgstr "" + +#: ../FAQ.md:343 +msgid "(1)确保设置了eval模式,并保证数据相关的seed设置保证数据一致性。" +msgstr "" + +#: ../FAQ.md:345 +msgid "" +"(2)如果是下游任务模型,查看是否所有模型参数都被导入了,直接使用bert-" +"base这种预训练模型是不包含任务相关参数的,要确认导入的是微调后的模型,否则任务相关参数会随机初始化导致出现随机性。" +msgstr "" + +#: ../FAQ.md:347 +msgid "" +"(3)部分算子使用CUDNN后端产生的不一致性可以通过环境变量的设置来避免。如果模型中使用了CNN相关算子,可以设置FLAGS_cudnn_deterministic=True。如果模型中使用了RNN相关算子,可以设置CUBLAS_WORKSPACE_CONFIG=:16:8或CUBLAS_WORKSPACE_CONFIG=:4096:2(CUDNN" +" 10.2以上版本可用,参考CUDNN 8 release note)。" +msgstr "" + +#: ../FAQ.md:349 +msgid "" +msgstr "" + +#: ../FAQ.md:353 +msgid "" +"A: 目前的API设计不保留中间层输出,当然在PaddleNLP里可以很方便地修改源码。 " +"此外,还可以在ErnieModel的__init__函数中通过register_forward_post_hook()为想要保留输出的Layer注册一个forward_post_hook函数,在forward_post_hook函数中把Layer的输出保存到一个全局的List里面。forward_post_hook函数将会在forward函数调用之后被调用,并保存Layer输出到全局的List。详情参考register_forward_post_hook()。" +msgstr "" + +#: ../FAQ.md:356 +msgid "" +msgstr "" + +#: ../FAQ.md:360 +msgid "" +msgstr "" + +#: ../FAQ.md:364 +msgid "A: 我们推荐在动态图模式下开发,静态图模式部署。" +msgstr "" + +#: ../FAQ.md:366 +msgid "(1)动转静" +msgstr "" + +#: ../FAQ.md:368 +msgid "" +"动转静,即将动态图的模型转为可用于部署的静态图模型。 动态图接口更加易用,python " +"风格的交互式编程体验,对于模型开发更为友好,而静态图相比于动态图在性能方面有更绝对的优势。因此动转静提供了这样的桥梁,同时兼顾开发成本和性能。 " +"可以参考官方文档 动态图转静态图文档,使用 paddle.jit.to_static 完成动转静。 另外,在 PaddleNLP " +"我们也提供了导出静态图模型的例子,可以参考 waybill_ie 模型导出。" +msgstr "" + +#: ../FAQ.md:373 +msgid "(2)借助Paddle Inference部署" +msgstr "" + +#: ../FAQ.md:375 +msgid "" +"动转静之后保存下来的模型可以借助Paddle Inference完成高性能推理部署。Paddle Inference内置高性能的CPU/GPU " +"Kernel,结合细粒度OP横向纵向融合等策略,并集成 TensorRT 实现模型推理的性能提升。具体可以参考文档 Paddle " +"Inference 简介。 为便于初次上手的用户更易理解 NLP 模型如何使用Paddle Inference,PaddleNLP " +"也提供了对应的例子以供参考,可以参考 /PaddleNLP/examples 下的deploy目录,如基于ERNIE的命名实体识别模型部署。" +msgstr "" + +#: ../FAQ.md:378 +msgid "" +msgstr "" + +#: ../FAQ.md:382 +msgid "A: 首先,需要将静态图参数保存成ndarray数据,然后将静态图参数名和对应动态图参数名对应,最后保存成动态图参数即可。详情可参考参数转换脚本。" +msgstr "" + +#: ../FAQ.md:384 +msgid "" +msgstr "" + +#: ../FAQ.md:386 +msgid "⭐️特定模型和应用场景咨询" +msgstr "" + +#: ../FAQ.md:388 +msgid "" +msgstr "" + +#: ../FAQ.md:392 +msgid "A: 更新label文件tag.dict,添加 修改下CRF的标签数即可。" +msgstr "" + +#: ../FAQ.md:394 +msgid "可参考自定义标签示例,增量训练自定义LABLE示例。" +msgstr "" + +#: ../FAQ.md:396 +msgid "" +msgstr "" + +#: ../FAQ.md:400 +msgid "A: 预训练模型+CRF是一个通用的序列标注的方法,目前预训练模型对序列信息的表达也是非常强的,也可以尝试直接使用预训练模型对序列标注任务建模。" +msgstr "" + +#: ../FAQ.md:402 +msgid "" +msgstr "" + +#: ../FAQ.md:404 +msgid "Q4.3.【阅读理解】MapDatasets的map()方法中对应的batched=True怎么理解,在阅读理解任务中为什么必须把参数batched设置为True?" +msgstr "" + +#: ../FAQ.md:406 +msgid "" +"A: " +"batched=True就是对整个batch(这里不一定是训练中的batch,理解为一组数据就可以)的数据进行map,即map中的trans_func接受一组数据为输入,而非逐条进行map。在阅读理解任务中,根据使用的doc_stride不同,一条样本可能被转换成多条feature,对数据逐条map是行不通的,所以需要设置batched=True。" +msgstr "" + +#: ../FAQ.md:408 +msgid "" +msgstr "" + +#: ../FAQ.md:412 +msgid "" +"A: 语义索引要解决的核心问题是如何从海量 Doc 中通过 ANN 索引的方式快速、准确地找出与 query " +"相关的文档,语义匹配要解决的核心问题是对 query和文档更精细的语义匹配信息建模。换个角度理解, " +"语义索引是要解决搜索、推荐场景下的召回问题,而语义匹配是要解决排序问题,两者要解决的问题不同,所采用的方案也会有很大不同,但两者间存在一些共通的技术点,可以互相借鉴。" +msgstr "" + +#: ../FAQ.md:414 +msgid "" +msgstr "" + +#: ../FAQ.md:418 +msgid "" +"A: 其主要依赖于二次构造数据来进行finetune,同时要更新termtree信息。wordtag分为两个步骤: " +"(1)通过BIOES体系进行分词; (2)将分词后的信息和TermTree进行匹配。 因此我们需要: " +"(1)分词正确,这里可能依赖于wordtag的finetune数据,来让分词正确; " +"(2)wordtag里面也需要把分词正确后term打上相应的知识信息。wordtag自定义TermTree的方式将在后续版本提供出来。" +msgstr "" + +#: ../FAQ.md:425 +msgid "可参考issue。" +msgstr "" + +#: ../FAQ.md:427 +msgid "" +msgstr "" + +#: ../FAQ.md:429 +msgid "⭐️其他使用咨询" +msgstr "" + +#: ../FAQ.md:431 +msgid "" +msgstr "" + +#: ../FAQ.md:435 +msgid "A: 在CUDA11安装,可参考issue,其他CUDA版本安装可参考 官方文档" +msgstr "" + +#: ../FAQ.md:437 +msgid "" +msgstr "" + +#: ../FAQ.md:441 +msgid "A: 有多种方法: (1)可以通过set_value()来设置parameter,set_value()的参数可以是numpy或者tensor。" +msgstr "" + +#: ../FAQ.md:453 +msgid "(2)通过create_parameter()设置参数。" +msgstr "" + +#: ../FAQ.md:471 +msgid "" +msgstr "" + +#: ../FAQ.md:475 +msgid "" +"A: 不支持 GPU 的设备只能安装 CPU 版本的 PaddlePaddle。 GPU 版本的 PaddlePaddle 如果想只在 CPU " +"上运行,可以通过 export CUDA_VISIBLE_DEVICES=-1 来设置。" +msgstr "" + +#: ../FAQ.md:477 +msgid "" +msgstr "" + +#: ../FAQ.md:481 +msgid "A: 一般我们的训练脚本提供了 --device 选项,用户可以通过 --device 选择需要使用的设备。" +msgstr "" + +#: ../FAQ.md:483 +msgid "" +"具体而言,在Python文件中,我们可以通过·paddle.device.set_device()·,设置为gpu或者cpu,可参考 " +"set_device文档。" +msgstr "" + +#: ../FAQ.md:485 +msgid "" +msgstr "" + +#: ../FAQ.md:489 +msgid "A: 正常情况下,预测结果应当是一致的。如果遇到不一致的情况,可以及时反馈给 PaddleNLP 的开发人员,我们进行处理。" +msgstr "" + +#: ../FAQ.md:491 +msgid "" +msgstr "" + +#: ../FAQ.md:495 +msgid "" +"A: " +"可使用VisualDL进行可视化。其中acc、loss曲线图的可视化可参考Scalar——折线图组件使用指南,模型网络结构的可视化可参考Graph——网络结构组件使用指南。" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/distributed_training.po b/docs/locale/en/LC_MESSAGES/advanced_guide/distributed_training.po new file mode 100644 index 0000000000000000000000000000000000000000..c2612396ac1f085ca14818e90ead14758b6467b9 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/advanced_guide/distributed_training.po @@ -0,0 +1,27 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../advanced_guide/distributed_training.rst:3 +msgid "大规模分布式训练" +msgstr "" + +#: ../advanced_guide/distributed_training.rst:5 +msgid "大规模分布式训练:" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/fastgeneration.po b/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/fastgeneration.po new file mode 100644 index 0000000000000000000000000000000000000000..a927bd73636b72016446d75882fb7a505655f24c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/fastgeneration.po @@ -0,0 +1,211 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:3 +msgid "FastGeneration加速生成API" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:5 +msgid "" +"FastGeneration是PaddleNLP " +"v2.2版本加入的一个高性能推理功能,可实现基于CUDA的序列解码。该功能可以用于多种生成类的预训练NLP模型,例如GPT、BART、UnifiedTransformer等,并且支持多种解码策略。因此该功能主要适用于机器翻译,文本续写,文本摘要,对话生成等任务。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:7 +msgid "" +"功能底层依托于 `FasterTransformer " +"`_ " +",该库专门针对Transformer系列模型及各种解码策略进行了优化。功能顶层封装于 `model.generate` " +"函数。功能的开启和关闭通过传入 `use_fast` 参数进行控制(默认为开启状态)。该功能具有如下特性:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:9 +msgid "全面支持生成式预训练模型。包括GPT、BART、mBART、UnifiedTransformer和UNIMO-text。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:10 +msgid "" +"支持大多数主流解码策略。包括Beam Search、Sampling、Greedy Search。以及Diverse Sibling " +"Search、Length Penalty等子策略。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:11 +msgid "" +"解码速度快。最高可达非加速版generate函数的 **17倍**。HuggingFace generate函数的 " +"**8倍**。**并支持FP16混合精度计算**。 详细性能试验数据请参见 `FastGeneration Performence " +"`_" +" 。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:12 +msgid "" +"易用性强。功能的入口为 `model.generate` " +",与非加速版生成api的使用方法相同,当满足加速条件时使用jit即时编译高性能算子并用于生成,不满足则自动切换回非加速版生成api。下图展示了FastGeneration的启动流程:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:17 +msgid "快速开始" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:19 +msgid "" +"为体现FastGeneration的易用性,我们在 `samples " +"`_" +" 文件夹中内置了几个典型任务示例,下面以基于GPT模型的中文文本续写任务为例:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:26 +msgid "" +"如果是第一次执行,PaddleNLP会启动即时编译( `JIT Compile " +"`_ )自动编译高性能解码算子。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:49 +msgid "编译过程通常会花费几分钟的时间但是只会进行一次,之后再次使用高性能解码不需要重新编译了。编译完成后会继续运行,可以看到生成的结果如下:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:56 +msgid "打开示例代码 `samples/gpt_sample.py` ,我们可以看到如下代码:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:67 +msgid "" +"可以看到,FastGeneration的使用方法与 `model.generate()` " +"相同,只需传入输入tensor和解码相关参数即可,使用非常简便。如果要使用非加速版的 `model.generate()` 方法,只需传入 " +"`use_fast=False` 即可,示例如下:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:78 +msgid "" +"需要注意的是,如果传入 `model.generate()` 的参数不满足高性能版本的要求。程序会做出提示并自动切换为非加速版本,例如我们传入 " +"`min_length=1` ,会得到如下提示:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:86 +msgid "" +"关于该方法的更多参数可以参考API文档 `generate " +"`_" +" 。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:88 +msgid "" +"`samples " +"`_" +" 文件夹中的其他示例的使用方法相同。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:91 +msgid "其他示例" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:93 +msgid "" +"除了以上简单示例之外,PaddleNLP的examples中所有使用了 `model.generate()` " +"的示例都可以通过调整到合适的参数使用高性能推理。具体如下:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:95 +msgid "" +"`examples/dialogue/unified_transformer " +"`_" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:96 +msgid "" +"`model_zoo/gpt/fast_gpt " +"`_" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:97 +msgid "" +"`examples/text_generation/unimo-text " +"`_" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:98 +msgid "" +"`examples/text_summarization/bart " +"`_" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:100 +msgid "" +"根据提示修改对应参数即可使用FastGeneration加速生成。下面我们以基于 `Unified Transformer` " +"的任务型对话为例展示一下FastGeneration的加速效果:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:102 +msgid "打开以上链接中Unified Transformer对应的example,找到README中对应预测的脚本。稍作修改如下:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:122 +msgid "" +"由于这里只是展示性能,我们直接在 `model_name_or_path` 填入PaddleNLP预训练模型名称 " +"`unified_transformer-12L-cn-luge` 。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:124 +msgid "" +"可以看到,由于该任务为对话任务,我们为了防止模型生成过多安全回复(如:哈哈哈、不错等),保证生成结果具有更多的随机性,我们选择TopK-" +"sampling作为解码策略,并让k=5。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:126 +msgid "打开 `infer.py` ,可以看到我们传入的脚本参数大多都提供给了 `model.generate()` 方法:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:149 +msgid "运行脚本,输出结果如下:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:157 +msgid "可以看到,非加速版 `generate()` 方法的预测速度为每个step耗时1.5秒左右。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:159 +msgid "" +"下面我们在启动脚本中传入 `--faster` 参数,这会让 `generate()` 方法传入 `use_fast=True` " +",启动加速模式。同时我们需要设置 `--min_dec_len=0` " +",因为FastGeneration当前还不支持该参数。新的脚本启动参数如下:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:180 +msgid "再次运行脚本,输出结果如下(由于我们已经编译过高性能算子,所以这里不会重新编译):" +msgstr "" + +#: ../advanced_guide/fastgeneration/fastgeneration.rst:189 +msgid "" +"可以看到,FastGeneration的预测速度为每个step耗时0.4秒左右,提速超过三倍。如果减少 " +"`num_return_sequences` ,可以得到更高的加速比。" +msgstr "" + +#~ msgid "" +#~ "如果是第一次执行,PaddleNLP会启动即时编译( `JIT Compile " +#~ "`_ )自动编译高性能解码算子。" +#~ msgstr "" + +#~ msgid "" +#~ "`examples/language_model/gpt/fast_gpt " +#~ "`_" +#~ msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/fasttransformer.po b/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/fasttransformer.po new file mode 100644 index 0000000000000000000000000000000000000000..4a7f27a8a45a692613e48e382172344191d04a7f --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/fasttransformer.po @@ -0,0 +1,331 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:3 +msgid "Transformer高性能加速" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:7 +msgid "使用环境说明" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:9 +msgid "本项目依赖于 PaddlePaddle 2.1.0 及以上版本或适当的 develop 版本" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:10 +msgid "CMake >= 3.10" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:11 +msgid "CUDA 10.1 或 10.2(需要 PaddlePaddle 框架一致)" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:12 +msgid "gcc 版本需要与编译 PaddlePaddle 版本一致,比如使用 gcc8.2" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:13 +msgid "推荐使用 Python3" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:14 +msgid "" +"`FasterTransformer " +"`_ 使用必要的环境" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:15 +msgid "环境依赖" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:17 +msgid "attrdict" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:18 +msgid "pyyaml" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:26 +msgid "快速开始" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:28 +msgid "" +"我们实现了基于 FasterTransformer 的自定义 op 的接入,打造了 FastGeneration 的能力,用于加速文本生成模型在 GPU " +"上的预测性能。接下来,我们将分别介绍基于 Python 动态图和预测库使用 FastGeneration 自定义 op 的方式,包括 op " +"的编译与使用。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:31 +msgid "Python 动态图使用自定义 op" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:34 +msgid "JIT 自动编译" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:36 +msgid "" +"目前当基于动态图使用 FastGeneration 预测加速自定义 op 时,PaddleNLP 提供了 Just In Time " +"的自动编译,在一些 API 上,用户无需关注编译流程,可以直接执行对应的 API,程序会自动编译需要的第三方库。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:38 +msgid "" +"以 Transformer 为例,可以直接调用 `TransformerGenerator()` 这个 API,程序会自动编译。使用示例可以参考 " +"`Transformer 预测加速使用示例-sample " +"`_,`Transformer" +" 预测加速使用示例-机器翻译 " +"`_。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:41 +#: ../advanced_guide/fastgeneration/fasttransformer.rst:154 +msgid "编译自定义OP" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:43 +msgid "除了自动编译外,如果需要自行编译,我们已经提供对应的 CMakeLists.txt,可以参考使用如下的方式完成编译。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:46 +#: ../advanced_guide/fastgeneration/fasttransformer.rst:159 +msgid "PaddleNLP 准备" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:48 +msgid "" +"首先,如果需要从源码自行编译,可以直接使用 Python 的 package 下的 paddlenlp,或是可从 github 克隆一个 " +"PaddleNLP,并重新编译:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:50 +#: ../advanced_guide/fastgeneration/fasttransformer.rst:163 +msgid "以下以从 github 上 clone 一个新版 PaddleNLP 为例:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:56 +msgid "其次,配置环境变量,让我们可以使用当前 clone 的 paddlenlp,并进入到自定义 OP 的路径,准备后续的编译操作:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:64 +#: ../advanced_guide/fastgeneration/fasttransformer.rst:176 +msgid "编译" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:66 +#: ../advanced_guide/fastgeneration/fasttransformer.rst:178 +msgid "编译之前,请确保安装的 PaddlePaddle 的版本高于 2.1.0 或是基于最新的 develop 分支的代码编译,并且正常可用。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:68 +#: ../advanced_guide/fastgeneration/fasttransformer.rst:180 +msgid "编译自定义 OP 可以参照一下步骤:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:78 +#: ../advanced_guide/fastgeneration/fasttransformer.rst:190 +msgid "可以使用的编译选项包括:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:80 +msgid "" +"`-DPY_CMD`: 指定当前装有 PaddlePaddle 版本的 python 环境,比如 " +"`-DPY_CMD=python3.7`。若未指定 `-DPY_CMD` 将会默认使用系统命令 `python` 对应的 Python。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:81 +#: ../advanced_guide/fastgeneration/fasttransformer.rst:207 +msgid "" +"`-DSM`: 是指的所用 GPU 的 compute capability,建议不使用该选项设置,未设置时将自动检测。如要设置,需根据 " +"[compute capability](https://developer.nvidia.com/zh-cn/cuda-" +"gpus#compute) 进行设置,如 V100 时设置 `-DSM=70` 或 T4 时设置 `-DSM=75`。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:82 +#: ../advanced_guide/fastgeneration/fasttransformer.rst:208 +msgid "" +"`-DWITH_GPT`: 是否编译带有 GPT 相关的 lib。若使用 GPT-2 高性能推理,需要加上 `-DWITH_GPT=ON`。默认为" +" OFF。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:83 +#: ../advanced_guide/fastgeneration/fasttransformer.rst:209 +msgid "" +"`-DWITH_UNIFIED`: 是否编译带有 Unified Transformer 或是 UNIMOText 相关的 " +"lib。若使用,需要加上 `-DWITH_UNIFIED=ON`。默认为 ON。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:84 +#: ../advanced_guide/fastgeneration/fasttransformer.rst:210 +msgid "`-DWITH_BART`: 是否编译带有 BART 支持的相关 lib。若使用,需要加上 `-DWITH_BART=ON`。默认为 ON。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:85 +#: ../advanced_guide/fastgeneration/fasttransformer.rst:211 +msgid "`-DWITH_DECODER`: 是否编译带有 decoder 优化的 lib。默认为 ON。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:87 +msgid "" +"最终,编译会在 `./build/lib/` 路径下,产出 `libdecoding_op.so`,即需要的 FastGeneration " +"decoding 执行的库。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:90 +msgid "使用 Transformer decoding 高性能推理" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:92 +msgid "" +"编写 python 脚本的时候,调用 `FasterTransformer API " +"`_" +" 即可实现 Transformer 模型的高性能预测。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:94 +msgid "举例如下:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:120 +msgid "" +"若当前环境下没有需要的自定义 op 的动态库,将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op " +"所需的动态库,可以如前文所述进行编译。编译好后,使用 " +"`FasterTransformer(decoding_lib=\"/path/to/lib\", ...)` 可以完成导入。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:122 +msgid "" +"更详细的例子可以参考 `Transformer 预测加速使用示例-sample " +"`_,`Transformer" +" 预测加速使用示例-机器翻译 " +"`_,我们提供了更详细用例。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:125 +msgid "Transformer decoding 示例代码" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:127 +msgid "使用 PaddlePaddle 仅执行 decoding 测试(float32):" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:137 +msgid "" +"使用 PaddlePaddle 仅执行 decoding 测试(float16): 执行 float16 的 " +"decoding,需要在执行的时候,加上 `--use_fp16_decoding` 选项。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:148 +msgid "" +"其中,`decoding_gemm` 不同参数的意义可以参考 `FasterTransformer 文档 " +"`_。这里提前执行 `decoding_gemm`,可以在当前路径下生成一个 config " +"文件,里面会包含针对当前 decoding 部分提供的配置下,性能最佳的矩阵乘的算法,并在执行的时候读入这个数据。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:151 +msgid "C++ 预测库使用自定义 op" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:156 +msgid "" +"在 C++ 预测库使用自定义 OP 需要将实现的 C++、CUDA 代码**以及 C++ 预测的 " +"demo**编译成一个可执行文件。因预测库支持方式与 Python 不同,这个过程将不会产生自定义 op " +"的动态库,将直接得到可执行文件。我们已经提供对应的 CMakeLists.txt ,可以参考使用如下的方式完成编译。并获取执行 demo。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:161 +msgid "" +"首先,因为需要基于当前环境重新编译,当前的 paddlenlp 的 python 包里面并不包含 FastGeneration 相关 " +"lib,需要从源码自行编译,可以直接使用 Python 的 package 下的 paddlenlp,或是可从 github 克隆一个 " +"PaddleNLP,并重新编译:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:169 +msgid "其次,让我们可以使用当前 clone 的 paddlenlp,并进入到自定义 OP 的路径,准备后续的编译操作:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:192 +msgid "" +"`-DPADDLE_LIB`: 需要指明使用的 PaddlePaddle 预测库的路径 " +"`/path/to/paddle_inference_install_dir/`,需要使用的 PaddlePaddle 的 lib " +"可以选择自行编译或者直接从官网下载 `paddle_inference_linux_lib " +"`_。需要注意的是,在该路径下,预测库的组织结构满足:" +" .. code-block::" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:206 +msgid "" +"`-DDEMO`: 说明预测库使用 demo 的位置。比如指定 -DDEMO=./demo/transformer_e2e.cc 或是 " +"-DDEMO=./demo/gpt.cc。最好使用绝对路径,若使用相对路径,需要是相对于 " +"`PaddleNLP/paddlenlp/ops/fast_transformer/src/` 的相对路径。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:212 +msgid "`-DWITH_MKL`: 若当前是使用的 mkl 的 Paddle lib,那么需要打开 MKL 以引入 MKL 相关的依赖。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:213 +msgid "`-DON_INFER`: 是否编译 paddle inference 预测库。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:214 +msgid "**当使用预测库的自定义 op 的时候,请务必开启 `-DON_INFER=ON` 选项,否则,不会得到预测库的可执行文件。**" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:217 +msgid "执行 Transformer decoding on PaddlePaddle" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:219 +msgid "" +"编译完成后,在 `build/bin/` 路径下将会看到 `transformer_e2e` " +"的一个可执行文件。通过设置对应的设置参数完成执行的过程。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:226 +msgid "举例说明:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:235 +msgid "其中:" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:237 +msgid "" +"`decoding_gemm` 不同参数的意义可以参考 `FasterTransformer 文档 " +"`_。这里提前执行 `decoding_gemm`,可以在当前路径下生成一个 config " +"文件,里面会包含针对当前 decoding 部分提供的配置下,性能最佳的矩阵乘的算法,并在执行的时候读入这个数据。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:238 +msgid "`DATA_HOME` 则是 `paddlenlp.utils.env.DATA_HOME` 返回的路径。" +msgstr "" + +#: ../advanced_guide/fastgeneration/fasttransformer.rst:240 +msgid "" +"预测所需要的模型文件,可以通过 `fast_transformer/README.md " +"`_" +" 文档中所记述的方式导出。" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/index.po b/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/index.po new file mode 100644 index 0000000000000000000000000000000000000000..d0c0a3168a8c7b6341a3c1e736ae91a23c564bef --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/index.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../advanced_guide/fastgeneration/index.rst:3 +msgid "文本生成高性能加速" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/distill_lstm.po b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/distill_lstm.po new file mode 100644 index 0000000000000000000000000000000000000000..a04e1157a185756ee74fa67b11aaeb7663505775 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/distill_lstm.po @@ -0,0 +1,128 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../advanced_guide/model_compression/distill_lstm.rst:2 +msgid "由BERT到Bi-LSTM的知识蒸馏" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:6 +msgid "整体原理介绍" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:8 +msgid "" +"本例是将特定任务下BERT模型的知识蒸馏到基于Bi-LSTM的小模型中,主要参考论文 `Distilling Task-Specific " +"Knowledge from BERT into Simple Neural Networks " +"`_ \\ 实现。整体原理如下:" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:11 +msgid "在本例中,较大的模型是BERT被称为教师模型,Bi-LSTM被称为学生模型。" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:13 +msgid "小模型学习大模型的知识,需要小模型学习蒸馏相关的损失函数。在本实验中,损失函数是均方误差损失函数,传入函数的两个参数分别是学生模型的输出和教师模型的输出。" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:15 +msgid "" +"在论文的模型蒸馏阶段,作者为了能让教师模型表达出更多的“暗知识”(dark " +"knowledge,通常指分类任务中低概率类别与高概率类别的关系)供学生模型学习,对训练数据进行了数据增强。通过数据增强,可以产生更多无标签的训练数据,在训练过程中,学生模型可借助教师模型的“暗知识”,在更大的数据集上进行训练,产生更好的蒸馏效果。本文的作者使用了三种数据增强方式,分别是:" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:17 +msgid "Masking,即以一定的概率将原数据中的word token替换成 ``[MASK]`` ;" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:19 +msgid "POS—guided word replacement,即以一定的概率将原数据中的词用与其有相同POS tag的词替换;" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:21 +msgid "n-gram sampling,即以一定的概率,从每条数据中采样n-gram,其中n的范围可通过人工设置。" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:26 +msgid "模型蒸馏步骤介绍" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:28 +msgid "" +"本实验分为三个训练过程:在特定任务上对BERT进行微调、在特定任务上对基于Bi-LSTM的小模型进行训练(用于评价蒸馏效果" +")、将BERT模型的知识蒸馏到基于Bi-LSTM的小模型上。" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:31 +msgid "1. 基于bert-base-uncased预训练模型在特定任务上进行微调" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:33 +msgid "" +"训练BERT的fine-tuning模型,可以去 `PaddleNLP " +"`_ 中\\ 的 `glue " +"`_" +" 目录下对bert-base-uncased做微调。" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:36 +msgid "" +"以GLUE的SST-2任务为例,用bert-base-" +"uncased做微调之后,可以得到一个在SST-2任务上的教师模型,可以把在dev上取得最好Accuracy的模型保存下来,用于第三步的蒸馏。" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:40 +msgid "2. 训练基于Bi-LSTM的小模型" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:42 +msgid "" +"在本示例中,小模型采取的是基于双向LSTM的分类模型,网络层分别是 ``Embedding`` 、``LSTM`` 、 带有 ``tanh`` " +"激活函数的 ``Linear`` 层,最后经过\\ 一个全连接的输出层得到logits。``LSTM`` 网络层定义如下:" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:50 +msgid "基于Bi-LSTM的小模型的 ``forward`` 函数定义如下:" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:66 +msgid "3.数据增强介绍" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:68 +msgid "" +"接下来的蒸馏过程,蒸馏时使用的训练数据集并不只包含数据集中原有的数据,而是按照上文原理介绍中的A、C两种方法进行数据增强后的总数据。 " +"在多数情况下,``alpha`` 会被设置为0,表示无视硬标签,学生模型只利用数据增强后的无标签数据进行训练。根据教师模型提供的软标签 " +"``teacher_logits`` \\ ,对比学生模型的 ``logits`` " +",计算均方误差损失。由于数据增强过程产生了更多的数据,学生模型可以从教师模型中学到更多的暗知识。" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:72 +msgid "数据增强的核心代码如下:" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:107 +msgid "4.蒸馏模型" +msgstr "" + +#: ../advanced_guide/model_compression/distill_lstm.rst:109 +msgid "" +"这一步是将教师模型BERT的知识蒸馏到基于Bi-LSTM的学生模型中,在本例中,主要是让学生模型(Bi-" +"LSTM)去学习教师模型的输出logits。\\ 蒸馏时使用的训练数据集是由上一步数据增强后的数据,核心代码如下:" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/index.po b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/index.po new file mode 100644 index 0000000000000000000000000000000000000000..beff3fe46a772e5b129741b0eb162080c9f168bb --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/index.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../advanced_guide/model_compression/index.rst:3 +msgid "模型压缩" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/introduction.po b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/introduction.po new file mode 100644 index 0000000000000000000000000000000000000000..0f2a0fb1c6053fce23acfe7854c5403a6dc6d1a0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/introduction.po @@ -0,0 +1,79 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../advanced_guide/model_compression/introduction.rst:3 +#: ../advanced_guide/model_compression/introduction.rst:11 +msgid "模型压缩简介" +msgstr "" + +#: ../advanced_guide/model_compression/introduction.rst:6 +msgid "" +"近些年,基于Transformer的语言模型在机器翻译、阅读理解、文本匹配、自然语言推理等自然语言处理任务上取得了实质\\ " +"进展。然而,海量的参数和计算资源的大量耗费,使BERT及其变体在部署中困难重重。模型压缩的发展,使得这些问题得到\\ 了缓解。" +msgstr "" + +#: ../advanced_guide/model_compression/introduction.rst:13 +msgid "" +"模型压缩在保证一定精度的情况下,能够降低模型的存储,加速模型的推理时间。常见的模型压缩方法主要包括模型裁剪、量化和蒸馏。\\ " +"下面分别对这几种方法进行简要的介绍。" +msgstr "" + +#: ../advanced_guide/model_compression/introduction.rst:17 +msgid "模型裁剪" +msgstr "" + +#: ../advanced_guide/model_compression/introduction.rst:18 +msgid "模型裁剪是通过对已经训练好的模型中不重要的网络连接进行裁剪,减少模型的冗余和计算量,从而减少网络存储、大幅度进行加速的模型压缩方法。" +msgstr "" + +#: ../advanced_guide/model_compression/introduction.rst:21 +msgid "量化" +msgstr "" + +#: ../advanced_guide/model_compression/introduction.rst:22 +msgid "" +"一般而言,神经网络模型的参数都是用的32bit长度的浮点型数表示。实际上,有时不需要保留那么高的精度,可以通过量化的方法减少\\ " +"模型的存储空间,通常用INT8代替Float32存储。比如,SGD(Stochastic Gradient " +"Descent)所需要的精度仅为6~8bit,\\ " +"因此合理的量化网络也可保证精度的情况下减小模型的存储体积,并且能够大幅度加速,使得神经网络在CPU上的运行成为可能。\\ " +"通常,量化包含多种方法,例如:二值神经网络、三元权重网络以及XNOR网络。" +msgstr "" + +#: ../advanced_guide/model_compression/introduction.rst:29 +msgid "蒸馏" +msgstr "" + +#: ../advanced_guide/model_compression/introduction.rst:30 +msgid "" +"蒸馏本质是student模型(参数量较少的模型)对teacher模型(参数量较多)的拟合,student模型从teacher中学到知识,比自己单独学习效果更好,。比较常见的方法通常是由Bert" +" base蒸馏到\\ Bi-LSTM或者是Transformer层数更少的BERT小模型。例如DistilBERT,它保留了BERT-base " +"97%的精度,\\ 减少了40%的参数,推理速度快了60%。" +msgstr "" + +#: ../advanced_guide/model_compression/introduction.rst:36 +msgid "模型压缩示例" +msgstr "" + +#: ../advanced_guide/model_compression/introduction.rst:38 +msgid "" +"下面将会对基于飞桨实现的常见的模型压缩示例进行介绍,其中《由BERT到Bi-LSTM的知识蒸馏》可以作为蒸馏实验的\"Hello " +"World\"示例。\\ " +"而《使用DynaBERT中的策略对BERT进行压缩》中使用的DynaBERT则是同时对不同尺寸的子网络进行训练,通过该方法训练后可以在推理阶段直接对模型裁剪。" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/ofa_bert.po b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/ofa_bert.po new file mode 100644 index 0000000000000000000000000000000000000000..3ee3881eaa9144bf8dab1ada4fb3820683f2249d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/ofa_bert.po @@ -0,0 +1,167 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../advanced_guide/model_compression/ofa_bert.rst:2 +msgid "使用DynaBERT中的策略对BERT进行压缩" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:4 +msgid "" +"本教程使用的是 `DynaBERT-Dynamic BERT with Adaptive Width and Depth " +"`_ 中的训练策略。\\ " +"把原始模型作为超网络中最大的子模型,这里超网络指的是包含所有搜索空间在内的一个网络。\\ 原始模型包括多个相同大小的Transformer " +"Block。在每次训练前会选择当前轮次要训练的子模型,\\ 每个子模型包含多个相同大小的Sub Transformer Block,每个Sub " +"Transformer Block是选择不同宽度的Transformer Block得到的,\\ 一个Transformer " +"Block包含一个Multi-Head Attention和一个Feed-Forward Network,Sub Transformer " +"Block获得方式为:" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:10 +msgid "" +"1. 一个 ``Multi-Head Attention`` " +"层中有多个Head,每次选择不同宽度的子模型时,会同时对Head数量进行等比例减少,\\ " +"例如:如果原始模型中有12个Head,本次训练选择的模型是宽度为原始宽度75%的子模型,则本次训练中所有Transformer " +"Block的Head数量为9。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:13 +msgid "" +"2. ``Feed-Forward Network`` 层中 ``Linear`` 的参数大小进行等比例减少,例如:如果原始模型中 ``FFN``" +" 层的特征维度为3072,\\ 本次训练选择的模型是宽度为原始宽度75%的子模型,则本次训练中所有Transformer Block中 " +"``FFN`` 层的特征维度为2304。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:18 +msgid "整体原理介绍" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:20 +msgid "" +"1. " +"首先对预训练模型的参数和head根据其重要性进行重排序,把重要的参数和head排在参数的前侧,保证训练过程中的参数裁剪不会裁剪掉这些重要的参数。\\" +" " +"参数的重要性计算是先使用dev数据计算一遍每个参数的梯度,然后根据梯度和参数的整体大小来计算当前参数的重要性,head的重要性计算是通过传入一个\\" +" 全1的对head的mask,并计算这个mask的梯度,根据mask的梯度来判断每个 ``Multi-Head Attention`` " +"层中每个Head的重要性。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:24 +msgid "" +"2. " +"使用原本的预训练模型作为蒸馏过程中的教师网络。同时定义一个超网络,这个超网络中最大的子网络的结构和教师网络的结构相同其他小的子网络是对最大网络\\" +" 进行不同的宽度选择来得到的,宽度选择具体指对网络中的参数进行裁剪,所有子网络在整个训练过程中都是参数共享的。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:27 +msgid "" +"使用重排序之后的预训练模型参数初始化超网络,并把这个超网络作为学生网络。分别为 ``Embedding`` 层,为每个transformer " +"block层和最后的logits添加蒸馏损失。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:29 +msgid "每个batch数据在训练前首先会选择当前要训练的子网络配置(子网络配置目前仅包括对整个模型的宽度的选择),参数更新时仅会更新当前子网络计算中用到的那部分参数。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:31 +msgid "通过以上的方式来优化整个超网络参数,训练完成后选择满足加速要求和精度要求的子模型。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:37 +msgid "整体流程" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:39 +msgid "基于PaddleSlim进行模型压缩" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:41 +msgid "在本例中,也需要训练基于特定任务的BERT模型,方法同上一篇教程《由BERT到Bi-LSTM的知识蒸馏》。下面重点介绍本例模型压缩的过程。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:44 +msgid "1. 定义初始网络" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:45 +msgid "" +"定义原始BERT-" +"base模型并定义一个字典保存原始模型参数。普通模型转换为超网络之后,由于其组网OP的改变导致原始模型加载的参数失效,所以需要定义一个字典保存原始模型的参数并用来初始化超网络。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:56 +msgid "2. 构建超网络" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:57 +msgid "定义搜索空间,并根据搜索空间把普通网络转换为超网络。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:69 +msgid "3. 定义教师网络" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:70 +msgid "构造教师网络。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:78 +msgid "4. 配置蒸馏相关参数" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:79 +msgid "" +"需要配置的参数包括教师模型实例;需要添加蒸馏的层,在教师网络和学生网络的 ``Embedding`` 层和每一个 ``Tranformer " +"Block`` 层\\ 之间添加蒸馏损失,中间层的蒸馏损失使用默认的MSE损失函数;配置 ``lambda_distill`` " +"参数表示整体蒸馏损失的缩放比例。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:97 +msgid "5. 定义Once-For-All模型" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:98 +msgid "普通模型和蒸馏相关配置传给 ``OFA`` 接口,自动添加蒸馏过程并把超网络训练方式转为 ``OFA`` 训练方式。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:106 +msgid "6. 计算神经元和head的重要性并根据其重要性重排序参数" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:120 +msgid "7. 传入当前OFA训练所处的阶段" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:129 +msgid "8. 传入网络相关配置,开始训练" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:130 +msgid "本示例使用DynaBERT的策略进行超网络训练。" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:150 +msgid "**NOTE**" +msgstr "" + +#: ../advanced_guide/model_compression/ofa_bert.rst:152 +msgid "" +"由于在计算head的重要性时会利用一个mask来收集梯度,所以需要通过monkey patch的方式重新实现一下 ``BERTModel`` 类的" +" ``forward`` 函数。示例如下:" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/changelog.po b/docs/locale/en/LC_MESSAGES/changelog.po new file mode 100644 index 0000000000000000000000000000000000000000..31069daadc58c0ddb51f8add953cf981acf11540 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/changelog.po @@ -0,0 +1,105 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../changelog.md:1 +msgid "v2.0.2 (2021.06.07)" +msgstr "" + +#: ../changelog.md:3 +msgid "丰富预训练模型" +msgstr "" + +#: ../changelog.md:4 +msgid "新增多粒度语言知识预训练模型ERNIE-Gram,该模型在多项中文NLP任务取得SOTA成绩。" +msgstr "" + +#: ../changelog.md:5 +msgid "新增NeZha中文预训练模型,感谢 @jm12138 的高质量贡献! 🎉 🎉 🎉" +msgstr "" + +#: ../changelog.md:6 +msgid "新增GPT CPM-Distill中文小型化模型,感谢 @jm12138 的高质量贡献!🎉 🎉 🎉" +msgstr "" + +#: ../changelog.md:8 ../changelog.md:14 +msgid "Bug Fix" +msgstr "" + +#: ../changelog.md:9 +msgid "修复了softmax_with_crossentropy API导致的deprecated warning" +msgstr "" + +#: ../changelog.md:10 +msgid "更正了ChnSentiCorp, LCQMC等数据集的官方下载链接。" +msgstr "" + +#: ../changelog.md:12 +msgid "v2.0.1 (2021.05.21)" +msgstr "" + +#: ../changelog.md:15 +msgid "修复Windows CPU环境下的import产生的CUDA_HOME检测问题。" +msgstr "" + +#: ../changelog.md:17 +msgid "v2.0.0 (2021.05.20)" +msgstr "" + +#: ../changelog.md:19 +msgid "" +"PaddleNLP " +"2.0是飞桨生态的文本领域核心库,具备易用的文本领域API,多场景的应用示例、和高性能分布式训练三大特点,旨在提升飞桨开发者文本领域建模效率,并提供基于飞桨框架2.0的NLP领域最佳实践。" +msgstr "" + +#: ../changelog.md:21 +msgid "特性" +msgstr "" + +#: ../changelog.md:23 +msgid "易用的文本领域API" +msgstr "" + +#: ../changelog.md:24 +msgid "" +"提供从数据集加载、文本预处理、组网建模、评估、到推的领域API:如一键加载丰富中文数据集的Dataset API, " +"可灵活高效的进行数据与处理的Data API,预置60+预训练词向量的Embedding API, " +"内置50+预训练模型,提供预训练模型生态基础设施的Transformer " +"API等,可大幅提升NLP任务建模和迭代的效率。更多API详细说明请查看PaddleNLP官方文档" +msgstr "" + +#: ../changelog.md:27 +msgid "多场景的应用示例" +msgstr "" + +#: ../changelog.md:28 +msgid "" +"PaddleNLP " +"2.0提供多粒度多场景的应用示例,涵盖从NLP基础技术、NLP核心技术、NLP系统应用以及文本相关的拓展应用等。全面基于飞桨2.0全新API体系开发,为开发提供飞桨2.0框架在文本领域的最佳实践。" +msgstr "" + +#: ../changelog.md:31 +msgid "高性能分布式训练" +msgstr "" + +#: ../changelog.md:32 +msgid "" +"基于飞桨核心框架『动静统一』的特性与领先的自动混合精度优化策略,通过分布式Fleet " +"API,支持超大规模参数的4D混合并行策略,并且可根据硬件情况灵活可配,高效地完成超大规模参数的模型训练。" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_datasets/how_to_write_a_DatasetBuilder.po b/docs/locale/en/LC_MESSAGES/community/contribute_datasets/how_to_write_a_DatasetBuilder.po new file mode 100644 index 0000000000000000000000000000000000000000..775c576d3241fbc9d8b1988df6e726dcbd24a71b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/community/contribute_datasets/how_to_write_a_DatasetBuilder.po @@ -0,0 +1,160 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:3 +msgid "创建 :class:`DatasetBuilder`" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:5 +msgid "" +"数据集的贡献通过定义一个 :class:`DatasetBuilder` 的子类来实现。一个合格的 :class:`DatasetBuilder`" +" 需要遵循一些协议和规范。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:7 +msgid "下面我们以 :obj:`LCQMC` 为例了解一下 :class:`DatasetBuilder` 通常需要包含哪些方法和参数。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:10 +msgid "成员变量" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:40 +msgid "" +"首先贡献的数据集需要继承 :class:`paddlenlp.datasets.DatasetBuilder` 类,类名格式为camel " +"case。之后应该添加一段注释,简要说明数据集的来源等信息。之后需定义以下成员变量:" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:42 +msgid "" +":attr:`lazy` :数据集的默认类型。:obj:`False` 对应 :class:`MapDataset` ,:obj:`True` " +"对应 :class:`IterDataset` 。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:43 +msgid ":attr:`URL` :数据集压缩包下载地址,需提供有效并稳定的下载链接。如果数据集不是压缩包,可以不再这里提供。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:44 +msgid ":attr:`MD5` :数据集压缩包的md5值,用于文件校验,如果数据集文件不是压缩包,可以不再这里提供。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:45 +msgid ":attr:`META_INFO` :数据集split信息格式。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:46 +msgid "" +":attr:`SPLITS` " +":数据集的split信息,包含数据集解压后的不同文件的具体位置,文件名,md5值等,如果数据集不是压缩包则通常在这里提供下载地址,还可以包含诸如不同文件对应的文件读取参数等信息。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:48 +msgid "" +"除此之外,不同的数据集可能还需要诸如 :attr:`VOCAB_INFO` 等其他成员变量(参见 `iwslt15.py " +"`__" +" )。或者成员变量会有其他格式。贡献者可以根据实际情况自行调整。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:52 +msgid "" +"如果贡献的数据集没有子数据集,那么 :class:`DatasetBuilder` **必须包含** :attr:`SPLITS` " +"成员变量,且该变量必须是一个字典,字典的key是该数据集包含的splits。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:53 +msgid "" +"如果贡献的数据集有子数据集,那么 :class:`DatasetBuilder` **必须包含** :attr:`BUILDER_CONFIGS`" +" 成员变量,且该变量必须是一个字典,字典的key是该数据集包含的子数据集的 :attr:`name` " +"。字典的value是包含该数据集的子数据集split信息的字典,key值必须是 `splits` 。具体格式(参见 `glue.py " +"`__" +" )" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:56 +msgid ":func:`_get_data` 方法" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:71 +msgid "" +":func:`_get_data` 方法根据传入的 :attr:`mode` " +"和数据集的split信息定位到具体数据集文件。首先进行md5值校验本地文件,若校验失败则调用 " +":func:`paddle.utils.download.get_path_from_url` " +"方法下载并校验数据集文件,最后返回数据集文件的本地地址。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:74 +msgid ":func:`_read` 方法" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:90 +msgid "" +":func:`_read` 方法根据传入的文件地址读取数据。该方法必须是一个生成器,以确保 :class:`DatasetBuilder` " +"可以构造 :class:`MapDataset` 和 :class:`IterDataset` 两种数据集。 " +"当不同split对应的数据文件读取方式不同时,该方法还需要支持 :attr:`split` 参数,并支持不同split下的读取方式。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:95 +msgid "该方法提供的每条example都应是一个 :class:`Dictionary` 对象。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:96 +msgid "" +":class:`DatasetBuilder` 在生成Dataset时提供了将class " +"label转换为id的功能。如果用户需要此功能,需要将example中label对应的key设置为 **\"label\"** 或 " +"**\"labels\"** ,并在类中正确添加 :func:`get_labels` 方法。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:99 +msgid ":func:`get_labels` 方法" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:109 +msgid "" +":func:`get_labels` 方法返回一个由该数据集中所有label组成的list。用于将数据集中的class " +"label转换为id,并且这个list之后会作为实例变量传给生成的数据集。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:112 +msgid ":func:`get_vocab` 方法" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:114 +msgid "如果数据集提供词典文件,则需要加入 :func:`get_vocab` 方法和 :attr:`VOCAB_INFO` 变量。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:116 +msgid "" +"该方法会根据 :attr:`VOCAB_INFO` 变量返回一个包含数据集词典信息的 :class:`Dictionary` " +"对象并作为实例变量传给生成的数据集。用于在训练过程中初始化 :class:`paddlenlp.data.Vocab` 对象。 该方法的写法请参考" +" `iwslt15.py " +"`__" +" 。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:121 +msgid "" +"贡献数据集时 :func:`get_labels` 和 :func:`get_vocab` 方法是可选的,视具体数据集内容而定。 " +":func:`_read` 和 :func:`_get_data` 方法是 **必须包含** 的。" +msgstr "" + +#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:122 +msgid "如果您不希望在数据获取过程中进行md5值校验,可以不用给出相关成员变量和校验代码。" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_datasets/index.po b/docs/locale/en/LC_MESSAGES/community/contribute_datasets/index.po new file mode 100644 index 0000000000000000000000000000000000000000..5124d3a4729456cf6af0f0d0b2c5d1f84f18eec0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/community/contribute_datasets/index.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../community/contribute_datasets/index.rst:3 +msgid "如何贡献数据集" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_datasets/sharing_dataset.po b/docs/locale/en/LC_MESSAGES/community/contribute_datasets/sharing_dataset.po new file mode 100644 index 0000000000000000000000000000000000000000..67b63325e2d452aea6b0c70cd68fbe8aee6fd9ba --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/community/contribute_datasets/sharing_dataset.po @@ -0,0 +1,169 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../community/contribute_datasets/sharing_dataset.rst:3 +msgid "分享你的数据集" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:5 +msgid "除了使用PaddleNLP内置的数据集以外,我们也鼓励用户向PaddleNLP贡献自己的数据集。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:7 +msgid "下面我们来介绍一下贡献数据集的详细流程:" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:10 +msgid "配置环境" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:12 +msgid "编写和测试PaddleNLP代码需要依赖python3.6以上版本以及最新版本的PaddlePaddle。请确保正确安装以上依赖。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:13 +msgid "在PaddleNLP的github页面上点击Fork按钮,在自己的github中创建一份PaddleNLP repo的副本。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:14 +msgid "将您frok的内容下载到本地,并将官方repo作为remote。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:22 +msgid "安装pre-commit钩子,它可以帮助我们格式化源代码,再提交前自动检查代码问题。不满足钩子的PR **不能** 被提交到PaddleNLP。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:30 +msgid "添加一个 :class:`DatasetBuilder`" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:32 +msgid "创建一个新的本地分支,一般从develop 分支上创建新分支。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:38 +msgid "" +"找到您本地repo下的 `PaddleNLP/paddlenlp/datasets/` " +"路径,PaddleNLP的所有数据集代码都储存在这个文件夹下。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:44 +msgid "为您的数据集确定一个 `name`,例如 `squad` , `chnsenticorp` 等,这个 `name` 就是您的数据集被读取时的名称。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:48 +msgid "为了方便别人使用您的数据集,确保这个 `name` **不会太长而且能够正确的表义**。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:49 +msgid "数据集的 `name` 格式应为snake case。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:51 +msgid "" +"在该路径下创建python文件,文件名是数据集的 `name`,例如 `squad.py` 。并在这个文件中编写数据集的 " +":class:`DatasetBuilder` 代码。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:53 +msgid "" +":class:`DatasetBuilder` 的编写可以参考教程 :doc:`如何创建一个DatasetBuilder " +"<./how_to_write_a_DatasetBuilder>` 。里面给出了详细的步骤和规范。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:55 +msgid "" +"我们也推荐您参考已有数据集的 :class:`DatasetBuilder` " +"进行创建,从已有代码copy一些共用部分可能对您编写自己的数据集代码有所帮助,下面是一些已有数据集的示例:" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:57 +msgid "" +"`iwslt15.py " +"`__" +" 翻译数据集,包含词表文件。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:58 +msgid "" +"`glue.py " +"`__" +" glue数据集,包含多个子数据集,文件格式为tsv。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:59 +msgid "" +"`squad.py " +"`__" +" 阅读理解数据集,文件格式为json。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:60 +msgid "" +"`imdb.py " +"`__" +" imdb数据集,每个split包含多个文件。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:61 +msgid "" +"`ptb.py " +"`__" +" 语料库数据集。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:62 +msgid "" +"`msra_ner.py " +"`__" +" 序列标注数据集。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:64 +msgid "" +"开发完成后,可以使用 :attr:`load_dataset` 测试您创建的数据集中的split能否正确被识别。也可以使用 " +":attr:`print` 看看数据集读入的格式是否符合您的预期:" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:74 +msgid "提交您的成果" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:76 +msgid "当您认为数据集的代码已经ready后,就可以在本地commit您的修改了:" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:83 +msgid "在提交修改之前,最好获取获取先upstream的最新代码并更新当前分支。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:90 +msgid "将本地的修改推送到GitHub上,并在GitHub上向PaddleNLP提交Pull Request。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:96 +msgid "" +"以上就是像PaddleNLP贡献数据集的完整流程了。我们看到您的PR后会尽快review,如果有任何问题都会尽快反馈给您。如果没有问题的话我们就会合入到PaddleNLP" +" repo,您贡献的数据集就可以供其他人使用啦。" +msgstr "" + +#: ../community/contribute_datasets/sharing_dataset.rst:98 +msgid "如果您对贡献数据集还有任何疑问,欢迎加入官方QQ技术交流群: 973379845向我们提出。我们会尽快为您解答。" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_docs.po b/docs/locale/en/LC_MESSAGES/community/contribute_docs.po new file mode 100644 index 0000000000000000000000000000000000000000..1ed304c34f68876424ff35e625ac25fd4eb9cca1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/community/contribute_docs.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../community/contribute_docs.rst:3 +msgid "如何贡献问答、案例" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_models/contribute_awesome_pretrained_models.po b/docs/locale/en/LC_MESSAGES/community/contribute_models/contribute_awesome_pretrained_models.po new file mode 100644 index 0000000000000000000000000000000000000000..1aedaf36c857e0c7df137d27b9cd9ed0956ed69f --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/community/contribute_models/contribute_awesome_pretrained_models.po @@ -0,0 +1,178 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:3 +msgid "贡献预训练模型权重" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:6 +msgid "1. 模型网络结构类型" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:7 +msgid "" +"PaddleNLP目前已支持绝大多数主流的预训练模型网络结构,既包括百度自研的预训练模型(如ERNIE系列), " +"也涵盖业界主流的预训练模型(如BERT,ALBERT,GPT,RoBERTa,XLNet等)。" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:10 +msgid "" +"PaddleNLP目前支持的预训练模型结构类型汇总可见 `Transformer预训练模型汇总 " +"`_" +" (持续增加中,也非常欢迎进行新模型贡献:`如何贡献新模型 " +"`_" +" )。" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:15 +msgid "2. 模型参数权重类型" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:16 +msgid "非常欢迎大家贡献优质模型参数权重。 参数权重类型包括但不限于(以BERT模型网络为例):" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:19 +msgid "" +"PaddleNLP还未收录的BERT预训练模型参数权重 (如 `bert-base-japanese-char " +"`_ ,`danish-" +"bert-botxo `_ 等);" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:21 +msgid "" +"BERT模型在其他垂类领域(如数学,金融,法律,医学等)的预训练模型参数权重 (如 `MathBERT " +"`_ ,`finbert " +"`_ 等);" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:23 +msgid "" +"基于BERT在下游具体任务进行fine-tuning后的模型参数权重 (如 `bert-base-multilingual-uncased-" +"sentiment `_ , `bert-base-NER `_ 等);" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:26 +msgid "其他模型参数权重(任何你觉得有价值的模型参数权重);" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:29 +msgid "3. 参数权重格式转换" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:30 +msgid "" +"当我们想要贡献github上开源的某模型权重时,但是发现该权重保存为其他的深度学习框架(PyTorch,TensorFlow等)的格式, " +"这就需要我们进行不同深度学习框架间的模型格式转换,下面的链接给出了一份详细的关于Pytorch到Paddle模型格式转换的教程: " +"`Pytorch到Paddle模型格式转换文档 <./convert_pytorch_to_paddle.rst>`_ 。" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:35 +msgid "4. 进行贡献" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:37 +msgid "4.1 准备权重相关文件" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:38 +msgid "" +"一般来说,我们需要准备 **model_state.pdparams** " +",**vocab.txt**,**tokenizer_config.json** 以及 **model_config.json** " +"这四个文件进行参数权重贡献。" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:41 +msgid "model_state.pdparams 文件可以通过上述的参数权重格式转换过程得到;" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:42 +msgid "" +"vocab.txt " +"文件可以直接使用原始模型对应的vocab文件(根据模型对应tokenizer类型的不同,该文件名可能为spiece.model等);" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:43 +msgid "" +"model_config.json 文件可以参考对应 model.save_pretrained() " +"接口保存的model_config.json文件;" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:44 +msgid "" +"tokenizer_config.json 文件可以参考对应 tokenizer.save_pretrained() " +"接口保存的model_config.json文件;" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:47 +msgid "4.2 创建个人目录" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:48 +msgid "" +"如果你是首次进行权重贡献,那么你需要在 ``PaddleNLP/community/`` 下新建一个目录。 " +"目录名称使用你的github名称,比如新建目录 ``PaddleNLP/community/yingyibiao/`` 。 " +"如果已有个人目录,则可以跳过此步骤。" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:53 +msgid "4.3 创建权重目录" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:54 +msgid "" +"在步骤4.2的个人目录下新建一个权重目录,权重目录名为本次贡献的模型权重名称。 比如我想贡献 ``bert-base-uncased-" +"sst-2-finetuned`` 这个模型, 则新建权重目录 ``PaddleNLP/community/yingyibiao/bert-" +"base-uncased-sst-2-finetuned/`` 。" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:59 +msgid "4.4 在权重目录下添加PR(pull request)相关文件" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:60 +msgid "在步骤4.3的目录下加入两个文件,分别为 ``README.md`` 和 ``files.json`` 。" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:62 +msgid "``README.md`` 是对你贡献的权重的详细介绍,使用示例,权重来源等。" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:63 +msgid "" +"``files.json`` 为步骤4.1所得的权重相关文件以及对应地址。files.json文件内容示例如下,只需将地址中的 " +"*yingyibiao* 和 *bert-base-uncased-sst-2-finetuned* 分别更改为你的github用户名和权重名称。" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:76 +msgid "4.5 在github上提PR进行贡献" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:77 +msgid "" +"第一次进行开源贡献的同学可以参考 `first-contributions " +"`_ 。" +msgstr "" + +#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:78 +msgid "模型权重贡献PR示例请参考 `bert-base-uncased-sst-2-finetuned PR <.>`_ 。" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_models/contribute_new_models.po b/docs/locale/en/LC_MESSAGES/community/contribute_models/contribute_new_models.po new file mode 100644 index 0000000000000000000000000000000000000000..f33257f2724cc05864ebdc58693759113d514af4 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/community/contribute_models/contribute_new_models.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../community/contribute_models/contribute_new_models.rst:3 +msgid "贡献新模型" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_models/convert_pytorch_to_paddle.po b/docs/locale/en/LC_MESSAGES/community/contribute_models/convert_pytorch_to_paddle.po new file mode 100644 index 0000000000000000000000000000000000000000..b3a66ac3773b97a737f7806dc1adf86bc634ea61 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/community/contribute_models/convert_pytorch_to_paddle.po @@ -0,0 +1,725 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:3 +msgid "模型格式转换" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:6 +msgid "0. 前言" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:7 +msgid "本文将介绍如何进行不同框架下的模型权重转换(以模型权重从PyTorch框架到Paddle框架的格式转换为例)。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:9 +msgid "模型格式转换的过程需要用户对模型结构有一个较详细的了解,成功完成模型格式转换也会有助于加深用户对该模型结构的理解。 让我们开始这个有趣的过程吧!" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:13 +msgid "1. 模型权重文件概述" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:14 +msgid "" +"不管在什么框架下,当我们保存训练好的模型时,我们都需要将模型的参数权重持久化保存下来; " +"当我们加载一个保存好的模型时,我们都需要将参数权重加载并重新赋值给相应的模型。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:17 +msgid "" +"PyTorch和Paddle都是通过序列化和反序列化模型的 ``state dict`` (状态字典)来进行参数权重的存储和加载的。 " +"``state dict`` 从数据结构上来看就是一个字典(比如Python中的dict), " +"其中key是模型参数的名称(数据类型为string),而value则为key所对应的值(数据类型为Tensor)。 参数存储时,先获取目标对象的 " +"``state dict`` ,然后将 ``state dict`` 存储至磁盘; 参数载入时,先从磁盘载入保存的 ``state dict`` " +",然后通过 ``set_state_dict()`` 方法配置到目标对象中。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:23 +msgid "" +"按照约定俗成的命名规则,Paddle框架保存的模型文件名一般后缀为 `'.pdparams'` , PyTorch框架保存的模型文件名一般后缀为 " +"`'.pt'` 、 `'.pth'` 或者 `'.bin'` 。 虽然后缀并不影响模型的保存和加载,但我们一般都会遵循这个命名规范。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:28 +msgid "2. 模型的 ``state dict`` 概述" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:29 +msgid "" +"刚刚我们简单介绍了一下模型文件和其中存储的 ``state dict`` , 下面让我们来看一个具体的例子来对 ``state dict`` " +"有更进一步的了解。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:32 +msgid "" +"``LeNet`` 是由Yann LeCun等人在1998年提出的一个CNN网络模型,并且成功应用于手写数字识别系统。 Paddle集成了 " +"``LeNet`` 这个简单的模型,我们可以一键进行模型加载, 下面的代码实现了该模型的加载和对应 ``state dict`` 的输出:" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:58 +msgid "" +"我们可以通过 ``model.state_dict().keys()`` 来获取模型的所有参数名称。 可以看到 ``LeNet`` " +"一共有10组参数,分别为:*'features.0.weight'*、*'features.0.bias'*、*'features.3.weight'*" +" " +"、*'features.3.bias'*、*'fc.0.weight'*、*'fc.0.bias'*、*'fc.1.weight'*、*'fc.1.bias'*、*'fc.2.weight'*" +" 和 *'fc.2.bias'*。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:62 +msgid "" +"通过查询 ``model.state_dict()['features.0.weight']`` 可以查看 " +"**'features.0.weight'** 这个参数的具体权重数值。 上述输出显示该权重是一个dtype=float32,shape=[6, " +"1, 3, 3]的Tensor。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:66 +msgid "3. 利用 ``state dict`` 进行权重格式转换" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:67 +msgid "" +"了解了模型的存储和加载以及相关的 ``state dict`` 之后,我们来看一下模型格式的转换的具体步骤。 一般来说,我们可以通过 " +"``state dict`` 的相互转换来帮助我们进行模型格式的转换。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:70 +msgid "以从PyTorch框架到Paddle框架的模型权重转换为例,转换的具体流程为:" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:72 +msgid "加载PyTorch模型得到 ``state dict``" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:73 +msgid "PyTorch下的 ``state dict`` 转换为Paddle下的 ``state dict``" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:74 +msgid "保存Paddle下的 ``state dict`` 得到Paddle模型。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:76 +msgid "" +"下面我们来看一个具体的例子:``'bert-base-uncased'`` 是一个谷歌开源的12层的bert英文模型。 " +"PaddleNLP(Paddle框架)和HuggingFace的transformers(PyTorch框架)里都集成了这个模型, " +"两者参数量和具体参数数值是完全一致的。我们可以来加载对比这两个模型的 ``state dict`` 来了解转换的细节。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:81 +msgid "3.1 PyTorch框架下的 ``state dict``" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:82 +msgid "首先加载transformers下的 ``'bert-base-uncased'`` 模型," +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:123 +msgid "" +"\\**odict_keys**(ordered_dict keys)所显示的是PyTorch模型文件所对应的 ``state dict`` " +"的keys: 我们仔细观察一下可以发现参数可以分成几大模块:**embeddings** 模块, **encoder_layers** 模块, " +"**pooler** 模块和 **cls** 模块。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:127 +msgid "我们可以结合bert的具体结构来解读一下各个模块:" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:129 +msgid "**embeddings** 模块" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:131 +msgid "" +"*'bert.embeddings'* 开头的各个参数是embeddings模块的参数, " +"包括word_embeddings矩阵,position_embeddings矩阵,token_type_embeddings矩阵以及embeddings模块的LayerNorm层参数等。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:133 +msgid "**encoder_layers** 模块" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:135 +msgid "" +"*'bert.encoder.layer'*开头的各个参数是各encoder层的参数, 可以看到 ``'bert-base-uncased'`` " +"模型一共有12层encoder(编号0-11),每一层encoder的结构都相同。 每一层encoder主要由一个*self-" +"attention*模块和一个*feed-forward*模块构成。 " +"我们具体来看一下第1层encoder的参数(编号为0,'bert.encoder.layer.0'开头的参数):" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:140 +msgid "首先是*self-attention*模块:" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:142 +msgid "" +"*'attention.self.query'*,*'attention.self.key'* 和 " +"*'attention.self.value'* 分别代表self-attention结构里面的query矩阵,key矩阵和value矩阵。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:144 +msgid "*'attention.output.dense'* 是self-attention结构的线性层。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:145 +msgid "*'attention.output.LayerNorm'* 则是self-attention结构后的LayerNorm层。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:147 +msgid "" +"接下来是*feed-forward*模块,对应 'intermediate.dense' 和 'output.dense' 开头的参数 " +"。*feed-forward*之后还有一个*LayerNorm*层,对应的是 'output.LayerNorm' 开头的参数。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:149 +msgid "**pooler** 模块" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:151 +msgid "pooler模块在最后一层encoder之后,是我们对最后一层encoder输出的池化操作," +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:152 +msgid "**cls** 模块" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:154 +msgid "" +"cls模块是我们计算mlm(masked language model)和next sentence prediction(nsp)任务的结构。 " +"'cls.predictions'开头的参数是我们做mlm任务时的参数,'cls.seq_relationship'开头的参数是我们做nsp预测任务时的参数。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:158 +msgid "3.2 Paddle框架下的 ``state dict``" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:159 +msgid "相信到现在,我们已经对bert这个模型的结构以及相应的具体参数有了更进一步的了解。 接下来我们来加载PaddleNLP下的模型:" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:199 +msgid "" +"Paddle模型的 ``state dict`` 是通过一个dict来进行存储,可以看到,两者的 ``state dict`` 是十分相似的。 " +"我们对比一下两者:" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:202 +msgid "" +"两者的存储是相似的,PyTorch里使用的是python中的ordered_dict来存储模型的参数状态, " +"在Paddle中则使用的是python中的dict来来进行存储。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:204 +msgid "" +"两者的结构也是相似的,都可以分成embeddings,encoder_layer, pooler, cls等 " +"模块(当然这也很直观,毕竟两者的模型结构和模型参数是完全一致的)。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:206 +msgid "同时两者也存在一些区别,两者的 ``state dict`` 的keys有一些细微的差异,这是由于模型代码的具体实现的参数命名差异所造成的。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:209 +msgid "3.3 PyTorch和Paddle的 ``state dict`` 对比" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:210 +msgid "" +"我们接下来对上述两个 ``state dict`` 的参数名称以及对应权重来做一一对应。 下面的表格是整理好的 ``state_dict`` " +"对应关系表格(同一行代表着相对应的参数):" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:214 +msgid "Keys (PyTorch)" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:214 +msgid "Shape (PyTorch)" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:214 +msgid "Keys (Paddle)" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:214 +msgid "Shape (Paddle)" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:216 +msgid "bert.embeddings.word_embeddings.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:216 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:272 +msgid "[30522, 768]" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:218 +msgid "bert.embeddings.position_embeddings.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:218 +msgid "[512, 768]" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:220 +msgid "bert.embeddings.token_type_embeddings.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:220 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:274 +msgid "[2, 768]" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:222 +msgid "bert.embeddings.LayerNorm.gamma" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:222 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:224 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:228 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:232 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:236 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:240 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:242 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:244 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:252 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:254 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:256 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:260 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:266 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:268 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:270 +msgid "[768]" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:222 +msgid "bert.embeddings.layer_norm.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:224 +msgid "bert.embeddings.LayerNorm.beta" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:224 +msgid "bert.embeddings.layer_norm.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:226 +msgid "bert.encoder.layer.0.attention.self.query.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:226 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:230 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:234 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:238 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:258 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:264 +msgid "[768, 768]" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:226 +msgid "bert.encoder.layers.0.self_attn.q_proj.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:228 +msgid "bert.encoder.layer.0.attention.self.query.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:228 +msgid "bert.encoder.layers.0.self_attn.q_proj.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:230 +msgid "bert.encoder.layer.0.attention.self.key.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:230 +msgid "bert.encoder.layers.0.self_attn.k_proj.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:232 +msgid "bert.encoder.layer.0.attention.self.key.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:232 +msgid "bert.encoder.layers.0.self_attn.k_proj.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:234 +msgid "bert.encoder.layer.0.attention.self.value.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:234 +msgid "bert.encoder.layers.0.self_attn.v_proj.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:236 +msgid "bert.encoder.layer.0.attention.self.value.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:236 +msgid "bert.encoder.layers.0.self_attn.v_proj.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:238 +msgid "bert.encoder.layer.0.attention.output.dense.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:238 +msgid "bert.encoder.layers.0.self_attn.out_proj.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:240 +msgid "bert.encoder.layer.0.attention.output.dense.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:240 +msgid "bert.encoder.layers.0.self_attn.out_proj.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:242 +msgid "bert.encoder.layer.0.attention.output.LayerNorm.gamma" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:242 +msgid "bert.encoder.layers.0.norm1.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:244 +msgid "bert.encoder.layer.0.attention.output.LayerNorm.beta" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:244 +msgid "bert.encoder.layers.0.norm1.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:246 +msgid "bert.encoder.layer.0.intermediate.dense.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:246 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:250 +msgid "[3072, 768]" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:246 +msgid "bert.encoder.layers.0.linear1.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:246 +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:250 +msgid "[768, 3072]" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:248 +msgid "bert.encoder.layer.0.intermediate.dense.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:248 +msgid "[3072]" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:248 +msgid "bert.encoder.layers.0.linear1.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:250 +msgid "bert.encoder.layer.0.output.dense.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:250 +msgid "bert.encoder.layers.0.linear2.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:252 +msgid "bert.encoder.layer.0.output.dense.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:252 +msgid "bert.encoder.layers.0.linear2.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:254 +msgid "bert.encoder.layer.0.output.LayerNorm.gamma" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:254 +msgid "bert.encoder.layers.0.norm2.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:256 +msgid "bert.encoder.layer.0.output.LayerNorm.beta" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:256 +msgid "bert.encoder.layers.0.norm2.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:258 +msgid "bert.pooler.dense.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:260 +msgid "bert.pooler.dense.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:262 +msgid "cls.predictions.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:262 +msgid "[30522]" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:262 +msgid "cls.predictions.decoder_bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:264 +msgid "cls.predictions.transform.dense.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:264 +msgid "cls.predictions.transform.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:266 +msgid "cls.predictions.transform.dense.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:266 +msgid "cls.predictions.transform.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:268 +msgid "cls.predictions.transform.LayerNorm.gamma" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:268 +msgid "cls.predictions.layer_norm.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:270 +msgid "cls.predictions.transform.LayerNorm.beta" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:270 +msgid "cls.predictions.layer_norm.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:272 +msgid "cls.predictions.decoder.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:272 +msgid "cls.predictions.decoder_weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:274 +msgid "cls.seq_relationship.weight" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:274 +msgid "[768, 2]" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:276 +msgid "cls.seq_relationship.bias" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:276 +msgid "[2]" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:279 +msgid "正确地对应好 ``state dict`` 的参数以及权重有助于我们正确地进行 ``state dict`` 的转换。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:281 +msgid "我们从参数名称上能看出基本的一个对应关系,例如:" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:283 +msgid "`bert.embeddings.LayerNorm.gamma` 对应 `bert.embeddings.layer_norm.weight` ;" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:284 +msgid "`bert.embeddings.LayerNorm.beta` 对应 `bert.embeddings.layer_norm.bias` ;" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:285 +msgid "" +"`bert.encoder.layer.0.attention.self.query.weight` 对应 " +"`bert.encoder.layers.0.self_attn.q_proj.weight` ;" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:286 +msgid "" +"`bert.encoder.layer.0.attention.self.query.bias` 对应 " +"`bert.encoder.layers.0.self_attn.q_proj.bias`。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:288 +msgid "两者的顺序是基本一致的,但也有一些例外,例如:" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:290 +msgid "" +"`bert.encoder.layers.0.norm1.weight` 对应 " +"`bert.encoder.layer.0.attention.output.LayerNorm.gamma` ;" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:291 +msgid "" +"`bert.encoder.layers.0.norm1.bias` 对应 " +"`bert.encoder.layer.0.attention.output.LayerNorm.beta` ;" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:292 +msgid "" +"`bert.encoder.layer.0.intermediate.dense.weight` 对应 " +"`bert.encoder.layers.0.linear1.weight` ;" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:293 +msgid "" +"`bert.encoder.layer.0.output.dense.weight` 对应 " +"`bert.encoder.layers.0.linear2.weight` ;" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:294 +msgid "" +"`bert.encoder.layer.0.output.LayerNorm.gamma` 对应 " +"`bert.encoder.layers.0.norm2.weight`。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:296 +msgid "" +"正确的参数对应关系可能需要我们阅读具体的代码进行判断。 " +"在上面的表格中我们已经将两者的keys准确地一一对应了。建立好了keys的对应关系之后,我们可以进行values的对应。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:299 +msgid "" +"如果你仔细观察表格,会发现有些参数对应的values形状存在差异。 比如 " +"``bert.encoder.layer.0.intermediate.dense.weight`` 和 " +"``bert.encoder.layers.0.linear1.weight`` " +"这两个keys是相对应的一组参数名,但是他们的values形状却不相同;前者是 ``[3072, 768]`` , 后者是 ``[768, " +"3072]`` ,两者刚好是一个转置的关系。这是因为PyTorch对于 ``nn.Linear`` " +"模块的保存是将权重的shape进行转置后保存的。 所以在我们进行 ``state dict`` " +"转换的时候,需要注意做好shape的转换(例如将PyTorch模型里 nn.Linear层对应的参数权重转置处理后生成Paddle的参数权重)。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:306 +msgid "另外还需要注意其他一些细节,这里列出来几个可能会遇到的情景以供参考:" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:308 +msgid "" +"有些模型结构可能在实现时对参数的处理有差异导致存在参数的拆分或者合并等操作, " +"此时我们需要进行参数多对一或者一对多的映射,同时将对应的values拆分或者合并。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:310 +msgid "还有存在batch norm层时,我们需要注意todo。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:313 +msgid "3.4 bert模型转换代码" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:314 +msgid "下一步就是进行最关键的模型转换环节。 这一步十分关键,正确地进行 ``state dict`` 的转换才能确保我们通过精度验证。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:317 +msgid "下面是进行模型转换的代码(PyTorch转换为Paddle):" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:376 +msgid "" +"我们来看一下这份转换代码: 我们需要下载好待转换的PyTorch模型,并加载模型得到**torch_state_dict** " +";**paddle_state_dict** 和 **paddle_model_path** 则定义了转换后的 ``state dict`` " +"和模型文件路径; 代码中 **keys_dict** 定义了两者keys的映射关系(可以通过上面的表格对比得到)。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:381 +msgid "" +"下一步就是最关键的 *paddle_state_dict* 的构建,我们对 *torch_state_dict* 里的每一个key都进行映射, " +"得到对应的 *paddle_state_dict* 的key。获取 *paddle_state_dict* 的key之后我们需要 对 " +"*torch_state_dict* 的value进行转换,如果key对应的结构是 ``nn.Linear`` 模块的话, " +"我们还需要进行value的transpose操作。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:386 +msgid "" +"最后我们保存得到的 *paddle_state_dict* 就能得到对应的Paddle模型。 " +"至此我们已经完成了模型的转换工作,得到了Paddle框架下的模型 ``\"model_state.pdparams\"`` 。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:390 +msgid "4. 模型权重验证" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:391 +msgid "" +"得到了模型权重后,我们还需要进行精度的对齐来验证我们上述转换的正确性。 我们可以通过前向推理和下游任务fine-" +"tuning这两个任务进行精度对齐验证。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:395 +msgid "4.1 对齐前向精度" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:396 +msgid "" +"前向精度的对齐十分简单,我们只需要保证两者输入是一致的前提下,观察得到的输出是否一致。 " +"这里有几个注意事项,我们运行推理时需要打开eval模式,设置dropout为0等操作去除随机性造成的影响。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:399 +msgid "" +"除了得到的模型权重文件,我们还需要准备模型配置文件。将模型权重文件(model_state.pdparams)和模型配置文件(model_config.json)" +" 这两个文件放在同一个路径下,我们就可以进行模型前向精度的对齐验证,下面提供了bert模型对齐前向精度的代码示例:" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:453 +msgid "代码最后会打印模型输出矩阵的每个元素最大差值,根据这个差值可以判定我们是否对齐了前向精度。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:456 +msgid "4.2 下游任务fine-tuning验证(可选)" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:457 +msgid "" +"当我们对齐前向精度时,一般来说我们的模型转换就已经成功了。我们还可以运行下游任务fine-tuning进行double check。 " +"同样的,我们需要设置相同的训练数据,相同的训练参数,相同的训练环境进行fine-tuning来对比两者的收敛性以及收敛指标。" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:461 +msgid "5. 写在最后" +msgstr "" + +#: ../community/contribute_models/convert_pytorch_to_paddle.rst:462 +msgid "恭喜你成功完成了模型权重的格式转换工作!欢迎向PaddleNLP提PR共享你的模型, 这样每一个使用PaddleNLP的用户都能使用你共享的模型哦~" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_models/index.po b/docs/locale/en/LC_MESSAGES/community/contribute_models/index.po new file mode 100644 index 0000000000000000000000000000000000000000..774255fc5db06b285bfa429f0fc3719754aa15d1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/community/contribute_models/index.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../community/contribute_models/index.rst:3 +msgid "如何贡献模型" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/community/join_in_PaddleNLP-SIG.po b/docs/locale/en/LC_MESSAGES/community/join_in_PaddleNLP-SIG.po new file mode 100644 index 0000000000000000000000000000000000000000..746ad42a6dcb8b40e4771b884c47c7dbfdf19534 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/community/join_in_PaddleNLP-SIG.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../community/join_in_PaddleNLP-SIG.rst:3 +msgid "如何加入PaddleNLP-SIG" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/data.po b/docs/locale/en/LC_MESSAGES/data.po new file mode 100644 index 0000000000000000000000000000000000000000..1aa73e16fba0acb2d85e25dc186a33a9fd635429 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/data.po @@ -0,0 +1,125 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../data.md:1 +msgid "PaddleNLP Data API" +msgstr "" + +#: ../data.md:3 +msgid "该模块提供了在NLP任务中构建有效的数据处理Pipeline的常用API。" +msgstr "" + +#: ../data.md:5 +msgid "APIl列表" +msgstr "" + +#: ../data.md:46 +msgid "API使用方法" +msgstr "" + +#: ../data.md:48 +msgid "以上API都是用来辅助构建DataLoader,DataLoader比较重要的三个初始化参数是dataset、batch_sampler和collate_fn。" +msgstr "" + +#: ../data.md:50 +msgid "paddlenlp.data.Vocab和paddlenlp.data.JiebaTokenizer用在构建dataset时处理文本token到ID的映射。" +msgstr "" + +#: ../data.md:52 +msgid "paddlenlp.data.SamplerHelper用于构建可迭代的batch_sampler。" +msgstr "" + +#: ../data.md:54 +msgid "" +"paddlenlp.data.Stack、paddlenlp.data.Pad、paddlenlp.data.Tuple和paddlenlp.data" +".Dict用于构建生成mini-batch的collate_fn函数。" +msgstr "" + +#: ../data.md:56 +msgid "数据预处理" +msgstr "" + +#: ../data.md:58 +msgid "paddlenlp.data.Vocab" +msgstr "" + +#: ../data.md:60 +msgid "paddlenlp.data.Vocab词表类,集合了一系列文本token与ids之间映射的一系列方法,支持从文件、字典、json等一系方式构建词表。" +msgstr "" + +#: ../data.md:78 +msgid "paddlenlp.data.JiebaTokenizer" +msgstr "" + +#: ../data.md:80 +msgid "paddlenlp.data.JiebaTokenizer初始化需传入paddlenlp.data.Vocab类,包含cut分词方法和将句子明文转换为ids的encode方法。" +msgstr "" + +#: ../data.md:97 +msgid "构建Sampler" +msgstr "" + +#: ../data.md:99 +msgid "paddlenlp.data.SamplerHelper" +msgstr "" + +#: ../data.md:101 +msgid "paddlenlp.data.SamplerHelper的作用是构建用于DataLoader的可迭代采样器,它包含shuffle、sort、batch、shard等一系列方法,方便用户灵活使用。" +msgstr "" + +#: ../data.md:139 +msgid "构建collate_fn" +msgstr "" + +#: ../data.md:141 +msgid "paddlenlp.data.Stack" +msgstr "" + +#: ../data.md:143 +msgid "paddlenlp.data.Stack用来组建batch,其输入必须具有相同的shape,输出便是这些输入的堆叠组成的batch数据。" +msgstr "" + +#: ../data.md:158 +msgid "paddlenlp.data.Pad" +msgstr "" + +#: ../data.md:160 +msgid "paddlenlp.data.Pad用来组建batch,它的输入长度不同,它首先会将输入数据全部padding到最大长度,然后再堆叠组成batch数据输出。" +msgstr "" + +#: ../data.md:175 +msgid "paddlenlp.data.Tuple" +msgstr "" + +#: ../data.md:177 +msgid "paddlenlp.data.Tuple会将多个组batch的函数包装在一起,组成tuple。" +msgstr "" + +#: ../data.md:197 +msgid "paddlenlp.data.Dict" +msgstr "" + +#: ../data.md:199 +msgid "paddlenlp.data.Dict会将多个组batch的函数包装在一起,组成dict。" +msgstr "" + +#: ../data.md:219 +msgid "综合示例" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/data_prepare/data_preprocess.po b/docs/locale/en/LC_MESSAGES/data_prepare/data_preprocess.po new file mode 100644 index 0000000000000000000000000000000000000000..58676470db2b14eac6688031b6943af082d058ee --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/data_prepare/data_preprocess.po @@ -0,0 +1,190 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../data_prepare/data_preprocess.rst:3 +msgid "数据处理" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:5 +msgid "" +"Dataset中通常为原始数据,需要经过一定的数据处理并进行采样组batch,而后通过 :class:`paddle.io.DataLoader`" +" 为训练或预测使用,PaddleNLP中为其中各环节提供了相应的功能支持。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:8 +msgid "基于预训练模型的数据处理" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:10 +msgid "" +"在使用预训练模型做NLP任务时,需要加载对应的Tokenizer,PaddleNLP在 :class:`PreTrainedTokenizer` " +"中内置的 :func:`__call__` 方法可以实现基础的数据处理功能。PaddleNLP内置的所有预训练模型的Tokenizer都继承自 " +":class:`PreTrainedTokenizer` ,下面以BertTokenizer举例说明:" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:28 +msgid "关于 :func:`__call__` 方法的其他参数和功能,请查阅PreTrainedTokenizer。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:30 +msgid "" +"paddlenlp内置的 :class:`paddlenlp.datasets.MapDataset` 的 :func:`map` " +"方法支持传入一个函数,对数据集内的数据进行统一转换。下面我们以 :obj:`LCQMC` 的数据处理流程为例:" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:42 +msgid "" +"可以看到, :obj:`LCQMC` 是一个句对匹配任务,即判断两个句子的意思是否相似的2分类任务。我们需要处理的是key为 **query** " +"和 **title** 的文本数据,我们编写基于 :class:`PreTrainedTokenizer` 的数据处理函数并传入数据集的 " +":func:`map` 方法。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:69 +msgid "可以看到,数据集中的文本数据已经被处理成了模型可以接受的 *feature* 。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:71 +msgid "" +":func:`map` 方法有一个重要的参数 :attr:`batched`,当设置为 :obj:`True` 时(默认为 " +":obj:`False` ),数据处理函数 :func:`trans_func` 的输入不再是单条数据,而是数据集的所有数据:" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:99 +msgid "" +"可以看到,在本例中两种实现的结果是相同的。但是在诸如阅读理解,对话等任务中,一条原始数据可能会产生多个 *feature* 的情况(参见 " +"`run_squad.py " +"`__" +" )通常需要将 :attr:`batched` 参数设置为 :obj:`True` 。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:101 +msgid "" +":func:`map` 方法还有一个 :attr:`num_workers` " +"参数,当其大于0时进行多进程数据处理,可以提高处理速度。但是需要注意如果在数据处理的函数中用到了 **数据index** " +"的相关信息,多进程处理可能会导致错误的结果。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:103 +msgid "" +"关于 :func:`map` 方法的其他参数和 :class:`paddlenlp.datasets.MapDataset` " +"的其他数据处理方法,请查阅 :doc:`dataset <../source/paddlenlp.datasets.dataset>` 。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:106 +msgid "Batchify" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:108 +msgid "" +"PaddleNLP内置了多种collate function,配合 :class:`paddle.io.BatchSampler` " +"可以协助用户简单的完成组batch的操作。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:110 +msgid "" +"我们继续以 :obj:`LCQMC` 的数据处理流程为例。从上一节最后可以看到,处理后的单条数据是一个 **字典** ,包含 " +"`input_ids` , `token_type_ids` 和 `label` 三个key。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:112 +msgid "" +"其中 `input_ids` 和 `token_type_ids` 是需要进行 **padding** 操作后输入模型的,而 `label` " +"是需要 **stack** 之后传入loss function的。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:114 +msgid "" +"因此,我们使用PaddleNLP内置的 :func:`Dict` ,:func:`Stack` 和 :func:`Pad` " +"函数整理batch中的数据。最终的 :func:`batchify_fn` 如下:" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:127 +msgid "" +"之后使用 :class:`paddle.io.BatchSampler` 和 :func:`batchify_fn` 构建 " +":class:`paddle.io.DataLoader` :" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:137 +msgid "" +"到此,一个完整的数据准备流程就完成了。关于更多batchify方法,请查阅 :doc:`collate " +"<../source/paddlenlp.data.collate>`。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:141 +msgid "" +"当需要进行 **单机多卡** 训练时,需要将 :class:`BatchSampler` 更换为 " +":class:`DistributedBatchSampler` 。更多有关 :class:`paddle.io.BatchSampler` " +"的信息,请查阅 `BatchSampler " +"`_。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:143 +msgid "" +"当需要诸如batch内排序,按token组batch等更复杂的组batch功能时。可以使用PaddleNLP内置的 " +":class:`SamplerHelper` 。相关用例请参考 `reader.py " +"`__。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:146 +msgid "基于非预训练模型的数据处理" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:148 +msgid "" +"在使用非预训练模型做NLP任务时,我们可以借助PaddleNLP内置的 :class:`JiebaTokenizer` 和 " +":class:`Vocab` 完成数据处理的相关功能,整体流程与使用预训练模型基本相似。我们以中文情感分析 :obj:`ChnSentiCorp`" +" 数据集为例:" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:169 +msgid "" +":class:`Vocab` 除了可以从本地词典文件初始化之外,还提供多种初始化方法,包括从 :class:`dictionary` " +"创建、从数据集创建等。详情请查阅Vocab。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:170 +msgid "" +"除了使用内置的 :class:`JiebaTokenizer` 外,用户还可以使用任何自定义的方式或第三方库进行分词,之后使用 " +":func:`Vocab.to_indices` 方法将token转为id。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:172 +msgid "之后与基于预训练模型的数据处理流程相似,编写数据处理函数并传入 :func:`map` 方法:" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:195 +msgid "" +"可以看到,原始数据已经被处理成了 *feature* 。但是这里我们发现单条数据并不是一个 **字典** ,而是 **元组** 。所以我们的 " +":func:`batchify_fn` 也要相应的做一些调整:" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:208 +msgid "" +"可以看到,:func:`Dict` 函数是将单条数据中的键值与 :func:`Pad` 等函数进行对应,适用于单条数据是字典的情况。而 " +":func:`Tuple` 是通过单条数据中不同部分的index进行对应的。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:210 +msgid "所以需要 **注意** 的是 :func:`convert_example` 方法和 :func:`batchify_fn` 方法的匹配。" +msgstr "" + +#: ../data_prepare/data_preprocess.rst:212 +msgid "之后的流程与基于预训练模型的数据处理相同。" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/data_prepare/dataset_list.po b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_list.po new file mode 100644 index 0000000000000000000000000000000000000000..8018860488efb9f5ea1bfe6c1b627940781545d2 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_list.po @@ -0,0 +1,63 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../data_prepare/dataset_list.md:1 +msgid "PaddleNLP Datasets API" +msgstr "" + +#: ../data_prepare/dataset_list.md:3 +msgid "PaddleNLP提供了以下数据集的快速读取API,实际使用时请根据需要添加splits信息:" +msgstr "" + +#: ../data_prepare/dataset_list.md:5 +msgid "阅读理解" +msgstr "" + +#: ../data_prepare/dataset_list.md:55 +msgid "文本分类" +msgstr "" + +#: ../data_prepare/dataset_list.md:234 +msgid "文本匹配" +msgstr "" + +#: ../data_prepare/dataset_list.md:253 +msgid "序列标注" +msgstr "" + +#: ../data_prepare/dataset_list.md:283 +msgid "机器翻译" +msgstr "" + +#: ../data_prepare/dataset_list.md:307 +msgid "机器同传" +msgstr "" + +#: ../data_prepare/dataset_list.md:326 +msgid "对话系统" +msgstr "" + +#: ../data_prepare/dataset_list.md:345 +msgid "文本生成" +msgstr "" + +#: ../data_prepare/dataset_list.md:389 +msgid "语料库" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/data_prepare/dataset_load.po b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_load.po new file mode 100644 index 0000000000000000000000000000000000000000..d4575f4162c2e388c06de27ca4fdf099950ccb26 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_load.po @@ -0,0 +1,103 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../data_prepare/dataset_load.rst:3 +msgid "加载数据集" +msgstr "" + +#: ../data_prepare/dataset_load.rst:6 +msgid "快速加载内置数据集" +msgstr "" + +#: ../data_prepare/dataset_load.rst:8 +msgid "" +"目前PaddleNLP内置20余个NLP数据集,涵盖阅读理解,文本分类,序列标注,机器翻译等多项任务。目前提供的数据集可以在 " +":doc:`数据集列表 <./dataset_list>` 中找到。" +msgstr "" + +#: ../data_prepare/dataset_load.rst:10 +msgid "以 **msra_ner** 数据集为例:" +msgstr "" + +#: ../data_prepare/dataset_load.rst:17 +msgid "" +":func:`load_dataset` 方法会从 :obj:`paddlenlp.datasets` " +"下找到msra_ner数据集对应的数据读取脚本(默认路径:paddlenlp/datasets/msra_ner.py),并调用脚本中 " +":class:`DatasetBuilder` 类的相关方法生成数据集。" +msgstr "" + +#: ../data_prepare/dataset_load.rst:19 +msgid "" +"生成数据集可以以 :class:`MapDataset` 和 :class:`IterDataset` 两种类型返回,分别是对 " +":class:`paddle.io.Dataset` 和 :class:`paddle.io.IterableDataset` 的扩展,只需在 " +":func:`load_dataset` 时设置 :attr:`lazy` 参数即可获取相应类型。:obj:`Flase` 对应返回 " +":class:`MapDataset` ,:obj:`True` 对应返回 :class:`IterDataset`,默认值为None,对应返回 " +":class:`DatasetBuilder` 默认的数据集类型,大多数为 :class:`MapDataset` 。" +msgstr "" + +#: ../data_prepare/dataset_load.rst:31 +msgid "" +"关于 :class:`MapDataset` 和 :class:`IterDataset` 功能和异同可以参考API文档 " +":doc:`datasets <../source/paddlenlp.datasets.dataset>`。" +msgstr "" + +#: ../data_prepare/dataset_load.rst:34 +msgid "选择子数据集" +msgstr "" + +#: ../data_prepare/dataset_load.rst:36 +msgid "" +"有些数据集是很多子数据集的集合,每个子数据集都是一个独立的数据集。例如 **GLUE** 数据集就包含COLA, SST2, MRPC, " +"QQP等10个子数据集。" +msgstr "" + +#: ../data_prepare/dataset_load.rst:38 +msgid ":func:`load_dataset` 方法提供了一个 :attr:`name` 参数用来指定想要获取的子数据集。使用方法如下:" +msgstr "" + +#: ../data_prepare/dataset_load.rst:46 +msgid "以内置数据集格式读取本地数据集" +msgstr "" + +#: ../data_prepare/dataset_load.rst:48 +msgid "" +"有的时候,我们希望使用数据格式与内置数据集相同的本地数据替换某些内置数据集的数据(例如参加SQuAD竞赛,对训练数据进行了数据增强)。 " +":func:`load_dataset` 方法提供的 :attr:`data_files` 参数可以实现这个功能。以 **SQuAD** 为例。" +msgstr "" + +#: ../data_prepare/dataset_load.rst:58 +msgid "" +"对于某些数据集,不同的split的读取方式不同。对于这种情况则需要在 :attr:`splits` 参数中以传入与 " +":attr:`data_files` **一一对应** 的split信息。" +msgstr "" + +#: ../data_prepare/dataset_load.rst:60 +msgid "此时 :attr:`splits` 不再代表选取的内置数据集,而代表以何种格式读取本地数据集。" +msgstr "" + +#: ../data_prepare/dataset_load.rst:62 +msgid "下面以 **COLA** 数据集为例:" +msgstr "" + +#: ../data_prepare/dataset_load.rst:69 +msgid "" +"**另外需要注意数据集的是没有默认加载选项的,**:attr:`splits` **和**:attr:`data_files` " +"**必须至少指定一个。**" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/data_prepare/dataset_self_defined.po b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_self_defined.po new file mode 100644 index 0000000000000000000000000000000000000000..3d076811d2f3ac2880557884514818a06859b5a7 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_self_defined.po @@ -0,0 +1,127 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../data_prepare/dataset_self_defined.rst:3 +msgid "如何自定义数据集" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:5 +msgid "" +"通过使用PaddleNLP提供的 :func:`load_dataset` , :class:`MapDataset` 和 " +":class:`IterDataset` 。任何人都可以方便的定义属于自己的数据集。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:8 +msgid "从本地文件创建数据集" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:10 +msgid "" +"从本地文件创建数据集时,我们 **推荐** 根据本地数据集的格式给出读取function并传入 :func:`load_dataset` " +"中创建数据集。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:12 +msgid "" +"以 `waybill_ie " +"`__" +" 快递单信息抽取任务中的数据为例:" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:32 +msgid "" +"我们推荐将数据读取代码写成生成器(generator)的形式,这样可以更好的构建 :class:`MapDataset` 和 " +":class:`IterDataset` 两种数据集。同时我们也推荐将单条数据写成字典的格式,这样可以更方便的监测数据流向。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:34 +msgid "" +"事实上,:class:`MapDataset` 在绝大多数时候都可以满足要求。一般只有在数据集过于庞大无法一次性加载进内存的时候我们才考虑使用 " +":class:`IterDataset` 。任何人都可以方便的定义属于自己的数据集。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:38 +msgid "" +"需要注意的是,只有PaddleNLP内置的数据集具有将数据中的label自动转为id的功能(详细条件参见 " +":doc:`创建DatasetBuilder " +"<../community/contribute_datasets/how_to_write_a_DatasetBuilder>`)。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:40 +msgid "像上例中的自定义数据集需要在自定义的convert to feature方法中添加label转id的功能。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:42 +msgid "" +"自定义数据读取function中的参数可以直接以关键字参数的方式传入 :func:`load_dataset` " +"中。而且对于自定义数据集,:attr:`lazy` 参数是 **必须** 传入的。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:45 +msgid "从 :class:`paddle.io.Dataset/IterableDataset` 创建数据集" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:47 +msgid "" +"虽然PaddlePddle内置的 :class:`Dataset` 和 :class:`IterableDataset` 是可以直接接入 " +":class:`DataLoader` 用于模型训练的,但有时我们希望更方便的使用一些数据处理(例如convert to feature, " +"数据清洗,数据增强等)。而PaddleNLP内置的 :class:`MapDataset` 和 :class:`IterDataset` " +"正好提供了能实现以上功能的API。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:49 +msgid "" +"所以如果您习惯使用 :class:`paddle.io.Dataset/IterableDataset` " +"创建数据集的话。只需要在原来的数据集上套上一层 :class:`MapDataset` 或 :class:`IterDataset` " +"就可以把原来的数据集对象转换成PaddleNLP的数据集。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:51 +msgid "下面举一个简单的小例子。:class:`IterDataset` 的用法基本相同。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:78 +msgid "从其他python对象创建数据集" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:80 +msgid "" +"理论上,我们可以使用任何包含 :func:`__getitem__` 方法和 :func:`__len__` 方法的python对象创建 " +":class:`MapDataset`。包括 :class:`List` ,:class:`Tuple` ,:class:`DataFrame` " +"等。只要将符合条件的python对象作为初始化参数传入 :class:`MapDataset` 即可完成创建。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:95 +msgid "" +"同样的,我们也可以使用包含 :func:`__iter__` 方法的python对象创建 :class:`IterDataset` 。例如 " +":class:`List`, :class:`Generator` 等。创建方法与 :class:`MapDataset` 相同。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:112 +msgid "需要注意,像上例中直接将 **生成器** 对象传入 :class:`IterDataset` 所生成的数据集。其数据只能迭代 **一次** 。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:114 +msgid "与常规的python对象一样,只要满足以上的条件,我们也可以使用同样的方法从第三方数据集创建PaddleNLP数据集。" +msgstr "" + +#: ../data_prepare/dataset_self_defined.rst:116 +msgid "例如HuggingFace Dataset:" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/data_prepare/overview.po b/docs/locale/en/LC_MESSAGES/data_prepare/overview.po new file mode 100644 index 0000000000000000000000000000000000000000..9da1d4654b3c84a5113c10da24e4879734d2a6dd --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/data_prepare/overview.po @@ -0,0 +1,101 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../data_prepare/overview.rst:3 +msgid "整体介绍" +msgstr "" + +#: ../data_prepare/overview.rst:5 +msgid "数据集和数据处理部分一直是NLP任务中最重要的环节之一。为了方便用户以更低的学习成本完成这一环节,PaddleNLP提供了以下特性:" +msgstr "" + +#: ../data_prepare/overview.rst:7 +msgid "功能强大的API。可以帮助用户完成大部分常见NLP任务的数据处理流程。" +msgstr "" + +#: ../data_prepare/overview.rst:8 +msgid "更灵活的封装。各个模块保持低耦合,高内聚,保证用户可以通过继承和改写满足特定的数据处理需求。" +msgstr "" + +#: ../data_prepare/overview.rst:9 +msgid "内置数据集涵盖大部分NLP任务,搭配简洁易用的数据集加载协议和贡献协议。对新手和社区贡献者更加友好。" +msgstr "" + +#: ../data_prepare/overview.rst:12 +msgid "核心API" +msgstr "" + +#: ../data_prepare/overview.rst:14 +msgid "" +":func:`load_dataset` :数据集快速加载接口,通过传入数据集读取脚本的名称和其他参数调用 " +":class:`DatasetBuilder` 子类的相关方法生成数据集。关于加载数据集的详细方法,请查阅 :doc:`加载数据集 " +"<./dataset_load>` 。" +msgstr "" + +#: ../data_prepare/overview.rst:15 +msgid "" +":class:`DatasetBuilder` : :class:`DatasetBuilder` " +"是一个基类,所有的内置数据集都继承自该类,该类的主要功能是下载和读取数据集文件并生成Dataset。其中大部分方法已经封装,不对贡献者暴露。贡献者通过重写" +" :func:`_get_data` 和 :func:`_read` 等方法像社区贡献数据集。详细信息请查阅 :doc:`如何贡献数据集 " +"` 。" +msgstr "" + +#: ../data_prepare/overview.rst:16 +msgid "" +":class:`MapDataset/IterDataset` :PaddleNLP内置数据集类型,分别是对 " +":class:`paddle.io.Dataset` 和 :class:`paddle.io.IterableDataset` 的扩展。内置诸如 " +":func:`map` , :func:`filter` " +"等适用于NLP任务的数据处理功能。同时还能帮助用户简单创建自定义数据集。详细信息请查阅***和 :doc:`如何自定义数据集 " +"<./dataset_self_defined>` 。" +msgstr "" + +#: ../data_prepare/overview.rst:19 +msgid "数据处理流程设计" +msgstr "" + +#: ../data_prepare/overview.rst:21 +msgid "目前PaddleNLP的通用数据处理流程如下:" +msgstr "" + +#: ../data_prepare/overview.rst:23 +msgid "加载数据集(内置数据集或者自定义数据集,数据集返回 **原始数据**)。" +msgstr "" + +#: ../data_prepare/overview.rst:24 +msgid "" +"定义 :func:`trans_func` ,包括tokenize,token to id等操作,并传入数据集的 :func:`map` " +"方法,将原始数据转为 *feature* 。" +msgstr "" + +#: ../data_prepare/overview.rst:25 +msgid "根据上一步数据处理的结果定义 **batchify** 方法和 :class:`BatchSampler` 。" +msgstr "" + +#: ../data_prepare/overview.rst:26 +msgid "定义 :class:`DataLoader` , 传入 :class:`BatchSampler` 和 :func:`batchify_fn` 。" +msgstr "" + +#: ../data_prepare/overview.rst:28 +msgid "下面是基于Bert的文本分类任务的数据处理流程图:" +msgstr "" + +#: ../data_prepare/overview.rst:32 +msgid "关于数据处理的详细信息,请查阅 :doc:`./data_preprocess` 。" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/datasets.po b/docs/locale/en/LC_MESSAGES/datasets.po new file mode 100644 index 0000000000000000000000000000000000000000..346f663b86c1beb880fe0f5534b1f92c0458fb2f --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/datasets.po @@ -0,0 +1,55 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../datasets.md:1 +msgid "PaddleNLP Datasets API" +msgstr "" + +#: ../datasets.md:3 +msgid "PaddleNLP提供了以下数据集的快速读取API,实际使用时请根据需要添加splits信息:" +msgstr "" + +#: ../datasets.md:5 +msgid "阅读理解" +msgstr "" + +#: ../datasets.md:44 +msgid "文本分类" +msgstr "" + +#: ../datasets.md:114 +msgid "序列标注" +msgstr "" + +#: ../datasets.md:139 +msgid "机器翻译" +msgstr "" + +#: ../datasets.md:164 +msgid "机器同传" +msgstr "" + +#: ../datasets.md:184 +msgid "文本生成" +msgstr "" + +#: ../datasets.md:208 +msgid "语料库" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/get_started/installation.po b/docs/locale/en/LC_MESSAGES/get_started/installation.po new file mode 100644 index 0000000000000000000000000000000000000000..4a74fb50ddb0a501f038a1f7f7518736279dd9ea --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/get_started/installation.po @@ -0,0 +1,108 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../get_started/installation.rst:2 +msgid "安装PaddleNLP" +msgstr "" + +#: ../get_started/installation.rst:3 +msgid "" +"以下安装过程默认用户已安装好paddlepaddle-" +"gpu或paddlepaddle(版本大于或等于2.0),paddlepaddle安装方式参照 飞桨官网_。" +msgstr "" + +#: ../get_started/installation.rst:8 +msgid "pip安装" +msgstr "" + +#: ../get_started/installation.rst:14 +msgid "Anaconda安装" +msgstr "" + +#: ../get_started/installation.rst:15 +msgid "Anaconda是一个开源的Python发行版本,其包含了conda、Python等180多个科学包及其依赖项。使用Anaconda可以通过创建多个独立的Python环境,避免用户的Python环境安装太多不同版本依赖导致冲突。" +msgstr "" + +#: ../get_started/installation.rst:18 +msgid "1、windows安装Anaconda" +msgstr "" + +#: ../get_started/installation.rst:21 ../get_started/installation.rst:56 +msgid "第一步 下载" +msgstr "" + +#: ../get_started/installation.rst:22 +msgid "在 Anaconda官网_ 选择下载Windows Python3.7 64-Bit版本。" +msgstr "" + +#: ../get_started/installation.rst:26 +msgid "确保已经安装Visual C++ Build Tools(可以在开始菜单中找到),如未安装,请点击 下载安装_。" +msgstr "" + +#: ../get_started/installation.rst:31 ../get_started/installation.rst:62 +msgid "第二步 安装" +msgstr "" + +#: ../get_started/installation.rst:32 +msgid "运行下载的安装包(以.exe为后辍),根据引导完成安装, 用户可自行修改安装目录(如下图)。" +msgstr "" + +#: ../get_started/installation.rst:37 ../get_started/installation.rst:73 +msgid "第三步 使用" +msgstr "" + +#: ../get_started/installation.rst:38 +msgid "点击Windows系统左下角的Windows图标,打开:所有程序->Anaconda3/2(64-bit)->Anaconda Prompt" +msgstr "" + +#: ../get_started/installation.rst:39 +msgid "在命令行中执行下述命令" +msgstr "" + +#: ../get_started/installation.rst:50 ../get_started/installation.rst:84 +msgid "" +"按如上方式配置后,即可在环境中使用PaddleNLP了,命令行输入python回车后,import " +"paddlenlp试试吧,之后再次使用都可以通过打开'所有程序->Anaconda3/2(64-bit)->Anaconda " +"Prompt',再执行conda activate my_paddlenlp进入环境后,即可再次使用PaddleNLP。" +msgstr "" + +#: ../get_started/installation.rst:53 +msgid "2、Linux/Mac安装Anaconda" +msgstr "" + +#: ../get_started/installation.rst:57 +msgid "在 Anaconda官网_ 选择下载对应系统 Python3.7版本下载(Mac下载Command Line Installer版本即可)。" +msgstr "" + +#: ../get_started/installation.rst:63 +msgid "打开终端,在终端安装Anaconda" +msgstr "" + +#: ../get_started/installation.rst:70 +msgid "安装过程中一直回车即可,如提示设置安装路径,可根据需求修改,一般默认即可。" +msgstr "" + +#: ../get_started/installation.rst:87 +msgid "代码安装" +msgstr "" + +#: ../get_started/installation.rst:88 +msgid "github代码会跟随开发进度不断更新" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/get_started/quick_start.po b/docs/locale/en/LC_MESSAGES/get_started/quick_start.po new file mode 100644 index 0000000000000000000000000000000000000000..0b9dc90f343e3b652ca3534a86a82620b2209aa6 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/get_started/quick_start.po @@ -0,0 +1,178 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../get_started/quick_start.rst:3 +msgid "10分钟完成高精度中文情感分析" +msgstr "" + +#: ../get_started/quick_start.rst:6 +msgid "1. 安装PaddleNLP" +msgstr "" + +#: ../get_started/quick_start.rst:8 +msgid "安装相关过程和问题可以参考PaddleNLP的 安装文档_。" +msgstr "" + +#: ../get_started/quick_start.rst:18 +msgid "2. 一键加载预训练模型" +msgstr "" + +#: ../get_started/quick_start.rst:20 +msgid "" +"情感分析本质是一个文本分类任务。PaddleNLP内置了ERNIE、BERT、RoBERTa、Electra等丰富的预训练模型" +",并且内置了各种预训练模型对于不同下游任务的Fine-" +"tune网络。用户可以使用PaddleNLP提供的模型,完成问答、序列分类、token分类等任务。查阅 预训练模型_ " +"了解更多。这里以ERNIE模型为例,介绍如何将预训练模型Fine-tune完成文本分类任务。" +msgstr "" + +#: ../get_started/quick_start.rst:24 +msgid "加载预训练模型ERNIE" +msgstr "" + +#: ../get_started/quick_start.rst:31 +msgid "加载预训练模型ERNIE用于文本分类任务的Fine-tune网络,只需指定想要使用的模型名称和文本分类的类别数即可完成网络定义。" +msgstr "" + +#: ../get_started/quick_start.rst:39 +msgid "3. 调用Tokenizer进行数据处理" +msgstr "" + +#: ../get_started/quick_start.rst:41 +msgid "Tokenizer用于将原始输入文本转化成模型可以接受的输入数据形式。PaddleNLP对于各种预训练模型已经内置了相应的Tokenizer,指定想要使用的模型名字即可加载。" +msgstr "" + +#: ../get_started/quick_start.rst:47 +msgid "" +"Transformer类预训练模型所需的数据处理步骤通常包括将原始输入文本切分token;将token映射为对应的token " +"id;拼接上预训练模型对应的特殊token " +",如[CLS]、[SEP];最后转化为框架所需的数据格式。为了方便使用,PaddleNLP提供了高阶API,一键即可返回模型所需数据格式。" +msgstr "" + +#: ../get_started/quick_start.rst:49 +msgid "一行代码完成切分token,映射token ID以及拼接特殊token:" +msgstr "" + +#: ../get_started/quick_start.rst:55 +msgid "转化成paddle框架数据格式:" +msgstr "" + +#: ../get_started/quick_start.rst:68 +msgid "input_ids: 表示输入文本的token ID。" +msgstr "" + +#: ../get_started/quick_start.rst:70 +msgid "token_type_ids: 表示对应的token属于输入的第一个句子还是第二个句子。(Transformer类预训练模型支持单句以及句对输入。)" +msgstr "" + +#: ../get_started/quick_start.rst:72 +msgid "此时即可输入ERNIE模型中得到相应输出。" +msgstr "" + +#: ../get_started/quick_start.rst:81 +msgid "可以看出,ERNIE模型输出有2个tensor。" +msgstr "" + +#: ../get_started/quick_start.rst:83 +msgid "" +"sequence_output是对应每个输入token的语义特征表示,shape为(1, num_tokens, " +"hidden_size)。其一般用于序列标注、问答等任务。" +msgstr "" + +#: ../get_started/quick_start.rst:85 +msgid "pooled_output是对应整个句子的语义特征表示,shape为(1, hidden_size)。其一般用于文本分类、信息检索等任务。" +msgstr "" + +#: ../get_started/quick_start.rst:88 +msgid "4. 加载数据集" +msgstr "" + +#: ../get_started/quick_start.rst:89 +msgid "PaddleNLP内置了适用于阅读理解、文本分类、序列标注、机器翻译等下游任务的多个数据集,这里我们使用公开中文情感分析数据集ChnSenticorp,包含7000多条正负向酒店评论数据。" +msgstr "" + +#: ../get_started/quick_start.rst:91 +msgid "一键加载PaddleNLP内置数据集:" +msgstr "" + +#: ../get_started/quick_start.rst:98 +msgid "获取分类数据标签:" +msgstr "" + +#: ../get_started/quick_start.rst:106 +msgid "展示一些数据:" +msgstr "" + +#: ../get_started/quick_start.rst:122 +msgid "5. 模型训练与评估" +msgstr "" + +#: ../get_started/quick_start.rst:123 +msgid "" +"数据读入时使用 :func:`paddle.io.DataLoader` " +"接口多线程异步加载数据,然后设置适用于ERNIE这类Transformer模型的动态学习率和损失函数、优化算法、评价指标等。" +msgstr "" + +#: ../get_started/quick_start.rst:125 +msgid "模型训练的过程通常按照以下步骤:" +msgstr "" + +#: ../get_started/quick_start.rst:127 +msgid "从dataloader中取出一个batch data。" +msgstr "" + +#: ../get_started/quick_start.rst:128 +msgid "将batch data喂给model,做前向计算。" +msgstr "" + +#: ../get_started/quick_start.rst:129 +msgid "将前向计算结果传给损失函数,计算loss。将前向计算结果传给评价方法,计算评价指标。" +msgstr "" + +#: ../get_started/quick_start.rst:130 +msgid "loss反向回传,更新梯度。重复以上步骤。" +msgstr "" + +#: ../get_started/quick_start.rst:131 +msgid "每训练一个epoch时,程序将会评估一次,评估当前模型训练的效果。" +msgstr "" + +#: ../get_started/quick_start.rst:133 +msgid "本示例同步在AIStudio上,可直接 在线体验模型训练_。" +msgstr "" + +#: ../get_started/quick_start.rst:137 +msgid "最后,保存训练好的模型用于预测。" +msgstr "" + +#: ../get_started/quick_start.rst:140 +msgid "6. 模型预测" +msgstr "" + +#: ../get_started/quick_start.rst:141 +msgid "保存训练模型,定义预测函数 :func:`predict` ,即可开始预测文本情感倾向。" +msgstr "" + +#: ../get_started/quick_start.rst:143 +msgid "以自定义预测数据和数据标签为示例:" +msgstr "" + +#: ../get_started/quick_start.rst:154 +msgid "得到预测结果:" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/index.po b/docs/locale/en/LC_MESSAGES/index.po new file mode 100644 index 0000000000000000000000000000000000000000..1cf3bcec97d47c85e6af8b37d096c05e72a34580 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/index.po @@ -0,0 +1,252 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../index.rst:26 +msgid "安装" +msgstr "" + +#: ../index.rst:26 +msgid "10分钟完成高精度中文情感分析" +msgstr "" + +#: ../index.rst:26 +msgid "快速开始" +msgstr "" + +#: ../index.rst:33 +msgid "整体介绍" +msgstr "" + +#: ../index.rst:33 +msgid "数据集列表" +msgstr "" + +#: ../index.rst:33 +msgid "加载数据集" +msgstr "" + +#: ../index.rst:33 +msgid "自定义数据集" +msgstr "" + +#: ../index.rst:33 +msgid "数据处理" +msgstr "" + +#: ../index.rst:33 +msgid "数据准备" +msgstr "" + +#: ../index.rst:43 +msgid "Transformer预训练模型" +msgstr "" + +#: ../index.rst:43 +msgid "使用Trainer API训练" +msgstr "" + +#: ../index.rst:43 +msgid "一键预测功能" +msgstr "" + +#: ../index.rst:43 +msgid "预训练词向量" +msgstr "" + +#: ../index.rst:43 +msgid "模型库" +msgstr "" + +#: ../index.rst:53 +msgid "评价指标" +msgstr "" + +#: ../index.rst:59 +msgid "AI Studio Notebook" +msgstr "" + +#: ../index.rst:59 +msgid "实践教程" +msgstr "" + +#: ../index.rst:65 +msgid "模型压缩" +msgstr "" + +#: ../index.rst:65 +msgid "文本生成高性能加速" +msgstr "" + +#: ../index.rst:65 +msgid "大规模分布式训练" +msgstr "" + +#: ../index.rst:65 +msgid "进阶指南" +msgstr "" + +#: ../index.rst:73 +msgid "如何贡献模型" +msgstr "" + +#: ../index.rst:73 +msgid "如何贡献数据集" +msgstr "" + +#: ../index.rst:73 +msgid "如何贡献文档案例" +msgstr "" + +#: ../index.rst:73 +msgid "如何加入兴趣小组" +msgstr "" + +#: ../index.rst:73 +msgid "社区交流共建" +msgstr "" + +#: ../index.rst:82 +msgid "FAQ" +msgstr "" + +#: ../index.rst:88 +msgid "paddlenlp.data" +msgstr "" + +#: ../index.rst:88 +msgid "paddlenlp.datasets" +msgstr "" + +#: ../index.rst:88 +msgid "paddlenlp.embeddings" +msgstr "" + +#: ../index.rst:88 +msgid "paddlenlp.layers" +msgstr "" + +#: ../index.rst:88 +msgid "paddlenlp.losses" +msgstr "" + +#: ../index.rst:88 +msgid "paddlenlp.metrics" +msgstr "" + +#: ../index.rst:88 +msgid "paddlenlp.ops" +msgstr "" + +#: ../index.rst:88 +msgid "paddlenlp.seq2vec" +msgstr "" + +#: ../index.rst:88 +msgid "paddlenlp.taskflow" +msgstr "" + +#: ../index.rst:88 +msgid "paddlenlp.trainer" +msgstr "" + +#: ../index.rst:88 +msgid "paddlenlp.transformers" +msgstr "" + +#: ../index.rst:88 +msgid "paddlenlp.utils" +msgstr "" + +#: ../index.rst:88 +msgid "API Reference" +msgstr "" + +#: ../index.rst:2 +msgid "欢迎使用PaddleNLP" +msgstr "" + +#: ../index.rst:4 +msgid "" +"`PaddleNLP `_ 是飞桨自然语言处理开发库,具备 " +"**易用的文本领域API**,**多场景的应用示例**、和 **高性能分布式训练** " +"三大特点,旨在提升飞桨开发者文本领域建模效率,旨在提升开发者在文本领域的开发效率,并提供丰富的NLP应用示例。" +msgstr "" + +#: ../index.rst:7 +msgid "**易用的文本领域API**" +msgstr "" + +#: ../index.rst:9 +msgid "" +"提供丰富的产业级预置任务能力 **Taskflow** 和全流程的文本领域API:支持丰富中文数据集加载的 **Dataset " +"API**,可灵活高效地完成数据预处理的 **Data API** ,预置60+预训练词向量的 **Embedding API** " +",提供100+预训练模型的 **Transformer API** 等,可大幅提升NLP任务建模的效率。" +msgstr "" + +#: ../index.rst:11 +msgid "**多场景的应用示例**" +msgstr "" + +#: ../index.rst:13 +msgid "覆盖从学术到产业级的NLP应用示例,涵盖NLP基础技术、NLP系统应用以及相关拓展应用。全面基于飞桨核心框架2.0全新API体系开发,为开发者提供飞桨文本领域的最佳实践。" +msgstr "" + +#: ../index.rst:15 +msgid "**高性能分布式训练**" +msgstr "" + +#: ../index.rst:17 +msgid "基于飞桨核心框架领先的自动混合精度优化策略,结合分布式Fleet API,支持4D混合并行策略,可高效地完成大规模预训练模型训练。" +msgstr "" + +#: ../index.rst:20 +msgid "项目GitHub: https://github.com/PaddlePaddle/PaddleNLP" +msgstr "" + +#: ../index.rst:21 +msgid "项目Gitee: https://gitee.com/paddlepaddle/PaddleNLP" +msgstr "" + +#: ../index.rst:22 +msgid "GitHub Issue反馈: https://github.com/PaddlePaddle/PaddleNLP/issues" +msgstr "" + +#: ../index.rst:23 +msgid "官方QQ技术交流群: 973379845" +msgstr "" + +#: ../index.rst:106 +msgid "Indices and tables" +msgstr "" + +#: ../index.rst:107 +msgid ":ref:`genindex`" +msgstr "" + +#: ../index.rst:108 +msgid ":ref:`modindex`" +msgstr "" + +#: ../index.rst:109 +msgid ":ref:`search`" +msgstr "" + +#~ msgid "TaskFlow" +#~ msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/metrics.po b/docs/locale/en/LC_MESSAGES/metrics.po new file mode 100644 index 0000000000000000000000000000000000000000..18808669e1d277fff110a19a3198928a5620f3b3 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/metrics.po @@ -0,0 +1,27 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../metrics.md:1 +msgid "PaddleNLP Metrics API" +msgstr "" + +#: ../metrics.md:3 +msgid "目前PaddleNLP提供以下模型评价指标:" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/metrics/metrics.po b/docs/locale/en/LC_MESSAGES/metrics/metrics.po new file mode 100644 index 0000000000000000000000000000000000000000..e2fb3304a303b6873b906a743966f9f27339ef68 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/metrics/metrics.po @@ -0,0 +1,27 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../metrics/metrics.md:1 +msgid "PaddleNLP Metrics API" +msgstr "" + +#: ../metrics/metrics.md:3 +msgid "目前PaddleNLP提供以下模型评价指标:" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/embeddings.po b/docs/locale/en/LC_MESSAGES/model_zoo/embeddings.po new file mode 100644 index 0000000000000000000000000000000000000000..ad259196948d182dfda771a9a017e1a3b00e86d6 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/embeddings.po @@ -0,0 +1,219 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../model_zoo/embeddings.md:1 +msgid "PaddleNLP Embedding API" +msgstr "" + +#: ../model_zoo/embeddings.md:3 ../model_zoo/embeddings.md:24 +msgid "介绍" +msgstr "" + +#: ../model_zoo/embeddings.md:4 ../model_zoo/embeddings.md:28 +msgid "用法" +msgstr "" + +#: ../model_zoo/embeddings.md:5 ../model_zoo/embeddings.md:30 +msgid "TokenEmbedding参数" +msgstr "" + +#: ../model_zoo/embeddings.md:6 ../model_zoo/embeddings.md:69 +msgid "初始化" +msgstr "" + +#: ../model_zoo/embeddings.md:7 ../model_zoo/embeddings.md:101 +msgid "查询embedding结果" +msgstr "" + +#: ../model_zoo/embeddings.md:8 ../model_zoo/embeddings.md:115 +msgid "可视化embedding结果" +msgstr "" + +#: ../model_zoo/embeddings.md:9 ../model_zoo/embeddings.md:140 +msgid "计算词向量cosine相似度" +msgstr "" + +#: ../model_zoo/embeddings.md:10 ../model_zoo/embeddings.md:147 +msgid "计算词向量内积" +msgstr "" + +#: ../model_zoo/embeddings.md:11 ../model_zoo/embeddings.md:155 +msgid "训练" +msgstr "" + +#: ../model_zoo/embeddings.md:12 ../model_zoo/embeddings.md:171 +msgid "切词" +msgstr "" + +#: ../model_zoo/embeddings.md:13 ../model_zoo/embeddings.md:183 +msgid "预训练模型" +msgstr "" + +#: ../model_zoo/embeddings.md:14 ../model_zoo/embeddings.md:189 +msgid "中文词向量" +msgstr "" + +#: ../model_zoo/embeddings.md:15 ../model_zoo/embeddings.md:343 +msgid "英文词向量" +msgstr "" + +#: ../model_zoo/embeddings.md:16 ../model_zoo/embeddings.md:345 +msgid "Word2Vec" +msgstr "" + +#: ../model_zoo/embeddings.md:17 ../model_zoo/embeddings.md:362 +msgid "GloVe" +msgstr "" + +#: ../model_zoo/embeddings.md:18 ../model_zoo/embeddings.md:395 +msgid "FastText" +msgstr "" + +#: ../model_zoo/embeddings.md:19 ../model_zoo/embeddings.md:416 +msgid "使用方式" +msgstr "" + +#: ../model_zoo/embeddings.md:20 ../model_zoo/embeddings.md:427 +msgid "模型信息" +msgstr "" + +#: ../model_zoo/embeddings.md:21 ../model_zoo/embeddings.md:751 +msgid "致谢" +msgstr "" + +#: ../model_zoo/embeddings.md:22 ../model_zoo/embeddings.md:756 +msgid "参考论文" +msgstr "" + +#: ../model_zoo/embeddings.md:26 +msgid "PaddleNLP提供多个开源的预训练词向量模型,用户仅需在使用paddlenlp.embeddings.TokenEmbedding时,指定预训练模型的名称,即可加载相对应的预训练模型。以下将介绍TokenEmbeddign详细用法,并列出PaddleNLP所支持的预训练Embedding模型。" +msgstr "" + +#: ../model_zoo/embeddings.md:116 +msgid "使用深度学习可视化工具VisualDL的High Dimensional组件可以对embedding结果进行可视化展示,便于对其直观分析,步骤如下:" +msgstr "" + +#: ../model_zoo/embeddings.md:128 +msgid "执行完毕后会在当前路径下生成一个visualize目录,并将日志存放在其中,我们在命令行启动VisualDL即可进行查看,启动命令为:" +msgstr "" + +#: ../model_zoo/embeddings.md:132 +msgid "启动后打开浏览器即可看到可视化结果" +msgstr "" + +#: ../model_zoo/embeddings.md:138 +msgid "使用VisualDL除可视化embedding结果外,还可以对标量、图片、音频等进行可视化,有效提升训练调参效率。关于VisualDL更多功能和详细介绍,可参考VisualDL使用文档。" +msgstr "" + +#: ../model_zoo/embeddings.md:157 +msgid "" +"以下为TokenEmbedding简单的组网使用方法。有关更多TokenEmbedding训练流程相关的使用方法,请参考Word " +"Embedding with PaddleNLP。" +msgstr "" + +#: ../model_zoo/embeddings.md:185 +msgid "以下将列举PaddleNLP支持的Embedding预训练模型。" +msgstr "" + +#: ../model_zoo/embeddings.md:186 +msgid "模型命名方式为:${训练模型}.${语料}.${词向量类型}.${co-occurrence type}.dim${维度}。" +msgstr "" + +#: ../model_zoo/embeddings.md:187 +msgid "模型有三种,分别是Word2Vec(w2v, skip-gram), GloVe(glove)和FastText(fasttext)。" +msgstr "" + +#: ../model_zoo/embeddings.md:191 +msgid "以下预训练词向量由Chinese-Word-Vectors提供。" +msgstr "" + +#: ../model_zoo/embeddings.md:193 +msgid "根据不同类型的上下文为每个语料训练多个目标词向量,第二列开始表示不同类型的上下文。以下为上下文类别:" +msgstr "" + +#: ../model_zoo/embeddings.md:195 +msgid "Word表示训练时目标词预测的上下文是一个Word。" +msgstr "" + +#: ../model_zoo/embeddings.md:196 +msgid "" +"Word + " +"N-gram表示训练时目标词预测的上下文是一个Word或者Ngram,其中bigram表示2-grams,ngram.1-2表示1-gram或者2-grams。" +msgstr "" + +#: ../model_zoo/embeddings.md:197 +msgid "" +"Word + Character表示训练时目标词预测的上下文是一个Word或者Character,其中word-" +"character.char1-2表示上下文是1个或2个Character。" +msgstr "" + +#: ../model_zoo/embeddings.md:198 +msgid "" +"Word + Character + Ngram表示训练时目标词预测的上下文是一个Word、Character或者Ngram。bigram-" +"char表示上下文是2-grams或者1个Character。" +msgstr "" + +#: ../model_zoo/embeddings.md:284 +msgid "特别地,对于百度百科语料,在不同的 Co-occurrence类型下分别提供了目标词与上下文向量:" +msgstr "" + +#: ../model_zoo/embeddings.md:418 +msgid "" +"以上所述的模型名称可直接以参数形式传入padddlenlp.embeddings.TokenEmbedding,加载相对应的模型。比如要加载语料为Wiki2017,通过FastText训练的预训练模型(fasttext" +".wiki-news.target.word-word.dim300.en),只需执行以下代码:" +msgstr "" + +#: ../model_zoo/embeddings.md:752 +msgid "感谢 Chinese-Word-Vectors提供Word2Vec中文预训练词向量。" +msgstr "" + +#: ../model_zoo/embeddings.md:753 +msgid "感谢 GloVe Project提供的GloVe英文预训练词向量。" +msgstr "" + +#: ../model_zoo/embeddings.md:754 +msgid "感谢 FastText Project提供的英文预训练词向量。" +msgstr "" + +#: ../model_zoo/embeddings.md:757 +msgid "" +"Li, Shen, et al. \"Analogical reasoning on chinese morphological and " +"semantic relations.\" arXiv preprint arXiv:1805.06504 (2018)." +msgstr "" + +#: ../model_zoo/embeddings.md:758 +msgid "" +"Qiu, Yuanyuan, et al. \"Revisiting correlations between intrinsic and " +"extrinsic evaluations of word embeddings.\" Chinese Computational " +"Linguistics and Natural Language Processing Based on Naturally Annotated " +"Big Data. Springer, Cham, 2018. 209-221." +msgstr "" + +#: ../model_zoo/embeddings.md:759 +msgid "" +"Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. " +"GloVe: Global Vectors for Word Representation." +msgstr "" + +#: ../model_zoo/embeddings.md:760 +msgid "" +"T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. Advances in " +"Pre-Training Distributed Word Representations." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/index.po b/docs/locale/en/LC_MESSAGES/model_zoo/index.po new file mode 100644 index 0000000000000000000000000000000000000000..f3ddee8affb5262babcaa5b8366c11ffb18f9550 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/index.po @@ -0,0 +1,782 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/index.rst:75 +msgid "ALBERT" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "BART" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "BERT" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "BigBird" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "Blenderbot" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "Blenderbot-Small" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "ChineseBert" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "ConvBert" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "CTRL" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "DistilBert" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "ELECTRA" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "ERNIE" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "ERNIE-CTM" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "ERNIE-DOC" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "ERNIE-GEN" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "ERNIE-GRAM" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "ERNIE-M" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "FNet" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "Funnel" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "GPT" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "LayoutLM" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "LayoutLMV2" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "LayoutXLM" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "Luke" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "MBart" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "MegatronBert" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "MobileBert" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "MPNet" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "NeZha" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "PPMiniLM" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "ProphetNet" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "Reformer" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "RemBert" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "RoBERTa" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "RoFormer" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "SKEP" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "SqueezeBert" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "T5" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "TinyBert" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "UnifiedTransformer" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "UNIMO" +msgstr "" + +#: ../model_zoo/index.rst:75 +msgid "XLNet" +msgstr "" + +#: ../model_zoo/index.rst:4 +msgid "PaddleNLP Transformer预训练模型" +msgstr "" + +#: ../model_zoo/index.rst:6 +msgid "" +"随着深度学习的发展,NLP领域涌现了一大批高质量的Transformer类预训练模型,多次刷新了不同NLP任务的SOTA(State of the" +" Art),极大地推动了自然语言处理的进展。 PaddleNLP为用户提供了常用的预训练模型及其相应权重,如 " +"``BERT``、``ERNIE``、``ALBERT``、``RoBERTa``、``XLNet`` 等,采用统一的API进行加载、训练和调用," +" 让开发者能够方便快捷地应用各种Transformer类预训练模型及其下游任务,且相应预训练模型权重下载速度快、稳定。" +msgstr "" + +#: ../model_zoo/index.rst:12 +msgid "预训练模型使用方法" +msgstr "" + +#: ../model_zoo/index.rst:14 +msgid "" +"PaddleNLP Transformer API在提供丰富预训练模型的同时,也降低了用户的使用门槛。 " +"使用Auto模块,可以加载不同网络结构的预训练模型,无需查找模型对应的类别。只需十几行代码,用户即可完成模型加载和下游任务Fine-tuning。" +msgstr "" + +#: ../model_zoo/index.rst:52 +msgid "" +"上面的代码给出使用预训练模型的简要示例,更完整详细的示例代码, 可以参考:`使用预训练模型Fine-tune完成中文文本分类任务 " +"`_" +msgstr "" + +#: ../model_zoo/index.rst:55 +msgid "加载数据集:PaddleNLP内置了多种数据集,用户可以一键导入所需的数据集。" +msgstr "" + +#: ../model_zoo/index.rst:56 +msgid "" +"加载预训练模型:PaddleNLP的预训练模型可以很容易地通过 ``from_pretrained()`` 方法加载。 " +"Auto模块(包括AutoModel, AutoTokenizer, 及各种下游任务类)提供了方便易用的接口, " +"无需指定类别,即可调用不同网络结构的预训练模型。 第一个参数是汇总表中对应的 ``Pretrained Weight``,可加载对应的预训练权重。" +" ``AutoModelForSequenceClassification`` 初始化 ``__init__`` 所需的其他参数,如 " +"``num_classes`` 等, 也是通过 ``from_pretrained()`` 传入。``Tokenizer`` 使用同样的 " +"``from_pretrained`` 方法加载。" +msgstr "" + +#: ../model_zoo/index.rst:62 +msgid "通过 ``Dataset`` 的 ``map`` 函数,使用 ``tokenizer`` 将 ``dataset`` 从原始文本处理成模型的输入。" +msgstr "" + +#: ../model_zoo/index.rst:63 +msgid "定义 ``BatchSampler`` 和 ``DataLoader``,shuffle数据、组合Batch。" +msgstr "" + +#: ../model_zoo/index.rst:64 +msgid "定义训练所需的优化器,loss函数等,就可以开始进行模型fine-tune任务。" +msgstr "" + +#: ../model_zoo/index.rst:68 +msgid "Transformer预训练模型汇总" +msgstr "" + +#: ../model_zoo/index.rst:70 +msgid "" +"PaddleNLP的Transformer预训练模型包含从 `huggingface.co`_ " +"直接转换的模型权重和百度自研模型权重,方便社区用户直接迁移使用。 目前共包含了40多个主流预训练模型,500多个模型权重。" +msgstr "" + +#: ../model_zoo/index.rst:125 +msgid "Transformer预训练模型适用任务汇总" +msgstr "" + +#: ../model_zoo/index.rst:128 +msgid "Model" +msgstr "" + +#: ../model_zoo/index.rst:128 +msgid "Sequence Classification" +msgstr "" + +#: ../model_zoo/index.rst:128 +msgid "Token Classification" +msgstr "" + +#: ../model_zoo/index.rst:128 +msgid "Question Answering" +msgstr "" + +#: ../model_zoo/index.rst:128 +msgid "Text Generation" +msgstr "" + +#: ../model_zoo/index.rst:128 +msgid "Multiple Choice" +msgstr "" + +#: ../model_zoo/index.rst:130 +msgid "ALBERT_" +msgstr "" + +#: ../model_zoo/index.rst:130 ../model_zoo/index.rst:132 +#: ../model_zoo/index.rst:134 ../model_zoo/index.rst:136 +#: ../model_zoo/index.rst:138 ../model_zoo/index.rst:140 +#: ../model_zoo/index.rst:142 ../model_zoo/index.rst:144 +#: ../model_zoo/index.rst:146 ../model_zoo/index.rst:148 +#: ../model_zoo/index.rst:150 ../model_zoo/index.rst:152 +#: ../model_zoo/index.rst:154 ../model_zoo/index.rst:156 +#: ../model_zoo/index.rst:158 ../model_zoo/index.rst:160 +#: ../model_zoo/index.rst:162 ../model_zoo/index.rst:164 +#: ../model_zoo/index.rst:166 ../model_zoo/index.rst:168 +#: ../model_zoo/index.rst:170 ../model_zoo/index.rst:172 +#: ../model_zoo/index.rst:174 ../model_zoo/index.rst:176 +#: ../model_zoo/index.rst:178 ../model_zoo/index.rst:180 +#: ../model_zoo/index.rst:182 ../model_zoo/index.rst:184 +#: ../model_zoo/index.rst:186 ../model_zoo/index.rst:188 +#: ../model_zoo/index.rst:190 ../model_zoo/index.rst:192 +#: ../model_zoo/index.rst:194 ../model_zoo/index.rst:196 +#: ../model_zoo/index.rst:198 ../model_zoo/index.rst:200 +#: ../model_zoo/index.rst:202 ../model_zoo/index.rst:204 +#: ../model_zoo/index.rst:206 ../model_zoo/index.rst:208 +#: ../model_zoo/index.rst:210 +msgid "✅" +msgstr "" + +#: ../model_zoo/index.rst:130 ../model_zoo/index.rst:132 +#: ../model_zoo/index.rst:134 ../model_zoo/index.rst:136 +#: ../model_zoo/index.rst:138 ../model_zoo/index.rst:140 +#: ../model_zoo/index.rst:142 ../model_zoo/index.rst:144 +#: ../model_zoo/index.rst:146 ../model_zoo/index.rst:148 +#: ../model_zoo/index.rst:150 ../model_zoo/index.rst:152 +#: ../model_zoo/index.rst:154 ../model_zoo/index.rst:156 +#: ../model_zoo/index.rst:158 ../model_zoo/index.rst:160 +#: ../model_zoo/index.rst:162 ../model_zoo/index.rst:164 +#: ../model_zoo/index.rst:166 ../model_zoo/index.rst:168 +#: ../model_zoo/index.rst:170 ../model_zoo/index.rst:172 +#: ../model_zoo/index.rst:174 ../model_zoo/index.rst:176 +#: ../model_zoo/index.rst:178 ../model_zoo/index.rst:180 +#: ../model_zoo/index.rst:182 ../model_zoo/index.rst:184 +#: ../model_zoo/index.rst:186 ../model_zoo/index.rst:188 +#: ../model_zoo/index.rst:190 ../model_zoo/index.rst:192 +#: ../model_zoo/index.rst:194 ../model_zoo/index.rst:196 +#: ../model_zoo/index.rst:198 ../model_zoo/index.rst:200 +#: ../model_zoo/index.rst:202 ../model_zoo/index.rst:204 +#: ../model_zoo/index.rst:206 ../model_zoo/index.rst:208 +#: ../model_zoo/index.rst:210 +msgid "❌" +msgstr "" + +#: ../model_zoo/index.rst:132 +msgid "BART_" +msgstr "" + +#: ../model_zoo/index.rst:134 +msgid "BERT_" +msgstr "" + +#: ../model_zoo/index.rst:136 +msgid "BigBird_" +msgstr "" + +#: ../model_zoo/index.rst:138 +msgid "Blenderbot_" +msgstr "" + +#: ../model_zoo/index.rst:140 +msgid "Blenderbot-Small_" +msgstr "" + +#: ../model_zoo/index.rst:142 +msgid "ChineseBert_" +msgstr "" + +#: ../model_zoo/index.rst:144 +msgid "ConvBert_" +msgstr "" + +#: ../model_zoo/index.rst:146 +msgid "CTRL_" +msgstr "" + +#: ../model_zoo/index.rst:148 +msgid "DistilBert_" +msgstr "" + +#: ../model_zoo/index.rst:150 +msgid "ELECTRA_" +msgstr "" + +#: ../model_zoo/index.rst:152 +msgid "ERNIE_" +msgstr "" + +#: ../model_zoo/index.rst:154 +msgid "ERNIE-CTM_" +msgstr "" + +#: ../model_zoo/index.rst:156 +msgid "ERNIE-DOC_" +msgstr "" + +#: ../model_zoo/index.rst:158 +msgid "ERNIE-GEN_" +msgstr "" + +#: ../model_zoo/index.rst:160 +msgid "ERNIE-GRAM_" +msgstr "" + +#: ../model_zoo/index.rst:162 +msgid "ERNIE-M_" +msgstr "" + +#: ../model_zoo/index.rst:164 +msgid "FNet_" +msgstr "" + +#: ../model_zoo/index.rst:166 +msgid "Funnel_" +msgstr "" + +#: ../model_zoo/index.rst:168 +msgid "GPT_" +msgstr "" + +#: ../model_zoo/index.rst:170 +msgid "LayoutLM_" +msgstr "" + +#: ../model_zoo/index.rst:172 +msgid "LayoutLMV2_" +msgstr "" + +#: ../model_zoo/index.rst:174 +msgid "LayoutXLM_" +msgstr "" + +#: ../model_zoo/index.rst:176 +msgid "Luke_" +msgstr "" + +#: ../model_zoo/index.rst:178 +msgid "MBart_" +msgstr "" + +#: ../model_zoo/index.rst:180 +msgid "MegatronBert_" +msgstr "" + +#: ../model_zoo/index.rst:182 +msgid "MobileBert_" +msgstr "" + +#: ../model_zoo/index.rst:184 +msgid "MPNet_" +msgstr "" + +#: ../model_zoo/index.rst:186 +msgid "NeZha_" +msgstr "" + +#: ../model_zoo/index.rst:188 +msgid "PPMiniLM_" +msgstr "" + +#: ../model_zoo/index.rst:190 +msgid "ProphetNet_" +msgstr "" + +#: ../model_zoo/index.rst:192 +msgid "Reformer_" +msgstr "" + +#: ../model_zoo/index.rst:194 +msgid "RemBert_" +msgstr "" + +#: ../model_zoo/index.rst:196 +msgid "RoBERTa_" +msgstr "" + +#: ../model_zoo/index.rst:198 +msgid "RoFormer_" +msgstr "" + +#: ../model_zoo/index.rst:200 +msgid "SKEP_" +msgstr "" + +#: ../model_zoo/index.rst:202 +msgid "SqueezeBert_" +msgstr "" + +#: ../model_zoo/index.rst:204 +msgid "T5_" +msgstr "" + +#: ../model_zoo/index.rst:206 +msgid "TinyBert_" +msgstr "" + +#: ../model_zoo/index.rst:208 +msgid "UnifiedTransformer_" +msgstr "" + +#: ../model_zoo/index.rst:210 +msgid "XLNet_" +msgstr "" + +#: ../model_zoo/index.rst:259 +msgid "Reference" +msgstr "" + +#: ../model_zoo/index.rst:260 +msgid "" +"部分中文预训练模型来自: `brightmart/albert_zh " +"`_, `ymcui/Chinese-BERT-wwm " +"`_, `huawei-noah/Pretrained-" +"Language-Model/TinyBERT `_, `ymcui/Chinese-XLNet " +"`_, " +"`huggingface/xlnet_chinese_large " +"`_, `Knover/luge-" +"dialogue `_, `huawei-noah/Pretrained-Language-Model/NEZHA-PyTorch/ " +"`_, `ZhuiyiTechnology/simbert " +"`_" +msgstr "" + +#: ../model_zoo/index.rst:269 +msgid "" +"Lan, Zhenzhong, et al. \"Albert: A lite bert for self-supervised learning" +" of language representations.\" arXiv preprint arXiv:1909.11942 (2019)." +msgstr "" + +#: ../model_zoo/index.rst:270 +msgid "" +"Lewis, Mike, et al. \"BART: Denoising Sequence-to-Sequence Pre-training " +"for Natural Language Generation, Translation, and Comprehension.\" arXiv " +"preprint arXiv:1910.13461 (2019)." +msgstr "" + +#: ../model_zoo/index.rst:271 +msgid "" +"Devlin, Jacob, et al. \"Bert: Pre-training of deep bidirectional " +"transformers for language understanding.\" arXiv preprint " +"arXiv:1810.04805 (2018)." +msgstr "" + +#: ../model_zoo/index.rst:272 +msgid "" +"Zaheer, Manzil, et al. \"Big bird: Transformers for longer sequences.\" " +"arXiv preprint arXiv:2007.14062 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:273 +msgid "" +"Stephon, Emily, et al. \"Blenderbot: Recipes for building an open-domain " +"chatbot.\" arXiv preprint arXiv:2004.13637 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:274 +msgid "" +"Stephon, Emily, et al. \"Blenderbot-Small: Recipes for building an open-" +"domain chatbot.\" arXiv preprint arXiv:2004.13637 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:275 +msgid "" +"Sun, Zijun, et al. \"Chinesebert: Chinese pretraining enhanced by glyph " +"and pinyin information.\" arXiv preprint arXiv:2106.16038 (2021)." +msgstr "" + +#: ../model_zoo/index.rst:276 +msgid "" +"Zhang, zhengyan, et al. \"CPM: A Large-scale Generative Chinese Pre-" +"trained Language Model.\" arXiv preprint arXiv:2012.00413 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:277 +msgid "" +"Jiang, Zihang, et al. \"ConvBERT: Improving BERT with Span-based Dynamic " +"Convolution.\" arXiv preprint arXiv:2008.02496 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:278 +msgid "" +"Nitish, Bryan, et al. \"CTRL: A Conditional Transformer Language Model " +"for Controllable Generation.\" arXiv preprint arXiv:1909.05858 (2019)." +msgstr "" + +#: ../model_zoo/index.rst:279 +msgid "" +"Sanh, Victor, et al. \"DistilBERT, a distilled version of BERT: smaller, " +"faster, cheaper and lighter.\" arXiv preprint arXiv:1910.01108 (2019)." +msgstr "" + +#: ../model_zoo/index.rst:280 +msgid "" +"Clark, Kevin, et al. \"Electra: Pre-training text encoders as " +"discriminators rather than generators.\" arXiv preprint arXiv:2003.10555 " +"(2020)." +msgstr "" + +#: ../model_zoo/index.rst:281 +msgid "" +"Sun, Yu, et al. \"Ernie: Enhanced representation through knowledge " +"integration.\" arXiv preprint arXiv:1904.09223 (2019)." +msgstr "" + +#: ../model_zoo/index.rst:282 +msgid "" +"Ding, Siyu, et al. \"ERNIE-Doc: A retrospective long-document modeling " +"transformer.\" arXiv preprint arXiv:2012.15688 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:283 +msgid "" +"Xiao, Dongling, et al. \"Ernie-gen: An enhanced multi-flow pre-training " +"and fine-tuning framework for natural language generation.\" arXiv " +"preprint arXiv:2001.11314 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:284 +msgid "" +"Xiao, Dongling, et al. \"ERNIE-Gram: Pre-Training with Explicitly N-Gram " +"Masked Language Modeling for Natural Language Understanding.\" arXiv " +"preprint arXiv:2010.12148 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:285 +msgid "" +"Ouyang, Xuan, et al. \"ERNIE-M: enhanced multilingual representation by " +"aligning cross-lingual semantics with monolingual corpora.\" arXiv " +"preprint arXiv:2012.15674 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:286 +msgid "" +"Lee-Thorp, James, et al. \"Fnet: Mixing tokens with fourier transforms.\"" +" arXiv preprint arXiv:2105.03824 (2021)." +msgstr "" + +#: ../model_zoo/index.rst:287 +msgid "" +"Dai, Zihang, et al. \"Funnel-transformer: Filtering out sequential " +"redundancy for efficient language processing.\" Advances in neural " +"information processing systems 33 (2020): 4271-4282." +msgstr "" + +#: ../model_zoo/index.rst:288 +msgid "" +"Radford, Alec, et al. \"Language models are unsupervised multitask " +"learners.\" OpenAI blog 1.8 (2019): 9." +msgstr "" + +#: ../model_zoo/index.rst:289 +msgid "" +"Xu, Yiheng, et al. \"LayoutLM: Pre-training of Text and Layout for " +"Document Image Understanding.\" arXiv preprint arXiv:1912.13318 (2019)." +msgstr "" + +#: ../model_zoo/index.rst:290 +msgid "" +"Xu, Yang, et al. \"LayoutLMv2: Multi-modal Pre-training for Visually-Rich" +" Document Understanding\" arXiv preprint arXiv:2012.14740 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:291 +msgid "" +"Xu, Yiheng, et al. \"LayoutXLM: Multimodal Pre-training for Multilingual " +"Visually-rich Document Understanding\" arXiv preprint arXiv:2104.08836 " +"(2021)." +msgstr "" + +#: ../model_zoo/index.rst:292 +msgid "" +"Yamada, Ikuya, et al. \"Luke: deep contextualized entity representations " +"with entity-aware self-attention.\" arXiv preprint arXiv:2010.01057 " +"(2020)." +msgstr "" + +#: ../model_zoo/index.rst:293 +msgid "" +"Liu, Yinhan, et al. \"MBart: Multilingual Denoising Pre-training for " +"Neural Machine Translation\" arXiv preprint arXiv:2001.08210 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:294 +msgid "" +"Shoeybi, Mohammad, et al. \"Megatron-lm: Training multi-billion parameter" +" language models using model parallelism.\" arXiv preprint " +"arXiv:1909.08053 (2019)." +msgstr "" + +#: ../model_zoo/index.rst:295 +msgid "" +"Sun, Zhiqing, et al. \"MobileBERT: a Compact Task-Agnostic BERT for " +"Resource-Limited Devices\" arXiv preprint arXiv:2004.02984 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:296 +msgid "" +"Song, Kaitao, et al. \"MPNet: Masked and Permuted Pre-training for " +"Language Understanding.\" arXiv preprint arXiv:2004.09297 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:297 +msgid "" +"Wei, Junqiu, et al. \"NEZHA: Neural contextualized representation for " +"chinese language understanding.\" arXiv preprint arXiv:1909.00204 (2019)." +msgstr "" + +#: ../model_zoo/index.rst:298 +msgid "" +"Qi, Weizhen, et al. \"Prophetnet: Predicting future n-gram for sequence-" +"to-sequence pre-training.\" arXiv preprint arXiv:2001.04063 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:299 +msgid "" +"Kitaev, Nikita, et al. \"Reformer: The efficient Transformer.\" arXiv " +"preprint arXiv:2001.04451 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:300 +msgid "" +"Chung, Hyung Won, et al. \"Rethinking embedding coupling in pre-trained " +"language models.\" arXiv preprint arXiv:2010.12821 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:301 +msgid "" +"Liu, Yinhan, et al. \"Roberta: A robustly optimized bert pretraining " +"approach.\" arXiv preprint arXiv:1907.11692 (2019)." +msgstr "" + +#: ../model_zoo/index.rst:302 +msgid "" +"Su Jianlin, et al. \"RoFormer: Enhanced Transformer with Rotary Position " +"Embedding.\" arXiv preprint arXiv:2104.09864 (2021)." +msgstr "" + +#: ../model_zoo/index.rst:303 +msgid "" +"Tian, Hao, et al. \"SKEP: Sentiment knowledge enhanced pre-training for " +"sentiment analysis.\" arXiv preprint arXiv:2005.05635 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:304 +msgid "" +"Forrest, ALbert, et al. \"SqueezeBERT: What can computer vision teach NLP" +" about efficient neural networks?\" arXiv preprint arXiv:2006.11316 " +"(2020)." +msgstr "" + +#: ../model_zoo/index.rst:305 +msgid "" +"Raffel, Colin, et al. \"T5: Exploring the Limits of Transfer Learning " +"with a Unified Text-to-Text Transformer.\" arXiv preprint " +"arXiv:1910.10683 (2019)." +msgstr "" + +#: ../model_zoo/index.rst:306 +msgid "" +"Vaswani, Ashish, et al. \"Attention is all you need.\" arXiv preprint " +"arXiv:1706.03762 (2017)." +msgstr "" + +#: ../model_zoo/index.rst:307 +msgid "" +"Jiao, Xiaoqi, et al. \"Tinybert: Distilling bert for natural language " +"understanding.\" arXiv preprint arXiv:1909.10351 (2019)." +msgstr "" + +#: ../model_zoo/index.rst:308 +msgid "" +"Bao, Siqi, et al. \"Plato-2: Towards building an open-domain chatbot via " +"curriculum learning.\" arXiv preprint arXiv:2006.16779 (2020)." +msgstr "" + +#: ../model_zoo/index.rst:309 +msgid "" +"Yang, Zhilin, et al. \"Xlnet: Generalized autoregressive pretraining for " +"language understanding.\" arXiv preprint arXiv:1906.08237 (2019)." +msgstr "" + +#: ../model_zoo/index.rst:310 +msgid "" +"Cui, Yiming, et al. \"Pre-training with whole word masking for chinese " +"bert.\" arXiv preprint arXiv:1906.08101 (2019)." +msgstr "" + +#: ../model_zoo/index.rst:311 +msgid "" +"Wang, Quan, et al. “Building Chinese Biomedical Language Models via " +"Multi-Level Text Discrimination.” arXiv preprint arXiv:2110.07244 (2021)." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/taskflow.po b/docs/locale/en/LC_MESSAGES/model_zoo/taskflow.po new file mode 100644 index 0000000000000000000000000000000000000000..a990df4e54a587f64a87e3ddb3e17467e07f5a57 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/taskflow.po @@ -0,0 +1,796 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/taskflow.md:1 +msgid "PaddleNLP一键预测功能:Taskflow API" +msgstr "" + +#: ../model_zoo/taskflow.md:23 +msgid "特性" +msgstr "" + +#: ../model_zoo/taskflow.md:24 +msgid "PaddleNLP提供开箱即用的产业级NLP预置任务能力,无需训练,一键预测。" +msgstr "" + +#: ../model_zoo/taskflow.md:25 +msgid "最全的中文任务:覆盖自然语言理解与自然语言生成两大核心应用;" +msgstr "" + +#: ../model_zoo/taskflow.md:26 +msgid "极致的产业级效果:在多个中文场景上提供产业级的精度与预测性能;" +msgstr "" + +#: ../model_zoo/taskflow.md:27 +msgid "统一的应用范式:通过paddlenlp.Taskflow调用,简捷易用。" +msgstr "" + +#: ../model_zoo/taskflow.md:167 +msgid "QuickStart" +msgstr "" + +#: ../model_zoo/taskflow.md:169 +msgid "环境依赖" +msgstr "" + +#: ../model_zoo/taskflow.md:170 +msgid "python >= 3.6" +msgstr "" + +#: ../model_zoo/taskflow.md:171 +msgid "paddlepaddle >= 2.2.0" +msgstr "" + +#: ../model_zoo/taskflow.md:172 +msgid "paddlenlp >= 2.2.5" +msgstr "" + +#: ../model_zoo/taskflow.md:174 +msgid "taskflow1" +msgstr "" + +#: ../model_zoo/taskflow.md:176 +msgid "可进入 Jupyter Notebook 环境,在线体验 👉🏻 进入在线运行环境" +msgstr "" + +#: ../model_zoo/taskflow.md:178 +msgid "PaddleNLP Taskflow API 支持任务持续丰富中,我们将根据开发者反馈,灵活调整功能建设优先级,可通过Issue或问卷反馈给我们。" +msgstr "" + +#: ../model_zoo/taskflow.md:180 +msgid "社区交流👬" +msgstr "" + +#: ../model_zoo/taskflow.md:182 +msgid "微信扫描二维码并填写问卷之后,加入交流群领取福利" +msgstr "" + +#: ../model_zoo/taskflow.md:183 +msgid "获取5月18-19日每晚20:30《产业级通用信息抽取技术UIE+ERNIE轻量级模型》直播课链接" +msgstr "" + +#: ../model_zoo/taskflow.md:184 +msgid "10G重磅NLP学习大礼包:" +msgstr "" + +#: ../model_zoo/taskflow.md:190 +msgid "详细使用" +msgstr "" + +#: ../model_zoo/taskflow.md:192 +msgid "PART Ⅰ   一键预测" +msgstr "" + +#: ../model_zoo/taskflow.md:194 +msgid "中文分词" +msgstr "" + +#: ../model_zoo/taskflow.md:198 +msgid "三种分词模式,满足各类分词需求" +msgstr "" + +#: ../model_zoo/taskflow.md:220 ../model_zoo/taskflow.md:422 +#: ../model_zoo/taskflow.md:602 ../model_zoo/taskflow.md:1177 +#: ../model_zoo/taskflow.md:1209 +msgid "批量样本输入,平均速度更快" +msgstr "" + +#: ../model_zoo/taskflow.md:222 +msgid "输入为多个句子组成的list,平均速度会更快。" +msgstr "" + +#: ../model_zoo/taskflow.md:231 ../model_zoo/taskflow.md:369 +#: ../model_zoo/taskflow.md:542 +msgid "自定义词典" +msgstr "" + +#: ../model_zoo/taskflow.md:233 +msgid "" +"你可以通过传入user_dict参数,装载自定义词典来定制分词结果。 " +"在默认模式和精确模式下,词典文件每一行由一个或多个自定义item组成。词典文件user_dict.txt示例:" +msgstr "" + +#: ../model_zoo/taskflow.md:240 +msgid "在快速模式下,词典文件每一行为一个自定义item+\"\\t\"+词频(词频可省略,词频省略则自动计算能保证分出该词的词频),暂时不支持黑名单词典(即通过设置”年“、”末“,以达到切分”年末“的目的)。词典文件user_dict.txt示例:" +msgstr "" + +#: ../model_zoo/taskflow.md:246 +msgid "加载自定义词典及输出结果示例:" +msgstr "" + +#: ../model_zoo/taskflow.md:256 +msgid "参数说明" +msgstr "" + +#: ../model_zoo/taskflow.md:257 +msgid "mode:指定分词模式,默认为None。" +msgstr "" + +#: ../model_zoo/taskflow.md:258 ../model_zoo/taskflow.md:393 +#: ../model_zoo/taskflow.md:572 ../model_zoo/taskflow.md:746 +#: ../model_zoo/taskflow.md:1073 ../model_zoo/taskflow.md:1092 +#: ../model_zoo/taskflow.md:1134 ../model_zoo/taskflow.md:1161 +#: ../model_zoo/taskflow.md:1186 ../model_zoo/taskflow.md:1217 +#: ../model_zoo/taskflow.md:1239 ../model_zoo/taskflow.md:1259 +#: ../model_zoo/taskflow.md:1278 +msgid "batch_size:批处理大小,请结合机器情况进行调整,默认为1。" +msgstr "" + +#: ../model_zoo/taskflow.md:259 +msgid "user_dict:自定义词典文件路径,默认为None。" +msgstr "" + +#: ../model_zoo/taskflow.md:260 ../model_zoo/taskflow.md:395 +#: ../model_zoo/taskflow.md:574 ../model_zoo/taskflow.md:753 +#: ../model_zoo/taskflow.md:1094 ../model_zoo/taskflow.md:1137 +#: ../model_zoo/taskflow.md:1162 ../model_zoo/taskflow.md:1188 +#: ../model_zoo/taskflow.md:1219 +msgid "task_path:自定义任务路径,默认为None。" +msgstr "" + +#: ../model_zoo/taskflow.md:263 +msgid "词性标注" +msgstr "" + +#: ../model_zoo/taskflow.md:267 +msgid "支持单条和批量预测" +msgstr "" + +#: ../model_zoo/taskflow.md:280 +msgid "标签集合" +msgstr "" + +#: ../model_zoo/taskflow.md:371 +msgid "你可以通过装载自定义词典来定制化分词和词性标注结果。词典文件每一行表示一个自定义item,可以由一个单词或者多个单词组成,单词后面可以添加自定义标签,格式为item/tag,如果不添加自定义标签,则使用模型默认标签n。" +msgstr "" + +#: ../model_zoo/taskflow.md:373 ../model_zoo/taskflow.md:546 +msgid "词典文件user_dict.txt示例:" +msgstr "" + +#: ../model_zoo/taskflow.md:381 ../model_zoo/taskflow.md:561 +msgid "装载自定义词典及输出结果示例:" +msgstr "" + +#: ../model_zoo/taskflow.md:392 ../model_zoo/taskflow.md:571 +#: ../model_zoo/taskflow.md:745 ../model_zoo/taskflow.md:1072 +#: ../model_zoo/taskflow.md:1160 ../model_zoo/taskflow.md:1185 +#: ../model_zoo/taskflow.md:1216 ../model_zoo/taskflow.md:1238 +#: ../model_zoo/taskflow.md:1258 +msgid "可配置参数说明" +msgstr "" + +#: ../model_zoo/taskflow.md:394 ../model_zoo/taskflow.md:573 +#: ../model_zoo/taskflow.md:1095 +msgid "user_dict:用户自定义词典文件,默认为None。" +msgstr "" + +#: ../model_zoo/taskflow.md:398 ../model_zoo/taskflow.md:763 +msgid "命名实体识别" +msgstr "" + +#: ../model_zoo/taskflow.md:402 +msgid "支持两种模式" +msgstr "" + +#: ../model_zoo/taskflow.md:430 +msgid "实体标签说明" +msgstr "" + +#: ../model_zoo/taskflow.md:432 +msgid "精确模式采用的标签集合" +msgstr "" + +#: ../model_zoo/taskflow.md:434 +msgid "包含66种词性及专名类别标签,标签集合如下表:" +msgstr "" + +#: ../model_zoo/taskflow.md:453 +msgid "快速模式采用的标签集合" +msgstr "" + +#: ../model_zoo/taskflow.md:544 +msgid "你可以通过装载自定义词典来定制化命名实体识别结果。词典文件每一行表示一个自定义item,可以由一个term或者多个term组成,term后面可以添加自定义标签,格式为item/tag,如果不添加自定义标签,则使用模型默认标签。" +msgstr "" + +#: ../model_zoo/taskflow.md:555 +msgid "以\"《长津湖》收尾,北美是最大海外票仓\"为例,原本的输出结果为:" +msgstr "" + +#: ../model_zoo/taskflow.md:575 +msgid "entity_only:只返回实体/概念词及其对应标签。" +msgstr "" + +#: ../model_zoo/taskflow.md:579 +msgid "依存句法分析" +msgstr "" + +#: ../model_zoo/taskflow.md:582 +msgid "支持多种形式输入" +msgstr "" + +#: ../model_zoo/taskflow.md:584 +msgid "未分词输入:" +msgstr "" + +#: ../model_zoo/taskflow.md:594 +msgid "使用分词结果来输入:" +msgstr "" + +#: ../model_zoo/taskflow.md:610 +msgid "多种模型选择,满足精度、速度需求" +msgstr "" + +#: ../model_zoo/taskflow.md:612 +msgid "使用ERNIE 1.0进行预测" +msgstr "" + +#: ../model_zoo/taskflow.md:620 +msgid "" +"除ERNIE 1.0外,还可使用ERNIE-Gram预训练模型,其中model=ddparser(基于LSTM " +"Encoder)速度最快,model=ddparser-ernie-gram-zh和model=ddparser-" +"ernie-1.0效果更优(两者效果相当)。" +msgstr "" + +#: ../model_zoo/taskflow.md:622 +msgid "输出方式" +msgstr "" + +#: ../model_zoo/taskflow.md:624 +msgid "输出概率值和词性标签:" +msgstr "" + +#: ../model_zoo/taskflow.md:632 +msgid "依存关系可视化" +msgstr "" + +#: ../model_zoo/taskflow.md:646 +msgid "依存句法分析标注关系集合" +msgstr "" + +#: ../model_zoo/taskflow.md:747 +msgid "model:选择任务使用的模型,可选有ddparser,ddparser-ernie-1.0和ddparser-ernie-gram-zh。" +msgstr "" + +#: ../model_zoo/taskflow.md:748 +msgid "tree:确保输出结果是正确的依存句法树,默认为True。" +msgstr "" + +#: ../model_zoo/taskflow.md:749 +msgid "prob:是否输出每个弧对应的概率值,默认为False。" +msgstr "" + +#: ../model_zoo/taskflow.md:750 +msgid "use_pos:是否返回词性标签,默认为False。" +msgstr "" + +#: ../model_zoo/taskflow.md:751 +msgid "use_cuda:是否使用GPU进行切词,默认为False。" +msgstr "" + +#: ../model_zoo/taskflow.md:752 +msgid "return_visual:是否返回句法树的可视化结果,默认为False。" +msgstr "" + +#: ../model_zoo/taskflow.md:756 +msgid "信息抽取" +msgstr "" + +#: ../model_zoo/taskflow.md:759 +msgid "开放域信息抽取是信息抽取的一种全新范式,主要思想是减少人工参与,利用单一模型支持多种类型的开放抽取任务,用户可以使用自然语言自定义抽取目标,在实体、关系类别等未定义的情况下抽取输入文本中的信息片段。" +msgstr "" + +#: ../model_zoo/taskflow.md:761 +msgid "支持多场景信息抽取任务" +msgstr "" + +#: ../model_zoo/taskflow.md:765 +msgid "" +"命名实体识别(Named Entity " +"Recognition,简称NER),是指识别文本中具有特定意义的实体。在开放域信息抽取中,抽取的类别没有限制,用户可以自己定义。" +msgstr "" + +#: ../model_zoo/taskflow.md:767 +msgid "例如抽取的目标实体类型是\"时间\"、\"选手\"和\"赛事名称\", schema构造如下:" +msgstr "" + +#: ../model_zoo/taskflow.md:773 ../model_zoo/taskflow.md:804 +#: ../model_zoo/taskflow.md:844 ../model_zoo/taskflow.md:899 +#: ../model_zoo/taskflow.md:923 ../model_zoo/taskflow.md:959 +#: ../model_zoo/taskflow.md:990 +msgid "预测:" +msgstr "" + +#: ../model_zoo/taskflow.md:796 +msgid "例如抽取的目标实体类型是\"肿瘤的大小\"、\"肿瘤的个数\"、\"肝癌级别\"和\"脉管内癌栓分级\", schema构造如下:" +msgstr "" + +#: ../model_zoo/taskflow.md:802 +msgid "在上例中我们已经实例化了一个Taskflow对象,这里可以通过set_schema方法重置抽取目标。" +msgstr "" + +#: ../model_zoo/taskflow.md:828 +msgid "关系抽取" +msgstr "" + +#: ../model_zoo/taskflow.md:830 +msgid "" +"关系抽取(Relation " +"Extraction,简称RE),是指从文本中识别实体并抽取实体之间的语义关系,进而获取三元组信息,即<主体,谓语,客体>。" +msgstr "" + +#: ../model_zoo/taskflow.md:832 +msgid "例如以\"竞赛名称\"作为抽取主体,抽取关系类型为\"主办方\"、\"承办方\"和\"已举办次数\", schema构造如下:" +msgstr "" + +#: ../model_zoo/taskflow.md:880 +msgid "事件抽取" +msgstr "" + +#: ../model_zoo/taskflow.md:882 +msgid "事件抽取 (Event Extraction, 简称EE),是指从自然语言文本中抽取预定义的事件触发词和事件要素,组合为相应的结构化信息。" +msgstr "" + +#: ../model_zoo/taskflow.md:884 +msgid "例如抽取的目标是\"地震\"事件的\"地震强度\"、\"时间\"、\"震中位置\"和\"震源深度\"这些信息,schema构造如下:" +msgstr "" + +#: ../model_zoo/taskflow.md:897 +msgid "触发词的格式统一为XX触发词,XX表示具体事件类型,上例中的事件类型是地震,则对应触发词为地震触发词。" +msgstr "" + +#: ../model_zoo/taskflow.md:908 +msgid "评论观点抽取" +msgstr "" + +#: ../model_zoo/taskflow.md:910 +msgid "评论观点抽取,是指抽取文本中包含的评价维度、观点词。" +msgstr "" + +#: ../model_zoo/taskflow.md:912 +msgid "例如抽取的目标是文本中包含的评价维度及其对应的观点词和情感倾向,schema构造如下:" +msgstr "" + +#: ../model_zoo/taskflow.md:951 +msgid "情感倾向分类" +msgstr "" + +#: ../model_zoo/taskflow.md:953 +msgid "句子级情感倾向分类,即判断句子的情感倾向是“正向”还是“负向”,schema构造如下:" +msgstr "" + +#: ../model_zoo/taskflow.md:968 +msgid "跨任务抽取" +msgstr "" + +#: ../model_zoo/taskflow.md:970 +msgid "例如在法律场景同时对文本进行实体抽取和关系抽取,schema可按照如下方式进行构造:" +msgstr "" + +#: ../model_zoo/taskflow.md:1019 +msgid "多模型选择,满足精度、速度要求" +msgstr "" + +#: ../model_zoo/taskflow.md:1021 +msgid "模型选择" +msgstr "" + +#: ../model_zoo/taskflow.md:1046 +msgid "使用UIE-Tiny进行预测" +msgstr "" + +#: ../model_zoo/taskflow.md:1057 +msgid "定制训练" +msgstr "" + +#: ../model_zoo/taskflow.md:1059 +msgid "" +"对于简单的抽取目标可以直接使用paddlenlp.Taskflow实现零样本(zero-" +"shot)抽取,对于细分场景我们推荐使用定制训练(标注少量数据进行模型微调)以进一步提升效果。" +msgstr "" + +#: ../model_zoo/taskflow.md:1061 +msgid "我们在互联网、医疗、金融三大垂类自建测试集上进行了实验:" +msgstr "" + +#: ../model_zoo/taskflow.md:1070 +msgid "0-shot表示无训练数据直接通过paddlenlp.Taskflow进行预测,5-shot表示基于5条标注数据进行模型微调。" +msgstr "" + +#: ../model_zoo/taskflow.md:1074 +msgid "model:选择任务使用的模型,默认为uie-base,可选有uie-tiny,uie-base和uie-medical-base。" +msgstr "" + +#: ../model_zoo/taskflow.md:1075 +msgid "schema:定义任务抽取目标,可参考示例中对于不同信息抽取任务的schema配置自定义抽取目标。" +msgstr "" + +#: ../model_zoo/taskflow.md:1076 +msgid "position_prob:模型对于span的起始位置/终止位置的结果概率0~1之间,返回结果去掉小于这个阈值的结果,默认为0.5,span的最终概率输出为起始位置概率和终止位置概率的乘积。" +msgstr "" + +#: ../model_zoo/taskflow.md:1079 +msgid "解语知识标注" +msgstr "" + +#: ../model_zoo/taskflow.md:1082 +msgid "词类知识标注" +msgstr "" + +#: ../model_zoo/taskflow.md:1091 ../model_zoo/taskflow.md:1133 +msgid "可配置参数说明:" +msgstr "" + +#: ../model_zoo/taskflow.md:1093 +msgid "linking:实现基于词类的linking,默认为True。" +msgstr "" + +#: ../model_zoo/taskflow.md:1098 +msgid "知识挖掘-词类知识标注任务共包含66种词性及专名类别标签,标签集合如下表:" +msgstr "" + +#: ../model_zoo/taskflow.md:1118 +msgid "名词短语标注" +msgstr "" + +#: ../model_zoo/taskflow.md:1135 +msgid "max_seq_len:最大序列长度,默认为64。" +msgstr "" + +#: ../model_zoo/taskflow.md:1136 +msgid "linking:实现与WordTag类别标签的linking,默认为False。" +msgstr "" + +#: ../model_zoo/taskflow.md:1142 +msgid "文本纠错" +msgstr "" + +#: ../model_zoo/taskflow.md:1146 ../model_zoo/taskflow.md:1225 +#: ../model_zoo/taskflow.md:1245 +msgid "支持单条、批量预测" +msgstr "" + +#: ../model_zoo/taskflow.md:1165 +msgid "文本相似度" +msgstr "" + +#: ../model_zoo/taskflow.md:1168 +msgid "单条输入" +msgstr "" + +#: ../model_zoo/taskflow.md:1187 +msgid "max_seq_len:最大序列长度,默认为128。" +msgstr "" + +#: ../model_zoo/taskflow.md:1191 +msgid "情感倾向分析" +msgstr "" + +#: ../model_zoo/taskflow.md:1194 +msgid "支持不同模型,速度快和精度高两种模式" +msgstr "" + +#: ../model_zoo/taskflow.md:1218 +msgid "model:选择任务使用的模型,可选有bilstm和skep_ernie_1.0_large_ch。" +msgstr "" + +#: ../model_zoo/taskflow.md:1222 +msgid "生成式问答" +msgstr "" + +#: ../model_zoo/taskflow.md:1242 +msgid "智能写诗" +msgstr "" + +#: ../model_zoo/taskflow.md:1262 +msgid "开放域对话" +msgstr "" + +#: ../model_zoo/taskflow.md:1265 +msgid "非交互模式" +msgstr "" + +#: ../model_zoo/taskflow.md:1276 +msgid "可配置参数:" +msgstr "" + +#: ../model_zoo/taskflow.md:1279 +msgid "max_seq_len:最大序列长度,默认为512。" +msgstr "" + +#: ../model_zoo/taskflow.md:1281 +msgid "交互模式" +msgstr "" + +#: ../model_zoo/taskflow.md:1299 +msgid "交互模式参数:" +msgstr "" + +#: ../model_zoo/taskflow.md:1300 +msgid "max_turn:任务能记忆的对话轮数,当max_turn为1时,模型只能记住当前对话,无法获知之前的对话内容。" +msgstr "" + +#: ../model_zoo/taskflow.md:1304 +msgid "PART Ⅱ   定制化训练" +msgstr "" + +#: ../model_zoo/taskflow.md:1308 +msgid "如果你有自己的业务数据集,可以对模型效果进一步调优,支持定制化训练的任务如下:" +msgstr "" + +#: ../model_zoo/taskflow.md:1397 +msgid "这里我们以命名实体识别Taskflow(\"ner\", mode=\"accurate\")为例,展示如何定制自己的模型。" +msgstr "" + +#: ../model_zoo/taskflow.md:1399 +msgid "调用Taskflow接口后,程序自动将相关文件下载到$HOME/.paddlenlp/taskflow/wordtag/,该默认路径包含以下文件:" +msgstr "" + +#: ../model_zoo/taskflow.md:1408 +msgid "参考上表中对应示例准备数据集和标签文件tags.txt,执行相应训练脚本得到自己的model_state.pdparams和model_config.json。" +msgstr "" + +#: ../model_zoo/taskflow.md:1410 +msgid "根据自己数据集情况,修改标签文件tags.txt。" +msgstr "" + +#: ../model_zoo/taskflow.md:1412 +msgid "将以上文件保存到任意路径中,自定义路径下的文件需要和默认路径的文件一致:" +msgstr "" + +#: ../model_zoo/taskflow.md:1420 +msgid "通过task_path指定自定义路径,使用Taskflow加载自定义模型进行一键预测:" +msgstr "" + +#: ../model_zoo/taskflow.md:1428 +msgid "模型算法" +msgstr "" + +#: ../model_zoo/taskflow.md:1456 +msgid "FAQ" +msgstr "" + +#: ../model_zoo/taskflow.md:1460 +msgid "" +"A: " +"Taskflow默认会将任务相关模型等文件保存到$HOME/.paddlenlp下,可以在任务初始化的时候通过home_path自定义修改保存路径。示例:" +msgstr "" + +#: ../model_zoo/taskflow.md:1466 +msgid "通过以上方式即可将ner任务相关文件保存至/workspace路径下。" +msgstr "" + +#: ../model_zoo/taskflow.md:1472 +msgid "" +"A: " +"Taskflow默认会将任务相关模型等文件保存到$HOME/.paddlenlp/taskflow下,如果下载或调用失败,可删除相应路径下的文件,重新尝试即可" +msgstr "" + +#: ../model_zoo/taskflow.md:1478 +msgid "A: 可以结合设备情况适当调整batch_size,采用批量输入的方式来提升平均速率。示例:" +msgstr "" + +#: ../model_zoo/taskflow.md:1489 +msgid "通过上述方式进行分词可以大幅提升预测速度。" +msgstr "" + +#: ../model_zoo/taskflow.md:1495 +msgid "A: Taskflow支持任务持续丰富中,我们将根据开发者反馈,灵活调整功能建设优先级,可通过Issue或问卷反馈给我们。" +msgstr "" + +#: ../model_zoo/taskflow.md:1500 +msgid "附录" +msgstr "" + +#: ../model_zoo/taskflow.md:1504 +msgid "fxsjy/jieba" +msgstr "" + +#: ../model_zoo/taskflow.md:1505 +msgid "ZhuiyiTechnology/simbert" +msgstr "" + +#: ../model_zoo/taskflow.md:1506 +msgid "CPM: A Large-scale Generative Chinese Pre-trained Language Model" +msgstr "" + +#~ msgid "PaddleNLP Taskflow" +#~ msgstr "" + +#~ msgid "介绍" +#~ msgstr "" + +#~ msgid "任务清单" +#~ msgstr "" + +#~ msgid "用法" +#~ msgstr "" + +#~ msgid "查看使用示例" +#~ msgstr "" + +#~ msgid "句法分析" +#~ msgstr "" + +#~ msgid "情感分析" +#~ msgstr "" + +#~ msgid "『解语』-词类知识标注" +#~ msgstr "" + +#~ msgid "『解语』-名词短语标注" +#~ msgstr "" + +#~ msgid "自定义任务" +#~ msgstr "" + +#~ msgid "paddlenlp.Taskflow提供开箱即用的NLP预置任务,覆盖自然语言理解与自然语言生成两大核心应用,在中文场景上提供产业级的效果与极致的预测性能。" +#~ msgstr "" + +#~ msgid "随着版本迭代会持续开放更多的应用场景。" +#~ msgstr "" + +#~ msgid "安装" +#~ msgstr "" + +#~ msgid "paddlenlp >= 2.2.0" +#~ msgstr "" + +#~ msgid "支持三种模式分词" +#~ msgstr "" + +#~ msgid "Base模式(默认)" +#~ msgstr "" + +#~ msgid "快速模式" +#~ msgstr "" + +#~ msgid "利用『结巴』中文分词工具,实现文本快速切分。" +#~ msgstr "" + +#~ msgid "精确模式" +#~ msgstr "" + +#~ msgid "试图将句子中的实体词完整切分,分词精确度高。" +#~ msgstr "" + +#~ msgid "快速模式词典载入方式:" +#~ msgstr "" + +#~ msgid "用户可以在词典文件每一行有两个部分:词语、词频(可省略),用空格隔开。词频省略则自动计算能保证分出该词的词频。" +#~ msgstr "" + +#~ msgid "\"国家卫健委修订完成了新冠肺炎诊疗方案\"原本的输出结果为:" +#~ msgstr "" + +#~ msgid "Base、精确模式词典载入方式:" +#~ msgstr "" + +#~ msgid "词典文件每一行表示一个自定义item。" +#~ msgstr "" + +#~ msgid "以默认模型为例,\"平原上的火焰计划于年末上映\"原本的输出结果为:" +#~ msgstr "" + +#~ msgid "标签集合:" +#~ msgstr "" + +#~ msgid "用户可以通过装载自定义词典来定制化分词和词性标注结果。词典文件每一行表示一个自定义item,可以由一个单词或者多个单词组成,单词后面可以添加自定义标签,格式为item/tag,如果不添加自定义标签,则使用模型默认标签。" +#~ msgstr "" + +#~ msgid "以\"赛里木湖是新疆海拔最高的高山湖泊\"为例,原本的输出结果为:" +#~ msgstr "" + +#~ msgid "精确模式(默认)" +#~ msgstr "" + +#~ msgid "只返回实体/概念词:" +#~ msgstr "" + +#~ msgid "entity_only:是否返回所有词性标签;若设置为True,则只返回实体/概念词;默认为False。" +#~ msgstr "" + +#~ msgid "使用ddparser-ernie-1.0进行预测:" +#~ msgstr "" + +#~ msgid "依存关系可视化:" +#~ msgstr "" + +#~ msgid "标注关系说明:" +#~ msgstr "" + +#~ msgid "使用BiLSTM模型:" +#~ msgstr "" + +#~ msgid "使用SKEP情感分析预训练模型进行预测:" +#~ msgstr "" + +#~ msgid "知识挖掘-词类知识标注" +#~ msgstr "" + +#~ msgid "知识挖掘-词类知识标注任务共包含66种词性及专名类别标签,标签集合如下表" +#~ msgstr "" + +#~ msgid "知识挖掘-名词短语标注" +#~ msgstr "" + +#~ msgid "非交互模式:" +#~ msgstr "" + +#~ msgid "交互模式:" +#~ msgstr "" + +#~ msgid "交互模式下,Taskflow具备多轮对话记忆功能。" +#~ msgstr "" + +#~ msgid "max_turn:仅在交互模式有效,表示任务能记忆的对话轮数;当max_turn为1时,模型只能记住当前对话,无法获知之前的对话内容。" +#~ msgstr "" + +#~ msgid "Taskflow提供了定制接口来使用自己的数据对模型进行微调/训练,适配任务如下:" +#~ msgstr "" + +#~ msgid "定制任务示例" +#~ msgstr "" + +#~ msgid "任务的默认路径为$HOME/.paddlenlp/taskflow/ner/wordtag/,该默认路径包含以下文件:" +#~ msgstr "" + +#~ msgid "参考表中对应示例准备数据集和标签文件tags.txt,执行相应训练脚本得到自己的model_state.pdparams和model_config.json。" +#~ msgstr "" + +#~ msgid "通过task_path指定用户自定义路径,自定义路径下的文件需要和默认路径的文件一致:" +#~ msgstr "" + +#~ msgid "使用Taskflow加载自定义模型进行一键预测:" +#~ msgstr "" + +#~ msgid "Q1 Taskflow如何修改任务保存路径?" +#~ msgstr "" + +#~ msgid "" +#~ "A: " +#~ "Taskflow默认会将任务相关模型等文件保存到$HOME/.paddlenlp下,可以在任务初始化的时候通过home_path自定义修改保存路径。" +#~ msgstr "" + +#~ msgid "示例:" +#~ msgstr "" + +#~ msgid "参考资料" +#~ msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers.po new file mode 100644 index 0000000000000000000000000000000000000000..4b7022a1515c94fe5f5d34cea98ee912e36dfcfa --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers.po @@ -0,0 +1,1913 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../model_zoo/transformers.rst:2 +msgid "PaddleNLP Transformer API" +msgstr "" + +#: ../model_zoo/transformers.rst:4 +msgid "" +"随着深度学习的发展,NLP领域涌现了一大批高质量的Transformer类预训练模型,多次刷新各种NLP任务SOTA(State of the " +"Art)。 PaddleNLP为用户提供了常用的 " +"``BERT``、``ERNIE``、``ALBERT``、``RoBERTa``、``XLNet`` 等经典结构预训练模型, " +"让开发者能够方便快捷应用各类Transformer预训练模型及其下游任务。" +msgstr "" + +#: ../model_zoo/transformers.rst:10 +msgid "Transformer预训练模型汇总" +msgstr "" + +#: ../model_zoo/transformers.rst:14 +msgid "" +"下表汇总了介绍了目前PaddleNLP支持的各类预训练模型以及对应预训练权重。我们目前提供了 **32** 种网络结构, **136** " +"种预训练的参数权重供用户使用, 其中包含了 **59** 种中文语言模型的预训练权重。" +msgstr "" + +#: ../model_zoo/transformers.rst:18 ../model_zoo/transformers.rst:655 +msgid "Model" +msgstr "" + +#: ../model_zoo/transformers.rst:18 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers.rst:18 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers.rst:18 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers.rst:20 ../model_zoo/transformers.rst:657 +msgid "ALBERT_" +msgstr "" + +#: ../model_zoo/transformers.rst:20 +msgid "``albert-base-v1``" +msgstr "" + +#: ../model_zoo/transformers.rst:20 ../model_zoo/transformers.rst:24 +#: ../model_zoo/transformers.rst:28 ../model_zoo/transformers.rst:32 +#: ../model_zoo/transformers.rst:36 ../model_zoo/transformers.rst:40 +#: ../model_zoo/transformers.rst:44 ../model_zoo/transformers.rst:48 +#: ../model_zoo/transformers.rst:76 ../model_zoo/transformers.rst:80 +#: ../model_zoo/transformers.rst:84 ../model_zoo/transformers.rst:88 +#: ../model_zoo/transformers.rst:92 ../model_zoo/transformers.rst:96 +#: ../model_zoo/transformers.rst:148 ../model_zoo/transformers.rst:196 +#: ../model_zoo/transformers.rst:200 ../model_zoo/transformers.rst:204 +#: ../model_zoo/transformers.rst:208 ../model_zoo/transformers.rst:212 +#: ../model_zoo/transformers.rst:216 ../model_zoo/transformers.rst:220 +#: ../model_zoo/transformers.rst:224 ../model_zoo/transformers.rst:228 +#: ../model_zoo/transformers.rst:232 ../model_zoo/transformers.rst:236 +#: ../model_zoo/transformers.rst:241 ../model_zoo/transformers.rst:246 +#: ../model_zoo/transformers.rst:252 ../model_zoo/transformers.rst:256 +#: ../model_zoo/transformers.rst:260 ../model_zoo/transformers.rst:264 +#: ../model_zoo/transformers.rst:296 ../model_zoo/transformers.rst:300 +#: ../model_zoo/transformers.rst:304 ../model_zoo/transformers.rst:312 +#: ../model_zoo/transformers.rst:316 ../model_zoo/transformers.rst:320 +#: ../model_zoo/transformers.rst:324 ../model_zoo/transformers.rst:342 +#: ../model_zoo/transformers.rst:346 ../model_zoo/transformers.rst:350 +#: ../model_zoo/transformers.rst:354 ../model_zoo/transformers.rst:358 +#: ../model_zoo/transformers.rst:362 ../model_zoo/transformers.rst:366 +#: ../model_zoo/transformers.rst:370 ../model_zoo/transformers.rst:378 +#: ../model_zoo/transformers.rst:382 ../model_zoo/transformers.rst:386 +#: ../model_zoo/transformers.rst:390 ../model_zoo/transformers.rst:394 +#: ../model_zoo/transformers.rst:398 ../model_zoo/transformers.rst:402 +#: ../model_zoo/transformers.rst:406 ../model_zoo/transformers.rst:411 +#: ../model_zoo/transformers.rst:416 ../model_zoo/transformers.rst:421 +#: ../model_zoo/transformers.rst:425 ../model_zoo/transformers.rst:445 +#: ../model_zoo/transformers.rst:448 ../model_zoo/transformers.rst:467 +#: ../model_zoo/transformers.rst:471 ../model_zoo/transformers.rst:475 +#: ../model_zoo/transformers.rst:479 ../model_zoo/transformers.rst:527 +#: ../model_zoo/transformers.rst:531 ../model_zoo/transformers.rst:540 +#: ../model_zoo/transformers.rst:545 ../model_zoo/transformers.rst:550 +#: ../model_zoo/transformers.rst:554 ../model_zoo/transformers.rst:558 +#: ../model_zoo/transformers.rst:562 ../model_zoo/transformers.rst:566 +#: ../model_zoo/transformers.rst:570 ../model_zoo/transformers.rst:574 +#: ../model_zoo/transformers.rst:579 ../model_zoo/transformers.rst:584 +#: ../model_zoo/transformers.rst:589 ../model_zoo/transformers.rst:628 +#: ../model_zoo/transformers.rst:632 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers.rst:20 +msgid "" +"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters." +" ALBERT base model" +msgstr "" + +#: ../model_zoo/transformers.rst:24 +msgid "``albert-large-v1``" +msgstr "" + +#: ../model_zoo/transformers.rst:24 +msgid "" +"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M " +"parameters. ALBERT large model" +msgstr "" + +#: ../model_zoo/transformers.rst:28 +msgid "``albert-xlarge-v1``" +msgstr "" + +#: ../model_zoo/transformers.rst:28 +msgid "" +"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M " +"parameters. ALBERT xlarge model" +msgstr "" + +#: ../model_zoo/transformers.rst:32 +msgid "``albert-xxlarge-v1``" +msgstr "" + +#: ../model_zoo/transformers.rst:32 +msgid "" +"12 repeating layers, 128 embedding, 4096-hidden, 64-heads, 223M " +"parameters. ALBERT xxlarge model" +msgstr "" + +#: ../model_zoo/transformers.rst:36 +msgid "``albert-base-v2``" +msgstr "" + +#: ../model_zoo/transformers.rst:36 +msgid "" +"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters." +" ALBERT base model (version2)" +msgstr "" + +#: ../model_zoo/transformers.rst:40 +msgid "``albert-large-v2``" +msgstr "" + +#: ../model_zoo/transformers.rst:40 +msgid "" +"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M " +"parameters. ALBERT large model (version2)" +msgstr "" + +#: ../model_zoo/transformers.rst:44 +msgid "``albert-xlarge-v2``" +msgstr "" + +#: ../model_zoo/transformers.rst:44 +msgid "" +"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M " +"parameters. ALBERT xlarge model (version2)" +msgstr "" + +#: ../model_zoo/transformers.rst:48 +msgid "``albert-xxlarge-v2``" +msgstr "" + +#: ../model_zoo/transformers.rst:48 +msgid "" +"12 repeating layers, 128 embedding, 4096-hidden, 64-heads, 223M " +"parameters. ALBERT xxlarge model (version2)" +msgstr "" + +#: ../model_zoo/transformers.rst:52 +msgid "``albert-chinese-tiny``" +msgstr "" + +#: ../model_zoo/transformers.rst:52 ../model_zoo/transformers.rst:56 +#: ../model_zoo/transformers.rst:60 ../model_zoo/transformers.rst:64 +#: ../model_zoo/transformers.rst:68 ../model_zoo/transformers.rst:72 +#: ../model_zoo/transformers.rst:112 ../model_zoo/transformers.rst:117 +#: ../model_zoo/transformers.rst:123 ../model_zoo/transformers.rst:129 +#: ../model_zoo/transformers.rst:133 ../model_zoo/transformers.rst:137 +#: ../model_zoo/transformers.rst:154 ../model_zoo/transformers.rst:159 +#: ../model_zoo/transformers.rst:164 ../model_zoo/transformers.rst:169 +#: ../model_zoo/transformers.rst:173 ../model_zoo/transformers.rst:268 +#: ../model_zoo/transformers.rst:272 ../model_zoo/transformers.rst:276 +#: ../model_zoo/transformers.rst:280 ../model_zoo/transformers.rst:284 +#: ../model_zoo/transformers.rst:288 ../model_zoo/transformers.rst:292 +#: ../model_zoo/transformers.rst:308 ../model_zoo/transformers.rst:329 +#: ../model_zoo/transformers.rst:333 ../model_zoo/transformers.rst:337 +#: ../model_zoo/transformers.rst:374 ../model_zoo/transformers.rst:429 +#: ../model_zoo/transformers.rst:433 ../model_zoo/transformers.rst:437 +#: ../model_zoo/transformers.rst:441 ../model_zoo/transformers.rst:451 +#: ../model_zoo/transformers.rst:456 ../model_zoo/transformers.rst:461 +#: ../model_zoo/transformers.rst:464 ../model_zoo/transformers.rst:483 +#: ../model_zoo/transformers.rst:487 ../model_zoo/transformers.rst:491 +#: ../model_zoo/transformers.rst:495 ../model_zoo/transformers.rst:499 +#: ../model_zoo/transformers.rst:503 ../model_zoo/transformers.rst:507 +#: ../model_zoo/transformers.rst:511 ../model_zoo/transformers.rst:515 +#: ../model_zoo/transformers.rst:519 ../model_zoo/transformers.rst:523 +#: ../model_zoo/transformers.rst:535 ../model_zoo/transformers.rst:594 +#: ../model_zoo/transformers.rst:599 ../model_zoo/transformers.rst:604 +#: ../model_zoo/transformers.rst:608 ../model_zoo/transformers.rst:612 +#: ../model_zoo/transformers.rst:616 ../model_zoo/transformers.rst:620 +#: ../model_zoo/transformers.rst:624 ../model_zoo/transformers.rst:636 +#: ../model_zoo/transformers.rst:640 ../model_zoo/transformers.rst:644 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers.rst:52 +msgid "" +"4 repeating layers, 128 embedding, 312-hidden, 12-heads, 4M parameters. " +"ALBERT tiny model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers.rst:56 +msgid "``albert-chinese-small``" +msgstr "" + +#: ../model_zoo/transformers.rst:56 +msgid "" +"6 repeating layers, 128 embedding, 384-hidden, 12-heads, _M parameters. " +"ALBERT small model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers.rst:60 +msgid "``albert-chinese-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:60 +msgid "" +"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 12M parameters." +" ALBERT base model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers.rst:64 +msgid "``albert-chinese-large``" +msgstr "" + +#: ../model_zoo/transformers.rst:64 +msgid "" +"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 18M " +"parameters. ALBERT large model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers.rst:68 +msgid "``albert-chinese-xlarge``" +msgstr "" + +#: ../model_zoo/transformers.rst:68 +msgid "" +"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 60M " +"parameters. ALBERT xlarge model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers.rst:72 +msgid "``albert-chinese-xxlarge``" +msgstr "" + +#: ../model_zoo/transformers.rst:72 +msgid "" +"12 repeating layers, 128 embedding, 4096-hidden, 16-heads, 235M " +"parameters. ALBERT xxlarge model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers.rst:76 ../model_zoo/transformers.rst:659 +msgid "BART_" +msgstr "" + +#: ../model_zoo/transformers.rst:76 +msgid "``bart-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:76 +msgid "12-layer, 768-hidden, 12-heads, 217M parameters. BART base model (English)" +msgstr "" + +#: ../model_zoo/transformers.rst:80 +msgid "``bart-large``" +msgstr "" + +#: ../model_zoo/transformers.rst:80 +msgid "" +"24-layer, 768-hidden, 16-heads, 509M parameters. BART large model " +"(English)." +msgstr "" + +#: ../model_zoo/transformers.rst:84 ../model_zoo/transformers.rst:661 +msgid "BERT_" +msgstr "" + +#: ../model_zoo/transformers.rst:84 +msgid "``bert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:84 +msgid "" +"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers.rst:88 +msgid "``bert-large-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:88 ../model_zoo/transformers.rst:304 +#: ../model_zoo/transformers.rst:320 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers.rst:92 +msgid "``bert-base-cased``" +msgstr "" + +#: ../model_zoo/transformers.rst:92 +msgid "" +"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on cased English" +" text." +msgstr "" + +#: ../model_zoo/transformers.rst:96 +msgid "``bert-large-cased``" +msgstr "" + +#: ../model_zoo/transformers.rst:96 +msgid "" +"24-layer, 1024-hidden, 16-heads, 335M parameters. Trained on cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers.rst:100 +msgid "``bert-base-multilingual-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:100 ../model_zoo/transformers.rst:106 +#: ../model_zoo/transformers.rst:141 +msgid "Multilingual" +msgstr "" + +#: ../model_zoo/transformers.rst:100 +msgid "" +"12-layer, 768-hidden, 12-heads, 168M parameters. Trained on lower-cased " +"text in the top 102 languages with the largest Wikipedias." +msgstr "" + +#: ../model_zoo/transformers.rst:106 +msgid "``bert-base-multilingual-cased``" +msgstr "" + +#: ../model_zoo/transformers.rst:106 +msgid "" +"12-layer, 768-hidden, 12-heads, 179M parameters. Trained on cased text in" +" the top 104 languages with the largest Wikipedias." +msgstr "" + +#: ../model_zoo/transformers.rst:112 +msgid "``bert-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers.rst:112 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese" +" Simplified and Traditional text." +msgstr "" + +#: ../model_zoo/transformers.rst:117 +msgid "``bert-wwm-chinese``" +msgstr "" + +#: ../model_zoo/transformers.rst:117 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese" +" Simplified and Traditional text using Whole-Word-Masking." +msgstr "" + +#: ../model_zoo/transformers.rst:123 +msgid "``bert-wwm-ext-chinese``" +msgstr "" + +#: ../model_zoo/transformers.rst:123 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese" +" Simplified and Traditional text using Whole-Word-Masking with extented " +"data." +msgstr "" + +#: ../model_zoo/transformers.rst:129 +msgid "``junnyu/ckiplab-bert-base-chinese-ner``" +msgstr "" + +#: ../model_zoo/transformers.rst:129 +msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on NER task." +msgstr "" + +#: ../model_zoo/transformers.rst:133 +msgid "``junnyu/ckiplab-bert-base-chinese-pos``" +msgstr "" + +#: ../model_zoo/transformers.rst:133 +msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on POS task." +msgstr "" + +#: ../model_zoo/transformers.rst:137 +msgid "``junnyu/ckiplab-bert-base-chinese-ws``" +msgstr "" + +#: ../model_zoo/transformers.rst:137 +msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on WS task." +msgstr "" + +#: ../model_zoo/transformers.rst:141 +msgid "``junnyu/nlptown-bert-base-multilingual-uncased-sentiment``" +msgstr "" + +#: ../model_zoo/transformers.rst:141 +msgid "" +"12-layer, 768-hidden, 12-heads, 167M parameters. Finetuned for sentiment " +"analysis on product reviews in six languages: English, Dutch, German, " +"French, Spanish and Italian." +msgstr "" + +#: ../model_zoo/transformers.rst:148 +msgid "``junnyu/tbs17-MathBERT``" +msgstr "" + +#: ../model_zoo/transformers.rst:148 +msgid "" +"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on pre-k to " +"graduate math language (English) using a masked language modeling (MLM) " +"objective." +msgstr "" + +#: ../model_zoo/transformers.rst:154 +msgid "``macbert-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers.rst:154 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained with novel MLM " +"as correction pre-training task." +msgstr "" + +#: ../model_zoo/transformers.rst:159 +msgid "``macbert-large-chinese``" +msgstr "" + +#: ../model_zoo/transformers.rst:159 +msgid "" +"24-layer, 1024-hidden, 16-heads, 326M parameters. Trained with novel MLM " +"as correction pre-training task." +msgstr "" + +#: ../model_zoo/transformers.rst:164 +msgid "``simbert-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers.rst:164 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on 22 million " +"pairs of similar sentences crawed from Baidu Know." +msgstr "" + +#: ../model_zoo/transformers.rst:169 +msgid "``Langboat/mengzi-bert-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:169 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on 300G Chinese " +"Corpus Datasets." +msgstr "" + +#: ../model_zoo/transformers.rst:173 +msgid "``Langboat/mengzi-bert-base-fin``" +msgstr "" + +#: ../model_zoo/transformers.rst:173 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on 20G Finacial " +"Corpus, based on ``Langboat/mengzi-bert-base``." +msgstr "" + +#: ../model_zoo/transformers.rst:178 +msgid "BERT-Japanese_" +msgstr "" + +#: ../model_zoo/transformers.rst:178 +msgid "``iverxin/bert-base-japanese``" +msgstr "" + +#: ../model_zoo/transformers.rst:178 ../model_zoo/transformers.rst:182 +#: ../model_zoo/transformers.rst:187 ../model_zoo/transformers.rst:191 +msgid "Japanese" +msgstr "" + +#: ../model_zoo/transformers.rst:178 +msgid "12-layer, 768-hidden, 12-heads, 110M parameters. Trained on Japanese text." +msgstr "" + +#: ../model_zoo/transformers.rst:182 +msgid "``iverxin/bert-base-japanese-whole-word-masking``" +msgstr "" + +#: ../model_zoo/transformers.rst:182 +msgid "" +"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on Japanese text" +" using Whole-Word-Masking." +msgstr "" + +#: ../model_zoo/transformers.rst:187 +msgid "``iverxin/bert-base-japanese-char``" +msgstr "" + +#: ../model_zoo/transformers.rst:187 +msgid "" +"12-layer, 768-hidden, 12-heads, 89M parameters. Trained on Japanese char " +"text." +msgstr "" + +#: ../model_zoo/transformers.rst:191 +msgid "``iverxin/bert-base-japanese-char-whole-word-masking``" +msgstr "" + +#: ../model_zoo/transformers.rst:191 +msgid "" +"12-layer, 768-hidden, 12-heads, 89M parameters. Trained on Japanese char " +"text using Whole-Word-Masking." +msgstr "" + +#: ../model_zoo/transformers.rst:196 ../model_zoo/transformers.rst:663 +msgid "BigBird_" +msgstr "" + +#: ../model_zoo/transformers.rst:196 +msgid "``bigbird-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:196 +msgid "" +"12-layer, 768-hidden, 12-heads, 127M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers.rst:200 ../model_zoo/transformers.rst:665 +msgid "Blenderbot_" +msgstr "" + +#: ../model_zoo/transformers.rst:200 +msgid "``blenderbot-3B``" +msgstr "" + +#: ../model_zoo/transformers.rst:200 +msgid "26-layer, 32-heads, 3B parameters. The Blenderbot base model." +msgstr "" + +#: ../model_zoo/transformers.rst:204 +msgid "``blenderbot-400M-distill``" +msgstr "" + +#: ../model_zoo/transformers.rst:204 +msgid "" +"14-layer, 384-hidden, 32-heads, 400M parameters. The Blenderbot distil " +"model." +msgstr "" + +#: ../model_zoo/transformers.rst:208 +msgid "``blenderbot-1B-distill``" +msgstr "" + +#: ../model_zoo/transformers.rst:208 +msgid "14-layer, 32-heads, 1478M parameters. The Blenderbot Distil 1B model." +msgstr "" + +#: ../model_zoo/transformers.rst:212 ../model_zoo/transformers.rst:667 +msgid "Blenderbot-Small_" +msgstr "" + +#: ../model_zoo/transformers.rst:212 +msgid "``blenderbot_small-90M``" +msgstr "" + +#: ../model_zoo/transformers.rst:212 +msgid "16-layer, 16-heads, 90M parameters. The Blenderbot small model." +msgstr "" + +#: ../model_zoo/transformers.rst:216 ../model_zoo/transformers.rst:669 +msgid "ConvBert_" +msgstr "" + +#: ../model_zoo/transformers.rst:216 +msgid "``convbert-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:216 +msgid "12-layer, 768-hidden, 12-heads, 106M parameters. The ConvBERT base model." +msgstr "" + +#: ../model_zoo/transformers.rst:220 +msgid "``convbert-medium-small``" +msgstr "" + +#: ../model_zoo/transformers.rst:220 +msgid "" +"12-layer, 384-hidden, 8-heads, 17M parameters. The ConvBERT medium small " +"model." +msgstr "" + +#: ../model_zoo/transformers.rst:224 +msgid "``convbert-small``" +msgstr "" + +#: ../model_zoo/transformers.rst:224 +msgid "12-layer, 128-hidden, 4-heads, 13M parameters. The ConvBERT small model." +msgstr "" + +#: ../model_zoo/transformers.rst:228 ../model_zoo/transformers.rst:671 +msgid "CTRL_" +msgstr "" + +#: ../model_zoo/transformers.rst:228 +msgid "``ctrl``" +msgstr "" + +#: ../model_zoo/transformers.rst:228 +msgid "48-layer, 1280-hidden, 16-heads, 1701M parameters. The CTRL base model." +msgstr "" + +#: ../model_zoo/transformers.rst:232 +msgid "``sshleifer-tiny-ctrl``" +msgstr "" + +#: ../model_zoo/transformers.rst:232 +msgid "2-layer, 16-hidden, 2-heads, 5M parameters. The Tiny CTRL model." +msgstr "" + +#: ../model_zoo/transformers.rst:236 ../model_zoo/transformers.rst:673 +msgid "DistilBert_" +msgstr "" + +#: ../model_zoo/transformers.rst:236 +msgid "``distilbert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:236 +msgid "" +"6-layer, 768-hidden, 12-heads, 66M parameters. The DistilBERT model " +"distilled from the BERT model ``bert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:241 +msgid "``distilbert-base-cased``" +msgstr "" + +#: ../model_zoo/transformers.rst:241 +msgid "" +"6-layer, 768-hidden, 12-heads, 66M parameters. The DistilBERT model " +"distilled from the BERT model ``bert-base-cased``" +msgstr "" + +#: ../model_zoo/transformers.rst:246 +msgid "``distilbert-base-multilingual-cased``" +msgstr "" + +#: ../model_zoo/transformers.rst:246 +msgid "" +"6-layer, 768-hidden, 12-heads, 200M parameters. The DistilBERT model " +"distilled from the BERT model ``bert-base-multilingual-cased``" +msgstr "" + +#: ../model_zoo/transformers.rst:252 +msgid "``sshleifer-tiny-distilbert-base-uncase-finetuned-sst-2-english``" +msgstr "" + +#: ../model_zoo/transformers.rst:252 +msgid "2-layer, 2-hidden, 2-heads, 50K parameters. The DistilBERT model" +msgstr "" + +#: ../model_zoo/transformers.rst:256 ../model_zoo/transformers.rst:675 +msgid "ELECTRA_" +msgstr "" + +#: ../model_zoo/transformers.rst:256 +msgid "``electra-small``" +msgstr "" + +#: ../model_zoo/transformers.rst:256 +msgid "" +"12-layer, 768-hidden, 4-heads, 14M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers.rst:260 +msgid "``electra-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:260 +msgid "" +"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers.rst:264 +msgid "``electra-large``" +msgstr "" + +#: ../model_zoo/transformers.rst:264 +msgid "" +"24-layer, 1024-hidden, 16-heads, 334M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers.rst:268 +msgid "``chinese-electra-small``" +msgstr "" + +#: ../model_zoo/transformers.rst:268 +msgid "12-layer, 768-hidden, 4-heads, 12M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers.rst:272 +msgid "``chinese-electra-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:272 ../model_zoo/transformers.rst:487 +msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers.rst:276 +msgid "``junnyu/hfl-chinese-electra-180g-base-discriminator``" +msgstr "" + +#: ../model_zoo/transformers.rst:276 +msgid "" +"Discriminator, 12-layer, 768-hidden, 12-heads, 102M parameters. Trained " +"on 180g Chinese text." +msgstr "" + +#: ../model_zoo/transformers.rst:280 +msgid "``junnyu/hfl-chinese-electra-180g-small-ex-discriminator``" +msgstr "" + +#: ../model_zoo/transformers.rst:280 +msgid "" +"Discriminator, 24-layer, 256-hidden, 4-heads, 24M parameters. Trained on " +"180g Chinese text." +msgstr "" + +#: ../model_zoo/transformers.rst:284 +msgid "``junnyu/hfl-chinese-legal-electra-small-generator``" +msgstr "" + +#: ../model_zoo/transformers.rst:284 +msgid "" +"Generator, 12-layer, 64-hidden, 1-heads, 3M parameters. Trained on " +"Chinese legal corpus." +msgstr "" + +#: ../model_zoo/transformers.rst:288 ../model_zoo/transformers.rst:677 +msgid "ERNIE_" +msgstr "" + +#: ../model_zoo/transformers.rst:288 +msgid "``ernie-3.0-medium-zh``" +msgstr "" + +#: ../model_zoo/transformers.rst:288 ../model_zoo/transformers.rst:308 +#: ../model_zoo/transformers.rst:329 ../model_zoo/transformers.rst:429 +#: ../model_zoo/transformers.rst:604 +msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers.rst:292 +msgid "``ernie-tiny``" +msgstr "" + +#: ../model_zoo/transformers.rst:292 +msgid "3-layer, 1024-hidden, 16-heads, _M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers.rst:296 +msgid "``ernie-2.0-en``" +msgstr "" + +#: ../model_zoo/transformers.rst:296 ../model_zoo/transformers.rst:312 +msgid "" +"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers.rst:300 +msgid "``ernie-2.0-en-finetuned-squad``" +msgstr "" + +#: ../model_zoo/transformers.rst:300 +msgid "" +"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on finetuned " +"squad text." +msgstr "" + +#: ../model_zoo/transformers.rst:304 +msgid "``ernie-2.0-large-en``" +msgstr "" + +#: ../model_zoo/transformers.rst:308 ../model_zoo/transformers.rst:679 +msgid "ERNIE-DOC_" +msgstr "" + +#: ../model_zoo/transformers.rst:308 +msgid "``ernie-doc-base-zh``" +msgstr "" + +#: ../model_zoo/transformers.rst:312 +msgid "``ernie-doc-base-en``" +msgstr "" + +#: ../model_zoo/transformers.rst:316 ../model_zoo/transformers.rst:681 +msgid "ERNIE-GEN_" +msgstr "" + +#: ../model_zoo/transformers.rst:316 +msgid "``ernie-gen-base-en``" +msgstr "" + +#: ../model_zoo/transformers.rst:316 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers.rst:320 +msgid "``ernie-gen-large-en``" +msgstr "" + +#: ../model_zoo/transformers.rst:324 +msgid "``ernie-gen-large-en-430g``" +msgstr "" + +#: ../model_zoo/transformers.rst:324 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased " +"English text. with extended data (430 GB)." +msgstr "" + +#: ../model_zoo/transformers.rst:329 ../model_zoo/transformers.rst:683 +msgid "ERNIE-GRAM_" +msgstr "" + +#: ../model_zoo/transformers.rst:329 +msgid "``ernie-gram-zh``" +msgstr "" + +#: ../model_zoo/transformers.rst:333 ../model_zoo/transformers.rst:685 +msgid "GPT_" +msgstr "" + +#: ../model_zoo/transformers.rst:333 +msgid "``gpt-cpm-large-cn``" +msgstr "" + +#: ../model_zoo/transformers.rst:333 +msgid "32-layer, 2560-hidden, 32-heads, 2.6B parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers.rst:337 +msgid "``gpt-cpm-small-cn-distill``" +msgstr "" + +#: ../model_zoo/transformers.rst:337 +msgid "" +"12-layer, 768-hidden, 12-heads, 109M parameters. The model distilled from" +" the GPT model ``gpt-cpm-large-cn``" +msgstr "" + +#: ../model_zoo/transformers.rst:342 +msgid "``gpt2-en``" +msgstr "" + +#: ../model_zoo/transformers.rst:342 +msgid "12-layer, 768-hidden, 12-heads, 117M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers.rst:346 +msgid "``gpt2-medium-en``" +msgstr "" + +#: ../model_zoo/transformers.rst:346 +msgid "24-layer, 1024-hidden, 16-heads, 345M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers.rst:350 +msgid "``gpt2-large-en``" +msgstr "" + +#: ../model_zoo/transformers.rst:350 ../model_zoo/transformers.rst:370 +msgid "36-layer, 1280-hidden, 20-heads, 774M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers.rst:354 +msgid "``gpt2-xl-en``" +msgstr "" + +#: ../model_zoo/transformers.rst:354 +msgid "" +"48-layer, 1600-hidden, 25-heads, 1558M parameters. Trained on English " +"text." +msgstr "" + +#: ../model_zoo/transformers.rst:358 +msgid "``junnyu/distilgpt2``" +msgstr "" + +#: ../model_zoo/transformers.rst:358 +msgid "6-layer, 768-hidden, 12-heads, 81M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers.rst:362 +msgid "``junnyu/microsoft-DialoGPT-small``" +msgstr "" + +#: ../model_zoo/transformers.rst:362 ../model_zoo/transformers.rst:467 +msgid "12-layer, 768-hidden, 12-heads, 124M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers.rst:366 +msgid "``junnyu/microsoft-DialoGPT-medium``" +msgstr "" + +#: ../model_zoo/transformers.rst:366 +msgid "24-layer, 1024-hidden, 16-heads, 354M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers.rst:370 +msgid "``junnyu/microsoft-DialoGPT-large``" +msgstr "" + +#: ../model_zoo/transformers.rst:374 +msgid "``junnyu/uer-gpt2-chinese-poem``" +msgstr "" + +#: ../model_zoo/transformers.rst:374 +msgid "" +"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on Chinese " +"poetry corpus." +msgstr "" + +#: ../model_zoo/transformers.rst:378 ../model_zoo/transformers.rst:687 +msgid "LayoutLM_" +msgstr "" + +#: ../model_zoo/transformers.rst:378 +msgid "``layoutlm-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:378 +msgid "" +"12-layer, 768-hidden, 12-heads, 339M parameters. LayoutLm base uncased " +"model." +msgstr "" + +#: ../model_zoo/transformers.rst:382 +msgid "``layoutlm-large-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:382 +msgid "" +"24-layer, 1024-hidden, 16-heads, 51M parameters. LayoutLm large Uncased " +"model." +msgstr "" + +#: ../model_zoo/transformers.rst:386 ../model_zoo/transformers.rst:689 +msgid "LayoutLMV2_" +msgstr "" + +#: ../model_zoo/transformers.rst:386 +msgid "``layoutlmv2-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:386 +msgid "" +"12-layer, 768-hidden, 12-heads, 200M parameters. LayoutLmv2 base uncased " +"model." +msgstr "" + +#: ../model_zoo/transformers.rst:390 +msgid "``layoutlmv2-large-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:390 +msgid "" +"24-layer, 1024-hidden, 16-heads, _M parameters. LayoutLmv2 large uncased " +"model." +msgstr "" + +#: ../model_zoo/transformers.rst:394 ../model_zoo/transformers.rst:691 +msgid "LayoutXLM_" +msgstr "" + +#: ../model_zoo/transformers.rst:394 +msgid "``layoutxlm-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:394 +msgid "" +"12-layer, 768-hidden, 12-heads, 369M parameters. Layoutxlm base uncased " +"model." +msgstr "" + +#: ../model_zoo/transformers.rst:398 +msgid "MBart_" +msgstr "" + +#: ../model_zoo/transformers.rst:398 +msgid "``mbart-large-cc25``" +msgstr "" + +#: ../model_zoo/transformers.rst:398 +msgid "" +"12-layer, 1024-hidden, 12-heads, 1123M parameters. The ``mbart-large-" +"cc25`` model." +msgstr "" + +#: ../model_zoo/transformers.rst:402 +msgid "``mbart-large-en-ro``" +msgstr "" + +#: ../model_zoo/transformers.rst:402 +msgid "" +"12-layer, 768-hidden, 16-heads, 1123M parameters. The ``mbart-large rn-" +"ro`` model ." +msgstr "" + +#: ../model_zoo/transformers.rst:406 +msgid "``mbart-large-50-one-to-many-mmt``" +msgstr "" + +#: ../model_zoo/transformers.rst:406 +msgid "" +"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-one-" +"to-many-mmt`` model." +msgstr "" + +#: ../model_zoo/transformers.rst:411 +msgid "``mbart-large-50-many-to-one-mmt``" +msgstr "" + +#: ../model_zoo/transformers.rst:411 +msgid "" +"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-many-" +"to-one-mmt`` model." +msgstr "" + +#: ../model_zoo/transformers.rst:416 +msgid "``mbart-large-50-many-to-many-mmt``" +msgstr "" + +#: ../model_zoo/transformers.rst:416 +msgid "" +"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-many-" +"to-many-mmt`` model." +msgstr "" + +#: ../model_zoo/transformers.rst:421 +msgid "Mobilebert_" +msgstr "" + +#: ../model_zoo/transformers.rst:421 +msgid "``mobilebert-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:421 +msgid "24-layer, 512-hidden, 4-heads, 24M parameters. Mobilebert uncased Model." +msgstr "" + +#: ../model_zoo/transformers.rst:425 ../model_zoo/transformers.rst:697 +msgid "MPNet_" +msgstr "" + +#: ../model_zoo/transformers.rst:425 +msgid "``mpnet-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:425 +msgid "12-layer, 768-hidden, 12-heads, 109M parameters. MPNet Base Model." +msgstr "" + +#: ../model_zoo/transformers.rst:429 ../model_zoo/transformers.rst:699 +msgid "NeZha_" +msgstr "" + +#: ../model_zoo/transformers.rst:429 +msgid "``nezha-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers.rst:433 +msgid "``nezha-large-chinese``" +msgstr "" + +#: ../model_zoo/transformers.rst:433 ../model_zoo/transformers.rst:441 +msgid "24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers.rst:437 +msgid "``nezha-base-wwm-chinese``" +msgstr "" + +#: ../model_zoo/transformers.rst:437 +msgid "12-layer, 768-hidden, 16-heads, 108M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers.rst:441 +msgid "``nezha-large-wwm-chinese``" +msgstr "" + +#: ../model_zoo/transformers.rst:445 +msgid "Reformer_" +msgstr "" + +#: ../model_zoo/transformers.rst:445 +msgid "``reformer-enwik8``" +msgstr "" + +#: ../model_zoo/transformers.rst:445 +msgid "12-layer, 1024-hidden, 8-heads, 148M parameters." +msgstr "" + +#: ../model_zoo/transformers.rst:448 +msgid "``reformer-crime-and-punishment``" +msgstr "" + +#: ../model_zoo/transformers.rst:448 +msgid "6-layer, 256-hidden, 2-heads, 3M parameters." +msgstr "" + +#: ../model_zoo/transformers.rst:451 ../model_zoo/transformers.rst:703 +msgid "RoBERTa_" +msgstr "" + +#: ../model_zoo/transformers.rst:451 +msgid "``roberta-wwm-ext``" +msgstr "" + +#: ../model_zoo/transformers.rst:451 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on English Text " +"using Whole-Word-Masking with extended data." +msgstr "" + +#: ../model_zoo/transformers.rst:456 +msgid "``roberta-wwm-ext-large``" +msgstr "" + +#: ../model_zoo/transformers.rst:456 +msgid "" +"24-layer, 1024-hidden, 16-heads, 325M parameters. Trained on English Text" +" using Whole-Word-Masking with extended data." +msgstr "" + +#: ../model_zoo/transformers.rst:461 +msgid "``rbt3``" +msgstr "" + +#: ../model_zoo/transformers.rst:461 +msgid "3-layer, 768-hidden, 12-heads, 38M parameters." +msgstr "" + +#: ../model_zoo/transformers.rst:464 +msgid "``rbtl3``" +msgstr "" + +#: ../model_zoo/transformers.rst:464 +msgid "3-layer, 1024-hidden, 16-heads, 61M parameters." +msgstr "" + +#: ../model_zoo/transformers.rst:467 +msgid "``nosaydomore/deepset-roberta-base-squad2``" +msgstr "" + +#: ../model_zoo/transformers.rst:471 +msgid "``nosaydomore/roberta-en-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:471 +msgid "12-layer, 768-hidden, 12-heads, 163M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers.rst:475 +msgid "``nosaydomore/roberta-en-large``" +msgstr "" + +#: ../model_zoo/transformers.rst:475 +msgid "24-layer, 1024-hidden, 16-heads, 408M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers.rst:479 +msgid "``nosaydomore/sshleifei-tiny-distilroberta-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:479 +msgid "2-layer, 2-hidden, 2-heads, 0.25M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers.rst:483 +msgid "``nosaydomore/uer-roberta-base-chn-extractive-qa``" +msgstr "" + +#: ../model_zoo/transformers.rst:483 ../model_zoo/transformers.rst:491 +msgid "12-layer, 768-hidden, 12-heads, 101M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers.rst:487 +msgid "``nosaydomore/uer-roberta-base-ft-chinanews-chn``" +msgstr "" + +#: ../model_zoo/transformers.rst:491 +msgid "``nosaydomore/uer-roberta-base-ft-cluener2020-chn``" +msgstr "" + +#: ../model_zoo/transformers.rst:495 ../model_zoo/transformers.rst:705 +msgid "RoFormer_" +msgstr "" + +#: ../model_zoo/transformers.rst:495 +msgid "``roformer-chinese-small``" +msgstr "" + +#: ../model_zoo/transformers.rst:495 +msgid "" +"6-layer, 384-hidden, 6-heads, 30M parameters. Roformer Small Chinese " +"model." +msgstr "" + +#: ../model_zoo/transformers.rst:499 +msgid "``roformer-chinese-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:499 +msgid "" +"12-layer, 768-hidden, 12-heads, 124M parameters. Roformer Base Chinese " +"model." +msgstr "" + +#: ../model_zoo/transformers.rst:503 +msgid "``roformer-chinese-char-small``" +msgstr "" + +#: ../model_zoo/transformers.rst:503 +msgid "" +"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Char Small" +" model." +msgstr "" + +#: ../model_zoo/transformers.rst:507 +msgid "``roformer-chinese-char-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:507 +msgid "" +"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Char " +"Base model." +msgstr "" + +#: ../model_zoo/transformers.rst:511 +msgid "``roformer-chinese-sim-char-ft-small``" +msgstr "" + +#: ../model_zoo/transformers.rst:511 +msgid "" +"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Char Ft " +"Small model." +msgstr "" + +#: ../model_zoo/transformers.rst:515 +msgid "``roformer-chinese-sim-char-ft-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:515 +msgid "" +"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Char Ft " +"Base model." +msgstr "" + +#: ../model_zoo/transformers.rst:519 +msgid "``roformer-chinese-sim-char-small``" +msgstr "" + +#: ../model_zoo/transformers.rst:519 +msgid "" +"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Sim Char " +"Small model." +msgstr "" + +#: ../model_zoo/transformers.rst:523 +msgid "``roformer-chinese-sim-char-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:523 +msgid "" +"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Sim Char" +" Base model." +msgstr "" + +#: ../model_zoo/transformers.rst:527 +msgid "``roformer-english-small-discriminator``" +msgstr "" + +#: ../model_zoo/transformers.rst:527 +msgid "" +"12-layer, 256-hidden, 4-heads, 13M parameters. Roformer English Small " +"Discriminator." +msgstr "" + +#: ../model_zoo/transformers.rst:531 +msgid "``roformer-english-small-generator``" +msgstr "" + +#: ../model_zoo/transformers.rst:531 +msgid "" +"12-layer, 64-hidden, 1-heads, 5M parameters. Roformer English Small " +"Generator." +msgstr "" + +#: ../model_zoo/transformers.rst:535 ../model_zoo/transformers.rst:707 +msgid "SKEP_" +msgstr "" + +#: ../model_zoo/transformers.rst:535 +msgid "``skep_ernie_1.0_large_ch``" +msgstr "" + +#: ../model_zoo/transformers.rst:535 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained using the Erine" +" model ``ernie_1.0``" +msgstr "" + +#: ../model_zoo/transformers.rst:540 +msgid "``skep_ernie_2.0_large_en``" +msgstr "" + +#: ../model_zoo/transformers.rst:540 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained using the Erine" +" model ``ernie_2.0_large_en``" +msgstr "" + +#: ../model_zoo/transformers.rst:545 +msgid "``skep_roberta_large_en``" +msgstr "" + +#: ../model_zoo/transformers.rst:545 +msgid "" +"24-layer, 1024-hidden, 16-heads, 355M parameters. Trained using the " +"RoBERTa model ``roberta_large_en``" +msgstr "" + +#: ../model_zoo/transformers.rst:550 ../model_zoo/transformers.rst:709 +msgid "SqueezeBert_" +msgstr "" + +#: ../model_zoo/transformers.rst:550 +msgid "``squeezebert-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:550 +msgid "12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Uncased model." +msgstr "" + +#: ../model_zoo/transformers.rst:554 +msgid "``squeezebert-mnli``" +msgstr "" + +#: ../model_zoo/transformers.rst:554 +msgid "12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Mnli model." +msgstr "" + +#: ../model_zoo/transformers.rst:558 +msgid "``squeezebert-mnli-headless``" +msgstr "" + +#: ../model_zoo/transformers.rst:558 +msgid "" +"12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Mnli Headless" +" model." +msgstr "" + +#: ../model_zoo/transformers.rst:562 ../model_zoo/transformers.rst:711 +msgid "T5_" +msgstr "" + +#: ../model_zoo/transformers.rst:562 +msgid "``t5-small``" +msgstr "" + +#: ../model_zoo/transformers.rst:562 +msgid "6-layer, 512-hidden, 8-heads, 93M parameters. T5 small model." +msgstr "" + +#: ../model_zoo/transformers.rst:566 +msgid "``t5-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:566 +msgid "12-layer, 768-hidden, 12-heads, 272M parameters. T5 base model." +msgstr "" + +#: ../model_zoo/transformers.rst:570 +msgid "``t5-large``" +msgstr "" + +#: ../model_zoo/transformers.rst:570 +msgid "24-layer, 1024-hidden, 16-heads, 803M parameters. T5 large model." +msgstr "" + +#: ../model_zoo/transformers.rst:574 ../model_zoo/transformers.rst:713 +msgid "TinyBert_" +msgstr "" + +#: ../model_zoo/transformers.rst:574 +msgid "``tinybert-4l-312d``" +msgstr "" + +#: ../model_zoo/transformers.rst:574 ../model_zoo/transformers.rst:584 +#: ../model_zoo/transformers.rst:594 +msgid "" +"4-layer, 312-hidden, 12-heads, 14.5M parameters. The TinyBert model " +"distilled from the BERT model ``bert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:579 +msgid "``tinybert-6l-768d``" +msgstr "" + +#: ../model_zoo/transformers.rst:579 ../model_zoo/transformers.rst:589 +#: ../model_zoo/transformers.rst:599 +msgid "" +"6-layer, 768-hidden, 12-heads, 67M parameters. The TinyBert model " +"distilled from the BERT model ``bert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers.rst:584 +msgid "``tinybert-4l-312d-v2``" +msgstr "" + +#: ../model_zoo/transformers.rst:589 +msgid "``tinybert-6l-768d-v2``" +msgstr "" + +#: ../model_zoo/transformers.rst:594 +msgid "``tinybert-4l-312d-zh``" +msgstr "" + +#: ../model_zoo/transformers.rst:599 +msgid "``tinybert-6l-768d-zh``" +msgstr "" + +#: ../model_zoo/transformers.rst:604 ../model_zoo/transformers.rst:715 +msgid "UnifiedTransformer_" +msgstr "" + +#: ../model_zoo/transformers.rst:604 +msgid "``unified_transformer-12L-cn``" +msgstr "" + +#: ../model_zoo/transformers.rst:608 +msgid "``unified_transformer-12L-cn-luge``" +msgstr "" + +#: ../model_zoo/transformers.rst:608 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text " +"(LUGE.ai)." +msgstr "" + +#: ../model_zoo/transformers.rst:612 +msgid "``plato-mini``" +msgstr "" + +#: ../model_zoo/transformers.rst:612 +msgid "6-layer, 768-hidden, 12-heads, 66M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers.rst:616 +msgid "UNIMO_" +msgstr "" + +#: ../model_zoo/transformers.rst:616 +msgid "``unimo-text-1.0``" +msgstr "" + +#: ../model_zoo/transformers.rst:616 +msgid "12-layer, 768-hidden, 12-heads, 99M parameters. UNIMO-text-1.0 model." +msgstr "" + +#: ../model_zoo/transformers.rst:620 +msgid "``unimo-text-1.0-lcsts-new``" +msgstr "" + +#: ../model_zoo/transformers.rst:620 +msgid "" +"12-layer, 768-hidden, 12-heads, 99M parameters. Finetuned on lcsts_new " +"dataset." +msgstr "" + +#: ../model_zoo/transformers.rst:624 +msgid "``unimo-text-1.0-large``" +msgstr "" + +#: ../model_zoo/transformers.rst:624 +msgid "" +"24-layer, 768-hidden, 16-heads, 316M parameters. UNIMO-text-1.0 large " +"model." +msgstr "" + +#: ../model_zoo/transformers.rst:628 ../model_zoo/transformers.rst:717 +msgid "XLNet_" +msgstr "" + +#: ../model_zoo/transformers.rst:628 +msgid "``xlnet-base-cased``" +msgstr "" + +#: ../model_zoo/transformers.rst:628 +msgid "12-layer, 768-hidden, 12-heads, 110M parameters. XLNet English model" +msgstr "" + +#: ../model_zoo/transformers.rst:632 +msgid "``xlnet-large-cased``" +msgstr "" + +#: ../model_zoo/transformers.rst:632 +msgid "" +"24-layer, 1024-hidden, 16-heads, 340M parameters. XLNet Large English " +"model" +msgstr "" + +#: ../model_zoo/transformers.rst:636 +msgid "``chinese-xlnet-base``" +msgstr "" + +#: ../model_zoo/transformers.rst:636 +msgid "12-layer, 768-hidden, 12-heads, 117M parameters. XLNet Chinese model" +msgstr "" + +#: ../model_zoo/transformers.rst:640 +msgid "``chinese-xlnet-mid``" +msgstr "" + +#: ../model_zoo/transformers.rst:640 +msgid "" +"24-layer, 768-hidden, 12-heads, 209M parameters. XLNet Medium Chinese " +"model" +msgstr "" + +#: ../model_zoo/transformers.rst:644 +msgid "``chinese-xlnet-large``" +msgstr "" + +#: ../model_zoo/transformers.rst:644 +msgid "24-layer, 1024-hidden, 16-heads, _M parameters. XLNet Large Chinese model" +msgstr "" + +#: ../model_zoo/transformers.rst:652 +msgid "Transformer预训练模型适用任务汇总" +msgstr "" + +#: ../model_zoo/transformers.rst:655 +msgid "Sequence Classification" +msgstr "" + +#: ../model_zoo/transformers.rst:655 +msgid "Token Classification" +msgstr "" + +#: ../model_zoo/transformers.rst:655 +msgid "Question Answering" +msgstr "" + +#: ../model_zoo/transformers.rst:655 +msgid "Text Generation" +msgstr "" + +#: ../model_zoo/transformers.rst:655 +msgid "Multiple Choice" +msgstr "" + +#: ../model_zoo/transformers.rst:657 ../model_zoo/transformers.rst:659 +#: ../model_zoo/transformers.rst:661 ../model_zoo/transformers.rst:663 +#: ../model_zoo/transformers.rst:665 ../model_zoo/transformers.rst:667 +#: ../model_zoo/transformers.rst:669 ../model_zoo/transformers.rst:671 +#: ../model_zoo/transformers.rst:673 ../model_zoo/transformers.rst:675 +#: ../model_zoo/transformers.rst:677 ../model_zoo/transformers.rst:679 +#: ../model_zoo/transformers.rst:681 ../model_zoo/transformers.rst:683 +#: ../model_zoo/transformers.rst:685 ../model_zoo/transformers.rst:687 +#: ../model_zoo/transformers.rst:689 ../model_zoo/transformers.rst:691 +#: ../model_zoo/transformers.rst:693 ../model_zoo/transformers.rst:695 +#: ../model_zoo/transformers.rst:697 ../model_zoo/transformers.rst:699 +#: ../model_zoo/transformers.rst:701 ../model_zoo/transformers.rst:703 +#: ../model_zoo/transformers.rst:705 ../model_zoo/transformers.rst:707 +#: ../model_zoo/transformers.rst:709 ../model_zoo/transformers.rst:711 +#: ../model_zoo/transformers.rst:713 ../model_zoo/transformers.rst:715 +#: ../model_zoo/transformers.rst:717 +msgid "✅" +msgstr "" + +#: ../model_zoo/transformers.rst:657 ../model_zoo/transformers.rst:659 +#: ../model_zoo/transformers.rst:661 ../model_zoo/transformers.rst:663 +#: ../model_zoo/transformers.rst:665 ../model_zoo/transformers.rst:667 +#: ../model_zoo/transformers.rst:671 ../model_zoo/transformers.rst:673 +#: ../model_zoo/transformers.rst:675 ../model_zoo/transformers.rst:677 +#: ../model_zoo/transformers.rst:679 ../model_zoo/transformers.rst:681 +#: ../model_zoo/transformers.rst:683 ../model_zoo/transformers.rst:685 +#: ../model_zoo/transformers.rst:687 ../model_zoo/transformers.rst:689 +#: ../model_zoo/transformers.rst:691 ../model_zoo/transformers.rst:693 +#: ../model_zoo/transformers.rst:695 ../model_zoo/transformers.rst:697 +#: ../model_zoo/transformers.rst:699 ../model_zoo/transformers.rst:701 +#: ../model_zoo/transformers.rst:703 ../model_zoo/transformers.rst:705 +#: ../model_zoo/transformers.rst:707 ../model_zoo/transformers.rst:709 +#: ../model_zoo/transformers.rst:711 ../model_zoo/transformers.rst:713 +#: ../model_zoo/transformers.rst:715 ../model_zoo/transformers.rst:717 +msgid "❌" +msgstr "" + +#: ../model_zoo/transformers.rst:693 +msgid "Mbart_" +msgstr "" + +#: ../model_zoo/transformers.rst:695 +msgid "MobileBert_" +msgstr "" + +#: ../model_zoo/transformers.rst:701 +msgid "ReFormer_" +msgstr "" + +#: ../model_zoo/transformers.rst:756 +msgid "预训练模型使用方法" +msgstr "" + +#: ../model_zoo/transformers.rst:758 +msgid "" +"PaddleNLP Transformer API在提丰富预训练模型的同时,也降低了用户的使用门槛。 " +"使用Auto模块,可以加载不同网络结构的预训练模型,无需查找 模型对应的类别。只需十几行代码,用户即可完成模型加载和下游任务Fine-" +"tuning。" +msgstr "" + +#: ../model_zoo/transformers.rst:797 +msgid "" +"上面的代码给出使用预训练模型的简要示例,更完整详细的示例代码, 可以参考:`使用预训练模型Fine-tune完成中文文本分类任务 " +"`_" +msgstr "" + +#: ../model_zoo/transformers.rst:800 +msgid "加载数据集:PaddleNLP内置了多种数据集,用户可以一键导入所需的数据集。" +msgstr "" + +#: ../model_zoo/transformers.rst:801 +msgid "" +"加载预训练模型:PaddleNLP的预训练模型可以很容易地通过 ``from_pretrained()`` 方法加载。 " +"Auto模块(包括AutoModel, AutoTokenizer, 及各种下游任务类)提供了方便易用的接口, " +"无需指定类别,即可调用不同网络结构的预训练模型。 第一个参数是汇总表中对应的 ``Pretrained Weight``,可加载对应的预训练权重。" +" ``AutoModelForSequenceClassification`` 初始化 ``__init__`` 所需的其他参数,如 " +"``num_classes`` 等, 也是通过 ``from_pretrained()`` 传入。``Tokenizer`` 使用同样的 " +"``from_pretrained`` 方法加载。" +msgstr "" + +#: ../model_zoo/transformers.rst:807 +msgid "通过 ``Dataset`` 的 ``map`` 函数,使用 ``tokenizer`` 将 ``dataset`` 从原始文本处理成模型的输入。" +msgstr "" + +#: ../model_zoo/transformers.rst:808 +msgid "定义 ``BatchSampler`` 和 ``DataLoader``,shuffle数据、组合Batch。" +msgstr "" + +#: ../model_zoo/transformers.rst:809 +msgid "定义训练所需的优化器,loss函数等,就可以开始进行模型fine-tune任务。" +msgstr "" + +#: ../model_zoo/transformers.rst:813 +msgid "Reference" +msgstr "" + +#: ../model_zoo/transformers.rst:814 +msgid "" +"部分中文预训练模型来自: `brightmart/albert_zh " +"`_, `ymcui/Chinese-BERT-wwm " +"`_, `huawei-noah/Pretrained-" +"Language-Model/TinyBERT `_, `ymcui/Chinese-XLNet " +"`_, " +"`huggingface/xlnet_chinese_large " +"`_, `Knover/luge-" +"dialogue `_, `huawei-noah/Pretrained-Language-Model/NEZHA-PyTorch/ " +"`_ `ZhuiyiTechnology/simbert " +"`_" +msgstr "" + +#: ../model_zoo/transformers.rst:823 +msgid "" +"Lan, Zhenzhong, et al. \"Albert: A lite bert for self-supervised learning" +" of language representations.\" arXiv preprint arXiv:1909.11942 (2019)." +msgstr "" + +#: ../model_zoo/transformers.rst:824 +msgid "" +"Lewis, Mike, et al. \"BART: Denoising Sequence-to-Sequence Pre-training " +"for Natural Language Generation, Translation, and Comprehension.\" arXiv " +"preprint arXiv:1910.13461 (2019)." +msgstr "" + +#: ../model_zoo/transformers.rst:825 +msgid "" +"Devlin, Jacob, et al. \"Bert: Pre-training of deep bidirectional " +"transformers for language understanding.\" arXiv preprint " +"arXiv:1810.04805 (2018)." +msgstr "" + +#: ../model_zoo/transformers.rst:826 +msgid "" +"Zaheer, Manzil, et al. \"Big bird: Transformers for longer sequences.\" " +"arXiv preprint arXiv:2007.14062 (2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:827 +msgid "" +"Stephon, Emily, et al. \"Blenderbot: Recipes for building an open-domain " +"chatbot.\" arXiv preprint arXiv:2004.13637 (2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:828 +msgid "" +"Stephon, Emily, et al. \"Blenderbot-Small: Recipes for building an open-" +"domain chatbot.\" arXiv preprint arXiv:2004.13637 (2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:829 +msgid "" +"Jiang, Zihang, et al. \"ConvBERT: Improving BERT with Span-based Dynamic " +"Convolution.\" arXiv preprint arXiv:2008.02496 (2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:830 +msgid "" +"Nitish, Bryan, et al. \"CTRL: A Conditional Transformer Language Model " +"for Controllable Generation.\" arXiv preprint arXiv:1909.05858 (2019)." +msgstr "" + +#: ../model_zoo/transformers.rst:831 +msgid "" +"Sanh, Victor, et al. \"DistilBERT, a distilled version of BERT: smaller, " +"faster, cheaper and lighter.\" arXiv preprint arXiv:1910.01108 (2019)." +msgstr "" + +#: ../model_zoo/transformers.rst:832 +msgid "" +"Clark, Kevin, et al. \"Electra: Pre-training text encoders as " +"discriminators rather than generators.\" arXiv preprint arXiv:2003.10555 " +"(2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:833 +msgid "" +"Sun, Yu, et al. \"Ernie: Enhanced representation through knowledge " +"integration.\" arXiv preprint arXiv:1904.09223 (2019)." +msgstr "" + +#: ../model_zoo/transformers.rst:834 +msgid "" +"Xiao, Dongling, et al. \"Ernie-gen: An enhanced multi-flow pre-training " +"and fine-tuning framework for natural language generation.\" arXiv " +"preprint arXiv:2001.11314 (2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:835 +msgid "" +"Xiao, Dongling, et al. \"ERNIE-Gram: Pre-Training with Explicitly N-Gram " +"Masked Language Modeling for Natural Language Understanding.\" arXiv " +"preprint arXiv:2010.12148 (2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:836 +msgid "" +"Radford, Alec, et al. \"Language models are unsupervised multitask " +"learners.\" OpenAI blog 1.8 (2019): 9." +msgstr "" + +#: ../model_zoo/transformers.rst:837 +msgid "" +"Xu, Yiheng, et al. \"LayoutLM: Pre-training of Text and Layout for " +"Document Image Understanding.\" arXiv preprint arXiv:1912.13318 (2019)." +msgstr "" + +#: ../model_zoo/transformers.rst:838 +msgid "" +"Xu, Yang, et al. \"LayoutLMv2: Multi-modal Pre-training for Visually-Rich" +" Document Understanding\" arXiv preprint arXiv:2012.14740 (2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:839 +msgid "" +"Xu, Yiheng, et al. \"LayoutXLM: Multimodal Pre-training for Multilingual " +"Visually-rich Document Understanding\" arXiv preprint arXiv:2104.08836 " +"(2021)." +msgstr "" + +#: ../model_zoo/transformers.rst:840 +msgid "" +"Liu, Yinhan, et al. \"MBart: Multilingual Denoising Pre-training for " +"Neural Machine Translation\" arXiv preprint arXiv:2001.08210 (2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:841 +msgid "" +"Sun, Zhiqing, et al. \"MobileBERT: a Compact Task-Agnostic BERT for " +"Resource-Limited Devices\" arXiv preprint arXiv:2004.02984 (2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:842 +msgid "" +"Song, Kaitao, et al. \"MPNet: Masked and Permuted Pre-training for " +"Language Understanding.\" arXiv preprint arXiv:2004.09297 (2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:843 +msgid "" +"Wei, Junqiu, et al. \"NEZHA: Neural contextualized representation for " +"chinese language understanding.\" arXiv preprint arXiv:1909.00204 (2019)." +msgstr "" + +#: ../model_zoo/transformers.rst:844 +msgid "" +"Kitaev, Nikita, et al. \"Reformer: The efficient Transformer.\" arXiv " +"preprint arXiv:2001.04451 (2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:845 +msgid "" +"Liu, Yinhan, et al. \"Roberta: A robustly optimized bert pretraining " +"approach.\" arXiv preprint arXiv:1907.11692 (2019)." +msgstr "" + +#: ../model_zoo/transformers.rst:846 +msgid "" +"Su Jianlin, et al. \"RoFormer: Enhanced Transformer with Rotary Position " +"Embedding.\" arXiv preprint arXiv:2104.09864 (2021)." +msgstr "" + +#: ../model_zoo/transformers.rst:847 +msgid "" +"Tian, Hao, et al. \"SKEP: Sentiment knowledge enhanced pre-training for " +"sentiment analysis.\" arXiv preprint arXiv:2005.05635 (2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:848 +msgid "" +"Forrest, ALbert, et al. \"SqueezeBERT: What can computer vision teach NLP" +" about efficient neural networks?\" arXiv preprint arXiv:2006.11316 " +"(2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:849 +msgid "" +"Raffel, Colin, et al. \"T5: Exploring the Limits of Transfer Learning " +"with a Unified Text-to-Text Transformer.\" arXiv preprint " +"arXiv:1910.10683 (2019)." +msgstr "" + +#: ../model_zoo/transformers.rst:850 +msgid "" +"Vaswani, Ashish, et al. \"Attention is all you need.\" arXiv preprint " +"arXiv:1706.03762 (2017)." +msgstr "" + +#: ../model_zoo/transformers.rst:851 +msgid "" +"Jiao, Xiaoqi, et al. \"Tinybert: Distilling bert for natural language " +"understanding.\" arXiv preprint arXiv:1909.10351 (2019)." +msgstr "" + +#: ../model_zoo/transformers.rst:852 +msgid "" +"Bao, Siqi, et al. \"Plato-2: Towards building an open-domain chatbot via " +"curriculum learning.\" arXiv preprint arXiv:2006.16779 (2020)." +msgstr "" + +#: ../model_zoo/transformers.rst:853 +msgid "" +"Yang, Zhilin, et al. \"Xlnet: Generalized autoregressive pretraining for " +"language understanding.\" arXiv preprint arXiv:1906.08237 (2019)." +msgstr "" + +#: ../model_zoo/transformers.rst:854 +msgid "" +"Cui, Yiming, et al. \"Pre-training with whole word masking for chinese " +"bert.\" arXiv preprint arXiv:1906.08101 (2019)." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ALBERT/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ALBERT/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..93a869bda14b22a7f42492b266f8977a330c396f --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ALBERT/contents.po @@ -0,0 +1,199 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/ALBERT/contents.rst:5 +msgid "ALBERT模型汇总" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的ALBERT模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:15 +msgid "``albert-base-v1``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:15 +#: ../model_zoo/transformers/ALBERT/contents.rst:19 +#: ../model_zoo/transformers/ALBERT/contents.rst:23 +#: ../model_zoo/transformers/ALBERT/contents.rst:27 +#: ../model_zoo/transformers/ALBERT/contents.rst:31 +#: ../model_zoo/transformers/ALBERT/contents.rst:35 +#: ../model_zoo/transformers/ALBERT/contents.rst:39 +#: ../model_zoo/transformers/ALBERT/contents.rst:43 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:15 +msgid "" +"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters." +" ALBERT base model" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:19 +msgid "``albert-large-v1``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:19 +msgid "" +"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M " +"parameters. ALBERT large model" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:23 +msgid "``albert-xlarge-v1``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:23 +msgid "" +"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M " +"parameters. ALBERT xlarge model" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:27 +msgid "``albert-xxlarge-v1``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:27 +msgid "" +"12 repeating layers, 128 embedding, 4096-hidden, 64-heads, 223M " +"parameters. ALBERT xxlarge model" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:31 +msgid "``albert-base-v2``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:31 +msgid "" +"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters." +" ALBERT base model (version2)" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:35 +msgid "``albert-large-v2``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:35 +msgid "" +"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M " +"parameters. ALBERT large model (version2)" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:39 +msgid "``albert-xlarge-v2``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:39 +msgid "" +"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M " +"parameters. ALBERT xlarge model (version2)" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:43 +msgid "``albert-xxlarge-v2``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:43 +msgid "" +"12 repeating layers, 128 embedding, 4096-hidden, 64-heads, 223M " +"parameters. ALBERT xxlarge model (version2)" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:47 +msgid "``albert-chinese-tiny``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:47 +#: ../model_zoo/transformers/ALBERT/contents.rst:51 +#: ../model_zoo/transformers/ALBERT/contents.rst:55 +#: ../model_zoo/transformers/ALBERT/contents.rst:59 +#: ../model_zoo/transformers/ALBERT/contents.rst:63 +#: ../model_zoo/transformers/ALBERT/contents.rst:67 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:47 +msgid "" +"4 repeating layers, 128 embedding, 312-hidden, 12-heads, 4M parameters. " +"ALBERT tiny model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:51 +msgid "``albert-chinese-small``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:51 +msgid "" +"6 repeating layers, 128 embedding, 384-hidden, 12-heads, _M parameters. " +"ALBERT small model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:55 +msgid "``albert-chinese-base``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:55 +msgid "" +"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 12M parameters." +" ALBERT base model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:59 +msgid "``albert-chinese-large``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:59 +msgid "" +"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 18M " +"parameters. ALBERT large model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:63 +msgid "``albert-chinese-xlarge``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:63 +msgid "" +"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 60M " +"parameters. ALBERT xlarge model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:67 +msgid "``albert-chinese-xxlarge``" +msgstr "" + +#: ../model_zoo/transformers/ALBERT/contents.rst:67 +msgid "" +"12 repeating layers, 128 embedding, 4096-hidden, 16-heads, 235M " +"parameters. ALBERT xxlarge model (Chinese)" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BART/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BART/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..2c780422a18973cbe7345853e5a4edd485cfab50 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BART/contents.po @@ -0,0 +1,62 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/BART/contents.rst:5 +msgid "BART模型汇总" +msgstr "" + +#: ../model_zoo/transformers/BART/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的BART模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/BART/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/BART/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/BART/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/BART/contents.rst:15 +msgid "``bart-base``" +msgstr "" + +#: ../model_zoo/transformers/BART/contents.rst:15 +#: ../model_zoo/transformers/BART/contents.rst:19 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/BART/contents.rst:15 +msgid "12-layer, 768-hidden, 12-heads, 217M parameters. BART base model (English)" +msgstr "" + +#: ../model_zoo/transformers/BART/contents.rst:19 +msgid "``bart-large``" +msgstr "" + +#: ../model_zoo/transformers/BART/contents.rst:19 +msgid "" +"24-layer, 768-hidden, 16-heads, 509M parameters. BART large model " +"(English)" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BERT/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BERT/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..8aa42f1927335d6b74f0f5fd1920be423b33a138 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BERT/contents.po @@ -0,0 +1,1670 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/BERT/contents.rst:5 +msgid "BERT模型汇总" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的BERT模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:15 +msgid "``bert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:15 +#: ../model_zoo/transformers/BERT/contents.rst:19 +#: ../model_zoo/transformers/BERT/contents.rst:23 +#: ../model_zoo/transformers/BERT/contents.rst:27 +#: ../model_zoo/transformers/BERT/contents.rst:79 +#: ../model_zoo/transformers/BERT/contents.rst:109 +#: ../model_zoo/transformers/BERT/contents.rst:124 +#: ../model_zoo/transformers/BERT/contents.rst:133 +#: ../model_zoo/transformers/BERT/contents.rst:136 +#: ../model_zoo/transformers/BERT/contents.rst:139 +#: ../model_zoo/transformers/BERT/contents.rst:145 +#: ../model_zoo/transformers/BERT/contents.rst:148 +#: ../model_zoo/transformers/BERT/contents.rst:154 +#: ../model_zoo/transformers/BERT/contents.rst:157 +#: ../model_zoo/transformers/BERT/contents.rst:163 +#: ../model_zoo/transformers/BERT/contents.rst:166 +#: ../model_zoo/transformers/BERT/contents.rst:175 +#: ../model_zoo/transformers/BERT/contents.rst:178 +#: ../model_zoo/transformers/BERT/contents.rst:181 +#: ../model_zoo/transformers/BERT/contents.rst:184 +#: ../model_zoo/transformers/BERT/contents.rst:187 +#: ../model_zoo/transformers/BERT/contents.rst:193 +#: ../model_zoo/transformers/BERT/contents.rst:196 +#: ../model_zoo/transformers/BERT/contents.rst:199 +#: ../model_zoo/transformers/BERT/contents.rst:214 +#: ../model_zoo/transformers/BERT/contents.rst:220 +#: ../model_zoo/transformers/BERT/contents.rst:226 +#: ../model_zoo/transformers/BERT/contents.rst:235 +#: ../model_zoo/transformers/BERT/contents.rst:238 +#: ../model_zoo/transformers/BERT/contents.rst:241 +#: ../model_zoo/transformers/BERT/contents.rst:247 +#: ../model_zoo/transformers/BERT/contents.rst:262 +#: ../model_zoo/transformers/BERT/contents.rst:265 +#: ../model_zoo/transformers/BERT/contents.rst:268 +#: ../model_zoo/transformers/BERT/contents.rst:271 +#: ../model_zoo/transformers/BERT/contents.rst:277 +#: ../model_zoo/transformers/BERT/contents.rst:280 +#: ../model_zoo/transformers/BERT/contents.rst:283 +#: ../model_zoo/transformers/BERT/contents.rst:286 +#: ../model_zoo/transformers/BERT/contents.rst:289 +#: ../model_zoo/transformers/BERT/contents.rst:292 +#: ../model_zoo/transformers/BERT/contents.rst:295 +#: ../model_zoo/transformers/BERT/contents.rst:298 +#: ../model_zoo/transformers/BERT/contents.rst:301 +#: ../model_zoo/transformers/BERT/contents.rst:313 +#: ../model_zoo/transformers/BERT/contents.rst:316 +#: ../model_zoo/transformers/BERT/contents.rst:319 +#: ../model_zoo/transformers/BERT/contents.rst:322 +#: ../model_zoo/transformers/BERT/contents.rst:328 +#: ../model_zoo/transformers/BERT/contents.rst:331 +#: ../model_zoo/transformers/BERT/contents.rst:334 +#: ../model_zoo/transformers/BERT/contents.rst:337 +#: ../model_zoo/transformers/BERT/contents.rst:340 +#: ../model_zoo/transformers/BERT/contents.rst:346 +#: ../model_zoo/transformers/BERT/contents.rst:352 +#: ../model_zoo/transformers/BERT/contents.rst:355 +#: ../model_zoo/transformers/BERT/contents.rst:358 +#: ../model_zoo/transformers/BERT/contents.rst:376 +#: ../model_zoo/transformers/BERT/contents.rst:379 +#: ../model_zoo/transformers/BERT/contents.rst:382 +#: ../model_zoo/transformers/BERT/contents.rst:394 +#: ../model_zoo/transformers/BERT/contents.rst:400 +#: ../model_zoo/transformers/BERT/contents.rst:418 +#: ../model_zoo/transformers/BERT/contents.rst:427 +#: ../model_zoo/transformers/BERT/contents.rst:445 +#: ../model_zoo/transformers/BERT/contents.rst:472 +#: ../model_zoo/transformers/BERT/contents.rst:481 +#: ../model_zoo/transformers/BERT/contents.rst:490 +#: ../model_zoo/transformers/BERT/contents.rst:493 +#: ../model_zoo/transformers/BERT/contents.rst:496 +#: ../model_zoo/transformers/BERT/contents.rst:499 +#: ../model_zoo/transformers/BERT/contents.rst:505 +#: ../model_zoo/transformers/BERT/contents.rst:511 +#: ../model_zoo/transformers/BERT/contents.rst:514 +#: ../model_zoo/transformers/BERT/contents.rst:523 +#: ../model_zoo/transformers/BERT/contents.rst:526 +#: ../model_zoo/transformers/BERT/contents.rst:529 +#: ../model_zoo/transformers/BERT/contents.rst:532 +#: ../model_zoo/transformers/BERT/contents.rst:538 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:15 +msgid "" +"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:19 +msgid "``bert-large-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:19 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:23 +msgid "``bert-base-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:23 +msgid "" +"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on cased English" +" text." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:27 +msgid "``bert-large-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:27 +msgid "" +"24-layer, 1024-hidden, 16-heads, 335M parameters. Trained on cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:31 +msgid "``bert-base-multilingual-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:31 +#: ../model_zoo/transformers/BERT/contents.rst:37 +#: ../model_zoo/transformers/BERT/contents.rst:72 +#: ../model_zoo/transformers/BERT/contents.rst:121 +#: ../model_zoo/transformers/BERT/contents.rst:421 +msgid "Multilingual" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:31 +msgid "" +"12-layer, 768-hidden, 12-heads, 168M parameters. Trained on lower-cased " +"text in the top 102 languages with the largest Wikipedias." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:37 +msgid "``bert-base-multilingual-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:37 +msgid "" +"12-layer, 768-hidden, 12-heads, 179M parameters. Trained on cased text in" +" the top 104 languages with the largest Wikipedias." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:43 +msgid "``bert-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:43 +#: ../model_zoo/transformers/BERT/contents.rst:48 +#: ../model_zoo/transformers/BERT/contents.rst:54 +#: ../model_zoo/transformers/BERT/contents.rst:60 +#: ../model_zoo/transformers/BERT/contents.rst:64 +#: ../model_zoo/transformers/BERT/contents.rst:68 +#: ../model_zoo/transformers/BERT/contents.rst:85 +#: ../model_zoo/transformers/BERT/contents.rst:90 +#: ../model_zoo/transformers/BERT/contents.rst:95 +#: ../model_zoo/transformers/BERT/contents.rst:100 +#: ../model_zoo/transformers/BERT/contents.rst:104 +#: ../model_zoo/transformers/BERT/contents.rst:130 +#: ../model_zoo/transformers/BERT/contents.rst:256 +#: ../model_zoo/transformers/BERT/contents.rst:349 +#: ../model_zoo/transformers/BERT/contents.rst:367 +#: ../model_zoo/transformers/BERT/contents.rst:412 +#: ../model_zoo/transformers/BERT/contents.rst:415 +#: ../model_zoo/transformers/BERT/contents.rst:436 +#: ../model_zoo/transformers/BERT/contents.rst:457 +#: ../model_zoo/transformers/BERT/contents.rst:469 +#: ../model_zoo/transformers/BERT/contents.rst:487 +#: ../model_zoo/transformers/BERT/contents.rst:535 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:43 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese" +" Simplified and Traditional text." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:48 +msgid "``bert-wwm-chinese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:48 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese" +" Simplified and Traditional text using Whole-Word-Masking." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:54 +msgid "``bert-wwm-ext-chinese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:54 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese" +" Simplified and Traditional text using Whole-Word-Masking with extented " +"data." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:60 +msgid "``junnyu/ckiplab-bert-base-chinese-ner``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:60 +msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on NER task." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:64 +msgid "``junnyu/ckiplab-bert-base-chinese-pos``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:64 +msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on POS task." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:68 +msgid "``junnyu/ckiplab-bert-base-chinese-ws``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:68 +msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on WS task." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:72 +msgid "``junnyu/nlptown-bert-base-multilingual-uncased-sentiment``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:72 +msgid "" +"12-layer, 768-hidden, 12-heads, 167M parameters. Finetuned for sentiment " +"analysis in six languages: English, Dutch, German, French, Spanish and " +"Italian." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:79 +msgid "``junnyu/tbs17-MathBERT``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:79 +msgid "" +"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on pre-k to " +"graduate math language (English) using a masked language modeling (MLM) " +"objective." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:85 +msgid "``macbert-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:85 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained with novel MLM " +"as correction pre-training task." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:90 +msgid "``macbert-large-chinese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:90 +msgid "" +"24-layer, 1024-hidden, 16-heads, 326M parameters. Trained with novel MLM " +"as correction pre-training task." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:95 +msgid "``simbert-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:95 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on 22 million " +"pairs of similar sentences crawed from Baidu Know." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:100 +msgid "``Langboat/mengzi-bert-base``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:100 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on 300G Chinese " +"Corpus Datasets." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:104 +msgid "``Langboat/mengzi-bert-base-fin``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:104 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on 20G Finacial " +"Corpus, based on ``Langboat/mengzi-bert-base``." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:109 +msgid "``cross-encoder/ms-marco-MiniLM-L-12-v2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:109 +msgid "Please refer to: `cross-encoder/ms-marco-MiniLM-L-12-v2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:112 +msgid "``cl-tohoku/bert-base-japanese-char``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:112 +#: ../model_zoo/transformers/BERT/contents.rst:115 +#: ../model_zoo/transformers/BERT/contents.rst:118 +#: ../model_zoo/transformers/BERT/contents.rst:223 +#: ../model_zoo/transformers/BERT/contents.rst:229 +#: ../model_zoo/transformers/BERT/contents.rst:310 +#: ../model_zoo/transformers/BERT/contents.rst:409 +#: ../model_zoo/transformers/BERT/contents.rst:439 +#: ../model_zoo/transformers/BERT/contents.rst:541 +#: ../model_zoo/transformers/BERT/contents.rst:545 +#: ../model_zoo/transformers/BERT/contents.rst:550 +#: ../model_zoo/transformers/BERT/contents.rst:554 +msgid "Japanese" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:112 +msgid "Please refer to: `cl-tohoku/bert-base-japanese-char`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:115 +msgid "``cl-tohoku/bert-base-japanese-whole-word-masking``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:115 +msgid "Please refer to: `cl-tohoku/bert-base-japanese-whole-word-masking`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:118 +msgid "``cl-tohoku/bert-base-japanese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:118 +msgid "Please refer to: `cl-tohoku/bert-base-japanese`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:121 +msgid "``nlptown/bert-base-multilingual-uncased-sentiment``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:121 +msgid "Please refer to: `nlptown/bert-base-multilingual-uncased-sentiment`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:124 +msgid "``bert-large-uncased-whole-word-masking-finetuned-squad``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:124 +msgid "Please refer to: `bert-large-uncased-whole-word-masking-finetuned-squad`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:127 +msgid "``finiteautomata/beto-sentiment-analysis``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:127 +#: ../model_zoo/transformers/BERT/contents.rst:190 +#: ../model_zoo/transformers/BERT/contents.rst:373 +#: ../model_zoo/transformers/BERT/contents.rst:433 +#: ../model_zoo/transformers/BERT/contents.rst:463 +#: ../model_zoo/transformers/BERT/contents.rst:475 +msgid "Spanish" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:127 +msgid "Please refer to: `finiteautomata/beto-sentiment-analysis`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:130 +msgid "``hfl/chinese-bert-wwm-ext``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:130 +msgid "Please refer to: `hfl/chinese-bert-wwm-ext`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:133 +msgid "``emilyalsentzer/Bio_ClinicalBERT``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:133 +msgid "Please refer to: `emilyalsentzer/Bio_ClinicalBERT`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:136 +msgid "``dslim/bert-base-NER``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:136 +msgid "Please refer to: `dslim/bert-base-NER`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:139 +msgid "``deepset/bert-large-uncased-whole-word-masking-squad2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:139 +msgid "Please refer to: `deepset/bert-large-uncased-whole-word-masking-squad2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:142 +msgid "``neuralmind/bert-base-portuguese-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:142 +#: ../model_zoo/transformers/BERT/contents.rst:244 +#: ../model_zoo/transformers/BERT/contents.rst:361 +#: ../model_zoo/transformers/BERT/contents.rst:520 +msgid "Portuguese" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:142 +msgid "Please refer to: `neuralmind/bert-base-portuguese-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:145 +msgid "``SpanBERT/spanbert-large-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:145 +msgid "Please refer to: `SpanBERT/spanbert-large-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:148 +msgid "``dslim/bert-large-NER``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:148 +msgid "Please refer to: `dslim/bert-large-NER`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:151 +msgid "``bert-base-german-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:151 +#: ../model_zoo/transformers/BERT/contents.rst:160 +#: ../model_zoo/transformers/BERT/contents.rst:205 +#: ../model_zoo/transformers/BERT/contents.rst:211 +#: ../model_zoo/transformers/BERT/contents.rst:250 +msgid "German" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:151 +msgid "Please refer to: `bert-base-german-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:154 +msgid "``deepset/sentence_bert``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:154 +msgid "Please refer to: `deepset/sentence_bert`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:157 +msgid "``ProsusAI/finbert``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:157 +msgid "Please refer to: `ProsusAI/finbert`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:160 +msgid "``oliverguhr/german-sentiment-bert``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:160 +msgid "Please refer to: `oliverguhr/german-sentiment-bert`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:163 +msgid "``google/bert_uncased_L-2_H-128_A-2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:163 +msgid "Please refer to: `google/bert_uncased_L-2_H-128_A-2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:166 +msgid "``microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:166 +msgid "Please refer to: `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:169 +msgid "``DeepPavlov/rubert-base-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:169 +#: ../model_zoo/transformers/BERT/contents.rst:202 +#: ../model_zoo/transformers/BERT/contents.rst:304 +#: ../model_zoo/transformers/BERT/contents.rst:430 +#: ../model_zoo/transformers/BERT/contents.rst:484 +msgid "Russian" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:169 +msgid "Please refer to: `DeepPavlov/rubert-base-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:172 +msgid "``wietsedv/bert-base-dutch-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:172 +#: ../model_zoo/transformers/BERT/contents.rst:397 +msgid "Dutch" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:172 +msgid "Please refer to: `wietsedv/bert-base-dutch-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:175 +msgid "``monologg/bert-base-cased-goemotions-original``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:175 +msgid "Please refer to: `monologg/bert-base-cased-goemotions-original`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:178 +msgid "``allenai/scibert_scivocab_uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:178 +msgid "Please refer to: `allenai/scibert_scivocab_uncased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:181 +msgid "``dbmdz/bert-large-cased-finetuned-conll03-english``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:181 +msgid "Please refer to: `dbmdz/bert-large-cased-finetuned-conll03-english`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:184 +msgid "``microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:184 +msgid "" +"Please refer to: `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-" +"fulltext`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:187 +msgid "``bert-large-uncased-whole-word-masking``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:187 +msgid "Please refer to: `bert-large-uncased-whole-word-masking`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:190 +msgid "``dccuchile/bert-base-spanish-wwm-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:190 +msgid "Please refer to: `dccuchile/bert-base-spanish-wwm-uncased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:193 +msgid "``google/bert_uncased_L-6_H-256_A-4``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:193 +msgid "Please refer to: `google/bert_uncased_L-6_H-256_A-4`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:196 +msgid "``google/bert_uncased_L-4_H-512_A-8``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:196 +msgid "Please refer to: `google/bert_uncased_L-4_H-512_A-8`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:199 +msgid "``FPTAI/vibert-base-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:199 +msgid "Please refer to: `FPTAI/vibert-base-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:202 +msgid "``cointegrated/rubert-tiny``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:202 +msgid "Please refer to: `cointegrated/rubert-tiny`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:205 +msgid "``bert-base-german-dbmdz-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:205 +msgid "Please refer to: `bert-base-german-dbmdz-uncased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:208 +msgid "``dbmdz/bert-base-turkish-128k-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:208 +#: ../model_zoo/transformers/BERT/contents.rst:325 +#: ../model_zoo/transformers/BERT/contents.rst:460 +#: ../model_zoo/transformers/BERT/contents.rst:502 +msgid "Turkish" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:208 +msgid "Please refer to: `dbmdz/bert-base-turkish-128k-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:211 +msgid "``dbmdz/bert-base-german-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:211 +msgid "Please refer to: `dbmdz/bert-base-german-uncased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:214 +msgid "``deepset/minilm-uncased-squad2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:214 +msgid "Please refer to: `deepset/minilm-uncased-squad2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:217 +msgid "``HooshvareLab/bert-base-parsbert-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:217 +#: ../model_zoo/transformers/BERT/contents.rst:388 +#: ../model_zoo/transformers/BERT/contents.rst:478 +msgid "Persian" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:217 +msgid "Please refer to: `HooshvareLab/bert-base-parsbert-uncased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:220 +msgid "``textattack/bert-base-uncased-ag-news``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:220 +msgid "Please refer to: `textattack/bert-base-uncased-ag-news`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:223 +msgid "``cl-tohoku/bert-base-japanese-v2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:223 +msgid "Please refer to: `cl-tohoku/bert-base-japanese-v2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:226 +msgid "``emilyalsentzer/Bio_Discharge_Summary_BERT``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:226 +msgid "Please refer to: `emilyalsentzer/Bio_Discharge_Summary_BERT`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:229 +msgid "``KoichiYasuoka/bert-base-japanese-upos``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:229 +msgid "Please refer to: `KoichiYasuoka/bert-base-japanese-upos`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:232 +msgid "``dbmdz/bert-base-italian-xxl-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:232 +#: ../model_zoo/transformers/BERT/contents.rst:403 +msgid "Italian" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:232 +msgid "Please refer to: `dbmdz/bert-base-italian-xxl-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:235 +msgid "``deepset/bert-base-cased-squad2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:235 +msgid "Please refer to: `deepset/bert-base-cased-squad2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:238 +msgid "``beomi/kcbert-large``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:238 +msgid "Please refer to: `beomi/kcbert-large`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:241 +msgid "``bert-large-cased-whole-word-masking-finetuned-squad``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:241 +msgid "Please refer to: `bert-large-cased-whole-word-masking-finetuned-squad`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:244 +msgid "``neuralmind/bert-large-portuguese-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:244 +msgid "Please refer to: `neuralmind/bert-large-portuguese-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:247 +msgid "``Luyu/co-condenser-marco``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:247 +msgid "Please refer to: `Luyu/co-condenser-marco`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:250 +msgid "``Sahajtomar/German_Zeroshot``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:250 +msgid "Please refer to: `Sahajtomar/German_Zeroshot`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:253 +msgid "``indolem/indobert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:253 +msgid "Indonesian" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:253 +msgid "Please refer to: `indolem/indobert-base-uncased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:256 +msgid "``shibing624/text2vec-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:256 +msgid "Please refer to: `shibing624/text2vec-base-chinese`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:259 +msgid "``cointegrated/LaBSE-en-ru``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:259 +msgid "English and Russian" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:259 +msgid "Please refer to: `cointegrated/LaBSE-en-ru`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:262 +msgid "``prithivida/parrot_fluency_on_BERT``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:262 +msgid "Please refer to: `prithivida/parrot_fluency_on_BERT`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:265 +msgid "``textattack/bert-base-uncased-SST-2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:265 +msgid "Please refer to: `textattack/bert-base-uncased-SST-2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:268 +msgid "``textattack/bert-base-uncased-snli``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:268 +msgid "Please refer to: `textattack/bert-base-uncased-snli`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:271 +msgid "``klue/bert-base``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:271 +msgid "Please refer to: `klue/bert-base`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:274 +msgid "``asafaya/bert-base-arabic``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:274 +#: ../model_zoo/transformers/BERT/contents.rst:424 +msgid "Arabic" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:274 +msgid "Please refer to: `asafaya/bert-base-arabic`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:277 +msgid "``textattack/bert-base-uncased-MRPC``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:277 +msgid "Please refer to: `textattack/bert-base-uncased-MRPC`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:280 +msgid "``textattack/bert-base-uncased-imdb``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:280 +msgid "Please refer to: `textattack/bert-base-uncased-imdb`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:283 +msgid "``cross-encoder/ms-marco-TinyBERT-L-2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:283 +msgid "Please refer to: `cross-encoder/ms-marco-TinyBERT-L-2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:286 +msgid "``mrm8488/bert-tiny-finetuned-sms-spam-detection``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:286 +msgid "Please refer to: `mrm8488/bert-tiny-finetuned-sms-spam-detection`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:289 +msgid "``felflare/bert-restore-punctuation``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:289 +msgid "Please refer to: `felflare/bert-restore-punctuation`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:292 +msgid "``sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:292 +msgid "" +"Please refer to: `sshleifer/tiny-dbmdz-bert-large-cased-finetuned-" +"conll03-english`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:295 +msgid "``textattack/bert-base-uncased-rotten-tomatoes``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:295 +msgid "Please refer to: `textattack/bert-base-uncased-rotten-tomatoes`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:298 +msgid "``nlpaueb/legal-bert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:298 +msgid "Please refer to: `nlpaueb/legal-bert-base-uncased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:301 +msgid "``hf-internal-testing/tiny-bert-for-token-classification``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:301 +msgid "Please refer to: `hf-internal-testing/tiny-bert-for-token-classification`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:304 +msgid "``cointegrated/rubert-tiny2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:304 +msgid "Please refer to: `cointegrated/rubert-tiny2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:307 +msgid "``kykim/bert-kor-base``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:307 +msgid "Korean" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:307 +msgid "Please refer to: `kykim/bert-kor-base`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:310 +msgid "``cl-tohoku/bert-base-japanese-char-v2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:310 +msgid "Please refer to: `cl-tohoku/bert-base-japanese-char-v2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:313 +msgid "``mrm8488/bert-small-finetuned-squadv2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:313 +msgid "Please refer to: `mrm8488/bert-small-finetuned-squadv2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:316 +msgid "``beomi/kcbert-base``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:316 +msgid "Please refer to: `beomi/kcbert-base`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:319 +msgid "``textattack/bert-base-uncased-MNLI``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:319 +msgid "Please refer to: `textattack/bert-base-uncased-MNLI`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:322 +msgid "``textattack/bert-base-uncased-WNLI``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:322 +msgid "Please refer to: `textattack/bert-base-uncased-WNLI`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:325 +msgid "``dbmdz/bert-base-turkish-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:325 +msgid "Please refer to: `dbmdz/bert-base-turkish-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:328 +msgid "``huawei-noah/TinyBERT_General_4L_312D``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:328 +msgid "Please refer to: `huawei-noah/TinyBERT_General_4L_312D`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:331 +msgid "``textattack/bert-base-uncased-QQP``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:331 +msgid "Please refer to: `textattack/bert-base-uncased-QQP`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:334 +msgid "``textattack/bert-base-uncased-STS-B``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:334 +msgid "Please refer to: `textattack/bert-base-uncased-STS-B`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:337 +msgid "``allenai/scibert_scivocab_cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:337 +msgid "Please refer to: `allenai/scibert_scivocab_cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:340 +msgid "``mrm8488/bert-medium-finetuned-squadv2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:340 +msgid "Please refer to: `mrm8488/bert-medium-finetuned-squadv2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:343 +msgid "``TurkuNLP/bert-base-finnish-cased-v1``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:343 +msgid "Finnish" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:343 +msgid "Please refer to: `TurkuNLP/bert-base-finnish-cased-v1`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:346 +msgid "``textattack/bert-base-uncased-RTE``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:346 +msgid "Please refer to: `textattack/bert-base-uncased-RTE`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:349 +msgid "``uer/roberta-base-chinese-extractive-qa``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:349 +msgid "Please refer to: `uer/roberta-base-chinese-extractive-qa`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:352 +msgid "``textattack/bert-base-uncased-QNLI``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:352 +msgid "Please refer to: `textattack/bert-base-uncased-QNLI`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:355 +msgid "``textattack/bert-base-uncased-CoLA``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:355 +msgid "Please refer to: `textattack/bert-base-uncased-CoLA`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:358 +msgid "``dmis-lab/biobert-base-cased-v1.2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:358 +msgid "Please refer to: `dmis-lab/biobert-base-cased-v1.2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:361 +msgid "``pierreguillou/bert-base-cased-squad-v1.1-portuguese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:361 +msgid "Please refer to: `pierreguillou/bert-base-cased-squad-v1.1-portuguese`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:364 +msgid "``KB/bert-base-swedish-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:364 +#: ../model_zoo/transformers/BERT/contents.rst:466 +msgid "Swedish" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:364 +msgid "Please refer to: `KB/bert-base-swedish-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:367 +msgid "``uer/roberta-base-finetuned-cluener2020-chinese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:367 +msgid "Please refer to: `uer/roberta-base-finetuned-cluener2020-chinese`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:370 +msgid "``onlplab/alephbert-base``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:370 +msgid "Hebrew" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:370 +msgid "Please refer to: `onlplab/alephbert-base`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:373 +msgid "``mrm8488/bert-spanish-cased-finetuned-ner``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:373 +msgid "Please refer to: `mrm8488/bert-spanish-cased-finetuned-ner`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:376 +msgid "``alvaroalon2/biobert_chemical_ner``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:376 +msgid "Please refer to: `alvaroalon2/biobert_chemical_ner`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:379 +msgid "``bert-base-cased-finetuned-mrpc``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:379 +msgid "Please refer to: `bert-base-cased-finetuned-mrpc`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:382 +msgid "``unitary/toxic-bert``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:382 +msgid "Please refer to: `unitary/toxic-bert`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:385 +msgid "``nlpaueb/bert-base-greek-uncased-v1``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:385 +msgid "Greek" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:385 +msgid "Please refer to: `nlpaueb/bert-base-greek-uncased-v1`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:388 +msgid "``HooshvareLab/bert-fa-base-uncased-sentiment-snappfood``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:388 +msgid "Please refer to: `HooshvareLab/bert-fa-base-uncased-sentiment-snappfood`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:391 +msgid "``Maltehb/danish-bert-botxo``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:391 +msgid "Danish" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:391 +msgid "Please refer to: `Maltehb/danish-bert-botxo`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:394 +msgid "``shahrukhx01/bert-mini-finetune-question-detection``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:394 +msgid "Please refer to: `shahrukhx01/bert-mini-finetune-question-detection`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:397 +msgid "``GroNLP/bert-base-dutch-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:397 +msgid "Please refer to: `GroNLP/bert-base-dutch-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:400 +msgid "``SpanBERT/spanbert-base-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:400 +msgid "Please refer to: `SpanBERT/spanbert-base-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:403 +msgid "``dbmdz/bert-base-italian-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:403 +msgid "Please refer to: `dbmdz/bert-base-italian-uncased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:406 +msgid "``dbmdz/bert-base-german-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:406 +msgid "Germanh" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:406 +msgid "Please refer to: `dbmdz/bert-base-german-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:409 +msgid "``cl-tohoku/bert-large-japanese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:409 +msgid "Please refer to: `cl-tohoku/bert-large-japanese`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:412 +msgid "``hfl/chinese-bert-wwm``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:412 +msgid "Please refer to: `hfl/chinese-bert-wwm`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:415 +msgid "``hfl/chinese-macbert-large``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:415 +msgid "Please refer to: `hfl/chinese-macbert-large`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:418 +msgid "``dslim/bert-base-NER-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:418 +msgid "Please refer to: `dslim/bert-base-NER-uncased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:421 +msgid "``amberoad/bert-multilingual-passage-reranking-msmarco``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:421 +msgid "Please refer to: `amberoad/bert-multilingual-passage-reranking-msmarco`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:424 +msgid "``aubmindlab/bert-base-arabertv02``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:424 +msgid "Please refer to: `aubmindlab/bert-base-arabertv02`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:427 +msgid "``google/bert_uncased_L-4_H-256_A-4``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:427 +msgid "Please refer to: `google/bert_uncased_L-4_H-256_A-4`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:430 +msgid "``DeepPavlov/rubert-base-cased-conversational``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:430 +msgid "Please refer to: `DeepPavlov/rubert-base-cased-conversational`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:433 +msgid "``dccuchile/bert-base-spanish-wwm-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:433 +msgid "Please refer to: `dccuchile/bert-base-spanish-wwm-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:436 +msgid "``ckiplab/bert-base-chinese-ws``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:436 +msgid "Please refer to: `ckiplab/bert-base-chinese-ws`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:439 +msgid "``daigo/bert-base-japanese-sentiment``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:439 +msgid "Please refer to: `daigo/bert-base-japanese-sentiment`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:442 +msgid "``SZTAKI-HLT/hubert-base-cc``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:442 +msgid "Hungarian" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:442 +msgid "Please refer to: `SZTAKI-HLT/hubert-base-cc`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:445 +msgid "``nlpaueb/legal-bert-small-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:445 +msgid "Please refer to: `nlpaueb/legal-bert-small-uncased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:448 +msgid "``dumitrescustefan/bert-base-romanian-uncased-v1``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:448 +#: ../model_zoo/transformers/BERT/contents.rst:508 +msgid "Romanian" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:448 +msgid "Please refer to: `dumitrescustefan/bert-base-romanian-uncased-v1`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:451 +msgid "``google/muril-base-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:451 +msgid "Indian" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:451 +msgid "Please refer to: `google/muril-base-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:454 +msgid "``dkleczek/bert-base-polish-uncased-v1``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:454 +msgid "Polish" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:454 +msgid "Please refer to: `dkleczek/bert-base-polish-uncased-v1`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:457 +msgid "``ckiplab/bert-base-chinese-ner``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:457 +msgid "Please refer to: `ckiplab/bert-base-chinese-ner`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:460 +msgid "``savasy/bert-base-turkish-sentiment-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:460 +msgid "Please refer to: `savasy/bert-base-turkish-sentiment-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:463 +msgid "``mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:463 +msgid "" +"Please refer to: `mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-" +"spa-squad2-es`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:466 +msgid "``KB/bert-base-swedish-cased-ner``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:466 +msgid "Please refer to: `KB/bert-base-swedish-cased-ner`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:469 +msgid "``hfl/rbt3``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:469 +msgid "Please refer to: `hfl/rbt3`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:472 +msgid "``remotejob/gradientclassification_v0``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:472 +msgid "Please refer to: `remotejob/gradientclassification_v0`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:475 +msgid "``Recognai/bert-base-spanish-wwm-cased-xnli``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:475 +msgid "Please refer to: `Recognai/bert-base-spanish-wwm-cased-xnli`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:478 +msgid "``HooshvareLab/bert-fa-zwnj-base``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:478 +msgid "Please refer to: `HooshvareLab/bert-fa-zwnj-base`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:481 +msgid "``monologg/bert-base-cased-goemotions-group``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:481 +msgid "Please refer to: `monologg/bert-base-cased-goemotions-group`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:484 +msgid "``blanchefort/rubert-base-cased-sentiment``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:484 +msgid "Please refer to: `blanchefort/rubert-base-cased-sentiment`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:487 +msgid "``shibing624/macbert4csc-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:487 +msgid "Please refer to: `shibing624/macbert4csc-base-chinese`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:490 +msgid "``google/bert_uncased_L-8_H-512_A-8``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:490 +msgid "Please refer to: `google/bert_uncased_L-8_H-512_A-8`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:493 +msgid "``bert-large-cased-whole-word-masking``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:493 +msgid "Please refer to: `bert-large-cased-whole-word-masking`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:496 +msgid "``alvaroalon2/biobert_diseases_ner``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:496 +msgid "Please refer to: `alvaroalon2/biobert_diseases_ner`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:499 +msgid "``philschmid/BERT-Banking77``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:499 +msgid "Please refer to: `philschmid/BERT-Banking77`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:502 +msgid "``dbmdz/bert-base-turkish-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:502 +msgid "Please refer to: `dbmdz/bert-base-turkish-uncased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:505 +msgid "``vblagoje/bert-english-uncased-finetuned-pos``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:505 +msgid "Please refer to: `vblagoje/bert-english-uncased-finetuned-pos`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:508 +msgid "``dumitrescustefan/bert-base-romanian-cased-v1``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:508 +msgid "Please refer to: `dumitrescustefan/bert-base-romanian-cased-v1`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:511 +msgid "``nreimers/BERT-Tiny_L-2_H-128_A-2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:511 +msgid "Please refer to: `nreimers/BERT-Tiny_L-2_H-128_A-2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:514 +msgid "``digitalepidemiologylab/covid-twitter-bert-v2``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:514 +msgid "Please refer to: `digitalepidemiologylab/covid-twitter-bert-v2`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:517 +msgid "``UBC-NLP/MARBERT``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:517 +msgid "(DA) and MSA" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:517 +msgid "Please refer to: `UBC-NLP/MARBERT`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:520 +msgid "``pierreguillou/bert-large-cased-squad-v1.1-portuguese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:520 +msgid "Please refer to: `pierreguillou/bert-large-cased-squad-v1.1-portuguese`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:523 +msgid "``alvaroalon2/biobert_genetic_ner``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:523 +msgid "Please refer to: `alvaroalon2/biobert_genetic_ner`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:526 +msgid "``bvanaken/clinical-assertion-negation-bert``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:526 +msgid "Please refer to: `bvanaken/clinical-assertion-negation-bert`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:529 +msgid "``cross-encoder/stsb-TinyBERT-L-4``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:529 +msgid "Please refer to: `cross-encoder/stsb-TinyBERT-L-4`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:532 +msgid "``sshleifer/tiny-distilbert-base-cased``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:532 +msgid "Please refer to: `sshleifer/tiny-distilbert-base-cased`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:535 +msgid "``ckiplab/bert-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:535 +msgid "Please refer to: `ckiplab/bert-base-chinese`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:538 +msgid "``fabriceyhc/bert-base-uncased-amazon_polarity``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:538 +msgid "Please refer to: `fabriceyhc/bert-base-uncased-amazon_polarity`_" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:541 +msgid "``iverxin/bert-base-japanese``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:541 +msgid "12-layer, 768-hidden, 12-heads, 110M parameters. Trained on Japanese text." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:545 +msgid "``iverxin/bert-base-japanese-whole-word-masking``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:545 +msgid "" +"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on Japanese text" +" using Whole-Word-Masking." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:550 +msgid "``iverxin/bert-base-japanese-char``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:550 +msgid "" +"12-layer, 768-hidden, 12-heads, 89M parameters. Trained on Japanese char " +"text." +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:554 +msgid "``iverxin/bert-base-japanese-char-whole-word-masking``" +msgstr "" + +#: ../model_zoo/transformers/BERT/contents.rst:554 +msgid "" +"12-layer, 768-hidden, 12-heads, 89M parameters. Trained on Japanese char " +"text using Whole-Word-Masking." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BigBird/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BigBird/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..d77ecc282de00215eea51c174fff926b3a4a59de --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BigBird/contents.po @@ -0,0 +1,53 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/BigBird/contents.rst:5 +msgid "BigBird模型汇总" +msgstr "" + +#: ../model_zoo/transformers/BigBird/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的BigBird模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/BigBird/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/BigBird/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/BigBird/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/BigBird/contents.rst:15 +msgid "``bigbird-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/BigBird/contents.rst:15 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/BigBird/contents.rst:15 +msgid "" +"12-layer, 768-hidden, 12-heads, 127M parameters. Trained on lower-cased " +"English text." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Blenderbot-Small/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Blenderbot-Small/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..469637a391c9021aeac8f73d1bfe2ff89d46d760 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Blenderbot-Small/contents.po @@ -0,0 +1,51 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:5 +msgid "Blenderbot-Small模型汇总" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的Blenderbot-Small模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:15 +msgid "``blenderbot_small-90M``" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:15 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:15 +msgid "16-layer, 16-heads, 90M parameters. The Blenderbot small model." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Blenderbot/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Blenderbot/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..825a8687f226d4c3a49ff656d527d6c2129fa95d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Blenderbot/contents.po @@ -0,0 +1,71 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/Blenderbot/contents.rst:5 +msgid "Blenderbot模型汇总" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的Blenderbot模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot/contents.rst:15 +msgid "``blenderbot-3B``" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot/contents.rst:15 +#: ../model_zoo/transformers/Blenderbot/contents.rst:19 +#: ../model_zoo/transformers/Blenderbot/contents.rst:23 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot/contents.rst:15 +msgid "26-layer, 32-heads, 3B parameters. The Blenderbot base model." +msgstr "" + +#: ../model_zoo/transformers/Blenderbot/contents.rst:19 +msgid "``blenderbot-400M-distill``" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot/contents.rst:19 +msgid "" +"14-layer, 384-hidden, 32-heads, 400M parameters. The Blenderbot distil " +"model." +msgstr "" + +#: ../model_zoo/transformers/Blenderbot/contents.rst:23 +msgid "``blenderbot-1B-distill``" +msgstr "" + +#: ../model_zoo/transformers/Blenderbot/contents.rst:23 +msgid "14-layer, 32-heads, 1478M parameters. The Blenderbot distil 1B model." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/CTRL/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/CTRL/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..5359e6346dd16954ccef2fff81fa958b91f1c1ca --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/CTRL/contents.po @@ -0,0 +1,60 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/CTRL/contents.rst:5 +msgid "CTRL模型汇总" +msgstr "" + +#: ../model_zoo/transformers/CTRL/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的CTRL模型对应预训练权重。" +msgstr "" + +#: ../model_zoo/transformers/CTRL/contents.rst:12 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/CTRL/contents.rst:12 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/CTRL/contents.rst:12 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/CTRL/contents.rst:14 +msgid "``ctrl``" +msgstr "" + +#: ../model_zoo/transformers/CTRL/contents.rst:14 +#: ../model_zoo/transformers/CTRL/contents.rst:18 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/CTRL/contents.rst:14 +msgid "48-layer, 1280-hidden, 16-heads, 1701M parameters. The CTRL base model." +msgstr "" + +#: ../model_zoo/transformers/CTRL/contents.rst:18 +msgid "``sshleifer-tiny-ctrl``" +msgstr "" + +#: ../model_zoo/transformers/CTRL/contents.rst:18 +msgid "2-layer, 16-hidden, 2-heads, 5M parameters. The Tiny CTRL model." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ChineseBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ChineseBert/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..704825363f2d83c30dd0e2341125ffa533116906 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ChineseBert/contents.po @@ -0,0 +1,60 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/ChineseBert/contents.rst:5 +msgid "ChineseBert模型汇总" +msgstr "" + +#: ../model_zoo/transformers/ChineseBert/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的ChineseBert模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/ChineseBert/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/ChineseBert/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/ChineseBert/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/ChineseBert/contents.rst:15 +msgid "``ChineseBERT-base``" +msgstr "" + +#: ../model_zoo/transformers/ChineseBert/contents.rst:15 +#: ../model_zoo/transformers/ChineseBert/contents.rst:18 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/ChineseBert/contents.rst:15 +msgid "For details, please refer to: ChineseBERT-base_" +msgstr "" + +#: ../model_zoo/transformers/ChineseBert/contents.rst:18 +msgid "``ChineseBERT-large``" +msgstr "" + +#: ../model_zoo/transformers/ChineseBert/contents.rst:18 +msgid "For details, please refer to: ChineseBERT-large_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ConvBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ConvBert/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..2c56bbc24c132881d22be32689bd415493ecbedc --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ConvBert/contents.po @@ -0,0 +1,71 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/ConvBert/contents.rst:5 +msgid "ConvBERT模型汇总" +msgstr "" + +#: ../model_zoo/transformers/ConvBert/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的ConvBERT模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/ConvBert/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/ConvBert/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/ConvBert/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/ConvBert/contents.rst:15 +msgid "``convbert-base``" +msgstr "" + +#: ../model_zoo/transformers/ConvBert/contents.rst:15 +#: ../model_zoo/transformers/ConvBert/contents.rst:19 +#: ../model_zoo/transformers/ConvBert/contents.rst:23 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/ConvBert/contents.rst:15 +msgid "12-layer, 768-hidden, 12-heads, 106M parameters. The ConvBERT base model." +msgstr "" + +#: ../model_zoo/transformers/ConvBert/contents.rst:19 +msgid "``convbert-medium-small``" +msgstr "" + +#: ../model_zoo/transformers/ConvBert/contents.rst:19 +msgid "" +"12-layer, 384-hidden, 8-heads, 17M parameters. The ConvBERT medium small " +"model." +msgstr "" + +#: ../model_zoo/transformers/ConvBert/contents.rst:23 +msgid "``convbert-small``" +msgstr "" + +#: ../model_zoo/transformers/ConvBert/contents.rst:23 +msgid "12-layer, 128-hidden, 4-heads, 13M parameters. The ConvBERT small model." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/DistilBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/DistilBert/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..e184f5127268d6de7a59a2d7a0c82e2f9c78eb1c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/DistilBert/contents.po @@ -0,0 +1,84 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/DistilBert/contents.rst:5 +msgid "DistilBERT模型汇总" +msgstr "" + +#: ../model_zoo/transformers/DistilBert/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的DistilBERT模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/DistilBert/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/DistilBert/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/DistilBert/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/DistilBert/contents.rst:15 +msgid "``distilbert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/DistilBert/contents.rst:15 +#: ../model_zoo/transformers/DistilBert/contents.rst:20 +#: ../model_zoo/transformers/DistilBert/contents.rst:25 +#: ../model_zoo/transformers/DistilBert/contents.rst:30 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/DistilBert/contents.rst:15 +msgid "" +"6-layer, 768-hidden, 12-heads, 66M parameters. The DistilBERT model " +"distilled from the BERT model ``bert-base-uncased``." +msgstr "" + +#: ../model_zoo/transformers/DistilBert/contents.rst:20 +msgid "``distilbert-base-cased``" +msgstr "" + +#: ../model_zoo/transformers/DistilBert/contents.rst:20 +msgid "" +"6-layer, 768-hidden, 12-heads, 66M parameters. The DistilBERT model " +"distilled from the BERT model ``bert-base-cased``." +msgstr "" + +#: ../model_zoo/transformers/DistilBert/contents.rst:25 +msgid "``renmada/distilbert-base-multilingual-cased``" +msgstr "" + +#: ../model_zoo/transformers/DistilBert/contents.rst:25 +msgid "" +"6-layer, 768-hidden, 12-heads, 200M parameters. The DistilBERT model " +"distilled from the BERT model ``bert-base-multilingual-cased``." +msgstr "" + +#: ../model_zoo/transformers/DistilBert/contents.rst:30 +msgid "``renmada/sshleifer-tiny-distilbert-base-uncase-finetuned-sst-2-english``" +msgstr "" + +#: ../model_zoo/transformers/DistilBert/contents.rst:30 +msgid "2-layer, 2-hidden, 2-heads, 50K parameters. The DistilBERT model." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ELECTRA/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ELECTRA/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..0c3eefb80966f77f0e880b2c420ec66f4a131f27 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ELECTRA/contents.po @@ -0,0 +1,140 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:5 +msgid "ELECTRA模型汇总" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的ELECTRA模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:15 +msgid "``electra-small``" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:15 +#: ../model_zoo/transformers/ELECTRA/contents.rst:19 +#: ../model_zoo/transformers/ELECTRA/contents.rst:23 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:15 +msgid "" +"12-layer, 768-hidden, 4-heads, 14M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:19 +msgid "``electra-base``" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:19 +msgid "" +"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:23 +msgid "``electra-large``" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:23 +msgid "" +"24-layer, 1024-hidden, 16-heads, 334M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:27 +msgid "``chinese-electra-small``" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:27 +#: ../model_zoo/transformers/ELECTRA/contents.rst:31 +#: ../model_zoo/transformers/ELECTRA/contents.rst:35 +#: ../model_zoo/transformers/ELECTRA/contents.rst:39 +#: ../model_zoo/transformers/ELECTRA/contents.rst:43 +#: ../model_zoo/transformers/ELECTRA/contents.rst:47 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:27 +msgid "12-layer, 768-hidden, 4-heads, 12M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:31 +msgid "``chinese-electra-base``" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:31 +msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:35 +msgid "``ernie-health-chinese``" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:35 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on Chinese " +"medical corpus." +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:39 +msgid "``junnyu/hfl-chinese-electra-180g-base-discriminator``" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:39 +msgid "" +"Discriminator, 12-layer, 768-hidden, 12-heads, 102M parameters. Trained " +"on 180g Chinese text." +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:43 +msgid "``junnyu/hfl-chinese-electra-180g-small-ex-discriminator``" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:43 +msgid "" +"Discriminator, 24-layer, 256-hidden, 4-heads, 24M parameters. Trained on " +"180g Chinese text." +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:47 +msgid "``junnyu/hfl-chinese-legal-electra-small-generator``" +msgstr "" + +#: ../model_zoo/transformers/ELECTRA/contents.rst:47 +msgid "" +"Generator, 12-layer, 64-hidden, 1-heads, 3M parameters. Trained on " +"Chinese legal corpus." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-CTM/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-CTM/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..1bd874dc170fdb6c92d0ebdceff21e566ae98ccf --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-CTM/contents.po @@ -0,0 +1,65 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:5 +msgid "ERNIE-CTM模型汇总" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的ERNIE-CTM模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:15 +msgid "``ernie-ctm``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:15 +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:20 +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:25 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:15 +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:20 +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:25 +msgid "" +"12-layer, 768-hidden, 12-heads, _M parameters. For details, please refer " +"to the ernie-ctm_" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:20 +msgid "``wordtag``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:25 +msgid "``nptag``" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-DOC/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-DOC/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..9ef7c49ba57f5826d30a63727691955136e1743b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-DOC/contents.po @@ -0,0 +1,65 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:5 +msgid "ERNIE-DOC模型汇总" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的ERNIE-DOC模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:15 +msgid "``ernie-doc-base-zh``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:15 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:15 +msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:19 +msgid "``ernie-doc-base-en``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:19 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:19 +msgid "" +"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on lower-cased " +"English text." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-GEN/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-GEN/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..c73b76a6826c5ac86732c239be226abd7977dc2d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-GEN/contents.po @@ -0,0 +1,75 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:5 +msgid "ERNIE-GEN模型汇总" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的ERNIE-GEN模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:15 +msgid "``ernie-gen-base-en``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:15 +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:20 +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:25 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:15 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:20 +msgid "``ernie-gen-large-en``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:20 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:25 +msgid "``ernie-gen-large-en-430g``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:25 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased " +"English text. with extended data (430 GB)." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-GRAM/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-GRAM/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..088e37c7fda9205d6a8e9951a88c7c9984cced95 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-GRAM/contents.po @@ -0,0 +1,62 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:5 +msgid "ERNIE-GRAM模型汇总" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的ERNIE-GRAM模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:15 +msgid "``ernie-gram-zh``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:15 +#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:20 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:15 +msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:20 +msgid "``ernie-gram-zh-finetuned-dureader-robust``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:20 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text." +" Then finetuned on dreader-robust." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-M/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-M/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..4de3b49f55b7cb283270543531a19026b27e6792 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-M/contents.po @@ -0,0 +1,64 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/ERNIE-M/contents.rst:5 +msgid "ERNIE-M模型汇总" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-M/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的ERNIE-M模型对应预训练权重。" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-M/contents.rst:12 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-M/contents.rst:12 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-M/contents.rst:12 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-M/contents.rst:14 +msgid "``ernie-m-base``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-M/contents.rst:14 +#: ../model_zoo/transformers/ERNIE-M/contents.rst:19 +msgid "Multilingual" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-M/contents.rst:14 +msgid "" +"12-layer, 768-hidden, 12-heads, _M parameters. Trained on pseudo-" +"parallel sentence pairs on a monolingual corpus." +msgstr "" + +#: ../model_zoo/transformers/ERNIE-M/contents.rst:19 +msgid "``ernie-m-large``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE-M/contents.rst:19 +msgid "" +"24-layer, 1024-hidden, 16-heads, _M parameters. Trained on pseudo-" +"parallel sentence pairs on a monolingual corpus." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..1584610c128f77d73cc401b10ef394a1e76a4811 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE/contents.po @@ -0,0 +1,105 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/ERNIE/contents.rst:5 +msgid "ERNIE模型汇总" +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的ERNIE模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:15 +msgid "``ernie-3.0-medium-zh``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:15 +#: ../model_zoo/transformers/ERNIE/contents.rst:19 +#: ../model_zoo/transformers/ERNIE/contents.rst:35 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:15 +msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:19 +msgid "``ernie-tiny``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:19 +msgid "3-layer, 1024-hidden, 16-heads, _M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:23 +msgid "``ernie-2.0-en``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:23 +#: ../model_zoo/transformers/ERNIE/contents.rst:27 +#: ../model_zoo/transformers/ERNIE/contents.rst:31 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:23 +msgid "" +"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:27 +msgid "``ernie-2.0-en-finetuned-squad``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:27 +msgid "" +"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on finetuned " +"squad text." +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:31 +msgid "``ernie-2.0-large-en``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:31 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:35 +msgid "``zhui/ernie-1.0-cluecorpussmall``" +msgstr "" + +#: ../model_zoo/transformers/ERNIE/contents.rst:35 +msgid "Please refer to: `zhui/ernie-1.0-cluecorpussmall`_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/FNet/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/FNet/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..4a662a8bb31586fb8caac6724be909151b191311 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/FNet/contents.po @@ -0,0 +1,60 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/FNet/contents.rst:5 +msgid "FNet模型汇总" +msgstr "" + +#: ../model_zoo/transformers/FNet/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的FNet模型对应预训练权重。" +msgstr "" + +#: ../model_zoo/transformers/FNet/contents.rst:12 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/FNet/contents.rst:12 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/FNet/contents.rst:12 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/FNet/contents.rst:14 +msgid "``fnet-base``" +msgstr "" + +#: ../model_zoo/transformers/FNet/contents.rst:14 +#: ../model_zoo/transformers/FNet/contents.rst:17 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/FNet/contents.rst:14 +msgid "For details, please refer to: `google/fnet-base`_" +msgstr "" + +#: ../model_zoo/transformers/FNet/contents.rst:17 +msgid "``fnet-large``" +msgstr "" + +#: ../model_zoo/transformers/FNet/contents.rst:17 +msgid "For details, please refer to: `google/fnet-large`_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Funnel/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Funnel/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..68a37ba326ed029d9dc340e18d628bfdf4e888ba --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Funnel/contents.po @@ -0,0 +1,132 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/Funnel/contents.rst:5 +msgid "Funnel模型汇总" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的Funnel模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:15 +msgid "``funnel-transformer/small``" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:15 +#: ../model_zoo/transformers/Funnel/contents.rst:18 +#: ../model_zoo/transformers/Funnel/contents.rst:21 +#: ../model_zoo/transformers/Funnel/contents.rst:24 +#: ../model_zoo/transformers/Funnel/contents.rst:27 +#: ../model_zoo/transformers/Funnel/contents.rst:30 +#: ../model_zoo/transformers/Funnel/contents.rst:33 +#: ../model_zoo/transformers/Funnel/contents.rst:36 +#: ../model_zoo/transformers/Funnel/contents.rst:39 +#: ../model_zoo/transformers/Funnel/contents.rst:42 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:15 +msgid "For details, please refer to: `funnel-transformer/small`_" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:18 +msgid "``funnel-transformer/small-base``" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:18 +msgid "For details, please refer to: `funnel-transformer/small-base`_" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:21 +msgid "``funnel-transformer/meduim``" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:21 +msgid "For details, please refer to: `funnel-transformer/meduim`_" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:24 +msgid "``funnel-transformer/meduim-base``" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:24 +msgid "For details, please refer to: `funnel-transformer/meduim-base`_" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:27 +msgid "``funnel-transformer/intermediate``" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:27 +msgid "For details, please refer to: `funnel-transformer/intermediate`_" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:30 +msgid "``funnel-transformer/intermediate-base``" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:30 +msgid "For details, please refer to: `funnel-transformer/intermediate-base`_" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:33 +msgid "``funnel-transformer/large``" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:33 +msgid "For details, please refer to: `funnel-transformer/large`_" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:36 +msgid "``funnel-transformer/large-base``" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:36 +msgid "For details, please refer to: `funnel-transformer/large-base`_" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:39 +msgid "``funnel-transformer/xlarge``" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:39 +msgid "For details, please refer to: `funnel-transformer/xlarge`_" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:42 +msgid "``funnel-transformer/xlarge-base``" +msgstr "" + +#: ../model_zoo/transformers/Funnel/contents.rst:42 +msgid "For details, please refer to: `funnel-transformer/xlarge-base`_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/GPT/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/GPT/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..819c17c7821aef9fdb17a4c8f7224226597d2d03 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/GPT/contents.po @@ -0,0 +1,779 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/GPT/contents.rst:5 +msgid "GPT模型汇总" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:8 +msgid "下表汇总介绍了目前PaddleNLP支持的GPT模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:12 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:12 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:12 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:14 +msgid "``gpt-cpm-large-cn``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:14 +#: ../model_zoo/transformers/GPT/contents.rst:18 +#: ../model_zoo/transformers/GPT/contents.rst:55 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:14 +msgid "32-layer, 2560-hidden, 32-heads, 2.6B parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:18 +msgid "``gpt-cpm-small-cn-distill``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:18 +msgid "" +"12-layer, 768-hidden, 12-heads, 109M parameters. The model distilled from" +" the GPT model ``gpt-cpm-large-cn``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:23 +msgid "``gpt2-en``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:23 +#: ../model_zoo/transformers/GPT/contents.rst:27 +#: ../model_zoo/transformers/GPT/contents.rst:31 +#: ../model_zoo/transformers/GPT/contents.rst:35 +#: ../model_zoo/transformers/GPT/contents.rst:39 +#: ../model_zoo/transformers/GPT/contents.rst:43 +#: ../model_zoo/transformers/GPT/contents.rst:47 +#: ../model_zoo/transformers/GPT/contents.rst:51 +#: ../model_zoo/transformers/GPT/contents.rst:59 +#: ../model_zoo/transformers/GPT/contents.rst:65 +#: ../model_zoo/transformers/GPT/contents.rst:68 +#: ../model_zoo/transformers/GPT/contents.rst:71 +#: ../model_zoo/transformers/GPT/contents.rst:74 +#: ../model_zoo/transformers/GPT/contents.rst:80 +#: ../model_zoo/transformers/GPT/contents.rst:83 +#: ../model_zoo/transformers/GPT/contents.rst:89 +#: ../model_zoo/transformers/GPT/contents.rst:95 +#: ../model_zoo/transformers/GPT/contents.rst:98 +#: ../model_zoo/transformers/GPT/contents.rst:104 +#: ../model_zoo/transformers/GPT/contents.rst:107 +#: ../model_zoo/transformers/GPT/contents.rst:110 +#: ../model_zoo/transformers/GPT/contents.rst:116 +#: ../model_zoo/transformers/GPT/contents.rst:122 +#: ../model_zoo/transformers/GPT/contents.rst:134 +#: ../model_zoo/transformers/GPT/contents.rst:140 +#: ../model_zoo/transformers/GPT/contents.rst:146 +#: ../model_zoo/transformers/GPT/contents.rst:149 +#: ../model_zoo/transformers/GPT/contents.rst:161 +#: ../model_zoo/transformers/GPT/contents.rst:164 +#: ../model_zoo/transformers/GPT/contents.rst:167 +#: ../model_zoo/transformers/GPT/contents.rst:170 +#: ../model_zoo/transformers/GPT/contents.rst:173 +#: ../model_zoo/transformers/GPT/contents.rst:176 +#: ../model_zoo/transformers/GPT/contents.rst:191 +#: ../model_zoo/transformers/GPT/contents.rst:197 +#: ../model_zoo/transformers/GPT/contents.rst:200 +#: ../model_zoo/transformers/GPT/contents.rst:203 +#: ../model_zoo/transformers/GPT/contents.rst:209 +#: ../model_zoo/transformers/GPT/contents.rst:212 +#: ../model_zoo/transformers/GPT/contents.rst:221 +#: ../model_zoo/transformers/GPT/contents.rst:227 +#: ../model_zoo/transformers/GPT/contents.rst:230 +#: ../model_zoo/transformers/GPT/contents.rst:233 +#: ../model_zoo/transformers/GPT/contents.rst:236 +#: ../model_zoo/transformers/GPT/contents.rst:239 +#: ../model_zoo/transformers/GPT/contents.rst:248 +#: ../model_zoo/transformers/GPT/contents.rst:251 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:23 +msgid "12-layer, 768-hidden, 12-heads, 117M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:27 +msgid "``gpt2-medium-en``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:27 +msgid "24-layer, 1024-hidden, 16-heads, 345M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:31 +msgid "``gpt2-large-en``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:31 +#: ../model_zoo/transformers/GPT/contents.rst:51 +msgid "36-layer, 1280-hidden, 20-heads, 774M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:35 +msgid "``gpt2-xl-en``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:35 +msgid "" +"48-layer, 1600-hidden, 25-heads, 1558M parameters. Trained on English " +"text." +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:39 +msgid "``junnyu/distilgpt2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:39 +msgid "6-layer, 768-hidden, 12-heads, 81M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:43 +msgid "``junnyu/microsoft-DialoGPT-small``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:43 +msgid "12-layer, 768-hidden, 12-heads, 124M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:47 +msgid "``junnyu/microsoft-DialoGPT-medium``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:47 +msgid "24-layer, 1024-hidden, 16-heads, 354M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:51 +msgid "``junnyu/microsoft-DialoGPT-large``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:55 +msgid "``junnyu/uer-gpt2-chinese-poem``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:55 +msgid "" +"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on Chinese " +"poetry corpus." +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:59 +msgid "``distilgpt2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:59 +msgid "Please refer to: `distilgpt2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:62 +msgid "``w11wo/javanese-gpt2-small-imdb``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:62 +msgid "Javanese" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:62 +msgid "Please refer to: `w11wo/javanese-gpt2-small-imdb`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:65 +msgid "``remotejob/tweetsDISTILGPT2fi_v4``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:65 +msgid "Please refer to: `remotejob/tweetsDISTILGPT2fi_v4`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:68 +msgid "``TrLOX/gpt2-tdk``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:68 +msgid "Please refer to: `TrLOX/gpt2-tdk`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:71 +msgid "``huggingtweets/slime_machine``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:71 +msgid "Please refer to: `huggingtweets/slime_machine`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:74 +msgid "``microsoft/DialoGPT-small``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:74 +msgid "Please refer to: `microsoft/DialoGPT-small`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:77 +msgid "``sberbank-ai/rugpt3large_based_on_gpt2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:77 +#: ../model_zoo/transformers/GPT/contents.rst:86 +#: ../model_zoo/transformers/GPT/contents.rst:101 +msgid "Russian" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:77 +msgid "Please refer to: `sberbank-ai/rugpt3large_based_on_gpt2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:80 +msgid "``sshleifer/tiny-gpt2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:80 +msgid "Please refer to: `sshleifer/tiny-gpt2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:83 +msgid "``microsoft/DialoGPT-large``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:83 +msgid "Please refer to: `microsoft/DialoGPT-large`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:86 +msgid "``sberbank-ai/rugpt3small_based_on_gpt2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:86 +msgid "Please refer to: `sberbank-ai/rugpt3small_based_on_gpt2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:89 +msgid "``uw-hai/polyjuice``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:89 +msgid "Please refer to: `uw-hai/polyjuice`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:92 +msgid "``NYTK/text-generation-poem-petofi-gpt2-small-hungarian``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:92 +msgid "Hungarian" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:92 +msgid "Please refer to: `NYTK/text-generation-poem-petofi-gpt2-small-hungarian`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:95 +msgid "``microsoft/DialogRPT-human-vs-rand``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:95 +msgid "Please refer to: `microsoft/DialogRPT-human-vs-rand`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:98 +msgid "``hf-internal-testing/tiny-random-gpt2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:98 +msgid "Please refer to: `hf-internal-testing/tiny-random-gpt2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:101 +msgid "``Grossmend/rudialogpt3_medium_based_on_gpt2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:101 +msgid "Please refer to: `Grossmend/rudialogpt3_medium_based_on_gpt2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:104 +msgid "``pranavpsv/genre-story-generator-v2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:104 +msgid "Please refer to: `pranavpsv/genre-story-generator-v2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:107 +msgid "``microsoft/DialogRPT-updown``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:107 +msgid "Please refer to: `microsoft/DialogRPT-updown`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:110 +msgid "``microsoft/DialogRPT-human-vs-machine``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:110 +msgid "Please refer to: `microsoft/DialogRPT-human-vs-machine`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:113 +msgid "``pierreguillou/gpt2-small-portuguese``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:113 +msgid "Portuguese" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:113 +msgid "Please refer to: `pierreguillou/gpt2-small-portuguese`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:116 +msgid "``mrm8488/GPT-2-finetuned-covid-bio-medrxiv``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:116 +msgid "Please refer to: `mrm8488/GPT-2-finetuned-covid-bio-medrxiv`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:119 +msgid "``anonymous-german-nlp/german-gpt2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:119 +#: ../model_zoo/transformers/GPT/contents.rst:128 +msgid "German" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:119 +msgid "Please refer to: `anonymous-german-nlp/german-gpt2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:122 +msgid "``microsoft/CodeGPT-small-py``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:122 +msgid "Please refer to: `microsoft/CodeGPT-small-py`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:125 +msgid "``antoiloui/belgpt2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:125 +#: ../model_zoo/transformers/GPT/contents.rst:131 +#: ../model_zoo/transformers/GPT/contents.rst:152 +#: ../model_zoo/transformers/GPT/contents.rst:245 +msgid "French" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:125 +msgid "Please refer to: `antoiloui/belgpt2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:128 +msgid "``benjamin/gerpt2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:128 +msgid "Please refer to: `benjamin/gerpt2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:131 +msgid "``asi/gpt-fr-cased-small``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:131 +msgid "Please refer to: `asi/gpt-fr-cased-small`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:134 +msgid "``microsoft/CodeGPT-small-java-adaptedGPT2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:134 +msgid "Please refer to: `microsoft/CodeGPT-small-java-adaptedGPT2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:137 +msgid "``GroNLP/gpt2-small-dutch``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:137 +msgid "Dutch" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:137 +msgid "Please refer to: `GroNLP/gpt2-small-dutch`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:140 +msgid "``lvwerra/gpt2-imdb``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:140 +msgid "Please refer to: `lvwerra/gpt2-imdb`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:143 +msgid "``DeepESP/gpt2-spanish``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:143 +#: ../model_zoo/transformers/GPT/contents.rst:194 +msgid "Spanish" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:143 +msgid "Please refer to: `DeepESP/gpt2-spanish`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:146 +msgid "``microsoft/CodeGPT-small-py-adaptedGPT2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:146 +msgid "Please refer to: `microsoft/CodeGPT-small-py-adaptedGPT2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:149 +msgid "``microsoft/DialogRPT-width``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:149 +msgid "Please refer to: `microsoft/DialogRPT-width`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:152 +msgid "``dbddv01/gpt2-french-small``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:152 +msgid "Please refer to: `dbddv01/gpt2-french-small`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:155 +msgid "``GroNLP/gpt2-small-italian``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:155 +#: ../model_zoo/transformers/GPT/contents.rst:206 +msgid "Italian" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:155 +msgid "Please refer to: `GroNLP/gpt2-small-italian`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:158 +msgid "``flax-community/gpt2-medium-persian``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:158 +#: ../model_zoo/transformers/GPT/contents.rst:185 +msgid "Persian" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:158 +msgid "Please refer to: `flax-community/gpt2-medium-persian`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:161 +msgid "``microsoft/DialogRPT-depth``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:161 +msgid "Please refer to: `microsoft/DialogRPT-depth`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:164 +msgid "``Nokia/nlgp-natural``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:164 +msgid "Please refer to: `Nokia/nlgp-natural`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:167 +msgid "``macedonizer/hr-gpt2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:167 +msgid "Please refer to: `macedonizer/hr-gpt2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:170 +msgid "``mrm8488/GPT-2-finetuned-common_gen``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:170 +msgid "Please refer to: `mrm8488/GPT-2-finetuned-common_gen`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:173 +msgid "``pranavpsv/gpt2-genre-story-generator``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:173 +msgid "Please refer to: `pranavpsv/gpt2-genre-story-generator`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:176 +msgid "``rbhushan/distilgpt2-finetuned-wikitext2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:176 +msgid "Please refer to: `rbhushan/distilgpt2-finetuned-wikitext2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:179 +msgid "``readerbench/RoGPT2-large``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:179 +msgid "Romanian" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:179 +msgid "Please refer to: `readerbench/RoGPT2-large`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:182 +msgid "``flax-community/gpt2-small-indonesian``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:182 +#: ../model_zoo/transformers/GPT/contents.rst:188 +#: ../model_zoo/transformers/GPT/contents.rst:215 +msgid "Indonesian" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:182 +msgid "Please refer to: `flax-community/gpt2-small-indonesian`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:185 +msgid "``HooshvareLab/gpt2-fa``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:185 +msgid "Please refer to: `HooshvareLab/gpt2-fa`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:188 +msgid "``cahya/gpt2-small-indonesian-522M``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:188 +msgid "Please refer to: `cahya/gpt2-small-indonesian-522M`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:191 +msgid "``DingleyMaillotUrgell/homer-bot``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:191 +msgid "Please refer to: `DingleyMaillotUrgell/homer-bot`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:194 +msgid "``datificate/gpt2-small-spanish``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:194 +msgid "Please refer to: `datificate/gpt2-small-spanish`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:197 +msgid "``ericzhou/tsundere_v1``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:197 +msgid "Please refer to: `ericzhou/tsundere_v1`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:200 +msgid "``huggingtweets/wwm_shakespeare``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:200 +msgid "Please refer to: `huggingtweets/wwm_shakespeare`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:203 +msgid "``SIC98/GPT2-python-code-generator``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:203 +msgid "Please refer to: `SIC98/GPT2-python-code-generator`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:206 +msgid "``GroNLP/gpt2-small-italian-embeddings``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:206 +msgid "Please refer to: `GroNLP/gpt2-small-italian-embeddings`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:209 +msgid "``huggingtweets/hel_ql-shahdashrf_-sinnerslayerr-witheredstrings``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:209 +msgid "" +"Please refer to: `huggingtweets/hel_ql-shahdashrf_-sinnerslayerr-" +"witheredstrings`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:212 +msgid "``salesken/grammar_correction``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:212 +msgid "Please refer to: `salesken/grammar_correction`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:215 +msgid "``flax-community/gpt2-medium-indonesian``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:215 +msgid "Please refer to: `flax-community/gpt2-medium-indonesian`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:218 +msgid "``gorkemgoknar/gpt2-small-turkish``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:218 +msgid "Turkish" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:218 +msgid "Please refer to: `gorkemgoknar/gpt2-small-turkish`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:221 +msgid "``deepparag/DumBot``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:221 +msgid "Please refer to: `deepparag/DumBot`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:224 +msgid "``jcblaise/gpt2-tagalog``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:224 +msgid "Tagalog" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:224 +msgid "Please refer to: `jcblaise/gpt2-tagalog`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:227 +msgid "``BigSalmon/InformalToFormalLincoln21``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:227 +msgid "Please refer to: `BigSalmon/InformalToFormalLincoln21`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:230 +msgid "``LorenzoDeMattei/GePpeTto``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:230 +msgid "Please refer to: `LorenzoDeMattei/GePpeTto`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:233 +msgid "``macedonizer/sr-gpt2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:233 +msgid "Please refer to: `macedonizer/sr-gpt2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:236 +msgid "``indonesian-nlp/gpt2``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:236 +msgid "Please refer to: `indonesian-nlp/gpt2`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:239 +msgid "``ceostroff/harry-potter-gpt2-fanfiction``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:239 +msgid "Please refer to: `ceostroff/harry-potter-gpt2-fanfiction`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:242 +msgid "``akhooli/gpt2-small-arabic-poetry``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:242 +msgid "Arabic" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:242 +msgid "Please refer to: `akhooli/gpt2-small-arabic-poetry`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:245 +msgid "``asi/gpt-fr-cased-base``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:245 +msgid "Please refer to: `asi/gpt-fr-cased-base`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:248 +msgid "``congcongwang/gpt2_medium_fine_tuned_coder``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:248 +msgid "Please refer to: `congcongwang/gpt2_medium_fine_tuned_coder`_" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:251 +msgid "``cambridgeltl/simctg_wikitext103``" +msgstr "" + +#: ../model_zoo/transformers/GPT/contents.rst:251 +msgid "Please refer to: `cambridgeltl/simctg_wikitext103`_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutLM/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutLM/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..20958ef2b77586917a438bf19aeb15200fc8547e --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutLM/contents.po @@ -0,0 +1,64 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/LayoutLM/contents.rst:5 +msgid "LayoutLM模型汇总" +msgstr "" + +#: ../model_zoo/transformers/LayoutLM/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的LayoutLM模型以及对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/LayoutLM/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/LayoutLM/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/LayoutLM/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/LayoutLM/contents.rst:15 +msgid "``layoutlm-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/LayoutLM/contents.rst:15 +#: ../model_zoo/transformers/LayoutLM/contents.rst:19 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/LayoutLM/contents.rst:15 +msgid "" +"12-layer, 768-hidden, 12-heads, 339M parameters. LayoutLm base uncased " +"model." +msgstr "" + +#: ../model_zoo/transformers/LayoutLM/contents.rst:19 +msgid "``layoutlm-large-uncased``" +msgstr "" + +#: ../model_zoo/transformers/LayoutLM/contents.rst:19 +msgid "" +"24-layer, 1024-hidden, 16-heads, 51M parameters. LayoutLm large Uncased " +"model." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutLMV2/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutLMV2/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..d8535b9511a622f4df80384a2bbfd685f293252d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutLMV2/contents.po @@ -0,0 +1,64 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/LayoutLMV2/contents.rst:5 +msgid "LayoutLMV2模型汇总" +msgstr "" + +#: ../model_zoo/transformers/LayoutLMV2/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的LayoutLMV2模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/LayoutLMV2/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/LayoutLMV2/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/LayoutLMV2/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/LayoutLMV2/contents.rst:15 +msgid "``layoutlmv2-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/LayoutLMV2/contents.rst:15 +#: ../model_zoo/transformers/LayoutLMV2/contents.rst:19 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/LayoutLMV2/contents.rst:15 +msgid "" +"12-layer, 768-hidden, 12-heads, 200M parameters. LayoutLmv2 base uncased " +"model." +msgstr "" + +#: ../model_zoo/transformers/LayoutLMV2/contents.rst:19 +msgid "``layoutlmv2-large-uncased``" +msgstr "" + +#: ../model_zoo/transformers/LayoutLMV2/contents.rst:19 +msgid "" +"24-layer, 1024-hidden, 16-heads, _M parameters. LayoutLmv2 large uncased " +"model." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutXLM/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutXLM/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..0fecb29cfba3cbc6dfe245cfdc4f7be839d562a8 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutXLM/contents.po @@ -0,0 +1,53 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/LayoutXLM/contents.rst:5 +msgid "LayoutXLM模型汇总" +msgstr "" + +#: ../model_zoo/transformers/LayoutXLM/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的LayoutXLM模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/LayoutXLM/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/LayoutXLM/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/LayoutXLM/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/LayoutXLM/contents.rst:15 +msgid "``layoutxlm-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/LayoutXLM/contents.rst:15 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/LayoutXLM/contents.rst:15 +msgid "" +"12-layer, 768-hidden, 12-heads, 369M parameters. Layoutxlm base uncased " +"model." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Luke/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Luke/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..c13f1c2fc2b1bc65d3ab0ecad382fbf31753cd94 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Luke/contents.po @@ -0,0 +1,60 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/Luke/contents.rst:5 +msgid "Luke模型汇总" +msgstr "" + +#: ../model_zoo/transformers/Luke/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的Luke模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/Luke/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/Luke/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/Luke/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/Luke/contents.rst:15 +msgid "``luke-base``" +msgstr "" + +#: ../model_zoo/transformers/Luke/contents.rst:15 +#: ../model_zoo/transformers/Luke/contents.rst:18 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/Luke/contents.rst:15 +msgid "For details, please refer to: `studio-ousia/luke-base`_" +msgstr "" + +#: ../model_zoo/transformers/Luke/contents.rst:18 +msgid "``luke-large``" +msgstr "" + +#: ../model_zoo/transformers/Luke/contents.rst:18 +msgid "For details, please refer to: `studio-ousia/luke-large`_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MBart/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MBart/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..e18558c6b269722a4ef81e5973cd5553d99f237f --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MBart/contents.po @@ -0,0 +1,97 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/MBart/contents.rst:5 +msgid "MBart模型汇总" +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的MBart模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:15 +msgid "``mbart-large-cc25``" +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:15 +#: ../model_zoo/transformers/MBart/contents.rst:20 +#: ../model_zoo/transformers/MBart/contents.rst:25 +#: ../model_zoo/transformers/MBart/contents.rst:30 +#: ../model_zoo/transformers/MBart/contents.rst:35 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:15 +msgid "" +"12-layer, 1024-hidden, 12-heads, 1123M parameters. The ``mbart-large-" +"cc25`` model." +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:20 +msgid "``mbart-large-en-ro``" +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:20 +msgid "" +"12-layer, 768-hidden, 16-heads, 1123M parameters. The ``mbart-large rn-" +"ro`` model." +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:25 +msgid "``mbart-large-50-one-to-many-mmt``" +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:25 +msgid "" +"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-one-" +"to-many-mmt`` model." +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:30 +msgid "``mbart-large-50-many-to-one-mmt``" +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:30 +msgid "" +"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-many-" +"to-one-mmt`` model." +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:35 +msgid "``mbart-large-50-many-to-many-mmt``" +msgstr "" + +#: ../model_zoo/transformers/MBart/contents.rst:35 +msgid "" +"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-many-" +"to-many-mmt`` model." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MPNet/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MPNet/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..9270a38b9c0eac1b7eda6018fcd88654bda903f2 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MPNet/contents.po @@ -0,0 +1,51 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/MPNet/contents.rst:5 +msgid "MPNet模型汇总" +msgstr "" + +#: ../model_zoo/transformers/MPNet/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的MPNet模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/MPNet/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/MPNet/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/MPNet/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/MPNet/contents.rst:15 +msgid "``mpnet-base``" +msgstr "" + +#: ../model_zoo/transformers/MPNet/contents.rst:15 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/MPNet/contents.rst:15 +msgid "12-layer, 768-hidden, 12-heads, 109M parameters. MPNet Base Model." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MegatronBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MegatronBert/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..3fce0ad660ad375361b632b6cfc5dcdb07fb173d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MegatronBert/contents.po @@ -0,0 +1,60 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/MegatronBert/contents.rst:5 +msgid "MegatronBert模型汇总" +msgstr "" + +#: ../model_zoo/transformers/MegatronBert/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的MegatronBert模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/MegatronBert/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/MegatronBert/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/MegatronBert/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/MegatronBert/contents.rst:15 +msgid "``megatronbert-cased``" +msgstr "" + +#: ../model_zoo/transformers/MegatronBert/contents.rst:15 +#: ../model_zoo/transformers/MegatronBert/contents.rst:18 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/MegatronBert/contents.rst:15 +msgid "For details, please refer to: `nvidia/megatron-bert-cased-345m`_" +msgstr "" + +#: ../model_zoo/transformers/MegatronBert/contents.rst:18 +msgid "``megatronbert-uncased``" +msgstr "" + +#: ../model_zoo/transformers/MegatronBert/contents.rst:18 +msgid "For details, please refer to: `nvidia/megatron-bert-uncased-345m`_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MobileBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MobileBert/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..68a3649645aeb0c247ff8bb9e70d2476ec1829e8 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MobileBert/contents.po @@ -0,0 +1,51 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/MobileBert/contents.rst:5 +msgid "MobileBert模型汇总" +msgstr "" + +#: ../model_zoo/transformers/MobileBert/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的MobileBert模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/MobileBert/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/MobileBert/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/MobileBert/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/MobileBert/contents.rst:15 +msgid "``mobilebert-uncased``" +msgstr "" + +#: ../model_zoo/transformers/MobileBert/contents.rst:15 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/MobileBert/contents.rst:15 +msgid "For details, please refer to: `google/mobilebert-uncased`_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/NeZha/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/NeZha/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..e50f08e591ed72a6d3955be26c8ca9335d1888cf --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/NeZha/contents.po @@ -0,0 +1,75 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/NeZha/contents.rst:5 +msgid "NeZha模型汇总" +msgstr "" + +#: ../model_zoo/transformers/NeZha/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的NeZha模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/NeZha/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/NeZha/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/NeZha/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/NeZha/contents.rst:15 +msgid "``nezha-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers/NeZha/contents.rst:15 +#: ../model_zoo/transformers/NeZha/contents.rst:19 +#: ../model_zoo/transformers/NeZha/contents.rst:23 +#: ../model_zoo/transformers/NeZha/contents.rst:27 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/NeZha/contents.rst:15 +msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/NeZha/contents.rst:19 +msgid "``nezha-large-chinese``" +msgstr "" + +#: ../model_zoo/transformers/NeZha/contents.rst:19 +#: ../model_zoo/transformers/NeZha/contents.rst:27 +msgid "24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/NeZha/contents.rst:23 +msgid "``nezha-base-wwm-chinese``" +msgstr "" + +#: ../model_zoo/transformers/NeZha/contents.rst:23 +msgid "12-layer, 768-hidden, 16-heads, 108M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/NeZha/contents.rst:27 +msgid "``nezha-large-wwm-chinese``" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/PPMiniLM/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/PPMiniLM/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..4e878768196544a74fa2bae6d536cdcbcbf984d5 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/PPMiniLM/contents.po @@ -0,0 +1,53 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/PPMiniLM/contents.rst:5 +msgid "PPMiniLM模型汇总" +msgstr "" + +#: ../model_zoo/transformers/PPMiniLM/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的PPMiniLM模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/PPMiniLM/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/PPMiniLM/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/PPMiniLM/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/PPMiniLM/contents.rst:15 +msgid "``ppminilm-6l-768h``" +msgstr "" + +#: ../model_zoo/transformers/PPMiniLM/contents.rst:15 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/PPMiniLM/contents.rst:15 +msgid "" +"A Chinese characteristic small model using multiple model compression. " +"Please refer to: ppminilm-6l-768h_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ProphetNet/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ProphetNet/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..2f2ee06dcd52d772221b6b380c01e10efddd87ef --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ProphetNet/contents.po @@ -0,0 +1,51 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/ProphetNet/contents.rst:5 +msgid "ProphetNet模型汇总" +msgstr "" + +#: ../model_zoo/transformers/ProphetNet/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的ProphetNet模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/ProphetNet/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/ProphetNet/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/ProphetNet/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/ProphetNet/contents.rst:15 +msgid "``prophetnet-large-uncased``" +msgstr "" + +#: ../model_zoo/transformers/ProphetNet/contents.rst:15 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/ProphetNet/contents.rst:15 +msgid "For details, please refer to: `microsoft/prophetnet-large-uncased`_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Reformer/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Reformer/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..02f4319ef8f76a463786811afe0a37f66c98292d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Reformer/contents.po @@ -0,0 +1,60 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/Reformer/contents.rst:5 +msgid "Reformer模型汇总" +msgstr "" + +#: ../model_zoo/transformers/Reformer/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的Reformer模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/Reformer/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/Reformer/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/Reformer/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/Reformer/contents.rst:15 +msgid "``reformer-enwik8``" +msgstr "" + +#: ../model_zoo/transformers/Reformer/contents.rst:15 +#: ../model_zoo/transformers/Reformer/contents.rst:18 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/Reformer/contents.rst:15 +msgid "12-layer, 1024-hidden, 8-heads, 148M parameters." +msgstr "" + +#: ../model_zoo/transformers/Reformer/contents.rst:18 +msgid "``reformer-crime-and-punishment``" +msgstr "" + +#: ../model_zoo/transformers/Reformer/contents.rst:18 +msgid "6-layer, 256-hidden, 2-heads, 3M parameters." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RemBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RemBert/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..cf1d91f9411ac9b0ab3d1aaeb9732f18f71c8900 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RemBert/contents.po @@ -0,0 +1,53 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/RemBert/contents.rst:5 +msgid "RemBert模型汇总" +msgstr "" + +#: ../model_zoo/transformers/RemBert/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的RemBert模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/RemBert/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/RemBert/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/RemBert/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/RemBert/contents.rst:15 +msgid "``rembert``" +msgstr "" + +#: ../model_zoo/transformers/RemBert/contents.rst:15 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/RemBert/contents.rst:15 +msgid "" +"For details, please refer to the corresponding model card of huggingface:" +" `google/rembert`_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RoBERTa/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RoBERTa/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..afe2d722f2216309148c215678a5317a451d89b1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RoBERTa/contents.po @@ -0,0 +1,984 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:5 +msgid "RoBERTa模型汇总" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:8 +msgid "下表汇总介绍了目前PaddleNLP支持的RoBERTa模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:12 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:12 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:12 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:14 +msgid "``roberta-wwm-ext``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:14 +#: ../model_zoo/transformers/RoBERTa/contents.rst:19 +#: ../model_zoo/transformers/RoBERTa/contents.rst:24 +#: ../model_zoo/transformers/RoBERTa/contents.rst:27 +#: ../model_zoo/transformers/RoBERTa/contents.rst:46 +#: ../model_zoo/transformers/RoBERTa/contents.rst:50 +#: ../model_zoo/transformers/RoBERTa/contents.rst:54 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:14 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on English Text " +"using Whole-Word-Masking with extended data." +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:19 +msgid "``roberta-wwm-ext-large``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:19 +msgid "" +"24-layer, 1024-hidden, 16-heads, 325M parameters. Trained on English Text" +" using Whole-Word-Masking with extended data." +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:24 +msgid "``rbt3``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:24 +msgid "3-layer, 768-hidden, 12-heads, 38M parameters." +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:27 +msgid "``rbtl3``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:27 +msgid "3-layer, 1024-hidden, 16-heads, 61M parameters." +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:30 +msgid "``nosaydomore/deepset-roberta-base-squad2``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:30 +#: ../model_zoo/transformers/RoBERTa/contents.rst:34 +#: ../model_zoo/transformers/RoBERTa/contents.rst:38 +#: ../model_zoo/transformers/RoBERTa/contents.rst:42 +#: ../model_zoo/transformers/RoBERTa/contents.rst:58 +#: ../model_zoo/transformers/RoBERTa/contents.rst:61 +#: ../model_zoo/transformers/RoBERTa/contents.rst:64 +#: ../model_zoo/transformers/RoBERTa/contents.rst:67 +#: ../model_zoo/transformers/RoBERTa/contents.rst:70 +#: ../model_zoo/transformers/RoBERTa/contents.rst:73 +#: ../model_zoo/transformers/RoBERTa/contents.rst:76 +#: ../model_zoo/transformers/RoBERTa/contents.rst:79 +#: ../model_zoo/transformers/RoBERTa/contents.rst:82 +#: ../model_zoo/transformers/RoBERTa/contents.rst:85 +#: ../model_zoo/transformers/RoBERTa/contents.rst:88 +#: ../model_zoo/transformers/RoBERTa/contents.rst:91 +#: ../model_zoo/transformers/RoBERTa/contents.rst:94 +#: ../model_zoo/transformers/RoBERTa/contents.rst:97 +#: ../model_zoo/transformers/RoBERTa/contents.rst:100 +#: ../model_zoo/transformers/RoBERTa/contents.rst:103 +#: ../model_zoo/transformers/RoBERTa/contents.rst:109 +#: ../model_zoo/transformers/RoBERTa/contents.rst:112 +#: ../model_zoo/transformers/RoBERTa/contents.rst:115 +#: ../model_zoo/transformers/RoBERTa/contents.rst:118 +#: ../model_zoo/transformers/RoBERTa/contents.rst:121 +#: ../model_zoo/transformers/RoBERTa/contents.rst:124 +#: ../model_zoo/transformers/RoBERTa/contents.rst:127 +#: ../model_zoo/transformers/RoBERTa/contents.rst:130 +#: ../model_zoo/transformers/RoBERTa/contents.rst:136 +#: ../model_zoo/transformers/RoBERTa/contents.rst:139 +#: ../model_zoo/transformers/RoBERTa/contents.rst:142 +#: ../model_zoo/transformers/RoBERTa/contents.rst:145 +#: ../model_zoo/transformers/RoBERTa/contents.rst:148 +#: ../model_zoo/transformers/RoBERTa/contents.rst:151 +#: ../model_zoo/transformers/RoBERTa/contents.rst:154 +#: ../model_zoo/transformers/RoBERTa/contents.rst:157 +#: ../model_zoo/transformers/RoBERTa/contents.rst:160 +#: ../model_zoo/transformers/RoBERTa/contents.rst:163 +#: ../model_zoo/transformers/RoBERTa/contents.rst:166 +#: ../model_zoo/transformers/RoBERTa/contents.rst:169 +#: ../model_zoo/transformers/RoBERTa/contents.rst:172 +#: ../model_zoo/transformers/RoBERTa/contents.rst:175 +#: ../model_zoo/transformers/RoBERTa/contents.rst:178 +#: ../model_zoo/transformers/RoBERTa/contents.rst:181 +#: ../model_zoo/transformers/RoBERTa/contents.rst:184 +#: ../model_zoo/transformers/RoBERTa/contents.rst:190 +#: ../model_zoo/transformers/RoBERTa/contents.rst:193 +#: ../model_zoo/transformers/RoBERTa/contents.rst:199 +#: ../model_zoo/transformers/RoBERTa/contents.rst:202 +#: ../model_zoo/transformers/RoBERTa/contents.rst:205 +#: ../model_zoo/transformers/RoBERTa/contents.rst:208 +#: ../model_zoo/transformers/RoBERTa/contents.rst:211 +#: ../model_zoo/transformers/RoBERTa/contents.rst:214 +#: ../model_zoo/transformers/RoBERTa/contents.rst:217 +#: ../model_zoo/transformers/RoBERTa/contents.rst:220 +#: ../model_zoo/transformers/RoBERTa/contents.rst:223 +#: ../model_zoo/transformers/RoBERTa/contents.rst:226 +#: ../model_zoo/transformers/RoBERTa/contents.rst:229 +#: ../model_zoo/transformers/RoBERTa/contents.rst:232 +#: ../model_zoo/transformers/RoBERTa/contents.rst:235 +#: ../model_zoo/transformers/RoBERTa/contents.rst:241 +#: ../model_zoo/transformers/RoBERTa/contents.rst:244 +#: ../model_zoo/transformers/RoBERTa/contents.rst:247 +#: ../model_zoo/transformers/RoBERTa/contents.rst:253 +#: ../model_zoo/transformers/RoBERTa/contents.rst:256 +#: ../model_zoo/transformers/RoBERTa/contents.rst:259 +#: ../model_zoo/transformers/RoBERTa/contents.rst:262 +#: ../model_zoo/transformers/RoBERTa/contents.rst:265 +#: ../model_zoo/transformers/RoBERTa/contents.rst:268 +#: ../model_zoo/transformers/RoBERTa/contents.rst:271 +#: ../model_zoo/transformers/RoBERTa/contents.rst:274 +#: ../model_zoo/transformers/RoBERTa/contents.rst:277 +#: ../model_zoo/transformers/RoBERTa/contents.rst:286 +#: ../model_zoo/transformers/RoBERTa/contents.rst:289 +#: ../model_zoo/transformers/RoBERTa/contents.rst:295 +#: ../model_zoo/transformers/RoBERTa/contents.rst:301 +#: ../model_zoo/transformers/RoBERTa/contents.rst:307 +#: ../model_zoo/transformers/RoBERTa/contents.rst:313 +#: ../model_zoo/transformers/RoBERTa/contents.rst:316 +#: ../model_zoo/transformers/RoBERTa/contents.rst:319 +#: ../model_zoo/transformers/RoBERTa/contents.rst:322 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:30 +msgid "12-layer, 768-hidden, 12-heads, 124M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:34 +msgid "``nosaydomore/roberta-en-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:34 +msgid "12-layer, 768-hidden, 12-heads, 163M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:38 +msgid "``nosaydomore/roberta-en-large``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:38 +msgid "24-layer, 1024-hidden, 16-heads, 408M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:42 +msgid "``nosaydomore/sshleifei-tiny-distilroberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:42 +msgid "2-layer, 2-hidden, 2-heads, 0.25M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:46 +msgid "``nosaydomore/uer-roberta-base-chinese-extractive-qa``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:46 +#: ../model_zoo/transformers/RoBERTa/contents.rst:54 +msgid "12-layer, 768-hidden, 12-heads, 101M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:50 +msgid "``nosaydomore/uer-roberta-base-ft-chinanews-chn``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:50 +msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:54 +msgid "``nosaydomore/uer-roberta-base-ft-cluener2020-chn``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:58 +msgid "``roberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:58 +msgid "Please refer to: roberta-base_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:61 +msgid "``cardiffnlp/twitter-roberta-base-sentiment``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:61 +msgid "Please refer to: `cardiffnlp/twitter-roberta-base-sentiment`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:64 +msgid "``roberta-large``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:64 +msgid "Please refer to: roberta-large_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:67 +msgid "``distilroberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:67 +msgid "Please refer to: distilroberta-base_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:70 +msgid "``cross-encoder/nli-distilroberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:70 +msgid "Please refer to: `cross-encoder/nli-distilroberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:73 +msgid "``siebert/sentiment-roberta-large-english``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:73 +msgid "Please refer to: `siebert/sentiment-roberta-large-english`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:76 +msgid "``j-hartmann/emotion-english-distilroberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:76 +msgid "Please refer to: `j-hartmann/emotion-english-distilroberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:79 +msgid "``roberta-base-openai-detector``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:79 +msgid "Please refer to: `roberta-base-openai-detector`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:82 +msgid "``huggingface/CodeBERTa-small-v1``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:82 +msgid "Please refer to: `huggingface/CodeBERTa-small-v1`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:85 +msgid "``mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:85 +msgid "" +"Please refer to: `mrm8488/distilroberta-finetuned-financial-news-" +"sentiment-analysis`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:88 +msgid "``cardiffnlp/twitter-roberta-base-emotion``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:88 +msgid "Please refer to: `cardiffnlp/twitter-roberta-base-emotion`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:91 +msgid "``seyonec/PubChem10M_SMILES_BPE_396_250``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:91 +msgid "Please refer to: `seyonec/PubChem10M_SMILES_BPE_396_250`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:94 +msgid "``textattack/roberta-base-SST-2``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:94 +msgid "Please refer to: `textattack/roberta-base-SST-2`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:97 +msgid "``sshleifer/tiny-distilroberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:97 +msgid "Please refer to: `sshleifer/tiny-distilroberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:100 +msgid "``thatdramebaazguy/roberta-base-squad``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:100 +msgid "Please refer to: `thatdramebaazguy/roberta-base-squad`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:103 +msgid "``ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:103 +msgid "Please refer to: `ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:106 +msgid "``ufal/robeczech-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:106 +msgid "Czech" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:106 +msgid "Please refer to: `ufal/robeczech-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:109 +msgid "``seyonec/PubChem10M_SMILES_BPE_450k``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:109 +msgid "Please refer to: `seyonec/PubChem10M_SMILES_BPE_450k`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:112 +msgid "``cardiffnlp/twitter-roberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:112 +msgid "Please refer to: `cardiffnlp/twitter-roberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:115 +msgid "``seyonec/PubChem10M_SMILES_BPE_50k``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:115 +msgid "Please refer to: `seyonec/PubChem10M_SMILES_BPE_50k`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:118 +msgid "``microsoft/codebert-base-mlm``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:118 +msgid "Please refer to: `microsoft/codebert-base-mlm`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:121 +msgid "``textattack/roberta-base-MNLI``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:121 +msgid "Please refer to: `textattack/roberta-base-MNLI`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:124 +msgid "``cardiffnlp/twitter-roberta-base-offensive``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:124 +msgid "Please refer to: `cardiffnlp/twitter-roberta-base-offensive`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:127 +msgid "``cross-encoder/stsb-roberta-large``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:127 +msgid "Please refer to: `cross-encoder/stsb-roberta-large`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:130 +msgid "``seyonec/ChemBERTa_zinc250k_v2_40k``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:130 +msgid "Please refer to: `seyonec/ChemBERTa_zinc250k_v2_40k`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:133 +msgid "``uklfr/gottbert-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:133 +msgid "German" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:133 +msgid "Please refer to: `uklfr/gottbert-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:136 +msgid "``seyonec/ChemBERTa-zinc-base-v1``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:136 +msgid "Please refer to: `seyonec/ChemBERTa-zinc-base-v1`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:139 +msgid "``roberta-large-openai-detector``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:139 +msgid "Please refer to: `roberta-large-openai-detector`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:142 +msgid "``cross-encoder/quora-roberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:142 +msgid "Please refer to: `cross-encoder/quora-roberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:145 +msgid "``cross-encoder/stsb-roberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:145 +msgid "Please refer to: `cross-encoder/stsb-roberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:148 +msgid "``microsoft/graphcodebert-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:148 +msgid "Please refer to: `microsoft/graphcodebert-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:151 +msgid "``cardiffnlp/twitter-roberta-base-hate``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:151 +msgid "Please refer to: `cardiffnlp/twitter-roberta-base-hate`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:154 +msgid "``chkla/roberta-argument``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:154 +msgid "Please refer to: `chkla/roberta-argument`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:157 +msgid "``Salesforce/grappa_large_jnt``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:157 +msgid "Please refer to: `Salesforce/grappa_large_jnt`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:160 +msgid "``vinai/bertweet-large``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:160 +msgid "Please refer to: `vinai/bertweet-large`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:163 +msgid "``allenai/biomed_roberta_base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:163 +msgid "Please refer to: `allenai/biomed_roberta_base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:166 +msgid "``facebook/muppet-roberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:166 +msgid "Please refer to: `facebook/muppet-roberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:169 +msgid "``Rakib/roberta-base-on-cuad``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:169 +msgid "Please refer to: `Rakib/roberta-base-on-cuad`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:172 +msgid "``cross-encoder/stsb-distilroberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:172 +msgid "Please refer to: `cross-encoder/stsb-distilroberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:175 +msgid "``nyu-mll/roberta-base-1B-1``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:175 +msgid "Please refer to: `nyu-mll/roberta-base-1B-1`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:178 +msgid "``nyu-mll/roberta-med-small-1M-1``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:178 +msgid "Please refer to: `nyu-mll/roberta-med-small-1M-1`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:181 +msgid "``SkolkovoInstitute/roberta_toxicity_classifier``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:181 +msgid "Please refer to: `SkolkovoInstitute/roberta_toxicity_classifier`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:184 +msgid "``facebook/muppet-roberta-large``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:184 +msgid "Please refer to: `facebook/muppet-roberta-large`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:187 +msgid "``lassl/roberta-ko-small``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:187 +msgid "Korean" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:187 +msgid "Please refer to: `lassl/roberta-ko-small`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:190 +msgid "``huggingface/CodeBERTa-language-id``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:190 +msgid "Please refer to: `huggingface/CodeBERTa-language-id`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:193 +msgid "``textattack/roberta-base-imdb``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:193 +msgid "Please refer to: `textattack/roberta-base-imdb`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:196 +msgid "``macedonizer/mk-roberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:196 +msgid "Macedonian" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:196 +msgid "Please refer to: `macedonizer/mk-roberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:199 +msgid "``cross-encoder/nli-MiniLM2-L6-H768``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:199 +msgid "Please refer to: `cross-encoder/nli-MiniLM2-L6-H768`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:202 +msgid "``textattack/roberta-base-QNLI``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:202 +msgid "Please refer to: `textattack/roberta-base-QNLI`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:205 +msgid "``deepset/roberta-base-squad2-covid``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:205 +msgid "Please refer to: `deepset/roberta-base-squad2-covid`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:208 +msgid "``textattack/roberta-base-MRPC``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:208 +msgid "Please refer to: `textattack/roberta-base-MRPC`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:211 +msgid "``bhadresh-savani/roberta-base-emotion``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:211 +msgid "Please refer to: `bhadresh-savani/roberta-base-emotion`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:214 +msgid "``aychang/roberta-base-imdb``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:214 +msgid "Please refer to: `aychang/roberta-base-imdb`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:217 +msgid "``cross-encoder/quora-distilroberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:217 +msgid "Please refer to: `cross-encoder/quora-distilroberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:220 +msgid "``csarron/roberta-base-squad-v1``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:220 +msgid "Please refer to: `csarron/roberta-base-squad-v1`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:223 +msgid "``seyonec/ChemBERTA_PubChem1M_shard00_155k``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:223 +msgid "Please refer to: `seyonec/ChemBERTA_PubChem1M_shard00_155k`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:226 +msgid "``mental/mental-roberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:226 +msgid "Please refer to: `mental/mental-roberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:229 +msgid "``textattack/roberta-base-CoLA``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:229 +msgid "Please refer to: `textattack/roberta-base-CoLA`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:232 +msgid "``navteca/quora-roberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:232 +msgid "Please refer to: `navteca/quora-roberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:235 +msgid "``cardiffnlp/twitter-roberta-base-emoji``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:235 +msgid "Please refer to: `cardiffnlp/twitter-roberta-base-emoji`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:238 +msgid "``benjamin/roberta-base-wechsel-german``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:238 +msgid "Multilingual" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:238 +msgid "Please refer to: `benjamin/roberta-base-wechsel-german`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:241 +msgid "``textattack/roberta-base-ag-news``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:241 +msgid "Please refer to: `textattack/roberta-base-ag-news`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:244 +msgid "``johngiorgi/declutr-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:244 +msgid "Please refer to: `johngiorgi/declutr-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:247 +msgid "``salesken/query_wellformedness_score``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:247 +msgid "Please refer to: `salesken/query_wellformedness_score`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:250 +msgid "``blinoff/roberta-base-russian-v0``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:250 +msgid "Russian" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:250 +msgid "Please refer to: `blinoff/roberta-base-russian-v0`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:253 +msgid "``allenai/reviews_roberta_base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:253 +msgid "Please refer to: `allenai/reviews_roberta_base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:256 +msgid "``ruiqi-zhong/roberta-base-meta-tuning-test``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:256 +msgid "Please refer to: `ruiqi-zhong/roberta-base-meta-tuning-test`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:259 +msgid "``mrm8488/distilroberta-finetuned-tweets-hate-speech``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:259 +msgid "Please refer to: `mrm8488/distilroberta-finetuned-tweets-hate-speech`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:262 +msgid "``cointegrated/roberta-large-cola-krishna2020``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:262 +msgid "Please refer to: `cointegrated/roberta-large-cola-krishna2020`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:265 +msgid "``deepset/roberta-base-squad2-distilled``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:265 +msgid "Please refer to: `deepset/roberta-base-squad2-distilled`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:268 +msgid "``tli8hf/unqover-roberta-base-squad``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:268 +msgid "Please refer to: `tli8hf/unqover-roberta-base-squad`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:271 +msgid "``cross-encoder/nli-roberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:271 +msgid "Please refer to: `cross-encoder/nli-roberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:274 +msgid "``nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:274 +msgid "Please refer to: `nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:277 +msgid "``seyonec/BPE_SELFIES_PubChem_shard00_160k``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:277 +msgid "Please refer to: `seyonec/BPE_SELFIES_PubChem_shard00_160k`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:280 +msgid "``CLTL/MedRoBERTa.nl``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:280 +msgid "Dutch" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:280 +msgid "Please refer to: `CLTL/MedRoBERTa.nl`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:283 +msgid "``HooshvareLab/roberta-fa-zwnj-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:283 +msgid "Persian" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:283 +msgid "Please refer to: `HooshvareLab/roberta-fa-zwnj-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:286 +msgid "``nyu-mll/roberta-base-100M-1``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:286 +msgid "Please refer to: `nyu-mll/roberta-base-100M-1`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:289 +msgid "``deepset/tinyroberta-squad2``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:289 +msgid "Please refer to: `deepset/tinyroberta-squad2`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:292 +msgid "``youscan/ukr-roberta-base``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:292 +msgid "Ukrainian" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:292 +msgid "Please refer to: `youscan/ukr-roberta-base`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:295 +msgid "``navteca/roberta-base-squad2``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:295 +msgid "Please refer to: `navteca/roberta-base-squad2`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:298 +msgid "``bertin-project/bertin-roberta-base-spanish``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:298 +msgid "Spanish" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:298 +msgid "Please refer to: `bertin-project/bertin-roberta-base-spanish`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:301 +msgid "``shiyue/roberta-large-tac08``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:301 +msgid "Please refer to: `shiyue/roberta-large-tac08`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:304 +msgid "``softcatala/julibert``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:304 +msgid "Catalan" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:304 +msgid "Please refer to: `softcatala/julibert`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:307 +msgid "``elozano/tweet_sentiment_eval``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:307 +msgid "Please refer to: `elozano/tweet_sentiment_eval`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:310 +msgid "``cahya/roberta-base-indonesian-1.5G``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:310 +msgid "Indonesian" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:310 +msgid "Please refer to: `cahya/roberta-base-indonesian-1.5G`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:313 +msgid "``elozano/tweet_emotion_eval``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:313 +msgid "Please refer to: `elozano/tweet_emotion_eval`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:316 +msgid "``navteca/roberta-large-squad2``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:316 +msgid "Please refer to: `navteca/roberta-large-squad2`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:319 +msgid "``elozano/tweet_offensive_eval``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:319 +msgid "Please refer to: `elozano/tweet_offensive_eval`_" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:322 +msgid "``ynie/roberta-large_conv_contradiction_detector_v0``" +msgstr "" + +#: ../model_zoo/transformers/RoBERTa/contents.rst:322 +msgid "Please refer to: `ynie/roberta-large_conv_contradiction_detector_v0`_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RoFormer/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RoFormer/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..427d290450dcf86298b059c775e4e8cc1ad88314 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RoFormer/contents.po @@ -0,0 +1,155 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/RoFormer/contents.rst:5 +msgid "RoFormer模型汇总" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:7 +msgid "下表汇总介绍了目前PaddleNLP支持的RoFormer模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:11 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:11 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:11 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:13 +msgid "``roformer-chinese-small``" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:13 +#: ../model_zoo/transformers/RoFormer/contents.rst:17 +#: ../model_zoo/transformers/RoFormer/contents.rst:21 +#: ../model_zoo/transformers/RoFormer/contents.rst:25 +#: ../model_zoo/transformers/RoFormer/contents.rst:29 +#: ../model_zoo/transformers/RoFormer/contents.rst:33 +#: ../model_zoo/transformers/RoFormer/contents.rst:37 +#: ../model_zoo/transformers/RoFormer/contents.rst:41 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:13 +msgid "" +"6-layer, 384-hidden, 6-heads, 30M parameters. Roformer Small Chinese " +"model." +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:17 +msgid "``roformer-chinese-base``" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:17 +msgid "" +"12-layer, 768-hidden, 12-heads, 124M parameters. Roformer Base Chinese " +"model." +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:21 +msgid "``roformer-chinese-char-small``" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:21 +msgid "" +"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Char Small" +" model." +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:25 +msgid "``roformer-chinese-char-base``" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:25 +msgid "" +"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Char " +"Base model." +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:29 +msgid "``roformer-chinese-sim-char-ft-small``" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:29 +msgid "" +"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Char Ft " +"Small model." +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:33 +msgid "``roformer-chinese-sim-char-ft-base``" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:33 +msgid "" +"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Char Ft " +"Base model." +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:37 +msgid "``roformer-chinese-sim-char-small``" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:37 +msgid "" +"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Sim Char " +"Small model." +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:41 +msgid "``roformer-chinese-sim-char-base``" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:41 +msgid "" +"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Sim Char" +" Base model." +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:45 +msgid "``roformer-english-small-discriminator``" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:45 +#: ../model_zoo/transformers/RoFormer/contents.rst:49 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:45 +msgid "" +"12-layer, 256-hidden, 4-heads, 13M parameters. Roformer English Small " +"Discriminator." +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:49 +msgid "``roformer-english-small-generator``" +msgstr "" + +#: ../model_zoo/transformers/RoFormer/contents.rst:49 +msgid "" +"12-layer, 64-hidden, 1-heads, 5M parameters. Roformer English Small " +"Generator." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/SKEP/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/SKEP/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..d8ec9283d84991ce7c69ef9ed548b784a7807b9d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/SKEP/contents.po @@ -0,0 +1,78 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/SKEP/contents.rst:5 +msgid "SKEP模型汇总" +msgstr "" + +#: ../model_zoo/transformers/SKEP/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的SKEP模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/SKEP/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/SKEP/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/SKEP/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/SKEP/contents.rst:15 +msgid "``skep_ernie_1.0_large_ch``" +msgstr "" + +#: ../model_zoo/transformers/SKEP/contents.rst:15 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/SKEP/contents.rst:15 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained using the Erine" +" model ``ernie_1.0``" +msgstr "" + +#: ../model_zoo/transformers/SKEP/contents.rst:20 +msgid "``skep_ernie_2.0_large_en``" +msgstr "" + +#: ../model_zoo/transformers/SKEP/contents.rst:20 +#: ../model_zoo/transformers/SKEP/contents.rst:25 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/SKEP/contents.rst:20 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained using the Erine" +" model ``ernie_2.0_large_en``" +msgstr "" + +#: ../model_zoo/transformers/SKEP/contents.rst:25 +msgid "``skep_roberta_large_en``" +msgstr "" + +#: ../model_zoo/transformers/SKEP/contents.rst:25 +msgid "" +"24-layer, 1024-hidden, 16-heads, 355M parameters. Trained using the " +"RoBERTa model ``roberta_large_en``" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/SqueezeBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/SqueezeBert/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..614040be5af641feb7d9c449c6a9b63b05c198c9 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/SqueezeBert/contents.po @@ -0,0 +1,71 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/SqueezeBert/contents.rst:5 +msgid "SqueezeBert模型汇总" +msgstr "" + +#: ../model_zoo/transformers/SqueezeBert/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的SqueezeBert模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/SqueezeBert/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/SqueezeBert/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/SqueezeBert/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/SqueezeBert/contents.rst:15 +msgid "``squeezebert-uncased``" +msgstr "" + +#: ../model_zoo/transformers/SqueezeBert/contents.rst:15 +#: ../model_zoo/transformers/SqueezeBert/contents.rst:19 +#: ../model_zoo/transformers/SqueezeBert/contents.rst:23 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/SqueezeBert/contents.rst:15 +msgid "12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Uncased model." +msgstr "" + +#: ../model_zoo/transformers/SqueezeBert/contents.rst:19 +msgid "``squeezebert-mnli``" +msgstr "" + +#: ../model_zoo/transformers/SqueezeBert/contents.rst:19 +msgid "12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Mnli model." +msgstr "" + +#: ../model_zoo/transformers/SqueezeBert/contents.rst:23 +msgid "``squeezebert-mnli-headless``" +msgstr "" + +#: ../model_zoo/transformers/SqueezeBert/contents.rst:23 +msgid "" +"12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Mnli Headless" +" model." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/T5/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/T5/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..7c4e944d04ebe0081650789c7563f766375c1db0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/T5/contents.po @@ -0,0 +1,458 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/T5/contents.rst:5 +msgid "T5模型汇总" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的T5模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:15 +msgid "``t5-small``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:15 +#: ../model_zoo/transformers/T5/contents.rst:19 +#: ../model_zoo/transformers/T5/contents.rst:23 +#: ../model_zoo/transformers/T5/contents.rst:27 +#: ../model_zoo/transformers/T5/contents.rst:30 +#: ../model_zoo/transformers/T5/contents.rst:36 +#: ../model_zoo/transformers/T5/contents.rst:39 +#: ../model_zoo/transformers/T5/contents.rst:42 +#: ../model_zoo/transformers/T5/contents.rst:45 +#: ../model_zoo/transformers/T5/contents.rst:48 +#: ../model_zoo/transformers/T5/contents.rst:51 +#: ../model_zoo/transformers/T5/contents.rst:54 +#: ../model_zoo/transformers/T5/contents.rst:57 +#: ../model_zoo/transformers/T5/contents.rst:60 +#: ../model_zoo/transformers/T5/contents.rst:63 +#: ../model_zoo/transformers/T5/contents.rst:66 +#: ../model_zoo/transformers/T5/contents.rst:72 +#: ../model_zoo/transformers/T5/contents.rst:75 +#: ../model_zoo/transformers/T5/contents.rst:78 +#: ../model_zoo/transformers/T5/contents.rst:81 +#: ../model_zoo/transformers/T5/contents.rst:84 +#: ../model_zoo/transformers/T5/contents.rst:87 +#: ../model_zoo/transformers/T5/contents.rst:90 +#: ../model_zoo/transformers/T5/contents.rst:93 +#: ../model_zoo/transformers/T5/contents.rst:96 +#: ../model_zoo/transformers/T5/contents.rst:99 +#: ../model_zoo/transformers/T5/contents.rst:102 +#: ../model_zoo/transformers/T5/contents.rst:105 +#: ../model_zoo/transformers/T5/contents.rst:111 +#: ../model_zoo/transformers/T5/contents.rst:114 +#: ../model_zoo/transformers/T5/contents.rst:117 +#: ../model_zoo/transformers/T5/contents.rst:120 +#: ../model_zoo/transformers/T5/contents.rst:123 +#: ../model_zoo/transformers/T5/contents.rst:126 +#: ../model_zoo/transformers/T5/contents.rst:129 +#: ../model_zoo/transformers/T5/contents.rst:132 +#: ../model_zoo/transformers/T5/contents.rst:135 +#: ../model_zoo/transformers/T5/contents.rst:138 +#: ../model_zoo/transformers/T5/contents.rst:144 +#: ../model_zoo/transformers/T5/contents.rst:147 +#: ../model_zoo/transformers/T5/contents.rst:150 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:15 +msgid "6-layer, 512-hidden, 8-heads, 93M parameters. T5 small model." +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:19 +msgid "``t5-base``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:19 +msgid "12-layer, 768-hidden, 12-heads, 272M parameters. T5 base model." +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:23 +msgid "``t5-large``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:23 +msgid "24-layer, 1024-hidden, 16-heads, 803M parameters. T5 large model." +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:27 +msgid "``t5-v1_1-base``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:27 +msgid "Please refer to: t5-v1_1-base_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:30 +msgid "``t5-v1_1-large``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:30 +msgid "Please refer to: t5-v1_1-large_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:33 +msgid "``Langboat/mengzi-t5-base``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:33 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:33 +msgid "Please refer to: `Langboat/mengzi-t5-base`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:36 +msgid "``deep-learning-analytics/wikihow-t5-small``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:36 +msgid "Please refer to: `deep-learning-analytics/wikihow-t5-small`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:39 +msgid "``sberbank-ai/ruT5-base``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:39 +msgid "Please refer to: `sberbank-ai/ruT5-base`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:42 +msgid "``Michau/t5-base-en-generate-headline``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:42 +msgid "Please refer to: `Michau/t5-base-en-generate-headline`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:45 +msgid "``google/t5-v1_1-small``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:45 +msgid "Please refer to: `google/t5-v1_1-small`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:48 +msgid "``prithivida/parrot_paraphraser_on_T5``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:48 +msgid "Please refer to: `prithivida/parrot_paraphraser_on_T5`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:51 +msgid "``prithivida/grammar_error_correcter_v1``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:51 +msgid "Please refer to: `prithivida/grammar_error_correcter_v1`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:54 +msgid "``valhalla/t5-small-qg-hl``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:54 +msgid "Please refer to: `valhalla/t5-small-qg-hl`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:57 +msgid "``valhalla/t5-small-qa-qg-hl``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:57 +msgid "Please refer to: `valhalla/t5-small-qa-qg-hl`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:60 +msgid "``ramsrigouthamg/t5-large-paraphraser-diverse-high-quality``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:60 +msgid "" +"Please refer to: `ramsrigouthamg/t5-large-paraphraser-diverse-high-" +"quality`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:63 +msgid "``mrm8488/t5-base-finetuned-common_gen``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:63 +msgid "Please refer to: `mrm8488/t5-base-finetuned-common_gen`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:66 +msgid "``valhalla/t5-small-e2e-qg``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:66 +msgid "Please refer to: `valhalla/t5-small-e2e-qg`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:69 +msgid "``sonoisa/t5-base-japanese``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:69 +msgid "japanese" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:69 +msgid "Please refer to: `sonoisa/t5-base-japanese`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:72 +msgid "``google/t5-base-lm-adapt``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:72 +msgid "Please refer to: `google/t5-base-lm-adapt`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:75 +msgid "``google/t5-small-lm-adapt``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:75 +msgid "Please refer to: `google/t5-small-lm-adapt`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:78 +msgid "``valhalla/t5-small-qg-prepend``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:78 +msgid "Please refer to: `valhalla/t5-small-qg-prepend`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:81 +msgid "``prithivida/informal_to_formal_styletransfer``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:81 +msgid "Please refer to: `prithivida/informal_to_formal_styletransfer`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:84 +msgid "``KETI-AIR/ke-t5-base``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:84 +msgid "Please refer to: `KETI-AIR/ke-t5-base`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:87 +msgid "``nielsr/nt5-small-rc1``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:87 +msgid "Please refer to: `nielsr/nt5-small-rc1`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:90 +msgid "``snrspeaks/t5-one-line-summary``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:90 +msgid "Please refer to: `snrspeaks/t5-one-line-summary`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:93 +msgid "``mrm8488/t5-small-finetuned-quora-for-paraphrasing``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:93 +msgid "Please refer to: `mrm8488/t5-small-finetuned-quora-for-paraphrasing`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:96 +msgid "``p-christ/12412fsasf``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:96 +msgid "Please refer to: `p-christ/12412fsasf`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:99 +msgid "``tscholak/3vnuv1vf``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:99 +msgid "Please refer to: `tscholak/3vnuv1vf`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:102 +msgid "``tennessejoyce/titlewave-t5-base``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:102 +msgid "Please refer to: `tennessejoyce/titlewave-t5-base`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:105 +msgid "``vennify/t5-base-grammar-correction``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:105 +msgid "Please refer to: `vennify/t5-base-grammar-correction`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:108 +msgid "``megagonlabs/t5-base-japanese-web``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:108 +#: ../model_zoo/transformers/T5/contents.rst:141 +msgid "Japanese" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:108 +msgid "Please refer to: `megagonlabs/t5-base-japanese-web`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:111 +msgid "``sberbank-ai/ruT5-large``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:111 +msgid "Please refer to: `sberbank-ai/ruT5-large`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:114 +msgid "``tscholak/t5.1.1.lm100k.base``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:114 +msgid "Please refer to: `tscholak/t5.1.1.lm100k.base`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:117 +msgid "``deep-learning-analytics/GrammarCorrector``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:117 +msgid "Please refer to: `deep-learning-analytics/GrammarCorrector`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:120 +msgid "``ThomasNLG/t5-qa_squad2neg-en``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:120 +msgid "Please refer to: `ThomasNLG/t5-qa_squad2neg-en`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:123 +msgid "``flexudy/t5-small-wav2vec2-grammar-fixer``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:123 +msgid "Please refer to: `flexudy/t5-small-wav2vec2-grammar-fixer`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:126 +msgid "``KETI-AIR/ke-t5-small``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:126 +msgid "Please refer to: `KETI-AIR/ke-t5-small`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:129 +msgid "``razent/SciFive-large-Pubmed_PMC``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:129 +msgid "Please refer to: `razent/SciFive-large-Pubmed_PMC`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:132 +msgid "``google/t5-large-ssm-nq``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:132 +msgid "Please refer to: `google/t5-large-ssm-nq`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:135 +msgid "``ozcangundes/T5-base-for-BioQA``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:135 +msgid "Please refer to: `ozcangundes/T5-base-for-BioQA`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:138 +msgid "``Rostlab/prot_t5_base_mt_uniref50``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:138 +msgid "Please refer to: `Rostlab/prot_t5_base_mt_uniref50`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:141 +msgid "``sonoisa/t5-base-japanese-question-generation``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:141 +msgid "Please refer to: `sonoisa/t5-base-japanese-question-generation`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:144 +msgid "``Wikidepia/IndoT5-base``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:144 +msgid "Please refer to: `Wikidepia/IndoT5-base`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:147 +msgid "``razent/SciFive-base-Pubmed_PMC``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:147 +msgid "Please refer to: `razent/SciFive-base-Pubmed_PMC`_" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:150 +msgid "``google/t5-small-ssm-nq``" +msgstr "" + +#: ../model_zoo/transformers/T5/contents.rst:150 +msgid "Please refer to: `google/t5-small-ssm-nq`_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/TinyBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/TinyBert/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..c62e09a836b76bb956ffcf9d684d1a152f2e4bb7 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/TinyBert/contents.po @@ -0,0 +1,91 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/TinyBert/contents.rst:5 +msgid "TinyBert模型汇总" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的TinyBert模型以及对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:15 +msgid "``tinybert-4l-312d``" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:15 +#: ../model_zoo/transformers/TinyBert/contents.rst:20 +#: ../model_zoo/transformers/TinyBert/contents.rst:25 +#: ../model_zoo/transformers/TinyBert/contents.rst:30 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:15 +#: ../model_zoo/transformers/TinyBert/contents.rst:25 +#: ../model_zoo/transformers/TinyBert/contents.rst:35 +msgid "" +"4-layer, 312-hidden, 12-heads, 14.5M parameters. The TinyBert model " +"distilled from the BERT model ``bert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:20 +msgid "``tinybert-6l-768d``" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:20 +#: ../model_zoo/transformers/TinyBert/contents.rst:30 +#: ../model_zoo/transformers/TinyBert/contents.rst:40 +msgid "" +"6-layer, 768-hidden, 12-heads, 67M parameters. The TinyBert model " +"distilled from the BERT model ``bert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:25 +msgid "``tinybert-4l-312d-v2``" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:30 +msgid "``tinybert-6l-768d-v2``" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:35 +msgid "``tinybert-4l-312d-zh``" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:35 +#: ../model_zoo/transformers/TinyBert/contents.rst:40 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/TinyBert/contents.rst:40 +msgid "``tinybert-6l-768d-zh``" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/UNIMO/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/UNIMO/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..356f04e285cb6b561cb84360d50d19e6920c6ff1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/UNIMO/contents.po @@ -0,0 +1,73 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/UNIMO/contents.rst:5 +msgid "UNIMO模型汇总" +msgstr "" + +#: ../model_zoo/transformers/UNIMO/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的UNIMO模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/UNIMO/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/UNIMO/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/UNIMO/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/UNIMO/contents.rst:15 +msgid "``unimo-text-1.0``" +msgstr "" + +#: ../model_zoo/transformers/UNIMO/contents.rst:15 +#: ../model_zoo/transformers/UNIMO/contents.rst:19 +#: ../model_zoo/transformers/UNIMO/contents.rst:23 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/UNIMO/contents.rst:15 +msgid "12-layer, 768-hidden, 12-heads, 99M parameters. UNIMO-text-1.0 model." +msgstr "" + +#: ../model_zoo/transformers/UNIMO/contents.rst:19 +msgid "``unimo-text-1.0-lcsts-new``" +msgstr "" + +#: ../model_zoo/transformers/UNIMO/contents.rst:19 +msgid "" +"12-layer, 768-hidden, 12-heads, 99M parameters. Finetuned on lcsts_new " +"dataset." +msgstr "" + +#: ../model_zoo/transformers/UNIMO/contents.rst:23 +msgid "``unimo-text-1.0-large``" +msgstr "" + +#: ../model_zoo/transformers/UNIMO/contents.rst:23 +msgid "" +"24-layer, 768-hidden, 16-heads, 316M parameters. UNIMO-text-1.0 large " +"model." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/UnifiedTransformer/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/UnifiedTransformer/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..9cfd0c77fec1d3ff81d53668927941217d3624cc --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/UnifiedTransformer/contents.po @@ -0,0 +1,80 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:5 +msgid "UnifiedTransformer模型汇总" +msgstr "" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的UnifiedTransformer模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:15 +msgid "``unified_transformer-12L-cn``" +msgstr "" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:15 +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:19 +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:23 +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:27 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:15 +msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:19 +msgid "``unified_transformer-12L-cn-luge``" +msgstr "" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:19 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text " +"(LUGE.ai)." +msgstr "" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:23 +msgid "``plato-mini``" +msgstr "" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:23 +msgid "6-layer, 768-hidden, 12-heads, 66M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:27 +msgid "``plato-xl``" +msgstr "" + +#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:27 +msgid "72-layer, 3072-hidden, 32-heads, ?M parameters. Trained on Chinese text." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/XLNet/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/XLNet/contents.po new file mode 100644 index 0000000000000000000000000000000000000000..8dee56236538b9ca29232e34af9ce1e3d0da5703 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/XLNet/contents.po @@ -0,0 +1,94 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/XLNet/contents.rst:5 +msgid "XLNet模型汇总" +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:9 +msgid "下表汇总介绍了目前PaddleNLP支持的XLNet模型对应预训练权重。 关于模型的具体细节可以参考对应链接。" +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:13 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:13 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:13 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:15 +msgid "``xlnet-base-cased``" +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:15 +#: ../model_zoo/transformers/XLNet/contents.rst:19 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:15 +msgid "12-layer, 768-hidden, 12-heads, 110M parameters. XLNet English model." +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:19 +msgid "``xlnet-large-cased``" +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:19 +msgid "" +"24-layer, 1024-hidden, 16-heads, 340M parameters. XLNet Large English " +"model." +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:23 +msgid "``chinese-xlnet-base``" +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:23 +#: ../model_zoo/transformers/XLNet/contents.rst:27 +#: ../model_zoo/transformers/XLNet/contents.rst:31 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:23 +msgid "12-layer, 768-hidden, 12-heads, 117M parameters. XLNet Chinese model." +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:27 +msgid "``chinese-xlnet-mid``" +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:27 +msgid "" +"24-layer, 768-hidden, 12-heads, 209M parameters. XLNet Medium Chinese " +"model." +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:31 +msgid "``chinese-xlnet-large``" +msgstr "" + +#: ../model_zoo/transformers/XLNet/contents.rst:31 +msgid "24-layer, 1024-hidden, 16-heads, _M parameters. XLNet Large Chinese model." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/all/transformers.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/all/transformers.po new file mode 100644 index 0000000000000000000000000000000000000000..a76d68bd4804a3e9e86bed8f438a2111bb0a1e73 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/all/transformers.po @@ -0,0 +1,2090 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../model_zoo/transformers/all/transformers.rst:2 +msgid "PaddleNLP Transformer API" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:4 +msgid "" +"随着深度学习的发展,NLP领域涌现了一大批高质量的Transformer类预训练模型,多次刷新各种NLP任务SOTA(State of the " +"Art)。 PaddleNLP为用户提供了常用的 " +"``BERT``、``ERNIE``、``ALBERT``、``RoBERTa``、``XLNet`` 等经典结构预训练模型, " +"让开发者能够方便快捷应用各类Transformer预训练模型及其下游任务。" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:10 +msgid "Transformer预训练模型汇总" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:14 +msgid "" +"下表汇总了介绍了目前PaddleNLP支持的各类预训练模型以及对应预训练权重。我们目前提供了 **32** 种网络结构, **136** " +"种预训练的参数权重供用户使用, 其中包含了 **59** 种中文语言模型的预训练权重。" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:18 +#: ../model_zoo/transformers/all/transformers.rst:664 +msgid "Model" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:18 +msgid "Pretrained Weight" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:18 +msgid "Language" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:18 +msgid "Details of the model" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:20 +#: ../model_zoo/transformers/all/transformers.rst:666 +msgid "ALBERT_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:20 +msgid "``albert-base-v1``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:20 +#: ../model_zoo/transformers/all/transformers.rst:24 +#: ../model_zoo/transformers/all/transformers.rst:28 +#: ../model_zoo/transformers/all/transformers.rst:32 +#: ../model_zoo/transformers/all/transformers.rst:36 +#: ../model_zoo/transformers/all/transformers.rst:40 +#: ../model_zoo/transformers/all/transformers.rst:44 +#: ../model_zoo/transformers/all/transformers.rst:48 +#: ../model_zoo/transformers/all/transformers.rst:76 +#: ../model_zoo/transformers/all/transformers.rst:80 +#: ../model_zoo/transformers/all/transformers.rst:84 +#: ../model_zoo/transformers/all/transformers.rst:88 +#: ../model_zoo/transformers/all/transformers.rst:92 +#: ../model_zoo/transformers/all/transformers.rst:96 +#: ../model_zoo/transformers/all/transformers.rst:148 +#: ../model_zoo/transformers/all/transformers.rst:196 +#: ../model_zoo/transformers/all/transformers.rst:200 +#: ../model_zoo/transformers/all/transformers.rst:204 +#: ../model_zoo/transformers/all/transformers.rst:208 +#: ../model_zoo/transformers/all/transformers.rst:212 +#: ../model_zoo/transformers/all/transformers.rst:216 +#: ../model_zoo/transformers/all/transformers.rst:220 +#: ../model_zoo/transformers/all/transformers.rst:224 +#: ../model_zoo/transformers/all/transformers.rst:228 +#: ../model_zoo/transformers/all/transformers.rst:232 +#: ../model_zoo/transformers/all/transformers.rst:236 +#: ../model_zoo/transformers/all/transformers.rst:241 +#: ../model_zoo/transformers/all/transformers.rst:246 +#: ../model_zoo/transformers/all/transformers.rst:252 +#: ../model_zoo/transformers/all/transformers.rst:256 +#: ../model_zoo/transformers/all/transformers.rst:260 +#: ../model_zoo/transformers/all/transformers.rst:264 +#: ../model_zoo/transformers/all/transformers.rst:300 +#: ../model_zoo/transformers/all/transformers.rst:304 +#: ../model_zoo/transformers/all/transformers.rst:308 +#: ../model_zoo/transformers/all/transformers.rst:316 +#: ../model_zoo/transformers/all/transformers.rst:320 +#: ../model_zoo/transformers/all/transformers.rst:324 +#: ../model_zoo/transformers/all/transformers.rst:328 +#: ../model_zoo/transformers/all/transformers.rst:351 +#: ../model_zoo/transformers/all/transformers.rst:355 +#: ../model_zoo/transformers/all/transformers.rst:359 +#: ../model_zoo/transformers/all/transformers.rst:363 +#: ../model_zoo/transformers/all/transformers.rst:367 +#: ../model_zoo/transformers/all/transformers.rst:371 +#: ../model_zoo/transformers/all/transformers.rst:375 +#: ../model_zoo/transformers/all/transformers.rst:379 +#: ../model_zoo/transformers/all/transformers.rst:387 +#: ../model_zoo/transformers/all/transformers.rst:391 +#: ../model_zoo/transformers/all/transformers.rst:395 +#: ../model_zoo/transformers/all/transformers.rst:399 +#: ../model_zoo/transformers/all/transformers.rst:403 +#: ../model_zoo/transformers/all/transformers.rst:407 +#: ../model_zoo/transformers/all/transformers.rst:411 +#: ../model_zoo/transformers/all/transformers.rst:415 +#: ../model_zoo/transformers/all/transformers.rst:420 +#: ../model_zoo/transformers/all/transformers.rst:425 +#: ../model_zoo/transformers/all/transformers.rst:430 +#: ../model_zoo/transformers/all/transformers.rst:434 +#: ../model_zoo/transformers/all/transformers.rst:454 +#: ../model_zoo/transformers/all/transformers.rst:457 +#: ../model_zoo/transformers/all/transformers.rst:476 +#: ../model_zoo/transformers/all/transformers.rst:480 +#: ../model_zoo/transformers/all/transformers.rst:484 +#: ../model_zoo/transformers/all/transformers.rst:488 +#: ../model_zoo/transformers/all/transformers.rst:536 +#: ../model_zoo/transformers/all/transformers.rst:540 +#: ../model_zoo/transformers/all/transformers.rst:549 +#: ../model_zoo/transformers/all/transformers.rst:554 +#: ../model_zoo/transformers/all/transformers.rst:559 +#: ../model_zoo/transformers/all/transformers.rst:563 +#: ../model_zoo/transformers/all/transformers.rst:567 +#: ../model_zoo/transformers/all/transformers.rst:571 +#: ../model_zoo/transformers/all/transformers.rst:575 +#: ../model_zoo/transformers/all/transformers.rst:579 +#: ../model_zoo/transformers/all/transformers.rst:583 +#: ../model_zoo/transformers/all/transformers.rst:588 +#: ../model_zoo/transformers/all/transformers.rst:593 +#: ../model_zoo/transformers/all/transformers.rst:598 +#: ../model_zoo/transformers/all/transformers.rst:637 +#: ../model_zoo/transformers/all/transformers.rst:641 +msgid "English" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:20 +msgid "" +"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters." +" ALBERT base model" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:24 +msgid "``albert-large-v1``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:24 +msgid "" +"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M " +"parameters. ALBERT large model" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:28 +msgid "``albert-xlarge-v1``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:28 +msgid "" +"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M " +"parameters. ALBERT xlarge model" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:32 +msgid "``albert-xxlarge-v1``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:32 +msgid "" +"12 repeating layers, 128 embedding, 4096-hidden, 64-heads, 223M " +"parameters. ALBERT xxlarge model" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:36 +msgid "``albert-base-v2``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:36 +msgid "" +"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters." +" ALBERT base model (version2)" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:40 +msgid "``albert-large-v2``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:40 +msgid "" +"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M " +"parameters. ALBERT large model (version2)" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:44 +msgid "``albert-xlarge-v2``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:44 +msgid "" +"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M " +"parameters. ALBERT xlarge model (version2)" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:48 +msgid "``albert-xxlarge-v2``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:48 +msgid "" +"12 repeating layers, 128 embedding, 4096-hidden, 64-heads, 223M " +"parameters. ALBERT xxlarge model (version2)" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:52 +msgid "``albert-chinese-tiny``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:52 +#: ../model_zoo/transformers/all/transformers.rst:56 +#: ../model_zoo/transformers/all/transformers.rst:60 +#: ../model_zoo/transformers/all/transformers.rst:64 +#: ../model_zoo/transformers/all/transformers.rst:68 +#: ../model_zoo/transformers/all/transformers.rst:72 +#: ../model_zoo/transformers/all/transformers.rst:112 +#: ../model_zoo/transformers/all/transformers.rst:117 +#: ../model_zoo/transformers/all/transformers.rst:123 +#: ../model_zoo/transformers/all/transformers.rst:129 +#: ../model_zoo/transformers/all/transformers.rst:133 +#: ../model_zoo/transformers/all/transformers.rst:137 +#: ../model_zoo/transformers/all/transformers.rst:154 +#: ../model_zoo/transformers/all/transformers.rst:159 +#: ../model_zoo/transformers/all/transformers.rst:164 +#: ../model_zoo/transformers/all/transformers.rst:169 +#: ../model_zoo/transformers/all/transformers.rst:173 +#: ../model_zoo/transformers/all/transformers.rst:268 +#: ../model_zoo/transformers/all/transformers.rst:272 +#: ../model_zoo/transformers/all/transformers.rst:276 +#: ../model_zoo/transformers/all/transformers.rst:280 +#: ../model_zoo/transformers/all/transformers.rst:284 +#: ../model_zoo/transformers/all/transformers.rst:288 +#: ../model_zoo/transformers/all/transformers.rst:292 +#: ../model_zoo/transformers/all/transformers.rst:296 +#: ../model_zoo/transformers/all/transformers.rst:312 +#: ../model_zoo/transformers/all/transformers.rst:333 +#: ../model_zoo/transformers/all/transformers.rst:337 +#: ../model_zoo/transformers/all/transformers.rst:342 +#: ../model_zoo/transformers/all/transformers.rst:346 +#: ../model_zoo/transformers/all/transformers.rst:383 +#: ../model_zoo/transformers/all/transformers.rst:438 +#: ../model_zoo/transformers/all/transformers.rst:442 +#: ../model_zoo/transformers/all/transformers.rst:446 +#: ../model_zoo/transformers/all/transformers.rst:450 +#: ../model_zoo/transformers/all/transformers.rst:460 +#: ../model_zoo/transformers/all/transformers.rst:465 +#: ../model_zoo/transformers/all/transformers.rst:470 +#: ../model_zoo/transformers/all/transformers.rst:473 +#: ../model_zoo/transformers/all/transformers.rst:492 +#: ../model_zoo/transformers/all/transformers.rst:496 +#: ../model_zoo/transformers/all/transformers.rst:500 +#: ../model_zoo/transformers/all/transformers.rst:504 +#: ../model_zoo/transformers/all/transformers.rst:508 +#: ../model_zoo/transformers/all/transformers.rst:512 +#: ../model_zoo/transformers/all/transformers.rst:516 +#: ../model_zoo/transformers/all/transformers.rst:520 +#: ../model_zoo/transformers/all/transformers.rst:524 +#: ../model_zoo/transformers/all/transformers.rst:528 +#: ../model_zoo/transformers/all/transformers.rst:532 +#: ../model_zoo/transformers/all/transformers.rst:544 +#: ../model_zoo/transformers/all/transformers.rst:603 +#: ../model_zoo/transformers/all/transformers.rst:608 +#: ../model_zoo/transformers/all/transformers.rst:613 +#: ../model_zoo/transformers/all/transformers.rst:617 +#: ../model_zoo/transformers/all/transformers.rst:621 +#: ../model_zoo/transformers/all/transformers.rst:625 +#: ../model_zoo/transformers/all/transformers.rst:629 +#: ../model_zoo/transformers/all/transformers.rst:633 +#: ../model_zoo/transformers/all/transformers.rst:645 +#: ../model_zoo/transformers/all/transformers.rst:649 +#: ../model_zoo/transformers/all/transformers.rst:653 +msgid "Chinese" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:52 +msgid "" +"4 repeating layers, 128 embedding, 312-hidden, 12-heads, 4M parameters. " +"ALBERT tiny model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:56 +msgid "``albert-chinese-small``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:56 +msgid "" +"6 repeating layers, 128 embedding, 384-hidden, 12-heads, _M parameters. " +"ALBERT small model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:60 +msgid "``albert-chinese-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:60 +msgid "" +"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 12M parameters." +" ALBERT base model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:64 +msgid "``albert-chinese-large``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:64 +msgid "" +"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 18M " +"parameters. ALBERT large model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:68 +msgid "``albert-chinese-xlarge``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:68 +msgid "" +"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 60M " +"parameters. ALBERT xlarge model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:72 +msgid "``albert-chinese-xxlarge``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:72 +msgid "" +"12 repeating layers, 128 embedding, 4096-hidden, 16-heads, 235M " +"parameters. ALBERT xxlarge model (Chinese)" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:76 +#: ../model_zoo/transformers/all/transformers.rst:668 +msgid "BART_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:76 +msgid "``bart-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:76 +msgid "12-layer, 768-hidden, 12-heads, 217M parameters. BART base model (English)" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:80 +msgid "``bart-large``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:80 +msgid "" +"24-layer, 768-hidden, 16-heads, 509M parameters. BART large model " +"(English)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:84 +#: ../model_zoo/transformers/all/transformers.rst:670 +msgid "BERT_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:84 +msgid "``bert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:84 +msgid "" +"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:88 +msgid "``bert-large-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:88 +#: ../model_zoo/transformers/all/transformers.rst:308 +#: ../model_zoo/transformers/all/transformers.rst:324 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:92 +msgid "``bert-base-cased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:92 +msgid "" +"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on cased English" +" text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:96 +msgid "``bert-large-cased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:96 +msgid "" +"24-layer, 1024-hidden, 16-heads, 335M parameters. Trained on cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:100 +msgid "``bert-base-multilingual-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:100 +#: ../model_zoo/transformers/all/transformers.rst:106 +#: ../model_zoo/transformers/all/transformers.rst:141 +msgid "Multilingual" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:100 +msgid "" +"12-layer, 768-hidden, 12-heads, 168M parameters. Trained on lower-cased " +"text in the top 102 languages with the largest Wikipedias." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:106 +msgid "``bert-base-multilingual-cased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:106 +msgid "" +"12-layer, 768-hidden, 12-heads, 179M parameters. Trained on cased text in" +" the top 104 languages with the largest Wikipedias." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:112 +msgid "``bert-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:112 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese" +" Simplified and Traditional text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:117 +msgid "``bert-wwm-chinese``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:117 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese" +" Simplified and Traditional text using Whole-Word-Masking." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:123 +msgid "``bert-wwm-ext-chinese``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:123 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese" +" Simplified and Traditional text using Whole-Word-Masking with extented " +"data." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:129 +msgid "``junnyu/ckiplab-bert-base-chinese-ner``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:129 +msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on NER task." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:133 +msgid "``junnyu/ckiplab-bert-base-chinese-pos``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:133 +msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on POS task." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:137 +msgid "``junnyu/ckiplab-bert-base-chinese-ws``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:137 +msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on WS task." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:141 +msgid "``junnyu/nlptown-bert-base-multilingual-uncased-sentiment``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:141 +msgid "" +"12-layer, 768-hidden, 12-heads, 167M parameters. Finetuned for sentiment " +"analysis on product reviews in six languages: English, Dutch, German, " +"French, Spanish and Italian." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:148 +msgid "``junnyu/tbs17-MathBERT``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:148 +msgid "" +"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on pre-k to " +"graduate math language (English) using a masked language modeling (MLM) " +"objective." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:154 +msgid "``macbert-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:154 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained with novel MLM " +"as correction pre-training task." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:159 +msgid "``macbert-large-chinese``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:159 +msgid "" +"24-layer, 1024-hidden, 16-heads, 326M parameters. Trained with novel MLM " +"as correction pre-training task." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:164 +msgid "``simbert-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:164 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on 22 million " +"pairs of similar sentences crawed from Baidu Know." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:169 +msgid "``Langboat/mengzi-bert-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:169 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on 300G Chinese " +"Corpus Datasets." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:173 +msgid "``Langboat/mengzi-bert-base-fin``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:173 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on 20G Finacial " +"Corpus, based on ``Langboat/mengzi-bert-base``." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:178 +msgid "BERT-Japanese_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:178 +msgid "``iverxin/bert-base-japanese``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:178 +#: ../model_zoo/transformers/all/transformers.rst:182 +#: ../model_zoo/transformers/all/transformers.rst:187 +#: ../model_zoo/transformers/all/transformers.rst:191 +msgid "Japanese" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:178 +msgid "12-layer, 768-hidden, 12-heads, 110M parameters. Trained on Japanese text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:182 +msgid "``iverxin/bert-base-japanese-whole-word-masking``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:182 +msgid "" +"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on Japanese text" +" using Whole-Word-Masking." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:187 +msgid "``iverxin/bert-base-japanese-char``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:187 +msgid "" +"12-layer, 768-hidden, 12-heads, 89M parameters. Trained on Japanese char " +"text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:191 +msgid "``iverxin/bert-base-japanese-char-whole-word-masking``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:191 +msgid "" +"12-layer, 768-hidden, 12-heads, 89M parameters. Trained on Japanese char " +"text using Whole-Word-Masking." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:196 +#: ../model_zoo/transformers/all/transformers.rst:672 +msgid "BigBird_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:196 +msgid "``bigbird-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:196 +msgid "" +"12-layer, 768-hidden, 12-heads, 127M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:200 +#: ../model_zoo/transformers/all/transformers.rst:674 +msgid "Blenderbot_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:200 +msgid "``blenderbot-3B``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:200 +msgid "26-layer, 32-heads, 3B parameters. The Blenderbot base model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:204 +msgid "``blenderbot-400M-distill``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:204 +msgid "" +"14-layer, 384-hidden, 32-heads, 400M parameters. The Blenderbot distil " +"model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:208 +msgid "``blenderbot-1B-distill``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:208 +msgid "14-layer, 32-heads, 1478M parameters. The Blenderbot Distil 1B model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:212 +#: ../model_zoo/transformers/all/transformers.rst:676 +msgid "Blenderbot-Small_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:212 +msgid "``blenderbot_small-90M``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:212 +msgid "16-layer, 16-heads, 90M parameters. The Blenderbot small model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:216 +#: ../model_zoo/transformers/all/transformers.rst:678 +msgid "ConvBert_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:216 +msgid "``convbert-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:216 +msgid "12-layer, 768-hidden, 12-heads, 106M parameters. The ConvBERT base model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:220 +msgid "``convbert-medium-small``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:220 +msgid "" +"12-layer, 384-hidden, 8-heads, 17M parameters. The ConvBERT medium small " +"model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:224 +msgid "``convbert-small``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:224 +msgid "12-layer, 128-hidden, 4-heads, 13M parameters. The ConvBERT small model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:228 +#: ../model_zoo/transformers/all/transformers.rst:680 +msgid "CTRL_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:228 +msgid "``ctrl``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:228 +msgid "48-layer, 1280-hidden, 16-heads, 1701M parameters. The CTRL base model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:232 +msgid "``sshleifer-tiny-ctrl``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:232 +msgid "2-layer, 16-hidden, 2-heads, 5M parameters. The Tiny CTRL model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:236 +#: ../model_zoo/transformers/all/transformers.rst:682 +msgid "DistilBert_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:236 +msgid "``distilbert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:236 +msgid "" +"6-layer, 768-hidden, 12-heads, 66M parameters. The DistilBERT model " +"distilled from the BERT model ``bert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:241 +msgid "``distilbert-base-cased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:241 +msgid "" +"6-layer, 768-hidden, 12-heads, 66M parameters. The DistilBERT model " +"distilled from the BERT model ``bert-base-cased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:246 +msgid "``distilbert-base-multilingual-cased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:246 +msgid "" +"6-layer, 768-hidden, 12-heads, 200M parameters. The DistilBERT model " +"distilled from the BERT model ``bert-base-multilingual-cased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:252 +msgid "``sshleifer-tiny-distilbert-base-uncase-finetuned-sst-2-english``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:252 +msgid "2-layer, 2-hidden, 2-heads, 50K parameters. The DistilBERT model" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:256 +#: ../model_zoo/transformers/all/transformers.rst:684 +msgid "ELECTRA_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:256 +msgid "``electra-small``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:256 +msgid "" +"12-layer, 768-hidden, 4-heads, 14M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:260 +msgid "``electra-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:260 +msgid "" +"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:264 +msgid "``electra-large``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:264 +msgid "" +"24-layer, 1024-hidden, 16-heads, 334M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:268 +msgid "``chinese-electra-small``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:268 +msgid "12-layer, 768-hidden, 4-heads, 12M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:272 +msgid "``chinese-electra-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:272 +#: ../model_zoo/transformers/all/transformers.rst:496 +msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:276 +msgid "``ernie-health-chinese``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:276 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on Chinese " +"medical corpus." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:280 +msgid "``junnyu/hfl-chinese-electra-180g-base-discriminator``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:280 +msgid "" +"Discriminator, 12-layer, 768-hidden, 12-heads, 102M parameters. Trained " +"on 180g Chinese text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:284 +msgid "``junnyu/hfl-chinese-electra-180g-small-ex-discriminator``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:284 +msgid "" +"Discriminator, 24-layer, 256-hidden, 4-heads, 24M parameters. Trained on " +"180g Chinese text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:288 +msgid "``junnyu/hfl-chinese-legal-electra-small-generator``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:288 +msgid "" +"Generator, 12-layer, 64-hidden, 1-heads, 3M parameters. Trained on " +"Chinese legal corpus." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:292 +#: ../model_zoo/transformers/all/transformers.rst:686 +msgid "ERNIE_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:292 +msgid "``ernie-3.0-medium-zh``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:292 +#: ../model_zoo/transformers/all/transformers.rst:312 +#: ../model_zoo/transformers/all/transformers.rst:333 +#: ../model_zoo/transformers/all/transformers.rst:438 +#: ../model_zoo/transformers/all/transformers.rst:613 +msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:296 +msgid "``ernie-tiny``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:296 +msgid "3-layer, 1024-hidden, 16-heads, _M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:300 +msgid "``ernie-2.0-en``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:300 +#: ../model_zoo/transformers/all/transformers.rst:316 +msgid "" +"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:304 +msgid "``ernie-2.0-en-finetuned-squad``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:304 +msgid "" +"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on finetuned " +"squad text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:308 +msgid "``ernie-2.0-large-en``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:312 +#: ../model_zoo/transformers/all/transformers.rst:688 +msgid "ERNIE-DOC_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:312 +msgid "``ernie-doc-base-zh``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:316 +msgid "``ernie-doc-base-en``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:320 +#: ../model_zoo/transformers/all/transformers.rst:690 +msgid "ERNIE-GEN_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:320 +msgid "``ernie-gen-base-en``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:320 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on lower-cased " +"English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:324 +msgid "``ernie-gen-large-en``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:328 +msgid "``ernie-gen-large-en-430g``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:328 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased " +"English text. with extended data (430 GB)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:333 +#: ../model_zoo/transformers/all/transformers.rst:692 +msgid "ERNIE-GRAM_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:333 +msgid "``ernie-gram-zh``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:337 +msgid "``ernie-gram-zh-finetuned-dureader-robust``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:337 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text." +" Then finetuned on dreader-robust" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:342 +#: ../model_zoo/transformers/all/transformers.rst:694 +msgid "GPT_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:342 +msgid "``gpt-cpm-large-cn``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:342 +msgid "32-layer, 2560-hidden, 32-heads, 2.6B parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:346 +msgid "``gpt-cpm-small-cn-distill``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:346 +msgid "" +"12-layer, 768-hidden, 12-heads, 109M parameters. The model distilled from" +" the GPT model ``gpt-cpm-large-cn``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:351 +msgid "``gpt2-en``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:351 +msgid "12-layer, 768-hidden, 12-heads, 117M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:355 +msgid "``gpt2-medium-en``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:355 +msgid "24-layer, 1024-hidden, 16-heads, 345M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:359 +msgid "``gpt2-large-en``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:359 +#: ../model_zoo/transformers/all/transformers.rst:379 +msgid "36-layer, 1280-hidden, 20-heads, 774M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:363 +msgid "``gpt2-xl-en``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:363 +msgid "" +"48-layer, 1600-hidden, 25-heads, 1558M parameters. Trained on English " +"text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:367 +msgid "``junnyu/distilgpt2``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:367 +msgid "6-layer, 768-hidden, 12-heads, 81M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:371 +msgid "``junnyu/microsoft-DialoGPT-small``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:371 +#: ../model_zoo/transformers/all/transformers.rst:476 +msgid "12-layer, 768-hidden, 12-heads, 124M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:375 +msgid "``junnyu/microsoft-DialoGPT-medium``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:375 +msgid "24-layer, 1024-hidden, 16-heads, 354M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:379 +msgid "``junnyu/microsoft-DialoGPT-large``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:383 +msgid "``junnyu/uer-gpt2-chinese-poem``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:383 +msgid "" +"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on Chinese " +"poetry corpus." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:387 +#: ../model_zoo/transformers/all/transformers.rst:696 +msgid "LayoutLM_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:387 +msgid "``layoutlm-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:387 +msgid "" +"12-layer, 768-hidden, 12-heads, 339M parameters. LayoutLm base uncased " +"model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:391 +msgid "``layoutlm-large-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:391 +msgid "" +"24-layer, 1024-hidden, 16-heads, 51M parameters. LayoutLm large Uncased " +"model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:395 +#: ../model_zoo/transformers/all/transformers.rst:698 +msgid "LayoutLMV2_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:395 +msgid "``layoutlmv2-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:395 +msgid "" +"12-layer, 768-hidden, 12-heads, 200M parameters. LayoutLmv2 base uncased " +"model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:399 +msgid "``layoutlmv2-large-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:399 +msgid "" +"24-layer, 1024-hidden, 16-heads, _M parameters. LayoutLmv2 large uncased " +"model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:403 +#: ../model_zoo/transformers/all/transformers.rst:700 +msgid "LayoutXLM_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:403 +msgid "``layoutxlm-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:403 +msgid "" +"12-layer, 768-hidden, 12-heads, 369M parameters. Layoutxlm base uncased " +"model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:407 +msgid "MBart_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:407 +msgid "``mbart-large-cc25``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:407 +msgid "" +"12-layer, 1024-hidden, 12-heads, 1123M parameters. The ``mbart-large-" +"cc25`` model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:411 +msgid "``mbart-large-en-ro``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:411 +msgid "" +"12-layer, 768-hidden, 16-heads, 1123M parameters. The ``mbart-large rn-" +"ro`` model ." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:415 +msgid "``mbart-large-50-one-to-many-mmt``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:415 +msgid "" +"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-one-" +"to-many-mmt`` model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:420 +msgid "``mbart-large-50-many-to-one-mmt``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:420 +msgid "" +"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-many-" +"to-one-mmt`` model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:425 +msgid "``mbart-large-50-many-to-many-mmt``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:425 +msgid "" +"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-many-" +"to-many-mmt`` model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:430 +msgid "Mobilebert_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:430 +msgid "``mobilebert-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:430 +msgid "24-layer, 512-hidden, 4-heads, 24M parameters. Mobilebert uncased Model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:434 +#: ../model_zoo/transformers/all/transformers.rst:706 +msgid "MPNet_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:434 +msgid "``mpnet-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:434 +msgid "12-layer, 768-hidden, 12-heads, 109M parameters. MPNet Base Model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:438 +#: ../model_zoo/transformers/all/transformers.rst:708 +msgid "NeZha_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:438 +msgid "``nezha-base-chinese``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:442 +msgid "``nezha-large-chinese``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:442 +#: ../model_zoo/transformers/all/transformers.rst:450 +msgid "24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:446 +msgid "``nezha-base-wwm-chinese``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:446 +msgid "12-layer, 768-hidden, 16-heads, 108M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:450 +msgid "``nezha-large-wwm-chinese``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:454 +msgid "Reformer_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:454 +msgid "``reformer-enwik8``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:454 +msgid "12-layer, 1024-hidden, 8-heads, 148M parameters." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:457 +msgid "``reformer-crime-and-punishment``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:457 +msgid "6-layer, 256-hidden, 2-heads, 3M parameters." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:460 +#: ../model_zoo/transformers/all/transformers.rst:712 +msgid "RoBERTa_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:460 +msgid "``roberta-wwm-ext``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:460 +msgid "" +"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on English Text " +"using Whole-Word-Masking with extended data." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:465 +msgid "``roberta-wwm-ext-large``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:465 +msgid "" +"24-layer, 1024-hidden, 16-heads, 325M parameters. Trained on English Text" +" using Whole-Word-Masking with extended data." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:470 +msgid "``rbt3``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:470 +msgid "3-layer, 768-hidden, 12-heads, 38M parameters." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:473 +msgid "``rbtl3``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:473 +msgid "3-layer, 1024-hidden, 16-heads, 61M parameters." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:476 +msgid "``nosaydomore/deepset-roberta-base-squad2``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:480 +msgid "``nosaydomore/roberta-en-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:480 +msgid "12-layer, 768-hidden, 12-heads, 163M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:484 +msgid "``nosaydomore/roberta-en-large``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:484 +msgid "24-layer, 1024-hidden, 16-heads, 408M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:488 +msgid "``nosaydomore/sshleifei-tiny-distilroberta-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:488 +msgid "2-layer, 2-hidden, 2-heads, 0.25M parameters. Trained on English text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:492 +msgid "``nosaydomore/uer-roberta-base-chn-extractive-qa``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:492 +#: ../model_zoo/transformers/all/transformers.rst:500 +msgid "12-layer, 768-hidden, 12-heads, 101M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:496 +msgid "``nosaydomore/uer-roberta-base-ft-chinanews-chn``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:500 +msgid "``nosaydomore/uer-roberta-base-ft-cluener2020-chn``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:504 +#: ../model_zoo/transformers/all/transformers.rst:714 +msgid "RoFormer_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:504 +msgid "``roformer-chinese-small``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:504 +msgid "" +"6-layer, 384-hidden, 6-heads, 30M parameters. Roformer Small Chinese " +"model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:508 +msgid "``roformer-chinese-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:508 +msgid "" +"12-layer, 768-hidden, 12-heads, 124M parameters. Roformer Base Chinese " +"model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:512 +msgid "``roformer-chinese-char-small``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:512 +msgid "" +"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Char Small" +" model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:516 +msgid "``roformer-chinese-char-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:516 +msgid "" +"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Char " +"Base model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:520 +msgid "``roformer-chinese-sim-char-ft-small``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:520 +msgid "" +"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Char Ft " +"Small model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:524 +msgid "``roformer-chinese-sim-char-ft-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:524 +msgid "" +"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Char Ft " +"Base model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:528 +msgid "``roformer-chinese-sim-char-small``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:528 +msgid "" +"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Sim Char " +"Small model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:532 +msgid "``roformer-chinese-sim-char-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:532 +msgid "" +"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Sim Char" +" Base model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:536 +msgid "``roformer-english-small-discriminator``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:536 +msgid "" +"12-layer, 256-hidden, 4-heads, 13M parameters. Roformer English Small " +"Discriminator." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:540 +msgid "``roformer-english-small-generator``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:540 +msgid "" +"12-layer, 64-hidden, 1-heads, 5M parameters. Roformer English Small " +"Generator." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:544 +#: ../model_zoo/transformers/all/transformers.rst:716 +msgid "SKEP_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:544 +msgid "``skep_ernie_1.0_large_ch``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:544 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained using the Erine" +" model ``ernie_1.0``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:549 +msgid "``skep_ernie_2.0_large_en``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:549 +msgid "" +"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained using the Erine" +" model ``ernie_2.0_large_en``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:554 +msgid "``skep_roberta_large_en``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:554 +msgid "" +"24-layer, 1024-hidden, 16-heads, 355M parameters. Trained using the " +"RoBERTa model ``roberta_large_en``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:559 +#: ../model_zoo/transformers/all/transformers.rst:718 +msgid "SqueezeBert_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:559 +msgid "``squeezebert-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:559 +msgid "12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Uncased model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:563 +msgid "``squeezebert-mnli``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:563 +msgid "12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Mnli model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:567 +msgid "``squeezebert-mnli-headless``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:567 +msgid "" +"12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Mnli Headless" +" model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:571 +#: ../model_zoo/transformers/all/transformers.rst:720 +msgid "T5_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:571 +msgid "``t5-small``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:571 +msgid "6-layer, 512-hidden, 8-heads, 93M parameters. T5 small model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:575 +msgid "``t5-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:575 +msgid "12-layer, 768-hidden, 12-heads, 272M parameters. T5 base model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:579 +msgid "``t5-large``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:579 +msgid "24-layer, 1024-hidden, 16-heads, 803M parameters. T5 large model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:583 +#: ../model_zoo/transformers/all/transformers.rst:722 +msgid "TinyBert_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:583 +msgid "``tinybert-4l-312d``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:583 +#: ../model_zoo/transformers/all/transformers.rst:593 +#: ../model_zoo/transformers/all/transformers.rst:603 +msgid "" +"4-layer, 312-hidden, 12-heads, 14.5M parameters. The TinyBert model " +"distilled from the BERT model ``bert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:588 +msgid "``tinybert-6l-768d``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:588 +#: ../model_zoo/transformers/all/transformers.rst:598 +#: ../model_zoo/transformers/all/transformers.rst:608 +msgid "" +"6-layer, 768-hidden, 12-heads, 67M parameters. The TinyBert model " +"distilled from the BERT model ``bert-base-uncased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:593 +msgid "``tinybert-4l-312d-v2``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:598 +msgid "``tinybert-6l-768d-v2``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:603 +msgid "``tinybert-4l-312d-zh``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:608 +msgid "``tinybert-6l-768d-zh``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:613 +#: ../model_zoo/transformers/all/transformers.rst:724 +msgid "UnifiedTransformer_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:613 +msgid "``unified_transformer-12L-cn``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:617 +msgid "``unified_transformer-12L-cn-luge``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:617 +msgid "" +"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text " +"(LUGE.ai)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:621 +msgid "``plato-mini``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:621 +msgid "6-layer, 768-hidden, 12-heads, 66M parameters. Trained on Chinese text." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:625 +msgid "UNIMO_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:625 +msgid "``unimo-text-1.0``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:625 +msgid "12-layer, 768-hidden, 12-heads, 99M parameters. UNIMO-text-1.0 model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:629 +msgid "``unimo-text-1.0-lcsts-new``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:629 +msgid "" +"12-layer, 768-hidden, 12-heads, 99M parameters. Finetuned on lcsts_new " +"dataset." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:633 +msgid "``unimo-text-1.0-large``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:633 +msgid "" +"24-layer, 768-hidden, 16-heads, 316M parameters. UNIMO-text-1.0 large " +"model." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:637 +#: ../model_zoo/transformers/all/transformers.rst:726 +msgid "XLNet_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:637 +msgid "``xlnet-base-cased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:637 +msgid "12-layer, 768-hidden, 12-heads, 110M parameters. XLNet English model" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:641 +msgid "``xlnet-large-cased``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:641 +msgid "" +"24-layer, 1024-hidden, 16-heads, 340M parameters. XLNet Large English " +"model" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:645 +msgid "``chinese-xlnet-base``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:645 +msgid "12-layer, 768-hidden, 12-heads, 117M parameters. XLNet Chinese model" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:649 +msgid "``chinese-xlnet-mid``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:649 +msgid "" +"24-layer, 768-hidden, 12-heads, 209M parameters. XLNet Medium Chinese " +"model" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:653 +msgid "``chinese-xlnet-large``" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:653 +msgid "24-layer, 1024-hidden, 16-heads, _M parameters. XLNet Large Chinese model" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:661 +msgid "Transformer预训练模型适用任务汇总" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:664 +msgid "Sequence Classification" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:664 +msgid "Token Classification" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:664 +msgid "Question Answering" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:664 +msgid "Text Generation" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:664 +msgid "Multiple Choice" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:666 +#: ../model_zoo/transformers/all/transformers.rst:668 +#: ../model_zoo/transformers/all/transformers.rst:670 +#: ../model_zoo/transformers/all/transformers.rst:672 +#: ../model_zoo/transformers/all/transformers.rst:674 +#: ../model_zoo/transformers/all/transformers.rst:676 +#: ../model_zoo/transformers/all/transformers.rst:678 +#: ../model_zoo/transformers/all/transformers.rst:680 +#: ../model_zoo/transformers/all/transformers.rst:682 +#: ../model_zoo/transformers/all/transformers.rst:684 +#: ../model_zoo/transformers/all/transformers.rst:686 +#: ../model_zoo/transformers/all/transformers.rst:688 +#: ../model_zoo/transformers/all/transformers.rst:690 +#: ../model_zoo/transformers/all/transformers.rst:692 +#: ../model_zoo/transformers/all/transformers.rst:694 +#: ../model_zoo/transformers/all/transformers.rst:696 +#: ../model_zoo/transformers/all/transformers.rst:698 +#: ../model_zoo/transformers/all/transformers.rst:700 +#: ../model_zoo/transformers/all/transformers.rst:702 +#: ../model_zoo/transformers/all/transformers.rst:704 +#: ../model_zoo/transformers/all/transformers.rst:706 +#: ../model_zoo/transformers/all/transformers.rst:708 +#: ../model_zoo/transformers/all/transformers.rst:710 +#: ../model_zoo/transformers/all/transformers.rst:712 +#: ../model_zoo/transformers/all/transformers.rst:714 +#: ../model_zoo/transformers/all/transformers.rst:716 +#: ../model_zoo/transformers/all/transformers.rst:718 +#: ../model_zoo/transformers/all/transformers.rst:720 +#: ../model_zoo/transformers/all/transformers.rst:722 +#: ../model_zoo/transformers/all/transformers.rst:724 +#: ../model_zoo/transformers/all/transformers.rst:726 +msgid "✅" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:666 +#: ../model_zoo/transformers/all/transformers.rst:668 +#: ../model_zoo/transformers/all/transformers.rst:670 +#: ../model_zoo/transformers/all/transformers.rst:672 +#: ../model_zoo/transformers/all/transformers.rst:674 +#: ../model_zoo/transformers/all/transformers.rst:676 +#: ../model_zoo/transformers/all/transformers.rst:680 +#: ../model_zoo/transformers/all/transformers.rst:682 +#: ../model_zoo/transformers/all/transformers.rst:684 +#: ../model_zoo/transformers/all/transformers.rst:686 +#: ../model_zoo/transformers/all/transformers.rst:688 +#: ../model_zoo/transformers/all/transformers.rst:690 +#: ../model_zoo/transformers/all/transformers.rst:692 +#: ../model_zoo/transformers/all/transformers.rst:694 +#: ../model_zoo/transformers/all/transformers.rst:696 +#: ../model_zoo/transformers/all/transformers.rst:698 +#: ../model_zoo/transformers/all/transformers.rst:700 +#: ../model_zoo/transformers/all/transformers.rst:702 +#: ../model_zoo/transformers/all/transformers.rst:704 +#: ../model_zoo/transformers/all/transformers.rst:706 +#: ../model_zoo/transformers/all/transformers.rst:708 +#: ../model_zoo/transformers/all/transformers.rst:710 +#: ../model_zoo/transformers/all/transformers.rst:712 +#: ../model_zoo/transformers/all/transformers.rst:714 +#: ../model_zoo/transformers/all/transformers.rst:716 +#: ../model_zoo/transformers/all/transformers.rst:718 +#: ../model_zoo/transformers/all/transformers.rst:720 +#: ../model_zoo/transformers/all/transformers.rst:722 +#: ../model_zoo/transformers/all/transformers.rst:724 +#: ../model_zoo/transformers/all/transformers.rst:726 +msgid "❌" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:702 +msgid "Mbart_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:704 +msgid "MobileBert_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:710 +msgid "ReFormer_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:765 +msgid "预训练模型使用方法" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:767 +msgid "" +"PaddleNLP Transformer API在提丰富预训练模型的同时,也降低了用户的使用门槛。 " +"使用Auto模块,可以加载不同网络结构的预训练模型,无需查找 模型对应的类别。只需十几行代码,用户即可完成模型加载和下游任务Fine-" +"tuning。" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:806 +msgid "" +"上面的代码给出使用预训练模型的简要示例,更完整详细的示例代码, 可以参考:`使用预训练模型Fine-tune完成中文文本分类任务 " +"`_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:809 +msgid "加载数据集:PaddleNLP内置了多种数据集,用户可以一键导入所需的数据集。" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:810 +msgid "" +"加载预训练模型:PaddleNLP的预训练模型可以很容易地通过 ``from_pretrained()`` 方法加载。 " +"Auto模块(包括AutoModel, AutoTokenizer, 及各种下游任务类)提供了方便易用的接口, " +"无需指定类别,即可调用不同网络结构的预训练模型。 第一个参数是汇总表中对应的 ``Pretrained Weight``,可加载对应的预训练权重。" +" ``AutoModelForSequenceClassification`` 初始化 ``__init__`` 所需的其他参数,如 " +"``num_classes`` 等, 也是通过 ``from_pretrained()`` 传入。``Tokenizer`` 使用同样的 " +"``from_pretrained`` 方法加载。" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:816 +msgid "通过 ``Dataset`` 的 ``map`` 函数,使用 ``tokenizer`` 将 ``dataset`` 从原始文本处理成模型的输入。" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:817 +msgid "定义 ``BatchSampler`` 和 ``DataLoader``,shuffle数据、组合Batch。" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:818 +msgid "定义训练所需的优化器,loss函数等,就可以开始进行模型fine-tune任务。" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:822 +msgid "Reference" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:823 +msgid "" +"部分中文预训练模型来自: `brightmart/albert_zh " +"`_, `ymcui/Chinese-BERT-wwm " +"`_, `huawei-noah/Pretrained-" +"Language-Model/TinyBERT `_, `ymcui/Chinese-XLNet " +"`_, " +"`huggingface/xlnet_chinese_large " +"`_, `Knover/luge-" +"dialogue `_, `huawei-noah/Pretrained-Language-Model/NEZHA-PyTorch/ " +"`_, `ZhuiyiTechnology/simbert " +"`_" +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:832 +msgid "" +"Lan, Zhenzhong, et al. \"Albert: A lite bert for self-supervised learning" +" of language representations.\" arXiv preprint arXiv:1909.11942 (2019)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:833 +msgid "" +"Lewis, Mike, et al. \"BART: Denoising Sequence-to-Sequence Pre-training " +"for Natural Language Generation, Translation, and Comprehension.\" arXiv " +"preprint arXiv:1910.13461 (2019)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:834 +msgid "" +"Devlin, Jacob, et al. \"Bert: Pre-training of deep bidirectional " +"transformers for language understanding.\" arXiv preprint " +"arXiv:1810.04805 (2018)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:835 +msgid "" +"Zaheer, Manzil, et al. \"Big bird: Transformers for longer sequences.\" " +"arXiv preprint arXiv:2007.14062 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:836 +msgid "" +"Stephon, Emily, et al. \"Blenderbot: Recipes for building an open-domain " +"chatbot.\" arXiv preprint arXiv:2004.13637 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:837 +msgid "" +"Stephon, Emily, et al. \"Blenderbot-Small: Recipes for building an open-" +"domain chatbot.\" arXiv preprint arXiv:2004.13637 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:838 +msgid "" +"Zhang, zhengyan, et al. \"CPM: A Large-scale Generative Chinese Pre-" +"trained Language Model.\" arXiv preprint arXiv:2012.00413 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:839 +msgid "" +"Jiang, Zihang, et al. \"ConvBERT: Improving BERT with Span-based Dynamic " +"Convolution.\" arXiv preprint arXiv:2008.02496 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:840 +msgid "" +"Nitish, Bryan, et al. \"CTRL: A Conditional Transformer Language Model " +"for Controllable Generation.\" arXiv preprint arXiv:1909.05858 (2019)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:841 +msgid "" +"Sanh, Victor, et al. \"DistilBERT, a distilled version of BERT: smaller, " +"faster, cheaper and lighter.\" arXiv preprint arXiv:1910.01108 (2019)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:842 +msgid "" +"Clark, Kevin, et al. \"Electra: Pre-training text encoders as " +"discriminators rather than generators.\" arXiv preprint arXiv:2003.10555 " +"(2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:843 +msgid "" +"Sun, Yu, et al. \"Ernie: Enhanced representation through knowledge " +"integration.\" arXiv preprint arXiv:1904.09223 (2019)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:844 +msgid "" +"Xiao, Dongling, et al. \"Ernie-gen: An enhanced multi-flow pre-training " +"and fine-tuning framework for natural language generation.\" arXiv " +"preprint arXiv:2001.11314 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:845 +msgid "" +"Xiao, Dongling, et al. \"ERNIE-Gram: Pre-Training with Explicitly N-Gram " +"Masked Language Modeling for Natural Language Understanding.\" arXiv " +"preprint arXiv:2010.12148 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:846 +msgid "" +"Radford, Alec, et al. \"Language models are unsupervised multitask " +"learners.\" OpenAI blog 1.8 (2019): 9." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:847 +msgid "" +"Xu, Yiheng, et al. \"LayoutLM: Pre-training of Text and Layout for " +"Document Image Understanding.\" arXiv preprint arXiv:1912.13318 (2019)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:848 +msgid "" +"Xu, Yang, et al. \"LayoutLMv2: Multi-modal Pre-training for Visually-Rich" +" Document Understanding\" arXiv preprint arXiv:2012.14740 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:849 +msgid "" +"Xu, Yiheng, et al. \"LayoutXLM: Multimodal Pre-training for Multilingual " +"Visually-rich Document Understanding\" arXiv preprint arXiv:2104.08836 " +"(2021)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:850 +msgid "" +"Liu, Yinhan, et al. \"MBart: Multilingual Denoising Pre-training for " +"Neural Machine Translation\" arXiv preprint arXiv:2001.08210 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:851 +msgid "" +"Sun, Zhiqing, et al. \"MobileBERT: a Compact Task-Agnostic BERT for " +"Resource-Limited Devices\" arXiv preprint arXiv:2004.02984 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:852 +msgid "" +"Song, Kaitao, et al. \"MPNet: Masked and Permuted Pre-training for " +"Language Understanding.\" arXiv preprint arXiv:2004.09297 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:853 +msgid "" +"Wei, Junqiu, et al. \"NEZHA: Neural contextualized representation for " +"chinese language understanding.\" arXiv preprint arXiv:1909.00204 (2019)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:854 +msgid "" +"Kitaev, Nikita, et al. \"Reformer: The efficient Transformer.\" arXiv " +"preprint arXiv:2001.04451 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:855 +msgid "" +"Liu, Yinhan, et al. \"Roberta: A robustly optimized bert pretraining " +"approach.\" arXiv preprint arXiv:1907.11692 (2019)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:856 +msgid "" +"Su Jianlin, et al. \"RoFormer: Enhanced Transformer with Rotary Position " +"Embedding.\" arXiv preprint arXiv:2104.09864 (2021)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:857 +msgid "" +"Tian, Hao, et al. \"SKEP: Sentiment knowledge enhanced pre-training for " +"sentiment analysis.\" arXiv preprint arXiv:2005.05635 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:858 +msgid "" +"Forrest, ALbert, et al. \"SqueezeBERT: What can computer vision teach NLP" +" about efficient neural networks?\" arXiv preprint arXiv:2006.11316 " +"(2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:859 +msgid "" +"Raffel, Colin, et al. \"T5: Exploring the Limits of Transfer Learning " +"with a Unified Text-to-Text Transformer.\" arXiv preprint " +"arXiv:1910.10683 (2019)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:860 +msgid "" +"Vaswani, Ashish, et al. \"Attention is all you need.\" arXiv preprint " +"arXiv:1706.03762 (2017)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:861 +msgid "" +"Jiao, Xiaoqi, et al. \"Tinybert: Distilling bert for natural language " +"understanding.\" arXiv preprint arXiv:1909.10351 (2019)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:862 +msgid "" +"Bao, Siqi, et al. \"Plato-2: Towards building an open-domain chatbot via " +"curriculum learning.\" arXiv preprint arXiv:2006.16779 (2020)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:863 +msgid "" +"Yang, Zhilin, et al. \"Xlnet: Generalized autoregressive pretraining for " +"language understanding.\" arXiv preprint arXiv:1906.08237 (2019)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:864 +msgid "" +"Cui, Yiming, et al. \"Pre-training with whole word masking for chinese " +"bert.\" arXiv preprint arXiv:1906.08101 (2019)." +msgstr "" + +#: ../model_zoo/transformers/all/transformers.rst:865 +msgid "" +"Wang, Quan, et al. “Building Chinese Biomedical Language Models via " +"Multi-Level Text Discrimination.” arXiv preprint arXiv:2110.07244 (2021)." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/modules.po b/docs/locale/en/LC_MESSAGES/source/modules.po new file mode 100644 index 0000000000000000000000000000000000000000..caee4de4a69d33c64c63c428c6e876b509ba314d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/modules.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/modules.rst:2 +msgid "paddlenlp" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.collate.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.collate.po new file mode 100644 index 0000000000000000000000000000000000000000..0833b056f484eb23d23394a026955ad177efb0e0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.collate.po @@ -0,0 +1,219 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.data.collate.rst:2 +msgid "collate" +msgstr "" + +#: of paddlenlp.data.collate.Dict:1 paddlenlp.data.collate.Pad:1 +#: paddlenlp.data.collate.Stack:1 paddlenlp.data.collate.Tuple:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.data.collate.Stack:1 +msgid "" +"Stacks the input data samples to construct the batch. The N input samples" +" must have the same shape/length and will be stacked to construct a " +"batch." +msgstr "" + +#: of paddlenlp.data.collate.Dict paddlenlp.data.collate.Dict.__call__ +#: paddlenlp.data.collate.Pad paddlenlp.data.collate.Pad.__call__ +#: paddlenlp.data.collate.Stack paddlenlp.data.collate.Stack.__call__ +#: paddlenlp.data.collate.Tuple paddlenlp.data.collate.Tuple.__call__ +msgid "参数" +msgstr "" + +#: of paddlenlp.data.collate.Stack:4 +msgid "" +"The axis in the result data along which the input data are stacked. " +"Default: 0." +msgstr "" + +#: of paddlenlp.data.collate.Stack:7 +msgid "" +"The value type of the output. If it is set to None, the type of input " +"data is used. Default: None." +msgstr "" + +#: of paddlenlp.data.collate.Stack.__call__:1 +msgid "Batchifies the input data by stacking." +msgstr "" + +#: of paddlenlp.data.collate.Pad.__call__:6 +#: paddlenlp.data.collate.Stack.__call__:3 +msgid "" +"The input data samples. It is a list. Each element is a numpy.ndarray or " +"list." +msgstr "" + +#: of paddlenlp.data.collate.Dict.__call__ paddlenlp.data.collate.Pad.__call__ +#: paddlenlp.data.collate.Stack.__call__ paddlenlp.data.collate.Tuple.__call__ +msgid "返回" +msgstr "" + +#: of paddlenlp.data.collate.Stack.__call__:7 +msgid "Stacked batch data." +msgstr "" + +#: of paddlenlp.data.collate.Dict.__call__ paddlenlp.data.collate.Pad.__call__ +#: paddlenlp.data.collate.Stack.__call__ paddlenlp.data.collate.Tuple.__call__ +msgid "返回类型" +msgstr "" + +#: of paddlenlp.data.collate.Dict.__call__:14 +#: paddlenlp.data.collate.Pad.__call__:18 +#: paddlenlp.data.collate.Stack.__call__:11 +#: paddlenlp.data.collate.Tuple.__call__:14 +msgid "示例" +msgstr "" + +#: of paddlenlp.data.collate.Pad:1 +msgid "Pads the input data samples to the largest length at `axis`." +msgstr "" + +#: of paddlenlp.data.collate.Pad:3 +msgid "The padding value. Default: 0." +msgstr "" + +#: of paddlenlp.data.collate.Pad:5 +msgid "" +"The axis to pad the arrays. The arrays will be padded to the largest " +"length at `axis`. For example, assume the input arrays have shape (10, 8," +" 5), (6, 8, 5), (3, 8, 5) and the axis is 0. Each input will be padded " +"into (10, 8, 5) and then stacked to form the final output, which has " +"shape (3, 10, 8, 5). Default: 0." +msgstr "" + +#: of paddlenlp.data.collate.Pad:12 +msgid "" +"If it is bool, indicate whether to return the valid length in the output," +" and the data type of returned length is int32 if True. If it is " +"numpy.dtype, indicate the data type of returned length. Default: None." +msgstr "" + +#: of paddlenlp.data.collate.Pad:17 +msgid "" +"The value type of the output. If it is set to None, the input data type " +"is used. Default: None." +msgstr "" + +#: of paddlenlp.data.collate.Pad:20 +msgid "" +"Whether the padding direction is right-side. If True, it indicates we pad" +" to the right side, while False indicates we pad to the left side. " +"Default: True." +msgstr "" + +#: of paddlenlp.data.collate.Pad.__call__:1 +msgid "" +"Batchifies the input data by padding. The input will be padded to the " +"largest dimension at `axis` and then stacked to form the final output. In" +" addition, the function will output the original dimensions at the `axis`" +" if `ret_length` is not None or False." +msgstr "" + +#: of paddlenlp.data.collate.Pad.__call__:10 +msgid "" +"If `ret_length` is False, it is a numpy.ndarray representing the padded " +"batch data and the shape is (N, …). Otherwise, it is a tuple, besides the" +" padded batch data, the tuple also includes a numpy.ndarray representing " +"original length at `axis` of all input samples, which shaped `(N,)`." +msgstr "" + +#: of paddlenlp.data.collate.Dict:1 paddlenlp.data.collate.Tuple:1 +msgid "" +"Wraps multiple batchify functions together. The input functions will be " +"applied to the corresponding input fields." +msgstr "" + +#: of paddlenlp.data.collate.Tuple:4 +msgid "" +"Each sample should be a list or tuple containing multiple fields. The " +"i'th batchify function stored in Tuple will be applied on the i'th field." +msgstr "" + +#: of paddlenlp.data.collate.Tuple:7 +msgid "" +"For example, when data sample is (nd_data, label), you can wrap two " +"batchify functions using `Tuple(DataBatchify, LabelBatchify)` to batchify" +" nd_data and label correspondingly." +msgstr "" + +#: of paddlenlp.data.collate.Tuple:11 +msgid "" +"The batchify functions to wrap. It is a callable function or a list/tuple" +" of callable functions." +msgstr "" + +#: of paddlenlp.data.collate.Tuple:14 +msgid "The additional batchify functions to wrap." +msgstr "" + +#: of paddlenlp.data.collate.Tuple.__call__:1 +msgid "" +"Batchifies data samples by applying each function on the corresponding " +"data field, and each data field is produced by stacking the field data of" +" samples." +msgstr "" + +#: of paddlenlp.data.collate.Tuple.__call__:5 +msgid "" +"The samples to batchfy. Each sample in list/tuple should contain `N` " +"fields." +msgstr "" + +#: of paddlenlp.data.collate.Dict.__call__:9 +#: paddlenlp.data.collate.Tuple.__call__:9 +msgid "A tuple composed of results from all including batchifying functions." +msgstr "" + +#: of paddlenlp.data.collate.Dict:4 +msgid "" +"Each sample should be a dict containing multiple fields. Each batchify " +"function with key stored in `Dict` will be applied on the field which has" +" the same key." +msgstr "" + +#: of paddlenlp.data.collate.Dict:8 +msgid "" +"For example, when data sample is {'tokens': tokens, 'labels': labels}, " +"you can wrap two batchify functions using `Dict({'tokens': DataBatchify, " +"'labels': LabelBatchify})` to batchify tokens and labels correspondingly." +msgstr "" + +#: of paddlenlp.data.collate.Dict:13 +msgid "" +"The batchify functions to wrap. It is a dict, which values is callable " +"functions." +msgstr "" + +#: of paddlenlp.data.collate.Dict.__call__:1 +msgid "" +"Batchifies data samples by applying each function on the corresponding " +"data field, and each data field is produced by stacking the field data " +"with the same key as batchify functions of all samples." +msgstr "" + +#: of paddlenlp.data.collate.Dict.__call__:5 +msgid "" +"The samples to batchfy. Each sample in list/tuple is a dict with `N` key-" +"values." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.data_collator.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.data_collator.po new file mode 100644 index 0000000000000000000000000000000000000000..f9ae0c435657599fd55104c9c935bb6a4d8bbec9 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.data_collator.po @@ -0,0 +1,182 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.data.data_collator.rst:2 +msgid "data\\_collator" +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:1 +#: paddlenlp.data.data_collator.DataCollatorForTokenClassification:1 +#: paddlenlp.data.data_collator.DataCollatorWithPadding:1 +#: paddlenlp.data.data_collator.DefaultDataCollator:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorWithPadding:1 +msgid "" +"Data collator that will dynamically pad the inputs to the longest " +"sequence in the batch." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq +#: paddlenlp.data.data_collator.DataCollatorForTokenClassification +#: paddlenlp.data.data_collator.DataCollatorWithPadding +msgid "参数" +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:3 +#: paddlenlp.data.data_collator.DataCollatorForTokenClassification:3 +#: paddlenlp.data.data_collator.DataCollatorWithPadding:3 +msgid "The tokenizer used for encoding the data." +msgstr "" + +#: of paddlenlp.data.data_collator.DefaultDataCollator:1 +msgid "" +"Very simple data collator that simply collates batches of dict-like " +"objects and performs special handling for potential keys named:" +msgstr "" + +#: of paddlenlp.data.data_collator.DefaultDataCollator:3 +msgid "`label`: handles a single value (int or float) per object" +msgstr "" + +#: of paddlenlp.data.data_collator.DefaultDataCollator:4 +msgid "`label_ids`: handles a list of values per object" +msgstr "" + +#: of paddlenlp.data.data_collator.DefaultDataCollator:5 +msgid "" +"Does not do any additional preprocessing: property names of the input " +"object will be used as corresponding inputs to the model. See glue and " +"ner for example of how it's useful. This is an object (like other data " +"collators) rather than a pure function like default_data_collator. This " +"can be helpful if you need to set a return_tensors value at " +"initialization. :param return_tensors: Return Tensor or numpy array. " +":type return_tensors: `bool`" +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForTokenClassification:1 +msgid "" +"Data collator that will dynamically pad the inputs to longest sequence in" +" the batch, as well as the labels." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForTokenClassification:5 +msgid "The id to use when padding the labels. Defaults to -100." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:1 +msgid "" +"Data collator that will dynamically pad the inputs received, as well as " +"the labels." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:5 +msgid "" +"The model that is being trained. If set and has the " +"*prepare_decoder_input_ids_from_labels*, use it to prepare the " +"*decoder_input_ids* This is useful when using *label_smoothing* to avoid" +" calculating loss twice." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:5 +msgid "" +"The model that is being trained. If set and has the " +"*prepare_decoder_input_ids_from_labels*, use it to prepare the " +"*decoder_input_ids*" +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:8 +msgid "" +"This is useful when using *label_smoothing* to avoid calculating loss " +"twice." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:10 +msgid "" +"Select a strategy to pad the returned sequences (according to the model's" +" padding side and padding index) among: - `True` or `'longest'`: Pad to " +"the longest sequence in the batch (or no padding if only a single " +"sequence is provided). - `'max_length'`: Pad to a maximum length " +"specified with the argument `max_length` or to the maximum acceptable " +"input length for the model if that argument is not provided. - `False` or" +" `'do_not_pad'` (default): No padding (i.e., can output a batch with " +"sequences of different lengths)." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:10 +msgid "" +"Select a strategy to pad the returned sequences (according to the model's" +" padding side and padding index) among:" +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:13 +msgid "" +"`True` or `'longest'`: Pad to the longest sequence in the batch (or no " +"padding if only a single sequence is provided)." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:15 +msgid "" +"`'max_length'`: Pad to a maximum length specified with the argument " +"`max_length` or to the maximum acceptable input length for the model if " +"that argument is not provided." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:17 +msgid "" +"`False` or `'do_not_pad'` (default): No padding (i.e., can output a batch" +" with sequences of different lengths)." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:20 +msgid "" +"Maximum length of the returned list and optionally padding length (see " +"above)." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:22 +msgid "" +"If set will pad the sequence to a multiple of the provided value. This " +"is especially useful to enable the use of Tensor Cores on NVIDIA hardware" +" with compute capability >= 7.5 (Volta)." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:22 +msgid "If set will pad the sequence to a multiple of the provided value." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:24 +msgid "" +"This is especially useful to enable the use of Tensor Cores on NVIDIA " +"hardware with compute capability >= 7.5 (Volta)." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:27 +msgid "" +"The id to use when padding the labels (-100 will be automatically ignored" +" by PyTorch loss functions)." +msgstr "" + +#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:29 +msgid "" +"The type of Tensor to return. Allowable values are \"np\", \"pt\" and " +"\"tf\"." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.po new file mode 100644 index 0000000000000000000000000000000000000000..8f85809137c1ee602c3f2058128bcdd16ab5afb1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.data.rst:2 +msgid "paddlenlp.data" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.sampler.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.sampler.po new file mode 100644 index 0000000000000000000000000000000000000000..741070ddb7b5bd3874c7800b64272eebf2abdbe1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.sampler.po @@ -0,0 +1,194 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.data.sampler.rst:2 +msgid "sampler" +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper:1 +msgid "" +"The class is to help construct iterable sampler used for " +":class:`paddle.io.DataLoader`. It wraps a dataset and uses its " +":meth:`__getitem__` method. Every subclass of :class:`SamplerHelper` has " +"to provide an :meth:`__iter__` method, providing a way to iterate over " +"indices of dataset elements, and a :meth:`__len__` method that returns " +"the length of the returned iterators." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper:8 +msgid "" +"The class also can be used as batch iterator instead of indices iterator " +"when `iterator` yield samples rather than indices by initializing " +"`iterator` with a iterable dataset." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper:13 +msgid "" +"The :meth:`__len__` method isn't strictly required by " +":class:`paddle.io.DataLoader`, but is expected in any calculation " +"involving the length of a :class:`paddle.io.DataLoader`." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper +#: paddlenlp.data.sampler.SamplerHelper.batch +#: paddlenlp.data.sampler.SamplerHelper.shard +#: paddlenlp.data.sampler.SamplerHelper.shuffle +#: paddlenlp.data.sampler.SamplerHelper.sort +msgid "参数" +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper:17 +msgid "Input dataset for :class:`SamplerHelper`." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper:19 +msgid "Iterator of dataset. Default: None." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.length:1 +msgid "Returns the length." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.shuffle:1 +msgid "Shuffles the dataset according to the given buffer size and random seed." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.shuffle:3 +msgid "" +"Buffer size for shuffle. If `buffer_size < 0` or more than the length of " +"the dataset, `buffer_size` is the length of the dataset. Default: -1." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.shuffle:7 +msgid "Seed for the random. Default: None." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.batch +#: paddlenlp.data.sampler.SamplerHelper.shard +#: paddlenlp.data.sampler.SamplerHelper.shuffle +#: paddlenlp.data.sampler.SamplerHelper.sort +msgid "返回" +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.shuffle:10 +msgid "A new shuffled :class:`SamplerHelper` object." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.batch +#: paddlenlp.data.sampler.SamplerHelper.shard +#: paddlenlp.data.sampler.SamplerHelper.shuffle +#: paddlenlp.data.sampler.SamplerHelper.sort +msgid "返回类型" +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.batch:25 +#: paddlenlp.data.sampler.SamplerHelper.shard:18 +#: paddlenlp.data.sampler.SamplerHelper.shuffle:14 +#: paddlenlp.data.sampler.SamplerHelper.sort:21 +msgid "示例" +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.sort:1 +msgid "Sorts the dataset according to given callable :meth:`cmp` or :meth:`key`." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.sort:3 +msgid "The function of comparison. Default: None." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.sort:5 +msgid "The function of key. Default: None." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.sort:7 +msgid "" +"Whether to reverse when sorting the data samples. If True, it means in " +"descending order, and False means in ascending order. Default: False." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.sort:11 +msgid "" +"Buffer size for sort. If `buffer_size < 0` or `buffer_size` is more than " +"the length of the data, `buffer_size` will be set to the length of the " +"data. Default: -1." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.sort:17 +msgid "A new sorted :class:`SamplerHelper` object." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.batch:1 +msgid "Batches the dataset according to given `batch_size`." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.batch:3 +msgid "The batch size." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.batch:5 +msgid "Whether to drop the last mini batch. Default: False." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.batch:8 +msgid "" +"It accepts four arguments: index of data source, the length of minibatch," +" the size of minibatch so far and data source, and it returns the size of" +" mini batch so far. Actually, the returned value can be anything and " +"would used as argument `size_so_far` in `key`. If None, it would return " +"the length of mini match. Default: None." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.batch:15 +msgid "" +"The function of key. It accepts the size of minibatch so far and the " +"length of minibatch, and returns what to be compared with `batch_size`. " +"If None, only the size of mini batch so far would be compared with " +"`batch_size`. Default: None." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.batch:21 +msgid "A new batched :class:`SamplerHelper` object." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.shard:1 +msgid "Slices the dataset for multi GPU training." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.shard:3 +msgid "" +"The number of training process, and is also the number of GPU cards used " +"in training. If None, it will be set by " +":meth:`paddle.distributed.get_world_size` method. Default: None." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.shard:8 +msgid "" +"The id of current training process. Equal to the value of the environment" +" variable PADDLE_TRAINER_ID. If None, it will be intialized by " +":meth:`paddle.distributed.get_rank` method. Default: None." +msgstr "" + +#: of paddlenlp.data.sampler.SamplerHelper.shard:14 +msgid "A new sliced :class:`SamplerHelper` object." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..ec5b593ae6a7e3a89ef9c52c5290227412c288f1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.tokenizer.po @@ -0,0 +1,99 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.data.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer:1 +msgid "基类::class:`paddlenlp.data.tokenizer.BaseTokenizer`" +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer:1 +msgid "" +"Constructs a tokenizer based on `jieba " +"`__. It supports :meth:`cut` method to " +"split the text to tokens, and :meth:`encode` method to covert text to " +"token ids." +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer +#: paddlenlp.data.tokenizer.JiebaTokenizer.cut +#: paddlenlp.data.tokenizer.JiebaTokenizer.encode +msgid "参数" +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer:5 +msgid "An instance of :class:`paddlenlp.data.Vocab`." +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut:1 +msgid "The method used to cut the text to tokens." +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut:3 +#: paddlenlp.data.tokenizer.JiebaTokenizer.encode:5 +msgid "The text that needs to be cuted." +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut:5 +#: paddlenlp.data.tokenizer.JiebaTokenizer.encode:7 +msgid "" +"Whether to use the full mode. If True, using full mode that gets all the " +"possible words from the sentence, which is fast but not accurate. If " +"False, using accurate mode that attempts to cut the sentence into the " +"most accurate segmentations, which is suitable for text analysis. " +"Default: False." +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut:12 +#: paddlenlp.data.tokenizer.JiebaTokenizer.encode:14 +msgid "Whether to use the HMM model. Default: True." +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut +#: paddlenlp.data.tokenizer.JiebaTokenizer.encode +msgid "返回" +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut:15 +msgid "A list of tokens." +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut +#: paddlenlp.data.tokenizer.JiebaTokenizer.encode +msgid "返回类型" +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut:19 +#: paddlenlp.data.tokenizer.JiebaTokenizer.encode:21 +msgid "示例" +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer.encode:1 +msgid "" +"The method used to convert the text to ids. It will firstly call " +":meth:`cut` method to cut the text to tokens. Then, convert tokens to ids" +" using `vocab`." +msgstr "" + +#: of paddlenlp.data.tokenizer.JiebaTokenizer.encode:17 +msgid "A list of ids." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.vocab.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.vocab.po new file mode 100644 index 0000000000000000000000000000000000000000..34418ef11149794efa1b2f23f734be6b2f1d4332 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.vocab.po @@ -0,0 +1,321 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.data.vocab.rst:2 +msgid "vocab" +msgstr "" + +#: of paddlenlp.data.vocab.Vocab:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.data.vocab.Vocab:1 +msgid "" +"The class used to convert between tokens and ids. It also includes some " +"store/load functions." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab paddlenlp.data.vocab.Vocab.build_vocab +#: paddlenlp.data.vocab.Vocab.from_dict paddlenlp.data.vocab.Vocab.from_json +#: paddlenlp.data.vocab.Vocab.load_vocabulary +#: paddlenlp.data.vocab.Vocab.to_indices paddlenlp.data.vocab.Vocab.to_json +#: paddlenlp.data.vocab.Vocab.to_tokens +msgid "参数" +msgstr "" + +#: of paddlenlp.data.vocab.Vocab:4 +msgid "" +"A Counter intance describes the tokens and their frequencies. Its keys " +"will be indexed accroding to the order of frequency sorting to construct " +"mapping relationship. If None, `token_to_idx` must be provided as the " +"mapping relationship. Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab:10 +msgid "Max size of vocab, not including special tokens. Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab:13 paddlenlp.data.vocab.Vocab.build_vocab:11 +msgid "Ignore tokens whose frequencies are less than `min_freq`. Default: 1." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab:16 paddlenlp.data.vocab.Vocab.build_vocab:14 +msgid "" +"A dict specifies the mapping relationship between tokens and indices to " +"be used. If provided, adjust the tokens and indices mapping according to " +"it. If None, counter must be provided. Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab:21 +msgid "" +"Special token for unknow token. If no need, it also could be None. " +"Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab:24 +msgid "" +"Special token for padding token. If no need, it also could be None. " +"Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab:27 +msgid "" +"Special token for bos token. If no need, it also could be None. Default: " +"None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab:30 +msgid "" +"Special token for eos token. If no need, it lso could be None. Default: " +"None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab:33 paddlenlp.data.vocab.Vocab.build_vocab:31 +#: paddlenlp.data.vocab.Vocab.from_dict:18 +#: paddlenlp.data.vocab.Vocab.load_vocabulary:19 +msgid "" +"Keyword arguments ending with `_token`. It can be used to specify further" +" special tokens that will be exposed as attribute of the vocabulary and " +"associated with an index." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.to_tokens:1 +msgid "Maps the input indices to token list." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.to_tokens:3 +msgid "" +"The input indice(s) for mapping. Must be an `int` or 1D " +"`list[int]`|`tuple[int]`|`numpy.ndarray`." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.build_vocab +#: paddlenlp.data.vocab.Vocab.from_dict paddlenlp.data.vocab.Vocab.from_json +#: paddlenlp.data.vocab.Vocab.load_vocabulary +#: paddlenlp.data.vocab.Vocab.to_indices paddlenlp.data.vocab.Vocab.to_json +#: paddlenlp.data.vocab.Vocab.to_tokens +msgid "返回" +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.to_tokens:7 +msgid "" +"Obtained token(s). If `indices` is an integer, it will return a str. If " +"`indices` is a list/tuple of integers, it will return a list of str." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.build_vocab +#: paddlenlp.data.vocab.Vocab.from_dict paddlenlp.data.vocab.Vocab.from_json +#: paddlenlp.data.vocab.Vocab.load_vocabulary +#: paddlenlp.data.vocab.Vocab.to_indices paddlenlp.data.vocab.Vocab.to_json +#: paddlenlp.data.vocab.Vocab.to_tokens +msgid "返回类型" +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.build_vocab:41 +#: paddlenlp.data.vocab.Vocab.from_dict:28 +#: paddlenlp.data.vocab.Vocab.from_json:12 +#: paddlenlp.data.vocab.Vocab.load_vocabulary:28 +#: paddlenlp.data.vocab.Vocab.to_indices:13 +#: paddlenlp.data.vocab.Vocab.to_json:14 +#: paddlenlp.data.vocab.Vocab.to_tokens:13 +msgid "示例" +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.to_indices:1 +msgid "Maps the input tokens into indices." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.to_indices:3 +msgid "The input token(s) for mapping." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.to_indices:7 +msgid "" +"Obationed indice(s). If `tokens` is a str, it will return an integer. If " +"`tokens` is a list/tuple of str, it will return a list of integers." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.__call__:1 +msgid "" +"Maps the input tokens into indices. Its function is the same as the " +":meth:`to_indices` method." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.__call__:4 +msgid "See detail at `to_indices`." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.to_json:1 +msgid "" +"Summarizes some information of vocab as JSON string. If path is gaven, " +"the JSON string will be saved into files. The JSON string and the saved " +"file all can be used to reconstruct the :class:`Vocab` by calling " +":meth:`from_json` method." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.to_json:6 +msgid "" +"The path to save JSON string. If None, the JSON will not be saved. " +"Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.to_json:10 +msgid "The JSON string including information of vocab." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.from_json:1 +msgid "" +"Loads :class:`Vocab` from JSON string or JSON file, which is gotten by " +"calling :meth:`to_json` method." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.from_json:4 +msgid "JSON string or file path of JSON string." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.from_json:7 +msgid "" +"An instance of :class:`Vocab` generated from information contained in " +"JSON string." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.from_dict:1 +msgid "Builds the :class:`Vocab` from a dict." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.from_dict:3 +msgid "A dict describes the mapping relationship between tokens and indices." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.from_dict:6 +msgid "" +"The special token for unknow token. If no need, it also could be None. " +"Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.from_dict:9 +msgid "" +"The special token for padding token. If no need, it also could be None. " +"Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.from_dict:12 +msgid "" +"The special token for bos token. If no need, it also could be None. " +"Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.from_dict:15 +msgid "" +"The special token for eos token. If no need, it also could be None. " +"Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.from_dict:23 +msgid "" +"An instance of :class:`Vocab` generated from the given dict and special " +"tokens." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.build_vocab:1 +msgid "" +"Builds the :class:`Vocab` accoring to given iterator and other " +"information. Firstly, iterate over the `iterator` to construct a " +":class:`collections.Counter` and used to init the as :class:`Vocab`." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.build_vocab:5 +msgid "" +"Iterator of tokens. Each element should be a list of tokens if wordlevel " +"vocab is needed." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.build_vocab:8 +msgid "The max size of vocab, not including special tokens. Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.build_vocab:19 +msgid "" +"The special token for unknow token ''. If no need, it also could be " +"None. Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.build_vocab:22 +msgid "" +"The special token for padding token ''. If no need, it also could be" +" None. Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.build_vocab:25 +msgid "" +"The special token for bos token ''. If no need, it also could be " +"None. Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.build_vocab:28 +msgid "" +"The special token for eos token ''. If no need, it also could be " +"None. Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.build_vocab:36 +msgid "" +"An instance of :class:`Vocab` generated from given iterator and other " +"informations." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.load_vocabulary:1 +msgid "" +"Builds the :class:`Vocab` from a file reserving all tokens by calling " +":meth:`Vocab.from_dict` method. The file contains a token per line, and " +"the line index would be the index of corresponding token." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.load_vocabulary:5 +msgid "the path of file to construct vocabulary." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.load_vocabulary:7 +msgid "" +"special token for unknown token. If no need, it also could be None. " +"Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.load_vocabulary:10 +msgid "" +"special token for padding token. If no need, it also could be None. " +"Default: None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.load_vocabulary:13 +msgid "" +"special token for bos token. If no need, it also could be None. Default: " +"None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.load_vocabulary:16 +msgid "" +"special token for eos token. If no need, it also could be None. Default: " +"None." +msgstr "" + +#: of paddlenlp.data.vocab.Vocab.load_vocabulary:24 +msgid "An instance of :class:`Vocab` generated from the given file." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.datasets.dataset.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.datasets.dataset.po new file mode 100644 index 0000000000000000000000000000000000000000..daafb8801ffe434f84878d9e7c1dc13e6f909fb8 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.datasets.dataset.po @@ -0,0 +1,282 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.datasets.dataset.rst:2 +msgid "dataset" +msgstr "" + +#: of paddlenlp.datasets.dataset.MapDataset:1 +msgid "" +"Wraps a map-style dataset-like object as an instance of `MapDataset`, and" +" equips it with `map` and other utility methods. All non-magic methods of" +" the raw object are also accessible." +msgstr "" + +#: of paddlenlp.datasets.dataset.DatasetBuilder.read +#: paddlenlp.datasets.dataset.IterDataset +#: paddlenlp.datasets.dataset.IterDataset.filter +#: paddlenlp.datasets.dataset.IterDataset.map +#: paddlenlp.datasets.dataset.IterDataset.shard +#: paddlenlp.datasets.dataset.MapDataset +#: paddlenlp.datasets.dataset.MapDataset.filter +#: paddlenlp.datasets.dataset.MapDataset.map +#: paddlenlp.datasets.dataset.load_dataset +msgid "参数" +msgstr "" + +#: of paddlenlp.datasets.dataset.MapDataset:5 +msgid "" +"An object with `__getitem__` and `__len__` methods. It could be a list or" +" a subclass of `paddle.io.Dataset`." +msgstr "" + +#: of paddlenlp.datasets.dataset.IterDataset:8 +#: paddlenlp.datasets.dataset.MapDataset:8 +msgid "Other information to be passed to the dataset." +msgstr "" + +#: of paddlenlp.datasets.dataset.IterDataset:11 +#: paddlenlp.datasets.dataset.MapDataset:11 +msgid "" +"For examples of this class, please see `dataset_self_defined " +"`__." +msgstr "" + +#: of paddlenlp.datasets.dataset.IterDataset.filter:1 +#: paddlenlp.datasets.dataset.MapDataset.filter:1 +msgid "" +"Filters samples by the filter function and uses the filtered data to " +"update this dataset." +msgstr "" + +#: of paddlenlp.datasets.dataset.MapDataset.filter:4 +msgid "" +"A filter function that takes a sample as input and returns a boolean. " +"Samples that return False would be discarded." +msgstr "" + +#: of paddlenlp.datasets.dataset.MapDataset.filter:7 +msgid "" +"Number of processes for multiprocessing. If set to 0, it doesn't use " +"multiprocessing. Defaults to `0`." +msgstr "" + +#: of paddlenlp.datasets.dataset.IterDataset.map:1 +#: paddlenlp.datasets.dataset.MapDataset.map:1 +msgid "" +"Performs specific function on the dataset to transform and update every " +"sample." +msgstr "" + +#: of paddlenlp.datasets.dataset.MapDataset.map:3 +msgid "" +"Transformations to be performed. It receives single sample as argument if" +" batched is False. Else it receives all examples." +msgstr "" + +#: of paddlenlp.datasets.dataset.MapDataset.map:6 +msgid "" +"If True, transformations would be delayed and performed on demand. " +"Otherwise, transforms all samples at once. Note that if `fn` is " +"stochastic, `lazy` should be True or you will get the same result on all " +"epochs. Defaults to False." +msgstr "" + +#: of paddlenlp.datasets.dataset.MapDataset.map:11 +msgid "" +"If True, transformations would take all examples as input and return a " +"collection of transformed examples. Note that if set True, `lazy` option " +"would be ignored. Defaults to False." +msgstr "" + +#: of paddlenlp.datasets.dataset.MapDataset.map:15 +msgid "" +"Number of processes for multiprocessing. If set to 0, it doesn't use " +"multiprocessing. Note that if set to positive value, `lazy` option would " +"be ignored. Defaults to 0." +msgstr "" + +#: of paddlenlp.datasets.dataset.DatasetBuilder:1 +msgid "" +"A base class for all DatasetBuilder. It provides a `read()` function to " +"turn a data file into a MapDataset or IterDataset." +msgstr "" + +#: of paddlenlp.datasets.dataset.DatasetBuilder:4 +msgid "" +"`_get_data()` function and `_read()` function should be implemented to " +"download data file and read data file into a `Iterable` of the examples." +msgstr "" + +#: of paddlenlp.datasets.dataset.DatasetBuilder:7 +msgid "" +"For how to define a custom `DatasetBuilder`, please see " +"`contribute_dataset " +"`__." +msgstr "" + +#: of paddlenlp.datasets.dataset.DatasetBuilder.read:1 +msgid "" +"Returns a dataset containing all the examples that can be read from the " +"file path." +msgstr "" + +#: of paddlenlp.datasets.dataset.DatasetBuilder.read:3 +msgid "" +"If `self.lazy` is False, this eagerly reads all instances from " +"`self._read()` and returns a `MapDataset`." +msgstr "" + +#: of paddlenlp.datasets.dataset.DatasetBuilder.read:6 +msgid "" +"If `self.lazy` is True, this returns an `IterDataset`, which internally " +"relies on the generator created from `self._read()` to lazily produce " +"examples. In this case your implementation of `_read()` must also be lazy" +" (that is, not load all examples into memory at once)." +msgstr "" + +#: of paddlenlp.datasets.dataset.DatasetBuilder.read:11 +msgid "Path of data file to read, usually provided by `_get_data` function." +msgstr "" + +#: of paddlenlp.datasets.dataset.DatasetBuilder.read:14 +msgid "" +"The split name of selected dataset. This only makes a different when data" +" files of different splits have different structures." +msgstr "" + +#: of paddlenlp.datasets.dataset.DatasetBuilder.read +#: paddlenlp.datasets.dataset.load_dataset +msgid "返回" +msgstr "" + +#: of paddlenlp.datasets.dataset.DatasetBuilder.read:18 +msgid "A `MapDataset|IterDataset`." +msgstr "" + +#: of paddlenlp.datasets.dataset.DatasetBuilder.get_labels:1 +msgid "Returns list of class labels of the dataset if specified." +msgstr "" + +#: of paddlenlp.datasets.dataset.DatasetBuilder.get_vocab:1 +msgid "Returns vocab file path of the dataset if specified." +msgstr "" + +#: of paddlenlp.datasets.dataset.IterDataset:1 +msgid "" +"Wraps a dataset-like object as an instance of `IterDataset`, and equips " +"it with `map` and other utility methods. All non-magic methods of the raw" +" object also accessible." +msgstr "" + +#: of paddlenlp.datasets.dataset.IterDataset:5 +msgid "" +"An object with `__iter__` function. It can be a Iterable or a subclass of" +" `paddle.io.IterableDataset`." +msgstr "" + +#: of paddlenlp.datasets.dataset.IterDataset.filter:4 +msgid "" +"A filter function that takes a sample as input and returns a boolean. " +"Samples that return False are discarded." +msgstr "" + +#: of paddlenlp.datasets.dataset.IterDataset.shard:1 +msgid "Split the dataset into `num_shards` pieces." +msgstr "" + +#: of paddlenlp.datasets.dataset.IterDataset.shard:3 +msgid "" +"An integer representing the number of data shards. If None, `num_shards` " +"would be number of trainers. Defaults to None." +msgstr "" + +#: of paddlenlp.datasets.dataset.IterDataset.shard:7 +msgid "" +"An integer representing the index of the current shard. If None, `index` " +"would be the current trainer rank id. Defaults to None." +msgstr "" + +#: of paddlenlp.datasets.dataset.IterDataset.map:3 +msgid "Transformations to be performed. It receives single sample as argument." +msgstr "" + +#: of paddlenlp.datasets.dataset.load_dataset:1 +msgid "" +"This method will load a dataset, either form PaddleNLP library or from a " +"self-defined data loading script, by calling functions in " +"`DatasetBuilder`." +msgstr "" + +#: of paddlenlp.datasets.dataset.load_dataset:4 +msgid "" +"For all the names of datasets in PaddleNLP library, see here: " +"`dataset_list " +"`__." +msgstr "" + +#: of paddlenlp.datasets.dataset.load_dataset:7 +msgid "Either `splits` or `data_files` must be specified." +msgstr "" + +#: of paddlenlp.datasets.dataset.load_dataset:9 +msgid "" +"Name of the dataset processing script in PaddleNLP library or a custom " +"data reading function." +msgstr "" + +#: of paddlenlp.datasets.dataset.load_dataset:12 +msgid "Additional name to select a more specific dataset. Defaults to None." +msgstr "" + +#: of paddlenlp.datasets.dataset.load_dataset:15 +msgid "" +"Defining the path of dataset files. If None. `splits` must be specified. " +"Defaults to None." +msgstr "" + +#: of paddlenlp.datasets.dataset.load_dataset:18 +msgid "" +"Which split of the data to load. If None. `data_files` must be specified." +" Defaults to None." +msgstr "" + +#: of paddlenlp.datasets.dataset.load_dataset:21 +msgid "" +"Weather to return `MapDataset` or an `IterDataset`. True for " +"`IterDataset`. False for `MapDataset`. If None, return the default type " +"of this dataset. Defaults to None." +msgstr "" + +#: of paddlenlp.datasets.dataset.load_dataset:25 +msgid "Other keyword arguments to be passed to the `DatasetBuilder`." +msgstr "" + +#: of paddlenlp.datasets.dataset.load_dataset:28 +msgid "A `MapDataset` or `IterDataset` or a tuple of those." +msgstr "" + +#: of paddlenlp.datasets.dataset.load_dataset:30 +msgid "" +"For how to use this function, please see `dataset_load " +"`__" +" and `dataset_self_defined " +"`__" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.datasets.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.datasets.po new file mode 100644 index 0000000000000000000000000000000000000000..97715579e95c3d470e7223458043af91216c949d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.datasets.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.datasets.rst:2 +msgid "paddlenlp.datasets" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.embeddings.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.embeddings.po new file mode 100644 index 0000000000000000000000000000000000000000..d3c2572e9434b0278c479eaf599caa80a0380651 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.embeddings.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.embeddings.rst:2 +msgid "paddlenlp.embeddings" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.embeddings.token_embedding.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.embeddings.token_embedding.po new file mode 100644 index 0000000000000000000000000000000000000000..638fc8a0a6664df00cc71c927054215f02b96156 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.embeddings.token_embedding.po @@ -0,0 +1,200 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.embeddings.token_embedding.rst:2 +msgid "token\\_embedding" +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.list_embedding_name:1 +msgid "Lists all names of pretrained embedding models paddlenlp provides." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:1 +msgid "基类::class:`paddle.nn.layer.common.Embedding`" +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:1 +msgid "" +"A `TokenEmbedding` can load pre-trained embedding model which paddlenlp " +"provides by specifying embedding name. Furthermore, a `TokenEmbedding` " +"can load extended vocabulary by specifying extended_vocab_path." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.dot +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.search +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.set_trainable +msgid "参数" +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:5 +msgid "" +"The pre-trained embedding model name. Use " +"`paddlenlp.embeddings.list_embedding_name()` to list the names of all " +"embedding models that we provide. Defaults to " +"`w2v.baidu_encyclopedia.target.word-word.dim300`." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:9 +msgid "Specifies unknown token. Defaults to `[UNK]`." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:12 +msgid "" +"To initialize the vector of unknown token. If it's none, use normal " +"distribution to initialize the vector of unknown token. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:16 +msgid "The file path of extended vocabulary. Defaults to `None`." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:19 +msgid "Whether the weight of embedding can be trained. Defaults to True." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:22 +msgid "" +"Whether to keep the extended vocabulary only, will be effective only if " +"provides extended_vocab_path. Defaults to False." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.set_trainable:1 +msgid "Whether or not to set the weights of token embedding to be trainable." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.set_trainable:3 +msgid "" +"The weights can be trained if trainable is set to True, or the weights " +"are fixed if trainable is False." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.search:1 +msgid "Gets the vectors of specifying words." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.search:3 +msgid "The words which need to be searched." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.dot +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.search +msgid "返回" +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.search:6 +msgid "The vectors of specifying words." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.dot +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.search +msgid "返回类型" +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.search:7 +msgid "`numpy.array`" +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim:13 +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.dot:14 +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words:10 +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.search:10 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word:1 +msgid "Gets the index of specifying word by searching word_to_idx dict." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word:3 +msgid "The input token word which we want to get the token index converted from." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word:6 +msgid "The index of specifying word." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word:7 +msgid "`int`" +msgstr "" + +#: of +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words:1 +msgid "Gets the index list of specifying words by searching word_to_idx dict." +msgstr "" + +#: of +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words:3 +msgid "" +"The input token words which we want to get the token indices converted " +"from." +msgstr "" + +#: of +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words:6 +msgid "The indexes list of specifying words." +msgstr "" + +#: of +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words:7 +msgid "`list`" +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.dot:1 +msgid "" +"Calculates the dot product of 2 words. Dot product or scalar product is " +"an algebraic operation that takes two equal-length sequences of numbers " +"(usually coordinate vectors), and returns a single number." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim:4 +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.dot:5 +msgid "The first word string." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim:6 +#: paddlenlp.embeddings.token_embedding.TokenEmbedding.dot:7 +msgid "The second word string." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.dot:10 +msgid "The dot product of 2 words." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim:1 +msgid "" +"Calculates the cosine similarity of 2 word vectors. Cosine similarity is " +"the cosine of the angle between two n-dimensional vectors in an " +"n-dimensional space." +msgstr "" + +#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim:9 +msgid "The cosine similarity of 2 words." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.ernie_model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.ernie_model.po new file mode 100644 index 0000000000000000000000000000000000000000..31045f6c624d38299a58db84eb7742f33aa80578 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.ernie_model.po @@ -0,0 +1,161 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.experimental.ernie_model.rst:2 +msgid "ernie\\_model" +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:1 +msgid "The bare ERNIE Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of +#: paddlenlp.experimental.ernie_model.FasterErnieForSequenceClassification.forward +#: paddlenlp.experimental.ernie_model.FasterErnieForTokenClassification.forward +#: paddlenlp.experimental.ernie_model.FasterErnieModel +#: paddlenlp.experimental.ernie_model.FasterErnieModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `ErnieModel`. Also is the vocab size " +"of token embedding matrix. Defines the number of different tokens that " +"can be represented by the `inputs_ids` passed when calling `ErnieModel`." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:13 +msgid "" +"Dimensionality of the embedding layer, encoder layers and pooler layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:15 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:35 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:38 +msgid "The vocabulary size of the `token_type_ids`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:41 +msgid "" +"The standard deviation of the normal initializer for initializing all " +"weight matrices. Defaults to `0.02`. .. note:: A normal_initializer " +"initializes weight matrices as normal distributions. See " +":meth:`ErniePretrainedModel._init_weights()` for how weights are " +"initialized in `ErnieModel`." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:41 +msgid "" +"The standard deviation of the normal initializer for initializing all " +"weight matrices. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:45 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`ErniePretrainedModel._init_weights()` for how weights are " +"initialized in `ErnieModel`." +msgstr "" + +#: of paddlenlp.experimental.ernie_model.FasterErnieModel:48 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of +#: paddlenlp.experimental.ernie_model.FasterErnieForSequenceClassification.forward:1 +#: paddlenlp.experimental.ernie_model.FasterErnieForTokenClassification.forward:1 +#: paddlenlp.experimental.ernie_model.FasterErnieModel.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of +#: paddlenlp.experimental.ernie_model.FasterErnieForSequenceClassification.forward:4 +#: paddlenlp.experimental.ernie_model.FasterErnieForTokenClassification.forward:4 +#: paddlenlp.experimental.ernie_model.FasterErnieModel.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of +#: paddlenlp.experimental.ernie_model.FasterErnieForSequenceClassification.forward:6 +#: paddlenlp.experimental.ernie_model.FasterErnieForTokenClassification.forward:6 +#: paddlenlp.experimental.ernie_model.FasterErnieModel.forward:6 +msgid "unpacked dict arguments" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.faster_tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.faster_tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..5e8ab0f96c3e7d53844c30129e5527dde4d95e90 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.faster_tokenizer.po @@ -0,0 +1,74 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.experimental.faster_tokenizer.rst:2 +msgid "faster\\_tokenizer" +msgstr "" + +#: of paddlenlp.experimental.faster_tokenizer.to_tensor:1 +msgid "" +"Create the tensor that the value holds the list of string. NOTICE: The " +"value will be holded in the cpu place." +msgstr "" + +#: of paddlenlp.experimental.faster_tokenizer.FasterTokenizer.forward +#: paddlenlp.experimental.faster_tokenizer.to_tensor +#: paddlenlp.experimental.faster_tokenizer.to_vocab_buffer +msgid "参数" +msgstr "" + +#: of paddlenlp.experimental.faster_tokenizer.to_tensor:4 +msgid "The value will be setted to the tensor." +msgstr "" + +#: of paddlenlp.experimental.faster_tokenizer.to_tensor:6 +#: paddlenlp.experimental.faster_tokenizer.to_vocab_buffer:7 +msgid "The name of the tensor." +msgstr "" + +#: of paddlenlp.experimental.faster_tokenizer.to_vocab_buffer:1 +msgid "" +"Create the tensor that the value holds the map, the type of key is the " +"string. NOTICE: The value will be holded in the cpu place." +msgstr "" + +#: of paddlenlp.experimental.faster_tokenizer.to_vocab_buffer:4 +msgid "" +"The value will be setted to the tensor. The key is token and the value is" +" the token index." +msgstr "" + +#: of paddlenlp.experimental.faster_tokenizer.FasterTokenizer:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.experimental.faster_tokenizer.FasterTokenizer.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of paddlenlp.experimental.faster_tokenizer.FasterTokenizer.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of paddlenlp.experimental.faster_tokenizer.FasterTokenizer.forward:6 +msgid "unpacked dict arguments" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.model_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.model_utils.po new file mode 100644 index 0000000000000000000000000000000000000000..6eb76061d50380deba0f0ec4879007455882caf7 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.model_utils.po @@ -0,0 +1,134 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.experimental.model_utils.rst:2 +msgid "model\\_utils" +msgstr "" + +#: of +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:1 +msgid "" +"Creates an instance of `PretrainedModel`. Model weights are loaded by " +"specifying name of a built-in pretrained model, or a community " +"contributed model, or a local file directory path." +msgstr "" + +#: of paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_pretrained +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_resources +msgid "参数" +msgstr "" + +#: of +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:5 +msgid "" +"Name of pretrained model or dir path to load from. The string can be: - " +"Name of a built-in pretrained model - Name of a community-contributed " +"pretrained model. - Local directory path which contains model weights " +"file(\"model_state.pdparams\") and model config file " +"(\"model_config.json\")." +msgstr "" + +#: of +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:5 +msgid "Name of pretrained model or dir path to load from. The string can be:" +msgstr "" + +#: of +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:8 +msgid "Name of a built-in pretrained model" +msgstr "" + +#: of +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:9 +msgid "Name of a community-contributed pretrained model." +msgstr "" + +#: of +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:10 +msgid "" +"Local directory path which contains model weights " +"file(\"model_state.pdparams\") and model config file " +"(\"model_config.json\")." +msgstr "" + +#: of +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:13 +msgid "" +"Position arguments for model `__init__`. If provided, use these as " +"position argument values for model initialization." +msgstr "" + +#: of +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:16 +msgid "" +"Keyword arguments for model `__init__`. If provided, use these to update " +"pre-defined keyword argument values for model initialization. If the " +"keyword is in `__init__` argument names of base model, update argument " +"values of the base model; else update argument values of derived model." +msgstr "" + +#: of paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:23 +msgid "An instance of `PretrainedModel`." +msgstr "" + +#: of paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:27 +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_pretrained:13 +msgid "示例" +msgstr "" + +#: of +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_pretrained:1 +msgid "" +"Saves model configuration and related resources (model state) as files " +"under `save_dir`. The model configuration would be saved into a file " +"named \"model_config.json\", and model state would be saved into a file " +"named \"model_state.pdparams\"." +msgstr "" + +#: of +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_pretrained:6 +msgid "" +"The `save_dir` can be used in `from_pretrained` as argument value of " +"`pretrained_model_name_or_path` to re-load the trained model." +msgstr "" + +#: of +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_pretrained:9 +#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_resources:4 +msgid "Directory to save files into." +msgstr "" + +#: of paddlenlp.experimental.model_utils.FasterPretrainedModel.save_resources:1 +msgid "" +"Save tokenizer related resources to `resource_files_names` indicating " +"files under `save_directory` by copying directly. Override it if " +"necessary." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.po new file mode 100644 index 0000000000000000000000000000000000000000..c380d05ce24fb38ddf9023e630d8a62b865ead63 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.experimental.rst:2 +msgid "paddlenlp.experimental" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.crf.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.crf.po new file mode 100644 index 0000000000000000000000000000000000000000..3bea53b2814a1959bb95ab2ea6bc2dd26d082483 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.crf.po @@ -0,0 +1,223 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.layers.crf.rst:2 +msgid "crf" +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf:1 +msgid "" +"LinearChainCrf is a linear chain Conditional Random Field layer, it can " +"implement sequential dependencies in the predictions. Therefore, it can " +"take context into account whereas a classifier predicts a label for a " +"single sample without considering \"neighboring\" samples. See " +"https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers" +" for reference." +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf +#: paddlenlp.layers.crf.LinearChainCrf.forward +#: paddlenlp.layers.crf.LinearChainCrf.gold_score +#: paddlenlp.layers.crf.LinearChainCrfLoss +#: paddlenlp.layers.crf.LinearChainCrfLoss.forward +#: paddlenlp.layers.crf.ViterbiDecoder +#: paddlenlp.layers.crf.ViterbiDecoder.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf:5 +msgid "The label number." +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf:7 +msgid "The crf layer learning rate. Defaults to ``0.1``." +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf:9 +msgid "" +"If set to True, the start tag and stop tag will be considered, the " +"transitions params will be a tensor with a shape of `[num_labels+2, " +"num_labels+2]`. Else, the transitions params will be a tensor with a " +"shape of `[num_labels, num_labels]`." +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.forward:1 +msgid "" +"Computes the normalization in a linear-chain CRF. See " +"http://www.cs.columbia.edu/~mcollins/fb.pdf for reference." +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.forward:3 +msgid "" +"F & = logZ(x) = log\\sum_y exp(score(x,y))\n" +"\n" +"score(x,y) & = \\sum_i Emit(x_i,y_i) + Trans(y_{i-1}, y_i)\n" +"\n" +"p(y_i) & = Emit(x_i,y_i), T(y_{i-1}, y_i) = Trans(y_{i-1}, y_i)" +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.forward:10 +msgid "then we can get:" +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.forward:12 +msgid "" +"F(1) = log\\sum_{y1} exp(p(y_1) + T([START], y1))\n" +"\n" +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.forward:15 +msgid "" +"F(2) & = log\\sum_{y1}\\sum_{y2} exp(p(y_1) + T([START], y1) + p(y_2) + " +"T(y_1,y_2)) \\\\\n" +"& = log\\sum_{y2} exp(F(1) + p(y_2) + T(y_1,y_2))\n" +"\n" +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.forward:19 +msgid "Further, We can get F(n) is a recursive formula with F(n-1)." +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.forward:21 +#: paddlenlp.layers.crf.LinearChainCrf.gold_score:4 +#: paddlenlp.layers.crf.LinearChainCrfLoss.forward:4 +msgid "" +"The input predicted tensor. Its dtype is float32 and has a shape of " +"`[batch_size, sequence_length, num_tags]`." +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.forward:23 +#: paddlenlp.layers.crf.LinearChainCrf.gold_score:8 +#: paddlenlp.layers.crf.LinearChainCrfLoss.forward:6 +msgid "The input length. Its dtype is int64 and has a shape of `[batch_size]`." +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.forward +#: paddlenlp.layers.crf.LinearChainCrf.gold_score +#: paddlenlp.layers.crf.LinearChainCrfLoss.forward +#: paddlenlp.layers.crf.ViterbiDecoder.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.forward:26 +msgid "" +"Returns the normalizers tensor `norm_score`. Its dtype is float32 and has" +" a shape of `[batch_size]`." +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.forward +#: paddlenlp.layers.crf.LinearChainCrf.gold_score +#: paddlenlp.layers.crf.LinearChainCrfLoss.forward +#: paddlenlp.layers.crf.ViterbiDecoder.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.gold_score:1 +msgid "" +"Computes the unnormalized score for a tag sequence. $$ score(x,y) = " +"\\sum_i Emit(x_i,y_i) + Trans(y_{i-1}, y_i) $$" +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.gold_score:6 +#: paddlenlp.layers.crf.LinearChainCrfLoss.forward:8 +msgid "" +"The input label tensor. Its dtype is int64 and has a shape of " +"`[batch_size, sequence_length]`" +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrf.gold_score:11 +msgid "" +"Returns the unnormalized sequence scores tensor `unnorm_score`. Its dtype" +" is float32 and has a shape of `[batch_size]`." +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrfLoss:1 +msgid "" +"The negative log-likelihood for linear chain Conditional Random Field " +"(CRF)." +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrfLoss:3 +msgid "" +"The `LinearChainCrf` network object. Its parameter will be used to " +"calculate the loss." +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrfLoss.forward:1 +msgid "" +"Calculate the crf loss. Let $$ Z(x) = \\sum_{y'}exp(score(x,y')) $$, " +"means the sum of all path scores, then we have $$ loss = -logp(y|x) = " +"-log(exp(score(x,y))/Z(x)) = -score(x,y) + logZ(x) $$" +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrfLoss.forward:10 +msgid "" +"Unnecessary parameter for compatibility with older versions. Defaults to " +"``None``." +msgstr "" + +#: of paddlenlp.layers.crf.LinearChainCrfLoss.forward:13 +msgid "The crf loss. Its dtype is float32 and has a shape of `[batch_size]`." +msgstr "" + +#: of paddlenlp.layers.crf.ViterbiDecoder:1 +msgid "" +"ViterbiDecoder can decode the highest scoring sequence of tags, it should" +" only be used at test time." +msgstr "" + +#: of paddlenlp.layers.crf.ViterbiDecoder:3 +msgid "" +"The transition matrix. Its dtype is float32 and has a shape of " +"`[num_tags, num_tags]`." +msgstr "" + +#: of paddlenlp.layers.crf.ViterbiDecoder:5 +msgid "" +"If set to True, the last row and the last column of transitions will be " +"considered as start tag, the penultimate row and the penultimate " +"column of transitions will be considered as stop tag. Else, all the rows " +"and columns will be considered as the real tag. Defaults to ``None``." +msgstr "" + +#: of paddlenlp.layers.crf.ViterbiDecoder.forward:1 +msgid "Decode the highest scoring sequence of tags." +msgstr "" + +#: of paddlenlp.layers.crf.ViterbiDecoder.forward:3 +msgid "" +"The unary emission tensor. Its dtype is float32 and has a shape of " +"`[batch_size, sequence_length, num_tags]`." +msgstr "" + +#: of paddlenlp.layers.crf.ViterbiDecoder.forward:5 +msgid "" +"The input length tensor storing real length of each sequence for " +"correctness. Its dtype is int64 and has a shape of `[batch_size]`." +msgstr "" + +#: of paddlenlp.layers.crf.ViterbiDecoder.forward:8 +msgid "" +"Returns tuple (scores, paths). The `scores` tensor containing the score " +"for the Viterbi sequence. Its dtype is float32 and has a shape of " +"`[batch_size]`. The `paths` tensor containing the highest scoring tag " +"indices. Its dtype is int64 and has a shape of `[batch_size, " +"sequence_length]`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.po new file mode 100644 index 0000000000000000000000000000000000000000..15d7ab232f01d0cd875ce935b2d9074245569f46 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.layers.rst:2 +msgid "paddlenlp.layers" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.sequence.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.sequence.po new file mode 100644 index 0000000000000000000000000000000000000000..c381dfaa47418a2ff26e568d7a77d796f43ac15d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.sequence.po @@ -0,0 +1,57 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.layers.sequence.rst:2 +msgid "sequence" +msgstr "" + +#: of paddlenlp.layers.sequence.sequence_mask:1 +msgid "" +"To boost the performance, this sequence_mask is different with " +"paddle.fluid.layers.sequence_mask" +msgstr "" + +#: of paddlenlp.layers.sequence.sequence_mask +msgid "参数" +msgstr "" + +#: of paddlenlp.layers.sequence.sequence_mask:3 +msgid "" +"The whole sequence index, a tensor with a shape of [batch_size, " +"sequence_length]." +msgstr "" + +#: of paddlenlp.layers.sequence.sequence_mask:5 +msgid "The valid length of every sequence, a tensor with a shape of [batch_size]." +msgstr "" + +#: of paddlenlp.layers.sequence.sequence_mask +msgid "返回" +msgstr "" + +#: of paddlenlp.layers.sequence.sequence_mask:8 +msgid "" +"Returns the output sequence mask `mask`. Its dtype is `bool` and has a " +"shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.layers.sequence.sequence_mask +msgid "返回类型" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.tcn.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.tcn.po new file mode 100644 index 0000000000000000000000000000000000000000..e4259a8fb04a1c28749dfb73f8577f7ede28ff85 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.tcn.po @@ -0,0 +1,93 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.layers.tcn.rst:2 +msgid "tcn" +msgstr "" + +#: of paddlenlp.layers.tcn.TemporalBlock:1 +msgid "" +"The TCN block, consists of dilated causal conv, relu and residual block. " +"See the Figure 1(b) in https://arxiv.org/pdf/1803.01271.pdf for more " +"details." +msgstr "" + +#: of paddlenlp.layers.tcn.TCN.forward paddlenlp.layers.tcn.TemporalBlock +#: paddlenlp.layers.tcn.TemporalBlock.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.layers.tcn.TemporalBlock:4 +msgid "The number of channels in the input tensor." +msgstr "" + +#: of paddlenlp.layers.tcn.TemporalBlock:6 +msgid "The number of filters." +msgstr "" + +#: of paddlenlp.layers.tcn.TemporalBlock:8 +msgid "The filter size." +msgstr "" + +#: of paddlenlp.layers.tcn.TemporalBlock:10 +msgid "The stride size." +msgstr "" + +#: of paddlenlp.layers.tcn.TemporalBlock:12 +msgid "The dilation size." +msgstr "" + +#: of paddlenlp.layers.tcn.TemporalBlock:14 +msgid "The size of zeros to be padded." +msgstr "" + +#: of paddlenlp.layers.tcn.TemporalBlock:16 +msgid "Probability of dropout the units. Defaults to 0.2." +msgstr "" + +#: of paddlenlp.layers.tcn.TemporalBlock.forward:1 +msgid "" +"The input tensor with a shape of [batch_size, input_channel, " +"sequence_length]." +msgstr "" + +#: of paddlenlp.layers.tcn.TCN.forward:1 +msgid "Apply temporal convolutional networks to the input tensor." +msgstr "" + +#: of paddlenlp.layers.tcn.TCN.forward:3 +msgid "" +"The input tensor with a shape of [batch_size, input_channel, " +"sequence_length]." +msgstr "" + +#: of paddlenlp.layers.tcn.TCN.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.layers.tcn.TCN.forward:6 +msgid "" +"The `output` tensor with a shape of [batch_size, num_channels[-1], " +"sequence_length]." +msgstr "" + +#: of paddlenlp.layers.tcn.TCN.forward +msgid "返回类型" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.losses.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.losses.po new file mode 100644 index 0000000000000000000000000000000000000000..0356af101fad1bc3d61232f48a9b90615781df4f --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.losses.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.losses.rst:2 +msgid "paddlenlp.losses" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.losses.rdrop.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.losses.rdrop.po new file mode 100644 index 0000000000000000000000000000000000000000..b3873b947f7482888cfa93468d5f4feda6b312c7 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.losses.rdrop.po @@ -0,0 +1,71 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.losses.rdrop.rst:2 +msgid "rdrop" +msgstr "" + +#: of paddlenlp.losses.rdrop.RDropLoss:1 +msgid "" +"R-Drop Loss implementation For more information about R-drop please refer" +" to this paper: https://arxiv.org/abs/2106.14448 Original implementation " +"please refer to this code: https://github.com/dropreg/R-Drop" +msgstr "" + +#: of paddlenlp.losses.rdrop.RDropLoss paddlenlp.losses.rdrop.RDropLoss.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.losses.rdrop.RDropLoss:5 +msgid "" +"Indicate how to average the loss, the candicates are " +"``'none'``,``'batchmean'``,``'mean'``,``'sum'``. If `reduction` is " +"``'mean'``, the reduced mean loss is returned; If `reduction` is " +"``'batchmean'``, the sum loss divided by batch size is returned; If " +"`reduction` is ``'sum'``, the reduced sum loss is returned; If " +"`reduction` is ``'none'``, no reduction will be applied. Defaults to " +"``'none'``." +msgstr "" + +#: of paddlenlp.losses.rdrop.RDropLoss.forward:1 +msgid "the first forward logits of training examples." +msgstr "" + +#: of paddlenlp.losses.rdrop.RDropLoss.forward:3 +msgid "the second forward logits of training examples." +msgstr "" + +#: of paddlenlp.losses.rdrop.RDropLoss.forward:5 +msgid "" +"The Tensor containing the binary mask to index with, it's data type is " +"bool." +msgstr "" + +#: of paddlenlp.losses.rdrop.RDropLoss.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.losses.rdrop.RDropLoss.forward:8 +msgid "Returns tensor `loss`, the rdrop loss of p and q." +msgstr "" + +#: of paddlenlp.losses.rdrop.RDropLoss.forward +msgid "返回类型" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.bleu.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.bleu.po new file mode 100644 index 0000000000000000000000000000000000000000..4d7afbb8914db5653edf7d744bbd6ade6211513a --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.bleu.po @@ -0,0 +1,188 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.metrics.bleu.rst:2 +msgid "bleu" +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU:1 +msgid "基类::class:`paddle.metric.metrics.Metric`" +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU:1 +msgid "" +"BLEU (bilingual evaluation understudy) is an algorithm for evaluating the" +" quality of text which has been machine-translated from one natural " +"language to another. This metric uses a modified form of precision to " +"compare a candidate translation against multiple reference translations." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU:6 +msgid "" +"BLEU could be used as `paddle.metric.Metric` class, or an ordinary class." +" When BLEU is used as `paddle.metric.Metric` class. A function is needed " +"that transforms the network output to reference string list, and " +"transforms the label to candidate string. By default, a default function " +"`default_trans_func` is provided, which gets target sequence id by " +"calculating the maximum probability of each step. In this case, user must" +" provide `vocab`. It should be noted that the BLEU here is different from" +" the BLEU calculated in prediction, and it is only for observation during" +" training and evaluation." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU:16 +msgid "" +"BP & =\n" +"\\begin{cases}\n" +"1, & \\text{if }c>r \\\\\n" +"e_{1-r/c}, & \\text{if }c\\leq r\n" +"\\end{cases}\n" +"\n" +"BLEU & = BP\\exp(\\sum_{n=1}^N w_{n} \\log{p_{n}})" +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU:26 +msgid "" +"where `c` is the length of candidate sentence, and `r` is the length of " +"reference sentence." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU paddlenlp.metrics.bleu.BLEU.add_inst +#: paddlenlp.metrics.bleu.BLEUForDuReader +#: paddlenlp.metrics.bleu.BLEUForDuReader.add_inst +msgid "参数" +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU:28 +msgid "`trans_func` transforms the network output to string to calculate." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU:31 +msgid "" +"Vocab for target language. If `trans_func` is None and BLEU is used as " +"`paddle.metric.Metric` instance, `default_trans_func` will be performed " +"and `vocab` must be provided." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU:36 paddlenlp.metrics.bleu.BLEUForDuReader:5 +msgid "Number of gram for BLEU metric. Defaults to 4." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU:38 +msgid "The weights of precision of each gram. Defaults to None." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU:41 +msgid "Name of `paddle.metric.Metric` instance. Defaults to \"bleu\"." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU:46 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU:47 +msgid "Using as a general evaluation object." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU:58 +msgid "Using as an instance of `paddle.metric.Metric`." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU.update:1 +msgid "Update states for metric" +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU.update:3 +msgid "" +"Inputs of :code:`update` is the outputs of :code:`Metric.compute`, if " +":code:`compute` is not defined, the inputs of :code:`update` will be " +"flatten arguments of **output** of mode and **label** from data: " +":code:`update(output1, output2, ..., label1, label2,...)`" +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU.update:8 +msgid "see :code:`Metric.compute`" +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU.add_inst:1 +#: paddlenlp.metrics.bleu.BLEUForDuReader.add_inst:1 +msgid "Update the states based on a pair of candidate and references." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU.add_inst:3 +#: paddlenlp.metrics.bleu.BLEUForDuReader.add_inst:3 +msgid "Tokenized candidate sentence." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU.add_inst:5 +#: paddlenlp.metrics.bleu.BLEUForDuReader.add_inst:5 +msgid "List of tokenized ground truth sentences." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU.reset:1 +msgid "Reset states and result" +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU.accumulate:1 +msgid "Calculates and returns the final bleu metric." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU.accumulate +msgid "返回" +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU.accumulate:3 +msgid "Returns the accumulated metric `bleu` and its data type is float64." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU.accumulate +msgid "返回类型" +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEU.name:1 +msgid "Returns metric name" +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEUForDuReader:1 +msgid "基类::class:`paddlenlp.metrics.bleu.BLEU`" +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEUForDuReader:1 +msgid "BLEU metric with bonus for DuReader contest." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEUForDuReader:3 +msgid "" +"Please refer to `DuReader " +"Homepage`_ for " +"more details." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEUForDuReader:7 +msgid "" +"Weight of YesNo dataset when adding bonus for DuReader contest. Defaults " +"to 1.0." +msgstr "" + +#: of paddlenlp.metrics.bleu.BLEUForDuReader:9 +msgid "" +"Weight of Entity dataset when adding bonus for DuReader contest. Defaults" +" to 1.0." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.chunk.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.chunk.po new file mode 100644 index 0000000000000000000000000000000000000000..31be210073d18c2f5468ee30a5c310fe0b1253da --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.chunk.po @@ -0,0 +1,176 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.metrics.chunk.rst:2 +msgid "chunk" +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator:1 +msgid "基类::class:`paddle.metric.metrics.Metric`" +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator:1 +msgid "" +"ChunkEvaluator computes the precision, recall and F1-score for chunk " +"detection. It is often used in sequence tagging tasks, such as Named " +"Entity Recognition(NER)." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator +#: paddlenlp.metrics.chunk.ChunkEvaluator.compute +#: paddlenlp.metrics.chunk.ChunkEvaluator.update +msgid "参数" +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator:4 +msgid "The label list." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator:6 +msgid "" +"If set True, the label ends with '-B', '-I', '-E' or '-S', else the label" +" starts with them. Defaults to `False`." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator:11 +msgid "示例" +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:1 +msgid "Computes the precision, recall and F1-score for chunk detection." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:3 +msgid "The valid length of every sequence, a tensor with shape `[batch_size]`" +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:5 +msgid "" +"The predictions index, a tensor with shape `[batch_size, " +"sequence_length]`." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:7 +msgid "The labels index, a tensor with shape `[batch_size, sequence_length]`." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:9 +msgid "" +"Unnecessary parameter for compatibility with older versions with " +"parameters list `inputs`, `lengths`, `predictions`, `labels`. Defaults to" +" None." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.accumulate +#: paddlenlp.metrics.chunk.ChunkEvaluator.compute +msgid "返回" +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:12 +msgid "" +"Returns tuple (`num_infer_chunks, num_label_chunks, num_correct_chunks`)." +" With the fields: - `num_infer_chunks` (Tensor): The number of the " +"inference chunks. - `num_label_chunks` (Tensor): The number of the " +"label chunks. - `num_correct_chunks` (Tensor): The number of the " +"correct chunks." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:12 +msgid "Returns tuple (`num_infer_chunks, num_label_chunks, num_correct_chunks`)." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:14 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:17 +msgid "`num_infer_chunks` (Tensor):" +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:17 +msgid "The number of the inference chunks." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:20 +msgid "`num_label_chunks` (Tensor):" +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:20 +msgid "The number of the label chunks." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:22 +msgid "`num_correct_chunks` (Tensor):" +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:23 +msgid "The number of the correct chunks." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.accumulate +#: paddlenlp.metrics.chunk.ChunkEvaluator.compute +msgid "返回类型" +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.update:1 +msgid "" +"This function takes (num_infer_chunks, num_label_chunks, " +"num_correct_chunks) as input, to accumulate and update the corresponding " +"status of the ChunkEvaluator object. The update method is as follows:" +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.update:4 +msgid "" +"\\\\ \\begin{array}{l}{\\text { self. num_infer_chunks }+=\\text { " +"num_infer_chunks }} \\\\ {\\text { self. num_Label_chunks }+=\\text { " +"num_label_chunks }} \\\\ {\\text { self. num_correct_chunks }+=\\text { " +"num_correct_chunks }}\\end{array} \\\\\n" +"\n" +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.update:7 +msgid "The number of chunks in Inference on the given minibatch." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.update:9 +msgid "The number of chunks in Label on the given mini-batch." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.update:11 +msgid "The number of chunks both in Inference and Label on the given mini-batch." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.accumulate:1 +msgid "" +"This function returns the mean precision, recall and f1 score for all " +"accumulated minibatches." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.accumulate:3 +msgid "Returns tuple (`precision, recall, f1 score`)." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.reset:1 +msgid "Reset function empties the evaluation memory for previous mini-batches." +msgstr "" + +#: of paddlenlp.metrics.chunk.ChunkEvaluator.name:1 +msgid "Return name of metric instance." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.distinct.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.distinct.po new file mode 100644 index 0000000000000000000000000000000000000000..b1d2bb23e8fd2e73790dc1467e94f9f142cfe3b7 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.distinct.po @@ -0,0 +1,151 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.metrics.distinct.rst:2 +msgid "distinct" +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct:1 +msgid "基类::class:`paddle.metric.metrics.Metric`" +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct:1 +msgid "" +"`Distinct` is an algorithm for evaluating the textual diversity of the " +"generated text by calculating the number of distinct n-grams. The larger " +"the number of distinct n-grams, the higher the diversity of the text. See" +" details at https://arxiv.org/abs/1510.03055." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct:6 +msgid "" +":class:`Distinct` could be used as a :class:`paddle.metric.Metric` class," +" or an ordinary class. When :class:`Distinct` is used as a " +":class:`paddle.metric.Metric` class, a function is needed to transform " +"the network output to a string list." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct +#: paddlenlp.metrics.distinct.Distinct.add_inst +#: paddlenlp.metrics.distinct.Distinct.update +msgid "参数" +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct:11 +msgid "Number of gram for :class:`Distinct` metric. Defaults to 2." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct:13 +msgid "" +"`trans_func` transforms the network output to a string list. Defaults to " +"None. .. note:: When :class:`Distinct` is used as a " +":class:`paddle.metric.Metric` class, `trans_func` must be provided. " +"Please note that the input of `trans_func` is numpy array." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct:13 +msgid "" +"`trans_func` transforms the network output to a string list. Defaults to " +"None." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct:16 +msgid "" +"When :class:`Distinct` is used as a :class:`paddle.metric.Metric` class, " +"`trans_func` must be provided. Please note that the input of `trans_func`" +" is numpy array." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct:20 +msgid "Name of :class:`paddle.metric.Metric` instance. Defaults to \"distinct\"." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct:25 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct:26 +msgid "Using as a general evaluation object." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct:38 +msgid "Using as an instance of `paddle.metric.Metric`." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct.update:1 +msgid "" +"Updates the metrics states. This method firstly will use " +":meth:`trans_func` method to process the `output` to get the tokenized " +"candidate sentence list. Then call :meth:`add_inst` method to process the" +" candidate list one by one." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct.update:6 +msgid "The outputs of model." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct.update:8 +msgid "The additional inputs." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct.add_inst:1 +msgid "Updates the states based on the candidate." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct.add_inst:3 +msgid "Tokenized candidate sentence generated by model." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct.reset:1 +msgid "Resets states and result." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct.accumulate:1 +msgid "Calculates the final distinct score." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct.accumulate +#: paddlenlp.metrics.distinct.Distinct.name +#: paddlenlp.metrics.distinct.Distinct.score +msgid "返回" +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct.accumulate:3 +#: paddlenlp.metrics.distinct.Distinct.score:3 +msgid "The final distinct score." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct.accumulate +#: paddlenlp.metrics.distinct.Distinct.name +#: paddlenlp.metrics.distinct.Distinct.score +msgid "返回类型" +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct.score:1 +msgid "The function is the same as :meth:`accumulate` method." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct.name:1 +msgid "Returns the metric name." +msgstr "" + +#: of paddlenlp.metrics.distinct.Distinct.name:3 +msgid "The metric name." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.dureader.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.dureader.po new file mode 100644 index 0000000000000000000000000000000000000000..5731e3f52b95ef2f990f96b2c951a74a8b1f91c3 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.dureader.po @@ -0,0 +1,55 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.metrics.dureader.rst:2 +msgid "dureader" +msgstr "" + +#: of paddlenlp.metrics.dureader:1 +msgid "Official evaluation script for SQuAD version 2.0." +msgstr "" + +#: of paddlenlp.metrics.dureader:3 +msgid "" +"In addition to basic functionality, we also compute additional statistics" +" and plot precision-recall curves if an additional na_prob.json file is " +"provided. This file is expected to map question ID's to the model's " +"predicted probability that a question is unanswerable." +msgstr "" + +#: of paddlenlp.metrics.dureader.compute_predictions:1 +msgid "Write final predictions to the json file and log-odds of null if needed." +msgstr "" + +#: of paddlenlp.metrics.dureader.get_final_text:1 +msgid "Project the tokenized prediction back to the original text." +msgstr "" + +#: of paddlenlp.metrics.dureader.normalize:1 +msgid "Normalize strings to space joined chars. :param s: a list of strings." +msgstr "" + +#: of paddlenlp.metrics.dureader.normalize +msgid "返回" +msgstr "" + +#: of paddlenlp.metrics.dureader.normalize:4 +msgid "A list of normalized strings." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.glue.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.glue.po new file mode 100644 index 0000000000000000000000000000000000000000..5c9b7b61a769b285ab404f4bfccc670e3d1df8bf --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.glue.po @@ -0,0 +1,459 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.metrics.glue.rst:2 +msgid "glue" +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1:1 paddlenlp.metrics.glue.Mcc:1 +#: paddlenlp.metrics.glue.MultiLabelsMetric:1 +#: paddlenlp.metrics.glue.PearsonAndSpearman:1 +msgid "基类::class:`paddle.metric.metrics.Metric`" +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1:1 +msgid "" +"This class encapsulates Accuracy, Precision, Recall and F1 metric logic, " +"and `accumulate` function returns accuracy, precision, recall and f1. The" +" overview of all metrics could be seen at the document of `paddle.metric " +"`_" +" for details." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1 +#: paddlenlp.metrics.glue.AccuracyAndF1.compute +#: paddlenlp.metrics.glue.AccuracyAndF1.update paddlenlp.metrics.glue.Mcc +#: paddlenlp.metrics.glue.Mcc.compute paddlenlp.metrics.glue.Mcc.update +#: paddlenlp.metrics.glue.MultiLabelsMetric +#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate +#: paddlenlp.metrics.glue.MultiLabelsMetric.compute +#: paddlenlp.metrics.glue.MultiLabelsMetric.update +#: paddlenlp.metrics.glue.PearsonAndSpearman +#: paddlenlp.metrics.glue.PearsonAndSpearman.update +msgid "参数" +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1:7 +msgid "" +"Number of top elements to look at for computing accuracy. Defaults to " +"(1,)." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1:10 +msgid "The positive label for calculating precision and recall. Defaults to 1." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1:14 +msgid "String name of the metric instance. Defaults to 'acc_and_f1'." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1:18 paddlenlp.metrics.glue.Mcc:7 +#: paddlenlp.metrics.glue.MultiLabelsMetric:11 +#: paddlenlp.metrics.glue.PearsonAndSpearman:9 +msgid "示例" +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.compute:1 +#: paddlenlp.metrics.glue.MultiLabelsMetric.compute:1 +msgid "" +"Accepts network's output and the labels, and calculates the top-k " +"(maximum value in topk) indices for accuracy." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.compute:4 +msgid "" +"Predicted tensor, and its dtype is float32 or float64, and has a shape of" +" [batch_size, num_classes]." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.compute:7 +msgid "" +"The ground truth tensor, and its dtype is int64, and has a shape of " +"[batch_size, 1] or [batch_size, num_classes] in one hot representation." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate +#: paddlenlp.metrics.glue.AccuracyAndF1.compute +#: paddlenlp.metrics.glue.AccuracyAndF1.name +#: paddlenlp.metrics.glue.Mcc.accumulate paddlenlp.metrics.glue.Mcc.compute +#: paddlenlp.metrics.glue.Mcc.name +#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate +#: paddlenlp.metrics.glue.MultiLabelsMetric.compute +#: paddlenlp.metrics.glue.MultiLabelsMetric.name +#: paddlenlp.metrics.glue.PearsonAndSpearman.accumulate +#: paddlenlp.metrics.glue.PearsonAndSpearman.name +msgid "返回" +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.compute:12 +msgid "" +"Correct mask, each element indicates whether the prediction equals to the" +" label. Its' a tensor with a data type of float32 and has a shape of " +"[batch_size, topk]." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate +#: paddlenlp.metrics.glue.AccuracyAndF1.compute +#: paddlenlp.metrics.glue.AccuracyAndF1.name +#: paddlenlp.metrics.glue.Mcc.accumulate paddlenlp.metrics.glue.Mcc.compute +#: paddlenlp.metrics.glue.Mcc.name +#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate +#: paddlenlp.metrics.glue.MultiLabelsMetric.compute +#: paddlenlp.metrics.glue.MultiLabelsMetric.name +#: paddlenlp.metrics.glue.PearsonAndSpearman.accumulate +#: paddlenlp.metrics.glue.PearsonAndSpearman.name +msgid "返回类型" +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.update:1 +#: paddlenlp.metrics.glue.MultiLabelsMetric.update:1 +msgid "" +"Updates the metrics states (accuracy, precision and recall), in order to " +"calculate accumulated accuracy, precision and recall of all instances." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.update:4 +msgid "" +"Correct mask for calculating accuracy, and it's a tensor with shape " +"[batch_size, topk] and has a dtype of float32." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:1 +#: paddlenlp.metrics.glue.Mcc.accumulate:1 +#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:1 +#: paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:1 +msgid "Calculates and returns the accumulated metric." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:3 +msgid "" +"The accumulated metric. A tuple of shape (acc, precision, recall, f1, " +"average_of_acc_and_f1) With the fields: - `acc` (numpy.float64): " +"The accumulated accuracy. - `precision` (numpy.float64): The " +"accumulated precision. - `recall` (numpy.float64): The accumulated " +"recall. - `f1` (numpy.float64): The accumulated f1. - " +"`average_of_acc_and_f1` (numpy.float64): The average of accumulated " +"accuracy and f1." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:3 +msgid "" +"The accumulated metric. A tuple of shape (acc, precision, recall, f1, " +"average_of_acc_and_f1)" +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:6 +#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:27 +#: paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:6 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:8 +msgid "`acc` (numpy.float64):" +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:9 +msgid "The accumulated accuracy." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:10 +msgid "`precision` (numpy.float64):" +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:11 +#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:30 +msgid "The accumulated precision." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:12 +msgid "`recall` (numpy.float64):" +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:13 +#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:32 +msgid "The accumulated recall." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:14 +msgid "`f1` (numpy.float64):" +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:15 +#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:34 +msgid "The accumulated f1." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:16 +msgid "`average_of_acc_and_f1` (numpy.float64):" +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:17 +msgid "The average of accumulated accuracy and f1." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.reset:1 +#: paddlenlp.metrics.glue.Mcc.reset:1 +#: paddlenlp.metrics.glue.PearsonAndSpearman.reset:1 +msgid "Resets all metric states." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.name:1 +#: paddlenlp.metrics.glue.Mcc.name:1 +#: paddlenlp.metrics.glue.MultiLabelsMetric.name:1 +#: paddlenlp.metrics.glue.PearsonAndSpearman.name:1 +msgid "Returns name of the metric instance." +msgstr "" + +#: of paddlenlp.metrics.glue.AccuracyAndF1.name:3 +#: paddlenlp.metrics.glue.Mcc.name:3 +#: paddlenlp.metrics.glue.MultiLabelsMetric.name:3 +#: paddlenlp.metrics.glue.PearsonAndSpearman.name:3 +msgid "The name of the metric instance." +msgstr "" + +#: of paddlenlp.metrics.glue.Mcc:1 +msgid "" +"This class calculates `Matthews correlation coefficient " +"`_ ." +msgstr "" + +#: of paddlenlp.metrics.glue.Mcc:3 +msgid "String name of the metric instance. Defaults to 'mcc'." +msgstr "" + +#: of paddlenlp.metrics.glue.Mcc.compute:1 +msgid "" +"Processes the pred tensor, and returns the indices of the maximum of each" +" sample." +msgstr "" + +#: of paddlenlp.metrics.glue.Mcc.compute:4 +msgid "" +"The predicted value is a Tensor with dtype float32 or float64. Shape is " +"[batch_size, 1]." +msgstr "" + +#: of paddlenlp.metrics.glue.Mcc.compute:7 +msgid "" +"The ground truth value is Tensor with dtype int64, and its shape is " +"[batch_size, 1]." +msgstr "" + +#: of paddlenlp.metrics.glue.Mcc.compute:11 +msgid "" +"A tuple of preds and label. Each shape is [batch_size, 1], with dtype " +"float32 or float64." +msgstr "" + +#: of paddlenlp.metrics.glue.Mcc.update:1 +msgid "" +"Calculates states, i.e. the number of true positive, false positive, true" +" negative and false negative samples." +msgstr "" + +#: of paddlenlp.metrics.glue.Mcc.update:4 +msgid "" +"Tuple of predicted value and the ground truth label, with dtype float32 " +"or float64. Each shape is [batch_size, 1]." +msgstr "" + +#: of paddlenlp.metrics.glue.Mcc.accumulate:3 +msgid "" +"Returns the accumulated metric, a tuple of shape (mcc,), `mcc` is the " +"accumulated mcc and its data type is float64." +msgstr "" + +#: of paddlenlp.metrics.glue.PearsonAndSpearman:1 +#, python-format +msgid "" +"The class calculates `Pearson correlation coefficient " +"`_ and " +"`Spearman's rank correlation coefficient " +"`_" +" ." +msgstr "" + +#: of paddlenlp.metrics.glue.PearsonAndSpearman:5 +msgid "String name of the metric instance. Defaults to 'pearson_and_spearman'." +msgstr "" + +#: of paddlenlp.metrics.glue.PearsonAndSpearman.update:1 +msgid "" +"Ensures the type of preds and labels is numpy.ndarray and reshapes them " +"into [-1, 1]." +msgstr "" + +#: of paddlenlp.metrics.glue.PearsonAndSpearman.update:4 +msgid "" +"Tuple or list of predicted value and the ground truth label. Its data " +"type should be float32 or float64 and its shape is [batch_size, d0, ..., " +"dN]." +msgstr "" + +#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:3 +msgid "" +"Returns the accumulated metric, a tuple of (pearson, spearman, " +"the_average_of_pearson_and_spearman). With the fields: - `pearson` " +"(numpy.float64): The accumulated pearson. - `spearman` " +"(numpy.float64): The accumulated spearman. - " +"`the_average_of_pearson_and_spearman` (numpy.float64): The average of" +" accumulated pearson and spearman correlation coefficient." +msgstr "" + +#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:3 +msgid "" +"Returns the accumulated metric, a tuple of (pearson, spearman, " +"the_average_of_pearson_and_spearman)." +msgstr "" + +#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:9 +msgid "`pearson` (numpy.float64):" +msgstr "" + +#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:9 +msgid "The accumulated pearson." +msgstr "" + +#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:12 +msgid "`spearman` (numpy.float64):" +msgstr "" + +#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:12 +msgid "The accumulated spearman." +msgstr "" + +#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:15 +msgid "`the_average_of_pearson_and_spearman` (numpy.float64):" +msgstr "" + +#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:15 +msgid "The average of accumulated pearson and spearman correlation coefficient." +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric:1 +msgid "" +"This class encapsulates Accuracy, Precision, Recall and F1 metric logic " +"in multi-labels setting (also the binary setting). Some codes are taken " +"and modified from sklearn.metrics ." +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric:5 +msgid "The total number of labels which is usually the number of classes" +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric:7 +msgid "String name of the metric instance. Defaults to 'multi_labels_metric'." +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric:43 +msgid "" +"Note: When zero_division is encountered (details as followed), the " +"corresponding metrics will be set to 0.0" +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric:40 +msgid "" +"precision is zero_division if there are no positive predictions recall is" +" zero_division if there are no positive labels fscore is zero_division if" +" all labels AND predictions are negative" +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.update:4 +msgid "the tuple returned from `compute` function" +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:9 +msgid "Only report results for the class specified by pos_label." +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:10 +msgid "" +"Calculate metrics globally by counting the total true positives, false " +"negatives and false positives." +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:12 +msgid "" +"Calculate metrics for each label, and find their unweighted mean. This " +"does not take label imbalance into account." +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:14 +msgid "" +"Calculate metrics for each label, and find their average weighted by " +"support (the number of true instances for each label). This alters " +"`macro` to account for label imbalance; it can result in an F-score that " +"is not between precision and recall." +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:18 +msgid "" +"The positive label for calculating precision and recall in binary " +"settings. Noted: Only when `average='binary'`, this arguments will be " +"used. Otherwise, it will be ignored. Defaults to 1." +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:24 +msgid "" +"The accumulated metric. A tuple of shape (precision, recall, f1) With" +" the fields: - `precision` (numpy.float64 or numpy.ndarray if " +"average=None): The accumulated precision. - `recall` " +"(numpy.float64 or numpy.ndarray if average=None): The accumulated" +" recall. - `f1` (numpy.float64 or numpy.ndarray if average=None):" +" The accumulated f1." +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:33 +msgid "The accumulated metric. A tuple of shape (precision, recall, f1)" +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:29 +msgid "`precision` (numpy.float64 or numpy.ndarray if average=None):" +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:31 +msgid "`recall` (numpy.float64 or numpy.ndarray if average=None):" +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:33 +msgid "`f1` (numpy.float64 or numpy.ndarray if average=None):" +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.compute:4 +msgid "" +"Predicted tensor, and its dtype is float32 or float64, and has a shape of" +" [batch_size, *, num_labels]." +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.compute:7 +msgid "" +"The ground truth tensor, and its dtype is int64, and has a shape of " +"[batch_size, *] or [batch_size, *, num_labels] in one hot representation." +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.compute:12 +msgid "" +"it contains two Tensor of shape [*, 1]. The tuple should be passed to " +"`update` function." +msgstr "" + +#: of paddlenlp.metrics.glue.MultiLabelsMetric.reset:1 +msgid "Reset states and result" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.perplexity.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.perplexity.po new file mode 100644 index 0000000000000000000000000000000000000000..227fa0e7a906c963765fc3dd6cdf7abc58ffe792 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.perplexity.po @@ -0,0 +1,148 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.metrics.perplexity.rst:2 +msgid "perplexity" +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity:1 +msgid "基类::class:`paddle.metric.metrics.Metric`" +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity:1 +msgid "" +"Perplexity is a metric used to judge how good a language model is. We can" +" define perplexity as the inverse probability of the test set, normalised" +" by the number of the words in the test set. Perplexity is calculated " +"using cross entropy. It supports both padding data and no padding data." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity:7 +msgid "" +"If data is not padded, users should provide `seq_len` for `Metric` " +"initialization. If data is padded, your label should contain `seq_mask`, " +"which indicates the actual length of samples." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity:11 +msgid "" +"This Perplexity requires that the output of your network is prediction, " +"label and sequence length (optional). If the Perplexity here doesn't meet" +" your needs, you could override the `compute` or `update` method for " +"calculating Perplexity." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity +#: paddlenlp.metrics.perplexity.Perplexity.compute +#: paddlenlp.metrics.perplexity.Perplexity.update +msgid "参数" +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity:16 +msgid "" +"Sequence length of each sample, it must be provided while data is not " +"padded. Defaults to 20." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity:19 +msgid "Name of `Metric` instance. Defaults to 'Perplexity'." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity:23 +msgid "示例" +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.compute:1 +msgid "Computes cross entropy loss." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.compute:3 +msgid "" +"Predictor tensor, and its dtype is float32 or float64, and has a shape of" +" [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.compute:6 +msgid "" +"Label tensor, and its dtype is int64, and has a shape of [batch_size, " +"sequence_length, 1] or [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.compute:9 +msgid "" +"Sequence mask tensor, and its type could be float32, float64, int32 or " +"int64, and has a shape of [batch_size, sequence_length]. It's used to " +"calculate loss. Defaults to None." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.accumulate +#: paddlenlp.metrics.perplexity.Perplexity.compute +#: paddlenlp.metrics.perplexity.Perplexity.name +msgid "返回" +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.compute:14 +msgid "" +"Returns tuple (`ce, word_num`) if `seq_mask` is not None. Otherwise, " +"returns tensor `ce`. `ce` it the cross entropy loss, its shape is " +"[batch_size, sequence_length] and its data type should be float32." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.accumulate +#: paddlenlp.metrics.perplexity.Perplexity.compute +#: paddlenlp.metrics.perplexity.Perplexity.name +msgid "返回类型" +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.update:1 +msgid "Updates metric states." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.update:3 +msgid "" +"Cross entropy loss, it's calculated by `compute` and converted to " +"`numpy.ndarray`." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.update:6 +msgid "" +"The number of words of sequence, it's calculated by `compute` and " +"converted to `numpy.ndarray`. Defaults to None." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.reset:1 +msgid "Resets all metric states." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.accumulate:1 +msgid "Calculates and returns the value of perplexity." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.accumulate:3 +msgid "Returns `perplexity`, the calculation results." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.name:1 +msgid "Returns name of the metric instance." +msgstr "" + +#: of paddlenlp.metrics.perplexity.Perplexity.name:3 +msgid "The name of the metric instance." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.po new file mode 100644 index 0000000000000000000000000000000000000000..c9a3d201205f711bbf2a642a641ed454f261b426 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.metrics.rst:2 +msgid "paddlenlp.metrics" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.rouge.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.rouge.po new file mode 100644 index 0000000000000000000000000000000000000000..bb5f73841b34a8022249fea883ce13366baea858 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.rouge.po @@ -0,0 +1,172 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.metrics.rouge.rst:2 +msgid "rouge" +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL:1 +msgid "基类::class:`paddle.metric.metrics.Metric`" +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL:1 +msgid "" +"Rouge-L is Recall-Oriented Understudy for Gisting Evaluation based on " +"Longest Common Subsequence (LCS). Longest common subsequence problem " +"takes into account sentence level structure similarity naturally and " +"identifies longest co-occurring in sequence n-grams automatically." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL:6 +msgid "" +"R_{LCS} & = \\frac{LCS(C,S)}{len(S)}\n" +"\n" +"P_{LCS} & = \\frac{LCS(C,S)}{len(C)}\n" +"\n" +"F_{LCS} & = \\frac{(1 + \\gamma^2)R_{LCS}P_{LCS}}{R_{LCS}} + " +"\\gamma^2{R_{LCS}}" +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL:14 +msgid "where `C` is the candidate sentence, and `S` is the reference sentence." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL paddlenlp.metrics.rouge.RougeL.add_inst +#: paddlenlp.metrics.rouge.RougeL.lcs paddlenlp.metrics.rouge.RougeLForDuReader +#: paddlenlp.metrics.rouge.RougeLForDuReader.add_inst +msgid "参数" +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL:16 +msgid "`trans_func` transforms the network output to string to calculate." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL:19 +msgid "" +"Vocab for target language. If `trans_func` is None and RougeL is used as " +"`paddle.metric.Metric` instance, `default_trans_func` will be performed " +"and `vocab` must be provided." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL:24 +msgid "A hyperparameter to decide the weight of recall. Defaults to 1.2." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL:26 +msgid "Name of `paddle.metric.Metric` instance. Defaults to \"rouge-l\"." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL:30 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.lcs:1 +msgid "Calculate the length of longest common subsequence of string and sub." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.lcs:3 +msgid "The string to be calculated, usually longer the sub string." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.lcs:5 +msgid "The sub string to be calculated." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.lcs +msgid "返回" +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.lcs:8 +msgid "Returns the length of the longest common subsequence of string and sub." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.lcs +msgid "返回类型" +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.add_inst:1 +#: paddlenlp.metrics.rouge.RougeLForDuReader.add_inst:1 +msgid "Update the states based on the a pair of candidate and references." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.add_inst:3 +#: paddlenlp.metrics.rouge.RougeLForDuReader.add_inst:3 +msgid "The candidate sentence generated by model." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.add_inst:5 +#: paddlenlp.metrics.rouge.RougeLForDuReader.add_inst:5 +msgid "List of ground truth sentences." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.update:1 +msgid "Update states for metric" +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.update:3 +msgid "" +"Inputs of :code:`update` is the outputs of :code:`Metric.compute`, if " +":code:`compute` is not defined, the inputs of :code:`update` will be " +"flatten arguments of **output** of mode and **label** from data: " +":code:`update(output1, output2, ..., label1, label2,...)`" +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.update:8 +msgid "see :code:`Metric.compute`" +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.accumulate:1 +msgid "Calculate the final rouge-l metric." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.reset:1 +msgid "Reset states and result" +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeL.name:1 +msgid "Returns metric name" +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeLForDuReader:1 +msgid "基类::class:`paddlenlp.metrics.rouge.RougeL`" +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeLForDuReader:1 +msgid "Rouge-L metric with bonus for DuReader contest." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeLForDuReader:3 +msgid "" +"Please refer to `DuReader " +"Homepage`_ for " +"more details." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeLForDuReader:5 +msgid "" +"Weight of YesNo dataset when adding bonus for DuReader contest. Defaults " +"to 1.0." +msgstr "" + +#: of paddlenlp.metrics.rouge.RougeLForDuReader:7 +msgid "" +"Weight of Entity dataset when adding bonus for DuReader contest. Defaults" +" to 1.0." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.sighan.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.sighan.po new file mode 100644 index 0000000000000000000000000000000000000000..df04947c3ccd72c3a83f2e1ab4a4093ff9b49a21 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.sighan.po @@ -0,0 +1,74 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.metrics.sighan.rst:2 +msgid "sighan" +msgstr "" + +#: of paddlenlp.metrics.sighan.DetectionF1:1 +msgid "基类::class:`paddle.metric.metrics.Metric`" +msgstr "" + +#: of paddlenlp.metrics.sighan.CorrectionF1.update:1 +#: paddlenlp.metrics.sighan.DetectionF1.update:1 +msgid "Update states for metric" +msgstr "" + +#: of paddlenlp.metrics.sighan.CorrectionF1.update:3 +#: paddlenlp.metrics.sighan.DetectionF1.update:3 +msgid "" +"Inputs of :code:`update` is the outputs of :code:`Metric.compute`, if " +":code:`compute` is not defined, the inputs of :code:`update` will be " +"flatten arguments of **output** of mode and **label** from data: " +":code:`update(output1, output2, ..., label1, label2,...)`" +msgstr "" + +#: of paddlenlp.metrics.sighan.CorrectionF1.update:8 +#: paddlenlp.metrics.sighan.DetectionF1.update:8 +msgid "see :code:`Metric.compute`" +msgstr "" + +#: of paddlenlp.metrics.sighan.DetectionF1.reset:1 +msgid "Resets all of the metric state." +msgstr "" + +#: of paddlenlp.metrics.sighan.DetectionF1.accumulate:1 +msgid "Accumulates statistics, computes and returns the metric value" +msgstr "" + +#: of paddlenlp.metrics.sighan.DetectionF1.name:1 +msgid "Returns name of the metric instance." +msgstr "" + +#: of paddlenlp.metrics.sighan.DetectionF1.name +msgid "返回" +msgstr "" + +#: of paddlenlp.metrics.sighan.DetectionF1.name:3 +msgid "The name of the metric instance." +msgstr "" + +#: of paddlenlp.metrics.sighan.DetectionF1.name +msgid "返回类型" +msgstr "" + +#: of paddlenlp.metrics.sighan.CorrectionF1:1 +msgid "基类::class:`paddlenlp.metrics.sighan.DetectionF1`" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.span.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.span.po new file mode 100644 index 0000000000000000000000000000000000000000..d6dc2614f71d075c2bfda5e06da57a1541639654 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.span.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.metrics.span.rst:2 +msgid "span" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.squad.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.squad.po new file mode 100644 index 0000000000000000000000000000000000000000..b1644f89ab8f94e8a586fa96b46abb4d6d667536 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.squad.po @@ -0,0 +1,125 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.metrics.squad.rst:2 +msgid "squad" +msgstr "" + +#: of paddlenlp.metrics.squad.compute_prediction:1 +msgid "" +"Post-processes the predictions of a question-answering model to convert " +"them to answers that are substrings of the original contexts. This is the" +" base postprocessing functions for models that only return start and end " +"logits." +msgstr "" + +#: of paddlenlp.metrics.squad.compute_prediction +#: paddlenlp.metrics.squad.squad_evaluate +msgid "参数" +msgstr "" + +#: of paddlenlp.metrics.squad.compute_prediction:6 +msgid "" +"List of raw squad-style data (see `run_squad.py " +"`__ for more " +"information)." +msgstr "" + +#: of paddlenlp.metrics.squad.compute_prediction:11 +msgid "" +"List of processed squad-style features (see `run_squad.py " +"`__ for" +" more information)." +msgstr "" + +#: of paddlenlp.metrics.squad.compute_prediction:16 +msgid "" +"The predictions of the model. Should be a tuple of two list containing " +"the start logits and the end logits." +msgstr "" + +#: of paddlenlp.metrics.squad.compute_prediction:19 +msgid "Whether the dataset contains examples with no answers. Defaults to False." +msgstr "" + +#: of paddlenlp.metrics.squad.compute_prediction:22 +msgid "The total number of candidate predictions to generate. Defaults to 20." +msgstr "" + +#: of paddlenlp.metrics.squad.compute_prediction:25 +msgid "The maximum length of predicted answer. Defaults to 20." +msgstr "" + +#: of paddlenlp.metrics.squad.compute_prediction:28 +msgid "" +"The threshold used to select the null answer. Only useful when " +"`version_2_with_negative` is True. Defaults to 0.0." +msgstr "" + +#: of paddlenlp.metrics.squad.compute_prediction +msgid "返回" +msgstr "" + +#: of paddlenlp.metrics.squad.compute_prediction:33 +msgid "" +"A tuple of three dictionaries containing final selected answer, all " +"n_best answers along with their probability and scores, and the " +"score_diff of each example." +msgstr "" + +#: of paddlenlp.metrics.squad.squad_evaluate:1 +msgid "" +"Computes and prints the f1 score and em score of input prediction. :param" +" examples: List of raw squad-style data (see `run_squad.py" +msgstr "" + +#: of paddlenlp.metrics.squad.squad_evaluate:3 +msgid "" +"`__ for more " +"information)." +msgstr "" + +#: of paddlenlp.metrics.squad.squad_evaluate:7 +msgid "" +"Dictionary of final predictions. Usually generated by " +"`compute_prediction`." +msgstr "" + +#: of paddlenlp.metrics.squad.squad_evaluate:10 +msgid "" +"Dictionary of score_diffs of each example. Used to decide if answer exits" +" and compute best score_diff threshold of null. Defaults to None." +msgstr "" + +#: of paddlenlp.metrics.squad.squad_evaluate:14 +msgid "The threshold used to select the null answer. Defaults to 1.0." +msgstr "" + +#: of paddlenlp.metrics.squad.squad_evaluate:17 +msgid "" +"Whether the predictions and references can be tokenized by whitespace. " +"Usually set True for English and False for Chinese. Defaults to True." +msgstr "" + +#~ msgid "Computes and prints the f1 score and em score of input prediction." +#~ msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.utils.po new file mode 100644 index 0000000000000000000000000000000000000000..00d80ab061109471738d718f6f33dfe0d18ad807 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.utils.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.metrics.utils.rst:2 +msgid "utils" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.parallel.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.parallel.po new file mode 100644 index 0000000000000000000000000000000000000000..ecc268adbae37e43a2ff56a20b0fa4469323d82b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.parallel.po @@ -0,0 +1,138 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.distributed.parallel.rst:2 +msgid "parallel" +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:1 +msgid "Parallel Embedding." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner +#: paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward +#: paddlenlp.ops.distributed.parallel.ParallelEmbedding +#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward +#: paddlenlp.ops.distributed.parallel.RowParallelLiner +#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:3 +msgid "" +"The size of embedding dictionary which dictates the maximum value of the " +"input id." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:5 +msgid "The dimensions of each embedding vector." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:7 +msgid "" +"The rank of the current part, which determines the start index of the " +"vocab." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:9 +msgid "The number of trainers." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:11 +msgid "" +"Specify the weight parameter property, including the initialization " +"method. Defaults to None which means the default weight parameter " +"property will be used." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:15 +#: paddlenlp.ops.distributed.parallel.ParallelEmbedding:14 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner:15 +msgid "Normally there is no need for user to set this property. Defaults to None." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward:1 +msgid "" +"A Tensor contains the id information. Its data type should be int32 or " +"int64, and the value of the input id should be in [0, weight.shape[0]] ." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward +#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward +#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward:4 +#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward:5 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward:4 +msgid "Returns the embedding Tensor mapped by x." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward +#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward +#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:1 +msgid "Parallel Linear, axis=1." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:3 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner:3 +msgid "The size of embedding vector." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:5 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner:5 +msgid "The number of parts within a model parallel group. Defaults to 1." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:7 +msgid "Whether to gather the output tensor. Defaults to True." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:9 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner:9 +msgid "" +"Specify the parameter property, including the initialization method. " +"Defaults to None which means the default parameter property will be used." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:12 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner:12 +msgid "" +"Specify the bias property. Defaults to None which means the default " +"parameter property will be used." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward:1 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward:1 +msgid "The input tensor. Its data type can be int or float." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.RowParallelLiner:1 +msgid "Parallel Linear, axis=0." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.RowParallelLiner:7 +msgid "Whether the input is parallel. Defaults to `False`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.po new file mode 100644 index 0000000000000000000000000000000000000000..319dd56ee7c865b53a6848f001ea465fd00d3f0b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.po @@ -0,0 +1,138 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.distributed.rst:2 +msgid "distributed" +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:1 +msgid "Parallel Embedding." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner +#: paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward +#: paddlenlp.ops.distributed.parallel.ParallelEmbedding +#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward +#: paddlenlp.ops.distributed.parallel.RowParallelLiner +#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:3 +msgid "" +"The size of embedding dictionary which dictates the maximum value of the " +"input id." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:5 +msgid "The dimensions of each embedding vector." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:7 +msgid "" +"The rank of the current part, which determines the start index of the " +"vocab." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:9 +msgid "The number of trainers." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:11 +msgid "" +"Specify the weight parameter property, including the initialization " +"method. Defaults to None which means the default weight parameter " +"property will be used." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:15 +#: paddlenlp.ops.distributed.parallel.ParallelEmbedding:14 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner:15 +msgid "Normally there is no need for user to set this property. Defaults to None." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward:1 +msgid "" +"A Tensor contains the id information. Its data type should be int32 or " +"int64, and the value of the input id should be in [0, weight.shape[0]] ." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward +#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward +#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward:4 +#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward:5 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward:4 +msgid "Returns the embedding Tensor mapped by x." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward +#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward +#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:1 +msgid "Parallel Linear, axis=1." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:3 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner:3 +msgid "The size of embedding vector." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:5 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner:5 +msgid "The number of parts within a model parallel group. Defaults to 1." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:7 +msgid "Whether to gather the output tensor. Defaults to True." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:9 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner:9 +msgid "" +"Specify the parameter property, including the initialization method. " +"Defaults to None which means the default parameter property will be used." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:12 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner:12 +msgid "" +"Specify the bias property. Defaults to None which means the default " +"parameter property will be used." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward:1 +#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward:1 +msgid "The input tensor. Its data type can be int or float." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.RowParallelLiner:1 +msgid "Parallel Linear, axis=0." +msgstr "" + +#: of paddlenlp.ops.distributed.parallel.RowParallelLiner:7 +msgid "Whether the input is parallel. Defaults to `False`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.po new file mode 100644 index 0000000000000000000000000000000000000000..a4a06c183ccc07dcbc52c752b2df64c8550dab3b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.distributed.utils.rst:2 +msgid "utils" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.random.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.random.po new file mode 100644 index 0000000000000000000000000000000000000000..532e33cfade4b61e76cf7f72bf125ea690c36079 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.random.po @@ -0,0 +1,27 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.distributed.utils.random.rst:2 +msgid "random" +msgstr "" + +#: of paddlenlp.ops.distributed.utils.random.RNGStatesTracker:1 +msgid "Tracker the RNG states." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.topo.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.topo.po new file mode 100644 index 0000000000000000000000000000000000000000..a8764d182b1752bfdf3268381a000ccfd97a807b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.topo.po @@ -0,0 +1,35 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.distributed.utils.topo.rst:2 +msgid "topo" +msgstr "" + +#: ../docstring of paddlenlp.ops.distributed.utils.topo.GroupInfo.rank:1 +msgid "Alias for field number 1" +msgstr "" + +#: ../docstring of paddlenlp.ops.distributed.utils.topo.GroupInfo.size:1 +msgid "Alias for field number 0" +msgstr "" + +#: ../docstring of paddlenlp.ops.distributed.utils.topo.GroupInfo.world:1 +msgid "Alias for field number 2" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.einsum.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.einsum.po new file mode 100644 index 0000000000000000000000000000000000000000..d1747337f5eeef0b0a359b599fa83c2662e9d5a3 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.einsum.po @@ -0,0 +1,71 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.einsum.rst:2 +msgid "einsum" +msgstr "" + +#: of paddlenlp.ops.einsum.einsum:1 +msgid "" +"Executes the sum of product of provided operands based on the Einstein " +"summation convention. Einsum can be used to complete a variety of " +"operations, such as sum, transpose, batch matrix multiplication." +msgstr "" + +#: of paddlenlp.ops.einsum.einsum +msgid "参数" +msgstr "" + +#: of paddlenlp.ops.einsum.einsum:5 +msgid "" +"Uses uncased letters to specify the dimension of the operands and result." +" The input equation is on the left hand before `->` while the output " +"equation is on the right side. Einsum can infer the result shape so that " +"the `->` and the result label letters can be omitted. Operands in the " +"input equation are splited by commas (','), e.g. 'abc,cde' describes two " +"3D operands. The dimensions labeled with same letter should be same or be" +" 1. Ellipsis ('...') can be used to specify the broadcast dimensions." +msgstr "" + +#: of paddlenlp.ops.einsum.einsum:12 +msgid "" +"The operands to compute the Einstein sum of. The number of operands " +"should be the same as the operands described in input equation." +msgstr "" + +#: of paddlenlp.ops.einsum.einsum +msgid "返回" +msgstr "" + +#: of paddlenlp.ops.einsum.einsum:16 +msgid "The result of Einstein sum product." +msgstr "" + +#: of paddlenlp.ops.einsum.einsum +msgid "返回类型" +msgstr "" + +#: of paddlenlp.ops.einsum.einsum:17 +msgid "`Tensor`" +msgstr "" + +#: of paddlenlp.ops.einsum.einsum:20 +msgid "示例" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.ext_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.ext_utils.po new file mode 100644 index 0000000000000000000000000000000000000000..5a3ca41484a44b9c309bdaa423ecef62e89dbec9 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.ext_utils.po @@ -0,0 +1,60 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.ops.ext_utils.rst:2 +msgid "ext\\_utils" +msgstr "" + +#: of paddlenlp.ops.ext_utils.CMakeExtension:1 +msgid "基类::class:`setuptools.extension.Extension`" +msgstr "" + +#: of paddlenlp.ops.ext_utils.CMakeExtension.build_with_command:1 +#: paddlenlp.ops.ext_utils.FasterTransformerExtension.build_with_command:1 +msgid "" +"Custom `build_ext.build_extension` in `Extension` instead of `Command`. " +"`ext_builder` is the instance of `build_ext` command." +msgstr "" + +#: of paddlenlp.ops.ext_utils.CMakeExtension.get_target_filename:1 +#: paddlenlp.ops.ext_utils.FasterTransformerExtension.get_target_filename:1 +msgid "" +"The file names of libraries. Currently only support one library for one " +"extension." +msgstr "" + +#: of paddlenlp.ops.ext_utils.CMakeExtension.get_output_filename:1 +#: paddlenlp.ops.ext_utils.FasterTransformerExtension.get_output_filename:1 +msgid "" +"The file names of outputs, which mostly is the same with " +"`get_target_filename`." +msgstr "" + +#: of paddlenlp.ops.ext_utils.FasterTransformerExtension:1 +msgid "基类::class:`paddlenlp.ops.ext_utils.CMakeExtension`" +msgstr "" + +#: of paddlenlp.ops.ext_utils.BuildExtension:1 +msgid "基类::class:`paddle.utils.cpp_extension.cpp_extension.BuildExtension`" +msgstr "" + +#: of paddlenlp.ops.ext_utils.BuildExtension:1 +msgid "Support both `CppExtention` of Paddle and custom extensions of PaddleNLP." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.po new file mode 100644 index 0000000000000000000000000000000000000000..f9aff194c4ab68184fa8b0037d57730f1dca145a --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.fast_transformer.rst:2 +msgid "fast\\_transformer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.decoder.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.decoder.po new file mode 100644 index 0000000000000000000000000000000000000000..bec6fac5c86d068581ba88d6d4ecb531d705e2b0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.decoder.po @@ -0,0 +1,135 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.fast_transformer.transformer.decoder.rst:2 +msgid "decoder" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder:1 +msgid "FasterTransformer decoder block." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder +#: paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder.forward +#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder +#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder.forward +msgid "参数" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder:3 +msgid "Transformer decoder block." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:13 +#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder:5 +msgid "The number of head used in multi-head attention." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder:7 +msgid "The size of per head used in multi-head attention." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:31 +#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder:9 +msgid "The path to decoder_lib. Default to None." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:33 +#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder:11 +msgid "Whether to use fp16 for decoder. Default to False." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder.forward:1 +#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder.forward:4 +#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder.forward:6 +#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:1 +msgid "FasterTransformer decoder for auto-regressive generation." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:3 +msgid "The size of source vocabulary." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:5 +msgid "The size of target vocabulary." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:7 +msgid "The maximum length of input sequences." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:9 +msgid "The number of sub-layers to be stacked in the encoder." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:11 +msgid "The number of sub-layers to be stacked in the decoder." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:15 +msgid "" +"The dimension for word embeddings, which is also the last dimension of " +"the input and output of multi-head attention, position-wise feed-forward " +"networks, encoder and decoder." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:19 +msgid "Size of the hidden layer in position-wise feed-forward networks." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:21 +msgid "Dropout rates. Used for pre-process, activation and inside attention." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:23 +msgid "Whether to use weight sharing." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:25 +msgid "The start token id and also is used as padding id. Defaults to 0." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:27 +msgid "The end token id. Defaults to 1." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:29 +msgid "The maximum output length. Defaults to 256." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.decoding.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.decoding.po new file mode 100644 index 0000000000000000000000000000000000000000..c1001f5c6fb59d222f7ccd1e1ce97fb515318771 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.decoding.po @@ -0,0 +1,304 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.ops.fast_transformer.transformer.decoding.rst:2 +msgid "decoding" +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:1 +msgid "" +"Convert parameters included in Transformer layer (`nn.TransformerEncoder`" +" and `gpt.modeling.TransformerDecoder`) from original models to the " +"format of faster models." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.set_partial_model +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferBartDecoding.forward +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferGptDecoding.forward +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferMBartDecoding.forward +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferTransformerDecoding.forward +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferUnifiedDecoding.forward +#: paddlenlp.ops.fast_transformer.transformer.decoding.convert_params +#: paddlenlp.ops.fast_transformer.transformer.decoding.enable_ft_para +msgid "参数" +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:5 +msgid "The faster model object." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:7 +msgid "" +"The Transformer layer. It can be an instance of `nn.TransformerEncoder` " +"or `gpt.modeling.TransformerDecoder` currently, and " +"`nn.TransformerDecoder` would be supported soon." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:11 +msgid "" +"0 for nofuse, 1 for fuse, 2 for fuse and delete the unfused parameters. " +"If environment variable `PPFG_QKV_MEM_OPT` is set and the weights of " +"q/k/v is fused, it will try to delete the original unfused weights. Note " +"the rollback to original model would not be guarantee anymore when the " +"faster model failed if the original weights are deleted. Default to 1." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:18 +msgid "" +"Whether to use float16. Maybe we should use the default dtype as the " +"highest priority later. Default to `False`." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:21 +msgid "" +"If `False`, need to reload the weight values. It should be `True` for " +"weight loaded models. Default to `False`." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight +#: paddlenlp.ops.fast_transformer.transformer.decoding.convert_params +#: paddlenlp.ops.fast_transformer.transformer.decoding.get_ft_para_conf +msgid "返回" +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:25 +msgid "" +"Each value is a list including converted parameters in all layers. " +"For other parameters not included in Transformer module to be " +"converted, such as embeddings, you can achieve it by using the " +"returned dict `params` though `params['word_emb'].append()` directly " +"which would do CPU/GPU and fp32/fp16 transfer automatically." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:30 +msgid "Each value is a list including converted parameters in all" +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:28 +msgid "" +"layers. For other parameters not included in Transformer module to be " +"converted, such as embeddings, you can achieve it by using the returned " +"dict `params` though `params['word_emb'].append()` directly which would " +"do CPU/GPU and fp32/fp16 transfer automatically." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight +#: paddlenlp.ops.fast_transformer.transformer.decoding.convert_params +#: paddlenlp.ops.fast_transformer.transformer.decoding.get_ft_para_conf +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferBartDecoding.forward:1 +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferGptDecoding.forward:1 +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferMBartDecoding.forward:1 +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferTransformerDecoding.forward:1 +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferUnifiedDecoding.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferBartDecoding.forward:4 +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferGptDecoding.forward:4 +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferMBartDecoding.forward:4 +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferTransformerDecoding.forward:4 +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferUnifiedDecoding.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferBartDecoding.forward:6 +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferGptDecoding.forward:6 +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferMBartDecoding.forward:6 +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferTransformerDecoding.forward:6 +#: paddlenlp.ops.fast_transformer.transformer.decoding.InferUnifiedDecoding.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf:1 +msgid "" +"Configurations for model parallel in FasterTransformer. Currently only " +"support GPT. Please refer to `Megatron " +"`__ for details." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf:5 +#: paddlenlp.ops.fast_transformer.transformer.decoding.enable_ft_para:5 +msgid "" +"The size for tensor parallel. If it is 1, tensor parallel would not be " +"used. Default to 1." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf:8 +#: paddlenlp.ops.fast_transformer.transformer.decoding.enable_ft_para:8 +msgid "" +"The size for layer parallel. If it is 1, layer parallel would not be " +"used. Default to 1." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf:11 +#: paddlenlp.ops.fast_transformer.transformer.decoding.enable_ft_para:11 +msgid "" +"The local batch size for pipeline parallel. It is suggested to use " +"`batch_size // layer_para_size`. Default to 1." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_last_group:1 +msgid "" +"For layer parallel, only the process corresponding to the last layer " +"group can get the predict results. It is used to check whether this is " +"the process corresponding to the last layer group." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load:1 +msgid "" +"Whether or not the given transformer layer of should be loaded to the " +"current parallel model. For layer parallel, there is no need not to load " +"other layer groups." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load:5 +msgid "The index of Transformer layer." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load:7 +msgid "The number of Transformer layers." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load:10 +msgid "" +"Indicate whether or not the given transformer layer of should be " +"loaded to the current parallel model." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load:12 +msgid "Indicate whether or not the given transformer layer of should" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load:13 +msgid "be loaded to the current parallel model." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight:1 +msgid "Get the weight slice for tensor parallel." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight:3 +msgid "The weight or bias to be sliced." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight:5 +msgid "The axis to perform slice." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight:7 +msgid "" +"0 is used for creating partial model when initializing and " +"`from_pretrained`. While 1 is used in converting parameters to " +"FasterTransformer. No slice would be performed if it is 1, since " +"parameters have been sliced in `phase=0`." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight:12 +msgid "" +"If true, `weight` should be a Parameter and force the output to be a " +"Parameter." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight:16 +msgid "The sliced weight." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.set_partial_model:1 +msgid "" +"This is used to set whether or not the current model has complete " +"parameters." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.set_partial_model:4 +msgid "" +"It is used to set whether or not the current model has complete " +"parameters." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model:1 +msgid "" +"Slice every values included in `state_to_load` according to the shape of " +"corresponding parameters in `model`. This is used in `from_pratrained` to" +" get sliced parameter values." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model:5 +msgid "The model to use." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model:7 +msgid "The state dict including complete parameter values of model." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model:11 +msgid "The state dict contains adjusted values." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.get_ft_para_conf:1 +msgid "Get settings for model parallel." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.get_ft_para_conf:3 +msgid "The settings for model parallel." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.decoding.enable_ft_para:1 +msgid "" +"Enable model parallel with the given settings in FasterTransformer. " +"Currently only support GPT. Please refer to `Megatron " +"`__ for details." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.encoder.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.encoder.po new file mode 100644 index 0000000000000000000000000000000000000000..6f0bc28d10fb015a24e26cf3c30be9bbb38a2245 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.encoder.po @@ -0,0 +1,178 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.fast_transformer.transformer.encoder.rst:2 +msgid "encoder" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.encoder.infer_transformer_encoder:1 +msgid "" +"Fusion Encoder API intergrating Encoder inference in FasterTransformer. " +"It accepts the weight and bias of TransformerEncoder and some other " +"parameters for inference." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward:1 +msgid "" +"Redefines `forward` function of `paddle.nn.TransformerEncoderLayer` for " +"integrating FasterTransformer for inference." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward:4 +msgid "" +"The original `forward` function would not be replaced unless " +"`enable_fast_encoder` is called by objects of its base class. After " +"replacing, objects of `paddle.nn.TransformerEncoderLayer` also have the " +"same member variables as before." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward:9 +#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward:9 +msgid "" +"After inference, `disable_fast_encoder` could be called to restore the " +"`forward` function of `paddle.nn.TransformerEncoder` and " +"`paddle.nn.TransformerEncoderLayer`." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.encoder.convert_to_fp16 +#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward +#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward +msgid "参数" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward:13 +msgid "" +"The input of Transformer encoder layer. It is a tensor with shape " +"`[batch_size, sequence_length, d_model]`. The data type should be float32" +" or float64." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward:17 +msgid "" +"A tensor used in multi-head attention to prevents attention to some " +"unwanted positions, usually the paddings or the subsequent positions. It " +"is a tensor with shape `[batch_size, 1, 1, sequence_length]`. When the " +"data type is bool, the unwanted positions have `False` values and the " +"others have `True` values. When the data type is int, the unwanted " +"positions have 0 values and the others have 1 values. When the data type " +"is float, the unwanted positions have `-INF` values and the others have 0" +" values. It can be None when nothing wanted or needed to be prevented " +"attention to. Defaults to None." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward +#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward:28 +msgid "" +"It is a tensor that has the same shape and data type as `enc_input`, " +"representing the output of Transformer encoder layer. Or a tuple if " +"`cache` is not None, except for encoder layer output, the tuple includes " +"the new cache which is same as input `cache` argument but " +"`incremental_cache` has an incremental length. See " +"`paddle.nn.MultiHeadAttention.gen_cache` and " +"`paddle.nn.MultiHeadAttention.forward` for more details." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward +#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward:1 +msgid "" +"Redefines `forward` function of `paddle.nn.TransformerEncoder` for " +"integrating FasterTransformer for inference." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward:4 +msgid "" +"The original `forward` function would not be replaced unless " +"`enable_fast_encoder` is called by objects of its base class. After " +"replacing, objects of `paddle.nn.TransformerEncoder` also have the same " +"member variables as before." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward:13 +msgid "" +"The input of Transformer encoder. It is a tensor with shape `[batch_size," +" sequence_length, d_model]`. The data type should be float32 or float16." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward:17 +msgid "" +"A tensor used in multi-head attention to prevents attention to some " +"unwanted positions, usually the paddings or the subsequent positions. It " +"is a tensor with shape `[batch_size, 1, 1, sequence_length]`. The data " +"type must be float, the unwanted positions have `-INF` values or other " +"non-zeros and the wanted positions must be 0.0." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward:24 +msgid "" +"It is a tensor that has the same shape and data type as `src`, " +"representing the output of Transformer encoder. Or a tuple if `cache` is " +"not None, except for encoder output, the tuple includes the new cache " +"which is same as input `cache` argument but `incremental_cache` in it has" +" an incremental length. See `paddle.nn.MultiHeadAttention.gen_cache` and " +"`paddle.nn.MultiHeadAttention.forward` for more details." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.encoder.enable_fast_encoder:1 +msgid "" +"Compiles fusion encoder operator intergrated FasterTransformer using the " +"method of JIT(Just-In-Time) and replaces the `forward` function of " +"`paddle.nn.TransformerEncoder` and `paddle.nn.TransformerEncoderLayer` " +"objects inherited from `self` to support inference using " +"FasterTransformer." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.encoder.disable_fast_encoder:5 +#: paddlenlp.ops.fast_transformer.transformer.encoder.enable_fast_encoder:7 +msgid "实际案例" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.encoder.disable_fast_encoder:1 +msgid "" +"Restores the original `forward` function of " +"`paddle.nn.TransformerEncoder` and `paddle.nn.TransformerEncoderLayer` " +"objects inherited from `self`." +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.encoder.convert_to_fp16:1 +msgid "Convert paddle.nn.TransformerEncoder's parameter from float32 to float16" +msgstr "" + +#: of paddlenlp.ops.fast_transformer.transformer.encoder.convert_to_fp16:3 +msgid "" +"The object to be converted to float16 inplaced, it must be an isinstance " +"of paddle.nn.TransformerEncoder." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.po new file mode 100644 index 0000000000000000000000000000000000000000..9955784239be694fea07c91b9d60ccc0739262ba --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.po @@ -0,0 +1,523 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.rst:2 +msgid "fast\\_transformer" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:1 +msgid "基类::class:`paddlenlp.transformers.transformer.modeling.TransformerModel`" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:1 +msgid "" +"FastGeneration is a fast version for generation with the Transformer" +" model. It uses a custom op based on and enhancing NV FasterTransformer " +"to do fast generation." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterBART.forward +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterGPT.forward +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterMBART.forward +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.export_params +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUNIMOText.forward +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUnifiedTransformer.forward +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward +msgid "参数" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:5 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:5 +msgid "The size of source vocabulary." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:7 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:7 +msgid "The size of target vocabulary." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:9 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:9 +msgid "The maximum length of input sequences." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:11 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:11 +msgid "The number of sub-layers to be stacked in the encoder." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:13 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:13 +msgid "The number of sub-layers to be stacked in the decoder." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:15 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:15 +msgid "The number of head used in multi-head attention." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:17 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:17 +msgid "" +"The dimension for word embeddings, which is also the last dimension of " +"the input and output of multi-head attention, position-wise feed-forward " +"networks, encoder and decoder." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:21 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:21 +msgid "Size of the hidden layer in position-wise feed-forward networks." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:23 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:23 +msgid "Dropout rates. Used for pre-process, activation and inside attention." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:25 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:25 +msgid "Whether to use weight sharing." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:27 +msgid "" +"The dropout probability used in MHA to drop some attention target. If " +"None, use the value of dropout. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:30 +msgid "" +"The dropout probability used after FFN activition. If None, use the value" +" of dropout. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:33 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:27 +msgid "The start token id and also is used as padding id. Defaults to 0." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:35 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:29 +msgid "The end token id. Defaults to 1." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:37 +msgid "" +"Indicating the strategy of decoding. It can be 'beam_search', " +"'beam_search_v2', 'topk_sampling' and 'topp_sampling'. For beam search " +"strategies, 'v2' would select the top `beam_size * 2` beams and process " +"the top `beam_size` alive and finish beams in them separately, while 'v1'" +" would only select the top `beam_size` beams and mix up the alive and " +"finish beams. 'v2' always searchs more and get better results, since the " +"alive beams would always be `beam_size` while the number of alive beams " +"in `v1` might decrease when meeting the end token. However, 'v2' always " +"generates longer results thus might do more calculation and be slower." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:48 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:31 +msgid "The beam width for beam search. Defaults to 4." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:50 +msgid "" +"The number of highest probability tokens to keep for top-k sampling. " +"Defaults to 4." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:53 +msgid "" +"The most probable tokens whose cumulative probability is not less than " +"`topp` are kept for top-p sampling. Defaults to 4." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:56 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:33 +msgid "The maximum output length. Defaults to 256." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:58 +msgid "" +"Refer to `A Simple, Fast Diverse Decoding Algorithm for Neural Generation" +" `_ for details. Bigger " +"`diversity_rate` would lead to more diversity. if `diversity_rate == 0` " +"is equivalent to naive BeamSearch. Default to 0 if not set." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:63 +msgid "Whether to use fp16 for decoding." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:65 +msgid "" +"Whether to use the fast version of encoder. This is experimental option" +" for now. Defaults to False." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:68 +msgid "" +"Whether to use fp16 for encoder. Only works when enable_fast_encoder is" +" True. Defaults to False." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:71 +msgid "" +"Indicating whether `max_out_len` in is the length relative to that of " +"source text. Only works in `v2` temporarily. It is suggest to set a small" +" `max_out_len` and use `rel_len=True`. Default to False if not set." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:76 +msgid "" +"The power number in length penalty calculation. Only works in `v2` " +"temporarily. Refer to `GNMT `_. " +"Default to 0.6 if not set." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward:1 +msgid "" +"The Transformer forward methods. The input are source/target sequences, " +"and returns logits." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward:4 +msgid "" +"The ids of source sequences words. It is a tensor with shape " +"`[batch_size, source_sequence_length]` and its data type can be int or " +"int64." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward:8 +msgid "" +"The ids of target sequences words. It is a tensor with shape " +"`[batch_size, target_sequence_length]` and its data type can be int or " +"int64." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward:13 +msgid "" +"Output tensor of the final layer of the model whose data type can be " +"float32 or float64 with shape `[batch_size, sequence_length, " +"vocab_size]`." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.export_params:10 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward:19 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward:21 +msgid "示例" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.export_params:1 +msgid "" +"This method is used for load static graph from dygraph checkpoint or " +"export inference model using static graph." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.export_params:4 +msgid "The path to dygraph checkpoint." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.export_params:6 +msgid "The place to execute static graph." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:1 +msgid "" +"The Transformer model for auto-regressive generation with beam search. It" +" wraps `FasterTransformer` and `InferTransformerModel`, and automatically" +" chioces using `FasterTransformer` (with jit building) or the slower " +"verison `InferTransformerModel`." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:35 +msgid "" +"The key word arguments can be `output_time_major`, `use_ft`, " +"`use_fp16_decoding`, `rel_len`, `alpha`: - `output_time_major(bool, " +"optional)`: Indicate the data layout of predicted Tensor. If `False`, the" +" data layout would be batch major with shape `[batch_size, seq_len, " +"beam_size]`. If `True`, the data layout would be time major with shape " +"`[seq_len, batch_size, beam_size]`. Default to `False`. - `use_ft(bool, " +"optional)`: Whether to use FastGeneration for decoding. Default to " +"True if not set. - `use_fp16_decoding(bool, optional)`: Whether to use " +"fp16 for decoding. Only works when using FastGeneration. - " +"`beam_search_version(str, optional)`: Indicating the strategy of beam " +"search. It can be 'v1' or 'v2'. 'v2' would select the top `beam_size * 2`" +" beams and process the top `beam_size` alive and finish beams in them " +"separately, while 'v1' would only select the top `beam_size` beams and " +"mix up the alive and finish beams. 'v2' always searchs more and get " +"better results, since the alive beams would always be `beam_size` while " +"the number of alive beams in `v1` might decrease when meeting the end " +"token. However, 'v2' always generates longer results thus might do more " +"calculation and be slower. - `rel_len(bool, optional)`: Indicating " +"whether `max_out_len` in is the length relative to that of source text. " +"Only works in `v2` temporarily. It is suggest to set a small " +"`max_out_len` and use `rel_len=True`. Default to False if not set. - " +"`alpha(float, optional)`: The power number in length penalty calculation." +" Refer to `GNMT `_. Only works in " +"`v2` temporarily. Default to 0.6 if not set. - diversity_rate(float, " +"optional): Refer to `A Simple, Fast Diverse Decoding Algorithm for Neural" +" Generation `_ for details. Bigger " +"`diversity_rate` would lead to more diversity. if `diversity_rate == 0` " +"is equivalent to naive BeamSearch. Default to 0 if not set. **NOTE**: " +"Only works when using FastGeneration temporarily." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:35 +msgid "" +"The key word arguments can be `output_time_major`, `use_ft`, " +"`use_fp16_decoding`, `rel_len`, `alpha`:" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:38 +msgid "`output_time_major(bool, optional)`: Indicate the data layout of predicted" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:39 +msgid "" +"Tensor. If `False`, the data layout would be batch major with shape " +"`[batch_size, seq_len, beam_size]`. If `True`, the data layout would be " +"time major with shape `[seq_len, batch_size, beam_size]`. Default to " +"`False`." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:44 +msgid "`use_ft(bool, optional)`: Whether to use FastGeneration" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:45 +msgid "for decoding. Default to True if not set." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:47 +msgid "`use_fp16_decoding(bool, optional)`: Whether to use fp16" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:48 +msgid "for decoding. Only works when using FastGeneration." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:50 +msgid "`beam_search_version(str, optional)`: Indicating the strategy of" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:51 +msgid "" +"beam search. It can be 'v1' or 'v2'. 'v2' would select the top `beam_size" +" * 2` beams and process the top `beam_size` alive and finish beams in " +"them separately, while 'v1' would only select the top `beam_size` beams " +"and mix up the alive and finish beams. 'v2' always searchs more and get " +"better results, since the alive beams would always be `beam_size` while " +"the number of alive beams in `v1` might decrease when meeting the end " +"token. However, 'v2' always generates longer results thus might do more " +"calculation and be slower." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:60 +msgid "`rel_len(bool, optional)`: Indicating whether `max_out_len` in is" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:61 +msgid "" +"the length relative to that of source text. Only works in `v2` " +"temporarily. It is suggest to set a small `max_out_len` and use " +"`rel_len=True`. Default to False if not set." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:65 +msgid "`alpha(float, optional)`: The power number in length penalty" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:66 +msgid "" +"calculation. Refer to `GNMT `_. " +"Only works in `v2` temporarily. Default to 0.6 if not set." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:69 +msgid "diversity_rate(float, optional): Refer to `A Simple, Fast Diverse" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:70 +msgid "" +"Decoding Algorithm for Neural Generation " +"`_ for details. Bigger `diversity_rate`" +" would lead to more diversity. if `diversity_rate == 0` is equivalent to " +"naive BeamSearch. Default to 0 if not set. **NOTE**: Only works when " +"using FastGeneration temporarily." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward:1 +msgid "Performs decoding for transformer model." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward:3 +msgid "" +"The ids of source sequence words. It is a tensor with shape `[batch_size," +" source_sequence_length]` and its data type can be int or int64." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward:7 +msgid "" +"The ids of target sequence words. Normally, it should NOT be given. If " +"it's given, force decoding with previous output token will be trigger. " +"Defaults to None." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward:12 +msgid "" +"An int64 tensor shaped indicating the predicted ids. Its shape is " +"`[batch_size, seq_len, beam_size]` or `[seq_len, batch_size, beam_size]` " +"according to `output_time_major`. While, when using FastGeneration and" +" beam search v2, the beam dimension would be doubled to include both the " +"top `beam_size` alive and finish beams, thus the tensor shape is " +"`[batch_size, seq_len, beam_size * 2]` or `[seq_len, batch_size, " +"beam_size * 2]`." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterGPT:1 +msgid "基类::class:`paddlenlp.transformers.gpt.modeling.GPTPretrainedModel`" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterBART.forward:1 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterGPT.forward:1 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterMBART.forward:1 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUNIMOText.forward:1 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUnifiedTransformer.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterBART.forward:4 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterGPT.forward:4 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterMBART.forward:4 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUNIMOText.forward:4 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUnifiedTransformer.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterBART.forward:6 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterGPT.forward:6 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterMBART.forward:6 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUNIMOText.forward:6 +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUnifiedTransformer.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUnifiedTransformer:1 +msgid "基类::class:`paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerPretrainedModel`" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUNIMOText:1 +msgid "基类::class:`paddlenlp.transformers.unimo.modeling.UNIMOPretrainedModel`" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterBART:1 +msgid "基类::class:`paddlenlp.transformers.bart.modeling.BartPretrainedModel`" +msgstr "" + +#: of +#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterMBART:1 +msgid "基类::class:`paddlenlp.transformers.mbart.modeling.MBartPretrainedModel`" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.po new file mode 100644 index 0000000000000000000000000000000000000000..b86c4d9321a5dc4038698c895039b8e3dca3e07a --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.fast_transformer.transformer.rst:2 +msgid "transformer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.AdamwOptimizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.AdamwOptimizer.po new file mode 100644 index 0000000000000000000000000000000000000000..f52a3d7f4044c127d38ec9b8437f9772038543dc --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.AdamwOptimizer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.optimizer.AdamwOptimizer.rst:2 +msgid "AdamwOptimizer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.adamw.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.adamw.po new file mode 100644 index 0000000000000000000000000000000000000000..16b4edae7c465a9543a0ba98c702573119192c34 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.adamw.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.optimizer.adamw.rst:2 +msgid "adamw" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.adamwdl.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.adamwdl.po new file mode 100644 index 0000000000000000000000000000000000000000..c2a5f6042d0508ad1e0e69bf2387f36c288c1fe0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.adamwdl.po @@ -0,0 +1,168 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.optimizer.adamwdl.rst:2 +msgid "adamwdl" +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:1 +msgid "基类::class:`paddle.optimizer.adamw.AdamW`" +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:1 +msgid "" +"The AdamWDL optimizer is implemented based on the AdamW Optimization with" +" dynamic lr setting. Generally it's used for transformer model." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:4 +msgid "" +"We use \"layerwise_lr_decay\" as default dynamic lr setting method of " +"AdamWDL. “Layer-wise decay” means exponentially decaying the learning " +"rates of individual layers in a top-down manner. For example, suppose the" +" 24-th layer uses a learning rate l, and the Layer-wise decay rate is α, " +"then the learning rate of layer m is lα^(24-m). See more details on: " +"https://arxiv.org/abs/1906.08237." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:10 +msgid "" +"& t = t + 1\n" +"\n" +"& moment\\_1\\_out = {\\beta}_1 * moment\\_1 + (1 - {\\beta}_1) * grad\n" +"\n" +"& moment\\_2\\_out = {\\beta}_2 * moment\\_2 + (1 - {\\beta}_2) * grad * " +"grad\n" +"\n" +"& learning\\_rate = learning\\_rate * \\frac{\\sqrt{1 - {\\beta}_2^t}}{1 " +"- {\\beta}_1^t}\n" +"\n" +"& param\\_out = param - learning\\_rate * " +"(\\frac{moment\\_1}{\\sqrt{moment\\_2} + \\epsilon} + \\lambda * param)" +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL +msgid "参数" +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:21 +msgid "" +"The learning rate used to update ``Parameter``. It can be a float value " +"or a LRScheduler. The default value is 0.001." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:24 +msgid "" +"The exponential decay rate for the 1st moment estimates. It should be a " +"float number or a Tensor with shape [1] and data type as float32. The " +"default value is 0.9." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:28 +msgid "" +"The exponential decay rate for the 2nd moment estimates. It should be a " +"float number or a Tensor with shape [1] and data type as float32. The " +"default value is 0.999." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:32 +msgid "" +"A small float value for numerical stability. It should be a float number " +"or a Tensor with shape [1] and data type as float32. The default value is" +" 1e-08." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:36 +msgid "" +"List/Tuple of ``Tensor`` to update to minimize ``loss``. \\ This " +"parameter is required in dygraph mode. \\ The default value is None in " +"static mode, at this time all parameters will be updated." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:40 +msgid "" +"The weight decay coefficient, it can be float or Tensor. The default " +"value is 0.01." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:42 +msgid "" +"If it is not None, only tensors that makes " +"apply_decay_param_fun(Tensor.name)==True will be updated. It only works " +"when we want to specify tensors. Default: None." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:47 +msgid "" +"Gradient cliping strategy, it's an instance of some derived class of " +"``GradientClipBase`` . There are three cliping strategies ( " +":ref:`api_fluid_clip_GradientClipByGlobalNorm` , " +":ref:`api_fluid_clip_GradientClipByNorm` , " +":ref:`api_fluid_clip_GradientClipByValue` ). Default None, meaning there " +"is no gradient clipping." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:52 +msgid "" +"The official Adam algorithm has two moving-average accumulators. The " +"accumulators are updated at every step. Every element of the two moving-" +"average is updated in both dense mode and sparse mode. If the size of " +"parameter is very large, then the update may be very slow. The lazy mode " +"only update the element that has gradient in current mini-batch, so it " +"will be much more faster. But this mode has different semantics with the " +"original Adam algorithm and may lead to different result. The default " +"value is False." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:60 +msgid "Whether to use multi-precision during weight updating. Default is false." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:62 +msgid "The layer-wise decay ratio. Defaults to 1.0." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:64 +msgid "The total number of encoder layers. Defaults to 12." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:66 +msgid "" +"If it's not None, set_param_lr_fun() will set the parameter learning " +"rate before it executes Adam Operator. Defaults to " +":ref:`layerwise_lr_decay`." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:69 +msgid "" +"The keys of name_dict is dynamic name of model while the value of " +"name_dict is static name. Use model.named_parameters() to get name_dict." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:72 +msgid "" +"Normally there is no need for user to set this property. For more " +"information, please refer to :ref:`api_guide_Name`. The default value is " +"None." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:78 +msgid "实际案例" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.ema.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.ema.po new file mode 100644 index 0000000000000000000000000000000000000000..174756237d0497df58ef8b2406c3716e27e5b636 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.ema.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.ops.optimizer.ema.rst:2 +msgid "ema" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.po new file mode 100644 index 0000000000000000000000000000000000000000..2ff6d35cf192d6d7737432bbb342a5b9a6612080 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.po @@ -0,0 +1,168 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.optimizer.rst:2 +msgid "optimizer" +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:1 +msgid "基类::class:`paddle.optimizer.adamw.AdamW`" +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:1 +msgid "" +"The AdamWDL optimizer is implemented based on the AdamW Optimization with" +" dynamic lr setting. Generally it's used for transformer model." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:4 +msgid "" +"We use \"layerwise_lr_decay\" as default dynamic lr setting method of " +"AdamWDL. “Layer-wise decay” means exponentially decaying the learning " +"rates of individual layers in a top-down manner. For example, suppose the" +" 24-th layer uses a learning rate l, and the Layer-wise decay rate is α, " +"then the learning rate of layer m is lα^(24-m). See more details on: " +"https://arxiv.org/abs/1906.08237." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:10 +msgid "" +"& t = t + 1\n" +"\n" +"& moment\\_1\\_out = {\\beta}_1 * moment\\_1 + (1 - {\\beta}_1) * grad\n" +"\n" +"& moment\\_2\\_out = {\\beta}_2 * moment\\_2 + (1 - {\\beta}_2) * grad * " +"grad\n" +"\n" +"& learning\\_rate = learning\\_rate * \\frac{\\sqrt{1 - {\\beta}_2^t}}{1 " +"- {\\beta}_1^t}\n" +"\n" +"& param\\_out = param - learning\\_rate * " +"(\\frac{moment\\_1}{\\sqrt{moment\\_2} + \\epsilon} + \\lambda * param)" +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL +msgid "参数" +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:21 +msgid "" +"The learning rate used to update ``Parameter``. It can be a float value " +"or a LRScheduler. The default value is 0.001." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:24 +msgid "" +"The exponential decay rate for the 1st moment estimates. It should be a " +"float number or a Tensor with shape [1] and data type as float32. The " +"default value is 0.9." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:28 +msgid "" +"The exponential decay rate for the 2nd moment estimates. It should be a " +"float number or a Tensor with shape [1] and data type as float32. The " +"default value is 0.999." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:32 +msgid "" +"A small float value for numerical stability. It should be a float number " +"or a Tensor with shape [1] and data type as float32. The default value is" +" 1e-08." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:36 +msgid "" +"List/Tuple of ``Tensor`` to update to minimize ``loss``. \\ This " +"parameter is required in dygraph mode. \\ The default value is None in " +"static mode, at this time all parameters will be updated." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:40 +msgid "" +"The weight decay coefficient, it can be float or Tensor. The default " +"value is 0.01." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:42 +msgid "" +"If it is not None, only tensors that makes " +"apply_decay_param_fun(Tensor.name)==True will be updated. It only works " +"when we want to specify tensors. Default: None." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:47 +msgid "" +"Gradient cliping strategy, it's an instance of some derived class of " +"``GradientClipBase`` . There are three cliping strategies ( " +":ref:`api_fluid_clip_GradientClipByGlobalNorm` , " +":ref:`api_fluid_clip_GradientClipByNorm` , " +":ref:`api_fluid_clip_GradientClipByValue` ). Default None, meaning there " +"is no gradient clipping." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:52 +msgid "" +"The official Adam algorithm has two moving-average accumulators. The " +"accumulators are updated at every step. Every element of the two moving-" +"average is updated in both dense mode and sparse mode. If the size of " +"parameter is very large, then the update may be very slow. The lazy mode " +"only update the element that has gradient in current mini-batch, so it " +"will be much more faster. But this mode has different semantics with the " +"original Adam algorithm and may lead to different result. The default " +"value is False." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:60 +msgid "Whether to use multi-precision during weight updating. Default is false." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:62 +msgid "The layer-wise decay ratio. Defaults to 1.0." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:64 +msgid "The total number of encoder layers. Defaults to 12." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:66 +msgid "" +"If it's not None, set_param_lr_fun() will set the parameter learning " +"rate before it executes Adam Operator. Defaults to " +":ref:`layerwise_lr_decay`." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:69 +msgid "" +"The keys of name_dict is dynamic name of model while the value of " +"name_dict is static name. Use model.named_parameters() to get name_dict." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:72 +msgid "" +"Normally there is no need for user to set this property. For more " +"information, please refer to :ref:`api_guide_Name`. The default value is " +"None." +msgstr "" + +#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:78 +msgid "实际案例" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.po new file mode 100644 index 0000000000000000000000000000000000000000..c6c62ce5e251eb44bf5c0b1fe503b9328a7d7a5b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.rst:2 +msgid "paddlenlp.ops" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.strings.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.strings.po new file mode 100644 index 0000000000000000000000000000000000000000..3136f0d90b94f699a1f5722a77a6388091713d1f --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.strings.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.strings.rst:2 +msgid "strings" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.decoding.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.decoding.po new file mode 100644 index 0000000000000000000000000000000000000000..9c31bf833d87fc12c06a90283a7ea06255c2a02b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.decoding.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.transformer.decoding.rst:2 +msgid "decoding" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.fast_transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.fast_transformer.po new file mode 100644 index 0000000000000000000000000000000000000000..49a6647b9d42808ae2f390bef6ac17c2a2198c51 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.fast_transformer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.transformer.fast_transformer.rst:2 +msgid "fast\\_transformer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.po new file mode 100644 index 0000000000000000000000000000000000000000..92f9893b41b0c119147fb6508340cb7785cfacf5 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.ops.transformer.rst:2 +msgid "transformer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.po new file mode 100644 index 0000000000000000000000000000000000000000..a1c26dcfa3be07be61669e1df9d052865a79901c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.rst:2 +msgid "paddlenlp" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.seq2vec.encoder.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.seq2vec.encoder.po new file mode 100644 index 0000000000000000000000000000000000000000..35d01ba951acb02ec29879d41ca057ad270aad74 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.seq2vec.encoder.po @@ -0,0 +1,566 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.seq2vec.encoder.rst:2 +msgid "encoder" +msgstr "" + +#: of paddlenlp.seq2vec.encoder.BoWEncoder:1 +#: paddlenlp.seq2vec.encoder.CNNEncoder:1 +#: paddlenlp.seq2vec.encoder.GRUEncoder:1 +#: paddlenlp.seq2vec.encoder.LSTMEncoder:1 +#: paddlenlp.seq2vec.encoder.RNNEncoder:1 +#: paddlenlp.seq2vec.encoder.TCNEncoder:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.seq2vec.encoder.BoWEncoder:1 +msgid "" +"A `BoWEncoder` takes as input a sequence of vectors and returns a single " +"vector, which simply sums the embeddings of a sequence across the time " +"dimension. The input to this encoder is of shape `(batch_size, " +"num_tokens, emb_dim)`, and the output is of shape `(batch_size, " +"emb_dim)`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.BoWEncoder +#: paddlenlp.seq2vec.encoder.BoWEncoder.forward +#: paddlenlp.seq2vec.encoder.CNNEncoder +#: paddlenlp.seq2vec.encoder.CNNEncoder.forward +#: paddlenlp.seq2vec.encoder.GRUEncoder +#: paddlenlp.seq2vec.encoder.GRUEncoder.forward +#: paddlenlp.seq2vec.encoder.LSTMEncoder +#: paddlenlp.seq2vec.encoder.LSTMEncoder.forward +#: paddlenlp.seq2vec.encoder.RNNEncoder +#: paddlenlp.seq2vec.encoder.RNNEncoder.forward +#: paddlenlp.seq2vec.encoder.TCNEncoder +#: paddlenlp.seq2vec.encoder.TCNEncoder.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.seq2vec.encoder.BoWEncoder:6 +#: paddlenlp.seq2vec.encoder.CNNEncoder:20 +msgid "The dimension of each vector in the input sequence." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.BoWEncoder:10 +#: paddlenlp.seq2vec.encoder.CNNEncoder:39 +#: paddlenlp.seq2vec.encoder.GRUEncoder:44 +#: paddlenlp.seq2vec.encoder.LSTMEncoder:43 +#: paddlenlp.seq2vec.encoder.RNNEncoder:43 +msgid "示例" +msgstr "" + +#: of paddlenlp.seq2vec.encoder.BoWEncoder.get_input_dim:1 +msgid "" +"Returns the dimension of the vector input for each element in the " +"sequence input to a `BoWEncoder`. This is not the shape of the input " +"tensor, but the last element of that shape." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.BoWEncoder.get_output_dim:1 +msgid "" +"Returns the dimension of the final vector output by this `BoWEncoder`. " +"This is not the shape of the returned tensor, but the last element of " +"that shape." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.BoWEncoder.forward:1 +msgid "It simply sums the embeddings of a sequence across the time dimension." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.BoWEncoder.forward:3 +msgid "" +"Shape as `(batch_size, num_tokens, emb_dim)` and dtype as `float32` or " +"`float64`. The sequence length of the input sequence." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.BoWEncoder.forward:6 +msgid "" +"Shape same as `inputs`. Its each elements identify whether the " +"corresponding input token is padding or not. If True, not padding token. " +"If False, padding token. Defaults to `None`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.BoWEncoder.forward +#: paddlenlp.seq2vec.encoder.CNNEncoder.forward +#: paddlenlp.seq2vec.encoder.GRUEncoder.forward +#: paddlenlp.seq2vec.encoder.LSTMEncoder.forward +#: paddlenlp.seq2vec.encoder.RNNEncoder.forward +#: paddlenlp.seq2vec.encoder.TCNEncoder.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.seq2vec.encoder.BoWEncoder.forward:12 +msgid "" +"Returns tensor `summed`, the result vector of BagOfEmbedding. Its data " +"type is same as `inputs` and its shape is `[batch_size, emb_dim]`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.BoWEncoder.forward +#: paddlenlp.seq2vec.encoder.CNNEncoder.forward +#: paddlenlp.seq2vec.encoder.GRUEncoder.forward +#: paddlenlp.seq2vec.encoder.LSTMEncoder.forward +#: paddlenlp.seq2vec.encoder.RNNEncoder.forward +#: paddlenlp.seq2vec.encoder.TCNEncoder.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder:1 +msgid "" +"A `CNNEncoder` takes as input a sequence of vectors and returns a single " +"vector, a combination of multiple convolution layers and max pooling " +"layers. The input to this encoder is of shape `(batch_size, num_tokens, " +"emb_dim)`, and the output is of shape `(batch_size, output_dim)` or " +"`(batch_size, len(ngram_filter_sizes) * num_filter)`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder:6 +msgid "" +"The CNN has one convolution layer for each ngram filter size. Each " +"convolution operation gives out a vector of size num_filter. The number " +"of times a convolution layer will be used is `num_tokens - ngram_size + " +"1`. The corresponding maxpooling layer aggregates all these outputs from " +"the convolution layer and outputs the max." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder:11 +msgid "" +"This operation is repeated for every ngram size passed, and consequently " +"the dimensionality of the output after maxpooling is " +"`len(ngram_filter_sizes) * num_filter`. This then gets (optionally) " +"projected down to a lower dimensional output, specified by `output_dim`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder:15 +msgid "" +"We then use a fully connected layer to project in back to the desired " +"output_dim. For more details, refer to `A Sensitivity Analysis of (and " +"Practitioners’ Guide to) Convolutional Neural Networks for Sentence " +"Classification `__ , Zhang and Wallace " +"2016, particularly Figure 1." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder:22 +msgid "" +"This is the output dim for each convolutional layer, which is the number " +"of \"filters\" learned by that layer." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder:25 +msgid "" +"This specifies both the number of convolutional layers we will create and" +" their sizes. The default of `(2, 3, 4, 5)` will have four convolutional" +" layers, corresponding to encoding ngrams of size 2 to 5 with some number" +" of filters." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder:29 +msgid "" +"Activation to use after the convolution layers. Defaults to " +"`paddle.nn.Tanh()`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder:32 +msgid "" +"After doing convolutions and pooling, we'll project the collected " +"features into a vector of this size. If this value is `None`, we will " +"just return the result of the max pooling, giving an output of shape " +"`len(ngram_filter_sizes) * num_filter`. Defaults to `None`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder.get_input_dim:1 +msgid "" +"Returns the dimension of the vector input for each element in the " +"sequence input to a `CNNEncoder`. This is not the shape of the input " +"tensor, but the last element of that shape." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder.get_output_dim:1 +msgid "" +"Returns the dimension of the final vector output by this `CNNEncoder`. " +"This is not the shape of the returned tensor, but the last element of " +"that shape." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder.forward:1 +msgid "The combination of multiple convolution layers and max pooling layers." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder.forward:3 +msgid "" +"Shape as `(batch_size, num_tokens, emb_dim)` and dtype as `float32` or " +"`float64`. Tensor containing the features of the input sequence." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder.forward:6 +msgid "" +"Shape shoule be same as `inputs` and dtype as `int32`, `int64`, `float32`" +" or `float64`. Its each elements identify whether the corresponding input" +" token is padding or not. If True, not padding token. If False, padding " +"token. Defaults to `None`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.CNNEncoder.forward:12 +msgid "" +"Returns tensor `result`. If output_dim is None, the result shape is of " +"`(batch_size, output_dim)` and dtype is `float`; If not, the result shape" +" is of `(batch_size, len(ngram_filter_sizes) * num_filter)`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder:1 +msgid "" +"A GRUEncoder takes as input a sequence of vectors and returns a single " +"vector, which is a combination of multiple `paddle.nn.GRU " +"`__ subclass. The input to this encoder " +"is of shape `(batch_size, num_tokens, input_size)`, The output is of " +"shape `(batch_size, hidden_size * 2)` if GRU is bidirection; If not, " +"output is of shape `(batch_size, hidden_size)`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder:9 +msgid "" +"Paddle's GRU have two outputs: the hidden state for every time step at " +"last layer, and the hidden state at the last time step for every layer. " +"If `pooling_type` is not None, we perform the pooling on the hidden state" +" of every time step at last layer to create a single vector. If None, we " +"use the hidden state of the last time step at last layer as a single " +"output (shape of `(batch_size, hidden_size)`); And if direction is " +"bidirection, the we concat the hidden state of the last forward gru and " +"backward gru layer to create a single vector (shape of `(batch_size, " +"hidden_size * 2)`)." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder:17 +#: paddlenlp.seq2vec.encoder.LSTMEncoder:17 +#: paddlenlp.seq2vec.encoder.RNNEncoder:17 +#: paddlenlp.seq2vec.encoder.TCNEncoder:14 +msgid "The number of expected features in the input (the last dimension)." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder:19 +#: paddlenlp.seq2vec.encoder.LSTMEncoder:19 +#: paddlenlp.seq2vec.encoder.RNNEncoder:19 +msgid "The number of features in the hidden state." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder:21 +msgid "" +"Number of recurrent layers. E.g., setting num_layers=2 would mean " +"stacking two GRUs together to form a stacked GRU, with the second GRU " +"taking in outputs of the first GRU and computing the final results. " +"Defaults to 1." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder:26 +msgid "" +"The direction of the network. It can be \"forward\" and \"bidirect\" (it " +"means bidirection network). If \"bidirect\", it is a birectional GRU, and" +" returns the concat output from both directions. Defaults to \"forward\"." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder:31 +msgid "" +"If non-zero, introduces a Dropout layer on the outputs of each GRU layer " +"except the last layer, with dropout probability equal to dropout. " +"Defaults to 0.0." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder:35 +msgid "" +"If `pooling_type` is None, then the GRUEncoder will return the hidden " +"state of the last time step at last layer as a single vector. If " +"pooling_type is not None, it must be one of \"sum\", \"max\" and " +"\"mean\". Then it will be pooled on the GRU output (the hidden state of " +"every time step at last layer) to create a single vector. Defaults to " +"`None`" +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder.get_input_dim:1 +msgid "" +"Returns the dimension of the vector input for each element in the " +"sequence input to a `GRUEncoder`. This is not the shape of the input " +"tensor, but the last element of that shape." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder.get_output_dim:1 +msgid "" +"Returns the dimension of the final vector output by this `GRUEncoder`. " +"This is not the shape of the returned tensor, but the last element of " +"that shape." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder.forward:1 +msgid "" +"GRUEncoder takes the a sequence of vectors and returns a single " +"vector, which is a combination of multiple GRU layers. The input to this " +"encoder is of shape `(batch_size, num_tokens, input_size)`, The output is" +" of shape `(batch_size, hidden_size * 2)` if GRU is bidirection; If not, " +"output is of shape `(batch_size, hidden_size)`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder.forward:7 +#: paddlenlp.seq2vec.encoder.LSTMEncoder.forward:7 +#: paddlenlp.seq2vec.encoder.RNNEncoder.forward:7 +msgid "" +"Shape as `(batch_size, num_tokens, input_size)`. Tensor containing the " +"features of the input sequence." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder.forward:10 +#: paddlenlp.seq2vec.encoder.LSTMEncoder.forward:10 +#: paddlenlp.seq2vec.encoder.RNNEncoder.forward:10 +msgid "Shape as `(batch_size)`. The sequence length of the input sequence." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.GRUEncoder.forward:14 +#: paddlenlp.seq2vec.encoder.LSTMEncoder.forward:14 +#: paddlenlp.seq2vec.encoder.RNNEncoder.forward:14 +msgid "" +"Returns tensor `output`, the hidden state at the last time step for every" +" layer. Its data type is `float` and its shape is `[batch_size, " +"hidden_size]`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.LSTMEncoder:1 +msgid "" +"An LSTMEncoder takes as input a sequence of vectors and returns a single " +"vector, which is a combination of multiple `paddle.nn.LSTM " +"`__ subclass. The input to this encoder" +" is of shape `(batch_size, num_tokens, input_size)`. The output is of " +"shape `(batch_size, hidden_size * 2)` if LSTM is bidirection; If not, " +"output is of shape `(batch_size, hidden_size)`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.LSTMEncoder:9 +msgid "" +"Paddle's LSTM have two outputs: the hidden state for every time step at " +"last layer, and the hidden state and cell at the last time step for every" +" layer. If `pooling_type` is not None, we perform the pooling on the " +"hidden state of every time step at last layer to create a single vector. " +"If None, we use the hidden state of the last time step at last layer as a" +" single output (shape of `(batch_size, hidden_size)`); And if direction " +"is bidirection, the we concat the hidden state of the last forward lstm " +"and backward lstm layer to create a single vector (shape of `(batch_size," +" hidden_size * 2)`)." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.LSTMEncoder:21 +msgid "" +"Number of recurrent layers. E.g., setting num_layers=2 would mean " +"stacking two LSTMs together to form a stacked LSTM, with the second LSTM " +"taking in outputs of the first LSTM and computing the final results. " +"Defaults to 1." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.LSTMEncoder:26 +msgid "" +"The direction of the network. It can be \"forward\" or \"bidirect\" (it " +"means bidirection network). If \"bidirect\", it is a birectional LSTM, " +"and returns the concat output from both directions. Defaults to " +"\"forward\"." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.LSTMEncoder:30 +msgid "" +"If non-zero, introduces a Dropout layer on the outputs of each LSTM layer" +" except the last layer, with dropout probability equal to dropout. " +"Defaults to 0.0 ." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.LSTMEncoder:34 +msgid "" +"If `pooling_type` is None, then the LSTMEncoder will return the hidden " +"state of the last time step at last layer as a single vector. If " +"pooling_type is not None, it must be one of \"sum\", \"max\" and " +"\"mean\". Then it will be pooled on the LSTM output (the hidden state of " +"every time step at last layer) to create a single vector. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.LSTMEncoder.get_input_dim:1 +msgid "" +"Returns the dimension of the vector input for each element in the " +"sequence input to a `LSTMEncoder`. This is not the shape of the input " +"tensor, but the last element of that shape." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.LSTMEncoder.get_output_dim:1 +msgid "" +"Returns the dimension of the final vector output by this `LSTMEncoder`. " +"This is not the shape of the returned tensor, but the last element of " +"that shape." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.LSTMEncoder.forward:1 +msgid "" +"LSTMEncoder takes the a sequence of vectors and returns a single " +"vector, which is a combination of multiple LSTM layers. The input to this" +" encoder is of shape `(batch_size, num_tokens, input_size)`, The output " +"is of shape `(batch_size, hidden_size * 2)` if LSTM is bidirection; If " +"not, output is of shape `(batch_size, hidden_size)`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.RNNEncoder:1 +msgid "" +"A RNNEncoder takes as input a sequence of vectors and returns a single " +"vector, which is a combination of multiple `paddle.nn.RNN " +"`__ subclass. The input to this encoder " +"is of shape `(batch_size, num_tokens, input_size)`, The output is of " +"shape `(batch_size, hidden_size * 2)` if RNN is bidirection; If not, " +"output is of shape `(batch_size, hidden_size)`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.RNNEncoder:9 +msgid "" +"Paddle's RNN have two outputs: the hidden state for every time step at " +"last layer, and the hidden state at the last time step for every layer. " +"If `pooling_type` is not None, we perform the pooling on the hidden state" +" of every time step at last layer to create a single vector. If None, we " +"use the hidden state of the last time step at last layer as a single " +"output (shape of `(batch_size, hidden_size)`); And if direction is " +"bidirection, the we concat the hidden state of the last forward rnn and " +"backward rnn layer to create a single vector (shape of `(batch_size, " +"hidden_size * 2)`)." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.RNNEncoder:21 +msgid "" +"Number of recurrent layers. E.g., setting num_layers=2 would mean " +"stacking two RNNs together to form a stacked RNN, with the second RNN " +"taking in outputs of the first RNN and computing the final results. " +"Defaults to 1." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.RNNEncoder:26 +msgid "" +"The direction of the network. It can be \"forward\" and \"bidirect\" (it " +"means bidirection network). If \"biderect\", it is a birectional RNN, and" +" returns the concat output from both directions. Defaults to \"forward\"" +msgstr "" + +#: of paddlenlp.seq2vec.encoder.RNNEncoder:30 +msgid "" +"If non-zero, introduces a Dropout layer on the outputs of each RNN layer " +"except the last layer, with dropout probability equal to dropout. " +"Defaults to 0.0." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.RNNEncoder:34 +msgid "" +"If `pooling_type` is None, then the RNNEncoder will return the hidden " +"state of the last time step at last layer as a single vector. If " +"pooling_type is not None, it must be one of \"sum\", \"max\" and " +"\"mean\". Then it will be pooled on the RNN output (the hidden state of " +"every time step at last layer) to create a single vector. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.RNNEncoder.get_input_dim:1 +msgid "" +"Returns the dimension of the vector input for each element in the " +"sequence input to a `RNNEncoder`. This is not the shape of the input " +"tensor, but the last element of that shape." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.RNNEncoder.get_output_dim:1 +msgid "" +"Returns the dimension of the final vector output by this `RNNEncoder`. " +"This is not the shape of the returned tensor, but the last element of " +"that shape." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.RNNEncoder.forward:1 +msgid "" +"RNNEncoder takes the a sequence of vectors and returns a single " +"vector, which is a combination of multiple RNN layers. The input to this " +"encoder is of shape `(batch_size, num_tokens, input_size)`. The output is" +" of shape `(batch_size, hidden_size * 2)` if RNN is bidirection; If not, " +"output is of shape `(batch_size, hidden_size)`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.TCNEncoder:1 +msgid "" +"A `TCNEncoder` takes as input a sequence of vectors and returns a single " +"vector, which is the last one time step in the feature map. The input to " +"this encoder is of shape `(batch_size, num_tokens, input_size)`, and the " +"output is of shape `(batch_size, num_channels[-1])` with a receptive " +"filed:" +msgstr "" + +#: of paddlenlp.seq2vec.encoder.TCNEncoder:7 +#: paddlenlp.seq2vec.encoder.TCNEncoder.forward:7 +msgid "" +"receptive filed = 2 * " +"\\sum_{i=0}^{len(num\\_channels)-1}2^i(kernel\\_size-1)." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.TCNEncoder:11 +msgid "" +"Temporal Convolutional Networks is a simple convolutional architecture. " +"It outperforms canonical recurrent networks such as LSTMs in many tasks. " +"See https://arxiv.org/pdf/1803.01271.pdf for more details." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.TCNEncoder:16 +msgid "The number of channels in different layer." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.TCNEncoder:18 +msgid "The kernel size. Defaults to 2." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.TCNEncoder:20 +msgid "The dropout probability. Defaults to 0.2." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.TCNEncoder.get_input_dim:1 +msgid "" +"Returns the dimension of the vector input for each element in the " +"sequence input to a `TCNEncoder`. This is not the shape of the input " +"tensor, but the last element of that shape." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.TCNEncoder.get_output_dim:1 +msgid "" +"Returns the dimension of the final vector output by this `TCNEncoder`. " +"This is not the shape of the returned tensor, but the last element of " +"that shape." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.TCNEncoder.forward:1 +msgid "" +"TCNEncoder takes as input a sequence of vectors and returns a single " +"vector, which is the last one time step in the feature map. The input to " +"this encoder is of shape `(batch_size, num_tokens, input_size)`, and the " +"output is of shape `(batch_size, num_channels[-1])` with a receptive " +"filed:" +msgstr "" + +#: of paddlenlp.seq2vec.encoder.TCNEncoder.forward:11 +msgid "The input tensor with shape `[batch_size, num_tokens, input_size]`." +msgstr "" + +#: of paddlenlp.seq2vec.encoder.TCNEncoder.forward:14 +msgid "Returns tensor `output` with shape `[batch_size, num_channels[-1]]`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.seq2vec.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.seq2vec.po new file mode 100644 index 0000000000000000000000000000000000000000..b834eb1c2519b88a2d72c395bf5bf3b4113d0b36 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.seq2vec.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.seq2vec.rst:2 +msgid "paddlenlp.seq2vec" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.dependency_parsing.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.dependency_parsing.po new file mode 100644 index 0000000000000000000000000000000000000000..28f4a1fc13f643978c8098a0d5c69cbc15a72317 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.dependency_parsing.po @@ -0,0 +1,158 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.dependency_parsing.rst:2 +msgid "dependency\\_parsing" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.DDParserTask:1 +msgid "基类::class:`paddlenlp.taskflow.task.Task`" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.DDParserTask:1 +msgid "" +"DDParser task to analyze the dependency relationship between words in a " +"sentence :param task: The name of task. :type task: string :param model: " +"The model name in the task. :type model: string :param tree: Ensure the " +"output conforms to the tree structure. :type tree: bool :param prob: " +"Whether to return the probability of predicted heads. :type prob: bool " +":param use_pos: Whether to return the postag. :type use_pos: bool :param " +"batch_size: Numbers of examples a batch. :type batch_size: int :param " +"return_visual: If True, the result will contain the dependency " +"visualization. :type return_visual: bool :param kwargs: Additional " +"keyword arguments passed along to the specific task. :type kwargs: dict, " +"optional" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.pad_sequence:1 +msgid "Fill sequences(np.ndarray) into a fixed-length matrix." +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.eisner:1 +msgid "" +"Eisner algorithm is a general dynamic programming decoding algorithm for " +"bilexical grammar." +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.eisner:5 +msgid "Args:" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.eisner:4 +msgid "" +"scores: Adjacency matrix,shape=(batch, seq_len, seq_len) mask: mask " +"matrix,shape=(batch, sql_len)" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.eisner +msgid "返回" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.eisner:7 +msgid "" +"output,shape=(batch, seq_len),the index of the parent node corresponding " +"to the token in the query" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.fill_diagonal:1 +msgid "" +"Fill value into the diagoanl of x that offset is ${offset} and the " +"coordinate system is (dim1, dim2)." +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.backtrack:1 +msgid "Backtrack the position matrix of eisner to generate the tree" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.stripe:1 +msgid "Returns a diagonal stripe of the tensor." +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.stripe +msgid "参数" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.stripe:3 +msgid "the input tensor with 2 or more dims." +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.stripe:5 +msgid "the length of the stripe." +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.stripe:7 +msgid "the width of the stripe." +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.stripe:9 +msgid "the offset of the first two dims." +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.stripe:11 +msgid "0 if returns a horizontal stripe; 1 else." +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.stripe:14 +msgid "" +"Example: >>> x = np.arange(25).reshape(5, 5) >>> x tensor([[ 0, 1, 2, " +"3, 4]," +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.stripe:18 +msgid "" +"[ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19], [20, " +"21, 22, 23, 24]])" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.DepTree:1 +#: paddlenlp.taskflow.dependency_parsing.Node:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.Node:1 +msgid "Node class" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.DepTree:1 +msgid "" +"DepTree class, used to check whether the prediction result is a project " +"Tree. A projective tree means that you can project the tree without " +"crossing arcs." +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.DepTree.build_tree:1 +msgid "Build the tree" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.DepTree.add:1 +msgid "Add a child node" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.DepTree.judge_legal:1 +msgid "Determine whether it is a project tree" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.DepTree.inorder_traversal:1 +msgid "Inorder traversal" +msgstr "" + +#: of paddlenlp.taskflow.dependency_parsing.istree:1 +msgid "Is the sequence a project tree" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.dialogue.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.dialogue.po new file mode 100644 index 0000000000000000000000000000000000000000..03c5d6d4de824833c56cabf07cb57d9defac60f8 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.dialogue.po @@ -0,0 +1,39 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.dialogue.rst:2 +msgid "dialogue" +msgstr "" + +#: of paddlenlp.taskflow.dialogue.DialogueTask:1 +msgid "基类::class:`paddlenlp.taskflow.task.Task`" +msgstr "" + +#: of paddlenlp.taskflow.dialogue.DialogueTask:1 +msgid "" +"Task of Chinese open domain dialogue. :param task: The name of task. " +":type task: string :param model: The model name in the task. :type model:" +" string :param kwargs: Additional keyword arguments passed along to the " +"specific task. :type kwargs: dict, optional" +msgstr "" + +#: of paddlenlp.taskflow.dialogue.DialogueTask.interactive_mode:1 +msgid "Enter the interactive mode." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.information_extraction.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.information_extraction.po new file mode 100644 index 0000000000000000000000000000000000000000..e5eb592a77eea9a4d361f755872b01d725146536 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.information_extraction.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.taskflow.information_extraction.rst:2 +msgid "information\\_extraction" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.knowledge_mining.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.knowledge_mining.po new file mode 100644 index 0000000000000000000000000000000000000000..7e684c5191a0e3a795490540e947c477bf75c16e --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.knowledge_mining.po @@ -0,0 +1,62 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.knowledge_mining.rst:2 +msgid "knowledge\\_mining" +msgstr "" + +#: of paddlenlp.taskflow.knowledge_mining.NPTagTask:1 +#: paddlenlp.taskflow.knowledge_mining.WordTagTask:1 +msgid "基类::class:`paddlenlp.taskflow.task.Task`" +msgstr "" + +#: of paddlenlp.taskflow.knowledge_mining.WordTagTask:1 +msgid "" +"This the NER(Named Entity Recognition) task that convert the raw text to " +"entities. And the task with the `wordtag` model will link the more " +"meesage with the entity. :param task: The name of task. :type task: " +"string :param model: The model name in the task. :type model: string " +":param kwargs: Additional keyword arguments passed along to the specific " +"task. :type kwargs: dict, optional" +msgstr "" + +#: of paddlenlp.taskflow.knowledge_mining.NPTagTask:12 +#: paddlenlp.taskflow.knowledge_mining.WordTagTask:11 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.taskflow.knowledge_mining.NPTagTask.summary_num:1 +#: paddlenlp.taskflow.knowledge_mining.WordTagTask.summary_num:1 +msgid "Number of model summary token" +msgstr "" + +#: of paddlenlp.taskflow.knowledge_mining.WordTagTask.linking:1 +msgid "Whether to do term linking." +msgstr "" + +#: of paddlenlp.taskflow.knowledge_mining.NPTagTask:1 +msgid "" +"Noun phrase tagging task that convert the noun phrase to POS tag. :param " +"task: The name of task. :type task: string :param model: The model name " +"in the task. :type model: string :param batch_size: Numbers of examples a" +" batch. :type batch_size: int :param linking: Returns the categories. If " +"`linking` is True, the fine-grained label (label) will link with the " +"coarse-grained label (category). :type linking: bool" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.lexical_analysis.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.lexical_analysis.po new file mode 100644 index 0000000000000000000000000000000000000000..680107ee5916083e520c96c199482e99cbdd408c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.lexical_analysis.po @@ -0,0 +1,41 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.lexical_analysis.rst:2 +msgid "lexical\\_analysis" +msgstr "" + +#: of paddlenlp.taskflow.lexical_analysis.load_vocab:1 +msgid "Load vocab from file" +msgstr "" + +#: of paddlenlp.taskflow.lexical_analysis.LacTask:1 +msgid "基类::class:`paddlenlp.taskflow.task.Task`" +msgstr "" + +#: of paddlenlp.taskflow.lexical_analysis.LacTask:1 +msgid "" +"Lexical analysis of Chinese task to segement the chinese sentence. :param" +" task: The name of task. :type task: string :param model: The model name " +"in the task. :type model: string :param user_dict: The user-defined " +"dictionary, default to None. :type user_dict: string :param kwargs: " +"Additional keyword arguments passed along to the specific task. :type " +"kwargs: dict, optional" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.model.po new file mode 100644 index 0000000000000000000000000000000000000000..680bb94c7a78e711e703de04c2f337106daf14ae --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.model.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.model.rst:2 +msgid "model" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.dependency_parsing_model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.dependency_parsing_model.po new file mode 100644 index 0000000000000000000000000000000000000000..b422296e89e3040fabea34345387449df89efde0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.dependency_parsing_model.po @@ -0,0 +1,81 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.models.dependency_parsing_model.rst:2 +msgid "dependency\\_parsing\\_model" +msgstr "" + +#: of paddlenlp.taskflow.models.dependency_parsing_model.BiAffineParser:1 +msgid "DDParser" +msgstr "" + +#: of paddlenlp.taskflow.models.dependency_parsing_model.BiAffine.forward:1 +#: paddlenlp.taskflow.models.dependency_parsing_model.BiAffineParser.forward:1 +#: paddlenlp.taskflow.models.dependency_parsing_model.ErnieEncoder.forward:1 +#: paddlenlp.taskflow.models.dependency_parsing_model.LSTMByWPEncoder.forward:1 +#: paddlenlp.taskflow.models.dependency_parsing_model.MLP.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of paddlenlp.taskflow.models.dependency_parsing_model.BiAffine.forward +#: paddlenlp.taskflow.models.dependency_parsing_model.BiAffineParser.forward +#: paddlenlp.taskflow.models.dependency_parsing_model.ErnieEncoder.forward +#: paddlenlp.taskflow.models.dependency_parsing_model.LSTMByWPEncoder.forward +#: paddlenlp.taskflow.models.dependency_parsing_model.MLP.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.taskflow.models.dependency_parsing_model.BiAffine.forward:4 +#: paddlenlp.taskflow.models.dependency_parsing_model.BiAffineParser.forward:4 +#: paddlenlp.taskflow.models.dependency_parsing_model.ErnieEncoder.forward:4 +#: paddlenlp.taskflow.models.dependency_parsing_model.LSTMByWPEncoder.forward:4 +#: paddlenlp.taskflow.models.dependency_parsing_model.MLP.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of paddlenlp.taskflow.models.dependency_parsing_model.BiAffine.forward:6 +#: paddlenlp.taskflow.models.dependency_parsing_model.BiAffineParser.forward:6 +#: paddlenlp.taskflow.models.dependency_parsing_model.ErnieEncoder.forward:6 +#: paddlenlp.taskflow.models.dependency_parsing_model.LSTMByWPEncoder.forward:6 +#: paddlenlp.taskflow.models.dependency_parsing_model.MLP.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of paddlenlp.taskflow.models.dependency_parsing_model.index_sample:1 +msgid "Select input value according to index" +msgstr "" + +#: of paddlenlp.taskflow.models.dependency_parsing_model.index_sample:4 +msgid "Arags:" +msgstr "" + +#: of paddlenlp.taskflow.models.dependency_parsing_model.index_sample:4 +msgid "input: input matrix index: index matrix" +msgstr "" + +#: of paddlenlp.taskflow.models.dependency_parsing_model.index_sample +msgid "返回" +msgstr "" + +#: of paddlenlp.taskflow.models.dependency_parsing_model.index_sample:6 +msgid "output" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.information_extraction_model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.information_extraction_model.po new file mode 100644 index 0000000000000000000000000000000000000000..22676e1db76165cc30c69079e58ecb957d9c0366 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.information_extraction_model.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.taskflow.models.information_extraction_model.rst:2 +msgid "information\\_extraction\\_model" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.lexical_analysis_model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.lexical_analysis_model.po new file mode 100644 index 0000000000000000000000000000000000000000..efaac39394ad6d82b3b48e657ad93a7ad7a07252 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.lexical_analysis_model.po @@ -0,0 +1,55 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.models.lexical_analysis_model.rst:2 +msgid "lexical\\_analysis\\_model" +msgstr "" + +#: of paddlenlp.taskflow.models.lexical_analysis_model.BiGruCrf:1 +msgid "" +"The network for lexical analysis, based on two layers of BiGRU and one " +"layer of CRF. More details see https://arxiv.org/abs/1807.01882 :param " +"word_emb_dim: The dimension in which a word is embedded. :type " +"word_emb_dim: int :param hidden_size: The number of hidden nodes in the " +"GRU layer. :type hidden_size: int :param vocab_size: the word vocab size." +" :type vocab_size: int :param num_labels: the labels amount. :type " +"num_labels: int :param emb_lr: The scaling of the learning rate of the " +"embedding layer. Defaults to 2.0. :type emb_lr: float, optional :param " +"crf_lr: The scaling of the learning rate of the crf layer. Defaults to " +"0.2. :type crf_lr: float, optional" +msgstr "" + +#: of paddlenlp.taskflow.models.lexical_analysis_model.BiGruCrf.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of paddlenlp.taskflow.models.lexical_analysis_model.BiGruCrf.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.taskflow.models.lexical_analysis_model.BiGruCrf.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of paddlenlp.taskflow.models.lexical_analysis_model.BiGruCrf.forward:6 +msgid "unpacked dict arguments" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.po new file mode 100644 index 0000000000000000000000000000000000000000..ca52fe50a94ab58e16bf4f48f91c9080c3f6dffd --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.models.rst:2 +msgid "models" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.sentiment_analysis_model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.sentiment_analysis_model.po new file mode 100644 index 0000000000000000000000000000000000000000..3b23dedcef965b6e29c24e750a0b1f83a54e5e63 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.sentiment_analysis_model.po @@ -0,0 +1,119 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.models.sentiment_analysis_model.rst:2 +msgid "sentiment\\_analysis\\_model" +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel:1 +msgid "" +"This class implements the Bag of Words Classification Network model to " +"classify texts. At a high level, the model starts by embedding the tokens" +" and running them through a word embedding. Then, we encode these " +"epresentations with a `BoWEncoder`. Lastly, we take the output of the " +"encoder to create a final representation, which is passed through some " +"feed-forward layers to output a logits (`output_layer`). :param " +"vocab_size: The vocab size that used to create the embedding. :type " +"vocab_size: int :param num_class: The num class of the classifier. :type " +"num_class: int :param emb_dim: The size of the embedding, default value " +"is 128. :type emb_dim: int. optinal :param padding_idx: The padding value" +" in the embedding, the padding_idx of embedding value will" +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel:13 +#: paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:13 +msgid "not be updated, the default value is 0." +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel +#: paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel.forward +#: paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel +#: paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel.forward +#: paddlenlp.taskflow.models.sentiment_analysis_model.SkepSequenceModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel:15 +msgid "The output size of linear that after the bow, default value is 128." +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel:17 +msgid "" +"The output size of linear that after the fisrt linear, default value is " +"96." +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel.forward:1 +#: paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel.forward:1 +#: paddlenlp.taskflow.models.sentiment_analysis_model.SkepSequenceModel.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel.forward:4 +#: paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel.forward:4 +#: paddlenlp.taskflow.models.sentiment_analysis_model.SkepSequenceModel.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel.forward:6 +#: paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel.forward:6 +#: paddlenlp.taskflow.models.sentiment_analysis_model.SkepSequenceModel.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:1 +msgid "" +"This class implements the Bag of Words Classification Network model to " +"classify texts. At a high level, the model starts by embedding the tokens" +" and running them through a word embedding. Then, we encode these " +"epresentations with a `BoWEncoder`. Lastly, we take the output of the " +"encoder to create a final representation, which is passed through some " +"feed-forward layers to output a logits (`output_layer`). :param " +"vocab_size: The vocab size that used to create the embedding. :type " +"vocab_size: int :param num_class: The num clas of the classifier. :type " +"num_class: int :param emb_dim: The size of the embedding, default value " +"is 128. :type emb_dim: int. optinal :param padding_idx: The padding value" +" in the embedding, the padding_idx of embedding value will" +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:15 +msgid "The output size of the lstm, defalut value 198." +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:17 +msgid "The direction of lstm, default value is `forward`." +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:19 +msgid "The num of lstm layer." +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:21 +msgid "The dropout rate of lstm." +msgstr "" + +#: of paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:23 +msgid "" +"The pooling type of lstm. Defalut value is None, if `pooling_type` is " +"None, then the LSTMEncoder will return the hidden state of the last time " +"step at last layer as a single vector." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.text_correction_model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.text_correction_model.po new file mode 100644 index 0000000000000000000000000000000000000000..e989ef85091b23abf2eaf6f772b3edaf011f2685 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.text_correction_model.po @@ -0,0 +1,167 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.models.text_correction_model.rst:2 +msgid "text\\_correction\\_model" +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC:1 +msgid "ErnieForCSC is a model specified for Chinese Spelling Correction task." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC:3 +msgid "" +"It integrates phonetic features into language model by leveraging the " +"powerful pre-training and fine-tuning method." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC:6 +msgid "" +"See more details on https://aclanthology.org/2021.findings-acl.198.pdf. " +":param ernie: An instance of `paddlenlp.transformers.ErnieModel`. :type " +"ernie: ErnieModel :param pinyin_vocab_size: The vocab size of pinyin " +"vocab. :type pinyin_vocab_size: int :param pad_pinyin_id: The pad token " +"id of pinyin vocab. Defaults to 0. :type pad_pinyin_id: int, optional" +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:1 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:5 +msgid "" +"Indices of pinyin tokens of input sequence in the pinyin vocabulary. They" +" are numerical representations of tokens that build the pinyin input " +"sequence. It's data type should be `int64` and has a shape of " +"[batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:9 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1: - 0 corresponds to a **sentence " +"A** token, - 1 corresponds to a **sentence B** token. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]. " +"Defaults to None, which means no segment embeddings is added to token " +"embeddings." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:9 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1:" +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:12 +msgid "0 corresponds to a **sentence A** token," +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:13 +msgid "1 corresponds to a **sentence B** token." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:15 +msgid "" +"It's data type should be `int64` and has a shape of [batch_size, " +"sequence_length]. Defaults to None, which means no segment embeddings is " +"added to token embeddings." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:18 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, config.max_position_embeddings - " +"1]``. Defaults to `None`. Shape as `(batch_sie, num_tokens)` and dtype as" +" `int32` or `int64`." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:22 +msgid "" +"Mask to indicate whether to perform attention on each input token or not." +" The values should be either 0 or 1. The attention scores will be set to " +"**-infinity** for any positions in the mask that are **0**, and will be " +"**unchanged** for positions that are **1**. - **1** for tokens that are " +"**not masked**, - **0** for tokens that are **masked**. It's data type " +"should be `float32` and has a shape of [batch_size, sequence_length]. " +"Defaults to `None`." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:22 +msgid "" +"Mask to indicate whether to perform attention on each input token or not." +" The values should be either 0 or 1. The attention scores will be set to " +"**-infinity** for any positions in the mask that are **0**, and will be " +"**unchanged** for positions that are **1**." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:27 +msgid "**1** for tokens that are **not masked**," +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:28 +msgid "**0** for tokens that are **masked**." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:30 +msgid "" +"It's data type should be `float32` and has a shape of [batch_size, " +"sequence_length]. Defaults to `None`." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:34 +msgid "" +"A Tensor of the detection prediction of each tokens. Shape as " +"`(batch_size, sequence_length)` and dtype as `int`. char_preds (Tensor):" +" A Tensor of the correction prediction of each tokens. Shape as " +"`(batch_size, sequence_length)` and dtype as `int`." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:35 +msgid "A Tensor of the detection prediction of each tokens." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:35 +msgid "Shape as `(batch_size, sequence_length)` and dtype as `int`." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:38 +msgid "char_preds (Tensor):" +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:38 +msgid "" +"A Tensor of the correction prediction of each tokens. Shape as " +"`(batch_size, sequence_length)` and dtype as `int`." +msgstr "" + +#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward +msgid "返回类型" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.named_entity_recognition.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.named_entity_recognition.po new file mode 100644 index 0000000000000000000000000000000000000000..1e374e1453239eb95bc5e909d7f1e48a88324cc4 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.named_entity_recognition.po @@ -0,0 +1,49 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.named_entity_recognition.rst:2 +msgid "named\\_entity\\_recognition" +msgstr "" + +#: of paddlenlp.taskflow.named_entity_recognition.NERWordTagTask:1 +msgid "基类::class:`paddlenlp.taskflow.knowledge_mining.WordTagTask`" +msgstr "" + +#: of paddlenlp.taskflow.named_entity_recognition.NERWordTagTask:1 +msgid "" +"This the NER(Named Entity Recognition) task that convert the raw text to " +"entities. And the task with the `wordtag` model will link the more " +"meesage with the entity. :param task: The name of task. :type task: " +"string :param model: The model name in the task. :type model: string " +":param kwargs: Additional keyword arguments passed along to the specific " +"task. :type kwargs: dict, optional" +msgstr "" + +#: of paddlenlp.taskflow.named_entity_recognition.NERLACTask:1 +msgid "基类::class:`paddlenlp.taskflow.lexical_analysis.LacTask`" +msgstr "" + +#: of paddlenlp.taskflow.named_entity_recognition.NERLACTask:1 +msgid "" +"Part-of-speech tagging task for the raw text. :param task: The name of " +"task. :type task: string :param model: The model name in the task. :type " +"model: string :param kwargs: Additional keyword arguments passed along to" +" the specific task. :type kwargs: dict, optional" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.po new file mode 100644 index 0000000000000000000000000000000000000000..88c4509069ea68ff7ccc36cde2f9225f36848fcf --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.rst:2 +msgid "paddlenlp.taskflow" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.poetry_generation.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.poetry_generation.po new file mode 100644 index 0000000000000000000000000000000000000000..f9d4a4d5bd3567da1c797d06a71b590479db12e4 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.poetry_generation.po @@ -0,0 +1,35 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.poetry_generation.rst:2 +msgid "poetry\\_generation" +msgstr "" + +#: of paddlenlp.taskflow.poetry_generation.PoetryGenerationTask:1 +msgid "基类::class:`paddlenlp.taskflow.text_generation.TextGenerationTask`" +msgstr "" + +#: of paddlenlp.taskflow.poetry_generation.PoetryGenerationTask:1 +msgid "" +"The text generation model to predict the question or chinese poetry. " +":param task: The name of task. :type task: string :param model: The model" +" name in the task. :type model: string :param kwargs: Additional keyword " +"arguments passed along to the specific task. :type kwargs: dict, optional" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.pos_tagging.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.pos_tagging.po new file mode 100644 index 0000000000000000000000000000000000000000..125024b9e302679d3f712fbe835b00fbd5cb939a --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.pos_tagging.po @@ -0,0 +1,35 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.pos_tagging.rst:2 +msgid "pos\\_tagging" +msgstr "" + +#: of paddlenlp.taskflow.pos_tagging.POSTaggingTask:1 +msgid "基类::class:`paddlenlp.taskflow.lexical_analysis.LacTask`" +msgstr "" + +#: of paddlenlp.taskflow.pos_tagging.POSTaggingTask:1 +msgid "" +"Part-of-speech tagging task for the raw text. :param task: The name of " +"task. :type task: string :param model: The model name in the task. :type " +"model: string :param kwargs: Additional keyword arguments passed along to" +" the specific task. :type kwargs: dict, optional" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.question_answering.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.question_answering.po new file mode 100644 index 0000000000000000000000000000000000000000..b5c34939926a4f6db33e67b502137b870aedb98b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.question_answering.po @@ -0,0 +1,35 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.question_answering.rst:2 +msgid "question\\_answering" +msgstr "" + +#: of paddlenlp.taskflow.question_answering.QuestionAnsweringTask:1 +msgid "基类::class:`paddlenlp.taskflow.text_generation.TextGenerationTask`" +msgstr "" + +#: of paddlenlp.taskflow.question_answering.QuestionAnsweringTask:1 +msgid "" +"The text generation model to predict the question or chinese poetry. " +":param task: The name of task. :type task: string :param model: The model" +" name in the task. :type model: string :param kwargs: Additional keyword " +"arguments passed along to the specific task. :type kwargs: dict, optional" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.sentiment_analysis.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.sentiment_analysis.po new file mode 100644 index 0000000000000000000000000000000000000000..8626d0f8ba41911710cfdd4d8d8d2b6ff7f76da5 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.sentiment_analysis.po @@ -0,0 +1,46 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.sentiment_analysis.rst:2 +msgid "sentiment\\_analysis" +msgstr "" + +#: of paddlenlp.taskflow.sentiment_analysis.SentaTask:1 +#: paddlenlp.taskflow.sentiment_analysis.SkepTask:1 +msgid "基类::class:`paddlenlp.taskflow.task.Task`" +msgstr "" + +#: of paddlenlp.taskflow.sentiment_analysis.SentaTask:1 +msgid "" +"Sentiment analysis task using RNN or BOW model to predict sentiment " +"opinion on Chinese text. :param task: The name of task. :type task: " +"string :param model: The model name in the task. :type model: string " +":param kwargs: Additional keyword arguments passed along to the specific " +"task. :type kwargs: dict, optional" +msgstr "" + +#: of paddlenlp.taskflow.sentiment_analysis.SkepTask:1 +msgid "" +"Sentiment analysis task using ERNIE-Gram model to predict sentiment " +"opinion on Chinese text. :param task: The name of task. :type task: " +"string :param model: The model name in the task. :type model: string " +":param kwargs: Additional keyword arguments passed along to the specific " +"task. :type kwargs: dict, optional" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.task.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.task.po new file mode 100644 index 0000000000000000000000000000000000000000..22119bc44d00e2ea6730252aba2362969d94923c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.task.po @@ -0,0 +1,57 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.task.rst:2 +msgid "task" +msgstr "" + +#: of paddlenlp.taskflow.task.Task:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.taskflow.task.Task:1 +msgid "" +"The meta classs of task in Taskflow. The meta class has the five abstract" +" function," +msgstr "" + +#: of paddlenlp.taskflow.task.Task:2 +msgid "the subclass need to inherit from the meta class." +msgstr "" + +#: of paddlenlp.taskflow.task.Task +msgid "参数" +msgstr "" + +#: of paddlenlp.taskflow.task.Task:3 +msgid "The name of task." +msgstr "" + +#: of paddlenlp.taskflow.task.Task:5 +msgid "The model name in the task." +msgstr "" + +#: of paddlenlp.taskflow.task.Task:7 +msgid "Additional keyword arguments passed along to the specific task." +msgstr "" + +#: of paddlenlp.taskflow.task.Task.help:1 +msgid "Return the usage message of the current task." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.taskflow.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.taskflow.po new file mode 100644 index 0000000000000000000000000000000000000000..c10c2d08a26a213c9d318ae5d413c726c8d2e758 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.taskflow.po @@ -0,0 +1,84 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.taskflow.rst:2 +msgid "taskflow" +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow:3 +msgid "" +"The Taskflow is the end2end inferface that could convert the raw text to " +"model result, and decode the model result to task result. The main " +"functions as follows:" +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow:2 +msgid "Convert the raw text to task result." +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow:3 +msgid "Convert the model to the inference model." +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow:4 +msgid "Offer the usage and help message." +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow +msgid "参数" +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow:5 +msgid "The task name for the Taskflow, and get the task class from the name." +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow:7 +msgid "The model name in the task, if set None, will use the default model." +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow:9 +msgid "" +"Select the mode of the task, only used in the tasks of word_segmentation " +"and ner. If set None, will use the default mode." +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow:12 +msgid "The device id for the gpu, xpu and other devices, the defalut value is 0." +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow:14 +msgid "Additional keyword arguments passed along to the specific task." +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow.help:1 +msgid "Return the task usage message." +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow.task_path:1 +msgid "Return the path of current task" +msgstr "" + +#: of paddlenlp.taskflow.taskflow.Taskflow.tasks:1 +msgid "Return the available task list." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text2knowledge.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text2knowledge.po new file mode 100644 index 0000000000000000000000000000000000000000..0e8fa7b1ec681792528087267399daf741f9168c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text2knowledge.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.text2knowledge.rst:2 +msgid "text2knowledge" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_correction.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_correction.po new file mode 100644 index 0000000000000000000000000000000000000000..7c8db153a50a895995fd35213759cf99ab16f9df --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_correction.po @@ -0,0 +1,35 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.text_correction.rst:2 +msgid "text\\_correction" +msgstr "" + +#: of paddlenlp.taskflow.text_correction.CSCTask:1 +msgid "基类::class:`paddlenlp.taskflow.task.Task`" +msgstr "" + +#: of paddlenlp.taskflow.text_correction.CSCTask:1 +msgid "" +"The text generation model to predict the question or chinese poetry. " +":param task: The name of task. :type task: string :param model: The model" +" name in the task. :type model: string :param kwargs: Additional keyword " +"arguments passed along to the specific task. :type kwargs: dict, optional" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_generation.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_generation.po new file mode 100644 index 0000000000000000000000000000000000000000..2c2131c87cd543879dc03c8624276cfb76c2d133 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_generation.po @@ -0,0 +1,35 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.text_generation.rst:2 +msgid "text\\_generation" +msgstr "" + +#: of paddlenlp.taskflow.text_generation.TextGenerationTask:1 +msgid "基类::class:`paddlenlp.taskflow.task.Task`" +msgstr "" + +#: of paddlenlp.taskflow.text_generation.TextGenerationTask:1 +msgid "" +"The text generation model to predict the question or chinese poetry. " +":param task: The name of task. :type task: string :param model: The model" +" name in the task. :type model: string :param kwargs: Additional keyword " +"arguments passed along to the specific task. :type kwargs: dict, optional" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_similarity.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_similarity.po new file mode 100644 index 0000000000000000000000000000000000000000..df0db4d96d4c82a9c730eb4298dd25cc0a6ffb88 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_similarity.po @@ -0,0 +1,36 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.text_similarity.rst:2 +msgid "text\\_similarity" +msgstr "" + +#: of paddlenlp.taskflow.text_similarity.TextSimilarityTask:1 +msgid "基类::class:`paddlenlp.taskflow.task.Task`" +msgstr "" + +#: of paddlenlp.taskflow.text_similarity.TextSimilarityTask:1 +msgid "" +"Text similarity task using SimBERT to predict the similarity of sentence " +"pair. :param task: The name of task. :type task: string :param model: The" +" model name in the task. :type model: string :param kwargs: Additional " +"keyword arguments passed along to the specific task. :type kwargs: dict, " +"optional" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.utils.po new file mode 100644 index 0000000000000000000000000000000000000000..710f4c8c0c086d0a312d37e8c206c5863034d97f --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.utils.po @@ -0,0 +1,343 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.utils.rst:2 +msgid "utils" +msgstr "" + +#: of paddlenlp.taskflow.utils.download_file:1 +msgid "" +"Download the file from the url to specified directory. Check md5 value " +"when the file is exists, if the md5 value is the same as the existed " +"file, just use the older file, if not, will download the file from the " +"url." +msgstr "" + +#: of paddlenlp.taskflow.utils.BurkhardKellerNode +#: paddlenlp.taskflow.utils.BurkhardKellerTree.add +#: paddlenlp.taskflow.utils.BurkhardKellerTree.search_similar_word +#: paddlenlp.taskflow.utils.TermTree.add_term +#: paddlenlp.taskflow.utils.TermTree.build_from_dir +#: paddlenlp.taskflow.utils.TermTree.find_term +#: paddlenlp.taskflow.utils.TermTree.from_dir +#: paddlenlp.taskflow.utils.TermTree.save paddlenlp.taskflow.utils.TermTreeNode +#: paddlenlp.taskflow.utils.TermTreeNode.from_dict +#: paddlenlp.taskflow.utils.TermTreeNode.from_json +#: paddlenlp.taskflow.utils.TriedTree.search +#: paddlenlp.taskflow.utils.download_check +#: paddlenlp.taskflow.utils.download_file +#: paddlenlp.taskflow.utils.levenstein_distance +msgid "参数" +msgstr "" + +#: of paddlenlp.taskflow.utils.download_file:5 +msgid "The specified directory saving the file." +msgstr "" + +#: of paddlenlp.taskflow.utils.download_file:7 +msgid "The specified filename saving the file." +msgstr "" + +#: of paddlenlp.taskflow.utils.download_file:9 +msgid "The url downling the file." +msgstr "" + +#: of paddlenlp.taskflow.utils.download_file:11 +msgid "The md5 value that checking the version downloaded." +msgstr "" + +#: of paddlenlp.taskflow.utils.download_check:1 +msgid "Check the resource statuc in the specified task." +msgstr "" + +#: of paddlenlp.taskflow.utils.download_check:3 +msgid "The name of specified task." +msgstr "" + +#: of paddlenlp.taskflow.utils.add_docstrings:1 +msgid "The function that add the doc string to doc of class." +msgstr "" + +#: of paddlenlp.taskflow.utils.cut_chinese_sent:1 +msgid "" +"Cut the Chinese sentences more precisely, reference to " +"\"https://blog.csdn.net/blmoistawinde/article/details/82379256\"." +msgstr "" + +#: of paddlenlp.taskflow.utils.BurkhardKellerNode:1 +#: paddlenlp.taskflow.utils.BurkhardKellerTree:1 +#: paddlenlp.taskflow.utils.Customization:1 paddlenlp.taskflow.utils.TermTree:1 +#: paddlenlp.taskflow.utils.TermTreeNode:1 paddlenlp.taskflow.utils.TriedTree:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode:1 +msgid "" +"Defination of term node. All members are protected, to keep rigorism of " +"data struct." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode:3 +msgid "term id of node." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode:5 +msgid "term, common name of this term." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode:7 +msgid "`cb` indicates concept base, `eb` indicates entity base." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode:9 +msgid "type of this term, constructs hirechical of `term` node. Defaults to None." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode:11 +msgid "parent type of a `type` node. Defaults to None." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode:13 +msgid "type statement of node, `type` or `term`. Defaults to \"term\"." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.add_term:13 +#: paddlenlp.taskflow.utils.TermTreeNode:15 +msgid "alias of this term. Defaults to None." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode:17 +msgid "extended alias of this term, CANNOT be used in matching. Defaults to None." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode:20 +msgid "grouped by some term. Defaults to None." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode:22 +msgid "some lower term. Defaults to None." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode:24 +msgid "to sore full imformation of a term. Defaults to None." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode.from_dict:1 +msgid "Build a node from dictionary data." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode.from_dict:3 +msgid "Dictionary data contain all k-v data." +msgstr "" + +#: of paddlenlp.taskflow.utils.BurkhardKellerTree.search_similar_word +#: paddlenlp.taskflow.utils.TermTree.find_term +#: paddlenlp.taskflow.utils.TermTree.from_dir +#: paddlenlp.taskflow.utils.TermTreeNode.from_dict +#: paddlenlp.taskflow.utils.TermTreeNode.from_json +#: paddlenlp.taskflow.utils.TriedTree.search +#: paddlenlp.taskflow.utils.levenstein_distance +msgid "返回" +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode.from_dict:6 +#: paddlenlp.taskflow.utils.TermTreeNode.from_json:6 +msgid "TermTree node object." +msgstr "" + +#: of paddlenlp.taskflow.utils.BurkhardKellerTree.search_similar_word +#: paddlenlp.taskflow.utils.TermTree.find_term +#: paddlenlp.taskflow.utils.TermTree.from_dir +#: paddlenlp.taskflow.utils.TermTreeNode.from_dict +#: paddlenlp.taskflow.utils.TermTreeNode.from_json +#: paddlenlp.taskflow.utils.TriedTree.search +#: paddlenlp.taskflow.utils.levenstein_distance +msgid "返回类型" +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode.from_json:1 +msgid "Build a node from JSON string." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTreeNode.from_json:3 +msgid "JSON string formatted by TermTree data." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree:1 +msgid "TermTree class." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.add_term:1 +msgid "Add a term into TermTree." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.add_term:3 +msgid "common name of name." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.add_term:5 +msgid "term is concept or entity." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.add_term:7 +msgid "term type of this term" +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.add_term:9 +msgid "sub type of this term, must exists in TermTree. Defaults to None." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.add_term:11 +msgid "sub terms of this term. Defaults to None." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.add_term:15 +msgid ". Defaults to None." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.add_term:17 +msgid "[description]. Defaults to None." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.find_term:1 +msgid "" +"Find a term in Term Tree. If term not exists, return None. If `term_type`" +" is not None, will find term with this type." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.find_term:4 +msgid "term to look up." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.find_term:6 +msgid "find term in this term_type. Defaults to None." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.build_from_dir:3 +#: paddlenlp.taskflow.utils.TermTree.find_term:9 +#: paddlenlp.taskflow.utils.TermTree.from_dir:3 +#: paddlenlp.taskflow.utils.TermTree.from_dir:6 +msgid "[description]" +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.build_from_dir:1 +#: paddlenlp.taskflow.utils.TermTree.from_dir:1 +msgid "" +"Build TermTree from a directory which should contain type schema and term" +" data." +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.save:1 +msgid "Save term tree to directory `save_dir`" +msgstr "" + +#: of paddlenlp.taskflow.utils.TermTree.save:3 +msgid "Directory." +msgstr "" + +#: of paddlenlp.taskflow.utils.levenstein_distance:1 +msgid "Calculate minimal Levenstein distance between s1 and s2." +msgstr "" + +#: of paddlenlp.taskflow.utils.levenstein_distance:3 +#: paddlenlp.taskflow.utils.levenstein_distance:5 +msgid "string" +msgstr "" + +#: of paddlenlp.taskflow.utils.levenstein_distance:8 +msgid "the minimal distance." +msgstr "" + +#: of paddlenlp.taskflow.utils.BurkhardKellerNode:1 +msgid "" +"Node implementatation for BK-Tree. A BK-Tree node stores the information " +"of current word, and its approximate words calculated by levenstein " +"distance." +msgstr "" + +#: of paddlenlp.taskflow.utils.BurkhardKellerNode:3 +msgid "word of current node." +msgstr "" + +#: of paddlenlp.taskflow.utils.BurkhardKellerTree:1 +msgid "Implementataion of BK-Tree" +msgstr "" + +#: of paddlenlp.taskflow.utils.BurkhardKellerTree.add:1 +msgid "Insert a word into current tree. If tree is empty, set this word to root." +msgstr "" + +#: of paddlenlp.taskflow.utils.BurkhardKellerTree.add:3 +msgid "word to be inserted." +msgstr "" + +#: of paddlenlp.taskflow.utils.BurkhardKellerTree.search_similar_word:1 +msgid "Search the most similar (minimal levenstain distance) word between `s`." +msgstr "" + +#: of paddlenlp.taskflow.utils.BurkhardKellerTree.search_similar_word:3 +msgid "target word" +msgstr "" + +#: of paddlenlp.taskflow.utils.BurkhardKellerTree.search_similar_word:6 +msgid "similar words." +msgstr "" + +#: of paddlenlp.taskflow.utils.TriedTree:1 +msgid "Implementataion of TriedTree" +msgstr "" + +#: of paddlenlp.taskflow.utils.TriedTree.add_word:1 +msgid "add single word into TriedTree" +msgstr "" + +#: of paddlenlp.taskflow.utils.TriedTree.search:1 +msgid "Backward maximum matching" +msgstr "" + +#: of paddlenlp.taskflow.utils.TriedTree.search:3 +msgid "string to be searched" +msgstr "" + +#: of paddlenlp.taskflow.utils.TriedTree.search:6 +msgid "" +"list of maximum matching words, each element represents the starting " +"and ending position of the matching string." +msgstr "" + +#: of paddlenlp.taskflow.utils.TriedTree.search:8 +msgid "list of maximum matching words, each element represents" +msgstr "" + +#: of paddlenlp.taskflow.utils.TriedTree.search:9 +msgid "the starting and ending position of the matching string." +msgstr "" + +#: of paddlenlp.taskflow.utils.Customization:1 +msgid "User intervention based on Aho-Corasick automaton" +msgstr "" + +#: of paddlenlp.taskflow.utils.Customization.load_customization:1 +msgid "Load the custom vocab" +msgstr "" + +#: of paddlenlp.taskflow.utils.Customization.parse_customization:1 +msgid "Use custom vocab to modify the lac results" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.word_segmentation.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.word_segmentation.po new file mode 100644 index 0000000000000000000000000000000000000000..8dcf598ef031685b7e05c2e5ff7bccaca8b6a08a --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.word_segmentation.po @@ -0,0 +1,60 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.taskflow.word_segmentation.rst:2 +msgid "word\\_segmentation" +msgstr "" + +#: of paddlenlp.taskflow.word_segmentation.SegJiebaTask:1 +msgid "基类::class:`paddlenlp.taskflow.task.Task`" +msgstr "" + +#: of paddlenlp.taskflow.word_segmentation.SegJiebaTask:1 +msgid "" +"Word Segmentation task for the raw text. :param task: The name of task. " +":type task: string :param model: The model name in the task. :type model:" +" string :param user_dict: The user-defined dictionary, default to None. " +":type user_dict: string :param kwargs: Additional keyword arguments " +"passed along to the specific task. :type kwargs: dict, optional" +msgstr "" + +#: of paddlenlp.taskflow.word_segmentation.SegLACTask:1 +msgid "基类::class:`paddlenlp.taskflow.lexical_analysis.LacTask`" +msgstr "" + +#: of paddlenlp.taskflow.word_segmentation.SegLACTask:1 +msgid "" +"Segement the sentences to the words using LAC mode. :param task: The name" +" of task. :type task: string :param model: The model name in the task. " +":type model: string :param kwargs: Additional keyword arguments passed " +"along to the specific task. :type kwargs: dict, optional" +msgstr "" + +#: of paddlenlp.taskflow.word_segmentation.SegWordTagTask:1 +msgid "基类::class:`paddlenlp.taskflow.named_entity_recognition.NERWordTagTask`" +msgstr "" + +#: of paddlenlp.taskflow.word_segmentation.SegWordTagTask:1 +msgid "" +"Segement the sentences to the words using WordTag model. :param task: The" +" name of task. :type task: string :param model: The model name in the " +"task. :type model: string :param kwargs: Additional keyword arguments " +"passed along to the specific task. :type kwargs: dict, optional" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.argparser.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.argparser.po new file mode 100644 index 0000000000000000000000000000000000000000..24c32480ffedb0f60dad1ceaf345fcf234d87b00 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.argparser.po @@ -0,0 +1,137 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.trainer.argparser.rst:2 +msgid "argparser" +msgstr "" + +#: of paddlenlp.trainer.argparser.PdArgumentParser:1 +msgid "基类::class:`argparse.ArgumentParser`" +msgstr "" + +#: of paddlenlp.trainer.argparser.PdArgumentParser:1 +msgid "" +"This subclass of `argparse.ArgumentParser` uses type hints on dataclasses" +" to generate arguments." +msgstr "" + +#: of paddlenlp.trainer.argparser.PdArgumentParser:3 +msgid "" +"The class is designed to play well with the native argparse. In " +"particular, you can add more (non-dataclass backed) arguments to the " +"parser after initialization and you'll get the output back after parsing " +"as an additional namespace. Optional: To create sub argument groups use " +"the `_argument_group_name` attribute in the dataclass." +msgstr "" + +#: of +#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:1 +msgid "Parse command-line args into instances of the specified dataclass types." +msgstr "" + +#: of +#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:3 +msgid "" +"This relies on argparse's `ArgumentParser.parse_known_args`. See the doc " +"at: " +"docs.python.org/3.7/library/argparse.html#argparse.ArgumentParser.parse_args" +msgstr "" + +#: of paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses +msgid "参数" +msgstr "" + +#: of +#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:6 +msgid "" +"List of strings to parse. The default is taken from sys.argv. (same as " +"argparse.ArgumentParser)" +msgstr "" + +#: of +#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:7 +msgid "If true, also return a list of remaining argument strings." +msgstr "" + +#: of +#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:8 +msgid "" +"If true, will look for a \".args\" file with the same base name as the " +"entry point script for this process, and will append its potential " +"content to the command line args." +msgstr "" + +#: of +#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:10 +msgid "" +"If not None, will uses this file instead of the \".args\" file specified " +"in the previous argument." +msgstr "" + +#: of paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:12 +msgid "" +"- the dataclass instances in the same order as they were passed to the " +"initializer.abspath - if applicable, an additional namespace for more " +"(non-dataclass backed) arguments added to the parser after " +"initialization. - The potential list of remaining argument strings. (same" +" as argparse.ArgumentParser.parse_known_args)" +msgstr "" + +#: of +#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:12 +msgid "" +"the dataclass instances in the same order as they were passed to the " +"initializer.abspath" +msgstr "" + +#: of +#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:13 +msgid "" +"if applicable, an additional namespace for more (non-dataclass backed) " +"arguments added to the parser after initialization." +msgstr "" + +#: of +#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:15 +msgid "" +"The potential list of remaining argument strings. (same as " +"argparse.ArgumentParser.parse_known_args)" +msgstr "" + +#: of paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses +msgid "返回类型" +msgstr "" + +#: of paddlenlp.trainer.argparser.PdArgumentParser.parse_json_file:1 +msgid "" +"Alternative helper method that does not use `argparse` at all, instead " +"loading a json file and populating the dataclass types." +msgstr "" + +#: of paddlenlp.trainer.argparser.PdArgumentParser.parse_dict:1 +msgid "" +"Alternative helper method that does not use `argparse` at all, instead " +"uses a dict and populating the dataclass types." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.integrations.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.integrations.po new file mode 100644 index 0000000000000000000000000000000000000000..40af1d077b2643c74ffa6a0424c87bf63ae42289 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.integrations.po @@ -0,0 +1,47 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.trainer.integrations.rst:2 +msgid "integrations" +msgstr "" + +#: of paddlenlp.trainer.integrations.VisualDLCallback:1 +msgid "基类::class:`paddlenlp.trainer.trainer_callback.TrainerCallback`" +msgstr "" + +#: of paddlenlp.trainer.integrations.VisualDLCallback:1 +msgid "" +"A [`TrainerCallback`] that sends the logs to " +"[VisualDL](https://www.paddlepaddle.org.cn/paddle/visualdl). :param " +"vdl_writer: The writer to use. Will instantiate one if not set. :type " +"vdl_writer: `LogWriter`, *optional*" +msgstr "" + +#: of paddlenlp.trainer.integrations.VisualDLCallback.on_train_begin:1 +msgid "Event called at the beginning of training." +msgstr "" + +#: of paddlenlp.trainer.integrations.VisualDLCallback.on_log:1 +msgid "Event called after logging the last logs." +msgstr "" + +#: of paddlenlp.trainer.integrations.VisualDLCallback.on_train_end:1 +msgid "Event called at the end of training." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.po new file mode 100644 index 0000000000000000000000000000000000000000..9791ba7b7a659245e71780983ce427eaede2202c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.trainer.rst:2 +msgid "paddlenlp.trainer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_base.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_base.po new file mode 100644 index 0000000000000000000000000000000000000000..12cd6dcf1b03eaf68e930034d73e34f54b277098 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_base.po @@ -0,0 +1,600 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.trainer.trainer_base.rst:2 +msgid "trainer\\_base" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:1 +msgid "" +"Trainer is a simple but feature-complete training and eval loop for " +"PaddlePaddle, optimized for PaddleNLP." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer +#: paddlenlp.trainer.trainer_base.Trainer.add_callback +#: paddlenlp.trainer.trainer_base.Trainer.create_scheduler +#: paddlenlp.trainer.trainer_base.Trainer.evaluate +#: paddlenlp.trainer.trainer_base.Trainer.export_model +#: paddlenlp.trainer.trainer_base.Trainer.get_eval_dataloader +#: paddlenlp.trainer.trainer_base.Trainer.get_optimizer_cls_and_kwargs +#: paddlenlp.trainer.trainer_base.Trainer.get_test_dataloader +#: paddlenlp.trainer.trainer_base.Trainer.log +#: paddlenlp.trainer.trainer_base.Trainer.predict +#: paddlenlp.trainer.trainer_base.Trainer.prediction_step +#: paddlenlp.trainer.trainer_base.Trainer.training_step +msgid "参数" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:3 +msgid "" +"The model to train, evaluate or use for predictions. [`Trainer`] " +"is optimized to work with the [`PretrainedModel`] provided by the " +"library. You can still use your own models defined as `paddle.nn.Layer` " +"as long as they work the same way as the PaddleNLP models. " +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:3 +msgid "The model to train, evaluate or use for predictions." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:5 +msgid "" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:7 +msgid "" +"[`Trainer`] is optimized to work with the [`PretrainedModel`] provided by" +" the library. You can still use your own models defined as " +"`paddle.nn.Layer` as long as they work the same way as the PaddleNLP " +"models." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:11 +msgid "" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:13 +msgid "" +"The arguments to tweak for training. Will default to a basic instance of " +"[`TrainingArguments`] with the `output_dir` set to a directory named " +"*tmp_trainer* in the current directory if not provided." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:16 +msgid "" +"The function to use to form a batch from a list of elements of " +"`train_dataset` or `eval_dataset`. Will default to " +"[`default_data_collator`] if no `tokenizer` is provided, an instance of " +"[`DataCollatorWithPadding`] otherwise." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:20 +msgid "" +"The dataset to use for training. If it is an `datasets.Dataset`, columns " +"not accepted by the `model.forward()` method are automatically removed." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:23 +msgid "" +"The dataset to use for evaluation. If it is an `datasets.Dataset`, " +"columns not accepted by the `model.forward()` method are automatically " +"removed." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:26 +msgid "" +"The tokenizer used to preprocess the data. If provided, will be used to " +"automatically pad the inputs the maximum length when batching inputs, and" +" it will be saved along the model to make it easier to rerun an " +"interrupted training or reuse the fine-tuned model." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:30 +msgid "" +"The function that will be used to compute metrics at evaluation. Must " +"take a [`EvalPrediction`] and return a dictionary string to metric " +"values." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:33 +msgid "" +"A tuple containing the optimizer and the scheduler to use. Will default " +"to an instance of [`AdamW`] on your model and a scheduler given by " +"[`get_linear_schedule_with_warmup`] controlled by `args`." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:38 +msgid "Important attributes:" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:40 +msgid "" +"**model** -- Always points to the core model. If using a transformers " +"model, it will be a [`PretrainedModel`] subclass." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:42 +msgid "" +"**model_wrapped** -- Always points to the most external model in case one" +" or more other modules wrap the original model. This is the model that " +"should be used for the forward pass. For example, the inner model is " +"wrapped in `paddle.DataParallel`. If model hasn't been wrapped, then " +"`self.model_wrapped` is the same as `self.model`." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:46 +msgid "" +"**is_model_parallel** -- Whether or not a model has been switched to a " +"model parallel mode (different from data parallelism, this means some of " +"the model layers are split on different GPUs)." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:48 +msgid "" +"**place_model_on_device** -- Whether or not to automatically place the " +"model on the device - it will be set to `False` if model parallel or " +"deepspeed is used, or if the default " +"`TrainingArguments.place_model_on_device` is overridden to return `False`" +" ." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer:51 +msgid "" +"**is_in_train** -- Whether or not a model is currently running `train` " +"(e.g. when `evaluate` is called while in `train`)" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.log_metrics:1 +msgid "" +"Log metrics in a specially formatted way Under distributed environment " +"this is done only for a process with rank 0. :param split: Mode/split " +"name: one of `train`, `eval`, `test` :type split: `str` :param metrics: " +"The metrics returned from train/evaluate/predictmetrics: metrics dict " +":type metrics: `Dict[str, float]`" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.metrics_format:1 +msgid "" +"Reformat Trainer metrics values to a human-readable format :param " +"metrics: The metrics returned from train/evaluate/predict :type metrics: " +"`Dict[str, float]`" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.evaluate +#: paddlenlp.trainer.trainer_base.Trainer.pop_callback +#: paddlenlp.trainer.trainer_base.Trainer.prediction_step +#: paddlenlp.trainer.trainer_base.Trainer.training_step +#: paddlenlp.trainer.trainer_utils.metrics_format +msgid "返回" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.metrics_format:5 +msgid "The reformatted metrics" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.pop_callback +#: paddlenlp.trainer.trainer_base.Trainer.prediction_step +#: paddlenlp.trainer.trainer_base.Trainer.training_step +#: paddlenlp.trainer.trainer_utils.metrics_format +msgid "返回类型" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.metrics_format:6 +msgid "metrics (`Dict[str, float]`)" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.save_metrics:1 +msgid "" +"Save metrics into a json file for that split, e.g. `train_results.json`. " +"Under distributed environment this is done only for a process with rank " +"0. :param split: Mode/split name: one of `train`, `eval`, `test`, `all` " +":type split: `str` :param metrics: The metrics returned from " +"train/evaluate/predict :type metrics: `Dict[str, float]` :param combined:" +" Creates combined metrics by updating `all_results.json` with metrics of " +"this call :type combined: `bool`, *optional*, defaults to `True`" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.save_metrics:10 +msgid "" +"To understand the metrics please read the docstring of " +"[`~Trainer.log_metrics`]. The only difference is that raw unformatted " +"numbers are saved in the current method." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.save_state:1 +msgid "" +"Saves the Trainer state, since Trainer.save_model saves only the " +"tokenizer with the model Under distributed environment this is done only " +"for a process with rank 0." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.add_callback:1 +msgid "Add a callback to the current list of [`~TrainerCallback`]." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.add_callback:3 +msgid "" +"A [`~TrainerCallback`] class or an instance of a [`~TrainerCallback`]. In" +" the first case, will instantiate a member of that class." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.pop_callback:1 +msgid "" +"Remove a callback from the current list of [`~TrainerCallback`] and " +"returns it. If the callback is not found, returns `None` (and no error is" +" raised). :param callback: A [`~TrainerCallback`] class or an instance of" +" a [`~TrainerCallback`]. In the" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.pop_callback:4 +msgid "" +"first case, will pop the first member of that class found in the list of " +"callbacks." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.pop_callback:7 +msgid "The callback removed, if found." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.pop_callback:8 +msgid "[`~TrainerCallback`]" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.remove_callback:1 +msgid "" +"Remove a callback from the current list of [`~TrainerCallback`]. :param " +"callback: A [`~TrainerCallback`] class or an instance of a " +"[`~TrainerCallback`]. In the" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.remove_callback:3 +msgid "" +"first case, will remove the first member of that class found in the list " +"of callbacks." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.get_train_dataloader:1 +msgid "Returns the training [`~paddle.io.DataLoader`]." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.get_train_dataloader:3 +msgid "" +"Will use no sampler if `self.train_dataset` does not implement `__len__`," +" a random sampler (adapted to distributed training if necessary) " +"otherwise." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.get_eval_dataloader:3 +#: paddlenlp.trainer.trainer_base.Trainer.get_test_dataloader:3 +#: paddlenlp.trainer.trainer_base.Trainer.get_train_dataloader:6 +msgid "" +"Subclass and override this method if you want to inject some custom " +"behavior." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.get_eval_dataloader:1 +msgid "Returns the evaluation [`~paddle.io.DataLoader`]." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.get_eval_dataloader:5 +msgid "" +"If provided, will override `self.eval_dataset`. If it is an " +"`datasets.Dataset`, columns not accepted by the `model.forward()` method " +"are automatically removed. It must implement `__len__`." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.get_test_dataloader:1 +msgid "Returns the test [`~paddle.io.DataLoader`]." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.get_test_dataloader:5 +msgid "" +"The test dataset to use. If it is an `datasets.Dataset`, columns not " +"accepted by the `model.forward()` method are automatically removed. It " +"must implement `__len__`." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.create_optimizer_and_scheduler:1 +msgid "Setup the optimizer and the learning rate scheduler." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.create_optimizer_and_scheduler:3 +msgid "" +"We provide a reasonable default that works well. If you want to use " +"something else, you can pass a tuple in the Trainer's init through " +"`optimizers`, or subclass and override this method (or `create_optimizer`" +" and/or `create_scheduler`) in a subclass." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.create_optimizer:1 +msgid "Setup the optimizer." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.create_optimizer:3 +msgid "" +"We provide a reasonable default that works well. If you want to use " +"something else, you can pass a tuple in the Trainer's init through " +"`optimizers`, or subclass and override this method in a subclass." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.get_optimizer_cls_and_kwargs:1 +msgid "" +"Returns the optimizer class and optimizer parameters based on the " +"training arguments." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.get_optimizer_cls_and_kwargs:3 +msgid "The training arguments for the training session." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.create_scheduler:1 +msgid "" +"Setup the scheduler. The optimizer of the trainer must have been set up " +"either before this method is called or passed as an argument." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.create_scheduler:4 +msgid "The number of training steps to do." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.autocast_smart_context_manager:1 +msgid "" +"A helper wrapper that creates an appropriate context manager for " +"`autocast` while feeding it the desired arguments, depending on the " +"situation." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.compute_loss:1 +msgid "" +"How the loss is computed by Trainer. By default, all models return the " +"loss in the first element. Subclass and override for custom behavior." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.training_step:1 +msgid "Perform a training step on a batch of inputs." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:3 +#: paddlenlp.trainer.trainer_base.Trainer.training_step:3 +msgid "Subclass and override to inject custom behavior." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.training_step:5 +msgid "The model to train." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:7 +#: paddlenlp.trainer.trainer_base.Trainer.training_step:7 +msgid "" +"The inputs and targets of the model. The dictionary will be unpacked " +"before being fed to the model. Most models expect the targets under the " +"argument `labels`. Check your model's documentation for all accepted " +"arguments." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:7 +#: paddlenlp.trainer.trainer_base.Trainer.training_step:7 +msgid "The inputs and targets of the model." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:9 +#: paddlenlp.trainer.trainer_base.Trainer.training_step:9 +msgid "" +"The dictionary will be unpacked before being fed to the model. Most " +"models expect the targets under the argument `labels`. Check your model's" +" documentation for all accepted arguments." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.training_step:13 +msgid "The tensor with training loss on this batch." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.training_step:14 +msgid "`paddle.Tensor`" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.save_model:1 +msgid "Will save the model, so you can reload it using `from_pretrained()`." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.save_model:3 +msgid "Will only save from the main process." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.export_model:1 +msgid "Export paddle inference model." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.export_model:3 +msgid "" +"InputSpec describes the signature information of the model input, such as" +" shape , dtype , name. Defaults to None." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.export_model:6 +msgid "Load best model. Defaults to False." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.export_model:8 +msgid "Output dir to save the exported model. Defaults to None." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.log:1 +msgid "Log `logs` on the various objects watching training." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.log:3 +msgid "Subclass and override this method to inject custom behavior." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.log:5 +msgid "The values to log." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:1 +msgid "Run evaluation and returns metrics." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:3 +msgid "" +"The calling script will be responsible for providing a method to compute " +"metrics, as they are task-dependent (pass it to the init " +"`compute_metrics` argument)." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:6 +msgid "You can also subclass and override this method to inject custom behavior." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:8 +msgid "" +"Pass a dataset if you wish to override `self.eval_dataset`. If it is an " +"`datasets.Dataset`, columns not accepted by the `model.forward()` method " +"are automatically removed. It must implement the `__len__` method." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:12 +#: paddlenlp.trainer.trainer_base.Trainer.predict:7 +#: paddlenlp.trainer.trainer_base.Trainer.prediction_step:14 +msgid "" +"A list of keys in the output of your model (if it is a dictionary) that " +"should be ignored when gathering predictions." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:15 +msgid "" +"An optional prefix to be used as the metrics key prefix. For example the " +"metrics \"bleu\" will be named \"eval_bleu\" if the prefix is \"eval\" " +"(default)" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:19 +msgid "" +"A dictionary containing the evaluation loss and the potential metrics " +"computed from the predictions. The dictionary also contains the epoch " +"number which comes from the training state." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.evaluation_loop:1 +msgid "" +"Prediction/evaluation loop, shared by `Trainer.evaluate()` and " +"`Trainer.predict()`." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.evaluation_loop:3 +msgid "Works both with or without labels." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.predict:1 +msgid "" +"Run prediction and returns predictions and potential metrics. Depending " +"on the dataset and your use case, your test dataset may contain labels. " +"In that case, this method will also return metrics, like in `evaluate()`." +" :param test_dataset: Dataset to run the predictions on. If it is an " +"`datasets.Dataset`, columns not accepted by the" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.predict:5 +msgid "" +"`model.forward()` method are automatically removed. Has to implement the " +"method `__len__`" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.predict:10 +msgid "" +"An optional prefix to be used as the metrics key prefix. For example the " +"metrics \"bleu\" will be named \"test_bleu\" if the prefix is \"test\" " +"(default)" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.predict:14 +msgid "" +" If your predictions or labels have different sequence length (for " +"instance because you're doing dynamic padding in a token classification " +"task) the predictions will be padded (on the right) to allow for " +"concatenation into one array. The padding index is -100. Returns: " +"*NamedTuple* A namedtuple with the following keys:" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.predict:20 +msgid "predictions (`np.ndarray`): The predictions on `test_dataset`." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.predict:21 +msgid "" +"label_ids (`np.ndarray`, *optional*): The labels (if the dataset " +"contained some)." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.predict:22 +msgid "" +"metrics (`Dict[str, float]`, *optional*): The potential dictionary of " +"metrics (if the dataset contained labels)." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:1 +msgid "Perform an evaluation step on `model` using `inputs`." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:5 +msgid "The model to evaluate." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:12 +msgid "Whether or not to return the loss only." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:18 +msgid "A tuple with the loss, logits and labels (each being optional)." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.num_examples:1 +msgid "" +"Helper to get number of samples in a [`~paddle.io.DataLoader`] by " +"accessing its dataset." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.num_examples:3 +msgid "" +"Will raise an exception if the underlying dataset does not implement " +"method `__len__`" +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.is_local_process_zero:1 +msgid "" +"Whether or not this process is the local (e.g., on one machine if " +"training in a distributed fashion on several machines) main process." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.is_world_process_zero:1 +msgid "" +"Whether or not this process is the global main process (when training in " +"a distributed fashion on several machines, this is only going to be " +"`True` for one process)." +msgstr "" + +#: of paddlenlp.trainer.trainer_base.Trainer.print_config:1 +msgid "print config values" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_callback.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_callback.po new file mode 100644 index 0000000000000000000000000000000000000000..93c3875a65cf32579498593eb6974e598f24d90b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_callback.po @@ -0,0 +1,431 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.trainer.trainer_callback.rst:2 +msgid "trainer\\_callback" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback:1 +msgid "Callbacks to use with the Trainer class and customize the training loop." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:1 +#: paddlenlp.trainer.trainer_callback.TrainerControl:1 +#: paddlenlp.trainer.trainer_callback.TrainerState:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState:1 +msgid "" +"A class containing the [`Trainer`] inner state that will be saved along " +"the model and optimizer when checkpointing and passed to the " +"[`TrainerCallback`]." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState:4 +msgid "" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState:6 +msgid "" +"In all this class, one step is to be understood as one update step. When " +"using gradient accumulation, one update step may require several forward " +"and backward passes: if you use `gradient_accumulation_steps=n`, then one" +" update step requires going through *n* batches." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState:10 +msgid "" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.EarlyStoppingCallback +#: paddlenlp.trainer.trainer_callback.TrainerCallback +#: paddlenlp.trainer.trainer_callback.TrainerControl +#: paddlenlp.trainer.trainer_callback.TrainerState +msgid "参数" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState:12 +msgid "" +"Only set during training, will represent the epoch the training is at " +"(the decimal part being the percentage of the current epoch completed)." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState:15 +msgid "During training, represents the number of update steps completed." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState:17 +msgid "The number of update steps to do during the current training." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState:19 +msgid "" +"The total number of floating operations done by the model since the " +"beginning of training (stored as floats to avoid overflow)." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState:22 +msgid "The list of logs done since the beginning of training." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState:24 +msgid "" +"When tracking the best model, the value of the best metric encountered so" +" far." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState:26 +msgid "" +"When tracking the best model, the value of the name of the checkpoint for" +" the best model encountered so far." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState:29 +msgid "" +"Whether or not this process is the local (e.g., on one machine if " +"training in a distributed fashion on several machines) main process." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState:32 +msgid "" +"Whether or not this process is the global main process (when training in " +"a distributed fashion on several machines, this is only going to be " +"`True` for one process)." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState.save_to_json:1 +msgid "Save the content of this instance in JSON format inside `json_path`." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerState.load_from_json:1 +msgid "Create an instance from the content of `json_path`." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:1 +msgid "" +"A class that handles the [`Trainer`] control flow. This class is used by " +"the [`TrainerCallback`] to activate some switches in the training loop." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:4 +msgid "" +"Whether or not the training should be interrupted. If `True`, this " +"variable will not be set back to `False`. The training will just stop." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:4 +msgid "Whether or not the training should be interrupted." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:6 +msgid "" +"If `True`, this variable will not be set back to `False`. The training " +"will just stop." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:8 +msgid "" +"Whether or not the current epoch should be interrupted. If `True`, this " +"variable will be set back to `False` at the beginning of the next epoch." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:8 +msgid "Whether or not the current epoch should be interrupted." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:10 +msgid "" +"If `True`, this variable will be set back to `False` at the beginning of " +"the next epoch." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:12 +msgid "" +"Whether or not the model should be saved at this step. If `True`, this " +"variable will be set back to `False` at the beginning of the next step." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:12 +msgid "Whether or not the model should be saved at this step." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:14 +#: paddlenlp.trainer.trainer_callback.TrainerControl:18 +#: paddlenlp.trainer.trainer_callback.TrainerControl:22 +msgid "" +"If `True`, this variable will be set back to `False` at the beginning of " +"the next step." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:16 +msgid "" +"Whether or not the model should be evaluated at this step. If `True`, " +"this variable will be set back to `False` at the beginning of the next " +"step." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:16 +msgid "Whether or not the model should be evaluated at this step." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:20 +msgid "" +"Whether or not the logs should be reported at this step. If `True`, this" +" variable will be set back to `False` at the beginning of the next step." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerControl:20 +msgid "Whether or not the logs should be reported at this step." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:1 +msgid "" +"A class for objects that will inspect the state of the training loop at " +"some events and take some decisions. At each of those events the " +"following arguments are available:" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:4 +msgid "The training arguments used to instantiate the [`Trainer`]." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:6 +msgid "The current state of the [`Trainer`]." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:8 +msgid "" +"The object that is returned to the [`Trainer`] and can be used to make " +"some decisions." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:10 +msgid "The model being trained." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:12 +msgid "The tokenizer used for encoding the data." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:14 +msgid "The optimizer used for the training steps." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:16 +msgid "The scheduler used for setting the learning rate." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:18 +#: paddlenlp.trainer.trainer_callback.TrainerCallback:20 +msgid "The current dataloader used for training." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:22 +msgid "" +"The metrics computed by the last evaluation phase. Those are only " +"accessible in the event `on_evaluate`." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:22 +msgid "The metrics computed by the last evaluation phase." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:24 +msgid "Those are only accessible in the event `on_evaluate`." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:26 +msgid "The values to log. Those are only accessible in the event `on_log`." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:26 +msgid "The values to log." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:28 +msgid "Those are only accessible in the event `on_log`." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:31 +msgid "" +"The `control` object is the only one that can be changed by the callback," +" in which case the event that changes it should return the modified " +"version." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:34 +msgid "" +"The argument `args`, `state` and `control` are positionals for all " +"events, all the others are grouped in `kwargs`. You can unpack the ones " +"you need in the signature of the event using them. As an example, see the" +" code of the simple [`~transformer.PrinterCallback`]." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:38 +msgid "Example:" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:40 +msgid "```python class PrinterCallback(TrainerCallback):" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:44 +msgid "def on_log(self, args, state, control, logs=None, **kwargs):" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:43 +msgid "_ = logs.pop(\"total_flos\", None) if state.is_local_process_zero:" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:45 +msgid "print(logs)" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.TrainerCallback:46 +msgid "```" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_init_end:1 +#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_init_end:1 +msgid "Event called at the end of the initialization of the [`Trainer`]." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_train_begin:1 +#: paddlenlp.trainer.trainer_callback.EarlyStoppingCallback.on_train_begin:1 +#: paddlenlp.trainer.trainer_callback.ProgressCallback.on_train_begin:1 +#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_train_begin:1 +msgid "Event called at the beginning of training." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_train_end:1 +#: paddlenlp.trainer.trainer_callback.ProgressCallback.on_train_end:1 +#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_train_end:1 +msgid "Event called at the end of training." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_epoch_begin:1 +#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_epoch_begin:1 +msgid "Event called at the beginning of an epoch." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_epoch_end:1 +#: paddlenlp.trainer.trainer_callback.DefaultFlowCallback.on_epoch_end:1 +#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_epoch_end:1 +msgid "Event called at the end of an epoch." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_step_begin:1 +#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_step_begin:1 +msgid "" +"Event called at the beginning of a training step. If using gradient " +"accumulation, one training step might take several inputs." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_substep_end:1 +#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_substep_end:1 +msgid "Event called at the end of an substep during gradient accumulation." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_step_end:1 +#: paddlenlp.trainer.trainer_callback.DefaultFlowCallback.on_step_end:1 +#: paddlenlp.trainer.trainer_callback.ProgressCallback.on_step_end:1 +#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_step_end:1 +msgid "" +"Event called at the end of a training step. If using gradient " +"accumulation, one training step might take several inputs." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_evaluate:1 +#: paddlenlp.trainer.trainer_callback.EarlyStoppingCallback.on_evaluate:1 +#: paddlenlp.trainer.trainer_callback.ProgressCallback.on_evaluate:1 +#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_evaluate:1 +msgid "Event called after an evaluation phase." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_save:1 +#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_save:1 +msgid "Event called after a checkpoint save." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_log:1 +#: paddlenlp.trainer.trainer_callback.PrinterCallback.on_log:1 +#: paddlenlp.trainer.trainer_callback.ProgressCallback.on_log:1 +#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_log:1 +msgid "Event called after logging the last logs." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_prediction_step:1 +#: paddlenlp.trainer.trainer_callback.ProgressCallback.on_prediction_step:1 +#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_prediction_step:1 +msgid "Event called after a prediction step." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler:1 +#: paddlenlp.trainer.trainer_callback.DefaultFlowCallback:1 +#: paddlenlp.trainer.trainer_callback.EarlyStoppingCallback:1 +#: paddlenlp.trainer.trainer_callback.PrinterCallback:1 +#: paddlenlp.trainer.trainer_callback.ProgressCallback:1 +msgid "基类::class:`paddlenlp.trainer.trainer_callback.TrainerCallback`" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.CallbackHandler:1 +msgid "Internal class that just calls the list of callbacks in order." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.DefaultFlowCallback:1 +msgid "" +"A [`TrainerCallback`] that handles the default flow of the training loop " +"for logs, evaluation and checkpoints." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.ProgressCallback:1 +msgid "" +"A [`TrainerCallback`] that displays the progress of training or " +"evaluation." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.PrinterCallback:1 +msgid "A bare [`TrainerCallback`] that just prints the logs." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.EarlyStoppingCallback:1 +msgid "A [`TrainerCallback`] that handles early stopping." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.EarlyStoppingCallback:3 +msgid "" +"Use with `metric_for_best_model` to stop training when the specified " +"metric worsens for `early_stopping_patience` evaluation calls." +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.EarlyStoppingCallback:6 +msgid "" +"Use with TrainingArguments `metric_for_best_model` and " +"`early_stopping_patience` to denote how much the specified metric must " +"improve to satisfy early stopping conditions. `" +msgstr "" + +#: of paddlenlp.trainer.trainer_callback.EarlyStoppingCallback:10 +msgid "" +"This callback depends on [`TrainingArguments`] argument " +"*load_best_model_at_end* functionality to set best_metric in " +"[`TrainerState`]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_utils.po new file mode 100644 index 0000000000000000000000000000000000000000..f4c282e4865cab4d856419967bbacf392c4dfb33 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_utils.po @@ -0,0 +1,246 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.trainer.trainer_utils.rst:2 +msgid "trainer\\_utils" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils:1 +msgid "Utilities for the Trainer class." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.ExplicitEnum:1 +msgid "基类::class:`enum.Enum`" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.ExplicitEnum:1 +msgid "Enum with more explicit error message for missing values." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.BestRun:1 +#: paddlenlp.trainer.trainer_utils.EvalLoopOutput:1 +#: paddlenlp.trainer.trainer_utils.EvalPrediction:1 +#: paddlenlp.trainer.trainer_utils.PredictionOutput:1 +#: paddlenlp.trainer.trainer_utils.TrainOutput:1 +msgid "基类::class:`tuple`" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.EvalPrediction:1 +msgid "Evaluation output (always contains labels), to be used to compute metrics." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.BestRun +#: paddlenlp.trainer.trainer_utils.EvalPrediction +#: paddlenlp.trainer.trainer_utils.default_compute_objective +msgid "参数" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.EvalPrediction:3 +msgid "Predictions of the model." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.EvalPrediction:5 +msgid "Targets to be matched." +msgstr "" + +#: ../docstring of paddlenlp.trainer.trainer_utils.BestRun.run_id:1 +#: paddlenlp.trainer.trainer_utils.EvalLoopOutput.predictions:1 +#: paddlenlp.trainer.trainer_utils.EvalPrediction.predictions:1 +#: paddlenlp.trainer.trainer_utils.PredictionOutput.predictions:1 +#: paddlenlp.trainer.trainer_utils.TrainOutput.global_step:1 +msgid "Alias for field number 0" +msgstr "" + +#: ../docstring of paddlenlp.trainer.trainer_utils.BestRun.objective:1 +#: paddlenlp.trainer.trainer_utils.EvalLoopOutput.label_ids:1 +#: paddlenlp.trainer.trainer_utils.EvalPrediction.label_ids:1 +#: paddlenlp.trainer.trainer_utils.PredictionOutput.label_ids:1 +#: paddlenlp.trainer.trainer_utils.TrainOutput.training_loss:1 +msgid "Alias for field number 1" +msgstr "" + +#: ../docstring of paddlenlp.trainer.trainer_utils.BestRun.hyperparameters:1 +#: paddlenlp.trainer.trainer_utils.EvalLoopOutput.metrics:1 +#: paddlenlp.trainer.trainer_utils.PredictionOutput.metrics:1 +#: paddlenlp.trainer.trainer_utils.TrainOutput.metrics:1 +msgid "Alias for field number 2" +msgstr "" + +#: ../docstring of paddlenlp.trainer.trainer_utils.EvalLoopOutput.num_samples:1 +msgid "Alias for field number 3" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.EvaluationStrategy:1 +#: paddlenlp.trainer.trainer_utils.IntervalStrategy:1 +#: paddlenlp.trainer.trainer_utils.OptimizerNames:1 +#: paddlenlp.trainer.trainer_utils.SchedulerType:1 +msgid "基类::class:`paddlenlp.trainer.trainer_utils.ExplicitEnum`" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.EvaluationStrategy:1 +#: paddlenlp.trainer.trainer_utils.IntervalStrategy:1 +#: paddlenlp.trainer.trainer_utils.SchedulerType:1 +msgid "An enumeration." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.OptimizerNames:1 +msgid "Stores the acceptable string identifiers for optimizers." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.BestRun:1 +msgid "" +"The best run found by an hyperparameter search (see " +"[`~Trainer.hyperparameter_search`])." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.BestRun:3 +msgid "" +"The id of the best run (if models were saved, the corresponding " +"checkpoint will be in the folder ending with run-{run_id})." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.BestRun:6 +msgid "The objective that was obtained for this run." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.BestRun:8 +msgid "The hyperparameters picked to get this run." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.default_compute_objective:1 +msgid "" +"The default objective to maximize/minimize when doing an hyperparameter " +"search. It is the evaluation loss if no metrics are provided to the " +"[`Trainer`], the sum of all metrics otherwise." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.default_compute_objective:4 +msgid "The metrics returned by the evaluate method." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.default_compute_objective +#: paddlenlp.trainer.trainer_utils.metrics_format +msgid "返回" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.default_compute_objective:7 +msgid "The objective to minimize or maximize" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.default_compute_objective +#: paddlenlp.trainer.trainer_utils.metrics_format +msgid "返回类型" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.default_compute_objective:8 +msgid "`float`" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.is_main_process:1 +msgid "" +"Whether or not the current process is the local process, based on " +"`xm.get_ordinal()` (for TPUs) first, then on `local_rank`." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.total_processes_number:1 +msgid "" +"Return the number of processes launched in parallel. Works with " +"`paddle.distributed` and TPUs." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.speed_metrics:1 +msgid "Measure and return speed performance metrics." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.speed_metrics:3 +msgid "" +"This function requires a time snapshot `start_time` before the operation " +"to be measured starts and this function should be run immediately after " +"the operation to be measured has completed." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.speed_metrics:6 +msgid "Args:" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.speed_metrics:8 +msgid "split: name to prefix metric (like train, eval, test...)" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.speed_metrics:9 +msgid "start_time: operation start time" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.speed_metrics:10 +msgid "num_samples: number of samples processed" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.metrics_format:1 +msgid "" +"Reformat Trainer metrics values to a human-readable format :param " +"metrics: The metrics returned from train/evaluate/predict :type metrics: " +"`Dict[str, float]`" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.metrics_format:5 +msgid "The reformatted metrics" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.metrics_format:6 +msgid "metrics (`Dict[str, float]`)" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.log_metrics:1 +msgid "" +"Log metrics in a specially formatted way Under distributed environment " +"this is done only for a process with rank 0. :param split: Mode/split " +"name: one of `train`, `eval`, `test` :type split: `str` :param metrics: " +"The metrics returned from train/evaluate/predictmetrics: metrics dict " +":type metrics: `Dict[str, float]`" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.save_metrics:1 +msgid "" +"Save metrics into a json file for that split, e.g. `train_results.json`. " +"Under distributed environment this is done only for a process with rank " +"0. :param split: Mode/split name: one of `train`, `eval`, `test`, `all` " +":type split: `str` :param metrics: The metrics returned from " +"train/evaluate/predict :type metrics: `Dict[str, float]` :param combined:" +" Creates combined metrics by updating `all_results.json` with metrics of " +"this call :type combined: `bool`, *optional*, defaults to `True`" +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.save_metrics:10 +msgid "" +"To understand the metrics please read the docstring of " +"[`~Trainer.log_metrics`]. The only difference is that raw unformatted " +"numbers are saved in the current method." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.save_state:1 +msgid "" +"Saves the Trainer state, since Trainer.save_model saves only the " +"tokenizer with the model Under distributed environment this is done only " +"for a process with rank 0." +msgstr "" + +#: of paddlenlp.trainer.trainer_utils.has_length:1 +msgid "Checks if the dataset implements __len__() and it doesn't raise an error" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.training_args.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.training_args.po new file mode 100644 index 0000000000000000000000000000000000000000..66c243a762ec95dd7f65d87989d1dcacf2279880 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.training_args.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.trainer.training_args.rst:2 +msgid "training\\_args" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.utils.helper.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.utils.helper.po new file mode 100644 index 0000000000000000000000000000000000000000..b1f0d1ac761cbebd8862c96037a02be0ad79be8c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.utils.helper.po @@ -0,0 +1,49 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.trainer.utils.helper.rst:2 +msgid "helper" +msgstr "" + +#: of paddlenlp.trainer.utils.helper.paddle_pad_and_concatenate:1 +msgid "" +"Concatenates `tensor1` and `tensor2` on first axis, applying padding on " +"the second if necessary." +msgstr "" + +#: of paddlenlp.trainer.utils.helper.nested_concat:1 +msgid "" +"Concat the `new_tensors` to `tensors` on the first dim and pad them on " +"the second if needed. Works for tensors or nested list/tuples of tensors." +msgstr "" + +#: of paddlenlp.trainer.utils.helper.nested_detach:1 +msgid "Detach `tensors` (even if it's a nested list/tuple of tensors)." +msgstr "" + +#: of paddlenlp.trainer.utils.helper.nested_numpify:1 +msgid "Numpify `tensors` (even if it's a nested list/tuple of tensors)." +msgstr "" + +#: of paddlenlp.trainer.utils.helper.nested_truncate:1 +msgid "" +"Truncate `tensors` at `limit` (even if it's a nested list/tuple of " +"tensors)." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.utils.po new file mode 100644 index 0000000000000000000000000000000000000000..80cf9f323f57b8fc7d8e34db07c04e2803a3d014 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.utils.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.trainer.utils.rst:2 +msgid "utils" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..a9813e3bbb7f70eee4b39e969f97ede513a0fc2d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.modeling.po @@ -0,0 +1,774 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.albert.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling:1 +msgid "Modeling classes for ALBERT model." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertPretrainedModel:1 +msgid "" +"An abstract class for pretrained ALBERT models. It provides ALBERT " +"related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +"`PretrainedModel` for more details." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM:1 +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice:1 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining:1 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification:1 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification:1 +#: paddlenlp.transformers.albert.modeling.AlbertModel:1 +msgid "基类::class:`paddlenlp.transformers.albert.modeling.AlbertPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:1 +msgid "The bare Albert Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM +#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward +#: paddlenlp.transformers.albert.modeling.AlbertModel +#: paddlenlp.transformers.albert.modeling.AlbertModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `AlbertModel`. Also is the vocab size " +"of token embedding matrix. Defines the number of different tokens that " +"can be represented by the `inputs_ids` passed when calling `AlbertModel`." +" Defaults to `30000`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:14 +msgid "Dimensionality of the embedding layer. Defaults to `128`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:16 +msgid "Dimensionality of the encoder layer and pooler layer. Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:18 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:20 +msgid "Number of hidden groups in the Transformer encoder. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:22 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:25 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:29 +msgid "Number of inner groups in a hidden group. Default to `1`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:31 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:35 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:38 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:41 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:44 +msgid "The vocabulary size of `token_type_ids`. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:46 +msgid "" +"The standard deviation of the normal initializer. Defaults to `0.02`. .." +" note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`BertPretrainedModel.init_weights()` for how" +" weights are initialized in `ElectraModel`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:46 +msgid "The standard deviation of the normal initializer. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:49 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`BertPretrainedModel.init_weights()` for how weights are " +"initialized in `ElectraModel`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:52 +msgid "" +"The `epsilon` parameter used in :class:`paddle.nn.LayerNorm` for " +"initializing layer normalization layers. A small value to the variance " +"added to the normalization layer to prevent division by zero. Default to " +"`1e-12`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:56 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel:58 +msgid "Whether or not to add the pooling layer. Default to `False`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:1 +msgid "The AlbertModel forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:7 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. Defaults to `None`, which means " +"nothing needed to be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:16 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:16 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:21 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:22 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:24 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:27 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:31 +msgid "" +"Mask to nullify selected heads of the self-attention modules. Masks " +"values can either be 0 or 1: - 1 indicates the head is **not masked**, -" +" 0 indicated the head is **masked**." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:31 +msgid "" +"Mask to nullify selected heads of the self-attention modules. Masks " +"values can either be 0 or 1:" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:33 +msgid "1 indicates the head is **not masked**," +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:34 +msgid "0 indicated the head is **masked**." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:36 +msgid "" +"If you want to control how to convert `inputs_ids` indices into " +"associated vectors, you can pass an embedded representation directly " +"instead of passing `inputs_ids`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:39 +msgid "" +"Whether or not to return a dict instead of a plain tuple. Default to " +"`False`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward +#: paddlenlp.transformers.albert.modeling.AlbertModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:42 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) or a dict with " +"`last_hidden_state`, `pooled_output`, `all_hidden_states`, " +"`all_attentions` fields. With the fields: - `sequence_output` (Tensor):" +" Sequence of hidden-states at the last layer of the model. It's " +"data type should be float32 and has a shape of [`batch_size, " +"sequence_length, hidden_size`]. - `pooled_output` (Tensor): The " +"output of first token (`[CLS]`) in sequence. We \"pool\" the model by " +"simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and has a shape of [batch_size, " +"hidden_size]. - `last_hidden_state` (Tensor): The output of the last " +"encoder layer, it is also the `sequence_output`. It's data type should" +" be float32 and has a shape of [batch_size, sequence_length, " +"hidden_size]. - `all_hidden_states` (Tensor): Hidden_states of all " +"layers in the Transformer encoder. The length of `all_hidden_states` is " +"`num_hidden_layers + 1`. For all element in the tuple, its data type " +"should be float32 and its shape is [`batch_size, sequence_length, " +"hidden_size`]. - `all_attentions` (Tensor): Attentions of all layers " +"of in the Transformer encoder. The length of `all_attentions` is " +"`num_hidden_layers`. For all element in the tuple, its data type " +"should be float32 and its shape is [`batch_size, num_attention_heads, " +"sequence_length, sequence_length`]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:42 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) or a dict with " +"`last_hidden_state`, `pooled_output`, `all_hidden_states`, " +"`all_attentions` fields." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:21 +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:25 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:25 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:20 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:20 +#: paddlenlp.transformers.albert.modeling.AlbertModel.forward:45 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:49 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:48 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and has a shape of [`batch_size, sequence_length, " +"hidden_size`]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:55 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:52 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and has a shape of [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:59 +msgid "`last_hidden_state` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:58 +msgid "" +"The output of the last encoder layer, it is also the `sequence_output`. " +"It's data type should be float32 and has a shape of [batch_size, " +"sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:63 +msgid "`all_hidden_states` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:62 +msgid "" +"Hidden_states of all layers in the Transformer encoder. The length of " +"`all_hidden_states` is `num_hidden_layers + 1`. For all element in the " +"tuple, its data type should be float32 and its shape is [`batch_size, " +"sequence_length, hidden_size`]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:67 +msgid "`all_attentions` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:66 +msgid "" +"Attentions of all layers of in the Transformer encoder. The length of " +"`all_attentions` is `num_hidden_layers`. For all element in the tuple, " +"its data type should be float32 and its shape is [`batch_size, " +"num_attention_heads, sequence_length, sequence_length`]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward +#: paddlenlp.transformers.albert.modeling.AlbertModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:37 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:37 +#: paddlenlp.transformers.albert.modeling.AlbertModel.forward:72 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining:1 +msgid "" +"Albert Model with a `masked language modeling` head and a `sentence order" +" prediction` head on top." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM:3 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining:4 +msgid "An instance of :class:`AlbertModel`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining:6 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining:8 +msgid "An instance of :class:`AlbertSOPHead`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:3 +#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:5 +#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:7 +#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:9 +#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:11 +#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:13 +#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:15 +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:3 +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:5 +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:7 +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:9 +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:11 +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:13 +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:19 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining:10 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:3 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:5 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:7 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:9 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:11 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:13 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:19 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:3 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:5 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:7 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:9 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:11 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:13 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:15 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:3 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:5 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:7 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:9 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:11 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:13 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:15 +msgid "See :class:`AlbertModel`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:1 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:1 +msgid "" +"The AlbertForPretraining forward method, overrides the __call__() special" +" method." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:15 +msgid "" +"Labels of the next sequence prediction. Input should be a sequence pair " +"Indices should be 0 or 1. ``0`` indicates original order (sequence A, " +"then sequence B), and ``1`` indicates switched order (sequence B, then " +"sequence A). Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:22 +msgid "" +"Returns tuple (`prediction_scores`, `sop_scores`) or a dict with " +"`prediction_logits`, `sop_logits`, `pooled_output`, `hidden_states`, " +"`attentions` fields. With the fields: - `prediction_scores` (Tensor):" +" The scores of masked token prediction. Its data type should be " +"float32. and its shape is [batch_size, sequence_length, vocab_size]." +" - `sop_scores` (Tensor): The scores of sentence order prediction." +" Its data type should be float32 and its shape is [batch_size, 2]. -" +" `prediction_logits` (Tensor): The scores of masked token prediction." +" Its data type should be float32. and its shape is [batch_size, " +"sequence_length, vocab_size]. - `sop_logits` (Tensor): The scores of" +" sentence order prediction. Its data type should be float32 and its " +"shape is [batch_size, 2]. - `hidden_states` (Tensor): Hidden_states " +"of all layers in the Transformer encoder. The length of `hidden_states` " +"is `num_hidden_layers + 1`. For all element in the tuple, its data " +"type should be float32 and its shape is [`batch_size, sequence_length, " +"hidden_size`]. - `attentions` (Tensor): Attentions of all layers of " +"in the Transformer encoder. The length of `attentions` is " +"`num_hidden_layers`. For all element in the tuple, its data type " +"should be float32 and its shape is [`batch_size, num_attention_heads," +" sequence_length, sequence_length`]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:22 +msgid "" +"Returns tuple (`prediction_scores`, `sop_scores`) or a dict with " +"`prediction_logits`, `sop_logits`, `pooled_output`, `hidden_states`, " +"`attentions` fields." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:25 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:29 +msgid "`prediction_scores` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:24 +#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:28 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:28 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:36 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"and its shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:33 +msgid "`sop_scores` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:32 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:40 +msgid "" +"The scores of sentence order prediction. Its data type should be float32 " +"and its shape is [batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:37 +msgid "`prediction_logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:41 +msgid "`sop_logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:33 +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:33 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:45 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:28 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:28 +msgid "`hidden_states` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:32 +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:32 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:44 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:27 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:27 +msgid "" +"Hidden_states of all layers in the Transformer encoder. The length of " +"`hidden_states` is `num_hidden_layers + 1`. For all element in the tuple," +" its data type should be float32 and its shape is [`batch_size, " +"sequence_length, hidden_size`]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:37 +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:37 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:49 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:32 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:32 +msgid "`attentions` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:36 +#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:36 +#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:48 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:31 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:31 +msgid "" +"Attentions of all layers of in the Transformer encoder. The length of " +"`attentions` is `num_hidden_layers`. For all element in the tuple, its " +"data type should be float32 and its shape is [`batch_size, " +"num_attention_heads, sequence_length, sequence_length`]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM:1 +msgid "Albert Model with a `masked language modeling` head on top." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:18 +msgid "" +"Returns tensor `prediction_scores` or a dict with `logits`, " +"`hidden_states`, `attentions` fields. With the fields: - " +"`prediction_scores` (Tensor): The scores of masked token prediction. " +"Its data type should be float32. and its shape is [batch_size, " +"sequence_length, vocab_size]. - `logits` (Tensor): The scores of " +"masked token prediction. Its data type should be float32. and its " +"shape is [batch_size, sequence_length, vocab_size]. - `hidden_states` " +"(Tensor): Hidden_states of all layers in the Transformer encoder. The" +" length of `hidden_states` is `num_hidden_layers + 1`. For all " +"element in the tuple, its data type should be float32 and its shape is " +"[`batch_size, sequence_length, hidden_size`]. - `attentions` (Tensor):" +" Attentions of all layers of in the Transformer encoder. The length " +"of `attentions` is `num_hidden_layers`. For all element in the tuple," +" its data type should be float32 and its shape is [`batch_size, " +"num_attention_heads, sequence_length, sequence_length`]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:18 +msgid "" +"Returns tensor `prediction_scores` or a dict with `logits`, " +"`hidden_states`, `attentions` fields." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:29 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:24 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:24 +msgid "`logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification:1 +msgid "" +"Albert Model with a linear layer on top of the output layer, designed for" +" sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice:4 +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification:4 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification:4 +msgid "An instance of AlbertModel." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification:6 +msgid "The dropout probability for the classifier. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification:9 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:1 +msgid "" +"The AlbertForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:18 +msgid "" +"Returns tensor `logits`, or a dict with `logits`, `hidden_states`, " +"`attentions` fields. With the fields: - `logits` (Tensor): A tensor" +" of the input text classification logits. Shape as `[batch_size, " +"num_classes]` and dtype as float32. - `hidden_states` (Tensor): " +"Hidden_states of all layers in the Transformer encoder. The length of " +"`hidden_states` is `num_hidden_layers + 1`. For all element in the " +"tuple, its data type should be float32 and its shape is [`batch_size, " +"sequence_length, hidden_size`]. - `attentions` (Tensor): Attentions " +"of all layers of in the Transformer encoder. The length of `attentions` " +"is `num_hidden_layers`. For all element in the tuple, its data type " +"should be float32 and its shape is [`batch_size, num_attention_heads," +" sequence_length, sequence_length`]." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:18 +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:18 +msgid "" +"Returns tensor `logits`, or a dict with `logits`, `hidden_states`, " +"`attentions` fields." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:23 +msgid "" +"A tensor of the input text classification logits. Shape as `[batch_size, " +"num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForTokenClassification:1 +msgid "" +"Albert Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:1 +msgid "" +"The AlbertForTokenClassification forward method, overrides the __call__()" +" special method." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:18 +msgid "" +"Returns tensor `logits`, or a dict with `logits`, `hidden_states`, " +"`attentions` fields. With the fields: - `logits` (Tensor): A tensor" +" of the input token classification logits. Shape as `[batch_size, " +"sequence_length, num_classes]` and dtype as `float32`. - `hidden_states`" +" (Tensor): Hidden_states of all layers in the Transformer encoder. " +"The length of `hidden_states` is `num_hidden_layers + 1`. For all " +"element in the tuple, its data type should be float32 and its shape is " +"[`batch_size, sequence_length, hidden_size`]. - `attentions` (Tensor):" +" Attentions of all layers of in the Transformer encoder. The length " +"of `attentions` is `num_hidden_layers`. For all element in the tuple," +" its data type should be float32 and its shape is [`batch_size, " +"num_attention_heads, sequence_length, sequence_length`]." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:23 +msgid "" +"A tensor of the input token classification logits. Shape as `[batch_size," +" sequence_length, num_classes]` and dtype as `float32`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice:1 +msgid "" +"Albert Model with a linear layer on top of the hidden-states output " +"layer, designed for multiple choice tasks like SWAG tasks ." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:1 +msgid "" +"The AlbertForQuestionAnswering forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:15 +msgid "Start positions of the text. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:17 +msgid "End positions of the text. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:22 +msgid "" +"Returns tensor `reshaped_logits` or a dict with `reshaped_logits`, " +"`hidden_states`, `attentions` fields. With the fields: - " +"`reshaped_logits` (Tensor): A tensor of the input multiple choice " +"classification logits. Shape as `[batch_size, num_classes]` and dtype" +" as `float32`. - `hidden_states` (Tensor): Hidden_states of all " +"layers in the Transformer encoder. The length of `hidden_states` is " +"`num_hidden_layers + 1`. For all element in the tuple, its data type " +"should be float32 and its shape is [`batch_size, sequence_length, " +"hidden_size`]. - `attentions` (Tensor): Attentions of all layers of " +"in the Transformer encoder. The length of `attentions` is " +"`num_hidden_layers`. For all element in the tuple, its data type " +"should be float32 and its shape is [`batch_size, num_attention_heads," +" sequence_length, sequence_length`]." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:22 +msgid "" +"Returns tensor `reshaped_logits` or a dict with `reshaped_logits`, " +"`hidden_states`, `attentions` fields." +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:29 +msgid "`reshaped_logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:28 +msgid "" +"A tensor of the input multiple choice classification logits. Shape as " +"`[batch_size, num_classes]` and dtype as `float32`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.po new file mode 100644 index 0000000000000000000000000000000000000000..cae0d3112373564530ab46219a5a4e609b285287 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.albert.rst:2 +msgid "albert" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..7bfcd12e49d88de3ec2593263e16747bb77c1b66 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.tokenizer.po @@ -0,0 +1,348 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.albert.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer:1 +msgid "Tokenization class for ALBERT model." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:1 +msgid "Constructs an Albert tokenizer based on SentencePiece or `BertTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:3 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.save_resources +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:7 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:10 +msgid "" +"The vocabulary file (ends with '.spm') required to instantiate a " +"`SentencePiece `__ tokenizer." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:13 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:15 +msgid "Whether or note to remove space when tokenizing. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:17 +msgid "Whether or note to keep accents when tokenizing. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:19 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:23 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:26 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:29 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:32 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:38 +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string:10 +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize:10 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize:1 +msgid "Converts a string to a list of tokens." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize:3 +msgid "The text to be tokenized." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize:6 +msgid "A list of string representing converted tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string:1 +msgid "Converts a sequence of tokens (list of string) to a single string." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string:3 +msgid "A list of string representing tokens to be converted." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string:6 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.num_special_tokens_to_add:3 +msgid "" +"Whether the input is a sequence pair or a single sequence. Defaults to " +"`False` and the input is a single sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.num_special_tokens_to_add:7 +msgid "Number of tokens added to sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:4 +msgid "An Albert sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:6 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:7 +msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:9 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:11 +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask:6 +msgid "Optional second list of IDs for sequence pairs. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:14 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:3 +msgid "A Albert offset_mapping has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:5 +msgid "single sequence: ``(0,0) X (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:6 +msgid "pair of sequences: ``(0,0) A (0,0) B (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:8 +msgid "List of wordpiece offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:10 +msgid "" +"Optional second list of wordpiece offsets for offset mapping pairs. " +"Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:13 +msgid "" +"A list of wordpiece offsets with the appropriate offsets of special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask:4 +msgid "A list of `inputs_ids` for the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers either be 0 or 1: 1 for a special token, 0 for a " +"sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences:3 +msgid "" +"Should be overridden in a subclass if the model has a special way of " +"building those." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences:6 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences:8 +msgid "List of IDs." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences:10 +msgid "Optional second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences:13 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.save_resources:1 +msgid "" +"Save tokenizer related resources to `resource_files_names` indicating " +"files under `save_directory` by copying directly. Override it if " +"necessary." +msgstr "" + +#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.save_resources:4 +msgid "Directory to save files into." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.attention_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.attention_utils.po new file mode 100644 index 0000000000000000000000000000000000000000..9c623fbf987f8e82eb42b883afabcd0b2f245da3 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.attention_utils.po @@ -0,0 +1,73 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.attention_utils.rst:2 +msgid "attention\\_utils" +msgstr "" + +#: of paddlenlp.transformers.attention_utils.Attention.forward:1 +#: paddlenlp.transformers.attention_utils.DefaultAttention.forward:1 +#: paddlenlp.transformers.attention_utils.Linear3D.forward:1 +#: paddlenlp.transformers.attention_utils.MultiHeadAttention.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of paddlenlp.transformers.attention_utils.Attention.forward +#: paddlenlp.transformers.attention_utils.DefaultAttention.forward +#: paddlenlp.transformers.attention_utils.Linear3D.forward +#: paddlenlp.transformers.attention_utils.MultiHeadAttention.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.attention_utils.Attention.forward:4 +#: paddlenlp.transformers.attention_utils.DefaultAttention.forward:4 +#: paddlenlp.transformers.attention_utils.Linear3D.forward:4 +#: paddlenlp.transformers.attention_utils.MultiHeadAttention.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of paddlenlp.transformers.attention_utils.Attention.forward:6 +#: paddlenlp.transformers.attention_utils.DefaultAttention.forward:6 +#: paddlenlp.transformers.attention_utils.Linear3D.forward:6 +#: paddlenlp.transformers.attention_utils.MultiHeadAttention.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of paddlenlp.transformers.attention_utils.BigBirdSparseAttention.forward:1 +msgid "" +"query_matrix: [B, H, T, D] key_matrix: [B, H, T, D] value_matrix: [B, H, " +"T, D] query_mask: [B, 1, T, 1] bool mask key_mask: [B, 1, 1, T] bool " +"mask rand_mask_idx: [H, T//bs, bs] Global Attention Random Attention " +"Window Attention" +msgstr "" + +#: ../docstring of +#: paddlenlp.transformers.attention_utils.MultiHeadAttention.Cache.k:1 +#: paddlenlp.transformers.attention_utils.MultiHeadAttention.StaticCache.k:1 +msgid "Alias for field number 0" +msgstr "" + +#: ../docstring of +#: paddlenlp.transformers.attention_utils.MultiHeadAttention.Cache.v:1 +#: paddlenlp.transformers.attention_utils.MultiHeadAttention.StaticCache.v:1 +msgid "Alias for field number 1" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..c9de1097ab21438a2f20d4000dd46c19954917df --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.modeling.po @@ -0,0 +1,409 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.auto.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoDecoder:1 +#: paddlenlp.transformers.auto.modeling.AutoDiscriminator:1 +#: paddlenlp.transformers.auto.modeling.AutoEncoder:1 +#: paddlenlp.transformers.auto.modeling.AutoGenerator:1 +#: paddlenlp.transformers.auto.modeling.AutoModel:1 +#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM:1 +#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration:1 +#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM:1 +#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice:1 +#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining:1 +#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering:1 +#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification:1 +#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification:1 +msgid "基类::class:`paddlenlp.transformers.auto.modeling._BaseAutoModelClass`" +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModel:1 +msgid "" +"AutoClass can help you automatically retrieve the relevant model given " +"the provided pretrained weights/vocabulary. AutoModel is a generic model " +"class that will be instantiated as one of the base model classes when " +"created with the from_pretrained() classmethod." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:1 +msgid "" +"Creates an instance of `AutoModel`. Model weights are loaded by " +"specifying name of a built-in pretrained model, or a community " +"contributed model, or a local file directory path." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:5 +msgid "" +"Name of pretrained model or dir path to load from. The string can be: - " +"Name of a built-in pretrained model - Name of a community-contributed " +"pretrained model. - Local directory path which contains model weights " +"file(\"model_state.pdparams\") and model config file " +"(\"model_config.json\")." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:5 +msgid "Name of pretrained model or dir path to load from. The string can be:" +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:8 +msgid "Name of a built-in pretrained model" +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:9 +msgid "Name of a community-contributed pretrained model." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:10 +msgid "" +"Local directory path which contains model weights " +"file(\"model_state.pdparams\") and model config file " +"(\"model_config.json\")." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:13 +msgid "" +"Specify a downstream task. Task can be 'Model', 'ForPretraining', " +"'ForSequenceClassification', 'ForTokenClassification', " +"'ForQuestionAnswering', 'ForMultipleChoice', 'ForMaskedLM', " +"'ForCausalLM', 'Encoder', 'Decoder', 'Generator', 'Discriminator', " +"'ForConditionalGeneration'. We only support specify downstream tasks in " +"AutoModel. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:19 +msgid "" +"Position arguments for model `__init__`. If provided, use these as " +"position argument values for model initialization." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:22 +msgid "" +"Keyword arguments for model `__init__`. If provided, use these to update " +"pre-defined keyword argument values for model initialization. If the " +"keyword is in `__init__` argument names of base model, update argument " +"values of the base model; else update argument values of derived model." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:29 +msgid "An instance of `AutoModel`." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained +#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained:16 +#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained:16 +#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained:16 +#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained:16 +#: paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:33 +#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained:16 +#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained:16 +#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained:16 +#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained:16 +#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained:16 +#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained:16 +#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained:16 +#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained:16 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModelForPretraining:1 +msgid "AutoModelForPretraining." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained:1 +msgid "" +"Creates an instance of `AutoModelForPretraining`. Model weights are " +"loaded by specifying name of a built-in pretrained model, or a community " +"contributed model, or a local file directory path." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained:5 +#: paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained:7 +#: paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained:9 +#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained:5 +#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained:7 +#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained:9 +#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained:5 +#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained:7 +#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained:9 +#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained:5 +#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained:7 +#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained:9 +#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained:5 +#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained:7 +#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained:9 +#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained:5 +#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained:7 +#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained:9 +#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained:5 +#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained:7 +#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained:9 +#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained:5 +#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained:7 +#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained:9 +#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained:5 +#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained:7 +#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained:9 +#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained:5 +#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained:7 +#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained:9 +#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained:5 +#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained:7 +#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained:9 +#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained:5 +#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained:7 +#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained:9 +msgid "See :class:`AutoModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained:12 +msgid "An instance of `AutoModelForPretraining`." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification:1 +msgid "AutoModelForSequenceClassification." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained:1 +msgid "" +"Creates an instance of `AutoModelForSequenceClassification`. Model " +"weights are loaded by specifying name of a built-in pretrained model, or " +"a community contributed model, or a local file directory path." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained:12 +msgid "An instance of `AutoModelForSequenceClassification`." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification:1 +msgid "AutoModelForTokenClassification." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained:1 +msgid "" +"Creates an instance of `AutoModelForTokenClassification`. Model weights " +"are loaded by specifying name of a built-in pretrained model, or a " +"community contributed model, or a local file directory path." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained:12 +msgid "An instance of `AutoModelForTokenClassification`." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering:1 +msgid "AutoModelForQuestionAnswering." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained:1 +msgid "" +"Creates an instance of `AutoModelForQuestionAnswering`. Model weights are" +" loaded by specifying name of a built-in pretrained model, or a community" +" contributed model, or a local file directory path." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained:12 +msgid "An instance of `AutoModelForQuestionAnswering`." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice:1 +msgid "AutoModelForMultipleChoice." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained:1 +msgid "" +"Creates an instance of `AutoModelForMultipleChoice`. Model weights are " +"loaded by specifying name of a built-in pretrained model, or a community " +"contributed model, or a local file directory path." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained:12 +msgid "An instance of `AutoModelForMultipleChoice`." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM:1 +msgid "AutoModelForMaskedLM." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained:1 +msgid "" +"Creates an instance of `AutoModelForMaskedLM`. Model weights are loaded " +"by specifying name of a built-in pretrained model, or a community " +"contributed model, or a local file directory path." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained:12 +msgid "An instance of `AutoModelForMaskedLM`." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModelForCausalLM:1 +msgid "AutoModelForCausalLM." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained:1 +msgid "" +"Creates an instance of `AutoModelForCausalLM`. Model weights are loaded " +"by specifying name of a built-in pretrained model, or a community " +"contributed model, or a local file directory path." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained:12 +msgid "An instance of `AutoModelForCausalLM`." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoEncoder:1 +msgid "AutoEncoder." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained:1 +msgid "" +"Creates an instance of `AutoEncoder`. Model weights are loaded by " +"specifying name of a built-in pretrained model, or a community " +"contributed model, or a local file directory path." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained:12 +msgid "An instance of `AutoEncoder`." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoDecoder:1 +msgid "AutoDecoder." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained:1 +msgid "" +"Creates an instance of `AutoDecoder`. Model weights are loaded by " +"specifying name of a built-in pretrained model, or a community " +"contributed model, or a local file directory path." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained:12 +msgid "An instance of `AutoDecoder`." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoGenerator:1 +msgid "AutoGenerator." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained:1 +msgid "" +"Creates an instance of `AutoGenerator`. Model weights are loaded by " +"specifying name of a built-in pretrained model, or a community " +"contributed model, or a local file directory path." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained:12 +msgid "An instance of `AutoGenerator`." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoDiscriminator:1 +msgid "AutoDiscriminator." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained:1 +msgid "" +"Creates an instance of `AutoDiscriminator`. Model weights are loaded by " +"specifying name of a built-in pretrained model, or a community " +"contributed model, or a local file directory path." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained:12 +msgid "An instance of `AutoDiscriminator`." +msgstr "" + +#: of paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration:1 +msgid "AutoModelForConditionalGeneration." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained:1 +msgid "" +"Creates an instance of `AutoModelForConditionalGeneration`. Model weights" +" are loaded by specifying name of a built-in pretrained model, or a " +"community contributed model, or a local file directory path." +msgstr "" + +#: of +#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained:12 +msgid "An instance of `AutoModelForConditionalGeneration`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.po new file mode 100644 index 0000000000000000000000000000000000000000..fe7904e1e221f52a3329cd5c1afb345b8f51d758 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.auto.rst:2 +msgid "auto" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..fe94e163b5fc6b803171e20c77f5df398cb4ae9a --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.tokenizer.po @@ -0,0 +1,101 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.auto.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer:1 +msgid "" +"AutoClass can help you automatically retrieve the relevant model given " +"the provided pretrained weights/vocabulary. AutoTokenizer is a generic " +"tokenizer class that will be instantiated as one of the base tokenizer " +"classes when created with the AutoTokenizer.from_pretrained() " +"classmethod." +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:1 +msgid "" +"Creates an instance of `AutoTokenizer`. Related resources are loaded by " +"specifying name of a built-in pretrained model, or a community-" +"contributed pretrained model, or a local file directory path." +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:5 +msgid "" +"Name of pretrained model or dir path to load from. The string can be: - " +"Name of built-in pretrained model - Name of a community-contributed " +"pretrained model. - Local directory path which contains tokenizer related" +" resources and tokenizer config file (\"tokenizer_config.json\")." +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:5 +msgid "Name of pretrained model or dir path to load from. The string can be:" +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:8 +msgid "Name of built-in pretrained model" +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:9 +msgid "Name of a community-contributed pretrained model." +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:10 +msgid "" +"Local directory path which contains tokenizer related resources and " +"tokenizer config file (\"tokenizer_config.json\")." +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:13 +msgid "" +"position arguments for model `__init__`. If provided, use these as " +"position argument values for tokenizer initialization." +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:16 +msgid "" +"keyword arguments for model `__init__`. If provided, use these to update " +"pre-defined keyword argument values for tokenizer initialization." +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:21 +msgid "An instance of `PretrainedTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:25 +msgid "示例" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..27542df560b8f46d3cb84530968371892b44ba30 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.modeling.po @@ -0,0 +1,505 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.bart.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartDecoder:1 +#: paddlenlp.transformers.bart.modeling.BartEncoder:1 +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration:1 +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering:1 +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification:1 +#: paddlenlp.transformers.bart.modeling.BartModel:1 +msgid "基类::class:`paddlenlp.transformers.bart.modeling.BartPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:1 +msgid "The bare Bart Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartClassificationHead.forward +#: paddlenlp.transformers.bart.modeling.BartDecoder.forward +#: paddlenlp.transformers.bart.modeling.BartEncoder.forward +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward +#: paddlenlp.transformers.bart.modeling.BartModel +#: paddlenlp.transformers.bart.modeling.BartModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `BartModel`. Also is the vocab size of" +" token embedding matrix. Defines the number of different tokens that can " +"be represented by the `inputs_ids` passed when calling `BartModel`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:13 +msgid "" +"The beginning of sequence token that was used during pretraining. Can be " +"used a sequence classifier token. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:17 +msgid "The index of padding token in the token vocabulary. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:20 +msgid "" +"A special token representing the end of a sequence that was used during " +"pretraining. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:23 +msgid "" +"Dimensionality of the embedding layer, encoder layer and decoder layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:25 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `6`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:27 +msgid "Number of hidden layers in the Transformer decoder. Defaults to `6`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:29 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:32 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"decoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:35 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `d_model` to " +"`encoder_ffn_dim`, and then projected back to `d_model`. Typically " +"`encoder_ffn_dim` is larger than `d_model`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:40 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `d_model` to " +"`decoder_ffn_dim`, and then projected back to `d_model`. Typically " +"`decoder_ffn_dim` is larger than `d_model`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:45 +msgid "" +"The dropout probability used in all fully connected layers (pre-process " +"and post-process of MHA and FFN sub-layer) in the encoders and decoders. " +"Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:48 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:52 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"and decoder layers to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:55 +msgid "" +"The dropout probability used after FFN activation in all encoder layers " +"and decoder layers. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:58 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`1024`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel:61 +msgid "" +"The standard deviation of the truncated_normal_initializer for " +"initializing all weight matrices. Default to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel.forward:1 +msgid "The BartModel forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel.forward:7 +msgid "" +"Mask used in multi-head attention to avoid performing attention to some " +"unwanted positions, usually the paddings or the subsequent positions. Its" +" data type can be int, float and bool. When the data type is bool, the " +"`masked` tokens have `False` values and the others have `True` values. " +"When the data type is int, the `masked` tokens have `0` values and the " +"others have `1` values. When the data type is float, the `masked` tokens " +"have `-INF` values and the others have `0` values. It is a tensor with " +"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, " +"sequence_length]`. For example, its shape can be [batch_size, " +"sequence_length], [batch_size, sequence_length, sequence_length], " +"[batch_size, num_attention_heads, sequence_length, sequence_length]. " +"Defaults to `None`, which means nothing needed to be prevented attention " +"to." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel.forward:18 +msgid "" +"Indices of decoder input sequence tokens in the vocabulary. Its data type" +" should be `int64` and it has a shape of [batch_size, sequence_length]. " +"Defaults to `None`, which means no `decoder_input_ids` is provided, the " +"model will create the tensor by shifting the `input_ids` to the right." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel.forward:23 +msgid "" +"Mask used in multi-head attention to avoid performing attention to some " +"unwanted positions in `decoder_input_ids`. Its data type and shape is the" +" same as `attention_mask`. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel.forward:26 +msgid "" +"The output of the encoder, a tuple consists `last_hidden_state`, " +"`hidden_states`(optional), `attentions`(optional). The data type of " +"`last_hidden_state` is float32 and its shape is `[batch_size, " +"sequence_length, hidden_size]`. `hidden_states` is hidden_states of all " +"layers in the Transformer encoder. The length of `hidden_states` is " +"`num_hidden_layers + 1`. For all element in the tuple, its data type " +"should be float32 and its shape is [`batch_size, sequence_length, " +"hidden_size`]. `attentions` is attentions of all layers of in the " +"Transformer encoder. The length of `attentions` is `num_hidden_layers`. " +"For all element in the tuple, its data type should be float32 and its " +"shape is [`batch_size, num_attention_heads, sequence_length, " +"sequence_length`]." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel.forward:33 +msgid "" +"Whether or not to use cache. Defaults to `False`. If set to `True`, key " +"value states will be returned and can be used to speed up decoding." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartModel.forward:36 +msgid "" +"It is a list, and each element in the list is a tuple " +"`(incremental_cache, static_cache)`. See `TransformerDecoder.gen_cache " +"`__" +" for more details. It is only used for inference and should be None for " +"training. Default to `None`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartDecoder.forward +#: paddlenlp.transformers.bart.modeling.BartEncoder.forward +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward +#: paddlenlp.transformers.bart.modeling.BartModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartDecoder.forward:14 +#: paddlenlp.transformers.bart.modeling.BartModel.forward:42 +msgid "" +"Returns tensor `decoder_output`, which is the output at the last layer of" +" the model. Its data type should be float32 and has a shape of " +"[batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartDecoder.forward +#: paddlenlp.transformers.bart.modeling.BartEncoder.forward +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward +#: paddlenlp.transformers.bart.modeling.BartModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:31 +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:32 +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:23 +#: paddlenlp.transformers.bart.modeling.BartModel.forward:47 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartPretrainedModel:1 +msgid "" +"An abstract class for pretrained Bart models. It provides Bart related " +"`model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartEncoder:1 +msgid "" +"The Transformer Encoder of BartModel. The arguments of BartEncoder can " +"see :class:`BartModel`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartEncoder.forward:1 +msgid "The BartEncoder forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartDecoder.forward:3 +#: paddlenlp.transformers.bart.modeling.BartDecoder.forward:5 +#: paddlenlp.transformers.bart.modeling.BartDecoder.forward:7 +#: paddlenlp.transformers.bart.modeling.BartDecoder.forward:9 +#: paddlenlp.transformers.bart.modeling.BartDecoder.forward:11 +#: paddlenlp.transformers.bart.modeling.BartEncoder.forward:3 +#: paddlenlp.transformers.bart.modeling.BartEncoder.forward:5 +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:3 +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:5 +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:7 +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:9 +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:11 +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:13 +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:15 +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:27 +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:3 +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:5 +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:7 +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:9 +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:11 +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:13 +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:15 +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:3 +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:5 +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:7 +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:9 +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:11 +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:13 +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:15 +msgid "See :class:`BartModel`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartEncoder.forward:8 +msgid "" +"Returns tensor `encoder_output`, which is the output at the last layer of" +" the model. Its data type should be float32 and has a shape of " +"[batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartDecoder:1 +msgid "" +"The Transformer Decoder of BartModel. The arguments of BartDecoder can " +"see :class:`BartModel`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartDecoder.forward:1 +msgid "The BartDecoder forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartClassificationHead:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartClassificationHead:1 +msgid "Perform sentence-level classification tasks." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartClassificationHead.forward:1 +msgid "Hidden states of the classification model." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartForSequenceClassification:1 +msgid "" +"Bart Model with a linear layer on top of the pooled output, designed for " +"sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartForConditionalGeneration:3 +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering:4 +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification:4 +msgid "An instance of BartModel." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartForSequenceClassification:6 +msgid "The number of different labels. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartForSequenceClassification:8 +msgid "" +"The dropout probability for output of Bart. If None, use the same value " +"as `hidden_dropout_prob` of `BartModel` instance `bart`. Defaults to " +"None." +msgstr "" + +#: of +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:1 +msgid "" +"The BartForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:18 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_labels]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering:1 +msgid "" +"Bart Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:1 +msgid "" +"The BartForQuestionAnswering forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:18 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:18 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:20 +#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:20 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:24 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:23 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:27 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:27 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.bart.modeling.BartForConditionalGeneration:1 +msgid "Bart Model with a `language modeling` head on top." +msgstr "" + +#: of +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:1 +msgid "" +"The BartForConditionalGeneration forward method, overrides the __call__()" +" special method." +msgstr "" + +#: of +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:18 +msgid "" +"Returns Tensor `lm_logits` if `use_cache` is `False`, otherwise, returns " +"tuple (`lm_logits`, `cache`). With the fields: - `lm_logits` (Tensor):" +" The generated sentence of the model. Its data type should be " +"float32 and has a shape of [batch_size, sequence_length, vocab_size]. - " +"`cache` (Tensor): See :class:`BartModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:18 +msgid "" +"Returns Tensor `lm_logits` if `use_cache` is `False`, otherwise, returns " +"tuple (`lm_logits`, `cache`)." +msgstr "" + +#: of +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:24 +msgid "`lm_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:23 +msgid "" +"The generated sentence of the model. Its data type should be float32 and " +"has a shape of [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:26 +msgid "`cache` (Tensor):" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.po new file mode 100644 index 0000000000000000000000000000000000000000..919ed199b2784097af46434dab10db7746ab9113 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.bart.rst:2 +msgid "bart" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..6deb0f8e7fd2f45fea08d0db4332e8830fc4c425 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.tokenizer.po @@ -0,0 +1,133 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.bart.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.gpt.tokenizer.GPTTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:1 +msgid "Construct a BART tokenizer based on byte-level Byte-Pair-Encoding." +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:3 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.gpt.tokenizer.GPTTokenizer`. For more " +"information regarding those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:6 +msgid "" +"Path to the vocabulary file. The vocab file contains a mapping from " +"vocabulary strings to indices." +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:9 +msgid "" +"Path to the merge file. The merge file is used to split the input " +"sentence into \"subword\" units. The vocab file is then used to encode " +"those units as intices." +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:13 +msgid "Paradigm to follow when decoding bytes to UTF-8. Defaults to `'replace'`." +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:16 +msgid "The maximum value of the input sequence length. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:19 +msgid "" +"The beginning of sequence token that was used during pretraining. Can be " +"used a sequence classifier token. Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:23 +msgid "" +"A special token representing the end of a sequence that was used during " +"pretraining. Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:26 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:30 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:33 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:37 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:40 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:46 +msgid "实际案例" +msgstr "" + +#: of +#: paddlenlp.transformers.bart.tokenizer.BartTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.bart.tokenizer.BartTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.bart.tokenizer.BartTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.faster_tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.faster_tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..c13b624495d40542f6bac89800a19d932e6f273c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.faster_tokenizer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.bert.fast_tokenizer.rst:2 +msgid "fast\\_tokenizer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..d401e85f0e2a4c953c1f4c969a7951399dbe8df2 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.modeling.po @@ -0,0 +1,679 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.bert.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM:1 +#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice:1 +#: paddlenlp.transformers.bert.modeling.BertForPretraining:1 +#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering:1 +#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification:1 +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification:1 +#: paddlenlp.transformers.bert.modeling.BertModel:1 +msgid "基类::class:`paddlenlp.transformers.bert.modeling.BertPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:1 +msgid "The bare BERT Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM +#: paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward +#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice +#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward +#: paddlenlp.transformers.bert.modeling.BertForPretraining +#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward +#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering +#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward +#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification +#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward +#: paddlenlp.transformers.bert.modeling.BertModel +#: paddlenlp.transformers.bert.modeling.BertModel.forward +#: paddlenlp.transformers.bert.modeling.BertPretrainingCriterion +#: paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `BertModel`. Also is the vocab size of" +" token embedding matrix. Defines the number of different tokens that can " +"be represented by the `inputs_ids` passed when calling `BertModel`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:13 +msgid "" +"Dimensionality of the embedding layer, encoder layer and pooler layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:15 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:35 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:38 +msgid "The vocabulary size of `token_type_ids`. Defaults to `16`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:41 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`BertPretrainedModel.init_weights()` for how" +" weights are initialized in `BertModel`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:41 +msgid "The standard deviation of the normal initializer. Defaults to 0.02." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:45 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`BertPretrainedModel.init_weights()` for how weights are " +"initialized in `BertModel`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:48 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel:51 +msgid "" +"The non-linear activation function in the pooling layer. Defaults to " +"`\"tanh\"`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:1 +msgid "The BertModel forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:12 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:13 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:15 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:18 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. Defaults to `None`, which means " +"nothing needed to be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:31 +msgid "Whether to return the output of each hidden layers. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward +#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward +#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward +#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward +#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward +#: paddlenlp.transformers.bert.modeling.BertModel.forward +#: paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:35 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`," +" `pooled_output`). With the fields: - `sequence_output` (Tensor): " +"Sequence of hidden-states at the last layer of the model. It's data " +"type should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]. - `pooled_output` (Tensor): The output of first token " +"(`[CLS]`) in sequence. We \"pool\" the model by simply taking the " +"hidden state corresponding to the first token. Its data type should " +"be float32 and its shape is [batch_size, hidden_size]. - " +"`encoder_outputs` (List(Tensor)): A list of Tensor containing hidden-" +"states of the model at each hidden layer in the Transformer encoder. " +"The length of the list is `num_hidden_layers`. Each Tensor has a data" +" type of float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:35 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`," +" `pooled_output`)." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:14 +#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:10 +#: paddlenlp.transformers.bert.modeling.BertModel.forward:37 +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:16 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:41 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:40 +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:1 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:46 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:44 +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:4 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:50 +msgid "`encoder_outputs` (List(Tensor)):" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertModel.forward:49 +msgid "" +"A list of Tensor containing hidden-states of the model at each hidden " +"layer in the Transformer encoder. The length of the list is " +"`num_hidden_layers`. Each Tensor has a data type of float32 and its shape" +" is [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward +#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward +#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward +#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward +#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward +#: paddlenlp.transformers.bert.modeling.BertModel.forward +#: paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward:15 +#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:17 +#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:22 +#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:17 +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:17 +#: paddlenlp.transformers.bert.modeling.BertModel.forward:55 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainedModel:1 +msgid "" +"An abstract class for pretrained BERT models. It provides BERT related " +"`model_config_file`, `resource_files_names`, " +"`pretrained_resource_files_map`, `pretrained_init_configuration`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForPretraining:1 +msgid "Bert Model with pretraining tasks on top." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM:3 +#: paddlenlp.transformers.bert.modeling.BertForPretraining:3 +msgid "An instance of :class:`BertModel`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward:1 +#: paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward:3 +#: paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward:5 +#: paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward:7 +#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward:1 +#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward:3 +#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward:5 +#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward:7 +#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:3 +#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:5 +#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:3 +#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:5 +#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:7 +#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:9 +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:3 +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:5 +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:7 +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:9 +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads:3 +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads:5 +msgid "See :class:`BertModel`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:9 +msgid "See :class:`BertPretrainingHeads`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:12 +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:14 +msgid "" +"Returns tuple (``prediction_scores``, ``seq_relationship_score``). With " +"the fields: - `prediction_scores` (Tensor): The scores of masked " +"token prediction. Its data type should be float32. If " +"`masked_positions` is None, its shape is [batch_size, sequence_length, " +"vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]. - `seq_relationship_score` (Tensor): The scores of next" +" sentence prediction. Its data type should be float32 and its shape " +"is [batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:12 +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:14 +msgid "Returns tuple (``prediction_scores``, ``seq_relationship_score``)." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:19 +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:21 +msgid "`prediction_scores` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:17 +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:19 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"If `masked_positions` is None, its shape is [batch_size, sequence_length," +" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:22 +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:24 +msgid "`seq_relationship_score` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:22 +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:24 +msgid "" +"The scores of next sentence prediction. Its data type should be float32 " +"and its shape is [batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion:1 +#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion:1 +msgid "" +"Vocabulary size of `inputs_ids` in `BertModel`. Defines the number of " +"different tokens that can be represented by the `inputs_ids` passed when " +"calling `BertModel`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward:1 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"If `masked_positions` is None, its shape is [batch_size, sequence_length," +" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward:5 +msgid "" +"The scores of next sentence prediction. Its data type should be float32 " +"and its shape is [batch_size, 2]" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward:8 +msgid "" +"The labels of the masked language modeling, its dimensionality is equal " +"to `prediction_scores`. Its data type should be int64. If " +"`masked_positions` is None, its shape is [batch_size, sequence_length, " +"1]. Otherwise, its shape is [batch_size, mask_token_num, 1]" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward:12 +msgid "" +"The labels of the next sentence prediction task, the dimensionality of " +"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type " +"should be int64 and its shape is [batch_size, 1]" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward:16 +msgid "" +"The scale of masked tokens. Used for the normalization of masked language" +" modeling loss. If it is a `Tensor`, its data type should be int64 and " +"its shape is equal to `prediction_scores`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward:20 +msgid "" +"The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean" +" of `next_sentence_loss`. Its data type should be float32 and its shape " +"is [1]." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainingHeads:1 +msgid "Perform language modeling task and next sentence classification task." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainingHeads:7 +msgid "Activation function used in the language modeling task." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainingHeads:9 +msgid "" +"Decoding weights used to map hidden_states to logits of the masked token " +"prediction. Its data type should be float32 and its shape is [vocab_size," +" hidden_size]. Defaults to `None`, which means use the same weights of " +"the embedding layer." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:8 +msgid "" +"A tensor indicates positions to be masked in the position embedding. Its " +"data type should be int64 and its shape is [batch_size, mask_token_num]. " +"`mask_token_num` is the number of masked tokens. It should be no bigger " +"than `sequence_length`. Defaults to `None`, which means we output hidden-" +"states of all tokens in masked token prediction." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForSequenceClassification:1 +msgid "" +"Bert Model with a linear layer on top of the output layer, designed for " +"sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice:4 +#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering:4 +#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification:4 +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification:4 +msgid "An instance of BertModel." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForSequenceClassification:6 +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForSequenceClassification:8 +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification:8 +msgid "" +"The dropout probability for output of BERT. If None, use the same value " +"as `hidden_dropout_prob` of `BertModel` instance `bert`. Defaults to " +"None." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:1 +msgid "" +"The BertForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:12 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForTokenClassification:1 +msgid "" +"Bert Model with a linear layer on top of the hidden-states output layer, " +"designed for token classification tasks like NER tasks." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:1 +msgid "" +"The BertForTokenClassification forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:12 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering:1 +msgid "" +"Bert Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering:6 +msgid "" +"The dropout probability for output of BERT. If None, use the same value " +"as `hidden_dropout_prob` of `BertModel` instance `bert`. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:1 +msgid "" +"The BertForQuestionAnswering forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:8 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:8 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:14 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:13 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:17 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:17 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice:1 +msgid "" +"Bert Model with a linear layer on top of the hidden-states output layer, " +"designed for multiple choice tasks like RocStories/SWAG tasks." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice:6 +msgid "The number of choices. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice:8 +msgid "" +"The dropout probability for output of Bert. If None, use the same value " +"as `hidden_dropout_prob` of `BertModel` instance `bert`. Defaults to " +"None." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:1 +msgid "" +"The BertForMultipleChoice forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:3 +#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:5 +#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:7 +#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:9 +msgid "" +"See :class:`BertModel` and shape as [batch_size, num_choice, " +"sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:12 +msgid "" +"Returns tensor `reshaped_logits`, a tensor of the multiple choice " +"classification logits. Shape as `[batch_size, num_choice]` and dtype as " +"`float32`." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM:1 +msgid "Bert Model with a `masked language modeling` head on top." +msgstr "" + +#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward:10 +msgid "" +"Returns tensor `prediction_scores`, The scores of masked token " +"prediction. Its data type should be float32 and shape is [batch_size, " +"sequence_length, vocab_size]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.po new file mode 100644 index 0000000000000000000000000000000000000000..aba47451e1de48437f8d453adecb47974f673851 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.bert.rst:2 +msgid "bert" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..00bf336ab2b5f380d378e30faff64451262ed0f9 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.tokenizer.po @@ -0,0 +1,358 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.bert.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer:1 +#: paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer:1 +msgid "Runs basic tokenization (punctuation splitting, lower casing, etc.)." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer +#: paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer +#: paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer:3 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize:1 +msgid "Tokenizes a piece of text using basic tokenizer." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize:3 +msgid "A piece of text." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.vocab_size +#: paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize:6 +msgid "A list of tokens." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.vocab_size +#: paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize:10 +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer:30 +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string:12 +#: paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize:12 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:1 +msgid "" +"Constructs a BERT tokenizer. It uses a basic tokenizer to do punctuation " +"splitting, lower casing and so on, and follows a WordPiece tokenizer to " +"tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:5 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:8 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:11 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:15 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:18 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:21 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:24 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) to a single string. Since " +"the usage of WordPiece introducing `##` to concat subwords, also removes " +"`##` when converting." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string:5 +msgid "A list of string representing tokens to be converted." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string:8 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.num_special_tokens_to_add:3 +msgid "" +"Whether the input is a sequence pair or a single sequence. Defaults to " +"`False` and the input is a single sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.num_special_tokens_to_add:7 +msgid "Number of tokens added to sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:4 +msgid "A BERT sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:6 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:7 +msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:9 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:11 +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences:13 +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask:6 +msgid "Optional second list of IDs for sequence pairs. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:14 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:3 +msgid "A BERT offset_mapping has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:5 +msgid "single sequence: ``(0,0) X (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:6 +msgid "pair of sequences: ``(0,0) A (0,0) B (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:8 +msgid "List of wordpiece offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:10 +msgid "" +"Optional second list of wordpiece offsets for offset mapping pairs. " +"Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:13 +msgid "" +"A list of wordpiece offsets with the appropriate offsets of special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences:3 +msgid "A BERT sequence pair mask has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences:9 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences:11 +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask:4 +msgid "A list of `inputs_ids` for the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences:16 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers either be 0 or 1: 1 for a special token, 0 for a " +"sequence token." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer:1 +msgid "Runs WordPiece tokenization." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer:3 +msgid "Vocab of the word piece tokenizer." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer:5 +msgid "A specific token to replace all unknown tokens." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer:7 +msgid "" +"If a word's length is more than max_input_chars_per_word, it will be " +"dealt as unknown word. Defaults to 100." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize:1 +msgid "" +"Tokenizes a piece of text into its word pieces. This uses a greedy " +"longest-match-first algorithm to perform tokenization using the given " +"vocabulary." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize:5 +msgid "" +"A single token or whitespace separated tokens. This should have already " +"been passed through `BasicTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize:8 +msgid "A list of wordpiece tokens." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.po new file mode 100644 index 0000000000000000000000000000000000000000..b016b4e1a04987291399e46e6c470de98c734560 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.rst:2 +msgid "convert\\_bert\\_japanese\\_params" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.po new file mode 100644 index 0000000000000000000000000000000000000000..6241941cf9728f6cb3ad6db6e5eac7b0fac5f568 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.bert_japanese.rst:2 +msgid "bert\\_japanese" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..1224431a96b89265c119e67ac812d918b49513f9 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.tokenizer.po @@ -0,0 +1,152 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.bert_japanese.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:1 +msgid "Construct a BERT tokenizer for Japanese text, based on a MecabTokenizer." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer +#: paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer.tokenize +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:3 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:6 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to`False`." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:9 +msgid "Whether to do word tokenization. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:11 +msgid "Whether to do subword tokenization. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:13 +msgid "Type of word tokenizer. Defaults to`basic`." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:15 +msgid "Type of subword tokenizer. Defaults to`wordpiece`." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:17 +msgid "Kept for backward compatibility purposes. Defaults to`None`." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:19 +msgid "Dictionary passed to the `MecabTokenizer` constructor." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:21 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:25 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:28 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:31 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:34 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:40 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer:1 +#: paddlenlp.transformers.bert_japanese.tokenizer.MecabTokenizer:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.MecabTokenizer:1 +msgid "Runs basic tokenization with MeCab morphological parser." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.MecabTokenizer.tokenize:1 +msgid "Tokenizes a piece of text." +msgstr "" + +#: of paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer:1 +msgid "Runs Character tokenization." +msgstr "" + +#: of +#: paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer.tokenize:1 +msgid "Tokenizes a piece of text into characters." +msgstr "" + +#: of +#: paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer.tokenize:3 +msgid "" +"For example, `input = \"apple\"\"` wil return as output `[\"a\", \"p\", " +"\"p\", \"l\", \"e\"]`." +msgstr "" + +#: of +#: paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer.tokenize:5 +msgid "" +"A single token or whitespace separated tokens. This should have already " +"been passed through `BasicTokenizer`." +msgstr "" + +#: of +#: paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer.tokenize +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer.tokenize:8 +msgid "A list of characters." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..b074b8478b22ef35a31292718d056542217d2f4b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.modeling.po @@ -0,0 +1,793 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.bigbird.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM:1 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM:1 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice:1 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining:1 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering:1 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification:1 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification:1 +#: paddlenlp.transformers.bigbird.modeling.BigBirdModel:1 +msgid "基类::class:`paddlenlp.transformers.bigbird.modeling.BigBirdPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:1 +msgid "The bare BigBird Model outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM +#: paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining +#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdModel +#: paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:10 +msgid "Number of hidden layers in the Transformer encoder." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:12 +msgid "" +"Vocabulary size of `inputs_ids` in `BigBirdModel`. Also is the vocab size" +" of token embedding matrix. Defines the number of different tokens that " +"can be represented by the `inputs_ids` passed when calling " +"`BigBirdModel`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:15 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:17 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the Transformer encoder." +" Input tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"``, ``\"silu\"`` and ``\"gelu_new\"`` are " +"supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:29 +msgid "" +"Indicates whether to put layer normalization into preprocessing of MHA " +"and FFN sub-layers. If True, pre-process is layer normalization and post-" +"precess includes dropout, residual connection. Otherwise, no pre-process " +"and post-precess includes dropout, residual connection, layer " +"normalization. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:35 +msgid "The block size for the attention mask. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:38 +msgid "The number of block in a window. Defaults to `3`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:41 +msgid "Number of global blocks per sequence. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:44 +msgid "Number of random blocks per row. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:47 +msgid "The random seed for generating random block id. Defaults to ``None``." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:50 +msgid "The index of padding token for BigBird embedding. Defaults to ``0``." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:53 +msgid "" +"Dimensionality of the embedding layer, encoder layer and pooler layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:56 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:59 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:62 +msgid "The vocabulary size of the `token_type_ids`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:1 +msgid "The BigBirdModel forward method, overrides the __call__() special method." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. Its data type should " +"be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:6 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can either be 0 or 1: - 0 corresponds to a *sentence A* " +"token, - 1 corresponds to a *sentence B* token. Its data type should be " +"`int64` and it has a shape of [batch_size, sequence_length]. Defaults to " +"``None``, which means we don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:6 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can either be 0 or 1:" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:9 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:10 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:12 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to ``None``, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:15 +msgid "" +"A list which contains some tensors used in multi-head attention to " +"prevents attention to some unwanted positions, usually the paddings or " +"the subsequent positions. Its data type can be int, float and bool. When " +"the data type is bool, the `masked` tokens have `False` values and the " +"others have `True` values. When the data type is int, the `masked` tokens" +" have `0` values and the others have `1` values. When the data type is " +"float, the `masked` tokens have `-INF` values and the others have `0` " +"values. It is a tensor with shape broadcasted to `[batch_size, n_head, " +"sequence_length, sequence_length]`. For example, its shape can be " +"[batch_size, sequence_length], [batch_size, sequence_length, " +"sequence_length], [batch_size, num_attention_heads, sequence_length, " +"sequence_length]. Defaults to `None`, which means nothing needed to be " +"prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:27 +msgid "A list which contains some tensors used in bigbird random block." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:30 +msgid "" +"Returns tuple (`encoder_output`, `pooled_output`). With the fields: - " +"encoder_output (Tensor): Sequence of output at the last layer of the " +"model. Its data type should be float32 and has a shape of " +"[batch_size, sequence_length, hidden_size]. - pooled_output (Tensor):" +" The output of first token (`[CLS]`) in sequence. We \"pool\" the" +" model by simply taking the hidden state corresponding to the first " +"token. Its data type should be float32 and its shape is [batch_size, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:30 +msgid "Returns tuple (`encoder_output`, `pooled_output`)." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:12 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:12 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:19 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:14 +#: paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:32 +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:17 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:36 +msgid "encoder_output (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:35 +msgid "" +"Sequence of output at the last layer of the model. Its data type should " +"be float32 and has a shape of [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:40 +msgid "pooled_output (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:39 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:31 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:17 +#: paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:45 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainedModel:1 +msgid "" +"An abstract class for pretrained BigBird models. It provides BigBird " +"related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining:1 +msgid "BigBird Model with pretraining tasks on top." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM:3 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM:3 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining:3 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification:4 +msgid "An instance of :class:`BigBirdModel`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:1 +msgid "" +"The BigBirdForPretraining forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:1 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:3 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:5 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:1 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:3 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:5 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward:7 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:3 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:5 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:7 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:9 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:3 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:5 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:7 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:9 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:3 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:5 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:7 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:9 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:3 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:5 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:7 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:9 +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion:3 +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads:3 +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads:5 +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads:7 +msgid "See :class:`BigBirdModel`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:11 +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:9 +msgid "" +"A tensor indicates positions to be masked in the position embedding. Its " +"data type should be int64 and its shape is [batch_size, mask_token_num]. " +"`mask_token_num` is the number of masked tokens. It should be no bigger " +"than `sequence_length`. Defaults to `None`, which means we output hidden-" +"states of all tokens in masked token prediction." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:17 +msgid "" +"Returns tuple (`prediction_scores`, `seq_relationship_score`). With the " +"fields: - prediction_scores (Tensor): The scores of masked token " +"prediction. Its data type should be float32 and its shape is " +"[batch_size, sequence_length, vocab_size]. - seq_relationship_score " +"(Tensor): The scores of next sentence prediction. Its data type " +"should be float32 and its shape is [batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:17 +msgid "Returns tuple (`prediction_scores`, `seq_relationship_score`)." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:23 +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:22 +msgid "prediction_scores (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:22 +msgid "" +"The scores of masked token prediction. Its data type should be float32 " +"and its shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:26 +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:25 +msgid "seq_relationship_score (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:26 +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:7 +msgid "" +"The scores of next sentence prediction. Its data type should be float32 " +"and its shape is [batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion:1 +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion:1 +msgid "BigBird Criterion for a pretraining task on top." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion:5 +msgid "" +"It decides whether it considers Next Sentence Prediction loss. Defaults " +"to `False`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion:8 +msgid "" +"Specifies a target value that is ignored and does not contribute to the " +"input gradient. Only valid if :attr:`soft_label` is set to :attr:`False`." +" Defaults to `0`." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:1 +msgid "" +"The BigBirdPretrainingCriterion forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:3 +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:20 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"If `masked_positions` is None, its shape is [batch_size, sequence_length," +" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:10 +msgid "" +"The labels of the masked language modeling, its dimensionality is equal " +"to `prediction_scores`. Its data type should be int64. If " +"`masked_positions` is None, its shape is [batch_size, sequence_length, " +"1]. Otherwise, its shape is [batch_size, mask_token_num, 1]." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:14 +msgid "" +"The labels of the next sentence prediction task, the dimensionality of " +"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type " +"should be int64 and its shape is [batch_size, 1]." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:17 +msgid "" +"The scale of masked tokens. Used for the normalization of masked language" +" modeling loss. If it is a `Tensor`, its data type should be int64 and " +"its shape is equal to `prediction_scores`." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:20 +msgid "" +"The weight of masked tokens. Its data type should be float32 and its " +"shape is [mask_token_num, 1]." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:24 +msgid "" +"The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean" +" of `next_sentence_loss`. Its data type should be float32 and its shape " +"is [1]." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward:15 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:26 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:17 +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:29 +msgid "示例" +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification:1 +msgid "" +"BigBird Model with a linear layer on top of the output layer, designed " +"for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification:6 +msgid "The number of classes. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:1 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:1 +msgid "" +"The BigBirdForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:12 +msgid "" +"Returns tensor `output`, a tensor of the input text classification " +"logits. Its data type should be float32 and it has a shape of " +"[batch_size, num_classes]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads:1 +msgid "The BigBird pretraining heads for a pretraining task." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads:9 +msgid "" +"The weight of pretraining embedding layer. Its data type should be " +"float32 and its shape is [hidden_size, vocab_size]. If set to `None`, use" +" normal distribution to initialize weight. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:1 +msgid "" +"The BigBirdPretrainingHeads forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:3 +msgid "" +"The sequence output of BigBirdModel. Its data type should be float32 and " +"has a shape of [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:6 +msgid "" +"The pooled output of BigBirdModel. Its data type should be float32 and " +"has a shape of [batch_size, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:15 +msgid "" +"(``prediction_scores``, ``seq_relationship_score``). With the fields: -" +" prediction_scores (Tensor): The scores of masked token prediction. " +"Its data type should be float32. If `masked_positions` is None, its " +"shape is [batch_size, sequence_length, vocab_size]. Otherwise, its " +"shape is [batch_size, mask_token_num, vocab_size]. - " +"seq_relationship_score (Tensor): The logits whether 2 sequences are " +"NSP relationship. Its data type should be float32 and has a shape of " +"[batch_size, 2]." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:15 +msgid "(``prediction_scores``, ``seq_relationship_score``)." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:25 +msgid "" +"The logits whether 2 sequences are NSP relationship. Its data type should" +" be float32 and has a shape of [batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering:1 +msgid "" +"BigBird Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice:4 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering:4 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification:4 +msgid "An instance of BigBirdModel." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering:6 +msgid "" +"The dropout probability for output of BigBirdModel. If None, use the same" +" value as `hidden_dropout_prob` of `BigBirdModel` instance `bigbird`. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:1 +msgid "" +"The BigBirdForQuestionAnswering forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:12 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:12 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:18 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:17 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:21 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:21 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification:1 +msgid "" +"BigBird Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice:8 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification:8 +msgid "" +"The dropout probability for output of BIGBIRD. If None, use the same " +"value as `hidden_dropout_prob` of `BigBirdModel` instance `bigbird`. " +"Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:12 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice:1 +msgid "" +"BigBird Model with a linear layer on top of the hidden-states output " +"layer, designed for multiple choice tasks like RocStories/SWAG tasks ." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice:6 +msgid "The number of choices. Defaults to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward:1 +msgid "" +"The BigBirdForMultipleChoice forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward:3 +msgid "" +"See :class:`BigBirdModel` and shape as [batch_size, num_choice, " +"sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward:5 +msgid "" +"See :class:`BigBirdModel` and shape as [batch_size, num_choice, n_head, " +"sequence_length, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, 1]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM:1 +msgid "BigBird Model with a `language modeling` head on top." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:7 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:7 +msgid "" +"The Labels for computing the masked language modeling loss. Indices " +"should be in ``[-100, 0, ..., vocab_size]`` Tokens with indices set to " +"``-100`` are ignored (masked), the loss is only computed for the tokens " +"with labels in ``[0, ..., vocab_size]`` Its shape is [batch_size, " +"sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:10 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:10 +msgid "" +"Returns tuple (`masked_lm_loss`, `prediction_scores`, " +"``sequence_output`). With the fields: - `masked_lm_loss` (Tensor): " +"The masked lm loss. Its data type should be float32 and its shape is [1]." +" - `prediction_scores` (Tensor): The scores of masked token " +"prediction. Its data type should be float32. Its shape is [batch_size, " +"sequence_length, vocab_size]. - `sequence_output` (Tensor): Sequence" +" of hidden-states at the last layer of the model. Its data type should be" +" float32. Its shape is `[batch_size, sequence_length, hidden_size]`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:10 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:10 +msgid "Returns tuple (`masked_lm_loss`, `prediction_scores`, ``sequence_output`)." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:15 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:15 +msgid "`masked_lm_loss` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:15 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:15 +msgid "The masked lm loss. Its data type should be float32 and its shape is [1]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:18 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:18 +msgid "`prediction_scores` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:18 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:18 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"Its shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:20 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:20 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:21 +#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:21 +msgid "" +"Sequence of hidden-states at the last layer of the model. Its data type " +"should be float32. Its shape is `[batch_size, sequence_length, " +"hidden_size]`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM:1 +msgid "BigBird Model for casual language model tasks." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.po new file mode 100644 index 0000000000000000000000000000000000000000..bd93b32f357387bef3e839976a6bbe9288b501a0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.bigbird.rst:2 +msgid "bigbird" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..cb7fec73508f6640d43a81a1510b954161fb665d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.tokenizer.po @@ -0,0 +1,241 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.bigbird.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:1 +msgid "" +"Constructs an BigBird tokenizer based on `SentencePiece " +"`__." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:3 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.num_special_tokens_to_add +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:7 +msgid "" +"The vocabulary file (ends with '.spm') required to instantiate a " +"`SentencePiece `__ tokenizer." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:10 +msgid "" +"Whether the text strips accents and convert to Whether or not to " +"lowercase the input when tokenizing. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:14 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:18 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:21 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:24 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:27 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer +msgid "引发" +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:32 +msgid "If file sentencepiece_model_file doesn't exist." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) to a single string. Since " +"the usage of WordPiece introducing `##` to concat subwords, also removes " +"`##` when converting." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string:5 +msgid "A list of string representing tokens to be converted." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string:8 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string:12 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode:1 +msgid "Returns a tuple containing the encoded sequence and mask information." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode:3 +msgid "" +"The first sequence to be encoded. This can be a string, a list of strings" +" (tokenized string using the `tokenize` method) or a list of integers " +"(tokenized string ids using the `convert_tokens_to_ids` method)" +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode:7 +msgid "" +"If set to a number, will limit the total sequence returned so that it has" +" a maximum length. If set to None, will not limit the total sequence. " +"Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode:11 +msgid "" +"If set to a number, will limit the mask sequence returned so that it has " +"a maximum prediction length. If set to None, will not limit the mask " +"sequence." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode:14 +msgid "The probability of the token to be masked. Defaults to `0.15`." +msgstr "" + +#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode:17 +msgid "" +"Returns tuple (span_ids, masked_lm_positions, masked_lm_ids, " +"masked_lm_weights)." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.num_special_tokens_to_add:3 +msgid "" +"Whether the input is a sequence pair or a single sequence. Defaults to " +"`False` and the input is a single sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.num_special_tokens_to_add:7 +msgid "Number of tokens added to sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:4 +msgid "A BigBird sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:6 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:7 +msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:9 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:11 +msgid "Optional second list of IDs for sequence pairs. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:14 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..72618fc633594ed8de483a3d11151bf96a577284 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.modeling.po @@ -0,0 +1,461 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.blenderbot.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotDecoder:1 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotEncoder:1 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM:1 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration:1 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:1 +msgid "基类::class:`paddlenlp.transformers.blenderbot.modeling.BlenderbotPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:1 +msgid "Construct a bare Blenderbot Model." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Check the " +"superclass documentation for the generic methods and the library " +"implements for all its model." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:10 +msgid "Vocabulary size of the Blenderbot model." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:12 +msgid "The id for begging of sentences token. Defaults to ``1``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:14 +msgid "The id for padding token. Defaults to ``0``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:16 +msgid "The id for end of sentence token. Defaults to ``2``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:18 +msgid "The id indicating the start of decoding sentence. Defaults to ``1``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:20 +msgid "Dimensionality of the layers and the pooler layer. Defaults to ``1280``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:22 +msgid "" +"Number of Transformer encoder layers for BlenderbotEncoder. Defaults to " +"``2``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:24 +msgid "" +"Number of Transformer decoder layers for BlenderbotDecoder. Defaults to " +"``12``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:26 +msgid "" +"Number of attention heads for each Transformer encoder layer in " +"BlenderbotEncoder. Defaults to ``32``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:29 +msgid "" +"Number of attention heads for each Transformer decoder layer in " +"BlenderbotDecoder. Defaults to ``32``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:32 +msgid "" +"Dimensionality of the feed-forward layer for each Transformer encoder " +"layer in BlenderbotEncoder. Defaults to ``5120``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:35 +msgid "" +"Dimensionality of the feed-forward layer for each Transformer dncoder " +"layer in BlenderbotDncoder. Defaults to ``5120``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:38 +msgid "" +"The dropout probability for all fully connected layers in the embeddings," +" encoder, and pooler. Defaults to ``0.1``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:41 +msgid "" +"The non-linear activation function (function or string) in the encoder " +"and pooler. ``\"gelu\"``, ``\"relu\"`` and any other paddle supported " +"activation functions are supported. Defaults to ``\"gelu\"``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:45 +msgid "The dropout ratio for the attention probabilities. Defaults to ``0.0``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:48 +msgid "The dropout ratio for activations inside the fully connected layer." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:50 +msgid ", The max position index of an input sequence. Defaults to ``128``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:53 +msgid "" +"The standard deviation of the truncated_normal_initializer for " +"initializing all weight matrices. Defaults to ``0.02``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:56 +msgid "" +"Indicate whether to scale embeddings by diving by sqrt(d_model). Defaults" +" to ``True``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:58 +msgid "" +"Indicate whether to put layer normalization into preprocessing of MHA and" +" FFN sub-layers. If True, pre-process is layer normalization and post-" +"precess includes dropout, residual connection. Otherwise, no pre-process " +"and post-precess includes dropout, residual connection, layer " +"normalization. Defaults to ``True``." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:1 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:1 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:5 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:5 +msgid "" +"Mask to indicate whether to perform attention on each input token or not." +" The values should be either 0 or 1. The attention scores will be set to " +"**-infinity** for any positions in the mask that are **0**, and will be " +"**unchanged** for positions that are **1**. - **1** for tokens that are " +"**not masked**, - **0** for tokens that are **masked**. It's data type " +"should be `float32` and has a shape of [batch_size, sequence_length]. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:5 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:5 +msgid "" +"Mask to indicate whether to perform attention on each input token or not." +" The values should be either 0 or 1. The attention scores will be set to " +"**-infinity** for any positions in the mask that are **0**, and will be " +"**unchanged** for positions that are **1**." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:10 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:10 +msgid "**1** for tokens that are **not masked**," +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:11 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:11 +msgid "**0** for tokens that are **masked**." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:13 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:13 +msgid "" +"It's data type should be `float32` and has a shape of [batch_size, " +"sequence_length]. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:16 +msgid "" +"If not provided, ``decoder_input_ids`` will be automatically generated " +"based on ``decoder_start_token_id`` and ``input_ids``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:19 +msgid "" +"If not provided, the default ``decoder_attention_mask`` will be a tensor " +"with upper triangular part being ``-np.inf``. the shape will be " +"``(decoder_length, decoder_length)``" +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:22 +msgid "" +"The output of encoder. If not provided, a ``encoder_output`` will be " +"generated from BlenderbotEncoder. Defaults to ``None``." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:16 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:25 +msgid "Indicates whether to use cache to speed up decoding. Defaults to ``False``" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:18 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:27 +msgid "" +"It is a list, and each element in the list is a tuple( " +":code:`(incremental_cache, static_cache)` ). See " +"`paddle.nn.TransformerDecoder.gen_cache` for more details. It is only " +"used for inference and should be None for training. Default None." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotEncoder.forward +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:33 +msgid "" +"If ``use_cache=False``, the return will be the last hidden state of " +"decoder with shape of [batch_size, seq_lens, hidden_size]. ``seq_lens`` " +"corresponds to the length of input sequence. Otherwise, the return will " +"be a tuple of ``(decoder_output, cache)``. Please refer to class " +":class:`paddle.nn.TransformerDecoder` for more information regarding " +"``cache``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotEncoder.forward +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:32 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:11 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:40 +msgid "示例" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.get_encoder:1 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.get_encoder:1 +msgid "This method is required for model with encoder-decoder architecture." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotPretrainedModel:1 +msgid "" +"An abstract class for pretrained Blenderbot models. It provides " +"Blenderbot related `model_config_file`, `resource_files_names`, " +"`pretrained_resource_files_map`, `pretrained_init_configuration`, " +"`base_model_prefix` for downloading and loading pretrained models. Refer " +"to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotEncoder:1 +msgid "" +"The encoder of Blenderbot Model. Please refer to " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` or " +":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more " +"information regarding methods and arguments." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotEncoder.forward:1 +msgid "" +"The last hidden states at the last layer of the encoder. It's data type " +"should be `float` and has a shape of `(batch_size, seq_lens, " +"hidden_size)`. ``seq_lens`` corresponds to the length of input sequence." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotDecoder:1 +msgid "" +"The decoder of Blenderbot Model. Please refer to " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` and " +":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more " +"information regarding methods and arguments." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotDecoder.forward:1 +msgid "" +"Please refer to " +":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more " +"information regarding the arguments." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:1 +msgid "" +"Please refer to " +":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more " +"information regarding arguments. :returns:" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:27 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:6 +msgid "If ``use_cache=False``, the return will be a tensor with shape of" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:6 +msgid "" +"[batch_size, seq_lens, hidden_size]. Otherwise, the return will be a " +"tuple of ``(decoder_output, cache)``." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:14 +msgid "" +"import paddle from paddlenlp.transformers import BlenderbotTokenizer, " +"BlenderbotForConditionalGeneration" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:17 +msgid "" +"pretrained_model_name = \"blenderbot-400M-distill\" tokenizer = " +"BlenderbotTokenizer.from_pretrained(pretrained_model_name) model = " +"BlenderbotForConditionalGeneration.from_pretrained(pretrained_model_name)" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:21 +msgid "" +"sample_text = \"My friends are cool but they eat too many carbs.\" inputs" +" = tokenizer(sample_text, return_attention_mask=True, " +"return_token_type_ids=False) inputs = {k: paddle.to_tensor([v]) for (k, " +"v) in inputs.items()}" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:25 +msgid "" +"# Generate response using beam search result_ids, scores = " +"model.generate(input_ids=inputs['input_ids']," +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:27 +msgid "" +"max_length=60, min_length=20, decode_strategy='beam_search', " +"num_beams=10, length_penalty=0.65)" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:34 +msgid "for sequence_ids in result_ids.numpy().tolist():" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:33 +msgid "" +"print(\"User: \", sample_text) print(\"bot: \", " +"tokenizer.convert_ids_to_string(sequence_ids)) # \"bot: That's " +"unfortunate. Are they trying to lose weight?\"" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.prepare_inputs_for_generation:1 +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.prepare_inputs_for_generation:1 +msgid "" +"Prepare inputs for decoder to generate sentences. :returns: A dictionary " +"containing necessary inputs for generating next token. :rtype: dict" +msgstr "" + +#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM:1 +msgid "" +"Constructs BLenderbot For Causal Language Model. This model is equivalent" +" to the blenderbot decoder without cross-attention." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:24 +msgid "" +"If ``use_cache=False``, the return will be a tensor with shape of " +"[batch_size, seq_lens, hidden_size]. Otherwise, the return will be a " +"tuple of ``(lm_logits, cache)``." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:27 +msgid "" +"[batch_size, seq_lens, hidden_size]. Otherwise, the return will be a " +"tuple of ``(lm_logits, cache)``." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:35 +msgid "" +"import paddle from paddlenlp.transformers import BlenderbotTokenizer, " +"BlenderbotForCausalLM use_cache = False text = \"My friends are cool but " +"they eat too many carbs.\" model_name = \"blenderbot-400M-distill\" " +"tokenizer = BlenderbotTokenizer.from_pretrained(model_name) model = " +"BlenderbotForCausalLM.from_pretrained(model_name) model.eval() inputs = " +"tokenizer(text) inputs = {k: paddle.to_tensor([v]) for (k, v) in " +"inputs.items()}" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:47 +msgid "with paddle.no_grad():" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:47 +msgid "" +"outputs = model(**inputs, use_cache=use_cache) # outputs is a tuple of " +"(lm_logits, cache) if ``use_cache=True``." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.po new file mode 100644 index 0000000000000000000000000000000000000000..e0443109a91e8ba8f83c747c4543faef8eea9dd5 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.blenderbot.rst:2 +msgid "blenderbot" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..904d63a5aa318252cf69e691b3dba83085005ad1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.tokenizer.po @@ -0,0 +1,110 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.blenderbot.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.gpt.tokenizer.GPTTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer:1 +msgid "" +"Construct a Blenderbot tokenizer, derived from the GPT tokenizer, using " +"byte-level Byte-Pair-Encoding." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer:4 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.GPTTokenizer`, which contains most of the" +" main methods. Please should refer to the superclass for more information" +" regarding methods. :param vocab_file: file path of the vocabulary :type " +"vocab_file: str :param merges_file: file path of the merges file. :type " +"merges_file: str :param errors: The method to handle errors in decoding " +":type errors: str :param max_len: The specified maximum sequence length. " +"Default: \"None\". :type max_len: int :param special_tokens: The " +"additional special tokens. Default: \"None\". :type special_tokens: dict " +":param bos_token: The special token for beginning of sequence token. " +"Default: \"\". :type bos_token: str :param eos_token: The special " +"token for end of sequence token. Default: \"\". :type eos_token: str " +":param cls_token: The special token for cls. Default: \"\". :type " +"cls_token: str :param sep_token: The special token for separator token . " +"Default: \"\". :type sep_token: str :param pad_token: The special " +"token for padding. Default: \"\". :type pad_token: str :param " +"eol_token: The special token for newline. Default: \"\\u010a\". :type " +"eol_token: str :param add_prefix: Whether or not to add an initial space " +"to the input." +msgstr "" + +#: of paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer:30 +msgid "" +"This allows to treat the leading word just as any other word. (Blenderbot" +" adds an initial space when tokenizes input text, which" +msgstr "" + +#: of paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer:32 +msgid "is differnt from BlenderbotSmall)" +msgstr "" + +#: of paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer:36 +msgid "实际案例" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens:1 +msgid "A Blenderbot sequence has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens +msgid "参数" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens:5 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens:7 +msgid "token_ids_1 Will be ignored" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens:10 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens:11 +msgid ":obj:`List[int]`" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..0145bef9318f8615ee0f3f5a42067a62e8cecb98 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.modeling.po @@ -0,0 +1,477 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.blenderbot_small.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallDecoder:1 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallEncoder:1 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM:1 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration:1 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:1 +msgid "基类::class:`paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:1 +msgid "Construct a bare BlenderbotSmall Model." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Check the " +"superclass documentation for the generic methods and the library " +"implements for all its model." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration.forward +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:10 +msgid "Vocabulary size of the BlenderbotSmall model." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:12 +msgid "The id for begging of sentences token. Defaults to ``1``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:14 +msgid "The id for padding token. Defaults to ``0``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:16 +msgid "The id for end of sentence token. Defaults to ``2``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:18 +msgid "The id indicating the start of decoding sentence. Defaults to ``1``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:20 +msgid "Dimensionality of the layers and the pooler layer. Defaults to ``512``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:22 +msgid "" +"Number of Transformer encoder layers for BlenderbotSmallEncoder. Defaults" +" to ``8``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:24 +msgid "" +"Number of Transformer decoder layers for BlenderbotSmallDecoder. Defaults" +" to ``8``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:26 +msgid "" +"Number of attention heads for each Transformer encoder layer in " +"BlenderbotSmallEncoder. Defaults to ``16``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:29 +msgid "" +"Number of attention heads for each Transformer decoder layer in " +"BlenderbotSmallDecoder. Defaults to ``16``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:32 +msgid "" +"Dimensionality of the feed-forward layer for each Transformer encoder " +"layer in BlenderbotSmallEncoder. Defaults to ``2048``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:35 +msgid "" +"Dimensionality of the feed-forward layer for each Transformer dncoder " +"layer in BlenderbotSmallDncoder. Defaults to ``2048``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:38 +msgid "" +"The dropout probability for all fully connected layers in the embeddings," +" encoder, and pooler. Defaults to ``0.1``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:41 +msgid "" +"The non-linear activation function (function or string) in the encoder " +"and pooler. ``\"gelu\"``, ``\"relu\"`` and any other paddle supported " +"activation functions are supported. Defaults to ``\"gelu\"``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:45 +msgid "The dropout ratio for the attention probabilities. Defaults to ``0.0``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:48 +msgid "The dropout ratio for activations inside the fully connected layer." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:50 +msgid ", The max position index of an input sequence. Defaults to ``512``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:53 +msgid "" +"The standard deviation of the truncated_normal_initializer for " +"initializing all weight matrices. Defaults to ``0.02``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:56 +msgid "" +"Indicate whether to scale embeddings by diving by sqrt(d_model). Defaults" +" to ``True``." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:58 +msgid "" +"Indicate whether to put layer normalization into preprocessing of MHA and" +" FFN sub-layers. If True, pre-process is layer normalization and post-" +"precess includes dropout, residual connection. Otherwise, no pre-process " +"and post-precess includes dropout, residual connection, layer " +"normalization. Defaults to ``False``." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:1 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:1 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:5 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:5 +msgid "" +"Mask to indicate whether to perform attention on each input token or not." +" The values should be either 0 or 1. The attention scores will be set to " +"**-infinity** for any positions in the mask that are **0**, and will be " +"**unchanged** for positions that are **1**. - **1** for tokens that are " +"**not masked**, - **0** for tokens that are **masked**. It's data type " +"should be `float32` and has a shape of [batch_size, sequence_length]. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:5 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:5 +msgid "" +"Mask to indicate whether to perform attention on each input token or not." +" The values should be either 0 or 1. The attention scores will be set to " +"**-infinity** for any positions in the mask that are **0**, and will be " +"**unchanged** for positions that are **1**." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:10 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:10 +msgid "**1** for tokens that are **not masked**," +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:11 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:11 +msgid "**0** for tokens that are **masked**." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:13 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:13 +msgid "" +"It's data type should be `float32` and has a shape of [batch_size, " +"sequence_length]. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:16 +msgid "" +"If not provided, ``decoder_input_ids`` will be automatically generated " +"based on ``decoder_start_token_id`` and ``input_ids``." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:19 +msgid "" +"If not provided, the default ``decoder_attention_mask`` will be a tensor " +"with upper triangular part being ``-np.inf``. the shape will be " +"``(decoder_length, decoder_length)``" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:22 +msgid "" +"The output of encoder. If not provided, a new ``encoder_output`` will be " +"generated from BlenderbotEncoder. Defaults to ``None``." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:16 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:25 +msgid "Indicates whether to use cache to speed up decoding. Defaults to ``False``" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:27 +msgid "" +"It is a list, and each element in the list is a tuple( " +":code:`(incremental_cache, static_cache)` ). See " +"`TransformerDecoder.gen_cache` for more details. It is only used for " +"inference and should be None for training. Default None." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallEncoder.forward +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:33 +msgid "" +"If ``use_cache=False``, the return will be the last hidden state of " +"decoder with shape of [batch_size, seq_lens, hidden_size]. ``seq_lens`` " +"corresponds to the length of input sequence. Otherwise, the return will " +"be a tuple of ``(decoder_output, cache)``. Please refer to class " +":class:`paddle.nn.TransformerDecoder` for more information regarding " +"``cache``." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallEncoder.forward +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:32 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration:11 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:40 +msgid "示例" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:43 +msgid "" +"import paddle from paddlenlp.transformers import " +"BlenderbotSmallTokenizer, BlenderbotSmallModel" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:46 +msgid "" +"# \"blenderbot_small-90M\" is pretrained weight of " +"BlenderbotSmallForConditionalGeneration, # Therefore some weight of " +"additional layers in BlenderbotSmallForConditionalGeneration # might not " +"be loaded and used. pretrained_model_name = \"blenderbot_small-90M\" " +"tokenizer = " +"BlenderbotSmallTokenizer.from_pretrained(pretrained_model_name) model = " +"BlenderbotSmallModel.from_pretrained(pretrained_model_name)" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:53 +msgid "" +"sample_text = \"My friends are cool but they eat too many carbs.\" inputs" +" = tokenizer(sample_text, return_attention_mask=True, " +"return_token_type_ids=False) inputs = {k:paddle.to_tensor([v]) for (k, v)" +" in inputs.items()} decoder_output = model(**inputs)" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration.get_encoder:1 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.get_encoder:1 +msgid "This method is required for model with encoder-decoder architecture." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallPretrainedModel:1 +msgid "" +"An abstract class for pretrained BlenderbotSmall models. It provides " +"BlenderbotSmall related `model_config_file`, `resource_files_names`, " +"`pretrained_resource_files_map`, `pretrained_init_configuration`, " +"`base_model_prefix` for downloading and loading pretrained models. Refer " +"to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallEncoder:1 +msgid "" +"The encoder of BlenderbotSmall Model. Please refer to " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` or " +":class:`~paddlenlp.transformers.Blenderbot.BlenderbotSmallModel` for more" +" details regarding methods and arguments." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallEncoder.forward:1 +msgid "" +"The last hidden-states at the last layer of the encoder. It's data type " +"should be `float` and has a shape of `(batch_size, seq_lens, " +"hidden_size)`. ``seq_lens`` corresponds to the length of input sequence." +msgstr "" + +#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallDecoder:1 +msgid "" +"The decoder of BlenderbotSmall Model. Please refer to " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` and " +":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more " +"information regarding methods and arguments." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallDecoder.forward:1 +msgid "" +"Please refer to " +":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more " +"information regarding the arguments." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration:1 +msgid "" +"Please refer to " +":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more " +"information regarding arguments. :returns:" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:27 +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration:6 +msgid "If ``use_cache=False``, the return will be a tensor with shape of" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration:6 +msgid "" +"[batch_size, seq_lens, hidden_size]. Otherwise, the return will be a " +"tuple of ``(decoder_output, cache)``." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM:1 +msgid "" +"Constructs BLenderbotSmall For Causal Language Model. This model is " +"equivalent to the blenderbotSmall decoder without cross-attention." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:18 +msgid "" +"It is a list, and each element in the list is a tuple( " +":code:`(incremental_cache, static_cache)` ). See " +"`paddle.nn.TransformerDecoder.gen_cache` for more details. It is only " +"used for inference and should be None for training. Default None." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:24 +msgid "" +"If ``use_cache=False``, the return will be a tensor with shape of " +"[batch_size, seq_lens, hidden_size]. Otherwise, the return will be a " +"tuple of ``(lm_logits, cache)``." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:27 +msgid "" +"[batch_size, seq_lens, hidden_size]. Otherwise, the return will be a " +"tuple of ``(lm_logits, cache)``." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:35 +msgid "" +"import paddle from paddlenlp.transformers import " +"BlenderbotSmallTokenizer, BlenderbotSmallForCausalLM use_cache = False " +"text = \"My friends are cool but they eat too many carbs.\" model_name = " +"\"blenderbot_small-90M\" tokenizer = " +"BlenderbotSmallTokenizer.from_pretrained(model_name) model = " +"BlenderbotSmallForCausalLM.from_pretrained(model_name) model.eval() " +"inputs = tokenizer(text, return_attention_mask=True, " +"return_token_type_ids=False) inputs = {k: paddle.to_tensor([v]) for (k, " +"v) in inputs.items()}" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:47 +msgid "with paddle.no_grad():" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:47 +msgid "" +"outputs = model(**inputs, use_cache=use_cache) # outputs is a tuple of " +"(lm_logits, cache) if ``use_cache=True``." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.prepare_inputs_for_generation:1 +msgid "" +"Prepare inputs for decoder to generate sentences. :returns: A dictionary " +"containing necessary inputs for generating next token. :rtype: dict" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.po new file mode 100644 index 0000000000000000000000000000000000000000..8ff26039ee79f9cbd33457d187133f662ca54165 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.blenderbot_small.rst:2 +msgid "blenderbot\\_small" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..28b6a258b076c918880a2be844ae7d4029172789 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.tokenizer.po @@ -0,0 +1,116 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.blenderbot_small.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.gpt.tokenizer.GPTTokenizer`" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer:1 +msgid "Constructs a BlenderbotSmall tokenizer based on Byte-Pair-Encoding." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer:3 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.GPTTokenizer`, which contains most of the" +" main methods. Please should refer to the superclass for more information" +" regarding methods. :param vocab_file: file path of the vocabulary :type " +"vocab_file: str :param merges_file: file path of the merges file. :type " +"merges_file: str :param errors: The method to handle errors in decoding " +":type errors: str :param max_len: The specified maximum sequence length. " +"Default: \"None\". :type max_len: int :param special_tokens: The " +"additional special tokens. Default: \"None\". :type special_tokens: dict " +":param bos_token: The special token for beginning of sequence token. " +"Default: \"__start__\". :type bos_token: str :param eos_token: The " +"special token for end of sequence token. Default: \"__end__\". :type " +"eos_token: str :param unk_token: The special token for unknown tokens. " +"Default: \"__unk__\" :type unk_token: str :param pad_token: The special " +"token for padding. Default: \"__null__\". :type pad_token: str :param " +"eol_token: The special token for newline. Default: \"__newln__\". :type " +"eol_token: str" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer:28 +msgid "实际案例" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.bpe:1 +msgid "" +"Apply Byte-Pair-Encoding on token. The process of bpe in BlenderbotSmall " +"is different from Blenderbot. :param token: The token to be converted. " +":type token: str" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.bpe +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_ids_to_string +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_tokens_to_string +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.bpe:6 +msgid "Converted token." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.bpe +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_ids_to_string +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_tokens_to_string +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) to a single string. :param" +" tokens: A sequence of tokens. :type tokens: list[str]" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_ids_to_string:10 +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_tokens_to_string:5 +msgid "Converted string." +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_ids_to_string:1 +msgid "" +"Converts a sequence of ids (list of integers) to a single string. :param " +"ids: A sequence of ids corresponding to tokens. :type ids: list[int] " +":param skip_special_tokens: Whether to skip and not decode special tokens" +" when converting. Defaults to `False`. :type skip_special_tokens: bool, " +"optional :param clean_up_tokenization_spaces: Whether to Clean up a list " +"of simple English tokenization artifacts" +msgstr "" + +#: of +#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_ids_to_string:7 +msgid "like spaces before punctuations and abbreviated forms." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..29f90d47cd9c5e9fb337c9a1414daab6fa6bd301 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.modeling.po @@ -0,0 +1,634 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.chinesebert.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining:1 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering:1 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification:1 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification:1 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:1 +msgid "基类::class:`paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:1 +msgid "The bare ChineseBert Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `BChineseBertModel`. Also is the vocab" +" size of token embedding matrix. Defines the number of different tokens " +"that can be represented by the `inputs_ids` passed when calling " +"`ChineseBertModel`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:13 +msgid "" +"Dimensionality of the embedding layer, encoder layer and pooler layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:15 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:35 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:38 +msgid "The vocabulary size of `token_type_ids`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:41 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`BertPretrainedModel.init_weights()` for how" +" weights are initialized in `BertModel`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:41 +msgid "The standard deviation of the normal initializer. Defaults to 0.02." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:45 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`BertPretrainedModel.init_weights()` for how weights are " +"initialized in `BertModel`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:48 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:51 +msgid "" +"The non-linear activation function in the pooling layer. Defaults to " +"`\"tanh\"`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:54 +msgid "The epsilon of layernorm. Defaults to `1e-12`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:56 +msgid "The dim of glyph_embedding. Defaults to `1728`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:59 +msgid "The length of pinyin map. Defaults to `32`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:1 +msgid "The ChineseBert forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:7 +msgid "" +"Indices of input sequence tokens pinyin. We apply a CNN model with width " +"2 on the pinyin sequence, followed by max-pooling to derive the resulting" +" pinyin embedding. This makes output dimensionality immune to the length " +"of the input pinyin sequence. The length of the input pinyin sequence is " +"fixed at 8. Its data type should be `int64` and it has a shape of " +"[batch_size, sequence_length, 8]. Defaults to `None`, which means we " +"don't add pinyin embeddings." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:14 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:14 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:19 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:20 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:22 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:25 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:29 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. Defaults to `None`, which means " +"nothing needed to be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:38 +msgid "Whether to return the output of each hidden layers. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:42 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`," +" `pooled_output`). With the fields: - `sequence_output` (Tensor): " +"Sequence of hidden-states at the last layer of the model. It's data " +"type should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]. - `pooled_output` (Tensor): The output of first token " +"(`[CLS]`) in sequence. We \"pool\" the model by simply taking the " +"hidden state corresponding to the first token. Its data type should " +"be float32 and its shape is [batch_size, hidden_size]. - " +"`encoder_outputs` (List(Tensor)): A list of Tensor containing hidden-" +"states of the model at each hidden layer in the Transformer encoder. " +"The length of the list is `num_hidden_layers`. Each Tensor has a data" +" type of float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:42 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`," +" `pooled_output`)." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:16 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:12 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:44 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:48 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:47 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:53 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:51 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:57 +msgid "`encoder_outputs` (List(Tensor)):" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:56 +msgid "" +"A list of Tensor containing hidden-states of the model at each hidden " +"layer in the Transformer encoder. The length of the list is " +"`num_hidden_layers`. Each Tensor has a data type of float32 and its shape" +" is [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:24 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:19 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:19 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:62 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainedModel:1 +msgid "" +"An abstract class for pretrained ChineseBert models. It provides " +"ChineseBert related `model_config_file`, `pretrained_init_configuration`," +" `resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainedModel.init_weights:1 +msgid "Initialize the weights." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining:1 +msgid "ChineseBert Model with pretraining tasks on top." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining:3 +msgid "An instance of :class:`ChineseBertModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:1 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:3 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:5 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:7 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:9 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:3 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:5 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:7 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:3 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:5 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:7 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:9 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:11 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:3 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:5 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:7 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:9 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:11 +msgid "See :class:`ChineseBertModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:11 +msgid "See :class:`ChineseBertPretrainingHeads`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:14 +msgid "" +"Returns tuple (``prediction_scores``, ``seq_relationship_score``). With " +"the fields: - `prediction_scores` (Tensor): The scores of masked " +"token prediction. Its data type should be float32. If " +"`masked_positions` is None, its shape is [batch_size, sequence_length, " +"vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]. - `seq_relationship_score` (Tensor): The scores of next" +" sentence prediction. Its data type should be float32 and its shape " +"is [batch_size, 2]." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:14 +msgid "Returns tuple (``prediction_scores``, ``seq_relationship_score``)." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:21 +msgid "`prediction_scores` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:19 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"If `masked_positions` is None, its shape is [batch_size, sequence_length," +" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:24 +msgid "`seq_relationship_score` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:24 +msgid "" +"The scores of next sentence prediction. Its data type should be float32 " +"and its shape is [batch_size, 2]." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion:1 +msgid "" +"Vocabulary size of `inputs_ids` in `ChineseBertModel`. Defines the number" +" of different tokens that can be represented by the `inputs_ids` passed " +"when calling `ChineseBertBertModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward:1 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"If `masked_positions` is None, its shape is [batch_size, sequence_length," +" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward:5 +msgid "" +"The scores of next sentence prediction. Its data type should be float32 " +"and its shape is [batch_size, 2]" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward:8 +msgid "" +"The labels of the masked language modeling, its dimensionality is equal " +"to `prediction_scores`. Its data type should be int64. If " +"`masked_positions` is None, its shape is [batch_size, sequence_length, " +"1]. Otherwise, its shape is [batch_size, mask_token_num, 1]" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward:12 +msgid "" +"The labels of the next sentence prediction task, the dimensionality of " +"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type " +"should be int64 and its shape is [batch_size, 1]" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward:16 +msgid "" +"The scale of masked tokens. Used for the normalization of masked language" +" modeling loss. If it is a `Tensor`, its data type should be int64 and " +"its shape is equal to `prediction_scores`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward:20 +msgid "" +"The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean" +" of `next_sentence_loss`. Its data type should be float32 and its shape " +"is [1]." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification:1 +msgid "" +"ChineseBert Model with a linear layer on top of the output layer, " +"designed for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering:4 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification:4 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification:4 +msgid "An instance of ChineseBertModel." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification:6 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification:8 +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification:8 +msgid "" +"The dropout probability for output of ChineseBert. If None, use the same " +"value as `hidden_dropout_prob` of `ChineseBertModel` instance " +"`chinesebert`. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:1 +msgid "" +"The ChineseBertForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:14 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification:1 +msgid "" +"ChineseBert Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:1 +msgid "" +"The ChineseBertForTokenClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:14 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering:1 +msgid "" +"ChineseBert Model with a linear layer on top of the hidden-states output " +"to compute `span_start_logits` and `span_end_logits`, designed for " +"question-answering tasks like SQuAD." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering:6 +msgid "" +"The dropout probability for output of ChineseBert. If None, use the same " +"value as `hidden_dropout_prob` of `ChineseBertModel` instance " +"`chinesebert`. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:1 +msgid "" +"The ChineseBertForQuestionAnswering forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:10 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:10 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:16 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:15 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:19 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:19 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.po new file mode 100644 index 0000000000000000000000000000000000000000..0df25a22b97674b2a3993b96bbbe1f3f3fcc2652 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.chinesebert.rst:2 +msgid "chinesebert" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..665864174807e676101c9f85f19ca71c5927e68c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.tokenizer.po @@ -0,0 +1,565 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.chinesebert.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:1 +msgid "" +"Construct a ChineseBert tokenizer. `ChineseBertTokenizer` is similar to " +"`BertTokenizerr`. The difference between them is that ChineseBert has the" +" extra process about pinyin id. For more information regarding those " +"methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.pinyin_locs_map +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:5 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:8 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:11 +msgid "" +"A dict of pinyin map, the map between pinyin char and id. pinyin char is " +"26 Romanian characters and 0-5 numbers. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:14 +msgid "A dict of char id map tensor. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:17 +msgid "A dict of pinyin map tensor. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:20 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:24 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:27 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:30 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:33 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:39 +msgid "实际案例" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:1 +msgid "" +"Performs tokenization and uses the tokenized tokens to prepare model " +"inputs. It supports sequence or sequence pair as input, and batch input " +"is not allowed." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:5 +msgid "" +"The sequence to be processed. One sequence is a string, a list of " +"strings, or a list of integers depending on whether it has been " +"pretokenized and converted to ids." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:9 +msgid "" +"Same as `text` argument, while it represents for the latter sequence of " +"the sequence pair." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:10 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:12 +msgid "" +"If set to a number, will limit the total sequence returned so that it has" +" a maximum length. If there are overflowing tokens, those overflowing " +"tokens will be added to the returned dictionary when " +"`return_overflowing_tokens` is `True`. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:15 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:17 +msgid "" +"Only available for batch input of sequence pair and mainly for question " +"answering usage. When for QA, `text` represents questions and `text_pair`" +" represents contexts. If `stride` is set to a positive number, the " +"context will be split into multiple spans where `stride` defines the " +"number of (tokenized) tokens to skip from the start of one span to get " +"the next span, thus will produce a bigger batch than inputs to include " +"all spans. Moreover, 'overflow_to_sample' and 'offset_mapping' preserving" +" the original example and position information will be added to the " +"returned dictionary. Defaults to 0." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:25 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:27 +msgid "" +"If set to `True`, the returned sequences would be padded up to " +"`max_seq_len` specified length according to padding side " +"(`self.padding_side`) and padding token id. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:29 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:31 +msgid "" +"String selected in the following options: - 'longest_first' (default) " +"Iteratively reduce the inputs sequence until the input is under " +"`max_seq_len` starting from the longest one at each token (when there is " +"a pair of input sequences). - 'only_first': Only truncate the first " +"sequence. - 'only_second': Only truncate the second sequence. - " +"'do_not_truncate': Do not truncate (raise an error if the input sequence " +"is longer than `max_seq_len`). Defaults to 'longest_first'." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:29 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:31 +msgid "String selected in the following options:" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:31 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:33 +msgid "'longest_first' (default) Iteratively reduce the inputs sequence" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:32 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:34 +msgid "" +"until the input is under `max_seq_len` starting from the longest one at " +"each token (when there is a pair of input sequences). - 'only_first': " +"Only truncate the first sequence. - 'only_second': Only truncate the " +"second sequence. - 'do_not_truncate': Do not truncate (raise an error if " +"the input sequence is longer than `max_seq_len`)." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:39 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:41 +msgid "Defaults to 'longest_first'." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:41 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:43 +msgid "" +"Whether to include tokens position ids in the returned dictionary. " +"Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:44 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:46 +msgid "" +"Whether to include token type ids in the returned dictionary. Defaults to" +" `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:47 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:49 +msgid "" +"Whether to include the attention mask in the returned dictionary. " +"Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:50 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:52 +msgid "" +"Whether to include the length of each encoded inputs in the returned " +"dictionary. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:53 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:55 +msgid "" +"Whether to include overflowing token information in the returned " +"dictionary. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:56 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:58 +msgid "" +"Whether to include special tokens mask information in the returned " +"dictionary. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.pinyin_locs_map +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:62 +msgid "" +"The dict has the following optional items: - **input_ids** (list[int]): " +"List of token ids to be fed to a model. - **pinyin_ids** (list[int]): " +"List of pinyin ids to be fed to a model. - **position_ids** (list[int], " +"optional): List of token position ids to be fed to a model. Included " +"when `return_position_ids` is `True` - **token_type_ids** (list[int], " +"optional): List of token type ids to be fed to a model. Included when " +"`return_token_type_ids` is `True`. - **attention_mask** (list[int], " +"optional): List of integers valued 0 or 1, where 0 specifies paddings " +"and should not be attended to by the model. Included when " +"`return_attention_mask` is `True`. - **seq_len** (int, optional): The " +"input_ids length. Included when `return_length` is `True`. - " +"**overflowing_tokens** (list[int], optional): List of overflowing tokens." +" Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True. - **num_truncated_tokens** (int, " +"optional): The number of overflowing tokens. Included when if " +"`max_seq_len` is specified and `return_overflowing_tokens` is True. - " +"**special_tokens_mask** (list[int], optional): List of integers valued 0 " +"or 1, with 0 specifying special added tokens and 1 specifying sequence " +"tokens. Included when `return_special_tokens_mask` is `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:60 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:62 +msgid "The dict has the following optional items:" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:62 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:64 +msgid "**input_ids** (list[int]): List of token ids to be fed to a model." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:63 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:65 +msgid "**pinyin_ids** (list[int]): List of pinyin ids to be fed to a model." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:64 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:66 +msgid "" +"**position_ids** (list[int], optional): List of token position ids to be " +"fed to a model. Included when `return_position_ids` is `True`" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:66 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:68 +msgid "" +"**token_type_ids** (list[int], optional): List of token type ids to be " +"fed to a model. Included when `return_token_type_ids` is `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:68 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:70 +msgid "" +"**attention_mask** (list[int], optional): List of integers valued 0 or 1," +" where 0 specifies paddings and should not be attended to by the model. " +"Included when `return_attention_mask` is `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:71 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:73 +msgid "" +"**seq_len** (int, optional): The input_ids length. Included when " +"`return_length` is `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:73 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:75 +msgid "" +"**overflowing_tokens** (list[int], optional): List of overflowing tokens." +" Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:76 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:78 +msgid "" +"**num_truncated_tokens** (int, optional): The number of overflowing " +"tokens. Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:79 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:81 +msgid "" +"**special_tokens_mask** (list[int], optional): List of integers valued 0 " +"or 1, with 0 specifying special added tokens and 1 specifying sequence " +"tokens. Included when `return_special_tokens_mask` is `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.pinyin_locs_map +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:1 +msgid "" +"Performs tokenization and uses the tokenized tokens to prepare model " +"inputs. It supports batch inputs of sequence or sequence pair." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:4 +msgid "" +"The element of list can be sequence or sequence pair, and the sequence is" +" a string or a list of strings depending on whether it has been " +"pretokenized. If each sequence is provided as a list of strings " +"(pretokenized), you must set `is_split_into_words` as `True` to " +"disambiguate with a sequence pair." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:60 +msgid "" +"The dict has the following optional items: - **input_ids** (list[int]): " +"List of token ids to be fed to a model. - **pinyin_ids** (list[int]): " +"List of pinyin ids to be fed to a model. - **position_ids** (list[int], " +"optional): List of token position ids to be fed to a model. Included " +"when `return_position_ids` is `True` - **token_type_ids** (list[int], " +"optional): List of token type ids to be fed to a model. Included when " +"`return_token_type_ids` is `True`. - **attention_mask** (list[int], " +"optional): List of integers valued 0 or 1, where 0 specifies paddings " +"and should not be attended to by the model. Included when " +"`return_attention_mask` is `True`. - **seq_len** (int, optional): The " +"input_ids length. Included when `return_length` is `True`. - " +"**overflowing_tokens** (list[int], optional): List of overflowing tokens." +" Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True. - **num_truncated_tokens** (int, " +"optional): The number of overflowing tokens. Included when if " +"`max_seq_len` is specified and `return_overflowing_tokens` is True. - " +"**special_tokens_mask** (list[int], optional): List of integers valued 0 " +"or 1, with 0 specifying special added tokens and 1 specifying sequence " +"tokens. Included when `return_special_tokens_mask` is `True`. - " +"**offset_mapping** (list[int], optional): list of pair preserving the " +"index of start and end char in original input for each token. For a " +"sqecial token, the index pair is `(0, 0)`. Included when `stride` " +"works. - **overflow_to_sample** (int, optional): Index of example from " +"which this feature is generated. Included when `stride` works." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:82 +msgid "" +"**offset_mapping** (list[int], optional): list of pair preserving the " +"index of start and end char in original input for each token. For a " +"sqecial token, the index pair is `(0, 0)`. Included when `stride` works." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:86 +msgid "" +"**overflow_to_sample** (int, optional): Index of example from which this " +"feature is generated. Included when `stride` works." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:1 +msgid "Truncates a sequence pair in place to the maximum length." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:3 +msgid "" +"list of tokenized input ids. Can be obtained from a string by chaining " +"the `tokenize` and `convert_tokens_to_ids` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:5 +msgid "" +"Optional second list of input ids. Can be obtained from a string by " +"chaining the `tokenize` and `convert_tokens_to_ids` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:7 +msgid "" +"The map of tokens and the start and end index of their start and end " +"character" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:9 +msgid "" +"The map of token pairs and the start and end index of their start and end" +" character" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:11 +msgid "number of tokens to remove using the truncation strategy" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:13 +msgid "" +"string selected in the following options: - 'longest_first' (default) " +"Iteratively reduce the inputs sequence until the input is under " +"max_seq_len starting from the longest one at each token (when there " +"is a pair of input sequences). Overflowing tokens only contains " +"overflow from the first sequence. - 'only_first': Only truncate the first" +" sequence. raise an error if the first sequence is shorter or equal to " +"than num_tokens_to_remove. - 'only_second': Only truncate the second " +"sequence - 'do_not_truncate': Does not truncate (raise an error if the " +"input sequence is longer than max_seq_len)" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:13 +msgid "" +"string selected in the following options: - 'longest_first' (default) " +"Iteratively reduce the inputs sequence until the input is under " +"max_seq_len" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:15 +msgid "" +"starting from the longest one at each token (when there is a pair of " +"input sequences). Overflowing tokens only contains overflow from the " +"first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:17 +msgid "" +"'only_first': Only truncate the first sequence. raise an error if the " +"first sequence is shorter or equal to than num_tokens_to_remove." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:18 +msgid "'only_second': Only truncate the second sequence" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:19 +msgid "" +"'do_not_truncate': Does not truncate (raise an error if the input " +"sequence is longer than max_seq_len)" +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:20 +msgid "" +"If set to a number along with max_seq_len, the overflowing tokens " +"returned will contain some tokens from the main sequence returned. The " +"value of this argument defines the number of additional tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.pinyin_locs_map:1 +msgid "Get the map of pinyin locations and pinyin tensor." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids:3 +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.pinyin_locs_map:3 +msgid "The sequence to be processed." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.pinyin_locs_map:6 +msgid "the map of pinyin locations and pinyin tensor." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids:1 +msgid "Find chinese character location, and generate pinyin ids." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids:5 +msgid "" +"Same as `text` argument, while it represents for the latter sequence of " +"the sequence pair. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids:8 +msgid "" +"A list of wordpiece offsets with the appropriate offsets of special " +"tokens. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids:12 +msgid "The list of pinyin id tensor." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..eba96010f926fbbbe6b6a9b31567bfdc12f56752 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.modeling.po @@ -0,0 +1,703 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.convbert.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator:1 +#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice:1 +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering:1 +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification:1 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification:1 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining:1 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator:1 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel:1 +msgid "基类::class:`paddlenlp.transformers.convbert.modeling.ConvBertPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:1 +msgid "The bare ConvBert Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertClassificationHead.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator +#: paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice +#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertModel +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion +#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `ConvBertModel`. Also is the vocab " +"size of token embedding matrix. Defines the number of different tokens " +"that can be represented by the `inputs_ids` passed when calling " +"`ConvBertModel`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:13 +msgid "Dimensionality of the embedding layer. Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:15 +msgid "Dimensionality of the encoder layer and pooler layer. Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:17 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:19 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:22 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:27 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:31 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:34 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:37 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:40 +msgid "The vocabulary size of `token_type_ids`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:43 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`ConvBertPretrainedModel.init_weights()` for" +" how weights are initialized in `ConvBertModel`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:43 +msgid "The standard deviation of the normal initializer. Defaults to 0.02." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:47 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`ConvBertPretrainedModel.init_weights()` for how weights are " +"initialized in `ConvBertModel`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:50 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:53 +msgid "The size of the convolutional kernel. Defaults to `9`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:56 +msgid "Ratio gamma to reduce the number of attention heads. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:59 +msgid "" +"The number of groups for grouped linear layers for ConvBert model. " +"Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:1 +msgid "" +"The ConvBertModel forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:3 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:3 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:7 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:7 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:7 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:7 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:12 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:12 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:12 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:13 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:13 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:13 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:15 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:15 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:15 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:18 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:18 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:18 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:22 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:22 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. If its data type is " +"int, the values should be either 0 or 1. - **1** for tokens that **not " +"masked**, - **0** for tokens that **masked**. It is a tensor with shape " +"broadcasted to `[batch_size, num_attention_heads, sequence_length, " +"sequence_length]`. Defaults to `None`, which means nothing needed to be " +"prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:22 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:22 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. If its data type is " +"int, the values should be either 0 or 1." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:27 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:27 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:27 +msgid "**1** for tokens that **not masked**," +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:28 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:28 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:28 +msgid "**0** for tokens that **masked**." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:30 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:30 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:30 +msgid "" +"It is a tensor with shape broadcasted to `[batch_size, " +"num_attention_heads, sequence_length, sequence_length]`. Defaults to " +"`None`, which means nothing needed to be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:34 +msgid "" +"Returns Tensor `sequence_output`, sequence of hidden-states at the last " +"layer of the model. Shape as `[batch_size, sequence_length, hidden_size]`" +" and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:39 +#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:17 +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:26 +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:17 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:17 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:39 +#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:39 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertPretrainedModel:1 +msgid "" +"An abstract class for pretrained ConvBert models. It provides ConvBert " +"related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainedModel.init_weights:1 +msgid "Initializes and tie weights if needed." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainedModel.tie_weights:1 +msgid "Tie the weights between the input embeddings and the output embeddings." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining:1 +msgid "" +"Combine generator with discriminator for Replaced Token Detection (RTD) " +"pretraining." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.get_discriminator_inputs:1 +msgid "Sample from the generator to create discriminator input." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:3 +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:5 +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:7 +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:9 +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:3 +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:5 +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:7 +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:9 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:3 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:5 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:7 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:9 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:1 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:3 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:5 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:7 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:28 +msgid "See :class:`ConvBertModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:9 +msgid "" +"The raw input_ids. Its data type should be `int64` and it has a shape of " +"[batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:11 +msgid "" +"The generator labels. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:14 +msgid "" +"Returns tuple (``gen_logits``, ``disc_logits``, ``disc_labels``, " +"``attention_mask``). With the fields: - `gen_logits` (Tensor): a " +"tensor of the generator prediction logits. Shape as `[batch_size, " +"sequence_length, vocab_size]` and dtype as float32. - `disc_logits` " +"(Tensor): a tensor of the discriminator prediction logits. Shape as " +"`[batch_size, sequence_length]` and dtype as float32. - `disc_labels` " +"(Tensor): a tensor of the discriminator prediction labels. Shape as " +"`[batch_size, sequence_length]` and dtype as int64. - `attention_mask` " +"(Tensor): See :class:`ConvBertModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:14 +msgid "" +"Returns tuple (``gen_logits``, ``disc_logits``, ``disc_labels``, " +"``attention_mask``)." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:14 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:16 +msgid "With the fields:" +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:19 +msgid "`gen_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:19 +msgid "" +"a tensor of the generator prediction logits. Shape as `[batch_size, " +"sequence_length, vocab_size]` and dtype as float32." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:22 +msgid "`disc_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:22 +msgid "" +"a tensor of the discriminator prediction logits. Shape as `[batch_size, " +"sequence_length]` and dtype as float32." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:25 +msgid "`disc_labels` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:25 +msgid "" +"a tensor of the discriminator prediction labels. Shape as `[batch_size, " +"sequence_length]` and dtype as int64." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:27 +msgid "`attention_mask` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator:1 +msgid "ConvBert Model with a discriminator prediction head on top." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator:3 +#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice:4 +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering:4 +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification:4 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification:4 +#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator:3 +msgid "An instance of ConvBertModel." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:1 +msgid "" +"The ConvBertDiscriminator forward method, overrides the `__call__()` " +"special method." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:34 +msgid "" +"Returns tensor `logits`, a tensor of the discriminator prediction logits." +" Shape as `[batch_size, sequence_length]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertGenerator:1 +msgid "ConvBert Model with a generator prediction head on top." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:1 +msgid "" +"The ConvBertGenerator forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:34 +msgid "" +"Returns tensor `prediction_scores`, a tensor of the generator prediction " +"scores. Shape as `[batch_size, sequence_length, vocab_size]` and dtype as" +" float32." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertClassificationHead:1 +#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertClassificationHead:1 +msgid "ConvBert head for sentence-level classification tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertClassificationHead.forward:1 +#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertClassificationHead.forward:4 +#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertClassificationHead.forward:6 +#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification:1 +msgid "" +"ConvBert Model with a linear layer on top of the output layer, designed " +"for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification:6 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice:8 +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification:8 +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification:8 +msgid "" +"The dropout probability for output of ConvBert. If None, use the same " +"value as `hidden_dropout_prob` of `ConvBertModel` instance `convbert`. " +"Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:1 +msgid "" +"The ConvBertForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:12 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification:1 +msgid "" +"ConvBert Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:1 +msgid "" +"The ConvBertForTokenClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:12 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion:1 +msgid "" +"Vocabulary size of `inputs_ids` in `ConvBertModel`. Defines the number of" +" different tokens that can be represented by the `inputs_ids` passed when" +" calling `ConvBertModel`." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion:4 +msgid "This is the generator weight." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion:6 +msgid "This is the discriminator weight." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering:1 +msgid "" +"ConvBert Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:1 +msgid "" +"The ConvBertForQuestionAnswering forward method, overrides the __call__()" +" special method." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:12 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:12 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:18 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:17 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:21 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:21 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice:1 +msgid "" +"ConvBert Model with a linear layer on top of the hidden-states output " +"layer, designed for multiple choice tasks like RocStories/SWAG tasks ." +msgstr "" + +#: of paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice:6 +msgid "The number of choices. Defaults to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:1 +msgid "" +"The ConvBertForMultipleChoice forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:3 +#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:5 +#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:7 +#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:9 +msgid "" +"See :class:`ConvBertModel` and shape as [batch_size,num_choice, " +"sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:12 +msgid "" +"Returns tensor `reshaped_logits`, a tensor of the multiple choice " +"classification logits. Shape as `[batch_size, num_choice]` and dtype as " +"`float32`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.po new file mode 100644 index 0000000000000000000000000000000000000000..2e9252d1894316c8a413cf66b9ca6debdd33c4ad --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.convbert.rst:2 +msgid "convbert" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..7bef9ac5e1c3a473e4ba03d0af0e4cb1eac81d4b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.tokenizer.po @@ -0,0 +1,34 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.convbert.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.convbert.tokenizer.ConvBertTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.electra.tokenizer.ElectraTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.convbert.tokenizer.ConvBertTokenizer:1 +msgid "" +"Construct a ConvBERT tokenizer. `ConvBertTokenizer` is identical to " +"`ElectraTokenizer`. For more information regarding those methods, please " +"refer to this superclass." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convert_slow_tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convert_slow_tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..56a66eef978591fe5eea4e5cf52ee5901091dd00 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convert_slow_tokenizer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.convert_slow_tokenizer.rst:2 +msgid "convert\\_slow\\_tokenizer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..85b35cd74795ca63738910a79ff8d0447b3dd04c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.modeling.po @@ -0,0 +1,517 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ctrl.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification:1 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel:1 +#: paddlenlp.transformers.ctrl.modeling.CTRLModel:1 +msgid "基类::class:`paddlenlp.transformers.ctrl.modeling.CTRLPreTrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:1 +msgid "" +"The bare CTRL Model transformer outputting raw hidden-states without any " +"specific head on top." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward +#: paddlenlp.transformers.ctrl.modeling.CTRLModel +#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward +#: paddlenlp.transformers.ctrl.modeling.SinusoidalPositionalEmbedding.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `CTRLModel`. Also is the vocab size of" +" token embedding matrix. Defines the number of different tokens that can " +"be represented by the `inputs_ids` passed when calling `CTRLModel`. " +"Defaults to `246534`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:14 +msgid "" +"The maximum sequence length that this model might ever be used with. " +"Typically set this to something large just in case (e.g., 512 or 1024 or " +"2048 or 50000). Defaults to `50000`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:17 +msgid "Dimensionality of the embeddings and hidden states. Defaults to `1280`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:20 +msgid "" +"Dimensionality of the inner dimension of the feed forward networks (FFN)." +" Defaults to `8192`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:23 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `48`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:26 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `16`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:29 +msgid "" +"The dropout ratio for all fully connected layers in the encoder. Defaults" +" to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:32 +msgid "The dropout ratio for the embeddings. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:35 +msgid "The epsilon to use in the layer normalization layers. Defaults to `1e-6`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:38 +msgid "" +"Whether the model's input and output word embeddings should be tied. Note" +" that this is only relevant if the model has a output word embedding " +"layer. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:41 +msgid "The id of the `padding` token. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:44 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`CTRLPreTrainedModel._init_weights()` for " +"how weights are initialized in `CTRLModel`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:44 +msgid "The standard deviation of the normal initializer. Defaults to 0.02." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:48 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`CTRLPreTrainedModel._init_weights()` for how weights are " +"initialized in `CTRLModel`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:1 +msgid "The CTRLModel forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:7 +msgid "" +"Contains pre-computed hidden-states (key and values in the attention " +"blocks) as computed by the model. Can be used to speed up sequential " +"decoding. The `input_ids` which have their past given to this model " +"should not be passed as input ids as they have already been computed. " +"Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:13 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `0.0` values and the others have `1.0` values. It is" +" a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. Defaults to `None`, which means " +"nothing needed to be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:23 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range `[0, type_vocab_size - 1]`. If `type_vocab_size` is" +" 2, which means the inputs have two portions. Indices can either be 0 or " +"1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:23 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range `[0, type_vocab_size - 1]`. If `type_vocab_size` is" +" 2, which means the inputs have two portions. Indices can either be 0 or " +"1:" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:28 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:29 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:31 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:34 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range `[0, max_position_embeddings - 1]`. " +"Shape as [batch_size, num_tokens] and dtype as int64. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:38 +msgid "" +"Whether or not to use cache. Defaults to `False`. If set to `True`, key " +"value states will be returned and can be used to speed up decoding." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:41 +msgid "" +"Whether or not to return the attentions tensors of all attention layers. " +"Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:44 +msgid "" +"Whether or not to return the output of all hidden layers. Defaults to " +"`False`." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward +#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:48 +msgid "" +"Returns tuple (`last_hidden_state`, `caches`, `hidden_states`, " +"`attentions`) With the fields: - `last_hidden_state` (Tensor): " +"Sequence of hidden-states at the last layer of the model. It's data " +"type should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]. - `caches` (tuple(tuple(Tensor), optional): returned " +"when `use_cache=True` is passed. Tuple of `tuple(Tensor)` of length " +"`num_hidden_layers`, with each tuple having 2 tensors of shape " +"[batch_size, num_heads, sequence_length, embed_size_per_head] and float32" +" dtype. - `hidden_states` (tuple(Tensor), optional): returned when " +"`output_hidden_states=True` is passed. Tuple of `Tensor` (one for the" +" output of the embeddings + one for the output of each layer). Each " +"Tensor has a data type of float32 and its shape is [batch_size, " +"sequence_length, hidden_size]. - `attentions` (tuple(Tensor), optional):" +" returned when `output_attentions=True` is passed. Tuple of " +"`Tensor` (one for each layer) of shape. Each Tensor has a data type of" +" float32 and its shape is [batch_size, num_heads, sequence_length, " +"sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:48 +msgid "" +"Returns tuple (`last_hidden_state`, `caches`, `hidden_states`, " +"`attentions`)" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:50 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:54 +msgid "`last_hidden_state` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:53 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:38 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:39 +#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:59 +msgid "`caches` (tuple(tuple(Tensor), optional):" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:57 +msgid "" +"returned when `use_cache=True` is passed. Tuple of `tuple(Tensor)` of " +"length `num_hidden_layers`, with each tuple having 2 tensors of shape " +"[batch_size, num_heads, sequence_length, embed_size_per_head] and float32" +" dtype." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:41 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:42 +#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:65 +msgid "`hidden_states` (tuple(Tensor), optional):" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:62 +msgid "" +"returned when `output_hidden_states=True` is passed. Tuple of `Tensor` " +"(one for the output of the embeddings + one for the output of each " +"layer). Each Tensor has a data type of float32 and its shape is " +"[batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:43 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:44 +#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:69 +msgid "`attentions` (tuple(Tensor), optional):" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:68 +msgid "" +"returned when `output_attentions=True` is passed. Tuple of `Tensor` (one " +"for each layer) of shape. Each Tensor has a data type of float32 and its " +"shape is [batch_size, num_heads, sequence_length, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward +#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:48 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:49 +#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:74 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel:1 +msgid "" +"The CTRL Model transformer with a language modeling head on top (linear " +"layer with weights tied to the input embeddings)." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification:8 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel:4 +msgid "An instance of :class:`CTRLModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:1 +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:3 +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:5 +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:7 +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:9 +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:17 +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:19 +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:21 +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:38 +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:41 +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:44 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:1 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:3 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:5 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:7 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:9 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:17 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:19 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:21 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:39 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:42 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:45 +msgid "See :class:`CTRLModel`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:11 +msgid "" +"Labels for language modeling. Note that the labels **are shifted** inside" +" the model, i.e. you can set `labels = input_ids` Indices are selected in" +" `[-100, 0, ..., vocab_size]` All labels set to `-100` are ignored " +"(masked), the loss is only computed for labels in `[0, ..., vocab_size]`." +" Shape is [batch_size, sequence_length] and dtype is int64." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:24 +msgid "" +"Returns tuple `(loss, logits, caches, hidden_states, attentions)`. With " +"the fields: - `loss` (Tensor): returned when `labels` is provided." +" Language modeling loss (for next-token prediction). It's data " +"type should be float32 and its shape is [1,]. - `logits` (Tensor): " +"Prediction scores of the language modeling head (scores for each " +"vocabulary token before SoftMax). It's data type should be " +"float32 and its shape is [batch_size, sequence_length, vocab_size]. " +"- `caches` (tuple(tuple(Tensor), optional): See :class:`CTRLModel`. " +"- `hidden_states` (tuple(Tensor), optional): See :class:`CTRLModel`." +" - `attentions` (tuple(Tensor), optional): See :class:`CTRLModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:24 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:24 +msgid "" +"Returns tuple `(loss, logits, caches, hidden_states, attentions)`. With " +"the fields:" +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:30 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:30 +msgid "`loss` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:28 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:28 +msgid "" +"returned when `labels` is provided. Language modeling loss (for next-" +"token prediction). It's data type should be float32 and its shape is " +"[1,]." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:35 +#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:36 +msgid "`logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:33 +msgid "" +"Prediction scores of the language modeling head (scores for each " +"vocabulary token before SoftMax). It's data type should be float32 and " +"its shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification:1 +msgid "" +"The CTRL Model transformer with a sequence classification head on top " +"(linear layer). `CTRLForSequenceClassification` uses the last token in " +"order to do the classification, as other causal models (e.g. GPT-2) do. " +"Since it does classification on the last token, it requires to know the " +"position of the last token. If a `pad_token_id` is defined in the " +"configuration, it finds the last token that is not a padding token in " +"each row. If no `pad_token_id` is defined, it simply takes the last value" +" in each row of the batch." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification:10 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification:12 +msgid "" +"The dropout probability for output of CTRL. If None, use the same value " +"as `hidden_dropout_prob` of `CTRLModel` instance `ctrl`. Defaults to " +"None." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:11 +msgid "" +"Labels for computing the sequence classification/regression loss. Indices" +" should be in `[0, ...,num_classes - 1]`. If `num_classes == 1` a " +"regression loss is computed (Mean-Square loss), If `num_classes > 1` a " +"classification loss is computed (Cross-Entropy). Shape is [batch_size,] " +"and dtype is int64." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:24 +msgid "" +"Returns tuple `(loss, logits, caches, hidden_states, attentions)`. With " +"the fields: - `loss` (Tensor): returned when `labels` is provided." +" Language modeling loss (for next-token prediction). It's data " +"type should be float32 and its shape is [1,]. - `logits` (Tensor): " +"Prediction scores of the language modeling head (scores for each " +"vocabulary token before SoftMax). It's data type should be " +"float32 and its shape is [batch_size, num_classes]. - `caches` " +"(tuple(tuple(Tensor), optional): See :class:`CTRLModel`. - " +"`hidden_states` (tuple(Tensor), optional): See :class:`CTRLModel`. -" +" `attentions` (tuple(Tensor), optional): See :class:`CTRLModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:33 +msgid "" +"Prediction scores of the language modeling head (scores for each " +"vocabulary token before SoftMax). It's data type should be float32 and " +"its shape is [batch_size, num_classes]." +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.SinusoidalPositionalEmbedding:1 +msgid "基类::class:`paddle.nn.layer.common.Embedding`" +msgstr "" + +#: of paddlenlp.transformers.ctrl.modeling.SinusoidalPositionalEmbedding:1 +msgid "This module produces sinusoidal positional embeddings of any length." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.SinusoidalPositionalEmbedding.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.SinusoidalPositionalEmbedding.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.modeling.SinusoidalPositionalEmbedding.forward:6 +msgid "unpacked dict arguments" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.po new file mode 100644 index 0000000000000000000000000000000000000000..a2dc827a0ccb9b40788424d57f6c2f53b8096750 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ctrl.rst:2 +msgid "ctrl" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..416dc598a80b33852c8d648af68c07b274234da0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.tokenizer.po @@ -0,0 +1,170 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ctrl.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:1 +msgid "Constructs a CTRL tokenizer based on byte-level Byte-Pair-Encoding." +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:3 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.save_resources +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:7 +msgid "" +"Path to the vocab file. The vocab file contains a mapping from vocabulary" +" strings to indices." +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:10 +msgid "" +"Path to the merge file. The merge file is used to split the input " +"sentence into \"subword\" units. The vocab file is then used to encode " +"those units as intices." +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:14 +msgid "The maximum value of the input sequence length. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:17 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"\"." +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize:1 +msgid "Converts a string to a list of tokens." +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize:3 +msgid "The text to be tokenized." +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize:6 +msgid "A list of string representing converted tokens." +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens:14 +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids:11 +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string:10 +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize:10 +msgid "示例" +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string:1 +msgid "Converts a sequence of tokens (list of string) to a single string." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string:3 +msgid "A sequence of tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string:6 +msgid "Converted string." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids:1 +msgid "" +"Converts a single token or a sequence of tokens to an index or a sequence" +" of indices using the vocab." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids:4 +msgid "A single token or a sequence of tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids:7 +msgid "The converted token id or token ids." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens:1 +msgid "" +"Converts an index or a sequence indices to a single token or a sequence " +"of tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens:4 +msgid "The token id (or token ids) to be converted to text." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens:6 +msgid "" +"Whether or not to skip the special tokens. Defaults to `False`, which " +"means we don't skip the special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens:10 +msgid "The converted token or the sequence of tokens." +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.save_resources:1 +msgid "Save tokenizer related resources to files under `save_directory`." +msgstr "" + +#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.save_resources:3 +msgid "Directory to save files into." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..2a483df03b586de5f8759c52a1215e71470585e2 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.modeling.po @@ -0,0 +1,389 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.distilbert.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM:1 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering:1 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification:1 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification:1 +#: paddlenlp.transformers.distilbert.modeling.DistilBertModel:1 +msgid "基类::class:`paddlenlp.transformers.distilbert.modeling.DistilBertPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:1 +msgid "The bare DistilBert Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM +#: paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward +#: paddlenlp.transformers.distilbert.modeling.DistilBertModel +#: paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `DistilBertModel`. Defines the number " +"of different tokens that can be represented by the `inputs_ids` passed " +"when calling `DistilBertModel`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:13 +msgid "" +"Dimensionality of the embedding layer, encoder layers and the pooler " +"layer. Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:15 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:35 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:38 +msgid "The vocabulary size of `token_type_ids`. Defaults to `16`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:41 +msgid "" +"The standard deviation of the normal initializer. Defaults to `0.02`. .." +" note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`DistilBertPretrainedModel.init_weights()` " +"for how weights are initialized in `DistilBertModel`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:41 +msgid "The standard deviation of the normal initializer. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:45 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`DistilBertPretrainedModel.init_weights()` for how weights are" +" initialized in `DistilBertModel`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:48 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward:1 +msgid "" +"The DistilBertModel forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward:7 +msgid "" +"Mask used in multi-head attention to avoid performing attention to some " +"unwanted positions, usually the paddings or the subsequent positions. Its" +" data type can be int, float and bool. When the data type is bool, the " +"`masked` tokens have `False` values and the others have `True` values. " +"When the data type is int, the `masked` tokens have `0` values and the " +"others have `1` values. When the data type is float, the `masked` tokens " +"have `-INF` values and the others have `0` values. It is a tensor with " +"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, " +"sequence_length]`. For example, its shape can be [batch_size, " +"sequence_length], [batch_size, sequence_length, sequence_length], " +"[batch_size, num_attention_heads, sequence_length, sequence_length]. " +"Defaults to `None`, which means nothing needed to be prevented attention " +"to." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward +#: paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward:19 +msgid "" +"Returns tensor `encoder_output`, which means the sequence of hidden-" +"states at the last layer of the model. Its data type should be float32 " +"and its shape is [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward +#: paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward:13 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:22 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward:13 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward:13 +#: paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward:24 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertPretrainedModel:1 +msgid "" +"An abstract class for pretrained DistilBert models. It provides " +"DistilBert related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification:1 +msgid "" +"DistilBert Model with a linear layer on top of the output layer, designed" +" for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM:3 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering:4 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification:4 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification:4 +msgid "An instance of DistilBertModel." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification:6 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering:6 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification:8 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification:8 +msgid "" +"The dropout probability for output of DistilBert. If None, use the same " +"value as `hidden_dropout_prob` of `DistilBertModel` instance " +"`distilbert`. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward:1 +msgid "" +"The DistilBertForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward:3 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward:5 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:3 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:5 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward:3 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward:5 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward:3 +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward:5 +msgid "See :class:`DistilBertModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward:8 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as `float32`." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification:1 +msgid "" +"DistilBert Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward:1 +msgid "" +"The DistilBertForTokenClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward:8 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering:1 +msgid "" +"DistilBert Model with a linear layer on top of the hidden-states output " +"to compute `span_start_logits` and `span_end_logits`, designed for " +"question-answering tasks like SQuAD." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:1 +msgid "" +"The DistilBertForQuestionAnswering forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:8 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"start_logits(Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" end_logits(Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:8 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:10 +msgid "With the fields:" +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:14 +msgid "start_logits(Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:13 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:17 +msgid "end_logits(Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:17 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM:1 +msgid "DistilBert Model with a `language modeling` head on top." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward:1 +msgid "" +"The DistilBertForMaskedLM forward method, overrides the `__call__()` " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward:8 +msgid "" +"Returns tensor `prediction_logits`, the scores of masked token " +"prediction. Its data type should be float32 and its shape is [batch_size," +" sequence_length, vocab_size]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.po new file mode 100644 index 0000000000000000000000000000000000000000..6036cb24084d21512874e2ce357c9836d1e226a1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.distilbert.rst:2 +msgid "distilbert" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..f3e52ce990a7457f464bcb8b08d83099dc28a8b9 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.tokenizer.po @@ -0,0 +1,36 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.distilbert.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.distilbert.tokenizer.DistilBertTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.distilbert.tokenizer.DistilBertTokenizer:1 +msgid "" +"Constructs a DistilBert tokenizer. The usage of DistilBertTokenizer is " +"the same as `BertTokenizer " +"`__." +" For more information regarding those methods, please refer to this " +"superclass." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distill_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distill_utils.po new file mode 100644 index 0000000000000000000000000000000000000000..4d59ea3942031813e6cb2dcf56deec2a18fee9c3 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distill_utils.po @@ -0,0 +1,116 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.distill_utils.rst:2 +msgid "distill\\_utils" +msgstr "" + +#: of paddlenlp.transformers.distill_utils.to_distill:1 +msgid "" +"Can be bound to object with transformer encoder layers, and make model " +"expose attributes `outputs.q`, `outputs.k`, `outputs.v`, " +"`outputs.scaled_qks`, `outputs.hidden_states`and `outputs.attentions` of " +"the object for distillation. It could be returned intermediate tensor " +"using in MiniLM and TinyBERT strategy." +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_minilm_loss:1 +msgid "" +"Calculates loss for Q-Q, K-K, V-V relation from MiniLMv2. :param " +"loss_fct: Loss function for distillation. It only supports kl_div loss " +"now. :type loss_fct: callable :param s: Q, K, V of Student. :type s: " +"Tensor :param t: Q, K, V of teacher. :type t: Tensor :param attn_mask: " +"Attention mask for relation. :type attn_mask: Tensor :param " +"num_relation_heads: The number of relation heads. 0 means " +"`num_relation_heads` equals" +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_minilm_loss:11 +msgid "to origin head num. Defaults to 0." +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_minilm_loss +#: paddlenlp.transformers.distill_utils.calc_multi_relation_loss +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_minilm_loss:15 +msgid "MiniLM loss value." +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_minilm_loss +#: paddlenlp.transformers.distill_utils.calc_multi_relation_loss +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:1 +msgid "" +"Calculates loss for multiple Q-Q, K-K and V-V relation. It supports head-" +"head relation, sample-sample relation and origin token-token relation. " +"The final loss value could be balanced by weight `alpha` and `beta`." +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:5 +msgid "Loss function for distillation. It only supports kl_div loss now." +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:7 +msgid "Q, K, V of Student." +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:9 +msgid "Q, K, V of teacher." +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:11 +msgid "Attention mask for relation." +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:13 +msgid "" +"The number of relation heads. 0 means `num_relation_heads` equals to " +"origin head num. Defaults to 0." +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:17 +msgid "The weight for head-head relation. Defaults to 0.0." +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:20 +msgid "The weight for sample-sample relation. Defaults to 0.0." +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:24 +msgid "" +"Weighted loss of token-token loss, head-head loss and sample-sample " +"loss." +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:26 +msgid "Weighted loss of token-token loss, head-head loss and" +msgstr "" + +#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:27 +msgid "sample-sample loss." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..821939bccd2221d46d3ef288d4bfde24a03dbf3d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.modeling.po @@ -0,0 +1,905 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.electra.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraDiscriminator:1 +#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice:1 +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering:1 +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:1 +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification:1 +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining:1 +#: paddlenlp.transformers.electra.modeling.ElectraGenerator:1 +#: paddlenlp.transformers.electra.modeling.ElectraModel:1 +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator:1 +msgid "基类::class:`paddlenlp.transformers.electra.modeling.ElectraPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:1 +msgid "The bare Electra Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead +#: paddlenlp.transformers.electra.modeling.ElectraClassificationHead.forward +#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator +#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward +#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice +#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward +#: paddlenlp.transformers.electra.modeling.ElectraGenerator +#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward +#: paddlenlp.transformers.electra.modeling.ElectraModel +#: paddlenlp.transformers.electra.modeling.ElectraModel.forward +#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion +#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `ElectraModel`. Also is the vocab size" +" of token embedding matrix. Defines the number of different tokens that " +"can be represented by the `inputs_ids` passed when calling " +"`ElectraModel`." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead:3 +#: paddlenlp.transformers.electra.modeling.ElectraModel:13 +msgid "Dimensionality of the embedding layer." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:15 +msgid "Dimensionality of the encoder layer and pooler layer." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:17 +msgid "Number of hidden layers in the Transformer encoder." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:19 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:21 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:31 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:33 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:36 +msgid "The vocabulary size of `token_type_ids`." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:38 +msgid "" +"The standard deviation of the normal initializer. .. note:: A " +"normal_initializer initializes weight matrices as normal distributions." +" See :meth:`ElectraPretrainedModel.init_weights()` for how weights " +"are initialized in `ElectraModel`." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:38 +msgid "The standard deviation of the normal initializer." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:41 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`ElectraPretrainedModel.init_weights()` for how weights are " +"initialized in `ElectraModel`." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel:44 +msgid "The index of padding token in the token vocabulary." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:1 +msgid "" +"The ElectraModel forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:12 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:13 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:15 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:18 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. Defaults to `None`, which means " +"nothing needed to be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead.forward +#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward +#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward +#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward +#: paddlenlp.transformers.electra.modeling.ElectraModel.forward +#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.sample_negatives_from_softmax +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:32 +msgid "" +"Returns tensor `encoder_outputs`, which is the output at the last layer " +"of the model. Its data type should be float32 and has a shape of " +"[batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead.forward +#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward +#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward +#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward +#: paddlenlp.transformers.electra.modeling.ElectraModel.forward +#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.sample_negatives_from_softmax +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward:16 +#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:17 +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:26 +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:17 +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:17 +#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward:15 +#: paddlenlp.transformers.electra.modeling.ElectraModel.forward:37 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraPretrainedModel:1 +msgid "" +"An abstract class for pretrained Electra models. It provides Electra " +"related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraPretrainedModel.init_weights:1 +msgid "Initializes and tie weights if needed." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraPretrainedModel.tie_weights:1 +msgid "Tie the weights between the input embeddings and the output embeddings." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining:1 +msgid "Electra Model for pretraining tasks." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining:3 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining:3 +msgid "An instance of :class:`ElectraGenerator`." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining:5 +msgid "An instance of :class:`ElectraDiscriminator`." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:1 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:1 +msgid "" +"The ElectraForPretraining forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward:1 +#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward:3 +#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward:5 +#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward:7 +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:3 +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:5 +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:7 +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:9 +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:3 +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:5 +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:7 +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:9 +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:3 +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:5 +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:7 +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:9 +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:3 +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:5 +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:7 +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:9 +#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward:1 +#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward:3 +#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward:5 +#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward:7 +#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward:14 +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:1 +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:6 +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:8 +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:10 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:3 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:5 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:7 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:9 +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward:14 +msgid "See :class:`ElectraModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:11 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:11 +msgid "" +"Raw inputs used to get discriminator labels. Its data type should be " +"`int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:14 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:14 +msgid "" +"Labels to compute the discriminator inputs. Its data type should be int64" +" and its shape is [batch_size, sequence_length]. The value for unmasked " +"tokens should be -100 and value for masked tokens should be 0." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:19 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:19 +msgid "" +"Returns tuple (generator_logits, disc_logits, disc_labels, " +"attention_mask). With the fields: - `generator_logits` (Tensor): " +"The scores of Electra Generator. Its data type should be int64 and " +"its shape is [batch_size, sequence_length, vocab_size]. - `disc_logits` " +"(Tensor): The prediction result of replaced tokens. Its data " +"type should be float32 and if batch_size>1, its shape is [batch_size, " +"sequence_length], if batch_size=1, its shape is [sequence_length]. -" +" `disc_labels` (Tensor): The labels of electra discriminator. Its " +"data type should be int32, and its shape is [batch_size, " +"sequence_length]. - `attention_mask` (Tensor): See " +":class:`ElectraModel`. Its data type should be bool." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:19 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:19 +msgid "" +"Returns tuple (generator_logits, disc_logits, disc_labels, " +"attention_mask)." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:14 +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:21 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:21 +msgid "With the fields:" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:25 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:25 +msgid "`generator_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:24 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:24 +msgid "" +"The scores of Electra Generator. Its data type should be int64 and its " +"shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:30 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:30 +msgid "`disc_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:28 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:28 +msgid "" +"The prediction result of replaced tokens. Its data type should be " +"float32 and if batch_size>1, its shape is [batch_size, sequence_length], " +"if batch_size=1, its shape is [sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:34 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:34 +msgid "`disc_labels` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:33 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:33 +msgid "" +"The labels of electra discriminator. Its data type should be int32, and " +"its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:36 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:36 +msgid "`attention_mask` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:37 +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:37 +msgid "See :class:`ElectraModel`. Its data type should be bool." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraDiscriminator:1 +msgid "" +"The Electra Discriminator can detect the tokens that are replaced by the " +"Electra Generator." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraDiscriminator:3 +#: paddlenlp.transformers.electra.modeling.ElectraGenerator:4 +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator:6 +msgid "An instance of :class:`ElectraModel`." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward:10 +msgid "" +"Returns tensor `logits`, the prediction result of replaced tokens. Its " +"data type should be float32 and if batch_size>1, its shape is " +"[batch_size, sequence_length], if batch_size=1, its shape is " +"[sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraGenerator:1 +msgid "" +"The Electra Generator will replace some tokens of the given sequence, it " +"is trained as a masked language model." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraGenerator.forward:10 +msgid "" +"Returns tensor `prediction_scores`, the scores of Electra Generator. Its " +"data type should be int64 and its shape is [batch_size, sequence_length, " +"vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead:1 +#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion:1 +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead:1 +msgid "Perform sentence-level classification tasks." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead:5 +msgid "The dropout probability for all fully connected layers." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead:7 +msgid "The number of classes." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead:9 +msgid "The activation function name between layers." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraClassificationHead.forward:1 +msgid "" +"The ElectraClassificationHead forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraClassificationHead.forward:3 +msgid "" +"Input sequence, usually the `sequence_output` of electra model. Its data " +"type should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraClassificationHead.forward:7 +msgid "" +"Returns a tensor of the input text classification logits. Shape as " +"`[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:1 +msgid "" +"Electra Model with a linear layer on top of the output layer, designed " +"for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice:4 +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering:4 +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:4 +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification:4 +msgid "An instance of ElectraModel." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:6 +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice:8 +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:8 +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification:8 +msgid "" +"The dropout probability for output of Electra. If None, use the same " +"value as `hidden_dropout_prob` of `ElectraModel` instance `electra`. " +"Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:12 +msgid "The activation function name for classifier. Defaults to \"gelu\"." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:15 +msgid "The epsilon to initialize nn.LayerNorm layers. Defaults to 1e-12." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:1 +msgid "" +"The ElectraForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:12 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraForTokenClassification:1 +msgid "" +"Electra Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:1 +msgid "" +"The ElectraForTokenClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:12 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion:1 +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion:1 +msgid "" +"Vocabulary size of `inputs_ids` in `ElectraModel`. Defines the number of " +"different tokens that can be represented by the `inputs_ids` passed when " +"calling `ElectraModel`." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion:4 +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion:4 +msgid "The weight of the Electra Generator." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion:6 +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion:6 +msgid "The weight of the Electra Discriminator." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward:1 +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward:1 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"and its shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward:4 +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward:7 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"and its shape is [batch_size, sequence_length] or [sequence length] if " +"batch_size=1." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward:7 +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward:4 +msgid "" +"The labels of the generator, its dimensionality is equal to " +"`generator_prediction_scores`. Its data type should be int64 and its " +"shape is [batch_size, sequence_size, 1]." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward:10 +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward:10 +msgid "" +"The labels of the discriminator, its dimensionality is equal to " +"`discriminator_prediction_scores`. The labels should be numbers between 0" +" and 1. Its data type should be float32 and its shape is [batch_size, " +"sequence_size] or [sequence length] if batch_size=1." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward:17 +#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward:17 +msgid "" +"The pretraining loss, equals to weighted generator loss plus the weighted" +" discriminator loss. Its data type should be float32 and its shape is " +"[1]." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice:1 +msgid "" +"Electra Model with a linear layer on top of the hidden-states output " +"layer, designed for multiple choice tasks like RocStories/SWAG tasks." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice:6 +msgid "The number of choices. Defaults to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:1 +msgid "" +"The ElectraForMultipleChoice forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:3 +#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:5 +#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:7 +#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:9 +msgid "" +"See :class:`ElectraModel` and shape as [batch_size, num_choice, " +"sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:12 +msgid "" +"Returns tensor `reshaped_logits`, a tensor of the multiple choice " +"classification logits. Shape as `[batch_size, num_choice]` and dtype as " +"`float32`." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering:1 +msgid "" +"Electra Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:1 +msgid "" +"The ElectraForQuestionAnswering forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:12 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:12 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:18 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:17 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:21 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:21 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining:1 +msgid "基类::class:`paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining`" +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining:1 +msgid "ERNIE-Health Model for pretraining task." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining:5 +msgid "" +"class:`ErnieHealthDiscriminator): An instance of " +":class:`ErnieHealthDiscriminator`." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.sample_negatives_from_softmax:1 +msgid "Sample K=5 non-original negative samples for candidate set." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.sample_negatives_from_softmax:3 +msgid "" +"Returns tensor `neg_samples_ids`, a tensor of the negative samples of " +"original inputs. Shape as ` [batch_size, sequence_length, K, vocab_size]`" +" and dtype as `int64`." +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator:4 +msgid "" +"The Discriminators in ERNIE-Health (https://arxiv.org/abs/2110.07244), " +"including" +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator:2 +msgid "token-level Replaced Token Detection (RTD) task" +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator:3 +msgid "token-level Multi-Token Selection (MTS) task" +msgstr "" + +#: of paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator:4 +msgid "sequence-level Contrastive Sequence Prediction (CSP) task." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:3 +msgid "" +"The candidate indices of input sequence tokens in the vocabulary for MTS " +"task. Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:13 +msgid "" +"Returns list of tensors, the prediction results of RTD, MTS and CSP. The " +"logits' data type should be float32 and if batch_size > 1, - the " +"shape of `logits_rtd` is [batch_size, sequence_length], - the shape " +"of `logits_mts` is [batch_size, sequence_length, num_candidate], - " +"the shape of `logits_csp` is [batch_size, 128]. If batch_size=1, the " +"shapes are [sequence_length], [sequence_length, num_cadidate], [128], " +"separately." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:13 +msgid "" +"Returns list of tensors, the prediction results of RTD, MTS and CSP. The " +"logits' data type should be float32 and if batch_size > 1," +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:15 +msgid "the shape of `logits_rtd` is [batch_size, sequence_length]," +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:16 +msgid "the shape of `logits_mts` is [batch_size, sequence_length, num_candidate]," +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:17 +msgid "the shape of `logits_csp` is [batch_size, 128]." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:18 +msgid "" +"If batch_size=1, the shapes are [sequence_length], [sequence_length, " +"num_cadidate], [128], separately." +msgstr "" + +#~ msgid "" +#~ "Returns tuple (gen_logits, disc_logits, " +#~ "disc_labels, attention_mask). With the " +#~ "fields: - `gen_logits` (Tensor): The " +#~ "scores of Electra Generator. Its " +#~ "data type should be int64 and its" +#~ " shape is [batch_size, sequence_length, " +#~ "vocab_size]. - `disc_logits` (Tensor): " +#~ "The prediction result of replaced" +#~ " tokens. Its data type should be" +#~ " float32 and if batch_size>1, its " +#~ "shape is [batch_size, sequence_length], if" +#~ " batch_size=1, its shape is " +#~ "[sequence_length]. - `disc_labels` (Tensor):" +#~ " The labels of electra discriminator." +#~ " Its data type should be int32," +#~ " and its shape is [batch_size, " +#~ "sequence_length]. - `attention_mask` (Tensor):" +#~ " See :class:`ElectraModel`. Its data " +#~ "type should be bool." +#~ msgstr "" + +#~ msgid "Returns tuple (gen_logits, disc_logits, disc_labels, attention_mask)." +#~ msgstr "" + +#~ msgid "`gen_logits` (Tensor):" +#~ msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.po new file mode 100644 index 0000000000000000000000000000000000000000..24f321c4c9b2f582a1705248a388c0b5134668d1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.electra.rst:2 +msgid "electra" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..c79fb28ec8eb1907ac512701ab906355b2440e0e --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.tokenizer.po @@ -0,0 +1,303 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.electra.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:1 +msgid "" +"Constructs an Electra tokenizer. It uses a basic tokenizer to do " +"punctuation splitting, lower casing and so on, and follows a WordPiece " +"tokenizer to tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:5 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.num_special_tokens_to_add +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:9 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:12 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:15 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:19 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:22 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:25 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:28 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:34 +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string:12 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) in a single string. Since " +"the usage of WordPiece introducing `##` to concat subwords, also remove " +"`##` when converting." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string:5 +msgid "A list of string representing tokens to be converted." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string:8 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.num_special_tokens_to_add:3 +msgid "" +"Returns the number of added tokens in the case of a sequence pair if set " +"to True, returns the number of added tokens in the case of a single " +"sequence if set to False." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.num_special_tokens_to_add:6 +msgid "Number of tokens added to sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:4 +msgid "A ELECTRA sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:6 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:7 +msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:9 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:11 +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences:12 +msgid "Optional second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:14 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:3 +msgid "A ELECTRA offset_mapping has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:5 +msgid "single sequence: ``(0,0) X (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:6 +msgid "pair of sequences: ``(0,0) A (0,0) B (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:8 +msgid "List of char offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:10 +msgid "Optional second list of char offsets for offset mapping pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:13 +msgid "List of char offsets with the appropriate offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences:3 +msgid "A ELECTRA sequence pair mask has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences:8 +msgid "" +"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first " +"portion of the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences:10 +msgid "List of IDs." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences:15 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask:4 +msgid "List of ids of the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask:6 +msgid "List of ids of the second sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers in the range [0, 1]: 1 for a special token, 0 for a " +"sequence token." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.faster_tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.faster_tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..92d346421ca16a999d4394aad38d012a0ffa2edc --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.faster_tokenizer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.ernie.fast_tokenizer.rst:2 +msgid "fast\\_tokenizer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..7461cbeed8cf5e78d2ca085e929d1c682f1ba064 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.modeling.po @@ -0,0 +1,598 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM:1 +#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice:1 +#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining:1 +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering:1 +#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification:1 +#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification:1 +#: paddlenlp.transformers.ernie.modeling.ErnieModel:1 +msgid "基类::class:`paddlenlp.transformers.ernie.modeling.ErniePretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:1 +msgid "The bare ERNIE Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM +#: paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice +#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification +#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification +#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward +#: paddlenlp.transformers.ernie.modeling.ErnieModel +#: paddlenlp.transformers.ernie.modeling.ErnieModel.forward +#: paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `ErnieModel`. Also is the vocab size " +"of token embedding matrix. Defines the number of different tokens that " +"can be represented by the `inputs_ids` passed when calling `ErnieModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:13 +msgid "" +"Dimensionality of the embedding layer, encoder layers and pooler layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:15 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:35 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:38 +msgid "The vocabulary size of the `token_type_ids`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:41 +msgid "" +"The standard deviation of the normal initializer for initializing all " +"weight matrices. Defaults to `0.02`. .. note:: A normal_initializer " +"initializes weight matrices as normal distributions. See " +":meth:`ErniePretrainedModel._init_weights()` for how weights are " +"initialized in `ErnieModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:41 +msgid "" +"The standard deviation of the normal initializer for initializing all " +"weight matrices. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:45 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`ErniePretrainedModel._init_weights()` for how weights are " +"initialized in `ErnieModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel:48 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:1 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:5 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:5 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:10 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:11 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:13 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:16 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:20 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. For example, its shape can be " +"[batch_size, sequence_length], [batch_size, sequence_length, " +"sequence_length], [batch_size, num_attention_heads, sequence_length, " +"sequence_length]. We use whole-word-mask in ERNIE, so the whole word will" +" have the same value. For example, \"使用\" as a word, \"使\" and \"用\" will" +" have the same value. Defaults to `None`, which means nothing needed to " +"be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward +#: paddlenlp.transformers.ernie.modeling.ErnieModel.forward +#: paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:34 +msgid "" +"Returns tuple (``sequence_output``, ``pooled_output``). With the fields:" +" - `sequence_output` (Tensor): Sequence of hidden-states at the last" +" layer of the model. It's data type should be float32 and its shape " +"is [batch_size, sequence_length, hidden_size]. - `pooled_output` " +"(Tensor): The output of first token (`[CLS]`) in sequence. We " +"\"pool\" the model by simply taking the hidden state corresponding to the" +" first token. Its data type should be float32 and its shape is " +"[batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:34 +msgid "Returns tuple (``sequence_output``, ``pooled_output``)." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:12 +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:12 +#: paddlenlp.transformers.ernie.modeling.ErnieModel.forward:36 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:40 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:39 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:44 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:43 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward +#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward +#: paddlenlp.transformers.ernie.modeling.ErnieModel.forward +#: paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward:15 +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:24 +#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward:15 +#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward:15 +#: paddlenlp.transformers.ernie.modeling.ErnieModel.forward:49 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErniePretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErniePretrainedModel:1 +msgid "" +"An abstract class for pretrained ERNIE models. It provides ERNIE related " +"`model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. Refer " +"to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErniePretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification:1 +msgid "" +"Ernie Model with a linear layer on top of the output layer, designed for " +"sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification:4 +msgid "An instance of `paddlenlp.transformers.ErnieModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification:6 +msgid "The number of classes. Default to `2`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification:8 +msgid "" +"The dropout probability for output of ERNIE. If None, use the same value " +"as `hidden_dropout_prob` of `paddlenlp.transformers.ErnieModel` instance." +" Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward:1 +#: paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward:3 +#: paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward:5 +#: paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward:7 +#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:1 +#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:3 +#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:5 +#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:7 +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:1 +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:3 +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:5 +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:7 +#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward:1 +#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward:3 +#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward:5 +#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward:7 +#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward:1 +#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward:3 +#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward:5 +#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward:7 +msgid "See :class:`ErnieModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification:1 +msgid "" +"ERNIE Model with a linear layer on top of the hidden-states output layer," +" designed for token classification tasks like NER tasks." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering:5 +#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification:4 +msgid "An instance of `ErnieModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification:8 +msgid "" +"The dropout probability for output of ERNIE. If None, use the same value " +"as `hidden_dropout_prob` of `ErnieModel` instance `ernie`. Defaults to " +"`None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering:1 +msgid "" +"Ernie Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:10 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:10 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:16 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:15 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:19 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:19 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining:1 +msgid "" +"Ernie Model with a `masked language modeling` head and a `sentence order " +"prediction` head on top." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:10 +msgid "" +"Returns tuple (``prediction_scores``, ``seq_relationship_score``). With " +"the fields: - `prediction_scores` (Tensor): The scores of masked " +"token prediction. Its data type should be float32. If " +"`masked_positions` is None, its shape is [batch_size, sequence_length, " +"vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]. - `seq_relationship_score` (Tensor): The scores of next" +" sentence prediction. Its data type should be float32 and its shape " +"is [batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:10 +msgid "Returns tuple (``prediction_scores``, ``seq_relationship_score``)." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:17 +msgid "`prediction_scores` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:15 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"If `masked_positions` is None, its shape is [batch_size, sequence_length," +" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:20 +msgid "`seq_relationship_score` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:20 +msgid "" +"The scores of next sentence prediction. Its data type should be float32 " +"and its shape is [batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion:1 +msgid "" +"The loss output of Ernie Model during the pretraining: a `masked language" +" modeling` head and a `next sentence prediction (classification)` head." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward:1 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"If `masked_positions` is None, its shape is [batch_size, sequence_length," +" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward:5 +msgid "" +"The scores of next sentence prediction. Its data type should be float32 " +"and its shape is [batch_size, 2]" +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward:8 +msgid "" +"The labels of the masked language modeling, its dimensionality is equal " +"to `prediction_scores`. Its data type should be int64. If " +"`masked_positions` is None, its shape is [batch_size, sequence_length, " +"1]. Otherwise, its shape is [batch_size, mask_token_num, 1]" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward:12 +msgid "" +"The labels of the next sentence prediction task, the dimensionality of " +"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type " +"should be int64 and its shape is [batch_size, 1]" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward:17 +msgid "" +"The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean" +" of `next_sentence_loss`. Its data type should be float32 and its shape " +"is [1]." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM:1 +msgid "Ernie Model with a `masked language modeling` head on top." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM:3 +msgid "An instance of :class:`ErnieModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward:10 +msgid "" +"Returns tensor `prediction_scores`, The scores of masked token " +"prediction. Its data type should be float32 and shape is [batch_size, " +"sequence_length, vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice:1 +msgid "" +"Ernie Model with a linear layer on top of the hidden-states output layer," +" designed for multiple choice tasks like RocStories/SWAG tasks." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice:4 +msgid "An instance of ErnieModel." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice:6 +msgid "The number of choices. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice:8 +msgid "" +"The dropout probability for output of Ernie. If None, use the same value " +"as `hidden_dropout_prob` of `ErnieModel` instance `ernie`. Defaults to " +"None." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward:1 +msgid "" +"The ErnieForMultipleChoice forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward:3 +#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward:5 +#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward:7 +#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward:9 +msgid "" +"See :class:`ErnieModel` and shape as [batch_size, num_choice, " +"sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward:12 +msgid "" +"Returns tensor `reshaped_logits`, a tensor of the multiple choice " +"classification logits. Shape as `[batch_size, num_choice]` and dtype as " +"`float32`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.po new file mode 100644 index 0000000000000000000000000000000000000000..d63a014b0a77f342f5b403c19af8e967132316b0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie.rst:2 +msgid "ernie" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..1f3a1cfcfe77c7b4e7e45067789c1860dc918e99 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.tokenizer.po @@ -0,0 +1,421 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:1 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:1 +msgid "" +"Constructs an ERNIE tokenizer. It uses a basic tokenizer to do " +"punctuation splitting, lower casing and so on, and follows a WordPiece " +"tokenizer to tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:5 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.save_resources +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:9 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:23 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:12 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:26 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:15 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:30 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:19 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:33 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:22 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:36 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:25 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:39 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:28 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:5 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:45 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:34 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string:12 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.vocab_size:1 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.vocab_size +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.vocab_size:3 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.vocab_size +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) in a single string. Since " +"the usage of WordPiece introducing `##` to concat subwords, also remove " +"`##` when converting." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string:5 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string:5 +msgid "A list of string representing tokens to be converted." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string:8 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string:8 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add:1 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add:5 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add:5 +msgid "" +"This encodes inputs and checks the number of added tokens, and is " +"therefore not efficient. Do not put this inside your training loop." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add:8 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add:8 +msgid "" +"Whether the input is a sequence pair or a single sequence. Defaults to " +"`False` and the input is a single sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add:12 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add:12 +msgid "Number of tokens added to sequences" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:1 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:4 +msgid "An Ernie sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:6 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:7 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:7 +msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:9 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:9 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:11 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences:13 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask:6 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:11 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences:13 +msgid "Optional second list of IDs for sequence pairs. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:15 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:15 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:1 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:3 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:3 +msgid "An ERNIE offset_mapping has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:5 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:5 +msgid "single sequence: ``(0,0) X (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:6 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:6 +msgid "pair of sequences: ``(0,0) A (0,0) B (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:8 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:8 +msgid "List of char offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:10 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:10 +msgid "" +"Optional second list of wordpiece offsets for offset mapping pairs. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:14 +msgid "" +"A list of wordpiece offsets with the appropriate offsets of special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences:1 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences:3 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences:3 +msgid "A ERNIE sequence pair mask has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences:9 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences:9 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences:11 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences:11 +msgid "A list of `inputs_ids` for the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences:17 +#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences:17 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:1 +msgid "" +"Constructs a ErnieTiny tokenizer. It uses the `dict.wordseg.pickle` cut " +"the text to words, and use the `sentencepiece` tools to cut the words to " +"sub-words." +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:17 +msgid "The file path of the vocabulary." +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:19 +msgid "The file path of sentencepiece model." +msgstr "" + +#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:21 +msgid "" +"The file path of word vocabulary, which is used to do chinese word " +"segmentation." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) to a single string. Since " +"the usage of WordPiece introducing `##` to concat subwords, also removes " +"`##` when converting." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string:11 +msgid "Examples: .. code-block::" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.save_resources:1 +msgid "Save tokenizer related resources to files under `save_directory`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.save_resources:3 +msgid "Directory to save files into." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:4 +msgid "An ERNIE sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:6 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:14 +msgid "List of wordpiece offsets with the appropriate offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask:4 +msgid "List of ids of the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask:9 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask:13 +msgid "" +"The list of integers in the range [0, 1]: 1 for a special token, 0 for a " +"sequence token." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..ef101c8733d39a41cfe1e3ddb8e5d70967c53e27 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.modeling.po @@ -0,0 +1,443 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_ctm.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmPretrainedModel:1 +msgid "" +"An abstract class for pretrained ErnieCtm models. It provides ErnieCtm " +"related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmPretrainedModel:4 +msgid "and loading pretrained models." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmPretrainedModel:5 +msgid "" +"See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more" +" details." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification:1 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:1 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel:1 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel:1 +msgid "基类::class:`paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:1 +msgid "The bare ErnieCtm Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `ErnieCtmModel`. Also is the vocab " +"size of token embedding matrix. Defines the number of different tokens " +"that can be represented by the `inputs_ids` passed when calling " +"`ErnieCtmModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:13 +msgid "Dimensionality of the embedding layer. Defaults to `128`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:16 +msgid "" +"Dimensionality of the encoder layers and the pooler layer. Defaults to " +"`768`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:19 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:21 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:24 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:35 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:38 +msgid "The vocabulary size of the `token_type_ids`. Defaults to `16`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:41 +msgid "" +"The standard deviation of the normal initializer for initializing all " +"weight matrices. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:44 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:47 +msgid "Whether or not to add content summary tokens. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:50 +msgid "" +"The number of the content summary tokens. Only valid when " +"use_content_summary is True. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:53 +msgid "" +"The number of the CLS tokens. Only valid when use_content_summary is " +"True. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:1 +msgid "The ErnieCtmModel forward method, overrides the __call__() special method." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:12 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:13 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:15 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:18 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. For example, its shape can be " +"[batch_size, sequence_length], [batch_size, sequence_length, " +"sequence_length], [batch_size, num_attention_heads, sequence_length, " +"sequence_length]. We use whole-word-mask in ERNIE, so the whole word will" +" have the same value. For example, \"使用\" as a word, \"使\" and \"用\" will" +" have the same value. Defaults to `None`, which means nothing needed to " +"be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:35 +msgid "" +"Whether the `content_output` is clone from `sequence_output`. If set to " +"`True`, the content_output is clone from sequence_output, which may cause" +" the classification task impact on the sequence labeling task. Defaults " +"to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:40 +msgid "" +"Returns tuple (``sequence_output``, ``pooled_output``, " +"``content_output``). With the fields: - `sequence_output` (Tensor):" +" Sequence of output at the last layer of the model. Its data type " +"should be float32 and has a shape of [batch_size, sequence_length, " +"hidden_size]. - `pooled_output` (Tensor): The output of first token " +"(`[CLS]`) in sequence. We \"pool\" the model by simply taking the " +"hidden state corresponding to the first token. Its data type should " +"be float32 and its shape is [batch_size, hidden_size]. - " +"`content_output` (Tensor): The output of content summary token " +"(`[CLS1]` in sequence). Its data type should be float32 and has a " +"shape of [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:40 +msgid "" +"Returns tuple (``sequence_output``, ``pooled_output``, " +"``content_output``)." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:42 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:19 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:46 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:45 +msgid "" +"Sequence of output at the last layer of the model. Its data type should " +"be float32 and has a shape of [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:51 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:49 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:54 +msgid "`content_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:54 +msgid "" +"The output of content summary token (`[CLS1]` in sequence). Its data type" +" should be float32 and has a shape of [batch_size, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward:15 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:59 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward:15 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:27 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel:1 +msgid "" +"ErnieCtmWordtag Model with a token classification head on top (a crf " +"layer on top of the hidden-states output) . e.g. for Named-Entity-" +"Recognition (NER) tasks." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel:3 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel:4 +msgid "An instance of :class:`ErnieCtmModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel:6 +msgid "The number of different tags." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel:8 +msgid "The learning rate of the crf. Defaults to `100`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward:1 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward:3 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward:5 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward:7 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward:1 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward:3 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward:5 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward:7 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:1 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:3 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:5 +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:7 +msgid "See :class:`ErnieCtmModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:9 +msgid "" +"The input length. Its dtype is int64 and has a shape of `[batch_size]`. " +"Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:12 +msgid "" +"The input predicted tensor. Its dtype is float32 and has a shape of " +"`[batch_size, sequence_length, num_tags]`. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:17 +msgid "" +"Returns tuple (`seq_logits`, `cls_logits`). With the fields: - " +"`seq_logits` (Tensor): A tensor of next sentence prediction logits." +" Its data type should be float32 and its shape is [batch_size, " +"sequence_length, num_tag]." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:17 +msgid "Returns tuple (`seq_logits`, `cls_logits`)." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:22 +msgid "`seq_logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:22 +msgid "" +"A tensor of next sentence prediction logits. Its data type should be " +"float32 and its shape is [batch_size, sequence_length, num_tag]." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel:1 +msgid "ErnieCtmNptag Model with a `masked language modeling` head on top." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward:10 +msgid "" +"Returns tensor `logits`, the scores of masked token prediction. Its data " +"type should be float32 and shape is [batch_size, sequence_length, " +"vocab_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification:1 +msgid "" +"ERNIECtm Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification:4 +msgid "An instance of `ErnieModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification:8 +msgid "" +"The dropout probability for output of ERNIE. If None, use the same value " +"as `hidden_dropout_prob` of `ErnieCtmModel` instance `ernie`. Defaults to" +" `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[sequence_length, num_classes]` and dtype as `float32`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.po new file mode 100644 index 0000000000000000000000000000000000000000..d28a42a646a75e1bc35e5b6b6708785e756898de --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_ctm.rst:2 +msgid "ernie\\_ctm" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..fc4e22a00b052d3190870c102b316d600c3c3aa6 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.tokenizer.po @@ -0,0 +1,295 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_ctm.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:1 +msgid "Construct an ERNIE-CTM tokenizer." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:3 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:7 +msgid "File path of the vocabulary." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:9 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to `True`" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:11 +msgid "" +"Whether or not to do basic tokenization before WordPiece. Defaults to " +"`True`" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:13 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:17 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:20 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:23 +msgid "" +"The template of summary token for multiple summary placeholders. Defaults" +" to `\"[CLS{}]\"`" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:25 +msgid "" +"Summary placeholder used in ernie-ctm model. For catching a sentence " +"global feature from multiple aware. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:28 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task. This is the token which the model will" +" try to predict the original unmasked ones. Defaults to `\"[MASK]\"`." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:32 +msgid "" +"(bool, optional): Whether or not to strip all accents. If this option is " +"not specified, then it will be determined by the value for `lowercase` " +"(as in the original BERT)." +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:37 +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string:12 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) in a single string. Since " +"the usage of WordPiece introducing `##` to concat subwords, also remove " +"`##` when converting." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string:5 +msgid "A list of string representing tokens to be converted." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string:8 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequences for sequence " +"classification tasks by concatenating and add special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:4 +msgid "A ERNIE-CTM sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:6 +msgid "single sequence: [CLS0][CLS1]... X [SEP]" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:7 +msgid "pair of sequences: [CLS0][CLS1]... X [SEP] X [SEP]" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:9 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:11 +msgid "Optional second list of IDs for sequence pairs. Defaults to ``None``." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:14 +msgid "The input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask:1 +msgid "" +"Creates a special tokens mask from the input sequences. This method is " +"called when adding special tokens using the tokenizer `encode` method." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:23 +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask:4 +msgid "A list of `inputs_ids` for the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:25 +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask:6 +msgid "" +"Optional second list of `inputs_ids` for the second sequence. Defaults to" +" `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask:9 +msgid "" +"Whether or not the token list already contains special tokens for the " +"model. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask:13 +msgid "" +"A list of integers which is either 0 or 1: 1 for a special token, 0 for a" +" sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:1 +msgid "Creates a token_type mask from the input sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:3 +msgid "" +"If `token_ids_1` is not `None`, then a sequence pair token_type mask has " +"the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:11 +msgid "" +"Else if `token_ids_1` is `None`, then a single sequence token_type mask " +"has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:19 +msgid "0 stands for the segment id of **first segment tokens**," +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:20 +msgid "1 stands for the segment id of **second segment tokens**," +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:21 +msgid "2 stands for the segment id of **cls_token**." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:29 +msgid "List of token type IDs according to the given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add:5 +msgid "" +"This encodes inputs and checks the number of added tokens, and is " +"therefore not efficient. Do not put this inside your training loop." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add:8 +msgid "" +"Whether the input is a sequence pair or a single sequence. Defaults to " +"`False` and the input is a single sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add:12 +msgid "Number of tokens added to sequences." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..ad156d1cebf8cc0ef7aaffea898e7ea34c0c02b9 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.modeling.po @@ -0,0 +1,535 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_doc.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering:1 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification:1 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification:1 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:1 +msgid "基类::class:`paddlenlp.transformers.ernie_doc.modeling.ErnieDocPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:1 +msgid "The bare ERNIE-Doc Model outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:6 +msgid "" +"This model is also a `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:10 +msgid "The number of hidden layers in the Transformer encoder." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:12 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:14 +msgid "Dimensionality of the embedding layers, encoder layers and pooler layer." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:16 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:18 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:20 +msgid "The dropout probability of FFN." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:22 +msgid "The non-linear activation function of FFN." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:24 +msgid "" +"The number of tokens to cache. If not 0, the last `memory_len` hidden " +"states in each layer will be cached into memory." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:27 +msgid "" +"Vocabulary size of `inputs_ids` in `ErnieDocModel`. Also is the vocab " +"size of token embedding matrix. Defines the number of different tokens " +"that can be represented by the `inputs_ids` passed when calling " +"`ErnieDocModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:30 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:33 +msgid "The vocabulary size of the `token_type_ids`. Defaults to `3`." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:35 +msgid "" +"Indicate whether to put layer normalization into preprocessing of MHA and" +" FFN sub-layers. If True, pre-process is layer normalization and post-" +"precess includes dropout, residual connection. Otherwise, no pre-process " +"and post-precess includes dropout, residual connection, layer " +"normalization. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:40 +msgid "" +"The `epsilon` parameter used in :class:`paddle.nn.LayerNorm` for " +"initializing layer normalization layers. Defaults to `1e-5`." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:43 +msgid "Whether to share the relative position parameters. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:46 +msgid "" +"The standard deviation of the normal initializer for initializing all " +"weight matrices. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:49 +msgid "" +"The token id of [PAD] token whose parameters won't be updated when " +"training. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:52 +msgid "The token id of [CLS] token. Defaults to `-1`." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:1 +msgid "" +"The ErnieDocModel forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length, 1]." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:7 +msgid "" +"A list of length `n_layers` with each Tensor being a pre-computed hidden-" +"state for each layer. Each Tensor has a dtype `float32` and a shape of " +"[batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:10 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1: - 0 corresponds to a **sentence " +"A** token, - 1 corresponds to a **sentence B** token. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length, 1]. " +"Defaults to None, which means no segment embeddings is added to token " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:10 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1:" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:13 +msgid "0 corresponds to a **sentence A** token," +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:14 +msgid "1 corresponds to a **sentence B** token." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:16 +msgid "" +"It's data type should be `int64` and has a shape of [batch_size, " +"sequence_length, 1]. Defaults to None, which means no segment embeddings " +"is added to token embeddings." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:19 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, config.max_position_embeddings - " +"1]``. Shape as `(batch_sie, num_tokens)` and dtype as `int32` or `int64`." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. For example, its shape can be " +"[batch_size, sequence_length], [batch_size, sequence_length, " +"sequence_length], [batch_size, num_attention_heads, sequence_length, " +"sequence_length]. We use whole-word-mask in ERNIE, so the whole word will" +" have the same value. For example, \"使用\" as a word, \"使\" and \"用\" will" +" have the same value. Defaults to `None`, which means nothing needed to " +"be prevented attention to." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:36 +msgid "" +"Returns tuple (``encoder_output``, ``pooled_output``, ``new_mem``). With" +" the fields: - `encoder_output` (Tensor): Sequence of hidden-states " +"at the last layer of the model. It's data type should be float32 and " +"its shape is [batch_size, sequence_length, hidden_size]. - " +"`pooled_output` (Tensor): The output of first token (`[CLS]`) in " +"sequence. We \"pool\" the model by simply taking the hidden state " +"corresponding to the first token. Its data type should be float32 and" +" its shape is [batch_size, hidden_size]. - `new_mem` (List[Tensor]):" +" A list of pre-computed hidden-states. The length of the list is " +"`n_layers`. Each element in the list is a Tensor with dtype `float32`" +" and shape as [batch_size, memory_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:36 +msgid "Returns tuple (``encoder_output``, ``pooled_output``, ``new_mem``)." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:16 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:16 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:17 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:38 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:42 +msgid "`encoder_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:41 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:47 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:45 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:50 +msgid "`new_mem` (List[Tensor]):" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:50 +msgid "" +"A list of pre-computed hidden-states. The length of the list is " +"`n_layers`. Each element in the list is a Tensor with dtype `float32` and" +" shape as [batch_size, memory_length, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:33 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:29 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:30 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:55 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocPretrainedModel:1 +msgid "" +"An abstract class for pretrained ErnieDoc models. It provides ErnieDoc " +"related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification:1 +msgid "" +"ErnieDoc Model with a linear layer on top of the output layer, designed " +"for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering:5 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification:4 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification:4 +msgid "An instance of :class:`ErnieDocModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification:6 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification:6 +msgid "The number of classes." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification:8 +msgid "The dropout ratio of last output. Default to `0.1`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:1 +msgid "" +"The ErnieDocForSequenceClassification forward method, overrides the " +"`__call__()` special method." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:3 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:5 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:7 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:9 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:11 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:3 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:5 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:7 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:9 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:11 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:3 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:5 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:10 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:12 +msgid "See :class:`ErnieDocModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:14 +msgid "" +"Returns tuple (`logits`, `mem`). With the fields: - `logits` (Tensor):" +" A tensor containing the [CLS] of hidden-states of the model at the " +"output of last layer. Each Tensor has a data type of `float32` and " +"has a shape of [batch_size, num_classes]. - `mem` (List[Tensor]): A " +"list of pre-computed hidden-states. The length of the list is `n_layers`." +" Each element in the list is a Tensor with dtype `float32` and has a " +"shape of [batch_size, memory_length, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:14 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:15 +msgid "Returns tuple (`logits`, `mem`)." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:20 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:21 +msgid "`logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:19 +msgid "" +"A tensor containing the [CLS] of hidden-states of the model at the output" +" of last layer. Each Tensor has a data type of `float32` and has a shape " +"of [batch_size, num_classes]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:28 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:24 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:25 +msgid "`mem` (List[Tensor]):" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:27 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:23 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:24 +msgid "" +"A list of pre-computed hidden-states. The length of the list is " +"`n_layers`. Each element in the list is a Tensor with dtype `float32` and" +" has a shape of [batch_size, memory_length, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification:1 +msgid "" +"ErnieDoc Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering:7 +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification:8 +msgid "The dropout ratio of last output. Default to 0.1." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:1 +msgid "" +"The ErnieDocForTokenClassification forward method, overrides the " +"`__call__()` special method." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:7 +msgid "" +"See :class:`ErnieDocModel`. Defaults to None, which means no segment " +"embeddings is added to token embeddings." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:15 +msgid "" +"Returns tuple (`logits`, `mem`). With the fields: - `logits` (Tensor):" +" A tensor containing the hidden-states of the model at the output of " +"last layer. Each Tensor has a data type of `float32` and has a shape " +"of [batch_size, sequence_length, num_classes]. - `mem` (List[Tensor]):" +" A list of pre-computed hidden-states. The length of the list is " +"`n_layers`. Each element in the list is a Tensor with dtype `float32`" +" and has a shape of [batch_size, memory_length, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:20 +msgid "" +"A tensor containing the hidden-states of the model at the output of last " +"layer. Each Tensor has a data type of `float32` and has a shape of " +"[batch_size, sequence_length, num_classes]." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering:1 +msgid "" +"ErnieDoc Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:1 +msgid "" +"The ErnieDocForQuestionAnswering forward method, overrides the " +"`__call__()` special method." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:14 +msgid "" +"Returns tuple (`start_logits`, `end_logits`, `mem`). With the fields: -" +" `start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `mem` (List[Tensor]): A list of pre-computed hidden-states. The " +"length of the list is `n_layers`. Each element in the list is a " +"Tensor with dtype `float32` and has a shape of [batch_size, " +"memory_length, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:14 +msgid "Returns tuple (`start_logits`, `end_logits`, `mem`)." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:20 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:19 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:24 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:23 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.po new file mode 100644 index 0000000000000000000000000000000000000000..09c4c0c23d421ec1da371bae3bddccc1ddf28448 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_doc.rst:2 +msgid "ernie\\_doc" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..bddacc7b7ee00985d1ff75722a50b9b3b2b392a1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.tokenizer.po @@ -0,0 +1,148 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_doc.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:1 +msgid "" +"Constructs an ERNIE-Doc tokenizer. It uses a basic tokenizer to do " +"punctuation splitting, lower casing and so on, and follows a WordPiece " +"tokenizer to tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:5 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer`. For more" +" information regarding those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer +#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:8 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:11 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:13 +#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:14 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:17 +#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:18 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:20 +#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:21 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:23 +#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:24 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:26 +#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:27 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:32 +#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:33 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.BPETokenizer`" +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:1 +msgid "" +"Constructs an ERNIE-Doc BPE tokenizer. It uses a bpe tokenizer to do " +"punctuation splitting, lower casing and so on, then tokenize words as " +"subwords." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:4 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.BPETokenizer`. For more " +"information regarding those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:7 +msgid "File path of the vocabulary." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:9 +msgid "File path of the id to vocab." +msgstr "" + +#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:11 +msgid "File path of word merge text." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer.vocab_size +msgid "返回类型" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gen.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gen.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..585be964010e36c5591945c8577c18801531c8a1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gen.modeling.po @@ -0,0 +1,168 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_gen.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieGenPretrainedModel:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieGenPretrainedModel:1 +msgid "" +"An abstract class for pretrained ErnieGen models. It provides ErnieGen " +"related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gen.modeling.ErnieGenPretrainedModel.save_pretrained:1 +msgid "" +"Save model configuration and related resources (model state) to files " +"under `save_directory`. :param save_directory: Directory to save files " +"into. :type save_directory: str" +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration:1 +msgid "基类::class:`paddlenlp.transformers.ernie_gen.modeling.ErnieModel`" +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration:1 +msgid "Ernie Model for sequence to sequence generation." +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.ernie.modeling.ErnieModel`. Refer to the " +"superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:1 +msgid "" +"The ground truth target sequence id (hard label) or distribution (soft " +"label). It's data type should be `int64` and has a shape of [batch_size, " +"sequence_length] or [batch_size, sequence_length, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:5 +msgid "" +"Index of tgt_labels in `src_ids`. It's data type should be `int64` and " +"has a shape of [n_targets, 2])." +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:8 +msgid "" +"Whether the model will output the logits or only encode the inputs. If " +"`encode_only` is `True`, `loss` and `logits_2d` will not be returned." +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:12 +msgid "" +"Returns tuple (`None`, `None`, `info`) if `encode_only` is `True`, " +"returns (`output_ids`, `logits`, `info`) if `tgt_labels` or `tgt_pos` is " +"`None`, else, returns (`loss`, `logits_2d`, `info`). With the fields: -" +" `info`(dict): Middle level info, includes all hidden stats and k/v " +"caches. - `output_ids`(Tensor): The output index. Its data type " +"should be float32 and its shape is [batch_size]. If `encode_only`, " +"returns None. - `logits`(Tensor): Logits for every targets. Its " +"data type should be float32 and its shape is [batch_size, " +"sequence_length]. If `encode_only`, returns None. - `loss`(Tensor):" +" Cross entropy loss mean over every target label. If " +"`encode_only`, returns None. - `logits_2d`(Tensor): Logits for every" +" targets if `tgt_labels` or `tgt_pos` is not `None` . Its data type " +"should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:12 +msgid "" +"Returns tuple (`None`, `None`, `info`) if `encode_only` is `True`, " +"returns (`output_ids`, `logits`, `info`) if `tgt_labels` or `tgt_pos` is " +"`None`, else, returns (`loss`, `logits_2d`, `info`)." +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:16 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:19 +msgid "`info`(dict):" +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:19 +msgid "Middle level info, includes all hidden stats and k/v caches." +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:23 +msgid "`output_ids`(Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:22 +msgid "" +"The output index. Its data type should be float32 and its shape is " +"[batch_size]. If `encode_only`, returns None." +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:28 +msgid "`logits`(Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:26 +msgid "" +"Logits for every targets. Its data type should be float32 and its shape " +"is [batch_size, sequence_length]. If `encode_only`, returns None." +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:32 +msgid "`loss`(Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:31 +msgid "" +"Cross entropy loss mean over every target label. If `encode_only`, " +"returns None." +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:35 +msgid "`logits_2d`(Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:35 +msgid "" +"Logits for every targets if `tgt_labels` or `tgt_pos` is not `None` . Its" +" data type should be float32 and its shape is [batch_size, " +"sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward +msgid "返回类型" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gen.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gen.po new file mode 100644 index 0000000000000000000000000000000000000000..c4aa4f4ff1729736b3401de0642f6c1adc50b2a5 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gen.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_gen.rst:2 +msgid "ernie\\_gen" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.matching_param_name.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.matching_param_name.po new file mode 100644 index 0000000000000000000000000000000000000000..b017641491e348c56a3000dcc39e88a0eeb546fe --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.matching_param_name.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_gram.matching_param_name.rst:2 +msgid "matching\\_param\\_name" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..34a2cd874dfa50ca3aed9d472bf2faf08749ee80 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.modeling.po @@ -0,0 +1,435 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_gram.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering:1 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification:1 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification:1 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:1 +msgid "基类::class:`paddlenlp.transformers.ernie_gram.modeling.ErnieGramPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:1 +msgid "The bare ERNIE-Gram Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:10 +msgid "" +"Vocabulary size of the ERNIE-Gram model. Also is the vocab size of token " +"embedding matrix." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:12 +msgid "" +"Dimensionality of the embedding layer, encoder layers and pooler layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:14 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:16 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:19 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:24 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to ``\"gelu\"``." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:28 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoders. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:31 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:34 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:37 +msgid "The vocabulary size of the `token_type_ids`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:40 +msgid "" +"The standard deviation of the normal initializer for initializing all " +"weight matrices. Defaults to `0.02`. .. note:: A normal_initializer " +"initializes weight matrices as normal distributions. See " +":meth:`ErniePretrainedModel._init_weights()` for how weights are " +"initialized in `ErnieGramModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:40 +msgid "" +"The standard deviation of the normal initializer for initializing all " +"weight matrices. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:44 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`ErniePretrainedModel._init_weights()` for how weights are " +"initialized in `ErnieGramModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:47 +msgid "" +"The relative position size just for ERNIE-Gram English model. Defaults to" +" None." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:49 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:1 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:5 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1: - 0 corresponds to a **sentence " +"A** token, - 1 corresponds to a **sentence B** token. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]. " +"Defaults to None, which means no segment embeddings is added to token " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:5 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1:" +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:8 +msgid "0 corresponds to a **sentence A** token," +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:9 +msgid "1 corresponds to a **sentence B** token." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:11 +msgid "" +"It's data type should be `int64` and has a shape of [batch_size, " +"sequence_length]. Defaults to None, which means no segment embeddings is " +"added to token embeddings." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:14 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, config.max_position_embeddings - " +"1]``. Defaults to `None`. Shape as `(batch_sie, num_tokens)` and dtype as" +" `int32` or `int64`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:18 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. For example, its shape can be " +"[batch_size, sequence_length], [batch_size, sequence_length, " +"sequence_length], [batch_size, num_attention_heads, sequence_length, " +"sequence_length]. We use whole-word-mask in ERNIE, so the whole word will" +" have the same value. For example, \"使用\" as a word, \"使\" and \"用\" will" +" have the same value. Defaults to `None`, which means nothing needed to " +"be prevented attention to." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:32 +msgid "" +"Returns tuple (``sequence_output``, ``pooled_output``). With the fields:" +" - `sequence_output` (Tensor): Sequence of hidden-states at the last" +" layer of the model. It's data type should be float32 and its shape " +"is [batch_size, sequence_length, hidden_size]. - `pooled_output` " +"(Tensor): The output of first token (`[CLS]`) in sequence. We " +"\"pool\" the model by simply taking the hidden state corresponding to the" +" first token. Its data type should be float32 and its shape is " +"[batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:32 +msgid "Returns tuple (``sequence_output``, ``pooled_output``)." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:12 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:34 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:38 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:37 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:42 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:41 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:24 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward:15 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward:15 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:47 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramPretrainedModel:1 +msgid "" +"An abstract class for pretrained ERNIE-Gram models. It provides ERNIE-" +"Gram related `model_config_file`, `resource_files_names`, " +"`pretrained_resource_files_map`, `pretrained_init_configuration`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification:1 +msgid "" +"ERNIE-Gram Model with a linear layer on top of the output layer, designed" +" for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification:4 +msgid "An instance of `paddlenlp.transformers.ErnieGramModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification:6 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification:6 +msgid "The number of classes. Default to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification:8 +msgid "" +"The dropout probability for output of ERNIE-Gram. If None, use the same " +"value as `hidden_dropout_prob` of `paddlenlp.transformers.ErnieGramModel`" +" instance. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:1 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:3 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:5 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:7 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward:1 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward:3 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward:5 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward:7 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward:1 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward:3 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward:5 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward:7 +msgid "See :class:`ErnieGramModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification:1 +msgid "" +"ERNIE-Gram Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering:5 +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification:4 +msgid "An instance of `ErnieGramModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification:8 +msgid "" +"The dropout probability for output of ERNIE-Gram. If None, use the same " +"value as `hidden_dropout_prob` of `ErnieGramModel` instance `ernie_gram`." +" Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering:1 +msgid "" +"ERNIE-Gram Model with a linear layer on top of the hidden-states output " +"to compute `span_start_logits` and `span_end_logits`, designed for " +"question-answering tasks like SQuAD.." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:10 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:10 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:16 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:15 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:19 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:19 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.po new file mode 100644 index 0000000000000000000000000000000000000000..2397fb7858eb3b5259d4f0eafccbf7d3d9b0496a --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_gram.rst:2 +msgid "ernie\\_gram" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..a99977fa731d901cd5deca74d8c97673d4ea00c8 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.tokenizer.po @@ -0,0 +1,91 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_gram.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:1 +msgid "" +"Constructs an ERNIE-Gram tokenizer. It uses a basic tokenizer to do " +"punctuation splitting, lower casing and so on, and follows a WordPiece " +"tokenizer to tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:4 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer`. For more" +" information regarding those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:7 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:10 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:13 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:17 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:20 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:23 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:26 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:32 +msgid "实际案例" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..563c4e457b395571e84b66ab14dd49bb2b83cd03 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.modeling.po @@ -0,0 +1,391 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_m.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering:1 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification:1 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification:1 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel:1 +msgid "基类::class:`paddlenlp.transformers.ernie_m.modeling.ErnieMPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:1 +msgid "The bare ERNIE-M Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward +#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel +#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `ErnieMModel`. Also is the vocab size " +"of token embedding matrix. Defines the number of different tokens that " +"can be represented by the `inputs_ids` passed when calling `ErnieMModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:13 +msgid "" +"Dimensionality of the embedding layer, encoder layers and pooler layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:15 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:35 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:38 +msgid "The vocabulary size of the `token_type_ids`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:41 +msgid "" +"The standard deviation of the normal initializer for initializing all " +"weight matrices. Defaults to `0.02`. .. note:: A normal_initializer " +"initializes weight matrices as normal distributions. See " +":meth:`ErnieMPretrainedModel._init_weights()` for how weights are " +"initialized in `ErnieMModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:41 +msgid "" +"The standard deviation of the normal initializer for initializing all " +"weight matrices. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:45 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`ErnieMPretrainedModel._init_weights()` for how weights are " +"initialized in `ErnieMModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:48 +msgid "The index of padding token in the token vocabulary. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:1 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:5 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:9 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. For example, its shape can be " +"[batch_size, sequence_length], [batch_size, sequence_length, " +"sequence_length], [batch_size, num_attention_heads, sequence_length, " +"sequence_length]. Defaults to `None`, which means nothing needed to be " +"prevented attention to." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward +#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:21 +msgid "" +"Returns tuple (``sequence_output``, ``pooled_output``). With the fields:" +" - `sequence_output` (Tensor): Sequence of hidden-states at the last" +" layer of the model. It's data type should be float32 and its shape " +"is [batch_size, sequence_length, hidden_size]. - `pooled_output` " +"(Tensor): The output of first token (`[CLS]`) in sequence. We " +"\"pool\" the model by simply taking the hidden state corresponding to the" +" first token. Its data type should be float32 and its shape is " +"[batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:21 +msgid "Returns tuple (``sequence_output``, ``pooled_output``)." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:12 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:23 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:27 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:26 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:31 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:30 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward +#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:24 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward:15 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward:15 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:36 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMPretrainedModel:1 +msgid "" +"An abstract class for pretrained ERNIE-M models. It provides ERNIE-M " +"related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. Refer " +"to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification:1 +msgid "" +"Ernie-M Model with a linear layer on top of the output layer, designed " +"for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification:4 +msgid "An instance of `paddlenlp.transformers.ErnieMModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification:6 +msgid "The number of classes. Default to `2`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification:8 +msgid "" +"The dropout probability for output of ERNIE-M. If None, use the same " +"value as `hidden_dropout_prob` of `paddlenlp.transformers.ErnieMModel` " +"instance. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:1 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:3 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:5 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:7 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward:1 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward:3 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward:5 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward:7 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward:1 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward:3 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward:5 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward:7 +msgid "See :class:`ErnieMModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification:1 +msgid "" +"ERNIE-M Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering:5 +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification:4 +msgid "An instance of `ErnieMModel`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification:8 +msgid "" +"The dropout probability for output of ERNIE-M. If None, use the same " +"value as `hidden_dropout_prob` of `ErnieMModel` instance `ernie_m`. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering:1 +msgid "" +"Ernie-M Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:10 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:10 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:16 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:15 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:19 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:19 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.po new file mode 100644 index 0000000000000000000000000000000000000000..0bcbb7a03cbd2bed69bd57759f252ceb760cd950 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_m.rst:2 +msgid "ernie\\_m" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..51866a4482e1cd3426ecd0edbb3de6b363ac2a1d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.tokenizer.po @@ -0,0 +1,216 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ernie_m.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:1 +msgid "" +"Constructs a ErnieM tokenizer. It uses the `sentencepiece` tools to cut " +"the words to sub-words." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.tokenize +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:3 +msgid "The file path of the vocabulary." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:5 +msgid "The file path of sentencepiece model." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:7 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:10 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:14 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:17 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:20 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:23 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.tokenize +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.tokenize +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.clean_text:1 +msgid "Performs invalid character removal and whitespace cleanup on text." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.tokenize:1 +msgid "Converts a string to a list of tokens." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.tokenize:3 +msgid "The text to be tokenized." +msgstr "" + +#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.tokenize:6 +msgid "A list of string representing converted tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.convert_ids_to_string:1 +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.convert_tokens_to_string:1 +msgid "Converts a sequence of tokens (strings for sub-words) in a single string." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_inputs_with_special_tokens:4 +msgid "" +"An ERNIE-M sequence has the following format: - single sequence: " +"``[CLS] X [SEP]`` - pair of sequences: ``[CLS] A [SEP] [SEP] B " +"[SEP]`` :param token_ids_0: List of IDs to which the special tokens will " +"be added. :type token_ids_0: List[int] :param token_ids_1: Optional " +"second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_inputs_with_special_tokens:10 +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask:6 +msgid "Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_inputs_with_special_tokens:13 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens:3 +msgid "" +"An ERNIE-M offset_mapping has the following format: - single sequence:" +" ``(0,0) X (0,0)`` - pair of sequences: ``(0,0) A (0,0) (0,0)" +" B (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens:7 +msgid "List of char offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens:9 +msgid "" +"Optional second list of wordpiece offsets for offset mapping pairs. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens:13 +msgid "List of wordpiece offsets with the appropriate offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods. :param token_ids_0: List of ids of the " +"first sequence. :type token_ids_0: List[int] :param token_ids_1: Optional" +" second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers in the range [0, 1]: 1 for a special token, 0 for a " +"sequence token." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.export.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.export.po new file mode 100644 index 0000000000000000000000000000000000000000..4baa1d450bff7493a7ba88da07aa41fedf4768c4 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.export.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.export.rst:2 +msgid "export" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..7bd4c32fa12003ec986dc377cea94658f5aed5b9 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.modeling.po @@ -0,0 +1,539 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.fnet.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling:1 +msgid "Modeling classes for FNet model." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetPretrainedModel:1 +msgid "" +"An abstract class for pretrained FNet models. It provides FNet related " +"`model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +"`PretrainedModel` for more details." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM:1 +#: paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice:1 +#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction:1 +#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining:1 +#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering:1 +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification:1 +#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification:1 +#: paddlenlp.transformers.fnet.modeling.FNetModel:1 +msgid "基类::class:`paddlenlp.transformers.fnet.modeling.FNetPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:1 +msgid "" +"The model can behave as an encoder, following the architecture described " +"in `FNet: Mixing Tokens with Fourier Transforms " +"`__ by James Lee-Thorp, Joshua Ainslie," +" Ilya Eckstein, Santiago Ontanon." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM +#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward +#: paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice +#: paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice.forward +#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction +#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction.forward +#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward +#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering +#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering.forward +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward +#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification +#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification.forward +#: paddlenlp.transformers.fnet.modeling.FNetModel +#: paddlenlp.transformers.fnet.modeling.FNetModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:5 +msgid "" +"Vocabulary size of `inputs_ids` in `FNetModel`. Also is the vocab size of" +" token embedding matrix. Defines the number of different tokens that can " +"be represented by the `inputs_ids` passed when calling `FNetModel`. " +"Defaults to `32000`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:9 +msgid "Dimensionality of the encoder layer and pooler layer. Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:11 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:13 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:18 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `glue_new`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:22 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:25 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:28 +msgid "The vocabulary size of `token_type_ids`. Defaults to `4`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:30 +msgid "" +"The standard deviation of the normal initializer. Defaults to `0.02`. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`BertPretrainedModel.init_weights()` for how" +" weights are initialized in `ElectraModel`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:30 +msgid "" +"The standard deviation of the normal initializer. Defaults to `0.02`. .. " +"note::" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:35 +msgid "" +"The `epsilon` parameter used in :class:`paddle.nn.LayerNorm` for " +"initializing layer normalization layers. A small value to the variance " +"added to the normalization layer to prevent division by zero. Defaults to" +" `1e-12`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:39 +msgid "The index of padding token in the token vocabulary. Defaults to `3`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel:41 +msgid "Whether or not to add the pooling layer. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:1 +msgid "The FNetModel forward method." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:3 +#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:7 +#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:7 +#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:12 +#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:12 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:13 +#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:13 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:15 +#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:15 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:18 +#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:18 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:22 +#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:22 +msgid "" +"If you want to control how to convert `inputs_ids` indices into " +"associated vectors, you can pass an embedded representation directly " +"instead of passing `inputs_ids`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward +#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward +#: paddlenlp.transformers.fnet.modeling.FNetModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:32 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`, `encoder_outputs[1:]`)" +" or a dict with last_hidden_state`, `pooled_output`, `all_hidden_states`," +" fields. With the fields: - `sequence_output` (Tensor): Sequence of " +"hidden-states at the last layer of the model. It's data type should be" +" float32 and has a shape of [`batch_size, sequence_length, hidden_size`]." +" - `pooled_output` (Tensor): The output of first token (`[CLS]`) in " +"sequence. We \"pool\" the model by simply taking the hidden state " +"corresponding to the first token. Its data type should be float32 and" +" has a shape of [batch_size, hidden_size]. - `last_hidden_state` " +"(Tensor): The output of the last encoder layer, it is also the " +"`sequence_output`. It's data type should be float32 and has a shape of" +" [batch_size, sequence_length, hidden_size]. - `all_hidden_states` " +"(Tensor): Hidden_states of all layers in the Transformer encoder. The " +"length of `all_hidden_states` is `num_hidden_layers + 1`. For all " +"element in the tuple, its data type should be float32 and its shape is " +"[`batch_size, sequence_length, hidden_size`]." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:32 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`, `encoder_outputs[1:]`)" +" or a dict with last_hidden_state`, `pooled_output`, `all_hidden_states`," +" fields." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:22 +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:34 +#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:35 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:39 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:38 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and has a shape of [`batch_size, sequence_length, " +"hidden_size`]." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:45 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:42 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and has a shape of [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:49 +msgid "`last_hidden_state` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:48 +msgid "" +"The output of the last encoder layer, it is also the `sequence_output`. " +"It's data type should be float32 and has a shape of [batch_size, " +"sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:52 +msgid "`all_hidden_states` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:52 +msgid "" +"Hidden_states of all layers in the Transformer encoder. The length of " +"`all_hidden_states` is `num_hidden_layers + 1`. For all element in the " +"tuple, its data type should be float32 and its shape is [`batch_size, " +"sequence_length, hidden_size`]." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward +#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward +#: paddlenlp.transformers.fnet.modeling.FNetModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:46 +#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:57 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification:1 +msgid "" +"FNet Model with a linear layer on top of the output layer, designed for " +"sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice:4 +#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering:4 +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification:4 +#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification:4 +msgid "An instance of FNetModel." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification:6 +#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:1 +msgid "The FNetForSequenceClassification forward method." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:32 +msgid "" +"Returns tensor `logits`, or a dict with `logits`, `hidden_states`, " +"`attentions` fields. With the fields: - `logits` (Tensor): A tensor" +" of the input text classification logits. Shape as `[batch_size, " +"num_classes]` and dtype as float32. - `hidden_states` (Tensor): " +"Hidden_states of all layers in the Transformer encoder. The length of " +"`hidden_states` is `num_hidden_layers + 1`. For all element in the " +"tuple, its data type should be float32 and its shape is [`batch_size, " +"sequence_length, hidden_size`]." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:32 +msgid "" +"Returns tensor `logits`, or a dict with `logits`, `hidden_states`, " +"`attentions` fields." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:38 +msgid "`logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:37 +msgid "" +"A tensor of the input text classification logits. Shape as `[batch_size, " +"num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:29 +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:41 +msgid "`hidden_states` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:29 +#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:41 +msgid "" +"Hidden_states of all layers in the Transformer encoder. The length of " +"`hidden_states` is `num_hidden_layers + 1`. For all element in the tuple," +" its data type should be float32 and its shape is [`batch_size, " +"sequence_length, hidden_size`]." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForPreTraining:1 +msgid "" +"FNet Model with two heads on top as done during the pretraining: a " +"`masked language modeling` head and a `next sentence prediction " +"(classification)` head." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:1 +msgid "The FNetForPretraining forward method." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:3 +#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:5 +#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:7 +#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:9 +#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:15 +#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:17 +#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:3 +#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:5 +#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:7 +#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:11 +#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:17 +#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:19 +msgid "See :class:`FNetModel`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:9 +msgid "Labels for computing the masked language modeling loss." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:13 +msgid "" +"The labels of the next sentence prediction task, the dimensionality of " +"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type " +"should be int64 and its shape is [batch_size, 1]" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:22 +msgid "" +"Returns tuple (`prediction_scores`, `seq_relationship_score`) or a dict " +"with `prediction_logits`, `seq_relationship_logits`, `hidden_states` " +"fields." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM:1 +msgid "FNet Model with a `masked language modeling` head on top." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM:3 +#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction:3 +msgid "An instance of :class:`FNetModel`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:1 +msgid "The FNetForMaskedLM forward method." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:11 +#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:13 +msgid "See :class:`FNetForPreTraining`." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:20 +msgid "" +"Returns tensor `prediction_scores` or a dict with `prediction_logits`, " +"`hidden_states` fields. With the fields: - `prediction_scores` " +"(Tensor): The scores of masked token prediction. Its data type should" +" be float32. and its shape is [batch_size, sequence_length, " +"vocab_size]. - `hidden_states` (Tensor): Hidden_states of all layers" +" in the Transformer encoder. The length of `hidden_states` is " +"`num_hidden_layers + 1`. For all element in the tuple, its data type " +"should be float32 and its shape is [`batch_size, sequence_length, " +"hidden_size`]." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:20 +msgid "" +"Returns tensor `prediction_scores` or a dict with `prediction_logits`, " +"`hidden_states` fields." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:26 +msgid "`prediction_scores` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:25 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"and its shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction:1 +msgid "FNet Model with a `next sentence prediction` head on top." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice.forward:1 +#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction.forward:1 +#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering.forward:1 +#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice.forward:4 +#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction.forward:4 +#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering.forward:4 +#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice.forward:6 +#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction.forward:6 +#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering.forward:6 +#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice:1 +msgid "" +"FNet Model with a linear layer on top of the hidden-states output layer, " +"designed for multiple choice tasks like SWAG tasks ." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForTokenClassification:1 +msgid "" +"FNet Model with a linear layer on top of the hidden-states output layer, " +"designed for token classification tasks like NER tasks." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering:1 +msgid "" +"FNet Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering:6 +msgid "The number of labels." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.po new file mode 100644 index 0000000000000000000000000000000000000000..d1ca3ef4986a7eebcb571f9ea6ab1fa2467925e4 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.fnet.rst:2 +msgid "fnet" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..79004b0383ca1f76ff5c64931f67f19e87e61d11 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.tokenizer.po @@ -0,0 +1,308 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.fnet.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer:1 +msgid "Tokenization class for FNet model." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.albert.tokenizer.AlbertEnglishTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:1 +msgid "" +"Construct a FNet tokenizer. Inherit from :class:`AlbertEnglishTokenizer`." +" Based on `SentencePiece `__." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.save_resources +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:4 +msgid "" +"`SentencePiece `__ file " +"(generally has a `.spm` extension) that contains the vocabulary necessary" +" to instantiate a tokenizer." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:7 +msgid "Whether or not to lowercase the input when tokenizing." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:9 +msgid "" +"Whether or not to strip the text when tokenizing (removing excess spaces " +"before and after the string)." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:11 +msgid "Whether or not to keep accents when tokenizing." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:13 +msgid "" +"The unknown token. A token that is not in the vocabulary cannot be " +"converted to an ID and is set to be this token instead." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:16 +msgid "" +"The separator token, which is used when building a sequence from multiple" +" sequences, e.g. two sequences for sequence classification or for a text " +"and a question for question answering. It is also used as the last token " +"of a sequence built with special tokens." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:20 +msgid "" +"The token used for padding, for example when batching sequences of " +"different lengths." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:22 +msgid "" +"The classifier token which is used when doing sequence classification " +"(classification of the whole sequence instead of per-token " +"classification). It is the first token of the sequence when built with " +"special tokens." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:25 +msgid "" +"The token used for masking values. This is the token used when training " +"this model with masked language modeling. This is the token which the " +"model will try to predict." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:28 +msgid "" +"Will be passed to the ``SentencePieceProcessor.__init__()`` method. The " +"`Python wrapper for SentencePiece " +"`__ can be " +"used, among other things, to set: - ``enable_sampling``: Enable subword " +"regularization. - ``nbest_size``: Sampling parameters for unigram. " +"Invalid for BPE-Dropout. - ``nbest_size = {0,1}``: No sampling is " +"performed. - ``nbest_size > 1``: samples from the nbest_size results." +" - ``nbest_size < 0``: assuming that nbest_size is infinite and samples" +" from the all hypothesis (lattice) using forward-filtering-and-" +"backward-sampling algorithm. - ``alpha``: Smoothing parameter for unigram" +" sampling, and dropout probability of merge operations for BPE-dropout." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:28 +msgid "" +"Will be passed to the ``SentencePieceProcessor.__init__()`` method. The " +"`Python wrapper for SentencePiece " +"`__ can be " +"used, among other things, to set:" +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:31 +msgid "``enable_sampling``: Enable subword regularization." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:32 +msgid "``nbest_size``: Sampling parameters for unigram. Invalid for BPE-Dropout." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:34 +msgid "``nbest_size = {0,1}``: No sampling is performed." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:35 +msgid "``nbest_size > 1``: samples from the nbest_size results." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:36 +msgid "" +"``nbest_size < 0``: assuming that nbest_size is infinite and samples from" +" the all hypothesis (lattice) using forward-filtering-and-backward-" +"sampling algorithm." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:38 +msgid "" +"``alpha``: Smoothing parameter for unigram sampling, and dropout " +"probability of merge operations for BPE-dropout." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:44 +msgid "" +"The `SentencePiece` processor that is used for every conversion (string, " +"tokens and IDs)." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer +msgid "type" +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:46 +msgid ":obj:`SentencePieceProcessor`" +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.convert_tokens_to_string:1 +msgid "Converts a sequence of tokens (strings for sub-words) in a single string." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens. An FNet " +"sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:4 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:5 +msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:7 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:9 +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences:10 +msgid "Optional second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:12 +msgid "" +"List of `input IDs <../glossary.html#input-ids>`__ with the appropriate " +"special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:13 +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences:15 +msgid ":obj:`List[int]`" +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:4 +msgid "List of ids of the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:6 +msgid "List of ids of the second sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers in the range [0, 1]: 1 for a special token, 0 " +"for a sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:14 +msgid "The list of integers in the range [0, 1]:" +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:15 +msgid "1 for a special token, 0 for a sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task. An FNet sequence pair mask has the following " +"format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences:6 +msgid "" +"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first " +"portion of the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences:8 +msgid "List of IDs." +msgstr "" + +#: of +#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences:13 +msgid "" +"List of `token type IDs <../glossary.html#token-type-ids>`_ according to " +"the given sequence(s)." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.save_resources:1 +msgid "" +"Save tokenizer related resources to `resource_files_names` indicating " +"files under `save_directory` by copying directly. Override it if " +"necessary." +msgstr "" + +#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.save_resources:4 +msgid "Directory to save files into." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..a5a0d757377a5688887059dde06c222cfa4eadda --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.modeling.po @@ -0,0 +1,109 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.funnel.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.funnel.modeling.FunnelForQuestionAnswering:1 +#: paddlenlp.transformers.funnel.modeling.FunnelForSequenceClassification:1 +#: paddlenlp.transformers.funnel.modeling.FunnelForTokenClassification:1 +#: paddlenlp.transformers.funnel.modeling.FunnelModel:1 +msgid "基类::class:`paddlenlp.transformers.funnel.modeling.FunnelPreTrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.funnel.modeling.FunnelModel.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of paddlenlp.transformers.funnel.modeling.FunnelModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.funnel.modeling.FunnelModel.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of paddlenlp.transformers.funnel.modeling.FunnelModel.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.modeling.FunnelForSequenceClassification.forward:3 +msgid "labels (:obj:`paddle.Tensor` of shape :obj:`(batch_size,)`, `optional`):" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.modeling.FunnelForSequenceClassification.forward:2 +msgid "" +"Labels for computing the sequence classification/regression loss. Indices" +" should be in :obj:`[0, ..., config.num_labels - 1]`. If " +":obj:`config.num_labels == 1` a regression loss is computed (Mean-Square " +"loss), If :obj:`config.num_labels > 1` a classification loss is computed " +"(Cross-Entropy)." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.modeling.FunnelForTokenClassification.forward:2 +msgid "" +"labels (:obj:`paddle.Tensor` of shape :obj:`(batch_size, " +"sequence_length)`, `optional`):" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.modeling.FunnelForTokenClassification.forward:2 +msgid "" +"Labels for computing the token classification loss. Indices should be in " +"``[0, ..., config.num_labels - 1]``." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.modeling.FunnelForQuestionAnswering.forward:3 +msgid "" +"start_positions (:obj:`paddle.Tensor` of shape :obj:`(batch_size,)`, " +"`optional`):" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.modeling.FunnelForQuestionAnswering.forward:2 +msgid "" +"Labels for position (index) of the start of the labelled span for " +"computing the token classification loss. Positions are clamped to the " +"length of the sequence (:obj:`sequence_length`). Position outside of the " +"sequence are not taken into account for computing the loss." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.modeling.FunnelForQuestionAnswering.forward:7 +msgid "" +"end_positions (:obj:`paddle.Tensor` of shape :obj:`(batch_size,)`, " +"`optional`):" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.modeling.FunnelForQuestionAnswering.forward:6 +msgid "" +"Labels for position (index) of the end of the labelled span for computing" +" the token classification loss. Positions are clamped to the length of " +"the sequence (:obj:`sequence_length`). Position outside of the sequence " +"are not taken into account for computing the loss." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.po new file mode 100644 index 0000000000000000000000000000000000000000..50cc071d876c721c467240e39ac2f08228cfbcc5 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.funnel.rst:2 +msgid "funnel" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..d8f3f6ca3911cb73ca200c21498832d05e6ca0c7 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.tokenizer.po @@ -0,0 +1,479 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.funnel.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.vocab_size:1 +msgid "" +"return the size of vocabulary. :returns: the size of vocabulary. :rtype: " +"int" +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.tokenize:1 +msgid "" +"End-to-end tokenization for BERT models. :param text: The text to be " +"tokenized. :type text: str" +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.tokenize +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.tokenize:5 +msgid "A list of string representing converted tokens." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.tokenize +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) in a single string. Since " +"the usage of WordPiece introducing `##` to concat subwords, also remove " +"`##` when converting. :param tokens: A list of string representing tokens" +" to be converted. :type tokens: list" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.convert_tokens_to_string:7 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.num_special_tokens_to_add:5 +msgid "" +"This encodes inputs and checks the number of added tokens, and is " +"therefore not efficient. Do not put this inside your training loop." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences +msgid "参数" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.num_special_tokens_to_add:8 +msgid "" +"Returns the number of added tokens in the case of a sequence pair if set " +"to True, returns the number of added tokens in the case of a single " +"sequence if set to False." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.num_special_tokens_to_add:11 +msgid "Number of tokens added to sequences" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens:3 +msgid "A BERT offset_mapping has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens:8 +msgid "List of char offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens:10 +msgid "Optional second list of char offsets for offset mapping pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens:13 +msgid "List of char offsets with the appropriate offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens:14 +msgid ":obj:`List[tuple]`" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences:3 +msgid "A BERT sequence pair mask has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences:9 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences:11 +msgid "A list of `inputs_ids` for the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences:13 +msgid "Optional second list of IDs for sequence pairs. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences:16 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask:4 +msgid "List of ids of the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask:6 +msgid "List of ids of the second sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers in the range [0, 1]: 1 for a special token, 0 for a " +"sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:1 +msgid "Truncates a sequence pair in place to the maximum length." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:3 +msgid "" +"list of tokenized input ids. Can be obtained from a string by chaining " +"the `tokenize` and `convert_tokens_to_ids` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:5 +msgid "" +"Optional second list of input ids. Can be obtained from a string by " +"chaining the `tokenize` and `convert_tokens_to_ids` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:7 +msgid "number of tokens to remove using the truncation strategy" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:9 +msgid "" +"string selected in the following options: - 'longest_first' (default) " +"Iteratively reduce the inputs sequence until the input is under " +"max_seq_len starting from the longest one at each token (when there " +"is a pair of input sequences). Overflowing tokens only contains " +"overflow from the first sequence. - 'only_first': Only truncate the first" +" sequence. raise an error if the first sequence is shorter or equal to " +"than num_tokens_to_remove. - 'only_second': Only truncate the second " +"sequence - 'do_not_truncate': Does not truncate (raise an error if the " +"input sequence is longer than max_seq_len)" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:9 +msgid "" +"string selected in the following options: - 'longest_first' (default) " +"Iteratively reduce the inputs sequence until the input is under " +"max_seq_len" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:11 +msgid "" +"starting from the longest one at each token (when there is a pair of " +"input sequences). Overflowing tokens only contains overflow from the " +"first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:13 +msgid "" +"'only_first': Only truncate the first sequence. raise an error if the " +"first sequence is shorter or equal to than num_tokens_to_remove." +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:14 +msgid "'only_second': Only truncate the second sequence" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:15 +msgid "" +"'do_not_truncate': Does not truncate (raise an error if the input " +"sequence is longer than max_seq_len)" +msgstr "" + +#: of +#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:16 +msgid "" +"If set to a number along with max_seq_len, the overflowing tokens " +"returned will contain some tokens from the main sequence returned. The " +"value of this argument defines the number of additional tokens." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:1 +msgid "" +"Performs tokenization and uses the tokenized tokens to prepare model " +"inputs. It supports batch inputs of sequence or sequence pair. :param " +"batch_text_or_text_pairs: The element of list can be sequence or sequence" +" pair, and the" +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:4 +msgid "" +"sequence is a string or a list of strings depending on whether it has " +"been pretokenized. If each sequence is provided as a list of strings " +"(pretokenized), you must set `is_split_into_words` as `True` to " +"disambiguate with a sequence pair." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:9 +msgid "" +"If set to a number, will limit the total sequence returned so that it has" +" a maximum length. If there are overflowing tokens, those overflowing " +"tokens will be added to the returned dictionary when " +"`return_overflowing_tokens` is `True`. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:14 +msgid "" +"Only available for batch input of sequence pair and mainly for question " +"answering usage. When for QA, `text` represents questions and `text_pair`" +" represents contexts. If `stride` is set to a positive number, the " +"context will be split into multiple spans where `stride` defines the " +"number of (tokenized) tokens to skip from the start of one span to get " +"the next span, thus will produce a bigger batch than inputs to include " +"all spans. Moreover, 'overflow_to_sample' and 'offset_mapping' preserving" +" the original example and position information will be added to the " +"returned dictionary. Defaults to 0." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:24 +msgid "" +"If set to `True`, the returned sequences would be padded up to " +"`max_seq_len` specified length according to padding side " +"(`self.padding_side`) and padding token id. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:28 +msgid "" +"String selected in the following options: - 'longest_first' (default) " +"Iteratively reduce the inputs sequence until the input is under " +"`max_seq_len` starting from the longest one at each token (when there is " +"a pair of input sequences). - 'only_first': Only truncate the first " +"sequence. - 'only_second': Only truncate the second sequence. - " +"'do_not_truncate': Do not truncate (raise an error if the input sequence " +"is longer than `max_seq_len`). Defaults to 'longest_first'." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:38 +msgid "" +"Whether to include tokens position ids in the returned dictionary. " +"Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:41 +msgid "" +"Whether to include token type ids in the returned dictionary. Defaults to" +" `True`." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:44 +msgid "" +"Whether to include the attention mask in the returned dictionary. " +"Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:47 +msgid "" +"Whether to include the length of each encoded inputs in the returned " +"dictionary. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:50 +msgid "" +"Whether to include overflowing token information in the returned " +"dictionary. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:53 +msgid "" +"Whether to include special tokens mask information in the returned " +"dictionary. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:57 +msgid "" +"The dict has the following optional items: - **input_ids** (list[int]): " +"List of token ids to be fed to a model. - **position_ids** (list[int], " +"optional): List of token position ids to be fed to a model. Included " +"when `return_position_ids` is `True` - **token_type_ids** (list[int], " +"optional): List of token type ids to be fed to a model. Included when " +"`return_token_type_ids` is `True`. - **attention_mask** (list[int], " +"optional): List of integers valued 0 or 1, where 0 specifies paddings " +"and should not be attended to by the model. Included when " +"`return_attention_mask` is `True`. - **seq_len** (int, optional): The " +"input_ids length. Included when `return_length` is `True`. - " +"**overflowing_tokens** (list[int], optional): List of overflowing tokens." +" Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True. - **num_truncated_tokens** (int, " +"optional): The number of overflowing tokens. Included when if " +"`max_seq_len` is specified and `return_overflowing_tokens` is True. - " +"**special_tokens_mask** (list[int], optional): List of integers valued 0 " +"or 1, with 0 specifying special added tokens and 1 specifying sequence " +"tokens. Included when `return_special_tokens_mask` is `True`. - " +"**offset_mapping** (list[int], optional): list of pair preserving the " +"index of start and end char in original input for each token. For a " +"sqecial token, the index pair is `(0, 0)`. Included when `stride` " +"works. - **overflow_to_sample** (int, optional): Index of example from " +"which this feature is generated. Included when `stride` works." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:57 +msgid "" +"The dict has the following optional items: - **input_ids** (list[int]): " +"List of token ids to be fed to a model. - **position_ids** (list[int], " +"optional): List of token position ids to be" +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:60 +msgid "fed to a model. Included when `return_position_ids` is `True`" +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:61 +msgid "" +"**token_type_ids** (list[int], optional): List of token type ids to be " +"fed to a model. Included when `return_token_type_ids` is `True`." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:63 +msgid "" +"**attention_mask** (list[int], optional): List of integers valued 0 or 1," +" where 0 specifies paddings and should not be attended to by the model. " +"Included when `return_attention_mask` is `True`." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:66 +msgid "" +"**seq_len** (int, optional): The input_ids length. Included when " +"`return_length` is `True`." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:68 +msgid "" +"**overflowing_tokens** (list[int], optional): List of overflowing tokens." +" Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:71 +msgid "" +"**num_truncated_tokens** (int, optional): The number of overflowing " +"tokens. Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:74 +msgid "" +"**special_tokens_mask** (list[int], optional): List of integers valued 0 " +"or 1, with 0 specifying special added tokens and 1 specifying sequence " +"tokens. Included when `return_special_tokens_mask` is `True`." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:77 +msgid "" +"**offset_mapping** (list[int], optional): list of pair preserving the " +"index of start and end char in original input for each token. For a " +"sqecial token, the index pair is `(0, 0)`. Included when `stride` works." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:81 +msgid "" +"**overflow_to_sample** (int, optional): Index of example from which this " +"feature is generated. Included when `stride` works." +msgstr "" + +#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.rematch:1 +msgid "" +"changed from " +"https://github.com/bojone/bert4keras/blob/master/bert4keras/tokenizers.py#L372" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.generation_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.generation_utils.po new file mode 100644 index 0000000000000000000000000000000000000000..e4609c0b17e0ab7ae0535230a0ba667bac29db00 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.generation_utils.po @@ -0,0 +1,238 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.generation_utils.rst:2 +msgid "generation\\_utils" +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin:1 +msgid "This class implements the interface for generation task." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin:3 +msgid "" +"It's used as the base class of `paddlenlp.transformers.PretrainedModel " +"`__." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:1 +msgid "" +"The interface for generation task. This method can generate sequences by " +"using decoding strategy. Currently, there are three decoding strategies " +"supported: \"greedy_search\", \"sampling\" and \"beam_search\"." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:5 +msgid "" +"The input sequence ids for the generation. It is a Tensor with shape " +"[batch_size, sequence_length]. The data type should be int32 or int64. " +"Default to None, which we will initialize it as a Tensor with shape [1, " +"1], filled with the value `bos_token_id`." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:11 +msgid "The maximum length of the sequence to be generated. Default to 20." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:14 +msgid "The minimum length of the sequence to be generated. Default to 0." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:17 +msgid "" +"The decoding strategy in generation. Currently, there are three decoding " +"strategies supported: \"greedy_search\", \"sampling\" and " +"\"beam_search\". Default to \"greedy_search\"." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:22 +msgid "" +"The value used to module the next token probabilities in the \"sampling\"" +" strategy. Default to 1.0, which means no effect." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:26 +msgid "" +"The number of highest probability tokens to keep for top-k-filtering in " +"the \"sampling\" strategy. Default to 0, which means no effect." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:30 +msgid "" +"The cumulative probability for top-p-filtering in the \"sampling\" " +"strategy. The value should satisfy :math:`0 <= top\\_p < 1`. Default to " +"1.0, which means no effect." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:35 +msgid "" +"The parameter for repetition penalty. 1.0 means no penalty. See `this " +"paper `__ for more details. " +"Defaults to 1.0." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:38 +msgid "The number of beams in the \"beam_search\" strategy. Default to 1." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:41 +msgid "" +"Number of groups to divide `num_beams` into in order to use DIVERSE BEAM " +"SEARCH. See `this paper `__ for " +"more details. Default to 1." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:45 +msgid "" +"The exponential penalty to the sequence length in the \"beam_search\" " +"strategy. The larger this param is, the more that the model would " +"generate shorter sequences. Default to 0.0, which means no penalty." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:50 +msgid "" +"Whether to stop searching in the \"beam_search\" strategy when at least " +"`num_beams` sentences are finished per batch or not. Default to False." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:54 +msgid "The id of the `bos_token`. Default to None." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:57 +msgid "The id of the `eos_token`. Default to None." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:60 +msgid "The id of the `pad_token`. Default to None." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:63 +msgid "The start token id for encoder-decoder models. Default to None." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:66 +msgid "" +"The id of the token to force as the first generated token. Usually use " +"for multilingual models. Default to None." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:70 +msgid "The id of the token to force as the last generated token. Default to None." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:73 +msgid "" +"The number of returned sequences for each sequence in the batch. Default " +"to 1." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:76 +msgid "" +"If num_beam_groups is 1, this is the diversity_rate for Diverse Siblings " +"Search. See `this paper https://arxiv.org/abs/1611.08562`__ for more " +"details. If not, this is the diversity_rate for DIVERSE BEAM SEARCH." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:81 +msgid "" +"(bool, optional): Whether to use the model cache to speed up decoding. " +"Default to True." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:83 +msgid "" +"(bool, optional): Whether to use fast entry of model for " +"FastGeneration. Default to False." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:85 +msgid "" +"(bool, optional): Whether to use fp16 for decoding. Only works when " +"fast entry is avalible. Default to False." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:87 +msgid "It can be used to specify additional kwargs passed to the model." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:91 +msgid "" +"It is a tuple contains two elements: ids and scores. Each element is a " +"Tensor. With the fields: - ids (Tensor): The ids of the generated " +"sequences. It is a Tensor with shape [batch_size * " +"num_return_sequences, sequence_length]. The data type is same as the " +"input `input_ids`. - scores (Tensor): The scores of the generated " +"sequences. It is a Tensor with shape [batch_size * " +"num_return_sequences, 1]. The data type is float32 or float64, which " +"is the same as the parameters in the model." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:91 +msgid "" +"It is a tuple contains two elements: ids and scores. Each element is a " +"Tensor." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:94 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:98 +msgid "ids (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:97 +msgid "" +"The ids of the generated sequences. It is a Tensor with shape [batch_size" +" * num_return_sequences, sequence_length]. The data type is same as the " +"input `input_ids`." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:102 +msgid "scores (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:101 +msgid "" +"The scores of the generated sequences. It is a Tensor with shape " +"[batch_size * num_return_sequences, 1]. The data type is float32 or " +"float64, which is the same as the parameters in the model." +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:107 +msgid "示例" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..915270047f8c402de8b61b36a3b12e950a1b9cc2 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.modeling.po @@ -0,0 +1,428 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.gpt.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration:1 +#: paddlenlp.transformers.gpt.modeling.GPTForPretraining:1 +#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification:1 +#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification:1 +#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel:1 +#: paddlenlp.transformers.gpt.modeling.GPTModel:1 +msgid "基类::class:`paddlenlp.transformers.gpt.modeling.GPTPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:1 +msgid "The bare GPT Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration +#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.forward +#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model +#: paddlenlp.transformers.gpt.modeling.GPTForPretraining +#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward +#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification +#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward +#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification +#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward +#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel +#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward +#: paddlenlp.transformers.gpt.modeling.GPTModel +#: paddlenlp.transformers.gpt.modeling.GPTModel.forward +#: paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `GPTModel`. Also is the vocab size of " +"token embedding matrix. Defines the number of different tokens that can " +"be represented by the `inputs_ids` passed when calling `GPTModel`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:13 +msgid "" +"Dimensionality of the embedding layer and decoder layer. Defaults to " +"`768`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:15 +msgid "Number of hidden layers in the Transformer decoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"decoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the decoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and decoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all decoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:35 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:38 +msgid "" +"The vocabulary size of the `token_type_ids`. Defaults to `16`. .. note::" +" Please NOT using `type_vocab_size`, for it will be obsolete in the " +"future.." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:38 +msgid "The vocabulary size of the `token_type_ids`. Defaults to `16`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:41 +msgid "" +"Please NOT using `type_vocab_size`, for it will be obsolete in the " +"future.." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:43 +msgid "" +"The standard deviation of the normal initializer. Default to `0.02`. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`GPTPretrainedModel._init_weights()` for how" +" weights are initialized in `GPTModel`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:43 +msgid "The standard deviation of the normal initializer. Default to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:46 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`GPTPretrainedModel._init_weights()` for how weights are " +"initialized in `GPTModel`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel:49 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:1 +msgid "The GPTModel forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:7 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:11 +msgid "" +"Mask used in self attention to avoid performing attention to some " +"unwanted positions, usually the subsequent positions. It is a tensor with" +" shape broadcasted to `[batch_size, num_attention_heads, sequence_length," +" sequence_length]`. It is a tensor with shape broadcasted to " +"`[batch_size, num_attention_heads, sequence_length, sequence_length]`. " +"For example, its shape can be [batch_size, sequence_length], " +"[batch_size, sequence_length, sequence_length], [batch_size, " +"num_attention_heads, sequence_length, sequence_length]. Its data type " +"should be float32. The `masked` tokens have `-1e-9` values, and the " +"`unmasked` tokens have `0` values. Defaults to `None`, which means " +"nothing needed to be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:21 +msgid "" +"Whether or not to use cache. Defaults to `False`. If set to `True`, key " +"value states will be returned and can be used to speed up decoding." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:24 +msgid "" +"It is a list, and each element in the list is a tuple " +"`(incremental_cache, static_cache)`. See `TransformerDecoder.gen_cache " +"`__" +" for more details. It is only used for inference and should be None for " +"training. Default to `None`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.forward +#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model +#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward +#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward +#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward +#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward +#: paddlenlp.transformers.gpt.modeling.GPTModel.forward +#: paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:30 +msgid "" +"Returns tensor `encoder_output`, which is the output at the last layer of" +" the model. Its data type should be float32 and has a shape of " +"[batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.forward +#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model +#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward +#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward +#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward +#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward +#: paddlenlp.transformers.gpt.modeling.GPTModel.forward +#: paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:19 +#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward:15 +#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward:15 +#: paddlenlp.transformers.gpt.modeling.GPTModel.forward:35 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTPretrainedModel:1 +msgid "" +"An abstract class for pretrained GPT models. It provides GPT related " +"`model_config_file`, `resource_files_names`, " +"`pretrained_resource_files_map`, `pretrained_init_configuration`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForPretraining:1 +msgid "GPT Model with pretraining tasks on top." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForPretraining:3 +#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel:3 +msgid "An instance of :class:`GPTModel`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.forward:1 +#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model:1 +#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model:3 +#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model:5 +#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model:7 +#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model:9 +#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:1 +#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:3 +#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:5 +#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:7 +#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:9 +#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward:3 +#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward:5 +#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward:7 +#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward:3 +#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward:5 +#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward:7 +#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward:1 +#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward:3 +#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward:5 +#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward:7 +#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward:9 +msgid "See :class:`GPTModel`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model:12 +#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:12 +#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward:12 +msgid "" +"Returns tensor `logits` or tuple `(logits, cached_kvs)`. If `use_cache` " +"is True, tuple (`logits, cached_kvs`) will be returned. Otherwise, tensor" +" `logits` will be returned. `logits` is the output of the gpt model. " +"`cache_kvs` is the cache output of gpt model if `use_cache` is True." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion:1 +msgid "Criterion for GPT. It calculates the final loss." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward:1 +msgid "" +"The logits of masked token prediction. Its data type should be float32 " +"and its shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward:4 +msgid "" +"The labels of the masked language modeling, the dimensionality of " +"`masked_lm_labels` is equal to `prediction_scores`. Its data type should " +"be int64 and its shape is [batch_size, sequence_length, 1]." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward:8 +msgid "" +"Mask used for calculating the loss of the masked language modeling to " +"avoid calculating some unwanted tokens. Its data type should be float32 " +"and its shape is [batch_size, sequence_length, 1]." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward:13 +msgid "" +"The pretraining loss. Its data type should be float32 and its shape is " +"[1]." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration:1 +msgid "" +"The generate model for GPT-2. It use the greedy strategy and generate the" +" output sequence with highest probability." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration:4 +msgid "An instance of `paddlenlp.transformers.GPTModel`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration:6 +msgid "The max length of the prediction." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.forward:4 +msgid "" +"Returns tensor `src_ids`, which means the indices of output sequence " +"tokens in the vocabulary. They are numerical representations of tokens " +"that build the output sequence." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTLMHeadModel:1 +msgid "The GPT Model with a `language modeling` head on top." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForTokenClassification:1 +msgid "" +"GPT Model with a token classification head on top (a linear layer on top " +"of the hidden-states output) e.g. for Named-Entity-Recognition (NER) " +"tasks." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification:4 +#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification:4 +msgid "An instance of GPTModel." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification:6 +#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForTokenClassification:8 +msgid "" +"The dropout probability for output of GPT. If None, use the same value as" +" `hidden_dropout_prob` of `GPTModel` instance `gpt`. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward:1 +msgid "" +"The GPTForTokenClassification forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification:1 +msgid "" +"GPT Model with a sequence classification/regression head on top (a linear" +" layer on top of the pooled output) e.g. for GLUE tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward:1 +msgid "" +"The GPTForSequenceClassification forward method, overrides the __call__()" +" special method." +msgstr "" + +#: of +#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.po new file mode 100644 index 0000000000000000000000000000000000000000..73f3ea9694b457fa2bf4b224f035fd335921c138 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.gpt.rst:2 +msgid "gpt" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..b35fc7efb1d262feaae43bdc059141c2fdf31d17 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.tokenizer.po @@ -0,0 +1,189 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.gpt.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:1 +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:1 +msgid "Constructs a GPT tokenizer based on byte-level Byte-Pair-Encoding." +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:3 +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:3 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.save_resources +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.save_resources +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:7 +msgid "" +"Path to the vocab file. The vocab file contains a mapping from vocabulary" +" strings to indices." +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:10 +msgid "" +"Path to the merge file. The merge file is used to split the input " +"sentence into \"subword\" units. The vocab file is then used to encode " +"those units as intices." +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:14 +msgid "Paradigm to follow when decoding bytes to UTF-8. Defaults to `'replace'`." +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:17 +msgid "The maximum value of the input sequence length. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:19 +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:22 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.vocab_size:1 +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.vocab_size:1 +msgid "Returns the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.vocab_size +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.vocab_size:3 +msgid "The sum of size of vocabulary and the size of speical tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.vocab_size +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string:1 +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string:1 +msgid "Converts a single index or a sequence of indices to texts." +msgstr "" + +#: of +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string:3 +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string:3 +msgid "The token id (or token ids) to be converted to text." +msgstr "" + +#: of +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string:6 +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string:6 +msgid "The decoded text." +msgstr "" + +#: of +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string:10 +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens:11 +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.vocab_size:7 +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string:10 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.save_resources:1 +msgid "" +"Saves `SentencePiece `__ file " +"(ends with '.spm') under `save_directory`." +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.save_resources:3 +#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.save_resources:4 +msgid "Directory to save files into." +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:1 +msgid "" +"Constructs a GPT Chinese tokenizer based on `SentencePiece " +"`__." +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:7 +msgid "" +"The vocabulary file required to instantiate a `SentencePiece " +"`__ tokenizer." +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:10 +msgid "The maximum value of the input sequence length. Defaults to `512`." +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:13 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens:1 +msgid "" +"Converts a single index or a sequence of indices to a token or a sequence" +" of tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens:4 +msgid "The token id (or token ids) to be converted to token(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens:7 +msgid "The converted token or sequence of tokens." +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.save_resources:1 +msgid "Save tokenizer related resources to files under `save_directory`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..8446e7b167ea52591c49eb96607f005d4f443607 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.modeling.po @@ -0,0 +1,382 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.layoutlm.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling:1 +msgid "Modeling classes for LayoutLM model." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM:1 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification:1 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification:1 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:1 +msgid "基类::class:`paddlenlp.transformers.layoutlm.modeling.LayoutLMPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:1 +msgid "The bare LayoutLM Model outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMModel +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:10 +msgid "" +"Vocabulary size of the LayoutLM model. Defines the number of different " +"tokens that can be represented by the `inputs_ids` passed when calling " +"LayoutLMModel." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:13 +msgid "Dimensionality of the encoder layers and the pooler layer." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:15 +msgid "Number of hidden layers in the Transformer encoder." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:19 +msgid "" +"Dimensionality of the \"intermediate\" (often named feed-forward) layer " +"in the Transformer encoder." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:21 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:25 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:27 +msgid "The dropout probability for all fully connected layers in the pooler." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:29 +msgid "The vocabulary size of `token_type_ids`. Defaults to `16`." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:32 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`LayoutLMPretrainedModel.init_weights()` for" +" how weights are initialized in `LayoutLMModel`." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:32 +msgid "The standard deviation of the normal initializer. Defaults to 0.02." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:36 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`LayoutLMPretrainedModel.init_weights()` for how weights are " +"initialized in `LayoutLMModel`." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:39 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:42 +msgid "" +"The non-linear activation function in the pooling layer. Defaults to " +"`\"tanh\"`." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:1 +msgid "" +"The LayoutLMModel forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:12 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:13 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:15 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:18 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. Defaults to `None`, which means " +"nothing needed to be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:31 +msgid "Whether to return the output of each hidden layers. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:35 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`). With the fields: - " +"`sequence_output` (Tensor): Sequence of hidden-states at the last " +"layer of the model. It's data type should be float32 and its shape is" +" [batch_size, sequence_length, hidden_size]. - `pooled_output` (Tensor):" +" The output of first token (`[CLS]`) in sequence. We \"pool\" the" +" model by simply taking the hidden state corresponding to the first " +"token. Its data type should be float32 and its shape is [batch_size, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:35 +msgid "Returns tuple (`sequence_output`, `pooled_output`)." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:37 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:41 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:40 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:45 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:44 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM:1 +msgid "LayoutLM Model with a `masked language modeling` head on top." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM:3 +msgid "An instance of :class:`LayoutLMModel`." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:1 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:3 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:5 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:7 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:9 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:3 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:5 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:7 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:9 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:11 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:13 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:3 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:5 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:7 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:9 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:11 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:13 +msgid "See :class:`LayoutLMModel`." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:12 +msgid "" +"Returns tensor `prediction_scores`, The scores of masked token " +"prediction. Its data type should be float32 and shape is [batch_size, " +"sequence_length, vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:17 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:21 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:21 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification:1 +msgid "" +"LayoutLM Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification:4 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification:4 +msgid "An instance of LayoutLMModel." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification:6 +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification:8 +msgid "" +"The dropout probability for output of LayoutLM. If None, use the same " +"value as `hidden_dropout_prob` of `LayoutLMModel` instance `layoutlm`. " +"Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:1 +msgid "" +"The LayoutLMForTokenClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:16 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification:1 +msgid "" +"LayoutLM Model with a linear layer on top of the output layer, designed " +"for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:1 +msgid "" +"The LayoutLMForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:16 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.po new file mode 100644 index 0000000000000000000000000000000000000000..9ad11da107a28272077cf9d55b404b09f4320b04 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.layoutlm.rst:2 +msgid "layoutlm" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..9ecd8d8859bef56735cdb856421c4ffe44d6f546 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.tokenizer.po @@ -0,0 +1,39 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.layoutlm.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.layoutlm.tokenizer:1 +msgid "Tokenization classes for LayoutLM model." +msgstr "" + +#: of paddlenlp.transformers.layoutlm.tokenizer.LayoutLMTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.layoutlm.tokenizer.LayoutLMTokenizer:1 +msgid "" +"The usage of LayoutLMTokenizer is the same as `BertTokenizer " +"`__." +" For more information regarding those methods, please refer to this " +"superclass." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..1169d16673304a192292e249a51dfe476890e096 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.modeling.po @@ -0,0 +1,157 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.layoutlmv2.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling:1 +msgid "Modeling classes for LayoutLMv2 model." +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForPretraining:1 +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForRelationExtraction:1 +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForTokenClassification:1 +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:1 +msgid "基类::class:`paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:1 +msgid "The bare LayoutLMv2 Model outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForPretraining.forward +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForRelationExtraction.forward +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForTokenClassification.forward +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:10 +msgid "" +"Vocabulary size of the XLNet model. Defines the number of different " +"tokens that can be represented by the `inputs_ids` passed when calling " +"XLNetModel." +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:13 +msgid "" +"Dimensionality of the encoder layers and the pooler layer. Defaults to " +"``768``." +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:15 +msgid "Number of hidden layers in the Transformer encoder. Defaults to ``12``." +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to ``12``." +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:20 +msgid "" +"Dimensionality of the \"intermediate\" (often named feed-forward) layer " +"in the Transformer encoder. Defaults to ``3072``." +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:23 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to ``\"gelu\"``." +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:27 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to ``0.1``." +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:30 +msgid "" +"The dropout probability for all fully connected layers in the pooler. " +"Defaults to ``0.1``." +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:33 +msgid "" +"The standard deviation of the truncated_normal_initializer for " +"initializing all weight matrices. Defaults to ``0.02``." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForPretraining.forward:1 +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForRelationExtraction.forward:1 +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForTokenClassification.forward:1 +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForPretraining.forward:4 +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForRelationExtraction.forward:4 +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForTokenClassification.forward:4 +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForPretraining.forward:6 +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForRelationExtraction.forward:6 +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForTokenClassification.forward:6 +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2PretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2PretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForRelationExtraction.init_weights:1 +msgid "Initialize the weights" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.po new file mode 100644 index 0000000000000000000000000000000000000000..e228ffd620941c77e98fbe2549ba65e77a7e8f1c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.layoutlmv2.rst:2 +msgid "layoutlmv2" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..36187c4795111a788b10699a494769807d28f745 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.tokenizer.po @@ -0,0 +1,39 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.layoutlmv2.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.tokenizer:1 +msgid "Tokenization classes for LayoutLMv2 model." +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.tokenizer.LayoutLMv2Tokenizer:1 +msgid "基类::class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.layoutlmv2.tokenizer.LayoutLMv2Tokenizer:1 +msgid "" +"The usage of LayoutLMv2Tokenizer is the same as `BertTokenizer " +"`__." +" For more information regarding those methods, please refer to this " +"superclass." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..ea5ad2bac1471c74cb3a4095d831bbb1cb2f45f9 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.modeling.po @@ -0,0 +1,156 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.layoutxlm.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling:1 +msgid "Modeling classes for LayoutXLM model." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForPretraining:1 +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForRelationExtraction:1 +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForTokenClassification:1 +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:1 +msgid "基类::class:`paddlenlp.transformers.layoutxlm.modeling.LayoutXLMPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:1 +msgid "The bare LayoutXLM Model outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForPretraining.forward +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForRelationExtraction.forward +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForTokenClassification.forward +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:10 +msgid "" +"Vocabulary size of the XLNet model. Defines the number of different " +"tokens that can be represented by the `inputs_ids` passed when calling " +"XLNetModel." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:13 +msgid "" +"Dimensionality of the encoder layers and the pooler layer. Defaults to " +"``768``." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:15 +msgid "Number of hidden layers in the Transformer encoder. Defaults to ``12``." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to ``12``." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:20 +msgid "" +"Dimensionality of the \"intermediate\" (often named feed-forward) layer " +"in the Transformer encoder. Defaults to ``3072``." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:23 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to ``\"gelu\"``." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:27 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to ``0.1``." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:30 +msgid "" +"The dropout probability for all fully connected layers in the pooler. " +"Defaults to ``0.1``." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:33 +msgid "" +"The standard deviation of the truncated_normal_initializer for " +"initializing all weight matrices. Defaults to ``0.02``." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForPretraining.forward:1 +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForRelationExtraction.forward:1 +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForTokenClassification.forward:1 +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForPretraining.forward:4 +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForRelationExtraction.forward:4 +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForTokenClassification.forward:4 +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForPretraining.forward:6 +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForRelationExtraction.forward:6 +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForTokenClassification.forward:6 +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForRelationExtraction.init_weights:1 +msgid "Initialize the weights" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.po new file mode 100644 index 0000000000000000000000000000000000000000..5088b7bb51a55d7b99c9aecef777aa74cecd3c88 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.layoutxlm.rst:2 +msgid "layoutxlm" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..b493f08595fc77f1ece491a5b2628170cf3231db --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.tokenizer.po @@ -0,0 +1,176 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.layoutxlm.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.tokenizer:1 +msgid "Tokenization classes for LayoutXLM model." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens:4 +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences:3 +msgid "" +"Should be overridden in a subclass if the model has a special way of " +"building those." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.num_special_tokens_to_add +msgid "参数" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens:6 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens:8 +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences:10 +msgid "Optional second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.num_special_tokens_to_add +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens:11 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.num_special_tokens_to_add +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:4 +msgid "List of ids of the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:6 +msgid "List of ids of the second sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers in the range [0, 1]: 1 for a special token, 0 " +"for a sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:14 +msgid "The list of integers in the range [0, 1]:" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:15 +msgid "1 for a special token, 0 for a sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences:6 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences:8 +msgid "List of IDs." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences:13 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.convert_tokens_to_string:1 +msgid "Converts a sequence of tokens (strings for sub-words) in a single string." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.num_special_tokens_to_add:3 +msgid "" +"Whether the number of added tokens should be computed in the case of a " +"sequence pair or a single sequence. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.num_special_tokens_to_add:7 +msgid "Number of special tokens added to sequences." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.visual_backbone.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.visual_backbone.po new file mode 100644 index 0000000000000000000000000000000000000000..62dc467e525171f1223428d14e170849d8a23a54 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.visual_backbone.po @@ -0,0 +1,286 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.layoutxlm.visual_backbone.rst:2 +msgid "visual\\_backbone" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.Conv2d:1 +msgid "基类::class:`paddle.nn.layer.conv.Conv2D`" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.Backbone.forward:1 +#: paddlenlp.transformers.layoutxlm.visual_backbone.BasicStem.forward:1 +#: paddlenlp.transformers.layoutxlm.visual_backbone.BottleneckBlock.forward:1 +#: paddlenlp.transformers.layoutxlm.visual_backbone.Conv2d.forward:1 +#: paddlenlp.transformers.layoutxlm.visual_backbone.LastLevelMaxPool.forward:1 +#: paddlenlp.transformers.layoutxlm.visual_backbone.VisualBackbone.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.Backbone.forward +#: paddlenlp.transformers.layoutxlm.visual_backbone.BasicStem.forward +#: paddlenlp.transformers.layoutxlm.visual_backbone.BottleneckBlock.forward +#: paddlenlp.transformers.layoutxlm.visual_backbone.Conv2d.forward +#: paddlenlp.transformers.layoutxlm.visual_backbone.FPN.forward +#: paddlenlp.transformers.layoutxlm.visual_backbone.LastLevelMaxPool.forward +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.forward +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage +#: paddlenlp.transformers.layoutxlm.visual_backbone.VisualBackbone.forward +#: paddlenlp.transformers.layoutxlm.visual_backbone.get_norm +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.Backbone.forward:4 +#: paddlenlp.transformers.layoutxlm.visual_backbone.BasicStem.forward:4 +#: paddlenlp.transformers.layoutxlm.visual_backbone.BottleneckBlock.forward:4 +#: paddlenlp.transformers.layoutxlm.visual_backbone.Conv2d.forward:4 +#: paddlenlp.transformers.layoutxlm.visual_backbone.LastLevelMaxPool.forward:4 +#: paddlenlp.transformers.layoutxlm.visual_backbone.VisualBackbone.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.Backbone.forward:6 +#: paddlenlp.transformers.layoutxlm.visual_backbone.BasicStem.forward:6 +#: paddlenlp.transformers.layoutxlm.visual_backbone.BottleneckBlock.forward:6 +#: paddlenlp.transformers.layoutxlm.visual_backbone.Conv2d.forward:6 +#: paddlenlp.transformers.layoutxlm.visual_backbone.LastLevelMaxPool.forward:6 +#: paddlenlp.transformers.layoutxlm.visual_backbone.VisualBackbone.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.Backbone:1 +#: paddlenlp.transformers.layoutxlm.visual_backbone.CNNBlockBase:1 +#: paddlenlp.transformers.layoutxlm.visual_backbone.LastLevelMaxPool:1 +#: paddlenlp.transformers.layoutxlm.visual_backbone.VisualBackbone:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.ShapeSpec:1 +msgid "基类::class:`paddlenlp.transformers.layoutxlm.visual_backbone._ShapeSpec`" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.get_norm:1 +msgid "" +"either one of BN, SyncBN, FrozenBN, GN; or a callable that takes a " +"channel number and returns the normalization layer as a nn.Layer." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.get_norm:5 +msgid "out_channels" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.FPN.forward +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.forward +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage +#: paddlenlp.transformers.layoutxlm.visual_backbone.build_resnet_backbone +#: paddlenlp.transformers.layoutxlm.visual_backbone.get_norm +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.get_norm:8 +msgid "the normalization layer" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.FPN.forward +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.forward +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage +#: paddlenlp.transformers.layoutxlm.visual_backbone.build_resnet_backbone +#: paddlenlp.transformers.layoutxlm.visual_backbone.get_norm +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.FrozenBatchNorm:1 +msgid "基类::class:`paddle.fluid.dygraph.nn.BatchNorm`" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.BasicBlock:1 +#: paddlenlp.transformers.layoutxlm.visual_backbone.BasicStem:1 +#: paddlenlp.transformers.layoutxlm.visual_backbone.BottleneckBlock:1 +#: paddlenlp.transformers.layoutxlm.visual_backbone.DeformBottleneckBlock:1 +msgid "基类::class:`paddlenlp.transformers.layoutxlm.visual_backbone.CNNBlockBase`" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.BasicBlock:1 +msgid "" +"The basic residual block for ResNet-18 and ResNet-34 defined in " +":paper:`ResNet`, with two 3x3 conv layers and a projection shortcut if " +"needed." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.BottleneckBlock:1 +msgid "" +"The standard bottleneck residual block used by ResNet-50, 101 and 152 " +"defined in :paper:`ResNet`. It contains 3 conv layers with kernels 1x1, " +"3x3, 1x1, and a projection shortcut if needed." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.DeformBottleneckBlock:1 +msgid "" +"Similar to :class:`BottleneckBlock`, but with :paper:`deformable conv " +"` in the 3x3 convolution." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.BasicStem:1 +msgid "" +"The standard ResNet stem (layers before the first residual block), with a" +" conv, relu and max_pool." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.FPN:1 +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet:1 +msgid "基类::class:`paddlenlp.transformers.layoutxlm.visual_backbone.Backbone`" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.forward:1 +msgid "" +"Tensor of shape (N,C,H,W). H, W must be a multiple of " +"``self.size_divisibility``." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.forward:3 +msgid "names and the corresponding features" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:1 +msgid "Create a list of blocks of the same type that forms one ResNet stage." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:3 +msgid "" +"a subclass of CNNBlockBase that's used to create all blocks in this " +"stage. A module of this type must not change spatial resolution of inputs" +" unless its stride != 1." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:7 +msgid "number of blocks in this stage" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:9 +msgid "input channels of the entire stage." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:11 +msgid "output channels of **every block** in the stage." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:13 +msgid "" +"other arguments passed to the constructor of `block_class`. If the " +"argument name is \"xx_per_block\", the argument is a list of values to be" +" passed to each block in the stage. Otherwise, the same argument is " +"passed to every block in the stage." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:19 +msgid "a list of block module." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:22 +msgid "Examples: ::" +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:31 +msgid "" +"Usually, layers that produce the same feature map spatial size are " +"defined as one \"stage\" (in :paper:`FPN`). Under such definition, " +"``stride_per_block[1:]`` should all be 1." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:1 +msgid "" +"Created list of ResNet stages from pre-defined depth (one of 18, 34, 50, " +"101, 152). If it doesn't create the ResNet variant you need, please use " +":meth:`make_stage` instead for fine-grained customization." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:5 +msgid "depth of ResNet" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:7 +msgid "" +"the CNN block class. Has to accept `bottleneck_channels` argument for " +"depth > 50. By default it is BasicBlock or BottleneckBlock, based on the " +"depth." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:12 +msgid "" +"other arguments to pass to `make_stage`. Should not contain stride and " +"channels, as they are predefined for each depth." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:15 +msgid "modules in all stages; see arguments of :class:`ResNet.__init__`." +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:17 +msgid "modules in all stages; see arguments of" +msgstr "" + +#: of +#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:18 +msgid ":class:`ResNet.__init__`." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.LastLevelMaxPool:1 +msgid "" +"This module is used in the original FPN to generate a downsampled P6 " +"feature from P5." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.FPN.forward:1 +msgid "" +"mapping feature map name (e.g., \"res5\") to feature map tensor for each " +"feature level in high to low resolution order." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.FPN.forward:5 +msgid "" +"mapping from feature map name to FPN feature map tensor in high to low " +"resolution order. Returned feature names follow the FPN paper convention:" +" \"p\", where stage has stride = 2 ** stage e.g., [\"p2\", \"p3\"," +" ..., \"p6\"]." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.make_stage:1 +msgid "Deprecated alias for backward compatibiltiy." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.build_resnet_backbone:1 +msgid "Create a ResNet instance from config." +msgstr "" + +#: of paddlenlp.transformers.layoutxlm.visual_backbone.build_resnet_backbone:3 +msgid "a :class:`ResNet` instance." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..783249b186716de930651c6f75e1243434fb6773 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.modeling.po @@ -0,0 +1,598 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.luke.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification:1 +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification:1 +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification:1 +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM:1 +#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering:1 +#: paddlenlp.transformers.luke.modeling.LukeModel:1 +msgid "基类::class:`paddlenlp.transformers.luke.modeling.LukePretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:1 +msgid "The bare Luke Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification +#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward +#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward +#: paddlenlp.transformers.luke.modeling.LukeModel +#: paddlenlp.transformers.luke.modeling.LukeModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `LukeModel`. Also is the vocab size of" +" token embedding matrix. Defines the number of different tokens that can " +"be represented by the `inputs_ids` passed when calling `LukeModel`. " +"Defaults to 50267." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:14 +msgid "" +"Dimensionality of the embedding layer, encoder layer and pooler layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:16 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:18 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:21 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:26 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:30 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:33 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:36 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`514`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:39 +msgid "The vocabulary size of `token_type_ids`. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:42 +msgid "" +"Vocabulary size of `entity_ids` in `LukeModel`. Also is the vocab size of" +" token entity embedding matrix. Defines the number of different entity " +"that can be represented by the `entity_ids` passed when calling " +"`LukeModel`. Defaults to 500000." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:46 +msgid "Dimensionality of the entity embedding layer Defaults to `256`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:48 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`BertPretrainedModel.init_weights()` for how" +" weights are initialized in `BertModel`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:48 +msgid "The standard deviation of the normal initializer. Defaults to 0.02." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:52 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`BertPretrainedModel.init_weights()` for how weights are " +"initialized in `BertModel`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:55 +msgid "The index of padding token in the token vocabulary. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel:58 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:1 +msgid "The LukeModel forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:12 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:13 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:15 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:18 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. Defaults to `None`, which means " +"nothing needed to be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:31 +msgid "" +"Indices of entity sequence tokens in the entity vocabulary. They are " +"numerical representations of entities that build the entity input " +"sequence. Its data type should be `int64` and it has a shape of " +"[batch_size, entity_sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:35 +msgid "" +"Indices of positions of each entity sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_entity_tokens)` and dtype as int64. Defaults " +"to `None`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:39 +msgid "" +"Segment entity token indices to indicate different portions of the entity" +" inputs. Selected in the range ``[0, type_vocab_size - 1]``. If " +"`type_vocab_size` is 2, which means the inputs have two portions. Indices" +" can either be 0 or 1:" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:44 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor will be concat with `attention_mask`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward +#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward +#: paddlenlp.transformers.luke.modeling.LukeModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:53 +msgid "" +"Returns tuple (`word_hidden_state, entity_hidden_state, pool_output`). " +"With the fields: - `word_hidden_state` (Tensor): Sequence of hidden-" +"states at the last layer of the model. It's data type should be " +"float32 and its shape is [batch_size, sequence_length, hidden_size]. - " +"`entity_hidden_state` (Tensor): Sequence of entity hidden-states at " +"the last layer of the model. It's data type should be float32 and its" +" shape is [batch_size, sequence_length, hidden_size]. - `pooled_output` " +"(Tensor): The output of first token (``) in sequence. We " +"\"pool\" the model by simply taking the hidden state corresponding to the" +" first token. Its data type should be float32 and its shape is " +"[batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:53 +msgid "Returns tuple (`word_hidden_state, entity_hidden_state, pool_output`)." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:22 +#: paddlenlp.transformers.luke.modeling.LukeModel.forward:55 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:59 +msgid "`word_hidden_state` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:58 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:63 +msgid "`entity_hidden_state` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:62 +msgid "" +"Sequence of entity hidden-states at the last layer of the model. It's " +"data type should be float32 and its shape is [batch_size, " +"sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:67 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:66 +msgid "" +"The output of first token (``) in sequence. We \"pool\" the model by " +"simply taking the hidden state corresponding to the first token. Its data" +" type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward +#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward +#: paddlenlp.transformers.luke.modeling.LukeModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:25 +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:25 +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:27 +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:34 +#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:31 +#: paddlenlp.transformers.luke.modeling.LukeModel.forward:72 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukePretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukePretrainedModel:1 +msgid "" +"An abstract class for pretrained Luke models. It provides Luke related " +"`model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukePretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification:1 +msgid "" +"The LUKE model with a span classification head on top (a linear layer on " +"top of the hidden states output) for tasks such as named entity " +"recognition." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification:4 +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification:4 +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification:4 +msgid "An instance of LukeModel." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification:6 +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification:6 +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification:6 +msgid "The number of classes." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:1 +msgid "" +"The LukeForEntitySpanClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:3 +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:4 +msgid "The start position of entities in sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:3 +#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:5 +#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:9 +#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:11 +#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:13 +#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:15 +#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:17 +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:3 +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:5 +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:9 +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:11 +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:13 +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:15 +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:17 +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:5 +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:7 +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:11 +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:13 +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:15 +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:17 +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:19 +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:3 +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:5 +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:9 +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:11 +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:13 +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:15 +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:17 +#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:3 +#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:5 +#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:9 +#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:11 +#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:13 +#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:15 +#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:17 +msgid "See :class:`LukeModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:7 +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:7 +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:9 +#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:7 +#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:7 +msgid "See :class: `LukeModel`" +msgstr "" + +#: of +#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:22 +msgid "" +"Returns tensor `logits`, a tensor of the entity span classification " +"logits. Shape as `[batch_size, num_entities, num_classes]` and dtype as " +"float32." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification:1 +msgid "" +"The LUKE model with a classification head on top (a linear layer on top " +"of the hidden states of the two entity tokens) for entity pair " +"classification tasks, such as TACRED." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:1 +msgid "" +"The LukeForEntityPairClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:20 +msgid "" +"Returns tensor `logits`, a tensor of the entity pair classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification:1 +msgid "" +"The LUKE model with a classification head on top (a linear layer on top " +"of the hidden state of the first entity token) for entity classification " +"tasks, such as Open Entity." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:1 +msgid "" +"The LukeForEntityClassification forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:20 +msgid "" +"Returns tensor `logits`, a tensor of the entity classification logits. " +"Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM:1 +msgid "Luke Model with a `masked language modeling` head on top." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM:3 +msgid "An instance of :class:`LukeModel`." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:1 +msgid "" +"The LukeForMaskedLM forward method, overrides the __call__() special " +"method." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:20 +msgid "" +"Returns tuple (``logits``, ``entity_logits``). With the fields: - " +"`logits` (Tensor): The scores of masked token prediction. Its " +"data type should be float32 and shape is [batch_size, sequence_length, " +"vocab_size]. - `entity_logits` (Tensor): The scores of masked entity" +" prediction. Its data type should be float32 and its shape is " +"[batch_size, entity_length, entity_vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:20 +msgid "Returns tuple (``logits``, ``entity_logits``)." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:26 +msgid "`logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:25 +msgid "" +"The scores of masked token prediction. Its data type should be float32 " +"and shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:29 +msgid "`entity_logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:29 +msgid "" +"The scores of masked entity prediction. Its data type should be float32 " +"and its shape is [batch_size, entity_length, entity_vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering:1 +msgid "" +"LukeBert Model with question answering tasks. :param luke: An instance of" +" :class:`LukeModel`. :type luke: :class:`LukeModel`" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:1 +msgid "" +"The LukeForQuestionAnswering forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:20 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. - " +"`end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:20 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:23 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:26 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:26 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.po new file mode 100644 index 0000000000000000000000000000000000000000..d934a3c3cf16a1b79e0e9ffa4a79977a2d62846d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.luke.rst:2 +msgid "luke" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..3361b2a80d0a56855e10d40d46d6859dc93b04a5 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.tokenizer.po @@ -0,0 +1,262 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.luke.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer:1 +msgid "Tokenization classes for LUKE." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer`" +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:1 +msgid "" +"Constructs a Luke tokenizer. It uses a basic tokenizer to do punctuation " +"splitting, lower casing and so on, and follows a WordPiece tokenizer to " +"tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:5 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.add_special_tokens +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_offset_mapping +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.num_special_tokens_to_add +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:9 +msgid "" +"The vocabulary file path (ends with '.json') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:12 +msgid "" +"The entity vocabulary file path (ends with '.tsv') required to " +"instantiate a `EntityTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:15 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:18 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:22 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:25 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:28 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:31 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:37 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_entity_vocab:1 +msgid "Get the entity vocab" +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.tokenize:6 +msgid "Tokenize a string." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.tokenize:6 +msgid "Args:" +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.tokenize:3 +msgid "text (str):" +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.tokenize:4 +msgid "The sentence to be tokenized." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.tokenize:6 +msgid "add_prefix_space (boolean, default False):" +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.tokenize:6 +msgid "" +"Begin the sentence with at least one space to get invariance to word " +"order in GPT-2 (and Luke) tokenizers." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.convert_tokens_to_string:1 +msgid "Converts a sequence of tokens (string) in a single string." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.add_special_tokens:1 +msgid "Adding special tokens if you need." +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.add_special_tokens:3 +msgid "" +"The special token list you provided. If you provide a Dict, the key of " +"the Dict must be \"additional_special_tokens\" and the value must be " +"token list." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.convert_entity_to_id:1 +msgid "Convert the entity to id" +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.entity_encode:1 +msgid "Convert the string entity to digital entity" +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_offset_mapping:1 +msgid "" +"Returns the map of tokens and the start and end index of their start and " +"end character. Modified from " +"https://github.com/bojone/bert4keras/blob/master/bert4keras/tokenizers.py#L372" +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_offset_mapping:4 +msgid "Input text." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_offset_mapping +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.num_special_tokens_to_add +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_offset_mapping:7 +msgid "The offset map of input text." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_offset_mapping +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.num_special_tokens_to_add +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences:3 +msgid "A Luke sequence pair mask has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences:9 +msgid "" +"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first " +"portion of the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences:11 +msgid "A list of `inputs_ids` for the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences:13 +msgid "Optional second list of IDs for sequence pairs. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences:16 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.num_special_tokens_to_add:3 +msgid "" +"Whether the input is a sequence pair or a single sequence. Defaults to " +"`False` and the input is a single sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.num_special_tokens_to_add:7 +msgid "Number of tokens added to sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#~ msgid "基类::class:`paddlenlp.transformers.roberta.tokenizer.RobertaTokenizer`" +#~ msgstr "" + +#~ msgid "Converts a sequence of tokens (list of string) to a list of ids." +#~ msgstr "" + +#~ msgid "A list of string representing tokens to be converted." +#~ msgstr "" + +#~ msgid "Converted ids from tokens." +#~ msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..b86ffd95ee657dc00d2fec4852c1e31433ccdfb6 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.modeling.po @@ -0,0 +1,518 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.mbart.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartDecoder:1 +#: paddlenlp.transformers.mbart.modeling.MBartEncoder:1 +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration:1 +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering:1 +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification:1 +#: paddlenlp.transformers.mbart.modeling.MBartModel:1 +msgid "基类::class:`paddlenlp.transformers.mbart.modeling.MBartPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:1 +msgid "The bare MBart Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartClassificationHead.forward +#: paddlenlp.transformers.mbart.modeling.MBartDecoder.forward +#: paddlenlp.transformers.mbart.modeling.MBartEncoder.forward +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward +#: paddlenlp.transformers.mbart.modeling.MBartModel +#: paddlenlp.transformers.mbart.modeling.MBartModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `MBartModel`. Also is the vocab size " +"of token embedding matrix. Defines the number of different tokens that " +"can be represented by the `inputs_ids` passed when calling `MBartModel`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:13 +msgid "" +"The beginning of sequence token that was used during pretraining. Can be " +"used a sequence classifier token. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:17 +msgid "The index of padding token in the token vocabulary. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:20 +msgid "" +"A special token representing the end of a sequence that was used during " +"pretraining. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:23 +msgid "" +"Dimensionality of the embedding layer, encoder layer and decoder layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:25 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `6`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:27 +msgid "Number of hidden layers in the Transformer decoder. Defaults to `6`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:29 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:32 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"decoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:35 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `d_model` to " +"`encoder_ffn_dim`, and then projected back to `d_model`. Typically " +"`encoder_ffn_dim` is larger than `d_model`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:40 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `d_model` to " +"`decoder_ffn_dim`, and then projected back to `d_model`. Typically " +"`decoder_ffn_dim` is larger than `d_model`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:45 +msgid "" +"The dropout probability used in all fully connected layers (pre-process " +"and post-process of MHA and FFN sub-layer) in the encoders and decoders. " +"Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:48 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:52 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"and decoder layers to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:55 +msgid "" +"The dropout probability used after FFN activation in all encoder layers " +"and decoder layers. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:58 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`1024`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel:61 +msgid "" +"The standard deviation of the truncated_normal_initializer for " +"initializing all weight matrices. Default to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:1 +msgid "The MBartModel forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:7 +msgid "" +"Mask used in multi-head attention to avoid performing attention to some " +"unwanted positions, usually the paddings or the subsequent positions. Its" +" data type can be int, float and bool. When the data type is bool, the " +"`masked` tokens have `False` values and the others have `True` values. " +"When the data type is int, the `masked` tokens have `0` values and the " +"others have `1` values. When the data type is float, the `masked` tokens " +"have `-INF` values and the others have `0` values. It is a tensor with " +"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, " +"sequence_length]`. For example, its shape can be [batch_size, " +"sequence_length], [batch_size, sequence_length, sequence_length], " +"[batch_size, num_attention_heads, sequence_length, sequence_length]. " +"Defaults to `None`, which means nothing needed to be prevented attention " +"to." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:18 +msgid "" +"Indices of decoder input sequence tokens in the vocabulary. Its data type" +" should be `int64` and it has a shape of [batch_size, sequence_length]. " +"Defaults to `None`, which means no `decoder_input_ids` is provided, the " +"model will create the tensor by shifting the `input_ids` to the right." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:23 +msgid "" +"Mask used in multi-head attention to avoid performing attention to some " +"unwanted positions in `decoder_input_ids`. Its data type and shape is the" +" same as `attention_mask`. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:26 +msgid "" +"The output of the encoder, a tuple consists `last_hidden_state`, " +"`hidden_states`(optional), `attentions`(optional). The data type of " +"`last_hidden_state` is float32 and its shape is `[batch_size, " +"sequence_length, hidden_size]`. `hidden_states` is hidden_states of all " +"layers in the Transformer encoder. The length of `hidden_states` is " +"`num_hidden_layers + 1`. For all element in the tuple, its data type " +"should be float32 and its shape is [`batch_size, sequence_length, " +"hidden_size`]. `attentions` is attentions of all layers of in the " +"Transformer encoder. The length of `attentions` is `num_hidden_layers`. " +"For all element in the tuple, its data type should be float32 and its " +"shape is [`batch_size, num_attention_heads, sequence_length, " +"sequence_length`]." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:33 +msgid "" +"Whether or not to use cache. Defaults to `False`. If set to `True`, key " +"value states will be returned and can be used to speed up decoding." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:36 +msgid "" +"It is a list, and each element in the list is a tuple " +"`(incremental_cache, static_cache)`. See `TransformerDecoder.gen_cache " +"`__" +" for more details. It is only used for inference and should be None for " +"training. Default to `None`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartDecoder.forward +#: paddlenlp.transformers.mbart.modeling.MBartEncoder.forward +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward +#: paddlenlp.transformers.mbart.modeling.MBartModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:14 +#: paddlenlp.transformers.mbart.modeling.MBartModel.forward:42 +msgid "" +"Returns tensor `decoder_output`, which is the output at the last layer of" +" the model. Its data type should be float32 and has a shape of " +"[batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartDecoder.forward +#: paddlenlp.transformers.mbart.modeling.MBartEncoder.forward +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward +#: paddlenlp.transformers.mbart.modeling.MBartModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:31 +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:32 +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:23 +#: paddlenlp.transformers.mbart.modeling.MBartModel.forward:47 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartPretrainedModel:1 +msgid "" +"An abstract class for pretrained MBart models. It provides MBart related " +"`model_config_file`, `resource_files_names`, " +"`pretrained_resource_files_map`, `pretrained_init_configuration`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartEncoder:1 +msgid "" +"The Transformer Encoder of MBartModel. The arguments of MBartEncoder can " +"see :class:`MBartModel`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartEncoder.forward:1 +msgid "" +"The MBartEncoder forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:3 +#: paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:5 +#: paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:7 +#: paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:9 +#: paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:11 +#: paddlenlp.transformers.mbart.modeling.MBartEncoder.forward:3 +#: paddlenlp.transformers.mbart.modeling.MBartEncoder.forward:5 +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:3 +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:5 +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:7 +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:9 +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:11 +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:13 +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:15 +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:27 +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:3 +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:5 +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:7 +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:9 +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:11 +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:13 +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:15 +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:3 +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:5 +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:7 +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:9 +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:11 +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:13 +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:15 +msgid "See :class:`MBartModel`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartEncoder.forward:8 +msgid "" +"Returns tensor `encoder_output`, which is the output at the last layer of" +" the model. Its data type should be float32 and has a shape of " +"[batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartDecoder:1 +msgid "" +"The Transformer Decoder of MBartModel. The arguments of MBartDecoder can " +"see :class:`MBartModel`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:1 +msgid "" +"The MBartDecoder forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartClassificationHead:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartClassificationHead:1 +msgid "Head for sentence-level classification tasks." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartClassificationHead.forward:1 +msgid "Hidden states of the classification model." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification:1 +msgid "" +"MBart Model with a linear layer on top of the pooled output, designed for" +" sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration:4 +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering:4 +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification:4 +msgid "An instance of MBartModel." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification:6 +msgid "The number of different labels. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification:8 +msgid "" +"The dropout probability for output of MBart. If None, use the same value " +"as `hidden_dropout_prob` of `MBartModel` instance `mbart`. Defaults to " +"None." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:1 +msgid "" +"The MBartForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:18 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_labels]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering:1 +msgid "" +"MBart Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:1 +msgid "" +"The MBartForQuestionAnswering forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:18 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:18 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:20 +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:20 +msgid "With the fields:" +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:24 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:23 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:27 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:27 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration:1 +msgid "" +"MBart Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD ." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:1 +msgid "" +"The MBartForConditionalGeneration forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:18 +msgid "" +"Returns Tensor `lm_logits` if `use_cache` is `False`, otherwise, returns " +"tuple (`lm_logits`, `cache`). With the fields: - `lm_logits` (Tensor):" +" The generated sentence of the model. Its data type should be " +"float32 and has a shape of [batch_size, sequence_length, vocab_size]. - " +"`cache` (Tensor): See :class:`MBartModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:18 +msgid "" +"Returns Tensor `lm_logits` if `use_cache` is `False`, otherwise, returns " +"tuple (`lm_logits`, `cache`)." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:24 +msgid "`lm_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:23 +msgid "" +"The generated sentence of the model. Its data type should be float32 and " +"has a shape of [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:26 +msgid "`cache` (Tensor):" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.po new file mode 100644 index 0000000000000000000000000000000000000000..59ecf1f1db7ff6ee1d9393c6c87b5408a062c2a6 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.mbart.rst:2 +msgid "mbart" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..9d2b9fd6c7faf7f0f1855c0d5dee9d928880658e --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.tokenizer.po @@ -0,0 +1,91 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.mbart.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.mbart.tokenizer.MBartTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.tokenize:1 +msgid "Tokenize a string." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.convert_ids_to_string:1 +#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.convert_tokens_to_string:1 +msgid "Converts a sequence of tokens (strings for sub-words) in a single string." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.get_special_tokens_mask:1 +msgid "Retrieve sequence ids from a token list that has no special tokens added." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens. An MBART" +" sequence has the following format, where ``X`` represents the sequence:" +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.build_inputs_with_special_tokens:4 +msgid "``input_ids`` (for encoder) ``X [eos, src_lang_code]``" +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.build_inputs_with_special_tokens:5 +msgid "``decoder_input_ids``: (for decoder) ``X [eos, tgt_lang_code]``" +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.build_inputs_with_special_tokens:7 +msgid "" +"BOS is never used. Pairs of sequences are not the expected use case, but " +"they will be handled without a separator." +msgstr "" + +#: of +#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.as_target_tokenizer:1 +msgid "" +"Temporarily sets the tokenizer for encoding the targets. Useful for " +"tokenizer associated to sequence-to-sequence models that need a slightly " +"different processing for the labels." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..9a2181b19282af23ac0466fef840f6c45317cd74 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.modeling.po @@ -0,0 +1,686 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.megatronbert.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM:1 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM:1 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice:1 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction:1 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining:1 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering:1 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification:1 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification:1 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:1 +msgid "基类::class:`paddlenlp.transformers.megatronbert.modeling.MegatronBertPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:1 +msgid "The bare MegatronBert Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `MegatronBertModel`. Also is the vocab" +" size of token embedding matrix. Defines the number of different tokens " +"that can be represented by the `inputs_ids` passed when calling " +"`MegatronBert`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:13 +msgid "Dimensionality of the encoder layer and pooler layer. Defaults to `1024`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:15 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:18 +msgid "The vocabulary size of `token_type_ids`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:21 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:25 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:28 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `16`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:31 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `24`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:33 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:36 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:39 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `4096`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:44 +msgid "Type of position embedding. Defaults to \"absolute\"" +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:46 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`MegatronBertPretrainedModel.init_weights()`" +" for how weights are initialized in `MegatronBertModel`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:46 +msgid "The standard deviation of the normal initializer. Defaults to 0.02." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:50 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`MegatronBertPretrainedModel.init_weights()` for how weights " +"are initialized in `MegatronBertModel`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:1 +msgid "" +"The MegatronBertModel forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:12 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:13 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:15 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:18 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. If its data type is " +"int, the values should be either 0 or 1. - **1** for tokens that **not " +"masked**, - **0** for tokens that **masked**. It is a tensor with shape " +"broadcasted to `[batch_size, num_attention_heads, sequence_length, " +"sequence_length]`. Defaults to `None`, which means nothing needed to be " +"prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. If its data type is " +"int, the values should be either 0 or 1." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:27 +msgid "**1** for tokens that **not masked**," +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:28 +msgid "**0** for tokens that **masked**." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:30 +msgid "" +"It is a tensor with shape broadcasted to `[batch_size, " +"num_attention_heads, sequence_length, sequence_length]`. Defaults to " +"`None`, which means nothing needed to be prevented attention to." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:34 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`). With the fields: - " +"`sequence_output` (Tensor): Sequence of hidden-states at the last " +"layer of the model. It's data type should be float32 and its shape is" +" [batch_size, sequence_length, hidden_size]. - `pooled_output` (Tensor):" +" The output of first token (`[CLS]`) in sequence. We \"pool\" the" +" model by simply taking the hidden state corresponding to the first " +"token. Its data type should be float32 and its shape is [batch_size, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:34 +msgid "Returns tuple (`sequence_output`, `pooled_output`)." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:14 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:14 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:36 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:40 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:39 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:44 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:43 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:21 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:21 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:19 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:19 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:27 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:26 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:16 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:19 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:49 +msgid "示例" +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertPretrainedModel:1 +msgid "" +"An abstract class for pretrained MegatronBert models. It provides RoBerta" +" related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering:1 +msgid "MegatronBert Model with question answering tasks." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification:3 +msgid "An instance of :class:`MegatronBertModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:1 +msgid "" +"The MegatronBertForQuestionAnswering forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:5 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:7 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:9 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:5 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:7 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:9 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:5 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:7 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:9 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:5 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:7 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:9 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:5 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:7 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:9 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:5 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:7 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:9 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:5 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:7 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:9 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:3 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:5 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:7 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:9 +msgid "See :class:`MegatronBertModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:12 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:12 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:18 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:17 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:21 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:21 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification:1 +msgid "MegatronBert Model with sequence classification tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification:5 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification:5 +msgid "The number of labels." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:1 +msgid "" +"The MegatronBertForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:12 +msgid "Returns tensor `logits`, a tensor of the sequence classification logits." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction:1 +msgid "" +"MegatronBert Model with a `next sentence prediction (classification)` " +"head on top." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:1 +msgid "" +"The MegatronBertForNextSentencePrediction forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:12 +msgid "" +"Returns Tensor `seq_relationship_scores`. The scores of next sentence " +"prediction. Its data type should be float32 and its shape is " +"[batch_size, 2]." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:14 +msgid "" +"Returns Tensor `seq_relationship_scores`. The scores of next sentence " +"prediction." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:15 +msgid "Its data type should be float32 and its shape is [batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM:1 +msgid "MegatronBert Model with a `causal masked language modeling` head on top." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:1 +msgid "" +"The MegatronBertForCausalLM forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:12 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:12 +msgid "" +"Returns Tensor `prediction_scores`. The scores of masked token " +"prediction. Its data type should be float32. If " +"`masked_positions` is None, its shape is [batch_size, " +"sequence_length, vocab_size]. Otherwise, its shape is " +"[batch_size, mask_token_num, vocab_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:16 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:16 +msgid "Returns Tensor `prediction_scores`. The scores of masked token prediction." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:15 +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:15 +msgid "" +"Its data type should be float32. If `masked_positions` is None, its shape" +" is [batch_size, sequence_length, vocab_size]. Otherwise, its shape is " +"[batch_size, mask_token_num, vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining:1 +msgid "Megatronbert Model with pretraining tasks on top." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:1 +msgid "" +"The MegatronBertForPreTraining forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:12 +msgid "" +"Returns tuple (`prediction_scores`, `seq_relationship_score`). With the " +"fields: - `prediction_scores` (Tensor): The scores of masked token " +"prediction. Its data type should be float32. If `masked_positions` is" +" None, its shape is [batch_size, sequence_length, vocab_size]. " +"Otherwise, its shape is [batch_size, mask_token_num, vocab_size]. - " +"`seq_relationship_score` (Tensor): The scores of next sentence " +"prediction. Its data type should be float32 and its shape is " +"[batch_size, 2]." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:12 +msgid "Returns tuple (`prediction_scores`, `seq_relationship_score`)." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:19 +msgid "`prediction_scores` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:17 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"If `masked_positions` is None, its shape is [batch_size, sequence_length," +" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:22 +msgid "`seq_relationship_score` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:22 +msgid "" +"The scores of next sentence prediction. Its data type should be float32 " +"and its shape is [batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM:1 +msgid "MegatronBert Model with a `masked language modeling` head on top." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:1 +msgid "" +"The MegatronBertForMaskedLM forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice:1 +msgid "MegatronBert Model with a multiple choice classification head on top." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:1 +msgid "" +"The MegatronBertForMultipleChoice forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:12 +msgid "" +"Returns Tensor `reshaped_logits`. A tensor of the multiple choice " +"classification logits. Shape as `[batch_size, num_choice]` and " +"dtype as `float32`." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:14 +msgid "" +"Returns Tensor `reshaped_logits`. A tensor of the multiple choice " +"classification logits." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:15 +msgid "Shape as `[batch_size, num_choice]` and dtype as `float32`." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification:1 +msgid "MegatronBert Model with a token classification head on top." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:1 +msgid "" +"The MegatronBertForTokenClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:12 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and" +" dtype as `float32`." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:14 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits." +msgstr "" + +#: of +#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:15 +msgid "" +"Shape as `[batch_size, sequence_length, num_classes]` and dtype as " +"`float32`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.po new file mode 100644 index 0000000000000000000000000000000000000000..74024acea98d21c0b82eba8ed42a76810cf152ec --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.megatronbert.rst:2 +msgid "megatronbert" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..39359ac4a5110a184c21bbe5bb4cc6864d64f90d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.tokenizer.po @@ -0,0 +1,88 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.megatronbert.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.megatronbert.tokenizer:1 +msgid "Tokenization classes for MegatronBert." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:1 +msgid "" +"Constructs a MegatronBert tokenizer. It uses a basic tokenizer to do " +"punctuation splitting, lower casing and so on, and follows a WordPiece " +"tokenizer to tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:5 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:8 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:11 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:15 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:18 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:21 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:24 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:30 +msgid "实际案例" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..02165c7089d29a8c9a216d33a184ffebf6435444 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.modeling.po @@ -0,0 +1,538 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.mobilebert.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining:1 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering:1 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification:1 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel:1 +msgid "基类::class:`paddlenlp.transformers.mobilebert.modeling.MobileBertPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:1 +msgid "" +"The bare MobileBert Model transformer outputting raw hidden-states. This " +"model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods. This model is also " +"a Paddle `paddle.nn.Layer `__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward +#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel +#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward +#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:8 +msgid "" +"Vocabulary size of `inputs_ids` in `MobileBertModel`. Also is the vocab " +"size of token embedding matrix. Defines the number of different tokens " +"that can be represented by the `inputs_ids` passed when calling " +"`MobileBertModel`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:11 +msgid "" +"Embedding dimensionality of lookup_table in the embedding layer. Defaults" +" to `128`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:13 +msgid "" +"Dimensionality of the embedding layer, encoder layer and pooler layer. " +"Defaults to `512`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:15 +msgid "Dimensionality of input_tensor in self attention layer. Defaults to `128`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:17 +msgid "" +"Using bottleneck to value tensor in self attention layer. Defaults to " +"`False`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:19 +msgid "Key and query shared bottleneck layer. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:21 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `24`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:23 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `4`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:26 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `512`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:31 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"relu\"`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:35 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:38 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:41 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:44 +msgid "The vocabulary size of `token_type_ids`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:47 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`MobileBertPretrainedModel.init_weights()` " +"for how weights are initialized in `MobileBertModel`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:47 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note::" +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:53 +msgid "The index of padding token in the token vocabulary. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:56 +msgid "Adding the pooling Layer after the encoder layer. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:58 +msgid "" +"Using the non-linear activation function in the pooling layer. Defaults " +"to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask:1 +msgid "Prepare the head mask if needed." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask:3 +msgid "" +"The mask indicating if we should keep the heads or not (1.0 for keep, 0.0" +" for discard)." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask:5 +msgid "The number of hidden layers in the model." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask:7 +msgid "" +"(:obj:`bool`, `optional`, defaults to :obj:`False`): Whether or not the " +"attentions scores are computed by chunks or not." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward +#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward +#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask:10 +msgid "" +":obj:`paddle.Tensor` with shape :obj:`[num_hidden_layers x batch x " +"num_heads x seq_length x seq_length]` or list with :obj:`[None]` for each" +" layer." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:1 +msgid "" +"The MobileBertModel forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape of" +" [batch_size, sequence_length]. Defaults to `None`, which means we don't " +"add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:16 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:20 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. Defaults to `None`, which means " +"nothing needed to be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:29 +msgid "" +"The mask indicating if we should keep the heads or not (1.0 for keep, 0.0" +" for discard). Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:31 +msgid "Whether to return the output of each hidden layers. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:34 +msgid "" +"Whether to return the output of each self attention layers. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:38 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`," +" `pooled_output`). With the fields: - `sequence_output` (Tensor): " +"Sequence of hidden-states at the last layer of the model. It's data " +"type should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]. - `pooled_output` (Tensor): The output of first token " +"(`[CLS]`) in sequence. We \"pool\" the model by simply taking the " +"hidden state corresponding to the first token. Its data type should " +"be float32 and its shape is [batch_size, hidden_size]. - " +"`encoder_outputs` (List(Tensor)): A list of Tensor containing hidden-" +"states of the model at each hidden layer in the Transformer encoder. " +"The length of the list is `num_hidden_layers`. Each Tensor has a data" +" type of float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:38 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`," +" `pooled_output`). With the fields: - `sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:41 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:45 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:44 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:49 +msgid "`encoder_outputs` (List(Tensor)):" +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:48 +msgid "" +"A list of Tensor containing hidden-states of the model at each hidden " +"layer in the Transformer encoder. The length of the list is " +"`num_hidden_layers`. Each Tensor has a data type of float32 and its shape" +" is [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward +#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:39 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:25 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:54 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertPretrainedModel:1 +msgid "" +"An abstract class for pretrained MobileBert models. It provides " +"MobileBert related `model_config_file`, `resource_files_names`, " +"`pretrained_resource_files_map`, `pretrained_init_configuration`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining:1 +msgid "MobileBert Model with pretraining tasks on top." +msgstr "" + +#: of paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining:3 +msgid "An instance of :class:`MobileBertModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:1 +msgid "" +"The MobileBertForPreTraining forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:3 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:5 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:7 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:9 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:11 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:13 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:15 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:17 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:3 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:5 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:7 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:9 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:11 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:13 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:15 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:17 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:3 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:5 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:7 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:9 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:11 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:13 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:15 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:17 +msgid "See :class:`MobileBertModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:20 +msgid "" +"Returns tuple (``prediction_scores``, ``seq_relationship_score``). With " +"the fields: - `prediction_scores` (Tensor): The scores of masked " +"token prediction. Its data type should be float32. If " +"`masked_positions` is None, its shape is [batch_size, sequence_length, " +"vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]. - `seq_relationship_score` (Tensor): The scores of next " +"sentence prediction. Its data type should be float32 and its shape is" +" [batch_size, 2]." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:20 +msgid "" +"Returns tuple (``prediction_scores``, ``seq_relationship_score``). With " +"the fields: - `prediction_scores` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:23 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"If `masked_positions` is None, its shape is [batch_size, sequence_length," +" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:27 +msgid "`seq_relationship_score` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:27 +msgid "" +"The scores of next sentence prediction. Its data type should be float32 " +"and its shape is [batch_size, 2]." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification:1 +msgid "" +"MobileBert Model with a linear layer on top of the output layer, designed" +" for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering:4 +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification:4 +msgid "An instance of MobileBert." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:1 +msgid "" +"The MobileBertForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:20 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering:1 +msgid "" +"MobileBert Model with a linear layer on top of the hidden-states output " +"to compute `span_start_logits` and `span_end_logits`, designed for " +"question-answering tasks like SQuAD." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:1 +msgid "" +"The MobileBertForQuestionAnswering forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:19 +msgid "" +"Labels for position (index) of the start of the labelled span for " +"computing the token classification loss. Positions are clamped to the " +"length of the sequence (:obj:`sequence_length`). Position outside of the " +"sequence are not taken into account for computing the loss." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:23 +msgid "" +"Labels for position (index) of the end of the labelled span for computing" +" the token classification loss. Positions are clamped to the length of " +"the sequence (:obj:`sequence_length`). Position outside of the sequence " +"are not taken into account for computing the loss." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:28 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. - " +"`end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:28 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:31 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:34 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:34 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.po new file mode 100644 index 0000000000000000000000000000000000000000..37a54f0d9800da5fb15a216d7034265f5cd2117b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.mobilebert.rst:2 +msgid "mobilebert" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..204f416d4d503a192fecafc98f2652421b8f2767 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.tokenizer.po @@ -0,0 +1,256 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.mobilebert.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer:1 +msgid "" +"Construct a MobileBERT tokenizer. " +":class:`~paddlenlp.transformers.MobileBertTokenizer is identical to " +":class:`~paddlenlp.transformers.BertTokenizer` and runs end-to-end " +"tokenization: punctuation splitting and wordpiece. Refer to superclass " +":class:`~~paddlenlp.transformers.BertTokenizer` for usage examples and " +"documentation concerning parameters." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:1 +msgid "" +"Performs tokenization and uses the tokenized tokens to prepare model " +"inputs. It supports batch inputs of sequence or sequence pair." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode +msgid "参数" +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:4 +msgid "" +"The element of list can be sequence or sequence pair, and the sequence is" +" a string or a list of strings depending on whether it has been " +"pretokenized. If each sequence is provided as a list of strings " +"(pretokenized), you must set `is_split_into_words` as `True` to " +"disambiguate with a sequence pair." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:10 +msgid "" +"If set to a number, will limit the total sequence returned so that it has" +" a maximum length. If there are overflowing tokens, those overflowing " +"tokens will be added to the returned dictionary when " +"`return_overflowing_tokens` is `True`. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:15 +msgid "" +"Only available for batch input of sequence pair and mainly for question " +"answering usage. When for QA, `text` represents questions and `text_pair`" +" represents contexts. If `stride` is set to a positive number, the " +"context will be split into multiple spans where `stride` defines the " +"number of (tokenized) tokens to skip from the start of one span to get " +"the next span, thus will produce a bigger batch than inputs to include " +"all spans. Moreover, 'overflow_to_sample' and 'offset_mapping' preserving" +" the original example and position information will be added to the " +"returned dictionary. Defaults to 0." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:25 +msgid "" +"If set to `True`, the returned sequences would be padded up to " +"`max_seq_len` specified length according to padding side " +"(`self.padding_side`) and padding token id. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:29 +msgid "" +"String selected in the following options: - 'longest_first' (default) " +"Iteratively reduce the inputs sequence until the input is under " +"`max_seq_len` starting from the longest one at each token (when there is " +"a pair of input sequences). - 'only_first': Only truncate the first " +"sequence. - 'only_second': Only truncate the second sequence. - " +"'do_not_truncate': Do not truncate (raise an error if the input sequence " +"is longer than `max_seq_len`). Defaults to 'longest_first'." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:39 +msgid "" +"Whether to include tokens position ids in the returned dictionary. " +"Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:42 +msgid "" +"Whether to include token type ids in the returned dictionary. Defaults to" +" `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:45 +msgid "" +"Whether to include the attention mask in the returned dictionary. " +"Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:48 +msgid "" +"Whether to include the length of each encoded inputs in the returned " +"dictionary. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:51 +msgid "" +"Whether to include overflowing token information in the returned " +"dictionary. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:54 +msgid "" +"Whether to include special tokens mask information in the returned " +"dictionary. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:58 +msgid "" +"The dict has the following optional items: - **input_ids** (list[int]): " +"List of token ids to be fed to a model. - **position_ids** (list[int], " +"optional): List of token position ids to be fed to a model. Included " +"when `return_position_ids` is `True` - **token_type_ids** (list[int], " +"optional): List of token type ids to be fed to a model. Included when " +"`return_token_type_ids` is `True`. - **attention_mask** (list[int], " +"optional): List of integers valued 0 or 1, where 0 specifies paddings " +"and should not be attended to by the model. Included when " +"`return_attention_mask` is `True`. - **seq_len** (int, optional): The " +"input_ids length. Included when `return_length` is `True`. - " +"**overflowing_tokens** (list[int], optional): List of overflowing tokens." +" Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True. - **num_truncated_tokens** (int, " +"optional): The number of overflowing tokens. Included when if " +"`max_seq_len` is specified and `return_overflowing_tokens` is True. - " +"**special_tokens_mask** (list[int], optional): List of integers valued 0 " +"or 1, with 0 specifying special added tokens and 1 specifying sequence " +"tokens. Included when `return_special_tokens_mask` is `True`. - " +"**offset_mapping** (list[int], optional): list of pair preserving the " +"index of start and end char in original input for each token. For a " +"sqecial token, the index pair is `(0, 0)`. Included when `stride` " +"works. - **overflow_to_sample** (int, optional): Index of example from " +"which this feature is generated. Included when `stride` works." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:58 +msgid "" +"The dict has the following optional items: - **input_ids** (list[int]): " +"List of token ids to be fed to a model. - **position_ids** (list[int], " +"optional): List of token position ids to be" +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:61 +msgid "fed to a model. Included when `return_position_ids` is `True`" +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:62 +msgid "" +"**token_type_ids** (list[int], optional): List of token type ids to be " +"fed to a model. Included when `return_token_type_ids` is `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:64 +msgid "" +"**attention_mask** (list[int], optional): List of integers valued 0 or 1," +" where 0 specifies paddings and should not be attended to by the model. " +"Included when `return_attention_mask` is `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:67 +msgid "" +"**seq_len** (int, optional): The input_ids length. Included when " +"`return_length` is `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:69 +msgid "" +"**overflowing_tokens** (list[int], optional): List of overflowing tokens." +" Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:72 +msgid "" +"**num_truncated_tokens** (int, optional): The number of overflowing " +"tokens. Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:75 +msgid "" +"**special_tokens_mask** (list[int], optional): List of integers valued 0 " +"or 1, with 0 specifying special added tokens and 1 specifying sequence " +"tokens. Included when `return_special_tokens_mask` is `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:78 +msgid "" +"**offset_mapping** (list[int], optional): list of pair preserving the " +"index of start and end char in original input for each token. For a " +"sqecial token, the index pair is `(0, 0)`. Included when `stride` works." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:82 +msgid "" +"**overflow_to_sample** (int, optional): Index of example from which this " +"feature is generated. Included when `stride` works." +msgstr "" + +#: of +#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode +msgid "返回类型" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.model_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.model_utils.po new file mode 100644 index 0000000000000000000000000000000000000000..4c0a2579baf4e2f0b6fe49275038777732f78f81 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.model_utils.po @@ -0,0 +1,258 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.model_utils.rst:2 +msgid "model\\_utils" +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel:1 +msgid "" +"The base class for all pretrained models. It mainly provides common " +"methods for loading (construction and loading) and saving pretrained " +"models. Loading and saving also rely on the following class attributes " +"which should be overridden by derived classes accordingly:" +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel:6 +msgid "" +"**model_config_file** (str): Represents the file name of model " +"configuration for configuration saving and loading in local file system. " +"The value is `model_config.json`." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel:9 +msgid "" +"**resource_files_names** (dict): Name of local file where the model " +"configuration can be saved and loaded locally. Currently, resources only " +"include the model state, thus the dict only includes `'model_state'` as " +"key with corresponding value `'model_state.pdparams'` for model weights " +"saving and loading." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel:13 +msgid "" +"**pretrained_init_configuration** (dict): Provides the model " +"configurations of built-in pretrained models (contrasts to models in " +"local file system). It has pretrained model names as keys (such as `bert-" +"base-uncased`), and the values are dict preserving corresponding " +"configuration for model initialization." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel:17 +msgid "" +"**pretrained_resource_files_map** (dict): Provides resource URLs of " +"built-in pretrained models (contrasts to models in local file system). It" +" has the same key as resource_files_names (that is \"model_state\"), and " +"the corresponding value is a dict with specific model name to model " +"weights URL mapping (such as \"bert-base-uncased\" -> " +"\"https://bj.bcebos.com/paddlenlp/models/transformers/bert-base-" +"uncased.pdparams\")." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel:23 +msgid "" +"**base_model_prefix** (str): Represents the attribute associated to the " +"base model in derived classes of the same architecture adding layers on " +"top of the base model. Note: A base model class is pretrained model class" +" decorated by `register_base_model`, such as `BertModel`; A derived model" +" class is a pretrained model class adding layers on top of the base " +"model, and it has a base model as attribute, such as " +"`BertForSequenceClassification`." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel:30 +msgid "" +"Methods common to models for text generation are defined in " +"`GenerationMixin` and also inherited here." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel:33 +msgid "" +"Besides, metaclass `InitTrackerMeta` is used to create `PretrainedModel`," +" by which subclasses can track arguments for initialization " +"automatically." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.base_model:1 +msgid "" +"The body of the same model architecture. It is the base model itself for " +"base model or the base model attribute for derived model." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.base_model +#: paddlenlp.transformers.model_utils.PretrainedModel.model_name_list +msgid "type" +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.base_model:5 +msgid "PretrainedModel" +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.model_name_list:1 +msgid "" +"Contains all supported built-in pretrained model names of the current " +"PretrainedModel class." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.model_name_list:4 +msgid "list" +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:1 +msgid "" +"Creates an instance of `PretrainedModel`. Model weights are loaded by " +"specifying name of a built-in pretrained model, or a community " +"contributed model, or a local file directory path." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained +#: paddlenlp.transformers.model_utils.PretrainedModel.save_model_config +#: paddlenlp.transformers.model_utils.PretrainedModel.save_pretrained +#: paddlenlp.transformers.model_utils.register_base_model +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:5 +msgid "" +"Name of pretrained model or dir path to load from. The string can be: - " +"Name of a built-in pretrained model - Name of a community-contributed " +"pretrained model. - Local directory path which contains model weights " +"file(\"model_state.pdparams\") and model config file " +"(\"model_config.json\")." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:5 +msgid "Name of pretrained model or dir path to load from. The string can be:" +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:8 +msgid "Name of a built-in pretrained model" +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:9 +msgid "Name of a community-contributed pretrained model." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:10 +msgid "" +"Local directory path which contains model weights " +"file(\"model_state.pdparams\") and model config file " +"(\"model_config.json\")." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:13 +msgid "" +"Position arguments for model `__init__`. If provided, use these as " +"position argument values for model initialization." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:16 +msgid "" +"Keyword arguments for model `__init__`. If provided, use these to update " +"pre-defined keyword argument values for model initialization. If the " +"keyword is in `__init__` argument names of base model, update argument " +"values of the base model; else update argument values of derived model." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:22 +msgid "" +"The weights read in can be choosed to place on CPU or GPU though the " +"model is on the default device. If `True`, load the model weights as " +"`numpy.ndarray` on CPU. Otherwise, weights would be loaded as tensors on " +"the default device. Note that if on GPU, the latter would creates extra " +"temporary tensors in addition to the model weights, which doubles the " +"memory usage . Thus it is suggested to use `True` for big models on GPU. " +"Default to `False`." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained +#: paddlenlp.transformers.model_utils.PretrainedModel.get_model_config +#: paddlenlp.transformers.model_utils.register_base_model +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:32 +msgid "An instance of `PretrainedModel`." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained +#: paddlenlp.transformers.model_utils.PretrainedModel.get_model_config +#: paddlenlp.transformers.model_utils.register_base_model +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:36 +#: paddlenlp.transformers.model_utils.PretrainedModel.save_pretrained:13 +#: paddlenlp.transformers.model_utils.register_base_model:13 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.get_model_config:1 +msgid "Get model configuration." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.get_model_config:3 +msgid "The config of the model." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.save_model_config:1 +msgid "" +"Saves model configuration to a file named \"model_config.json\" under " +"`save_dir`." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.save_model_config:3 +msgid "Directory to save model_config file into." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.save_pretrained:1 +msgid "" +"Saves model configuration and related resources (model state) as files " +"under `save_dir`. The model configuration would be saved into a file " +"named \"model_config.json\", and model state would be saved into a file " +"named \"model_state.pdparams\"." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.save_pretrained:6 +msgid "" +"The `save_dir` can be used in `from_pretrained` as argument value of " +"`pretrained_model_name_or_path` to re-load the trained model." +msgstr "" + +#: of paddlenlp.transformers.model_utils.PretrainedModel.save_pretrained:9 +msgid "Directory to save files into." +msgstr "" + +#: of paddlenlp.transformers.model_utils.register_base_model:1 +msgid "" +"A decorator for `PretrainedModel` class. It first retrieves the parent " +"class of the class being decorated, then sets the `base_model_class` " +"attribute of that parent class to be the class being decorated. In " +"summary, the decorator registers the decorated class as the base model " +"class in all derived classes under the same architecture." +msgstr "" + +#: of paddlenlp.transformers.model_utils.register_base_model:6 +msgid "The class (inherited from PretrainedModel) to be decorated ." +msgstr "" + +#: of paddlenlp.transformers.model_utils.register_base_model:9 +msgid "The input class `cls` after decorating." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..ea77223e7f09950a56ca1f0b872c34a8a65bbd5b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.modeling.po @@ -0,0 +1,521 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.mpnet.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM:1 +#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice:1 +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering:1 +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification:1 +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification:1 +#: paddlenlp.transformers.mpnet.modeling.MPNetModel:1 +msgid "基类::class:`paddlenlp.transformers.mpnet.modeling.MPNetPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:1 +msgid "The bare MPNet Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM +#: paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice +#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetModel +#: paddlenlp.transformers.mpnet.modeling.MPNetModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `MPNetModel`. Also is the vocab size " +"of token embedding matrix. Defines the number of different tokens that " +"can be represented by the `inputs_ids` passed when calling `MPNetModel`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:13 +msgid "" +"Dimensionality of the embedding layer, encoder layer and pooler layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:15 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:35 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`514`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:38 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`MPNetPretrainedModel.init_weights()` for " +"how weights are initialized in `MPNetModel`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:38 +msgid "The standard deviation of the normal initializer. Defaults to 0.02." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:42 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`MPNetPretrainedModel.init_weights()` for how weights are " +"initialized in `MPNetModel`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:45 +msgid "The number of buckets to use for each attention layer. Defaults to `32`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:48 +msgid "The epsilon used by the layer normalization layers. Defaults to `1e-5`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:51 +msgid "The index of padding token in the token vocabulary. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:1 +msgid "The MPNetModel forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:7 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:11 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. If its data type is " +"int, the values should be either 0 or 1. - **1** for tokens that **not " +"masked**, - **0** for tokens that **masked**. It is a tensor with shape " +"broadcasted to `[batch_size, num_attention_heads, sequence_length, " +"sequence_length]`. Defaults to `None`, which means nothing needed to be " +"prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:11 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. If its data type is " +"int, the values should be either 0 or 1." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:16 +msgid "**1** for tokens that **not masked**," +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:17 +msgid "**0** for tokens that **masked**." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:19 +msgid "" +"It is a tensor with shape broadcasted to `[batch_size, " +"num_attention_heads, sequence_length, sequence_length]`. Defaults to " +"`None`, which means nothing needed to be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:23 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`). With the fields: - " +"`sequence_output` (Tensor): Sequence of hidden-states at the last " +"layer of the model. It's data type should be float32 and its shape is" +" [batch_size, sequence_length, hidden_size]. - `pooled_output` (Tensor):" +" The output of first token (``) in sequence. We \"pool\" the " +"model by simply taking the hidden state corresponding to the first token." +" Its data type should be float32 and its shape is [batch_size, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:23 +msgid "Returns tuple (`sequence_output`, `pooled_output`)." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:12 +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:12 +#: paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:25 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:20 +#: paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:29 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:28 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:33 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:32 +msgid "" +"The output of first token (``) in sequence. We \"pool\" the model by " +"simply taking the hidden state corresponding to the first token. Its data" +" type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward +#: paddlenlp.transformers.mpnet.modeling.MPNetModel.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward:15 +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:24 +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward:15 +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward:15 +#: paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:38 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetPretrainedModel:1 +msgid "" +"An abstract class for pretrained MPNet models. It provides MPNet related " +"`model_config_file`, `resource_files_names`, " +"`pretrained_resource_files_map`, `pretrained_init_configuration`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM:1 +msgid "MPNet Model with a `language modeling` head on top." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM:3 +msgid "An instance of :class:`MPNetModel`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:1 +#: paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:3 +#: paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:5 +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:3 +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:5 +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:7 +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward:3 +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward:5 +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward:7 +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward:3 +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward:5 +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward:7 +msgid "See :class:`MPNetModel`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:7 +msgid "" +"The Labels for computing the masked language modeling loss. Indices " +"should be in ``[-100, 0, ..., vocab_size]`` Tokens with indices set to " +"``-100`` are ignored (masked), the loss is only computed for the tokens " +"with labels in ``[0, ..., vocab_size]`` Its shape is [batch_size, " +"sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:10 +msgid "" +"Returns tuple (`masked_lm_loss`, `prediction_scores`, " +"``sequence_output`). With the fields: - `masked_lm_loss` (Tensor): " +"The masked lm loss. Its data type should be float32 and its shape is [1]." +" - `prediction_scores` (Tensor): The scores of masked token " +"prediction. Its data type should be float32. Its shape is [batch_size, " +"sequence_length, vocab_size]. - `sequence_output` (Tensor): Sequence" +" of hidden-states at the last layer of the model. Its data type should be" +" float32. Its shape is `[batch_size, sequence_length, hidden_size]`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:10 +msgid "Returns tuple (`masked_lm_loss`, `prediction_scores`, ``sequence_output`)." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:15 +msgid "`masked_lm_loss` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:15 +msgid "The masked lm loss. Its data type should be float32 and its shape is [1]." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:18 +msgid "`prediction_scores` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:18 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"Its shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:21 +msgid "" +"Sequence of hidden-states at the last layer of the model. Its data type " +"should be float32. Its shape is `[batch_size, sequence_length, " +"hidden_size]`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification:1 +msgid "" +"MPNet Model with a linear layer on top of the output layer, designed for " +"sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice:4 +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering:4 +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification:4 +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification:4 +msgid "An instance of MPNetModel." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering:6 +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification:6 +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice:8 +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification:8 +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification:8 +msgid "" +"The dropout probability for output of MPNet. If None, use the same value " +"as `hidden_dropout_prob` of `MPNetModel` instance `mpnet`. Defaults to " +"None." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward:1 +msgid "" +"The MPNetForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice:1 +msgid "" +"MPNet Model with a linear layer on top of the hidden-states output layer," +" designed for multiple choice tasks like RocStories/SWAG tasks." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice:6 +msgid "The number of choices. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward:1 +msgid "" +"The MPNetForMultipleChoice forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward:3 +#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward:5 +#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward:7 +msgid "" +"See :class:`MPNetModel` and shape as [batch_size, num_choice, " +"sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward:10 +msgid "" +"Returns tensor `reshaped_logits`, a tensor of the multiple choice " +"classification logits. Shape as `[batch_size, num_choice]` and dtype as " +"`float32`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification:1 +msgid "" +"MPNet Model with a linear layer on top of the hidden-states output layer," +" designed for token classification tasks like NER tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward:1 +msgid "" +"The MPNetForTokenClassification forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering:1 +msgid "" +"MPNet Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:1 +msgid "" +"The MPNetForQuestionAnswering forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:10 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:10 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:16 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:15 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:19 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:19 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.po new file mode 100644 index 0000000000000000000000000000000000000000..433d59d822d107262bce9b1d1298852c0e7bbee3 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.mpnet.rst:2 +msgid "mpnet" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..011b07f96ccf938585fe935d60048bb6739c8c2d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.tokenizer.po @@ -0,0 +1,135 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.mpnet.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer:1 +msgid "" +"Construct a MPNet tokenizer which is almost identical to `BertTokenizer`." +" For more information regarding those methods, please refer to this " +"superclass." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:4 +msgid "A MPNet sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:6 +msgid "single sequence: `` X ``" +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:7 +msgid "pair of sequences: `` A B ``" +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask +msgid "参数" +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:9 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:11 +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences:6 +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask:6 +msgid "Optional second list of IDs for sequence pairs. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:14 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences:4 +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask:4 +msgid "A list of `inputs_ids` for the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers either be 0 or 1: 1 for a special token, 0 for a " +"sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Creates a mask from the two sequences passed to be used in a sequence-" +"pair classification task. MPNet does not make use of token type ids, " +"therefore a list of zeros is returned." +msgstr "" + +#: of +#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences:9 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..faae4fc0fa755ca69dbb96005d6020639f278c45 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.modeling.po @@ -0,0 +1,600 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.nezha.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice:1 +#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining:1 +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering:1 +#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification:1 +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification:1 +#: paddlenlp.transformers.nezha.modeling.NeZhaModel:1 +msgid "基类::class:`paddlenlp.transformers.nezha.modeling.NeZhaPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:1 +msgid "The bare NeZha Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice +#: paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining +#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification +#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaModel +#: paddlenlp.transformers.nezha.modeling.NeZhaModel.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads +#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `DistilBertModel`. Defines the number " +"of different tokens that can be represented by the `inputs_ids` passed " +"when calling `DistilBertModel`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:13 +msgid "" +"Dimensionality of the embedding layer, encoder layers and the pooler " +"layer. Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:15 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:35 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:38 +msgid "The vocabulary size of `token_type_ids`. Defaults to `16`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:41 +msgid "" +"The standard deviation of the normal initializer. Defaults to `0.02`. .." +" note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`NeZhaPretrainedModel.init_weights()` for " +"how weights are initialized in `NeZhaModel`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:41 +msgid "The standard deviation of the normal initializer. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:45 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`NeZhaPretrainedModel.init_weights()` for how weights are " +"initialized in `NeZhaModel`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:48 +msgid "" +"The maximum value of the dimensionality of relative encoding, which " +"dictates the maximum supported relative distance of two sentences. " +"Defaults to `64`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:52 +msgid "" +"The small value added to the variance in `LayerNorm` to prevent division " +"by zero. Defaults to `1e-12`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:55 +msgid "Whether or not to use relative position embedding. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:1 +msgid "The NeZhaModel forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:12 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:13 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:15 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:18 +msgid "" +"Mask used in multi-head attention to avoid performing attention to some " +"unwanted positions, usually the paddings or the subsequent positions. Its" +" data type can be int, float and bool. When the data type is bool, the " +"`masked` tokens have `False` values and the others have `True` values. " +"When the data type is int, the `masked` tokens have `0` values and the " +"others have `1` values. When the data type is float, the `masked` tokens " +"have `-INF` values and the others have `0` values. It is a tensor with " +"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, " +"sequence_length]`. For example, its shape can be [batch_size, " +"sequence_length], [batch_size, sequence_length, sequence_length], " +"[batch_size, num_attention_heads, sequence_length, sequence_length]. We " +"use whole-word-mask in NeZha, so the whole word will have the same value." +" For example, \"使用\" as a word, \"使\" and \"用\" will have the same value." +" Defaults to `None`, which means nothing needed to be prevented attention" +" to." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaModel.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:32 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`). With the fields: - " +"`sequence_output` (Tensor): Sequence of hidden-states at the last " +"layer of the model. It's data type should be float32 and its shape is" +" [batch_size, sequence_length, hidden_size]. - `pooled_output` (Tensor):" +" The output of first token (`[CLS]`) in sequence. We \"pool\" the" +" model by simply taking the hidden state corresponding to the first " +"token. Its data type should be float32 and its shape is [batch_size, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:32 +msgid "Returns tuple (`sequence_output`, `pooled_output`)." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:17 +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:12 +#: paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:34 +#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:11 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:38 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:37 +#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:1 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:42 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:41 +#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:4 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaModel.forward +#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:24 +#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward:15 +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward:15 +#: paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:47 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainedModel:1 +msgid "" +"An abstract class for pretrained NeZha models. It provides NeZha related " +"`model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining:1 +msgid "NeZha Model with pretraining tasks on top." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining:3 +msgid "An instance of :class:`NeZhaModel`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward:3 +#: paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward:5 +#: paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward:7 +#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:1 +#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:3 +#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:5 +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:3 +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:5 +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:7 +#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward:3 +#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward:5 +#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward:7 +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward:3 +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward:5 +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward:7 +#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads:3 +#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads:5 +msgid "See :class:`NeZhaModel`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:7 +msgid "" +"The labels of the masked language modeling, its dimensionality is equal " +"to `prediction_scores`. Its data type should be int64 and its shape is " +"[batch_size, sequence_length, 1]." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:10 +msgid "" +"The labels of the next sentence prediction task, the dimensionality of " +"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type " +"should be int64 and its shape is [batch_size, 1]." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:14 +msgid "" +"Returns Tensor ``total_loss`` if `masked_lm_labels` is not None. Returns " +"tuple (``prediction_scores``, ``seq_relationship_score``) if " +"`masked_lm_labels` is None. With the fields: - `total_loss` (Tensor):" +" - `prediction_scores` (Tensor): The scores of masked token " +"prediction. Its data type should be float32. If `masked_positions` is" +" None, its shape is [batch_size, sequence_length, vocab_size]. " +"Otherwise, its shape is [batch_size, mask_token_num, vocab_size]. - " +"`seq_relationship_score` (Tensor): The scores of next sentence " +"prediction. Its data type should be float32 and its shape is " +"[batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:14 +msgid "" +"Returns Tensor ``total_loss`` if `masked_lm_labels` is not None. Returns " +"tuple (``prediction_scores``, ``seq_relationship_score``) if " +"`masked_lm_labels` is None." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:19 +msgid "`total_loss` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:25 +#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:16 +msgid "`prediction_scores` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:23 +#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:14 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"If `masked_positions` is None, its shape is [batch_size, sequence_length," +" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:28 +#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:19 +msgid "`seq_relationship_score` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:28 +#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:19 +msgid "" +"The scores of next sentence prediction. Its data type should be float32 " +"and its shape is [batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification:1 +msgid "" +"NeZha Model with a linear layer on top of the output layer, designed for " +"sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice:4 +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering:4 +#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification:4 +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification:4 +msgid "An instance of NeZhaModel." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification:6 +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification:8 +msgid "" +"The dropout probability for output of NeZha. If None, use the same value " +"as `hidden_dropout_prob` of `NeZhaModel` instance `nezha`. Defaults to " +"None." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward:1 +msgid "" +"The NeZhaForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads:1 +msgid "Perform language modeling task and next sentence classification task." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads:7 +msgid "Activation function used in the language modeling task." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads:9 +msgid "" +"Decoding weights used to map hidden_states to logits of the masked token " +"prediction. Its data type should be float32 and its shape is [vocab_size," +" hidden_size]. Defaults to `None`, which means use the same weights of " +"the embedding layer." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:9 +msgid "" +"Returns tuple (``prediction_scores``, ``seq_relationship_score``). With " +"the fields: - `prediction_scores` (Tensor): The scores of masked " +"token prediction. Its data type should be float32. If " +"`masked_positions` is None, its shape is [batch_size, sequence_length, " +"vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]. - `seq_relationship_score` (Tensor): The scores of next" +" sentence prediction. Its data type should be float32 and its shape " +"is [batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:9 +msgid "Returns tuple (``prediction_scores``, ``seq_relationship_score``)." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification:1 +msgid "" +"NeZha Model with a linear layer on top of the hidden-states output layer," +" designed for token classification tasks like NER tasks." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice:8 +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering:6 +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification:8 +msgid "" +"The dropout probability for output of NeZha. If None, use the same value " +"as `hidden_dropout_prob` of `NeZhaModel` instance `nezha`. Defaults to " +"`None`." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward:1 +msgid "" +"The NeZhaForTokenClassification forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering:1 +msgid "" +"NeZha with a linear layer on top of the hidden-states output to compute " +"`span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:1 +msgid "" +"The NeZhaForQuestionAnswering forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:10 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:10 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:16 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:15 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:19 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:19 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice:1 +msgid "" +"NeZha Model with a linear layer on top of the hidden-states output layer," +" designed for multiple choice tasks like RocStories/SWAG tasks." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice:6 +msgid "The number of choices. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward:1 +msgid "" +"The NeZhaForMultipleChoice forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward:10 +msgid "" +"Returns tensor `reshaped_logits`, a tensor of the input multiple choice " +"classification logits. Shape as `[batch_size, num_classes]` and dtype as " +"`float32`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.po new file mode 100644 index 0000000000000000000000000000000000000000..8428ce2ba07414c2f71ca74f05ed13e7c53aed87 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.nezha.rst:2 +msgid "nezha" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..d1b64ed92e9b23927afdbd373dfd2bcc812f790c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.tokenizer.po @@ -0,0 +1,302 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.nezha.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:1 +msgid "" +"Constructs a NeZha tokenizer. It uses a basic tokenizer to do punctuation" +" splitting, lower casing and so on, and follows a WordPiece tokenizer to " +"tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:5 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.num_special_tokens_to_add +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:9 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:12 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:15 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:19 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:22 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:25 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:28 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:34 +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string:12 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) to a single string. Since " +"the usage of WordPiece introducing `##` to concat subwords, also removes " +"`##` when converting." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string:5 +msgid "A list of string representing tokens to be converted." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string:8 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.num_special_tokens_to_add:3 +msgid "" +"Whether the input is a sequence pair or a single sequence. Defaults to " +"`False` and the input is a single sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.num_special_tokens_to_add:7 +msgid "Number of tokens added to sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:4 +msgid "A NeZha sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:6 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:7 +msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:9 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:11 +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask:6 +msgid "Optional second list of IDs for sequence pairs. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:14 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:3 +msgid "A NeZha offset_mapping has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:5 +msgid "single sequence: ``(0,0) X (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:6 +msgid "pair of sequences: ``(0,0) A (0,0) B (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:8 +msgid "List of wordpiece offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:10 +msgid "" +"Optional second list of wordpiece offsets for offset mapping pairs. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:13 +msgid "" +"A list of wordpiece offsets with the appropriate offsets of special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences:3 +msgid "A NeZha sequence pair mask has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences:9 +msgid "" +"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first " +"portion of the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences:11 +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask:4 +msgid "A list of `inputs_ids` for the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences:13 +msgid "Optional second list of IDs for sequence pairs. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences:16 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers either be 0 or 1: 1 for a special token, 0 for a " +"sequence token." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.optimization.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.optimization.po new file mode 100644 index 0000000000000000000000000000000000000000..7ba6a123422c649f776b0bd6f7b04e7e4333f01f --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.optimization.po @@ -0,0 +1,152 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.optimization.rst:2 +msgid "optimization" +msgstr "" + +#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:1 +#: paddlenlp.transformers.optimization.CosineDecayWithWarmup:1 +#: paddlenlp.transformers.optimization.LinearDecayWithWarmup:1 +#: paddlenlp.transformers.optimization.PolyDecayWithWarmup:1 +msgid "基类::class:`paddle.optimizer.lr.LambdaDecay`" +msgstr "" + +#: of paddlenlp.transformers.optimization.LinearDecayWithWarmup:1 +msgid "" +"Creates a learning rate scheduler, which increases learning rate linearly" +" from 0 to given `learning_rate`, after this warmup period learning rate " +"would be decreased linearly from the base learning rate to 0." +msgstr "" + +#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup +#: paddlenlp.transformers.optimization.CosineDecayWithWarmup +#: paddlenlp.transformers.optimization.LinearDecayWithWarmup +#: paddlenlp.transformers.optimization.PolyDecayWithWarmup +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:5 +#: paddlenlp.transformers.optimization.CosineDecayWithWarmup:7 +#: paddlenlp.transformers.optimization.LinearDecayWithWarmup:5 +#: paddlenlp.transformers.optimization.PolyDecayWithWarmup:6 +msgid "The base learning rate. It is a python float number." +msgstr "" + +#: of paddlenlp.transformers.optimization.CosineDecayWithWarmup:9 +#: paddlenlp.transformers.optimization.LinearDecayWithWarmup:7 +#: paddlenlp.transformers.optimization.PolyDecayWithWarmup:8 +msgid "The number of training steps." +msgstr "" + +#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:7 +#: paddlenlp.transformers.optimization.CosineDecayWithWarmup:11 +#: paddlenlp.transformers.optimization.LinearDecayWithWarmup:9 +#: paddlenlp.transformers.optimization.PolyDecayWithWarmup:10 +msgid "" +"If int, it means the number of steps for warmup. If float, it means the " +"proportion of warmup in total training steps." +msgstr "" + +#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:14 +#: paddlenlp.transformers.optimization.CosineDecayWithWarmup:23 +#: paddlenlp.transformers.optimization.LinearDecayWithWarmup:12 +#: paddlenlp.transformers.optimization.PolyDecayWithWarmup:19 +msgid "" +"The index of last epoch. It can be set to restart training. If None, it " +"means initial learning rate. Defaults to -1." +msgstr "" + +#: of paddlenlp.transformers.optimization.LinearDecayWithWarmup:16 +msgid "If True, prints a message to stdout for each update. Defaults to False." +msgstr "" + +#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:20 +#: paddlenlp.transformers.optimization.CosineDecayWithWarmup:29 +#: paddlenlp.transformers.optimization.LinearDecayWithWarmup:21 +#: paddlenlp.transformers.optimization.PolyDecayWithWarmup:25 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:1 +msgid "" +"Creates a learning rate scheduler, which increases learning rate linearly" +" from 0 to given `learning_rate` during warmup periods and keeps learning" +" rate a constant after that." +msgstr "" + +#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:10 +msgid "" +"The number of training steps. If `warmup` is a float number, " +"`total_steps` must be provided. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.optimization.CosineDecayWithWarmup:1 +msgid "" +"Creates a learning rate scheduler, which increases learning rate linearly" +" from 0 to given `learning_rate`, after this warmup period learning rate " +"would be decreased following the values of the cosine function. If " +"`with_hard_restarts` is True, the cosine function could have serveral " +"hard restarts." +msgstr "" + +#: of paddlenlp.transformers.optimization.CosineDecayWithWarmup:14 +msgid "Whether cosine function has several hard restarts. Defaults to False." +msgstr "" + +#: of paddlenlp.transformers.optimization.CosineDecayWithWarmup:17 +msgid "" +"If `with_hard_restarts` is False, it means the number of waves in cosine " +"scheduler and should be an integer number and defaults to 1. If " +"`with_hard_restarts` is True, it means the number of hard restarts to use" +" and should be a float number and defaults to be 0.5. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.optimization.PolyDecayWithWarmup:1 +msgid "" +"Creates a learning rate scheduler, which increases learning rate linearly" +" from 0 to given `lr_init`, after this warmup period learning rate would " +"be decreased as a polynomial decay from the base learning rate to the end" +" learning rate `lr_end`." +msgstr "" + +#: of paddlenlp.transformers.optimization.PolyDecayWithWarmup:13 +msgid "The end learning rate. Defaults to 1e-7." +msgstr "" + +#: of paddlenlp.transformers.optimization.PolyDecayWithWarmup:16 +msgid "Power factor. Defaults to 1.0." +msgstr "" + +#: of paddlenlp.transformers.optimization.CosineAnnealingWithWarmupDecay:1 +msgid "基类::class:`paddle.optimizer.lr.LRScheduler`" +msgstr "" + +#: of +#: paddlenlp.transformers.optimization.CosineAnnealingWithWarmupDecay.get_lr:1 +msgid "" +"For those subclass who overload ``LRScheduler`` (Base Class), User should" +" have a custom implementation of ``get_lr()`` ." +msgstr "" + +#: of +#: paddlenlp.transformers.optimization.CosineAnnealingWithWarmupDecay.get_lr:3 +msgid "Otherwise, an ``NotImplementedError`` exception will be thrown." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.po new file mode 100644 index 0000000000000000000000000000000000000000..f1ac70debea6f48d2be291d37592ee93fc11a432 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.rst:2 +msgid "paddlenlp.transformers" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..855d3381963115d311803c162cc924d6e3a51fad --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.modeling.po @@ -0,0 +1,348 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ppminilm.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification:1 +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:1 +msgid "基类::class:`paddlenlp.transformers.ppminilm.modeling.PPMiniLMPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:1 +msgid "The bare PPMiniLM Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `PPMiniLMModel`. Also is the vocab " +"size of token embedding matrix. Defines the number of different tokens " +"that can be represented by the `inputs_ids` passed when calling " +"`PPMiniLMModel`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:13 +msgid "" +"Dimensionality of the embedding layer, encoder layers and pooler layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:15 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:35 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:38 +msgid "The vocabulary size of the `token_type_ids`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:41 +msgid "" +"The standard deviation of the normal initializer for initializing all " +"weight matrices. Defaults to `0.02`. .. note:: A normal_initializer " +"initializes weight matrices as normal distributions. See " +":meth:`PPMiniLMPretrainedModel._init_weights()` for how weights are " +"initialized in `PPMiniLMModel`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:41 +msgid "" +"The standard deviation of the normal initializer for initializing all " +"weight matrices. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:45 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`PPMiniLMPretrainedModel._init_weights()` for how weights are " +"initialized in `PPMiniLMModel`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:48 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:1 +msgid "" +"If `input_ids` is a Tensor object, it is an indices of input sequence " +"tokens in the vocabulary. They are numerical representations of tokens " +"that build the input sequence. It's data type should be `int64` and has a" +" shape of [batch_size, sequence_length]. If `input_ids` is a list of " +"string, `self.use_faster_tokenizer` should be True, and the network " +"contains `faster_tokenizer` operator." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:9 +msgid "" +"If `token_type_ids` is a Tensor object: Segment token indices to indicate" +" different portions of the inputs. Selected in the range ``[0, " +"type_vocab_size - 1]``. If `type_vocab_size` is 2, which means the inputs" +" have two portions. Indices can either be 0 or 1: - 0 corresponds to a " +"*sentence A* token, - 1 corresponds to a *sentence B* token. Its data " +"type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings. If `token_type_ids` is a list of string: " +"`self.use_faster_tokenizer` should be True, and the network contains " +"`faster_tokenizer` operator." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:9 +msgid "" +"If `token_type_ids` is a Tensor object: Segment token indices to indicate" +" different portions of the inputs. Selected in the range ``[0, " +"type_vocab_size - 1]``. If `type_vocab_size` is 2, which means the inputs" +" have two portions. Indices can either be 0 or 1:" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:15 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:16 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:18 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:21 +msgid "" +"If `token_type_ids` is a list of string: `self.use_faster_tokenizer` " +"should be True, and the network contains `faster_tokenizer` operator." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:24 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:28 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. For example, its shape can be " +"[batch_size, sequence_length], [batch_size, sequence_length, " +"sequence_length], [batch_size, num_attention_heads, sequence_length, " +"sequence_length]. We use whole-word-mask in PPMiniLM, so the whole word " +"will have the same value. For example, \"使用\" as a word, \"使\" and \"用\" " +"will have the same value. Defaults to `None`, which means nothing needed " +"to be prevented attention to." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:42 +msgid "" +"Returns tuple (``sequence_output``, ``pooled_output``). With the fields:" +" - `sequence_output` (Tensor): Sequence of hidden-states at the last" +" layer of the model. It's data type should be float32 and its shape " +"is [batch_size, sequence_length, hidden_size]. - `pooled_output` " +"(Tensor): The output of first token (`[CLS]`) in sequence. We " +"\"pool\" the model by simply taking the hidden state corresponding to the" +" first token. Its data type should be float32 and its shape is " +"[batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:42 +msgid "Returns tuple (``sequence_output``, ``pooled_output``)." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:44 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:48 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:47 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:52 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:51 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward:15 +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:57 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMPretrainedModel:1 +msgid "基类::class:`paddlenlp.experimental.model_utils.FasterPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMPretrainedModel:1 +msgid "" +"An abstract class for pretrained PPMiniLM models. It provides PPMiniLM " +"related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. Refer " +"to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification:1 +msgid "" +"PPMiniLM Model with a linear layer on top of the output layer, designed " +"for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification:4 +msgid "An instance of `paddlenlp.transformers.PPMiniLMModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification:6 +msgid "The number of classes. Default to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification:8 +msgid "" +"The dropout probability for output of PPMiniLM. If None, use the same " +"value as `hidden_dropout_prob` of `paddlenlp.transformers.PPMiniLMModel` " +"instance. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward:1 +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward:3 +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward:5 +msgid "See :class:`PPMiniLMModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward:7 +msgid "See :class:`MiniLMModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.po new file mode 100644 index 0000000000000000000000000000000000000000..2904e71719a238f4f380f604a396361be1f14127 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ppminilm.rst:2 +msgid "ppminilm" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..fdeb47bbfe519ecc9a3dc1622d7ebba8521a62b2 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.tokenizer.po @@ -0,0 +1,294 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.ppminilm.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:1 +msgid "" +"Constructs an PPMiniLM tokenizer. It uses a basic tokenizer to do " +"punctuation splitting, lower casing and so on, and follows a WordPiece " +"tokenizer to tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:5 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:9 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:12 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:15 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:19 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:22 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:25 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:28 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:34 +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string:12 +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize:10 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize:1 +msgid "Converts a string to a list of tokens." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize:3 +msgid "The text to be tokenized." +msgstr "" + +#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize:6 +msgid "A list of string representing converted tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) in a single string. Since " +"the usage of WordPiece introducing `##` to concat subwords, also remove " +"`##` when converting." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string:5 +msgid "A list of string representing tokens to be converted." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string:8 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add:5 +msgid "" +"This encodes inputs and checks the number of added tokens, and is " +"therefore not efficient. Do not put this inside your training loop." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add:8 +msgid "" +"Whether the input is a sequence pair or a single sequence. Defaults to " +"`False` and the input is a single sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add:12 +msgid "Number of tokens added to sequences" +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:4 +msgid "A sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:6 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:7 +msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:9 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:11 +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences:13 +msgid "Optional second list of IDs for sequence pairs. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:15 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:3 +msgid "An offset_mapping has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:5 +msgid "single sequence: ``(0,0) X (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:6 +msgid "pair of sequences: ``(0,0) A (0,0) B (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:8 +msgid "List of char offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:10 +msgid "" +"Optional second list of wordpiece offsets for offset mapping pairs. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:14 +msgid "" +"A list of wordpiece offsets with the appropriate offsets of special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences:3 +msgid "A sequence pair mask has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences:9 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences:11 +msgid "A list of `inputs_ids` for the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences:17 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..77077e04da76b1a1b06a6501183a3b2685e652ec --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.modeling.po @@ -0,0 +1,85 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.prophetnet.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetDecoder:1 +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder:1 +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetForConditionalGeneration:1 +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetModel:1 +msgid "基类::class:`paddlenlp.transformers.prophetnet.modeling.ProphetNetPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetDecoder.forward:1 +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder.forward:1 +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetForConditionalGeneration.forward:1 +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetModel.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetDecoder.forward +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder.forward +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetForConditionalGeneration.forward +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetDecoder.forward:4 +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder.forward:4 +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetForConditionalGeneration.forward:4 +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetModel.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetDecoder.forward:6 +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder.forward:6 +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetForConditionalGeneration.forward:6 +#: paddlenlp.transformers.prophetnet.modeling.ProphetNetModel.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetPretrainedModel:1 +msgid "" +"An abstract class for pretrained Prophetnet models. It provides " +"Prophetnet related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models." +msgstr "" + +#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder:4 +msgid "" +"word_embeddings (:obj:`paddle.nn.Embeddings` of shape " +":obj:`(config.vocab_size, config.hidden_size)`, `optional`):" +msgstr "" + +#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder:2 +msgid "" +"The word embedding parameters. This can be used to initialize " +":class:`~transformers.ProphetNetEncoder` with pre-defined word embeddings" +" instead of randomly initialized word embeddings." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.po new file mode 100644 index 0000000000000000000000000000000000000000..b6ee17f11e6a006ccf2c541cb85bdeb261fab6a4 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.prophetnet.rst:2 +msgid "prophetnet" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..87f73f783bf49e0ce9850d842e063fb244ce6878 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.tokenizer.po @@ -0,0 +1,292 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.prophetnet.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.load_vocab:1 +msgid "Loads a vocabulary file into a dictionary." +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:1 +msgid "Construct a ProphetNetTokenizer. Based on WordPiece." +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:3 +msgid "" +"This tokenizer inherits from [`PreTrainedTokenizer`] which contains most " +"of the main methods. Users should refer to this superclass for more " +"information regarding those methods." +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.save_vocabulary +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:6 +msgid "File containing the vocabulary." +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:8 +msgid "Whether or not to lowercase the input when tokenizing." +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:10 +msgid "Whether or not to do basic tokenization before WordPiece." +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:12 +msgid "" +"The unknown token. A token that is not in the vocabulary cannot be " +"converted to an ID and is set to be this token instead." +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:15 +msgid "" +"The separator token, which is used when building a sequence from multiple" +" sequences, e.g. two sequences for sequence classification or for a text " +"and a question for question answering. It is also used as the last token " +"of a sequence built with special tokens." +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:19 +msgid "" +"Special second separator token, which can be generated by " +"[`ProphetNetForConditionalGeneration`]. It is used to separate bullet-" +"point like sentences in summarization, *e.g.*." +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:23 +msgid "" +"The token used for padding, for example when batching sequences of " +"different lengths." +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:25 +msgid "" +"The classifier token which is used when doing sequence classification " +"(classification of the whole sequence instead of per-token " +"classification). It is the first token of the sequence when built with " +"special tokens." +msgstr "" + +#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:28 +msgid "" +"The token used for masking values. This is the token used when training " +"this model with masked language modeling. This is the token which the " +"model will try to predict." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_ids:1 +msgid "" +"Converts a sequence of tokens into ids using the `vocab` attribute (an " +"instance of `Vocab`). Override it if needed." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_ids:5 +msgid "Args:" +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_ids:5 +msgid "tokens (list[int]): List of token ids." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_ids +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_ids:7 +msgid "Converted id list." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_ids +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens:1 +msgid "" +"Converts a single index or a sequence of indices to a token or a sequence" +" of tokens, using the vocabulary and added tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens:4 +msgid "The token id (or token ids) to be converted to token(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens:6 +msgid "" +"Whether or not to remove special tokens in the decoding. Defaults to " +"`False` and we do not remove special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens:10 +msgid "The decoded token(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_string:1 +msgid "Converts a sequence of tokens (string) in a single string." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieve sequence ids from a token list that has no special tokens added." +" This method is called when adding special tokens using the tokenizer " +"`prepare_for_model` method." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:11 +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask:4 +msgid "List of IDs." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:9 +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:13 +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask:6 +msgid "Optional second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask:11 +msgid "" +"A list of integers in the range [0, 1]: 1 for a special token, 0 for a " +"sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:13 +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:18 +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask:12 +msgid "`List[int]`" +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task. A ProphetNet sequence pair mask has the following " +"format:" +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:4 +msgid "" +"``` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second " +"sequence | ```" +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:9 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:16 +msgid "" +"List of [token type IDs](../glossary#token-type-ids) according to the " +"given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens. A BERT " +"sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:4 +msgid "single sequence: `[CLS] X [SEP]`" +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:5 +msgid "pair of sequences: `[CLS] A [SEP] B [SEP]`" +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:7 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:12 +msgid "" +"List of [input IDs](../glossary#input-ids) with the appropriate special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.save_vocabulary:1 +msgid "" +"Save all tokens to a vocabulary file. The file contains a token per line," +" and the line number would be the index of corresponding token." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.save_vocabulary:4 +msgid "File path to be saved to." +msgstr "" + +#: of +#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.save_vocabulary:6 +msgid "The `Vocab` or `dict` instance to be saved." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..a486f3f2e4d0afd8d854bf0dade44cc5732ef6ff --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.modeling.po @@ -0,0 +1,807 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.reformer.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM:1 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering:1 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification:1 +#: paddlenlp.transformers.reformer.modeling.ReformerModel:1 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead:1 +msgid "基类::class:`paddlenlp.transformers.reformer.modeling.ReformerPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:1 +msgid "" +"The bare Reformer Model transformer outputting raw hidden-states without " +"any specific head on top." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM +#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward +#: paddlenlp.transformers.reformer.modeling.ReformerModel +#: paddlenlp.transformers.reformer.modeling.ReformerModel.forward +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:10 +msgid "Whether to tie input and output embeddings. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:12 +msgid "" +"Whether or not to use a causal mask in addition to the `attention_mask` " +"passed to `ReformerModel`. When using the Reformer for causal language " +"modeling, this argument should be set to `True`. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:14 +msgid "" +"The chunk size of all feed forward layers in the residual attention " +"blocks. A chunk size of `0` means that the feed forward layer is not " +"chunked. A chunk size of n means that the feed forward layer processes " +"`n` < sequence_length embeddings at a time. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:18 +msgid "The id of the `padding` token. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:20 +msgid "" +"Seed that can be used to make local sensitive hashing in " +"`LSHSelfAttention` deterministic. This should only be set for testing " +"purposed. For evaluation and training purposes `hash_seed` should be left" +" as `None` to ensure fully random rotations in local sensitive hashing " +"scheme. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:24 +msgid "" +"Vocabulary size of `inputs_ids` in `ReformerModel`. Also is the vocab " +"size of token embedding matrix. Defines the number of different tokens " +"that can be represented by the `inputs_ids` passed when calling " +"`ReformerModel`. Defaults to `258`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:27 +msgid "" +"Dimensionality of the projected key, query and value vectors. Defaults to" +" `128`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:29 +msgid "Dimensionality of the embedding layer, encoder layer.Defaults to `1024`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:31 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `8`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:34 +msgid "" +"Number of hashing rounds (e.g., number of random rotations) in Local " +"Sensitive Hashing scheme. The higher `num_hashes`, the more accurate the " +"`LSHSelfAttention` becomes, but also the more memory and time intensive " +"the hashing becomes. Defaults to `4`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:36 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:38 +msgid "" +"Number of buckets, the key query vectors can be \"hashed into\" using the" +" locality sensitive hashing scheme. Each query key vector is hashed into " +"a hash in `1, ..., num_buckets`. The number of buckets can also be " +"factorized into a list for improved memory complexity. In this case, each" +" query key vector is hashed into a hash in `1-1, 1-2, ..., " +"num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is" +" factorized into two factors. The number of buckets (or the product the " +"factors) should approximately equal sequence length / lsh_chunk_length. " +"If `num_buckets` not set, a good value is calculated on the fly. Defaults" +" to `512`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:41 +msgid "" +"Length of chunk which attends to itself in `LSHSelfAttention`. Chunking " +"reduces memory complexity from sequence length x sequence length (self " +"attention) to chunk length x chunk length x sequence length / chunk " +"length (chunked self attention).Defaults to `256`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:43 +msgid "" +"Length of chunk which attends to itself in `LocalSelfAttention`. Chunking" +" reduces memory complexity from sequence length x sequence length (self " +"attention) to chunk length x chunk length x sequence length / chunk " +"length (chunked self attention).Defaults to `128`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:45 +msgid "" +"Number of following neighbouring chunks to attend to in " +"`LSHSelfAttention` layer to itself. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:47 +msgid "" +"Number of previous neighbouring chunks to attend to in `LSHSelfAttention`" +" layer to itself. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:49 +msgid "" +"Number of following neighbouring chunks to attend to in " +"`LocalSelfAttention` layer to itself. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:51 +msgid "" +"Number of previous neighbouring chunks to attend to in " +"`LocalSelfAttention` layer to itself. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:53 +msgid "" +"The non-linear activation function (function or string) in the feed " +"forward layer in the residual attention block. If string, `\"gelu\"`, " +"`\"relu\"`, `\"tanh\"`, `\"mish\"` and `\"gelu_new\"` are supported. " +"Defaults to `\"relu\"`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:55 +msgid "" +"Dimensionality of the feed_forward layer in the residual attention block." +" Defaults to `4096`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:57 +msgid "" +"The dropout ratio for all fully connected layers in the embeddings and " +"encoder. Defaults to `0.2`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:59 +msgid "" +"The dropout ratio for the attention probabilities in `LSHSelfAttention`. " +"Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:61 +msgid "" +"The dropout ratio for the attention probabilities in " +"`LocalSelfAttention`. Defaults to `0.2`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:63 +msgid "" +"The maximum sequence length that this model might ever be used with. " +"Typically set this to something large just in case (e.g., 512 or 1024 or " +"2048). Defaults to `65536`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:65 +msgid "" +"The standard deviation of the normal initializer. Defaults to `0.02`. .." +" note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`ReformerPretrainedModel._init_weights()` " +"for how weights are initialized in `ReformerModel`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:65 +msgid "The standard deviation of the normal initializer. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:68 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`ReformerPretrainedModel._init_weights()` for how weights are " +"initialized in `ReformerModel`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:71 +msgid "The epsilon used by the layer normalization layers. Defaults to `1e-12`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:73 +msgid "Whether or not to use axial position embeddings. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:75 +msgid "" +"The position dims of the axial position encodings. During training, the " +"product of the position dims has to be equal to the sequence length. " +"Defaults to `[128, 512]`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:77 +msgid "" +"The embedding dims of the axial position encodings. The sum of the " +"embedding dims has to be equal to the hidden size. Defaults to `[256, " +"768]`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:80 +msgid "" +"The standard deviation of the normal_initializer for initializing the " +"weight matrices of the axial positional encodings. Defaults to `1.0`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:83 +msgid "" +"The chunk size of the final language model feed forward head layer. A " +"chunk size of 0 means that the feed forward layer is not chunked. A chunk" +" size of n means that the feed forward layer processes n < " +"sequence_length embeddings at a time. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel:86 +msgid "" +"List of attention layer types in ascending order. It can be chosen " +"between a LSHSelfAttention layer (`\"lsh\"`) and a LocalSelfAttention " +"layer (`\"local\"`). Defaults to `[\"local\", \"local\", \"lsh\", " +"\"local\", \"local\", \"local\", \"lsh\", \"local\", \"local\", " +"\"local\", \"lsh\", \"local\"]`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:1 +msgid "" +"The ReformerModel forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:7 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float. When the data type is int, " +"the `masked` tokens have `0` values and the others have `1` values. When " +"the data type is float, the `masked` tokens have `0.0` values and the " +"others have `1.0` values. It is a tensor with shape broadcasted to " +"`[batch_size, num_attention_heads, sequence_length, sequence_length]`. " +"Defaults to `None`, which means nothing needed to be prevented attention " +"to." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:17 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range `[0, max_position_embeddings - 1]`. " +"Shape as [batch_size, num_tokens] and dtype as int64. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:21 +msgid "" +"The number of hashing rounds that should be performed during bucketing. " +"Setting this argument overwrites the default defined in " +"`config[\"num_hashes\"]`. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:25 +msgid "" +"List of `tuple(Tensor, Tensor)` of length " +"`config[\"num_hidden_layers\"]`, with the first element being the " +"previous `buckets` of shape `[batch_size, num_heads, num_hashes, " +"sequence_length]` and the second being the previous `hidden_states` of " +"shape `[batch_size, sequence_length, hidden_size]`. Contains precomputed " +"hidden-states and buckets (only relevant for LSH Self-Attention). Can be " +"used to speed up sequential decoding. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:32 +msgid "" +"Whether or not to use cache. If set to `True`, `cache` states are " +"returned and can be used to speed up decoding. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:36 +msgid "" +"Whether or not to return the attentions tensors of all attention layers. " +"Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:39 +msgid "" +"Whether or not to return the output of all hidden layers. Defaults to " +"`False`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward +#: paddlenlp.transformers.reformer.modeling.ReformerModel.forward +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:43 +msgid "" +"Returns tuple (`last_hidden_state`, `cache`, `hidden_states`, " +"`attentions`) With the fields: - `last_hidden_state` (Tensor): " +"Sequence of hidden-states at the last layer of the model. It's data " +"type should be float32 and its shape is [batch_size, sequence_length," +" hidden_size]. - `cache` (List[tuple(Tensor, Tensor)], optional): " +"returned when `use_cache=True` is passed. List of `tuple(Tensor, " +"Tensor)` of length `config[\"num_hidden_layers\"]`, with the first " +"element being the previous `buckets` of shape `[batch_size, " +"num_heads, num_hashes, sequence_length]` and the second being the " +"previous `hidden_states` of shape `[batch_size, sequence_length, " +"hidden_size]`. - `hidden_states` (tuple(Tensor), optional): returned" +" when `output_hidden_states=True` is passed. tuple of `Tensor` (one " +"for the output of the embeddings + one for the output of each layer)." +" Each Tensor has a data type of float32 and its shape is [batch_size," +" sequence_length, hidden_size]. - `attentions` (tuple(Tensor), " +"optional): returned when `output_attentions=True` is passed. " +"tuple of `Tensor` (one for each layer) of shape. Each Tensor has a data" +" type of float32 and its shape is [batch_size, num_heads, " +"sequence_length, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:43 +msgid "" +"Returns tuple (`last_hidden_state`, `cache`, `hidden_states`, " +"`attentions`)" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:23 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:30 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:22 +#: paddlenlp.transformers.reformer.modeling.ReformerModel.forward:45 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:26 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:50 +msgid "`last_hidden_state` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:48 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:57 +msgid "`cache` (List[tuple(Tensor, Tensor)], optional):" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:53 +msgid "" +"returned when `use_cache=True` is passed. List of `tuple(Tensor, Tensor)`" +" of length `config[\"num_hidden_layers\"]`, with the first element being " +"the previous `buckets` of shape `[batch_size, num_heads, num_hashes, " +"sequence_length]` and the second being the previous `hidden_states` of " +"shape `[batch_size, sequence_length, hidden_size]`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:63 +msgid "`hidden_states` (tuple(Tensor), optional):" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:60 +msgid "" +"returned when `output_hidden_states=True` is passed. tuple of `Tensor` " +"(one for the output of the embeddings + one for the output of each " +"layer). Each Tensor has a data type of float32 and its shape is " +"[batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:67 +msgid "`attentions` (tuple(Tensor), optional):" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:66 +msgid "" +"returned when `output_attentions=True` is passed. tuple of `Tensor` (one " +"for each layer) of shape. Each Tensor has a data type of float32 and its " +"shape is [batch_size, num_heads, sequence_length, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward +#: paddlenlp.transformers.reformer.modeling.ReformerModel.forward +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:44 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:55 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:41 +#: paddlenlp.transformers.reformer.modeling.ReformerModel.forward:72 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:50 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerPretrainedModel:1 +msgid "" +"An abstract class for pretrained Reformer models. It provides Reformer " +"related `model_config_file`, `resource_files_names`, " +"`pretrained_resource_files_map`, `pretrained_init_configuration`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +"`PretrainedModel` for more details." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerPretrainedModel.init_weights:1 +msgid "Initializes and tie weights if needed." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerPretrainedModel.tie_weights:1 +msgid "Tie the weights between the input embeddings and the output embeddings." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification:1 +msgid "" +"The Reformer Model transformer with a sequence classification head on top" +" (linear layer)." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM:3 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification:3 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead:3 +msgid "An instance of :class:`ReformerModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification:5 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification:7 +msgid "" +"The dropout probability for output of Reformer. If None, use the same " +"value as `hidden_dropout_prob` of `ReformerModel` instance `reformer`. " +"Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:1 +#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:3 +#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:5 +#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:7 +#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:16 +#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:18 +#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:37 +#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:40 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:1 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:3 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:5 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:7 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:23 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:25 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:48 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:51 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:1 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:3 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:5 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:7 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:15 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:17 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:34 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:37 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:1 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:3 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:5 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:7 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:9 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:11 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:19 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:21 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:40 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:43 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:46 +msgid "See :class:`ReformerModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:9 +msgid "" +"Labels for computing the sequence classification/regression loss. Indices" +" should be in `[0, ...,num_classes - 1]`. If `num_classes == 1` a " +"regression loss is computed (Mean-Square loss), If `num_classes > 1` a " +"classification loss is computed (Cross-Entropy). Shape is [batch_size,] " +"and dtype is int64." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:20 +msgid "" +"Returns tuple `(loss, logits, hidden_states, attentions)`. With the " +"fields: - `loss` (Tensor): returned when `labels` is provided. " +"Classification (or regression if num_classes==1) loss. It's data type" +" should be float32 and its shape is [1,]. - `logits` (Tensor): " +"Classification (or regression if num_classes==1) scores (before SoftMax)." +" It's data type should be float32 and its shape is [batch_size, " +"num_classes]. - `hidden_states` (tuple(Tensor)): See " +":class:`ReformerModel`. - `attentions` (tuple(Tensor)): See " +":class:`ReformerModel`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:21 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:28 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:20 +msgid "Returns tuple `(loss, logits, hidden_states, attentions)`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:28 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:35 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:27 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:31 +msgid "`loss` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:33 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:25 +msgid "" +"returned when `labels` is provided. Classification (or regression if " +"num_classes==1) loss. It's data type should be float32 and its shape is " +"[1,]." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:34 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:31 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:37 +msgid "`logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:30 +msgid "" +"Classification (or regression if num_classes==1) scores (before SoftMax)." +" It's data type should be float32 and its shape is [batch_size, " +"num_classes]." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:37 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:48 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:34 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:43 +msgid "`hidden_states` (tuple(Tensor)):" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:39 +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:50 +#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:36 +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:45 +msgid "`attentions` (tuple(Tensor)):" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering:1 +msgid "" +"Reformer Model with a span classification head on top for extractive " +"question-answering tasks like SQuAD (a linear layers on top of the " +"hidden-states output to compute `span start logits` and `span end " +"logits`)." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering:5 +msgid "An instance of ReformerModel." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering:7 +msgid "" +"The dropout probability for output of Reformer. If None, use the same " +"value as `hidden_dropout_prob` of `ReformerModel` instance `reformer`. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:9 +msgid "" +"Labels for position (index) of the start of the labelled span for " +"computing the token classification loss. Positions are clamped to the " +"length of the sequence (`sequence_length`). Position outside of the " +"sequence are not taken into account for computing the loss. Shape is " +"[batch_size,] and dtype is int64." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:16 +msgid "" +"Labels for position (index) of the end of the labelled span for computing" +" the token classification loss. Positions are clamped to the length of " +"the sequence (`sequence_length`). Position outside of the sequence are " +"not taken into account for computing the loss. Shape is [batch_size,] and" +" dtype is int64." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:28 +msgid "" +"Returns tuple `(loss, logits, hidden_states, attentions)`. With the " +"fields: - `loss` (Tensor): returned when `labels` is provided. " +"Classification (or regression if num_classes==1) loss. It's data type" +" should be float32 and its shape is [1,]. - `start_logits` (Tensor):" +" A tensor of the input token classification logits, indicates the" +" start position of the labelled span. Its data type should be float32" +" and its shape is [batch_size, sequence_length]. - `end_logits` " +"(Tensor): A tensor of the input token classification logits, " +"indicates the end position of the labelled span. Its data type " +"should be float32 and its shape is [batch_size, sequence_length]. - " +"`hidden_states` (tuple(Tensor)): See :class:`ReformerModel`. - " +"`attentions` (tuple(Tensor)): See :class:`ReformerModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:40 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:38 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:45 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:43 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead:1 +msgid "The Reformer Model transformer with a language modeling head on top." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:13 +msgid "" +"Labels for language modeling. Note that the labels **are shifted** inside" +" the model, i.e. you can set `labels = input_ids` Indices are selected in" +" `[-100, 0, ..., vocab_size]` All labels set to `-100` are ignored " +"(masked), the loss is only computed for labels in `[0, ..., vocab_size]`." +" Shape is [batch_size, sequence_length] and dtype is int64." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:24 +msgid "" +"Returns tuple `(loss, logits, cache, hidden_states, attentions)`. With " +"the fields: - `loss` (Tensor): returned when `labels` is provided." +" Language modeling loss (for next-token prediction). It's data " +"type should be float32 and its shape is [1,]. - `logits` (Tensor): " +"Prediction scores of the language modeling head (scores for each " +"vocabulary token before SoftMax). It's data type should be float32 " +"and its shape is [batch_size, sequence_length, vocab_size]. - " +"`cache` (List[tuple(Tensor, Tensor)]): See :class:`ReformerModel`. -" +" `hidden_states` (tuple(Tensor)): See :class:`ReformerModel`. - " +"`attentions` (tuple(Tensor)): See :class:`ReformerModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:24 +msgid "Returns tuple `(loss, logits, cache, hidden_states, attentions)`." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:29 +msgid "" +"returned when `labels` is provided. Language modeling loss (for next-" +"token prediction). It's data type should be float32 and its shape is " +"[1,]." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:34 +msgid "" +"Prediction scores of the language modeling head (scores for each " +"vocabulary token before SoftMax). It's data type should be float32 and " +"its shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:40 +msgid "`cache` (List[tuple(Tensor, Tensor)]):" +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM:1 +msgid "" +"The Reformer Model transformer with a masked language modeling head on " +"top." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:9 +msgid "" +"Labels for computing the masked language modeling loss. Indices should be" +" in ``[-100, 0, ..., vocab_size]`` (see ``input_ids`` docstring) Tokens " +"with indices set to ``-100`` are ignored(masked), the loss is only " +"computed for the tokens with labels in ``[0, ..., vocab_size]``. Shape is" +" [batch_size, sequence_length] and dtype is int64." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:21 +msgid "" +"Returns tuple `(loss, logits, hidden_states, attentions)`. With the " +"fields: - `loss` (Tensor): returned when `labels` is provided. " +"Masked Language modeling loss. It's data type should be float32 and " +"its shape is [1,]. - `logits` (Tensor): Prediction scores of the " +"masked language modeling head (scores for each vocabulary token " +"before SoftMax). It's data type should be float32 and its shape is" +" [batch_size, sequence_length, vocab_size]. - `hidden_states` " +"(tuple(Tensor)): See :class:`ReformerModel`. - `attentions` " +"(tuple(Tensor)): See :class:`ReformerModel`." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:26 +msgid "" +"returned when `labels` is provided. Masked Language modeling loss. It's " +"data type should be float32 and its shape is [1,]." +msgstr "" + +#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:31 +msgid "" +"Prediction scores of the masked language modeling head (scores for each " +"vocabulary token before SoftMax). It's data type should be float32 and " +"its shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.po new file mode 100644 index 0000000000000000000000000000000000000000..d656b21a7e577f4ec956d04ed963b61c99315666 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.reformer.rst:2 +msgid "reformer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..a1dadaf14a90d32338bcccdd9329627d77b01cc1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.tokenizer.po @@ -0,0 +1,155 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.reformer.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.albert.tokenizer.AlbertEnglishTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:1 +msgid "" +"Constructs a Reformer tokenizer based on SentencePiece . This tokenizer " +"inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:6 +msgid "" +"The vocabulary file (ends with '.spm') required to instantiate a " +"`SentencePiece `__ tokenizer." +msgstr "" + +#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:9 +msgid "" +"Whether or not to lowercase the input when tokenizing. Defaults to " +"`False`." +msgstr "" + +#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:11 +msgid "Whether or note to remove space when tokenizing. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:13 +msgid "Whether or note to keep accents when tokenizing. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:15 +msgid "" +"A special token representing the *eos (end-of-sentence)* token. Defaults " +"to \"\"." +msgstr "" + +#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:18 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"\"." +msgstr "" + +#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:22 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"\"." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:1 +msgid "Build model inputs from a sequence or a pair of sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:3 +msgid "An Reformer sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:5 +msgid "single sequence: ``X``" +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:6 +msgid "pair of sequences: ``A B ``" +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:8 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:10 +msgid "Optional second list of IDs for sequence pairs. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:13 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences:1 +msgid "Create a mask from the two sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences:3 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences:5 +msgid "List of IDs." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences:7 +msgid "Optional second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences:10 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..e291691d880a1716dee8cb0c75249bf3c843e573 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.modeling.po @@ -0,0 +1,484 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.rembert.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM:1 +#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice:1 +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering:1 +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification:1 +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification:1 +#: paddlenlp.transformers.rembert.modeling.RemBertModel:1 +msgid "基类::class:`paddlenlp.transformers.rembert.modeling.RembertPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:1 +msgid "The bare RemBERT Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM +#: paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward +#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice +#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward +#: paddlenlp.transformers.rembert.modeling.RemBertModel +#: paddlenlp.transformers.rembert.modeling.RemBertModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `RemBertModel`. Also is the vocab size" +" of token embedding matrix. Defines the number of different tokens that " +"can be represented by the `inputs_ids` passed when calling " +"`RemBertModel`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:13 +msgid "Dimensionality of the embedding layer. Defaults to `256`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:15 +msgid "Dimensionality of the encoder layer and pooler layer. Defaults to `1152`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:17 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `32`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:19 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `18`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:22 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:27 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:31 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:34 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:37 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:40 +msgid "The vocabulary size of `token_type_ids`. Defaults to `16`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:43 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`BertPretrainedModel.init_weights()` for how" +" weights are initialized in `BertModel`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:43 +msgid "The standard deviation of the normal initializer. Defaults to 0.02." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:47 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`BertPretrainedModel.init_weights()` for how weights are " +"initialized in `BertModel`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel:50 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:1 +msgid "" +"The RemBertModel forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:12 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:13 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:15 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:18 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. Defaults to `None`, which means " +"nothing needed to be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward +#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward +#: paddlenlp.transformers.rembert.modeling.RemBertModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:32 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) With the fields: - " +"`sequence_output` (Tensor): Sequence of hidden-states at the last " +"layer of the model. It's data type should be float32 and its shape is" +" [batch_size, sequence_length, hidden_size]. - `pooled_output` (Tensor):" +" The output of first token (`[CLS]`) in sequence. We \"pool\" the" +" model by simply taking the hidden state corresponding to the first " +"token. Its data type should be float32 and its shape is [batch_size, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:32 +msgid "Returns tuple (`sequence_output`, `pooled_output`)" +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:14 +#: paddlenlp.transformers.rembert.modeling.RemBertModel.forward:34 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:38 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:37 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:42 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:41 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward +#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward +#: paddlenlp.transformers.rembert.modeling.RemBertModel.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward:15 +#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:17 +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:26 +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:17 +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:17 +#: paddlenlp.transformers.rembert.modeling.RemBertModel.forward:47 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM:1 +msgid "RemBert Model with a `masked language modeling` head on top." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM:3 +msgid "An instance of :class:`RemBertModel`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward:1 +#: paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward:3 +#: paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward:5 +#: paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward:7 +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:3 +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:5 +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:7 +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:9 +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:3 +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:5 +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:7 +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:9 +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:3 +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:5 +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:7 +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:9 +msgid "See :class:`RemBertModel`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward:10 +msgid "" +"Returns tensor `prediction_scores`, The scores of masked token " +"prediction. Its data type should be float32 and shape is [batch_size, " +"sequence_length, vocab_size]." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering:1 +msgid "" +"RemBert Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice:4 +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering:4 +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification:4 +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification:4 +msgid "An instance of RemBertModel." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:1 +msgid "" +"The RemBertForQuestionAnswering forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:12 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:12 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:18 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:17 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:21 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:21 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification:1 +msgid "" +"RemBert Model with a linear layer on top of the output layer, designed " +"for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification:6 +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification:6 +msgid "The number of classes." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:1 +msgid "" +"The RemBertForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:12 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice:1 +msgid "" +"RemBert Model with a linear layer on top of the hidden-states output " +"layer, designed for multiple choice tasks like RocStories/SWAG tasks." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice:6 +msgid "The number of choices." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:1 +msgid "" +"The BertForMultipleChoice forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:3 +#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:5 +#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:7 +#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:9 +msgid "" +"See :class:`RemBertModel` and shape as [batch_size, num_choice, " +"sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:12 +msgid "" +"Returns tensor `reshaped_logits`, a tensor of the multiple choice " +"classification logits. Shape as `[batch_size, num_choice]` and dtype as " +"`float32`." +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RembertPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RembertPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification:1 +msgid "" +"RemBert Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:1 +msgid "" +"The RemBertForTokenClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:12 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.po new file mode 100644 index 0000000000000000000000000000000000000000..f6e5af4d2b038e7dc695895292d8883fe0bb2ad0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.rembert.rst:2 +msgid "rembert" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..722f3493018ac4ff8b2a8e0898313b4987ed7345 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.tokenizer.po @@ -0,0 +1,234 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.rembert.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:1 +msgid "" +"Construct a RemBertTokenizer. For more information regarding those " +"methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.save_vocabulary +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:4 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:7 +msgid "" +"Whether or not to lowercase the input when tokenizing. Defaults to " +"`False`." +msgstr "" + +#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:10 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:14 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:17 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:20 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:23 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:29 +msgid "实际案例" +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) to a single string by " +"using ``' '.join(tokens)`` ." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.convert_tokens_to_string:4 +msgid "A sequence of tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.convert_tokens_to_string:7 +msgid "Converted string." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens. A " +"REMBERT sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:4 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:5 +msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:7 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:9 +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences:13 +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask:6 +msgid "Optional second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:12 +msgid "" +"List of `input IDs <../glossary.html#input-ids>`__ with the appropriate " +"special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:13 +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences:18 +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask:12 +msgid ":obj:`List[int]`" +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieve sequence ids from a token list that has no special tokens added." +" This method is called when adding special tokens using the tokenizer " +"``prepare_for_model`` method." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences:11 +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask:4 +msgid "List of IDs." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask:11 +msgid "" +"A list of integers in the range [0, 1]: 1 for a special token, 0 for a " +"sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task. A RemBERT sequence pair mask has the following " +"format:" +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences:9 +msgid "" +"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first " +"portion of the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences:16 +msgid "" +"List of `token type IDs <../glossary.html#token-type-ids>`_ according to " +"the given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.save_vocabulary:1 +msgid "" +"Save all tokens to a vocabulary file. The file contains a token per line," +" and the line number would be the index of corresponding token." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.save_vocabulary:4 +msgid "File path to be saved to." +msgstr "" + +#: of +#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.save_vocabulary:6 +msgid "The `Vocab` or `dict` instance to be saved." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..7bd5d069b183abd1e96b2f354fa052bb575b279a --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.modeling.po @@ -0,0 +1,676 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.roberta.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM:1 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM:1 +#: paddlenlp.transformers.roberta.modeling.RobertaForMultipleChoice:1 +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering:1 +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification:1 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification:1 +#: paddlenlp.transformers.roberta.modeling.RobertaModel:1 +msgid "基类::class:`paddlenlp.transformers.roberta.modeling.RobertaPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:1 +msgid "The bare Roberta Model outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM +#: paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward +#: paddlenlp.transformers.roberta.modeling.RobertaForMultipleChoice.forward +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward +#: paddlenlp.transformers.roberta.modeling.RobertaModel +#: paddlenlp.transformers.roberta.modeling.RobertaModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `RobertaModel`. Also is the vocab size" +" of token embedding matrix. Defines the number of different tokens that " +"can be represented by the `inputs_ids` passed when calling " +"`RobertaModel`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:13 +msgid "" +"Dimensionality of the embedding layer, encoder layers and pooler layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:15 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to ``\"gelu\"``." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:35 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:38 +msgid "" +"The vocabulary size of the `token_type_ids` passed when calling " +"`~transformers.RobertaModel`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:41 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`RobertaPretrainedModel._init_weights()` for" +" how weights are initialized in `RobertaModel`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:41 +msgid "The standard deviation of the normal initializer. Defaults to 0.02." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:44 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`RobertaPretrainedModel._init_weights()` for how weights are " +"initialized in `RobertaModel`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:47 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel:50 +msgid "The index of cls token in the token vocabulary. Defaults to `101`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:1 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:5 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1: - 0 corresponds to a **sentence " +"A** token, - 1 corresponds to a **sentence B** token. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]. " +"Defaults to None, which means no segment embeddings is added to token " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:5 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1:" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:8 +msgid "0 corresponds to a **sentence A** token," +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:9 +msgid "1 corresponds to a **sentence B** token." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:11 +msgid "" +"It's data type should be `int64` and has a shape of [batch_size, " +"sequence_length]. Defaults to None, which means no segment embeddings is " +"added to token embeddings." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:14 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"It's data type should be `int64` and has a shape of [batch_size, " +"sequence_length]. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:19 +msgid "" +"Mask used in multi-head attention to avoid performing attention to some " +"unwanted positions, usually the paddings or the subsequent positions. Its" +" data type can be int, float and bool. When the data type is bool, the " +"`masked` tokens have `False` values and the others have `True` values. " +"When the data type is int, the `masked` tokens have `0` values and the " +"others have `1` values. When the data type is float, the `masked` tokens " +"have `-INF` values and the others have `0` values. It is a tensor with " +"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, " +"sequence_length]`. For example, its shape can be [batch_size, " +"sequence_length], [batch_size, sequence_length, sequence_length], " +"[batch_size, num_attention_heads, sequence_length, sequence_length]. " +"Defaults to `None`, which means nothing needed to be prevented attention " +"to." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:30 +msgid "" +"Whether or not to output hidden states for all hidden layers. Defaults to" +" `False`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward +#: paddlenlp.transformers.roberta.modeling.RobertaModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:34 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) by default. Returns " +"(`encoder_outputs`, `pooled_output`) if output_hidden_states is `True`. " +"With the fields: - `sequence_output` (Tensor): Sequence of hidden-" +"states at the last layer of the model. It's data type should be " +"float32 and its shape is [batch_size, sequence_length, hidden_size]. - " +"`pooled_output` (Tensor): The output of first token (`[CLS]`) in " +"sequence. We \"pool\" the model by simply taking the hidden state " +"corresponding to the first token. Its data type should be float32 and" +" its shape is [batch_size, hidden_size]. - `encoder_outputs` " +"(List(Tensor)): A list of Tensor containing hidden-states of the " +"model at each hidden layer in the Transformer encoder. The length of " +"the list is `num_hidden_layers`. Each Tensor has a data type of " +"float32 and its shape is [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:34 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) by default. Returns " +"(`encoder_outputs`, `pooled_output`) if output_hidden_states is `True`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:15 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:15 +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:15 +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:15 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:15 +#: paddlenlp.transformers.roberta.modeling.RobertaModel.forward:37 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:41 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:40 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:46 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:44 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:23 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:23 +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:27 +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:23 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:23 +#: paddlenlp.transformers.roberta.modeling.RobertaModel.forward:50 +msgid "`encoder_outputs` (List(Tensor)):" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:49 +msgid "" +"A list of Tensor containing hidden-states of the model at each hidden " +"layer in the Transformer encoder. The length of the list is " +"`num_hidden_layers`. Each Tensor has a data type of float32 and its shape" +" is [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward +#: paddlenlp.transformers.roberta.modeling.RobertaModel.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:28 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:28 +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:32 +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:28 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:28 +#: paddlenlp.transformers.roberta.modeling.RobertaModel.forward:55 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaPretrainedModel:1 +msgid "" +"An abstract class for pretrained RoBerta models. It provides RoBerta " +"related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification:1 +msgid "" +"Roberta Model with a linear layer on top of the output layer, designed " +"for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification:4 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification:4 +msgid "An instance of `RobertaModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification:6 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification:8 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification:8 +msgid "" +"The dropout probability for output of Roberta. If None, use the same " +"value as `hidden_dropout_prob` of `RobertaModel` instance `roberta`. " +"Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:1 +#: paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:3 +#: paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:5 +#: paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:7 +#: paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:9 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:1 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:3 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:5 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:7 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:9 +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:1 +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:3 +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:5 +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:7 +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:9 +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:1 +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:3 +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:5 +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:7 +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:9 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:1 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:3 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:5 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:7 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:9 +msgid "See :class:`RobertaModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:12 +msgid "" +"Returns tensor `logits` by default. Returns tuple (`logits`, " +"`encoder_outputs`) if output_hidden_states is set to `True`. With the " +"fields: - `logits` (Tensor): a tensor of the input text " +"classification logits. Its data type should be float32 and it has a " +"shape of [batch_size, num_classes]. - `encoder_outputs` (List(Tensor)):" +" A list of Tensor containing hidden-states of the model at each " +"hidden layer in the Transformer encoder. The length of the list is " +"`num_hidden_layers`. Each Tensor has a data type of float32 and a " +"shape of [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:12 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:12 +msgid "" +"Returns tensor `logits` by default. Returns tuple (`logits`, " +"`encoder_outputs`) if output_hidden_states is set to `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:19 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:19 +msgid "`logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:18 +msgid "" +"a tensor of the input text classification logits. Its data type should be" +" float32 and it has a shape of [batch_size, num_classes]." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:22 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:22 +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:26 +#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:22 +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:22 +msgid "" +"A list of Tensor containing hidden-states of the model at each hidden " +"layer in the Transformer encoder. The length of the list is " +"`num_hidden_layers`. Each Tensor has a data type of float32 and a shape " +"of [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification:1 +msgid "" +"Roberta Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:12 +msgid "" +"Returns tensor `logits` by default. Returns tuple (`logits`, " +"`encoder_outputs`) if output_hidden_states is set to `True`. With the " +"fields: - `logits` (Tensor): a tensor of the input token " +"classification logits. Shape as `[batch_size, sequence_length, " +"num_classes]` and dtype as `float32`. - `encoder_outputs` " +"(List(Tensor)): A list of Tensor containing hidden-states of the " +"model at each hidden layer in the Transformer encoder. The length of " +"the list is `num_hidden_layers`. Each Tensor has a data type of " +"float32 and a shape of [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:18 +msgid "" +"a tensor of the input token classification logits. Shape as `[batch_size," +" sequence_length, num_classes]` and dtype as `float32`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering:2 +msgid "" +"Roberta Model with a linear layer on top of the hidden-states output to " +"compute `span_start_logits`" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering:2 +msgid "and `span_end_logits`, designed for question-answering tasks like SQuAD." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering:4 +msgid "An instance of RobertaModel." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:12 +msgid "" +"Returns tuple (`start_logits`, `end_logits`) by default if " +"output_hidden_states is `False`. Returns tuple (`start_logits`, " +"`end_logits`, `encoder_outputs`) if output_hidden_states is set to " +"`True`. With the fields: - `start_logits` (Tensor): A tensor of the" +" input token classification logits, indicates the start position of the " +"labelled span. Its data type should be float32 and its shape is " +"[batch_size, sequence_length]. - `end_logits` (Tensor): A tensor of " +"the input token classification logits, indicates the end position of the " +"labelled span. Its data type should be float32 and its shape is " +"[batch_size, sequence_length]. - `encoder_outputs` (List(Tensor)): A" +" list of Tensor containing hidden-states of the model at each hidden " +"layer in the Transformer encoder. The length of the list is " +"`num_hidden_layers`. Each Tensor has a data type of float32 and a " +"shape of [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:12 +msgid "" +"Returns tuple (`start_logits`, `end_logits`) by default if " +"output_hidden_states is `False`. Returns tuple (`start_logits`, " +"`end_logits`, `encoder_outputs`) if output_hidden_states is set to " +"`True`." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:19 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:18 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:23 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:22 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM:1 +msgid "Roberta Model with a `masked language modeling` head on top." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM:3 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM:3 +msgid "class:RobertaModel`): An instance of :class:`RobertaModel`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:12 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:12 +msgid "" +"Returns tensor `prediction_scores` by default. Returns tuple " +"(`prediction_scores`, `encoder_outputs`) if output_hidden_states is set " +"to `True`. With the fields: - `prediction_scores` (Tensor): The " +"scores of masked token prediction. Its data type should be float32 " +"and shape is [batch_size, sequence_length, vocab_size]. - " +"`encoder_outputs` (List(Tensor)): A list of Tensor containing hidden-" +"states of the model at each hidden layer in the Transformer encoder. " +"The length of the list is `num_hidden_layers`. Each Tensor has a data" +" type of float32 and a shape of [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:12 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:12 +msgid "" +"Returns tensor `prediction_scores` by default. Returns tuple " +"(`prediction_scores`, `encoder_outputs`) if output_hidden_states is set " +"to `True`." +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:19 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:19 +msgid "`prediction_scores` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:18 +#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:18 +msgid "" +"The scores of masked token prediction. Its data type should be float32 " +"and shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForMultipleChoice.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForMultipleChoice.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.modeling.RobertaForMultipleChoice.forward:6 +msgid "unpacked dict arguments" +msgstr "" + +#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM:1 +msgid "Roberta Model with a `Causal language modeling` head on top." +msgstr "" + +#~ msgid "" +#~ "Returns tuple (`sequence_output`, `pooled_output`)." +#~ " With the fields: - sequence_output " +#~ "(Tensor): Sequence of hidden-states " +#~ "at the last layer of the model." +#~ " It's data type should be float32" +#~ " and its shape is [batch_size, " +#~ "sequence_length, hidden_size]. - pooled_output " +#~ "(Tensor): The output of first token" +#~ " (`[CLS]`) in sequence. We \"pool\" " +#~ "the model by simply taking the " +#~ "hidden state corresponding to the first" +#~ " token. Its data type should be" +#~ " float32 and its shape is " +#~ "[batch_size, hidden_size]." +#~ msgstr "" + +#~ msgid "Returns tuple (`sequence_output`, `pooled_output`)." +#~ msgstr "" + +#~ msgid "sequence_output (Tensor):" +#~ msgstr "" + +#~ msgid "pooled_output (Tensor):" +#~ msgstr "" + +#~ msgid "" +#~ "Returns tensor `logits`, a tensor of " +#~ "the input text classification logits. " +#~ "Its data type should be float32 " +#~ "and it has a shape of [batch_size," +#~ " num_classes]." +#~ msgstr "" + +#~ msgid "" +#~ "Returns tensor `logits`, a tensor of " +#~ "the input token classification logits. " +#~ "Shape as `[batch_size, sequence_length, " +#~ "num_classes]` and dtype as `float32`." +#~ msgstr "" + +#~ msgid "" +#~ "Returns tuple (`start_logits`, `end_logits`). " +#~ "With the fields: - `start_logits` " +#~ "(Tensor): A tensor of the input " +#~ "token classification logits, indicates the " +#~ "start position of the labelled span." +#~ " Its data type should be float32" +#~ " and its shape is [batch_size, " +#~ "sequence_length]. - `end_logits` (Tensor): " +#~ "A tensor of the input token " +#~ "classification logits, indicates the end " +#~ "position of the labelled span. Its" +#~ " data type should be float32 and " +#~ "its shape is [batch_size, sequence_length]." +#~ msgstr "" + +#~ msgid "Returns tuple (`start_logits`, `end_logits`)." +#~ msgstr "" + +#~ msgid "" +#~ "Returns tensor `prediction_scores`, The scores" +#~ " of masked token prediction. Its data" +#~ " type should be float32 and shape " +#~ "is [batch_size, sequence_length, vocab_size]." +#~ msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.po new file mode 100644 index 0000000000000000000000000000000000000000..ac6284724b338cc2c48158f64b054c2d082c964f --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.roberta.rst:2 +msgid "roberta" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..d8dbdfc30131a8e56dc38fb6a79b2407af88bea2 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.tokenizer.po @@ -0,0 +1,495 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.roberta.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaTokenizer:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaTokenizer:1 +msgid "" +"RobertaTokenizer is a generic tokenizer class that will be instantiated " +"as either RobertaChineseTokenizer or RobertaBPETokenizer when created " +"with the RobertaTokenizer.from_pretrained() class method." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:1 +msgid "" +"Constructs a RoBerta tokenizer. It uses a basic tokenizer to do " +"punctuation splitting, lower casing and so on, and follows a WordPiece " +"tokenizer to tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:5 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.convert_tokens_to_string +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_offset_mapping +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.num_special_tokens_to_add +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:9 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:12 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:15 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:19 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:22 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:25 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:28 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:25 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:34 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string:12 +msgid "实际案例" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.convert_tokens_to_string +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_offset_mapping +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.convert_tokens_to_string +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_offset_mapping +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) to a single string. Since " +"the usage of WordPiece introducing `##` to concat subwords, also removes " +"`##` when converting." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string:5 +msgid "A list of string representing tokens to be converted." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string:8 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.num_special_tokens_to_add:1 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.num_special_tokens_to_add:3 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.num_special_tokens_to_add:3 +msgid "" +"Whether the input is a sequence pair or a single sequence. Defaults to " +"`False` and the input is a single sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.num_special_tokens_to_add:7 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.num_special_tokens_to_add:7 +msgid "Number of tokens added to sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_inputs_with_special_tokens:1 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:4 +msgid "A RoBERTa sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:6 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:7 +msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:9 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:11 +msgid "Optional second list of IDs for sequence pairs. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:15 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:1 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:3 +msgid "A RoBERTa offset_mapping has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:5 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:5 +msgid "single sequence: ``(0,0) X (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:6 +msgid "pair of sequences: ``(0,0) A (0,0) B (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:8 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:8 +msgid "List of wordpiece offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:10 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:10 +msgid "" +"Optional second list of wordpiece offsets for offset mapping pairs. " +"Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:13 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:13 +msgid "" +"A list of wordpiece offsets with the appropriate offsets of special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences:1 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences:3 +msgid "A RoBERTa sequence pair mask has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences:9 +msgid "" +"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first " +"portion of the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask:4 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences:11 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask:4 +msgid "A list of `inputs_ids` for the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask:6 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences:13 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask:6 +msgid "Optional second list of IDs for sequence pairs. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences:13 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences:16 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask:1 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask:8 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask:12 +#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers either be 0 or 1: 1 for a special token, 0 for a " +"sequence token." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:1 +msgid "基类::class:`paddlenlp.transformers.gpt.tokenizer.GPTTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:1 +msgid "Constructs a Roberta tokenizer based on byte-level Byte-Pair-Encoding." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:3 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.GPTTokenizer` which contains most of the " +"main methods. For more information regarding those methods, please refer " +"to this superclass." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:7 +msgid "" +"Path to the vocab file. The vocab file contains a mapping from vocabulary" +" strings to indices." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:10 +msgid "" +"Path to the merge file. The merge file is used to split the input " +"sentence into \"subword\" units. The vocab file is then used to encode " +"those units as intices." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:14 +msgid "Paradigm to follow when decoding bytes to UTF-8. Defaults to `'replace'`." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:17 +msgid "The maximum value of the input sequence length. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:20 +msgid "A list of special tokens not in the vocabulary. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_offset_mapping:1 +msgid "" +"Returns the map of tokens and the start and end index of their start and " +"end character. Modified from " +"https://github.com/bojone/bert4keras/blob/master/bert4keras/tokenizers.py#L372" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_offset_mapping:4 +msgid "Input text." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_offset_mapping:7 +msgid "The offset map of input text." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:3 +msgid "A Roberta offset_mapping has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:6 +msgid "pair of sequences: ``(0,0) A (0,0) (0,0) B (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences:3 +msgid "" +"Should be overridden in a subclass if the model has a special way of " +"building those." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences:6 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences:8 +msgid "List of IDs." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences:10 +msgid "Optional second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) to a single string by " +"using ``' '.join(tokens)`` ." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.convert_tokens_to_string:4 +msgid "A sequence of tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.convert_tokens_to_string:7 +msgid "Converted string." +msgstr "" + +#~ msgid "Converts a string to a list of tokens." +#~ msgstr "" + +#~ msgid "The text to be tokenized." +#~ msgstr "" + +#~ msgid "A list of string representing converted tokens." +#~ msgstr "" + +#~ msgid "Converts a sequence of tokens (list of string) to a list of ids." +#~ msgstr "" + +#~ msgid "Converted ids from tokens." +#~ msgstr "" + +#~ msgid "A list of ids to be converted." +#~ msgstr "" + +#~ msgid "Whether or not to skip specical tokens. Defaults to `False`." +#~ msgstr "" + +#~ msgid "A list of converted tokens." +#~ msgstr "" + +#~ msgid "" +#~ "Save tokenizer related resources to " +#~ "`resource_files_names` indicating files under " +#~ "`save_directory` by copying directly. Override" +#~ " it if necessary." +#~ msgstr "" + +#~ msgid "Directory to save files into." +#~ msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..8a649e338ca4bdcac093cae87382212e4ed4a18a --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.modeling.po @@ -0,0 +1,635 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.roformer.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining:1 +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering:1 +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification:1 +#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification:1 +#: paddlenlp.transformers.roformer.modeling.RoFormerModel:1 +msgid "基类::class:`paddlenlp.transformers.roformer.modeling.RoFormerPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:1 +msgid "The bare RoFormer Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining +#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification +#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerModel +#: paddlenlp.transformers.roformer.modeling.RoFormerModel.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `RoFormerModel`. Also is the vocab " +"size of token embedding matrix. Defines the number of different tokens " +"that can be represented by the `inputs_ids` passed when calling " +"`RoFormerModel`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:13 +msgid "Dimensionality of the embedding layer. Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:15 +msgid "Dimensionality of the, encoder layers and pooler layer. Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:17 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:19 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:22 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:27 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:31 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:34 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:37 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:40 +msgid "The vocabulary size of `token_type_ids`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:43 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`BertPretrainedModel.init_weights()` for how" +" weights are initialized in `BertModel`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:43 +msgid "The standard deviation of the normal initializer. Defaults to 0.02." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:47 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`BertPretrainedModel.init_weights()` for how weights are " +"initialized in `BertModel`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:50 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:53 +msgid "The non-linear activation function in the pooler. Defaults to `\"tanh\"`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:56 +msgid "" +"Whether or not apply rotay position embeddings to value. Defaults to " +"`False`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:1 +msgid "" +"The RoFormerModel forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:12 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:13 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:15 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:18 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. When the data type " +"is bool, the `masked` tokens have `False` values and the others have " +"`True` values. When the data type is int, the `masked` tokens have `0` " +"values and the others have `1` values. When the data type is float, the " +"`masked` tokens have `-INF` values and the others have `0` values. It is " +"a tensor with shape broadcasted to `[batch_size, num_attention_heads, " +"sequence_length, sequence_length]`. Defaults to `None`, which means " +"nothing needed to be prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:31 +msgid "Whether to return the output of each hidden layers. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerModel.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:35 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`," +" `pooled_output`). With the fields: - `sequence_output` (Tensor): " +"Sequence of hidden-states at the last layer of the model. It's data " +"type should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]. - `pooled_output` (Tensor): The output of first token " +"(`[CLS]`) in sequence. We \"pool\" the model by simply taking the " +"hidden state corresponding to the first token. Its data type should " +"be float32 and its shape is [batch_size, hidden_size]. - " +"`encoder_outputs` (List(Tensor)): A list of Tensor containing hidden-" +"states of the model at each hidden layer in the Transformer encoder. " +"The length of the list is `num_hidden_layers`. Each Tensor has a data" +" type of float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:35 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`," +" `pooled_output`)." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:12 +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:10 +#: paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:37 +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:16 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:41 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:40 +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:1 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:46 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:44 +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:4 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:50 +msgid "`encoder_outputs` (List(Tensor)):" +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:49 +msgid "" +"A list of Tensor containing hidden-states of the model at each hidden " +"layer in the Transformer encoder. The length of the list is " +"`num_hidden_layers`. Each Tensor has a data type of float32 and its shape" +" is [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerModel.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:22 +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward:13 +#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward:13 +#: paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:55 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainedModel:1 +msgid "" +"An abstract class for pretrained RoFormer models. It provides RoFormer " +"related `model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining:1 +msgid "RoFormer Model with pretraining tasks on top." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining:3 +msgid "An instance of :class:`RoFormerModel`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:1 +#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:3 +#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:5 +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:3 +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:5 +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward:1 +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward:3 +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward:5 +#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward:1 +#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward:3 +#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward:5 +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:3 +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:5 +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:7 +msgid "See :class:`RoFormerModel`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:7 +msgid "See :class:`RoFormerPretrainingHeads`." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:10 +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:14 +msgid "" +"Returns tuple (``prediction_scores``, ``seq_relationship_score``). With " +"the fields: - `prediction_scores` (Tensor): The scores of masked " +"token prediction. Its data type should be float32. If " +"`masked_positions` is None, its shape is [batch_size, sequence_length, " +"vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]. - `seq_relationship_score` (Tensor): The scores of next" +" sentence prediction. Its data type should be float32 and its shape " +"is [batch_size, 2]." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:10 +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:14 +msgid "Returns tuple (``prediction_scores``, ``seq_relationship_score``)." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:17 +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:21 +msgid "`prediction_scores` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:15 +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:19 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"If `masked_positions` is None, its shape is [batch_size, sequence_length," +" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:20 +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:24 +msgid "`seq_relationship_score` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:20 +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:24 +msgid "" +"The scores of next sentence prediction. Its data type should be float32 " +"and its shape is [batch_size, 2]." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion:1 +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion:1 +msgid "" +"Vocabulary size of `inputs_ids` in `RoFormerModel`. Defines the number of" +" different tokens that can be represented by the `inputs_ids` passed when" +" calling `RoFormerModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward:1 +msgid "" +"The scores of masked token prediction. Its data type should be float32. " +"If `masked_positions` is None, its shape is [batch_size, sequence_length," +" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, " +"vocab_size]" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward:5 +msgid "" +"The scores of next sentence prediction. Its data type should be float32 " +"and its shape is [batch_size, 2]" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward:8 +msgid "" +"The labels of the masked language modeling, its dimensionality is equal " +"to `prediction_scores`. Its data type should be int64. If " +"`masked_positions` is None, its shape is [batch_size, sequence_length, " +"1]. Otherwise, its shape is [batch_size, mask_token_num, 1]" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward:12 +msgid "" +"The labels of the next sentence prediction task, the dimensionality of " +"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type " +"should be int64 and its shape is [batch_size, 1]" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward:16 +msgid "" +"The scale of masked tokens. Used for the normalization of masked language" +" modeling loss. If it is a `Tensor`, its data type should be int64 and " +"its shape is equal to `prediction_scores`." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward:20 +msgid "" +"The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean" +" of `next_sentence_loss`. Its data type should be float32 and its shape " +"is [1]." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:1 +msgid "Perform language modeling task and next sentence classification task." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:9 +msgid "Activation function used in the language modeling task." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:11 +msgid "" +"Decoding weights used to map hidden_states to logits of the masked token " +"prediction. Its data type should be float32 and its shape is [vocab_size," +" hidden_size]. Defaults to `None`, which means use the same weights of " +"the embedding layer." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:8 +msgid "" +"A tensor indicates positions to be masked in the position embedding. Its " +"data type should be int64 and its shape is [batch_size, mask_token_num]. " +"`mask_token_num` is the number of masked tokens. It should be no bigger " +"than `sequence_length`. Defaults to `None`, which means we output hidden-" +"states of all tokens in masked token prediction." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification:1 +msgid "" +"RoFormer Model with a linear layer on top of the output layer, designed " +"for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification:4 +#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification:4 +msgid "An instance of `paddlenlp.transformers.RoFormerModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification:6 +#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification:6 +msgid "The number of classes. Default to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification:8 +#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification:8 +msgid "" +"The dropout probability for output of RoFormer. If None, use the same " +"value as `hidden_dropout_prob` of `paddlenlp.transformers.RoFormerModel` " +"instance. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward:8 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification:1 +msgid "" +"RoFormer Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward:8 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering:1 +msgid "" +"RoFormer with a linear layer on top of the hidden-states output to " +"compute `span_start_logits` and `span_end_logits`, designed for question-" +"answering tasks like SQuAD." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering:4 +msgid "An instance of RoFormerModel." +msgstr "" + +#: of paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering:6 +msgid "" +"The dropout probability for output of RoFormer. If None, use the same " +"value as `hidden_dropout_prob` of `RoFormerModel` instance `roformer`. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:1 +msgid "" +"The RoFormerForQuestionAnswering forward method, overrides the __call__()" +" special method." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:8 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. -" +" `end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:8 +msgid "Returns tuple (`start_logits`, `end_logits`)." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:14 +msgid "`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:13 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:17 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:17 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.po new file mode 100644 index 0000000000000000000000000000000000000000..6c168e065af2e55f04a9010ca0ce08f4e453d00d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.roformer.rst:2 +msgid "roformer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..59189b79266395381f6f0bd55803b98d104c4ac5 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.tokenizer.po @@ -0,0 +1,322 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.roformer.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:1 +msgid "" +"Constructs a RoFormer tokenizer. It uses a basic tokenizer to do " +"punctuation splitting, lower casing, jieba pretokenizer and so on, and " +"follows a WordPiece tokenizer to tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:5 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.JiebaBasicTokenizer +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.num_special_tokens_to_add +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:9 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:12 +msgid "" +"Whether or not to lowercase the input when tokenizing. If you use the " +"RoFormer pretrained model, lower is set to False when using the cased " +"model, otherwise it is set to True. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:17 +msgid "Whether or not to tokenize the text with jieba. Default: False." +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:19 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:23 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:26 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:29 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:32 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:38 +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string:10 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string:1 +msgid "Converts a sequence of tokens (list of string) in a single string." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string:3 +msgid "A list of string representing tokens to be converted." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string:6 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.num_special_tokens_to_add:3 +msgid "" +"Whether the input is a sequence pair or a single sequence. Defaults to " +"`False` and the input is a single sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.num_special_tokens_to_add:7 +msgid "Number of tokens added to sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:4 +msgid "A Roformer sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:6 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:7 +msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:9 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:11 +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences:13 +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask:6 +msgid "Optional second list of IDs for sequence pairs. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:14 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:3 +msgid "A RoFormer offset_mapping has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:5 +msgid "single sequence: ``(0,0) X (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:6 +msgid "pair of sequences: `(0,0) A (0,0) B (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:8 +msgid "List of wordpiece offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:10 +msgid "" +"Optional second list of wordpiece offsets for offset mapping pairs. " +"Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:13 +msgid "List of wordpiece offsets with the appropriate offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences:3 +msgid "A RoFormer sequence pair mask has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences:9 +msgid "" +"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first " +"portion of the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences:11 +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask:4 +msgid "A list of `inputs_ids` for the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences:16 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers either be 0 or 1: 1 for a special token, 0 for a " +"sequence token." +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.JiebaBasicTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.bert.tokenizer.BasicTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.JiebaBasicTokenizer:1 +msgid "" +"Runs basic tokenization with jieba (punctuation splitting, lower casing, " +"jieba pretokenizer etc)." +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.JiebaBasicTokenizer:3 +msgid "An instance of paddlenlp.data.Vocab." +msgstr "" + +#: of paddlenlp.transformers.roformer.tokenizer.JiebaBasicTokenizer:5 +msgid "" +"Whether the text strips accents and converts to lower case. If you use " +"the RoFormer Pretrained model, lower is set to False when using the cased" +" model, otherwise it is set to True. Defaults to `True`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..049ad921adaef995df1b92e95bf5fe3bd47e6c20 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.modeling.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.roformerv2.modeling.rst:2 +msgid "modeling" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.po new file mode 100644 index 0000000000000000000000000000000000000000..ba4bfda1b4382f0638427757329a0575a786d7d1 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.roformerv2.rst:2 +msgid "roformerv2" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..8c9eb6b2179a64b7a5d212e538b51b43a58890cd --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.tokenizer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.roformerv2.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_indexing.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_indexing.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..f9542ce184ba5ec4a3a7b8b4d40f16e9ed71d425 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_indexing.modeling.po @@ -0,0 +1,53 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.semantic_indexing.modeling.rst:2 +msgid "modeling" +msgstr "" + +#~ msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +#~ msgstr "" + +#~ msgid "" +#~ "This class encapsulates two ErnieEncoder " +#~ "models into one model, so query " +#~ "embedding and title embedding could be" +#~ " obtained using one model. And this" +#~ " class allows two ErnieEncoder models " +#~ "to be trained at the same time." +#~ msgstr "" + +#~ msgid "示例" +#~ msgstr "" + +#~ msgid "" +#~ "Defines the computation performed at " +#~ "every call. Should be overridden by " +#~ "all subclasses." +#~ msgstr "" + +#~ msgid "参数" +#~ msgstr "" + +#~ msgid "unpacked tuple arguments" +#~ msgstr "" + +#~ msgid "unpacked dict arguments" +#~ msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_indexing.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_indexing.po new file mode 100644 index 0000000000000000000000000000000000000000..fd0bc26a1616d9e6b738b6feb914c28f9a195118 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_indexing.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.semantic_indexing.rst:2 +msgid "semantic\\_indexing" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_search.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_search.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..10c87b9a27751e36a76af2776cd41f1efdee4755 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_search.modeling.po @@ -0,0 +1,65 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.semantic_search.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.semantic_search.modeling.ErnieCrossEncoder:1 +#: paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder:1 +msgid "" +"This class encapsulates two ErnieEncoder models into one model, so query " +"embedding and title embedding could be obtained using one model. And this" +" class allows two ErnieEncoder models to be trained at the same time." +msgstr "" + +#: of paddlenlp.transformers.semantic_search.modeling.ErnieCrossEncoder:2 +#: paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder:6 +msgid "示例" +msgstr "" + +#: of +#: paddlenlp.transformers.semantic_search.modeling.ErnieCrossEncoder.forward:1 +#: paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder.forward:1 +msgid "" +"Defines the computation performed at every call. Should be overridden by " +"all subclasses." +msgstr "" + +#: of paddlenlp.transformers.semantic_search.modeling.ErnieCrossEncoder.forward +#: paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder.forward +msgid "参数" +msgstr "" + +#: of +#: paddlenlp.transformers.semantic_search.modeling.ErnieCrossEncoder.forward:4 +#: paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder.forward:4 +msgid "unpacked tuple arguments" +msgstr "" + +#: of +#: paddlenlp.transformers.semantic_search.modeling.ErnieCrossEncoder.forward:6 +#: paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder.forward:6 +msgid "unpacked dict arguments" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_search.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_search.po new file mode 100644 index 0000000000000000000000000000000000000000..5ec6b92a52a7beb070883369ec95f0a38030cef9 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_search.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.semantic_search.rst:2 +msgid "semantic\\_search" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..3f1162fb0f64665eb78447f5a2ffda03fd13d549 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.modeling.po @@ -0,0 +1,455 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.skep.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepForSequenceClassification:1 +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification:1 +#: paddlenlp.transformers.skep.modeling.SkepModel:1 +msgid "基类::class:`paddlenlp.transformers.skep.modeling.SkepPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:1 +msgid "The bare SKEP Model outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:10 +msgid "" +"More details refer to `SKEP `." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward +#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification +#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward +#: paddlenlp.transformers.skep.modeling.SkepModel +#: paddlenlp.transformers.skep.modeling.SkepModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:12 +msgid "" +"Vocabulary size of `inputs_ids` in `SKEPModel`. Defines the number of " +"different tokens that can be represented by the `inputs_ids` passed when " +"calling `SKEPModel`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:15 +msgid "" +"Dimensionality of the embedding layer, encoder layers and the pooler " +"layer. Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:17 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:19 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:22 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:27 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to ``\"gelu\"``." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:31 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to ``0.1``." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:34 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:37 +msgid "" +"The maximum value of the dimensionality of position encoding. The " +"dimensionality of position encoding is the dimensionality of the sequence" +" in `TinyBertModel`. Defaults to `512`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:41 +msgid "" +"The vocabulary size of the `token_type_ids` passed when calling " +"`~transformers.SkepModel`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:44 +msgid "" +"The standard deviation of the normal initializer. Defaults to `0.02`. .." +" note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`SkepPretrainedModel.init_weights()` for how" +" weights are initialized in `SkepModel`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:44 +msgid "The standard deviation of the normal initializer. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:48 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`SkepPretrainedModel.init_weights()` for how weights are " +"initialized in `SkepModel`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel:51 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:1 +msgid "The SkepModel forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:12 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:13 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:15 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:18 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention to some " +"unwanted positions, usually the paddings or the subsequent positions. Its" +" data type can be int, float and bool. When the data type is bool, the " +"`masked` tokens have `False` values and the others have `True` values. " +"When the data type is int, the `masked` tokens have `0` values and the " +"others have `1` values. When the data type is float, the `masked` tokens " +"have `-INF` values and the others have `0` values. It is a tensor with " +"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, " +"sequence_length]`. For example, its shape can be [batch_size, " +"sequence_length], [batch_size, sequence_length, sequence_length], " +"[batch_size, num_attention_heads, sequence_length, sequence_length]. " +"Defaults to `None`, which means nothing needed to be prevented attention " +"to." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward +#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward +#: paddlenlp.transformers.skep.modeling.SkepModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:34 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`). With the fields: - " +"`sequence_output` (Tensor): Sequence of hidden-states at the last " +"layer of the model. It's data type should be float32 and its shape is" +" [batch_size, sequence_length, hidden_size]. - `pooled_output` (Tensor):" +" The output of first token (`[CLS]`) in sequence. We \"pool\" the" +" model by simply taking the hidden state corresponding to the first " +"token. Its data type should be float32 and its shape is [batch_size, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:34 +msgid "Returns tuple (`sequence_output`, `pooled_output`)." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:36 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:40 +msgid "`sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:39 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:44 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:43 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward +#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward +#: paddlenlp.transformers.skep.modeling.SkepModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:17 +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:17 +#: paddlenlp.transformers.skep.modeling.SkepModel.forward:49 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepPretrainedModel:1 +msgid "" +"An abstract class for pretrained Skep models. It provides Skep related " +"`model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepForSequenceClassification:1 +msgid "" +"SKEP Model with a linear layer on top of the pooled output, designed for " +"sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification:4 +#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification:4 +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification:4 +msgid "An instance of SkepModel." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepForSequenceClassification:6 +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepForSequenceClassification:8 +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification:8 +msgid "" +"The dropout probability for output of SKEP. If None, use the same value " +"as `hidden_dropout_prob` of `SkepModel` instance `skep`. Defaults to " +"None." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:1 +msgid "" +"The SkepForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:3 +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:5 +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:7 +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:9 +#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:3 +#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:5 +#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:7 +#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:9 +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:3 +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:5 +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:7 +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:9 +msgid "See :class:`SkepModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:12 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepForTokenClassification:1 +msgid "" +"SKEP Model with a linear layer on top of the hidden-states output layer, " +"designed for token classification tasks like NER tasks." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:1 +msgid "" +"The SkepForTokenClassification forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:12 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification:1 +msgid "" +"SKEPCRF Model with a linear layer on top of the hidden-states output " +"layer, designed for token classification tasks like NER tasks." +msgstr "" + +#: of paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification:6 +msgid "The number of classes." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:1 +msgid "" +"The SkepCrfForTokenClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:11 +msgid "" +"The input length tensor storing real length of each sequence for " +"correctness. Its data type should be int64 and its shape is " +"`[batch_size]`. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:15 +msgid "" +"The input label tensor. Its data type should be int64 and its shape is " +"`[batch_size, sequence_length]`." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:19 +msgid "" +"Returns tensor `loss` if `labels` is not None. Otherwise, returns tensor " +"`prediction`. - `loss` (Tensor): The crf loss. Its data type is " +"float32 and its shape is `[batch_size]`. - `prediction` (Tensor): " +"The prediction tensor containing the highest scoring tag indices. Its" +" data type is int64 and its shape is `[batch_size, sequence_length]`." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:19 +msgid "" +"Returns tensor `loss` if `labels` is not None. Otherwise, returns tensor " +"`prediction`." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:22 +msgid "`loss` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:22 +msgid "The crf loss. Its data type is float32 and its shape is `[batch_size]`." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:25 +msgid "`prediction` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:25 +msgid "" +"The prediction tensor containing the highest scoring tag indices. Its " +"data type is int64 and its shape is `[batch_size, sequence_length]`." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.po new file mode 100644 index 0000000000000000000000000000000000000000..d7925d9d9d603cba2f3dbe65136dc2cf3e78a383 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.skep.rst:2 +msgid "skep" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..e415d3306b2b925dfb14927ce1a01546e0e4c1f8 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.tokenizer.po @@ -0,0 +1,239 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.skep.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:1 +msgid "" +"Constructs a Skep tokenizer. It uses a basic tokenizer to do punctuation " +"splitting, lower casing and so on, and follows a WordPiece tokenizer to " +"tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:5 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.save_resources +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:9 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:12 +msgid "The vocabulary file path of a `BpeTokenizer`. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:14 +msgid "The json file path of a `BpeTokenizer`. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:16 +msgid "Whether or not to use BPE Encoder. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:18 +msgid "Whether or not to use token type id. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:20 +msgid "Whether or not to add two different `sep_token`. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:22 +msgid "The special token for unknown words. Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:25 +msgid "The special token for separator token. Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:28 +msgid "The special token for padding. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:31 +msgid "The special token for cls. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:34 +msgid "The special token for mask. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:39 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer.vocab_size:3 +msgid "the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.num_special_tokens_to_add:3 +msgid "" +"Returns the number of added tokens in the case of a sequence pair if set " +"to True, returns the number of added tokens in the case of a single " +"sequence if set to False. Defaults to False." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.num_special_tokens_to_add:8 +msgid "Number of tokens added to sequences" +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:4 +msgid "" +"A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the " +"following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:6 +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:11 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:7 +msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:9 +msgid "A skep_roberta_large_en sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:12 +msgid "pair of sequences: ``[CLS] A [SEP] [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:14 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:16 +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:15 +msgid "Optional second list of IDs for sequence pairs. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:20 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:3 +msgid "" +"A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has " +"the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:9 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:11 +msgid "note: There is no need token type ids for skep_roberta_large_ch model." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:13 +msgid "List of IDs." +msgstr "" + +#: of +#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:19 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer.save_resources:1 +msgid "Save tokenizer related resources to files under `save_directory`." +msgstr "" + +#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer.save_resources:3 +msgid "Directory to save files into." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..f002c3f3141d814d653d9c3b782f777018f7932b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.modeling.po @@ -0,0 +1,417 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.squeezebert.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering:1 +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification:1 +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification:1 +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:1 +msgid "基类::class:`paddlenlp.transformers.squeezebert.modeling.SqueezeBertPreTrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:1 +msgid "" +"Vocabulary size of `inputs_ids` in `SqueezeBertModel`. Also is the vocab " +"size of token embedding matrix. Defines the number of different tokens " +"that can be represented by the `inputs_ids` passed when calling " +"`BertModel`." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:4 +msgid "" +"Dimensionality of the embedding layer, encoder layer and pooler layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:6 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:8 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:11 +msgid "Output chans for intermediate layer." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:13 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:17 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:20 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:23 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:26 +msgid "The vocabulary size of `token_type_ids`. Defaults to `16`." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:29 +msgid "" +"number of query groups for all layers in the BertModule. (eventually we " +"could change the interface to allow different groups for different " +"layers)" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:32 +msgid "" +"number of key groups for all layers in the BertModule. (eventually we " +"could change the interface to allow different groups for different " +"layers)" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:35 +msgid "" +"number of value groups for all layers in the BertModule. (eventually we " +"could change the interface to allow different groups for different " +"layers)" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:38 +msgid "" +"number of output groups for all layers in the BertModule. (eventually we " +"could change the interface to allow different groups for different " +"layers)" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:41 +msgid "" +"number of intermediate groups for all layers in the BertModule. " +"(eventually we could change the interface to allow different groups for " +"different layers)" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:44 +msgid "" +"number of post groups for all layers in the BertModule. (eventually we " +"could change the interface to allow different groups for different " +"layers)" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:47 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`BertPretrainedModel.init_weights()` for how" +" weights are initialized in `BertModel`." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:47 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note::" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:1 +msgid "" +"The forward method, overrides the `__call__()` special method. :param " +"input_ids: Indices of input sequence tokens in the vocabulary. They are" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:3 +msgid "" +"numerical representations of tokens that build the input sequence. Its " +"data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:6 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float and bool. If its data type is " +"int, the values should be either 0 or 1. - **1** for tokens that **not " +"masked**, - **0** for tokens that **masked**. It is a tensor with shape " +"broadcasted to `[batch_size, num_attention_heads, sequence_length, " +"sequence_length]`. Defaults to `None`, which means nothing needed to be " +"prevented attention to." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:15 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape of" +" [batch_size, sequence_length]. Defaults to `None`, which means we don't " +"add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:24 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:28 +msgid "" +"Whether to return the attention_weight of each hidden layers. Defaults to" +" `False`." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:31 +msgid "Whether to return the output of each hidden layers. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification.forward +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification.forward +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:35 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) with " +"(`encoder_outputs`, `encoder_attentions`) by optional. With the fields: -" +" `sequence_output` (Tensor): Sequence of hidden-states at the last " +"layer of the model. It's data type should be float32 and its shape is" +" [batch_size, sequence_length, hidden_size]. - `pooled_output` (Tensor):" +" The output of first token (`[CLS]`) in sequence. We \"pool\" the" +" model by simply taking the hidden state corresponding to the first " +"token. Its data type should be float32 and its shape is [batch_size, " +"hidden_size]. - `encoder_outputs` (List(Tensor)): A list of Tensor " +"containing hidden-states of the model at each hidden layer in the " +"Transformer encoder. The length of the list is `num_hidden_layers` + " +"1 (Embedding Layer output). Each Tensor has a data type of float32 " +"and its shape is [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:35 +msgid "" +"Returns tuple (`sequence_output`, `pooled_output`) with " +"(`encoder_outputs`, `encoder_attentions`) by optional. With the fields: -" +" `sequence_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:39 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:43 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:42 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:47 +msgid "`encoder_outputs` (List(Tensor)):" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:46 +msgid "" +"A list of Tensor containing hidden-states of the model at each hidden " +"layer in the Transformer encoder. The length of the list is " +"`num_hidden_layers` + 1 (Embedding Layer output). Each Tensor has a data " +"type of float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification.forward +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification.forward +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification:1 +msgid "" +"SqueezeBert Model with a sequence classification/regression head on top " +"(a linear layer on top of the pooled output) e.g. for GLUE tasks. :param " +"squeezebert: An instance of SqueezeBert. :type squeezebert: " +":class:`SqueezeBertModel` :param num_classes: The number of classes. " +"Defaults to `2`. :type num_classes: int, optional :param dropout: The " +"dropout probability for output of SqueezeBertModel." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification:8 +msgid "" +"If None, use the same value as `hidden_dropout_prob` of " +"`SqueezeBertModel` instance `squeezebert`. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification.forward:1 +msgid "" +"The SqueezeBertForSequenceClassification forward method, overrides the " +"__call__() special method. :param input_ids: See " +":class:`SqueezeBertModel`. :type input_ids: Tensor :param token_type_ids:" +" See :class:`SqueezeBertModel`. :type token_type_ids: Tensor, optional " +":param position_ids: See :class:`SqueezeBertModel`. :type position_ids: " +"Tensor, optional :param attention_mask: See :class:`SqueezeBertModel`. " +":type attention_mask: list, optional" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification.forward:11 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification:1 +msgid "" +"SqueezeBert Model with a token classification head on top (a linear layer" +" on top of the hidden-states output) e.g. for Named-Entity-Recognition " +"(NER) tasks. :param squeezebert: An instance of SqueezeBertModel. :type " +"squeezebert: :class:`SqueezeBertModel` :param num_classes: The number of " +"classes. Defaults to `2`. :type num_classes: int, optional :param " +"dropout: The dropout probability for output of squeezebert." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification:8 +msgid "" +"If None, use the same value as `hidden_dropout_prob` of `SqueezeBert` " +"instance `squeezebert`. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification.forward:1 +msgid "" +"The SqueezeBertForTokenClassification forward method, overrides the " +"__call__() special method. :param input_ids: See " +":class:`SqueezeBertModel`. :type input_ids: Tensor :param token_type_ids:" +" See :class:`SqueezeBertModel`. :type token_type_ids: Tensor, optional " +":param position_ids: See :class:`SqueezeBertModel`. :type position_ids: " +"Tensor, optional :param attention_mask: See :class:`SqueezeBertModel`. " +":type attention_mask: list, optional" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification.forward:11 +msgid "" +"Returns tensor `logits`, a tensor of the input token classification " +"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype " +"as `float32`." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering:1 +msgid "" +"SqueezeBert Model with a span classification head on top for extractive " +"question-answering tasks like SQuAD (a linear layers on top of the " +"hidden-states output to compute `span start logits` and `span end " +"logits`). :param squeezebert: An instance of SqueezeBertModel. :type " +"squeezebert: :class:`SqueezeBertModel` :param dropout: The dropout " +"probability for output of SqueezeBert." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering:7 +msgid "" +"If None, use the same value as `hidden_dropout_prob` of " +"`SqueezeBertModel` instance `squeezebert`. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward:1 +msgid "" +"The SqueezeBertForQuestionAnswering forward method, overrides the " +"__call__() special method. :param input_ids: See " +":class:`SqueezeBertModel`. :type input_ids: Tensor :param token_type_ids:" +" See :class:`SqueezeBertModel`. :type token_type_ids: Tensor, optional" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward:7 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the start position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]. - " +"`end_logits` (Tensor): A tensor of the input token classification " +"logits, indicates the end position of the labelled span. Its data " +"type should be float32 and its shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward:7 +msgid "" +"Returns tuple (`start_logits`, `end_logits`). With the fields: - " +"`start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward:10 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward:13 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward:13 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.po new file mode 100644 index 0000000000000000000000000000000000000000..73515a3111c35060a34ac2c5fde3a175c4e467de --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.squeezebert.rst:2 +msgid "squeezebert" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..42eb8d320e1a4200d5f60d840d2901c09ae77232 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.tokenizer.po @@ -0,0 +1,237 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.squeezebert.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:1 +msgid "" +"Constructs a SqueezeBert tokenizer. It uses a basic tokenizer to do " +"punctuation splitting, lower casing and so on, and follows a WordPiece " +"tokenizer to tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.num_special_tokens_to_add +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:5 +msgid "file path of the vocabulary" +msgstr "" + +#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:7 +msgid "" +"Whether the text strips accents and convert to lower case. Default: " +"`True`. Default: True." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:11 +msgid "The special token for unkown words. Default: \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:13 +msgid "The special token for separator token . Default: \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:15 +msgid "The special token for padding. Default: \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:17 +msgid "The special token for cls. Default: \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:19 +msgid "The special token for mask. Default: \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:23 +msgid "实际案例" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.vocab_size:1 +msgid "" +"return the size of vocabulary. :returns: the size of vocabulary. :rtype: " +"int" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) in a single string. Since " +"the usage of WordPiece introducing `##` to concat subwords, also remove " +"`##` when converting. :param tokens: A list of string representing tokens" +" to be converted. :type tokens: list" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.num_special_tokens_to_add +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.convert_tokens_to_string:7 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.get_special_tokens_mask +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens. .. note::" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.num_special_tokens_to_add:7 +msgid "" +"Returns the number of added tokens in the case of a sequence pair if set " +"to True, returns the number of added tokens in the case of a single " +"sequence if set to False." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.num_special_tokens_to_add:10 +msgid "Number of tokens added to sequences" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens. A " +"SqueezeBert sequence has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens:7 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens:9 +msgid "Optional second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens:12 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens:13 +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.create_token_type_ids_from_sequences:13 +msgid ":obj:`List[int]`" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens. A SqueezeBert offset_mapping has the following" +" format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens:6 +msgid "List of char offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens:8 +msgid "Optional second list of char offsets for offset mapping pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens:11 +msgid "List of char offsets with the appropriate offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens:12 +msgid ":obj:`List[tuple]`" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task. A SqueezeBert sequence pair mask has the following " +"format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.create_token_type_ids_from_sequences:6 +msgid "" +"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first " +"portion of the mask (0s). :param token_ids_0: List of IDs. :type " +"token_ids_0: :obj:`List[int]` :param token_ids_1: Optional second list of" +" IDs for sequence pairs. :type token_ids_1: :obj:`List[int]`, `optional`" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.create_token_type_ids_from_sequences:12 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods. :param token_ids_0: List of ids of the " +"first sequence. :type token_ids_0: List[int] :param token_ids_1: List of " +"ids of the second sequence. :type token_ids_1: List[int], optinal :param " +"already_has_special_tokens: Whether or not the token list is already" +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.get_special_tokens_mask:8 +msgid "formatted with special tokens for the model. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.get_special_tokens_mask:11 +msgid "" +"The list of integers in the range [0, 1]: 1 for a special token, 0 for a " +"sequence token." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..3cc4a09950ed14f98625bbb5eac8700bfb432666 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.modeling.po @@ -0,0 +1,475 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.t5.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration:1 +#: paddlenlp.transformers.t5.modeling.T5Model:1 +msgid "基类::class:`paddlenlp.transformers.t5.modeling.T5PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:1 +msgid "" +"The bare T5 Model transformer outputting raw hidden-states without any " +"specific head on top." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward +#: paddlenlp.transformers.t5.modeling.T5Model +#: paddlenlp.transformers.t5.modeling.T5Model.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:10 +msgid "Whether to tie input and output embeddings. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:12 +msgid "The id of the `padding` token. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:14 +msgid "The id of the `bos` token. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:16 +msgid "The id of the `eos` token. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:18 +msgid "" +"A factor for initializing all weight matrices (should be kept to 1, used " +"internally for initialization testing). Defaults to `1.0`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:21 +msgid "" +"Vocabulary size of `inputs_ids` in `T5Model`. Also is the vocab size of " +"token embedding matrix. Defines the number of different tokens that can " +"be represented by the `inputs_ids` passed when calling `T5Model`. " +"Defaults to `32128`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:24 +msgid "Dimensionality of the embedding layer, encoder layer. Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:26 +msgid "" +"Size of the key, query, value projections per attention head. Defaults to" +" `64`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:28 +msgid "" +"Dimensionality of the feed_forward layer in the residual attention block." +" Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:30 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:32 +msgid "Number of hidden layers in the Transformer decoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:34 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder and decoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:37 +msgid "The number of buckets to use for each attention layer. Defaults to `32`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:39 +msgid "The dropout ratio for all layers. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:41 +msgid "The epsilon used by the layer normalization layers. Defaults to `1e-6`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model:43 +#: paddlenlp.transformers.t5.modeling.T5Model:45 +msgid "" +"The non-linear activation function (function or string) in the feed " +"forward layer in the residual attention block. If string, `\"relu\"`, " +"`\"gated-gelu\"` are supported. Defaults to `\"relu\"`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:1 +msgid "The T5Model forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:7 +msgid "" +"Mask used in multi-head attention to avoid performing attention on to " +"some unwanted positions, usually the paddings or the subsequent " +"positions. Its data type can be int, float. When the data type is int, " +"the `masked` tokens have `0` values and the others have `1` values. When " +"the data type is float, the `masked` tokens have `0.0` values and the " +"others have `1.0` values. It is a tensor with shape broadcasted to " +"[batch_size, num_attention_heads, sequence_length, sequence_length]. " +"Defaults to `None`, which means nothing needed to be prevented attention " +"to." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:17 +msgid "" +"Indices of decoder input sequence tokens in the vocabulary. Its data type" +" should be `int64` and it has a shape of [batch_size, sequence_length]. " +"Defaults to `None`, which means no `decoder_input_ids` is provided, the " +"model will create the tensor by shifting the `input_ids` to the right." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:22 +msgid "" +"Mask used in multi-head attention to avoid performing attention to some " +"unwanted positions in `decoder_input_ids`. Its data type and shape is the" +" same as `attention_mask`. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:25 +msgid "" +"The output of the encoder, a tuple consists `last_hidden_state`, " +"`hidden_states`(optional), `attentions`(optional). The data type of " +"`last_hidden_state` is float32 and its shape is [batch_size, " +"sequence_length, hidden_size]. `hidden_states` is hidden_states of all " +"layers in the Transformer encoder. The length of `hidden_states` is " +"`num_hidden_layers + 1`. For all element in the tuple, its data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]. `attentions` is attentions of all layers of in the " +"Transformer encoder. The length of `attentions` is `num_hidden_layers`. " +"For all element in the tuple, its data type should be float32 and its " +"shape is [batch_size, num_attention_heads, sequence_length, " +"sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:32 +msgid "" +"Contains pre-computed hidden-states (key and values in the attention " +"blocks) as computed by the model. Can be used to speed up sequential " +"decoding. The `input_ids` which have their past given to this model " +"should not be passed as input ids as they have already been computed. " +"Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:38 +msgid "" +"Whether or not to use cache. If set to `True`, `past_buckets_states` " +"states are returned and can be used to speed up decoding. Defaults to " +"`False`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:42 +msgid "" +"Whether or not to return the attentions tensors of all attention layers. " +"Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:45 +msgid "" +"Whether or not to return the output of all hidden layers. Defaults to " +"`False`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward +#: paddlenlp.transformers.t5.modeling.T5Model.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:49 +msgid "" +"Returns tuple (`last_hidden_state`, `cache`, `decoder_hidden_states`, " +"`decoder_attentions`, `cross_attentions`, `encoder_last_hidden_state`, " +"`encoder_hidden_states`, `encoder_attentions`) With the fields: - " +"`last_hidden_state` (Tensor): Sequence of hidden-states at the last " +"layer of the decoder of the model. It's data type should be float32 " +"and its shape is [batch_size, sequence_length, hidden_size]. - " +"`cache` (List[tuple(Tensor, Tensor)], optional): returned when " +"`use_cache=True` is passed. List of `tuple(Tensor, Tensor)` of length" +" `config[\"num_layers\"]`, with the first element being the previous " +"`buckets` of shape `[batch_size, num_heads, num_hashes, " +"sequence_length]` and the second being the previous `hidden_states` " +"of shape `[batch_size, sequence_length, hidden_size]`. - " +"`decoder_hidden_states` (tuple(Tensor), optional) returned when " +"``output_hidden_states=True`` is passed. Tuple of `Tensor` (one for " +"the output of the embeddings + one for the output of decoder each layer) " +"of shape `(batch_size, sequence_length, hidden_size)`. - " +"`decoder_attentions` (tuple(Tensor), optional): returned when " +"`output_attentions=True` is passed. tuple of `Tensor` (one for each " +"layer) of shape. Each Tensor has a data type of float32 and its shape" +" is [batch_size, num_heads, sequence_length, sequence_length]. - " +"`cross_attentions` (tuple(Tensor), optional): returned when " +"`output_attentions=True` is passed. tuple of `Tensor` (one for each " +"layer) of shape. Each Tensor has a data type of float32 and its shape" +" is [batch_size, num_heads, sequence_length, sequence_length]. - " +"`encoder_last_hidden_state` (Tensor): Sequence of hidden-states at " +"the last layer of the encoder of the model. It's data type should be " +"float32 and its shape is [batch_size, sequence_length, hidden_size]." +" - `encoder_hidden_states` (tuple(Tensor), optional): returned when " +"`output_hidden_states=True` is passed. tuple of `Tensor` (one for the" +" output of the embeddings + one for the output of encoder each " +"layer). Each Tensor has a data type of float32 and its shape is " +"[batch_size, sequence_length, hidden_size]. - `encoder_attentions` " +"(tuple(Tensor), optional): returned when `output_attentions=True` is " +"passed. tuple of `Tensor` (one for each layer) of shape. Each Tensor " +"has a data type of float32 and its shape is [batch_size, num_heads, " +"sequence_length, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:49 +msgid "" +"Returns tuple (`last_hidden_state`, `cache`, `decoder_hidden_states`, " +"`decoder_attentions`, `cross_attentions`, `encoder_last_hidden_state`, " +"`encoder_hidden_states`, `encoder_attentions`)" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:29 +#: paddlenlp.transformers.t5.modeling.T5Model.forward:52 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:57 +msgid "`last_hidden_state` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:55 +msgid "" +"Sequence of hidden-states at the last layer of the decoder of the model. " +"It's data type should be float32 and its shape is [batch_size, " +"sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:42 +#: paddlenlp.transformers.t5.modeling.T5Model.forward:64 +msgid "`cache` (List[tuple(Tensor, Tensor)], optional):" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:60 +msgid "" +"returned when `use_cache=True` is passed. List of `tuple(Tensor, Tensor)`" +" of length `config[\"num_layers\"]`, with the first element being the " +"previous `buckets` of shape `[batch_size, num_heads, num_hashes, " +"sequence_length]` and the second being the previous `hidden_states` of " +"shape `[batch_size, sequence_length, hidden_size]`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:45 +#: paddlenlp.transformers.t5.modeling.T5Model.forward:68 +msgid "`decoder_hidden_states` (tuple(Tensor), optional)" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:67 +msgid "" +"returned when ``output_hidden_states=True`` is passed. Tuple of `Tensor` " +"(one for the output of the embeddings + one for the output of decoder " +"each layer) of shape `(batch_size, sequence_length, hidden_size)`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:48 +#: paddlenlp.transformers.t5.modeling.T5Model.forward:73 +msgid "`decoder_attentions` (tuple(Tensor), optional):" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:71 +#: paddlenlp.transformers.t5.modeling.T5Model.forward:76 +#: paddlenlp.transformers.t5.modeling.T5Model.forward:92 +msgid "" +"returned when `output_attentions=True` is passed. tuple of `Tensor` (one " +"for each layer) of shape. Each Tensor has a data type of float32 and its " +"shape is [batch_size, num_heads, sequence_length, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:51 +#: paddlenlp.transformers.t5.modeling.T5Model.forward:78 +msgid "`cross_attentions` (tuple(Tensor), optional):" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:54 +#: paddlenlp.transformers.t5.modeling.T5Model.forward:83 +msgid "`encoder_last_hidden_state` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:81 +msgid "" +"Sequence of hidden-states at the last layer of the encoder of the model. " +"It's data type should be float32 and its shape is [batch_size, " +"sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:57 +#: paddlenlp.transformers.t5.modeling.T5Model.forward:89 +msgid "`encoder_hidden_states` (tuple(Tensor), optional):" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5Model.forward:86 +msgid "" +"returned when `output_hidden_states=True` is passed. tuple of `Tensor` " +"(one for the output of the embeddings + one for the output of encoder " +"each layer). Each Tensor has a data type of float32 and its shape is " +"[batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:59 +#: paddlenlp.transformers.t5.modeling.T5Model.forward:93 +msgid "`encoder_attentions` (tuple(Tensor), optional):" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward +#: paddlenlp.transformers.t5.modeling.T5Model.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:64 +#: paddlenlp.transformers.t5.modeling.T5Model.forward:98 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5PretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5PretrainedModel:1 +msgid "" +"An abstract class for pretrained T5 models. It provides T5 related " +"`model_config_file`, `resource_files_names`, " +"`pretrained_resource_files_map`, `pretrained_init_configuration`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +"`PretrainedModel` for more details." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5PretrainedModel.init_weights:1 +msgid "Initializes and tie weights if needed." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration:1 +msgid "The T5 Model transformer with a language modeling head on top." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration:3 +msgid "An instance of :class:`T5Model`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:1 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:3 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:5 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:7 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:9 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:11 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:19 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:21 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:23 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:42 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:45 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:48 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:51 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:54 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:57 +#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:60 +msgid "See :class:`T5Model`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:13 +msgid "" +"Labels for language modeling. Note that the labels **are shifted** inside" +" the model, i.e. you can set `labels = input_ids` Indices are selected in" +" `[-100, 0, ..., vocab_size]` All labels set to `-100` are ignored " +"(masked), the loss is only computed for labels in `[0, ..., vocab_size]`." +" Shape is [batch_size, sequence_length] and dtype is int64." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:26 +msgid "" +"Returns tuple (`loss`, `logits`, `cache`, `decoder_hidden_states`, " +"`decoder_attentions`, `cross_attentions`, `encoder_last_hidden_state`, " +"`encoder_hidden_states`, `encoder_attentions`) With the fields: - " +"`loss` (Tensor): returned when `labels` is provided. Language " +"modeling loss. It's data type should be float32 and its shape is [1,]. -" +" `logits` (Tensor): Prediction scores of the language modeling head" +" (scores for each vocabulary token before SoftMax). It's data " +"type should be float32 and its shape is [batch_size, sequence_length," +" vocab_size]. - `cache` (List[tuple(Tensor, Tensor)], optional): See" +" :class:`T5Model`. - `decoder_hidden_states` (tuple(Tensor), optional)" +" See :class:`T5Model`. - `decoder_attentions` (tuple(Tensor), " +"optional): See :class:`T5Model`. - `cross_attentions` " +"(tuple(Tensor), optional): See :class:`T5Model`. - " +"`encoder_last_hidden_state` (Tensor): See :class:`T5Model`. - " +"`encoder_hidden_states` (tuple(Tensor), optional): See " +":class:`T5Model`. - `encoder_attentions` (tuple(Tensor), optional): " +"See :class:`T5Model`." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:26 +msgid "" +"Returns tuple (`loss`, `logits`, `cache`, `decoder_hidden_states`, " +"`decoder_attentions`, `cross_attentions`, `encoder_last_hidden_state`, " +"`encoder_hidden_states`, `encoder_attentions`)" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:33 +msgid "`loss` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:32 +msgid "" +"returned when `labels` is provided. Language modeling loss. It's data " +"type should be float32 and its shape is [1,]." +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:39 +msgid "`logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:36 +msgid "" +"Prediction scores of the language modeling head (scores for each " +"vocabulary token before SoftMax). It's data type should be float32 and " +"its shape is [batch_size, sequence_length, vocab_size]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.po new file mode 100644 index 0000000000000000000000000000000000000000..30b62d3a0f55690d31dfab9a00f67b890b39cbde --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.t5.rst:2 +msgid "t5" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..2023df5590557450059c3921a868d6a7659e0229 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.tokenizer.po @@ -0,0 +1,266 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.t5.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:1 +msgid "基类::class:`paddlenlp.transformers.albert.tokenizer.AlbertEnglishTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:1 +msgid "" +"Constructs a T5 tokenizer based on SentencePiece . This tokenizer " +"inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.clean_up_tokenization +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:6 +msgid "" +"The vocabulary file (ends with '.spm') required to instantiate a " +"`SentencePiece `__ tokenizer." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:9 +msgid "" +"Whether or not to lowercase the input when tokenizing. Defaults to " +"`False`." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:11 +msgid "Whether or note to remove space when tokenizing. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:13 +msgid "Whether or note to keep accents when tokenizing. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:15 +msgid "" +"A special token representing the *eos (end-of-sentence)* token. Defaults " +"to \"\"." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:18 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"\"." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:22 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"\"." +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:1 +msgid "Build model inputs from a sequence or a pair of sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:3 +msgid "An Reformer sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:5 +msgid "single sequence: ``X ``" +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:6 +msgid "pair of sequences: ``A B ``" +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:8 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:10 +msgid "Optional second list of IDs for sequence pairs. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.clean_up_tokenization +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:13 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.clean_up_tokenization +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences:1 +msgid "Create a mask from the two sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences:3 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences:5 +msgid "List of IDs." +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences:7 +msgid "Optional second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences:10 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:4 +msgid "List of ids of the first sequence." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:6 +msgid "List of ids of the second sequence." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers in the range [0, 1]: 1 for a special token, 0 " +"for a sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:14 +msgid "The list of integers in the range [0, 1]:" +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:15 +msgid "1 for a special token, 0 for a sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.convert_tokens_to_string:1 +msgid "Converts a sequence of tokens (string) in a single string." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode:1 +msgid "" +"Converts a sequence of ids in a string, using the tokenizer and " +"vocabulary with options to remove special tokens and clean up " +"tokenization spaces." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode:4 +msgid "" +"Similar to doing " +"``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode:3 +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode:6 +msgid "List of tokenized input ids." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode:5 +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode:8 +msgid "" +"Whether or not to remove special tokens in the decoding. Defaults to " +"`False`." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode:7 +#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode:10 +msgid "Whether or not to clean up the tokenization spaces. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode:13 +msgid "The decoded sentence." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode:1 +msgid "" +"Convert a list of lists of token ids into a list of strings by calling " +"decode." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode:10 +msgid "The list of decoded sentences." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.clean_up_tokenization:1 +msgid "" +"Clean up a list of simple English tokenization artifacts like spaces " +"before punctuations and abbreviated forms." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.clean_up_tokenization:3 +msgid "The text to clean up." +msgstr "" + +#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.clean_up_tokenization:6 +msgid "The cleaned-up string." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..efaa3ae560edb0998aa18a9ab40db77f4c996562 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.modeling.po @@ -0,0 +1,369 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.tinybert.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining:1 +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification:1 +#: paddlenlp.transformers.tinybert.modeling.TinyBertModel:1 +msgid "基类::class:`paddlenlp.transformers.tinybert.modeling.TinyBertPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:1 +msgid "The bare TinyBert Model transformer outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:6 +msgid "" +"This model is also a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining +#: paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward +#: paddlenlp.transformers.tinybert.modeling.TinyBertModel +#: paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `TinyBertModel`. Defines the number of" +" different tokens that can be represented by the `inputs_ids` passed when" +" calling `TinyBertModel`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:13 +msgid "" +"Dimensionality of the embedding layer, encoder layers and pooler layer. " +"Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:15 +msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:29 +msgid "" +"The dropout probability for all fully connected layers in the embeddings " +"and encoder. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:35 +msgid "" +"The maximum value of the dimensionality of position encoding. The " +"dimensionality of position encoding is the dimensionality of the sequence" +" in `TinyBertModel`. Defaults to `512`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:39 +msgid "" +"The vocabulary size of `token_type_ids` passed when calling `~ " +"transformers.TinyBertModel`. Defaults to `16`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:42 +msgid "" +"The standard deviation of the normal initializer. Defaults to `0.02`. .." +" note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`TinyBertPretrainedModel.init_weights()` for" +" how weights are initialized in `TinyBertModel`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:42 +msgid "The standard deviation of the normal initializer. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:46 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`TinyBertPretrainedModel.init_weights()` for how weights are " +"initialized in `TinyBertModel`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:49 +msgid "The index of padding token in the token vocabulary. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:52 +msgid "" +"Dimensionality of the output layer of `fit_dense(s)`, which is the hidden" +" size of the teacher model. `fit_dense(s)` means a hidden states' " +"transformation from student to teacher. `fit_dense(s)` will be generated " +"when bert model is distilled during the training, and will not be " +"generated during the prediction process. `fit_denses` is used in v2 " +"models and it has `num_hidden_layers+1` layers. `fit_dense` is used in " +"other pretraining models and it has one linear layer. Defaults to `768`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:1 +msgid "" +"The TinyBertModel forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. Its data type " +"should be `int64` and it has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a " +"*sentence B* token. Its data type should be `int64` and it has a shape " +"of [batch_size, sequence_length]. Defaults to `None`, which means we " +"don't add segment embeddings." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:7 +msgid "" +"Segment token indices to indicate different portions of the inputs. " +"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` " +"is 2, which means the inputs have two portions. Indices can either be 0 " +"or 1:" +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:12 +msgid "0 corresponds to a *sentence A* token," +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:13 +msgid "1 corresponds to a *sentence B* token." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:15 +msgid "" +"Its data type should be `int64` and it has a shape of [batch_size, " +"sequence_length]. Defaults to `None`, which means we don't add segment " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:18 +msgid "" +"Mask used in multi-head attention to avoid performing attention to some " +"unwanted positions, usually the paddings or the subsequent positions. Its" +" data type can be int, float and bool. When the data type is bool, the " +"`masked` tokens have `False` values and the others have `True` values. " +"When the data type is int, the `masked` tokens have `0` values and the " +"others have `1` values. When the data type is float, the `masked` tokens " +"have `-INF` values and the others have `0` values. It is a tensor with " +"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, " +"sequence_length]`. For example, its shape can be [batch_size, " +"sequence_length], [batch_size, sequence_length, sequence_length], " +"[batch_size, num_attention_heads, sequence_length, sequence_length]. " +"Defaults to `None`, which means nothing needed to be prevented attention " +"to." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward +#: paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:30 +msgid "" +"Returns tuple (`encoder_output`, `pooled_output`). With the fields: - " +"`encoder_output` (Tensor): Sequence of hidden-states at the last " +"layer of the model. It's data type should be float32 and its shape is" +" [batch_size, sequence_length, hidden_size]. - `pooled_output` (Tensor):" +" The output of first token (`[CLS]`) in sequence. We \"pool\" the" +" model by simply taking the hidden state corresponding to the first " +"token. Its data type should be float32 and its shape is [batch_size, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:30 +msgid "Returns tuple (`encoder_output`, `pooled_output`)." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:32 +msgid "With the fields:" +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:36 +msgid "`encoder_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:35 +msgid "" +"Sequence of hidden-states at the last layer of the model. It's data type " +"should be float32 and its shape is [batch_size, sequence_length, " +"hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:40 +msgid "`pooled_output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:39 +msgid "" +"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by" +" simply taking the hidden state corresponding to the first token. Its " +"data type should be float32 and its shape is [batch_size, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward +#: paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward:15 +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward:15 +#: paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:45 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertPretrainedModel:1 +msgid "" +"An abstract class for pretrained TinyBERT models. It provides TinyBERT " +"related `model_config_file`, `resource_files_names`, " +"`pretrained_resource_files_map`, `pretrained_init_configuration`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.tinybert.modeling.TinyBertPretrainedModel.init_weights:1 +msgid "Initialization hook" +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining:1 +msgid "TinyBert Model with pretraining tasks on top." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining:3 +msgid "An instance of :class:`TinyBertModel`." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward:1 +msgid "" +"The TinyBertForPretraining forward method, overrides the __call__() " +"special method." +msgstr "" + +#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward:3 +#: paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward:5 +#: paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward:7 +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward:3 +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward:5 +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward:7 +msgid "See :class:`TinyBertModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward:10 +msgid "" +"Returns tensor `sequence_output`, sequence of hidden-states at the last " +"layer of the model. It's data type should be float32 and its shape is " +"[batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification:1 +msgid "" +"TinyBert Model with a linear layer on top of the output layer, designed " +"for sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification:4 +msgid "An instance of TinyBertModel." +msgstr "" + +#: of +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification:6 +msgid "The number of classes. Defaults to `2`." +msgstr "" + +#: of +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification:8 +msgid "" +"The dropout probability for output of TinyBert. If None, use the same " +"value as `hidden_dropout_prob` of `TinyBertModel` instance `tinybert`. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward:1 +msgid "" +"The TinyBertForSequenceClassification forward method, overrides the " +"__call__() special method." +msgstr "" + +#: of +#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward:10 +msgid "" +"Returns tensor `logits`, a tensor of the input text classification " +"logits. Shape as `[batch_size, num_classes]` and dtype as float32." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.po new file mode 100644 index 0000000000000000000000000000000000000000..480d2d850b17be12ec92549619f254aabd88c88b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.tinybert.rst:2 +msgid "tinybert" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..4de3dbadfb45193b9efa9089ebd2cc3842b4bb53 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.tokenizer.po @@ -0,0 +1,36 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.tinybert.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.tinybert.tokenizer.TinyBertTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.tinybert.tokenizer.TinyBertTokenizer:1 +msgid "" +"Constructs a TinyBert tokenizer. The usage of TinyBertTokenizer is the " +"same as `BertTokenizer " +"`__." +" For more information regarding those methods, please refer to this " +"superclass." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils.po new file mode 100644 index 0000000000000000000000000000000000000000..ab2e9240532e3fd1468830e9f940455ecf028a51 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils.po @@ -0,0 +1,1264 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.tokenizer_utils.rst:2 +msgid "tokenizer\\_utils" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:1 +msgid "" +"The base class for all pretrained tokenizers. It mainly provides common " +"methods for loading (construction and loading) and saving pretrained " +"tokenizers. Loading and saving also rely on the following class " +"attributes which should be overridden by derived classes accordingly:" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:6 +msgid "" +"**tokenizer_config_file** (str): Represents the file name of tokenizer " +"configuration for configuration saving and loading in local file system. " +"The value is `tokenizer_config.json`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:9 +msgid "" +"**resource_files_names** (dict): Represents resources to specific file " +"names mapping for resource saving and loading in local file system. The " +"keys of dict representing resource items should be argument names in " +"tokenizer's `__init__` method, and the values are file names for saving " +"and loading corresponding resources. The mostly used resources here are " +"vocabulary file and sentence-piece model file." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:15 +msgid "" +"**pretrained_init_configuration** (dict): Provides the tokenizer " +"configurations of built-in pretrained tokenizers (contrasts to tokenizers" +" in local file system). It has pretrained tokenizer names as keys (the " +"same as pretrained model names, such as `bert-base-uncased`), and the " +"values are dict preserving corresponding configuration for tokenizer " +"initialization." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:20 +msgid "" +"**pretrained_resource_files_map** (dict): Provides resource URLs of " +"built-in pretrained tokenizers (contrasts to tokenizers in local file " +"system). It has the same keys as `resource_files_names`, and the values " +"are also `dict` mapping specific pretrained tokenizer names (such as " +"`bert-base-uncased`) to corresponding resource URLs." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:26 +msgid "" +"Moreover, methods common to tokenizers for tokenization, token/id " +"conversion and encoding as model inputs are also provided here." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:29 +msgid "" +"Besides, metaclass `InitTrackerMeta` is used to create " +"`PretrainedTokenizer`, by which subclasses can track arguments for " +"initialization automatically and expose special tokens initialization " +"used as attributes." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:1 +msgid "" +"Performs tokenization and uses the tokenized tokens to prepare model " +"inputs. It supports sequence or sequence pair as input, and batch input " +"is allowed. `self.encode()` or `self.batch_encode()` would be called " +"separately for single or batch input depending on input format and " +"`is_split_into_words` argument." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__ +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_offset_mapping +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_pretrained +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_resources +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_vocabulary +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:7 +msgid "" +"The sequence or batch of sequences to be processed. One sequence is a " +"string or a list of strings depending on whether it has been " +"pretokenized. If each sequence is provided as a list of strings " +"(pretokenized), you must set `is_split_into_words` as `True` to " +"disambiguate with a batch of sequences." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:13 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:9 +msgid "" +"Same as `text` argument, while it represents for the latter sequence of " +"the sequence pair." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:16 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:10 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:12 +msgid "" +"If set to a number, will limit the total sequence returned so that it has" +" a maximum length. If there are overflowing tokens, those overflowing " +"tokens will be added to the returned dictionary when " +"`return_overflowing_tokens` is `True`. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:21 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:15 +msgid "" +"Only available for batch input of sequence pair and mainly for question " +"answering usage. When for QA, `text` represents questions and `text_pair`" +" represents contexts. If `stride` is set to a positive number, the " +"context will be split into multiple spans where `stride` defines the " +"number of (tokenized) tokens to skip from the start of one span to get " +"the next span, thus will produce a bigger batch than inputs to include " +"all spans. Moreover, 'overflow_to_sample' and 'offset_mapping' preserving" +" the original example and position information will be added to the " +"returned dictionary. Defaults to 0." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:31 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:25 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:17 +msgid "" +"If set to `True`, the returned sequences would be padded up to " +"`max_seq_len` specified length according to padding side " +"(`self.padding_side`) and padding token id. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:35 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:29 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:21 +msgid "" +"String selected in the following options: - 'longest_first' (default) " +"Iteratively reduce the inputs sequence until the input is under " +"`max_seq_len` starting from the longest one at each token (when there is " +"a pair of input sequences). - 'only_first': Only truncate the first " +"sequence. - 'only_second': Only truncate the second sequence. - " +"'do_not_truncate': Do not truncate (raise an error if the input sequence " +"is longer than `max_seq_len`). Defaults to 'longest_first'." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:35 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:29 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:21 +msgid "String selected in the following options:" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:37 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:31 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:23 +msgid "'longest_first' (default) Iteratively reduce the inputs sequence" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:38 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:32 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:24 +msgid "" +"until the input is under `max_seq_len` starting from the longest one at " +"each token (when there is a pair of input sequences). - 'only_first': " +"Only truncate the first sequence. - 'only_second': Only truncate the " +"second sequence. - 'do_not_truncate': Do not truncate (raise an error if " +"the input sequence is longer than `max_seq_len`)." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:45 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:39 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:31 +msgid "Defaults to 'longest_first'." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:47 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:41 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:33 +msgid "" +"Whether to include tokens position ids in the returned dictionary. " +"Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:50 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:44 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:36 +msgid "" +"Whether to include token type ids in the returned dictionary. Defaults to" +" `True`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:53 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:47 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:39 +msgid "" +"Whether to include the attention mask in the returned dictionary. " +"Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:56 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:50 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:42 +msgid "" +"Whether to include the length of each encoded inputs in the returned " +"dictionary. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:59 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:53 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:45 +msgid "" +"Whether to include overflowing token information in the returned " +"dictionary. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:62 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:56 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:48 +msgid "" +"Whether to include special tokens mask information in the returned " +"dictionary. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:65 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:59 +msgid "" +"Decide the format for returned encoded batch inputs. Only works when " +"input is a batch of data. :: - If True, encoded inputs would be a " +"dictionary like: {'input_ids': [[1, 4444, 4385, 1545, 6712],[1, " +"4444, 4385]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0]]}" +" - If False, encoded inputs would be a list like: " +"[{'input_ids': [1, 4444, 4385, 1545, 6712], 'token_type_ids': " +"[0, 0, 0, 0, 0]}, {'input_ids': [1, 4444, 4385], " +"'token_type_ids': [0, 0, 0]}] Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:65 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:59 +msgid "" +"Decide the format for returned encoded batch inputs. Only works when " +"input is a batch of data. ::" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:76 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:70 +msgid "Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:78 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:72 +msgid "" +"Whether to include the list of pair preserving the index of start and end" +" char in original input for each token in the returned dictionary. Would " +"be automatically set to `True` when `stride` > 0. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:83 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:77 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:55 +msgid "" +"Whether to add the special tokens associated with the corresponding model" +" to the encoded inputs. Defaults to `True`" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__ +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_offset_mapping +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.tokenizer_utils.convert_to_unicode +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:87 +msgid "" +"The dict has the following optional items: - **input_ids** (list[int] or" +" list[list[int]]): List of token ids to be fed to a model. - " +"**position_ids** (list[int] or list[list[int]], optional): List of token " +"position ids to be fed to a model. Included when `return_position_ids` " +"is `True` - **token_type_ids** (list[int] or list[list[int]], optional): " +"List of token type ids to be fed to a model. Included when " +"`return_token_type_ids` is `True`. - **attention_mask** (list[int] or " +"list[list[int]], optional): List of integers valued 0 or 1, where 0 " +"specifies paddings and should not be attended to by the model. Included" +" when `return_attention_mask` is `True`. - **seq_len** (int or list[int]," +" optional): The input_ids length. Included when `return_length` is " +"`True`. - **overflowing_tokens** (list[int] or list[list[int]], " +"optional): List of overflowing tokens. Included when if `max_seq_len` " +"is specified and `return_overflowing_tokens` is True. - " +"**num_truncated_tokens** (int or list[int], optional): The number of " +"overflowing tokens. Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True. - **special_tokens_mask** " +"(list[int] or list[list[int]], optional): List of integers valued 0 or 1," +" with 0 specifying special added tokens and 1 specifying sequence " +"tokens. Included when `return_special_tokens_mask` is `True`. - " +"**offset_mapping** (list[int], optional): list of pair preserving the " +"index of start and end char in original input for each token. For a " +"sqecial token, the index pair is `(0, 0)`. Included when " +"`return_overflowing_tokens` is True or `stride` > 0. - " +"**overflow_to_sample** (int or list[int], optional): Index of example " +"from which this feature is generated. Included when `stride` works." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:87 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:81 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:59 +msgid "The dict has the following optional items:" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:89 +msgid "" +"**input_ids** (list[int] or list[list[int]]): List of token ids to be fed" +" to a model." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:90 +msgid "" +"**position_ids** (list[int] or list[list[int]], optional): List of token " +"position ids to be fed to a model. Included when `return_position_ids` is" +" `True`" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:92 +msgid "" +"**token_type_ids** (list[int] or list[list[int]], optional): List of " +"token type ids to be fed to a model. Included when " +"`return_token_type_ids` is `True`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:94 +msgid "" +"**attention_mask** (list[int] or list[list[int]], optional): List of " +"integers valued 0 or 1, where 0 specifies paddings and should not be " +"attended to by the model. Included when `return_attention_mask` is " +"`True`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:97 +msgid "" +"**seq_len** (int or list[int], optional): The input_ids length. Included " +"when `return_length` is `True`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:99 +msgid "" +"**overflowing_tokens** (list[int] or list[list[int]], optional): List of " +"overflowing tokens. Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:102 +msgid "" +"**num_truncated_tokens** (int or list[int], optional): The number of " +"overflowing tokens. Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:105 +msgid "" +"**special_tokens_mask** (list[int] or list[list[int]], optional): List of" +" integers valued 0 or 1, with 0 specifying special added tokens and 1 " +"specifying sequence tokens. Included when `return_special_tokens_mask` is" +" `True`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:108 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:102 +msgid "" +"**offset_mapping** (list[int], optional): list of pair preserving the " +"index of start and end char in original input for each token. For a " +"sqecial token, the index pair is `(0, 0)`. Included when " +"`return_overflowing_tokens` is True or `stride` > 0." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:112 +msgid "" +"**overflow_to_sample** (int or list[int], optional): Index of example " +"from which this feature is generated. Included when `stride` works." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__ +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_offset_mapping +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.tokenizer_utils.convert_to_unicode +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_tokens:1 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_tokens_extended:1 +msgid "" +"All the special tokens ('', ''...) corresponding to special " +"token arguments in `__init__` (arguments end with '_end')." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_ids +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_tokens +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_tokens_extended +msgid "type" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_ids:3 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_tokens:4 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_tokens_extended:4 +msgid "list" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_ids:1 +msgid "All the token ids corresponding to all the special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) to a single string by " +"using ``' '.join(tokens)`` ." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.convert_tokens_to_string:4 +msgid "A sequence of tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.convert_tokens_to_string:7 +msgid "Converted string." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:1 +msgid "" +"Creates an instance of `PretrainedTokenizer`. Related resources are " +"loaded by specifying name of a built-in pretrained model, or a community-" +"contributed pretrained model, or a local file directory path." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:5 +msgid "" +"Name of pretrained model or dir path to load from. The string can be: - " +"Name of built-in pretrained model - Name of a community-contributed " +"pretrained model. - Local directory path which contains tokenizer related" +" resources and tokenizer config file (\"tokenizer_config.json\")." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:5 +msgid "Name of pretrained model or dir path to load from. The string can be:" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:8 +msgid "Name of built-in pretrained model" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:9 +msgid "Name of a community-contributed pretrained model." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:10 +msgid "" +"Local directory path which contains tokenizer related resources and " +"tokenizer config file (\"tokenizer_config.json\")." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:13 +msgid "" +"position arguments for model `__init__`. If provided, use these as " +"position argument values for tokenizer initialization." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:16 +msgid "" +"keyword arguments for model `__init__`. If provided, use these to update " +"pre-defined keyword argument values for tokenizer initialization." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:21 +msgid "An instance of `PretrainedTokenizer`." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:25 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_pretrained:14 +msgid "示例" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_pretrained:1 +msgid "" +"Save tokenizer configuration and related resources to files under " +"`save_directory`. The tokenizer configuration would be saved into " +"`tokenizer_config_file` indicating file (thus `tokenizer_config.json`), " +"and resources would be saved into `resource_files_names` indicating files" +" by using `self.save_resources(save_directory)`." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_pretrained:7 +msgid "" +"The `save_directory` can be used in `from_pretrained` as argument value " +"of `pretrained_model_name_or_path` to re-load the tokenizer." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_pretrained:10 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_resources:4 +msgid "Directory to save files into." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_resources:1 +msgid "" +"Save tokenizer related resources to `resource_files_names` indicating " +"files under `save_directory` by copying directly. Override it if " +"necessary." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:1 +msgid "" +"Instantiate an instance of `Vocab` from a file reserving all tokens by " +"using `Vocab.from_dict`. The file contains a token per line, and the line" +" number would be the index of corresponding token." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:5 +msgid "path of file to construct vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:7 +msgid "" +"special token for unknown token. If no need, it also could be `None`. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:10 +msgid "" +"special token for padding token. If no need, it also could be `None`. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:13 +msgid "" +"special token for bos token. If no need, it also could be `None`. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:16 +msgid "" +"special token for eos token. If no need, it also could be `None`. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:19 +msgid "keyword arguments for `Vocab.from_dict`." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:22 +msgid "An instance of `Vocab`." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_vocabulary:1 +msgid "" +"Save all tokens to a vocabulary file. The file contains a token per line," +" and the line number would be the index of corresponding token." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_vocabulary:4 +msgid "File path to be saved to." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_vocabulary:6 +msgid "The `Vocab` or `dict` instance to be saved." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:1 +msgid "Truncates a sequence pair in place to the maximum length." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:3 +msgid "" +"list of tokenized input ids. Can be obtained from a string by chaining " +"the `tokenize` and `convert_tokens_to_ids` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:5 +msgid "" +"Optional second list of input ids. Can be obtained from a string by " +"chaining the `tokenize` and `convert_tokens_to_ids` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:7 +msgid "number of tokens to remove using the truncation strategy" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:9 +msgid "" +"string selected in the following options: - 'longest_first' (default) " +"Iteratively reduce the inputs sequence until the input is under " +"max_seq_len starting from the longest one at each token (when there " +"is a pair of input sequences). Overflowing tokens only contains " +"overflow from the first sequence. - 'only_first': Only truncate the first" +" sequence. raise an error if the first sequence is shorter or equal to " +"than num_tokens_to_remove. - 'only_second': Only truncate the second " +"sequence - 'do_not_truncate': Does not truncate (raise an error if the " +"input sequence is longer than max_seq_len)" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:9 +msgid "" +"string selected in the following options: - 'longest_first' (default) " +"Iteratively reduce the inputs sequence until the input is under " +"max_seq_len" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:11 +msgid "" +"starting from the longest one at each token (when there is a pair of " +"input sequences). Overflowing tokens only contains overflow from the " +"first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:13 +msgid "" +"'only_first': Only truncate the first sequence. raise an error if the " +"first sequence is shorter or equal to than num_tokens_to_remove." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:14 +msgid "'only_second': Only truncate the second sequence" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:15 +msgid "" +"'do_not_truncate': Does not truncate (raise an error if the input " +"sequence is longer than max_seq_len)" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:16 +msgid "" +"If set to a number along with max_seq_len, the overflowing tokens " +"returned will contain some tokens from the main sequence returned. The " +"value of this argument defines the number of additional tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens:4 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens:3 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences:3 +msgid "" +"Should be overridden in a subclass if the model has a special way of " +"building those." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens:6 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens:8 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences:10 +msgid "Optional second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens:11 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens:5 +msgid "List of char offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens:7 +msgid "Optional second list of char offsets for offset mapping pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens:10 +msgid "List of char offsets with the appropriate offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:4 +msgid "List of ids of the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:6 +msgid "List of ids of the second sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers in the range [0, 1]: 1 for a special token, 0 " +"for a sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:14 +msgid "The list of integers in the range [0, 1]:" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:15 +msgid "1 for a special token, 0 for a sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences:6 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences:8 +msgid "List of IDs." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences:13 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.num_special_tokens_to_add:3 +msgid "" +"Whether the number of added tokens should be computed in the case of a " +"sequence pair or a single sequence. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.num_special_tokens_to_add:7 +msgid "Number of special tokens added to sequences." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:1 +msgid "" +"Performs tokenization and uses the tokenized tokens to prepare model " +"inputs. It supports sequence or sequence pair as input, and batch input " +"is not allowed." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:5 +msgid "" +"The sequence to be processed. One sequence is a string, a list of " +"strings, or a list of integers depending on whether it has been " +"pretokenized and converted to ids." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:51 +msgid "" +"Whether to include the list of pair preserving the index of start and end" +" char in original input for each token in the returned dictionary. " +"Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:59 +msgid "" +"The dict has the following optional items: - **input_ids** (list[int]): " +"List of token ids to be fed to a model. - **position_ids** (list[int], " +"optional): List of token position ids to be fed to a model. Included " +"when `return_position_ids` is `True` - **token_type_ids** (list[int], " +"optional): List of token type ids to be fed to a model. Included when " +"`return_token_type_ids` is `True`. - **attention_mask** (list[int], " +"optional): List of integers valued 0 or 1, where 0 specifies paddings " +"and should not be attended to by the model. Included when " +"`return_attention_mask` is `True`. - **seq_len** (int, optional): The " +"input_ids length. Included when `return_length` is `True`. - " +"**overflowing_tokens** (list[int], optional): List of overflowing tokens." +" Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True. - **num_truncated_tokens** (int, " +"optional): The number of overflowing tokens. Included when if " +"`max_seq_len` is specified and `return_overflowing_tokens` is True. - " +"**special_tokens_mask** (list[int], optional): List of integers valued 0 " +"or 1, with 0 specifying special added tokens and 1 specifying sequence " +"tokens. Included when `return_special_tokens_mask` is `True`. - " +"**offset_mapping** (list[int], optional): list of pair preserving the " +"index of start and end char in original input for each token. For a " +"sqecial token, the index pair is `(0, 0)`. Included when " +"`return_overflowing_tokens` is True." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:83 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:61 +msgid "**input_ids** (list[int]): List of token ids to be fed to a model." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:84 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:62 +msgid "" +"**position_ids** (list[int], optional): List of token position ids to be " +"fed to a model. Included when `return_position_ids` is `True`" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:86 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:64 +msgid "" +"**token_type_ids** (list[int], optional): List of token type ids to be " +"fed to a model. Included when `return_token_type_ids` is `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:88 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:66 +msgid "" +"**attention_mask** (list[int], optional): List of integers valued 0 or 1," +" where 0 specifies paddings and should not be attended to by the model. " +"Included when `return_attention_mask` is `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:91 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:69 +msgid "" +"**seq_len** (int, optional): The input_ids length. Included when " +"`return_length` is `True`." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:93 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:71 +msgid "" +"**overflowing_tokens** (list[int], optional): List of overflowing tokens." +" Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:96 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:74 +msgid "" +"**num_truncated_tokens** (int, optional): The number of overflowing " +"tokens. Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:99 +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:77 +msgid "" +"**special_tokens_mask** (list[int], optional): List of integers valued 0 " +"or 1, with 0 specifying special added tokens and 1 specifying sequence " +"tokens. Included when `return_special_tokens_mask` is `True`." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:80 +msgid "" +"**offset_mapping** (list[int], optional): list of pair preserving the " +"index of start and end char in original input for each token. For a " +"sqecial token, the index pair is `(0, 0)`. Included when " +"`return_overflowing_tokens` is True." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:1 +msgid "" +"Performs tokenization and uses the tokenized tokens to prepare model " +"inputs. It supports batch inputs of sequence or sequence pair." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:4 +msgid "" +"The element of list can be sequence or sequence pair, and the sequence is" +" a string or a list of strings depending on whether it has been " +"pretokenized. If each sequence is provided as a list of strings " +"(pretokenized), you must set `is_split_into_words` as `True` to " +"disambiguate with a sequence pair." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:81 +msgid "" +"The dict has the following optional items: - **input_ids** (list[int]): " +"List of token ids to be fed to a model. - **position_ids** (list[int], " +"optional): List of token position ids to be fed to a model. Included " +"when `return_position_ids` is `True` - **token_type_ids** (list[int], " +"optional): List of token type ids to be fed to a model. Included when " +"`return_token_type_ids` is `True`. - **attention_mask** (list[int], " +"optional): List of integers valued 0 or 1, where 0 specifies paddings " +"and should not be attended to by the model. Included when " +"`return_attention_mask` is `True`. - **seq_len** (int, optional): The " +"input_ids length. Included when `return_length` is `True`. - " +"**overflowing_tokens** (list[int], optional): List of overflowing tokens." +" Included when if `max_seq_len` is specified and " +"`return_overflowing_tokens` is True. - **num_truncated_tokens** (int, " +"optional): The number of overflowing tokens. Included when if " +"`max_seq_len` is specified and `return_overflowing_tokens` is True. - " +"**special_tokens_mask** (list[int], optional): List of integers valued 0 " +"or 1, with 0 specifying special added tokens and 1 specifying sequence " +"tokens. Included when `return_special_tokens_mask` is `True`. - " +"**offset_mapping** (list[int], optional): list of pair preserving the " +"index of start and end char in original input for each token. For a " +"sqecial token, the index pair is `(0, 0)`. Included when " +"`return_overflowing_tokens` is True or `stride` > 0. - " +"**overflow_to_sample** (int, optional): Index of example from which this" +" feature is generated. Included when `stride` works." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:106 +msgid "" +"**overflow_to_sample** (int, optional): Index of example from which this " +"feature is generated. Included when `stride` works." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_offset_mapping:1 +msgid "" +"Returns the map of tokens and the start and end index of their start and " +"end character. Modified from " +"https://github.com/bojone/bert4keras/blob/master/bert4keras/tokenizers.py#L372" +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_offset_mapping:4 +msgid "Input text." +msgstr "" + +#: of +#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_offset_mapping:7 +msgid "The offset map of input text." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:1 +msgid "" +"The base class for all bpe tokenizers. It mainly provides common tokenize" +" methods for bpe type tokenizer." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:4 +msgid "file path of the vocabulary." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:6 +msgid "file path of the id to vocab." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:8 +msgid "file path of word merge text." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:10 +msgid "The special token for unknown words. Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:13 +msgid "The special token for separator token. Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:16 +msgid "The special token for padding. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:19 +msgid "The special token for cls. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:22 +msgid "The special token for mask. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.tokenize_chinese_chars:1 +msgid "Adds whitespace around any CJK character." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.is_chinese_char:1 +msgid "Checks whether CP is the codepoint of a CJK character." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.normalize_chars:1 +msgid "" +"Normalize the text for multiligual and chinese models. Unicode range: " +"https://www.ling.upenn.edu/courses/Spring_2003/ling538/UnicodeRanges.html" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.tokenize_special_chars:1 +msgid "Adds whitespace around any special character." +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.convert_to_unicode:1 +msgid "" +"Converts `text` to Unicode (if it's not already), assuming utf-8 input. " +":param text: Text to be converted to unicode. :type text: str|bytes" +msgstr "" + +#: of paddlenlp.transformers.tokenizer_utils.convert_to_unicode:5 +msgid "converted text." +msgstr "" + +#~ msgid "" +#~ "The dict has the following optional " +#~ "items: - **input_ids** (list[int]): List " +#~ "of token ids to be fed to a" +#~ " model. - **position_ids** (list[int], " +#~ "optional): List of token position ids" +#~ " to be fed to a model. " +#~ "Included when `return_position_ids` is `True`" +#~ " - **token_type_ids** (list[int], optional): " +#~ "List of token type ids to be " +#~ "fed to a model. Included when " +#~ "`return_token_type_ids` is `True`. - " +#~ "**attention_mask** (list[int], optional): List " +#~ "of integers valued 0 or 1, where" +#~ " 0 specifies paddings and should not" +#~ " be attended to by the model. " +#~ "Included when `return_attention_mask` is " +#~ "`True`. - **seq_len** (int, optional): " +#~ "The input_ids length. Included when " +#~ "`return_length` is `True`. - " +#~ "**overflowing_tokens** (list[int], optional): List" +#~ " of overflowing tokens. Included when " +#~ "if `max_seq_len` is specified and " +#~ "`return_overflowing_tokens` is True. - " +#~ "**num_truncated_tokens** (int, optional): The " +#~ "number of overflowing tokens. Included " +#~ "when if `max_seq_len` is specified and" +#~ " `return_overflowing_tokens` is True. - " +#~ "**special_tokens_mask** (list[int], optional): List" +#~ " of integers valued 0 or 1, " +#~ "with 0 specifying special added tokens" +#~ " and 1 specifying sequence tokens. " +#~ "Included when `return_special_tokens_mask` is " +#~ "`True`. - **offset_mapping** (list[int], " +#~ "optional): list of pair preserving the" +#~ " index of start and end char " +#~ "in original input for each token. " +#~ "For a special token, the index " +#~ "pair is `(0, 0)`. Included when " +#~ "`stride` works. - **overflow_to_sample** (int," +#~ " optional): Index of example from " +#~ "which this feature is generated. " +#~ "Included when `stride` works." +#~ msgstr "" + +#~ msgid "" +#~ "**offset_mapping** (list[int], optional): list " +#~ "of pair preserving the index of " +#~ "start and end char in original " +#~ "input for each token. For a " +#~ "special token, the index pair is " +#~ "`(0, 0)`. Included when `stride` works." +#~ msgstr "" + +#~ msgid "" +#~ "The dict has the following optional " +#~ "items: - **input_ids** (list[int]): List " +#~ "of token ids to be fed to a" +#~ " model. - **position_ids** (list[int], " +#~ "optional): List of token position ids" +#~ " to be fed to a model. " +#~ "Included when `return_position_ids` is `True`" +#~ " - **token_type_ids** (list[int], optional): " +#~ "List of token type ids to be " +#~ "fed to a model. Included when " +#~ "`return_token_type_ids` is `True`. - " +#~ "**attention_mask** (list[int], optional): List " +#~ "of integers valued 0 or 1, where" +#~ " 0 specifies paddings and should not" +#~ " be attended to by the model. " +#~ "Included when `return_attention_mask` is " +#~ "`True`. - **seq_len** (int, optional): " +#~ "The input_ids length. Included when " +#~ "`return_length` is `True`. - " +#~ "**overflowing_tokens** (list[int], optional): List" +#~ " of overflowing tokens. Included when " +#~ "if `max_seq_len` is specified and " +#~ "`return_overflowing_tokens` is True. - " +#~ "**num_truncated_tokens** (int, optional): The " +#~ "number of overflowing tokens. Included " +#~ "when if `max_seq_len` is specified and" +#~ " `return_overflowing_tokens` is True. - " +#~ "**special_tokens_mask** (list[int], optional): List" +#~ " of integers valued 0 or 1, " +#~ "with 0 specifying special added tokens" +#~ " and 1 specifying sequence tokens. " +#~ "Included when `return_special_tokens_mask` is " +#~ "`True`." +#~ msgstr "" + +#~ msgid "" +#~ "The dict has the following optional " +#~ "items: - **input_ids** (list[int]): List " +#~ "of token ids to be fed to a" +#~ " model. - **position_ids** (list[int], " +#~ "optional): List of token position ids" +#~ " to be fed to a model. " +#~ "Included when `return_position_ids` is `True`" +#~ " - **token_type_ids** (list[int], optional): " +#~ "List of token type ids to be " +#~ "fed to a model. Included when " +#~ "`return_token_type_ids` is `True`. - " +#~ "**attention_mask** (list[int], optional): List " +#~ "of integers valued 0 or 1, where" +#~ " 0 specifies paddings and should not" +#~ " be attended to by the model. " +#~ "Included when `return_attention_mask` is " +#~ "`True`. - **seq_len** (int, optional): " +#~ "The input_ids length. Included when " +#~ "`return_length` is `True`. - " +#~ "**overflowing_tokens** (list[int], optional): List" +#~ " of overflowing tokens. Included when " +#~ "if `max_seq_len` is specified and " +#~ "`return_overflowing_tokens` is True. - " +#~ "**num_truncated_tokens** (int, optional): The " +#~ "number of overflowing tokens. Included " +#~ "when if `max_seq_len` is specified and" +#~ " `return_overflowing_tokens` is True. - " +#~ "**special_tokens_mask** (list[int], optional): List" +#~ " of integers valued 0 or 1, " +#~ "with 0 specifying special added tokens" +#~ " and 1 specifying sequence tokens. " +#~ "Included when `return_special_tokens_mask` is " +#~ "`True`. - **offset_mapping** (list[int], " +#~ "optional): list of pair preserving the" +#~ " index of start and end char " +#~ "in original input for each token. " +#~ "For a sqecial token, the index " +#~ "pair is `(0, 0)`. Included when " +#~ "`stride` works. - **overflow_to_sample** (int," +#~ " optional): Index of example from " +#~ "which this feature is generated. " +#~ "Included when `stride` works." +#~ msgstr "" + +#~ msgid "" +#~ "**offset_mapping** (list[int], optional): list " +#~ "of pair preserving the index of " +#~ "start and end char in original " +#~ "input for each token. For a " +#~ "sqecial token, the index pair is " +#~ "`(0, 0)`. Included when `stride` works." +#~ msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils_base.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils_base.po new file mode 100644 index 0000000000000000000000000000000000000000..b135d3b7f82db0cd1127817bd9629fe024652199 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils_base.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.tokenizer_utils_base.rst:2 +msgid "tokenizer\\_utils\\_base" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils_fast.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils_fast.po new file mode 100644 index 0000000000000000000000000000000000000000..5c99e1c224cdef1a0dbce4d338c0b12cc4fb7edd --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils_fast.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.transformers.tokenizer_utils_fast.rst:2 +msgid "tokenizer\\_utils\\_fast" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.transformer.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.transformer.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..f188f8fcffba594e92e5f1244126b91a6c4e7c07 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.transformer.modeling.po @@ -0,0 +1,761 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.transformer.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.position_encoding_init:1 +msgid "" +"Generates the initial values for the sinusoidal position encoding table. " +"This method follows the implementation in tensor2tensor, but is slightly " +"different from the description in \"Attention Is All You Need\"." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward +#: paddlenlp.transformers.transformer.modeling.InferTransformerModel +#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward +#: paddlenlp.transformers.transformer.modeling.PositionalEmbedding +#: paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward +#: paddlenlp.transformers.transformer.modeling.TransformerModel +#: paddlenlp.transformers.transformer.modeling.TransformerModel.forward +#: paddlenlp.transformers.transformer.modeling.WordEmbedding +#: paddlenlp.transformers.transformer.modeling.WordEmbedding.forward +#: paddlenlp.transformers.transformer.modeling.position_encoding_init +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.position_encoding_init:5 +msgid "" +"The largest position for sequences, that is, the maximum length of source" +" or target sequences." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.position_encoding_init:8 +msgid "The size of positional embedding vector." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.position_encoding_init:10 +msgid "The output `numpy.array`'s data type. Defaults to \"float32\"." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward +#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward +#: paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward +#: paddlenlp.transformers.transformer.modeling.TransformerModel.forward +#: paddlenlp.transformers.transformer.modeling.WordEmbedding.forward +#: paddlenlp.transformers.transformer.modeling.position_encoding_init +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.position_encoding_init:13 +msgid "" +"The embedding table of sinusoidal position encoding with shape " +"`[n_position, d_pos_vec]`." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward +#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward +#: paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward +#: paddlenlp.transformers.transformer.modeling.TransformerModel.forward +#: paddlenlp.transformers.transformer.modeling.WordEmbedding.forward +#: paddlenlp.transformers.transformer.modeling.position_encoding_init +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:30 +#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward:18 +#: paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward:13 +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch:18 +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:40 +#: paddlenlp.transformers.transformer.modeling.TransformerModel.forward:19 +#: paddlenlp.transformers.transformer.modeling.WordEmbedding.forward:14 +#: paddlenlp.transformers.transformer.modeling.position_encoding_init:18 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion:1 +#: paddlenlp.transformers.transformer.modeling.PositionalEmbedding:1 +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:1 +#: paddlenlp.transformers.transformer.modeling.TransformerModel:1 +#: paddlenlp.transformers.transformer.modeling.WordEmbedding:1 +msgid "基类::class:`paddle.fluid.dygraph.layers.Layer`" +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.WordEmbedding:1 +msgid "Word Embedding layer of Transformer." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.WordEmbedding:3 +msgid "" +"This layer automatically constructs a 2D embedding matrix based on the " +"input the size of vocabulary (`vocab_size`) and the size of each " +"embedding vector (`emb_dim`). This layer lookups embeddings vector of ids" +" provided by input `word`." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.WordEmbedding:8 +msgid "" +"After the embedding, those weights are multiplied by `sqrt(d_model)` " +"which is `sqrt(emb_dim)` in the interface." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.WordEmbedding:11 +msgid "Out = embedding(word) * sqrt(emb\\_dim)" +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.WordEmbedding:15 +msgid "The size of vocabulary." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.WordEmbedding:17 +msgid "Dimensionality of each embedding vector." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:31 +#: paddlenlp.transformers.transformer.modeling.WordEmbedding:19 +msgid "The start token id and also is used as padding id. Defaults to 0." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.WordEmbedding.forward:1 +msgid "Computes word embedding." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.WordEmbedding.forward:3 +msgid "" +"The input ids which indicates the sequences' words with shape " +"`[batch_size, sequence_length]` whose data type can be int or int64." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.WordEmbedding.forward:8 +msgid "" +"The (scaled) embedding tensor of shape `(batch_size, sequence_length, " +"emb_dim)` whose data type can be float32 or float64." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.PositionalEmbedding:1 +msgid "" +"This layer produces sinusoidal positional embeddings of any length. While" +" in `forward()` method, this layer lookups embeddings vector of ids " +"provided by input `pos`." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.PositionalEmbedding:5 +msgid "The size of each embedding vector." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.PositionalEmbedding:7 +msgid "The maximum length of sequences." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward:1 +msgid "Computes positional embedding." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward:3 +msgid "" +"The input position ids with shape `[batch_size, sequence_length]` whose " +"data type can be int or int64." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward:7 +msgid "" +"The positional embedding tensor of shape `(batch_size, sequence_length, " +"emb_dim)` whose data type can be float32 or float64." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion:1 +msgid "" +"Computes the cross entropy loss for given input with or without label " +"smoothing." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion:3 +msgid "" +"The weight used to mix up the original ground-truth distribution and the " +"fixed distribution. Defaults to None. If given, label smoothing will be " +"applied on `label`." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion:7 +msgid "The token id used to pad variant sequence. Defaults to 0." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:1 +msgid "Computes cross entropy loss with or without label smoothing." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:3 +msgid "" +"The predict results of `TransformerModel` with shape `[batch_size, " +"sequence_length, vocab_size]` whose data type can be float32 or float64." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:7 +msgid "" +"The label for correspoding results with shape `[batch_size, " +"sequence_length, 1]`." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:11 +msgid "" +"A tuple with items: (`sum_cost`, `avg_cost`, `token_num`). With the " +"corresponding fields: - `sum_cost` (Tensor): The sum of loss of " +"current batch whose data type can be float32, float64. - `avg_cost` " +"(Tensor): The average loss of current batch whose data type can be " +"float32, float64. The relation between `sum_cost` and `avg_cost` can " +"be described as: .. math:: avg\\_cost = sum\\_cost / " +"token\\_num - `token_num` (Tensor): The number of tokens of current " +"batch. Its data type can be float32, float64." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:11 +msgid "A tuple with items: (`sum_cost`, `avg_cost`, `token_num`)." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:13 +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:28 +msgid "With the corresponding fields:" +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:15 +msgid "`sum_cost` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:16 +msgid "The sum of loss of current batch whose data type can be float32, float64." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:23 +msgid "`avg_cost` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:18 +msgid "" +"The average loss of current batch whose data type can be float32, " +"float64. The relation between `sum_cost` and `avg_cost` can be described " +"as:" +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:21 +msgid "avg\\_cost = sum\\_cost / token\\_num" +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:25 +msgid "`token_num` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:26 +msgid "" +"The number of tokens of current batch. Its data type can be float32, " +"float64." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:1 +msgid "" +"This layer wraps a Transformer decoder combined with embedding layer and " +"output layer to produce logits from ids and position." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:4 +msgid "" +"Can be a `paddle.nn.TransformerDecoder` instance. Or a wrapper that " +"includes an embedding layer accepting ids and positions and includes an " +"output layer transforming decoder output to logits." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:8 +msgid "" +"Can be a `WordEmbedding` instance or a callable that accepts ids as " +"arguments and return embeddings. It can be None if `decoder` includes a " +"embedding layer. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:12 +msgid "" +"Can be a `PositionalEmbedding` instance or a callable that accepts " +"position as arguments and return embeddings. It can be None if `decoder` " +"includes a positional embedding layer. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:16 +msgid "" +"Can be a `paddle.nn.Linear` instance or a callable to transform decoder " +"output to logits." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:19 +msgid "" +"The dropout rate for the results of `word_embedding` and `pos_embedding`." +" Defaults to 0.1." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:1 +msgid "Produces logits." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:3 +msgid "" +"A tuple/list includes target ids and positions. If `word_embedding` is " +"None, then it should be a Tensor which means the input for decoder." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:6 +msgid "" +"It is a list and each element of the list is an instance of " +"`paddle.nn.MultiheadAttention.Cache` for corresponding decoder layer. It " +"can be produced by `paddle.nn.TransformerDecoder.gen_cache`." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:10 +msgid "" +"It is a list and each element of the list is an instance of " +"`paddle.nn.MultiheadAttention.StaticCache` for corresponding decoder " +"layer. It can be produced by `paddle.nn.TransformerDecoder.gen_cache`." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:14 +msgid "" +"A tensor used in self attention to prevents attention to some unwanted " +"positions, usually the subsequent positions. It is a tensor with shape " +"broadcasted to `[batch_size, n_head, target_length, target_length]`, " +"where the unwanted positions have `-INF` values and the others have 0 " +"values. The data type should be float32 or float64. It can be None when " +"nothing wanted or needed to be prevented attention to." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:21 +msgid "" +"The output of Transformer encoder. It is a tensor with shape " +"`[batch_size, source_length, d_model]` and its data type can be float32 " +"or float64." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:26 +msgid "" +"A tuple with items: `(outputs, new_states)` With the corresponding " +"fields: - `outputs` (Tensor): A float32 or float64 3D tensor " +"representing logits shaped `[batch_size, sequence_length, " +"vocab_size]`. - `new_states` (Tensor): This output has the same " +"structure and data type with `states` while the length is one larger " +"since concatanating the intermediate results of current step." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:26 +msgid "A tuple with items: `(outputs, new_states)`" +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:31 +msgid "`outputs` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:31 +msgid "" +"A float32 or float64 3D tensor representing logits shaped `[batch_size, " +"sequence_length, vocab_size]`." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:35 +msgid "`new_states` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:34 +msgid "" +"This output has the same structure and data type with `states` while the " +"length is one larger since concatanating the intermediate results of " +"current step." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:1 +msgid "基类::class:`paddle.fluid.layers.rnn.BeamSearchDecoder`" +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:1 +msgid "" +"This layer is a subclass of `BeamSearchDecoder` to make beam search adapt" +" to Transformer decoder." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:4 +msgid "An instance of `TransformerDecoderCell`." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:6 +msgid "The start token id." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:8 +msgid "The end token id." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:10 +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch:10 +msgid "The beam width used in beam search." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:12 +msgid "Indicate which dimension of states is variant." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch:1 +msgid "" +"Tiles the batch dimension of a tensor. Specifically, this function takes " +"a tensor t shaped `[batch_size, s0, s1, ...]` composed of minibatch " +"entries `t[0], ..., t[batch_size - 1]` and tiles it to have a shape " +"`[batch_size * beam_size, s0, s1, ...]` composed of minibatch entries " +"`t[0], t[0], ..., t[1], t[1], ...` where each minibatch entry is repeated" +" `beam_size` times." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch:8 +msgid "A list of tensor with shape `[batch_size, ...]`." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch:13 +msgid "" +"A tensor with shape `[batch_size * beam_size, ...]`, whose data type is " +"same as `t`." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step:1 +msgid "" +"Perform a beam search decoding step, which uses cell to get " +"probabilities, and follows a beam search step to calculate scores and " +"select candidate token ids." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step:4 +msgid "" +"An `int64` tensor with shape `[1]` provided by the caller, representing " +"the current time step number of decoding." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step:7 +msgid "" +"A tensor variable. It is same as `initial_inputs` returned by " +"`initialize()` for the first decoding step and `next_inputs` returned by " +"`step()` for the others." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step:11 +msgid "" +"A structure of tensor variables. It is same as the `initial_cell_states` " +"returned by `initialize()` for the first decoding step and `next_states` " +"returned by `step()` for the others." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step:16 +msgid "Additional keyword arguments, provided by the caller `dynamic_decode`." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step:19 +msgid "" +"Returns tuple (``beam_search_output, beam_search_state, next_inputs, " +"finished``). `beam_search_state` and `next_inputs` have the same " +"structure, shape and data type as the input arguments states and inputs " +"separately. `beam_search_output` is a namedtuple(including scores, " +"predicted_ids, parent_ids as fields) of tensor variables, where `scores, " +"predicted_ids, parent_ids` all has a tensor value shaped [batch_size, " +"beam_size] with data type float32, int64, int64. `finished` is a bool " +"tensor with shape [batch_size, beam_size]." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerModel:1 +msgid "The Transformer model." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerModel:3 +msgid "" +"This model is a Paddle `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:3 +#: paddlenlp.transformers.transformer.modeling.TransformerModel:7 +msgid "The size of source vocabulary." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:5 +#: paddlenlp.transformers.transformer.modeling.TransformerModel:9 +msgid "The size of target vocabulary." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:7 +#: paddlenlp.transformers.transformer.modeling.TransformerModel:11 +msgid "The maximum length of input sequences." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:9 +#: paddlenlp.transformers.transformer.modeling.TransformerModel:13 +msgid "The number of sub-layers to be stacked in the encoder." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:11 +#: paddlenlp.transformers.transformer.modeling.TransformerModel:15 +msgid "The number of sub-layers to be stacked in the decoder." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:13 +#: paddlenlp.transformers.transformer.modeling.TransformerModel:17 +msgid "The number of head used in multi-head attention." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:15 +#: paddlenlp.transformers.transformer.modeling.TransformerModel:19 +msgid "" +"The dimension for word embeddings, which is also the last dimension of " +"the input and output of multi-head attention, position-wise feed-forward " +"networks, encoder and decoder." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:19 +#: paddlenlp.transformers.transformer.modeling.TransformerModel:23 +msgid "Size of the hidden layer in position-wise feed-forward networks." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:21 +#: paddlenlp.transformers.transformer.modeling.TransformerModel:25 +msgid "Dropout rates. Used for pre-process, activation and inside attention." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:23 +#: paddlenlp.transformers.transformer.modeling.TransformerModel:27 +msgid "Whether to use weight sharing." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:25 +#: paddlenlp.transformers.transformer.modeling.TransformerModel:29 +msgid "" +"The dropout probability used in MHA to drop some attention target. If " +"None, use the value of dropout. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerModel:32 +msgid "" +"The dropout probability used after FFN activation. If None, use the value" +" of dropout. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerModel:35 +msgid "The start token id and also be used as padding id. Defaults to 0." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:33 +#: paddlenlp.transformers.transformer.modeling.TransformerModel:37 +msgid "The end token id. Defaults to 1." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerModel.forward:1 +msgid "" +"The Transformer forward methods. The input are source/target sequences, " +"and returns logits." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerModel.forward:4 +msgid "" +"The ids of source sequences words. It is a tensor with shape " +"`[batch_size, source_sequence_length]` and its data type can be int or " +"int64." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerModel.forward:8 +msgid "" +"The ids of target sequences words. It is a tensor with shape " +"`[batch_size, target_sequence_length]` and its data type can be int or " +"int64." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.TransformerModel.forward:13 +msgid "" +"Output tensor of the final layer of the model whose data type can be " +"float32 or float64 with shape `[batch_size, sequence_length, " +"vocab_size]`." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:1 +msgid "基类::class:`paddlenlp.transformers.transformer.modeling.TransformerModel`" +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:1 +msgid "The Transformer model for auto-regressive generation." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:28 +msgid "" +"The dropout probability used after FFN activition. If None, use the value" +" of dropout. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:35 +msgid "The beam width for beam search. Defaults to 4." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:37 +msgid "The maximum output length. Defaults to 256." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:39 +msgid "" +"Indicate the data layout of predicted Tensor. If `False`, the data layout" +" would be batch major with shape `[batch_size, seq_len, beam_size]`. If " +"`True`, the data layout would be time major with shape `[seq_len, " +"batch_size, beam_size]`. Default to `False`." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:45 +msgid "" +"Specify beam search version. It should be in one of [`v1`, `v2`]. If " +"`v2`, need to set `alpha`(default to 0.6) for length penalty. Default to " +"`v1`." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:49 +msgid "" +"The key word arguments can be `rel_len` and `alpha`: - `rel_len(bool, " +"optional)`: Indicating whether `max_out_len` in is the length relative to" +" that of source text. Only works in `v2` temporarily. It is suggest to " +"set a small `max_out_len` and use `rel_len=True`. Default to False if not" +" set. - `alpha(float, optional)`: The power number in length penalty " +"calculation. Refer to `GNMT `_. " +"Only works in `v2` temporarily. Default to 0.6 if not set." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:49 +msgid "The key word arguments can be `rel_len` and `alpha`:" +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:51 +msgid "`rel_len(bool, optional)`: Indicating whether `max_out_len` in" +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:52 +msgid "" +"is the length relative to that of source text. Only works in `v2` " +"temporarily. It is suggest to set a small `max_out_len` and use " +"`rel_len=True`. Default to False if not set." +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:56 +msgid "`alpha(float, optional)`: The power number in length penalty" +msgstr "" + +#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:57 +msgid "" +"calculation. Refer to `GNMT `_. " +"Only works in `v2` temporarily. Default to 0.6 if not set." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward:1 +msgid "The Transformer forward method." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward:3 +msgid "" +"The ids of source sequence words. It is a tensor with shape `[batch_size," +" source_sequence_length]` and its data type can be int or int64." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward:7 +msgid "" +"The ids of target sequence words. Normally, it should NOT be given. If " +"it's given, force decoding with previous output token will be trigger. " +"Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward:12 +msgid "" +"An int64 tensor shaped indicating the predicted ids. Its shape is " +"`[batch_size, seq_len, beam_size]` or `[seq_len, batch_size, beam_size]` " +"according to `output_time_major`." +msgstr "" + +#: of +#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.beam_search_v2:1 +msgid "" +"Beam search with the alive and finished two queues, both have a beam size" +" capicity separately. It includes `grow_topk` `grow_alive` `grow_finish` " +"as steps. 1. `grow_topk` selects the top `2*beam_size` candidates to " +"avoid all getting EOS. 2. `grow_alive` selects the top `beam_size` non-" +"EOS candidates as the inputs of next decoding step. 3. `grow_finish` " +"compares the already finished candidates in the finished queue and newly " +"added finished candidates from `grow_topk`, and selects the top " +"`beam_size` finished candidates." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.transformer.po new file mode 100644 index 0000000000000000000000000000000000000000..8fc2bc3acd8b78ab2bacac3a99c4638814703909 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.transformer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.transformer.rst:2 +msgid "transformer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.convert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.convert.po new file mode 100644 index 0000000000000000000000000000000000000000..4a3a54f3f3da0d9f00a13dd0c5cd8a26573703e0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.convert.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.unified_transformer.convert.rst:2 +msgid "convert" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..02178cd8c75bf2001afef298df7bd1aa498fc5f7 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.modeling.po @@ -0,0 +1,421 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.unified_transformer.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.unified_transformer.modeling:1 +msgid "Modeling classes for UnifiedTransformer model." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerPretrainedModel:1 +msgid "" +"An abstract class for pretrained UnifiedTransformer models. It provides " +"UnifiedTransformer related `model_config_file`, `resource_files_names`, " +"`pretrained_resource_files_map`, `pretrained_init_configuration`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel:1 +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:1 +msgid "基类::class:`paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerPretrainedModel`" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:1 +msgid "The bare UnifiedTransformer Model outputting raw hidden-states." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:6 +msgid "" +"This model is also a `paddle.nn.Layer `__ " +"subclass. Use it as a regular Paddle Layer and refer to the Paddle " +"documentation for all matter related to general usage and behavior." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward +msgid "参数" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:11 +msgid "" +"Vocabulary size of `inputs_ids` in :class:`UnifiedTransformerModel`. Also" +" is the vocab size of token embedding matrix." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:14 +msgid "" +"Dimensionality of the embedding layers, encoder layers and pooler layer. " +"Defaults to 768." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:17 +msgid "The number of hidden layers in the encoder. Defaults to 12." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:19 +msgid "The number of heads in multi-head attention(MHA). Defaults to 12." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:21 +msgid "" +"Dimensionality of the feed-forward layer in the encoder. Input tensors to" +" feed-forward layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to 3072." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:27 +msgid "The activation function in the feedforward network. Defaults to \"gelu\"." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:30 +msgid "" +"The dropout probability used in pre-process and post-precess of MHA and " +"FFN sub-layer. Defaults to 0.1." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:33 +msgid "" +"The dropout probability used in MHA to drop some attention target. " +"Defaults to 0.1." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:36 +msgid "" +"Indicate whether to put layer normalization into preprocessing of MHA and" +" FFN sub-layers. If True, pre-process is layer normalization and post-" +"precess includes dropout, residual connection. Otherwise, no pre-process " +"and post-precess includes dropout, residual connection, layer " +"normalization. Defaults to True." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:42 +msgid "The maximum length of input `position_ids`. Defaults to 512." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:44 +msgid "The size of the input `token_type_ids`. Defaults to 2." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:46 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal" +" distributions. See " +":meth:`UnifiedTransformerPretrainedModel.init_weights` method for how" +" weights are initialized in :class:`UnifiedTransformerModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:46 +msgid "The standard deviation of the normal initializer. Defaults to 0.02." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:49 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`UnifiedTransformerPretrainedModel.init_weights` method for " +"how weights are initialized in :class:`UnifiedTransformerModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:55 +msgid "The id of special token `unk_token`. Defaults to 0." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:57 +msgid "The id of special token `pad_token`. Defaults to 0." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:59 +msgid "The id of special token `bos_token`. Defaults to 1." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:61 +msgid "The id of special token `eos_token`. Defaults to 2." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:63 +msgid "The id of special token `mask_token`. Defaults to 30000." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:1 +msgid "" +"The UnifiedTransformerModel forward method, overrides the special " +":meth:`__call__` method." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:4 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:9 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1: - 0 corresponds to a **sentence " +"A** token, - 1 corresponds to a **sentence B** token. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:9 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1:" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:12 +msgid "0 corresponds to a **sentence A** token," +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:13 +msgid "1 corresponds to a **sentence B** token." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:15 +msgid "" +"It's data type should be `int64` and has a shape of [batch_size, " +"sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:18 +msgid "" +"The position indices of input sequence tokens. It's data type should be " +"`int64` and has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:21 +msgid "" +"A tensor used in multi-head attention to prevents attention to some " +"unwanted positions, usually the paddings or the subsequent positions. It " +"is a tensor with shape broadcasted to [batch_size, n_head, " +"sequence_length, sequence_length]. - When the data type is bool, the " +"unwanted positions have `False` values and the others have `True` " +"values. - When the data type is int, the unwanted positions have 0 " +"values and the others have 1 values. - When the data type is float, the " +"unwanted positions have `-INF` values and the others have 0 values." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:21 +msgid "" +"A tensor used in multi-head attention to prevents attention to some " +"unwanted positions, usually the paddings or the subsequent positions. It " +"is a tensor with shape broadcasted to [batch_size, n_head, " +"sequence_length, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:26 +msgid "" +"When the data type is bool, the unwanted positions have `False` values " +"and the others have `True` values." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:28 +msgid "" +"When the data type is int, the unwanted positions have 0 values and the " +"others have 1 values." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:30 +msgid "" +"When the data type is float, the unwanted positions have `-INF` values " +"and the others have 0 values." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:33 +msgid "" +"(bool, optional): Whether or not use the model cache to speed up " +"decoding. Defaults to False." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:36 +msgid "" +"It is a list, and each element in the list is `incremental_cache` " +"produced by :meth:`paddle.nn.TransformerEncoderLayer.gen_cache` method. " +"See :meth:`paddle.nn.TransformerEncoder.gen_cache` method for more " +"details. It is only used for inference and should be None for training. " +"Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:42 +msgid "" +"Indices of role ids indicated different roles. It's data type should be " +"`int64` and has a shape of [batch_size, sequence_length]. Defaults to " +"None." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:43 +msgid "Indices of role ids indicated different roles." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:44 +msgid "It's data type should be `int64` and has a shape of" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:45 +msgid "[batch_size, sequence_length]. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:48 +msgid "" +"If `use_cache` is False, it is a tensor representing the output of " +":class:`UnifiedTransformerModel`, with shape [batch_size, " +"sequence_length, hidden_size]. The data type is float32 or float64. " +"Otherwise, it is a tuple, besides the output of " +":class:`UnifiedTransformerModel`, the tuple also includes the new cache " +"which is same as input `cache` but `incremental_cache` in it has an " +"incremental length. See :meth:`paddle.nn.MultiHeadAttention.gen_cache` " +"method and :meth:`paddle.nn.MultiHeadAttention.forward` method for more " +"details." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:31 +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:60 +msgid "示例" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel:1 +msgid "" +"The UnifiedTransformer Model with a language modeling head on top for " +"generation tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel:4 +msgid "An instance of :class:`UnifiedTransformerModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:1 +msgid "" +"The UnifiedTransformerLMHeadModel forward method, overrides the special " +":meth:`__call__` method." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:4 +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:6 +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:8 +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:10 +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:14 +msgid "See :class:`UnifiedTransformerModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:12 +msgid "(bool, optional): See :class:`UnifiedTransformerModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:16 +msgid "(Tensor, optional): See :class:`UnifiedTransformerModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:19 +msgid "" +"If `use_cache` is False, it is a tensor representing the output of " +":class:`UnifiedTransformerLMHeadModel`, with shape [batch_size, " +"sequence_length, vocab_size]. The data type is float32 or float64. " +"Otherwise, it is a tuple, besides the output of " +":class:`UnifiedTransformerLMHeadModel`, the tuple also includes the new " +"cache which is same as input `cache` but `incremental_cache` in it has an" +" incremental length. See :meth:`paddle.nn.MultiHeadAttention.gen_cache` " +"method and :meth:`paddle.nn.MultiHeadAttention.forward` method for more " +"details." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.po new file mode 100644 index 0000000000000000000000000000000000000000..7303924289a46bf2630261e610ede3f3b2ca3a8f --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.unified_transformer.rst:2 +msgid "unified\\_transformer" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..15517e061e49822217e82db00c2e6c20991eac3d --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.tokenizer.po @@ -0,0 +1,647 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.unified_transformer.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.unified_transformer.tokenizer:1 +msgid "Tokenization class for UnifiedTransformer model." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:1 +msgid "" +"Constructs an UnifiedTransformer tokenizer based on `SentencePiece " +"`__." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:3 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.save_resources +msgid "参数" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:7 +msgid "The path of file to construct vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:9 +msgid "" +"The sentencepiece model file (ends with '.spm') required to instantiate a" +" `SentencePiece `__." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:12 +msgid "" +"Whether or not to lowercase the input when tokenizing. Defaults to False " +"and **does not** lowercase the input." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:15 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:19 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:22 +msgid "" +"A special token representing the beginning of a sequence. Defaults to " +"\"[CLS]\"." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:25 +msgid "" +"A special token representing the end of a sequence or separating two " +"different sentences in the same input. Defaults to \"[SEP]\"." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:28 +msgid "A special token representing a masked token. Defaults to \"[MASK]\"." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:30 +msgid "" +"The path of file that contains additional special tokens to be used by " +"the tokenizer. Defaults to \"\"." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.vocab_size:1 +msgid "Returns the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string:14 +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string:15 +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:107 +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.vocab_size:4 +msgid "示例" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) in a single string. Since " +"the usage of WordPiece introducing `__` to concat subwords, also remove " +"`__` when converting." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string:5 +msgid "A list of string representing tokens to be converted." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string:6 +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string:7 +msgid "Whether or not to keep the segmentation with space. Defaults to True." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.num_special_tokens_to_add +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string:11 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.num_special_tokens_to_add +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string:1 +msgid "" +"Converts a single index or a sequence of indices to a token or a sequence" +" of tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string:4 +msgid "The token id (or token ids) to be converted to token(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string:10 +msgid "The decoded token(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.num_special_tokens_to_add:3 +msgid "" +"Whether the number of added tokens should be computed in the case of a " +"sequence pair or a single sequence. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.num_special_tokens_to_add:7 +msgid "Number of special tokens added to sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens:4 +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens:3 +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences:3 +msgid "" +"Should be overridden in a subclass if the model has a special way of " +"building those." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens:6 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens:8 +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences:10 +msgid "Optional second list of IDs for sequence pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens:11 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens:5 +msgid "List of char offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens:7 +msgid "Optional second list of char offsets for offset mapping pairs." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens:10 +msgid "List of char offsets with the appropriate offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences:6 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences:8 +msgid "List of IDs." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences:13 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:1 +msgid "" +"Retrieves sequence ids from a token list that has no special tokens " +"added. This method is called when adding special tokens using the " +"tokenizer ``encode`` methods." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:4 +msgid "List of ids of the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:6 +msgid "List of ids of the second sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:8 +msgid "" +"Whether or not the token list is already formatted with special tokens " +"for the model. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:12 +msgid "" +"The list of integers in the range [0, 1]: 1 for a special token, 0 " +"for a sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:14 +msgid "The list of integers in the range [0, 1]:" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:15 +msgid "1 for a special token, 0 for a sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.save_resources:1 +msgid "" +"Save tokenizer related resources to `resource_files_names` indicating " +"files under `save_directory` by copying directly. Override it if " +"necessary." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.save_resources:4 +msgid "Directory to save files into." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:1 +msgid "" +"Instantiate an instance of `Vocab` from a file reserving all tokens by " +"using `Vocab.from_dict`. The file contains a token per line, and the line" +" number would be the index of corresponding token." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:5 +msgid "path of file to construct vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:7 +msgid "" +"special token for unknown token. If no need, it also could be `None`. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:10 +msgid "" +"special token for padding token. If no need, it also could be `None`. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:13 +msgid "" +"special token for bos token. If no need, it also could be `None`. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:16 +msgid "" +"special token for eos token. If no need, it also could be `None`. " +"Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:19 +msgid "keyword arguments for `Vocab.from_dict`." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:22 +msgid "An instance of `Vocab`." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:1 +msgid "" +"Main method to encode the single-turn or multi-turn dialogue " +"conversation. It will return a dictionary containing the encoded sequence" +" and other relative informations which meets the input format " +"requirements of the UnifiedTransformer model. See detail at " +"https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-dialogue" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:8 +msgid "" +"The history of dialogue conversation. It is an utterance or list of " +"utterances to be encoded. Each utterance is a string." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:12 +msgid "" +"The response of dialogue conversation. It should be set when training the" +" model. It should not be set when running inference. Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:16 +msgid "" +"The knowledge information of dialogue conversation. It should be set if " +"the `task_type` is \"knowledge\" or \"recommend\". Defaults to None." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:20 +msgid "" +"The type of dialogue conversation. It is one of \"chitchat\", " +"\"knowledge\" and \"recommend\". They represent the chitchat dialogue, " +"knowledge grounded dialogue and conversational recommendation " +"respectively. Defaults to None, which means there is no `special_token` " +"added in output sequence for identifying different conversation types." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:27 +msgid "The maximum encoded sequence length. Defaults to 512." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:30 +msgid "" +"The maximum encoded sequence length of the input `response`. Defaults to " +"128." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:33 +msgid "" +"The maximum encoded sequence length of the input `knowledge`. Defaults to" +" 128." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:36 +msgid "Whether to return the position_ids. Defaults to True." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:39 +msgid "Whether to return the token_type_ids. Defaults to True." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:42 +msgid "Whether to return the role_ids. Defaults to False." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:45 +msgid "Whether to return the attention_mask. Defaults to True." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:48 +msgid "Whether to return the length of the encoded sequence. Defaults to False." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:51 +msgid "" +"Whether to add the special token \"[CLS]\" at the end of sequence as the " +"beginning of the response when running inference to force the model to " +"start generating response sequence. Defaults to False." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:56 +msgid "" +"Whether to pad the returned sequences to the `max_seq_len`. Note that, in" +" this method, returned sequences will be padded on the left. Defaults to " +"False." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:60 +msgid "Whether to convert the returned sequences to Tensor. Defaults to False." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:63 +msgid "" +"Whether or not the input text (`history`, `response` and `knowledge`) has" +" been pretokenized. Defaults to True." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:67 +msgid "" +"Specify the involved positional style which must be one of [relative, " +"continuous]. Defaults to continuous which means start from 0 to maximum " +"length of history." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:72 +msgid "" +"A dictionary containing the encoded sequence and other relative " +"informations. With the corresponding fields: - input_ids " +"(list[int]|Tensor): A list of indices of input tokens to be feed to " +"UnifiedTransformer model. If `return_tensors` is True, it is a Tensor" +" with shape [1, sequence_length] and data type 'int64'. - role_ids " +"(list[int]|Tensor, optional): A list of indices of role indices. If " +"`return_role_ids` is True, it is a Tensor with shape [1, " +"sequence_length] and data type 'int64'. - token_type_ids " +"(list[int]|Tensor, optional): A list of segment token indices to " +"indicate whether the token belongs to the dialogue response. If " +"`return_tensors` is True, it is a Tensor with shape [1, " +"sequence_length] and data type 'int64'. Being returned when " +"`return_token_type_ids` is set to True. - position_ids (list[int]|Tensor," +" optional): A list of The position indices. If `return_tensors` is " +"True, it is a Tensor with shape [1, sequence_length] and data type" +" 'int64'. Being returned when `return_position_ids` is set to " +"True. - attention_mask (numpy.ndarray|Tensor, optional): A " +"numpy.ndarray to prevents attention to some unwanted positions, with " +"shape [sequence_length, sequence_length] and data type 'float32'. If " +"`return_tensors` is True, it is a Tensor with shape [1, 1, " +"sequence_length, sequence_length] and data type 'float32'. Being " +"returned when `return_attention_mask` is set to True. - seq_len (int, " +"optional): The actual length of the `input_ids`, excluding the pad " +"token. Being returned when `return_length` is set to True." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:72 +msgid "" +"A dictionary containing the encoded sequence and other relative " +"informations." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:75 +msgid "With the corresponding fields:" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:79 +msgid "input_ids (list[int]|Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:78 +msgid "" +"A list of indices of input tokens to be feed to UnifiedTransformer model." +" If `return_tensors` is True, it is a Tensor with shape [1, " +"sequence_length] and data type 'int64'." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:82 +msgid "role_ids (list[int]|Tensor, optional):" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:82 +msgid "" +"A list of indices of role indices. If `return_role_ids` is True, it is a " +"Tensor with shape [1, sequence_length] and data type 'int64'." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:88 +msgid "token_type_ids (list[int]|Tensor, optional):" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:85 +msgid "" +"A list of segment token indices to indicate whether the token belongs to " +"the dialogue response. If `return_tensors` is True, it is a Tensor with " +"shape [1, sequence_length] and data type 'int64'. Being returned when " +"`return_token_type_ids` is set to True." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:93 +msgid "position_ids (list[int]|Tensor, optional):" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:91 +msgid "" +"A list of The position indices. If `return_tensors` is True, it is a " +"Tensor with shape [1, sequence_length] and data type 'int64'. Being " +"returned when `return_position_ids` is set to True." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:99 +msgid "attention_mask (numpy.ndarray|Tensor, optional):" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:96 +msgid "" +"A numpy.ndarray to prevents attention to some unwanted positions, with " +"shape [sequence_length, sequence_length] and data type 'float32'. If " +"`return_tensors` is True, it is a Tensor with shape [1, 1, " +"sequence_length, sequence_length] and data type 'float32'. Being returned" +" when `return_attention_mask` is set to True." +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:102 +msgid "seq_len (int, optional):" +msgstr "" + +#: of +#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:102 +msgid "" +"The actual length of the `input_ids`, excluding the pad token. Being " +"returned when `return_length` is set to True." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..2a1873a8eda037e4363d1fb5539487bc3f0deea7 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.modeling.po @@ -0,0 +1,346 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.unimo.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling:1 +msgid "Modeling classes for UNIMO model." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOPretrainedModel:1 +msgid "" +"An abstract class for pretrained UNIMO models. It provides UNIMO related " +"`model_config_file`, `pretrained_init_configuration`, " +"`resource_files_names`, `pretrained_resource_files_map`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel:1 +#: paddlenlp.transformers.unimo.modeling.UNIMOModel:1 +msgid "基类::class:`paddlenlp.transformers.unimo.modeling.UNIMOPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:1 +msgid "The bare UNIMO Model outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:6 +msgid "" +"This model is also a `paddle.nn.Layer " +"`__" +" subclass. Use it as a regular Paddle Layer and refer to the Paddle " +"documentation for all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel +#: paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward +#: paddlenlp.transformers.unimo.modeling.UNIMOModel +#: paddlenlp.transformers.unimo.modeling.UNIMOModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `UNIMOModel`. Also is the vocab size " +"of token embedding matrix. Defines the number of different tokens that " +"can be represented by the `inputs_ids` passed when calling `UNIMOModel`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:13 +msgid "" +"Dimensionality of the embedding layers and encoder layers. Defaults to " +"`768`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:15 +msgid "The number of hidden layers in the Transformer encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:17 +msgid "" +"Number of attention heads for each attention layer in the Transformer " +"encoder. Defaults to `12`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:20 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `hidden_size` to " +"`intermediate_size`, and then projected back to `hidden_size`. Typically " +"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:25 +msgid "" +"The non-linear activation function in the feed-forward layer. " +"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation " +"functions are supported. Defaults to ``\"gelu\"``." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:29 +msgid "" +"The dropout probability used in pre-process and post-precess of MHA and " +"FFN sub-layer. Defaults to 0.1." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:32 +msgid "" +"The dropout probability used in MultiHeadAttention in all encoder layers " +"to drop some attention target. Defaults to `0.1`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:35 +msgid "" +"Indicate whether to put layer normalization into preprocessing of MHA and" +" FFN sub-layers. If True, pre-process is layer normalization and post-" +"precess includes dropout, residual connection. Otherwise, no pre-process " +"and post-precess includes dropout, residual connection, layer " +"normalization. Defaults to `True`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:41 +msgid "" +"The maximum value of the dimensionality of position encoding, which " +"dictates the maximum supported length of an input sequence. Defaults to " +"`512`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:44 +msgid "" +"The vocabulary size of the `token_type_ids` passed when calling " +"`~transformers.UNIMOModel`. Defaults to `2`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:47 +msgid "" +"The standard deviation of the normal initializer. Defaults to `0.02`. .." +" note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`UNIMOPretrainedModel._init_weights()` for " +"how weights are initialized in `UNIMOModel`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:47 +msgid "The standard deviation of the normal initializer. Defaults to `0.02`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:50 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`UNIMOPretrainedModel._init_weights()` for how weights are " +"initialized in `UNIMOModel`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:53 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` in order to be converted to an ID." +" Defaults to `17963`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:57 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to `0`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:60 +msgid "" +"A special token representing the beginning of a sequence that was used " +"during pretraining. Defaults to `1`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:63 +msgid "" +"A special token representing the end of a sequence that was used during " +"pretraining. Defaults to `3`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:66 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to `3`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:1 +msgid "" +"The UNIMOModel forward method, overrides the special :meth:`__call__` " +"method." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:7 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1: - 0 corresponds to a **sentence " +"A** token, - 1 corresponds to a **sentence B** token. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]. " +"Defaults to None, which means no segment embeddings is added to token " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:7 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1:" +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:10 +msgid "0 corresponds to a **sentence A** token," +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:11 +msgid "1 corresponds to a **sentence B** token." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:13 +msgid "" +"It's data type should be `int64` and has a shape of [batch_size, " +"sequence_length]. Defaults to None, which means no segment embeddings is " +"added to token embeddings." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:16 +msgid "" +"Indices of positions of each input sequence tokens in the position " +"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. " +"It's data type should be `int64` and has a shape of [batch_size, " +"sequence_length]. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:21 +msgid "" +"Mask used in multi-head attention to avoid performing attention to some " +"unwanted positions, usually the paddings or the subsequent positions. Its" +" data type can be int, float and bool. When the data type is bool, the " +"`masked` tokens have `False` values and the others have `True` values. " +"When the data type is int, the `masked` tokens have `0` values and the " +"others have `1` values. When the data type is float, the `masked` tokens " +"have `-INF` values and the others have `0` values. It is a tensor with " +"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, " +"sequence_length]`. For example, its shape can be [batch_size, " +"sequence_length], [batch_size, sequence_length, sequence_length], " +"[batch_size, num_attention_heads, sequence_length, sequence_length]. " +"Defaults to `None`, which means nothing needed to be prevented attention " +"to." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:32 +msgid "" +"(bool, optional): Whether or not use the model cache to speed up " +"decoding. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:35 +msgid "" +"It is a list, and each element in the list is `incremental_cache` " +"produced by :meth:`paddle.nn.TransformerEncoderLayer.gen_cache` method. " +"See :meth:`paddle.nn.TransformerEncoder.gen_cache` method for more " +"details. It is only used for inference and should be None for training. " +"Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward +#: paddlenlp.transformers.unimo.modeling.UNIMOModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:42 +msgid "" +"If `use_cache` is False, it is a tensor representing the output of " +":class:`UNIMOModel`, with shape [batch_size, sequence_length, " +"hidden_size]. The data type is float64. Otherwise, it is a tuple, besides" +" the output of :class:`UNIMOModel`, the tuple also includes the new cache" +" which is same as input `cache` but `incremental_cache` in it has an " +"incremental length. See :meth:`paddle.nn.MultiHeadAttention.gen_cache` " +"method and :meth:`paddle.nn.MultiHeadAttention.forward` method for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward +#: paddlenlp.transformers.unimo.modeling.UNIMOModel.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:26 +#: paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:51 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel:1 +msgid "" +"The UNIMO Model with a `language modeling` head on top designed for " +"generation tasks." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel:3 +msgid "An instance of :class:`UNIMOModel`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:1 +msgid "" +"The UNIMOLMHeadModel forward method, overrides the special " +":meth:`__call__` method." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:4 +#: paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:6 +#: paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:8 +#: paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:10 +#: paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:14 +msgid "See :class:`UNIMOModel`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:12 +msgid "(bool, optional): See :class:`UNIMOModel`." +msgstr "" + +#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:17 +msgid "" +"If `use_cache` is False, it is a tensor representing the output of " +":class:`UNIMOModel`, with shape [batch_size, sequence_length, " +"hidden_size]. The data type is float64. Otherwise, it is a tuple, besides" +" the output of :class:`UNIMOLMHeadModel`, the tuple also includes the new" +" cache which is same as input `cache` but `incremental_cache` in it has " +"an incremental length. See :meth:`paddle.nn.MultiHeadAttention.gen_cache`" +" method and :meth:`paddle.nn.MultiHeadAttention.forward` method for more " +"details." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.po new file mode 100644 index 0000000000000000000000000000000000000000..54632057c48c9ae07ab0e50ab8a9aaf3f8d0962a --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.unimo.rst:2 +msgid "unimo" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..50c8a3dc632a6966bdf17dc3e776ce8b7b4ca484 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.tokenizer.po @@ -0,0 +1,512 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.unimo.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:1 +msgid "" +"Constructs an UNIMO tokenizer. It uses a basic tokenizer to do " +"punctuation splitting, lower casing and so on, and follows a WordPiece " +"tokenizer to tokenize as subwords." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:5 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.merge_subword +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.num_special_tokens_to_add +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:9 +msgid "" +"The vocabulary file path (ends with '.txt') required to instantiate a " +"`WordpieceTokenizer`." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:12 +msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:15 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to \"[UNK]\"." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:19 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to \"[SEP]\"." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:22 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to \"[PAD]\"." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:25 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to \"[CLS]\"." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:28 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to \"[MASK]\"." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:34 +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string:12 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.vocab_size:1 +msgid "Return the size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.merge_subword +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.vocab_size +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.vocab_size:3 +msgid "The size of vocabulary." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.merge_subword +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.vocab_size +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:1 +msgid "" +"Instantiate an instance of `Vocab` from a file reserving all tokens by " +"using `Vocab.from_dict`. The file contains a token per line, and the line" +" number would be the index of corresponding token." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:5 +msgid "path of file to construct vocabulary." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:7 +msgid "" +"special token for unknown token. If no need, it also could be `None`. " +"Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:10 +msgid "" +"special token for padding token. If no need, it also could be `None`. " +"Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:13 +msgid "" +"special token for bos token. If no need, it also could be `None`. " +"Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:16 +msgid "" +"special token for eos token. If no need, it also could be `None`. " +"Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:19 +msgid "keyword arguments for `Vocab.from_dict`." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:22 +msgid "An instance of `Vocab`." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) in a single string. Since " +"the usage of WordPiece introducing `##` to concat subwords, also remove " +"`##` when converting." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string:5 +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.merge_subword:4 +msgid "A list of string representing tokens to be converted." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string:8 +msgid "Converted string from tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.num_special_tokens_to_add:3 +msgid "" +"Whether the input is a sequence pair or a single sequence. Defaults to " +"`False` and the input is a single sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.num_special_tokens_to_add:7 +msgid "Number of tokens added to sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Build model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:4 +msgid "A UNIMO sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:6 +msgid "single sequence: ``[CLS] X [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:7 +msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``" +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:9 +msgid "List of IDs to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:11 +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences:15 +msgid "Optional second list of IDs for sequence pairs. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:15 +msgid "List of input_id with the appropriate special tokens." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.merge_subword:1 +msgid "" +"Converts the subwords in a sequence of tokens (list of string) to whole " +"words, also remove `##` when converting." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.merge_subword:7 +msgid "Converted sequence of whole words." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Build offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:4 +msgid "A UNIMO offset_mapping has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:9 +msgid "List of char offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:11 +msgid "" +"Optional second list of char offsets for offset mapping pairs. Defaults " +"to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:15 +msgid "List of char offsets with the appropriate offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:17 +msgid "List of char offsets with the appropriate offsets" +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:18 +msgid "of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Create a mask from the two sequences passed to be used in a sequence-pair" +" classification task." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences:4 +msgid "A UNIMO sequence pair mask has the following format: ::" +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences:10 +msgid "" +"If `token_ids_1` is `None`, this method only returns the first portion of" +" the mask (0s)." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences:13 +msgid "List of IDs." +msgstr "" + +#: of +#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences:19 +msgid "List of token_type_id according to the given sequence(s)." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:1 +msgid "" +"Main method for encoding the source for generation. It will return a " +"dictionary containing the encoded sequence and other relative " +"informations which meets the input format requirements of the UNIMO-text " +"model." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:5 +msgid "The source text of generation. It should be a string." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:7 +msgid "" +"The target text of generation. It should be set when training the model " +"and should be None when running inference. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:11 +msgid "" +"The additional information of some of the generation tasks such as " +"summary. Defaults to None." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:14 +msgid "The maximum encoded sequence length. Defaults to 512." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:17 +msgid "" +"The maximum encoded sequence length of the input `target`. Defaults to " +"128." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:20 +msgid "The maximum encoded sequence length of the input `title`. Defaults to 128." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:23 +msgid "Whether to return the position_ids. Defaults to True." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:26 +msgid "Whether to return the token_type_ids. Defaults to True." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:29 +msgid "Whether to return the attention_mask. Defaults to True." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:32 +msgid "Whether to return the length of the encoded sequence. Defaults to False." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:35 +msgid "" +"Whether to add the special token \"[CLS]\" at the end of sequence as the " +"beginning of the target when running inference to force the model to start" +" generating target sequence. Defaults to False." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:40 +msgid "" +"Whether to pad the returned sequences to the `max_seq_len`. Note that, in" +" this method, returned sequences will be padded on the left. Defaults to " +"False." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:44 +msgid "Whether to convert the returned sequences to Tensor. Defaults to False." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:47 +msgid "" +"Whether or not the input text (`source`, `target` and `title`) has been " +"pretokenized. Defaults to False." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:51 +msgid "" +"Whether the position ids is continuous between source ids and target ids." +" Defaults to False." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:55 +msgid "" +"A dictionary containing the encoded sequence and other relative " +"informations. With the corresponding fields: - input_ids " +"(list[int]|Tensor): A list of indices of input tokens to be feed to " +"UNIMO-text model. If `return_tensors` is True, it is a Tensor with " +"shape [1, sequence_length] and data type 'int64'. - token_type_ids " +"(list[int]|Tensor, optional): A list of segment token indices to " +"indicate whether the token belongs to the dialogue target. If " +"`return_tensors` is True, it is a Tensor with shape [1, " +"sequence_length] and data type 'int64'. Being returned when " +"`return_token_type_ids` is set to True. - position_ids (list[int]|Tensor," +" optional): A list of The position indices. If `return_tensors` is " +"True, it is a Tensor with shape [1, sequence_length] and data type" +" 'int64'. Being returned when `return_position_ids` is set to " +"True. - attention_mask (numpy.ndarray|Tensor, optional): A " +"numpy.ndarray to prevents attention to some unwanted positions, with " +"shape [sequence_length, sequence_length] and data type 'float32'. If " +"`return_tensors` is True, it is a Tensor with shape [1, 1, " +"sequence_length, sequence_length] and data type 'float32'. Being " +"returned when `return_attention_mask` is set to True. - seq_len (int, " +"optional): The actual length of the `input_ids`, excluding the pad " +"token. Being returned when `return_length` is set to True." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:55 +msgid "" +"A dictionary containing the encoded sequence and other relative " +"informations." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:58 +msgid "With the corresponding fields:" +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:62 +msgid "input_ids (list[int]|Tensor):" +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:61 +msgid "" +"A list of indices of input tokens to be feed to UNIMO-text model. If " +"`return_tensors` is True, it is a Tensor with shape [1, sequence_length] " +"and data type 'int64'." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:68 +msgid "token_type_ids (list[int]|Tensor, optional):" +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:65 +msgid "" +"A list of segment token indices to indicate whether the token belongs to " +"the dialogue target. If `return_tensors` is True, it is a Tensor with " +"shape [1, sequence_length] and data type 'int64'. Being returned when " +"`return_token_type_ids` is set to True." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:73 +msgid "position_ids (list[int]|Tensor, optional):" +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:71 +msgid "" +"A list of The position indices. If `return_tensors` is True, it is a " +"Tensor with shape [1, sequence_length] and data type 'int64'. Being " +"returned when `return_position_ids` is set to True." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:79 +msgid "attention_mask (numpy.ndarray|Tensor, optional):" +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:76 +msgid "" +"A numpy.ndarray to prevents attention to some unwanted positions, with " +"shape [sequence_length, sequence_length] and data type 'float32'. If " +"`return_tensors` is True, it is a Tensor with shape [1, 1, " +"sequence_length, sequence_length] and data type 'float32'. Being returned" +" when `return_attention_mask` is set to True." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:82 +msgid "seq_len (int, optional):" +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:82 +msgid "" +"The actual length of the `input_ids`, excluding the pad token. Being " +"returned when `return_length` is set to True." +msgstr "" + +#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:87 +msgid "示例" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.utils.po new file mode 100644 index 0000000000000000000000000000000000000000..a792eb80f03fd4fffde5bd47bb099b304cdec395 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.utils.po @@ -0,0 +1,71 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.utils.rst:2 +msgid "utils" +msgstr "" + +#: of paddlenlp.transformers.utils.fn_args_to_dict:1 +msgid "" +"Inspect function `func` and its arguments for running, and extract a dict" +" mapping between argument names and keys." +msgstr "" + +#: of paddlenlp.transformers.utils.InitTrackerMeta:1 +msgid "基类::class:`type`" +msgstr "" + +#: of paddlenlp.transformers.utils.InitTrackerMeta:1 +msgid "" +"This metaclass wraps the `__init__` method of a class to add " +"`init_config` attribute for instances of that class, and `init_config` " +"use a dict to track the initial configuration. If the class has " +"`_wrap_init` method, it would be hooked after `__init__` and called as " +"`_wrap_init(self, init_fn, init_args)`. Since InitTrackerMeta would be " +"used as metaclass for pretrained model classes, which always are Layer " +"and `type(Layer)` is not `type`, thus use `type(Layer)` rather than " +"`type` as base class for it to avoid inheritance metaclass conflicts." +msgstr "" + +#: of paddlenlp.transformers.utils.InitTrackerMeta.init_and_track_conf:1 +msgid "" +"wraps `init_func` which is `__init__` method of a class to add " +"`init_config` attribute for instances of that class. :param init_func: It" +" should be the `__init__` method of a class. :type init_func: callable " +":param help_func: If provided, it would be hooked after" +msgstr "" + +#: of paddlenlp.transformers.utils.InitTrackerMeta.init_and_track_conf:6 +msgid "" +"`init_func` and called as `_wrap_init(self, init_func, *init_args, " +"**init_args)`. Default None." +msgstr "" + +#: of paddlenlp.transformers.utils.InitTrackerMeta.init_and_track_conf +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.utils.InitTrackerMeta.init_and_track_conf:10 +msgid "the wrapped function" +msgstr "" + +#: of paddlenlp.transformers.utils.InitTrackerMeta.init_and_track_conf +msgid "返回类型" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.modeling.po new file mode 100644 index 0000000000000000000000000000000000000000..6862a9012543b7c6efc6af564eef1c62294000f0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.modeling.po @@ -0,0 +1,915 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.xlnet.modeling.rst:2 +msgid "modeling" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling:1 +msgid "Modeling classes for XLNet model." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetPretrainedModel:1 +msgid "基类::class:`paddlenlp.transformers.model_utils.PretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetPretrainedModel:1 +msgid "" +"An abstract class for pretrained XLNet models. It provides XLNet related " +"`model_config_file`, `resource_files_names`, " +"`pretrained_resource_files_map`, `pretrained_init_configuration`, " +"`base_model_prefix` for downloading and loading pretrained models. See " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more " +"details." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice:1 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering:1 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification:1 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification:1 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel:1 +#: paddlenlp.transformers.xlnet.modeling.XLNetModel:1 +msgid "基类::class:`paddlenlp.transformers.xlnet.modeling.XLNetPretrainedModel`" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:1 +msgid "The bare XLNet Model outputting raw hidden-states." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:3 +msgid "" +"This model inherits from " +":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to " +"the superclass documentation for the generic methods." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:6 +msgid "" +"This model is also a `paddle.nn.Layer " +"`__ subclass. Use " +"it as a regular Paddle Layer and refer to the Paddle documentation for " +"all matter related to general usage and behavior." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetModel +#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:10 +msgid "" +"Vocabulary size of `inputs_ids` in `XLNetModel`. Also is the vocab size " +"of token embedding matrix." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:13 +msgid "" +"The number of tokens to cache. If not 0 or None, the last `mem_len` " +"hidden states in each layer will be cached into memory. Defaults to " +"`None`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:16 +msgid "" +"The number of tokens in the current batch to be cached. If positive, then" +" at most `reuse_len` tokens can be cached in the current batch. " +"Otherwise, there is no limit to the number of tokens. Defaults to `None`." +" .. note:: The difference between `mem_len` and `reuse_len` is that " +"`mem_len` defines **the total number** of tokens to cache while " +"`reuse_len` defines the number of tokens in **the current batch** to " +"be cached." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:16 +msgid "" +"The number of tokens in the current batch to be cached. If positive, then" +" at most `reuse_len` tokens can be cached in the current batch. " +"Otherwise, there is no limit to the number of tokens. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:21 +msgid "" +"The difference between `mem_len` and `reuse_len` is that `mem_len` " +"defines **the total number** of tokens to cache while `reuse_len` defines" +" the number of tokens in **the current batch** to be cached." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:25 +msgid "" +"Dimensionality of the embedding layers, encoder layers and pooler layer. " +"Defaults to 768." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:28 +msgid "" +"Whether or not to use the same attention length for each token. Defaults " +"to `False`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:31 +msgid "" +"The attention type used in the attention layer. Set **\"bi\"** for " +"``XLNet``, **\"uni\"** for ``Transformer-XL``. Defaults to **\"bi\"**." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:34 +msgid "" +"Whether or not to use bidirectional input pipeline. Set to `True` during " +"pretraining and `False` during fine-tuning. Defaults to `False`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:37 +msgid "" +"Maximum relative distance supported. All relative distances larger than " +"`clamp_len` will be clamped. Setting this attribute to -1 means no " +"clamping. Defaults to -1." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:40 +msgid "The number of hidden layers in the encoder. Defaults to 12." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:42 +msgid "" +"The dropout ratio for all fully connected layers in the embeddings and " +"encoder. Defaults to 0.1." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:45 +msgid "" +"The dropout ratio for all fully connected layers in the pooler " +"(classification head). Defaults to 0.1." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:48 +msgid "Number of attention heads in each attention layer. Defaults to 12." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:51 +msgid "" +"Dimensionality of each attention head. Defaults to 64. .. note:: " +"`d_head` should be equal to `d_model` divided by `n_head`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:51 +msgid "Dimensionality of each attention head. Defaults to 64." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:54 +msgid "`d_head` should be equal to `d_model` divided by `n_head`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:56 +msgid "" +"The `epsilon` parameter used in :class:`paddle.nn.LayerNorm` for " +"initializing layer normalization layers. Defaults to 1e-12." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:59 +msgid "" +"Dimensionality of the feed-forward (ff) layer in the encoder. Input " +"tensors to ff layers are firstly projected from `d_model` to `d_inner`, " +"and then projected back to `d_model`. Typically `d_inner` is larger than " +"`d_model`. Defaults to 3072." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:64 +msgid "" +"The non-linear activation function in the feed-forward layers in the " +"encoder. Choose from the following supported activation functions: " +"`[\"relu\", \"gelu\", \"tanh\", \"sigmoid\", \"mish\", \"swish\"]`. " +"Defaults to `\"gelu\"`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:68 +msgid "" +"The standard deviation of the normal initializer. Defaults to 0.02. .. " +"note:: A normal_initializer initializes weight matrices as normal " +"distributions. See :meth:`XLNetPretrainedModel._init_weights()` for " +"how weights are initialized in `XLNetModel`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:68 +msgid "The standard deviation of the normal initializer. Defaults to 0.02." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:71 +msgid "" +"A normal_initializer initializes weight matrices as normal distributions." +" See :meth:`XLNetPretrainedModel._init_weights()` for how weights are " +"initialized in `XLNetModel`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:1 +msgid "The XLNetModel forward method, overrides the `__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:3 +msgid "" +"Indices of input sequence tokens in the vocabulary. They are numerical " +"representations of tokens that build the input sequence. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:7 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1: - 0 corresponds to a **sentence " +"A** token, - 1 corresponds to a **sentence B** token. It's data type " +"should be `int64` and has a shape of [batch_size, sequence_length]. " +"Defaults to None, which means no segment embeddings is added to token " +"embeddings." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:7 +msgid "" +"Segment token indices to indicate first and second portions of the " +"inputs. Indices can be either 0 or 1:" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:10 +msgid "0 corresponds to a **sentence A** token," +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:11 +msgid "1 corresponds to a **sentence B** token." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:13 +msgid "" +"It's data type should be `int64` and has a shape of [batch_size, " +"sequence_length]. Defaults to None, which means no segment embeddings is " +"added to token embeddings." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:16 +msgid "" +"Mask to indicate whether to perform attention on each input token or not." +" The values should be either 0 or 1. The attention scores will be set to " +"**-infinity** for any positions in the mask that are **0**, and will be " +"**unchanged** for positions that are **1**. - **1** for tokens that are " +"**not masked**, - **0** for tokens that are **masked**. It's data type " +"should be `float32` and has a shape of [batch_size, sequence_length]. " +"Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:16 +msgid "" +"Mask to indicate whether to perform attention on each input token or not." +" The values should be either 0 or 1. The attention scores will be set to " +"**-infinity** for any positions in the mask that are **0**, and will be " +"**unchanged** for positions that are **1**." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:21 +msgid "**1** for tokens that are **not masked**," +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:22 +msgid "**0** for tokens that are **masked**." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:24 +msgid "" +"It's data type should be `float32` and has a shape of [batch_size, " +"sequence_length]. Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:27 +msgid "" +"A list of length `n_layers` with each Tensor being a pre-computed hidden-" +"state for each layer. Each Tensor has a dtype `float32` and a shape of " +"[batch_size, sequence_length, hidden_size]. Defaults to None, and we " +"don't use mems. .. note:: `use_mems` has to be set to `True` in " +"order to make use of `mems`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:27 +msgid "" +"A list of length `n_layers` with each Tensor being a pre-computed hidden-" +"state for each layer. Each Tensor has a dtype `float32` and a shape of " +"[batch_size, sequence_length, hidden_size]. Defaults to None, and we " +"don't use mems." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:32 +msgid "`use_mems` has to be set to `True` in order to make use of `mems`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:34 +msgid "" +"Mask to indicate the permutation pattern of the input sequence with " +"values being either 0 or 1. - if ``perm_mask[k, i, j] = 0``, i " +"**attend** to j in batch k; - if ``perm_mask[k, i, j] = 1``, i **does not" +" attend** to j in batch k. Only used during pretraining (to define " +"factorization order) or for sequential decoding (generation). It's data " +"type should be `float32` and has a shape of [batch_size, sequence_length," +" sequence_length]. Defaults to `None`, then each token attends to all the" +" other tokens (full bidirectional attention)." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:34 +msgid "" +"Mask to indicate the permutation pattern of the input sequence with " +"values being either 0 or 1." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:36 +msgid "if ``perm_mask[k, i, j] = 0``, i **attend** to j in batch k;" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:37 +msgid "if ``perm_mask[k, i, j] = 1``, i **does not attend** to j in batch k." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:39 +msgid "" +"Only used during pretraining (to define factorization order) or for " +"sequential decoding (generation). It's data type should be `float32` and " +"has a shape of [batch_size, sequence_length, sequence_length]. Defaults " +"to `None`, then each token attends to all the other tokens (full " +"bidirectional attention)." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:44 +msgid "" +"Mask to indicate the output tokens to use with values being either 0 or " +"1. If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on " +"the j-th token. It's data type should be `float32` and has a shape of " +"[batch_size, num_predict, sequence_length]. Only used during pretraining " +"for partial prediction or for sequential decoding (generation). Defaults " +"to `None`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:50 +msgid "" +"Mask to avoid performing attention on padding token with values being " +"either 0 or 1. It's data type should be `float32` and it has a shape of " +"[batch_size, sequence_length]. This mask is negative of `attention_mask`:" +" - 1 for tokens that are **masked**, - 0 for tokens that are **not " +"masked**. You should use only one of `input_mask` and `attention_mask`. " +"Defaults to `None`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:50 +msgid "" +"Mask to avoid performing attention on padding token with values being " +"either 0 or 1. It's data type should be `float32` and it has a shape of " +"[batch_size, sequence_length]. This mask is negative of `attention_mask`:" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:54 +msgid "1 for tokens that are **masked**," +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:55 +msgid "0 for tokens that are **not masked**." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:57 +msgid "" +"You should use only one of `input_mask` and `attention_mask`. Defaults to" +" `None`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:59 +msgid "" +"Mask to nullify selected heads of the self-attention layers with values " +"being either 0 or 1. - 1 indicates the head is **not masked**, - 0 " +"indicates the head is **masked**. It's data type should be `float32` and" +" has a shape of [num_heads] or [num_layers, num_heads]. Defaults to " +"`None`, which means we keep all heads." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:59 +msgid "" +"Mask to nullify selected heads of the self-attention layers with values " +"being either 0 or 1." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:61 +msgid "1 indicates the head is **not masked**," +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:62 +msgid "0 indicates the head is **masked**." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:64 +msgid "" +"It's data type should be `float32` and has a shape of [num_heads] or " +"[num_layers, num_heads]. Defaults to `None`, which means we keep all " +"heads." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:67 +msgid "" +"An embedded representation tensor which is an alternative of `input_ids`." +" You should specify only either one of them to avoid contradiction. It's " +"data type should be `float32` and has a shape of [batch_size, " +"sequence_length, hidden_size]. Defaults to `None`, which means we only " +"specify `input_ids`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:72 +msgid "" +"Whether or not to use recurrent memory mechanism during training. " +"Defaults to `False` and we don't use recurrent memory mechanism in " +"training mode." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:75 +msgid "" +"Whether or not to use recurrent memory mechanism during evaluation. " +"Defaults to `False` and we don't use recurrent memory mechanism in " +"evaluation mode." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:78 +msgid "" +"Whether or not to return additional information other than the output " +"tensor. If True, then returns information about `output`, `new_mems`, " +"`hidden_states` and `attentions` which will also be formatted as a dict. " +"Else only returns the output tensor. Defaults to False." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward +msgid "返回" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:84 +msgid "" +"Returns tensor `output` or a dict with key-value pairs: " +"{\"last_hidden_state\": `output`, \"mems\": `mems`, \"hidden_states\": " +"`hidden_states`, \"attentions\": `attentions`}. With the corresponding " +"fields: - `output` (Tensor): Output of the final layer of the model." +" It's a Tensor of dtype `float32` and has a shape of [batch_size, " +"num_predict, hidden_size]. .. note:: `num_predict` " +"corresponds to `target_mapping.shape[1]`. If `target_mapping` is " +"`None`, then `num_predict` equals to `sequence_length`. - `mems` " +"(List[Tensor]): A list of pre-computed hidden-states. The length of " +"the list is `n_layers`. Each element in the list is a Tensor with " +"dtype `float32` and has a shape of [batch_size, sequence_length, " +"hidden_size]. - `hidden_states` (List[Tensor], optional): A list of " +"Tensor containing hidden-states of the model at the output of each layer" +" plus the initial embedding outputs. Each Tensor has a data type of " +"`float32` and has a shape of [batch_size, sequence_length, " +"hidden_size]. Being returned when `output_hidden_states` is set to " +"`True`. - `attentions` (List[Tensor], optional): A list of Tensor " +"containing attentions weights of each hidden layer. Each Tensor (one " +"for each layer) has a data type of `float32` and has a shape of " +"[batch_size, num_heads, sequence_length, sequence_length]. Being " +"returned when `output_attentions` is set to `True`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:84 +msgid "" +"Returns tensor `output` or a dict with key-value pairs: " +"{\"last_hidden_state\": `output`, \"mems\": `mems`, \"hidden_states\": " +"`hidden_states`, \"attentions\": `attentions`}." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:32 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:34 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:34 +#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:88 +msgid "With the corresponding fields:" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:95 +msgid "`output` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:91 +msgid "" +"Output of the final layer of the model. It's a Tensor of dtype `float32` " +"and has a shape of [batch_size, num_predict, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:95 +msgid "" +"`num_predict` corresponds to `target_mapping.shape[1]`. If " +"`target_mapping` is `None`, then `num_predict` equals to " +"`sequence_length`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:38 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:41 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:37 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:39 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:39 +#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:99 +msgid "`mems` (List[Tensor]):" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:98 +msgid "" +"A list of pre-computed hidden-states. The length of the list is " +"`n_layers`. Each element in the list is a Tensor with dtype `float32` and" +" has a shape of [batch_size, sequence_length, hidden_size]." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:40 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:43 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:39 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:41 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:41 +#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:104 +msgid "`hidden_states` (List[Tensor], optional):" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:102 +msgid "" +"A list of Tensor containing hidden-states of the model at the output of " +"each layer plus the initial embedding outputs. Each Tensor has a data " +"type of `float32` and has a shape of [batch_size, sequence_length, " +"hidden_size]. Being returned when `output_hidden_states` is set to " +"`True`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:42 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:45 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:41 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:43 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:43 +#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:109 +msgid "`attentions` (List[Tensor], optional):" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:107 +msgid "" +"A list of Tensor containing attentions weights of each hidden layer. Each" +" Tensor (one for each layer) has a data type of `float32` and has a shape" +" of [batch_size, num_heads, sequence_length, sequence_length]. Being " +"returned when `output_attentions` is set to `True`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward +#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward +msgid "返回类型" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:47 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:50 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:46 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:48 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:48 +#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:114 +msgid "示例" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification:1 +msgid "" +"XLNet Model with a linear layer on top of the output layer, designed for " +"sequence classification/regression tasks like GLUE tasks." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice:4 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering:4 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification:4 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification:4 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel:3 +msgid "An instance of :class:`XLNetModel`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification:6 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification:6 +msgid "The number of classes. Defaults to 2." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:1 +msgid "" +"The XLNetForSequenceClassification forward method, overrides the " +"`__call__()` special method." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:3 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:5 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:7 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:9 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:11 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:13 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:15 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:17 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:19 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:21 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:23 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:25 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:39 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:41 +#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:43 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:3 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:5 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:7 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:9 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:11 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:13 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:15 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:17 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:19 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:21 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:23 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:25 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:42 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:44 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:46 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:3 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:5 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:7 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:9 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:11 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:13 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:15 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:17 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:19 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:21 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:23 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:25 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:38 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:40 +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:42 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:3 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:5 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:7 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:9 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:11 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:13 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:15 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:17 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:19 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:21 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:23 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:25 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:40 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:42 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:44 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:3 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:5 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:7 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:9 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:11 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:13 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:15 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:17 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:19 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:21 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:23 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:25 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:40 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:42 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:44 +msgid "See :class:`XLNetModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:28 +msgid "" +"Returns tensor `logits` or a dict with key-value pairs: {\"logits\": " +"`logits`, \"mems\": `mems`, \"hidden_states\": `hidden_states`, " +"\"attentions\": `attentions`}. With the corresponding fields: - " +"`logits` (Tensor): Classification scores before SoftMax (also called " +"logits). It's data type should be `float32` and has a shape of " +"[batch_size, num_classes]. - `mems` (List[Tensor]): See " +":class:`XLNetModel`. - `hidden_states` (List[Tensor], optional): See " +":class:`XLNetModel`. - `attentions` (List[Tensor], optional): See " +":class:`XLNetModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:28 +msgid "" +"Returns tensor `logits` or a dict with key-value pairs: {\"logits\": " +"`logits`, \"mems\": `mems`, \"hidden_states\": `hidden_states`, " +"\"attentions\": `attentions`}." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:35 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:37 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:37 +msgid "`logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:35 +msgid "" +"Classification scores before SoftMax (also called logits). It's data type" +" should be `float32` and has a shape of [batch_size, num_classes]." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification:1 +msgid "" +"XLNet Model with a linear layer on top of the hidden-states output layer," +" designed for token classification tasks like NER tasks." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:1 +msgid "" +"The XLNetForTokenClassification forward method, overrides the " +"`__call__()` special method." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:28 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:28 +msgid "" +"Returns tensor `logits` or a dict with key-value pairs: {\"logits\": " +"`logits`, \"mems\": `mems`, \"hidden_states\": `hidden_states`, " +"\"attentions\": `attentions`}. With the corresponding fields: - " +"`logits` (Tensor): Classification scores before SoftMax (also called " +"logits). It's data type should be `float32` and has a shape of " +"[batch_size, sequence_length, num_classes]. - `mems` (List[Tensor]): " +"See :class:`XLNetModel`. - `hidden_states` (List[Tensor], optional): " +"See :class:`XLNetModel`. - `attentions` (List[Tensor], optional): See" +" :class:`XLNetModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:30 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:30 +msgid "Returns tensor `logits` or a dict with key-value pairs:" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:31 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:31 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:31 +msgid "{\"logits\": `logits`, \"mems\": `mems`," +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:32 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:32 +msgid "\"hidden_states\": `hidden_states`, \"attentions\": `attentions`}." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:36 +#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:37 +#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:37 +msgid "" +"Classification scores before SoftMax (also called logits). It's data type" +" should be `float32` and has a shape of [batch_size, sequence_length, " +"num_classes]." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel:1 +msgid "" +"XLNet Model with a language modeling head on top (linear layer with " +"weights tied to the input embeddings)." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:1 +msgid "" +"The XLNetLMHeadModel forward method, overrides the `__call__()` special " +"method." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice:1 +msgid "" +"XLNet Model with a multiple choice classification head on top (a linear " +"layer on top of the pooled output and a softmax) e.g. for RACE/SWAG " +"tasks." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:1 +msgid "" +"The XLNetForMultipleChoice forward method, overrides the `__call__()` " +"special method." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:28 +msgid "" +"Returns tensor `logtis` or a dict with key-value pairs: {\"logits\": " +"`logits`, \"mems\": `mems`, \"hidden_states\": `hidden_states`, " +"\"attentions\": `attentions`} With the corresponding fields: - `logits` " +"(Tensor): Classification scores before SoftMax (also called logits). " +"It's data type should be `float32` and has a shape of [batch_size, " +"sequence_length, num_classes]. - `mems` (List[Tensor]): See " +":class:`XLNetModel`. - `hidden_states` (List[Tensor], optional): See " +":class:`XLNetModel`. - `attentions` (List[Tensor], optional): See " +":class:`XLNetModel`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:30 +msgid "Returns tensor `logtis` or a dict with key-value pairs:" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:32 +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:32 +msgid "\"hidden_states\": `hidden_states`, \"attentions\": `attentions`}" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:34 +msgid "With the corresponding fields: - `logits` (Tensor):" +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering:1 +msgid "" +"XLNet Model with a span classification head on top for extractive " +"question-answering tasks like SQuAD (a linear layers on top of the " +"hidden-states output to compute `span start logits` and `span end " +"logits`)." +msgstr "" + +#: of paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:1 +msgid "" +"The XLNetForQuestionAnswering forward method, overrides the `__call__()` " +"special method." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:28 +msgid "" +"Returns tensor (`start_logits`, `end_logits`) or a dict with key-value " +"pairs: {\"start_logits\": `start_logits`, \"end_logits\": `end_logits`, " +"\"mems\": `mems`, \"hidden_states\": `hidden_states`, \"attentions\": " +"`attentions`} With the corresponding fields: - `start_logits` (Tensor):" +" A tensor of the input token classification logits, indicates the " +"start position of the labelled span. Its data type should be float32 " +"and its shape is [batch_size, sequence_length]. - `end_logits` (Tensor):" +" A tensor of the input token classification logits, indicates the end" +" position of the labelled span. Its data type should be float32 and " +"its shape is [batch_size, sequence_length]. - `mems` (List[Tensor]): " +"See :class:`XLNetModel`. - `hidden_states` (List[Tensor], optional): " +"See :class:`XLNetModel`. - `attentions` (List[Tensor], optional): See" +" :class:`XLNetModel`." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:30 +msgid "" +"Returns tensor (`start_logits`, `end_logits`) or a dict with key-value " +"pairs:" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:31 +msgid "" +"{\"start_logits\": `start_logits`, \"end_logits\": `end_logits`, " +"\"mems\": `mems`," +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:34 +msgid "With the corresponding fields: - `start_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:36 +msgid "" +"A tensor of the input token classification logits, indicates the start " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:39 +msgid "`end_logits` (Tensor):" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:39 +msgid "" +"A tensor of the input token classification logits, indicates the end " +"position of the labelled span. Its data type should be float32 and its " +"shape is [batch_size, sequence_length]." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.po new file mode 100644 index 0000000000000000000000000000000000000000..95bbdac3087e9c2265785bc5a2677c1255958900 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.po @@ -0,0 +1,41 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.xlnet.rst:2 +msgid "xlnet" +msgstr "" + +#: of paddlenlp.transformers.xlnet:4 +msgid "" +"`XLNet: Generalized Autoregressive Pretraining for Language Understanding" +" `__ 是一款无监督的自回归预训练语言模型。" +msgstr "" + +#: of paddlenlp.transformers.xlnet:7 +msgid "" +"有别于传统的单向自回归模型,XLNet通过最大化输入序列所有排列的期望来进行语言建模,这使得它可以同时关注到上下文的信息。 " +"另外,XLNet在预训练阶段集成了 `Transformer-XL `__ " +"模型, Transformer-XL中的片段循环机制(Segment Recurrent Mechanism)和相对位置编码(Relative " +"Positional Encoding)机制 能够支持XLNet接受更长的输入序列,这使得XLNet在长文本序列的语言任务上有着优秀的表现。" +msgstr "" + +#: of paddlenlp.transformers.xlnet:12 +msgid "本项目是XLNet在 Paddle 2.0上的开源实现,由modeling和tokenizer两部分组成。" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.tokenizer.po new file mode 100644 index 0000000000000000000000000000000000000000..3272bbcb31be4f7ca0a2952ba18d611dd33df3c3 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.tokenizer.po @@ -0,0 +1,352 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.transformers.xlnet.tokenizer.rst:2 +msgid "tokenizer" +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer:1 +msgid "Tokenization class for XLNet model." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:1 +msgid "基类::class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`" +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:1 +msgid "" +"Constructs an XLNet tokenizer based on `SentencePiece " +"`__." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:3 +msgid "" +"This tokenizer inherits from " +":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` " +"which contains most of the main methods. For more information regarding " +"those methods, please refer to this superclass." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.num_special_tokens_to_add +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.save_resources +msgid "参数" +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:7 +msgid "" +"The vocabulary file (ends with '.spm') required to instantiate a " +"`SentencePiece `__ tokenizer." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:10 +msgid "" +"Whether or not to lowercase the input when tokenizing. Defaults to " +"`False` and **does not** lowercase the input." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:13 +msgid "" +"Whether or not to strip the text when tokenizing. Defaults to `True` and " +"removes excess spaces before and after the string." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:16 +msgid "" +"Whether or not to keep accents when tokenizing. Defaults to `False` and " +"**does not** keep accents." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:18 +msgid "" +"A special token representing the beginning of a sequence that was used " +"during pretraining. Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:21 +msgid "" +"A special token representing the end of a sequence that was used during " +"pretraining. Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:24 +msgid "" +"A special token representing the *unknown (out-of-vocabulary)* token. An " +"unknown token is set to be `unk_token` inorder to be converted to an ID. " +"Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:28 +msgid "" +"A special token separating two different sentences in the same input. " +"Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:31 +msgid "" +"A special token used to make arrays of tokens the same size for batching " +"purposes. Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:34 +msgid "" +"A special token used for sequence classification. It is the last token of" +" the sequence when built with special tokens. Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:37 +msgid "" +"A special token representing a masked token. This is the token used in " +"the masked language modeling task which the model tries to predict the " +"original unmasked ones. Defaults to `\"\"`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:41 +msgid "" +"A list of additional special tokens to be used by the tokenizer. Defaults" +" to `[\"\", \"\"]`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:47 +msgid "" +"The *SentencePiece* processor that is used for every conversion (string, " +"tokens and IDs)." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer +msgid "type" +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:49 +msgid "SentencePieceProcessor" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.convert_tokens_to_string:1 +msgid "" +"Converts a sequence of tokens (list of string) to a single string by " +"using ``' '.join(tokens)`` ." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.convert_tokens_to_string:4 +msgid "A sequence of tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.num_special_tokens_to_add +msgid "返回" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.convert_tokens_to_string:7 +msgid "Converted string." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.convert_tokens_to_string +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.num_special_tokens_to_add +msgid "返回类型" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.num_special_tokens_to_add:1 +msgid "" +"Returns the number of added tokens when encoding a sequence with special " +"tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.num_special_tokens_to_add:3 +msgid "" +"Whether the input is a sequence pair or a single sequence. Defaults to " +"`False` and the input is a single sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.num_special_tokens_to_add:7 +msgid "Number of tokens added to sequences." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens:1 +msgid "" +"Builds model inputs from a sequence or a pair of sequence for sequence " +"classification tasks by concatenating and adding special tokens. An XLNet" +" sequence has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens:5 +msgid "single sequence: ``X ``" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens:6 +msgid "pair of sequences: ``A B ``" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens:8 +msgid "List of IDs for the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens:10 +msgid "Optional second list of IDs for the second sequenze. Defaults to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens:13 +msgid "List of input IDs with the appropriate special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:1 +msgid "" +"Builds offset map from a pair of offset map by concatenating and adding " +"offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:4 +msgid "An XLNet offset_mapping has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:6 +msgid "single sequence: ``X (0,0) (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:7 +msgid "pair of sequences: ``A (0,0) B (0,0) (0,0)``" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:9 +msgid "List of char offsets to which the special tokens will be added." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:11 +msgid "" +"Optional second list of char offsets for offset mapping pairs. Defaults " +"to `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:15 +msgid "A list of char offsets with the appropriate offsets of special tokens." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask:1 +msgid "" +"Creates a special tokens mask from the input sequences. This method is " +"called when adding special tokens using the tokenizer `encode` method." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:22 +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask:4 +msgid "A list of `inputs_ids` for the first sequence." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:24 +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask:6 +msgid "" +"Optional second list of `inputs_ids` for the second sequence. Defaults to" +" `None`." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask:9 +msgid "" +"Whether or not the token list already contains special tokens for the " +"model. Defaults to `False`." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask:13 +msgid "" +"A list of integers which is either 0 or 1: 1 for a special token, 0 for a" +" sequence token." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:1 +msgid "" +"Creates a token_type mask from the input sequences. If `token_ids_1` is " +"not `None`, then a sequence pair token_type mask has the following " +"format:" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:10 +msgid "" +"Else if `token_ids_1` is `None`, then a single sequence token_type mask " +"has the following format:" +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:18 +msgid "0 stands for the segment id of **first segment tokens**," +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:19 +msgid "1 stands for the segment id of **second segment tokens**," +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:20 +msgid "2 stands for the segment id of **cls_token**." +msgstr "" + +#: of +#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:28 +msgid "List of token type IDs according to the given sequence(s)." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.save_resources:1 +msgid "" +"Saves `SentencePiece `__ file " +"(ends with '.spm') under `save_directory`." +msgstr "" + +#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.save_resources:4 +msgid "Directory to save files into." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.batch_sampler.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.batch_sampler.po new file mode 100644 index 0000000000000000000000000000000000000000..d09a8f042cf3c9d9f1daa3e4bd83b6a1efdc3d99 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.batch_sampler.po @@ -0,0 +1,98 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.utils.batch_sampler.rst:2 +msgid "batch\\_sampler" +msgstr "" + +#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:1 +msgid "Sampler that restricts data loading to a subset of the dataset." +msgstr "" + +#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:3 +msgid "" +"In such case, each process can pass a DistributedBatchSampler instance as" +" a DataLoader sampler, and load a subset of the original dataset that is " +"exclusive to it." +msgstr "" + +#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:8 +msgid "Dataset is assumed to be of constant size." +msgstr "" + +#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler +#: paddlenlp.utils.batch_sampler.DistributedBatchSampler.set_epoch +msgid "参数" +msgstr "" + +#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:10 +msgid "" +"this could be a `paddle.io.Dataset` implement or other python object " +"which implemented `__len__` for BatchSampler to get sample number of data" +" source." +msgstr "" + +#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:15 +msgid "sample indice number in a mini-batch indices." +msgstr "" + +#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:17 +msgid "" +"porcess number in distributed training. If :attr:`num_replicas` is None, " +":attr:`num_replicas` will be retrieved from " +":code:`paddle.distributed.ParallenEnv`. Default None." +msgstr "" + +#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:22 +msgid "" +"the rank of the current process among :attr:`num_replicas` processes. If " +":attr:`rank` is None, :attr:`rank` is retrieved from " +":code:`paddle.distributed.ParallenEnv`. Default None." +msgstr "" + +#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:26 +msgid "" +"whther to shuffle indices order before genrating batch indices. Default " +"False." +msgstr "" + +#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:29 +msgid "" +"whether drop the last incomplete batch dataset size is not divisible by " +"the batch size. Default False" +msgstr "" + +#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:34 +#: paddlenlp.utils.batch_sampler.DistributedBatchSampler.set_epoch:11 +msgid "实际案例" +msgstr "" + +#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler.set_epoch:1 +msgid "" +"Sets the epoch number. When :attr:`shuffle=True`, this number is used as " +"seeds of random numbers. By default, users may not set this, all replicas" +" (workers) use a different random ordering for each epoch. If set same " +"number at each epoch, this sampler will yield the same ordering at all " +"epoches." +msgstr "" + +#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler.set_epoch:7 +msgid "Epoch number." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.downloader.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.downloader.po new file mode 100644 index 0000000000000000000000000000000000000000..996dc51a87061c05139973ef61f9355c337834e0 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.downloader.po @@ -0,0 +1,46 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.utils.downloader.rst:2 +msgid "downloader" +msgstr "" + +#: of paddlenlp.utils.downloader.get_weights_path_from_url:1 +msgid "" +"Get weights path from WEIGHT_HOME, if not exists, download it from url. " +":param url: download url :type url: str :param md5sum: md5 sum of " +"download package :type md5sum: str" +msgstr "" + +#: of paddlenlp.utils.downloader.get_weights_path_from_url +msgid "返回" +msgstr "" + +#: of paddlenlp.utils.downloader.get_weights_path_from_url:8 +msgid "a local path to save downloaded weights." +msgstr "" + +#: of paddlenlp.utils.downloader.get_weights_path_from_url +msgid "返回类型" +msgstr "" + +#: of paddlenlp.utils.downloader.get_weights_path_from_url:12 +msgid "实际案例" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.env.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.env.po new file mode 100644 index 0000000000000000000000000000000000000000..63c65bb62a50bc9b3387e7f6ea7b588c1254ce3e --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.env.po @@ -0,0 +1,33 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.utils.env.rst:2 +msgid "env" +msgstr "" + +#: of paddlenlp.utils.env:1 +msgid "" +"This module is used to store environmental variables in PaddleNLP. " +"PPNLP_HOME --> the root directory for storing PaddleNLP " +"related data. Default to ~/.paddlenlp. Users can change the ├" +" default value through the PPNLP_HOME " +"environment variable. ├─ MODEL_HOME --> Store model files. " +"└─ DATA_HOME --> Store automatically downloaded datasets." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.file_lock.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.file_lock.po new file mode 100644 index 0000000000000000000000000000000000000000..b6a1ff8f3d66051a171fd1c1d4d91af72bb6ac6c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.file_lock.po @@ -0,0 +1,52 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.utils.file_lock.rst:2 +msgid "file\\_lock" +msgstr "" + +#: of paddlenlp.utils.file_lock.FileLockException:1 +msgid "基类::class:`Exception`" +msgstr "" + +#: of paddlenlp.utils.file_lock.FileLock:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.utils.file_lock.FileLock:1 +msgid "" +"A file locking mechanism that has context-manager support so you can use " +"it in a with statement. This should be relatively cross compatible as it " +"doesn't rely on msvcrt or fcntl for the locking." +msgstr "" + +#: of paddlenlp.utils.file_lock.FileLock.acquire:1 +msgid "" +"Acquire the lock, if possible. If the lock is in use, it check again " +"every `wait` seconds. It does this until it either gets the lock or " +"exceeds `timeout` number of seconds, in which case it throws an " +"exception." +msgstr "" + +#: of paddlenlp.utils.file_lock.FileLock.release:1 +msgid "" +"Get rid of the lock by deleting the lockfile. When working in a `with` " +"statement, this gets automatically called at the end." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.import_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.import_utils.po new file mode 100644 index 0000000000000000000000000000000000000000..c63490425c91f68d32219608c323d28c24bf942b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.import_utils.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../source/paddlenlp.utils.import_utils.rst:2 +msgid "import\\_utils" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.log.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.log.po new file mode 100644 index 0000000000000000000000000000000000000000..4b4330933c4467aa22056ab07c1e0f09036b1db7 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.log.po @@ -0,0 +1,51 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.utils.log.rst:2 +msgid "log" +msgstr "" + +#: of paddlenlp.utils.log.Logger:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.utils.log.Logger:1 +msgid "Deafult logger in PaddleNLP" +msgstr "" + +#: of paddlenlp.utils.log.Logger paddlenlp.utils.log.Logger.processing +msgid "参数" +msgstr "" + +#: of paddlenlp.utils.log.Logger:3 +msgid "Logger name, default is 'PaddleNLP'" +msgstr "" + +#: of paddlenlp.utils.log.Logger.processing:1 +msgid "Continuously print a progress bar with rotating special effects." +msgstr "" + +#: of paddlenlp.utils.log.Logger.processing:3 +msgid "Message to be printed." +msgstr "" + +#: of paddlenlp.utils.log.Logger.processing:5 +msgid "Rotation interval. Default to 0.1." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.po new file mode 100644 index 0000000000000000000000000000000000000000..0ecaeaa053ebf5b3651185a2151ce4d6cfbd4c34 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.utils.rst:2 +msgid "paddlenlp.utils" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.profiler.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.profiler.po new file mode 100644 index 0000000000000000000000000000000000000000..2b6ec35b4f6d6ef5f1b40da6c3610a77cd669c2c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.profiler.po @@ -0,0 +1,92 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.utils.profiler.rst:2 +msgid "profiler" +msgstr "" + +#: of paddlenlp.utils.profiler.ProfilerOptions:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.utils.profiler.ProfilerOptions:1 +msgid "" +"Use a string to initialize a ProfilerOptions. The string should be in the" +" format: \"key1=value1;key2=value;key3=value3\". For example:" +msgstr "" + +#: of paddlenlp.utils.profiler.ProfilerOptions:4 +msgid "" +"\"profile_path=model.profile\" \"batch_range=[50, 60]; " +"profile_path=model.profile\" \"batch_range=[50, 60]; " +"tracer_option=OpDetail; profile_path=model.profile\"" +msgstr "" + +#: of paddlenlp.utils.profiler.ProfilerOptions:16 +msgid "ProfilerOptions supports following key-value pair:" +msgstr "" + +#: of paddlenlp.utils.profiler.ProfilerOptions:9 +msgid "" +"batch_range - a integer list, e.g. [100, 110]. state - a " +"string, the optional values are 'CPU', 'GPU' or 'All'. sorted_key -" +" a string, the optional values are 'calls', 'total'," +msgstr "" + +#: of paddlenlp.utils.profiler.ProfilerOptions:12 +msgid "'max', 'min' or 'ave." +msgstr "" + +#: of paddlenlp.utils.profiler.ProfilerOptions:13 +msgid "" +"tracer_option - a string, the optional values are 'Default', " +"'OpDetail'," +msgstr "" + +#: of paddlenlp.utils.profiler.ProfilerOptions:14 +msgid "'AllOpDetail'." +msgstr "" + +#: of paddlenlp.utils.profiler.ProfilerOptions:15 +msgid "profile_path - a string, the path to save the serialized profile data," +msgstr "" + +#: of paddlenlp.utils.profiler.ProfilerOptions:16 +msgid "which can be used to generate a timeline." +msgstr "" + +#: of paddlenlp.utils.profiler.ProfilerOptions:17 +msgid "exit_on_finished - a boolean." +msgstr "" + +#: of paddlenlp.utils.profiler.add_profiler_step:1 +msgid "" +"Enable the operator-level timing using PaddlePaddle's profiler. The " +"profiler uses a independent variable to count the profiler steps. One " +"call of this function is treated as a profiler step." +msgstr "" + +#: of paddlenlp.utils.profiler.add_profiler_step +msgid "参数" +msgstr "" + +#: of paddlenlp.utils.profiler.add_profiler_step:5 +msgid "Default is None, and the profiler is disabled." +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.tools.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.tools.po new file mode 100644 index 0000000000000000000000000000000000000000..be7be41c852f415ea6e5cd1dbeea816de4d66d09 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.tools.po @@ -0,0 +1,130 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../source/paddlenlp.utils.tools.rst:2 +msgid "tools" +msgstr "" + +#: of paddlenlp.utils.tools.static_params_to_dygraph:1 +msgid "Simple tool for convert static paramters to dygraph paramters dict." +msgstr "" + +#: of paddlenlp.utils.tools.dygraph_params_to_static:3 +#: paddlenlp.utils.tools.static_params_to_dygraph:3 +msgid "**NOTE** The model must both support static graph and dygraph mode." +msgstr "" + +#: of paddlenlp.utils.tools.compare_version +#: paddlenlp.utils.tools.dygraph_params_to_static +#: paddlenlp.utils.tools.static_params_to_dygraph +msgid "参数" +msgstr "" + +#: of paddlenlp.utils.tools.dygraph_params_to_static:5 +#: paddlenlp.utils.tools.static_params_to_dygraph:5 +msgid "the model of a neural network." +msgstr "" + +#: of paddlenlp.utils.tools.static_params_to_dygraph:7 +msgid "" +"path of which locate the saved paramters in static mode. Usualy load by " +"`paddle.static.load_program_state`." +msgstr "" + +#: of paddlenlp.utils.tools.compare_version +#: paddlenlp.utils.tools.dygraph_params_to_static +#: paddlenlp.utils.tools.static_params_to_dygraph +msgid "返回" +msgstr "" + +#: of paddlenlp.utils.tools.dygraph_params_to_static:10 +#: paddlenlp.utils.tools.static_params_to_dygraph:11 +msgid "a state dict the same as the dygraph mode." +msgstr "" + +#: of paddlenlp.utils.tools.compare_version +#: paddlenlp.utils.tools.dygraph_params_to_static +#: paddlenlp.utils.tools.static_params_to_dygraph +msgid "返回类型" +msgstr "" + +#: of paddlenlp.utils.tools.dygraph_params_to_static:1 +msgid "Simple tool for convert dygraph paramters to static paramters dict." +msgstr "" + +#: of paddlenlp.utils.tools.dygraph_params_to_static:7 +msgid "path of which locate the saved paramters in static mode." +msgstr "" + +#: of paddlenlp.utils.tools.TimeCostAverage:1 +msgid "基类::class:`object`" +msgstr "" + +#: of paddlenlp.utils.tools.TimeCostAverage:1 +msgid "" +"Simple tool for calcluating time average cost in the process of training " +"and inferencing." +msgstr "" + +#: of paddlenlp.utils.tools.TimeCostAverage.reset:1 +msgid "Reset the recoder state, and reset the `cnt` to zero." +msgstr "" + +#: of paddlenlp.utils.tools.TimeCostAverage.record:1 +msgid "Recoding the time cost in current step and accumulating the `cnt`." +msgstr "" + +#: of paddlenlp.utils.tools.TimeCostAverage.get_average:1 +msgid "Returning the average time cost after the start of training." +msgstr "" + +#: of paddlenlp.utils.tools.get_env_device:1 +msgid "Return the device name of running environment." +msgstr "" + +#: of paddlenlp.utils.tools.compare_version:1 +msgid "" +"The first version string needed to be compared. The format of version " +"string should be as follow : \"xxx.yyy.zzz\"." +msgstr "" + +#: of paddlenlp.utils.tools.compare_version:4 +msgid "" +"The second version string needed to be compared. The format of version " +"string should be as follow : \"xxx.yyy.zzz\"." +msgstr "" + +#: of paddlenlp.utils.tools.compare_version:8 +msgid "" +"The result of comparasion. 1 means version > pair_version; 0 means " +"version = pair_version; -1 means version < pair_version." +msgstr "" + +#: of paddlenlp.utils.tools.compare_version:10 +msgid "The result of comparasion. 1 means version > pair_version; 0 means" +msgstr "" + +#: of paddlenlp.utils.tools.compare_version:11 +msgid "version = pair_version; -1 means version < pair_version." +msgstr "" + +#: of paddlenlp.utils.tools.compare_version:15 +msgid "实际案例" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/trainer.po b/docs/locale/en/LC_MESSAGES/trainer.po new file mode 100644 index 0000000000000000000000000000000000000000..1fe4d12bd7443783f87a8e7c606151752de0f6eb --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/trainer.po @@ -0,0 +1,137 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-05-19 14:17+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.10.1\n" + +#: ../trainer.md:1 +msgid "PaddleNLP Trainer API" +msgstr "" + +#: ../trainer.md:3 +msgid "PaddleNLP提供了Trainer训练API,针对训练过程的通用训练配置做了封装,比如:" +msgstr "" + +#: ../trainer.md:5 +msgid "优化器、学习率调度等训练配置" +msgstr "" + +#: ../trainer.md:6 +msgid "多卡,混合精度,梯度累积等功能" +msgstr "" + +#: ../trainer.md:7 +msgid "checkpoint断点,断点重启(数据集,随机数恢复)" +msgstr "" + +#: ../trainer.md:8 +msgid "日志显示,loss可视化展示等" +msgstr "" + +#: ../trainer.md:10 +msgid "用户输入模型,数据集,就可以使用Trainer API高效快速的实现预训练、微调等任务。" +msgstr "" + +#: ../trainer.md:13 +msgid "Trainer基本使用方法介绍" +msgstr "" + +#: ../trainer.md:15 +msgid "" +"下面是用户使用 Trainer API进行finetune任务的简单示例,这里以中文情感分类数据集chnsenticorp为例。 " +"更详细的使用可以参考CLUE Trainer版本。" +msgstr "" + +#: ../trainer.md:18 +msgid "导入需要用到的头文件。" +msgstr "" + +#: ../trainer.md:19 +msgid "主要是模型、Tokenizer" +msgstr "" + +#: ../trainer.md:20 +msgid "还有Trainer组件" +msgstr "" + +#: ../trainer.md:21 +msgid "其中Trainer是训练主要入口,用户传入模型,数据集,即可进行训练" +msgstr "" + +#: ../trainer.md:22 +msgid "TrainingArguments 包含了用户需要的大部分训练参数。" +msgstr "" + +#: ../trainer.md:23 +msgid "PdArgumentParser 是用户输出参数的工具" +msgstr "" + +#: ../trainer.md:31 +msgid "设置好用户参数" +msgstr "" + +#: ../trainer.md:32 +msgid "" +"PdArgumentParser 可以接受多个类似TrainingArguments的参数。用户可以自定义所需要的ModelArguments, " +"DataArguments为 tuple 传入 PdArgumentParser即可。" +msgstr "" + +#: ../trainer.md:33 +msgid "" +"这些参数都是通过python xxx.py --dataset xx --max_seq_length " +"xx的方式传入。TrainingArguments的所有可配置参数见后文。" +msgstr "" + +#: ../trainer.md:50 +msgid "加载模型,tokenizer, 数据集" +msgstr "" + +#: ../trainer.md:51 +msgid "注意,这里的数据集,需要输出的是一个dict。dict中的key,需要和模型的输入名称对应。" +msgstr "" + +#: ../trainer.md:52 +msgid "这里的,labels如果模型没有使用到,我们还需要额外定义criterion,计算最后的loss损失。" +msgstr "" + +#: ../trainer.md:66 +msgid "构造Trainer实例,进行模型训练。" +msgstr "" + +#: ../trainer.md:67 +msgid "这里传入model,criterion,args,train_dataset,tokenizer这些训练需要的组件,构建了实例化的trainer" +msgstr "" + +#: ../trainer.md:68 +msgid "使用trainer.train()接口开始训练过程。训练完成后,可以保存模型,保存一些日志。" +msgstr "" + +#: ../trainer.md:84 +msgid "预训练的使用方式可以参考ERNIE-1.0 Trainer版本。" +msgstr "" + +#: ../trainer.md:87 +msgid "Trainer 实例化参数介绍" +msgstr "" + +#: ../trainer.md:88 +msgid "Trainer 是一个简单,但功能完整的 Paddle训练和评估模块,并针对 PaddleNLP 模型进行了优化。" +msgstr "" + +#: ../trainer.md:172 +msgid "TrainingArguments 参数介绍" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/tutorials/classify.po b/docs/locale/en/LC_MESSAGES/tutorials/classify.po new file mode 100644 index 0000000000000000000000000000000000000000..1690b3b2028482e49d4e038e1306f7e534b06c9b --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/tutorials/classify.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../tutorials/classify.rst:3 +msgid "文本分类" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/tutorials/embedding.po b/docs/locale/en/LC_MESSAGES/tutorials/embedding.po new file mode 100644 index 0000000000000000000000000000000000000000..ea594eb16d9d7c0cd6126e4946963e77d445bbe3 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/tutorials/embedding.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../tutorials/embedding.rst:3 +msgid "词向量" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/tutorials/general_dialogue.po b/docs/locale/en/LC_MESSAGES/tutorials/general_dialogue.po new file mode 100644 index 0000000000000000000000000000000000000000..bed957915ad3a139c08a70daa5e9a4cd99fa48af --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/tutorials/general_dialogue.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../tutorials/general_dialogue.rst:3 +msgid "通用对话" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/tutorials/lexical_analysis.po b/docs/locale/en/LC_MESSAGES/tutorials/lexical_analysis.po new file mode 100644 index 0000000000000000000000000000000000000000..41ad758c3fb1f4bd94480525b8ed41b97371928c --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/tutorials/lexical_analysis.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../tutorials/lexical_analysis.rst:3 +msgid "词法分析" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/tutorials/machine_translation.po b/docs/locale/en/LC_MESSAGES/tutorials/machine_translation.po new file mode 100644 index 0000000000000000000000000000000000000000..8759d5ffc4d65358c5593ab8030cfe7b95e6049f --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/tutorials/machine_translation.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../tutorials/machine_translation.rst:3 +msgid "机器翻译" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/tutorials/ner.po b/docs/locale/en/LC_MESSAGES/tutorials/ner.po new file mode 100644 index 0000000000000000000000000000000000000000..3bca60c8ae4bd3e104c59902fdc6c53e24dc1942 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/tutorials/ner.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../tutorials/ner.rst:3 +msgid "序列标注" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/tutorials/overview.po b/docs/locale/en/LC_MESSAGES/tutorials/overview.po new file mode 100644 index 0000000000000000000000000000000000000000..54ca186a71b386c1b623deaef209e2638d10b348 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/tutorials/overview.po @@ -0,0 +1,133 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../tutorials/overview.rst:3 +msgid "整体介绍" +msgstr "" + +#: ../tutorials/overview.rst:7 +msgid "案例集" +msgstr "" + +#: ../tutorials/overview.rst:9 +msgid "词向量" +msgstr "" + +#: ../tutorials/overview.rst:11 +msgid "" +"`使用预训练词向量改善模型效果 " +"`_" +msgstr "" + +#: ../tutorials/overview.rst:13 +msgid "文本分类" +msgstr "" + +#: ../tutorials/overview.rst:15 +msgid "" +"`基于LSTM等RNN网络的文本分类 " +"`_" +msgstr "" + +#: ../tutorials/overview.rst:16 +msgid "" +"`基于预训练模型的文本分类 " +"`_" +msgstr "" + +#: ../tutorials/overview.rst:17 +msgid "" +"`自定义数据集实现文本多分类任务 " +"`_" +msgstr "" + +#: ../tutorials/overview.rst:19 +msgid "信息抽取" +msgstr "" + +#: ../tutorials/overview.rst:21 +msgid "" +"`使用BiGRU-CRF模型完成快递单信息抽取 " +"`_" +msgstr "" + +#: ../tutorials/overview.rst:22 +msgid "" +"`使用预训练模型ERNIE优化快递单信息抽取 " +"`_" +msgstr "" + +#: ../tutorials/overview.rst:23 +msgid "`关系抽取 `_" +msgstr "" + +#: ../tutorials/overview.rst:24 +msgid "`事件抽取 `_" +msgstr "" + +#: ../tutorials/overview.rst:26 +msgid "阅读理解式问答" +msgstr "" + +#: ../tutorials/overview.rst:28 +msgid "" +"`使用预训练模型完成阅读理解 " +"`_" +msgstr "" + +#: ../tutorials/overview.rst:30 +msgid "对话" +msgstr "" + +#: ../tutorials/overview.rst:32 +msgid "`多技能对话 `_" +msgstr "" + +#: ../tutorials/overview.rst:34 +msgid "文本生成" +msgstr "" + +#: ../tutorials/overview.rst:36 +msgid "" +"`使用Seq2Seq模型完成自动对联 " +"`_" +msgstr "" + +#: ../tutorials/overview.rst:37 +msgid "" +"`使用预训练模型ERNIE-GEN实现智能写诗 " +"`_" +msgstr "" + +#: ../tutorials/overview.rst:39 +msgid "时序预测" +msgstr "" + +#: ../tutorials/overview.rst:41 +msgid "" +"`使用TCN网络完成新冠疫情病例数预测 " +"`_" +msgstr "" + +#: ../tutorials/overview.rst:43 +msgid "" +"更多教程参见 `PaddleNLP on AI Studio " +"`_" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/tutorials/reading_comprehension.po b/docs/locale/en/LC_MESSAGES/tutorials/reading_comprehension.po new file mode 100644 index 0000000000000000000000000000000000000000..dd64cc674261d5f00a25ac145e952a4e914f8b08 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/tutorials/reading_comprehension.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../tutorials/reading_comprehension.rst:3 +msgid "阅读理解" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/tutorials/semantic_matching.po b/docs/locale/en/LC_MESSAGES/tutorials/semantic_matching.po new file mode 100644 index 0000000000000000000000000000000000000000..16629a2f69eef79ed57bab50297391190ebabb7a --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/tutorials/semantic_matching.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../tutorials/semantic_matching.rst:3 +msgid "语义匹配" +msgstr "" + diff --git a/docs/locale/en/LC_MESSAGES/tutorials/text_generation.po b/docs/locale/en/LC_MESSAGES/tutorials/text_generation.po new file mode 100644 index 0000000000000000000000000000000000000000..22f5a95187db8208fae666677d9df84f4b1f2b42 --- /dev/null +++ b/docs/locale/en/LC_MESSAGES/tutorials/text_generation.po @@ -0,0 +1,23 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2021, PaddleNLP +# This file is distributed under the same license as the PaddleNLP package. +# FIRST AUTHOR , 2022. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PaddleNLP \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2022-03-18 21:31+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.9.0\n" + +#: ../tutorials/text_generation.rst:3 +msgid "文本生成" +msgstr "" + diff --git a/docs/metrics.md b/docs/metrics.md new file mode 100644 index 0000000000000000000000000000000000000000..581287460d49eb3bead3920791d562dc9bbd8c69 --- /dev/null +++ b/docs/metrics.md @@ -0,0 +1,15 @@ +# PaddleNLP Metrics API + +目前PaddleNLP提供以下模型评价指标: + +| Metric | 简介 | API | +| ------ | --- | --- | +| [Perplexity](https://en.wikipedia.org/wiki/Perplexity) | 困惑度,常用来衡量语言模型优劣,也可用于机器翻译、文本生成等任务。 | `paddlenlp.metrics.Perplexity` | +| [BLEU(BiLingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU) | 机器翻译常用评价指标 | `paddlenlp.metrics.BLEU` | +| [Rouge(Recall-Oriented Understudy for Gisting Evaluation)](https://en.wikipedia.org/wiki/ROUGE_(metric)) | 评估自动文摘以及机器翻译的指标 | `paddlenlp.metrics.RougeL`, `paddlenlp.metrics.RougeN` | +| AccuracyAndF1 | 准确率及F1-score,可用于GLUE中的MRPC 和QQP任务 | `paddlenlp.metrics.AccuracyAndF1` | +| PearsonAndSpearman | 皮尔森相关性系数和斯皮尔曼相关系数。可用于GLUE中的STS-B任务 | `paddlenlp.metrics.PearsonAndSpearman` | +| Mcc(Matthews correlation coefficient) | 马修斯相关系数,用以测量二分类的分类性能的指标。可用于GLUE中的CoLA任务 | `paddlenlp.metrics.Mcc` | +| ChunkEvaluator | 计算了块检测的精确率、召回率和F1-score。常用于序列标记任务,如命名实体识别(NER) | `paddlenlp.metrics.ChunkEvaluator` | +| Squad Evalutaion | 用于SQuAD和DuReader-robust的评价指标 | `paddlenlp.metrics.compute_predictions`, `paddlenlp.metrics.squad_evaluate` | +| [Distinct](https://arxiv.org/abs/1510.03055) | 多样性指标,常用来衡量文本生成模型生成的句子形式上的多样性。 | `paddlenlp.metrics.Distinct` | diff --git a/docs/metrics/metrics.md b/docs/metrics/metrics.md new file mode 100644 index 0000000000000000000000000000000000000000..581287460d49eb3bead3920791d562dc9bbd8c69 --- /dev/null +++ b/docs/metrics/metrics.md @@ -0,0 +1,15 @@ +# PaddleNLP Metrics API + +目前PaddleNLP提供以下模型评价指标: + +| Metric | 简介 | API | +| ------ | --- | --- | +| [Perplexity](https://en.wikipedia.org/wiki/Perplexity) | 困惑度,常用来衡量语言模型优劣,也可用于机器翻译、文本生成等任务。 | `paddlenlp.metrics.Perplexity` | +| [BLEU(BiLingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU) | 机器翻译常用评价指标 | `paddlenlp.metrics.BLEU` | +| [Rouge(Recall-Oriented Understudy for Gisting Evaluation)](https://en.wikipedia.org/wiki/ROUGE_(metric)) | 评估自动文摘以及机器翻译的指标 | `paddlenlp.metrics.RougeL`, `paddlenlp.metrics.RougeN` | +| AccuracyAndF1 | 准确率及F1-score,可用于GLUE中的MRPC 和QQP任务 | `paddlenlp.metrics.AccuracyAndF1` | +| PearsonAndSpearman | 皮尔森相关性系数和斯皮尔曼相关系数。可用于GLUE中的STS-B任务 | `paddlenlp.metrics.PearsonAndSpearman` | +| Mcc(Matthews correlation coefficient) | 马修斯相关系数,用以测量二分类的分类性能的指标。可用于GLUE中的CoLA任务 | `paddlenlp.metrics.Mcc` | +| ChunkEvaluator | 计算了块检测的精确率、召回率和F1-score。常用于序列标记任务,如命名实体识别(NER) | `paddlenlp.metrics.ChunkEvaluator` | +| Squad Evalutaion | 用于SQuAD和DuReader-robust的评价指标 | `paddlenlp.metrics.compute_predictions`, `paddlenlp.metrics.squad_evaluate` | +| [Distinct](https://arxiv.org/abs/1510.03055) | 多样性指标,常用来衡量文本生成模型生成的句子形式上的多样性。 | `paddlenlp.metrics.Distinct` | diff --git a/docs/model_zoo/embeddings.md b/docs/model_zoo/embeddings.md new file mode 100644 index 0000000000000000000000000000000000000000..b24bb5ba18f56a9f9181dade82a694416a9caf6e --- /dev/null +++ b/docs/model_zoo/embeddings.md @@ -0,0 +1,307 @@ +# PaddleNLP Embedding API + +- [介绍](#介绍) +- [用法](#用法) + * [TokenEmbedding参数](#TokenEmbedding参数) + * [初始化](#初始化) + * [查询embedding结果](#查询embedding结果) + * [可视化embedding结果](#可视化embedding结果) + * [计算词向量cosine相似度](#计算词向量cosine相似度) + * [计算词向量内积](#计算词向量内积) + * [训练](#训练) + * [切词](#切词) +- [预训练模型](#预训练模型) + * [中文词向量](#中文词向量) + * [英文词向量](#英文词向量) + * [Word2Vec](#word2vec) + * [GloVe](#glove) + * [FastText](#fasttext) + * [使用方式](#使用方式) + * [模型信息](#模型信息) +- [致谢](#致谢) +- [参考论文](#参考论文) + +## 介绍 + +PaddleNLP提供多个开源的预训练词向量模型,用户仅需在使用`paddlenlp.embeddings.TokenEmbedding`时,指定预训练模型的名称,即可加载相对应的预训练模型。以下将介绍`TokenEmbedding`详细用法,并列出PaddleNLP所支持的预训练Embedding模型。 + +## 用法 + +### TokenEmbedding参数 + +| 参数 | 类型 | 属性 | +| ------------ | ------------ | ------------ | +| embedding_name | **string** | 预训练embedding名称,可通过paddlenlp.embeddings.list_embedding_name()或[Embedding 模型汇总](#中文词向量)查询。 | +| unknown_token | **string** | unknown token。 | +| unknown_token_vector | **list** 或者 **np.array** | 用来初始化unknown token对应的vector。默认为None(以正态分布方式初始化vector)| +| extended_vocab_path | **string** | 扩展词表的文件名路径。词表格式为一行一个词。 | +| trainable | **bool** | 是否可训练。True表示Embedding可以更新参数,False为不可更新。 | + +### 初始化 +```python +import paddle +from paddlenlp.embeddings import TokenEmbedding, list_embedding_name +paddle.set_device("cpu") + +# 查看预训练embedding名称: +print(list_embedding_name()) # ['w2v.baidu_encyclopedia.target.word-word.dim300'] + +# 初始化TokenEmbedding, 预训练embedding没下载时会自动下载并加载数据 +token_embedding = TokenEmbedding(embedding_name="w2v.baidu_encyclopedia.target.word-word.dim300") + +# 查看token_embedding详情 +print(token_embedding) + +Object type: +Unknown index: 635963 +Unknown token: [UNK] +Padding index: 635964 +Padding token: [PAD] +Parameter containing: +Tensor(shape=[635965, 300], dtype=float32, place=CPUPlace, stop_gradient=False, + [[-0.24200200, 0.13931701, 0.07378800, ..., 0.14103900, 0.05592300, -0.08004800], + [-0.08671700, 0.07770800, 0.09515300, ..., 0.11196400, 0.03082200, -0.12893000], + [-0.11436500, 0.12201900, 0.02833000, ..., 0.11068700, 0.03607300, -0.13763499], + ..., + [ 0.02628800, -0.00008300, -0.00393500, ..., 0.00654000, 0.00024600, -0.00662600], + [-0.00924490, 0.00652097, 0.01049327, ..., -0.01796000, 0.03498908, -0.02209341], + [ 0. , 0. , 0. , ..., 0. , 0. , 0. ]]) + +``` + +### 查询embedding结果 + +```python +test_token_embedding = token_embedding.search("中国") +print(test_token_embedding) +[[ 0.260801 0.1047 0.129453 -0.257317 -0.16152 0.19567 -0.074868 + 0.361168 0.245882 -0.219141 -0.388083 0.235189 0.029316 0.154215 + -0.354343 0.017746 0.009028 0.01197 -0.121429 0.096542 0.009255 + ..., + -0.260592 -0.019668 -0.063312 -0.094939 0.657352 0.247547 -0.161621 + 0.289043 -0.284084 0.205076 0.059885 0.055871 0.159309 0.062181 + 0.123634 0.282932 0.140399 -0.076253 -0.087103 0.07262 ]] +``` + +### 可视化embedding结果 +使用深度学习可视化工具[VisualDL](https://github.com/PaddlePaddle/VisualDL)的High Dimensional组件可以对embedding结果进行可视化展示,便于对其直观分析,步骤如下: +```python +# 获取词表中前1000个单词 +labels = token_embedding.vocab.to_tokens(list(range(0,1000))) +test_token_embedding = token_embedding.search(labels) + +# 引入VisualDL的LogWriter记录日志 +from visualdl import LogWriter + +with LogWriter(logdir='./visualize') as writer: + writer.add_embeddings(tag='test', mat=test_token_embedding, metadata=labels) +``` +执行完毕后会在当前路径下生成一个visualize目录,并将日志存放在其中,我们在命令行启动VisualDL即可进行查看,启动命令为: +```shell +visualdl --logdir ./visualize +``` +启动后打开浏览器即可看到可视化结果 + +

+ +

+ +使用VisualDL除可视化embedding结果外,还可以对标量、图片、音频等进行可视化,有效提升训练调参效率。关于VisualDL更多功能和详细介绍,可参考[VisualDL使用文档](https://github.com/PaddlePaddle/VisualDL/tree/develop/docs)。 + +### 计算词向量cosine相似度 + +```python +score = token_embedding.cosine_sim("中国", "美国") +print(score) # 0.49586025 +``` + +### 计算词向量内积 + +```python +score = token_embedding.dot("中国", "美国") +print(score) # 8.611071 +``` + + +### 训练 + +以下为`TokenEmbedding`简单的组网使用方法。有关更多`TokenEmbedding`训练流程相关的使用方法,请参考[Word Embedding with PaddleNLP](../../examples/word_embedding/README.md)。 + +```python +in_words = paddle.to_tensor([0, 2, 3]) +input_embeddings = token_embedding(in_words) +linear = paddle.nn.Linear(token_embedding.embedding_dim, 20) +input_fc = linear(input_embeddings) +print(input_fc) +Tensor(shape=[3, 20], dtype=float32, place=CPUPlace, stop_gradient=False, + [[ 0. , 0. , 0. , ..., 0. , 0. , 0. ], + [-0.23473957, 0.17878169, 0.07215232, ..., 0.03698236, 0.14291850, 0.05136518], + [-0.42466098, 0.15017235, -0.04780108, ..., -0.04995505, 0.15847842, 0.00025209]]) +``` + +### 切词 + +```python +from paddlenlp.data import JiebaTokenizer +tokenizer = JiebaTokenizer(vocab=token_embedding.vocab) +words = tokenizer.cut("中国人民") +print(words) # ['中国人', '民'] + +tokens = tokenizer.encode("中国人民") +print(tokens) # [12530, 1334] +``` + +## 预训练模型 + +以下将列举PaddleNLP支持的Embedding预训练模型。 +- 模型命名方式为:\${训练模型}.\${语料}.\${词向量类型}.\${co-occurrence type}.dim\${维度}。 +- 模型有三种,分别是Word2Vec(w2v, skip-gram), GloVe(glove)和FastText(fasttext)。 + +### 中文词向量 + +以下预训练词向量由[Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors)提供。 + +根据不同类型的上下文为每个语料训练多个目标词向量,第二列开始表示不同类型的上下文。以下为上下文类别: + +* Word表示训练时目标词预测的上下文是一个Word。 +* Word + N-gram表示训练时目标词预测的上下文是一个Word或者Ngram,其中bigram表示2-grams,ngram.1-2表示1-gram或者2-grams。 +* Word + Character表示训练时目标词预测的上下文是一个Word或者Character,其中word-character.char1-2表示上下文是1个或2个Character。 +* Word + Character + Ngram表示训练时目标词预测的上下文是一个Word、Character或者Ngram。bigram-char表示上下文是2-grams或者1个Character。 + +| 语料 | Word | Word + N-gram | Word + Character | Word + Character + N-gram | +| ------------------------------------------- | ---- | ---- | ---- | ---- | +| Baidu Encyclopedia 百度百科 | w2v.baidu_encyclopedia.target.word-word.dim300 | w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300 | w2v.baidu_encyclopedia.target.word-character.char1-2.dim300 | w2v.baidu_encyclopedia.target.bigram-char.dim300 | +| Wikipedia_zh 中文维基百科 | w2v.wiki.target.word-word.dim300 | w2v.wiki.target.word-bigram.dim300 | w2v.wiki.target.word-char.dim300 | w2v.wiki.target.bigram-char.dim300 | +| People's Daily News 人民日报 | w2v.people_daily.target.word-word.dim300 | w2v.people_daily.target.word-bigram.dim300 | w2v.people_daily.target.word-char.dim300 | w2v.people_daily.target.bigram-char.dim300 | +| Sogou News 搜狗新闻 | w2v.sogou.target.word-word.dim300 | w2v.sogou.target.word-bigram.dim300 | w2v.sogou.target.word-char.dim300 | w2v.sogou.target.bigram-char.dim300 | +| Financial News 金融新闻 | w2v.financial.target.word-word.dim300 | w2v.financial.target.word-bigram.dim300 | w2v.financial.target.word-char.dim300 | w2v.financial.target.bigram-char.dim300 | +| Zhihu_QA 知乎问答 | w2v.zhihu.target.word-word.dim300 | w2v.zhihu.target.word-bigram.dim300 | w2v.zhihu.target.word-char.dim300 | w2v.zhihu.target.bigram-char.dim300 | +| Weibo 微博 | w2v.weibo.target.word-word.dim300 | w2v.weibo.target.word-bigram.dim300 | w2v.weibo.target.word-char.dim300 | w2v.weibo.target.bigram-char.dim300 | +| Literature 文学作品 | w2v.literature.target.word-word.dim300 | w2v.literature.target.word-bigram.dim300 | w2v.literature.target.word-char.dim300 | w2v.literature.target.bigram-char.dim300 | +| Complete Library in Four Sections 四库全书 | w2v.sikuquanshu.target.word-word.dim300 | w2v.sikuquanshu.target.word-bigram.dim300 | 无 | 无 | +| Mixed-large 综合 | w2v.mixed-large.target.word-word.dim300 | 暂无 | w2v.mixed-large.target.word-word.dim300 | 暂无 | + +特别地,对于百度百科语料,在不同的 Co-occurrence类型下分别提供了目标词与上下文向量: + +| Co-occurrence 类型 | 目标词向量 | 上下文词向量 | +| --------------------------- | ------ | ---- | +| Word → Word | w2v.baidu_encyclopedia.target.word-word.dim300 | w2v.baidu_encyclopedia.context.word-word.dim300 | +| Word → Ngram (1-2) | w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300 | w2v.baidu_encyclopedia.context.word-ngram.1-2.dim300 | +| Word → Ngram (1-3) | w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300 | w2v.baidu_encyclopedia.context.word-ngram.1-3.dim300 | +| Ngram (1-2) → Ngram (1-2)| w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300 | w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300 | +| Word → Character (1) | w2v.baidu_encyclopedia.target.word-character.char1-1.dim300 | w2v.baidu_encyclopedia.context.word-character.char1-1.dim300 | +| Word → Character (1-2) | w2v.baidu_encyclopedia.target.word-character.char1-2.dim300 | w2v.baidu_encyclopedia.context.word-character.char1-2.dim300 | +| Word → Character (1-4) | w2v.baidu_encyclopedia.target.word-character.char1-4.dim300 | w2v.baidu_encyclopedia.context.word-character.char1-4.dim300 | +| Word → Word (left/right) | w2v.baidu_encyclopedia.target.word-wordLR.dim300 | w2v.baidu_encyclopedia.context.word-wordLR.dim300 | +| Word → Word (distance) | w2v.baidu_encyclopedia.target.word-wordPosition.dim300 | w2v.baidu_encyclopedia.context.word-wordPosition.dim300 | + +### 英文词向量 + +### Word2Vec + +| 语料 | 名称 | +|------|------| +| Google News | w2v.google_news.target.word-word.dim300.en | + +### GloVe + +| 语料 | 25维 | 50维 | 100维 | 200维 | 300 维 | +| ----------------- | ------ | ------ | ------ | ------ | ------ | +| Wiki2014 + GigaWord | 无 | glove.wiki2014-gigaword.target.word-word.dim50.en | glove.wiki2014-gigaword.target.word-word.dim100.en | glove.wiki2014-gigaword.target.word-word.dim200.en | glove.wiki2014-gigaword.target.word-word.dim300.en | +| Twitter | glove.twitter.target.word-word.dim25.en | glove.twitter.target.word-word.dim50.en | glove.twitter.target.word-word.dim100.en | glove.twitter.target.word-word.dim200.en | 无 | + +### FastText + +| 语料 | 名称 | +|------|------| +| Wiki2017 | fasttext.wiki-news.target.word-word.dim300.en | +| Crawl | fasttext.crawl.target.word-word.dim300.en | + +### 使用方式 + +以上所述的模型名称可直接以参数形式传入`padddlenlp.embeddings.TokenEmbedding`,加载相对应的模型。比如要加载语料为Wiki2017,通过FastText训练的预训练模型(`fasttext.wiki-news.target.word-word.dim300.en`),只需执行以下代码: + +```python +import paddle +from paddlenlp.embeddings import TokenEmbedding + +token_embedding = TokenEmbedding(embedding_name="fasttext.wiki-news.target.word-word.dim300.en") +``` + +### 模型信息 + +| 模型 | 文件大小 | 词表大小 | +|-----|---------|---------| +| w2v.baidu_encyclopedia.target.word-word.dim300 | 678.21 MB | 635965 | +| w2v.baidu_encyclopedia.target.word-character.char1-1.dim300 | 679.15 MB | 636038 | +| w2v.baidu_encyclopedia.target.word-character.char1-2.dim300 | 679.30 MB | 636038 | +| w2v.baidu_encyclopedia.target.word-character.char1-4.dim300 | 679.51 MB | 636038 | +| w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300 | 679.48 MB | 635977 | +| w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300 | 671.27 MB | 628669 | +| w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300 | 7.28 GB | 6969069 | +| w2v.baidu_encyclopedia.target.word-wordLR.dim300 | 678.22 MB | 635958 | +| w2v.baidu_encyclopedia.target.word-wordPosition.dim300 | 679.32 MB | 636038 | +| w2v.baidu_encyclopedia.target.bigram-char.dim300 | 679.29 MB | 635976 | +| w2v.baidu_encyclopedia.context.word-word.dim300 | 677.74 MB | 635952 | +| w2v.baidu_encyclopedia.context.word-character.char1-1.dim300 | 678.65 MB | 636200 | +| w2v.baidu_encyclopedia.context.word-character.char1-2.dim300 | 844.23 MB | 792631 | +| w2v.baidu_encyclopedia.context.word-character.char1-4.dim300 | 1.16 GB | 1117461 | +| w2v.baidu_encyclopedia.context.word-ngram.1-2.dim300 | 7.25 GB | 6967598 | +| w2v.baidu_encyclopedia.context.word-ngram.1-3.dim300 | 5.21 GB | 5000001 | +| w2v.baidu_encyclopedia.context.word-ngram.2-2.dim300 | 7.26 GB | 6968998 | +| w2v.baidu_encyclopedia.context.word-wordLR.dim300 | 1.32 GB | 1271031 | +| w2v.baidu_encyclopedia.context.word-wordPosition.dim300 | 6.47 GB | 6293920 | +| w2v.wiki.target.bigram-char.dim300 | 375.98 MB | 352274 | +| w2v.wiki.target.word-char.dim300 | 375.52 MB | 352223 | +| w2v.wiki.target.word-word.dim300 | 374.95 MB | 352219 | +| w2v.wiki.target.word-bigram.dim300 | 375.72 MB | 352219 | +| w2v.people_daily.target.bigram-char.dim300 | 379.96 MB | 356055 | +| w2v.people_daily.target.word-char.dim300 | 379.45 MB | 355998 | +| w2v.people_daily.target.word-word.dim300 | 378.93 MB | 355989 | +| w2v.people_daily.target.word-bigram.dim300 | 379.68 MB | 355991 | +| w2v.weibo.target.bigram-char.dim300 | 208.24 MB | 195199 | +| w2v.weibo.target.word-char.dim300 | 208.03 MB | 195204 | +| w2v.weibo.target.word-word.dim300 | 207.94 MB | 195204 | +| w2v.weibo.target.word-bigram.dim300 | 208.19 MB | 195204 | +| w2v.sogou.target.bigram-char.dim300 | 389.81 MB | 365112 | +| w2v.sogou.target.word-char.dim300 | 389.89 MB | 365078 | +| w2v.sogou.target.word-word.dim300 | 388.66 MB | 364992 | +| w2v.sogou.target.word-bigram.dim300 | 388.66 MB | 364994 | +| w2v.zhihu.target.bigram-char.dim300 | 277.35 MB | 259755 | +| w2v.zhihu.target.word-char.dim300 | 277.40 MB | 259940 | +| w2v.zhihu.target.word-word.dim300 | 276.98 MB | 259871 | +| w2v.zhihu.target.word-bigram.dim300 | 277.53 MB | 259885 | +| w2v.financial.target.bigram-char.dim300 | 499.52 MB | 467163 | +| w2v.financial.target.word-char.dim300 | 499.17 MB | 467343 | +| w2v.financial.target.word-word.dim300 | 498.94 MB | 467324 | +| w2v.financial.target.word-bigram.dim300 | 499.54 MB | 467331 | +| w2v.literature.target.bigram-char.dim300 | 200.69 MB | 187975 | +| w2v.literature.target.word-char.dim300 | 200.44 MB | 187980 | +| w2v.literature.target.word-word.dim300 | 200.28 MB | 187961 | +| w2v.literature.target.word-bigram.dim300 | 200.59 MB | 187962 | +| w2v.sikuquanshu.target.word-word.dim300 | 20.70 MB | 19529 | +| w2v.sikuquanshu.target.word-bigram.dim300 | 20.77 MB | 19529 | +| w2v.mixed-large.target.word-char.dim300 | 1.35 GB | 1292552 | +| w2v.mixed-large.target.word-word.dim300 | 1.35 GB | 1292483 | +| w2v.google_news.target.word-word.dim300.en | 1.61 GB | 3000000 | +| glove.wiki2014-gigaword.target.word-word.dim50.en | 73.45 MB | 400002 | +| glove.wiki2014-gigaword.target.word-word.dim100.en | 143.30 MB | 400002 | +| glove.wiki2014-gigaword.target.word-word.dim200.en | 282.97 MB | 400002 | +| glove.wiki2014-gigaword.target.word-word.dim300.en | 422.83 MB | 400002 | +| glove.twitter.target.word-word.dim25.en | 116.92 MB | 1193516 | +| glove.twitter.target.word-word.dim50.en | 221.64 MB | 1193516 | +| glove.twitter.target.word-word.dim100.en | 431.08 MB | 1193516 | +| glove.twitter.target.word-word.dim200.en | 848.56 MB | 1193516 | +| fasttext.wiki-news.target.word-word.dim300.en | 541.63 MB | 999996 | +| fasttext.crawl.target.word-word.dim300.en | 1.19 GB | 2000002 | + +## 致谢 +- 感谢 [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors)提供Word2Vec中文预训练词向量。 +- 感谢 [GloVe Project](https://nlp.stanford.edu/projects/glove)提供的GloVe英文预训练词向量。 +- 感谢 [FastText Project](https://fasttext.cc/docs/en/english-vectors.html)提供的英文预训练词向量。 + +## 参考论文 +- Li, Shen, et al. "Analogical reasoning on chinese morphological and semantic relations." arXiv preprint arXiv:1805.06504 (2018). +- Qiu, Yuanyuan, et al. "Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings." Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221. +- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. +- T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. Advances in Pre-Training Distributed Word Representations. diff --git a/docs/model_zoo/index.rst b/docs/model_zoo/index.rst new file mode 100644 index 0000000000000000000000000000000000000000..d15e4b6ab022179267ae8afa63cd4705eaa5184c --- /dev/null +++ b/docs/model_zoo/index.rst @@ -0,0 +1,311 @@ + + +PaddleNLP Transformer预训练模型 +==================================== + +随着深度学习的发展,NLP领域涌现了一大批高质量的Transformer类预训练模型,多次刷新了不同NLP任务的SOTA(State of the Art),极大地推动了自然语言处理的进展。 +PaddleNLP为用户提供了常用的预训练模型及其相应权重,如 ``BERT``、``ERNIE``、``ALBERT``、``RoBERTa``、``XLNet`` 等,采用统一的API进行加载、训练和调用, +让开发者能够方便快捷地应用各种Transformer类预训练模型及其下游任务,且相应预训练模型权重下载速度快、稳定。 + +------------------------------------ +预训练模型使用方法 +------------------------------------ + +PaddleNLP Transformer API在提供丰富预训练模型的同时,也降低了用户的使用门槛。 +使用Auto模块,可以加载不同网络结构的预训练模型,无需查找模型对应的类别。只需十几行代码,用户即可完成模型加载和下游任务Fine-tuning。 + +.. code:: python + + from functools import partial + import numpy as np + + import paddle + from paddlenlp.datasets import load_dataset + from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer + + train_ds = load_dataset("chnsenticorp", splits=["train"]) + + model = AutoModelForSequenceClassification.from_pretrained("bert-wwm-chinese", num_classes=len(train_ds.label_list)) + + tokenizer = AutoTokenizer.from_pretrained("bert-wwm-chinese") + + def convert_example(example, tokenizer): + encoded_inputs = tokenizer(text=example["text"], max_seq_len=512, pad_to_max_seq_len=True) + return tuple([np.array(x, dtype="int64") for x in [ + encoded_inputs["input_ids"], encoded_inputs["token_type_ids"], [example["label"]]]]) + train_ds = train_ds.map(partial(convert_example, tokenizer=tokenizer)) + + batch_sampler = paddle.io.BatchSampler(dataset=train_ds, batch_size=8, shuffle=True) + train_data_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=batch_sampler, return_list=True) + + optimizer = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters()) + + criterion = paddle.nn.loss.CrossEntropyLoss() + + for input_ids, token_type_ids, labels in train_data_loader(): + logits = model(input_ids, token_type_ids) + loss = criterion(logits, labels) + loss.backward() + optimizer.step() + optimizer.clear_grad() + +上面的代码给出使用预训练模型的简要示例,更完整详细的示例代码, +可以参考:`使用预训练模型Fine-tune完成中文文本分类任务 `_ + +1. 加载数据集:PaddleNLP内置了多种数据集,用户可以一键导入所需的数据集。 +2. 加载预训练模型:PaddleNLP的预训练模型可以很容易地通过 ``from_pretrained()`` 方法加载。 + Auto模块(包括AutoModel, AutoTokenizer, 及各种下游任务类)提供了方便易用的接口, + 无需指定类别,即可调用不同网络结构的预训练模型。 + 第一个参数是汇总表中对应的 ``Pretrained Weight``,可加载对应的预训练权重。 + ``AutoModelForSequenceClassification`` 初始化 ``__init__`` 所需的其他参数,如 ``num_classes`` 等, + 也是通过 ``from_pretrained()`` 传入。``Tokenizer`` 使用同样的 ``from_pretrained`` 方法加载。 +3. 通过 ``Dataset`` 的 ``map`` 函数,使用 ``tokenizer`` 将 ``dataset`` 从原始文本处理成模型的输入。 +4. 定义 ``BatchSampler`` 和 ``DataLoader``,shuffle数据、组合Batch。 +5. 定义训练所需的优化器,loss函数等,就可以开始进行模型fine-tune任务。 + +------------------------------------ +Transformer预训练模型汇总 +------------------------------------ + +PaddleNLP的Transformer预训练模型包含从 `huggingface.co`_ 直接转换的模型权重和百度自研模型权重,方便社区用户直接迁移使用。 +目前共包含了40多个主流预训练模型,500多个模型权重。 + +.. _huggingface.co: https://huggingface.co/models + +.. toctree:: + :maxdepth: 3 + + ALBERT + BART + BERT + BigBird + Blenderbot + Blenderbot-Small + ChineseBert + ConvBert + CTRL + DistilBert + ELECTRA + ERNIE + ERNIE-CTM + ERNIE-DOC + ERNIE-GEN + ERNIE-GRAM + ERNIE-M + FNet + Funnel + GPT + LayoutLM + LayoutLMV2 + LayoutXLM + Luke + MBart + MegatronBert + MobileBert + MPNet + NeZha + PPMiniLM + ProphetNet + Reformer + RemBert + RoBERTa + RoFormer + SKEP + SqueezeBert + T5 + TinyBert + UnifiedTransformer + UNIMO + XLNet + + + +------------------------------------ +Transformer预训练模型适用任务汇总 +------------------------------------ + ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +| Model | Sequence Classification | Token Classification | Question Answering | Text Generation | Multiple Choice | ++====================+=========================+======================+====================+=================+=================+ +|ALBERT_ | ✅ | ✅ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|BART_ | ✅ | ✅ | ✅ | ✅ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|BERT_ | ✅ | ✅ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|BigBird_ | ✅ | ✅ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|Blenderbot_ | ❌ | ❌ | ❌ | ✅ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|Blenderbot-Small_ | ❌ | ❌ | ❌ | ✅ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|ChineseBert_ | ✅ | ✅ | ✅ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|ConvBert_ | ✅ | ✅ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|CTRL_ | ✅ | ❌ | ❌ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|DistilBert_ | ✅ | ✅ | ✅ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|ELECTRA_ | ✅ | ✅ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|ERNIE_ | ✅ | ✅ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|ERNIE-CTM_ | ❌ | ✅ | ❌ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|ERNIE-DOC_ | ✅ | ✅ | ✅ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|ERNIE-GEN_ | ❌ | ❌ | ❌ | ✅ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|ERNIE-GRAM_ | ✅ | ✅ | ✅ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|ERNIE-M_ | ✅ | ✅ | ✅ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|FNet_ | ✅ | ✅ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|Funnel_ | ✅ | ✅ | ✅ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|GPT_ | ✅ | ✅ | ❌ | ✅ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|LayoutLM_ | ✅ | ✅ | ❌ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|LayoutLMV2_ | ❌ | ✅ | ❌ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|LayoutXLM_ | ❌ | ✅ | ❌ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|Luke_ | ❌ | ✅ | ✅ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|MBart_ | ✅ | ❌ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|MegatronBert_ | ✅ | ✅ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|MobileBert_ | ✅ | ❌ | ✅ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|MPNet_ | ✅ | ✅ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|NeZha_ | ✅ | ✅ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|PPMiniLM_ | ✅ | ❌ | ❌ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|ProphetNet_ | ❌ | ❌ | ❌ | ✅ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|Reformer_ | ✅ | ❌ | ✅ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|RemBert_ | ✅ | ✅ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|RoBERTa_ | ✅ | ✅ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|RoFormer_ | ✅ | ✅ | ✅ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|SKEP_ | ✅ | ✅ | ❌ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|SqueezeBert_ | ✅ | ✅ | ✅ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|T5_ | ❌ | ❌ | ❌ | ✅ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|TinyBert_ | ✅ | ❌ | ❌ | ❌ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|UnifiedTransformer_ | ❌ | ❌ | ❌ | ✅ | ❌ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ +|XLNet_ | ✅ | ✅ | ✅ | ❌ | ✅ | ++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+ + +.. _ALBERT: https://arxiv.org/abs/1909.11942 +.. _BART: https://arxiv.org/abs/1910.13461 +.. _BERT: https://arxiv.org/abs/1810.04805 +.. _BERT-Japanese: https://arxiv.org/abs/1810.04805 +.. _BigBird: https://arxiv.org/abs/2007.14062 +.. _Blenderbot: https://arxiv.org/pdf/2004.13637.pdf +.. _Blenderbot-Small: https://arxiv.org/pdf/2004.13637.pdf +.. _ChineseBert: https://arxiv.org/abs/2106.16038 +.. _ConvBert: https://arxiv.org/abs/2008.02496 +.. _CTRL: https://arxiv.org/abs/1909.05858 +.. _DistilBert: https://arxiv.org/abs/1910.01108 +.. _ELECTRA: https://arxiv.org/abs/2003.10555 +.. _ERNIE: https://arxiv.org/abs/1904.09223 +.. _ERNIE-CTM: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm +.. _ERNIE-DOC: https://arxiv.org/abs/2012.15688 +.. _ERNIE-GEN: https://arxiv.org/abs/2001.11314 +.. _ERNIE-GRAM: https://arxiv.org/abs/2010.12148 +.. _ERNIE-M: https://arxiv.org/abs/2012.15674 +.. _FNet: https://arxiv.org/abs/2105.03824 +.. _Funnel: https://arxiv.org/abs/2006.03236 +.. _GPT: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf +.. _LayoutLM: https://arxiv.org/abs/1912.13318 +.. _LayoutLMV2: https://arxiv.org/abs/2012.14740 +.. _LayoutXLM: https://arxiv.org/abs/2104.08836 +.. _Luke: https://arxiv.org/abs/2010.01057 +.. _MBart: https://arxiv.org/abs/2001.08210 +.. _MegatronBert: https://arxiv.org/abs/1909.08053 +.. _MobileBert: https://arxiv.org/abs/2004.02984 +.. _MPNet: https://arxiv.org/abs/2004.09297 +.. _NeZha: https://arxiv.org/abs/1909.00204 +.. _PPMiniLM: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/model_compression/pp-minilm +.. _ProphetNet: https://arxiv.org/abs/2001.04063 +.. _Reformer: https://arxiv.org/abs/2001.04451 +.. _RemBert: https://arxiv.org/abs/2010.12821 +.. _RoBERTa: https://arxiv.org/abs/1907.11692 +.. _RoFormer: https://arxiv.org/abs/2104.09864 +.. _SKEP: https://arxiv.org/abs/2005.05635 +.. _SqueezeBert: https://arxiv.org/abs/2006.11316 +.. _T5: https://arxiv.org/abs/1910.10683 +.. _TinyBert: https://arxiv.org/abs/1909.10351 +.. _UnifiedTransformer: https://arxiv.org/abs/2006.16779 +.. _UNIMO: https://arxiv.org/abs/2012.15409 +.. _XLNet: https://arxiv.org/abs/1906.08237 + +------------------------------------ +Reference +------------------------------------ +- 部分中文预训练模型来自: + `brightmart/albert_zh `_, + `ymcui/Chinese-BERT-wwm `_, + `huawei-noah/Pretrained-Language-Model/TinyBERT `_, + `ymcui/Chinese-XLNet `_, + `huggingface/xlnet_chinese_large `_, + `Knover/luge-dialogue `_, + `huawei-noah/Pretrained-Language-Model/NEZHA-PyTorch/ `_, + `ZhuiyiTechnology/simbert `_ +- Lan, Zhenzhong, et al. "Albert: A lite bert for self-supervised learning of language representations." arXiv preprint arXiv:1909.11942 (2019). +- Lewis, Mike, et al. "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." arXiv preprint arXiv:1910.13461 (2019). +- Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018). +- Zaheer, Manzil, et al. "Big bird: Transformers for longer sequences." arXiv preprint arXiv:2007.14062 (2020). +- Stephon, Emily, et al. "Blenderbot: Recipes for building an open-domain chatbot." arXiv preprint arXiv:2004.13637 (2020). +- Stephon, Emily, et al. "Blenderbot-Small: Recipes for building an open-domain chatbot." arXiv preprint arXiv:2004.13637 (2020). +- Sun, Zijun, et al. "Chinesebert: Chinese pretraining enhanced by glyph and pinyin information." arXiv preprint arXiv:2106.16038 (2021). +- Zhang, zhengyan, et al. "CPM: A Large-scale Generative Chinese Pre-trained Language Model." arXiv preprint arXiv:2012.00413 (2020). +- Jiang, Zihang, et al. "ConvBERT: Improving BERT with Span-based Dynamic Convolution." arXiv preprint arXiv:2008.02496 (2020). +- Nitish, Bryan, et al. "CTRL: A Conditional Transformer Language Model for Controllable Generation." arXiv preprint arXiv:1909.05858 (2019). +- Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019). +- Clark, Kevin, et al. "Electra: Pre-training text encoders as discriminators rather than generators." arXiv preprint arXiv:2003.10555 (2020). +- Sun, Yu, et al. "Ernie: Enhanced representation through knowledge integration." arXiv preprint arXiv:1904.09223 (2019). +- Ding, Siyu, et al. "ERNIE-Doc: A retrospective long-document modeling transformer." arXiv preprint arXiv:2012.15688 (2020). +- Xiao, Dongling, et al. "Ernie-gen: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation." arXiv preprint arXiv:2001.11314 (2020). +- Xiao, Dongling, et al. "ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding." arXiv preprint arXiv:2010.12148 (2020). +- Ouyang, Xuan, et al. "ERNIE-M: enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora." arXiv preprint arXiv:2012.15674 (2020). +- Lee-Thorp, James, et al. "Fnet: Mixing tokens with fourier transforms." arXiv preprint arXiv:2105.03824 (2021). +- Dai, Zihang, et al. "Funnel-transformer: Filtering out sequential redundancy for efficient language processing." Advances in neural information processing systems 33 (2020): 4271-4282. +- Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9. +- Xu, Yiheng, et al. "LayoutLM: Pre-training of Text and Layout for Document Image Understanding." arXiv preprint arXiv:1912.13318 (2019). +- Xu, Yang, et al. "LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding" arXiv preprint arXiv:2012.14740 (2020). +- Xu, Yiheng, et al. "LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding" arXiv preprint arXiv:2104.08836 (2021). +- Yamada, Ikuya, et al. "Luke: deep contextualized entity representations with entity-aware self-attention." arXiv preprint arXiv:2010.01057 (2020). +- Liu, Yinhan, et al. "MBart: Multilingual Denoising Pre-training for Neural Machine Translation" arXiv preprint arXiv:2001.08210 (2020). +- Shoeybi, Mohammad, et al. "Megatron-lm: Training multi-billion parameter language models using model parallelism." arXiv preprint arXiv:1909.08053 (2019). +- Sun, Zhiqing, et al. "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices" arXiv preprint arXiv:2004.02984 (2020). +- Song, Kaitao, et al. "MPNet: Masked and Permuted Pre-training for Language Understanding." arXiv preprint arXiv:2004.09297 (2020). +- Wei, Junqiu, et al. "NEZHA: Neural contextualized representation for chinese language understanding." arXiv preprint arXiv:1909.00204 (2019). +- Qi, Weizhen, et al. "Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training." arXiv preprint arXiv:2001.04063 (2020). +- Kitaev, Nikita, et al. "Reformer: The efficient Transformer." arXiv preprint arXiv:2001.04451 (2020). +- Chung, Hyung Won, et al. "Rethinking embedding coupling in pre-trained language models." arXiv preprint arXiv:2010.12821 (2020). +- Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019). +- Su Jianlin, et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv preprint arXiv:2104.09864 (2021). +- Tian, Hao, et al. "SKEP: Sentiment knowledge enhanced pre-training for sentiment analysis." arXiv preprint arXiv:2005.05635 (2020). +- Forrest, ALbert, et al. "SqueezeBERT: What can computer vision teach NLP about efficient neural networks?" arXiv preprint arXiv:2006.11316 (2020). +- Raffel, Colin, et al. "T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." arXiv preprint arXiv:1910.10683 (2019). +- Vaswani, Ashish, et al. "Attention is all you need." arXiv preprint arXiv:1706.03762 (2017). +- Jiao, Xiaoqi, et al. "Tinybert: Distilling bert for natural language understanding." arXiv preprint arXiv:1909.10351 (2019). +- Bao, Siqi, et al. "Plato-2: Towards building an open-domain chatbot via curriculum learning." arXiv preprint arXiv:2006.16779 (2020). +- Yang, Zhilin, et al. "Xlnet: Generalized autoregressive pretraining for language understanding." arXiv preprint arXiv:1906.08237 (2019). +- Cui, Yiming, et al. "Pre-training with whole word masking for chinese bert." arXiv preprint arXiv:1906.08101 (2019). +- Wang, Quan, et al. “Building Chinese Biomedical Language Models via Multi-Level Text Discrimination.” arXiv preprint arXiv:2110.07244 (2021). diff --git a/docs/model_zoo/taskflow.md b/docs/model_zoo/taskflow.md new file mode 100644 index 0000000000000000000000000000000000000000..fde8f31b6bcc069a09f9129226495e0592c24b4e --- /dev/null +++ b/docs/model_zoo/taskflow.md @@ -0,0 +1,1912 @@ +# PaddleNLP一键预测功能:Taskflow API + + + +

+ + + + + +

+ + +

+ QuickStart | + 社区交流 | + 一键预测&定制训练 | + FAQ +

+ + +------------------------------------------------------------------------------------------ + +## 特性 +PaddleNLP提供**开箱即用**的产业级NLP预置任务能力,无需训练,一键预测。 +- 最全的中文任务:覆盖自然语言理解与自然语言生成两大核心应用; +- 极致的产业级效果:在多个中文场景上提供产业级的精度与预测性能; +- 统一的应用范式:通过`paddlenlp.Taskflow`调用,简捷易用。 + +| 任务名称 | 调用方式 | 一键预测 | 单条输入 | 多条输入 | 文档级输入 | 定制化训练 | 其它特性 | +| :--------------------------------- | -------------------------------- | -------- | -------- | -------- | ---------- | ---------- | ------------------------------------------------------ | +| [中文分词](#中文分词) | `Taskflow("word_segmentation")` | ✅ | ✅ | ✅ | ✅ | ✅ | 多种分词模式,满足快速切分和实体粒度精准切分 | +| [词性标注](#词性标注) | `Taskflow("pos_tagging")` | ✅ | ✅ | ✅ | ✅ | ✅ | 基于百度前沿词法分析工具LAC | +| [命名实体识别](#命名实体识别) | `Taskflow("ner")` | ✅ | ✅ | ✅ | ✅ | ✅ | 覆盖最全中文实体标签 | +| [依存句法分析](#依存句法分析) | `Taskflow("dependency_parsing")` | ✅ | ✅ | ✅ | | ✅ | 基于最大规模中文依存句法树库研发的DDParser | +| [信息抽取](#信息抽取) | `Taskflow("information_extraction")`| ✅ | ✅ | ✅ | ✅ | ✅ | 适配多场景的开放域通用信息抽取工具 | +| [『解语』-知识标注](#解语知识标注) | `Taskflow("knowledge_mining")` | ✅ | ✅ | ✅ | ✅ | ✅ | 覆盖所有中文词汇的知识标注工具 | +| [文本纠错](#文本纠错) | `Taskflow("text_correction")` | ✅ | ✅ | ✅ | ✅ | ✅ | 融合拼音特征的端到端文本纠错模型ERNIE-CSC | +| [文本相似度](#文本相似度) | `Taskflow("text_similarity")` | ✅ | ✅ | ✅ | | | 基于百万量级Dureader Retrieval数据集训练RocketQA并达到前沿文本相似效果| +| [情感分析](#情感分析) | `Taskflow("sentiment_analysis")` | ✅ | ✅ | ✅ | | ✅ | 集成BiLSTM、SKEP、UIE等模型,支持评论维度、观点抽取、情感极性分类等情感分析任务 | +| [生成式问答](#生成式问答) | `Taskflow("question_answering")` | ✅ | ✅ | ✅ | | | 使用最大中文开源CPM模型完成问答 | +| [智能写诗](#智能写诗) | `Taskflow("poetry_generation")` | ✅ | ✅ | ✅ | | | 使用最大中文开源CPM模型完成写诗 | +| [开放域对话](#开放域对话) | `Taskflow("dialogue")` | ✅ | ✅ | ✅ | | | 十亿级语料训练最强中文闲聊模型PLATO-Mini,支持多轮对话 | +| [代码生成](#代码生成) | `Taskflow("code_generation")` | ✅ | ✅ | ✅ | ✅ | | 代码生成大模型 | +| [文本摘要](#文本摘要) | `Taskflow("text_summarization")` | ✅ | ✅ | ✅ | ✅ | | 文本摘要大模型 | +| [文档智能](#文档智能) | `Taskflow("document_intelligence")` | ✅ | ✅ | ✅ | ✅ | | 以多语言跨模态布局增强文档预训练模型ERNIE-Layout为核心底座 | +| [问题生成](#问题生成) | `Taskflow("question_generation")` | ✅ | ✅ | ✅ | ✅ | | 问题生成大模型 | +| [零样本文本分类](#零样本文本分类) | `Taskflow("zero_shot_text_classification")` | ✅ | ✅ | ✅ | | ✅ | 集成多场景的通用文本分类工具 | +| [模型特征提取](#模型特征提取) | `Taskflow("feature_extraction")` | ✅ | ✅ | ✅ | ✅ | | 集成文本,图片的特征抽取工具 | + +## QuickStart + +**环境依赖** + - python >= 3.6 + - paddlepaddle >= 2.3.0 + - paddlenlp >= 2.3.4 + +![taskflow1](https://user-images.githubusercontent.com/11793384/159693816-fda35221-9751-43bb-b05c-7fc77571dd76.gif) + +可进入 Jupyter Notebook 环境,在线体验 👉🏻 [进入在线运行环境](https://aistudio.baidu.com/aistudio/projectdetail/3696243) + +PaddleNLP Taskflow API 支持任务持续丰富中,我们将根据开发者反馈,灵活调整功能建设优先级,可通过Issue或[问卷](https://iwenjuan.baidu.com/?code=44amg8)反馈给我们。 + +## 社区交流👬 + +- 微信扫描二维码并填写问卷之后,加入交流群领取福利 + - 获取5月18-19日每晚20:30《产业级通用信息抽取技术UIE+ERNIE轻量级模型》直播课链接 + - 10G重磅NLP学习大礼包: + +
+ +
+ +## 详细使用 + +## PART Ⅰ   一键预测 + +### 中文分词 + +
 (可展开详情)多种分词模式,满足快速切分和实体粒度精准切分
+ +#### 三种分词模式,满足各类分词需求 + +```python +from paddlenlp import Taskflow + +# 默认模式————实体粒度分词,在精度和速度上的权衡,基于百度LAC +>>> seg = Taskflow("word_segmentation") +>>> seg("近日国家卫健委发布第九版新型冠状病毒肺炎诊疗方案") +['近日', '国家卫健委', '发布', '第九版', '新型', '冠状病毒肺炎', '诊疗', '方案'] + +# 快速模式————最快:实现文本快速切分,基于jieba中文分词工具 +>>> seg_fast = Taskflow("word_segmentation", mode="fast") +>>> seg_fast("近日国家卫健委发布第九版新型冠状病毒肺炎诊疗方案") +['近日', '国家', '卫健委', '发布', '第九版', '新型', '冠状病毒', '肺炎', '诊疗', '方案'] + +# 精确模式————最准:实体粒度切分准确度最高,基于百度解语 +# 精确模式基于预训练模型,更适合实体粒度分词需求,适用于知识图谱构建、企业搜索Query分析等场景中 +>>> seg_accurate = Taskflow("word_segmentation", mode="accurate") +>>> seg_accurate("近日国家卫健委发布第九版新型冠状病毒肺炎诊疗方案") +['近日', '国家卫健委', '发布', '第九版', '新型冠状病毒肺炎', '诊疗', '方案'] +``` + +#### 批量样本输入,平均速度更快 + +输入为多个句子组成的list,平均速度会更快。 + +```python +>>> from paddlenlp import Taskflow +>>> seg = Taskflow("word_segmentation") +>>> seg(["第十四届全运会在西安举办", "三亚是一个美丽的城市"]) +[['第十四届', '全运会', '在', '西安', '举办'], ['三亚', '是', '一个', '美丽', '的', '城市']] +``` + +#### 自定义词典 + +你可以通过传入`user_dict`参数,装载自定义词典来定制分词结果。 +在默认模式和精确模式下,词典文件每一行由一个或多个自定义item组成。词典文件`user_dict.txt`示例: +```text +平原上的火焰 +上 映 +``` + +在快速模式下,词典文件每一行为一个自定义item+"\t"+词频(词频可省略,词频省略则自动计算能保证分出该词的词频),暂时不支持黑名单词典(即通过设置”年“、”末“,以达到切分”年末“的目的)。词典文件`user_dict.txt`示例: + +```text +平原上的火焰 10 +``` + +加载自定义词典及输出结果示例: +```python +>>> from paddlenlp import Taskflow +>>> seg = Taskflow("word_segmentation") +>>> seg("平原上的火焰宣布延期上映") +['平原', '上', '的', '火焰', '宣布', '延期', '上映'] +>>> seg = Taskflow("word_segmentation", user_dict="user_dict.txt") +>>> seg("平原上的火焰宣布延期上映") +['平原上的火焰', '宣布', '延期', '上', '映'] +``` +#### 参数说明 +* `mode`:指定分词模式,默认为None。 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `user_dict`:自定义词典文件路径,默认为None。 +* `task_path`:自定义任务路径,默认为None。 +
+ +### 词性标注 + +
 基于百度词法分析工具LAC
+ +#### 支持单条和批量预测 +```python +>>> from paddlenlp import Taskflow +# 单条预测 +>>> tag = Taskflow("pos_tagging") +>>> tag("第十四届全运会在西安举办") +[('第十四届', 'm'), ('全运会', 'nz'), ('在', 'p'), ('西安', 'LOC'), ('举办', 'v')] + +# 批量样本输入,平均速度更快 +>>> tag(["第十四届全运会在西安举办", "三亚是一个美丽的城市"]) +[[('第十四届', 'm'), ('全运会', 'nz'), ('在', 'p'), ('西安', 'LOC'), ('举办', 'v')], [('三亚', 'LOC'), ('是', 'v'), ('一个', 'm'), ('美丽', 'a'), ('的', 'u'), ('城市', 'n')]] +``` + +#### 标签集合 + +| 标签 | 含义 | 标签 | 含义 | 标签 | 含义 | 标签 | 含义 | +| ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- | +| n | 普通名词 | f | 方位名词 | s | 处所名词 | t | 时间 | +| nr | 人名 | ns | 地名 | nt | 机构名 | nw | 作品名 | +| nz | 其他专名 | v | 普通动词 | vd | 动副词 | vn | 名动词 | +| a | 形容词 | ad | 副形词 | an | 名形词 | d | 副词 | +| m | 数量词 | q | 量词 | r | 代词 | p | 介词 | +| c | 连词 | u | 助词 | xc | 其他虚词 | w | 标点符号 | +| PER | 人名 | LOC | 地名 | ORG | 机构名 | TIME | 时间 | + +#### 自定义词典 + +你可以通过装载自定义词典来定制化分词和词性标注结果。词典文件每一行表示一个自定义item,可以由一个单词或者多个单词组成,单词后面可以添加自定义标签,格式为`item/tag`,如果不添加自定义标签,则使用模型默认标签`n`。 + +词典文件`user_dict.txt`示例: + +```text +赛里木湖/LAKE +高/a 山/n +海拔最高 +``` + +装载自定义词典及输出结果示例: + +```python +>>> from paddlenlp import Taskflow +>>> tag = Taskflow("pos_tagging") +>>> tag("赛里木湖是新疆海拔最高的高山湖泊") +[('赛里木湖', 'LOC'), ('是', 'v'), ('新疆', 'LOC'), ('海拔', 'n'), ('最高', 'a'), ('的', 'u'), ('高山', 'n'), ('湖泊', 'n')] +>>> my_tag = Taskflow("pos_tagging", user_dict="user_dict.txt") +>>> my_tag("赛里木湖是新疆海拔最高的高山湖泊") +[('赛里木湖', 'LAKE'), ('是', 'v'), ('新疆', 'LOC'), ('海拔最高', 'n'), ('的', 'u'), ('高', 'a'), ('山', 'n'), ('湖泊', 'n')] +``` +#### 可配置参数说明 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `user_dict`:用户自定义词典文件,默认为None。 +* `task_path`:自定义任务路径,默认为None。 +
+ +### 命名实体识别 + +
 最全中文实体标签
+ +#### 支持两种模式 + +```python +# 精确模式(默认),基于百度解语,内置91种词性及专名类别标签 +>>> from paddlenlp import Taskflow +>>> ner = Taskflow("ner") +>>> ner("《孤女》是2010年九州出版社出版的小说,作者是余兼羽") +[('《', 'w'), ('孤女', '作品类_实体'), ('》', 'w'), ('是', '肯定词'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('的', '助词'), ('小说', '作品类_概念'), (',', 'w'), ('作者', '人物类_概念'), ('是', '肯定词'), ('余兼羽', '人物类_实体')] + +>>> ner = Taskflow("ner", entity_only=True) # 只返回实体/概念词 +>>> ner("《孤女》是2010年九州出版社出版的小说,作者是余兼羽") +[('孤女', '作品类_实体'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('小说', '作品类_概念'), ('作者', '人物类_概念'), ('余兼羽', '人物类_实体')] + +# 快速模式,基于百度LAC,内置24种词性和专名类别标签 +>>> from paddlenlp import Taskflow +>>> ner = Taskflow("ner", mode="fast") +>>> ner("三亚是一个美丽的城市") +[('三亚', 'LOC'), ('是', 'v'), ('一个', 'm'), ('美丽', 'a'), ('的', 'u'), ('城市', 'n')] +``` + +#### 批量样本输入,平均速度更快 +```python +>>> from paddlenlp import Taskflow +>>> ner = Taskflow("ner") +>>> ner(["热梅茶是一道以梅子为主要原料制作的茶饮", "《孤女》是2010年九州出版社出版的小说,作者是余兼羽"]) +[[('热梅茶', '饮食类_饮品'), ('是', '肯定词'), ('一道', '数量词'), ('以', '介词'), ('梅子', '饮食类'), ('为', '肯定词'), ('主要原料', '物体类'), ('制作', '场景事件'), ('的', '助词'), ('茶饮', '饮食类_饮品')], [('《', 'w'), ('孤女', '作品类_实体'), ('》', 'w'), ('是', '肯定词'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('的', '助词'), ('小说', '作品类_概念'), (',', 'w'), ('作者', '人物类_概念'), ('是', '肯定词'), ('余兼羽', '人物类_实体')]] +``` + +#### 实体标签说明 + +- 精确模式采用的标签集合 + +包含91种词性及专名类别标签,标签集合如下表: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
WordTag标签集合
人物类_实体组织机构类_军事组织机构_概念文化类_制度政策协议位置方位术语类_医药学术语信息资料_性别否定词
人物类_概念组织机构类_医疗卫生机构文化类_姓氏与人名世界地区类术语类_生物体链接地址数量词
作品类_实体组织机构类_医疗卫生机构_概念生物类世界地区类_国家疾病损伤类个性特征数量词_序数词
作品类_概念组织机构类_教育组织机构生物类_植物世界地区类_区划概念疾病损伤类_植物病虫害感官特征数量词_单位数量词
组织机构类组织机构类_教育组织机构_概念生物类_动物世界地区类_地理概念宇宙类场景事件叹词
组织机构类_概念物体类品牌名饮食类事件类介词拟声词
组织机构类_企事业单位物体类_概念品牌名_品牌类型饮食类_菜品时间类介词_方位介词修饰词
组织机构类_企事业单位_概念物体类_兵器场所类饮食类_饮品时间类_特殊日助词修饰词_性质
组织机构类_国家机关物体类_化学物质场所类_概念药物类时间类_朝代代词修饰词_类型
组织机构类_国家机关_概念其他角色类场所类_交通场所药物类_中药时间类_具体时间连词修饰词_化
组织机构类_体育组织机构文化类场所类_交通场所_概念术语类时间类_时长副词外语单词
组织机构类_体育组织机构_概念文化类_语言文字场所类_网上场所术语类_术语类型词汇用语疑问词汉语拼音
组织机构类_军事组织机构文化类_奖项赛事活动场所类_网上场所_概念术语类_符号指标类信息资料肯定词w(标点)
+ +- 快速模式采用的标签集合 + +| 标签 | 含义 | 标签 | 含义 | 标签 | 含义 | 标签 | 含义 | +| ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- | +| n | 普通名词 | f | 方位名词 | s | 处所名词 | t | 时间 | +| nr | 人名 | ns | 地名 | nt | 机构名 | nw | 作品名 | +| nz | 其他专名 | v | 普通动词 | vd | 动副词 | vn | 名动词 | +| a | 形容词 | ad | 副形词 | an | 名形词 | d | 副词 | +| m | 数量词 | q | 量词 | r | 代词 | p | 介词 | +| c | 连词 | u | 助词 | xc | 其他虚词 | w | 标点符号 | +| PER | 人名 | LOC | 地名 | ORG | 机构名 | TIME | 时间 | + +#### 自定义词典 + +你可以通过装载自定义词典来定制化命名实体识别结果。词典文件每一行表示一个自定义item,可以由一个term或者多个term组成,term后面可以添加自定义标签,格式为`item/tag`,如果不添加自定义标签,则使用模型默认标签。 + +词典文件`user_dict.txt`示例: + +```text +长津湖/电影类_实体 +收/词汇用语 尾/术语类 +最 大 +海外票仓 +``` + +以"《长津湖》收尾,北美是最大海外票仓"为例,原本的输出结果为: + +```text +[('《', 'w'), ('长津湖', '作品类_实体'), ('》', 'w'), ('收尾', '场景事件'), (',', 'w'), ('北美', '世界地区类'), ('是', '肯定词'), ('最大', '修饰词'), ('海外', '场所类'), ('票仓', '词汇用语')] +``` + +装载自定义词典及输出结果示例: + +```python +>>> from paddlenlp import Taskflow + +>>> my_ner = Taskflow("ner", user_dict="user_dict.txt") +>>> my_ner("《长津湖》收尾,北美是最大海外票仓") +[('《', 'w'), ('长津湖', '电影类_实体'), ('》', 'w'), ('收', '词汇用语'), ('尾', '术语类'), (',', 'w'), ('北美', '世界地区类'), ('是', '肯定词'), ('最', '修饰词'), ('大', '修饰词'), ('海外票仓', '场所类')] +``` + +#### 可配置参数说明 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `user_dict`:用户自定义词典文件,默认为None。 +* `task_path`:自定义任务路径,默认为None。 +* `entity_only`:只返回实体/概念词及其对应标签。 +
+ + +### 依存句法分析 +
 基于最大规模中文依存句法树库研发的DDParser
+ +#### 支持多种形式输入 + +未分词输入: + +```python +>>> from paddlenlp import Taskflow +>>> ddp = Taskflow("dependency_parsing") +>>> ddp("2月8日谷爱凌夺得北京冬奥会第三金") +[{'word': ['2月8日', '谷爱凌', '夺得', '北京冬奥会', '第三金'], 'head': [3, 3, 0, 5, 3], 'deprel': ['ADV', 'SBV', 'HED', 'ATT', 'VOB']}] + +``` + +使用分词结果来输入: + +```python +>>> ddp = Taskflow("dependency_parsing") +>>> ddp.from_segments([['2月8日', '谷爱凌', '夺得', '北京冬奥会', '第三金']]) +[{'word': ['2月8日', '谷爱凌', '夺得', '北京冬奥会', '第三金'], 'head': [3, 3, 0, 5, 3], 'deprel': ['ADV', 'SBV', 'HED', 'ATT', 'VOB']}] +``` + +#### 批量样本输入,平均速度更快 + +```python +>>> from paddlenlp import Taskflow +>>> ddp(["2月8日谷爱凌夺得北京冬奥会第三金", "他送了一本书"]) +[{'word': ['2月8日', '谷爱凌', '夺得', '北京冬奥会', '第三金'], 'head': [3, 3, 0, 5, 3], 'deprel': ['ADV', 'SBV', 'HED', 'ATT', 'VOB']}, {'word': ['他', '送', '了', '一本', '书'], 'head': [2, 0, 2, 5, 2], 'deprel': ['SBV', 'HED', 'MT', 'ATT', 'VOB']}] +``` + +#### 多种模型选择,满足精度、速度需求 + +使用ERNIE 1.0进行预测 + +```python +>>> ddp = Taskflow("dependency_parsing", model="ddparser-ernie-1.0") +>>> ddp("2月8日谷爱凌夺得北京冬奥会第三金") +[{'word': ['2月8日', '谷爱凌', '夺得', '北京冬奥会', '第三金'], 'head': [3, 3, 0, 5, 3], 'deprel': ['ADV', 'SBV', 'HED', 'ATT', 'VOB']}] +``` + +除ERNIE 1.0外,还可使用ERNIE-Gram预训练模型,其中`model=ddparser`(基于LSTM Encoder)速度最快,`model=ddparser-ernie-gram-zh`和`model=ddparser-ernie-1.0`效果更优(两者效果相当)。 + +#### 输出方式 + +输出概率值和词性标签: + +```python +>>> ddp = Taskflow("dependency_parsing", prob=True, use_pos=True) +>>> ddp("2月8日谷爱凌夺得北京冬奥会第三金") +[{'word': ['2月8日', '谷爱凌', '夺得', '北京冬奥会', '第三金'], 'head': [3, 3, 0, 5, 3], 'deprel': ['ADV', 'SBV', 'HED', 'ATT', 'VOB'], 'postag': ['TIME', 'PER', 'v', 'ORG', 'n'], 'prob': [0.97, 1.0, 1.0, 0.99, 0.99]}] +``` + +依存关系可视化 + +```python +>>> from paddlenlp import Taskflow +>>> ddp = Taskflow("dependency_parsing", return_visual=True) +>>> result = ddp("2月8日谷爱凌夺得北京冬奥会第三金")[0]['visual'] +>>> import cv2 +>>> cv2.imwrite('test.png', result) +``` + +

+ +

+ +#### 依存句法分析标注关系集合 + +| Label | 关系类型 | 说明 | 示例 | +| :---: | :--------: | :----------------------- | :----------------------------- | +| SBV | 主谓关系 | 主语与谓词间的关系 | 他送了一本书(他<--送) | +| VOB | 动宾关系 | 宾语与谓词间的关系 | 他送了一本书(送-->书) | +| POB | 介宾关系 | 介词与宾语间的关系 | 我把书卖了(把-->书) | +| ADV | 状中关系 | 状语与中心词间的关系 | 我昨天买书了(昨天<--买) | +| CMP | 动补关系 | 补语与中心词间的关系 | 我都吃完了(吃-->完) | +| ATT | 定中关系 | 定语与中心词间的关系 | 他送了一本书(一本<--书) | +| F | 方位关系 | 方位词与中心词的关系 | 在公园里玩耍(公园-->里) | +| COO | 并列关系 | 同类型词语间关系 | 叔叔阿姨(叔叔-->阿姨) | +| DBL | 兼语结构 | 主谓短语做宾语的结构 | 他请我吃饭(请-->我,请-->吃饭) | +| DOB | 双宾语结构 | 谓语后出现两个宾语 | 他送我一本书(送-->我,送-->书) | +| VV | 连谓结构 | 同主语的多个谓词间关系 | 他外出吃饭(外出-->吃饭) | +| IC | 子句结构 | 两个结构独立或关联的单句 | 你好,书店怎么走?(你好<--走) | +| MT | 虚词成分 | 虚词与中心词间的关系 | 他送了一本书(送-->了) | +| HED | 核心关系 | 指整个句子的核心 | | + +#### 可配置参数说明 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `model`:选择任务使用的模型,可选有`ddparser`,`ddparser-ernie-1.0`和`ddparser-ernie-gram-zh`。 +* `tree`:确保输出结果是正确的依存句法树,默认为True。 +* `prob`:是否输出每个弧对应的概率值,默认为False。 +* `use_pos`:是否返回词性标签,默认为False。 +* `use_cuda`:是否使用GPU进行切词,默认为False。 +* `return_visual`:是否返回句法树的可视化结果,默认为False。 +* `task_path`:自定义任务路径,默认为None。 +

+ +### 信息抽取 +
  适配多场景的开放域通用信息抽取工具
+ +开放域信息抽取是信息抽取的一种全新范式,主要思想是减少人工参与,利用单一模型支持多种类型的开放抽取任务,用户可以使用自然语言自定义抽取目标,在实体、关系类别等未定义的情况下抽取输入文本中的信息片段。 + +#### 实体抽取 + + 命名实体识别(Named Entity Recognition,简称NER),是指识别文本中具有特定意义的实体。在开放域信息抽取中,抽取的类别没有限制,用户可以自己定义。 + + - 例如抽取的目标实体类型是"时间"、"选手"和"赛事名称", schema构造如下: + + ```text + ['时间', '选手', '赛事名称'] + ``` + + 调用示例: + + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + + >>> schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction + >>> ie = Taskflow('information_extraction', schema=schema) + >>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!")) # Better print results using pprint + [{'时间': [{'end': 6, + 'probability': 0.9857378532924486, + 'start': 0, + 'text': '2月8日上午'}], + '赛事名称': [{'end': 23, + 'probability': 0.8503089953268272, + 'start': 6, + 'text': '北京冬奥会自由式滑雪女子大跳台决赛'}], + '选手': [{'end': 31, + 'probability': 0.8981548639781138, + 'start': 28, + 'text': '谷爱凌'}]}] + ``` + + - 例如抽取的目标实体类型是"肿瘤的大小"、"肿瘤的个数"、"肝癌级别"和"脉管内癌栓分级", schema构造如下: + + ```text + ['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级'] + ``` + + 在上例中我们已经实例化了一个`Taskflow`对象,这里可以通过`set_schema`方法重置抽取目标。 + + 调用示例: + + ```python + >>> schema = ['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级'] + >>> ie.set_schema(schema) + >>> pprint(ie("(右肝肿瘤)肝细胞性肝癌(II-III级,梁索型和假腺管型),肿瘤包膜不完整,紧邻肝被膜,侵及周围肝组织,未见脉管内癌栓(MVI分级:M0级)及卫星子灶形成。(肿物1个,大小4.2×4.0×2.8cm)。")) + [{'肝癌级别': [{'end': 20, + 'probability': 0.9243267447402701, + 'start': 13, + 'text': 'II-III级'}], + '肿瘤的个数': [{'end': 84, + 'probability': 0.7538413804059623, + 'start': 82, + 'text': '1个'}], + '肿瘤的大小': [{'end': 100, + 'probability': 0.8341128043459491, + 'start': 87, + 'text': '4.2×4.0×2.8cm'}], + '脉管内癌栓分级': [{'end': 70, + 'probability': 0.9083292325934664, + 'start': 67, + 'text': 'M0级'}]}] + ``` + + - 例如抽取的目标实体类型是"person"和"organization",schema构造如下: + + ```text + ['person', 'organization'] + ``` + + 英文模型调用示例: + + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + >>> schema = ['Person', 'Organization'] + >>> ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en') + >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.')) + [{'Organization': [{'end': 53, + 'probability': 0.9985840259877357, + 'start': 48, + 'text': 'Apple'}], + 'Person': [{'end': 14, + 'probability': 0.999631971804547, + 'start': 9, + 'text': 'Steve'}]}] + ``` + +#### 关系抽取 + + 关系抽取(Relation Extraction,简称RE),是指从文本中识别实体并抽取实体之间的语义关系,进而获取三元组信息,即<主体,谓语,客体>。 + + - 例如以"竞赛名称"作为抽取主体,抽取关系类型为"主办方"、"承办方"和"已举办次数", schema构造如下: + + ```text + { + '竞赛名称': [ + '主办方', + '承办方', + '已举办次数' + ] + } + ``` + + 调用示例: + + ```python + >>> schema = {'竞赛名称': ['主办方', '承办方', '已举办次数']} # Define the schema for relation extraction + >>> ie.set_schema(schema) # Reset schema + >>> pprint(ie('2022语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办,百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办,已连续举办4届,成为全球最热门的中文NLP赛事之一。')) + [{'竞赛名称': [{'end': 13, + 'probability': 0.7825402622754041, + 'relations': {'主办方': [{'end': 22, + 'probability': 0.8421710521379353, + 'start': 14, + 'text': '中国中文信息学会'}, + {'end': 30, + 'probability': 0.7580801847701935, + 'start': 23, + 'text': '中国计算机学会'}], + '已举办次数': [{'end': 82, + 'probability': 0.4671295049136148, + 'start': 80, + 'text': '4届'}], + '承办方': [{'end': 39, + 'probability': 0.8292706618236352, + 'start': 35, + 'text': '百度公司'}, + {'end': 72, + 'probability': 0.6193477885474685, + 'start': 56, + 'text': '中国计算机学会自然语言处理专委会'}, + {'end': 55, + 'probability': 0.7000497331473241, + 'start': 40, + 'text': '中国中文信息学会评测工作委员会'}]}, + 'start': 0, + 'text': '2022语言与智能技术竞赛'}]}] + ``` + + - 例如以"person"作为抽取主体,抽取关系类型为"Company"和"Position", schema构造如下: + + ```text + { + 'Person': [ + 'Company', + 'Position' + ] + } + ``` + + 英文模型调用示例: + + ```python + >>> schema = [{'Person': ['Company', 'Position']}] + >>> ie_en.set_schema(schema) + >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.')) + [{'Person': [{'end': 14, + 'probability': 0.999631971804547, + 'relations': {'Company': [{'end': 53, + 'probability': 0.9960158209451642, + 'start': 48, + 'text': 'Apple'}], + 'Position': [{'end': 44, + 'probability': 0.8871063806420736, + 'start': 41, + 'text': 'CEO'}]}, + 'start': 9, + 'text': 'Steve'}]}] + ``` + +#### 事件抽取 + + 事件抽取 (Event Extraction, 简称EE),是指从自然语言文本中抽取预定义的事件触发词(Trigger)和事件论元(Argument),组合为相应的事件结构化信息。 + + - 例如抽取的目标是"地震"事件的"地震强度"、"时间"、"震中位置"和"震源深度"这些信息,schema构造如下: + + ```text + { + '地震触发词': [ + '地震强度', + '时间', + '震中位置', + '震源深度' + ] + } + ``` + + 触发词的格式统一为`触发词`或``XX触发词`,`XX`表示具体事件类型,上例中的事件类型是`地震`,则对应触发词为`地震触发词`。 + + 调用示例: + + ```python + >>> schema = {'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']} # Define the schema for event extraction + >>> ie.set_schema(schema) # Reset schema + >>> ie('中国地震台网正式测定:5月16日06时08分在云南临沧市凤庆县(北纬24.34度,东经99.98度)发生3.5级地震,震源深度10千米。') + [{'地震触发词': [{'text': '地震', 'start': 56, 'end': 58, 'probability': 0.9987181623528585, 'relations': {'地震强度': [{'text': '3.5级', 'start': 52, 'end': 56, 'probability': 0.9962985320905915}], '时间': [{'text': '5月16日06时08分', 'start': 11, 'end': 22, 'probability': 0.9882578028575182}], '震中位置': [{'text': '云南临沧市凤庆县(北纬24.34度,东经99.98度)', 'start': 23, 'end': 50, 'probability': 0.8551415716584501}], '震源深度': [{'text': '10千米', 'start': 63, 'end': 67, 'probability': 0.999158304648045}]}}]}] + ``` + + - 英文模型zero-shot方式**暂不支持事件抽取**,如有英文事件抽取相关语料请进行训练定制。 + +#### 评论观点抽取 + + 评论观点抽取,是指抽取文本中包含的评价维度、观点词。 + + - 例如抽取的目标是文本中包含的评价维度及其对应的观点词和情感倾向,schema构造如下: + + ```text + { + '评价维度': [ + '观点词', + '情感倾向[正向,负向]' + ] + } + ``` + + 调用示例: + + ```python + >>> schema = {'评价维度': ['观点词', '情感倾向[正向,负向]']} # Define the schema for opinion extraction + >>> ie.set_schema(schema) # Reset schema + >>> pprint(ie("店面干净,很清静,服务员服务热情,性价比很高,发现收银台有排队")) # Better print results using pprint + [{'评价维度': [{'end': 20, + 'probability': 0.9817040258681473, + 'relations': {'情感倾向[正向,负向]': [{'probability': 0.9966142505350533, + 'text': '正向'}], + '观点词': [{'end': 22, + 'probability': 0.957396472711558, + 'start': 21, + 'text': '高'}]}, + 'start': 17, + 'text': '性价比'}, + {'end': 2, + 'probability': 0.9696849569741168, + 'relations': {'情感倾向[正向,负向]': [{'probability': 0.9982153274927796, + 'text': '正向'}], + '观点词': [{'end': 4, + 'probability': 0.9945318044652538, + 'start': 2, + 'text': '干净'}]}, + 'start': 0, + 'text': '店面'}]}] + ``` + + - 英文模型schema构造如下: + + ```text + { + 'Aspect': [ + 'Opinion', + 'Sentiment classification [negative, positive]' + ] + } + ``` + + 英文模型调用示例: + + ```python + >>> schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}] + >>> ie_en.set_schema(schema) + >>> pprint(ie_en("The teacher is very nice.")) + [{'Aspect': [{'end': 11, + 'probability': 0.4301476415932193, + 'relations': {'Opinion': [{'end': 24, + 'probability': 0.9072940447883724, + 'start': 15, + 'text': 'very nice'}], + 'Sentiment classification [negative, positive]': [{'probability': 0.9998571920670685, + 'text': 'positive'}]}, + 'start': 4, + 'text': 'teacher'}]}] + ``` + +#### 情感分类 + + - 句子级情感倾向分类,即判断句子的情感倾向是“正向”还是“负向”,schema构造如下: + + ```text + '情感倾向[正向,负向]' + ``` + + 调用示例: + + ```python + >>> schema = '情感倾向[正向,负向]' # Define the schema for sentence-level sentiment classification + >>> ie.set_schema(schema) # Reset schema + >>> ie('这个产品用起来真的很流畅,我非常喜欢') + [{'情感倾向[正向,负向]': [{'text': '正向', 'probability': 0.9988661643929895}]}] + ``` + + 英文模型schema构造如下: + + ```text + '情感倾向[正向,负向]' + ``` + + 英文模型调用示例: + + ```python + >>> schema = 'Sentiment classification [negative, positive]' + >>> ie_en.set_schema(schema) + >>> ie_en('I am sorry but this is the worst film I have ever seen in my life.') + [{'Sentiment classification [negative, positive]': [{'text': 'negative', 'probability': 0.9998415771287057}]}] + ``` + +#### 跨任务抽取 + + - 例如在法律场景同时对文本进行实体抽取和关系抽取,schema可按照如下方式进行构造: + + ```text + [ + "法院", + { + "原告": "委托代理人" + }, + { + "被告": "委托代理人" + } + ] + ``` + + 调用示例: + + ```python + >>> schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}] + >>> ie.set_schema(schema) + >>> pprint(ie("北京市海淀区人民法院\n民事判决书\n(199x)建初字第xxx号\n原告:张三。\n委托代理人李四,北京市 A律师事务所律师。\n被告:B公司,法定代表人王五,开发公司总经理。\n委托代理人赵六,北京市 C律师事务所律师。")) # Better print results using pprint + [{'原告': [{'end': 37, + 'probability': 0.9949814024296764, + 'relations': {'委托代理人': [{'end': 46, + 'probability': 0.7956844697990384, + 'start': 44, + 'text': '李四'}]}, + 'start': 35, + 'text': '张三'}], + '法院': [{'end': 10, + 'probability': 0.9221074192336651, + 'start': 0, + 'text': '北京市海淀区人民法院'}], + '被告': [{'end': 67, + 'probability': 0.8437349536631089, + 'relations': {'委托代理人': [{'end': 92, + 'probability': 0.7267121388225029, + 'start': 90, + 'text': '赵六'}]}, + 'start': 64, + 'text': 'B公司'}]}] + ``` + +#### 模型选择 + +- 多模型选择,满足精度、速度要求 + + | 模型 | 结构 | 语言 | + | :---: | :--------: | :--------: | + | `uie-base` (默认)| 12-layers, 768-hidden, 12-heads | 中文 | + | `uie-base-en` | 12-layers, 768-hidden, 12-heads | 英文 | + | `uie-medical-base` | 12-layers, 768-hidden, 12-heads | 中文 | + | `uie-medium`| 6-layers, 768-hidden, 12-heads | 中文 | + | `uie-mini`| 6-layers, 384-hidden, 12-heads | 中文 | + | `uie-micro`| 4-layers, 384-hidden, 12-heads | 中文 | + | `uie-nano`| 4-layers, 312-hidden, 12-heads | 中文 | + | `uie-m-large`| 24-layers, 1024-hidden, 16-heads | 中、英文 | + | `uie-m-base`| 12-layers, 768-hidden, 12-heads | 中、英文 | + +- `uie-nano`调用示例: + + ```python + >>> from paddlenlp import Taskflow + + >>> schema = ['时间', '选手', '赛事名称'] + >>> ie = Taskflow('information_extraction', schema=schema, model="uie-nano") + >>> ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!") + [{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.6513581678349247}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.9819330659468051}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.4908131110420939}]}] + ``` + +- `uie-m-base`和`uie-m-large`支持中英文混合抽取,调用示例: + + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + + >>> schema = ['Time', 'Player', 'Competition', 'Score'] + >>> ie = Taskflow('information_extraction', schema=schema, model="uie-m-base", schema_lang="en") + >>> pprint(ie(["2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!", "Rafael Nadal wins French Open Final!"])) + [{'Competition': [{'end': 23, + 'probability': 0.9373889907291257, + 'start': 6, + 'text': '北京冬奥会自由式滑雪女子大跳台决赛'}], + 'Player': [{'end': 31, + 'probability': 0.6981119555336441, + 'start': 28, + 'text': '谷爱凌'}], + 'Score': [{'end': 39, + 'probability': 0.9888507878270296, + 'start': 32, + 'text': '188.25分'}], + 'Time': [{'end': 6, + 'probability': 0.9784080036931151, + 'start': 0, + 'text': '2月8日上午'}]}, + {'Competition': [{'end': 35, + 'probability': 0.9851549932171295, + 'start': 18, + 'text': 'French Open Final'}], + 'Player': [{'end': 12, + 'probability': 0.9379371275888104, + 'start': 0, + 'text': 'Rafael Nadal'}]}] + ``` + +#### 定制训练 + +对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本(zero-shot)抽取,对于细分场景我们推荐使用[定制训练](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie)(标注少量数据进行模型微调)以进一步提升效果。 + +我们在互联网、医疗、金融三大垂类自建测试集上进行了实验: + + +
金融医疗互联网 +
0-shot5-shot0-shot5-shot0-shot5-shot +
uie-base (12L768H)46.4370.9271.8385.7278.3381.86 +
uie-medium (6L768H)41.1164.5365.4075.7278.3279.68 +
uie-mini (6L384H)37.0464.6560.5078.3672.0976.38 +
uie-micro (4L384H)37.5362.1157.0475.9266.0070.22 +
uie-nano (4L312H)38.9466.8348.2976.7462.8672.35 +
uie-m-large (24L1024H)49.3574.5570.5092.6678.4983.02 +
uie-m-base (12L768H)38.4674.3163.3787.3276.2780.13 +
+ +0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测,5-shot表示每个类别包含5条标注数据进行模型微调。**实验表明UIE在垂类场景可以通过少量数据(few-shot)进一步提升效果**。 + +#### 可配置参数说明 + +* `schema`:定义任务抽取目标,可参考开箱即用中不同任务的调用示例进行配置。 +* `schema_lang`:设置schema的语言,默认为`zh`, 可选有`zh`和`en`。因为中英schema的构造有所不同,因此需要指定schema的语言。该参数只对`uie-m-base`和`uie-m-large`模型有效。 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `model`:选择任务使用的模型,默认为`uie-base`,可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano`, `uie-medical-base`, `uie-base-en`。 +* `position_prob`:模型对于span的起始位置/终止位置的结果概率0~1之间,返回结果去掉小于这个阈值的结果,默认为0.5,span的最终概率输出为起始位置概率和终止位置概率的乘积。 +* `precision`:选择模型精度,默认为`fp32`,可选有`fp16`和`fp32`。`fp16`推理速度更快。如果选择`fp16`,请先确保机器正确安装NVIDIA相关驱动和基础软件,**确保CUDA>=11.2,cuDNN>=8.1.1**,初次使用需按照提示安装相关依赖(主要是**确保安装onnxruntime-gpu**)。其次,需要确保GPU设备的CUDA计算能力(CUDA Compute Capability)大于7.0,典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档:[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。 +
+ +### 解语知识标注 +
 覆盖所有中文词汇的知识标注工具
+ +#### 词类知识标注 + +```python +>>> from paddlenlp import Taskflow +>>> wordtag = Taskflow("knowledge_mining") +>>> wordtag("《孤女》是2010年九州出版社出版的小说,作者是余兼羽") +[{'text': '《孤女》是2010年九州出版社出版的小说,作者是余兼羽', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '孤女', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 2}, {'item': '》', 'offset': 3, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 4, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '2010年', 'offset': 5, 'wordtag_label': '时间类', 'length': 5, 'termid': '时间阶段_cb_2010年'}, {'item': '九州出版社', 'offset': 10, 'wordtag_label': '组织机构类', 'length': 5, 'termid': '组织机构_eb_九州出版社'}, {'item': '出版', 'offset': 15, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_出版'}, {'item': '的', 'offset': 17, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '小说', 'offset': 18, 'wordtag_label': '作品类_概念', 'length': 2, 'termid': '小说_cb_小说'}, {'item': ',', 'offset': 20, 'wordtag_label': 'w', 'length': 1}, {'item': '作者', 'offset': 21, 'wordtag_label': '人物类_概念', 'length': 2, 'termid': '人物_cb_作者'}, {'item': '是', 'offset': 23, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '余兼羽', 'offset': 24, 'wordtag_label': '人物类_实体', 'length': 3}]}] +``` + +**可配置参数说明:** +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `linking`:实现基于词类的linking,默认为True。 +* `task_path`:自定义任务路径,默认为None。 +* `user_dict`:用户自定义词典文件,默认为None。 + + +知识挖掘-词类知识标注任务共包含91种词性及专名类别标签,标签集合如下表: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
WordTag标签集合
人物类_实体组织机构类_军事组织机构_概念文化类_制度政策协议位置方位术语类_医药学术语信息资料_性别否定词
人物类_概念组织机构类_医疗卫生机构文化类_姓氏与人名世界地区类术语类_生物体链接地址数量词
作品类_实体组织机构类_医疗卫生机构_概念生物类世界地区类_国家疾病损伤类个性特征数量词_序数词
作品类_概念组织机构类_教育组织机构生物类_植物世界地区类_区划概念疾病损伤类_植物病虫害感官特征数量词_单位数量词
组织机构类组织机构类_教育组织机构_概念生物类_动物世界地区类_地理概念宇宙类场景事件叹词
组织机构类_概念物体类品牌名饮食类事件类介词拟声词
组织机构类_企事业单位物体类_概念品牌名_品牌类型饮食类_菜品时间类介词_方位介词修饰词
组织机构类_企事业单位_概念物体类_兵器场所类饮食类_饮品时间类_特殊日助词修饰词_性质
组织机构类_国家机关物体类_化学物质场所类_概念药物类时间类_朝代代词修饰词_类型
组织机构类_国家机关_概念其他角色类场所类_交通场所药物类_中药时间类_具体时间连词修饰词_化
组织机构类_体育组织机构文化类场所类_交通场所_概念术语类时间类_时长副词外语单词
组织机构类_体育组织机构_概念文化类_语言文字场所类_网上场所术语类_术语类型词汇用语疑问词汉语拼音
组织机构类_军事组织机构文化类_奖项赛事活动场所类_网上场所_概念术语类_符号指标类信息资料肯定词w(标点)
+ +#### 知识模板信息抽取 +```python +>>> from paddlenlp import Taskflow +>>> wordtag_ie = Taskflow("knowledge_mining", with_ie=True) +>>> wordtag_ie('《忘了所有》是一首由王杰作词、作曲并演唱的歌曲,收录在专辑同名《忘了所有》中,由波丽佳音唱片于1996年08月31日发行。') +[[{'text': '《忘了所有》是一首由王杰作词、作曲并演唱的歌曲,收录在专辑同名《忘了所有》中,由波丽佳音唱片于1996年08月31日发行。', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '忘了所有', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 4}, {'item': '》', 'offset': 5, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 6, 'wordtag_label': '肯定词', 'length': 1}, {'item': '一首', 'offset': 7, 'wordtag_label': '数量词_单位数量词', 'length': 2}, {'item': '由', 'offset': 9, 'wordtag_label': '介词', 'length': 1}, {'item': '王杰', 'offset': 10, 'wordtag_label': '人物类_实体', 'length': 2}, {'item': '作词', 'offset': 12, 'wordtag_label': '场景事件', 'length': 2}, {'item': '、', 'offset': 14, 'wordtag_label': 'w', 'length': 1}, {'item': '作曲', 'offset': 15, 'wordtag_label': '场景事件', 'length': 2}, {'item': '并', 'offset': 17, 'wordtag_label': '连词', 'length': 1}, {'item': '演唱', 'offset': 18, 'wordtag_label': '场景事件', 'length': 2}, {'item': '的', 'offset': 20, 'wordtag_label': '助词', 'length': 1}, {'item': '歌曲', 'offset': 21, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': ',', 'offset': 23, 'wordtag_label': 'w', 'length': 1}, {'item': '收录', 'offset': 24, 'wordtag_label': '场景事件', 'length': 2}, {'item': '在', 'offset': 26, 'wordtag_label': '介词', 'length': 1}, {'item': '专辑', 'offset': 27, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '同名', 'offset': 29, 'wordtag_label': '场景事件', 'length': 2}, {'item': '《', 'offset': 31, 'wordtag_label': 'w', 'length': 1}, {'item': '忘了所有', 'offset': 32, 'wordtag_label': '作品类_实体', 'length': 4}, {'item': '》', 'offset': 36, 'wordtag_label': 'w', 'length': 1}, {'item': '中', 'offset': 37, 'wordtag_label': '词汇用语', 'length': 1}, {'item': ',', 'offset': 38, 'wordtag_label': 'w', 'length': 1}, {'item': '由', 'offset': 39, 'wordtag_label': '介词', 'length': 1}, {'item': '波丽佳音', 'offset': 40, 'wordtag_label': '人物类_实体', 'length': 4}, {'item': '唱片', 'offset': 44, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '于', 'offset': 46, 'wordtag_label': '介词', 'length': 1}, {'item': '1996年08月31日', 'offset': 47, 'wordtag_label': '时间类_具体时间', 'length': 11}, {'item': '发行', 'offset': 58, 'wordtag_label': '场景事件', 'length': 2}, {'item': '。', 'offset': 60, 'wordtag_label': 'w', 'length': 1}]}], [[{'HEAD_ROLE': {'item': '王杰', 'offset': 10, 'type': '人物类_实体'}, 'TAIL_ROLE': [{'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}], 'GROUP': '创作', 'TRIG': [{'item': '作词', 'offset': 12}, {'item': '作曲', 'offset': 15}, {'item': '演唱', 'offset': 18}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '王杰', 'offset': 10, 'type': '人物类_实体'}], 'GROUP': '创作者', 'SRC': 'HTG', 'TRIG': [{'item': '作词', 'offset': 12}, {'item': '作曲', 'offset': 15}, {'item': '演唱', 'offset': 18}]}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '歌曲', 'offset': 21, 'type': '作品类_概念'}], 'GROUP': '类型', 'SRC': 'TAIL'}, {'HEAD_ROLE': {'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}, 'TAIL_ROLE': [{'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}], 'GROUP': '收录', 'TRIG': [{'item': '收录', 'offset': 24}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}], 'GROUP': '收录于', 'SRC': 'HGT', 'TRIG': [{'item': '收录', 'offset': 24}]}, {'HEAD_ROLE': {'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}, 'TAIL_ROLE': [{'item': '王杰', 'type': '人物类_实体', 'offset': 10}], 'GROUP': '创作者', 'TRIG': [{'item': '专辑', 'offset': 27}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '王杰', 'type': '人物类_实体', 'offset': 10}, 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}], 'GROUP': '创作', 'SRC': 'HGT', 'TRIG': [{'item': '专辑', 'offset': 27}]}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 32}, 'TAIL_ROLE': [{'item': '唱片', 'offset': 44, 'type': '作品类_概念'}], 'GROUP': '类型', 'SRC': 'TAIL'}]]] + +``` + +**自定义抽取的schema** + +``` python +>>> from pprint import pprint +>>> schema = [ + { + "head_role": "作品类_实体", #头实体词类 + "group": "创作者", #关系名 + "tail_role": [ + { + "main": [ + "人物类_实体" #尾实体词类 + ], + "support": [] #相关词类,可作为该关系的补充,不可作为尾实体独立存在 + } + ], + "trig_word": [ + "作词", #触发词,对于没有触发词,而是由头尾实体直接触发的关系,可为null + ], + "trig_type": "trigger", #trigger表明由触发词触发,tail表明为尾实体触发 + "reverse": False, #是否为反向配置,即尾实体实际是头,头实体实际是尾 + "trig_direction": "B", #触发P的方向,表示在自然表达中,尾实体在触发词的哪一边,L为左,R为右,B为双向都有可能,默认为B + "rel_group": "创作" #对应的反关系,即头尾实体对调后,对应的关系,用于逻辑推断 + }] +>>> wordtag_ie.set_schema(schema) +>>> pprint(wordtag_ie('《忘了所有》是一首由王杰作词、作曲并演唱的歌曲,收录在专辑同名《忘了所有》中,由波丽佳音唱片于1996年08月31日发行。')[1]) +[[{'GROUP': '创作', + 'HEAD_ROLE': {'item': '王杰', 'offset': 10, 'type': '人物类_实体'}, + 'SRC': 'REVERSE', + 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 1, 'type': '作品类_实体'}], + 'TRIG': [{'item': '作词', 'offset': 12}]}, + {'GROUP': '创作者', + 'HEAD_ROLE': {'item': '忘了所有', 'offset': 1, 'type': '作品类_实体'}, + 'SRC': 'HTG', + 'TAIL_ROLE': [{'item': '王杰', 'offset': 10, 'type': '人物类_实体'}], + 'TRIG': [{'item': '作词', 'offset': 12}]}]] +``` +具体的WordTag-IE信息抽取的功能可以见[WordTag-IE具体介绍](../../examples/text_to_knowledge/wordtag-ie/README.md) . + + +#### 名词短语标注 +```python +>>> from paddlenlp import Taskflow +>>> nptag = Taskflow("knowledge_mining", model="nptag") +>>> nptag("糖醋排骨") +[{'text': '糖醋排骨', 'label': '菜品'}] + +>>> nptag(["糖醋排骨", "红曲霉菌"]) +[{'text': '糖醋排骨', 'label': '菜品'}, {'text': '红曲霉菌', 'label': '微生物'}] + +# 使用`linking`输出粗粒度类别标签`category`,即WordTag的词汇标签。 +>>> nptag = Taskflow("knowledge_mining", model="nptag", linking=True) +>>> nptag(["糖醋排骨", "红曲霉菌"]) +[{'text': '糖醋排骨', 'label': '菜品', 'category': '饮食类_菜品'}, {'text': '红曲霉菌', 'label': '微生物', 'category': '生物类_微生物'}] +``` +**可配置参数说明:** +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `max_seq_len`:最大序列长度,默认为64。 +* `linking`:实现与WordTag类别标签的linking,默认为False。 +* `task_path`:自定义任务路径,默认为None。 + + +
+ +### 文本纠错 +
 融合拼音特征的端到端文本纠错模型ERNIE-CSC
+ + +#### 支持单条、批量预测 + +```python +>>> from paddlenlp import Taskflow +>>> corrector = Taskflow("text_correction") +# 单条输入 +>>> corrector('遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇。') +[{'source': '遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇。', 'target': '遇到逆境时,我们必须勇于面对,而且要愈挫愈勇。', 'errors': [{'position': 3, 'correction': {'竟': '境'}}]}] + +# 批量预测 +>>> corrector(['遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇。', '人生就是如此,经过磨练才能让自己更加拙壮,才能使自己更加乐观。']) +[{'source': '遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇。', 'target': '遇到逆境时,我们必须勇于面对,而且要愈挫愈勇。', 'errors': [{'position': 3, 'correction': {'竟': '境'}}]}, {'source': '人生就是如此,经过磨练才能让自己更加拙壮,才能使自己更加乐观。', 'target': '人生就是如此,经过磨练才能让自己更加茁壮,才能使自己更加乐观。', 'errors': [{'position': 18, 'correction': {'拙': '茁'}}]}] +``` + +#### 可配置参数说明 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `task_path`:自定义任务路径,默认为None。 +
+ +### 文本相似度 +
 基于百万量级Dureader Retrieval数据集训练RocketQA并达到前沿文本相似效果
+ +#### 单条输入 + ++ Query-Query的相似度匹配 + +```python +>>> from paddlenlp import Taskflow +>>> similarity = Taskflow("text_similarity") +>>> similarity([["春天适合种什么花?", "春天适合种什么菜?"]]) +[{'text1': '春天适合种什么花?', 'text2': '春天适合种什么菜?', 'similarity': 0.83402544}] +``` + ++ Query-Passage的相似度匹配 + +```python +>>> similarity = Taskflow("text_similarity", model='rocketqa-base-cross-encoder') +>>> similarity([["国家法定节假日共多少天?", "现在法定假日是元旦1天,春节3天,清明节1天,五一劳动节1天,端午节1天,国庆节3天,中秋节1天,共计11天。法定休息日每年52个周末总共104天。合到一起总计115天。"]]) +[{'text1': '国家法定节假日共多少天?', 'text2': '现在法定假日是元旦1天,春节3天,清明节1天,五一劳动节1天,端午节1天,国庆节3天,中秋节1天,共计11天。法定休息日每年52个周末总共104天。合到一起总计115天。', 'similarity': 0.7174624800682068}] +``` + +#### 批量样本输入,平均速度更快 + ++ Query-Query的相似度匹配 + +```python +>>> from paddlenlp import Taskflow +>>> similarity = Taskflow("text_similarity") +>>> similarity([['春天适合种什么花?','春天适合种什么菜?'],['谁有狂三这张高清的','这张高清图,谁有']]) +[{'text1': '春天适合种什么花?', 'text2': '春天适合种什么菜?', 'similarity': 0.83402544}, {'text1': '谁有狂三这张高清的', 'text2': '这张高清图,谁有', 'similarity': 0.6540646}] +``` + ++ Query-Passage的相似度匹配 + +```python +>>> similarity = Taskflow("text_similarity", model='rocketqa-base-cross-encoder') +>>> similarity([["国家法定节假日共多少天?", "现在法定假日是元旦1天,春节3天,清明节1天,五一劳动节1天,端午节1天,国庆节3天,中秋节1天,共计11天。法定休息日每年52个周末总共104天。合到一起总计115天。"],["衡量酒水的价格的因素有哪些?", "衡量酒水的价格的因素很多的,酒水的血统(也就是那里产的,采用什么工艺等);存储的时间等等,酒水是一件很难标准化得商品,只要你敢要价,有买的那就值那个钱。"]]) +[{'text1': '国家法定节假日共多少天?', 'text2': '现在法定假日是元旦1天,春节3天,清明节1天,五一劳动节1天,端午节1天,国庆节3天,中秋节1天,共计11天。法定休息日每年52个周末总共104天。合到一起总计115天。', 'similarity': 0.7174624800682068}, {'text1': '衡量酒水的价格的因素有哪些?', 'text2': '衡量酒水的价格的因素很多的,酒水的血统(也就是那里产的,采用什么工艺等);存储的时间等等,酒水是一件很难标准化得商品,只要你敢要价,有买的那就值那个钱。', 'similarity': 0.9069755673408508}] + +``` + +#### 模型选择 + +- 多模型选择,满足精度、速度要求 + + | 模型 | 结构 | 语言 | + | :---: | :--------: | :--------: | + | `rocketqa-zh-dureader-cross-encoder` | 12-layers, 768-hidden, 12-heads | 中文 | + | `simbert-base-chinese` (默认) | 12-layers, 768-hidden, 12-heads | 中文 | + | `rocketqa-base-cross-encoder` | 12-layers, 768-hidden, 12-heads | 中文 | + | `rocketqa-medium-cross-encoder` | 6-layers, 768-hidden, 12-heads | 中文 | + | `rocketqa-mini-cross-encoder` | 6-layers, 384-hidden, 12-heads | 中文 | + | `rocketqa-micro-cross-encoder` | 4-layers, 384-hidden, 12-heads | 中文 | + | `rocketqa-nano-cross-encoder` | 4-layers, 312-hidden, 12-heads | 中文 | + | `rocketqav2-en-marco-cross-encoder` | 12-layers, 768-hidden, 12-heads | 英文 | + + +#### 可配置参数说明 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `max_seq_len`:最大序列长度,默认为384。 +* `task_path`:自定义任务路径,默认为None。 +
+ +### 情感分析 +
 集成BiLSTM、SKEP、UIE等模型,支持评论维度、观点抽取、情感极性分类等情感分析任务
+ +#### 支持不同模型,速度快和精度高两种模式 + +```python +>>> from paddlenlp import Taskflow +# 默认使用bilstm模型进行预测,速度快 +>>> senta = Taskflow("sentiment_analysis") +>>> senta("这个产品用起来真的很流畅,我非常喜欢") +[{'text': '这个产品用起来真的很流畅,我非常喜欢', 'label': 'positive', 'score': 0.9938690066337585}] + +# 使用SKEP情感分析预训练模型进行预测,精度高 +>>> senta = Taskflow("sentiment_analysis", model="skep_ernie_1.0_large_ch") +>>> senta("作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。") +[{'text': '作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。', 'label': 'positive', 'score': 0.984320878982544}] + +# 使用UIE模型进行情感分析,具有较强的样本迁移能力 +# 1. 语句级情感分析 +>>> schema = ['情感倾向[正向,负向]'] +>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema) +>>> senta('蛋糕味道不错,店家服务也很好') +[{'情感倾向[正向,负向]': [{'text': '正向', 'probability': 0.996646058824652}]}] + +# 2. 评价维度级情感分析 +>>> # Aspect Term Extraction +>>> # schema = ["评价维度"] +>>> # Aspect - Opinion Extraction +>>> # schema = [{"评价维度":["观点词"]}] +>>> # Aspect - Sentiment Extraction +>>> # schema = [{"评价维度":["情感倾向[正向,负向,未提及]"]}] +>>> # Aspect - Sentiment - Opinion Extraction +>>> schema = [{"评价维度":["观点词", "情感倾向[正向,负向,未提及]"]}] + +>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema) +>>> senta('蛋糕味道不错,店家服务也很热情') +[{'评价维度': [{'text': '服务', 'start': 9, 'end': 11, 'probability': 0.9709093024793489, 'relations': { '观点词': [{'text': '热情', 'start': 13, 'end': 15, 'probability': 0.9897222206316556}], '情感倾向[正向,负向,未提及]': [{'text': '正向', 'probability': 0.9999327669598301}]}}, {'text': '味道', 'start': 2, 'end': 4, 'probability': 0.9105472387838915, 'relations': {'观点词': [{'text': '不错', 'start': 4, 'end': 6, 'probability': 0.9946981266891619}], '情感倾向[正向,负向,未提及]': [{'text': '正向', 'probability': 0.9998829392709467}]}}]}] +``` + +#### 批量样本输入,平均速度更快 +```python +>>> from paddlenlp import Taskflow +>>> schema = [{"评价维度":["观点词", "情感倾向[正向,负向,未提及]"]}] +>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema) +>>> senta(["房间不大,很干净", "老板服务热情,价格也便宜"]) +[{'评价维度': [{'text': '房间', 'start': 0, 'end': 2, 'probability': 0.998526653966298, 'relations': {'观点词': [{'text': '干净', 'start': 6, 'end': 8, 'probability': 0.9899580841973474}, {'text': '不大', 'start': 2, 'end': 4, 'probability': 0.9945525066163512}], '情感倾向[正向,负向,未提及]': [{'text': '正向', 'probability': 0.6077412795680956}]}}]}, {'评价维度': [{'text': '服务', 'start': 2, 'end': 4, 'probability': 0.9913965811617516, 'relations': {'观点词': [{'text': '热情', 'start': 4, 'end': 6, 'probability': 0.9995530034336753}], '情感倾向[正向,负向,未提及]': [{'text': '正向', 'probability': 0.9956709542206106}]}}, {'text': '价格', 'start': 7, 'end': 9, 'probability': 0.9970075537913772, 'relations': {'观点词': [{'text': '便宜', 'start': 10, 'end': 12, 'probability': 0.9991568497876635}], '情感倾向[正向,负向,未提及]': [{'text': '正向', 'probability': 0.9943191048602245}]}}]}] +``` + +#### 可配置参数说明 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `model`:选择任务使用的模型,可选有`bilstm`,`skep_ernie_1.0_large_ch`,`uie-senta-base`,`uie-senta-medium`,`uie-senta-mini`,`uie-senta-micro`,`uie-senta-nano`。 +* `task_path`:自定义任务路径,默认为None。 +
+ +### 生成式问答 +
  使用最大中文开源CPM模型完成问答
+ +#### 支持单条、批量预测 + +```python +>>> from paddlenlp import Taskflow +>>> qa = Taskflow("question_answering") +# 单条输入 +>>> qa("中国的国土面积有多大?") +[{'text': '中国的国土面积有多大?', 'answer': '960万平方公里。'}] +# 多条输入 +>>> qa(["中国国土面积有多大?", "中国的首都在哪里?"]) +[{'text': '中国国土面积有多大?', 'answer': '960万平方公里。'}, {'text': '中国的首都在哪里?', 'answer': '北京。'}] +``` + +#### 可配置参数说明 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +
+ +### 智能写诗 +
  使用最大中文开源CPM模型完成写诗
+ +#### 支持单条、批量预测 + +```python +>>> from paddlenlp import Taskflow +>>> poetry = Taskflow("poetry_generation") +# 单条输入 +>>> poetry("林密不见人") +[{'text': '林密不见人', 'answer': ',但闻人语响。'}] +# 多条输入 +>>> poetry(["林密不见人", "举头邀明月"]) +[{'text': '林密不见人', 'answer': ',但闻人语响。'}, {'text': '举头邀明月', 'answer': ',低头思故乡。'}] +``` + +#### 可配置参数说明 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +
+ +### 开放域对话 +
 十亿级语料训练最强中文闲聊模型PLATO-Mini,支持多轮对话
+ +#### 非交互模式 +```python +>>> from paddlenlp import Taskflow +>>> dialogue = Taskflow("dialogue") +>>> dialogue(["吃饭了吗"]) +['刚吃完饭,你在干什么呢?'] + +>>> dialogue(["你好", "吃饭了吗"], ["你是谁?"]) +['吃过了,你呢', '我是李明啊'] +``` + +可配置参数: + +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `max_seq_len`:最大序列长度,默认为512。 + +#### 交互模式 +```python +>>> from paddlenlp import Taskflow + +>>> dialogue = Taskflow("dialogue") +# 输入`exit`可退出交互模式 +>>> dialogue.interactive_mode(max_turn=3) + +''' +[Human]:你好 +[Bot]:你好,很高兴认识你,我想问你一下,你喜欢运动吗? +[Human]:喜欢 +[Bot]:那你喜欢什么运动啊? +[Human]:篮球,你喜欢篮球吗 +[Bot]:当然了,我很喜欢打篮球的 +''' +``` + +交互模式参数: +* `max_turn`:任务能记忆的对话轮数,当max_turn为1时,模型只能记住当前对话,无法获知之前的对话内容。 +
+ +### 代码生成 +
  通过CodeGen模型来生成代码
+ +#### 支持单条、批量预测 + +```python +>>> from paddlenlp import Taskflow +# 默认模型为 Salesforce/codegen-350M-mono +>>> codegen = Taskflow("code_generation", model="Salesforce/codegen-2B-mono") +# 单条输入 +>>> codegen("def hello_world():") +['\n print("Hello World")'] +# 多条输入 +>>> codegen(["Get the length of array", "def hello_world():"]) +['\n n = len(a)\n\n #', '\n print("Hello World!")'] +``` + +#### 可配置参数说明 +* `model`:可选模型,默认为Salesforce/codegen-350M-mono,支持的模型参考[CodeGen文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/code_generation/codegen/README.md)。 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `max_length`:生成代码的最大长度,默认为128。 +* `min_length`:生成代码的最小长度,默认为0。 +* `decode_strategy`:解码策略,支持greedy_search,beam_search和sampling,默认为sampling。 +* `temperature`:解码参数temperature,默认为0.6。 +* `top_k`:解码参数top_k,默认为5。 +* `top_p`:解码参数top_p,默认为1.0。 +* `num_beams`:beam_search解码的beam size,默认为4。 +* `length_penalty`:解码长度控制值,默认为1.0。 +* `repetition_penalty`:解码重复惩罚值,默认为1.1。 +* `output_scores`:是否要输出解码得分,请默认为False。 +
+ + + +### 文本摘要 +
  通过Pegasus模型来生成摘要
+ +#### 支持单条、批量预测 + +```python +>>> from paddlenlp import Taskflow +>>> summarizer = Taskflow("text_summarization") +# 单条输入 +>>> summarizer('2022年,中国房地产进入转型阵痛期,传统“高杠杆、快周转”的模式难以为继,万科甚至直接喊话,中国房地产进入“黑铁时代”') +# 输出:['万科喊话中国房地产进入“黑铁时代”'] + +# 多条输入 +>>> summarizer([ + '据悉,2022年教育部将围绕“巩固提高、深化落实、创新突破”三个关键词展开工作。要进一步强化学校教育主阵地作用,继续把落实“双减”作为学校工作的重中之重,重点从提高作业设计水平、提高课后服务水平、提高课堂教学水平、提高均衡发展水平四个方面持续巩固提高学校“双减”工作水平。', + '党参有降血脂,降血压的作用,可以彻底消除血液中的垃圾,从而对冠心病以及心血管疾病的患者都有一定的稳定预防工作作用,因此平时口服党参能远离三高的危害。另外党参除了益气养血,降低中枢神经作用,调整消化系统功能,健脾补肺的功能。' + ]) +#输出:['教育部:将从四个方面持续巩固提高学校“双减”工作水平', '党参能降低三高的危害'] +``` + +#### 可配置参数说明 +* `model`:可选模型,默认为`IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese`。 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 + +
+ +### 文档智能 +
  以多语言跨模态布局增强文档预训练模型ERNIE-Layout为核心底座
+ +#### 输入格式 + +``` +[ + {"doc": "./invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}, + {"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]} +] +``` + +默认使用PaddleOCR进行OCR识别,同时支持用户通过``word_boxes``传入自己的OCR结果,格式为``List[str, List[float, float, float, float]]``。 + +``` +[ + {"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes} +] +``` + +#### 支持单条、批量预测 + +- 支持本地图片路径输入 + +
+ +
+ + +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow + +>>> docprompt = Taskflow("document_intelligence") +>>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}])) +[{'prompt': '五百丁本次想要担任的是什么职位?', + 'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]}, +{'prompt': '五百丁是在哪里上的大学?', + 'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]}, +{'prompt': '大学学的是什么专业?', + 'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科)'}]}] +``` + +- http图片链接输入 + +
+ +
+ + +```python +>>> from pprint import pprint +>>> from paddlenlp import Taskflow + +>>> docprompt = Taskflow("document_intelligence") +>>> pprint(docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}])) +[{'prompt': '发票号码是多少?', + 'result': [{'end': 2, 'prob': 0.74, 'start': 2, 'value': 'No44527206'}]}, +{'prompt': '校验码是多少?', + 'result': [{'end': 233, + 'prob': 1.0, + 'start': 231, + 'value': '01107 555427109891646'}]}] +``` + +#### 可配置参数说明 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `lang`:选择PaddleOCR的语言,`ch`可在中英混合的图片中使用,`en`在英文图片上的效果更好,默认为`ch`。 +* `topn`: 如果模型识别出多个结果,将返回前n个概率值最高的结果,默认为1。 + + +
+ +### 问题生成 +
  基于百度自研中文预训练模型UNIMO-Text和大规模多领域问题生成数据集
+ +#### 支持单条、批量预测 + +```python +>>> from paddlenlp import Taskflow +# 默认模型为 unimo-text-1.0-dureader_qg +>>> question_generator = Taskflow("question_generation") +# 单条输入 +>>> question_generator([ + {"context": "奇峰黄山千米以上的山峰有77座,整座黄山就是一座花岗岩的峰林,自古有36大峰,36小峰,最高峰莲花峰、最险峰天都峰和观日出的最佳点光明顶构成黄山的三大主峰。", "answer": "莲花峰"} + ]) +''' + ['黄山最高峰是什么'] +''' +# 多条输入 +>>> question_generator([ + {"context": "奇峰黄山千米以上的山峰有77座,整座黄山就是一座花岗岩的峰林,自古有36大峰,36小峰,最高峰莲花峰、最险峰天都峰和观日出的最佳点光明顶构成黄山的三大主峰。", "answer": "莲花峰"}, + {"context": "弗朗索瓦·韦达外文名:franciscusvieta国籍:法国出生地:普瓦图出生日期:1540年逝世日期:1603年12月13日职业:数学家主要成就:为近代数学的发展奠定了基础。", "answer": "法国"} + ]) +''' + ['黄山最高峰是什么', '弗朗索瓦是哪里人'] +''' +``` + +#### 可配置参数说明 +* `model`:可选模型,默认为unimo-text-1.0-dureader_qg,支持的模型有["unimo-text-1.0", "unimo-text-1.0-dureader_qg", "unimo-text-1.0-question-generation", "unimo-text-1.0-question-generation-dureader_qg"]。 +* `device`:运行设备,默认为"gpu"。 +* `template`:模版,可选项有[0, 1, 2, 3],1表示使用默认模版,0表示不使用模版。 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `output_scores`:是否要输出解码得分,默认为False。 +* `is_select_from_num_return_sequences`:是否对多个返回序列挑选最优项输出,当为True时,若num_return_sequences不为1则自动根据解码得分选择得分最高的序列最为最终结果,否则返回num_return_sequences个序列,默认为True。 +* `max_length`:生成代码的最大长度,默认为50。 +* `min_length`:生成代码的最小长度,默认为3。 +* `decode_strategy`:解码策略,支持beam_search和sampling,默认为beam_search。 +* `temperature`:解码参数temperature,默认为1.0。 +* `top_k`:解码参数top_k,默认为0。 +* `top_p`:解码参数top_p,默认为1.0。 +* `num_beams`:解码参数num_beams,表示beam_search解码的beam size,默认为6。 +* `num_beam_groups`:解码参数num_beam_groups,默认为1。 +* `diversity_rate`:解码参数diversity_rate,默认为0.0。 +* `length_penalty`:解码长度控制值,默认为1.2。 +* `num_return_sequences`:解码返回序列数,默认为1。 +* `repetition_penalty`:解码重复惩罚值,默认为1。 +* `use_fast`:表示是否开启基于FastGeneration的高性能预测,注意FastGeneration的高性能预测仅支持gpu,默认为False。 +* `use_fp16_decoding`: 表示在开启高性能预测的时候是否使用fp16来完成预测过程,若不使用则使用fp32,默认为False。 + +
+ +### 零样本文本分类 +
  适配多场景的零样本通用文本分类工具
+ +通用文本分类主要思想是利用单一模型支持通用分类、评论情感分析、语义相似度计算、蕴含推理、多项式阅读理解等众多“泛分类”任务。用户可以自定义任意标签组合,在不限定领域、不设定 prompt 的情况下进行文本分类。 + + +#### 情感分析 + +``` +>>> cls = Taskflow("zero_shot_text_classification", schema=["这是一条好评", "这是一条差评"]) +>>> cls("房间干净明亮,非常不错") +[{'predictions': [{'label': '这是一条好评', 'score': 0.9072999699439914}], 'text_a': '房间干净明亮,非常不错'}] +>>> cls("东西还可以,但是快递非常慢,下次不会再买这家了。") +[{'predictions': [{'label': '这是一条差评', 'score': 0.9282672873429476}], 'text_a': '东西还可以,但是快递非常慢,下次不会再买这家了。'}] +``` + +#### 意图识别 + +``` +>>> from paddlenlp import Taskflow +>>> schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用"] +>>> cls("先天性厚甲症去哪里治") +[{'predictions': [{'label': '就医建议', 'score': 0.5494891306403806}], 'text_a': '先天性厚甲症去哪里治'}] +>>> cls("男性小腹疼痛是什么原因?") +[{'predictions': [{'label': '病因分析', 'score': 0.5763229815300723}], 'text_a': '男性小腹疼痛是什么原因?'}] +``` + +#### 语义相似度计算 + +``` +>>> from paddlenlp import Taskflow +>>> cls = Taskflow("zero_shot_text_classification", schema=["不同", "相同"]) +>>> cls([["怎么查看合同", "从哪里可以看到合同"]]) +[{'predictions': [{'label': '相同', 'score': 0.9951385264364382}], 'text_a': '怎么查看合同', 'text_b': '从哪里可以看到合同'}] +>>> cls([["为什么一直没有电话来确认借款信息", "为何我还款了,今天却接到客服电话通知"]]) +[{'predictions': [{'label': '不同', 'score': 0.9991497973466908}], 'text_a': '为什么一直没有电话来确认借款信息', 'text_b': '为何我还款了,今天却接到客服电话通知'}] +``` + +#### 蕴含推理 + +``` +>>> from paddlenlp import Taskflow +>>> cls = Taskflow("zero_shot_text_classification", schema=["无关", "蕴含", "矛盾"]) +>>> cls([["一个骑自行车的人正沿着一条城市街道朝一座有时钟的塔走去。", "骑自行车的人正朝钟楼走去。"]]) +[{'predictions': [{'label': '蕴含', 'score': 0.9931122738524856}], 'text_a': '一个骑自行车的人正沿着一条城市街道朝一座有时钟的塔走去。', 'text_b': '骑自行车的人正朝钟楼走去。'}] +>>> cls([["一个留着长发和胡须的怪人,在地铁里穿着一件颜色鲜艳的衬衫。", "这件衬衫是新的。"]]) +[{'predictions': [{'label': '无关', 'score': 0.997680189334587}], 'text_a': '一个留着长发和胡须的怪人,在地铁里穿着一件颜色鲜艳的衬衫。', 'text_b': '这件衬衫是新的。'}] +>>> cls([["一个穿着绿色衬衫的妈妈和一个穿全黑衣服的男人在跳舞。", "两人都穿着白色裤子。"]]) +[{'predictions': [{'label': '矛盾', 'score': 0.9666946163628479}], 'text_a': '一个穿着绿色衬衫的妈妈和一个穿全黑衣服的男人在跳舞。', 'text_b': '两人都穿着白色裤子。'}] +``` + +#### 可配置参数说明 + +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `task_path`:自定义任务路径,默认为None。 +* `schema`:定义任务标签候选集合。 +* `model`:选择任务使用的模型,默认为`utc-base`, 支持`utc-xbase`, `utc-base`, `utc-medium`, `utc-micro`, `utc-mini`, `utc-nano`, `utc-pico`。 +* `max_seq_len`:最长输入长度,包括所有标签的长度,默认为512。 +* `pred_threshold`:模型对标签预测的概率在0~1之间,返回结果去掉小于这个阈值的结果,默认为0.5。 +* `precision`:选择模型精度,默认为`fp32`,可选有`fp16`和`fp32`。`fp16`推理速度更快。如果选择`fp16`,请先确保机器正确安装NVIDIA相关驱动和基础软件,**确保CUDA>=11.2,cuDNN>=8.1.1**,初次使用需按照提示安装相关依赖。其次,需要确保GPU设备的CUDA计算能力(CUDA Compute Capability)大于7.0,典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档:[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。 + +
+ +### 模型特征提取 + +
  基于百度自研中文图文跨模态预训练模型ERNIE-ViL 2.0
+ +#### 多模态特征提取 + +```python +>>> from paddlenlp import Taskflow +>>> from PIL import Image +>>> import paddle.nn.functional as F +>>> vision_language= Taskflow("feature_extraction") +# 单条输入 +>>> image_embeds = vision_language(Image.open("demo/000000039769.jpg")) +>>> image_embeds["features"] +Tensor(shape=[1, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True, + [[-0.59475428, -0.69795364, 0.22144008, 0.88066685, -0.58184201, +# 单条输入 +>>> text_embeds = vision_language("猫的照片") +>>> text_embeds['features'] +Tensor(shape=[1, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True, + [[ 0.04250504, -0.41429776, 0.26163983, 0.29910022, 0.39019185, + -0.41884750, -0.19893740, 0.44328332, 0.08186490, 0.10953025, + ...... + +# 多条输入 +>>> image_embeds = vision_language([Image.open("demo/000000039769.jpg")]) +>>> image_embeds["features"] +Tensor(shape=[1, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True, + [[-0.59475428, -0.69795364, 0.22144008, 0.88066685, -0.58184201, + ...... +# 多条输入 +>>> text_embeds = vision_language(["猫的照片","狗的照片"]) +>>> text_embeds["features"] +Tensor(shape=[2, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True, + [[ 0.04250504, -0.41429776, 0.26163983, ..., 0.26221892, + 0.34387422, 0.18779707], + [ 0.06672225, -0.41456309, 0.13787819, ..., 0.21791610, + 0.36693242, 0.34208685]]) +>>> image_features = image_embeds["features"] +>>> text_features = text_embeds["features"] +>>> image_features /= image_features.norm(axis=-1, keepdim=True) +>>> text_features /= text_features.norm(axis=-1, keepdim=True) +>>> logits_per_image = 100 * image_features @ text_features.t() +>>> probs = F.softmax(logits_per_image, axis=-1) +>>> probs +Tensor(shape=[1, 2], dtype=float32, place=Place(gpu:0), stop_gradient=True, + [[0.99833173, 0.00166824]]) +``` +#### 模型选择 + +- 多模型选择,满足精度、速度要求 + + | 模型 | 视觉| 文本 | 语言 | + | :---: | :--------: | :--------: | :--------: | + | `PaddlePaddle/ernie_vil-2.0-base-zh` (默认) | ViT | ERNIE | 中文 | + | `OFA-Sys/chinese-clip-vit-base-patch16` | ViT-B/16 |RoBERTa-wwm-Base| 中文 | + | `OFA-Sys/chinese-clip-vit-large-patch14` | ViT-L/14 | RoBERTa-wwm-Base | 中文 | + | `OFA-Sys/chinese-clip-vit-large-patch14-336px` | ViT-L/14 | RoBERTa-wwm-Base | 中文 | + + +#### 可配置参数说明 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `_static_mode`:静态图模式,默认开启。 +* `model`:选择任务使用的模型,默认为`PaddlePaddle/ernie_vil-2.0-base-zh`。 + +#### 文本特征提取 + +```python +>>> from paddlenlp import Taskflow +>>> import paddle.nn.functional as F +>>> text_encoder = Taskflow("feature_extraction", model='rocketqa-zh-base-query-encoder') +>>> text_embeds = text_encoder(['春天适合种什么花?','谁有狂三这张高清的?']) +>>> text_features1 = text_embeds["features"] +>>> text_features1 +Tensor(shape=[2, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True, + [[ 0.27640465, -0.13405125, 0.00612330, ..., -0.15600294, + -0.18932408, -0.03029604], + [-0.12041329, -0.07424965, 0.07895312, ..., -0.17068857, + 0.04485796, -0.18887770]]) +>>> text_embeds = text_encoder('春天适合种什么菜?') +>>> text_features2 = text_embeds["features"] +>>> text_features2 +Tensor(shape=[1, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True, + [[ 0.32578075, -0.02398480, -0.18929179, -0.18639392, -0.04062131, + ...... +>>> probs = F.cosine_similarity(text_features1, text_features2) +>>> probs +Tensor(shape=[2], dtype=float32, place=Place(gpu:0), stop_gradient=True, + [0.86455142, 0.41222256]) +``` + +#### 模型选择 + +- 多模型选择,满足精度、速度要求 + + | 模型 | 层数| 维度 | 语言| + | :---: | :--------: | :--------: | :--------: | + | `rocketqa-zh-dureader-query-encoder` | 12 | 768 | 中文| + | `rocketqa-zh-dureader-para-encoder` | 12 | 768 | 中文| + | `rocketqa-zh-base-query-encoder` | 12 | 768 | 中文| + | `rocketqa-zh-base-para-encoder` | 12 | 768 | 中文| + | `moka-ai/m3e-base` | 12 | 768 | 中文| + | `rocketqa-zh-medium-query-encoder` | 6 | 768 | 中文| + | `rocketqa-zh-medium-para-encoder` | 6 | 768 | 中文| + | `rocketqa-zh-mini-query-encoder` | 6 | 384 | 中文| + | `rocketqa-zh-mini-para-encoder` | 6 | 384 | 中文| + | `rocketqa-zh-micro-query-encoder` | 4 | 384 | 中文| + | `rocketqa-zh-micro-para-encoder` | 4 | 384 | 中文| + | `rocketqa-zh-nano-query-encoder` | 4 | 312 | 中文| + | `rocketqa-zh-nano-para-encoder` | 4 | 312 | 中文| + | `rocketqav2-en-marco-query-encoder` | 12 | 768 | 英文| + | `rocketqav2-en-marco-para-encoder` | 12 | 768 | 英文| + | `ernie-search-base-dual-encoder-marco-en"` | 12 | 768 | 英文| + +#### 可配置参数说明 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 +* `max_seq_len`:文本序列的最大长度,默认为128。 +* `return_tensors`: 返回的类型,有pd和np,默认为pd。 +* `model`:选择任务使用的模型,默认为`PaddlePaddle/ernie_vil-2.0-base-zh`。 +* `pooling_mode`:选择句向量获取方式,有'max_tokens','mean_tokens','mean_sqrt_len_tokens','cls_token',默认为'cls_token'(`moka-ai/m3e-base`)。 + +
+ +## PART Ⅱ   定制化训练 + +
适配任务列表
+ +如果你有自己的业务数据集,可以对模型效果进一步调优,支持定制化训练的任务如下: + +| 任务名称 | 默认路径 | | +| :----------------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | +| `Taskflow("word_segmentation", mode="base")` | `$HOME/.paddlenlp/taskflow/lac` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/lexical_analysis) | +| `Taskflow("word_segmentation", mode="accurate")` | `$HOME/.paddlenlp/taskflow/wordtag` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm) | +| `Taskflow("pos_tagging")` | `$HOME/.paddlenlp/taskflow/lac` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/lexical_analysis) | +| `Taskflow("ner", mode="fast")` | `$HOME/.paddlenlp/taskflow/lac` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/lexical_analysis) | +| `Taskflow("ner", mode="accurate")` | `$HOME/.paddlenlp/taskflow/wordtag` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm) | +| `Taskflow("information_extraction", model="uie-base")` | `$HOME/.paddlenlp/taskflow/information_extraction/uie-base` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie) | +| `Taskflow("information_extraction", model="uie-tiny")` | `$HOME/.paddlenlp/taskflow/information_extraction/uie-tiny` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie) | +| `Taskflow("text_correction", model="ernie-csc")` | `$HOME/.paddlenlp/taskflow/text_correction/ernie-csc` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_correction/ernie-csc) | +| `Taskflow("dependency_parsing", model="ddparser")` | `$HOME/.paddlenlp/taskflow/dependency_parsing/ddparser` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/dependency_parsing/ddparser) | +| `Taskflow("dependency_parsing", model="ddparser-ernie-1.0")` | `$HOME/.paddlenlp/taskflow/dependency_parsing/ddparser-ernie-1.0` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/dependency_parsing/ddparser) | +| `Taskflow("dependency_parsing", model="ddparser-ernie-gram-zh")` | `$HOME/.paddlenlp/taskflow/dependency_parsing/ddparser-ernie-gram-zh` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/dependency_parsing/ddparser) | +| `Taskflow("sentiment_analysis", model="skep_ernie_1.0_large_ch")` | `$HOME/.paddlenlp/taskflow/sentiment_analysis/skep_ernie_1.0_large_ch` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/sentiment_analysis/skep) | +| `Taskflow("knowledge_mining", model="wordtag")` | `$HOME/.paddlenlp/taskflow/wordtag` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm) | +| `Taskflow("knowledge_mining", model="nptag")` | `$HOME/.paddlenlp/taskflow/knowledge_mining/nptag` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/nptag) | +| `Taskflow("zero_shot_text_classification", model="utc-base")` | `$HOME/.paddlenlp/taskflow/zero_shot_text_classification/utc-base` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/zero_shot_text_classification) | + +
+ + +
定制化训练示例
+ +这里我们以命名实体识别`Taskflow("ner", mode="accurate")`为例,展示如何定制自己的模型。 + +调用`Taskflow`接口后,程序自动将相关文件下载到`$HOME/.paddlenlp/taskflow/wordtag/`,该默认路径包含以下文件: + +```text +$HOME/.paddlenlp/taskflow/wordtag/ +├── model_state.pdparams # 默认模型参数文件 +├── model_config.json # 默认模型配置文件 +└── tags.txt # 默认标签文件 +``` + +* 参考上表中对应[示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm)准备数据集和标签文件`tags.txt`,执行相应训练脚本得到自己的`model_state.pdparams`和`model_config.json`。 + +* 根据自己数据集情况,修改标签文件`tags.txt`。 + +* 将以上文件保存到任意路径中,自定义路径下的文件需要和默认路径的文件一致: + +```text +custom_task_path/ +├── model_state.pdparams # 定制模型参数文件 +├── model_config.json # 定制模型配置文件 +└── tags.txt # 定制标签文件 +``` +* 通过`task_path`指定自定义路径,使用Taskflow加载自定义模型进行一键预测: + +```python +from paddlenlp import Taskflow +my_ner = Taskflow("ner", mode="accurate", task_path="./custom_task_path/") +``` +
+ +## 模型算法 + +
模型算法说明
+ + +
任务名称模型模型详情训练集 +
中文分词默认模式: BiGRU+CRF 训练详情 百度自建数据集,包含近2200万句子,覆盖多种场景 +
快速模式:Jieba - - +
精确模式:WordTag 训练详情 百度自建数据集,词类体系基于TermTree构建 +
词性标注BiGRU+CRF 训练详情 百度自建数据集,包含2200万句子,覆盖多种场景 +
命名实体识别精确模式:WordTag 训练详情 百度自建数据集,词类体系基于TermTree构建 +
快速模式:BiGRU+CRF 训练详情 百度自建数据集,包含2200万句子,覆盖多种场景 +
依存句法分析DDParser 训练详情 百度自建数据集,DuCTB 1.0中文依存句法树库 +
信息抽取 UIE 训练详情 百度自建数据集 +
解语知识标注词类知识标注:WordTag 训练详情 百度自建数据集,词类体系基于TermTree构建 +
名词短语标注:NPTag 训练详情 百度自建数据集 +
文本纠错ERNIE-CSC 训练详情 SIGHAN简体版数据集及 Automatic Corpus Generation生成的中文纠错数据集 +
文本相似度SimBERT - 收集百度知道2200万对相似句组 +
情感分析 BiLSTM - 百度自建数据集 +
SKEP 训练详情 百度自建数据集 +
UIE 训练详情 百度自建数据集 +
生成式问答CPM - 100GB级别中文数据 +
智能写诗CPM - 100GB级别中文数据 +
开放域对话PLATO-Mini - 十亿级别中文对话数据 +
零样本文本分类UTC 训练详情 百度自建数据集 +
+ +
+ +## FAQ + +
Q:Taskflow如何修改任务保存路径?
+ +**A:** Taskflow默认会将任务相关模型等文件保存到`$HOME/.paddlenlp`下,可以在任务初始化的时候通过`home_path`自定义修改保存路径。示例: +```python +from paddlenlp import Taskflow + +ner = Taskflow("ner", home_path="/workspace") +``` +通过以上方式即可将ner任务相关文件保存至`/workspace`路径下。 +
+ + +
Q:下载或调用模型失败,多次下载均失败怎么办?
+ +**A:** Taskflow默认会将任务相关模型等文件保存到`$HOME/.paddlenlp/taskflow`下,如果下载或调用失败,可删除相应路径下的文件,重新尝试即可 + +
+ +
Q:Taskflow如何提升预测速度?
+ +**A:** 可以结合设备情况适当调整batch_size,采用批量输入的方式来提升平均速率。示例: +```python +from paddlenlp import Taskflow + +# 精确模式模型体积较大,可结合机器情况适当调整batch_size,采用批量样本输入的方式。 +seg_accurate = Taskflow("word_segmentation", mode="accurate", batch_size=32) + +# 批量样本输入,输入为多个句子组成的list,预测速度更快 +texts = ["热梅茶是一道以梅子为主要原料制作的茶饮", "《孤女》是2010年九州出版社出版的小说,作者是余兼羽"] +seg_accurate(texts) +``` +通过上述方式进行分词可以大幅提升预测速度。 + +
+ +
Q:后续会增加更多任务支持吗?
+ +**A:** Taskflow支持任务持续丰富中,我们将根据开发者反馈,灵活调整功能建设优先级,可通过Issue或[问卷](https://wenjuan.baidu-int.com/manage/?r=survey/pageEdit&sid=85827)反馈给我们。 + +
+ + +## 附录 + +
参考资料
+ +1. [fxsjy/jieba](https://github.com/fxsjy/jieba) +2. [ZhuiyiTechnology/simbert](https://github.com/ZhuiyiTechnology/simbert) +3. [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) + +
diff --git a/docs/model_zoo/transformers/ALBERT/contents.rst b/docs/model_zoo/transformers/ALBERT/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..febf37681a07d7d47b9dfaec6acabd8a7f5683b2 --- /dev/null +++ b/docs/model_zoo/transformers/ALBERT/contents.rst @@ -0,0 +1,70 @@ + + +------------------------------------ +ALBERT模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的ALBERT模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``albert-base-v1`` | English | 12 repeating layers, 128 embedding, | +| | | 768-hidden, 12-heads, 11M parameters. | +| | | ALBERT base model | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``albert-large-v1`` | English | 24 repeating layers, 128 embedding, | +| | | 1024-hidden, 16-heads, 17M parameters. | +| | | ALBERT large model | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``albert-xlarge-v1`` | English | 24 repeating layers, 128 embedding, | +| | | 2048-hidden, 16-heads, 58M parameters. | +| | | ALBERT xlarge model | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``albert-xxlarge-v1`` | English | 12 repeating layers, 128 embedding, | +| | | 4096-hidden, 64-heads, 223M parameters. | +| | | ALBERT xxlarge model | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``albert-base-v2`` | English | 12 repeating layers, 128 embedding, | +| | | 768-hidden, 12-heads, 11M parameters. | +| | | ALBERT base model (version2) | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``albert-large-v2`` | English | 24 repeating layers, 128 embedding, | +| | | 1024-hidden, 16-heads, 17M parameters. | +| | | ALBERT large model (version2) | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``albert-xlarge-v2`` | English | 24 repeating layers, 128 embedding, | +| | | 2048-hidden, 16-heads, 58M parameters. | +| | | ALBERT xlarge model (version2) | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``albert-xxlarge-v2`` | English | 12 repeating layers, 128 embedding, | +| | | 4096-hidden, 64-heads, 223M parameters. | +| | | ALBERT xxlarge model (version2) | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``albert-chinese-tiny`` | Chinese | 4 repeating layers, 128 embedding, | +| | | 312-hidden, 12-heads, 4M parameters. | +| | | ALBERT tiny model (Chinese) | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``albert-chinese-small`` | Chinese | 6 repeating layers, 128 embedding, | +| | | 384-hidden, 12-heads, _M parameters. | +| | | ALBERT small model (Chinese) | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``albert-chinese-base`` | Chinese | 12 repeating layers, 128 embedding, | +| | | 768-hidden, 12-heads, 12M parameters. | +| | | ALBERT base model (Chinese) | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``albert-chinese-large`` | Chinese | 24 repeating layers, 128 embedding, | +| | | 1024-hidden, 16-heads, 18M parameters. | +| | | ALBERT large model (Chinese) | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``albert-chinese-xlarge`` | Chinese | 24 repeating layers, 128 embedding, | +| | | 2048-hidden, 16-heads, 60M parameters. | +| | | ALBERT xlarge model (Chinese) | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``albert-chinese-xxlarge`` | Chinese | 12 repeating layers, 128 embedding, | +| | | 4096-hidden, 16-heads, 235M parameters. | +| | | ALBERT xxlarge model (Chinese) | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/BART/contents.rst b/docs/model_zoo/transformers/BART/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..d9205a31fc48bf36cc9155a46c20f7433c581af3 --- /dev/null +++ b/docs/model_zoo/transformers/BART/contents.rst @@ -0,0 +1,22 @@ + + +------------------------------------ +BART模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的BART模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``bart-base`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 217M parameters. | +| | | BART base model (English) | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``bart-large`` | English | 24-layer, 768-hidden, | +| | | 16-heads, 509M parameters. | +| | | BART large model (English) | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/BERT/contents.rst b/docs/model_zoo/transformers/BERT/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..5cc533a10007f0f21b2c469387430c347b2c1180 --- /dev/null +++ b/docs/model_zoo/transformers/BERT/contents.rst @@ -0,0 +1,692 @@ + + +------------------------------------ +BERT模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的BERT模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +| ``bert-base-uncased`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 110M parameters. | +| | | Trained on lower-cased English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-large-uncased`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, 336M parameters. | +| | | Trained on lower-cased English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-base-cased`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 109M parameters. | +| | | Trained on cased English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-large-cased`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, 335M parameters. | +| | | Trained on cased English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-base-multilingual-uncased`` | Multilingual | 12-layer, 768-hidden, | +| | | 12-heads, 168M parameters. | +| | | Trained on lower-cased text | +| | | in the top 102 languages | +| | | with the largest Wikipedias. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-base-multilingual-cased`` | Multilingual | 12-layer, 768-hidden, | +| | | 12-heads, 179M parameters. | +| | | Trained on cased text | +| | | in the top 104 languages | +| | | with the largest Wikipedias. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-base-chinese`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 108M parameters. | +| | | Trained on cased Chinese Simplified | +| | | and Traditional text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-wwm-chinese`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 108M parameters. | +| | | Trained on cased Chinese Simplified | +| | | and Traditional text using | +| | | Whole-Word-Masking. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-wwm-ext-chinese`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 108M parameters. | +| | | Trained on cased Chinese Simplified | +| | | and Traditional text using | +| | | Whole-Word-Masking with extented data. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``uer/chinese-roberta-base`` | Chinese | Please refer to: | +| | | `uer/chinese_roberta_L-12_H-768`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``uer/chinese-roberta-medium`` | Chinese | Please refer to: | +| | | `uer/chinese_roberta_L-8_H-512`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``uer/chinese-roberta-small`` | Chinese | Please refer to: | +| | | `uer/chinese_roberta_L-4_H-512`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``uer/chinese-roberta-mini`` | Chinese | Please refer to: | +| | | `uer/chinese_roberta_L-4_H-256`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``uer/chinese-roberta-tiny`` | Chinese | Please refer to: | +| | | `uer/chinese_roberta_L-2_H-128`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``uer/chinese-roberta-6l-768h`` | Chinese | Please refer to: | +| | | `uer/chinese_roberta_L-6_H-768`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``ckiplab/bert-base-chinese-pos`` | Chinese | Please refer to: | +| | | `ckiplab/bert-base-chinese-pos`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``tbs17/MathBERT`` | English | Please refer to: | +| | | `tbs17/MathBERT`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``macbert-base-chinese`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 102M parameters. | +| | | Trained with novel MLM as correction | +| | | pre-training task. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``macbert-large-chinese`` | Chinese | 24-layer, 1024-hidden, | +| | | 16-heads, 326M parameters. | +| | | Trained with novel MLM as correction | +| | | pre-training task. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``simbert-base-chinese`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 108M parameters. | +| | | Trained on 22 million pairs of similar | +| | | sentences crawed from Baidu Know. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``Langboat/mengzi-bert-base`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 102M parameters. | +| | | Trained on 300G Chinese Corpus Datasets. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``Langboat/mengzi-bert-base-fin`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 102M parameters. | +| | | Trained on 20G Finacial Corpus, | +| | | based on ``Langboat/mengzi-bert-base``. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cross-encoder/ms-marco-MiniLM-L-12-v2`` | English | Please refer to: | +| | | `cross-encoder/ms-marco-MiniLM-L-12-v2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cl-tohoku/bert-base-japanese-char`` | Japanese | Please refer to: | +| | | `cl-tohoku/bert-base-japanese-char`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cl-tohoku/bert-base-japanese-whole-word-masking`` | Japanese | Please refer to: | +| | | `cl-tohoku/bert-base-japanese-whole-word-masking`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cl-tohoku/bert-base-japanese`` | Japanese | Please refer to: | +| | | `cl-tohoku/bert-base-japanese`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``nlptown/bert-base-multilingual-uncased-sentiment`` | Multilingual | Please refer to: | +| | | `nlptown/bert-base-multilingual-uncased-sentiment`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-large-uncased-whole-word-masking-finetuned-squad`` | English | Please refer to: | +| | | `bert-large-uncased-whole-word-masking-finetuned-squad`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``finiteautomata/beto-sentiment-analysis`` | Spanish | Please refer to: | +| | | `finiteautomata/beto-sentiment-analysis`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``hfl/chinese-bert-wwm-ext`` | Chinese | Please refer to: | +| | | `hfl/chinese-bert-wwm-ext`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``emilyalsentzer/Bio_ClinicalBERT`` | English | Please refer to: | +| | | `emilyalsentzer/Bio_ClinicalBERT`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dslim/bert-base-NER`` | English | Please refer to: | +| | | `dslim/bert-base-NER`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``deepset/bert-large-uncased-whole-word-masking-squad2`` | English | Please refer to: | +| | | `deepset/bert-large-uncased-whole-word-masking-squad2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``neuralmind/bert-base-portuguese-cased`` | Portuguese | Please refer to: | +| | | `neuralmind/bert-base-portuguese-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``SpanBERT/spanbert-large-cased`` | English | Please refer to: | +| | | `SpanBERT/spanbert-large-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dslim/bert-large-NER`` | English | Please refer to: | +| | | `dslim/bert-large-NER`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-base-german-cased`` | German | Please refer to: | +| | | `bert-base-german-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``deepset/sentence_bert`` | English | Please refer to: | +| | | `deepset/sentence_bert`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``ProsusAI/finbert`` | English | Please refer to: | +| | | `ProsusAI/finbert`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``oliverguhr/german-sentiment-bert`` | German | Please refer to: | +| | | `oliverguhr/german-sentiment-bert`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``google/bert_uncased_L-2_H-128_A-2`` | English | Please refer to: | +| | | `google/bert_uncased_L-2_H-128_A-2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract`` | English | Please refer to: | +| | | `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``DeepPavlov/rubert-base-cased`` | Russian | Please refer to: | +| | | `DeepPavlov/rubert-base-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``wietsedv/bert-base-dutch-cased`` | Dutch | Please refer to: | +| | | `wietsedv/bert-base-dutch-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``monologg/bert-base-cased-goemotions-original`` | English | Please refer to: | +| | | `monologg/bert-base-cased-goemotions-original`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``allenai/scibert_scivocab_uncased`` | English | Please refer to: | +| | | `allenai/scibert_scivocab_uncased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dbmdz/bert-large-cased-finetuned-conll03-english`` | English | Please refer to: | +| | | `dbmdz/bert-large-cased-finetuned-conll03-english`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext`` | English | Please refer to: | +| | | `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-large-uncased-whole-word-masking`` | English | Please refer to: | +| | | `bert-large-uncased-whole-word-masking`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dccuchile/bert-base-spanish-wwm-uncased`` | Spanish | Please refer to: | +| | | `dccuchile/bert-base-spanish-wwm-uncased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``google/bert_uncased_L-6_H-256_A-4`` | English | Please refer to: | +| | | `google/bert_uncased_L-6_H-256_A-4`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``google/bert_uncased_L-4_H-512_A-8`` | English | Please refer to: | +| | | `google/bert_uncased_L-4_H-512_A-8`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``FPTAI/vibert-base-cased`` | English | Please refer to: | +| | | `FPTAI/vibert-base-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cointegrated/rubert-tiny`` | Russian | Please refer to: | +| | | `cointegrated/rubert-tiny`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-base-german-dbmdz-uncased`` | German | Please refer to: | +| | | `bert-base-german-dbmdz-uncased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dbmdz/bert-base-turkish-128k-cased`` | Turkish | Please refer to: | +| | | `dbmdz/bert-base-turkish-128k-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dbmdz/bert-base-german-uncased`` | German | Please refer to: | +| | | `dbmdz/bert-base-german-uncased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``deepset/minilm-uncased-squad2`` | English | Please refer to: | +| | | `deepset/minilm-uncased-squad2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``HooshvareLab/bert-base-parsbert-uncased`` | Persian | Please refer to: | +| | | `HooshvareLab/bert-base-parsbert-uncased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``textattack/bert-base-uncased-ag-news`` | English | Please refer to: | +| | | `textattack/bert-base-uncased-ag-news`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cl-tohoku/bert-base-japanese-v2`` | Japanese | Please refer to: | +| | | `cl-tohoku/bert-base-japanese-v2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``emilyalsentzer/Bio_Discharge_Summary_BERT`` | English | Please refer to: | +| | | `emilyalsentzer/Bio_Discharge_Summary_BERT`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``KoichiYasuoka/bert-base-japanese-upos`` | Japanese | Please refer to: | +| | | `KoichiYasuoka/bert-base-japanese-upos`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dbmdz/bert-base-italian-xxl-cased`` | Italian | Please refer to: | +| | | `dbmdz/bert-base-italian-xxl-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``deepset/bert-base-cased-squad2`` | English | Please refer to: | +| | | `deepset/bert-base-cased-squad2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``beomi/kcbert-large`` | English | Please refer to: | +| | | `beomi/kcbert-large`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-large-cased-whole-word-masking-finetuned-squad`` | English | Please refer to: | +| | | `bert-large-cased-whole-word-masking-finetuned-squad`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``neuralmind/bert-large-portuguese-cased`` |Portuguese | Please refer to: | +| | | `neuralmind/bert-large-portuguese-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``Luyu/co-condenser-marco`` | English | Please refer to: | +| | | `Luyu/co-condenser-marco`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``Sahajtomar/German_Zeroshot`` | German | Please refer to: | +| | | `Sahajtomar/German_Zeroshot`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``indolem/indobert-base-uncased`` | Indonesian | Please refer to: | +| | | `indolem/indobert-base-uncased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``shibing624/text2vec-base-chinese`` | Chinese | Please refer to: | +| | | `shibing624/text2vec-base-chinese`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cointegrated/LaBSE-en-ru`` | English | Please refer to: | +| | and Russian | `cointegrated/LaBSE-en-ru`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``prithivida/parrot_fluency_on_BERT`` | English | Please refer to: | +| | | `prithivida/parrot_fluency_on_BERT`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``textattack/bert-base-uncased-SST-2`` | English | Please refer to: | +| | | `textattack/bert-base-uncased-SST-2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``textattack/bert-base-uncased-snli`` | English | Please refer to: | +| | | `textattack/bert-base-uncased-snli`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``klue/bert-base`` | English | Please refer to: | +| | | `klue/bert-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``asafaya/bert-base-arabic`` | Arabic | Please refer to: | +| | | `asafaya/bert-base-arabic`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``textattack/bert-base-uncased-MRPC`` | English | Please refer to: | +| | | `textattack/bert-base-uncased-MRPC`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``textattack/bert-base-uncased-imdb`` | English | Please refer to: | +| | | `textattack/bert-base-uncased-imdb`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cross-encoder/ms-marco-TinyBERT-L-2`` | English | Please refer to: | +| | | `cross-encoder/ms-marco-TinyBERT-L-2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``mrm8488/bert-tiny-finetuned-sms-spam-detection`` | English | Please refer to: | +| | | `mrm8488/bert-tiny-finetuned-sms-spam-detection`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``felflare/bert-restore-punctuation`` | English | Please refer to: | +| | | `felflare/bert-restore-punctuation`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english`` | English | Please refer to: | +| | | `sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``textattack/bert-base-uncased-rotten-tomatoes`` | English | Please refer to: | +| | | `textattack/bert-base-uncased-rotten-tomatoes`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``nlpaueb/legal-bert-base-uncased`` | English | Please refer to: | +| | | `nlpaueb/legal-bert-base-uncased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``hf-internal-testing/tiny-bert-for-token-classification`` | English | Please refer to: | +| | | `hf-internal-testing/tiny-bert-for-token-classification`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cointegrated/rubert-tiny2`` | Russian | Please refer to: | +| | | `cointegrated/rubert-tiny2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``kykim/bert-kor-base`` | Korean | Please refer to: | +| | | `kykim/bert-kor-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cl-tohoku/bert-base-japanese-char-v2`` | Japanese | Please refer to: | +| | | `cl-tohoku/bert-base-japanese-char-v2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``mrm8488/bert-small-finetuned-squadv2`` | English | Please refer to: | +| | | `mrm8488/bert-small-finetuned-squadv2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``beomi/kcbert-base`` | English | Please refer to: | +| | | `beomi/kcbert-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``textattack/bert-base-uncased-MNLI`` | English | Please refer to: | +| | | `textattack/bert-base-uncased-MNLI`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``textattack/bert-base-uncased-WNLI`` | English | Please refer to: | +| | | `textattack/bert-base-uncased-WNLI`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dbmdz/bert-base-turkish-cased`` | Turkish | Please refer to: | +| | | `dbmdz/bert-base-turkish-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``huawei-noah/TinyBERT_General_4L_312D`` | English | Please refer to: | +| | | `huawei-noah/TinyBERT_General_4L_312D`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``textattack/bert-base-uncased-QQP`` | English | Please refer to: | +| | | `textattack/bert-base-uncased-QQP`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``textattack/bert-base-uncased-STS-B`` | English | Please refer to: | +| | | `textattack/bert-base-uncased-STS-B`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``allenai/scibert_scivocab_cased`` | English | Please refer to: | +| | | `allenai/scibert_scivocab_cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``mrm8488/bert-medium-finetuned-squadv2`` | English | Please refer to: | +| | | `mrm8488/bert-medium-finetuned-squadv2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``TurkuNLP/bert-base-finnish-cased-v1`` | Finnish | Please refer to: | +| | | `TurkuNLP/bert-base-finnish-cased-v1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``textattack/bert-base-uncased-RTE`` | English | Please refer to: | +| | | `textattack/bert-base-uncased-RTE`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``uer/roberta-base-chinese-extractive-qa`` | Chinese | Please refer to: | +| | | `uer/roberta-base-chinese-extractive-qa`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``textattack/bert-base-uncased-QNLI`` | English | Please refer to: | +| | | `textattack/bert-base-uncased-QNLI`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``textattack/bert-base-uncased-CoLA`` | English | Please refer to: | +| | | `textattack/bert-base-uncased-CoLA`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dmis-lab/biobert-base-cased-v1.2`` | English | Please refer to: | +| | | `dmis-lab/biobert-base-cased-v1.2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``pierreguillou/bert-base-cased-squad-v1.1-portuguese`` | Portuguese | Please refer to: | +| | | `pierreguillou/bert-base-cased-squad-v1.1-portuguese`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``KB/bert-base-swedish-cased`` | Swedish | Please refer to: | +| | | `KB/bert-base-swedish-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``uer/roberta-base-finetuned-cluener2020-chinese`` | Chinese | Please refer to: | +| | | `uer/roberta-base-finetuned-cluener2020-chinese`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``onlplab/alephbert-base`` | Hebrew | Please refer to: | +| | | `onlplab/alephbert-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``mrm8488/bert-spanish-cased-finetuned-ner`` | Spanish | Please refer to: | +| | | `mrm8488/bert-spanish-cased-finetuned-ner`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``alvaroalon2/biobert_chemical_ner`` | English | Please refer to: | +| | | `alvaroalon2/biobert_chemical_ner`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-base-cased-finetuned-mrpc`` | English | Please refer to: | +| | | `bert-base-cased-finetuned-mrpc`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``unitary/toxic-bert`` | English | Please refer to: | +| | | `unitary/toxic-bert`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``nlpaueb/bert-base-greek-uncased-v1`` | Greek | Please refer to: | +| | | `nlpaueb/bert-base-greek-uncased-v1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``HooshvareLab/bert-fa-base-uncased-sentiment-snappfood`` | Persian | Please refer to: | +| | | `HooshvareLab/bert-fa-base-uncased-sentiment-snappfood`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``Maltehb/danish-bert-botxo`` | Danish | Please refer to: | +| | | `Maltehb/danish-bert-botxo`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``shahrukhx01/bert-mini-finetune-question-detection`` | English | Please refer to: | +| | | `shahrukhx01/bert-mini-finetune-question-detection`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``GroNLP/bert-base-dutch-cased`` | Dutch | Please refer to: | +| | | `GroNLP/bert-base-dutch-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``SpanBERT/spanbert-base-cased`` | English | Please refer to: | +| | | `SpanBERT/spanbert-base-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dbmdz/bert-base-italian-uncased`` | Italian | Please refer to: | +| | | `dbmdz/bert-base-italian-uncased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dbmdz/bert-base-german-cased`` | Germanh | Please refer to: | +| | | `dbmdz/bert-base-german-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cl-tohoku/bert-large-japanese`` | Japanese | Please refer to: | +| | | `cl-tohoku/bert-large-japanese`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``hfl/chinese-bert-wwm`` | Chinese | Please refer to: | +| | | `hfl/chinese-bert-wwm`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``hfl/chinese-macbert-large`` | Chinese | Please refer to: | +| | | `hfl/chinese-macbert-large`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dslim/bert-base-NER-uncased`` | English | Please refer to: | +| | | `dslim/bert-base-NER-uncased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``amberoad/bert-multilingual-passage-reranking-msmarco`` | Multilingual | Please refer to: | +| | | `amberoad/bert-multilingual-passage-reranking-msmarco`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``aubmindlab/bert-base-arabertv02`` | Arabic | Please refer to: | +| | | `aubmindlab/bert-base-arabertv02`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``google/bert_uncased_L-4_H-256_A-4`` | English | Please refer to: | +| | | `google/bert_uncased_L-4_H-256_A-4`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``DeepPavlov/rubert-base-cased-conversational`` | Russian | Please refer to: | +| | | `DeepPavlov/rubert-base-cased-conversational`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dccuchile/bert-base-spanish-wwm-cased`` | Spanish | Please refer to: | +| | | `dccuchile/bert-base-spanish-wwm-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``ckiplab/bert-base-chinese-ws`` | Chinese | Please refer to: | +| | | `ckiplab/bert-base-chinese-ws`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``daigo/bert-base-japanese-sentiment`` | Japanese | Please refer to: | +| | | `daigo/bert-base-japanese-sentiment`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``SZTAKI-HLT/hubert-base-cc`` | Hungarian | Please refer to: | +| | | `SZTAKI-HLT/hubert-base-cc`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``nlpaueb/legal-bert-small-uncased`` | English | Please refer to: | +| | | `nlpaueb/legal-bert-small-uncased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dumitrescustefan/bert-base-romanian-uncased-v1`` | Romanian | Please refer to: | +| | | `dumitrescustefan/bert-base-romanian-uncased-v1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``google/muril-base-cased`` | Indian | Please refer to: | +| | | `google/muril-base-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dkleczek/bert-base-polish-uncased-v1`` | Polish | Please refer to: | +| | | `dkleczek/bert-base-polish-uncased-v1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``ckiplab/bert-base-chinese-ner`` | Chinese | Please refer to: | +| | | `ckiplab/bert-base-chinese-ner`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``savasy/bert-base-turkish-sentiment-cased`` | Turkish | Please refer to: | +| | | `savasy/bert-base-turkish-sentiment-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es`` | Spanish | Please refer to: | +| | | `mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``KB/bert-base-swedish-cased-ner`` | Swedish | Please refer to: | +| | | `KB/bert-base-swedish-cased-ner`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``hfl/rbt3`` | Chinese | Please refer to: | +| | | `hfl/rbt3`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``remotejob/gradientclassification_v0`` | English | Please refer to: | +| | | `remotejob/gradientclassification_v0`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``Recognai/bert-base-spanish-wwm-cased-xnli`` | Spanish | Please refer to: | +| | | `Recognai/bert-base-spanish-wwm-cased-xnli`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``HooshvareLab/bert-fa-zwnj-base`` | Persian | Please refer to: | +| | | `HooshvareLab/bert-fa-zwnj-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``monologg/bert-base-cased-goemotions-group`` | English | Please refer to: | +| | | `monologg/bert-base-cased-goemotions-group`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``blanchefort/rubert-base-cased-sentiment`` | Russian | Please refer to: | +| | | `blanchefort/rubert-base-cased-sentiment`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``shibing624/macbert4csc-base-chinese`` | Chinese | Please refer to: | +| | | `shibing624/macbert4csc-base-chinese`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``google/bert_uncased_L-8_H-512_A-8`` | English | Please refer to: | +| | | `google/bert_uncased_L-8_H-512_A-8`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bert-large-cased-whole-word-masking`` | English | Please refer to: | +| | | `bert-large-cased-whole-word-masking`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``alvaroalon2/biobert_diseases_ner`` | English | Please refer to: | +| | | `alvaroalon2/biobert_diseases_ner`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``philschmid/BERT-Banking77`` | English | Please refer to: | +| | | `philschmid/BERT-Banking77`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dbmdz/bert-base-turkish-uncased`` | Turkish | Please refer to: | +| | | `dbmdz/bert-base-turkish-uncased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``vblagoje/bert-english-uncased-finetuned-pos`` | English | Please refer to: | +| | | `vblagoje/bert-english-uncased-finetuned-pos`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dumitrescustefan/bert-base-romanian-cased-v1`` | Romanian | Please refer to: | +| | | `dumitrescustefan/bert-base-romanian-cased-v1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``nreimers/BERT-Tiny_L-2_H-128_A-2`` | English | Please refer to: | +| | | `nreimers/BERT-Tiny_L-2_H-128_A-2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``digitalepidemiologylab/covid-twitter-bert-v2`` | English | Please refer to: | +| | | `digitalepidemiologylab/covid-twitter-bert-v2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``UBC-NLP/MARBERT`` | (DA) and MSA | Please refer to: | +| | | `UBC-NLP/MARBERT`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``pierreguillou/bert-large-cased-squad-v1.1-portuguese`` | Portuguese | Please refer to: | +| | | `pierreguillou/bert-large-cased-squad-v1.1-portuguese`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``alvaroalon2/biobert_genetic_ner`` | English | Please refer to: | +| | | `alvaroalon2/biobert_genetic_ner`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``bvanaken/clinical-assertion-negation-bert`` | English | Please refer to: | +| | | `bvanaken/clinical-assertion-negation-bert`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cross-encoder/stsb-TinyBERT-L-4`` | English | Please refer to: | +| | | `cross-encoder/stsb-TinyBERT-L-4`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``sshleifer/tiny-distilbert-base-cased`` | English | Please refer to: | +| | | `sshleifer/tiny-distilbert-base-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``ckiplab/bert-base-chinese`` | Chinese | Please refer to: | +| | | `ckiplab/bert-base-chinese`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``fabriceyhc/bert-base-uncased-amazon_polarity`` | English | Please refer to: | +| | | `fabriceyhc/bert-base-uncased-amazon_polarity`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _ckiplab/bert-base-chinese-pos: https://huggingface.co/ckiplab/bert-base-chinese-pos +.. _uer/chinese_roberta_L-12_H-768: https://huggingface.co/uer/chinese_roberta_L-12_H-768 +.. _uer/chinese_roberta_L-6_H-768: https://huggingface.co/uer/chinese_roberta_L-6_H-768 +.. _uer/chinese_roberta_L-8_H-512: https://huggingface.co/uer/chinese_roberta_L-8_H-512 +.. _uer/chinese_roberta_L-4_H-512: https://huggingface.co/uer/chinese_roberta_L-4_H-512 +.. _uer/chinese_roberta_L-4_H-256: https://huggingface.co/uer/chinese_roberta_L-4_H-256 +.. _uer/chinese_roberta_L-2_H-128: https://huggingface.co/uer/chinese_roberta_L-2_H-128 +.. _tbs17/MathBERT: https://huggingface.co/tbs17/MathBERT +.. _cross-encoder/ms-marco-MiniLM-L-12-v2: https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2 +.. _cl-tohoku/bert-base-japanese-char: https://huggingface.co/cl-tohoku/bert-base-japanese-char +.. _cl-tohoku/bert-base-japanese-whole-word-masking: https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking +.. _cl-tohoku/bert-base-japanese: https://huggingface.co/cl-tohoku/bert-base-japanese +.. _nlptown/bert-base-multilingual-uncased-sentiment: https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment +.. _bert-large-uncased-whole-word-masking-finetuned-squad: https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad +.. _finiteautomata/beto-sentiment-analysis: https://huggingface.co/finiteautomata/beto-sentiment-analysis +.. _hfl/chinese-bert-wwm-ext: https://huggingface.co/hfl/chinese-bert-wwm-ext +.. _emilyalsentzer/Bio_ClinicalBERT: https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT +.. _dslim/bert-base-NER: https://huggingface.co/dslim/bert-base-NER +.. _deepset/bert-large-uncased-whole-word-masking-squad2: https://huggingface.co/deepset/bert-large-uncased-whole-word-masking-squad2 +.. _neuralmind/bert-base-portuguese-cased: https://huggingface.co/neuralmind/bert-base-portuguese-cased +.. _SpanBERT/spanbert-large-cased: https://huggingface.co/SpanBERT/spanbert-large-cased +.. _dslim/bert-large-NER: https://huggingface.co/dslim/bert-large-NER +.. _bert-base-german-cased: https://huggingface.co/bert-base-german-cased +.. _deepset/sentence_bert: https://huggingface.co/deepset/sentence_bert +.. _ProsusAI/finbert: https://huggingface.co/ProsusAI/finbert +.. _oliverguhr/german-sentiment-bert: https://huggingface.co/oliverguhr/german-sentiment-bert +.. _google/bert_uncased_L-2_H-128_A-2: https://huggingface.co/google/bert_uncased_L-2_H-128_A-2 +.. _DeepPavlov/rubert-base-cased: https://huggingface.co/DeepPavlov/rubert-base-cased +.. _wietsedv/bert-base-dutch-cased: https://huggingface.co/wietsedv/bert-base-dutch-cased +.. _monologg/bert-base-cased-goemotions-original: https://huggingface.co/monologg/bert-base-cased-goemotions-original +.. _allenai/scibert_scivocab_uncased: https://huggingface.co/allenai/scibert_scivocab_uncased +.. _microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract: https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract +.. _dbmdz/bert-large-cased-finetuned-conll03-english: https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english +.. _microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext: https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext +.. _bert-large-uncased-whole-word-masking: https://huggingface.co/bert-large-uncased-whole-word-masking +.. _dccuchile/bert-base-spanish-wwm-uncased: https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased +.. _google/bert_uncased_L-6_H-256_A-4: https://huggingface.co/google/bert_uncased_L-6_H-256_A-4 +.. _google/bert_uncased_L-4_H-512_A-8: https://huggingface.co/google/bert_uncased_L-4_H-512_A-8 +.. _FPTAI/vibert-base-cased: https://huggingface.co/FPTAI/vibert-base-cased +.. _cointegrated/rubert-tiny: https://huggingface.co/cointegrated/rubert-tiny +.. _bert-base-german-dbmdz-uncased: https://huggingface.co/bert-base-german-dbmdz-uncased +.. _dbmdz/bert-base-turkish-128k-cased: https://huggingface.co/dbmdz/bert-base-turkish-128k-cased +.. _dbmdz/bert-base-german-uncased: https://huggingface.co/dbmdz/bert-base-german-uncased +.. _deepset/minilm-uncased-squad2: https://huggingface.co/deepset/minilm-uncased-squad2 +.. _HooshvareLab/bert-base-parsbert-uncased: https://huggingface.co/HooshvareLab/bert-base-parsbert-uncased +.. _textattack/bert-base-uncased-ag-news: https://huggingface.co/textattack/bert-base-uncased-ag-news +.. _cl-tohoku/bert-base-japanese-v2: https://huggingface.co/cl-tohoku/bert-base-japanese-v2 +.. _emilyalsentzer/Bio_Discharge_Summary_BERT: https://huggingface.co/emilyalsentzer/Bio_Discharge_Summary_BERT +.. _KoichiYasuoka/bert-base-japanese-upos: https://huggingface.co/KoichiYasuoka/bert-base-japanese-upos +.. _dbmdz/bert-base-italian-xxl-cased: https://huggingface.co/dbmdz/bert-base-italian-xxl-cased +.. _deepset/bert-base-cased-squad2: https://huggingface.co/deepset/bert-base-cased-squad2 +.. _beomi/kcbert-large: https://huggingface.co/beomi/kcbert-large +.. _bert-large-cased-whole-word-masking-finetuned-squad: https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad +.. _neuralmind/bert-large-portuguese-cased: https://huggingface.co/neuralmind/bert-large-portuguese-cased +.. _Luyu/co-condenser-marco: https://huggingface.co/Luyu/co-condenser-marco +.. _Sahajtomar/German_Zeroshot: https://huggingface.co/Sahajtomar/German_Zeroshot +.. _indolem/indobert-base-uncased: https://huggingface.co/indolem/indobert-base-uncased +.. _shibing624/text2vec-base-chinese: https://huggingface.co/shibing624/text2vec-base-chinese +.. _cointegrated/LaBSE-en-ru: https://huggingface.co/cointegrated/LaBSE-en-ru +.. _prithivida/parrot_fluency_on_BERT: https://huggingface.co/prithivida/parrot_fluency_on_BERT +.. _textattack/bert-base-uncased-SST-2: https://huggingface.co/textattack/bert-base-uncased-SST-2 +.. _textattack/bert-base-uncased-snli: https://huggingface.co/textattack/bert-base-uncased-snli +.. _klue/bert-base: https://huggingface.co/klue/bert-base +.. _asafaya/bert-base-arabic: https://huggingface.co/asafaya/bert-base-arabic +.. _textattack/bert-base-uncased-MRPC: https://huggingface.co/textattack/bert-base-uncased-MRPC +.. _textattack/bert-base-uncased-imdb: https://huggingface.co/textattack/bert-base-uncased-imdb +.. _cross-encoder/ms-marco-TinyBERT-L-2: https://huggingface.co/cross-encoder/ms-marco-TinyBERT-L-2 +.. _mrm8488/bert-tiny-finetuned-sms-spam-detection: https://huggingface.co/mrm8488/bert-tiny-finetuned-sms-spam-detection +.. _felflare/bert-restore-punctuation: https://huggingface.co/felflare/bert-restore-punctuation +.. _sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english: https://huggingface.co/sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english +.. _textattack/bert-base-uncased-rotten-tomatoes: https://huggingface.co/textattack/bert-base-uncased-rotten-tomatoes +.. _nlpaueb/legal-bert-base-uncased: https://huggingface.co/nlpaueb/legal-bert-base-uncased +.. _hf-internal-testing/tiny-bert-for-token-classification: https://huggingface.co/hf-internal-testing/tiny-bert-for-token-classification +.. _cointegrated/rubert-tiny2: https://huggingface.co/cointegrated/rubert-tiny2 +.. _kykim/bert-kor-base: https://huggingface.co/kykim/bert-kor-base +.. _cl-tohoku/bert-base-japanese-char-v2: https://huggingface.co/cl-tohoku/bert-base-japanese-char-v2 +.. _mrm8488/bert-small-finetuned-squadv2: https://huggingface.co/mrm8488/bert-small-finetuned-squadv2 +.. _beomi/kcbert-base: https://huggingface.co/beomi/kcbert-base +.. _textattack/bert-base-uncased-MNLI: https://huggingface.co/textattack/bert-base-uncased-MNLI +.. _textattack/bert-base-uncased-WNLI: https://huggingface.co/textattack/bert-base-uncased-WNLI +.. _dbmdz/bert-base-turkish-cased: https://huggingface.co/dbmdz/bert-base-turkish-cased +.. _huawei-noah/TinyBERT_General_4L_312D: https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D +.. _textattack/bert-base-uncased-QQP: https://huggingface.co/textattack/bert-base-uncased-QQP +.. _textattack/bert-base-uncased-STS-B: https://huggingface.co/textattack/bert-base-uncased-STS-B +.. _allenai/scibert_scivocab_cased: https://huggingface.co/allenai/scibert_scivocab_cased +.. _mrm8488/bert-medium-finetuned-squadv2: https://huggingface.co/mrm8488/bert-medium-finetuned-squadv2 +.. _TurkuNLP/bert-base-finnish-cased-v1: https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1 +.. _textattack/bert-base-uncased-RTE: https://huggingface.co/textattack/bert-base-uncased-RTE +.. _uer/roberta-base-chinese-extractive-qa: https://huggingface.co/uer/roberta-base-chinese-extractive-qa +.. _textattack/bert-base-uncased-QNLI: https://huggingface.co/textattack/bert-base-uncased-QNLI +.. _textattack/bert-base-uncased-CoLA: https://huggingface.co/textattack/bert-base-uncased-CoLA +.. _dmis-lab/biobert-base-cased-v1.2: https://huggingface.co/dmis-lab/biobert-base-cased-v1.2 +.. _pierreguillou/bert-base-cased-squad-v1.1-portuguese: https://huggingface.co/pierreguillou/bert-base-cased-squad-v1.1-portuguese +.. _KB/bert-base-swedish-cased: https://huggingface.co/KB/bert-base-swedish-cased +.. _uer/roberta-base-finetuned-cluener2020-chinese: https://huggingface.co/uer/roberta-base-finetuned-cluener2020-chinese +.. _onlplab/alephbert-base: https://huggingface.co/onlplab/alephbert-base +.. _mrm8488/bert-spanish-cased-finetuned-ner: https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-ner +.. _alvaroalon2/biobert_chemical_ner: https://huggingface.co/alvaroalon2/biobert_chemical_ner +.. _bert-base-cased-finetuned-mrpc: https://huggingface.co/bert-base-cased-finetuned-mrpc +.. _unitary/toxic-bert: https://huggingface.co/unitary/toxic-bert +.. _nlpaueb/bert-base-greek-uncased-v1: https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1 +.. _HooshvareLab/bert-fa-base-uncased-sentiment-snappfood: https://huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood +.. _Maltehb/danish-bert-botxo: https://huggingface.co/Maltehb/danish-bert-botxo +.. _shahrukhx01/bert-mini-finetune-question-detection: https://huggingface.co/shahrukhx01/bert-mini-finetune-question-detection +.. _GroNLP/bert-base-dutch-cased: https://huggingface.co/GroNLP/bert-base-dutch-cased +.. _SpanBERT/spanbert-base-cased: https://huggingface.co/SpanBERT/spanbert-base-cased +.. _dbmdz/bert-base-italian-uncased: https://huggingface.co/dbmdz/bert-base-italian-uncased +.. _dbmdz/bert-base-german-cased: https://huggingface.co/dbmdz/bert-base-german-cased +.. _cl-tohoku/bert-large-japanese: https://huggingface.co/cl-tohoku/bert-large-japanese +.. _hfl/chinese-bert-wwm: https://huggingface.co/hfl/chinese-bert-wwm +.. _hfl/chinese-macbert-large: https://huggingface.co/hfl/chinese-macbert-large +.. _dslim/bert-base-NER-uncased: https://huggingface.co/dslim/bert-base-NER-uncased +.. _amberoad/bert-multilingual-passage-reranking-msmarco: https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco +.. _aubmindlab/bert-base-arabertv02: https://huggingface.co/aubmindlab/bert-base-arabertv02 +.. _google/bert_uncased_L-4_H-256_A-4: https://huggingface.co/google/bert_uncased_L-4_H-256_A-4 +.. _DeepPavlov/rubert-base-cased-conversational: https://huggingface.co/DeepPavlov/rubert-base-cased-conversational +.. _dccuchile/bert-base-spanish-wwm-cased: https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased +.. _ckiplab/bert-base-chinese-ws: https://huggingface.co/ckiplab/bert-base-chinese-ws +.. _daigo/bert-base-japanese-sentiment: https://huggingface.co/daigo/bert-base-japanese-sentiment +.. _SZTAKI-HLT/hubert-base-cc: https://huggingface.co/SZTAKI-HLT/hubert-base-cc +.. _nlpaueb/legal-bert-small-uncased: https://huggingface.co/nlpaueb/legal-bert-small-uncased +.. _dumitrescustefan/bert-base-romanian-uncased-v1: https://huggingface.co/dumitrescustefan/bert-base-romanian-uncased-v1 +.. _google/muril-base-cased: https://huggingface.co/google/muril-base-cased +.. _dkleczek/bert-base-polish-uncased-v1: https://huggingface.co/dkleczek/bert-base-polish-uncased-v1 +.. _ckiplab/bert-base-chinese-ner: https://huggingface.co/ckiplab/bert-base-chinese-ner +.. _savasy/bert-base-turkish-sentiment-cased: https://huggingface.co/savasy/bert-base-turkish-sentiment-cased +.. _mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es: https://huggingface.co/mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es +.. _KB/bert-base-swedish-cased-ner: https://huggingface.co/KB/bert-base-swedish-cased-ner +.. _hfl/rbt3: https://huggingface.co/hfl/rbt3 +.. _remotejob/gradientclassification_v0: https://huggingface.co/remotejob/gradientclassification_v0 +.. _Recognai/bert-base-spanish-wwm-cased-xnli: https://huggingface.co/Recognai/bert-base-spanish-wwm-cased-xnli +.. _HooshvareLab/bert-fa-zwnj-base: https://huggingface.co/HooshvareLab/bert-fa-zwnj-base +.. _monologg/bert-base-cased-goemotions-group: https://huggingface.co/monologg/bert-base-cased-goemotions-group +.. _blanchefort/rubert-base-cased-sentiment: https://huggingface.co/blanchefort/rubert-base-cased-sentiment +.. _shibing624/macbert4csc-base-chinese: https://huggingface.co/shibing624/macbert4csc-base-chinese +.. _google/bert_uncased_L-8_H-512_A-8: https://huggingface.co/google/bert_uncased_L-8_H-512_A-8 +.. _bert-large-cased-whole-word-masking: https://huggingface.co/bert-large-cased-whole-word-masking +.. _alvaroalon2/biobert_diseases_ner: https://huggingface.co/alvaroalon2/biobert_diseases_ner +.. _philschmid/BERT-Banking77: https://huggingface.co/philschmid/BERT-Banking77 +.. _dbmdz/bert-base-turkish-uncased: https://huggingface.co/dbmdz/bert-base-turkish-uncased +.. _vblagoje/bert-english-uncased-finetuned-pos: https://huggingface.co/vblagoje/bert-english-uncased-finetuned-pos +.. _dumitrescustefan/bert-base-romanian-cased-v1: https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1 +.. _nreimers/BERT-Tiny_L-2_H-128_A-2: https://huggingface.co/nreimers/BERT-Tiny_L-2_H-128_A-2 +.. _digitalepidemiologylab/covid-twitter-bert-v2: https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2 +.. _UBC-NLP/MARBERT: https://huggingface.co/UBC-NLP/MARBERT +.. _pierreguillou/bert-large-cased-squad-v1.1-portuguese: https://huggingface.co/pierreguillou/bert-large-cased-squad-v1.1-portuguese +.. _alvaroalon2/biobert_genetic_ner: https://huggingface.co/alvaroalon2/biobert_genetic_ner +.. _bvanaken/clinical-assertion-negation-bert: https://huggingface.co/bvanaken/clinical-assertion-negation-bert +.. _cross-encoder/stsb-TinyBERT-L-4: https://huggingface.co/cross-encoder/stsb-TinyBERT-L-4 +.. _sshleifer/tiny-distilbert-base-cased: https://huggingface.co/sshleifer/tiny-distilbert-base-cased +.. _ckiplab/bert-base-chinese: https://huggingface.co/ckiplab/bert-base-chinese +.. _fabriceyhc/bert-base-uncased-amazon_polarity: https://huggingface.co/fabriceyhc/bert-base-uncased-amazon_polarity diff --git a/docs/model_zoo/transformers/BigBird/contents.rst b/docs/model_zoo/transformers/BigBird/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..ba2c57911fb39a6e2cf65196375b5a829620ea53 --- /dev/null +++ b/docs/model_zoo/transformers/BigBird/contents.rst @@ -0,0 +1,18 @@ + + +------------------------------------ +BigBird模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的BigBird模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``bigbird-base-uncased`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 127M parameters. | +| | | Trained on lower-cased English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/Blenderbot-Small/contents.rst b/docs/model_zoo/transformers/Blenderbot-Small/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..d348b9059da88ca2ebe354b0a2e0b7fe90470f1e --- /dev/null +++ b/docs/model_zoo/transformers/Blenderbot-Small/contents.rst @@ -0,0 +1,18 @@ + + +------------------------------------ +Blenderbot-Small模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的Blenderbot-Small模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``blenderbot_small-90M`` | English | 16-layer, | +| | | 16-heads, 90M parameters. | +| | | The Blenderbot small model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/Blenderbot/contents.rst b/docs/model_zoo/transformers/Blenderbot/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..e4948268f5c901431d0dc76d38ef27e31ebcd5a2 --- /dev/null +++ b/docs/model_zoo/transformers/Blenderbot/contents.rst @@ -0,0 +1,26 @@ + + +------------------------------------ +Blenderbot模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的Blenderbot模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``blenderbot-3B`` | English | 26-layer, | +| | | 32-heads, 3B parameters. | +| | | The Blenderbot base model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``blenderbot-400M-distill`` | English | 14-layer, 384-hidden, | +| | | 32-heads, 400M parameters. | +| | | The Blenderbot distil model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``blenderbot-1B-distill`` | English | 14-layer, | +| | | 32-heads, 1478M parameters. | +| | | The Blenderbot distil 1B model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/CTRL/contents.rst b/docs/model_zoo/transformers/CTRL/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..6f57970b2119d58dc9d33fa72a8b494b153ff1bd --- /dev/null +++ b/docs/model_zoo/transformers/CTRL/contents.rst @@ -0,0 +1,21 @@ + + +------------------------------------ +CTRL模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的CTRL模型对应预训练权重。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``ctrl`` | English | 48-layer, 1280-hidden, | +| | | 16-heads, 1701M parameters. | +| | | The CTRL base model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``sshleifer-tiny-ctrl`` | English | 2-layer, 16-hidden, | +| | | 2-heads, 5M parameters. | +| | | The Tiny CTRL model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/ChineseBert/contents.rst b/docs/model_zoo/transformers/ChineseBert/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..cdacb2ec1e0722fb2f7c3d560a79e141cbaaa96e --- /dev/null +++ b/docs/model_zoo/transformers/ChineseBert/contents.rst @@ -0,0 +1,23 @@ + + +------------------------------------ +ChineseBert模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的ChineseBert模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``ChineseBERT-base`` | Chinese | For details, please refer to: | +| | | ChineseBERT-base_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ChineseBERT-large`` | Chinese | For details, please refer to: | +| | | ChineseBERT-large_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _ChineseBERT-base: https://huggingface.co/ShannonAI/ChineseBERT-base +.. _ChineseBERT-large: https://huggingface.co/ShannonAI/ChineseBERT-large diff --git a/docs/model_zoo/transformers/ConvBert/contents.rst b/docs/model_zoo/transformers/ConvBert/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..889f2e5da446b9221514e11a123644d41ef25b6d --- /dev/null +++ b/docs/model_zoo/transformers/ConvBert/contents.rst @@ -0,0 +1,26 @@ + + +------------------------------------ +ConvBERT模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的ConvBERT模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``convbert-base`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 106M parameters. | +| | | The ConvBERT base model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``convbert-medium-small`` | English | 12-layer, 384-hidden, | +| | | 8-heads, 17M parameters. | +| | | The ConvBERT medium small model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``convbert-small`` | English | 12-layer, 128-hidden, | +| | | 4-heads, 13M parameters. | +| | | The ConvBERT small model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/DistilBert/contents.rst b/docs/model_zoo/transformers/DistilBert/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..06e2a8bdc77a3776be71dc2958d852fa7c85d2e5 --- /dev/null +++ b/docs/model_zoo/transformers/DistilBert/contents.rst @@ -0,0 +1,42 @@ + + +------------------------------------ +DistilBERT模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的DistilBERT模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``distilbert-base-uncased`` | English | 6-layer, 768-hidden, | +| | | 12-heads, 66M parameters. | +| | | The DistilBERT model distilled from | +| | | the BERT model ``bert-base-uncased``. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``distilbert-base-cased`` | English | 6-layer, 768-hidden, | +| | | 12-heads, 66M parameters. | +| | | The DistilBERT model distilled from | +| | | the BERT model ``bert-base-cased``. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``distilbert-base-multilingual-cased`` | English | 6-layer, 768-hidden, 12-heads, | +| | | 200M parameters. The DistilBERT model | +| | | distilled from the BERT model | +| | | ``bert-base-multilingual-cased``. | +| | | | +| | | Please refer to: | +| | | `distilbert-base-multilingual-cased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``sshleifer/tiny-distilbert-base-uncased-finetuned-sst-2-english`` | English | 2-layer, 2-hidden, | +| | | 2-heads, 50K parameters. | +| | | The DistilBERT model. | +| | | | +| | | Please refer to: | +| | | `sshleifer/tiny-distilbert-base-uncased-finetuned-sst-2-english`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _distilbert-base-multilingual-cased: https://huggingface.co/distilbert-base-multilingual-cased +.. _sshleifer/tiny-distilbert-base-uncased-finetuned-sst-2-english: https://huggingface.co/sshleifer/tiny-distilbert-base-uncased-finetuned-sst-2-english diff --git a/docs/model_zoo/transformers/ELECTRA/contents.rst b/docs/model_zoo/transformers/ELECTRA/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..0176aac948fd5738ad221a7a5e47af22c8883061 --- /dev/null +++ b/docs/model_zoo/transformers/ELECTRA/contents.rst @@ -0,0 +1,63 @@ + + +------------------------------------ +ELECTRA模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的ELECTRA模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``electra-small`` | English | 12-layer, 768-hidden, | +| | | 4-heads, 14M parameters. | +| | | Trained on lower-cased English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``electra-base`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 109M parameters. | +| | | Trained on lower-cased English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``electra-large`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, 334M parameters. | +| | | Trained on lower-cased English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``chinese-electra-small`` | Chinese | 12-layer, 768-hidden, | +| | | 4-heads, 12M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``chinese-electra-base`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 102M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-health-chinese`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 102M parameters. | +| | | Trained on Chinese medical corpus. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``hfl/chinese-electra-180g-base-discriminator`` | Chinese | Discriminator, 12-layer, 768-hidden, | +| | | 12-heads, 102M parameters. | +| | | Trained on 180g Chinese text. | +| | | | +| | | Please refer to: | +| | | `hfl/chinese-electra-180g-base-discriminator`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``hfl/chinese-electra-180g-small-ex-discriminator`` | Chinese | Discriminator, 24-layer, 256-hidden, | +| | | 4-heads, 24M parameters. | +| | | Trained on 180g Chinese text. | +| | | | +| | | Please refer to: | +| | | `hfl/chinese-electra-180g-small-ex-discriminator`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``hfl/chinese-legal-electra-small-generator`` | Chinese | Generator, 12-layer, 64-hidden, | +| | | 1-heads, 3M parameters. | +| | | Trained on Chinese legal corpus. | +| | | | +| | | Please refer to: | +| | | `hfl/chinese-legal-electra-small-generator`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _hfl/chinese-electra-180g-base-discriminator: https://huggingface.co/hfl/chinese-electra-180g-base-discriminator +.. _hfl/chinese-electra-180g-small-ex-discriminator: https://huggingface.co/hfl/chinese-electra-180g-small-ex-discriminator +.. _hfl/chinese-legal-electra-small-generator: https://huggingface.co/hfl/chinese-legal-electra-small-generator diff --git a/docs/model_zoo/transformers/ERNIE-CTM/contents.rst b/docs/model_zoo/transformers/ERNIE-CTM/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..c0e5719585ac966d2c8a3f3edacbd667df267c37 --- /dev/null +++ b/docs/model_zoo/transformers/ERNIE-CTM/contents.rst @@ -0,0 +1,31 @@ + + +------------------------------------ +ERNIE-CTM模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的ERNIE-CTM模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``ernie-ctm`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, _M parameters. | +| | | For details, please refer to the | +| | | ernie-ctm_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``wordtag`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, _M parameters. | +| | | For details, please refer to the | +| | | ernie-ctm_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``nptag`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, _M parameters. | +| | | For details, please refer to the | +| | | ernie-ctm_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _ernie-ctm: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm \ No newline at end of file diff --git a/docs/model_zoo/transformers/ERNIE-DOC/contents.rst b/docs/model_zoo/transformers/ERNIE-DOC/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..905708628d8c83c2bb9695a00cc62a92ab930744 --- /dev/null +++ b/docs/model_zoo/transformers/ERNIE-DOC/contents.rst @@ -0,0 +1,22 @@ + + +------------------------------------ +ERNIE-DOC模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的ERNIE-DOC模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``ernie-doc-base-zh`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 108M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-doc-base-en`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 103M parameters. | +| | | Trained on lower-cased English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/ERNIE-GEN/contents.rst b/docs/model_zoo/transformers/ERNIE-GEN/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..a0822f91379e022cf1581c9f6fdebe19aa68f7b0 --- /dev/null +++ b/docs/model_zoo/transformers/ERNIE-GEN/contents.rst @@ -0,0 +1,29 @@ + + +------------------------------------ +ERNIE-GEN模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的ERNIE-GEN模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``ernie-gen-base-en`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 108M parameters. | +| | | Trained on lower-cased English text. | +| | | | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-gen-large-en`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, 336M parameters. | +| | | Trained on lower-cased English text. | +| | | | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-gen-large-en-430g`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, 336M parameters. | +| | | Trained on lower-cased English text. | +| | | with extended data (430 GB). | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/ERNIE-GRAM/contents.rst b/docs/model_zoo/transformers/ERNIE-GRAM/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..2ed80d0e35e66d3b87f352f373ae8687aee66e9b --- /dev/null +++ b/docs/model_zoo/transformers/ERNIE-GRAM/contents.rst @@ -0,0 +1,24 @@ + + +------------------------------------ +ERNIE-GRAM模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的ERNIE-GRAM模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``ernie-gram-zh`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 108M parameters. | +| | | Trained on Chinese text. | +| | | | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-gram-zh-finetuned-dureader-robust`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 108M parameters. | +| | | Trained on Chinese text. | +| | | Then finetuned on dreader-robust. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/ERNIE-M/contents.rst b/docs/model_zoo/transformers/ERNIE-M/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..62493fed917ed883d4fd67e6fe80945e7a925eff --- /dev/null +++ b/docs/model_zoo/transformers/ERNIE-M/contents.rst @@ -0,0 +1,23 @@ + + +------------------------------------ +ERNIE-M模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的ERNIE-M模型对应预训练权重。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``ernie-m-base`` | Multilingual | 12-layer, 768-hidden, | +| | | 12-heads, _M parameters. | +| | | Trained on pseudo-parallel sentence | +| | | pairs on a monolingual corpus. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-m-large`` | Multilingual | 24-layer, 1024-hidden, | +| | | 16-heads, _M parameters. | +| | | Trained on pseudo-parallel sentence | +| | | pairs on a monolingual corpus. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/ERNIE/contents.rst b/docs/model_zoo/transformers/ERNIE/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..5bab4d1dc2eedba51c3903c52b0baaf9cdcf8649 --- /dev/null +++ b/docs/model_zoo/transformers/ERNIE/contents.rst @@ -0,0 +1,127 @@ + + +------------------------------------ +ERNIE模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的ERNIE模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``ernie-1.0-base-zh`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 108M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-1.0-base-zh-cw`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 118M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-1.0-large-zh-cw`` | Chinese | 24-layer, 1024-hidden, | +| | | 16-heads, 272M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-tiny`` | Chinese | 3-layer, 1024-hidden, | +| | | 16-heads, _M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-2.0-base-en`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 103M parameters. | +| | | Trained on lower-cased English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-2.0-base-en-finetuned-squad`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 110M parameters. | +| | | Trained on finetuned squad text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-2.0-large-en`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, 336M parameters. | +| | | Trained on lower-cased English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-3.0-xbase-zh`` | Chinese | 20-layer, 1024-hidden, | +| | | 16-heads, 296M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-3.0-base-zh`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 118M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-3.0-medium-zh`` | Chinese | 6-layer, 768-hidden, | +| | | 12-heads, 75M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-3.0-mini-zh`` | Chinese | 6-layer, 384-hidden, | +| | | 12-heads, 27M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-3.0-micro-zh`` | Chinese | 4-layer, 384-hidden, | +| | | 12-heads, 23M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ernie-3.0-nano-zh`` | Chinese | 4-layer, 312-hidden, | +| | | 12-heads, 18M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-base-cross-encoder`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 118M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-medium-cross-encoder`` | Chinese | 6-layer, 768-hidden, | +| | | 12-heads, 75M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-mini-cross-encoder`` | Chinese | 6-layer, 384-hidden, | +| | | 12-heads, 27M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-micro-cross-encoder`` | Chinese | 4-layer, 384-hidden, | +| | | 12-heads, 23M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-nano-cross-encoder`` | Chinese | 4-layer, 312-hidden, | +| | | 12-heads, 18M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-zh-base-query-encoder`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 118M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-zh-base-para-encoder`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 118M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-zh-medium-query-encoder`` | Chinese | 6-layer, 768-hidden, | +| | | 12-heads, 75M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-zh-medium-para-encoder`` | Chinese | 6-layer, 768-hidden, | +| | | 12-heads, 75M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-zh-mini-query-encoder`` | Chinese | 6-layer, 384-hidden, | +| | | 12-heads, 27M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-zh-mini-para-encoder`` | Chinese | 6-layer, 384-hidden, | +| | | 12-heads, 27M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-zh-micro-query-encoder`` | Chinese | 4-layer, 384-hidden, | +| | | 12-heads, 23M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-zh-micro-para-encoder`` | Chinese | 4-layer, 384-hidden, | +| | | 12-heads, 23M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-zh-nano-query-encoder`` | Chinese | 4-layer, 312-hidden, | +| | | 12-heads, 18M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``rocketqa-zh-nano-para-encoder`` | Chinese | 4-layer, 312-hidden, | +| | | 12-heads, 18M parameters. | +| | | Trained on DuReader retrieval text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +.. _zhui/ernie-1.0-cluecorpussmall: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/community/zhui/ernie-1.0-cluecorpussmall diff --git a/docs/model_zoo/transformers/FNet/contents.rst b/docs/model_zoo/transformers/FNet/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..2fa95029adc7e97a39658d1b43918f5dafa38f56 --- /dev/null +++ b/docs/model_zoo/transformers/FNet/contents.rst @@ -0,0 +1,22 @@ + + +------------------------------------ +FNet模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的FNet模型对应预训练权重。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``fnet-base`` | English | For details, please refer to: | +| | | `google/fnet-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``fnet-large`` | English | For details, please refer to: | +| | | `google/fnet-large`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _google/fnet-base: https://huggingface.co/google/fnet-base +.. _google/fnet-large: https://huggingface.co/google/fnet-large \ No newline at end of file diff --git a/docs/model_zoo/transformers/Funnel/contents.rst b/docs/model_zoo/transformers/Funnel/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..89b192f17152741e79dfc2b9137b9659b32aef3d --- /dev/null +++ b/docs/model_zoo/transformers/Funnel/contents.rst @@ -0,0 +1,55 @@ + + +------------------------------------ +Funnel模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的Funnel模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``funnel-transformer/small`` | English | For details, please refer to: | +| | | `funnel-transformer/small`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``funnel-transformer/small-base`` | English | For details, please refer to: | +| | | `funnel-transformer/small-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``funnel-transformer/meduim`` | English | For details, please refer to: | +| | | `funnel-transformer/meduim`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``funnel-transformer/meduim-base`` | English | For details, please refer to: | +| | | `funnel-transformer/meduim-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``funnel-transformer/intermediate`` | English | For details, please refer to: | +| | | `funnel-transformer/intermediate`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``funnel-transformer/intermediate-base`` | English | For details, please refer to: | +| | | `funnel-transformer/intermediate-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``funnel-transformer/large`` | English | For details, please refer to: | +| | | `funnel-transformer/large`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``funnel-transformer/large-base`` | English | For details, please refer to: | +| | | `funnel-transformer/large-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``funnel-transformer/xlarge`` | English | For details, please refer to: | +| | | `funnel-transformer/xlarge`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``funnel-transformer/xlarge-base`` | English | For details, please refer to: | +| | | `funnel-transformer/xlarge-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _funnel-transformer/small: https://huggingface.co/funnel-transformer/small +.. _funnel-transformer/small-base: https://huggingface.co/funnel-transformer/small-base +.. _funnel-transformer/meduim: https://huggingface.co/funnel-transformer/medium +.. _funnel-transformer/meduim-base: https://huggingface.co/funnel-transformer/medium-base +.. _funnel-transformer/intermediate: https://huggingface.co/funnel-transformer/intermediate +.. _funnel-transformer/intermediate-base: https://huggingface.co/funnel-transformer/intermediate-base +.. _funnel-transformer/large: https://huggingface.co/funnel-transformer/large +.. _funnel-transformer/large-base: https://huggingface.co/funnel-transformer/large-base +.. _funnel-transformer/xlarge: https://huggingface.co/funnel-transformer/xlarge +.. _funnel-transformer/xlarge-base: https://huggingface.co/funnel-transformer/xlarge-base \ No newline at end of file diff --git a/docs/model_zoo/transformers/GPT/contents.rst b/docs/model_zoo/transformers/GPT/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..a5b802387d9f9564133d2289c4e91783a0279528 --- /dev/null +++ b/docs/model_zoo/transformers/GPT/contents.rst @@ -0,0 +1,315 @@ + + +------------------------------------ +GPT模型汇总 +------------------------------------ + + +下表汇总介绍了目前PaddleNLP支持的GPT模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``gpt-cpm-large-cn`` | Chinese | 32-layer, 2560-hidden, | +| | | 32-heads, 2.6B parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``gpt-cpm-small-cn-distill`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 109M parameters. | +| | | The model distilled from | +| | | the GPT model ``gpt-cpm-large-cn`` | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``gpt2-en`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 117M parameters. | +| | | Trained on English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``gpt2-medium-en`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, 345M parameters. | +| | | Trained on English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``gpt2-large-en`` | English | 36-layer, 1280-hidden, | +| | | 20-heads, 774M parameters. | +| | | Trained on English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``gpt2-xl-en`` | English | 48-layer, 1600-hidden, | +| | | 25-heads, 1558M parameters. | +| | | Trained on English text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``microsoft/DialoGPT-medium`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, 354M parameters. | +| | | Trained on English text. | +| | | | +| | | Please refer to: | +| | | `microsoft/DialoGPT-medium`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``uer/gpt2-chinese-poem`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 103M parameters. | +| | | Trained on Chinese poetry corpus. | +| | | | +| | | Please refer to: | +| | | `uer/gpt2-chinese-poem`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``distilgpt2`` | English | Please refer to: | +| | | `distilgpt2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``w11wo/javanese-gpt2-small-imdb`` | Javanese | Please refer to: | +| | | `w11wo/javanese-gpt2-small-imdb`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``remotejob/tweetsDISTILGPT2fi_v4`` | English | Please refer to: | +| | | `remotejob/tweetsDISTILGPT2fi_v4`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``TrLOX/gpt2-tdk`` | English | Please refer to: | +| | | `TrLOX/gpt2-tdk`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``huggingtweets/slime_machine`` | English | Please refer to: | +| | | `huggingtweets/slime_machine`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``microsoft/DialoGPT-small`` | English | Please refer to: | +| | | `microsoft/DialoGPT-small`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``sberbank-ai/rugpt3large_based_on_gpt2`` | Russian | Please refer to: | +| | | `sberbank-ai/rugpt3large_based_on_gpt2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``sshleifer/tiny-gpt2`` | English | Please refer to: | +| | | `sshleifer/tiny-gpt2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``microsoft/DialoGPT-large`` | English | Please refer to: | +| | | `microsoft/DialoGPT-large`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``sberbank-ai/rugpt3small_based_on_gpt2`` | Russian | Please refer to: | +| | | `sberbank-ai/rugpt3small_based_on_gpt2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``uw-hai/polyjuice`` | English | Please refer to: | +| | | `uw-hai/polyjuice`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``NYTK/text-generation-poem-petofi-gpt2-small-hungarian`` | Hungarian | Please refer to: | +| | | `NYTK/text-generation-poem-petofi-gpt2-small-hungarian`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``microsoft/DialogRPT-human-vs-rand`` | English | Please refer to: | +| | | `microsoft/DialogRPT-human-vs-rand`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``hf-internal-testing/tiny-random-gpt2`` | English | Please refer to: | +| | | `hf-internal-testing/tiny-random-gpt2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``Grossmend/rudialogpt3_medium_based_on_gpt2`` | Russian | Please refer to: | +| | | `Grossmend/rudialogpt3_medium_based_on_gpt2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``pranavpsv/genre-story-generator-v2`` | English | Please refer to: | +| | | `pranavpsv/genre-story-generator-v2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``microsoft/DialogRPT-updown`` | English | Please refer to: | +| | | `microsoft/DialogRPT-updown`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``microsoft/DialogRPT-human-vs-machine`` | English | Please refer to: | +| | | `microsoft/DialogRPT-human-vs-machine`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``pierreguillou/gpt2-small-portuguese`` | Portuguese | Please refer to: | +| | | `pierreguillou/gpt2-small-portuguese`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``mrm8488/GPT-2-finetuned-covid-bio-medrxiv`` | English | Please refer to: | +| | | `mrm8488/GPT-2-finetuned-covid-bio-medrxiv`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``anonymous-german-nlp/german-gpt2`` | German | Please refer to: | +| | | `anonymous-german-nlp/german-gpt2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``microsoft/CodeGPT-small-py`` | English | Please refer to: | +| | | `microsoft/CodeGPT-small-py`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``antoiloui/belgpt2`` | French | Please refer to: | +| | | `antoiloui/belgpt2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``benjamin/gerpt2`` | German | Please refer to: | +| | | `benjamin/gerpt2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``asi/gpt-fr-cased-small`` | French | Please refer to: | +| | | `asi/gpt-fr-cased-small`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``microsoft/CodeGPT-small-java-adaptedGPT2`` | English | Please refer to: | +| | | `microsoft/CodeGPT-small-java-adaptedGPT2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``GroNLP/gpt2-small-dutch`` | Dutch | Please refer to: | +| | | `GroNLP/gpt2-small-dutch`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``lvwerra/gpt2-imdb`` | English | Please refer to: | +| | | `lvwerra/gpt2-imdb`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``DeepESP/gpt2-spanish`` | Spanish | Please refer to: | +| | | `DeepESP/gpt2-spanish`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``microsoft/CodeGPT-small-py-adaptedGPT2`` | English | Please refer to: | +| | | `microsoft/CodeGPT-small-py-adaptedGPT2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``microsoft/DialogRPT-width`` | English | Please refer to: | +| | | `microsoft/DialogRPT-width`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``dbddv01/gpt2-french-small`` | French | Please refer to: | +| | | `dbddv01/gpt2-french-small`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``GroNLP/gpt2-small-italian`` | Italian | Please refer to: | +| | | `GroNLP/gpt2-small-italian`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``flax-community/gpt2-medium-persian`` | Persian | Please refer to: | +| | | `flax-community/gpt2-medium-persian`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``microsoft/DialogRPT-depth`` | English | Please refer to: | +| | | `microsoft/DialogRPT-depth`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``Nokia/nlgp-natural`` | English | Please refer to: | +| | | `Nokia/nlgp-natural`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``macedonizer/hr-gpt2`` | English | Please refer to: | +| | | `macedonizer/hr-gpt2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``mrm8488/GPT-2-finetuned-common_gen`` | English | Please refer to: | +| | | `mrm8488/GPT-2-finetuned-common_gen`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``pranavpsv/gpt2-genre-story-generator`` | English | Please refer to: | +| | | `pranavpsv/gpt2-genre-story-generator`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``rbhushan/distilgpt2-finetuned-wikitext2`` | English | Please refer to: | +| | | `rbhushan/distilgpt2-finetuned-wikitext2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``readerbench/RoGPT2-large`` | Romanian | Please refer to: | +| | | `readerbench/RoGPT2-large`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``flax-community/gpt2-small-indonesian`` | Indonesian | Please refer to: | +| | | `flax-community/gpt2-small-indonesian`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``HooshvareLab/gpt2-fa`` | Persian | Please refer to: | +| | | `HooshvareLab/gpt2-fa`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cahya/gpt2-small-indonesian-522M`` | Indonesian | Please refer to: | +| | | `cahya/gpt2-small-indonesian-522M`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``DingleyMaillotUrgell/homer-bot`` | English | Please refer to: | +| | | `DingleyMaillotUrgell/homer-bot`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``datificate/gpt2-small-spanish`` | Spanish | Please refer to: | +| | | `datificate/gpt2-small-spanish`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``ericzhou/tsundere_v1`` | English | Please refer to: | +| | | `ericzhou/tsundere_v1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``huggingtweets/wwm_shakespeare`` | English | Please refer to: | +| | | `huggingtweets/wwm_shakespeare`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``SIC98/GPT2-python-code-generator`` | English | Please refer to: | +| | | `SIC98/GPT2-python-code-generator`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``GroNLP/gpt2-small-italian-embeddings`` | Italian | Please refer to: | +| | | `GroNLP/gpt2-small-italian-embeddings`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``huggingtweets/hel_ql-shahdashrf_-sinnerslayerr-witheredstrings`` | English | Please refer to: | +| | | `huggingtweets/hel_ql-shahdashrf_-sinnerslayerr-witheredstrings`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``salesken/grammar_correction`` | English | Please refer to: | +| | | `salesken/grammar_correction`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``flax-community/gpt2-medium-indonesian`` | Indonesian | Please refer to: | +| | | `flax-community/gpt2-medium-indonesian`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``gorkemgoknar/gpt2-small-turkish`` | Turkish | Please refer to: | +| | | `gorkemgoknar/gpt2-small-turkish`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``deepparag/DumBot`` | English | Please refer to: | +| | | `deepparag/DumBot`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``jcblaise/gpt2-tagalog`` | Tagalog | Please refer to: | +| | | `jcblaise/gpt2-tagalog`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``BigSalmon/InformalToFormalLincoln21`` | English | Please refer to: | +| | | `BigSalmon/InformalToFormalLincoln21`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``LorenzoDeMattei/GePpeTto`` | English | Please refer to: | +| | | `LorenzoDeMattei/GePpeTto`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``macedonizer/sr-gpt2`` | English | Please refer to: | +| | | `macedonizer/sr-gpt2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``indonesian-nlp/gpt2`` | English | Please refer to: | +| | | `indonesian-nlp/gpt2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``ceostroff/harry-potter-gpt2-fanfiction`` | English | Please refer to: | +| | | `ceostroff/harry-potter-gpt2-fanfiction`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``akhooli/gpt2-small-arabic-poetry`` | Arabic | Please refer to: | +| | | `akhooli/gpt2-small-arabic-poetry`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``asi/gpt-fr-cased-base`` | French | Please refer to: | +| | | `asi/gpt-fr-cased-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``congcongwang/gpt2_medium_fine_tuned_coder`` | English | Please refer to: | +| | | `congcongwang/gpt2_medium_fine_tuned_coder`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| ``cambridgeltl/simctg_wikitext103`` | English | Please refer to: | +| | | `cambridgeltl/simctg_wikitext103`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _microsoft/DialoGPT-medium: https://huggingface.co/microsoft/DialoGPT-medium +.. _uer/gpt2-chinese-poem: https://huggingface.co/uer/gpt2-chinese-poem +.. _distilgpt2: https://huggingface.co/distilgpt2 +.. _w11wo/javanese-gpt2-small-imdb: https://huggingface.co/w11wo/javanese-gpt2-small-imdb +.. _remotejob/tweetsDISTILGPT2fi_v4: https://huggingface.co/remotejob/tweetsDISTILGPT2fi_v4 +.. _TrLOX/gpt2-tdk: https://huggingface.co/TrLOX/gpt2-tdk +.. _huggingtweets/slime_machine: https://huggingface.co/huggingtweets/slime_machine +.. _microsoft/DialoGPT-small: https://huggingface.co/microsoft/DialoGPT-small +.. _sberbank-ai/rugpt3large_based_on_gpt2: https://huggingface.co/sberbank-ai/rugpt3large_based_on_gpt2 +.. _sshleifer/tiny-gpt2: https://huggingface.co/sshleifer/tiny-gpt2 +.. _microsoft/DialoGPT-large: https://huggingface.co/microsoft/DialoGPT-large +.. _sberbank-ai/rugpt3small_based_on_gpt2: https://huggingface.co/sberbank-ai/rugpt3small_based_on_gpt2 +.. _uw-hai/polyjuice: https://huggingface.co/uw-hai/polyjuice +.. _NYTK/text-generation-poem-petofi-gpt2-small-hungarian: https://huggingface.co/NYTK/text-generation-poem-petofi-gpt2-small-hungarian +.. _microsoft/DialogRPT-human-vs-rand: https://huggingface.co/microsoft/DialogRPT-human-vs-rand +.. _hf-internal-testing/tiny-random-gpt2: https://huggingface.co/hf-internal-testing/tiny-random-gpt2 +.. _Grossmend/rudialogpt3_medium_based_on_gpt2: https://huggingface.co/Grossmend/rudialogpt3_medium_based_on_gpt2 +.. _pranavpsv/genre-story-generator-v2: https://huggingface.co/pranavpsv/genre-story-generator-v2 +.. _microsoft/DialogRPT-updown: https://huggingface.co/microsoft/DialogRPT-updown +.. _microsoft/DialogRPT-human-vs-machine: https://huggingface.co/microsoft/DialogRPT-human-vs-machine +.. _pierreguillou/gpt2-small-portuguese: https://huggingface.co/pierreguillou/gpt2-small-portuguese +.. _mrm8488/GPT-2-finetuned-covid-bio-medrxiv: https://huggingface.co/mrm8488/GPT-2-finetuned-covid-bio-medrxiv +.. _anonymous-german-nlp/german-gpt2: https://huggingface.co/anonymous-german-nlp/german-gpt2 +.. _microsoft/CodeGPT-small-py: https://huggingface.co/microsoft/CodeGPT-small-py +.. _antoiloui/belgpt2: https://huggingface.co/antoiloui/belgpt2 +.. _benjamin/gerpt2: https://huggingface.co/benjamin/gerpt2 +.. _asi/gpt-fr-cased-small: https://huggingface.co/asi/gpt-fr-cased-small +.. _microsoft/CodeGPT-small-java-adaptedGPT2: https://huggingface.co/microsoft/CodeGPT-small-java-adaptedGPT2 +.. _GroNLP/gpt2-small-dutch: https://huggingface.co/GroNLP/gpt2-small-dutch +.. _lvwerra/gpt2-imdb: https://huggingface.co/lvwerra/gpt2-imdb +.. _DeepESP/gpt2-spanish: https://huggingface.co/DeepESP/gpt2-spanish +.. _microsoft/CodeGPT-small-py-adaptedGPT2: https://huggingface.co/microsoft/CodeGPT-small-py-adaptedGPT2 +.. _microsoft/DialogRPT-width: https://huggingface.co/microsoft/DialogRPT-width +.. _dbddv01/gpt2-french-small: https://huggingface.co/dbddv01/gpt2-french-small +.. _GroNLP/gpt2-small-italian: https://huggingface.co/GroNLP/gpt2-small-italian +.. _flax-community/gpt2-medium-persian: https://huggingface.co/flax-community/gpt2-medium-persian +.. _microsoft/DialogRPT-depth: https://huggingface.co/microsoft/DialogRPT-depth +.. _Nokia/nlgp-natural: https://huggingface.co/Nokia/nlgp-natural +.. _macedonizer/hr-gpt2: https://huggingface.co/macedonizer/hr-gpt2 +.. _mrm8488/GPT-2-finetuned-common_gen: https://huggingface.co/mrm8488/GPT-2-finetuned-common_gen +.. _pranavpsv/gpt2-genre-story-generator: https://huggingface.co/pranavpsv/gpt2-genre-story-generator +.. _rbhushan/distilgpt2-finetuned-wikitext2: https://huggingface.co/rbhushan/distilgpt2-finetuned-wikitext2 +.. _readerbench/RoGPT2-large: https://huggingface.co/readerbench/RoGPT2-large +.. _flax-community/gpt2-small-indonesian: https://huggingface.co/flax-community/gpt2-small-indonesian +.. _HooshvareLab/gpt2-fa: https://huggingface.co/HooshvareLab/gpt2-fa +.. _cahya/gpt2-small-indonesian-522M: https://huggingface.co/cahya/gpt2-small-indonesian-522M +.. _DingleyMaillotUrgell/homer-bot: https://huggingface.co/DingleyMaillotUrgell/homer-bot +.. _datificate/gpt2-small-spanish: https://huggingface.co/datificate/gpt2-small-spanish +.. _ericzhou/tsundere_v1: https://huggingface.co/ericzhou/tsundere_v1 +.. _huggingtweets/wwm_shakespeare: https://huggingface.co/huggingtweets/wwm_shakespeare +.. _SIC98/GPT2-python-code-generator: https://huggingface.co/SIC98/GPT2-python-code-generator +.. _GroNLP/gpt2-small-italian-embeddings: https://huggingface.co/GroNLP/gpt2-small-italian-embeddings +.. _huggingtweets/hel_ql-shahdashrf_-sinnerslayerr-witheredstrings: https://huggingface.co/huggingtweets/hel_ql-shahdashrf_-sinnerslayerr-witheredstrings +.. _salesken/grammar_correction: https://huggingface.co/salesken/grammar_correction +.. _flax-community/gpt2-medium-indonesian: https://huggingface.co/flax-community/gpt2-medium-indonesian +.. _gorkemgoknar/gpt2-small-turkish: https://huggingface.co/gorkemgoknar/gpt2-small-turkish +.. _deepparag/DumBot: https://huggingface.co/deepparag/DumBot +.. _jcblaise/gpt2-tagalog: https://huggingface.co/jcblaise/gpt2-tagalog +.. _BigSalmon/InformalToFormalLincoln21: https://huggingface.co/BigSalmon/InformalToFormalLincoln21 +.. _LorenzoDeMattei/GePpeTto: https://huggingface.co/LorenzoDeMattei/GePpeTto +.. _macedonizer/sr-gpt2: https://huggingface.co/macedonizer/sr-gpt2 +.. _indonesian-nlp/gpt2: https://huggingface.co/indonesian-nlp/gpt2 +.. _ceostroff/harry-potter-gpt2-fanfiction: https://huggingface.co/ceostroff/harry-potter-gpt2-fanfiction +.. _akhooli/gpt2-small-arabic-poetry: https://huggingface.co/akhooli/gpt2-small-arabic-poetry +.. _asi/gpt-fr-cased-base: https://huggingface.co/asi/gpt-fr-cased-base +.. _congcongwang/gpt2_medium_fine_tuned_coder: https://huggingface.co/congcongwang/gpt2_medium_fine_tuned_coder +.. _cambridgeltl/simctg_wikitext103: https://huggingface.co/cambridgeltl/simctg_wikitext103 \ No newline at end of file diff --git a/docs/model_zoo/transformers/LayoutLM/contents.rst b/docs/model_zoo/transformers/LayoutLM/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..47662bcb3df4d7d9979b07dd69b12dcdd92a8673 --- /dev/null +++ b/docs/model_zoo/transformers/LayoutLM/contents.rst @@ -0,0 +1,22 @@ + + +------------------------------------ +LayoutLM模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的LayoutLM模型以及对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``layoutlm-base-uncased`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 339M parameters. | +| | | LayoutLm base uncased model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``layoutlm-large-uncased`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, 51M parameters. | +| | | LayoutLm large Uncased model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/LayoutLMV2/contents.rst b/docs/model_zoo/transformers/LayoutLMV2/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..efecc7fbbc69c67fb1c84d8f37f30759e2696dd6 --- /dev/null +++ b/docs/model_zoo/transformers/LayoutLMV2/contents.rst @@ -0,0 +1,22 @@ + + +------------------------------------ +LayoutLMV2模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的LayoutLMV2模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``layoutlmv2-base-uncased`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 200M parameters. | +| | | LayoutLmv2 base uncased model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``layoutlmv2-large-uncased`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, _M parameters. | +| | | LayoutLmv2 large uncased model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/LayoutXLM/contents.rst b/docs/model_zoo/transformers/LayoutXLM/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..212bb27e272f1484ea0465da2b58150984a4d551 --- /dev/null +++ b/docs/model_zoo/transformers/LayoutXLM/contents.rst @@ -0,0 +1,18 @@ + + +------------------------------------ +LayoutXLM模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的LayoutXLM模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``layoutxlm-base-uncased`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 369M parameters. | +| | | Layoutxlm base uncased model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/Luke/contents.rst b/docs/model_zoo/transformers/Luke/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..396600f55c4298f9b4efd15c2dfb6bff926d2118 --- /dev/null +++ b/docs/model_zoo/transformers/Luke/contents.rst @@ -0,0 +1,23 @@ + + +------------------------------------ +Luke模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的Luke模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``luke-base`` | English | For details, please refer to: | +| | | `studio-ousia/luke-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``luke-large`` | English | For details, please refer to: | +| | | `studio-ousia/luke-large`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _studio-ousia/luke-base: https://huggingface.co/studio-ousia/luke-base +.. _studio-ousia/luke-large: https://huggingface.co/studio-ousia/luke-large diff --git a/docs/model_zoo/transformers/MBart/contents.rst b/docs/model_zoo/transformers/MBart/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..a1a432e171c7d3a50074b5f19ee36847e7815852 --- /dev/null +++ b/docs/model_zoo/transformers/MBart/contents.rst @@ -0,0 +1,39 @@ + + +------------------------------------ +MBart模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的MBart模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``mbart-large-cc25`` | English | 12-layer, 1024-hidden, | +| | | 12-heads, 1123M parameters. | +| | | The ``mbart-large-cc25`` model. | +| | | | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``mbart-large-en-ro`` | English | 12-layer, 768-hidden, | +| | | 16-heads, 1123M parameters. | +| | | The ``mbart-large rn-ro`` model. | +| | | | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``mbart-large-50-one-to-many-mmt`` | English | 12-layer, 1024-hidden, | +| | | 16-heads, 1123M parameters. | +| | | ``mbart-large-50-one-to-many-mmt`` | +| | | model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``mbart-large-50-many-to-one-mmt`` | English | 12-layer, 1024-hidden, | +| | | 16-heads, 1123M parameters. | +| | | ``mbart-large-50-many-to-one-mmt`` | +| | | model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``mbart-large-50-many-to-many-mmt`` | English | 12-layer, 1024-hidden, | +| | | 16-heads, 1123M parameters. | +| | | ``mbart-large-50-many-to-many-mmt`` | +| | | model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/MPNet/contents.rst b/docs/model_zoo/transformers/MPNet/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..2f3dd1c078dbb33f67a57e5ba9dcac9f3ea3a0a7 --- /dev/null +++ b/docs/model_zoo/transformers/MPNet/contents.rst @@ -0,0 +1,18 @@ + + +------------------------------------ +MPNet模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的MPNet模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``mpnet-base`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 109M parameters. | +| | | MPNet Base Model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/MegatronBert/contents.rst b/docs/model_zoo/transformers/MegatronBert/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..0d9118115858931813048406ee29d18b96314a93 --- /dev/null +++ b/docs/model_zoo/transformers/MegatronBert/contents.rst @@ -0,0 +1,23 @@ + + +------------------------------------ +MegatronBert模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的MegatronBert模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``megatronbert-cased`` | English | For details, please refer to: | +| | | `nvidia/megatron-bert-cased-345m`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``megatronbert-uncased`` | English | For details, please refer to: | +| | | `nvidia/megatron-bert-uncased-345m`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _nvidia/megatron-bert-cased-345m: https://huggingface.co/nvidia/megatron-bert-cased-345m +.. _nvidia/megatron-bert-uncased-345m: https://huggingface.co/nvidia/megatron-bert-uncased-345m diff --git a/docs/model_zoo/transformers/MobileBert/contents.rst b/docs/model_zoo/transformers/MobileBert/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..ad948ff8411899e7e912da6f33c90426accf3125 --- /dev/null +++ b/docs/model_zoo/transformers/MobileBert/contents.rst @@ -0,0 +1,19 @@ + + +------------------------------------ +MobileBert模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的MobileBert模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``mobilebert-uncased`` | English | For details, please refer to: | +| | | `google/mobilebert-uncased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _google/mobilebert-uncased: https://huggingface.co/google/mobilebert-uncased diff --git a/docs/model_zoo/transformers/NeZha/contents.rst b/docs/model_zoo/transformers/NeZha/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..23a54aa8eb62a935a4530e864ed2050c8b5b1cf1 --- /dev/null +++ b/docs/model_zoo/transformers/NeZha/contents.rst @@ -0,0 +1,30 @@ + + +------------------------------------ +NeZha模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的NeZha模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``nezha-base-chinese`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 108M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``nezha-large-chinese`` | Chinese | 24-layer, 1024-hidden, | +| | | 16-heads, 336M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``nezha-base-wwm-chinese`` | Chinese | 12-layer, 768-hidden, | +| | | 16-heads, 108M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``nezha-large-wwm-chinese`` | Chinese | 24-layer, 1024-hidden, | +| | | 16-heads, 336M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/PPMiniLM/contents.rst b/docs/model_zoo/transformers/PPMiniLM/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..90ec6236f2c78b0ffcabd497ac05b91f290106ca --- /dev/null +++ b/docs/model_zoo/transformers/PPMiniLM/contents.rst @@ -0,0 +1,20 @@ + + +------------------------------------ +PPMiniLM模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的PPMiniLM模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +| ``ppminilm-6l-768h`` | Chinese | A Chinese characteristic small model | +| | | using multiple model compression. | +| | | Please refer to: ppminilm-6l-768h_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _ppminilm-6l-768h: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/model_compression/pp-minilm diff --git a/docs/model_zoo/transformers/ProphetNet/contents.rst b/docs/model_zoo/transformers/ProphetNet/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..0ae7d3e510dcb7e65ef8b41355a7b80847872f49 --- /dev/null +++ b/docs/model_zoo/transformers/ProphetNet/contents.rst @@ -0,0 +1,19 @@ + + +------------------------------------ +ProphetNet模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的ProphetNet模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``prophetnet-large-uncased`` | English | For details, please refer to: | +| | | `microsoft/prophetnet-large-uncased`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _microsoft/prophetnet-large-uncased: https://huggingface.co/microsoft/prophetnet-large-uncased diff --git a/docs/model_zoo/transformers/Reformer/contents.rst b/docs/model_zoo/transformers/Reformer/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..c7f54848a55c3e4c5621b39d26567f1233ffc3db --- /dev/null +++ b/docs/model_zoo/transformers/Reformer/contents.rst @@ -0,0 +1,20 @@ + + +------------------------------------ +Reformer模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的Reformer模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``reformer-enwik8`` | English | 12-layer, 1024-hidden, | +| | | 8-heads, 148M parameters. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``reformer-crime-and-punishment`` | English | 6-layer, 256-hidden, | +| | | 2-heads, 3M parameters. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/RemBert/contents.rst b/docs/model_zoo/transformers/RemBert/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..0cd7914c1ef15415f86b9437dd332b7ccca618bc --- /dev/null +++ b/docs/model_zoo/transformers/RemBert/contents.rst @@ -0,0 +1,20 @@ + + +------------------------------------ +RemBert模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的RemBert模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``rembert`` | English | For details, please refer to the | +| | | corresponding model card of huggingface: | +| | | `google/rembert`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _google/rembert: https://huggingface.co/google/rembert diff --git a/docs/model_zoo/transformers/RoBERTa/contents.rst b/docs/model_zoo/transformers/RoBERTa/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..859bfbcde559a3e9e9a047e897246c9dd3dd5063 --- /dev/null +++ b/docs/model_zoo/transformers/RoBERTa/contents.rst @@ -0,0 +1,424 @@ + + +------------------------------------ +RoBERTa模型汇总 +------------------------------------ + + +下表汇总介绍了目前PaddleNLP支持的RoBERTa模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``hfl/roberta-wwm-ext`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 102M parameters. | +| | | Trained on English Text using | +| | | Whole-Word-Masking with extended data. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``hfl/roberta-wwm-ext-large`` | Chinese | 24-layer, 1024-hidden, | +| | | 16-heads, 325M parameters. | +| | | Trained on English Text using | +| | | Whole-Word-Masking with extended data. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``hfl/rbt3`` | Chinese | 3-layer, 768-hidden, | +| | | 12-heads, 38M parameters. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``hfl/rbtl3`` | Chinese | 3-layer, 1024-hidden, | +| | | 16-heads, 61M parameters. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``hfl/rbt4`` | Chinese | 4-layer, 768-hidden, | +| | | 12-heads, 47M parameters. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``hfl/rbt6`` | Chinese | 6-layer, 768-hidden, | +| | | 12-heads, 60M parameters. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``deepset/roberta-base-squad2`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 124M parameters. | +| | | Trained on English text. | +| | | | +| | | Please refer to: | +| | | `deepset/roberta-base-squad2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``uer/roberta-base-chinese-extractive-qa`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 101M parameters. | +| | | Trained on Chinese text. | +| | | | +| | | Please refer to: | +| | | `uer/roberta-base-chinese-extractive-qa`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``uer/roberta-base-finetuned-chinanews-chinese`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 102M parameters. | +| | | Trained on Chinese text. | +| | | | +| | | Please refer to: | +| | | `uer/roberta-base-finetuned-chinanews-chinese`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``uer/roberta-base-finetuned-cluener2020-chinese`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 101M parameters. | +| | | Trained on Chinese text. | +| | | | +| | | Please refer to: | +| | | `uer/roberta-base-finetuned-cluener2020-chinese`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``roberta-base`` | English | Please refer to: | +| | | `roberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cardiffnlp/twitter-roberta-base-sentiment`` | English | Please refer to: | +| | | `cardiffnlp/twitter-roberta-base-sentiment`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``roberta-large`` | English | Please refer to: | +| | | `roberta-large`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``distilroberta-base`` | English | Please refer to: | +| | | `distilroberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cross-encoder/nli-distilroberta-base`` | English | Please refer to: | +| | | `cross-encoder/nli-distilroberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``siebert/sentiment-roberta-large-english`` | English | Please refer to: | +| | | `siebert/sentiment-roberta-large-english`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``j-hartmann/emotion-english-distilroberta-base`` | English | Please refer to: | +| | | `j-hartmann/emotion-english-distilroberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``roberta-base-openai-detector`` | English | Please refer to: | +| | | `roberta-base-openai-detector`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``huggingface/CodeBERTa-small-v1`` | English | Please refer to: | +| | | `huggingface/CodeBERTa-small-v1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis`` | English | Please refer to: | +| | | `mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cardiffnlp/twitter-roberta-base-emotion`` | English | Please refer to: | +| | | `cardiffnlp/twitter-roberta-base-emotion`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``seyonec/PubChem10M_SMILES_BPE_396_250`` | English | Please refer to: | +| | | `seyonec/PubChem10M_SMILES_BPE_396_250`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``textattack/roberta-base-SST-2`` | English | Please refer to: | +| | | `textattack/roberta-base-SST-2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``sshleifer/tiny-distilroberta-base`` | English | Please refer to: | +| | | `sshleifer/tiny-distilroberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``thatdramebaazguy/roberta-base-squad`` | English | Please refer to: | +| | | `thatdramebaazguy/roberta-base-squad`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli`` | English | Please refer to: | +| | | `ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ufal/robeczech-base`` | Czech | Please refer to: | +| | | `ufal/robeczech-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``seyonec/PubChem10M_SMILES_BPE_450k`` | English | Please refer to: | +| | | `seyonec/PubChem10M_SMILES_BPE_450k`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cardiffnlp/twitter-roberta-base`` | English | Please refer to: | +| | | `cardiffnlp/twitter-roberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``seyonec/PubChem10M_SMILES_BPE_50k`` | English | Please refer to: | +| | | `seyonec/PubChem10M_SMILES_BPE_50k`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``microsoft/codebert-base-mlm`` | English | Please refer to: | +| | | `microsoft/codebert-base-mlm`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``textattack/roberta-base-MNLI`` | English | Please refer to: | +| | | `textattack/roberta-base-MNLI`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cardiffnlp/twitter-roberta-base-offensive`` | English | Please refer to: | +| | | `cardiffnlp/twitter-roberta-base-offensive`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cross-encoder/stsb-roberta-large`` | English | Please refer to: | +| | | `cross-encoder/stsb-roberta-large`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``seyonec/ChemBERTa_zinc250k_v2_40k`` | English | Please refer to: | +| | | `seyonec/ChemBERTa_zinc250k_v2_40k`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``uklfr/gottbert-base`` | German | Please refer to: | +| | | `uklfr/gottbert-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``seyonec/ChemBERTa-zinc-base-v1`` | English | Please refer to: | +| | | `seyonec/ChemBERTa-zinc-base-v1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``roberta-large-openai-detector`` | English | Please refer to: | +| | | `roberta-large-openai-detector`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cross-encoder/quora-roberta-base`` | English | Please refer to: | +| | | `cross-encoder/quora-roberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cross-encoder/stsb-roberta-base`` | English | Please refer to: | +| | | `cross-encoder/stsb-roberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``microsoft/graphcodebert-base`` | English | Please refer to: | +| | | `microsoft/graphcodebert-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cardiffnlp/twitter-roberta-base-hate`` | English | Please refer to: | +| | | `cardiffnlp/twitter-roberta-base-hate`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``chkla/roberta-argument`` | English | Please refer to: | +| | | `chkla/roberta-argument`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``Salesforce/grappa_large_jnt`` | English | Please refer to: | +| | | `Salesforce/grappa_large_jnt`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``vinai/bertweet-large`` | English | Please refer to: | +| | | `vinai/bertweet-large`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``allenai/biomed_roberta_base`` | English | Please refer to: | +| | | `allenai/biomed_roberta_base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``facebook/muppet-roberta-base`` | English | Please refer to: | +| | | `facebook/muppet-roberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``Rakib/roberta-base-on-cuad`` | English | Please refer to: | +| | | `Rakib/roberta-base-on-cuad`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cross-encoder/stsb-distilroberta-base`` | English | Please refer to: | +| | | `cross-encoder/stsb-distilroberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``nyu-mll/roberta-base-1B-1`` | English | Please refer to: | +| | | `nyu-mll/roberta-base-1B-1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``nyu-mll/roberta-med-small-1M-1`` | English | Please refer to: | +| | | `nyu-mll/roberta-med-small-1M-1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``SkolkovoInstitute/roberta_toxicity_classifier`` | English | Please refer to: | +| | | `SkolkovoInstitute/roberta_toxicity_classifier`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``facebook/muppet-roberta-large`` | English | Please refer to: | +| | | `facebook/muppet-roberta-large`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``lassl/roberta-ko-small`` | Korean | Please refer to: | +| | | `lassl/roberta-ko-small`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``huggingface/CodeBERTa-language-id`` | English | Please refer to: | +| | | `huggingface/CodeBERTa-language-id`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``textattack/roberta-base-imdb`` | English | Please refer to: | +| | | `textattack/roberta-base-imdb`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``macedonizer/mk-roberta-base`` | Macedonian | Please refer to: | +| | | `macedonizer/mk-roberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cross-encoder/nli-MiniLM2-L6-H768`` | English | Please refer to: | +| | | `cross-encoder/nli-MiniLM2-L6-H768`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``textattack/roberta-base-QNLI`` | English | Please refer to: | +| | | `textattack/roberta-base-QNLI`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``deepset/roberta-base-squad2-covid`` | English | Please refer to: | +| | | `deepset/roberta-base-squad2-covid`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``textattack/roberta-base-MRPC`` | English | Please refer to: | +| | | `textattack/roberta-base-MRPC`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``bhadresh-savani/roberta-base-emotion`` | English | Please refer to: | +| | | `bhadresh-savani/roberta-base-emotion`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``aychang/roberta-base-imdb`` | English | Please refer to: | +| | | `aychang/roberta-base-imdb`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cross-encoder/quora-distilroberta-base`` | English | Please refer to: | +| | | `cross-encoder/quora-distilroberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``csarron/roberta-base-squad-v1`` | English | Please refer to: | +| | | `csarron/roberta-base-squad-v1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``seyonec/ChemBERTA_PubChem1M_shard00_155k`` | English | Please refer to: | +| | | `seyonec/ChemBERTA_PubChem1M_shard00_155k`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``mental/mental-roberta-base`` | English | Please refer to: | +| | | `mental/mental-roberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``textattack/roberta-base-CoLA`` | English | Please refer to: | +| | | `textattack/roberta-base-CoLA`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``navteca/quora-roberta-base`` | English | Please refer to: | +| | | `navteca/quora-roberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cardiffnlp/twitter-roberta-base-emoji`` | English | Please refer to: | +| | | `cardiffnlp/twitter-roberta-base-emoji`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``benjamin/roberta-base-wechsel-german`` | Multilingual | Please refer to: | +| | | `benjamin/roberta-base-wechsel-german`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``textattack/roberta-base-ag-news`` | English | Please refer to: | +| | | `textattack/roberta-base-ag-news`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``johngiorgi/declutr-base`` | English | Please refer to: | +| | | `johngiorgi/declutr-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``salesken/query_wellformedness_score`` | English | Please refer to: | +| | | `salesken/query_wellformedness_score`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``blinoff/roberta-base-russian-v0`` | Russian | Please refer to: | +| | | `blinoff/roberta-base-russian-v0`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``allenai/reviews_roberta_base`` | English | Please refer to: | +| | | `allenai/reviews_roberta_base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ruiqi-zhong/roberta-base-meta-tuning-test`` | English | Please refer to: | +| | | `ruiqi-zhong/roberta-base-meta-tuning-test`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``mrm8488/distilroberta-finetuned-tweets-hate-speech`` | English | Please refer to: | +| | | `mrm8488/distilroberta-finetuned-tweets-hate-speech`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cointegrated/roberta-large-cola-krishna2020`` | English | Please refer to: | +| | | `cointegrated/roberta-large-cola-krishna2020`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``deepset/roberta-base-squad2-distilled`` | English | Please refer to: | +| | | `deepset/roberta-base-squad2-distilled`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``tli8hf/unqover-roberta-base-squad`` | English | Please refer to: | +| | | `tli8hf/unqover-roberta-base-squad`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cross-encoder/nli-roberta-base`` | English | Please refer to: | +| | | `cross-encoder/nli-roberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large`` | English | Please refer to: | +| | | `nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``seyonec/BPE_SELFIES_PubChem_shard00_160k`` | English | Please refer to: | +| | | `seyonec/BPE_SELFIES_PubChem_shard00_160k`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``CLTL/MedRoBERTa.nl`` | Dutch | Please refer to: | +| | | `CLTL/MedRoBERTa.nl`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``HooshvareLab/roberta-fa-zwnj-base`` | Persian | Please refer to: | +| | | `HooshvareLab/roberta-fa-zwnj-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``nyu-mll/roberta-base-100M-1`` | English | Please refer to: | +| | | `nyu-mll/roberta-base-100M-1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``deepset/tinyroberta-squad2`` | English | Please refer to: | +| | | `deepset/tinyroberta-squad2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``youscan/ukr-roberta-base`` | Ukrainian | Please refer to: | +| | | `youscan/ukr-roberta-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``navteca/roberta-base-squad2`` | English | Please refer to: | +| | | `navteca/roberta-base-squad2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``bertin-project/bertin-roberta-base-spanish`` | Spanish | Please refer to: | +| | | `bertin-project/bertin-roberta-base-spanish`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``shiyue/roberta-large-tac08`` | English | Please refer to: | +| | | `shiyue/roberta-large-tac08`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``softcatala/julibert`` | Catalan | Please refer to: | +| | | `softcatala/julibert`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``elozano/tweet_sentiment_eval`` | English | Please refer to: | +| | | `elozano/tweet_sentiment_eval`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``cahya/roberta-base-indonesian-1.5G`` | Indonesian | Please refer to: | +| | | `cahya/roberta-base-indonesian-1.5G`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``elozano/tweet_emotion_eval`` | English | Please refer to: | +| | | `elozano/tweet_emotion_eval`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``navteca/roberta-large-squad2`` | English | Please refer to: | +| | | `navteca/roberta-large-squad2`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``elozano/tweet_offensive_eval`` | English | Please refer to: | +| | | `elozano/tweet_offensive_eval`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ynie/roberta-large_conv_contradiction_detector_v0`` | English | Please refer to: | +| | | `ynie/roberta-large_conv_contradiction_detector_v0`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + +.. _deepset/roberta-base-squad2: https://huggingface.co/deepset/roberta-base-squad2 +.. _uer/roberta-base-chinese-extractive-qa: https://huggingface.co/uer/roberta-base-chinese-extractive-qa +.. _uer/roberta-base-finetuned-chinanews-chinese: https://huggingface.co/uer/roberta-base-finetuned-chinanews-chinese +.. _uer/roberta-base-finetuned-cluener2020-chinese: https://huggingface.co/uer/uer/roberta-base-finetuned-cluener2020-chinese +.. _roberta-base: https://huggingface.co/roberta-base +.. _cardiffnlp/twitter-roberta-base-sentiment: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment +.. _roberta-large: https://huggingface.co/roberta-large +.. _distilroberta-base: https://huggingface.co/distilroberta-base +.. _cross-encoder/nli-distilroberta-base: https://huggingface.co/cross-encoder/nli-distilroberta-base +.. _roberta-base-openai-detector: https://huggingface.co/roberta-base-openai-detector +.. _huggingface/CodeBERTa-small-v1: https://huggingface.co/huggingface/CodeBERTa-small-v1 +.. _mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis: https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis +.. _siebert/sentiment-roberta-large-english: https://huggingface.co/siebert/sentiment-roberta-large-english +.. _j-hartmann/emotion-english-distilroberta-base: https://huggingface.co/j-hartmann/emotion-english-distilroberta-base +.. _cardiffnlp/twitter-roberta-base-emotion: https://huggingface.co/cardiffnlp/twitter-roberta-base-emotion +.. _seyonec/PubChem10M_SMILES_BPE_396_250: https://huggingface.co/seyonec/PubChem10M_SMILES_BPE_396_250 +.. _textattack/roberta-base-SST-2: https://huggingface.co/textattack/roberta-base-SST-2 +.. _sshleifer/tiny-distilroberta-base: https://huggingface.co/sshleifer/tiny-distilroberta-base +.. _thatdramebaazguy/roberta-base-squad: https://huggingface.co/thatdramebaazguy/roberta-base-squad +.. _ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli: https://huggingface.co/ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli +.. _ufal/robeczech-base: https://huggingface.co/ufal/robeczech-base +.. _seyonec/PubChem10M_SMILES_BPE_450k: https://huggingface.co/seyonec/PubChem10M_SMILES_BPE_450k +.. _cardiffnlp/twitter-roberta-base: https://huggingface.co/cardiffnlp/twitter-roberta-base +.. _seyonec/PubChem10M_SMILES_BPE_50k: https://huggingface.co/seyonec/PubChem10M_SMILES_BPE_50k +.. _microsoft/codebert-base-mlm: https://huggingface.co/microsoft/codebert-base-mlm +.. _textattack/roberta-base-MNLI: https://huggingface.co/textattack/roberta-base-MNLI +.. _cardiffnlp/twitter-roberta-base-offensive: https://huggingface.co/cardiffnlp/twitter-roberta-base-offensive +.. _cross-encoder/stsb-roberta-large: https://huggingface.co/cross-encoder/stsb-roberta-large +.. _seyonec/ChemBERTa_zinc250k_v2_40k: https://huggingface.co/seyonec/ChemBERTa_zinc250k_v2_40k +.. _uklfr/gottbert-base: https://huggingface.co/uklfr/gottbert-base +.. _seyonec/ChemBERTa-zinc-base-v1: https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1 +.. _roberta-large-openai-detector: https://huggingface.co/roberta-large-openai-detector +.. _cross-encoder/quora-roberta-base: https://huggingface.co/cross-encoder/quora-roberta-base +.. _cross-encoder/stsb-roberta-base: https://huggingface.co/cross-encoder/stsb-roberta-base +.. _microsoft/graphcodebert-base: https://huggingface.co/microsoft/graphcodebert-base +.. _cardiffnlp/twitter-roberta-base-hate: https://huggingface.co/cardiffnlp/twitter-roberta-base-hate +.. _chkla/roberta-argument: https://huggingface.co/chkla/roberta-argument +.. _Salesforce/grappa_large_jnt: https://huggingface.co/Salesforce/grappa_large_jnt +.. _vinai/bertweet-large: https://huggingface.co/vinai/bertweet-large +.. _allenai/biomed_roberta_base: https://huggingface.co/allenai/biomed_roberta_base +.. _facebook/muppet-roberta-base: https://huggingface.co/facebook/muppet-roberta-base +.. _Rakib/roberta-base-on-cuad: https://huggingface.co/Rakib/roberta-base-on-cuad +.. _cross-encoder/stsb-distilroberta-base: https://huggingface.co/cross-encoder/stsb-distilroberta-base +.. _nyu-mll/roberta-base-1B-1: https://huggingface.co/nyu-mll/roberta-base-1B-1 +.. _nyu-mll/roberta-med-small-1M-1: https://huggingface.co/nyu-mll/roberta-med-small-1M-1 +.. _SkolkovoInstitute/roberta_toxicity_classifier: https://huggingface.co/SkolkovoInstitute/roberta_toxicity_classifier +.. _facebook/muppet-roberta-large: https://huggingface.co/facebook/muppet-roberta-large +.. _lassl/roberta-ko-small: https://huggingface.co/lassl/roberta-ko-small +.. _huggingface/CodeBERTa-language-id: https://huggingface.co/huggingface/CodeBERTa-language-id +.. _textattack/roberta-base-imdb: https://huggingface.co/textattack/roberta-base-imdb +.. _macedonizer/mk-roberta-base: https://huggingface.co/macedonizer/mk-roberta-base +.. _cross-encoder/nli-MiniLM2-L6-H768: https://huggingface.co/cross-encoder/nli-MiniLM2-L6-H768 +.. _textattack/roberta-base-QNLI: https://huggingface.co/textattack/roberta-base-QNLI +.. _deepset/roberta-base-squad2-covid: https://huggingface.co/deepset/roberta-base-squad2-covid +.. _textattack/roberta-base-MRPC: https://huggingface.co/textattack/roberta-base-MRPC +.. _bhadresh-savani/roberta-base-emotion: https://huggingface.co/bhadresh-savani/roberta-base-emotion +.. _aychang/roberta-base-imdb: https://huggingface.co/aychang/roberta-base-imdb +.. _cross-encoder/quora-distilroberta-base: https://huggingface.co/cross-encoder/quora-distilroberta-base +.. _csarron/roberta-base-squad-v1: https://huggingface.co/csarron/roberta-base-squad-v1 +.. _seyonec/ChemBERTA_PubChem1M_shard00_155k: https://huggingface.co/seyonec/ChemBERTA_PubChem1M_shard00_155k +.. _mental/mental-roberta-base: https://huggingface.co/mental/mental-roberta-base +.. _textattack/roberta-base-CoLA: https://huggingface.co/textattack/roberta-base-CoLA +.. _navteca/quora-roberta-base: https://huggingface.co/navteca/quora-roberta-base +.. _cardiffnlp/twitter-roberta-base-emoji: https://huggingface.co/cardiffnlp/twitter-roberta-base-emoji +.. _benjamin/roberta-base-wechsel-german: https://huggingface.co/benjamin/roberta-base-wechsel-german +.. _textattack/roberta-base-ag-news: https://huggingface.co/textattack/roberta-base-ag-news +.. _johngiorgi/declutr-base: https://huggingface.co/johngiorgi/declutr-base +.. _salesken/query_wellformedness_score: https://huggingface.co/salesken/query_wellformedness_score +.. _blinoff/roberta-base-russian-v0: https://huggingface.co/blinoff/roberta-base-russian-v0 +.. _allenai/reviews_roberta_base: https://huggingface.co/allenai/reviews_roberta_base +.. _ruiqi-zhong/roberta-base-meta-tuning-test: https://huggingface.co/ruiqi-zhong/roberta-base-meta-tuning-test +.. _mrm8488/distilroberta-finetuned-tweets-hate-speech: https://huggingface.co/mrm8488/distilroberta-finetuned-tweets-hate-speech +.. _cointegrated/roberta-large-cola-krishna2020: https://huggingface.co/cointegrated/roberta-large-cola-krishna2020 +.. _deepset/roberta-base-squad2-distilled: https://huggingface.co/deepset/roberta-base-squad2-distilled +.. _tli8hf/unqover-roberta-base-squad: https://huggingface.co/tli8hf/unqover-roberta-base-squad +.. _cross-encoder/nli-roberta-base: https://huggingface.co/cross-encoder/nli-roberta-base +.. _nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large: https://huggingface.co/nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large +.. _seyonec/BPE_SELFIES_PubChem_shard00_160k: https://huggingface.co/seyonec/BPE_SELFIES_PubChem_shard00_160k +.. _CLTL/MedRoBERTa.nl: https://huggingface.co/CLTL/MedRoBERTa.nl +.. _HooshvareLab/roberta-fa-zwnj-base: https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base +.. _nyu-mll/roberta-base-100M-1: https://huggingface.co/nyu-mll/roberta-base-100M-1 +.. _deepset/tinyroberta-squad2: https://huggingface.co/deepset/tinyroberta-squad2 +.. _youscan/ukr-roberta-base: https://huggingface.co/youscan/ukr-roberta-base +.. _navteca/roberta-base-squad2: https://huggingface.co/navteca/roberta-base-squad2 +.. _bertin-project/bertin-roberta-base-spanish: https://huggingface.co/bertin-project/bertin-roberta-base-spanish +.. _shiyue/roberta-large-tac08: https://huggingface.co/shiyue/roberta-large-tac08 +.. _softcatala/julibert: https://huggingface.co/softcatala/julibert +.. _elozano/tweet_sentiment_eval: https://huggingface.co/elozano/tweet_sentiment_eval +.. _cahya/roberta-base-indonesian-1.5G: https://huggingface.co/cahya/roberta-base-indonesian-1.5G +.. _elozano/tweet_emotion_eval: https://huggingface.co/elozano/tweet_emotion_eval +.. _navteca/roberta-large-squad2: https://huggingface.co/navteca/roberta-large-squad2 +.. _elozano/tweet_offensive_eval: https://huggingface.co/elozano/tweet_offensive_eval +.. _ynie/roberta-large_conv_contradiction_detector_v0: https://huggingface.co/ynie/roberta-large_conv_contradiction_detector_v0 diff --git a/docs/model_zoo/transformers/RoFormer/contents.rst b/docs/model_zoo/transformers/RoFormer/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..3b4db00dc70f9c93fc45a92ad3aee21cba3cfb6f --- /dev/null +++ b/docs/model_zoo/transformers/RoFormer/contents.rst @@ -0,0 +1,53 @@ + + +------------------------------------ +RoFormer模型汇总 +------------------------------------ + +下表汇总介绍了目前PaddleNLP支持的RoFormer模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``roformer-chinese-small`` | Chinese | 6-layer, 384-hidden, | +| | | 6-heads, 30M parameters. | +| | | Roformer Small Chinese model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``roformer-chinese-base`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 124M parameters. | +| | | Roformer Base Chinese model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``roformer-chinese-char-small`` | Chinese | 6-layer, 384-hidden, | +| | | 6-heads, 15M parameters. | +| | | Roformer Chinese Char Small model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``roformer-chinese-char-base`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 95M parameters. | +| | | Roformer Chinese Char Base model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``roformer-chinese-sim-char-ft-small`` | Chinese | 6-layer, 384-hidden, | +| | | 6-heads, 15M parameters. | +| | | Roformer Chinese Char Ft Small model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``roformer-chinese-sim-char-ft-base`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 95M parameters. | +| | | Roformer Chinese Char Ft Base model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``roformer-chinese-sim-char-small`` | Chinese | 6-layer, 384-hidden, | +| | | 6-heads, 15M parameters. | +| | | Roformer Chinese Sim Char Small model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``roformer-chinese-sim-char-base`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 95M parameters. | +| | | Roformer Chinese Sim Char Base model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``roformer-english-small-discriminator`` | English | 12-layer, 256-hidden, | +| | | 4-heads, 13M parameters. | +| | | Roformer English Small Discriminator. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``roformer-english-small-generator`` | English | 12-layer, 64-hidden, | +| | | 1-heads, 5M parameters. | +| | | Roformer English Small Generator. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + diff --git a/docs/model_zoo/transformers/SKEP/contents.rst b/docs/model_zoo/transformers/SKEP/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..f250842787d6ee5c235c750941837018c6306a9c --- /dev/null +++ b/docs/model_zoo/transformers/SKEP/contents.rst @@ -0,0 +1,29 @@ + + +------------------------------------ +SKEP模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的SKEP模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``skep_ernie_1.0_large_ch`` | Chinese | 24-layer, 1024-hidden, | +| | | 16-heads, 336M parameters. | +| | | Trained using the Erine model | +| | | ``ernie_1.0`` | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``skep_ernie_2.0_large_en`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, 336M parameters. | +| | | Trained using the Erine model | +| | | ``ernie_2.0_large_en`` | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``skep_roberta_large_en`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, 355M parameters. | +| | | Trained using the RoBERTa model | +| | | ``roberta_large_en`` | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/SqueezeBert/contents.rst b/docs/model_zoo/transformers/SqueezeBert/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..6f1b2afcc1efc4fb054353cb8b5b147fa9060a25 --- /dev/null +++ b/docs/model_zoo/transformers/SqueezeBert/contents.rst @@ -0,0 +1,26 @@ + + +------------------------------------ +SqueezeBert模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的SqueezeBert模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``squeezebert-uncased`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 51M parameters. | +| | | SqueezeBert Uncased model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``squeezebert-mnli`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 51M parameters. | +| | | SqueezeBert Mnli model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``squeezebert-mnli-headless`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 51M parameters. | +| | | SqueezeBert Mnli Headless model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/T5/contents.rst b/docs/model_zoo/transformers/T5/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..b5b0ea1f9ead660dd40206cc19af150407551831 --- /dev/null +++ b/docs/model_zoo/transformers/T5/contents.rst @@ -0,0 +1,203 @@ + + +------------------------------------ +T5模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的T5模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``t5-small`` | English | 6-layer, 512-hidden, | +| | | 8-heads, 93M parameters. | +| | | T5 small model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``t5-base`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 272M parameters. | +| | | T5 base model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``t5-large`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, 803M parameters. | +| | | T5 large model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``t5-v1_1-base`` | English | Please refer to: | +| | | t5-v1_1-base_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``t5-v1_1-large`` | English | Please refer to: | +| | | t5-v1_1-large_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``Langboat/mengzi-t5-base`` | Chinese | Please refer to: | +| | | `Langboat/mengzi-t5-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``Langboat/mengzi-t5-base-mt`` | Chinese | Please refer to: | +| | | `Langboat/mengzi-t5-base-mt`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``deep-learning-analytics/wikihow-t5-small`` | English | Please refer to: | +| | | `deep-learning-analytics/wikihow-t5-small`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``sberbank-ai/ruT5-base`` | English | Please refer to: | +| | | `sberbank-ai/ruT5-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``Michau/t5-base-en-generate-headline`` | English | Please refer to: | +| | | `Michau/t5-base-en-generate-headline`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``google/t5-v1_1-small`` | English | Please refer to: | +| | | `google/t5-v1_1-small`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``prithivida/parrot_paraphraser_on_T5`` | English | Please refer to: | +| | | `prithivida/parrot_paraphraser_on_T5`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``prithivida/grammar_error_correcter_v1`` | English | Please refer to: | +| | | `prithivida/grammar_error_correcter_v1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``valhalla/t5-small-qg-hl`` | English | Please refer to: | +| | | `valhalla/t5-small-qg-hl`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``valhalla/t5-small-qa-qg-hl`` | English | Please refer to: | +| | | `valhalla/t5-small-qa-qg-hl`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ramsrigouthamg/t5-large-paraphraser-diverse-high-quality`` | English | Please refer to: | +| | | `ramsrigouthamg/t5-large-paraphraser-diverse-high-quality`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``mrm8488/t5-base-finetuned-common_gen`` | English | Please refer to: | +| | | `mrm8488/t5-base-finetuned-common_gen`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``valhalla/t5-small-e2e-qg`` | English | Please refer to: | +| | | `valhalla/t5-small-e2e-qg`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``sonoisa/t5-base-japanese`` | japanese | Please refer to: | +| | | `sonoisa/t5-base-japanese`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``google/t5-base-lm-adapt`` | English | Please refer to: | +| | | `google/t5-base-lm-adapt`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``google/t5-small-lm-adapt`` | English | Please refer to: | +| | | `google/t5-small-lm-adapt`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``valhalla/t5-small-qg-prepend`` | English | Please refer to: | +| | | `valhalla/t5-small-qg-prepend`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``prithivida/informal_to_formal_styletransfer`` | English | Please refer to: | +| | | `prithivida/informal_to_formal_styletransfer`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``KETI-AIR/ke-t5-base`` | English | Please refer to: | +| | | `KETI-AIR/ke-t5-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``nielsr/nt5-small-rc1`` | English | Please refer to: | +| | | `nielsr/nt5-small-rc1`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``snrspeaks/t5-one-line-summary`` | English | Please refer to: | +| | | `snrspeaks/t5-one-line-summary`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``mrm8488/t5-small-finetuned-quora-for-paraphrasing`` | English | Please refer to: | +| | | `mrm8488/t5-small-finetuned-quora-for-paraphrasing`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``p-christ/12412fsasf`` | English | Please refer to: | +| | | `p-christ/12412fsasf`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``tscholak/3vnuv1vf`` | English | Please refer to: | +| | | `tscholak/3vnuv1vf`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``tennessejoyce/titlewave-t5-base`` | English | Please refer to: | +| | | `tennessejoyce/titlewave-t5-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``vennify/t5-base-grammar-correction`` | English | Please refer to: | +| | | `vennify/t5-base-grammar-correction`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``megagonlabs/t5-base-japanese-web`` | Japanese | Please refer to: | +| | | `megagonlabs/t5-base-japanese-web`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``sberbank-ai/ruT5-large`` | English | Please refer to: | +| | | `sberbank-ai/ruT5-large`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``tscholak/t5.1.1.lm100k.base`` | English | Please refer to: | +| | | `tscholak/t5.1.1.lm100k.base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``deep-learning-analytics/GrammarCorrector`` | English | Please refer to: | +| | | `deep-learning-analytics/GrammarCorrector`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ThomasNLG/t5-qa_squad2neg-en`` | English | Please refer to: | +| | | `ThomasNLG/t5-qa_squad2neg-en`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``flexudy/t5-small-wav2vec2-grammar-fixer`` | English | Please refer to: | +| | | `flexudy/t5-small-wav2vec2-grammar-fixer`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``KETI-AIR/ke-t5-small`` | English | Please refer to: | +| | | `KETI-AIR/ke-t5-small`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``razent/SciFive-large-Pubmed_PMC`` | English | Please refer to: | +| | | `razent/SciFive-large-Pubmed_PMC`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``google/t5-large-ssm-nq`` | English | Please refer to: | +| | | `google/t5-large-ssm-nq`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``ozcangundes/T5-base-for-BioQA`` | English | Please refer to: | +| | | `ozcangundes/T5-base-for-BioQA`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``Rostlab/prot_t5_base_mt_uniref50`` | English | Please refer to: | +| | | `Rostlab/prot_t5_base_mt_uniref50`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``sonoisa/t5-base-japanese-question-generation`` | Japanese | Please refer to: | +| | | `sonoisa/t5-base-japanese-question-generation`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``Wikidepia/IndoT5-base`` | English | Please refer to: | +| | | `Wikidepia/IndoT5-base`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``razent/SciFive-base-Pubmed_PMC`` | English | Please refer to: | +| | | `razent/SciFive-base-Pubmed_PMC`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``google/t5-small-ssm-nq`` | English | Please refer to: | +| | | `google/t5-small-ssm-nq`_ | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + + + +.. _t5-v1_1-base: https://huggingface.co/google/t5-v1_1-base +.. _t5-v1_1-large: https://huggingface.co/google/t5-v1_1-large +.. _Langboat/mengzi-t5-base: https://huggingface.co/Langboat/mengzi-t5-base +.. _Langboat/mengzi-t5-base-mt: https://huggingface.co/Langboat/mengzi-t5-base-mt +.. _deep-learning-analytics/wikihow-t5-small: https://huggingface.co/deep-learning-analytics/wikihow-t5-small +.. _sberbank-ai/ruT5-base: https://huggingface.co/sberbank-ai/ruT5-base +.. _Michau/t5-base-en-generate-headline: https://huggingface.co/Michau/t5-base-en-generate-headline +.. _google/t5-v1_1-small: https://huggingface.co/google/t5-v1_1-small +.. _prithivida/parrot_paraphraser_on_T5: https://huggingface.co/prithivida/parrot_paraphraser_on_T5 +.. _prithivida/grammar_error_correcter_v1: https://huggingface.co/prithivida/grammar_error_correcter_v1 +.. _valhalla/t5-small-qg-hl: https://huggingface.co/valhalla/t5-small-qg-hl +.. _valhalla/t5-small-qa-qg-hl: https://huggingface.co/valhalla/t5-small-qa-qg-hl +.. _ramsrigouthamg/t5-large-paraphraser-diverse-high-quality: https://huggingface.co/ramsrigouthamg/t5-large-paraphraser-diverse-high-quality +.. _mrm8488/t5-base-finetuned-common_gen: https://huggingface.co/mrm8488/t5-base-finetuned-common_gen +.. _valhalla/t5-small-e2e-qg: https://huggingface.co/valhalla/t5-small-e2e-qg +.. _sonoisa/t5-base-japanese: https://huggingface.co/sonoisa/t5-base-japanese +.. _google/t5-base-lm-adapt: https://huggingface.co/google/t5-base-lm-adapt +.. _google/t5-small-lm-adapt: https://huggingface.co/google/t5-small-lm-adapt +.. _valhalla/t5-small-qg-prepend: https://huggingface.co/valhalla/t5-small-qg-prepend +.. _prithivida/informal_to_formal_styletransfer: https://huggingface.co/prithivida/informal_to_formal_styletransfer +.. _KETI-AIR/ke-t5-base: https://huggingface.co/KETI-AIR/ke-t5-base +.. _nielsr/nt5-small-rc1: https://huggingface.co/nielsr/nt5-small-rc1 +.. _snrspeaks/t5-one-line-summary: https://huggingface.co/snrspeaks/t5-one-line-summary +.. _mrm8488/t5-small-finetuned-quora-for-paraphrasing: https://huggingface.co/mrm8488/t5-small-finetuned-quora-for-paraphrasing +.. _p-christ/12412fsasf: https://huggingface.co/p-christ/12412fsasf +.. _tscholak/3vnuv1vf: https://huggingface.co/tscholak/3vnuv1vf +.. _tennessejoyce/titlewave-t5-base: https://huggingface.co/tennessejoyce/titlewave-t5-base +.. _vennify/t5-base-grammar-correction: https://huggingface.co/vennify/t5-base-grammar-correction +.. _megagonlabs/t5-base-japanese-web: https://huggingface.co/megagonlabs/t5-base-japanese-web +.. _sberbank-ai/ruT5-large: https://huggingface.co/sberbank-ai/ruT5-large +.. _tscholak/t5.1.1.lm100k.base: https://huggingface.co/tscholak/t5.1.1.lm100k.base +.. _deep-learning-analytics/GrammarCorrector: https://huggingface.co/deep-learning-analytics/GrammarCorrector +.. _ThomasNLG/t5-qa_squad2neg-en: https://huggingface.co/ThomasNLG/t5-qa_squad2neg-en +.. _t5-small-wav2vec2-grammar-fixer: https://huggingface.co/t5-small-wav2vec2-grammar-fixer +.. _KETI-AIR/ke-t5-small: https://huggingface.co/KETI-AIR/ke-t5-small +.. _razent/SciFive-large-Pubmed_PMC: https://huggingface.co/razent/SciFive-large-Pubmed_PMC +.. _google/t5-large-ssm-nq: https://huggingface.co/google/t5-large-ssm-nq +.. _ozcangundes/T5-base-for-BioQA: https://huggingface.co/ozcangundes/T5-base-for-BioQA +.. _Rostlab/prot_t5_base_mt_uniref50: https://huggingface.co/Rostlab/prot_t5_base_mt_uniref50 +.. _sonoisa/t5-base-japanese-question-generation: https://huggingface.co/sonoisa/t5-base-japanese-question-generation +.. _Wikidepia/IndoT5-base: https://huggingface.co/Wikidepia/IndoT5-base +.. _razent/SciFive-base-Pubmed_PMC: https://huggingface.co/razent/SciFive-base-Pubmed_PMC +.. _google/t5-small-ssm-nq: https://huggingface.co/google/t5-small-ssm-nq +.. _flexudy/t5-small-wav2vec2-grammar-fixer: https://huggingface.co/flexudy/t5-small-wav2vec2-grammar-fixer + diff --git a/docs/model_zoo/transformers/TinyBert/contents.rst b/docs/model_zoo/transformers/TinyBert/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..92bf2879c39d9ad2f546c67007737ccd080d8f1a --- /dev/null +++ b/docs/model_zoo/transformers/TinyBert/contents.rst @@ -0,0 +1,44 @@ + + +------------------------------------ +TinyBert模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的TinyBert模型以及对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``tinybert-4l-312d`` | English | 4-layer, 312-hidden, | +| | | 12-heads, 14.5M parameters. | +| | | The TinyBert model distilled from | +| | | the BERT model ``bert-base-uncased`` | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``tinybert-6l-768d`` | English | 6-layer, 768-hidden, | +| | | 12-heads, 67M parameters. | +| | | The TinyBert model distilled from | +| | | the BERT model ``bert-base-uncased`` | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``tinybert-4l-312d-v2`` | English | 4-layer, 312-hidden, | +| | | 12-heads, 14.5M parameters. | +| | | The TinyBert model distilled from | +| | | the BERT model ``bert-base-uncased`` | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``tinybert-6l-768d-v2`` | English | 6-layer, 768-hidden, | +| | | 12-heads, 67M parameters. | +| | | The TinyBert model distilled from | +| | | the BERT model ``bert-base-uncased`` | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``tinybert-4l-312d-zh`` | Chinese | 4-layer, 312-hidden, | +| | | 12-heads, 14.5M parameters. | +| | | The TinyBert model distilled from | +| | | the BERT model ``bert-base-uncased`` | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``tinybert-6l-768d-zh`` | Chinese | 6-layer, 768-hidden, | +| | | 12-heads, 67M parameters. | +| | | The TinyBert model distilled from | +| | | the BERT model ``bert-base-uncased`` | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/UNIMO/contents.rst b/docs/model_zoo/transformers/UNIMO/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..85e3bbbdde9c16080e87e5dd46e01ed6a99be274 --- /dev/null +++ b/docs/model_zoo/transformers/UNIMO/contents.rst @@ -0,0 +1,26 @@ + + +------------------------------------ +UNIMO模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的UNIMO模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``unimo-text-1.0`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 99M parameters. | +| | | UNIMO-text-1.0 model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``unimo-text-1.0-lcsts-new`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 99M parameters. | +| | | Finetuned on lcsts_new dataset. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``unimo-text-1.0-large`` | Chinese | 24-layer, 768-hidden, | +| | | 16-heads, 316M parameters. | +| | | UNIMO-text-1.0 large model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/model_zoo/transformers/UnifiedTransformer/contents.rst b/docs/model_zoo/transformers/UnifiedTransformer/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..16c62d9c3e7fad6e808276f09ef7ba4983e51939 --- /dev/null +++ b/docs/model_zoo/transformers/UnifiedTransformer/contents.rst @@ -0,0 +1,32 @@ + + +------------------------------------ +UnifiedTransformer模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的UnifiedTransformer模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``unified_transformer-12L-cn`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 108M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``unified_transformer-12L-cn-luge`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 108M parameters. | +| | | Trained on Chinese text (LUGE.ai). | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``plato-mini`` | Chinese | 6-layer, 768-hidden, | +| | | 12-heads, 66M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``plato-xl`` | Chinese | 72-layer, 3072-hidden, | +| | | 32-heads, ?M parameters. | +| | | Trained on Chinese text. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ + + diff --git a/docs/model_zoo/transformers/XLNet/contents.rst b/docs/model_zoo/transformers/XLNet/contents.rst new file mode 100644 index 0000000000000000000000000000000000000000..5856dd99bb3c7b058d05725de000f48c653175b9 --- /dev/null +++ b/docs/model_zoo/transformers/XLNet/contents.rst @@ -0,0 +1,34 @@ + + +------------------------------------ +XLNet模型汇总 +------------------------------------ + + + +下表汇总介绍了目前PaddleNLP支持的XLNet模型对应预训练权重。 +关于模型的具体细节可以参考对应链接。 + ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +| Pretrained Weight | Language | Details of the model | ++==================================================================================+==============+==================================================================================+ +|``xlnet-base-cased`` | English | 12-layer, 768-hidden, | +| | | 12-heads, 110M parameters. | +| | | XLNet English model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``xlnet-large-cased`` | English | 24-layer, 1024-hidden, | +| | | 16-heads, 340M parameters. | +| | | XLNet Large English model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``chinese-xlnet-base`` | Chinese | 12-layer, 768-hidden, | +| | | 12-heads, 117M parameters. | +| | | XLNet Chinese model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``chinese-xlnet-mid`` | Chinese | 24-layer, 768-hidden, | +| | | 12-heads, 209M parameters. | +| | | XLNet Medium Chinese model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ +|``chinese-xlnet-large`` | Chinese | 24-layer, 1024-hidden, | +| | | 16-heads, _M parameters. | +| | | XLNet Large Chinese model. | ++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+ diff --git a/docs/paddle.png b/docs/paddle.png new file mode 100644 index 0000000000000000000000000000000000000000..bc1135abfab7aa48f29392da4bca614f688314af Binary files /dev/null and b/docs/paddle.png differ diff --git a/docs/peft.md b/docs/peft.md new file mode 100644 index 0000000000000000000000000000000000000000..64e0a87038146e8ad43196dde69eb604e324a027 --- /dev/null +++ b/docs/peft.md @@ -0,0 +1,277 @@ +# PaddleNLP PEFT API + +PaddleNLP PEFT API提供单卡/分布式LoRA和Prefix-Tuning,用户定义好模型,数据集, 以及相应的配置,就可以快速使用PEFT适配模型进行低参数模型微调。 + +# 预备知识 +## LoRA +
+ +
+ +大模型网络中有很多的线性层,里面需要进行密集的矩阵乘法计算,而这些通常具有全秩(full rank),较难优化计算。LoRA论文的研究中表明, 将输入表达随机投影到较小的子空间不仅任然可以有效地学习还可以节约大量的计算显存需求。具体做法:对于预训练的权重矩阵, 通过引入两个低 rank 矩阵 $AB$(图中橙色的两个矩阵) 来近似权重的更新过程 $W_0+\Delta W=W_0+B A$ , 其中 $B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k}$, $r$ 远小于原权重矩阵的 rank 。训练期间, $W_0$ 参数冻结, 只对 $\mathrm{A}$ 和 $\mathrm{B}$ 两个矩阵进行梯度更新,前向传播公式如下: + +$$ +h=W_{0}x+BAx +$$ + +由于训练参数的减少,训练过程会减少很多中间变量的存储,由此节约大量的训练显存消耗。 +更多算法细节参考LoRA[论文](https://arxiv.org/abs/2106.09685) + +## Prefix-tuning + +
+ +
+ + +Prefix-tuning是一个针对NLG类型下游任务的轻量级微调方案,受提示学习(Prompt learning)的影响,加入的一部分 prefix embedding 作为连续型提示进行训练。prefix embedding是由专门的 prefix encoder 网络生成的数个张量,会以 past_key_value的方式被插入到语言模型每一层的 hidden_state之前。和 LoRA 类似,它也会冻结整个预训练模型的所有参数权重,只对prefix embedding进行梯度更新,因此训练参数量只有常规 SFT 的 0.1%。Prefix-tuning可以在全样本下获得与 SFT 比肩的训练效果,在小样本环境下甚至可以超越 SFT。更多算法细节参考 +Prefix-tuning[论文](https://arxiv.org/abs/2101.00190) + +# 快速开始 +## LoRA + +1. 要对 model 进行 LoRA 微调,首先需要定义LoRAConfig, 再通过 LoRAConfig 对 LoRAModel 进行构建,再通过 mark_only_lora_as_trainable函数冻结主干参数: +```python + from paddlenlp.peft import LoRAConfig, LoRAModel + from paddlenlp.transformers import AutoModelForCausalLM + + model = AutoModelForCausalLM.from_pretrained('facebook/llama-7b') + target_modules = [".*q_proj.*", ".*v_proj.*", ".*k_proj.*"] + lora_rank = 8 + lora_config = LoRAConfig( + target_modules=target_modules, + r=lora_rank, + lora_alpha=2 * lora_rank, + merge_weights=True + ) + model = LoRAModel(model, lora_config) + model.mark_only_lora_as_trainable() + model.print_trainable_parameters() +``` + +2. 模型的保存和载入 + +LoRAModel的保存和载入和普通的 model 没有太大区别,都是通过 save_pretrained/from_pretrained调用 +```python + # 保存 + model.save_pretrained('lora_path') +``` +Paddle会将 LoRAModel 的矩阵 AB 权重保存为lora_mode_state.pdparams文件,LoRAConfig 配置保存为 lora_config.json 文件在 lora_path 目录下。 +之后当需要载入模型权重进行推理时,则直接进行 from_pretrained即可。 +```python + from paddlenlp.transformers import AutoModelForCausalLM + + from paddlenlp.peft import LoRAModel, LoRAConfig + + # 载入 + + config = LoRAConfig.from_pretrained('lora_path') + model = AutoModelForCausalLM.from_pretrained('facebook/llama-7b') + + model = LoRAModel.from_pretrained(model, 'lora_path') + model.eval() +``` +## class LoRAConfig +```python +Parameters: + + --r + 默认为 8,LoRA A/B 矩阵秩。 + + --target_modules + 指定哪些 module 需要适配 LoRA 算法,格式为module 的名字 + 或正则表达式的 List,比如, ['q', 'v'] 或者 '.*decoder.*(SelfAttention|EncDecAttention).*(q|v)$' + + --trainable_modules + 指定除LoRA参数外的需要进行梯度更新参数的 modules,格式为module 的名字 + 或正则表达式的 List,比如, ['q', 'v'] 或者 '.*decoder.*(SelfAttention|EncDecAttention).*(q|v)$' + + --lora_alpha + 默认为 8,LoRA算法的 alpha 值,int 类型 + + --lora_dropout + 默认为 0.0,dropout的比例设置,float 类型 + + --merge_weights + 默认为 False,模型推理时,是否进行base model 权重和 LoRA 权重的合参操作,bool 类型 + + --trainable_bias + 指定可训练的 bias, 可选项 ['lora', 'all'] + + --enable_lora_list + 指定是否需要使用`MergedLoRALinear`,如果不指定则默认使用 + `LoRALinear` + + --tensor_parallel_degree + 默认为-1,多 GPU 并行的控制参数,传入tensor_parallel_degree 必须与 base model保持一致 + + --dtype + LoRA矩阵参数类型设置 + + --head_dim + 多头注意力的头数,只有`LoRAMergedLinear`和 + `ColumnParallelLoRAMergedLinear`使用 +``` +## class LoRAModel +```python +Parameters: + + --model + 指定 base model,必须是 nn.Layer 类型的对象 + + --lora_config + 指定 LoRAConfig 用于配置 LoRAModel + +key function: + + -mark_only_lora_as_trainable() + + 其作用是将模型中与LoRA相关的的一些层标记为可训练,而其他层则标记为不可训练。 + + + -save_pretrained(save_directory, merge_tensor_parallel) + --save_directory + 保存目录的路径 + --merge_tensor_parallel + 是否合并多卡参数,默认为True + + 如果merge_tensor_parallel为真且模型的配置中的张量并行度大于1,则获取可训练的state_dict,并使用_merge_trainable_tensor_parallel方法合并张量并行训练的state_dict。如果merge_tensor_parallel为真且模型的张量并行度大于1,只有主进程会进行保存操作。 + + + -from_pretrained(model, lora_path) + --model + 要加载LORA权重参数的model对象 + --lora_path + 保存LORA权重参数和 config 的路径 + + 该函数用于从预先训练的模型中加载LORA权重参数,并将其设置到给定的模型中,以便在后续的任务中使用该模型进行预测或训练。 + + + -print_trainable_parameters() + + 该函数会遍历整个权重参数列表,对于每个权重参数weight,统计所有进行梯度更新的参数,最后将信息打印出来。 +``` + + +## Prefix-tuning +1. 设置Prefix-tuning参数 +```python + from paddlenlp.transformers import AutoModelForCausalLM + + model = AutoModelForCausalLM.from_pretrained('facebook/llama-7b') + + prefix_config = PrefixConfig( + num_prefix_tokens=64, + num_attention_heads=model.config.n_head, + num_hidden_layers=model.config.n_layer, + hidden_size=model.config.hidden_size, + prefix_projection=False, + prefix_projection_hidden_size=model.config.hidden_size + ) + model = PrefixModelForCausalLM(model=model, prefix_config=prefix_config) + model.mark_only_prefix_as_trainable() + model.print_trainable_parameters() +``` + +2. 模型的保存和载入 + +和 LoRAModel 一致,通过 save_pretrained/from_pretrained调用 +```python + # 保存 + model.save_pretrained('prefix_path') +``` +Paddle会将 PrefixModel 中用到的 prefix_encoder(里面包含 Embedding layer 和 Linear layers)网络模型权重,PrefixConfig 配置保存为 prefix_config.json 文件在 prefix_path 路径下。 +之后当需要载入模型权重进行推理时,则直接进行 from_pretrained即可。 +```python + from paddlenlp.transformers import AutoModelForCausalLM + + from paddlenlp.peft import PrefixModel, PrefixConfig + + # 载入 + + config = PrefixConfig.from_pretrained('prefix_path') + model = AutoModelForCausalLM.from_pretrained('facebook/llama-7b') + + model = PrefixModelForCausalLM.from_pretrained(model, 'prefix_path') + model.eval() +``` + +## class PrefixConfig +```python +Parameters: + + --prefix_dropout + 默认为 0.0,prefix projection dropout比例设置,float 类型 + + --num_prefix_tokens + prefix tokens个数设定,int 类型 + + --num_attention_heads + 注意力头数设置,int 类型 + + --multi_query_group_num + multi query group 个数设置,int 类型 + + --num_hidden_layers + base model 的 layer层数设置,int 类型 + + --hidden_size + base model 的 hidden size 设置,int 类型 + + --prefix_projection + 默认为 False,是否对 prefix tokens 进行 projection 操作,bool 类型 + + --prefix_projection_hidden_size + 如果 prefix_projection 设置为 True,则在这里设置 + projection 操作的 hidden size,int 类型 + + --tensor_parallel_degree + 默认为-1,多 GPU 并行的控制参数 + + --dtype + prefix embeddings 参数类型设置 + +``` + +## class PrefixModelForCausalLM +```python +Parameters: + + --model + 指定 base model,必须是 nn.Layer 类型的对象 + + --prefix_config + 指定 PrefixConfig 用于配置 PrefixModelForCausalLM + + --postprocess_past_key_value + 指定对 past_key_value 进行后处理的函数 + + --pad_attention_mask + 指定处理新增的 prefix embedding 的 pad_attention_mask函数 + +key function + + -mark_only_prefix_as_trainable() + + 其作用是只把模型中的 Prefix embedding 和 Prefix projection 层标记为可训练,而其他层参数冻结。 + + -save_pretrained(save_directory, merge_tensor_parallel) + --save_directory + 保存目录的路径 + --merge_tensor_parallel + 是否合并多卡参数,默认为True + + 如果merge_tensor_parallel为真且模型的配置中的张量并行度大于1,则获取可训练的state_dict,并使用_merge_trainable_tensor_parallel方法合并张量并行训练的state_dict。如果merge_tensor_parallel为真且模型的张量并行度大于1,只有主进程会进行保存操作。 + + -from_pretrained(model, prefix_path, postprocess_past_key_value, pad_attention_mask) + --model + 要加载Prefix权重参数的model对象 + --prefix_path + 保存Prefix权重参数和 config 文件的路径 + --postprocess_past_key_value + 功能同 PrefixModelForCausalLM 构造参数 + --pad_attention_mask + 功能同 PrefixModelForCausalLM 构造参数 + + 该函数用于从预先训练的模型中加载Prefix权重参数,并将其设置到给定的模型中,以便在后续的任务中使用该模型进行预测或训练。 + + -print_trainable_parameters() + + 该函数会遍历整个权重参数列表,对于每个权重参数weight,统计所有进行梯度更新的参数,最后将信息打印出来。 +``` + +更详细的使用可以参考[finetuning 脚本](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/causallm/finetune_generation.py)版本, 以及对应的启动脚本编写方式(写在 [README.md](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/causallm/README.md)文件中)。 diff --git a/docs/requirements.txt b/docs/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..129805924a933b43302ead432016fcb1167776fb --- /dev/null +++ b/docs/requirements.txt @@ -0,0 +1,18 @@ +# Defining the exact version will make sure things don't break + +urllib3==1.26.2 # fix urllib3 version dependency: https://github.com/psf/requests/issues/6432#issuecomment-1537221990 +scipy==1.9.1 +aiohttp==3.8.4 +numpy<1.27.0,>=1.19.5 +h11<0.13,>=0.11 +jinja2 +sphinx +sphinx_book_theme +readthedocs-sphinx-search + +Markdown +sphinx-copybutton +sphinx-markdown-tables + +# use paddlepaddle == 2.3.* according to: https://github.com/PaddlePaddle/Paddle/issues/48243 +paddlepaddle>=2.4.2 diff --git a/docs/server.md b/docs/server.md new file mode 100644 index 0000000000000000000000000000000000000000..1505790b65c18df8f5785b2df83aa32d728ceeac --- /dev/null +++ b/docs/server.md @@ -0,0 +1,207 @@ +# PaddleNLP SimpleSevring + +PaddleNLP SimpleServing 是基于 unicorn 封装的模型部署服务化工具,该服务化工具具备灵活、易用的特性,可以简易部署预训练模型和预训练模型工具Taskflow,PaddleNLP SimpleServing 具备以下两个特性: + - 易用:一行代码即可部署预训练模型和预训练工具Taskflow + - 灵活:Handler机制可以快速定制化服务化部署方式 + + +## Tasflow部署 + +Taskflow 是 PaddleNLP 预训练模型工具,具备开箱即用的特性,同时 Taskflow 可以支持加载微调后的模型,基于 Taskflow 的服务化方式可以进一步降低使用者的部署难度。PaddleNLP SimpleServing 基于这样的设计需求,设计了一套基于Taskflow的快速部署方式。下面从 server 搭建,client 发送请求来详细介绍使用方式。 + +### server 搭建 + +下面是 Taskflow 搭建服务的简易代码 + +```python +schema = ['出发地', '目的地', '费用', '时间'] +uie = Taskflow("information_extraction", schema=schema) +app = SimpleServer() +app.register_taskflow('taskflow/uie', uie) +``` +这里主要是使用 `SimpleServer` 服务类来注册 Taskflow Server,下面我们具体介绍一下 `register_taskflow` 相关参数 + +```text +def register_taskflow( + task_name, + task, + taskflow_handler=None) + +task_name(str): + 服务化的名称,最终的服务化的URL: https://host:port/{task_name} +task(paddlenlp.Taskflow or list(paddlenlp.Taskflow)): + Taskflow的实例对象,将想要注册的Taskflow任务注册进去,可以是多个Taskflow实例来支持多卡服务化 +taskflow_handler(paddlenlp.server.BaseTaskflowHandler, 可选): + Taskflow句柄处理类,可以自定义处理类来定制化Taskflow服务,默认为None,是默认的TaskflowHandler +``` +### 多卡服务化(可选) +在机器环境里面如果有多卡,那就可以register taskflow服务化时,可以注册多个Taskflow实例,在服务化处理请求的过程中做了负载均衡,保证机器设备利用率充分利用,下面是具体的使用例子 +```python +schema = ['出发地', '目的地', '费用', '时间'] +uie1 = Taskflow("information_extraction", schema=schema, device_id=0) +uie2 = Taskflow("information_extraction", schema=schema, device_id=1) +app = SimpleServer() +app.register_taskflow('taskflow/uie', [uie1, uie2]) +``` +### 启动服务化 +执行代码的即可启动服务 +``` +paddlenlp server server:app --host 0.0.0.0 --port 8189 --workers 1 +``` +服务化整体参数配置如下: +```text +--host: 启动服务化的IP地址,通常可以设置成 0.0.0.0 +--port:启动服务化的网络端口 +--workers: 接收服务化的进程数,默认为1 +--log_level:服务化输出日志的级别,默认为 info 级别 +--limit_concurrency:服务化能接受的并发数目,默认为None, 没有限制 +--timeout_keep_alive:保持服务化连接的时间,默认为15s +--app_dir:服务化本地的路径,默认为服务化启动的位置 +--reload: 当 app_dir的服务化相关配置和代码发生变化时,是否重启server,默认为False +``` + +### client 发送 +```python +import requests +import json + +url = "http://0.0.0.0:8189/taskflow/uie" +headers = {"Content-Type": "application/json"} +texts = ["城市内交通费7月5日金额114广州至佛山", "5月9日交通费29元从北苑到望京搜后"] +data = { + "data": { + "text": texts, + } +} +r = requests.post(url=url, headers=headers, data=json.dumps(data)) +datas = json.loads(r.text) +print(datas) +``` +通过上述代码配置即可发送POST请求,同时注意在`data`这个key填入相关请求 + +同时可以支持定义 `schema` 传入到client请求中,可以快速切换 `schema` + +```python +import requests +import json + +url = "http://0.0.0.0:8189/taskflow/uie" +headers = {"Content-Type": "application/json"} +texts = ["城市内交通费7月5日金额114广州至佛山", "5月9日交通费29元从北苑到望京搜后"] +data = { + "data": { + "text": texts, + }, + "parameters": { + "schema": [] # 自定义schema + } +} +r = requests.post(url=url, headers=headers, data=json.dumps(data)) +datas = json.loads(r.text) +print(datas) +``` + +## 预训练模型部署 +PaddleNLP SimpleServing 除了能支持Taskflow的服务化部署,也能支持预训练模型的部署,通过简单的配置即可加载预训练模型来进行服务化,同时在接口层面也能支持服务化的扩展,支持模型前后处理的定制化需求。 + +## server 搭建 + +下面是预训练模型的搭建的简易代码 +```python +from paddlenlp import SimpleServer +from paddlenlp.server import CustomModelHandler, MultiClassificationPostHandler + +app = SimpleServer() +app.register('cls_multi_class', + model_path="./export", + tokenizer_name='ernie-3.0-medium-zh', + model_handler=CustomModelHandler, + post_handler=MultiClassificationPostHandler) +``` + +这里主要是使用 `SimpleServer` 服务类来注册 Transformers Server,下面我们具体介绍一下 `register` 相关参数 + +```text +def register(task_name, + model_path, + tokenizer_name, + model_handler, + post_handler, + precision='fp32', + device_id=0) +task_name(str): + 服务化的名称,最终的服务化的URL: https://host:port/{task_name} +model_path(str): + 需要部署的模型路径,这里的路径必须是动转静后的模型路径 +model_handler(paddlenlp.server.BaseModelHandler): + 模型前置处理以及模型预测的Handler类别名字,这里可以继承 BaseModelHandler 自定义处理逻辑 + post_handler(paddlenlp.server.BasePostHandler): + 模型后置处理的Handler类别名字,这里可以继承 BasePostHandler 自定义处理逻辑 +precision(str): + 模型的预测精度,默认为fp32;可选fp16,fp16的支持需要以下条件 1) **硬件**: V100、T4、A10、A100/GA100、Jetson AGX Xavier 、3080、3080、2080、2090 等显卡 2)**CUDA环境**:确保 CUDA >= 11.2,cuDNN >= 8.1.1 3) **安装依赖**:安装 onnx、 onnxruntime-gpu +device_id(int, list(int)): + GPU设备,device_id默认为0,同时如果有多张显卡,可以设置成list,例如[0, 1]就可以支持多卡服务化;CPU设备,不用设置。 +``` +- BaseModelHandler继承类:主要是 `CustomModelHandler`,该类的实现可以参考[链接](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/server/handlers/custom_model_handler.py), 绝大多数语义理解模型均可使用该继承类 +- BasePostHandler继承类:主要是文本分类 `MultiClassificationPostHandler`、`MultiLabelClassificationPostHandler` 来支持多分类、多标签分类,实现代码部分可以参考[链接](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/server/handlers/cls_post_handler.py);`TokenClsModelHandler` 支持 序列标注任务,实现代码部分可以参考[链接](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/server/handlers/token_model_handler.py) + +### 启动服务化 +执行代码的即可启动服务 +``` +paddlenlp server server:app --host 0.0.0.0 --port 8189 --workers 1 +``` +服务化整体参数配置如下: +```text +--host: 启动服务化的IP地址,通常可以设置成 0.0.0.0 +--port:启动服务化的网络端口,注意和已有网络端口冲突 +--workers: 接收服务化的进程数,默认为1 +--log_level:服务化输出日志的级别,默认为 info 级别 +--limit_concurrency:服务化能接受的并发数目,默认为None, 没有限制 +--timeout_keep_alive:保持服务化连接的时间,默认为15s +--app_dir:服务化本地的路径,默认为服务化启动的位置 +--reload: 当 app_dir的服务化相关配置和代码发生变化时,是否重启server,默认为False +``` + +### 多卡服务化(可选) +在机器环境里面如果有多卡,通过简单设置 device_id 即可实现多卡服务化,保证机器设备利用率充分利用,下面是具体的使用例子 +```python +from paddlenlp import SimpleServer +from paddlenlp.server import CustomModelHandler, MultiClassificationPostHandler + +app = SimpleServer() +app.register('models/cls_multi_class', + model_path="../../export", + tokenizer_name='ernie-3.0-medium-zh', + model_handler=CustomModelHandler, + post_handler=MultiClassificationPostHandler, + device_id=[0,1]) # device_id是0,1 两张卡 +``` +### client 发送 +```python + +import requests +import json + +texts = [ + '黑苦荞茶的功效与作用及食用方法', '交界痣会凸起吗', '检查是否能怀孕挂什么科', '鱼油怎么吃咬破吃还是直接咽下去', + '幼儿挑食的生理原因是' + ] + data = { + 'data': { + 'text': texts, + }, + 'parameters': { + 'max_seq_len': 128, + 'batch_size': 2 + } + } + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + result_json = json.loads(r.text) + print(result_json) +``` +在Client发送请求的过程中可以一些参数来控制服务化处理逻辑,例如上面的 `max_seq_len`和 `batch_size` 均可以控制服务化处理时的序列长度和处理batch_size 。 + +## 参考示例 +- [UIE 服务化部署](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie/deploy/serving/simple_serving) +- [文本分类服务化部署](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class/deploy/simple_serving) +- [预训练模型定制化post_handler](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-health/cblue/deploy/serving/simple_serving) diff --git a/docs/source/modules.rst b/docs/source/modules.rst new file mode 100644 index 0000000000000000000000000000000000000000..e470b1c22bcc177a73949cdf6fe379992aea679b --- /dev/null +++ b/docs/source/modules.rst @@ -0,0 +1,7 @@ +paddlenlp +========= + +.. toctree:: + :maxdepth: 4 + + paddlenlp diff --git a/docs/source/paddlenlp.data.collate.rst b/docs/source/paddlenlp.data.collate.rst new file mode 100644 index 0000000000000000000000000000000000000000..a0b8e9058a98198c0d069fbae7990a09294f201c --- /dev/null +++ b/docs/source/paddlenlp.data.collate.rst @@ -0,0 +1,8 @@ +collate +============================= + +.. automodule:: paddlenlp.data.collate + :members: + :no-undoc-members: + :show-inheritance: + :special-members: __call__ diff --git a/docs/source/paddlenlp.data.data_collator.rst b/docs/source/paddlenlp.data.data_collator.rst new file mode 100644 index 0000000000000000000000000000000000000000..1543cdd6eb86c88e170ff3644a1359719bb4468a --- /dev/null +++ b/docs/source/paddlenlp.data.data_collator.rst @@ -0,0 +1,7 @@ +data\_collator +==================================== + +.. automodule:: paddlenlp.data.data_collator + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.data.rst b/docs/source/paddlenlp.data.rst new file mode 100644 index 0000000000000000000000000000000000000000..608f8eff45258c49fa16fba168f401278bc97190 --- /dev/null +++ b/docs/source/paddlenlp.data.rst @@ -0,0 +1,18 @@ +paddlenlp.data +====================== + +.. automodule:: paddlenlp.data + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.data.collate + paddlenlp.data.data_collator + paddlenlp.data.iterator + paddlenlp.data.sampler + paddlenlp.data.tokenizer + paddlenlp.data.vocab diff --git a/docs/source/paddlenlp.data.sampler.rst b/docs/source/paddlenlp.data.sampler.rst new file mode 100644 index 0000000000000000000000000000000000000000..97ef992ab92ec5d705a1c2681e6f2115328fe645 --- /dev/null +++ b/docs/source/paddlenlp.data.sampler.rst @@ -0,0 +1,8 @@ +sampler +============================= + +.. automodule:: paddlenlp.data.sampler + :members: + :no-undoc-members: + :show-inheritance: + :special-members: __call__ diff --git a/docs/source/paddlenlp.data.tokenizer.rst b/docs/source/paddlenlp.data.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..80848dcd8372590944401b1909fc648237209bae --- /dev/null +++ b/docs/source/paddlenlp.data.tokenizer.rst @@ -0,0 +1,8 @@ +tokenizer +=============================== + +.. automodule:: paddlenlp.data.tokenizer + :members: + :no-undoc-members: + :show-inheritance: + :special-members: __call__ diff --git a/docs/source/paddlenlp.data.vocab.rst b/docs/source/paddlenlp.data.vocab.rst new file mode 100644 index 0000000000000000000000000000000000000000..15c436f6f914246270fc63f38c69ed91a71e5832 --- /dev/null +++ b/docs/source/paddlenlp.data.vocab.rst @@ -0,0 +1,8 @@ +vocab +=========================== + +.. automodule:: paddlenlp.data.vocab + :members: + :no-undoc-members: + :show-inheritance: + :special-members: __call__ diff --git a/docs/source/paddlenlp.dataaug.base_augment.rst b/docs/source/paddlenlp.dataaug.base_augment.rst new file mode 100644 index 0000000000000000000000000000000000000000..be43595b87cbd4114f147788dd13fd5b919fed93 --- /dev/null +++ b/docs/source/paddlenlp.dataaug.base_augment.rst @@ -0,0 +1,7 @@ +base\_augment +====================================== + +.. automodule:: paddlenlp.dataaug.base_augment + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.dataaug.rst b/docs/source/paddlenlp.dataaug.rst new file mode 100644 index 0000000000000000000000000000000000000000..3df52b5ab89c7f328cddcc605a62e4ed02e9024b --- /dev/null +++ b/docs/source/paddlenlp.dataaug.rst @@ -0,0 +1,17 @@ +paddlenlp.dataaug +========================= + +.. automodule:: paddlenlp.dataaug + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.dataaug.base_augment + paddlenlp.dataaug.word_delete + paddlenlp.dataaug.word_insert + paddlenlp.dataaug.word_substitute + paddlenlp.dataaug.word_swap diff --git a/docs/source/paddlenlp.dataaug.word_delete.rst b/docs/source/paddlenlp.dataaug.word_delete.rst new file mode 100644 index 0000000000000000000000000000000000000000..36f4fd46bb830426437a397242a3297fba04e50c --- /dev/null +++ b/docs/source/paddlenlp.dataaug.word_delete.rst @@ -0,0 +1,7 @@ +word\_delete +===================================== + +.. automodule:: paddlenlp.dataaug.word_delete + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.dataaug.word_insert.rst b/docs/source/paddlenlp.dataaug.word_insert.rst new file mode 100644 index 0000000000000000000000000000000000000000..90aea8210552ca06b00b720e2fd8386019e2a5bf --- /dev/null +++ b/docs/source/paddlenlp.dataaug.word_insert.rst @@ -0,0 +1,7 @@ +word\_insert +===================================== + +.. automodule:: paddlenlp.dataaug.word_insert + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.dataaug.word_substitute.rst b/docs/source/paddlenlp.dataaug.word_substitute.rst new file mode 100644 index 0000000000000000000000000000000000000000..dd343af4cc477a2bfe9c2f77862dbc02f479b49e --- /dev/null +++ b/docs/source/paddlenlp.dataaug.word_substitute.rst @@ -0,0 +1,7 @@ +word\_substitute +========================================= + +.. automodule:: paddlenlp.dataaug.word_substitute + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.dataaug.word_swap.rst b/docs/source/paddlenlp.dataaug.word_swap.rst new file mode 100644 index 0000000000000000000000000000000000000000..bedc8d99621e19e5d10cb4d266773d2948b5eed0 --- /dev/null +++ b/docs/source/paddlenlp.dataaug.word_swap.rst @@ -0,0 +1,7 @@ +word\_swap +=================================== + +.. automodule:: paddlenlp.dataaug.word_swap + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.datasets.dataset.rst b/docs/source/paddlenlp.datasets.dataset.rst new file mode 100644 index 0000000000000000000000000000000000000000..5faa5b46234761bcca1f2a350dbf1764c39836d5 --- /dev/null +++ b/docs/source/paddlenlp.datasets.dataset.rst @@ -0,0 +1,6 @@ +dataset +================================= + +.. automodule:: paddlenlp.datasets.dataset + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.datasets.rst b/docs/source/paddlenlp.datasets.rst new file mode 100644 index 0000000000000000000000000000000000000000..6681f0fb142ff7a3579f66e1e5793318fada5851 --- /dev/null +++ b/docs/source/paddlenlp.datasets.rst @@ -0,0 +1,17 @@ +paddlenlp.datasets +========================== + +.. automodule:: paddlenlp.datasets + :members: + :no-undoc-members: + + +.. toctree:: + :maxdepth: 4 + + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.datasets.dataset diff --git a/docs/source/paddlenlp.embeddings.rst b/docs/source/paddlenlp.embeddings.rst new file mode 100644 index 0000000000000000000000000000000000000000..1c85548c46d4429e514154983334fb11683dd41e --- /dev/null +++ b/docs/source/paddlenlp.embeddings.rst @@ -0,0 +1,14 @@ +paddlenlp.embeddings +============================ + +.. automodule:: paddlenlp.embeddings + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.embeddings.constant + paddlenlp.embeddings.token_embedding diff --git a/docs/source/paddlenlp.embeddings.token_embedding.rst b/docs/source/paddlenlp.embeddings.token_embedding.rst new file mode 100644 index 0000000000000000000000000000000000000000..eabb899600e01d9014b3b69dc65d18d95081c83a --- /dev/null +++ b/docs/source/paddlenlp.embeddings.token_embedding.rst @@ -0,0 +1,7 @@ +token\_embedding +============================================ + +.. automodule:: paddlenlp.embeddings.token_embedding + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.experimental.ernie_model.rst b/docs/source/paddlenlp.experimental.ernie_model.rst new file mode 100644 index 0000000000000000000000000000000000000000..607fd5740a1854cc768e748aff6fe52f32d96de3 --- /dev/null +++ b/docs/source/paddlenlp.experimental.ernie_model.rst @@ -0,0 +1,6 @@ +ernie\_model +========================================== + +.. automodule:: paddlenlp.experimental.ernie_model + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.experimental.faster_tokenizer.rst b/docs/source/paddlenlp.experimental.faster_tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..a53da32782808505dea088fa738acebce85550fb --- /dev/null +++ b/docs/source/paddlenlp.experimental.faster_tokenizer.rst @@ -0,0 +1,7 @@ +faster\_tokenizer +=============================================== + +.. automodule:: paddlenlp.experimental.faster_tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.experimental.model_utils.rst b/docs/source/paddlenlp.experimental.model_utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..07715f9552b4a627978089f61e0311e08d3747b7 --- /dev/null +++ b/docs/source/paddlenlp.experimental.model_utils.rst @@ -0,0 +1,6 @@ +model\_utils +========================================== + +.. automodule:: paddlenlp.experimental.model_utils + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.experimental.rst b/docs/source/paddlenlp.experimental.rst new file mode 100644 index 0000000000000000000000000000000000000000..7018ced0af7bae9ef9ed22fa8fd769ad5c9e5895 --- /dev/null +++ b/docs/source/paddlenlp.experimental.rst @@ -0,0 +1,15 @@ +paddlenlp.experimental +============================== + +.. automodule:: paddlenlp.experimental + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.experimental.ernie_model + paddlenlp.experimental.faster_tokenizer + paddlenlp.experimental.model_utils diff --git a/docs/source/paddlenlp.layers.crf.rst b/docs/source/paddlenlp.layers.crf.rst new file mode 100644 index 0000000000000000000000000000000000000000..ed340a8fea0c51a0b389038d5bd7da882b5ba727 --- /dev/null +++ b/docs/source/paddlenlp.layers.crf.rst @@ -0,0 +1,6 @@ +crf +=========================== + +.. automodule:: paddlenlp.layers.crf + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.layers.rst b/docs/source/paddlenlp.layers.rst new file mode 100644 index 0000000000000000000000000000000000000000..c41c7e41962eb8b5eef8fe6b15028b052d08c74e --- /dev/null +++ b/docs/source/paddlenlp.layers.rst @@ -0,0 +1,15 @@ +paddlenlp.layers +======================== + +.. automodule:: paddlenlp.layers + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.layers.crf + paddlenlp.layers.sequence + paddlenlp.layers.tcn diff --git a/docs/source/paddlenlp.layers.sequence.rst b/docs/source/paddlenlp.layers.sequence.rst new file mode 100644 index 0000000000000000000000000000000000000000..945540340c9cce8c94ea04891c3a2091511011f0 --- /dev/null +++ b/docs/source/paddlenlp.layers.sequence.rst @@ -0,0 +1,7 @@ +sequence +================================ + +.. automodule:: paddlenlp.layers.sequence + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.layers.tcn.rst b/docs/source/paddlenlp.layers.tcn.rst new file mode 100644 index 0000000000000000000000000000000000000000..c7a6784837d1eb7511ddec480aab38b04f8a11be --- /dev/null +++ b/docs/source/paddlenlp.layers.tcn.rst @@ -0,0 +1,6 @@ +tcn +=========================== + +.. automodule:: paddlenlp.layers.tcn + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.losses.rdrop.rst b/docs/source/paddlenlp.losses.rdrop.rst new file mode 100644 index 0000000000000000000000000000000000000000..39ccfd28fb728facb93249991451817134b5c192 --- /dev/null +++ b/docs/source/paddlenlp.losses.rdrop.rst @@ -0,0 +1,6 @@ +rdrop +============================= + +.. automodule:: paddlenlp.losses.rdrop + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.losses.rst b/docs/source/paddlenlp.losses.rst new file mode 100644 index 0000000000000000000000000000000000000000..9c7196336aee4f317dfa931f793b9ceb21fe675a --- /dev/null +++ b/docs/source/paddlenlp.losses.rst @@ -0,0 +1,13 @@ +paddlenlp.losses +======================== + +.. automodule:: paddlenlp.losses + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.losses.rdrop diff --git a/docs/source/paddlenlp.metrics.bleu.rst b/docs/source/paddlenlp.metrics.bleu.rst new file mode 100644 index 0000000000000000000000000000000000000000..610ddf2ccd792425116e4d0db4104e991fee46b3 --- /dev/null +++ b/docs/source/paddlenlp.metrics.bleu.rst @@ -0,0 +1,7 @@ +bleu +============================= + +.. automodule:: paddlenlp.metrics.bleu + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.metrics.chunk.rst b/docs/source/paddlenlp.metrics.chunk.rst new file mode 100644 index 0000000000000000000000000000000000000000..b496e40bfced04a440c0fe6c5c91a1f850953380 --- /dev/null +++ b/docs/source/paddlenlp.metrics.chunk.rst @@ -0,0 +1,7 @@ +chunk +============================== + +.. automodule:: paddlenlp.metrics.chunk + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.metrics.distinct.rst b/docs/source/paddlenlp.metrics.distinct.rst new file mode 100644 index 0000000000000000000000000000000000000000..135b5b0cbdf5db495dd7475e35d9d136fd4fe852 --- /dev/null +++ b/docs/source/paddlenlp.metrics.distinct.rst @@ -0,0 +1,7 @@ +distinct +================================= + +.. automodule:: paddlenlp.metrics.distinct + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.metrics.dureader.rst b/docs/source/paddlenlp.metrics.dureader.rst new file mode 100644 index 0000000000000000000000000000000000000000..deeb9372db3f434f4d6a71e86b08c8b6980e9993 --- /dev/null +++ b/docs/source/paddlenlp.metrics.dureader.rst @@ -0,0 +1,7 @@ +dureader +================================= + +.. automodule:: paddlenlp.metrics.dureader + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.metrics.glue.rst b/docs/source/paddlenlp.metrics.glue.rst new file mode 100644 index 0000000000000000000000000000000000000000..55c3a972f7a85de74d9981aa031b615af0938954 --- /dev/null +++ b/docs/source/paddlenlp.metrics.glue.rst @@ -0,0 +1,7 @@ +glue +============================= + +.. automodule:: paddlenlp.metrics.glue + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.metrics.perplexity.rst b/docs/source/paddlenlp.metrics.perplexity.rst new file mode 100644 index 0000000000000000000000000000000000000000..5a0ebed6b3348fe1db7fedfc2d8c356f8e722514 --- /dev/null +++ b/docs/source/paddlenlp.metrics.perplexity.rst @@ -0,0 +1,7 @@ +perplexity +=================================== + +.. automodule:: paddlenlp.metrics.perplexity + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.metrics.rouge.rst b/docs/source/paddlenlp.metrics.rouge.rst new file mode 100644 index 0000000000000000000000000000000000000000..5844354ca18f1b3d63ade339835d40be473e3627 --- /dev/null +++ b/docs/source/paddlenlp.metrics.rouge.rst @@ -0,0 +1,7 @@ +rouge +============================== + +.. automodule:: paddlenlp.metrics.rouge + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.metrics.rst b/docs/source/paddlenlp.metrics.rst new file mode 100644 index 0000000000000000000000000000000000000000..995355b941a021163c8e70f7c27fba90a32257d3 --- /dev/null +++ b/docs/source/paddlenlp.metrics.rst @@ -0,0 +1,23 @@ +paddlenlp.metrics +========================= + +.. automodule:: paddlenlp.metrics + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.metrics.bleu + paddlenlp.metrics.chunk + paddlenlp.metrics.distinct + paddlenlp.metrics.dureader + paddlenlp.metrics.glue + paddlenlp.metrics.perplexity + paddlenlp.metrics.rouge + paddlenlp.metrics.sighan + paddlenlp.metrics.span + paddlenlp.metrics.squad + paddlenlp.metrics.utils diff --git a/docs/source/paddlenlp.metrics.sighan.rst b/docs/source/paddlenlp.metrics.sighan.rst new file mode 100644 index 0000000000000000000000000000000000000000..058c86387d138d9499b4861d90f4e276fa2599f0 --- /dev/null +++ b/docs/source/paddlenlp.metrics.sighan.rst @@ -0,0 +1,7 @@ +sighan +=============================== + +.. automodule:: paddlenlp.metrics.sighan + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.metrics.span.rst b/docs/source/paddlenlp.metrics.span.rst new file mode 100644 index 0000000000000000000000000000000000000000..5476262a60fd98ed8cd558c07d15b99ad7a87713 --- /dev/null +++ b/docs/source/paddlenlp.metrics.span.rst @@ -0,0 +1,7 @@ +span +============================= + +.. automodule:: paddlenlp.metrics.span + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.metrics.squad.rst b/docs/source/paddlenlp.metrics.squad.rst new file mode 100644 index 0000000000000000000000000000000000000000..ef2ffabf35c46339537793fd3e6861b4578e51bc --- /dev/null +++ b/docs/source/paddlenlp.metrics.squad.rst @@ -0,0 +1,7 @@ +squad +============================== + +.. automodule:: paddlenlp.metrics.squad + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.metrics.utils.rst b/docs/source/paddlenlp.metrics.utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..ff1640ba062c358067661302ac5154b7fe190534 --- /dev/null +++ b/docs/source/paddlenlp.metrics.utils.rst @@ -0,0 +1,7 @@ +utils +============================== + +.. automodule:: paddlenlp.metrics.utils + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.ops.distributed.parallel.rst b/docs/source/paddlenlp.ops.distributed.parallel.rst new file mode 100644 index 0000000000000000000000000000000000000000..83c36944bdefc2347e751cc748299e3ebffbf33b --- /dev/null +++ b/docs/source/paddlenlp.ops.distributed.parallel.rst @@ -0,0 +1,6 @@ +parallel +========================================= + +.. automodule:: paddlenlp.ops.distributed.parallel + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.ops.distributed.rst b/docs/source/paddlenlp.ops.distributed.rst new file mode 100644 index 0000000000000000000000000000000000000000..733f114bf4ea0350efad768102bec87071dad985 --- /dev/null +++ b/docs/source/paddlenlp.ops.distributed.rst @@ -0,0 +1,18 @@ +distributed +================================= + +.. automodule:: paddlenlp.ops.distributed + :members: + :no-undoc-members: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.ops.distributed.utils + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.ops.distributed.parallel diff --git a/docs/source/paddlenlp.ops.distributed.utils.random.rst b/docs/source/paddlenlp.ops.distributed.utils.random.rst new file mode 100644 index 0000000000000000000000000000000000000000..27370ba9f718241cbadc822ea854253c90d2b38f --- /dev/null +++ b/docs/source/paddlenlp.ops.distributed.utils.random.rst @@ -0,0 +1,6 @@ +random +============================================= + +.. automodule:: paddlenlp.ops.distributed.utils.random + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.ops.distributed.utils.rst b/docs/source/paddlenlp.ops.distributed.utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..c7e1c223995d18b4374b7276644f2d6a1b1fa629 --- /dev/null +++ b/docs/source/paddlenlp.ops.distributed.utils.rst @@ -0,0 +1,13 @@ +utils +======================================= + +.. automodule:: paddlenlp.ops.distributed.utils + :members: + :no-undoc-members: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.ops.distributed.utils.random + paddlenlp.ops.distributed.utils.topo diff --git a/docs/source/paddlenlp.ops.distributed.utils.topo.rst b/docs/source/paddlenlp.ops.distributed.utils.topo.rst new file mode 100644 index 0000000000000000000000000000000000000000..6cd8a7aa2ae43ce0f0a2a6ec948980e6b49825cb --- /dev/null +++ b/docs/source/paddlenlp.ops.distributed.utils.topo.rst @@ -0,0 +1,6 @@ +topo +=========================================== + +.. automodule:: paddlenlp.ops.distributed.utils.topo + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.ops.einsum.rst b/docs/source/paddlenlp.ops.einsum.rst new file mode 100644 index 0000000000000000000000000000000000000000..c5907b8e135f1fe3baa68a442f41cc6288a0177a --- /dev/null +++ b/docs/source/paddlenlp.ops.einsum.rst @@ -0,0 +1,7 @@ +einsum +=========================== + +.. automodule:: paddlenlp.ops.einsum + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.ops.ext_utils.rst b/docs/source/paddlenlp.ops.ext_utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..a3750293b1bb6463fa38a8226dbe7dd9628942ac --- /dev/null +++ b/docs/source/paddlenlp.ops.ext_utils.rst @@ -0,0 +1,7 @@ +ext\_utils +=============================== + +.. automodule:: paddlenlp.ops.ext_utils + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.ops.fast_transformer.rst b/docs/source/paddlenlp.ops.fast_transformer.rst new file mode 100644 index 0000000000000000000000000000000000000000..02a51ecdcb01a776d80a2345cabaa1ee5f451e8d --- /dev/null +++ b/docs/source/paddlenlp.ops.fast_transformer.rst @@ -0,0 +1,13 @@ +fast\_transformer +========================================= + +.. automodule:: paddlenlp.ops.fast_transformer + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.ops.fast_transformer.transformer diff --git a/docs/source/paddlenlp.ops.fast_transformer.transformer.decoder.rst b/docs/source/paddlenlp.ops.fast_transformer.transformer.decoder.rst new file mode 100644 index 0000000000000000000000000000000000000000..50a3a23fa2386aa90c74b34bf84cc39951310cfc --- /dev/null +++ b/docs/source/paddlenlp.ops.fast_transformer.transformer.decoder.rst @@ -0,0 +1,6 @@ +decoder +============================================================ + +.. automodule:: paddlenlp.ops.fast_transformer.transformer.decoder + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.ops.fast_transformer.transformer.decoding.rst b/docs/source/paddlenlp.ops.fast_transformer.transformer.decoding.rst new file mode 100644 index 0000000000000000000000000000000000000000..e8ed3550c06f9062025b39f0d4f469c0c77b423f --- /dev/null +++ b/docs/source/paddlenlp.ops.fast_transformer.transformer.decoding.rst @@ -0,0 +1,6 @@ +decoding +============================================================= + +.. automodule:: paddlenlp.ops.fast_transformer.transformer.decoding + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.ops.fast_transformer.transformer.encoder.rst b/docs/source/paddlenlp.ops.fast_transformer.transformer.encoder.rst new file mode 100644 index 0000000000000000000000000000000000000000..5ca4d481f64b083f3797048beee72e3003d46435 --- /dev/null +++ b/docs/source/paddlenlp.ops.fast_transformer.transformer.encoder.rst @@ -0,0 +1,7 @@ +encoder +============================================================ + +.. automodule:: paddlenlp.ops.fast_transformer.transformer.encoder + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.rst b/docs/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.rst new file mode 100644 index 0000000000000000000000000000000000000000..d57dd44eb85b96479d03a02153b6823227e21bf6 --- /dev/null +++ b/docs/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.rst @@ -0,0 +1,7 @@ +fast\_transformer +======================================================================== + +.. automodule:: paddlenlp.ops.fast_transformer.transformer.fast_transformer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.ops.fast_transformer.transformer.rst b/docs/source/paddlenlp.ops.fast_transformer.transformer.rst new file mode 100644 index 0000000000000000000000000000000000000000..4fade55a679ee2c72a71430e377f4e91102a6a00 --- /dev/null +++ b/docs/source/paddlenlp.ops.fast_transformer.transformer.rst @@ -0,0 +1,16 @@ +transformer +===================================================== + +.. automodule:: paddlenlp.ops.fast_transformer.transformer + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.ops.fast_transformer.transformer.decoder + paddlenlp.ops.fast_transformer.transformer.decoding + paddlenlp.ops.fast_transformer.transformer.encoder + paddlenlp.ops.fast_transformer.transformer.fast_transformer diff --git a/docs/source/paddlenlp.ops.optimizer.AdamwOptimizer.rst b/docs/source/paddlenlp.ops.optimizer.AdamwOptimizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..88be3e6ddc10b42c96f532bc2e8c553670f33a1c --- /dev/null +++ b/docs/source/paddlenlp.ops.optimizer.AdamwOptimizer.rst @@ -0,0 +1,6 @@ +AdamwOptimizer +============================================= + +.. automodule:: paddlenlp.ops.optimizer.AdamwOptimizer + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.ops.optimizer.adamw.rst b/docs/source/paddlenlp.ops.optimizer.adamw.rst new file mode 100644 index 0000000000000000000000000000000000000000..5aab38563dcd740b4d4540694b35083dfc94b8fa --- /dev/null +++ b/docs/source/paddlenlp.ops.optimizer.adamw.rst @@ -0,0 +1,7 @@ +adamw +==================================== + +.. automodule:: paddlenlp.ops.optimizer.adamw + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.ops.optimizer.adamwdl.rst b/docs/source/paddlenlp.ops.optimizer.adamwdl.rst new file mode 100644 index 0000000000000000000000000000000000000000..78be7f88818fbfccb0db82d413b219b774a26df5 --- /dev/null +++ b/docs/source/paddlenlp.ops.optimizer.adamwdl.rst @@ -0,0 +1,7 @@ +adamwdl +====================================== + +.. automodule:: paddlenlp.ops.optimizer.adamwdl + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.ops.optimizer.ema.rst b/docs/source/paddlenlp.ops.optimizer.ema.rst new file mode 100644 index 0000000000000000000000000000000000000000..b02eb7eb6b073133a6550282406e98709e7098d2 --- /dev/null +++ b/docs/source/paddlenlp.ops.optimizer.ema.rst @@ -0,0 +1,7 @@ +ema +================================== + +.. automodule:: paddlenlp.ops.optimizer.ema + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.ops.optimizer.rst b/docs/source/paddlenlp.ops.optimizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..df220f9fd7f89552e8d6f277a01a60cb1ffc9544 --- /dev/null +++ b/docs/source/paddlenlp.ops.optimizer.rst @@ -0,0 +1,14 @@ +optimizer +=============================== + +.. automodule:: paddlenlp.ops.optimizer + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.ops.optimizer.adamwdl + paddlenlp.ops.optimizer.ema diff --git a/docs/source/paddlenlp.ops.rst b/docs/source/paddlenlp.ops.rst new file mode 100644 index 0000000000000000000000000000000000000000..831aeb95411e6cd689611cea4b190f233613f335 --- /dev/null +++ b/docs/source/paddlenlp.ops.rst @@ -0,0 +1,22 @@ +paddlenlp.ops +===================== + +.. automodule:: paddlenlp.ops + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.ops.distributed + paddlenlp.ops.fast_transformer + paddlenlp.ops.optimizer + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.ops.einsum + paddlenlp.ops.ext_utils diff --git a/docs/source/paddlenlp.ops.strings.rst b/docs/source/paddlenlp.ops.strings.rst new file mode 100644 index 0000000000000000000000000000000000000000..8ccb7d3a66b3f9e5a1dec024391618ae1c8a4b0c --- /dev/null +++ b/docs/source/paddlenlp.ops.strings.rst @@ -0,0 +1,7 @@ +strings +============================ + +.. automodule:: paddlenlp.ops.strings + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.ops.transformer.decoding.rst b/docs/source/paddlenlp.ops.transformer.decoding.rst new file mode 100644 index 0000000000000000000000000000000000000000..e3966647cb3400017ef55c0863d5878db6eeae08 --- /dev/null +++ b/docs/source/paddlenlp.ops.transformer.decoding.rst @@ -0,0 +1,6 @@ +decoding +========================================= + +.. automodule:: paddlenlp.ops.transformer.decoding + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.ops.transformer.fast_transformer.rst b/docs/source/paddlenlp.ops.transformer.fast_transformer.rst new file mode 100644 index 0000000000000000000000000000000000000000..2718eb8f6fc9f4bd1d4af9b182aac42c565d8cb0 --- /dev/null +++ b/docs/source/paddlenlp.ops.transformer.fast_transformer.rst @@ -0,0 +1,7 @@ +fast\_transformer +==================================================== + +.. automodule:: paddlenlp.ops.transformer.fast_transformer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.ops.transformer.rst b/docs/source/paddlenlp.ops.transformer.rst new file mode 100644 index 0000000000000000000000000000000000000000..ecab713ecfa8b9622c07eb9b4bd88125e7c8b1ee --- /dev/null +++ b/docs/source/paddlenlp.ops.transformer.rst @@ -0,0 +1,14 @@ +transformer +================================= + +.. automodule:: paddlenlp.ops.transformer + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.ops.transformer.decoding + paddlenlp.ops.transformer.fast_transformer diff --git a/docs/source/paddlenlp.rst b/docs/source/paddlenlp.rst new file mode 100644 index 0000000000000000000000000000000000000000..fbc7672055fad96c4a03f9b736938eb0a7f0d8f9 --- /dev/null +++ b/docs/source/paddlenlp.rst @@ -0,0 +1,26 @@ +paddlenlp +================= + +.. automodule:: paddlenlp + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.data + paddlenlp.dataaug + paddlenlp.datasets + paddlenlp.embeddings + paddlenlp.experimental + paddlenlp.layers + paddlenlp.losses + paddlenlp.metrics + paddlenlp.ops + paddlenlp.seq2vec + paddlenlp.taskflow + paddlenlp.trainer + paddlenlp.transformers + paddlenlp.utils diff --git a/docs/source/paddlenlp.seq2vec.encoder.rst b/docs/source/paddlenlp.seq2vec.encoder.rst new file mode 100644 index 0000000000000000000000000000000000000000..e2a42957efa67d603f7451766bfe9feb4ce23b25 --- /dev/null +++ b/docs/source/paddlenlp.seq2vec.encoder.rst @@ -0,0 +1,7 @@ +encoder +================================ + +.. automodule:: paddlenlp.seq2vec.encoder + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.seq2vec.rst b/docs/source/paddlenlp.seq2vec.rst new file mode 100644 index 0000000000000000000000000000000000000000..48497e5874f9b7417d9f0d97906d4a0bdd51bf1f --- /dev/null +++ b/docs/source/paddlenlp.seq2vec.rst @@ -0,0 +1,13 @@ +paddlenlp.seq2vec +========================= + +.. automodule:: paddlenlp.seq2vec + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.seq2vec.encoder diff --git a/docs/source/paddlenlp.taskflow.code_generation.rst b/docs/source/paddlenlp.taskflow.code_generation.rst new file mode 100644 index 0000000000000000000000000000000000000000..7fbd7ab20bb097e8270aeda1e309c4c8389c0354 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.code_generation.rst @@ -0,0 +1,7 @@ +code\_generation +========================================== + +.. automodule:: paddlenlp.taskflow.code_generation + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.dependency_parsing.rst b/docs/source/paddlenlp.taskflow.dependency_parsing.rst new file mode 100644 index 0000000000000000000000000000000000000000..a2c7ab93f4dace23c64f997bdad5c90a77cba4e2 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.dependency_parsing.rst @@ -0,0 +1,7 @@ +dependency\_parsing +============================================= + +.. automodule:: paddlenlp.taskflow.dependency_parsing + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.dialogue.rst b/docs/source/paddlenlp.taskflow.dialogue.rst new file mode 100644 index 0000000000000000000000000000000000000000..dbba5586323d9b3d7381ccfa69ab4d1d4468f907 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.dialogue.rst @@ -0,0 +1,7 @@ +dialogue +================================== + +.. automodule:: paddlenlp.taskflow.dialogue + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.information_extraction.rst b/docs/source/paddlenlp.taskflow.information_extraction.rst new file mode 100644 index 0000000000000000000000000000000000000000..bc5cf227ea099b8119a46c35942e7a32220a7afc --- /dev/null +++ b/docs/source/paddlenlp.taskflow.information_extraction.rst @@ -0,0 +1,7 @@ +information\_extraction +================================================= + +.. automodule:: paddlenlp.taskflow.information_extraction + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.knowledge_mining.rst b/docs/source/paddlenlp.taskflow.knowledge_mining.rst new file mode 100644 index 0000000000000000000000000000000000000000..2bc62bf7b6ac1ed60293844f594641010d3da3bb --- /dev/null +++ b/docs/source/paddlenlp.taskflow.knowledge_mining.rst @@ -0,0 +1,7 @@ +knowledge\_mining +=========================================== + +.. automodule:: paddlenlp.taskflow.knowledge_mining + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.lexical_analysis.rst b/docs/source/paddlenlp.taskflow.lexical_analysis.rst new file mode 100644 index 0000000000000000000000000000000000000000..0916c249d95ab0a0dd1cdac9acbc5a2dc00df078 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.lexical_analysis.rst @@ -0,0 +1,7 @@ +lexical\_analysis +=========================================== + +.. automodule:: paddlenlp.taskflow.lexical_analysis + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.model.rst b/docs/source/paddlenlp.taskflow.model.rst new file mode 100644 index 0000000000000000000000000000000000000000..a09842dd516df4068ee47b715ac1ff3e1c4496cf --- /dev/null +++ b/docs/source/paddlenlp.taskflow.model.rst @@ -0,0 +1,6 @@ +model +=============================== + +.. automodule:: paddlenlp.taskflow.model + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.taskflow.models.dependency_parsing_model.rst b/docs/source/paddlenlp.taskflow.models.dependency_parsing_model.rst new file mode 100644 index 0000000000000000000000000000000000000000..ef0bc9010245e07957c0b9fadf2680fa1f28d987 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.models.dependency_parsing_model.rst @@ -0,0 +1,6 @@ +dependency\_parsing\_model +=========================================================== + +.. automodule:: paddlenlp.taskflow.models.dependency_parsing_model + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.taskflow.models.information_extraction_model.rst b/docs/source/paddlenlp.taskflow.models.information_extraction_model.rst new file mode 100644 index 0000000000000000000000000000000000000000..78407871c9d6ca762619e997d10c3f4db573176e --- /dev/null +++ b/docs/source/paddlenlp.taskflow.models.information_extraction_model.rst @@ -0,0 +1,6 @@ +information\_extraction\_model +=============================================================== + +.. automodule:: paddlenlp.taskflow.models.information_extraction_model + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.taskflow.models.lexical_analysis_model.rst b/docs/source/paddlenlp.taskflow.models.lexical_analysis_model.rst new file mode 100644 index 0000000000000000000000000000000000000000..db0eadd0273f68427b8aa3481ab1655cba74822f --- /dev/null +++ b/docs/source/paddlenlp.taskflow.models.lexical_analysis_model.rst @@ -0,0 +1,6 @@ +lexical\_analysis\_model +========================================================= + +.. automodule:: paddlenlp.taskflow.models.lexical_analysis_model + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.taskflow.models.rst b/docs/source/paddlenlp.taskflow.models.rst new file mode 100644 index 0000000000000000000000000000000000000000..f5966db3a44c0b68affaca7040714bdd1e138efa --- /dev/null +++ b/docs/source/paddlenlp.taskflow.models.rst @@ -0,0 +1,16 @@ +models +================================= + +.. automodule:: paddlenlp.taskflow.models + :members: + :no-undoc-members: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.taskflow.models.dependency_parsing_model + paddlenlp.taskflow.models.information_extraction_model + paddlenlp.taskflow.models.lexical_analysis_model + paddlenlp.taskflow.models.sentiment_analysis_model + paddlenlp.taskflow.models.text_correction_model diff --git a/docs/source/paddlenlp.taskflow.models.sentiment_analysis_model.rst b/docs/source/paddlenlp.taskflow.models.sentiment_analysis_model.rst new file mode 100644 index 0000000000000000000000000000000000000000..df299b6af547c3601c06b5ea84cc358929321137 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.models.sentiment_analysis_model.rst @@ -0,0 +1,6 @@ +sentiment\_analysis\_model +=========================================================== + +.. automodule:: paddlenlp.taskflow.models.sentiment_analysis_model + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.taskflow.models.text_correction_model.rst b/docs/source/paddlenlp.taskflow.models.text_correction_model.rst new file mode 100644 index 0000000000000000000000000000000000000000..4231184a34671107931328cae3d25cdeaa2cbb99 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.models.text_correction_model.rst @@ -0,0 +1,6 @@ +text\_correction\_model +======================================================== + +.. automodule:: paddlenlp.taskflow.models.text_correction_model + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.taskflow.named_entity_recognition.rst b/docs/source/paddlenlp.taskflow.named_entity_recognition.rst new file mode 100644 index 0000000000000000000000000000000000000000..3183ae7930c7efd1d548afbf0cf9f4a9c3a1e50b --- /dev/null +++ b/docs/source/paddlenlp.taskflow.named_entity_recognition.rst @@ -0,0 +1,7 @@ +named\_entity\_recognition +==================================================== + +.. automodule:: paddlenlp.taskflow.named_entity_recognition + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.poetry_generation.rst b/docs/source/paddlenlp.taskflow.poetry_generation.rst new file mode 100644 index 0000000000000000000000000000000000000000..092e1838dae2273ce291b7c29ce4130f2fcb7b2e --- /dev/null +++ b/docs/source/paddlenlp.taskflow.poetry_generation.rst @@ -0,0 +1,7 @@ +poetry\_generation +============================================ + +.. automodule:: paddlenlp.taskflow.poetry_generation + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.pos_tagging.rst b/docs/source/paddlenlp.taskflow.pos_tagging.rst new file mode 100644 index 0000000000000000000000000000000000000000..e8251cae44c35aca02ed79d9d4e1a276086c5747 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.pos_tagging.rst @@ -0,0 +1,7 @@ +pos\_tagging +====================================== + +.. automodule:: paddlenlp.taskflow.pos_tagging + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.question_answering.rst b/docs/source/paddlenlp.taskflow.question_answering.rst new file mode 100644 index 0000000000000000000000000000000000000000..3d9a34d72e0ec70590c9cd3222388ad09bb345fa --- /dev/null +++ b/docs/source/paddlenlp.taskflow.question_answering.rst @@ -0,0 +1,7 @@ +question\_answering +============================================= + +.. automodule:: paddlenlp.taskflow.question_answering + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.rst b/docs/source/paddlenlp.taskflow.rst new file mode 100644 index 0000000000000000000000000000000000000000..791033a8812b5fd2ee246a5cbba727a15c948c92 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.rst @@ -0,0 +1,37 @@ +paddlenlp.taskflow +========================== + +.. automodule:: paddlenlp.taskflow + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.taskflow.models + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.taskflow.code_generation + paddlenlp.taskflow.dependency_parsing + paddlenlp.taskflow.dialogue + paddlenlp.taskflow.information_extraction + paddlenlp.taskflow.knowledge_mining + paddlenlp.taskflow.lexical_analysis + paddlenlp.taskflow.named_entity_recognition + paddlenlp.taskflow.poetry_generation + paddlenlp.taskflow.pos_tagging + paddlenlp.taskflow.question_answering + paddlenlp.taskflow.sentiment_analysis + paddlenlp.taskflow.task + paddlenlp.taskflow.taskflow + paddlenlp.taskflow.text_to_image + paddlenlp.taskflow.text_correction + paddlenlp.taskflow.text_generation + paddlenlp.taskflow.text_similarity + paddlenlp.taskflow.utils + paddlenlp.taskflow.word_segmentation diff --git a/docs/source/paddlenlp.taskflow.sentiment_analysis.rst b/docs/source/paddlenlp.taskflow.sentiment_analysis.rst new file mode 100644 index 0000000000000000000000000000000000000000..4d539b970840759024e2a495be2fda2c119f07d6 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.sentiment_analysis.rst @@ -0,0 +1,7 @@ +sentiment\_analysis +============================================= + +.. automodule:: paddlenlp.taskflow.sentiment_analysis + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.task.rst b/docs/source/paddlenlp.taskflow.task.rst new file mode 100644 index 0000000000000000000000000000000000000000..64804791adf32ea03be498df693670479ca0b16f --- /dev/null +++ b/docs/source/paddlenlp.taskflow.task.rst @@ -0,0 +1,7 @@ +task +============================== + +.. automodule:: paddlenlp.taskflow.task + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.taskflow.rst b/docs/source/paddlenlp.taskflow.taskflow.rst new file mode 100644 index 0000000000000000000000000000000000000000..6244b2c54055fd666f13b140dfe89805d0829f81 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.taskflow.rst @@ -0,0 +1,7 @@ +taskflow +================================== + +.. automodule:: paddlenlp.taskflow.taskflow + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.text2knowledge.rst b/docs/source/paddlenlp.taskflow.text2knowledge.rst new file mode 100644 index 0000000000000000000000000000000000000000..e855e1520dd9f530baa295068eb9f3d9679fbf1e --- /dev/null +++ b/docs/source/paddlenlp.taskflow.text2knowledge.rst @@ -0,0 +1,7 @@ +text2knowledge +======================================== + +.. automodule:: paddlenlp.taskflow.text2knowledge + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.text_correction.rst b/docs/source/paddlenlp.taskflow.text_correction.rst new file mode 100644 index 0000000000000000000000000000000000000000..d3737ea366562065774da14617f7fabcd42edca1 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.text_correction.rst @@ -0,0 +1,7 @@ +text\_correction +========================================== + +.. automodule:: paddlenlp.taskflow.text_correction + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.text_generation.rst b/docs/source/paddlenlp.taskflow.text_generation.rst new file mode 100644 index 0000000000000000000000000000000000000000..ff69b87fbea13400ab2af3e7b236f871dadc8f9d --- /dev/null +++ b/docs/source/paddlenlp.taskflow.text_generation.rst @@ -0,0 +1,7 @@ +text\_generation +========================================== + +.. automodule:: paddlenlp.taskflow.text_generation + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.text_similarity.rst b/docs/source/paddlenlp.taskflow.text_similarity.rst new file mode 100644 index 0000000000000000000000000000000000000000..0b3dde8dad74a5bed0682919e760382e13f98b90 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.text_similarity.rst @@ -0,0 +1,7 @@ +text\_similarity +========================================== + +.. automodule:: paddlenlp.taskflow.text_similarity + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.text_to_image.rst b/docs/source/paddlenlp.taskflow.text_to_image.rst new file mode 100644 index 0000000000000000000000000000000000000000..fa834e27471dc3bb17118bad26393df3f197aa9d --- /dev/null +++ b/docs/source/paddlenlp.taskflow.text_to_image.rst @@ -0,0 +1,7 @@ +text\_to\_image +================================================ + +.. automodule:: paddlenlp.taskflow.text_to_image + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.utils.rst b/docs/source/paddlenlp.taskflow.utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..097a299a56d020d47a64d54565db59c04efee976 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.utils.rst @@ -0,0 +1,7 @@ +utils +=============================== + +.. automodule:: paddlenlp.taskflow.utils + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.taskflow.word_segmentation.rst b/docs/source/paddlenlp.taskflow.word_segmentation.rst new file mode 100644 index 0000000000000000000000000000000000000000..c7c6090f145e177cf1ee55a55573bf6b2a1da302 --- /dev/null +++ b/docs/source/paddlenlp.taskflow.word_segmentation.rst @@ -0,0 +1,7 @@ +word\_segmentation +============================================ + +.. automodule:: paddlenlp.taskflow.word_segmentation + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.trainer.argparser.rst b/docs/source/paddlenlp.trainer.argparser.rst new file mode 100644 index 0000000000000000000000000000000000000000..80c48c001e69ca104655324ec5fd52eb3c972cd5 --- /dev/null +++ b/docs/source/paddlenlp.trainer.argparser.rst @@ -0,0 +1,7 @@ +argparser +================================== + +.. automodule:: paddlenlp.trainer.argparser + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.trainer.integrations.rst b/docs/source/paddlenlp.trainer.integrations.rst new file mode 100644 index 0000000000000000000000000000000000000000..e16235dedb7b8602bfb8c1f249576ff201a290c7 --- /dev/null +++ b/docs/source/paddlenlp.trainer.integrations.rst @@ -0,0 +1,7 @@ +integrations +===================================== + +.. automodule:: paddlenlp.trainer.integrations + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.trainer.rst b/docs/source/paddlenlp.trainer.rst new file mode 100644 index 0000000000000000000000000000000000000000..af33d53c245bd0226880f31b1f5a212a8b485b17 --- /dev/null +++ b/docs/source/paddlenlp.trainer.rst @@ -0,0 +1,24 @@ +paddlenlp.trainer +========================= + +.. automodule:: paddlenlp.trainer + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.trainer.utils + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.trainer.argparser + paddlenlp.trainer.integrations + paddlenlp.trainer.trainer_base + paddlenlp.trainer.trainer_callback + paddlenlp.trainer.trainer_utils + paddlenlp.trainer.training_args diff --git a/docs/source/paddlenlp.trainer.trainer_base.rst b/docs/source/paddlenlp.trainer.trainer_base.rst new file mode 100644 index 0000000000000000000000000000000000000000..4378dd75d758ce0c587b9f42fd27d5806228f8d3 --- /dev/null +++ b/docs/source/paddlenlp.trainer.trainer_base.rst @@ -0,0 +1,7 @@ +trainer\_base +====================================== + +.. automodule:: paddlenlp.trainer.trainer_base + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.trainer.trainer_callback.rst b/docs/source/paddlenlp.trainer.trainer_callback.rst new file mode 100644 index 0000000000000000000000000000000000000000..f4de42a59b965b2c54581d89972b426bcbf46eff --- /dev/null +++ b/docs/source/paddlenlp.trainer.trainer_callback.rst @@ -0,0 +1,7 @@ +trainer\_callback +========================================== + +.. automodule:: paddlenlp.trainer.trainer_callback + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.trainer.trainer_utils.rst b/docs/source/paddlenlp.trainer.trainer_utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..c58e4c52d91a850b2874793f886b48bc4cc40d95 --- /dev/null +++ b/docs/source/paddlenlp.trainer.trainer_utils.rst @@ -0,0 +1,7 @@ +trainer\_utils +======================================= + +.. automodule:: paddlenlp.trainer.trainer_utils + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.trainer.training_args.rst b/docs/source/paddlenlp.trainer.training_args.rst new file mode 100644 index 0000000000000000000000000000000000000000..e4909f9aef910e818057d6807a0a906991d04975 --- /dev/null +++ b/docs/source/paddlenlp.trainer.training_args.rst @@ -0,0 +1,7 @@ +training\_args +======================================= + +.. automodule:: paddlenlp.trainer.training_args + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.trainer.utils.helper.rst b/docs/source/paddlenlp.trainer.utils.helper.rst new file mode 100644 index 0000000000000000000000000000000000000000..2d8b6f6e1e2580da788a02b54fcf96ed891ebf1f --- /dev/null +++ b/docs/source/paddlenlp.trainer.utils.helper.rst @@ -0,0 +1,7 @@ +helper +===================================== + +.. automodule:: paddlenlp.trainer.utils.helper + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.trainer.utils.rst b/docs/source/paddlenlp.trainer.utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..de6a9daff5160eeabd5f253aa11d4a3bd71326fc --- /dev/null +++ b/docs/source/paddlenlp.trainer.utils.rst @@ -0,0 +1,13 @@ +utils +=============================== + +.. automodule:: paddlenlp.trainer.utils + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.trainer.utils.helper diff --git a/docs/source/paddlenlp.transformers.albert.modeling.rst b/docs/source/paddlenlp.transformers.albert.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..0a7948f40a04f4ba99fd3bb287b695727071fe9f --- /dev/null +++ b/docs/source/paddlenlp.transformers.albert.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================= + +.. automodule:: paddlenlp.transformers.albert.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.albert.rst b/docs/source/paddlenlp.transformers.albert.rst new file mode 100644 index 0000000000000000000000000000000000000000..0c0da687851368c5a9e9ffefa195929f1fd4cd51 --- /dev/null +++ b/docs/source/paddlenlp.transformers.albert.rst @@ -0,0 +1,14 @@ +albert +===================================== + +.. automodule:: paddlenlp.transformers.albert + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.albert.modeling + paddlenlp.transformers.albert.tokenizer diff --git a/docs/source/paddlenlp.transformers.albert.tokenizer.rst b/docs/source/paddlenlp.transformers.albert.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..3feef349131446cd5d7ab1b6b3d20e4c8c29d0e9 --- /dev/null +++ b/docs/source/paddlenlp.transformers.albert.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================== + +.. automodule:: paddlenlp.transformers.albert.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.artist.modeling.rst b/docs/source/paddlenlp.transformers.artist.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..751786f8b99fd5e2c5f3a126caef2d990769fe90 --- /dev/null +++ b/docs/source/paddlenlp.transformers.artist.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================= + +.. automodule:: paddlenlp.transformers.artist.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.artist.rst b/docs/source/paddlenlp.transformers.artist.rst new file mode 100644 index 0000000000000000000000000000000000000000..cacfe699e3e58ecabefb1b1ba2bb2236c1dcc8d7 --- /dev/null +++ b/docs/source/paddlenlp.transformers.artist.rst @@ -0,0 +1,14 @@ +artist +===================================== + +.. automodule:: paddlenlp.transformers.artist + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.artist.modeling + paddlenlp.transformers.artist.tokenizer diff --git a/docs/source/paddlenlp.transformers.artist.tokenizer.rst b/docs/source/paddlenlp.transformers.artist.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..0e1a05f005ae64a3b115f2182ee1e53def1ab393 --- /dev/null +++ b/docs/source/paddlenlp.transformers.artist.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================== + +.. automodule:: paddlenlp.transformers.artist.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.attention_utils.rst b/docs/source/paddlenlp.transformers.attention_utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..9cbf2e8705edbd8aadc957ffca09f8e766465275 --- /dev/null +++ b/docs/source/paddlenlp.transformers.attention_utils.rst @@ -0,0 +1,6 @@ +attention\_utils +============================================== + +.. automodule:: paddlenlp.transformers.attention_utils + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.transformers.auto.modeling.rst b/docs/source/paddlenlp.transformers.auto.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..b8778dd70d0d32a963281b9093a2903b05d1f091 --- /dev/null +++ b/docs/source/paddlenlp.transformers.auto.modeling.rst @@ -0,0 +1,7 @@ +modeling +=========================================== + +.. automodule:: paddlenlp.transformers.auto.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.auto.rst b/docs/source/paddlenlp.transformers.auto.rst new file mode 100644 index 0000000000000000000000000000000000000000..7ff49719749f11438d9839c65abffe4b40c9fc3f --- /dev/null +++ b/docs/source/paddlenlp.transformers.auto.rst @@ -0,0 +1,14 @@ +auto +=================================== + +.. automodule:: paddlenlp.transformers.auto + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.auto.modeling + paddlenlp.transformers.auto.tokenizer diff --git a/docs/source/paddlenlp.transformers.auto.tokenizer.rst b/docs/source/paddlenlp.transformers.auto.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..3cf2c590985a57d7ab91d33eb8d300c5f5401e11 --- /dev/null +++ b/docs/source/paddlenlp.transformers.auto.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================ + +.. automodule:: paddlenlp.transformers.auto.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.bart.modeling.rst b/docs/source/paddlenlp.transformers.bart.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..4ed15ed2a64d4bfead9f58371f49f656d6c7be32 --- /dev/null +++ b/docs/source/paddlenlp.transformers.bart.modeling.rst @@ -0,0 +1,7 @@ +modeling +=========================================== + +.. automodule:: paddlenlp.transformers.bart.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.bart.rst b/docs/source/paddlenlp.transformers.bart.rst new file mode 100644 index 0000000000000000000000000000000000000000..de955a2d3af8ff62a405a98b77420b6956cf9fbc --- /dev/null +++ b/docs/source/paddlenlp.transformers.bart.rst @@ -0,0 +1,14 @@ +bart +=================================== + +.. automodule:: paddlenlp.transformers.bart + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.bart.modeling + paddlenlp.transformers.bart.tokenizer diff --git a/docs/source/paddlenlp.transformers.bart.tokenizer.rst b/docs/source/paddlenlp.transformers.bart.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..6fd4c7a597f5f666ae1ea1d67c051c48d0149a34 --- /dev/null +++ b/docs/source/paddlenlp.transformers.bart.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================ + +.. automodule:: paddlenlp.transformers.bart.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.bert.fast_tokenizer.rst b/docs/source/paddlenlp.transformers.bert.fast_tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..d102c88993748c7335bbe0f915e9b0081ff79d24 --- /dev/null +++ b/docs/source/paddlenlp.transformers.bert.fast_tokenizer.rst @@ -0,0 +1,7 @@ +fast\_tokenizer +==================================================== + +.. automodule:: paddlenlp.transformers.bert.fast_tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.bert.modeling.rst b/docs/source/paddlenlp.transformers.bert.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..01b8b37ded3989b378cfed8926b5aaede3db8538 --- /dev/null +++ b/docs/source/paddlenlp.transformers.bert.modeling.rst @@ -0,0 +1,7 @@ +modeling +=========================================== + +.. automodule:: paddlenlp.transformers.bert.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.bert.rst b/docs/source/paddlenlp.transformers.bert.rst new file mode 100644 index 0000000000000000000000000000000000000000..bca1ffa00ecada671c30c74719ec6a9c6e0724d9 --- /dev/null +++ b/docs/source/paddlenlp.transformers.bert.rst @@ -0,0 +1,15 @@ +bert +=================================== + +.. automodule:: paddlenlp.transformers.bert + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.bert.fast_tokenizer + paddlenlp.transformers.bert.modeling + paddlenlp.transformers.bert.tokenizer diff --git a/docs/source/paddlenlp.transformers.bert.tokenizer.rst b/docs/source/paddlenlp.transformers.bert.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..f451e3d3e510408c53f2384d3461b90b6645ea03 --- /dev/null +++ b/docs/source/paddlenlp.transformers.bert.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================ + +.. automodule:: paddlenlp.transformers.bert.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.rst b/docs/source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.rst new file mode 100644 index 0000000000000000000000000000000000000000..c8608818dceb162037fb8fd4bfcd095bc2be3e89 --- /dev/null +++ b/docs/source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.rst @@ -0,0 +1,7 @@ +convert\_bert\_japanese\_params +============================================================================ + +.. automodule:: paddlenlp.transformers.bert_japanese.convert_bert_japanese_params + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.bert_japanese.rst b/docs/source/paddlenlp.transformers.bert_japanese.rst new file mode 100644 index 0000000000000000000000000000000000000000..4e0a617de60829ea12233d06faab6688b5260ea5 --- /dev/null +++ b/docs/source/paddlenlp.transformers.bert_japanese.rst @@ -0,0 +1,13 @@ +bert\_japanese +============================================= + +.. automodule:: paddlenlp.transformers.bert_japanese + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.bert_japanese.tokenizer diff --git a/docs/source/paddlenlp.transformers.bert_japanese.tokenizer.rst b/docs/source/paddlenlp.transformers.bert_japanese.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..6f78bbf99884782c655522c30510ca3931954841 --- /dev/null +++ b/docs/source/paddlenlp.transformers.bert_japanese.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +====================================================== + +.. automodule:: paddlenlp.transformers.bert_japanese.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.bigbird.modeling.rst b/docs/source/paddlenlp.transformers.bigbird.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..44f82bbe22905d1fbcd038f6b8d4813b45f9a9da --- /dev/null +++ b/docs/source/paddlenlp.transformers.bigbird.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================== + +.. automodule:: paddlenlp.transformers.bigbird.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.bigbird.rst b/docs/source/paddlenlp.transformers.bigbird.rst new file mode 100644 index 0000000000000000000000000000000000000000..21fb8e6e8e0e2c09520913d48102557a915de448 --- /dev/null +++ b/docs/source/paddlenlp.transformers.bigbird.rst @@ -0,0 +1,14 @@ +bigbird +====================================== + +.. automodule:: paddlenlp.transformers.bigbird + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.bigbird.modeling + paddlenlp.transformers.bigbird.tokenizer diff --git a/docs/source/paddlenlp.transformers.bigbird.tokenizer.rst b/docs/source/paddlenlp.transformers.bigbird.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..34e240e414f5744894579830c58a85044745ab7f --- /dev/null +++ b/docs/source/paddlenlp.transformers.bigbird.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +=============================================== + +.. automodule:: paddlenlp.transformers.bigbird.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.blenderbot.modeling.rst b/docs/source/paddlenlp.transformers.blenderbot.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..a52aa9d2e4347024a37b3e8aee0381e612d35eba --- /dev/null +++ b/docs/source/paddlenlp.transformers.blenderbot.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================= + +.. automodule:: paddlenlp.transformers.blenderbot.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.blenderbot.rst b/docs/source/paddlenlp.transformers.blenderbot.rst new file mode 100644 index 0000000000000000000000000000000000000000..b7f3e335a0ddeaf3e67c93804e62b2ac0d15f5cb --- /dev/null +++ b/docs/source/paddlenlp.transformers.blenderbot.rst @@ -0,0 +1,14 @@ +blenderbot +========================================= + +.. automodule:: paddlenlp.transformers.blenderbot + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.blenderbot.modeling + paddlenlp.transformers.blenderbot.tokenizer diff --git a/docs/source/paddlenlp.transformers.blenderbot.tokenizer.rst b/docs/source/paddlenlp.transformers.blenderbot.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..3f74eeaad3fe1c6693c6ea79c4b6b3fd8b2da6b9 --- /dev/null +++ b/docs/source/paddlenlp.transformers.blenderbot.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================== + +.. automodule:: paddlenlp.transformers.blenderbot.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.blenderbot_small.modeling.rst b/docs/source/paddlenlp.transformers.blenderbot_small.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..e6b486f4b2930483b37806e6b37ce98728bcb5c5 --- /dev/null +++ b/docs/source/paddlenlp.transformers.blenderbot_small.modeling.rst @@ -0,0 +1,7 @@ +modeling +======================================================== + +.. automodule:: paddlenlp.transformers.blenderbot_small.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.blenderbot_small.rst b/docs/source/paddlenlp.transformers.blenderbot_small.rst new file mode 100644 index 0000000000000000000000000000000000000000..13736397b4542c20cd8d0826c43ae2d6378a0bc8 --- /dev/null +++ b/docs/source/paddlenlp.transformers.blenderbot_small.rst @@ -0,0 +1,14 @@ +blenderbot\_small +================================================ + +.. automodule:: paddlenlp.transformers.blenderbot_small + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.blenderbot_small.modeling + paddlenlp.transformers.blenderbot_small.tokenizer diff --git a/docs/source/paddlenlp.transformers.blenderbot_small.tokenizer.rst b/docs/source/paddlenlp.transformers.blenderbot_small.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..496c3e03f03b90b22bda57e3ddd61b8040aaaefe --- /dev/null +++ b/docs/source/paddlenlp.transformers.blenderbot_small.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +========================================================= + +.. automodule:: paddlenlp.transformers.blenderbot_small.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.chinesebert.modeling.rst b/docs/source/paddlenlp.transformers.chinesebert.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..a93209e0655e689391fcc0fb22d9335833de3024 --- /dev/null +++ b/docs/source/paddlenlp.transformers.chinesebert.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================== + +.. automodule:: paddlenlp.transformers.chinesebert.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.chinesebert.rst b/docs/source/paddlenlp.transformers.chinesebert.rst new file mode 100644 index 0000000000000000000000000000000000000000..fa8cad8e88a91f6a7922049ed34d069a1cd3150e --- /dev/null +++ b/docs/source/paddlenlp.transformers.chinesebert.rst @@ -0,0 +1,14 @@ +chinesebert +========================================== + +.. automodule:: paddlenlp.transformers.chinesebert + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.chinesebert.modeling + paddlenlp.transformers.chinesebert.tokenizer diff --git a/docs/source/paddlenlp.transformers.chinesebert.tokenizer.rst b/docs/source/paddlenlp.transformers.chinesebert.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..9521fc2adb2dfd3b3f5b555f38de0b865aed5bcb --- /dev/null +++ b/docs/source/paddlenlp.transformers.chinesebert.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +=================================================== + +.. automodule:: paddlenlp.transformers.chinesebert.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.codegen.modeling.rst b/docs/source/paddlenlp.transformers.codegen.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..94bf3c40080b8b7e90098a28d926108a615775fc --- /dev/null +++ b/docs/source/paddlenlp.transformers.codegen.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================== + +.. automodule:: paddlenlp.transformers.codegen.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.codegen.rst b/docs/source/paddlenlp.transformers.codegen.rst new file mode 100644 index 0000000000000000000000000000000000000000..06b98a2db02c13665327b7877a68258fadf9d847 --- /dev/null +++ b/docs/source/paddlenlp.transformers.codegen.rst @@ -0,0 +1,14 @@ +codegen +====================================== + +.. automodule:: paddlenlp.transformers.codegen + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.codegen.modeling + paddlenlp.transformers.codegen.tokenizer diff --git a/docs/source/paddlenlp.transformers.codegen.tokenizer.rst b/docs/source/paddlenlp.transformers.codegen.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..49ebd1510971e953085113f70aa3b03bafa14cd8 --- /dev/null +++ b/docs/source/paddlenlp.transformers.codegen.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +=============================================== + +.. automodule:: paddlenlp.transformers.codegen.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.convbert.modeling.rst b/docs/source/paddlenlp.transformers.convbert.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..f24b7e95f7520a5acc731b7bf370fc4de7e8ac5b --- /dev/null +++ b/docs/source/paddlenlp.transformers.convbert.modeling.rst @@ -0,0 +1,7 @@ +modeling +=============================================== + +.. automodule:: paddlenlp.transformers.convbert.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.convbert.rst b/docs/source/paddlenlp.transformers.convbert.rst new file mode 100644 index 0000000000000000000000000000000000000000..93b396b679e9d7015064ba6d157656c48a725b6a --- /dev/null +++ b/docs/source/paddlenlp.transformers.convbert.rst @@ -0,0 +1,14 @@ +convbert +======================================= + +.. automodule:: paddlenlp.transformers.convbert + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.convbert.modeling + paddlenlp.transformers.convbert.tokenizer diff --git a/docs/source/paddlenlp.transformers.convbert.tokenizer.rst b/docs/source/paddlenlp.transformers.convbert.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..715b3fad7b7d0ea9f53b9c225c5e63d066501f10 --- /dev/null +++ b/docs/source/paddlenlp.transformers.convbert.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================ + +.. automodule:: paddlenlp.transformers.convbert.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.convert_slow_tokenizer.rst b/docs/source/paddlenlp.transformers.convert_slow_tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..c8e4e4b576fb6dbbf8bf90c46b18717afb382166 --- /dev/null +++ b/docs/source/paddlenlp.transformers.convert_slow_tokenizer.rst @@ -0,0 +1,7 @@ +convert\_slow\_tokenizer +====================================================== + +.. automodule:: paddlenlp.transformers.convert_slow_tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ctrl.modeling.rst b/docs/source/paddlenlp.transformers.ctrl.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..2a00824efcc49fca708324f7143dca58fc04199a --- /dev/null +++ b/docs/source/paddlenlp.transformers.ctrl.modeling.rst @@ -0,0 +1,7 @@ +modeling +=========================================== + +.. automodule:: paddlenlp.transformers.ctrl.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ctrl.rst b/docs/source/paddlenlp.transformers.ctrl.rst new file mode 100644 index 0000000000000000000000000000000000000000..ac550371b5a781faeeb6b8b74171fb965ff927e9 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ctrl.rst @@ -0,0 +1,14 @@ +ctrl +=================================== + +.. automodule:: paddlenlp.transformers.ctrl + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.ctrl.modeling + paddlenlp.transformers.ctrl.tokenizer diff --git a/docs/source/paddlenlp.transformers.ctrl.tokenizer.rst b/docs/source/paddlenlp.transformers.ctrl.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..ae6909f8367624a1ef42a6e093dae955161b6850 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ctrl.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================ + +.. automodule:: paddlenlp.transformers.ctrl.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.dallebart.modeling.rst b/docs/source/paddlenlp.transformers.dallebart.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..43b9008645f6bdb0055d2cf800be51f30a70073a --- /dev/null +++ b/docs/source/paddlenlp.transformers.dallebart.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================ + +.. automodule:: paddlenlp.transformers.dallebart.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.dallebart.rst b/docs/source/paddlenlp.transformers.dallebart.rst new file mode 100644 index 0000000000000000000000000000000000000000..b4c076fe1095d2ebf470f5d44d14efff6cbab5a8 --- /dev/null +++ b/docs/source/paddlenlp.transformers.dallebart.rst @@ -0,0 +1,14 @@ +dallebart +======================================== + +.. automodule:: paddlenlp.transformers.dallebart + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.dallebart.modeling + paddlenlp.transformers.dallebart.tokenizer diff --git a/docs/source/paddlenlp.transformers.dallebart.tokenizer.rst b/docs/source/paddlenlp.transformers.dallebart.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..dc6ce082462a55b60f0ceb98800e4acf138d8728 --- /dev/null +++ b/docs/source/paddlenlp.transformers.dallebart.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================= + +.. automodule:: paddlenlp.transformers.dallebart.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.distilbert.modeling.rst b/docs/source/paddlenlp.transformers.distilbert.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..d269bf4a4ccdc7eceafcfd231054e945587347d2 --- /dev/null +++ b/docs/source/paddlenlp.transformers.distilbert.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================= + +.. automodule:: paddlenlp.transformers.distilbert.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.distilbert.rst b/docs/source/paddlenlp.transformers.distilbert.rst new file mode 100644 index 0000000000000000000000000000000000000000..09ed79d561cbe7c2816e10ff50dd809dc4e11397 --- /dev/null +++ b/docs/source/paddlenlp.transformers.distilbert.rst @@ -0,0 +1,14 @@ +distilbert +========================================= + +.. automodule:: paddlenlp.transformers.distilbert + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.distilbert.modeling + paddlenlp.transformers.distilbert.tokenizer diff --git a/docs/source/paddlenlp.transformers.distilbert.tokenizer.rst b/docs/source/paddlenlp.transformers.distilbert.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..490daa508c15737a3b363c2e032bbbc16b1545a7 --- /dev/null +++ b/docs/source/paddlenlp.transformers.distilbert.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================== + +.. automodule:: paddlenlp.transformers.distilbert.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.distill_utils.rst b/docs/source/paddlenlp.transformers.distill_utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..776b2c9fb4d6f3e456a6c920ea9c6ce3b83508f1 --- /dev/null +++ b/docs/source/paddlenlp.transformers.distill_utils.rst @@ -0,0 +1,7 @@ +distill\_utils +============================================ + +.. automodule:: paddlenlp.transformers.distill_utils + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.electra.modeling.rst b/docs/source/paddlenlp.transformers.electra.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..c31be0555a89424aca414f20f9c2663cc5eb00a6 --- /dev/null +++ b/docs/source/paddlenlp.transformers.electra.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================== + +.. automodule:: paddlenlp.transformers.electra.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.electra.rst b/docs/source/paddlenlp.transformers.electra.rst new file mode 100644 index 0000000000000000000000000000000000000000..9c9c44eaa6fddd9ea94bbec6352978750daa20b0 --- /dev/null +++ b/docs/source/paddlenlp.transformers.electra.rst @@ -0,0 +1,14 @@ +electra +====================================== + +.. automodule:: paddlenlp.transformers.electra + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.electra.modeling + paddlenlp.transformers.electra.tokenizer diff --git a/docs/source/paddlenlp.transformers.electra.tokenizer.rst b/docs/source/paddlenlp.transformers.electra.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..870b37825fb1727eda7871195a1fb2163b78b61d --- /dev/null +++ b/docs/source/paddlenlp.transformers.electra.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +=============================================== + +.. automodule:: paddlenlp.transformers.electra.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie.fast_tokenizer.rst b/docs/source/paddlenlp.transformers.ernie.fast_tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..d076125bafe460bf74cde7a5eb802c5612608c05 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie.fast_tokenizer.rst @@ -0,0 +1,7 @@ +fast\_tokenizer +===================================================== + +.. automodule:: paddlenlp.transformers.ernie.fast_tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie.modeling.rst b/docs/source/paddlenlp.transformers.ernie.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..380a8d64cfcb5f983f34aad1e2a01802cf070658 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================ + +.. automodule:: paddlenlp.transformers.ernie.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie.rst b/docs/source/paddlenlp.transformers.ernie.rst new file mode 100644 index 0000000000000000000000000000000000000000..7444d131dbb07cb2e941c1fc8bb1ff8496f7a77f --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie.rst @@ -0,0 +1,15 @@ +ernie +==================================== + +.. automodule:: paddlenlp.transformers.ernie + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.ernie.fast_tokenizer + paddlenlp.transformers.ernie.modeling + paddlenlp.transformers.ernie.tokenizer diff --git a/docs/source/paddlenlp.transformers.ernie.tokenizer.rst b/docs/source/paddlenlp.transformers.ernie.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..fb6c1fa1c503f7e94e518897a8db7ef26df7e956 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================= + +.. automodule:: paddlenlp.transformers.ernie.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie_ctm.modeling.rst b/docs/source/paddlenlp.transformers.ernie_ctm.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..3b8770cddca47fa9800ee1b7fc894b3009d3fab2 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_ctm.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================= + +.. automodule:: paddlenlp.transformers.ernie_ctm.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie_ctm.rst b/docs/source/paddlenlp.transformers.ernie_ctm.rst new file mode 100644 index 0000000000000000000000000000000000000000..f1494fa75c0eb3b32dc383835a1ad7a9aaa46c4e --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_ctm.rst @@ -0,0 +1,14 @@ +ernie\_ctm +========================================= + +.. automodule:: paddlenlp.transformers.ernie_ctm + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.ernie_ctm.modeling + paddlenlp.transformers.ernie_ctm.tokenizer diff --git a/docs/source/paddlenlp.transformers.ernie_ctm.tokenizer.rst b/docs/source/paddlenlp.transformers.ernie_ctm.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..15a8daed187666dcca9bc90a146f3e2343fe590a --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_ctm.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================== + +.. automodule:: paddlenlp.transformers.ernie_ctm.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie_doc.modeling.rst b/docs/source/paddlenlp.transformers.ernie_doc.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..e8fd915642d1fed055f8a9afda23a59ebc203f3f --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_doc.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================= + +.. automodule:: paddlenlp.transformers.ernie_doc.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie_doc.rst b/docs/source/paddlenlp.transformers.ernie_doc.rst new file mode 100644 index 0000000000000000000000000000000000000000..8cf67597ebca15a64c06e76d1eb97302ad8acb60 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_doc.rst @@ -0,0 +1,14 @@ +ernie\_doc +========================================= + +.. automodule:: paddlenlp.transformers.ernie_doc + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.ernie_doc.modeling + paddlenlp.transformers.ernie_doc.tokenizer diff --git a/docs/source/paddlenlp.transformers.ernie_doc.tokenizer.rst b/docs/source/paddlenlp.transformers.ernie_doc.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..cdf63976128cd831c044185574c3b6a00e9a75b0 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_doc.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================== + +.. automodule:: paddlenlp.transformers.ernie_doc.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie_gen.modeling.rst b/docs/source/paddlenlp.transformers.ernie_gen.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..b33a1d0abc2d9f34a48d5fc29f4a838257a1dc34 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_gen.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================= + +.. automodule:: paddlenlp.transformers.ernie_gen.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie_gen.rst b/docs/source/paddlenlp.transformers.ernie_gen.rst new file mode 100644 index 0000000000000000000000000000000000000000..be4f6f3b853f728a0d0e0adb129f18807471a8fe --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_gen.rst @@ -0,0 +1,13 @@ +ernie\_gen +========================================= + +.. automodule:: paddlenlp.transformers.ernie_gen + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.ernie_gen.modeling diff --git a/docs/source/paddlenlp.transformers.ernie_gram.matching_param_name.rst b/docs/source/paddlenlp.transformers.ernie_gram.matching_param_name.rst new file mode 100644 index 0000000000000000000000000000000000000000..7968e19f2db34028fd6c63f50d257fd55c03055c --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_gram.matching_param_name.rst @@ -0,0 +1,7 @@ +matching\_param\_name +=============================================================== + +.. automodule:: paddlenlp.transformers.ernie_gram.matching_param_name + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie_gram.modeling.rst b/docs/source/paddlenlp.transformers.ernie_gram.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..3053c06384527dbc5ce95d58563aab49655586b8 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_gram.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================== + +.. automodule:: paddlenlp.transformers.ernie_gram.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie_gram.rst b/docs/source/paddlenlp.transformers.ernie_gram.rst new file mode 100644 index 0000000000000000000000000000000000000000..2e615be1548dedb02e3ac20f24e7b103704f8e6c --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_gram.rst @@ -0,0 +1,15 @@ +ernie\_gram +========================================== + +.. automodule:: paddlenlp.transformers.ernie_gram + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.ernie_gram.matching_param_name + paddlenlp.transformers.ernie_gram.modeling + paddlenlp.transformers.ernie_gram.tokenizer diff --git a/docs/source/paddlenlp.transformers.ernie_gram.tokenizer.rst b/docs/source/paddlenlp.transformers.ernie_gram.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..459dba4fc07658cd499b1f4ab0afc78b95388be8 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_gram.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +=================================================== + +.. automodule:: paddlenlp.transformers.ernie_gram.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie_m.faster_tokenizer.rst b/docs/source/paddlenlp.transformers.ernie_m.faster_tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..a0c7dcb07bababd9af9006af9e2f620906e34fe7 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_m.faster_tokenizer.rst @@ -0,0 +1,7 @@ +fast\_tokenizer +======================================================== + +.. automodule:: paddlenlp.transformers.ernie_m.fast_tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie_m.modeling.rst b/docs/source/paddlenlp.transformers.ernie_m.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..4a836830edd1ca0b59fa27f7d85d70ba4f6ad775 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_m.modeling.rst @@ -0,0 +1,7 @@ +modeling +=============================================== + +.. automodule:: paddlenlp.transformers.ernie_m.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ernie_m.rst b/docs/source/paddlenlp.transformers.ernie_m.rst new file mode 100644 index 0000000000000000000000000000000000000000..ab97f8f9553a4f367ea9d2b00fb6f2709e8652bd --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_m.rst @@ -0,0 +1,15 @@ +ernie\_m +======================================= + +.. automodule:: paddlenlp.transformers.ernie_m + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.ernie_m.fast_tokenizer + paddlenlp.transformers.ernie_m.modeling + paddlenlp.transformers.ernie_m.tokenizer diff --git a/docs/source/paddlenlp.transformers.ernie_m.tokenizer.rst b/docs/source/paddlenlp.transformers.ernie_m.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..fb9eaec8cce3e9be1b84e9c19a67354716ba8f8d --- /dev/null +++ b/docs/source/paddlenlp.transformers.ernie_m.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================ + +.. automodule:: paddlenlp.transformers.ernie_m.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.export.rst b/docs/source/paddlenlp.transformers.export.rst new file mode 100644 index 0000000000000000000000000000000000000000..7f49d2faafde10ddfcda6d1c62f38c4bd9660cff --- /dev/null +++ b/docs/source/paddlenlp.transformers.export.rst @@ -0,0 +1,7 @@ +export +==================================== + +.. automodule:: paddlenlp.transformers.export + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.fnet.modeling.rst b/docs/source/paddlenlp.transformers.fnet.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..141baad80d80ea8009f29a8cc9ff405ea0067253 --- /dev/null +++ b/docs/source/paddlenlp.transformers.fnet.modeling.rst @@ -0,0 +1,7 @@ +modeling +=========================================== + +.. automodule:: paddlenlp.transformers.fnet.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.fnet.rst b/docs/source/paddlenlp.transformers.fnet.rst new file mode 100644 index 0000000000000000000000000000000000000000..7e0440c2704d608c5ef7862342ed29397157d4ad --- /dev/null +++ b/docs/source/paddlenlp.transformers.fnet.rst @@ -0,0 +1,14 @@ +fnet +=================================== + +.. automodule:: paddlenlp.transformers.fnet + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.fnet.modeling + paddlenlp.transformers.fnet.tokenizer diff --git a/docs/source/paddlenlp.transformers.fnet.tokenizer.rst b/docs/source/paddlenlp.transformers.fnet.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..22bf444a04bf5be97da770c35466e69d6a500a7d --- /dev/null +++ b/docs/source/paddlenlp.transformers.fnet.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================ + +.. automodule:: paddlenlp.transformers.fnet.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.funnel.modeling.rst b/docs/source/paddlenlp.transformers.funnel.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..329f6bec4b098f17a5c35654244402a6428c0ed5 --- /dev/null +++ b/docs/source/paddlenlp.transformers.funnel.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================= + +.. automodule:: paddlenlp.transformers.funnel.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.funnel.rst b/docs/source/paddlenlp.transformers.funnel.rst new file mode 100644 index 0000000000000000000000000000000000000000..3a69c3398ff81e27004026f1f76a22bb99a0a09b --- /dev/null +++ b/docs/source/paddlenlp.transformers.funnel.rst @@ -0,0 +1,14 @@ +funnel +===================================== + +.. automodule:: paddlenlp.transformers.funnel + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.funnel.modeling + paddlenlp.transformers.funnel.tokenizer diff --git a/docs/source/paddlenlp.transformers.funnel.tokenizer.rst b/docs/source/paddlenlp.transformers.funnel.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..49b413a063184ce5e25be48b49b3dfcfcb207004 --- /dev/null +++ b/docs/source/paddlenlp.transformers.funnel.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================== + +.. automodule:: paddlenlp.transformers.funnel.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.gau_alpha.modeling.rst b/docs/source/paddlenlp.transformers.gau_alpha.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..9bc4a2e4e01c9eed0cee941415ba6ca331bd04c0 --- /dev/null +++ b/docs/source/paddlenlp.transformers.gau_alpha.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================= + +.. automodule:: paddlenlp.transformers.gau_alpha.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.gau_alpha.rst b/docs/source/paddlenlp.transformers.gau_alpha.rst new file mode 100644 index 0000000000000000000000000000000000000000..91684515e801b83f53c0370d6edff3f0a96e52fa --- /dev/null +++ b/docs/source/paddlenlp.transformers.gau_alpha.rst @@ -0,0 +1,14 @@ +gau\_alpha +========================================= + +.. automodule:: paddlenlp.transformers.gau_alpha + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.gau_alpha.modeling + paddlenlp.transformers.gau_alpha.tokenizer diff --git a/docs/source/paddlenlp.transformers.gau_alpha.tokenizer.rst b/docs/source/paddlenlp.transformers.gau_alpha.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..40d966a2e0efc3e2b77ca063b1228221ab436db4 --- /dev/null +++ b/docs/source/paddlenlp.transformers.gau_alpha.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================== + +.. automodule:: paddlenlp.transformers.gau_alpha.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.generation_utils.rst b/docs/source/paddlenlp.transformers.generation_utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..a36d1f0b26fc9344b78c1ed691a2054a9828dc51 --- /dev/null +++ b/docs/source/paddlenlp.transformers.generation_utils.rst @@ -0,0 +1,7 @@ +generation\_utils +=============================================== + +.. automodule:: paddlenlp.transformers.generation_utils + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.gpt.modeling.rst b/docs/source/paddlenlp.transformers.gpt.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..e50a4d51a9ebf71444de1075a9e2aff47a5f6363 --- /dev/null +++ b/docs/source/paddlenlp.transformers.gpt.modeling.rst @@ -0,0 +1,7 @@ +modeling +========================================== + +.. automodule:: paddlenlp.transformers.gpt.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.gpt.rst b/docs/source/paddlenlp.transformers.gpt.rst new file mode 100644 index 0000000000000000000000000000000000000000..08123c1ee019ff6d74e693178307d14bec3970c1 --- /dev/null +++ b/docs/source/paddlenlp.transformers.gpt.rst @@ -0,0 +1,14 @@ +gpt +================================== + +.. automodule:: paddlenlp.transformers.gpt + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.gpt.modeling + paddlenlp.transformers.gpt.tokenizer diff --git a/docs/source/paddlenlp.transformers.gpt.tokenizer.rst b/docs/source/paddlenlp.transformers.gpt.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..58dd364ba1b535aad91d3402ad6c79fceabf62ab --- /dev/null +++ b/docs/source/paddlenlp.transformers.gpt.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +=========================================== + +.. automodule:: paddlenlp.transformers.gpt.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.layoutlm.modeling.rst b/docs/source/paddlenlp.transformers.layoutlm.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..389635c84f7edb9b9b72e7a147e774205474d77c --- /dev/null +++ b/docs/source/paddlenlp.transformers.layoutlm.modeling.rst @@ -0,0 +1,7 @@ +modeling +=============================================== + +.. automodule:: paddlenlp.transformers.layoutlm.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.layoutlm.rst b/docs/source/paddlenlp.transformers.layoutlm.rst new file mode 100644 index 0000000000000000000000000000000000000000..5777c44f88a228c18243c1abcbefa312ab6be572 --- /dev/null +++ b/docs/source/paddlenlp.transformers.layoutlm.rst @@ -0,0 +1,14 @@ +layoutlm +======================================= + +.. automodule:: paddlenlp.transformers.layoutlm + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.layoutlm.modeling + paddlenlp.transformers.layoutlm.tokenizer diff --git a/docs/source/paddlenlp.transformers.layoutlm.tokenizer.rst b/docs/source/paddlenlp.transformers.layoutlm.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..4328daa8d95140f346c567c6f139e13229ce33be --- /dev/null +++ b/docs/source/paddlenlp.transformers.layoutlm.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================ + +.. automodule:: paddlenlp.transformers.layoutlm.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.layoutlmv2.modeling.rst b/docs/source/paddlenlp.transformers.layoutlmv2.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..bbce41e59313253c2cfd34432cdbb28ff883b0ab --- /dev/null +++ b/docs/source/paddlenlp.transformers.layoutlmv2.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================= + +.. automodule:: paddlenlp.transformers.layoutlmv2.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.layoutlmv2.rst b/docs/source/paddlenlp.transformers.layoutlmv2.rst new file mode 100644 index 0000000000000000000000000000000000000000..208cec4cf7568a518b095cfaba9c57cf22de05da --- /dev/null +++ b/docs/source/paddlenlp.transformers.layoutlmv2.rst @@ -0,0 +1,14 @@ +layoutlmv2 +========================================= + +.. automodule:: paddlenlp.transformers.layoutlmv2 + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.layoutlmv2.modeling + paddlenlp.transformers.layoutlmv2.tokenizer diff --git a/docs/source/paddlenlp.transformers.layoutlmv2.tokenizer.rst b/docs/source/paddlenlp.transformers.layoutlmv2.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..dfb3da6c03c9a0bc160c82c740f0a333348f72a2 --- /dev/null +++ b/docs/source/paddlenlp.transformers.layoutlmv2.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================== + +.. automodule:: paddlenlp.transformers.layoutlmv2.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.layoutxlm.modeling.rst b/docs/source/paddlenlp.transformers.layoutxlm.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..696a754912797ad52171036e11209123fcf9988f --- /dev/null +++ b/docs/source/paddlenlp.transformers.layoutxlm.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================ + +.. automodule:: paddlenlp.transformers.layoutxlm.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.layoutxlm.rst b/docs/source/paddlenlp.transformers.layoutxlm.rst new file mode 100644 index 0000000000000000000000000000000000000000..5c5f45f562ea2c18052cdaaaee5d867f145635ef --- /dev/null +++ b/docs/source/paddlenlp.transformers.layoutxlm.rst @@ -0,0 +1,15 @@ +layoutxlm +======================================== + +.. automodule:: paddlenlp.transformers.layoutxlm + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.layoutxlm.modeling + paddlenlp.transformers.layoutxlm.tokenizer + paddlenlp.transformers.layoutxlm.visual_backbone diff --git a/docs/source/paddlenlp.transformers.layoutxlm.tokenizer.rst b/docs/source/paddlenlp.transformers.layoutxlm.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..8d59aa88ead34650cd343728b7f304ccb9c9ac64 --- /dev/null +++ b/docs/source/paddlenlp.transformers.layoutxlm.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================= + +.. automodule:: paddlenlp.transformers.layoutxlm.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.layoutxlm.visual_backbone.rst b/docs/source/paddlenlp.transformers.layoutxlm.visual_backbone.rst new file mode 100644 index 0000000000000000000000000000000000000000..7f513c7592828569be937a8f7fcab03db0b6da68 --- /dev/null +++ b/docs/source/paddlenlp.transformers.layoutxlm.visual_backbone.rst @@ -0,0 +1,7 @@ +visual\_backbone +======================================================== + +.. automodule:: paddlenlp.transformers.layoutxlm.visual_backbone + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.luke.modeling.rst b/docs/source/paddlenlp.transformers.luke.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..68b42e54af00222a03a3407715f78abd661af836 --- /dev/null +++ b/docs/source/paddlenlp.transformers.luke.modeling.rst @@ -0,0 +1,7 @@ +modeling +=========================================== + +.. automodule:: paddlenlp.transformers.luke.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.luke.rst b/docs/source/paddlenlp.transformers.luke.rst new file mode 100644 index 0000000000000000000000000000000000000000..ddc1975aed911dce5518c4dc2b1997df9fec9ad4 --- /dev/null +++ b/docs/source/paddlenlp.transformers.luke.rst @@ -0,0 +1,14 @@ +luke +=================================== + +.. automodule:: paddlenlp.transformers.luke + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.luke.modeling + paddlenlp.transformers.luke.tokenizer diff --git a/docs/source/paddlenlp.transformers.luke.tokenizer.rst b/docs/source/paddlenlp.transformers.luke.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..5ffdc300ad83101442b101c60c29831b5df36c51 --- /dev/null +++ b/docs/source/paddlenlp.transformers.luke.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================ + +.. automodule:: paddlenlp.transformers.luke.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.mbart.modeling.rst b/docs/source/paddlenlp.transformers.mbart.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..57877eb210ec547f77e5a0c5775868286769f7ce --- /dev/null +++ b/docs/source/paddlenlp.transformers.mbart.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================ + +.. automodule:: paddlenlp.transformers.mbart.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.mbart.rst b/docs/source/paddlenlp.transformers.mbart.rst new file mode 100644 index 0000000000000000000000000000000000000000..ee9e94b7a194016efdedded4355a42cec4407187 --- /dev/null +++ b/docs/source/paddlenlp.transformers.mbart.rst @@ -0,0 +1,14 @@ +mbart +==================================== + +.. automodule:: paddlenlp.transformers.mbart + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.mbart.modeling + paddlenlp.transformers.mbart.tokenizer diff --git a/docs/source/paddlenlp.transformers.mbart.tokenizer.rst b/docs/source/paddlenlp.transformers.mbart.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..baf9e956d91259a1c9a8dfe87ffb5a86c452b788 --- /dev/null +++ b/docs/source/paddlenlp.transformers.mbart.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================= + +.. automodule:: paddlenlp.transformers.mbart.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.megatronbert.modeling.rst b/docs/source/paddlenlp.transformers.megatronbert.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..fe34bcc5f8be0b71ed536dc96c5270010021395b --- /dev/null +++ b/docs/source/paddlenlp.transformers.megatronbert.modeling.rst @@ -0,0 +1,7 @@ +modeling +=================================================== + +.. automodule:: paddlenlp.transformers.megatronbert.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.megatronbert.rst b/docs/source/paddlenlp.transformers.megatronbert.rst new file mode 100644 index 0000000000000000000000000000000000000000..ed428d1afad26a3e36e667cb259091236ffebd3d --- /dev/null +++ b/docs/source/paddlenlp.transformers.megatronbert.rst @@ -0,0 +1,14 @@ +megatronbert +=========================================== + +.. automodule:: paddlenlp.transformers.megatronbert + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.megatronbert.modeling + paddlenlp.transformers.megatronbert.tokenizer diff --git a/docs/source/paddlenlp.transformers.megatronbert.tokenizer.rst b/docs/source/paddlenlp.transformers.megatronbert.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..61243ad237eff9d4bd6594a65ab3556417a0d0a0 --- /dev/null +++ b/docs/source/paddlenlp.transformers.megatronbert.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +==================================================== + +.. automodule:: paddlenlp.transformers.megatronbert.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.mobilebert.modeling.rst b/docs/source/paddlenlp.transformers.mobilebert.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..d912aeea993d468de2eddf51956f2f9471b76ce2 --- /dev/null +++ b/docs/source/paddlenlp.transformers.mobilebert.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================= + +.. automodule:: paddlenlp.transformers.mobilebert.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.mobilebert.rst b/docs/source/paddlenlp.transformers.mobilebert.rst new file mode 100644 index 0000000000000000000000000000000000000000..da320896cb8671db2279245684955eee0179ee81 --- /dev/null +++ b/docs/source/paddlenlp.transformers.mobilebert.rst @@ -0,0 +1,14 @@ +mobilebert +========================================= + +.. automodule:: paddlenlp.transformers.mobilebert + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.mobilebert.modeling + paddlenlp.transformers.mobilebert.tokenizer diff --git a/docs/source/paddlenlp.transformers.mobilebert.tokenizer.rst b/docs/source/paddlenlp.transformers.mobilebert.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..20576d7b3d0e432b8274c9b72a2fbabc1a6b3a9c --- /dev/null +++ b/docs/source/paddlenlp.transformers.mobilebert.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================== + +.. automodule:: paddlenlp.transformers.mobilebert.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.model_outputs.rst b/docs/source/paddlenlp.transformers.model_outputs.rst new file mode 100644 index 0000000000000000000000000000000000000000..7c3220fc0fd0a54b7235e2aa202381ba77b1ade5 --- /dev/null +++ b/docs/source/paddlenlp.transformers.model_outputs.rst @@ -0,0 +1,6 @@ +model\_outputs +============================================ + +.. automodule:: paddlenlp.transformers.model_outputs + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.transformers.model_utils.rst b/docs/source/paddlenlp.transformers.model_utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..f4e465e7fcfc594de2d8088cefeded21332ac111 --- /dev/null +++ b/docs/source/paddlenlp.transformers.model_utils.rst @@ -0,0 +1,6 @@ +model\_utils +========================================== + +.. automodule:: paddlenlp.transformers.model_utils + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.transformers.mpnet.modeling.rst b/docs/source/paddlenlp.transformers.mpnet.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..3af94ea79612c9a315c31e86611dc54e20005e46 --- /dev/null +++ b/docs/source/paddlenlp.transformers.mpnet.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================ + +.. automodule:: paddlenlp.transformers.mpnet.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.mpnet.rst b/docs/source/paddlenlp.transformers.mpnet.rst new file mode 100644 index 0000000000000000000000000000000000000000..ecf0185c05494cc104ecc60d5670b6ae624c148f --- /dev/null +++ b/docs/source/paddlenlp.transformers.mpnet.rst @@ -0,0 +1,14 @@ +mpnet +==================================== + +.. automodule:: paddlenlp.transformers.mpnet + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.mpnet.modeling + paddlenlp.transformers.mpnet.tokenizer diff --git a/docs/source/paddlenlp.transformers.mpnet.tokenizer.rst b/docs/source/paddlenlp.transformers.mpnet.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..4d46073b97fdccc582d719dbc5c3943ff1bfc7cd --- /dev/null +++ b/docs/source/paddlenlp.transformers.mpnet.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================= + +.. automodule:: paddlenlp.transformers.mpnet.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.nezha.modeling.rst b/docs/source/paddlenlp.transformers.nezha.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..4246cb72fea66aab2571191d116aa21ca4c4ce6d --- /dev/null +++ b/docs/source/paddlenlp.transformers.nezha.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================ + +.. automodule:: paddlenlp.transformers.nezha.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.nezha.rst b/docs/source/paddlenlp.transformers.nezha.rst new file mode 100644 index 0000000000000000000000000000000000000000..02a28088eb3261c3dff14d9acb6bec6a26541ff4 --- /dev/null +++ b/docs/source/paddlenlp.transformers.nezha.rst @@ -0,0 +1,14 @@ +nezha +==================================== + +.. automodule:: paddlenlp.transformers.nezha + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.nezha.modeling + paddlenlp.transformers.nezha.tokenizer diff --git a/docs/source/paddlenlp.transformers.nezha.tokenizer.rst b/docs/source/paddlenlp.transformers.nezha.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..435140acb8a9031cbb83e05ab9a793c110193571 --- /dev/null +++ b/docs/source/paddlenlp.transformers.nezha.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================= + +.. automodule:: paddlenlp.transformers.nezha.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.opt.modeling.rst b/docs/source/paddlenlp.transformers.opt.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..71e2af7e61ddb973ca034b540b830a25c2bccf50 --- /dev/null +++ b/docs/source/paddlenlp.transformers.opt.modeling.rst @@ -0,0 +1,7 @@ +modeling +========================================== + +.. automodule:: paddlenlp.transformers.opt.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.opt.rst b/docs/source/paddlenlp.transformers.opt.rst new file mode 100644 index 0000000000000000000000000000000000000000..445e3c2639b9871d85c016a319827cd1aba3a8ed --- /dev/null +++ b/docs/source/paddlenlp.transformers.opt.rst @@ -0,0 +1,13 @@ +opt +================================== + +.. automodule:: paddlenlp.transformers.opt + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.opt.modeling diff --git a/docs/source/paddlenlp.transformers.optimization.rst b/docs/source/paddlenlp.transformers.optimization.rst new file mode 100644 index 0000000000000000000000000000000000000000..eb8c4eb182e07e17c302f3bf61e53a8255f0a45f --- /dev/null +++ b/docs/source/paddlenlp.transformers.optimization.rst @@ -0,0 +1,7 @@ +optimization +========================================== + +.. automodule:: paddlenlp.transformers.optimization + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ppminilm.modeling.rst b/docs/source/paddlenlp.transformers.ppminilm.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..a786fb47e2ae6f20f17c3bd5c93fc136d239219f --- /dev/null +++ b/docs/source/paddlenlp.transformers.ppminilm.modeling.rst @@ -0,0 +1,7 @@ +modeling +=============================================== + +.. automodule:: paddlenlp.transformers.ppminilm.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.ppminilm.rst b/docs/source/paddlenlp.transformers.ppminilm.rst new file mode 100644 index 0000000000000000000000000000000000000000..8676e01e332b54d76c225a85010c636cbb4b9336 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ppminilm.rst @@ -0,0 +1,14 @@ +ppminilm +======================================= + +.. automodule:: paddlenlp.transformers.ppminilm + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.ppminilm.modeling + paddlenlp.transformers.ppminilm.tokenizer diff --git a/docs/source/paddlenlp.transformers.ppminilm.tokenizer.rst b/docs/source/paddlenlp.transformers.ppminilm.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..974c8579e852fd9a1219b0ea27becca227dc9b31 --- /dev/null +++ b/docs/source/paddlenlp.transformers.ppminilm.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================ + +.. automodule:: paddlenlp.transformers.ppminilm.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.prophetnet.modeling.rst b/docs/source/paddlenlp.transformers.prophetnet.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..9c6c587fcdefc0b60f6ace7041ad57141b5fbd75 --- /dev/null +++ b/docs/source/paddlenlp.transformers.prophetnet.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================= + +.. automodule:: paddlenlp.transformers.prophetnet.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.prophetnet.rst b/docs/source/paddlenlp.transformers.prophetnet.rst new file mode 100644 index 0000000000000000000000000000000000000000..91a594abc38d7ee87d29cc6e73bc166b3eaef513 --- /dev/null +++ b/docs/source/paddlenlp.transformers.prophetnet.rst @@ -0,0 +1,14 @@ +prophetnet +========================================= + +.. automodule:: paddlenlp.transformers.prophetnet + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.prophetnet.modeling + paddlenlp.transformers.prophetnet.tokenizer diff --git a/docs/source/paddlenlp.transformers.prophetnet.tokenizer.rst b/docs/source/paddlenlp.transformers.prophetnet.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..38fba82a15e741eabff4e94d7cde58cdb0608339 --- /dev/null +++ b/docs/source/paddlenlp.transformers.prophetnet.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================== + +.. automodule:: paddlenlp.transformers.prophetnet.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.reformer.modeling.rst b/docs/source/paddlenlp.transformers.reformer.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..1b1d8d26f71b33f4c19a85d69d8bf18b9d74de49 --- /dev/null +++ b/docs/source/paddlenlp.transformers.reformer.modeling.rst @@ -0,0 +1,7 @@ +modeling +=============================================== + +.. automodule:: paddlenlp.transformers.reformer.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.reformer.rst b/docs/source/paddlenlp.transformers.reformer.rst new file mode 100644 index 0000000000000000000000000000000000000000..e65273244234347b42eba22560613290daa4cfa7 --- /dev/null +++ b/docs/source/paddlenlp.transformers.reformer.rst @@ -0,0 +1,14 @@ +reformer +======================================= + +.. automodule:: paddlenlp.transformers.reformer + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.reformer.modeling + paddlenlp.transformers.reformer.tokenizer diff --git a/docs/source/paddlenlp.transformers.reformer.tokenizer.rst b/docs/source/paddlenlp.transformers.reformer.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..ffb7a11cc76c92a436e81b2702b9e30408e412b7 --- /dev/null +++ b/docs/source/paddlenlp.transformers.reformer.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================ + +.. automodule:: paddlenlp.transformers.reformer.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.rembert.modeling.rst b/docs/source/paddlenlp.transformers.rembert.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..f1b5aa838b65ba48f538dd777031d2b113377484 --- /dev/null +++ b/docs/source/paddlenlp.transformers.rembert.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================== + +.. automodule:: paddlenlp.transformers.rembert.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.rembert.rst b/docs/source/paddlenlp.transformers.rembert.rst new file mode 100644 index 0000000000000000000000000000000000000000..a559fa9c5378f49db4ff251fa451b50031f42a02 --- /dev/null +++ b/docs/source/paddlenlp.transformers.rembert.rst @@ -0,0 +1,14 @@ +rembert +====================================== + +.. automodule:: paddlenlp.transformers.rembert + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.rembert.modeling + paddlenlp.transformers.rembert.tokenizer diff --git a/docs/source/paddlenlp.transformers.rembert.tokenizer.rst b/docs/source/paddlenlp.transformers.rembert.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..0ee175516cc5cde31677c949c115dab19f925077 --- /dev/null +++ b/docs/source/paddlenlp.transformers.rembert.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +=============================================== + +.. automodule:: paddlenlp.transformers.rembert.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.roberta.modeling.rst b/docs/source/paddlenlp.transformers.roberta.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..2e46e4358b4bb2894fdd2155c95f2600d5eb0b47 --- /dev/null +++ b/docs/source/paddlenlp.transformers.roberta.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================== + +.. automodule:: paddlenlp.transformers.roberta.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.roberta.rst b/docs/source/paddlenlp.transformers.roberta.rst new file mode 100644 index 0000000000000000000000000000000000000000..3ff4cb5fcdf5d699d803647a5c1cff4972d57e0a --- /dev/null +++ b/docs/source/paddlenlp.transformers.roberta.rst @@ -0,0 +1,14 @@ +roberta +====================================== + +.. automodule:: paddlenlp.transformers.roberta + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.roberta.modeling + paddlenlp.transformers.roberta.tokenizer diff --git a/docs/source/paddlenlp.transformers.roberta.tokenizer.rst b/docs/source/paddlenlp.transformers.roberta.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..c9cbc66227e169784af566480815caa21be35c1a --- /dev/null +++ b/docs/source/paddlenlp.transformers.roberta.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +=============================================== + +.. automodule:: paddlenlp.transformers.roberta.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.roformer.modeling.rst b/docs/source/paddlenlp.transformers.roformer.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..6d1d02576738956df899d9887ba6d338984e77d4 --- /dev/null +++ b/docs/source/paddlenlp.transformers.roformer.modeling.rst @@ -0,0 +1,7 @@ +modeling +=============================================== + +.. automodule:: paddlenlp.transformers.roformer.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.roformer.rst b/docs/source/paddlenlp.transformers.roformer.rst new file mode 100644 index 0000000000000000000000000000000000000000..5a67d90fe63dcb2845bc3b05a25f50f0b66e02af --- /dev/null +++ b/docs/source/paddlenlp.transformers.roformer.rst @@ -0,0 +1,14 @@ +roformer +======================================= + +.. automodule:: paddlenlp.transformers.roformer + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.roformer.modeling + paddlenlp.transformers.roformer.tokenizer diff --git a/docs/source/paddlenlp.transformers.roformer.tokenizer.rst b/docs/source/paddlenlp.transformers.roformer.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..ac6c2332f819b4b6f9c41c00aff6a396635ba775 --- /dev/null +++ b/docs/source/paddlenlp.transformers.roformer.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================ + +.. automodule:: paddlenlp.transformers.roformer.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.roformerv2.modeling.rst b/docs/source/paddlenlp.transformers.roformerv2.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..e145ea6b352e143c1895de235718c0d90a88474c --- /dev/null +++ b/docs/source/paddlenlp.transformers.roformerv2.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================= + +.. automodule:: paddlenlp.transformers.roformerv2.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.roformerv2.rst b/docs/source/paddlenlp.transformers.roformerv2.rst new file mode 100644 index 0000000000000000000000000000000000000000..a19ac028c7d2aebb7e6bc5afc0ad8a2d99ce7191 --- /dev/null +++ b/docs/source/paddlenlp.transformers.roformerv2.rst @@ -0,0 +1,14 @@ +roformerv2 +========================================= + +.. automodule:: paddlenlp.transformers.roformerv2 + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.roformerv2.modeling + paddlenlp.transformers.roformerv2.tokenizer diff --git a/docs/source/paddlenlp.transformers.roformerv2.tokenizer.rst b/docs/source/paddlenlp.transformers.roformerv2.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..25f32bd4bfbab8d3497d8e17721db559a800e2ee --- /dev/null +++ b/docs/source/paddlenlp.transformers.roformerv2.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================== + +.. automodule:: paddlenlp.transformers.roformerv2.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.rst b/docs/source/paddlenlp.transformers.rst new file mode 100644 index 0000000000000000000000000000000000000000..1ec7ff404395051fad08d3a0e2d9d722b0a156eb --- /dev/null +++ b/docs/source/paddlenlp.transformers.rst @@ -0,0 +1,83 @@ +paddlenlp.transformers +============================== + +.. automodule:: paddlenlp.transformers + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.albert + paddlenlp.transformers.artist + paddlenlp.transformers.auto + paddlenlp.transformers.bart + paddlenlp.transformers.bert + paddlenlp.transformers.bert_japanese + paddlenlp.transformers.bigbird + paddlenlp.transformers.blenderbot + paddlenlp.transformers.blenderbot_small + paddlenlp.transformers.chinesebert + paddlenlp.transformers.codegen + paddlenlp.transformers.convbert + paddlenlp.transformers.ctrl + paddlenlp.transformers.dallebart + paddlenlp.transformers.distilbert + paddlenlp.transformers.electra + paddlenlp.transformers.ernie + paddlenlp.transformers.ernie_ctm + paddlenlp.transformers.ernie_doc + paddlenlp.transformers.ernie_gen + paddlenlp.transformers.ernie_gram + paddlenlp.transformers.ernie_m + paddlenlp.transformers.fnet + paddlenlp.transformers.funnel + paddlenlp.transformers.gau_alpha + paddlenlp.transformers.gpt + paddlenlp.transformers.layoutlm + paddlenlp.transformers.layoutlmv2 + paddlenlp.transformers.layoutxlm + paddlenlp.transformers.luke + paddlenlp.transformers.mbart + paddlenlp.transformers.megatronbert + paddlenlp.transformers.mobilebert + paddlenlp.transformers.mpnet + paddlenlp.transformers.nezha + paddlenlp.transformers.opt + paddlenlp.transformers.ppminilm + paddlenlp.transformers.prophetnet + paddlenlp.transformers.reformer + paddlenlp.transformers.rembert + paddlenlp.transformers.roberta + paddlenlp.transformers.roformer + paddlenlp.transformers.roformerv2 + paddlenlp.transformers.semantic_search + paddlenlp.transformers.skep + paddlenlp.transformers.squeezebert + paddlenlp.transformers.t5 + paddlenlp.transformers.tinybert + paddlenlp.transformers.transformer + paddlenlp.transformers.unified_transformer + paddlenlp.transformers.unimo + paddlenlp.transformers.xlm + paddlenlp.transformers.xlnet + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.attention_utils + paddlenlp.transformers.convert_slow_tokenizer + paddlenlp.transformers.distill_utils + paddlenlp.transformers.export + paddlenlp.transformers.generation_utils + paddlenlp.transformers.model_outputs + paddlenlp.transformers.model_utils + paddlenlp.transformers.optimization + paddlenlp.transformers.sentencepiece_model_pb2 + paddlenlp.transformers.tokenizer_utils + paddlenlp.transformers.tokenizer_utils_base + paddlenlp.transformers.tokenizer_utils_fast + paddlenlp.transformers.utils diff --git a/docs/source/paddlenlp.transformers.semantic_indexing.modeling.rst b/docs/source/paddlenlp.transformers.semantic_indexing.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..e9280aa743c86c592b86f10770b3f919cb0d28f9 --- /dev/null +++ b/docs/source/paddlenlp.transformers.semantic_indexing.modeling.rst @@ -0,0 +1,7 @@ +modeling +========================================================= + +.. automodule:: paddlenlp.transformers.semantic_indexing.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.semantic_indexing.rst b/docs/source/paddlenlp.transformers.semantic_indexing.rst new file mode 100644 index 0000000000000000000000000000000000000000..ae92d819be16fb6af3d025a8ea4624b1966cb1fa --- /dev/null +++ b/docs/source/paddlenlp.transformers.semantic_indexing.rst @@ -0,0 +1,13 @@ +semantic\_indexing +================================================= + +.. automodule:: paddlenlp.transformers.semantic_indexing + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.semantic_indexing.modeling diff --git a/docs/source/paddlenlp.transformers.semantic_search.modeling.rst b/docs/source/paddlenlp.transformers.semantic_search.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..8d64e83628531cda8244ec0dc384cbbe3cc37bbd --- /dev/null +++ b/docs/source/paddlenlp.transformers.semantic_search.modeling.rst @@ -0,0 +1,7 @@ +modeling +======================================================= + +.. automodule:: paddlenlp.transformers.semantic_search.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.semantic_search.rst b/docs/source/paddlenlp.transformers.semantic_search.rst new file mode 100644 index 0000000000000000000000000000000000000000..cb255233b7f3520e94557431b1fc8e13f5018bd5 --- /dev/null +++ b/docs/source/paddlenlp.transformers.semantic_search.rst @@ -0,0 +1,13 @@ +semantic\_search +=============================================== + +.. automodule:: paddlenlp.transformers.semantic_search + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.semantic_search.modeling diff --git a/docs/source/paddlenlp.transformers.sentencepiece_model_pb2.rst b/docs/source/paddlenlp.transformers.sentencepiece_model_pb2.rst new file mode 100644 index 0000000000000000000000000000000000000000..27793b44085afc768616e4143751da07c86d1c54 --- /dev/null +++ b/docs/source/paddlenlp.transformers.sentencepiece_model_pb2.rst @@ -0,0 +1,6 @@ +sentencepiece\_model\_pb2 +======================================================= + +.. automodule:: paddlenlp.transformers.sentencepiece_model_pb2 + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.transformers.skep.modeling.rst b/docs/source/paddlenlp.transformers.skep.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..78f4c10a35c339259694912b880a3174e5503195 --- /dev/null +++ b/docs/source/paddlenlp.transformers.skep.modeling.rst @@ -0,0 +1,7 @@ +modeling +=========================================== + +.. automodule:: paddlenlp.transformers.skep.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.skep.rst b/docs/source/paddlenlp.transformers.skep.rst new file mode 100644 index 0000000000000000000000000000000000000000..116bfd5bfef3f3bd40d34da55f2b845cef2fe353 --- /dev/null +++ b/docs/source/paddlenlp.transformers.skep.rst @@ -0,0 +1,14 @@ +skep +=================================== + +.. automodule:: paddlenlp.transformers.skep + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.skep.modeling + paddlenlp.transformers.skep.tokenizer diff --git a/docs/source/paddlenlp.transformers.skep.tokenizer.rst b/docs/source/paddlenlp.transformers.skep.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..a33afb457018ed0d935945a5f7e2aac615d1b753 --- /dev/null +++ b/docs/source/paddlenlp.transformers.skep.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================ + +.. automodule:: paddlenlp.transformers.skep.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.squeezebert.modeling.rst b/docs/source/paddlenlp.transformers.squeezebert.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..1992788a2091892931269fb70cf5ed8c1ee74254 --- /dev/null +++ b/docs/source/paddlenlp.transformers.squeezebert.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================== + +.. automodule:: paddlenlp.transformers.squeezebert.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.squeezebert.rst b/docs/source/paddlenlp.transformers.squeezebert.rst new file mode 100644 index 0000000000000000000000000000000000000000..7498b55a0b271318368d2e18d9fc1502a21e3c75 --- /dev/null +++ b/docs/source/paddlenlp.transformers.squeezebert.rst @@ -0,0 +1,14 @@ +squeezebert +========================================== + +.. automodule:: paddlenlp.transformers.squeezebert + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.squeezebert.modeling + paddlenlp.transformers.squeezebert.tokenizer diff --git a/docs/source/paddlenlp.transformers.squeezebert.tokenizer.rst b/docs/source/paddlenlp.transformers.squeezebert.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..f6a680c40f5184878a8bf6117dbd767453de2120 --- /dev/null +++ b/docs/source/paddlenlp.transformers.squeezebert.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +=================================================== + +.. automodule:: paddlenlp.transformers.squeezebert.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.t5.modeling.rst b/docs/source/paddlenlp.transformers.t5.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..ee60d516ef5f078bc503698dbe9309c09137827b --- /dev/null +++ b/docs/source/paddlenlp.transformers.t5.modeling.rst @@ -0,0 +1,7 @@ +modeling +========================================= + +.. automodule:: paddlenlp.transformers.t5.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.t5.rst b/docs/source/paddlenlp.transformers.t5.rst new file mode 100644 index 0000000000000000000000000000000000000000..35c1aec1639b2ae0ae18b4eac2c43471f30d7ec2 --- /dev/null +++ b/docs/source/paddlenlp.transformers.t5.rst @@ -0,0 +1,14 @@ +t5 +================================= + +.. automodule:: paddlenlp.transformers.t5 + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.t5.modeling + paddlenlp.transformers.t5.tokenizer diff --git a/docs/source/paddlenlp.transformers.t5.tokenizer.rst b/docs/source/paddlenlp.transformers.t5.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..a7b912bc055d226a230a7cb98786d2fddbd31f41 --- /dev/null +++ b/docs/source/paddlenlp.transformers.t5.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +========================================== + +.. automodule:: paddlenlp.transformers.t5.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.tinybert.fast_tokenizer.rst b/docs/source/paddlenlp.transformers.tinybert.fast_tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..0f37e8ad03e7f8c92cc4106a8e92395a831964ef --- /dev/null +++ b/docs/source/paddlenlp.transformers.tinybert.fast_tokenizer.rst @@ -0,0 +1,7 @@ +fast\_tokenizer +======================================================== + +.. automodule:: paddlenlp.transformers.tinybert.fast_tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.tinybert.modeling.rst b/docs/source/paddlenlp.transformers.tinybert.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..344656a9970690a5572f9fdd7aa5b153980f75d7 --- /dev/null +++ b/docs/source/paddlenlp.transformers.tinybert.modeling.rst @@ -0,0 +1,7 @@ +modeling +=============================================== + +.. automodule:: paddlenlp.transformers.tinybert.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.tinybert.rst b/docs/source/paddlenlp.transformers.tinybert.rst new file mode 100644 index 0000000000000000000000000000000000000000..43b80a253545b730424cc1d9c1158dab7cfa88d0 --- /dev/null +++ b/docs/source/paddlenlp.transformers.tinybert.rst @@ -0,0 +1,15 @@ +tinybert +======================================= + +.. automodule:: paddlenlp.transformers.tinybert + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.tinybert.fast_tokenizer + paddlenlp.transformers.tinybert.modeling + paddlenlp.transformers.tinybert.tokenizer diff --git a/docs/source/paddlenlp.transformers.tinybert.tokenizer.rst b/docs/source/paddlenlp.transformers.tinybert.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..eaa651268369a0512e797c3959f55caddde60e00 --- /dev/null +++ b/docs/source/paddlenlp.transformers.tinybert.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +================================================ + +.. automodule:: paddlenlp.transformers.tinybert.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.tokenizer_utils.rst b/docs/source/paddlenlp.transformers.tokenizer_utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..52693b51a499a90b1fdb11e91980dca5ccd100f3 --- /dev/null +++ b/docs/source/paddlenlp.transformers.tokenizer_utils.rst @@ -0,0 +1,8 @@ +tokenizer\_utils +============================================== + +.. automodule:: paddlenlp.transformers.tokenizer_utils + :members: + :no-undoc-members: + :show-inheritance: + :special-members: __call__ diff --git a/docs/source/paddlenlp.transformers.tokenizer_utils_base.rst b/docs/source/paddlenlp.transformers.tokenizer_utils_base.rst new file mode 100644 index 0000000000000000000000000000000000000000..91de4951b9060912fe03b7c457e9138237fc939c --- /dev/null +++ b/docs/source/paddlenlp.transformers.tokenizer_utils_base.rst @@ -0,0 +1,8 @@ +tokenizer\_utils\_base +==================================================== + +.. automodule:: paddlenlp.transformers.tokenizer_utils_base + :members: + :no-undoc-members: + :show-inheritance: + :special-members: __call__ diff --git a/docs/source/paddlenlp.transformers.tokenizer_utils_fast.rst b/docs/source/paddlenlp.transformers.tokenizer_utils_fast.rst new file mode 100644 index 0000000000000000000000000000000000000000..3c24a50a8d8cdefa327bed2896865d52db8d00d3 --- /dev/null +++ b/docs/source/paddlenlp.transformers.tokenizer_utils_fast.rst @@ -0,0 +1,8 @@ +tokenizer\_utils\_fast +====================================================== + +.. automodule:: paddlenlp.transformers.tokenizer_utils_fast + :members: + :no-undoc-members: + :show-inheritance: + :special-members: __call__ diff --git a/docs/source/paddlenlp.transformers.transformer.modeling.rst b/docs/source/paddlenlp.transformers.transformer.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..53361828f7865dccd5920dbd138ed5898af953b4 --- /dev/null +++ b/docs/source/paddlenlp.transformers.transformer.modeling.rst @@ -0,0 +1,7 @@ +modeling +================================================== + +.. automodule:: paddlenlp.transformers.transformer.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.transformer.rst b/docs/source/paddlenlp.transformers.transformer.rst new file mode 100644 index 0000000000000000000000000000000000000000..7d8435bff395c4b9ddce9896f9ab31393e011de4 --- /dev/null +++ b/docs/source/paddlenlp.transformers.transformer.rst @@ -0,0 +1,13 @@ +transformer +========================================== + +.. automodule:: paddlenlp.transformers.transformer + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.transformer.modeling diff --git a/docs/source/paddlenlp.transformers.unified_transformer.convert.rst b/docs/source/paddlenlp.transformers.unified_transformer.convert.rst new file mode 100644 index 0000000000000000000000000000000000000000..c0722b806a7c4d9732895a789c6929da7ab49672 --- /dev/null +++ b/docs/source/paddlenlp.transformers.unified_transformer.convert.rst @@ -0,0 +1,7 @@ +convert +========================================================== + +.. automodule:: paddlenlp.transformers.unified_transformer.convert + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.unified_transformer.modeling.rst b/docs/source/paddlenlp.transformers.unified_transformer.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..ff123566d6a32200691bbc02c39387111718bff4 --- /dev/null +++ b/docs/source/paddlenlp.transformers.unified_transformer.modeling.rst @@ -0,0 +1,7 @@ +modeling +=========================================================== + +.. automodule:: paddlenlp.transformers.unified_transformer.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.unified_transformer.rst b/docs/source/paddlenlp.transformers.unified_transformer.rst new file mode 100644 index 0000000000000000000000000000000000000000..1c6a1b68df30f8e6e1a77890f3c3f0b8461304f0 --- /dev/null +++ b/docs/source/paddlenlp.transformers.unified_transformer.rst @@ -0,0 +1,15 @@ +unified\_transformer +=================================================== + +.. automodule:: paddlenlp.transformers.unified_transformer + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.unified_transformer.convert + paddlenlp.transformers.unified_transformer.modeling + paddlenlp.transformers.unified_transformer.tokenizer diff --git a/docs/source/paddlenlp.transformers.unified_transformer.tokenizer.rst b/docs/source/paddlenlp.transformers.unified_transformer.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..96a227173b17b2ed81e52a9f7503938222bb521d --- /dev/null +++ b/docs/source/paddlenlp.transformers.unified_transformer.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================================ + +.. automodule:: paddlenlp.transformers.unified_transformer.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.unimo.modeling.rst b/docs/source/paddlenlp.transformers.unimo.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..026cf3230aa139652306d111e10c8d3aaac5bd28 --- /dev/null +++ b/docs/source/paddlenlp.transformers.unimo.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================ + +.. automodule:: paddlenlp.transformers.unimo.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.unimo.rst b/docs/source/paddlenlp.transformers.unimo.rst new file mode 100644 index 0000000000000000000000000000000000000000..8a4d4afe8d5d3aeb5c7748ab3ebbce31f5a71c0d --- /dev/null +++ b/docs/source/paddlenlp.transformers.unimo.rst @@ -0,0 +1,14 @@ +unimo +==================================== + +.. automodule:: paddlenlp.transformers.unimo + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.unimo.modeling + paddlenlp.transformers.unimo.tokenizer diff --git a/docs/source/paddlenlp.transformers.unimo.tokenizer.rst b/docs/source/paddlenlp.transformers.unimo.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..d7e711fa5b01ae351f5d524480fd96fcabcafa15 --- /dev/null +++ b/docs/source/paddlenlp.transformers.unimo.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================= + +.. automodule:: paddlenlp.transformers.unimo.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.utils.rst b/docs/source/paddlenlp.transformers.utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..bd1f403f5cb64d78aa71c6663b4f0ee892d427e6 --- /dev/null +++ b/docs/source/paddlenlp.transformers.utils.rst @@ -0,0 +1,7 @@ +utils +=================================== + +.. automodule:: paddlenlp.transformers.utils + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.xlm.modeling.rst b/docs/source/paddlenlp.transformers.xlm.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..24df639bdc43f4d9b3245bcfae462047b3dc16d3 --- /dev/null +++ b/docs/source/paddlenlp.transformers.xlm.modeling.rst @@ -0,0 +1,7 @@ +modeling +========================================== + +.. automodule:: paddlenlp.transformers.xlm.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.xlm.rst b/docs/source/paddlenlp.transformers.xlm.rst new file mode 100644 index 0000000000000000000000000000000000000000..54ec485dd13b5c7f5f78bb868fa570099b1f3b69 --- /dev/null +++ b/docs/source/paddlenlp.transformers.xlm.rst @@ -0,0 +1,14 @@ +xlm +================================== + +.. automodule:: paddlenlp.transformers.xlm + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.xlm.modeling + paddlenlp.transformers.xlm.tokenizer diff --git a/docs/source/paddlenlp.transformers.xlm.tokenizer.rst b/docs/source/paddlenlp.transformers.xlm.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..9ba4a1cddbe0401189a2a9aa461d2834f452e827 --- /dev/null +++ b/docs/source/paddlenlp.transformers.xlm.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +=========================================== + +.. automodule:: paddlenlp.transformers.xlm.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.xlnet.modeling.rst b/docs/source/paddlenlp.transformers.xlnet.modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..edc53492499e2798fb7874ae5e0fa24e0d85fe35 --- /dev/null +++ b/docs/source/paddlenlp.transformers.xlnet.modeling.rst @@ -0,0 +1,7 @@ +modeling +============================================ + +.. automodule:: paddlenlp.transformers.xlnet.modeling + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.transformers.xlnet.rst b/docs/source/paddlenlp.transformers.xlnet.rst new file mode 100644 index 0000000000000000000000000000000000000000..eb6b5c4911cb5edfe1b65500f7aa9e25b34e2106 --- /dev/null +++ b/docs/source/paddlenlp.transformers.xlnet.rst @@ -0,0 +1,14 @@ +xlnet +==================================== + +.. automodule:: paddlenlp.transformers.xlnet + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.transformers.xlnet.modeling + paddlenlp.transformers.xlnet.tokenizer diff --git a/docs/source/paddlenlp.transformers.xlnet.tokenizer.rst b/docs/source/paddlenlp.transformers.xlnet.tokenizer.rst new file mode 100644 index 0000000000000000000000000000000000000000..b55fa0a21d452861a1e0f65b98adcaa9559849bf --- /dev/null +++ b/docs/source/paddlenlp.transformers.xlnet.tokenizer.rst @@ -0,0 +1,7 @@ +tokenizer +============================================= + +.. automodule:: paddlenlp.transformers.xlnet.tokenizer + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.utils.batch_sampler.rst b/docs/source/paddlenlp.utils.batch_sampler.rst new file mode 100644 index 0000000000000000000000000000000000000000..5c9ef2fb6068aa940cc0033136871abbae4af96e --- /dev/null +++ b/docs/source/paddlenlp.utils.batch_sampler.rst @@ -0,0 +1,6 @@ +batch\_sampler +===================================== + +.. automodule:: paddlenlp.utils.batch_sampler + :members: + :no-undoc-members: diff --git a/docs/source/paddlenlp.utils.downloader.rst b/docs/source/paddlenlp.utils.downloader.rst new file mode 100644 index 0000000000000000000000000000000000000000..946af01660db9a371a9ee89a6ed402589da2b33f --- /dev/null +++ b/docs/source/paddlenlp.utils.downloader.rst @@ -0,0 +1,7 @@ +downloader +================================= + +.. automodule:: paddlenlp.utils.downloader + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.utils.env.rst b/docs/source/paddlenlp.utils.env.rst new file mode 100644 index 0000000000000000000000000000000000000000..c650dfead4b252c28afcde5f07870363e2eaa814 --- /dev/null +++ b/docs/source/paddlenlp.utils.env.rst @@ -0,0 +1,7 @@ +env +========================== + +.. automodule:: paddlenlp.utils.env + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.utils.file_lock.rst b/docs/source/paddlenlp.utils.file_lock.rst new file mode 100644 index 0000000000000000000000000000000000000000..dbf1a60ebc4a0a865be67c909f81dfbec70a56c4 --- /dev/null +++ b/docs/source/paddlenlp.utils.file_lock.rst @@ -0,0 +1,7 @@ +file\_lock +================================= + +.. automodule:: paddlenlp.utils.file_lock + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.utils.import_utils.rst b/docs/source/paddlenlp.utils.import_utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..49a928b734552d878904eb0b47efc75460df3c1d --- /dev/null +++ b/docs/source/paddlenlp.utils.import_utils.rst @@ -0,0 +1,7 @@ +import\_utils +==================================== + +.. automodule:: paddlenlp.utils.import_utils + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.utils.log.rst b/docs/source/paddlenlp.utils.log.rst new file mode 100644 index 0000000000000000000000000000000000000000..343d32eb7ceee056a64fe8fddf79053b9cac371d --- /dev/null +++ b/docs/source/paddlenlp.utils.log.rst @@ -0,0 +1,7 @@ +log +========================== + +.. automodule:: paddlenlp.utils.log + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.utils.profiler.rst b/docs/source/paddlenlp.utils.profiler.rst new file mode 100644 index 0000000000000000000000000000000000000000..3a8d3644bf82f5c0c0c04bbf619788c1e69df0bb --- /dev/null +++ b/docs/source/paddlenlp.utils.profiler.rst @@ -0,0 +1,7 @@ +profiler +=============================== + +.. automodule:: paddlenlp.utils.profiler + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/source/paddlenlp.utils.rst b/docs/source/paddlenlp.utils.rst new file mode 100644 index 0000000000000000000000000000000000000000..55f2d79907d5b0abf9670af8725cb69d37c43285 --- /dev/null +++ b/docs/source/paddlenlp.utils.rst @@ -0,0 +1,20 @@ +paddlenlp.utils +======================= + +.. automodule:: paddlenlp.utils + :members: + :no-undoc-members: + :show-inheritance: + + +.. toctree:: + :maxdepth: 4 + + paddlenlp.utils.batch_sampler + paddlenlp.utils.downloader + paddlenlp.utils.env + paddlenlp.utils.file_lock + paddlenlp.utils.import_utils + paddlenlp.utils.log + paddlenlp.utils.profiler + paddlenlp.utils.tools diff --git a/docs/source/paddlenlp.utils.tools.rst b/docs/source/paddlenlp.utils.tools.rst new file mode 100644 index 0000000000000000000000000000000000000000..d3451531212cb573d60c6d08f79881009e2b8693 --- /dev/null +++ b/docs/source/paddlenlp.utils.tools.rst @@ -0,0 +1,7 @@ +tools +============================ + +.. automodule:: paddlenlp.utils.tools + :members: + :no-undoc-members: + :show-inheritance: diff --git a/docs/trainer.md b/docs/trainer.md new file mode 100644 index 0000000000000000000000000000000000000000..31728bbaa627b0ed523ee8f4d33aca01580ba66b --- /dev/null +++ b/docs/trainer.md @@ -0,0 +1,674 @@ +# PaddleNLP Trainer API + +PaddleNLP提供了Trainer训练API,针对训练过程的通用训练配置做了封装,比如: + +- 优化器、学习率调度等训练配置 +- 多卡,混合精度,梯度累积等功能 +- checkpoint断点,断点重启(数据集,随机数恢复) +- 日志显示,loss可视化展示等 + +用户输入模型,数据集,就可以使用Trainer API高效快速的实现预训练、微调等任务。 + + +## Trainer基本使用方法介绍 + +下面是用户使用 Trainer API进行finetune任务的简单示例,这里以中文情感分类数据集`chnsenticorp`为例。 +更详细的使用可以参考[CLUE Trainer](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/benchmark/clue/classification/run_clue_classifier_trainer.py)版本。 + +1. 导入需要用到的头文件。 + - 主要是模型、Tokenizer + - 还有Trainer组件 + - 其中`Trainer`是训练主要入口,用户传入模型,数据集,即可进行训练 + - `TrainingArguments` 包含了用户需要的大部分训练参数。 + - `PdArgumentParser` 是用户输出参数的工具 +```python +from functools import partial +import paddle +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.trainer import Trainer, TrainingArguments, PdArgumentParser +``` +2. 设置好用户参数 + - PdArgumentParser 可以接受多个类似`TrainingArguments`的参数。用户可以自定义所需要的`ModelArguments`, `DataArguments`为 tuple 传入 PdArgumentParser即可。 + - 这些参数都是通过`python xxx.py --dataset xx --max_seq_length xx`的方式传入。`TrainingArguments`的所有可配置参数见后文。 +```python +from dataclasses import dataclass +@dataclass +class DataArguments: + dataset: str = field( + default=None, + metadata={"help": "The name of the dataset to use."}) + + max_seq_length: int = field( + default=128, + metadata={"help": "The maximum total input sequence length after tokenization."}) + +parser = PdArgumentParser(TrainingArguments, DataArguments) +(training_args, data_args) = parser.parse_args_into_dataclasses() +``` + +3. 加载模型,tokenizer, 数据集 + - 注意,这里的数据集,需要输出的是一个dict。dict中的key,需要和模型的输入名称对应。 + - 这里的,`labels`如果模型没有使用到,我们还需要额外定义`criterion`,计算最后的loss损失。 +```python +train_dataset = load_dataset("chnsenticorp", splits=["train"]) +model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=len(train_dataset.label_list)) +tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + +def convert_example(example, tokenizer): + encoded_inputs = tokenizer(text=example["text"], max_seq_len=128, pad_to_max_seq_len=True) + encoded_inputs["labels"] = int(example["label"]) + return encoded_inputs + +train_dataset = train_dataset.map(partial(convert_example, tokenizer=tokenizer)) +``` + +4. 构造Trainer实例,进行模型训练。 + - 这里传入`model,criterion,args,train_dataset,tokenizer`这些训练需要的组件,构建了实例化的trainer + - 使用trainer.train()接口开始训练过程。训练完成后,可以保存模型,保存一些日志。 +```python +trainer = Trainer( + model=model, + criterion=paddle.nn.loss.CrossEntropyLoss(), + args=training_args, + train_dataset=train_dataset if training_args.do_train else None, + tokenizer=tokenizer) + +if training_args.do_train: + train_result = trainer.train() + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_state() +``` +预训练的使用方式可以参考[ERNIE-1.0 Trainer](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/run_pretrain_trainer.py)版本。 + + +## Trainer进阶分布式能力使用介绍 + +**通用分布式能力** +对于通用的分布式能力, PaddleNLP主要做了数据并行data_parallel, 分布式参数sharding功能的支持. +这类功能无需用户修改组网, 直接多卡即可运行. + +用户使用 `paddle.distruted.launch --devices "0,1,2,3" train.py`即可将运行的程序切换为多卡数据并行. +如果想要使用sharding功能, 减少模型显存占用, 指定参数`--sharding "stage2"`即可. 更多sharding功能配置见参数介绍部分. + + +**混合并行分布式能力** + +飞桨4D并行, 即: data parallel + sharding parallel + tensor parallel + pipeline parallel. + +混合并行这里, 主要添加了 tensor parallel (TP) 和 pipeline parallel(PP)支持. +目前, PaddleNLP主要对一些大模型, 如 GPT, Llama等做了 TP PP支持, 用户可以使用这些策略. + +相关代码实现可以参考llama训练的[例子](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/language_model/llama/run_trainer_tp4pp2.sh) + +流水线并行的组网改造可以参见[modeling_pp.py](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/language_model/llama/modeling_pp.py) + + +当组网适配好 张量并行(TP), 流水线并行(PP)之后, 用户使用 `--tensor_parallel_degree` `--pipeline_parallel_degree` 即可启用混合并行训练. + + + + +## Trainer 实例化参数介绍 +Trainer 是一个简单,但功能完整的 Paddle训练和评估模块,并针对 PaddleNLP 模型进行了优化。 + +```python +参数: + model([`PretrainedModel`] 或 `paddle.nn.Layer`,可选): + 用于训练、评估或预测的模型。 + [`Trainer`] 对PaddleNLP的 [`PretrainedModel`] 一起使用进行了优化。你仍然可以使用 + 您自己的模型定义为`paddle.nn.Layer`,只要它们的工作方式与 PaddleNLP 模型相同。 + + ([`PretrainedModel`] or `paddle.nn.Layer`, *optional*): + The model to train, evaluate or use for predictions. + criterion (`paddle.nn.Layer`,*可选*): + model可能只输出中间结果loggit,如果想对模型的输出做更多的计算,可以添加criterion层。 + + The model may only output the loggit, if you want do more computation for the output of model, + you can add the criterion Layer. + + args([`TrainingArguments`],可选): + 训练时需要用到的参数。将默认使用 [`TrainingArguments`] 初始化。 + `output_dir` 设置为当前目录中名为 *tmp_trainer* 的目录(如果未提供)。 + + ([`TrainingArguments`], *optional*): + The arguments to tweak for training. Will default to a basic instance of [`TrainingArguments`] with the + `output_dir` set to a directory named *tmp_trainer* in the current directory if not provided. + + data_collator(`DataCollator`,可选): + 用于将 `train_dataset` 或 `eval_dataset` 的数据,组合为batch的函数。 + 如果没有提供 `tokenizer`,则默认为 [`default_data_collator`], 否则为 + [`DataCollatorWithPadding`]。 + + (`DataCollator`, *optional*): + The function to use to form a batch from a list of elements of `train_dataset` or `eval_dataset`. Will + default to [`default_data_collator`] if no `tokenizer` is provided, an instance of + [`DataCollatorWithPadding`] otherwise. + + + train_dataset(`paddle.io.Dataset` 或 `paddle.io.IterableDataset`,可选): + 用于训练的数据集。如果是 `datasets.Dataset`,那么 + `model.forward()` 不需要的输入字段会被自动删除。 + + (`paddle.io.Dataset` or `paddle.io.IterableDataset`, *optional*): + The dataset to use for training. If it is an `datasets.Dataset`, columns not accepted by the + `model.forward()` method are automatically removed. + + eval_dataset(`paddle.io.Dataset` 或 `Dict[str, paddle.io.Dataset]`,可选): + 用于评估的数据集。如果是 `datasets.Dataset`,那么 + `model.forward()` 不需要的输入字段会被自动删除。 + 如果它是一个字典,则将对字典中每个数据集进行评估, + 并将字典中的键添加到评估指标名称前。 + + The dataset to use for evaluation. If it is a [`~datasets.Dataset`], columns not accepted by the + `model.forward()` method are automatically removed. If it is a dictionary, it will evaluate on each + dataset prepending the dictionary key to the metric name. + + tokenizer([`PretrainedTokenizer`],可选): + 用于数据预处理的tokenizer。如果传入,将用于自动Pad输入 + batch输入的最大长度,它随模型保存,可以重新运行中断的训练过程。 + + ([`PretrainedTokenizer`], *optional*): + The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the + maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an + interrupted training or reuse the fine-tuned model. + + compute_metrics (`Callable[[EvalPrediction], Dict]`, 可选): + 用于评估的计算指标的函数。必须采用 [`EvalPrediction`] 并返回 + dict形式的metrics结果。 + + (`Callable[[EvalPrediction], Dict]`, *optional*): + The function that will be used to compute metrics at evaluation. Must take a [`EvalPrediction`] and return + a dictionary string to metric values. + + callbacks (List of [`TrainerCallback`],*可选*): + 用于自定义训练call列表函数。将这些函数会被添加到默认回调函数列表。 + 如果要删除使用的回调函数,请使用 [`Trainer.remove_callback`] 方法。 + + A list of callbacks to customize the training loop. Will add those to the list of default callbacks. + If you want to remove one of the default callbacks used, use the [`Trainer.remove_callback`] method. + + optimizers (`Tuple[paddle.optimizer.Optimizer, paddle.optimizer.lr.LRScheduler]`, 可选): + 一个tuple, 包含要使用Optimizer和LRScheduler。将默认为模型上的 [`AdamW`] 实例 + 和LinearDecayWithWarmup。 + + (`Tuple[paddle.optimizer.Optimizer, paddle.optimizer.lr.LRScheduler]`, *optional*) + A tuple containing the optimizer and the scheduler to use. Will default to an instance of [`AdamW`] on your model + and a scheduler [`LinearDecayWithWarmup`]. + + preprocess_logits_for_metrics (`Callable[[paddle.Tensor, paddle.Tensor], paddle.Tensor]`, 可选)): + 一个函数, 在每次评估之前对logits进行预处理。 + + (`Callable[[paddle.Tensor, paddle.Tensor], paddle.Tensor]`, *optional*) + A function that preprocess the logits right before caching them at each evaluation step. Must take two + tensors, the logits and the labels, and return the logits once processed as desired. The modifications made + by this function will be reflected in the predictions received by `compute_metrics`. +``` + + +## TrainingArguments 参数介绍 +```python + --output_dir + 保存模型输出和中间checkpoints的输出目录。(`str`, 必须, 默认为 `None`) + + The output directory where the model predictions and + checkpoints will be written. (default: None) + + --overwrite_output_dir + 如果 `True`,覆盖输出目录的内容。如果 `output_dir` 指向检查点 + 目录,则使用它继续训练。(`bool`, 可选, 默认为 `False`) + + Overwrite the content of the output directory. Use + this to continue training if output_dir points to a + checkpoint directory. (default: False) + + --do_train + 是否进行训练任务。 注:`Trainer`不直接使用此参数,而是提供给用户 + 的训练/评估脚本使用。(`bool`, 可选, 默认为 `False`) + + Whether to run training. (default: False) + + --do_eval + 是否进行评估任务。同上。(`bool`, 可选, 默认为 `False`) + + Whether to run eval on the dev set. (default: False) + + --do_predict + 是否进行预测任务。同上。(`bool`, 可选, 默认为 `False`) + + Whether to run predictions on the test set. (default:False) + + --do_export + 是否进行模型导出任务。同上。(`bool`, 可选, 默认为 `False`) + + Whether to export infernece model. (default: False) + + --evaluation_strategy {no,steps,epoch} + 评估策略,(`str`, 可选,默认为 `"no"`): + 训练期间采用的评估策略。可能的值为: + - `"no"`:训练期间不进行评估。 + - `"steps"`:评估在每个`eval_steps`完成(并记录)。 + - `"epoch"`:在每个 epoch 结束时进行评估。 + + The evaluation strategy to use. (default: no) + + --prediction_loss_only + 在执行评估和预测任务时,只返回loss的值。(`bool`, 可选, 默认为 `False`) + + When performing evaluation and predictions, only + returns the loss. (default: False) + + --per_device_train_batch_size + 用于训练的每个 GPU 核心/CPU 的batch大小.(`int`,可选,默认为 8) + + Batch size per GPU core/CPU for training. (default: 8) + + --per_device_eval_batch_size + 用于评估的每个 GPU 核心/CPU 的batch大小.(`int`,可选,默认为 8) + + Batch size per GPU core/CPU for evaluation. (default:8) + + --gradient_accumulation_steps + 在执行反向,更新回传梯度之前,累积梯度的更新步骤数(`int`,可选,默认为 1) + + Number of updates steps to accumulate before + performing a backward/update pass. (default: 1) + + --eval_accumulation_steps + 在将结果移动到CPU之前,累积输出张量的预测步骤数。如果如果未设置, + 则在移动到CPU之前,整个预测都会在GPU上累积(速度更快需要更多的显存)。 + (`int`,可选,默认为 None 不设置) + + Number of predictions steps to accumulate the output tensors for, + before moving the results to the CPU. If left unset, the whole predictions are + accumulated on GPU before being moved to the CPU (faster butrequires more memory) + (default: None) + + --learning_rate + 优化器的初始学习率, (`float`,可选,默认为 5e-05) + + The initial learning rate for optimizer. (default: 5e-05) + + --weight_decay + 除了所有bias和 LayerNorm 权重之外,应用于所有层的权重衰减数值。(`float`,可选,默认为 0.0) + + Weight decay for AdamW if we apply some. (default: + 0.0) + + --adam_beta1 + AdamW的优化器的 beta1 超参数。(`float`,可选,默认为 0.9) + + Beta1 for AdamW optimizer (default: 0.9) + + --adam_beta2 + AdamW的优化器的 beta2 超参数。(`float`,可选,默认为 0.999) + + Beta2 for AdamW optimizer (default: 0.999) + + --adam_epsilon + AdamW的优化器的 epsilon 超参数。(`float`,可选,默认为 1e-8) + + Epsilon for AdamW optimizer. (default: 1e-08) + + --max_grad_norm + 最大梯度范数(用于梯度裁剪)。(`float`,可选,默认为 1.0) + + Max gradient norm. (default: 1.0) + + --num_train_epochs + 要执行的训练 epoch 总数(如果不是整数,将在停止训练 + 之前执行最后一个 epoch 的小数部分百分比)。 + (`float`, 可选, 默认为 3.0): + + Total number of training epochs to perform. (default:3.0) + + --max_steps + 如果设置为正数,则表示要执行的训练步骤总数。 + 覆盖`num_train_epochs`。(`int`,可选,默认为 -1) + + If > 0: set total number of training steps to + perform.Override num_train_epochs. (default: -1 + + --lr_scheduler_type + 要使用的学习率调度策略。 (`str`, 可选, 默认为 `"linear"`) + + The scheduler type to use. (default: linear) 支持,linear, cosine, constant, constant_with_warmup. + + --warmup_ratio + 用于从 0 到 `learning_rate` 的线性warmup的总训练步骤的比例。(`float`,可选,默认为 0.0) + + Linear warmup over warmup_ratio fraction of total + steps. (default: 0.0) + + --warmup_steps + 用于从 0 到 `learning_rate` 的线性warmup的步数。覆盖warmup_ratio参数。 + (`int`,可选,默认为 0) + + Linear warmup over warmup_steps. (default: 0) + + --log_on_each_node + 在多节点分布式训练中,是在每个节点上记录一次,还是仅在主节点上记录节点。(`bool`,可选,默认为`True`) + + When doing a multinode distributed training, whether + to log once per node or just once on the main node. + (default: True) + + --logging_dir + VisualDL日志目录。(`str`,可选,默认为None) + None情况下会修改为 *output_dir/runs/**CURRENT_DATETIME_HOSTNAME** + + VisualDL log dir. (default: None) + + --logging_strategy {no,steps,epoch} + (`str`, 可选,默认为 `"steps"`) + 训练期间采用的日志记录策略。可能的值为: + - `"no"`:训练期间不进行记录。 + - `"epoch"`:记录在每个 epoch 结束时完成。 + - `"steps"`:记录是每 `logging_steps` 完成的。 + + The logging strategy to use. (default: steps) + + --logging_first_step + 是否记录和评估第一个 `global_step`。(`bool`,可选,默认为`False`) + + Log the first global_step (default: False) + + --logging_steps + 如果 `logging_strategy="steps"`,则两个日志之间的更新步骤数。 + (`int`,可选,默认为 500) + + Log every X updates steps. (default: 500) + + --save_strategy {no,steps,epoch} + (`str`, 可选,默认为 `"steps"`) + 训练期间采用的checkpoint保存策略。可能的值为: + - `"no"`:训练期间不保存。 + - `"epoch"`:保存在每个 epoch 结束时完成。 + - `"steps"`:保存是每`save_steps`完成。 + The checkpoint save strategy to use. (default: steps) + + --save_steps + 如果 `save_strategy="steps"`,则在两个checkpoint保存之间的更新步骤数。 + (`int`,可选,默认为 500) + + Save checkpoint every X updates steps. (default: 500) + + --save_total_limit + 如果设置次参数,将限制checkpoint的总数。删除旧的checkpoints + `输出目录`。(`int`,可选) + + Limit the total amount of checkpoints. Deletes the + older checkpoints in the output_dir. Default is + unlimited checkpoints (default: None) + + --save_on_each_node + 在做多节点分布式训练时,是在每个节点上保存模型和checkpoints, + 还是只在主节点上。当不同的节点使用相同的存储时,不应激活此功能, + 因为每个节点的文件将以相同的名称保存。(`bool`, 可选, 默认为 `False`) + + When doing multi-node distributed training, whether to + save models and checkpoints on each node, or only on + the main one (default: False) + + --no_cuda + 是否不使用 CUDA,即使CUDA环境可用。(`bool`, 可选, 默认为 `False`) + Do not use CUDA even when it is available (default: + False) + --seed + 设置的随机种子。为确保多次运行的可复现性。(`int`,可选,默认为 42) + + Random seed that will be set at the beginning of + training. (default: 42) + + --bf16 + 是否使用 bf16 混合精度训练而不是 fp32 训练。需要 Ampere 或更高的 NVIDIA + 显卡架构支持。这是实验性质的API,以后可能会修改。 + (`bool`, 可选, 默认为 `False`) + + Whether to use bf16 (mixed) precision instead of + 32-bit. Requires Ampere or higher NVIDIA architecture. + This is an experimental API and it may change. + (default: False) + + --fp16 + 是否使用 fp16 混合精度训练而不是 fp32 训练。 + (`bool`, 可选, 默认为 `False`) + + Whether to use fp16 (mixed) precision instead of + 32-bit (default: False) + + --fp16_opt_level + 混合精度训练模式,可为``O1``或``O2``模式,默认``O1``模式,默认O1. + O1表示混合精度训练,O2表示纯fp16/bf16训练。 + 只在fp16或bf16选项开启时候生效. + (`str`, 可选, 默认为 `O1`) + + For fp16: AMP optimization level selected in + ['O0', 'O1', and 'O2']. See details at https://www.pad + dlepaddle.org.cn/documentation/docs/zh/develop/api/pad + dle/amp/auto_cast_cn.html (default: O1) + --amp_custom_black_list + 飞桨有默认的黑名单,可以根据模型特点设置自定义黑名单。自定义黑名单中的算子在计算时会被认为是数值危险的,它们的影响也可能会在下游算子中观察到。该名单中的算子不会转为 float16/bfloat16 计算。(可选,默认为None) + + The custom black_list. The set of ops that support fp16/bf16 calculation and are considered numerically-dangerous and whose effects may also be observed in downstream ops. These ops will not be converted to fp16/bf16. (default:None) + + --amp_custom_white_list + 飞桨有默认的白名单,通常不需要设置自定义白名单。自定义白名单中的算子在计算时会被认为是数值安全的,并且对性能至关重要。如果设置了该名单,其中的算子会使用 float16/bfloat16 计算。(可选,默认为None) + + The custom white_list. It’s the set of ops that support fp16/bf16 calculation and are considered numerically-safe and performance-critical. These ops will be converted to fp16/bf16. (default:None) + + --amp_master_grad + 当使用pure fp16/bf16的时候, 可能对梯度的数值精度有更高要求, + 例如梯度裁剪, weight decay, 权重更新的时候. + 打开此选项, 梯度的数值精度会变成float32类型. + 只在 --fp16_opt_level O2 生效, 默认为 False + + For amp opt level=’O2’, whether to use float32 weight gradients + for calculations such as gradient clipping, weight decay, and weight updates. + If master_grad is enabled, the weight gradients will be float32 dtype after the backpropagation. + Note: only support model parallel and pipeline parallel for now !!! (default: False) + + --scale_loss + fp16/bf16训练时,scale_loss的初始值。 + (`float`,可选,默认为 32768) + + The value of initial scale_loss for fp16. (default: 32768) + + --sharding + 是否使用Paddle的Sharding数据并行功能,用户的参数。支持sharding `stage1`, `stage2` or `stage3`。 + 其中`stage2``stage3`可以和`offload`组合使用。 + 每个种策略分别为: + stage1 : optimizer 中的参数切分到不同卡 + stage2 : optimizer + gradient 中的参数切分到不同卡 + stage3 : parameter + gradient + optimizer 中的参数都切分到不同卡 + offload : offload parameters to cpu 部分参数存放到cpu中 + (`str`, 可选, 默认为 `` 不使用sharding) + + Whether or not to use Paddle Sharding Data Parallel training (in distributed training + only). The base option should be `stage1`, `stage2` or `stage3` and you can add + CPU-offload to `stage2` or `stage3` like this: `stage2 offload` or `stage3 offload`. + Each stage means: + stage1 : optimizer state segmentation + stage2 : optimizer state + gradient segmentation + stage3 : parameter + gradient + optimizer state segmentation + offload : offload parameters to cpu + + --sharding_parallel_degree + 设置sharding的通信组参数,表示通信组的大小。同一个sharding通信组内的参数,进行sharding,分布到不同卡上。 + 不同sharding通信组之间,相当于单纯的数据并行。此选项只在sharding选项开启时候生效。 + 默认值为-1,表示所有训练的卡在同一个通信组内。 + (`int`, 可选, 默认为 `-1`) + + Sharding parameter in certain cards group. For example, aussume we use 2 machines each + with 8 cards, then set sharding_degree=8, sharding will only communication inside machine. + default -1 means sharding parameters between all workers. (`int`, *optional*, defaults to `-1`) + + --tensor_parallel_degree + 张量并行是Megatron论文针对Transformer结构的张量切分方法. + 此方法将一层transformer的计算划分到了不同卡上. + 此参数tensor_parallel_degree表示将一层transformer结构的份数. + 默认值-1, 表示不启用张量并行, + (`int`, 可选, 默认为 `-1`) + (注: 该方法需要修改模型结构, 目前支持GPT/BLOOM/LLAMA/BLOOM/CLM/CHATGLM) + (注: 该方法对通信开销较大, 建议 tensor_parallel_degree<=8, 尽量使用机器内部通信) + + Tensor parallelism is a parallel technique which proposed in (https://arxiv.org/pdf/2104.04473.pdf see 2.3 Tensor Model Parallelism). + This techique splits one transformer layer into multi-cards (For examples, tensor_parallel_degree=4, will split a layer to 4-parts) + tensor_parallel_degree means split the transformer layer to how many parts. + default -1 for not use tensor parallel, Suggest tensor_parallel_degree<=8 for better proformance. + Note, this need model support in source code, currently GPT/BLOOM/LLAMA/BLOOM/CLM/CHATGLM is supported. + + + --pipeline_parallel_degree + 流水线并行是Megatron论文针对多层Transformer结构提出的按层划分方法. + 该方法将多层的transformer结构,按照不同层,均匀划分到不同的卡上. + 然后数据流先后在不同的卡上传递, 形成流水线. + 参数pipeline_parallel_degree表示划分流水线的大小.(假设该参数为4, 模型12层, 则每一个pp stage 包含3层模型) + 默认值-1, 表示不启用流水线并行, + (`int`, 可选, 默认为 `-1`) + (注, 使用此功能需要修改源码,请参见language_model/llama/modeling_pp.py文件) + + Pipeline parallelism is parallel technique proposed in (https://arxiv.org/pdf/2104.04473.pdf see 2.2 Pipeline Model Parallelism). + Pipeline parallelism assigns multi-transformer layers to different cards, the micro batch data stream passed between cards like pipelines. + pipeline_parallel_degree means split all transformer layers to how many stages. + default -1 for not use pipeline parallel. + Note. this need model support in source code, see llama modeling_pp.py file + + --pipeline_parallel_config + 对于流水线并行,一些选项会影响训练性能,这里将一些选项配置集中管理,以str形式传入配置. + 支持如下选项: + disable_p2p_cache_shape : 关闭通信时候的tensor shape cache, 如果你的模型输入的tensor, shape 是不断变化的(如sequence length) 必须配置此选项 + disable_partial_send_recv : 关闭与张量并行合用时候的通信优化. + enable_dp_comm_overlap : 开启PP+DP使用时候的通信优化. + enable_delay_scale_loss : 开启, 使得梯度累积, 先累积最后除以累积次数. 而不是每次除以累积次数. + + Some additional config it highly affect the useage of pipeline parallel, we provide some option to config it. + following config is support: + disable_p2p_cache_shape, if you max sequence length is varying, please set disable_p2p_cache_shape. + disable_partial_send_recv, optmize send speed for tensor parallel. + enable_delay_scale_loss, accumulate gradients util optimizer step, all gradients div by inner pipeline accumute step. instead of div accumute step on loss directly. + enable_dp_comm_overlap, fuse data parallel gradient communication. + + + --recompute + 是否使用重计算训练。可以节省显存。 + 重新计算前向过程以获取梯度,减少中间变量显存. + 注:需要组网支持 recompute,默认使用 enable_recompute 关键字作为recompute功能开关。 + (`bool`, 可选, 默认为 `False`) + + Recompute the forward pass to calculate gradients. Used for saving memory (default: False) + + --minimum_eval_times + 最少评估次数,如果当前设置的eval_steps,评估次数少于minimum_eval_times, + 此选项会覆盖eval_steps参数。 + (`int`,可选,默认为 None) + + If under eval_steps, the valid time is less then + minimum_eval_times, the config of override eval_steps. + (default: None) + + --local_rank + 分布式训练时,设备的本地rank值。 + For distributed training: local_rank (default: -1) + + --dataloader_drop_last + 是否丢弃最后一个不完整的批次(如果数据集的长度不能被批次大小整除) + (`bool`,可选,默认为 False) + + Drop the last incomplete batch if it is not divisible + by the batch size. (default: False) + + --eval_steps + 如果 `evaluation_strategy="steps"`,则两次评估之间的更新步骤数。将默认为相同如果未设置,则值为 `logging_steps`。 + (`int`,可选,默认为 None) + + Run an evaluation every X steps. (default: None) + + --max_evaluate_steps + 如果设置为正数,则表示要执行的评估步骤的总数。 + (`int`,可选,默认为 -1) + + If set to a positive number, the total number of evaluation steps to perform. (default: -1) + + --dataloader_num_workers + 用于数据加载的子进程数。 0 表示数据将在主进程制造。 + (`int`,可选,默认为 0) + + Number of subprocesses to use for data loading. 0 means + that the data will be loaded in the main process. (default: 0) + + --past_index + If >=0, uses the corresponding part of the output as + the past state for next step. (default: -1) + + --run_name + An optional descriptor for the run. (default: None) + --device + 运行的设备名称。支持cpu/gpu, 默认gpu + (`str`,可选,默认为 'gpu') + + select cpu, gpu, xpu devices. (default: gpu) + + --disable_tqdm + 是否使用tqdm进度条 + Whether or not to disable the tqdm progress bars. + (default: None) + + --remove_unused_columns + 去除Dataset中不用的字段数据 + Remove columns not required by the model when using an + nlp.Dataset. (default: True) + + --label_names + 训练数据标签label的名称 + The list of keys in your dictionary of inputs that + correspond to the labels. (default: None) + + --load_best_model_at_end + 训练结束后是否加载最优模型,通常与`metric_for_best_model`配合使用 + Whether or not to load the best model found during + training at the end of training. (default: False) + + --metric_for_best_model + 最优模型指标,如`eval_accuarcy`等,用于比较模型好坏。 + The metric to use to compare two different models. + (default: None) + + --greater_is_better + 与`metric_for_best_model`配合使用。 + Whether the `metric_for_best_model` should be + maximized or not. (default: None) + + --ignore_data_skip + 重启训练时候,不略过已经训练的数据。 + When resuming training, whether or not to skip the + first epochs and batches to get to the same training + data. (default: False) + + --optim + 优化器名称,默认为adamw,(`str`, 可选,默认为 `adamw`) + The optimizer to use. (default: adamw) + + --report_to + 日志可视化显示,默认使用visualdl可视化展示。(可选,默认为 None,展示所有) + The list of integrations to report the results and + logs to. (default: None) + + --resume_from_checkpoint + 是否从断点重启恢复训练,(可选,默认为 None) + The path to a folder with a valid checkpoint for your + model. (default: None) + + --skip_memory_metrics + 是否跳过内存profiler检测。(可选,默认为True,跳过) + Whether or not to skip adding of memory profiler reports + to metrics.(default:True) + + --flatten_param_grads + 是否在优化器中使用flatten_param_grads策略,该策略将素有参数摊平后输入Optimizer更新。目前该策略仅在NPU设备上生效。(可选,默认为False) + Whether use flatten_param_grads method in optimizer, + only used on NPU devices.(default:False) + +``` diff --git a/docs/tutorials/classify.rst b/docs/tutorials/classify.rst new file mode 100644 index 0000000000000000000000000000000000000000..75cc774f6f26f8d9d63c9dbd5cfaaa744e3cb220 --- /dev/null +++ b/docs/tutorials/classify.rst @@ -0,0 +1,3 @@ +======================== +文本分类 +======================== diff --git a/docs/tutorials/embedding.rst b/docs/tutorials/embedding.rst new file mode 100644 index 0000000000000000000000000000000000000000..8c618c6fde192ad03a645d002913cffb94d183bf --- /dev/null +++ b/docs/tutorials/embedding.rst @@ -0,0 +1,3 @@ +======================== +词向量 +======================== diff --git a/docs/tutorials/general_dialogue.rst b/docs/tutorials/general_dialogue.rst new file mode 100644 index 0000000000000000000000000000000000000000..2b9e0f28110e14ad2d52f240b5ce7b6c1438c9bb --- /dev/null +++ b/docs/tutorials/general_dialogue.rst @@ -0,0 +1,3 @@ +======================== +通用对话 +======================== diff --git a/docs/tutorials/lexical_analysis.rst b/docs/tutorials/lexical_analysis.rst new file mode 100644 index 0000000000000000000000000000000000000000..872bcb710582c676dbfee23775fce0523eea17ac --- /dev/null +++ b/docs/tutorials/lexical_analysis.rst @@ -0,0 +1,3 @@ +======================== +词法分析 +======================== diff --git a/docs/tutorials/machine_translation.rst b/docs/tutorials/machine_translation.rst new file mode 100644 index 0000000000000000000000000000000000000000..e153b7a33b51db0266adec18567b346824e22fb3 --- /dev/null +++ b/docs/tutorials/machine_translation.rst @@ -0,0 +1,3 @@ +======================== +机器翻译 +======================== diff --git a/docs/tutorials/ner.rst b/docs/tutorials/ner.rst new file mode 100644 index 0000000000000000000000000000000000000000..cf2a8b257aba879d7fdab441201f628d526a7b24 --- /dev/null +++ b/docs/tutorials/ner.rst @@ -0,0 +1,3 @@ +======================== +序列标注 +======================== diff --git a/docs/tutorials/overview.rst b/docs/tutorials/overview.rst new file mode 100644 index 0000000000000000000000000000000000000000..040de50bc802f667a57839161db333b8163e3c6a --- /dev/null +++ b/docs/tutorials/overview.rst @@ -0,0 +1,43 @@ +============ +整体介绍 +============ + + +案例集 +---------- + + - 词向量 + + - `使用预训练词向量改善模型效果 `_ + + - 文本分类 + + - `基于LSTM等RNN网络的文本分类 `_ + - `基于预训练模型的文本分类 `_ + - `自定义数据集实现文本多分类任务 `_ + + - 信息抽取 + + - `使用BiGRU-CRF模型完成快递单信息抽取 `_ + - `使用预训练模型ERNIE优化快递单信息抽取 `_ + - `关系抽取 `_ + - `事件抽取 `_ + + - 阅读理解式问答 + + - `使用预训练模型完成阅读理解 `_ + + - 对话 + + - `多技能对话 `_ + + - 文本生成 + + - `使用Seq2Seq模型完成自动对联 `_ + - `使用预训练模型ERNIE-GEN实现智能写诗 `_ + + - 时序预测 + + - `使用TCN网络完成新冠疫情病例数预测 `_ + + 更多教程参见 `PaddleNLP on AI Studio `_ \ No newline at end of file diff --git a/docs/tutorials/reading_comprehension.rst b/docs/tutorials/reading_comprehension.rst new file mode 100644 index 0000000000000000000000000000000000000000..7ff09f6388c6484b1f1598578da13e3fb6215d3d --- /dev/null +++ b/docs/tutorials/reading_comprehension.rst @@ -0,0 +1,3 @@ +======================== +阅读理解 +======================== diff --git a/docs/tutorials/semantic_matching.rst b/docs/tutorials/semantic_matching.rst new file mode 100644 index 0000000000000000000000000000000000000000..40bd7cfd9c82c1e611529749535ee23e6824c578 --- /dev/null +++ b/docs/tutorials/semantic_matching.rst @@ -0,0 +1,3 @@ +======================== +语义匹配 +======================== diff --git a/docs/tutorials/text_generation.rst b/docs/tutorials/text_generation.rst new file mode 100644 index 0000000000000000000000000000000000000000..3fbb1f3d1be82138fc5c8e2a02df5fc8e44056a5 --- /dev/null +++ b/docs/tutorials/text_generation.rst @@ -0,0 +1,3 @@ +======================== +文本生成 +======================== diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 0000000000000000000000000000000000000000..817c9400b2ddf39de38ce0143c64cda8c92b3fcc --- /dev/null +++ b/examples/README.md @@ -0,0 +1,43 @@ +# PaddleNLP Examples + +PaddleNLP旨在提供覆盖从研究到产业应用的丰富示例,助力开发者加速文本任务开发效率。 + +PaddleNLP provides rich application examples covering mainstream NLP task to help developers accelerate problem solving. + +## NLP 基础技术 (NLP Basic Technique) + +| 目录 Folder | 任务 Task | +| :--------------- | ------- | +| word_embedding | [词向量 (Word Embedding)](./word_embedding/) | +| lexical_analysis | [词法分析 (Lexical Analysis)](./lexical_analysis/) | +| dependency_parsing | [句法依存分析 (Dependency Parsing)](./dependency_parsing/) | +| language_model | [预训练语言模型 (Pretrained Language Model)](./language_model/) | +| text_to_sql | [语义解析 (Semantic Parsing/Text to SQL)](./text_to_sql):star: | +| text_classification | [文本分类 (Text Classification)](./text_classification/) | +| text_matching | [文本匹配 (Text Matching)](./text_matching/) | +| text_generation | [文本生成 (Text Generation)](./text_generation/) | +| text_summarization | [文本摘要 (Text Summarization)](./text_summarization/) | +| text_correction |[文本纠错 (Text Correction)](./text_correction/):star: | +| semantic_indexing | [语义索引 (Semantic Indexing)](./semantic_indexing/)| +| information_extraction | [信息抽取 (Information Extraction)](./information_extraction/) | +| question_generation | [问题生成 (Question Generation)](./question_generation/) | + +## NLP 系统应用 (NLP System Applications) + +| 目录 Folder | 任务 Task | +| :--------------- | ------- | +| sentiment_analysis|[情感分析 (Sentiment Analysis)](./sentiment_analysis/):star2: | +| dialogue |[通用对话 (General Dialogue System)](./dialogue/) | +| machine_translation |[文本翻译 (Machine Translation)](./machine_translation/) | +| simultaneous_translation|[同声翻译 (Simultaneous Translation)](./simultaneous_translation/) | +| machine_reading_comprehension | [阅读理解 (Machine Reading Comprehension)](./machine_reading_comprehension/) | + +## NLP 拓展应用 (NLP Extented Applications) + +| 目录 Folder | 任务 Task | +| :--------------- | ------- | +| few_shot |[小样本学习 (Few-shot Learning)](./few_shot/):star2: | +| text_to_knowledge |[解语知识关联框架 (Text Knowledge Mining)](./text_to_knowledge/):star2: | +| model_compression |[模型压缩 (Model Compression)](./model_compression/) | +| text_graph |[文本图学习 (Text Graph Learning)](./text_graph/erniesage/) | +| time_series |[时间序列预测 (Time Series Prediction)](./time_series/) | diff --git a/examples/benchmark/ceval/README.md b/examples/benchmark/ceval/README.md new file mode 100644 index 0000000000000000000000000000000000000000..67b803ed81dcc2e4152dbd1c6d564e1fd196a50f --- /dev/null +++ b/examples/benchmark/ceval/README.md @@ -0,0 +1,84 @@ +# C-Eval评测脚本 + +此C-Eval评测脚本修改自[ymcui/Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca)项目。 + +## 数据准备 + +从C-Eval官方指定路径下载评测数据集,并解压至data文件夹: + +``` +wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip +unzip ceval-exam.zip -d data +``` +将data文件夹放置于本项目的scripts/ceval目录下。 + +## 运行预测脚本 + +运行以下脚本: + +``` +cd scripts/ceval +python eval.py \ + --model_name_or_path /path/to/your/model \ + --cot False \ + --few_shot False \ + --with_prompt True \ + --constrained_decoding True \ + --temperature 0.2 \ + --n_times 1 \ + --ntrain 5 \ + --do_save_csv False \ + --do_test False \ + --output_dir ${output_path} \ +``` + +参数说明 + +- model_path:待评测模型所在目录(合并LoRA后的HF格式模型) +- cot:是否使用chain-of-thought +- few_shot:是否使用few-shot +- ntrain:few_shot=True时,指定few-shot实例的数量(5-shot:ntrain=5);few_shot=False时该项不起作用 +- with_prompt:模型输入是否包含针对Alpaca模型的指令模板 +- constrained_decoding:由于C-Eval评测的标准答案格式为选项'A'/'B'/'C'/'D',所以我们提供了两种从模型生成内容中抽取答案的方案: + - 当constrained_decoding=True,计算模型生成的第一个token分别为'A', 'B', 'C', 'D'的概率,选择其中概率最大的一个作为答案 + - 当constrained_decoding=False,用正则表达式从模型生成内容中提取答案 +- temperature:模型解码时的温度 +- n_times:指定评测的重复次数,将在output_dir下生成指定次数的文件夹 +- do_save_csv:是否将模型生成结果、提取的答案等内容保存在csv文件中 +- output_dir:指定评测结果的输出路径 +- do_test:在valid或test集上测试:当do_test=False,在valid集上测试;当do_test=True,在test集上测试 + +## 评测输出 +模型预测完成后,生成目录`outputs/take*`,其中*代表数字,范围为0至`n_times-1`,分别储存了`n_times`次解码的结果。 + +`outputs/take*`下包含`submission.json`和`summary.json`两个json文件。若`do_save_csv=True`,还将包含52个保存的模型生成结果、提取的答案等内容的csv文件。 + +`submission.json`为依据官方提交规范生成的存储模型评测答案的文件,形式如: + +``` +{ + "computer_network": { + "0": "A", + "1": "B", + ... + }, + "marxism": { + "0": "B", + "1": "A", + ... + }, + ... +} +``` + +summary.json包含模型在52个主题下、4个大类下和总体平均的评测结果。例如,json文件最后的All字段中会显示总体平均效果: + +``` + "All": { + "score": 0.36701337295690933, + "num": 1346, + "correct": 494.0 +} +``` + +其中score为准确率,num为测试的总样本条数,correct为正确的数量。 diff --git a/examples/benchmark/ceval/eval.py b/examples/benchmark/ceval/eval.py new file mode 100644 index 0000000000000000000000000000000000000000..47b2227bbf3c9155a687b1923d5bd6c5af6efeac --- /dev/null +++ b/examples/benchmark/ceval/eval.py @@ -0,0 +1,130 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Adapted from https://github.com/ymcui/Chinese-LLaMA-Alpaca and https://github.com/SJTU-LIT/ceval +import argparse +import json +import os +import time + +import pandas as pd +from model_evaluator import ModelEvaluator + +choices = ["A", "B", "C", "D"] + + +def main(args, evaluator, take): + assert os.path.exists("subject_mapping.json"), "subject_mapping.json not found!" + with open("subject_mapping.json") as f: + subject_mapping = json.load(f) + filenames = os.listdir("data/val") + subject_list = [val_file.replace("_val.csv", "") for val_file in filenames] + accuracy, summary = {}, {} + + run_date = time.strftime("%Y-%m-%d_%H-%M-%S", time.localtime(time.time())) + output_dir = args.output_dir + save_result_dir = os.path.join(output_dir, f"take{take}") + if not os.path.exists(save_result_dir): + os.makedirs(save_result_dir, exist_ok=True) + + all_answers = {} + for index, subject_name in enumerate(subject_list): + print( + f"{index/len(subject_list)} Inference starts at {run_date} on {args.model_name_or_path} with subject of {subject_name}!" + ) + val_file_path = os.path.join("data/val", f"{subject_name}_val.csv") + dev_file_path = os.path.join("data/dev", f"{subject_name}_dev.csv") + test_file_path = os.path.join("data/test", f"{subject_name}_test.csv") + + val_df = pd.read_csv(val_file_path) if args.do_test is False else pd.read_csv(test_file_path) + dev_df = pd.read_csv(dev_file_path) if args.few_shot else None + + correct_ratio, answers = evaluator.eval_subject( + subject_name, + val_df, + dev_df, + save_result_dir=save_result_dir if args.do_save_csv else None, + few_shot=args.few_shot, + cot=args.cot, + with_prompt=args.with_prompt, + constrained_decoding=args.constrained_decoding, + do_test=args.do_test, + ) + print(f"Subject: {subject_name}") + print(f"Acc: {correct_ratio}") + accuracy[subject_name] = correct_ratio + summary[subject_name] = { + "score": correct_ratio, + "num": len(val_df), + "correct": correct_ratio * len(val_df) / 100, + } + all_answers[subject_name] = answers + + json.dump(all_answers, open(save_result_dir + "/submission.json", "w"), ensure_ascii=False, indent=4) + print("Accuracy:") + for k, v in accuracy.items(): + print(k, ": ", v) + + total_num = 0 + total_correct = 0 + summary["grouped"] = { + "STEM": {"correct": 0.0, "num": 0}, + "Social Science": {"correct": 0.0, "num": 0}, + "Humanities": {"correct": 0.0, "num": 0}, + "Other": {"correct": 0.0, "num": 0}, + } + for subj, info in subject_mapping.items(): + group = info[2] + summary["grouped"][group]["num"] += summary[subj]["num"] + summary["grouped"][group]["correct"] += summary[subj]["correct"] + for group, info in summary["grouped"].items(): + info["score"] = info["correct"] / info["num"] + total_num += info["num"] + total_correct += info["correct"] + summary["All"] = {"score": total_correct / total_num, "num": total_num, "correct": total_correct} + + json.dump(summary, open(save_result_dir + "/summary.json", "w"), ensure_ascii=False, indent=2) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--model_name_or_path", type=str) + parser.add_argument("--cot", choices=["False", "True"], default="False") + parser.add_argument("--few_shot", choices=["False", "True"], default="True") + parser.add_argument("--ntrain", "-k", type=int, default=5) + parser.add_argument("--with_prompt", choices=["False", "True"], default="False") + parser.add_argument("--constrained_decoding", choices=["False", "True"], default="True") + parser.add_argument("--temperature", type=float, default=0.2) + parser.add_argument("--n_times", default=1, type=int) + parser.add_argument("--do_save_csv", choices=["False", "True"], default="False") + parser.add_argument("--output_dir", type=str) + parser.add_argument("--do_test", choices=["False", "True"], default="False") + + args = parser.parse_args() + + args.cot = args.cot == "True" + args.few_shot = args.few_shot == "True" + args.with_prompt = args.with_prompt == "True" + args.constrained_decoding = args.constrained_decoding == "True" + args.do_test = args.do_test == "True" + args.do_save_csv = args.do_save_csv == "True" + if args.constrained_decoding is True: + args.n_times = max(args.n_times, 1) + print(args) + + evaluator = ModelEvaluator( + choices=choices, k=args.ntrain, model_name_or_path=args.model_name_or_path, temperature=args.temperature + ) + for i in range(args.n_times): + main(args, evaluator=evaluator, take=i) diff --git a/examples/benchmark/ceval/evaluator.py b/examples/benchmark/ceval/evaluator.py new file mode 100644 index 0000000000000000000000000000000000000000..47eff428b9baf1d63ea5c3fe4139485544d84dba --- /dev/null +++ b/examples/benchmark/ceval/evaluator.py @@ -0,0 +1,61 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Adapted from https://github.com/ymcui/Chinese-LLaMA-Alpaca and https://github.com/SJTU-LIT/ceval +import string + + +class Evaluator: + def __init__(self, choices, model_name, k=-1): + self.choices = choices + self.model_name = model_name + self.k = k + self.puncs = list(string.punctuation) + + def format_example(self, line, include_answer=True): + example = line["question"] + for choice in self.choices: + example += f'\n{choice}. {line[f"{choice}"]}' + example += "\n答案:" + if include_answer: + example += f'{line["answer"]}\n\n' + return example + + def generate_few_shot_prompt(self, subject, dev_df): + prompt = f"以下是中国关于{subject}考试的单项选择题,请选出其中的正确答案。\n\n" + k = self.k + if self.k == -1: + k = dev_df.shape[0] + for i in range(k): + prompt += self.format_example(dev_df.iloc[i, :]) + return prompt + + def eval_subject(self, subject_name, test_df, dev_df=None, few_shot=False, save_result_dir=None): + pass + + def normalize_answer(self, s): + def white_space_fix(text): + return " ".join(text.split()) + + def remove_punc(text): + exclude = set(self.puncs) + return "".join(ch for ch in text if ch not in exclude) + + def lower(text): + return text.lower() + + return white_space_fix(remove_punc(lower(s))) + + def exact_match(self, pred, target): + return self.normalize_answer(pred) == self.normalize_answer(target) diff --git a/examples/benchmark/ceval/model_evaluator.py b/examples/benchmark/ceval/model_evaluator.py new file mode 100644 index 0000000000000000000000000000000000000000..4fbef4fe26c93a2f6f5c583d49a4e11b73774637 --- /dev/null +++ b/examples/benchmark/ceval/model_evaluator.py @@ -0,0 +1,189 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Adapted from https://github.com/ymcui/Chinese-LLaMA-Alpaca and https://github.com/SJTU-LIT/ceval +import os +import random +import re + +import numpy as np +import paddle +from evaluator import Evaluator +from tqdm import tqdm + +from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer + + +class ModelEvaluator(Evaluator): + def __init__(self, choices, k, model_name_or_path, temperature=0.2): + super().__init__(choices, model_name_or_path, k) + self.model_name_or_path = model_name_or_path + self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) + self.model = AutoModelForCausalLM.from_pretrained(model_name_or_path, dtype="float16", low_cpu_mem_usage=True) + self.model.eval() + self.generation_config = dict( + temperature=temperature, + top_k=40, + top_p=0.9, + do_sample=True, + num_beams=1, + repetition_penalty=1.1, + max_new_tokens=20, + ) + + self.A_id = self.tokenizer.encode("A", add_special_tokens=False)["input_ids"][0] + self.B_id = self.tokenizer.encode("B", add_special_tokens=False)["input_ids"][0] + self.C_id = self.tokenizer.encode("C", add_special_tokens=False)["input_ids"][0] + self.D_id = self.tokenizer.encode("D", add_special_tokens=False)["input_ids"][0] + + def eval_subject( + self, + subject_name, + test_df, + dev_df=None, + few_shot=False, + cot=False, + save_result_dir=None, + with_prompt=False, + constrained_decoding=False, + do_test=False, + ): + all_answers = {} + + correct_num = 0 + if save_result_dir: + result = [] + score = [] + if few_shot: + history = self.generate_few_shot_prompt(subject_name, dev_df, cot=cot) + else: + history = "" + answers = ["NA"] * len(test_df) if do_test is True else list(test_df["answer"]) + for row_index, row in tqdm(test_df.iterrows(), total=len(test_df)): + question = self.format_example(row, include_answer=False, cot=cot, with_prompt=with_prompt) + instruction = history + question + inputs = self.tokenizer(instruction, return_tensors="pd") + batch_size, length = inputs.input_ids.shape + if constrained_decoding is True: + # batch_size is 1, take the last logits as the logits for next token prediction + with paddle.no_grad(): + logits = self.model(**inputs)[0][0, -1, :] + choices_logits = logits[[self.A_id, self.B_id, self.C_id, self.D_id]].numpy() + assert not (np.any(np.isinf(choices_logits)) or np.any(np.isnan(choices_logits))) + ans = {0: "A", 1: "B", 2: "C", 3: "D"}[np.argmax(choices_logits)] + response = self.tokenizer.decode([logits.argmax(-1).item()]) + else: + generation_output = self.model.generate( + **inputs, + eos_token_id=self.tokenizer.eos_token_id, + pad_token_id=self.tokenizer.pad_token_id, + **self.generation_config, + ) + response = self.tokenizer.decode(generation_output[0][0, length:], skip_special_tokens=True) + ans, direct_extract = self.extract_answer(row, response) + if ans == answers[row_index]: + correct_num += 1 + correct = 1 + else: + correct = 0 + print(f"\n=======begin {str(row_index)}=======") + print("question: ", question) + print("response: ", response) + print("ans: ", ans) + print("ground truth: ", answers[row_index], "\n") + if save_result_dir: + result.append(response) + score.append(correct) + print(f"=======end {str(row_index)}=======") + + all_answers[str(row_index)] = ans + + correct_ratio = 100 * correct_num / len(answers) + + if save_result_dir: + test_df["model_output"] = result + test_df["correctness"] = score + test_df.to_csv(os.path.join(save_result_dir, f"{subject_name}_test.csv")) + + return correct_ratio, all_answers + + def format_example(self, line, include_answer=True, cot=False, with_prompt=False): + example = line["question"] + for choice in self.choices: + example += f'\n{choice}. {line[f"{choice}"]}' + if include_answer: + if cot: + example += "\n答案:让我们一步一步思考,\n" + line["explanation"] + f"\n所以答案是{line['answer']}。\n\n" + else: + example += "\n答案:" + line["answer"] + "\n\n" + else: + if with_prompt is False: + if cot: + example += "\n答案:让我们一步一步思考,\n1." + else: + example += "\n答案:" + else: + if cot: + example += "\n答案是什么?让我们一步一步思考,\n1." + else: + example += "\n答案是什么? " + return example + + def generate_few_shot_prompt(self, subject, dev_df, cot=False): + prompt = f"以下是中国关于{subject}考试的单项选择题,请选出其中的正确答案。\n\n" + k = self.k + if self.k == -1: + k = dev_df.shape[0] + for i in range(k): + prompt += self.format_example(dev_df.iloc[i, :], include_answer=True, cot=cot) + return prompt + + def extract_answer(self, line, gen_ans): + m = re.findall(r"所以答案是(.+?)。", gen_ans, re.M) + if len(m) > 0 and m[-1] in self.choices: + return m[-1], True + answer_patterns = [ + r"([ABCD])是正确的", + r"选项([ABCD])正确", + r"答案为([ABCD])", + r"答案是([ABCD])", + r"答案([ABCD])", + r"选择([ABCD])", + r"答案:([ABCD])", + r"选择答案([ABCD])", + ] + # RE extraction + for answer_pattern in answer_patterns: + m = re.search(answer_pattern, gen_ans, re.M) + if m: + answer = m.group(1) + return answer, False + # only containing one choice-character + m = re.findall(r"[ABCD]", gen_ans, re.M) + if len(m) >= 1: + answer = m[0] + return answer, False + # only containing one choice-context + choices_dict = {} + pattern = "" + for c in self.choices: + choices_dict[str(line[f"{c}"])] = c + pattern += re.escape(str(line[f"{c}"])) + "|" + pattern = pattern[:-1] + m = re.findall(pattern, gen_ans, re.M) + print("w/ escape:", repr(pattern), gen_ans, (len(m) >= 1)) + if len(m) >= 1: + answer = choices_dict[m[0]] + return answer, False + return random.choice("ABCD"), False diff --git a/examples/benchmark/ceval/subject_mapping.json b/examples/benchmark/ceval/subject_mapping.json new file mode 100644 index 0000000000000000000000000000000000000000..493c0f38e42213604cd9a46550d577ee9a51e0db --- /dev/null +++ b/examples/benchmark/ceval/subject_mapping.json @@ -0,0 +1,262 @@ +{ + "computer_network": [ + "Computer Network", + "\u8ba1\u7b97\u673a\u7f51\u7edc", + "STEM" + ], + "operating_system": [ + "Operating System", + "\u64cd\u4f5c\u7cfb\u7edf", + "STEM" + ], + "computer_architecture": [ + "Computer Architecture", + "\u8ba1\u7b97\u673a\u7ec4\u6210", + "STEM" + ], + "college_programming": [ + "College Programming", + "\u5927\u5b66\u7f16\u7a0b", + "STEM" + ], + "college_physics": [ + "College Physics", + "\u5927\u5b66\u7269\u7406", + "STEM" + ], + "college_chemistry": [ + "College Chemistry", + "\u5927\u5b66\u5316\u5b66", + "STEM" + ], + "advanced_mathematics": [ + "Advanced Mathematics", + "\u9ad8\u7b49\u6570\u5b66", + "STEM" + ], + "probability_and_statistics": [ + "Probability and Statistics", + "\u6982\u7387\u7edf\u8ba1", + "STEM" + ], + "discrete_mathematics": [ + "Discrete Mathematics", + "\u79bb\u6563\u6570\u5b66", + "STEM" + ], + "electrical_engineer": [ + "Electrical Engineer", + "\u6ce8\u518c\u7535\u6c14\u5de5\u7a0b\u5e08", + "STEM" + ], + "metrology_engineer": [ + "Metrology Engineer", + "\u6ce8\u518c\u8ba1\u91cf\u5e08", + "STEM" + ], + "high_school_mathematics": [ + "High School Mathematics", + "\u9ad8\u4e2d\u6570\u5b66", + "STEM" + ], + "high_school_physics": [ + "High School Physics", + "\u9ad8\u4e2d\u7269\u7406", + "STEM" + ], + "high_school_chemistry": [ + "High School Chemistry", + "\u9ad8\u4e2d\u5316\u5b66", + "STEM" + ], + "high_school_biology": [ + "High School Biology", + "\u9ad8\u4e2d\u751f\u7269", + "STEM" + ], + "middle_school_mathematics": [ + "Middle School Mathematics", + "\u521d\u4e2d\u6570\u5b66", + "STEM" + ], + "middle_school_biology": [ + "Middle School Biology", + "\u521d\u4e2d\u751f\u7269", + "STEM" + ], + "middle_school_physics": [ + "Middle School Physics", + "\u521d\u4e2d\u7269\u7406", + "STEM" + ], + "middle_school_chemistry": [ + "Middle School Chemistry", + "\u521d\u4e2d\u5316\u5b66", + "STEM" + ], + "veterinary_medicine": [ + "Veterinary Medicine", + "\u517d\u533b\u5b66", + "STEM" + ], + "college_economics": [ + "College Economics", + "\u5927\u5b66\u7ecf\u6d4e\u5b66", + "Social Science" + ], + "business_administration": [ + "Business Administration", + "\u5de5\u5546\u7ba1\u7406", + "Social Science" + ], + "marxism": [ + "Marxism", + "\u9a6c\u514b\u601d\u4e3b\u4e49\u57fa\u672c\u539f\u7406", + "Social Science" + ], + "mao_zedong_thought": [ + "Mao Zedong Thought", + "\u6bdb\u6cfd\u4e1c\u601d\u60f3\u548c\u4e2d\u56fd\u7279\u8272\u793e\u4f1a\u4e3b\u4e49\u7406\u8bba\u4f53\u7cfb\u6982\u8bba", + "Social Science" + ], + "education_science": [ + "Education Science", + "\u6559\u80b2\u5b66", + "Social Science" + ], + "teacher_qualification": [ + "Teacher Qualification", + "\u6559\u5e08\u8d44\u683c", + "Social Science" + ], + "high_school_politics": [ + "High School Politics", + "\u9ad8\u4e2d\u653f\u6cbb", + "Social Science" + ], + "high_school_geography": [ + "High School Geography", + "\u9ad8\u4e2d\u5730\u7406", + "Social Science" + ], + "middle_school_politics": [ + "Middle School Politics", + "\u521d\u4e2d\u653f\u6cbb", + "Social Science" + ], + "middle_school_geography": [ + "Middle School Geography", + "\u521d\u4e2d\u5730\u7406", + "Social Science" + ], + "modern_chinese_history": [ + "Modern Chinese History", + "\u8fd1\u4ee3\u53f2\u7eb2\u8981", + "Humanities" + ], + "ideological_and_moral_cultivation": [ + "Ideological and Moral Cultivation", + "\u601d\u60f3\u9053\u5fb7\u4fee\u517b\u4e0e\u6cd5\u5f8b\u57fa\u7840", + "Humanities" + ], + "logic": [ + "Logic", + "\u903b\u8f91\u5b66", + "Humanities" + ], + "law": [ + "Law", + "\u6cd5\u5b66", + "Humanities" + ], + "chinese_language_and_literature": [ + "Chinese Language and Literature", + "\u4e2d\u56fd\u8bed\u8a00\u6587\u5b66", + "Humanities" + ], + "art_studies": [ + "Art Studies", + "\u827a\u672f\u5b66", + "Humanities" + ], + "professional_tour_guide": [ + "Professional Tour Guide", + "\u5bfc\u6e38\u8d44\u683c", + "Humanities" + ], + "legal_professional": [ + "Legal Professional", + "\u6cd5\u5f8b\u804c\u4e1a\u8d44\u683c", + "Humanities" + ], + "high_school_chinese": [ + "High School Chinese", + "\u9ad8\u4e2d\u8bed\u6587", + "Humanities" + ], + "high_school_history": [ + "High School History", + "\u9ad8\u4e2d\u5386\u53f2", + "Humanities" + ], + "middle_school_history": [ + "Middle School History", + "\u521d\u4e2d\u5386\u53f2", + "Humanities" + ], + "civil_servant": [ + "Civil Servant", + "\u516c\u52a1\u5458", + "Other" + ], + "sports_science": [ + "Sports Science", + "\u4f53\u80b2\u5b66", + "Other" + ], + "plant_protection": [ + "Plant Protection", + "\u690d\u7269\u4fdd\u62a4", + "Other" + ], + "basic_medicine": [ + "Basic Medicine", + "\u57fa\u7840\u533b\u5b66", + "Other" + ], + "clinical_medicine": [ + "Clinical Medicine", + "\u4e34\u5e8a\u533b\u5b66", + "Other" + ], + "urban_and_rural_planner": [ + "Urban and Rural Planner", + "\u6ce8\u518c\u57ce\u4e61\u89c4\u5212\u5e08", + "Other" + ], + "accountant": [ + "Accountant", + "\u6ce8\u518c\u4f1a\u8ba1\u5e08", + "Other" + ], + "fire_engineer": [ + "Fire Engineer", + "\u6ce8\u518c\u6d88\u9632\u5de5\u7a0b\u5e08", + "Other" + ], + "environmental_impact_assessment_engineer": [ + "Environmental Impact Assessment Engineer", + "\u73af\u5883\u5f71\u54cd\u8bc4\u4ef7\u5de5\u7a0b\u5e08", + "Other" + ], + "tax_accountant": [ + "Tax Accountant", + "\u7a0e\u52a1\u5e08", + "Other" + ], + "physician": [ + "Physician", + "\u533b\u5e08\u8d44\u683c", + "Other" + ] +} \ No newline at end of file diff --git a/examples/benchmark/clue/README.md b/examples/benchmark/clue/README.md new file mode 100644 index 0000000000000000000000000000000000000000..cf6662d543a9162392b21d8976a5c2e53beeacba --- /dev/null +++ b/examples/benchmark/clue/README.md @@ -0,0 +1,1522 @@ +# CLUE Benchmark + +**目录** + * [CLUE 评测结果](#CLUE评测结果) + * [一键复现模型效果](#一键复现模型效果) + * [启动 CLUE 分类任务](#启动CLUE分类任务) + * [使用 Trainer 启动 CLUE 分类任务](#使用Trainer启动CLUE分类任务) + * [启动 CLUE 阅读理解任务](#启动CLUE阅读理解任务) + * [批量启动 Grid Search](#批量启动GridSearch) + * [环境依赖](#环境依赖) + * [一键启动方法](#一键启动方法) + * [Grid Search 脚本说明](#GridSearch脚本说明) + * [参加 CLUE 竞赛](#参加CLUE竞赛) + * [分类任务](#分类任务) + * [阅读理解任务](#阅读理解任务) + +[CLUE](https://www.cluebenchmarks.com/) 自成立以来发布了多项 NLP 评测基准,包括分类榜单,阅读理解榜单和自然语言推断榜单等,在学术界、工业界产生了深远影响。是目前应用最广泛的中文语言测评指标之一。详细可参考 [CLUE论文](https://arxiv.org/abs/2004.05986)。 + +本项目基于 PaddlePaddle 在 CLUE 数据集上对领先的开源预训练模型模型进行了充分评测,为开发者在预训练模型选择上提供参考,同时开发者基于本项目可以轻松一键复现模型效果,也可以参加 CLUE 竞赛取得好成绩。 + + + + +## CLUE 评测结果 + +使用多种**中文**预训练模型微调在 CLUE 的各验证集上有如下结果: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Arch + + Model + + AVG + + AFQMC + + TNEWS + + IFLYTEK + + CMNLI + + OCNLI + + CLUEWSC2020 + + CSL + + CMRC2018 + + CHID + + C3 +
24L1024H + ERNIE 1.0-Large-zh-cw + + 79.03 + + 75.97 + + 59.65 + + 62.91 + + 85.09 + + 81.73 + + 93.09 + + 84.53 + + 74.22/91.88 + + 88.57 + + 84.54 +
+ ERNIE 2.0-Large-zh + + 77.03 + + 76.41 + + 59.67 + + 62.29 + + 83.82 + + 79.69 + + 89.14 + + 84.10 + + 71.48/90.35 + + 85.52 + + 78.12 +
+ HFL/RoBERTa-wwm-ext-large + + 76.61 + + 76.00 + + 59.33 + + 62.02 + + 83.88 + + 78.81 + + 90.79 + + 83.67 + + 70.58/89.82 + + 85.72 + + 75.26 +
20L1024H + ERNIE 3.0-Xbase-zh + + 78.39 + + 76.16 + + 59.55 + + 61.87 + + 84.40 + + 81.73 + + 88.82 + + 83.60 + + 75.99/93.00 + + 86.78 + + 84.98 +
12L768H + + + ERNIE 3.0-Base-zh + + + + 76.05 + + 75.93 + + 58.26 + + 61.56 + + 83.02 + + 80.10 + + 86.18 + + 82.63 + + 70.71/90.41 + + 84.26 + + 77.88 +
+ ERNIE 1.0-Base-zh-cw + + 76.47 + + 76.07 + + 57.86 + + 59.91 + + 83.41 + + 79.58 + + 89.91 + + 83.42 + + 72.88/90.78 + + 84.68 + + 76.98 +
+ ERNIE-Gram-zh + + 75.72 + + 75.28 + + 57.88 + + 60.87 + + 82.90 + + 79.08 + + 88.82 + + 82.83 + + 71.82/90.38 + + 84.04 + + 73.69 +
+ ERNIE 2.0-Base-zh + + 74.32 + + 75.65 + + 58.25 + + 61.64 + + 82.62 + + 78.71 + + 81.91 + + 82.33 + + 66.08/87.46 + + 82.78 + + 73.19 +
+ Langboat/Mengzi-BERT-Base + + 74.69 + + 75.35 + + 57.76 + + 61.64 + + 82.41 + + 77.93 + + 88.16 + + 82.20 + + 67.04/88.35 + + 83.74 + + 70.70 +
+ ERNIE 1.0-Base-zh + + 74.17 + + 74.84 + + 58.91 + + 62.25 + + 81.68 + + 76.58 + + 85.20 + + 82.77 + + 67.32/87.83 + + 82.47 + + 69.68 +
+ HFL/RoBERTa-wwm-ext + + 74.11 + + 74.60 + + 58.08 + + 61.23 + + 81.11 + + 76.92 + + 88.49 + + 80.77 + + 68.39/88.50 + + 83.43 + + 68.03 +
+ BERT-Base-Chinese + + 72.57 + + 74.63 + + 57.13 + + 61.29 + + 80.97 + + 75.22 + + 81.91 + + 81.90 + + 65.30/86.53 + + 82.01 + + 65.38 +
+ UER/Chinese-RoBERTa-Base + + 71.78 + + 72.89 + + 57.62 + + 61.14 + + 80.01 + + 75.56 + + 81.58 + + 80.80 + + 63.87/84.95 + + 81.52 + + 62.76 +
8L512H + UER/Chinese-RoBERTa-Medium + + 67.06 + + 70.64 + + 56.10 + + 58.29 + + 77.35 + + 71.90 + + 68.09 + + 78.63 + + 57.63/78.91 + + 75.13 + + 56.84 +
6L768H + + + ERNIE 3.0-Medium-zh + + + + 72.49 + + 73.37 + + 57.00 + + 60.67 + + 80.64 + + 76.88 + + 79.28 + + 81.60 + + 65.83/87.30 + + 79.91 + + 69.73 +
+ HLF/RBT6, Chinese + + 70.06 + + 73.45 + + 56.82 + + 59.64 + + 79.36 + + 73.32 + + 76.64 + + 80.67 + + 62.72/84.77 + + 78.17 + + 59.85 +
+ TinyBERT6, Chinese + + 69.62 + + 72.22 + + 55.70 + + 54.48 + + 79.12 + + 74.07 + + 77.63 + + 80.17 + + 63.03/83.75 + + 77.64 + + 62.11 +
+ RoFormerV2 Small + + 68.52 + + 72.47 + + 56.53 + + 60.72 + + 76.37 + + 72.95 + + 75.00 + + 81.07 + + 62.97/83.64 + + 67.66 + + 59.41 +
+ UER/Chinese-RoBERTa-L6-H768 + + 67.09 + + 70.13 + + 56.54 + + 60.48 + + 77.49 + + 72.00 + + 72.04 + + 77.33 + + 53.74/75.52 + + 76.73 + + 54.40 +
6L384H + + + ERNIE 3.0-Mini-zh + + + + 66.90 + + 71.85 + + 55.24 + + 54.48 + + 77.19 + + 73.08 + + 71.05 + + 79.30 + + 58.53/81.97 + + 69.71 + + 58.60 +
4L768H + HFL/RBT4, Chinese + + 67.42 + + 72.41 + + 56.50 + + 58.95 + + 77.34 + + 70.78 + + 71.05 + + 78.23 + + 59.30/81.93 + + 73.18 + + 56.45 +
4L512H + UER/Chinese-RoBERTa-Small + + 63.25 + + 69.21 + + 55.41 + + 57.552 + + 73.64 + + 69.80 + + 66.78 + + 74.83 + + 46.75/69.69 + + 67.59 + + 50.92 +
4L384H + + + ERNIE 3.0-Micro-zh + + + + 64.21 + + 71.15 + + 55.05 + + 53.83 + + 74.81 + + 70.41 + + 69.08 + + 76.50 + + 53.77/77.82 + + 62.26 + + 55.53 +
4L312H + + + ERNIE 3.0-Nano-zh + + + + 62.97 + + 70.51 + + 54.57 + + 48.36 + + 74.97 + + 70.61 + + 68.75 + + 75.93 + + 52.00/76.35 + + 58.91 + + 55.11 +
+ TinyBERT4, Chinese + + 60.82 + + 69.07 + + 54.02 + + 39.71 + + 73.94 + + 69.59 + + 70.07 + + 75.07 + + 46.04/69.34 + + 58.53 + + 52.18 +
4L256H + UER/Chinese-RoBERTa-Mini + + 53.40 + + 69.32 + + 54.22 + + 41.63 + + 69.40 + + 67.36 + + 65.13 + + 70.07 + + 5.96/17.13 + + 51.19 + + 39.68 +
3L1024H + HFL/RBTL3, Chinese + + 66.63 + + 71.11 + + 56.14 + + 59.56 + + 76.41 + + 71.29 + + 69.74 + + 76.93 + + 58.50/80.90 + + 71.03 + + 55.56 +
3L768H + HFL/RBT3, Chinese + + 65.72 + + 70.95 + + 55.53 + + 59.18 + + 76.20 + + 70.71 + + 67.11 + + 76.63 + + 55.73/78.63 + + 70.26 + + 54.93 +
2L128H + UER/Chinese-RoBERTa-Tiny + + 44.45 + + 69.02 + + 51.47 + + 20.28 + + 59.95 + + 57.73 + + 63.82 + + 67.43 + + 3.08/14.33 + + 23.57 + + 28.12 +
+
+ +AFQMC(语义相似度)、TNEWS(文本分类)、IFLYTEK(长文本分类)、CMNLI(自然语言推理)、OCNLI(自然语言推理)、CLUEWSC2020(代词消歧)、CSL(论文关键词识别)、CHID(成语阅读理解填空) 和 C3(中文多选阅读理解) 任务使用的评估指标均是 Accuracy。CMRC2018(阅读理解) 的评估指标是 EM (Exact Match)/F1,计算每个模型效果的平均值时,取 EM 为最终指标。 + +其中前 7 项属于分类任务,后面 3 项属于阅读理解任务,这两种任务的训练过程在下面将会分开介绍。 + +**NOTE:具体评测方式如下** +1. 以上所有任务均基于 Grid Search 方式进行超参寻优。分类任务训练每间隔 100 steps 评估验证集效果,阅读理解任务每隔一个 epoch 评估验证集效果,取验证集最优效果作为表格中的汇报指标。 + +2. 分类任务 Grid Search 超参范围: batch_size: 16, 32, 64; learning rates: 1e-5, 2e-5, 3e-5, 5e-5;因为 CLUEWSC2020 数据集较小,所以模型在该数据集上的效果对 batch_size 较敏感,所以对 CLUEWSC2020 评测时额外增加了 batch_size = 8 的超参搜索; 因为CLUEWSC2020 和 IFLYTEK 数据集对 dropout 概率值较为敏感,所以对 CLUEWSC2020 和 IFLYTEK 数据集评测时额外增加了 dropout_prob = 0.0 的超参搜索。 + +3. 阅读理解任务 Grid Search 超参范围:batch_size: 24, 32; learning rates: 1e-5, 2e-5, 3e-5。阅读理解任务均使用多卡训练,其中 Grid Search 中的 batch_size 是指多张卡上的 batch_size 总和。 + +4. 以上每个下游任务的固定超参配置如下表所示: + +| TASK | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL | CMRC2018 | CHID | C3 | +| ----------------- | ----- | ----- | ------- | ----- | ----- | ----------- | ---- | -------- | ---- | ------------- | +| epoch | 3 | 3 | 3 | 2 | 5 | 50 | 5 | 2 | 3 | 8 | +| max_seq_length | 128 | 128 | 128 | 128 | 128 | 128 | 256 | 512 | 64 | 512 | +| warmup_proportion | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.06 | 0.1 | +| num_cards | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 4 | 4 | + +不同预训练模型在下游任务上做 Grid Search 之后的最优超参(learning_rate、batch_size)如下: + +| Model | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL | CMRC2018 | CHID | C3 | +| -------------------------------- | ------- | ------- | ------- | -------- | -------- | ----------- | ------- | -------- | ------- | ------------- | +| ERNIE 1.0-Large-zh-cw | 2e-5,64 | 3e-5,32 | 5e-5,16 | 2e-5,16 | 2e-5,32 | 1e-5,32 | 1e-5,16 | 2e-5,24 | 1e-5,24 | 2e-5,32 | +| ERNIE 3.0-Xbase-zh | 2e-5,16 | 3e-5,32 | 3e-5,32 | 3e-5,64 | 3e-5,64 | 2e-5,32 | 1e-5,16 | 3e-5,24 | 2e-5,24 | 3e-5,24 | +| ERNIE 2.0-Large-zh | 1e-5,32 | 3e-5,64 | 3e-5,32 | 2e-5,32 | 1e-5,16 | 3e-5,32 | 1e-5,64 | 2e-5,24 | 2e-5,24 | 3e-5,32 | +| HFL/RoBERTa-wwm-ext-large | 1e-5,32 | 3e-5,32 | 2e-5,32 | 1e-5,16 | 1e-5,16 | 2e-5,16 | 2e-5,16 | 3e-5,32 | 1e-5,24 | 2e-5,24 | +| ERNIE 3.0-Base-zh | 3e-5,16 | 3e-5,32 | 5e-5,32 | 3e-5,32 | 2e-5,64 | 2e-5,16 | 2e-5,32 | 2e-5,24 | 3e-5,24 | 3e-5,32 | +| ERNIE 1.0-Base-zh-cw | 2e-5,16 | 3e-5,32 | 5e-5,16 | 2e-5,16 | 3e-5,32 | 2e-5,16 | 2e-5,32 | 3e-5,24 | 2e-5,32 | 3e-5,24 | +| ERNIE-Gram-zh | 1e-5,16 | 5e-5,16 | 5e-5,16 | 2e-5,32 | 2e-5,64 | 3e-5,16 | 3e-5,64 | 3e-5,32 | 2e-5,24 | 2e-5,24 | +| ERNIE 2.0-Base-zh | 3e-5,64 | 3e-5,64 | 5e-5,16 | 5e-5,64 | 5e-5,32 | 5e-5,16 | 2e-5,16 | 2e-5,32 | 3e-5,24 | 3e-5,32 | +| Langboat/Mengzi-Bert-Base | 3e-5,32 | 5e-5,32 | 5e-5,16 | 2e-5,16 | 2e-5,16 | 3e-5,8 | 1e-5,16 | 3e-5,24 | 3e-5,24 | 2e-5,32 | +| ERNIE 1.0-Base-zh | 3e-5,16 | 3e-5,32 | 5e-5,16 | 5e-5,32 | 3e-5,16 | 2e-5,8 | 2e-5,16 | 3e-5,32 | 3e-5,24 | 3e-5,24 | +| HFL/RoBERTa-wwm-ext | 3e-5,32 | 3e-5,64 | 5e-5,16 | 3e-5,32 | 2e-5,32 | 3e-5,32 | 2e-5,32 | 3e-5,32 | 2e-5,32 | 3e-5,24 | +| BERT-Base-Chinese | 2e-5,16 | 5e-5,16 | 5e-5,16 | 5e-5,64 | 3e-5,16 | 3e-5,16 | 1e-5,16 | 3e-5,24 | 2e-5,32 | 3e-5,24 | +| UER/Chinese-RoBERTa-Base | 2e-5,16 | 5e-5,16 | 5e-5,16 | 2e-5,16 | 3e-5,16 | 3e-5,8 | 2e-5,16 | 3e-5,24 | 3e-5,32 | 3e-5,32 | +| UER/Chinese-RoBERTa-Medium | 3e-5,32 | 5e-5,64 | 5e-5,16 | 5e-5,32 | 3e-5,32 | 3e-5,16 | 5e-5,32 | 3e-5,24 | 3e-5,24 | 3e-5,32 | +| ERNIE 3.0-Medium-zh | 3e-5,32 | 3e-5,64 | 5e-5,32 | 2e-5,32 | 1e-5,64 | 3e-5,16 | 2e-5,32 | 3e-5,24 | 2e-5,24 | 1e-5,24 | +| TinyBERT6, Chinese | 1e-5,16 | 3e-5,32 | 5e-5,16 | 5e-5,32 | 3e-5,64 | 3e-5,16 | 3e-5,16 | 3e-5,32 | 3e-5,24 | 2e-5,24 | +| RoFormerV2 Small | 5e-5,16 | 2e-5,16 | 5e-5,16 | 5e-5,32 | 2e-5,16 | 3e-5,8 | 3e-5,16 | 3e-5,24 | 3e-5,24 | 3e-5,24 | +| HLF/RBT6, Chinese | 3e-5,16 | 5e-5,16 | 5e-5,16 | 5e-5,64 | 3e-5,16 | 3e-5,8 | 5e-5,64 | 2e-5,24 | 3e-5,32 | 2e-5,32 | +| UER/Chinese-RoBERTa-L6-H768 | 2e-5,16 | 3e-5,16 | 5e-5,16 | 5e-5,16 | 5e-5,32 | 2e-5,32 | 3e-5,16 | 3e-5,32 | 3e-5,24 | 3e-5,24 | +| ERNIE 3.0-Mini-zh | 5e-5,64 | 5e-5,64 | 5e-5,16 | 5e-5,32 | 2e-5,16 | 2e-5,8 | 2e-5,16 | 3e-5,24 | 3e-5,24 | 3e-5,24 | +| HFL/RBT4, Chinese | 5e-5,16 | 5e-5,16 | 5e-5,16 | 5e-5,16 | 2e-5,16 | 2e-5,8 | 2e-5,16 | 3e-5,32 | 3e-5,24 | 3e-5,32 | +| UER/Chinese-RoBERTa-Small | 2e-5,32 | 5e-5,32 | 5e-5,16 | 5e-5,16 | 5e-5,16 | 2e-5,64 | 5e-5,32 | 3e-5,24 | 3e-5,24 | 3e-5,24 | +| ERNIE 3.0-Micro-zh | 3e-5,16 | 5e-5,32 | 5e-5,16 | 5e-5,16 | 2e-5,32 | 5e-5,16 | 3e-5,64 | 3e-5,24 | 3e-5,32 | 3e-5,24 | +| ERNIE 3.0-Nano-zh | 2e-5,32 | 5e-5,16 | 5e-5,16 | 5e-5,16 | 3e-5,16 | 1e-5,8 | 3e-5,32 | 3e-5,24 | 3e-5,24 | 2e-5,24 | +| TinyBERT4, Chinese | 3e-5,32 | 5e-5,16 | 5e-5,16 | 5e-5,16 | 3e-5,16 | 1e-5,16 | 5e-5,16 | 3e-5,24 | 3e-5,24 | 2e-5,24 | +| UER/Chinese-RoBERTa-Mini | 3e-5,16 | 5e-5,16 | 5e-5,16 | 5e-5,16 | 5e-5,32 | 3e-5,8 | 5e-5,32 | 3e-5,24 | 3e-5,32 | 3e-5,32 | +| HFL/RBTL3, Chinese | 5e-5,32 | 5e-5,16 | 5e-5,16 | 5e-5,32 | 2e-5,16 | 5e-5,8 | 2e-5,16 | 3e-5,24 | 2e-5,24 | 3e-5,24 | +| HFL/RBT3, Chinese | 5e-5,64 | 5e-5,32 | 5e-5,16 | 5e-5,16 | 2e-5,16 | 3e-5,16 | 5e-5,16 | 3e-5,32 | 3e-5,24 | 3e-5,32 | +| UER/Chinese-RoBERTa-Tiny | 5e-5,64 | 5e-5,16 | 5e-5,16 | 5e-5,16 | 5e-5,16 | 5e-5,8 | 5e-5,16 | 3e-5,24 | 3e-5,24 | 3e-5,24 | + +其中,`ERNIE 3.0-Base-zh`、`ERNIE 3.0-Medium-zh`、`ERNIE-Gram-zh`、`ERNIE 1.0-Base-zh`、`ERNIE 3.0-Mini-zh`、`ERNIE 3.0-Micro-zh`、`ERNIE 3.0-Nano-zh` 、`HFL/RBT3, Chinese`、`HFL/RBTL3, Chinese`、`HFL/RBT6, Chinese`、`TinyBERT4, Chinese`、`UER/Chinese-RoBERTa-Base`、`UER/Chinese-RoBERTa-Mini`、`UER/Chinese-RoBERTa-Small` 在 CLUEWSC2020 处的 dropout_prob 为 0.0,`ERNIE 3.0-Base-zh`、`HLF/RBT6, Chinese`、`Langboat/Mengzi-BERT-Base`、`ERNIE-Gram-zh`、`ERNIE 1.0-Base-zh` 、`TinyBERT6, Chinese`、`UER/Chinese-RoBERTa-L6-H768`、`ERNIE 3.0-Mini-zh`、`ERNIE 3.0-Micro-zh`、`ERNIE 3.0-Nano-zh`、`HFL/RBT3, Chinese`、`HFL/RBT4, Chinese`、`HFL/RBT6, Chinese`、`TinyBERT4, Chinese`、`UER/Chinese-RoBERTa-Medium`、`UER/Chinese-RoBERTa-Base`、`UER/Chinese-RoBERTa-Mini`、`UER/Chinese-RoBERTa-Tiny`、`UER/Chinese-RoBERTa-Small` 在 IFLYTEK 处的 dropout_prob 为 0.0。 + + + +## 一键复现模型效果 + +这一节将会对分类、阅读理解任务分别展示如何一键复现本文的评测结果。 + + + +### 启动 CLUE 分类任务 +以 CLUE 的 TNEWS 任务为例,启动 CLUE 任务进行 Fine-tuning 的方式如下: + +```shell +export CUDA_VISIBLE_DEVICES=0 +export TASK_NAME=TNEWS +export LR=3e-5 +export BS=32 +export EPOCH=6 +export MAX_SEQ_LEN=128 +export MODEL_PATH=ernie-3.0-medium-zh + +cd classification +mkdir ernie-3.0-medium-zh +python -u ./run_clue_classifier.py \ + --model_name_or_path ${MODEL_PATH} \ + --task_name ${TASK_NAME} \ + --max_seq_length ${MAX_SEQ_LEN} \ + --batch_size ${BS} \ + --learning_rate ${LR} \ + --num_train_epochs ${EPOCH} \ + --logging_steps 100 \ + --seed 42 \ + --save_steps 100 \ + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --adam_epsilon 1e-8 \ + --output_dir ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ \ + --device gpu \ + --dropout 0.1 \ + --gradient_accumulation_steps 1 \ + --save_best_model True \ + --do_train \ + +``` + +另外,如需评估,传入参数 `--do_eval` 即可,如果只对读入的 checkpoint 进行评估不训练,则不需传入 `--do_train`。 + +其中参数释义如下: + +- `model_name_or_path` 指示了 Fine-tuning 使用的具体预训练模型,可以是 PaddleNLP 提供的预训练模型,可以选择[Transformer预训练模型汇总](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 中相对应的中文预训练权重。注意 CLUE 任务应选择中文预训练权重。 +- `task_name` 表示 Fine-tuning 的分类任务,当前支持 AFQMC、TNEWS、IFLYTEK、OCNLI、CMNLI、CSL、CLUEWSC2020。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于 learning rate scheduler 产生的值相乘作为当前学习率。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `save_best_model` 是否保存在评估集上效果最好的模型,默认为 True +- `output_dir` 表示模型保存路径。 +- `device` 表示训练使用的设备, 'gpu' 表示使用GPU, 'xpu' 表示使用百度昆仑卡, 'cpu' 表示使用 CPU。 + +Fine-tuning 过程将按照 `logging_steps` 和 `save_steps` 的设置打印出如下日志: + +``` +global step 100/20010, epoch: 0, batch: 99, rank_id: 0, loss: 2.734340, lr: 0.0000014993, speed: 8.7969 step/s +eval loss: 2.720359, acc: 0.0827, eval done total : 25.712125062942505 s +global step 200/20010, epoch: 0, batch: 199, rank_id: 0, loss: 2.608563, lr: 0.0000029985, speed: 2.5921 step/s +eval loss: 2.652753, acc: 0.0945, eval done total : 25.64827537536621 s +global step 300/20010, epoch: 0, batch: 299, rank_id: 0, loss: 2.555283, lr: 0.0000044978, speed: 2.6032 step/s +eval loss: 2.572999, acc: 0.112, eval done total : 25.67190170288086 s +global step 400/20010, epoch: 0, batch: 399, rank_id: 0, loss: 2.631579, lr: 0.0000059970, speed: 2.6238 step/s +eval loss: 2.476962, acc: 0.1697, eval done total : 25.794789791107178 s +``` + + + +#### 使用 Trainer 启动 CLUE 分类任务 +PaddleNLP 提供了 Trainer API,本示例新增了`run_clue_classifier_trainer.py`脚本供用户使用。 + +``` +export CUDA_VISIBLE_DEVICES=0 +export TASK_NAME=TNEWS +export LR=3e-5 +export BS=32 +export EPOCH=6 +export MAX_SEQ_LEN=128 +export MODEL_PATH=ernie-3.0-medium-zh + +cd classification +mkdir ernie-3.0-medium-zh + +python -u ./run_clue_classifier_trainer.py \ + --model_name_or_path ${MODEL_PATH} \ + --dataset "clue ${TASK_NAME}" \ + --max_seq_length ${MAX_SEQ_LEN} \ + --per_device_train_batch_size ${BS} \ + --per_device_eval_batch_size ${BS} \ + --learning_rate ${LR} \ + --num_train_epochs ${EPOCH} \ + --logging_steps 100 \ + --seed 42 \ + --save_steps 100 \ + --warmup_ratio 0.1 \ + --weight_decay 0.01 \ + --adam_epsilon 1e-8 \ + --output_dir ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ \ + --device gpu \ + --do_train \ + --do_eval \ + --metric_for_best_model "eval_accuracy" \ + --load_best_model_at_end \ + --save_total_limit 3 \ +``` +大部分参数含义如上文所述,这里简要介绍一些新参数: +- `dataset`, 同上文`task_name`,此处为小写字母。表示 Fine-tuning 的分类任务,当前支持 afamc、tnews、iflytek、ocnli、cmnli、csl、cluewsc2020。 +- `per_device_train_batch_size` 同上文`batch_size`。训练时,每次迭代**每张卡**上的样本数目。 +- `per_device_eval_batch_size` 同上文`batch_size`。评估时,每次迭代**每张卡**上的样本数目。 +- `warmup_ratio` 同上文`warmup_proportion`,warmup步数占总步数的比例。 +- `metric_for_best_model` 评估时,最优评估指标。 +- `load_best_model_at_end` 训练结束时,时候加载评估结果最好的 ckpt。 +- `save_total_limit` 保存的ckpt数量的最大限制 + + + +### 启动 CLUE 阅读理解任务 +以 CLUE 的 C3 任务为例,多卡启动 CLUE 任务进行 Fine-tuning 的方式如下: + +```shell + +cd mrc + +MODEL_PATH=ernie-3.0-medium-zh +BATCH_SIZE=6 +LR=2e-5 + +python -m paddle.distributed.launch --gpus "0,1,2,3" run_c3.py \ + --model_name_or_path ${MODEL_PATH} \ + --batch_size ${BATCH_SIZE} \ + --learning_rate ${LR} \ + --max_seq_length 512 \ + --num_train_epochs 8 \ + --do_train \ + --warmup_proportion 0.1 \ + --gradient_accumulation_steps 3 \ + +``` +需要注意的是,如果显存无法容纳所传入的 `batch_size`,可以通过传入 `gradient_accumulation_steps` 参数来模拟该 `batch_size`。 + + + +### 批量启动 Grid Search + + + +#### 环境依赖 + +Grid Search 需要在 GPU 环境下进行,需要注意的是 C3 任务需要显存大于 16 GB,最好是在显存 32 GB的环境下启动。 + +Grid Search 中的 GPU 调度需要依赖 pynvml 库,pynvml 库提供了 GPU 管理的 Python 接口。可启动以下命令进行安装 pynvml: + +```shell +pip install pynvml +``` + + + +#### 一键启动方法 + +运行下面一句命令即可启动 Grid Search 任务。前期需要注意数据集是否正常下载,否则训练任务不会正式启动。 +脚本默认不保存模型,如需保存每个超参数下最好的模型,需要修改 Python 脚本中的 `--save_best_models` 参数为 True。 + +```shell +cd grid_search_tools + +# 这里 ernie-3.0-base-zh 是模型名,也可以传用户自定义的模型目录 +# 自定义的模型目录需要有 model_config.json, model_state.pdparams, tokenizer_config.json 和 vocab.txt 四个文件 +python grid_seach.py ernie-3.0-base-zh + +``` + +确认模型所有任务训练完成后,可以调用脚本 `extract_result.sh` 一键抽取 Grid Search 结果,打印出每个任务的最佳结果和对应的超参数,例如: + +```shell +bash extract_result.sh ernie-3.0-base-zh +``` +```text +AFQMC TNEWS IFLYTEK CMNLI OCNLI CLUEWSC2020 CSL CMRC2018 CHID C3 +75.93 58.26 61.56 83.02 80.10 86.18 82.63 70.71/90.41 84.26 77.88 +==================================================================== +Best hyper-parameters list: +==================================================================== +TASK result (lr, batch_size, dropout_p) +AFQMC 75.93 (3e-05,16,0.1) +TNEWS 58.26 (3e-05,32,0.1) +IFLYTEK 61.56 (5e-05,32,0.0) +CMNLI 83.02 (3e-05,32,0.1) +OCNLI 80.10 (2e-05,64,0.1) +CLUEWSC2020 86.18 (2e-05,16,0.0) +CSL 82.63 (2e-05,32,0.1) +CMRC2018 70.71/90.41 (2e-05,24,0.1) +CHID 84.26 (3e-05,24,0.1) +C3 77.88 (3e-05,32,0.1) +``` + +另外,如遇意外情况(如机器重启)导致训练中断,可以直接再次启动 `grid_search.py` 脚本,之前已完成(输出完整日志)的任务则会直接跳过。 + + + +#### Grid Search 脚本说明 + +本节介绍 grid_search_tools 目录下各个脚本的功能: + +- `grid_search.py` Grid Search 任务入口脚本,该脚本负责调度 GPU 资源,可自动将 7 个分类任务、3 个阅读理解下所有超参数对应的任务完成,训练完成后会自动调用抽取结果的脚本 `extract_result.sh` 打印出所有任务的最佳结果和对应的超参。 +- `warmup_dataset_and_model.py` 首次运行时,该脚本完成模型下载(如需)、数据集下载,阅读理解任务数据预处理、预处理文件缓存等工作,再次运行则会检查这些文件是否存在,存在则跳过。该脚本由 `grid_search.py` 在 Grid Search 训练前自动调用,预处理 cache 文件生成后,后面所有训练任务即可加载缓存文件避免重复进行数据预处理。如果该阶段任务失败,大多需要检查网络,解决之后需重启 `grid_search.py`,直到训练正常开始。该脚本也可手动调用,需要 1 个参数,模型名称或目录。该脚本在使用 Intel(R) Xeon(R) Gold 6271C CPU 且 `--num_proc`默认为 4 的情况下需约 30 分钟左右完成,可以更改 `run_mrc.sh` 中的 `--num_proc` 参数以改变生成 cache 的进程数。需要注意的是,若改变 num_proc,之前的缓存则不能再使用,该脚本会重新处理数据并生成新的 cache,cache 相关内容可查看[datasets.Dataset.map文档](https://huggingface.co/docs/datasets/v2.0.0/package_reference/main_classes?highlight=map#datasets.Dataset.map)。 +- `extract_result.sh` 从日志抽取每个任务的最佳结果和对应的最佳超参并打印,`grid_search.py` 在完成训练任务后会自动调用,也可手动调用,需要 1 个参数:模型名称或目录。手动调用前需要确认训练均全部完成,并且保证该目录下有分类和阅读理解所有任务的日志。 +- `run_mrc.sh` 阅读理解任务的启动脚本。 +- `run_cls.sh` 分类任务的启动脚本。 + + + + +## 参加 CLUE 竞赛 + +对各个任务运行预测脚本,汇总多个结果文件压缩之后,即可提交至 CLUE 官网进行评测。 + +下面 2 小节会分别介绍分类、阅读理解任务产生预测结果的方法。 + + + +### 分类任务 + +以 TNEWS 为例,可以直接使用脚本 `classification/run_clue_classifier.py` 对单个任务进行预测,注意脚本启动时需要传入参数 `--do_predict`。假设 TNEWS 模型所在路径为 `${TNEWS_MODEL}`,运行如下脚本可得到模型在测试集上的预测结果,预测结果会写入地址 `${OUTPUT_DIR}/tnews_predict.json`。 + +``` +cd classification +OUTPUT_DIR=results +mkdir ${OUTPUT_DIR} + +python run_clue_classifier.py \ + --task_name TNEWS \ + --model_name_or_path ${TNEWS_MODEL} \ + --output_dir ${OUTPUT_DIR} \ + --do_predict \ +``` + + +### 阅读理解任务 + +以 C3 为例,直接使用 `mrc/run_c3.py`对该任务进行预测,注意脚本启动时需要传入参数 `--do_predict`。假设 C3 模型所在路径为 `${C3_MODEL}`,运行如下脚本可得到模型在测试集上的预测结果,预测结果会写入地址 `${OUTPUT_DIR}/c311_predict.json`。 + +```shell +cd mrc +OUTPUT_DIR=results +mkdir ${OUTPUT_DIR} + +python run_c3.py \ + --model_name_or_path ${C3_MODEL} \ + --output_dir ${OUTPUT_DIR} \ + --do_predict \ +``` diff --git a/examples/benchmark/clue/classification/run_clue_classifier.py b/examples/benchmark/clue/classification/run_clue_classifier.py new file mode 100644 index 0000000000000000000000000000000000000000..81e1ee690fd4281823e0698fe1a202770537c36f --- /dev/null +++ b/examples/benchmark/clue/classification/run_clue_classifier.py @@ -0,0 +1,481 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import math +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from paddle.io import DataLoader +from paddle.metric import Accuracy + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import ( + AutoModelForSequenceClassification, + AutoTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +METRIC_CLASSES = { + "afqmc": Accuracy, + "tnews": Accuracy, + "iflytek": Accuracy, + "ocnli": Accuracy, + "cmnli": Accuracy, + "cluewsc2020": Accuracy, + "csl": Accuracy, +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name.", + ) + parser.add_argument( + "--output_dir", + default="best_clue_model", + type=str, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--batch_size", + default=32, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument( + "--warmup_steps", + default=0, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument( + "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--gradient_accumulation_steps", + type=int, + default=1, + help="Number of updates steps to accumualte before performing a backward/update pass.", + ) + parser.add_argument("--do_train", action="store_true", help="Whether do train.") + parser.add_argument("--do_eval", action="store_true", help="Whether do train.") + parser.add_argument("--do_predict", action="store_true", help="Whether do predict.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument( + "--save_best_model", + default=True, + type=strtobool, + help="Whether to save best model.", + ) + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu." + ) + parser.add_argument("--dropout", default=0.1, type=float, help="dropout.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="The max value of grad norm.") + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, loss_fct, metric, data_loader): + model.eval() + metric.reset() + for batch in data_loader: + labels = batch.pop("labels") + logits = model(**batch) + loss = loss_fct(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + logger.info("eval loss: %f, acc: %s, " % (loss.numpy(), res)) + model.train() + return res + + +def convert_example(example, tokenizer, label_list, is_test=False, max_seq_length=512): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + # Get the label + label = np.array(example["label"], dtype="int64") + # Convert raw text to feature + if "keyword" in example: # CSL + sentence1 = " ".join(example["keyword"]) + example = {"sentence1": sentence1, "sentence2": example["abst"]} + elif "target" in example: # wsc + text, query, pronoun, query_idx, pronoun_idx = ( + example["text"], + example["target"]["span1_text"], + example["target"]["span2_text"], + example["target"]["span1_index"], + example["target"]["span2_index"], + ) + text_list = list(text) + assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun) + assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query) + if pronoun_idx > query_idx: + text_list.insert(query_idx, "_") + text_list.insert(query_idx + len(query) + 1, "_") + text_list.insert(pronoun_idx + 2, "[") + text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]") + else: + text_list.insert(pronoun_idx, "[") + text_list.insert(pronoun_idx + len(pronoun) + 1, "]") + text_list.insert(query_idx + 2, "_") + text_list.insert(query_idx + len(query) + 2 + 1, "_") + text = "".join(text_list) + example["sentence"] = text + if "sentence" in example: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + elif "sentence1" in example: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + if not is_test: + example["labels"] = label + return example + + +def do_eval(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + + dev_ds = load_dataset("clue", args.task_name, splits="dev") + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + trans_func = partial( + convert_example, label_list=dev_ds.label_list, tokenizer=tokenizer, max_seq_length=args.max_seq_length + ) + + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + + batchify_fn = DataCollatorWithPadding(tokenizer) + + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + num_classes = 1 if dev_ds.label_list is None else len(dev_ds.label_list) + + model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + metric = metric_class() + model.eval() + metric.reset() + for batch in dev_data_loader: + labels = batch.pop("labels") + logits = model(**batch) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + logger.info("acc: %s\n, " % (res)) + + +def do_train(args): + assert ( + args.batch_size % args.gradient_accumulation_steps == 0 + ), "Please make sure argmument `batch_size` must be divisible by `gradient_accumulation_steps`." + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + + args.batch_size = int(args.batch_size / args.gradient_accumulation_steps) + train_ds, dev_ds = load_dataset("clue", args.task_name, splits=("train", "dev")) + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, label_list=train_ds.label_list, tokenizer=tokenizer, max_seq_length=args.max_seq_length + ) + + train_ds = train_ds.map(trans_func, lazy=True) + + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + + batchify_fn = DataCollatorWithPadding(tokenizer) + + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list) + model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes) + + if args.dropout != 0.1: + update_model_dropout(model, args.dropout) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + if args.max_steps > 0: + num_training_steps = args.max_steps / args.gradient_accumulation_steps + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + else: + num_training_steps = len(train_data_loader) * args.num_train_epochs / args.gradient_accumulation_steps + num_train_epochs = args.num_train_epochs + + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm), + ) + + loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss() + + metric = metric_class() + best_acc = 0.0 + global_step = 0 + tic_train = time.time() + for epoch in range(num_train_epochs): + for step, batch in enumerate(train_data_loader): + labels = batch.pop("labels") + logits = model(**batch) + loss = loss_fct(logits, labels) + if args.gradient_accumulation_steps > 1: + loss = loss / args.gradient_accumulation_steps + loss.backward() + if (step + 1) % args.gradient_accumulation_steps == 0: + global_step += 1 + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + logger.info( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % ( + global_step, + num_training_steps, + epoch, + step, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + acc = evaluate(model, loss_fct, metric, dev_data_loader) + logger.info("eval done total : %s s" % (time.time() - tic_eval)) + if acc > best_acc: + best_acc = acc + if args.save_best_model: + output_dir = args.output_dir + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + logger.info("best_result: %.2f" % (best_acc * 100)) + return + logger.info("best_result: %.2f" % (best_acc * 100)) + + +def do_predict(args): + paddle.set_device(args.device) + args.task_name = args.task_name.lower() + + train_ds, test_ds = load_dataset("clue", args.task_name, splits=("train", "test")) + if args.task_name == "cluewsc2020" or args.task_name == "tnews": + test_ds_10 = load_dataset("clue", args.task_name, splits="test1.0") + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + label_list=train_ds.label_list, + max_seq_length=args.max_seq_length, + is_test=True, + ) + + batchify_fn = DataCollatorWithPadding(tokenizer) + + test_ds = test_ds.map(trans_func, lazy=True) + test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False) + test_data_loader = DataLoader( + dataset=test_ds, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + if args.task_name == "cluewsc2020" or args.task_name == "tnews": + test_ds_10 = test_ds_10.map(trans_func, lazy=True) + test_batch_sampler_10 = paddle.io.BatchSampler(test_ds_10, batch_size=args.batch_size, shuffle=False) + test_data_loader_10 = DataLoader( + dataset=test_ds_10, + batch_sampler=test_batch_sampler_10, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + + num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list) + + model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes) + + if not os.path.exists(args.output_dir): + os.makedirs(args.output_dir) + + prediction_filename = args.task_name + + if args.task_name == "ocnli": + prediction_filename = "ocnli_50k" + elif args.task_name == "cluewsc2020": + prediction_filename = "cluewsc" + "11" + elif args.task_name == "tnews": + prediction_filename = args.task_name + "11" + + # For version 1.1 + f = open(os.path.join(args.output_dir, prediction_filename + "_predict.json"), "w") + preds = [] + for step, batch in enumerate(test_data_loader): + with paddle.no_grad(): + logits = model(**batch) + pred = paddle.argmax(logits, axis=1).numpy().tolist() + preds += pred + for idx, pred in enumerate(preds): + j = json.dumps({"id": idx, "label": train_ds.label_list[pred]}) + f.write(j + "\n") + + # For version 1.0 + if args.task_name == "cluewsc2020" or args.task_name == "tnews": + prediction_filename = args.task_name + "10" + if args.task_name == "cluewsc2020": + prediction_filename = "cluewsc10" + f = open(os.path.join(args.output_dir, prediction_filename + "_predict.json"), "w") + + preds = [] + for step, batch in enumerate(test_data_loader_10): + with paddle.no_grad(): + logits = model(**batch) + pred = paddle.argmax(logits, axis=1).numpy().tolist() + preds += pred + for idx, pred in enumerate(preds): + j = json.dumps({"id": idx, "label": train_ds.label_list[pred]}) + f.write(j + "\n") + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +def update_model_dropout(model, p=0.0): + model.base_model.embeddings.dropout.p = p + for i in range(len(model.base_model.encoder.layers)): + model.base_model.encoder.layers[i].dropout.p = p + model.base_model.encoder.layers[i].dropout1.p = p + model.base_model.encoder.layers[i].dropout2.p = p + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + if args.do_train: + do_train(args) + if args.do_eval: + do_eval(args) + if args.do_predict: + do_predict(args) diff --git a/examples/benchmark/clue/classification/run_clue_classifier_trainer.py b/examples/benchmark/clue/classification/run_clue_classifier_trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..0bfabedc44e9478968557a2fbe4b8678a443fcd6 --- /dev/null +++ b/examples/benchmark/clue/classification/run_clue_classifier_trainer.py @@ -0,0 +1,282 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from dataclasses import dataclass, field +from functools import partial +from typing import Optional + +import paddle +import paddle.nn as nn +from paddle.metric import Accuracy + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, +) +from paddlenlp.transformers import ( + AutoModelForSequenceClassification, + AutoTokenizer, + export_model, +) +from paddlenlp.utils.log import logger + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + Using `PdArgumentParser` we can turn this class into argparse arguments to be able to + specify them on the command line. + """ + + dataset: str = field(default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}) + + max_seq_length: int = field( + default=128, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + do_lower_case: bool = field( + default=False, + metadata={ + "help": "Whether to lower case the input text. Should be True for uncased models and False for cased models." + }, + ) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + + model_name_or_path: str = field( + metadata={ + "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html" + } + ) + config_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} + ) + tokenizer_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} + ) + cache_dir: Optional[str] = field( + default=None, + metadata={"help": "Path to directory to store the dataset cache."}, + ) + export_model_dir: Optional[str] = field( + default=None, + metadata={"help": "Path to directory to store the exported inference model."}, + ) + + +# Data pre-process function for clue benchmark datatset +def convert_clue(example, label_list, tokenizer=None, max_seq_length=512, **kwargs): + """convert a glue example into necessary features""" + is_test = False + if "label" not in example.keys(): + is_test = True + + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # print("label_list", label_list) + # Get the label + # example['label'] = np.array(example["label"], dtype="int64") + example["label"] = int(example["label"]) if label_dtype != "float32" else float(example["label"]) + label = example["label"] + # Convert raw text to feature + if "keyword" in example: # CSL + sentence1 = " ".join(example["keyword"]) + example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]} + elif "target" in example: # wsc + text, query, pronoun, query_idx, pronoun_idx = ( + example["text"], + example["target"]["span1_text"], + example["target"]["span2_text"], + example["target"]["span1_index"], + example["target"]["span2_index"], + ) + text_list = list(text) + assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun) + assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query) + if pronoun_idx > query_idx: + text_list.insert(query_idx, "_") + text_list.insert(query_idx + len(query) + 1, "_") + text_list.insert(pronoun_idx + 2, "[") + text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]") + else: + text_list.insert(pronoun_idx, "[") + text_list.insert(pronoun_idx + len(pronoun) + 1, "]") + text_list.insert(query_idx + 2, "_") + text_list.insert(query_idx + len(query) + 2 + 1, "_") + text = "".join(text_list) + example["sentence"] = text + + if tokenizer is None: + return example + if "sentence" in example: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + elif "sentence1" in example: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + + if not is_test: + return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"], "labels": label} + else: + return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"]} + + +def clue_trans_fn(example, tokenizer, args): + return convert_clue(example, tokenizer=tokenizer, label_list=args.label_list, max_seq_length=args.max_seq_length) + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + data_args.dataset = data_args.dataset.strip() + + dataset_config = data_args.dataset.split(" ") + print(dataset_config) + raw_datasets = load_dataset( + dataset_config[0], name=None if len(dataset_config) <= 1 else dataset_config[1], splits=("train", "dev") + ) + + data_args.label_list = getattr(raw_datasets["train"], "label_list", None) + num_classes = 1 if raw_datasets["train"].label_list is None else len(raw_datasets["train"].label_list) + + # Define tokenizer, model, loss function. + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + model = AutoModelForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes) + criterion = nn.loss.CrossEntropyLoss() if data_args.label_list else nn.loss.MSELoss() + + # Define dataset pre-process function + trans_fn = partial(clue_trans_fn, tokenizer=tokenizer, args=data_args) + + # Define data collector + data_collator = DataCollatorWithPadding(tokenizer) + + # Dataset pre-process + if training_args.do_train: + train_dataset = raw_datasets["train"].map(trans_fn) + if training_args.do_eval: + eval_dataset = raw_datasets["dev"].map(trans_fn) + if training_args.do_predict: + test_dataset = raw_datasets["test"].map(trans_fn) + + # Define the metrics of tasks. + def compute_metrics(p): + preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions + + preds = paddle.to_tensor(preds) + label = paddle.to_tensor(p.label_ids) + + metric = Accuracy() + metric.reset() + result = metric.compute(preds, label) + metric.update(result) + accu = metric.accumulate() + metric.reset() + return {"accuracy": accu} + + trainer = Trainer( + model=model, + criterion=criterion, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + if training_args.do_predict: + test_ret = trainer.predict(test_dataset) + trainer.log_metrics("test", test_ret.metrics) + if test_ret.label_ids is None: + paddle.save( + test_ret.predictions, + os.path.join(training_args.output_dir, "test_results.pdtensor"), + ) + + # export inference model + if training_args.do_export: + # You can also load from certain checkpoint + # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/") + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ] + if model_args.export_model_dir is None: + model_args.export_model_dir = os.path.join(training_args.output_dir, "export") + export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir) + + +if __name__ == "__main__": + main() diff --git a/examples/benchmark/clue/grid_search_tools/draw_pic.py b/examples/benchmark/clue/grid_search_tools/draw_pic.py new file mode 100644 index 0000000000000000000000000000000000000000..31710f26992db0525fd37e6bcc2aceb4304c8d39 --- /dev/null +++ b/examples/benchmark/clue/grid_search_tools/draw_pic.py @@ -0,0 +1,146 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys + +import matplotlib.pyplot as plt + +mode = sys.argv[1] +batch_size = sys.argv[2] + +ylabel_name = "CLUE Avg Score" +title_name = "PaddleNLP Chinese Models" + +if mode == "gpu": + picture_name = "./gpu_bs" + batch_size + ".png" + xlabel_name = "Latency (ms) under FP16 on Tesla T4" +elif mode == "cpu1": + picture_name = "./cpu_thread1_bs" + batch_size + ".png" + xlabel_name = "Latency (ms) under FP32 on Intel(R) Xeon(R) Gold 6271C, num_threads=1" +elif mode == "cpu8": + picture_name = "./cpu_thread8_bs" + batch_size + ".png" + xlabel_name = "Latency (ms) under FP32 on Intel(R) Xeon(R) Gold 6271C, num_threads=8" +else: + raise ValueError("Only supports gpu, cpu1, cpu8.") +xlabel_name += ", batch_size=" + batch_size + +# Each element has model_name, model_param_num, latency(ms), clue avg score, +# color, the size of circle. +# Models of the same series are best represented by colors of the same color +# system. https://zhuanlan.zhihu.com/p/65220518 is for reference. +data = [ + [ + ["ERNIE 3.0-Base", "117.95M", 2.69, 226.43, 33.08, 3.43, 205.57, 34.10, 76.05, "#F08080", 11.8], # #F08080 + ["ERNIE 3.0-Medium", "75.43M", 1.42, 113.35, 17.32, 2.11, 104.06, 17.50, 72.49, "#A52A2A", 7.5], + ["ERNIE 3.0-Mini", "26.95M", 0.75, 38.24, 5.54, 1.59, 30.28, 8.18, 66.90, "#CD5C5C", 2.7], + ["ERNIE 3.0-Micro", "23.40M", 0.62, 26.44, 3.76, 1.33, 20.06, 5.46, 64.21, "#FF6347", 2.3], + ["ERNIE 3.0-Nano", "17.91M", 0.57, 20.93, 3.22, 1.25, 15.24, 4.89, 62.97, "#FF0000", 1.8], + ], + [ + ["RoBERTa-Base", "102.27M", 2.69, 226.16, 32.18, 3.44, 204.27, 34.10, 71.78, "royalblue", 10.2], # #4169E1 + ["RoBERTa-6L768H", "59.74M", 1.43, 112.55, 16.21, 2.14, 102.95, 18.55, 67.09, "#6495ED", 6.0], + ["RoBERTa-Medium", "36.56M", 1.02, 71.23, 10.84, 1.91, 65.74, 13.26, 67.06, "#87CEFA", 3.7], + ["RoBERTa-Small", "23.95M", 0.63, 36.33, 5.61, 1.41, 33.26, 7.01, 63.25, "#B0E0E6", 2.4], + # ['RoBERTa-Mini','8.77M', 0.59, 10.61, 2.02, 1.41, 10.03, 3.60, 53.40, '#40E0D0', 0.9], + # ['RoBERTa-Tiny','3.18M', 0.37, 2.08, 0.72, 1.03, 2.25, 1.30, 44.45, '#4682B4', 0.3], + ], + [ + ["TinyBERT6", "59.74M", 1.44, 113.90, 16.37, 2.14, 104.06, 17.44, 69.62, "gold", 6.5], # '#008000' + ["TinyBERT4", "11.46M", 0.54, 16.53, 2.93, 1.22, 14.02, 4.64, 60.82, "#8FBC8F", 1.3], + ], + [ + # [ + # 'RBTL3', '61.00M', 1.34, 113.27, 16.02, 1.69, 101.59, + # 15.47, 66.79, '#FFA500', 6.1 + # ], + ["RBT6", "59.74M", 1.43, 114.24, 16.35, 2.14, 103.53, 17.27, 70.06, "mediumseagreen", 6.0], # #FFDEAD + ["RBT4", "46.56M", 1.03, 76.19, 11.08, 1.60, 69.90, 12.60, 67.42, "#FFD700", 4.7], + ["RBT3", "38.43M", 0.86, 58.65, 8.28, 1.40, 52.12, 10.63, 65.72, "#FFE4B5", 3.8], + ], +] + +fig, ax = plt.subplots() +ax.spines["right"].set_visible(False) +ax.spines["top"].set_visible(False) + +ln_list = [] +size = 7 +for i in range(len(data)): + model_name_list = [model_info[0] for model_info in data[i]] + clue_res_list = [model_info[-3] for model_info in data[i]] + color = data[i][0][-2] + num_param_list = [model_info[1] for model_info in data[i]] + + if mode == "gpu": + if batch_size == "32": + latency_list = [model_info[2] for model_info in data[i]] + else: + latency_list = [model_info[5] for model_info in data[i]] + + (ln,) = plt.plot(latency_list, clue_res_list, color=color, linewidth=2.0, linestyle="-", marker="o", ms=5) + ln_list.append(ln) + for j, model in enumerate(data[i]): + xytext = (latency_list[j] + 0.05, clue_res_list[j] - 0.1) + model_name = model_name_list[j] + clue_res = clue_res_list[j] + num_param = num_param_list[j] + latency = latency_list[j] + if model_name in ("RoBERTa-Medium", "TinyBERT6", "ERNIE 3.0-Nano"): + xytext = (latency + 0.05, clue_res - 0.6) + if model_name in ("RBT4"): + xytext = (latency + 0.05, clue_res + 0.1) + plt.annotate(model_name, xy=(latency, clue_res), xytext=xytext, size=size, alpha=1.0) + plt.annotate(num_param, xy=(latency, clue_res), xytext=(xytext[0], xytext[1] - 0.3), size=5, alpha=1.0) + + elif mode == "cpu1": + if batch_size == "32": + latency_list = [model_info[3] for model_info in data[i]] + else: + latency_list = [model_info[6] for model_info in data[i]] + (ln,) = plt.plot(latency_list, clue_res_list, color=color, linewidth=2.0, linestyle="-", marker="o", ms=5) + ln_list.append(ln) + for j, model in enumerate(data[i]): + xytext = (latency_list[j] + 5.0, clue_res_list[j] - 0.1) + model_name = model_name_list[j] + clue_res = clue_res_list[j] + num_param = num_param_list[j] + latency = latency_list[j] + if model_name in ("RoBERTa-Medium", "TinyBERT6", "ERNIE 3.0-Nano"): + xytext = (latency + 5.0, clue_res - 0.6) + plt.annotate(model_name, xy=(latency, clue_res), xytext=xytext, size=size, alpha=1.0) + plt.annotate(num_param, xy=(latency, clue_res), xytext=(xytext[0], xytext[1] - 0.3), size=5, alpha=1.0) + else: + if batch_size == "32": + latency_list = [model_info[4] for model_info in data[i]] + else: + latency_list = [model_info[7] for model_info in data[i]] + (ln,) = plt.plot(latency_list, clue_res_list, color=color, linewidth=2.0, linestyle="-", marker="o", ms=5) + ln_list.append(ln) + for j, model in enumerate(data[i]): + xytext = (latency_list[j] + 0.8, clue_res_list[j] - 0.1) + model_name = model_name_list[j] + clue_res = clue_res_list[j] + num_param = num_param_list[j] + latency = latency_list[j] + if model_name in ("RoBERTa-Medium", "TinyBERT6", "ERNIE 3.0-Nano"): + xytext = (latency + 0.8, clue_res - 0.6) + plt.annotate(model_name, xy=(latency, clue_res), xytext=xytext, size=size, alpha=1.0) + plt.annotate(num_param, xy=(latency, clue_res), xytext=(xytext[0], xytext[1] - 0.3), size=5, alpha=1.0) + plt.legend(handles=ln_list, labels=["Baidu/ERNIE 3.0", "UER/RoBERTa", "Huawei/TinyBERT", "HFL/RBT"], loc="best") + +plt.title(title_name) +plt.xlabel(xlabel_name) +plt.ylabel(ylabel_name) + +plt.savefig(picture_name, dpi=500) diff --git a/examples/benchmark/clue/grid_search_tools/extract_result.sh b/examples/benchmark/clue/grid_search_tools/extract_result.sh new file mode 100644 index 0000000000000000000000000000000000000000..138c704a06e69e3a28d7a764700a3617b260ebfc --- /dev/null +++ b/examples/benchmark/clue/grid_search_tools/extract_result.sh @@ -0,0 +1,55 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +unset GREP_OPTIONS + +MODEL_PATH=$1 +declare -A dict + +for task in afqmc tnews iflytek cmnli ocnli cluewsc2020 csl cmrc2018 chid c3 +do + # `awk '{print substr($0,1,11)}'` is to prevent '[' brought by logger. + if [ $task == 'cmrc2018' ]; then + dict[${task}]=`cat ${MODEL_PATH}/${task}/*|grep best_result|awk '{print $7}' |awk '{print substr($0,1,11)}'|awk 'BEGIN {max = 0} {if ($1+0 > max+0) max=$1} END {print max}'` + else + dict[${task}]=`tail -n 1 ${MODEL_PATH}/${task}/*|grep best_result|awk '{print $7}'|awk '{print substr($0,1,5)}'|awk 'BEGIN {max = 0} {if ($1+0 > max+0) max=$1} END {print max}'` + fi +done + +echo -e AFQMC"\t"TNEWS"\t"IFLYTEK"\t"CMNLI"\t"OCNLI"\t"CLUEWSC2020"\t"CSL"\t"CMRC2018"\t"CHID"\t"C3 + +for task in afqmc tnews iflytek cmnli ocnli cluewsc2020 csl cmrc2018 chid c3 +do + echo -e -n "${dict[$task]}\t" +done + +echo -e "\n====================================================================\nBest hyper-parameters list: \n====================================================================" +echo -e TASK"\t"result"\t(lr, batch_size, dropout_p)" + +for task in afqmc tnews iflytek cmnli ocnli cluewsc2020 csl cmrc2018 chid c3 +do + if [ -z ${dict[$task]} ] + then + continue + fi + s=`find ${MODEL_PATH}/${task}/* | xargs grep -rin "best_result: ${dict[$task]}"` + if [ $task == 'cmrc2018' ]; then + s=${s%/*} + fi + s=${s##*/} + s=`echo $s|awk '{split($1, hy, "."); print hy[1]"."hy[2]}'` + s=`echo $s|awk '{split($1, hy, "_"); print hy[1]"," hy[2]"," hy[3]}'` + echo -n ${task}| tr 'a-z' 'A-Z' + echo -e "\t"${dict[$task]}"\t("$s")" +done diff --git a/examples/benchmark/clue/grid_search_tools/grid_search.py b/examples/benchmark/clue/grid_search_tools/grid_search.py new file mode 100644 index 0000000000000000000000000000000000000000..6887f09bc10bd095a64f7db4e04cd69d911ea0c7 --- /dev/null +++ b/examples/benchmark/clue/grid_search_tools/grid_search.py @@ -0,0 +1,187 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import subprocess +import sys +import time +from collections import defaultdict + +from pynvml import ( + nvmlDeviceGetCount, + nvmlDeviceGetHandleByIndex, + nvmlDeviceGetMemoryInfo, + nvmlInit, +) + +nvmlInit() + +world_size = nvmlDeviceGetCount() +handles = [] +mrc_device = {} +handle_mapping = {} +for i in range(world_size): + h = nvmlDeviceGetHandleByIndex(i) + handles.append(h) + handle_mapping[str(h)] = i + + +def get_availble(est=15, is_mrc=False): + # Sort handles according to info.free + handles.sort(key=lambda x: nvmlDeviceGetMemoryInfo(x).free, reverse=True) + for i, h in enumerate(handles): + device_id = handle_mapping[str(h)] + if device_id in mrc_device.values(): + continue + info = nvmlDeviceGetMemoryInfo(h) + gb = 1024 * 1024 * 1024 + print(f"- device_id: {device_id}") + print(f"- free : {info.free/gb}") + if info.free / gb >= est: + return device_id + return None + + +# TODO Support multi-machine + + +def get_mrc_tasks(model_name_or_path): + learning_rate_list = [1e-5, 2e-5, 3e-5] + batch_size_list = [32, 24] + cls_base_grd_acc = 4 + tasks = [] + for lr in learning_rate_list: + for bs in batch_size_list: + tasks.append(f"bash run_mrc.sh {model_name_or_path} chid {bs} {lr} {cls_base_grd_acc*2}") + tasks.append(f"bash run_mrc.sh {model_name_or_path} cmrc2018 {bs} {lr} {cls_base_grd_acc}") + tasks.append(f"bash run_mrc.sh {model_name_or_path} c3 {bs} {lr} {bs//2}") + return tasks + + +def get_cls_tasks(model_name_or_path): + learning_rate_list = [1e-5, 2e-5, 3e-5, 5e-5] + batch_size_list = [16, 32, 64] + datasets = ["afqmc", "tnews", "iflytek", "ocnli", "cmnli", "cluewsc2020", "csl"] + cls_base_grd_acc = 1 + hyper_params = { + "afqmc": [[3, 128, cls_base_grd_acc, 0.1]], + "tnews": [[3, 128, cls_base_grd_acc, 0.1]], + "iflytek": [[3, 128, cls_base_grd_acc, 0.1], [3, 128, cls_base_grd_acc, 0.0]], + "ocnli": [[5, 128, cls_base_grd_acc, 0.1]], + "cluewsc2020": [[50, 128, cls_base_grd_acc, 0.1], [50, 128, cls_base_grd_acc, 0.0]], + "csl": [[5, 256, cls_base_grd_acc * 2, 0.1]], + "cmnli": [[2, 128, cls_base_grd_acc, 0.1]], + } + tasks = [] + for dataset in datasets: + for lr in learning_rate_list: + for bs in batch_size_list: + for hyper_param in hyper_params[dataset]: + epoch, max_seq_len, grd_acc, dropout = hyper_param + tasks.append( + f"bash run_cls.sh {dataset} {lr} {bs} {epoch} {max_seq_len} {model_name_or_path} {grd_acc} {dropout}" + ) + for lr in learning_rate_list: + for hyper_param in hyper_params["cluewsc2020"]: + bs = 8 + epoch, max_seq_len, grd_acc, dropout = hyper_param + tasks.append( + f"bash run_cls.sh cluewsc2020 {lr} {bs} {epoch} {max_seq_len} {model_name_or_path} {grd_acc} {dropout}" + ) + return tasks + + +def do_task(task): + # tmp = task.split(" ") + est = 15 + # if int(tmp[4]) * int(tmp[6]) > 32 * 128: + # est = 30 + print(est) + is_mrc = False + if "cmrc" in task or "chid" in task or "c3" in task: + is_mrc = True + device_id = get_availble(est, is_mrc) + retry = 5 + while device_id is None and retry > 0: + print("> No device avaliable, wait 120 seconds.") + time.sleep(120) + device_id = get_availble(est, is_mrc) + retry -= 1 + if retry == 0: + return None + task_ps = f"set -x \nexport CUDA_VISIBLE_DEVICES={device_id}\n" + task + print(f"> Send task \n{task_ps}\n") + ps = subprocess.Popen(task_ps, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) + if is_mrc and device_id is not None: + mrc_device[task] = device_id + print("mrc_device", mrc_device) + return ps + + +def main(): + model_name_or_path = sys.argv[1] + # Make sure that dataset has been downloaded first + status = os.system(f"python warmup_dataset_and_model.py {model_name_or_path}") + assert status == 0, "Please make sure clue dataset has been downloaded successfully." + tasks = [] + tasks = get_cls_tasks(model_name_or_path) + tasks += get_mrc_tasks(model_name_or_path) + + for x in tasks: + print(x) + + runs = [] + retry = defaultdict(int) + while len(tasks) > 0 or len(runs) > 0: + i = 0 + print("\n\n\n>> Round start") + while i < len(runs): + returncode = runs[i]["ps"].poll() + if returncode is not None: + if returncode != 0: + retry[runs[i]["ts"]] += 1 + print(f"> {runs[i]['ts']} task failed, will retried, tryed {retry[runs[i]['ts']]} times.") + output = runs[i]["ps"].communicate()[0] + for line in output.decode("utf-8").split("\n"): + print(line) + if retry[runs[i]["ts"]] <= 5: + tasks.append(runs[i]["ts"]) + else: + if "cmrc" in runs[i]["ts"] or "chid" in runs[i]["ts"] or "c3" in runs[i]["ts"]: + mrc_device.pop(runs[i]["ts"]) + print("mrc_device", mrc_device) + print(f"> Done! {runs[i]['ts']}") + runs.pop(i) + i = i - 1 + else: + print(">> DOING", runs[i]["ts"]) + i += 1 + + if len(tasks) > 0: + task = tasks.pop(0) + print(f"> Try to append {task}") + ps = do_task(task) + if ps is None: + tasks.append(task) + else: + runs.append({"ps": ps, "ts": task}) + + print("> Wait for 15 seconds to start!") + time.sleep(15) + print("All done!") + status = os.system(f"bash extract_result.sh {model_name_or_path}") + + +if __name__ == "__main__": + main() diff --git a/examples/benchmark/clue/grid_search_tools/run_cls.sh b/examples/benchmark/clue/grid_search_tools/run_cls.sh new file mode 100644 index 0000000000000000000000000000000000000000..dd0c474c002566f1ce766d5e4e4a0e38ed821618 --- /dev/null +++ b/examples/benchmark/clue/grid_search_tools/run_cls.sh @@ -0,0 +1,58 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +export TASK_NAME=$1 +export LR=$2 +export BS=$3 +export EPOCH=$4 +export MAX_SEQ_LEN=$5 +export MODEL_PATH=$6 +export grad_acc=$7 +export dropout_p=$8 + +export FLAGS_cudnn_deterministic=True + +if [ -f "${MODEL_PATH}/${TASK_NAME}/${LR}_${BS}_${dropout_p}.log" ] +then + # Exits if log exits and best_result is computed. + best_acc=`cat ${MODEL_PATH}/${TASK_NAME}/${LR}_${BS}_${dropout_p}.log |grep "best_result"` + if [ "${best_acc}" != "" ] + then + exit 0 + fi +fi + +mkdir -p ${MODEL_PATH}/${TASK_NAME} + +python -u ../classification/run_clue_classifier.py \ + --model_name_or_path ${MODEL_PATH} \ + --task_name ${TASK_NAME} \ + --max_seq_length ${MAX_SEQ_LEN} \ + --batch_size ${BS} \ + --learning_rate ${LR} \ + --num_train_epochs ${EPOCH} \ + --logging_steps 100 \ + --seed 42 \ + --save_steps 100 \ + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --adam_epsilon 1e-8 \ + --output_dir ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ \ + --device gpu \ + --gradient_accumulation_steps ${grad_acc} \ + --do_train \ + --dropout ${dropout_p} \ + --save_best_model False > ${MODEL_PATH}/${TASK_NAME}/${LR}_${BS}_${dropout_p}.log 2>&1 + diff --git a/examples/benchmark/clue/grid_search_tools/run_mrc.sh b/examples/benchmark/clue/grid_search_tools/run_mrc.sh new file mode 100644 index 0000000000000000000000000000000000000000..1ddb5cdfc507fe3c8f613ccab576c346d9805cd0 --- /dev/null +++ b/examples/benchmark/clue/grid_search_tools/run_mrc.sh @@ -0,0 +1,69 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +MODEL_PATH=$1 +TASK_NAME=$2 +BATCH_SIZE=$3 +LR=$4 +GRAD_ACCU_STEPS=$5 + +export FLAGS_cudnn_deterministic=True + +if [ -f "${MODEL_PATH}/${TASK_NAME}/${LR}_${BATCH_SIZE}_0.1.log" ] +then + # Exits if log exits and best_result is computed. + best_res=`cat ${MODEL_PATH}/${TASK_NAME}/${LR}_${BATCH_SIZE}_0.1.log |grep "best_result"` + if [ "${best_res}" != "" ] + then + exit 0 + fi +fi + +if [ $TASK_NAME == 'cmrc2018' ]; then +MAX_SEQ_LEN=512 +EPOCHS=2 +WARMUP_PROP=0.1 +fi + +if [ $TASK_NAME == 'c3' ]; then +MAX_SEQ_LEN=512 +EPOCHS=8 +WARMUP_PROP=0.1 +fi + +if [ $TASK_NAME == 'chid' ]; then +MAX_SEQ_LEN=64 +EPOCHS=3 +WARMUP_PROP=0.06 +fi + +mkdir -p ${MODEL_PATH}/${TASK_NAME} + +python ../mrc/run_${TASK_NAME}.py \ + --model_name_or_path ${MODEL_PATH} \ + --batch_size ${BATCH_SIZE} \ + --learning_rate ${LR} \ + --max_seq_length ${MAX_SEQ_LEN} \ + --num_train_epochs ${EPOCHS} \ + --output_dir ${MODEL_PATH}/${TASK_NAME}_model/${LR}_${BATCH_SIZE}/ \ + --do_train \ + --seed 42 \ + --weight_decay 0.01 \ + --device gpu \ + --num_proc 4 \ + --logging_steps 100 \ + --warmup_proportion ${WARMUP_PROP} \ + --gradient_accumulation_steps ${GRAD_ACCU_STEPS} \ + --save_best_model False > ${MODEL_PATH}/${TASK_NAME}/${LR}_${BATCH_SIZE}_0.1.log 2>&1 diff --git a/examples/benchmark/clue/grid_search_tools/warmup_dataset_and_model.py b/examples/benchmark/clue/grid_search_tools/warmup_dataset_and_model.py new file mode 100644 index 0000000000000000000000000000000000000000..d5775a7b2f4c1c4f0d7c11a2cf09194af20c1ac5 --- /dev/null +++ b/examples/benchmark/clue/grid_search_tools/warmup_dataset_and_model.py @@ -0,0 +1,50 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import sys + +from paddlenlp.datasets import load_dataset +from paddlenlp.utils.log import logger + +model_name_or_path = sys.argv[1] + +# CLUE classification dataset warmup +logger.info("Download model and data for CLUE classification tasks.") +for task in ["afqmc", "tnews", "iflytek", "ocnli", "cmnli", "cluewsc2020", "csl"]: + load_dataset("clue", task, splits=("train", "dev", "test")) + +# Downloads HF dataset +from datasets import load_dataset # noqa: E402 + +load_dataset("clue", "chid") +load_dataset("clue", "cmrc2018") +load_dataset("clue", "c3") + +# HF dataset process and cache +logger.info("Data process for CHID tasks, and this will take some time. If cache exists, this will skip.") +status = os.system( + f"python ../mrc/run_chid.py --do_train --max_steps 0 --model_name_or_path {model_name_or_path} --batch_size 1 --gradient_accumulation_steps 1" +) +assert status == 0, "Please make sure clue dataset CHID has been preprocessed successfully." +logger.info("Data process for CMRC2018 tasks. If cache exists, this will skip.") +status = os.system( + f"python ../mrc/run_cmrc2018.py --do_train --max_steps 0 --model_name_or_path {model_name_or_path} --batch_size 1 --gradient_accumulation_steps 1" +) +assert status == 0, "Please make sure clue dataset CMRC2018 has been preprocessed successfully." +logger.info("Data process for C3 tasks. If cache exists, this will skip.") +status = os.system( + f"python ../mrc/run_c3.py --do_train --max_steps 0 --model_name_or_path {model_name_or_path} --batch_size 1 --gradient_accumulation_steps 1" +) +assert status == 0, "Please make sure clue dataset C3 has been preprocessed successfully." diff --git a/examples/benchmark/clue/mrc/run_c3.py b/examples/benchmark/clue/mrc/run_c3.py new file mode 100644 index 0000000000000000000000000000000000000000..a7bc139b4ca7571da235ae0420038696f6d24620 --- /dev/null +++ b/examples/benchmark/clue/mrc/run_c3.py @@ -0,0 +1,413 @@ +# coding: utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import contextlib +import json +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from datasets import load_dataset + +from paddlenlp.data import Dict, Pad, Stack +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import ( + AutoModelForMultipleChoice, + AutoTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu." + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name.", + ) + parser.add_argument( + "--num_proc", + default=None, + type=int, + help="Max number of processes when generating cache. Already cached shards are loaded sequentially.", + ) + parser.add_argument("--output_dir", default="best_c3_model", type=str, help="The path of the checkpoints .") + parser.add_argument("--save_best_model", default=True, type=strtobool, help="Whether to save best model.") + parser.add_argument( + "--overwrite_cache", + default=False, + type=strtobool, + help="Whether to overwrite cache for dataset.", + ) + parser.add_argument("--num_train_epochs", default=8, type=int, help="Total number of training epochs to perform.") + parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") + parser.add_argument( + "--warmup_steps", + default=0, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument( + "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--learning_rate", default=2e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="The max value of grad norm.") + parser.add_argument("--batch_size", default=24, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--eval_batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument( + "--gradient_accumulation_steps", + type=int, + default=4, + help="Number of updates steps to accumualte before performing a backward/update pass.", + ) + parser.add_argument("--do_train", action="store_true", help="Whether to train.") + parser.add_argument("--do_predict", action="store_true", help="Whether to predict.") + parser.add_argument( + "--max_seq_length", + default=512, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, loss_fct, dev_data_loader, metric): + metric.reset() + model.eval() + for _, batch in enumerate(dev_data_loader): + input_ids, segment_ids, label_id = batch + logits = model(input_ids=input_ids, token_type_ids=segment_ids) + correct = metric.compute(logits, label_id) + metric.update(correct) + acc = metric.accumulate() + model.train() + return acc + + +@contextlib.contextmanager +def main_process_first(desc="work"): + if paddle.distributed.get_world_size() > 1: + rank = paddle.distributed.get_rank() + is_main_process = rank == 0 + main_process_desc = "main local process" + + try: + if not is_main_process: + # tell all replicas to wait + logger.debug(f"{rank}: waiting for the {main_process_desc} to perform {desc}") + paddle.distributed.barrier() + yield + finally: + if is_main_process: + # the wait is over + logger.debug(f"{rank}: {main_process_desc} completed {desc}, releasing all replicas") + paddle.distributed.barrier() + else: + yield + + +def run(args): + if args.do_train: + assert ( + args.batch_size % args.gradient_accumulation_steps == 0 + ), "Please make sure argmument `batch_size` must be divisible by `gradient_accumulation_steps`." + max_seq_length = args.max_seq_length + max_num_choices = 4 + + def preprocess_function(examples, do_predict=False): + def _truncate_seq_tuple(tokens_a, tokens_b, tokens_c, max_length): + """Truncates a sequence tuple in place to the maximum length.""" + # This is a simple heuristic which will always truncate the longer + # sequence one token at a time. This makes more sense than + # truncating an equal percent of tokens from each, since if one + # sequence is very short then each token that's truncated likely + # contains more information than a longer sequence. + while True: + total_length = len(tokens_a) + len(tokens_b) + len(tokens_c) + if total_length <= max_length: + break + if len(tokens_a) >= len(tokens_b) and len(tokens_a) >= len(tokens_c): + tokens_a.pop() + elif len(tokens_b) >= len(tokens_a) and len(tokens_b) >= len(tokens_c): + tokens_b.pop() + else: + tokens_c.pop() + + num_examples = len(examples.data["question"]) + if do_predict: + result = {"input_ids": [], "token_type_ids": []} + else: + result = {"input_ids": [], "token_type_ids": [], "labels": []} + for idx in range(num_examples): + text = "\n".join(examples.data["context"][idx]).lower() + question = examples.data["question"][idx].lower() + choice_list = examples.data["choice"][idx] + choice_list = [choice.lower() for choice in choice_list][:max_num_choices] + if not do_predict: + answer = examples.data["answer"][idx].lower() + label = choice_list.index(answer) + + tokens_t = tokenizer.tokenize(text) + tokens_q = tokenizer.tokenize(question) + + tokens_t_list = [] + tokens_c_list = [] + + # Pad each new example for axis=1, [batch_size, num_choices, seq_len] + while len(choice_list) < max_num_choices: + choice_list.append("无效答案") + + for choice in choice_list: + tokens_c = tokenizer.tokenize(choice.lower()) + _truncate_seq_tuple(tokens_t, tokens_q, tokens_c, max_seq_length - 4) + + tokens_c = tokens_q + ["[SEP]"] + tokens_c + tokens_t_list.append(tokens_t) + tokens_c_list.append(tokens_c) + + new_data = tokenizer(tokens_t_list, text_pair=tokens_c_list, is_split_into_words="token") + + # Pad each new example for axis=2 of [batch_size, num_choices, seq_len], + # because length of each choice could be different. + input_ids = Pad(axis=0, pad_val=tokenizer.pad_token_id)(new_data["input_ids"]) + token_type_ids = Pad(axis=0, pad_val=tokenizer.pad_token_id)(new_data["token_type_ids"]) + + # Final shape of input_ids: [batch_size, num_choices, seq_len] + result["input_ids"].append(input_ids) + result["token_type_ids"].append(token_type_ids) + if not do_predict: + result["labels"].append([label]) + if (idx + 1) % 1000 == 0: + logger.info("%d samples have been processed." % (idx + 1)) + return result + + paddle.set_device(args.device) + set_seed(args) + + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + model = AutoModelForMultipleChoice.from_pretrained(args.model_name_or_path, num_choices=max_num_choices) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + train_ds, dev_ds, test_ds = load_dataset("clue", "c3", split=["train", "validation", "test"]) + + if args.do_train: + args.batch_size = int(args.batch_size / args.gradient_accumulation_steps) + column_names = train_ds.column_names + with main_process_first(desc="train dataset map pre-processing"): + train_ds = train_ds.map( + preprocess_function, + batched=True, + batch_size=len(train_ds), + num_proc=args.num_proc, + remove_columns=column_names, + load_from_cache_file=not args.overwrite_cache, + desc="Running tokenizer on train dataset", + ) + + batchify_fn = lambda samples, fn=Dict( # noqa: E731 + { + "input_ids": Pad(axis=1, pad_val=tokenizer.pad_token_id), # input + "token_type_ids": Pad(axis=1, pad_val=tokenizer.pad_token_type_id), # segment + "labels": Stack(dtype="int64"), # label + } + ): fn(samples) + + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + train_data_loader = paddle.io.DataLoader( + dataset=train_ds, + batch_sampler=train_batch_sampler, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + with main_process_first(desc="evaluate dataset map pre-processing"): + dev_ds = dev_ds.map( + preprocess_function, + batched=True, + batch_size=len(dev_ds), + remove_columns=column_names, + num_proc=args.num_proc, + load_from_cache_file=args.overwrite_cache, + desc="Running tokenizer on validation dataset", + ) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.eval_batch_size, shuffle=False) + dev_data_loader = paddle.io.DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + num_training_steps = ( + int(args.max_steps / args.gradient_accumulation_steps) + if args.max_steps >= 0 + else int(len(train_data_loader) * args.num_train_epochs / args.gradient_accumulation_steps) + ) + + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + grad_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm) + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=grad_clip, + ) + loss_fct = paddle.nn.loss.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + model.train() + global_step = 0 + best_acc = 0.0 + tic_train = time.time() + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + input_ids, segment_ids, label = batch + logits = model(input_ids=input_ids, token_type_ids=segment_ids) + loss = loss_fct(logits, label) + if args.gradient_accumulation_steps > 1: + loss = loss / args.gradient_accumulation_steps + loss.backward() + if (step + 1) % args.gradient_accumulation_steps == 0: + global_step += 1 + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + logger.info( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % ( + global_step, + num_training_steps, + epoch, + step + 1, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step >= num_training_steps: + break + if global_step > num_training_steps: + break + tic_eval = time.time() + acc = evaluate(model, loss_fct, dev_data_loader, metric) + logger.info("eval acc: %.5f, eval done total : %s s" % (acc, time.time() - tic_eval)) + if paddle.distributed.get_rank() == 0 and acc > best_acc: + best_acc = acc + if args.save_best_model: + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + if not os.path.exists(args.output_dir): + os.makedirs(args.output_dir) + model_to_save.save_pretrained(args.output_dir) + tokenizer.save_pretrained(args.output_dir) + if global_step >= num_training_steps: + break + logger.info("best_result: %.2f" % (best_acc * 100)) + + if args.do_predict: + column_names = test_ds.column_names + test_ds = test_ds.map( + partial(preprocess_function, do_predict=True), + batched=True, + batch_size=len(test_ds), + remove_columns=column_names, + num_proc=args.num_proc, + ) + # Serveral samples have more than four choices. + test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=1, shuffle=False) + + batchify_fn = lambda samples, fn=Dict( # noqa: E731 + { + "input_ids": Pad(axis=1, pad_val=tokenizer.pad_token_id), # input + "token_type_ids": Pad(axis=1, pad_val=tokenizer.pad_token_type_id), # segment + } + ): fn(samples) + + test_data_loader = paddle.io.DataLoader( + dataset=test_ds, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + if not os.path.exists(args.output_dir): + os.makedirs(args.output_dir) + + f = open(os.path.join(args.output_dir, "c311_predict.json"), "w") + result = {} + idx = 0 + for step, batch in enumerate(test_data_loader): + input_ids, segment_ids = batch + with paddle.no_grad(): + logits = model(input_ids, segment_ids) + preds = paddle.argmax(logits, axis=1).numpy().tolist() + for pred in preds: + result[str(idx)] = pred + j = json.dumps({"id": idx, "label": pred}) + f.write(j + "\n") + idx += 1 + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + run(args) diff --git a/examples/benchmark/clue/mrc/run_chid.py b/examples/benchmark/clue/mrc/run_chid.py new file mode 100644 index 0000000000000000000000000000000000000000..952e78b3b1bb2068adb64189f6e167702b18966f --- /dev/null +++ b/examples/benchmark/clue/mrc/run_chid.py @@ -0,0 +1,583 @@ +# coding: utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import contextlib +import json +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from datasets import load_dataset + +from paddlenlp.data import Dict, Pad, Stack +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import ( + AutoModelForMultipleChoice, + AutoTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu." + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name.", + ) + parser.add_argument("--output_dir", default="best_chid_model", type=str, help="The path of the checkpoints .") + parser.add_argument("--save_best_model", default=True, type=strtobool, help="Whether to save best model.") + parser.add_argument( + "--overwrite_cache", + default=False, + type=strtobool, + help="Whether to overwrite cache for dataset.", + ) + parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.") + parser.add_argument( + "--num_proc", + default=None, + type=int, + help="Max number of processes when generating cache. Already cached shards are loaded sequentially.", + ) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument( + "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument( + "--warmup_steps", + default=0, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--learning_rate", default=2e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="The max value of grad norm.") + parser.add_argument("--batch_size", default=4, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--eval_batch_size", default=24, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument( + "--gradient_accumulation_steps", + type=int, + default=1, + help="Number of updates steps to accumualte before performing a backward/update pass.", + ) + parser.add_argument("--do_train", action="store_true", help="Whether to train.") + parser.add_argument("--do_predict", action="store_true", help="Whether to predict.") + parser.add_argument( + "--max_seq_length", + default=64, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +def calc_global_pred_results(logits): + logits = np.array(logits) + # [num_choices, tag_size] + logits = np.transpose(logits) + tmp = [] + for i, row in enumerate(logits): + for j, col in enumerate(row): + tmp.append((i, j, col)) + else: + choice = set(range(i + 1)) + blanks = set(range(j + 1)) + tmp = sorted(tmp, key=lambda x: x[2], reverse=True) + results = [] + for i, j, v in tmp: + if (j in blanks) and (i in choice): + results.append((i, j)) + blanks.remove(j) + choice.remove(i) + results = sorted(results, key=lambda x: x[1], reverse=False) + results = [i for i, j in results] + return results + + +@paddle.no_grad() +def evaluate(model, data_loader, do_predict=False): + model.eval() + right_num, total_num = 0, 0 + all_results = [] + for step, batch in enumerate(data_loader): + if do_predict: + input_ids, segment_ids, example_ids = batch + else: + input_ids, segment_ids, labels, example_ids = batch + logits = model(input_ids=input_ids, token_type_ids=segment_ids) + batch_num = example_ids.shape[0] + l = 0 + r = batch_num - 1 + batch_results = [] + for i in range(batch_num - 1): + if example_ids[i] != example_ids[i + 1]: + r = i + batch_results.extend(calc_global_pred_results(logits[l : r + 1, :])) + l = i + 1 + if l <= batch_num - 1: + batch_results.extend(calc_global_pred_results(logits[l:batch_num, :])) + if do_predict: + all_results.extend(batch_results) + else: + right_num += np.sum(np.array(batch_results) == labels.numpy()) + total_num += labels.shape[0] + model.train() + if not do_predict: + acc = right_num / total_num + return acc + return all_results + + +@contextlib.contextmanager +def main_process_first(desc="work"): + if paddle.distributed.get_world_size() > 1: + rank = paddle.distributed.get_rank() + is_main_process = rank == 0 + main_process_desc = "main local process" + try: + if not is_main_process: + # tell all replicas to wait + logger.debug(f"{rank}: waiting for the {main_process_desc} to perform {desc}") + paddle.distributed.barrier() + yield + finally: + if is_main_process: + # the wait is over + logger.debug(f"{rank}: {main_process_desc} completed {desc}, releasing all replicas") + paddle.distributed.barrier() + else: + yield + + +def run(args): + if args.do_train: + assert ( + args.batch_size % args.gradient_accumulation_steps == 0 + ), "Please make sure argmument `batch_size` must be divisible by `gradient_accumulation_steps`." + paddle.set_device(args.device) + set_seed(args) + + max_seq_length = args.max_seq_length + max_num_choices = 10 + + def preprocess_function(examples, do_predict=False): + SPIECE_UNDERLINE = "▁" + + def _is_chinese_char(cp): + if ( + (cp >= 0x4E00 and cp <= 0x9FFF) + or (cp >= 0x3400 and cp <= 0x4DBF) # + or (cp >= 0x20000 and cp <= 0x2A6DF) # + or (cp >= 0x2A700 and cp <= 0x2B73F) # + or (cp >= 0x2B740 and cp <= 0x2B81F) # + or (cp >= 0x2B820 and cp <= 0x2CEAF) # + or (cp >= 0xF900 and cp <= 0xFAFF) + or (cp >= 0x2F800 and cp <= 0x2FA1F) # + ): # + return True + + return False + + def is_fuhao(c): + if ( + c == "。" + or c == "," + or c == "!" + or c == "?" + or c == ";" + or c == "、" + or c == ":" + or c == "(" + or c == ")" + or c == "-" + or c == "~" + or c == "「" + or c == "《" + or c == "》" + or c == "," + or c == "」" + or c == '"' + or c == "“" + or c == "”" + or c == "$" + or c == "『" + or c == "』" + or c == "—" + or c == ";" + or c == "。" + or c == "(" + or c == ")" + or c == "-" + or c == "~" + or c == "。" + or c == "‘" + or c == "’" + ): + return True + return False + + def _tokenize_chinese_chars(text): + """Adds whitespace around any CJK character.""" + output = [] + is_blank = False + for index, char in enumerate(text): + cp = ord(char) + if is_blank: + output.append(char) + if context[index - 12 : index + 1].startswith("#idiom"): + is_blank = False + output.append(SPIECE_UNDERLINE) + else: + if text[index : index + 6] == "#idiom": + is_blank = True + if len(output) > 0 and output[-1] != SPIECE_UNDERLINE: + output.append(SPIECE_UNDERLINE) + output.append(char) + elif _is_chinese_char(cp) or is_fuhao(char): + if len(output) > 0 and output[-1] != SPIECE_UNDERLINE: + output.append(SPIECE_UNDERLINE) + output.append(char) + output.append(SPIECE_UNDERLINE) + else: + output.append(char) + return "".join(output) + + def is_whitespace(c): + if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F or c == SPIECE_UNDERLINE: + return True + return False + + def add_tokens_for_around(tokens, pos, num_tokens): + num_l = num_tokens // 2 + num_r = num_tokens - num_l + + if pos >= num_l and (len(tokens) - 1 - pos) >= num_r: + tokens_l = tokens[pos - num_l : pos] + tokens_r = tokens[pos + 1 : pos + 1 + num_r] + elif pos <= num_l: + tokens_l = tokens[:pos] + right_len = num_tokens - len(tokens_l) + tokens_r = tokens[pos + 1 : pos + 1 + right_len] + elif (len(tokens) - 1 - pos) <= num_r: + tokens_r = tokens[pos + 1 :] + left_len = num_tokens - len(tokens_r) + tokens_l = tokens[pos - left_len : pos] + else: + raise ValueError("impossible") + + return tokens_l, tokens_r + + max_tokens_for_doc = max_seq_length - 3 + num_tokens = max_tokens_for_doc - 5 + num_examples = len(examples.data["candidates"]) + if do_predict: + result = {"input_ids": [], "token_type_ids": [], "example_ids": []} + else: + result = {"input_ids": [], "token_type_ids": [], "labels": [], "example_ids": []} + for idx in range(num_examples): + candidate = 0 + options = examples.data["candidates"][idx] + + # Each content may have several sentences. + for context in examples.data["content"][idx]: + context = ( + context.replace("“", '"') + .replace("”", '"') + .replace("——", "--") + .replace("—", "-") + .replace("―", "-") + .replace("…", "...") + .replace("‘", "'") + .replace("’", "'") + ) + context = _tokenize_chinese_chars(context) + paragraph_text = context.strip() + doc_tokens = [] + prev_is_whitespace = True + for c in paragraph_text: + if is_whitespace(c): + prev_is_whitespace = True + else: + if prev_is_whitespace: + doc_tokens.append(c) + else: + doc_tokens[-1] += c + prev_is_whitespace = False + all_doc_tokens = [] + for (i, token) in enumerate(doc_tokens): + if "#idiom" in token: + sub_tokens = [str(token)] + else: + sub_tokens = tokenizer.tokenize(token) + for sub_token in sub_tokens: + all_doc_tokens.append(sub_token) + tags = [blank for blank in doc_tokens if "#idiom" in blank] + + # Each sentence may have several tags + for tag_index, tag in enumerate(tags): + pos = all_doc_tokens.index(tag) + + tmp_l, tmp_r = add_tokens_for_around(all_doc_tokens, pos, num_tokens) + num_l = len(tmp_l) + num_r = len(tmp_r) + tokens_l = [] + for token in tmp_l: + if "#idiom" in token and token != tag: + # Mask tag which is not considered in this new sample. + # Each idiom has four words, so 4 mask tokens are used. + tokens_l.extend(["[MASK]"] * 4) + else: + tokens_l.append(token) + tokens_l = tokens_l[-num_l:] + del tmp_l + + tokens_r = [] + for token in tmp_r: + if "#idiom" in token and token != tag: + tokens_r.extend(["[MASK]"] * 4) + else: + tokens_r.append(token) + tokens_r = tokens_r[:num_r] + del tmp_r + + tokens_list = [] + # Each tag has ten choices, and the shape of each new + # example is [num_choices, seq_len] + for i, elem in enumerate(options): + option = tokenizer.tokenize(elem) + tokens = option + ["[SEP]"] + tokens_l + ["[unused1]"] + tokens_r + tokens_list.append(tokens) + new_data = tokenizer(tokens_list, is_split_into_words=True) + # Final shape of input_ids: [batch_size, num_choices, seq_len] + result["input_ids"].append(new_data["input_ids"]) + result["token_type_ids"].append(new_data["token_type_ids"]) + result["example_ids"].append(idx) + if not do_predict: + label = examples.data["answers"][idx]["candidate_id"][candidate] + result["labels"].append(label) + candidate += 1 + if (idx + 1) % 10000 == 0: + logger.info("%d samples have been processed." % (idx + 1)) + return result + + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + model = AutoModelForMultipleChoice.from_pretrained(args.model_name_or_path, num_choices=max_num_choices) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + train_ds, dev_ds, test_ds = load_dataset("clue", "chid", split=["train", "validation", "test"]) + + if args.do_train: + args.batch_size = int(args.batch_size / args.gradient_accumulation_steps) + column_names = train_ds.column_names + with main_process_first(desc="train dataset map pre-processing"): + train_ds = train_ds.map( + partial(preprocess_function), + batched=True, + batch_size=len(train_ds), + num_proc=args.num_proc, + remove_columns=column_names, + load_from_cache_file=not args.overwrite_cache, + desc="Running tokenizer on train dataset", + ) + batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=1, pad_val=tokenizer.pad_token_id), # input + "token_type_ids": Pad(axis=1, pad_val=tokenizer.pad_token_type_id), # segment + "labels": Stack(dtype="int64"), # label + "example_ids": Stack(dtype="int64"), # example id + } + ): fn(samples) + + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + train_data_loader = paddle.io.DataLoader( + dataset=train_ds, + batch_sampler=train_batch_sampler, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + with main_process_first(desc="evaluate dataset map pre-processing"): + dev_ds = dev_ds.map( + partial(preprocess_function), + batched=True, + batch_size=len(dev_ds), + remove_columns=column_names, + num_proc=args.num_proc, + load_from_cache_file=args.overwrite_cache, + desc="Running tokenizer on validation dataset", + ) + + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.eval_batch_size, shuffle=False) + + dev_data_loader = paddle.io.DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + num_training_steps = ( + int(args.max_steps / args.gradient_accumulation_steps) + if args.max_steps >= 0 + else int(len(train_data_loader) * args.num_train_epochs / args.gradient_accumulation_steps) + ) + + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + grad_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm) + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=grad_clip, + ) + + loss_fct = nn.CrossEntropyLoss() + + model.train() + global_step = 0 + best_acc = 0.0 + tic_train = time.time() + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + input_ids, segment_ids, labels, example_ids = batch + logits = model(input_ids=input_ids, token_type_ids=segment_ids) + loss = loss_fct(logits, labels) + if args.gradient_accumulation_steps > 1: + loss = loss / args.gradient_accumulation_steps + loss.backward() + if (step + 1) % args.gradient_accumulation_steps == 0: + global_step += 1 + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + logger.info( + "global step %d/%d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % ( + global_step, + num_training_steps, + epoch, + step + 1, + loss, + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step >= num_training_steps: + break + if global_step > num_training_steps: + break + tic_eval = time.time() + acc = evaluate(model, dev_data_loader) + logger.info("eval acc: %.5f, eval done total : %s s" % (acc, time.time() - tic_eval)) + if paddle.distributed.get_rank() == 0 and acc > best_acc: + best_acc = acc + if args.save_best_model: + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + if not os.path.exists(args.output_dir): + os.makedirs(args.output_dir) + model_to_save.save_pretrained(args.output_dir) + tokenizer.save_pretrained(args.output_dir) + if global_step >= num_training_steps: + break + logger.info("best_result: %.2f" % (best_acc * 100)) + + if args.do_predict: + column_names = test_ds.column_names + test_ds = test_ds.map( + partial(preprocess_function, do_predict=True), + batched=True, + batch_size=len(test_ds), + remove_columns=column_names, + num_proc=args.num_proc, + ) + test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.eval_batch_size, shuffle=False) + + batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=1, pad_val=tokenizer.pad_token_id), # input + "token_type_ids": Pad(axis=1, pad_val=tokenizer.pad_token_type_id), # segment + "example_ids": Stack(dtype="int64"), # example id + } + ): fn(samples) + + test_data_loader = paddle.io.DataLoader( + dataset=test_ds, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + result = {} + idx = 623377 + preds = evaluate(model, test_data_loader, do_predict=True) + for pred in preds: + result["#idiom" + str(idx) + "#"] = pred + idx += 1 + if not os.path.exists(args.output_dir): + os.makedirs(args.output_dir) + with open(os.path.join(args.output_dir, "chid11_predict.json"), "w") as writer: + json.dump(result, writer, indent=2) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + run(args) diff --git a/examples/benchmark/clue/mrc/run_cmrc2018.py b/examples/benchmark/clue/mrc/run_cmrc2018.py new file mode 100644 index 0000000000000000000000000000000000000000..be12a216edd45a8ea07924d2d3a76afe85c5c988 --- /dev/null +++ b/examples/benchmark/clue/mrc/run_cmrc2018.py @@ -0,0 +1,488 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import contextlib +import distutils.util +import json +import os +import random +import time + +import numpy as np +import paddle +from datasets import load_dataset +from paddle.io import DataLoader + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.metrics.squad import compute_prediction, squad_evaluate +from paddlenlp.transformers import ( + AutoModelForQuestionAnswering, + AutoTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name of model.", + ) + parser.add_argument( + "--output_dir", + default="best_cmrc_model", + type=str, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--save_best_model", default=True, type=distutils.util.strtobool, help="Whether to save best model." + ) + parser.add_argument( + "--overwrite_cache", + default=False, + type=distutils.util.strtobool, + help="Whether to overwrite cache for dataset.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument( + "--num_proc", + default=None, + type=int, + help="Max number of processes when generating cache. Already cached shards are loaded sequentially.", + ) + parser.add_argument("--eval_batch_size", default=12, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument( + "--warmup_steps", + default=0, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument( + "--warmup_proportion", + default=0.1, + type=float, + help="Proportion of training steps to perform linear learning rate warmup for.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." + ) + parser.add_argument( + "--doc_stride", + type=int, + default=128, + help="When splitting up a long document into chunks, how much stride to take between chunks.", + ) + parser.add_argument( + "--n_best_size", + type=int, + default=20, + help="The total number of n-best predictions to generate in the nbest_predictions.json output file.", + ) + parser.add_argument("--max_query_length", type=int, default=64, help="Max query length.") + parser.add_argument("--max_answer_length", type=int, default=50, help="Max answer length.") + parser.add_argument( + "--do_lower_case", + action="store_false", + help="Whether to lower case the input text. Should be True for uncased models and False for cased models.", + ) + parser.add_argument("--verbose", action="store_true", help="Whether to output verbose log.") + parser.add_argument("--do_train", action="store_true", help="Whether to train.") + parser.add_argument("--do_predict", action="store_true", help="Whether to predict.") + parser.add_argument( + "--gradient_accumulation_steps", + type=int, + default=2, + help="Number of updates steps to accumualte before performing a backward/update pass.", + ) + args = parser.parse_args() + return args + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, raw_dataset, dataset, data_loader, args, do_eval=True): + model.eval() + + all_start_logits = [] + all_end_logits = [] + tic_eval = time.time() + for batch in data_loader: + start_logits, end_logits = model(**batch) + for idx in range(start_logits.shape[0]): + if len(all_start_logits) % 1000 == 0 and len(all_start_logits): + logger.info("Processing example: %d" % len(all_start_logits)) + logger.info("time per 1000: %s" % (time.time() - tic_eval)) + tic_eval = time.time() + + all_start_logits.append(start_logits.numpy()[idx]) + all_end_logits.append(end_logits.numpy()[idx]) + + all_predictions, _, _ = compute_prediction( + raw_dataset, dataset, (all_start_logits, all_end_logits), False, args.n_best_size, args.max_answer_length + ) + + if not os.path.exists(args.output_dir): + os.makedirs(args.output_dir) + if do_eval: + filename = os.path.join(args.output_dir, "prediction_validation.json") + else: + filename = os.path.join(args.output_dir, "cmrc2018_predict.json") + with open(filename, "w", encoding="utf-8") as writer: + writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n") + if do_eval: + res = squad_evaluate( + examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, is_whitespace_splited=False + ) + model.train() + return res["exact"], res["f1"] + + model.train() + + +class CrossEntropyLossForSQuAD(paddle.nn.Layer): + def __init__(self): + super(CrossEntropyLossForSQuAD, self).__init__() + + def forward(self, y, label): + start_logits, end_logits = y + start_position, end_position = label + start_position = paddle.unsqueeze(start_position, axis=-1) + end_position = paddle.unsqueeze(end_position, axis=-1) + start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position) + end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position) + loss = (start_loss + end_loss) / 2 + return loss + + +@contextlib.contextmanager +def main_process_first(desc="work"): + if paddle.distributed.get_world_size() > 1: + rank = paddle.distributed.get_rank() + is_main_process = rank == 0 + main_process_desc = "main local process" + + try: + if not is_main_process: + # tell all replicas to wait + logger.debug(f"{rank}: waiting for the {main_process_desc} to perform {desc}") + paddle.distributed.barrier() + yield + finally: + if is_main_process: + # the wait is over + logger.debug(f"{rank}: {main_process_desc} completed {desc}, releasing all replicas") + paddle.distributed.barrier() + else: + yield + + +def run(args): + if args.do_train: + assert ( + args.batch_size % args.gradient_accumulation_steps == 0 + ), "Please make sure argmument `batch_size` must be divisible by `gradient_accumulation_steps`." + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + set_seed(args) + + train_examples, dev_examples, test_examples = load_dataset( + "clue", "cmrc2018", split=["train", "validation", "test"] + ) + + column_names = train_examples.column_names + if rank == 0: + if os.path.exists(args.model_name_or_path): + logger.info("init checkpoint from %s" % args.model_name_or_path) + + model = AutoModelForQuestionAnswering.from_pretrained(args.model_name_or_path) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + def prepare_train_features(examples): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = examples["context"] + questions = examples["question"] + + tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + # The offset mappings will give us a map from token to character position in the original context. This will + # help us compute the start_positions and end_positions. + offset_mapping = tokenized_examples.pop("offset_mapping") + + # Let's label those examples! + tokenized_examples["start_positions"] = [] + tokenized_examples["end_positions"] = [] + + for i, offsets in enumerate(offset_mapping): + # We will label impossible answers with the index of the CLS token. + input_ids = tokenized_examples["input_ids"][i] + cls_index = input_ids.index(tokenizer.cls_token_id) + + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_examples["token_type_ids"][i] + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + answers = examples["answers"][sample_index] + # If no answers are given, set the cls_index as answer. + if len(answers["answer_start"]) == 0: + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Start/end character index of the answer in the text. + start_char = answers["answer_start"][0] + end_char = start_char + len(answers["text"][0]) + + # Start token index of the current span in the text. + token_start_index = 0 + while sequence_ids[token_start_index] != 1: + token_start_index += 1 + + # End token index of the current span in the text. + token_end_index = len(input_ids) - 1 + while sequence_ids[token_end_index] != 1: + token_end_index -= 1 + token_end_index -= 1 + + # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index). + if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char): + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Otherwise move the token_start_index and token_end_index to the two ends of the answer. + # Note: we could go after the last offset if the answer is the last word (edge case). + while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char: + token_start_index += 1 + tokenized_examples["start_positions"].append(token_start_index - 1) + while offsets[token_end_index][1] >= end_char: + token_end_index -= 1 + tokenized_examples["end_positions"].append(token_end_index + 1) + + return tokenized_examples + + def prepare_validation_features(examples): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HuggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = examples["context"] + questions = examples["question"] + + tokenized_examples = tokenizer( + questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length, return_attention_mask=True + ) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + + # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the + # corresponding example_id and we will store the offset mappings. + tokenized_examples["example_id"] = [] + + for i in range(len(tokenized_examples["input_ids"])): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_examples["token_type_ids"][i] + context_index = 1 + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + tokenized_examples["example_id"].append(examples["id"][sample_index]) + + # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token + # position is part of the context or not. + tokenized_examples["offset_mapping"][i] = [ + (o if sequence_ids[k] == context_index and k != len(sequence_ids) - 1 else None) + for k, o in enumerate(tokenized_examples["offset_mapping"][i]) + ] + + return tokenized_examples + + if args.do_train: + args.batch_size = int(args.batch_size / args.gradient_accumulation_steps) + + with main_process_first(desc="train dataset map pre-processing"): + train_ds = train_examples.map( + prepare_train_features, + batched=True, + remove_columns=column_names, + load_from_cache_file=not args.overwrite_cache, + num_proc=args.num_proc, + desc="Running tokenizer on train dataset", + ) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + + batchify_fn = DataCollatorWithPadding(tokenizer) + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + with main_process_first(desc="evaluate dataset map pre-processing"): + dev_ds = dev_examples.map( + prepare_validation_features, + batched=True, + remove_columns=column_names, + num_proc=args.num_proc, + load_from_cache_file=args.overwrite_cache, + desc="Running tokenizer on validation dataset", + ) + dev_ds_for_model = dev_ds.remove_columns(["example_id", "offset_mapping", "attention_mask"]) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.eval_batch_size, shuffle=False) + + dev_data_loader = DataLoader( + dataset=dev_ds_for_model, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + num_training_steps = ( + int(args.max_steps / args.gradient_accumulation_steps) + if args.max_steps >= 0 + else int(len(train_data_loader) * args.num_train_epochs / args.gradient_accumulation_steps) + ) + + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + criterion = CrossEntropyLossForSQuAD() + best_res = (0.0, 0.0) + global_step = 0 + tic_train = time.time() + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + start_positions = batch.pop("start_positions") + end_positions = batch.pop("end_positions") + logits = model(**batch) + loss = criterion(logits, (start_positions, end_positions)) + if args.gradient_accumulation_steps > 1: + loss = loss / args.gradient_accumulation_steps + loss.backward() + if (step + 1) % args.gradient_accumulation_steps == 0: + global_step += 1 + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.logging_steps == 0: + logger.info( + "global step %d/%d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % ( + global_step, + num_training_steps, + epoch, + step + 1, + loss, + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step >= num_training_steps: + break + if global_step > num_training_steps: + break + em, f1 = evaluate(model, dev_examples, dev_ds, dev_data_loader, args) + if paddle.distributed.get_rank() == 0 and em > best_res[0]: + best_res = (em, f1) + if args.save_best_model: + output_dir = args.output_dir + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + break + logger.info("best_result: %.2f/%.2f" % (best_res[0], best_res[1])) + + if args.do_predict and rank == 0: + test_ds = test_examples.map( + prepare_validation_features, batched=True, remove_columns=column_names, num_proc=args.num_proc + ) + test_ds_for_model = test_ds.remove_columns(["example_id", "offset_mapping", "attention_mask"]) + test_batch_sampler = paddle.io.BatchSampler(test_ds_for_model, batch_size=args.eval_batch_size, shuffle=False) + + batchify_fn = DataCollatorWithPadding(tokenizer) + test_data_loader = DataLoader( + dataset=test_ds_for_model, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + evaluate(model, test_examples, test_ds, test_data_loader, args, do_eval=False) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + run(args) diff --git a/examples/benchmark/glue/README.md b/examples/benchmark/glue/README.md new file mode 100644 index 0000000000000000000000000000000000000000..14f9abc434606081297b69087372e7c1987de8b0 --- /dev/null +++ b/examples/benchmark/glue/README.md @@ -0,0 +1,126 @@ +# GLUE Benchmark + +[GLUE](https://gluebenchmark.com/)是当今使用最为普遍的自然语言理解评测基准数据集,评测数据涵盖新闻、电影、百科等许多领域,其中有简单的句子,也有困难的句子。其目的是通过公开的得分榜,促进自然语言理解系统的发展。详细可参考 [GLUE论文](https://openreview.net/pdf?id=rJ4km2R5t7) + +本项目是 GLUE评测任务 在 Paddle 2.0上的开源实现。 + +本项目支持BERT, ELECTRA,ERNIE,ALBERT,RoBERTa模型,可在model_type中进行指定。 + +## 快速开始 + +### 启动GLUE任务 +以 GLUE/SST-2 任务为例,启动GLUE任务进行Fine-tuning的方式如下: + +#### 单卡训练 +```shell +export CUDA_VISIBLE_DEVICES=0 +export TASK_NAME=SST-2 + +python -u ./run_glue.py \ + --model_name_or_path bert-base-uncased \ + --tokenizer_name_or_path bert-base-uncased \ + --task_name $TASK_NAME \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 3e-5 \ + --num_train_epochs 3 \ + --logging_steps 1 \ + --save_steps 100 \ + --output_dir ./tmp/$TASK_NAME/ \ + --device gpu + +``` + +#### 多卡训练 +```shell +unset CUDA_VISIBLE_DEVICES +export TASK_NAME=SST-2 + +python -m paddle.distributed.launch --gpus "0,1" run_glue.py \ + --model_name_or_path bert-base-uncased \ + --tokenizer_name_or_path bert-base-uncased \ + --task_name $TASK_NAME \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 3e-5 \ + --num_train_epochs 3 \ + --logging_steps 1 \ + --save_steps 100 \ + --output_dir ./tmp/$TASK_NAME/ \ + --device gpu + +``` +其中参数释义如下: +- `model_name_or_path` 指示了Fine-tuning使用的具体预训练模型,可以是PaddleNLP提供的预训练模型 或者 本地的预训练模型。如果使用本地的预训练模型,可以配置本地模型的目录地址,例如: /home/xx_model/,目录中需包含paddle预训练模型model_state.pdparams。 +如果使用PaddleNLP提供的预训练模型,可以选择`model_type`在[Transformer预训练模型汇总](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 中相对应的英文预训练权重。注意这里选择的模型权重要和上面配置的模型类型匹配,例如model_type 配置的是bert,则model_name_or_path只能选择bert相关的模型。另,glue任务应选择英文预训练权重。 +- `tokenizer_name_or_path` 指示了Fine-tuning使用的具体tokenizer,一般保持和model_name_or_path一致,也可以单独指定 +- `task_name` 表示 Fine-tuning 的任务,当前支持CoLA、SST-2、MRPC、STS-B、QQP、MNLI、QNLI、RTE。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `output_dir` 表示模型保存路径。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 + +Fine-tuning过程将按照 `logging_steps` 和 `save_steps` 的设置打印如下日志: + +``` +global step 6310/6315, epoch: 2, batch: 2099, rank_id: 0, loss: 0.035772, lr: 0.0000000880, speed: 3.1527 step/s +global step 6311/6315, epoch: 2, batch: 2100, rank_id: 0, loss: 0.056789, lr: 0.0000000704, speed: 3.4201 step/s +global step 6312/6315, epoch: 2, batch: 2101, rank_id: 0, loss: 0.096717, lr: 0.0000000528, speed: 3.4694 step/s +global step 6313/6315, epoch: 2, batch: 2102, rank_id: 0, loss: 0.044982, lr: 0.0000000352, speed: 3.4513 step/s +global step 6314/6315, epoch: 2, batch: 2103, rank_id: 0, loss: 0.139579, lr: 0.0000000176, speed: 3.4566 step/s +global step 6315/6315, epoch: 2, batch: 2104, rank_id: 0, loss: 0.046043, lr: 0.0000000000, speed: 3.4590 step/s +eval loss: 0.549763, acc: 0.9151376146788991, eval done total : 1.8206987380981445 s +``` + +使用各种预训练模型进行 Fine-tuning ,在GLUE验证集上有如下结果: + +| Model GLUE Score | CoLA | SST-2 | MRPC | STS-B | QQP | MNLI | QNLI | RTE | +|--------------------|-------|--------|--------|--------|--------|--------|--------|--------| +| electra-small | 58.22 | 91.85 | 88.24 | 87.24 | 88.83 | 82.45 | 88.61 | 66.78 | +| ernie-2.0-large-en | 65.4 | 96.0 | 88.7 | 92.3 | 92.5 | 89.1 | 94.3 | 85.2 | + +关于GLUE Score的说明: +1. 因Fine-tuning过程中有dropout等随机因素影响,同样预训练模型每次运行的GLUE Score会有较小差异,上表中的GLUE Score是运行多次取eval最好值的得分。 +2. 不同GLUE任务判定得分所使用的评价指标有些差异,简单如下表,详细说明可参考[GLUE论文](https://openreview.net/pdf?id=rJ4km2R5t7)。 + +| GLUE Task | Metric | +|------------|------------------------------| +| CoLA | Matthews corr | +| SST-2 | acc. | +| MRPC | acc./F1 | +| STS-B | Pearson/Spearman corr | +| QQP | acc./F1 | +| MNLI | matched acc./mismatched acc. | +| QNLI | acc. | +| RTE | acc. | + +#### trainer 版本 + +```shell +export task_name=mnli +export learning_rate=5e-5 + +python run_glue_trainer.py \ +--model_name_or_path roberta-large \ +--task_name $task_name \ +--do_train \ +--do_eval \ +--max_seq_length 512 \ +--per_device_train_batch_size 16 \ +--per_device_eval_batch_size 64 \ +--learning_rate $learning_rate \ +--num_train_epochs 10 \ +--output_dir ./checkpoints/$task_name/ft \ +--overwrite_output_dir \ +--logging_steps 10 \ +--evaluation_strategy epoch \ +--save_strategy epoch \ +--warmup_ratio 0.06 \ +--seed 0 \ +--weight_decay 0.1 \ +--disable_tqdm True +``` diff --git a/examples/benchmark/glue/run_glue.py b/examples/benchmark/glue/run_glue.py new file mode 100644 index 0000000000000000000000000000000000000000..29ad6c42236a7efe3eb594d9eefad27da08dc167 --- /dev/null +++ b/examples/benchmark/glue/run_glue.py @@ -0,0 +1,346 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import math +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from paddle.io import DataLoader +from paddle.metric import Accuracy + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman +from paddlenlp.transformers import ( + AutoModelForSequenceClassification, + AutoTokenizer, + LinearDecayWithWarmup, +) + +FORMAT = "%(asctime)s-%(levelname)s: %(message)s" +logging.basicConfig(level=logging.INFO, format=FORMAT) +logger = logging.getLogger(__name__) + +METRIC_CLASSES = { + "cola": Mcc, + "sst-2": Accuracy, + "mrpc": AccuracyAndF1, + "sts-b": PearsonAndSpearman, + "qqp": AccuracyAndF1, + "mnli": Accuracy, + "qnli": Accuracy, + "rte": Accuracy, +} + + +def parse_args(): + parser = argparse.ArgumentParser() + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument("--model_type", default=None, type=str, required=False, help="should be remove later") + parser.add_argument( + "--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model" + ) + parser.add_argument("--tokenizer_name_or_path", default=None, type=str, required=False, help="Path to tokenizer") + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--batch_size", + default=32, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument( + "--warmup_steps", + default=0, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument( + "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu." + ) + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, loss_fct, metric, data_loader): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + loss = loss_fct(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + if isinstance(metric, AccuracyAndF1): + print( + "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, " + % ( + loss.numpy(), + res[0], + res[1], + res[2], + res[3], + res[4], + ), + end="", + ) + elif isinstance(metric, Mcc): + print("eval loss: %f, mcc: %s, " % (loss.numpy(), res[0]), end="") + elif isinstance(metric, PearsonAndSpearman): + print( + "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, " + % (loss.numpy(), res[0], res[1], res[2]), + end="", + ) + else: + print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="") + model.train() + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if (int(is_test) + len(example)) == 2: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + else: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + + if not is_test: + return example["input_ids"], example["token_type_ids"], label + else: + return example["input_ids"], example["token_type_ids"] + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + + train_ds = load_dataset("glue", args.task_name, splits="train") + if args.tokenizer_name_or_path: + tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name_or_path) + else: + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length + ) + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ): fn(samples) + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + if args.task_name == "mnli": + dev_ds_matched, dev_ds_mismatched = load_dataset( + "glue", args.task_name, splits=["dev_matched", "dev_mismatched"] + ) + + dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True) + dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True) + dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False) + dev_data_loader_matched = DataLoader( + dataset=dev_ds_matched, + batch_sampler=dev_batch_sampler_matched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + dev_batch_sampler_mismatched = paddle.io.BatchSampler( + dev_ds_mismatched, batch_size=args.batch_size, shuffle=False + ) + dev_data_loader_mismatched = DataLoader( + dataset=dev_ds_mismatched, + batch_sampler=dev_batch_sampler_mismatched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + else: + dev_ds = load_dataset("glue", args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list) + model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + if args.max_steps > 0: + num_training_steps = args.max_steps + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + else: + num_training_steps = len(train_data_loader) * args.num_train_epochs + num_train_epochs = args.num_train_epochs + + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss() + + metric = metric_class() + + global_step = 0 + tic_train = time.time() + for epoch in range(num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + loss = loss_fct(logits, labels) + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + print( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % ( + global_step, + num_training_steps, + epoch, + step, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + if args.task_name == "mnli": + evaluate(model, loss_fct, metric, dev_data_loader_matched) + evaluate(model, loss_fct, metric, dev_data_loader_mismatched) + print("eval done total : %s s" % (time.time() - tic_eval)) + else: + evaluate(model, loss_fct, metric, dev_data_loader) + print("eval done total : %s s" % (time.time() - tic_eval)) + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join( + args.output_dir, "%s_ft_model_%d.pdparams" % (args.task_name, global_step) + ) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + return + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/benchmark/glue/run_glue_trainer.py b/examples/benchmark/glue/run_glue_trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..03c4e2f347bf911bb64df16ed381181231f1eba4 --- /dev/null +++ b/examples/benchmark/glue/run_glue_trainer.py @@ -0,0 +1,333 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from dataclasses import dataclass, field +from functools import partial +from typing import Optional + +import numpy as np +import paddle +from paddle.metric import Accuracy + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman +from paddlenlp.peft import LoRAConfig, LoRAModel +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, +) +from paddlenlp.transformers import ( + AutoModelForSequenceClassification, + AutoTokenizer, + export_model, +) +from paddlenlp.utils.log import logger + +METRIC_CLASSES = { + "cola": Mcc, + "sst-2": Accuracy, + "mrpc": AccuracyAndF1, + "sts-b": PearsonAndSpearman, + "qqp": AccuracyAndF1, + "mnli": Accuracy, + "qnli": Accuracy, + "rte": Accuracy, +} + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + Using `PdArgumentParser` we can turn this class into argparse arguments to be able to + specify them on the command line. + """ + + task_name: str = field( + default=None, + metadata={"help": "The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys())}, + ) + + max_seq_length: int = field( + default=128, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + + model_name_or_path: str = field( + metadata={ + "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html" + } + ) + tokenizer_name_or_path: Optional[str] = field( + default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} + ) + export_model_dir: Optional[str] = field( + default=None, + metadata={"help": "Path to directory to store the exported inference model."}, + ) + lora: bool = field(default=False, metadata={"help": "Whether to use LoRA technique"}) + lora_rank: int = field(default=8, metadata={"help": "Lora rank"}) + lora_alpha: int = field(default=16, metadata={"help": "Lora alpha"}) + qat: bool = field(default=False, metadata={"help": "Whether to use QAT technique"}) + qat_type: str = field(default="A8W8", metadata={"help": "Quantization type. Supported values: A8W8, W4,A8W4"}) + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if (int(is_test) + len(example)) == 2: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + else: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + + if not is_test: + return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"], "labels": label} + else: + return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"]} + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + data_args.task_name = data_args.task_name.strip().lower() + metric = METRIC_CLASSES[data_args.task_name]() + + train_ds = load_dataset("glue", data_args.task_name, splits="train") + if model_args.tokenizer_name_or_path: + tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path) + else: + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + + trans_func = partial( + convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=data_args.max_seq_length + ) + train_ds = train_ds.map(trans_func, lazy=True) + + if data_args.task_name == "mnli": + dev_ds, dev_ds_mismatched = load_dataset("glue", data_args.task_name, splits=["dev_matched", "dev_mismatched"]) + + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True) + + test_ds, test_ds_mismatched = load_dataset( + "glue", data_args.task_name, splits=["test_matched", "test_mismatched"] + ) + + test_ds = test_ds.map(trans_func, lazy=True) + test_ds_mismatched = test_ds_mismatched.map(trans_func, lazy=True) + + else: + dev_ds = load_dataset("glue", data_args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + + test_ds = load_dataset("glue", data_args.task_name, splits="test") + test_ds = test_ds.map(trans_func, lazy=True) + + # Define data collector + data_collator = DataCollatorWithPadding(tokenizer) + num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list) + model = AutoModelForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes) + dtype = "float32" + if training_args.fp16_opt_level == "O2": + if training_args.fp16: + dtype = "float16" + if training_args.bf16: + dtype = "bfloat16" + if model_args.lora: + # TODO: hardcode parameters for now. Change after MergedLoRA is introduced + lora_config = LoRAConfig( + target_modules=[ + ".*self_attn.q_proj.*", + ".*self_attn.k_proj.*", + ".*self_attn.v_proj.*", + ".*self_attn.out_proj.*", + ".*linear1.*", + ".*linear2.*", + ], + trainable_modules=[".*classifier.*"], + r=model_args.lora_rank, + lora_alpha=model_args.lora_alpha, + merge_weights=False, + dtype=dtype, + ) + model = LoRAModel(model, lora_config) + model.mark_only_lora_as_trainable() + model.print_trainable_parameters() + + if model_args.qat: + from paddle import nn + from paddle.quantization import QAT, QuantConfig + from paddle.quantization.quanters import ( + FakeQuanterChannelWiseAbsMaxObserver, + FakeQuanterWithAbsMaxObserver, + ) + + from paddlenlp.peft.lora import LoRALinear + from paddlenlp.peft.lora.lora_quant_layers import QuantedLoRALinear + + q_config = QuantConfig(activation=None, weight=None) + q_config.add_qat_layer_mapping(LoRALinear, QuantedLoRALinear) + + if model_args.qat_type == "A8W8": + activation = FakeQuanterWithAbsMaxObserver(moving_rate=0.9, bit_length=8, dtype=dtype) + weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=8, dtype=dtype) + elif model_args.qat_type == "W4": + activation = None + weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=4, dtype=dtype) + elif model_args.qat_type == "A8W4": + activation = FakeQuanterWithAbsMaxObserver(moving_rate=0.9, bit_length=8, dtype=dtype) + weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=4, dtype=dtype) + else: + raise ValueError("qat_type should be one of ['A8W8', 'W4', 'A8W4']") + + q_config.add_type_config(LoRALinear, weight=weight, activation=activation) + q_config.add_type_config(nn.Linear, weight=weight, activation=activation) + + qat = QAT(q_config) + model = qat.quantize(model, inplace=True) + + # Define the metrics of tasks. + def compute_metrics(p): + preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions + + preds = paddle.to_tensor(preds) + label = paddle.to_tensor(p.label_ids) + + metric.reset() + result = metric.compute(preds, label) + metric.update(result) + res = metric.accumulate() + metric.reset() + if isinstance(metric, AccuracyAndF1): + return { + "accuracy": res[0], + "precision": res[1], + "recall": res[2], + "f1 score": res[3], + "accuracy and f1": res[4], + } + elif isinstance(metric, Mcc): + return {"mcc": res[0]} + elif isinstance(metric, PearsonAndSpearman): + return { + "pearson": res[0], + "spearman": res[1], + "pearson and spearman": res[2], + } + else: + return {"accuracy": res} + + trainer = Trainer( + model=model, + args=training_args, + data_collator=data_collator, + train_dataset=train_ds if training_args.do_train else None, + eval_dataset=dev_ds if training_args.do_eval else None, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + logger.info("*** Evaluate ***") + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + if data_args.task_name == "mnli": + eval_metrics = trainer.evaluate(dev_ds_mismatched) + trainer.log_metrics("eval", eval_metrics) + + if training_args.do_predict: + logger.info("*** Predict ***") + test_ret = trainer.predict(test_ds) + trainer.log_metrics("test", test_ret.metrics) + if data_args.task_name == "mnli": + test_ret = trainer.predict(test_ds_mismatched) + trainer.log_metrics("test", test_ret.metrics) + + # export inference model + if training_args.do_export: + # You can also load from certain checkpoint + # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/") + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ] + if model_args.export_model_dir is None: + model_args.export_model_dir = os.path.join(training_args.output_dir, "export") + export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir) + + +if __name__ == "__main__": + main() diff --git a/examples/benchmark/lambada/eval.py b/examples/benchmark/lambada/eval.py new file mode 100644 index 0000000000000000000000000000000000000000..8a9aa31043fb0850a388c36a7579f97009af59f9 --- /dev/null +++ b/examples/benchmark/lambada/eval.py @@ -0,0 +1,384 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from __future__ import annotations + +import argparse +import json +import math +import re +import time +from pprint import pprint as print + +# from paddle.distributed.apis import env +import numpy as np +import paddle +from paddle.distributed import fleet +from paddle.io import DataLoader + +from paddlenlp.data import Stack, Tuple +from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer +from paddlenlp.utils.log import logger + + +def get_parser(): + parser = argparse.ArgumentParser() + parser.add_argument("--model_type", default=None, type=str, required=False, help="Model type selected in the list") + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: ", + ) + + # only support tensor_parallel_degree + parser.add_argument( + "--tensor_parallel_degree", + type=int, + default=1, + help="Model Parallelism degree. Spliting the linear layers to many cards.", + ) + + # Other config + parser.add_argument("--seed", type=int, default=1024, help="Random seed for initialization") + parser.add_argument("--sample_nums", type=int, default=16, help="Random seed for initialization") + parser.add_argument( + "--device", + type=str, + default="gpu", + choices=["cpu", "eval_pathgpu", "xpu", "npu"], + help="select cpu, gpu, xpu devices.", + ) + parser.add_argument( + "--dtype", + type=str, + default="float16", + choices=["bfloat16", "float16", "float32"], + help="set the dtype of model", + ) + + # load autodist name files, eg: bloom-176b + parser.add_argument("--load_autodist", action="store_true", help="whether load auto-dist wieght file") + + return parser + + +def get_eval_parser(): + parser = get_parser() + parser.add_argument( + "--eval_path", + default=None, + type=str, + required=True, + help="The eval file path.", + ) + parser.add_argument( + "--cloze_eval", action="store_true", help="Evaluation dataset from `--eval_path` is a cloze task." + ) + parser.add_argument("--overlapping_eval", type=int, default=32, help="Sliding window for overlapping eval.") + parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument( + "--seq_length", type=int, default=512, help="Maximum sequence length to process for evaluation." + ) + parser.add_argument("--logging_steps", type=int, default=10, help="logging step for eval") + return parser + + +class LM_Eval_Dataset(paddle.io.Dataset): + def __init__(self, tokens, seq_len, pad_idx, overlapping_eval=None): + self.tokens = tokens + self.seq_len = seq_len + self.pad_idx = pad_idx + self.overlapping_eval = overlapping_eval + if self.overlapping_eval is None: + self.overlapping_eval = self.seq_len + self.overlapping_eval = max(1, self.overlapping_eval) + + self.total_targets = len(self.tokens) - 1 + # remove first sequence tokens + targets = max(self.total_targets - self.overlapping_eval, 0) + self.total_sequences = max(math.ceil(targets / self.overlapping_eval) + 1, 1) + + def __len__(self): + return self.total_sequences + + def _construct_sample(self, tokens): + tokens = np.array(tokens).astype("int64").tolist() + labels = tokens[1:] + tokens = tokens[:-1] + seq_length = len(tokens) + # attention mask for the attention calulate + attention_mask = np.tri(seq_length, seq_length).reshape((1, seq_length, seq_length)) + + # the pad and eos tokens do not contribute the loss + loss_mask = np.ones(seq_length, dtype="float32") + loss_mask[np.where(np.array(tokens) == self.pad_idx)] = 0.0 + position_ids = np.arange(0, seq_length, dtype="int64") + + # -INF mask value as default + # attention_mask = (attention_mask - 1.0) * 1e9 + # Bool mask of attention + attention_mask = attention_mask.astype("float32") + return [tokens, loss_mask, attention_mask, position_ids, labels] + + def __getitem__(self, idx): + start_idx = idx * self.overlapping_eval + end_idx = start_idx + self.seq_len + tokens = self.tokens[start_idx : end_idx + 1] + num_tokens = len(tokens) + if num_tokens < self.seq_len + 1: + num_pad = self.seq_len + 1 - num_tokens + tokens += [self.pad_idx] * num_pad + [tokens, loss_mask, attention_mask, position_ids, labels] = self._construct_sample(tokens) + if self.overlapping_eval != self.seq_len and idx != 0: + loss_mask[: -self.overlapping_eval] *= 0 + + return [tokens, loss_mask, attention_mask, position_ids, labels] + + +class Lambada_Eval_Dataset(paddle.io.Dataset): + def __init__(self, tokens, labels, seq_len, pad_idx): + self.seq_len = seq_len + self.pad_idx = pad_idx + self.tokens = tokens + self.labels = labels + + def __len__(self): + return len(self.tokens) + + def _construct_sample(self, tokens): + tokens = np.array(tokens).astype("int64").tolist() + labels = tokens[1:] + tokens = tokens[:-1] + + seq_length = len(tokens) + # attention mask for the attention calulate + attention_mask = np.tri(seq_length, seq_length).reshape((1, seq_length, seq_length)) + + # the pad and eos tokens do not contribute the loss + position_ids = np.arange(0, seq_length, dtype="int64") + + # -INF mask value as default + # attention_mask = (attention_mask - 1.0) * 1e9 + # Bool mask of attention + attention_mask = attention_mask.astype("float32") + return [tokens, attention_mask, position_ids, labels] + + def __getitem__left_padding(self, idx): + tokens = self.tokens[idx][: self.seq_len] + labels = self.labels[idx] + tokens = tokens + labels + num_tokens = len(tokens) + if num_tokens < self.seq_len + 1: + num_pad = self.seq_len + 1 - num_tokens + # tokens += [self.pad_idx] * num_pad + tokens + tokens = [self.pad_idx] * num_pad + tokens + loss_mask = np.zeros(self.seq_len, dtype="float32") + loss_mask[-len(labels) :] = 1.0 + [tokens, attention_mask, position_ids, labels] = self._construct_sample(tokens) + return [tokens, loss_mask, attention_mask, position_ids, labels] + + def __getitem__(self, idx): + tokens = self.tokens[idx][: self.seq_len] + labels = self.labels[idx] + tokens = tokens + labels + + num_tokens = len(tokens) + if num_tokens < self.seq_len + 1: + num_pad = self.seq_len + 1 - num_tokens + tokens += [self.pad_idx] * num_pad + loss_mask = np.zeros(self.seq_len, dtype="float32") + loss_mask[num_tokens - len(labels) - 1 : num_tokens - 1] = 1.0 + [tokens, attention_mask, position_ids, labels] = self._construct_sample(tokens) + return [tokens, loss_mask, attention_mask, position_ids, labels] + + +def wikitext_detokenizer(string): + # contractions + string = string.replace("s '", "s'") + string = re.sub(r"/' [0-9]/", r"/'[0-9]/", string) + # number separators + string = string.replace(" @-@ ", "-") + string = string.replace(" @,@ ", ",") + string = string.replace(" @.@ ", ".") + # punctuation + string = string.replace(" : ", ": ") + string = string.replace(" ; ", "; ") + string = string.replace(" . ", ". ") + string = string.replace(" ! ", "! ") + string = string.replace(" ? ", "? ") + string = string.replace(" , ", ", ") + # double brackets + string = re.sub(r"\(\s*([^\)]*?)\s*\)", r"(\1)", string) + string = re.sub(r"\[\s*([^\]]*?)\s*\]", r"[\1]", string) + string = re.sub(r"{\s*([^}]*?)\s*}", r"{\1}", string) + string = re.sub(r"\"\s*([^\"]*?)\s*\"", r'"\1"', string) + string = re.sub(r"'\s*([^']*?)\s*'", r"'\1'", string) + # miscellaneous + string = string.replace("= = = =", "====") + string = string.replace("= = =", "===") + string = string.replace("= =", "==") + string = string.replace(" " + chr(176) + " ", chr(176)) + string = string.replace(" \n", "\n") + string = string.replace("\n ", "\n") + string = string.replace(" N ", " 1 ") + string = string.replace(" 's", "'s") + return string + + +def get_tokens(tokenizer, text, strict=True): + if not strict: + tokens = tokenizer(text)["input_ids"] + return tokens[:-1], [tokens[-1]] + last_token = text.split()[-1] + start_idx = text.rfind(last_token) + beginning_tokens = tokenizer(text[:start_idx].strip())["input_ids"] + last_token = tokenizer(" " + last_token)["input_ids"] + return beginning_tokens, last_token + + +def create_eval_dataset(args): + val_dataloader = None + eval_batch_size = args.batch_size + seq_len = args.seq_length + + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + tokenizer.pad_token = tokenizer.eos_token if tokenizer.eos_token else "" + if not args.cloze_eval: + with open(args.eval_path, "rb") as reader: + entire_data = reader.read().decode("utf-8") + num_original_tokens = len(entire_data.strip().split(" ")) + entire_data = wikitext_detokenizer(entire_data) + tokenized_data = tokenizer(entire_data)["input_ids"] + num_tokenized_tokens = len(tokenized_data) + print("Original Tokens: %d, Detokenized tokens: %d" % (num_tokenized_tokens, num_original_tokens)) + val_dataset = LM_Eval_Dataset(tokenized_data, seq_len, tokenizer.pad_token_id, args.overlapping_eval) + else: + tokenized_data = [] + tokenized_label = [] + with open(args.eval_path, "r") as f: + for line in f.readlines(): + text = json.loads(line)["text"] + tokens, labels = get_tokens(tokenizer, text, strict=False) + tokenized_data.append(tokens) + tokenized_label.append(labels) + val_dataset = Lambada_Eval_Dataset(tokenized_data, tokenized_label, seq_len, tokenizer.pad_token_id) + num_tokenized_tokens = 0 + num_original_tokens = 0 + + args.num_examples = len(val_dataset) + args.num_original_tokens = num_original_tokens + args.num_tokenized_tokens = num_tokenized_tokens + val_dataloader = DataLoader( + val_dataset, + batch_size=eval_batch_size, + drop_last=False, + collate_fn=Tuple(Stack(), Stack(), Stack(), Stack(), Stack()), + ) + + return val_dataloader + + +def do_generation(): + + # env.set_seed(seed) + parser = get_eval_parser() + args = parser.parse_args() + paddle.set_default_dtype(args.dtype) + + if args.tensor_parallel_degree > 1: + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = { + "mp_degree": args.tensor_parallel_degree, + } + # Set control in tensor parallel + strategy.tensor_parallel_configs = {"tensor_init_seed": args.seed} + fleet.init(is_collective=True, strategy=strategy) + + eval_data_loader = create_eval_dataset(args) + + tic_eval = time.time() + + model = AutoModelForCausalLM.from_pretrained( + args.model_name_or_path, + tensor_parallel_output=False, + tensor_parallel_degree=args.tensor_parallel_degree, + tensor_parallel_rank=paddle.distributed.get_rank(), + use_flash_attention=False, + dtype=args.dtype, # todo enable set dtype to avoid additional mem usage + ) + + model.eval() + args.use_pure_fp16 = False + + total_score = 0 + score_name = "loss" if not args.cloze_eval else "number correct" + args.use_pure_fp16 = False + eval_data_loader = create_eval_dataset(args) + with paddle.no_grad(): + for step, batch in enumerate(eval_data_loader): + + tokens, loss_mask = batch[:2] + labels = batch[-1] + with paddle.amp.auto_cast(args.use_pure_fp16): + if args.model_type == "bloom": + preds = model(tokens).detach() + else: + preds = model(tokens)[0].detach() + # print(preds) + + # cast preds to float32 to keep high-precision + preds = preds.astype(paddle.float32) + + if not args.cloze_eval: + masked_lm_loss = paddle.nn.functional.cross_entropy(preds, labels, reduction="none") + loss = paddle.sum(masked_lm_loss * loss_mask) + total_score += float(loss) / (args.num_tokenized_tokens - 1) + else: + outputs = paddle.argmax(preds, -1) + acc = paddle.cast(outputs == labels, "float32") + acc = paddle.where(paddle.cast(loss_mask, "bool"), acc, paddle.ones_like(acc)) + acc = paddle.sum(paddle.prod(acc, -1)) + total_score += float(acc) + + if step % args.logging_steps == 0: + logger.info( + "step %d, batch: %d, %s: %f, speed: %.2f step/s" + % (step, step, score_name, total_score, args.logging_steps / (time.time() - tic_eval)) + ) + tic_eval = time.time() + + if not args.cloze_eval: + total_loss = float(total_score) + ppl = math.exp(min(20, total_loss)) + token_ratio = (args.num_tokenized_tokens - 1) / (args.num_original_tokens - 1) + adjusted_ppl = math.exp(min(20, total_loss * token_ratio)) + string = " validation results on {} | ".format(args.eval_path) + string += "avg loss: {:.4E} | ".format(total_loss) + string += "ppl: {:.4E} | ".format(ppl) + string += "adjusted ppl: {:.4E} | ".format(adjusted_ppl) + string += "token ratio: {} |".format(token_ratio) + else: + num_correct = float(total_score) + acc = float(num_correct / args.num_examples) + string = " validation results on {} | ".format(args.eval_path) + string += "number correct: {:.4E} | ".format(num_correct) + string += "total examples: {:.4E} | ".format(args.num_examples) + string += "avg accuracy: {:.4E}".format(acc) + logger.info(string) + + +if __name__ == "__main__": + do_generation() \ No newline at end of file diff --git a/examples/benchmark/peft/README.md b/examples/benchmark/peft/README.md new file mode 100644 index 0000000000000000000000000000000000000000..841e7f880af8d9b43c56cd062a5bf5bbaa139bcf --- /dev/null +++ b/examples/benchmark/peft/README.md @@ -0,0 +1,97 @@ +# Benchmark Results + +### 配置 + +- 硬件: A100-80G with NVLink, 具体卡数见表 +- Torch环境: 见torch/requirements.txt +- FP16配置: torch 使用 cuda amp fp16, paddle 使用 fp16 O2 opt level, intokens 设置为 1024, 并开启了flash attention + +### Bloom + +- 数据: 10k条[Chinese-Vicuna/guanaco_belle_merge_v1.0](https://huggingface.co/datasets/Chinese-Vicuna/guanaco_belle_merge_v1.0) + +| Model | Method | Num GPUs | Batch Size | Paddle Setup | Paddle Effective Tokens/s | Torch Setup | Torch Effective Tokens/s | **Speedup** | +|---------------|----------|----------|------------|--------------|---------------------------|-------------|--------------------------|---------| +| Bloomz-7b1-mt | LoRA | 1 | 4 | | 4097.03 | | 1980.32 | **107%** | +| Bloomz-7b1-mt | Finetune | 4 | 8 | MP 4 | 4136.69 | ZeRO 3 | 1702.00 | **143%** | +| Bloomz-7b1-mt | Finetune | 4 | 16 | MP 4 | 4359.72 | ZeRO 3 | 2849.90 | **53%** | + +###### 多卡分布式实验记录 + +- Finetuning with 4 GPUs + +| Model | Setup | Paddle Effective Tokens /s | Torch Effective Tokens /s | Speedup | +|----------------|-----------------|----------------------------|----------------------------|-----------| +| Bloomz-7b1-mt | bsz 8 MP4 | 7421.09 | N/A | N/A | +| Bloomz-7b1-mt | bsz 8 ZeRO 3 | 6063.23 | 1702.00 | 256% | +| Bloomz-7b1-mt | bsz 8 ZeRO 2 | 5191.47 | 1891.16 | 175% | +| Bloomz-7b1-mt | bsz 16 MP4 | 8214.55 | N/A | N/A | +| Bloomz-7b1-mt | bsz 16 ZeRO 3 | 5822.23 | 2849.90 | 104 | +| Bloomz-7b1-mt | bsz 16 ZeRO 2 | 5572.13 | 2719.92 | 105% | + + +### Llama + +- 数据: 使用10k条[tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) + +| Model | Method | Num GPUs | Batch Size | Paddle Setup | Paddle Effective Tokens/s | Torch Setup | Torch Effective Tokens/s | Speedup | +|-----------|----------|----------|-------------|--------------|---------------------------|-------------|--------------------------|---------| +| Llama-7b | LoRA | 1 | 4 | | 4406.23 | | 1895.90 | **132%** | +| Llama-13b | LoRA | 1 | 4 | | 1975.94 | | 1101.85 | **79%** | +| Llama-13b | LoRA | 1 | 8 | recompute | 1869.60 | gradient ckpt | 768.26 | **143%** | +| Llama-7b | Finetune | 4 | 8 | MP4 | 3275.90 | ZeRO 2 | 1621.52 | **102%** | +| Llama-7b | Finetune | 4 | 16 | sharding 2 | 6798.72 | ZeRO 2 | 2465.55 | **176%** | +| Llama-13b | Finetune | 4 | 8 | MP4 recompute| 1938.19 | ZeRO 3 | 736.19 | **127%** | +| Llama-65b | LoRA | 4 | 8 | MP4 recompute| 840.57 | gradient ckpt, bits 4, max_memory_MB 50000, qlora | 327.75 | **156%** | +| Llama-65b | LoRA | 4 | 16 | MP4 recompute| 993.38 | gradient ckpt, bits 4, max_memory_MB 50000, qlora | 405.90 | **122%** | + + +###### 多卡分布式实验记录 + +- Finetuning with 4 GPUs + +| Model | Setup | Paddle Effective Tokens /s | Torch Effective Tokens /s | Speedup | +|-----------|---------------|----------------------------|----------------------------|-----------| +| LLaMA-7b | bsz 8 MP4 | **3841.61** | N/A | N/A | +| LLaMA-7b | bsz 8 ZeRO 3 | 4189.43 | 1177.93 | 256% | +| LLaMA-7b | bsz 8 ZeRO 2 | 4611.10 | 1621.52 | 184% | +| LLaMA-7b | bsz 16 (4*4) MP4 | 4829.47 | N/A | N/A | +| LLaMA-7b | bsz 16 ZeRO 3 | 4048.61 | 2268.16 | 78% | +| LLaMA-7b | bsz 16 ZeRO 2 | **3463.45** | 2465.55 | 40% | +| LLaMA-13b | bsz 8 MP4 recompute | **2509.50** | N/A | N/A | +| LLaMA-13b | bsz 8 ZeRO 3 | 1867.99 | 736.19 | 154% | +| LLaMA-13b | bsz 8 ZeRO 2 | 1201.75 | OOM | N/A | + + +### ChatGLM + +| Model | Method | Num GPUs | Batch Size | Paddle Setup | Paddle Effective Tokens/s | Torch Setup | Torch Effective Tokens/s | Speedup | +|---------------|----------|----------|------------|--------------|---------------------------|-------------|--------------------------|---------| +| chatglm-6b | LoRA | 1 | 4 | | 4216.76 | | 1866.48 | **126%** | +| chatglm-6b | Finetune | 4 | 8 | MP 4 | 3799.78 | ZeRO 2 | 2124.17 | **79%** | +| chatglm-6b | Finetune | 4 | 16 | MP 4 | 5720.21 | ZeRO 3 | 3191.35 | **79%** | + + +###### 多卡分布式实验记录 + +- Finetuning with 4 GPUs + +| Model | Setup | Paddle Effective Tokens /s | Torch Effective Tokens /s | Speedup | +|-----------|-----------------|----------------------------|----------------------------|-----------| +| chatglm-6b | bsz 8 MP4 | 4564.94 | N/A | N/A | +| chatglm-6b | bsz 8 ZeRO 3 | 6480.36 | 1840.99 | 252% | +| chatglm-6b | bsz 8 ZeRO 2 | 4707.74 | 2124.17 | 122% | +| chatglm-6b | bsz 16 MP4 | 4972.21 | N/A | N/A | +| chatglm-6b | bsz 16 ZeRO 3 | 5282.28 | 3184.26 | 66% | +| chatglm-6b | bsz 16 ZeRO 2 | 5751.00 | 3151.07 | 83% | + + +### GPT 3 + +| Model | Method | Num GPUs | Batch Size | Paddle Setup | Paddle Effective Tokens/s | Torch Setup | Torch Effective Tokens/s | Speedup | +|---------------|----------|----------|------------|--------------|---------------------------|-------------|--------------------------|---------| +| gpt3-6.7b | LoRA | 1 | 4 | | 3450.06 | | 1186.74 | **191%**| +| gpt3-13b | LoRA | 1 | 4 | | 2008.40 | | 969.60 | **107%**| +| gpt3-6.7b | Finetune | 4 | 8 | MP 4 | 3301.49 | ZeRO 2 | 1441.65 | **129%**| +| gpt3-13b | Finetune | 4 | 8 | MP 4 | 1890.38 | ZeRO 2 | 783.26 | **141%**| +| gpt3-6.7b | Finetune | 4 | 16 | MP 4 | 4666.19 | ZeRO 3 | 1756.42 | **166%**| diff --git a/examples/benchmark/peft/paddle/benchmark.py b/examples/benchmark/peft/paddle/benchmark.py new file mode 100644 index 0000000000000000000000000000000000000000..1849dd9d73127e8bb35f5d53458dda55864dabf3 --- /dev/null +++ b/examples/benchmark/peft/paddle/benchmark.py @@ -0,0 +1,328 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from dataclasses import dataclass, field +from typing import Optional + +import numpy as np +import paddle.profiler as profiler +from datasets import load_dataset +from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker +from utils import CustomTrainer, ProfilerCallback + +from paddlenlp.data import DataCollatorForSeq2Seq +from paddlenlp.datasets import InTokensMapDataset +from paddlenlp.peft import LoRAConfig, LoRAModel +from paddlenlp.trainer import PdArgumentParser, TrainingArguments +from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer, GPTForCausalLM + +""" +单卡 +python benchmark.py --model_name_or_path bigscience/bloomz-7b1-mt \ + --num_train_epochs 1 --per_device_train_batch_size 4 \ + --evaluation_strategy no --save_strategy no \ + --fp16 --fp16_opt_level O2 --lora \ + --logging_steps 50 --output_dir outputs + +多卡mp +python -m paddle.distributed.launch --gpus "0,1,2,3" benchmark.py --model_name_or_path bigscience/bloomz-7b1-mt \ + --num_train_epochs 1 --per_device_train_batch_size 8 \ + --evaluation_strategy no --save_strategy no \ + --fp16 --fp16_opt_level O2 --tensor_parallel_degree 4 \ + --logging_steps 50 --output_dir outputs + +多卡sharding 3 +python -m paddle.distributed.launch --gpus "0,1,2,3" benchmark.py --model_name_or_path bigscience/bloomz-7b1-mt \ + --num_train_epochs 1 --per_device_train_batch_size 4 \ + --evaluation_strategy no --save_strategy no \ + --fp16 --fp16_opt_level O2 \ + --sharding "stage3" --sharding_parallel_degree 4 \ + --logging_steps 50 --output_dir outputs +""" + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch. + """ + + model_name_or_path: str = field(default=None, metadata={"help": "model name or local path"}) + lora: Optional[bool] = field(default=False, metadata={"help": "whether to use LoRA"}) + english: Optional[bool] = field(default=False, metadata={"help": "whether to english benchmark dataset"}) + profiler: Optional[bool] = field(default=False, metadata={"help": "whether to use profiler"}) + train_data_size: int = field(default=1000, metadata={"help": "Number of dataset for training"}) + intokens_length: int = field(default=2048, metadata={"help": "Intokens length"}) + intokens: Optional[bool] = field(default=False, metadata={"help": "whether to use intokens"}) + use_flash_attention: bool = field(default=False, metadata={"help": "Whether to use flash attention"}) + + +def main(): + parser = PdArgumentParser((ModelArguments, TrainingArguments)) + model_args, training_args = parser.parse_args_into_dataclasses() + + # Set the dtype for loading model + dtype = None + if training_args.fp16_opt_level == "O2": + if training_args.fp16: + dtype = "float16" + if training_args.bf16: + dtype = "bfloat16" + else: + dtype = "float32" + if model_args.model_name_or_path in ["gpt3-6.7B-en", "gpt3-13B-en"]: + tokenizer = AutoTokenizer.from_pretrained("gpt3-13B-en") + else: + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + + if "llama" in model_args.model_name_or_path or "Baichuan" in model_args.model_name_or_path: + tokenizer.pad_token = tokenizer.unk_token + + if model_args.model_name_or_path in ["gpt3-6.7B-en", "gpt3-13B-en"]: + model = GPTForCausalLM.from_pretrained( + model_args.model_name_or_path, + low_cpu_mem_usage=True, + use_flash_attention=model_args.use_flash_attention, + dtype=dtype, + tensor_parallel_degree=training_args.tensor_parallel_degree, + tensor_parallel_rank=training_args.tensor_parallel_rank, + ) + tracker = get_rng_state_tracker() + tracker.add("global_seed", 111) + tracker.add("local_seed", 222) + else: + model = AutoModelForCausalLM.from_pretrained( + model_args.model_name_or_path, + low_cpu_mem_usage=True, + use_flash_attention=model_args.use_flash_attention, + dtype=dtype, + tensor_parallel_degree=training_args.tensor_parallel_degree, + tensor_parallel_rank=training_args.tensor_parallel_rank, + ) + + if model_args.lora: + if "llama" in model_args.model_name_or_path or "Baichuan" in model_args.model_name_or_path: + target_modules = [".*q_proj.*", ".*k_proj.*", ".*v_proj.*"] + elif model_args.model_name_or_path in ["gpt3-6.7B-en", "gpt3-13B-en"]: + target_modules = [ + ".*qkv_proj.*", + ".*q_proj.*", + ".*k_proj.*", + ".*v_proj.*", + ".*linear1.*", + ".*linear2.*", + ".*out_proj.*", + ] + elif "chatglm2" in model_args.model_name_or_path: + target_modules = [ + ".*query.*", + ".*key.*", + ".*value.*", + ".*dense.*", + ".*dense_h_to_4h.*", + ".*dense_4h_to_h.*", + ] + else: + target_modules = [".*query_key_value.*"] + + lora_config = LoRAConfig( + target_modules=target_modules, + r=8, + lora_alpha=32, + dtype=dtype, + ) + model = LoRAModel(model, lora_config) + model.mark_only_lora_as_trainable() + model.print_trainable_parameters() + + def preprocess_function(example, max_src_length=256, max_tgt_length=384, intokens=False): + inputs = example["instruction"] + targets = example["output"] + if "input" in example: + inputs += example["input"] + model_inputs = tokenizer(inputs, max_length=max_src_length, truncation=True, return_attention_mask=False) + labels = tokenizer(targets, max_length=max_tgt_length, truncation=True, return_attention_mask=False) + labels_input_ids = labels["input_ids"] + [tokenizer.eos_token_id] + model_inputs["labels"] = [-100] * len(model_inputs["input_ids"]) + labels_input_ids + model_inputs["input_ids"] = model_inputs["input_ids"] + labels_input_ids + # shift input and labels + model_inputs["input_ids"] = model_inputs["input_ids"][:-1] + model_inputs["labels"] = model_inputs["labels"][1:] + seq_length = len(model_inputs["input_ids"]) + model_inputs["position_ids"] = list(range(seq_length)) + if intokens: + model_inputs["attention_mask"] = np.tril(np.ones([seq_length, seq_length], dtype=bool)) + return model_inputs + + def preprocess_function_chatglm(example, max_src_length=256, max_tgt_length=384, intokens=False): + inputs = example["instruction"] + targets = example["output"] + if "input" in example: + inputs += example["input"] + model_inputs = tokenizer(inputs, max_length=max_src_length, truncation=True, return_attention_mask=False) + labels = tokenizer(targets, max_length=max_tgt_length, truncation=True, return_attention_mask=False) + labels_input_ids = labels["input_ids"] + [tokenizer.eos_token_id] + model_inputs["labels"] = [-100] * len(model_inputs["input_ids"]) + labels_input_ids + model_inputs["input_ids"] = model_inputs["input_ids"] + labels_input_ids + # shift input and labels + model_inputs["input_ids"] = model_inputs["input_ids"][:-1] + model_inputs["labels"] = model_inputs["labels"][1:] + + if intokens: + context_length = model_inputs["input_ids"].index(tokenizer.bos_token_id) + seq_length = len(model_inputs["input_ids"]) + position_ids = np.arange(seq_length, dtype=np.int64) + block_position_ids = np.concatenate( + [ + np.zeros(context_length, dtype=np.int64), + np.arange(1, seq_length - context_length + 1, dtype=np.int64), + ] + ) + model_inputs["position_ids"] = np.stack([position_ids, block_position_ids], axis=0) + attention_mask = np.tri(seq_length, seq_length, dtype=bool) + attention_mask[:, :context_length] = 1 + model_inputs["attention_mask"] = attention_mask + + return model_inputs + + def preprocess_function_bloom(example, max_src_length=256, max_tgt_length=384, intokens=False): + inputs = example["instruction"] + targets = example["output"] + if "input" in example: + inputs += example["input"] + model_inputs = tokenizer(inputs, max_length=max_src_length, truncation=True, return_attention_mask=False) + labels = tokenizer(targets, max_length=max_tgt_length, truncation=True, return_attention_mask=False) + labels_input_ids = labels["input_ids"] + [tokenizer.eos_token_id] + model_inputs["labels"] = [-100] * len(model_inputs["input_ids"]) + labels_input_ids + model_inputs["input_ids"] = model_inputs["input_ids"] + labels_input_ids + # shift input and labels + model_inputs["input_ids"] = model_inputs["input_ids"][:-1] + model_inputs["labels"] = model_inputs["labels"][1:] + + if intokens: + model_inputs["attention_mask"] = np.tril( + np.ones([len(model_inputs["input_ids"]), len(model_inputs["input_ids"])], dtype=bool) + ) + return model_inputs + + def preprocess_function_gpt(example, max_source_length=256, max_target_length=384, intokens=False): + """ + Convert an example into necessary features. + """ + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + inputs = example["instruction"] + targets = example["output"] + if "input" in example: + inputs += example["input"] + + input_seq = inputs + output_seq = targets + + outputs = tokenizer( + output_seq, + max_length=max_target_length, + # pad_to_max_seq_len=True, + truncation_strategy="longest_first", + return_attention_mask=False, + return_token_type_ids=False, + ) + inputs = tokenizer( + input_seq, + max_length=max_source_length, + # pad_to_max_seq_len=True, + truncation_strategy="longest_first", + return_attention_mask=False, + return_length=False, + ) + + final = {} + for k in outputs.keys(): + final[k] = inputs[k] + outputs[k] + if k == "input_ids": + final["labels"] = [tokenizer.pad_token_id] * len(inputs["input_ids"]) + outputs[k] + + # shift inputs and labels + final["input_ids"] = final["input_ids"][:-1] + final["labels"] = final["labels"][1:] + return final + + if model_args.english: + dataset = load_dataset("tatsu-lab/alpaca") + else: + dataset = load_dataset("Chinese-Vicuna/guanaco_belle_merge_v1.0") + + # select first 10k examples for benchmarking + dataset = dataset["train"].select(range(model_args.train_data_size)) + if "chatglm2" in model_args.model_name_or_path: + dataset = dataset.map( + lambda example: preprocess_function(example, intokens=model_args.intokens), + ) + elif "chatglm" in model_args.model_name_or_path: + dataset = dataset.map( + lambda example: preprocess_function_chatglm(example, intokens=model_args.intokens), + ) + elif "bloom" in model_args.model_name_or_path: + + dataset = dataset.map( + lambda example: preprocess_function_bloom(example, intokens=model_args.intokens), + ) + elif model_args.model_name_or_path in ["gpt3-6.7B-en", "gpt3-13B-en"]: + dataset = dataset.map( + lambda example: preprocess_function_gpt(example, intokens=model_args.intokens), + ) + else: + dataset = dataset.map(lambda example: preprocess_function(example, intokens=model_args.intokens)) + total_effective_tokens = sum([len(i["input_ids"]) for i in dataset]) * training_args.num_train_epochs + if model_args.intokens: + dataset = InTokensMapDataset( + dataset, + tokenizer=tokenizer, + max_length=model_args.intokens_length, + ) + if model_args.profiler: + prof = profiler.Profiler( + targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU], + profile_memory=True, + scheduler=profiler.make_scheduler(closed=1, ready=2, record=1, repeat=1), + on_trace_ready=profiler.export_chrome_tracing("./log"), + ) + if model_args.model_name_or_path in ["gpt3-6.7B-en", "gpt3-13B-en"]: + data_collator = DataCollatorForSeq2Seq( + return_tensors="pd", tokenizer=tokenizer, label_pad_token_id=tokenizer.pad_token_id + ) + else: + data_collator = DataCollatorForSeq2Seq(return_tensors="pd", tokenizer=tokenizer) + + trainer = CustomTrainer( + model=model, + tokenizer=tokenizer, + train_dataset=dataset, + callbacks=[ProfilerCallback(prof)] if model_args.profiler else [], + args=training_args, + data_collator=data_collator, + ) + + train_metrics = trainer.train() + tokens_per_second = trainer.total_observed_tokens / train_metrics.metrics["train_runtime"] + effective_tokens_per_second = total_effective_tokens / train_metrics.metrics["train_runtime"] + print(f"Tokens per second: {tokens_per_second:.2f}") + print(f"Effective Tokens per second: {effective_tokens_per_second:.2f}") + + +if __name__ == "__main__": + main() diff --git a/examples/benchmark/peft/paddle/inference_benchmark.py b/examples/benchmark/peft/paddle/inference_benchmark.py new file mode 100644 index 0000000000000000000000000000000000000000..287e9e13d14b91b82040006b3a0604eb96b2eecf --- /dev/null +++ b/examples/benchmark/peft/paddle/inference_benchmark.py @@ -0,0 +1,99 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import time + +import paddle + +from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer + + +def parse_args(prog=None): + """ + parse_args + """ + parser = argparse.ArgumentParser(prog=prog) + parser.add_argument("--model_name_or_path", type=str, help="model name or local path", required=True) + parser.add_argument("--do_forward", action="store_true", help="fowrward test") + parser.add_argument("--do_generate", action="store_true", help="generate test") + parser.add_argument("--dtype", type=str, default="float16", choices=["float16", "float32"], help="set dtype of model", required=True) + return parser.parse_args() + + +@paddle.no_grad() +def predict_generate(model, inputs): + for i in range(10): + start = time.perf_counter() + result = model.generate( + **inputs, + max_length=100, + decode_strategy="greedy_search", + use_cache=True, + ) + hf_cost = (time.perf_counter() - start) * 1000 + print("Speed test:", hf_cost) + infer_data = result[0] + for x in infer_data.tolist(): + res = tokenizer.decode(x, skip_special_tokens=True) + print(res) + + +@paddle.no_grad() +def predict_forward(model, inputs): + for i in range(10): + start = time.perf_counter() + _ = model(**inputs) + hf_cost = (time.perf_counter() - start) * 1000 + print("Speed test:", hf_cost) + + +if __name__ == "__main__": + args = parse_args() + all_texts = [ + "你好", + "去年9月,拼多多海外版“Temu”正式在美国上线。数据显示,截至2023年2月23日,Temu App新增下载量4000多万,新增用户增速第一,AppStore购物榜霸榜69天、Google Play购物榜霸榜114天。", + ] + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + model = AutoModelForCausalLM.from_pretrained( + args.model_name_or_path, + load_state_as_np=True, + low_cpu_mem_usage=True, + dtype=args.dtype, + ) + if model.base_model_prefix == "llama": + tokenizer.pad_token = tokenizer.unk_token + model.eval() + + if args.do_forward: + for input_text in all_texts: + print(f"text: {input_text}") + inputs = tokenizer([input_text], return_tensors="pd", return_token_type_ids=False) + predict_forward(model, inputs) + + if args.do_generate: + for input_text in all_texts: + print(f"text: {input_text}") + _inputs = tokenizer( + input_text, + padding=True, + return_tensors="np", + max_length=50, + return_attention_mask=True, + return_position_ids=True, + ) + inputs_tensor = {} + for key, value in _inputs.items(): + inputs_tensor[key] = paddle.to_tensor(value) + predict_generate(model, inputs_tensor) diff --git a/examples/benchmark/peft/paddle/utils.py b/examples/benchmark/peft/paddle/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..70d8a8824cdaa108b7837e509d94ec37e098814d --- /dev/null +++ b/examples/benchmark/peft/paddle/utils.py @@ -0,0 +1,49 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +from paddlenlp.trainer import ( + Trainer, + TrainerCallback, + TrainerControl, + TrainerState, + TrainingArguments, +) + + +class CustomTrainer(Trainer): + total_observed_tokens = 0.0 + + def training_step(self, model, inputs): + input_ids = inputs["input_ids"] + self.total_observed_tokens += float(input_ids.shape[0] * input_ids.shape[1]) + return super().training_step(model, inputs) + + +class ProfilerCallback(TrainerCallback): + "A callback that prints a message at the beginning of training" + + def __init__(self, prof): + self.prof = prof + self.prof.start() + + def on_train_begin(self, args, state, control, **kwargs): + print("Starting training") + + def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs): + self.prof.step() + + def on_train_end(self, args, state, control, **kwargs): + self.prof.stop() + self.prof.summary() diff --git a/examples/benchmark/peft/torch/benchmark.py b/examples/benchmark/peft/torch/benchmark.py new file mode 100644 index 0000000000000000000000000000000000000000..7e4d7e1bd634d02e7affe50c39b3af306eee59bc --- /dev/null +++ b/examples/benchmark/peft/torch/benchmark.py @@ -0,0 +1,226 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +from dataclasses import dataclass, field +from typing import Optional + +import torch +import torch.profiler as profiler +from datasets import load_dataset +from peft import LoraConfig, TaskType, get_peft_model +from transformers import ( + AutoModel, + AutoModelForCausalLM, + AutoTokenizer, + BitsAndBytesConfig, + DataCollatorForSeq2Seq, + HfArgumentParser, + LlamaTokenizer, + TrainingArguments, +) +from utils import CustomTrainer, ProfilerCallback + +""" +单卡 +python benchmark.py --model_name_or_path bigscience/bloomz-7b1-mt \ + --num_train_epochs 1 --per_device_train_batch_size 4 \ + --evaluation_strategy no --save_strategy no \ + --fp16 --lora \ + --logging_steps 50 --output_dir outputs + +多卡 deepspeed zero3 +python -m torch.distributed.run --nproc_per_node=4 benchmark.py --deepspeed ds_config.json \ + --model_name_or_path bigscience/bloomz-7b1-mt \ + --num_train_epochs 1 --per_device_train_batch_size 2 \ + --evaluation_strategy no --save_strategy no \ + --fp16 \ + --logging_steps 50 --output_dir outputs +""" + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch. + """ + + model_name_or_path: str = field(default=None, metadata={"help": "model name or local path"}) + lora: Optional[bool] = field(default=False, metadata={"help": "whether to use LoRA"}) + qlora: Optional[bool] = field(default=False, metadata={"help": "whether to use qLoRA"}) + english: Optional[bool] = field(default=False, metadata={"help": "whether to english benchmark dataset"}) + profiler: Optional[bool] = field(default=False, metadata={"help": "whether to use profiler"}) + double_quant: bool = field( + default=True, metadata={"help": "Compress the quantization statistics through double quantization."} + ) + quant_type: str = field( + default="nf4", metadata={"help": "Quantization data type to use. Should be one of `fp4` or `nf4`."} + ) + bits: int = field(default=4, metadata={"help": "How many bits to use."}) + max_memory_MB: int = field(default=80000, metadata={"help": "Free memory per gpu."}) + train_data_size: int = field(default=1000, metadata={"help": "Number of dataset for training"}) + + +def main(): + parser = HfArgumentParser((ModelArguments, TrainingArguments)) + model_args, training_args = parser.parse_args_into_dataclasses() + + if "llama" in model_args.model_name_or_path: + tokenizer = LlamaTokenizer.from_pretrained(model_args.model_name_or_path, use_fast=False) + tokenizer.pad_token_id = 0 + elif model_args.model_name_or_path in ["cerebras/Cerebras-GPT-13B", "stanford-crfm/levanter-gpt2-7B"]: + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, use_fast=False) + tokenizer.pad_token_id = 0 + else: + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True) + + compute_dtype = torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32) + if "chatglm" in model_args.model_name_or_path: + # Add empty_init=False for zero3 training, refer to https://github.com/THUDM/ChatGLM-6B/issues/530 + model = AutoModel.from_pretrained( + model_args.model_name_or_path, + empty_init=False if training_args.deepspeed is not None else True, + trust_remote_code=True, + torch_dtype="auto", + ) + + else: + if model_args.qlora: + n_gpus = torch.cuda.device_count() + max_memory = f"{model_args.max_memory_MB}MB" + max_memory = {i: max_memory for i in range(n_gpus)} + device_map = "auto" + + # if we are in a distributed setting, we need to set the device map and max memory per device + if os.environ.get("LOCAL_RANK") is not None: + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + device_map = {"": local_rank} + max_memory = {"": max_memory[local_rank]} + + model = AutoModelForCausalLM.from_pretrained( + model_args.model_name_or_path, + torch_dtype="auto", + load_in_4bit=model_args.bits == 4, + load_in_8bit=model_args.bits == 8, + device_map=device_map, + max_memory=max_memory, + quantization_config=BitsAndBytesConfig( + load_in_4bit=model_args.bits == 4, + load_in_8bit=model_args.bits == 8, + llm_int8_threshold=6.0, + llm_int8_has_fp16_weight=False, + bnb_4bit_compute_dtype=compute_dtype, + bnb_4bit_use_double_quant=model_args.double_quant, + bnb_4bit_quant_type=model_args.quant_type, + ), + ) + elif model_args.model_name_or_path in ["cerebras/Cerebras-GPT-13B", "stanford-crfm/levanter-gpt2-7B"]: + model = AutoModelForCausalLM.from_pretrained( + model_args.model_name_or_path, + torch_dtype=torch.float16, + ) + else: + model = AutoModelForCausalLM.from_pretrained( + model_args.model_name_or_path, + torch_dtype="auto", + ) + if model_args.lora: + if "llama" in model_args.model_name_or_path: + target_modules = ["q_proj", "k_proj", "v_proj"] + elif model_args.model_name_or_path in ["cerebras/Cerebras-GPT-13B", "stanford-crfm/levanter-gpt2-7B"]: + target_modules = [ + ".*c_attn.*", + ".*q_attn.*", + ".*c_proj.*", + ".*c_fc.*", + ] + else: + target_modules = ["query_key_value"] + peft_config = LoraConfig( + task_type=TaskType.CAUSAL_LM, target_modules=target_modules, r=8, lora_alpha=32, lora_dropout=0.0 + ) + model = get_peft_model(model, peft_config) + model.print_trainable_parameters() + + if model_args.lora and training_args.gradient_checkpointing: + # For backward compatibility + if hasattr(model, "enable_input_require_grads"): + model.enable_input_require_grads() + else: + + def make_inputs_require_grad(module, input, output): + output.requires_grad_(True) + + model.get_input_embeddings().register_forward_hook(make_inputs_require_grad) + + # enable gradient checkpointing for memory efficiency + model.gradient_checkpointing_enable() + + def preprocess_function(example, max_src_length=256, max_tgt_length=384): + inputs = example["instruction"] + if "input" in example: + inputs += example["input"] + targets = example["output"] + model_inputs = tokenizer(inputs, max_length=max_src_length, truncation=True, return_attention_mask=False) + + labels = tokenizer(targets, max_length=max_tgt_length, truncation=True, return_attention_mask=False) + labels_input_ids = labels["input_ids"] + [tokenizer.eos_token_id] + + model_inputs["labels"] = [-100] * len(model_inputs["input_ids"]) + labels_input_ids + model_inputs["input_ids"] = model_inputs["input_ids"] + labels_input_ids + return model_inputs + + if model_args.english: + dataset = load_dataset("tatsu-lab/alpaca") + else: + dataset = load_dataset("Chinese-Vicuna/guanaco_belle_merge_v1.0") + + # select first 10k examples for benchmarking + dataset = dataset["train"].select(range(model_args.train_data_size)) + dataset = dataset.map( + lambda example: preprocess_function(example), remove_columns=["instruction", "input", "output"] + ) + total_effective_tokens = sum([len(i["input_ids"]) for i in dataset]) * training_args.num_train_epochs + + if model_args.profiler: + prof = profiler.profile( + activities=[ + torch.profiler.ProfilerActivity.CPU, + torch.profiler.ProfilerActivity.CUDA, + ], + schedule=torch.profiler.schedule(wait=1, warmup=1, active=2, repeat=1), + on_trace_ready=torch.profiler.tensorboard_trace_handler("hf-training-trainer"), + profile_memory=True, + with_stack=True, + ) + + data_collator = DataCollatorForSeq2Seq(return_tensors="pt", tokenizer=tokenizer) + + trainer = CustomTrainer( + model=model, + tokenizer=tokenizer, + train_dataset=dataset, + callbacks=[ProfilerCallback(prof=prof)] if model_args.profiler else [], + args=training_args, + data_collator=data_collator, + ) + model.config.use_cache = False # silence the warnings. Please re-enable for inference! + train_metrics = trainer.train() + tokens_per_second = trainer.total_observed_tokens / train_metrics.metrics["train_runtime"] + effective_tokens_per_second = total_effective_tokens / train_metrics.metrics["train_runtime"] + print(f"Tokens per second: {tokens_per_second:.2f}") + print(f"Effective Tokens per second: {effective_tokens_per_second:.2f}") + + +if __name__ == "__main__": + main() diff --git a/examples/benchmark/peft/torch/ds_config_stage2.json b/examples/benchmark/peft/torch/ds_config_stage2.json new file mode 100644 index 0000000000000000000000000000000000000000..da73ccf988b937e55256a25f3bcb3f402dff320d --- /dev/null +++ b/examples/benchmark/peft/torch/ds_config_stage2.json @@ -0,0 +1,35 @@ +{ + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "gradient_accumulation_steps": "auto", + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + "fp16": { + "enabled": "auto" + }, + "zero_optimization": { + "stage": 2, + "allgather_partitions": true, + "allgather_bucket_size": 5e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 5e8, + "contiguous_gradients": true + } +} + diff --git a/examples/benchmark/peft/torch/ds_config_stage3.json b/examples/benchmark/peft/torch/ds_config_stage3.json new file mode 100644 index 0000000000000000000000000000000000000000..f988bc0c66900f44b19025d63716b291420d5fc5 --- /dev/null +++ b/examples/benchmark/peft/torch/ds_config_stage3.json @@ -0,0 +1,37 @@ +{ + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "gradient_accumulation_steps": "auto", + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + "fp16": { + "enabled": "auto" + }, + "zero_optimization": { + "stage": 3, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true + } +} \ No newline at end of file diff --git a/examples/benchmark/peft/torch/inference_benchmark.py b/examples/benchmark/peft/torch/inference_benchmark.py new file mode 100644 index 0000000000000000000000000000000000000000..85b23014bc36d28e0bcc4a484ec27b5a4379a44b --- /dev/null +++ b/examples/benchmark/peft/torch/inference_benchmark.py @@ -0,0 +1,80 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import time + +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + + +def parse_args(prog=None): + """ + parse_args + """ + parser = argparse.ArgumentParser(prog=prog) + parser.add_argument("--model_name_or_path", type=str, help="model name or local path", required=True) + parser.add_argument("--do_forward", action="store_true", help="fowrward test") + parser.add_argument("--do_generate", action="store_true", help="generate test") + return parser.parse_args() + + +@torch.no_grad() +def predict_generate(model, inputs): + for i in range(10): + start = time.perf_counter() + generate_ids = model.generate( + inputs.input_ids, + max_length=100, + do_sample=False, + ) + hf_cost = (time.perf_counter() - start) * 1000 + print("Speed test:", hf_cost) + result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] + print(result) + + +@torch.no_grad() +def predict_forward(model, inputs): + for i in range(10): + start = time.perf_counter() + _ = model(**inputs) + hf_cost = (time.perf_counter() - start) * 1000 + print("Speed test:", hf_cost) + + +if __name__ == "__main__": + args = parse_args() + all_texts = [ + "你好", + "去年9月,拼多多海外版“Temu”正式在美国上线。数据显示,截至2023年2月23日,Temu App新增下载量4000多万,新增用户增速第一,AppStore购物榜霸榜69天、Google Play购物榜霸榜114天。", + ] + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + if "llama" in args.model_name_or_path: + tokenizer.pad_token_id = 0 + model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path).cuda() + model = model.eval() + if args.do_forward: + for input_text in all_texts: + print(f"text: {input_text}") + inputs = tokenizer([input_text], return_tensors="pt", max_length=50, padding=True) + inputs = inputs.to("cuda") + predict_forward(model, inputs) + + if args.do_generate: + for input_text in all_texts: + print(f"text: {input_text}") + inputs = tokenizer([input_text], return_tensors="pt", max_length=50, padding=True) + inputs = inputs.to("cuda") + predict_generate(model, inputs) diff --git a/examples/benchmark/peft/torch/requirements.txt b/examples/benchmark/peft/torch/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..c683fbf8ee1bc17202a7e23bfb4ea9ea2778e1ae --- /dev/null +++ b/examples/benchmark/peft/torch/requirements.txt @@ -0,0 +1,8 @@ +deepspeed==0.9.4 +datasets==2.12.0 +transformers==4.30.2 +peft @ git+https://github.com/huggingface/peft.git +torch==2.0.1 +sentencepiece +bitsandbytes==0.39.0 +# CUDA VERSION 11.7 \ No newline at end of file diff --git a/examples/benchmark/peft/torch/utils.py b/examples/benchmark/peft/torch/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..cb8a286e84d03e75270c2f9ac65be17135f2d4a3 --- /dev/null +++ b/examples/benchmark/peft/torch/utils.py @@ -0,0 +1,43 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from transformers import ( + Trainer, + TrainerCallback, + TrainerControl, + TrainerState, + TrainingArguments, +) + + +class CustomTrainer(Trainer): + total_observed_tokens = 0.0 + + def training_step(self, model, inputs): + input_ids = inputs["input_ids"] + self.total_observed_tokens += float(input_ids.shape[0] * input_ids.shape[1]) + return super().training_step(model, inputs) + + +class ProfilerCallback(TrainerCallback): + "A callback that prints a message at the beginning of training" + + def __init__(self, prof): + self.prof = prof + + def on_train_begin(self, args, state, control, **kwargs): + print("Starting training") + + def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs): + self.prof.step() diff --git a/examples/code_generation/codegen/README.md b/examples/code_generation/codegen/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c71803fdba2052f06dbd9d4eff7883d9e8ef5e5b --- /dev/null +++ b/examples/code_generation/codegen/README.md @@ -0,0 +1,327 @@ +# 代码生成:写代码的AI助理 + +**目录** +- [代码生成](#代码生成) + - [简介](#简介) + - [特色](#特色) + - [效果展示](#效果展示) + - [Github Copilot插件配置](#GithubCopilot插件配置) + - [环境依赖](#环境依赖) + - [代码结构说明](#代码结构说明) + - [启动服务](#启动服务) + - [配置参数](#配置参数说明) + - [测试服务](#测试服务) + - [配置插件](#配置插件) + - [注意事项](#注意事项) + - [训练定制](#训练定制) + - [数据准备](#数据准备) + - [从本地文件创建数据集](#从本地文件创建数据集) + - [模型训练](#模型训练) + - [TaskFlow调用](#TaskFlow调用) + - [更多使用案例](#更多使用案例) + - [模型列表](#模型列表) + - [References](#references) + + +## 简介 +代码生成是根据编程人员的输入,生成出编程人员想要的代码,能够帮助编程人员甚至独立生成代码,提高编程效率。 + + +### 特色 + +本项目是基于预训练语言模型CodeGen的代码生成,具有以下优势: +- **效果领先**。CodeGen(16B)在HumanEval benchmark上评估指标已经超过[OpenAI's Codex](https://arxiv.org/pdf/2107.03374.pdf)。 +- **免费的Github Copilot**。支持通过Github Copilot调用该模型,让你免费体验代码AI助理。 +- **高性能**。基于[FastGeneration](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/fast_generation)打造高性能推理,毫秒级响应。具体加速指标可参考[perf](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/fast_generation/README.md)。 +- **支持自定义数据集训练**。可增加自己的代码数据加以微调,让其更智能。 +- **开箱即用**。本项目提供TaskFlow接口,无需训练,仅需几行代码便可预测。 + + +## 效果展示 + +- Github Copilot代码提示效果展示 +

+
+

+ +- 解算法题效果展示。求解无重复字符的最长子串的长度 +```python +from paddlenlp import Taskflow + +prompt = "def lengthOfLongestSubstring(self, s: str) -> int:" +codegen = Taskflow("code_generation", model="Salesforce/codegen-2B-mono",decode_strategy="greedy_search", repetition_penalty=1.0) +print(codegen(prompt)) +``` +结果输出为: +```python + if not s: + return 0 + + start = 0 + end = 0 + max_len = 0 + + while end < len(s): + if s[end] not in s[start:end]: + max_len = max(max_len, end - start + 1) + end += 1 + else: + start += 1 + + return max_len +``` +

+
+

+ + +## Jupyter Lab插件配置 + +请参考[codegenJupyterLabExt](https://github.com/chenqianhe/codegenJupyterLabExt), 感谢生态开发者[@chenqianhe](https://github.com/chenqianhe)的贡献!👏👏 + +## GithubCopilot插件配置 + +**以VS Code的插件为例** + +### 环境依赖 +- PaddleNLP >= 2.4.0 +- PaddlePaddle >= 2.3.1 + +其他依赖:`pip install -r requirements.txt` + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +codegen/ +├── requirements.txt # 环境依赖 +├── codegen_server.py # server启动脚本 +├── run_clm.py # 训练评估脚本 +├── run_clm.sh # 启动脚本 +└── README.md # 说明文档 +``` + +### 启动服务 + +```python +python codegen_server.py +``` + +##### 配置参数说明 +在codegen_server.py中配置如下参数: +- `model_name_or_path`:模型名,默认为 "Salesforce/codegen-350M-mono" +- `device`:运行设备,默认为"gpu" +- `temperature`:解码参数temperature,默认为0.5 +- `top_k`:解码参数top_k,默认为10 +- `top_p`:解码参数top_p,默认为1.0 +- `repetition_penalty`:解码重复惩罚项,默认为1.0 +- `min_length`:生成的最小长度,默认为0 +- `max_length`:生成的最大长度,默认为16 +- `decode_strategy`:解码策略,默认为"greedy_search" +- `load_state_as_np`:以numpy格式加载模型参数,可节省显存,默认为True +- `use_fast`:是否使用FastGeneration,可加速推理,默认为True +- `use_fp16_decoding`:是否使用fp16推理,可节省显存和加速推理,默认为True + +### 测试服务 +```python +import openai +openai.api_key = 'dummy' +openai.api_base = 'http://127.0.0.1:8978' +result = openai.Completion.create( + engine='codegen', prompt='def hello', max_tokens=16, temperature=0.1) +print(result) +''' + JSON: { + "id": "cmpl-dmhoeHmcw9DJ4NeqOJDQVKv3iivJ0", + "choices": [ + { + "text": "_world():\n print(\"Hello World!\")\n\n\n#", + "index": 0, + "finish_reason": "stop", + "logprobs": null, + } + ], + "usage": { + "completion_tokens": null, + "prompt_tokens": null, + "total_tokens": null + } +} +''' +``` +**注意**:如果要从本地访问服务器,`127.0.0.1`需要换成服务器的对外IP。 + + +### 配置插件 +打开用户设置([settings.json](https://code.visualstudio.com/docs/getstarted/settings#_settings-file-locations)),增加一行配置 +```json + "github.copilot.advanced": { + "debug.overrideEngine": "codegen", + "debug.testOverrideProxyUrl": "http://127.0.0.1:8978", + "debug.overrideProxyUrl": "http://127.0.0.1:8978" + }, +``` +接下来就可以愉快地使用了😊。 + + +#### 注意事项 +- 如果使用FastGeneration,需要设置[codegen_server.py](#配置参数说明)中`use_fast=True`,第一次推理会涉及到编译,会耗费一些时间。FastGeneration的环境依赖参考[这里](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/ops/README.md#%E4%BD%BF%E7%94%A8%E7%8E%AF%E5%A2%83%E8%AF%B4%E6%98%8E)。 +- 如果要使用自己训练好的模型,可以设置[codegen_server.py](#配置参数说明)中`model_name_or_path`为本地模型路径。 +- 如果要从本地访问服务器,上述的`127.0.0.1`需要换成服务器的对外IP。 +- 如果出现下方的提示和报错,则说明FastGeneration没有启动成功,需要定位下失败的原因。或者也可设置`use_fast=False`,不启动FastGeneration加速,但推理速度会较慢。 +```shell + FastGeneration is not available, and the original version would be used instead. +``` +```shell + RuntimeError: (NotFound) There are no kernels which are registered in the unsqueeze2 operator. + [Hint: Expected kernels_iter != all_op_kernels.end(), but received kernels_iter == all_op_kernels.end().] (at /home/Paddle/paddle/fluid/imperative/prepared_operator.cc:341) + [operator < unsqueeze2 > error] +``` +- 本代码也支持插件[fauxpilot](https://marketplace.visualstudio.com/items?itemName=Venthe.fauxpilot),感谢[@linonetwo](https://github.com/linonetwo)测试。`settings.json`中配置"fauxpilot.server": "http://服务器ip:8978/v1/engines" + +## 训练定制 + +### 数据准备 + +#### 从本地文件创建数据集 + +在许多情况,我们需要使用本地数据集来训练我们的代码生成模型,本项目支持使用固定格式本地数据集文件进行训练。 + +本地数据集文件格式如下: +- train.json/test.json 文件格式: +每行为一个jsonline +```text +{ + "code": "from paddlenlp.transformers import CodeGenForCausalLM\n\n\nmodel = CodeGenForCausalLM.from_pretrained('Salesforce/codegen-2B-mono')\n" +} +``` + +更多数据集读取格式详见[数据集加载](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html#)和[自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。 + + +### 模型训练 +运行如下命令即可在样例训练集上进行finetune,并在样例验证集上进行验证。 + +```shell +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +unset CUDA_VISIBLE_DEVICES + +python -m paddle.distributed.launch --gpus 0,1 run_clm.py \ + --model_name_or_path Salesforce/codegen-350M-mono \ + --block_size 1024 \ + --output_dir output \ + --train_file train.json \ + --validation_file test.json \ + --num_train_epochs 5 \ + --logging_steps 10 \ + --save_steps 1000 \ + --per_device_train_batch_size 2 \ + --per_device_eval_batch_size 2 \ + --learning_rate 1e-4 \ + --warmup_ratio 0.1 \ + --do_train \ + --do_eval \ + --device gpu +``` +使用多卡训练可以指定多个GPU卡号,例如 --gpus "0,1" + +关键参数释义如下: +- `gpus` 指示了训练所用的GPU卡号。 +- `model_name_or_path` 指示了finetune使用的具体预训练模型,可以是PaddleNLP提供的预训练模型(详见[模型列表](#模型列表)),或者是本地的预训练模型。如果使用本地的预训练模型,可以配置本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 +- `block_size` 表示训练时候数据被拆分的块数。 +- `output_dir` 表示模型的保存路径。 +- `train_file` 本地训练数据地址,数据格式必须与`dataset_name`所指数据集格式相同。 +- `validation_file` 本地测试数据地址,数据格式必须与`dataset_name`所指数据集格式相同。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `per_device_train_batch_size` 表示训练时**每张卡**上的样本数目。 +- `per_device_eval_batch_size` 表示测试时**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `warmup_ratio` 表示学习率逐渐升高到基础学习率(即上面配置的learning_rate)所需要的迭代数占总步数的比例,最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。 +- `do_train` 表示是否训练。 +- `do_eval` 表示是否评测。 +- `device` 表示使用的设备,从gpu和cpu中选择。 + +可通过`bash run_clm.sh`启动训练,更多参数详情和参数的默认值请参考`run_clm.py`。 + +程序运行时将会自动进行训练和验证,训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +./output/ +│── model_config.json +│── model_state.pdparams +│── tokenizer_config.json +│── special_tokens_map.json +│── added_tokens.json +│── vocab.json +│── merges.txt +└── ... +``` + +**NOTE:** 如需恢复模型训练,`model_name_or_path`配置本地模型的目录地址即可。 + + +## TaskFlow调用 +参考[TaskFlow文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/model_zoo/taskflow.md) + +## 更多使用案例 + +- 根据注释/功能描述写代码 + +```python +import re +import paddle +from paddlenlp.transformers import CodeGenTokenizer, CodeGenForCausalLM + +# The supported models are shown in the following table +model_name = 'Salesforce/codegen-2B-mono' +# Init tokenizer +tokenizer = CodeGenTokenizer.from_pretrained(model_name) +# Init model +model = CodeGenForCausalLM.from_pretrained(model_name) + +prompt = "# this function prints hello world" +inputs = tokenizer([prompt]) +inputs = {k: paddle.to_tensor(v) for (k, v) in inputs.items()} +# Generate +output, score = model.generate(inputs['input_ids'], + max_length=128, + decode_strategy='greedy_search') +# Decode the result +print( + tokenizer.decode(output[0], + truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"], + skip_special_tokens=True, + spaces_between_special_tokens=False)) +``` +结果输出为: +```python +def hello_world(): + print("Hello World") + +hello_world() +``` + +## 模型列表 +模型列表 +| 模型名称 | 说明 | +| :--------------------------------- | -------------------------------- | +| Salesforce/codegen-350M-mono | 基于Python数据集BIGPYTHON训练 | +| Salesforce/codegen-2B-mono | 基于Python数据集BIGPYTHON训练 | +| Salesforce/codegen-6B-mono | 基于Python数据集BIGPYTHON训练 | +| Salesforce/codegen-16B-mono | 基于Python数据集BIGPYTHON训练 | +| Salesforce/codegen-350M-nl | 基于自然语言数据集THEPILE训练 | +| Salesforce/codegen-2B-nl | 基于自然语言数据集THEPILE训练 | +| Salesforce/codegen-6B-nl | 基于自然语言数据集THEPILE训练 | +| Salesforce/codegen-16B-nl | 基于自然语言数据集THEPILE训练 | +| Salesforce/codegen-350M-multi | 基于多编程语言数据集BIGQUERY训练 | +| Salesforce/codegen-2B-multi | 基于多编程语言数据集BIGQUERY训练 | +| Salesforce/codegen-6B-multi | 基于多编程语言数据集BIGQUERY训练 | +| Salesforce/codegen-16B-multi | 基于多编程语言数据集BIGQUERY训练 | + +## References +- Nijkamp, Erik, et al. "A conversational paradigm for program synthesis." arXiv preprint arXiv:2203.13474 (2022). +- [https://github.com/features/copilot/](https://github.com/features/copilot/) +- [https://github.com/AndPuQing/Papilot](https://github.com/AndPuQing/Papilot) diff --git a/examples/code_generation/codegen/codegen_server.py b/examples/code_generation/codegen/codegen_server.py new file mode 100644 index 0000000000000000000000000000000000000000..5f8d80bb7cfab4ce436f5461a14e696a172bbff6 --- /dev/null +++ b/examples/code_generation/codegen/codegen_server.py @@ -0,0 +1,140 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import random +import string +import time + +import paddle +import uvicorn +from fastapi import FastAPI, Response, status +from pydantic import BaseModel +from sse_starlette.sse import EventSourceResponse + +from paddlenlp.transformers import CodeGenForCausalLM, CodeGenTokenizer +from paddlenlp.utils.log import logger + + +class DefaultConfig: + model_name_or_path = "Salesforce/codegen-350M-mono" + device = "gpu" + temperature = 0.5 + top_k = 10 + top_p = 1.0 + repetition_penalty = 1.0 + min_length = 0 + max_length = 16 + decode_strategy = "greedy_search" + load_state_as_np = True + use_faster = True + use_fp16_decoding = True + default_dtype = "float16" if use_faster and use_fp16_decoding else "float32" + + +class Input(BaseModel): + prompt: str + stream: bool = False + + +class Output(BaseModel): + id: str + model: str = "codegen" + object: str = "text_completion" + created: int = int(time.time()) + choices: list = None + usage = { + "completion_tokens": None, + "prompt_tokens": None, + "total_tokens": None, + } + + +generate_config = DefaultConfig() +paddle.set_device(generate_config.device) +paddle.set_default_dtype(generate_config.default_dtype) + +tokenizer = CodeGenTokenizer.from_pretrained(generate_config.model_name_or_path) +model = CodeGenForCausalLM.from_pretrained( + generate_config.model_name_or_path, load_state_as_np=generate_config.load_state_as_np +) + +app = FastAPI() + + +def random_completion_id(): + return "cmpl-" + "".join(random.choice(string.ascii_letters + string.digits) for _ in range(29)) + + +@app.post("/v1/engines/codegen/completions", status_code=200) +async def gen(item: Input): + item = item.dict() + logger.info(f"Request: {item}") + temperature = item.get("temperature", generate_config.temperature) + top_k = item.get("top_k", generate_config.top_k) + if temperature == 0.0: + temperature = 1.0 + top_k = 1 + repetition_penalty = item.get("frequency_penalty", generate_config.repetition_penalty) + + start_time = time.time() + logger.info("Start generating code") + tokenized = tokenizer([item["prompt"]], truncation=True, return_tensors="pd") + output, _ = model.generate( + tokenized["input_ids"], + max_length=16, + min_length=generate_config.min_length, + decode_strategy=generate_config.decode_strategy, + top_k=top_k, + repetition_penalty=repetition_penalty, + temperature=temperature, + use_fast=generate_config.use_faster, + use_fp16_decoding=generate_config.use_fp16_decoding, + ) + logger.info("Finish generating code") + end_time = time.time() + logger.info(f"Time cost: {end_time - start_time}") + output = tokenizer.decode(output[0], skip_special_tokens=True) + logger.info(f"Generated code: {output}") + output_json = Output( + id=random_completion_id(), + choices=[ + { + "text": output, + "index": 0, + "finish_reason": "stop", + "logprobs": None, + } + ], + usage={ + "completion_tokens": None, + "prompt_tokens": None, + "total_tokens": None, + }, + ).json() + + def stream_response(response): + yield f"{response}\n\n" + yield "data: [DONE]\n\n" + + if item.get("stream", False): + return EventSourceResponse(stream_response(output_json)) + else: + return Response( + status_code=status.HTTP_200_OK, + content=output_json, + media_type="application/json", + ) + + +if __name__ == "__main__": + uvicorn.run("codegen_server:app", host="0.0.0.0", port=8978) diff --git a/examples/code_generation/codegen/requirements.txt b/examples/code_generation/codegen/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..ae00f4799fa170e98fa4cb1744bc3aa537439fe3 --- /dev/null +++ b/examples/code_generation/codegen/requirements.txt @@ -0,0 +1,7 @@ +fastapi==0.79.0 +pydantic==1.9.1 +python-dotenv==0.20.0 +sse_starlette==0.10.3 +uvicorn==0.17.6 +openai==0.8.0 +regex==2022.6.2 \ No newline at end of file diff --git a/examples/code_generation/codegen/run_clm.py b/examples/code_generation/codegen/run_clm.py new file mode 100644 index 0000000000000000000000000000000000000000..4e9d5668763fc74fb9da9a14ad08995a41bbdc38 --- /dev/null +++ b/examples/code_generation/codegen/run_clm.py @@ -0,0 +1,162 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math +from dataclasses import dataclass, field +from functools import partial +from itertools import chain +from typing import Optional + +import paddle +import paddle.nn as nn +from datasets import load_dataset + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments, set_seed +from paddlenlp.transformers import CodeGenForCausalLM, CodeGenTokenizer +from paddlenlp.utils.log import logger + + +@dataclass +class ModelArguments: + model_name_or_path: Optional[str] = field( + default="Salesforce/codegen-350M-mono", + metadata={"help": ("Path to pre-trained model.")}, + ) + overwrite_cache: Optional[bool] = field( + default=False, + metadata={"help": ("Whether to overwrite cache for dataset.")}, + ) + + +@dataclass +class DataArguments: + train_file: Optional[str] = field( + default=None, + metadata={"help": "The input training data file."}, + ) + validation_file: Optional[str] = field( + default=None, + metadata={"help": "The input validation data file."}, + ) + block_size: Optional[int] = field( + default=None, + metadata={"help": ("The training dataset will be truncated in block of this size for training. ")}, + ) + + +def compute_metrics(eval_preds): + labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64") + logits = paddle.to_tensor(eval_preds.predictions) + loss_fct = nn.CrossEntropyLoss() + eval_loss = loss_fct(logits[:, :-1, :], labels[:, 1:]) + perplexity = math.exp(eval_loss) + return {"perplexity": perplexity} + + +def convert_example(examples, tokenizer): + """convert examples into necessary features""" + # Convert raw text to feature + tokenized_examples = tokenizer( + examples["code"], return_attention_mask=True, return_position_ids=False, return_token_type_ids=False + ) + return tokenized_examples + + +def group_texts(examples, block_size): + concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()} + total_length = len(concatenated_examples[list(examples.keys())[0]]) + if total_length >= block_size: + total_length = (total_length // block_size) * block_size + result = { + k: [t[i : i + block_size] for i in range(0, total_length, block_size)] + for k, t in concatenated_examples.items() + } + result["labels"] = result["input_ids"].copy() + return result + + +def process_ds(dataset, tokenizer, overwrite_cache, block_size): + trans_func = partial(convert_example, tokenizer=tokenizer) + dataset = dataset.map( + trans_func, batched=True, remove_columns=dataset.column_names, load_from_cache_file=overwrite_cache + ) + trans_func = partial(group_texts, block_size=block_size) + dataset = dataset.map(trans_func, batched=True, load_from_cache_file=overwrite_cache) + return dataset + + +def do_train(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(training_args.seed) + + model = CodeGenForCausalLM.from_pretrained(model_args.model_name_or_path) + + tokenizer = CodeGenTokenizer.from_pretrained(model_args.model_name_or_path) + + train_set = load_dataset("json", data_files=data_args.train_file, split="train") + dev_set = load_dataset("json", data_files=data_args.validation_file, split="train") + + if data_args.block_size is None: + block_size = tokenizer.model_max_length + if block_size > 1024: + logger.warning( + f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). " + "Picking 1024 instead. You can change that default value by passing --block_size xxx." + ) + block_size = 1024 + else: + if data_args.block_size > tokenizer.model_max_length: + logger.warning( + f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model" + f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}." + ) + block_size = min(data_args.block_size, tokenizer.model_max_length) + + train_set = process_ds(train_set, tokenizer, model_args.overwrite_cache, block_size) + dev_set = process_ds(dev_set, tokenizer, model_args.overwrite_cache, block_size) + + batchify_fn = DataCollatorWithPadding(tokenizer, return_attention_mask=True) + + trainer = Trainer( + model=model, + args=training_args, + train_dataset=train_set if training_args.do_train else None, + eval_dataset=dev_set if training_args.do_eval else None, + tokenizer=tokenizer, + data_collator=batchify_fn, + compute_metrics=compute_metrics, + ) + + if training_args.do_train: + train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + metrics = train_results.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + +if __name__ == "__main__": + do_train() diff --git a/examples/code_generation/codegen/run_clm.sh b/examples/code_generation/codegen/run_clm.sh new file mode 100644 index 0000000000000000000000000000000000000000..4c76ea178e3f4c2c8a270e799199372151aa78f7 --- /dev/null +++ b/examples/code_generation/codegen/run_clm.sh @@ -0,0 +1,30 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python -m paddle.distributed.launch --gpus 0,1 run_clm.py \ + --model_name_or_path Salesforce/codegen-350M-mono \ + --block_size 1024 \ + --output_dir output \ + --train_file train.json \ + --validation_file test.json \ + --num_train_epochs 5 \ + --logging_steps 10 \ + --save_steps 1000 \ + --per_device_train_batch_size 2 \ + --per_device_eval_batch_size 2 \ + --learning_rate 1e-4 \ + --warmup_ratio 0.1 \ + --do_train \ + --do_eval \ + --device gpu diff --git a/examples/dependency_parsing/ddparser/README.md b/examples/dependency_parsing/ddparser/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c0c964dc23d545204c1aee408bfa6ce6ed79d56d --- /dev/null +++ b/examples/dependency_parsing/ddparser/README.md @@ -0,0 +1,393 @@ +# DDParser + + - [模型简介](#模型简介) + - [快速开始](#快速开始) + - [模型效果](#模型效果) + - [数据格式](#数据格式) + - [数据准备](#数据准备) + - [文件结构](#文件结构) + - [模型训练、预测与部署](#模型训练、预测与部署) + - [Taskflow一键预测](#Taskflow一键预测) + - [Reference](#Reference) + +## 模型简介 + +依存句法分析任务通过分析句子中词语之间的依存关系来确定句子的句法结构, +该项目是基于Paddle v2.1的[baidu/ddparser](https://github.com/baidu/DDParser)实现, +模型结构为[Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734)。 +同时本项目引入了[ERNIE](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 系列预训练模型, +用户可以基于预训练模型finetune完成依存句法分析训练(参考以下[示例](#模型训练))。 + +## 快速开始 + +本项目展示了基于NLPCC2013_EVSAM05_THU和NLPCC2013_EVSAM05_HIT数据集的进行模型训练、预测和部署的示例。 + +### 模型效果 + +以下是NLPCC2013_EVSAM05_THU和NLPCC2013_EVSAM05_HIT数据集的模型性能对比,baseline为第二届自然语言处理与中文计算会议发布的[评测报告](http://tcci.ccf.org.cn/conference/2013/dldoc/evrpt05.rar)。 + +#### NLPCC2013_EVSAM05_THU + +| model | dev UAS | dev LAS | test UAS | test LAS | +| ------------------------- | :-----: | :------:| :-------:| :-------:| +| `baseline` | 81.49 | 72.17 | 84.68 | 76.02 | +| `biaffine-dep(+char)` | 84.11 | 75.16 | 85.31 | 76.73 | +| `biaffine-dep(+pos)` | 83.28 | 74.20 | 84.54 | 75.33 | +| `biaffine-dep-lstm-pe` | 81.02 | 71.20 | 82.86 | 73.86 | +| `biaffine-dep-ernie-tiny` | 89.02 | 81.39 | 89.31 | 81.51 | +| `biaffine-dep-ernie-1.0` | 92.25 | 84.77 | 92.12 | 84.62 | +| `biaffine-dep-ernie-gram` | 92.20 | 85.10 | 91.96 | 84.10 | + +#### NLPCC2013_EVSAM05_HIT + +| model | dev UAS | dev LAS | test UAS | test LAS | +| ------------------------- | :-----: | :------:| :-------:| :-------:| +| `baseline` | 82.96 | 65.45 | 82.65 | 65.25 | +| `biaffine-dep(+char)` | 80.90 | 65.29 | 80.77 | 65.43 | +| `biaffine-dep(+pos)` | 83.85 | 68.27 | 83.75 | 68.04 | +| `biaffine-dep-lstm-pe` | 77.48 | 61.34 | 76.41 | 60.32 | +| `biaffine-dep-ernie-tiny` | 84.21 | 68.89 | 83.98 | 68.67 | +| `biaffine-dep-ernie-1.0` | 89.24 | 74.12 | 88.64 | 74.09 | +| `biaffine-dep-ernie-gram` | 89.59 | 74.75 | 88.79 | 74.46 | + +其中`lstm-pe`表示lstm by positional encoding,`biaffine-dep`的模型输入可以选择句子的word级表示加char级表示(`biaffine-dep(+char)`)或者句子的word级表示加上pos词性标签(`biaffine-dep(+pos)`),其他模型使用句子的word级表示和char级表示。 + +指标释义: +```text +UAS (依存准确率) = number of words assigned correct head / total words +LAS (依存标注准备率) = number of words assigned correct head and relation / total words +``` + +### 数据格式 + +本用例数据格式基于[CoNLL-X](https://ilk.uvt.nl/~emarsi/download/pubs/14964.pdf)。 + +| 名称 | 含义 | +| --- | --- | +| ID | 单词ID,序号从1开始 | +| FORM | 当前单词 | +| LEMMA | 当前词语的原型或词干,在中文中此列与FORM相同 | +| CPOSTAG | 当前词语的词性(粗粒度) | +| POSTAG | 当前词语的词性(细粒度) | +| FEATS | 句法特征 | +| HEAD | 当前单词的中心词 | +| DEPREL | 当前单词与中心词的依存关系 | +| PHEAD | 当前单词的主观中心词 | +| PDEPREL | 当前单词与主观中心词的依存关系 | + +NLPCC2013_EVSAM05_THU数据集示例: +``` +ID FROM LEMMA CPOSTAG POSTAG FEATS HEAD DEPREL +1 世界 世界 n n _ 5 限定 +2 第 第 m m _ 4 限定 +3 八 八 m m _ 2 连接依存 +4 大 大 a a _ 5 限定 +5 奇迹 奇迹 n n _ 6 存现体 +6 出现 出现 v v _ 0 核心成分 +``` + +NLPCC2013_EVSAM05_HIT数据集示例: +``` +ID FROM LEMMA CPOSTAG POSTAG FEATS HEAD DEPREL PHEAD PDEPREL +1 城建 城建 NN NN _ 2 relevant _ _ +2 成为 成为 VV VV _ 0 ROOT _ _ +3 外商 外商 NN NN _ 4 agent _ _ +4 投资 投资 VV VV _ 7 d-restrictive _ _ +5 青海 青海 NR NR _ 4 patient _ _ +6 新 新 JJ JJ _ 7 d-attribute _ _ +7 热点 热点 NN NN _ 2 isa _ _ +``` + +- 该用例中用户只需关注`FORM`、`POSTTAG`、`HEAD`和`DEPREL`这几列信息即可,'_'表示数值不可用。 + +### 数据准备 + +该用例使用的是[第二届自然语言处理与中文计算会议(NLP&CC 2013)](http://tcci.ccf.org.cn/conference/2013/pages/page04_sam.html) +提供的数据集,其中`NLPCC2013_EVSAM05_THU`为清华大学语义依存网络语料,`NLPCC2013_EVSAM05_HIT`为哈尔滨工业大学依存网络语料。 + +为了方便用户的快速使用,PaddleNLP Dataset API内置了数据集,一键可完成数据集加载。 + +加载`NLPCC2013_EVSAM05_THU`数据集: +```python +from paddlenlp.datasets import load_dataset +train_ds, dev_ds, test_ds = load_dataset("nlpcc13_evsam05_thu", splits=["train", "dev", "test"]) +``` + +加载`NLPCC2013_EVSAM05_HIT`数据集: +```python +from paddlenlp.datasets import load_dataset +train_ds, dev_ds, test_ds = load_dataset("nlpcc13_evsam05_hit", splits=["train", "dev", "test"]) +``` + +### 文件结构 + +以下是本项目主要代码结构及说明: + +```text +ddparser/ +├── deploy # 部署 +│   └── python +│   └── predict.py # python预测部署示例 +├── model +│ ├── dropouts.py # dropout +│ ├── encoder.py # 编码器 +│ └── dep.py # 模型网络 +├── README.md # 使用说明 +├── export_model.py # 模型导出脚本 +├── criterion.py # 损失函数 +├── data.py # 数据结构 +├── metric.py # 指标计算 +├── train.py # 训练脚本 +├── predict.py # 预测脚本 +└── utils.py # 工具函数 +``` + +### 模型训练、预测与部署 + +本项目提供了三种模型结构:LSTMEncoder+MLP+BiAffine、LSTMByWPEncoder+MLP+BiAffine和ErnieEncoder+MLP+BiAffine,用户可通过`--encoding_model`指定所使用的模型结构。 + +#### LSTMEncoder+MLP+BiAffine + +##### 启动训练 + +通过如下命令,指定GPU 0卡,以`lstm`为encoder在`nlpcc13_evsam05_thu`数据集上训练与评估: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" train.py \ + --device=gpu \ + --epochs=100 \ + --task_name=nlpcc13_evsam05_thu \ + --save_dir=./model_file \ + --encoding_model=lstm \ + --feat=pos \ + --batch_size=1000 \ + --lstm_lr=0.002 +``` + +##### 基于动态图的预测 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python -m paddle.distributed.launch --gpus "0" predict.py \ + --device=gpu \ + --task_name=nlpcc13_evsam05_thu \ + --encoding_model=lstm \ + --feat=pos \ + --params_path=./model_file/best.pdparams \ + --infer_output_file=infer_output.conll +``` + +**NOTE**: 预测时的`encoding_model`和`feat`需要与训练时的参数保持一致。 + +#### LSTMByWPEncoder+MLP+BiAffine + +##### 启动训练 + +通过如下命令,指定GPU 0卡,以`lstm-pe`为encoder在`nlpcc13_evsam05_hit`数据集上训练与评估: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" train.py \ + --device=gpu \ + --epochs=100 \ + --task_name=nlpcc13_evsam05_hit \ + --encoding_model=lstm-pe \ + --save_dir=./model_file \ + --lstm_lr=0.002 +``` + +##### 基于动态图的预测 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python -m paddle.distributed.launch --gpus "0" predict.py \ + --device=gpu \ + --task_name=nlpcc13_evsam05_hit \ + --encoding_model=lstm-pe \ + --params_path=./model_file/best.pdparams \ + --infer_output_file=infer_output.conll +``` + +##### 基于静态图的预测部署 + +使用动态图训练结束后,可以将动态图参数导出成静态图参数, 从而获得较优的预测部署性能,执行如下命令完成动态图转换静态图的功能: + +```shell +python export_model.py --encoding_model=lstm-pe \ + --params_path=./model_file/best.pdparams \ + --output_path=./output +``` + +导出静态图模型之后,可以用于部署,`deploy/python/predict.py`脚本提供了python部署预测示例。运行方式: +```shell +python deploy/python/predict.py --model_dir=./output --task_name=nlpcc13_evsam05_hit +``` + +#### ErnieEncoder+MLP+BiAffine + +##### 启动训练 + +通过如下命令,指定GPU 0卡,以预训练模型`ernie-gram-zh`为encoder在`nlpcc13_evsam05_hit`数据集上训练与评估: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" train.py \ + --device=gpu \ + --epochs=100 \ + --task_name=nlpcc13_evsam05_hit \ + --encoding_model=ernie-gram-zh \ + --save_dir=./model_file \ + --ernie_lr=5e-5 +``` + +##### 基于动态图的预测 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python -m paddle.distributed.launch --gpus "0" predict.py \ + --device=gpu \ + --task_name=nlpcc13_evsam05_hit \ + --encoding_model=ernie-gram-zh \ + --params_path=./model_file/best.pdparams \ + --infer_output_file=infer_output.conll +``` + +##### 基于静态图的预测部署 + +使用动态图训练结束后,可以将动态图参数导出成静态图参数, 从而获得较优的预测部署性能,执行如下命令完成动态图转换静态图的功能: + +```shell +python export_model.py --encoding_model=ernie-gram-zh \ + --params_path=./model_file/best.pdparams \ + --output_path=./output +``` + +导出静态图模型之后,可以用于部署,`deploy/python/predict.py`脚本提供了python部署预测示例。运行方式: +```shell +python deploy/python/predict.py --model_dir=./output --task_name=nlpcc13_evsam05_hit +``` + +#### 参数释义 + +项目中的参数具体说明如下: + +* `device`: 选用什么设备进行训练,可选cpu、gpu。 +* `task_name`: 选择训练所用的数据集,可选nlpcc13_evsam05_thu和nlpcc13_evsam05_hit。 +* `encoding_model`: 选择模型编码网络,可选lstm、lstm-pe、ernie-1.0、ernie-tiny和ernie-gram-zh。 +* `epochs`: 训练轮数。 +* `save_dir`: 保存训练模型的路径;默认将当前在验证集上LAS指标最高的模型`best.pdparams`和训练最近一个epoch的模型`last_epoch.pdparams`保存在目录model_file文件夹下。 +* `batch_size`: 批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数,默认为1000。 +* `init_from_params`: 模型参数路径,热启动模型训练;默认为None。 +* `clip`: 梯度裁剪阈值,将梯度限制在阈值范围内。 +* `lstm_lr`: 模型编码网络为lstm或lstm-pe时的学习率,默认为0.002。 +* `ernie_lr`: 模型编码网络为ernie-1.0、ernie-tiny、ernie-gram-zh时的学习率,默认为5e-5。 +* `seed`: 随机种子,默认为1000。 +* `min_freq`: 训练模式下的使用参数,基于训练数据生成的词表的最小词频,默认为2。 +* `n_buckets`: 选择数据分桶数,对训练数据按照长度进行分桶。 +* `tree`: 确保输出结果是正确的依存句法树,默认为True。 +* `feat`: 模型编码网络为lstm时的使用参数,选择输入的特征,可选char(句子的char级表示)和pos(词性标签);ernie类别的模型只能为None。 +* `warmup_proportion`: 学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.0。 +* `weight_decay`: 控制正则项力度的参数,用于防止过拟合,默认为0.0。 + +## Taskflow一键预测 + +Taskflow向用户提供了一个百度基于大规模标注数据集[DuCTB1.0](#数据来源)训练的依存句法分析工具ddparser。用户可以方便地使用该工具完成[一键预测](#一键预测)。 + +### 环境依赖 + +- LAC >= 2.1 +- matplotlib >= 3.4.2 + +### 一键预测 + +```python +from paddlenlp import Taskflow + +ddp = Taskflow("dependency_parsing") +ddp("百度是一家高科技公司") +''' +[{'word': ['百度', '是', '一家', '高科技', '公司'], + 'head': ['2', '0', '5', '5', '2'], + 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB']}] +''' +ddp(["百度是一家高科技公司", "他送了一本书"]) +''' +[{'word': ['百度', '是', '一家', '高科技', '公司'], + 'head': ['2', '0', '5', '5', '2'], + 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB']}, + {'word': ['他', '送', '了', '一本', '书'], + 'head': ['2', '0', '2', '5', '2'], + 'deprel': ['SBV', 'HED', 'MT', 'ATT', 'VOB']}] +''' + +# 输出概率和词性标签 +ddp = Taskflow("dependency_parsing", prob=True, use_pos=True) +ddp("百度是一家高科技公司") +''' +[{'word': ['百度', '是', '一家', '高科技', '公司'], + 'postag': ['ORG', 'v', 'm', 'n', 'n'], + 'head': ['2', '0', '5', '5', '2'], + 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB'], + 'prob': [1.0, 1.0, 1.0, 1.0, 1.0]}] +''' + +# 使用ddparser-ernie-1.0进行预测 +ddp = Taskflow("dependency_parsing", model="ddparser-ernie-1.0") +ddp("百度是一家高科技公司") +''' +[{'word': ['百度', '是', '一家', '高科技', '公司'], + 'head': ['2', '0', '5', '5', '2'], + 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB']}] +''' + +# 使用ddparser-ernie-gram-zh进行预测 +ddp = Taskflow("dependency_parsing", model="ddparser-ernie-gram-zh") +ddp("百度是一家高科技公司") +''' +[{'word': ['百度', '是', '一家', '高科技', '公司'], + 'head': ['2', '0', '5', '5', '2'], + 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB']}] +''' +``` + +### 依存关系可视化 + +```python +from paddlenlp import Taskflow + +ddp = Taskflow("dependency_parsing", return_visual=True) +result = ddp("百度是一家高科技公司")[0]['visual'] +import cv2 +cv2.imwrite('test.png', result) +``` + +### 标注关系说明 + +DuCTB1.0数据集含14种标注关系,具体含义见下表: + +| Label | 关系类型 | 说明 | 示例 | +| :---: | :--------: | :----------------------- | :----------------------------- | +| SBV | 主谓关系 | 主语与谓词间的关系 | 他送了一本书(他<--送) | +| VOB | 动宾关系 | 宾语与谓词间的关系 | 他送了一本书(送-->书) | +| POB | 介宾关系 | 介词与宾语间的关系 | 我把书卖了(把-->书) | +| ADV | 状中关系 | 状语与中心词间的关系 | 我昨天买书了(昨天<--买) | +| CMP | 动补关系 | 补语与中心词间的关系 | 我都吃完了(吃-->完) | +| ATT | 定中关系 | 定语与中心词间的关系 | 他送了一本书(一本<--书) | +| F | 方位关系 | 方位词与中心词的关系 | 在公园里玩耍(公园-->里) | +| COO | 并列关系 | 同类型词语间关系 | 叔叔阿姨(叔叔-->阿姨) | +| DBL | 兼语结构 | 主谓短语做宾语的结构 | 他请我吃饭(请-->我,请-->吃饭) | +| DOB | 双宾语结构 | 谓语后出现两个宾语 | 他送我一本书(送-->我,送-->书) | +| VV | 连谓结构 | 同主语的多个谓词间关系 | 他外出吃饭(外出-->吃饭) | +| IC | 子句结构 | 两个结构独立或关联的单句 | 你好,书店怎么走?(你好<--走) | +| MT | 虚词成分 | 虚词与中心词间的关系 | 他送了一本书(送-->了) | +| HED | 核心关系 | 指整个句子的核心 | | + +### 数据来源 + +**DuCTB1.0**: `Baidu Chinese Treebank1.0`是百度构建的中文句法树库,即Taskflow所提供的依存句法分析工具-DDParser的训练数据来源。 + +## Reference + +- [baidu/ddparser](https://github.com/baidu/DDParser) +- [Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734) diff --git a/examples/dependency_parsing/ddparser/criterion.py b/examples/dependency_parsing/ddparser/criterion.py new file mode 100644 index 0000000000000000000000000000000000000000..dcb25995f0e4dc8e72c2491088619ad523c19ae0 --- /dev/null +++ b/examples/dependency_parsing/ddparser/criterion.py @@ -0,0 +1,40 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from utils import index_sample + + +class ParserCriterion(nn.Layer): + def __init__(self, *args, **kwargs): + super(ParserCriterion, self).__init__(*args, **kwargs) + + def __call__(self, s_arc, s_rel, arcs, rels, mask): + + arcs = paddle.masked_select(arcs, mask) + rels = paddle.masked_select(rels, mask) + + select = paddle.nonzero(mask) + s_arc = paddle.gather_nd(s_arc, select) + s_rel = paddle.gather_nd(s_rel, select) + + s_rel = index_sample(s_rel, paddle.unsqueeze(arcs, axis=1)) + + arc_cost = F.cross_entropy(s_arc, arcs) + rel_cost = F.cross_entropy(s_rel, rels) + + avg_cost = paddle.mean(arc_cost + rel_cost) + return avg_cost diff --git a/examples/dependency_parsing/ddparser/data.py b/examples/dependency_parsing/ddparser/data.py new file mode 100644 index 0000000000000000000000000000000000000000..1663dfbe2a2bb41463859744ea23dd66b1851078 --- /dev/null +++ b/examples/dependency_parsing/ddparser/data.py @@ -0,0 +1,278 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import math +import numpy as np + +import paddle +from paddle.io import Dataset +from paddlenlp.data import Vocab +from utils import kmeans, pad_sequence + + +def build_vocab(corpus, tokenizer, encoding_model, feat): + """ + Build vocabs use the api of paddlenlp.data.Vocab.build_vocab(), + Using token_to_idx to specifies the mapping relationship between + tokens and indices to be used. + + Args: + Corpus(obj:`list[list[str]]`): The training corpus which contains + list of input words, features and relations. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from + :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. If the encoding model is lstm, + tokenizer is None. + encoding_model(obj:`str`): The encoder used for embedding. + feat(obj:`str`): The features used for model inputs. If the encoding + model is lstm, feat can be `pos` or `char`, otherwise the feat is None. + + Returns: + word_vocab(obj:`Vocab`): Word vocab. + feat_vocab(obj:`Vocab`): Feature vocab. + rel_vocab(obj:`Vocab`): Relation vocab. + """ + word_examples, feat_examples, rel_examples = corpus + + # Build word vocab and feature vocab + if encoding_model == "lstm": + # Using token_to_idx to specifies the mapping + # relationship between tokens and indices + word_vocab = Vocab.build_vocab( + word_examples, + min_freq=2, + token_to_idx={"[PAD]": 0, "[UNK]": 1, "[BOS]": 2, "[EOS]": 3}, + unk_token="[UNK]", + pad_token="[PAD]", + bos_token="[BOS]", + eos_token="[EOS]", + ) + if feat == "pos": + feat_vocab = Vocab.build_vocab( + feat_examples, + token_to_idx={"[BOS]": 0, "[EOS]": 1}, + bos_token="[BOS]", + eos_token="[EOS]", + ) + else: + feat_vocab = Vocab.build_vocab( + feat_examples, + token_to_idx={"[PAD]": 0, "[UNK]": 1, "[BOS]": 2, "[EOS]": 3}, + unk_token="[UNK]", + pad_token="[PAD]", + bos_token="[BOS]", + eos_token="[EOS]", + ) + else: + word_vocab = tokenizer.vocab + feat_vocab = None + + # Build relation vocab + rel_vocab = Vocab.build_vocab( + rel_examples, + token_to_idx={"[BOS]": 0, "[EOS]": 1, "[UNK]": 2}, + bos_token="[BOS]", + eos_token="[EOS]", + unk_token="[UNK]", + ) + return word_vocab, feat_vocab, rel_vocab + + +def load_vocab(vocab_dir): + """load vocabs""" + word_vocab = Vocab.from_json(os.path.join(vocab_dir, "word_vocab.json")) + rel_vocab = Vocab.from_json(os.path.join(vocab_dir, "rel_vocab.json")) + feat_vocab_path = os.path.join(vocab_dir, "feat_vocab.json") + if os.path.exists(feat_vocab_path): + feat_vocab = Vocab.from_json(os.path.join(feat_vocab_path)) + else: + feat_vocab = None + return word_vocab, feat_vocab, rel_vocab + + +def convert_example(example, vocabs, encoding_model="ernie-3.0-medium-zh", feat=None, mode="train", fix_len=20): + """Builds model inputs for dependency parsing task.""" + word_vocab, feat_vocab, rel_vocab = vocabs + if encoding_model == "lstm": + word_bos_index = word_vocab.to_indices("[BOS]") + word_eos_index = word_vocab.to_indices("[EOS]") + else: + word_bos_index = word_vocab.to_indices("[CLS]") + word_eos_index = word_vocab.to_indices("[SEP]") + + if feat_vocab: + feat_bos_index = feat_vocab.to_indices("[BOS]") + feat_eos_index = feat_vocab.to_indices("[EOS]") + + arc_bos_index, arc_eos_index = 0, 1 + + rel_bos_index = rel_vocab.to_indices("[BOS]") + rel_eos_index = rel_vocab.to_indices("[EOS]") + + if mode != "test": + arcs = list(example["HEAD"]) + arcs = [arc_bos_index] + arcs + [arc_eos_index] + arcs = np.array(arcs, dtype=int) + + rels = rel_vocab.to_indices(example["DEPREL"]) + rels = [rel_bos_index] + rels + [rel_eos_index] + rels = np.array(rels, dtype=int) + + if encoding_model == "lstm": + words = word_vocab.to_indices(example["FORM"]) + words = [word_bos_index] + words + [word_eos_index] + words = np.array(words, dtype=int) + + if feat == "pos": + feats = feat_vocab.to_indices(example["CPOS"]) + feats = [feat_bos_index] + feats + [feat_eos_index] + feats = np.array(feats, dtype=int) + else: + feats = [[feat_vocab.to_indices(token) for token in word] for word in example["FORM"]] + feats = [[feat_bos_index]] + feats + [[feat_eos_index]] + feats = pad_sequence([np.array(ids[:fix_len], dtype=int) for ids in feats], fix_len=fix_len) + if mode == "test": + return words, feats + return words, feats, arcs, rels + else: + words = [[word_vocab.to_indices(char) for char in word] for word in example["FORM"]] + words = [[word_bos_index]] + words + [[word_eos_index]] + words = pad_sequence([np.array(ids[:fix_len], dtype=int) for ids in words], fix_len=fix_len) + if mode == "test": + return [words] + return words, arcs, rels + + +def create_dataloader(dataset, batch_size, mode="train", n_buckets=None, trans_fn=None): + """ + Create dataloader. + + Args: + dataset(obj:`paddle.io.Dataset`): Dataset instance. + batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. + mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will + shuffle the dataset randomly. + n_buckets(obj:`int`, optional, defaults to `None`): If n_buckets is not None, it will devide + the dataset into n_buckets according to the sequence lengths. + trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a + data sample to input ids, etc. + """ + if n_buckets: + word_examples = [seq["FORM"] for seq in dataset] + lengths = [len(i) + 1 for i in word_examples] + buckets = dict(zip(*kmeans(lengths, n_buckets))) + else: + buckets = None + if trans_fn: + dataset = dataset.map(trans_fn) + + if n_buckets: + if mode == "train": + batch_sampler = BucketsSampler( + buckets=buckets, + batch_size=batch_size, + shuffle=True, + ) + else: + batch_sampler = BucketsSampler( + buckets=buckets, + batch_size=batch_size, + shuffle=False, + ) + else: + batch_sampler = SequentialSampler( + batch_size=batch_size, + corpus_length=len(dataset), + ) + + # Subclass of `paddle.io.Dataset` + dataset = Batchify(dataset, batch_sampler) + + # According to the api of `paddle.io.DataLoader` set `batch_size` + # and `batch_sampler` to `None` to disable batchify dataset automatically + data_loader = paddle.io.DataLoader(dataset=dataset, batch_sampler=None, batch_size=None, return_list=True) + return data_loader, buckets + + +class Batchify(Dataset): + def __init__(self, dataset, batch_sampler): + + self.batches = [] + for batch_sample_id in batch_sampler: + batch = [] + raw_batch = self._collate_fn([dataset[sample_id] for sample_id in batch_sample_id]) + for data in raw_batch: + if isinstance(data[0], np.ndarray): + data = pad_sequence(data) + batch.append(data) + self.batches.append(batch) + + def __getitem__(self, idx): + return self.batches[idx] + + def __len__(self): + return len(self.batches) + + def _collate_fn(self, batch): + """Return batch samples""" + return (raw for raw in zip(*batch)) + + +class BucketsSampler(object): + """BucketsSampler""" + + def __init__(self, buckets, batch_size, shuffle=False): + self.batch_size = batch_size + self.shuffle = shuffle + self.sizes, self.buckets = zip(*[(size, bucket) for size, bucket in buckets.items()]) + # The number of chunks in each bucket, which is clipped by range [1, len(bucket)] + self.chunks = [] + for size, bucket in zip(self.sizes, self.buckets): + max_ch = max(math.ceil(size * len(bucket) / batch_size), 1) + chunk = min(len(bucket), int(max_ch)) + self.chunks.append(chunk) + + def __iter__(self): + """Returns an iterator, randomly or sequentially returns a batch id""" + range_fn = np.random.permutation if self.shuffle else np.arange + for i in range_fn(len(self.buckets)).tolist(): + split_sizes = [(len(self.buckets[i]) - j - 1) // self.chunks[i] + 1 for j in range(self.chunks[i])] + for batch in np.split(range_fn(len(self.buckets[i])), np.cumsum(split_sizes)): + if len(batch): + yield [self.buckets[i][j] for j in batch.tolist()] + + def __len__(self): + """Returns the number of batches""" + return sum(self.chunks) + + +class SequentialSampler(object): + """SequentialSampler""" + + def __init__(self, batch_size, corpus_length): + self.batch_size = batch_size + self.corpus_length = corpus_length + + def __iter__(self): + """iter""" + batch = [] + for i in range(self.corpus_length): + batch.append(i) + if len(batch) == self.batch_size: + yield batch + batch = [] + else: + if len(batch): + yield batch diff --git a/examples/dependency_parsing/ddparser/deploy/python/predict.py b/examples/dependency_parsing/ddparser/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..027d0f358621c02f4cdeb90ddf5fa57628161345 --- /dev/null +++ b/examples/dependency_parsing/ddparser/deploy/python/predict.py @@ -0,0 +1,150 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import copy +import os + +import numpy as np +import paddle +from data import convert_example, load_vocab +from utils import eisner, flat_words, istree, pad_sequence + +from paddlenlp.datasets import load_dataset + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The path to static model.") +parser.add_argument("--task_name", choices=["nlpcc13_evsam05_thu", "nlpcc13_evsam05_hit"], type=str, default="nlpcc13_evsam05_thu", help="Select the task.") +parser.add_argument("--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--batch_size", type=int, default=64, help="Numbers of examples a batch for training.") +parser.add_argument("--infer_output_file", type=str, default='infer_output.conll', help="The path to save infer results.") +parser.add_argument("--tree", type=bool, default=True, help="Ensure the output conforms to the tree structure.") +args = parser.parse_args() +# fmt: on + + +def batchify_fn(batch): + raw_batch = [raw for raw in zip(*batch)] + batch = [pad_sequence(data) for data in raw_batch] + return batch + + +def decode(s_arc, s_rel, mask, tree=True): + + lens = np.sum(mask.astype(int), axis=-1) + arc_preds = np.argmax(s_arc, axis=-1) + + bad = [not istree(seq[: i + 1]) for i, seq in zip(lens, arc_preds)] + if tree and any(bad): + arc_preds[bad] = eisner(s_arc[bad], mask[bad]) + + rel_preds = np.argmax(s_rel, axis=-1) + rel_preds = [rel_pred[np.arange(len(arc_pred)), arc_pred] for arc_pred, rel_pred in zip(arc_preds, rel_preds)] + return arc_preds, rel_preds + + +class Predictor(object): + def __init__(self, model_dir, device): + model_file = model_dir + "/inference.pdmodel" + params_file = model_dir + "/inference.pdiparams" + + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + if device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + + self.output_handle = [self.predictor.get_output_handle(name) for name in self.predictor.get_output_names()] + + def predict(self, data, vocabs): + word_vocab, _, rel_vocab = vocabs + word_pad_index = word_vocab.to_indices("[PAD]") + word_bos_index = word_vocab.to_indices("[CLS]") + word_eos_index = word_vocab.to_indices("[SEP]") + examples = [] + for text in data: + example = { + "FORM": text["FORM"], + "CPOS": text["CPOS"], + } + example = convert_example( + example, + vocabs=vocabs, + mode="test", + ) + examples.append(example) + + batches = [examples[idx : idx + args.batch_size] for idx in range(0, len(examples), args.batch_size)] + + arcs, rels = [], [] + for batch in batches: + words = batchify_fn(batch)[0] + words, position = flat_words(words, word_pad_index) + self.input_handles[0].copy_from_cpu(words) + self.input_handles[1].copy_from_cpu(position) + self.predictor.run() + s_arc = self.output_handle[0].copy_to_cpu() + s_rel = self.output_handle[1].copy_to_cpu() + words = self.output_handle[2].copy_to_cpu() + + mask = np.logical_and( + np.logical_and(words != word_pad_index, words != word_bos_index), + words != word_eos_index, + ) + + arc_preds, rel_preds = decode(s_arc, s_rel, mask, args.tree) + + arcs.extend([arc_pred[m] for arc_pred, m in zip(arc_preds, mask)]) + rels.extend([rel_pred[m] for rel_pred, m in zip(rel_preds, mask)]) + + arcs = [[str(s) for s in seq] for seq in arcs] + rels = [rel_vocab.to_tokens(seq) for seq in rels] + return arcs, rels + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor(args.model_dir, args.device) + + # Load vocabs from model file path + vocabs = load_vocab(args.model_dir) + + test_ds = load_dataset(args.task_name, splits=["test"]) + test_ds_copy = copy.deepcopy(test_ds) + + pred_arcs, pred_rels = predictor.predict(test_ds, vocabs) + + with open(args.infer_output_file, "w", encoding="utf-8") as out_file: + for res, head, rel in zip(test_ds_copy, pred_arcs, pred_rels): + res["HEAD"] = tuple(head) + res["DEPREL"] = tuple(rel) + res = "\n".join("\t".join(map(str, line)) for line in zip(*res.values())) + "\n" + out_file.write("{}\n".format(res)) + out_file.close() + print("Results saved!") diff --git a/examples/dependency_parsing/ddparser/export_model.py b/examples/dependency_parsing/ddparser/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..97b6cabb5eb507072567b93fc33b2ff80d816eb8 --- /dev/null +++ b/examples/dependency_parsing/ddparser/export_model.py @@ -0,0 +1,85 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from data import load_vocab +from model.dep import BiAffineParser + +from paddlenlp.transformers import AutoModel + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--encoding_model", choices=["lstm-pe", "ernie-1.0", "ernie-3.0-medium-zh", "ernie-tiny", "ernie-gram-zh"], type=str, default="ernie-3.0-medium-zh", help="Select the encoding model.") +parser.add_argument("--params_path", type=str, required=True, default='./model_file/best.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.") +args = parser.parse_args() +# fmt: on + +if __name__ == "__main__": + + # Load pretrained model if encoding model is ernie-3.0-medium-zh, ernie-1.0, ernie-tiny or ernie-gram-zh + if args.encoding_model in ["ernie-3.0-medium-zh", "ernie-1.0", "ernie-tiny", "ernie-gram-zh"]: + pretrained_model = AutoModel.from_pretrained(args.encoding_model) + else: + pretrained_model = None + + # Load vocabs from model file path + vocab_dir = os.path.split(args.params_path)[0] + word_vocab, _, rel_vocab = load_vocab(vocab_dir) + + if not os.path.exists(args.output_path): + os.makedirs(args.output_path) + + # Save vocabs to output path + word_vocab.to_json(path=os.path.join(args.output_path, "word_vocab.json")) + rel_vocab.to_json(path=os.path.join(args.output_path, "rel_vocab.json")) + + n_rels, n_words, n_feats = len(rel_vocab), len(word_vocab), None + + word_pad_index = word_vocab.to_indices("[PAD]") + word_bos_index = word_vocab.to_indices("[CLS]") + word_eos_index = word_vocab.to_indices("[SEP]") + + # Load ddparser model + model = BiAffineParser( + encoding_model=args.encoding_model, + feat=None, + n_rels=n_rels, + n_feats=n_feats, + n_words=n_words, + pad_index=word_pad_index, + eos_index=word_eos_index, + pretrained_model=pretrained_model, + ) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + model.eval() + + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ], + ) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/examples/dependency_parsing/ddparser/metric.py b/examples/dependency_parsing/ddparser/metric.py new file mode 100644 index 0000000000000000000000000000000000000000..ad3256fc3c5906a334ffeb792b6af72ef3faaa7f --- /dev/null +++ b/examples/dependency_parsing/ddparser/metric.py @@ -0,0 +1,62 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np + +import paddle +from paddle.metric import Metric + + +class ParserEvaluator(Metric): + """ + UAS and LAS for dependency parser. + + UAS = number of words assigned correct head / total words + LAS = number of words assigned correct head and relation / total words + """ + + def __init__(self, name="ParserEvaluator", eps=1e-8): + super(ParserEvaluator, self).__init__() + + self.eps = eps + self._name = name + self.reset() + + def reset(self): + """ + Resets all of the metric state. + """ + self.total = 0.0 + self.correct_arcs = 0.0 + self.correct_rels = 0.0 + + def update(self, arc_preds, rel_preds, arcs, rels, mask): + select = paddle.nonzero(mask) + arc_mask = paddle.gather_nd(arc_preds == arcs, select) + rel_mask = paddle.logical_and(paddle.gather_nd(rel_preds == rels, select), arc_mask) + + self.total += len(arc_mask) + self.correct_arcs += np.sum(arc_mask.numpy()).item() + self.correct_rels += np.sum(rel_mask.numpy()).item() + + def accumulate(self): + uas = self.correct_arcs / (self.total + self.eps) + las = self.correct_rels / (self.total + self.eps) + return uas, las + + def name(self): + """ + Returns metric name + """ + return self._name diff --git a/examples/dependency_parsing/ddparser/model/dep.py b/examples/dependency_parsing/ddparser/model/dep.py new file mode 100644 index 0000000000000000000000000000000000000000..73d7659d85f02478ad83adcc8949ecc85721b836 --- /dev/null +++ b/examples/dependency_parsing/ddparser/model/dep.py @@ -0,0 +1,140 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn + +from model.dropouts import SharedDropout +from model.encoder import LSTMEncoder, LSTMByWPEncoder, ErnieEncoder + + +class BiAffineParser(nn.Layer): + """DDParser""" + + def __init__( + self, + encoding_model, + feat, + n_rels, + n_feats, + n_words, + pad_index, + eos_index, + pretrained_model=None, + n_mlp_arc=500, + n_mlp_rel=100, + mlp_dropout=0.33, + ): + super(BiAffineParser, self).__init__() + self.pad_index = pad_index + self.eos_index = eos_index + + if encoding_model == "lstm": + self.embed = LSTMEncoder(feat, n_feats, n_words) + elif encoding_model == "lstm-pe": + self.embed = LSTMByWPEncoder(n_words, pad_index) + else: + self.embed = ErnieEncoder(pad_index, pretrained_model) + + # MLP layer + self.mlp_arc_h = MLP(n_in=self.embed.mlp_input_size, n_out=n_mlp_arc, dropout=mlp_dropout) + self.mlp_arc_d = MLP(n_in=self.embed.mlp_input_size, n_out=n_mlp_arc, dropout=mlp_dropout) + self.mlp_rel_h = MLP(n_in=self.embed.mlp_input_size, n_out=n_mlp_rel, dropout=mlp_dropout) + self.mlp_rel_d = MLP(n_in=self.embed.mlp_input_size, n_out=n_mlp_rel, dropout=mlp_dropout) + + # Biaffine layer + self.arc_attn = BiAffine(n_in=n_mlp_arc, bias_x=True, bias_y=False) + self.rel_attn = BiAffine(n_in=n_mlp_rel, n_out=n_rels, bias_x=True, bias_y=True) + + def forward(self, words, feats): + + words, x = self.embed(words, feats) + mask = paddle.logical_and(words != self.pad_index, words != self.eos_index) + + arc_h = self.mlp_arc_h(x) + arc_d = self.mlp_arc_d(x) + rel_h = self.mlp_rel_h(x) + rel_d = self.mlp_rel_d(x) + + # Get arc and rel scores from the bilinear attention + # Shape: (batch_size, seq_len, seq_len) + s_arc = self.arc_attn(arc_d, arc_h) + # Shape: (batch_size, seq_len, seq_len, n_rels) + s_rel = paddle.transpose(self.rel_attn(rel_d, rel_h), perm=[0, 2, 3, 1]) + # Set the scores that exceed the length of each sentence to -1e5 + s_arc_mask = paddle.unsqueeze(mask, 1) + s_arc = s_arc * s_arc_mask + paddle.scale( + paddle.cast(s_arc_mask, "int32"), scale=1e5, bias=-1, bias_after_scale=False + ) + return s_arc, s_rel, words + + +class MLP(nn.Layer): + """MLP""" + + def __init__(self, n_in, n_out, dropout=0): + super(MLP, self).__init__() + + self.linear = nn.Linear( + n_in, + n_out, + weight_attr=nn.initializer.XavierNormal(), + ) + self.leaky_relu = nn.LeakyReLU(negative_slope=0.1) + self.dropout = SharedDropout(p=dropout) + + def forward(self, x): + # Shape: (batch_size, output_size) + x = self.linear(x) + x = self.leaky_relu(x) + x = self.dropout(x) + return x + + +class BiAffine(nn.Layer): + """BiAffine""" + + def __init__(self, n_in, n_out=1, bias_x=True, bias_y=True): + super(BiAffine, self).__init__() + + self.n_in = n_in + self.n_out = n_out + self.bias_x = bias_x + self.bias_y = bias_y + self.weight = self.create_parameter(shape=[n_out, n_in + bias_x, n_in + bias_y], dtype="float32") + + def forward(self, x, y): + if self.bias_x: + x = paddle.concat([x, paddle.ones_like(x[:, :, :1])], axis=-1) + if self.bias_y: + y = paddle.concat([y, paddle.ones_like(x[:, :, :1])], axis=-1) + # Shape x: (batch_size, num_tokens, input_size + bias_x) + b = x.shape[0] + o = self.weight.shape[0] + # Shape x: (batch_size, output_size, num_tokens, input_size + bias_x) + x = paddle.expand(paddle.unsqueeze(x, axis=1), shape=(x.shape[0], o, x.shape[1], x.shape[2])) + # Shape y: (batch_size, output_size, num_tokens, input_size + bias_y) + y = paddle.expand(paddle.unsqueeze(y, axis=1), shape=(y.shape[0], o, y.shape[1], y.shape[2])) + # Shape weight: (batch_size, output_size, input_size + bias_x, input_size + bias_y) + weight = paddle.expand( + paddle.unsqueeze(self.weight, axis=0), + shape=(b, self.weight.shape[0], self.weight.shape[1], self.weight.shape[2]), + ) + + # Shape: (batch_size, output_size, num_tokens, num_tokens) + s = paddle.matmul(paddle.matmul(x, weight), paddle.transpose(y, perm=[0, 1, 3, 2])) + # Remove dim 1 if n_out == 1 + if s.shape[1] == 1: + s = paddle.squeeze(s, axis=1) + return s diff --git a/examples/dependency_parsing/ddparser/model/dropouts.py b/examples/dependency_parsing/ddparser/model/dropouts.py new file mode 100644 index 0000000000000000000000000000000000000000..1aa8c5ac1723a8bb8f207ea3da5be337de326667 --- /dev/null +++ b/examples/dependency_parsing/ddparser/model/dropouts.py @@ -0,0 +1,63 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn + + +class SharedDropout(nn.Layer): + """SharedDropout""" + + def __init__(self, p=0.5, batch_first=True): + super(SharedDropout, self).__init__() + + self.p = p + self.batch_first = batch_first + + def forward(self, x): + """Forward network""" + if self.training and self.p > 0: + if self.batch_first: + mask = self.get_mask(x[:, 0], self.p) + else: + mask = self.get_mask(x[0], self.p) + x *= paddle.unsqueeze(mask, axis=1) if self.batch_first else mask + return x + + @staticmethod + def get_mask(x, p): + """Generate the mask matrix of the dropout by the input.""" + mask = paddle.uniform(shape=x.shape, min=0, max=1) >= p + mask = paddle.cast(mask, "float32") + mask = mask / (1 - p) + return mask + + +class IndependentDropout(nn.Layer): + """IndependentDropout""" + + def __init__(self, p=0.5): + super(IndependentDropout, self).__init__() + self.p = p + + def forward(self, *items): + """Forward network""" + if self.training and self.p > 0: + masks = [paddle.uniform(shape=x.shape[:2], min=0, max=1) >= self.p for x in items] + masks = [paddle.cast(x, "float32") for x in masks] + total = paddle.add(*masks) + scale = len(items) / paddle.maximum(total, paddle.ones_like(total)) + masks = [mask * scale for mask in masks] + items = [item * paddle.unsqueeze(mask, axis=-1) for item, mask in zip(items, masks)] + return items diff --git a/examples/dependency_parsing/ddparser/model/encoder.py b/examples/dependency_parsing/ddparser/model/encoder.py new file mode 100644 index 0000000000000000000000000000000000000000..6c319b2e736addf8d69ea47920e6e1a0075a5ef1 --- /dev/null +++ b/examples/dependency_parsing/ddparser/model/encoder.py @@ -0,0 +1,165 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +from model.dropouts import IndependentDropout, SharedDropout +from utils import index_sample, pad_sequence_paddle + + +class ErnieEncoder(nn.Layer): + def __init__(self, pad_index, pretrained_model): + super(ErnieEncoder, self).__init__() + self.pad_index = pad_index + self.ptm = pretrained_model + self.mlp_input_size = self.ptm.config["hidden_size"] + + def forward(self, words, wp): + x, _ = self.ptm(words) + x = paddle.reshape( + index_sample(x, wp), + shape=[wp.shape[0], wp.shape[1], x.shape[2]], + ) + words = index_sample(words, wp) + return words, x + + +class LSTMByWPEncoder(nn.Layer): + def __init__( + self, + n_words, + pad_index, + lstm_by_wp_embed_size=200, + n_embed=300, + n_lstm_hidden=300, + n_lstm_layers=3, + lstm_dropout=0.33, + ): + super(LSTMByWPEncoder, self).__init__() + self.pad_index = pad_index + self.word_embed = nn.Embedding(n_words, lstm_by_wp_embed_size) + + self.lstm = nn.LSTM( + input_size=lstm_by_wp_embed_size, + hidden_size=n_lstm_hidden, + num_layers=n_lstm_layers, + dropout=lstm_dropout, + direction="bidirectional", + ) + + self.lstm_dropout = SharedDropout(p=lstm_dropout) + self.mlp_input_size = n_lstm_hidden * 2 + + def forward(self, words, wp): + + word_embed = self.word_embed(words) + mask = words != self.pad_index + seq_lens = paddle.sum(paddle.cast(mask, "int32"), axis=-1) + + x, _ = self.lstm(word_embed, sequence_length=seq_lens) + x = paddle.reshape( + index_sample(x, wp), + shape=[wp.shape[0], wp.shape[1], x.shape[2]], + ) + words = paddle.index_sample(words, wp) + x = self.lstm_dropout(x) + return words, x + + +class LSTMEncoder(nn.Layer): + def __init__( + self, + feat, + n_feats, + n_words, + pad_index=0, + feat_pad_index=0, + n_char_embed=50, + n_feat_embed=60, + n_lstm_char_embed=100, + n_embed=300, + embed_dropout=0.33, + n_lstm_hidden=300, + n_lstm_layers=3, + lstm_dropout=0.33, + ): + super(LSTMEncoder, self).__init__() + self.pad_index = pad_index + + if feat == "char": + self.feat_embed = CharLSTMEncoder( + n_chars=n_feats, + n_embed=n_char_embed, + n_out=n_lstm_char_embed, + pad_index=feat_pad_index, + ) + feat_embed_size = n_lstm_char_embed + else: + self.feat_embed = nn.Embedding(num_embeddings=n_feats, embedding_dim=n_feat_embed) + feat_embed_size = n_feat_embed + + self.word_embed = nn.Embedding(num_embeddings=n_words, embedding_dim=n_embed) + self.embed_dropout = IndependentDropout(p=embed_dropout) + + self.lstm = nn.LSTM( + input_size=n_embed + feat_embed_size, + hidden_size=n_lstm_hidden, + num_layers=n_lstm_layers, + dropout=lstm_dropout, + direction="bidirectional", + ) + self.lstm_dropout = SharedDropout(p=lstm_dropout) + self.mlp_input_size = n_lstm_hidden * 2 + + def forward(self, words, feats): + word_embed = self.word_embed(words) + feat_embed = self.feat_embed(feats) + word_embed, feat_embed = self.embed_dropout(word_embed, feat_embed) + embed = paddle.concat([word_embed, feat_embed], axis=-1) + mask = words != self.pad_index + seq_lens = paddle.sum(paddle.cast(mask, "int32"), axis=-1) + x, _ = self.lstm(embed, sequence_length=seq_lens) + x = self.lstm_dropout(x) + return words, x + + +class CharLSTMEncoder(nn.Layer): + def __init__(self, n_chars, n_embed, n_out, pad_index=0): + super(CharLSTMEncoder, self).__init__() + self.n_chars = n_chars + self.n_embed = n_embed + self.n_out = n_out + self.pad_index = pad_index + + # the embedding layer + self.embed = nn.Embedding(num_embeddings=n_chars, embedding_dim=n_embed) + # the lstm layer + self.lstm = nn.LSTM(input_size=n_embed, hidden_size=n_out // 2, direction="bidirectional") + + def forward(self, x): + """Forward network""" + mask = paddle.any(x != self.pad_index, axis=-1) + + lens = paddle.sum(paddle.cast(mask, "int32"), axis=-1) + select = paddle.nonzero(mask) + masked_x = paddle.gather_nd(x, select) + char_mask = masked_x != self.pad_index + emb = self.embed(masked_x) + word_lens = paddle.sum(paddle.cast(char_mask, "int32"), axis=-1) + _, (h, _) = self.lstm(emb, sequence_length=word_lens) + h = paddle.concat(paddle.unstack(h), axis=-1) + + feat_embed = pad_sequence_paddle(h, lens, pad_index=self.pad_index) + + return feat_embed diff --git a/examples/dependency_parsing/ddparser/predict.py b/examples/dependency_parsing/ddparser/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..53c1835cc919353d7cc02d35e949710eb9c725e4 --- /dev/null +++ b/examples/dependency_parsing/ddparser/predict.py @@ -0,0 +1,187 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import copy +import os +from functools import partial + +import numpy as np +import paddle +from data import convert_example, create_dataloader, load_vocab +from model.dep import BiAffineParser +from utils import decode, flat_words + +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel + +# fmt: off +parser = argparse.ArgumentParser() +# Predict +parser.add_argument("--params_path", type=str, default='model_file/best.pdparams', required=True, help="Directory to load model parameters.") +parser.add_argument("--task_name", choices=["nlpcc13_evsam05_thu", "nlpcc13_evsam05_hit"], type=str, default="nlpcc13_evsam05_thu", help="Select the task.") +parser.add_argument("--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--encoding_model", choices=["lstm", "lstm-pe", "ernie-3.0-medium-zh", "ernie-1.0", "ernie-tiny", "ernie-gram-zh"], type=str, default="ernie-3.0-medium-zh", help="Select the encoding model.") +parser.add_argument("--batch_size", type=int, default=1000, help="Numbers of examples a batch for training.") +parser.add_argument("--infer_output_file", type=str, default='infer_output.conll', help="The path to save infer results.") +# Preprocess +parser.add_argument("--n_buckets", type=int, default=15, help="Number of buckets to devide the dataset.") +# Postprocess +parser.add_argument("--tree", type=bool, default=True, help="Ensure the output conforms to the tree structure.") +# Lstm +parser.add_argument("--feat", choices=["char", "pos"], type=str, default=None, help="The feature representation to use.") +args = parser.parse_args() +# fmt: on + + +@paddle.no_grad() +def batch_predict( + model, + data_loader, + rel_vocab, + word_pad_index, + word_bos_index, + word_eos_index, +): + + model.eval() + arcs, rels = [], [] + for inputs in data_loader(): + if args.encoding_model.startswith("ernie") or args.encoding_model == "lstm-pe": + words = inputs[0] + words, feats = flat_words(words) + s_arc, s_rel, words = model(words, feats) + else: + words, feats = inputs + s_arc, s_rel, words = model(words, feats) + + mask = paddle.logical_and( + paddle.logical_and(words != word_pad_index, words != word_bos_index), + words != word_eos_index, + ) + + lens = paddle.sum(paddle.cast(mask, "int32"), axis=-1) + arc_preds, rel_preds = decode(s_arc, s_rel, mask) + arcs.extend(paddle.split(paddle.masked_select(arc_preds, mask), lens.numpy().tolist())) + rels.extend(paddle.split(paddle.masked_select(rel_preds, mask), lens.numpy().tolist())) + + arcs = [[str(s) for s in seq.numpy().tolist()] for seq in arcs] + rels = [rel_vocab.to_tokens(seq.numpy().tolist()) for seq in rels] + + return arcs, rels + + +def do_predict(args): + paddle.set_device(args.device) + + # if args.encoding_model.startswith("ernie"): + # tokenizer = AutoTokenizer.from_pretrained(args.encoding_model) + # elif args.encoding_model == "lstm-pe": + # tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + # else: + # tokenizer = None + + # Load vocabs from model file path + vocab_dir = os.path.split(args.params_path)[0] + word_vocab, feat_vocab, rel_vocab = load_vocab(vocab_dir) + + n_rels, n_words = len(rel_vocab), len(word_vocab) + if args.encoding_model == "lstm": + n_feats = len(feat_vocab) + word_pad_index = word_vocab.to_indices("[PAD]") + word_bos_index = word_vocab.to_indices("[BOS]") + word_eos_index = word_vocab.to_indices("[EOS]") + else: + n_feats = None + word_pad_index = word_vocab.to_indices("[PAD]") + word_bos_index = word_vocab.to_indices("[CLS]") + word_eos_index = word_vocab.to_indices("[SEP]") + + test_ds = load_dataset(args.task_name, splits=["test"]) + test_ds_copy = copy.deepcopy(test_ds) + + trans_fn = partial( + convert_example, + vocabs=[word_vocab, feat_vocab, rel_vocab], + encoding_model=args.encoding_model, + feat=args.feat, + mode="test", + ) + + test_data_loader, buckets = create_dataloader( + test_ds, + batch_size=args.batch_size, + mode="test", + n_buckets=args.n_buckets, + trans_fn=trans_fn, + ) + + # Load pretrained model if encoding model is ernie-3.0-medium-zh, ernie-1.0, ernie-tiny or ernie-gram-zh + if args.encoding_model in ["ernie-3.0-medium-zh", "ernie-1.0", "ernie-tiny"]: + pretrained_model = AutoModel.from_pretrained(args.encoding_model) + elif args.encoding_model == "ernie-gram-zh": + pretrained_model = AutoModel.from_pretrained(args.encoding_model) + else: + pretrained_model = None + + # Load model + model = BiAffineParser( + encoding_model=args.encoding_model, + feat=args.feat, + n_rels=n_rels, + n_feats=n_feats, + n_words=n_words, + pad_index=word_pad_index, + eos_index=word_eos_index, + pretrained_model=pretrained_model, + ) + + # Load saved model parameters + if os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("The parameters path is incorrect or not specified.") + + # Start predict + pred_arcs, pred_rels = batch_predict( + model, + test_data_loader, + rel_vocab, + word_pad_index, + word_bos_index, + word_eos_index, + ) + + # Restore the order of sentences in the buckets + if buckets: + indices = np.argsort(np.array([i for bucket in buckets.values() for i in bucket])) + else: + indices = range(len(pred_arcs)) + pred_heads = [pred_arcs[i] for i in indices] + pred_deprels = [pred_rels[i] for i in indices] + + with open(args.infer_output_file, "w", encoding="utf-8") as out_file: + for res, head, rel in zip(test_ds_copy, pred_heads, pred_deprels): + res["HEAD"] = tuple(head) + res["DEPREL"] = tuple(rel) + res = "\n".join("\t".join(map(str, line)) for line in zip(*res.values())) + "\n" + out_file.write("{}\n".format(res)) + out_file.close() + print("Results saved!") + + +if __name__ == "__main__": + do_predict(args) diff --git a/examples/dependency_parsing/ddparser/train.py b/examples/dependency_parsing/ddparser/train.py new file mode 100644 index 0000000000000000000000000000000000000000..473391fb0d68fb2c77825845c2f888036107d6fb --- /dev/null +++ b/examples/dependency_parsing/ddparser/train.py @@ -0,0 +1,300 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from criterion import ParserCriterion +from data import build_vocab, convert_example, create_dataloader +from metric import ParserEvaluator +from model.dep import BiAffineParser +from utils import decode, flat_words + +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer +from paddlenlp.transformers.optimization import LinearDecayWithWarmup + +# yapf: disable +parser = argparse.ArgumentParser() +# Train +parser.add_argument("--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--task_name", choices=["nlpcc13_evsam05_thu", "nlpcc13_evsam05_hit"], type=str, default="nlpcc13_evsam05_thu", help="Select the task.") +parser.add_argument("--encoding_model", choices=["lstm", "lstm-pe", "ernie-3.0-medium-zh", "ernie-1.0", "ernie-tiny", "ernie-gram-zh"], type=str, default="ernie-3.0-medium-zh", help="Select the encoding model.") +parser.add_argument("--epochs", type=int, default=100, help="Number of epoches for training.") +parser.add_argument("--save_dir", type=str, default='model_file/', help="Directory to save model parameters.") +parser.add_argument("--batch_size", type=int, default=1000, help="Numbers of examples a batch for training.") +parser.add_argument("--init_from_params", type=str, default=None, help="The path of model parameters to be loaded.") +parser.add_argument("--clip", type=float, default=1.0, help="The threshold of gradient clip.") +parser.add_argument("--lstm_lr", type=float, default=0.002, help="The Learning rate of lstm encoding model.") +parser.add_argument("--ernie_lr", type=float, default=5e-05, help="The Learning rate of ernie encoding model.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +# Preprocess +parser.add_argument("--n_buckets", type=int, default=15, help="Number of buckets to devide the dataset.") +# Postprocess +parser.add_argument("--tree", type=bool, default=True, help="Ensure the output conforms to the tree structure.") +# Lstm +parser.add_argument("--feat", choices=["char", "pos"], type=str, default=None, help="The feature representation to use.") +# Ernie +parser.add_argument("--warmup_proportion", type=float, default=0.0, help="Linear warmup proportion over total steps.") +parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay if we apply some.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def batch_evaluate( + model, + metric, + criterion, + data_loader, + word_pad_index, + word_bos_index, + word_eos_index, +): + model.eval() + metric.reset() + losses = [] + for batch in data_loader(): + if args.encoding_model.startswith("ernie") or args.encoding_model == "lstm-pe": + words, arcs, rels = batch + words, feats = flat_words(words) + s_arc, s_rel, words = model(words, feats) + else: + words, feats, arcs, rels = batch + s_arc, s_rel, words = model(words, feats) + + mask = paddle.logical_and( + paddle.logical_and(words != word_pad_index, words != word_bos_index), + words != word_eos_index, + ) + + loss = criterion(s_arc, s_rel, arcs, rels, mask) + + losses.append(loss.item()) + + arc_preds, rel_preds = decode(s_arc, s_rel, mask) + metric.update(arc_preds, rel_preds, arcs, rels, mask) + uas, las = metric.accumulate() + total_loss = np.mean(losses) + model.train() + metric.reset() + return total_loss, uas, las + + +def do_train(args): + set_seed(args.seed) + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + if args.encoding_model == "ernie-gram-zh": + tokenizer = AutoTokenizer.from_pretrained(args.encoding_model) + elif args.encoding_model.startswith("ernie"): + tokenizer = AutoTokenizer.from_pretrained(args.encoding_model) + elif args.encoding_model == "lstm-pe": + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + else: + tokenizer = None + + train_ds, dev_ds = load_dataset(args.task_name, splits=["train", "dev"]) + + # Build the vocabs based on train corpus + word_examples = [seq["FORM"] for seq in train_ds] + if args.feat == "pos": + feat_examples = [seq["CPOS"] for seq in train_ds] + elif args.feat == "char": + feat_examples = [token for seq in train_ds for token in seq["FORM"]] + else: + feat_examples = None + rel_examples = [seq["DEPREL"] for seq in train_ds] + + train_corpus = [word_examples, feat_examples, rel_examples] + vocabs = build_vocab( + train_corpus, + tokenizer, + encoding_model=args.encoding_model, + feat=args.feat, + ) + word_vocab, feat_vocab, rel_vocab = vocabs + + if not os.path.exists(args.save_dir): + os.makedirs(args.save_dir) + + # Save vocabs into json file + word_vocab.to_json(path=os.path.join(args.save_dir, "word_vocab.json")) + rel_vocab.to_json(path=os.path.join(args.save_dir, "rel_vocab.json")) + + if feat_vocab: + n_feats = len(feat_vocab) + feat_vocab.to_json(path=os.path.join(args.save_dir, "feat_vocab.json")) + word_pad_index = word_vocab.to_indices("[PAD]") + word_bos_index = word_vocab.to_indices("[BOS]") + word_eos_index = word_vocab.to_indices("[EOS]") + else: + n_feats = None + word_pad_index = word_vocab.to_indices("[PAD]") + word_bos_index = word_vocab.to_indices("[CLS]") + word_eos_index = word_vocab.to_indices("[SEP]") + + n_rels, n_words = len(rel_vocab), len(word_vocab) + + trans_fn = partial( + convert_example, + vocabs=vocabs, + encoding_model=args.encoding_model, + feat=args.feat, + ) + + train_data_loader, _ = create_dataloader( + train_ds, + batch_size=args.batch_size, + mode="train", + n_buckets=args.n_buckets, + trans_fn=trans_fn, + ) + dev_data_loader, _ = create_dataloader( + dev_ds, + batch_size=args.batch_size, + mode="dev", + n_buckets=args.n_buckets, + trans_fn=trans_fn, + ) + + # Load pretrained model if encoding model is ernie-3.0-medium-zh, ernie-1.0, ernie-tiny or ernie-gram-zh + if args.encoding_model in ["ernie-3.0-medium-zh", "ernie-1.0", "ernie-tiny", "ernie-gram-zh"]: + pretrained_model = AutoModel.from_pretrained(args.encoding_model) + else: + pretrained_model = None + + # Load ddparser model + model = BiAffineParser( + encoding_model=args.encoding_model, + feat=args.feat, + n_rels=n_rels, + n_feats=n_feats, + n_words=n_words, + pad_index=word_pad_index, + eos_index=word_eos_index, + pretrained_model=pretrained_model, + ) + + # Define learning rate + if args.encoding_model.startswith("ernie"): + lr = args.ernie_lr + else: + lr = args.lstm_lr + + # Continue training from a pretrained model if the checkpoint is specified + if args.init_from_params and os.path.isfile(args.init_from_params): + state_dict = paddle.load(args.init_from_params) + model.set_dict(state_dict) + + # Data parallel for distributed training + model = paddle.DataParallel(model) + + num_training_steps = len(list(train_data_loader)) * args.epochs + + # Define the training strategy + lr_scheduler = LinearDecayWithWarmup(lr, num_training_steps, args.warmup_proportion) + grad_clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.clip) + if args.encoding_model.startswith("ernie"): + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + grad_clip=grad_clip, + ) + else: + optimizer = paddle.optimizer.Adam( + learning_rate=lr, + beta1=0.9, + beta2=0.9, + epsilon=1e-12, + parameters=model.parameters(), + grad_clip=grad_clip, + ) + + # Load metric and criterion + best_las = 0 + metric = ParserEvaluator() + criterion = ParserCriterion() + + # Epoch train + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for inputs in train_data_loader(): + if args.encoding_model.startswith("ernie") or args.encoding_model == "lstm-pe": + words, arcs, rels = inputs + words, feats = flat_words(words) + s_arc, s_rel, words = model(words, feats) + else: + words, feats, arcs, rels = inputs + s_arc, s_rel, words = model(words, feats) + + mask = paddle.logical_and( + paddle.logical_and(words != word_pad_index, words != word_bos_index), + words != word_eos_index, + ) + + loss = criterion(s_arc, s_rel, arcs, rels, mask) + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, loss.item(), 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + + if rank == 0: + # Evaluate on dev dataset + loss, uas, las = batch_evaluate( + model, + metric, + criterion, + dev_data_loader, + word_pad_index, + word_bos_index, + word_eos_index, + ) + print("eval loss: %.5f, UAS: %.2f%%, LAS: %.2f%%" % (loss, uas * 100, las * 100)) + # Save model parameter of last epoch + save_param_path = os.path.join(args.save_dir, "last_epoch.pdparams") + paddle.save(model.state_dict(), save_param_path) + # Save the model if it get a higher score of las + if las > best_las: + save_param_path = os.path.join(args.save_dir, "best.pdparams") + paddle.save(model.state_dict(), save_param_path) + best_las = las + + +if __name__ == "__main__": + do_train(args) diff --git a/examples/dependency_parsing/ddparser/utils.py b/examples/dependency_parsing/ddparser/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..807d4be437d3d074ddde0801e558d4c958faf044 --- /dev/null +++ b/examples/dependency_parsing/ddparser/utils.py @@ -0,0 +1,436 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import copy + +import numpy as np +import paddle + +from paddlenlp.data import Pad + + +def decode(s_arc, s_rel, mask, tree=True): + """Decode function""" + mask = mask.numpy() + lens = np.sum(mask, -1) + # Prevent self-loops + arc_preds = paddle.argmax(s_arc, axis=-1).numpy() + bad = [not istree(seq[: i + 1]) for i, seq in zip(lens, arc_preds)] + if tree and any(bad): + arc_preds[bad] = eisner(s_arc.numpy()[bad], mask[bad]) + arc_preds = paddle.to_tensor(arc_preds) + rel_preds = paddle.argmax(s_rel, axis=-1) + rel_preds = index_sample(rel_preds, paddle.unsqueeze(arc_preds, axis=-1)) + rel_preds = paddle.squeeze(rel_preds, axis=-1) + return arc_preds, rel_preds + + +def pad_sequence(sequences, padding_value=0, fix_len=None): + """Fill sequences(np.ndarray) into a fixed-length matrix.""" + max_size = sequences[0].shape + trailing_dims = max_size[1:] + max_len = max([s.shape[0] for s in sequences]) + if fix_len is not None: + assert fix_len >= max_len, "fix_len is too small." + max_len = fix_len + out_dims = (len(sequences), max_len) + trailing_dims + out_tensor = np.full(out_dims, padding_value, dtype=sequences[0].dtype) + for i, tensor in enumerate(sequences): + length = tensor.shape[0] + out_tensor[i, :length, ...] = tensor + return out_tensor + + +def pad_sequence_paddle(inputs, lens, pad_index=0): + sequences = [] + idx = 0 + for l in lens: + sequences.append(np.array(inputs[idx : idx + l])) + idx += l + outputs = Pad(pad_val=pad_index)(sequences) + output_tensor = paddle.to_tensor(outputs) + return output_tensor + + +def fill_diagonal(x, value, offset=0, dim1=0, dim2=1): + """Fill value into the diagoanl of x that offset is ${offset} and the coordinate system is (dim1, dim2).""" + strides = x.strides + shape = x.shape + if dim1 > dim2: + dim1, dim2 = dim2, dim1 + assert 0 <= dim1 < dim2 <= 2 + assert len(x.shape) == 3 + assert shape[dim1] == shape[dim2] + + dim_sum = dim1 + dim2 + dim3 = 3 - dim_sum + if offset >= 0: + diagonal = np.lib.stride_tricks.as_strided( + x[:, offset:] if dim_sum == 1 else x[:, :, offset:], + shape=(shape[dim3], shape[dim1] - offset), + strides=(strides[dim3], strides[dim1] + strides[dim2]), + ) + else: + diagonal = np.lib.stride_tricks.as_strided( + x[-offset:, :] if dim_sum in [1, 2] else x[:, -offset:], + shape=(shape[dim3], shape[dim1] + offset), + strides=(strides[dim3], strides[dim1] + strides[dim2]), + ) + + diagonal[...] = value + return x + + +def backtrack(p_i, p_c, heads, i, j, complete): + """Backtrack the position matrix of eisner to generate the tree""" + if i == j: + return + if complete: + r = p_c[i, j] + backtrack(p_i, p_c, heads, i, r, False) + backtrack(p_i, p_c, heads, r, j, True) + else: + r, heads[j] = p_i[i, j], i + i, j = sorted((i, j)) + backtrack(p_i, p_c, heads, i, r, True) + backtrack(p_i, p_c, heads, j, r + 1, True) + + +def stripe(x, n, w, offset=(0, 0), dim=1): + """ + Returns a diagonal stripe of the tensor. + + Args: + x (Tensor): the input tensor with 2 or more dims. + n (int): the length of the stripe. + w (int): the width of the stripe. + offset (tuple): the offset of the first two dims. + dim (int): 0 if returns a horizontal stripe; 1 else. + + Example: + >>> x = np.arange(25).reshape(5, 5) + >>> x + tensor([[ 0, 1, 2, 3, 4], + [ 5, 6, 7, 8, 9], + [10, 11, 12, 13, 14], + [15, 16, 17, 18, 19], + [20, 21, 22, 23, 24]]) + >>> stripe(x, 2, 3, (1, 1)) + tensor([[ 6, 7, 8], + [12, 13, 14]]) + >>> stripe(x, 2, 3, dim=0) + tensor([[ 0, 5, 10], + [ 6, 11, 16]]) + """ + if not x.flags["C_CONTIGUOUS"]: + x = np.ascontiguousarray(x) + strides = x.strides + m = strides[0] + strides[1] + k = strides[1] if dim == 1 else strides[0] + return np.lib.stride_tricks.as_strided( + x[offset[0] :, offset[1] :], shape=[n, w] + list(x.shape[2:]), strides=[m, k] + list(strides[2:]) + ) + + +def flat_words(words, pad_index=0): + mask = words != pad_index + lens = paddle.sum(paddle.cast(mask, "int64"), axis=-1) + position = paddle.cumsum(lens + paddle.cast((lens == 0), "int64"), axis=1) - 1 + select = paddle.nonzero(mask) + words = paddle.gather_nd(words, select) + lens = paddle.sum(lens, axis=-1) + words = pad_sequence_paddle(words, lens, pad_index) + max_len = words.shape[1] + position = mask_fill(position, position >= max_len, max_len - 1) + return words, position + + +def index_sample(x, index): + """ + Select input value according to index + + Arags: + input: input matrix + index: index matrix + Returns: + output + >>> input + [ + [1, 2, 3], + [4, 5, 6] + ] + >>> index + [ + [1, 2], + [0, 1] + ] + >>> index_sample(input, index) + [ + [2, 3], + [4, 5] + ] + """ + x_s = x.shape + dim = len(index.shape) - 1 + assert x_s[:dim] == index.shape[:dim] + + if len(x_s) == 3 and dim == 1: + r_x = paddle.reshape(x, shape=[-1, x_s[1], x_s[-1]]) + else: + r_x = paddle.reshape(x, shape=[-1, x_s[-1]]) + + index = paddle.reshape(index, shape=[len(r_x), -1, 1]) + # Generate arange index, shape like index + arr_index = paddle.arange(start=0, end=len(index), dtype=index.dtype) + arr_index = paddle.unsqueeze(arr_index, axis=[1, 2]) + arr_index = paddle.expand(arr_index, index.shape) + # Genrate new index + new_index = paddle.concat((arr_index, index), -1) + new_index = paddle.reshape(new_index, (-1, 2)) + # Get output + out = paddle.gather_nd(r_x, new_index) + if len(x_s) == 3 and dim == 2: + out = paddle.reshape(out, shape=[x_s[0], x_s[1], -1]) + else: + out = paddle.reshape(out, shape=[x_s[0], -1]) + return out + + +def mask_fill(input, mask, value): + """ + Fill value to input according to mask + + Args: + input: input matrix + mask: mask matrix + value: Fill value + + Returns: + output + + >>> input + [ + [1, 2, 3], + [4, 5, 6] + ] + >>> mask + [ + [True, True, False], + [True, False, False] + ] + >>> mask_fill(input, mask, 0) + [ + [1, 2, 0], + [4, 0, 0] + ] + """ + return input * paddle.logical_not(mask) + paddle.cast(mask, input.dtype) * value + + +def kmeans(x, k): + """ + kmeans algorithm, put sentence id into k buckets according to sentence length + + Args: + x: list, sentence length + k: int, k clusters + + Returns: + centroids: list, center point of k clusters + clusters: list(tuple), k clusters + """ + x = np.array(x, dtype=np.float32) + # Count the frequency of each datapoint + d, indices, f = np.unique(x, return_inverse=True, return_counts=True) + # Calculate the sum of the values of the same datapoints + total = d * f + # Initialize k centroids randomly + c, old = d[np.random.permutation(len(d))[:k]], None + # Assign labels to each datapoint based on centroids + dists_abs = np.absolute(d[..., np.newaxis] - c) + dists, y = dists_abs.min(axis=-1), dists_abs.argmin(axis=-1) + # The number of clusters must not be greater than that of datapoints + k = min(len(d), k) + + while old is None or not np.equal(c, old).all(): + # If an empty cluster is encountered, + # choose the farthest datapoint from the biggest cluster + # and move that the empty one + for i in range(k): + if not np.equal(y, i).any(): + # mask.shape=[k, n] + mask = y == np.arange(k)[..., np.newaxis] + lens = mask.sum(axis=-1) + biggest = mask[lens.argmax()].nonzero()[0] + farthest = dists[biggest].argmax() + y[biggest[farthest]] = i + mask = y == np.arange(k)[..., np.newaxis] + # Update the centroids + c, old = (total * mask).sum(-1) / (f * mask).sum(-1), c + # Re-assign all datapoints to clusters + dists_abs = np.absolute(d[..., np.newaxis] - c) + dists, y = dists_abs.min(axis=-1), dists_abs.argmin(axis=-1) + # Assign all datapoints to the new-generated clusters without considering the empty ones + y, assigned = y[indices], np.unique(y).tolist() + # Get the centroids of the assigned clusters + centroids = c[assigned].tolist() + # Map all values of datapoints to buckets + clusters = [np.equal(y, i).nonzero()[0].tolist() for i in assigned] + + return centroids, clusters + + +def eisner(scores, mask): + """Eisner algorithm is a general dynamic programming decoding algorithm for bilexical grammar. + + Args: + scores: Adjacency matrix,shape=(batch, seq_len, seq_len) + mask: mask matrix,shape=(batch, sql_len) + + Returns: + output,shape=(batch, seq_len),the index of the parent node corresponding to the token in the query + + """ + lens = mask.sum(1) + batch_size, seq_len, _ = scores.shape + scores = scores.transpose(2, 1, 0) + # Score for incomplete span + s_i = np.full_like(scores, float("-inf")) + # Score for complete span + s_c = np.full_like(scores, float("-inf")) + # Incompelte span position for backtrack + p_i = np.zeros((seq_len, seq_len, batch_size), dtype=np.int64) + # Compelte span position for backtrack + p_c = np.zeros((seq_len, seq_len, batch_size), dtype=np.int64) + # Set 0 to s_c.diagonal + s_c = fill_diagonal(s_c, 0) + # Contiguous + s_c = np.ascontiguousarray(s_c) + s_i = np.ascontiguousarray(s_i) + for w in range(1, seq_len): + n = seq_len - w + starts = np.arange(n, dtype=np.int64)[np.newaxis, :] + # ilr = C(i->r) + C(j->r+1) + ilr = stripe(s_c, n, w) + stripe(s_c, n, w, (w, 1)) + # Shape: (batch_size, n, w) + ilr = ilr.transpose(2, 0, 1) + # scores.diagonal(-w).shape:[batch, n] + il = ilr + scores.diagonal(-w)[..., np.newaxis] + # I(j->i) = max(C(i->r) + C(j->r+1) + s(j->i)), i <= r < j + il_span, il_path = il.max(-1), il.argmax(-1) + s_i = fill_diagonal(s_i, il_span, offset=-w) + p_i = fill_diagonal(p_i, il_path + starts, offset=-w) + + ir = ilr + scores.diagonal(w)[..., np.newaxis] + # I(i->j) = max(C(i->r) + C(j->r+1) + s(i->j)), i <= r < j + ir_span, ir_path = ir.max(-1), ir.argmax(-1) + s_i = fill_diagonal(s_i, ir_span, offset=w) + p_i = fill_diagonal(p_i, ir_path + starts, offset=w) + + # C(j->i) = max(C(r->i) + I(j->r)), i <= r < j + cl = stripe(s_c, n, w, (0, 0), 0) + stripe(s_i, n, w, (w, 0)) + cl = cl.transpose(2, 0, 1) + cl_span, cl_path = cl.max(-1), cl.argmax(-1) + s_c = fill_diagonal(s_c, cl_span, offset=-w) + p_c = fill_diagonal(p_c, cl_path + starts, offset=-w) + + # C(i->j) = max(I(i->r) + C(r->j)), i < r <= j + cr = stripe(s_i, n, w, (0, 1)) + stripe(s_c, n, w, (1, w), 0) + cr = cr.transpose(2, 0, 1) + cr_span, cr_path = cr.max(-1), cr.argmax(-1) + s_c = fill_diagonal(s_c, cr_span, offset=w) + s_c[0, w][np.not_equal(lens, w)] = float("-inf") + p_c = fill_diagonal(p_c, cr_path + starts + 1, offset=w) + + predicts = [] + p_c = p_c.transpose(2, 0, 1) + p_i = p_i.transpose(2, 0, 1) + for i, length in enumerate(lens.tolist()): + heads = np.ones(length + 1, dtype=np.int64) + backtrack(p_i[i], p_c[i], heads, 0, length, True) + predicts.append(heads) + + return pad_sequence(predicts, fix_len=seq_len) + + +class Node: + """Node class""" + + def __init__(self, id=None, parent=None): + self.lefts = [] + self.rights = [] + self.id = int(id) + self.parent = parent if parent is None else int(parent) + + +class DepTree: + """ + DepTree class, used to check whether the prediction result is a project Tree. + A projective tree means that you can project the tree without crossing arcs. + """ + + def __init__(self, sentence): + # set root head to -1 + sentence = copy.deepcopy(sentence) + sentence[0] = -1 + self.sentence = sentence + self.build_tree() + self.visit = [False] * len(sentence) + + def build_tree(self): + """Build the tree""" + self.nodes = [Node(index, p_index) for index, p_index in enumerate(self.sentence)] + # set root + self.root = self.nodes[0] + for node in self.nodes[1:]: + self.add(self.nodes[node.parent], node) + + def add(self, parent, child): + """Add a child node""" + if parent.id is None or child.id is None: + raise Exception("id is None") + if parent.id < child.id: + parent.rights = sorted(parent.rights + [child.id]) + else: + parent.lefts = sorted(parent.lefts + [child.id]) + + def judge_legal(self): + """Determine whether it is a project tree""" + target_seq = list(range(len(self.nodes))) + if len(self.root.lefts + self.root.rights) != 1: + return False + cur_seq = self.inorder_traversal(self.root) + if target_seq != cur_seq: + return False + else: + return True + + def inorder_traversal(self, node): + """Inorder traversal""" + if self.visit[node.id]: + return [] + self.visit[node.id] = True + lf_list = [] + rf_list = [] + for ln in node.lefts: + lf_list += self.inorder_traversal(self.nodes[ln]) + for rn in node.rights: + rf_list += self.inorder_traversal(self.nodes[rn]) + + return lf_list + [node.id] + rf_list + + +def istree(sequence): + """Is the sequence a project tree""" + return DepTree(sequence).judge_legal() diff --git a/examples/dialogue/dgu/README.md b/examples/dialogue/dgu/README.md new file mode 100644 index 0000000000000000000000000000000000000000..90a9f57ec209ab1c67abd57be9b2032287f1c5bf --- /dev/null +++ b/examples/dialogue/dgu/README.md @@ -0,0 +1,143 @@ +# 对话通用理解模型 (DGU, Dialogue General Understanding) + +## 模型简介 + +对话系统 (Dialogue System) 常常需要根据应用场景的变化去解决多种多样的任务。任务的多样性(意图识别、槽填充、行为识别、状态追踪等等),以及领域训练数据的稀少,给Dialogue System的研究和应用带来了巨大的困难和挑战,要使得Dialogue System得到更好的发展,需要开发一个通用的对话理解模型。为此,我们给出了基于BERT的对话通用理解模型 (DGU: Dialogue General Understanding),通过实验表明,使用base-model (BERT)并结合常见的学习范式,就可以在几乎全部对话理解任务上取得比肩甚至超越各个领域业内最好的模型的效果,展现了学习一个通用对话理解模型的巨大潜力。 + +DGU模型内共包含6个任务,全部基于公开数据集在Paddle2.0上完成训练及评估,详细说明如下: + +``` +udc: 使用UDC (Ubuntu Corpus V1) 数据集完成对话匹配 (Dialogue Response Selection) 任务; +dstc2: 使用DSTC2 (Dialog State Tracking Challenge 2) 数据集完成对话状态追踪 (Dialogue State Tracking) 任务; +atis_slot: 使用ATIS (Airline Travel Information System) 数据集完成对话槽填充 (Dialogue Slot Filling) 任务; +atis_intent: 使用ATIS (Airline Travel Information System) 数据集完成对话意图识别 (Dialogue Intent Detection) 任务; +mrda: 使用MRDAC (Meeting Recorder Dialogue Act Corpus) 数据集完成对话行为识别 (Dialogue Act Detection) 任务; +swda: 使用SwDAC (Switchboard Dialogue Act Corpus) 数据集完成对话行为识别 (Dialogue Act Detection) 任务; +``` + +## 模型效果 + +DGU模型中的6个任务,分别采用不同的评估指标在test集上进行评估,结果如下: + + + + + + + + + + + +
任务评估指标DGU
udcR1@1081.04%
R2@1089.85%
R5@1097.59%
dstc2Joint_Acc90.43%
atis_slotF1_Micro97.98%
atis_intentAcc97.42%
mrdaAcc90.94%
swdaAcc80.61%
+ +**NOTE:** 以上结果均是采用默认配置在GPU单卡上训练和评估得到的,用户如需复现效果,可采用默认配置在单卡上进行训练评估。 + +## 快速开始 + +### 数据准备 + +下载数据集压缩包并解压后,DGU_datasets目录下共存在6个目录,分别对应每个任务的训练集train.txt、评估集dev.txt和测试集test.txt。 + +```shell +wget https://bj.bcebos.com/paddlenlp/datasets/DGU_datasets.tar.gz +tar -zxf DGU_datasets.tar.gz +``` + +DGU_datasets目录结构: + +```text +DGU_datasets/ +├── atis_intent +│   ├── dev.txt +│   ├── map_tag_intent_id.txt +│   ├── test.txt +│   └── train.txt +├── udc +│   ├── dev.txt +│   ├── dev.txt-small +│   ├── test.txt +│   └── train.txt +├── atis_slot +│   ├── dev.txt +│   ├── map_tag_slot_id.txt +│   ├── test.txt +│   └── train.txt +├── dstc2 +│   ├── dev.txt +│   ├── map_tag_id.txt +│   ├── test.txt +│   └── train.txt +├── mrda +│   ├── dev.txt +│   ├── map_tag_id.txt +│   ├── test.txt +│   └── train.txt +└── swda + ├── dev.txt + ├── map_tag_id.txt + ├── test.txt + └── train.txt +``` + +数据的每一行由多列组成,都以"\t"作为分割符,详细数据格式说明如下: + +``` +udc:由label、多轮对话conv和回应response组成 +格式:label \t conv1 \t conv2 \t conv3 \t ... \t response + +dstc2:由多轮对话id、当前轮QA对(使用\1拼接)和对话状态序列state_list(state_list中每个state由空格分割)组成 +格式:conversation_id \t question \1 answer \t state1 state2 state3 ... + +atis_slot:由对话内容conversation_content和标签序列label_list (label_list中每个label由空格分割) 组成, 其中标签序列和对话内容中word为一一对应关系 +格式:conversation_content \t label1 label2 label3 ... + +atis_intent:由标签label和对话内容conversation_content组成 +格式: label \t conversation_content + +mrda:由多轮对话id、标签label、发言人caller、对话内容conversation_content组成 +格式:conversation_id \t label \t caller \t conversation_content + +swda:由多轮对话id、标签label、发言人caller、对话内容conversation_content组成 +格式:conversation_id \t label \t caller \t conversation_content +``` + +**NOTE:** 上述数据集来自于 [Paddle1.8静态图版本](https://github.com/PaddlePaddle/models/tree/release/1.8/PaddleNLP/dialogue_system/dialogue_general_understanding),是由相应的开源数据集经过数据格式转换而得来的,本项目中暂未包含数据格式转换脚本,细节请参考 [Paddle1.8静态图版本](https://github.com/PaddlePaddle/models/tree/release/1.8/PaddleNLP/dialogue_system/dialogue_general_understanding)。 + +### 模型训练 + +运行如下命令即可在训练集 (train.tsv) 上进行模型训练,并在开发集 (dev.tsv) 验证,训练结束后会在测试集 (test.txt) 上进行模型评估 + +```shell +# GPU启动,gpus指定训练所用的GPU卡号,可以是单卡,也可以多卡。默认会进行训练、验证和评估 +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" --log_dir ./log main.py --task_name=udc --data_dir=./DGU_datasets/udc --output_dir=./checkpoints/udc --device=gpu +# 若只需进行评估,do_train设为False,并且必须指定init_from_ckpt +# python -m paddle.distributed.launch --gpus "0" --log_dir ./log main.py --task_name=udc --data_dir=./DGU_datasets/udc --do_train=False --init_from_ckpt=./checkpoints/udc/best --device=gpu +``` + +以上参数表示: + +* `task_name`:任务名称,可以为udc、dstc2、atis_slot、atis_intent、mrda或swda。 +* `data_dir`:训练数据路径。 +* `output_dir`:训练保存模型的文件路径。 +* `do_train:是否进行训练,默认为`True`。 +* `init_from_ckpt`:恢复模型参数的路径。 +* `device`:表示训练使用的设备。 + +其他可选参数和参数的默认值请参考`args.py`。 + +程序运行时将会自动进行训练,验证和评估。同时训练过程中会自动保存模型在指定的`output_dir`中。 +如: +```text +checkpoints/ +├── 1000.pdopt +├── 1000.pdparams +├── 2000.pdopt +├── 2000.pdparams +├── ... +├── best.pdopt +└── best.pdparams +``` + +**NOTE:** 如需恢复模型训练,则init_from_ckpt只需指定到文件名即可,不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/1000`即可,程序会自动加载模型参数`checkpoints/1000.pdparams`,也会自动加载优化器状态`checkpoints/1000.pdopt`。 diff --git a/examples/dialogue/dgu/args.py b/examples/dialogue/dgu/args.py new file mode 100644 index 0000000000000000000000000000000000000000..4139474c906b47b7b690b9d5eab32c14861cb29b --- /dev/null +++ b/examples/dialogue/dgu/args.py @@ -0,0 +1,118 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument("--task_name", default=None, type=str, required=True, help="The name of the task to train.") + parser.add_argument("--model_name_or_path", default='bert-base-uncased', type=str, help="Path to pre-trained bert model or shortcut name.") + parser.add_argument("--output_dir", default=None, type=str, help="The output directory where the checkpoints will be saved.") + parser.add_argument("--data_dir", default=None, type=str, help="The directory where the dataset will be load.") + parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.") + parser.add_argument("--max_seq_len", default=None, type=int, help="The maximum total input sequence length after tokenization for trainng. Sequences longer than this will be truncated, sequences shorter will be padded.") + parser.add_argument("--test_max_seq_len", default=None, type=int, help="The maximum total input sequence length after tokenization for testing. Sequences longer than this will be truncated, sequences shorter will be padded.") + parser.add_argument("--batch_size", default=None, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--test_batch_size", default=None, type=int, help="Batch size per GPU/CPU for testing.") + parser.add_argument("--learning_rate", default=None, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") + parser.add_argument("--epochs", default=None, type=int, help="Total number of training epochs to perform.") + parser.add_argument("--logging_steps", default=None, type=int, help="Log every X updates steps.") + parser.add_argument("--save_steps", default=None, type=int, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", default=42, type=int, help="Random seed for initialization.") + parser.add_argument("--warmup_proportion", default=0.1, type=float, help="The proportion of warmup.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="The max value of grad norm.") + parser.add_argument("--do_train", default=True, type=eval, help="Whether training.") + parser.add_argument("--do_eval", default=True, type=eval, help="Whether evaluation.") + parser.add_argument("--do_test", default=True, type=eval, help="Whether testing.") + parser.add_argument("--device", type=str, default="gpu", help="Device for selecting for the training.") + + args = parser.parse_args() + return args +# yapf: enable + + +def set_default_args(args): + args.task_name = args.task_name.lower() + if args.task_name == "udc": + if not args.save_steps: + args.save_steps = 1000 + if not args.logging_steps: + args.logging_steps = 100 + if not args.epochs: + args.epochs = 2 + if not args.max_seq_len: + args.max_seq_len = 210 + if not args.test_batch_size: + args.test_batch_size = 100 + elif args.task_name == "dstc2": + if not args.save_steps: + args.save_steps = 400 + if not args.logging_steps: + args.logging_steps = 20 + if not args.epochs: + args.epochs = 40 + if not args.learning_rate: + args.learning_rate = 5e-5 + if not args.max_seq_len: + args.max_seq_len = 256 + if not args.test_max_seq_len: + args.test_max_seq_len = 512 + elif args.task_name == "atis_slot": + if not args.save_steps: + args.save_steps = 100 + if not args.logging_steps: + args.logging_steps = 10 + if not args.epochs: + args.epochs = 50 + elif args.task_name == "atis_intent": + if not args.save_steps: + args.save_steps = 100 + if not args.logging_steps: + args.logging_steps = 10 + if not args.epochs: + args.epochs = 20 + elif args.task_name == "mrda": + if not args.save_steps: + args.save_steps = 500 + if not args.logging_steps: + args.logging_steps = 200 + if not args.epochs: + args.epochs = 7 + elif args.task_name == "swda": + if not args.save_steps: + args.save_steps = 500 + if not args.logging_steps: + args.logging_steps = 200 + if not args.epochs: + args.epochs = 3 + else: + raise ValueError("Not support task: %s." % args.task_name) + + if not args.data_dir: + args.data_dir = "./DGU_datasets/" + args.task_name + if not args.output_dir: + args.output_dir = "./checkpoints/" + args.task_name + if not args.learning_rate: + args.learning_rate = 2e-5 + if not args.batch_size: + args.batch_size = 32 + if not args.test_batch_size: + args.test_batch_size = args.batch_size + if not args.max_seq_len: + args.max_seq_len = 128 + if not args.test_max_seq_len: + args.test_max_seq_len = args.max_seq_len diff --git a/examples/dialogue/dgu/data.py b/examples/dialogue/dgu/data.py new file mode 100644 index 0000000000000000000000000000000000000000..469134f7cfc738a8b15573a9599f15e6307f193d --- /dev/null +++ b/examples/dialogue/dgu/data.py @@ -0,0 +1,509 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import numpy as np +from typing import List + +from paddle.io import Dataset + +# The input data bigin with '[CLS]', using '[SEP]' split conversation content( +# Previous part, current part, following part, etc.). If there are multiple +# conversation in split part, using 'INNER_SEP' to further split. +INNER_SEP = "[unused0]" + + +def get_label_map(label_list): + """Create label maps""" + label_map = {} + for (i, l) in enumerate(label_list): + label_map[l] = i + return label_map + + +class UDCv1(Dataset): + """ + The UDCv1 dataset is using in task Dialogue Response Selection. + The source dataset is UDCv1(Ubuntu Dialogue Corpus v1.0). See detail at + http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/ + """ + + MAX_LEN_OF_RESPONSE = 60 + LABEL_MAP = get_label_map(["0", "1"]) + + def __init__(self, data_dir, mode="train"): + super(UDCv1, self).__init__() + self._data_dir = data_dir + self._mode = mode + self.read_data() + + def read_data(self): + if self._mode == "train": + data_path = os.path.join(self._data_dir, "train.txt") + elif self._mode == "dev": + data_path = os.path.join(self._data_dir, "dev.txt-small") + elif self._mode == "test": + data_path = os.path.join(self._data_dir, "test.txt") + self.data = [] + with open(data_path, "r", encoding="utf8") as fin: + for line in fin: + if not line: + continue + arr = line.rstrip("\n").split("\t") + if len(arr) < 3: + print("Data format error: %s" % "\t".join(arr)) + print("Data row contains at least three parts: label\tconversation1\t.....\tresponse.") + continue + label = arr[0] + text_a = arr[1:-1] + text_b = arr[-1] + self.data.append([label, text_a, text_b]) + + @classmethod + def get_label(cls, label): + return cls.LABEL_MAP[label] + + @classmethod + def num_classes(cls): + return len(cls.LABEL_MAP) + + @classmethod + def convert_example(cls, example, tokenizer, max_seq_length=512): + """Convert a glue example into necessary features.""" + + def _truncate_and_concat(text_a: List[str], text_b: str, tokenizer, max_seq_length): + tokens_b = tokenizer.tokenize(text_b) + tokens_b = tokens_b[: min(cls.MAX_LEN_OF_RESPONSE, len(tokens_b))] + tokens_a = [] + for text in text_a: + tokens_a.extend(tokenizer.tokenize(text)) + tokens_a.append(INNER_SEP) + tokens_a = tokens_a[:-1] + if len(tokens_a) > max_seq_length - len(tokens_b) - 3: + tokens_a = tokens_a[len(tokens_a) - max_seq_length + len(tokens_b) + 3 :] + tokens, segment_ids = [], [] + tokens.extend([tokenizer.cls_token] + tokens_a + [tokenizer.sep_token]) + segment_ids.extend([0] * len(tokens)) + tokens.extend(tokens_b + [tokenizer.sep_token]) + segment_ids.extend([1] * (len(tokens_b) + 1)) + input_ids = tokenizer.convert_tokens_to_ids(tokens) + return input_ids, segment_ids + + label, text_a, text_b = example + label = np.array([cls.get_label(label)], dtype="int64") + input_ids, segment_ids = _truncate_and_concat(text_a, text_b, tokenizer, max_seq_length) + return input_ids, segment_ids, label + + def __getitem__(self, index): + return self.data[index] + + def __len__(self): + return len(self.data) + + +class DSTC2(Dataset): + """ + The dataset DSTC2 is using in task Dialogue State Tracking. + The source dataset is DSTC2(Dialog State Tracking Challenges 2). See detail at + https://github.com/matthen/dstc + """ + + LABEL_MAP = get_label_map([str(i) for i in range(217)]) + + def __init__(self, data_dir, mode="train"): + super(DSTC2, self).__init__() + self._data_dir = data_dir + self._mode = mode + self.read_data() + + def read_data(self): + def _concat_dialogues(examples): + """concat multi turns dialogues""" + new_examples = [] + max_turns = 20 + for i in range(len(examples)): + multi_turns = examples[max(i - max_turns, 0) : i + 1] + new_qa = "\1".join([example[0] for example in multi_turns]) + new_examples.append((new_qa.split("\1"), examples[i][1])) + return new_examples + + if self._mode == "train": + data_path = os.path.join(self._data_dir, "train.txt") + elif self._mode == "dev": + data_path = os.path.join(self._data_dir, "dev.txt") + elif self._mode == "test": + data_path = os.path.join(self._data_dir, "test.txt") + self.data = [] + with open(data_path, "r", encoding="utf8") as fin: + pre_idx = -1 + examples = [] + for line in fin: + if not line: + continue + arr = line.rstrip("\n").split("\t") + if len(arr) != 3: + print("Data format error: %s" % "\t".join(arr)) + print("Data row should contains three parts: id\tquestion\1answer\tlabel1 label2 ...") + continue + idx = arr[0] + qa = arr[1] + label_list = arr[2].split() + if idx != pre_idx: + if idx != 0: + examples = _concat_dialogues(examples) + self.data.extend(examples) + examples = [] + pre_idx = idx + examples.append((qa, label_list)) + if examples: + examples = _concat_dialogues(examples) + self.data.extend(examples) + + @classmethod + def get_label(cls, label): + return cls.LABEL_MAP[label] + + @classmethod + def num_classes(cls): + return len(cls.LABEL_MAP) + + @classmethod + def convert_example(cls, example, tokenizer, max_seq_length=512): + """Convert a glue example into necessary features.""" + + def _truncate_and_concat(texts: List[str], tokenizer, max_seq_length): + tokens = [] + for text in texts: + tokens.extend(tokenizer.tokenize(text)) + tokens.append(INNER_SEP) + tokens = tokens[:-1] + if len(tokens) > max_seq_length - 2: + tokens = tokens[len(tokens) - max_seq_length + 2 :] + tokens = [tokenizer.cls_token] + tokens + [tokenizer.sep_token] + segment_ids = [0] * len(tokens) + input_ids = tokenizer.convert_tokens_to_ids(tokens) + return input_ids, segment_ids + + texts, labels = example + input_ids, segment_ids = _truncate_and_concat(texts, tokenizer, max_seq_length) + labels = [cls.get_label(l) for l in labels] + label = np.zeros(cls.num_classes(), dtype="int64") + for l in labels: + label[l] = 1 + return input_ids, segment_ids, label + + def __getitem__(self, index): + return self.data[index] + + def __len__(self): + return len(self.data) + + +class ATIS_DSF(Dataset): + """ + The dataset ATIS_DSF is using in task Dialogue Slot Filling. + The source dataset is ATIS(Airline Travel Information System). See detail at + https://www.kaggle.com/siddhadev/ms-cntk-atis + """ + + LABEL_MAP = get_label_map([str(i) for i in range(130)]) + + def __init__(self, data_dir, mode="train"): + super(ATIS_DSF, self).__init__() + self._data_dir = data_dir + self._mode = mode + self.read_data() + + def read_data(self): + if self._mode == "train": + data_path = os.path.join(self._data_dir, "train.txt") + elif self._mode == "dev": + data_path = os.path.join(self._data_dir, "dev.txt") + elif self._mode == "test": + data_path = os.path.join(self._data_dir, "test.txt") + self.data = [] + with open(data_path, "r", encoding="utf8") as fin: + for line in fin: + if not line: + continue + arr = line.rstrip("\n").split("\t") + if len(arr) != 2: + print("Data format error: %s" % "\t".join(arr)) + print("Data row should contains two parts: conversation_content\tlabel1 label2 label3.") + continue + text = arr[0] + label_list = arr[1].split() + self.data.append([text, label_list]) + + @classmethod + def get_label(cls, label): + return cls.LABEL_MAP[label] + + @classmethod + def num_classes(cls): + return len(cls.LABEL_MAP) + + @classmethod + def convert_example(cls, example, tokenizer, max_seq_length=512): + """Convert a glue example into necessary features.""" + text, labels = example + tokens, label_list = [], [] + words = text.split() + assert len(words) == len(labels) + for word, label in zip(words, labels): + piece_words = tokenizer.tokenize(word) + tokens.extend(piece_words) + label = cls.get_label(label) + label_list.extend([label] * len(piece_words)) + if len(tokens) > max_seq_length - 2: + tokens = tokens[len(tokens) - max_seq_length + 2 :] + label_list = label_list[len(tokens) - max_seq_length + 2 :] + tokens = [tokenizer.cls_token] + tokens + [tokenizer.sep_token] + label_list = [0] + label_list + [0] + segment_ids = [0] * len(tokens) + input_ids = tokenizer.convert_tokens_to_ids(tokens) + label = np.array(label_list, dtype="int64") + return input_ids, segment_ids, label + + def __getitem__(self, index): + return self.data[index] + + def __len__(self): + return len(self.data) + + +class ATIS_DID(Dataset): + """ + The dataset ATIS_ID is using in task Dialogue Intent Detection. + The source dataset is ATIS(Airline Travel Information System). See detail at + https://www.kaggle.com/siddhadev/ms-cntk-atis + """ + + LABEL_MAP = get_label_map([str(i) for i in range(26)]) + + def __init__(self, data_dir, mode="train"): + super(ATIS_DID, self).__init__() + self._data_dir = data_dir + self._mode = mode + self.read_data() + + def read_data(self): + if self._mode == "train": + data_path = os.path.join(self._data_dir, "train.txt") + elif self._mode == "dev": + data_path = os.path.join(self._data_dir, "dev.txt") + elif self._mode == "test": + data_path = os.path.join(self._data_dir, "test.txt") + self.data = [] + with open(data_path, "r", encoding="utf8") as fin: + for line in fin: + if not line: + continue + arr = line.rstrip("\n").split("\t") + if len(arr) != 2: + print("Data format error: %s" % "\t".join(arr)) + print("Data row should contains two parts: label\tconversation_content.") + continue + label = arr[0] + text = arr[1] + self.data.append([label, text]) + + @classmethod + def get_label(cls, label): + return cls.LABEL_MAP[label] + + @classmethod + def num_classes(cls): + return len(cls.LABEL_MAP) + + @classmethod + def convert_example(cls, example, tokenizer, max_seq_length=512): + """Convert a glue example into necessary features.""" + label, text = example + tokens = tokenizer.tokenize(text) + if len(tokens) > max_seq_length - 2: + tokens = tokens[len(tokens) - max_seq_length + 2 :] + tokens = [tokenizer.cls_token] + tokens + [tokenizer.sep_token] + segment_ids = [0] * len(tokens) + input_ids = tokenizer.convert_tokens_to_ids(tokens) + label = np.array([cls.get_label(label)], dtype="int64") + return input_ids, segment_ids, label + + def __getitem__(self, index): + return self.data[index] + + def __len__(self): + return len(self.data) + + +def read_da_data(data_dir, mode): + def _concat_dialogues(examples): + """concat multi turns dialogues""" + new_examples = [] + for i in range(len(examples)): + label, caller, text = examples[i] + cur_txt = "%s : %s" % (caller, text) + pre_txt = ["%s : %s" % (item[1], item[2]) for item in examples[max(0, i - 5) : i]] + suf_txt = ["%s : %s" % (item[1], item[2]) for item in examples[i + 1 : min(len(examples), i + 3)]] + sample = [label, pre_txt, cur_txt, suf_txt] + new_examples.append(sample) + return new_examples + + if mode == "train": + data_path = os.path.join(data_dir, "train.txt") + elif mode == "dev": + data_path = os.path.join(data_dir, "dev.txt") + elif mode == "test": + data_path = os.path.join(data_dir, "test.txt") + data = [] + with open(data_path, "r", encoding="utf8") as fin: + pre_idx = -1 + examples = [] + for line in fin: + if not line: + continue + arr = line.rstrip("\n").split("\t") + if len(arr) != 4: + print("Data format error: %s" % "\t".join(arr)) + print("Data row should contains four parts: id\tlabel\tcaller\tconversation_content.") + continue + idx, label, caller, text = arr + if idx != pre_idx: + if idx != 0: + examples = _concat_dialogues(examples) + data.extend(examples) + examples = [] + pre_idx = idx + examples.append((label, caller, text)) + if examples: + examples = _concat_dialogues(examples) + data.extend(examples) + return data + + +def truncate_and_concat( + pre_txt: List[str], cur_txt: str, suf_txt: List[str], tokenizer, max_seq_length, max_len_of_cur_text +): + cur_tokens = tokenizer.tokenize(cur_txt) + cur_tokens = cur_tokens[: min(max_len_of_cur_text, len(cur_tokens))] + pre_tokens = [] + for text in pre_txt: + pre_tokens.extend(tokenizer.tokenize(text)) + pre_tokens.append(INNER_SEP) + pre_tokens = pre_tokens[:-1] + suf_tokens = [] + for text in suf_txt: + suf_tokens.extend(tokenizer.tokenize(text)) + suf_tokens.append(INNER_SEP) + suf_tokens = suf_tokens[:-1] + if len(cur_tokens) + len(pre_tokens) + len(suf_tokens) > max_seq_length - 4: + left_num = max_seq_length - 4 - len(cur_tokens) + if len(pre_tokens) > len(suf_tokens): + suf_num = int(left_num / 2) + suf_tokens = suf_tokens[:suf_num] + pre_num = left_num - len(suf_tokens) + pre_tokens = pre_tokens[max(0, len(pre_tokens) - pre_num) :] + else: + pre_num = int(left_num / 2) + pre_tokens = pre_tokens[max(0, len(pre_tokens) - pre_num) :] + suf_num = left_num - len(pre_tokens) + suf_tokens = suf_tokens[:suf_num] + tokens, segment_ids = [], [] + tokens.extend([tokenizer.cls_token] + pre_tokens + [tokenizer.sep_token]) + segment_ids.extend([0] * len(tokens)) + tokens.extend(cur_tokens + [tokenizer.sep_token]) + segment_ids.extend([1] * (len(cur_tokens) + 1)) + if suf_tokens: + tokens.extend(suf_tokens + [tokenizer.sep_token]) + segment_ids.extend([0] * (len(suf_tokens) + 1)) + input_ids = tokenizer.convert_tokens_to_ids(tokens) + return input_ids, segment_ids + + +class MRDA(Dataset): + """ + The dataset MRDA is using in task Dialogue Act. + The source dataset is MRDA(Meeting Recorder Dialogue Act). See detail at + https://www.aclweb.org/anthology/W04-2319.pdf + """ + + MAX_LEN_OF_CUR_TEXT = 50 + LABEL_MAP = get_label_map([str(i) for i in range(5)]) + + def __init__(self, data_dir, mode="train"): + super(MRDA, self).__init__() + self.data = read_da_data(data_dir, mode) + + @classmethod + def get_label(cls, label): + return cls.LABEL_MAP[label] + + @classmethod + def num_classes(cls): + return len(cls.LABEL_MAP) + + @classmethod + def convert_example(cls, example, tokenizer, max_seq_length=512): + """Convert a glue example into necessary features.""" + label, pre_txt, cur_txt, suf_txt = example + label = np.array([cls.get_label(label)], dtype="int64") + input_ids, segment_ids = truncate_and_concat( + pre_txt, cur_txt, suf_txt, tokenizer, max_seq_length, cls.MAX_LEN_OF_CUR_TEXT + ) + return input_ids, segment_ids, label + + def __getitem__(self, index): + return self.data[index] + + def __len__(self): + return len(self.data) + + +class SwDA(Dataset): + """ + The dataset SwDA is using in task Dialogue Act. + The source dataset is SwDA(Switchboard Dialog Act). See detail at + http://compprag.christopherpotts.net/swda.html + """ + + MAX_LEN_OF_CUR_TEXT = 50 + LABEL_MAP = get_label_map([str(i) for i in range(42)]) + + def __init__(self, data_dir, mode="train"): + super(SwDA, self).__init__() + self.data = read_da_data(data_dir, mode) + + @classmethod + def get_label(cls, label): + return cls.LABEL_MAP[label] + + @classmethod + def num_classes(cls): + return len(cls.LABEL_MAP) + + @classmethod + def convert_example(cls, example, tokenizer, max_seq_length=512): + """Convert a glue example into necessary features.""" + label, pre_txt, cur_txt, suf_txt = example + label = np.array([cls.get_label(label)], dtype="int64") + input_ids, segment_ids = truncate_and_concat( + pre_txt, cur_txt, suf_txt, tokenizer, max_seq_length, cls.MAX_LEN_OF_CUR_TEXT + ) + return input_ids, segment_ids, label + + def __getitem__(self, index): + return self.data[index] + + def __len__(self): + return len(self.data) diff --git a/examples/dialogue/dgu/main.py b/examples/dialogue/dgu/main.py new file mode 100644 index 0000000000000000000000000000000000000000..f5ca4faf457242b57b4b23ee09be242e3404ff0a --- /dev/null +++ b/examples/dialogue/dgu/main.py @@ -0,0 +1,290 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import random +import time +import numpy as np +from functools import partial + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +import paddle.distributed as dist +from paddle.io import DataLoader, DistributedBatchSampler, BatchSampler +from paddle.optimizer import AdamW +from paddle.metric import Accuracy + +from paddlenlp.datasets import MapDataset +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.transformers import BertTokenizer, BertForSequenceClassification, BertForTokenClassification +from paddlenlp.transformers import LinearDecayWithWarmup + +from args import parse_args, set_default_args +import data +import metric + +TASK_CLASSES = { + "udc": (data.UDCv1, metric.RecallAtK), + "dstc2": (data.DSTC2, metric.JointAccuracy), + "atis_slot": (data.ATIS_DSF, metric.F1Score), + "atis_intent": (data.ATIS_DID, Accuracy), + "mrda": (data.MRDA, Accuracy), + "swda": (data.SwDA, Accuracy), +} + + +def set_seed(seed): + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def load_ckpt(args, model, optimizer=None): + if args.init_from_ckpt: + params_state_dict = paddle.load(args.init_from_ckpt + ".pdparams") + model.set_state_dict(params_state_dict) + if optimizer: + opt_state_dict = paddle.load(args.init_from_ckpt + ".pdopt") + optimizer.set_state_dict(opt_state_dict) + print("Loaded checkpoint from %s" % args.init_from_ckpt) + + +def save_ckpt(model, optimizer, output_dir, name): + params_path = os.path.join(output_dir, "{}.pdparams".format(name)) + opt_path = os.path.join(output_dir, "{}.pdopt".format(name)) + paddle.save(model.state_dict(), params_path) + paddle.save(optimizer.state_dict(), opt_path) + + +class DGULossFunction(nn.Layer): + def __init__(self, task_name): + super(DGULossFunction, self).__init__() + + self.task_name = task_name + self.loss_fn = self.get_loss_fn() + + def get_loss_fn(self): + if self.task_name in ["udc", "atis_slot", "atis_intent", "mrda", "swda"]: + return F.cross_entropy + elif self.task_name == "dstc2": + return nn.BCEWithLogitsLoss(reduction="sum") + + def forward(self, logits, labels): + if self.task_name in ["udc", "atis_intent", "mrda", "swda"]: + loss = self.loss_fn(logits, labels) + elif self.task_name == "dstc2": + loss = self.loss_fn(logits, paddle.cast(labels, dtype=logits.dtype)) + elif self.task_name == "atis_slot": + labels = paddle.unsqueeze(labels, axis=-1) + loss = self.loss_fn(logits, labels) + return loss + + +def print_logs(args, step, logits, labels, loss, total_time, metric): + if args.task_name in ["udc", "atis_intent", "mrda", "swda"]: + if args.task_name == "udc": + metric = Accuracy() + metric.reset() + correct = metric.compute(logits, labels) + metric.update(correct) + acc = metric.accumulate() + print("step %d - loss: %.4f - acc: %.4f - %.3fs/step" % (step, loss, acc, total_time / args.logging_steps)) + elif args.task_name == "dstc2": + metric.reset() + metric.update(logits, labels) + joint_acc = metric.accumulate() + print( + "step %d - loss: %.4f - joint_acc: %.4f - %.3fs/step" + % (step, loss, joint_acc, total_time / args.logging_steps) + ) + elif args.task_name == "atis_slot": + metric.reset() + metric.update(logits, labels) + f1_micro = metric.accumulate() + print( + "step %d - loss: %.4f - f1_micro: %.4f - %.3fs/step" + % (step, loss, f1_micro, total_time / args.logging_steps) + ) + + +def train(args, model, train_data_loader, dev_data_loader, metric, n_procs, rank): + num_examples = len(train_data_loader) * args.batch_size * n_procs + max_train_steps = args.epochs * len(train_data_loader) + print("\nNum train examples: %d" % num_examples) + print("Max train steps: %d" % max_train_steps) + print("Warmup proportion: %.2f" % args.warmup_proportion) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, max_train_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm), + ) + loss_fn = DGULossFunction(args.task_name) + + load_ckpt(args, model, optimizer) + + step = 0 + best_metric = 0.0 + total_time = 0.0 + for epoch in range(args.epochs): + print("\nEpoch %d/%d" % (epoch + 1, args.epochs)) + batch_start_time = time.time() + for batch in train_data_loader: + step += 1 + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + loss = loss_fn(logits, labels) + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + total_time += time.time() - batch_start_time + if step % args.logging_steps == 0: + print_logs(args, step, logits, labels, loss, total_time, metric) + total_time = 0.0 + if step % args.save_steps == 0 or step == max_train_steps: + if rank == 0: + save_ckpt(model, optimizer, args.output_dir, step) + if args.do_eval: + print("\nEval begin...") + metric_out = evaluation(args, model, dev_data_loader, metric) + if rank == 0 and metric_out > best_metric: + best_metric = metric_out + save_ckpt(model, optimizer, args.output_dir, "best") + print("Best model, step: %d\n" % step) + batch_start_time = time.time() + + +@paddle.no_grad() +def evaluation(args, model, data_loader, metric): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + if args.task_name in ["atis_intent", "mrda", "swda"]: + correct = metric.compute(logits, labels) + metric.update(correct) + else: + metric.update(logits, labels) + model.train() + metric_out = metric.accumulate() + print("Total samples: %d" % (len(data_loader) * args.test_batch_size)) + if args.task_name == "udc": + print("R1@10: %.4f - R2@10: %.4f - R5@10: %.4f\n" % (metric_out[0], metric_out[1], metric_out[2])) + return metric_out[0] + elif args.task_name == "dstc2": + print("Joint_acc: %.4f\n" % metric_out) + return metric_out + elif args.task_name == "atis_slot": + print("F1_micro: %.4f\n" % metric_out) + return metric_out + elif args.task_name in ["atis_intent", "mrda", "swda"]: + print("Acc: %.4f\n" % metric_out) + return metric_out + + +def create_data_loader(args, dataset_class, trans_func, batchify_fn, mode): + dataset = dataset_class(args.data_dir, mode) + dataset = MapDataset(dataset).map(trans_func, lazy=True) + if mode == "train": + batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) + else: + batch_sampler = BatchSampler(dataset, batch_size=args.test_batch_size, shuffle=False) + data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + return data_loader + + +def main(args): + paddle.set_device(args.device) + world_size = dist.get_world_size() + rank = dist.get_rank() + if world_size > 1 and args.do_train: + dist.init_parallel_env() + + set_seed(args.seed) + + dataset_class, metric_class = TASK_CLASSES[args.task_name] + tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path) + trans_func = partial(dataset_class.convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_len) + test_trans_func = partial(dataset_class.convert_example, tokenizer=tokenizer, max_seq_length=args.test_max_seq_len) + metric = metric_class() + + if args.task_name in ("udc", "dstc2", "atis_intent", "mrda", "swda"): + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="int64"), # label + ): fn(samples) + model = BertForSequenceClassification.from_pretrained( + args.model_name_or_path, num_classes=dataset_class.num_classes() + ) + elif args.task_name == "atis_slot": + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Pad(axis=0, pad_val=0, dtype="int64"), # label + ): fn(samples) + model = BertForTokenClassification.from_pretrained( + args.model_name_or_path, num_classes=dataset_class.num_classes(), dropout=0.0 + ) + if world_size > 1 and args.do_train: + model = paddle.DataParallel(model) + + if args.do_train: + train_data_loader = create_data_loader(args, dataset_class, trans_func, batchify_fn, "train") + if args.do_eval: + dev_data_loader = create_data_loader(args, dataset_class, test_trans_func, batchify_fn, "dev") + else: + dev_data_loader = None + train(args, model, train_data_loader, dev_data_loader, metric, world_size, rank) + + if args.do_test: + if rank == 0: + test_data_loader = create_data_loader(args, dataset_class, test_trans_func, batchify_fn, "test") + if args.do_train: + # If do_eval=True, use best model to evaluate the test data. + # Otherwise, use final model to evaluate the test data. + if args.do_eval: + args.init_from_ckpt = os.path.join(args.output_dir, "best") + load_ckpt(args, model) + else: + if not args.init_from_ckpt: + raise ValueError('"init_from_ckpt" should be set.') + load_ckpt(args, model) + print("\nTest begin...") + evaluation(args, model, test_data_loader, metric) + + +def print_args(args): + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + set_default_args(args) + print_args(args) + + main(args) diff --git a/examples/dialogue/dgu/metric.py b/examples/dialogue/dgu/metric.py new file mode 100644 index 0000000000000000000000000000000000000000..b5ef869f768cbc912435539d32fe078ef6447665 --- /dev/null +++ b/examples/dialogue/dgu/metric.py @@ -0,0 +1,245 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np + +import paddle +import paddle.nn as nn +from paddle.metric import Metric + + +class RecallAtK(Metric): + """ + Recall@K is the fraction of relevant results among the retrieved Top K + results, using to evaluate the performance of Dialogue Response Selection. + + Noted that this class manages the Recall@K score only for binary + classification task. + """ + + def __init__(self, name="Recall@K", *args, **kwargs): + super(RecallAtK, self).__init__(*args, **kwargs) + self._name = name + self.softmax = nn.Softmax() + self.reset() + + def reset(self): + """ + Resets all of the metric state. + """ + self.num_sampls = 0 + self.p_at_1_in_10 = 0.0 + self.p_at_2_in_10 = 0.0 + self.p_at_5_in_10 = 0.0 + + def get_p_at_n_in_m(self, data, n, m, idx): + """ + calculate precision in recall n + """ + pos_score = data[idx][0] + curr = data[idx : idx + m] + curr = sorted(curr, key=lambda x: x[0], reverse=True) + if curr[n - 1][0] <= pos_score: + return 1 + return 0 + + def update(self, logits, labels): + """ + Update the states based on the current mini-batch prediction results. + + Args: + logits (Tensor): The predicted value is a Tensor with + shape [batch_size, 2] and type float32 or float64. + labels (Tensor): The ground truth value is a 2D Tensor, + its shape is [batch_size, 1] and type is int64. + """ + probs = self.softmax(logits) + probs = probs.numpy() + labels = labels.numpy() + assert probs.shape[0] == labels.shape[0] + data = [] + for prob, label in zip(probs, labels): + data.append((prob[1], label)) + assert len(data) % 10 == 0 + + length = int(len(data) / 10) + self.num_sampls += length + for i in range(length): + idx = i * 10 + assert data[idx][1] == 1 + self.p_at_1_in_10 += self.get_p_at_n_in_m(data, 1, 10, idx) + self.p_at_2_in_10 += self.get_p_at_n_in_m(data, 2, 10, idx) + self.p_at_5_in_10 += self.get_p_at_n_in_m(data, 5, 10, idx) + + def accumulate(self): + """ + Calculate the final Recall@K. + + Returns: + A list with scaler float: results of the calculated R1@K, R2@K, R5@K. + """ + metrics_out = [ + self.p_at_1_in_10 / self.num_sampls, + self.p_at_2_in_10 / self.num_sampls, + self.p_at_5_in_10 / self.num_sampls, + ] + return metrics_out + + def name(self): + """ + Returns metric name + """ + return self._name + + +class JointAccuracy(Metric): + """ + The joint accuracy rate is used to evaluate the performance of multi-turn + Dialogue State Tracking. For each turn, if and only if all state in + state_list are correctly predicted, the dialog state prediction is + considered correct. And the joint accuracy rate is equal to 1, otherwise + it is equal to 0. + """ + + def __init__(self, name="JointAccuracy", *args, **kwargs): + super(JointAccuracy, self).__init__(*args, **kwargs) + self._name = name + self.sigmoid = nn.Sigmoid() + self.reset() + + def reset(self): + """ + Resets all of the metric state. + """ + self.num_samples = 0 + self.correct_joint = 0.0 + + def update(self, logits, labels): + """ + Update the states based on the current mini-batch prediction results. + + Args: + logits (Tensor): The predicted value is a Tensor with + shape [batch_size, num_classes] and type float32 or float64. + labels (Tensor): The ground truth value is a 2D Tensor, + its shape is [batch_size, num_classes] and type is int64. + """ + probs = self.sigmoid(logits) + probs = probs.numpy() + labels = labels.numpy() + assert probs.shape[0] == labels.shape[0] + assert probs.shape[1] == labels.shape[1] + for i in range(probs.shape[0]): + pred, refer = [], [] + for j in range(probs.shape[1]): + if probs[i][j] >= 0.5: + pred.append(j) + if labels[i][j] == 1: + refer.append(j) + if not pred: + pred = [np.argmax(probs[i])] + if pred == refer: + self.correct_joint += 1 + self.num_samples += probs.shape[0] + + def accumulate(self): + """ + Calculate the final JointAccuracy. + + Returns: + A scaler float: results of the calculated JointAccuracy. + """ + joint_acc = self.correct_joint / self.num_samples + return joint_acc + + def name(self): + """ + Returns metric name + """ + return self._name + + +class F1Score(Metric): + """ + F1-score is the harmonic mean of precision and recall. Micro-averaging is + to create a global confusion matrix for all examples, and then calculate + the F1-score. This class is using to evaluate the performance of Dialogue + Slot Filling. + """ + + def __init__(self, name="F1Score", *args, **kwargs): + super(F1Score, self).__init__(*args, **kwargs) + self._name = name + self.reset() + + def reset(self): + """ + Resets all of the metric state. + """ + self.tp = {} + self.fn = {} + self.fp = {} + + def update(self, logits, labels): + """ + Update the states based on the current mini-batch prediction results. + + Args: + logits (Tensor): The predicted value is a Tensor with + shape [batch_size, seq_len, num_classes] and type float32 or + float64. + labels (Tensor): The ground truth value is a 2D Tensor, + its shape is [batch_size, seq_len] and type is int64. + """ + probs = paddle.argmax(logits, axis=-1) + probs = probs.numpy() + labels = labels.numpy() + assert probs.shape[0] == labels.shape[0] + assert probs.shape[1] == labels.shape[1] + for i in range(probs.shape[0]): + start, end = 1, probs.shape[1] + while end > start: + if labels[i][end - 1] != 0: + break + end -= 1 + prob, label = probs[i][start:end], labels[i][start:end] + for y_pred, y in zip(prob, label): + if y_pred == y: + self.tp[y] = self.tp.get(y, 0) + 1 + else: + self.fp[y_pred] = self.fp.get(y_pred, 0) + 1 + self.fn[y] = self.fn.get(y, 0) + 1 + + def accumulate(self): + """ + Calculate the final micro F1 score. + + Returns: + A scaler float: results of the calculated micro F1 score. + """ + tp_total = sum(self.tp.values()) + fn_total = sum(self.fn.values()) + fp_total = sum(self.fp.values()) + p_total = float(tp_total) / (tp_total + fp_total) + r_total = float(tp_total) / (tp_total + fn_total) + if p_total + r_total == 0: + return 0 + f1_micro = 2 * p_total * r_total / (p_total + r_total) + return f1_micro + + def name(self): + """ + Returns metric name + """ + return self._name diff --git a/examples/dialogue/lic2021_baseline/README.md b/examples/dialogue/lic2021_baseline/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b0354b0f18d0761dbbba67efc5f72a41a2ac027d --- /dev/null +++ b/examples/dialogue/lic2021_baseline/README.md @@ -0,0 +1,146 @@ +# LIC 2021对话比赛baseline + +## 模型简介 + +近年来,人机对话系统受到了学术界和产业界的广泛关注并取得了不错的发展。开放域对话系统旨在建立一个开放域的多轮对话系统,使得机器可以流畅自然地与人进行语言交互,既可以进行日常问候类的闲聊,又可以完成特定功能,以使得开放域对话系统具有实际应用价值,例如进行对话式推荐,或围绕一个主题进行深入的知识对话等。具体的说,开放域对话可以继续拆分为支持不同功能的对话形式,例如对话式推荐,知识对话技术等,如何解决并有效融合以上多个技能面临诸多挑战。 + +LIC 2021对话比赛收集了一系列公开的开放域对话数据并提供了统一的评测方式,旨在为研究人员和开发者提供学术和技术交流的平台,进一步提升开放域对话的研究水平,推动自然语言理解和人工智能领域技术的应用和发展。 + +为了方便参赛者快速了解LIC 2021对话比赛的流程,并快速地参与到比赛中,本项目基于UnifiedTransformer模型提供了一个基础baseline,利用小规模样例数据在预训练模型上完成了微调及预测。参赛者可以针对赛题进行其他改进,例如修改数据预处理方法、修改网络结构、修改训练方式、修改预测的解码方式或对结果的后处理策略等方式提升模型效果。 + +UnifiedTransformer模型的细节可以[参阅论文](https://arxiv.org/abs/2006.16779)。 + +## 快速开始 + +### 环境依赖 + +- sentencepiece + +安装方式:`pip install sentencepiece` + +### 数据准备 + +由于样例数据涉及LIC 2021对话比赛,暂不开放。 +关于数据集及数据集的预处理过程,详见[2021语言与智能技术竞赛:多技能对话](https://aistudio.baidu.com/aistudio/competition/detail/67)及官方提供的基线系统Baselines。 + +模型的输入由3部分组成:词向量token_ids,句向量token_type_ids和位置向量position_ids。本项目的数据集是样例文本经过数据预处理脚本得到的id化的数据集。数据的每一行由3列组成,以";"作为分割符,格式:token_ids;token_type_ids;position_ids。具体细节请参考`data.py`。 + +### 模型训练 + +运行如下命令即可在样例训练集上进行finetune,并在样例验证集上进行验证 + +```shell +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" --log_dir ./log finetune.py \ + --model_name_or_path=unified_transformer-12L-cn \ + --train_data_path=./datasets/train.txt \ + --valid_data_path=./datasets/valid.txt \ + --save_dir=./checkpoints \ + --logging_steps=500 \ + --save_steps=8000 \ + --seed=2021 \ + --epochs=10 \ + --batch_size=8192 \ + --lr=1e-5 \ + --weight_decay=0.01 \ + --warmup_steps=4000 \ + --max_grad_norm=0.1 \ + --sort_pool_size=65536 \ + --device=gpu +``` + +其中参数释义如下: +- `gpus` 指示了训练所用的GPU卡号。 +- `model_name_or_path` 指示了finetune使用的具体预训练模型,可以是PaddleNLP提供的预训练模型,或者是本地的预训练模型。如果使用本地的预训练模型,可以配置本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + + | PaddleNLP提供的预训练模型 | + |---------------------------------| + | unified_transformer-12L-cn | + | unified_transformer-12L-cn-luge | + +- `train_data_path` 表示训练集文件路径。 +- `valid_data_path` 表示验证集文件路径。 +- `save_dir` 表示模型的保存路径。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `seed` 表示随机数生成器的种子。 +- `epochs` 表示训练轮数。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `lr` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。 +- `warmup_steps` 表示学习率逐渐升高到基础学习率(即上面配置的lr)所需要的迭代数,最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。 +- `max_grad_norm` 表示梯度裁剪允许的最大梯度值。 +- `sort_pool_size` 表示在构建batch数据时,用来排序的pool size。 +- `device` 表示训练使用的设备。 + +参数详情和参数的默认值请参考`args.py`。 + +程序运行时将会自动进行训练和验证,训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +./checkpoints/ +├── model_8000 +│ ├── model_config.json +│ ├── model_state.pdparams +│ ├── spm.model +│ ├── tokenizer_config.json +│ └── vocab.txt +└── ... +``` + +**NOTE:** 如需恢复模型训练,`model_name_or_path`配置本地模型的目录地址即可。 + +### 模型预测 + +运行如下命令即可在样例测试集上进行测试 + +```shell +export CUDA_VISIBLE_DEVICES=0 +# GPU启动,预测仅支持单卡 +python infer.py \ + --model_name_or_path=./checkpoints/model_80000 \ + --test_data_path=./datasets/test.txt \ + --output_path=./predict.txt \ + --logging_steps=500 \ + --seed=2021 \ + --batch_size=4 \ + --min_dec_len=1 \ + --max_dec_len=64 \ + --num_samples=20 \ + --decode_strategy=sampling \ + --top_k=5 \ + --device=gpu +``` + +其中参数释义如下: +- `model_name_or_path` 指示了finetune使用的具体预训练模型,可以是PaddleNLP提供的预训练模型,或者是本地的预训练模型。如果使用本地的预训练模型,可以配置本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + + | PaddleNLP提供的预训练模型 | + |---------------------------------| + | unified_transformer-12L-cn | + | unified_transformer-12L-cn-luge | + +- `test_data_path` 表示预测集文件路径。 +- `output_path` 表示预测结果的保存路径。 +- `logging_steps` 表示日志打印间隔。 +- `seed` 表示随机数生成器的种子。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `min_dec_len` 表示预测生成的句子的最小长度。 +- `max_dec_len` 表示预测生成的句子的最大长度。 +- `num_samples` 表示每条样本生成的句子的数量。对于每条样本,模型会生成`num_samples`个句子,根据每个句子的概率得分进行排序,得分最高的句子作为最终的生成结果。 +- `decode_strategy` 表示预测解码时采取的策略,可选"sampling"、"greedy_search"和"beam_search"之一。 +- `top_k` 表示采用"sampling"解码策略时,token的概率按从大到小排序,生成的token只从前`top_k`个中进行采样。 +- `device` 表示训练使用的设备。 + +参数详情和参数的默认值请参考`args.py`。 + +程序运行结束后会将预测结果保存在`output_path`中。将预测结果准备成比赛官网要求的格式,提交评估即可得评估结果。 + +采用不同的模型在样例测试集上有如下结果: + +| model_name_or_path | F1 | BLEU1 / BLEU2 | DISTINCT1 / DISTINCT2 | +| :-----------------------------: | :---: | :-----------: | :-------------------: | +| unified_transformer-12L-cn | 10.62 | 0.070 / 0.022 | 0.065 / 0.304 | +| unified_transformer-12L-cn-luge | 33.11 | 0.245 / 0.157 | 0.074 / 0.238 | +| ./checkpoints/model_80000 | 32.38 | 0.239 / 0.150 | 0.070 / 0.219 | diff --git a/examples/dialogue/lic2021_baseline/args.py b/examples/dialogue/lic2021_baseline/args.py new file mode 100644 index 0000000000000000000000000000000000000000..cc08c7d618f6e3bdf93fb410adeaca63184b7966 --- /dev/null +++ b/examples/dialogue/lic2021_baseline/args.py @@ -0,0 +1,58 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--model_name_or_path', type=str, default='unified_transformer-12L-cn', help='The path or shortcut name of the pre-trained model.') + parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.') + parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.') + parser.add_argument('--train_data_path', type=str, default='./datasets/train.txt', help='Specify the path to load train data.') + parser.add_argument('--valid_data_path', type=str, default='./datasets/valid.txt', help='Specify the path to load valid data.') + parser.add_argument('--test_data_path', type=str, default='./datasets/test.txt', help='Specify the path to load test data.') + parser.add_argument('--logging_steps', type=int, default=500, help='Log every X updates steps.') + parser.add_argument('--save_steps', type=int, default=8000, help='Save checkpoint every X updates steps.') + parser.add_argument('--seed', type=int, default=2021, help='Random seed for initialization.') + parser.add_argument('--batch_size', type=int, default=8192, required=True, help='Batch size per GPU/CPU for training.') + parser.add_argument('--lr', type=float, default=1e-5, help='The initial learning rate.') + parser.add_argument('--weight_decay', type=float, default=0.01, help='The weight decay for optimizer.') + parser.add_argument('--epochs', type=int, default=10, help='Total number of training epochs to perform.') + parser.add_argument('--warmup_steps', type=int, default=4000, help='The number of warmup steps.') + parser.add_argument('--max_grad_norm', type=float, default=0.1, help='The max value of grad norm.') + parser.add_argument('--sort_pool_size', type=int, default=65536, help='The pool size for sort in build batch data.') + parser.add_argument('--min_dec_len', type=int, default=1, help='The minimum sequence length of generation.') + parser.add_argument('--max_dec_len', type=int, default=64, help='The maximum sequence length of generation.') + parser.add_argument('--num_samples', type=int, default=1, help='The decode numbers in generation.') + parser.add_argument('--decode_strategy', type=str, default='sampling', help='The decode strategy in generation.') + parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.') + parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.') + parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.') + parser.add_argument('--num_beams', type=int, default=0, help='The number of beams for beam search.') + parser.add_argument('--length_penalty', type=float, default=1.0, help='The exponential penalty to the sequence length for beam search.') + parser.add_argument('--early_stopping', type=eval, default=False, help='Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.') + parser.add_argument('--device', type=str, default='gpu', help='Device for selecting for the training.') + + args = parser.parse_args() + return args +# yapf: enable + + +def print_args(args): + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") diff --git a/examples/dialogue/lic2021_baseline/data.py b/examples/dialogue/lic2021_baseline/data.py new file mode 100644 index 0000000000000000000000000000000000000000..6bc938d1b72902aa3504b02d42dfb7114c3c9bc4 --- /dev/null +++ b/examples/dialogue/lic2021_baseline/data.py @@ -0,0 +1,258 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from glob import glob + +import numpy as np +import paddle.distributed as dist +from paddle.io import IterableDataset + +from paddlenlp.transformers.tokenizer_utils import convert_to_unicode + + +class DialogueDataset(IterableDataset): + def __init__( + self, + filepattern, + batch_size, + pad_token_id, + bos_token_id, + sort_pool_size=2**16, + seed=1, + n_procs=None, + rank=None, + mode="test", + ): + super(DialogueDataset, self).__init__() + + self.file_list = glob(filepattern) + self.sort_pool_size = 0 if mode == "test" else sort_pool_size + self.n_procs = n_procs if n_procs else dist.get_world_size() + self.rank = rank if rank else dist.get_rank() + self.batch_size = batch_size * self.n_procs + self.shuffle = True if mode == "train" else False + self.mode = mode + self.pad_id = pad_token_id + self.bos_id = bos_token_id + self.global_rng = np.random.RandomState(seed) + + assert len(self.file_list) > 0, "There is no files in %s." % filepattern + + def load_file(self, file_path): + with open(file_path, "r", encoding="utf-8") as fin: + for i, line in enumerate(fin): + cols = convert_to_unicode(line).strip().split(";") + cols = list(map(lambda x: list(map(int, x.split(" "))), cols)) + if len(cols) > 3: + cols = cols[:3] + token_ids, type_ids, pos_ids = cols + if self.mode == "test": + tgt_start_idx = len(cols[0]) + else: + tgt_start_idx = token_ids.index(self.bos_id, 1) + sample = [token_ids, type_ids, pos_ids, tgt_start_idx] + yield sample + + def get_sorted_batch(self, pool): + """Generate sorted batches from pool.""" + pool = sorted(pool, key=lambda sample: len(sample[0])) + batches = [] + batch, max_len = [], 0 + for sample in pool: + max_len = max(max_len, len(sample[0])) + if self.mode == "test": + to_append = len(batch) < self.batch_size + else: + to_append = (len(batch) + 1) * max_len <= self.batch_size + if to_append: + batch.append(sample) + else: + batches.append(batch) + batch, max_len = [sample], len(sample[0]) + if len(batch) > 0: + batches.append(batch) + if self.shuffle: + self.global_rng.shuffle(batches) + for batch in batches: + yield batch + + @property + def get_batch(self): + all_files = list(self.file_list) + if self.shuffle: + self.global_rng.shuffle(all_files) + if self.sort_pool_size > 0: + pool = [] + for file_path in all_files: + for sample in self.load_file(file_path): + pool.append(sample) + if len(pool) == self.sort_pool_size: + for batch in self.get_sorted_batch(pool): + yield batch + pool = [] + if len(pool) > 0: + for batch in self.get_sorted_batch(pool): + yield batch + else: + batch, max_len = [], 0 + for file_path in all_files: + for sample in self.load_file(file_path): + max_len = max(max_len, len(sample[0])) + if self.mode == "test": + to_append = len(batch) < self.batch_size + else: + to_append = (len(batch) + 1) * max_len <= self.batch_size + if to_append: + batch.append(sample) + else: + yield batch + batch, max_len = [sample], len(sample[0]) + if len(batch) > 0: + yield batch + + def pad_batch_data(self, batch): + """Pad the instances to the max sequence length in batch.""" + max_len = max(map(len, batch)) + batch_data = np.array([list(data) + [self.pad_id] * (max_len - len(data)) for data in batch], dtype="int64") + return batch_data + + def gen_tgt_label_and_pos(self, batch_token_ids, batch_tgt_start_idx): + max_len = max(map(len, batch_token_ids)) + tgt_label = [] + tgt_pos = [] + for sent_index, sent in enumerate(batch_token_ids): + sent_b_index = batch_tgt_start_idx[sent_index] + tgt_label.extend(sent[sent_b_index + 1 :]) + tgt_pos.extend([sent_index * max_len + i for i in range(sent_b_index, len(sent) - 1)]) + tgt_label = np.array(tgt_label).astype("int64") + tgt_pos = np.array(tgt_pos).astype("int64") + + return tgt_label, tgt_pos + + def gen_self_attn_mask(self, batch_token_ids, batch_tgt_start_idx): + max_len = max(map(len, batch_token_ids)) + input_mask_data = np.zeros((len(batch_token_ids), max_len, max_len)) + for index, mask_data in enumerate(input_mask_data): + start = batch_tgt_start_idx[index] + end = len(batch_token_ids[index]) + mask_data[:end, :start] = 1.0 + # Generate the lower triangular matrix using the slice of matrix + b = np.tril(np.ones([end - start, end - start]), 0) + mask_data[start:end, start:end] = b + return input_mask_data.astype("float32") + + def __iter__(self): + for batch_data in self.get_batch: + # sample [token_ids, type_ids, pos_ids, tgt_start_idx] + # raw_batch [sample0, sample1, ...] + if self.n_procs > 1: + batch_data = batch_data[self.rank :: self.n_procs] + batch_data = zip(*batch_data) + token_ids, type_ids, pos_ids, tgt_start_idx = batch_data + + pad_token_ids = self.pad_batch_data(token_ids) + pad_type_ids = self.pad_batch_data(type_ids) + pad_pos_ids = self.pad_batch_data(pos_ids) + + generation_mask = self.gen_self_attn_mask(token_ids, tgt_start_idx) + + if self.mode == "test": + # [batch_size, 1] + tgt_ids = np.array([[self.bos_id]] * len(token_ids), dtype="int64") + tgt_type = np.ones((len(token_ids), 1), dtype="int64") + tgt_pos = np.array(tgt_start_idx, dtype="int64").reshape(-1, 1) + tgt_generation_mask = generation_mask[:, 0:1, :].astype("float32") + + pad_token_ids = np.concatenate((pad_token_ids, tgt_ids), axis=1) + pad_type_ids = np.concatenate((pad_type_ids, tgt_type), axis=1) + pad_pos_ids = np.concatenate((pad_pos_ids, tgt_pos), axis=1) + generation_mask = np.concatenate((generation_mask, tgt_generation_mask), axis=1) + + append_mask = np.zeros((generation_mask.shape[0], generation_mask.shape[1], 1), dtype="float32") + append_mask[:, -1, :] = 1.0 + generation_mask = np.concatenate((generation_mask, append_mask), axis=2) + generation_mask = (generation_mask - 1.0) * 1e9 + generation_mask = np.expand_dims(generation_mask, axis=1) + yield (pad_token_ids, pad_type_ids, pad_pos_ids, generation_mask) + else: + tgt_label, tgt_pos = self.gen_tgt_label_and_pos(token_ids, tgt_start_idx) + generation_mask = (generation_mask - 1.0) * 1e9 + generation_mask = np.expand_dims(generation_mask, axis=1) + yield (pad_token_ids, pad_type_ids, pad_pos_ids, generation_mask, tgt_label, tgt_pos) + + +def post_process_response(token_ids, tokenizer): + """ + Post-process the decoded sequence. Truncate from the first + and remove the and tokens currently. + """ + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.sep_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + response = tokenizer.merge_subword(tokens) + return token_ids, response + + +def get_in_turn_repetition(pred, is_cn=False): + """Get in-turn repetition.""" + if len(pred) == 0: + return 1.0 + if isinstance(pred[0], str): + pred = [tok.lower() for tok in pred] + if is_cn: + pred = "".join(pred) + tri_grams = set() + for i in range(len(pred) - 2): + tri_gram = tuple(pred[i : i + 3]) + if tri_gram in tri_grams: + return True + tri_grams.add(tri_gram) + return False + + +def select_response(ids, scores, tokenizer, max_dec_len=None, num_samples=1): + ids = ids.numpy().tolist() + scores = scores.numpy() + + if len(ids) != len(scores) or (len(ids) % num_samples) != 0: + raise ValueError("the length of `ids` is {}, but the `num_samples` is {}".format(len(ids), num_samples)) + + group = [] + tmp = [] + for pred, score in zip(ids, scores): + pred_token_ids, pred_tokens = post_process_response(pred, tokenizer) + num_token = len(pred_token_ids) + response = " ".join(pred_tokens) + + in_turn_repetition = get_in_turn_repetition(pred_tokens, True) or get_in_turn_repetition(pred_token_ids) + # not ending + if max_dec_len is not None and num_token >= max_dec_len: + score -= 1e3 + elif in_turn_repetition: + score -= 1e3 + + tmp.append([response, score]) + if len(tmp) == num_samples: + group.append(tmp) + tmp = [] + + results = [] + for preds in group: + preds = sorted(preds, key=lambda x: -x[1]) + results.append(preds[0][0]) + return results diff --git a/examples/dialogue/lic2021_baseline/finetune.py b/examples/dialogue/lic2021_baseline/finetune.py new file mode 100644 index 0000000000000000000000000000000000000000..8f74d75ef84bc06e6a86cf79c246ffdbee114e9b --- /dev/null +++ b/examples/dialogue/lic2021_baseline/finetune.py @@ -0,0 +1,149 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math +import os +import time + +import paddle +import paddle.distributed as dist +import paddle.nn as nn +import paddle.nn.functional as F +from args import parse_args, print_args +from data import DialogueDataset +from paddle.io import DataLoader +from paddle.optimizer import AdamW +from paddle.optimizer.lr import NoamDecay + +from paddlenlp.transformers import ( + UnifiedTransformerLMHeadModel, + UnifiedTransformerTokenizer, +) + + +def save_ckpt(model, tokenizer, save_dir, name): + output_dir = os.path.join(save_dir, "model_{}".format(name)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + +def main(args): + paddle.set_device(args.device) + paddle.seed(args.seed) + world_size = dist.get_world_size() + if world_size > 1: + dist.init_parallel_env() + + model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path) + tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path) + if world_size > 1: + model = paddle.DataParallel(model) + + train_dataset = DialogueDataset( + args.train_data_path, + args.batch_size, + tokenizer.pad_token_id, + tokenizer.cls_token_id, + args.sort_pool_size, + args.seed, + mode="train", + ) + train_dataloader = DataLoader(train_dataset, return_list=True, batch_size=None) + valid_dataset = DialogueDataset( + args.valid_data_path, + args.batch_size, + tokenizer.pad_token_id, + tokenizer.cls_token_id, + args.sort_pool_size, + mode="valid", + ) + valid_dataloader = DataLoader(valid_dataset, return_list=True, batch_size=None) + + lr_scheduler = NoamDecay(1 / (args.warmup_steps * (args.lr**2)), args.warmup_steps) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm), + ) + + step = 0 + total_time = 0.0 + for epoch in range(args.epochs): + print("\nEpoch %d/%d" % (epoch + 1, args.epochs)) + batch_start_time = time.time() + for inputs in train_dataloader: + step += 1 + token_ids, type_ids, pos_ids, generation_mask, tgt_label, tgt_pos = inputs + + logits = model(token_ids, type_ids, pos_ids, generation_mask, tgt_pos) + loss = F.cross_entropy(logits, tgt_label) + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + total_time += time.time() - batch_start_time + if step % args.logging_steps == 0: + ppl = paddle.exp(loss) + print( + "step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step" + % (step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps) + ) + total_time = 0.0 + if step % args.save_steps == 0: + evaluation(model, valid_dataloader) + if dist.get_rank() == 0: + save_ckpt(model, tokenizer, args.save_dir, step) + batch_start_time = time.time() + + +@paddle.no_grad() +def evaluation(model, data_loader): + print("\nEval begin...") + model.eval() + total_tokens = 0 + total_loss = 0.0 + start_time = time.time() + step = 0 + for inputs in data_loader: + step += 1 + token_ids, type_ids, pos_ids, generation_mask, tgt_label, tgt_pos = inputs + + logits = model(token_ids, type_ids, pos_ids, generation_mask, tgt_pos) + loss = F.cross_entropy(logits, tgt_label, reduction="sum") + + total_loss += float(loss.numpy()) + total_tokens += tgt_label.shape[0] + + avg_loss = total_loss / total_tokens + ppl = math.exp(avg_loss) + avg_speed = (time.time() - start_time) / step + print("loss: %.4f - ppl: %.4f - %.3fs/step\n" % (avg_loss, ppl, avg_speed)) + model.train() + + +if __name__ == "__main__": + args = parse_args() + print_args(args) + + main(args) diff --git a/examples/dialogue/lic2021_baseline/infer.py b/examples/dialogue/lic2021_baseline/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..b41cb6fcf2d0ba644525ec17684858437b0fe453 --- /dev/null +++ b/examples/dialogue/lic2021_baseline/infer.py @@ -0,0 +1,88 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import time + +import paddle +from args import parse_args, print_args +from data import DialogueDataset, select_response +from paddle.io import DataLoader + +from paddlenlp.transformers import ( + UnifiedTransformerLMHeadModel, + UnifiedTransformerTokenizer, +) + + +def main(args): + paddle.set_device(args.device) + paddle.seed(args.seed) + + model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path) + tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path) + + test_dataset = DialogueDataset( + args.test_data_path, args.batch_size, tokenizer.pad_token_id, tokenizer.cls_token_id, mode="test" + ) + test_dataloader = DataLoader(test_dataset, return_list=True, batch_size=None) + + infer(model, test_dataloader, tokenizer) + + +@paddle.no_grad() +def infer(model, data_loader, tokenizer): + print("\nInfer begin...") + model.eval() + total_time = 0.0 + start_time = time.time() + responses = [] + for step, inputs in enumerate(data_loader, 1): + token_ids, type_ids, pos_ids, generation_mask = inputs + ids, scores = model.generate( + input_ids=token_ids, + token_type_ids=type_ids, + position_ids=pos_ids, + attention_mask=generation_mask, + max_length=args.max_dec_len, + min_length=args.min_dec_len, + decode_strategy=args.decode_strategy, + temperature=args.temperature, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + early_stopping=args.early_stopping, + num_return_sequences=args.num_samples, + use_fast=False, + ) + + total_time += time.time() - start_time + if step % args.logging_steps == 0: + print("step %d - %.3fs/step" % (step, total_time / args.logging_steps)) + total_time = 0.0 + results = select_response(ids, scores, tokenizer, args.max_dec_len, args.num_samples) + responses.extend(results) + + start_time = time.time() + + with open(args.output_path, "w", encoding="utf-8") as fout: + for response in responses: + fout.write(response + "\n") + print("\nSave inference result into: %s" % args.output_path) + + +if __name__ == "__main__": + args = parse_args() + print_args(args) + main(args) diff --git a/examples/dialogue/plato-2/README.md b/examples/dialogue/plato-2/README.md new file mode 100644 index 0000000000000000000000000000000000000000..20d6e6644d702162a2d4b4d7b2a9047aeb091e83 --- /dev/null +++ b/examples/dialogue/plato-2/README.md @@ -0,0 +1,67 @@ +# PLATO-2 + +## 模型简介 + +构建高质量的开放领域(Open-Domain)的对话机器人,使得它能用自然语言与人自由地交流,这一直是自然语言处理领域终极目标之一。 + +为了能够简易地构建一个高质量的开放域聊天机器人,本项目在Paddle2.0上实现了PLATO-2的预测模型,并基于终端实现了简单的人机交互。用户可以通过下载预训练模型快速构建一个开放域聊天机器人。 + +PLATO-2的网络结构见下图: + +

+ +PLATO-2的训练过程及其他细节详见 [Knover](https://github.com/PaddlePaddle/Knover) + +## 快速开始 + +### 环境依赖 + +- python 3.7+ +- sentencepiece +- termcolor + +安装方式:`pip install sentencepiece termcolor` + +### 数据准备 + +您可以从以下位置下载预训练模型文件: + +* PLATO-2, 24-layers, 16-heads, 1024-hidden, EN: [预训练模型](https://bj.bcebos.com/paddlenlp/models/transformers/plato2/24L.pdparams) +* PLATO-2, 32-layers, 32-heads, 2048-hidden, EN: [预训练模型](https://bj.bcebos.com/paddlenlp/models/transformers/plato2/32L.pdparams) + +以24层预训练模型为例: + +```shell +wget https://bj.bcebos.com/paddlenlp/models/transformers/plato2/24L.pdparams +``` + +**NOTE:** PLATO-2网络参数量较大,24层网络至少需要显存16G,32层网络至少需要显存22G,用户可选择合适的网络层数及预训练模型。 + +sentencepiece分词预训练模型和词表文件下载: + +```shell +wget https://bj.bcebos.com/paddlenlp/models/transformers/plato2/data.tar.gz +tar -zxf data.tar.gz +``` + +### 人机交互 + +运行如下命令即可开始与聊天机器人用英语进行简单的对话 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python interaction.py --vocab_path ./data/vocab.txt --spm_model_file ./data/spm.model --num_layers 24 --init_from_ckpt ./24L.pdparams +``` + +以上参数表示: + +* vocab_path:词表文件路径。 +* spm_model_file:sentencepiece分词预训练模型路径。 +* num_layers:PLATO-2组网层数。 +* init_from_ckpt:PLATO-2预训练模型路径。 + +32层PLATO-2网络交互示例: + +

+ +**NOTE:** 输入"[EXIT]"退出交互程序,输入"[NEXT]"开启下一轮新的对话。 diff --git a/examples/dialogue/plato-2/imgs/case.jpg b/examples/dialogue/plato-2/imgs/case.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e3378e4164a3d0fe0a43334414bd4b0a13605458 Binary files /dev/null and b/examples/dialogue/plato-2/imgs/case.jpg differ diff --git a/examples/dialogue/plato-2/imgs/network.png b/examples/dialogue/plato-2/imgs/network.png new file mode 100644 index 0000000000000000000000000000000000000000..c14de8e75d7411b0ce9ea94d565402b2b245e7a2 Binary files /dev/null and b/examples/dialogue/plato-2/imgs/network.png differ diff --git a/examples/dialogue/plato-2/interaction.py b/examples/dialogue/plato-2/interaction.py new file mode 100644 index 0000000000000000000000000000000000000000..6c4227b3d0d12d31fb168078356c93cd5c3b418a --- /dev/null +++ b/examples/dialogue/plato-2/interaction.py @@ -0,0 +1,104 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import argparse +from collections import namedtuple + +import paddle +from model import Plato2InferModel +from readers.nsp_reader import NSPReader +from readers.plato_reader import PlatoReader +from termcolor import colored, cprint +from utils import gen_inputs +from utils.args import parse_args + +from paddlenlp.trainer.argparser import strtobool + + +def setup_args(): + """Setup arguments.""" + parser = argparse.ArgumentParser() + group = parser.add_argument_group("Model") + group.add_argument("--init_from_ckpt", type=str, default="") + group.add_argument("--vocab_size", type=int, default=8001) + group.add_argument("--latent_type_size", type=int, default=20) + group.add_argument("--num_layers", type=int, default=24) + + group = parser.add_argument_group("Task") + group.add_argument("--is_cn", type=strtobool, default=False) + + args, _ = parser.parse_known_args() + NSPReader.add_cmdline_args(parser) + + args = parse_args(parser) + args.batch_size *= args.latent_type_size + + # print(json.dumps(args, indent=2)) + return args + + +def load_params(model, init_from_ckpt): + state_dict = paddle.load(init_from_ckpt) + model.set_state_dict(state_dict) + + +def interact(args): + """Inference main function.""" + plato_reader = PlatoReader(args) + nsp_reader = NSPReader(args) + + if args.num_layers == 24: + n_head = 16 + hidden_size = 1024 + elif args.num_layers == 32: + n_head = 32 + hidden_size = 2048 + else: + raise ValueError( + "The pre-trained model only support 24 or 32 layers, " "but received num_layers=%d." % args.num_layers + ) + + model = Plato2InferModel(nsp_reader, args.num_layers, n_head, hidden_size) + load_params(model, args.init_from_ckpt) + model.eval() + + Example = namedtuple("Example", ["src", "data_id"]) + context = [] + start_info = "Enter [EXIT] to quit the interaction, [NEXT] to start a new conversation." + cprint(start_info, "yellow", attrs=["bold"]) + while True: + user_utt = input(colored("[Human]: ", "red", attrs=["bold"])).strip() + if user_utt == "[EXIT]": + break + elif user_utt == "[NEXT]": + context = [] + cprint(start_info, "yellow", attrs=["bold"]) + else: + context.append(user_utt) + example = Example(src=" [SEP] ".join(context), data_id=0) + record = plato_reader._convert_example_to_record(example, is_infer=True) + data = plato_reader._pad_batch_records([record], is_infer=True) + inputs = gen_inputs(data, args.latent_type_size) + inputs["tgt_ids"] = inputs["tgt_ids"].astype("int64") + pred = model(inputs)[0] + bot_response = pred["response"] + print(colored("[Bot]:", "blue", attrs=["bold"]), colored(bot_response, attrs=["bold"])) + context.append(bot_response) + return + + +if __name__ == "__main__": + args = setup_args() + interact(args) diff --git a/examples/dialogue/plato-2/model.py b/examples/dialogue/plato-2/model.py new file mode 100644 index 0000000000000000000000000000000000000000..437a708876838bce79489cac3204f60a2c861a2a --- /dev/null +++ b/examples/dialogue/plato-2/model.py @@ -0,0 +1,458 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from collections import namedtuple + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +def post_process_context(token_ids, reader, merge=True): + """Post-process the context sequence.""" + context = [] + utt = [] + for tok_id in token_ids[1:]: + if tok_id == reader.eos_id: + utt = reader.tokenizer.convert_ids_to_tokens(utt) + if merge: + utt = reader.tokenizer.merge_subword(utt) + context.append(utt) + utt = [] + else: + utt.append(tok_id) + return context + + +def post_process_response(token_ids, reader, merge=True): + """ + Post-process the decoded sequence. Truncate from the first + and remove the and tokens currently. + """ + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == reader.eos_id: + eos_pos = i + break + token_ids = token_ids[1:eos_pos] + response = reader.tokenizer.convert_ids_to_tokens(token_ids) + if merge: + response = reader.tokenizer.merge_subword(response) + return token_ids, response + + +def get_cross_turn_repetition(context, pred_tokens, eos_idx, is_cn=False): + """Get cross-turn repetition.""" + if len(pred_tokens) == 0: + return 1.0 + if is_cn: + context = ["".join(utt) for utt in context] + pred_tokens = "".join(pred_tokens) + + pred_tri_grams = set() + for i in range(len(pred_tokens) - 2): + tri_gram = tuple(pred_tokens[i : i + 3]) + pred_tri_grams.add(tri_gram) + for utt in context: + for i in range(len(utt) - 2): + tri_gram = tuple(utt[i : i + 3]) + if tri_gram in pred_tri_grams: + return 1.0 + return 0.0 + + +def get_in_turn_repetition(pred, is_cn=False): + """Get in-turn repetition.""" + if len(pred) == 0: + return 1.0 + if isinstance(pred[0], str): + pred = [tok.lower() for tok in pred] + if is_cn: + pred = "".join(pred) + tri_grams = set() + for i in range(len(pred) - 2): + tri_gram = tuple(pred[i : i + 3]) + if tri_gram in tri_grams: + return 1.0 + tri_grams.add(tri_gram) + return 0.0 + + +class Plato2EncoderLayer(nn.Layer): + def __init__(self, n_head, hidden_size, attn_dropout, act_dropout): + super(Plato2EncoderLayer, self).__init__() + + self.self_attn = nn.MultiHeadAttention(hidden_size, n_head, attn_dropout) + self.pre_norm_layer = nn.LayerNorm(hidden_size) + self.post_norm_layer = nn.LayerNorm(hidden_size) + self.fc1 = nn.Linear(hidden_size, hidden_size * 4) + self.fc2 = nn.Linear(hidden_size * 4, hidden_size) + + self.dropout_layer = nn.Dropout(act_dropout) + self.gelu_layer = nn.GELU() + + def forward(self, x, attn_mask, cache): + query = self.pre_norm_layer(x) + attn_output, new_cache = self.self_attn(query, None, None, attn_mask, cache) + attn_output = self.dropout_layer(attn_output) + attn_output = attn_output + x + ffd_input = self.post_norm_layer(attn_output) + + ffd_output = self.fc1(ffd_input) + ffd_output = self.gelu_layer(ffd_output) + ffd_output = self.dropout_layer(ffd_output) + + ffd_output = self.fc2(ffd_output) + ffd_output = self.dropout_layer(ffd_output) + out = ffd_output + attn_output + + return out, new_cache + + def gen_cache(self, key): + return self.self_attn.gen_cache(key) + + +class Plato2Encoder(nn.Layer): + def __init__( + self, vocab_size, type_size, max_position_seq_len, num_layers, n_head, hidden_size, attn_dropout, act_dropout + ): + super(Plato2Encoder, self).__init__() + + self.n_head = n_head + + self.word_embedding_layer = nn.Embedding(vocab_size, hidden_size) + self.sent_embedding_layer = nn.Embedding(type_size, hidden_size) + self.pos_embedding_layer = nn.Embedding(max_position_seq_len, hidden_size) + + self.encoder_layers = [] + for i in range(num_layers): + encoder_layer = Plato2EncoderLayer(n_head, hidden_size, attn_dropout, act_dropout) + self.encoder_layers.append(encoder_layer) + self.add_sublayer("layers." + str(i), encoder_layer) + self.post_encoder_layer_norm = nn.LayerNorm(hidden_size) + + self.dropout_layer = nn.Dropout(act_dropout) + + def forward(self, caches, token_ids, type_ids, pos_ids, generation_mask, aux_emb=None): + out, self_attn_mask = self.gen_input(token_ids, type_ids, pos_ids, generation_mask, aux_emb) + + new_caches = [] + for i, encoder_layer in enumerate(self.encoder_layers): + out, new_cache = encoder_layer(out, self_attn_mask, caches[i]) + new_caches.append(new_cache) + + enc_output = self.post_encoder_layer_norm(out) + return enc_output, new_caches + + def gen_input(self, token_ids, type_ids, pos_ids, input_mask, aux_emb=None): + token_emb_out = self.word_embedding_layer(token_ids) + type_emb_out = self.sent_embedding_layer(type_ids) + pos_emb_out = self.pos_embedding_layer(pos_ids) + emb_out = token_emb_out + type_emb_out + pos_emb_out + + # auxiliary memory embeddings + if aux_emb is not None: + emb_out = paddle.concat([aux_emb, emb_out], axis=1) + + emb_out = self.dropout_layer(emb_out) + + # generate n-head self-attention mask + self_attn_mask = input_mask + self_attn_mask = paddle.scale(x=self_attn_mask, scale=1e4, bias=-1.0, bias_after_scale=False) + n_head_self_attn_mask = paddle.stack(x=[self_attn_mask] * self.n_head, axis=1) + n_head_self_attn_mask.stop_gradient = True + + return emb_out, n_head_self_attn_mask + + def gen_caches(self, key): + caches = [encoder_layer.gen_cache(key) for encoder_layer in self.encoder_layers] + return caches + + +class NSP(nn.Layer): + def __init__( + self, vocab_size, type_size, max_position_seq_len, num_layers, n_head, hidden_size, attn_dropout, act_dropout + ): + super(NSP, self).__init__() + + self.n_head = n_head + self.hidden_size = hidden_size + + self.word_embedding_layer = nn.Embedding(vocab_size, hidden_size) + self.sent_embedding_layer = nn.Embedding(type_size, hidden_size) + self.pos_embedding_layer = nn.Embedding(max_position_seq_len, hidden_size) + + encoder_layer = nn.TransformerEncoderLayer( + hidden_size, n_head, hidden_size * 4, act_dropout, "gelu", attn_dropout, act_dropout, "True" + ) + encoder_norm = nn.LayerNorm(hidden_size) + self.encoder = nn.TransformerEncoder(encoder_layer, num_layers, encoder_norm) + self.fc1 = nn.Linear(hidden_size, hidden_size) + self.fc2 = nn.Linear(hidden_size, 2) + + self.dropout_layer = nn.Dropout(act_dropout) + self.tanh_layer = nn.Tanh() + self.softmax = nn.Softmax() + + def forward(self, inputs): + token_ids = inputs["token_ids"] + type_ids = inputs["type_ids"] + pos_ids = inputs["pos_ids"] + attention_mask = inputs["attention_mask"] + label_pos = inputs["label_pos"] + + out, self_attn_mask = self.gen_input(token_ids, type_ids, pos_ids, attention_mask) + # [-1, seq_len, hidden_size] + enc_out = self.encoder(out, self_attn_mask) + + enc_out = paddle.reshape(enc_out, [-1, self.hidden_size]) + label_pos = paddle.cast(label_pos, "int64") + out = paddle.gather(enc_out, label_pos) + pooled_out = self.fc1(out) + pooled_out = self.tanh_layer(pooled_out) + + # [-1, 2] + logits = self.fc2(pooled_out) + probs = self.softmax(logits) + + return probs + + def gen_input(self, token_ids, type_ids, pos_ids, input_mask, aux_emb=None): + token_emb_out = self.word_embedding_layer(token_ids) + type_emb_out = self.sent_embedding_layer(type_ids) + pos_emb_out = self.pos_embedding_layer(pos_ids) + emb_out = token_emb_out + type_emb_out + pos_emb_out + + # auxiliary memory embeddings + if aux_emb is not None: + emb_out = paddle.concat([aux_emb, emb_out], axis=1) + + emb_out = self.dropout_layer(emb_out) + + # generate n-head self-attention mask + self_attn_mask = input_mask + self_attn_mask = paddle.scale(x=self_attn_mask, scale=1e4, bias=-1.0, bias_after_scale=False) + n_head_self_attn_mask = paddle.stack(x=[self_attn_mask] * self.n_head, axis=1) + n_head_self_attn_mask.stop_gradient = True + + return emb_out, n_head_self_attn_mask + + +class Plato2InferModel(nn.Layer): + def __init__( + self, + nsp_reader, + num_layers, + n_head, + hidden_size, + vocab_size=8001, + type_size=2, + latent_type_size=20, + max_position_seq_len=256, + act_dropout=0.1, + attn_dropout=0.1, + max_dec_len=64, + min_dec_len=1, + topk=10, + ): + super(Plato2InferModel, self).__init__() + + self.nsp_reader = nsp_reader + self.num_layers = num_layers + self.latent_type_size = latent_type_size + self.max_dec_len = max_dec_len + self.min_dec_len = min_dec_len + self.topk = topk + self.unk_id = 0 + self.bos_id = 1 + self.eos_id = 2 + self.mask_id = 8000 + self.after_eos = paddle.ones([vocab_size]) * -1e9 + self.after_eos[self.eos_id] = 0 + self.is_cn = False + self.batch_size = 1 + + self.latent_weight = paddle.create_parameter([hidden_size, latent_type_size], "float32") + + self.plato2_encoder = Plato2Encoder( + vocab_size, type_size, max_position_seq_len, num_layers, n_head, hidden_size, attn_dropout, act_dropout + ) + + self.logits_fc_layer = nn.Linear(hidden_size, hidden_size) + self.logits_layer_norm = nn.LayerNorm(hidden_size) + self.logits_bias = paddle.create_parameter([vocab_size], "float32", is_bias=True) + + self.nsp_predictor = NSP( + vocab_size, type_size, max_position_seq_len, num_layers, n_head, hidden_size, attn_dropout, act_dropout + ) + + self.gelu_layer = nn.GELU() + self.softmax = nn.Softmax() + + @paddle.no_grad() + def forward(self, inputs): + token_ids = inputs["token_ids"] + type_ids = inputs["type_ids"] + pos_ids = inputs["pos_ids"] + generation_mask = inputs["generation_mask"] + latent_id = inputs["latent_id"] + data_id = inputs["data_id"] + + # [-1, 1, latent_type_size] + latent_id = F.one_hot(latent_id, self.latent_type_size) + # [-1, 1, hidden_size] + latent_emb = paddle.matmul(latent_id, self.latent_weight, transpose_y=True) + + caches = self.plato2_encoder.gen_caches(token_ids) + + # [-1, seq_len + 1, hidden_size] + enc_out, new_caches = self.plato2_encoder(caches, token_ids, type_ids, pos_ids, generation_mask, latent_emb) + + pred_ids = self.decode(inputs, new_caches) + + nsp_inputs = self.gen_nsp_input(token_ids, pred_ids) + # [-1, 2] + probs = self.nsp_predictor(nsp_inputs) + + return self.get_results(data_id, token_ids, pred_ids, probs) + + def decode(self, inputs, caches): + tgt_ids = inputs["tgt_ids"] + tgt_pos = inputs["tgt_pos"] + tgt_generation_mask = inputs["tgt_generation_mask"] + predictions = tgt_ids + + # TODO + step = 0 + while step < self.max_dec_len: + # [-1, 1] + append_mask = paddle.cast(tgt_ids != self.eos_id, dtype=tgt_generation_mask.dtype) + tgt_generation_mask = paddle.concat([tgt_generation_mask, paddle.unsqueeze(append_mask, 1)], axis=-1) + tgt_sent = paddle.ones([tgt_generation_mask.shape[0], 1], dtype=tgt_ids.dtype) + + # [-1, 1, hidden_size] + out, caches = self.plato2_encoder(caches, tgt_ids, tgt_sent, tgt_pos, tgt_generation_mask) + out = paddle.squeeze(out, axis=1) + + # [-1, hidden_size] + trans = self.logits_fc_layer(out) + trans = self.gelu_layer(trans) + trans = self.logits_layer_norm(trans) + + # [-1, vocab_size] + logits = ( + paddle.matmul(trans, self.plato2_encoder.word_embedding_layer.weight, transpose_y=True) + + self.logits_bias + ) + logits[:, self.unk_id] = -1e9 + logits[:, self.bos_id] = -1e9 + logits[:, self.mask_id] = -1e9 + if step < self.min_dec_len: + logits[:, self.eos_id] = -1e9 + logits = logits * append_mask + (1 - append_mask) * self.after_eos + probs = self.softmax(logits) + + # [-1, topk] + topk_probs, _ = paddle.topk(probs, k=self.topk) + mask = paddle.cast(probs >= topk_probs[:, -1:], "float32") + sums = paddle.sum(topk_probs, axis=-1, keepdim=True) + new_probs = probs * mask / sums + # [-1, 1] + sampling_ids = paddle.multinomial(new_probs) + + step = step + 1 + tgt_ids = sampling_ids + tgt_pos = tgt_pos + 1 + predictions = paddle.concat([predictions, tgt_ids], axis=1) + return predictions + + def gen_nsp_input(self, token_ids, pred_ids): + token_ids = token_ids.numpy() + pred_ids = pred_ids.numpy() + + def __reader__(): + headers = ["src", "tgt", "data_id"] + + Example = namedtuple("Example", headers) + + for i, (raw, pred) in enumerate(zip(token_ids, pred_ids)): + context = post_process_context(raw, self.nsp_reader, merge=False) + _, response = post_process_response(pred, self.nsp_reader, merge=False) + context_tokenized_input = " [SEP] ".join(" ".join(utt) for utt in context) + response_tokenized_input = " ".join(response) + example = Example(src=context_tokenized_input, tgt=response_tokenized_input, data_id=i) + data = self.nsp_reader._convert_example_to_record(example, is_infer=True) + yield data + return + + generator = self.nsp_reader.data_generator( + reader=__reader__, + is_infer=True, + phase="test", + ) + inputs = next(generator()) + + # print('\nnsp_inputs:') + for key in inputs: + inputs[key] = paddle.to_tensor(inputs[key]) + if key in ["token_ids", "type_ids", "pos_ids"]: + inputs[key] = paddle.squeeze(inputs[key], axis=-1) + # print(key, inputs[key].shape) + # print(inputs[key]) + return inputs + + def get_results(self, data_id, token_ids, pred_ids, probs): + data_id = data_id.numpy() + token_ids = token_ids.numpy() + pred_ids = pred_ids.numpy() + probs = probs.numpy() + + infos = [] + for raw, pred, prob in zip(token_ids, pred_ids, probs): + tokens = post_process_context(raw, self.nsp_reader) + pred_token_ids, pred_tokens = post_process_response(pred, self.nsp_reader) + info = {} + info["response"] = " ".join(pred_tokens) + cross_turn_repetition = get_cross_turn_repetition(tokens, pred_tokens, self.nsp_reader.eos_id, self.is_cn) + in_turn_repetition = max( + get_in_turn_repetition(pred_tokens, self.is_cn), get_in_turn_repetition(pred_token_ids) + ) + + info["score"] = float(prob[1]) + if len(pred_token_ids) >= self.max_dec_len: + info["score"] -= 1e3 + elif cross_turn_repetition > 0: + info["score"] -= 1e3 + elif in_turn_repetition > 0: + info["score"] -= 1e3 + infos.append(info) + + results = [] + pre_idx = 0 + sample = [] + for idx, info in zip(data_id, infos): + if idx != pre_idx: + sample = sorted(sample, key=lambda info: -info["score"]) + result = sample[0] + result["data_id"] = pre_idx + results.apeend(result) + sample = [] + pre_idx = idx + sample.append(info) + if sample: + sample = sorted(sample, key=lambda info: -info["score"]) + result = sample[0] + result["data_id"] = pre_idx + results.append(result) + return results diff --git a/examples/dialogue/plato-2/readers/dialog_reader.py b/examples/dialogue/plato-2/readers/dialog_reader.py new file mode 100644 index 0000000000000000000000000000000000000000..00339c1f0a0e83ba75015598c1159790d48cb69b --- /dev/null +++ b/examples/dialogue/plato-2/readers/dialog_reader.py @@ -0,0 +1,436 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Dialogue Reader.""" + +import csv +import gzip +from collections import namedtuple +from contextlib import contextmanager + +import numpy as np +import utils.tokenization as tokenization +from utils import pad_batch_data +from utils.masking import mask + +from paddlenlp.trainer.argparser import strtobool + + +class DialogReader(object): + """The implement of DialogReader.""" + + @classmethod + def add_cmdline_args(cls, parser): + """Add cmdline argurments.""" + group = parser.add_argument_group("Reader") + group.add_argument("--max_src_len", type=int, default=128) + group.add_argument("--max_tgt_len", type=int, default=128) + group.add_argument("--truncate_first_turn", type=strtobool, default=False) + group.add_argument("--file_format", type=str, default="file", choices=["file", "filelist"]) + group.add_argument("--data_format", type=str, default="raw", choices=["raw", "tokenized", "numerical"]) + group.add_argument("--in_tokens", type=strtobool, default=False) + group.add_argument("--batch_size", type=int, default=16) + group.add_argument("--continuous_position", type=strtobool, default=True) + group.add_argument("--random_seed", type=int, default=11) + group.add_argument("--sort_pool_size", type=int, default=2**16) + + group = parser.add_argument_group("Tokenizer") + group.add_argument("--tokenizer", type=str, default="SentencePieceTokenizer") + args, _ = parser.parse_known_args() + tokenizer_cls = getattr(tokenization, args.tokenizer) + tokenizer_cls.add_cmdline_args(parser) + return group + + def __init__(self, args): + tokenizer_cls = getattr(tokenization, args.tokenizer) + self.tokenizer = tokenizer_cls(args) + self.vocab = self.tokenizer.vocab + self.pad_id = args.pad_id = self.vocab["[PAD]"] + self.bos_id = args.bos_id = self.vocab["[CLS]"] + self.eos_id = args.eos_id = self.vocab["[SEP]"] + self.unk_id = args.unk_id = self.vocab["[UNK]"] + self.mask_id = args.mask_id = self.vocab["[MASK]"] + self.vocab_size = args.get("vocab_size", 0) + self.max_src_len = args.max_src_len + self.max_tgt_len = args.max_tgt_len + self.truncate_first_turn = args.truncate_first_turn + self.file_format = args.file_format + self.data_format = args.data_format + self.in_tokens = args.in_tokens + self.batch_size = args.batch_size + self.continuous_position = args.continuous_position + self.sort_pool_size = args.sort_pool_size + + # random_seed must be set for data slicing when using multi-gpu + self.global_rng = np.random.RandomState(args.random_seed) + + # training progress + self.current_example = 0 + self.current_epoch = 0 + self.num_examples = 0 + + # model related + + self.fields = ["token_ids", "type_ids", "pos_ids"] + self.num_numerical_fields = len(self.fields) + self.fields += ["tgt_start_idx", "data_id"] + self.sort_key = lambda record: [len(record.token_ids)] + + self.Record = namedtuple("Record", self.fields, defaults=(None,) * len(self.fields)) + + self.features = {} + return + + def get_train_progress(self): + """Gets progress for training phase.""" + return self.current_epoch, self.current_file_index, self.total_file + + def _convert_example_to_record(self, example, is_infer): + # process src + src_token_ids = [] + src_pos_ids = [] + + # tokenize src + s_token_ids_list = [] + for s in example.src.split("[SEP]"): + s = tokenization.convert_to_unicode(s).strip() + + if self.data_format == "tokenized": + s_tokens = s.split(" ") + else: + s_tokens = self.tokenizer.tokenize(s) + + s_token_ids = self.tokenizer.convert_tokens_to_ids(s_tokens) + [self.eos_id] + s_token_ids_list.append(s_token_ids) + + # trim src + idx = len(s_token_ids_list) - 1 + total_token_num = 1 + while idx >= 0: + total_token_num += len(s_token_ids_list[idx]) + if total_token_num > self.max_src_len: + if self.truncate_first_turn and idx == 0: + truncated_ids = s_token_ids_list[idx][: self.max_src_len - total_token_num] + if len(truncated_ids) > 1: + s_token_ids_list[idx] = truncated_ids[:-1] + [self.eos_id] + idx -= 1 + break + idx -= 1 + + for i, s_token_ids in enumerate(s_token_ids_list[idx + 1 :], idx + 1): + src_token_ids += s_token_ids + src_pos_ids += list(range(1, len(s_token_ids) + 1)) + + src_token_ids = [self.bos_id] + src_token_ids + src_type_ids = [0] * len(src_token_ids) + src_pos_ids = [0] + src_pos_ids + assert ( + len(src_token_ids) == len(src_type_ids) == len(src_pos_ids) + ), "not len(src_token_ids) == len(src_type_ids) == len(src_pos_ids)" + + token_ids = src_token_ids + type_ids = src_type_ids + pos_ids = src_pos_ids + tgt_start_idx = len(token_ids) + + if not is_infer: + # process tgt + # tokenize tgt + tgt = tokenization.convert_to_unicode(example.tgt).strip() + if self.data_format == "tokenized": + tgt_tokens = tgt.split(" ") + else: + tgt_tokens = self.tokenizer.tokenize(tgt) + + tgt_token_ids = self.tokenizer.convert_tokens_to_ids(tgt_tokens) + tgt_token_ids.append(self.eos_id) + + # trim tgt + if len(tgt_token_ids) > self.max_tgt_len - 1: + tgt_token_ids = tgt_token_ids[: self.max_tgt_len - 1] + + tgt_token_ids = [self.bos_id] + tgt_token_ids + tgt_type_ids = [1] * len(tgt_token_ids) + tgt_pos_ids = list(range(1, len(tgt_token_ids) + 1)) + assert ( + len(tgt_token_ids) == len(tgt_type_ids) == len(tgt_pos_ids) + ), "not len(tgt_token_ids) == len(tgt_type_ids) == len(tgt_pos_ids)" + + token_ids += tgt_token_ids + type_ids += tgt_type_ids + pos_ids += tgt_pos_ids + + assert len(token_ids) == len(type_ids) == len(pos_ids), "not len(token_ids) == len(type_ids) == len(pos_ids)" + + if self.continuous_position: + src_pos_ids = list(range(len(src_token_ids))) + if not is_infer: + tgt_pos_ids = list(range(len(tgt_token_ids))) + pos_ids = list(range(len(token_ids))) + + field_values = {"token_ids": src_token_ids, "type_ids": src_type_ids, "pos_ids": src_pos_ids} + field_values["tgt_start_idx"] = tgt_start_idx + field_values["data_id"] = example.data_id + + record = self.Record(**field_values) + return record + + def _read_tsv(self, fp, phase, is_infer, delimiter="\t", quotechar=None): + """Reads a tab separated value file.""" + csv.field_size_limit(2**20) + reader = csv.reader(fp, delimiter=delimiter, quotechar=quotechar) + headers = next(reader) + headers.append("data_id") + Example = namedtuple("Example", headers) + + for i, line in enumerate(reader): + example = Example(*line, data_id=i) + if is_infer or phase.endswith("test"): + self.features[phase][i] = example + record = self._convert_example_to_record(example, is_infer) + yield record + + def _read_numerical_file(self, fp, delimiter=";"): + for i, line in enumerate(fp): + cols = tokenization.convert_to_unicode(line).strip().split(delimiter) + cols = list(map(lambda x: list(map(int, x.split(" "))), cols)) + if len(cols) > self.num_numerical_fields: + cols = cols[: self.num_numerical_fields] + tgt_start_idx = cols[0].index(self.bos_id, 1) + record = self.Record(*cols, tgt_start_idx=tgt_start_idx, data_id=i) + yield record + + def _read_file(self, input_file, phase, is_infer): + def __wrapper__(): + with open_file(input_file) as fp: + if self.data_format == "numerical": + records = self._read_numerical_file(fp) + else: + records = self._read_tsv(fp, phase, is_infer) + for record in records: + yield record + + return __wrapper__ + + def _read_files(self, filelist, phase, is_infer, shuffle_files): + input_files = open(filelist).readlines() + + def __wrapper__(): + if shuffle_files: + self.global_rng.shuffle(input_files) + + if phase == "train": + self.total_file = len(input_files) + for file_index, input_file in enumerate(input_files, 1): + if phase == "train": + self.current_file_index = file_index + self.current_file = input_file + file_reader = self._read_file(input_file.strip(), phase, is_infer) + for record in file_reader(): + yield record + + return __wrapper__ + + def _batch_reader(self, reader, phase=None, is_infer=False, sort_pool_size=2**16): + """Construct a batch reader.""" + + def update_max_lens(max_lens, record): + """Update max_lens.""" + if max_lens is None: + return self.sort_key(record) + else: + return [max(max_len, l) for max_len, l in zip(max_lens, self.sort_key(record))] + + def get_batch(reader): + """Generate batches from reader.""" + batch, max_lens = [], None + for record in reader(): + if record is None: + yield batch + batch, max_lens = [], None + continue + + self.current_example += 1 + max_lens = update_max_lens(max_lens, record) + if self.in_tokens: + to_append = (len(batch) + 1) * sum(max_lens) <= self.batch_size + else: + to_append = len(batch) < self.batch_size + if to_append: + batch.append(record) + else: + yield batch + batch, max_lens = [record], self.sort_key(record) + + if len(batch) > 0: + yield batch + + def get_sorted_batch(pool): + """Generate sorted batches from pool.""" + pool = sorted(pool, key=self.sort_key) + batches = [] + batch, max_lens = [], None + for record in pool: + self.current_example += 1 + max_lens = update_max_lens(max_lens, record) + if self.in_tokens: + to_append = (len(batch) + 1) * sum(max_lens) <= self.batch_size + else: + to_append = len(batch) < self.batch_size + if to_append: + batch.append(record) + else: + batches.append(batch) + batch, max_lens = [record], self.sort_key(record) + + if len(batch) > 0: + batches.append(batch) + self.global_rng.shuffle(batches) + + for batch in batches: + yield batch + + def __wrapper__(): + if sort_pool_size > 0: + pool = [] + for record in reader(): + pool.append(record) + if len(pool) == sort_pool_size: + for batch in get_sorted_batch(pool): + yield batch + pool = [] + if len(pool) > 0: + for batch in get_sorted_batch(pool): + yield batch + else: + for batch in get_batch(reader): + yield batch + + return __wrapper__ + + def _distributed_batch_reader(self, batch_reader, num_part, part_id, is_test=False): + def __wrapper__(): + batches = [] + for batch in batch_reader(): + batches.append(batch) + if len(batches) == num_part: + yield batches[part_id] + batches = [] + if is_test and 0 <= part_id < len(batches): + yield batches[part_id] + return + + return __wrapper__ + + def data_generator( + self, input_file=None, reader=None, num_epochs=1, num_part=1, part_id=0, phase=None, is_infer=False + ): + """Data generator.""" + + def __wrapper__(): + if is_infer or phase.endswith("test"): + self.features[phase] = {} + + nonlocal reader + if reader is None: + if self.file_format == "filelist": + reader = self._read_files(input_file, phase, is_infer, not phase.endswith("test")) + else: + if phase == "train": + self.total_file = 1 + self.current_file_index = 1 + self.current_file = input_file + reader = self._read_file(input_file, phase, is_infer) + + batch_reader = self._batch_reader( + reader, phase, is_infer, sort_pool_size=self.sort_pool_size if not is_infer else 0 + ) + if phase == "train": + batch_reader = self._distributed_batch_reader(batch_reader, num_part, part_id) + elif phase.startswith("distributed"): + batch_reader = self._distributed_batch_reader(batch_reader, num_part, part_id, is_test=True) + + for epoch_index in range(num_epochs): + if phase == "train": + self.current_example = 0 + self.current_epoch = epoch_index + 1 + for batch in batch_reader(): + yield self._pad_batch_records(batch, is_infer) + + return __wrapper__ + + def _gen_self_attn_mask(self, batch_token_ids, batch_tgt_start_idx=None, is_unidirectional=True, shift_len=0): + max_len = max(map(len, batch_token_ids)) + input_mask_data = np.zeros((len(batch_token_ids), max_len + shift_len, max_len + shift_len)) + if is_unidirectional: + for index, mask_data in enumerate(input_mask_data): + start = 0 if batch_tgt_start_idx is None else batch_tgt_start_idx[index] + end = len(batch_token_ids[index]) + mask_data[: end + shift_len, : start + shift_len] = 1.0 + # Generate the lower triangular matrix using the slice of matrix + b = np.tril(np.ones([end - start, end - start]), 0) + mask_data[start + shift_len : end + shift_len, start + shift_len : end + shift_len] = b + else: + for index, token_ids in enumerate(batch_token_ids): + input_mask_data[index, : len(token_ids) + shift_len, : len(token_ids) + shift_len] = 1.0 + return input_mask_data.astype("float32") + + def _pad_batch_records(self, batch_records, is_infer): + """ + Padding batch records and construct model's inputs. + """ + batch = {} + batch_token_ids = [record.token_ids for record in batch_records] + batch_type_ids = [record.type_ids for record in batch_records] + batch_pos_ids = [record.pos_ids for record in batch_records] + batch["token_ids"] = pad_batch_data(batch_token_ids, pad_id=self.pad_id) + batch["type_ids"] = pad_batch_data(batch_type_ids, pad_id=self.pad_id) + batch["pos_ids"] = pad_batch_data(batch_pos_ids, pad_id=self.pad_id) + + batch_tgt_start_idx = [record.tgt_start_idx for record in batch_records] + batch["generation_mask"] = self._gen_self_attn_mask(batch_token_ids, batch_tgt_start_idx=batch_tgt_start_idx) + + if is_infer: + tgt_ids = np.array([[[self.bos_id]]] * len(batch_token_ids), dtype="int64") + if self.continuous_position: + tgt_pos = np.array(batch_tgt_start_idx, dtype="int64") + else: + tgt_pos = np.zeros_like(batch_tgt_start_idx, dtype="int64") + tgt_pos = tgt_pos.reshape(-1, 1, 1) + batch["init_score"] = np.zeros_like(tgt_ids, dtype="float32").reshape(-1, 1).tolist() + batch["tgt_ids"] = tgt_ids.tolist() + batch["tgt_pos"] = tgt_pos.tolist() + + batch["tgt_generation_mask"] = batch["generation_mask"][:, 0:1, :].astype("float32") + else: + batch["tgt_label"], batch["tgt_pos"] = mask( + batch_tokens=batch_token_ids, + vocab_size=self.vocab_size, + sent_b_starts=batch_tgt_start_idx, + is_unidirectional=True, + ) + + batch_data_id = [record.data_id for record in batch_records] + batch["data_id"] = np.array(batch_data_id).astype("int64").reshape([-1, 1]) + return batch + + +@contextmanager +def open_file(filename): + """Open file.""" + if filename.endswith(".gz"): + fp = gzip.open(filename, "rt") + else: + fp = open(filename) + yield fp + fp.close() diff --git a/examples/dialogue/plato-2/readers/nsp_reader.py b/examples/dialogue/plato-2/readers/nsp_reader.py new file mode 100644 index 0000000000000000000000000000000000000000..968da9ff1ba0e7c8ac87e993df5c964d1ea0bd72 --- /dev/null +++ b/examples/dialogue/plato-2/readers/nsp_reader.py @@ -0,0 +1,152 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""NSP Reader.""" + +from collections import namedtuple + +import numpy as np +from readers.dialog_reader import DialogReader +from utils import pad_batch_data +from utils.masking import mask + +from paddlenlp.trainer.argparser import strtobool + + +class NSPReader(DialogReader): + """NSP Reader.""" + + @classmethod + def add_cmdline_args(cls, parser): + """Add cmdline argurments.""" + group = DialogReader.add_cmdline_args(parser) + group.add_argument( + "--attention_style", type=str, default="bidirectional", choices=["bidirectional", "unidirectional"] + ) + group.add_argument("--mix_negative_sample", type=strtobool, default=False) + return group + + def __init__(self, args): + super(NSPReader, self).__init__(args) + self.fields.append("label") + self.Record = namedtuple("Record", self.fields, defaults=(None,) * len(self.fields)) + + self.attention_style = args.attention_style + self.mix_negative_sample = args.mix_negative_sample + return + + def _convert_example_to_record(self, example, is_infer): + record = super(NSPReader, self)._convert_example_to_record(example, False) + if "label" in example._fields: + record = record._replace(label=int(example.label)) + return record + + def _mix_negative_sample(self, reader, neg_pool_size=2**16): + def gen_from_pool(pool): + num_samples = len(pool) + if num_samples == 1: + # only one sample: it is impossible to generate negative sample + yield pool[0]._replace(label=1) + return + self.global_rng.shuffle(pool) + for i in range(num_samples): + pool[i] = pool[i]._replace(label=1) + j = (i + 1) % num_samples + idx_i = pool[i].tgt_start_idx + idx_j = pool[j].tgt_start_idx + field_values = {} + field_values["token_ids"] = pool[i].token_ids[:idx_i] + pool[j].token_ids[idx_j:] + field_values["type_ids"] = pool[i].type_ids[:idx_i] + pool[j].type_ids[idx_j:] + field_values["pos_ids"] = list(range(len(field_values["token_ids"]))) + neg_record = self.Record(**field_values, tgt_start_idx=idx_i, data_id=-1, label=0) + pool.append(neg_record) + assert len(neg_record.token_ids) <= self.max_seq_len + self.global_rng.shuffle(pool) + for record in pool: + yield record + + def __wrapper__(): + pool = [] + for record in reader(): + pool.append(record) + if len(pool) == neg_pool_size: + for record in gen_from_pool(pool): + yield record + pool = [] + if len(pool) > 0: + for record in gen_from_pool(pool): + yield record + + return __wrapper__ + + def _batch_reader(self, reader, phase=None, is_infer=False, sort_pool_size=2**16): + if self.mix_negative_sample: + reader = self._mix_negative_sample(reader) + return super(NSPReader, self)._batch_reader( + reader, phase=phase, is_infer=is_infer, sort_pool_size=sort_pool_size + ) + + def _pad_batch_records(self, batch_records, is_infer): + """ + Padding batch records and construct model's inputs. + """ + batch = {} + batch_token_ids = [record.token_ids for record in batch_records] + batch_type_ids = [record.type_ids for record in batch_records] + batch_pos_ids = [record.pos_ids for record in batch_records] + batch_tgt_start_idx = [record.tgt_start_idx for record in batch_records] + batch_label = [record.label for record in batch_records] + + if self.attention_style == "unidirectional": + batch["token_ids"] = pad_batch_data(batch_token_ids, pad_id=self.pad_id) + batch["type_ids"] = pad_batch_data(batch_type_ids, pad_id=self.pad_id) + batch["pos_ids"] = pad_batch_data(batch_pos_ids, pad_id=self.pad_id) + tgt_label, tgt_pos, label_pos = mask( + batch_tokens=batch_token_ids, + vocab_size=self.vocab_size, + bos_id=self.bos_id, + sent_b_starts=batch_tgt_start_idx, + labels=batch_label, + is_unidirectional=True, + ) + attention_mask = self._gen_self_attn_mask(batch_token_ids, batch_tgt_start_idx) + else: + batch_mask_token_ids, tgt_label, tgt_pos, label_pos = mask( + batch_tokens=batch_token_ids, + vocab_size=self.vocab_size, + bos_id=self.bos_id, + eos_id=self.eos_id, + mask_id=self.mask_id, + sent_b_starts=batch_tgt_start_idx, + labels=batch_label, + is_unidirectional=False, + ) + if not is_infer: + batch_token_ids = batch_mask_token_ids + batch["token_ids"] = pad_batch_data(batch_token_ids, pad_id=self.pad_id) + batch["type_ids"] = pad_batch_data(batch_type_ids, pad_id=self.pad_id) + batch["pos_ids"] = pad_batch_data(batch_pos_ids, pad_id=self.pad_id) + attention_mask = self._gen_self_attn_mask(batch_token_ids, is_unidirectional=False) + + batch["attention_mask"] = attention_mask + batch["label_pos"] = label_pos + + if not is_infer: + batch_label = np.array(batch_label).astype("int64").reshape([-1, 1]) + batch["label"] = batch_label + batch["tgt_label"] = tgt_label + batch["tgt_pos"] = tgt_pos + + batch_data_id = [record.data_id for record in batch_records] + batch["data_id"] = np.array(batch_data_id).astype("int64").reshape([-1, 1]) + return batch diff --git a/examples/dialogue/plato-2/readers/plato_reader.py b/examples/dialogue/plato-2/readers/plato_reader.py new file mode 100644 index 0000000000000000000000000000000000000000..3d3cd790ee76900799aa2cf2683dc64a2d1524d9 --- /dev/null +++ b/examples/dialogue/plato-2/readers/plato_reader.py @@ -0,0 +1,85 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Plato Reader.""" + +import numpy as np + +from readers.dialog_reader import DialogReader +from utils import pad_batch_data +from utils.masking import mask + + +class PlatoReader(DialogReader): + """The implement of PlatoReader""" + + def __init__(self, args): + super(PlatoReader, self).__init__(args) + self.latent_type_size = args.latent_type_size + self.use_bow = args.use_bow + + def _pad_batch_records(self, batch_records, is_infer): + """ + Padding batch records and construct model's inputs. + """ + batch = {} + batch_token_ids = [record.token_ids for record in batch_records] + batch_type_ids = [record.type_ids for record in batch_records] + batch_pos_ids = [record.pos_ids for record in batch_records] + + batch_tgt_start_idx = [record.tgt_start_idx for record in batch_records] + + batch_size = len(batch_token_ids) + + # padding + batch["token_ids"] = pad_batch_data(batch_token_ids, pad_id=self.pad_id) + batch["type_ids"] = pad_batch_data(batch_type_ids, pad_id=self.pad_id) + batch["pos_ids"] = pad_batch_data(batch_pos_ids, pad_id=self.pad_id) + + batch["generation_mask"] = self._gen_self_attn_mask( + batch_token_ids, batch_tgt_start_idx=batch_tgt_start_idx, is_unidirectional=True, shift_len=1 + ) + if not is_infer: + batch["recognition_mask"] = self._gen_self_attn_mask(batch_token_ids, is_unidirectional=False, shift_len=1) + + if is_infer: + tgt_ids = np.array([[[self.bos_id]]] * batch_size, dtype="int64") + if self.continuous_position: + tgt_pos = np.array(batch_tgt_start_idx, dtype="int64") + else: + tgt_pos = np.zeros_like(batch_tgt_start_idx, dtype="int64") + tgt_pos = tgt_pos.reshape(-1, 1, 1) + batch["init_score"] = np.zeros_like(tgt_ids, dtype="float32").reshape(-1, 1).tolist() + batch["tgt_ids"] = tgt_ids.tolist() + batch["tgt_pos"] = tgt_pos.tolist() + batch["parent_idx"] = np.array(range(batch_size), dtype="int32") + + batch["tgt_generation_mask"] = batch["generation_mask"][:, 0:1, :].astype("float32") + else: + mask_return_list = mask( + batch_tokens=batch_token_ids, + vocab_size=self.vocab_size, + sent_b_starts=batch_tgt_start_idx, + is_unidirectional=True, + use_latent=True, + use_bow=self.use_bow, + ) + batch["tgt_label"] = mask_return_list[0] + batch["tgt_pos"] = mask_return_list[1] + if self.use_bow: + batch["bow_label"] = mask_return_list[2] + batch["bow_pos"] = mask_return_list[3] + + batch_data_id = [record.data_id for record in batch_records] + batch["data_id"] = np.array(batch_data_id).astype("int64").reshape([-1, 1]) + return batch diff --git a/examples/dialogue/plato-2/utils/__init__.py b/examples/dialogue/plato-2/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..1a9ff1098dd3bbde8f4bf169216c404bb0b689de --- /dev/null +++ b/examples/dialogue/plato-2/utils/__init__.py @@ -0,0 +1,51 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Utils.""" + +from itertools import chain + +import numpy as np +import paddle + + +def repeat_array(array, times): + """Repeate numpy array.""" + if isinstance(array, list): + return list(chain(*([array] * times))) + else: + return np.concatenate([array] * times, axis=0) + + +def gen_inputs(inputs, latent_type_size): + batch_size = len(inputs["data_id"]) + inputs = {name: repeat_array(array, latent_type_size) for name, array in inputs.items()} + # Add latent_id + inputs["latent_id"] = np.array( + [i for i in range(latent_type_size) for _ in range(batch_size)], dtype="int64" + ).reshape([-1, 1]) + + # print('\nplato_inputs:') + for key in inputs: + inputs[key] = paddle.to_tensor(inputs[key]) + if key in ["token_ids", "type_ids", "pos_ids", "tgt_ids", "tgt_pos", "data_id"]: + inputs[key] = paddle.squeeze(inputs[key], axis=-1) + # print(key, inputs[key].shape, inputs[key].dtype) + return inputs + + +def pad_batch_data(insts, pad_id=0): + """Pad the instances to the max sequence length in batch.""" + max_len = max(map(len, insts)) + inst_data = np.array([list(inst) + [pad_id] * (max_len - len(inst)) for inst in insts]) + return inst_data.astype("int64").reshape([-1, max_len, 1]) diff --git a/examples/dialogue/plato-2/utils/args.py b/examples/dialogue/plato-2/utils/args.py new file mode 100644 index 0000000000000000000000000000000000000000..b112acf6ba7354f47cab8e35e502fc3ccdda1b03 --- /dev/null +++ b/examples/dialogue/plato-2/utils/args.py @@ -0,0 +1,88 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Parse argument.""" + +import argparse +import json + + +class Args(dict): + """Arguments class + + Store arguments in training / infer / ... scripts. + """ + + def __getattr__(self, name): + if name in self.keys(): + return self[name] + for v in self.values(): + if isinstance(v, Args): + if name in v: + return v[name] + return None + + def get(self, key, default_value=None): + """Get the value of corresponding key.""" + if key in self.keys(): + return self[key] + for v in self.values(): + if isinstance(v, Args): + if key in v: + return v[key] + return default_value + + def __setattr__(self, name, value): + self[name] = value + + def save(self, filename): + with open(filename, "w") as fp: + json.dump(self, fp, ensure_ascii=False, indent=4, sort_keys=False) + + def load(self, filename, group_name=None): + if group_name is not None: + if group_name not in self: + self[group_name] = Args() + self[group_name].load(filename) + return + with open(filename, "r") as fp: + params_dict = json.load(fp) + for k, v in params_dict.items(): + if isinstance(v, dict): + self[k].update(Args(v)) + else: + self[k] = v + + +def parse_args(parser: argparse.ArgumentParser, allow_unknown=False) -> Args: + """Parse hyper-parameters from cmdline.""" + if allow_unknown: + parsed, _ = parser.parse_known_args() + else: + parsed = parser.parse_args() + args = Args() + optional_args = parser._action_groups[1] + for action in optional_args._group_actions[1:]: + arg_name = action.dest + args[arg_name] = getattr(parsed, arg_name) + for group in parser._action_groups[2:]: + group_args = Args() + for action in group._group_actions: + arg_name = action.dest + group_args[arg_name] = getattr(parsed, arg_name) + if len(group_args) > 0: + if group.title in args: + args[group.title].update(group_args) + else: + args[group.title] = group_args + return args diff --git a/examples/dialogue/plato-2/utils/masking.py b/examples/dialogue/plato-2/utils/masking.py new file mode 100644 index 0000000000000000000000000000000000000000..fb6be808448a84bbc0039eafbfaf65ecdbcbecd7 --- /dev/null +++ b/examples/dialogue/plato-2/utils/masking.py @@ -0,0 +1,119 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Reader utils.""" + +import numpy as np + + +def mask( + batch_tokens, + vocab_size, + bos_id=1, + eos_id=2, + mask_id=3, + sent_b_starts=None, + labels=None, + is_unidirectional=False, + use_latent=False, + use_bow=False, +): + """ + Add mask for batch_tokens, return out, mask_label, mask_pos; + Note: mask_pos responding the batch_tokens after padded; + """ + batch_tokens = np.copy(batch_tokens) + max_len = max(map(len, batch_tokens)) + mask_label = [] + mask_pos = [] + if labels is not None: + label_pos = [] + + if is_unidirectional: + # unidirectional language model + if use_latent: + max_len += 1 + shift_len = 1 + else: + shift_len = 0 + for sent_index, sent in enumerate(batch_tokens): + sent_b_index = sent_b_starts[sent_index] if sent_b_starts is not None else 0 + if labels is not None: + label_pos.append(sent_index * max_len + len(sent) - 1 + shift_len) + mask_label.extend(sent[sent_b_index + 1 :]) + mask_pos.extend([sent_index * max_len + i + shift_len for i in range(sent_b_index, len(sent) - 1)]) + mask_label = np.array(mask_label).astype("int64").reshape([-1, 1]) + mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1]) + return_list = [mask_label, mask_pos] + + # latent related (bow label and pos) + if use_latent and use_bow: + bow_label = [] + bow_pos = [] + for sent_index, sent in enumerate(batch_tokens): + sent_b_index = sent_b_starts[sent_index] if sent_b_starts is not None else 0 + + def __filter__(tok_id): + # TODO: exclude [EOS] from bow loss + return True + + bow_pos.extend([sent_index for i in range(sent_b_index + 1, len(sent)) if __filter__(sent[i])]) + bow_label.extend([sent[i] for i in range(sent_b_index + 1, len(sent)) if __filter__(sent[i])]) + bow_label = np.array(bow_label).astype("int64").reshape([-1, 1]) + bow_pos = np.array(bow_pos).astype("int64").reshape([-1, 1]) + return_list += [bow_label, bow_pos] + else: + # bidirectional mask language model + total_token_num = sum(map(len, batch_tokens)) + prob_mask = np.random.rand(total_token_num) + # TODO: fix replace_ids, include [UNK] + replace_ids = np.random.randint(3, high=vocab_size, size=total_token_num) + prob_index = 0 + for sent_index, sent in enumerate(batch_tokens): + # add pair label position + if labels is not None: + label_pos.append(sent_index * max_len) + + # add mask label and position + for token_index, token in enumerate(sent): + if token == eos_id or token == bos_id: + continue + prob = prob_mask[prob_index + token_index] + if prob > 0.15: + continue + elif 0.03 < prob <= 0.15: + # mask + mask_label.append(sent[token_index]) + sent[token_index] = mask_id + mask_pos.append(sent_index * max_len + token_index) + elif 0.015 < prob <= 0.03: + # random replace + mask_label.append(sent[token_index]) + sent[token_index] = replace_ids[prob_index + token_index] + mask_pos.append(sent_index * max_len + token_index) + else: + # keep the original token + mask_label.append(sent[token_index]) + mask_pos.append(sent_index * max_len + token_index) + + prob_index += len(sent) + + mask_label = np.array(mask_label).astype("int64").reshape([-1, 1]) + mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1]) + return_list = [batch_tokens, mask_label, mask_pos] + + if labels is not None: + label_pos = np.array(label_pos).astype("int64").reshape([-1, 1]) + assert len(labels) == len(label_pos) + return_list.append(label_pos) + return return_list diff --git a/examples/dialogue/plato-2/utils/tokenization.py b/examples/dialogue/plato-2/utils/tokenization.py new file mode 100644 index 0000000000000000000000000000000000000000..7d4741ba984ef17c75a6d2967f573f1f343b895b --- /dev/null +++ b/examples/dialogue/plato-2/utils/tokenization.py @@ -0,0 +1,189 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Tokenization classes.""" + +import collections +import sentencepiece as spm +import unicodedata + +from utils.args import str2bool + + +def clean_text(text): + """Performs invalid character removal and whitespace cleanup on text.""" + text = text.replace("“", '"').replace("”", '"').replace("‘", "'").replace("’", "'").replace("—", "-") + + output = [] + for char in text: + if _is_control(char): + continue + if _is_whitespace(char): + output.append(" ") + else: + output.append(char) + return "".join(output) + + +def preprocess_text(inputs, remove_space=True, lower=False): + """preprocess data by removing extra space and normalize data.""" + outputs = inputs + if remove_space: + outputs = " ".join(inputs.strip().split()) + + outputs = unicodedata.normalize("NFKD", outputs) + outputs = "".join([c for c in outputs if not unicodedata.combining(c)]) + if lower: + outputs = outputs.lower() + + return outputs + + +def encode_pieces(spm_model, text, return_unicode=True, sample=False): + """turn sentences into word pieces.""" + # liujiaxiang: add for ernie-albert, mainly consider for “/”/‘/’/— causing too many unk + text = clean_text(text) + + if not sample: + pieces = spm_model.EncodeAsPieces(text) + else: + pieces = spm_model.SampleEncodeAsPieces(text, 64, 0.1) + + return pieces + + +def encode_ids(spm_model, text, sample=False): + """turn sentences into word pieces.""" + pieces = encode_pieces(spm_model, text, return_unicode=False, sample=sample) + ids = [spm_model.PieceToId(piece) for piece in pieces] + return ids + + +def convert_to_unicode(text): + """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" + if isinstance(text, str): + return text + elif isinstance(text, bytes): + return text.decode("utf-8", "ignore") + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + + +def load_vocab(vocab_file): + """Loads a vocabulary file into a dictionary.""" + vocab = collections.OrderedDict() + fin = open(vocab_file, "r", encoding="UTF-8") + for num, line in enumerate(fin): + items = convert_to_unicode(line.rstrip()).split("\t") + if len(items) > 2: + break + token = items[0] + index = items[1] if len(items) == 2 else num + token = token.strip() + vocab[token] = int(index) + return vocab + + +def convert_by_vocab(vocab, items): + """Converts a sequence of [tokens|ids] using the vocab.""" + output = [] + for item in items: + output.append(vocab[item]) + return output + + +class SentencePieceTokenizer(object): + """Runs end-to-end tokenziation.""" + + @classmethod + def add_cmdline_args(cls, parser): + """Add cmdline argurments.""" + group = parser.add_argument_group("Tokenizer") + group.add_argument("--vocab_path", type=str, required=True) + group.add_argument("--do_lower_case", type=str2bool, default=False) + group.add_argument("--spm_model_file", type=str, required=True) + return group + + def __init__(self, args): + self.spm_model = spm.SentencePieceProcessor() + self.spm_model.Load(args.spm_model_file) + self.vocab = load_vocab(args.vocab_path) + self.do_lower_case = args.do_lower_case + self.inv_vocab = {v: k for k, v in self.vocab.items()} + + def tokenize(self, text): + """Tokenizes a piece of text.""" + text = preprocess_text(text, lower=self.do_lower_case) + return encode_pieces(self.spm_model, text, return_unicode=True) + + def convert_tokens_to_ids(self, tokens): + """Convert tokens to ids.""" + ret = [] + unk_id = self.vocab[""] + for token in tokens: + if token in self.vocab: + ret.append(self.vocab[token]) + else: + ret.append(unk_id) + return ret + + def convert_ids_to_tokens(self, ids): + """Convert ids to tokens.""" + return convert_by_vocab(self.inv_vocab, ids) + + def merge_subword(self, tokens): + """Merge subword.""" + ret = [] + for token in tokens: + if token.startswith("▁"): + ret.append(token[1:]) + else: + if len(ret): + ret[-1] += token + else: + ret.append(token) + + ret = [token for token in ret if token] + return ret + + def convert_ids_to_str(self, ids): + """Convert ids to string.""" + tokens = self.convert_ids_to_tokens(ids) + tokens = self.merge_subword(tokens) + res = " ".join(tokens).replace("", "") + res = res.replace("", "\n").replace("\n ", "\n").strip() + return res + + +def _is_whitespace(char): + """Checks whether `chars` is a whitespace character.""" + # \t, \n, and \r are technically contorl characters but we treat them + # as whitespace since they are generally considered as such. + if char == " " or char == "\t" or char == "\n" or char == "\r": + return True + cat = unicodedata.category(char) + if cat == "Zs": + return True + return False + + +def _is_control(char): + """Checks whether `chars` is a control character.""" + # These are technically control characters but we count them as whitespace + # characters. + if char == "\t" or char == "\n" or char == "\r": + return False + cat = unicodedata.category(char) + if cat.startswith("C"): + return True + return False diff --git a/examples/dialogue/plato-xl b/examples/dialogue/plato-xl new file mode 100644 index 0000000000000000000000000000000000000000..3225c24e6351bef4a4c8b0031723042adab8fa06 --- /dev/null +++ b/examples/dialogue/plato-xl @@ -0,0 +1 @@ +../../model_zoo/plato-xl \ No newline at end of file diff --git a/examples/dialogue/unified_transformer/README.md b/examples/dialogue/unified_transformer/README.md new file mode 100644 index 0000000000000000000000000000000000000000..45cf6f32b2399ed35e48f383ed32e8b5af5c2e44 --- /dev/null +++ b/examples/dialogue/unified_transformer/README.md @@ -0,0 +1,217 @@ +# UnifiedTransformer + +## 模型简介 + +近年来,人机对话系统受到了学术界和产业界的广泛关注并取得了不错的发展。开放域对话系统旨在建立一个开放域的多轮对话系统,使得机器可以流畅自然地与人进行语言交互,既可以进行日常问候类的闲聊,又可以完成特定功能,以使得开放域对话系统具有实际应用价值。具体的说,开放域对话可以继续拆分为支持不同功能的对话形式,例如对话式推荐,知识对话技术等,如何解决并有效融合以上多个技能面临诸多挑战。 + +[UnifiedTransformer](https://arxiv.org/abs/2006.16779)以[Transformer](https://arxiv.org/abs/1706.03762) 编码器为网络基本组件,采用灵活的注意力机制,十分适合对话生成任务。 + +本项目是UnifiedTransformer在 Paddle 2.0上的开源实现,介绍了如何使用UnifiedTransformer在DuConv任务型对话数据集上进行微调,并给出了一个搭建简单中文聊天机器人的例子。 + +## 快速开始 + +### 环境依赖 + +- sentencepiece +- termcolor + +安装方式:`pip install sentencepiece termcolor` + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +. +├── finetune.py # 模型finetune主程序入口 +├── infer.py # 模型预测主程序入口 +├── utils.py # 定义参数及一些工具函数 +└── README.md # 文档说明 +``` + +### 数据准备 + +**DuConv**是百度发布的基于知识图谱的主动聊天任务数据集,让机器根据构建的知识图谱进行主动聊天,使机器具备模拟人类用语言进行信息传递的能力。数据集的创新性是:强调了bot的主动性,并且在闲聊对话中引入了明确的对话目标,即将对话引导到特定实体上。数据中的知识信息来源于电影和娱乐人物领域有聊天价值的知识信息,如票房、导演、评价等,以三元组SPO的形式组织,对话目标中的话题为电影或娱乐人物实体。数据集中共有3万session,约12万轮对话,划分为训练集、开发集、测试集1和测试集2,其中测试集1中包含对话的response,而测试集2中只有对话历史。 + +为了方便用户快速测试,PaddleNLP Dataset API内置了DuConv数据集,一键即可完成数据集加载,示例代码如下: + +```python +from paddlenlp.datasets import load_dataset +train_ds, dev_ds, test1_ds, test2_ds = load_dataset('duconv', splits=('train', 'dev', 'test_1', 'test_2')) +``` + +### 预训练模型 + +以下是PaddleNLP支持的对话类预训练模型: + +|模型名称| 模型参数 | 模型特点 | +|:-----:|:------:|:-------:| +|unified_transformer-12L-cn| 12-layers, 12-heads, 768-hidden| 在千万级别的中文会话数据上进行预训练。| +|unified_transformer-12L-cn-luge| 12-layers, 12-heads, 768-hidden|由unified_transformer-12L-cn预训练模型在千言对话数据集上进行微调。并且模型输入中加入了标识不同对话技能的special token,使得模型能同时支持闲聊对话、推荐对话和知识对话。| +|plato-mini| 6-layers, 12-heads, 768-hidden|在十亿级别的中文对话数据上进行预训练。参数量更小,但效果更好。只支持闲聊型对话。| + +### 模型训练 + +运行如下命令即可在训练集上进行finetune,并在验证集上进行验证 + +```shell +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +# 例如使用1号和2号卡,则:`--gpu '1,2'` +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus '0' --log_dir ./log finetune.py \ + --model_name_or_path=unified_transformer-12L-cn-luge \ + --save_dir=./checkpoints \ + --logging_steps=100 \ + --save_steps=1000 \ + --seed=2021 \ + --epochs=3 \ + --batch_size=16 \ + --lr=5e-5 \ + --weight_decay=0.01 \ + --warmup_steps=2500 \ + --max_grad_norm=0.1 \ + --max_seq_len=512 \ + --max_response_len=128 \ + --max_knowledge_len=256 \ + --device=gpu +``` + +其中参数释义如下: +- `gpus` 指示了训练所用的GPU +- `log_dir` 指示了日志保存目录 +- `model_name_or_path` 指示了finetune使用的预训练模型,可以是PaddleNLP提供的预训练模型,或者是本地的模型。如果使用本地的模型,则配置为本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + + | PaddleNLP提供的预训练模型 | + |---------------------------------| + | unified_transformer-12L-cn | + | unified_transformer-12L-cn-luge | + +- `save_dir` 表示模型的保存路径。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `seed` 表示随机数生成器的种子。 +- `epochs` 表示训练轮数。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `lr` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。 +- `warmup_steps` 表示学习率逐渐升高到基础学习率(即上面配置的lr)所需要的迭代数,最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。 +- `max_grad_norm` 表示梯度裁剪允许的最大梯度值。 +- `max_seq_len` 表示输入序列的最大长度。 +- `max_response_len` 表示输入response的最大长度。 +- `max_knowledge_len` 表示输入knowledge序列的最大长度。 +- `device` 表示使用的设备。 + +程序运行时将会自动进行训练和验证,训练过程中会自动保存模型在指定的`save_dir`中,其中loss最小的模型会被保存在`save_dir/model_best`中。如: + +```text +./checkpoints/ +├── model_1000 +│ ├── model_config.json +│ ├── model_state.pdparams +│ ├── spm.model +│ ├── tokenizer_config.json +│ └── vocab.txt +└── ... +``` + +**NOTE:** 如需恢复模型训练,只需指定`model_name_or_path`为本地微调模型的路径即可。 + +### 模型预测 + +运行如下命令即可在测试集上进行测试。 + +```shell +# GPU启动,预测仅支持单卡 +export CUDA_VISIBLE_DEVICES=0 +python infer.py \ + --model_name_or_path=./checkpoints/model_best \ + --output_path=./predict.txt \ + --logging_steps=10 \ + --seed=2021 \ + --max_seq_len=512 \ + --max_knowledge_len=256 \ + --batch_size=4 \ + --min_dec_len=1 \ + --max_dec_len=64 \ + --num_return_sequences=20 \ + --decode_strategy=sampling \ + --top_k=5 \ + --device=gpu +``` + +其中参数释义如下: +- `model_name_or_path` 指示了预测使用的模型,可以是PaddleNLP提供的预训练模型,或者是本地的模型。如果使用本地的模型,则配置为本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + + | PaddleNLP提供的预训练模型 | + |---------------------------------| + | unified_transformer-12L-cn | + | unified_transformer-12L-cn-luge | + +- `output_path` 表示预测结果的保存路径。 +- `logging_steps` 表示日志打印间隔。 +- `seed` 表示随机数生成器的种子。 +- `max_seq_len` 表示输入序列的最大长度。 +- `max_knowledge_len` 表示输入knowledge序列的最大长度。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `min_dec_len` 表示预测生成的句子的最小长度。 +- `max_dec_len` 表示预测生成的句子的最大长度。 +- `num_return_sequences` 表示每条样本生成的句子的数量。对于每条样本,模型会生成`num_return_sequences`个句子,根据每个句子的概率得分进行排序,得分最高的句子作为最终的生成结果。 +- `decode_strategy` 表示预测解码时采取的策略,可选"sampling"、"greedy_search"和"beam_search"之一。 +- `top_k` 表示采用"sampling"解码策略时,token的概率按从大到小排序,生成的token只从前`top_k`个中进行采样。 +- `device` 表示使用的设备。 + +同时,我们提供了基于 FastGeneration 的高性能预测的选项,可以选择性开启是否需要采用高性能预测,PaddleNLP 提供了 JIT 的方式,可以自动完成对所需的自定义 op 的动态库编译: +- `faster` 表示是否开启高性能预测。设置 `--faster` 即表示开启。 +- `use_fp16_decoding` 表示在开启高性能预测的时候,是否使用 fp16 来完成预测过程。设置 `--use_fp16_decoding` 即表示使用 fp16 进行预测,否则使用 fp32。 + +程序运行结束后会将预测生成的response保存在`output_path`中。同时终端中会输出评估结果。 + +采用预训练模型及微调模型在测试集上有如下结果: + +| model_name_or_path | BLEU1 / BLEU2 | DISTINCT1 / DISTINCT2 | +| :-----------------------------: | :-------------: | :-------------------: | +| unified_transformer-12L-cn-luge | 0.2606 / 0.1576 | 0.1168 / 0.2977 | +| ./checkpoints/model_best | 0.2808 / 0.1744 | 0.1124 / 0.2899 | + +**NOTE:** `./checkpoints/model_best`是按本项目中的超参在单卡上finetune得到的结果。 + +### 人机交互 + +运行如下命令即可开始与聊天机器人用中文进行简单的对话。 + +```shell +# GPU启动,仅支持单卡 +export CUDA_VISIBLE_DEVICES=0 +python interaction.py \ + --model_name_or_path=plato-mini \ + --min_dec_len=0 \ + --max_dec_len=64 \ + --num_return_sequences=20 \ + --decode_strategy=sampling \ + --top_k=5 \ + --device=gpu +``` + +其中参数释义如下: +- `model_name_or_path` 指示了预测使用的模型,可以是PaddleNLP提供的预训练模型,或者是本地的模型。如果使用本地的模型,则配置为本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + + | PaddleNLP提供的预训练模型 | + |---------------------------------| + | unified_transformer-12L-cn | + | unified_transformer-12L-cn-luge | + | plato-mini | + +- `min_dec_len` 表示预测生成的句子的最小长度。 +- `max_dec_len` 表示预测生成的句子的最大长度。 +- `num_return_sequences` 表示每条样本生成的句子的数量。对于每条样本,模型会生成`num_return_sequences`个句子,根据每个句子的概率得分进行排序,得分最高的句子作为最终的生成结果。 +- `decode_strategy` 表示预测解码时采取的策略,可选"sampling"、"greedy_search"和"beam_search"之一。 +- `top_k` 表示采用"sampling"解码策略时,token的概率按从大到小排序,生成的token只从前`top_k`个中进行采样。 +- `device` 表示使用的设备。 + +**NOTE:** 输入"[EXIT]"退出交互程序,输入"[NEXT]"开启下一轮新的对话。需要注意使用退格会导致错误。 + +## Reference + +- [UnifiedTransformer](https://arxiv.org/abs/2006.16779) +- [Knover/luge-dialogue](https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-dialogue) +- [DuConv](https://www.aclweb.org/anthology/P19-1369/) diff --git a/examples/dialogue/unified_transformer/finetune.py b/examples/dialogue/unified_transformer/finetune.py new file mode 100644 index 0000000000000000000000000000000000000000..daeb700d3605878ffbc3293e232f4ba0778757da --- /dev/null +++ b/examples/dialogue/unified_transformer/finetune.py @@ -0,0 +1,165 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import math +import os +import time + +import paddle +import paddle.distributed as dist +import paddle.nn as nn +import paddle.nn.functional as F +from datasets import load_dataset +from paddle.optimizer import AdamW +from paddle.optimizer.lr import NoamDecay +from utils import create_data_loader, print_args, set_seed + +from paddlenlp.transformers import ( + UnifiedTransformerLMHeadModel, + UnifiedTransformerTokenizer, +) + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--model_name_or_path', type=str, default='unified_transformer-12L-cn-luge', help='The path or shortcut name of the pre-trained model.') + parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.') + parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.') + parser.add_argument('--save_steps', type=int, default=1000, help='Save checkpoint every X updates steps.') + parser.add_argument('--seed', type=int, default=2021, help='Random seed for initialization.') + parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.') + parser.add_argument('--lr', type=float, default=5e-5, help='The initial learning rate.') + parser.add_argument('--weight_decay', type=float, default=0.01, help='The weight decay for optimizer.') + parser.add_argument('--epochs', type=int, default=3, help='Total number of training epochs to perform.') + parser.add_argument('--warmup_steps', type=int, default=2500, help='The number of warmup steps.') + parser.add_argument('--max_grad_norm', type=float, default=0.1, help='The max value of grad norm.') + parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.') + parser.add_argument('--max_response_len', type=int, default=128, help='The maximum response sequence length of training.') + parser.add_argument('--max_knowledge_len', type=int, default=256, help='The maximum knowledge sequence length of training.') + parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.') + + args = parser.parse_args() + return args +# yapf: enable + + +def save_ckpt(model, tokenizer, save_dir, name): + output_dir = os.path.join(save_dir, "model_{}".format(name)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + +def train(args): + paddle.set_device(args.device) + world_size = dist.get_world_size() + if world_size > 1: + dist.init_parallel_env() + + set_seed(args.seed) + + model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path) + tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path) + + if world_size > 1: + model = paddle.DataParallel(model) + + train_ds, dev_ds = load_dataset("duconv", split=("train", "dev")) + train_ds, train_data_loader = create_data_loader(train_ds, tokenizer, args, "train") + dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "dev") + + lr_scheduler = NoamDecay(1 / (args.warmup_steps * (args.lr**2)), args.warmup_steps) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm), + ) + + step = 0 + total_time = 0.0 + best_ppl = 1e9 + for epoch in range(args.epochs): + print("\nEpoch %d/%d" % (epoch + 1, args.epochs)) + batch_start_time = time.time() + for inputs in train_data_loader: + step += 1 + labels = inputs[-1] + + logits = model(*inputs[:-1]) + loss = F.cross_entropy(logits, labels) + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + total_time += time.time() - batch_start_time + if step % args.logging_steps == 0: + ppl = paddle.exp(loss) + print( + "step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step" + % (step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps) + ) + total_time = 0.0 + if step % args.save_steps == 0: + ppl = evaluation(model, dev_data_loader) + if dist.get_rank() == 0: + save_ckpt(model, tokenizer, args.save_dir, step) + if ppl < best_ppl: + best_ppl = ppl + save_ckpt(model, tokenizer, args.save_dir, "best") + print("Saved step {} as best model.\n".format(step)) + batch_start_time = time.time() + print("\nTraining completed.") + + +@paddle.no_grad() +def evaluation(model, data_loader): + print("\nEval begin...") + model.eval() + total_tokens = 0 + total_loss = 0.0 + start_time = time.time() + step = 0 + for inputs in data_loader: + step += 1 + labels = inputs[-1] + + logits = model(*inputs[:-1]) + loss = F.cross_entropy(logits, labels, reduction="sum") + + total_loss += loss.item() + total_tokens += labels.shape[0] + + avg_loss = total_loss / total_tokens + ppl = math.exp(avg_loss) + avg_speed = (time.time() - start_time) / step + print("loss: %.4f - ppl: %.4f - %.3fs/step" % (avg_loss, ppl, avg_speed)) + model.train() + return ppl + + +if __name__ == "__main__": + args = parse_args() + print_args(args) + train(args) diff --git a/examples/dialogue/unified_transformer/infer.py b/examples/dialogue/unified_transformer/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..3781f73e16aa998d84fd18a732c74e64b1c8283e --- /dev/null +++ b/examples/dialogue/unified_transformer/infer.py @@ -0,0 +1,146 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import time + +import paddle +from datasets import load_dataset +from utils import create_data_loader, print_args, select_response, set_seed + +from paddlenlp.metrics import BLEU, Distinct +from paddlenlp.transformers import ( + UnifiedTransformerLMHeadModel, + UnifiedTransformerTokenizer, +) + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--model_name_or_path', type=str, default='unified_transformer-12L-cn-luge', help='The path or shortcut name of the pre-trained model.') + parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.') + parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.') + parser.add_argument('--seed', type=int, default=2021, help='Random seed for initialization.') + parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.') + parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.') + parser.add_argument('--max_response_len', type=int, default=128, help='The maximum response sequence length of training.') + parser.add_argument('--max_knowledge_len', type=int, default=256, help='The maximum knowledge sequence length of training.') + parser.add_argument('--min_dec_len', type=int, default=1, help='The minimum sequence length of generation.') + parser.add_argument('--max_dec_len', type=int, default=64, help='The maximum sequence length of generation.') + parser.add_argument('--num_return_sequences', type=int, default=20, help='The numbers of returned sequences for one input in generation.') + parser.add_argument('--decode_strategy', type=str, default='sampling', help='The decode strategy in generation.') + parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.') + parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.') + parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.') + parser.add_argument('--num_beams', type=int, default=0, help='The number of beams for beam search.') + parser.add_argument('--length_penalty', type=float, default=1.0, help='The exponential penalty to the sequence length for beam search.') + parser.add_argument('--early_stopping', type=eval, default=False, help='Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.') + parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.') + parser.add_argument('--faster', action='store_true', help='Whether to process inference using faster transformer. ') + parser.add_argument('--use_fp16_decoding', action='store_true', help='Whether to use fp16 when using faster transformer. Only works when using faster transformer. ') + + args = parser.parse_args() + return args +# yapf: enable + + +def calc_bleu_and_distinct(preds, targets): + assert len(preds) == len(targets), ( + "The length of pred_responses should be equal to the length of " + "target_responses. But received {} and {}.".format(len(preds), len(targets)) + ) + bleu1 = BLEU(n_size=1) + bleu2 = BLEU(n_size=2) + distinct1 = Distinct(n_size=1) + distinct2 = Distinct(n_size=2) + for pred, target in zip(preds, targets): + pred_tokens = pred.split() + target_token = target.split() + + bleu1.add_inst(pred_tokens, [target_token]) + bleu2.add_inst(pred_tokens, [target_token]) + + distinct1.add_inst(pred_tokens) + distinct2.add_inst(pred_tokens) + + print("\n" + "*" * 15) + print("The auto evaluation result is:") + print("BLEU-1:", bleu1.score()) + print("BLEU-2:", bleu2.score()) + print("DISTINCT-1:", distinct1.score()) + print("DISTINCT-2:", distinct2.score()) + + +@paddle.no_grad() +def infer(args): + paddle.set_device(args.device) + set_seed(args.seed) + + model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path) + tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path) + + test_ds = load_dataset("duconv", split="test_1") + test_ds, test_data_loader = create_data_loader(test_ds, tokenizer, args, "test") + + model.eval() + total_time = 0.0 + start_time = time.time() + pred_responses = [] + for step, inputs in enumerate(test_data_loader, 1): + input_ids, token_type_ids, position_ids, attention_mask, seq_len = inputs + output = model.generate( + input_ids=input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask, + seq_len=seq_len, + max_length=args.max_dec_len, + min_length=args.min_dec_len, + decode_strategy=args.decode_strategy, + temperature=args.temperature, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + early_stopping=args.early_stopping, + num_return_sequences=args.num_return_sequences, + use_fp16_decoding=args.use_fp16_decoding, + use_fast=args.faster, + ) + + total_time += time.time() - start_time + if step % args.logging_steps == 0: + print("step %d - %.3fs/step" % (step, total_time / args.logging_steps)) + total_time = 0.0 + + ids, scores = output + results = select_response(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences) + pred_responses.extend(results) + + start_time = time.time() + + with open(args.output_path, "w", encoding="utf-8") as fout: + for response in pred_responses: + fout.write(response + "\n") + print("\nSave inference result into: %s" % args.output_path) + + target_responses = [example["response"] for example in test_ds] + calc_bleu_and_distinct(pred_responses, target_responses) + + +if __name__ == "__main__": + args = parse_args() + print_args(args) + infer(args) diff --git a/examples/dialogue/unified_transformer/interaction.py b/examples/dialogue/unified_transformer/interaction.py new file mode 100644 index 0000000000000000000000000000000000000000..cde62e5057d6ecc5cf143b5c80519a46952fa5f5 --- /dev/null +++ b/examples/dialogue/unified_transformer/interaction.py @@ -0,0 +1,107 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle +from termcolor import colored, cprint +from utils import print_args, select_response, set_seed + +from paddlenlp.transformers import ( + UnifiedTransformerLMHeadModel, + UnifiedTransformerTokenizer, +) + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--model_name_or_path', type=str, default='plato-mini', help='The path or shortcut name of the pre-trained model.') + parser.add_argument('--seed', type=int, default=None, help='Random seed for initialization.') + parser.add_argument('--min_dec_len', type=int, default=1, help='The minimum sequence length of generation.') + parser.add_argument('--max_dec_len', type=int, default=64, help='The maximum sequence length of generation.') + parser.add_argument('--num_return_sequences', type=int, default=20, help='The numbers of returned sequences for one input in generation.') + parser.add_argument('--decode_strategy', type=str, default='sampling', help='The decode strategy in generation.') + parser.add_argument('--top_k', type=int, default=5, help='The number of highest probability vocabulary tokens to keep for top-k sampling.') + parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.') + parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.') + parser.add_argument('--num_beams', type=int, default=0, help='The number of beams for beam search.') + parser.add_argument('--length_penalty', type=float, default=1.0, help='The exponential penalty to the sequence length for beam search.') + parser.add_argument('--early_stopping', type=eval, default=False, help='Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.') + parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.') + + args = parser.parse_args() + return args +# yapf: enable + + +def interaction(args, model, tokenizer): + history = [] + start_info = "Enter [EXIT] to quit the interaction, [NEXT] to start a new conversation." + cprint(start_info, "yellow", attrs=["bold"]) + while True: + user_utt = input(colored("[Human]: ", "red", attrs=["bold"])).strip() + if user_utt == "[EXIT]": + break + elif user_utt == "[NEXT]": + history = [] + cprint(start_info, "yellow", attrs=["bold"]) + else: + history.append(user_utt) + inputs = tokenizer.dialogue_encode( + history, add_start_token_as_response=True, return_tensors=True, is_split_into_words=False + ) + inputs["input_ids"] = inputs["input_ids"].astype("int64") + ids, scores = model.generate( + input_ids=inputs["input_ids"], + token_type_ids=inputs["token_type_ids"], + position_ids=inputs["position_ids"], + attention_mask=inputs["attention_mask"], + max_length=args.max_dec_len, + min_length=args.min_dec_len, + decode_strategy=args.decode_strategy, + temperature=args.temperature, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + early_stopping=args.early_stopping, + num_return_sequences=args.num_return_sequences, + use_fast=True, + ) + bot_response = select_response( + ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences, keep_space=False + )[0] + print(colored("[Bot]:", "blue", attrs=["bold"]), colored(bot_response, attrs=["bold"])) + history.append(bot_response) + return + + +def main(args): + paddle.set_device(args.device) + if args.seed is not None: + set_seed(args.seed) + + # Initialize the model and tokenizer + model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path) + tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path) + + model.eval() + interaction(args, model, tokenizer) + + +if __name__ == "__main__": + args = parse_args() + print_args(args) + main(args) diff --git a/examples/dialogue/unified_transformer/utils.py b/examples/dialogue/unified_transformer/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..90585d69e0ee684639151a47127b979ba44df165 --- /dev/null +++ b/examples/dialogue/unified_transformer/utils.py @@ -0,0 +1,265 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import random +from functools import partial + +import numpy as np + +import paddle +import paddle.distributed as dist +from paddle.io import DataLoader, DistributedBatchSampler, BatchSampler +from paddlenlp.data import Pad + + +def print_args(args): + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +def set_seed(seed): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(seed) + np.random.seed(seed) + # Maybe different op seeds(for dropout) for different procs is better. + paddle.seed(seed + dist.get_rank()) + + +def preprocess_examples(examples, mode="train"): + """ + For training set and dev set, treat each utterance of the first speaker as + the response, and concatenate the goal, knowledge and the dialog’s previous + utterances as the history. In this way, multiple history-response pairs + are constructed. + """ + if mode == "test": + return examples + new_examples = {} + goal = [] + knowledge = [] + history = [] + response = [] + + conv = examples["conversation"] + for index, conversation in enumerate(conv): + for i in range(0, len(conversation), 2): + goal.append(examples["goal"][index]) + knowledge.append(examples["knowledge"][index]) + history.append(conversation[:i]) + response.append(conversation[i]) + new_examples["goal"] = goal + new_examples["knowledge"] = knowledge + new_examples["history"] = history + new_examples["response"] = response + + return new_examples + + +def convert_example(example, tokenizer, max_seq_len=512, max_response_len=128, max_knowledge_len=256, mode="train"): + """Convert all examples into necessary features.""" + goal = example["goal"] + knowledge = example["knowledge"] + goal_knowledge = " ".join([" ".join(lst) for lst in goal + knowledge]) + + if mode != "test": + tokenized_example = tokenizer.dialogue_encode( + example["history"], + response=example["response"], + knowledge=goal_knowledge, + task_type="knowledge", + max_seq_len=max_seq_len, + max_response_len=max_response_len, + max_knowledge_len=max_knowledge_len, + return_length=True, + ) + response_start = tokenized_example["input_ids"].index(tokenizer.cls_token_id, 1) + response_end = tokenized_example["seq_len"] + # Use to gather the logits corresponding to the labels during training + tokenized_example["masked_positions"] = list(range(response_start, response_end - 1)) + tokenized_example["labels"] = tokenized_example["input_ids"][response_start + 1 : response_end] + return tokenized_example + else: + tokenized_example = tokenizer.dialogue_encode( + example["history"], + knowledge=goal_knowledge, + task_type="knowledge", + max_seq_len=max_seq_len, + max_knowledge_len=max_knowledge_len, + add_start_token_as_response=True, + return_length=True, + ) + + if "response" in example: + tokenized_example["response"] = example["response"] + return tokenized_example + + +def batchify_fn(batch_examples, pad_val, mode): + def pad_mask(batch_attention_mask): + batch_size = len(batch_attention_mask) + max_len = max(map(len, batch_attention_mask)) + attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9 + for i, mask_data in enumerate(attention_mask): + seq_len = len(batch_attention_mask[i]) + mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32") + # In order to ensure the correct broadcasting mechanism, expand one + # dimension to the second dimension (n_head of Transformer). + attention_mask = np.expand_dims(attention_mask, axis=1) + return attention_mask + + pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64") + + input_ids = pad_func([example["input_ids"] for example in batch_examples]) + token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples]) + position_ids = pad_func([example["position_ids"] for example in batch_examples]) + + attention_mask = pad_mask([example["attention_mask"] for example in batch_examples]) + + if mode != "test": + max_len = max([example["seq_len"] for example in batch_examples]) + masked_positions = np.concatenate( + [ + np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len + for i, example in enumerate(batch_examples) + ] + ) + labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples]) + return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels + else: + seq_len = np.asarray([example["seq_len"] for example in batch_examples]).astype("int32") + return input_ids, token_type_ids, position_ids, attention_mask, seq_len + + +def create_data_loader(dataset, tokenizer, args, mode): + trans_func1 = partial(preprocess_examples, mode=mode) + trans_func2 = partial( + convert_example, + tokenizer=tokenizer, + max_seq_len=args.max_seq_len, + max_response_len=args.max_response_len, + max_knowledge_len=args.max_knowledge_len, + mode=mode, + ) + remove_columns = None + if mode in ["train", "dev"]: + remove_columns = ["id", "conversation"] + + dataset = dataset.map(trans_func1, batched=True, batch_size=None, remove_columns=remove_columns).map(trans_func2) + if mode == "train": + batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) + else: + batch_sampler = BatchSampler(dataset, batch_size=args.batch_size, shuffle=False) + collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode) + data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True) + return dataset, data_loader + + +def post_process_response(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first .""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.sep_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + return token_ids, tokens + + +def get_in_turn_repetition(pred, is_cn=False): + """Get in-turn repetition.""" + if len(pred) == 0: + return 1.0 + if isinstance(pred[0], str): + pred = [tok.lower() for tok in pred] + if is_cn: + pred = "".join(pred) + tri_grams = set() + for i in range(len(pred) - 2): + tri_gram = tuple(pred[i : i + 3]) + if tri_gram in tri_grams: + return True + tri_grams.add(tri_gram) + return False + + +def select_response(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1, keep_space=True): + results = [] + group = [] + tmp = [] + if scores is not None: + ids = ids.numpy() + scores = scores.numpy() + + if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0: + raise ValueError( + "the length of `ids` is {}, but the `num_return_sequences` is {}".format( + len(ids), num_return_sequences + ) + ) + + for pred, score in zip(ids, scores): + pred_token_ids, pred_tokens = post_process_response(pred, tokenizer) + num_token = len(pred_token_ids) + if keep_space: + response = " ".join(pred_tokens) + else: + response = "".join(pred_tokens) + + in_turn_repetition = get_in_turn_repetition(pred_tokens, True) or get_in_turn_repetition(pred_token_ids) + # not ending + if max_dec_len is not None and num_token >= max_dec_len: + score -= 1e3 + elif in_turn_repetition: + score -= 1e3 + + tmp.append([response, score]) + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + preds = sorted(preds, key=lambda x: -x[1]) + results.append(preds[0][0]) + else: + ids = ids.numpy() + + for pred in ids: + pred_token_ids, pred_tokens = post_process_response(pred, tokenizer) + num_token = len(pred_token_ids) + if keep_space: + response = " ".join(pred_tokens) + else: + response = "".join(pred_tokens) + + in_turn_repetition = get_in_turn_repetition(pred_tokens, True) or get_in_turn_repetition(pred_token_ids) + + last_pos = 0 + if (max_dec_len is not None and num_token >= max_dec_len) or in_turn_repetition: + tmp.append([response]) + else: + tmp.insert(last_pos, [response]) + last_pos += 1 + + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + results.append(preds[0][0]) + return results diff --git a/examples/few_shot/README.md b/examples/few_shot/README.md new file mode 100644 index 0000000000000000000000000000000000000000..317ce55d06837bfc1f5f49aec4b8f0fa79982025 --- /dev/null +++ b/examples/few_shot/README.md @@ -0,0 +1,34 @@ +# Few-Shot Learning (FSL) + +Few-Shot Learning 旨在研究如何从少量有监督的训练样本中学习出具有良好泛化性的模型,对训练数据很少或监督数据获取成本极高的应用场景有很大价值。 + +随着大规模预训练模型的不断涌现,FSL 结合预训练模型的先验知识和强大的泛化能力在下游任务效果上取得了显著提升,为大规模预训练模型结合 FSL 的工业落地应用带来了无限可能性。 + +我们旨在为 FSL 领域的研究者提供简单易用、全面、前沿的 FSL 策略库,便于研究者基于 FSL 策略库将注意力集中在算法创新上。我们会持续开源 FSL 领域的前沿学术工作,并在中文小样本学习测评基准 [FewCLUE](https://github.com/CLUEbenchmark/FewCLUE) 上进行评测。 + +## Benchmark +我们在 FewCLUE 9 个任务的 test_public.json 测试集上进行了效果评测 + +| 算法 | 预训练模型 | eprstmt | csldcp | iflytek | tnews | ocnli | bustm | chid | csl | cluewsc | +| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |------------ | ------------ | ---------- | +| PET | ERNIE-1.0-Large-CW | 88.03 | 63.79 | 56.43 | 56.57 | 56.27 | 72.69 | 91.39 | 76.00 | 78.79 | +| P-Tuning | ERNIE-1.0-Large-CW | 89.84 | 64.57 | 45.80 | 57.41 | 44.13 | 68.51 | 90.00 | 74.67 | 73.26 | +| EFL | ERNIE-1.0-Large-CW | 90.82 | 54.48 | 46.71 | 54.43 | 43.17 | 72.63 | 85.71 | 61.52 | 80.02 | + +**注释**: +- 表格中 CHID 数据集的指标与 FewCLUE 榜单指标计算方式不同。 +- 由于 iflytek 和 csldcp 标签数较多,每条样本采样 5 个非正确标签作为负样本训练评测。 +- 为统一配置,除 EFL-iflytek 外均训练 1000 steps,EFL-iflytek 训练 5000 steps。 + +## Models +- [P-tuning](./p-tuning) +- [EFL](./efl) +- [PET](./pet) + +## References + +- [1] X. Liu et al., “GPT Understands, Too,” arXiv:2103.10385 [cs], Mar. 2021, Accessed: Mar. 22, 2021. [Online]. Available: http://arxiv.org/abs/2103.10385. + +- [2] Wang, Sinong, Han Fang, Madian Khabsa, Hanzi Mao, and Hao Ma. “Entailment as Few-Shot Learner.” ArXiv:2104.14690 [Cs], April 29, 2021. http://arxiv.org/abs/2104.14690. + +- [3] Schick, Timo, and Hinrich Schütze. “Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference.” ArXiv:2001.07676 [Cs], January 25, 2021. http://arxiv.org/abs/2001.07676. diff --git a/examples/few_shot/RGL/README.md b/examples/few_shot/RGL/README.md new file mode 100644 index 0000000000000000000000000000000000000000..14c183d8a6ed227bbc1f236c101d4b46cbf3938d --- /dev/null +++ b/examples/few_shot/RGL/README.md @@ -0,0 +1,129 @@ +# RGL: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning + +This is the implementation of the paper [RGL: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning](https://aclanthology.org/2022.findings-naacl.81/). + +**************************** Updates ***************************** + +2022-07-11: Our training code has been released. + +2022-04-08: Our paper has been accepted to Findings of [NAACL 2022](https://aclanthology.org/2022.findings-naacl.81/)! + +# Overview + +

+overview +

+ +We propose a simple yet effective Relation Graph augmented Learning RGL method that can obtain better performance in few-shot natural language understanding tasks. + +RGL constructs a relation graph based on the label consistency between samples in the same batch, and learns to solve the resultant node classification and link prediction problems of the relation graphs. In this way, RGL fully exploits the limited supervised information, which can boost the tuning effectiveness. + +# Prepare the data + +We evaluate on the GLUE variant for few-shot learning in the paper, including SST-2, SST-5, MR, CR, MPQA, Subj, TREC, CoLA, MNLI, MNLI-mm, SNLI, QNLI, RTE, MRPC, QQP and STS-B. Please download the [datasets](https://paddlenlp.bj.bcebos.com/datasets/k-shot-glue/rgl-k-shot.zip) and extract the data files to the path ``./data/k-shot``. + + +# Experiments + +The structure of the code: + +``` +├── scripts/ +│ ├── run_pet.sh # Script for PET +│ └── run_rgl.sh # Script for RGL +├── template.py # The parser for prompt template +├── verbalizer.py # The mapping from labels to corresponding words +├── tokenizer.py # The tokenizer wrapeer to conduct text truncation +├── utils.py # The tools +└── rgl.py # The training process of RGL +``` + +## How to define a template + +We inspire from [OpenPrompt](https://github.com/thunlp/OpenPrompt/tree/main) and define template as a list of dictionary. The key of raw texts in datasets is `text`, and the corresponding value is the keyword of text in loaded dataset, where we use `text_a` to denote the first sentence in every example and `text_b` to denote the other sentences by default. + +For example, given the template ``{'text':'text_a'} It was {'mask'}.`` and a sample text ``nothing happens , and it happens to flat characters .`` the input text will be ``nothing happens , and it happens to flat characters . It was .`` + + +## Quick start + +Run the following code for prompt-tuning. + +``` +export CUDA_VISIBLE_DEVICES=0 +python rgl.py \ +--output_dir ./checkpoints/ \ +--dataset SST-2 \ +--data_path ./data/k-hot/SST-2/16-13/ \ +--max_seq_length 128 \ +--max_steps 1000 \ +--logging_step 10 \ +--eval_step 100 \ +--batch_size 4 \ +--alpha 0.1 \ +--seed 13 \ +--learning_rate 1e-5 \ +--template "{'text':'text_a'} It was {'mask'}." \ +--verbalizer "{'0':'terrible','1':'great'}" +``` + +The configurations consist of: +- ``output_dir``: The directory to save model checkpoints. +- ``dataset``: The dataset name for few-shot learning. +- ``data_path``: The path to data files of ``dataset``. +- ``max_seq_length``: The maximum length of input text, including the prompt. +- ``max_steps``: The maximum steps for training. +- ``logging_step``: Print logs every ``logging_step``. +- ``eval_step``: Evaluate model every ``eval_step``. +- ``batch_size``: The number of samples per batch. +- ``alpha``: The weight of the loss proposed in RGL. +- ``seed``: Random seed. +- ``learning_rate``: The learning rate for tuning. +- ``template``: The template to define how to combine text data and prompt. +- ``verbalizer``: The verbalizer to map labels to words in vocabulary. + + +## Multiple runs for the best results + +To reproduce our experiments, you can use the scripts to get the results under different settings. We have defined the templates and the verbalizers in both ``./script/run_pet.sh`` and ``./script/run_rgl.sh``. You can refer to these scripts for more details. + +### Run PET + +``` +bash ./scripts/run_pet.sh SST-2 0 +``` + +where ``SST-2`` specifies the dataset used for prompt-tuning and you can replace it with any other downloaded datasets in ``./data/k-shot/ ``. Besides, ``0`` refers to the gpu device id. + +**NOTE**: The dataset name is case-sensitive to run the scripts. + +### Run RGL + +``` +bash ./scripts/run_rgl.sh SST-2 0 +``` + +Please see the descriptions above for the arguments. + + +# Citation + +Please cite our paper if you use RGL in your work: +``` +@inproceedings{wang-etal-2022-rgl, + title = "{RGL}: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning", + author = "Wang, Yaqing and + Tian, Xin and + Xiong, Haoyi and + Li, Yueyang and + Chen, Zeyu and + Guo, Sheng and + Dou, Dejing", + booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022", + year = "2022", + publisher = "Association for Computational Linguistics", + url = "https://aclanthology.org/2022.findings-naacl.81", + pages = "1078--1084", +} + +``` diff --git a/examples/few_shot/RGL/data.py b/examples/few_shot/RGL/data.py new file mode 100644 index 0000000000000000000000000000000000000000..32efac286aad456acbb8b057dd9b12f76bf52ece --- /dev/null +++ b/examples/few_shot/RGL/data.py @@ -0,0 +1,496 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import csv +import json +import os +from abc import abstractmethod +from collections import defaultdict +from dataclasses import dataclass, field + +import paddle +import pandas as pd +from paddle.metric import Accuracy + +from paddlenlp.datasets import MapDataset +from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman + + +@dataclass +class InputExample(object): + """Data structure of every example in datasets.""" + + uid: str = field(default=None, metadata={"help": "A unique identifier of the example."}) + text_a: str = field(default=None, metadata={"help": "The first text sequence in each example."}) + text_b: str = field(default=None, metadata={"help": "The other text sequences in each example."}) + cls_label: int = field(default=None, metadata={"help": "The label of classification tasks."}) + seq_label: list = field(default=None, metadata={"help": "The label of generation tasks."}) + meta: dict = field(default=None, metadata={"help": "An optional dictionary of other data for each example."}) + + def __repr__(self): + content = {k: v for k, v in self.__dict__.items() if v is not None} + content = json.dumps(content, indent=2, sort_keys=True) + "\n" + return str(content) + + def keys(self, keep_none=False): + return [key for key in self.__dict__.keys() if getattr(self, key) is not None] + + +class InputFeatures(dict): + """ + Data structure of every wrapped example or a batch of examples as the input of model. + + Args: + input_ids (paddle.Tensor): + The token ids. + attention_mask (paddle.Tensor): + The mask ids. + token_type_ids (paddle.Tensor, optional): + The token type ids. + input_embeds (paddle.Tensor, optional): + The embeddings of soft tokens. + mask_ids (paddle.Tensor, optional): + The mask ids where 1 denotes that a token is a mask, 0 denotes it is not a mask. + cls_label (list, optional): + The label of classification task. + seq_label (list, optional): + The label of generation task. + uid (list, optional): + The unique id(s) for example(s). + """ + + input_keys = [ + "input_ids", + "attention_mask", + "token_type_ids", + "input_embeds", + "cls_label", + "seq_label", + "label", + "uid", + "mask_ids", + "soft_token_ids", + ] + + def __init__( + self, + input_ids=None, + attention_mask=None, + token_type_ids=None, + input_embeds=None, + mask_ids=None, + label=None, + cls_label=None, + seq_label=None, + uid=None, + soft_token_ids=None, + ): + self.input_ids = input_ids + self.attention_mask = attention_mask + self.token_type_ids = token_type_ids + self.input_embeds = input_embeds + self.label = label + self.cls_label = cls_label + self.seq_label = seq_label + self.mask_ids = mask_ids + self.uid = uid + self.soft_token_ids = soft_token_ids + + @classmethod + def add_keys(cls, *args): + cls.input_keys.extend(args) + + def keys(self, keep_none=False): + if keep_none: + return self.input_keys + else: + return [key for key in self.input_keys if getattr(self, key) is not None] + + def values(self, keep_none=False): + return [getattr(self, key) for key in self.keys(keep_none=keep_none)] + + def items(self): + return [(key, getattr(self, key)) for key in self.keys()] + + def __len__(self): + return len(self.keys()) + + def __repr__(self): + return str(json.dumps(self.items()) + "\n") + + def __getitem__(self, key): + return getattr(self, key) + + def __iter__(self): + return iter(self.keys()) + + def __contains__(self, key, keep_none): + return key in self.keys(keep_none) + + def __setitem__(self, key, value): + if key not in self.input_keys: + raise KeyError("{} not in predefined keys, use add_keys to add it.".format(key)) + setattr(self, key, value) + + @staticmethod + def collate_fn(batch): + """Collate batch data in form of InputFeatures.""" + new_batch = {} + for key in batch[0]: + values = [b[key] for b in batch] + try: + new_batch[key] = paddle.to_tensor(values) + except ValueError: + new_batch[key] = values + + return InputFeatures(**new_batch) + + +class DataProcessor(object): + """Base class for reading datasets from files.""" + + def __init__(self, labels=None): + self._labels = labels + if labels is not None: + self._labels = sorted(labels) + + @property + def labels(self): + if not getattr(self, "_labels"): + raise ValueError("labels and label_mappings are not setted yet.") + return self._labels + + @labels.setter + def labels(self, labels): + if labels is not None: + self._labels = sorted(labels) + + @property + def label_mapping(self): + if not getattr(self, "_labels"): + raise ValueError("labels and label_mappings are not setted yet.") + if not getattr(self, "_label_mapping"): + self._label_mapping = {k: i for i, k in enumerate(self._labels)} + return self._label_mapping + + @label_mapping.setter + def label_mapping(self, label_mapping): + if getattr(self, "_labels"): + assert self._labels == sorted(list(label_mapping.keys())) + self._label_mapping = label_mapping + + @abstractmethod + def get_examples(self, data_dir, split): + raise NotImplementedError + + def get_train_examples(self, data_dir): + return self.get_examples(data_dir, "train") + + def get_dev_examples(self, data_dir): + return self.get_examples(data_dir, "dev") + + def get_test_exaples(self, data_dir): + return self.get_examples(data_dir, "test") + + @classmethod + def read_tsv(cls, input_file, quotechar=None): + with open(input_file, "r", encoding="utf-8-sig") as f: + data = csv.reader(f, delimiter="\t", quotechar=quotechar) + return [x for x in data] + + @classmethod + def read_csv(cls, input_file, header=None): + data = pd.read_csv(input_file, header=header) + return data.values.tolist() + + @classmethod + def read_json(cls, input_file): + with open(input_file, "r") as f: + data = [json.loads(x) for x in f.readlines()] + return data + + +class BoolQProcessor(DataProcessor): + def __init__(self): + super().__init__(["False", "True"]) + self.split_map = {"train": "train", "dev": "dev32", "test": "val"} + + def get_examples(self, data_dir, split): + split = self.split_map[split] + raw_data = self.read_json(os.path.join(data_dir, split + ".jsonl")) + examples = [] + for i, line in enumerate(raw_data): + examples.append( + InputExample( + uid="%s-%d" % (split, i), + text_a=line["passage"], + text_b=line["question"], + cls_label=str(line["label"]), + ) + ) + + return examples + + +class MrpcProcesser(DataProcessor): + def __init__(self): + super().__init__(["0", "1"]) + + def get_examples(self, data_dir, split): + raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) + examples = [] + for i, line in enumerate(raw_data): + if i == 0: + continue + examples.append(InputExample(uid="%s-%d" % (split, i), text_a=line[3], text_b=line[4], cls_label=line[0])) + + return examples + + +class MnliProcessor(DataProcessor): + def __init__(self): + super().__init__(["contradiction", "entailment", "neutral"]) + + def _process_file(self, split): + if split in ["dev", "test"]: + return split + "_matched" + return split + + def get_examples(self, data_dir, split): + split = self._process_file(split) + raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) + examples = [] + for i, line in enumerate(raw_data): + if i == 0: + continue + examples.append( + InputExample(uid="%s-%s" % (split, line[0]), text_a=line[8], text_b=line[9], cls_label=line[-1]) + ) + return examples + + +class MnliMismatchedProcessor(MnliProcessor): + def _process_file(self, split): + if split == "dev": + return split + "_matched" + if split == "test": + return split + "_mismatched" + return split + + +class SnliProcessor(DataProcessor): + def __init__(self): + super().__init__(["contradiction", "entailment", "neutral"]) + + def get_examples(self, data_dir, split): + raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) + examples = [] + for i, line in enumerate(raw_data): + if i == 0: + continue + examples.append( + InputExample(uid="%s-%s" % (split, line[0]), text_a=line[7], text_b=line[8], cls_label=line[-1]) + ) + return examples + + +class ColaProcessor(DataProcessor): + def __init__(self): + super().__init__(["0", "1"]) + + def get_examples(self, data_dir, split): + raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) + examples = [] + for i, line in enumerate(raw_data): + examples.append(InputExample(uid="%s-%d" % (split, i), text_a=line[3], text_b=None, cls_label=line[1])) + return examples + + +class Sst2Processor(DataProcessor): + def __init__(self): + super().__init__(["0", "1"]) + + def get_examples(self, data_dir, split): + raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) + examples = [] + for i, line in enumerate(raw_data): + if i == 0: + continue + examples.append(InputExample(uid="%s-%d" % (split, i), text_a=line[0], text_b=None, cls_label=line[1])) + return examples + + +class StsbProcessor(DataProcessor): + def __init__(self): + super().__init__(["0", "1"]) + + def get_examples(self, data_dir, split): + raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) + examples = [] + for i, line in enumerate(raw_data): + if i == 0: + continue + examples.append( + InputExample(uid="%s-%s" % (split, line[0]), text_a=line[7], text_b=line[8], cls_label=line[-1]) + ) + return examples + + +class QqpProcessor(DataProcessor): + def __init__(self): + super().__init__(["0", "1"]) + + def get_examples(self, data_dir, split): + raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) + examples = [] + for i, line in enumerate(raw_data): + if i == 0: + continue + try: + examples.append( + InputExample(uid="%s-%s" % (split, line[0]), text_a=line[3], text_b=line[4], cls_label=line[5]) + ) + except IndexError: + continue + return examples + + +class QnliProcessor(DataProcessor): + def __init__(self): + super().__init__(["entailment", "not_entailment"]) + + def get_examples(self, data_dir, split): + raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) + examples = [] + for i, line in enumerate(raw_data): + if i == 0: + continue + examples.append( + InputExample(uid="%s-%s" % (split, line[0]), text_a=line[1], text_b=line[2], cls_label=line[-1]) + ) + return examples + + +class RteProcessor(DataProcessor): + def __init__(self): + super().__init__(["entailment", "not_entailment"]) + + def get_examples(self, data_dir, split): + raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) + examples = [] + for i, line in enumerate(raw_data): + if i == 0: + continue + examples.append( + InputExample(uid="%s-%s" % (split, line[0]), text_a=line[1], text_b=line[2], cls_label=line[-1]) + ) + return examples + + +class WnliProcessor(DataProcessor): + def __init__(self): + super().__init__(["0", "1"]) + + def get_examples(self, data_dir, split): + raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) + examples = [] + for i, line in enumerate(raw_data): + if i == 0: + continue + examples.append( + InputExample(uid="%s-%s" % (split, line[0]), text_a=line[1], text_b=line[2], cls_label=line[-1]) + ) + return examples + + +class TextClassificationProcessor(DataProcessor): + def __init__(self, task_name): + NUM_LABELS = {"mr": 2, "sst-5": 5, "subj": 2, "trec": 6, "cr": 2, "mpqa": 2} + assert task_name in NUM_LABELS, "task_name not supported." + self.task_name = task_name + self._labels = list(range(NUM_LABELS[self.task_name])) + + def get_examples(self, data_dir, split): + raw_data = self.read_csv(os.path.join(data_dir, split + ".csv")) + examples = [] + for i, line in enumerate(raw_data): + examples.append(InputExample(uid="%s-%d" % (split, i), text_a=line[1], cls_label=line[0])) + return examples + + +# The processor mapping for datasets in RGL paper. +PROCESSOR_MAPPING = { + "mrpc": MrpcProcesser(), + "mnli": MnliProcessor(), + "mnli-mm": MnliMismatchedProcessor(), + "snli": SnliProcessor(), + "cola": ColaProcessor(), + "sst-2": Sst2Processor(), + "sts-b": StsbProcessor(), + "qqp": QqpProcessor(), + "qnli": QnliProcessor(), + "rte": RteProcessor(), + "wnli": WnliProcessor(), + "cr": TextClassificationProcessor("cr"), + "mr": TextClassificationProcessor("mr"), + "sst-5": TextClassificationProcessor("sst-5"), + "subj": TextClassificationProcessor("subj"), + "mpqa": TextClassificationProcessor("mpqa"), + "trec": TextClassificationProcessor("trec"), + "boolq": BoolQProcessor(), +} + +# The task mapping for datasets. +TASK_MAPPING = defaultdict(lambda: "classification") +TASK_MAPPING["sts-b"] = "regression" + +# The metric mapping for datasets. +METRIC_MAPPING = defaultdict(Accuracy) +METRIC_MAPPING.update( + { + "mrpc": AccuracyAndF1(name=["acc", "precision", "recall", "f1", "acc_and_f1"]), + "qqp": AccuracyAndF1(name=["acc", "precision", "recall", "f1", "acc_and_f1"]), + "cola": Mcc(), + "sts-b": PearsonAndSpearman(name=["pearson", "spearman", "corr"]), + } +) + + +def load_dataset(dataset, data_path=None, splits=[]): + """ + Read datasets from files. + + Args: + dataset (str): + The dataset name in lowercase. + data_path (str): + The path to the dataset directory, including train, dev or test file. + splits (list): + Which file(s) of dataset to read, such as ['train', 'dev', 'test']. + + """ + assert len(splits) > 0, "No splits, can not load dataset {}".format(dataset) + processor = PROCESSOR_MAPPING[dataset] + data = [] + if "train" in splits: + train_examples = processor.get_train_examples(data_path) + data.append(MapDataset(train_examples)) + if "dev" in splits: + dev_examples = processor.get_dev_examples(data_path) + data.append(MapDataset(dev_examples)) + if "test" in splits: + test_examples = processor.get_test_exaples(data_path) + data.append(MapDataset(test_examples)) + data.append(processor.labels) + return data diff --git a/examples/few_shot/RGL/rgl.py b/examples/few_shot/RGL/rgl.py new file mode 100644 index 0000000000000000000000000000000000000000..dd137c71b7003101a704f6000312a1083ebc0fe4 --- /dev/null +++ b/examples/few_shot/RGL/rgl.py @@ -0,0 +1,239 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from data import METRIC_MAPPING, TASK_MAPPING, InputFeatures, load_dataset +from template import ManualTemplate +from tokenizer import MLMTokenizerWrapper +from utils import ( + LinearSchedulerWarmup, + check_args, + convert_example, + create_dataloader, + set_seed, +) +from verbalizer import ManualVerbalizer +from visualdl import LogWriter + +from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser('Implementation of RGL paper.') +parser.add_argument('--seed', type=int, default=1000, help='Random seed.') +parser.add_argument('--device', type=str, default='gpu', choices=['gpu', 'cpu'], help='Device for training, default to gpu.') +parser.add_argument('--dataset', type=str, default='SST-2', help='The build-in few-shot dataset.') +parser.add_argument('--data_path', type=str, default=None, help='The path to local dataset in .tsv files.') + +parser.add_argument('--model_name_or_path', type=str, default='roberta-large', help='The build-in pretrained LM or the path to local model parameters.') +parser.add_argument('--template', type=str, default="{'text':'text_a'} It was {'mask'}.", help='The input template.') +parser.add_argument('--verbalizer', type=str, default="{'0':'terrible', '1':'great'}", help='The label mapping of output.') +parser.add_argument('--alpha', type=float, default=0, help='The weight of link prediction loss in RGL.') +parser.add_argument('--max_seq_length', type=int, default=512, help='The maximum length of input text.') +parser.add_argument('--max_grad_norm', type=float, default=1.0, help='The maximum norm of all parameters.') + +parser.add_argument('--num_epoch', type=int, default=0, help='The number of epoch for training.') +parser.add_argument('--max_steps', type=int, default=1000, help='Maximum steps, which overwrites num_epoch.') +parser.add_argument('--batch_size', type=int, default=32, help='The number of samples used per step.') +parser.add_argument('--learning_rate', type=float, default=1e-5, help='The learning rate of optimizer.') +parser.add_argument('--weight_decay', type=float, default=0.0, help='Weight decay if we apply some.') +parser.add_argument('--warmup_steps', type=int, default=0, help='The warmup steps for leanring rate scheduler.') +parser.add_argument('--logging_step', type=int, default=100, help='Print logs every logging_step steps.') +parser.add_argument('--eval_step', type=int, default=100, help='Evaluate model every eval_step steps.') +parser.add_argument('--save_best', action='store_true', help='Save the best model according to evaluation results. Save the last checkpoint if False.') +parser.add_argument('--output_dir', type=str, default='./checkpoints/', help='The path to save checkpoints.') +parser.add_argument('--overwrite_output', action='store_true', help='Whether overwrite the output_dir.') +args = parser.parse_args() +# yapf: enable + +check_args(args) +for arg in vars(args): + logger.info(format(arg, "<20") + format(str(getattr(args, arg)), "<")) + + +@paddle.no_grad() +def evaluate(model, dataloader, metric, verbalizer, task_type, bound=(0, 5)): + if task_type == "regression": + logsoftmax = nn.LogSoftmax(axis=-1) + lb, ub = bound + model.eval() + metric.reset() + for batch in dataloader: + logits = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]) + label_logits = verbalizer.process_logits(logits, batch["mask_ids"]) + if task_type == "regression": + label_logits = logsoftmax(label_logits) + label_logits = paddle.exp(label_logits[..., 1].unsqueeze(-1)) * (ub - lb) + lb + correct = metric.compute(label_logits, batch["label"]) + metric.update(correct) + score = metric.accumulate() + score = score if isinstance(score, (list, tuple)) else [score] + logger.info("{:>20}".format("Evaluation results:")) + for name, value in zip(metric.name(), score): + logger.info("{:>20} = {:.6f}".format(name, value)) + model.train() + return score[0] + + +def contrastive_loss(sentence_embeddings, labels, task_type="classification"): + """Compute the loss proposed in RGL method.""" + + def _raw_equal(x, y): + return int(x == y) + + def _max_equal(x, y): + return int(np.argmax(x, axis=0) == np.argmax(y, axis=0)) + + equal_int = _raw_equal if task_type == "classification" else _max_equal + bce_metric = nn.CrossEntropyLoss() + cos_metric = nn.CosineSimilarity(axis=0, eps=1e-6) + batch_size = sentence_embeddings.shape[0] + loss = 0 + for i in range(batch_size): + for j in range(batch_size): + score = cos_metric(sentence_embeddings[i], sentence_embeddings[j]) + score = score.unsqueeze(0) + logits = paddle.concat([(1 - score) * 50, (1 + score) * 50], axis=-1) + label = paddle.to_tensor(equal_int(labels[i], labels[j])) + loss += bce_metric(logits.reshape([-1, logits.shape[-1]]), label.unsqueeze(0)) + loss = loss / (batch_size * (batch_size - 1)) + loss = loss / 100 + return loss + + +def main(): + paddle.set_device(args.device) + set_seed(args.seed) + + task_type = TASK_MAPPING[args.dataset] + model = AutoModelForMaskedLM.from_pretrained(args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + tokenizer_wrapper = MLMTokenizerWrapper(args.max_seq_length, tokenizer) + + train_ds, dev_ds, test_ds, label_list = load_dataset( + args.dataset, data_path=args.data_path, splits=["train", "dev", "test"] + ) + + template = ManualTemplate(tokenizer, args.template) + logger.info("Set template: {}".format(template.template)) + verbalizer = ManualVerbalizer(tokenizer, labels=label_list, label_to_words=eval(args.verbalizer), prefix=" ") + logger.info("Set verbalizer: {}".format(args.verbalizer)) + + trans_fn = partial(convert_example, template=template, verbalizer=verbalizer, tokenizer_wrapper=tokenizer_wrapper) + + train_loader = create_dataloader(train_ds, "train", args.batch_size, InputFeatures.collate_fn, trans_fn) + dev_loader = create_dataloader(dev_ds, "dev", args.batch_size, InputFeatures.collate_fn, trans_fn) + test_loader = create_dataloader(test_ds, "test", args.batch_size, InputFeatures.collate_fn, trans_fn) + if args.max_steps > 0: + num_epoch = args.max_steps // len(train_loader) + int(args.max_steps % len(train_loader) > 0) + max_steps = args.max_steps + else: + num_epoch = args.num_epoch + max_steps = args.num_epoch * len(train_loader) + + lr_scheduler = LinearSchedulerWarmup(args.learning_rate, args.warmup_steps, max_steps) + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + grad_clip=paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm), + apply_decay_param_fun=lambda x: x in decay_params, + ) + + metric_fn = METRIC_MAPPING[args.dataset] + if task_type == "regression": + loss_fn = nn.KLDivLoss() + lb, ub = 0, 5 + logsoftmax = nn.LogSoftmax(axis=-1) + else: + loss_fn = nn.CrossEntropyLoss() + with LogWriter(logdir="./log/pet/train") as writer: + best_metric = -float("inf") + global_step = 1 + global_loss = 0 + for epoch in range(1, num_epoch + 1): + for step, batch in enumerate(train_loader, start=1): + writer.add_scalar("train/lr", lr_scheduler.get_lr(), global_step) + + logits = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]) + label_logits = verbalizer.process_logits(logits, batch["mask_ids"]) + if task_type == "regression": + label_logits = logsoftmax(label_logits) + + labels = paddle.stack( + [ + 1 - (batch["label"].reshape([-1]) - lb) / (ub - lb), + (batch["label"].reshape([-1]) - lb) / (ub - lb), + ], + axis=-1, + ) + loss = loss_fn(label_logits.reshape([-1, 2]), labels) + else: + labels = paddle.to_tensor(batch["label"], dtype="int64") + loss = loss_fn(label_logits.reshape([-1, label_logits.shape[-1]]), labels.reshape([-1])) + if args.alpha > 0: + con_loss = contrastive_loss(logits, labels, task_type=task_type) + loss += args.alpha * con_loss + global_loss += loss.item() + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + writer.add_scalar("train/loss", loss.item(), global_step) + + if global_step % args.logging_step == 0: + avg_loss = global_loss / args.logging_step + logger.info( + "Epoch: {:3d}/{:3d}, Global Step: {:4d}, Loss: {:e}".format( + epoch, num_epoch, global_step, avg_loss + ) + ) + global_loss = 0 + + if global_step % args.eval_step == 0: + logger.info("{0:-^30}".format(" Validate ")) + value = evaluate(model, dev_loader, metric_fn, verbalizer, task_type) + if args.save_best and value > best_metric: + best_metric = value + save_path = os.path.join(args.output_dir, "model_best") + if not os.path.exists(save_path): + os.makedirs(save_path) + model.save_pretrained(save_path) + tokenizer.save_pretrained(save_path) + + global_step += 1 + if global_step > max_steps: + break + + logger.info("{0:-^30}".format(" Test ")) + evaluate(model, test_loader, metric_fn, verbalizer, task_type) + if not args.save_best: + save_path = os.path.join(args.output_dir, "model_last") + if not os.path.exists(save_path): + os.makedirs(save_path) + model.save_pretrained(save_path) + tokenizer.save_pretrained(save_path) + + +if __name__ == "__main__": + main() diff --git a/examples/few_shot/RGL/scripts/run_pet.sh b/examples/few_shot/RGL/scripts/run_pet.sh new file mode 100644 index 0000000000000000000000000000000000000000..ca50d8c6bd00d8300cbb02430844664939f56c18 --- /dev/null +++ b/examples/few_shot/RGL/scripts/run_pet.sh @@ -0,0 +1,114 @@ +dataset=$1 +device=$2 + +MAX_LEN=128 +dataname=$dataset + +case $dataset in + CoLA) + temp="{'text':'text_a'} This is {'mask'}." + verb="{'0':'incorrect','1':'correct'}" + ;; + MRPC) + temp="{'text':'text_a'}{'mask'},{'text':'text_b'}" + verb="{'0':'No','1':'Yes'}" + ;; + QQP) + temp="{'text':'text_a'}{'mask'},{'text':'text_b'}" + verb="{'0':'No','1':'Yes'}" + ;; + STS-B) + temp="{'text':'text_a'}{'mask'},{'text':'text_b'}" + verb="{'0':'No','1':'Yes'}" + ;; + MNLI) + temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" + verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}" + MAX_LEN=256 + ;; + MNLI-mm) + temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" + verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}" + MAX_LEN=256 + dataname='MNLI' + ;; + SNLI) + temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" + verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}" + MAX_LEN=256 + ;; + QNLI) + temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" + verb="{'not_entailment':'No','entailment':'Yes'}" + ;; + RTE) + temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" + verb="{'not_entailment':'No','entailment':'Yes'}" + MAX_LEN=256 + ;; + mr) + temp="{'text':'text_a'} It was {'mask'}" + verb="{0:'terrible',1:'great'}" + MAX_LEN=160 + ;; + sst-5) + temp="{'text':'text_a'} It was {'mask'}." + temp="{'text':'text_a'} {'mask'}" + verb="{0:'terrible',1:'bad',2:'okay',3:'good',4:'great'}" + ;; + SST-2) + temp="{'text':'text_a'} It was {'mask'}." + verb="{'0':'terrible','1':'great'}" + ;; + subj) + temp="{'text':'text_a'} This is {'mask'}." + verb="{0:'subjective',1:'objective'}" + MAX_LEN=256 + ;; + trec) + temp="{'mask'}:{'text':'text_a'}" + verb="{0:'Description',1:'Entity',2:'Expression',3:'Human',4:'Location',5:'Number'}" + ;; + cr) + temp="{'text':'text_a'} It was {'mask'}." + verb="{0:'terrible',1:'great'}" + MAX_LEN=160 + ;; + mpqa) + temp="{'text':'text_a'} It was {'mask'}" + verb="{0:'terrible',1:'great'}" + MAX_LEN=128 + ;; + +esac + +echo $temp +echo $verb + + +ALPHA=0 +for seed in 13 21 42 87 100 +do + for lr in 1e-5 2e-5 5e-5 + do + for bs in 2 4 8 + do + CUDA_VISIBLE_DEVICES=$device python rgl.py \ + --output_dir ./ckpt_pet_roberta_$seed/ \ + --dataset $dataset \ + --data_path ./data/k-shot/$dataname/16-$seed/ \ + --max_seq_length $MAX_LEN \ + --max_steps 1000 \ + --logging_step 10 \ + --eval_step 100 \ + --batch_size $bs \ + --alpha $ALPHA \ + --seed $seed \ + --learning_rate $lr \ + --template "$temp" \ + --verbalizer "$verb" \ + --overwrite_output + done + done +done + diff --git a/examples/few_shot/RGL/scripts/run_rgl.sh b/examples/few_shot/RGL/scripts/run_rgl.sh new file mode 100644 index 0000000000000000000000000000000000000000..9b1a5d2dc216b7b8bcc37a607c6606ee955ef30d --- /dev/null +++ b/examples/few_shot/RGL/scripts/run_rgl.sh @@ -0,0 +1,115 @@ +dataset=$1 +device=$2 + +MAX_LEN=128 +dataname=$dataset + +case $dataset in + CoLA) + temp="{'text':'text_a'} This is {'mask'}." + verb="{'0':'incorrect','1':'correct'}" + ;; + MRPC) + temp="{'text':'text_a'}{'mask'},{'text':'text_b'}" + verb="{'0':'No','1':'Yes'}" + ;; + QQP) + temp="{'text':'text_a'}{'mask'},{'text':'text_b'}" + verb="{'0':'No','1':'Yes'}" + ;; + STS-B) + temp="{'text':'text_a'}{'mask'},{'text':'text_b'}" + verb="{'0':'No','1':'Yes'}" + ;; + MNLI) + temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" + verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}" + MAX_LEN=256 + ;; + MNLI-mm) + temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" + verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}" + MAX_LEN=256 + dataname='MNLI' + ;; + SNLI) + temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" + verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}" + MAX_LEN=256 + ;; + QNLI) + temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" + verb="{'not_entailment':'No','entailment':'Yes'}" + ;; + RTE) + temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" + verb="{'not_entailment':'No','entailment':'Yes'}" + MAX_LEN=256 + ;; + mr) + temp="{'text':'text_a'} It was {'mask'}" + verb="{0:'terrible',1:'great'}" + MAX_LEN=160 + ;; + sst-5) + temp="{'text':'text_a'} It was {'mask'}." + temp="{'text':'text_a'} {'mask'}" + verb="{0:'terrible',1:'bad',2:'okay',3:'good',4:'great'}" + ;; + SST-2) + temp="{'text':'text_a'} It was {'mask'}." + verb="{'0':'terrible','1':'great'}" + ;; + subj) + temp="{'text':'text_a'} This is {'mask'}." + verb="{0:'subjective',1:'objective'}" + MAX_LEN=256 + ;; + trec) + temp="{'mask'}:{'text':'text_a'}" + verb="{0:'Description',1:'Entity',2:'Expression',3:'Human',4:'Location',5:'Number'}" + ;; + cr) + temp="{'text':'text_a'} It was {'mask'}." + verb="{0:'terrible',1:'great'}" + MAX_LEN=160 + ;; + mpqa) + temp="{'text':'text_a'} It was {'mask'}" + verb="{0:'terrible',1:'great'}" + MAX_LEN=128 + ;; + +esac + +echo $temp +echo $verb + + +for seed in 13 21 42 87 100 +do + for lr in 1e-5 2e-5 5e-5 + do + for bs in 2 4 8 + do + for alpha in 0.1 0.3 0.5 0.7 1 + do + CUDA_VISIBLE_DEVICES=$device python rgl.py \ + --output_dir ./ckpt_rgl_$seed/ \ + --dataset $dataset \ + --data_path ./data/k-shot/$dataname/16-$seed/ \ + --max_seq_length $MAX_LEN \ + --max_steps 1000 \ + --logging_step 100 \ + --eval_step 1000 \ + --batch_size $bs \ + --alpha $alpha \ + --seed $seed \ + --learning_rate $lr \ + --template "$temp" \ + --verbalizer "$verb" \ + --overwrite_output + done + done + done +done diff --git a/examples/few_shot/RGL/template.py b/examples/few_shot/RGL/template.py new file mode 100644 index 0000000000000000000000000000000000000000..9f0561fc240296715100086200c156a3c6ef4600 --- /dev/null +++ b/examples/few_shot/RGL/template.py @@ -0,0 +1,391 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from abc import abstractmethod + +import paddle +import paddle.nn as nn +from data import InputExample + +from paddlenlp.utils.log import logger + + +class Template(nn.Layer): + """ + Base template class used to preprocess the inputs of model. + + Args: + tokenizer (paddlenlp.transformers.PretrainedTokenizer): + The tokenizer of pretrained models. + text_mapping (dict): + The dictionary to map text name in template to that in InputExample. + For example, {'premise': 'text_a', 'hypothesis': 'text_b'}. + + """ + + registered_input_names = ["mask_ids", "shortenable_ids"] + + def __init__(self, tokenizer, text_mapping=None): + super().__init__() + self.tokenizer = tokenizer + self.text_mapping = text_mapping + self._process_lock = False + + self.part_start = "{" + self.part_end = "}" + + @property + def template(self): + if not hasattr(self, "_template"): + raise RuntimeError("Property template has not been set before used.") + return self._template + + @template.setter + def template(self, template): + if template is None: + return + self._template = template + self.process_template() + + @abstractmethod + def process_template(self): + """A hook to process template text when it is set.""" + raise NotImplementedError + + def get_default_mask_ids(self): + """List to denote whether an item in template is a mask token.""" + return [1 if "mask" in p else 0 for p in self.template] + + def get_default_shortenable_ids(self): + """List to denote whther an item in template can be truncated.""" + idx = [] + for p in self.template: + if "shortenable" in p: + idx.append(1 if p["shortenable"] else 0) + else: + idx.append(1 if "text" in p else 0) + return idx + + def incorporate_template_text(self, example, template=None): + """Replace each item in template with real text.""" + inputs = template.copy() if self.template is None else self.template.copy() + + for i, p in enumerate(inputs): + if "text" in p: + inputs[i] = p["add_prefix_space"] + getattr(example, p["text"]) + elif "mask" in p: + inputs[i] = self.tokenizer.mask_token + elif "hard" in p: + inputs[i] = p["add_prefix_space"] + p["hard"] + elif "sep" in p: + inputs[i] = self.tokenizer.sep_token + else: + raise ValueError("can not parse {}".format(p)) + + return inputs + + def parse_inputs(self, inputs: str): + """Parse items from the input template text.""" + parsed = [] + i = 0 + while i < len(inputs): + p = {"add_prefix_space": " " if (i > 0 and inputs[i - 1] == " ") else ""} + while i < len(inputs) and inputs[i] == " ": + p["add_prefix_space"] = " " + i = i + 1 + if i == len(inputs): + break + + if inputs[i] == self.part_start: + j = i + 1 + count_part = 1 + while j < len(inputs): + if inputs[j] == self.part_end: + count_part -= 1 + if count_part == 0: + break + elif inputs[j] == self.part_start: + count_part += 1 + j = j + 1 + if j == len(inputs): + raise ValueError( + "{} at position {} has no corresponding {}".format(self.part_start, i, self.part_end) + ) + try: + part = eval("{%s}" % inputs[i + 1 : j]) + if isinstance(part, set): + part = {k: None for k in part} + p.update(part) + except: + import traceback + + logger.error(traceback.format_exc()) + logger.error("syntax error in {}".format("{%s}" % inputs[i + 1 : j])) + exit() + i = j + 1 + else: + j = i + 1 + while j < len(inputs): + if inputs[j] == self.part_start: + break + j = j + 1 + p["hard"] = inputs[i:j].rstrip(" ") + i = j + parsed.append(p) + + return parsed + + def wrap_one_example(self, example): + """Process InputExample according to the predefined template.""" + if self.template is None: + raise ValueError("template has not been initialized.") + if isinstance(example, InputExample): + text = self.incorporate_template_text(example) + + non_empty_keys = example.keys() + for key in self.text_mapping: + if self.text_mapping[key] in non_empty_keys: + non_empty_keys.remove(self.text_mapping[key]) + + keys, values = ["text"], [text] + for name in self.registered_input_names: + keys.append(name) + v = None + if hasattr(self, name) and getattr(self, name) is not None: + v = getattr(self, name) + elif hasattr(self, "get_default_" + name): + v = getattr(self, "get_default_" + name)() + setattr(self, name, v) + else: + raise ValueError( + """ + Template's part attribute '{}' is registered but not + initialized. Try using template.{} = [...] to + initialize or create a get_default_{}(self) + method in your template.""".format( + name, name, name + ) + ) + values.append(v) + + wrapped_parts_to_tokenize = [] + for value in list(zip(*values)): + wrapped_parts_to_tokenize.append(dict(zip(keys, value))) + + wrapped_parts_not_to_tokenize = {key: getattr(example, key) for key in non_empty_keys} + return [wrapped_parts_to_tokenize, wrapped_parts_not_to_tokenize] + else: + raise TypeError("InputExample") + + +class ManualTemplate(Template): + """ + ManualTemplate for hard prompt methods, such as PET, EFL. + """ + + def __init__(self, tokenizer, template=None, text_mapping={"text_a": "text_a", "text_b": "text_b"}): + super().__init__(tokenizer=tokenizer, text_mapping=text_mapping) + self.template = template + + def process_template(self): + self._template = self.parse_inputs(self._template) + + +class SoftTemplate(Template): + """ + SoftTemplate on the input layer for soft prompt methods, such as p-tuning. + """ + + registered_input_names = ["soft_token_ids", "mask_ids", "shortenable_ids"] + + def __init__(self, tokenizer, model, template=None, text_mapping={"text_a": "text_a", "text_b": "text_b"}): + super().__init__(tokenizer=tokenizer, text_mapping=text_mapping) + for module in model.children(): + if type(module).__name__.endswith("Model"): + self.token_embeddings = module.embeddings.word_embeddings + break + self.token_embeddings.weight.stop_gradient = True + self.embedding_size = self.token_embeddings.weight.shape[-1] + self.template = template + + def process_template(self): + self._template = self.parse_inputs(self._template) + self.process_soft_tokens() + self.generate_parameters() + + def incorporate_template_text(self, example, template=None): + """Replace each item in template with real text.""" + inputs = template.copy() if self.template is None else self.template.copy() + + for i, p in enumerate(inputs): + if "text" in p: + inputs[i] = p["add_prefix_space"] + getattr(example, p["text"]) + elif "mask" in p: + inputs[i] = self.tokenizer.mask_token + elif "hard" in p: + inputs[i] = p["add_prefix_space"] + p["hard"] + elif "soft" in p: + inputs[i] = p["add_prefix_space"] + p["soft"] + elif "sep" in p: + inputs[i] = self.tokenizer.sep_token + else: + raise ValueError("can not parse {}".format(p)) + + return inputs + + def process_soft_tokens(self): + inputs = [] + soft_token_ids = [] + num_soft_token = 0 + soft2word_init = {} + soft_id_reindex = {} + + for part in self.template: + if "soft" not in part and "soft_id" not in part: + soft_token_ids.append(0) + inputs.append(part) + continue + + if "soft" in part and part["soft"] is not None: + if "duplicate" in part: + logger.warnings("Ignore ``duplicate``. It is " "incompatible with ``soft`` with text values.") + + # Get word tokens and ids for soft token initialization. + init_token_ids = self.tokenizer( + part["add_prefix_space"] + part["soft"], add_special_tokens=False, return_token_type_ids=False + )["input_ids"] + init_tokens = self.tokenizer.convert_ids_to_tokens(init_token_ids) + assert len(init_tokens) == len(init_token_ids) + + # Create soft ids and corresponding ``soft`` part in template. + next_num_soft = num_soft_token + 1 + num_soft_token += len(init_tokens) + id_list = list(range(next_num_soft, num_soft_token + 1)) + + soft_token_ids.extend(id_list) + inputs.extend([{"add_prefix_space": part["add_prefix_space"], "soft": token} for token in init_tokens]) + for soft_id, word_id in zip(id_list, init_token_ids): + soft2word_init[soft_id] = word_id + + # Check the ids of ``soft`` and ``soft_id``. + if "soft_id" in part: + if part["soft_id"] in soft_id_reindex: + assert id_list == soft_id_reindex[part["soft_id"]] + else: + soft_id_reindex[part["soft_id"]] = id_list + continue + + if "soft_id" in part and part["soft_id"] in soft_id_reindex: + if "duplicate" in part: + logger.warnings("Ignore ``duplicate``. Initialize " "``soft`` by ``soft_id`` directly.") + id_list = soft_id_reindex[part["soft_id"]] + + elif "duplicate" in part: + assert isinstance(part["duplicate"], int) + if "same" in part: + num_soft_token += 1 + id_list = [num_soft_token for _ in range(part["duplicate"])] + else: + next_num_soft = num_soft_token + 1 + num_soft_token += part["duplicate"] + id_list = list(range(next_num_soft, num_soft_token + 1)) + else: + num_soft_token += 1 + id_list = [num_soft_token] + + if "soft_id" in part: + soft_id_reindex[part["soft_id"]] = id_list + + soft_token_ids.extend(id_list) + inputs.extend([{"add_prefix_space": part["add_prefix_space"], "soft": ""} for _ in range(len(id_list))]) + + self._template = inputs + self.soft_token_ids = soft_token_ids + self.num_soft_token = num_soft_token + self.soft2word_init = soft2word_init + + if self.num_soft_token == 0: + logger.warnings("No soft tokens in template. " "Use ManualTemplate for better performance.") + + def generate_parameters(self): + """ + Generate parameters for soft tokens. + """ + if self.num_soft_token == 0: + return None + self.soft_embeddings = nn.Embedding(self.num_soft_token + 1, self.embedding_size) + + weight = self.soft_embeddings.weight.clone().detach() + for soft_id, word_id in self.soft2word_init.items(): + weight[soft_id] = self.token_embeddings(paddle.to_tensor(word_id)) + self.soft_embeddings.weight.set_value(weight) + + def process_batch(self, batch): + word_embeds = self.token_embeddings(batch["input_ids"]) + batch["input_ids"] = None + if not hasattr(self, "soft_embeddings"): + batch["input_embeds"] = word_embeds + else: + soft_embeds = self.soft_embeddings(batch["soft_token_ids"]) + input_embeds = paddle.where((batch["soft_token_ids"] > 0).unsqueeze(-1), soft_embeds, word_embeds) + batch["input_embeds"] = input_embeds + return batch + + +class PTuningTemplate(SoftTemplate): + def __init__( + self, tokenizer, model, template, prompt_encoder="lstm", text_mapping={"text_a": "text_a", "text_b": "text_b"} + ): + super().__init__(tokenizer=tokenizer, model=model, text_mapping=text_mapping) + self.prompt_encoder = prompt_encoder + self.template = template + + def generate_parameters(self): + super().generate_parameters() + if self.prompt_encoder == "lstm": + self.lstm_head = nn.LSTM( + input_size=self.embedding_size, + hidden_size=self.embedding_size, + num_layers=2, + direction="bidirect", + time_major=False, + ) + self.mlp_head = nn.Sequential( + nn.Linear(2 * self.embedding_size, self.embedding_size), + nn.ReLU(), + nn.Linear(self.embedding_size, self.embedding_size), + ) + elif self.prompt_encoder == "mlp": + self.mlp_head = nn.Sequential( + nn.Linear(self.embedding_size, self.embedding_size), + nn.ReLU(), + nn.Linear(self.embedding_size, self.embedding_size), + ) + else: + raise ValueError("Unsupported soft token encoder: {}".format(self.prompt_encoder)) + + def process_batch(self, batch): + word_embeds = self.token_embeddings(batch["input_ids"]) + batch["input_ids"] = None + if not hasattr(self, "soft_embeddings"): + batch["input_embeds"] = word_embeds + else: + soft_embeds = self.soft_embeddings(batch["soft_token_ids"]) + if self.prompt_encoder == "lstm": + soft_embeds = self.lstm_head(soft_embeds)[0] + soft_embeds = self.mlp_head(soft_embeds) + + input_embeds = paddle.where((batch["soft_token_ids"] > 0).unsqueeze(-1), soft_embeds, word_embeds) + batch["input_embeds"] = input_embeds + return batch diff --git a/examples/few_shot/RGL/tokenizer.py b/examples/few_shot/RGL/tokenizer.py new file mode 100644 index 0000000000000000000000000000000000000000..91f2fbd1fad616b3a289abc5a5f304d7adc28e26 --- /dev/null +++ b/examples/few_shot/RGL/tokenizer.py @@ -0,0 +1,261 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import itertools +import warnings +from collections import defaultdict +from functools import partial + +import numpy as np + + +class TokenizerWrapper: + """ + Process examples encoded by template, such as truncating and padding. + + Args: + max_seq_length (int): + The maximum length of input data (prompt and text). + tokenizer (paddlenlp.transformers.PreTrainedTokenizer): + The tokenizer of pretrained model. + truncate_method (str): + How to truncate input data. + Choices: ``tail``, ``head``, ``manual``. + create_token_type_ids (bool): + Whether to create token_type_ids for inputs. + seq_length_list (list, optional): + The list of maximum length for every part in input data. + """ + + def __init__(self, max_seq_length, tokenizer, truncate_method="tail", create_token_type_ids=False, **kwargs): + self.max_seq_length = max_seq_length + self.tokenizer = tokenizer + if truncate_method == "manual": + assert hasattr(kwargs, "seq_length_list"), "seq_length_list " "should be defined for manual truncation." + self.seq_length_list = kwargs["seq_length_list"] + self.truncate_fn = partial(self.truncate_from_end, etype="tail") + elif truncate_method == "tail" or truncate_method == "head": + self.truncate_fn = partial(self.truncate_from_end, etype=truncate_method) + else: + raise NotImplementedError + + self.create_token_type_ids = create_token_type_ids + + self.num_truncated_sentences = 0 + self.total_passed_sentences = 0 + + @property + def special_tokens_maps(self): + if not hasattr(self, "_special_tokens_map"): + self._special_tokens_map = { + "": getattr(self.tokenizer, "cls_token", ""), + "": getattr(self.tokenizer, "sep_token", ""), + "": getattr(self.tokenizer, "pad_token", ""), + "": getattr(self.tokenizer, "mask_token", ""), + "": getattr(self.tokenizer, "unk_token", ""), + } + return self._special_tokens_map + + @property + def truncate_rate(self): + if self.total_passed_sentences == 0: + return None + else: + return self.num_truncated_sentences / self.total_passed_sentences + + @staticmethod + def truncate_by_manual(input_dict, max_len_list=[]): + """ + Truncate input data by manually defined maximum sequence length. + + Args: + input_dict (dict): + The dictionary of an input example. + max_len_list (list): + The maximum length of every part in example. + ``-1`` denotes that there is no limit on length. + """ + truncated_dict = defaultdict(list) + shortenable_ids = input_dict["shortenable_ids"] + truncated_dict["shortenable_ids"] = shortenable_ids + for attr_name, attr_values in input_dict.items(): + text_idx = 0 + for i, value in enumerate(attr_values): + if shortenable_ids[i][0] == 0: + continue + if text_idx >= len(max_len_list): + break + if len(value) > 0: + max_len = max_len_list[text_idx] + if max_len < 0: + attr_values[i] = value + else: + attr_values[i] = value[:max_len] + text_idx += 1 + truncated_dict[attr_name] = attr_values + return truncated_dict + + @staticmethod + def truncate_from_end(input_dict, num_tokens_to_truncate=0, etype="tail"): + assert etype in ["head", "tail"] + step = 1 if etype == "head" else -1 + idx_offset = 0 if etype == "head" else 1 + truncated_dict = defaultdict(list) + shortenable_ids = input_dict["shortenable_ids"] + for attr_name in input_dict: + attr_values = input_dict[attr_name] + count = num_tokens_to_truncate + for i, value in enumerate(attr_values[::step]): + index = int(step * (idx_offset + i)) + if len(value) == 0 or shortenable_ids[index][0] == 0: + continue + if count < len(value): + attr_values[index] = value[:-count] + else: + attr_values[index] = [] + count -= len(value) + if count <= 0: + break + truncated_dict[attr_name] = attr_values + + return truncated_dict + + @staticmethod + def concate_parts(input_dict): + for key in input_dict: + input_dict[key] = list(itertools.chain(*input_dict[key])) + return input_dict + + @staticmethod + def padding(input_dict, max_len, pad_id_for_inputs=0, pad_id_for_others: int = 0) -> None: + for key, value in input_dict.items(): + if len(input_dict[key]) > max_len: + raise ValueError( + f"""Truncated seq length of '{key}' still greater than + max length {max_len}. One possible reason is that + no enough shortenable parts in template. Try adding + {{"shortenable": "True"}} property. + """ + ) + if "input" in key: + input_dict[key].extend([pad_id_for_inputs] * (max_len - len(value))) + else: + input_dict[key].extend([pad_id_for_others] * (max_len - len(value))) + return input_dict + + def truncate(self, inputs): + if hasattr(self, "seq_length_list"): + inputs = self.truncate_by_manual(inputs, self.seq_length_list) + total_tokens = sum([len(part) for part in inputs["input_ids"]]) + num_specials = self.num_special_tokens_to_add + num_tokens_to_truncate = total_tokens - self.max_seq_length + num_specials + self.total_passed_sentences += 1 + if num_tokens_to_truncate > 0: + self.num_truncated_sentences += 1 + inputs = self.truncate_fn(input_dict=inputs, num_tokens_to_truncate=num_tokens_to_truncate) + return inputs + + def add_special_tokens(self, encode_inputs): + for key in encode_inputs: + if key == "input_ids": + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + encode_inputs[key] = self.tokenizer.build_inputs_with_special_tokens(encode_inputs[key]) + else: + special_tokens_mask = np.array(self.tokenizer.get_special_tokens_mask(encode_inputs[key])) + with_special_tokens = np.array(self.tokenizer.build_inputs_with_special_tokens(encode_inputs[key])) + with_special_tokens[special_tokens_mask == 1] = 0 + encode_inputs[key] = with_special_tokens.tolist() + return encode_inputs + + +class MLMTokenizerWrapper(TokenizerWrapper): + input_keys = ["input_ids", "attention_mask", "token_type_ids"] + + @property + def mask_token(self): + return self.tokenizer.mask_token + + @property + def mask_token_id(self): + return self.tokenizer.mask_token_id + + @property + def soft_token(self): + return self.tokenizer.unk_token + + @property + def soft_token_id(self): + return self.tokenizer.unk_token_id + + @property + def num_special_tokens_to_add(self): + if not hasattr(self, "_num_specials"): + self._num_specials = self.tokenizer.num_special_tokens_to_add() + return self._num_specials + + def get_token_type_ids(self, encoded_inputs): + token_type_ids = [0] * len(encoded_inputs["input_ids"]) + sep_token = getattr(self.tokenizer, "sep_token", -1) + if sep_token >= 0: + sep_index = np.where([x == sep_token for x in encoded_inputs["input_ids"]])[0] + for i, x in enumerate(sep_index[1:]): + pre_x = sep_index[i - 1] + sep_index[pre_x + 1 : x + 1] = [i + 1] * (x - pre_x) + return token_type_ids + + def tokenize_one_example(self, wrapped_example): + to_tokenize, not_to_tokenize = wrapped_example + + encode_inputs = defaultdict(list) + for part in to_tokenize: + if part["mask_ids"] == 1: + text = [self.mask_token_id] + + if part["text"] in self.special_tokens_maps.keys(): + to_replace = self.special_tokens_maps[part["text"]] + if to_replace is not None: + part["text"] = to_replace + else: + raise KeyError("This tokenizer doesn't specify {} token.".format(part["prompt"])) + + if "soft_token_ids" in part and part["soft_token_ids"] == 1: + text = [self.soft_token_id] + else: + text = self.tokenizer.encode(part["text"], add_special_tokens=False, return_token_type_ids=False)[ + "input_ids" + ] + + text_len = len(text) + encode_inputs["input_ids"].append(text) + for key in part: + if key not in ["text"]: + encode_inputs[key].append([part[key]] * text_len) + encode_inputs = self.truncate(inputs=encode_inputs) + encode_inputs.pop("shortenable_ids") + encode_inputs = self.concate_parts(encode_inputs) + encode_inputs = self.add_special_tokens(encode_inputs) + encode_inputs["attention_mask"] = [1] * len(encode_inputs["input_ids"]) + if self.create_token_type_ids: + encode_inputs["token_type_ids"] = self.get_token_type_ids(encode_inputs) + encode_inputs = self.padding( + encode_inputs, max_len=self.max_seq_length, pad_id_for_inputs=self.tokenizer.pad_token_id + ) + + return {**encode_inputs} + + +tokenizer_mapping = { + "roberta": MLMTokenizerWrapper, +} diff --git a/examples/few_shot/RGL/utils.py b/examples/few_shot/RGL/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..f855145c444dd092083c50bc3894e8045678ebb6 --- /dev/null +++ b/examples/few_shot/RGL/utils.py @@ -0,0 +1,81 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import random + +import numpy as np +import paddle +from data import InputFeatures +from paddle.io import DataLoader +from paddle.optimizer.lr import LambdaDecay + +from paddlenlp.datasets import MapDataset + + +def set_seed(seed): + """set random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def check_args(args): + """check output_dir and make it when not exist""" + if os.path.exists(args.output_dir): + if os.listdir(args.output_dir) and not args.overwrite_output: + raise ValueError("Path Configuration: output_dir {} exists!".format(args.output_dir)) + if not os.path.exists(args.output_dir): + os.makedirs(args.output_dir) + + args.dataset = args.dataset.lower() + + +def convert_example(example, template, tokenizer_wrapper, verbalizer=None): + if verbalizer is not None and hasattr(verbalizer, "wrap_one_example"): + example = verbalizer.wrap_one_example(example) + example = template.wrap_one_example(example) + encoded_inputs = InputFeatures(**tokenizer_wrapper.tokenize_one_example(example), **example[1]) + return encoded_inputs + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if isinstance(dataset, list): + dataset = MapDataset(dataset) + assert isinstance(dataset, MapDataset) + + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +class LinearSchedulerWarmup(LambdaDecay): + """ + Linear scheduler with warm up. + """ + + def __init__(self, learning_rate, warmup_steps, max_steps, last_epoch=-1, verbose=False): + def lr_lambda(current_step): + if current_step < warmup_steps: + return float(current_step) / float(max(1, warmup_steps)) + return max(0.0, float(max_steps - current_step) / float(max(1, max_steps - warmup_steps))) + + super().__init__(learning_rate, lr_lambda, last_epoch, verbose) diff --git a/examples/few_shot/RGL/verbalizer.py b/examples/few_shot/RGL/verbalizer.py new file mode 100644 index 0000000000000000000000000000000000000000..0e741235dcc072bb4e99a9db8cbda742fbbf676f --- /dev/null +++ b/examples/few_shot/RGL/verbalizer.py @@ -0,0 +1,188 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from abc import abstractmethod +from typing import Dict, List, Union + +import numpy as np +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from data import InputExample + +from paddlenlp.transformers import PretrainedTokenizer + + +class Verbalizer(nn.Layer): + """ + Base verbalizer class used to process the outputs and labels. + + Args: + tokenizer (paddlenlp.transformers.PretrainedTokenizer): + The tokenizer of pretrained models. + labels (list): + The sequence of labels in task. + + """ + + def __init__(self, tokenizer: PretrainedTokenizer = None, labels: List = None): + super().__init__() + assert labels is not None, "Label list for current task is not set yet." + self.tokenizer = tokenizer + self.labels = sorted(labels) + self._process_lock = False + + @property + def vocab(self): + if not hasattr(self, "_vocab"): + self._vocab = self.tokenizer.convert_ids_to_tokens(np.arange(self.vocab_size).tolist()) + return self._vocab + + @property + def vocab_size(self): + return self.tokenizer.vocab_size + + @property + def label_to_words(self): + if not hasattr(self, "_label_to_words"): + raise RuntimeError("Property label_to_words has not been set before used.") + return self._label_to_words + + @label_to_words.setter + def label_to_words(self, label_to_words: Union[List, Dict]): + if label_to_words is None: + return + if isinstance(label_to_words, dict): + new_keys = sorted(list(label_to_words.keys())) + assert new_keys == self.labels, "label_to_words {} does not match the predefined labels {}.".format( + new_keys, self.labels + ) + self._label_to_words = {k: label_to_words[k] for k in self.labels} + elif isinstance(label_to_words, list): + assert len(self.labels) == len( + label_to_words + ), "The lengths of label_to_words and predefined labels do not match." + self._label_to_words = {k: v for k, v in zip(self.labels, label_to_words)} + else: + raise TypeError("Unsupported type {} for label_to_words".format(type(label_to_words))) + self.process_label_words() + + @property + def labels_to_ids(self): + if not hasattr(self, "labels"): + raise RuntimeError("Property labels_to_ids has not been set before used.") + return {k: i for i, k in enumerate(self.labels)} + + @property + def ids_to_labels(self): + if not hasattr(self, "labels"): + raise RuntimeError("Property ids_to_labels has not been set before used.") + return {i: k for i, k in enumerate(self.labels)} + + @abstractmethod + def process_label_words( + self, + ): + """A hook to process verbalizer when it is set.""" + raise NotImplementedError + + @abstractmethod + def project(self, logits, **kwargs): + """ + Project the logits with shape ```[batch_size, vocab_size]``` into + label_word_logits with shape ```[batch_size, num_label_words]```. + """ + raise NotImplementedError + + @staticmethod + def aggregate(label_words_logits, atype="mean", ndim=2): + """ + Aggregate embeddings when multiple words are mapped to one label. + + Args: + label_words_logits (paddle.Tensor): + The logits of words which could be mapped to labels. + atype (str): + The aggregation strategy, including mean and first. + ndim (str): + The aggregated embeddings' number of dimensions. + + """ + if label_words_logits.ndim > ndim: + if atype == "mean": + return label_words_logits.mean(axis=-1) + elif atype == "max": + return label_words_logits.max(axis=-1) + elif atype == "first": + return label_words_logits[..., 0, :] + else: + raise ValueError("Unsupported aggreate type {}".format(atype)) + return label_words_logits + + def normalize(self, logits): + """Normalize the logits of every example.""" + new_logits = F.softmax(logits.reshape(logits.shape[0], -1), axis=-1) + return new_logits.reshape(*logits.shape) + + +class ManualVerbalizer(Verbalizer): + """ + Manual Verbalizer to map labels to words for hard prompt methods. + + Args: + tokenizer (paddlenlp.transformers.PretrainedTokenizer): + The tokenizer of pretrained models. + labels (list): + The sequence of all labels. + label_to_words (dict or list): + The dictionary or corresponding list to map labels to words. + prefix (str): + The prefix string of words, used in PLMs like RoBERTa, which is sensitive to the prefix. + """ + + def __init__(self, tokenizer, labels=None, label_to_words=None, prefix=""): + super().__init__(tokenizer=tokenizer, labels=labels) + self.tokenizer = tokenizer + self.labels = labels + self.prefix = prefix + self.label_to_words = label_to_words + + def process_label_words(self): + word_ids = [] + for label in self.labels: + word_ids.append( + self.tokenizer.encode( + self.prefix + self._label_to_words[label], add_special_tokens=False, return_token_type_ids=False + )["input_ids"] + ) + self.word_ids = paddle.to_tensor(word_ids, dtype="int64").squeeze() + self.label_to_words_ids = {k: v for k, v in zip(self.labels, word_ids)} + + def process_logits(self, logits, mask_ids=None, **kwargs): + if mask_ids is not None: + logits = logits[mask_ids == 1] + label_words_logits = logits.index_select(index=self.word_ids, axis=-1) + return label_words_logits + + def wrap_one_example(self, example): + """Process labels in InputExample According to the predefined verbalizer.""" + if isinstance(example, InputExample): + try: + example.label = self.labels_to_ids[example.cls_label] + except KeyError: + # Regression tasks. + example.label = eval(example.cls_label) + return example + else: + raise TypeError("InputExample") diff --git a/examples/few_shot/efl/README.md b/examples/few_shot/efl/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f8656b690172935109a3c2fbb2b1325c76a1f9b1 --- /dev/null +++ b/examples/few_shot/efl/README.md @@ -0,0 +1,85 @@ +# EFL + + +[Entailment as Few-Shot Learner](https://arxiv.org/abs/2104.14690) + + +## 算法简介 + +Entailment as Few-Shot Learner(EFL)提出将 NLP Fine-tune 任务转换统一转换为 Entailment 二分类任务,为小样本场景下的任务求解提供了新的视角。EFL 的主要思想如下图所示,该算法也可以使用 `Template` 实现标签描述与数据文本的拼接,定义方式详见[Prompt API 文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md)。 + +![EFL](https://user-images.githubusercontent.com/25607475/204245126-bd94e87c-f25f-471e-af1c-d1e05f7a2897.png) + +## 快速开始 + +CLUE(Chinese Language Understanding Evaluation)作为中文语言理解权威测评榜单,在学术界和工业界都有着广泛影响。FewCLUE 是其设立的中文小样本学习测评子榜,旨在探索小样本学习最佳模型和中文实践。PaddleNLP 内置了 FewCLUE 数据集,可以直接用来进行 EFL 算法训练、评估、预测,并生成 FewCLUE 榜单的提交结果,参与 FewCLUE 竞赛。 + +### 代码结构说明 +``` +├── run_train.py # EFL 算法提示学习脚本 +├── data.py # 数据集构造、数据增强 +├── utils.py # FewCLUE 提交结果保存等工具函数 +└── prompt/ # FewCLUE 各数据集的 prompt 定义文件 +``` + +### 数据准备 + +读取 FewCLUE 数据集只需要 1 行代码,这部分代码在 `data.py` 脚本中。以情感分类数据集 `eprstmt` 为例: + +``` +from paddlenlp.datasets import load_dataset + +# 通过指定 "fewclue" 和数据集名字 name="eprstmt" 即可一键加载 FewCLUE 中的 eprstmt 数据集 +train_ds, dev_ds, public_test_ds = load_dataset("fewclue", name="eprstmt", splits=("train_0", "dev_0", "test_public")) +``` + +### 模型训练、评估、预测 + +通过如下命令,指定 GPU 0 卡, 在 FewCLUE 的 `eprstmt` 数据集上进行训练&评估 +``` +python -u -m paddle.distributed.launch --gpus "0" run_train.py \ + --output_dir checkpoint_eprstmt \ + --task_name eprstmt \ + --split_id few_all \ + --prompt_path prompt/eprstmt.json \ + --prompt_index 0 \ + --do_train \ + --do_eval \ + --do_test \ + --do_predict \ + --do_label \ + --max_steps 1000 \ + --learning_rate 3e-5 \ + --eval_steps 100 \ + --save_steps 100 \ + --logging_steps 5 \ + --per_device_train_batch_size 16 \ + --max_seq_length 128 \ + --load_best_model_at_end \ + --metric_for_best_model accuracy \ + --save_total_limit 1 +``` +参数含义说明 +- `task_name`: FewCLUE 中的数据集名字 +- `split_id`: 数据集编号,包括0, 1, 2, 3, 4 和 few_all +- `prompt_path`: prompt 定义文件名 +- `prompt_index`: 使用定义文件中第 `prompt_index` 个 prompt +- `augment_type`: 数据增强策略,可选 swap, delete, insert, substitute +- `num_augment`: 数据增强策略为每个样本生成的样本数量 +- `word_augment_percent`: 每个序列中数据增强词所占的比例 +- `pseudo_data_path`: 使用模型标注的伪标签数据文件路径 +- `do_label`: 是否使用训练后的模型给无标签数据标注伪标签 +- `do_test`: 是否在公开测试集上评估模型效果 +- `model_name_or_path`: 预训练模型名,默认为 `ernie-1.0-large-zh-cw` +- `use_rdrop`: 是否使用对比学习策略 R-Drop +- `alpha_rdrop`: R-Drop 损失值权重,默认为 0.5 +- `dropout`: 预训练模型的 dropout 参数值,用于 R-Drop 策略中参数配置 +- `export_type`: 模型导出格式,默认为 `paddle`,动态图转静态图 +- 更多配置参考 [Trainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md#trainingarguments-%E5%8F%82%E6%95%B0%E4%BB%8B%E7%BB%8D) 和 [PromptTrainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md#prompttrainer%E5%8F%82%E6%95%B0%E5%88%97%E8%A1%A8) + +### 模型部署 + +Coming soon... + +## References +[1] Wang, Sinong, Han Fang, Madian Khabsa, Hanzi Mao, and Hao Ma. “Entailment as Few-Shot Learner.” ArXiv:2104.14690 [Cs], April 29, 2021. http://arxiv.org/abs/2104.14690. diff --git a/examples/few_shot/efl/data.py b/examples/few_shot/efl/data.py new file mode 100644 index 0000000000000000000000000000000000000000..b33ea7927166ec03b614be73b3c2eaede7679698 --- /dev/null +++ b/examples/few_shot/efl/data.py @@ -0,0 +1,134 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import numpy as np + +from paddlenlp.datasets import MapDataset, load_dataset + + +def extend_with_pseudo_data(data_ds, pseudo_path, labels_to_ids): + """ + Extend train dataset with pseudo labeled examples if exists. + """ + if pseudo_path is None: + return data_ds + with open(pseudo_path, "r", encoding="utf-8") as fp: + pseudo_data = [json.loads(x.strip()) for x in fp] + data_ds = MapDataset([x for x in data_ds] + pseudo_data) + return data_ds + + +def convert_efl(data_ds, label_words, orig_key, is_train=False, num_neg=5): + efl_data_ds = [] + label_list = sorted(label_words.keys()) + for example in data_ds: + label = label_words[example[orig_key]] if orig_key in example else None + sub_list = label_list + if is_train and label is not None and len(label_list) > num_neg: + rand_index = np.random.permutation(len(label_list)) + sub_list = [example[orig_key]] + [label_list[i] for i in rand_index[:num_neg]] + for key in sub_list: + new_example = example.copy() + cand = label_words[key] + new_example["candidate_label"] = cand + if label is not None: + new_example["labels"] = int(cand == label) + efl_data_ds.append(new_example) + return MapDataset(efl_data_ds) + + +def convert_chid(data_ds): + """ + Insert idioms into positions of `#idiom#` so that the task is converted + to binary classification. + """ + split_data_ds = [] + for example in data_ds: + fragments = example["content"].split("#idiom#") + label = example.get("answer", None) + for index, cand in enumerate(example["candidates"]): + new_example = {"content_pre": fragments[0], "content_post": fragments[1], "idiom": cand} + if label is not None: + new_example["label"] = str(int(index == label)) + split_data_ds.append(new_example) + return MapDataset(split_data_ds) + + +def convert_cluewsc(data_ds): + """ + Mark the pronoun and entity with special tokens. + """ + marked_data_ds = [] + for example in data_ds: + target, text = example["target"], list(example["text"]) + pronoun, p_index = target["span2_text"], target["span2_index"] + entity, e_index = target["span1_text"], target["span1_index"] + label = example.get("label", None) + if p_index > e_index: + text.insert(p_index, "_") + text.insert(p_index + len(pronoun) + 1, "_") + text.insert(e_index, "[") + text.insert(e_index + len(entity) + 1, "]") + else: + text.insert(e_index, "[") + text.insert(e_index + len(entity) + 1, "]") + text.insert(p_index, "_") + text.insert(p_index + len(pronoun) + 1, "_") + new_example = {"text": "".join(text), "pronoun": pronoun, "entity": entity} + if label is not None: + new_example["label"] = label + marked_data_ds.append(new_example) + return MapDataset(marked_data_ds) + + +def load_fewclue_dataset(args, verbalizer): + """ + Load fewclue datasets and convert them to the standard format of PET. + """ + split_id = args.split_id + splits = [f"train_{split_id}", f"dev_{split_id}", "test_public", "test"] + if args.task_name == "cluewsc": + train_ds, dev_ds, public_test_ds, test_ds = load_dataset("fewclue", name=args.task_name, splits=splits) + unlabeled_ds = None + else: + splits.append("unlabeled") + train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = load_dataset( + "fewclue", name=args.task_name, splits=splits + ) + data_ds = [train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds] + + # Preprocess data for EFL. + if args.task_name == "chid": + for index, sub_data_ds in enumerate(data_ds): + data_ds[index] = convert_chid(sub_data_ds) + elif args.task_name == "cluewsc": + for index, sub_data_ds in enumerate(data_ds[:-1]): + data_ds[index] = convert_cluewsc(sub_data_ds) + + orig_key = "label" + if args.task_name == "tnews": + orig_key = "label_desc" + elif args.task_name == "iflytek": + orig_key = "label_des" + for index, sub_data_ds in enumerate(data_ds): + is_train = index == 0 + if sub_data_ds is not None: + data_ds[index] = convert_efl(sub_data_ds, args.label_words, orig_key, is_train) + + # Extend train dataset with pseudo-label data. + data_ds[0] = extend_with_pseudo_data(data_ds[0], args.pseudo_data_path, verbalizer.labels_to_ids) + + return data_ds diff --git a/examples/few_shot/efl/prompt/bustm.json b/examples/few_shot/efl/prompt/bustm.json new file mode 100644 index 0000000000000000000000000000000000000000..b44363510642badc49e8414b6a1f135436ed8ede --- /dev/null +++ b/examples/few_shot/efl/prompt/bustm.json @@ -0,0 +1,8 @@ +{ + "template": [ + {"text": "下边两个句子说的是{'text': 'candidate_label'}的事情。“{'text': 'sentence1'}”和“{'text': 'sentence2'}”"} + ], + "verbalizer": [ + {"0": "不同", "1": "相关"} + ] +} diff --git a/examples/few_shot/efl/prompt/chid.json b/examples/few_shot/efl/prompt/chid.json new file mode 100644 index 0000000000000000000000000000000000000000..b3c1e648e29a60d37fc21320fc4933b88453f052 --- /dev/null +++ b/examples/few_shot/efl/prompt/chid.json @@ -0,0 +1,8 @@ +{ + "template": [ + {"text": "成语[{'text':'idiom'}]使用{'text': 'candidate_label'}的例子:{'text':'content_pre'}({'text': 'idiom'}){'text': 'content_post'}"} + ], + "verbalizer": [ + {"0": "错误", "1": "正确"} + ] +} diff --git a/examples/few_shot/efl/prompt/cluewsc.json b/examples/few_shot/efl/prompt/cluewsc.json new file mode 100644 index 0000000000000000000000000000000000000000..1e736c43332da941622bb3d03b7d159973bae57a --- /dev/null +++ b/examples/few_shot/efl/prompt/cluewsc.json @@ -0,0 +1,8 @@ +{ + "template": [ + {"text": "{'text': 'text'}{'text': 'pronoun'}指的{'text': 'candidate_label'}{'text': 'entity'}"} + ], + "verbalizer": [ + {"false": "不是", "true": "是"} + ] +} diff --git a/examples/few_shot/efl/prompt/csl.json b/examples/few_shot/efl/prompt/csl.json new file mode 100644 index 0000000000000000000000000000000000000000..6d19eee927f8badc4c77297e76e04f34e85c02e5 --- /dev/null +++ b/examples/few_shot/efl/prompt/csl.json @@ -0,0 +1,8 @@ +{ + "template": [ + {"text": "给定以下几个词语:{'options': 'keyword', 'add_prompt': '[OPT],'}{'text': 'candidate_label'}扩写成“{'text': 'abst'}”"} + ], + "verbalizer": [ + {"0": "不能", "1": "可以"} + ] +} diff --git a/examples/few_shot/efl/prompt/csldcp.json b/examples/few_shot/efl/prompt/csldcp.json new file mode 100644 index 0000000000000000000000000000000000000000..2dd84e36ca2194851b7865b0a33b0e855ffb373e --- /dev/null +++ b/examples/few_shot/efl/prompt/csldcp.json @@ -0,0 +1,76 @@ +{ + "template": [ + {"text": "这篇论文阐述了{'text': 'candidate_label'}。{'text': 'content'}"} + ], + "verbalizer": [ + [ + "材料科学与工程", + "作物学", + "口腔医学", + "药学", + "教育学", + "水利工程", + "理论经济学", + "食品科学与工程", + "畜牧学/兽医学", + "体育学", + "核科学与技术", + "力学", + "园艺学", + "水产", + "法学", + "地质学/地质资源与地质工程", + "石油与天然气工程", + "农林经济管理", + "信息与通信工程", + "图书馆、情报与档案管理", + "政治学", + "电气工程", + "海洋科学", + "民族学", + "航空宇航科学与技术", + "化学/化学工程与技术", + "哲学", + "公共卫生与预防医学", + "艺术学", + "农业工程", + "船舶与海洋工程", + "计算机科学与技术", + "冶金工程", + "交通运输工程", + "动力工程及工程热物理", + "纺织科学与工程", + "建筑学", + "环境科学与工程", + "公共管理", + "数学", + "物理学", + "林学/林业工程", + "心理学", + "历史学", + "工商管理", + "应用经济学", + "中医学/中药学", + "天文学", + "机械工程", + "土木工程", + "光学工程", + "地理学", + "农业资源利用", + "生物学/生物科学与工程", + "兵器科学与技术", + "矿业工程", + "大气科学", + "基础医学/临床医学", + "电子科学与技术", + "测绘科学与技术", + "控制科学与工程", + "军事学", + "中国语言文学", + "新闻传播学", + "社会学", + "地球物理学", + "植物保护" + ] + ] +} diff --git a/examples/few_shot/efl/prompt/eprstmt.json b/examples/few_shot/efl/prompt/eprstmt.json new file mode 100644 index 0000000000000000000000000000000000000000..309146c5e7e514d7c84c2c041fcbc9589f8ec69a --- /dev/null +++ b/examples/few_shot/efl/prompt/eprstmt.json @@ -0,0 +1,8 @@ +{ + "template": [ + {"text": "这表达了{'text': 'candidate_label'}的情感。{'text':'sentence'}"} + ], + "verbalizer": [ + {"Negative": "不满意", "Positive": "满意"} + ] +} diff --git a/examples/few_shot/efl/prompt/iflytek.json b/examples/few_shot/efl/prompt/iflytek.json new file mode 100644 index 0000000000000000000000000000000000000000..5199508e6f035355b17b0ba6006fae28228543fe --- /dev/null +++ b/examples/few_shot/efl/prompt/iflytek.json @@ -0,0 +1,129 @@ +{ + "template": [ + {"text": "这段文本的应用描述主题是{'text': 'candidate_label'}。{'text': 'sentence'}"} + ], + "verbalizer": [ + [ + "银行", + "社区服务", + "电商", + "支付", + "经营养成", + "卡牌", + "借贷", + "驾校", + "理财", + "职考", + "新闻", + "旅游资讯", + "公共交通", + "魔幻", + "医疗服务", + "影像剪辑", + "动作类", + "工具", + "体育竞技", + "小说", + "运动健身", + "相机", + "辅助工具", + "快递物流", + "高等教育", + "股票", + "菜谱", + "行车辅助", + "仙侠", + "亲子儿童", + "购物咨询", + "射击游戏", + "漫画", + "中小学", + "同城服务", + "成人教育", + "求职", + "电子产品", + "艺术", + "薅羊毛", + "约会社交", + "经营", + "兼职", + "短视频", + "音乐", + "英语", + "棋牌中心", + "摄影修图", + "养生保健", + "办公", + "政务", + "视频", + "论坛圈子", + "彩票", + "直播", + "其他", + "休闲益智", + "策略", + "即时通讯", + "汽车交易", + "违章", + "地图导航", + "民航", + "电台", + "语言(非英语)", + "搞笑", + "婚恋社交", + "社区超市", + "日常养车", + "杂志", + "视频教育", + "家政", + "影视娱乐", + "装修家居", + "体育咨讯", + "社交工具", + "餐饮店", + "美颜", + "问诊挂号", + "飞行空战", + "综合预定", + "电影票务", + "笔记", + "买房", + "外卖", + "母婴", + "打车", + "情侣社交", + "日程管理", + "租车", + "微博博客", + "百科", + "绘画", + "铁路", + "生活社交", + "租房", + "酒店", + "保险", + "问答交流", + "收款", + "MOBA", + "K歌", + "技术", + "减肥瘦身", + "工作社交", + "团购", + "记账", + "女性", + "公务员", + "二手", + "美妆美业", + "汽车咨询", + "行程管理", + "免费WIFI", + "教辅", + "成人", + "婚庆", + "民宿短租", + "出国" + ] + ] +} + diff --git a/examples/few_shot/efl/prompt/ocnli.json b/examples/few_shot/efl/prompt/ocnli.json new file mode 100644 index 0000000000000000000000000000000000000000..caa7fd2c5719fccea5e3dd95a7c30648cae73b4e --- /dev/null +++ b/examples/few_shot/efl/prompt/ocnli.json @@ -0,0 +1,8 @@ +{ + "template": [ + {"text": "“{'text': 'sentence1'}”和“{'text': 'sentence2'}”之间{'text': 'candidate_label'}。"} + ], + "verbalizer": [ + {"contradiction": "互相矛盾", "entailment": "相互包含", "neutral": "没有关系"} + ] +} diff --git a/examples/few_shot/efl/prompt/tnews.json b/examples/few_shot/efl/prompt/tnews.json new file mode 100644 index 0000000000000000000000000000000000000000..4580cd7662088667f59526dd93c196a9f3e8f730 --- /dev/null +++ b/examples/few_shot/efl/prompt/tnews.json @@ -0,0 +1,24 @@ +{ + "template": [ + {"text": "下边报道一条{'text': 'candidate_label'}新闻{'text':'sentence'}"} + ], + "verbalizer": [ + { + "news_story": "故事", + "news_entertainment": "明星", + "news_finance": "财经", + "news_sports": "体育", + "news_edu": "校园", + "news_game": "游戏", + "news_culture": "文化", + "news_tech": "科技", + "news_car": "汽车", + "news_travel": "旅行", + "news_world": "国际", + "news_agriculture": "农业", + "news_military": "军事", + "news_house": "房产", + "news_stock": "股票" + } + ] +} diff --git a/examples/few_shot/efl/run_train.py b/examples/few_shot/efl/run_train.py new file mode 100644 index 0000000000000000000000000000000000000000..8dd47043d762097f0ea9a5fdc7563ca72afe6338 --- /dev/null +++ b/examples/few_shot/efl/run_train.py @@ -0,0 +1,164 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time +from dataclasses import dataclass, field +from functools import partial + +import paddle +from data import load_fewclue_dataset +from paddle.metric import Accuracy +from paddle.static import InputSpec +from utils import load_prompt_arguments, save_fewclue_prediction, save_pseudo_data + +from paddlenlp.prompt import ( + ManualTemplate, + ManualVerbalizer, + PromptModelForSequenceClassification, + PromptTrainer, + PromptTuningArguments, +) +from paddlenlp.trainer import PdArgumentParser +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + + +# yapf: disable +@dataclass +class DataArguments: + task_name: str = field(default="eprstmt", metadata={"help": "The task name in FewCLUE."}) + split_id: str = field(default="0", metadata={"help": "The split id of datasets, including 0, 1, 2, 3, 4, few_all."}) + prompt_path: str = field(default="prompt/eprstmt.json", metadata={"help": "Path to the defined prompts."}) + prompt_index: int = field(default=0, metadata={"help": "The index of defined prompt for training."}) + pseudo_data_path: str = field(default=None, metadata={"help": "Path to data with pseudo labels."}) + do_label: bool = field(default=False, metadata={"help": "Whether to label unsupervised data in unlabeled datasets"}) + do_test: bool = field(default=False, metadata={"help": "Whether to evaluate model on public test datasets."}) + + +@dataclass +class ModelArguments: + model_name_or_path: str = field(default="ernie-1.0-large-zh-cw", metadata={"help": "Build-in pretrained model name or the path to local model."}) + export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."}) + dropout: float = field(default=0.1, metadata={"help": "The dropout used for pretrained model."}) +# yapf: enable + + +def main(): + # Parse the arguments. + parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + data_args = load_prompt_arguments(data_args) + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + paddle.set_device(training_args.device) + + # Load the pretrained language model. + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + model = AutoModelForSequenceClassification.from_pretrained( + model_args.model_name_or_path, + num_labels=2, + hidden_dropout_prob=model_args.dropout, + attention_probs_dropout_prob=model_args.dropout, + ) + + # Define template for preprocess and verbalizer for postprocess. + template = ManualTemplate(data_args.prompt, tokenizer, training_args.max_seq_length) + logger.info("Using template: {}".format(template.prompt)) + + verbalizer = ManualVerbalizer(data_args.label_words, tokenizer) + ids_to_labels = {idx: label for idx, label in enumerate(verbalizer.labels)} + logger.info("Using verbalizer: {}".format(data_args.label_words)) + + # Load datasets. + train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = load_fewclue_dataset(data_args, verbalizer=verbalizer) + + # Define the criterion. + criterion = paddle.nn.CrossEntropyLoss() + + # Initialize the prompt model with the above variables. + prompt_model = PromptModelForSequenceClassification( + model, template, None, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout + ) + + # Define the metric function. + def compute_metrics(eval_preds, num_labels): + metric = Accuracy() + preds = paddle.to_tensor(eval_preds.predictions) + preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1] + preds = preds.reshape([-1, num_labels]) + labels = paddle.to_tensor(eval_preds.label_ids) + labels = paddle.argmax(labels.reshape([-1, num_labels]), axis=1) + correct = metric.compute(preds, labels) + metric.update(correct) + acc = metric.accumulate() + return {"accuracy": acc} + + # Initialize the trainer. + compute_metrics = partial(compute_metrics, num_labels=len(verbalizer.labels)) + trainer = PromptTrainer( + model=prompt_model, + tokenizer=tokenizer, + args=training_args, + criterion=criterion, + train_dataset=train_ds, + eval_dataset=dev_ds, + callbacks=None, + compute_metrics=compute_metrics, + ) + + # Traininig. + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + time_stamp = time.strftime("%m%d-%H-%M-%S", time.localtime()) + + # Test. + if data_args.do_test and public_test_ds is not None: + test_ret = trainer.predict(public_test_ds) + trainer.log_metrics("test", test_ret.metrics) + + # Predict. + if training_args.do_predict and test_ds is not None: + pred_ret = trainer.predict(test_ds) + logger.info("Prediction done.") + predict_path = os.path.join(training_args.output_dir, "fewclue_submit_examples_" + time_stamp) + save_fewclue_prediction(predict_path, data_args.task_name, pred_ret, verbalizer, ids_to_labels) + + # Label unsupervised data. + if data_args.do_label and unlabeled_ds is not None: + label_ret = trainer.predict(unlabeled_ds) + logger.info("Labeling done.") + pseudo_path = os.path.join(training_args.output_dir, "pseudo_data_" + time_stamp + ".txt") + save_pseudo_data(pseudo_path, data_args.task_name, label_ret, verbalizer, ids_to_labels) + + # Export static model. + if training_args.do_export: + input_spec = [ + InputSpec(shape=[None, None], dtype="int64"), # input_ids, + InputSpec(shape=[None, None], dtype="int64"), # token_type_ids + InputSpec(shape=[None, None], dtype="int64"), # position_ids + InputSpec(shape=[None, None, None, None], dtype="float32"), # attention_mask + ] + export_path = os.path.join(training_args.output_dir, "export") + trainer.export_model(export_path, input_spec=input_spec, export_type=model_args.export_type) + + +if __name__ == "__main__": + main() diff --git a/examples/few_shot/efl/utils.py b/examples/few_shot/efl/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..dfc6463bb69d4d6719cf514ad55ec8a3ba765b4a --- /dev/null +++ b/examples/few_shot/efl/utils.py @@ -0,0 +1,252 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +import pathlib + +import numpy as np +import paddle + +from paddlenlp.datasets import load_dataset + +LABEL_TO_STANDARD = { + "tnews": { + "news_story": "100", + "news_culture": "101", + "news_entertainment": "102", + "news_sports": "103", + "news_finance": "104", + "news_house": "106", + "news_car": "107", + "news_edu": "108", + "news_tech": "109", + "news_military": "110", + "news_travel": "112", + "news_world": "113", + "news_stock": "114", + "news_agriculture": "115", + "news_game": "116", + }, + "iflytek": { + "打车": 0, + "美颜": 100, + "影像剪辑": 101, + "摄影修图": 102, + "相机": 103, + "绘画": 104, + "二手": 105, + "电商": 106, + "团购": 107, + "外卖": 108, + "电影票务": 109, + "社区服务": 10, + "社区超市": 110, + "购物咨询": 111, + "笔记": 112, + "办公": 113, + "日程管理": 114, + "女性": 115, + "经营": 116, + "收款": 117, + "其他": 118, + "薅羊毛": 11, + "魔幻": 12, + "仙侠": 13, + "卡牌": 14, + "飞行空战": 15, + "射击游戏": 16, + "休闲益智": 17, + "动作类": 18, + "体育竞技": 19, + "地图导航": 1, + "棋牌中心": 20, + "经营养成": 21, + "策略": 22, + "MOBA": 23, + "辅助工具": 24, + "约会社交": 25, + "即时通讯": 26, + "工作社交": 27, + "论坛圈子": 28, + "婚恋社交": 29, + "免费WIFI": 2, + "情侣社交": 30, + "社交工具": 31, + "生活社交": 32, + "微博博客": 33, + "新闻": 34, + "漫画": 35, + "小说": 36, + "技术": 37, + "教辅": 38, + "问答交流": 39, + "租车": 3, + "搞笑": 40, + "杂志": 41, + "百科": 42, + "影视娱乐": 43, + "求职": 44, + "兼职": 45, + "视频": 46, + "短视频": 47, + "音乐": 48, + "直播": 49, + "同城服务": 4, + "电台": 50, + "K歌": 51, + "成人": 52, + "中小学": 53, + "职考": 54, + "公务员": 55, + "英语": 56, + "视频教育": 57, + "高等教育": 58, + "成人教育": 59, + "快递物流": 5, + "艺术": 60, + "语言(非英语)": 61, + "旅游资讯": 62, + "综合预定": 63, + "民航": 64, + "铁路": 65, + "酒店": 66, + "行程管理": 67, + "民宿短租": 68, + "出国": 69, + "婚庆": 6, + "工具": 70, + "亲子儿童": 71, + "母婴": 72, + "驾校": 73, + "违章": 74, + "汽车咨询": 75, + "汽车交易": 76, + "日常养车": 77, + "行车辅助": 78, + "租房": 79, + "家政": 7, + "买房": 80, + "装修家居": 81, + "电子产品": 82, + "问诊挂号": 83, + "养生保健": 84, + "医疗服务": 85, + "减肥瘦身": 86, + "美妆美业": 87, + "菜谱": 88, + "餐饮店": 89, + "公共交通": 8, + "体育咨讯": 90, + "运动健身": 91, + "支付": 92, + "保险": 93, + "股票": 94, + "借贷": 95, + "理财": 96, + "彩票": 97, + "记账": 98, + "银行": 99, + "政务": 9, + }, +} + + +def load_prompt_arguments(args): + """ + Load prompt and label words according to prompt index. + """ + with open(args.prompt_path, "r", encoding="utf-8") as fp: + configs = json.load(fp) + assert len(configs["verbalizer"]) == len(configs["template"]) + assert configs["verbalizer"][0] is not None + verbalizer = [configs["verbalizer"][0]] + last_verb_index = 0 + for index, verb in enumerate(configs["verbalizer"][1:]): + if verb is None or len(verb) == 0: + verbalizer.append(configs["verbalizer"][last_verb_index]) + else: + verbalizer.append(verb) + last_verb_index = index + 1 + configs["verbalizer"] = verbalizer + args.prompt = configs["template"][args.prompt_index]["text"] + label_words = configs["verbalizer"][args.prompt_index] + if isinstance(label_words, list): + label_words = {k: k for k in label_words} + args.label_words = label_words + return args + + +def save_pseudo_data(save_path, task_name, label_preds, verbalizer, labels): + """ + Combine unsupervised data and corresponding predicted labels and + save one example per line. + """ + if task_name == "cluewsc": + return None + + num_labels = len(labels) + data_ds = load_dataset("fewclue", name=task_name, splits="unlabeled") + preds = paddle.to_tensor(label_preds.predictions) + preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1].numpy() + preds = preds.reshape([-1, num_labels]) + label_preds = np.argmax(preds, axis=1) + label_probs = np.max(preds, axis=1) + pseudo_data = [] + for index, example in enumerate(data_ds): + example["labels"] = labels[label_preds[index]] + example["prob"] = str(label_probs[index]) + pseudo_data.append(example) + save_data(pseudo_data, save_path) + + +def save_fewclue_prediction(save_path, task_name, label_preds, verbalizer, labels): + """ + Extract predicted labels and save as the format required by FewCLUE. + """ + num_labels = len(labels) + preds = paddle.to_tensor(label_preds.predictions) + preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1] + preds = preds.reshape([-1, num_labels]) + if task_name == "chid": + batch_size = preds.shape[0] + preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1] + preds = preds.reshape([batch_size // 7, 7]) + preds = paddle.nn.functional.softmax(preds, axis=1).numpy() + preds = np.argmax(preds, axis=1) + test_ds = load_dataset("fewclue", name=task_name, splits="test") + + ret_list = [] + maps = LABEL_TO_STANDARD.get(task_name, None) + for idx, example in enumerate(test_ds): + uid = example.get("id", idx) + if task_name in ["bustm", "csl"]: + ret_list.append({"id": uid, "label": str(preds[idx])}) + elif task_name == "chid": + ret_list.append({"id": uid, "answer": preds[idx]}) + elif task_name in ["cluewsc", "eprstmt", "ocnli", "csldcp"]: + ret_list.append({"id": uid, "label": labels[preds[idx]]}) + elif task_name in ["iflytek", "tnews"]: + ret_list.append({"id": uid, "label": str(maps[labels[preds[idx]]])}) + save_file = task_name if task_name in ["bustm", "csldcp", "eprstmt"] else task_name + "f" + save_data(ret_list, save_path, save_file + "_predict.json") + + +def save_data(data, save_path, save_file=None): + if save_file is not None: + pathlib.Path(save_path).mkdir(parents=True, exist_ok=True) + save_path = os.path.join(save_path, save_file) + with open(save_path, "w") as fp: + for example in data: + fp.write(json.dumps(example, ensure_ascii=False) + "\n") diff --git a/examples/few_shot/p-tuning/README.md b/examples/few_shot/p-tuning/README.md new file mode 100644 index 0000000000000000000000000000000000000000..38d2f0af9c527f77e9743764d50d74804cfaa287 --- /dev/null +++ b/examples/few_shot/p-tuning/README.md @@ -0,0 +1,85 @@ +# P-Tuning + +[GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf) + +## 算法简介 + +P-tuning 引入可学习的连续型提示向量 prompt embeddings 参数, 让模型自己去学习最优的 prompt embedding, 而不再依赖人工去设置自然语言形式的提示(Prompt)信息。P-Tuning 算法的数据和模型定义如下图所示,对应于数据预处理模块 `SoftTemplate` 和标签词映射模块 `MaskedLMVerbalizer`,详细介绍及定义方法参见 [Prompt API 文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md)。 + +![p-tuning](https://user-images.githubusercontent.com/25607475/204214359-3036c6c6-f101-4a5f-958c-abe0e40c243a.png) + + +## 快速开始 + +CLUE(Chinese Language Understanding Evaluation)作为中文语言理解权威测评榜单,在学术界和工业界都有着广泛影响。FewCLUE 是其设立的中文小样本学习测评子榜,旨在探索小样本学习最佳模型和中文实践。PaddleNLP 内置了 FewCLUE 数据集,可以直接用来进行 PET 策略训练、评估、预测,并生成 FewCLUE 榜单的提交结果,参与 FewCLUE 竞赛。 +PaddleNLP 内置了 FewCLUE 数据集,可以直接用来进行 P-tuning 策略训练、评估、预测,并生成 FewCLUE 榜单的提交结果,参与 FewCLUE 竞赛。 + +### 代码结构及说明 +``` +├── run_train.py # P-Tuning 算法提示学习脚本 +├── data.py # 数据集构造、数据增强 +├── utils.py # FewCLUE 提交结果保存等工具函数 +└── prompt/ # FewCLUE 各数据集的 prompt 定义文件 +``` + +### 数据准备 + +读取 FewCLUE 数据集只需要 1 行代码,这部分代码在 `data.py` 脚本中。以情感分类数据集 `eprstmt` 为例: +``` +from paddlenlp.datasets import load_dataset + +# 通过指定 "fewclue" 和数据集名字 name="eprstmt" 即可一键加载 FewCLUE 中的 eprstmt 数据集 +train_ds, dev_ds, public_test_ds = load_dataset("fewclue", name="eprstmt", splits=("train_0", "dev_0", "test_public")) +``` + +### 模型训练、评估、预测 + +通过如下命令,指定 GPU 0 卡, 使用一个连续型提示向量在 FewCLUE 的 `eprstmt` 数据集上进行训练和评估。如果要使用多个可学习连续型提示向量,可修改 `./prompt/` 目录下相应的文件,修改 `soft` 的长度属性 `length` 即可。 +``` +python -u -m paddle.distributed.launch --gpus "0" run_train.py \ + --output_dir checkpoint_eprstmt \ + --task_name eprstmt \ + --split_id few_all \ + --prompt_path prompt/eprstmt.json \ + --prompt_index 0 \ + --do_train \ + --do_eval \ + --do_test \ + --do_predict \ + --do_label \ + --max_steps 1000 \ + --learning_rate 3e-5 \ + --eval_steps 100 \ + --save_steps 100 \ + --logging_steps 5 \ + --per_device_train_batch_size 16 \ + --max_seq_length 128 \ + --load_best_model_at_end \ + --metric_for_best_model accuracy \ + --save_total_limit 1 +``` + +参数含义说明 +- `task_name`: FewCLUE 中的数据集名字 +- `split_id`: 数据集编号,包括0, 1, 2, 3, 4 和 few_all +- `prompt_path`: prompt 定义文件名 +- `prompt_index`: 使用定义文件中第 `prompt_index` 个 prompt +- `augment_type`: 数据增强策略,可选 swap, delete, insert, substitute +- `num_augment`: 数据增强策略为每个样本生成的样本数量 +- `word_augment_percent`: 每个序列中数据增强词所占的比例 +- `pseudo_data_path`: 使用模型标注的伪标签数据文件路径 +- `do_label`: 是否使用训练后的模型给无标签数据标注伪标签 +- `do_test`: 是否在公开测试集上评估模型效果 +- `model_name_or_path`: 预训练模型名,默认为 `ernie-1.0-large-zh-cw` +- `use_rdrop`: 是否使用对比学习策略 R-Drop +- `alpha_rdrop`: R-Drop 损失值权重,默认为 0.5 +- `dropout`: 预训练模型的 dropout 参数值,用于 R-Drop 策略中参数配置 +- `export_type`: 模型导出格式,默认为 `paddle`,动态图转静态图 +- 更多配置参考 [Trainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md#trainingarguments-%E5%8F%82%E6%95%B0%E4%BB%8B%E7%BB%8D) 和 [PromptTrainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md#prompttrainer%E5%8F%82%E6%95%B0%E5%88%97%E8%A1%A8) + +### 模型部署 + +Coming soon... + +## References +[1]X. Liu et al., “GPT Understands, Too,” arXiv:2103.10385 [cs], Mar. 2021, Accessed: Mar. 22, 2021. [Online]. Available: http://arxiv.org/abs/2103.10385 diff --git a/examples/few_shot/p-tuning/data.py b/examples/few_shot/p-tuning/data.py new file mode 100644 index 0000000000000000000000000000000000000000..6f96ac02cdc8cddcde08a7cc09ca82587b41b174 --- /dev/null +++ b/examples/few_shot/p-tuning/data.py @@ -0,0 +1,202 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +from functools import partial + +import paddle + +from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap +from paddlenlp.datasets import MapDataset, load_dataset + + +def extend_with_pseudo_data(data_ds, pseudo_path, labels_to_ids): + """ + Extend train dataset with pseudo labeled examples if exists. + """ + if pseudo_path is None: + return data_ds + with open(pseudo_path, "r", encoding="utf-8") as fp: + pseudo_data = [json.loads(x.strip()) for x in fp] + data_ds = MapDataset([x for x in data_ds] + pseudo_data) + return data_ds + + +def extend_with_data_augment(data_ds, aug_type, num_aug=10, percent=0.1, aug_base="mlm", example_keys=None): + """ + Extend train dataset with augmentation. + """ + if example_keys is None: + return data_ds + if aug_type is None or aug_type == "None": + return data_ds + if aug_type == "delete": + aug = WordDelete(create_n=num_aug, aug_percent=percent) + elif aug_type == "substitute": + aug = WordSubstitute(aug_base, create_n=num_aug, aug_percent=percent) + elif aug_type == "insert": + aug = WordInsert(aug_base, create_n=num_aug, aug_percent=percent) + elif aug_type == "swap": + aug = WordSwap(create_n=num_aug, aug_percent=percent) + else: + raise ValueError("Unsupported data augment strategy `{}`".format(aug_type)) + + aug_data = [] + for example in data_ds: + for key in example_keys: + text_aug = aug.augment(example[key]) + for text in text_aug: + new_example = example.copy() + example[key] = text + aug_data.append(new_example) + + data_ds = MapDataset([x for x in data_ds] + aug_data) + return data_ds + + +def convert_chid(data_ds): + """ + Insert idioms into positions of `#idiom#` so that the task is converted + to binary classification. + """ + split_data_ds = [] + for example in data_ds: + fragments = example["content"].split("#idiom#") + label = example.get("answer", None) + for index, cand in enumerate(example["candidates"]): + new_example = {"content_pre": fragments[0], "content_post": fragments[1], "idiom": cand} + if label is not None: + new_example["label"] = str(int(index == label)) + split_data_ds.append(new_example) + return MapDataset(split_data_ds) + + +def convert_csl(data_ds): + """ + Concatanate keywords and it can be replaced by keyword `options` in develop versioin. + """ + concat_data_ds = [] + for example in data_ds: + example["keyword"] = ",".join(example["keyword"]) + concat_data_ds.append(example) + return MapDataset(concat_data_ds) + + +def convert_cluewsc(data_ds): + """ + Mark the pronoun and entity with special tokens. + """ + marked_data_ds = [] + for example in data_ds: + target, text = example["target"], list(example["text"]) + pronoun, p_index = target["span2_text"], target["span2_index"] + entity, e_index = target["span1_text"], target["span1_index"] + label = example.get("label", None) + if p_index > e_index: + text.insert(p_index, "_") + text.insert(p_index + len(pronoun) + 1, "_") + text.insert(e_index, "[") + text.insert(e_index + len(entity) + 1, "]") + else: + text.insert(e_index, "[") + text.insert(e_index + len(entity) + 1, "]") + text.insert(p_index, "_") + text.insert(p_index + len(pronoun) + 1, "_") + new_example = {"text": "".join(text), "pronoun": pronoun, "entity": entity} + if label is not None: + new_example["label"] = label + marked_data_ds.append(new_example) + return MapDataset(marked_data_ds) + + +def convert_labels_to_ids(example, orig_key, labels_to_ids, pop_keys=None): + """ + Convert the keyword in datasets to `labels`. + """ + if orig_key in example: + example["label_ids"] = labels_to_ids[example.pop(orig_key)] + if pop_keys is not None: + for key in pop_keys: + if key in example: + example.pop(key) + return example + + +def convert_ids_to_words(example, token_ids): + """ + Convert label id to the first word in mapping from labels to words, + the length of which should coincide with that of `mask` in prompt. + """ + if "label_ids" in example: + labels = paddle.index_select(token_ids, paddle.to_tensor(example.pop("label_ids")), axis=0).squeeze(0) + example["labels"] = labels + return example + + +def load_fewclue_dataset(args, verbalizer, example_keys=None): + """ + Load fewclue datasets and convert them to the standard format of PET. + """ + split_id = args.split_id + splits = [f"train_{split_id}", f"dev_{split_id}", "test_public", "test"] + if args.task_name == "cluewsc": + train_ds, dev_ds, public_test_ds, test_ds = load_dataset("fewclue", name=args.task_name, splits=splits) + unlabeled_ds = None + else: + splits.append("unlabeled") + train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = load_dataset( + "fewclue", name=args.task_name, splits=splits + ) + data_ds = [train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds] + + # Preprocess data for mask prediction task. + if args.task_name == "chid": + for index, sub_data_ds in enumerate(data_ds): + data_ds[index] = convert_chid(sub_data_ds) + elif args.task_name == "cluewsc": + for index, sub_data_ds in enumerate(data_ds[:-1]): + data_ds[index] = convert_cluewsc(sub_data_ds) + elif args.task_name == "csl": + for index, sub_data_ds in enumerate(data_ds): + data_ds[index] = convert_csl(sub_data_ds) + orig_key = "label" + pop_keys = ["id"] + if args.task_name == "tnews": + orig_key = "label_desc" + pop_keys = ["keywords", "label", "id"] + elif args.task_name == "iflytek": + orig_key = "label_des" + pop_keys = ["id", "label"] + elif args.task_name == "ocnli": + pop_keys = ["level", "label0", "label1", "label2", "label3", "label4", "genre", "prem_id", "id"] + convert_label = partial( + convert_labels_to_ids, orig_key=orig_key, labels_to_ids=verbalizer.labels_to_ids, pop_keys=pop_keys + ) + for index, sub_data_ds in enumerate(data_ds): + if sub_data_ds is not None: + data_ds[index] = sub_data_ds.map(convert_label) + + # Extend train dataset with data augmentation and pseudo-label data. + data_ds[0] = extend_with_data_augment( + data_ds[0], args.augment_type, args.num_augment, args.word_augment_percent, args.augment_method, example_keys + ) + data_ds[0] = extend_with_pseudo_data(data_ds[0], args.pseudo_data_path, verbalizer.labels_to_ids) + + dev_labels = [x["label_ids"] for x in data_ds[1]] + test_labels = [x["label_ids"] for x in data_ds[2]] + + convert_fn = partial(convert_ids_to_words, token_ids=verbalizer.token_ids[:, 0, :]) + data_ds[:3] = [x.map(convert_fn) for x in data_ds[:3]] + + return data_ds, (dev_labels, test_labels) diff --git a/examples/few_shot/p-tuning/prompt/bustm.json b/examples/few_shot/p-tuning/prompt/bustm.json new file mode 100644 index 0000000000000000000000000000000000000000..345930ea51a90b229df961300cc8b90ac954b436 --- /dev/null +++ b/examples/few_shot/p-tuning/prompt/bustm.json @@ -0,0 +1,8 @@ +{ + "template": [ + {"text": "{'mask'}{'soft'}{'text': 'sentence1'}{'text': 'sentence2'}"} + ], + "verbalizer": [ + {"0": "不", "1": "很"} + ] +} diff --git a/examples/few_shot/p-tuning/prompt/chid.json b/examples/few_shot/p-tuning/prompt/chid.json new file mode 100644 index 0000000000000000000000000000000000000000..cc3b30195fa7c826921fec7bd2587176b2009f27 --- /dev/null +++ b/examples/few_shot/p-tuning/prompt/chid.json @@ -0,0 +1,8 @@ +{ + "template": [ + {"text": "{'mask'}{'soft'}{'text':'content_pre'}{'text': 'idiom'}{'text': 'content_post'}"} + ], + "verbalizer": [ + {"0": "否", "1": "是"} + ] +} diff --git a/examples/few_shot/p-tuning/prompt/cluewsc.json b/examples/few_shot/p-tuning/prompt/cluewsc.json new file mode 100644 index 0000000000000000000000000000000000000000..c0ef7573441bd2c7fe05597c3a1e4371bee64dee --- /dev/null +++ b/examples/few_shot/p-tuning/prompt/cluewsc.json @@ -0,0 +1,8 @@ +{ + "template": [ + {"text": "{'mask'}{'mask'}{'soft'}{'text': 'text'}{'text': 'pronoun'}指的是{'text': 'entity'}"} + ], + "verbalizer": [ + {"false": "错误", "true": "正确"} + ] +} diff --git a/examples/few_shot/p-tuning/prompt/csl.json b/examples/few_shot/p-tuning/prompt/csl.json new file mode 100644 index 0000000000000000000000000000000000000000..443ba172a2fea9e01c13089622b857c8105d8a0b --- /dev/null +++ b/examples/few_shot/p-tuning/prompt/csl.json @@ -0,0 +1,8 @@ +{ + "template": [ + {"text": "{'mask'}{'soft'}本文关键词有{'text': 'keyword'}{'text': 'abst'}"} + ], + "verbalizer": [ + {"0": "不", "1": "很"} + ] +} diff --git a/examples/few_shot/p-tuning/prompt/csldcp.json b/examples/few_shot/p-tuning/prompt/csldcp.json new file mode 100644 index 0000000000000000000000000000000000000000..5bb12c680f4e5b5d0c4da69bfbbda8312e6135e5 --- /dev/null +++ b/examples/few_shot/p-tuning/prompt/csldcp.json @@ -0,0 +1,76 @@ +{ + "template": [ + {"text": "{'mask'}{'mask'}{'soft'}{'text': 'content'}"} + ], + "verbalizer": [ + { + "材料科学与工程": "材料", + "作物学": "作物", + "口腔医学": "口腔", + "药学": "药学", + "教育学": "教育", + "水利工程": "水利", + "理论经济学": "理经", + "食品科学与工程": "食品", + "畜牧学/兽医学": "畜牧", + "体育学": "体育", + "核科学与技术": "核科", + "力学": "力学", + "园艺学": "园艺", + "水产": "水产", + "法学": "法学", + "地质学/地质资源与地质工程": "地质", + "石油与天然气工程": "石油", + "农林经济管理": "农林", + "信息与通信工程": "通信", + "图书馆、情报与档案管理": "图书", + "政治学": "政治", + "电气工程": "电气", + "海洋科学": "海洋", + "民族学": "民族", + "航空宇航科学与技术": "航空", + "化学/化学工程与技术": "化学", + "哲学": "哲学", + "公共卫生与预防医学": "卫生", + "艺术学": "艺术", + "农业工程": "农工", + "船舶与海洋工程": "船舶", + "计算机科学与技术": "计科", + "冶金工程": "冶金", + "交通运输工程": "交通", + "动力工程及工程热物理": "动力", + "纺织科学与工程": "纺织", + "建筑学": "建筑", + "环境科学与工程": "环境", + "公共管理": "公管", + "数学": "数学", + "物理学": "物理", + "林学/林业工程": "林学", + "心理学": "心理", + "历史学": "历史", + "工商管理": "工管", + "应用经济学": "应经", + "中医学/中药学": "中医", + "天文学": "天文", + "机械工程": "机械", + "土木工程": "土木", + "光学工程": "光学", + "地理学": "地理", + "农业资源利用": "农业", + "生物学/生物科学与工程": "生物", + "兵器科学与技术": "兵器", + "矿业工程": "矿业", + "大气科学": "大气", + "基础医学/临床医学": "基础", + "电子科学与技术": "电子", + "测绘科学与技术": "测绘", + "控制科学与工程": "控制", + "军事学": "军事", + "中国语言文学": "中文", + "新闻传播学": "新闻", + "社会学": "社会", + "地球物理学":"地球", + "植物保护":"植保" + } + ] +} diff --git a/examples/few_shot/p-tuning/prompt/eprstmt.json b/examples/few_shot/p-tuning/prompt/eprstmt.json new file mode 100644 index 0000000000000000000000000000000000000000..ea6941cdd963f096f4429c28368c738594df149a --- /dev/null +++ b/examples/few_shot/p-tuning/prompt/eprstmt.json @@ -0,0 +1,8 @@ +{ + "template": [ + {"text": "{'mask'}{'soft'}{'text':'sentence'}"} + ], + "verbalizer": [ + {"Negative": "不", "Positive": "很"} + ] +} diff --git a/examples/few_shot/p-tuning/prompt/iflytek.json b/examples/few_shot/p-tuning/prompt/iflytek.json new file mode 100644 index 0000000000000000000000000000000000000000..198ce19949738290b6494777b47ff359fdd00ab7 --- /dev/null +++ b/examples/few_shot/p-tuning/prompt/iflytek.json @@ -0,0 +1,129 @@ +{ + "template": [ + {"text": "{'mask': None, 'length': 4}{'soft'}{'text': 'sentence'}"} + ], + "verbalizer": [ + { + "银行": "银行办理", + "社区服务": "社区服务", + "电商": "电商网购", + "支付": "支付交易", + "经营养成": "经营养成", + "卡牌": "卡牌游戏", + "借贷": "借贷借款", + "驾校": "驾校学车", + "理财": "投资理财", + "职考": "职业考试", + "新闻": "新闻资讯", + "旅游资讯": "旅游资讯", + "公共交通": "公共交通", + "魔幻": "魔幻游戏", + "医疗服务": "医疗服务", + "影像剪辑": "影像剪辑", + "动作类": "动作游戏", + "工具": "使用工具", + "体育竞技": "体育竞技", + "小说": "小说阅读", + "运动健身": "运动健身", + "相机": "相机拍照", + "辅助工具": "辅助工具", + "快递物流": "快递物流", + "高等教育": "高等教育", + "股票": "股票炒股", + "菜谱": "做菜菜谱", + "行车辅助": "行车帮助", + "仙侠": "仙侠小说", + "亲子儿童": "亲子儿童", + "购物咨询": "购物资讯", + "射击游戏": "射击游戏", + "漫画": "动漫漫画", + "中小学": "中学小学", + "同城服务": "同城跑腿", + "成人教育": "成人教育", + "求职": "面试求职", + "电子产品": "电子产品", + "艺术": "艺术学习", + "薅羊毛": "比价省钱", + "约会社交": "约会社交", + "经营": "经营管理", + "兼职": "兼职赚钱", + "短视频": "拍短视频", + "音乐": "音乐乐库", + "英语": "英语学习", + "棋牌中心": "棋牌中心", + "摄影修图": "摄影修图", + "养生保健": "养生保健", + "办公": "办公工具", + "政务": "政务服务", + "视频": "视频拍摄", + "论坛圈子": "论坛圈子", + "彩票": "彩票乐透", + "直播": "直播娱乐", + "其他": "其他类别", + "休闲益智": "休闲益智", + "策略": "策略游戏", + "即时通讯": "即时通讯", + "汽车交易": "汽车交易", + "违章": "违章罚款", + "地图导航": "地图导航", + "民航": "民用航空", + "电台": "电台播报", + "语言(非英语)": "小语种类", + "搞笑": "搞笑娱乐", + "婚恋社交": "婚恋社交", + "社区超市": "社区超市", + "日常养车": "日常养车", + "杂志": "杂志期刊", + "视频教育": "线上教育", + "家政": "家政服务", + "影视娱乐": "影视娱乐", + "装修家居": "装修家居", + "体育咨讯": "体育资讯", + "社交工具": "社交工具", + "餐饮店": "餐饮美食", + "美颜": "美颜相机", + "问诊挂号": "问诊挂号", + "飞行空战": "飞行空战", + "综合预定": "综合预定", + "电影票务": "电影票务", + "笔记": "笔记记录", + "买房": "买房购房", + "外卖": "外卖配送", + "母婴": "母婴产品", + "打车": "打车出行", + "情侣社交": "情侣社交", + "日程管理": "日程管理", + "租车": "租车出行", + "微博博客": "微博博客", + "百科": "知识百科", + "绘画": "绘画学习", + "铁路": "铁路交通", + "生活社交": "生活社交", + "租房": "租房房源", + "酒店": "酒店住宿", + "保险": "保险理赔", + "问答交流": "问答交流", + "收款": "收款交易", + "MOBA": "多人竞技", + "K歌": "唱歌K歌", + "技术": "技术学习", + "减肥瘦身": "减肥瘦身", + "工作社交": "工作社交", + "团购": "团购拼单", + "记账": "记录记账", + "女性": "女性生活", + "公务员": "公务员类", + "二手": "二手交易", + "美妆美业": "美妆美业", + "汽车咨询": "汽车资讯", + "行程管理": "行程管理", + "免费WIFI": "WIFI", + "教辅": "教育辅助", + "成人": "成人两性", + "婚庆": "婚庆结婚", + "民宿短租": "民宿短租", + "出国": "出国相关" + } + ] +} + diff --git a/examples/few_shot/p-tuning/prompt/ocnli.json b/examples/few_shot/p-tuning/prompt/ocnli.json new file mode 100644 index 0000000000000000000000000000000000000000..796cb691f99d80806e8b01be2ef6240964c9e7a2 --- /dev/null +++ b/examples/few_shot/p-tuning/prompt/ocnli.json @@ -0,0 +1,8 @@ +{ + "template": [ + {"text": "{'mask'}{'mask'}{'soft'}{'text': 'sentence1'}{'text': 'sentence2'}"} + ], + "verbalizer": [ + {"contradiction": "不同", "entailment": "相似", "neutral": "无关"} + ] +} diff --git a/examples/few_shot/p-tuning/prompt/tnews.json b/examples/few_shot/p-tuning/prompt/tnews.json new file mode 100644 index 0000000000000000000000000000000000000000..822c30badd5237c2ca4313e94cb273f13e737eb2 --- /dev/null +++ b/examples/few_shot/p-tuning/prompt/tnews.json @@ -0,0 +1,24 @@ +{ + "template": [ + {"text": "{'mask'}{'mask'}{'soft'}{'text':'sentence'}"} + ], + "verbalizer": [ + { + "news_story": "八卦", + "news_entertainment": "明星", + "news_finance": "财经", + "news_sports": "体育", + "news_edu": "校园", + "news_game": "游戏", + "news_culture": "文化", + "news_tech": "科技", + "news_car": "汽车", + "news_travel": "旅行", + "news_world": "国际", + "news_agriculture": "农业", + "news_military": "军事", + "news_house": "房子", + "news_stock": "股票" + } + ] +} diff --git a/examples/few_shot/p-tuning/run_train.py b/examples/few_shot/p-tuning/run_train.py new file mode 100644 index 0000000000000000000000000000000000000000..abe66b7bd3fa883f246f05893ec432d57ace5f48 --- /dev/null +++ b/examples/few_shot/p-tuning/run_train.py @@ -0,0 +1,175 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time +from dataclasses import dataclass, field +from functools import partial + +import paddle +from data import load_fewclue_dataset +from paddle.metric import Accuracy +from paddle.static import InputSpec +from utils import load_prompt_arguments, save_fewclue_prediction, save_pseudo_data + +from paddlenlp.prompt import ( + MaskedLMVerbalizer, + PromptModelForSequenceClassification, + PromptTrainer, + PromptTuningArguments, + SoftTemplate, +) +from paddlenlp.trainer import PdArgumentParser +from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer +from paddlenlp.utils.log import logger + + +# yapf: disable +@dataclass +class DataArguments: + task_name: str = field(default="eprstmt", metadata={"help": "The task name in FewCLUE."}) + split_id: str = field(default="0", metadata={"help": "The split id of datasets, including 0, 1, 2, 3, 4, few_all."}) + prompt_path: str = field(default="prompt/eprstmt.json", metadata={"help": "Path to the defined prompts."}) + prompt_index: int = field(default=0, metadata={"help": "The index of defined prompt for training."}) + augment_type: str = field(default=None, metadata={"help": "The strategy used for data augmentation, including `swap`, `delete`, `insert`, `subsitute`."}) + num_augment: str = field(default=5, metadata={"help": "Number of augmented data per example, which works when `augment_type` is set."}) + word_augment_percent: str = field(default=0.1, metadata={"help": "Percentage of augmented words in sequences, used for `swap`, `delete`, `insert`, `subsitute`."}) + augment_method: str = field(default="mlm", metadata={"help": "Strategy used for `insert` and `subsitute`."}) + pseudo_data_path: str = field(default=None, metadata={"help": "Path to data with pseudo labels."}) + do_label: bool = field(default=False, metadata={"help": "Whether to label unsupervised data in unlabeled datasets"}) + do_test: bool = field(default=False, metadata={"help": "Whether to evaluate model on public test datasets."}) + + +@dataclass +class ModelArguments: + model_name_or_path: str = field(default="ernie-1.0-large-zh-cw", metadata={"help": "Build-in pretrained model name or the path to local model."}) + export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."}) + dropout: float = field(default=0.1, metadata={"help": "The dropout used for pretrained model."}) +# yapf: enable + + +def main(): + # Parse the arguments. + parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + data_args = load_prompt_arguments(data_args) + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + paddle.set_device(training_args.device) + + # Load the pretrained language model. + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + model = AutoModelForMaskedLM.from_pretrained( + model_args.model_name_or_path, + hidden_dropout_prob=model_args.dropout, + attention_probs_dropout_prob=model_args.dropout, + ) + + # Define template for preprocess and verbalizer for postprocess. + template = SoftTemplate(data_args.prompt, tokenizer, training_args.max_seq_length, model.get_input_embeddings()) + logger.info("Using template: {}".format(template.prompt)) + + verbalizer = MaskedLMVerbalizer(data_args.label_words, tokenizer) + labels_to_ids = verbalizer.labels_to_ids + ids_to_labels = {idx: label for label, idx in labels_to_ids.items()} + logger.info("Using verbalizer: {}".format(data_args.label_words)) + + # Load datasets. + data_ds, label_list = load_fewclue_dataset(data_args, verbalizer=verbalizer, example_keys=template.example_keys) + train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = data_ds + dev_labels, test_labels = label_list + + # Define the criterion. + criterion = paddle.nn.CrossEntropyLoss() + + # Initialize the prompt model with the above variables. + prompt_model = PromptModelForSequenceClassification( + model, template, verbalizer, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout + ) + + # Define the metric function. + def compute_metrics(eval_preds, labels, verbalizer): + metric = Accuracy() + predictions = paddle.to_tensor(eval_preds.predictions) + predictions = verbalizer.aggregate_multiple_mask(predictions) + correct = metric.compute(predictions, paddle.to_tensor(labels)) + metric.update(correct) + acc = metric.accumulate() + return {"accuracy": acc} + + # Initialize the trainer. + dev_compute_metrics = partial(compute_metrics, labels=dev_labels, verbalizer=verbalizer) + trainer = PromptTrainer( + model=prompt_model, + tokenizer=tokenizer, + args=training_args, + criterion=criterion, + train_dataset=train_ds, + eval_dataset=dev_ds, + callbacks=None, + compute_metrics=dev_compute_metrics, + ) + + # Traininig. + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + time_stamp = time.strftime("%m%d-%H-%M-%S", time.localtime()) + + # Test. + if data_args.do_test and public_test_ds is not None: + test_compute_metrics = partial(compute_metrics, labels=test_labels, verbalizer=verbalizer) + trainer.compute_metrics = test_compute_metrics + test_ret = trainer.predict(public_test_ds) + trainer.log_metrics("test", test_ret.metrics) + + # Predict. + if training_args.do_predict and test_ds is not None: + pred_ret = trainer.predict(test_ds) + logger.info("Prediction done.") + predict_path = os.path.join(training_args.output_dir, "fewclue_submit_examples_" + time_stamp) + save_fewclue_prediction(predict_path, data_args.task_name, pred_ret, verbalizer, ids_to_labels) + + # Label unsupervised data. + if data_args.do_label and unlabeled_ds is not None: + label_ret = trainer.predict(unlabeled_ds) + logger.info("Labeling done.") + pseudo_path = os.path.join(training_args.output_dir, "pseudo_data_" + time_stamp + ".txt") + save_pseudo_data(pseudo_path, data_args.task_name, label_ret, verbalizer, ids_to_labels) + + # Export static model. + if training_args.do_export: + template = prompt_model.template + template_keywords = template.extract_template_keywords(template.prompt) + input_spec = [ + InputSpec(shape=[None, None], dtype="int64"), # input_ids, + InputSpec(shape=[None, None], dtype="int64"), # token_type_ids + InputSpec(shape=[None, None], dtype="int64"), # position_ids + InputSpec(shape=[None, None, None, None], dtype="float32"), # attention_mask + InputSpec(shape=[None], dtype="int64"), # masked_positions + InputSpec(shape=[None, None], dtype="int64"), # soft_token_ids + ] + if "encoder" in template_keywords: + input_spec.append(InputSpec(shape=[None, None], dtype="int64")) # encoder_ids + export_path = os.path.join(training_args.output_dir, "export") + trainer.export_model(export_path, input_spec=input_spec, export_type=model_args.export_type) + + +if __name__ == "__main__": + main() diff --git a/examples/few_shot/p-tuning/utils.py b/examples/few_shot/p-tuning/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..989b4e6b81a8d156f93bf256e79ba5a6ed201197 --- /dev/null +++ b/examples/few_shot/p-tuning/utils.py @@ -0,0 +1,249 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +import pathlib + +import numpy as np +import paddle + +from paddlenlp.datasets import load_dataset + +LABEL_TO_STANDARD = { + "tnews": { + "news_story": "100", + "news_culture": "101", + "news_entertainment": "102", + "news_sports": "103", + "news_finance": "104", + "news_house": "106", + "news_car": "107", + "news_edu": "108", + "news_tech": "109", + "news_military": "110", + "news_travel": "112", + "news_world": "113", + "news_stock": "114", + "news_agriculture": "115", + "news_game": "116", + }, + "iflytek": { + "打车": 0, + "美颜": 100, + "影像剪辑": 101, + "摄影修图": 102, + "相机": 103, + "绘画": 104, + "二手": 105, + "电商": 106, + "团购": 107, + "外卖": 108, + "电影票务": 109, + "社区服务": 10, + "社区超市": 110, + "购物咨询": 111, + "笔记": 112, + "办公": 113, + "日程管理": 114, + "女性": 115, + "经营": 116, + "收款": 117, + "其他": 118, + "薅羊毛": 11, + "魔幻": 12, + "仙侠": 13, + "卡牌": 14, + "飞行空战": 15, + "射击游戏": 16, + "休闲益智": 17, + "动作类": 18, + "体育竞技": 19, + "地图导航": 1, + "棋牌中心": 20, + "经营养成": 21, + "策略": 22, + "MOBA": 23, + "辅助工具": 24, + "约会社交": 25, + "即时通讯": 26, + "工作社交": 27, + "论坛圈子": 28, + "婚恋社交": 29, + "免费WIFI": 2, + "情侣社交": 30, + "社交工具": 31, + "生活社交": 32, + "微博博客": 33, + "新闻": 34, + "漫画": 35, + "小说": 36, + "技术": 37, + "教辅": 38, + "问答交流": 39, + "租车": 3, + "搞笑": 40, + "杂志": 41, + "百科": 42, + "影视娱乐": 43, + "求职": 44, + "兼职": 45, + "视频": 46, + "短视频": 47, + "音乐": 48, + "直播": 49, + "同城服务": 4, + "电台": 50, + "K歌": 51, + "成人": 52, + "中小学": 53, + "职考": 54, + "公务员": 55, + "英语": 56, + "视频教育": 57, + "高等教育": 58, + "成人教育": 59, + "快递物流": 5, + "艺术": 60, + "语言(非英语)": 61, + "旅游资讯": 62, + "综合预定": 63, + "民航": 64, + "铁路": 65, + "酒店": 66, + "行程管理": 67, + "民宿短租": 68, + "出国": 69, + "婚庆": 6, + "工具": 70, + "亲子儿童": 71, + "母婴": 72, + "驾校": 73, + "违章": 74, + "汽车咨询": 75, + "汽车交易": 76, + "日常养车": 77, + "行车辅助": 78, + "租房": 79, + "家政": 7, + "买房": 80, + "装修家居": 81, + "电子产品": 82, + "问诊挂号": 83, + "养生保健": 84, + "医疗服务": 85, + "减肥瘦身": 86, + "美妆美业": 87, + "菜谱": 88, + "餐饮店": 89, + "公共交通": 8, + "体育咨讯": 90, + "运动健身": 91, + "支付": 92, + "保险": 93, + "股票": 94, + "借贷": 95, + "理财": 96, + "彩票": 97, + "记账": 98, + "银行": 99, + "政务": 9, + }, +} + + +def load_prompt_arguments(args): + """ + Load prompt and label words according to prompt index. + """ + with open(args.prompt_path, "r", encoding="utf-8") as fp: + configs = json.load(fp) + assert len(configs["verbalizer"]) == len(configs["template"]) + assert configs["verbalizer"][0] is not None + verbalizer = [configs["verbalizer"][0]] + last_verb_index = 0 + for index, verb in enumerate(configs["verbalizer"][1:]): + if verb is None or len(verb) == 0: + verbalizer.append(configs["verbalizer"][last_verb_index]) + else: + verbalizer.append(verb) + last_verb_index = index + 1 + configs["verbalizer"] = verbalizer + args.prompt = configs["template"][args.prompt_index]["text"] + label_words = configs["verbalizer"][args.prompt_index] + if isinstance(label_words, list): + label_words = {k: k for k in label_words} + args.label_words = label_words + return args + + +def save_pseudo_data(save_path, task_name, label_preds, verbalizer, labels): + """ + Combine unsupervised data and corresponding predicted labels and + save one example per line. + """ + if task_name == "cluewsc": + return None + + data_ds = load_dataset("fewclue", name=task_name, splits="unlabeled") + preds = paddle.to_tensor(label_preds.predictions) + preds = verbalizer.aggregate_multiple_mask(preds) + preds = paddle.nn.functional.softmax(preds, axis=1).numpy() + label_preds = np.argmax(preds, axis=1) + label_probs = np.max(preds, axis=1) + pseudo_data = [] + for index, example in enumerate(data_ds): + example["labels"] = labels[label_preds[index]] + example["prob"] = str(label_probs[index]) + pseudo_data.append(example) + save_data(pseudo_data, save_path) + + +def save_fewclue_prediction(save_path, task_name, label_preds, verbalizer, labels): + """ + Extract predicted labels and save as the format required by FewCLUE. + """ + preds = paddle.to_tensor(label_preds.predictions) + preds = verbalizer.aggregate_multiple_mask(preds) + if task_name == "chid": + batch_size = preds.shape[0] + preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1] + preds = preds.reshape([batch_size // 7, 7]) + preds = paddle.nn.functional.softmax(preds, axis=1).numpy() + preds = np.argmax(preds, axis=1) + test_ds = load_dataset("fewclue", name=task_name, splits="test") + + ret_list = [] + maps = LABEL_TO_STANDARD.get(task_name, None) + for idx, example in enumerate(test_ds): + uid = example.get("id", idx) + if task_name in ["bustm", "csl"]: + ret_list.append({"id": uid, "label": str(preds[idx])}) + elif task_name == "chid": + ret_list.append({"id": uid, "answer": preds[idx]}) + elif task_name in ["cluewsc", "eprstmt", "ocnli", "csldcp"]: + ret_list.append({"id": uid, "label": labels[preds[idx]]}) + elif task_name in ["iflytek", "tnews"]: + ret_list.append({"id": uid, "label": str(maps[labels[preds[idx]]])}) + save_file = task_name if task_name in ["bustm", "csldcp", "eprstmt"] else task_name + "f" + save_data(ret_list, save_path, save_file + "_predict.json") + + +def save_data(data, save_path, save_file=None): + if save_file is not None: + pathlib.Path(save_path).mkdir(parents=True, exist_ok=True) + save_path = os.path.join(save_path, save_file) + with open(save_path, "w") as fp: + for example in data: + fp.write(json.dumps(example, ensure_ascii=False) + "\n") diff --git a/examples/few_shot/pet/README.md b/examples/few_shot/pet/README.md new file mode 100644 index 0000000000000000000000000000000000000000..0499883d4707227692f38d3acffee4c82176b9cc --- /dev/null +++ b/examples/few_shot/pet/README.md @@ -0,0 +1,84 @@ +# PET + +[Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference](https://arxiv.org/abs/2001.07676) + +## 算法简介 + +自然语言处理任务可以通过给预训练模型提供“任务描述”等方式来进行无监督学习,但效果一般低于有监督训练。而 Pattern-Exploiting Training (PET) 是一种半监督方法,通过将输入转换为完形填空形式的短语来帮助语言模型理解任务。然后用这些短语来给无标注数据打软标签。最后在得到的标注数据集上用有监督方法进行训练。在小样本设置下,PET 在部分任务上远超有监督学习和强半监督学习方法。以 PET 为代表的提示学习与微调学习的区别如下图所示,包括数据预处理模块 `Template` 和标签词映射模块 `Verbalizer`。详细介绍及定义方法参见 [Prompt API 文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md)。 + +![PET_and_FT](https://user-images.githubusercontent.com/25607475/192727706-0a17b5ef-db6b-46be-894d-0ee315306776.png) + + +## 快速开始 + +CLUE(Chinese Language Understanding Evaluation)作为中文语言理解权威测评榜单,在学术界和工业界都有着广泛影响。FewCLUE 是其设立的中文小样本学习测评子榜,旨在探索小样本学习最佳模型和中文实践。PaddleNLP 内置了 FewCLUE 数据集,可以直接用来进行 PET 算法训练、评估、预测,并生成 FewCLUE 榜单的提交结果,参与 FewCLUE 竞赛。 + +### 代码结构说明 +``` +├── run_train.py # PET 算法提示学习脚本 +├── data.py # 数据集构造、数据增强 +├── utils.py # FewCLUE 提交结果保存等工具函数 +└── prompt/ # FewCLUE 各数据集的 prompt 定义文件 +``` + +### 数据准备 + +读取 FewCLUE 数据集只需要 1 行代码,这部分代码在 `data.py` 脚本中。以情感分类数据集 `eprstmt` 为例: + +``` +from paddlenlp.datasets import load_dataset + +# 通过指定 "fewclue" 和数据集名字 name="eprstmt" 即可一键加载 FewCLUE 中的eprstmt 数据集 +train_ds, dev_ds, public_test_ds = load_dataset("fewclue", name="eprstmt", splits=("train_0", "dev_0", "test_public")) +``` + +### 模型训练、评估、预测 + +通过如下命令,指定 GPU 0 卡, 在 FewCLUE 的 `eprstmt` 数据集上进行训练&评估 +``` +python -u -m paddle.distributed.launch --gpus "0" run_train.py \ + --output_dir checkpoint_eprstmt \ + --task_name eprstmt \ + --split_id few_all \ + --prompt_path prompt/eprstmt.json \ + --prompt_index 0 \ + --do_train \ + --do_eval \ + --do_test \ + --do_predict \ + --do_label \ + --max_steps 1000 \ + --learning_rate 3e-5 \ + --eval_steps 100 \ + --save_steps 100 \ + --logging_steps 5 \ + --per_device_train_batch_size 16 \ + --max_seq_length 128 \ + --load_best_model_at_end \ + --metric_for_best_model accuracy \ + --save_total_limit 1 +``` +参数含义说明 +- `task_name`: FewCLUE 中的数据集名字 +- `split_id`: 数据集编号,包括0, 1, 2, 3, 4 和 few_all +- `prompt_path`: prompt 定义文件名 +- `prompt_index`: 使用定义文件中第 `prompt_index` 个 prompt +- `augment_type`: 数据增强策略,可选 swap, delete, insert, substitute +- `num_augment`: 数据增强策略为每个样本生成的样本数量 +- `word_augment_percent`: 每个序列中数据增强词所占的比例 +- `pseudo_data_path`: 使用模型标注的伪标签数据文件路径 +- `do_label`: 是否使用训练后的模型给无标签数据标注伪标签 +- `do_test`: 是否在公开测试集上评估模型效果 +- `model_name_or_path`: 预训练模型名,默认为 `ernie-1.0-large-zh-cw` +- `use_rdrop`: 是否使用对比学习策略 R-Drop +- `alpha_rdrop`: R-Drop 损失值权重,默认为 0.5 +- `dropout`: 预训练模型的 dropout 参数值,用于 R-Drop 策略中参数配置 +- `export_type`: 模型导出格式,默认为 `paddle`,动态图转静态图 +- 更多配置参考 [Trainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md#trainingarguments-%E5%8F%82%E6%95%B0%E4%BB%8B%E7%BB%8D) 和 [PromptTrainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md#prompttrainer%E5%8F%82%E6%95%B0%E5%88%97%E8%A1%A8) + +### 模型部署 + +Coming soon... + +## References +[1] Schick, Timo, and Hinrich Schütze. “Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference.” ArXiv:2001.07676 [Cs], January 25, 2021. http://arxiv.org/abs/2001.07676. diff --git a/examples/few_shot/pet/data.py b/examples/few_shot/pet/data.py new file mode 100644 index 0000000000000000000000000000000000000000..ba2cca68383011e989c4b0197500dd6757fe3708 --- /dev/null +++ b/examples/few_shot/pet/data.py @@ -0,0 +1,191 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +from functools import partial + +import paddle + +from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap +from paddlenlp.datasets import MapDataset, load_dataset + + +def extend_with_pseudo_data(data_ds, pseudo_path, labels_to_ids): + """ + Extend train dataset with pseudo labeled examples if exists. + """ + if pseudo_path is None: + return data_ds + with open(pseudo_path, "r", encoding="utf-8") as fp: + pseudo_data = [json.loads(x.strip()) for x in fp] + data_ds = MapDataset([x for x in data_ds] + pseudo_data) + return data_ds + + +def extend_with_data_augment(data_ds, aug_type, num_aug=10, percent=0.1, aug_base="mlm", example_keys=None): + """ + Extend train dataset with augmentation. + """ + if example_keys is None: + return data_ds + if aug_type is None or aug_type == "None": + return data_ds + if aug_type == "delete": + aug = WordDelete(create_n=num_aug, aug_percent=percent) + elif aug_type == "substitute": + aug = WordSubstitute(aug_base, create_n=num_aug, aug_percent=percent) + elif aug_type == "insert": + aug = WordInsert(aug_base, create_n=num_aug, aug_percent=percent) + elif aug_type == "swap": + aug = WordSwap(create_n=num_aug, aug_percent=percent) + else: + raise ValueError("Unsupported data augment strategy `{}`".format(aug_type)) + + aug_data = [] + for example in data_ds: + for key in example_keys: + text_aug = aug.augment(example[key]) + for text in text_aug: + new_example = example.copy() + example[key] = text + aug_data.append(new_example) + + data_ds = MapDataset([x for x in data_ds] + aug_data) + return data_ds + + +def convert_chid(data_ds): + """ + Insert idioms into positions of `#idiom#` so that the task is converted + to binary classification. + """ + split_data_ds = [] + for example in data_ds: + fragments = example["content"].split("#idiom#") + label = example.get("answer", None) + for index, cand in enumerate(example["candidates"]): + new_example = {"content_pre": fragments[0], "content_post": fragments[1], "idiom": cand} + if label is not None: + new_example["label"] = str(int(index == label)) + split_data_ds.append(new_example) + return MapDataset(split_data_ds) + + +def convert_csl(data_ds): + """ + Concatanate keywords and it can be replaced by keyword `options` in develop versioin. + """ + concat_data_ds = [] + for example in data_ds: + example["keyword"] = ",".join(example["keyword"]) + concat_data_ds.append(example) + return MapDataset(concat_data_ds) + + +def convert_cluewsc(data_ds): + """ + Mark the pronoun and entity with special tokens. + """ + marked_data_ds = [] + for example in data_ds: + target, text = example["target"], list(example["text"]) + pronoun, p_index = target["span2_text"], target["span2_index"] + entity, e_index = target["span1_text"], target["span1_index"] + label = example.get("label", None) + if p_index > e_index: + text.insert(p_index, "_") + text.insert(p_index + len(pronoun) + 1, "_") + text.insert(e_index, "[") + text.insert(e_index + len(entity) + 1, "]") + else: + text.insert(e_index, "[") + text.insert(e_index + len(entity) + 1, "]") + text.insert(p_index, "_") + text.insert(p_index + len(pronoun) + 1, "_") + new_example = {"text": "".join(text), "pronoun": pronoun, "entity": entity} + if label is not None: + new_example["label"] = label + marked_data_ds.append(new_example) + return MapDataset(marked_data_ds) + + +def convert_labels_to_ids(example, orig_key, labels_to_ids): + """ + Convert the keyword in datasets to `labels`. + """ + if orig_key in example: + example["label_ids"] = labels_to_ids[example.pop(orig_key)] + return example + + +def convert_ids_to_words(example, token_ids): + """ + Convert label id to the first word in mapping from labels to words, + the length of which should coincide with that of `mask` in prompt. + """ + if "label_ids" in example: + labels = paddle.index_select(token_ids, paddle.to_tensor(example.pop("label_ids")), axis=0).squeeze(0) + example["labels"] = labels + return example + + +def load_fewclue_dataset(args, verbalizer, example_keys=None): + """ + Load fewclue datasets and convert them to the standard format of PET. + """ + split_id = args.split_id + splits = [f"train_{split_id}", f"dev_{split_id}", "test_public", "test"] + if args.task_name == "cluewsc": + train_ds, dev_ds, public_test_ds, test_ds = load_dataset("fewclue", name=args.task_name, splits=splits) + unlabeled_ds = None + else: + splits.append("unlabeled") + train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = load_dataset( + "fewclue", name=args.task_name, splits=splits + ) + data_ds = [train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds] + + # Preprocess data for mask prediction task. + if args.task_name == "chid": + for index, sub_data_ds in enumerate(data_ds): + data_ds[index] = convert_chid(sub_data_ds) + elif args.task_name == "cluewsc": + for index, sub_data_ds in enumerate(data_ds[:-1]): + data_ds[index] = convert_cluewsc(sub_data_ds) + elif args.task_name == "csl": + for index, sub_data_ds in enumerate(data_ds): + data_ds[index] = convert_csl(sub_data_ds) + orig_key = "label" + if args.task_name == "tnews": + orig_key = "label_desc" + elif args.task_name == "iflytek": + orig_key = "label_des" + convert_label = partial(convert_labels_to_ids, orig_key=orig_key, labels_to_ids=verbalizer.labels_to_ids) + for index, sub_data_ds in enumerate(data_ds): + if sub_data_ds is not None: + data_ds[index] = sub_data_ds.map(convert_label) + + # Extend train dataset with data augmentation and pseudo-label data. + data_ds[0] = extend_with_data_augment( + data_ds[0], args.augment_type, args.num_augment, args.word_augment_percent, args.augment_method, example_keys + ) + data_ds[0] = extend_with_pseudo_data(data_ds[0], args.pseudo_data_path, verbalizer.labels_to_ids) + + dev_labels = [x["label_ids"] for x in data_ds[1]] + test_labels = [x["label_ids"] for x in data_ds[2]] + + convert_fn = partial(convert_ids_to_words, token_ids=verbalizer.token_ids[:, 0, :]) + data_ds[:3] = [x.map(convert_fn) for x in data_ds[:3]] + + return data_ds, (dev_labels, test_labels) diff --git a/examples/few_shot/pet/prompt/bustm.json b/examples/few_shot/pet/prompt/bustm.json new file mode 100644 index 0000000000000000000000000000000000000000..ab377ea85708450af14a138fd48049a6b9df447d --- /dev/null +++ b/examples/few_shot/pet/prompt/bustm.json @@ -0,0 +1,14 @@ +{ + "template": [ + {"text": "下边两句话说的是一个事情吗?{'mask'}“{'text': 'sentence1'}”和“{'text': 'sentence2'}”"}, + {"text": "下边两个句子说的是{'mask'}{'mask'}的事情。“{'text': 'sentence1'}”和“{'text': 'sentence2'}”"}, + {"text": "“{'text': 'sentence1'}”和“{'text': 'sentence2'}”意思{'mask'}{'mask'}。"}, + {"text": "“{'text':'sentence1'}”和“{'text':'sentence2'}”描述的是{'mask'}{'mask'}的事情。"} + ], + "verbalizer": [ + {"0": "不", "1": "是"}, + {"0": "不同", "1": "相同"}, + {"0": "不同", "1": "一样"}, + {"0": "不同", "1": "相同"} + ] +} diff --git a/examples/few_shot/pet/prompt/chid.json b/examples/few_shot/pet/prompt/chid.json new file mode 100644 index 0000000000000000000000000000000000000000..24dac2d41100e9640af01269479fb6de12b0ca97 --- /dev/null +++ b/examples/few_shot/pet/prompt/chid.json @@ -0,0 +1,14 @@ +{ + "template": [ + {"text": "{'text':'content_pre'}({'text': 'idiom'}){'text': 'content_post'}{'mask'}"}, + {"text": "{'text':'content_pre'}({'text': 'idiom'}){'text': 'content_post'}成语{'text':'idiom'}用在这个句子中{'mask'}合适。"}, + {"text": "选一个合适的词语填在括号里,你会选“{'text': 'idiom'}”吗?{'mask'}。“{'text':'content_pre'}(){'text': 'content_post'}”"}, + {"text": "下边句中成语[{'text':'idiom'}]的理解正确吗?{'mask'}{'mask'}。“{'text':'content_pre'}({'text': 'idiom'}){'text': 'content_post'}”"} + ], + "verbalizer": [ + {"0": "否", "1": "是"}, + {"0": "不", "1": "很"}, + {"0": "不", "1": "会"}, + {"0": "错误", "1": "正确"} + ] +} \ No newline at end of file diff --git a/examples/few_shot/pet/prompt/cluewsc.json b/examples/few_shot/pet/prompt/cluewsc.json new file mode 100644 index 0000000000000000000000000000000000000000..76badab27eb87ccd8054866e173afb76731fc56e --- /dev/null +++ b/examples/few_shot/pet/prompt/cluewsc.json @@ -0,0 +1,12 @@ +{ + "template": [ + {"text": "{'text': 'text'}{'text': 'pronoun'}指的{'mask'}是{'text': 'entity'}"}, + {"text": "{'text': 'text'}{'text': 'pronoun'}指的是{'text': 'entity'}。这里{'text': 'pronoun'}理解得对吗?{'mask'}"}, + {"text": "{'text': 'text'}{'text': 'pronoun'}{'mask'}{'mask'}地代表了{'text': 'entity'}"} + ], + "verbalizer": [ + {"false": "不", "true": "就"}, + {"false": "错", "true": "对"}, + {"false": "错误", "true": "正确"} + ] +} diff --git a/examples/few_shot/pet/prompt/csl.json b/examples/few_shot/pet/prompt/csl.json new file mode 100644 index 0000000000000000000000000000000000000000..c604c90d0ce506089aea570966f6b3fb995345ef --- /dev/null +++ b/examples/few_shot/pet/prompt/csl.json @@ -0,0 +1,14 @@ +{ + "template": [ + {"text": "给定以下几个词语:{'text': 'keyword'}{'mask'}{'mask'}扩写成“{'text': 'abst'}”"}, + {"text": "{'text':'abst'}这段话中关键词包括{'text':'keyword', 'truncate': False}对吗?{'mask'}。"}, + {"text": "{'text':'keyword'}这几个词和下边这段话内容{'mask'}关。“{'text':'abst'}”"}, + {"text": "“{'text':'abst'}”本文的内容{'mask'}{'mask'}“{'text':'keyword'}”"} + ], + "verbalizer": [ + {"0": "不能", "1": "可以"}, + {"0": "错", "1": "对"}, + {"0": "无", "1": "有"}, + {"0": "不含", "1": "包括"} + ] +} diff --git a/examples/few_shot/pet/prompt/csldcp.json b/examples/few_shot/pet/prompt/csldcp.json new file mode 100644 index 0000000000000000000000000000000000000000..e0fcf846b7ed3ff0730f24d4319e288b11760e3d --- /dev/null +++ b/examples/few_shot/pet/prompt/csldcp.json @@ -0,0 +1,82 @@ +{ + "template": [ + {"text": "阅读下边一段{'mask'}{'mask'}学的资料:“{'text': 'content'}”"}, + {"text": "阅读下边这段{'mask'}{'mask'}方面的材料:“{'text': 'content'}”"}, + {"text": "阅读这段{'mask'}{'mask'}学的文献:“{'text': 'content'}”"}, + {"text": "阅读这段{'mask'}{'mask'}学的材料:“{'text': 'content'}”"} + ], + "verbalizer": [ + { + "材料科学与工程": "材料", + "作物学": "作物", + "口腔医学": "口腔", + "药学": "药学", + "教育学": "教育", + "水利工程": "水利", + "理论经济学": "理经", + "食品科学与工程": "食品", + "畜牧学/兽医学": "畜牧", + "体育学": "体育", + "核科学与技术": "核科", + "力学": "力学", + "园艺学": "园艺", + "水产": "水产", + "法学": "法学", + "地质学/地质资源与地质工程": "地质", + "石油与天然气工程": "石油", + "农林经济管理": "农林", + "信息与通信工程": "通信", + "图书馆、情报与档案管理": "图书", + "政治学": "政治", + "电气工程": "电气", + "海洋科学": "海洋", + "民族学": "民族", + "航空宇航科学与技术": "航空", + "化学/化学工程与技术": "化学", + "哲学": "哲学", + "公共卫生与预防医学": "卫生", + "艺术学": "艺术", + "农业工程": "农工", + "船舶与海洋工程": "船舶", + "计算机科学与技术": "计科", + "冶金工程": "冶金", + "交通运输工程": "交通", + "动力工程及工程热物理": "动力", + "纺织科学与工程": "纺织", + "建筑学": "建筑", + "环境科学与工程": "环境", + "公共管理": "公管", + "数学": "数学", + "物理学": "物理", + "林学/林业工程": "林学", + "心理学": "心理", + "历史学": "历史", + "工商管理": "工管", + "应用经济学": "应经", + "中医学/中药学": "中医", + "天文学": "天文", + "机械工程": "机械", + "土木工程": "土木", + "光学工程": "光学", + "地理学": "地理", + "农业资源利用": "农业", + "生物学/生物科学与工程": "生物", + "兵器科学与技术": "兵器", + "矿业工程": "矿业", + "大气科学": "大气", + "基础医学/临床医学": "基础", + "电子科学与技术": "电子", + "测绘科学与技术": "测绘", + "控制科学与工程": "控制", + "军事学": "军事", + "中国语言文学": "中文", + "新闻传播学": "新闻", + "社会学": "社会", + "地球物理学":"地球", + "植物保护":"植保" + }, + {}, + {}, + {} + ] +} diff --git a/examples/few_shot/pet/prompt/eprstmt.json b/examples/few_shot/pet/prompt/eprstmt.json new file mode 100644 index 0000000000000000000000000000000000000000..84e408def08760013c8f483582a6761ac72dd612 --- /dev/null +++ b/examples/few_shot/pet/prompt/eprstmt.json @@ -0,0 +1,18 @@ +{ + "template": [ + {"text": "{'text':'sentence'}我{'mask'}喜欢。"}, + {"text": "我{'mask'}喜欢。{'text':'sentence'}"}, + {"text": "{'mask'}{'mask'}推荐这件商品!{'text':'sentence'}"}, + {"text": "我对这个东西{'mask'}满意。{'text':'sentence'}"}, + {"text": "{'mask'}理想。{'text':'sentence'}"}, + {"text": "{'text':'sentence'}这句话表示我{'mask'}满意。"} + ], + "verbalizer": [ + {"Negative": "不", "Positive": "很"}, + {"Negative": "不", "Positive": "很"}, + {"Negative": "很不", "Positive": "非常"}, + {"Negative": "不", "Positive": "很"}, + {"Negative": "不", "Positive": "很"}, + {"Negative": "不", "Positive": "很"} + ] +} diff --git a/examples/few_shot/pet/prompt/iflytek.json b/examples/few_shot/pet/prompt/iflytek.json new file mode 100644 index 0000000000000000000000000000000000000000..9bce98d3f57ab67ce8284ba8a280641c30d37987 --- /dev/null +++ b/examples/few_shot/pet/prompt/iflytek.json @@ -0,0 +1,253 @@ +{ + "template": [ + {"text": "下边介绍的是和{'mask': None, 'length': 4}相关的产品:{'text': 'sentence'}"}, + {"text": "搜索更多{'mask'}{'mask'}相关的应用程序。{'text': 'sentence'}"}, + {"text": "这段话跟什么有关?{'mask'}{'mask'}“{'text': 'sentence'}”"} + ], + "verbalizer": [ + { + "银行": "银行办理", + "社区服务": "社区服务", + "电商": "电商网购", + "支付": "支付交易", + "经营养成": "经营养成", + "卡牌": "卡牌游戏", + "借贷": "借贷借款", + "驾校": "驾校学车", + "理财": "投资理财", + "职考": "职业考试", + "新闻": "新闻资讯", + "旅游资讯": "旅游资讯", + "公共交通": "公共交通", + "魔幻": "魔幻游戏", + "医疗服务": "医疗服务", + "影像剪辑": "影像剪辑", + "动作类": "动作游戏", + "工具": "使用工具", + "体育竞技": "体育竞技", + "小说": "小说阅读", + "运动健身": "运动健身", + "相机": "相机拍照", + "辅助工具": "辅助工具", + "快递物流": "快递物流", + "高等教育": "高等教育", + "股票": "股票炒股", + "菜谱": "做菜菜谱", + "行车辅助": "行车帮助", + "仙侠": "仙侠小说", + "亲子儿童": "亲子儿童", + "购物咨询": "购物资讯", + "射击游戏": "射击游戏", + "漫画": "动漫漫画", + "中小学": "中学小学", + "同城服务": "同城跑腿", + "成人教育": "成人教育", + "求职": "面试求职", + "电子产品": "电子产品", + "艺术": "艺术学习", + "薅羊毛": "比价省钱", + "约会社交": "约会社交", + "经营": "经营管理", + "兼职": "兼职赚钱", + "短视频": "拍短视频", + "音乐": "音乐乐库", + "英语": "英语学习", + "棋牌中心": "棋牌中心", + "摄影修图": "摄影修图", + "养生保健": "养生保健", + "办公": "办公工具", + "政务": "政务服务", + "视频": "视频拍摄", + "论坛圈子": "论坛圈子", + "彩票": "彩票乐透", + "直播": "直播娱乐", + "其他": "其他类别", + "休闲益智": "休闲益智", + "策略": "策略游戏", + "即时通讯": "即时通讯", + "汽车交易": "汽车交易", + "违章": "违章罚款", + "地图导航": "地图导航", + "民航": "民用航空", + "电台": "电台播报", + "语言(非英语)": "小语种类", + "搞笑": "搞笑娱乐", + "婚恋社交": "婚恋社交", + "社区超市": "社区超市", + "日常养车": "日常养车", + "杂志": "杂志期刊", + "视频教育": "线上教育", + "家政": "家政服务", + "影视娱乐": "影视娱乐", + "装修家居": "装修家居", + "体育咨讯": "体育资讯", + "社交工具": "社交工具", + "餐饮店": "餐饮美食", + "美颜": "美颜相机", + "问诊挂号": "问诊挂号", + "飞行空战": "飞行空战", + "综合预定": "综合预定", + "电影票务": "电影票务", + "笔记": "笔记记录", + "买房": "买房购房", + "外卖": "外卖配送", + "母婴": "母婴产品", + "打车": "打车出行", + "情侣社交": "情侣社交", + "日程管理": "日程管理", + "租车": "租车出行", + "微博博客": "微博博客", + "百科": "知识百科", + "绘画": "绘画学习", + "铁路": "铁路交通", + "生活社交": "生活社交", + "租房": "租房房源", + "酒店": "酒店住宿", + "保险": "保险理赔", + "问答交流": "问答交流", + "收款": "收款交易", + "MOBA": "多人竞技", + "K歌": "唱歌K歌", + "技术": "技术学习", + "减肥瘦身": "减肥瘦身", + "工作社交": "工作社交", + "团购": "团购拼单", + "记账": "记录记账", + "女性": "女性生活", + "公务员": "公务员类", + "二手": "二手交易", + "美妆美业": "美妆美业", + "汽车咨询": "汽车资讯", + "行程管理": "行程管理", + "免费WIFI": "WIFI", + "教辅": "教育辅助", + "成人": "成人两性", + "婚庆": "婚庆结婚", + "民宿短租": "民宿短租", + "出国": "出国相关" + }, + { + "银行": "银行", + "社区服务": "社区", + "电商": "网购", + "支付": "付钱", + "经营养成": "养成", + "卡牌": "纸牌", + "借贷": "借钱", + "驾校": "学车", + "理财": "投资", + "职考": "考试", + "新闻": "新闻", + "旅游资讯": "旅游", + "公共交通": "交通", + "魔幻": "魔幻", + "医疗服务": "医疗", + "影像剪辑": "剪辑", + "动作类": "动作", + "工具": "工具", + "体育竞技": "体育", + "小说": "小说", + "运动健身": "运动", + "相机": "相机", + "辅助工具": "辅助", + "快递物流": "快递", + "高等教育": "教育", + "股票": "股票", + "菜谱": "菜谱", + "行车辅助": "帮助", + "仙侠": "仙侠", + "亲子儿童": "小孩", + "购物咨询": "购物", + "射击游戏": "射击", + "漫画": "漫画", + "中小学": "小学", + "同城服务": "跑腿", + "成人教育": "成人", + "求职": "面试", + "电子产品": "电子", + "艺术": "艺术", + "薅羊毛": "赚钱", + "约会社交": "约会", + "经营": "经营", + "兼职": "兼职", + "短视频": "短片", + "音乐": "音乐", + "英语": "英语", + "棋牌中心": "棋牌", + "摄影修图": "拍照", + "养生保健": "养生", + "办公": "办公", + "政务": "政务", + "视频": "视频", + "论坛圈子": "论坛", + "彩票": "彩票", + "直播": "直播", + "其他": "其他", + "休闲益智": "休闲", + "策略": "策略", + "即时通讯": "通讯", + "汽车交易": "买车", + "违章": "违章", + "地图导航": "地图", + "民航": "航空", + "电台": "电台", + "语言(非英语)": "语言", + "搞笑": "搞笑", + "婚恋社交": "婚恋", + "社区超市": "超市", + "日常养车": "养车", + "杂志": "杂志", + "视频教育": "线上", + "家政": "家政", + "影视娱乐": "影视", + "装修家居": "装修", + "体育咨讯": "资讯", + "社交工具": "交流", + "餐饮店": "美食", + "美颜": "美颜", + "问诊挂号": "挂号", + "飞行空战": "飞行", + "综合预定": "预定", + "电影票务": "票务", + "笔记": "笔记", + "买房": "买房", + "外卖": "外卖", + "母婴": "母婴", + "打车": "打车", + "情侣社交": "情侣", + "日程管理": "日程", + "租车": "租车", + "微博博客": "博客", + "百科": "百科", + "绘画": "绘画", + "铁路": "铁路", + "生活社交": "生活", + "租房": "租房", + "酒店": "酒店", + "保险": "保险", + "问答交流": "问答", + "收款": "收款", + "MOBA": "多人", + "K歌": "唱歌", + "技术": "技术", + "减肥瘦身": "减肥", + "工作社交": "工作", + "团购": "团购", + "记账": "记录", + "女性": "女性", + "公务员": "公务", + "二手": "二手", + "美妆美业": "美妆", + "汽车咨询": "汽车", + "行程管理": "行程", + "免费WIFI": "上网", + "教辅": "教辅", + "成人": "两性", + "婚庆": "婚庆", + "民宿短租": "民宿", + "出国": "出国" + }, + {} + ] +} + diff --git a/examples/few_shot/pet/prompt/ocnli.json b/examples/few_shot/pet/prompt/ocnli.json new file mode 100644 index 0000000000000000000000000000000000000000..0519816f080b9fe9d71cacd5d757a72ff54e5427 --- /dev/null +++ b/examples/few_shot/pet/prompt/ocnli.json @@ -0,0 +1,12 @@ +{ + "template": [ + {"text": "“{'text': 'sentence1'}”和“{'text': 'sentence2'}”之间的逻辑关系是{'mask'}{'mask'}"}, + {"text": "“{'text': 'sentence1'}”和“{'text': 'sentence2'}”说的是{'mask'}{'mask'}的东西。"}, + {"text": "下边两句话之间有什么逻辑关系?{'mask'}{'mask'}“{'text': 'sentence1'}”{'sep'}“{'text': 'sentence2'}”"} + ], + "verbalizer": [ + {"contradiction": "矛盾", "entailment": "蕴含", "neutral": "中立"}, + {"contradiction": "矛盾", "entailment": "蕴含", "neutral": "中立"}, + {"contradiction": "不同", "entailment": "类似", "neutral": "无关"} + ] +} \ No newline at end of file diff --git a/examples/few_shot/pet/prompt/tnews.json b/examples/few_shot/pet/prompt/tnews.json new file mode 100644 index 0000000000000000000000000000000000000000..5a2c7449b8e132b9a4dde3048217df9951d7fb65 --- /dev/null +++ b/examples/few_shot/pet/prompt/tnews.json @@ -0,0 +1,30 @@ +{ + "template": [ + {"text": "阅读下边一则{'mask'}{'mask'}新闻:{'text':'sentence'}"}, + {"text": "阅读这篇标题为「{'text':'sentence'}」的文章,它讲的是{'mask'}{'mask'}。"}, + {"text": "下边这则新闻属于{'mask'}{'mask'}话题{'text':'sentence'}"}, + {"text": "下边这则新闻属于什么话题呢?{'mask'}{'mask'}{'text':'sentence'}"} + ], + "verbalizer": [ + { + "news_story": "八卦", + "news_entertainment": "明星", + "news_finance": "财经", + "news_sports": "体育", + "news_edu": "校园", + "news_game": "游戏", + "news_culture": "文化", + "news_tech": "科技", + "news_car": "汽车", + "news_travel": "旅行", + "news_world": "国际", + "news_agriculture": "农业", + "news_military": "军事", + "news_house": "房子", + "news_stock": "股票" + }, + {}, + {}, + {} + ] +} diff --git a/examples/few_shot/pet/run_train.py b/examples/few_shot/pet/run_train.py new file mode 100644 index 0000000000000000000000000000000000000000..3bab91cfe712d2456ee390073a56b21e7e8f6d4c --- /dev/null +++ b/examples/few_shot/pet/run_train.py @@ -0,0 +1,170 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time +from dataclasses import dataclass, field +from functools import partial + +import paddle +from data import load_fewclue_dataset +from paddle.metric import Accuracy +from paddle.static import InputSpec +from utils import load_prompt_arguments, save_fewclue_prediction, save_pseudo_data + +from paddlenlp.prompt import ( + ManualTemplate, + MaskedLMVerbalizer, + PromptModelForSequenceClassification, + PromptTrainer, + PromptTuningArguments, +) +from paddlenlp.trainer import PdArgumentParser +from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer +from paddlenlp.utils.log import logger + + +# yapf: disable +@dataclass +class DataArguments: + task_name: str = field(default="eprstmt", metadata={"help": "The task name in FewCLUE."}) + split_id: str = field(default="0", metadata={"help": "The split id of datasets, including 0, 1, 2, 3, 4, few_all."}) + prompt_path: str = field(default="prompt/eprstmt.json", metadata={"help": "Path to the defined prompts."}) + prompt_index: int = field(default=0, metadata={"help": "The index of defined prompt for training."}) + augment_type: str = field(default=None, metadata={"help": "The strategy used for data augmentation, including `swap`, `delete`, `insert`, `subsitute`."}) + num_augment: str = field(default=5, metadata={"help": "Number of augmented data per example, which works when `augment_type` is set."}) + word_augment_percent: str = field(default=0.1, metadata={"help": "Percentage of augmented words in sequences, used for `swap`, `delete`, `insert`, `subsitute`."}) + augment_method: str = field(default="mlm", metadata={"help": "Strategy used for `insert` and `subsitute`."}) + pseudo_data_path: str = field(default=None, metadata={"help": "Path to data with pseudo labels."}) + do_label: bool = field(default=False, metadata={"help": "Whether to label unsupervised data in unlabeled datasets"}) + do_test: bool = field(default=False, metadata={"help": "Whether to evaluate model on public test datasets."}) + + +@dataclass +class ModelArguments: + model_name_or_path: str = field(default="ernie-1.0-large-zh-cw", metadata={"help": "Build-in pretrained model name or the path to local model."}) + export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."}) + dropout: float = field(default=0.1, metadata={"help": "The dropout used for pretrained model."}) +# yapf: enable + + +def main(): + # Parse the arguments. + parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + data_args = load_prompt_arguments(data_args) + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + paddle.set_device(training_args.device) + + # Load the pretrained language model. + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + model = AutoModelForMaskedLM.from_pretrained( + model_args.model_name_or_path, + hidden_dropout_prob=model_args.dropout, + attention_probs_dropout_prob=model_args.dropout, + ) + + # Define template for preprocess and verbalizer for postprocess. + template = ManualTemplate(data_args.prompt, tokenizer, training_args.max_seq_length) + logger.info("Using template: {}".format(template.prompt)) + + verbalizer = MaskedLMVerbalizer(data_args.label_words, tokenizer) + labels_to_ids = verbalizer.labels_to_ids + ids_to_labels = {idx: label for label, idx in labels_to_ids.items()} + logger.info("Using verbalizer: {}".format(data_args.label_words)) + + # Load datasets. + data_ds, label_list = load_fewclue_dataset(data_args, verbalizer=verbalizer, example_keys=template.example_keys) + train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = data_ds + dev_labels, test_labels = label_list + + # Define the criterion. + criterion = paddle.nn.CrossEntropyLoss() + + # Initialize the prompt model with the above variables. + prompt_model = PromptModelForSequenceClassification( + model, template, verbalizer, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout + ) + + # Define the metric function. + def compute_metrics(eval_preds, labels, verbalizer): + metric = Accuracy() + predictions = paddle.to_tensor(eval_preds.predictions) + predictions = verbalizer.aggregate_multiple_mask(predictions) + correct = metric.compute(predictions, paddle.to_tensor(labels)) + metric.update(correct) + acc = metric.accumulate() + return {"accuracy": acc} + + # Initialize the trainer. + dev_compute_metrics = partial(compute_metrics, labels=dev_labels, verbalizer=verbalizer) + trainer = PromptTrainer( + model=prompt_model, + tokenizer=tokenizer, + args=training_args, + criterion=criterion, + train_dataset=train_ds, + eval_dataset=dev_ds, + callbacks=None, + compute_metrics=dev_compute_metrics, + ) + + # Traininig. + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + time_stamp = time.strftime("%m%d-%H-%M-%S", time.localtime()) + + # Test. + if data_args.do_test and public_test_ds is not None: + test_compute_metrics = partial(compute_metrics, labels=test_labels, verbalizer=verbalizer) + trainer.compute_metrics = test_compute_metrics + test_ret = trainer.predict(public_test_ds) + trainer.log_metrics("test", test_ret.metrics) + + # Predict. + if training_args.do_predict and test_ds is not None: + pred_ret = trainer.predict(test_ds) + logger.info("Prediction done.") + predict_path = os.path.join(training_args.output_dir, "fewclue_submit_examples_" + time_stamp) + save_fewclue_prediction(predict_path, data_args.task_name, pred_ret, verbalizer, ids_to_labels) + + # Label unsupervised data. + if data_args.do_label and unlabeled_ds is not None: + label_ret = trainer.predict(unlabeled_ds) + logger.info("Labeling done.") + pseudo_path = os.path.join(training_args.output_dir, "pseudo_data_" + time_stamp + ".txt") + save_pseudo_data(pseudo_path, data_args.task_name, label_ret, verbalizer, ids_to_labels) + + # Export static model. + if training_args.do_export: + input_spec = [ + InputSpec(shape=[None, None], dtype="int64"), # input_ids, + InputSpec(shape=[None, None], dtype="int64"), # token_type_ids + InputSpec(shape=[None, None], dtype="int64"), # position_ids + InputSpec(shape=[None, None, None, None], dtype="float32"), # attention_mask + InputSpec(shape=[None], dtype="int64"), # masked_positions + ] + export_path = os.path.join(training_args.output_dir, "export") + trainer.export_model(export_path, input_spec=input_spec, export_type=model_args.export_type) + + +if __name__ == "__main__": + main() diff --git a/examples/few_shot/pet/utils.py b/examples/few_shot/pet/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..989b4e6b81a8d156f93bf256e79ba5a6ed201197 --- /dev/null +++ b/examples/few_shot/pet/utils.py @@ -0,0 +1,249 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +import pathlib + +import numpy as np +import paddle + +from paddlenlp.datasets import load_dataset + +LABEL_TO_STANDARD = { + "tnews": { + "news_story": "100", + "news_culture": "101", + "news_entertainment": "102", + "news_sports": "103", + "news_finance": "104", + "news_house": "106", + "news_car": "107", + "news_edu": "108", + "news_tech": "109", + "news_military": "110", + "news_travel": "112", + "news_world": "113", + "news_stock": "114", + "news_agriculture": "115", + "news_game": "116", + }, + "iflytek": { + "打车": 0, + "美颜": 100, + "影像剪辑": 101, + "摄影修图": 102, + "相机": 103, + "绘画": 104, + "二手": 105, + "电商": 106, + "团购": 107, + "外卖": 108, + "电影票务": 109, + "社区服务": 10, + "社区超市": 110, + "购物咨询": 111, + "笔记": 112, + "办公": 113, + "日程管理": 114, + "女性": 115, + "经营": 116, + "收款": 117, + "其他": 118, + "薅羊毛": 11, + "魔幻": 12, + "仙侠": 13, + "卡牌": 14, + "飞行空战": 15, + "射击游戏": 16, + "休闲益智": 17, + "动作类": 18, + "体育竞技": 19, + "地图导航": 1, + "棋牌中心": 20, + "经营养成": 21, + "策略": 22, + "MOBA": 23, + "辅助工具": 24, + "约会社交": 25, + "即时通讯": 26, + "工作社交": 27, + "论坛圈子": 28, + "婚恋社交": 29, + "免费WIFI": 2, + "情侣社交": 30, + "社交工具": 31, + "生活社交": 32, + "微博博客": 33, + "新闻": 34, + "漫画": 35, + "小说": 36, + "技术": 37, + "教辅": 38, + "问答交流": 39, + "租车": 3, + "搞笑": 40, + "杂志": 41, + "百科": 42, + "影视娱乐": 43, + "求职": 44, + "兼职": 45, + "视频": 46, + "短视频": 47, + "音乐": 48, + "直播": 49, + "同城服务": 4, + "电台": 50, + "K歌": 51, + "成人": 52, + "中小学": 53, + "职考": 54, + "公务员": 55, + "英语": 56, + "视频教育": 57, + "高等教育": 58, + "成人教育": 59, + "快递物流": 5, + "艺术": 60, + "语言(非英语)": 61, + "旅游资讯": 62, + "综合预定": 63, + "民航": 64, + "铁路": 65, + "酒店": 66, + "行程管理": 67, + "民宿短租": 68, + "出国": 69, + "婚庆": 6, + "工具": 70, + "亲子儿童": 71, + "母婴": 72, + "驾校": 73, + "违章": 74, + "汽车咨询": 75, + "汽车交易": 76, + "日常养车": 77, + "行车辅助": 78, + "租房": 79, + "家政": 7, + "买房": 80, + "装修家居": 81, + "电子产品": 82, + "问诊挂号": 83, + "养生保健": 84, + "医疗服务": 85, + "减肥瘦身": 86, + "美妆美业": 87, + "菜谱": 88, + "餐饮店": 89, + "公共交通": 8, + "体育咨讯": 90, + "运动健身": 91, + "支付": 92, + "保险": 93, + "股票": 94, + "借贷": 95, + "理财": 96, + "彩票": 97, + "记账": 98, + "银行": 99, + "政务": 9, + }, +} + + +def load_prompt_arguments(args): + """ + Load prompt and label words according to prompt index. + """ + with open(args.prompt_path, "r", encoding="utf-8") as fp: + configs = json.load(fp) + assert len(configs["verbalizer"]) == len(configs["template"]) + assert configs["verbalizer"][0] is not None + verbalizer = [configs["verbalizer"][0]] + last_verb_index = 0 + for index, verb in enumerate(configs["verbalizer"][1:]): + if verb is None or len(verb) == 0: + verbalizer.append(configs["verbalizer"][last_verb_index]) + else: + verbalizer.append(verb) + last_verb_index = index + 1 + configs["verbalizer"] = verbalizer + args.prompt = configs["template"][args.prompt_index]["text"] + label_words = configs["verbalizer"][args.prompt_index] + if isinstance(label_words, list): + label_words = {k: k for k in label_words} + args.label_words = label_words + return args + + +def save_pseudo_data(save_path, task_name, label_preds, verbalizer, labels): + """ + Combine unsupervised data and corresponding predicted labels and + save one example per line. + """ + if task_name == "cluewsc": + return None + + data_ds = load_dataset("fewclue", name=task_name, splits="unlabeled") + preds = paddle.to_tensor(label_preds.predictions) + preds = verbalizer.aggregate_multiple_mask(preds) + preds = paddle.nn.functional.softmax(preds, axis=1).numpy() + label_preds = np.argmax(preds, axis=1) + label_probs = np.max(preds, axis=1) + pseudo_data = [] + for index, example in enumerate(data_ds): + example["labels"] = labels[label_preds[index]] + example["prob"] = str(label_probs[index]) + pseudo_data.append(example) + save_data(pseudo_data, save_path) + + +def save_fewclue_prediction(save_path, task_name, label_preds, verbalizer, labels): + """ + Extract predicted labels and save as the format required by FewCLUE. + """ + preds = paddle.to_tensor(label_preds.predictions) + preds = verbalizer.aggregate_multiple_mask(preds) + if task_name == "chid": + batch_size = preds.shape[0] + preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1] + preds = preds.reshape([batch_size // 7, 7]) + preds = paddle.nn.functional.softmax(preds, axis=1).numpy() + preds = np.argmax(preds, axis=1) + test_ds = load_dataset("fewclue", name=task_name, splits="test") + + ret_list = [] + maps = LABEL_TO_STANDARD.get(task_name, None) + for idx, example in enumerate(test_ds): + uid = example.get("id", idx) + if task_name in ["bustm", "csl"]: + ret_list.append({"id": uid, "label": str(preds[idx])}) + elif task_name == "chid": + ret_list.append({"id": uid, "answer": preds[idx]}) + elif task_name in ["cluewsc", "eprstmt", "ocnli", "csldcp"]: + ret_list.append({"id": uid, "label": labels[preds[idx]]}) + elif task_name in ["iflytek", "tnews"]: + ret_list.append({"id": uid, "label": str(maps[labels[preds[idx]]])}) + save_file = task_name if task_name in ["bustm", "csldcp", "eprstmt"] else task_name + "f" + save_data(ret_list, save_path, save_file + "_predict.json") + + +def save_data(data, save_path, save_file=None): + if save_file is not None: + pathlib.Path(save_path).mkdir(parents=True, exist_ok=True) + save_path = os.path.join(save_path, save_file) + with open(save_path, "w") as fp: + for example in data: + fp.write(json.dumps(example, ensure_ascii=False) + "\n") diff --git a/examples/information_extraction/DuEE/README.md b/examples/information_extraction/DuEE/README.md new file mode 100644 index 0000000000000000000000000000000000000000..1bdf5fab793a5a85882ac22126afd1b04c70e637 --- /dev/null +++ b/examples/information_extraction/DuEE/README.md @@ -0,0 +1,231 @@ +# LIC2021 DuEE 事件抽取基线 + + +信息抽取旨在从非结构化自然语言文本中提取结构化知识,如实体、关系、事件等。事件抽取的目标是对于给定的自然语言句子,根据预先指定的事件类型和论元角色,识别句子中所有目标事件类型的事件,并根据相应的论元角色集合抽取事件所对应的论元。其中目标事件类型 (event_type) 和论元角色 (role) 限定了抽取的范围,例如 (event_type:胜负,role:时间,胜者,败者,赛事名称)、(event_type:夺冠,role:夺冠事件,夺冠赛事,冠军)。 + +
+事件抽取 +
+ +该示例展示了如何使用PaddleNLP快速复现[LIC2021事件抽取比赛](http://lic2021.ccf.org.cn/)基线并进阶优化基线。 + +同时,我们提供了该示例在线运行展示教程: +[PaddleNLP实战——LIC2021事件抽取任务基线](https://aistudio.baidu.com/aistudio/projectdetail/1639964) + +## 目录结构 + +以下是本项目主要目录结构及说明: + +```text +DuEE/ +├── classifier.py # 文本分类训练脚本 +├── duee_1_data_prepare.py # 句子级事件抽取数据预处理 +├── duee_1_postprocess.py # 句子级事件抽取数据后处理 +├── duee_fin_data_prepare.py # 篇章级事件抽取数据预处理 +├── duee_fin_postprocess.py # 篇章级事件抽取数据后处理 +├── README.md # 文档说明 +├── run_classifier.sh # 文本分类训练启动脚本 +├── run_duee_1.sh # 句子级事件抽取启动脚本 +├── run_duee_fin.sh # 篇章级事件抽取启动脚本 +├── run_sequence_labeling.sh # 序列标注训练启动脚本 +├── sequence_labeling.py # 序列标注训练脚本 +└── utils.py # 效能函数 +``` + + +## 篇章级事件抽取基线 + +篇章级事件抽取数据集(DuEE-Fin)是金融领域篇章级别事件抽取数据集, +共包含13个已定义好的事件类型约束和1.15万中文篇章(存在部分非目标篇章作为负样例),其中6900训练集,1150验证集和3450测试集,数据集下载[地址](https://aistudio.baidu.com/aistudio/competition/detail/65) 。 +在该数据集上基线采用基于[ERNIE](https://github.com/PaddlePaddle/ERNIE)的序列标注(sequence labeling)方案,分为基于序列标注的触发词抽取模型、基于序列标注的论元抽取模型和枚举属性分类模型,属于PipeLine模型;基于序列标注的触发词抽取模型采用BIO方式,识别触发词的位置以及对应的事件类型,基于序列标注的论元抽取模型采用BIO方式识别出事件中的论元以及对应的论元角色;枚举属性分类模型采用ernie进行分类。 + +### 评测方法 + +本任务采用预测论元F1值作为评价指标,对于每个篇章,采用不放回的方式给每个目标事件寻找最相似的预测事件(事件级别匹配),搜寻方式是优先寻找与目标事件的事件类型相同且角色和论元正确数量最多的预测事件 + +f1_score = (2 * P * R) / (P + R),其中 + +- 预测论元正确=事件类型和角色相同且论元正确 +- P=预测论元正确数量 / 所有预测论元的数量 +- R=预测论元正确数量 / 所有人工标注论元的数量 + +### 快速复现基线Step1:数据预处理并加载 + +从比赛官网下载数据集,逐层解压存放于data/DuEE-fin目录下,运行以下脚本将原始数据预处理成序列标注格式数据。 +处理之后的数据放在data/DuEE-Fin下,触发词识别数据文件存放在data/DuEE-Fin/role下,论元角色识别数据文件存放在data/DuEE-Fin/trigger下。 +枚举分类数据存放在data/DuEE-Fin/enum下。 + +``` +sh run_duee_fin.sh data_prepare +``` + +我们可以加载自定义数据集。通过继承[`paddle.io.Dataset`](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/Dataset_cn.html#dataset),自定义实现`__getitem__` 和 `__len__`两个方法。 + +如下代码已完成加载数据集操作: +``` +train_ds = DuEventExtraction(args.train_data, args.vocab_path, args.tag_path) +dev_ds = DuEventExtraction(args.dev_data, args.vocab_path, args.tag_path) +test_ds = DuEventExtraction(args.test_data, args.vocab_path, args.tag_path) +``` + +### 快速复现基线Step2:构建模型 + + +基于序列标注的触发词抽取模型是整体模型的一部分,该部分主要是给定事件类型,识别句子中出现的事件触发词对应的位置以及对应的事件类别,该模型是基于ERNIE开发序列标注模型,模型原理图如下: + +
+基于序列标注的触发词抽取模型 +
+ + +同样地,基于序列标注的论元抽取模型也是基于ERNIE开发序列标注模型,该部分主要是识别出事件中的论元以及对应论元角色,模型原理图如下: + +
+基于序列标注的论元抽取模型 +
+ +上述样例中通过模型识别出:1)论元"新东方",并分配标签"B-收购方"、"I-收购方"、"I-收购方";2)论元"东方优播", 并分配标签"B-被收购方"、"I-被收购方"、"I-被收购方"、"I-被收购方"。最终识别出文本中包含的论元角色和论元对是<收购方,新东方>、<被收购方,东方优播> + +**PaddleNLP提供了ERNIE预训练模型常用序列标注模型,可以通过指定模型名字完成一键加载**: + +```python +from paddlenlp.transformers import AutoModelForTokenClassification + +model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=len(label_map)) +``` + +同时,对于枚举分类数据采用的是基于ERNIE的文本分类模型,枚举角色类型为环节。模型原理图如下: + +
+枚举属性分类模型 +
+ +> 给定文本,对文本进行分类,得到不同类别上的概率 筹备上市(0.8)、暂停上市(0.02)、正式上市(0.15)、终止上市(0.03) + + +**同样地,PaddleNLP提供了ERNIE预训练模型常用文本分类模型,可以通过指定模型名字完成一键加载**: + +```python +from paddlenlp.transformers import AutoModelForSequenceClassification + +model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=len(label_map)) +``` + +### 快速复现基线Step3:数据处理 + +我们需要将原始数据处理成模型可读入的数据。PaddleNLP为了方便用户处理数据,内置了对于各个预训练模型对应的Tokenizer,可以完成 +文本token化,转token ID,文本长度截断等操作。与加载模型类似地,也可以一键加载。 + +```python +from paddlenlp.transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") +``` + +文本数据处理直接调用tokenizer即可输出模型所需输入数据。 + +```python +inputs = tokenizer(text="请输入测试样例", max_seq_len=20) +# {'input_ids': [1, 647, 789, 109, 558, 525, 314, 656, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], +# 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], +# 'seq_len': 9} +``` + + +### 快速复现基线Step4:定义损失函数和优化器,开始训练 + +在该基线上,我们选择交叉墒作为损失函数,使用[`paddle.optimizer.AdamW`](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/optimizer/adamw/AdamW_cn.html#adamw)作为优化器。 + +启动训练: +```shell +# 触发词识别模型训练 +sh run_duee_fin.sh trigger_train + +# 触发词识别预测 +sh run_duee_fin.sh trigger_predict + +# 论元识别模型训练 +sh run_duee_fin.sh role_train + +# 论元识别预测 +sh run_duee_fin.sh role_predict + +# 枚举分类模型训练 +sh run_duee_fin.sh enum_train + +# 枚举分类预测 +sh run_duee_fin.sh enum_predict +``` + + +### 快速复现基线Step5:数据后处理,提交结果 + +按照比赛预测指定格式提交结果至[评测网站](http://aistudio-bce.bcc-bdbl.baidu.com/aistudio/competition/detail/141)。 + + +``` shell +sh run_duee_fin.sh pred_2_submit +``` + +结果存放于`submit/test_duee_fin.json` + + +#### to-do 增加基线效果 + + +## 句子级事件抽取基线 + + +句子级别通用领域的事件抽取数据集([DuEE 1.0](https://aistudio.baidu.com/aistudio/competition/detail/32?isFromCcf=true))上进行事件抽取的基线模型,该模型采用基于[ERNIE](https://github.com/PaddlePaddle/ERNIE)的序列标注(sequence labeling)方案,分为基于序列标注的触发词抽取模型和基于序列标注的论元抽取模型,属于PipeLine模型;基于序列标注的触发词抽取模型采用BIO方式,识别触发词的位置以及对应的事件类型,基于序列标注的论元抽取模型采用BIO方式识别出事件中的论元以及对应的论元角色。模型和数据处理方式与篇章级事件抽取相同,此处不再赘述。句子级别通用领域的事件抽取无枚举角色分类。 + +启动训练: +```shell +# 训练触发词识别模型 +sh run_duee_1.sh trigger_train + +# 触发词识别预测 +sh run_duee_1.sh trigger_predict + +# 论元识别模型训练 +sh run_duee_1.sh role_train + +# 论元识别预测 +sh run_duee_1.sh role_predict + +# 数据后处理,提交预测结果 +# 结果存放于submit/test_duee_1.json` +sh run_duee_1.sh pred_2_submit +``` + +### 评测方法 + +事件论元结果与人工标注的事件论元结果进行匹配,并按字级别匹配F1进行打分,不区分大小写,如论元有多个表述,则取多个匹配F1中的最高值 + +f1_score = (2 * P * R) / (P + R),其中 + +- P=预测论元得分总和 / 所有预测论元的数量 +- R=预测论元得分总和 / 所有人工标注论元的数量 +- 预测论元得分=事件类型是否准确 * 论元角色是否准确 * 字级别匹配F1值 (*是相乘) +- 字级别匹配F1值 = 2 * 字级别匹配P值 * 字级别匹配R值 / (字级别匹配P值 + 字级别匹配R值) +- 字级别匹配P值 = 预测论元和人工标注论元共有字的数量/ 预测论元字数 +- 字级别匹配R值 = 预测论元和人工标注论元共有字的数量/ 人工标注论元字数 + +#### to-do 增加基线效果 + +## 进阶优化基线效果 + +基线采用的预训练模型为ERNIE,PaddleNLP提供了丰富的预训练模型,如BERT,RoBERTa,Electra,XLNet等 +参考[PaddleNLP预训练模型介绍](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) + +如可以选择RoBERTa large中文模型优化模型效果,只需更换模型和tokenizer即可无缝衔接。 + +```python +from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer + +model = RobertaForTokenClassification.from_pretrained("roberta-wwm-ext-large", num_classes=len(label_map)) +tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext-large") +``` + +## Reference + +- [DuEE: A Large-Scale Dataset for Chinese Event Extraction in Real-World Scenarios](https://link.springer.com/chapter/10.1007/978-3-030-60457-8_44) diff --git a/examples/information_extraction/DuEE/classifier.py b/examples/information_extraction/DuEE/classifier.py new file mode 100644 index 0000000000000000000000000000000000000000..9732593639a78eac11b6fbde12438939e288112a --- /dev/null +++ b/examples/information_extraction/DuEE/classifier.py @@ -0,0 +1,323 @@ +# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +classification +""" +import argparse +import ast +import csv +import json +import os +import random +import traceback +from collections import namedtuple +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +from utils import load_dict, read_by_lines, write_by_lines + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer + +""" +For All pre-trained model(English and Chinese), +Please refer to https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer +""" + +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches for fine-tuning.") +parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.") +parser.add_argument("--tag_path", type=str, default=None, help="tag set path") +parser.add_argument("--train_data", type=str, default=None, help="train data") +parser.add_argument("--dev_data", type=str, default=None, help="dev data") +parser.add_argument("--test_data", type=str, default=None, help="test data") +parser.add_argument("--predict_data", type=str, default=None, help="predict data") +parser.add_argument("--do_train", type=ast.literal_eval, default=True, help="do train") +parser.add_argument("--do_predict", type=ast.literal_eval, default=True, help="do predict") +parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.") +parser.add_argument( + "--warmup_proportion", type=float, default=0.1, help="Warmup proportion params for warmup strategy" +) +parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.") +parser.add_argument("--valid_step", type=int, default=100, help="validation step") +parser.add_argument("--skip_step", type=int, default=20, help="skip step") +parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.") +parser.add_argument("--checkpoints", type=str, default=None, help="Directory to model checkpoint") +parser.add_argument("--init_ckpt", type=str, default=None, help="already pretraining model checkpoint") +parser.add_argument("--predict_save_path", type=str, default=None, help="predict data save path") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." +) +args = parser.parse_args() + + +def set_seed(random_seed): + """sets random seed""" + random.seed(random_seed) + np.random.seed(random_seed) + paddle.seed(random_seed) + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader): + """ + Given a dataset, it evals model and computes the metric. + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + criterion(obj:`paddle.nn.Layer`): It can compute the loss. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + losses = [] + for batch in data_loader: + input_ids, token_type_ids, labels = batch + logits = model(input_ids, token_type_ids) + loss = criterion(logits, labels) + losses.append(loss.numpy()) + correct = metric.compute(logits, labels) + metric.update(correct) + accuracy = metric.accumulate() + metric.reset() + model.train() + return float(np.mean(losses)), accuracy + + +def convert_example(example, tokenizer, label_map=None, max_seq_len=512, is_test=False): + """convert_example""" + has_text_b = False + if isinstance(example, dict): + has_text_b = "text_b" in example.keys() + else: + has_text_b = "text_b" in example._fields + + text_b = None + if has_text_b: + text_b = example.text_b + + tokenized_input = tokenizer(text=example.text_a, text_pair=text_b, max_seq_len=max_seq_len) + input_ids = tokenized_input["input_ids"] + token_type_ids = tokenized_input["token_type_ids"] + + if is_test: + return input_ids, token_type_ids + else: + label = np.array([label_map[example.label]], dtype="int64") + return input_ids, token_type_ids, label + + +class DuEventExtraction(paddle.io.Dataset): + """Du""" + + def __init__(self, data_path, tag_path): + self.label_vocab = load_dict(tag_path) + self.examples = self._read_tsv(data_path) + + def _read_tsv(self, input_file, quotechar=None): + """Reads a tab separated value file.""" + with open(input_file, "r", encoding="UTF-8") as f: + reader = csv.reader(f, delimiter="\t", quotechar=quotechar) + headers = next(reader) + text_indices = [index for index, h in enumerate(headers) if h != "label"] + Example = namedtuple("Example", headers) + examples = [] + for line in reader: + for index, text in enumerate(line): + if index in text_indices: + line[index] = text + try: + example = Example(*line) + except Exception as e: + traceback.print_exc() + raise Exception(e) + examples.append(example) + return examples + + def __len__(self): + return len(self.examples) + + def __getitem__(self, index): + return self.examples[index] + + +def data_2_examples(datas): + """data_2_examples""" + has_text_b, examples = False, [] + if isinstance(datas[0], list): + Example = namedtuple("Example", ["text_a", "text_b"]) + has_text_b = True + else: + Example = namedtuple("Example", ["text_a"]) + for item in datas: + if has_text_b: + example = Example(text_a=item[0], text_b=item[1]) + else: + example = Example(text_a=item) + examples.append(example) + return examples + + +def do_train(): + paddle.set_device(args.device) + world_size = paddle.distributed.get_world_size() + rank = paddle.distributed.get_rank() + if world_size > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + label_map = load_dict(args.tag_path) + model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=len(label_map)) + model = paddle.DataParallel(model) + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + print("============start train==========") + train_ds = DuEventExtraction(args.train_data, args.tag_path) + dev_ds = DuEventExtraction(args.dev_data, args.tag_path) + + trans_func = partial(convert_example, tokenizer=tokenizer, label_map=label_map, max_seq_len=args.max_seq_len) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token], dtype="int32"), + Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token], dtype="int32"), + Stack(dtype="int64"), # label + ): fn(list(map(trans_func, samples))) + + batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + train_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=batch_sampler, collate_fn=batchify_fn) + dev_loader = paddle.io.DataLoader(dataset=dev_ds, batch_size=args.batch_size, collate_fn=batchify_fn) + + num_training_steps = len(train_loader) * args.num_epoch + metric = paddle.metric.Accuracy() + criterion = paddle.nn.loss.CrossEntropyLoss() + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=args.learning_rate, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + step, best_performerence = 0, 0.0 + model.train() + for epoch in range(args.num_epoch): + for idx, (input_ids, token_type_ids, labels) in enumerate(train_loader): + logits = model(input_ids, token_type_ids) + loss = criterion(logits, labels) + probs = F.softmax(logits, axis=1) + correct = metric.compute(probs, labels) + metric.update(correct) + acc = metric.accumulate() + loss.backward() + optimizer.step() + optimizer.clear_grad() + loss_item = loss.item() + if step > 0 and step % args.skip_step == 0 and rank == 0: + print( + f"train epoch: {epoch} - step: {step} (total: {num_training_steps}) - loss: {loss_item:.6f} acc {acc:.5f}" + ) + if step > 0 and step % args.valid_step == 0 and rank == 0: + loss_dev, acc_dev = evaluate(model, criterion, metric, dev_loader) + print( + f"dev step: {step} - loss: {loss_dev:.6f} accuracy: {acc_dev:.5f}, current best {best_performerence:.5f}" + ) + if acc_dev > best_performerence: + best_performerence = acc_dev + print( + f"==============================================save best model best performerence {best_performerence:5f}" + ) + paddle.save(model.state_dict(), f"{args.checkpoints}/best.pdparams") + step += 1 + + # save the final model + if rank == 0: + paddle.save(model.state_dict(), "{}/final.pdparams".format(args.checkpoints)) + + +def do_predict(): + set_seed(args.seed) + paddle.set_device(args.device) + + label_map = load_dict(args.tag_path) + id2label = {val: key for key, val in label_map.items()} + + model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=len(label_map)) + model = paddle.DataParallel(model) + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + print("============start predict==========") + if not args.init_ckpt or not os.path.isfile(args.init_ckpt): + raise Exception("init checkpoints {} not exist".format(args.init_ckpt)) + else: + state_dict = paddle.load(args.init_ckpt) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.init_ckpt) + + # load data from predict file + sentences = read_by_lines(args.predict_data) # origin data format + sentences = [json.loads(sent) for sent in sentences] + + encoded_inputs_list = [] + for sent in sentences: + sent = sent["text"] + input_sent = [sent] # only text_a + if "text_b" in sent: + input_sent = [[sent, sent["text_b"]]] # add text_b + example = data_2_examples(input_sent)[0] + input_ids, token_type_ids = convert_example(example, tokenizer, max_seq_len=args.max_seq_len, is_test=True) + encoded_inputs_list.append((input_ids, token_type_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), + Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), + ): fn(samples) + # Separates data into some batches. + batch_encoded_inputs = [ + encoded_inputs_list[i : i + args.batch_size] for i in range(0, len(encoded_inputs_list), args.batch_size) + ] + results = [] + model.eval() + for batch in batch_encoded_inputs: + input_ids, token_type_ids = batchify_fn(batch) + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + logits = model(input_ids, token_type_ids) + probs = F.softmax(logits, axis=1) + probs_ids = paddle.argmax(probs, -1).numpy() + probs = probs.numpy() + for prob_one, p_id in zip(probs.tolist(), probs_ids.tolist()): + label_probs = {} + for idx, p in enumerate(prob_one): + label_probs[id2label[idx]] = p + results.append({"probs": label_probs, "label": id2label[p_id]}) + + assert len(results) == len(sentences) + for sent, ret in zip(sentences, results): + sent["pred"] = ret + sentences = [json.dumps(sent, ensure_ascii=False) for sent in sentences] + write_by_lines(args.predict_save_path, sentences) + print("save data {} to {}".format(len(sentences), args.predict_save_path)) + + +if __name__ == "__main__": + + if args.do_train: + do_train() + elif args.do_predict: + do_predict() diff --git a/examples/information_extraction/DuEE/duee_1_data_prepare.py b/examples/information_extraction/DuEE/duee_1_data_prepare.py new file mode 100644 index 0000000000000000000000000000000000000000..dee77a7899313b9fd35808e9f27805cb476ef717 --- /dev/null +++ b/examples/information_extraction/DuEE/duee_1_data_prepare.py @@ -0,0 +1,127 @@ +# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""duee 1.0 dataset process""" +import json +import os + +from utils import read_by_lines, write_by_lines + + +def data_process(path, model="trigger", is_predict=False): + """data_process""" + + def label_data(data, start, l, _type): + """label_data""" + for i in range(start, start + l): + suffix = "B-" if i == start else "I-" + data[i] = "{}{}".format(suffix, _type) + return data + + sentences = [] + output = ["text_a"] if is_predict else ["text_a\tlabel"] + with open(path) as f: + for line in f: + d_json = json.loads(line.strip()) + _id = d_json["id"] + text_a = ["," if t == " " or t == "\n" or t == "\t" else t for t in list(d_json["text"].lower())] + if is_predict: + sentences.append({"text": d_json["text"], "id": _id}) + output.append("\002".join(text_a)) + else: + if model == "trigger": + labels = ["O"] * len(text_a) + for event in d_json.get("event_list", []): + event_type = event["event_type"] + start = event["trigger_start_index"] + trigger = event["trigger"] + labels = label_data(labels, start, len(trigger), event_type) + output.append("{}\t{}".format("\002".join(text_a), "\002".join(labels))) + elif model == "role": + for event in d_json.get("event_list", []): + labels = ["O"] * len(text_a) + for arg in event["arguments"]: + role_type = arg["role"] + argument = arg["argument"] + start = arg["argument_start_index"] + labels = label_data(labels, start, len(argument), role_type) + output.append("{}\t{}".format("\002".join(text_a), "\002".join(labels))) + return output + + +def schema_process(path, model="trigger"): + """schema_process""" + + def label_add(labels, _type): + """label_add""" + if "B-{}".format(_type) not in labels: + labels.extend(["B-{}".format(_type), "I-{}".format(_type)]) + return labels + + labels = [] + for line in read_by_lines(path): + d_json = json.loads(line.strip()) + if model == "trigger": + labels = label_add(labels, d_json["event_type"]) + elif model == "role": + for role in d_json["role_list"]: + labels = label_add(labels, role["role"]) + labels.append("O") + tags = [] + for index, label in enumerate(labels): + tags.append("{}\t{}".format(index, label)) + return tags + + +if __name__ == "__main__": + print("\n=================DUEE 1.0 DATASET==============") + conf_dir = "./conf/DuEE1.0" + schema_path = "{}/event_schema.json".format(conf_dir) + tags_trigger_path = "{}/trigger_tag.dict".format(conf_dir) + tags_role_path = "{}/role_tag.dict".format(conf_dir) + print("\n=================start schema process==============") + print("input path {}".format(schema_path)) + tags_trigger = schema_process(schema_path, "trigger") + write_by_lines(tags_trigger_path, tags_trigger) + print("save trigger tag {} at {}".format(len(tags_trigger), tags_trigger_path)) + tags_role = schema_process(schema_path, "role") + write_by_lines(tags_role_path, tags_role) + print("save trigger tag {} at {}".format(len(tags_role), tags_role_path)) + print("=================end schema process===============") + + # data process + data_dir = "./data/DuEE1.0" + trigger_save_dir = "{}/trigger".format(data_dir) + role_save_dir = "{}/role".format(data_dir) + print("\n=================start schema process==============") + if not os.path.exists(trigger_save_dir): + os.makedirs(trigger_save_dir) + if not os.path.exists(role_save_dir): + os.makedirs(role_save_dir) + print("\n----trigger------for dir {} to {}".format(data_dir, trigger_save_dir)) + train_tri = data_process("{}/duee_train.json".format(data_dir), "trigger") + write_by_lines("{}/train.tsv".format(trigger_save_dir), train_tri) + dev_tri = data_process("{}/duee_dev.json".format(data_dir), "trigger") + write_by_lines("{}/dev.tsv".format(trigger_save_dir), dev_tri) + test_tri = data_process("{}/duee_test1.json".format(data_dir), "trigger") + write_by_lines("{}/test.tsv".format(trigger_save_dir), test_tri) + print("train {} dev {} test {}".format(len(train_tri), len(dev_tri), len(test_tri))) + print("\n----role------for dir {} to {}".format(data_dir, role_save_dir)) + train_role = data_process("{}/duee_train.json".format(data_dir), "role") + write_by_lines("{}/train.tsv".format(role_save_dir), train_role) + dev_role = data_process("{}/duee_dev.json".format(data_dir), "role") + write_by_lines("{}/dev.tsv".format(role_save_dir), dev_role) + test_role = data_process("{}/duee_test1.json".format(data_dir), "role") + write_by_lines("{}/test.tsv".format(role_save_dir), test_role) + print("train {} dev {} test {}".format(len(train_role), len(dev_role), len(test_role))) + print("=================end schema process==============") diff --git a/examples/information_extraction/DuEE/duee_1_postprocess.py b/examples/information_extraction/DuEE/duee_1_postprocess.py new file mode 100644 index 0000000000000000000000000000000000000000..6a9efbc030a8fcd1c109e038a57664bfc20a17f6 --- /dev/null +++ b/examples/information_extraction/DuEE/duee_1_postprocess.py @@ -0,0 +1,80 @@ +# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""duee 1.0 data predict post-process""" + +import argparse +import json + +from utils import extract_result, read_by_lines, write_by_lines + + +def predict_data_process(trigger_file, role_file, schema_file, save_path): + """predict_data_process""" + pred_ret = [] + trigger_data = read_by_lines(trigger_file) + role_data = read_by_lines(role_file) + schema_data = read_by_lines(schema_file) + print("trigger predict {} load from {}".format(len(trigger_data), trigger_file)) + print("role predict {} load from {}".format(len(role_data), role_file)) + print("schema {} load from {}".format(len(schema_data), schema_file)) + + schema = {} + for s in schema_data: + d_json = json.loads(s) + schema[d_json["event_type"]] = [r["role"] for r in d_json["role_list"]] + + # process the role data + sent_role_mapping = {} + for d in role_data: + d_json = json.loads(d) + r_ret = extract_result(d_json["text"], d_json["pred"]["labels"]) + role_ret = {} + for r in r_ret: + role_type = r["type"] + if role_type not in role_ret: + role_ret[role_type] = [] + role_ret[role_type].append("".join(r["text"])) + sent_role_mapping[d_json["id"]] = role_ret + + for d in trigger_data: + d_json = json.loads(d) + t_ret = extract_result(d_json["text"], d_json["pred"]["labels"]) + pred_event_types = list(set([t["type"] for t in t_ret])) + event_list = [] + for event_type in pred_event_types: + role_list = schema[event_type] + arguments = [] + for role_type, ags in sent_role_mapping[d_json["id"]].items(): + if role_type not in role_list: + continue + for arg in ags: + if len(arg) == 1: + continue + arguments.append({"role": role_type, "argument": arg}) + event = {"event_type": event_type, "arguments": arguments} + event_list.append(event) + pred_ret.append({"id": d_json["id"], "text": d_json["text"], "event_list": event_list}) + pred_ret = [json.dumps(r, ensure_ascii=False) for r in pred_ret] + print("submit data {} save to {}".format(len(pred_ret), save_path)) + write_by_lines(save_path, pred_ret) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Official evaluation script for DuEE version 1.0") + parser.add_argument("--trigger_file", help="trigger model predict data path", required=True) + parser.add_argument("--role_file", help="role model predict data path", required=True) + parser.add_argument("--schema_file", help="schema file path", required=True) + parser.add_argument("--save_path", help="save file path", required=True) + args = parser.parse_args() + predict_data_process(args.trigger_file, args.role_file, args.schema_file, args.save_path) diff --git a/examples/information_extraction/DuEE/duee_fin_data_prepare.py b/examples/information_extraction/DuEE/duee_fin_data_prepare.py new file mode 100644 index 0000000000000000000000000000000000000000..49b101007703bf9cb84d2f720bcccfd8c75b8f10 --- /dev/null +++ b/examples/information_extraction/DuEE/duee_fin_data_prepare.py @@ -0,0 +1,276 @@ +# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""duee finance dataset proces""" +import json +import os + +from utils import cal_md5, read_by_lines, text_to_sents, write_by_lines + +enum_role = "环节" + + +def data_process(path, model="trigger", is_predict=False): + """data_process""" + + def label_data(data, start, l, _type): + """label_data""" + for i in range(start, start + l): + suffix = "B-" if i == start else "I-" + data[i] = "{}{}".format(suffix, _type) + return data + + sentences = [] + output = ["text_a"] if is_predict else ["text_a\tlabel"] + + for line in read_by_lines(path): + d_json = json.loads(line) + _id = d_json["id"] + text_a = ["," if t == " " or t == "\n" or t == "\t" else t for t in list(d_json["text"].lower())] + if is_predict: + sentences.append({"text": d_json["text"], "id": _id}) + output.append("\002".join(text_a)) + else: + if model == "trigger": + labels = ["O"] * len(text_a) + if len(d_json.get("event_list", [])) == 0: + continue + for event in d_json.get("event_list", []): + event_type = event["event_type"] + start = event["trigger_start_index"] + trigger = event["trigger"] + labels = label_data(labels, start, len(trigger), event_type) + output.append("{}\t{}".format("\002".join(text_a), "\002".join(labels))) + elif model == "role": + for event in d_json.get("event_list", []): + labels = ["O"] * len(text_a) + for arg in event["arguments"]: + role_type = arg["role"] + if role_type == enum_role: + continue + argument = arg["argument"] + start = arg["argument_start_index"] + labels = label_data(labels, start, len(argument), role_type) + output.append("{}\t{}".format("\002".join(text_a), "\002".join(labels))) + return output + + +def enum_data_process(path, is_predict=False): + """enum_data_process""" + output = ["text_a"] if is_predict else ["label\ttext_a"] + for line in read_by_lines(path): + d_json = json.loads(line) + text = d_json["text"].lower().replace("\t", " ") + if is_predict: + output.append(text) + continue + if len(d_json.get("event_list", [])) == 0: + continue + label = None + for event in d_json["event_list"]: + if event["event_type"] != "公司上市": + continue + for argument in event["arguments"]: + role_type = argument["role"] + if role_type == enum_role: + label = argument["argument"] + if label: + output.append("{}\t{}".format(label, text)) + return output + + +def schema_process(path, model="trigger"): + """schema_process""" + + def label_add(labels, _type): + """label_add""" + if "B-{}".format(_type) not in labels: + labels.extend(["B-{}".format(_type), "I-{}".format(_type)]) + return labels + + labels = [] + for line in read_by_lines(path): + d_json = json.loads(line.strip()) + if model == "trigger": + labels = label_add(labels, d_json["event_type"]) + elif model == "role": + for role in d_json["role_list"]: + if role["role"] == enum_role: + continue + labels = label_add(labels, role["role"]) + elif model == "enum": + for role in d_json["role_list"]: + if role["role"] == enum_role: + labels = role["enum_items"] + + labels.append("O") + tags = [] + for index, label in enumerate(labels): + tags.append("{}\t{}".format(index, label)) + if model == "enum": + tags = tags[:-1] + return tags + + +def marked_doc_2_sentence(doc): + """marked_doc_2_sentence""" + + def argument_in_sent(sent, argument_list, trigger): + """argument_in_sent""" + trigger_start = sent.find(trigger) + if trigger_start < 0: + return trigger_start, [], None + new_arguments, enum_argument = [], None + for argument in argument_list: + word = argument["argument"] + role_type = argument["role"] + if role_type == enum_role: + # special + enum_argument = argument + continue + start = sent.find(word) + if start < 0: + continue + new_arguments.append({"role": role_type, "argument": word, "argument_start_index": start}) + return trigger_start, new_arguments, enum_argument + + title = doc["title"] + text = doc["text"] + sents = text_to_sents(text) + sent_mapping_event, sents_order = {}, [] + step = 3 + batch_sents = [sents[i : i + step] for i in range(0, len(sents), step)] + if len(title) > 0: + batch_sents = [[title]] + batch_sents + for batch in batch_sents: + b_sent = " ".join(batch).replace("\n", " ").replace("\r\n", " ").replace("\r", " ").replace("\t", " ") + if b_sent in sent_mapping_event: + continue + sent_id = cal_md5(b_sent.encode("utf-8")) + sent_mapping_event[b_sent] = {"id": doc["id"], "sent_id": sent_id, "text": b_sent} + sents_order.append(b_sent) + + for event in doc.get("event_list", []): + cur_sent, trigger_start, arguments, enum_argument = "", -1, [], None + for sent in sents_order: + tri_start, argus, enum_arg = argument_in_sent(sent, event["arguments"], event["trigger"]) + if tri_start < 0: + continue + if len(argus) > len(arguments): + cur_sent, trigger_start, arguments = sent, tri_start, argus + if enum_arg: + enum_argument = enum_arg + if trigger_start >= 0 and len(arguments) > 0: + # add enum 2 event + if enum_argument: + arguments.append(enum_argument) + if "event_list" not in sent_mapping_event[cur_sent]: + sent_mapping_event[cur_sent]["event_list"] = [] + new_event = { + "arguments": arguments, + "event_type": event["event_type"], + "trigger": event["trigger"], + "trigger_start_index": trigger_start, + } + sent_mapping_event[cur_sent]["event_list"].append(new_event) + return sent_mapping_event.values() + + +def docs_data_process(path): + """docs_data_process""" + lines = read_by_lines(path) + sentences = [] + for line in lines: + d_json = json.loads(line) + sentences.extend(marked_doc_2_sentence(d_json)) + sentences = [json.dumps(s, ensure_ascii=False) for s in sentences] + return sentences + + +if __name__ == "__main__": + # schema process + print("\n=================DUEE FINANCE DATASET==============") + conf_dir = "./conf/DuEE-Fin" + if not os.path.exists(conf_dir): + os.makedirs(conf_dir) + schema_path = "./data/DuEE-fin/duee_fin_event_schema.json" + tags_trigger_path = "{}/trigger_tag.dict".format(conf_dir) + tags_role_path = "{}/role_tag.dict".format(conf_dir) + tags_enum_path = "{}/enum_tag.dict".format(conf_dir) + print("\n=================start schema process==============") + print("input path {}".format(schema_path)) + tags_trigger = schema_process(schema_path, "trigger") + write_by_lines(tags_trigger_path, tags_trigger) + print("save trigger tag {} at {}".format(len(tags_trigger), tags_trigger_path)) + tags_role = schema_process(schema_path, "role") + write_by_lines(tags_role_path, tags_role) + print("save trigger tag {} at {}".format(len(tags_role), tags_role_path)) + tags_enum = schema_process(schema_path, "enum") + write_by_lines(tags_enum_path, tags_enum) + print("save enum enum tag {} at {}".format(len(tags_enum), tags_enum_path)) + print("=================end schema process===============") + + # data process + data_dir = "./data/DuEE-Fin" + sentence_dir = "{}/sentence".format(data_dir) + trigger_save_dir = "{}/trigger".format(data_dir) + role_save_dir = "{}/role".format(data_dir) + enum_save_dir = "{}/enum".format(data_dir) + print("\n=================start data process==============") + print("\n********** start document process **********") + if not os.path.exists(sentence_dir): + os.makedirs(sentence_dir) + train_sent = docs_data_process("./data/DuEE-fin/duee_fin_train.json/duee_fin_train.json") + write_by_lines("{}/train.json".format(sentence_dir), train_sent) + dev_sent = docs_data_process("./data/DuEE-fin/duee_fin_dev.json/duee_fin_dev.json") + write_by_lines("{}/dev.json".format(sentence_dir), dev_sent) + test_sent = docs_data_process("./data/DuEE-fin/duee_fin_test2.json/duee_fin_test2.json") + write_by_lines("{}/test.json".format(sentence_dir), test_sent) + print("train {} dev {} test {}".format(len(train_sent), len(dev_sent), len(test_sent))) + print("********** end document process **********") + + print("\n********** start sentence process **********") + print("\n----trigger------for dir {} to {}".format(sentence_dir, trigger_save_dir)) + if not os.path.exists(trigger_save_dir): + os.makedirs(trigger_save_dir) + train_tri = data_process("{}/train.json".format(sentence_dir), "trigger") + write_by_lines("{}/train.tsv".format(trigger_save_dir), train_tri) + dev_tri = data_process("{}/dev.json".format(sentence_dir), "trigger") + write_by_lines("{}/dev.tsv".format(trigger_save_dir), dev_tri) + test_tri = data_process("{}/test.json".format(sentence_dir), "trigger") + write_by_lines("{}/test.tsv".format(trigger_save_dir), test_tri) + print("train {} dev {} test {}".format(len(train_tri), len(dev_tri), len(test_tri))) + + print("\n----role------for dir {} to {}".format(sentence_dir, role_save_dir)) + if not os.path.exists(role_save_dir): + os.makedirs(role_save_dir) + train_role = data_process("{}/train.json".format(sentence_dir), "role") + write_by_lines("{}/train.tsv".format(role_save_dir), train_role) + dev_role = data_process("{}/dev.json".format(sentence_dir), "role") + write_by_lines("{}/dev.tsv".format(role_save_dir), dev_role) + test_role = data_process("{}/test.json".format(sentence_dir), "role") + write_by_lines("{}/test.tsv".format(role_save_dir), test_role) + print("train {} dev {} test {}".format(len(train_role), len(dev_role), len(test_role))) + + print("\n----enum------for dir {} to {}".format(sentence_dir, enum_save_dir)) + if not os.path.exists(enum_save_dir): + os.makedirs(enum_save_dir) + trian_enum = enum_data_process("{}/train.json".format(sentence_dir)) + write_by_lines("{}/train.tsv".format(enum_save_dir), trian_enum) + dev_enum = enum_data_process("{}/dev.json".format(sentence_dir)) + write_by_lines("{}/dev.tsv".format(enum_save_dir), dev_enum) + test_enum = enum_data_process("{}/test.json".format(sentence_dir)) + write_by_lines("{}/test.tsv".format(enum_save_dir), test_enum) + print("train {} dev {} test {}".format(len(trian_enum), len(dev_enum), len(test_enum))) + print("********** end sentence process **********") + print("=================end data process==============") diff --git a/examples/information_extraction/DuEE/duee_fin_postprocess.py b/examples/information_extraction/DuEE/duee_fin_postprocess.py new file mode 100644 index 0000000000000000000000000000000000000000..544bf88efcb10de6d8fbbc0e88853ab02e052db8 --- /dev/null +++ b/examples/information_extraction/DuEE/duee_fin_postprocess.py @@ -0,0 +1,139 @@ +# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""duee finance data predict post-process""" + +import argparse +import json + +from utils import extract_result, read_by_lines, write_by_lines + +enum_event_type = "公司上市" +enum_role = "环节" + + +def event_normalization(doc): + """event_merge""" + for event in doc.get("event_list", []): + argument_list = [] + argument_set = set() + for arg in event["arguments"]: + arg_str = "{}-{}".format(arg["role"], arg["argument"]) + if arg_str not in argument_set: + argument_list.append(arg) + argument_set.add(arg_str) + event["arguments"] = argument_list + + event_list = sorted(doc.get("event_list", []), key=lambda x: len(x["arguments"]), reverse=True) + new_event_list = [] + for event in event_list: + event_type = event["event_type"] + event_argument_set = set() + for arg in event["arguments"]: + event_argument_set.add("{}-{}".format(arg["role"], arg["argument"])) + flag = True + for new_event in new_event_list: + if event_type != new_event["event_type"]: + continue + new_event_argument_set = set() + for arg in new_event["arguments"]: + new_event_argument_set.add("{}-{}".format(arg["role"], arg["argument"])) + if len(event_argument_set & new_event_argument_set) == len(new_event_argument_set): + flag = False + if flag: + new_event_list.append(event) + doc["event_list"] = new_event_list + return doc + + +def predict_data_process(trigger_file, role_file, enum_file, schema_file, save_path): + """predict_data_process""" + pred_ret = [] + trigger_data = read_by_lines(trigger_file) + role_data = read_by_lines(role_file) + enum_data = read_by_lines(enum_file) + schema_data = read_by_lines(schema_file) + print("trigger predict {} load from {}".format(len(trigger_data), trigger_file)) + print("role predict {} load from {}".format(len(role_data), role_file)) + print("enum predict {} load from {}".format(len(enum_data), enum_file)) + print("schema {} load from {}".format(len(schema_data), schema_file)) + + schema, sent_role_mapping, sent_enum_mapping = {}, {}, {} + for s in schema_data: + d_json = json.loads(s) + schema[d_json["event_type"]] = [r["role"] for r in d_json["role_list"]] + + # role depends on id and sent_id + for d in role_data: + d_json = json.loads(d) + r_ret = extract_result(d_json["text"], d_json["pred"]["labels"]) + role_ret = {} + for r in r_ret: + role_type = r["type"] + if role_type not in role_ret: + role_ret[role_type] = [] + role_ret[role_type].append("".join(r["text"])) + _id = "{}\t{}".format(d_json["id"], d_json["sent_id"]) + sent_role_mapping[_id] = role_ret + + # process the enum_role data + for d in enum_data: + d_json = json.loads(d) + _id = "{}\t{}".format(d_json["id"], d_json["sent_id"]) + label = d_json["pred"]["label"] + sent_enum_mapping[_id] = label + + # process trigger data + for d in trigger_data: + d_json = json.loads(d) + t_ret = extract_result(d_json["text"], d_json["pred"]["labels"]) + pred_event_types = list(set([t["type"] for t in t_ret])) + event_list = [] + _id = "{}\t{}".format(d_json["id"], d_json["sent_id"]) + for event_type in pred_event_types: + role_list = schema[event_type] + arguments = [] + for role_type, ags in sent_role_mapping[_id].items(): + if role_type not in role_list: + continue + for arg in ags: + arguments.append({"role": role_type, "argument": arg}) + # 特殊处理环节 + if event_type == enum_event_type: + arguments.append({"role": enum_role, "argument": sent_enum_mapping[_id]}) + event = {"event_type": event_type, "arguments": arguments, "text": d_json["text"]} + event_list.append(event) + pred_ret.append( + {"id": d_json["id"], "sent_id": d_json["sent_id"], "text": d_json["text"], "event_list": event_list} + ) + doc_pred = {} + for d in pred_ret: + if d["id"] not in doc_pred: + doc_pred[d["id"]] = {"id": d["id"], "event_list": []} + doc_pred[d["id"]]["event_list"].extend(d["event_list"]) + + # unfiy the all prediction results and save them + doc_pred = [json.dumps(event_normalization(r), ensure_ascii=False) for r in doc_pred.values()] + print("submit data {} save to {}".format(len(doc_pred), save_path)) + write_by_lines(save_path, doc_pred) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Official evaluation script for DuEE version 1.0") + parser.add_argument("--trigger_file", help="trigger model predict data path", required=True) + parser.add_argument("--role_file", help="role model predict data path", required=True) + parser.add_argument("--enum_file", help="enum model predict data path", required=True) + parser.add_argument("--schema_file", help="schema file path", required=True) + parser.add_argument("--save_path", help="save file path", required=True) + args = parser.parse_args() + predict_data_process(args.trigger_file, args.role_file, args.enum_file, args.schema_file, args.save_path) diff --git a/examples/information_extraction/DuEE/pictures/DuEE-Fin/ee.png b/examples/information_extraction/DuEE/pictures/DuEE-Fin/ee.png new file mode 100644 index 0000000000000000000000000000000000000000..2d2f6ce67170d302a62842de19d8154db3fa1f70 Binary files /dev/null and b/examples/information_extraction/DuEE/pictures/DuEE-Fin/ee.png differ diff --git a/examples/information_extraction/DuEE/pictures/DuEE-Fin/enum_model.png b/examples/information_extraction/DuEE/pictures/DuEE-Fin/enum_model.png new file mode 100644 index 0000000000000000000000000000000000000000..d52fddb93b0ea43d36349becd2f499f9019d43f9 Binary files /dev/null and b/examples/information_extraction/DuEE/pictures/DuEE-Fin/enum_model.png differ diff --git a/examples/information_extraction/DuEE/pictures/DuEE-Fin/role_model.png b/examples/information_extraction/DuEE/pictures/DuEE-Fin/role_model.png new file mode 100644 index 0000000000000000000000000000000000000000..97ac8ec639a41cd7942506f2832f9875d5849e6d Binary files /dev/null and b/examples/information_extraction/DuEE/pictures/DuEE-Fin/role_model.png differ diff --git a/examples/information_extraction/DuEE/pictures/DuEE-Fin/trigger_model.png b/examples/information_extraction/DuEE/pictures/DuEE-Fin/trigger_model.png new file mode 100644 index 0000000000000000000000000000000000000000..9332eb56a321d9866c882e37b1ee4eb31aa98198 Binary files /dev/null and b/examples/information_extraction/DuEE/pictures/DuEE-Fin/trigger_model.png differ diff --git a/examples/information_extraction/DuEE/run_classifier.sh b/examples/information_extraction/DuEE/run_classifier.sh new file mode 100644 index 0000000000000000000000000000000000000000..a75fcaef4635f4a4cf200de0f3928af0060c81e4 --- /dev/null +++ b/examples/information_extraction/DuEE/run_classifier.sh @@ -0,0 +1,53 @@ +data_dir=${1} +conf_path=${2} +ckpt_dir=${3} +predict_data=${4} +learning_rate=${5} +is_train=${6} +max_seq_len=${7} +batch_size=${8} +epoch=${9} +pred_save_path=${10} + + +if [ "$is_train" = True ]; then + unset CUDA_VISIBLE_DEVICES + python -m paddle.distributed.launch --gpus "0" classifier.py \ + --num_epoch ${epoch} \ + --learning_rate 5e-5 \ + --tag_path ${conf_path} \ + --train_data ${data_dir}/train.tsv \ + --dev_data ${data_dir}/dev.tsv \ + --test_data ${data_dir}/test.tsv \ + --predict_data ${predict_data} \ + --do_train True \ + --do_predict False \ + --max_seq_len ${max_seq_len} \ + --batch_size ${batch_size} \ + --skip_step 1 \ + --valid_step 5 \ + --checkpoints ${ckpt_dir} \ + --init_ckpt ${ckpt_dir}/best.pdparams \ + --predict_save_path ${pred_save_path} \ + --device gpu +else + export CUDA_VISIBLE_DEVICES=0 + python classifier.py \ + --num_epoch ${epoch} \ + --learning_rate 5e-5 \ + --tag_path ${conf_path} \ + --train_data ${data_dir}/train.tsv \ + --dev_data ${data_dir}/dev.tsv \ + --test_data ${data_dir}/test.tsv \ + --predict_data ${predict_data} \ + --do_train False \ + --do_predict True \ + --max_seq_len ${max_seq_len} \ + --batch_size ${batch_size} \ + --skip_step 1 \ + --valid_step 1 \ + --checkpoints ${ckpt_dir} \ + --init_ckpt ${ckpt_dir}/best.pdparams \ + --predict_save_path ${pred_save_path} \ + --device gpu +fi diff --git a/examples/information_extraction/DuEE/run_duee_1.sh b/examples/information_extraction/DuEE/run_duee_1.sh new file mode 100644 index 0000000000000000000000000000000000000000..4d599594123c911a7d069255bd59ae972fc619c4 --- /dev/null +++ b/examples/information_extraction/DuEE/run_duee_1.sh @@ -0,0 +1,60 @@ +dataset_name=DuEE1.0 +data_dir=./data/${dataset_name} +conf_dir=./conf/${dataset_name} +ckpt_dir=./ckpt/${dataset_name} +submit_data_path=./submit/test_duee_1.json +pred_data=${data_dir}/duee_test1.json # 换其他数据,需要修改它 + +learning_rate=5e-5 +max_seq_len=300 +batch_size=16 +epoch=20 + +echo -e "check and create directory" +dir_list=(./ckpt ${ckpt_dir} ./submit) +for item in ${dir_list[*]} +do + if [ ! -d ${item} ]; then + mkdir ${item} + echo "create dir * ${item} *" + else + echo "dir ${item} exist" + fi +done + +process_name=${1} + +run_sequence_labeling_model(){ + model=${1} + is_train=${2} + pred_save_path=${ckpt_dir}/${model}/test_pred.json + sh run_sequence_labeling.sh ${data_dir}/${model} ${conf_dir}/${model}_tag.dict ${ckpt_dir}/${model} ${pred_data} ${learning_rate} ${is_train} ${max_seq_len} ${batch_size} ${epoch} ${pred_save_path} +} + +if [ ${process_name} == data_prepare ]; then + echo -e "\nstart ${dataset_name} data prepare" + python duee_1_data_prepare.py + echo -e "end ${dataset_name} data prepare" +elif [ ${process_name} == trigger_train ]; then + echo -e "\nstart ${dataset_name} trigger train" + run_sequence_labeling_model trigger True + echo -e "end ${dataset_name} trigger train" +elif [ ${process_name} == trigger_predict ]; then + echo -e "\nstart ${dataset_name} trigger predict" + run_sequence_labeling_model trigger False + echo -e "end ${dataset_name} trigger predict" +elif [ ${process_name} == role_train ]; then + echo -e "\nstart ${dataset_name} role train" + run_sequence_labeling_model role True + echo -e "end ${dataset_name} role train" +elif [ ${process_name} == role_predict ]; then + echo -e "\nstart ${dataset_name} role predict" + run_sequence_labeling_model role False + echo -e "end ${dataset_name} role predict" +elif [ ${process_name} == pred_2_submit ]; then + echo -e "\nstart ${dataset_name} predict data merge to submit fotmat" + python duee_1_postprocess.py --trigger_file ${ckpt_dir}/trigger/test_pred.json --role_file ${ckpt_dir}/role/test_pred.json --schema_file ${conf_dir}/event_schema.json --save_path ${submit_data_path} + echo -e "end ${dataset_name} role predict data merge" +else + echo "no process name ${process_name}" +fi \ No newline at end of file diff --git a/examples/information_extraction/DuEE/run_duee_fin.sh b/examples/information_extraction/DuEE/run_duee_fin.sh new file mode 100644 index 0000000000000000000000000000000000000000..5b0b7b12e4a4474c5dedc96be3953f3162dd652e --- /dev/null +++ b/examples/information_extraction/DuEE/run_duee_fin.sh @@ -0,0 +1,75 @@ +dataset_name=DuEE-Fin +data_dir=./data/${dataset_name} +conf_dir=./conf/${dataset_name} +ckpt_dir=./ckpt/${dataset_name} +submit_data_path=./submit/test_duee_fin.json +pred_data=${data_dir}/sentence/test.json # 换其他数据,需要修改它 + +learning_rate=5e-5 +max_seq_len=300 +batch_size=16 +epoch=20 + +echo -e "check and create directory" +dir_list=(./ckpt ${ckpt_dir} ./submit) +for item in ${dir_list[*]} +do + if [ ! -d ${item} ]; then + mkdir ${item} + echo "create dir * ${item} *" + else + echo "dir ${item} exist" + fi +done + +process_name=${1} + +run_sequence_labeling_model(){ + model=${1} + is_train=${2} + pred_save_path=${ckpt_dir}/${model}/test_pred.json + sh run_sequence_labeling.sh ${data_dir}/${model} ${conf_dir}/${model}_tag.dict ${ckpt_dir}/${model} ${pred_data} ${learning_rate} ${is_train} ${max_seq_len} ${batch_size} ${epoch} ${pred_save_path} +} + +run_classifier_model(){ + model=${1} + is_train=${2} + pred_save_path=${ckpt_dir}/${model}/test_pred.json + sh run_classifier.sh ${data_dir}/${model} ${conf_dir}/${model}_tag.dict ${ckpt_dir}/${model} ${pred_data} ${learning_rate} ${is_train} ${max_seq_len} ${batch_size} ${epoch} ${pred_save_path} +} + +if [ ${process_name} == data_prepare ]; then + echo -e "\nstart ${dataset_name} data prepare" + python duee_fin_data_prepare.py + echo -e "end ${dataset_name} data prepare" +elif [ ${process_name} == trigger_train ]; then + echo -e "\nstart ${dataset_name} trigger train" + run_sequence_labeling_model trigger True + echo -e "end ${dataset_name} trigger train" +elif [ ${process_name} == trigger_predict ]; then + echo -e "\nstart ${dataset_name} trigger predict" + run_sequence_labeling_model trigger False + echo -e "end ${dataset_name} trigger predict" +elif [ ${process_name} == role_train ]; then + echo -e "\nstart ${dataset_name} role train" + run_sequence_labeling_model role True + echo -e "end ${dataset_name} role train" +elif [ ${process_name} == role_predict ]; then + echo -e "\nstart ${dataset_name} role predict" + run_sequence_labeling_model role False + echo -e "end ${dataset_name} role predict" +elif [ ${process_name} == enum_train ]; then + echo -e "\nstart ${dataset_name} enum train" + run_classifier_model enum True + echo -e "end ${dataset_name} enum train" +elif [ ${process_name} == enum_predict ]; then + echo -e "\nstart ${dataset_name} enum predict" + run_classifier_model enum False + echo -e "end ${dataset_name} enum predict" +elif [ ${process_name} == pred_2_submit ]; then + echo -e "\nstart ${dataset_name} predict data merge to submit fotmat" + python duee_fin_postprocess.py --trigger_file ${ckpt_dir}/trigger/test_pred.json --role_file ${ckpt_dir}/role/test_pred.json --enum_file ${ckpt_dir}/enum/test_pred.json --schema_file ${conf_dir}/event_schema.json --save_path ${submit_data_path} + echo -e "end ${dataset_name} role predict data merge" +else + echo "no process name ${process_name}" +fi \ No newline at end of file diff --git a/examples/information_extraction/DuEE/run_sequence_labeling.sh b/examples/information_extraction/DuEE/run_sequence_labeling.sh new file mode 100644 index 0000000000000000000000000000000000000000..05f3e337f3a9cf125b4f95adfa1a5931050a195b --- /dev/null +++ b/examples/information_extraction/DuEE/run_sequence_labeling.sh @@ -0,0 +1,53 @@ + +data_dir=$1 +conf_path=$2 +ckpt_dir=$3 +predict_data=$4 +learning_rate=$5 +is_train=$6 +max_seq_len=$7 +batch_size=$8 +epoch=${9} +pred_save_path=${10} + +if [ "$is_train" = True ]; then + unset CUDA_VISIBLE_DEVICES + python -m paddle.distributed.launch --gpus "0" sequence_labeling.py \ + --num_epoch ${epoch} \ + --learning_rate ${learning_rate} \ + --tag_path ${conf_path} \ + --train_data ${data_dir}/train.tsv \ + --dev_data ${data_dir}/dev.tsv \ + --test_data ${data_dir}/test.tsv \ + --predict_data ${predict_data} \ + --do_train True \ + --do_predict False \ + --max_seq_len ${max_seq_len} \ + --batch_size ${batch_size} \ + --skip_step 10 \ + --valid_step 50 \ + --checkpoints ${ckpt_dir} \ + --init_ckpt ${ckpt_dir}/best.pdparams \ + --predict_save_path ${pred_save_path} \ + --device gpu +else + export CUDA_VISIBLE_DEVICES=0 + python sequence_labeling.py \ + --num_epoch ${epoch} \ + --learning_rate ${learning_rate} \ + --tag_path ${conf_path} \ + --train_data ${data_dir}/train.tsv \ + --dev_data ${data_dir}/dev.tsv \ + --test_data ${data_dir}/test.tsv \ + --predict_data ${predict_data} \ + --do_train False \ + --do_predict True \ + --max_seq_len ${max_seq_len} \ + --batch_size ${batch_size} \ + --skip_step 10 \ + --valid_step 50 \ + --checkpoints ${ckpt_dir} \ + --init_ckpt ${ckpt_dir}/best.pdparams \ + --predict_save_path ${pred_save_path} \ + --device gpu +fi diff --git a/examples/information_extraction/DuEE/sequence_labeling.py b/examples/information_extraction/DuEE/sequence_labeling.py new file mode 100644 index 0000000000000000000000000000000000000000..c2c596d00eab002dac447d906508379bfe018dc6 --- /dev/null +++ b/examples/information_extraction/DuEE/sequence_labeling.py @@ -0,0 +1,312 @@ +# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +sequence labeling +""" +import argparse +import ast +import json +import os +import random +import warnings +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +from utils import load_dict, read_by_lines, write_by_lines + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer + +warnings.filterwarnings("ignore") + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches for fine-tuning.") +parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.") +parser.add_argument("--tag_path", type=str, default=None, help="tag set path") +parser.add_argument("--train_data", type=str, default=None, help="train data") +parser.add_argument("--dev_data", type=str, default=None, help="dev data") +parser.add_argument("--test_data", type=str, default=None, help="test data") +parser.add_argument("--predict_data", type=str, default=None, help="predict data") +parser.add_argument("--do_train", type=ast.literal_eval, default=True, help="do train") +parser.add_argument("--do_predict", type=ast.literal_eval, default=True, help="do predict") +parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.") +parser.add_argument("--warmup_proportion", type=float, default=0.1, help="Warmup proportion params for warmup strategy") +parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.") +parser.add_argument("--valid_step", type=int, default=100, help="validation step") +parser.add_argument("--skip_step", type=int, default=20, help="skip step") +parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.") +parser.add_argument("--checkpoints", type=str, default=None, help="Directory to model checkpoint") +parser.add_argument("--init_ckpt", type=str, default=None, help="already pretraining model checkpoint") +parser.add_argument("--predict_save_path", type=str, default=None, help="predict data save path") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable. + + +def set_seed(args): + """sets random seed""" + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, criterion, metric, num_label, data_loader): + """evaluate""" + model.eval() + metric.reset() + losses = [] + for input_ids, seg_ids, seq_lens, labels in data_loader: + logits = model(input_ids, seg_ids) + loss = paddle.mean( + criterion(logits.reshape([-1, num_label]), labels.reshape([-1]))) + losses.append(loss.numpy()) + preds = paddle.argmax(logits, axis=-1) + n_infer, n_label, n_correct = metric.compute(None, seq_lens, preds, + labels) + metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy()) + precision, recall, f1_score = metric.accumulate() + avg_loss = np.mean(losses) + model.train() + + return precision, recall, f1_score, avg_loss + + +def convert_example_to_feature(example, + tokenizer, + label_vocab=None, + max_seq_len=512, + no_entity_label="O", + ignore_label=-1, + is_test=False): + tokens, labels = example + tokenized_input = tokenizer(tokens, + return_length=True, + is_split_into_words='token', + max_seq_len=max_seq_len) + + input_ids = tokenized_input['input_ids'] + token_type_ids = tokenized_input['token_type_ids'] + seq_len = tokenized_input['seq_len'] + + if is_test: + return input_ids, token_type_ids, seq_len + elif label_vocab is not None: + labels = labels[:(max_seq_len - 2)] + encoded_label = [no_entity_label] + labels + [no_entity_label] + encoded_label = [label_vocab[x] for x in encoded_label] + return input_ids, token_type_ids, seq_len, encoded_label + + +class DuEventExtraction(paddle.io.Dataset): + """DuEventExtraction""" + + def __init__(self, data_path, tag_path): + self.label_vocab = load_dict(tag_path) + self.word_ids = [] + self.label_ids = [] + with open(data_path, 'r', encoding='utf-8') as fp: + # skip the head line + next(fp) + for line in fp.readlines(): + words, labels = line.strip('\n').split('\t') + words = words.split('\002') + labels = labels.split('\002') + self.word_ids.append(words) + self.label_ids.append(labels) + self.label_num = max(self.label_vocab.values()) + 1 + + def __len__(self): + return len(self.word_ids) + + def __getitem__(self, index): + return self.word_ids[index], self.label_ids[index] + + +def do_train(): + paddle.set_device(args.device) + world_size = paddle.distributed.get_world_size() + rank = paddle.distributed.get_rank() + if world_size > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + no_entity_label = "O" + ignore_label = -1 + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + label_map = load_dict(args.tag_path) + model = AutoModelForTokenClassification.from_pretrained( + "ernie-3.0-medium-zh", num_classes=len(label_map)) + model = paddle.DataParallel(model) + + print("============start train==========") + train_ds = DuEventExtraction(args.train_data, args.tag_path) + dev_ds = DuEventExtraction(args.dev_data, args.tag_path) + + trans_func = partial(convert_example_to_feature, + tokenizer=tokenizer, + label_vocab=train_ds.label_vocab, + max_seq_len=args.max_seq_len, + no_entity_label=no_entity_label, + ignore_label=ignore_label, + is_test=False) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token], dtype='int32' + ), # input ids + Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token], dtype='int32' + ), # token type ids + Stack(dtype='int64'), # sequence lens + Pad(axis=0, pad_val=ignore_label, dtype='int64') # labels + ): fn(list(map(trans_func, samples))) + + batch_sampler = paddle.io.DistributedBatchSampler( + train_ds, batch_size=args.batch_size, shuffle=True) + train_loader = paddle.io.DataLoader(dataset=train_ds, + batch_sampler=batch_sampler, + collate_fn=batchify_fn) + dev_loader = paddle.io.DataLoader(dataset=dev_ds, + batch_size=args.batch_size, + collate_fn=batchify_fn) + + num_training_steps = len(train_loader) * args.num_epoch + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [ + p.name for n, p in model.named_parameters() + if not any(nd in n for nd in ["bias", "norm"]) + ] + optimizer = paddle.optimizer.AdamW( + learning_rate=args.learning_rate, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params) + + metric = ChunkEvaluator(label_list=train_ds.label_vocab.keys(), + suffix=False) + criterion = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label) + + step, best_f1 = 0, 0.0 + model.train() + for epoch in range(args.num_epoch): + for idx, (input_ids, token_type_ids, seq_lens, + labels) in enumerate(train_loader): + logits = model(input_ids, + token_type_ids).reshape([-1, train_ds.label_num]) + loss = paddle.mean(criterion(logits, labels.reshape([-1]))) + loss.backward() + optimizer.step() + optimizer.clear_grad() + loss_item = loss.item() + if step > 0 and step % args.skip_step == 0 and rank == 0: + print( + f'train epoch: {epoch} - step: {step} (total: {num_training_steps}) - loss: {loss_item:.6f}' + ) + if step > 0 and step % args.valid_step == 0 and rank == 0: + p, r, f1, avg_loss = evaluate(model, criterion, metric, + len(label_map), dev_loader) + print(f'dev step: {step} - loss: {avg_loss:.5f}, precision: {p:.5f}, recall: {r:.5f}, f1: {f1:.5f} current best {best_f1:.5f}') + if f1 > best_f1: + best_f1 = f1 + print(f'==============================================save best model best performerence {best_f1:5f}') + paddle.save(model.state_dict(), f'{args.checkpoints}/best.pdparams') + step += 1 + + # save the final model + if rank == 0: + paddle.save(model.state_dict(), + '{}/final.pdparams'.format(args.checkpoints)) + + +def do_predict(): + paddle.set_device(args.device) + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + label_map = load_dict(args.tag_path) + id2label = {val: key for key, val in label_map.items()} + model = AutoModelForTokenClassification.from_pretrained( + "ernie-3.0-medium-zh", num_classes=len(label_map)) + + print("============start predict==========") + if not args.init_ckpt or not os.path.isfile(args.init_ckpt): + raise Exception("init checkpoints {} not exist".format(args.init_ckpt)) + else: + state_dict = paddle.load(args.init_ckpt) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.init_ckpt) + + # load data from predict file + sentences = read_by_lines(args.predict_data) # origin data format + sentences = [json.loads(sent) for sent in sentences] + + encoded_inputs_list = [] + for sent in sentences: + sent = sent["text"].replace(" ", "\002") + input_ids, token_type_ids, seq_len = convert_example_to_feature( + [list(sent), []], + tokenizer, + max_seq_len=args.max_seq_len, + is_test=True) + encoded_inputs_list.append((input_ids, token_type_ids, seq_len)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token], dtype='int32' + ), # input_ids + Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token], dtype='int32' + ), # token_type_ids + Stack(dtype='int64') # sequence lens + ): fn(samples) + # Separates data into some batches. + batch_encoded_inputs = [ + encoded_inputs_list[i:i + args.batch_size] + for i in range(0, len(encoded_inputs_list), args.batch_size) + ] + results = [] + model.eval() + for batch in batch_encoded_inputs: + input_ids, token_type_ids, seq_lens = batchify_fn(batch) + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + logits = model(input_ids, token_type_ids) + probs = F.softmax(logits, axis=-1) + probs_ids = paddle.argmax(probs, -1).numpy() + probs = probs.numpy() + for p_list, p_ids, seq_len in zip(probs.tolist(), probs_ids.tolist(), + seq_lens.tolist()): + prob_one = [ + p_list[index][pid] + for index, pid in enumerate(p_ids[1:seq_len - 1]) + ] + label_one = [id2label[pid] for pid in p_ids[1:seq_len - 1]] + results.append({"probs": prob_one, "labels": label_one}) + assert len(results) == len(sentences) + for sent, ret in zip(sentences, results): + sent["pred"] = ret + sentences = [json.dumps(sent, ensure_ascii=False) for sent in sentences] + write_by_lines(args.predict_save_path, sentences) + print("save data {} to {}".format(len(sentences), args.predict_save_path)) + + +if __name__ == '__main__': + + if args.do_train: + do_train() + elif args.do_predict: + do_predict() diff --git a/examples/information_extraction/DuEE/utils.py b/examples/information_extraction/DuEE/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..e4c912ace48e241af88f3709660d27dc4ccf4caf --- /dev/null +++ b/examples/information_extraction/DuEE/utils.py @@ -0,0 +1,102 @@ +# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import hashlib + + +def cal_md5(str): + """calculate string md5""" + str = str.decode("utf-8", "ignore").encode("utf-8", "ignore") + return hashlib.md5(str).hexdigest() + + +def read_by_lines(path): + """read the data by line""" + result = list() + with open(path, "r", encoding="utf8") as infile: + for line in infile: + result.append(line.strip()) + return result + + +def write_by_lines(path, data): + """write the data""" + with open(path, "w", encoding="utf8") as outfile: + [outfile.write(d + "\n") for d in data] + + +def text_to_sents(text): + """text_to_sents""" + deliniter_symbols = ["。", "?", "!"] + paragraphs = text.split("\n") + ret = [] + for para in paragraphs: + if para == "": + continue + sents = [""] + for s in para: + sents[-1] += s + if s in deliniter_symbols: + sents.append("") + if sents[-1] == "": + sents = sents[:-1] + ret.extend(sents) + return ret + + +def load_dict(dict_path): + """load_dict""" + vocab = {} + for line in open(dict_path, "r", encoding="utf-8"): + value, key = line.strip("\n").split("\t") + vocab[key] = int(value) + return vocab + + +def extract_result(text, labels): + """extract_result""" + ret, is_start, cur_type = [], False, None + if len(text) != len(labels): + # 韩文回导致label 比 text要长 + labels = labels[: len(text)] + for i, label in enumerate(labels): + if label != "O": + _type = label[2:] + if label.startswith("B-"): + is_start = True + cur_type = _type + ret.append({"start": i, "text": [text[i]], "type": _type}) + elif _type != cur_type: + """ + # 如果是没有B-开头的,则不要这部分数据 + cur_type = None + is_start = False + """ + cur_type = _type + is_start = True + ret.append({"start": i, "text": [text[i]], "type": _type}) + elif is_start: + ret[-1]["text"].append(text[i]) + else: + cur_type = None + is_start = False + else: + cur_type = None + is_start = False + return ret + + +if __name__ == "__main__": + s = "xxdedewd" + print(cal_md5(s.encode("utf-8"))) diff --git a/examples/information_extraction/DuIE/README.md b/examples/information_extraction/DuIE/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c40cd28a3e156bac7b6b3295ee63147410e24e8d --- /dev/null +++ b/examples/information_extraction/DuIE/README.md @@ -0,0 +1,142 @@ +# LIC2021 DuIE 关系抽取基线 + +信息抽取旨在从非结构化自然语言文本中提取结构化知识,如实体、关系、事件等。关系抽取的目标是对于给定的自然语言句子,根据预先定义的schema集合,抽取出所有满足schema约束的SPO三元组。schema定义了关系P以及其对应的主体S和客体O的类别。 +本基线系统基于预训练语言模型ERNIE设计了结构化的标注策略,可以实现多条、交叠的SPO抽取。 + +该示例展示了如何使用PaddleNLP快速复现[LIC2021关系抽取比赛](http://lic2021.ccf.org.cn/)基线并进阶优化模型基线。 + +同时,我们提供了该示例在线运行展示教程: +[PaddleNLP实战——LIC2021关系抽取任务基线](https://aistudio.baidu.com/aistudio/projectdetail/1639963) + + +## 目录结构 + +以下是本项目主要目录结构及说明: + +```text +DuIE/ +├── data_loader.py # 加载数据 +├── extract_chinese_and_punct.py # 文本数据预处理 +├── README.md # 文档说明 +├── re_official_evaluation.py # 比赛评价脚本 +├── run_duie.py # 模型训练脚本 +├── train.sh # 启动脚本 +└── utils.py # 效能函数 +``` + +## 关系抽取基线 + +针对 DuIE2.0 任务中多条、交叠SPO这一抽取目标,比赛对标准的 'BIO' 标注进行了扩展。 +对于每个 token,根据其在实体span中的位置(包括B、I、O三种),我们为其打上三类标签,并且根据其所参与构建的predicate种类,将 B 标签进一步区分。给定 schema 集合,对于 N 种不同 predicate,以及头实体/尾实体两种情况,我们设计对应的共 2*N 种 B 标签,再合并 I 和 O 标签,故每个 token 一共有 (2*N+2) 个标签,如下图所示。 + +
+标注策略 +
+ +### 评价方法 + +对测试集上参评系统输出的SPO结果和人工标注的SPO结果进行精准匹配,采用F1值作为评价指标。注意,对于复杂O值类型的SPO,必须所有槽位都精确匹配才认为该SPO抽取正确。针对部分文本中存在实体别名的问题,使用百度知识图谱的别名词典来辅助评测。F1值的计算方式如下: + +F1 = (2 * P * R) / (P + R),其中 + +- P = 测试集所有句子中预测正确的SPO个数 / 测试集所有句子中预测出的SPO个数 +- R = 测试集所有句子中预测正确的SPO个数 / 测试集所有句子中人工标注的SPO个数 + +### 快速复现基线Step1:构建模型 + +该任务可以看作一个序列标注任务,所以基线模型采用的是ERNIE序列标注模型。 + +**PaddleNLP提供了ERNIE预训练模型常用序列标注模型,可以通过指定模型名字完成一键加载.PaddleNLP为了方便用户处理数据,内置了对于各个预训练模型对应的Tokenizer,可以完成文本token化,转token ID,文本长度截断等操作。** + +```python +from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer + +model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=(len(label_map) - 2) * 2 + 2) +tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") +``` + +文本数据处理直接调用tokenizer即可输出模型所需输入数据。 + +```python +inputs = tokenizer(text="请输入测试样例", max_seq_len=20) +# {'input_ids': [1, 647, 789, 109, 558, 525, 314, 656, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], +# 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], +# 'seq_len': 9} +``` + +### 快速复现基线Step2:加载并处理 + + + +从比赛官网[下载数据集](https://aistudio.baidu.com/aistudio/competition/detail/65),解压存放于data/目录下并重命名为train_data.json, dev_data.json, test_data.json. + +我们可以加载自定义数据集。通过继承[`paddle.io.Dataset`](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/Dataset_cn.html#dataset),自定义实现`__getitem__` 和 `__len__`两个方法。 + + +如下代码已完成加载数据集操作: + +```python +train_dataset = DuIEDataset.from_file( + os.path.join(args.data_path, 'train_data.json'), + tokenizer, + args.max_seq_length, + True) +test_dataset = DuIEDataset.from_file( + os.path.join(args.data_path, 'dev_data.json'), + tokenizer, + args.max_seq_length, + True) +``` + +### 快速复现基线Step3:定义损失函数和优化器,开始训练 + +在该基线上,我们选择均方误差作为损失函数,使用[`paddle.optimizer.AdamW`](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/optimizer/adamw/AdamW_cn.html#adamw)作为优化器。 + + +启动训练: +```shell +sh train.sh +``` + +在训练过程中,模型保存在当前目录checkpoints文件夹下。同时在训练的同时使用官方评测脚本进行评估,输出P/R/F1指标。 +在验证集上F1可以达到69.42。 + + +### 快速复现基线Step4:提交预测结果 + +将训练保存的模型加载后进行预测 + +```shell +sh predict.sh +``` + +预测结果会被保存在data/predictions.json,data/predictions.json.zip,其格式与原数据集文件一致。 + +之后可以使用官方评估脚本评估训练模型在dev_data.json上的效果。如: + +```shell +python re_official_evaluation.py --golden_file=dev_data.json --predict_file=predicitons.json.zip [--alias_file alias_dict] +``` +输出指标为Precision, Recall 和 F1,Alias file包含了合法的实体别名,最终评测的时候会使用,这里不予提供。 + +之后在test_data.json上预测,然后预测结果(.zip文件)至[评测网站](http://aistudio-bce.bcc-bdbl.baidu.com/aistudio/competition/detail/141)。 + + +## 进阶优化基线效果 + +基线采用的预训练模型为ERNIE,PaddleNLP提供了丰富的预训练模型,如BERT,RoBERTa,Electra,XLNet等 +参考[PaddleNLP预训练模型介绍](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) + +如可以选择RoBERTa large中文模型优化模型效果,只需更换模型和tokenizer即可无缝衔接。 + +```python +from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer + +model = RobertaForTokenClassification.from_pretrained( + "roberta-wwm-ext-large", + num_classes=(len(label_map) - 2) * 2 + 2) +tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext-large") +``` +## Reference + +- [DuIE: A Large-scale Chinese Dataset for Information Extraction](http://tcci.ccf.org.cn/conference/2019/papers/EV10.pdf) diff --git a/examples/information_extraction/DuIE/data/id2spo.json b/examples/information_extraction/DuIE/data/id2spo.json new file mode 100644 index 0000000000000000000000000000000000000000..8313a9ef942fba1fa79957acc1652688d2503b5d --- /dev/null +++ b/examples/information_extraction/DuIE/data/id2spo.json @@ -0,0 +1 @@ +{"predicate": ["empty", "empty", "注册资本", "作者", "所属专辑", "歌手", "邮政编码", "主演", "上映时间", "上映时间", "饰演", "饰演", "国籍", "成立日期", "毕业院校", "作曲", "作词", "编剧", "导演", "面积", "占地面积", "总部地点", "制片人", "嘉宾", "简称", "主持人", "获奖", "获奖", "获奖", "获奖", "海拔", "出品公司", "配音", "配音", "所在城市", "号", "主角", "创始人", "父亲", "祖籍", "母亲", "朝代", "董事长", "人口数量", "妻子", "丈夫", "票房", "票房", "专业代码", "气候", "修业年限", "改编自", "官方语言", "首都", "主题曲", "校长", "代言人"], "subject_type": ["empty", "empty", "企业", "图书作品", "歌曲", "歌曲", "行政区", "影视作品", "影视作品", "影视作品", "娱乐人物", "娱乐人物", "人物", "机构", "人物", "歌曲", "歌曲", "影视作品", "影视作品", "行政区", "机构", "企业", "影视作品", "电视综艺", "机构", "电视综艺", "娱乐人物", "娱乐人物", "娱乐人物", "娱乐人物", "地点", "影视作品", "娱乐人物", "娱乐人物", "景点", "历史人物", "文学作品", "企业", "人物", "人物", "人物", "历史人物", "企业", "行政区", "人物", "人物", "影视作品", "影视作品", "学科专业", "行政区", "学科专业", "影视作品", "国家", "国家", "影视作品", "学校", "企业/品牌"], "object_type": ["empty", "empty", "Number", "人物", "音乐专辑", "人物", "Text", "人物", "Date_@value", "地点_inArea", "人物_@value", "影视作品_inWork", "国家", "Date", "学校", "人物", "人物", "人物", "人物", "Number", "Number", "地点", "人物", "人物", "Text", "人物", "奖项_@value", "作品_inWork", "Date_onDate", "Number_period", "Number", "企业", "人物_@value", "影视作品_inWork", "城市", "Text", "人物", "人物", "人物", "地点", "人物", "Text", "人物", "Number", "人物", "人物", "Number_@value", "地点_inArea", "Text", "气候", "Number", "作品", "语言", "城市", "歌曲", "人物", "人物"]} \ No newline at end of file diff --git a/examples/information_extraction/DuIE/data/predicate2id.json b/examples/information_extraction/DuIE/data/predicate2id.json new file mode 100644 index 0000000000000000000000000000000000000000..a94c10304a85910d7b0a1f967541471cbc97b940 --- /dev/null +++ b/examples/information_extraction/DuIE/data/predicate2id.json @@ -0,0 +1 @@ +{"O": 0, "I": 1, "注册资本": 2, "作者": 3, "所属专辑": 4, "歌手": 5, "邮政编码": 6, "主演": 7, "上映时间_@value": 8, "上映时间_inArea": 9, "饰演_@value": 10, "饰演_inWork": 11, "国籍": 12, "成立日期": 13, "毕业院校": 14, "作曲": 15, "作词": 16, "编剧": 17, "导演": 18, "面积": 19, "占地面积": 20, "总部地点": 21, "制片人": 22, "嘉宾": 23, "简称": 24, "主持人": 25, "获奖_@value": 26, "获奖_inWork": 27, "获奖_onDate": 28, "获奖_period": 29, "海拔": 30, "出品公司": 31, "配音_@value": 32, "配音_inWork": 33, "所在城市": 34, "号": 35, "主角": 36, "创始人": 37, "父亲": 38, "祖籍": 39, "母亲": 40, "朝代": 41, "董事长": 42, "人口数量": 43, "妻子": 44, "丈夫": 45, "票房_@value": 46, "票房_inArea": 47, "专业代码": 48, "气候": 49, "修业年限": 50, "改编自": 51, "官方语言": 52, "首都": 53, "主题曲": 54, "校长": 55, "代言人": 56} \ No newline at end of file diff --git a/examples/information_extraction/DuIE/data_loader.py b/examples/information_extraction/DuIE/data_loader.py new file mode 100644 index 0000000000000000000000000000000000000000..7a1d26a8a88f7321b8c5ec985b6dd6423a2e1be4 --- /dev/null +++ b/examples/information_extraction/DuIE/data_loader.py @@ -0,0 +1,285 @@ +# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import json +import os +from dataclasses import dataclass +from typing import Dict, List, Optional, Union + +import numpy as np +import paddle +from extract_chinese_and_punct import ChineseAndPunctuationExtractor + +from paddlenlp.transformers import AutoTokenizer, PretrainedTokenizer + +InputFeature = collections.namedtuple( + "InputFeature", ["input_ids", "seq_len", "tok_to_orig_start_index", "tok_to_orig_end_index", "labels"] +) + + +def parse_label(spo_list, label_map, tokens, tokenizer): + # 2 tags for each predicate + I tag + O tag + num_labels = 2 * (len(label_map.keys()) - 2) + 2 + seq_len = len(tokens) + # initialize tag + labels = [[0] * num_labels for i in range(seq_len)] + # find all entities and tag them with corresponding "B"/"I" labels + for spo in spo_list: + for spo_object in spo["object"].keys(): + # assign relation label + if spo["predicate"] in label_map.keys(): + # simple relation + label_subject = label_map[spo["predicate"]] + label_object = label_subject + 55 + subject_tokens = tokenizer._tokenize(spo["subject"]) + object_tokens = tokenizer._tokenize(spo["object"]["@value"]) + else: + # complex relation + label_subject = label_map[spo["predicate"] + "_" + spo_object] + label_object = label_subject + 55 + subject_tokens = tokenizer._tokenize(spo["subject"]) + object_tokens = tokenizer._tokenize(spo["object"][spo_object]) + + subject_tokens_len = len(subject_tokens) + object_tokens_len = len(object_tokens) + + # assign token label + # there are situations where s entity and o entity might overlap, e.g. xyz established xyz corporation + # to prevent single token from being labeled into two different entity + # we tag the longer entity first, then match the shorter entity within the rest text + forbidden_index = None + if subject_tokens_len > object_tokens_len: + for index in range(seq_len - subject_tokens_len + 1): + if tokens[index : index + subject_tokens_len] == subject_tokens: + labels[index][label_subject] = 1 + for i in range(subject_tokens_len - 1): + labels[index + i + 1][1] = 1 + forbidden_index = index + break + + for index in range(seq_len - object_tokens_len + 1): + if tokens[index : index + object_tokens_len] == object_tokens: + if forbidden_index is None: + labels[index][label_object] = 1 + for i in range(object_tokens_len - 1): + labels[index + i + 1][1] = 1 + break + # check if labeled already + elif index < forbidden_index or index >= forbidden_index + len(subject_tokens): + labels[index][label_object] = 1 + for i in range(object_tokens_len - 1): + labels[index + i + 1][1] = 1 + break + + else: + for index in range(seq_len - object_tokens_len + 1): + if tokens[index : index + object_tokens_len] == object_tokens: + labels[index][label_object] = 1 + for i in range(object_tokens_len - 1): + labels[index + i + 1][1] = 1 + forbidden_index = index + break + + for index in range(seq_len - subject_tokens_len + 1): + if tokens[index : index + subject_tokens_len] == subject_tokens: + if forbidden_index is None: + labels[index][label_subject] = 1 + for i in range(subject_tokens_len - 1): + labels[index + i + 1][1] = 1 + break + elif index < forbidden_index or index >= forbidden_index + len(object_tokens): + labels[index][label_subject] = 1 + for i in range(subject_tokens_len - 1): + labels[index + i + 1][1] = 1 + break + + # if token wasn't assigned as any "B"/"I" tag, give it an "O" tag for outside + for i in range(seq_len): + if labels[i] == [0] * num_labels: + labels[i][0] = 1 + + return labels + + +def convert_example_to_feature( + example, + tokenizer: PretrainedTokenizer, + chineseandpunctuationextractor: ChineseAndPunctuationExtractor, + label_map, + max_length: Optional[int] = 512, + pad_to_max_length: Optional[bool] = None, +): + spo_list = example["spo_list"] if "spo_list" in example.keys() else None + text_raw = example["text"] + + sub_text = [] + buff = "" + for char in text_raw: + if chineseandpunctuationextractor.is_chinese_or_punct(char): + if buff != "": + sub_text.append(buff) + buff = "" + sub_text.append(char) + else: + buff += char + if buff != "": + sub_text.append(buff) + + tok_to_orig_start_index = [] + tok_to_orig_end_index = [] + orig_to_tok_index = [] + tokens = [] + text_tmp = "" + for (i, token) in enumerate(sub_text): + orig_to_tok_index.append(len(tokens)) + sub_tokens = tokenizer._tokenize(token) + text_tmp += token + for sub_token in sub_tokens: + tok_to_orig_start_index.append(len(text_tmp) - len(token)) + tok_to_orig_end_index.append(len(text_tmp) - 1) + tokens.append(sub_token) + if len(tokens) >= max_length - 2: + break + else: + continue + break + + seq_len = len(tokens) + # 2 tags for each predicate + I tag + O tag + num_labels = 2 * (len(label_map.keys()) - 2) + 2 + # initialize tag + labels = [[0] * num_labels for i in range(seq_len)] + if spo_list is not None: + labels = parse_label(spo_list, label_map, tokens, tokenizer) + + # add [CLS] and [SEP] token, they are tagged into "O" for outside + if seq_len > max_length - 2: + tokens = tokens[0 : (max_length - 2)] + labels = labels[0 : (max_length - 2)] + tok_to_orig_start_index = tok_to_orig_start_index[0 : (max_length - 2)] + tok_to_orig_end_index = tok_to_orig_end_index[0 : (max_length - 2)] + tokens = ["[CLS]"] + tokens + ["[SEP]"] + # "O" tag for [PAD], [CLS], [SEP] token + outside_label = [[1] + [0] * (num_labels - 1)] + + labels = outside_label + labels + outside_label + tok_to_orig_start_index = [-1] + tok_to_orig_start_index + [-1] + tok_to_orig_end_index = [-1] + tok_to_orig_end_index + [-1] + if seq_len < max_length: + tokens = tokens + ["[PAD]"] * (max_length - seq_len - 2) + labels = labels + outside_label * (max_length - len(labels)) + tok_to_orig_start_index = tok_to_orig_start_index + [-1] * (max_length - len(tok_to_orig_start_index)) + tok_to_orig_end_index = tok_to_orig_end_index + [-1] * (max_length - len(tok_to_orig_end_index)) + + token_ids = tokenizer.convert_tokens_to_ids(tokens) + + return InputFeature( + input_ids=np.array(token_ids), + seq_len=np.array(seq_len), + tok_to_orig_start_index=np.array(tok_to_orig_start_index), + tok_to_orig_end_index=np.array(tok_to_orig_end_index), + labels=np.array(labels), + ) + + +class DuIEDataset(paddle.io.Dataset): + def __init__(self, data, label_map, tokenizer, max_length=512, pad_to_max_length=False): + super(DuIEDataset, self).__init__() + + self.data = data + self.chn_punc_extractor = ChineseAndPunctuationExtractor() + self.tokenizer = tokenizer + self.max_seq_length = max_length + self.pad_to_max_length = pad_to_max_length + self.label_map = label_map + + def __len__(self): + return len(self.data) + + def __getitem__(self, item): + + example = json.loads(self.data[item]) + input_feature = convert_example_to_feature( + example, + self.tokenizer, + self.chn_punc_extractor, + self.label_map, + self.max_seq_length, + self.pad_to_max_length, + ) + return { + "input_ids": np.array(input_feature.input_ids, dtype="int64"), + "seq_lens": np.array(input_feature.seq_len, dtype="int64"), + "tok_to_orig_start_index": np.array(input_feature.tok_to_orig_start_index, dtype="int64"), + "tok_to_orig_end_index": np.array(input_feature.tok_to_orig_end_index, dtype="int64"), + # If model inputs is generated in `collate_fn`, delete the data type casting. + "labels": np.array(input_feature.labels, dtype="float32"), + } + + @classmethod + def from_file( + cls, + file_path: Union[str, os.PathLike], + tokenizer: PretrainedTokenizer, + max_length: Optional[int] = 512, + pad_to_max_length: Optional[bool] = None, + ): + assert os.path.exists(file_path) and os.path.isfile( + file_path + ), f"{file_path} dose not exists or is not a file." + label_map_path = os.path.join(os.path.dirname(file_path), "predicate2id.json") + assert os.path.exists(label_map_path) and os.path.isfile( + label_map_path + ), f"{label_map_path} dose not exists or is not a file." + with open(label_map_path, "r", encoding="utf8") as fp: + label_map = json.load(fp) + with open(file_path, "r", encoding="utf-8") as fp: + data = fp.readlines() + return cls(data, label_map, tokenizer, max_length, pad_to_max_length) + + +@dataclass +class DataCollator: + """ + Collator for DuIE. + """ + + def __call__(self, examples: List[Dict[str, Union[list, np.ndarray]]]): + batched_input_ids = np.stack([x["input_ids"] for x in examples]) + seq_lens = np.stack([x["seq_lens"] for x in examples]) + tok_to_orig_start_index = np.stack([x["tok_to_orig_start_index"] for x in examples]) + tok_to_orig_end_index = np.stack([x["tok_to_orig_end_index"] for x in examples]) + labels = np.stack([x["labels"] for x in examples]) + + return (batched_input_ids, seq_lens, tok_to_orig_start_index, tok_to_orig_end_index, labels) + + +if __name__ == "__main__": + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + d = DuIEDataset.from_file("./data/train_data.json", tokenizer) + sampler = paddle.io.RandomSampler(data_source=d) + batch_sampler = paddle.io.BatchSampler(sampler=sampler, batch_size=2) + + collator = DataCollator() + loader = paddle.io.DataLoader(dataset=d, batch_sampler=batch_sampler, collate_fn=collator, return_list=True) + for dd in loader(): + model_input = { + "input_ids": dd[0], + "seq_len": dd[1], + "tok_to_orig_start_index": dd[2], + "tok_to_orig_end_index": dd[3], + "labels": dd[4], + } + print(model_input) diff --git a/examples/information_extraction/DuIE/extract_chinese_and_punct.py b/examples/information_extraction/DuIE/extract_chinese_and_punct.py new file mode 100644 index 0000000000000000000000000000000000000000..2cd8e9966746f48832f4ae55b61042f3f369c3fb --- /dev/null +++ b/examples/information_extraction/DuIE/extract_chinese_and_punct.py @@ -0,0 +1,132 @@ +# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import re + +LHan = [ + [0x2E80, 0x2E99], # Han # So [26] CJK RADICAL REPEAT, CJK RADICAL RAP + [0x2E9B, 0x2EF3], # Han # So [89] CJK RADICAL CHOKE, CJK RADICAL C-SIMPLIFIED TURTLE + [0x2F00, 0x2FD5], # Han # So [214] KANGXI RADICAL ONE, KANGXI RADICAL FLUTE + 0x3005, # Han # Lm IDEOGRAPHIC ITERATION MARK + 0x3007, # Han # Nl IDEOGRAPHIC NUMBER ZERO + [0x3021, 0x3029], # Han # Nl [9] HANGZHOU NUMERAL ONE, HANGZHOU NUMERAL NINE + [0x3038, 0x303A], # Han # Nl [3] HANGZHOU NUMERAL TEN, HANGZHOU NUMERAL THIRTY + 0x303B, # Han # Lm VERTICAL IDEOGRAPHIC ITERATION MARK + [0x3400, 0x4DB5], # Han # Lo [6582] CJK UNIFIED IDEOGRAPH-3400, CJK UNIFIED IDEOGRAPH-4DB5 + [0x4E00, 0x9FC3], # Han # Lo [20932] CJK UNIFIED IDEOGRAPH-4E00, CJK UNIFIED IDEOGRAPH-9FC3 + [0xF900, 0xFA2D], # Han # Lo [302] CJK COMPATIBILITY IDEOGRAPH-F900, CJK COMPATIBILITY IDEOGRAPH-FA2D + [0xFA30, 0xFA6A], # Han # Lo [59] CJK COMPATIBILITY IDEOGRAPH-FA30, CJK COMPATIBILITY IDEOGRAPH-FA6A + [0xFA70, 0xFAD9], # Han # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70, CJK COMPATIBILITY IDEOGRAPH-FAD9 + [0x20000, 0x2A6D6], # Han # Lo [42711] CJK UNIFIED IDEOGRAPH-20000, CJK UNIFIED IDEOGRAPH-2A6D6 + [0x2F800, 0x2FA1D], +] # Han # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800, CJK COMPATIBILITY IDEOGRAPH-2FA1D + +CN_PUNCTS = [ + (0x3002, "。"), + (0xFF1F, "?"), + (0xFF01, "!"), + (0xFF0C, ","), + (0x3001, "、"), + (0xFF1B, ";"), + (0xFF1A, ":"), + (0x300C, "「"), + (0x300D, "」"), + (0x300E, "『"), + (0x300F, "』"), + (0x2018, "‘"), + (0x2019, "’"), + (0x201C, "“"), + (0x201D, "”"), + (0xFF08, "("), + (0xFF09, ")"), + (0x3014, "〔"), + (0x3015, "〕"), + (0x3010, "【"), + (0x3011, "】"), + (0x2014, "—"), + (0x2026, "…"), + (0x2013, "–"), + (0xFF0E, "."), + (0x300A, "《"), + (0x300B, "》"), + (0x3008, "〈"), + (0x3009, "〉"), + (0x2015, "―"), + (0xFF0D, "-"), + (0x0020, " "), +] +# (0xFF5E, "~"), + +EN_PUNCTS = [[0x0021, 0x002F], [0x003A, 0x0040], [0x005B, 0x0060], [0x007B, 0x007E]] + + +class ChineseAndPunctuationExtractor(object): + def __init__(self): + self.chinese_re = self.build_re() + + def is_chinese_or_punct(self, c): + if self.chinese_re.match(c): + return True + else: + return False + + def build_re(self): + L = [] + for i in LHan: + if isinstance(i, list): + f, t = i + try: + f = chr(f) + t = chr(t) + L.append("%s-%s" % (f, t)) + except: + pass # A narrow python build, so can't use chars > 65535 without surrogate pairs! + + else: + try: + L.append(chr(i)) + except: + pass + for j, _ in CN_PUNCTS: + try: + L.append(chr(j)) + except: + pass + + for k in EN_PUNCTS: + f, t = k + try: + f = chr(f) + t = chr(t) + L.append("%s-%s" % (f, t)) + except: + raise ValueError() + pass # A narrow python build, so can't use chars > 65535 without surrogate pairs! + + RE = "[%s]" % "".join(L) + # print('RE:', RE.encode('utf-8')) + return re.compile(RE, re.UNICODE) + + +if __name__ == "__main__": + extractor = ChineseAndPunctuationExtractor() + for c in "韩邦庆(1856~1894)曾用名寄,字子云,别署太仙、大一山人、花也怜侬、三庆": + if extractor.is_chinese_or_punct(c): + print(c, "yes") + else: + print(c, "no") + + print("~", extractor.is_chinese_or_punct("~")) + print("~", extractor.is_chinese_or_punct("~")) + print("―", extractor.is_chinese_or_punct("―")) + print("-", extractor.is_chinese_or_punct("-")) diff --git a/examples/information_extraction/DuIE/images/tagging_strategy.png b/examples/information_extraction/DuIE/images/tagging_strategy.png new file mode 100644 index 0000000000000000000000000000000000000000..0b67f69d775c1811f70bb0de880fa32bfc7009d3 Binary files /dev/null and b/examples/information_extraction/DuIE/images/tagging_strategy.png differ diff --git a/examples/information_extraction/DuIE/predict.sh b/examples/information_extraction/DuIE/predict.sh new file mode 100644 index 0000000000000000000000000000000000000000..dd4a1da7f4cde4457d65a43d64387031215ea16f --- /dev/null +++ b/examples/information_extraction/DuIE/predict.sh @@ -0,0 +1,14 @@ +set -eux + +export CUDA_VISIBLE_DEVICES=0 +export BATCH_SIZE=64 +export CKPT=./checkpoints/model_90000.pdparams +export DATASET_FILE=./data/test1.json + +python run_duie.py \ + --do_predict \ + --init_checkpoint $CKPT \ + --predict_data_file $DATASET_FILE \ + --max_seq_length 128 \ + --batch_size $BATCH_SIZE + diff --git a/examples/information_extraction/DuIE/re_official_evaluation.py b/examples/information_extraction/DuIE/re_official_evaluation.py new file mode 100644 index 0000000000000000000000000000000000000000..827601948873fdcc796ef51d9da0a3c7c2c26e6c --- /dev/null +++ b/examples/information_extraction/DuIE/re_official_evaluation.py @@ -0,0 +1,271 @@ +# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# imitations under the License. +""" +This module to calculate precision, recall and f1-value +of the predicated results. +""" +import argparse +import json +import os +import sys +import zipfile + +SUCCESS = 0 +FILE_ERROR = 1 +NOT_ZIP_FILE = 2 +ENCODING_ERROR = 3 +JSON_ERROR = 4 +SCHEMA_ERROR = 5 +ALIAS_FORMAT_ERROR = 6 + +CODE_INFO = { + SUCCESS: "success", + FILE_ERROR: "file is not exists", + NOT_ZIP_FILE: "predict file is not a zipfile", + ENCODING_ERROR: "file encoding error", + JSON_ERROR: "json parse is error", + SCHEMA_ERROR: "schema is error", + ALIAS_FORMAT_ERROR: "alias dict format is error", +} + + +def del_bookname(entity_name): + """delete the book name""" + if entity_name.startswith("《") and entity_name.endswith("》"): + entity_name = entity_name[1:-1] + return entity_name + + +def check_format(line): + """检查输入行是否格式错误""" + ret_code = SUCCESS + json_info = {} + try: + line = line.strip() + except: + ret_code = ENCODING_ERROR + return ret_code, json_info + try: + json_info = json.loads(line) + except: + ret_code = JSON_ERROR + return ret_code, json_info + if "text" not in json_info or "spo_list" not in json_info: + ret_code = SCHEMA_ERROR + return ret_code, json_info + required_key_list = ["subject", "predicate", "object"] + for spo_item in json_info["spo_list"]: + if type(spo_item) is not dict: + ret_code = SCHEMA_ERROR + return ret_code, json_info + if not all([required_key in spo_item for required_key in required_key_list]): + ret_code = SCHEMA_ERROR + return ret_code, json_info + if not isinstance(spo_item["subject"], str) or not isinstance(spo_item["object"], dict): + ret_code = SCHEMA_ERROR + return ret_code, json_info + return ret_code, json_info + + +def _parse_structured_ovalue(json_info): + spo_result = [] + for item in json_info["spo_list"]: + s = del_bookname(item["subject"].lower()) + o = {} + for o_key, o_value in item["object"].items(): + o_value = del_bookname(o_value).lower() + o[o_key] = o_value + spo_result.append({"predicate": item["predicate"], "subject": s, "object": o}) + return spo_result + + +def load_predict_result(predict_filename): + """Loads the file to be predicted""" + predict_result = {} + ret_code = SUCCESS + if not os.path.exists(predict_filename): + ret_code = FILE_ERROR + return ret_code, predict_result + try: + predict_file_zip = zipfile.ZipFile(predict_filename) + except: + ret_code = NOT_ZIP_FILE + return ret_code, predict_result + for predict_file in predict_file_zip.namelist(): + for line in predict_file_zip.open(predict_file): + ret_code, json_info = check_format(line) + if ret_code != SUCCESS: + return ret_code, predict_result + sent = json_info["text"] + spo_result = _parse_structured_ovalue(json_info) + predict_result[sent] = spo_result + return ret_code, predict_result + + +def load_test_dataset(golden_filename): + """load golden file""" + golden_dict = {} + ret_code = SUCCESS + if not os.path.exists(golden_filename): + ret_code = FILE_ERROR + return ret_code, golden_dict + with open(golden_filename, "r", encoding="utf-8") as gf: + for line in gf: + ret_code, json_info = check_format(line) + if ret_code != SUCCESS: + return ret_code, golden_dict + + sent = json_info["text"] + spo_result = _parse_structured_ovalue(json_info) + golden_dict[sent] = spo_result + return ret_code, golden_dict + + +def load_alias_dict(alias_filename): + """load alias dict""" + alias_dict = {} + ret_code = SUCCESS + if alias_filename == "": + return ret_code, alias_dict + if not os.path.exists(alias_filename): + ret_code = FILE_ERROR + return ret_code, alias_dict + with open(alias_filename, "r", encoding="utf-8") as af: + for line in af: + line = line.strip() + try: + words = line.split("\t") + alias_dict[words[0].lower()] = set() + for alias_word in words[1:]: + alias_dict[words[0].lower()].add(alias_word.lower()) + except: + ret_code = ALIAS_FORMAT_ERROR + return ret_code, alias_dict + return ret_code, alias_dict + + +def del_duplicate(spo_list, alias_dict): + """delete synonyms triples in predict result""" + normalized_spo_list = [] + for spo in spo_list: + if not is_spo_in_list(spo, normalized_spo_list, alias_dict): + normalized_spo_list.append(spo) + return normalized_spo_list + + +def is_spo_in_list(target_spo, golden_spo_list, alias_dict): + """target spo是否在golden_spo_list中""" + if target_spo in golden_spo_list: + return True + target_s = target_spo["subject"] + target_p = target_spo["predicate"] + target_o = target_spo["object"] + target_s_alias_set = alias_dict.get(target_s, set()) + target_s_alias_set.add(target_s) + for spo in golden_spo_list: + s = spo["subject"] + p = spo["predicate"] + o = spo["object"] + if p != target_p: + continue + if s in target_s_alias_set and _is_equal_o(o, target_o, alias_dict): + return True + return False + + +def _is_equal_o(o_a, o_b, alias_dict): + for key_a, value_a in o_a.items(): + if key_a not in o_b: + return False + value_a_alias_set = alias_dict.get(value_a, set()) + value_a_alias_set.add(value_a) + if o_b[key_a] not in value_a_alias_set: + return False + for key_b, value_b in o_b.items(): + if key_b not in o_a: + return False + value_b_alias_set = alias_dict.get(value_b, set()) + value_b_alias_set.add(value_b) + if o_a[key_b] not in value_b_alias_set: + return False + return True + + +def calc_pr(predict_filename, alias_filename, golden_filename): + """calculate precision, recall, f1""" + ret_info = {} + + # load alias dict + ret_code, alias_dict = load_alias_dict(alias_filename) + if ret_code != SUCCESS: + ret_info["errorCode"] = ret_code + ret_info["errorMsg"] = CODE_INFO[ret_code] + return ret_info + # load test golden dataset + ret_code, golden_dict = load_test_dataset(golden_filename) + if ret_code != SUCCESS: + ret_info["errorCode"] = ret_code + ret_info["errorMsg"] = CODE_INFO[ret_code] + return ret_info + # load predict result + ret_code, predict_result = load_predict_result(predict_filename) + if ret_code != SUCCESS: + ret_info["errorCode"] = ret_code + ret_info["errorMsg"] = CODE_INFO[ret_code] + return ret_info + + # evaluation + correct_sum, predict_sum, recall_sum, recall_correct_sum = 0.0, 0.0, 0.0, 0.0 + for sent in golden_dict: + golden_spo_list = del_duplicate(golden_dict[sent], alias_dict) + predict_spo_list = predict_result.get(sent, list()) + normalized_predict_spo = del_duplicate(predict_spo_list, alias_dict) + recall_sum += len(golden_spo_list) + predict_sum += len(normalized_predict_spo) + for spo in normalized_predict_spo: + if is_spo_in_list(spo, golden_spo_list, alias_dict): + correct_sum += 1 + for golden_spo in golden_spo_list: + if is_spo_in_list(golden_spo, predict_spo_list, alias_dict): + recall_correct_sum += 1 + sys.stderr.write("correct spo num = {}\n".format(correct_sum)) + sys.stderr.write("submitted spo num = {}\n".format(predict_sum)) + sys.stderr.write("golden set spo num = {}\n".format(recall_sum)) + sys.stderr.write("submitted recall spo num = {}\n".format(recall_correct_sum)) + precision = correct_sum / predict_sum if predict_sum > 0 else 0.0 + recall = recall_correct_sum / recall_sum if recall_sum > 0 else 0.0 + f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0 + precision = round(precision, 4) + recall = round(recall, 4) + f1 = round(f1, 4) + ret_info["errorCode"] = SUCCESS + ret_info["errorMsg"] = CODE_INFO[SUCCESS] + ret_info["data"] = [] + ret_info["data"].append({"name": "precision", "value": precision}) + ret_info["data"].append({"name": "recall", "value": recall}) + ret_info["data"].append({"name": "f1-score", "value": f1}) + return ret_info + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--golden_file", type=str, help="true spo results", required=True) + parser.add_argument("--predict_file", type=str, help="spo results predicted", required=True) + parser.add_argument("--alias_file", type=str, default="", help="entities alias dictionary") + args = parser.parse_args() + golden_filename = args.golden_file + predict_filename = args.predict_file + alias_filename = args.alias_file + ret_info = calc_pr(predict_filename, alias_filename, golden_filename) + print(json.dumps(ret_info)) diff --git a/examples/information_extraction/DuIE/run_duie.py b/examples/information_extraction/DuIE/run_duie.py new file mode 100644 index 0000000000000000000000000000000000000000..94e1a227292be505567703a8112ebca93fbd5156 --- /dev/null +++ b/examples/information_extraction/DuIE/run_duie.py @@ -0,0 +1,317 @@ +# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import random +import sys +import time + +import numpy as np +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from data_loader import DataCollator, DuIEDataset +from paddle.io import DataLoader +from tqdm import tqdm +from utils import decoding, get_precision_recall_f1, write_prediction_results + +from paddlenlp.transformers import ( + AutoModelForTokenClassification, + AutoTokenizer, + LinearDecayWithWarmup, +) + +parser = argparse.ArgumentParser() +parser.add_argument("--do_train", action="store_true", default=False, help="do train") +parser.add_argument("--do_predict", action="store_true", default=False, help="do predict") +parser.add_argument("--init_checkpoint", default=None, type=str, required=False, help="Path to initialize params from") +parser.add_argument("--data_path", default="./data", type=str, required=False, help="Path to data.") +parser.add_argument( + "--predict_data_file", default="./data/test_data.json", type=str, required=False, help="Path to data." +) +parser.add_argument( + "--output_dir", + default="./checkpoints", + type=str, + required=False, + help="The output directory where the model predictions and checkpoints will be written.", +) +parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument( + "--batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU for training.", +) +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_ratio", default=0, type=float, help="Linear warmup over warmup_ratio * total_steps.") +parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") +parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." +) +args = parser.parse_args() + + +class BCELossForDuIE(nn.Layer): + def __init__( + self, + ): + super(BCELossForDuIE, self).__init__() + self.criterion = nn.BCEWithLogitsLoss(reduction="none") + + def forward(self, logits, labels, mask): + loss = self.criterion(logits, labels) + mask = paddle.cast(mask, "float32") + loss = loss * mask.unsqueeze(-1) + loss = paddle.sum(loss.mean(axis=2), axis=1) / paddle.sum(mask, axis=1) + loss = loss.mean() + return loss + + +def set_random_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, criterion, data_loader, file_path, mode): + """ + mode eval: + eval on development set and compute P/R/F1, called between training. + mode predict: + eval on development / test set, then write predictions to \ + predict_test.json and predict_test.json.zip \ + under args.data_path dir for later submission or evaluation. + """ + example_all = [] + with open(file_path, "r", encoding="utf-8") as fp: + for line in fp: + example_all.append(json.loads(line)) + id2spo_path = os.path.join(os.path.dirname(file_path), "id2spo.json") + with open(id2spo_path, "r", encoding="utf8") as fp: + id2spo = json.load(fp) + + model.eval() + loss_all = 0 + eval_steps = 0 + formatted_outputs = [] + current_idx = 0 + for batch in tqdm(data_loader, total=len(data_loader)): + eval_steps += 1 + input_ids, seq_len, tok_to_orig_start_index, tok_to_orig_end_index, labels = batch + logits = model(input_ids=input_ids) + mask = (input_ids != 0).logical_and((input_ids != 1)).logical_and((input_ids != 2)) + loss = criterion(logits, labels, mask) + loss_all += loss.item() + probs = F.sigmoid(logits) + logits_batch = probs.numpy() + seq_len_batch = seq_len.numpy() + tok_to_orig_start_index_batch = tok_to_orig_start_index.numpy() + tok_to_orig_end_index_batch = tok_to_orig_end_index.numpy() + formatted_outputs.extend( + decoding( + example_all[current_idx : current_idx + len(logits)], + id2spo, + logits_batch, + seq_len_batch, + tok_to_orig_start_index_batch, + tok_to_orig_end_index_batch, + ) + ) + current_idx = current_idx + len(logits) + loss_avg = loss_all / eval_steps + print("eval loss: %f" % (loss_avg)) + + if mode == "predict": + predict_file_path = os.path.join(args.data_path, "predictions.json") + else: + predict_file_path = os.path.join(args.data_path, "predict_eval.json") + + predict_zipfile_path = write_prediction_results(formatted_outputs, predict_file_path) + + if mode == "eval": + precision, recall, f1 = get_precision_recall_f1(file_path, predict_zipfile_path) + os.system("rm {} {}".format(predict_file_path, predict_zipfile_path)) + return precision, recall, f1 + elif mode != "predict": + raise Exception("wrong mode for eval func") + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + # Reads label_map. + label_map_path = os.path.join(args.data_path, "predicate2id.json") + if not (os.path.exists(label_map_path) and os.path.isfile(label_map_path)): + sys.exit("{} dose not exists or is not a file.".format(label_map_path)) + with open(label_map_path, "r", encoding="utf8") as fp: + label_map = json.load(fp) + num_classes = (len(label_map.keys()) - 2) * 2 + 2 + + # Loads pretrained model ERNIE + model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=num_classes) + model = paddle.DataParallel(model) + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + criterion = BCELossForDuIE() + + # Loads dataset. + train_dataset = DuIEDataset.from_file( + os.path.join(args.data_path, "train_data.json"), tokenizer, args.max_seq_length, True + ) + train_batch_sampler = paddle.io.DistributedBatchSampler( + train_dataset, batch_size=args.batch_size, shuffle=True, drop_last=True + ) + collator = DataCollator() + train_data_loader = DataLoader( + dataset=train_dataset, batch_sampler=train_batch_sampler, collate_fn=collator, return_list=True + ) + eval_file_path = os.path.join(args.data_path, "dev_data.json") + test_dataset = DuIEDataset.from_file(eval_file_path, tokenizer, args.max_seq_length, True) + test_batch_sampler = paddle.io.BatchSampler( + test_dataset, batch_size=args.batch_size, shuffle=False, drop_last=True + ) + test_data_loader = DataLoader( + dataset=test_dataset, batch_sampler=test_batch_sampler, collate_fn=collator, return_list=True + ) + + # Defines learning rate strategy. + steps_by_epoch = len(train_data_loader) + num_training_steps = steps_by_epoch * args.num_train_epochs + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_ratio) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + # Starts training. + global_step = 0 + logging_steps = 50 + save_steps = 10000 + tic_train = time.time() + for epoch in range(args.num_train_epochs): + print("\n=====start training of %d epochs=====" % epoch) + tic_epoch = time.time() + model.train() + for step, batch in enumerate(train_data_loader): + input_ids, seq_lens, tok_to_orig_start_index, tok_to_orig_end_index, labels = batch + logits = model(input_ids=input_ids) + mask = (input_ids != 0).logical_and((input_ids != 1)).logical_and((input_ids != 2)) + loss = criterion(logits, labels, mask) + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + loss_item = loss.item() + global_step += 1 + + if global_step % logging_steps == 0 and rank == 0: + print( + "epoch: %d / %d, steps: %d / %d, loss: %f, speed: %.2f step/s" + % ( + epoch, + args.num_train_epochs, + step, + steps_by_epoch, + loss_item, + logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + + if global_step % save_steps == 0 and rank == 0: + print("\n=====start evaluating ckpt of %d steps=====" % global_step) + precision, recall, f1 = evaluate(model, criterion, test_data_loader, eval_file_path, "eval") + print("precision: %.2f\t recall: %.2f\t f1: %.2f\t" % (100 * precision, 100 * recall, 100 * f1)) + print("saving checkpoing model_%d.pdparams to %s " % (global_step, args.output_dir)) + paddle.save(model.state_dict(), os.path.join(args.output_dir, "model_%d.pdparams" % global_step)) + model.train() # back to train mode + + tic_epoch = time.time() - tic_epoch + print( + "epoch time footprint: %d hour %d min %d sec" + % (tic_epoch // 3600, (tic_epoch % 3600) // 60, tic_epoch % 60) + ) + + # Does final evaluation. + if rank == 0: + print("\n=====start evaluating last ckpt of %d steps=====" % global_step) + precision, recall, f1 = evaluate(model, criterion, test_data_loader, eval_file_path, "eval") + print("precision: %.2f\t recall: %.2f\t f1: %.2f\t" % (100 * precision, 100 * recall, 100 * f1)) + paddle.save(model.state_dict(), os.path.join(args.output_dir, "model_%d.pdparams" % global_step)) + print("\n=====training complete=====") + + +def do_predict(): + paddle.set_device(args.device) + + # Reads label_map. + label_map_path = os.path.join(args.data_path, "predicate2id.json") + if not (os.path.exists(label_map_path) and os.path.isfile(label_map_path)): + sys.exit("{} dose not exists or is not a file.".format(label_map_path)) + with open(label_map_path, "r", encoding="utf8") as fp: + label_map = json.load(fp) + num_classes = (len(label_map.keys()) - 2) * 2 + 2 + + # Loads pretrained model ERNIE + model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=num_classes) + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + criterion = BCELossForDuIE() + + # Loads dataset. + test_dataset = DuIEDataset.from_file(args.predict_data_file, tokenizer, args.max_seq_length, True) + collator = DataCollator() + test_batch_sampler = paddle.io.BatchSampler( + test_dataset, batch_size=args.batch_size, shuffle=False, drop_last=True + ) + test_data_loader = DataLoader( + dataset=test_dataset, batch_sampler=test_batch_sampler, collate_fn=collator, return_list=True + ) + + # Loads model parameters. + if not (os.path.exists(args.init_checkpoint) and os.path.isfile(args.init_checkpoint)): + sys.exit("wrong directory: init checkpoints {} not exist".format(args.init_checkpoint)) + state_dict = paddle.load(args.init_checkpoint) + model.set_dict(state_dict) + + # Does predictions. + print("\n=====start predicting=====") + evaluate(model, criterion, test_data_loader, args.predict_data_file, "predict") + print("=====predicting complete=====") + + +if __name__ == "__main__": + + if args.do_train: + do_train() + elif args.do_predict: + do_predict() diff --git a/examples/information_extraction/DuIE/train.sh b/examples/information_extraction/DuIE/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..89a69e9ab9fba6d10f6d59481c844653280fe598 --- /dev/null +++ b/examples/information_extraction/DuIE/train.sh @@ -0,0 +1,18 @@ +set -eux + +export BATCH_SIZE=8 +export LR=2e-5 +export EPOCH=12 + +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_duie.py \ + --device gpu \ + --seed 42 \ + --do_train \ + --data_path ./data \ + --max_seq_length 128 \ + --batch_size $BATCH_SIZE \ + --num_train_epochs $EPOCH \ + --learning_rate $LR \ + --warmup_ratio 0.06 \ + --output_dir ./checkpoints diff --git a/examples/information_extraction/DuIE/utils.py b/examples/information_extraction/DuIE/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..b810043bdee8fcadc59e796e890ebdac429195a1 --- /dev/null +++ b/examples/information_extraction/DuIE/utils.py @@ -0,0 +1,171 @@ +# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import codecs +import json +import os +import re +import zipfile + +import numpy as np + + +def find_entity(text_raw, id_, predictions, tok_to_orig_start_index, tok_to_orig_end_index): + """ + retrieval entity mention under given predicate id for certain prediction. + this is called by the "decoding" func. + """ + entity_list = [] + for i in range(len(predictions)): + if [id_] in predictions[i]: + j = 0 + while i + j + 1 < len(predictions): + if [1] in predictions[i + j + 1]: + j += 1 + else: + break + entity = "".join(text_raw[tok_to_orig_start_index[i] : tok_to_orig_end_index[i + j] + 1]) + entity_list.append(entity) + return list(set(entity_list)) + + +def decoding( + example_batch, id2spo, logits_batch, seq_len_batch, tok_to_orig_start_index_batch, tok_to_orig_end_index_batch +): + """ + model output logits -> formatted spo (as in data set file) + """ + formatted_outputs = [] + for (i, (example, logits, seq_len, tok_to_orig_start_index, tok_to_orig_end_index)) in enumerate( + zip(example_batch, logits_batch, seq_len_batch, tok_to_orig_start_index_batch, tok_to_orig_end_index_batch) + ): + + logits = logits[1 : seq_len + 1] # slice between [CLS] and [SEP] to get valid logits + logits[logits >= 0.5] = 1 + logits[logits < 0.5] = 0 + tok_to_orig_start_index = tok_to_orig_start_index[1 : seq_len + 1] + tok_to_orig_end_index = tok_to_orig_end_index[1 : seq_len + 1] + predictions = [] + for token in logits: + predictions.append(np.argwhere(token == 1).tolist()) + + # format predictions into example-style output + formatted_instance = {} + text_raw = example["text"] + complex_relation_label = [8, 10, 26, 32, 46] + complex_relation_affi_label = [9, 11, 27, 28, 29, 33, 47] + + # flatten predictions then retrival all valid subject id + flatten_predictions = [] + for layer_1 in predictions: + for layer_2 in layer_1: + flatten_predictions.append(layer_2[0]) + subject_id_list = [] + for cls_label in list(set(flatten_predictions)): + if 1 < cls_label <= 56 and (cls_label + 55) in flatten_predictions: + subject_id_list.append(cls_label) + subject_id_list = list(set(subject_id_list)) + + # fetch all valid spo by subject id + spo_list = [] + for id_ in subject_id_list: + if id_ in complex_relation_affi_label: + continue # do this in the next "else" branch + if id_ not in complex_relation_label: + subjects = find_entity(text_raw, id_, predictions, tok_to_orig_start_index, tok_to_orig_end_index) + objects = find_entity(text_raw, id_ + 55, predictions, tok_to_orig_start_index, tok_to_orig_end_index) + for subject_ in subjects: + for object_ in objects: + spo_list.append( + { + "predicate": id2spo["predicate"][id_], + "object_type": {"@value": id2spo["object_type"][id_]}, + "subject_type": id2spo["subject_type"][id_], + "object": {"@value": object_}, + "subject": subject_, + } + ) + else: + # traverse all complex relation and look through their corresponding affiliated objects + subjects = find_entity(text_raw, id_, predictions, tok_to_orig_start_index, tok_to_orig_end_index) + objects = find_entity(text_raw, id_ + 55, predictions, tok_to_orig_start_index, tok_to_orig_end_index) + for subject_ in subjects: + for object_ in objects: + object_dict = {"@value": object_} + object_type_dict = {"@value": id2spo["object_type"][id_].split("_")[0]} + if id_ in [8, 10, 32, 46] and id_ + 1 in subject_id_list: + id_affi = id_ + 1 + object_dict[id2spo["object_type"][id_affi].split("_")[1]] = find_entity( + text_raw, id_affi + 55, predictions, tok_to_orig_start_index, tok_to_orig_end_index + )[0] + object_type_dict[id2spo["object_type"][id_affi].split("_")[1]] = id2spo["object_type"][ + id_affi + ].split("_")[0] + elif id_ == 26: + for id_affi in [27, 28, 29]: + if id_affi in subject_id_list: + object_dict[id2spo["object_type"][id_affi].split("_")[1]] = find_entity( + text_raw, + id_affi + 55, + predictions, + tok_to_orig_start_index, + tok_to_orig_end_index, + )[0] + object_type_dict[id2spo["object_type"][id_affi].split("_")[1]] = id2spo[ + "object_type" + ][id_affi].split("_")[0] + spo_list.append( + { + "predicate": id2spo["predicate"][id_], + "object_type": object_type_dict, + "subject_type": id2spo["subject_type"][id_], + "object": object_dict, + "subject": subject_, + } + ) + + formatted_instance["text"] = example["text"] + formatted_instance["spo_list"] = spo_list + formatted_outputs.append(formatted_instance) + return formatted_outputs + + +def write_prediction_results(formatted_outputs, file_path): + """write the prediction results""" + + with codecs.open(file_path, "w", "utf-8") as f: + for formatted_instance in formatted_outputs: + json_str = json.dumps(formatted_instance, ensure_ascii=False) + f.write(json_str) + f.write("\n") + zipfile_path = file_path + ".zip" + f = zipfile.ZipFile(zipfile_path, "w", zipfile.ZIP_DEFLATED) + f.write(file_path) + + return zipfile_path + + +def get_precision_recall_f1(golden_file, predict_file): + r = os.popen( + "python3 ./re_official_evaluation.py --golden_file={} --predict_file={}".format(golden_file, predict_file) + ) + result = r.read() + r.close() + precision = float( + re.search('"precision", "value":.*?}', result).group(0).lstrip('"precision", "value":').rstrip("}") + ) + recall = float(re.search('"recall", "value":.*?}', result).group(0).lstrip('"recall", "value":').rstrip("}")) + f1 = float(re.search('"f1-score", "value":.*?}', result).group(0).lstrip('"f1-score", "value":').rstrip("}")) + + return precision, recall, f1 diff --git a/examples/information_extraction/DuUIE/README.md b/examples/information_extraction/DuUIE/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b8bfc1edaf4b1e36034be4cdd7332f81d3e34983 --- /dev/null +++ b/examples/information_extraction/DuUIE/README.md @@ -0,0 +1,180 @@ +# CCKS 2022 通用信息抽取 -- 基于UIE的基线系统 + +信息抽取任务旨在根据特定的抽取需求(Schema,S)从非结构化文本(Text,X)中自动抽取结构化信息(Structure,Y)。 +其中,特定的抽取需求是指抽取任务中的抽取框架,主要由抽取类别(人物名称、公司名称、企业上市事件)及目标结构(实体、关系、事件等)组成。 +本任务为中文信息抽取任务,即按照特定的抽取框架S,从给定的一组自由文本X中抽取出所有符合抽取需求的信息结构Y(实体、关系、事件记录等)。 +对于同一输入文本,不同的抽取框架会抽取不同的信息结构。 + +本例中包含四类抽取任务:实体抽取、关系抽取、事件抽取和情感抽取。 +以“In 1997, Steve was excited to become the CEO of Apple.”为例,各个任务的目标结构为: + +- 实体:Steve - 人物实体、Apple - 组织机构实体 +- 关系:(Steve, Work For Apple) +- 事件:{类别: 就职事件, 触发词: become, 论元: [[雇主, Apple], [雇员, Steve]]} +- 情感:(exicted, become the CEO of Apple, Positive) + +该示例展示了如何使用 PaddleNLP 快速构建 [CCKS 2022 通用信息抽取比赛](https://aistudio.baidu.com/aistudio/competition/detail/161/0/task-definition)基线,构建单个模型同时对上述四个任务进行抽取。 + +## 环境安装 + +``` bash +pip install -r requirements.txt +``` + +## 目录结构 +``` text +. +├── config/ # 配置文件 +├── inference.py # 推理入口 +├── process_data.py # 比赛数据处理相关脚本 +├── README.md # 说明文件 +├── requirements.txt # Python 依赖包文件 +├── run_seq2struct.py # Python 入口 +└── uie/ + ├── evaluation # 信息抽取代码 + └── seq2struct # 序列到结构代码 +``` + +## 通用信息抽取基线 + +### 基线说明 + +本例采用面向信息抽取的统一序列到结构生成模型作为任务基线。 + +该模型将多种不同的信息抽取目标结构表示为统一的结构化抽取语言(Structured Extraction Language,SEL),并且通过端到端生成的方式实现复杂结构的抽取。 + +同时,该模型使用结构化框架前缀(Structural Schema Instructor,SSI)作为抽取目标来帮助模型区分不同的抽取任务。 + +**[报名竞赛](https://aistudio.baidu.com/aistudio/competition/detail/161/0/introduction)下载数据集后,从[这里](#quick-start)开始实现快速基线。** + +#### 结构化抽取语言 +结构化抽取语言将不同的目标结构进行统一结构表示。 +典型的结构化抽取语言的形式如下: +``` +( + (Spot Name: Info Span + (Assocation Name: Info Span) + (Assocation Name: Info Span) + ) +) +``` +其中, +- Spot Name: 信息点类别,如实体类型; +- Assocation Name (asoc/asso): 信息点关联类别,如关系类型、事件论元类型; +- Info Span: 信息点所对应的文本片段。 + +以`2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!`中的信息结构为例: + +- 该句中的国籍关系 SEL 表达式为: +``` +( + (人物: 谷爱凌 + (国籍: 中国) + ) +) +``` +- 该句中的夺冠事件 SEL 表达式为: +``` +( + (夺冠: 金牌 + (夺冠时间: 2月8号上午) + (冠军: 谷爱凌) + (夺冠赛事: 北京冬奥会自由式滑雪女子大跳台决赛) + ) +) +``` + +生成SEL表达式后,我们通过解析器将表达式解析成对应的结构化抽取记录。 + +``` +records = sel2record.sel2record( + pred=predicted_sel, # 生成的 SEL 表达式,例如 ((夺冠: 金牌 (冠军: 谷爱凌))) + text=raw_text, + ... +) +records 为解析后的抽取结果。例如 {类别: 夺冠, 触发词: 金牌, 冠军: 谷爱凌} + +``` + +#### 结构化模式前缀 +结构化模式前缀与带抽取的文本一同输入序列到结构生成模型,用于区分不同的抽取任务。 +基线模型使用特殊字符 `[spot]`、`[asoc]` 来组织结构化模式前缀,`[spot]` 对应 SEL 中的 SpotName 类别,`[asoc]` 对应 +不同任务的形式是: +- 实体抽取:[spot] 实体类别 [text] +- 关系抽取:[spot] 实体类别 [asoc] 关系类别 [text] +- 事件抽取:[spot] 事件类别 [asoc] 论元类别 [text] +- 情感抽取:[spot] 评价维度 [asoc] 观点类别 [text] + +以夺冠事件为例,其对应的SSI为 `[spot] 夺冠 [asoc] 夺冠事件 [asoc] 冠军 [asoc] 夺冠赛事 [text] 2月8日上午北京冬奥会自由...`。 + +本次大赛在框架定义文件([seen_schema.zip](https://aistudio.baidu.com/aistudio/competition/detail/161/0/datasets))中提供了详细的抽取类别定义和模板类型,欢迎选手尝试多种多样不同的输入形式和输出形式。 + +### 快速基线第一步:数据预处理并加载 + +从比赛官网下载数据集([duuie.zip](https://aistudio.baidu.com/aistudio/competition/detail/161/0/datasets)),解压存放于 data/ 目录下。 +预处理脚本将在原始数据中添加 Spot 和 Asoc 标注,将需要抽取的内容表示为 Spot-Asoc 的数据形式。 + +``` bash +python process_data.py preprocess +``` + +处理之后的数据将自动生成在 data/DuUIE_pre 下,每个实例中添加了 `spot`, `asoc` 和 `spot_asoc` 三个字段。 + +### 快速基线第二步:多任务模型训练 + +基线采用的预训练模型为字符级别的中文模型 [uie-char-small](https://paddlenlp.bj.bcebos.com/models/ccks2022/uie-char-small.zip),该模型采用两阶段的训练方式构建:首先使用 100G 中文数据进行 Span Corruption 预训练;然后使用远距离监督产生的文本-结构数据进行结构生成预训练。 +下载解压缩后开始多任务训练。 + +#### 多任务配置 + +本例中采用 Yaml 配置文件来配置不同任务的数据来源和验证方式,详见多任务配置文件 `config/multi-task-duuie.yaml`。 +本例将依据配置文件自动读取每个任务所需的训练数据进行训练,并对每个任务进行验证并汇报结果。 +``` bash +python3 run_seq2struct.py \ + --multi_task_config config/multi-task-duuie.yaml \ + --negative_keep 1.0 \ + --do_train \ + --metric_for_best_model=all-task-ave \ + --model_name_or_path=./uie-char-small \ + --num_train_epochs=10 \ + --per_device_train_batch_size=32 \ + --per_device_eval_batch_size=256 \ + --output_dir=output/duuie_multi_task_b32_lr5e-4 \ + --logging_dir=output/duuie_multi_task_b32_lr5e-4_log \ + --learning_rate=5e-4 \ + --overwrite_output_dir \ + --gradient_accumulation_steps 1 +``` + +训练完成后,将生成对应的文件夹 `output/duuie_multi_task_b32_lr5e-4` + +### 快速基线第三步:使用多任务模型进行预测 + +该步骤将依据不同任务的抽取框架进行信息抽取,并输出到对应的文件夹中。 +首先下载[测试文件](https://aistudio.baidu.com/aistudio/competition/detail/161/0/datasets)放置在data目录下,然后使用脚本将其自动处理并预测。 + +``` bash +python process_data.py split-test +python inference.py --data data/duuie_test_a --model output/duuie_multi_task_b32_lr5e-4 +``` + +### 快速基线第四步:后处理提交结果 + +该步骤将按照实例的 `id` 将不同任务的预测进行合并,生成提交数据 `submit.txt`。 + +``` bash +python process_data.py merge-test +``` + +### 进阶优化,提升模型性能 +本次大赛联合多个开源平台,提供大量开源数据集、免费计算资源、预训练语言模型和结构化知识图谱数据,参赛选手可以进一步构建数据集开发模型,提升模型性能。 +- [千言中文开源数据集](https://aistudio.baidu.com/aistudio/index) +- [飞桨AI Studio-人工智能学习实训社区](https://aistudio.baidu.com/aistudio/index) +- [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) +- [OpenKG 开放知识图谱社区](http://openkg.cn) + +## Reference +- **[Unified Structure Generation for Universal Information Extraction](https://arxiv.org/pdf/2203.12277.pdf)** +- [DuIE: A Large-scale Chinese Dataset for Information Extraction](http://tcci.ccf.org.cn/conference/2019/papers/EV10.pdf) +- [DuEE: A Large-Scale Dataset for Chinese Event Extraction in Real-World Scenarios](https://link.springer.com/chapter/10.1007/978-3-030-60457-8_44) +- [CASA: Conversational Aspect Sentiment Analysis for Dialogue Understanding](https://jair.org/index.php/jair/article/view/12802) diff --git a/examples/information_extraction/DuUIE/config/multi-task-duuie.yaml b/examples/information_extraction/DuUIE/config/multi-task-duuie.yaml new file mode 100644 index 0000000000000000000000000000000000000000..fafefbad28934241a36f897c996d30b4d6170ce5 --- /dev/null +++ b/examples/information_extraction/DuUIE/config/multi-task-duuie.yaml @@ -0,0 +1,172 @@ +T1: + name: DUIE_LIFE_SPO + path: data/duuie_pre/DUIE_LIFE_SPO + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-rel-boundary-F1 + +T2: + name: DUIE_ORG_SPO + path: data/duuie_pre/DUIE_ORG_SPO + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-rel-boundary-F1 + +T3: + name: 体育竞赛 + path: data/duuie_pre/体育竞赛 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + +T4: + name: 灾害意外 + path: data/duuie_pre/灾害意外 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + +T5: + name: CONV-ASA + path: data/duuie_pre/CONV-ASA + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-rel-strict-F1 + +T6: + name: MSRA + path: data/duuie_pre/MSRA + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-ent-F1 + +T7: + name: PEOPLE_DAILY + path: data/duuie_pre/PEOPLE_DAILY + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-ent-F1 + +T8: + name: 金融信息_中标 + path: data/duuie_pre/金融信息_中标 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + + +T9: + name: 金融信息_企业融资 + path: data/duuie_pre/金融信息_企业融资 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + + +T10: + name: 金融信息_股份回购 + path: data/duuie_pre/金融信息_股份回购 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + + +T11: + name: 金融信息_中标 + path: data/duuie_pre/金融信息_中标 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + + +T12: + name: 金融信息_高管变动 + path: data/duuie_pre/金融信息_高管变动 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + + +T13: + name: 金融信息_亏损 + path: data/duuie_pre/金融信息_亏损 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + +T14: + name: 金融信息_公司上市 + path: data/duuie_pre/金融信息_公司上市 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + +T15: + name: 金融信息_被约谈 + path: data/duuie_pre/金融信息_被约谈 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + +T16: + name: 金融信息_企业收购 + path: data/duuie_pre/金融信息_企业收购 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + +T17: + name: 金融信息_股东减持 + path: data/duuie_pre/金融信息_股东减持 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + +T18: + name: 金融信息_解除质押 + path: data/duuie_pre/金融信息_解除质押 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + +T19: + name: 金融信息_企业破产 + path: data/duuie_pre/金融信息_企业破产 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + +T20: + name: 金融信息_股东增持 + path: data/duuie_pre/金融信息_股东增持 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 + +T21: + name: 金融信息_质押 + path: data/duuie_pre/金融信息_质押 + sel2record: longer_first_zh + eval_match_mode: set + metrics: + - string-evt-role-F1 diff --git a/examples/information_extraction/DuUIE/inference.py b/examples/information_extraction/DuUIE/inference.py new file mode 100644 index 0000000000000000000000000000000000000000..b0a0662837c092eed0f8fa407fa214f995b5c17a --- /dev/null +++ b/examples/information_extraction/DuUIE/inference.py @@ -0,0 +1,164 @@ +#!/usr/bin/env python3 +# -*- coding:utf-8 -*- + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +import math +from tqdm import tqdm + +import paddle +from paddlenlp.data import Pad +from paddlenlp.transformers import T5ForConditionalGeneration + +from uie.evaluation.sel2record import RecordSchema, MapConfig, SEL2Record +from uie.seq2struct.t5_bert_tokenizer import T5BertTokenizer + +special_to_remove = {"", ""} + + +def read_json_file(file_name): + return [json.loads(line) for line in open(file_name, encoding="utf8")] + + +def schema_to_ssi(schema: RecordSchema): + # Convert Schema to SSI + # spot type ... asoc type + ssi = "" + "".join(sorted(schema.type_list)) + ssi += "" + "".join(sorted(schema.role_list)) + ssi += "" + return ssi + + +def post_processing(x): + for special in special_to_remove: + x = x.replace(special, "") + return x.strip() + + +class Predictor: + def __init__(self, model_path, max_source_length=256, max_target_length=192) -> None: + self.tokenizer = T5BertTokenizer.from_pretrained(model_path) + self.model = T5ForConditionalGeneration.from_pretrained(model_path) + self.model.eval() + self.max_source_length = max_source_length + self.max_target_length = max_target_length + + @paddle.no_grad() + def predict(self, text, schema): + def to_tensor(x): + return paddle.to_tensor(x, dtype="int64") + + ssi = schema_to_ssi(schema=schema) + + text = [ssi + x for x in text] + + inputs = self.tokenizer( + text, return_token_type_ids=False, return_attention_mask=True, max_seq_len=self.max_source_length + ) + + inputs = { + "input_ids": to_tensor(Pad(pad_val=self.tokenizer.pad_token_id)(inputs["input_ids"])), + "attention_mask": to_tensor(Pad(pad_val=0)(inputs["attention_mask"])), + } + + pred, _ = self.model.generate( + input_ids=inputs["input_ids"], + attention_mask=inputs["attention_mask"], + max_length=self.max_target_length, + ) + + pred = self.tokenizer.batch_decode(pred.numpy()) + + return [post_processing(x) for x in pred] + + +def find_to_predict_folder(folder_name): + for root, dirs, _ in os.walk(folder_name): + for dirname in dirs: + data_name = os.path.join(root, dirname) + if os.path.exists(os.path.join(data_name, "record.schema")): + yield data_name + + +def main(): + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument("--data", "-d", required=True, help="Folder need to been predicted.") + parser.add_argument("--model", "-m", required=True, help="Trained model for inference") + parser.add_argument( + "--max_source_length", default=384, type=int, help="Max source length for inference, ssi + text" + ) + parser.add_argument("--max_target_length", default=192, type=int) + parser.add_argument("--batch_size", default=512, type=int) + parser.add_argument( + "-c", + "--config", + dest="map_config", + help="Offset mapping config, maping generated sel to offset record", + default="longer_first_zh", + ) + parser.add_argument("--verbose", action="store_true") + options = parser.parse_args() + + # Find the folder need to be predicted with `record.schema` + data_folder = find_to_predict_folder(options.data) + model_path = options.model + + predictor = Predictor( + model_path=model_path, max_source_length=options.max_source_length, max_target_length=options.max_target_length + ) + + for task_folder in data_folder: + + print(f"Extracting on {task_folder}") + schema = RecordSchema.read_from_file(os.path.join(task_folder, "record.schema")) + sel2record = SEL2Record( + schema_dict=SEL2Record.load_schema_dict(task_folder), + map_config=MapConfig.load_by_name(options.map_config), + tokenizer=predictor.tokenizer, + ) + + test_filename = os.path.join(f"{task_folder}", "test.json") + if not os.path.exists(test_filename): + print(f"{test_filename} not found, skip ...") + continue + + instances = read_json_file(test_filename) + text_list = [x["text"] for x in instances] + token_list = [list(x["text"]) for x in instances] + + batch_num = math.ceil(len(text_list) / options.batch_size) + + predict = list() + for index in tqdm(range(batch_num)): + start = index * options.batch_size + end = index * options.batch_size + options.batch_size + predict += predictor.predict(text_list[start:end], schema=schema) + + records = list() + for p, text, tokens in zip(predict, text_list, token_list): + records += [sel2record.sel2record(pred=p, text=text, tokens=tokens)] + + pred_filename = os.path.join(f"{task_folder}", "pred.json") + with open(pred_filename, "w", encoding="utf8") as output: + for record in records: + output.write(json.dumps(record, ensure_ascii=False) + "\n") + + +if __name__ == "__main__": + main() diff --git a/examples/information_extraction/DuUIE/process_data.py b/examples/information_extraction/DuUIE/process_data.py new file mode 100644 index 0000000000000000000000000000000000000000..0269223ccb7a287f73e777c694255c9fcce66981 --- /dev/null +++ b/examples/information_extraction/DuUIE/process_data.py @@ -0,0 +1,612 @@ +#!/usr/bin/env python3 +# -*- coding:utf-8 -*- + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import copy +from typing import List, Dict +from collections import defaultdict +import yaml +import json +import os +from uie.evaluation.sel2record import RecordSchema, merge_schema + + +def load_definition_schema_file(filename): + """Load schema file in Yaml + 读取 YAML 定义的 Schema 文件 + """ + return yaml.load(open(filename, encoding="utf8"), Loader=yaml.FullLoader) + + +def load_jsonlines_file(filename): + """Load Data file in JSONLINE + 读取 JSONLINE 文件 + """ + return [json.loads(line) for line in open(filename, encoding="utf8")] + + +def convert_entity_schema(entity_schema): + """Convert entity schmea to record schema""" + spots = list() + asocs = list() + spot_asoc_map = dict() + for entity in entity_schema: + spots += [entity] + spot_asoc_map[entity] = list() + return spots, asocs, spot_asoc_map + + +def convert_entity_relation_schema(entity_schema, relation_schema): + """Convert entity and relation chmea to record schema""" + spots = list() + asocs = list() + spot_asoc_map = dict() + for entity in entity_schema: + spots += [entity] + spot_asoc_map[entity] = list() + for relation in relation_schema: + asocs += [relation] + arg1_type = relation_schema[relation]["主体"] + if arg1_type not in spots: + spots += [arg1_type] + spot_asoc_map[arg1_type] = list() + spot_asoc_map[arg1_type] += [relation] + return spots, asocs, spot_asoc_map + + +def convert_event_schema(schema): + """Convert event schmea to record schema""" + spots = list() + asocs = set() + spot_asoc_map = dict() + for event_type, definition in schema.items(): + spots += [event_type] + spot_asoc_map[event_type] = list() + for arg in definition["参数"]: + asocs.add(arg) + spot_asoc_map[event_type] += [arg] + return spots, list(asocs), spot_asoc_map + + +def dump_schema(output_folder, schema_dict): + if not os.path.exists(output_folder): + os.makedirs(output_folder) + + for schema_name, schema in schema_dict.items(): + schema_file = f"{output_folder}/{schema_name}.schema" + with open(schema_file, "w", encoding="utf8") as output: + for element in schema: + output.write(json.dumps(element, ensure_ascii=False) + "\n") + + +def main_entity_relation(schema_file, schema_name, instances, output_folder): + schema = yaml.load(open(schema_file, encoding="utf8"), Loader=yaml.FullLoader) + entity_schema = convert_entity_schema(schema.get("实体", {})) + relation_schema = convert_entity_relation_schema(schema.get("实体", {}), schema.get("关系", {})) + event_schema = convert_event_schema({}) + dump_schema( + output_folder=output_folder, + schema_dict={ + "entity": entity_schema, + "relation": relation_schema, + "event": event_schema, + "record": relation_schema, + }, + ) + + with open(f"{output_folder}/test.json", "w", encoding="utf8") as output: + for instance in instances: + if instance["schema"] == schema_name: + output.write(json.dumps(instance, ensure_ascii=False) + "\n") + return schema_name + + +def main_event(schema_file, schema_name, instances, output_folder): + schema = yaml.load(open(schema_file, encoding="utf8"), Loader=yaml.FullLoader) + event_schema = convert_event_schema(schema.get("事件", {})) + dump_schema( + output_folder=output_folder, + schema_dict={ + "entity": [[], [], {}], + "relation": [[], [], {}], + "event": event_schema, + "record": event_schema, + }, + ) + + with open(f"{output_folder}/test.json", "w", encoding="utf8") as output: + for instance in instances: + if instance["schema"] == schema_name: + output.write(json.dumps(instance, ensure_ascii=False) + "\n") + return schema_name + + +def main_seprate_event(schema_file, schema_name, instances, output_folder): + """Prediction tasks are separated by event types + 按照事件类别分离预测任务生成抽取的 Schema + """ + valid_instances = list() + + for instance in instances: + if schema_name == instance["schema"]: + valid_instances += [instance] + + schema = yaml.load(open(schema_file, encoding="utf8"), Loader=yaml.FullLoader) + _, _, event_map = convert_event_schema(schema.get("事件", {})) + + for event in event_map: + subevent_output_folder = f"{output_folder}_{event}" + dump_schema( + output_folder=subevent_output_folder, + schema_dict={ + "entity": [[], [], {}], + "relation": [[], [], {}], + "event": [[event], event_map[event], {event: event_map[event]}], + "record": [[event], event_map[event], {event: event_map[event]}], + }, + ) + + with open(f"{subevent_output_folder}/test.json", "w", encoding="utf8") as output: + for instance in valid_instances: + output.write(json.dumps(instance, ensure_ascii=False) + "\n") + return event_map.keys() + + +# 将关系抽取结果转换到提交格式 +def convert_relation(relation): + return { + "type": relation[0], + "args": [ + {"type": relation[1], "text": relation[2]}, + {"type": relation[3], "text": relation[4]}, + ], + } + + +# 将实体抽取结果转换到提交格式 +def convert_entity(entity): + return { + "type": entity[0], + "text": entity[1], + } + + +def convert_event(event): + return { + "type": event["type"], + "text": event["trigger"], + "args": [{"type": role_type, "text": arg} for role_type, arg in event["roles"]], + } + + +def merge_pred_text_file(text_filename, pred_filename): + """Merge extracted result + 基于实例编号合并抽取结果 + """ + + # 读取原始文件中的数据,用于获取 ID + test_instances = load_jsonlines_file(text_filename) + # 读取抽取结果的预测文件 + pred_instances = load_jsonlines_file(pred_filename) + + assert len(test_instances) == len(pred_instances) + + to_sumbit_instances = dict() + for test_instance, pred_instance in zip(test_instances, pred_instances): + + # 获取抽取结果中的字符串结果 + entity_list = pred_instance["entity"].get("string", []) + relation_list = pred_instance["relation"].get("string", []) + event_list = pred_instance["event"].get("string", []) + + # 将抽取结果转换为提交的数据格式 + to_sumbit_instance = { + "id": test_instance["id"], + "entity": [convert_entity(entity) for entity in entity_list], + "relation": [convert_relation(relation) for relation in relation_list], + "event": [convert_event(event) for event in event_list], + } + to_sumbit_instances[test_instance["id"]] = to_sumbit_instance + + return to_sumbit_instances + + +def split_test(options): + test_file = options.data_file + schema_folder = options.schema_folder + output_folder = options.output_folder + instances = [json.loads(line) for line in open(test_file, encoding="utf8")] + main_entity_relation( + os.path.join(schema_folder, "人生信息.yaml"), "人生信息", instances, os.path.join(output_folder, "人生信息") + ) + main_entity_relation( + os.path.join(schema_folder, "机构信息.yaml"), "机构信息", instances, os.path.join(output_folder, "机构信息") + ) + main_entity_relation( + os.path.join(schema_folder, "影视情感.yaml"), "影视情感", instances, os.path.join(output_folder, "影视情感") + ) + main_event(os.path.join(schema_folder, "灾害意外.yaml"), "灾害意外", instances, os.path.join(output_folder, "灾害意外")) + main_event(os.path.join(schema_folder, "体育竞赛.yaml"), "体育竞赛", instances, os.path.join(output_folder, "体育竞赛")) + main_seprate_event( + os.path.join(schema_folder, "金融信息.yaml"), "金融信息", instances, os.path.join(output_folder, "金融信息") + ) + + +def merge_test(options): + """Merge predicted result from trained model + 将预测文件夹中的预测结果进行合并 + """ + output_folder = options.pred_folder + submit_filename = options.submit + + to_sumbit_instances = dict() + for schema in os.listdir(output_folder): + test_filename = os.path.join(output_folder, schema, "test.json") + pred_filename = os.path.join(output_folder, schema, "pred.json") + sub_to_sumbit_instances = merge_pred_text_file( + text_filename=test_filename, + pred_filename=pred_filename, + ) + print(f"Merge {schema} with {len(sub_to_sumbit_instances)} instances ...") + for instance_id, instance in sub_to_sumbit_instances.items(): + if instance_id in to_sumbit_instances: + to_sumbit_instances[instance_id]["entity"] += instance.get("entity", []) + to_sumbit_instances[instance_id]["relation"] += instance.get("relation", []) + to_sumbit_instances[instance_id]["event"] += instance.get("event", []) + else: + to_sumbit_instances[instance_id] = instance + + print(f"To submit instances number: {len(to_sumbit_instances)}") + with open(submit_filename, "w", encoding="utf8") as output: + for instance in to_sumbit_instances.values(): + output.write(json.dumps(instance, ensure_ascii=False) + "\n") + + +def annonote_graph(entities: List[Dict] = [], relations: List[Dict] = [], events: List[Dict] = []): + """Convert Entity Relation Event to Spot-Assocation Graph + 将实体、关系和事件的标注信息转换成需要生成的 Spot-Assocation 结构 + + Args: + tokens (List[str]): Token List + entities (List[Entity], optional): Entity List. Defaults to []. + relations (List[Relation], optional): Relation List. Defaults to []. + events (List[Event], optional): Event List. Defaults to []. + + Returns: + set: Set of Spot + set: Set of Asoc + list: Instance of Spot-Asoc + """ + spot_dict = dict() + asoc_dict = defaultdict(list) + + def add_spot(spot): + spot_key = (tuple(spot["offset"]), spot["type"]) + spot_dict[spot_key] = spot + + def add_asoc(spot, asoc, tail): + spot_key = (tuple(spot["offset"]), spot["type"]) + asoc_dict[spot_key] += [(tuple(tail["offset"]), tail["text"], asoc)] + + for entity in entities: + add_spot(spot=entity) + + for relation in relations: + add_spot(spot=relation["args"][0]) + add_asoc(spot=relation["args"][0], asoc=relation["type"], tail=relation["args"][1]) + + for event in events: + add_spot(spot=event) + for argument in event["args"]: + add_asoc(spot=event, asoc=argument["type"], tail=argument) + + spot_asoc_instance = list() + for spot_key in sorted(spot_dict.keys()): + offset, label = spot_key + + if len(spot_dict[spot_key]["offset"]) == 0: + continue + + spot_instance = { + "span": spot_dict[spot_key]["text"], + "label": label, + "asoc": list(), + } + for tail_offset, tail_text, asoc in sorted(asoc_dict.get(spot_key, [])): + if len(tail_offset) == 0: + continue + + spot_instance["asoc"] += [(asoc, tail_text)] + spot_asoc_instance += [spot_instance] + + spot_labels = set([label for _, label in spot_dict.keys()]) + asoc_labels = set() + for _, asoc_list in asoc_dict.items(): + for _, _, asoc in asoc_list: + asoc_labels.add(asoc) + return spot_labels, asoc_labels, spot_asoc_instance + + +def add_spot_asoc_to_single_file(filename): + instances = [json.loads(line) for line in open(filename, encoding="utf8")] + print(f"Add spot asoc to {filename} ...") + with open(filename, "w", encoding="utf8") as output: + for instance in instances: + spots, asocs, spot_asoc_instance = annonote_graph( + entities=instance["entity"], + relations=instance["relation"], + events=instance["event"], + ) + # 将信息结构转换成 Spot Asoc 形式 + instance["spot_asoc"] = spot_asoc_instance + # 添加该实例中存在的 Spot 类别 + instance["spot"] = list(spots) + # 添加该实例中存在的 Asoc 类别 + instance["asoc"] = list(asocs) + output.write(json.dumps(instance, ensure_ascii=False) + "\n") + + +def convert_duuie_to_spotasoc(data_folder, ignore_datasets): + + schema_list = list() + + for task_folder in os.listdir(data_folder): + if task_folder in ignore_datasets: + continue + if not os.path.isdir(os.path.join(data_folder, task_folder)): + continue + + print(f"Add spot asoc to {task_folder} ...") + # 读取单任务的 Schema + task_schema_file = os.path.join(data_folder, task_folder, "record.schema") + + # 向单任务数据中添加 Spot Asoc 标注 + add_spot_asoc_to_single_file(os.path.join(data_folder, task_folder, "train.json")) + add_spot_asoc_to_single_file(os.path.join(data_folder, task_folder, "val.json")) + record_schema = RecordSchema.read_from_file(task_schema_file) + + schema_list += [record_schema] + + for line in open(os.path.join(data_folder, task_folder, "train.json"), encoding="utf8"): + new_instance = json.loads(line) + # 添加任务中所有的 Spot 类别 + new_instance["spot"] = record_schema.type_list + # 添加任务中所有的 Asoc 类别 + new_instance["asoc"] = record_schema.role_list + + for line in open(os.path.join(data_folder, task_folder, "val.json"), encoding="utf8"): + new_instance = json.loads(line) + # 添加任务中所有的 Spot 类别 + new_instance["spot"] = record_schema.type_list + # 添加任务中所有的 Asoc 类别 + new_instance["asoc"] = record_schema.role_list + + # 融合不同任务的 Schema + multi_schema = merge_schema(schema_list) + multi_schema.write_to_file(os.path.join(data_folder, "record.schema")) + + +def dump_instances(instances, output_filename): + with open(output_filename, "w", encoding="utf8") as output: + for instance in instances: + output.write(json.dumps(instance, ensure_ascii=False) + "\n") + + +def dump_event_schema(event_map, output_folder): + role_list = list() + for roles in event_map.values(): + role_list += roles["参数"] + rols_list = list(set(role_list)) + type_list = list(event_map.keys()) + type_role_map = {event_type: list(event_map[event_type]["参数"].keys()) for event_type in event_map} + dump_schema( + output_folder=output_folder, + schema_dict={ + "entity": [[], [], {}], + "relation": [[], [], {}], + "event": [type_list, rols_list, type_role_map], + "record": [type_list, rols_list, type_role_map], + }, + ) + + +def filter_event_in_instance(instances, required_event_types): + """Filter events in the instance, keep event mentions with `required_event_types` + 过滤实例中的事件,只保留需要的事件类别的事件标注 + """ + import copy + + new_instances = list() + for instance in instances: + new_instance = copy.deepcopy(instance) + new_instance["event"] = list(filter(lambda x: x["type"] in required_event_types, new_instance["event"])) + new_instances += [new_instance] + return new_instances + + +def filter_event(data_folder, event_types, output_folder): + """Keep event with `event_types` in `data_folder` save to `output_folder` + 过滤 `data_folder` 中的事件,只保留 `event_types` 类型事件保存到 `output_folder`""" + dump_event_schema(event_types, output_folder) + for split in ["train", "val"]: + filename = os.path.join(data_folder, f"{split}.json") + instances = [json.loads(line.strip()) for line in open(filename, encoding="utf8")] + new_instances = filter_event_in_instance(instances, required_event_types=event_types) + dump_instances(new_instances, os.path.join(output_folder, f"{split}.json")) + + +def preprocess_event(data_folder, schema_folder): + """Preprocessing event dataset for CCKS 2022 + 针对 CCKS 2022 竞赛数据进行预处理 + """ + # Filter event annotation in raw data, only keep the required event in CCKS 2022 + # 对事件数据进行预处理,过滤除 `灾害意外` 和 `体育竞赛` 外的事件标注 + for schema in ["灾害意外", "体育竞赛"]: + print(f"Building {schema} dataset ...") + duee_folder = os.path.join(data_folder, "DUEE") + schema_file = os.path.join(schema_folder, f"{schema}.yaml") + output_folder = os.path.join(data_folder, schema) + schema = load_definition_schema_file(schema_file) + filter_event( + data_folder=duee_folder, + event_types=schema["事件"], + output_folder=output_folder, + ) + + for schema in ["金融信息"]: + print(f"Building {schema} dataset ...") + duee_fin_folder = os.path.join(data_folder, "DUEE_FIN_LITE") + schema_file = os.path.join(schema_folder, f"{schema}.yaml") + output_folder = os.path.join(data_folder, schema) + schema = load_definition_schema_file(schema_file) + # 依据不同事件类别将多事件抽取分割成多个单事件类型抽取 + # Separate multi-type extraction to multiple single-type extraction + for event_type in schema["事件"]: + filter_event( + data_folder=duee_fin_folder, + event_types={event_type: schema["事件"][event_type]}, + output_folder=output_folder + "_" + event_type, + ) + + +def merge_instance(instance_list): + """Merge instances with same text but different annotation + 合并文本相同标记不同的实例 + """ + + def all_equal(_x): + for __x in _x: + if __x != _x[0]: + return False + return True + + def entity_key(_x): + return (tuple(_x["offset"]), _x["type"]) + + def relation_key(_x): + return ( + tuple(_x["type"]), + tuple(_x["args"][0]["offset"]), + _x["args"][0]["type"], + tuple(_x["args"][1]["offset"]), + _x["args"][1]["type"], + ) + + def event_key(_x): + return (tuple(_x["offset"]), _x["type"]) + + assert all_equal([x["text"] for x in instance_list]) + + element_dict = { + "entity": dict(), + "relation": dict(), + "event": dict(), + } + instance_id_list = list() + for x in instance_list: + instance_id_list += [x["id"]] + for entity in x.get("entity", list()): + element_dict["entity"][entity_key(entity)] = entity + for relation in x.get("relation", list()): + element_dict["relation"][relation_key(relation)] = relation + for event in x.get("event", list()): + element_dict["event"][event_key(event)] = event + + return { + "id": "-".join(instance_id_list), + "text": instance_list[0]["text"], + "tokens": instance_list[0]["tokens"], + "entity": list(element_dict["entity"].values()), + "relation": list(element_dict["relation"].values()), + "event": list(element_dict["event"].values()), + } + + +def preprocess_duie(data_folder): + life_folder = os.path.join(data_folder, "DUIE_LIFE_SPO") + org_folder = os.path.join(data_folder, "DUIE_ORG_SPO") + life_train_instances = load_jsonlines_file(f"{life_folder}/train.json") + org_train_instances = load_jsonlines_file(f"{org_folder}/train.json") + life_relation = RecordSchema.read_from_file(f"{life_folder}/record.schema").role_list + org_relation = RecordSchema.read_from_file(f"{org_folder}/record.schema").role_list + + instance_dict = defaultdict(list) + for instance in life_train_instances + org_train_instances: + instance_dict[instance["text"]] += [instance] + + for text in instance_dict: + instance_dict[text] = merge_instance(instance_dict[text]) + + with open(f"{life_folder}/train.json", "w") as output: + for instance in instance_dict.values(): + new_instance = copy.deepcopy(instance) + new_instance["relation"] = list(filter(lambda x: x["type"] in life_relation, instance["relation"])) + output.write(json.dumps(new_instance) + "\n") + + with open(f"{org_folder}/train.json", "w") as output: + for instance in instance_dict.values(): + new_instance = copy.deepcopy(instance) + new_instance["relation"] = list(filter(lambda x: x["type"] in org_relation, instance["relation"])) + output.write(json.dumps(new_instance) + "\n") + + +def preprocess(options): + """Preprocessing event dataset for CCKS 2022 + 针对 CCKS 2022 竞赛数据进行预处理 + """ + import shutil + + shutil.rmtree(options.output_folder) if os.path.exists(options.output_folder) else None + shutil.copytree(options.train_data, options.output_folder) + + preprocess_duie(data_folder=options.output_folder) + preprocess_event(data_folder=options.output_folder, schema_folder=options.schema_folder) + convert_duuie_to_spotasoc(data_folder=options.output_folder, ignore_datasets=options.ignore_datasets) + + +if __name__ == "__main__": + import argparse + + parser = argparse.ArgumentParser() + + subparsers = parser.add_subparsers(help="Data preprocessing scripts for CCKS 2022") + + parser_t = subparsers.add_parser("preprocess", help="Data preprocessing") + parser_t.add_argument("--train_data", default="data/duuie", help="Path for DuUIE data folder") + parser_t.add_argument("--output_folder", default="data/duuie_pre", help="Path for Preprocessed DuUIE data folder") + parser_t.add_argument( + "--ignore_datasets", + default=["DUEE", "DUEE_FIN_LITE"], + nargs="+", + help="Ignore dataset in `output_folder` for training", + ) + parser_t.add_argument("--schema_folder", default="data/seen_schema", help="Path for seen schema folder") + parser_t.set_defaults(func=preprocess) + + parser_a = subparsers.add_parser("split-test", help="Split test file with schema for prediction") + parser_a.add_argument("--data_file", default="data/duuie_test_a.json", help="Path for DuUIE data file") + parser_a.add_argument("--output_folder", default="data/duuie_test_a", help="Path for DuUIE predicted folder") + parser_a.add_argument("--schema_folder", default="data/seen_schema", help="Path for seen schema folder") + parser_a.set_defaults(func=split_test) + + parser_b = subparsers.add_parser("merge-test", help="Merge predicted result for submission") + parser_b.add_argument("--data_file", default="data/duuie_test_a.json", help="Path for DuUIE data file") + parser_b.add_argument("--pred_folder", default="data/duuie_test_a", help="Path for DuUIE predicted folder") + parser_b.add_argument("--submit", default="submit.txt", help="Path for output submission file") + parser_b.set_defaults(func=merge_test) + + options = parser.parse_args() + options.func(options) diff --git a/examples/information_extraction/DuUIE/requirements.txt b/examples/information_extraction/DuUIE/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..88bbb7f977ad1f22e2e26b8f4ae3082f73c385fe --- /dev/null +++ b/examples/information_extraction/DuUIE/requirements.txt @@ -0,0 +1,5 @@ +tabulate +importlib_metadata +nltk +visualdl +paddlenlp \ No newline at end of file diff --git a/examples/information_extraction/DuUIE/run_seq2struct.py b/examples/information_extraction/DuUIE/run_seq2struct.py new file mode 100644 index 0000000000000000000000000000000000000000..cb566bfe8e3917f6569b3de2ae465eb60efe2fe8 --- /dev/null +++ b/examples/information_extraction/DuUIE/run_seq2struct.py @@ -0,0 +1,497 @@ +#!/usr/bin/env python3 +# -*- coding:utf-8 -*- + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import math +import os + +import paddle +from paddle.amp import GradScaler, auto_cast +from paddle.optimizer import AdamW +from uie.evaluation.sel2record import evaluate_extraction_results +from uie.seq2struct.t5_bert_tokenizer import T5BertTokenizer +from uie.seq2struct.utils import ( + better_print_multi, + get_scheduler, + get_train_dataloader, + get_writer, + load_eval_tasks, + save_checkpoint, + set_logger, + set_seed, +) + +from paddlenlp.transformers import T5ForConditionalGeneration + +logger = logging.getLogger(__name__) + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument( + "--multi_task_config", + required=True, + help="Path to multi-task config file.", + ) + parser.add_argument( + "--model_name_or_path", + default="t5-large", + type=str, + help="Path to pre-trained model or shortcut name of model.", + ) + parser.add_argument( + "--output_dir", + type=str, + help="The output directory where the model predictions and checkpoints" + " will be written. Default as `outputs`", + ) + parser.add_argument( + "--overwrite_output_dir", + action="store_true", + help="Overwrite output directory", + ) + parser.add_argument( + "--logging_dir", + type=str, + help="The output logging directory", + ) + parser.add_argument( + "--max_source_length", + default=384, + type=int, + help="The maximum total input sequence length after tokenization." + " Sequences longer than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--max_target_length", + default=192, + type=int, + help="The maximum total target sequence length to be generated.", + ) + parser.add_argument("--max_prefix_length", default=None, type=int, help="The maximum prefix length.") + parser.add_argument( + "--per_device_train_batch_size", + default=16, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument( + "--per_device_eval_batch_size", + default=16, + type=int, + help="Batch size per GPU/CPU for evaluating.", + ) + parser.add_argument( + "--metric_for_best_model", + type=str, + help="The main metric to choose the best checkpoint.", + ) + parser.add_argument( + "--gradient_accumulation_steps", + default=1, + type=int, + help="gradient_accumulation_steps.", + ) + parser.add_argument( + "--learning_rate", + default=5e-4, + type=float, + help="The initial learning rate for Adam.", + ) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_beta1", default=0.9, type=float, help="Beta1 for AdamW optimizer") + parser.add_argument("--adam_beta2", default=0.999, type=float, help="Beta2 for AdamW optimizer") + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument( + "--num_train_epochs", + default=10, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument( + "--warmup_ratio", + default=0.06, + type=float, + help="Proportion of training steps to perform linear learning rate warmup for.", + ) + parser.add_argument( + "--ignore_pad_token_for_loss", + type=bool, + default=True, + help="Whether to ignore the tokens corresponding to padded labels in the loss computation or not.", + ) + parser.add_argument( + "--warmup_steps", + type=int, + default=-1, + help="warmup_steps.", + ) + parser.add_argument( + "--logging_steps", + type=int, + default=500, + help="Log every X updates steps.", + ) + parser.add_argument( + "--seed", + type=int, + default=42, + help="random seed for initialization", + ) + parser.add_argument( + "--writer_type", + choices=["visualdl", "tensorboard"], + default="visualdl", + help="writer_type.", + ) + parser.add_argument( + "--lr_scheduler_type", + choices=["linear", "cosine", "poly"], + default="linear", + type=str, + help="lr_scheduler_type.", + ) + parser.add_argument("--use_amp", "--fp16", action="store_true", help="Enable mixed precision training.") + parser.add_argument( + "--scale_loss", + type=float, + default=2**15, + help="The value of scale_loss for fp16.", + ) + parser.add_argument( + "--dataloader_num_workers", + type=int, + default=0, + help="dataloader_num_workers.", + ) + parser.add_argument("--spot_noise", type=float, default=0, help="The noise rate of inserting rejection null spot.") + parser.add_argument("--asoc_noise", type=float, default=0, help="The noise rate of inserting rejection null asoc.") + parser.add_argument( + "--negative_keep", type=float, default=1.0, help="The keep rate of negative instance for fast training." + ) + parser.add_argument("--meta_positive_rate", type=float, default=1, help="The keep rate of positive spot.") + parser.add_argument("--meta_negative", type=int, default=-1, help="Negative Schema Number in Training.") + parser.add_argument( + "--ordered_prompt", action="store_true", help="Whether to sort the spot prompt and asoc prompt or not." + ) + parser.add_argument("--do_train", action="store_true") + parser.add_argument("--do_eval", action="store_true") + parser.add_argument( + "--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Device for selecting for the training." + ) + + args = parser.parse_args() + + # Sanity check + if not (args.do_train or args.do_eval): + raise ValueError('At least one of the "--do_train" or "--do_eval" should be true.') + if args.do_train and not args.output_dir: + raise ValueError("--output_dir should be given when --do_train is true.") + + return args + + +@paddle.no_grad() +def evaluate(model, tokenizer, data_loader, generate_max_length, eval_instances, sel2record, eval_match_mode): + """Evaluate single task""" + + model = model._layers if isinstance(model, paddle.DataParallel) else model + + model.eval() + + to_remove_token_list = list() + if tokenizer.eos_token: + to_remove_token_list += [tokenizer.eos_token] + if tokenizer.pad_token: + to_remove_token_list += [tokenizer.pad_token] + + def postprocess_text(x_str): + # Clean `bos` `eos` `pad` for cleaned text + for to_remove_token in to_remove_token_list: + x_str = x_str.replace(to_remove_token, "") + + return x_str.strip() + + # Generate SEL using Trained Model + all_preds = [] + for batch in data_loader: + + outputs, scores = model.generate( + input_ids=batch["input_ids"], + attention_mask=batch["attention_mask"], + max_length=generate_max_length, + use_fast=True, + ) + + # Convert Token id to Token String + outputs = tokenizer.batch_decode(outputs, clean_up_tokenization_spaces=False, skip_special_tokens=False) + + preds = [postprocess_text(output) for output in outputs] + all_preds.extend(preds) + + assert len(all_preds) == len(eval_instances) + + # Parsing SEL to Record + all_records = [] + for predicted_sel, instance in zip(all_preds, eval_instances): + record = sel2record.sel2record(pred=predicted_sel, text=instance["text"], tokens=instance["tokens"]) + all_records += [record] + + task_metrics = evaluate_extraction_results(eval_instances, all_records, eval_match_mode=eval_match_mode) + + prediction = {"record": all_records, "sel": all_preds, "metric": task_metrics} + + return task_metrics, prediction + + +def eval_all_tasks(eval_tasks, model, tokenizer, generate_max_length): + """Evaluate all tasks""" + eval_overall_results = dict() + eval_overall_predictions = dict() + for task_name, eval_task in eval_tasks.items(): + # Evaulate single task + logger.info(f"Evaluate {task_name} ...") + eval_results, eval_prediction = evaluate( + model=model, + tokenizer=tokenizer, + data_loader=eval_task.dataloader, + generate_max_length=generate_max_length, + eval_instances=eval_task.val_instances, + sel2record=eval_task.sel2record, + eval_match_mode=eval_task.config.eval_match_mode, + ) + + for metric_name in eval_task.metrics: + metric_key = f"{task_name}:{metric_name}" + eval_overall_results[metric_key] = eval_results[metric_name] + + eval_overall_predictions[task_name] = eval_prediction + + sum_metric = sum(eval_overall_results.values()) + number_metric = len(eval_overall_results.values()) + eval_overall_results["all-task-ave"] = sum_metric / float(number_metric) + + return eval_overall_results, eval_overall_predictions + + +def test(args, model, tokenizer): + eval_tasks = load_eval_tasks(model=model, tokenizer=tokenizer, args=args) + + eval_overall_results, eval_predictions = eval_all_tasks( + eval_tasks=eval_tasks, + model=model, + tokenizer=tokenizer, + generate_max_length=args.max_target_length, + ) + + for line in better_print_multi(eval_overall_results).split("\n"): + logger.info(line) + + +def train(args, model, tokenizer): + + set_seed(args) + + generate_max_length = args.max_target_length + + writer = get_writer(args) + + # Distributed Setting + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + model = paddle.DataParallel(model) + + # get dataloader + train_dataloader = get_train_dataloader( + model=model, + tokenizer=tokenizer, + args=args, + ) + eval_tasks = load_eval_tasks(model=model, tokenizer=tokenizer, args=args) if args.do_eval else None + + def math_ceil(x, y): + return math.ceil(x / float(y)) + + num_update_steps_per_epoch = math_ceil(len(train_dataloader), args.gradient_accumulation_steps) + if args.logging_steps > num_update_steps_per_epoch: + args.logging_steps = num_update_steps_per_epoch + if args.max_steps > 0: + args.num_train_epochs = math_ceil(args.max_steps, num_update_steps_per_epoch) + else: + args.max_steps = args.num_train_epochs * num_update_steps_per_epoch + + # get lr_scheduler + lr_scheduler = get_scheduler( + learning_rate=args.learning_rate, + scheduler_type=args.lr_scheduler_type, + num_warmup_steps=args.warmup_steps if args.warmup_steps > 0 else args.warmup_ratio, + num_training_steps=args.max_steps, + ) + + total_batch_size = ( + args.per_device_train_batch_size * args.gradient_accumulation_steps * paddle.distributed.get_world_size() + ) + + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + grad_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm) + optimizer = AdamW( + learning_rate=lr_scheduler, + beta1=args.adam_beta1, + beta2=args.adam_beta2, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=grad_clip, + ) + + if args.use_amp: + scaler = GradScaler(init_loss_scaling=args.scale_loss) + + logger.info("********** Running training **********") + logger.info(f" Num examples = {len(train_dataloader.dataset)}") + logger.info(f" Num Epochs = {args.num_train_epochs}") + logger.info(f" Device train batch size = {args.per_device_train_batch_size}") + logger.info(f" Device eval batch size = {args.per_device_eval_batch_size}") + logger.info(f" Total train batch size (w. accumulation) = {total_batch_size}") + logger.info(f" Gradient Accumulation steps = {args.gradient_accumulation_steps}") + logger.info(f" Total optimization steps = {args.max_steps}") + + global_steps = 0 + tr_loss, logging_loss = 0.0, 0.0 + + best_score = 0.0 + + def logging_lr_loss(): + cur_lr = lr_scheduler.get_lr() + cur_loss = (tr_loss - logging_loss) / args.logging_steps + writer.add_scalar("lr", cur_lr, global_steps) + writer.add_scalar("loss", cur_loss, global_steps) + logger.info(f"global_steps {global_steps}/{args.max_steps}" f" - lr: {cur_lr:.10f} loss: {cur_loss:.10f}") + + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_dataloader): + model.train() + + with auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax"]): + outputs = model(**batch) + loss = outputs[0] / args.gradient_accumulation_steps + tr_loss += loss.item() + + if args.use_amp: + scaler.scale(loss).backward() + else: + loss.backward() + + if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1: + if args.use_amp: + scaler.minimize(optimizer, loss) + else: + optimizer.step() + + lr_scheduler.step() + optimizer.clear_grad() + global_steps += 1 + + if args.logging_steps > 0 and global_steps % args.logging_steps == 0: + if paddle.distributed.get_rank() == 0: + logging_lr_loss() + logging_loss = tr_loss + + save_checkpoint(tokenizer, model, os.path.join(args.output_dir, f"ckpt_epoch{epoch}")) + if args.do_eval and paddle.distributed.get_rank() == 0: + + logger.info("********** Running evaluating **********") + logger.info(f"************* Epoch {epoch} ************") + + eval_overall_results, eval_predictions = eval_all_tasks( + eval_tasks=eval_tasks, + model=model, + tokenizer=tokenizer, + generate_max_length=generate_max_length, + ) + + for line in better_print_multi(eval_overall_results).split("\n"): + logger.info(line) + + if args.metric_for_best_model not in eval_overall_results: + raise ValueError( + f"Main metric {args.metric_for_best_model} " f"is not in {eval_overall_results.keys()}." + ) + + logger.info("********** Evaluating Done **********") + current_score = eval_overall_results[args.metric_for_best_model] + if current_score > best_score: + logger.info("********** Saving Model **********") + best_score = current_score + save_checkpoint(tokenizer, model, os.path.join(args.output_dir, "best")) + + best_ckpt_file = os.path.join(args.output_dir, "best", "model_state.pdparams") + if os.path.exists(best_ckpt_file): + logger.info(f"Load best checkpoint from {best_ckpt_file}") + model.load_dict(paddle.load(best_ckpt_file)) + + save_checkpoint(tokenizer, model, args.output_dir) + + +def main(args): + logger.info("********** Configuration Arguments **********") + for arg, value in sorted(vars(args).items()): + logger.info(f"{arg}: {value}") + logger.info("**********************************************") + + if args.do_train and args.output_dir is not None: + if os.path.exists(args.output_dir) and not args.overwrite_output_dir: + raise ValueError( + f"Output directory ({args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + else: + os.makedirs(args.output_dir, exist_ok=True) + + # Set device + paddle.set_device(args.device) + + # Prepare model and tokenizer + tokenizer = T5BertTokenizer.from_pretrained(args.model_name_or_path) + model = T5ForConditionalGeneration.from_pretrained(args.model_name_or_path) + + if args.do_train: + train(args, model, tokenizer) + + if args.do_eval: + test(args, model, tokenizer) + + logger.info(f"Output Dir: {args.output_dir}") + + +if __name__ == "__main__": + args = parse_args() + os.makedirs("caches", exist_ok=True) + if args.logging_dir is not None: + os.makedirs(args.logging_dir, exist_ok=True) + set_logger(args) + logger.info(args) + main(args) diff --git a/examples/information_extraction/DuUIE/uie/__init__.py b/examples/information_extraction/DuUIE/uie/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..b7fa799a6a6e843079c76d8dae328db51d57c8e7 --- /dev/null +++ b/examples/information_extraction/DuUIE/uie/__init__.py @@ -0,0 +1,19 @@ +#!/usr/bin/env python3 +# -*- coding:utf-8 -*- + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Code for Evaluation and Sequence-to-Structure +""" diff --git a/examples/information_extraction/DuUIE/uie/evaluation/__init__.py b/examples/information_extraction/DuUIE/uie/evaluation/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..595126a161a77fa1215a2a2032dbd19d35f88a51 --- /dev/null +++ b/examples/information_extraction/DuUIE/uie/evaluation/__init__.py @@ -0,0 +1,19 @@ +#!/usr/bin/env python3 +# -*- coding:utf-8 -*- + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Code for Evaluation +""" diff --git a/examples/information_extraction/DuUIE/uie/evaluation/constants.py b/examples/information_extraction/DuUIE/uie/evaluation/constants.py new file mode 100644 index 0000000000000000000000000000000000000000..ac6eacf6a36c296c3150b6030673657c12f3030a --- /dev/null +++ b/examples/information_extraction/DuUIE/uie/evaluation/constants.py @@ -0,0 +1,72 @@ +#!/usr/bin/env python3 +# -*- coding:utf-8 -*- + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from dataclasses import dataclass + +spot_prompt = "" +asoc_prompt = "" + +type_start = "" +type_end = "" +text_start = "" +span_start = "" +null_span = "" +null_lÍabel = "" + +offset_map_strategy = { + "closest_en": { + "map_strategy": "closest", + "de_duplicate": True, + "span_to_token": "space", + }, + "closest_zh": { + "map_strategy": "closest", + "de_duplicate": True, + "span_to_token": "list", + }, + "fisrt_en": { + "map_strategy": "first", + "de_duplicate": True, + "span_to_token": "space", + }, + "first_zh": { + "map_strategy": "first", + "de_duplicate": True, + "span_to_token": "list", + }, + "longer_first_zh": { + "map_strategy": "longer_first", + "de_duplicate": True, + "span_to_token": "list", + }, +} + + +@dataclass +class BaseStructureMarker: + sent_start = "" + sent_end = "" + record_start = "" + record_end = "" + span_start = "" + span_end = "" + text_start = "" + source_span_start = "" + source_span_end = "" + target_span_start = "" + null_span = "" + null_label = "" diff --git a/examples/information_extraction/DuUIE/uie/evaluation/scorer.py b/examples/information_extraction/DuUIE/uie/evaluation/scorer.py new file mode 100644 index 0000000000000000000000000000000000000000..1fa366e4e10f6971237c3c551d6fba9cc3f57a75 --- /dev/null +++ b/examples/information_extraction/DuUIE/uie/evaluation/scorer.py @@ -0,0 +1,569 @@ +#!/usr/bin/env python3 +# -*- coding:utf-8 -*- + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys +from collections import defaultdict +from copy import deepcopy +from typing import Dict, List + + +def tuple_offset(offset): + if isinstance(offset, tuple): + return offset + else: + return tuple(offset) + + +def warning_tp_increment(gold, pred, prefix): + sys.stderr.write(f"{prefix} TP Increment Warning, Gold Offset: {gold['offset']}\n") + sys.stderr.write(f"{prefix} TP Increment Warning, Pred Offset: {pred['offset']}\n") + sys.stderr.write(f"{prefix} TP Increment Warning, Gold String: {gold['string']}\n") + sys.stderr.write(f"{prefix} TP Increment Warning, Pred String: {pred['string']}\n") + sys.stderr.write("===============\n") + + +class Metric: + """Tuple Metric""" + + def __init__(self, verbose=False, match_mode="normal"): + self.tp = 0.0 + self.gold_num = 0.0 + self.pred_num = 0.0 + self.verbose = verbose + self.match_mode = match_mode + assert self.match_mode in {"set", "normal", "multimatch"} + + def __repr__(self) -> str: + return f"tp: {self.tp}, gold: {self.gold_num}, pred: {self.pred_num}" + + @staticmethod + def safe_div(a, b): + if b == 0.0: + return 0.0 + else: + return a / b + + def compute_f1(self, prefix=""): + tp = self.tp + pred_num = self.pred_num + gold_num = self.gold_num + p, r = self.safe_div(tp, pred_num), self.safe_div(tp, gold_num) + return { + prefix + "tp": tp, + prefix + "gold": gold_num, + prefix + "pred": pred_num, + prefix + "P": p * 100, + prefix + "R": r * 100, + prefix + "F1": self.safe_div(2 * p * r, p + r) * 100, + } + + def count_instance(self, gold_list, pred_list): + if self.match_mode == "set": + gold_list = set(gold_list) + pred_list = set(pred_list) + if self.verbose: + print("Gold:", gold_list) + print("Pred:", pred_list) + self.gold_num += len(gold_list) + self.pred_num += len(pred_list) + self.tp += len(gold_list & pred_list) + + else: + if self.verbose: + print("Gold:", gold_list) + print("Pred:", pred_list) + self.gold_num += len(gold_list) + self.pred_num += len(pred_list) + + if len(gold_list) > 0 and len(pred_list) > 0: + # guarantee length same + assert len(gold_list[0]) == len(pred_list[0]) + + dup_gold_list = deepcopy(gold_list) + for pred in pred_list: + if pred in dup_gold_list: + self.tp += 1 + if self.match_mode == "normal": + # Each Gold Instance can be matched one time + dup_gold_list.remove(pred) + + def count_batch_instance(self, batch_gold_list, batch_pred_list): + for gold_list, pred_list in zip(batch_gold_list, batch_pred_list): + self.count_instance(gold_list=gold_list, pred_list=pred_list) + + +class Scorer: + @staticmethod + def load_gold_list(gold_list, offset_key=None): + raise NotImplementedError + + @staticmethod + def load_pred_list(pred_list): + raise NotImplementedError + + @staticmethod + def eval_instance_list(gold_instance_list, pred_instance_list, verbose=False, match_mode="normal"): + raise NotImplementedError + + +class EntityScorer(Scorer): + @staticmethod + def load_gold_list(gold_list: List[List[Dict]]): + """Load gold instance to `string` and `offset` + + Args: + gold_list (List[List[Dict]]): [description] + [ + [ + {'type': 'Geo-political', 'offset': [7], 'text': 'seattle'}, + {'type': 'Location', 'offset': [11], 'text': 'lot'}, + {'type': 'Geo-political', 'offset': [14], 'text': 'city'} + ], + [...] + ] + + Returns: + List[Dict]: each instance has `offset` and `string` + [ + { + 'offset': [('Geo-political', (7,)), ('Location', (11,)), ('Geo-political', (14,))], + 'string': [('Geo-political', 'seattle'), ('Location', 'lot'), ('Geo-political', 'city')] + }, + {...}, ... + ] + """ + gold_instance_list = [] + for gold in gold_list: + gold_offset = list() + gold_string = list() + for span in gold: + span_label = span["type"] + span_offset = span["offset"] + span_text = span["text"] + gold_offset += [(span_label, tuple_offset(span_offset))] + gold_string += [(span_label, span_text)] + gold_instance = { + "offset": gold_offset, + "string": gold_string, + } + gold_instance_list += [gold_instance] + return gold_instance_list + + @staticmethod + def load_pred_list(pred_list: List[Dict]): + """[summary] + + Args: + pred_list (List[Dict]): [description] + [ + { + 'offset': [['Geo-political', [7]], ['Geo-political', [14]]], + 'string': [['Geo-political', 'seattle'], ['Geo-political', 'city']] + }, + {...}, + ] + Returns: + List[Dict] : each relation instance has `offset` and `string` + [ + { + 'offset': [('Geo-political', (7,)), ('Geo-political', (14,))], + 'string': [('Geo-political', 'seattle'), ('Geo-political', 'city')] + } + ] + """ + pred_instance_list = list() + for pred in pred_list: + for offset_pred in pred["offset"]: + if not isinstance(offset_pred[1], tuple): + offset_pred[1] = tuple_offset(offset_pred[1]) + pred["offset"] = [tuple_offset(p) for p in pred["offset"]] + pred["string"] = [tuple_offset(p) for p in pred["string"]] + pred_instance_list += [pred] + return pred_instance_list + + @staticmethod + def eval_instance_list( + gold_instance_list: List[Dict], pred_instance_list: List[Dict], verbose=False, match_mode="normal" + ): + """[summary] + + Args: + gold_instance_list (List[Dict]): [description] + [ + { + 'offset': [('Geo-political', (7,)), ('Location', (11,)), ('Geo-political', (14,))], + 'string': [('Geo-political', 'seattle'), ('Location', 'lot'), ('Geo-political', 'city')] + }, + {...}, ... + ] + pred_instance_list (List[Dict]): [description] + [ + { + 'offset': [('Geo-political', (7,)), ('Geo-political', (14,))], + 'string': [('Geo-political', 'seattle'), ('Geo-political', 'city')] + } + ] + verbose (bool, optional): [description]. Defaults to False. + match_mode (string, optional): [description]. Defaults to `normal` . + + Returns: + Dict: Result of Evaluation + (offset, string) X (gold, pred, tp, P, R, F1) + """ + metrics = { + "string": Metric(verbose=verbose, match_mode=match_mode), + "offset": Metric(verbose=verbose, match_mode=match_mode), + } + for pred, gold in zip(pred_instance_list, gold_instance_list): + + pre_string_tp, pre_offset_tp = metrics["string"].tp, metrics["offset"].tp + + for eval_key in metrics: + metrics[eval_key].count_instance(gold_list=gold.get(eval_key, []), pred_list=pred.get(eval_key, [])) + + post_string_tp, post_offset_tp = metrics["string"].tp, metrics["offset"].tp + if verbose and post_offset_tp - pre_offset_tp != post_string_tp - pre_string_tp: + warning_tp_increment(gold=gold, pred=pred, prefix="Entity") + + results = dict() + for eval_key in metrics: + results.update(metrics[eval_key].compute_f1(prefix=eval_key + "-ent-")) + + return results + + +class RelationScorer(Scorer): + @staticmethod + def load_gold_list(gold_list: List[List[Dict]]): + """[summary] + + Args: + gold_list (List[List[Dict]]): List of Sentece, each sentence contains a List of Relation Dict + [ + [ + { + 'type': 'Part-whole', + 'args': [{'type': 'Location', 'offset': [11], 'text': 'lot'}, {'type': 'Geo-political', 'offset': [14], 'text': 'city'}] + }, ... + ], + [...], + ] + + Returns: + List[Dict]: List of Sentece, each sentence contains two List (offset, string) of Relation Tuple + [ + { + 'offset': [('Part-whole', 'Geo-political', (0,), 'Geo-political', (2,)), ... ], + 'string': [('Part-whole', 'Geo-political', 'MULTAN', 'Geo-political', 'Pakistan'), ...] + } + ] + """ + gold_instance_list = [] + for gold in gold_list: + gold_instance = defaultdict(list) + for record in gold: + assert len(record["args"]) == 2 + gold_instance["offset"] += [ + ( + record["type"], + record["args"][0]["type"], + tuple_offset(record["args"][0]["offset"]), + record["args"][1]["type"], + tuple_offset(record["args"][1]["offset"]), + ) + ] + gold_instance["string"] += [ + ( + record["type"], + record["args"][0]["type"], + record["args"][0]["text"], + record["args"][1]["type"], + record["args"][1]["text"], + ) + ] + gold_instance_list += [gold_instance] + + return gold_instance_list + + @staticmethod + def load_pred_list(pred_list): + """[summary] + + Args: + pred_list (List[Dict]): List of Sentece, each sentence contains two List (offset, string) of Relation List + [ + { + 'offset': [['Part-whole', 'Geo-political', [0], 'Geo-political', [2]]], + 'string': [['Part-whole', 'Geo-political', 'MULTAN', 'Geo-political', 'Pakistan']], + }, ... + ] + Returns: + List[Dict]: List of Sentece, each sentence contains two List (offset, string) of Relation Tuple + [ + { + 'offset': [('Part-whole', 'Geo-political', (0,), 'Geo-political', (2,))], + 'string': [('Part-whole', 'Geo-political', 'MULTAN', 'Geo-political', 'Pakistan')] + }, ... + ] + """ + pred_instance_list = list() + for pred in pred_list: + for offset_pred in pred["offset"]: + + if not isinstance(offset_pred[2], tuple): + offset_pred[2] = tuple_offset(offset_pred[2]) + + if not isinstance(offset_pred[4], tuple): + offset_pred[4] = tuple_offset(offset_pred[4]) + + pred["offset"] = [tuple_offset(p) for p in pred["offset"]] + pred["string"] = [tuple_offset(p) for p in pred["string"]] + pred_instance_list += [pred] + return pred_instance_list + + @staticmethod + def eval_instance_list(gold_instance_list, pred_instance_list, verbose=False, match_mode="normal"): + """[summary] + + Args: + gold_instance_list (List[Dict]): List of Sentece, each sentence contains two List (offset, string) of Relation Tuple + [ + { + 'offset': [('Part-whole', 'Geo-political', (0,), 'Geo-political', (2,)), ... ], + 'string': [('Part-whole', 'Geo-political', 'MULTAN', 'Geo-political', 'Pakistan'), ...] + } + ] + pred_instance_list ([type]): List of Sentece, each sentence contains two List (offset, string) of Relation Tuple + [ + { + 'offset': [('Part-whole', 'Geo-political', (0,), 'Geo-political', (2,))], + 'string': [('Part-whole', 'Geo-political', 'MULTAN', 'Geo-political', 'Pakistan')] + }, ... + ] + verbose (bool, optional): Defaults to False. + match_mode (string, optional): [description]. Defaults to `normal` . + + Returns: + Dict: Result of Evaluation + (offset, string) X (boundary, strict) X (gold, pred, tp, P, R, F1) + """ + # Span Boundary and Type + metrics = { + "offset": Metric(verbose=verbose, match_mode=match_mode), + "string": Metric(verbose=verbose, match_mode=match_mode), + } + # Span Boundary Only + boundary_metrics = { + "offset": Metric(verbose=verbose, match_mode=match_mode), + "string": Metric(verbose=verbose, match_mode=match_mode), + } + for pred, gold in zip(pred_instance_list, gold_instance_list): + + pre_string_tp, pre_offset_tp = metrics["string"].tp, metrics["offset"].tp + + for eval_key in metrics: + # Span Boundary and Type + metrics[eval_key].count_instance( + gold_list=gold.get(eval_key, []), + pred_list=pred.get(eval_key, []), + ) + + post_string_tp, post_offset_tp = metrics["string"].tp, metrics["offset"].tp + if verbose and (post_offset_tp - pre_offset_tp != post_string_tp - pre_string_tp): + warning_tp_increment(gold=gold, pred=pred, prefix="Relation Strict") + + pre_string_tp, pre_offset_tp = boundary_metrics["string"].tp, boundary_metrics["offset"].tp + + for eval_key in boundary_metrics: + # Span Boundary Only + boundary_metrics[eval_key].count_instance( + gold_list=[(x[0], x[2], x[4]) for x in gold.get(eval_key, [])], + pred_list=[(x[0], x[2], x[4]) for x in pred.get(eval_key, [])], + ) + post_string_tp, post_offset_tp = boundary_metrics["string"].tp, boundary_metrics["offset"].tp + if verbose and post_offset_tp - pre_offset_tp != post_string_tp - pre_string_tp: + warning_tp_increment(gold=gold, pred=pred, prefix="Relation Boundary") + + results = dict() + for eval_key in metrics: + results.update(metrics[eval_key].compute_f1(prefix=eval_key + "-rel-strict-")) + for eval_key in boundary_metrics: + results.update(boundary_metrics[eval_key].compute_f1(prefix=eval_key + "-rel-boundary-")) + return results + + +class EventScorer(Scorer): + @staticmethod + def load_gold_list(gold_list): + """[summary] + + Args: + gold_list (List[List[Dict]]): List of Sentece, each sentence contains a List of Event Dict + [ + [ # Sentance + { # Event Record + 'type': 'Die', + 'offset': [16], + 'text': 'shot', + 'args': [ + {'type': 'Victim', 'offset': [17], 'text': 'himself'}, + {'type': 'Agent', 'offset': [5, 6], 'text': 'John Joseph'}, + {'type': 'Place', 'offset': [23], 'text': 'court'} + ] + }, + ] + ] + + Returns: + List[Dict]: List of Sentece, each sentence contains Four List of Event Tuple + [ + { + 'offset_trigger': [('Die', (16,)), ('Convict', (30,))], + 'string_trigger': [('Die', 'shot'), ('Convict', 'convicted')], + 'offset_role': [('Die', 'Victim', (17,)), ('Die', 'Agent', (5, 6)), ('Die', 'Place', (23,))], + 'string_role': [('Die', 'Victim', 'himself'), ('Die', 'Agent', 'John Joseph'), ('Die', 'Place', 'court')] + }, + ... + ] + """ + gold_instance_list = [] + for gold in gold_list: + gold_instance = defaultdict(list) + for record in gold: + gold_instance["offset_trigger"] += [(record["type"], tuple_offset(record["offset"]))] + gold_instance["string_trigger"] += [(record["type"], record["text"])] + for arg in record["args"]: + gold_instance["offset_role"] += [(record["type"], arg["type"], tuple_offset(arg["offset"]))] + gold_instance["string_role"] += [(record["type"], arg["type"], arg["text"])] + gold_instance_list += [gold_instance] + return gold_instance_list + + @staticmethod + def load_pred_list(pred_list): + """[summary] + + Args: + pred_list (List[Dict]): List of Sentece, each sentence contains two List (offset, string) of Event List + [ + { + 'offset': [{'type': 'Attack', 'roles': [['Attacker', [5, 6]], ['Place', [23]], ['Target', [17]]], 'trigger': [16]}], + 'string': [{'roles': [['Attacker', 'John Joseph'], ['Place', 'court'], ['Target', 'himself']], 'type': 'Attack', 'trigger': 'shot'}], + }, + ... + ] + Returns: + List[Dict]: List of Sentece, each sentence contains four List (offset, string) X (trigger, role) of Event List + [ + { + 'offset_trigger': [('Attack', (16,))], + 'offset_role': [('Attack', 'Attacker', (5, 6)), ('Attack', 'Place', (23,)), ('Attack', 'Target', (17,))], + 'string_trigger': [('Attack', 'shot')], + 'string_role': [('Attack', 'Attacker', 'John Joseph'), ('Attack', 'Place', 'court'), ('Attack', 'Target', 'himself')], + }, + ... + ] + """ + pred_instance_list = list() + for pred in pred_list: + pred_instance = defaultdict(list) + + for offset_pred in pred["offset"]: + event_type, trigger_offset = offset_pred["type"], tuple_offset(offset_pred["trigger"]) + pred_instance["offset_trigger"] += [(event_type, trigger_offset)] + for role_type, role_offset in offset_pred["roles"]: + pred_instance["offset_role"] += [(event_type, role_type, tuple_offset(role_offset))] + + for string_pred in pred["string"]: + event_type, trigger_string = string_pred["type"], string_pred["trigger"] + pred_instance["string_trigger"] += [(event_type, trigger_string)] + for role_type, role_string in string_pred["roles"]: + pred_instance["string_role"] += [(event_type, role_type, role_string)] + pred_instance_list += [pred_instance] + return pred_instance_list + + @staticmethod + def eval_instance_list(gold_instance_list, pred_instance_list, verbose=False, match_mode="normal"): + """[summary] + + Args: + gold_instance_list (List[Dict]): List of Sentece, each sentence contains Four List of Event Tuple + [ + { + 'offset_trigger': [('Die', (16,)), ('Convict', (30,))], + 'string_trigger': [('Die', 'shot'), ('Convict', 'convicted')], + 'offset_role': [('Die', 'Victim', (17,)), ('Die', 'Agent', (5, 6)), ('Die', 'Place', (23,))], + 'string_role': [('Die', 'Victim', 'himself'), ('Die', 'Agent', 'John Joseph'), ('Die', 'Place', 'court')] + }, + ... + ] + pred_instance_list (List[Dict]): List of Sentece, each sentence contains four List (offset, string) X (trigger, role) of Event List + [ + { + 'offset_trigger': [('Attack', (16,))], + 'offset_role': [('Attack', 'Attacker', (5, 6)), ('Attack', 'Place', (23,)), ('Attack', 'Target', (17,))], + 'string_trigger': [('Attack', 'shot')], + 'string_role': [('Attack', 'Attacker', 'John Joseph'), ('Attack', 'Place', 'court'), ('Attack', 'Target', 'himself')], + }, + ... + ] + verbose (bool, optional): [description]. Defaults to False. + match_mode (string, optional): [description]. Defaults to `normal`. + + Returns: + Dict: Result of Evaluation + (offset, string) X (trigger, role) X (gold, pred, tp, P, R, F1) + """ + trigger_metrics = { + "offset": Metric(verbose=verbose, match_mode=match_mode), + "string": Metric(verbose=verbose, match_mode=match_mode), + } + role_metrics = { + "offset": Metric(verbose=verbose, match_mode=match_mode), + "string": Metric(verbose=verbose, match_mode=match_mode), + } + + for pred, gold in zip(pred_instance_list, gold_instance_list): + + pre_string_tp, pre_offset_tp = trigger_metrics["string"].tp, trigger_metrics["offset"].tp + + for eval_key in trigger_metrics: + trigger_metrics[eval_key].count_instance( + gold_list=gold.get(eval_key + "_trigger", []), pred_list=pred.get(eval_key + "_trigger", []) + ) + + post_string_tp, post_offset_tp = trigger_metrics["string"].tp, trigger_metrics["offset"].tp + if verbose and post_offset_tp - pre_offset_tp != post_string_tp - pre_string_tp: + warning_tp_increment(gold=gold, pred=pred, prefix="Trigger") + + pre_string_tp, pre_offset_tp = role_metrics["string"].tp, role_metrics["offset"].tp + + for eval_key in role_metrics: + role_metrics[eval_key].count_instance( + gold_list=gold.get(eval_key + "_role", []), pred_list=pred.get(eval_key + "_role", []) + ) + + post_string_tp, post_offset_tp = role_metrics["string"].tp, role_metrics["offset"].tp + if verbose and post_offset_tp - pre_offset_tp != post_string_tp - pre_string_tp: + warning_tp_increment(gold=gold, pred=pred, prefix="Role") + + results = dict() + for eval_key in trigger_metrics: + results.update(trigger_metrics[eval_key].compute_f1(prefix=f"{eval_key}-evt-trigger-")) + for eval_key in role_metrics: + results.update(role_metrics[eval_key].compute_f1(prefix=f"{eval_key}-evt-role-")) + + return results diff --git a/examples/information_extraction/DuUIE/uie/evaluation/sel2record.py b/examples/information_extraction/DuUIE/uie/evaluation/sel2record.py new file mode 100644 index 0000000000000000000000000000000000000000..77885638d5080f883c72332088ce26381ac46a5f --- /dev/null +++ b/examples/information_extraction/DuUIE/uie/evaluation/sel2record.py @@ -0,0 +1,1069 @@ +#!/usr/bin/env python3 +# -*- coding:utf-8 -*- + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Tuple, List, Dict +from collections import defaultdict, OrderedDict, Counter +import os +import numpy +import logging +import re +import json +from nltk.tree import ParentedTree +from uie.evaluation.constants import span_start, type_start, type_end, null_span, offset_map_strategy +from uie.evaluation.scorer import EntityScorer, RelationScorer, EventScorer + +logger = logging.getLogger("__main__") + +left_bracket = "【" +right_bracket = "】" +brackets = left_bracket + right_bracket + +split_bracket = re.compile(r"") + + +def proprocessing_graph_record(graph, schema_dict): + """Mapping generated spot-asoc result to Entity/Relation/Event""" + records = { + "entity": list(), + "relation": list(), + "event": list(), + } + + entity_dict = OrderedDict() + + # 根据不同任务的 Schema 将不同的 Spot 对应到不同抽取结果: Entity/Event + # Mapping generated spot result to Entity/Event + for record in graph["pred_record"]: + + if record["type"] in schema_dict["entity"].type_list: + records["entity"] += [{"text": record["spot"], "type": record["type"]}] + entity_dict[record["spot"]] = record["type"] + + elif record["type"] in schema_dict["event"].type_list: + records["event"] += [{"trigger": record["spot"], "type": record["type"], "roles": record["asocs"]}] + + else: + print("Type `%s` invalid." % record["type"]) + + # 根据不同任务的 Schema 将不同的 Asoc 对应到不同抽取结果: Relation/Argument + # Mapping generated asoc result to Relation/Argument + for record in graph["pred_record"]: + if record["type"] in schema_dict["entity"].type_list: + for role in record["asocs"]: + records["relation"] += [ + { + "type": role[0], + "roles": [ + (record["type"], record["spot"]), + (entity_dict.get(role[1], record["type"]), role[1]), + ], + } + ] + + if len(entity_dict) > 0: + for record in records["event"]: + if record["type"] in schema_dict["event"].type_list: + new_role_list = list() + for role in record["roles"]: + if role[1] in entity_dict: + new_role_list += [role] + record["roles"] = new_role_list + + return records + + +def match_sublist(the_list, to_match): + """Find sublist in the whole list + + Args: + the_list (list(str)): the whole list + - [1, 2, 3, 4, 5, 6, 1, 2, 4, 5] + to_match (list(str)): the sublist + - [1, 2] + + Returns: + list(tuple): matched (start, end) position list + - [(0, 1), (6, 7)] + """ + len_to_match = len(to_match) + matched_list = list() + for index in range(len(the_list) - len_to_match + 1): + if to_match == the_list[index : index + len_to_match]: + matched_list += [(index, index + len_to_match - 1)] + return matched_list + + +def check_overlap(x, y): + """Check two span whether overlap or not + + Args: + x (Tuple[int, int]): start, end including position of span x + y (Tuple[int, int]): start, end including position of span y + + x: (3, 4), y: (4, 5) -> True + x: (3, 3), y: (4, 5) -> False + + Returns: + bool: two span whether overlap or not + """ + # x start > y end or y start > x end, no overlap + if x[0] > y[1] or y[0] > x[1]: + return False + else: + return True + + +def get_index_tuple(matched: Tuple[int, int]): + """Convert start, end (inlcuding) tuple to index list + + Args: + matched (Tuple[int, int]): start and end position tuple + + (3, 4) -> [3, 4] + (3, 3) -> [3] + + Returns: + List[int]: List of index + """ + return tuple(range(matched[0], matched[1] + 1)) + + +def span_to_token(text, span_to_token_strategy="space"): + """Convert text span string to token list + + Args: + text (string): text span string + span_to_token_strategy (str, optional): Defaults to 'space'. + - space: split text to tokens using space + - list: split text to toekns as list + + Raises: + NotImplementedError: No implemented span_to_token_strategy + + Returns: + list(str): list of token + """ + if span_to_token_strategy == "space": + return text.split(" ") + elif span_to_token_strategy == "list": + return list(text) + else: + raise NotImplementedError(f"The span to token strategy {span_to_token_strategy} is not implemented.") + + +class MapConfig: + """Config of mapping string to offset""" + + def __init__(self, map_strategy: str = "first", de_duplicate: bool = True, span_to_token: str = "space") -> None: + self.map_strategy = map_strategy + self.de_duplicate = de_duplicate + self.span_to_token = span_to_token + + def __repr__(self) -> str: + repr_list = [ + f"map_strategy: {self.map_strategy}", + f"de_duplicate: {self.de_duplicate}", + f"span_to_token: {self.span_to_token}", + ] + return ", ".join(repr_list) + + @staticmethod + def load_by_name(config_name): + offset_map = offset_map_strategy[config_name] + return MapConfig( + map_strategy=offset_map["map_strategy"], + de_duplicate=offset_map["de_duplicate"], + span_to_token=offset_map["span_to_token"], + ) + + +class RecordSchema: + """Record Schema Class + type_list: list of spot name + role_list: list of asoc name + type_role_dict: the mapping of spot-to-asoc + """ + + def __init__(self, type_list, role_list, type_role_dict): + self.type_list = type_list + self.role_list = role_list + self.type_role_dict = type_role_dict + + def __repr__(self) -> str: + repr_list = [f"Type: {self.type_list}\n", f"Role: {self.role_list}\n", f"Map: {self.type_role_dict}"] + return "\n".join(repr_list) + + @staticmethod + def get_empty_schema(): + return RecordSchema(type_list=list(), role_list=list(), type_role_dict=dict()) + + @staticmethod + def read_from_file(filename): + lines = open(filename, encoding="utf8").readlines() + type_list = json.loads(lines[0]) + role_list = json.loads(lines[1]) + type_role_dict = json.loads(lines[2]) + return RecordSchema(type_list, role_list, type_role_dict) + + def write_to_file(self, filename): + with open(filename, "w", encoding="utf8") as output: + output.write(json.dumps(self.type_list, ensure_ascii=False) + "\n") + output.write(json.dumps(self.role_list, ensure_ascii=False) + "\n") + output.write(json.dumps(self.type_role_dict, ensure_ascii=False) + "\n") + + +def merge_schema(schema_list: List[RecordSchema]): + """Merge list of schema + + Args: + schema_list (List[RecordSchema]): list of record schema + + Returns: + RecordSchema: A merged schema + """ + type_set = set() + role_set = set() + type_role_dict = defaultdict(list) + + for schema in schema_list: + + for type_name in schema.type_list: + type_set.add(type_name) + + for role_name in schema.role_list: + role_set.add(role_name) + + for type_name in schema.type_role_dict: + type_role_dict[type_name] += schema.type_role_dict[type_name] + + for type_name in type_role_dict: + type_role_dict[type_name] = list(set(type_role_dict[type_name])) + + return RecordSchema( + type_list=list(type_set), + role_list=list(role_set), + type_role_dict=type_role_dict, + ) + + +class Record: + """Record for converting generated string to information record""" + + def __init__(self, map_config) -> None: + self._map_config = map_config + + def span_to_token(self, text): + return span_to_token(text, span_to_token_strategy=self._map_config.span_to_token) + + +class EntityRecord(Record): + """Record for converting generated string to information record """ + + @staticmethod + def to_string(pred_record_list): + entity_list = list() + for pred_record in pred_record_list: + record_type, record_text = pred_record["type"], pred_record["text"] + if record_text == "": + logger.warning(f"Empty Extraction {pred_record}") + continue + entity_list += [(record_type, record_text)] + return entity_list + + def to_offset(self, instance, tokens): + map_strategy_dict = { + "first": self.record_to_offset_first_role, + "closest": self.record_to_offset_closest_role, + "longer_first": self.record_to_offset_longer_first, + } + + if self._map_config.map_strategy in map_strategy_dict: + map_function = map_strategy_dict[self._map_config.map_strategy] + return map_function( + instance=instance, + token_list=tokens, + ) + else: + raise NotImplementedError( + f"The map strategy {self._map_config.map_strategy} in {self.__class__} is not implemented." + ) + + def record_to_offset_closest_role( + self, + instance, + token_list, + ): + """ + Find Role's offset using closest matched with trigger word. + :param instance: + :return: + """ + return self.record_to_offset_first_role(instance, token_list=token_list) + + def record_to_offset_first_role(self, instance, token_list): + """ + Find Entity's offset using first matched in the sentence. + :param instance: + :return: + """ + entity_list = list() + + entity_matched_set = set() + for pred_record in instance: + record_type, record_text = pred_record["type"], pred_record["text"] + if record_text == "": + logger.warning(f"Empty Extraction {pred_record}") + continue + matched_list = match_sublist(token_list, self.span_to_token(record_text)) + for matched in matched_list: + if (record_type, matched) not in entity_matched_set: + entity_list += [(record_type, tuple(range(matched[0], matched[1] + 1)))] + entity_matched_set.add((record_type, matched)) + break + + return entity_list + + def record_to_offset_longer_first(self, instance, token_list): + """ + Find Entity's offset using first matched in the sentence. + :param instance: + :return: + """ + entity_list = list() + + entity_matched_set = set() + for x in instance: + x["length"] = len(x["text"]) + instance.sort(reverse=True, key=lambda x: x["length"]) + + for pred_record in instance: + record_type, record_text = pred_record["type"], pred_record["text"] + if record_text == "": + logger.warning(f"Empty Extraction {pred_record}") + continue + + matched_list = match_sublist(token_list, self.span_to_token(record_text)) + for matched in matched_list: + flag = False + for _, g in entity_matched_set: + if check_overlap(g, matched): + flag = True + if flag: + continue + + if (record_type, matched) not in entity_matched_set: + entity_list += [(record_type, tuple(range(matched[0], matched[1] + 1)))] + entity_matched_set.add((record_type, matched)) + break + + return entity_list + + +class RelationRecord(Record): + """Record for converting generated string to information record + + """ + + def to_offset(self, instance, tokens): + map_strategy_dict = { + "first": self.record_to_offset_first_role, + "closest": self.record_to_offset_closest_role, + "longer_first": self.record_to_offset_closest_role, + } + if self._map_config.map_strategy in map_strategy_dict: + map_function = map_strategy_dict[self._map_config.map_strategy] + return map_function( + instance=instance, + token_list=tokens, + ) + else: + raise NotImplementedError( + f"The map strategy {self._map_config.map_strategy} in {self.__class__} is not implemented." + ) + + @staticmethod + def to_string(instance): + relation_list = list() + for record in instance: + relation_type = record["type"] + relation = [relation_type] + if len(record["roles"]) < 2: + continue + for role_type, text_str in record["roles"][:2]: + relation += [role_type, text_str] + relation_list += [tuple(relation)] + return relation_list + + def record_to_offset_first_role(self, instance, token_list): + """ + Find Role's offset using first matched in the sentence. + :param instance: + :return: + """ + relation_list = list() + + for record in instance: + relation_type = record["type"] + + if len(record["roles"]) < 2: + continue + + relation = [relation_type] + for role_type, text_str in record["roles"][:2]: + matched_list = match_sublist(token_list, self.span_to_token(text_str)) + if len(matched_list) == 0: + logger.warning("[Cannot reconstruct]: %s %s\n" % (text_str, token_list)) + break + relation += [role_type, get_index_tuple(matched_list[0])] + if len(relation) != 5 or (self._map_config.de_duplicate and tuple(relation) in relation_list): + continue + relation_list += [tuple(relation)] + + return relation_list + + def record_to_offset_closest_role(self, instance, token_list): + """ + Find Role's offset using closest matched with trigger word. + :param instance: + :return: + """ + relation_list = list() + + for record in instance: + relation_type = record["type"] + + if len(record["roles"]) < 2: + continue + + arg1_type, arg1_text = record["roles"][0] + arg2_type, arg2_text = record["roles"][1] + arg1_matched_list = match_sublist(token_list, self.span_to_token(arg1_text)) + arg2_matched_list = match_sublist(token_list, self.span_to_token(arg2_text)) + + if len(arg1_matched_list) == 0: + logger.warning("[Cannot reconstruct]: %s %s\n" % (arg1_text, token_list)) + break + if len(arg2_matched_list) == 0: + logger.warning("[Cannot reconstruct]: %s %s\n" % (arg2_text, token_list)) + break + + distance_tuple = list() + for arg1_match in arg1_matched_list: + for arg2_match in arg2_matched_list: + distance = abs(arg1_match[0] - arg2_match[0]) + distance_tuple += [(distance, arg1_match, arg2_match)] + distance_tuple.sort() + + relation = [ + relation_type, + arg1_type, + get_index_tuple(distance_tuple[0][1]), + arg2_type, + get_index_tuple(distance_tuple[0][2]), + ] + if self._map_config.de_duplicate and tuple(relation) in relation_list: + continue + relation_list += [tuple(relation)] + + return relation_list + + +class EventRecord(Record): + """Record for converting generated string to information record in predicate-arguments + { + type: pred_type, + trigger: predicate_span, + args: [(arg_type, arg_span), ...] + } + """ + + def to_offset(self, instance, tokens): + map_strategy_dict = { + "first": self.record_to_offset_first_role, + "closest": self.record_to_offset_closest_role, + "longer_first": self.record_to_offset_closest_role, + } + if self._map_config.map_strategy in map_strategy_dict: + map_function = map_strategy_dict[self._map_config.map_strategy] + return map_function( + instance=instance, + token_list=tokens, + ) + else: + raise NotImplementedError( + f"The map strategy {self._map_config.map_strategy} in {self.__class__} is not implemented." + ) + + @staticmethod + def to_string(instance): + """ + {'type': 'Justice:Appeal', + 'trigger': 'appeal', + 'roles': [ + ('Adjudicator', 'court'), + ('Plaintiff', 'Anwar') + ], } + """ + return instance + + def record_to_offset_first_role(self, instance, token_list): + """ + Find Role's offset using first matched in the sentence. + """ + record_list = list() + + trigger_matched_set = set() + for record in instance: + event_type = record["type"] + trigger = record["trigger"] + matched_list = match_sublist(token_list, self.span_to_token(trigger)) + + if len(matched_list) == 0: + logger.warning("[Cannot reconstruct]: %s %s\n" % (trigger, token_list)) + continue + + trigger_offset = None + for matched in matched_list: + if matched not in trigger_matched_set: + trigger_offset = get_index_tuple(matched) + trigger_matched_set.add(matched) + break + + # No trigger word, skip the record + if trigger_offset is None: + break + + pred_record = {"type": event_type, "roles": [], "trigger": trigger_offset} + + for role_type, text_str in record["roles"]: + matched_list = match_sublist(token_list, self.span_to_token(text_str)) + if len(matched_list) == 0: + logger.warning("[Cannot reconstruct]: %s %s\n" % (text_str, token_list)) + continue + pred_record["roles"] += [(role_type, get_index_tuple(matched_list[0]))] + + record_list += [pred_record] + + return record_list + + def record_to_offset_closest_role(self, instance, token_list): + """ + Find Role's offset using closest matched with trigger word. + """ + record_list = list() + + trigger_matched_set = set() + for record in instance: + event_type = record["type"] + trigger = record["trigger"] + matched_list = match_sublist(token_list, self.span_to_token(trigger)) + + if len(matched_list) == 0: + logger.warning("[Cannot reconstruct]: %s %s\n" % (trigger, token_list)) + continue + + trigger_offset = None + for matched in matched_list: + if matched not in trigger_matched_set: + trigger_offset = get_index_tuple(matched) + trigger_matched_set.add(matched) + break + + # No trigger word, skip the record + if trigger_offset is None or len(trigger_offset) == 0: + break + + pred_record = {"type": event_type, "roles": [], "trigger": trigger_offset} + + for role_type, text_str in record["roles"]: + matched_list = match_sublist(token_list, self.span_to_token(text_str)) + if len(matched_list) == 0: + logger.warning("[Cannot reconstruct]: %s %s\n" % (text_str, token_list)) + else: + abs_distances = [abs(match[0] - trigger_offset[0]) for match in matched_list] + closest_index = numpy.argmin(abs_distances) + pred_record["roles"] += [(role_type, get_index_tuple(matched_list[closest_index]))] + + record_list += [pred_record] + return record_list + + +class SEL2Record: + """Converting sel expression to information records""" + + def __init__(self, schema_dict, map_config: MapConfig, tokenizer=None) -> None: + self._schema_dict = schema_dict + self._predict_parser = SpotAsocPredictParser( + record_schema=schema_dict["record"], + tokenizer=tokenizer, + ) + self._map_config = map_config + self._tokenizer = tokenizer + + def __repr__(self) -> str: + return f"## {self._map_config}" + + def sel2record(self, pred, text, tokens): + """Converting sel expression to information records + + Args: + pred (str): sel expression + text (str): input text + tokens (list(str)): token list + + Returns: + _type_: _description_ + """ + # Parsing generated SEL to String-level Record + # 将生成的结构表达式解析成 String 级别的 Record + well_formed_list, counter = self._predict_parser.decode( + gold_list=[], + pred_list=[pred], + text_list=[text], + ) + + # Convert String-level Record to Entity/Relation/Event + # 将抽取的 Spot-Asoc Record 结构 + # 根据不同的 Schema 转换成 Entity/Relation/Event 结果 + pred_records = proprocessing_graph_record(well_formed_list[0], self._schema_dict) + + task_record_map = { + "entity": EntityRecord, + "relation": RelationRecord, + "event": EventRecord, + } + + parsed_record = defaultdict(dict) + # Mapping String-level record to Offset-level record + # 将 String 级别的 Record 回标成 Offset 级别的 Record + for task in task_record_map: + record_map = task_record_map[task]( + map_config=self._map_config, + ) + + parsed_record[task]["offset"] = record_map.to_offset( + instance=pred_records.get(task, []), + tokens=tokens, + ) + + parsed_record[task]["string"] = record_map.to_string( + pred_records.get(task, []), + ) + return parsed_record + + @staticmethod + def load_schema_dict(schema_folder): + schema_dict = dict() + for schema_key in ["record", "entity", "relation", "event"]: + schema_filename = os.path.join(schema_folder, f"{schema_key}.schema") + if os.path.exists(schema_filename): + schema_dict[schema_key] = RecordSchema.read_from_file(schema_filename) + else: + logger.warning(f"{schema_filename} is empty, ignore.") + schema_dict[schema_key] = RecordSchema.get_empty_schema() + return schema_dict + + +def fix_unk_from_text(span, text, unk="", tokenizer=None): + """Fixing unknown tokens `unk` in the generated expression + + Args: + span (str): generated span + text (str): raw text + unk (str, optional): symbol of unk token + tokenizer (Tokenizer, optional): the tokenizer + + Returns: + fixed span + """ + if tokenizer is not None: + return fix_unk_from_text_with_tokenizer(span, text, unk=unk, tokenizer=tokenizer) + else: + return fix_unk_from_text_without_tokenizer(span, text, unk=unk) + + +def fix_unk_from_text_without_tokenizer(span, text, unk=""): + """ + Find span from the text to fix unk in the generated span + Example: + span = " colo e Bengo" + text = "Angola International Airport is located at Ícolo e Bengo" + fixed_span = "Ícolo e Bengo" + """ + if unk not in span: + return span + + def clean_wildcard(x): + sp = ".*?()[]+" + return re.sub("(" + "|".join([f"\\{s}" for s in sp]) + ")", "\\\\\g<1>", x) + + match = r"\s*[^,?。\s]+\s*".join([clean_wildcard(item.strip()) for item in span.split(unk)]) + + if len(match) > 100: + # Too long regular expression may lead re problem + return span + + result = re.search(match, text) + + if not result: + return span + return result.group().strip() + + +def fix_unk_from_text_with_tokenizer(span, text, tokenizer, unk=""): + unk_id = tokenizer.vocab.to_indices(unk) + tokenized_span = tokenizer.encode(span, add_special_tokens=False, return_token_type_ids=None)["input_ids"] + tokenized_text = tokenizer.encode(text, add_special_tokens=False, return_token_type_ids=None)["input_ids"] + + matched = match_sublist(tokenized_text, tokenized_span) + if len(matched) == 0: + return span + + if tokenized_span[0] == unk_id and matched[0][0] > 0: + previous_token = [tokenized_text[matched[0][0] - 1]] + pre_strip = tokenizer.vocab.to_tokens(previous_token[0]) + else: + previous_token = [] + pre_strip = "" + + if tokenized_span[-1] == unk_id and matched[0][1] < len(tokenized_text) - 1: + next_token = [tokenized_text[matched[0][1] + 1]] + next_strip = tokenizer.vocab.to_tokens(next_token[0]) + else: + next_token = [] + next_strip = "" + + extend_span = tokenized_span + if len(previous_token) > 0: + extend_span = previous_token + extend_span + if len(next_token) > 0: + extend_span = extend_span + next_token + + extend_span = tokenizer.decode(extend_span) + fixed_span = fix_unk_from_text_without_tokenizer(extend_span, text, unk) + return fixed_span.rstrip(next_strip).lstrip(pre_strip) + + +def add_space(text): + """ + add space between special token + """ + new_text_list = list() + for item in zip(split_bracket.findall(text), split_bracket.split(text)[1:]): + new_text_list += item + return " ".join(new_text_list) + + +def convert_bracket(text): + text = add_space(text) + for start in [type_start]: + text = text.replace(start, left_bracket) + for end in [type_end]: + text = text.replace(end, right_bracket) + return text + + +def find_bracket_num(tree_str): + """ + Count Bracket Number (num_left - num_right), 0 indicates num_left = num_right + """ + count = 0 + for char in tree_str: + if char == left_bracket: + count += 1 + elif char == right_bracket: + count -= 1 + else: + pass + return count + + +def check_well_form(tree_str): + return find_bracket_num(tree_str) == 0 + + +def clean_text(tree_str): + count = 0 + sum_count = 0 + + tree_str_list = tree_str.split() + + for index, char in enumerate(tree_str_list): + if char == left_bracket: + count += 1 + sum_count += 1 + elif char == right_bracket: + count -= 1 + sum_count += 1 + else: + pass + if count == 0 and sum_count > 0: + return " ".join(tree_str_list[: index + 1]) + return " ".join(tree_str_list) + + +def resplit_label_span(label, span, split_symbol=span_start): + label_span = label + " " + span + + if split_symbol in label_span: + splited_label_span = label_span.split(split_symbol) + if len(splited_label_span) == 2: + return splited_label_span[0].strip(), splited_label_span[1].strip() + + return label, span + + +def add_bracket(tree_str): + """add right bracket to fix ill-formed expression""" + tree_str_list = tree_str.split() + bracket_num = find_bracket_num(tree_str_list) + tree_str_list += [right_bracket] * bracket_num + return " ".join(tree_str_list) + + +def get_tree_str(tree): + """get str from sel tree""" + str_list = list() + for element in tree: + if isinstance(element, str): + str_list += [element] + return " ".join(str_list) + + +def rewrite_label_span(label, span, label_set=None, text=None, tokenizer=None): + + # Invalid Type + if label_set and label not in label_set: + logger.debug("Invalid Label: %s" % label) + return None, None + + # Fix unk using Text + if text is not None and "" in span: + span = fix_unk_from_text(span, text, unk="", tokenizer=tokenizer) + + # Invalid Text Span + if text is not None and span not in text: + logger.debug("Invalid Text Span: %s\n%s\n" % (span, text)) + return None, None + + return label, span + + +def convert_spot_asoc(spot_asoc_instance, structure_maker): + """Convert spot asoc instance to target string""" + spot_instance_str_rep_list = list() + for spot in spot_asoc_instance: + spot_str_rep = [ + spot["label"], + structure_maker.target_span_start, + spot["span"], + ] + for asoc_label, asoc_span in spot.get("asoc", list()): + asoc_str_rep = [ + structure_maker.span_start, + asoc_label, + structure_maker.target_span_start, + asoc_span, + structure_maker.span_end, + ] + spot_str_rep += [" ".join(asoc_str_rep)] + spot_instance_str_rep_list += [ + " ".join( + [ + structure_maker.record_start, + " ".join(spot_str_rep), + structure_maker.record_end, + ] + ) + ] + target_text = " ".join( + [ + structure_maker.sent_start, + " ".join(spot_instance_str_rep_list), + structure_maker.sent_end, + ] + ) + return target_text + + +class SpotAsocPredictParser: + """Parser for converting generated sel to extraction record""" + + def __init__(self, record_schema: RecordSchema = None, tokenizer=None): + self.spot_set = set(record_schema.type_list) if record_schema else None + self.asoc_set = set(record_schema.role_list) if record_schema else None + self.tokenizer = tokenizer + + def decode( + self, + gold_list, + pred_list, + text_list=None, + raw_list=None, + ) -> Tuple[List[Dict], Counter]: + counter = Counter() + well_formed_list = [] + + if gold_list is None or len(gold_list) == 0: + gold_list = ["%s%s" % (type_start, type_end)] * len(pred_list) + + if text_list is None: + text_list = [None] * len(pred_list) + + if raw_list is None: + raw_list = [None] * len(pred_list) + + for gold, pred, text, raw_data in zip(gold_list, pred_list, text_list, raw_list): + gold = convert_bracket(gold) + pred = convert_bracket(pred) + + pred = clean_text(pred) + + try: + gold_tree = ParentedTree.fromstring(gold, brackets=brackets) + except ValueError: + logger.warning(f"Ill gold: {gold}") + logger.warning(f"Fix gold: {add_bracket(gold)}") + gold_tree = ParentedTree.fromstring(add_bracket(gold), brackets=brackets) + counter.update(["gold_tree add_bracket"]) + + instance = {"gold": gold, "pred": pred, "gold_tree": gold_tree, "text": text, "raw_data": raw_data} + + counter.update(["gold_tree" for _ in gold_tree]) + + _, _, instance["gold_record"] = self.get_record_list(sel_tree=instance["gold_tree"], text=instance["text"]) + + try: + if not check_well_form(pred): + pred = add_bracket(pred) + counter.update(["fixed"]) + + pred_tree = ParentedTree.fromstring(pred, brackets=brackets) + counter.update(["pred_tree" for _ in pred_tree]) + + instance["pred_tree"] = pred_tree + counter.update(["well-formed"]) + + except ValueError: + counter.update(["ill-formed"]) + logger.debug(f"ill-formed: {pred}") + instance["pred_tree"] = ParentedTree.fromstring(left_bracket + right_bracket, brackets=brackets) + + _, _, instance["pred_record"] = self.get_record_list(sel_tree=instance["pred_tree"], text=instance["text"]) + + well_formed_list += [instance] + + return well_formed_list, counter + + def get_record_list(self, sel_tree, text=None): + """Convert single sel expression to extraction records + + Args: + sel_tree (Tree): sel tree + text (str, optional): _description_. Defaults to None. + + Returns: + spot_list: list of (spot_type: str, spot_span: str) + asoc_list: list of (spot_type: str, asoc_label: str, asoc_text: str) + record_list: list of {'asocs': list(), 'type': spot_type, 'spot': spot_text} + """ + spot_list = list() + asoc_list = list() + record_list = list() + + for spot_tree in sel_tree: + + # Drop incomplete tree + if isinstance(spot_tree, str) or len(spot_tree) == 0: + continue + + spot_type = spot_tree.label() + spot_text = get_tree_str(spot_tree) + spot_type, spot_text = resplit_label_span(spot_type, spot_text) + spot_type, spot_text = rewrite_label_span( + label=spot_type, + span=spot_text, + label_set=self.spot_set, + text=text, + tokenizer=self.tokenizer, + ) + + # Drop empty generated span + if spot_text is None or spot_text == null_span: + continue + # Drop empty generated type + if spot_type is None: + continue + # Drop invalid spot type + if self.spot_set is not None and spot_type not in self.spot_set: + continue + + record = {"asocs": list(), "type": spot_type, "spot": spot_text} + + for asoc_tree in spot_tree: + if isinstance(asoc_tree, str) or len(asoc_tree) < 1: + continue + + asoc_label = asoc_tree.label() + asoc_text = get_tree_str(asoc_tree) + asoc_label, asoc_text = resplit_label_span(asoc_label, asoc_text) + asoc_label, asoc_text = rewrite_label_span( + label=asoc_label, + span=asoc_text, + label_set=self.asoc_set, + text=text, + tokenizer=self.tokenizer, + ) + + # Drop empty generated span + if asoc_text is None or asoc_text == null_span: + continue + # Drop empty generated type + if asoc_label is None: + continue + # Drop invalid asoc type + if self.asoc_set is not None and asoc_label not in self.asoc_set: + continue + + asoc_list += [(spot_type, asoc_label, asoc_text)] + record["asocs"] += [(asoc_label, asoc_text)] + + spot_list += [(spot_type, spot_text)] + record_list += [record] + + return spot_list, asoc_list, record_list + + +def evaluate_extraction_results(gold_instances, pred_instances, eval_match_mode="normal"): + task_scorer_dict = {"entity": EntityScorer, "relation": RelationScorer, "event": EventScorer} + # Score Record + results = dict() + for task, scorer in task_scorer_dict.items(): + + gold_list = [x[task] for x in gold_instances] + pred_list = [x[task] for x in pred_instances] + + gold_instance_list = scorer.load_gold_list(gold_list) + pred_instance_list = scorer.load_pred_list(pred_list) + sub_results = scorer.eval_instance_list( + gold_instance_list=gold_instance_list, + pred_instance_list=pred_instance_list, + verbose=False, + match_mode=eval_match_mode, + ) + results.update(sub_results) + return results diff --git a/examples/information_extraction/DuUIE/uie/seq2struct/__init__.py b/examples/information_extraction/DuUIE/uie/seq2struct/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..bc13950c0b3bc029002a2fa67ed49e7eff7cd6d1 --- /dev/null +++ b/examples/information_extraction/DuUIE/uie/seq2struct/__init__.py @@ -0,0 +1,19 @@ +#!/usr/bin/env python3 +# -*- coding:utf-8 -*- + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Code for Sequence-to-Structure +""" diff --git a/examples/information_extraction/DuUIE/uie/seq2struct/data_collator.py b/examples/information_extraction/DuUIE/uie/seq2struct/data_collator.py new file mode 100644 index 0000000000000000000000000000000000000000..9a602aee4c664e319cc4c8bfe066b9d9ec991416 --- /dev/null +++ b/examples/information_extraction/DuUIE/uie/seq2struct/data_collator.py @@ -0,0 +1,510 @@ +#!/usr/bin/env python3 +# -*- coding:utf-8 -*- + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import copy +import logging +import math +import random +from collections import OrderedDict +from dataclasses import dataclass +from typing import Optional + +import numpy as np +import paddle +from uie.evaluation.constants import ( + BaseStructureMarker, + asoc_prompt, + null_span, + spot_prompt, + text_start, +) +from uie.evaluation.sel2record import RecordSchema, convert_spot_asoc + +from paddlenlp.data import Pad + +logger = logging.getLogger("__main__") + + +@dataclass +class SpotAsocNoiser: + spot_noise_ratio: float = 0.0 # Ratio of insert spot not in raw record + asoc_noise_ratio: float = 0.0 # Ratio of insert asoc not in raw recordf + + def random_insert_spot(self, spot_asoc, spot_label_list=None): + """Insert negative spot in random, sample negative spot from spot_label_list""" + # If no negative spot_label_list, skip insertion of spot null span + if spot_label_list is None or len(spot_label_list) == 0: + return spot_asoc + + random_num = sum(np.random.binomial(1, self.spot_noise_ratio, len(spot_asoc))) + for _ in range(random_num): + random_position = np.random.randint(low=0, high=len(spot_asoc)) + to_insert_negative_spot = { + "span": null_span, + "label": np.random.choice(spot_label_list), # Sample negative spot from spot_label_list + "asoc": list(), + } + spot_asoc.insert(random_position, to_insert_negative_spot) + return spot_asoc + + def random_insert_asoc(self, spot_asoc, asoc_label_list=None): + """Insert negative asoc in random, sample negative asoc from asoc_label_list""" + # If no negative asoc_label_list, skip insertion of asoc null span + if asoc_label_list is None or len(asoc_label_list) == 0: + return spot_asoc + + spot_sum = len(spot_asoc) + random_num = sum(np.random.binomial(1, self.asoc_noise_ratio, spot_sum)) + for _ in range(random_num): + random_label = np.random.choice(asoc_label_list) + spot_position = np.random.randint(low=0, high=len(spot_asoc)) + asoc_position = np.random.randint(low=0, high=len(spot_asoc[spot_position]["asoc"]) + 1) + # Insert random negative span at `asoc_position` + spot_asoc[spot_position]["asoc"].insert(asoc_position, (random_label, null_span)) + return spot_asoc + + def add_noise(self, spot_asoc, spot_label_list, asoc_label_list): + """Add noise to target spot-asoc structure + spot_asoc: raw spot-asoc structure + spot_label_list: negative spot candidates + asoc_label_list: negative asoc candidates + """ + spot_asoc = self.random_insert_asoc(spot_asoc, asoc_label_list) + spot_asoc = self.random_insert_spot(spot_asoc, spot_label_list) + return spot_asoc + + +class DynamicSSIGenerator: + """ + Sample negative spot and asoc to construct SSI + """ + + def __init__(self, tokenizer, schema: RecordSchema, positive_rate=1, negative=-1, ordered_prompt=False) -> None: + self.spot_dict = self.get_ordered_dict(schema.type_list, tokenizer) + self.asoc_dict = self.get_ordered_dict(schema.role_list, tokenizer) + self.spot_list = list(self.spot_dict.keys()) + self.asoc_list = list(self.asoc_dict.keys()) + self.spot_prompt_id = tokenizer.convert_tokens_to_ids(spot_prompt) + self.asoc_prompt_id = tokenizer.convert_tokens_to_ids(asoc_prompt) + self.text_start_id = tokenizer.convert_tokens_to_ids(text_start) + self.positive_rate = positive_rate if 0 < positive_rate < 1 else 1 + self.negative = negative + self.ordered_prompt = ordered_prompt + logger.info(f"Meta Sample " f"Negative: {self.negative}, " f"Ordered SSI: {self.ordered_prompt}") + + @staticmethod + def get_ordered_dict(schema_name_list, tokenizer): + """Get schema name -> id dict + schema_name_list: ["人物", "组织机构"] + """ + schema_ordered_dict = OrderedDict() + for name in schema_name_list: + # tokenizer.encode("人物") -> [8, 122] + encoded_name = tokenizer.encode(name, add_special_tokens=False, return_token_type_ids=None) + schema_ordered_dict[name] = encoded_name["input_ids"] + return schema_ordered_dict + + @staticmethod + def sample_negative(postive, candidates, k=5): + if k < 0: + k = len(candidates) + negative_set = set() + for index in np.random.permutation(len(candidates))[:k].tolist(): + negative = candidates[index] + if negative not in postive: + negative_set.add(negative) + + return list(negative_set) + + def sample_spot(self, positive, candidates=None): + """Sample spot + + Args: + positive (List[str]): Positive Spot List + + Returns: + List[int]: spot index list + List[str]: Sampled Positive Spot List + List[str]: Sampled Negative Spot List + """ + neg_cands = candidates if candidates is not None else self.spot_list + + negative_spot = self.sample_negative(postive=positive, candidates=neg_cands, k=self.negative) + positive_spot = random.sample(positive, math.floor(len(positive) * self.positive_rate)) + + converted_spot_prefix = self.convert_prefix( + candidates=positive_spot + negative_spot, + prompt=self.spot_prompt_id, + mapper=self.spot_dict, + ordered_prompt=self.ordered_prompt, + ) + + return converted_spot_prefix, positive_spot, negative_spot + + def sample_asoc(self, positive, candidates=None): + """Sample Asoc + + Args: + positive (List[str]): Positive Asoc List + + Returns: + List[int]: asoc index list + List[str]: Sampled Negative Asoc List + """ + neg_cands = candidates if candidates is not None else self.asoc_list + negative_asoc = self.sample_negative(postive=positive, candidates=neg_cands, k=self.negative) + converted_asoc_prefix = self.convert_prefix( + candidates=positive + negative_asoc, + prompt=self.asoc_prompt_id, + mapper=self.asoc_dict, + ordered_prompt=self.ordered_prompt, + ) + return converted_asoc_prefix, negative_asoc + + def full_spot(self, candidates=None, shuffle=False): + # Random Prompt + Shuffle + if not self.ordered_prompt and shuffle: + ordered_prompt = False + else: + ordered_prompt = True + + prefix_cands = candidates if candidates is not None else self.spot_list + + return self.convert_prefix( + candidates=prefix_cands, + prompt=self.spot_prompt_id, + mapper=self.spot_dict, + ordered_prompt=ordered_prompt, + ) + + def full_asoc(self, candidates=None, shuffle=False): + # Random Prompt + Shuffle + if not self.ordered_prompt and shuffle: + ordered_prompt = False + else: + ordered_prompt = True + + prefix_cands = candidates if candidates is not None else self.asoc_list + + return self.convert_prefix( + candidates=prefix_cands, + prompt=self.asoc_prompt_id, + mapper=self.asoc_dict, + ordered_prompt=ordered_prompt, + ) + + @staticmethod + def convert_prefix(candidates, prompt, mapper, ordered_prompt=True): + prefix = list() + + if ordered_prompt: + candidate_sorted = sorted([(candidate, index) for index, candidate in enumerate(candidates)]) + index_list = [index for _, index in candidate_sorted] + else: + index_list = np.random.permutation(len(candidates)).tolist() + + for index in index_list: + prefix += [prompt] + prefix += mapper[candidates[index]] + return prefix + + +class DataCollatorForSeq2Seq: + def __init__( + self, + tokenizer, + ssi_generator: DynamicSSIGenerator, + model=None, + label_pad_token_id=-100, + padding=True, + max_source_length: Optional[int] = None, + max_target_length: Optional[int] = None, + max_prefix_length: Optional[int] = None, + spot_asoc_nosier: SpotAsocNoiser = None, + return_tensors=True, + ): + + self.tokenizer = tokenizer + self.ssi_generator = ssi_generator + self.model = model + self.label_pad_token_id = label_pad_token_id + self.padding = padding + self.max_source_length = max_source_length + self.max_target_length = max_target_length + self.max_prefix_length = max_prefix_length + self.spot_asoc_nosier = spot_asoc_nosier + self.return_tensors = return_tensors + + def __call__(self, data, return_tensors=None): + + new_data = [] # To avoid the orgin data being covered + + for ins in data: + + target_spot_asoc = copy.deepcopy(ins["spot_asoc"]) + + if ins["sample_ssi"] is True: + # Sample Dynamic SSI + spot_prefix, pos_spot, neg_spot = self.ssi_generator.sample_spot(positive=ins.get("spots", [])) + asoc_prefix, neg_asoc = self.ssi_generator.sample_asoc(positive=ins.get("asocs", [])) + + # Filter spot-asoc not in Positive Spot + target_spot_asoc = list(filter(lambda x: x["label"] in pos_spot, target_spot_asoc)) + + # Inject rejection noise + if self.spot_asoc_nosier is not None: + target_spot_asoc = self.spot_asoc_nosier.add_noise( + target_spot_asoc, + spot_label_list=neg_spot, + asoc_label_list=neg_asoc, + ) + else: + # Evaluation using Ordered SSI + spot_prefix = self.ssi_generator.full_spot(shuffle=self.model.training) + asoc_prefix = self.ssi_generator.full_asoc(shuffle=self.model.training) + + # Prepare prefix ids + prefix = spot_prefix + asoc_prefix + # truncate `prefix` to max length + if self.max_prefix_length is not None and self.max_prefix_length >= 0: + prefix = prefix[: self.max_prefix_length] + prefix = prefix + [self.ssi_generator.text_start_id] + + # Prepare source text ids + source_text_id = prefix + ins["input_ids"] + # truncate `input_ids` to max source length + if self.max_source_length is not None: + source_text_id = source_text_id[: self.max_source_length] + + # Prepare target record ids + # Generate new record + target_record = convert_spot_asoc(target_spot_asoc, structure_maker=BaseStructureMarker()) + target_labels = self.tokenizer.encode( + target_record, + return_token_type_ids=False, + return_attention_mask=True, + max_seq_len=self.max_target_length, + ) + + new_data.append( + { + "input_ids": source_text_id, + "labels": target_labels["input_ids"], + "attention_mask": [1] * len(source_text_id), + "decoder_attention_mask": target_labels["attention_mask"], + } + ) + + first = new_data[0] + assert isinstance( + first, dict + ), f"Input pattern not understood. The input of collatot must be a dict with key of input column name and value of data Received input type: {type(first)}" + + labels = [d["labels"] for d in new_data] if "labels" in new_data[0].keys() else None + + batch = {} + + def _pad_function(sequence, pad_value): + return Pad(axis=0, pad_val=pad_value, dtype="int64")(sequence) + + pad_value_map = { + "token_type_ids": self.tokenizer.pad_token_type_id, + "attention_mask": 0, + "decoder_attention_mask": 0, + "special_tokens_mask": 1, + "input_ids": self.tokenizer.pad_token_id, + } + + for k, v in first.items(): + if k not in ("labels", "label_ids") and v is not None and not isinstance(v, str): + batch[k] = _pad_function( + sequence=[d[k] for d in new_data], + pad_value=pad_value_map[k], + ) + else: + batch[k] = _pad_function( + sequence=[d[k] for d in new_data], + pad_value=self.label_pad_token_id, + ) + + # prepare decoder_input_ids + if ( + labels is not None + and self.model is not None + and hasattr(self.model, "prepare_decoder_input_ids_from_labels") + ): + decoder_input_ids = self.model.prepare_decoder_input_ids_from_labels(labels=batch["labels"]) + if not return_tensors: + batch["decoder_input_ids"] = decoder_input_ids.numpy() + if self.return_tensors: + for k, v in batch.items(): + batch[k] = paddle.to_tensor(v) + return batch + + +class DataCollatorForMultiTaskSeq2Seq: + def __init__( + self, + tokenizer, + ssi_generator: DynamicSSIGenerator, + model=None, + label_pad_token_id=-100, + padding=True, + max_source_length: Optional[int] = None, + max_target_length: Optional[int] = None, + max_prefix_length: Optional[int] = None, + spot_asoc_nosier: SpotAsocNoiser = None, + return_tensors=True, + ): + + self.tokenizer = tokenizer + self.ssi_generator = ssi_generator + self.model = model + self.label_pad_token_id = label_pad_token_id + self.padding = padding + self.max_source_length = max_source_length + self.max_target_length = max_target_length + self.max_prefix_length = max_prefix_length + self.spot_asoc_nosier = spot_asoc_nosier + self.return_tensors = return_tensors + + def __call__(self, data, return_tensors=None): + + new_data = [] # To avoid the orgin data being covered + + for ins in data: + + target_spot_asoc = copy.deepcopy(ins["spot_asoc"]) + + if ins["sample_ssi"] is True: + + positive_spot = set() + positive_asoc = set() + for spot_asoc in ins["spot_asoc"]: + positive_spot.add(spot_asoc["label"]) + for asoc in spot_asoc["asoc"]: + positive_asoc.add(asoc[0]) + + # 对 SSI 进行采样 + # 在多任务中,每个数据Instance + # ‘spots’ 对应该任务的 spots + # ‘asocs’ 对应该任务的 asocs + # 因此 candidates 在任务内进行采样 + spot_prefix, pos_spot, neg_spot = self.ssi_generator.sample_spot( + positive=list(positive_spot), + candidates=ins["spots"], + ) + asoc_prefix, neg_asoc = self.ssi_generator.sample_asoc( + positive=list(positive_asoc), + candidates=ins["asocs"], + ) + + # Filter spot-asoc not in Positive Spot + target_spot_asoc = list(filter(lambda x: x["label"] in pos_spot, target_spot_asoc)) + + # Inject rejection noise + if self.spot_asoc_nosier is not None: + target_spot_asoc = self.spot_asoc_nosier.add_noise( + target_spot_asoc, + spot_label_list=neg_spot, + asoc_label_list=neg_asoc, + ) + + else: + # Evaluation using Ordered SSI + spot_prefix = self.ssi_generator.full_spot(candidates=ins["spots"], shuffle=self.model.training) + asoc_prefix = self.ssi_generator.full_asoc(candidates=ins["asocs"], shuffle=self.model.training) + + # Prepare prefix ids + prefix_id = spot_prefix + asoc_prefix + # truncate `prefix` to max length + if self.max_prefix_length is not None and self.max_prefix_length >= 0: + prefix_id = prefix_id[: self.max_prefix_length] + prefix_id = prefix_id + [self.ssi_generator.text_start_id] + + # Prepare source text ids + source_text_id = prefix_id + ins["input_ids"] + # truncate `input_ids` to max source length + if self.max_source_length is not None: + source_text_id = source_text_id[: self.max_source_length] + + # Prepare target record ids + # Generate new record + target_record = convert_spot_asoc(target_spot_asoc, structure_maker=BaseStructureMarker()) + target_labels = self.tokenizer.encode( + target_record, + return_token_type_ids=False, + return_attention_mask=True, + max_seq_len=self.max_target_length, + ) + + new_data.append( + { + "input_ids": source_text_id, + "labels": target_labels["input_ids"], + "attention_mask": [1] * len(source_text_id), + "decoder_attention_mask": target_labels["attention_mask"], + } + ) + + first = new_data[0] + assert isinstance( + first, dict + ), f"Input pattern not understood. The input of collatot must be a dict with key of input column name and value of data Received input type: {type(first)}" + + labels = [d["labels"] for d in new_data] if "labels" in new_data[0].keys() else None + + batch = {} + + def _pad_function(sequence, pad_value): + return Pad(axis=0, pad_val=pad_value, dtype="int64")(sequence) + + pad_value_map = { + "token_type_ids": self.tokenizer.pad_token_type_id, + "attention_mask": 0, + "decoder_attention_mask": 0, + "special_tokens_mask": 1, + "input_ids": self.tokenizer.pad_token_id, + } + + for k, v in first.items(): + if k not in ("labels", "label_ids") and v is not None and not isinstance(v, str): + batch[k] = _pad_function( + sequence=[d[k] for d in new_data], + pad_value=pad_value_map[k], + ) + else: + batch[k] = _pad_function( + sequence=[d[k] for d in new_data], + pad_value=self.label_pad_token_id, + ) + + # prepare decoder_input_ids + if ( + labels is not None + and self.model is not None + and hasattr(self.model, "prepare_decoder_input_ids_from_labels") + ): + decoder_input_ids = self.model.prepare_decoder_input_ids_from_labels(labels=batch["labels"]) + if not return_tensors: + batch["decoder_input_ids"] = decoder_input_ids.numpy() + + if self.return_tensors: + for k, v in batch.items(): + batch[k] = paddle.to_tensor(v) + + return batch diff --git a/examples/information_extraction/DuUIE/uie/seq2struct/t5_bert_tokenizer.py b/examples/information_extraction/DuUIE/uie/seq2struct/t5_bert_tokenizer.py new file mode 100644 index 0000000000000000000000000000000000000000..cb761d8093f21a895a92f5a520a8f1941d574c99 --- /dev/null +++ b/examples/information_extraction/DuUIE/uie/seq2struct/t5_bert_tokenizer.py @@ -0,0 +1,156 @@ +#!/usr/bin/env python3 +# -*- coding:utf-8 -*- + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +from typing import Optional, Union, List + +from paddle import Tensor +from paddlenlp.transformers import BertTokenizer + +logger = logging.getLogger(__name__) + + +class T5BertTokenizer(BertTokenizer): + + model_input_names = ["input_ids", "attention_mask"] + + def __init__( + self, + vocab_file, + do_lower_case=False, + do_basic_tokenize=True, + never_split=None, + unk_token="", + sep_token=None, + pad_token="", + cls_token=None, + mask_token=None, + space_token="", + tokenize_chinese_chars=True, + strip_accents=None, + **kwargs + ): + super().__init__( + vocab_file=vocab_file, + do_lower_case=do_lower_case, + do_basic_tokenize=do_basic_tokenize, + never_split=never_split, + unk_token=unk_token, + sep_token=sep_token, + pad_token=pad_token, + cls_token=cls_token, + mask_token=mask_token, + tokenize_chinese_chars=tokenize_chinese_chars, + strip_accents=strip_accents, + **kwargs, + ) + if space_token not in self._additional_special_tokens: + self._additional_special_tokens += [space_token] + + self._space_token = space_token + + def get_vocab(self): + vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)} + vocab.update(self.added_tokens_encoder) + return vocab + + def tokenize(self, text, **kwargs): + import re + + # Remove space between + split_bracket = re.compile(r"\s*\s*|\s*\s*|\s*\s*") + + if len(split_bracket.split(text)) > 1: + new_text_list = [split_bracket.split(text)[0]] + for item in zip(split_bracket.findall(text), split_bracket.split(text)[1:]): + new_text_list += [item[0].strip(), item[1]] + text = "".join(new_text_list) + text = text.replace(" ", self._space_token) + return super().tokenize(text, **kwargs) + + def _add_eos_if_not_present(self, token_ids: List[int]) -> List[int]: + """Do not add eos again if user already added it.""" + if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id: + logging.warn( + f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added." + ) + return token_ids + else: + return token_ids + [self.eos_token_id] + + def build_inputs_with_special_tokens( + self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None + ) -> List[int]: + """ + Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and + adding special tokens. A sequence has the following format: + + - single sequence: ``X `` + - pair of sequences: ``A B `` + + Args: + token_ids_0 (:obj:`List[int]`): + List of IDs to which the special tokens will be added. + token_ids_1 (:obj:`List[int]`, `optional`): + Optional second list of IDs for sequence pairs. + + Returns: + :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens. + """ + token_ids_0 = self._add_eos_if_not_present(token_ids_0) + if token_ids_1 is None: + return token_ids_0 + else: + token_ids_1 = self._add_eos_if_not_present(token_ids_1) + return token_ids_0 + token_ids_1 + + def _decode(self, token_ids: Union[List[int], Tensor], skip_special_tokens: bool = False, **kwargs) -> str: + if isinstance(token_ids, Tensor): + tokens = self.convert_ids_to_tokens(token_ids.tolist(), skip_special_tokens=skip_special_tokens) + else: + tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens) + + # Fix '##' subtoken + tokens = [x.lstrip("#") if x.startswith("##") else x for x in tokens] + + x_str = "".join(tokens) + x_str = x_str.replace(" ", "") + x_str = x_str.replace(self._space_token, " ") + return x_str + + def decode(self, token_ids: Union[List[int], Tensor], skip_special_tokens: bool = False, **kwargs) -> str: + return self._decode(token_ids, skip_special_tokens) + + def batch_decode(self, sequences, skip_special_tokens=False, clean_up_tokenization_spaces=True): + """ + Convert a list of lists of token ids into a list of strings by calling decode. + Args: + sequences (Union[List[int], List[List[int]], Tensor]): + List of tokenized input ids. + skip_special_tokens (bool, optional): + Whether or not to remove special tokens in the decoding. Defaults to `False`. + clean_up_tokenization_spaces (bool, optional): + Whether or not to clean up the tokenization spaces. Defaults to `True`. + Returns: + List[str]: The list of decoded sentences. + """ + return [ + self._decode( + seq, skip_special_tokens=skip_special_tokens, clean_up_tokenization_spaces=clean_up_tokenization_spaces + ) + for seq in sequences + ] diff --git a/examples/information_extraction/DuUIE/uie/seq2struct/utils.py b/examples/information_extraction/DuUIE/uie/seq2struct/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..06112300d92d35fe4c970586bf31d1dbff4e032c --- /dev/null +++ b/examples/information_extraction/DuUIE/uie/seq2struct/utils.py @@ -0,0 +1,479 @@ +#!/usr/bin/env python3 +# -*- coding:utf-8 -*- + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import logging +import os +import random +from dataclasses import dataclass +from typing import List + +import numpy as np +import paddle +import tabulate +from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler +from uie.evaluation import constants +from uie.evaluation.sel2record import MapConfig, RecordSchema, SEL2Record, merge_schema +from uie.seq2struct.data_collator import ( + DataCollatorForMultiTaskSeq2Seq, + DataCollatorForSeq2Seq, + DynamicSSIGenerator, + SpotAsocNoiser, +) +from uie.seq2struct.t5_bert_tokenizer import T5BertTokenizer + +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import ( + CosineDecayWithWarmup, + LinearDecayWithWarmup, + PolyDecayWithWarmup, +) + +logger = logging.getLogger("__main__") + + +def get_writer(args): + if args.writer_type == "visualdl": + from visualdl import LogWriter + + writer = LogWriter(logdir=args.logging_dir) + elif args.writer_type == "tensorboard": + from tensorboardX import SummaryWriter + + writer = SummaryWriter(logdir=args.logging_dir) + else: + raise ValueError("writer_type must be in ['visualdl', 'tensorboard']") + return writer + + +scheduler_type2cls = { + "linear": LinearDecayWithWarmup, + "cosine": CosineDecayWithWarmup, + "poly": PolyDecayWithWarmup, +} + + +def get_scheduler( + learning_rate, + scheduler_type, + num_warmup_steps=None, + num_training_steps=None, + **scheduler_kwargs, +): + """Set learning rate scheduler""" + + if scheduler_type not in scheduler_type2cls.keys(): + data = " ".join(scheduler_type2cls.keys()) + raise ValueError(f"scheduler_type must be choson from {data}") + + if num_warmup_steps is None: + raise ValueError("requires `num_warmup_steps`, please provide that argument.") + + if num_training_steps is None: + raise ValueError("requires `num_training_steps`, please provide that argument.") + + return scheduler_type2cls[scheduler_type]( + learning_rate=learning_rate, total_steps=num_training_steps, warmup=num_warmup_steps, **scheduler_kwargs + ) + + +def set_seed(args): + """Set default seed""" + random.seed(args.seed + paddle.distributed.get_rank()) + np.random.seed(args.seed + paddle.distributed.get_rank()) + paddle.seed(args.seed + paddle.distributed.get_rank()) + + +def save_checkpoint(tokenizer, model, output_dir): + """Save tokenizer and checkpoint model to output_dir""" + logger.info(f"saving checkpoint to {output_dir}") + if isinstance(model, paddle.DataParallel): + model = model._layers + model.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + +def set_logger(args): + """Set logger""" + logger.setLevel(logging.DEBUG if "DEBUG" in os.environ else logging.INFO) + + logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + level=logging.INFO, + handlers=[ + logging.FileHandler( + filename=f"{args.output_dir}.log", + mode="w", + encoding="utf-8", + ) + ], + ) + # create console handler and set level to debug + console_handler = logging.StreamHandler() + console_handler.setLevel(level=logging.DEBUG) + # add formatter to console_handler + console_handler.setFormatter(fmt=logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")) + # add console_handler to logger + logger.addHandler(console_handler) + + +def read_json_file(file_name): + """Read jsonline file as generator""" + with open(file_name, encoding="utf8") as fin: + for line in fin: + yield json.loads(line) + + +def better_print_multi(results): + """Better print multi task results + results: Dictionary of task and metric {"task:metric": "value", ...} + """ + table = [(task, results[task]) for task in results] + return tabulate.tabulate(table, headers=["Task", "Metric"]) + + +def read_func(tokenizer, data_file: str, max_source_length: int, is_train: bool = False, negative_keep: float = 1.0): + """Read instance from data_file + + Args: + tokenizer (PretrainedTokenizer): Tokenizer + data_file (str): Data filename + max_source_length (int): Max source length + is_train (bool): instance from this file whether for training + negative_keep (float): the ratio of keeping negative instances + """ + + negative_drop_num = 0 + with open(data_file, "r", encoding="utf-8") as fin: + for line in fin: + instance = json.loads(line) + + # Drop negative sample in random during training stage + if is_train and len(instance["spot_asoc"]) == 0: + # if negative_keep >= 1, keep all negative instances + # else drop negative instance when random() > negative_keep + if random.random() > negative_keep: + negative_drop_num += 1 + continue + + inputs = tokenizer( + instance["text"], + return_token_type_ids=False, + return_attention_mask=True, + max_seq_len=max_source_length, + ) + + # `sample_ssi` can be True in the training stage + # `sample_ssi` can only be False in the evaluation stage + # 在训练时,ssi可以动态变化 (sample_ssi=True) + # 但是在推理和验证时,ssi必须固定保证推理结果的一致 (sample_ssi=False) + inputs.update( + { + "spots": instance["spot"], + "asocs": instance["asoc"], + "spot_asoc": instance["spot_asoc"], + "sample_ssi": is_train, + } + ) + yield inputs + + if negative_drop_num > 0: + logger.info(f"Drop negative {negative_drop_num} instance during loading {data_file}.") + + +def read_training_instance_based_config( + tokenizer, config_file: str, max_source_length: int, negative_keep: float = 1.0 +): + """Read training instances based on config_file + + Args: + tokenizer (PretrainedTokenizer): Tokenizer + config_file (str): Config filename + max_source_length (int): Max source length + negative_keep: the ratio of keeping negative instances + + Yields: + dict: instance for training + """ + task_configs = list(TaskConfig.load_list_from_yaml(config_file)) + + for task_config in task_configs: + negative_drop_num = 0 + + train_file = os.path.join(task_config.data_path, "train.json") + schema_file = os.path.join(task_config.data_path, "record.schema") + record_schema = RecordSchema.read_from_file(schema_file) + with open(train_file, "r", encoding="utf-8") as fin: + count = 0 + for line in fin: + instance = json.loads(line) + + # Drop negative sample in random during training stage + if len(instance["spot_asoc"]) == 0: + # if negative_keep >= 1, keep all negative instances + # else drop negative instance when random() > negative_keep + if random.random() > negative_keep: + negative_drop_num += 1 + continue + + inputs = tokenizer( + instance["text"], + return_token_type_ids=False, + return_attention_mask=True, + max_seq_len=max_source_length, + ) + + # `sample_ssi` is True in the training stage + inputs.update( + { + "spots": record_schema.type_list, + "asocs": record_schema.role_list, + "spot_asoc": instance["spot_asoc"], + "sample_ssi": True, + } + ) + yield inputs + count += 1 + logger.info(f"Load {count} instances from {train_file}") + + if negative_drop_num > 0: + logger.info(f"Drop negative {negative_drop_num} instance during loading {train_file}.") + + +def get_train_dataloader(model, tokenizer, args): + logger.info(f"Load data according to {args.multi_task_config} ...") + + dataset = load_dataset( + read_training_instance_based_config, + tokenizer=tokenizer, + config_file=args.multi_task_config, + max_source_length=args.max_source_length, + lazy=False, + negative_keep=args.negative_keep, + ) + + # Merge schema in all datasets for pre-tokenize + schema_list = list() + for task_config in TaskConfig.load_list_from_yaml(args.multi_task_config): + schema_file = os.path.join(task_config.data_path, "record.schema") + schema_list += [RecordSchema.read_from_file(schema_file)] + schema = merge_schema(schema_list) + + batch_sampler = DistributedBatchSampler( + dataset=dataset, + batch_size=args.per_device_train_batch_size, + shuffle=True, + ) + + if args.spot_noise > 0 or args.asoc_noise > 0: + spot_asoc_nosier = SpotAsocNoiser( + spot_noise_ratio=args.spot_noise, + asoc_noise_ratio=args.asoc_noise, + null_span=constants.null_span, + ) + else: + spot_asoc_nosier = None + + label_pad_token_id = -100 if args.ignore_pad_token_for_loss else tokenizer.pad_token_id + + collate_fn = DataCollatorForMultiTaskSeq2Seq( + tokenizer, + model=model, + label_pad_token_id=label_pad_token_id, + max_source_length=args.max_source_length, + max_prefix_length=args.max_prefix_length, + max_target_length=args.max_target_length, + ssi_generator=DynamicSSIGenerator( + tokenizer=tokenizer, + schema=schema, + positive_rate=args.meta_positive_rate, + negative=args.meta_negative, + ordered_prompt=args.ordered_prompt, + ), + spot_asoc_nosier=spot_asoc_nosier, + ) + + data_loader = DataLoader( + dataset=dataset, + batch_sampler=batch_sampler, + collate_fn=collate_fn, + num_workers=args.dataloader_num_workers, + return_list=True, + ) + + return data_loader + + +def get_eval_dataloader(model, tokenizer, eval_filename, record_schema, args): + """Get evaluation dataloader""" + + logger.info(f"Load data from {eval_filename} ...") + + schema = RecordSchema.read_from_file(record_schema) + + dataset = load_dataset( + read_func, + tokenizer=tokenizer, + data_file=eval_filename, + max_source_length=args.max_source_length, + is_train=False, + lazy=False, + ) + + batch_sampler = BatchSampler(dataset=dataset, batch_size=args.per_device_eval_batch_size, shuffle=False) + + label_pad_token_id = -100 if args.ignore_pad_token_for_loss else tokenizer.pad_token_id + + collate_fn = DataCollatorForSeq2Seq( + tokenizer, + model=model, + label_pad_token_id=label_pad_token_id, + max_source_length=args.max_source_length, + max_prefix_length=args.max_prefix_length, + max_target_length=args.max_target_length, + ssi_generator=DynamicSSIGenerator( + tokenizer=tokenizer, + schema=schema, + positive_rate=1, + negative=-1, + ordered_prompt=True, + ), + spot_asoc_nosier=None, + ) + + data_loader = DataLoader( + dataset=dataset, + batch_sampler=batch_sampler, + collate_fn=collate_fn, + num_workers=args.dataloader_num_workers, + return_list=True, + ) + + return data_loader + + +def load_eval_tasks(model, tokenizer, args): + """Load evaluation tasks + + Args: + model (PretrainedModel): Pretrain Model + tokenizer (PretrainedTokenizer): Tokenizer + args (Namespace): arguments for loading eval tasks + + Returns: + list(Task): list of evaluation tasks + """ + eval_tasks = dict() + task_configs = list(TaskConfig.load_list_from_yaml(args.multi_task_config)) + + for task_config in task_configs: + + val_filename = os.path.join(task_config.data_path, "val.json") + record_schema = os.path.join(task_config.data_path, "record.schema") + + task_dataloader = get_eval_dataloader( + model=model, tokenizer=tokenizer, eval_filename=val_filename, record_schema=record_schema, args=args + ) + + sel2record = SEL2Record( + schema_dict=SEL2Record.load_schema_dict(task_config.data_path), + map_config=MapConfig.load_by_name(task_config.sel2record), + tokenizer=tokenizer if isinstance(tokenizer, T5BertTokenizer) else None, + ) + + eval_tasks[task_config.dataset_name] = Task( + config=task_config, + dataloader=task_dataloader, + sel2record=sel2record, + val_instances=list(read_json_file(val_filename)), + metrics=task_config.metrics, + ) + + return eval_tasks + + +def write_prediction(eval_prediction, output_dir, prefix="eval"): + """Write prediction to output_dir + + Args: + eval_prediction (dict): + - `record` (list(dict)), each element is extraction reocrd + - `sel` (list(str)): each element is sel expression + - `metric` (dict) + output_dir (str): Output directory path + prefix (str, optional): prediction file prefix. Defaults to 'eval'. + + Write prediction to files: + - `preds_record.txt`, each line is extracted record + - `preds_seq2seq.txt`, each line is generated sel + - `results.txt`, detailed metrics of prediction + """ + output_filename = os.path.join(output_dir, f"{prefix}-preds_record.txt") + with open(output_filename, "w", encoding="utf8") as output: + for pred in eval_prediction.get("record", []): + output.write(json.dumps(pred, ensure_ascii=False) + "\n") + + output_filename = os.path.join(output_dir, f"{prefix}-preds_seq2seq.txt") + with open(output_filename, "w", encoding="utf8") as output: + for pred in eval_prediction.get("sel", []): + output.write(pred + "\n") + + output_filename = os.path.join(output_dir, f"{prefix}-results.txt") + with open(output_filename, "w", encoding="utf8") as output: + for key, value in eval_prediction.get("metric", {}).items(): + output.write(f"{prefix}-{key} = {value}\n") + + +class TaskConfig: + def __init__(self, task_dict) -> None: + self.dataset_name = task_dict.get("name", "") + self.task_name = task_dict.get("task", "") + self.data_path = task_dict.get("path", "") + self.sel2record = task_dict.get("sel2record", "") + self.metrics = task_dict.get("metrics", []) + self.eval_match_mode = task_dict.get("eval_match_mode", "normal") + self.schema = RecordSchema.read_from_file(f"{self.data_path}/record.schema") + + def __repr__(self) -> str: + task_config_list = [ + f"dataset: {self.dataset_name}", + f"task : {self.task_name}", + f"path : {self.data_path}", + f"schema : {self.schema}", + f"metrics: {self.metrics}", + f"eval_match_mode : {self.eval_match_mode}", + ] + return "\n".join(task_config_list) + + @staticmethod + def load_list_from_yaml(task_config): + import yaml + + configs = yaml.load(open(task_config, encoding="utf8"), Loader=yaml.FullLoader) + task_configs = filter(lambda x: x.startswith("T"), configs) + for task_config in task_configs: + yield TaskConfig(configs[task_config]) + + +@dataclass +class Task: + config: TaskConfig + dataloader: DataLoader + sel2record: SEL2Record + val_instances: List[dict] + metrics: List[str] diff --git a/examples/information_extraction/msra_ner/README.md b/examples/information_extraction/msra_ner/README.md new file mode 100644 index 0000000000000000000000000000000000000000..3d9e04bf6fe7dc1cdf8f6330e856e378c79243d5 --- /dev/null +++ b/examples/information_extraction/msra_ner/README.md @@ -0,0 +1,125 @@ +# 使用PaddleNLP完成中文命名实体识别 + +## 1. 简介 + +MSRA-NER 数据集由微软亚研院发布,其目标是识别文本中具有特定意义的实体,主要包括人名、地名、机构名等。示例如下: + +``` +不\002久\002前\002,\002中\002国\002共\002产\002党\002召\002开\002了\002举\002世\002瞩\002目\002的\002第\002十\002五\002次\002全\002国\002代\002表\002大\002会\002。 O\002O\002O\002O\002B-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002O\002O\002O\002O\002O\002O\002O\002O\002B-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002O +这\002次\002代\002表\002大\002会\002是\002在\002中\002国\002改\002革\002开\002放\002和\002社\002会\002主\002义\002现\002代\002化\002建\002设\002发\002展\002的\002关\002键\002时\002刻\002召\002开\002的\002历\002史\002性\002会\002议\002。 O\002O\002O\002O\002O\002O\002O\002O\002B-LOC\002I-LOC\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O +``` + +PaddleNLP集成的数据集MSRA-NER数据集对文件格式做了调整:每一行文本、标签以特殊字符"\t"进行分隔,每个字之间以特殊字符"\002"分隔。 + +## 快速开始 + +### 模型训练 + +#### 单卡训练 + +```shell +export CUDA_VISIBLE_DEVICES=0 + +python -u ./train.py \ + --model_type bert \ + --model_name_or_path bert-base-multilingual-uncased \ + --dataset msra_ner \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --logging_steps 1 \ + --save_steps 500 \ + --output_dir ./tmp/msra_ner/ \ + --device gpu +``` + +其中参数释义如下: +- `model_type`: 指定模型的类型,可选的有 bert、ernie、ernie-ctm。 +- `model_name_or_path`: 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer,支持[PaddleNLP Transformer API](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 中除ernie-gen以外的所有模型。若使用其他系列模型,需修改脚本导入相应的Task和Tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。 +- `dataset`: 目前支持 msra_ner 和 peoples_daily_ner 数据集。 +- `max_seq_length`: 表示最大句子长度,超过该长度将被截断。 +- `batch_size`: 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate`: 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `num_train_epochs`: 表示训练轮数。 +- `logging_steps`: 表示日志打印间隔。 +- `save_steps`: 表示模型保存及评估间隔。 +- `output_dir`: 表示模型保存路径。 +- `device`: 训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU, 'npu'表示使用华为昇腾卡。 + +#### 多卡训练 +```shell +python -m paddle.distributed.launch --gpus "0,1" ./train.py \ + --model_type bert \ + --model_name_or_path bert-base-multilingual-uncased \ + --dataset msra_ner \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --logging_steps 1 \ + --save_steps 500 \ + --output_dir ./tmp/msra_ner/ \ + --device gpu +``` + + +训练过程将按照 `logging_steps` 和 `save_steps` 的设置打印如下日志: + +``` +global step 3996, epoch: 2, batch: 1184, loss: 0.008593, speed: 4.15 step/s +global step 3997, epoch: 2, batch: 1185, loss: 0.008453, speed: 4.17 step/s +global step 3998, epoch: 2, batch: 1186, loss: 0.002294, speed: 4.19 step/s +global step 3999, epoch: 2, batch: 1187, loss: 0.005351, speed: 4.16 step/s +global step 4000, epoch: 2, batch: 1188, loss: 0.004734, speed: 4.18 step/s +eval loss: 0.006829, precision: 0.908957, recall: 0.926683, f1: 0.917734 +``` + +使用以上命令进行单卡 Fine-tuning ,在验证集上有如下结果: + Metric | Result | +------------------------------|-------------| +Precision | 0.908957 | +Recall | 0.926683 | +F1 | 0.917734 | + +### 模型评估 +目前支持bert类型模型,其他模型可修改为对应的Task和Tokenizer。支持msra_ner数据集。 +```shell +export CUDA_VISIBLE_DEVICES=0 + +python -u ./eval.py \ + --model_name_or_path bert-base-multilingual-uncased \ + --max_seq_length 128 \ + --batch_size 32 \ + --device gpu \ + --init_checkpoint_path tmp/msra_ner/model_500.pdparams +``` + +其中参数释义如下: +- `model_name_or_path`: 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。 +- `max_seq_length`: 表示最大句子长度,超过该长度将被截断。 +- `batch_size`: 表示每次迭代**每张卡**上的样本数目。 +- `use_gpu`: 是否使用GPU。 +- `init_checkpoint_path`: 模型加载路径。 + +### 模型预测 + +目前支持bert类型模型,其他模型可修改为对应的Task和Tokenizer。支持msra_ner数据集。 +```shell +export CUDA_VISIBLE_DEVICES=0 + +python -u ./predict.py \ + --model_name_or_path bert-base-multilingual-uncased \ + --max_seq_length 128 \ + --batch_size 32 \ + --device gpu \ + --init_checkpoint_path tmp/msra_ner/model_500.pdparams +``` + +### 使用其它预训练模型 + +请参考[Transformer API文档](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 了解更多PaddleNLP支持的预训练模型信息,并更换`--model_name_or_path`参数即可对比其他预训练模型的效果。 + +## Reference + +- [The third international Chinese language processing bakeoff: Word segmentation and named entity recognition](https://faculty.washington.edu/levow/papers/sighan06.pdf) diff --git a/examples/information_extraction/msra_ner/eval.py b/examples/information_extraction/msra_ner/eval.py new file mode 100644 index 0000000000000000000000000000000000000000..2e30532cd43697b7763a03a6938467d6487bc6aa --- /dev/null +++ b/examples/information_extraction/msra_ner/eval.py @@ -0,0 +1,126 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle +from datasets import load_dataset +from paddle.io import DataLoader + +from paddlenlp.data import Dict, Pad, Stack +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers import BertForTokenClassification, BertTokenizer + +parser = argparse.ArgumentParser() + +parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join(list(BertTokenizer.pretrained_init_configuration.keys())), +) +parser.add_argument("--init_checkpoint_path", default=None, type=str, required=True, help="The model checkpoint path.") +parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu", "xpu"], + help="The device to select to train the model, is must be cpu/gpu/xpu.", +) + + +def do_eval(args): + paddle.set_device(args.device) + + # Create dataset, tokenizer and dataloader. + train_ds, eval_ds = load_dataset("msra_ner", split=("train", "test")) + tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path) + + label_list = train_ds.features["ner_tags"].feature.names + label_num = len(label_list) + no_entity_id = 0 + + def tokenize_and_align_labels(examples): + tokenized_inputs = tokenizer( + examples["tokens"], + max_seq_len=args.max_seq_length, + # We use this argument because the texts in our dataset are lists of words (with a label for each word). + is_split_into_words="token", + return_length=True, + ) + labels = [] + + for i, label in enumerate(examples["ner_tags"]): + label_ids = label + if len(tokenized_inputs["input_ids"][i]) - 2 < len(label_ids): + label_ids = label_ids[: len(tokenized_inputs["input_ids"][i]) - 2] + label_ids = [no_entity_id] + label_ids + [no_entity_id] + label_ids += [no_entity_id] * (len(tokenized_inputs["input_ids"][i]) - len(label_ids)) + + labels.append(label_ids) + tokenized_inputs["labels"] = labels + return tokenized_inputs + + ignore_label = -100 + batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int32"), # input + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int32"), # segment + "seq_len": Stack(dtype="int64"), + "labels": Pad(axis=0, pad_val=ignore_label, dtype="int64"), # label + } + ): fn(samples) + + eval_ds = eval_ds.select(range(len(eval_ds) - 1)) + eval_ds = eval_ds.map(tokenize_and_align_labels, batched=True) + eval_data_loader = DataLoader( + dataset=eval_ds, collate_fn=batchify_fn, num_workers=0, batch_size=args.batch_size, return_list=True + ) + + # Define the model netword and its loss + model = BertForTokenClassification.from_pretrained(args.model_name_or_path, num_classes=label_num) + if args.init_checkpoint_path: + model_dict = paddle.load(args.init_checkpoint_path) + model.set_dict(model_dict) + loss_fct = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label) + + metric = ChunkEvaluator(label_list=label_list) + + model.eval() + metric.reset() + for step, batch in enumerate(eval_data_loader): + input_ids, token_type_ids, length, labels = batch + logits = model(input_ids, token_type_ids) + loss = loss_fct(logits, labels) + avg_loss = paddle.mean(loss) + preds = logits.argmax(axis=2) + num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(length, preds, labels) + metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + precision, recall, f1_score = metric.accumulate() + print("eval loss: %f, precision: %f, recall: %f, f1: %f" % (avg_loss, precision, recall, f1_score)) + + +if __name__ == "__main__": + args = parser.parse_args() + do_eval(args) diff --git a/examples/information_extraction/msra_ner/predict.py b/examples/information_extraction/msra_ner/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..5aa4de5197cfcb4d5264ea5c5a7d5ddff5ee0948 --- /dev/null +++ b/examples/information_extraction/msra_ner/predict.py @@ -0,0 +1,133 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle +from datasets import load_dataset +from paddle.io import DataLoader + +from paddlenlp.data import DataCollatorForTokenClassification +from paddlenlp.transformers import BertForTokenClassification, BertTokenizer + +parser = argparse.ArgumentParser() + +# yapf: disable +parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(list(BertTokenizer.pretrained_init_configuration.keys()))) +parser.add_argument("--init_checkpoint_path", default=None, type=str, required=True, help="The model checkpoint path.", ) +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer " "than this will be truncated, sequences shorter will be padded.", ) +parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.", ) +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu", "npu"] , help="The device to select to train the model, is must be cpu/gpu/xpu/npu.") +# yapf: enable + + +def parse_decodes(input_words, id2label, decodes, lens): + decodes = [x for batch in decodes for x in batch] + lens = [x for batch in lens for x in batch] + + outputs = [] + for idx, end in enumerate(lens): + sent = "".join(input_words[idx]["tokens"]) + tags = [id2label[x] for x in decodes[idx][1:end]] + sent_out = [] + tags_out = [] + words = "" + for s, t in zip(sent, tags): + if t.startswith("B-") or t == "O": + if len(words): + sent_out.append(words) + if t.startswith("B-"): + tags_out.append(t.split("-")[1]) + else: + tags_out.append(t) + words = s + else: + words += s + if len(sent_out) < len(tags_out): + sent_out.append(words) + outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)])) + return outputs + + +def do_predict(args): + paddle.set_device(args.device) + + # Create dataset, tokenizer and dataloader. + train_examples, predict_examples = load_dataset("msra_ner", split=("train", "test")) + column_names = train_examples.column_names + tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path) + + label_list = train_examples.features["ner_tags"].feature.names + label_num = len(label_list) + no_entity_id = 0 + + def tokenize_and_align_labels(examples): + tokenized_inputs = tokenizer( + examples["tokens"], + max_seq_len=args.max_seq_length, + # We use this argument because the texts in our dataset are lists of words (with a label for each word). + is_split_into_words="token", + return_length=True, + ) + labels = [] + + for i, label in enumerate(examples["ner_tags"]): + label_ids = label + if len(tokenized_inputs["input_ids"][i]) - 2 < len(label_ids): + label_ids = label_ids[: len(tokenized_inputs["input_ids"][i]) - 2] + label_ids = [no_entity_id] + label_ids + [no_entity_id] + label_ids += [no_entity_id] * (len(tokenized_inputs["input_ids"][i]) - len(label_ids)) + + labels.append(label_ids) + tokenized_inputs["labels"] = labels + return tokenized_inputs + + batchify_fn = DataCollatorForTokenClassification(tokenizer) + + id2label = dict(enumerate(label_list)) + + predict_examples = predict_examples.select(range(len(predict_examples) - 1)) + predict_ds = predict_examples.map(tokenize_and_align_labels, batched=True, remove_columns=column_names) + predict_data_loader = DataLoader( + dataset=predict_ds, collate_fn=batchify_fn, num_workers=0, batch_size=args.batch_size, return_list=True + ) + + # Define the model netword + model = BertForTokenClassification.from_pretrained(args.model_name_or_path, num_classes=label_num) + if args.init_checkpoint_path: + model_dict = paddle.load(args.init_checkpoint_path) + model.set_dict(model_dict) + + model.eval() + pred_list = [] + len_list = [] + for step, batch in enumerate(predict_data_loader): + logits = model(batch["input_ids"], batch["token_type_ids"]) + pred = paddle.argmax(logits, axis=-1) + pred_list.append(pred.numpy()) + len_list.append(batch["seq_len"].numpy()) + + preds = parse_decodes(predict_examples, id2label, pred_list, len_list) + + file_path = "results.txt" + with open(file_path, "w", encoding="utf8") as fout: + fout.write("\n".join(preds)) + # Print some examples + print("The results have been saved in the file: %s, some examples are shown below: " % file_path) + print("\n".join(preds[:10])) + + +if __name__ == "__main__": + args = parser.parse_args() + do_predict(args) diff --git a/examples/information_extraction/msra_ner/train.py b/examples/information_extraction/msra_ner/train.py new file mode 100644 index 0000000000000000000000000000000000000000..e87ba6ad94f4f65971fa275c809e50eb33ee7a8e --- /dev/null +++ b/examples/information_extraction/msra_ner/train.py @@ -0,0 +1,216 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time + +import paddle +from datasets import load_dataset +from paddle.io import DataLoader + +from paddlenlp.data import DataCollatorForTokenClassification +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers import ( + BertForTokenClassification, + BertTokenizer, + ErnieCtmForTokenClassification, + ErnieCtmTokenizer, + ErnieForTokenClassification, + ErnieTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "bert": (BertForTokenClassification, BertTokenizer), + "ernie": (ErnieForTokenClassification, ErnieTokenizer), + "ernie-ctm": (ErnieCtmForTokenClassification, ErnieCtmTokenizer), +} + +parser = argparse.ArgumentParser() + +# yapf: disable +parser.add_argument("--model_type", default="bert", type=str, help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), ) +parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])), ) +parser.add_argument("--dataset", default="msra_ner", type=str, choices=["msra_ner", "peoples_daily_ner"] , help="The named entity recognition datasets.") +parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the model predictions and checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") +parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") +parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.", ) +parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",) +parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") +parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.") +parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") +parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu", "npu"] , help="The device to select to train the model, is must be cpu/gpu/xpu/npu.") +# yapf: enable + + +@paddle.no_grad() +def evaluate(model, loss_fct, metric, data_loader, label_num, mode="valid"): + model.eval() + metric.reset() + avg_loss, precision, recall, f1_score = 0, 0, 0, 0 + for batch in data_loader: + logits = model(batch["input_ids"], batch["token_type_ids"]) + loss = loss_fct(logits, batch["labels"]) + avg_loss = paddle.mean(loss) + preds = logits.argmax(axis=2) + num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute( + batch["seq_len"], preds, batch["labels"] + ) + metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + precision, recall, f1_score = metric.accumulate() + print("%s: eval loss: %f, precision: %f, recall: %f, f1: %f" % (mode, avg_loss, precision, recall, f1_score)) + model.train() + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + # Create dataset, tokenizer and dataloader. + if args.dataset == "peoples_daily_ner": + raw_datasets = load_dataset(args.dataset) + else: + raw_datasets = load_dataset(args.dataset) + + AutoForTokenClassification, AutoTokenizer = MODEL_CLASSES[args.model_type] + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + train_ds = raw_datasets["train"] + + label_list = train_ds.features["ner_tags"].feature.names + label_num = len(label_list) + no_entity_id = 0 + + def tokenize_and_align_labels(examples): + tokenized_inputs = tokenizer( + examples["tokens"], + max_seq_len=args.max_seq_length, + # We use this argument because the texts in our dataset are lists of words (with a label for each word). + is_split_into_words="token", + return_length=True, + ) + labels = [] + + for i, label in enumerate(examples["ner_tags"]): + label_ids = label + if len(tokenized_inputs["input_ids"][i]) - 2 < len(label_ids): + label_ids = label_ids[: len(tokenized_inputs["input_ids"][i]) - 2] + label_ids = [no_entity_id] + label_ids + [no_entity_id] + label_ids += [no_entity_id] * (len(tokenized_inputs["input_ids"][i]) - len(label_ids)) + + labels.append(label_ids) + tokenized_inputs["labels"] = labels + return tokenized_inputs + + train_ds = train_ds.select(range(len(train_ds) - 1)) + column_names = train_ds.column_names + train_ds = train_ds.map(tokenize_and_align_labels, batched=True, remove_columns=column_names) + + ignore_label = -100 + + batchify_fn = DataCollatorForTokenClassification(tokenizer=tokenizer, label_pad_token_id=ignore_label) + + train_batch_sampler = paddle.io.DistributedBatchSampler( + train_ds, batch_size=args.batch_size, shuffle=True, drop_last=True + ) + + train_data_loader = DataLoader( + dataset=train_ds, collate_fn=batchify_fn, num_workers=0, batch_sampler=train_batch_sampler, return_list=True + ) + + test_ds = raw_datasets["test"] + test_ds = test_ds.select(range(len(test_ds) - 1)) + test_ds = test_ds.map(tokenize_and_align_labels, batched=True, remove_columns=column_names) + + test_data_loader = DataLoader( + dataset=test_ds, collate_fn=batchify_fn, num_workers=0, batch_size=args.batch_size, return_list=True + ) + + if args.dataset == "peoples_daily_ner": + dev_ds = raw_datasets["validation"] + dev_ds = dev_ds.select(range(len(dev_ds) - 1)) + dev_ds = dev_ds.map(tokenize_and_align_labels, batched=True, remove_columns=column_names) + + dev_data_loader = DataLoader( + dataset=dev_ds, collate_fn=batchify_fn, num_workers=0, batch_size=args.batch_size, return_list=True + ) + + # Define the model netword and its loss + model = AutoForTokenClassification.from_pretrained(args.model_name_or_path, num_classes=label_num) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + loss_fct = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label) + + metric = ChunkEvaluator(label_list=label_list) + + global_step = 0 + tic_train = time.time() + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + logits = model(batch["input_ids"], batch["token_type_ids"]) + loss = loss_fct(logits, batch["labels"]) + avg_loss = paddle.mean(loss) + if global_step % args.logging_steps == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_step, epoch, step, avg_loss, args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + avg_loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + if paddle.distributed.get_rank() == 0: + if args.dataset == "peoples_daily_ner": + evaluate(model, loss_fct, metric, dev_data_loader, label_num, "valid") + evaluate(model, loss_fct, metric, test_data_loader, label_num, "test") + + paddle.save(model.state_dict(), os.path.join(args.output_dir, "model_%d.pdparams" % global_step)) + if global_step >= num_training_steps: + return + + +if __name__ == "__main__": + args = parser.parse_args() + for arg in vars(args): + logger.info("{:20}:{}".format(arg, getattr(args, arg))) + + do_train(args) diff --git a/examples/information_extraction/waybill_ie/README.md b/examples/information_extraction/waybill_ie/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c842c91c7242aa8916665df0199f0a252f4cf009 --- /dev/null +++ b/examples/information_extraction/waybill_ie/README.md @@ -0,0 +1,102 @@ +# 快递单信息抽取 (Waybill Information Extraction) + +## 简介 + +本示例将通过BiGRU-CRF和ERNIE + FC两类模型,演示如何从用户提供的快递单中,抽取姓名、电话、省、市、区、详细地址等内容,形成结构化信息。辅助物流行业从业者进行有效信息的提取,从而降低客户填单的成本。 + +## 快速开始 + +### 数据准备 + +执行以下命令,下载并解压示例数据集: + +```bash +python download.py --data_dir ./waybill_ie +``` + +数据示例如下: + +``` +1^B6^B6^B2^B0^B2^B0^B0^B0^B7^B7^B宣^B荣^B嗣^B甘^B肃^B省^B白^B银^B市^B会^B宁^B县^B河^B畔^B镇^B十^B字^B街^B金^B海^B超^B市^B西^B行^B5^B0^B米 T-B^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BP-B^BP-I^BP-I^BA1-B^BA1-I^BA1-I^BA2-B^BA2-I^BA2-I^BA3-B^BA3-I^BA3-I^BA4-B^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I +1^B3^B5^B5^B2^B6^B6^B4^B3^B0^B7^B姜^B骏^B炜^B云^B南^B省^B德^B宏^B傣^B族^B景^B颇^B族^B自^B治^B州^B盈^B江^B县^B平^B原^B镇^B蜜^B回^B路^B下^B段 T-B^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BP-B^BP-I^BP-I^BA1-B^BA1-I^BA1-I^BA2-B^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA3-B^BA3-I^BA3-I^BA4-B^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I +``` +数据集中以特殊字符"\t"分隔文本、标签,以特殊字符"\002"(示例中显示为"^B")分隔每个字。标签的定义如下: + +| 标签 | 定义 | 标签 | 定义 | +| -------- | -------- |-------- | -------- | +| P-B | 姓名起始位置 | P-I | 姓名中间位置或结束位置 | +| T-B | 电话起始位置 | T-I | 电话中间位置或结束位置 | +| A1-B | 省份起始位置 | A1-I | 省份中间位置或结束位置 | +| A2-B | 城市起始位置 | A2-I | 城市中间位置或结束位置 | +| A3-B | 县区起始位置 | A3-I | 县区中间位置或结束位置 | +| A4-B | 详细地址起始位置 | A4-I | 详细地址中间位置或结束位置 | +| O | 无关字符 | | | + +数据标注采用**BIO模式**。其中 B(begin) 表示一个标签类别的开头,比如 P-B 指的是姓名的开头;相应的,I(inside) 表示一个标签的延续。O表示Outside无关字符。更多标注模式介绍请参考[Inside–outside–beginning (tagging)](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) + +### 启动训练 + +本项目提供了两种模型结构,一种是BiGRU+CRF结构,另一种是ERNIE+FC结构,前者显存占用小,推理速度快;后者能够在更快收敛并取得更高的精度,但推理速度较慢。 + +#### 启动BiGRU + CRF训练 + +```bash +export CUDA_VISIBLE_DEVICES=0 +python run_bigru_crf.py +``` + +#### 启动ERNIE + FC训练 + +```bash +export CUDA_VISIBLE_DEVICES=0 +python run_ernie.py +``` +##### 模型导出 +使用动态图训练结束之后,还可以将动态图参数导出成静态图参数,具体代码见export_model.py。静态图参数保存在output_path指定路径中。 运行方式: + +基于 `ERNIE` 的模型结构的导出方式 + +```bash +python export_ernie_model.py --params_path ernie_ckpt/model_80/model_state.pdparams --output_path=./output +``` + +基于 `ERNIE + CRF` 的模型结构的导出方式 + +```bash +python export_ernie_crf_model.py --params_path ernie_ckpt/model_80/model_state.pdparams --output_path=./output +``` + +基于 `BIGRU + CRF` 的模型结构的导出方式 + +```bash +python export_bigru_crf_model.py --params_path bigru_crf_ckpt/model_80/model_state.pdparams --output_path=./output +``` + +其中`params_path`是指动态图训练保存的参数路径,`output_path`是指静态图参数导出路径。 + +#### 模型部署 +导出模型之后,可以用于部署,deploy/python文件提供了python部署预测示例。运行方式: + +基于 `ERNIE` 的模型 + +```bash +python deploy/python/predict_ernie.py --model_dir ./output +``` + +基于 `ERNIE + CRF` 的模型 + +```bash +python deploy/python/predict_ernie_crf.py --model_dir ./output +``` + +基于 `BIGRU + CRF` 的模型 + +```bash +python deploy/python/predict_bigru_crf.py --model_dir ./output +``` + +## 更多详细教程请参考: + +[基于Bi-GRU+CRF的快递单信息抽取](https://aistudio.baidu.com/aistudio/projectdetail/1317771) + +[使用预训练模型ERNIE优化快递单信息抽取](https://aistudio.baidu.com/aistudio/projectdetail/1329361) diff --git a/examples/information_extraction/waybill_ie/data.py b/examples/information_extraction/waybill_ie/data.py new file mode 100644 index 0000000000000000000000000000000000000000..d276b45510749ffb56ba6636827dbc43b1d4c49c --- /dev/null +++ b/examples/information_extraction/waybill_ie/data.py @@ -0,0 +1,79 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp.datasets import MapDataset + + +def load_dict(dict_path): + vocab = {} + i = 0 + with open(dict_path, "r", encoding="utf-8") as fin: + for line in fin: + key = line.strip("\n") + vocab[key] = i + i += 1 + return vocab + + +def load_dataset(datafiles): + def read(data_path): + with open(data_path, "r", encoding="utf-8") as fp: + next(fp) # Skip header + for line in fp.readlines(): + words, labels = line.strip("\n").split("\t") + words = words.split("\002") + labels = labels.split("\002") + yield words, labels + + if isinstance(datafiles, str): + return MapDataset(list(read(datafiles))) + elif isinstance(datafiles, list) or isinstance(datafiles, tuple): + return [MapDataset(list(read(datafile))) for datafile in datafiles] + + +def parse_decodes(sentences, predictions, lengths, label_vocab): + """Parse the padding result + + Args: + sentences (list): the tagging sentences. + predictions (list): the prediction tags. + lengths (list): the valid length of each sentence. + label_vocab (dict): the label vocab. + + Returns: + outputs (list): the formatted output. + """ + predictions = [x for batch in predictions for x in batch] + lengths = [x for batch in lengths for x in batch] + id_label = dict(zip(label_vocab.values(), label_vocab.keys())) + + outputs = [] + for idx, end in enumerate(lengths): + sent = sentences[idx][:end] + tags = [id_label[x] for x in predictions[idx][:end]] + sent_out = [] + tags_out = [] + words = "" + for s, t in zip(sent, tags): + if t.endswith("-B") or t == "O": + if len(words): + sent_out.append(words) + tags_out.append(t.split("-")[0]) + words = s + else: + words += s + if len(sent_out) < len(tags_out): + sent_out.append(words) + outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)])) + return outputs diff --git a/examples/information_extraction/waybill_ie/deploy/python/predict_bigru_crf.py b/examples/information_extraction/waybill_ie/deploy/python/predict_bigru_crf.py new file mode 100644 index 0000000000000000000000000000000000000000..2578b69c4c6c09228e4970416e9cde9d89e15e80 --- /dev/null +++ b/examples/information_extraction/waybill_ie/deploy/python/predict_bigru_crf.py @@ -0,0 +1,290 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +from functools import partial + +import paddle +from paddle import inference + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.utils.log import logger + +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--model_dir", type=str, default="./output", help="The path to parameters in static graph.") +parser.add_argument( + "--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located." +) +parser.add_argument("--batch_size", type=int, default=200, help="The number of sequences contained in a mini-batch.") +parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu"], + help="The device to select to train the model, is must be cpu/gpu.", +) +parser.add_argument( + "--use_tensorrt", default=False, type=eval, choices=[True, False], help="Enable to use tensorrt to speed up." +) +parser.add_argument( + "--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help="The tensorrt precision." +) +parser.add_argument("--cpu_threads", default=10, type=int, help="Number of threads to predict when using cpu.") +parser.add_argument( + "--enable_mkldnn", + default=False, + type=eval, + choices=[True, False], + help="Enable to use mkldnn to speed up when using cpu.", +) +parser.add_argument( + "--benchmark", type=eval, default=False, help="To log some information about environment and running." +) +parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") +args = parser.parse_args() + + +def load_dict(dict_path): + vocab = {} + i = 0 + with open(dict_path, "r", encoding="utf-8") as fin: + for line in fin: + key = line.strip("\n") + vocab[key] = i + i += 1 + return vocab + + +def load_vocab(dict_path): + """Load vocab from file""" + vocab = {} + reverse = None + with open(dict_path, "r", encoding="utf8") as fin: + for i, line in enumerate(fin): + terms = line.strip("\n").split("\t") + if len(terms) == 2: + if reverse is None: + reverse = True if terms[0].isdigit() else False + if reverse: + value, key = terms + else: + key, value = terms + elif len(terms) == 1: + key, value = terms[0], i + else: + raise ValueError("Error line: %s in file: %s" % (line, dict_path)) + vocab[key] = value + return vocab + + +def parse_decodes(sentences, predictions, lengths, label_vocab): + """Parse the padding result + + Args: + sentences (list): the tagging sentences. + predictions (list): the prediction tags. + lengths (list): the valid length of each sentence. + label_vocab (dict): the label vocab. + + Returns: + outputs (list): the formatted output. + """ + predictions = [x for batch in predictions for x in batch] + lengths = [x for batch in lengths for x in batch] + id_label = dict(zip(label_vocab.values(), label_vocab.keys())) + + outputs = [] + for idx, end in enumerate(lengths): + sent = sentences[idx][:end] + print(predictions[idx][:end]) + tags = [id_label[x] for x in predictions[idx][:end]] + sent_out = [] + tags_out = [] + words = "" + for s, t in zip(sent, tags): + if t.endswith("-B") or t == "O": + if len(words): + sent_out.append(words) + tags_out.append(t.split("-")[0]) + words = s + else: + words += s + if len(sent_out) < len(tags_out): + sent_out.append(words) + outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)])) + return outputs + + +def convert_tokens_to_ids(tokens, vocab, oov_token=None): + token_ids = [] + oov_id = vocab.get(oov_token) if oov_token else None + for token in tokens: + token_id = vocab.get(token, oov_id) + token_ids.append(token_id) + return token_ids + + +def convert_to_features(example, word_vocab): + tokens = example[0] + token_ids = convert_tokens_to_ids(tokens, word_vocab, "OOV") + return token_ids, len(token_ids) + + +def read(data_path): + with open(data_path, "r", encoding="utf-8") as fp: + next(fp) # Skip header + for line in fp.readlines(): + words, labels = line.strip("\n").split("\t") + words = words.split("\002") + labels = labels.split("\002") + yield words, labels + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + batch_size=200, + use_tensorrt=False, + precision="fp32", + enable_mkldnn=False, + benchmark=False, + save_log_path="", + ): + self.batch_size = batch_size + model_file = os.path.join(model_dir, "inference.pdmodel") + param_file = os.path.join(model_dir, "inference.pdiparams") + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(param_file): + raise ValueError("not find params file path {}".format(param_file)) + config = paddle.inference.Config(model_file, param_file) + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + if args.benchmark: + import auto_log + + pid = os.getpid() + self.autolog = auto_log.AutoLogger( + model_name="ernie-3.0-medium-zh", + model_precision=precision, + batch_size=self.batch_size, + data_shape="dynamic", + save_path=save_log_path, + inference_config=config, + pids=pid, + process_name=None, + gpu_ids=0, + time_keys=["preprocess_time", "inference_time", "postprocess_time"], + warmup=0, + logger=logger, + ) + + def predict(self, dataset, batchify_fn, word_vocab, label_vocab): + if args.benchmark: + self.autolog.times.start() + all_preds = [] + all_lens = [] + num_of_examples = len(dataset) + trans_func = partial(convert_to_features, word_vocab=word_vocab) + start_idx = 0 + while start_idx < num_of_examples: + end_idx = start_idx + self.batch_size + end_idx = end_idx if end_idx < num_of_examples else num_of_examples + batch_data = [trans_func(example) for example in dataset[start_idx:end_idx]] + + if args.benchmark: + self.autolog.times.stamp() + input_ids, lens = batchify_fn(batch_data) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(lens) + self.predictor.run() + preds = self.output_handle.copy_to_cpu() + + if args.benchmark: + self.autolog.times.stamp() + # Drop CLS prediction + all_preds.append(preds) + print(preds.shape) + all_lens.append(lens) + + start_idx += self.batch_size + + if args.benchmark: + self.autolog.times.end(stamp=True) + sentences = [example[0] for example in dataset.data] + results = parse_decodes(sentences, all_preds, all_lens, label_vocab) + return results + + +if __name__ == "__main__": + test_ds = load_dataset(read, data_path=os.path.join(args.data_dir, "test.txt"), lazy=False) + label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) + word_vocab = load_dict(os.path.join(args.data_dir, "word.dic")) + + trans_func = partial(convert_to_features, word_vocab=word_vocab) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=word_vocab.get("OOV", 0), dtype="int64"), # token_ids + Stack(dtype="int64"), # seq_len + ): fn(samples) + + predictor = Predictor( + args.model_dir, + args.device, + args.batch_size, + args.use_tensorrt, + args.precision, + args.enable_mkldnn, + args.benchmark, + args.save_log_path, + ) + + results = predictor.predict(test_ds, batchify_fn, word_vocab, label_vocab) + print("\n".join(results)) + if args.benchmark: + predictor.autolog.report() diff --git a/examples/information_extraction/waybill_ie/deploy/python/predict_ernie.py b/examples/information_extraction/waybill_ie/deploy/python/predict_ernie.py new file mode 100644 index 0000000000000000000000000000000000000000..bdd3ccfeba9b0808c0f5434915bbb0bc5f7e7eb5 --- /dev/null +++ b/examples/information_extraction/waybill_ie/deploy/python/predict_ernie.py @@ -0,0 +1,283 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +from functools import partial + +import numpy as np +import paddle +from paddle import inference + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoTokenizer +from paddlenlp.utils.log import logger + +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--model_dir", type=str, default="./output", help="The path to parameters in static graph.") +parser.add_argument( + "--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located." +) +parser.add_argument("--batch_size", type=int, default=200, help="The number of sequences contained in a mini-batch.") +parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu"], + help="The device to select to train the model, is must be cpu/gpu.", +) +parser.add_argument( + "--use_tensorrt", default=False, type=eval, choices=[True, False], help="Enable to use tensorrt to speed up." +) +parser.add_argument( + "--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help="The tensorrt precision." +) +parser.add_argument("--cpu_threads", default=10, type=int, help="Number of threads to predict when using cpu.") +parser.add_argument( + "--enable_mkldnn", + default=False, + type=eval, + choices=[True, False], + help="Enable to use mkldnn to speed up when using cpu.", +) +parser.add_argument( + "--benchmark", type=eval, default=False, help="To log some information about environment and running." +) +parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") +args = parser.parse_args() + + +def load_dict(dict_path): + vocab = {} + i = 0 + with open(dict_path, "r", encoding="utf-8") as fin: + for line in fin: + key = line.strip("\n") + vocab[key] = i + i += 1 + return vocab + + +def load_vocab(dict_path): + """Load vocab from file""" + vocab = {} + reverse = None + with open(dict_path, "r", encoding="utf8") as fin: + for i, line in enumerate(fin): + terms = line.strip("\n").split("\t") + if len(terms) == 2: + if reverse is None: + reverse = True if terms[0].isdigit() else False + if reverse: + value, key = terms + else: + key, value = terms + elif len(terms) == 1: + key, value = terms[0], i + else: + raise ValueError("Error line: %s in file: %s" % (line, dict_path)) + vocab[key] = value + return vocab + + +def parse_decodes(sentences, predictions, lengths, label_vocab): + """Parse the padding result + + Args: + sentences (list): the tagging sentences. + predictions (list): the prediction tags. + lengths (list): the valid length of each sentence. + label_vocab (dict): the label vocab. + + Returns: + outputs (list): the formatted output. + """ + predictions = [x for batch in predictions for x in batch] + lengths = [x for batch in lengths for x in batch] + id_label = dict(zip(label_vocab.values(), label_vocab.keys())) + + outputs = [] + for idx, end in enumerate(lengths): + sent = sentences[idx][:end] + tags = [id_label[x] for x in predictions[idx][:end]] + sent_out = [] + tags_out = [] + words = "" + for s, t in zip(sent, tags): + if t.endswith("-B") or t == "O": + if len(words): + sent_out.append(words) + tags_out.append(t.split("-")[0]) + words = s + else: + words += s + if len(sent_out) < len(tags_out): + sent_out.append(words) + outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)])) + return outputs + + +def convert_to_features(example, tokenizer): + tokens = example[0] + tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token") + # Token '[CLS]' and '[SEP]' will get label 'O' + return tokenized_input["input_ids"], tokenized_input["token_type_ids"], tokenized_input["seq_len"] + + +def read(data_path): + with open(data_path, "r", encoding="utf-8") as fp: + next(fp) # Skip header + for line in fp.readlines(): + words, labels = line.strip("\n").split("\t") + words = words.split("\002") + labels = labels.split("\002") + yield words, labels + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + batch_size=200, + use_tensorrt=False, + precision="fp32", + enable_mkldnn=False, + benchmark=False, + save_log_path="", + ): + self.batch_size = batch_size + model_file = os.path.join(model_dir, "inference.pdmodel") + param_file = os.path.join(model_dir, "inference.pdiparams") + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(param_file): + raise ValueError("not find params file path {}".format(param_file)) + config = paddle.inference.Config(model_file, param_file) + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + if args.benchmark: + import auto_log + + pid = os.getpid() + self.autolog = auto_log.AutoLogger( + model_name="ernie-3.0-medium-zh", + model_precision=precision, + batch_size=self.batch_size, + data_shape="dynamic", + save_path=save_log_path, + inference_config=config, + pids=pid, + process_name=None, + gpu_ids=0, + time_keys=["preprocess_time", "inference_time", "postprocess_time"], + warmup=0, + logger=logger, + ) + + def predict(self, dataset, batchify_fn, tokenizer, label_vocab): + if args.benchmark: + self.autolog.times.start() + all_preds = [] + all_lens = [] + num_of_examples = len(dataset) + trans_func = partial(convert_to_features, tokenizer=tokenizer) + start_idx = 0 + while start_idx < num_of_examples: + end_idx = start_idx + self.batch_size + end_idx = end_idx if end_idx < num_of_examples else num_of_examples + batch_data = [trans_func(example) for example in dataset[start_idx:end_idx]] + + if args.benchmark: + self.autolog.times.stamp() + input_ids, segment_ids, lens = batchify_fn(batch_data) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + + if args.benchmark: + self.autolog.times.stamp() + preds = np.argmax(logits, axis=-1) + # Drop CLS prediction + preds = preds[:, 1:] + all_preds.append(preds) + all_lens.append(lens) + + start_idx += self.batch_size + + if args.benchmark: + self.autolog.times.end(stamp=True) + sentences = [example[0] for example in dataset.data] + results = parse_decodes(sentences, all_preds, all_lens, label_vocab) + return results + + +if __name__ == "__main__": + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + test_ds = load_dataset(read, data_path=os.path.join(args.data_dir, "test.txt"), lazy=False) + label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids + Stack(dtype="int64"), # seq_len + ): fn(samples) + + predictor = Predictor( + args.model_dir, + args.device, + args.batch_size, + args.use_tensorrt, + args.precision, + args.enable_mkldnn, + args.benchmark, + args.save_log_path, + ) + + results = predictor.predict(test_ds, batchify_fn, tokenizer, label_vocab) + print("\n".join(results)) + if args.benchmark: + predictor.autolog.report() diff --git a/examples/information_extraction/waybill_ie/deploy/python/predict_ernie_crf.py b/examples/information_extraction/waybill_ie/deploy/python/predict_ernie_crf.py new file mode 100644 index 0000000000000000000000000000000000000000..1158a49aafe25b4b3cc7ec6677a3f52978a927dc --- /dev/null +++ b/examples/information_extraction/waybill_ie/deploy/python/predict_ernie_crf.py @@ -0,0 +1,263 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +from functools import partial + +import paddle +from paddle import inference + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--model_dir", type=str, default='./output', help="The path to parameters in static graph.") +parser.add_argument("--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located.") +parser.add_argument("--batch_size", type=int, default=200, help="The number of sequences contained in a mini-batch.") +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') +parser.add_argument("--benchmark", type=eval, default=False, help="To log some information about environment and running.") +parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") +args = parser.parse_args() +# yapf: enable + + +def load_dict(dict_path): + vocab = {} + i = 0 + with open(dict_path, "r", encoding="utf-8") as fin: + for line in fin: + key = line.strip("\n") + vocab[key] = i + i += 1 + return vocab + + +def load_vocab(dict_path): + """Load vocab from file""" + vocab = {} + reverse = None + with open(dict_path, "r", encoding="utf8") as fin: + for i, line in enumerate(fin): + terms = line.strip("\n").split("\t") + if len(terms) == 2: + if reverse is None: + reverse = True if terms[0].isdigit() else False + if reverse: + value, key = terms + else: + key, value = terms + elif len(terms) == 1: + key, value = terms[0], i + else: + raise ValueError("Error line: %s in file: %s" % (line, dict_path)) + vocab[key] = value + return vocab + + +def parse_decodes(sentences, predictions, lengths, label_vocab): + """Parse the padding result + + Args: + sentences (list): the tagging sentences. + predictions (list): the prediction tags. + lengths (list): the valid length of each sentence. + label_vocab (dict): the label vocab. + + Returns: + outputs (list): the formatted output. + """ + predictions = [x for batch in predictions for x in batch] + lengths = [x for batch in lengths for x in batch] + id_label = dict(zip(label_vocab.values(), label_vocab.keys())) + + outputs = [] + for idx, end in enumerate(lengths): + sent = sentences[idx][:end] + tags = [id_label[x] for x in predictions[idx][:end]] + sent_out = [] + tags_out = [] + words = "" + for s, t in zip(sent, tags): + if t.endswith("-B") or t == "O": + if len(words): + sent_out.append(words) + tags_out.append(t.split("-")[0]) + words = s + else: + words += s + if len(sent_out) < len(tags_out): + sent_out.append(words) + outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)])) + return outputs + + +def convert_to_features(example, tokenizer): + tokens = example[0] + tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token") + # Token '[CLS]' and '[SEP]' will get label 'O' + return tokenized_input["input_ids"], tokenized_input["token_type_ids"], tokenized_input["seq_len"] + + +def read(data_path): + with open(data_path, "r", encoding="utf-8") as fp: + next(fp) # Skip header + for line in fp.readlines(): + words, labels = line.strip("\n").split("\t") + words = words.split("\002") + labels = labels.split("\002") + yield words, labels + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + batch_size=200, + use_tensorrt=False, + precision="fp32", + enable_mkldnn=False, + benchmark=False, + save_log_path="", + ): + self.batch_size = batch_size + model_file = os.path.join(model_dir, "inference.pdmodel") + param_file = os.path.join(model_dir, "inference.pdiparams") + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(param_file): + raise ValueError("not find params file path {}".format(param_file)) + config = paddle.inference.Config(model_file, param_file) + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + if args.benchmark: + import auto_log + + pid = os.getpid() + self.autolog = auto_log.AutoLogger( + model_name="ernie-3.0-medium-zh", + model_precision=precision, + batch_size=self.batch_size, + data_shape="dynamic", + save_path=save_log_path, + inference_config=config, + pids=pid, + process_name=None, + gpu_ids=0, + time_keys=["preprocess_time", "inference_time", "postprocess_time"], + warmup=0, + logger=logger, + ) + + def predict(self, dataset, batchify_fn, tokenizer, label_vocab): + if args.benchmark: + self.autolog.times.start() + all_preds = [] + all_lens = [] + num_of_examples = len(dataset) + trans_func = partial(convert_to_features, tokenizer=tokenizer) + start_idx = 0 + while start_idx < num_of_examples: + end_idx = start_idx + self.batch_size + end_idx = end_idx if end_idx < num_of_examples else num_of_examples + batch_data = [trans_func(example) for example in dataset[start_idx:end_idx]] + + if args.benchmark: + self.autolog.times.stamp() + input_ids, segment_ids, lens = batchify_fn(batch_data) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.input_handles[2].copy_from_cpu(lens) + self.predictor.run() + preds = self.output_handle.copy_to_cpu() + + if args.benchmark: + self.autolog.times.stamp() + preds = [pred[1:] for pred in preds] + all_preds.append(preds) + all_lens.append(lens) + + start_idx += self.batch_size + + if args.benchmark: + self.autolog.times.end(stamp=True) + sentences = [example[0] for example in dataset.data] + results = parse_decodes(sentences, all_preds, all_lens, label_vocab) + return results + + +if __name__ == "__main__": + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + test_ds = load_dataset(read, data_path=os.path.join(args.data_dir, "test.txt"), lazy=False) + label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids + Stack(dtype="int64"), # seq_len + ): fn(samples) + + predictor = Predictor( + args.model_dir, + args.device, + args.batch_size, + args.use_tensorrt, + args.precision, + args.enable_mkldnn, + args.benchmark, + args.save_log_path, + ) + + results = predictor.predict(test_ds, batchify_fn, tokenizer, label_vocab) + print("\n".join(results)) + if args.benchmark: + predictor.autolog.report() diff --git a/examples/information_extraction/waybill_ie/download.py b/examples/information_extraction/waybill_ie/download.py new file mode 100644 index 0000000000000000000000000000000000000000..a76b56b99aeed624ec9c71b8b8b512ea7889ad35 --- /dev/null +++ b/examples/information_extraction/waybill_ie/download.py @@ -0,0 +1,32 @@ +# -*- coding: utf-8 -*- +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the 'License'); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an 'AS IS' BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import sys + +from paddle.utils.download import get_path_from_url + +URL = "https://bj.bcebos.com/paddlenlp/paddlenlp/datasets/waybill.tar.gz" + + +def main(arguments): + parser = argparse.ArgumentParser() + parser.add_argument("-d", "--data_dir", help="directory to save data to", type=str, default="./") + args = parser.parse_args(arguments) + get_path_from_url(URL, args.data_dir) + + +if __name__ == "__main__": + sys.exit(main(sys.argv[1:])) diff --git a/examples/information_extraction/waybill_ie/export_bigru_crf_model.py b/examples/information_extraction/waybill_ie/export_bigru_crf_model.py new file mode 100644 index 0000000000000000000000000000000000000000..b439dc30836afd303e81344b39958021d58d8c3d --- /dev/null +++ b/examples/information_extraction/waybill_ie/export_bigru_crf_model.py @@ -0,0 +1,60 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from data import load_dict +from model import BiGRUWithCRF + +parser = argparse.ArgumentParser() +parser.add_argument( + "--params_path", + type=str, + required=True, + default="./checkpoint/model_900/model_state.pdparams", + help="The path to model parameters to be loaded.", +) +parser.add_argument( + "--output_path", type=str, default="./output", help="The path of model parameter in static graph to be saved." +) +parser.add_argument( + "--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located." +) +args = parser.parse_args() + +if __name__ == "__main__": + # The number of labels should be in accordance with the training dataset. + label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) + word_vocab = load_dict(os.path.join(args.data_dir, "word.dic")) + + # Define the model netword and its loss + model = BiGRUWithCRF(300, 256, len(word_vocab), len(label_vocab)) + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + model.eval() + + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None], dtype="int64"), # lengths + ], + ) + + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/examples/information_extraction/waybill_ie/export_ernie_crf_model.py b/examples/information_extraction/waybill_ie/export_ernie_crf_model.py new file mode 100644 index 0000000000000000000000000000000000000000..9f1d6839e54e52047f5a96b11ee454c67ad9aac5 --- /dev/null +++ b/examples/information_extraction/waybill_ie/export_ernie_crf_model.py @@ -0,0 +1,55 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from data import load_dict +from model import ErnieCrfForTokenClassification + +from paddlenlp.transformers import AutoModelForTokenClassification + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default="./checkpoint/model_900/model_state.pdparams", help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default="./output", help="The path of model parameter in static graph to be saved.") +parser.add_argument("--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located.") +args = parser.parse_args() +# fmt: on + +if __name__ == "__main__": + # The number of labels should be in accordance with the training dataset. + label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) + + # Define the model netword and its loss + ernie = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_labels=len(label_vocab)) + model = ErnieCrfForTokenClassification(ernie) + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + model.eval() + + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + paddle.static.InputSpec(shape=[None], dtype="int64"), # lengths + ], + ) + + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/examples/information_extraction/waybill_ie/export_ernie_model.py b/examples/information_extraction/waybill_ie/export_ernie_model.py new file mode 100644 index 0000000000000000000000000000000000000000..2436a98b4af8878c7ce59fb29efbd5132f85efcc --- /dev/null +++ b/examples/information_extraction/waybill_ie/export_ernie_model.py @@ -0,0 +1,52 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from data import load_dict + +from paddlenlp.transformers import AutoModelForTokenClassification + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default="./checkpoint/model_900/model_state.pdparams", help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default="./output", help="The path of model parameter in static graph to be saved.") +parser.add_argument("--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located.") +args = parser.parse_args() +# fmt: on + +if __name__ == "__main__": + # The number of labels should be in accordance with the training dataset. + label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) + + model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_labels=len(label_vocab)) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + model.eval() + + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/examples/information_extraction/waybill_ie/model.py b/examples/information_extraction/waybill_ie/model.py new file mode 100644 index 0000000000000000000000000000000000000000..d6d1e8dfb36fa194dad4b7030c4073ec9151ca6b --- /dev/null +++ b/examples/information_extraction/waybill_ie/model.py @@ -0,0 +1,76 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn + +from paddlenlp.embeddings import TokenEmbedding +from paddlenlp.layers.crf import LinearChainCrf, LinearChainCrfLoss +from paddlenlp.utils.tools import compare_version + +if compare_version(paddle.version.full_version, "2.2.0") >= 0: + # paddle.text.ViterbiDecoder is supported by paddle after version 2.2.0 + from paddle.text import ViterbiDecoder +else: + from paddlenlp.layers.crf import ViterbiDecoder + + +class BiGRUWithCRF(nn.Layer): + def __init__(self, emb_size, hidden_size, word_num, label_num, use_w2v_emb=False): + super(BiGRUWithCRF, self).__init__() + if use_w2v_emb: + self.word_emb = TokenEmbedding(extended_vocab_path="./data/word.dic", unknown_token="OOV") + else: + self.word_emb = nn.Embedding(word_num, emb_size) + self.gru = nn.GRU(emb_size, hidden_size, num_layers=2, direction="bidirect") + # We need `label_num + 2` for appending BOS and EOS tag + self.fc = nn.Linear(hidden_size * 2, label_num + 2) + self.crf = LinearChainCrf(label_num) + self.crf_loss = LinearChainCrfLoss(self.crf) + self.viterbi_decoder = ViterbiDecoder(self.crf.transitions) + + def forward(self, inputs, lengths, labels=None): + embs = self.word_emb(inputs) + output, _ = self.gru(embs) + emission = self.fc(output) + if labels is not None: + loss = self.crf_loss(emission, lengths, labels) + return loss + else: + _, prediction = self.viterbi_decoder(emission, lengths) + return prediction + + +class ErnieCrfForTokenClassification(nn.Layer): + def __init__(self, ernie, crf_lr=100): + super().__init__() + self.num_labels = ernie.num_labels + self.ernie = ernie # allow ernie to be config + self.crf = LinearChainCrf(self.num_labels, crf_lr=crf_lr, with_start_stop_tag=False) + self.crf_loss = LinearChainCrfLoss(self.crf) + self.viterbi_decoder = ViterbiDecoder(self.crf.transitions, False) + + def forward( + self, input_ids, token_type_ids=None, lengths=None, position_ids=None, attention_mask=None, labels=None + ): + logits = self.ernie( + input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, position_ids=position_ids + ) + + if labels is not None: + loss = self.crf_loss(logits, lengths, labels) + return loss + else: + _, prediction = self.viterbi_decoder(logits, lengths) + return prediction diff --git a/examples/information_extraction/waybill_ie/run_bigru_crf.py b/examples/information_extraction/waybill_ie/run_bigru_crf.py new file mode 100644 index 0000000000000000000000000000000000000000..f458d36de5b39123cf387f1dc3e55c749af2eba8 --- /dev/null +++ b/examples/information_extraction/waybill_ie/run_bigru_crf.py @@ -0,0 +1,149 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import paddle +from data import load_dataset, load_dict, parse_decodes +from model import BiGRUWithCRF + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.metrics import ChunkEvaluator + +parser = argparse.ArgumentParser() + +# yapf: disable +parser.add_argument("--save_dir", default='./bigru_crf_ckpt', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.") +parser.add_argument("--data_dir", default='./waybill_ie/data', type=str, help="The folder where the dataset is located.") + +args = parser.parse_args() +# yapf: enable + + +def convert_tokens_to_ids(tokens, vocab, oov_token=None): + token_ids = [] + oov_id = vocab.get(oov_token) if oov_token else None + for token in tokens: + token_id = vocab.get(token, oov_id) + token_ids.append(token_id) + return token_ids + + +def convert_to_features(example, word_vocab, label_vocab): + tokens, labels = example + token_ids = convert_tokens_to_ids(tokens, word_vocab, "OOV") + label_ids = convert_tokens_to_ids(labels, label_vocab, "O") + return token_ids, len(token_ids), label_ids + + +@paddle.no_grad() +def evaluate(model, metric, data_loader): + model.eval() + metric.reset() + for token_ids, lengths, label_ids in data_loader: + preds = model(token_ids, lengths) + n_infer, n_label, n_correct = metric.compute(lengths, preds, label_ids) + metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy()) + precision, recall, f1_score = metric.accumulate() + print("[EVAL] Precision: %f - Recall: %f - F1: %f" % (precision, recall, f1_score)) + model.train() + + +@paddle.no_grad() +def predict(model, data_loader, ds, label_vocab): + all_preds = [] + all_lens = [] + for token_ids, lengths, label_ids in data_loader: + preds = model(token_ids, lengths) + all_preds.append(preds.numpy()) + all_lens.append(lengths) + sentences = [example[0] for example in ds.data] + results = parse_decodes(sentences, all_preds, all_lens, label_vocab) + return results + + +if __name__ == "__main__": + paddle.set_device(args.device) + + # Create dataset, tokenizer and dataloader. + train_ds, dev_ds, test_ds = load_dataset( + datafiles=( + os.path.join(args.data_dir, "train.txt"), + os.path.join(args.data_dir, "dev.txt"), + os.path.join(args.data_dir, "test.txt"), + ) + ) + + label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) + word_vocab = load_dict(os.path.join(args.data_dir, "word.dic")) + + trans_func = partial(convert_to_features, word_vocab=word_vocab, label_vocab=label_vocab) + train_ds.map(trans_func) + dev_ds.map(trans_func) + test_ds.map(trans_func) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=word_vocab.get("OOV", 0), dtype="int32"), # token_ids + Stack(dtype="int64"), # seq_len + Pad(axis=0, pad_val=label_vocab.get("O", 0), dtype="int64"), # label_ids + ): fn(samples) + + train_loader = paddle.io.DataLoader( + dataset=train_ds, + batch_size=args.batch_size, + shuffle=True, + drop_last=True, + return_list=True, + collate_fn=batchify_fn, + ) + + dev_loader = paddle.io.DataLoader( + dataset=dev_ds, batch_size=args.batch_size, drop_last=True, return_list=True, collate_fn=batchify_fn + ) + + test_loader = paddle.io.DataLoader( + dataset=test_ds, batch_size=args.batch_size, drop_last=True, return_list=True, collate_fn=batchify_fn + ) + + # Define the model netword and its loss + model = BiGRUWithCRF(300, 256, len(word_vocab), len(label_vocab)) + + optimizer = paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()) + metric = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True) + + step = 0 + for epoch in range(args.epochs): + for token_ids, lengths, label_ids in train_loader: + loss = model(token_ids, lengths, label_ids) + loss = loss.mean() + loss.backward() + optimizer.step() + optimizer.clear_grad() + step += 1 + print("[TRAIN] Epoch:%d - Step:%d - Loss: %f" % (epoch, step, loss)) + evaluate(model, metric, dev_loader) + paddle.save(model.state_dict(), os.path.join(args.save_dir, "model_%d" % step, "model_state.pdparams")) + + preds = predict(model, test_loader, test_ds, label_vocab) + file_path = "bigru_crf_results.txt" + with open(file_path, "w", encoding="utf8") as fout: + fout.write("\n".join(preds)) + # Print some examples + print("The results have been saved into: %s, some examples are shown below: " % file_path) + print("\n".join(preds[:10])) diff --git a/examples/information_extraction/waybill_ie/run_ernie.py b/examples/information_extraction/waybill_ie/run_ernie.py new file mode 100644 index 0000000000000000000000000000000000000000..d21baad79a7772be433eb8c11d363787bd800953 --- /dev/null +++ b/examples/information_extraction/waybill_ie/run_ernie.py @@ -0,0 +1,166 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import paddle +from data import load_dataset, load_dict, parse_decodes + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default="./ernie_ckpt", type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.") +parser.add_argument("--data_dir", default="./waybill_ie/data", type=str, help="The folder where the dataset is located.") +args = parser.parse_args() +# fmt: on + + +def convert_to_features(example, tokenizer, label_vocab): + tokens, labels = example + tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token") + # Token '[CLS]' and '[SEP]' will get label 'O' + labels = ["O"] + labels + ["O"] + tokenized_input["labels"] = [label_vocab[x] for x in labels] + return ( + tokenized_input["input_ids"], + tokenized_input["token_type_ids"], + tokenized_input["seq_len"], + tokenized_input["labels"], + ) + + +@paddle.no_grad() +def evaluate(model, metric, data_loader): + model.eval() + metric.reset() + for input_ids, seg_ids, lens, labels in data_loader: + logits = model(input_ids, seg_ids) + preds = paddle.argmax(logits, axis=-1) + n_infer, n_label, n_correct = metric.compute(lens, preds, labels) + metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy()) + precision, recall, f1_score = metric.accumulate() + print("[EVAL] Precision: %f - Recall: %f - F1: %f" % (precision, recall, f1_score)) + model.train() + + +@paddle.no_grad() +def predict(model, data_loader, ds, label_vocab): + all_preds = [] + all_lens = [] + for input_ids, seg_ids, lens, labels in data_loader: + logits = model(input_ids, seg_ids) + preds = paddle.argmax(logits, axis=-1) + # Drop CLS prediction + preds = [pred[1:] for pred in preds.numpy()] + all_preds.append(preds) + all_lens.append(lens) + sentences = [example[0] for example in ds.data] + results = parse_decodes(sentences, all_preds, all_lens, label_vocab) + return results + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +if __name__ == "__main__": + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + trainer_num = paddle.distributed.get_world_size() + if trainer_num > 1: + paddle.distributed.init_parallel_env() + # Create dataset, tokenizer and dataloader. + train_ds, dev_ds, test_ds = load_dataset( + datafiles=( + os.path.join(args.data_dir, "train.txt"), + os.path.join(args.data_dir, "dev.txt"), + os.path.join(args.data_dir, "test.txt"), + ) + ) + + label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial(convert_to_features, tokenizer=tokenizer, label_vocab=label_vocab) + + train_ds.map(trans_func) + dev_ds.map(trans_func) + test_ds.map(trans_func) + + ignore_label = -1 + + def batchify_fn(samples): + fn = Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int32"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int32"), # token_type_ids + Stack(dtype="int64"), # seq_len + Pad(axis=0, pad_val=label_vocab.get("O", 0), dtype="int64"), # labels + ) + return fn(samples) + + train_loader = create_dataloader( + dataset=train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn + ) + + dev_loader = create_dataloader(dataset=dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn) + + test_loader = create_dataloader(dataset=test_ds, mode="test", batch_size=args.batch_size, batchify_fn=batchify_fn) + + # Define the model netword and its loss + model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_labels=len(label_vocab)) + if trainer_num > 1: + model = paddle.DataParallel(model) + metric = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True) + loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label) + optimizer = paddle.optimizer.AdamW(learning_rate=2e-5, parameters=model.parameters()) + + step = 0 + for epoch in range(args.epochs): + for input_ids, token_type_ids, length, labels in train_loader: + logits = model(input_ids, token_type_ids) + loss = paddle.mean(loss_fn(logits, labels)) + loss.backward() + optimizer.step() + optimizer.clear_grad() + step += 1 + print("[TRAIN] Epoch:%d - Step:%d - Loss: %f" % (epoch, step, loss)) + evaluate(model, metric, dev_loader) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(os.path.join(args.save_dir, "model_%d" % step)) + + if rank == 0: + preds = predict(model, test_loader, test_ds, label_vocab) + file_path = "ernie_results.txt" + with open(file_path, "w", encoding="utf8") as fout: + fout.write("\n".join(preds)) + # Print some examples + print("The results have been saved in the file: %s, some examples are shown below: " % file_path) + print("\n".join(preds[:10])) diff --git a/examples/information_extraction/waybill_ie/run_ernie_crf.py b/examples/information_extraction/waybill_ie/run_ernie_crf.py new file mode 100644 index 0000000000000000000000000000000000000000..b9d03b77e6437e12745cc70bbbe35f3301b2a2cc --- /dev/null +++ b/examples/information_extraction/waybill_ie/run_ernie_crf.py @@ -0,0 +1,147 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import paddle +from data import load_dataset, load_dict, parse_decodes +from model import ErnieCrfForTokenClassification + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default="./ernie_crf_ckpt", type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.") +parser.add_argument("--data_dir", default="./waybill_ie/data", type=str, help="The folder where the dataset is located.") +args = parser.parse_args() +# fmt: on + + +def convert_to_features(example, tokenizer, label_vocab): + tokens, labels = example + tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token") + # Token '[CLS]' and '[SEP]' will get label 'O' + labels = ["O"] + labels + ["O"] + tokenized_input["labels"] = [label_vocab[x] for x in labels] + return ( + tokenized_input["input_ids"], + tokenized_input["token_type_ids"], + tokenized_input["seq_len"], + tokenized_input["labels"], + ) + + +@paddle.no_grad() +def evaluate(model, metric, data_loader): + model.eval() + metric.reset() + for input_ids, seg_ids, lens, labels in data_loader: + preds = model(input_ids, seg_ids, lengths=lens) + n_infer, n_label, n_correct = metric.compute(lens, preds, labels) + metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy()) + precision, recall, f1_score = metric.accumulate() + print("[EVAL] Precision: %f - Recall: %f - F1: %f" % (precision, recall, f1_score)) + model.train() + + +@paddle.no_grad() +def predict(model, data_loader, ds, label_vocab): + all_preds = [] + all_lens = [] + for input_ids, seg_ids, lens, labels in data_loader: + preds = model(input_ids, seg_ids, lengths=lens) + # Drop CLS prediction + preds = [pred[1:] for pred in preds.numpy()] + all_preds.append(preds) + all_lens.append(lens) + sentences = [example[0] for example in ds.data] + results = parse_decodes(sentences, all_preds, all_lens, label_vocab) + return results + + +if __name__ == "__main__": + paddle.set_device(args.device) + + # Create dataset, tokenizer and dataloader. + train_ds, dev_ds, test_ds = load_dataset( + datafiles=( + os.path.join(args.data_dir, "train.txt"), + os.path.join(args.data_dir, "dev.txt"), + os.path.join(args.data_dir, "test.txt"), + ) + ) + + label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial(convert_to_features, tokenizer=tokenizer, label_vocab=label_vocab) + + train_ds.map(trans_func) + dev_ds.map(trans_func) + test_ds.map(trans_func) + + def batchify_fn(samples): + fn = Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int32"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int32"), # token_type_ids + Stack(dtype="int64"), # seq_len + Pad(axis=0, pad_val=label_vocab.get("O", 0), dtype="int64"), # labels + ) + return fn(samples) + + train_loader = paddle.io.DataLoader( + dataset=train_ds, batch_size=args.batch_size, return_list=True, collate_fn=batchify_fn + ) + dev_loader = paddle.io.DataLoader( + dataset=dev_ds, batch_size=args.batch_size, return_list=True, collate_fn=batchify_fn + ) + test_loader = paddle.io.DataLoader( + dataset=test_ds, batch_size=args.batch_size, return_list=True, collate_fn=batchify_fn + ) + + # Define the model netword and its loss + ernie = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_labels=len(label_vocab)) + model = ErnieCrfForTokenClassification(ernie) + + metric = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True) + optimizer = paddle.optimizer.AdamW(learning_rate=2e-5, parameters=model.parameters()) + + step = 0 + for epoch in range(args.epochs): + for input_ids, token_type_ids, lengths, labels in train_loader: + loss = model(input_ids, token_type_ids, lengths=lengths, labels=labels) + avg_loss = paddle.mean(loss) + avg_loss.backward() + optimizer.step() + optimizer.clear_grad() + step += 1 + print("[TRAIN] Epoch:%d - Step:%d - Loss: %f" % (epoch, step, avg_loss)) + evaluate(model, metric, dev_loader) + + paddle.save(model.state_dict(), os.path.join(args.save_dir, "model_%d" % step, "model_state.pdparams")) + + preds = predict(model, test_loader, test_ds, label_vocab) + file_path = "ernie_crf_results.txt" + with open(file_path, "w", encoding="utf8") as fout: + fout.write("\n".join(preds)) + # Print some examples + print("The results have been saved in the file: %s, some examples are shown below: " % file_path) + print("\n".join(preds[:10])) diff --git a/examples/language_model/bert b/examples/language_model/bert new file mode 100644 index 0000000000000000000000000000000000000000..d4b3e50787afc1c64b23948635b163a7f181e76c --- /dev/null +++ b/examples/language_model/bert @@ -0,0 +1 @@ +../../model_zoo/bert \ No newline at end of file diff --git a/examples/language_model/bigbird/README.md b/examples/language_model/bigbird/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9bc3f26b8346b94e4f496086320dfe4df40035a0 --- /dev/null +++ b/examples/language_model/bigbird/README.md @@ -0,0 +1,151 @@ +# Big Bird + +## 模型介绍 +[Big Bird](https://arxiv.org/abs/2007.14062)(Transformers for Longer Sequences) 是Google的研究人员提出的针对长序列预训练模型,使用了稀疏注意力机制,将计算复杂度、空间复杂度降到线性复杂度,大大提升了长序列任务的预测能力。 + +本项目是 Big Bird 的 PaddlePaddle 实现, 包含模型训练,模型验证等内容。以下是本例的简要目录结构及说明: + +```text +. +├── args.py # 预训练任务的配置 +├── run_classifier.py # IMDB数据集的分类任务 +├── run_pretrain.py # 预训练任务脚本 +├── README.md # 文档 +└── data/ # 示例数据 +``` +## 快速开始 + +### 环境依赖 + +- sentencepiece + +安装命令:`pip install sentencepiece` + +### 数据准备 +根据论文中的信息,目前 Big Bird 的预训练数据是主要是由 Books,CC-News,Stories, Wikipedia 4种预训练数据来构造,用户可以根据自己的需要来下载和清洗相应的数据。目前已提供一份示例数据在 data 目录。 + + +### 预训练任务 + +下面是预训练任务的具体的执行方式 + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" --log_dir log run_pretrain.py --model_name_or_path bigbird-base-uncased \ + --input_dir "./data" \ + --output_dir "output" \ + --batch_size 4 \ + --weight_decay 0.01 \ + --learning_rate 1e-5 \ + --max_steps 100000 \ + --save_steps 10000 \ + --logging_steps 1 \ + --max_encoder_length 512 \ + --max_pred_length 75 +``` + +其中参数释义如下: + +- `gpus` paddle.distributed.launch参数,用于指定使用哪张显卡。单卡格式:"0";多卡格式:"0,1,2"。 +- `log_dir` paddle.distributed.launch参数,用于指定训练日志输出的目录,默认值为`log`。(注意,如果需要在同一目录多次启动run_pretrain.py,需要设置不同的log_dir,否则日志会重定向到相同的文件中)。 +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。目前支持的训练模型配置有:"bigbird-base-uncased"。若模型相关内容保存在本地,这里也可以提供相应目录地址,例如:"./checkpoint/model_xx/" +- `input_dir` 指定输入文件,可以使用目录,指定目录时将包括目录中的所有文件。 +- `output_dir` 指定输出文件。 +- `batch_size` 训练的batch大小 +- `weight_decay` AdamW权重衰减参数 +- `learning_rate` 训练的学习率 +- `max_steps` 最大训练步数 +- `save_steps` 保存模型间隔 +- `logging_steps` 打印日志的步数 +- `max_encoder_length` MLM任务的最大的token数目 +- `max_pred_length` MLM任务最大的需要预测token的数目 + + +### 验证任务 + +#### Imdb分类任务 +通过预训练任务训练完成之后,可以预训练的模型参数,在 Big Bird 的验证任务中通过IMDB数据集来进行最终模型效果的验证,[IMDB数据集](http://ai.stanford.edu/~amaas/data/sentiment/) ,IMDB数据集是关于电影用户评论情感分析的数据集,主要是包含了50000条偏向明显的评论,其中25000条作为训练集,25000作为测试集。label为pos(positive)和neg(negative),是一个序列文本分类任务,具体的执行脚本如下。 + + +```shell +export CUDA_VISIBLE_DEVICES=0 +python run_classifier.py --model_name_or_path bigbird-base-uncased \ + --output_dir "output" \ + --batch_size 2 \ + --learning_rate 5e-6 \ + --max_steps 16000 \ + --save_steps 1000 \ + --max_encoder_length 3072 +``` + +其中参数释义如下: + +- `model_name_or_path` 指示了finetune使用的具体预训练模型以及预训练时使用的tokenizer,目前支持的预训练模型有:"bigbird-base-uncased"。若模型相关内容保存在本地,这里也可以提供相应目录地址,例如:"./checkpoint/model_xx/"。 +- `output_dir` 指定输出文件。 +- `batch_size` 训练的batch大小。 +- `learning_rate` 训练的学习率。 +- `max_steps` 最大训练步数。 +- `save_steps` 保存模型间隔。 +- `logging_steps` 打印日志的步数。 +- `max_encoder_length` MLM任务的最大的token数目。 + + +基于`bigbird-base-uncased`在IMDB评测任务上Fine-tuning后,在验证集上有如下结果: + +| Task | Metric | Result | +|:-----:|:----------------------------:|:-----------------:| +| IMDB | Accuracy | 0.9449 | + +#### Glue任务 + +以GLUE中的SST-2任务为例,启动Fine-tuning的方式如下: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_glue.py \ + --model_type bigbird \ + --model_name_or_path bigbird-base-uncased \ + --task_name SST-2 \ + --max_encoder_length 128 \ + --batch_size 32 \ + --learning_rate 1e-5 \ + --epochs 5 \ + --logging_steps 1 \ + --save_steps 500 \ + --output_dir ./tmp/ \ + --device gpu +``` + +其中参数释义如下: +- `model_type` 指示了模型类型,使用bigbird模型时设置为bigbird即可。 +- `model_name_or_path` 指示了finetune使用的具体预训练模型以及预训练时使用的tokenizer,目前支持的预训练模型有:"bigbird-base-uncased"。若模型相关内容保存在本地,这里也可以提供相应目录地址,例如:"./checkpoint/model_xx/"。 +- `task_name` 表示Fine-tuning的任务。 +- `max_encoder_length` 表示最大句子长度,超过该长度将被截断。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `output_dir` 表示模型保存路径。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU, 'npu'表示使用华为昇腾卡。 + +基于`bigbird-base-uncased`在GLUE各评测任务上Fine-tuning后,在验证集上有如下结果: + +| Task | Metric | Result | +|:-----:|:----------------------------:|:-----------------:| +| SST-2 | Accuracy | 0.9365 | +| QNLI | Accuracy | 0.9017 | +| CoLA | Mattehew's corr | 0.5708 | +| MRPC | F1/Accuracy | 0.9019 / 0.8603 | +| STS-B | Person/Spearman corr | 0.8591 / 0.8607 | +| QQP | Accuracy/F1 | 0.9132 / 0.8828 | +| MNLI | Matched acc/MisMatched acc | 0.8615 / 0.8606 | +| RTE | Accuracy | 0.7004 | + +### 致谢 + +* 感谢[Google 研究团队](https://github.com/google-research/bigbird)提供BigBird开源代码的实现以及预训练模型。 + +### 参考论文 + +* Zaheer, et al. "Big bird: Transformers for longer sequences" Advances in Neural Information Processing Systems, 2020 diff --git a/examples/language_model/bigbird/args.py b/examples/language_model/bigbird/args.py new file mode 100644 index 0000000000000000000000000000000000000000..5d33e05c31b2cbfd1381e7a08f95b95d0d408372 --- /dev/null +++ b/examples/language_model/bigbird/args.py @@ -0,0 +1,104 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse + +from paddlenlp.trainer.argparser import strtobool + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--model_type", default="bigbird", type=str, help="Model type selected in training model.") + + parser.add_argument( + "--model_name_or_path", + default="bigbird-base-uncased", + type=str, + help="Path to pre-trained model or shortcut model name for training model.", + ) + + parser.add_argument( + "--input_dir", default=None, type=str, help="The input directory where the data will be read from." + ) + + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + + parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") + + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for AdamW.") + + parser.add_argument("--warmup_steps", default=10000, type=int, help="Linear warmup over warmup_steps.") + + parser.add_argument( + "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps." + ) + + parser.add_argument( + "--weight_decay", default=0.01, type=float, help=" Weight decay rate if we apply in the optimizer of Adamw." + ) + + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for AdamW optimizer.") + + parser.add_argument( + "--max_steps", default=100000, type=int, help="If > 0: set total number of training steps to perform." + ) + + parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.") + + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + + parser.add_argument("--seed", type=int, default=42, help="Random seed for initialization.") + + parser.add_argument( + "--device", + type=str, + default="gpu", + choices=["cpu", "gpu", "npu"], + help="Select cpu, gpu, xpu, npu devices to train model.", + ) + + parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.") + + parser.add_argument( + "--max_encoder_length", + type=int, + default=512, + help="The maximum total input sequence length after SentencePiece tokenization.", + ) + + parser.add_argument( + "--max_pred_length", default=75, type=int, help="The maximum total of masked tokens in input sequence." + ) + + parser.add_argument( + "--use_nsp", default=False, type=bool, help="Whether or not add the nsp loss to the total loss." + ) + + parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.") + + parser.add_argument( + "--task_name", + default="sst-2", + type=str, + required=False, + help="The name of the task to train selected in the list: sst-2, cola, mrpc, sts-b, qqp, mnli, qnli, rte", + ) + + args = parser.parse_args() + return args diff --git a/examples/language_model/bigbird/data/data.txt b/examples/language_model/bigbird/data/data.txt new file mode 100644 index 0000000000000000000000000000000000000000..8f264377effd4e015f777600c0998298257cfa56 --- /dev/null +++ b/examples/language_model/bigbird/data/data.txt @@ -0,0 +1,200 @@ + _START_ARTICLE_ Kasturba Road _START_PARAGRAPH_ Kasturba Road is a street in Bangalore, the capital of Karnataka, India, which is connected to M G Road to the north and J C Road to the south. Some important landmarks situated along Kasturba Road are Sree Kanteerava Stadium, Kanteerava Indoor Stadium, Cubbon Park, Government Museum, Venkatappa Art Gallery, Visvesvaraya Industrial and Technological Museum and UB City. A 600-year-old Ganesha temple is also situated on Kasturba Road._NEWLINE_It was earlier known as Sydney Road._NEWLINE_Other important landmarks close to the road are Karnataka High Court, Vidhana Soudha and Chinnaswamy Stadium. + _START_ARTICLE_ Amazon (yacht) _START_SECTION_ Construction _START_PARAGRAPH_ Carvel planked in teak and pitch pine on oak frames, with alternate wrought iron strap floor reinforcement, bronze fastenings, lead keel and copper sheathing, the Amazon's hull is still largely original. _START_SECTION_ History _START_PARAGRAPH_ Her builder and first owner, Tankerville Chamberlayne, an English gentleman, personally supervised her construction by his own Arrow Yard at Northam on the River Itchen. This small private facility was established by the Chamberlayne family for the maintenance of the famous cutter Arrow, which was adapted continuously and thereby kept racing competitively into the 1890s. Amazon's engine and boiler were supplied by the adjacent works of Day, Summers and Company._NEWLINE_Amazon was used for summer cruising, to attend sailing regattas along the south coast of England, and to visit France. Having been prepared appropriately for the occasion of Queen Victoria's Diamond Jubilee Royal Fleet Review in 1897 (at which Turbinia made her debut), she was shortly after sold to a prominent French yachtsman and was based at Saint Malo as Armoricain until 1900, when she returned to British ownership._NEWLINE_Already too old (and with a coal-fired compound engine thought to be rather too old-fashioned) for the First World War, she remained in south coast ports as a private yacht. A new owner took her to London and after 52 years of service her original engine and boiler were removed on her conversion to diesel in 1937. The Second World War put paid to pleasure cruising and she subsequently became a houseboat for some years in a west London Yacht Basin._NEWLINE_The actor Arthur Lowe bought her as a houseboat in 1968 and, encouraged by his surveyor's positive report, made her ready for sea again in 1971; at first a private yacht she then pursued a successful charter business in the 1980s, before migrating to northern Scotland in 1990._NEWLINE_In 1997 she made passage from Scotland to Malta, where her new owners used her for cruising in the Mediterranean. In 2009 Amazon crossed the Atlantic Ocean via the Cape Verde Islands and travelled in the Caribbean and to Bermuda._NEWLINE_She arrived at Newport, Rhode Island, United States from Bermuda on Labor Day 2009. Amazon was hosted by the Herreshoff Marine Museum at Bristol, Rhode Island in October 2009. She spent time in Narragansett Bay. The yacht subsequently travelled to Mystic Seaport in late 2009 and was based there in early 2011. Amazon remained at Mystic Seaport until mid-2011._NEWLINE_Amazon acted as Flagship for the Commodore of the Mystic River Yacht Club for a charity regatta in Long Island Sound in June 2011 and visited Canada in July 2011 _NEWLINE_In August 2011 the yacht made a trans-Atlantic passage from Newfoundland to Ireland, and arrived at Waterford on 2 September 2011 where she was described by a local boat owner as the "classiest motor boat I have ever seen!". She remained at Waterford for the winter._NEWLINE_In May 2012 she visited Bristol before sailing to London, where she took part in the Thames Diamond Jubilee Pageant on Sunday 3 June 2012. She was the only vessel present that had also witnessed the Diamond Jubilee Fleet Review for Queen Victoria at Spithead on 26 June 1897. The Director of National Historic Ships referred to her in his public letter of criticism concerning the BBC's coverage of the event._NEWLINE_She was subsequently at the Ramsgate Maritime Museum until late June, at Shoreham on 28 June 2012, then at Cowes and in the Bassin Vauban at St Malo, France in late July 2012._NEWLINE_In August and September 2012, Amazon was in the Channel Islands, visiting Alderney in August and Jersey in September, berthing in St Helier and Gorey Harbours; on 13 September she was in St Aubin's Bay to watch the 2012 Jersey International Air Display._NEWLINE_She was in Bristol during the winter and at the Southampton Maritime Festival on 5 & 6 May 2013._NEWLINE_On 23 May she was in the Bristol Channel en route to Gloucester where she arrived on 24 May for the city's Tall Ships Festival on 25 & 26 May, and was on the Gloucester and Sharpness Canal during June. She was at Gorey, Jersey on 22 July 2013 and had returned to Malta by October that year. + _START_ARTICLE_ Dolores Kuenz _START_SECTION_ Table tennis career _START_PARAGRAPH_ She won a World Championship gold medal in the Women's Team event at the 1937 World Table Tennis Championships known as the Corbillon Cup. she was the captain of the team. _START_SECTION_ Hall of Fame _START_PARAGRAPH_ She was inducted into the USA Hall of Fame in 1979. + _START_ARTICLE_ Mark E. Clayton _START_SECTION_ Background _START_PARAGRAPH_ Clayton was born in Mobile in south Alabama, but he was reared in Alexandria, Virginia. His parents, Mr. and Mrs. William Gill Clayton, were Goldwater Republicans. His father lobbied Congress on religious liberty issues. Clayton graduated from high school and served in the United States Army Reserve. He studied to be an aircraft electrician before he enrolled at Pensacola Christian College in Pensacola, Florida, from which he graduated in 2002._NEWLINE_In 2003, Clayton moved to Nashville. His father died in 2004 and Clayton purchased a 1920s-era farmhouse on three acres in Whites Creek in suburban Davidson County. Clayton, who has never been married, lives in Whites Creek with his dog, Saint._NEWLINE_Clayton has worked at numerous jobs, including Target, a call center, as a floor installer, and as a salesperson of insurance, siding, and roofs. He is a church youth group leader. He is currently employed with a moving company. _START_SECTION_ 2008 U.S. Senate candidacy _START_PARAGRAPH_ Clayton first ran for U.S. Senate in 2008, when he finished fourth among six candidates in the Democratic primary with 32,309 votes. Bob Tuke won the nomination with 59,000 votes and was then decisively defeated in the general election by the Republican incumbent Lamar Alexander of Maryville, whom Clayton had described as a "neo-conservative", _START_SECTION_ 2012 candidacy for U.S. Senate _START_PARAGRAPH_ After Clayton's primary triumph, the Tennessee Democratic Party disavowed his candidacy and his vice-presidency of the socially conservative interest group, the Public Advocate of the United States, based in Washington, D.C. The group has been labelled a hate group by the Southern Poverty Law Center for its anti-gay rhetoric._NEWLINE_The Tennessee Democratic Party issued this statement:_NEWLINE_The only time that Clayton has voted in a Democratic primary was when he was voting for himself. Many Democrats in Tennessee knew nothing about any of the candidates in the race; so they voted for the person at the top of the ticket. Unfortunately, none of the other Democratic candidates was able to run the race needed to gain statewide visibility or support. Mark Clayton is associated with a known hate group in Washington, D.C., and the Tennessee Democratic Party disavows his candidacy, will not do anything to promote or support him in any way, and urges Democrats to write-in a candidate of their choice in November._NEWLINE_Clayton won the Democratic nomination with 30% of the vote, despite raising no money and having a website that was four years out of date. The next day Tennessee's Democratic Party disavowed the candidate over his active role in the Public Advocate of the United States, which they described as a "known hate group". They blamed his victory among candidates, for whom the TNDP provided little forums to become known, on the fact that his name appeared first on the ballot, and said they would do nothing to help his campaign, urging Democrats to vote for "the write-in candidate of their choice" in November. One of the Democratic candidates, Larry Crim, filed a petition seeking to offer the voters a new primary in which to select a Democratic Nominee based on Democratic Chair Chip Forrester permitting Clayton, a nondemocratic candidate, at the top of ballot to benefit a candidate Forrester recruited and improperly endorsed - Overall - who did not win the primary. Forrester then disavowed Clayton he had allowed on the ballot after he received the most votes. The background is that the TNDP placed little emphasis on the U.S. Senate race in 2012 to replace Corker. Treasurer and Financial benefactor of the TNDP Bill Freeman who was under Forrester actually contributed to Republican Bob Corker's campaign and was later removed from office, followed by Forrester's not seeking another term subsequent to the TNDP 2012 fiasco. Crim filed a preliminary motion seeking a temporary restraining order against certification of the results until the merits of the case for a new primary could be decided. Yet, after a judge denied the temporary restraining order Crim withdrew his petition stating at a news conference outside the Federal Courthouse that the costs of proceeding and the costs of a new primary to the Democratic Party, even if Crim won, would be overwhelming especially given the political realities the party leaders conducted and permitted to the detriment of any Democratic Nominee. Mr. Crim was subsequently elected Chair of Democrats United For Tennessee in 2012._NEWLINE_Clayton's nomination has been compared to that of the previously unknown Alvin Greene in the Democratic primary in the 2010 Senate race in South Carolina. Greene was then handily defeated by the Republican Jim DeMint. _START_SECTION_ 2014 candidacy for Tennessee governor _START_PARAGRAPH_ Clayton attempted to register as a candidate in the Democratic primary for the Tennessee governor in 2014. The Democratic Party of Tennessee denied his attempt to run in the primary election, describing him as, "not a bona fide Democrat." Clayton filed a suit in federal court, but lost when the judge found there were no grounds for the suit at the federal level. _START_SECTION_ Political positions _START_PARAGRAPH_ Clayton opposes abortion and same-sex marriage. He believes that the Transportation Security Administration (TSA) should be shut down, stating that the TSA "mandates [transsexuals] and homosexuals grabbing children in their stranger-danger zones." He opposes national ID cards. The Washington Post and Vox describe him as a conspiracy theorist. _START_SECTION_ Public Advocate of the United States _START_PARAGRAPH_ Clayton's work at the Public Advocate of the United States, a conservative group based in Washington, D.C., has come under scrutiny. The group has been designated as an anti-gay hate group by the Alabama-based Southern Poverty Law Center._NEWLINE_In a press release The Public Advocate proclaimed that it "associates with members of both major parties in a non-partisan fashion and promotes traditional values". The organization contends that Clayton has demonstrated that "an American patriot can put his or her name on the ballot and win big as a conservative, even in the Democratic Party." A Clayton spokesman criticized the state Democratic party for disowning the nominee and argued that the state party had violated the law by using its resources to attack one of its own candidates. The Clayton campaign also said that it would file a complaint with the Federal Election Commission. + _START_ARTICLE_ Ben T. Elliott _START_SECTION_ Life _START_PARAGRAPH_ Elliott was born on November 6, 1944, in Philadelphia, Pennsylvania. He is a 1966 graduate of Bucknell University and a 1970 graduate of The Institut d'Études Politiques de Paris. At Bucknell he played free safety for the Bison football team and was a member of the Phi Gamma Delta fraternity. In 1962 he graduated from the Haverford School, which honored him with its Distinguished Alumni Award in 2007._NEWLINE_In 1978, he worked for the U.S. Chamber of Commerce._NEWLINE_In 1980, he was hired as White house speech writer, by Ken Khachigian._NEWLINE_In 1983, he succeeded him as chief speech writer._NEWLINE_When Treasury Secretary Don Regan took over as Reagan’s chief of staff in 1985, he asked Elliott to resign, which occurred in 1986. Eleanor Clift reported, “Elliott, a staunch believer in supply side economics and a fervent right-to-life advocate, clashed with Regan and other top presidential assistants over the rhetorical tone of the President's State of the Union address last January. Although Elliott was widely believed to have won the battle, his language was nevertheless softened considerably.”_NEWLINE_After leaving the White House, Elliott wrote speeches for Jack Kemp, William Simon, Steve Forbes, IBM, Pepsi, Goldman Sachs and the New York Stock Exchange. He serves as a speechwriter at Bank of America and is a trustee of Bucknell University. + _START_ARTICLE_ The Ex-File 3: The Return of the Exes _START_SECTION_ Release _START_PARAGRAPH_ It was released in China on 29 December 2017 _START_SECTION_ Box office _START_PARAGRAPH_ As of 26 January 2018, it has earned ¥1.9 billion in China._NEWLINE_The film gained some notoriety in the overseas press for beating Star Wars: The Last Jedi at the Chinese box office. + _START_ARTICLE_ Fluor-liddicoatite _START_SECTION_ Crystal habit _START_PARAGRAPH_ Crystals are stout prismatic, with a curved convex trigonal outline, generally elongated and striated parallel to the c axis. Crystals are hemimorphic, meaning that the two ends of the crystal have different forms. Fluor-liddicoatite usually has a pedion (a single crystal face) opposite one or two pyramids. _START_SECTION_ Physical properties _START_PARAGRAPH_ The color is usually smoky brown, but also pink, red, green, blue, or rarely white. Color zoning is abundant at the type locality, parallel to pyramid faces. This is due to changes in the solution during crystal growth. As the concentration of trace elements that serve as coloring agents changes, there will be areas of less or more color in different parts of the crystal. When the crystal is sliced perpendicular to the c axis, triangular zoning may be seen, together with a trigonal star that radiates from the centre of the crystal, with the three rays directed towards the corners of the triangular color patterns. _NEWLINE_The pink-red color is due to the manganese Mn³⁺ content, and the green color is due to intervalence charge transfer transactions between iron Fe²⁺ and titanium Ti⁴⁺._NEWLINE__NEWLINE_The streak is white to very light brown, lighter than the mass color, luster is vitreous and crystals are transparent to translucent._NEWLINE_ _NEWLINE_Cleavage is poor perpendicular to the c crystal axis, or it may be totally absent. The mineral is brittle, with an uneven to conchoidal fracture. It is very hard, with hardness ​7 ¹⁄₂, a little harder than zircon, making it suitable for use as a gemstone. Specific gravity is 3.02, a little lighter than fluorite. It is neither fluorescent nor radioactive. _START_SECTION_ Optical properties _START_PARAGRAPH_ Fluor-liddicoatite is uniaxial (-), with _NEWLINE_refractive Indices Nₒ = 1.637 and Nₑ = 1.621 for the type specimen. The refractive indices, however, will vary from specimen to specimen, as they depend on the content of iron and manganese, which are usually present as trace elements.Pleochroism is strong: O dark brown or pink, E light brown or pale pink. _START_SECTION_ Environment _START_PARAGRAPH_ Fluor-liddicoatite is detrital in soil at the type locality, presumably derived from the weathering of granitic pegmatites. Associated minerals are quartz, elbaite, albite and micas. + _START_ARTICLE_ Château Bilquin de Cartier _START_SECTION_ History _START_PARAGRAPH_ Origins of the château can be traced back to the 17th century, around 1635, when the Honoré family builds a castle on the Sambre river bank. The place had formerly been occupied by a seigneurial manor which was destroyed on 21 July 1554._NEWLINE_In 1667, the unfinished Spanish fortress of Charleroy is captured by Louis XIV's troops during the War of Devolution. As the castle in Marchienne was located in neutral territory (under authority of the Prince-Bishopric of Liège), it was used as a hospital for both French and Spanish soldiers._NEWLINE_In 1695, the castle is bought by Guillaume de Bilquin, a wealthy forge owner, who completes and enhances it. In 1717, his daughter, Marie-Agnès Bilquin, marries Jean-Louis Cartier, son of the general treasurer of the prince-bishop of Liège. As such, the castle becomes the property of the Cartier de Marchienne family._NEWLINE_In 1740, the castle hosts Remacle Le Loup, a famous draftsman from the Liège region. It is severely damaged by a fire in 1932, and bought over by the municipality of Marchienne-au-Pont in 1938, ending more than two centuries of ownership by the Cartier family._NEWLINE_Marguerite Yourcenar, a Belgian-born French novelist and essayist, and the first woman elected to the Académie française, is the daughter of Fernande de Cartier de Marchienne, from the Cartier family related to the Cartier castle. She visited the castle in Marchienne-au-Pont in 1956, and mentions her Cartier de Marchienne ancestry and the castle in her 1974 memoir Dear Departed: A Memoir (French: Souvenirs pieux)._NEWLINE_The Cartier castle was listed on 21 August 1980. It underwent restoration in phases between 1986 and 2001 (helped by ERDF), after having been left in a sorry condition (infested by dry rot). _START_SECTION_ Current condition _START_PARAGRAPH_ The castle today hosts a public library on the ground floor (Bibliothèque Marguerite Yourcenar), and administrative services of the Walloon region on the first floor. The library has a section dedicated to books in Turkish language._NEWLINE_The courtyard is equipped with benches and is publicly accessible as part of the Marchienne-au-Pont municipal park. The castle wing which was located on the southern side of the courtyard has been demolished to create an entrance for the De Cartier station of the Charleroi metro. At the tip of the western wing, a stone porch is adorned with the arms of the Bilquin-Baillencourt family, and a 1699 date inscription. Similarly, the lintel above the northern wing door shows a scalloped key with the arms of the Cartier family. Other demolished features include a barnyard where the Marchienne-au-Pont municipal swimming pool now stands._NEWLINE_A 19th century grain elevator in neo-renaissance style can be seen on the Sambre embankment. _START_SECTION_ Beijing replica _START_PARAGRAPH_ Emile de Cartier de Marchienne, Marguerite Yourcenar's uncle, who served as the Belgian ambassador in China at the start of the 20thcentury (1910-1917), ordered the construction of a Cartier castle replica, to serve as the Belgian legation building in Beijing. Plans were drawn in Marchienne-au-Pont, and bricks, slates, tiles, panelling and other materials were transported from Belgium to China for the construction. The building, which is now the Zijin Guest House, still exists in the Beijing Legation Quarter, although the original entrance has disappeared (Photo). + _START_ARTICLE_ Paralympic swimming _START_PARAGRAPH_ Paralympic swimming is an adaptation of the sport of swimming for athletes with disabilities. Paralympic swimmers compete at the Summer Paralympic Games and at other sports competitions throughout the world. The sport is governed by the International Paralympic Committee. Both men and women compete in Paralympic swimming, racing against competitors of their own gender. Swimming has been a part of the Paralympic program since the 1960 Summer Olympics in Rome, Italy. _START_SECTION_ Rules _START_PARAGRAPH_ Rules for the sport are adapted from those set forth by the International Swimming Federation (FINA). Swimmers compete individually in backstroke, breaststroke, butterfly, freestyle, individual medley, and as teams in relay races. At the Paralympics, World Championships and other elite level competitions, swimmers compete in an Olympic-size swimming pool._NEWLINE_Significant differences between able-bodied and Paralympic swimming include the starting position and adaptations allowed for visually impaired swimmers. Competitors may start a race by standing on a platform and diving into the pool, as in non-disabled swimming, or by sitting on the platform and diving in, or they may start the race in the water. In events for the blind and visually impaired, people called "tappers" may stand at the end of the pool and use a pole to tap the swimmers when they approach the wall, indicating when the swimmer should turn or end the race. No prostheses or assistive devices may be worn during competition. _START_SECTION_ Classification _START_PARAGRAPH_ Swimmers are classified according to the type and extent of their disability. The classification system allows swimmers to compete against others with a similar level of function._NEWLINE_Swimmers with physical disabilities are allocated a category between 1 and 10, with 1 corresponding to the most severe types of disability. Physical disabilities of Paralympic swimmers include single or multiple limb loss (through birth defects and/or amputation), cerebral palsy, spinal cord injuries (leading to paralysis or disability in limb coordination), dwarfism, and disabilities which impair the use of joints._NEWLINE_Blind and visually impaired swimmers compete within separate categories, being allocated to categories 11, 12 or 13. Category 11 corresponds to totally blind swimmers, while competitors in category 13 have severe but not total visual impairment. Category 11 swimmers compete with blackened goggles to ensure competitors are on an even level. Category 11 swimmers are also required to use tappers but they are optional for category 12 and 13._NEWLINE_Swimmers with mental disabilities compete in category 14._NEWLINE_Numbers are combined with a letter prefix depending on the event type. An "S" prefix corresponds to freestyle, backstroke and butterfly, while "SB" corresponds to breaststroke and "SM" to the medley. Hence, a swimmer with severe physical disabilities competing in backstroke may compete in an S3 event, while a blind swimmer in the medley would compete in class SM11._NEWLINE_For relay races, athletes from different classifications compete together, but the sum of their individual classifications must not exceed a given points total. For example, a relay team for a 34 points freestyle relay may consist of two S8 swimmers and two S9 swimmers (9 + 9 + 8 + 8 = 34), or an S10 swimmer and three S8 swimmers (10 + 8 + 8 + 8 = 34) + _START_ARTICLE_ Noah's Ark (1956 TV series) _START_SECTION_ Synopsis _START_PARAGRAPH_ Noah's Ark stars Paul Burke as young veterinarian Dr. Noah McCann, partner with the older Dr. Sam Rinehart, played by Victor Rodman (1892–1965), who in the series uses a wheelchair. May Wynn plays the young receptionist, Liz Clark._NEWLINE_Another similarly titled series, Second Noah, a family drama with Daniel Hugh Kelly in the title role of author Noah Beckett and Betsy Brantley as his veterinarian-wife, was televised on ABC from 1996 to 1997. _START_SECTION_ Production notes _START_PARAGRAPH_ Noah's Ark was created, produced, and directed by Jack Webb through his Mark VII Limited production company, and filmed at Revue Studios, later part of Universal Television. Its pilot episode on September 18, 1956, is titled "Jack Webb Presents". At the time, Webb and Ben Alexander co-starred on NBC's popular police drama Dragnet. In the October 2 episode of Noah's Ark titled "The Petition", a dispute develops over a rezoning request for the veterinary clinic. When Noah tries to reason with recalcitrant neighbors, violence results. _START_SECTION_ Scheduling _START_PARAGRAPH_ The 24th episode of the series, "Irmgaard's Problem", never aired. Instead, on March 5, 1957, the suspense series Panic! made its debut to replace Noah's Ark. The theme song for Noah's Ark was performed by the Hi-Los. _NEWLINE_Noah's Ark aired at 8:30 p.m. EST on Tuesdays, opposite The Brothers (with Gale Gordon, Bob Sweeney, and Barbara Billingsley) on CBS and Hugh O'Brian's The Life and Legend of Wyatt Earp on ABC. Noah's Ark followed the quiz show The Big Surprise and preceded the anthology series, The Jane Wyman Show on NBC. + _START_ARTICLE_ Katy de la Cruz _START_SECTION_ Early life _START_PARAGRAPH_ Catalina de la Cruz was born in Bustos, Bulacan. Even as a young child, de la Cruz would be hired to sing at town fiestas, and at intermissions during cockfights and boxing matches. Her formal schooling ended at the third grade._NEWLINE_In 1914, when she was seven years old, she was hired by the owner of a Manila film theater to sing to the audiences in between movie screenings. Such performances were typical in Manila theaters during that period, and from those routines would emerge a distinct genre eventually known as bodabil. She learned her songs through listening to phonograph records, and mastered the English language with the help of her brother. _START_SECTION_ Bodabil star _START_PARAGRAPH_ By the age of thirteen, de la Cruz was a rising star in the bodabil circuit, performing alongside other leading stage performers such as Atang de la Rama. She soon became a solo headliner, performing in Manila's largest theaters such as the Savoy, the Palace, and the Lux. By 1925, de la Cruz was the highest paid entertainer in the Philippines. She fell in love, and later married, the piano player of her stage show. Some of the chorus girls who performed alongside her onstage, such as Chichay, Etang Discher, and Mary Walter, later become prominent entertainers in their own right._NEWLINE_De la Cruz was acknowledged as a proficient performer of torch songs who drew comparisons to Sophie Tucker._NEWLINE_Initially, her signature tune was the bluesy ballad St. Louis Blues. After jazz became popular in the Philippines in the 1920s, de la Cruz adapted her singing style and soon mastered the art of scat singing, which became a trademark of hers. By the 1930s, de la Cruz would be most identified with the song Balut, a fast-paced jazzy tune written by Jerry Brandy. Her take on the song, which afforded her to showcase her scatting ability, has been described as impish and rustic, rounded out by her low, playfully dragging key. A slightly bawdy take, called "Balut", named for a notorious Filipino culinary delicacy of the same name, remains popular to date, with versions performed by the New Minstrels, Pilita Corrales, and Lani Misalucha._NEWLINE_She also occasionally acted in films, most prominently in Inspirasyon (1953), for which she received the FAMAS Best Supporting Actress Award in 1953. Many of her films were for Sampaguita Pictures. _START_SECTION_ Later career _START_PARAGRAPH_ As bodabil slowly declined, de la Cruz concentrated on concert performances and international tours. In the late 1940s and early 1950s, she was a top-billed performer at the famed Forbidden City nightclub in San Francisco. In 1961, she starred in her own stage show in Las Vegas. De la Cruz also performed concert tours in Thailand, Taiwan, Hong Kong, Singapore, Australia, and Hawaii._NEWLINE_De la Cruz eventually retired to San Francisco, California, though she would occasionally perform until the late 1980s. In 1989, she visited the Philippines to attend the premiere at the Cultural Center of the Philippines of Katy!, a highly publicized stage musical based on her life. _START_SECTION_ Family _START_PARAGRAPH_ Of de la Cruz's four children, her daughter Angie followed her into showbusiness, pairing with Nikki Ross to form Wing Duo, a singing tandem that was popular on the bodabil circuit and on film during the 1950s. + _START_ARTICLE_ Wood, North Carolina _START_PARAGRAPH_ Wood (formerly, Wood's Store) is a small unincorporated community in northeastern Franklin County, North Carolina, on North Carolina Highway 561 east of Centerville. Settled in 1893, Wood was incorporated as a town in 1917. The town charter was repealed on May 5, 1961. Wood lies at an elevation of 322 feet (98 m)._NEWLINE_Archibald Taylor House was listed on the National Register of Historic Places in 1975. + _START_ARTICLE_ Requiem Canticles (Robbins) _START_PARAGRAPH_ Requiem Canticles is a ballet made for New York City Ballet's Stravinsky Festival by balletmaster Jerome Robbins to eponymous music from 1966 by Igor Stravinsky. The premiere took place June 25, 1972, at the New York State Theater, Lincoln Center. + _START_ARTICLE_ Tennō, Akita _START_PARAGRAPH_ Tennō (天王町 Tennō-machi) was a town located in Minamiakita District, Akita Prefecture, Japan._NEWLINE_In 2003, the town had an estimated population of 22,115 and a density of 532.76 persons per km². The total area was 41.51 km²._NEWLINE_On March 22, 2005, Tennō, along with the towns of Iitagawa and Shōwa (all from Minamiakita District), merged to create the city of Katagami. + _START_ARTICLE_ James W. Jones _START_PARAGRAPH_ James William Jones (1843 – 26 April 1920) was a South Australian surveyor. He was the son of civil engineer Thomas Jones, founder of the Manchester Unity of Oddfellows in South Australia. He studied at J. L. Young's Adelaide Educational Institution. He was soon working for his father on the Port Elliot to Goolwa tramway, for which his father received official criticism. He joined the State public service as a draughtsman in 1865 and was appointed Chief Surveyor then Deputy Surveyor-General in the Department of Survey and Crown Lands. He explored the area north-east of Eucla in 1880, and discovered the Kudna rockhole and catacombs, an immense network of limestone caves, lakes and underground passages under the Nullarbor Plain. He was appointed Conservator of Water in 1887 and Secretary to the Commissioner of Works in 1902. He was secretary of the South Australia branch of the Royal Geographical Society of Australasia from its foundation in 1885 to 1894. He was elected president of the Institute of Surveyors in 1912. He was chairman and president of the Harbors and Marine Board in 1914. He was secretary of the Cheer-Up Society during the First World War._NEWLINE_He was appointed Companion of the Imperial Service Order in 1911. + _START_ARTICLE_ Mathilda Ebeling _START_PARAGRAPH_ Aurora Mathilda Ebeling (1826–1851) was a Swedish soprano opera singer. After first appearing as a concert pianist in 1842, she made her singing début at Stockholm's Mindre Theatre in 1844. She performed at the Royal Swedish Opera from 1846 to 1848 before further study in Paris and an engagement with Berlin's Royal Opera in 1850. _START_SECTION_ Biography _START_PARAGRAPH_ Born in Stockholm County on 11 October 1826, Mathilda Ebeling was the daughter of the flautist Johan Ludvig Ebeling and Aurora Olivia Björkman. She received piano lessons from Wilhelmina Josephson. When 15, she gave her first major concert as a pianist in Stockholm's Stock Exchange Hall where she showed promise as one of the country's most notable performers. Nevertheless, following the encouragement of the composer Johan Magnus Rosén, she decided to take singing lessons which led to her début in the opera Farinelli at the Mindre Theatre in 1844._NEWLINE_In 1845, she was admitted to the Royal Opera as a student, making a highly successful début the following year in The Magic Flute. The opera proved so popular that it ran for over a hundred performances, playing to full houses ten times in succession. Writing in Stockholms Figaro, a critic complimented her as a "bright star" who had "great promise with her full voice, a true sense of music and a propensity for deep perception". She went on to play the roles of Anna in Don Giovanni, Agathe in Der Freischütz, Adalgisa in Norma and the countess in The Marriage of Figaro. Her career in Sweden came to an abrupt end when she began suffering from a pulmonary ailment._NEWLINE_In 1848 Ebeling continued her studies in Paris under Manuel García. Thereafter she appeared in various locations in Germany, giving her last performance in Berlin, shortly before her death on 1 December 1851. + _START_ARTICLE_ William Marshall (illustrator) _START_PARAGRAPH_ William Marshall (fl. 1617–1649) was a seventeenth-century British engraver and illustrator, best known for his print depicting "Charles the Martyr", a symbolic portrayal of King Charles I of England as a Christian martyr. _START_SECTION_ Early career _START_PARAGRAPH_ Nothing is known of Marshall's life beyond references to his career as an engraver. Marshall's earliest known work is the frontispiece to the book A Solemne Joviall Disposition Briefly Shadowing the Law of Drinking, which was published in 1617. In the 1630s he produced a number of portrait engravings and book frontispieces, depicting Puritan divines, poets, and figures associated with the High Church establishment of the day, such as William Laud._NEWLINE_His most ambitious work was the highly elaborate frontispiece to George Wither's 1635 Collection of Emblemes, Ancient and Moderne, an unusually complex example of the Emblem book. Wither left the design to Marshall, having given general instructions, but expressed himself exasperated with the result, on the grounds that its symbolism was thoroughly incoherent. As he wrote,_NEWLINE_Instead thereof, the Workman brought to light,_NEWLINE__NEWLINE_What, here, you see; therein, mistaking quite_NEWLINE__NEWLINE_The true Design: And, so (with pains, and cost)_NEWLINE__NEWLINE_The first intended FRONTISPIECE, is lost._NEWLINE_Wither's lengthy poem on the engraving claims that its apparently inconsistent symbolism revealed, unintentionally, a deeper truth. The lower part of the frontispiece depicts people wandering in confusion in a cave, apparently having emerged from a womb-like pool in which babies are shown swimming. They exit the cave to draw lots given to them by the goddess of Fortune, symbolic of their allotted place in life. They then climb up a mountain, which divides into two peaks, symbolic of the right and the wrong paths in life. The path to the peak on the right appears more attractive at first, but then becomes rocky and finally leads only to death; the path on the left is at first harder, but eventually becomes pleasant and leads to paradise. A Christian church is depicted on the left and a Pagan temple on the right._NEWLINE_Marshall also created forty-one of the seventy-nine plates in Francis Quarles's Emblems of the life of man._NEWLINE_In 1640 he created the image of William Shakespeare for John Benson's (notoriously inaccurate) edition of the poet's sonnets. This was an adapted and reversed version of the original Martin Droeshout print. Five years later, he created the image of John Milton surrounded by four muses for Milton's 1645 Poems. The muses are Melpomene (tragedy), Erato, (lyric poetry), Urania, (astronomy), and Clio (history). Like Wither, Milton was unimpressed by Marshall's work, considering the portrait to be deeply unflattering. He had Marshall engrave satirical verses written in Greek underneath the image. It is assumed that this was a practical joke on Marshall, who is unlikely to have known that he was engraving insults directed at himself. The verses read in translation,_NEWLINE_Looking at the form of the original, you could say, perhaps, that this likeness had been drawn by a rank beginner; but, my friends, since you do not recognize what is pictured here, have a chuckle at a caricature by a useless artist. + _START_ARTICLE_ Energy policy of Venezuela _START_SECTION_ Oil _START_PARAGRAPH_ Venezuela has been producing oil for nearly a century and was an OPEC founder-member. In 2005, Venezuela produced 162 million tons of oil, which is 4.1% of world's total production. By the oil production Venezuela ranks seventh in the world. Venezuela is the world's eight oil exporter and fifth largest net exporter. In 2012, 11 percent of US oil imports came from Venezuela._NEWLINE_Since 2010, when the heavy oil from the Orinoco Belt was considered to be economically recoverable, Venezuela has had the largest proved reserves of petroleum in the world, about 298 billion barrels._NEWLINE_Oil accounts for about half of total government revenues. The leading oil company is Petróleos de Venezuela S.A. (PDVSA), which according to Venezuelan authorities produces 3.3 million barrels per day (520,000 m³/d). However, oil industry analysts and the U.S. Energy Information Administration believe it to be only 2.8-2.9 million barrels per day (460,000 m³/d). Venezuela's main oil fields are located at four major sedimentary basins: Maracaibo, Falcon, Apure, and Oriental. PdVSA has 1.28 million barrels per day (204,000 m³/d) of crude oil refining capacity. The major facilities are the Paraguaná Refining Center, Puerto de la Cruz, and El Palito. _START_SECTION_ Natural gas _START_PARAGRAPH_ As of 2013, Venezuela has the eighth-largest proved gas reserves in the world and the largest in South America. Proved reserves were estimated at 5.5 trillion cubic meters (tcm). However, inadequate transportation and distribution infrastructure has prevented it from making the most of its resources. More than 70% of domestic gas production is consumed by the petroleum industry. Nearly 35% of gross natural gas output are re-injected in order to boost or maintain reservoir pressures, while smaller_NEWLINE_amounts (5%) are vented or flared. About 10% of production volumes are subject to shrinkage as a result of the extraction of NGLs. The 2010 estimate is 176 trillion cubic feet (5,000 km³), and the nation reportedly produced about 848 billion cubic feet (2.40×10¹⁰ m³) in 2008._NEWLINE_The leading gas company is PdVSA. The largest private natural gas producer is Repsol-YPF, who supplies 80-megawatt (MW) power station in Portuguesa, and plans to develop a 450-MW power plant in Obispos. _START_SECTION_ Tar sands and heavy oils _START_PARAGRAPH_ Venezuela has non-conventional oil deposits (extra-heavy crude oil, bitumen and tar sands) at 1,200 billion barrels (1.9×10¹¹ m³) approximately equal to the world's reserves of conventional oil. About 267 billion barrels (4.24×10¹⁰ m³) of this may be producible at current prices using current technology. The main deposits are located in the Orinoco Belt in central Venezuela (Orinoco tar sands), some deposits are also found in the Maracaibo Basin and Lake Guanoco, near the Caribbean coast. _START_SECTION_ Coal _START_PARAGRAPH_ Venezuela has recoverable coal reserves of approximately 528 million short tons (Mmst), most of which is bituminous._NEWLINE_Coal production was at 9.254 million short tons as of 2007. Most coal exports go to Latin American countries, the United States and Europe._NEWLINE_The main coal company in Venezuelas is Carbozulia, a former subsidiary of PdVSA, which is controlled by Venezuela's state development agency Corpozulia. The major coal-producing region in Venezuela is the Guasare Basin, which is located near the Colombian border. The coal industry development plans include the construction of a railway linking coal mines to the coast and a new deepwater port. _START_SECTION_ Electricity _START_PARAGRAPH_ The main electricity source is hydropower, which accounts for 71% in 2004. A gross theoretical capability of hydropower is 320 TWh per annum, of which 130 TWh per annum is considered as economically feasible. In 2004, Venezuela produced 70 TWh of hydropower, which accounts 2.5% of world's total. At the end of 2002, total installed hydroelectric generating capacity accounted 13.76 GW with additional 4.5 GW under construction and 7.4 GW of planned capacity._NEWLINE_Hydroelectricity production is concentrated on the Caroní River in Guayana Region. Today it has 4 different dams. The largest hydroplant is the Guri dam with 10,200 MW of installed capacity, which makes it the third-largest hydroelectric plant in the world. Other facilities on the Caroní are Caruachi, Macagua I, Macagua II and Macagua III, with a total of 15.910 MW of installed capacity in 2003. New dams, Tocoma (2 160 MW) and Tayucay (2 450 MW), are currently under construction between Guri and Caruachi. With a projected installed capacity for the whole Hydroelectric Complex (upstream Caroni River and downstream Caroni River), between 17.250 and 20.000 MW in 2010._NEWLINE_The largest power companies are state-owned CVG Electrificación del Caroní (EDELCA), a subsidiary of the mining company Corporación Venezolana de Guayana (CVG), and Compania Anonima de Administracion y Fomento Electrico (CADAFE) accounting respectively for approximately 63% and 18% of generating capacities. Other state-owned power companies are ENELBAR and ENELVEN-ENELCO (approximately 8% of capacities). In 2007, PDVSA bought 82.14% percent of Electricidad de Caracas (EDC) from AES Corporation as part of a renationalization program. Subsequently, the ownership share rose to 93.62% (December 2008). EDC has 11% of Venezuelan capacity, and owns the majority of conventional thermal power plants. The rest of the power production is owned by private companies._NEWLINE_The national transmission system (Sistema Inrterconectado Nacional- SIN) is composed by four interconnected regional transmission systems operated by EDELCA, CADAFE, EDC and ENELVEN-ENELCO. Oficina de Operacion de Sistema Interconectados (OPSIS), jointly owned by the four vertical integrated electric companies, operate the SIN under an RTPA regime. _START_SECTION_ Environmental issues _START_PARAGRAPH_ Prolonged oil production has resulted in significant oil pollution along the Caribbean coast. Hydrocarbons extraction has resulted also in the subsiding of the eastern shore of Lake Maracaibo, South America's largest lake. Venezuela is also the region's top emitter of carbon dioxide. _START_SECTION_ Regional cooperation _START_PARAGRAPH_ Venezuela has pushed the creation of regional oil initiatives for the Caribbean (Petrocaribe), the Andean region (Petroandino), and South America (Petrosur), and Latin America (Petroamerica). The initiatives include assistance for oil developments, investments in refining capacity, and preferential oil pricing. The most developed of these three is the Petrocaribe initiative, with 13 nations signed agreement in 2005. Under Petrocaribe, Venezuela will offer crude oil and petroleum products to Caribbean nations under preferential terms and prices. The payment system allows for a few nations to buy oil on market value but only a certain amount is needed up front; the remainder can be paid through a 25-year financing agreement on 1% interest. The deal allows for the Caribbean nations to purchase up to 185 million barrels (29,400,000 m³) of oil per day on these terms. In addition it allows for nations to pay part of the cost with other products provided to Venezuela, such as bananas, rice, and sugar._NEWLINE_In 2000, Venezuela and Cuba signed an agreement, which grants Venezuelan oil supplies to Cuba._NEWLINE_In 2006, the construction of the Trans-Caribbean gas pipeline, which will connect Venezuela and Colombia with extension to Panama (and probably to Nicaragua) began. The pipeline will pump gas from Colombia to Venezuela and, after 7 years, from Venezuela to Colombia. Venezuela has also proposed the project of Gran Gasoducto del Sur, which would connect Venezuela with Brazil and Argentina. There has been some discussion about constructing an oil pipeline to Colombia along the Pacific Ocean._NEWLINE_Venezuela also exports electricity to neighboring countries. Santa Elena/Boa Vista Interconnector permits electricity export to Brazil, and Cuatricenternario/Cuestecitas Interconnector and EI Corozo/San Mateo Interconnector to Colombia. + _START_ARTICLE_ Marvin Creamer _START_PARAGRAPH_ Marvin Creamer (born January 24, 1916) is a former college professor and amateur American sailor noted for having sailed around the globe without the aid of navigational instruments. Between December, 21, 1982, and May 17, 1984, Creamer and the crew of his 36-foot boat, Globe Star, circumnavigated the globe without a compass, sextant, watch, or other instruments. The ship spent 510 days at sea. As general guides, Creamer observed the sun and stars, currents, and occasionally the regional biological setting. In honor of his voyage, Rowan University created the Marvin Creamer Scholarship Fund. He turned 100 in January 2016. _START_SECTION_ Personal life _START_PARAGRAPH_ Creamer was born in Vineland, New Jersey, and taught Geography at Glassboro State College from 1948 until 1977. He has three children, six grandchildren, and two great-grandchildren. He was married to Blanche Creamer for 59 years until her death in 2005. + _START_ARTICLE_ Save a Child's Heart _START_PARAGRAPH_ Save a Child's Heart (SACH) is a humanitarian organization with a mission to improve the quality of pediatric cardiac care for children from developing countries who suffer from heart disease, and who cannot get adequate medical care in their home countries. It also works to create centers of pediatric cardiac competence in these countries, so these children can be treated at home. SACH was founded in 1996 and is based at the Edith Wolfson Medical Center near Tel Aviv, Israel. _START_SECTION_ History _START_PARAGRAPH_ Save a Child's Heart is the creation of Dr. Amram Cohen, and grew out of Cohen's experiences as a doctor serving with the U.S. Armed Forces in Korea in 1988, where he joined a program that helped poor local children with heart disease. The experience introduced him to a network of doctors doing similar work in developing countries, inspiring him to start his own program after moving to Israel in 1992. He brought three Ethiopian children to Israel for heart surgery in 1996, and then went on to make use of a network of professional and personal contacts to build a volunteer organization to help others for whom the operations were unavailable or too expensive._NEWLINE_Through a foundation he established, Save a Child's Heart, Dr. Cohen and other surgeons conducted hundreds of operations on children with congenital heart diseases, mostly at the Wolfson Medical Center, where Dr. Cohen was the head of pediatric cardiac surgery and served as Save a Child's Heart's chief surgeon. _NEWLINE_Children were also brought from Nigeria, Tanzania, Congo, Moldova, Russia, Ghana, Vietnam, Ecuador, Jordan and the Palestinian Authority. Dr. Cohen and his team also traveled to China and Ethiopia to operate on about 60 children and taught medical staff there and in other countries. His foundation helped bring doctors and nurses to Israel for training, with the aim of creating centers for treatment of pediatric heart disease in their home countries. _NEWLINE_Dr. Cohen died on August 16, 2001, while climbing Mount Kilimanjaro in Tanzania. He was 47._NEWLINE_Since Dr. Cohen's death, SACH has continued its efforts to benefit children with life-threatening cardiac problems and to teach medical personnel in developing nations the surgical techniques needed to treat these young patients._NEWLINE_In 2006, SACH was selected as a featured charity by KLM Royal Dutch Airlines Air Cares program, with the airline showing a video of the charity's work on board its flights. The airline also donated EUR10,000 and donated free air miles to SACH._NEWLINE_In April, 2007, Israeli musician Idan Raichel traveled with Save a Child's Heart to Rwanda and Ethiopia._NEWLINE_In May 2011, SACH received recognition for special consultative status with the Economic and Social Council (ECOSOC) of the United Nations_NEWLINE_In November 2011, a new children's home was inaugurated. The facility was built specifically to meet the needs of the young patients and staff and will allow Save a Child's Heart to house and treat a larger number of the children._NEWLINE_In June 2012, SACH received the Israeli Presidential Award for Volunteerism._NEWLINE_In July 2016, SACH saves its 4,000th child._NEWLINE_In January 2017, Israeli Prime Minister Benjamin Netanyahu visited SACH_NEWLINE_In January 2019, SACH saved its 5,000th child. _START_SECTION_ Surgeries performed in Israel _START_PARAGRAPH_ Save a Child's Heart has treated over 5,000 children from 59 developing nations in Israeli hospitals._NEWLINE_In 2013, amidst the Syrian Civil War, SACH conducted an open-heart surgery on a 5-year old Syrian girl. The pre-schooler, living as a refugee in an undisclosed country, traveled to the Wolfson Medical Hospital in Holon to receive the treatment. She was the first Syrian child to receive the free medical care and surgery._NEWLINE_SACH is embarking on its biggest project yet, to build an International Pediatric Cardiac Center (IPCC) at the Wolfson Medical Center (WMC), which will serve as a Children's Hospital. The IPCC will be a worldwide center of competence in pediatric cardiac care with international recognition in pediatric cardiac treatment, training and research. It will serve as a model for other SACH centers of competence in developing countries. This new state of the art child oriented medical facility will house all of the infrastructure and equipment needed to perform pediatric heart surgeries, including all pre and post-operative care. _START_SECTION_ International activities _START_PARAGRAPH_ China – On November 16, 2008, a Save a Child's Heart (SACH) training and surgical mission left for Shijiazhuang in the Hebei Province in China. This was SACH's 8th mission to China where its medical teams have saved, with Chinese colleagues, more than 100 Chinese children._NEWLINE_Angola - On May 3, 2009, a Save a Child's Heart medical team left for Luanda, Angola, to examine and screen Angolan children. The team examined 88 children. Among them were children who had been treated in Israel and needed a follow up examination._NEWLINE_Moldova – On November 11, 2007, a SACH team arrived in Kishinev, Moldova, to work with a team of local pediatric cardiologists. The mixed surgical group examined children and performed surgeries for five days._NEWLINE_Tanzania – In August 2011, a SACH team of Doctors, Nurses, Staff and Volunteers traveled to Tanzania to the Bugando Medical Center to work alongside local partners. During this mission SACH, together with the local partners doctors screened 300 children and performed 12 surgeries on Ethiopian children. A week later, a team of SACH volunteers, doctors and staff climbed Mount Kilimanjaro in an effort to raise $1M to save the lives of African children in need. As of January 2018 there have been 7 medical missions to Tanzania._NEWLINE_Romania - In 2017, there were two missions to Romania in March and November, as well as one mission in 2018. During these missions Israeli doctors traveled to help assist Romanian medical staff in performing over 11 life saving heart procedures as well as performing their own procedures._NEWLINE_Zanzibar - There have been 8 medical missions to Zanzibar since 2008, the most recent being in February 2019. Save a Child's Heart (SACH) sent an all-women's mission to Zanzibar in mid-February 2019 to screen and diagnose children in need of life-saving heart surgery. SACH worked with its medical partners at the Mnazi Mmoia Hospital in Zanzibar to conduct screenings and determine which children are in need of heart surgeries. Throughout the mission, there were a total of 398 children in Zanzibar screened._NEWLINE_SACH Photo Exhibit Tours the Globe_NEWLINE_Since 2008, a photo exhibit of SACH activities has been presented in cities around the world, including Abuja (Nigeria), Brussels, Detroit, Glasgow, Hebei (China), Jerusalem, Johannesburg, Melbourne, Miami, Moscow, Philadelphia, Quezon City (Philippines), Singapore, Sydney, Toronto, Vancouver and Washington, DC. + _START_ARTICLE_ Nerve point of neck _START_SECTION_ Convergence of nerves _START_PARAGRAPH_ Erb's point is formed by the union of the C5 and C6 nerve roots, which later converge. At the nerve trunk, branches of suprascapular nerves and the nerve to the subclavius also merge. The merged nerve divides into the anterior and posterior division of C5 and C6. _START_SECTION_ Clinical significance _START_PARAGRAPH_ Injury to Erb's point is commonly sustained at birth or from a fall onto the shoulder. The nerve roots normally involved are C5 and partly C6. Symptoms include paralysis of the biceps, brachialis, and coracobrachialis (through the musculocutaneous nerve); the brachioradialis (through the radial nerve); and the deltoid (through the axillary nerve). The effect is called "Erb's palsy". Typically, an affected person's arm hangs at the side with the hand rotated medially, like a porter waiting for a tip; hence the colloquial name "porter's tip hand". + _START_ARTICLE_ Ken Knight _START_SECTION_ Fire service career _START_PARAGRAPH_ Knight started work at Westminster Bank in Reigate (1964-1966). He commenced his fire service career as a firefighter in 1966 and subsequently served in a number of UK fire brigades. He was appointed as Chief Fire Officer of Dorset Fire and Rescue Service (1994-1998) and West Midlands Fire Service (1998-2003), before becoming London’s Fire Commissioner in 2003. In 2007 he was appointed as the Government's Chief Fire and Rescue Adviser for England based at the Department for Communities and Local Government._NEWLINE_As the Chief Fire and Rescue Adviser for England (2007-2013), Knight was responsible for advising ministers and senior officials on fire policy matters and for providing advice during major and catastrophic emergencies together with operational advice on preparedness and response during the 2012 London Olympics. He was also responsible for the enforcement of fire safety regulations in Crown Premises in England._NEWLINE_He produced an independent report for the UK Government on fire and rescue services response to the widespread flooding in 2007 entitled Facing the Challenge. He was also tasked by the Secretary of State to undertake a review in the immediate aftermath of the fire in high rise flats at Lakanal House, London, 2009 in which six people died._NEWLINE_In May 2013 Knight published an efficiencies review of the 46 Fire and Rescue Authorities, Facing the Future, which had been commissioned by the UK Government. _START_SECTION_ Independent consultancy _START_PARAGRAPH_ Since leaving his position as the Chief Fire and Rescue Adviser to the UK Government in 2013, Knight has provided independent consultancy advice to public and private sector organisations._NEWLINE_From 2014 to 2017 Knight served as the Lead Commissioner for the London Borough of Tower Hamlets under the Local Government Act 1999._NEWLINE_From 2016-17, the Department for Business, Innovation and Skills commissioned Knight to make recommendations regarding the introduction of electronic balloting for industrial action._NEWLINE_Since June 2017, Knight has been chair of the Independent Expert Advisory Panel at the Ministry of Housing, Communities and Local Government in the immediate aftermath of London's Grenfell Tower fire, in which 71 people lost their lives. His role was to provide advice to officials and Ministers on action to be taken in high rise buildings following the fire._NEWLINE_Knight has also completed reviews of the fire and rescue services in Southern Ireland, Bermuda and Gibraltar, and undertook a review of the national fire safety and civil defence arrangements in the Kurdish Region of Iraq at the request of the Kurdish Regional Government. _START_SECTION_ Honours _START_PARAGRAPH_ Her Majesty the Queen awarded Knight the Queen's Fire Service Medal in 1991 and a CBE in the 2001 Birthday Honours. He was appointed as Her Majesty’s Representative Deputy Lieutenant for Richmond upon Thames (2017-2017), and has been a Deputy Lieutenant for Greater London since 2007. Knight was knighted in the 2006 Birthday Honours in recognition of his outstanding contribution to the fire and rescue service._NEWLINE_Knight is a Companion of the Chartered Management Institute and a Fellow of the Institution of Fire Engineers. He was a founder Trustee of the UK Firefighters Memorial Trust and is a past Master of the Worshipful Company of Firefighters (1998). + _START_ARTICLE_ James Brophy (public servant) _START_SECTION_ Biography _START_PARAGRAPH_ James Brophy was born on 26 September 1889 in South Melbourne, Victoria eldest child of Richard Brophy, labourer, and his wife Catherine, née Mackey, both from Ireland._NEWLINE_On 14 January 1922 he married Elizabeth Constance Ridley at St Brigid's Catholic Church, Red Hill, Brisbane. They had ten children. She died in 1965. They moved to Melbourne, Victoria in 1927 and to Canberra in 1930._NEWLINE_He was appointed Papal Knight-Commander of the Order of St. Gregory the Great in 1951 by Pope Pius XII. He was awarded the Companion of the Imperial Service Order in the Queen’s birthday honours list in 1954._NEWLINE_He was a life member of the NSW Hockey Association, NSW Junior Hockey Association, ACT Hockey Association and the NSW Amateur Swimming Association. He served as an official at the Rome Olympics and Perth Commonwealth Games in swimming. He was also involved with Australian Rules Football and was president of the Canberra Australian National Football Junior League from 1940 to 1942 and senior vice-president of the CANFL in 1941 and 1942. He was a foundation member of the National Eisteddfod Society._NEWLINE_He retired on 25 September 1955._NEWLINE_He died on 24 May 1969 at Canberra Hospital aged 79 and was buried at Canberra Cemetery. + _START_ARTICLE_ Paeania _START_PARAGRAPH_ Paeania or Paiania (Ancient Greek: Παιανία) were two demoi of ancient Attica, divided into Upper Paeania and Lower Paeania, that were situated on the eastern side of Hymettus, near the modern village of Liopesi renamed to Paiania. It was the deme of Demosthenes. + _START_ARTICLE_ Federal University of Rio de Janeiro Faculty of Law _START_SECTION_ History _START_PARAGRAPH_ The National Faculty of Law of UFRJ is the result of the merger in 1920 of two private schools, the Free Faculty of Law and Social Sciences of Rio de Janeiro and the Free School of Law. It was a long–held dream of prominent citizens such as Fernando Mendes de Almeida and others, who dreamed of creating a private law school. With the establishment of the republic and the creation of a free educational system, Mendes de Almeida called on former supporters of the idea and, with new members, worked for the establishment of Free School of Law and Social Sciences of Rio de Janeiro, which eventually became the National Faculty of Law._NEWLINE_The creation of the National Faculty of Law, through the merger of the two private colleges, represented an end to the monopoly of legal education, which until then was the nearly exclusive province of the Faculdade de Direito do Recife in Olinda, and the Faculdade de Direito da Universidade de São Paulo. The founding of the National Faculty of Law added much–needed diversity to the nation's legal education._NEWLINE_The National Faculty of Law, together with the UFRJ Polytechnical School and the UFRJ Medical School, became in 1945 the basis for a new university, the University of Brazil. During that period the faculty's library was created, the college's magazine "A Época" was launched, the Literary Guild and the Law Journal were created, under a committee formed by Cândido de Oliveira Filho, Luiz Carpenter Raul Pederneiras, Virgílio de Sá Pereira, Gilberto Amado and Afrânio Peixoto._NEWLINE_In the 1930s, the National Faculty of Law experienced memorable public contests for remarkable teachers, such as Joaquim Pimenta (sociology). The class of 1937 was especially noted for graduates such as José Honorio Rodrigues, and Evaristo de Moraes Filho, who became professor in Labor Law and Sociology with his thesis on Auguste Comte._NEWLINE_In the 1940s the National Faculty of Law transferred to its current building, during a period marked by strong student mobilization (especially as resistance to the Estado Novo). Notable recruiting drives continued, bringing young lawyers to the Chairs of the Faculty, such as San Tiago Dantas and Hélio Tornaghi._NEWLINE_The 1950s consolidated the reputation of the National Faculty of Law. In 1955, the inaugural class of San Tiago Dantas, entitled "Legal Education and the Brazilian crisis", attracted much attention. At that time, San Tiago presented new guidelines for the legal education and criticized legal teaching methods of the time, defending the case system as opposed to the text system, and also argued that an interdisciplinary approach to Law was more suitable to modern times._NEWLINE_In 1960, the Brazilian capital moved to Brasília, and the process of federalisation of higher education began, with UFRJ as a part of it. With the coup of 1964, the National Faculty of Law faced some consequences, but the CACO – Centro Acadêmico Cândido de Oliveira (the faculty's students' union) fought against the military regime._NEWLINE_In the 1970s, the National Faculty of Law went through a deep crisis, characterized by the carrying out of only a few entrance examinations and a gradual reduction of faculty staff. The 1980s were also marked by crises and obstacles in entrance drives._NEWLINE_In the 1990s, there were some initiatives, such as curriculum changes, the rearrangement of departmental structure and the creation of a center for community outreach, including a Special Court, an office of the Ombudsman, and a center for legal practice._NEWLINE_Since the end of 2009, following the election of a new directing board, the National Faculty of Law has been going through deep changes in academic and structural matters, aimed at improving the school's quality and reputation. + _START_ARTICLE_ Special flight rules area _START_PARAGRAPH_ In United States aviation, a special flight rules area (SFRA) is a region in which the normal regulations of flight do not apply in whole or in part, especially regulations concerning airspace classification, altitude, course, and speed restrictions, and the like. _START_SECTION_ Washington, DC Special Flight Rules Area _START_PARAGRAPH_ Following the terrorist attacks of September 11, 2001, the airspace around Washington DC underwent a number of changes designed to restrict flying around the city. In 2003, a temporary flight rules area was created and was named the Washington DC Air Defense Identification Zone. In 2008 the temporary status of the ADIZ was removed and the rule was made permanent._NEWLINE_In order to fly within the DC SFRA, pilots of general aviation aircraft are required to file a special fight rules flight plan, obtain a discrete transponder code, and remain in contact with air traffic control at all times. Special training is required in order to fly within 30 nm of the Washington DC (KDCA) VOR. _START_SECTION_ Los Angeles Special Flight Rules Area _START_PARAGRAPH_ Long established in the Los Angeles basin is the Los Angeles SFRA. Los Angeles International Airport is surrounded by extensive Class B airspace, which is difficult for VFR traffic to navigate. In particular, the airport has four large runways running east/west that have airspace protection from 10,000 feet down to the surface that is 25 statute miles wide. This large swath of Class B airspace bisects Los Angeles and makes flights between the airports north of LAX and south of LAX require air traffic control to route these flights. To alleviate this load on ATC, the SFRA over LAX defines two exceptions to the Class B airspace to allow VFR aircraft to transit without control from ATC._NEWLINE_There are two routes, one for southeast-bound traffic and one for northwest-bound traffic. Both follow the 132° radial of the Santa Monica VOR between the Santa Monica Airport and the intersection of Interstate 405 and Imperial Highway. Southeast-bound traffic flies at 3,500 feet. Northwest-bound traffic flies at 4,500 feet. Despite being in the Class B airspace, aircraft following the rules of this corridor need not communicate with ATC._NEWLINE_The rules are fairly simple: Turn on all practical lights, day or night. Squawk 1201. Do not exceed 140 knots IAS. Monitor and self-report on 128.55 MHz. Have a copy of the Los Angeles TAC in the aircraft. No jets. _START_SECTION_ Hudson River Special Flight Rules Area _START_PARAGRAPH_ On November 19, 2009, the FAA effected an SFRA in the New York City Class B airspace, motivated largely by the mid-air collision of a private general aviation aircraft and a sightseeing helicopter ride along the Hudson River VFR corridor in the summer of 2009._NEWLINE_The Hudson River Class B exclusion area is formed from the airspace above the Hudson River between the Alpine Tower and the Verrazano-Narrows Bridge. It is bounded by the banks of the Hudson river and runs from the surface of the river up to 1,300 feet. Aircraft fly along the right-hand bank to separate northbound and southbound traffic. Aircraft transiting the entire corridor fly between 1,000' and 1,300'. Aircraft performing local operations (mostly landing and taking off) inside the area fly under 1,000'. Aircraft need not communicate with ATC, but they must make certain mandatory self-reports at certain charted points._NEWLINE_In addition, there is a further Class B exclusion area over the East River between the Hudson river and just past the Queensboro Bridge. This length cannot be transited, as the Queensboro end of the corridor ends inside the Class B airspace. Additionally, aircraft not landing or taking off inside the East river exclusion must be in contact with ATC. + _START_ARTICLE_ Volf Bronner _START_SECTION_ Early life _START_PARAGRAPH_ Volf Bronner was born in Buriat-Mongolia in 1876. He attended high school in Chita and then began to study medicine at the University of Tomsk but was expelled because of his revolutionary political activities. One of his classmates at Tomsk was A. T. Trubacheev, later the People's Commissar of Public Health of the Buryat Autonomous Soviet Socialist Republic. He continued his medical studies at the University of Berlin from where he obtained his doctorate in medicine in 1900. _START_SECTION_ Career _START_PARAGRAPH_ From 1900 to autumn 1901, Bronner was a doctor in Verkhneudinsk, and from 1906 to 1913 he was in Paris, where he worked with professor Guyon and subsequently at the Pasteur Institute. He edited the Journal Clinique d'Urologie. From 1915 he worked in Moscow and in 1922 he established the State Venereological Institute in Moscow, of which he became the director._NEWLINE_Bronner helped to organise the 1928 Soviet-German Syphilis Expedition which aimed to tackle the endemic syphilis in Buriat-Mongolia, Bronner's place of birth, and to determine the method of transmission of the disease. Contrary to expectations, the expedition concluded that the syphilis in the area was spread principally by sexual activity._NEWLINE_In 1927, Bronner edited Prostitutsiia v Rossii (Prostitution in Russia) with Arkadii Elistratov, professor of police law at Moscow University, and in 1936, his book, La lutte contre la prostitution en URSS (The fight against prostitution in the USSR) revealed that two thirds of prostitutes had been servants._NEWLINE_Following the Russian Communist Party's 17th Congress in 1934, which emphasised service to the collective over individual needs, Bronner was one of a number of public figures who changed his public utterances to match the new ethos, moving away from a humanistic approach that saw syphilitic infection as the result of misfortune and nothing to be ashamed about, towards an approach that characterised it as impeding the efforts of the party and something that carried shameful connotations. _START_SECTION_ Death _START_PARAGRAPH_ Bronner was arrested on suspicion of spying and terrorist activity during Joseph Stalin's Great Purge of 1937. He was executed in 1939. + _START_ARTICLE_ Fordongianus _START_SECTION_ History _START_PARAGRAPH_ In antiquity, Fordongianus was called Forum Trajani in honor of Roman emperor Trajan, who is credited with the building of what are now considerable Roman remains, including those of a bridge, and of thermae on a scale of great magnificence (Valéry, Voy. en Sardaigne, vol. ii. c. 35). The city, in the interior of Sardinia, is known from the Itineraries, which place it on the road from Tibula, through the interior of the island, to Othoca. (Itin. Ant. p. 82.) Fordongianus sits on the left bank of the river Tirsi (ancient Thyrsus), about 25 kilometres (16 mi) from Oristano. + _START_ARTICLE_ Cherry Hill Gourmet Market _START_PARAGRAPH_ The Cherry Hill Gourmet Market is a 19,000-square-foot (1,800 m²) Russian themed specialty grocery and deli located on the corner of Emmons Avenue and Ocean Avenue on the water in Sheepshead Bay, Brooklyn, New York City. It is the principal establishment occupying the former Lundy's Restaurant (now Lundy's Landing Shopping Plaza), which also houses Momoyama Japanese Restaurant and Masal Turkish Cafe on the first floor, and professional offices on the upper floors._NEWLINE_Opened by Russian born fruit and vegetable produce entrepreneur David Isaev in May 2009, the creation of the market put an end to attempts to revive Lundy's. The facade of the Lundy's building, an official New York City landmark, remains the same. The market and other businesses located within the landmark Lundy's structure remain embroiled in legal controversy due to ongoing violations in of zoning laws created to protect Lundy's. _START_SECTION_ Community opposition _START_PARAGRAPH_ The Lundy's Building underwent a seven million dollar renovation in order to be saved and reopened. Broooklynite neighborhood traditionalists have continued to attempt to force the new businesses out of the historic site. Community opposition centers on previous inclusion of the Lundy's building into a special maritime zoning district enacted in the 1970s to promote water-related commercial and recreational development. The Cherry Hill Gourmet Market also features tables for dinner, serves fish salad and a dozen kinds of smoked fish and caviar. However as such it is not primarily or exclusively a grocery store, and theoretically not permitted under the special zoning designation imposed on the historical Lundy's building, according to the New York City Department of Buildings. Mr. Isaev is involved in ongoing negotiations to legalize the market, and keep its roughly 100 workers employed. Isaev was hit with fines related to violation of the zoning laws, settled the fines with the New York City Landmarks Commission, and is now pursuing zoning changes which would legalize his and other businesses now housed within the Lundy's structure which are out of zoning compliance, which remain in operation despite being in technical violation of still legal and enforceable community zoning laws enacted to protect the Lundy's structure._NEWLINE_As part of a community settlement, Isaev, who removed the brass lettering 'Lundy Bros' and 'FWIL' for Lundy's founder Frederick William Irving Lundy from several arched doorways, restored the brass lettering to their original positions, and placed a large screen around large refrigeration units behind the market on the parking lot to settle landmark commission objections. The Lundy's building, which was a garbage-strewn decaying structure going to ruin when Mr. Isaev acquired it, is today a popular Brooklyn shopping spot for the Russian emigre communities situated in Sheepshead Bay, and nearby Manhattan Beach and Brighton Beach. In opening his establishment and later pursuing changing in the zoning laws applying to Lundy's, Isaev maintained he was unaware nor informed of all the complicated landmark designations and zoning requirements applying to the site when he signed his lease and opened a community business in good faith. _START_SECTION_ Hurricane Sandy flood _START_PARAGRAPH_ The effects of Hurricane Sandy in October 2012 caused the waters of Sheepshead Bay to overflow. The storm surge flooded the Cherry Hill Gourmet Market at ground level, causing it to sustain water damage and resulting in tons of spoiled food. During the post-hurricane cleanup the food had to be discarded, but the building was otherwise unaffected. + _START_ARTICLE_ Herping _START_SECTION_ Photography _START_PARAGRAPH_ Photography of reptiles and amphibians is largely dependent on digital cameras with a macro lens. An adequate lens is necessary for successfully capturing many species' images in an efficient manner, as it keeps photographer and subject from being injured, as well as maintaining the natural behavior of the subject. In some cases it is more practical to temporarily capture and pose the subject manually such as when moving or obscured by debris, such as when a fossorial snake is scurrying into its burrow. _START_SECTION_ Equipment _START_PARAGRAPH_ Herping activities are often recorded using the latest digital camera or camcorder technology. As many as three flashes may be used for optimal lighting, especially in challenging environments such as tropical rainforests. The multiple flashes create three distracting catchlights in the subject's eye; two may be edited out of the photo by using Photoshop or similar applications._NEWLINE_Photographing venomous snakes at close range places the photographer within striking range, and various shields have evolved to minimize the danger. These bite shields often take the form of an opaque or transparent plastic covering which surrounds the camera and exposes only the lens. Modifications are made to accommodate various flash setups. Snakes are temperature-dependent and are often active in large numbers during optimal weather. Consequently, the greatest danger in venomous snake photography may lie in a bite from an unseen snake near the photographer. Great care must be taken to survey the area, and bites of this nature have occurred on several occasions._NEWLINE_The safest way to photograph venomous snakes is never to touch them. Snakes may be manipulated with a variety of specialized hooks, ranging from large hooks used for moving snakes, to extendable pocket hooks used for minor posing adjustments. Bite-resistant gloves may also be worn. _START_SECTION_ Setups _START_PARAGRAPH_ Herptiles are extremely weather-sensitive and often appear in heavy rain or other challenging photographic conditions. Some photographers carry cardboard boxes which can be modified in the field to create tiny sets for photography. In a desert area, sand is sprinkled on the bottom of the box and desert debris is scattered about. In wet areas, mossy sets are often developed, which work well for salamanders. The herp is posed to show identifying features and can be photographed at leisure, creating a realistic photo. During heavy rain or cold temperatures, this "studio" work is usually done in the back of an SUV or similar vehicle._NEWLINE_For aquatic herptiles, early spring is often the best period to find them, as aquatic vegetation is still sparse. Aquariums with natural or prefitted substrate may be used to obtain natural photographs. The extent of aquatic setups is limited only by the photographer's imagination, and elaborate studio setups have been used to photograph specialized scenes like basilisks running on water. _START_SECTION_ Techniques _START_PARAGRAPH_ Because reptiles and amphibians are often agitated when captured, various techniques have evolved to pacify subjects of herpetological photography. One technique involves placing a hat or similar object over an animal (typically a snake) so that it coils and rests quietly. The object is then quickly lifted off the animal and a series of photos are taken. Assistants are often standing by out-of-frame to head off escape attempts. _START_SECTION_ Field Techniques _START_PARAGRAPH_ Many techniques are used when a person goes “herping” or looking for reptiles and amphibians. _NEWLINE_One technique is known as road running, road cruising, or cruising. This is done by riding in a vehicle and traveling down stretches of road at a slow speed to count or catch animals. The use of a road as a natural transect can generate estimates of species density by cruising the road at peak migration time. Similarly, driving roads at night during anuran breeding times can yield a high diversity of species. The North American Amphibian Monitoring Program (NAAMP) uses road surveys to log a species count into a data base to study amphibian population across the nation. This is done by traveling down a set route and stopping at predetermined spots and listening for a few minutes and writing down every species that was heard at that location. (Dodd 2010)_NEWLINE_Another technique for observing reptiles for research or photo opportunities is the use of cover boards. Silvy (2012) suggests that the use of metal and wood cover boards be set at least two months prior to searching. These boards act like natural cover for herpetofauna to hide._NEWLINE_Tree frogs can be caught and photographed by using PVC pipes that are capped on the bottom and hung vertically in a tree near water._NEWLINE_If aquatic species are the target species the use of an aquatic funnel trap can be used._NEWLINE_Drift fences have been used with a high success rate for capturing snakes. The use of a drift fences along with a pit-fall or funnel box trap has yielded high success. The length of the fence is variable, but the longer the fence results in a higher success rate. The fence is set with traps in the middle and/or the ends. Snakes encounter the fence and are directed or lead to the trap. Care must be taken in providing enough cover so the species do not die of heat exhaustion. Identifying all the species in the trap is recommended so an accidental envenomation is avoided. Pit fall traps are small buckets that are placed in holes dug out next to the drift fence._NEWLINE_Turtles can be caught by using a variety of techniques; hoop traps, basking traps, floating pitfall traps, and funnel traps are among the best traps to use. Basking traps are used to catch basking turtles. These traps float on the surface and have an elevated platform for the turtle to bask. The net is underwater so they cannot escape once they fall into the trap. _START_SECTION_ Tourism _START_PARAGRAPH_ Herp-related tourism, like bird-related tourism, is on the rise. Because there are several hundred birders for every herper, herp-related tourism presently has a negligible economic impact. Fortunately, there is no way to engineer wildlife preserves for a specified vertebrate group. Instead, large areas of wilderness are conserved, benefiting all wildlife. Some of the more popular herping destinations include the United States, Costa Rica, the Amazon, Madagascar, and Australia._NEWLINE_Other countries such as India and South Africa possess tremendous herpetological diversity and there are entrepreneurial individuals developing ecotourism infrastructure in these areas. One example is Exo-Terra, a division of the Hagen pet supplies company, which since 2004 has traveled to a different tropical African country each year. The company also holds an annual photography contest that showcases some of the best herp photography in the world. The winner of the photo contest goes on the next trip. _START_SECTION_ Geographical differences _START_PARAGRAPH_ In Canada and other high-latitude countries, the herping season lasts 6–8 months, depending on the area. Ontario is the most herpetologically diverse province in Canada. While species lists may seem high, many Canadian herps have extremely limited ranges and exist only in isolated populations. Many Canadian herp species are threatened and in some cases great care is taken to protect remnant populations._NEWLINE_The United States contains a large number of different habitats and thus has a wide diversity of reptiles and amphibians. In some parts of the country, such as South Florida and South Texas, herping can be productive year round because of moderate winter temperatures. In most cooler parts of the country the herps hibernate in the winter and thus are mostly inaccessible to herpers. Popular herping destinations in the United States are southern California, southern Arizona, Texas, and Florida. These states boast an incredible diversity of herps as well as a number of species that are highly sought after by herpers. It is no coincidence that all of these states are in the southern part of the country; reptiles and amphibians are ectothermic (cold blooded) and thus are typically more abundant in warmer climates. _START_SECTION_ Safety _START_PARAGRAPH_ Herping can potentially be a dangerous activity if not pursued with proper caution. A strike from a venomous snake can potentially be life-threatening. Other herping activities, especially "flipping," put a herper at risk of accidentally coming in contact with a scorpion or spider. Safety equipment used to mitigate such dangers includes snake hooks, snake tongs, boots and gloves. _START_SECTION_ Ethical and legal issues _START_PARAGRAPH_ Field herpers encompass a wide ethical spectrum, ranging from behavioural observation without approaching the animal to "feeder" animal collection for existing herpetoculture. The majority of herpers practice careful capture and release in the same spot, as many herps have their own territories and replacing them somewhere else would be a disturbance. As wilderness areas shrink, herpers are concentrated into smaller areas, and commercial collectors often encounter field biologists which may have quite different approaches to their study animals. Many species are also threatened or endangered and thus it is illegal to take them from the wild. Another consideration is spreading of diseases, such as the fungus Batrachochytrium dendrobatidis responsible for a worldwide decline in amphibian populations, which may be spread inadvertently by humans._NEWLINE_Since many herps are nocturnal, herpers often remove animals temporarily for daylight photo sessions. The animals are then replaced exactly where found. There is no "herpers code" and ethical considerations are left to the individual. From time to time, albino and other unusually coloured animals are encountered and these are sometimes kept for herpetoculture. The ethical justification in these cases is that conspicuous animals would be easy prey in the wild. Although true in the case of albino or other light-coloured animals, this is not true, for example, when normally barred individuals are born with striped patterns. In this case the motive is usually commercial, with the collector planning to develop a striped bloodline and charge high prices for an exclusive morph._NEWLINE_There are many different laws in place that affect herpers. Laws vary by country and state and are designed to protect the wildlife and habitats. In most states, a hunting license is required to collect reptiles and amphibians. Some states are more strict than others in terms of herping-related legislature. In Texas, for example, it is illegal to collect herps on public land, and thus the "road cruising" strategy described above is illegal. Herpers should be careful to obey all laws in areas that they hunt. Lawbreaking herpers risk fines or even legal prosecution. + _START_ARTICLE_ Patriot Guard Riders _START_SECTION_ History _START_PARAGRAPH_ The group was formed in 2005 to shelter and protect the deceased's family against protesters from the Westboro Baptist Church, who claim that the deaths of American troops in Iraq and Afghanistan are divine retribution for American tolerance of homosexuality. PGR members position themselves to physically shield the mourners from the presence of the Westboro protesters by blocking the protesters from view with their motorcade, or by having members hold American flags. The group also drowns out the protesters' chants by singing patriotic songs or by revving motorcycle engines._NEWLINE_Although initially founded by motorcyclists, the organization is open to anyone, regardless of political affiliation, veteran status, or whether they ride or not. The only prerequisite is "a deep respect for those who serve our country; military and first responders. The Patriot Guard was established in Mulvane, Kansas, at American Legion Post 136 in 2005. The founder members incorporated the organization as a 501(c)(3) non-profit in the State of Oklahoma on February 21, 2006._NEWLINE_The group's mission quickly expanded to include the funerals of law enforcement officers, fire department personnel, all first responders, and any active duty member or veteran of the U.S. Armed Forces from all previous wars and conflicts and is now largely focused on recognizing and honoring the sacrifices of dead service members as well as their families and loved ones. As of March 2011, PGR reported over 220,000 members. In addition to their attendance at funerals, the group also greets troops returning from overseas at welcome home celebrations, deployment ceremonies, and performs volunteer work for veteran's organizations such as Veterans Homes. The group also assists families in financial difficulties with travel and housing arrangements, and visits military hospitals to encourage and honor wounded service members of the United States Armed Forces. _START_SECTION_ Trademark lawsuit _START_PARAGRAPH_ In 2007, the Patriot Guard Riders attempted to register the name with United States Patent and Trademark Office. One of the organization's founding members and first President, Jeff Brown, who previously operated the PGR merchandise store, filed an objection. PGR rebuked this, stating in papers filed with the Patent and Trademark Office that Brown had been ejected as a director of PGR in November 2006, and had therefore relinquished all rights to the store and the organization's name. After resigning, Brown filed a trademark request, but this was rejected since the PGR had submitted its own request. PGR contacted all its members asking for donations to establish a defense fund for the lawsuit._NEWLINE_As of 16 July 2012 the Trademark Trial and Appeal Board (TTAB) rendered its decision to Brown's opposition of the PGR, Inc's registration. They stated: "The record further reflects that during Brown's tenure as Executive Director, despite his use of personal funds, he was acting in his official capacity when ordering the collateral merchandise to sell on the online store. Consumers who bought the goods prior to Brown's departure and the subsequent creation of "Twister's Store" were led to believe the goods originated from the PGR. Hence, Brown cannot prevail on his claim of priority since he cannot show by a preponderance of the evidence a prior proprietary interest in the word mark PATRIOT GUARD RIDERS for collateral merchandise. Decision: The opposition is dismissed." _START_SECTION_ Defending their trademark _START_PARAGRAPH_ After successfully registering multiple trademarks, the Patriot Guard Riders (PGR), Inc., began taking steps to enforce and defend its marks from unauthorized use._NEWLINE_A group in Michigan split from the PGR but continued to use multiple marks while conducting fundraising activities, most notably adopting the name "Michigan Patriot Guard" (MPG). The PGR made multiple requests of the MPG to cease and desist utilizing the name and trademarks. When the MPG failed to comply, the PGR filed a lawsuit in US District Court of Flint, Michigan._NEWLINE_Before the lawsuit went to trial, the PGR and MPG reached a settlement. As part of the agreement, the MPG will change its name. The organization's new name is Michigan Bikers Helping Veterans. + _START_ARTICLE_ Martyn Farr _START_PARAGRAPH_ Martyn Farr (born Crickhowell, Wales, March 3, 1951) is a leading exploratory cave diver and caver, known for his record-breaking cave dives and the exploration of many miles of previously undiscovered underground passages (e.g. in Ogof y Daren Cilau and Noon's Hole). As an author and photographer he has written many books on the subject of cave diving history and techniques and caving locations. _START_SECTION_ Life and career _START_PARAGRAPH_ Farr began caving in 1961 and cave diving in 1971, and within 10 years had established a world record for underwater cave penetration in the Bahamas. He is noted within the cave diving community for his explorations in Wookey Hole in 1977 and 1982, and for completing the first traverse of Llangattock Mountain in Wales in 1986, the execution of which was a televised media event, being the longest and deepest caving through trip in the British Isles. In 1978, Farr discovered also the Pollatoomary cave in the Partry Mountains of the Republic of Ireland. In 2008, his student Artur Kozłowski explored this cave to a depth of 103 metres, which made it the deepest known cave on the British Isles._NEWLINE_As well as running a cave diving training facility in South Wales, Farr is a regular contributor to diving magazines around the world. Farr has also acted as support diver in some of the world's most notable cave diving penetrations, including the British-led expedition to Pozo Azul in Spain in September 2010, which at 8.8 km (5.5 miles) of underwater travel is the world's longest cave diving penetration. _NEWLINE_Martyn also owns Cwmdu Campsite, a Visit Wales 4 star campsite and caravan park in the Brecon Beacons National Park, an area upon which many of Martyn’s books are centered. _NEWLINE_Farr is the author of The Darkness Beckons, regarded as the definitive book on the history of UK cave diving. + _START_ARTICLE_ The Covenant, The Sword, and the Arm of the Lord _START_SECTION_ Leadership _START_PARAGRAPH_ The founder of the CSA was James Ellison, who was jailed for a period of time in federal prison along with his "high priest" Kerry Noble. Robert G. Millar became one of Ellison's spiritual advisers, and he was also the founder of Elohim City. Ellison was also mentored by Richard Girnt Butler, founder of the Aryan Nations and Robert E. Miles, founder of the Mountain Church in Cohoctah, Michigan. Both extreme right-wing leaders taught and practiced the theology of Christian Identity, a belief system which the FBI includes on its watch list as an extremist religion. Ellison had close ties to the Ku Klux Klan and the Aryan Nations, based in Hayden Lake, Idaho, and led by Richard Girnt Butler, who was described as "the glue of the Aryan Nations movement in the Northwest, if not the country" by the supervisor of the Inland Northwest Joint Terrorism Task Force. Miles had a prison ministry and newsletter, relating mostly to the violent white Aryan groups, of which there are many, most notably, the Aryan Brotherhood. After Ellison was released from prison, he moved to Elohim City, where he married Millar's granddaughter._NEWLINE_The entire Council of Elders in the CSA community were deeply influenced and mentored by many outside sources. This nine-man council deliberated on both the spiritual meaning and the direction of CSA activities. Jim Ellison, Kerry Noble and William Wade were the only known members of the council. _START_SECTION_ Purpose _START_PARAGRAPH_ The CSA was an organization which believed that doomsday was imminent, and the 224-acre compound that was set up in Elijah became a community for its members. There they trained their members in paramilitary operations. The group believed in white supremacy and was anti-Semitic. Like other prominent anti-Semitic groups that believed in antisemitic canards, they referred to the United States Government as ZOG, short for Zionist Occupied Government. The military leader, who used the name Randall Rader during his stay at the CSA compound, left the group in a rift with Ellison and joined the newly forming group The Order in Idaho. The CSA initially professed the belief that the United States government would dissolve due to its own corruption, whereas The Order advocated revolution. However, in July 1983, The CSA published a manifesto called A.T.T.A.C.K. (Aryan Tactical Treaty for the Advancement of Christ’s Kingdom) which declared a war against the government. This was seen by followers as the Second American Revolution. _START_SECTION_ Operations _START_PARAGRAPH_ CSA assassins monitored the homes of their targets, practiced mock assassinations of these targets with scoped rifles, and practiced attacks in a mock residential training facility known as Silhouette City. The perimeter of the CSA compound had 100, 200, and 300-yard (270 m) indicator plates nailed to trees to allow the defenders to adjust their sights accordingly to engage attackers. The central rallying point in the event of an attack was a concrete bunk house that housed the communications radios next to the 95-foot (29 m) tower, which was constructed for defense. The perimeter of the compound had built-in bunkers for one to three men, and each was numbered as a post and assigned to individuals as an area of responsibility. _NEWLINE_The line infantryman carried a Ruger Mini-14 .223 Remington rifle. As in the early days of the United States Marine Corps, the squads were set up in four man fire teams. One man in the fire team carried a Heckler and Koch Model 91 rifle in .308 caliber. These had been modified via a technique which the organization sold to "brother groups," converting the rifles to an illegal selective fire weapon (capable of firing either single shots or fully automatic). The Elite "A" Team had black clothing and some fairly sophisticated weapons, such as the .22 caliber Ruger target pistol fitted with an integral silencer, and several MAC-10 submachineguns in both 9 mm and .45 ACP, also with attached suppressors. These men trained in the covert aspects of military action and were to be the core of the defense initiative._NEWLINE_The Bureau of Alcohol, Tobacco, and Firearms (ATF) later determined that the CSA had obtained 155 Krugerrands, one live light antitank rocket, 94 long arms, 30 handguns, 35 sawed-off shotguns and machine guns, one heavy machine gun, and a quantity of C-4 explosives._NEWLINE_Within "Silhouette City", the CSA also ran a boot camp-style program known as the End Time Overcomer Survival Training School - which was conducted by Order member, Randall Rader. Here, the group trained an estimated 1,500 of like-minded Christian Identity adherents in combat techniques and paramilitary exercises. Upon completing this training, a newly-trained militant would leave to join or start other similar militia groups._NEWLINE_The CSA and its paramilitary arm taught basic pistol and rifle use as well as personal home defense, rural and urban warfare, weapons proficiency, general military field craft, Christian martial arts, and natural wilderness survival._NEWLINE_In 1983, CSA member William Thomas accompanying Richard Wayne Snell and Steven Scott attempted to dynamite a natural gas pipeline which crossed the Red River on its way from the Gulf of Mexico to Chicago. This event ran as part of the group's A.T.T.A.C.K. operations. According to Kerry Noble, the group predicted this to result in riots (due to it being in winter). However, the trio were unsuccessful in carrying out the act of terror. _NEWLINE_The CSA had links with other radical organizations, including the Aryan Brotherhood, the Mountain Church, and The Order, which were all dangerous white supremacist organizations which advocated the violent overthrow of the United States Government. Many of their members were seen traveling in and out of the compound, and after a search of the compound, several stolen vehicles including one belonging to The Order were recovered._NEWLINE_According to a report conducted by the California Department of Justice, The Pagans Motorcycle Club provided the CSA with training in booby trap devices and survival techniques in return for weapons and ammunition._NEWLINE_Things began to go downhill for the organization after Snell, an alleged member, was arrested for killing an African-American police officer. Snell was later tied to the killing of a gun store owner in 1981, obtaining and using the same gun, the serial number of which had been removed by the CSA armorer, Kent Yates. Yates was arrested on Friday, July 13, 1984, on an outstanding warrant out of New Mexico for firearms violations in Farmington. He was later also charged and convicted of weapons manufacture and modification for the CSA._NEWLINE_After the incident with Snell, the FBI began to seek ways to infiltrate the CSA compound and stop the organization which it deemed dangerous. Its agents obtained warrants under Arkansas state law to arrest Ellison, the leader of the CSA, for multiple firearms violations. (The FBI later claimed that at all times it had an "inside man" in the CSA.) _START_SECTION_ Siege _START_PARAGRAPH_ On 16 April 1985, the FBI obtained a search warrant for the CSA compound._NEWLINE_Beginning on 19 April 1985, the FBI and the ATF, led by the FBI's Hostage Rescue Team (HRT), positioned around 300 federal agents in Elijah. It was necessary to keep the operation a secret, but this was not easy in the small community. However, the FBI and ATF agents took advantage of Elijah being a common destination for anglers by pretending to be fishermen and registering at different motels near the various fishing destinations. On the morning of 19 April, they moved in and surrounded the CSA compound, putting some of their agents in fishing boats in order to seal off the lakeside area of the compound. There they waited, until a few hours later when two guards emerged from the compound. They appeared to be unaware of the presence of the officers and walked towards a sniper hold-out, until an officer yelled commands to return to the compound, with which the guards complied. Later, an unnamed individual emerged from the compound and talked with the federal agents and reported to Ellison that the FBI agents were outside and willing to negotiate his surrender and the emptying of the compound. Ellison emerged later. FBI agents had expected that he would not go down without a firefight, but the FBI negotiators convinced him that the CSA would certainly lose if they had one. They convinced him that they wanted peaceful cooperation, and he asked that his spiritual adviser, assumed to be Millar, come to the compound to instruct him. The individual was flown to the area and seemed eager to convince Ellison to stand down. They allowed the individual into the compound, and the FBI instructed him to call in every 30 minutes in order to report on how negotiations were going._NEWLINE_U.S. Attorney Asa Hutchinson, who later successfully prosecuted Ellison and other leaders of the CSA, put on an FBI flak jacket and entered the compound in order to join the negotiations, leading to a peaceful conclusion to the armed stand-off. After several calls requesting more time, early on the morning of the fourth day of the siege, Arkansas State Police entered the compound and escorted out the remaining members without further bloodshed. Women and children had earlier been evacuated to nearby motels. _START_SECTION_ Charges _START_PARAGRAPH_ Ellison and most of his leadership were charged in federal court with illegal weapons possession and racketeering. In September 1985, Ellison, Kerry Noble, and four other CSA members (Gary Stone, Timothy Russell, Rudy Loewen and David Giles) were sentenced to lengthy federal prison terms. A seventh CSA member, Stephen Scott, pleaded guilty in an Arkansas federal court to charges he dynamited a natural gas pipeline near Fulton, Arkansas in 1983, and was also sent to prison. Ex-CSA member Kent Yates also pleaded guilty to a charge of conspiring to make and transfer automatic weapons silencers._NEWLINE_Ellison faced the maximum sentence of 20 years in prison after he was convicted on federal racketeering and weapons charges. However, Ellison was released in 1987 after agreeing to testify against the leader and six senior members of the Aryan Nations. All seven men were arrested and indicted on charges of sedition. The jury found all the defendants not guilty on all charges. Upon his release from federal prison, Ellison moved to Elohim City._NEWLINE_Richard Wayne Snell, the man who shot and killed the police officer and a pawn shop owner, was sentenced to death by lethal injection, which was carried out on 19 April 1995, the same day as the Oklahoma City bombing. _START_SECTION_ Possible ties to the Oklahoma City bombing _START_PARAGRAPH_ There are several claims that the 1995 Oklahoma City bombing on the Alfred P. Murrah Federal Building was tied to the "New Day" teachings of Elohim City. No proof, however, has been established. Elohim City was assembled for the purpose of gathering "prophets of the New Day". Leader Robert G. Millar envisioned himself to be the "Shepherd of Shepherds" traveling to numerous alternative societies, many of which were and are still communes. His ambition was to unite these underground organizations. He appeared several times at the Padanaram Settlement, in southern Indiana, but contrary to reports, members of the Padanaram Settlement did not concur with the radical callings of either Millar or Ellison, who made two appearances there. "The Valley" was and still is known more for being a cultural hub for artists and philosophers, and until roughly 2003 it operated a sawmill._NEWLINE_Timothy McVeigh, who was convicted and executed for perpetuating the Oklahoma City bombing, had no association with the CSA and had just enlisted in the U.S Army when the CSA compound was besieged and broken up. The Oklahoma City bombing occurred exactly on the 10-year anniversary of the start of the siege of the CSA compound in 1985. The most plausible link is the fact that Richard Wayne Snell, who was executed on the day of the Oklahoma City bombing, had planned a similar attack on the Murrah building in 1983 after becoming upset with the Internal Revenue Service. Additionally, Snell was heard taunting jailers that something drastic would happen on the day of his execution. However, McVeigh has stated that he chose the date of 19 April to coincide with violent end of the Waco siege exactly two years prior. McVeigh had traveled and visited Waco during the 51-day siege and cited it and 1992 Ruby Ridge events as his primary motivation for carrying out the bombing._NEWLINE_The single incident in which the CSA was involved, the robbery of a pawn shop in Springfield, Missouri, was in fact, foiled by a CSA member on the orders of Jim Ellison, unknown to Wayne Snell, who headed up the plan. It was in regard to this event in which Ellison saw a "sign from God" which he interpreted to mean that they should not carry out the attempt; not the attack on the Oklahoma City Federal Building._NEWLINE_The death knell of the CSA was its attempt to kill FBI special agent Jack Knox, the lead agent assigned to investigate the group; Asa Hutchinson, the federal prosecutor; and the federal judge who presided over the affair that brought about the eventual action against Gordon Kahl, a tax protester and a member of the Posse Comitatus, by federal agents at CSA member, Leonard Ginter's home (called 'The Bunker', due to its construction from concrete covered with earth). Ellison revered Kahl as a hero. Like McVeigh, Kahl was a decorated American soldier; Kahl earned a Silver Star in the Korean War, and McVeigh earned a Bronze Star in the first Gulf War – Desert Storm. _START_SECTION_ Media _START_PARAGRAPH_ In 2013, Kerry Noble appeared on the Investigation Discovery show Dangerous Persuasions talking about his time with the group. He was also interviewed for an episode of Brainwashed on the Slice Network in Canada and discussed his time with the CSA._NEWLINE_The Discovery Channel crime series The FBI Files' sixth season featured an episode whose topic was the CSA. The episode reveals the details of the federal investigation into the group, the 1985 siege and aftermath. The episode originally aired 10 December 2002. + _START_ARTICLE_ I Alone (The Vampire Diaries) _START_SECTION_ Plot _START_PARAGRAPH_ After Damon (Ian Somerhalder) compels Alaric (Matt Davis) to do whatever he has to do to get the ascendant from Jo (Jodi Lyn O'Keefe), Alaric has no choice. After obtaining it, Damon compels him again to forget everything and along with Elena (Nina Dobrev) they meet Liv (Penelope Mitchell) who sends them back to 1994 to find Bonnie (Kat Graham) and bring her home. While searching for Bonnie, Elena wonders why Jo agreed to give Damon the ascendant which is the only thing that is protecting her from Kai (Chris Wood), but Damon manages to avoid the truth._NEWLINE_Damon and Elena page Bonnie and are able to speak to her and tell her that they are bringing her home. In their conversation Bonnie tells them that Kai stabbed her to get her blood leaving her in Portland, and that she fear Kai might be free. While waiting for Bonnie to get back in Mystic Falls, Damon tells Elena the truth of how he got the ascendant something that makes her furious accusing him that he would do anything to make her fall in love with him again, no matter who gets hurt. Damon confesses her that Bonnie kept him alive while the two of them were stuck in 1994 and she was the one giving him hope explaining that this is the reason he wants to save Bonnie and not only to make her fall in love with him again._NEWLINE_Back in present time, Kai kills a cab driver and once he arrives in Mystic Falls he finds Liv. He steals some of her magic and tries to kill her but Tyler (Michael Trevino) comes in time and saves her. Seeing that Kai is free, Tyler wants to take Liv inside the borders of Mystic Falls that no magic works to protect her from Kai. That forces Liv to bring Damon and Elena back to the present before Bonnie gets to them and she is left behind again. They try to convince her to send them back but Liv leaves with Tyler. A little bit later, Kai finds Elena and Damon at the cemetery and destroys the ascendant while he crosses the border and gets to Mystic Falls, with Elena and Damon not be able to do anything to stop him._NEWLINE_Jo finds out that the ascendant is gone and confronts Alaric, being the only other person who knew where she kept it, but Alaric swears he did not take it. Jo tells him that it might be possible that he took it but he does not remember because he was compelled to forget. Alaric tells her that Damon is his friend and would never do that to him, so Jo makes him to cross the border so they can see if Alaric is compelled._NEWLINE_Meanwhile, Stefan (Paul Wesley) meets Matt's (Zach Roerig) friend who claims to be Sarah (Gabrielle Walsh), the daughter of their uncle. Later, it is revealed that she is actually an impostor who goes by the name Monique. Stefan knows about the real Sarah and where she is all these years since he has been watching over her whole life so the moment he saw Monique, he knew she was lying. He compels Monique to forget she ever knew Sarah and to leave Mystic Falls, because he does not want Damon, or anyone else, to know about Sarah. Enzo (Michael Malarkey) suspects that Stefan hides something and kills Monique because Stefan refuses to tell him. That makes Matt go to Jeremy (Steven R. McQueen) and ask him to help him kill Enzo._NEWLINE_At the end of the episode, Alaric, knowing now the truth about Damon compelling him, he confronts him and even though Damon tries to apologize. He hits him and leaves when he finds out that Kai is already free. In the meantime, Bonnie returns to Mystic Falls of 1994 but she does not find Elena and Damon there while in present, Kai gets to Tyler's home and tells him that he wants to save Liv's life, asking for Tyler's help. _START_SECTION_ Ratings _START_PARAGRAPH_ In its original American broadcast, "I Alone" was watched by 1.49 million; down by 0.19 from the previous episode. _START_SECTION_ Reviews _START_PARAGRAPH_ "I Alone" received mixed reviews._NEWLINE_Stephanie Flasher from TV After Dark gave the episode a B+ rating saying that the episode had a little bit of something for everyone and the writers Brian Young and Holly Brix took viewers on an emotional journey filled with ups and downs._NEWLINE_Ashley Dominique of Geeked Out Nation rated the episode with 7.1/10 stating that the episode moved the plot with the Gemini Coven forward, readjusting the tensions within our characters._NEWLINE_Jen from TV Overmind rated the episode with 7/10 saying that the episode left her feeling a little uneasy about the second half of the season and the next week's midseason finale._NEWLINE_Leigh Raines of TV Fanatic rated the episode with 3.5/5 stating that the episode was a full hour of good intentions but with poor planning._NEWLINE_Sara Ditta from Next Projection rated the episode with 5.7/10 saying that the only characters with any real spark in the episode were Enzo and Kai. "While neither the plot nor characters developed significantly in this episode, the show's baddies brought some fun moments in an episode that mostly sets up next week's midseason finale."_NEWLINE_Caroline Preece of Den of Geek gave a good review to the episode saying that the show delivered yet another fantastic hour of television. "There are shows, like The Vampire Diaries, that start off pretty terribly before going on to become sizeable hits. They burn hot and bright for a couple of seasons before the complacency sets in and eventually drives even the most enthusiastic fans away. Vampire Diaries was a textbook example of this, and to see it get back to its early quality in its sixth year is fantastic." + _START_ARTICLE_ Zoltán Kontár _START_SECTION_ Club career _START_PARAGRAPH_ On 13 July 2015, FK Senica signed Kontár on one year-loan from FC Petržalka akadémia. At the age of 21, he made his professional debut for FK Senica against FC DAC 1904 Dunajská Streda on 18 July 2015. + _START_ARTICLE_ Dawnn Lewis _START_SECTION_ Early life and education _START_PARAGRAPH_ Dawnn Lewis was born in Brooklyn, New York City, to Carl and Joyce Lewis, who are of African-American and Guyanese descent, She began singing at the age of four and acting at eleven. _NEWLINE_Lewis graduated at 16 from the High School of Music & Art in New York City, now known as Fiorello H. LaGuardia High School. Then she majored in musical theatre with a minor in journalism at the University of Miami, graduating with a Bachelor of Music degree, cum laude, in 1982. _START_SECTION_ A Different World (1987–1992) _START_PARAGRAPH_ Lewis appeared for the first five of the six-season run as Jaleesa Vinson (later Vinson–Taylor) from 1987 until 1992. Lewis co-wrote the theme song to A Different World, with Bill Cosby and Stu Gardner, and co-performed the song for the first season. In A Different World, Although her character was married to another of the main characters on the show, her character disappeared from A Different World without explanation, like Chuck Cunningham of Happy Days. Lewis appeared in a special week-long segment of A Different World called the Hillman College Reunion airing on Nick At Nite, along with Lisa Bonet, Jasmine Guy, Kadeem Hardison, Darryl M. Bell, Cree Summer, and Sinbad. On her Super Password appearance in 1988, she was paired with Dallas star Ken Kercheval, not any of her co-stars. _START_SECTION_ Hangin' with Mr. Cooper (1992–1993) _START_PARAGRAPH_ In September 1992, Lewis began starring in ABC's Hangin' with Mr. Cooper alongside Mark Curry and Holly Robinson. Lewis appeared in 20 of the 22 episodes of the first season as Robin Dumars, Mark's childhood best friend and roommate. Lewis did not appear on the two shows concurrently — she left A Different World to star in Hangin' with Mr. Cooper. Lewis and Holly Robinson, along with R&B quartet En Vogue, performed the theme song for season one of Hangin' with Mr. Cooper. Sometime before the end of season one, The show producers decided to scale back on the updated version of the 1970s ABC sitcom Three's Company concept. Lewis left the show after the conclusion of the first season due to the producers deciding to change the direction of the show, replacing her character with a mother and child; Mark's cousin Geneva Lee (portrayed by Saundra Quarterman) and her daughter Nicole (portrayed by Raven-Symoné). _START_SECTION_ Other work _START_PARAGRAPH_ Lewis portrayed Deloris Van Cartier in Peter Schneider's Sister Act the Musical, which opened at the Pasadena Playhouse on October 24, 2006. Lewis has voiced Storm of the X-Men in three games, most recently Marvel: Ultimate Alliance 2. She also voiced Granny Grim on The Grim Adventures of Billy and Mandy, and voiced the female Shokan (Sheeva) in Mortal Kombat: Defenders of the Realm. Lewis has also done voice work as LaBarbara in Futurama, Detective Terri Lee on Spider-Man:The Animated Series, villainess Di Archer on Bruno the Kid, and voiced a number of characters on The Boondocks. Additionally, she voiced the character, 'Sharona' on "King of the Hill."_NEWLINE_Lewis co-starred in two Disney Channel Original Movies, The Poof Point as Marigold Ballord, and as Gail DeBarge in Let It Shine. In 2000, Lewis played Blabberwort the Troll in the five-episode NBC miniseries The 10th Kingdom. _START_SECTION_ 2006–present _START_PARAGRAPH_ In 2006, Lewis starred as Melba Early in the film adaptation of Dreamgirls. Lewis released her debut CD, entitled Worth Waiting For, in 2006. She most recently played Addaperle in The Wiz with New York City Center's Encores! In 2009, Lewis played Denise Fields on One Tree Hill. In 2010, Lewis played a minor recurring role as Lauren's mother in The Secret Life of the American Teenager. In 2012, she voiced Malora in Strange Frame. She also appeared as Dr. Knapp on Days of Our Lives in 2012–2013. As of 2015, Lewis is playing a recurring role on Major Crimes as Patrice, a love interest for Lt. Provenza, whom he met during a case. In that same year, she also voiced Ruby's mother Helen Hanshaw in one episode of Sofia The First . _NEWLINE_In March 2016, Lewis was cast in Disney Junior's animated series Doc McStuffins as the voice of Grandma McStuffins._NEWLINE_In 2017 she provided the voice of Maybelle Mundy in the film Bunyan and Babe._NEWLINE_In 2018 she began voicing Fannie Granger on DreamWorks’s Spirit Riding Free, and in 2019 began voicing The Chief on Netflix’s animated Carmen Sandiego._NEWLINE_She will be portraying Zelma in the Tina Turner Broadway Musical. _START_SECTION_ Personal life _START_PARAGRAPH_ Lewis was married to former NBA player Johnny Newman in 2004. They divorced in 2006. + _START_ARTICLE_ International Space Year _START_PARAGRAPH_ The International Space Year (ISY) was 1992, the year of the quincentenary of Christopher Columbus's voyage to the Americas in 1492. First proposed by U.S. Senator Spark Matsunaga, the designation of 1992 as International Space Year was endorsed by 18 national and international space agencies, who also proposed the year's theme, "Mission to Planet Earth". Eventually, 29 national space agencies and 10 international organizations took part in coordinated activities to promote space exploration and the use of sustainable technology on Earth. _START_SECTION_ United Nations endorsement _START_PARAGRAPH_ The United Nations Committee on the Peaceful Uses of Outer Space agreed to recognize the International Space Year to promote peaceful cooperation between nations during its 1990 session. United Nations Secretary-General Boutros Boutros-Ghali, addressing the World Space Congress in Washington, D.C. on August 28, 1992 said, "One of the central goals of International Space year is to highlight the importance of understanding the Earth as a single, complex, interdependent system and to stress the unique role that space science and technology can play in promoting that understanding." _START_SECTION_ Global activities _START_PARAGRAPH_ International Space Year was celebrated by 29 space agencies in various countries with the purpose of establishing peaceful international relations in space programs. International Space Year conferences were held regularly in many nations. _START_SECTION_ Australia _START_PARAGRAPH_ In Australia, many public events were organized to augment public awareness of space by the National Space Society chapters of Australia. CSIRO led the "Mission to Planet Earth" Land Cover Change project, using satellites to study plant life on Earth in relation to climate and civilization. CSIRO and various Australian Universities also studied the ocean using European and Japanese satellites. Additionally, a series of commemorative stamps was issued by the Australia Post for International Space Year. _START_SECTION_ Japan _START_PARAGRAPH_ In Tokyo, Japan, a conference — the Asia-Pacific International Space Year Conference — was held to discuss the "Mission to Planet Earth" theme and international cooperation. _START_SECTION_ Russia _START_PARAGRAPH_ In Russia, the Foundation for Social Inventions launched Space Flight Europe-America 500 in an attempt to promote a peaceful social and economic relationship between the former Soviet states and the United States of America. Space Flight Europe-America 500 consisted of a Proton rocket carrying various items symbolizing peace, which orbited the Earth for a few days. The space craft was scheduled to land near Washington in late November. Its cost was estimated by Russian authorities at over US$200 million. _START_SECTION_ United States _START_PARAGRAPH_ In the United States, NASA, which led the US space agencies, responded to ISY with the completion or creation of many important space programs, including numerous collaborations with other domestic and international space agencies. A total of twelve programmes were launched, the most in any year up to that point. NASA focused particularly on projects — such as the Mars Observer, which studied the atmosphere and climate of Mars — that examined the possibility of sustaining human life outside Earth, as well as those exploring problems that existed on Earth at the time. ISY was also recognized with the opening of a new exhibit, entitled "Where Next, Columbus?" at the National Air and Space Museum. _START_SECTION_ They Might Be Giants _START_PARAGRAPH_ Alternative rock band They Might Be Giants were designated by NASA as the "Musical Ambassador" of the International Space Year when they were searching the NASA archives for images for their album, Apollo 18. The title of the album came directly from the NASA Apollo program—the last mission of which was Apollo 17. Accordionist and singer/songwriter John Linnell jokingly speculated that an album named Apollo 18 would be a cheaper alternative to actually manning a flight to the Moon as part of the International Space Year, although the album title was selected prior to the band's involvement with ISY. In support of the celebration, the album's back cover artwork and some promotional materials feature the International Space Year logo. Linnell explained that "[the band is] supposed to be included on lists of events happening in connection with International Space Year...In other words, on a particular month they'll say in some town there's this lecture about space telescopes and then there's this They Might Be Giants concert." On a different occasion, however, he pointed out that he "[didn't] think most people have heard that this is International Space Year". + _START_ARTICLE_ Pteraspidomorphi _START_SECTION_ Characteristics _START_PARAGRAPH_ Pteraspidomorphs are characterized by their massive dermal head armour having large, median, ventral and dorsal plates or shields._NEWLINE_The fossils show extensive shielding of the head. Many had hypocercal tails in order to generate lift to increase ease of movement through the water for their armoured bodies, which were covered in dermal bone. They also had sucking mouth parts and some species may have lived in fresh water._NEWLINE_Most pteraspidomorphs were marine, but lived very near to the shore, in lagoons and deltas. Some groups are thought to have been fresh water-dwelling. They were certainly bottom-dwellers, as shown by traces of abrasion of the ventral surfaces of their headshields. _START_SECTION_ Classification _START_PARAGRAPH_ Pteraspidomorphs have been first regarded as related to bony fishes, then to sharks, then ancestral to hagfishes, and finally as the closest jawless relatives of the gnathostomes._NEWLINE_This last theory was based on the fact that they seem to have a paired olfactory organ and a sensory-line pattern which is quite similar to that of the gnathostomes. These characteristics are, however, likely to be general for either the vertebrates or, at any rate, for the ensemble of all ostracoderms and the gnathostomes. Other ostracoderms, such as the Galeaspida are now known to have a paired olfactory organ. Current phylogenetic analysis using a large number of characteristics now place pteraspidomorphs as the sister-group of all other ostracoderms and the gnathostomes. + _START_ARTICLE_ Black Park _START_SECTION_ Wildlife _START_PARAGRAPH_ Black Park SSSI has heath, alder carr - both rare in the county - mixed and coniferous woodland and some areas of acid grassland. It has a varied fauna, and insects include the nationally rare Roesel's bush cricket. There are eighteen species of butterfly, birds including hobbies and nightjars, and snakes and lizards. _START_SECTION_ Filming location _START_PARAGRAPH_ Black Park is adjacent to Pinewood Film Studios and has been used as an outdoor location for many film and television productions. The woods and lake featured prominently in the Hammer Horror films from the late 1950s to the 1970s, including The Curse of Frankenstein (1957), The Brides of Dracula (1960), The Curse of the Werewolf (1961) and Dracula: Prince of Darkness (1966). In these films the location was often used to represent Transylvania. The park has also been used in film productions such as the James Bond film Goldfinger, where it was used for a night car chase scene (actually set in Switzerland and featuring Bond's Aston Martin DB5), and the 2006 version of Casino Royale, plus several Carry On films, Wombling Free, Batman, Hawk the Slayer, Sleepy Hollow, Bugsy Malone, the Harry Potter film series, Captain America: The First Avenger, Robin Hood, 47 Ronin,Eden Lake, and the Monty Python film And Now for Something Completely Different._NEWLINE_In television, Black Park, together with its lake, was used extensively in location filming for the planet Alzarius in the 1980 Doctor Who serial Full Circle, and was employed again two years later in the recording of the Restoration-era set serial The Visitation. Dressed with fake cobwebs, it was also used for the filming of the early Blake's 7 episode The Web. _START_SECTION_ Recreation and sports _START_PARAGRAPH_ Black Park is popular with walkers and dog owners due to the wide open spaces and well-maintained routes. _NEWLINE_During summer 2010 a 'Go-Ape' activity centre was established in the park with the construction of climbing rigging and zip lines between the trees. The area is properly supervised by park staff during opening hours. The Go Ape team now offers cycle hire._NEWLINE_Runners are commonplace within the park and the increase in private persons using the park for exercise/training has led to the establishment of a Parkrun event on Saturday mornings. The professionally organised events are free to enter and form part of a network of nationwide parkruns._NEWLINE_Mountain biking is popular in the park as the combination of dense woodland, open plains, technical sections and narrow but quick draining trails make for exciting riding. Beyond Black Park XC10 is an annual event organised in conjunction with Black Park staff by West Drayton Mountain Bike Club and Beyond Mountain Bikes of Surrey. The event attracts riders of all ages and skill levels._NEWLINE_The lake is open for fishing during the normal rod licence season, though pre-baiting, keep nets and night fishing are all forbidden. _START_SECTION_ Black Park at war _START_PARAGRAPH_ During both World War One and Two the Park saw service for the Empire with troops from the Canadian Forestry Regiment helping to farm the Park and harvest the wood, for use in the trenches of France or building air strips in France for the Royal Flying Corps. To this day the lines of trees they planted can still be clearly seen._NEWLINE_Sadly one of the Forestry Regiment never went home after being killed in a road traffic accident on the nearby Crooked Billet Roundabout. He is buried in the nearby St Margaret's Church, Iver Heath. Since 2007 the local Scout Group, 1st Iver Heath have laid poppies on his grave, as part of the Centenary of Scouting and an event called 'Uniform Day 007' that featured a representative of the Canadian Army who helped the Scouts' routine of laying a wreath for this young soldier many miles from home._NEWLINE_On the fields between the park and Iver Heath near Pinewood Studios, a World War One fighter crashed on its way to France after stopping off in Iver Heath. In World War Two a V2 rocket fell very close by the site of the fighter's location._NEWLINE_The Park was also used to store military supplies hidden amongst the trees from enemy surveillance, as was nearby Langley Park. _START_SECTION_ Geology _START_PARAGRAPH_ The park is the type site for Black Park Gravel Member, a layer of sand and gravel dating to the Anglian ice age, around 450,000 years ago. + _START_ARTICLE_ Johannes Adam Simon Oertel _START_SECTION_ Biography _START_PARAGRAPH_ After studying art in Germany at Nuremberg and Munich, he practiced engraving until 1848, in which year he came to the United States and taught for a time in Newark, New Jersey. In 1851, he married Julia Adelaide Torrey. They eventually had four children. After his marriage, he engraved plates for bank notes, painted portraits and colored photographs. In 1857 he was elected an associate of the National Academy of Design. In 1857 he moved to Madison, New Jersey, where he painted Lament of the Fallen Spirits and Redemption. _NEWLINE_ About this time, he was invited to assist in preparing new decorations for the capitol in Washington. In 1861 he transferred his studio to Westerly, Rhode Island, where he painted Father Time and his Family and The Final Harvest (1862), The Dispensation of the Promise and the Law (containing 150 figures, 1863), Walk to Emmaus, The Walk to Gethsemane, Easter Morning, Magdalen at the Sepulchre, The Rock of Ages, and others (1868)._NEWLINE__NEWLINE_His painting Rock of Ages became enormously popular and was reproduced in millions of photographs and chromolithographs for sale both in the United States and England. _START_SECTION_ American Civil War _START_PARAGRAPH_ During the American Civil War, Oertel accompanied the Army of Virginia under General Burnside for several months in 1862. His Virginia Turnpike and other landscapes were the fruit of this military experience. He also did some historical battle scenes, such as the "Battle of Sullivan's Island" that happened during the American Revolutionary War, and some illustrations for Harper's Weekly, such as the cover for November 15, 1864 issue, of "Convalescent Soldiers Passing through Washington, DC, to Re-join their Units" and "The Union Scout"._NEWLINE_While residing at Westerly, he prepared himself for orders in the Episcopal church, and he was made deacon in 1865, and subsequently presbyter. He then confined himself almost entirely to the domain of Christian art, and painted pictures that he presented to churches in Glen Cove, New York, New York City, Washington, D.C., North Carolina, and elsewhere._NEWLINE_He was in Washington, D.C., during the funeral of President Abraham Lincoln on April 19, 1865, and left an eloquent account of the event. _START_SECTION_ St James Episcopal Church _START_PARAGRAPH_ The Rev. Johannes Oertel served as the priest of St James Episcopal Church in Lenoir, North Carolina, from 1869 to 1874. He was one of the first in the valley to offer a school for African American children, and offered religious services to those recently freed from slavery, including baptism, confirmation, marriage and funeral rites._NEWLINE_The Reredos in front of the church is an outstanding exampal of his woodworking skills. Made from over four hundred pieces of chestnut, oak, poplar, holly, cherry, beech, and pine, they were often carved during missionary trips to the Chapel of Rest in Happy Valley, North Carolina, and the Chapel of Peace in Witnel, North Carolina The architectural design is Gothic perpendicular from the 14th and 16th centuries. While Rev. Oertel carved other reredos and altar rails, the one in St. James is considered to be the most intricate and notable. The altar painting (1872) is layered oil on canvass with gold gilt, and depicts Jesus administering Holy Communion to male and female communicants._NEWLINE_While at St. James, friends in New York donated the 100 year old pump organ from Christ Episcopal Church (Tarrytown, New York). The organ, dating from about 1770, was the first instrument to enhance the service in Lenoir. Rev. Oertel rebuilt the damaged organ, making new pipes, and a new wind chest and bellows. He then carved an illuminated case for the organ works._NEWLINE_By the main church door of the church is "Father Time and His Family", (1862, charcoal and pen on paper), which was completed in Westerly, Rhode Island. It depicts Father Time, his wife (the year) and their children (the months). Each child carries an item from the Cornucopia representative of their month._NEWLINE_A collection of his art is held by the church, and includes: "The Wandering Jew" (1902?, oil on canvas); "Capturing Wild Horses" (print); "Founded Upon a Rock" (1900, oil on canvas); "Rock of Ages" (offset lithography), and known as his most popular work; "Man Rowing Out on the Sea of Life With Christ as Pilot" (1880, oil on canvas); "In Memorium" (between 1880 and 1900, oil on canvas board); "Christian Hope" (1867, oil on canvas); "Head of St Paul" (oil on canvas, unknown date); "Expulsion from the Garden of Eden" (1893, oil on canvas); "Prophecy of Balaam" (1891, monochrome oil on canvas); "The Four Evangelists" (1884, monochrome oil on canvas); "Lament of the Fallen Spirits" (1850, oil on canvas); "Mary Magdalene at the Cross" (ca 1902, oil on canvas); "The Good Shepherd" (1878, oil on canvas); "The Prophet Jeremiah" (oil on canvas, unknown date); "The King of Truth" (1903, oil on canvas); "The Prophet Joel"; The Prophet Ezekiel"; "The Prophet Isaiah"; "The Unknown Prophet"; "The Dispensation of Promise and the Law" (1864-1865, chalk and ink on linen-backed paper)._NEWLINE_He had charge of two parishes in North Carolina (in Lenoir) until 1876. He moved around a great deal as a priest and spent time in North Carolina, Florida, Tennessee, St. Louis, Washington DC, Maryland and Virginia. _START_SECTION_ Portrait painter _START_PARAGRAPH_ During his time, Johannes Oertel was also known primarily as a portrait painter, and often he would leave the church in Lenoir, North Carolina, to go north to raise money by painting portraits. Many of his head and bust portraits were popular after the Civil War, and he did a number of them for prosperous clients in New England. He made an interesting portrait of the Mayor of Providence, Rhode Island, Thomas A. Doyle, as a young man on his way up. He would later serve eighteen years as the mayor, and brought Providence, Rhode Island from a manufacturing town to a small metropolis._NEWLINE_The Rev. Oertel is also known for his head of St Paul, held today by the St. James Episcopal Church, and portrayed as a weary but stern man. _START_SECTION_ Later years _START_PARAGRAPH_ Oertel was an instructor of art at Washington University in St Louis, in 1889-91. He spent the last 18 years of his life in a town near Washington DC, where he made many religious paintings and wood carvings. He painted a series of four large pictures entitled The Plan of Redemption which he presented to Sewanee (the University of the South in Tennessee). His last major work came in 1906-07 when he created the paintings and designed the new woodwork for the altarpiece of the Cathedral at Quincy, Illinois._NEWLINE_Oertel died in Vienna, Virginia, where he was living with one of his sons, and is buried in Flint Hill Cemetery in nearby Oakton. Collections of his papers are held by the libraries of George Washington University and the University of North Carolina at Chapel Hill. + _START_ARTICLE_ The Viral Fever _START_SECTION_ Early days _START_PARAGRAPH_ After graduating from IIT Kharagpur, Kumar quit his job as a consultant for US Air Force to try his hand at production jobs, assisting Farah Khan on Om Shanti Om. After a few production jobs, Kumar began to write and produce his own short films and videos._NEWLINE_Arunabh Kumar, along with long-time friends and IIT alumni, Amit Golani, and Biswapati Sarkar began pitching shows aimed at the youth to MTV. Faced with rejection, The Viral Fever was founded when the group came together and released a video titled Rowdies on YouTube starring Deepak Kumar Mishra and Naveen Kasturia. The runaway success of the video prompted Arunabh to create the YouTube focused video company, The Viral Fever, in 2012. _START_SECTION_ Growth on Youtube (2012–2014) _START_PARAGRAPH_ The Viral Fever released the first Barely Speaking with Arnub video with an interview of Shah Rukh Khan, and an appearance by then and current Delhi Chief Minister, Arvind Kejriwal. Biswapati Sarkar's parody of Indian news anchor Arnab Goswami was widely appreciated. TVF's content was dominated by parodies during these years with videos like Gaana Waala Song, Gangs of Social Media and Munna Jazbaati contributing to the growing popularity of The Viral Fever. The channel was recognised as one of the first success stories of original digital content in India. _START_SECTION_ Pioneering web-series in India (2015–present) _START_PARAGRAPH_ After a few years of creating "viral videos", The Viral Fever released India's first web-series, Permanent Roommates, in 2015. Featuring then unknown actor Sumeet Vyas and Nidhi Singh, Permanent Roommates has been watched over 5 crore (50 million) times. Pitchers was lauded as one of the best shows in recent memory in Indian entertainment for capturing the zeitgeist of the Indian startup scene. In 2016, The Viral Fever released the smash hit Tripling, and the widely applauded Humorously Yours. Other TVF network channels like Girliyapa, The Screen Patti and The Timeliners have also successfully debuted web-series. _START_SECTION_ TVFPlay & funding from Tiger Global (early 2016–present) _START_PARAGRAPH_ TVF debuted their platform, releasing the final two episodes of Pitchers on TVFPlay. The platform saw 10 lakh (1 million) hits in the first two days and crashed for 3 hours. In early 2016, venture capital firm Tiger Global Management invested $10 million into The Viral Fever, acquiring a 20% stake in the company. The Viral Fever has since launched other YouTube channels for original content: Girliyapa, a female-run channel, The Screen Patti, and The Timeliners, headquartered in New Delhi. TVF currently has offices in Mumbai, New Delhi, and Palo Alto. _START_SECTION_ Branded entertainment _START_PARAGRAPH_ The Viral Fever is one of the most sought after creators of original content with brands in India. TVF claims to have worked with over 150 brands. Some major companies to have worked with The Viral Fever or allied channels in the past include Procter & Gamble, Ola, Flipkart, Vodafone, Bharti Airtel, OnePlus, Xiaomi, Nokia and Tata Motors. _START_SECTION_ The Making Of... (2014) _START_PARAGRAPH_ The Making Of...entertainment products in India, ranging from a blockbuster movie, to a decade long TV soap. The first season of The Making Of... comprised five episodes, with a standalone episode released for Season 2 in 2016. _START_SECTION_ Permanent Roommates (2014–present) _START_PARAGRAPH_ Permanent Roommates is an Indian web series created by TVF and Biswapati Sarkar. This series revolves around a young couple, Tanya and Mikesh, who after being in a long distance relationship for 3 years, face the prospect of marriage. The first season released on YouTube on October 29, 2014. The second season was released on TVFPlay, The Viral Fever's video streaming medium, on February 14, 2016. Permanent Roommates has been renewed for a third season, rumoured to premiere in 2018._NEWLINE_Permanent Roommates was lauded for its portrayal of live-in relationships in conservative urban Indian families. Actors Sumeet Vyas and Nidhi Singh have gone on from Permanent Roommates to be showcased in Bollywood films. _START_SECTION_ Barely Speaking with Arnub (2014–present) _START_PARAGRAPH_ A talk show starring Biswapati Sarkar as Arnub with a U, parodying Indian news anchor Arnab Goswami. Barely Speaking with Arnub was picked up after the success of an earlier video titled "Bollywood Aam Aadmi Party" featuring Jitendra Kumar, Nidhi Bisht, and novelist Mayank Shekhar. The parody talk show has been lauded for its portrayal of the loud and boisterous nature of Indian news where anchors prefer theatrics over nuance. Season one of the show opened with an interview with Shah Rukh Khan. Popular celebrities who have appeared on the show for an interview with Sarkar include Delhi Chief Minister Arvind Keriwal, Ranveer Singh, Parineeti Chopra, Sunny Leone and Chetan Bhagat. In the first two seasons, Biswapati Sarkar, Amit Golani, Vipul Goyal, Shivankit Parihar, Jasmeet Singh Bhatia, and Abhishek Upmanyu have been writers on the show._NEWLINE_Barely Speaking with Arnub returned for a shortened season two in 2016. The show has been on hold as writer Biswapati Sarkar focuses on writing web series including the sequel to TVF Pitchers. _START_SECTION_ TVF Pitchers (2015) _START_PARAGRAPH_ TVF Pitchers is an Indian web series created by The Viral Fever (TVF) and developed by Arunabh Kumar single-handedly with assisting efforts from others. The first season consists of five episodes and premiered online on The Viral Fever's content portal TVFPlay on 3 June 2015. A week later, on 10 June, it premiered on YouTube. The season finale premiered on TVFPlay on 30 August 2015. It follows four friends, Naveen, Jitu, Yogi and Mandal, who quit their jobs in order to develop their own start-up company._NEWLINE_In 2016, TVF announced December 2017 as the projected release of Season 2 of Pitchers with the last scene of Permanent Roommates. _START_SECTION_ TVF Tripling (2016) _START_PARAGRAPH_ TVF Tripling is an Asian Television Award-winning Indian web series created by The Viral Fever. It traces the story of three siblings Chandan, Chanchal & Chitvan. Together they start a hilarious journey, to find themselves and their relations. Featuring Sumeet Vyas, Maanvi Gagroo and Amol Parashar, and written by Sumeet Vyas and Akarsh Khurana, along with some other contributions; and has won several awards, including a Kyoorius Blue Elephant. The Viral Fever partnered with Tata Motors for the project to promote the newly launched Tata Tiago. Tripling was recognised as one of the best web-series of 2016 and is a benchmark of success in Indian branded content. _START_SECTION_ Chai Sutta Chronicles (2013–present) _START_PARAGRAPH_ Chai Sutta Chronicles is TVF's Jim Jarmusch-inspired series about conversations between friends. Each episode of deals with a theme and conversation over a cigarette and a cup of tea. Season 1 of Chai Sutta Chronicles aired in 2013, with a season 2 released over 2017-18. _START_SECTION_ Tech Conversations With Dad (2014–present) _START_PARAGRAPH_ An ongoing TVF series about Jeetu and his conversations with his father - often about technology, relationships, and families in India. This is TVF's longest running digital title with 9 videos in 4 years. _START_SECTION_ Bisht, Please! (2017) _START_PARAGRAPH_ Nidhi Bisht's solo vehicle released in 2017 - about a small-town Indian girl who moves to the city. _START_SECTION_ Humorously Yours (2016-present) _START_PARAGRAPH_ Long-time TVF writer, and stand-up comic Vipul Goyal features in a semi-autobiographical story about the life of a stand-up comic. The show furthered TVF's reputation for story-telling in the Indian context. Humorously Yours has been renewed for a second season and was released in 2019. _START_SECTION_ F.A.T.H.E.R.S (2017) _START_PARAGRAPH_ A Tech Conversations with Dad spin-off, Fathers features three veterans of the Indian television screen - Brijendra Kala, Gajraj Rao, and Atul Srivastava, playing three fathers trying to keep up with the times. _START_SECTION_ Inmates (2017) _START_PARAGRAPH_ Inmates is a sitcom about living in Mumbai in the 21st century starring Ashish Verma, Mukti Mohan and Aakansha Thakur, and writers Raghav Raj Kakkar and Kashyap Kapoor. _START_SECTION_ Bachelors (2016–present) _START_PARAGRAPH_ Bachelors is a show about four bachelors who live together and face hilarious situations as they face the world after college. In 2016, TVF released a pilot episode of Bachelors featuring popular YouTube star Bhuvan Bam in 2016. The success of the pilot led to a 4 episode extension with Bhuvan and over 25 million views. Bachelors was picked up for Season 2, which was released in late 2017, with Jeetendra Kumar replacing Bhuvan Bam as the lead. Season 2 has also crossed 25 million views since release and was listed among the best web-series of 2017 by FilmCompanion and Indian Express. _START_SECTION_ The Aam Aadmi Family (2016) _START_PARAGRAPH_ The Aam Aadmi Family is an offering from TVF's sister channel -The Timeliners. It is an album of moments from the life of a middle-class family based in Delhi. It has garnered an average of 8 million views across 3 seasons _START_SECTION_ Flames (2018) _START_PARAGRAPH_ Flames is another offering from TVF's sister channel -The Timeliners. It is the story of a young romance unfolding as a chemical reaction. Studious Rajat falls for Ishita, the new girl in the tuition. Rajat's BFFs, Pandey & Anusha's friendship is beginning to turn into a relationship. The equations of friendships evolve in the first season of this teenage romance. This series has already garnered an average of 5 million views. _START_SECTION_ Zeroes (2018) _START_PARAGRAPH_ Zeroes is a mini-series from TheScreenpatti. It is about four zeroes who come together with an almost delusional ambition to create a great company despite being already late in the startup race. Whether they rise above their label of zeroes is what forms the crux of the story. _START_SECTION_ Yeh Meri Family (2018) _START_PARAGRAPH_ Yeh Meri Family is a mini series released by TVF. It is about a nuclear-middle class family living in the height of fad of 90s. The story revolves around a 13-year-old boy who is average at academics and is being constantly being bullied and blackmailed by his elder brother. It is a mixture of drama and thrills of being a teenager. _START_SECTION_ Awkward Conversations With Parents (2018) _START_PARAGRAPH_ Awkward Conversations With Parents is a six episode series produced by The Screen Patti. The story is about a modern-boy named Ishan and his down-to-earth, ritual parents. Ishan is in his late teens; a year before he leaves the home for college. He and his parents are involved in strange talks about absurd topics about the desires of a growing-into-an-adult boy. The series involves various absurd topics such as condoms, wet dreams etc. _START_SECTION_ Kota Factory (2019) _START_PARAGRAPH_ Kota Factory is a TVF original released in 2019. _START_SECTION_ Controversy _START_PARAGRAPH_ In March 2017, Arunabh Kumar found himself in the middle of a sexual harassment claim that was posted anonymously via Medium. TVF released a press release via Medium refuting the claims. Several women have come out with similar stories of harassment. An FIR has been filed against Kumar in this case._NEWLINE_A second FIR was then lodged against sexual harassment-accused CEO and founder of The Viral Fever Arunabh Kumar. Mumbai's Versova Police registered the FIR after another woman filed a complaint against Kumar over an incident she said took place in 2014. On 16 June 2017 Arunabh Kumar, accused in multiple sexual harassment cases, stepped down as TVF CEO. Dhawal Gusain now leads the company as CEO, with Karan Chaudhry stepping in as COO and the creator of Permanent Roommates & Tripling, Sameer Saxena appointed CCO. + _START_ARTICLE_ Tobias Bridge _START_PARAGRAPH_ Sir Tobias Bridge fought for Parliament in the English Civil War, and served the Lord Protector Oliver Cromwell during the Interregnum. After the Restoration he served King Charles II._NEWLINE_During the English Civil War, Bridge fought for Parliament under Fairfax. During the Interregnum he was an active supporter of Oliver Cromwell served on several influential committees. From 1655 and 1659 he was a Colonel of Horse, and on the death of Charles Worsley he succeeded to the governorship of Cheshire, Lancashire and Staffordshire district during the second half of 1656 of the Rule of the Major-Generals._NEWLINE_During the Second Commonwealth, in the immediate prelude to the restoration of the monarchy, he served as a major in Sir Lord Lockhart's Regiment of Horse at Dunkirk, and after the restoration he was appointed Captain of Horse at Dunkirk, a post where he took direct orders from the Governor of Dunkirk and King Charles II. He held the post until 1662 when Dunkirk was sold to France. On his return from Dunkirk he was commissioned into the Duke of Richmond's Regiment as a captain._NEWLINE_A year after he was knighted in 1666, Bridge went to Barbados as colonel of his regiment. In 1673 he commanded the local land forces against the Baron of Tobago in one of the many wars over that island. In 1674 he was admitted to the council of Barbados. He probably died in Bridgetown, a town named after him and the capital of Barbados. + _START_ARTICLE_ General Jewish Labour Bund in Belarus _START_PARAGRAPH_ The Belarusian chapter of the General Jewish Labour Bund was among political parties forming the government and parliament of the Belarusian People's Republic gaining independence in 1918._NEWLINE_Members of the Bund became members of the Parliament. Bund member Mojżesz Gutman even became a Minister without portfolio in the Government of the newly created republic and drafted its constitution. The Bund left later the government bodies of the BNR._NEWLINE_Contrarily to the attitude of the Communist party in Ukraine and Russia proper, the Communist Party (bolsheviks) of Lithuania and Belorussia agreed to integrate in its ranks the local Bund, renamed Belarusian Kombund in April 1920 after the Twelfth Conference of the Bund on April 12–19, 1920 in Gomel, into an autonomous Jewish Communist Party, thus not forcing individual members either to join directly the party or through the Yevsektsiya. They even demanded the Yevsektsiya to be dissolved into the Kombund. This seems however to have been a mere agreement on paper that was never implemented, a manoeuver by the Communists to attract support from Bundists as the Bund was more powerful than them in the Belarusian cities._NEWLINE_In 1921, at its Thirteenth Conference of the still "General Jewish Labour Bund in Lithuania, Poland and Russia", i.e. then in Russia, Ukraine and Belarus, a majority of the Bundist delegates decided to dissolve the party, and part of its membership joined the R.C.P.(B.) on the basis of the rules of admission, while the minority formed the Social Democratic Bund._NEWLINE_In West Belarus, that was part of interwar Poland, the remnants of the party finally merged into the Polish Bund, while many activists chose to join the Polish Communist Party. + _START_ARTICLE_ Albrecht v. Herald Co. _START_SECTION_ Background _START_PARAGRAPH_ Lester J. Albrecht, an independent newspaper carrier, bought from Herald Publishing Company at wholesale and sold at retail copies of Herald's morning newspaper, the St. Louis Globe-Democrat, under an exclusive territory arrangement terminable if a carrier exceeded the maximum retail price advertised by Albrecht. When Albrecht exceeded that price, Herald Co. protested to him and then informed Albrecht's subscribers that it would itself deliver the paper at the lower price. Herald Co. engaged an agency (Milne) to solicit petitioner's customers. About 300 of Albrecht's 1200 subscribers switched to direct delivery by Herald._NEWLINE_Herald Co. later turned these customers over, without cost, to another carrier (Kroner), who was aware of Herald's purpose and knew that he might have to return the route if Albrecht discontinued his pricing practice. Herald Co. told Albrecht that he could have his customers back if he adhered to the suggested price. Albrecht filed a treble-damage complaint which, as later amended, charged a combination in restraint of trade in violation of section 1 of the Sherman Antitrust Act, among Herald, Albrecht's customers, Milne, and Kroner. Albrecht's appointment as carrier was terminated and Herald required sale of his route. Albrecht made the sale at a price found to be lower than it would have been but for the conduct of Herald Co._NEWLINE_The jury found for Herald Co. Albrecht moved for a judgment notwithstanding the verdict, asserting that, under United States v. Parke, Davis & Co., and like cases, the undisputed facts showed a combination to fix resale prices, which was per se illegal under § 1 of the Sherman Act. The district court denied the motion. _START_SECTION_ Ruling of Eighth Circuit _START_PARAGRAPH_ The court of appeals affirmed. It held that there could be no violation of § 1 of the Sherman Act, which requires concerted action, because Herald's action was unilateral. Herald was entitled to refuse to deal with Albrecht because he violated his contract requiring him to observe Herald's maximum price. Herald was entitled to engage in competition with Albrecht because he was not entitled to exclusivity after violating the contract. _START_SECTION_ Ruling of Supreme Court _START_PARAGRAPH_ The Supreme Court reversed in an opinion by Justice White wrote for the Court; Justice Douglas concurred. Justices Harlan and Stewart dissented. _START_SECTION_ Majority opinion _START_PARAGRAPH_ The Court decided two principal points, one of which was later overruled. First, the conduct was not unilateral but rather was concerted. Second, later overruled by Khan, maximum price-fixing was illegal per se. _START_SECTION_ The combination _START_PARAGRAPH_ Based on the Parke Davis case, there was a combination that Herald put together. In Parke Davis the "combination with retailers arose because their acquiescence in the suggested prices was secured by threats of termination; the combination with wholesalers arose because they cooperated in terminating price-cutting retailers." By the same token, "there can be no doubt that a combination arose between respondent, Milne, and Kroner to force petitioner to conform to the advertised retail price." Herald:_NEWLINE_hired Milne to solicit customers away from petitioner in order to get petitioner to reduce his price. It was through the efforts of Milne, as well as because of respondent's letter to petitioner's customers, that about 300 customers were obtained for Kroner. Milne's purpose was undoubtedly to earn its fee, but it was aware that the aim of the solicitation campaign was to force petitioner to lower his price. Kroner knew that respondent was giving him the customer list as part of a program to get petitioner to conform to the advertised price, and he knew that he might have to return the customers if petitioner ultimately complied with respondent's demands. He undertook to deliver papers at the suggested price, and materially aided in the accomplishment of respondent's plan. Given the uncontradicted facts recited by the Court of Appeals, there was a combination within the meaning of § 1 between respondent, Milne, and Kroner, and the Court of Appeals erred in holding to the contrary._NEWLINE_Justice White pointed out other possible combinations that Albrecht might properly have argued existed. First, he could have claimed a combination between Herald and himself, at least "as of the day he unwillingly complied" with Herald's advertised price. Second, "he might successfully have claimed that respondent [Herald] had combined with other carriers because the firmly enforced price policy applied to all carriers, most of whom acquiesced in it." A third possible combination was between Herald and Albrecht's customers. _START_SECTION_ The price fix _START_PARAGRAPH_ Price-fixing agreements and combinations are illegal per se, including ones to fix maximum prices. In Kiefer-Stewart Co. v. Seagram & Sons, the Court pointed out, liquor distributors combined to set maximum resale prices. The court of appeals perceived no restraint of trade, but the Supreme Court reversed. It held "that agreements to fix maximum prices 'no less than those to fix minimum prices, cripple the freedom of traders, and thereby restrain their ability to sell in accordance with their own judgment.' " The Court said that it agreed with the Kiefer-Stewart decision:_NEWLINE_Maximum and minimum price-fixing may have different consequences in many situations. But schemes to fix maximum prices, by substituting the perhaps erroneous judgment of a seller for the forces of the competitive market, may severely intrude upon the ability of buyers to compete and survive in that market. Competition, even in a single product, is not cast in a single mold. Maximum prices may be fixed too low for the dealer to furnish services essential to the value which goods have for the consumer or to furnish services and conveniences which consumers desire and for which they are willing to pay. Maximum price-fixing may channel distribution through a few large or specifically advantaged dealers who otherwise would be subject to significant non-price competition. Moreover, if the actual price charged under a maximum price scheme is nearly always the fixed maximum price, which is increasingly likely as the maximum price approaches the actual cost of the dealer, the scheme tends to acquire all the attributes of an arrangement fixing minimum prices. It is our view, therefore, that the combination formed by the respondent in this case to force petitioner to maintain a specified price for the resale of the newspapers which he had purchased from respondent constituted, without more, an illegal restraint of trade under § 1 of the Sherman Act. _START_SECTION_ Concurring opinion _START_PARAGRAPH_ Justice Douglas agreed that the court of appeals erred, but considered that "this is a 'rule of reason' case." _START_SECTION_ Harlan dissent _START_PARAGRAPH_ Justice Harlan considered maximum price-fixing beneficial to the public:_NEWLINE_Other things being equal, a manufacturer would like to restrict those distributing his product to the lowest feasible profit margin, for, in this way, he achieves the lowest overall price to the public and the largest volume. When a manufacturer dictates a minimum resale price, he is responding to the interest of his [retailer] customers, who may treat his product better if they have a secure high margin of profits. When the same manufacturer dictates a price ceiling, however, he is acting directly in his own interest, and there is no room for the inference that he is merely a mechanism for accomplishing anticompetitive purposes of his customers._NEWLINE_Justice Harlan also disagreed that one who merely acquiesces engages in concerted action within the meaning of § 1 of the Sherman Act. _START_SECTION_ Stewart dissent _START_PARAGRAPH_ Justice Stewart considered that Herald was justified in fixing maximum prices to its ultimate customers, the consuming public, because that was a necessary defensive measure in the face of the territorial monopoly granted the distributors. By not permitting this—" The Court today stands the Sherman Act on its head." _START_SECTION_ Economic background _START_PARAGRAPH_ A newspaper's profits are determined by its circulation and the number of advertisements it sells. Like in every circulation industry, circulation depends upon the price of a copy, as well as the amount of advertising: . Similarly, the demand for advertising space is determined by . In other words: the higher the circulation, the higher the demand for advertising space. The profit-maxizing newspaper monopolist therefore sets his copy price as:_NEWLINE_where is the cost per copy, is the marginal cost of advertisement, is the traditional price elasticity of demand, and captures the feedback effect of lower copy prices inducing more advertising and vice versa. Most important is the term , which captures the marginal advertising profit from selling additional advertising due to increased circulation. The newspaper monopolist's optimal price is therefore lower than for traditional monopolists in non-circulation industries. + _START_ARTICLE_ Gudastviri _START_SECTION_ Construction _START_PARAGRAPH_ The gudastviri is made up of two main parts: The first being a whole sheep or goat skin, or a sewed, rectangular leather bag (“guda”). The second is a yoked double-chante ("stviri"), terminating in a single horn bell, which makes the gudastviri a member of the hornpipe class of bagpipes._NEWLINE_There is a small wooden blow-pipe (khreko) with a check-valve tied into one leg, or corner of the bag. A fixed round wooden stock holding the chanter, is tied into the bag, in the opposite foreleg, or corner. The chanter itself has two wooden pipes (dedani) of equal length, bore and wall thickness, which are inserted into the stock. The left chanter tube "leader" has the most finger holes, it is also called “teller” or “beginner”. The right chanter tube "bass" is called mebane or "deep voice producer". This bass pipe has three front-facing holes and the “beginner”, has six holes (but the Adjaran chiboni’s leader pipe has only five holes). _NEWLINE_The three bottom holes of the left pipe are placed symmetrically across from the three holes of the right pipe. _START_SECTION_ Tuning _START_PARAGRAPH_ The Adjaran chiboni has a diatonic scale. It can produce two-part chords and two-part tunes. The two parts are produced by the simultaneous sound of both dedanis. The player's left hand plays the highest notes of the scale on the left chanter tube, while the fingers of the player's right hand covers and uncovers the lower notes of the scale, which is made possible by the limited number of finger holes (only 3 or 4 holes) disposed lower down, toward the distal end of the right chanter tube. _NEWLINE_The compass of a chiboni is major sixth (but the Rachian gudastviri’s diapason can be a minor, or a major seventh). The ends of the pipes are fixed inside the resonator/horn. The horn is made of Caucasian goat or bull horn. The gudastviri is decorated with silver bands, mounted with colored glass beads, and numerous small chains. There is a ball of cotton wool inserted into the open end of the horn, to absorb the moisture formed during playing. The bag (guda) can have a bag cover of cloth or leather, or have the natural goat hair left on the outside of the bag. _NEWLINE_The six holes of the left reed pipe emit notes of the first octave: F, E, D, C, B, A, G; the three holes of the right one emit deep-voiced notes: C, B, A, G. _START_SECTION_ Playing and application _START_PARAGRAPH_ The gudastviri is used for vocal accompaniment. A majority of recitative songs were performed with its accompaniment, in the region of Racha. The gudastviri player’s repertoire consists of historical, epic, satirical, comic, and lyrical verses, which are performed as one part songs. These songs are recitatives and it is the text, not the melody, that is the most important part of the performance. _NEWLINE_Traditionally,only men play this instrument, and Rachian gudastviri players were strolling musicians, who were welcomed as guests, at every family merriment, party, or wedding. It was a kind of profession that served as the main source of the player's income. Gudastviri players often took part in the old Georgian improvisation competition known as berikaoba, where they had to invent a witty epic, lyrical or comical poems, "on the spot" and retell these poems, accompanied with Gudastviri music. _NEWLINE_The competition was often won by the most skillful berika (participant). _START_SECTION_ Design and development _START_PARAGRAPH_ In the region of eastern Javakheti, gudastviri pipes are made of very young branches of a dog-rose. The gudastviri itself is normally constructed by the player to his tastes. Jewellers may also be hired to ornament the instrument._NEWLINE_Among the kinds of Georgian gudastviri, the most developed is Adjarian chiboni. As for the gudastviri of Pshavi, it belongs perhaps, to an earlier stage of development, as it has only one hole on the bass chanter, possibly indicating this instrument's early origin. + _START_ARTICLE_ Ignacio Cáceres _START_SECTION_ Biography _START_PARAGRAPH_ He was the silver medallist in the 10,000 m at the 2001 Summer Universiade and the following year he finished twelfth in the event at the 2002 European Championships. He was selected for the marathon team at the 2010 European Athletics Championships, but failed to finish the race._NEWLINE_He set a personal best at the 2012 Rotterdam Marathon, taking ninth place in a time of 2:11:58 hours. Also in 2012, he competed in the marathon at the Olympic Games, finishing in 31st place. + _START_ARTICLE_ Masayoshi Tomizuka _START_SECTION_ Career _START_PARAGRAPH_ Tomizuka joined the faculty of the Department of Mechanical Engineering at the University of California, Berkeley in 1974. He served as Vice Chair of Mechanical Engineering in charge of instruction from December 1989 to December 1991, and as Vice Chair in charge of graduate studies from July 1995 to December 1996. Since June 11, 2009, he has been Executive Associate Dean for the College of Engineering at UC Berkeley. He also served as Program Director of the Dynamic Systems and Control Program at the National Science Foundation from Sept. 2002 to Dec. 2004. _START_SECTION_ Research interests _START_PARAGRAPH_ Tomizuka's current research interests include optimal and adaptive control, digital control, signal processing, motion control, and control problems related to robotics, manufacturing, data storage devices, vehicles and human-machine systems. _START_SECTION_ Society activities _START_PARAGRAPH_ Tomizuka has been and is an active member of the Dynamic Systems and Control Division (DSCD) of the American Society of Mechanical Engineers (ASME). He served as chairman of the Executive Committee of the Division (1986–87), Technical Editor of the ASME Journal of Dynamic Systems, Measurement and Control, J-DSMC (1988–93) and Editor-in-Chief of the IEEE/ASME Transactions on Mechatronics (1997–99). He served as President of the American Automatic Control Council (1998–99). He chairs the IFAC Technical Committee on Mechatronic Systems. He is a Fellow of the ASME, the Institute of Electric and Electronics Engineers (IEEE) and the Society of Manufacturing Engineers. He is the recipient of the Best J-DSMC Best Paper Award (1995), the DSCD Outstanding Investigator Award (1996), the Pi Tau Sigma-ASME Charles Russ Richards Memorial Award (1997), the DSCD Leadership Award (2000), the Rufus Oldenburger Medal (2002) and the John Ragazzini Award (2006). The Oldenburger Medal was awarded to him for his seminal contributions in the area of adaptive control, preview control and zero-phase control. + _START_ARTICLE_ A-normal form _START_SECTION_ Grammar _START_PARAGRAPH_ The following BNF grammar describes the pure λ-calculus modified to support the constraints of ANF:_NEWLINE_ EXP ::= VAL_NEWLINE_ | let VAR = VAL in EXP_NEWLINE__NEWLINE_ VAL ::= λ VAR . EXP_NEWLINE_ | VAR_NEWLINE_ | VAL VAL_NEWLINE_Variants of ANF used in compilers or in research often allow constants, records, tuples, multiargument functions, primitive operations and conditional expressions as well. _START_SECTION_ Examples _START_PARAGRAPH_ The expression:_NEWLINE_f(g(x),h(y))_NEWLINE_is written in ANF as:_NEWLINE_let v0 = g(x) in_NEWLINE_ let v1 = h(y) in_NEWLINE_ f(v0,v1) + _START_ARTICLE_ Prometheus Global Media _START_SECTION_ Founding _START_PARAGRAPH_ On December 10, 2009, the Nielsen Company announced that it would sell its Business Media division, which included brands such as Adweek, Billboard, and The Hollywood Reporter, to a new company known as e5 Global Media; a joint venture between Guggenheim Partners and Pluribus Capital Management—a company led by James Finkelstein, Matthew Doull, and George Green. Two Nielsen properties, Editor & Publisher, and Kirkus Reviews, were not included in the sale, and were to be shut down. Editor & Publisher would instead be sold to the Duncan McIntosh Company, and Kirkus Reviews would be sold to Herbert Simon. The company's first CEO was Richard Beckman, previously an executive and publisher at Condé Nast and Fairchild Publications, and former publisher of magazines GQ and Vogue. Beckman's career suffered a setback in 1999 following "some inappropriate behavior" resulting in injuries to Vogue's West Coast advertising director Carol Matthews, while Beckman was Matthews' publisher at Condé Nast._NEWLINE_Beckman's first major move was a re-launch of The Hollywood Reporter; with the hiring of Janice Min, formerly of Us Weekly, as editorial director, THR replaced its daily print publication with a weekly magazine, and performed a significant redesign to its website with an increased focus on breaking scoops. The new format was meant to compete against up-and-coming blogs focusing on industry news, such as Deadline Hollywood and TheWrap, along with its then-struggling rival Variety. The changes had a significant impact on the publication's performance: by 2013, ad sales were up more than 50%, while traffic to the magazine's website had grown by 800%. In October 2010, the company was renamed Prometheus Global Media; named after the Greek mythological figure, Beckman stated in an internal memo that the new name would "[carry] more weight and gravitas in the marketplace." _START_SECTION_ Re-organization and acquisition _START_PARAGRAPH_ In late 2011, Prometheus went through a number of cost-cutting measures. In August 2011, Backstage was sold to a group of investors led by John Amato in a transaction funded by Guggenheim, and the following month, Prometheus laid off the staff responsible for the Hollywood Creative Directory and announced it had sold the publication._NEWLINE_In January 2013, Guggenheim Partners acquired the stake in Prometheus owned by Pluribus Capital, giving it full ownership; following the acquisition, former Yahoo! executive Ross Levinsohn was named as CEO of the new Guggenheim Digital Media division, which would oversee Prometheus and other digital assets for Guggenheim companies (such as Dick Clark Productions). In April 2013, Guggenheim re-acquired Backstage (which had also acquired Sonicbids, a platform for allowing musicians to book gigs online) and made its CEO John Amato president of the Billboard Group—a new group consisting of Billboard, Backstage, and Sonicbids._NEWLINE_In a January 2014 restructuring, Levinsohn was shifted to a business development role and no longer directly manages the Prometheus properties. Additionally, the company was split into two operating groups; an Entertainment Group was formed by merging The Hollywood Reporter into the Billboard Group, with Janice Min becoming co-president and chief creative officer of the group alongside Amato. The remaining properties, consisting of Adweek and Film Expo Group, are led by Jeff Wilbur._NEWLINE_On May 29, 2014, Prometheus announced it would acquire the publishing assets of Mediabistro—a network of websites focusing on various aspects of the mass media industry—which includes the media job listing site Mediabistro and its network of blogs such as AgencySpy, FishbowlNY, Lost Remote and TVNewser—for $8 million. The acquisition did not include Mediabistro's expo business, which were retained under the name Mecklermedia. On January 13, 2015, Adweek and Film Expo Group were merged into Mediabistro to form a new Prometheus subsidiary, Mediabistro Holdings. At the same time, its blogs were re-launched under the new "Adweek Blog Network" banner, and all of Mediabistro's social media-oriented blogs were merged into SocialTimes._NEWLINE_In March 2015, Guggenheim Partners reported that its president Todd Boehly was exploring the possibility of forming his own company. A representative stated that such a company would "likely be harmonious with Guggenheim, especially since Todd's role for some time has been strategic and transaction-oriented, rather than working in or managing any of our day-to-day businesses." On December 17, 2015, in response to losses across Guggenheim Partners, the company announced that it would spin out its media properties to a group led by Boehly, including the Hollywood Reporter-Billboard Media Group, Mediabistro, and Dick Clark Productions, all under their existing leadership. The resultant company is known as Eldridge Industries. + _START_ARTICLE_ Debora Menicucci _START_SECTION_ Personal life _START_PARAGRAPH_ She is married to the Venezuelan Supreme Court President, Maikel Moreno. + _START_ARTICLE_ Šajkaši _START_SECTION_ Personal armament _START_PARAGRAPH_ The Šajkaši were armed with sabres, spears and ordinary and mechanical arrows. Sometimes they wore helmets and shields. Their spears likely were longer than ordinary, set to be used at longer distances. They used arrows until the end of the 16th century when the arquebus had been perfected. Later, when gunpowder began to be widely used, the Šajkaši were armed with sabres, long spears and muskets. _START_SECTION_ Clothing _START_PARAGRAPH_ Their clothing was in dark-blue colour. _START_SECTION_ 16th century _START_PARAGRAPH_ Pavle Bakić commanded the Šajkaši in the service of Ferdinand, the Archduke of Austria and King of Hungary and Croatia. The Šajkaši participated in the Battle of Mohács (1526). After the battle the Šajkaši were still unpaid for their services. Ferdinand reprimanded the court for not having paid at least part of the unpaid salary to the Šajkaši. Bakić once again turned to Ferdinand, alerting him that the nonpayment to the Šajkaši would cause estrangement of the Serbs in his lands, and those of Zapolya and the Ottoman Empire. He also informed Ferdinand of the persecution of Serbs by the Austrian staff and officers. _START_SECTION_ 17th century _START_PARAGRAPH_ From all the writings in Serbian and German by the estimable Archimandrite Jovan Rajić (Johann Raics, 1726-1801), it has been established that the old Šajkaš Corps had their staff in the city of Komarno (Comorn, Hungarian: Komorom) which is along the upper Danube and that the personnel were under the Hungarian and Polish King Ladislaus III (1424-1444) and those following him until they were included under the rule of Leopold I (1640-1705)._NEWLINE_The old Šajkaš Corps was established in 1526 and dissolved in 1746._NEWLINE_By far the majority of the Šajkaši were Serbs who had come north and west as direct result of the Turkish advance into the Balkans. As Ottoman conquest continued throughout the 1500s, thousands fled north across the Danube into lands vacated by the Serbs, also moving away from the Turks. In addition, thousands of refugees, generally of the Orthodox faith, had entered the largely deserted lands of northern Medieval Serbia where few of the original natives had survived the brutal wars. These Serbs formed the nucleus of the military frontiersmen, beginning in the early 1500s, and of the river fleet, the Šajkaši formations._NEWLINE_In 1690, a major Serbian immigration took place. Some 30,000 families from Kosovo sought refuge across the Sava and Danube rivers among their kin folk after the Austrian supported revolt failed and left them defenceless in the face of Ottoman reprisals. _START_SECTION_ 18th century _START_PARAGRAPH_ These newly-arrived Serbs together with the members of the earlier Šajkaši Corps formed the new Šajkaši Battalion on the lower Danube. They defended the border between the Habsburg and Ottoman empires. _NEWLINE_Like the other regiments of the Austrian Military Frontier, the Šajkaši settled the deserted borderlands designated for them by the Austrian crown in exchange for military service. The Šajkaši battalion's first lower Danube villages were: Titel, Lok, Mošorin (Moschorin), Vilovo (Willova), Gardinovci (Gardinovatz) and Žabalj (Xablia, formerly Josefdorf)._NEWLINE_Six more villages were authorized on 7 June 1769: Čurug, Gospođinci, Šajkaš (St. Ivan), Upper Kovilj, Lower Kovilj, and Kać (Kaacs). In 1800 and 1801, two more -- Djurdjevo and Nadalj -- were also settled. On 1 January 1809, the battalion totaled six companies (a division)._NEWLINE_This battalion, along the model of the other regiments of the Military Frontier, was organized according to the standing order of the Austrian military-civil administration. The duties of and support for the Battalion's frontiersmen were the same as all other frontiersmen._NEWLINE_A history of the Šajkaš Battalion was written between 1842 and 1847 by one of its officers, Captain Jovan Trumić. Apparently the only surviving copy is now at the Serbian scholarly society, Matica srpska, in Novi Sad._NEWLINE_The original Trumić manuscript and other documents about the Šajkaši were brought to light in 2004 by Slavko Gavrilović, a Serbian scholar who specialized in the Šajkaš Battalion. _NEWLINE_Included in Captain Trumić's study are valuable statistics from 1844. There were 30,315 inhabitants on the lands of the Šajkaš Battalion at that time: 28,656 Serbs; 758 Germans; 528 Hungarians; 196 Wallachians; and 177 others. Within the jurisdiction of the Battalion, there were 28,275 Eastern Orthodox Christians (Non-Uniates, mostly Serbs); 1,627 Roman Catholics (includes Croatians); 329 Protestant Evangelists (Lutherans); 63 Uniates; and 21 Protestant Reformed (Calvenists)._NEWLINE_Among the additional documents Slavko Gavrilović published in 2004 is a list of the officers who served in the Šajka Battalion between 1762 and 1873. Captain Trumić is among them._NEWLINE_A total of 246 Serbian officers are listed. The list is interesting today for the large number of Serbian officers and for the details about their military service which provide information about the officers of both the Šajkaš Battalion and the Military Frontier._NEWLINE_The list is chronological. The dates and places correspond not only to assignments within the Military Frontier but also to postings in far-away wars waged by the Habsburg crown. The list reveals that these officers were transferred to those wars. _NEWLINE_After 1699, the service of the military frontiersmen was not limited to protecting the Habsburg Empire from the Ottomans. They were required, by regulation, to serve where called. The high command in Vienna viewed the Military Frontier as a vast pool of self-sufficient recruits for the Austrian military, and their participation in the major wars involving the Austrian crown seems to bear this out. _START_SECTION_ 19th Century _START_PARAGRAPH_ By an imperial decree of the Habsburg ruler, Empress Maria Theresa, a special unit of the Austrian Danube Fleet, the Šajkaš Battalion of the lower Danube, was created in 1763. It was abolished under another Habsburg, Emperor Franz Joseph, in 1872._NEWLINE_Of the Battalion's Serbian officers, 89 were promoted from the ranks of non-commissioned officers. Sixty came from several of the Military Frontier's other regiments which were either named for the regimental headquarter cities -- Brod, Gradiska, Ogulin, Otocac, Petrovaradin, Slunj, Titel and Varazdin -- or regions of the Military Frontier -- Banat, Banija, Lika, and Slavonia. Some were cadets or arrived fresh from the Military Frontier School in Vienna. _NEWLINE_The Trumić list includes the year the officers joined the Battalion and his rank and prior post. Also shown are each officer's promotions and the year he was transferred out of the Battalion as well as his rank and new post. _NEWLINE_A staggering number military frontiersmen (graničari as they were called in Serbian) from the regiments of the Military Frontier were called up to the Austrian armies engaged in various European wars, such as the Austrian War of Succession (1741-1748); the Seven Years' War (1756-1763); the Bavarian War of Succession (1778-1779); the wars against post-revolutionary France (1792-1800); the Napoleonic Wars (1805-1815); the Austro-Italian Wars (1848-1849, 1859, 1866) and the Revolution of 1848 and the wars against the Hungarians (1848-1849)._NEWLINE_Postings on the list include places in northern Italy, Mantua and Solferino, made famous in the Napoleonic Wars and in the Austro-Italian Wars. According to the list, Lieutenant Michael Stanisavljević was transferred to Mantua in 1784 and Captain Marcus Rajčević was killed in the Battle of Solferino in 1859. From the Šajkaš Battalion, Adjutant George Bešanović transferred to the Bosnian-Serbian Frei Corps in 1788. Lieutenant Gligorije Popović and Lieutenant Thimotie Zivković transferred to Count Gyulay's Frei Corps in 1793. Lieutenant Arsenije Sečujac transferred to the 3rd Serbian Frei Crops Battalion in 1813. _NEWLINE_During the Austro-Turkish War of 1788-1791, the Austrians organized Serbian volunteers into special military force known as the Frei Corps (Free Corps), usually commanded by Austrian officers of Serbian descent._NEWLINE_Thirty-five were transferred to other regiments of the Military Frontier. Fifty-one died while in serving in the Battalion, but whether the death was in line of duty is not specified. Seventy-seven retired from the Battalion with pensions. The length of service varied from a few months to as many as 30 years, if not more. _NEWLINE_Several Serbs became majors, colonels and battalion commanders. In 1763, the year the Šajkaš Battalion was created, Theodor Stanisavljević was a Major and the Battalion Commander of the Petrovaradin Frontier Infantry Regiment. In 1773, he was Colonel in the Šajkaš Battalion. He died in 1783._NEWLINE_Colonel Aron Stanisavljević, in 1813 after 35 years with the Battalion, was promoted to Brigadier General and Major-General and transferred to Banat.That same year, Lieutenant Colonel Johann Nepomuk Majdić became Battalion Commander._NEWLINE_In 1816, Captain Thimotie Zivković returned to the Battalion and was promoted to Colonel and battalion Commander.In 1835, Colonel Franz Jankovic was appointed Major General and Commander of the Supreme Shipping Office in Vienna._NEWLINE_In 1849, Major Johann Bunčić was the Battalion Commander of the Ogulin Frontier Regiment and Adjutant to the Austrian-Serbian Army Corps when he joined the Sajkas Battalion. The next year, he was transferred to the Petrovaradin Frontier Infantry Regiment as a Colonel. _START_SECTION_ Šajkaši Battalion _START_PARAGRAPH_ The Frontier Šajkaši Battalion (Krajiški šajkaški bataljon), known in German as Czaikisten-Bataillon, was active in the period of 1763–1873. After the Treaty of Belgrade (1739) the Habsburg-Ottoman border was set up on the Danube and Sava rivers. The Šajkaši bands in Komarno, Esztergom, Györ and other places were abolished, until the establishment of the Šajkaši Battalion in Bačka, between the Danube and Tisa, in 1763 upon decision of the Habsburg War Council. The Serb colonising community which was employed into the battalion (the šajkaši) was given the Šajkaška region, which initially included six villages, eventually increased by eight. The battalion headquarters were in Titel. The battalion had four bands in 1769, with ca. 1,116 men, although it was constantly expanded. _START_SECTION_ Šajkaši migrations _START_PARAGRAPH_ Šajkaši families, Serbs, settled in Esztergom during the rule of Matthias Corvinus, a settlement in the lower town developed from the community, called Srpska varoš._NEWLINE_A group of Serbian Šajkaši settled in Slovakia, where they continued their service, known in Slovak as čajkári. _START_SECTION_ Legacy _START_PARAGRAPH_ In Hungarian war annals the most clear and also most vulnerable place is taken by the King's Šajkaši. They were the most important factors and participants of the victories of the Royal Army. Whenever there was a threat to Hungary, the Šajkaši were the main support of the territorial defence and the most reliable aid to the Royal Army._NEWLINE_Stationed at many locations, the most important Šajkaši units were those of Komárom, as this was the most important Imperial fortress in Hungary; they were stationed here up until the reign of Maria Theresa, when they were moved to South Bačka._NEWLINE_The šajkača hat is derived from the 18th-century Šajkaši in Banat. _START_SECTION_ Annotations _START_PARAGRAPH_ Another term used in German was Nassadisten (Serbian: насадисте/nasadiste). + _START_ARTICLE_ Path to Paradise: The Untold Story of the World Trade Center Bombing _START_SECTION_ Production _START_PARAGRAPH_ The film was shot in New Jersey and Manhattan. _START_SECTION_ Accolades _START_PARAGRAPH_ The American-Arab Anti-Discrimination Committee awarded the film the Advancing Tolerance award for Enhancing Intolerance. _START_SECTION_ Aftermath _START_PARAGRAPH_ The film was scheduled for a repeat broadcast on HBO the week of the 9/11 attacks. HBO pulled it from its schedule following the attacks. + _START_ARTICLE_ Marilyn Frye _START_SECTION_ Education and career _START_PARAGRAPH_ Frye received the BA with honors in philosophy from Stanford University in 1963 and received the PhD in Philosophy at Cornell University in 1969, writing a dissertation titled "Meaning and Illocutionary Force," under the supervision of Max Black. Before coming to Michigan State University in 1974, she taught in the Philosophy Department at the University of Pittsburgh. From 2003 until her retirement, Frye was University Distinguished Professor at Michigan State University; she also served as Associate Dean for Graduate Studies of the College of Arts and Letters. In 2008 she was the Phi Beta Kappa Romanell Lecturer. _START_SECTION_ Research and publications _START_PARAGRAPH_ Frye is the author of The Politics of Reality (1983), a collection of nine essays which has become a "classic" of feminist philosophy._NEWLINE_In her chapter entitled "Oppression" in the book Feminist Frontiers, Frye discusses the idea of the double bind in gender. This double bind refers to "situations in which options are reduced to a very few and all of them expose one to penalty, censure or deprivation". Frye applies this principle to gender and the dilemma women often face in her discussion of oppression. For example, it is neither socially acceptable for a women to be sexually active or for her to be sexually inactive and labelled a "man-hater" or "uptight". This absence of choice permeates so thoroughly into women's day-to-day life that even small things like how they choose to dress or talk are criticized. Frye acknowledges that men face issues as well, but differentiates the issues of men and women through the metaphor of a bird cage. Each individual bind women face can be thought of as a single bar in a cage: by itself, it isn't enough to contain the bird. But, with enough bars, the bird is trapped inside the cage, left with nowhere to go. This is the complete absence of choice Frye describes: how it is the culmination of issues women face that is so "immobilizing" and why their struggle, and not men's, is considered oppression._NEWLINE_Frye is openly lesbian, and much of her work explores social categories—in particular, those based on race and gender. + _START_ARTICLE_ Dmitri Kurakin _START_PARAGRAPH_ Dmitri Kurakin (born 5 June 1975 in Tallinn) is an Estonian ice dancer who also competed internationally for Germany. He originally competed with Anna Mosenkova for Estonia. They were multiple medalists at the Estonian Figure Skating Championships and competed at the World Figure Skating Championships, the European Figure Skating Championships, and the World Junior Figure Skating Championships. He teamed up with Jill Vernekohl in 2001 and they are the 2001 and 2002 German silver medalists._NEWLINE_Kurakin is the older brother of Juri Kurakin, who is also an ice dancer. + _START_ARTICLE_ Giorgi Japaridze _START_PARAGRAPH_ Giorgi Japaridze (also spelled Giorgie Dzhaparidze) is a Georgian-American researcher in logic and theoretical computer science. He currently holds the title of Full Professor at the Computing Sciences Department of Villanova University. Japaridze is best known for his invention of computability logic, cirquent calculus, and Japaridze's polymodal logic. _START_SECTION_ Research _START_PARAGRAPH_ During 1985-1988 Japaridze elaborated the system GLP, known as Japaridze's polymodal logic. This is a system of modal logic with the "necessity" operators [0],[1],[2],…, understood as a natural series of incrementally weak provability predicates for Peano arithmetic. In "The polymodal logic of provability" Japaridze proved the arithmetical completeness of this system, as well as its inherent incompleteness with respect to Kripke frames. GLP has been extensively studied by various authors during the subsequent three decades, especially after Lev Beklemishev, in 2004, pointed out its usefulness in understanding the proof theory of arithmetic (provability algebras and proof-theoretic ordinals)._NEWLINE_Japaridze has also studied the first-order (predicate) versions of provability logic. He came up with an axiomatization of the single-variable fragment of that logic, and proved its arithmetical completeness and decidability. In the same paper he showed that, on the condition of the 1-completeness of the underlying arithmetical theory, predicate provability logic with non-iterated modalities is recursively enumerable. In he did the same for the predicate provability logic with non-modalized quantifiers._NEWLINE_In 1992-1993, Japaridze came up with the concepts of cointerpretability, tolerance and cotolerance, naturally arising in interpretability logic. He proved that cointerpretability is equivalent to 1-conservativity and tolerance is equivalent to 1-consistency. The former was an answer to the long-standing open problem regarding the metamathematical meaning of 1-conservativity. Within the same line of research, Japaridze constructed the modal logics of tolerance (1993) and the arithmetical hierarchy (1994), and proved their arithmetical completeness._NEWLINE_In 2002 Japaridze introduced "the Logic of Tasks", which later became a part of his Abstract Resource Semantics on one hand, and a fragment of Computability Logic (see below) on the other hand._NEWLINE_Japaridze is best known for founding Computability Logic in 2003 and making subsequent contributions to its evolution. This is a long-term research program and a semantical platform for "redeveloping logic as a formal theory of (interactive) computability, as opposed to the formal theory of truth that it has more traditionally been"._NEWLINE_In 2006 Japaridze conceived cirquent calculus as a proof-theoretic approach that manipulates graph-style constructs, termed cirquents, instead of the more traditional and less general tree-like constructs such as formulas or sequents. This novel proof-theoretic approach was later successfully used to "tame" various fragments of computability logic, which had otherwise stubbornly resisted all axiomatization attempts within the traditional proof theory such as sequent calculus or Hilbert-style systems. It was also used to (define and) axiomatize the purely propositional fragment of Independent-Friendly Logic._NEWLINE_The birth of cirquent calculus was accompanied with offering the associated "abstract resource semantics". Cirquent calculus with that semantics can be seen as a logic of resources which, unlike Linear Logic, makes it possible to account for resource-sharing. As such, it has been presented as a viable alternative to linear logic by Japaridze, who repeatedly has criticized the latter for being neither sufficiently expressive nor complete as a resource logic. This challenge, however, has remained largely unnoticed by the linear logic community, which never responded to it._NEWLINE_Japaridze has cast a similar (and also never answered) challenge to intuitionistic logic, criticizing it for lacking a convincing semantical justification the associated constructivistic claims, and for being incomplete as a result of "throwing out the baby with the bath water". Heyting's intuitionistic logic, in its full generality, has been shown to be sound but incomplete with respect to the semantics of computability logic. The positive (negation-free) propositional fragment of intuitionistic logic, however, has been proven to be complete with respect to the computability-logic semantics._NEWLINE_In "On the system CL12 of computability logic", on the platform of computability logic, Japaridze generalized the traditional concepts of time and space complexities to interactive computations, and introduced a third sort of a complexity measure for such computations, termed "amplitude complexity"._NEWLINE_Among Japaridze's contributions is the elaboration of a series of systems of (Peano) arithmetic based on computability logic, named "clarithmetics". These include complexity-oriented systems (in the style of bounded arithmetic) for various combinations of time, space and amplitude complexity classes. _START_SECTION_ Biography and academic career _START_PARAGRAPH_ Giorgi Japaridze was born in 1961 in Tbilisi, Georgia (then in the Soviet Union)._NEWLINE_He graduated from Tbilisi State University in 1983, received a PhD degree (in philosophy) from Moscow State University in 1987, and then a second PhD degree (in computer science) from the University of Pennsylvania in 1998. During 1987-1992 Japaridze worked as a Senior Researcher at the Institute of Philosophy of the Georgian Academy of Sciences. During 1992-1993 he was a Postdoctoral Fellow at the University of Amsterdam (Mathematics and Computer Science department). During 1993-1994 he held the position of a Visiting Associate Professor at the University of Notre Dame (Philosophy Department). He has joined the faculty of Villanova University (Computing Sciences Department). Japaridze has also worked as a Visiting Professor at Xiamen University (2007) and Shandong University (2010-2013) in China. _START_SECTION_ Awards _START_PARAGRAPH_ In 1982, for his work "Determinism and Freedom of Will", Japaridze received a Medal from the Georgian Academy of Sciences for the best student research paper, granted to one student in the nation each year. In 2015, he received an Outstanding Faculty Research Award from Villanova University, granted to one faculty member each year. Japaridze has been a recipient of various grants and scholarships, including research grants from the US National Science Foundation, Villanova University and Shandong University, Postdoctoral Fellowship from the Dutch government, Smullyan Fellowship from Indiana University (never utilized), and Dean's Fellowship from the University of Pennsylvania. + _START_ARTICLE_ Zera Luther Tanner _START_SECTION_ Career _START_PARAGRAPH_ Zera Tanner was born in Warsaw, New York in 1835, the son of Zerah and Ruth (Foster) Tanner. The elder Tanner died when his son was one year old, so the younger Tanner worked in family farms until his late teens, when he apprenticed to a mechanic. Tanner traveled by ship to Great Britain in 1855, and because of ill health chose to take a longer voyage from Liverpool to Bombay, India aboard SS Culloden in 1856. After two round trips, one as third officer, Tanner chose sailing for his profession._NEWLINE_Returning to the United States, after Tanner served aboard American merchantmen, he eventually assisted several seaborne troop movements in the Gulf of Mexico. _START_SECTION_ Civil War service _START_PARAGRAPH_ Tanner chose to join government service and was appointed acting ensign of the Union Navy in the summer of 1862. Tanner served upon the bark USS Midnight and the supply steamer USS Rhode Island during the American Civil War. When Rhode Island captured a British blockade runner in December 1864, Tanner was put in charge of the prize crew. During the Second Battle of Fort Fisher in 1865, Tanner commanded the boats from his vessel landing Union ground forces. _START_SECTION_ Post war service _START_PARAGRAPH_ Tanner entered the United States Navy in 1868, coming over from the deactivated volunteer services. Until his retirement in 1897, Tanner served the navy in hydrographic survey and dredging commands, often in conjunction with the United States Commission of Fish and Fisheries, generally known as the United States Fish Commission. Tanner partially designed and oversaw the construction of two ships for the commission. USFC Fish Hawk, in service from 1880 to 1926 and the first large vessel ever built expressly for the promotion of fisheries, was a smaller vessel designed for coastal waters and was primarily used as a mobile fish hatchery although she also conducted fisheries research, while USFC Albatross, which served as a fisheries research ship from 1882 to 1921 except for brief periods in U.S. Navy service in 1898 and from 1917 to 1919, was the first full-sized vessel primarily designed for marine research. Tanner was the first commanding officer of Fish Hawk, and he commanded Albatross for many years, including transporting famed naturalist Alexander Emanuel Agassiz on an 1891 voyage to the Galapagos Islands._NEWLINE_Tanner was promoted to commander in 1893 and was relieved of command of Albatross on 1 May 1894. After an extended furlough, was assigned to duty with the Fish Commission on 1 January 1895. He retired from the Navy on 5 December 1897, having reached the mandatory retirement age of 62._NEWLINE_Tanner died in Washington, D.C. in late 1906 and was buried with military honors at Arlington National Cemetery._NEWLINE_Tanner was a Companion of the Military Order of the Loyal Legion of the United States. He was also a member of the Grand Army of the Republic and the Society of American Wars. _START_SECTION_ Legacy _START_PARAGRAPH_ Tanner developed an improved method of depth sounding, using instruments of his own design. He patented his system in 1899 as the Tanner navigational sounding apparatus._NEWLINE_Two U.S. Navy ships have been named after Tanner. After World War II, USS Pamina (AKA-34), an attack cargo ship with service in the Okinawa campaign was re-purposed for oceanographic survey work and renamed USS Tanner (AGS-15). Tanner spent her career mapping significant coastline areas and was retired in 1969. In 1990, USNS Tanner (T-AGS-40) was built for the U.S. Navy as a fast oceanographic research vessel. Now named TS State of Maine, she serves as the training ship of the Maine Maritime Academy. + _START_ARTICLE_ We Got Married (season 1) _START_PARAGRAPH_ We Got Married (Season 1) is the first season of We Got Married (우리 결혼했어요), a popular reality South Korean variety show and a segment of the Sunday Sunday Night program. First broadcast in 2008, the show pairs up Korean celebrities to show what life would be like if they were married. Each week, couples are assigned missions to complete, with candid interviews of the participants to reveal their thoughts and feelings._NEWLINE_With a new format and slightly different couples, newlyweds are given a mission to complete each week. As during the special pilot episode, interviewed participants provide a unique perspective on the ongoing relationship conflicts and developments. All of the recorded material is then played in front of the participants, MCs, and audience who add commentary or clarification._NEWLINE_Beginning with a Lunar New Year's Special in 2009 with three new couples, a new format is introduced into the show, first forecasted through the addition of Kangin and Lee Yoon Ji. Each couple is given a concept to portray; in Kangin and Lee Yoon Ji's case, a college couple living with a limited income. The show now consists of more special effects and editing in order to show each couple in a set atmosphere and theme. + _START_ARTICLE_ 2006 World Lacrosse Championship _START_SECTION_ Pool play _START_PARAGRAPH_ For the round-robin phase of the tournament, nations were separated into blue, red, orange and yellow divisions according to strength. Each of the twenty-one nations was eligible to win the championship. _START_SECTION_ Best Positional Players _START_PARAGRAPH_ Brodie Merrill - Defence_NEWLINE_ Jay Jalbert - Midfield_NEWLINE_ Jeff Zywicki - Attack _START_SECTION_ Tournament MVP _START_PARAGRAPH_ Geoff Snider - Midfield, face-off + _START_ARTICLE_ Thomas Porter (Vermont politician) _START_PARAGRAPH_ Thomas Porter (February 15, 1734 – May 30, 1833) was a Connecticut and Vermont military and political figure who served as Speaker of the Vermont House of Representatives. _START_SECTION_ Biography _START_PARAGRAPH_ Thomas Porter was born in Farmington, Connecticut Colony, on February 15, 1734 and became a farmer in Cornwall. He served with the British during the French and Indian War and held several local offices, including member of the Connecticut House of Representatives._NEWLINE_Porter served against the British at the start of the American Revolution as a Captain in the Connecticut Militia, and relocated to Tinmouth, Vermont in 1779._NEWLINE_In 1780 Porter was elected to the Vermont House of Representatives. He served until 1782 and was Speaker of the House during his entire House tenure._NEWLINE_Porter resigned as Speaker to accept election to the Governor's Council, on which he served until 1795._NEWLINE_From 1781 to 1782 Porter was Assistant Judge of the Rutland County Court, and he was the court's Chief Judge from 1788 to 1789._NEWLINE_In 1783 Porter became a Judge on the Vermont Supreme Court, serving until 1785._NEWLINE_He died in Granville, New York on May 30, 1833. Porter was buried at Sawyer Cemetery in Tinmouth._NEWLINE_Porter was the father of college president and theologian Ebenezer Porter. + _START_ARTICLE_ The Ice House (1969 film) _START_SECTION_ Plot _START_PARAGRAPH_ Ric Martin (Robert Story), a disgraced cop, long since fired from police work, makes a sexual approach to Ice House dancer Venus De Marco (Sabrina) and is struck with a beer bottle for his efforts. Angered, he stalks the dancer, and when she again raises a bottle in a defensive manner, he strangles her. He is thwarted in his efforts to hide the body at a local lovers' lane, and ends up hiding it at The Ice House, where he works in the menial position of attendant. Other women become his victims and their bodies are stored there as well. His identical twin brother Fred Martin (David Story), himself a cop and investigating the disappearances, cannot understand why his brother is acting oddly. In an effort to slow down the hunt for the serial killer, Ric kills Fred and takes his place investigating the case. _START_SECTION_ Production _START_PARAGRAPH_ The film's production began in early 1967, with director Stuart E. McGowan wanting blonde bombshell Jayne Mansfield to play Venus De Marco. However, after Mansfield's death in an automobile accident in June 1967, filming was postponed. Over the next year McGowan offered the role to Mamie Van Doren, Diana Dors and Joi Lansing, all of whom turned down the offer. Eventually the role was filled by model-turned-actress Sabrina. _START_SECTION_ Release _START_PARAGRAPH_ The film had its original United States release in the United States on July 9, 1969 by Orbit Media Group. Marden Films gave the film a theatrical release in Canada in 1972. The film was released on VHS in the USA by Something Weird Video in 1996 as part of Frank Henenlotter's Sexy Shockers from the Vaults (Vol. 60), and a fully restored director's cut was given a worldwide release in 2008 by Grindhouse Releasing. The film is also known as Cold Blood, Crimen on the Rocks (Spain), Love in Cold Blood and The Passion Pit. _START_SECTION_ Critical response _START_PARAGRAPH_ John Charles, editor of Video Watchdog magazine, wrote: "Character actors Scott Brady, Jim Davis and Tris Coffin, and a pair of musclebound, thespically challenged leading men are the main points of interest in this thriller/softcore hybrid, which delivers little more than copious nudity." He panned the film for the poor direction of Stuart E. McGowan, and notes that while the film set up the viewer for mystery and horror, it failed to deliver and meandered to a predictable twist ending. He also panned the performances of real-life twins David and Robert Story as "incredibly stiff", and made note that "some amusingly unhip slang" and an undramatic "ridiculous" and "undercranked" motorcycle chase provided only "intermittent entertainment". While noting Grindhouse Releasing's intent to remarket the film, they spoke toward Something Weird Video's 1996 video release, and noted that although SWV's 35mm source material was "damaged in every way imaginable", its color and resolution were still decent. + _START_ARTICLE_ William J. Thaler _START_PARAGRAPH_ William J. Thaler, Ph.D. (December 4, 1925 – June 5, 2005) was an American experimental physicist. Working for the Office of Naval Research (ONR) at the Naval Research Laboratory in the 1950s, Thaler developed an early warning system to detect the launching of ballistic missiles using high frequency radio waves bounced between the Earth's surface and the ionosphere, part of the upper atmosphere._NEWLINE_Monitoring the disruption of the returning radio waves, called back-scatter, allowed for the long distance detection of rocket launchings and nuclear tests. Based in the Washington D.C. area, the experimental monitoring systems, termed "over-the-horizon radar", were able to pick up radio disruptions from nuclear tests held in Nevada and were later successful in tracking a Polaris missile fired from Cape Canaveral. _START_SECTION_ Education _START_PARAGRAPH_ Thaler attended St. James Parochial School in Baltimore and Loyola High School in Towson, MD. He received his undergraduate degree from Loyola College of Baltimore in 1947 and earned his master's degree in science at The Catholic University of America. He received his doctorate in physics at Catholic University in 1951. _START_SECTION_ Operation Argus _START_PARAGRAPH_ In 1958, Thaler was in charge of the ONR section of Operation Argus, a secret series of tests conducted over the Atlantic Ocean that looked at the effect of high-altitude detonations of nuclear weapons on radar and radio transmissions. _START_SECTION_ Later career _START_PARAGRAPH_ In late 1960, Thaler joined the faculty of Georgetown University, expanded the Physics department and chaired the department from 1960 to 1976. From 1976 to 1979, he took a leave of absence to serve as chief scientist and director of the Office of Telecommunications Policy, within the Executive Office of the President, in the Ford and Carter administrations. He returned to Georgetown University and retired in 1996. _START_SECTION_ Awards _START_PARAGRAPH_ In 1960, Thaler was awarded the Mendel Medal by Villanova University. This honor "is awarded to outstanding scientists who have done much by their painstaking work to advance the cause of science, and, by their lives and their standing before the world as scientists, have demonstrated that between true science and true religion there is no intrinsic conflict." _START_SECTION_ Personal life _START_PARAGRAPH_ Thaler was married to Barbara Thaler and had six children, two of whom preceded him in death. _START_SECTION_ Death _START_PARAGRAPH_ Thaler died of complications resulting from a stroke at his home in Centreville, Virginia. He was 79 years old. + _START_ARTICLE_ J&D's Down Home Enterprises _START_PARAGRAPH_ J&D's Down Home Enterprises, also known as J&D's Foods, is an American company based in Seattle, Washington, that produces bacon-related products such as Bacon Salt and Baconnaise. The company was founded in 2007 by Justin Esch and Dave Lefkow who used a $5,000 prize winning from America's Funniest Home Videos as start capital to launch the business. J&D's used an unconventional, yet successful advertising campaign consisting of on-line Social network services and personal telephone calls to the media to get stories written about them resulting in selling 20,000 jars of their initial product, Bacon Salt, in their first five months of operation. The business continues to be successful and has since added several new products to the Bacon Salt range. _START_SECTION_ History _START_PARAGRAPH_ The company launched sales on July 16, 2007, when entrepreneurs Justin Esch and David Lefkow started selling Bacon Salt, a product that was conceived by Esch while considering the virtues of the drink Mitch Morgan, which is composed of a shot of bourbon and a piece of fried bacon as a garnish. After registering a trademark and purchasing an internet domain, Esch and Lefkow started producing Bacon Salt. They funded their company with a $5,000 loan from Lefkow's 3-year-old son Dean, who had won the money on America's Funniest Home Videos for a video in which he smacks his dad with a hit while playing T-Ball. Bacon Salt became a hit, and as of 2007 was sold in all 50 states and 26 other countries. Esch and Lefkow have since repaid the $5,000 loan._NEWLINE_Additional financing was provided through the sale of series B stock by outside investors in 2011. _START_SECTION_ Marketing _START_PARAGRAPH_ Initially Esch and Lefkow had no advertising budget and were met by resistance from food brokers and distributors who were unwilling to take a chance on an unknown product. Instead of traditional advertising, they sent jars of Bacon Salt to editors of food blogs and set up online profiles for Bacon Salt for online communities including MySpace and Facebook. The technique worked, Esch and Lefkow received 800 orders in the first week for their three varieties (original, hickory, and peppered) which included an order from a customer in Texas for 36 jars. They sold 20,000 jars between their opening in July 2007 and November 2007. In addition to online marketing, Esch and Lefkow spent an average of 30 minutes per day phoning the media, and explaining why they should do a story about them. _START_SECTION_ Mayonnaise Wrestling _START_PARAGRAPH_ On October 28, 2008 J&D's Foods hosted The World’s First Charity No Holds Barred Battle To The Death Mayonnaise Wrestling Match at Heavens night club in Seattle, Washington. The event featured three bouts of wrestling in 6,000 lbs of mayonnaise inside a ring made of old mattresses and hay bales. The main event included a fight between combatants dressed in a 7' foam Bacon suit and a giant jar of Mayonnaise. The event promoted the launch of their new bacon-flavored mayonnaise, Baconnaise and the money raised benefited the family of a deceased co-worker. J&D's hosted a complimentary BLT sandwich bar served $3 Mitch Morgan shots made with Maker's Mark bourbon. _START_SECTION_ The Bacathlon _START_PARAGRAPH_ On November 19, 2009 J&D's Foods hosted The Bacathlon at Heavens night club in Seattle, Washington. Promoted as The World's First Bacon-Themed Multi-Sport Athletic and Endurance Event that featured an attempt to set the World Record for Bacon Eating. The Bacathlon featured local Seattle personalities competing against each other dressed in giant foam Bacon suits. The competitors included Ben Dragavon from the Seattle Sounders, Mark Rahner from ROTTEN Comics, Thee Ted Smith from KISW 99.9 FM, Josh Black a.k.a. Ronald McFondle from Seattle Semi-Pro Wrestling, Justin Barnes from KFNK 104.9 FM, Sheeza Brickhouse from the Rat City Rollergirls, Erik "The Red" Denmark from Major League Eating and Jessica Williams the J&D's Foods Fall 2009 intern. J&D's once again served $3 Mitch Morgan shots made with Maker's Mark bourbon. _START_SECTION_ Novelty Products _START_PARAGRAPH_ J&D's Foods has used novelty bacon-flavored products Bacon Lip Balm, baconlube, Mmmvelopes, BaconAir, Bacon Coffins, Bacon Baby Formula and bacon soda to generate media buzz and spread the awareness of their mainstream consumer products. _START_SECTION_ Products _START_PARAGRAPH_ J&D's markets a variety of bacon-flavored products, including: _START_SECTION_ Bacon Salt _START_PARAGRAPH_ Bacon Salt was the first product developed by J&D's Foods. The first attempt at making bacon salt was to take bacon grease and pour it over kosher salt. That attempt result was unsuccessful and, according to Esch, tasted disgusting. Upon perfecting the recipe, the pair introduced the first three flavors of Bacon Salt — Original, Hickory and Peppered varieties. J&D's went on to launch Natural Bacon Salt and the Limited Edition Holiday Flavors Cheddar, Jalapeño, Maple, Applewood and Mesquite Bacon Salt in 2008. Bacon Salt is vegetarian and kosher. Hickory Bacon Salt is vegan._NEWLINE_In 2008, J&D's Foods launched Operation Bacon Salt, to provide Bacon Salt to troops serving abroad in Iraq and Afghanistan where pork is not readily available. After receiving several emails from troops stationed abroad J&D's began mailing free Bacon Salt to troops. _START_SECTION_ Baconnaise _START_PARAGRAPH_ Introduced in October 2008, Baconnaise is a bacon-flavored mayonnaise spread available in both Regular and Lite versions. Jon Stewart of The Daily Show described Baconnaise as being "for people who want heart disease but are too lazy to actually make the bacon." _START_SECTION_ Bacon Lip Balm _START_PARAGRAPH_ Introduced in October 2008, Bacon Lip Balm is a bacon-flavored lip balm. J&D's bacon-inspired lip balm, which is a "hot seller" on their web site, having sold over 50,000 units, often by the dozen. Customer reviews range from being described as the worst thing they have ever heard of to the greatest thing ever. When Esch and Lefkow appeared on The Oprah Winfrey Show in March 2009, Winfrey's co-hosts Mark Consuelos and Ali Wentworth applied Bacon Lip Balm during the interview. _START_SECTION_ BaconPOP _START_PARAGRAPH_ Introduced in November 2009, BaconPOP is a bacon-flavored microwave popcorn. _START_SECTION_ Bacon Ranch _START_PARAGRAPH_ Introduced in November 2009, Bacon Ranch is a bacon-flavored ranch dip/mix, to be combined with buttermilk, sour cream and/or mayonnaise. _START_SECTION_ Mmmvelopes _START_PARAGRAPH_ Introduced in November 2009, Mmmvelopes are bacon-flavored envelopes that both look like pieces of bacon and taste like bacon when licked. + _START_ARTICLE_ Lindholmiola spectabilis _START_SECTION_ Geographic distribution _START_PARAGRAPH_ This species is endemic to Greece, where it occurs in the north and north-eastern part of the country's mainland. + _START_ARTICLE_ Jean Vander Pyl _START_SECTION_ Early life and career _START_PARAGRAPH_ Vander Pyl was born in Philadelphia to John Howard and Kathleen Hale Vander Pyl. Her grandfather had come from the Netherlands. Her father was the district manager for Knit Underwear, her mother was a Southerner from Tennessee. The two died within six months of each other in the early 1950s. By 1939, she was already working as a radio actress. _NEWLINE_On radio she was heard on such programs as The Halls of Ivy (1950–52) and on Father Knows Best during the early 1950s, where she portrayed Margaret Anderson; the role was played on television by Jane Wyatt. Her husband, Carroll G. O'Meara, was a graduate of Stanford University who worked as a copywriter at KHJ radio in the mid-1930s and later became an advertising executive._NEWLINE_Vander Pyl made numerous TV appearances as an actress in programs such as Leave It to Beaver, The Donna Reed Show, Father Knows Best, The Beverly Hillbillies, That Girl, and Petticoat Junction. One of her final TV appearances was in the opening scene of the Season Two Murder, She Wrote episode, "One Good Bid Deserves a Murder". Vander Pyl also had a cameo appearance in the 1994 live-action film version of The Flintstones as Mrs. Feldspar, an elderly woman in a conga line. _START_SECTION_ Voice work _START_PARAGRAPH_ Vander Pyl was the voice of Wilma Flintstone, her best-known character, in the original Flintstones series. She told an interviewer in 1995 that she received $250 per episode for making The Flintstones, and in 1966, when the series ended, she rushed to accept $15,000 in lieu of residual payments from syndication. The Flintstones ran in syndication across the globe for decades. At the time, Vander Pyl lived in San Clemente, California, and remarked: "If I got residuals, I wouldn't live in San Clemente. I'd own San Clemente."_NEWLINE_Most of her other voice acting work was also for the Hanna-Barbera studio, where she played her first voice role in 1958 on an episode of The Huckleberry Hound Show, voicing an actress in The Yogi Bear episode, "Show Biz Bear". She did additional voices, The Narrator And Southern belles and beautiful girls, on The Quick Draw McGraw Show, Snagglepuss and The Yogi Bear Show. In 1961-62, Vander Pyl played Nurse Larue, Charlie the baby, Goldie, Lola Glamour and additional voices on multiple episodes of Top Cat and in 1962, she did another memorable role, as Rosie, the Jetsons' robotic maid, and 23 years later in 1985 she reprised the character on the returning series._NEWLINE_Later, she did the voices of Maw Rugg and her daughter Floral Rugg on a rural cartoon, The Hillbilly Bears and Winsome Witch; both shows were part of The Atom Ant/Secret Squirrel Show (1965–1967). Jean Vander Pyl was also the voice of Little Ogee on The Magilla Gorilla Show. In 1969, Vander Pyl guest starred on the Scooby-Doo, Where Are You! episode "Foul Play in Funland", playing Sarah Jenkins._NEWLINE_In the 1970s, she was the voice of Marge Huddles, the main character's wife on Where's Huddles?, in which she played a role similar to that of Wilma Flintstone and was reunited with her Flintstones cast members Alan Reed and Mel Blanc. She went on to voice Mrs. Finkerton on Inch High, Private Eye, as well as several female characters on Hong Kong Phooey, The Tom and Jerry Show and Captain Caveman and the Teen Angels._NEWLINE_In the 1980s and 1990s, the talented voice actress did voices on Mister T, Snorks, Yogi's Treasure Hunt and also on The Flintstone Kids as Mrs. Slaghoople. She mostly reprised Wilma Flintstone on spin-off series and films such as The Flintstone Comedy Hour, The New Fred and Barney Show, The Flintstone Comedy Show, The Jetsons Meet the Flintstones, I Yabba-Dabba Do!, Hollyrock-a-Bye Baby, and A Flintstones Christmas Carol._NEWLINE_Her last roles were again as Wilma Flintstone on What a Cartoon! episode "Dino: Stay Out!" in 1995, A Flintstone Family Christmas in 1996 and on The Weird Al Show in 1997. _START_SECTION_ Personal life _START_PARAGRAPH_ Vander Pyl was married twice. First to Carroll G. O'Meara on March 9, 1939; together they had three children, O'Meara died on February 18, 1962, at the age of 53. She then married her second husband Roger Wells DeWitt in 1963; the couple had one son, they remained married until DeWitt's death in 1992. _START_SECTION_ Death _START_PARAGRAPH_ On April 10, 1999, Vander Pyl, the last surviving original cast member of The Flintstones, died of lung cancer at her home in Dana Point, California, at the age of 79. Vander Pyl was interred in Ascension Cemetery in Lake Forest, California. + _START_ARTICLE_ Choleretic _START_PARAGRAPH_ Choleretics are substances that increase the volume of secretion of bile from the liver as well as the amount of solids secreted. + _START_ARTICLE_ Once Upon a Castle _START_PARAGRAPH_ Once Upon a Castle is a symphonie concertante for organ and orchestra composed in 2003 and revised in 2015 by American composer Michael Daugherty. The music is inspired by both the life and times of American media mogul William Randolph Hearst, Hearst Castle, and the Hollywood lore of Charles Foster Kane, a fictional character based on Hearst in the movie Citizen Kane. _START_SECTION_ Origin and performance history _START_PARAGRAPH_ The composition was commissioned by the Ann Arbor Symphony Orchestra and a consortium consisting of the Cedar Rapids Symphony Orchestra, the Rockford Symphony Orchestra and the West Michigan Symphony Orchestra. The world premiere was given by the Ann Arbor Symphony Orchestra conducted by Arie Lipsky, with Steven Ball, organ, at the Michigan Theater, Ann Arbor, Michigan on November 15, 2003. The world premiere of the revised version was given by the Nashville Symphony conducted by Giancarlo Guerrero, with Paul Jacobs, organ, at the Schermerhorn Symphony Center, Nashville, Tennessee on November 6, 2015. _START_SECTION_ Recording and reception _START_PARAGRAPH_ The concerto was recorded in 2015 with the Nashville Symphony and released on the Naxos label. Many critics reviewed the recording favorably, including a 10/10 for both categories of artistic quality and sound quality from music critic David Hurwitz and 4 out of 5 stars from critic James Manheim._NEWLINE__NEWLINE_Donald Rosenberg of Gramophone wrote:_NEWLINE_Hearst’s extravagant abode in San Simeon comes to brilliant life in Once Upon a Castle (2015), whose four movements receive an extra sonic kick with the presence of a pipe organ, played to the glowing hilt by Paul Jacobs. Guerrero and the orchestra sound as if they’re savouring every fresh Daugherty detail._NEWLINE_— Donald Rosenberg, Gramophone_NEWLINE__NEWLINE_Bob McQuiston of Classial Lost and Found wrote:_NEWLINE_American organist Paul Jacobs gives a technically flawless, magnificent account of this colorful work. He's at the console of the U.S.-built, three-manual Schoenstein Organ (64 ranks) in the Schermerhorn Symphony Center's Laura Turner Hall, Nashville, Tennessee, where these recordings were made (see 10 November 2014)...Jacobs receives superb support from the Nashville Symphony under its Music Director Giancarlo Guerrero."_NEWLINE_— Bob McQuiston, "Classical Lost and Found"_NEWLINE_ The album won the 2017 Grammy Award for Best Classical Compendium. _START_SECTION_ Four movements of the composition _START_PARAGRAPH_ The music of the four movements is programmatic, and is based on the composer's framing of different architectural, geographical and fictional aspects of the Hearst/Kane history and lore. The composer's published score includes descriptive program notes explaining the imagery and inspiration for each movement. _START_SECTION_ I. The Road to San Simeon _START_PARAGRAPH_ The music of this movement is meant to represent the winding drive to Hearst Castle from San Simeon. Hearst's opulent antique collection. The composer explains his intent for his music in this movement to occasionally remind the listener of a musical “antique”. _START_SECTION_ II. Neptune Pool _START_PARAGRAPH_ This movement is dedicated to William Albright (1944-98), who was the composer's colleague at the University of Michigan and who is recognized as one of the 20th century's greatest composers of contemporary organ music. The music is meant to portray the vast, grandiose architecture of the famous pool at Hearst Castle. _START_SECTION_ III. Rosebud _START_PARAGRAPH_ In his program notes the composer states: "...the ground breaking film [Citizen Kane] starring and directed by Orson Welles, presents a caricature of Randolph Hearst...[the] music for this movement echoes a brilliant scene in the film where the boisterous Kane (the organ) and lonely Susan (the solo violin) argue from opposite ends of a cavernous empty room of the castle." Sleigh bells are used as a musical representation of "rosebud," the famous final word of the fictional character Citizen Kane. _START_SECTION_ IV. Xanadu _START_PARAGRAPH_ This movement is composed to capture the spirit of the bombastic, lavish parties famously held at Hearst Castle, the likes of which Winston Churchill and famous film stars of the day including Clark Gable, Charlie Chaplin, and Greta Garbo attended. The composer notes in the published score, "I also had in mind fragments of Samuel Taylor Coleridge's 1798 poem, Kubla Khan...[the music uses] virtuoso bass pedal riffs surrounded by sizzling strings, rumbling brass, shimmering percussion and pulsating timpani.” + _START_ARTICLE_ Inhabited initial _START_PARAGRAPH_ An inhabited initial is an initial, an enlarged letter at the beginning of a paragraph or other section of text that contains an illustration of human or animal figures within the letter. It is similar to a historiated initial; however, the figures in historiated initials show an identifiable scene or story, while the figures in inhabited initials do not show a narrative. Figures in inhabited initials may be related to the contents of the text, but do not have to be. They may be purely decorative instead. + _START_ARTICLE_ Jim St. James _START_PARAGRAPH_ Jim Bozyk (1954–1990), known professionally as Jim St. James, was a Canadian actor and HIV/AIDS activist. He was best known as the star of a series of public service announcements on AIDS awareness which aired on Canadian television in the 1980s, and as the subject of June Callwood's 1988 book Jim: A Life with AIDS. _START_SECTION_ Background _START_PARAGRAPH_ He was raised in rural Southern Ontario in a Jehovah's Witness family, and was briefly married to a woman. He struggled with his sexuality, and undertook at least one suicide attempt before coming out as gay. Many of his family disowned him when he came out, although he remained in occasional contact with his father. He was also excommunicated from the Jehovah's Witnesses, although he remained devoutly religious in his personal life._NEWLINE_He worked as a stage actor in Toronto for several years, winning an award from Theatre Ontario as best actor in a musical for his performance in a production of Man of La Mancha in 1984. Just two days after winning that award, he was first diagnosed HIV-positive. _START_SECTION_ Activism _START_PARAGRAPH_ Following his diagnosis, he battled clinical depression for about a year before deciding in 1985 to get on with life, and renewed his commitment to both acting and HIV activism. He was one of the founding members of Toronto's People With AIDS Foundation, appeared in the AIDS-themed documentary film No Sad Songs in 1985 and a production of Robert E. Sherwood's play Idiot's Delight in 1987, and began appearing as a public speaker on HIV and AIDS issues. During this era, he was commonly credited as Canada's longest-living survivor of the disease, and as the country's most prominent HIV/AIDS activist._NEWLINE_In 1987, he appeared in an HIV education segment on CBC Television's youth public affairs program What's New, and in 1988 he starred in several HIV/AIDS awareness commercials, funded by CJOH-TV and the Canadian Public Health Association, which aired on television stations across Canada. During this era, he was also meeting regularly with Callwood in preparation for the book Jim: A Life with AIDS, which was published in fall 1988. By this time, he had developed Kaposi's sarcoma. In both 1988 and 1989, he invited the media to cover his birthday party as a news story, to highlight his continued survival and to promote further awareness of the disease. At the time of his 1989 party, however, he was making plans to move into Casey House, Toronto's AIDS hospice, due to his declining health._NEWLINE_He died on March 24, 1990 at Casey House, just a few weeks short of his 36th birthday. + _START_ARTICLE_ Festival y Reinado Nacional del Carbón _START_PARAGRAPH_ The Coal National festival and Pageant (Spanish: Festival y Reinado Nacional del Carbón) is a festival in Colombia that takes place in the town of Barrancas, Department of La Guajira from October 10 to the 13. The festival is an artistic and cultural event which celebrates de municipalities' Roman Catholic devotion to the "Virgin of Pilar", there are art expositions showing different coal sculptures and paintings as well as local gastronomy events. There also cultural expositions showing the arts and crafts of the Wayuu indigenous ethnic group. + _START_ARTICLE_ Calliostoma zizyphinum _START_SECTION_ Description _START_PARAGRAPH_ The solid, regularly conical shell is straight-sided and imperforate. The shell contains up to 12-13 whorls. It is sculptured with regular spiral grooves and ridges, traversed by fine prosocline growth lines. The apex is minute, composed of a single smooth rounded_NEWLINE_whorl, Several whorls follow, each with 4 granose spiral ridges. These become smooth and either obsolete or narrow on the later whorls. The body whorl has a prominent peripheral keel bearing two broad ridges; ridges above suture in preceding whorls. Base of shell rather flat, inner lips reflected over shallow umbilical groove. The periphery is angular, encircled by a smooth rounded rib that becomes a supra-sutural band or fasciole on the spire. The base of the shell is nearly flat. The aperture is quadrate. The cylindrical columella is nearly straight._NEWLINE_The color of the shell is variable. The ground color is yellowish brown, pale pink, or violet with streaks and blotches of brown, red or purple on the periphery. Blotches on the keel are generally darker, more frequent and more regular than on other parts of shell. It is radiately clouded with brown on the upper surface. The base of the shell is unicolored or obscurely radiately streaked. Pure white or violet specimens are occasionally found. _START_SECTION_ Predators _START_PARAGRAPH_ The starfish Asterias rubens is a known predator of Calliostoma zizyphinus _START_SECTION_ Distribution _START_PARAGRAPH_ This marine species occurs in European waters from Northern Norway to the Azores; in the Mediterranean Sea. + _START_ARTICLE_ Shahnshah Zakarian _START_SECTION_ Biography _START_PARAGRAPH_ After death of Zakaria II Mkhargrdzeli, his five-year-old son, Shanshe was adopted by his uncle Ivane Mkhargrdzeli, the latter raised and converted him in Chalcedonism. As soon as he reached age of adulthood he was raised to office of mandaturukhutsesi._NEWLINE_During Mongol invasion of Georgia Queen Rusudan had to evacuate Tbilisi for Kutaisi, while Shanshe took refuge in Adjara, then made peace with the Mongols and agreed to pay them tribute. They confirmed Shanshe in his fief. Soon Rusudan sent Avag Mkhargrdzeli to arranged her submission to the Mongols in 1243 and arrived in eastern Georgia, where she was met by Shanshe and other notable nobles._NEWLINE_During the period of interregnum (1245–1250), with the two Davids absent at the court of the Great Khan in Karakorum, the Mongols divided the Kingdom of Georgia into eight districts (tumen), three of which belonged to Mkhargrdzeli's, i.e., the territories of the Shanshe in Ani and Kars; of the Avag in Syunik and Artsakh; and of the Vahram (Gagi, Shamkor and the surrounding area)._NEWLINE_Rubroek, envoy extraordinary of the French king Louis IX to the Khan of Mongolia, stayed in 1255 with Shanshe on one of his Armenian estates. Rubroek characterizes Shanshe as a great feudal lord and owner of 15 cities._NEWLINE_In 1260, Hulagu Khan requested that David Ulu supported him in the war against Mamluk Sultanate in Cairo. David, remembering the Georgian losses at Baghdad (1258) refused to comply and revolted. Georgian nobles, led by David Ulu was defeated and once again submitted to Mongol rule. Although Prince Shanshe was freed for a ransom, his son Zakaria was killed. Shanshe died not soon after this event. He was buried in his ancestral Kobayr monastery. + _START_ARTICLE_ Howlin' at the Moon _START_SECTION_ Background _START_PARAGRAPH_ The up-tempo "Howlin' at the Moon" celebrates the giddiness of true love. Lyrically, the song reflects Williams' sense of humor and love of hunting. The title is punctuated by the hound dog yodels of fiddler Jerry Rivers. In his book Hank Williams: The Biography, writer Colin Escott observes, "The performance tears along...It was but a short step from there to rockabilly." Williams recorded the song at Castle Studio in Nashville on March 16, 1951. Williams was backed on the session by members of his Drifting Cowboys band, including Jerry Rivers (fiddle), Don Helms (steel guitar), Sammy Pruett (electric guitar), Jack Shook (rhythm guitar), Ernie Newton or "Cedric Rainwater," aka Howard Watts (bass), and either Owen Bradley or producer Fred Rose on piano. The B-side of Howlin' at the Moon," the ballad "I Can't Help It (If I'm Still in Love with You)," outperformed the A-side on the charts, peaking at #2._NEWLINE_Williams disciple George Jones recorded this song for his 1960 album George Jones Salutes Hank Williams. + _START_ARTICLE_ Oddjobs _START_SECTION_ History _START_PARAGRAPH_ Oddjobs released the album, Drums, in 2001. The 12-inch single, "Blue Collar Holler", reached number 6 on the CMJ college radio hip hop chart in 2002. The six-track EP, The Shopkeeper's Wife, was released in 2003. The group toured with DJ Shadow in the same year. + _START_ARTICLE_ Groep Otten _START_SECTION_ Internal conflicts within Forum for Democracy _START_PARAGRAPH_ After the establishment of Forum for Democracy in 2016, the founders Henk Otten and Thierry Baudet have fully devoted themselves to the development of their party. In a short time the party managed to win two seats in the 2017 Dutch general election. Two years later, further efforts led to a monstrous victory in the 2019 Dutch provincial elections, yet the party failed to form a college in any province. Baudet was blamed for this by the political parties involved. His earlier statements about social issues have been viewed by many as extreme right, racist and unfriendly to women. In addition, a number of Tweets also caused a great deal of fuss._NEWLINE_On 19 April 2019, co-founder Otten, who was always involved with the party behind the scenes, steps forward in an interview with the NRC Handelsblad. In this interview, Otten criticised the course of the party and the behaviour of Baudet. "Baudet should not unnecessarily ignore our people. It might be nice to make a daring statement, but we are now a big party. He is talking about his own words, but words have consequences. You have responsibility for other people. The question is whether you should use a political party as a vehicle for academic debates that you enjoy yourself. I don't think so." The party leader was "not amused" that this criticism came out._NEWLINE_Less than a week later the fire was opened in public at Otten. Two of the party's three board members, Baudet and Rob Rooken, accused Otten of taking money from the party treasury. The money in question was immediately deposited back to the party account by Otten. Moreover, Otten resigned as a party member of the party at the request of the other two board members. During this period Otten opted for media silence so as not to further harm the party, which was certainly important with the Senate election approaching. In this election, Otten was also the lead candidate for the Forum for Democracy, and the party's intended group chairman. In the aftermath of the internal conflict, Otten renounced the group chairmanship. He was succeeded by Paul Cliteur with whom he entered the Senate (together with ten brand new members of the Dutch parliament)._NEWLINE_Owing to suspicions of financial malpractices, Otten was disbarred by the party on 24 July 2019. During this period, Otten immediately visited all kinds of media to speak out against the accusations against him that he himself dismissed as defamation and slander. It is due to disagreements about the course of the party, Otten claims. Otten also promised to make a declaration weigh defamation. In Nieuwsuur Otten hinted at his further political aspirations by the possibility of establishing his own party for the first time. _START_SECTION_ Founding of Group Otten _START_PARAGRAPH_ On 18 August 2019, Otten officially sets up his political party. The new party leader is still looking for a suitable name for his party, but will continue under the name "Fractie-Otten" for the time being. Two members of the Senate have joined him because of the same dissatisfaction with the course of the Forum: former party spokesman Jeroen de Vries and Dorien Rookmaker. The three have chosen to keep their seats. There is even a possibility that the Otten Group can get a seat in the European Parliament. This has to do with the redistribution of seats in the European Union Parliament after the United Kingdom's departure from the EU. The number of seats of FvD then increases from three to four seats. As Rookmaker was the fourth on the list of candidates for the Forum for the European Parliament elections of 2019, she still claims the "Brexit seat" despite her cancelled membership._NEWLINE_The situation in the provincial states was equally uneasy. An increasing number of members of the provincial states expressed their dissatisfaction with the direction of FvD and then cancelled their membership. Some even chose to join Otten's party, including North Holland member of the states Robert Baljeu. In an interview with the NOS, Otten expresses the expectation that more FvD members will make the switch to his party. + _START_ARTICLE_ Lethrinus rubrioperculatus _START_SECTION_ Description _START_PARAGRAPH_ This species grows to and is brown or olive-grey in colour. It has small, scattered blotches that are irregular in chape. The Body depth 2.94 to 3.18 times in standard length. Body color is olive-gray or brown, with scattered irregular small black blotches. There is normally a red spot present on the top edge of the operculum. The lips are normally red. The fins are pinkish or pale in colour. _START_SECTION_ Distribution _START_PARAGRAPH_ Lethrinus rubrioperculatus is found in numerous locations, including East African waters, southern Japan and Taiwan, the Marquesas Islands, New Caledonia and the northern half of Australia. _START_SECTION_ Habitat _START_PARAGRAPH_ This species lives over sandy bottoms, in areas where rubble is present, and along the slopes of outer reefs. Although reef-associated, Lethrinus rubrioperculatus also occurs at depths of up to 160 metres, much deeper than most other species in this genus. This species is non-migratory. _START_SECTION_ Diet _START_PARAGRAPH_ Lethrinus rubrioperculatus eats mostly crustaceans, mollusks, echinoderms, and other fishes. _START_SECTION_ Human uses _START_PARAGRAPH_ This fish is caught commercially. _START_SECTION_ Parasites _START_PARAGRAPH_ As most fish, Lethrinus rubrioperculatus is the host of many species of parasites. _NEWLINE_Monogeneans parasitic on the gills include the diplectanid Calydiscoides euzeti, the ancyrocephalids Lethrinitrema gibbus and Lethrinitrema dossenus and several capsalids. _NEWLINE_Copepods parasitic on the gills include the caligid Caligus lethrinicola and the lernanthropid Sagum vespertilio. _NEWLINE_The gills also harbour unidentified gnathiid isopod larvae. _NEWLINE_The digestive tract harbours an unidentified Acanthocephala, unidentified tetraphyllid cestodes, species of the anisakid nematode Raphidascaris (Ichthyascaris), and a variety of digeneans, including the acanthocolpid Stephanostomum aaravi, the hemiurid Lecithochirium sp. and Tubulovesicula angusticauda, the opecoelid Pseudoplagioporus interruptus and three other opecoelids. _NEWLINE_The abdominal cavity contains two species of larval tetrarhynch cestodes, the otobothriid Otobothrium parvum and the tentaculariid Nybelinia goreensis. _NEWLINE_In New Caledonia, where its parasites were particularly studied, Lethrinus rubrioperculatus has a total of twenty species of parasites. + _START_ARTICLE_ Claussen pickles _START_PARAGRAPH_ Claussen is a brand of pickled cucumbers. It is headquartered in Woodstock, Illinois, an exurb of Chicago. Unlike many other brands, Claussen pickles are uncooked, and are typically found in the refrigerated section of grocery stores._NEWLINE_Claussen is advertised as having superior crunchiness to other brands. In a 1992 television advertisement, Claussen pickles were shown to snap under pressure, whereas unidentified competing brands merely bent without snapping. In response, Vlasic Foods Inc. submitted a complaint to an advertising industry tribunal, claiming that the commercial was unfair and misleading. Ultimately, however, the claims of Claussen were upheld by the tribunal. _START_SECTION_ Other products _START_PARAGRAPH_ Additionally, Claussen is the manufacturer of sauerkraut and a sweet pickle relish which won the San Francisco Chronicle's June 18, 2008, Taster's Choice challenge. _START_SECTION_ History _START_PARAGRAPH_ The company C. F. Claussen & Sons was founded by Claus Claussen Sr. in Chicago in May 1870. Claussen was a vegetable farmer on land that today is in the Chicago city limits at 51st and South Western Blvd. He had a surplus crop of cucumbers one year, and so he decided to pickle them. Claussen pickles were produced on the same piece of land until 1976 when the plant moved to Woodstock, Ill._NEWLINE_Claus Claussen Sr. was succeeded by his son Claus S. Claussen, who was serving as president of the company when he died following an automobile accident on December 20, 1932._NEWLINE_For some years, William C. Claussen (b. 1890) served as president of the Claussen Pickle Company._NEWLINE_The company was sold to Oscar Mayer in 1970 and moved to Woodstock in 1976. Oscar Mayer was later acquired by General Foods in 1981, who in turn merged with Kraft, Inc. in 1990 to form Kraft General Foods, renamed Kraft Foods in 1995._NEWLINE_In 2002, the investment group that owned Vlasic Pickles sought to acquire the Claussen brand as well. The Federal Trade Commission blocked the proposed merger on the grounds that it would have severe anticompetitive effects, leading to a monopoly in the refrigerated-pickle market. + _START_ARTICLE_ Steve Molloy _START_SECTION_ Background _START_PARAGRAPH_ Steve Molloy was born in Gorton, Manchester, Lancashire, England. _START_SECTION_ International honours _START_PARAGRAPH_ Steve Molloy won caps for England while at Leeds in 1992 against Wales, while at Featherstone Rovers in 1996 against France (interchange/substitute), and Wales, while at Sheffield Eagles in 1999 against France (2 matches), and won caps for Great Britain while at Leeds in 1993 against France, while at Featherstone Rovers in 1994 against Fiji, 1996 against Fiji (interchange/substitute), and New Zealand (interchange/substitute). _START_SECTION_ County Cup Final appearances _START_PARAGRAPH_ Steve Molloy played right-prop, i.e. number 10, in Warrington's 24-16 victory over Oldham in the 1989 Lancashire County Cup Final during the 1989–90 season at Knowsley Road, St. Helens on Saturday 14 October 1989. _START_SECTION_ Club career _START_PARAGRAPH_ Molloy made his début for Warrington on Sunday 28 August 1988, and he played his last match for Warrington on Monday 16 April 1990, he made his début for Featherstone Rovers on Sunday 29 August 1993, and he played his last match for Featherstone Rovers during the 1997 season. + _START_ARTICLE_ Hemigalinae _START_SECTION_ Characteristics _START_PARAGRAPH_ The tails of Hemigalinae species are ringed. The toes and the middle of the lower part of the tarsus are bald. The frenum, upper part, and sides of the lower part are hairy. The orbit is imperfect._NEWLINE_Hemigalinae resemble the Viverrinae in having the scent glands present in both sexes and wholly perineal, but differing by their simpler structure, consisting in the male of a shallower, _NEWLINE_smaller pouch, with less tumid lips, situated midway between the scrotum and the penis, but not extending to either. In the female, the scent glands consist of a pair of swellings, each with a slit-like orifice, situated one on each side of the vulva and a little behind it and on a common eminence, the perineal area behind this eminence being naked. The prepuce is long and pendulous. The feet are nearly intermediate in structure between those of the digitigrade Viverrinae and the semiplantigrade Paradoxurinae, but more like the latter, both the carpal and metatarsal pads being well developed, double, and joining the plantar pad below, and as wide as it is at the point of contact. But the feet, with the pads, are considerably narrower, the carpals and metatarsals converging and meeting above so that a much larger area of the under surface is hairy. The area between the four main digits and the plantar pad is covered with short hair, and the pads of the third and fourth digits of the hind foot are separated as in the Viverrinae, not confluent as in the Paradoxurinae. The retractile claws are not protected by skin-lobes. + _START_ARTICLE_ Durand Scott _START_SECTION_ High school career _START_PARAGRAPH_ The Bronx native attended Rice High School where he was a teammate of Kemba Walker until the latter left for college. He was crucial in their state championship earned in 2009, including a good performance in the semifinal against a Lance Stephenson led Lincoln won 77-50. For his efforts, he was selected as the Daily News City Player of the Year, and was selected to the Jordan Brand Classic. During that time, he also played AAU basketball for the Gauchos. _START_SECTION_ College career _START_PARAGRAPH_ He passed up offers from Memphis, West Virginia and UConn to join the Miami (Florida) and play in the Atlantic Coast Conference (ACC) of the NCAA Division I._NEWLINE_In his freshman year, Scott played in all 33 games (28 starts) while averaging 10.3 points, 4 rebounds, 3.4 assists and a team-high 1.2 steals per game. He made the ACC All-Rookie team and the ACC All-Tournament First Team._NEWLINE_In his sophomore year, he started in all but one of the 36 games he played in, averaging 13.6 points (second-best on team), 4.2 rebounds, 3.1 assists and a 1.2 steals (best) in 32.8 minutes (most) per game._NEWLINE_In his junior year, he played 33.2 minutes per game (6th most in ACC), posting 12.9 (ACC 14th, team best), 3.1 assists (ACC 7th), 5.4 rebounds (team second best) and 1 steal. He was an All-ACC Honorable Mention._NEWLINE_He scored a career-high 32 points versus NC State in the 2013 ACC Tournament semi-finals. In his senior year, had 13.1 points and 4 rebounds. He was named ACC Defensive Player of the Year and selected to the ACC All-Tournament First Team as Miami won the Tournament._NEWLINE_At the end of his college career, he averaged 12.5 points, 4.4 rebounds, 3.1 assists, 1.3 steals and 32.1 minutes in 132 total games played. He was first in Miami history for games started and minutes played (125 and 4,238 respectively), 8th in points scored (1,650), 5th in assists (404) and 7th in steals (166). _START_SECTION_ Professional career _START_PARAGRAPH_ After his college career, Scott attended the Portsmouth Invitational, where he was an all-tournament selection. He also worked out with a number of NBA teams, but went undrafted in the 2013 NBA draft. Scott then joined the San Antonio Spurs for the 2013 NBA Summer League._NEWLINE_In August 2013, Scott signed with Blu:sens Monbús of the Spanish Liga ACB for the 2013–14 season. He registered 4.6 points and 1.2 rebounds in 12.3 minutes per game during the season._NEWLINE_Scott signed with Israeli side Hapoel Tel Aviv for the 2014–15 season, he finished the season with 15.2 points, 4.5 rebounds and 1.5 steals in 31 Israeli League games as Hapoel reached the playoffs._NEWLINE_In July 2015, Scott signed with Italian Serie A side Enel Brindisi for one year. The same month, he was announced as part of the Milwaukee Bucks roster for the 2015 NBA Summer League. On July 22, 2016, he re-signed with Brindisi for one more season._NEWLINE_On July 15, 2017, Scott signed with Italian club Auxilium Torino for the 2017–18 season. On August 20, 2017, it was announced that the player won't play with the team for personal reasons. On October 5, 2017, he signed with the Memphis Grizzlies. On October 14, 2017, he was waived by the Grizzlies. On March 29, 2018, EWE Baskets Oldenburg of the Basketball Bundesliga was reported to have signed Scott for the rest of 2017–18 season._NEWLINE_For the 2018–19 season, Scott signed with the Long Island Nets of the NBA G League. He did not make the final roster._NEWLINE_On November 28, 2018, Scott signed a one-year deal with the French team Levallois Metropolitans. In January 2019, Scott parted ways with Levallois Metropolitans after appearing in five games._NEWLINE_On January 22, 2019, Scott returned to Israel for a second stint, signing with Hapoel Gilboa Galil for the rest of the season. On February 4, 2019, Scott recorded a season-high 25 points in his second game with Gilboa Galil, shooting 9-for-12 from the field, along with three rebounds and assists in an 89–87 win over Ironi Nahariya. On April 10, 2019, Scott parted ways after appearing in nine games._NEWLINE_On August 30, 2019, Scott returned to France for a second stint, signing a one-year deal with Cholet Basket. On September 17, 2019, he parted ways with Cholet before appearing in a game. _START_SECTION_ National team career _START_PARAGRAPH_ Scott has played for the Jamaican national team. He participated in the 2013 FIBA Americas Championship, posting 10.5 points, 3.9 rebounds and 0.8 assists in around 28 minutes per game. + _START_ARTICLE_ McCarthy of Muskerry _START_PARAGRAPH_ The MacCarthy dynasty of Muskerry is a branch of the great MacCarthy Mor dynasty, the Kings of Desmond. Their branch descends from Dermod Mor MacCarthy, 1st Lord of Muscry (1310-1367/8), second son of Cormac MacCarthy Mor (1271–1359), King of Desmond._NEWLINE_Dermod Mor was created Lord of Muscry (Muskerry, along the Lee river in central County Cork) in 1353. His descendant Cormac Oge MacCarthy, 17th Lord of Muscry, was in 1628 created Charles MacCarty, 1st Viscount Muskerry, and his son, the 2nd Viscount Muskerry, was in 1658 created Donough MacCarty, 1st Earl of Clancarty._NEWLINE_The dynasty is still in existence and can be considered to still broadly belong to the Irish nobility, but its leadership is in confusion. There also remains some dispute with their (friendly) rivals and kinsmen the MacCarthys Reagh, concerning the title Prince of Desmond. The late main line of the MacCarthy Mor dynasty became extinct in the late 16th century and it has ever since been unclear who inherits the title, because of the advent of the career of Florence MacCarthy. See Kingdom of Desmond. There are also earlier MacCarthy Mor septs in existence who are claimants. The situation was recently thrown into even more exotic confusion by the impostor Terence Francis MacCarthy. + _START_ARTICLE_ Deltahedron _START_PARAGRAPH_ In geometry, a deltahedron (plural deltahedra) is a polyhedron whose faces are all equilateral triangles. The name is taken from the Greek majuscule delta (Δ), which has the shape of an equilateral triangle. There are infinitely many deltahedra, but of these only eight are convex, having 4, 6, 8, 10, 12, 14, 16 and 20 faces. The number of faces, edges, and vertices is listed below for each of the eight convex deltahedra. + _START_ARTICLE_ Stephan Dweck _START_PARAGRAPH_ Stephan Dweck Esq. (born 1960) is an African-American humorist, attorney, radio show host and the author or co-author of several books._NEWLINE_He co-hosted the Sports Funk show on WFAN-AM radio in New York City with Monteria Ivey. Dweck and Ivey lived in the Frederick Douglass Houses housing project in Manhattan._NEWLINE_Ivey, Dweck and James Percelay co-authored several books on African-American humor, from slavery to American ghettos, including the Snaps trilogy. Ivey and Dweck also wrote two books on pick-up lines called You're So Fine I'd Drink a Tub of Your Bathwater and Baby, All Those Curves. And Me With No Brakes. Other books include Laugh Your Ass Off: The Big Book of African American Humor and The Field Guide to White People._NEWLINE_Dweck executive produced the Snaps series for HBO and the animated show The Big Head People for Spike TV. He has worked as a screenwriter for Eddie Murphy Productions and Miramax Films. He also was a regular guest on the IMUS in the morning program. His WFAN radio show, Sports Funk, was one of the first African American Sports talk shows in the nation._NEWLINE_He is a graduate of Dartmouth College, where he received the Ernest E. Just award for academic excellence. He is a member of Alpha Phi Alpha fraternity, and a member of the New York, New Jersey and Connecticut bar. As an attorney he has represented several rappers, singers and actors, including the cast of the film Paris Is Burning in their lawsuit against the producers of the film. He is the co-owner and founder of the Digital Cannabis Magazine Bloomin "The Cannabis Life" He Practices Entertainment law in New York City. + _START_ARTICLE_ Izaak Walton Killam Memorial Prize _START_PARAGRAPH_ The Izaak Walton Killam Memorial Prize was established according to the will of Dorothy J. Killam to honour the memory of her husband Izaak Walton Killam._NEWLINE_Five Killam Prizes, each having a value of $100,000, are annually awarded by the Canada Council to eminent Canadian researchers who distinguish themselves in the fields of social, human, natural, or health sciences. + _START_ARTICLE_ Canada-Wide Science Fair _START_SECTION_ History _START_PARAGRAPH_ The First Canada-Wide Science Fair was held May 11 and 12, 1962 at the Science Building at Carleton University in Ottawa. In 1962, the fair was co-sponsored by the Kiwanis Club of Ottawa Incorporated. The initial Headquarters for the Canadian Science Fairs Council was 45 Rideau Street, Ottawa. The two-day science fair was made up of 45 exhibits of regional winners from secondary school fairs across the country. _START_SECTION_ Intel International Science and Engineering Fair (Intel ISEF) _START_PARAGRAPH_ Several competitors and winners from the Canada-Wide Science Fair have been selected for competition at the Intel International Science and Engineering Fair as part of Team Canada, among them inventors Ann Makosinski and Alex Deans. Past Canada-Wide Science Fair winners Raymond Wang and Austin Wang both from Vancouver, BC, won the Gordon E. Moore award at Intel ISEF in 2015 and 2016, respectively. _START_SECTION_ Awards _START_PARAGRAPH_ Almost $1 million in awards and scholarships is given out each year at the Canada-Wide Science Fair._NEWLINE_Bronze, silver, and gold medals are awarded to outstanding projects in each age/grade category (see above). Challenge awards are presented for the best project in each of seven STEM challenges (discovery, energy, environment, health, information, innovation and resources) for each age/grade category. Sponsored special awards are also offered._NEWLINE_Three Grand Awards recognize the top project from the gold medal winners in each age/grade category: The Best Project Award (including $2,500 cash) is presented to the top overall project, regardless of category. The top projects from the two remaining categories receive Platinum Awards, which include $1,000 cash. Two or three of the platinum award winners compete at the European Union Contest for Young Scientists. + _START_ARTICLE_ Home Energy Saver _START_SECTION_ The Home Energy Simulation Model _START_PARAGRAPH_ The Home Energy Saver is built on DOE-2, a computer program for building heating and cooling energy analysis and design. DOE-2 performs a thermal load simulation that accounts for heating and cooling equipment and thermal distribution efficiencies, infiltration, and thermostat management. User-entered zip codes are mapped to one of about 300 unique "weather tapes" that impose a year's worth of local weather conditions on the home to determine heating and cooling needs._NEWLINE_Home Energy Saver extends DOE-2 in a number of ways to improve the simulation model. For example, when users enter their actual electricity tariffs, the predictive power of the model improves. Other methods are used to calculate the energy used by appliances, water heating, and lighting._NEWLINE_The public domain HES calculation methods and underlying data are clearly documented on the website. Other web-based tool developers are welcome to use this information at no cost, providing that the source is properly credited. _START_SECTION_ Awards & Recognition _START_PARAGRAPH_ Each year, the R&D 100 Awards recognize the year's 100 most significant, innovative, newly introduced research and development advances. The awards are recognized in industry, government, and academia as proof that a product is one of the most innovative ideas of the year, nationally and internationally. Home Energy Saver and Hohm received an R&D 100 Award in 2010._NEWLINE_Home Energy Saver received the U.S. Department of Energy's "Energy 100" award as one of the best 100 scientific and technological accomplishments over DOE's 23-year lifetime. The discoveries were chosen based on their impact in saving consumers money and improving quality of life._NEWLINE_PC Magazine recognized Home Energy Saver in 2004 as one of the "Top 100 Undiscovered Websites._NEWLINE_MSN-Money rates Home Energy Saver among the "Best Sites for Free Government Help" including it in the list of "The 100 most Useful Sites on the Internet. _START_SECTION_ Licensing _START_PARAGRAPH_ Organizations who want to provide their customers tools to predict home energy consumption can license the Home Energy Saver Application Programming Interfaces (APIs)._NEWLINE_Microsoft, the first organization to license the Home Energy Saver, uses it to drive Microsoft Hohm. + _START_ARTICLE_ Lego Racers (video game) _START_SECTION_ Gameplay _START_PARAGRAPH_ Lego Racers is a racing game played from a third person perspective. Set in the fictional Legoland universe, the game depicts Rocket Racer, the "greatest racing champion" in Legoland. After becoming bored from beating everyone at racing, he decides to create a racing contest, and finds the best racers in the history of Legoland using a dimensional warp machine created by his friend, Veronica Voltage, a genius scientist and mechanic. The player takes on the hosts and co-racers in an attempt to beat Rocket Racer and become the "Greatest Lego Racer of All Time", completing the game._NEWLINE_Players assume the role of either one of several pre-built or custom-built minifigures and compete against other minifigure characters in races set across different tracks in the Legoland universe, using a variety of cars built out of Lego. At the beginning of each race, the player can perform a "Turbo Start", which allows the player to start the race at full speed. Throughout races, the player can also perform power slides and "Super Slides", which allow the player's car to turn around corners more sharply._NEWLINE_Each of the game's tracks contain power up bricks, which can be collected by the player and used to gain an advantage over other racers. The power ups are divided into four categories: Projectile, Hazard, Shield and Turbo, with each providing a different use to the player. The player can also collect up to three "power plus" bricks, which increase the capability of any power ups collected. Most tracks contain one shortcut that players can use to get ahead of opponents, which are usually either found with careful looking, or accessed using power-ups, mainly Projectile power-ups that destroy part of the scenery. During a race, the in-game HUD displays the player's position, lap number, "lap timers", and a "Power Up Icon" if the player is carrying any power up or power plus bricks. The player can also choose between viewing the "Speedometer", the "Course Map" or the "Close-up Map"._NEWLINE_The game contains three single-player modes: "Circuit Race", "Single Race" and "Time Race", as well as one multiplayer mode, "Versus Race". The Circuit Race mode follows the game's main plot, and allows players to race through circuits made up of multiple tracks, gaining points based on where they place, while contending with a highly skilled racer who leads each circuit. In a circuit, the player must earn enough points to move on to the next race, and will win if they finish with the most points. Placing third or above in a circuit unlocks the next circuit for the player. The Single Race mode allows the player to race on a single track unlocked from the Circuit Race mode. The Time Race mode places the player in a race against Veronica Voltage driving a ghost car with the aim of beating her best time around a track chosen by the player. Versus Race allows two players to race against each other in a split screen view without non-player character minifigures on the track._NEWLINE_Throughout the game, the player can unlock various brick sets and character pieces by completing certain tasks, such as coming first in a Circuit Race. The game's "Build Menu" allows the player to build custom cars, minifigures and driving licenses of their own design using unlocked bricks and character parts. Minifigures can be customized with different hat, hair, head, body and leg parts, and given a name entered by the player on the minifigure's driving license. A picture of the player's minifigure is also placed on their driving license, and their facial expression can be changed by the player. The player can create a custom car using a combination of different chassis and car sets. The player can rotate, move and place bricks from these sets directly on to the chassis. Placement of the bricks changes the car's balance and weight, which affects its overall performance. The "Mix" option creates minifigures from randomly selected parts, while the "Quick Build" option creates one of 2 presets for a specific chassis. _START_SECTION_ Development _START_PARAGRAPH_ The concept for Lego Racers was initially created by High Voltage Software founder Kerry J. Ganofsky, with the idea of players being able to build and race cars created with Lego bricks. After a year of development, Lego Media began production of the game, hiring Ganofsky's company to develop it. Lego Media and other facilities within The Lego Group collaborated with High Voltage Software during the production of the game._NEWLINE_A large number of character models, documents and pictures from different Lego System characters and models were sent to the developers, who eventually chose to use the Castle, Space, Adventurers and Pirates themes in the game. High Voltage Software chose the characters they liked best from these themes and created character studies for them to "capture the mood of each persona". Certain characters would assume the role of bosses, while others were featured as less skilled opponents. The developers also created two original characters, Rocket Racer and Veronica Voltage._NEWLINE_High Voltage Software spent over a year creating Lego Racers' car building mechanics. The game's lead programmer, Dwight Luetscher, created a formula that was used by the game's artists to create individual Lego elements in the game. The pieces available to the player were selected from hundreds of Lego elements by the developers, chosen first by aesthetics, and then analysed to see if they would fit into Luetscher's formula. The developers chose to affect the attributes of the player's car, such as handling, acceleration and top speed, through how many bricks are placed on the chassis, as this is simpler to understand for the game's main age demographic._NEWLINE_Due to the high number of Lego sets and pieces in the game, a custom mesh code was created to "weld" the geometry in place and optimize the cars polygon count, creating one solid mesh for each car created by the player. Every element in the game, including bricks and character pieces, had different levels of detail created for use in menu screens and cut scenes, where the models had to be a higher quality due to the player seeing them up close. The developers planned a damage system where bricks would break apart from the car upon crashing, but this presented "too many problems to make it a real possibility". Lego Racers was available to play before release by journalists at E3 1999. _START_SECTION_ Sequel _START_PARAGRAPH_ Following Lego Racers' success, news arose in April 2001 that Pocket Studios was working on a sequel to the Game Boy Advance version of Lego Racers, titled Lego Racers 2, which was then shown at E3 2001 in May that year. The eponymous Microsoft Windows and PlayStation 2 counterpart to Lego Racers 2, developed by Attention to Detail, was announced in August 2001, and released in September 2001. The sequel followed up immediately after Rocket Racer's defeat in Lego Racers, who is shown a new opportunity to reclaim his title as world champion, by travelling to Xalax and prove himself worthy of it. After Rocket Racer proceeds to do so and succeeds, the player is tasked to control their self-built protagonist, racing through various worlds based on Lego themes, and eventually face Rocket Racer again. Lego Racers 2 was received less favorably than Lego Racers, and incorporated numerous elements from both Lego Racers and Rollcage, another game developed by Attention to Detail._NEWLINE_An arcade-style version of Lego Racers was shown in Legoland Windsor's Lego Rocket Racers building in "The Beginning" area, between 2000 and 2004, as well as 2009 and 2011. + _START_ARTICLE_ Grange railway station _START_SECTION_ History _START_PARAGRAPH_ The original station, located 13.2 kilometers from Adelaide and on the western side of Military Road, was opened in September 1882 as the terminus of the Grange railway line. Initially operated by a private company, South Australian Railways took over the line in the 1890s, and extended it to Henley Beach station via the Henley Beach railway line. On 31 August 1957, however, the line was cut back to Grange._NEWLINE_On 9 March 1986, the current Grange station, on the eastern side of Military Road replaced the original station on the western side. The station was relocated to prevent traffic flow along Military Road from being interrupted by the arrival of trains. The ticket office and shelter of the original station were demolished shortly after, but the unused platform remains in place._NEWLINE_The train station shelter was replaced in 2017. + _START_ARTICLE_ Heinrich Graf von Einsiedel _START_SECTION_ Biography _START_PARAGRAPH_ Einsiedel, a great-grandson of Otto von Bismarck, was born in Potsdam, Province of Brandenburg, as the youngest child to Herbert von Einsiedel (1885–1945) and Irene von Bismarck-Schönhausen (1888–1982). His parents were divorced in 1931._NEWLINE_In World War II Einsiedel served as a German fighter pilot, initially with Jagdgeschwader 2 over the Western Front, flying the Messerschmitt Bf 109. He took part in escort operations over the cruisers Scharnhorst, Gneisenau and Prinz Eugen as they made their 'Channel dash' from Brest to Germany in February 1942. von Einsiedel claimed two of the six Fairey Swordfish of No. 825 Squadron Fleet Air Arm, who made an unsuccessful low-level torpedo attack. _NEWLINE_On one occasion he was shot down and crash-landed near Rotterdam and was also shot down into the Channel and rescued. In June 1942 von Einsiedel was transferred to Jagdgeschwader 3 on the Russian Front for the forthcoming offensive against Stalingrad. Over the next six weeks, he claimed 33 Russian aircraft downed, including four Petlyakov Pe-2 bombers in the space of six minutes on 20 August. He was awarded the German Cross in Gold._NEWLINE_On 30 August 1942, during combat with Russian 'Ratas', he was forced to land and was captured by Russian ground forces, becoming a prisoner of war in the Soviet Union. The Soviet authorities soon realised the pilot was a well-connected member of the German nobility and thus a potentially valuable propaganda weapon. On capture von Einsiedel refused to divulge any military intelligence to his captors. He finally agreed however to send an open letter home stating he was being treated correctly and that Germany was going to lose the war, and that his great-grandfather Otto von Bismarck, would never have invaded Russia._NEWLINE_He became a founding member, Vice-president and commissary of propaganda of the National Committee for a Free Germany and led a propaganda unit which broadcast and distributed leaflets to German forces._NEWLINE_Released after the war, von Einsiedel initially worked for the Tägliche Rundschau, the German newspaper of the Soviet Military Administration in Germany, but became increasingly disillusioned with the Soviet regime, experiencing at first hand the Russian corruption and inefficiency. He was given permission to visit West Berlin on behalf of the NKVD for intelligence gathering purposes. While meeting his mother he was arrested by US Forces and sentenced by an American court for spying and having forged documents. He was released on appeal. _NEWLINE_Despite a highly publicised press conference when back in the East, he was by now seen as a liability by the Soviet authorities._NEWLINE_He thus moved to West Germany in late 1948, where he worked as a translator, script-writer and journalist. The governing Socialist Unity Party of East Germany acknowledged von Einsiedel as a bonafide anti-fascist but a petit bourgeois who, "as soon as the class war became acute", had wavered and switched political camps for his own self interests._NEWLINE_von Einsiedel also wrote 'The Shadow of Stalingrad: Being the Diary of Temptation' in 1953, which attempted to tell his complex story. Eventually von Einsiedel joined the film industry, as a scriptwriter and a film soundtrack dubber. He also played the role of a pilot in the drama 'The Last Bridge' (1953) with his first wife, Barbara Rütting._NEWLINE_He also wrote for the liberal Hamburg weekly, Die Zeit. He twice won the German bridge championship and played in the bridge World Cup._NEWLINE_Einsiedel was a member of the Social Democratic Party of Germany from 1957 until 1992 and was elected as a member of the German Bundestag as a candidate of the Party of Democratic Socialism (PDS) from 1994 until 1998._NEWLINE_Einsiedel died in Munich on 18 July 2007 aged 85. + _START_ARTICLE_ Calmar, Iowa _START_SECTION_ History _START_PARAGRAPH_ Calmar was platted in 1854. It was named after Kalmar, a city in Sweden._NEWLINE_The settlement experienced growth in 1868 when the railroad was built through it. Calmar was incorporated on July 14, 1869. _START_SECTION_ Geography _START_PARAGRAPH_ Calmar is located at 43°10′55″N 91°51′59″W (43.182054, -91.866446)._NEWLINE_According to the United States Census Bureau, the city has a total area of 1.07 square miles (2.77 km²), all of it land. _START_SECTION_ 2010 census _START_PARAGRAPH_ As of the census of 2010, there were 978 people, 444 households, and 252 families residing in the city. The population density was 914.0 inhabitants per square mile (352.9/km²). There were 492 housing units at an average density of 459.8 per square mile (177.5/km²). The racial makeup of the city was 98.0% White, 0.3% African American, 0.1% Asian, 0.6% from other races, and 1.0% from two or more races. Hispanic or Latino of any race were 2.0% of the population._NEWLINE_There were 444 households of which 27.0% had children under the age of 18 living with them, 43.9% were married couples living together, 9.9% had a female householder with no husband present, 2.9% had a male householder with no wife present, and 43.2% were non-families. 31.5% of all households were made up of individuals and 10.8% had someone living alone who was 65 years of age or older. The average household size was 2.20 and the average family size was 2.84._NEWLINE_The median age in the city was 34.9 years. 21.5% of residents were under the age of 18; 13.7% were between the ages of 18 and 24; 23.9% were from 25 to 44; 27.7% were from 45 to 64; and 13.2% were 65 years of age or older. The gender makeup of the city was 51.7% male and 48.3% female. _START_SECTION_ 2000 census _START_PARAGRAPH_ As of the census of 2000, there were 1,058 people, 452 households, and 269 families residing in the city. The population density was 1,006.8 people per square mile (389.0/km²). There were 482 housing units at an average density of 458.7 per square mile (177.2/km²). The racial makeup of the city was 98.87% White, 0.19% African American, 0.09% Native American, 0.19% Asian, and 0.66% from two or more races. Hispanic or Latino of any race were 0.47% of the population._NEWLINE_There were 452 households out of which 27.2% had children under the age of 18 living with them, 49.8% were married couples living together, 7.1% had a female householder with no husband present, and 40.3% were non-families. 30.3% of all households were made up of individuals and 14.6% had someone living alone who was 65 years of age or older. The average household size was 2.33 and the average family size was 2.97._NEWLINE_In the city, the population was spread out with 23.3% under the age of 18, 14.7% from 18 to 24, 26.6% from 25 to 44, 18.6% from 45 to 64, and 16.8% who were 65 years of age or older. The median age was 36 years. For every 100 females, there were 105.0 males. For every 100 females age 18 and over, there were 104.8 males._NEWLINE_The median income for a household in the city was $36,250, and the median income for a family was $50,063. Males had a median income of $29,875 versus $21,708 for females. The per capita income for the city was $17,958. About 3.4% of families and 9.6% of the population were below the poverty line, including 5.7% of those under age 18 and 17.5% of those age 65 or over. _START_SECTION_ Education _START_PARAGRAPH_ Calmar is home to one of two campuses of Northeast Iowa Community College._NEWLINE_South Winneshiek High School is in Calmar. Its elementary and middle schools are in Ossian. + _START_ARTICLE_ Texas State Highway Loop 207 _START_SECTION_ Route description _START_PARAGRAPH_ Loop 207 begins and ends in Mont Belvieu at SH 146. Between the terminuses, Loop 207 intersects FM 565. _START_SECTION_ History _START_PARAGRAPH_ Loop 207 was designated on its current route on September 12, 1946. + _START_ARTICLE_ Swarovski crystal mesh Armani Privé gown _START_SECTION_ Reception _START_PARAGRAPH_ The Daily Mail remarked that Armani "stole the show" at the 2007 Oscars, as both Beyoncé and Katie Holmes also turned up wearing Giorgio Armani dresses and stated that Blanchett's dress "set the tone for the evening: a pale colour, with clean lines, that paid lip service to the metallic trend so prevalant on the catwalks this season." Cosmopolitan magazine cited the slinky, shimmery-silver, one-shoulder dress as one of the Best Oscar dresses of all time, saying, "Cate makes the list twice because of her consistently impeccable style. This one-shouldered gunmetal gown clings to her fabulous body like it was painted on, and the delicate and elegant hair and makeup complete the look without distracting us from the dress." + _START_ARTICLE_ William G. Bissell _START_PARAGRAPH_ William George Bissell (1857–1925) was a member of the Wisconsin State Senate. _START_SECTION_ Biography _START_PARAGRAPH_ Bissell was born on September 18, 1857 in Massena, New York. He moved with his parents to Lodi, Wisconsin in the spring of 1866. Bissell worked as a farmer and travelling salesman before becoming a general merchant in 1896. He married Eva S. Sisson (1860–1937). Bissell died in 1925 and is interred in Baraboo, Wisconsin. _START_SECTION_ Senate career _START_PARAGRAPH_ In the fall of 1898 Bissell was nominated for the state senate by the Republicans of the Twenty-seventh district, comprising Columbia and Sauk counties, and he was elected over Edmund S. Baker, the candidate of the democrats and James M. Blachly, the candidate of the Prohibitionists. Bissell represented the 27th District in the Senate from 1899 to 1902. Bissell served on the committees on state affairs, manufacturers and agriculture of the senate. + _START_ARTICLE_ Patrick Davis (politics) _START_PARAGRAPH_ Patrick Davis is a political consultant and strategist. Davis has worked in the George H.W. Bush Administration and the National Republican Senatorial Committee, most notably as Political Director in 2004. Davis also served as the Executive Director of the South Dakota Republican Party before going into private business in 2005. He lives in Colorado Springs, Colorado, with his wife, Jo Ann, and their two children. _START_SECTION_ Life in the public sphere _START_PARAGRAPH_ After graduating from college in 1990, Davis served as the Assistant to the Deputy Director of White House Political Affairs in the George H. W. Bush Administration. Davis then worked for the 1992 Bush-Quayle Presidential campaign, serving as the field desk coordinator for eleven Northwestern states._NEWLINE_Davis served as the Executive Director of the South Dakota Republican Party from 1995 to 1999. During this time, South Dakota Republicans increased their majorities in both houses of the State Legislature, John Thune was elected to the United States House of Representatives and Governor Bill Janklow was re-elected._NEWLINE_In 1999, Davis was hired to represent the National Republican Senatorial Committee as a Regional Political Director in ten Republican United States Senate campaigns. During the 2004 election cycle, Davis served as the NRSC's Political Director, increasing the Republican majority from 51 to 55. In his position as Political Director, Davis managed the political and strategic operations of the committee, including candidate recruitment, message development, and campaign management. He also directed the committee's $35 million voter contact budget._NEWLINE_Davis was involved in the competitive winning United States Senate campaigns for John Thune, Norm Coleman, Wayne Allard, Gordon Smith, Conrad Burns, Tom Coburn, Mel Martinez, Richard Burr, David Vitter, Jim Bunning, Johnny Isakson, Mike Lee, Richard Burr, John Hoeven, and Jim DeMint._NEWLINE_Davis was involved in the competitive winning United States Representative campaigns for Cynthia Lummis, Rick Berg, Steve Daines, Kristi Noem, Tim Huelskamp, and Mike Coffman. _START_SECTION_ Private consultant _START_PARAGRAPH_ In 2005, Davis founded Patrick Davis Consulting, LLC, a company that serves candidates, campaigns and corporations clients. Patrick Davis Consulting has been hired to work for both local and national campaigns, including Steve House for Governor (CO), Joe Gschwendtner for Governor (CO), Steve Laffey for Congress (CO), Floyd Trujillo for U.S. Senate (CO), Ron Saxton for Governor (OR), Don Stenberg for U.S. Senate (NE), Mike Protack for U.S. Senate (DE), Scott Tipton for Congress (CO-3), Jeff Crank for Congress (CO-5), Duane Sand for Congress (ND), Bruce Whalen for Congress (SD), Rick O’Donnell for Congress (CO-7), Kyle Hybl for CU Regent (CO-5), Eli Schwiesow for State Senate (SD), Glen Urquhart for Congress (DE), Karen England for Lt. Governor (CA), Rhonda Sivarajah for Congress (MN), Sharna Wahlgren for Congress (MN), David Gerson for Congress (MN) and Dan Lederman for Senate (SD)_NEWLINE_Since 1990, Davis has been involved in the winning U.S. Senate campaigns for John Thune, John Hoeven, Larry Pressler ('90), Steve Daines, Dan Sullivan, Lisa Murkowski, Jim DeMint, Joni Ernst, Cory Gardner, Thom Tillis, Bill Cassidy, Mike Lee, Norm Coleman, Rudy Boschwitz ('90), Tom Cotton, Bill Frist, Wayne Allard, Gordon Smith, Conrad Burns, Tom Coburn, Mel Martinez, Richard Burr, Jim Inhofe, John Cornyn, Sam Brownback, David Vitter, Jim Bunning, and Johnny Isakson._NEWLINE_He has been involved in the winning US House campaigns of Kristi Noem, Kevin Cramer, Tim Huelskamp, Mike Coffman, Steve Pearce, and Cynthia Lummis._NEWLINE_Finally, Patrick Davis Consulting also provides public relations services for private, non-political clients, such as Wal-Mart, Comcast, Neumann Education Foundation, and Neumann Systems Group. + _START_ARTICLE_ Cubelets _START_PARAGRAPH_ Cubelets are a line of construction toys manufactured by Modular Robotics._NEWLINE_The Cubelets are small color coded cubes that people magnetically stick together to form a variety of simple robots, a kind of modular robot._NEWLINE_The cubes connect with magnets and a genderless connector. + _START_ARTICLE_ 1991 World Artistic Gymnastics Championships _START_PARAGRAPH_ The 26th Artistic Gymnastics World Championships were held in Indianapolis, United States, in the Hoosier Dome from September 6 to 15, 1991. This was the last championships at which the Soviet Union competed. + _START_ARTICLE_ MN 5 (biostratigraphic zone) _START_PARAGRAPH_ In biostratigraphy, MN 5 is one of the MN zones used to characterize the fossil mammal faunas of the Neogene of Europe. It is preceded by MN 4 and followed by MN 6 and is part of the Orleanian age of the middle Miocene. MN 5 starts within magnetostratigraphic chron C5Cr, at 17.0 million years ago, and ends at the start of chron C5Bn.1r, at 15.0 million years ago, although some different correlations have been proposed. The reference locality used to correlate faunas with this zone is Pontlevoy-Thenay in France; other localities include La Retama in Spain, Castelnau-d'Arbieu in France, and Sandelzhausen in Germany._NEWLINE_In this zone, the muroid rodent Cricetodon first appears in western Europe, as do the poorly known Lartetomys and Mixocricetodon. In the extinct rodent family Eomyidae, the genus Ligerimys last appears during MN 5, but Keramidomys and Eomyops appear as immigrants. The last European marsupial, Amphiperatherium, last appears in France and Spain during MN 5, but persists into MN 6 in Germany._NEWLINE_The primate Pliopithecus first appears during MN 5. The rhinoceroses Prosantorhinus, Plesiaceratherium, Hispanotherium, and Gaindatherium make their last appearance, but Ancylotherium and Hoploaceratherium first appear during MN 5. Chalicotherium, a member of the related extinct family Chalicotheriidae, also appears for the first time. Several artiodactyls, such as the pig Conohyus, the deer Heteroprox and Dicrocerus, and the musk deer Micromeryx first appear, and the pigs Bunolistriodon and Aureliachoerus and the ruminants Amphimoschus and Lagomeryx last appear in MN 5. Two artiodactyl genera, Triceromeryx and Pseudoeotragus, occur only during MN 5. The primitive artiodactyl Cainotherium last appears in France and Spain, but persists into MN 6 in Germany. + _START_ARTICLE_ Etheostoma gracile _START_SECTION_ Distribution and habitat _START_PARAGRAPH_ Etheostoma gracile is found in the Mississippi River basin from central Illinois and northeastern Missouri to Louisiana, also in the Red River drainages to southeastern Kansas and eastern Oklahoma, and the Gulf Slope drainages from the Tombigbee River in Mississippi to the Nueces River in Texas. Suitable habitats include pools of slow-flowing water in small streams, backwaters of larger rivers, turbid water over sand or mud, oxbow lakes, swamps, and among vegetation. _START_SECTION_ Status _START_PARAGRAPH_ The IUCN has listed this species as being of "Least Concern", because it has an extensive range in the Mississippi River system, has a large total population size, and numerous subpopulations. In general, the population trend seems stable and no major threats have been identified. + _START_ARTICLE_ N-acylneuraminate-9-phosphate synthase _START_SECTION_ Structural studies _START_PARAGRAPH_ As of late 2007, only one structure has been solved for this class of enzymes, with the PDB accession code 1WVO. + _START_ARTICLE_ GE Healthcare _START_SECTION_ 19th century _START_PARAGRAPH_ In 1893, C.F. Samms and J.B. Wantz founded the Victor Electric Company in a basement. By 1896 they made electrostatic generators for exciting x-ray tubes and electrotherapeutic devices._NEWLINE_They had a staff of six and a capital of $3,000 invested in the company._NEWLINE_Victor Electric_NEWLINE_plunged into the x-ray business and by 1896 (one year after Roentgen’s discovery) were making x-ray machines. The business grew rapidly and so, in 1896, moved into new premises three times the original size, but this did not solve the space problems and the company made 3 moves by 1899._NEWLINE_Victor Electric had competitors. In 1896, G.A.Frye began making x-ray tubes, which in 1897 was purchased by Swett & Lewis as the first merger in the x-ray business. _START_SECTION_ 20th century _START_PARAGRAPH_ During the first years, it was easier to keep up with the competition than space requirements. By 1903, Victor Electric had outgrown its facilities at 418 Dearborn St. in Chicago and bought two floors of a building at 55 Market Street, Chicago. This was again only a temporary stop; by 1910 it was too small and the firm moved again in 1911 to a building at the corner of Jackson Blvd. and Damen Avenue. This was the first permanent home of Victor Electric Co. They stayed there 35 years and during this time, gradually acquired all the space in the building and several around it._NEWLINE_During the first 20 years of the x-ray business, many new names appeared. In 1901 the Western Electric Coil Co. was formed. In 1902 MacAlaster & Wiggin purchased the x-ray tube business of Swett & Lewis. Two other companies were the Radio Electric Co., which was later to be known as Snook-Roentgen Manufacturing and the Scheidel Western X-Ray Coil Co. In 1907, Homer Clyde Snook introduced the Snook apparatus, the first interrupterless device produced for X-ray work. The Snook apparatus was manufactured in England._NEWLINE_In 1916, the first significant merger took place, Scheidel Western, Snook-Roentgen, MacAlaster & Wiggin, and Victor Electric Co. were merged with Victor, the surviving name. Victor’s two founders had key roles in the new firm; C.F.Samms was company president and J.B.Wantz was Vice-President of manufacturing and engineering._NEWLINE_Four years later, in 1920, a second major merger was accomplished when Victor was acquired by General Electric_NEWLINE_which was, at that time, the foremost manufacturer of x-ray tubes._NEWLINE_The marriage of Victor Electric and General Electric became complete of July 28, 1926 when Victor was declared a wholly owned affiliate of General Electric. The merger brought renewed vitality to the organization and Victor entered the foreign market with equipment sold and serviced in nearly 70 countries. In 1930, the name was changed from Victor to General Electric X-Ray Corporation._NEWLINE_World War II saw the dramatic use of x-rays in industry for non-destructive testing of war materials. It also saw the broad use of x-rays as a medical tool for military services._NEWLINE_As the war ended, GE X-Ray Corporation continued to grow. Greater production capacity and greater expertise was needed in the core business of building X-ray tubes. Since the tubes were made from hand-blown glass, the decision was made to move the company 90 miles north to Milwaukee, Wisconsin, in order to tap into the enormous amount of glass-blowing talent in Milwaukee's beer-brewing industry. The company moved from Jackson Blvd. in Chicago to a 43-acre (170,000 m²) site in the city of West Milwaukee, which had been used for building turbochargers during the war. The street in front was renamed Electric Avenue, and the General Electric X-Ray Corporation had a new home in 1947._NEWLINE_In 1951, the corporate structure was dissolved and the name changed to General Electric x-Ray Department. This new name lasted less than 10 years as the department divested itself of its industrial x-ray business, widened its medical business, and took on the name of GE Medical Systems Department. One of the reasons for the name of Medical Systems was due to the increase in the electro-medical business, which began in 1961 with the introduction of patient monitoring equipment. By 1967 modular equipment was developed which was soon popular in cardiac and intensive care units. Early in 1960, pacemakers were developed in Corporate Research & Development in Schenectady, New York, and in 1969 the Standby Pacemaker was developed._NEWLINE_In 1968, the Biomedical Business Section opened its first factory in Edgerton Avenue. Late in 1970 a surgical package was introduced and in 1971, equipment to monitor blood gasses during surgery was introduced._NEWLINE_Later in 1971, Biomedical opened a 9,000 square meter admin and engineering building opposite its factory and in 1972, the section was renamed The cardio-Surgical Product Section. With the growth of its medical business, the General Electric Company upgraded the department to The Medical Systems Division in 1971. Also in 1971, a major expansion programme was started and the Waukesha factory was planned. Work started in July 1972, and was completed in 1973._NEWLINE_In 1973, work on CT was started and eventually the first CT machine was installed in 1976. Development continued to the first CT 8800, and after long negotiations, GE acquired the medical division of EMI Group Ltd. in late 1980 soon after the 1979 takeover of EMI medical division by Thorn Electric company._NEWLINE_The American Anti-Trust Authorities stopped the takeover in the USA however, and the EMI factory in Chicago was bought up by Omni-Medical, who continued to make CTs for a number of years._NEWLINE_Meanwhile, back at GE, the Patient Monitoring Department was sold off in 1981. The initial boost provided by the EMI takeover turned into the doldrums as Reaganomics sent the US dollar soaring, so in 1984 GE bought a 49% share of YMS (Yokogawa Medical Systems), a Japanese company._NEWLINE_In 1983, GE Medical started investing heavily in Magnetic Resonance Imaging (MRI) technology, investing nearly 1 billion US dollars in a new plant in Waukesha, and the MR Signa was born, which would go on to become the very successful MR model range. The magnet plant in Florence (USA) was opened a short time later, giving GE its own magnet production. In the same year, GE divested its dental x-ray division to form Gendex Dental Systems._NEWLINE_In 1985 GE acquired Technicare from Johnson and Johnson. Originally named Ohio Nuclear (and in 1979, after another fusion, Ohio Nuclear Unirad), the name was changed to Technicare in 1982. Technicare (with headquarters in Cleveland, Ohio) had been producing a range of rotate-stationary CTs with an installed base in the thousands, as well as some x-ray diagnostic equipment and a nascent MRI product range._NEWLINE_Up to this time, the medical Systems Division had simply been divided into domestic and international, but in 1987 it was decided to re-organize into the three "poles" of America, Europe and Pacific. In 1988, GE Medical Europe merged with CGR (a medical equipment supplier based in France) to form General Electric CGR Medical Systems. The European headquarters were moved from Hammersmith (UK) to Buc near Paris._NEWLINE_In 1992, GE had a setback after long negotiations to buy Picker International, who were a major producer of CT and MR equipment. The deal was not approved by the American authorities, and so GE just bought the Picker Service organization in the U.K., leaving the rest of Picker intact._NEWLINE_In 1994, it was decided to change the name in Europe from GE-CGR back to General Electric Medical Systems. At the close of 1998, GE Medical acquired the Nuclear and MR businesses of Elscint, (then a division of Elron, based in Haifa, Israel), the CT business being bought by Picker, and in the same year Marquette Medical Systems became a wholly owned subsidiary of GE Medical. In 1998, GE medical bought Diasonics Vingmed Ltd. from Elbit Medical Imaging (of Haifa, Israel), thus expanding its ultrasound imaging business. _START_SECTION_ 21st century _START_PARAGRAPH_ In 2001, GE Medical Systems acquired San Francisco, CA, based CT maker Imatron, Inc for $210 million. Imatron produced an Electron beam tomography (EBT) scanner that performs imaging applications used by physicians specializing in cardiology, pulmonology and gastroenterology. The formal Imatron business was later incorporated into GE Healthcare's Diagnostic Imaging business segment. In early 2002, GE Healthcare had acquired MedicaLogic (creator of the former Logician, an ambulatory Electronic Medical Records system) for approximately $32 million._NEWLINE_By Jan 2003, GE acquired Millbrook Corporation, maker of Millbrook Practice Manager, a billing and scheduling system for doctors offices._NEWLINE_GE Healthcare IT would later merge the two products into one, although the stand-alone EMR product is still available and in development. Also in April 2002, GE Healthcare completed the acquisition of Visualization Technology, Inc., Boston, MA; a manufacturer of intra-operative medical devices and related products for use in minimally invasive image guided surgery._NEWLINE_In 2003, GE Healthcare acquired Instrumentarium (including its Datex-Ohmeda division), a producer, manufacturer, and supplier of anesthesia machines and mechanical ventilators. To satisfy regulatory concerns in the United States and in Europe, GE Healthcare was forced to divest Instrumentarium’s Ziehm Imaging mobile C-arm business, as well as its Spacelabs patient-monitoring unit. Currently, GE Healthcare owns 80% of all anesthesia machines in the United States and 60% of the machines in the world. The former Instrumentarium business was incorporated into GE Healthcare's Clinical Systems business segment._NEWLINE_In 2004, the former Amersham plc business segments were separated into the GE Healthcare Medical Diagnostics and Life Sciences business segments and 1 May 2013, both the business were combined again under the GE Life Sciences brand with Kieran Murphy taking the leadership role. Also in 2004, GE Healthcare along with other healthcare companies built a research reactor for neutron and unit cell research at GE's European Research Center near Garching (outside of Munich), Germany. It is the only such reactor currently in operation. In 2005, Sir William Castell, CEO of GE Healthcare and former CEO of Amersham plc stepped down as CEO to become Chairman of the Wellcome Trust—a charity that fosters and promotes human and animal research—in the United Kingdom. Former GE Medical Systems CEO Joe Hogan became the overall CEO for the GE Healthcare business. In 2005, Dental Imaging operations were separated from GE Healthcare. The PaloDEx Group Oy was founded and continues the business with its subsidiaries Instrumentarium Dental and SOREDEX. Specifically, Instrumentarium Dental continues the brands Orthopantomograph and intraoral systems FOCUS and SIGMA, formerly known as Instrumentarium Imaging or GE Healthcare products._NEWLINE_In September 2005, GE Healthcare and IDX Systems Corporation announced that they entered into a definitive, $1.2 billion merger agreement for GE to acquire IDX, a leading healthcare information technology (IT) provider. The acquisition was completed in January 2006. IDX was folded into GE Healthcare Integrated IT Solutions, which specializes in clinical information systems and healthcare revenue management._NEWLINE_On 4 February 2008, GE Healthcare announced that it had completed the acquisition of Whatman plc (LSE:WHM), a global supplier of filtration products and technologies at 270p per share in cash for each Whatman share, valuing Whatman at approximately £363 million (approximately $713 million.) In July 2008, Joseph Hogan announced his intent to leave his post as CEO of GE Healthcare to take the role of CEO at ABB. On July 17, 2008, GE Healthcare announced John Dineen had been chosen to replace outgoing CEO Joseph Hogan. Mr. Dineen had been head of GE's Transportation division since 2005. On March 24, 2010, GE Healthcare announced acquisition of MedPlexus. In late April, 2010, GE Healthcare announced it was investing €3 million in the Technology Research for Independent Living Centre (TRIL). The Irish centre seeks to enhance independence for elderly people through technological innovation._NEWLINE_In July 2015, GE Healthcare partnered with the 2015 CrossFit Games to provide athletes with mobile imaging equipment._NEWLINE_In January 2016, it was announced GE Healthcare's global headquarters will move to Chicago effective early 2016._NEWLINE_In June 2017, GE announced Kieran Murphy as the new CEO of GE Healthcare, with former CEO John Flannery's appointment as CEO of GE._NEWLINE_In April 2018, GE announced the sale of several healthcare information technology assets for $1.05 billion to Veritas Capital._NEWLINE_In June 2018, GE announced it would spin off GE Healthcare into its own company that will most likely be based in Chicago. This represents the conglomerate's efforts to shrink and focus more on the aviaton, power and renewable energy sectors. _START_SECTION_ Criticism _START_PARAGRAPH_ According to The Independent, the firm has received more money back in tax benefits (£1.6 million) in the UK over the past 12 years than it has paid in. Its UK operations are all ultimately owned by a holding company in the Netherlands. Tax paid was £250,000, 1.7% of its £14.3m profit. The group employs 22,000 people in the UK._NEWLINE_It supplies a cloud based imaging system to the East Midlands Radiology Consortium which was described in October 2017 as breaking down, so that medical images had to be sent between hospitals by taxi. + _START_ARTICLE_ National Association Opposed to Woman Suffrage _START_PARAGRAPH_ The National Association Opposed to Women Suffrage (NAOWS) was founded in the United States by women opposed to the suffrage movement in 1911. It was the most popular anti-suffrage organization in northeastern cities. NAOWS had influential local chapters in many states, including Texas and Virginia. _START_SECTION_ History _START_PARAGRAPH_ The National Association Opposed to Women Suffrage (NAOWS) was established by Josephine Jewell Dodge in New York City in 1911. Dodge had the first meeting at her house and women came from New York and surrounding states. Dodge was currently the president of the New York State Association Opposed to Woman Suffrage (NYSAOWS). Dodge resigned from NYSAOWS to take over as president of NAOWS. Shortly after formation, state branches of NAOWS began to form. Headquarters in Washington, D.C. were opened in 1913, giving the organization a front in both New York and the U.S. Capital._NEWLINE_Like other anti-suffrage organizations, NAOWS published a newsletter as well as other publications, containing their opinions on the current political issues of the time. The newsletter of the association was called Woman's Protest (later renamed Woman Patriot in 1918). Dodge also toured the country, spreading anti-suffrage views to other states._NEWLINE_Members of the NAOWS typically belonged to wealthy families who feared suffrage would upset the status quo. In the South, the NAOWS gained support from many plantation owners who believed rights for women would lead to rights for minorities. Josephine Dodge, the founding president, was replaced in 1917, by Alice Hay Wadsworth, wife of U.S. Senator James W. Wadsworth, Jr. from New York. Upon amendment to the New York State Constitution granting women the right to vote, the focus of the NAOWS shifted from the state level to the federal level. The organization also began to see more men join NAOWS than before. The headquarters were moved solely to Washington D.C. and they merged with the Woman Patriot Publishing Company. The organization disbanded in 1920 as a result of the passage of the Nineteenth Amendment. _START_SECTION_ Texas Association Opposed to Woman Suffrage _START_PARAGRAPH_ In March 1916, the Texas Association Opposed to Woman Suffrage (TAOWS) was created as a chapter of NAOWS in Houston with Pauline Wells as the president. The chapter in Texas also connected the increase in African Americans voting to women's suffrage and they stoked fears of "domination by the black race in the South." They also believed that women's suffrage was linked to "feminism, sex antagonism, socialism, anarchy and Mormonism." Like their parent organization, TAOWS had local chapters in major Texas cities. TAOWS fought against the Texas Equal Suffrage Association who were pushing for Texas women's right to vote in Texas primary elections in 1918. In April 1919, headquarters were moved to Fort Worth. In 1919, TAOWS successfully campaigned against a state measure for women's vote which was defeated by 25,000 votes in May. However, in June 1919, Texas passed a suffrage amendment, allowing women to vote and the TAOWS stopped fighting against women's suffrage. _START_SECTION_ Virginia Association Opposed to Woman Suffrage _START_PARAGRAPH_ A group, the Virginia Association Opposed to Woman Suffrage (VAOWS) formed in Richmond in March 1912 and affiliated with NAOWS. Jane Rutherford served as the president of VAOWS. Local branches in different cities formed by 1913 and the organization distributed anti-suffrage literature. In 1915, VAOWS helped raise money for the Belgian Relief Fund during World War I. By May 1917, VAOWS had doubled in size and continued to grow through 1918. Around 8,000 women had signed up with the anti-suffrage cause in Richmond by 1919._NEWLINE_Like the Texas Association Opposed to Woman Suffrage, VAOWS also suggested that race riots, the black vote and women's suffrage were connected. In a sponsored editorial published in The Richmond Times-Dispatch on September 2, 1919, VAOWS exclaimed, "RACE RIOTS WILL INCREASE IF THERE IS MORE POLITICS BETWEEN THE RACES AND IF WOMEN ARE MIXED UP IN POLITICS!" _START_SECTION_ Political views _START_PARAGRAPH_ One of NAOWS' publications included a pamphlet, Some Reasons Why We Oppose Votes for Women, which, as the title suggests, outlines some of the reasons why they are opposed to women suffrage. They believed it was irrelevant to the success of the country, as stated in their pamphlet:_NEWLINE_"Because the great advance of women in the last century— moral, intellectual and economic— has been made without the vote; which goes to prove that it is not needed for their further advancement along the same lines."_NEWLINE_The National Association Opposed to Women Suffrage opposed women's right to vote because they said that the majority of women did not want the right to vote, and because they believed that the men in their lives accurately represented the political will of women around the United States. NAOWS submitted pamphlets like these to the general public as well as directing them to government officials so that political figures would see that women opposed the then-unratified nineteenth amendment. They did this in order to counteract the rhetoric of the suffragettes of the time. According to the NAOWS and the state-based organizations that it inspired, voting would severely and negatively affect the true submissive and domestic state of the feminine. These organizations were championed by women who thought themselves the prime examples of true womanhood—quiet, dignified, and regal. They looked with disdain at the outward protests of suffragettes._NEWLINE_NAOWS wanted to appeal to conservative and traditional members of their community, including other women and religious figures. They positioned themselves as being in opposition of "the militant suffragette" and militant or "hysterical" tactics. NAOWS also believed that women's involvement in politics would interfere with their "civic duties for which they are peculiarly adapted." NAOWS believed that women were equal to men, but had different duties and "functions." _START_SECTION_ Quotes from Some Reasons Why We Oppose Votes For Women _START_PARAGRAPH_ "We believe that political equality will deprive us of special privileges hitherto accorded to us by law."_NEWLINE_"[We oppose suffrage] Because it means simply doubling the vote, and especially the undesirable and corrupt vote of our large cities." _NEWLINE_"[We oppose suffrage] Because our present duties fill up the whole measure of our time and ability, and are such as none but ourselves can perform." + _START_ARTICLE_ N4 road (Ghana) _START_SECTION_ Route _START_PARAGRAPH_ Major towns and cities along the route of the N4 include Madina, Adenta, Aburi, Mamfe, Koforidua, Asokore, and Bunso. _START_SECTION_ Greater Accra Region _START_PARAGRAPH_ The N4 begins at the Tetteh Quarshie Interchange in Accra and travels north, running by the University of Ghana at Legon. Continuing north to Madina, the N4 intersects with the R40 near the Madina Police Station and veers slightly northwest toward Oyarifa before entering the Eastern Region. _START_SECTION_ Eastern Region _START_PARAGRAPH_ In the Eastern Region, the N4 continues north to Aburi, where it intersects with the IR1 near the Aburi Botanical Gardens. The route continues north intersecting with the R22 at Mamfe,then turns west through Saforo and Kwamoso before veering northwest at Adawso toward Koforidua. At Koforidua, the N4 intersects with the R42 near the Koforidua Technical University and continues north to Asokore, where it intersects with the R41. From Asokore, the N4 turns northwest through New Tafo before terminating at Bunso, where it intersects with the R60 and N6, which continues north to Kumasi + _START_ARTICLE_ Hussein Adan Isack _START_PARAGRAPH_ Hussein Adan Isack (born 1957) is a naturalist and ethnobiologist living in Kenya, having been a research scientist in ornithology. _START_SECTION_ Background and youth _START_PARAGRAPH_ Hussein Adan Isack was born to parents who lived in the pastoral Northern region of Kenya. Throughout his youth Hussein developed and cultivated a keen interest in bird watching and began keeping birds at his home._NEWLINE_His interest was further stimulated when he joined his high school where he met Paul Scholes, a biologist, himself an ardent birdwatcher from Liverpool. He reared falcons in school and would later become the leader of the wildlife clubs in his area. During his holidays,he would spend time pursuing his passion Later, he met Van Someren, an ornithologist with whom he worked with at the National Museum of Kenya paving way for a career spanning over 30 years._NEWLINE_He was later awarded the Christopher Welch Scholarship for Natural Sciences at Oxford University, where he studied zoology and majored in behavioral ecology. It was during this time that Hussein travelled extensively across the world and met Heinz Ulrich Reyer, a zoologist from Zurich. The two would become close friends and colleagues. He then made frequent trips to the North alongside Heinz, studying the behavioural patterns of birds in that area, specifically the honeyguide, a bird revered for its ability to locate and direct locals towards honey in the remote desert. _START_SECTION_ Professional life _START_PARAGRAPH_ Having received his PhD in ornithology from Oxford, Hussein went back to Kenya where he worked as a scientist at the National Museum of Kenya. He became the head of the Ornithology department and co-ordinated research activities in the region._NEWLINE_He is a founding board member of the Ewaso Ng'iro Development Authority, appointed by H.E president Daniel Toroitich arap Moi, tasked with the responsibility of ensuring sustainable development of the water basin._NEWLINE_In 1991, he took part in the making of a documentary, The Trials of Life with sir David Attenborough, produced by the BBC and the Australian Broadcasting Service. It focused on the communication between humans and birds, specifically the honeyguide on which Hussein was known for his expertise._NEWLINE_Currently based in Kenya, he founded and runs the Kivulini Heritage Trust a Non-Governmental organization that seeks to preserve indigenous cultures and promote sustainable use of natural resources. + _START_ARTICLE_ Cloisters Cross _START_PARAGRAPH_ The Cloisters Cross, also referred to as the Bury St Edmunds Cross, is a complex 12th-century ivory Romanesque altar cross in The Cloisters, part of the Metropolitan Museum of Art in New York. The cross is carved from walrus ivory. _START_SECTION_ Description _START_PARAGRAPH_ The carvings which cover both front and back sides include ninety-two intricately carved figures and ninety-eight inscriptions. The figures, each of which is only about one-half inch tall, illustrate a number of Biblical scenes, and on the back a number of the Old Testament prophets with banderoles containing quotations from their books. There is debate over whether or not these inscriptions are chosen with an anti-Semitic intent. The Metropolitan Museum of Art's website currently says: "Prominent among the inscriptions are several strong invectives against Jews. Though it is impossible to know precisely who commissioned this piece and with what aims, the cross certainly offers some indication of the anti-Semitism prevalent in England at this time. By the end of the thirteenth century, Jews were expelled from the country". This theme was developed in a book by Thomas Hoving, the curator involved when the Metropolitan acquired the cross, and later Director. This was unkindly described in an academic review of Elizabeth C. Parker and Charles T. Little as "an autobiographical romance...written in Raymond Chandler style"._NEWLINE_Parker and Little disagree with Hoving and think that it is doubtful that the cross, a sophisticated theological object, was specifically designed for the purpose of either castigating or converting any member of the small Jewish population in England in the mid-twelfth century._NEWLINE_The sculptor is not known. Thomas Hoving, who managed the acquisition of the cross while he was associate curator at The Cloisters, concluded that it was carved by Master Hugo at Bury St Edmunds Abbey in Suffolk. However, beyond stylistic affinities there is no certain evidence to suggest that the cross was even made in England; although this is accepted by most scholars, other places of origin such as Germany have been proposed._NEWLINE_The history of the cross before it was acquired by Ante Topić Mimara (1898–1987) is unknown. He sold it to the Metropolitan Museum in 1963. The British Museum was also keen to buy the cross, but they eventually declined, because Topić Mimara steadfastly refused to provide proof that he had full title to sell the cross. Hoving reportedly sat up drinking coffee with Topić Mimara until after midnight on the night that the British Museum's option lapsed, and he purchased the cross immediately afterwards for £200,000. + _START_ARTICLE_ Štefan Ružička _START_SECTION_ Playing career _START_PARAGRAPH_ Ružička was drafted in the third round of the 2003 NHL Entry Draft by the Philadelphia Flyers and proceeded to the Canadian Hockey League to play for the Owen Sound Attack, of the Ontario Hockey League, under the direction of head coach Mike Stothers._NEWLINE_On September 14, 2015 he opted to take a hiatus from professional hockey in spite of being only 30 years old. Over a calendar year later, he returned to the professional ranks in securing a contract with HC Sparta Praha of the Czech Extraliga on September 4, 2016. + _START_ARTICLE_ Nikki Marshall _START_SECTION_ Early life _START_PARAGRAPH_ Born in Thornton, Colorado to Mike and Kelly Marshall, Nikki was raised with her younger sister, Shaye, in Mead, Colorado. She attended Skyline High School in the nearby city of Longmont and was the leading scorer for the soccer team. She was named All-State three times during her sophomore, junior and senior years and was a three-time Longmont Times-Call Soccer Player of the Year. Marshall finished her high school career with 100 goals and 38 assists. She graduated as the top scorer in the school's history. She scored 23 goals and served 10 assists during her senior year alone and was named 2006 Northern Conference Player of the Year, all-region soccer player of the year, and Skyline Falcon of the Year. Also a decorated track athlete at the school, she earned All-State honors in the 100 meter, 400 meter relay and long jump during her junior year and won the state championship in the 800 medley as a senior. _START_SECTION_ Colorado Buffaloes _START_PARAGRAPH_ Marshall currently holds seventeen school records for the University of Colorado, and is the all-time leading goal scorer for the Colorado women's soccer team. As a freshman, she led the Buffaloes' most prolific offensive season with seventeen goals, all the way to the Sweet Sixteen in the 2006 NCAA postseason. She was on Soccer Buzz's All-American Fourth Team, All-American Freshman Team, All-Central Region First Team, All-Central Region Freshman team, and Freshman of the Year. She was on the Freshman All-American Team in Soccer America and National Player of the Week in Soccer Times. Her Big 12 Conference awards in 2006 include Newcomer of the Year, First Team All-Big 12, All-Big 12 Newcomer Team, and Big 12 All-Tournament Team. The University of Colorado Athletic Department awarded her Female Athlete of the Year and Female Freshman of the Year._NEWLINE_In 2007, Marshall led the Buffaloes in scoring for the second-straight year with nine goals, even though she had played defender that year for Colorado. Before her sophomore year season, she Marshall was on the Pre-season All-American Team by Soccer America and was ranked Pre-season All-Big 12 by the Big 12 Conference. She was runner-up in the most playing time on the team, totaling 1,927 minutes on the season. She led the Buffs to 10-8-4 record and another bid into the NCAA tournament She ended the season with a First Team All-Big 12 award from the Big 12 Conference._NEWLINE_During her junior year, Marshall was moved back up to the striker position. She was ranked Pre-season All-Big 12 by the Big 12 Conference. After another successful season, Marshall was second on the team for goals scored, with eight goals. She led her team second in the Big 12 Tournament and a bid to the NCAA Tournament. For the third year in a row, she received First Team All-Big 12 from the Big 12 Conference._NEWLINE_In 2009, Marshall was again ranked a Pre-season All-American by Soccer America. She was captain of the Buffaloes along with fellow senior Kara Linder. She led the Buffs in scoring with a total of eight goals on the season. She also let the team in number of shots taken, 53. She finished the season with another First Team All-Big 12 accolade. _START_SECTION_ The WPS Years, 2010–11 _START_PARAGRAPH_ Marshall was the first draft pick (seventh overall) for the Washington Freedom in the 2010 WPS Draft. During the 2010 season, she started in all 24 of the team's regular season matches and scored three goals playing as a defender. The Freedom finished fourth during the regular season with an 8–7–9 record, earning a berth to the playoffs. During the playoff quarterfinals, the Freedom were defeated by the Philadelphia Independence 1–0._NEWLINE_Marshall remained with the club in 2011 when they relocated to Florida and became magicJack under new ownership. She played in eight games for magicJack before being traded to the Boston Breakers. She made seven appearances for the Breakers as the team finished fourth in the regular season with a 5–4–0 record. On August 17, 2011, the Breakers were defeated 3–1 by magicJack and eliminated from the playoffs. _START_SECTION_ Portland Thorns FC, 2013–2014 _START_PARAGRAPH_ In February 2013, Marshall signed with Portland Thorns FC for the inaugural season of the National Women's Soccer League. She was on the starting lineup as a defender in all of the Thorns' 22 games as the team finished third in regular season play and received a berth to the playoffs. During the team's playoff semi-final match, Marshall served the assist to Tiffany Weimer's equalizer goal. The Thorns would eventually defeat FC Kansas City 3-2 in overtime. The Thorns defeated Western New York Flash during the playoff final clinching the league's first championship title._NEWLINE_Marshall was waived by the Thorns during the post-season and picked up during the waiver draft by the Washington Spirit. A few months later, she was traded to the Seattle Reign FC. In December 2013, she was traded back to the Portland Thorns. Thorns management clarified that Marshall was put on waivers due to a cap issue. With the trades, she was signed to a new contract at a different salary for the 2014 season._NEWLINE_During the 2014 season, Marshall started in all 24 matches for the Thorns playing for 2072 minutes. After finishing third during the regular season with a 10–8–6 record, the Thorns advanced to the playoffs where they were defeated in the semifinals 2-0 by eventual champions FC Kansas City._NEWLINE_In August 2014, Marshall suffered an anterior cruciate ligament (ACL) tear during a match against the Seattle Reign . She announced her retirement in February 2015, citing low pay. _START_SECTION_ International _START_PARAGRAPH_ Marshall was a member of the United States under-20 women's national soccer team that competed in the 2008 FIFA U-20 Women's World Cup in Chile. Marshall and fellow central defender Lauren Fowlkes were the only two members of that squad to start in and play every minute of all six matches of the tournament; both were praised for their poised performance as the anchors for the team's defense, especially during the final game against North Korea. + _START_ARTICLE_ Roberto Calasso _START_SECTION_ Biography _START_PARAGRAPH_ Calasso was born in Florence in 1941, into a family of the Tuscan upper class, well connected with some of the great Italian intellectuals of their time. His maternal grandfather Ernesto Codignola was a professor of philosophy at Florence University. Codignola created a new publishing house called La Nuova Italia, in Florence, as his friend Benedetto Croce had done in Bari with Laterza. Calasso's uncle, Tristano Codignola, was a partisan during World War II who after the war joined the political life of the new republic, and was for a while Minister of Education. His mother Melisenda – who gave up an academic career to raise her three children – was a scholar of German literature, working on Hölderlin’s translations of the Greek poet Pindar. Calasso's father Francesco was a law professor, first at Florence University and then in Rome, where he eventually became dean of his faculty. He was arrested by the fascist militia after the assassination of Giovanni Gentile and sentenced to be killed in reprisal, but was saved by the intervention of both friends of Gentile, with whom the family had connections on the maternal side, and by the German consul Gerhard Wolf._NEWLINE_At 12 Calasso met and was greatly influenced by a professor at Padua University, Enzo Turolla, and they became lifelong friends. In 1954 the family moved to Rome, where Calasso developed a passion for cinema. His doctoral dissertation was Sir Thomas Browne's theory of hieroglyphs, which he completed under Mario Praz, while indulging himself with hashish._NEWLINE_Calasso has worked for the publishing firm of Adelphi Edizioni since its founding by Roberto Bazlen in 1962 and became its Chairman in 1999. His books have been translated into most European languages._NEWLINE_He is the author of an unnamed ongoing work reflecting on the culture of modernity which began with The Ruin of Kasch in 1983, a book admired by Italo Calvino. Dedicated to the French statesman Talleyrand, it was followed in 1988 by The Marriage of Cadmus and Harmony, in which the tale of Cadmus and his wife Harmonia becomes a pretext for re-telling the great tales of Greek mythology and reflecting on the reception of Greek culture for a contemporary readership. Another world civilization is surveyed in Ka (1996, where the subject of the re-telling is Hindu mythology). K restricts the focus to a single author, Franz Kafka; this trend continues with Il rosa Tiepolo, inspired by an adjective used by Proust to describe a shade of pink used by Tiepolo in his paintings. With La folie Baudelaire, Calasso once more broadens his scope to fresco a whole civilisation, that of Paris in the latter half of the 19th century, reconsidering the lives and works of the post-romantic generation of writers and artists from Baudelaire to Valéry. In his most recent work, Ardore (2010), the author returns to India for an exhaustive analysis of the theory and practice of Vedic sacrifice and its significance for post-modern epistemology._NEWLINE_His more narrowly focused essays relating to European modernity are collected in I quarantanove gradini (The Forty-nine Steps), addressed to Pierre Klossowski and his wife; Literature and the gods (2002) (based on his Weidenfeld Lectures at Oxford, on the decline and return of pagan imagery in the art of the west), and La follia che viene dalle ninfe (The Madness that Comes from the Nymphs), a collection of related essays ranging from Plato's Phaedrus to Nabokov's Lolita._NEWLINE_Along with his status as a major analyst specifically of the works of Kafka, Calasso has, more broadly, been active in many essays in retrieving and re-invigorating the notion of a Central European literary culture. He also serves as the president of the International Alexander Lernet-Holenia Society, which promotes the publication, translation and study of this multi-genre Austrian writer and his focus on the identity crisis of his characters at odds with postimperial Austria and Central Europe. _START_SECTION_ Reception _START_PARAGRAPH_ Terri Windling selected the English translation of The Marriage of Cadmus and Harmony as one of the best fantasy books of 1994, describing it as "a complex and intellectually dazzling novel using ancient Greek mythology to explore the origins of Western thought." + _START_ARTICLE_ Henry J. Rosner _START_SECTION_ Personal life _START_PARAGRAPH_ Rosner was the oldest of seven children, along with his twin sister Sally Miller. He married Sophie Kimels in December 1929. The couple, who had met at a Young People's Socialist League picnic, honeymooned in Russia, which they found to be a totalitarian dictatorship rather than the socialist utopia they had hoped to see. They later wrote a report to Norman Thomas about their experience of Russia. (see Barbara Seaman)_NEWLINE_Rosner and Kimels and had three children: Barbara Seaman, Jeri Drucker, and Elaine Rosner-Jeria. After Kimels' death, Rosner married journalist Ruth Gruber in 1974. _START_SECTION_ Career _START_PARAGRAPH_ As Norman Thomas's policy researcher, Rosner helped write the socialist platform for the 1932 presidential race. Rosner contributed "The Myth of a Progressive Governor," a statistic-filled six page tract blasting Franklin D. Roosevelt's failure to honor his promise to "remember the forgotten man at the bottom." On Roosevelt's position (or lack thereof) regarding the seven-day work week, Rosner wrote:_NEWLINE_While distinguished economists urge the five-day week as a solution for the unemployment problem, Roosevelt has done nothing to abolish the seven-day work week among New York transit employees, hotel and cafeteria workers, and elevator operators in apartment houses. The records of the N.Y.Ç. Transit Commission reveal that there are 25,000 subway guards, platform men, street car conductors, motormen and bus drivers in New York City alone who work ten hours a day or more seven days a week. There are 25,000 hotel workers in New York City who never get a day off. Thousands of cafeteria works and elevator operators are in the same predicament. The same conditions exist on the state payroll. Guards and attendants in state hospitals and state prisons work ten and twelve hours a day seven days a week. Watchmen, lock tenders, and bridge workers in the state department of public works are also denied one day of rest in seven._NEWLINE__NEWLINE_It would be a simple matter to amend that section of the N.Y. labor law so as to give all workers in New York this protection. At the request of the City Affairs Committee such an amendment was introduced at the 1932 session of the Legislature. The bill was never reported out of committee or given a public hearing. Communications were sent to the Governor, acquainting him with the facts and requesting his support, but he did not make any effort to compel action from he legislature. In his gubernatorial messages to the legislature, he never mentioned the abolition of the seven-day week." _NEWLINE_The following year, Rosner was co-editor of the 1933 handbook of the New York Socialist Party._NEWLINE_After trouncing the Socialists and the Republicans in the 1932 election, Roosevelt met with Norman Thomas, Henry Rosner, and other members of the Socialist Party. As president, Roosevelt took on many of the social issues Rosner had criticized him for ignoring during his years as governor of New York._NEWLINE_In this respect, Rosner played an important, though low key, role as an early proponent for New Deal programs._NEWLINE_As Fiscal officer for welfare for New York City, Rosner served under all New York City Mayors from Fiorello LaGuardia through Abraham Beame. He retired as Assistant Administrator of the New York City Human Resources Administration in 1975._NEWLINE_Rosner contributed controversial and influential articles to The Nation magazine and other political periodicals. + _START_ARTICLE_ Plazenica _START_PARAGRAPH_ Plazenica is a mountain in the municipality of Kupres, Bosnia and Herzegovina. It has an altitude of 1,765 metres (5,791 ft). + _START_ARTICLE_ Error guessing _START_PARAGRAPH_ In software testing, error guessing is a test method in which test cases used to find bugs in programs are established based on experience in prior testing. The scope of test cases usually rely on the software tester involved, who uses past experience and intuition to determine what situations commonly cause software failure, or may cause errors to appear. Typical errors include divide by zero, null pointers, or invalid parameters._NEWLINE_Error guessing has no explicit rules for testing; test cases can be designed depending on the situation, either drawing from functional documents or when an unexpected/undocumented error is found while testing operations. + _START_ARTICLE_ David Jay _START_SECTION_ Activism _START_PARAGRAPH_ Frustrated with the lack of resources available regarding asexuality, Jay launched AVEN's website in 2001. Since then, he has taken a leading role in the asexuality movement, appearing on multiple television shows, and being featured heavily in Arts Engine's 2011 documentary (A)sexual._NEWLINE_AVEN, which Salon.com referred to as the "unofficial online headquarters" of the asexuality movement, is widely recognised as the largest online asexual community. Its two main goals are to create public acceptance and discussion about asexuality and to facilitate the growth of a large online asexual community. As of June 17, 2013, AVEN has nearly 70,000 registered members._NEWLINE_In New York City, working both with the Department of Education and private organizations, he's been providing training on Ace inclusion to health educators._NEWLINE_He has a vision for a post-sex world, one that asks all of us to work on building a more empathetic, intimate society that celebrates any kind of close human relationship, whether or not it involves sex. _START_SECTION_ Personal life _START_PARAGRAPH_ Jay is from St. Louis, Missouri, and he graduated from Crossroads College Preparatory School in 2000. At the age of 15, Jay began considering himself asexual, and he came out as asexual while a student at Wesleyan University in Connecticut. + _START_ARTICLE_ Lynn Norment _START_SECTION_ Personal _START_PARAGRAPH_ Norment was born the third child out of nine. Norment's mother Ester worked as a licensed practitioner nurse. Her father Alex Norment owned a local repair shop, which was named Norment's Radio and TV. While in elementary school, Norment attended an all-black, segregated school known as Bolivar Industrial Elementary. She then went to vocational school where she became a member of the school newspaper and Beta Club. In 1969, Tennessee offered African Americans in Bolivar to transfer to the mostly white Bolivar High school, Norment was amongst a few African Americans who helped integrate the school; she then graduated in 1970._NEWLINE_Lynn Norment is an alumni from Memphis State University where she received a Bachelor of Arts degree in journalism. In college, Norment was an intern for The Commercial Appeal a newspaper in Memphis, Tennessee._NEWLINE_She resides in the South Loop area of Chicago, Illinois. _START_SECTION_ Career _START_PARAGRAPH_ Later, Norment traveled north to Chicago and worked as a freelance writer for Ebony Magazine. Norment has worked with a number of celebrities, athletes and public figures including Denzel Washington, Barack Obama, Whitney Houston, Steve Harvey, Will Smith, and Micheal Jordan. She became the managing editor of Ebony._NEWLINE_Norment has also held different leadership roles for the National Association of Black Journalists, including being chairperson for the Convention in Chicago held in 1977._NEWLINE_She is a board member of Habilitative Inc. She operates programs for residents that are in need on the West side of Chicago. Norment has taught college courses at Columbia College Chicago, and mentors young journalist. Norment currently has launched a company that offers media relations and editorial services to individuals as well as agencies and corporations. _START_SECTION_ Notable works _START_PARAGRAPH_ Lynn Norment is most recognized for her 30 years spent of writing for Ebony Magazine. Norment has written a wide range of stories of dissimilar subjects such as religion, business, relationships, social issues and lifestyle. _START_SECTION_ Context _START_PARAGRAPH_ While growing up in Bolivar Tennessee, Lynn Norment went to a segregated school, a school built specifically for African Americans and a school built for White Americans. Segregation formally began with the passing of Jim Crow laws following the end of the Reconstruction Era in 1877. Those laws prevented blacks, and later Mexican Americans, Native Americans to go to the same school as white individuals and affected other public spaces such as church, bathrooms, movie theaters, etc. However, in 1969 racial integration in Tennessee schools allowed the African American community to transfer to the mostly white schools. Norment among many helped was a student in the desegregated high school._NEWLINE_Later, she moved North to Chicago and began working for Ebony Magazine. The magazine was founded in 1947 by John H. Johnson in Chicago. It is a monthly magazine for the African American community. The magazine has always brought up African American issues and interests while remaining positive despite how negative things seemed to be happening at the time. For years ads were created specifically for Ebony, which featured black models and advertised black-owned products. + _START_ARTICLE_ Come Darkness, Come Light: Twelve Songs of Christmas _START_SECTION_ Background _START_PARAGRAPH_ Come Darkness Come Light: Twelve Songs of Christmas contained twelve songs of holiday-themed songs, six of which were written or co-written by Carpenter. The six additional tracks consisted of rare traditional holiday songs. The album features collaborations with Carpenter's producer of many of her previous albums John Jennings. Come Darkness, Come Light includes cover of songs by Robin and Linda Williams, Tommy Thompson, and composer John Rutter. The opening track "Once in Royal David's City" was originally performed during the Festival of Nine Lessons and Carols in Cambridge, England, which Carpenter states she listens to every Christmas. Mark Deming of Allmusic thought that the album focused more on the "thoughtful and spiritual side of the season", while Scott Sexton of About.com said that the album's arrangement evoked "a calming vibe that is perfect for any holiday event"._NEWLINE_In an interview with Country Music Television in late 2008, Carpenter explained that she and producer John Jennings tried to create a more solemn approach to the record, without the use of symphonies or orchestras. In the interview, Carpenter commented that she wanted to keep the focus of the album "spare" and make its sound more acoustic._NEWLINE__NEWLINE_"That was the whole point from the beginning -- to make a real acoustic record. Whatever instruments we thought might add a texture or color, John was able to provide himself. We brought in Jon Carroll, my longtime keyboard player. He is so gifted, and he really did the heavy lifting on the piano, but John was able to fill in where it was needed. At most, there were three people in the room, but mostly it was me and John. It's really fun to do that. You feel like you're wacky scientists, late at night in the lab, experimenting to your heart's content." + _START_ARTICLE_ Salut d'Amour _START_SECTION_ History _START_PARAGRAPH_ Elgar finished the piece in July 1888, when he was romantically involved with Caroline Alice Roberts, and he called it "Liebesgruss" ('Love's Greeting') because of Miss Roberts' fluency in German. On their engagement she had already presented him with a poem "The Wind at Dawn" which he set to music and, when he returned home to London on 22 September from a holiday at the house of his friend Dr. Charles Buck in Settle, he gave her Salut d'Amour as an engagement present. _NEWLINE_The dedication was in French: "à Carice". "Carice" was a combination of his wife's names Caroline Alice, and was the name to be given to their daughter born two years later._NEWLINE_It was not published by Schott & Co., a German publisher, with offices in Mainz, London, Paris and Brussels, until a year later, and the first editions were for violin and piano, piano solo, cello and piano, and for small orchestra. Few copies were sold until Schott changed the title to "Salut d'Amour" with Liebesgruss as a sub-title, and the composer's name as 'Ed. Elgar'. The French title, Elgar realised, would help the work to be sold not only in France but in other European countries. _NEWLINE_The first public performance was of the orchestral version, at a Crystal Palace concert on 11 November 1889, conducted by August Manns. The first recording of that version was made in 1915 for The Gramophone Company with an orchestra conducted by the composer. As a violin-and-piano piece Salut d'Amour had been recorded for The Gramophone & Typewriter Ltd (predecessor to The Gramophone Company) as early as 1901 by Jacques Jacobs, leader/director of the Trocadero Restaurant orchestra. Auguste van Biene recorded a cello transcription in 1907. _START_SECTION_ Legacy _START_PARAGRAPH_ "Salut d'amour" is one of Elgar's best-known works and has inspired numerous arrangements for widely varying instrumental combinations. It was even arranged as a song "Woo thou, Sweet Music" with words by A. C. Bunten._NEWLINE_The tune, in E major, was included to 2015 video game Fallout 4 as part of its "Classical Radio Station" songs. + _START_ARTICLE_ 1976 Colgate International _START_SECTION_ Doubles _START_PARAGRAPH_ Chris Evert / Martina Navratilova defeated Olga Morozova / Virginia Wade 6–4, 1–1 divided due to rain + _START_ARTICLE_ Victoria Cross for New Zealand _START_SECTION_ Victoria Cross _START_PARAGRAPH_ The original Victoria Cross was created by Queen Victoria in 1856 to recognise incidents of gallantry that were unconnected with a man's lengthy or meritorious service. She signed a Royal Warrant on 29 January 1856 that officially instituted the VC. The order was retroactive to 1854 to recognise acts of valour during the Crimean War._NEWLINE_The Australian and New Zealand Victoria Crosses are made from the same gunmetal as the originals. It was originally intended that the VCs would be cast from the bronze cascabels of two cannon that were captured from the Russians at the siege of Sevastopol. The historian John Glanfield has since shown that the metal used for VCs is in fact from Chinese cannon not Russian, and their origin is a mystery._NEWLINE_The barrels of the cannon in question are stationed outside the Officers' Mess at the Royal Artillery Barracks at Woolwich. The remaining portion of the only remaining cascabel, weighing 10 kilograms (385 oz), is stored in a vault maintained by 15 Regiment Royal Logistic Corps at MoD Donnington. It can only be removed under armed guard. It is estimated that approximately 80 to 85 more VCs could be cast from this source. A single company of jewellers, Hancocks of London, has been responsible for the production of every VC. _START_SECTION_ Appearance _START_PARAGRAPH_ The Victoria Cross for New Zealand is identical to the original design. The decoration is a cross pattée, 41 millimetres (1.6 in) high, 36 millimetres (1.4 in) wide, bearing a crown surmounted by a lion, and the inscription FOR VALOUR. This was originally to have been FOR BRAVERY, until it was changed on the recommendation of Queen Victoria, who thought some might erroneously consider that only the recipients of the VC were brave in battle. The decoration, suspension bar and link weigh about 27 grams (0.87 troy ounces)._NEWLINE_The cross is suspended by a ring from a seriffed "V" to a bar ornamented with laurel leaves, through which the ribbon passes. The reverse of the suspension bar is engraved with the recipient's name, rank, number and unit. On the reverse of the medal is a circular panel on which the date of the act for which it was awarded is engraved in the centre. The ribbon is crimson, 38 millimetres (1.5 inches) wide. Although the warrants state the colour as being red it is described by most commentators as being crimson or "wine-red". + _START_ARTICLE_ Indiana State Road 357 _START_SECTION_ Route description _START_PARAGRAPH_ State Road 357 starts at State Road 64, which is also Morton Street, on the south side of town. It runs north for about 200 feet, then veers to the north-northeast and runs parallel with the railroad track about 100 feet to the east. Upon reaching Mill Street, the road veers back to the north and proceeds to its northern terminus at State Road 57 at the north edge of town. It is concurrent with Main Street over its entire length. + _START_ARTICLE_ George Geldorp _START_PARAGRAPH_ George Geldorp, Georg Geldorp or Jorge Geldorp (1580/1595, Cologne – 4 November 1665, London) was a Flemish painter who was mainly active in England where he was known for his portraits and history paintings. He was also active as an art dealer and impresario. _START_SECTION_ Life _START_PARAGRAPH_ Geldorp was the son of the Flemish portrait painter Gortzius Geldorp who lived and worked in Cologne. Geldorp first trained and worked as a painter in Cologne before being admitted as a Master in the Guild of Saint Luke in Antwerp in 1610. Two years later his first wife Margriet Parmentiers died in Antwerp._NEWLINE_In 1623, Geldorp moved to London where he painted a number of portraits in the Anglo-Netherlandish style, notably of William Cecil, 2nd Earl of Salisbury and his wife Catherine in 1626 (Hatfield House, Hertfordshire) and of Sir Arthur Ingram in late 1638/early 1639._NEWLINE_He was involved in organizing commissions in England for Flemish and Dutch artists including Rubens, Anthony van Dyck and Peter Lely. Upon the Restoration, he assisted with the reconstitution of the art collection and possessions of the English Royal family and was rewarded for his services with the post of picture mender and cleaner to the King._NEWLINE_He was the teacher of Isaac Sailmaker. _START_SECTION_ Work _START_PARAGRAPH_ George Geldorp was a portrait specialist. His portraits are regarded as less accomplished and more stiffly articulated than those of contemporary painters active in London such as Daniel Mijtens. The surfaces of his paintings are decorative. The background of the Portrait of William Cecil, 2nd Earl of Salisbury contains an historically important view of Hatfield House with sportsmen in the foreground._NEWLINE_Geldorp was also active as a collaborator and copyist of Anthony van Dyck and later Peter Lely._NEWLINE_The Dutch biographer Arnold Houbraken reported that Geldorp was known to the artist biographer Joachim von Sandrart. Von Sandrart had written that Geldorp was not a very accomplished draughtsman and had the habit of tracing other's sketches, and then pricking holes in these sketches, and sponging this onto the canvas as a guide to paint his subjects. Houbraken disapproved of this practise and wrote that he preferred to write about painters who were good draughtsmen. + _START_ARTICLE_ A381 road _START_SECTION_ Route _START_PARAGRAPH_ The A381 starts in Teignmouth from a junction with the A379 at Shaldon Bridge, following the Teign Estuary to Kingsteignton, where it overlaps the A380 to cross the River Teign. At the Penn Inn Roundabout it then continues west on a short dual carriageway into central Newton Abbot and southwest to Totnes. Here it overlaps the A385 to cross the River Dart and the main London-Penzance railway line._NEWLINE_From a junction on the west of Totnes it rises southwards into the South Hams. This section of the road is an important link to the national road network for the town of Dartmouth (served by the A3122) as the alternative A379 via Torbay is reliant upon the Dartmouth Higher Ferry with its associated fares and peak-time queues. As the road approaches Kingsbridge it enters the South Devon Area of Outstanding Natural Beauty and skirts around the edge of the town, overlapping for a short distance with the A379 road before finally turning south to Salcombe. An identically-numbered spur from this road turns back eastward to Kingsbridge. _START_SECTION_ History _START_PARAGRAPH_ The constant pressure of traffic through the narrow streets of Totnes town centre prompted the construction of the Western Bypass around the edge of the town, together with a second crossing of the River Dart at Brutus Bridge in 1982. The tight-knit nature of the town centre's development, quickly thinning to countryside, meant that relatively few buildings needed demolition to facilitate construction of the new road. _NEWLINE_In 1991 and 2006 the route through and around Kingsbridge was redrawn twice, when the section northwest of Kingsbridge was downgraded to B-road status leaving a gap in the route, and subsequently diverted to the former route of the B3197 around the west side of the town leaving the original section through West Alvington as a spur of the new road. _NEWLINE_As a rural main road, the A381 has been the scene of multiple accidents. During 2008-2010 there were three fatal accidents on the section from Totnes to Halwell, prompting Devon County Council to implement a Casualty Severity Reduction Scheme, improving road markings and signage. _NEWLINE_On Sunday 20 May 2012 a 0.7-mile (1.1 km) section of the road through Totnes was part of the Olympic Torch procession for the London 2012 Olympics. + _START_ARTICLE_ United States House Select Committee on Alleged Abstraction of Books from the Library of the House _START_PARAGRAPH_ The Committee on Alleged Abstraction of Books from the Library of the House was a select committee of the United States House of Representatives that existed from February 14–28, 1861._NEWLINE_The committee was charged with investigating rumors that several members of the House of Representatives from several states that had seceded from the United States had taken books from the House Library for personal use, or, alternatively, to help start a congressional library for the Confederate States of America. The allegations were first made public by The New York Times in an article published February 13, 1861, that accused Members of Congress of taking "some of the most valuable volumes in the collection._NEWLINE_The select committee's investigation determined those rumors to be in error, finding that the supposed missing books had instead not been properly credited back to the representatives' accounts after being returned. + _START_ARTICLE_ Shazam (application) _START_SECTION_ Overview _START_PARAGRAPH_ Shazam identifies songs based on an audio fingerprint based on a time-frequency graph called a spectrogram. It uses a smartphone or computer's built-in microphone to gather a brief sample of audio being played. Shazam stores a catalogue of audio fingerprints in a database. The user tags a song for 10 seconds and the application creates an audio fingerprint. Shazam works by analyzing the captured sound and seeking a match based on an acoustic fingerprint in a database of millions of songs. If it finds a match, it sends information such as the artist, song title, and album back to the user. Some implementations of Shazam incorporate relevant links to services such as iTunes, Spotify, YouTube, or Groove Music._NEWLINE_Shazam can identify music being played from any source, provided that the background noise level is not high enough to prevent an acoustic fingerprint being taken, and that the song is present in the software's database._NEWLINE_As well as the free app, the company has released a paid app called Shazam Encore. In September 2012, the service was expanded to enable TV users in the US to identify featured music, access cast information, and get links to show information online, as well as added social networking capabilities._NEWLINE_Shazam redesigned their app in 2014, and added additional features. _START_SECTION_ Compatible devices _START_PARAGRAPH_ Shazam runs on Android, iOS (including Apple Watch), BlackBerry OS, and Windows Phone systems. Shazam is also available for Mac, as a desktop application that when enabled, runs in the background and automatically recognises any song played on or near the computer. Apple's launch of iOS 8 in September 2014 came with the integration of Shazam into Apple's Siri function. _START_SECTION_ History _START_PARAGRAPH_ The company was founded in 1999 by Barton and Inghelbrecht, who were students at University of California, Berkeley and Mukherjee, who worked at a London-based internet consulting firm called Viant. In need of a digital signal processing specialist, the founding team then hired Wang, who had received his PhD from Stanford University. It first made a profit in 2016, 17 years after launch. _START_SECTION_ 2002–2006: Early days of the service _START_PARAGRAPH_ Initially, in 2002, the service was launched only in the UK and was known as "2580", as the number was the shortcode that customers dialled from their mobile phone to get music recognised. The phone would automatically hang up after 30 seconds. A result was then sent to the user in the form of a text message containing the song title and artist name. At a later date, the service also began to add hyperlinks in the text message to allow the user to download the song online._NEWLINE_Shazam launched in the US on the AT&T Wireless network in 2004 in a joint offering with Musicphone, a now defunct San Francisco-based company. The service was free at launch with AT&T saying that it would charge $0.99 for each use in future._NEWLINE_In 2006, users were charged £0.60 per call or had unlimited use for £4.50 a month, as well as an online service to keep track of all tags. _START_SECTION_ 2006–2017: Smartphone app and expansion _START_PARAGRAPH_ Shazam for iPhone debuted on 10 July 2008, with the launch of Apple's App Store. The free app enabled users to launch iTunes and buy the song directly, although the service struggled to identify classical music._NEWLINE_Shazam launched on the Android platform later that year, and on the Windows Mobile Marketplace a year later. Encore first appeared for iPhone in November 2009._NEWLINE_By December 2009, Shazam was downloaded 10 million times in 150 countries across 350 mobile operators. Around eight percent of users purchased a track after it was identified by the service. Its success led to a funding round from Kleiner Perkins Caufield & Byers in October 2009. In January 2011, Apple announced that Shazam was the fourth most downloaded free app of all time on the App Store, while rival SoundHound had the top paid iPad app._NEWLINE_In August 2012, Shazam announced the service had been used to tag five billion songs, TV shows and advertisements. In addition, Shazam claimed to have over 225 million users across 200 countries. A month later, the service claimed to have more than 250 million users with 2 million active users per week. The Shazam app currently has more than 100 million monthly active users and has been used on more than 500 million mobile devices. In October 2014, Shazam announced its technology has been used to identify 15 billion songs._NEWLINE_The Shazam app was listed among Techland's 50 Best Android Applications for 2013._NEWLINE_In August 2014, Shazam announced there would be no more updates for Shazam(RED) after 7 August. _NEWLINE_Apple's launch of iOS 8 in September 2014 came with the seamless integration of Shazam into Apple's intelligent personal assistant Siri function._NEWLINE_In February 2013, Shazam announced a partnership with the music store Beatport, adding its library of electronic music to the service. On 3 April 2013, Shazam announced an exclusive partnership with Saavn, an Indian online music streaming service. The deal added nearly 1 million songs in Indian languages to Shazam's database. In July 2014, Shazam announced a partnership with Rdio that allows Shazam users to stream full songs within the app._NEWLINE__NEWLINE_Rich Riley joined Shazam as CEO in April 2013 to increase the company's growth, after over 13 years at Yahoo! and with more than 17 years of experience as an entrepreneur and internet executive. "I look forward to extending our dominance in media engagement, from our roots in music to our leadership position in second-screen TV and want to ensure that Shazam is the company that helps people recognize and engage with the world around them", Riley said at the time. Riley replaced Andrew Fisher, who was hired from Infospace into the CEO role in 2005 to strengthen industry partnerships and grow the userbase. Fisher is now executive chairman._NEWLINE_In addition to music, Shazam has announced collaborations with partners across television, advertising and cinema. In May 2014, National CineMedia announced a partnership with Shazam to incorporate Shazam into FirstLook pre-show segments that run in Regal, AMC and Cinemark theatres. In November 2014, NCM and Shazam announced that NCM FirstLook pre-shows are now Shazam enabled on over 20,000 movie screens across the United States._NEWLINE_In August 2014, Shazam announced the launch of Resonate, a sales product that allows TV networks to access its technology and user base. The news included the announcement of partnerships with AMC, A&E, Dick Clark Productions and Fuse._NEWLINE_Shazam recently announced a partnership with Sun Broadcast Group on Shazam for Radio, a new offering that will allow radio stations to push customised content to listeners on Sun Broadcast's over 8,000 radio stations in the US._NEWLINE_In December 2016, Shazam announced a partnership with Snapchat. The new feature comes as part of the latest Snapchat update and integration with Shazam, which allows Snapchat users to use Shazam's music recognition technology by pressing and holding the camera screen. _START_SECTION_ 2017–present: subsidiary of Apple _START_PARAGRAPH_ In December 2017, Apple announced it would be acquiring Shazam for a reported $400 million (£300 million). On 23 April 2018, the European Commission stated that it would be reviewing the acquisition. The European Commission approved the acquisition on 6 September 2018 and the deal was completed on 24 September 2018. _START_SECTION_ Funding _START_PARAGRAPH_ As of September 2012, Shazam had raised US$32 million in funding. In July 2013, Carlos Slim invested US$40 million in Shazam for an undisclosed share. In March 2014, Shazam confirmed another US$20 million in new funding, raising the total value of the company to US$500 million. The company's earlier backers include European venture capital firm DN Capital, which invested in Shazam in 2004. _START_SECTION_ Patent infringement lawsuit _START_PARAGRAPH_ In May 2009, Tune Hunter accused Shazam of violating U.S. Patent 6,941,275, which covers music identification and purchase in a portable device. Shazam settled the case in January 2010. + _START_ARTICLE_ International scale of river difficulty _START_SECTION_ Caution in application _START_PARAGRAPH_ Classifications can vary enormously, depending on the skill level and experience of the paddlers who rated the river. For example, at the 1999 International Conference on Outdoor Recreation and Education, an author of a paddling guide pointed out that there is too much variation in what is covered by the Class I designation, and proposed making further distinctions within the Class I flat water designations and Class I+ moving water designations, with the goal of providing better information for canoeists, instructors leading trips, and families with young children._NEWLINE_The grade of a river or rapid is likely to change along with the level of the water. High water usually makes rapids more difficult and dangerous, although some rapids may be easier at high flows because features are covered or washed out. At spate/flood stage, even rapids which are usually easy can contain lethal and unpredictable hazards. Conversely, some rapids may be easier with lower water levels when dangerous hydraulics become easier to manage. Some rivers with high volumes of fast moving water may require little maneuvering, but will pose serious risk of injury or death in the event of a capsize. + _START_ARTICLE_ Zambian kwacha _START_SECTION_ Etymology _START_PARAGRAPH_ The name kwacha derives from the Nyanja, Bemba, and Tonga language word for "dawn", alluding to the Zambian nationalist slogan of a "new dawn of freedom". The name ngwee translates as "bright" in the Nyanja language. _START_SECTION_ History _START_PARAGRAPH_ Prior to independence, the Rhodesia and Nyasaland pound was the legal tender of the short-lived British protectorate of Northern Rhodesia. Banknotes of 10 shillings, 1, 5, and 10 pounds issued by the Central Africa Currency Board were in circulation, together with coins of ½, 1, 3, 6 pence, and 1, 2, 2½, and 5 shillings. After independence, the Bank of Zambia issued the first Zambian currency, the Zambian pound, in 1964. The issued paper bills and coins were of similar denominations as these used before independence, except for the 10 pounds note, which was never issued by the Bank of Zambia. A new design to depict the newly independent country's history and struggle was adopted. The two currencies - the Rhodesia and Nyasaland pound and the Zambian pound, were allowed to circulate in parallel until December 15, 1965, when the South Rhodesian pound bills and coins were withdrawn from circulation, except for the 3 pence coin which was allowed to circulate alongside its Zambian alternative for a brief period._NEWLINE_On July 1, 1966, the parliament approved the arrangements of the decimal currency system (Act 40 of 1966). The government voted in favor of decimalisation, and changing the main currency unit to Kwacha, with one kwacha being equal to 100 ngwee. The exchange rate was set to one kwacha equivalent to ten Zambian shillings, or one half of a Zambian pound. Thus, by January 16, 1968, all Zambian pound bills and coins were removed from circulation and replaced by the new kwacha bills, and ngwee coins. The Zambian pound bills of 10 shillings, 1, and 5 pounds were changed into 1, 2 and 10 kwacha respectively, a bill of 50 ngwee was issued to replace the old 5 shillings coin, alongside a new bill of 20 kwacha. Ngwee coins with the denominations of 1, 2, 5, 10, and 20 ngwee replacing the existing 1, 3, 6 pence, 1, and 2 shillings coins respectively. The Zambian pound notes, and coins ceased to be a legal tender on January 31, 1974._NEWLINE_At the very beginning, the kwacha was pegged to the pound at a fixed rate of 1.7094 kwacha per 1 pound. Yet, after the devaluation of the dollar on August 15, 1971, Zambia broke all its currency's ties to the British monetary unit, and pegged the kwacha to the American monetary unit. These reforms resulted in a reduction of the kwacha's gold standard by 7.8%. A few months later, the British Chancellor of the Exchequer Anthony Barber, announced the demise of the Sterling area, and flotation of the sterling pound, causing Zambia to renounce the monetary privileges once enjoyed as a member state._NEWLINE_Throughout the years, the Zambian currency suffered high rates of inflation forcing the Bank of Zambia to introduce high value denominations in 2003, including 20,000 and 50,000 kwacha bills to facilitate transactions. In 2013, a new, redenominated kwacha was introduced. _START_SECTION_ Banknotes _START_PARAGRAPH_ The Zambian kwacha was first issued in 1968 to replace the Zambian pound. The design of the kwacha bill changed as time went on, also, different bills were either introduced in or withdrawn from circulation. Seven emissions of the first kwacha are known to exist, while only one emission of the second kwacha was introduced in circulation on January 1, 2013, and still existing since then without any changes in design or security features. Each emission share similar general features in design throughout all the banknotes, with slight changes concerning the colors and the activity based theme on the reverse of the banknotes. _START_SECTION_ New Kwacha (2012 series) _START_PARAGRAPH_ On January 23, 2012, the Bank of Zambia proposed certain measures in regards of the redenomination of the Zambian kwacha. Such recommendations were initially approved by the government, being one of the measures required to address costs associated with the continuous devaluation of the national currency, due to depreciation throughout time, as a direct result of several years of high inflation rates that characterized the national economy during the late decades of the 20th century, and the early years of the 21st century. The recommendations were assented to the parliament on November 3, 2012. Later, The Re-Domination of Currency Act (Act 8 of 2012) was enacted on December 3, 2012._NEWLINE_The old currency unit was divided by 1000, hence, removing three zeros from the preexisting K50,000, K20,000, K10,000, K5,000, and K1,000. The lower denominations of K500, K100, and K50 were also divided by 1000 and were changed into the 1 Kwacha, 50, 10, and 5 Ngwee coins respectively. On the other hand, the preexisting K20 banknote was removed from circulation due to its extremely low purchasing power._NEWLINE_The Bank of Zambia announced January 1, 2013 as the changeover date. On the same day, the new redenominated currency became the legal tender of Zambia. The old and new currencies were allowed to circulate side-by-side for a transition period of six months, until 30 June 2013. During this period, the old currency was denoted by 'K', whilst the new one was denoted by 'KR'. After the six-month period, the 'KR' symbol was dropped, and the new currency was referred to by the 'K' symbol._NEWLINE_By June 26, 2013, the Bank of Zambia managed to withdraw 3.7 trillion Kwacha in old banknotes, accumulating to about 95.3% of the circulating banknotes. Although the old currency ceased to be legal tender four days later, the Bank of Zambia Deputy Governor, announced that residents who were still holding to the old currency, especially those living in rural areas, could still be able to exchange the old currency for the new one through commercial banks, and other designated agents. + _START_ARTICLE_ Transportin 1 _START_SECTION_ Function _START_PARAGRAPH_ This protein is a karyopherin which interacts with nuclear localization sequence to target nuclear proteins to the nucleus. The classical karyopherin receptor complex, such as the complex that uses Importin-β1 (encoded by gene KPNB1), is a heterodimer of an alpha subunit which recognizes the nuclear localization signal and a beta subunit which docks the complex at nucleoporins. However, Transportin-1 can directly bind to the cargo proteins and may not need importin alpha subunit to do it._NEWLINE_Transportin-1 is thought to use the same principal mechanism to carry out nuclear transport as other Importins. It mediates docking to the nuclear pore complex through binding to nucleoporin and is subsequently translocated through the pore by an energy requiring mechanism. Then, in the nucleus Ran binds to Transportin-1, it dissociates from cargo, and Transportin-1 is re-exported from the nucleus to the cytoplasm where GTP hydrolysis releases Ran. Then Transportin-1 is free to bind new cargo._NEWLINE_In addition, Transportin-1 is implicated in helping protein transport into primary cilium. The function of Transportin-1 in this case is thought to be similar to carrying proteins into the nucleus through a nuclear pore. Transportin-1 binds cargo and then is helping this cargo to pass through a pore at the base of the cilium. Ran and nucleoporins are also implicated in this mechanism._NEWLINE_Alternate splicing of this gene results in two transcript variants encoding different proteins. _START_SECTION_ Targets _START_PARAGRAPH_ Transportin 1 (TRN1) is part of the non-classical nuclear import pathway. In conjunction with the RanGTP hydrolysis cascade TRN1 acts to import a selection of proteins into the nucleus of cells. These targets typically contain a PY-motif otherwise known as a M9 nuclear localisation signal. Well described examples include hnRNP A1._NEWLINE_The type of cargo proteins that Transportin 1 can carry into the nucleus include RNA-binding proteins (such as hnRNP A1 and hnRNP F) and also ribosomal proteins. _START_SECTION_ Clinical Significance _START_PARAGRAPH_ TRN1 has been implicated in the pathogenesis of two neurodegenerative diseases namely Amyotrophic lateral sclerosis and frontotemporal dementia. + _START_ARTICLE_ Embedded feminism _START_PARAGRAPH_ Embedded feminism is the attempt of state authorities to legitimize an intervention in a conflict by co-opting feminist discourses and instrumentalizing feminist activists and groups for their own agenda. This term was introduced in the analysis of the US-led invasion of Afghanistan, but can also be applied to several historical examples where women's rights were used as justification and legitimization of Western interventionism. _START_SECTION_ Concept _START_PARAGRAPH_ Originally, the Canadian gender researcher Krista Hunt developed the conceptual framework of embedded feminism to describe the gendered nature of the US-led invasion of Afghanistan in 2001 and the practice of the US government to justify the War on Terror in the eyes of the public._NEWLINE_Hunt defines the concept as the "incorporation of feminist discourse and feminist activists into political projects that claim to serve the interests of women, but ultimately subordinate and/or subvert that goal"._NEWLINE_Hunt coined the term embedded feminism referring to the "embedded journalism" or "embedded media" approach of the US Department of Defense which became prominent in the media coverage of the 2003 invasion of Iraq. The US government attached journalists, photographers, and camera people to military units and granted them unprecedented access to the battle frontline. Although "embedded journalism" allowed the public to get an exclusive look at the situation in Iraq, this practice was regarded as problematic, as it could undermine the independent reporting and promote the preferences of the government._NEWLINE_The "far-reaching process of appropriating and subverting feminism through appeals to women's rights" that is embedded feminism is different from simple co-optation practices by state authorities in so far as it goes beyond the absorption "of the meanings of the original concepts to fit into the prevailing political priorities". _START_SECTION_ Historical examples _START_PARAGRAPH_ Krista Hunt argues that appeals to women's liberation have been embedded in political projects for centuries to mobilize feminists and their discourses._NEWLINE_A large body of feminist literature has analyzed the gender-related dimensions of (post-) colonial projects where feminists from the Global North were convinced to get involved in order to "save" other oppressed women. Such rescue narratives generally presuppose a homogeneity of women as an oppressed group, as showed in the work of Chandra Mohanty, and put into play the orientalized nature of the seemingly dangerous "brown man". Thus, feminism which was incorporated in the modernization and civilization projects of imperial countries is argued to have helped strengthening colonialism and patriarchy instead of promoting women's rights._NEWLINE_Feminists also claim that feminist activists and their discourses have been instrumentalized for nationalist projects. During the Nasser era in Egypt for example, feminists are said to have played a major role in helping create a sense of cohesion and bonding and therefore directly contributed to the emergence of a national identity during and after the struggle for independence. Nevertheless, women remained mostly absent from the public sphere of politics once the project succeed. _START_SECTION_ The War on Terror _START_PARAGRAPH_ The history of the war on terrorism throughout the IR realm consistently showcased a male-stream discipline and a hyper-masculine war hero narrative. In other words, the story is narrated by these men, who hold high positions of power and are fixated to exemplify their heroic qualities to shield women from harm and collide with the world’s difficulties. For example, according to the former US President, George W. Bush, the central goal of the terrorists is the brutal oppression of women…that is the reason this great nation, with our friends and allies, will not rest until we bring them all to justice . This rallying cry by the Bush administration is exactly the narrative that is at question. The time-honored tradition of the good guys defeating the bad guys and protecting racialized women serves to reinforce patriotism and justify violence both abroad and at home. However, how does one exemplify “the bad guys”? By using a gendered lens and looking at the war of terror through a gendered perspective, a simple rallying cry has far more complexities. For instance, there is a power dynamic at play here involving two opposing parties. There is the Western men and women who are deemed to be the saviors. Then there are the Afghan women who needs saving. What does this do? This creates a subtle social construct that the war on terror has created different kinds of men and women based on race, religion and nationality . Having said that, a gendered lens does ignore specific factors. It ignores the power dynamic of liberated white western women against their oppressed Afghan women. Basically, in a war, your race and nationality come heavily into play when it comes to who is deemed to be more liberated. It ignores the historical colonial justification for invasion by proclaiming racialized men are harmful to racialized women. Feminists analyzed Bush’s rallying cry and found similarities to the white men knowing what’s right and saving the racialized women because of perceptions about the racialized men. It ignores the reinforced resistance to women’s rights, whereas men see it as Western imposition. In a war situation, when a Western country tries to help an oppressed nation, it is seen as western imposition because it is as if “the west knows best”, without even being apart or living in an oppressed nation and gives the perception that anything the West does (even empowering women) is treated as imposition. It ignores the obscurity of the reality that white Western women are still being oppressed by the same powers that are trying to liberate Afghan women. Finally, it ignores the overarching point of all these factors which is the creation of a divide and conquer situation of women while also initiating solidarity of all women . In other words, what all these factors mean is the examination of race, class, nationality, religion and sexuality, we notice factors that were passive such as progressing onward with political agendas that are traditional, old testament and problematic while also trying to play the good guy by silencing other related key issues. In conclusion, Gender has become a topic that is heavily scrutinized but also heavily appreciated and even in the most traditional of scenarios, a gender lens/perspective is very much needed to tackle the real issue of IR._NEWLINE_In 2001, the Bush administration began expressing their concerns for the situation of women under the Taliban regime. According to Hunt, it invoked the struggle for women's rights and women's liberation as a rational to justify the invasion of Afghanistan. This increased gender awareness can be interpreted as part of a framing strategy which conflated the War on Terror with the fight for women's rights as a proxy for universal human rights. In the eyes of many feminists, the rescue of oppressed women from the Taliban became the powerful normative legitimation of the invasion which obtained broad public approval. More importantly, this strategy could align itself with feminist groups, that are traditionally pacifist, and could win their approval, thereby removing a critical opposition. The doubts in the government's commitment to further women's rights through war arose due to its lack of interest before 9/11. It was only after the terror attacks, that politicians in the US and in Europe began broadly supporting women's liberation from the Taliban._NEWLINE_Despite its usual non-violent stance, the Feminist Majority Foundation (FMF) supporter the policies of the Bush administration and is therefore regarded as one of the most vocal feminist supporters embedded in the War on Terror. Although the FMF saw the government's increased gender awareness as a success of their ‘Stop Gender Apartheid’ campaign, their involvement in Bush's political project was strongly criticized by other NGOs and the critical public because their role was considered to be legitimizing._NEWLINE_Hunt sees embedded feminism as a concept that was used to advance the engendered war story of the Bush administration that the invasion of Afghanistan could liberate Afghan women. It has further created a division between feminist groups that supported the war and those groups that refused to get involved in the usurpation of feminism for war. A division also emerged between "Western" feminists who strived to save the "Other" women from an orientalised enemy and Afghan feminists who criticized the notion that war could liberate them. _START_SECTION_ Hegemonic Western feminism and post-colonial critique _START_PARAGRAPH_ Hunt notes that there is a striking similarity between the logic of embedded feminism in colonialist projects and the War on Terror. Both are inherently Eurocentric and present the West as culturally and normatively superior to the "unmodern" Eastern societies. This rationale would give the West a prerogative to intervene and rescue the "monolithic group" of Other women who have no agency on their own. Spivak's famous post-colonial critique of the relationship between the colonizers and the colonized subjects in "Can the Subaltern speak?" condenses this relationship to the strategy of "white men saving brown women from brown men". This analysis can also be applied to the seemingly neo-imperialist strategy that the US government was pursuing by framing Taliban men in Afghanistan as a danger to women who were presented as victims in need of help from the West._NEWLINE_Characteristic for Western hegemonic feminism was the disregard of Western actors for the opinions of Afghan women's groups who argued that a war would certainly have a negative impact on women and fuel fundamentalist sentiments. In the aftermath, Bush's agenda was in fact interpreted as an attack on Islamic values and resulted in a backlash from the conservative forces.Hegemonic feminism also tends to reproduce binary gender roles, especially in the visual representation of women and children as victims of war or oppression in the media. Cynthia Enloe has called this conflation of women and children as victimized subjects "womenandchildren", a single trope that is being invoked in patriarchic narratives to support state security interests. _START_SECTION_ Contextualisation _START_PARAGRAPH_ The unique nature of embedded feminism as a state strategy is not just the argumentation based on the representation of women and children as victims but the conjunction of this discourse with the struggle for women's rights. _NEWLINE_Hunt's concept has made an impact on gender-related conflict research and has been applied to the wars in Iraq, Kosovo and Afghanistan. Embedded feminism can also be used in other contexts such as neo-liberal globalization and can be applied to several other policy fields where pseudo-feminist arguments and feminist groups are misused to legitimize a state-led action or to construct an alternative story. + _START_ARTICLE_ Ridgeway Mine _START_SECTION_ Extraction _START_PARAGRAPH_ Ore was blasted out of the mine pits and hauled in 85-ton trucks to ore storage heaps at the surface of the mine. Each ore truckload had only around 1.5 to 3 ounces of extractable gold in it, and for every ore truckload there was a truckload of waste rock. Recovering the ore's microscopic gold deposits was a multi-step process. First, the ore was milled to 200 mesh and soaked in a mixture of sodium cyanide, lime and water in one of ten giant tanks for twenty-seven days to dissolve the gold in the ore. Next, the solution was sent through a carbon in leach process where activated charcoal is used to absorb the gold from the solution. The gold on the carbon granules was dissolved into a second solution as it was sent through a stripping column under high pressure. This solution is used in an electrowinning process that deposits the gold onto steel wool cathodes. The gold-plated steel wool undergoes electrorefining and its gold is deposited onto stainless steel plates. Finally, the gold is scraped off of these stainless steel plates, fed into a gold crucible and melted into doré bars. _START_SECTION_ Geology _START_PARAGRAPH_ The Carolina Slate Belt (CSB) originated in the Proterozoic or Cambrian age with a volcanic arc terrane, a chain of volcanoes that form on the edge of one continental plate when it has another continental plate subducted under it. Later, as part of the Taconic orogeny these volcanoes were metamorphosed to greenschist. Finally, the Acadian orogeny and Alleghanian orogeny sheared the rocks and created intrusive granite from the intrusions of magma into cracks in the rock._NEWLINE_The mine is located in an east-west trend of the CSB which otherwise follows in a northeastern trend. It is also on the boundary between an older igneous terrane called the Sawney's Creek volcanics and a newer Bear Creek turbidite (sedimentary rock made from slides of liquefied sediments) terrane. The North Pit's primary gold ore is a chert made from either fine-grained sediment or a volcanic tuff and the South Pit's primary gold ore is finely laminated sediments from underwater flows of felsic (rich in feldspar and quartz) ash from a volcano. There is some debate on the processes that created the ore but there is consensus that exhalative sediments formed when hot water from a deep geothermal vent cooled and precipitated gold onto the ocean floor and these gold-bearing sediments were further concentrated when they were dissolved and precipitated again by an epithermal process (cooler geothermal vents closer to the surface). _START_SECTION_ History _START_PARAGRAPH_ Gold was first found in the Carolina Slate Belt in 1827 at the Haile mine in nearby Lancaster County. Placer mining, hydraulic mining and later use of the Newbery-Vautin chlorination process kept South Carolina mines in operation but by the year 1900 mining became unprofitable and had mostly stopped. In the 1960s interest in mining the Ridgeway area increased after John Chapman found gold while panning in nearby creeks. Amselco Minerals, a company later merged with Kennecott Minerals, began investigating Ridgeway after geologist Irving T. Kiff noticed similarities between Ridgeway's slate outcrops and outcrops at the Haile mine. Kennecott Minerals began buying land for the Ridgeway mine in 1980 and began mining in 1988. _START_SECTION_ Accidents _START_PARAGRAPH_ The Ridgeway mine was a relatively safe operation, only killing three employees in eleven years of operation._NEWLINE_On August 18, 1988 James F. Wise, while working as a spotter for a pan scraper, was crushed to death when it backed over him. Roosevelt Williams, the scraper driver, was also injured and was sent to the hospital for treatment. The Caterpillar model 623 scraper that Williams was driving that day has limited visibility for the driver and no right side rear view mirror, blocking the driver from seeing anything within 30 feet of the right side of the scraper. Williams was instructed in weekly safety meetings to keep spotters in his eyesight while driving the vehicle and to not start moving the vehicle until the spotter is located. On the day of the accident Williams did not locate the spotter before moving and only watched the left side of the vehicle while backing it out. Wise was crushed under the right tires of the scraper. Williams did not notice that his spotter was missing until he had backed up for around 100 feet and saw Wise lying around 98 feet in front of him. _NEWLINE_On April 21, 1993 Johnny Ray fell to his death after he was pushed off the top of the mine into the pit by a forklift. Steven Crapps, the forklift driver, was attempting to park the forklift by putting it in neutral and applying the parking brake. However, he also turned off the forklift's engine which provides power to the hydraulic brake system pumps. The braking system has a backup accumulator system that allows the brakes to function even when the engine is turned off, however later tests of the forklift showed that the accumulator was broken. When the brakes failed the forklift rolled down a 5%-6% grade slope towards the edge of the mine pit. Ray was between the forks of the forklift and was pushed by it for around 15 feet as it rolled towards the mine pit. A small berm at the edge of the pit prevented the forklift from rolling over the edge but Ray was pushed into the pit, suffered a drop of around 85 feet, was evacuated by helicopter to a local hospital and later died._NEWLINE_On February 2, 1997 Joseph Sumpter was fatally crushed by a runaway haulage truck. The truck, assigned to Sumpter for the day's work, had a bad battery and starter and had to be jump started by another vehicle. It also had a completely inoperable parking brake. His co-worker, Robert Stover, pulled his truck in front of Sumpter's truck to jump start it and Sumpter climbed across the engine decks of both trucks to connect jumper cables. Sumpter's truck was immobilized with a wooden chock but when Stover parked facing Sumpter's truck he pushed Sumpter's truck backward about two feet behind the chock. The jump start was successful and Stover backed his truck away, allowing Sumpter's truck to roll forward, bypass the chock and continue rolling down a 3% grade hill. Sumpter held on to the engine deck while the truck rolled for around 57 feet, but near the end of the ride Sumpter fell off and was run over by the front wheel of the truck. Sumpter was pronounced dead at the scene._NEWLINE_On the weekend of December 3, 1988, soon after the mine began chemically processing the mined ore, a flock of around 400 sea gulls landed in and around the tailings pond and 65 of the seagulls died. Ridgeway Mining had installed scare cannons to prevent birds from landing in the cyanide pond but in this case it was ineffective. The South Carolina Department of Natural Resources levied a $6,500 fine against Ridgeway Mining for killing the birds. _START_SECTION_ Closure _START_PARAGRAPH_ Mining stopped in the South Pit mine in September 1996 and the North Pit shut down in November 1999. Ore processing finished on December 3, 1999. The Confederate States Mint struck commemorative gold, silver and bronze coins with metal from the mine and sold them to the general public. The coins were marketed as wholly South Carolina gold and featured a local building used by the Confederate army during the American Civil War. The Ridgeway Mining Company also commissioned the Confederate States Mint to mint a coin out of the mine's gold and silver which it gave to the mine employees as gifts. _START_SECTION_ Reclamation _START_PARAGRAPH_ The mine tailings are a pyrite-bearing rock which, when exposed to water and oxygen, causes acid rock drainage and environmental damage. Kennecott Ridgeway designed an impoundment for the tailings that seals them off from rainwater runoff and the atmosphere. If fresh water was allowed to run through the tailings it would constantly dissolve and oxidize the rock's sulfides so the tailings are instead kept in stagnant water that has already had a chance to fully saturate with sulfides. The impoundment was created by saturating the tailings with water, mining nearby inert saprolites and clay, sending the saprolites and clay through the ore mills and pouring them over the tailing pile to create a cone over the mine tailings._NEWLINE_The Ridgeway mine recovery has been an ecological success with minimal seepage of acid water into the environment, a stark contrast to the nearby Barite Hill and Brewer gold mines that were both declared Superfund sites after being abandoned by their owners. _START_SECTION_ South Pit Lake _START_PARAGRAPH_ Both mine pits are expected to fill with water by 2020 and public access is planned for the reclaimed South Pit Lake. Mine management hopes to create a meromictic lake, a self-regulated lake with distinct layers of water that do not mix, to minimize oxidation of sulfides and prevent toxic metals from circulating in the environment. A comprehensive study of the South Pit Lake's limnology was conducted from 2000 to 2004, measuring its physical, chemical and biological properties. Wind is typically the most important contributing factor to mixing in a lake and the South Pit Lake was found to have a poor alignment with the prevailing direction of wind in the area. To counteract the mixing forces of wind the lake has other properties that promote water layer stability and stratification. Its relative depth, a ratio of the lake's surface area to its depth, is around 10%, typical of other meromictic lakes. There are several underwater features that dissipate wave energy and reduce the strength of wind-powered seiches in the lake. Finally, bacteria in the lake contributed to a pycnocline by precipitating carbon, gypsum and metals out of the upper layer, lowering the density of the upper layer, increasing the density of the lower layer and promoting lake stability. The lake successfully achieved meromixis in the winter of 2001 and maintained it until the end of the study in 2004. _START_SECTION_ Future mining _START_PARAGRAPH_ The Strongbow Exploration company announced that it was investigating a possible new strike of gold within three miles of Rio Tinto's Ridgeway mine in 2011. After releasing several press releases announcing positive test results from drilling samples and the signing of property purchase option agreements with landowners there has been no further communication on their Ridgeway exploration since December 2012. + _START_ARTICLE_ Atlético Tucumán _START_SECTION_ History _START_PARAGRAPH_ The club was founded in 1902, which makes Atlético the oldest football club from the province of Tucumán._NEWLINE_Atlético has participated in nine seasons in the Primera division: eight seasons between 1973 and 1981, and a single season in 1984. The team's best ever performance in Primera División was in 1979, when they reached the semi-finals of the Torneo Nacional._NEWLINE_In 2008 Atlético Tucumán was promoted to the Argentine 2nd Division after defeating Racing de Córdoba in the final game of Torneo Argentino A, and one year later the squad achieved its 2nd consecutive promotion by winning the B Nacional tournament and reaching the Primera División. _START_SECTION_ Tucumán Derby _START_PARAGRAPH_ The Tucumán Derby is played between Atlético and its longtime rival San Martín, both of the same city. The Santo (as San Martín is nicknamed) currently plays in the Torneo Argentino A, the regionalized third division of Argentine league system. _START_SECTION_ Ground _START_PARAGRAPH_ The stadium was constructed in 1922 by Spanish architect José Graña (1885–1950) with an original capacity for 5,000 spectators. It was inaugurated on May 21 of same year. Originally named as "Grand Stadium" due to being the largest of the North side of Argentina, Racing Club de Avellaneda was invited to play a friendly match versus Atlético Tucumán as part of the celebration. The stadium was named Monumental "José Fierro" in honor of a well-remembered Atlético's chairman._NEWLINE_It was the first roof stadium in Tucumán Province and the first to have a superior stand. The structure was built out of concrete. The record attendance was in 2008, during a match between Atlético and Racing de Córdoba, when all the seats were filled._NEWLINE_The stadium is located in the north part of the city of San Miguel de Tucumán (named "Barrio Norte"). It can currently accommodate up to 32,500 people due to an upgrade of the facilities that included adding an extra 2,500 seats. + _START_ARTICLE_ Ford Motor Company Philippines _START_SECTION_ History _START_PARAGRAPH_ Ford's history in the Philippines can be traced back to 1913 with the local assembly of the Ford Model T. In 1929, Henry Ford established Pilipinas Ford Car Works, Inc. (PFCW). In 1967, Ford Philippines, Inc. (FPI) was established as a subsidiary of the Ford Motor Company and began production operations on May 3, 1968, located at Sucat, Parañaque. In 1976, FPI inaugurated a body stamping plant in Mariveles, Bataan. On March 20, 1984, FPI formally and unexpectedly announced it would cease its operations in the Philippines by August 1984, in accordance with a decision reached by the management of Ford Motor Company._NEWLINE_In 1997, Ford returned to the Philippines with the establishment of Ford Motor Company Philippines, Inc. (FMCPI), introducing US-made vehicles such as the Expedition, the F-150, the Clubwagon and the Lincoln Town Car. A new P4 Billion state-of-the-art assembly plant in Santa Rosa, Laguna opened in September 1999. The first car manufactured at the plant was the Ford Lynx, and the company began building the Mazda-based Ford Ranger in March 2000. FMPCI company expanded its line-up with the introduction of the Escape SUV, Explorer SUV and Everest SUV. Towards the end of the decade, the Fiesta and Mustang were also introduced in the local market._NEWLINE_In 2012, Ford announced the consolidation of manufacturing operations in Southeast Asia and the cessation of operations at the Santa Rosa plant, citing "lack of supply base and economies of scale." 250 workers were affected by the decision, which Ford Philippines tried to resolve by offering them work at other Ford manufacturing facilities overseas. Despite this closure, Ford Philippines is opening more dealerships and expanding its vehicle lineup by the year 2015. On March 2014, Mitsubishi Motors Philippines Corporation announced it had acquired the former Ford assembly plant. + _START_ARTICLE_ Imrana Alhaji Buba _START_SECTION_ Early Life and Education _START_PARAGRAPH_ Buba was born in Jakusko Yobe State on 6 August 1992 and grew up in Potiskum, Yobe state. He is an alumnus of the University of Maiduguri, Borno state where he graduated with a first-class honours degree in Political Science in 2015 and holds a master's degree in Africa and International Development from the University of Edinburgh, United Kingdom in 2018. _START_SECTION_ Career and Activism _START_PARAGRAPH_ Buba had a traumatic experience with Boko haram in June 2010 while he was travelling to the University of Maiduguri as an undergraduate when his bus was stopped by the terrorists and passengers were kidnapped, he survived and also had friends and family who were killed by the Boko haram insurgency. As a result of this he founded the Youth Coalition Against Terrorism (YOCAT) in August 2010 to offer counselling services to victims of terrorism, as well as providing peace education and skills training for unemployed youths._NEWLINE_He has provided employment opportunities for over 2000 youth in north-eastern Nigeria through partnerships with local government agencies and private organisations and the organization has recruited over 600 volunteers and partnered with many local bodies to organize different beneficial programs for young people in north-eastern Nigeria._NEWLINE_In 2016, he was selected as one of 3 Nigerians and 21 African changemakers in the Commonwealth for the Queen’s Young Leaders Award by her majesty, Queen Elizabeth II of England in recognition of his work around peace building in northern Nigeria and also became a fellow of Generation Change Fellowship of the United States Institute of Peace (USIP) as a result of his works in combating terrorism._NEWLINE_He was selected for the 2017 JCI Ten Outstanding Young Persons of the World, in recognition of his effort to counter violent extremism and promote a culture of peace in Nigeria and was part of the 2017 Mandela Washington Fellowship program for Young African leaders in Washington D.C. He is also a fellow of LEAP Africa SIP and YALI West Africa._NEWLINE_His work and public accolades made him an expert and speaker, particularly regarding political instability in Nigeria. He was a speaker/panelist at the 2016/2017 International Day of Peace events at the United States Institute of Peace (USIP), speaker at the 2017 Wage Peace event at the American University, speaker/panelist at the 2017 United Nations International Youth Day event, speaker at the 2018 United Nations International Day for the Remembrance of Victims of Terrorism and the 2018 One Young World Summit._NEWLINE_Buba’s vision is to promote a culture of peace and tolerance that can break the cycle of conflict, violence, and terror that plague Nigeria. _START_SECTION_ Links _START_PARAGRAPH_ Age is not a limit to making a difference –Imrana Alhaji Buba_NEWLINE_Interview with Nigerian recipient of 2016 Queen's Young Leader award + _START_ARTICLE_ Princeton High School (Texas) _START_SECTION_ Football _START_PARAGRAPH_ The Princeton football team has made playoff appearances 13 times in 1948, 1972, 1974, 1975, 1976, 2010, 2011, 2012, 2013, 2014, 2015, 2016 and 2017. Princeton won district championships in 1948, 1972, 1974, 1975 and 1976 when only one team per district made the playoffs. In the new format where 4 teams from each district make the playoffs, Princeton has added two more district championships in 2013 and 2017. _START_SECTION_ Basketball _START_PARAGRAPH_ PHS men's and women's basketball teams have experienced success over the years. The 1994-95 men's team finished with a record of 30-5, finishing the season as district 9-3A Runner-up and Area Champions after defeating West High School in the playoffs. _START_SECTION_ Marching Band _START_PARAGRAPH_ The "Panther Pride Marching Band" has made appearances at the UIL State Marching Band Contest on many consecutive occasions, earning the bronze medal three times; once in 2010, again in 2014, and in 2016. The band has been well represented locally and at the state level since the early 1970s. _START_SECTION_ Extracurricular activities _START_PARAGRAPH_ Princeton High School offers an extensive amount of extracurricular activities and programs, including Speech, Debate, Theatre, UIL Academics, Band, Choir, Winterguard, Art Club, Robotics, Fishing, Journalism, Student Council, Welding, Cheerleading, National Honor Society, Junior Reserve Officers' Training Corps (JROTC), The Fellowship for Christian Athletes, Future Farmers of America (FFA), Spanish Club. The school also has a widely recognized Career and Technology Exploration (CATE) program that serves Princeton High School students and others from the area's surrounding schools. + _START_ARTICLE_ Bangladesh Institute of Science and Technology _START_PARAGRAPH_ Bangladesh Institute of Science and Technology (BIST) is a university-level institution affiliated with the National University, Bangladesh located in Dhaka, Bangladesh.The institute was established in 1999 at Kakrail in Dhaka. The institute is regulated by a governing body consisting of Principal, Head of the Faculty Science and Engineering, Head of the Faculty of Business Studies all within the rules and regulations of National University of Bangladesh. _START_SECTION_ History _START_PARAGRAPH_ Bangladesh Institute of Science and Technology (BIST) was established in 1999. _START_SECTION_ Campus _START_PARAGRAPH_ BIST is located at 122, New Kakrail Road, Dhaka 1000, Bangladesh. + _START_ARTICLE_ Rosemary R. Haggett _START_PARAGRAPH_ Rosemary R. Haggett, Ph.D., is the Vice Chancellor for Academic Affairs and Student Success for the University of North Texas System. She was the second woman to serve as a dean of a college of agriculture in the U.S. _START_SECTION_ Education _START_PARAGRAPH_ Rosemary Haggett earned a bachelor's degree in biology from the University of Bridgeport and a Ph.D. in physiology from the University of Virginia. She conducted postdoctoral work in reproductive biology at Northwestern University. _START_SECTION_ Career _START_PARAGRAPH_ Rosemary Haggett began her career as a postdoctoral associate at Northwestern University, conducting research in reproductive biology. She worked at the U.S. Department of Agriculture for several years. In 1994, she became a professor of Animal and Veterinary Sciences and the Dean of the College of Agriculture, Forestry, and Consumer Sciences at West Virginia University, only the second woman in the U.S. to serve in such a position. In 1999, she became the Associate Provost for Academic Programs at West Virginia University. _NEWLINE_In 2003, she left West Virginia University to serve at the National Science Foundation as the director of the Division of Undergraduate Education. She also served as the Acting Deputy Assistant Director of the Education and Human Resources Directorate, the Acting Director of the Division of Graduate Education, and the Senior Adviser of the Education and Human Resources Directorate at the National Science Foundation. _NEWLINE_In 2007, she became Provost and Executive Vice President for Academic Affairs at the University of Toledo. _NEWLINE_In 2010, she became Vice Chancellor for Academic Affairs and Student Success for the University of North Texas System._NEWLINE_Rosemary Haggett served as Chair of the Committee of Visitors for the National Science Foundation's Training Cluster in the Division of Biological Infrastructure in 2003. She served on the National Academies Committee to review the USDA Agricultural and Food Research Initiative in 2012. + _START_ARTICLE_ Ted Hooper (rugby league) _START_PARAGRAPH_ Edward James Hooper (1871–1925) was an Australian rugby league referee and administrator. Born in Kent, England, Hooper played rugby union in Sydney before becoming a referee in the sport. When the New South Wales Rugby League (NSWRL) formed in 1908, marking the start of professional rugby league in Australia, he joined the ranks of players and referees who switched to the new code, becoming the president of its referees' association. He officiated the match between Eastern Suburbs and Newtown in the first round of the NSWRL's inaugural season. He died in 1925, collapsing in the dressing rooms after refereeing an exhibition match in Brisbane. + _START_ARTICLE_ Jason Baird Jackson _START_PARAGRAPH_ Jason Baird Jackson, Ph.D. (born 1969) is the Director of the Mathers Museum of World Cultures and Professor of Folklore and Anthropology at Indiana University Bloomington. He is "an advocate of open access issues and works for scholarly communications and scholarly publishing projects." At IUB, he has served as Chair of the Department of Folklore and Ethnomusicology and as Director of the Folklore Institute. According to the Journal of American Folklore, "Jason Baird Jackson establishes himself as one of the foremost scholars in American Indian studies today." _START_SECTION_ Early life and education _START_PARAGRAPH_ Jason Jackson was born in 1969._NEWLINE_He received his B.A. in sociology from University of Florida in 1990 with a minor in anthropology. He earned his M.A. degrees in cultural anthropology and in folklore, as well as his Ph.D. degree in anthropology from Indiana University Bloomington. _START_SECTION_ Career _START_PARAGRAPH_ Jackson was Curator of Anthropology at the Gilcrease Museum in Tulsa, Oklahoma (1995–2000) and Assistant Curator of Ethnology at the Sam Noble Oklahoma Museum of Natural History in Norman, Oklahoma (2000–2004). He remains a Research Associate at SNOMNH._NEWLINE_A noted scholar in the tradition of Boasian anthropology, Dr. Jackson's research interests include the following areas: (1) folklore and ethnology (intellectual and cultural property issues, folklore and folklife, material culture, religion, ritual, cultural change, ethnohistory, music and dance, ethnobotany, ethnomedicine, social organization, social theory, history of folkloristics and anthropology), (2) linguistic anthropology (verbal art, oratory, language shift, language ideologies, theories of performance, language and culture), (3) curatorship (community collaboration, exhibitions, collections management), (4) American and native American studies (Eastern North America)._NEWLINE_Dr. Jackson's ethnographic and historical work has focused on the life of the Yuchi, a Native American people residing today in Oklahoma, USA. He has published and edited several books on Native American topics, including Yuchi Ceremonial Life: Performance, Meaning and Tradition in a Contemporary American Indian Community. He has also published numerous articles based on his studies of Native American ethnography and folklore. Dr. Jackson has additionally spent time as an editor of the Journal of Folklore Research._NEWLINE_Dr. Jackson is the founding editor of Museum Anthropology Review, the first open access, peer-reviewed journal for on the subject of Museum Anthropology. He is also the principal for the Open Folklore Project tasked with "developing tools and resources for open access within Folklore studies." He also serves on the Editorial Board for Anthropological Quarterly and is one of the 2017 Visiting Faculty for the Smithsonian Summer Institute in Museum Anthropology, a position he has also previously held._NEWLINE_In June 2001, Dr. Jackson was awarded a Post-Ph.D. Research Grant from the Wenner-Gren Foundation "to aid archival and ethnographic field research on the role that social dance, musical performance, and cultural performances more generally, play in the network connecting the Woodland Indian communities of central and eastern Oklahoma into a regional system of exchange." + _START_ARTICLE_ The Umbrella (film) _START_SECTION_ Premise _START_PARAGRAPH_ After being released from prison, two incompetent crooks allow the umbrella with their stolen valuables stashed away in it to be carried off by someone else. A series of confusions ensue as they desperately try to recover the missing umbrella. + _START_ARTICLE_ Rick Rivet _START_SECTION_ Background and education _START_PARAGRAPH_ Rivet's family lived both in the country and in town at Aklavik, which was a Métis trading center. Métis have a specific culture with First Nations and European roots. He began school in Aklavik at age seven._NEWLINE_Rivet earned four degrees: his Bachelor of Arts from the University of Alberta in 1972; his Bachelor of Fine Arts from the University of Victoria in 1980; his Master of Fine Arts from the University of Saskatchewan in 1985, and Bachelor of Education from the University of Saskatchewan in 1986. _START_SECTION_ Artwork _START_PARAGRAPH_ His art is deeply influenced by ideas of fusion and hybridity of cultures. He works primarily in acrylic on canvas in a style he has referred to as "an expressionist/primitivist approach." + _START_ARTICLE_ Bogdan Radenković _START_SECTION_ Biography _START_PARAGRAPH_ Born in 1874 in Srbovac, a village in the municipality of Zvečan, then part of the Ottoman Empire (and now Kosovo, still a contentious international political zone to this day). As a university graduate and a tonsured monk with a chosen name Vasilije, he became a secretary to the Serbian Orthodox Metropolitanate of Skopje. In this influential post, he had numerous contacts with his people and consulates of Serbia, Russia, and France. Among the clergy, he was known as Vasilije (Radenković) and among the laity simply Bogdan Radenković._NEWLINE_Bogdan Radenković was a member of the Serbian Committee of Skopje and the main organizer of the Serbian Chetnik action in the Ottoman Empire. He was an intermediary between the Serbian consulate and the Chetnik organization and their supporters. During 1905 the Turkish authorities caught a farmer who after being tortured revealed that Radenković was the president of the Serbian Committee in Skopje. At the Skopje trial, the farmer recanted citing that his testimony was extracted by force and Radenković was ultimately acquitted. Radenković was a friend of Milan Rakić, then Serbia's vice-consul in Skopje, with whom he conferred confidential operational plans of the Chetnik organization. Milan Rakić at the time began writing a poem called "On Gazimestan" that became popular even before the Balkan War of 1912. _START_SECTION_ Founding of the Serb Democratic League _START_PARAGRAPH_ The Serb Democratic League in the Ottoman Empire (Serbian: Српска демократска лига у Отоманској царевини) was an Ottoman Serb political organisation established on August 13, 1908, at the First Serb Conference (August 10–13), immediately after the Young Turk Revolution. Some 26 most distinguished Serbs in the Ottoman Empire attended and Bogdan Radenković was selected to head the "Temporary Central Board of the Organization of Ottoman Serbs" in July 1908. Bishop Vicentije Krdzić of the Serbian Orthodox Eparchy of Skopje headed the clergy and Bogdan Radenković the lay membership of the "Assembly of Ottoman Serbs in Skopje", held on Sretenje in 1909. These organizations included the Serb elite of Old Raška, Kosovo and Metohija, and Vardar Macedonia and Aegean Macedonia.[1] It included many members of the Serbian Chetnik Organization as well. They were: Bogdan Radenković; Aleksandar Bukvić; Gligorije "Gliša" Elezović; Vasa Jovanović; Milan Čemerikić; Sava Stojanović; David Dimitrijević; Đorđe Hadzi-Kostić; Velimir Prelić and Jovan Šantrić._NEWLINE_With the founding of the Serb Democratic League, it became the first political party to represent the interests of the Serbs in the Ottoman Empire.The Serbian Democratic League sent to Thessaloniki Bogdan Radenković, Jovan Šantrić and Đorđe Hadži-Kostić to negotiate with the Central Young Turk Board. The Serbian demands were as follows:for the three non-Muslim “ethnic groups” – Serbian, Greek and Bulgarian – to get equal number of_NEWLINE_seats in the Ottoman Parliament. But the Young Turks refused that concept and they conditioned the electoral agreement with the Serbs with having an agreement on broader bases that would not have a national_NEWLINE_background. In 1910 as a representative of the party, he was sent to Istanbul where he urged the Turkish authorities to stop using their troops (Bashi-bazouk) to terrorize the Serbian population in Gjilan. The Sublime Porte denied the violence in Kosovo claiming that it was a fabrication. Yet to the Albanians are credited many of the outrages that have been committed in Old Serbia, where Turkish troops are alleged to have massacred more than 60,000 Christians. _START_SECTION_ Black Hand _START_PARAGRAPH_ Radenković and a few others, particularly Colonel Dragutin Dimitrijević, were the initiators of the creation of the "Unification or Death" organization, better known as the Black Hand, in 1911. Along with Ljuba Čupa and Vojislav Tankosić, Radenković wrote the constitution of the "Unification or Death" organization, which was modelled on similar German secret nationalistic associations and the Italian Carbonari. _START_SECTION_ Bishop _START_PARAGRAPH_ After bishop Nićifor Perić of Raška-Prizren withdrew from his office (1911), owing to disagreement with the Serbian diplomacy, the Patriarchate of Constantinople appointed Bishop Gavrilo Dožić as successor, as the Serbian diplomacy wanted. There was a conflict within the Serbian Church regarding the appointment of Gavrilo; the "Old Serbs" (clergy from Kosovo and Old Serbia) wanted their candidate, the previous secretary of the Eparchy of Skoplje, monk Vasilije (Bogdan) Radenković. While waiting for the Ottoman government approval, the Serbian government changed the decision and ordered through the consuls that Ottoman Serbs request that Radenković be appointed instead. However, Gavrilo ended up being chosen. Meanwhile, Radenković became a founder of the Black Hand conspiracy group. _START_SECTION_ Secret Mission in Korçë _START_PARAGRAPH_ After the occupation of Serbia in late 1915 by the Germans, Austrians, Hungarians and Bulgarians, Bogdan Radenković withdrew through Montenegro, Albania to the island of Corfu, where he was temporarily hospitalized with tuberculosis. When his condition improved somewhat he was sent to Athens and from there to Korçë County in eastern Albania. There he stayed until August 1916 when a surprise Bulgarian invasion took place and he was forced to flee. He was almost caught while escaping but eventually managed to reach Thessaloniki, where the Serbian Supreme Command was stationed. Weak, suffering from tuberculosis after the harrowing escape from Korçë, he was advised by his doctor to go to Egypt where the climate may improve his condition. Nikola Pašić, however, purposefully delayed his departure until his condition worsened. _START_SECTION_ High Military Court _START_PARAGRAPH_ The Serbian Supreme Command on 15 March 1917 sent a warrant for Bogdan Radenković's arrest, though the main accused was Dragutin Dimitrijević, better known as Apis, and his associates. Radenković was sentenced to death for allegedly plotting against the prince regent Aleksandar Karadjordjević and Nikola Pašić, the head of the Serbian government-in-exile, even though there was no concrete evidence that could link him to such an outrageous plot, not him nor Dimitrijević and others._NEWLINE_Bogdan Radenković died of tuberculosis in a prison hospital in Thessaloniki, Greece on 30 July 1917. Years later it was revealed that Nikola Pašić fabricated the story to rid himself of Dragutin Dimitrijević and other Serbian nationalists that may pose a threat after the war during election time. All the accused were vindicated, but many years later. + _START_ARTICLE_ Super Mario Bros. 3 _START_SECTION_ Gameplay _START_PARAGRAPH_ Super Mario Bros. 3 is a two-dimensional, side-scrolling platform game in which the player controls either Mario or Luigi. The game shares similar gameplay mechanics with previous games in the series — Super Mario Bros., Super Mario Bros. 2 in Japan, and Super Mario Bros. 2 internationally — while introducing several new elements. In addition to running and jumping found in past games, the player can slide down slopes, pick up and throw special blocks, and freely climb vines. Mario can also fly and float with power-ups. The game world consists of eight kingdoms, each subdivided into multiple levels. The eight worlds feature distinct visual themes: for example, the second world, "Desert Land", contains sand-covered levels with pyramids, while the levels in the fourth world, "Giant Land", contain obstacles and enemies four times their normal size._NEWLINE_The player navigates through the game via two game screens: an overworld map and a level playfield. The overworld map displays an overhead representation of the current kingdom and has several paths leading from the world's entrance to a castle. Paths connect to action panels, fortresses, and other map icons, and allow players to take different routes to reach the kingdom's goal. Moving the on-screen character to an action panel or fortress will allow access to that level's playfield, a linear stage populated with obstacles and enemies. The majority of the game takes place in these levels, with the player traversing the stage by running, jumping, flying, swimming, and dodging or defeating enemies. Players start with a certain number of lives and may gain additional lives by picking up green spotted 1-Up mushrooms hidden in bricks, or by collecting 100 coins, defeating several enemies in a row with a Koopa shell, or bouncing on enemies successively without touching the ground. Mario and Luigi lose a life if they take damage while small, fall in a bottomless pit, or run out of time. The game ends when all lives are lost, although the player can continue from the last level played by selecting "Continue"._NEWLINE_Completing stages allows the player to progress through the overworld map and to succeeding worlds. Each world features a final stage with a boss to defeat. The first seven worlds feature an airship controlled by one of the Koopalings, while the player battles Bowser in his castle in the eighth world as the Final boss. Other map icons include large boulders and locked doors that impede paths. Mini-games and bonus screens on the map provide the player a chance to obtain special power-ups and additional lives. Power-ups obtained in these mini-games are stored in a reserve until activated by the player from the map screen._NEWLINE_In addition to special items from previous games like the Super Mushroom and the Fire Flower, new power-ups are introduced that provide the player with new options. The Super Leaf and Tanooki Suit give Mario raccoon and tanooki appearances, allowing him to fly. The Tanooki Suit enables him to turn into stone to avoid enemies for a short period of time. Changing into a Tanooki statue while jumping results in Mario pounding the ground and killing whatever enemies are directly under him; this is the first appearance of the now standard "ground pound" move in the Mario series. The new "Frog Suit" increases the character's underwater speed, agility, and jumping height on land. Another new suit, the Hammer Suit, gives Mario the appearance of the Hammer Bro. enemy and allows him to throw hammers at enemies and resist fire attacks when crouching._NEWLINE_Super Mario Bros. 3 includes a multiplayer option which allows two players to play the game by taking turns at navigating the overworld map and accessing stage levels. The first player controls Mario, while the other controls Luigi (a palette swap of Mario). Through this mode, players can access several mini-games, including a remake of the original Mario Bros. arcade game, in which one player has the opportunity to steal the cards of another, but may lose their turn if they lose the mini-game. _START_SECTION_ Plot and characters _START_PARAGRAPH_ The plot of Super Mario Bros. 3 is described in the instruction booklet. The Mushroom World, the setting of the game, is invaded by the Koopalings, Bowser's seven children. The Koopalings conquer each of the seven kingdoms by stealing its king's magical wand and using it to transform him into an animal. Princess Toadstool sends Mario and Luigi to travel to each kingdom, retrieve the stolen wand, and restore its king to normal._NEWLINE_Mario and Luigi receive notes and special items from Princess Toadstool after rescuing each of the first six kings. When they rescue the seventh king, they instead receive a note from Bowser, boasting that he has kidnapped Toadstool and imprisoned her within the castle of his own realm, Dark Land. The brothers travel through Dark Land, enter his castle, and defeat Bowser in a battle. The game ends with Princess Toadstool being freed from the castle._NEWLINE_Super Mario Bros. 3 was conceived as a stage play. The title screen features a stage curtain being drawn open, and in-game objects hang from off-screen catwalks, are bolted to the background, or cast shadows on the skyline. When Mario finishes a level, he walks off the stage. _START_SECTION_ Development _START_PARAGRAPH_ Beginning development shortly after the 1986 release of the Famicom's Super Mario Bros. 2, Super Mario Bros. 3 was developed by Nintendo Entertainment Analysis and Development, a team that consisted of more than ten people. The game took more than two years to complete at a budget of about $800,000. Developer Shigeru Miyamoto served as director. He worked closely with the designers and programmers during the conceptual and final stages, encouraging a free interchange of ideas. Miyamoto considered intriguing and original ideas to be key to creating a successful game. Originally, the team intended for the game to be played from an isometric point of view, but the developers found that this made it too difficult to position jumps, so the game was changed to the 2D side view used in previous games. Some isometric elements remain, such as the checkered floor present in the title screen._NEWLINE_The game was designed to appeal to players of varying skill levels. To assist less skilled players, bonus coins and 1-ups are more abundant in earlier worlds, while later worlds present more complex challenges for experienced players. In the two-player mode, the players alternate turns to balance play time. The development team introduced new power-ups and concepts that would give Mario the appearance of different creatures as a means of providing him with new abilities. An early idea changed Mario into a centaur, but was dropped in favor of a raccoon tail with limited flying ability. Other costumes with different abilities were added to his repertoire, and levels were designed to take advantage of these abilities. New enemies were included to add diversity to the game, along with variants of previous enemies, such as Goombas, Hammer Bros., and Koopa Troopas._NEWLINE_Some of the enemies designed for Super Mario Bros. 3 were inspired by the team's personal experiences. For example, Miyamoto stated that the Chain Chomp enemy, a tethered ball and chain creature that lunges at the player when in close proximity, was based on a "bad [childhood] experience" he had with a dog. Bowser's children, the Koopalings, were designed to be unique in appearance and personality; Miyamoto based the characters on seven of his programmers as a tribute to their work and efforts. Nintendo of America named the Koopalings after well-known musicians: for example, the characters "Ludwig von Koopa" and "Roy Koopa" are named after Ludwig van Beethoven and Roy Orbison respectively._NEWLINE_The character graphics were created with a special graphics machine ("Character Generator Computer Aided Design") that generated a collection of the graphical shapes used in the game. Shapes in the collection were assigned numbers that the game's code used to access and combine to form complete images on the screen in real time. The Super Mario Bros. 3 cartridge uses Nintendo's custom MMC3 (memory management controller) ASIC to enhance the NES capabilities. The MMC3 chip allows for animated tiles, extra RAM for diagonal scrolling, and a scan line timer to split the screen. The game uses these functions to split the game screen into two portions, a playfield on the top and a status bar on the bottom. This allows the top portion to scroll as the character navigates the stage while the bottom portion remains static to display text and other information._NEWLINE_Like its predecessors, the music in Super Mario Bros. 3 was composed by Koji Kondo, who composed several new songs as well as returning melodies from Super Mario Bros. According to Kondo, who had composed the music in Super Mario Bros. based on what he believed fit the levels rather than focusing on composing a specific genre of music, the game was the most difficult game for him to compose. Kondo experimented with several different genres of music, unsure of how to follow up the music from the first game after hearing from several people that it sounded a lot like latin or fusion music, and came up with several different melodies throughout its development before settling on what ultimately made it into the game. The development team decided that music on the title screen was unnecessary._NEWLINE_During 1988, a global shortage of ROM chips, along with Nintendo's preparation of Super Mario Bros. 2, prevented Nintendo from performing various North American game releases according to their original schedules. The delayed products included Super Mario Bros. 3 and, according to Nintendo Power, Zelda II: The Adventure of Link. The delay, however, presented Nintendo with an opportunity to promote the game in a feature film. In 1989, Tom Pollack of Universal Studios approached Nintendo of America's marketing department about a video game movie; inspired by Nintendo video game competitions, Pollack envisioned a video game version of Tommy for younger audiences. Nintendo licensed its products for inclusion in what would become the film The Wizard. During the movie's production, the filmmakers requested and were granted approval from Nintendo regarding the script and the portrayal of the company's games. Super Mario Bros. 3 was one of the products shown in the film and was used in a final scene involving a video game competition. The film was released in December 1989, between the Japanese and English versions of the game. _START_SECTION_ Sales _START_PARAGRAPH_ Super Mario Bros. 3 became a best-selling game. Its inclusion in The Wizard served as a preview which generated a high level of anticipation in the United States prior to its release. Levi Buchanan of IGN considered Super Mario Bros. 3's appearance in the film as a show-stealing element, referring to the movie as a "90-minute commercial" for the game. The game sold 250,000 copies in its first two days of release, according to a spokeswoman for Nintendo. By 1993, the game had sold 4 and 7 million unbundled units in Japan and the United States respectively. In the United States alone, the game generated over US$500 million in revenue for Nintendo. Author David Sheff commented that, in music industry terms, the game went platinum 11 times. The game was later bundled with new NES systems. Including bundled units, the NES version of the game sold over 17 million copies. Game Informer reported in their October 2009 issue that the Virtual Console version had sold one million copies. As of 2011, Super Mario Bros. 3 remains the highest-grossing non-bundled home video game to date, having grossed $1.7 billion, adjusted for inflation. _START_SECTION_ Legacy _START_PARAGRAPH_ Super Mario Bros. 3 introduced several elements carried over to subsequent Mario games. A similar overworld map is used in Super Mario World and New Super Mario Bros., and Mario's ability to fly has been a feature in games such as Super Mario World, Super Mario 64 and Super Mario Galaxy. The game's "Super Leaf" item has returned in more recent Mario games for the Nintendo 3DS, like Super Mario 3D Land, Mario Kart 7 and New Super Mario Bros. 2. Bowser's red hair was first popularized in the game and has since become a part of his standard appearance._NEWLINE_Through a collaboration between NBC and Nintendo of America, an animated television series, The Adventures of Super Mario Bros. 3, was created in 1990 by DIC Entertainment. The show aired weekly and featured numerous characters, enemies, and settings from the video game; the original seven Koopalings are given different names based on their given personalities and are also given a new age order. Other Nintendo products have included various elements of the game as well. Music from Super Mario Bros. 3 appears as a track on Nintendo Sound Selection Koopa, a collection of songs from Nintendo games. The game's stages and graphics comprise a background theme in the 2006 Nintendo DS game Tetris DS. The Koopalings are also world bosses in Super Mario World, Mario is Missing!, Yoshi's Safari, Hotel Mario and all New Super Mario Bros. games except New Super Mario Bros. Boom Boom, another boss from this game, additionally reappears in Super Mario 3D Land and Super Mario 3D World, alongside a boomerang-wielding female counterpart named Pom Pom. Super Mario Bros. 3 is one of the games represented in both Super Mario Maker and Super Mario Maker 2._NEWLINE_Super Mario Bros. 3 has appeared on numerous top video game lists. The game debuted on Nintendo Power's Top 30 best games ever list at number 20 in September 1989. It entered the list's top 10 a few months later and reached number one in May 1990. Super Mario Bros. 3 remained within the top 20 for more than five years. More than a decade later, the magazine ranked the game number six on their list of 200 Greatest Nintendo Games. In August 2008, Nintendo Power listed Super Mario Bros. 3 as the second best NES video game, praising it for making the series more complex and introducing new abilities that have since become signature abilities in the series. The game placed 11th, behind Super Mario Bros., in Official Nintendo Magazine's "100 greatest Nintendo games of all time". In 2007, ScrewAttack called Super Mario Bros. 3 the best Mario game in the series as well as the best game on the NES, citing the graphics, power-ups, secrets, and popularity, summing it up as "just incredible" and stating, "If you haven't experienced this greatness, we pity you". In a poll conducted by Dengeki, the game tied with Super Mario World as the number three video game their readers first played._NEWLINE_The game has been ranked on several of IGN's lists of "top games". In 2005, they rated it 23rd among their Top 100 Games, and praised the precise and intuitive controls. IGN editors from the United States, United Kingdom, and Australia ranked Super Mario Bros. 3 number 39 in their 2007 Top 100 Games, citing Miyamoto's "ingenious" designs. They further commented that the game improved on the "already-brilliant concepts" of the previous games with new power-ups and enemies. Users and readers of the website placed the game high on similar lists: 32nd in 2005 and 21st in 2006. In 2007, the game was included in the "game canon", a list of the ten most important video games selected for preservation by the Library of Congress. In 2009, Game Informer put Super Mario Bros. 3 9th on their list of "The Top 200 Games of All Time", saying that it is "a game with incredible lasting power that we won't soon forget". This is down one place from Game Informer's previous ranking in 2001. Edge ranked the game #20 on its list of "The 100 Best Games To Play Today", calling it "the one 8-bit game that still shines today, no caveats required." UGO listed Super Mario Bros. 3 on their list of the "Top 50 Games That Belong On the 3DS", calling it "Arguably the greatest Mario game ever made." GameSpot placed the game on their list of the greatest games of all time. USgamer ranked the game as the third best Mario platformer ever. Super Mario Bros. 3 ranked 34th on Warp Zoned's "Scientifically Proven Best Video Games of All Time" list, a statistical meta-analysis of 44 "top games" lists published between 1995 and 2016._NEWLINE_In the early 1990s, game developers John Carmack and Tom Hall developed an adaptive tile refresh technology to perform smooth, side-scrolling graphics on EGA cards for IBM clone personal computers. They used it to develop a clone of Super Mario Bros. 3 and presented it to Nintendo, who rejected it to retain exclusivity for their games on Nintendo consoles. Carmack and Hall went on to found Id Software and develop Commander Keen, a series of platform games inspired by Super Mario Bros. 3. _START_SECTION_ Remakes _START_PARAGRAPH_ The game has been ported or remade on several other Nintendo consoles. It was included in the 1993 Super NES game Super Mario All-Stars, a compilation of remakes of NES Super Mario games featuring updated graphics and sound, which was also later released on the Wii in 2010. A Game Boy Advance version, Super Mario Advance 4: Super Mario Bros. 3, was released in 2003. This version features support for the Nintendo e-Reader peripheral, which allows the player to access additional levels stored on e-Reader cards, in addition to updated graphics, power-ups, and sound._NEWLINE_Super Mario Bros. 3 was rereleased in emulation as a downloadable Virtual Console game in 2007 for the Wii and in 2014 for the Nintendo 3DS and Wii U consoles. It is one of thirty pre-installed games in the NES Classic Edition console, and is on the Nintendo Switch Online service. + _START_ARTICLE_ Mobile Public Library _START_SECTION_ History _START_PARAGRAPH_ The Mobile Public Library has roots going back to the 1850s, when it was started as a subscription organization by the Franklin Society. The library was officially established as the Mobile Public Library in 1902 and was originally housed in an antebellum structure at the corner of Conti and Hamilton Street. The library association appealed to city leaders in the late 1910s to provide operating funds for the library, and it offered to give the city the library property if it would build a new building to house the collections. The city declined to finance the construction of a new building, but did approve operating funds on 2 April 1918. _NEWLINE_Due to increasing public demand for a library, on 15 December 1925, the city commissioners voted to schedule a special election on a $250,000 bond issue. The voters approved the bond and, along with a gift of $30,000 from Eli H. Bernheim of New York City, the new library building was constructed. Noted Mobile architect George Bigelow Rogers designed the building in the Classical Revival style. The new structure, now known as the Ben May Main Library, was opened on 15 September 1928._NEWLINE_The state had passed racial segregation laws at the turn of the century after disenfranchising most blacks and many poor whites in the state, excluding them from politics. Mobile's African-American community did not have access to a public library until one was completed for them in 1931; it was known as the Davis Avenue Branch. It was also designed by George Bigelow Rogers. It was funded by a city bond issue and the city's sale of the old library property on Conti Street._NEWLINE_The Ben May Main Library building is a contributing building to the Church Street East Historic District, which was listed on the National Register of Historic Places on 16 December 1971. The system opened a new branch, the West Regional Branch, in 2002, with First Lady Laura Bush making an address. Beginning in 2006, the Ben May Main Library building was restored and expanded by 22,000 square feet (2,000 m²). It was reopened on 31 May 2007. _START_SECTION_ Services _START_PARAGRAPH_ In addition to basic services, participation in several interlibrary loan systems, and internet access at all locations, the Mobile Public Library provides a range of other services. Free library cards are made available to all residents in the Alabama counties of Baldwin, Washington, Clarke, Monroe, Escambia and Conecuh. Alabama Virtual Library (AVL) cards are also made available for free at all branches. _START_SECTION_ Local history and genealogy _START_PARAGRAPH_ The Local History and Genealogy Division includes works by local authors, Mobile histories, periodicals, Mobile newspapers on microfilm from 1819 to the present, city directories from 1837 onward, federal census records for most of the Southeastern United States, and the Mobile Historic Development Commission's survey of historic architecture in Mobile with 10,000 images stored and indexed on CD-ROM. _START_SECTION_ Youth services _START_PARAGRAPH_ This department attempts to meet the needs and interests of children and young adults through the various library collections, services and programs. Books, movie DVDs and VHS, music CDs, audiobooks, back issues of magazines, and video games are available to be checked out. Story time for young children is provided at most library locations. _START_SECTION_ Disabilities _START_PARAGRAPH_ All branches provide handicapped access, materials, and services for patrons with disabilities. A few of the services provided are magnifying glasses, large type books, closed captioned videos, books for and about the handicapped, and instructional books and videos on sign language. In addition, recorded books on discs and cassettes and the equipment for using them are available on free loan to eligible individuals from the Alabama Regional Library for the Blind and Physically Handicapped in Montgomery, Alabama. _START_SECTION_ Bookmobile _START_PARAGRAPH_ The library operates a bookmobile three days a week at over 30 different stops across Mobile County. Each location is visited every three weeks. + _START_ARTICLE_ Daugherty Furniture Building _START_SECTION_ History _START_PARAGRAPH_ The furniture store was originally located in Fork Mountain, Tennessee, but was moved to Clinton in the late 1930s because of the population growth of Anderson County. More than 75,000 individuals had moved to the area in a two-year period, and the demand for household items was high. Daugherty Furniture Store served as the only "one-stop shop" for Anderson County residents. The store offered home delivery, with a fleet of delivery trucks. The third and fourth floors of the building served as apartments, which housed Oak Ridge workers and scientists in need of housing in a region running out of housing options. The business expanded into the nearby regions, and the main location served as a meeting place for local businessmen. The attention and popularity of the main location helped to develop the neighborhood of Market and Main Streets in Clinton._NEWLINE_The store was owned and operated by the family, and most of the employees were family members. Daugherty's brothers, Leonard, Emmitt, and Laford, worked for the shop, with Leonard serving as a manager overseeing deliveries and purchases. Their sons, along with Daugherty's sons, all worked at the store and remained working there until its closing in 1985. All employees were trained to be familiar with the store's merchandise, including demonstrating Singer sewing machines. Employees also provided repair services, both in-store and in the homes of customers. Many employees lived in the upstairs apartments, and Daugherty lived in the corner unit and oversaw the company until his death in 1985. _START_SECTION_ Architecture _START_PARAGRAPH_ The original building was located in Fork Mountain, Tennessee, and was a simple one-story building. When Daugherty moved to Clinton in 1935, he rented two buildings across the street from where the current building stands. While planning to build a new location, he found inspiration in a building in Oak Ridge owned by Elza Gate. Gate's home, known as the Glenn Copeland House, was entirely faced in stone (except the roof). Daugherty would use Frank Gilbreath as an advisor on building his store; Gilbreath served as stonemason on the Glenn Copeland House._NEWLINE_When Daugherty began exploring options for building his store, his friends and colleagues advised him to avoid building a multi-story building. He hired architect Clem H. Meyer, who had designed Huntsville High School in Scott County, Tennessee. Oba Hill was also hired, who worked alongside Gilbreath on the stone work._NEWLINE_About 99,000 pounds of locally quarried stone was used on the buildings exterior, most which came from areas near the New River region of Morgan County and Scruggs Farm in Bethel. All of the stone was hand-chiseled and laid by Gilbreath and Sebastian Marie, another local stone cutter. When the building was complete, in 1942, it was the largest commercial building in Clinton, along with Magnet Mills, Inc._NEWLINE_The Daugherty Furniture Building doesn't touch on traditional architectural styles, but serves as an example of vernacular architecture, by using local construction methods and materials. The buildings interior, from the fourth floor to the basement, looks like an inverted step pyramid. The load bearing walls are stair-stepped, therefor the fifth floor walls are thinner than those in the basement; the basement walls are 26 inches thick, with the fifth floor being 12 inches. The architecture of the building was also influenced by Meyer's work on Huntsville High School, using techniques found in that project. Examples include the rectangular shape, stone exterior, and rectangular steel windows. The design is also minimalistic, reflecting the wartime emphasis on simplicity seen in design during the period. + _START_ARTICLE_ Waka (poetry) _START_SECTION_ Etymology _START_PARAGRAPH_ The word waka has two different but related meanings: the original meaning was "poetry in Japanese" and encompassed several genres such as chōka and sedōka (discussed below); the later, more common definition refers to poetry in a 5-7-5-7-7 metre. Up to and during the compilation of the Man'yōshū in the eighth century, the word waka was a general term for poetry composed in Japanese, and included several genres such as tanka (短歌, "short poem"), chōka (長歌, "long poem"), bussokusekika (仏足石歌, "Buddha footprint poem") and sedōka (旋頭歌, "repeating-the-first-part poem"). However, by the time of the Kokinshū's compilation at the beginning of the tenth century, all of these forms except for the tanka and chōka had effectively gone extinct, and chōka had significantly diminished in prominence. As a result, the word waka became effectively synonymous with tanka, and the word tanka fell out of use until it was revived at the end of the nineteenth century (see Tanka)._NEWLINE_Tanka (hereafter referred to as waka) consist of five lines (句 ku, literally "phrases") of 5-7-5-7-7 on or syllabic units. Therefore, tanka is sometimes called Misohitomoji (三十一文字), meaning it contains 31 syllables in total. _START_SECTION_ Kamakura and Muromachi periods _START_PARAGRAPH_ After the Heian period, during the Kamakura period and later, renga, a form of collaborative linked poetry, began to develop. In the late Heian period, three of the last great waka poets appeared: Fujiwara no Shunzei, his son Fujiwara no Teika, and Emperor Go-Toba. Emperor Go-Toba ordered the creation of a new anthology and joined in editing it. The anthology was named Shin Kokin Wakashū. He edited it again and again until he died in 1239. Teika made copies of ancient books and wrote on the theory of waka. His descendants, and indeed almost all subsequent poets, such as Shōtetsu, taught his methods and studied his poems. The courtly poetry scenes were historically dominated by a few noble clans and allies, each of which staked out a position._NEWLINE_By this period, a number of clans had fallen by the wayside, leaving the Reizei and the Nijō families; the former stood for "progressive" approaches, the varied use of the "ten styles" and novelty, while the latter conservatively hewed to already established norms and the "ushin" (deep feelings) style that dominated courtly poetry. Eventually, the Nijo family became defunct, leading to the ascendancy of the "liberal" Reizei family. Their innovative reign was soon deposed by the Asukai family, aided by the Ashikaga shōgun, Ashikaga Yoshinori._NEWLINE_In the Muromachi period, renga became popular in the court and people around it. It spread to the priestly classes and thence to wealthy commoners. In much the same way as waka, renga anthologies were produced under the imperial aegis. As momentum and popular interest shifted to the renga form, the tanka style was left to the Imperial court. Conservative tendencies exacerbated the loss of life and flexibility. A tradition named Kokin-denju, the heritage of Kokin Wakashū, was developed. It was a system on how to analyze the Kokin Wakashū and included the secret (or precisely lost) meaning of words. Studying waka degenerated into learning the many intricate rules, allusions, theories, and secrets, so as to produce tanka that would be accepted by the court._NEWLINE_There were comical waka already in the Kojiki and the Man'yōshū, but the noble style of waka in the court inhibited and scorned such aspects of waka. Renga was soon in the same position with many codes and strictures reflecting literary tradition. Haikai no renga (also called just haikai (playful renga)) and kyōka, comical waka, were a reaction to this seriousness. But in the Edo-period waka itself lost almost all of its flexibility and began to echo and repeat old poems and themes. _START_SECTION_ Edo period (1603–1867) _START_PARAGRAPH_ In the early Edo period, waka was not a fashionable genre. Newly created haikai no renga (of whose hokku, or opening verse, haiku was a late 19th-century revision) was the favored genre. This tendency was kept during this period, but in the late Edo period waka faced new trends from beyond the court. Motoori Norinaga, the great reviver of the traditional Japanese literature, attempted to revive waka as a way of providing "traditional feeling expressed in genuine Japanese way". He wrote waka, and waka became an important form to his followers, the Kokugaku scholars._NEWLINE_In Echigo Province a Buddhist priest, Ryōkan, composed many waka in a naïve style intentionally avoiding complex rules and the traditional way of waka. He belonged to another great tradition of waka: waka for expressing religious feeling. His frank expression of his feeling found many admirers, then and now. In the cities, a comical, ironic and satiric form of waka emerged. It was called kyōka (狂歌), mad poem, and was loved by intellectual people in big cities like Edo and Osaka. It was not precisely a new form; satirical waka was a style known since ancient times. But it was in the Edo period that this aspect of waka developed and reached an artistic peak. Still, most waka poets kept to ancient tradition or made those reformation another stereotype, and waka was not a vibrant genre in general at the end of this period. + _START_ARTICLE_ 1982 Monaco Grand Prix Formula Three _START_PARAGRAPH_ Results from the 1982 Monaco Grand Prix Formula Three held at Monte Carlo on May 22, 1982, in the Circuit de Monaco. + _START_ARTICLE_ John Jairo Castillo _START_SECTION_ Club career _START_PARAGRAPH_ He signed with the team in January 2008 on a long-team contract. He has played for numerous Colombian teams such as Deportes Tolima in the Copa Mustang and also for Bolivian club Oriente Petrolero and most recently for Venezuelan club Guaros FC. Recently training with deportivo Cali. + _START_ARTICLE_ The Rainbow Agenda _START_PARAGRAPH_ The Rainbow Agenda was a set of demands put forth by a coalition of student groups at Stanford University in the late 1980s. Inspired by Jesse Jackson's Rainbow Coalition (now Rainbow/PUSH), Stanford's Rainbow Coalition demanded that the university "explore the critical concerns of minority students, faculty, and staff at Stanford University". _START_SECTION_ History _START_PARAGRAPH_ On January 17, 1987, some 500 Stanford students marched with Jesse Jackson to celebrate a new course at Stanford to replace its previous "Western Culture" requirement. In 1988, the Faculty Senate voted to change the course to "Cultures, Ideas, and Values" (CIV)._NEWLINE_In response to backlash, a "Rainbow Coalition" was formed, a coalition of student groups which made a number of demands of the university (the Rainbow Agenda). These student groups included the Black Student Union, MEChA, the Asian American Student Association, and the Stanford American Indian Organization. The demands included requests concerning student and faculty diversity, support for community centers, and a "renewed commitment to discourage Indian mascot fanatics." _NEWLINE_On 15 May 1989, students from the Rainbow Coalition also occupied of President Donald Kennedy's office, to "emphasize the need for an Asian American Studies tenure-track professor; a full-time dean for El Centro Chicano, the Chicano student center; and a discrimination review board to act on complaints of racial slurs and incidents. Fifty-five students... were arrested." _START_SECTION_ Impact _START_PARAGRAPH_ The Rainbow Coalition was renamed the Students of Color Coalition (SOCC) in 1994._NEWLINE_Twenty five years after its founding, SOCC has recently been one of the most influential student groups in determining the result of student elections on campus._NEWLINE_"Culture, Ideas, and Values" was eventually replaced by IHUM_NEWLINE_._NEWLINE_The Stanford Review was founded in 1987 partly to provide an alternative viewpoint to issues raised by the Rainbow Coalition. + _START_ARTICLE_ 1997–98 Bradford City A.F.C. season _START_SECTION_ Season summary _START_PARAGRAPH_ In the 1997–98 season, Bradford started well with 13 points from a possible 15 which saw the Bantams top of the table after five games, but results declined and chairman Geoffrey Richmond sacked Kamara on 6 January, three days after a 2–0 FA Cup defeat to Manchester City._NEWLINE_Richmond turned to Jewell, who was by now Kamara's assistant, and he won his first game 2–1 to Stockport County. In his 21 games in charge, Jewell won six games and drew five to guide Bradford to 13th, their highest position since Jewell had joined the club. He was rewarded with a permanent contract when others expected Richmond to turn to a big name. _START_SECTION_ Results _START_PARAGRAPH_ Bradford City's score comes first + _START_ARTICLE_ Ulaanbaatar Hotel _START_SECTION_ History _START_PARAGRAPH_ The infamous Anastasia Filatova, the wife of Mongolian communist leader Yumjaagiin Tsedenbal, and de facto co-ruler of the country, was personally involved in the construction and design. She chose the best workers and designers available at the time to complete the hotel, which was designed to be a flagship property for the Mongolian hospitality industry. The senior employees say that she had personally picked colors and design for the lobby and main hall._NEWLINE_It was the first public building with running hot water, in the 1960s Mongolian elites used to rent rooms per hour to enjoy the hot bath or shower. A number of foreign embassies were quartered at the hotel during 1980s and 1990s._NEWLINE_The last embassy, Turkish mission, moved out in 1997. Ever since its foundation the hotel is frequented by politicians and lobbyists. During the Democratic revolution of 1991, Communist rulers used the hotel to meet unofficially with the democratic activists. The future fate of the country was decided during these meetings. The hotel has also become a cultural phenomenon: more than a dozen movies were filmed here._NEWLINE_Julia Roberts, Demis Roussos, Richard Gere, Steven Seagal, Dalai Lama, Fradkov, Andre Kim, Alsou are among the famous guests who stayed at the Hotel. + _START_ARTICLE_ Ireland–Spain relations _START_SECTION_ Early relations _START_PARAGRAPH_ The first awareness and contact between both nations was through stories about Celtic migration from Iberia to Ireland as mentioned in the Lebor Gabála Érenn regarding the Milesians. The first diplomatic contact between Irish and Spanish nobility happened in April 1529 when the Spanish ambassador, Don Gonzalez Fernandez, visited Ireland and met with The 10th Earl of Desmond. The agreement, known as the Treaty of Dingle, gave a formal legal and constitutional foundation to the rights of citizenship and other privileges that Irish exiles and emigrés enjoyed in Habsburg Spain, Habsburg Austria and Habsburg Netherlands from the 16th to the early 20th centuries. Both nations felt united in their common beliefs of Catholicism, but this was not an issue in 1529. In 1554-58 Philip Prince of Asturias was married to Mary I and was named as titular King of Ireland in the Papal Bull Ilius ad quem. As a result, during the first plantations of Ireland what is now County Offaly was shired as "King's County", and Philipstown (now Daingean) was named in his honour, the first Irish place named after someone from Spain. Soon after Mary's death he succeeded as Philip II of Spain._NEWLINE_In 1601, Spain supported Irish rebels fighting against England during the Nine Years War, and especially during the Siege of Kinsale. At the time, the Catholics of Ireland saw Spain as a potential liberator of their country from Protestant England and in 1595 Hugh O'Neill offered the crown of Ireland to Philip II of Spain. Philip refused the offer, having already been the titular King of Ireland. _NEWLINE_Many Irishmen in rebellion against English rule subsequently sought refuge in Spain following the Flight of the Earls (1607), and for the next two centuries Irish soldiers contributed to Spanish Army of Flanders and fought side-by-side during the Dutch Revolts and during the Thirty Years' War. During the Anglo-Spanish War (1625–1630) proposals were made in 1627 to launch an invasion of Ireland under Shane O'Neill and Hugh O'Donnell, but did not go further than the planning stage._NEWLINE_Irish soldiers in the service of Spain also participated in the colonization of the Americas. Several prominent Spanish officials of Irish origin governed and administered Spanish colonies as Viceroy's such as Ambrosio O'Higgins in Peru and Juan O'Donojú in Mexico or became ministers in the Spanish government, most notably Leopoldo O'Donnell and his relatives Carlos O'Donnell and Juan O'Donnell. At the same time, during the independence wars of Gran Colombia, several thousand Irish soldiers fought for South America against Spain and made up parts of the British Legions. _START_SECTION_ Independence for the Republic of Ireland and the Spanish Civil War _START_PARAGRAPH_ In January 1801, Ireland became a part of the newly created United Kingdom of Great Britain and Ireland, and all relations between Ireland and Spain were henceforth carried out through the Court of St James's. Napoleon's Irish Legion took part in suppressing the Dos de Mayo Uprising. After the Napoleonic Wars the Regiment of Hibernia was disbanded at the demand of their British allies._NEWLINE_In December 1922, most of Ireland gained a form of independence within the British Empire as the Irish Free State and, in 1924, diplomatic relations were officially established between the new Irish Free State and the Kingdom of Spain. That same year, Spain opened its first consulate in Dublin. In 1935, the first Irish Minister was appointed to Spain with residence in Madrid._NEWLINE_In 1936, Spain was engulfed in a Civil War between the Republican faction led by President Manuel Azaña; and the Nationalist faction led by General Francisco Franco. The Free State was a member of the Non-intervention Committee. However, Ireland (North and South) was divided in opinion over the war. Many Irish Catholics sided with Francisco Franco. A smaller number sided with the Spanish Republican faction. In 1936 the Irish Christian Front was established to financially support Francisco Franco and the Irish Brigade was created to fight for the Nationalist side and contributed 700 Irish volunteer soldiers to Franco. The Irish-American politician Joe Kennedy stopped the US Congress from supplying arms to the Republic. At the same time, 250 Irish socialist volunteers joined the International Brigades and fought for the Spanish Republican faction. Both Irish factions took part in the Battle of Jarama in February 1937. By the summer of 1937, the Irish Brigade was "disarmed and ordered out of Spain by Franco" (Fearghal McGarry); most of the socialists stayed until late 1938, although they were frequently treated as pariahs on their return home, and many emigrated to the UK. After the war ended in 1939, the Irish Minister presented his credentials in Burgos and formally recognized the new Spanish government under General Franco. _START_SECTION_ Post-war relations _START_PARAGRAPH_ In 1973, both Northern Ireland (as part of the United Kingdom) and the Republic of Ireland joined the European Economic Community (EEC), and Spain followed suit in 1986. In July 1986, King Juan Carlos I of Spain paid his first official visit to the Republic of Ireland. In 1993, Mary Robinson became the first Irish President to pay an official visit to Spain. Since then, there have been numerous visits between leaders of both states. Recently, in January 2017, Irish Taoiseach Enda Kenny paid a visit to Spain._NEWLINE_Spain has increasingly become an important tourist destination for Irish travelers. In 2016, 1.4 million Irish citizens visited Spain for tourism. At the same time, 263,000 Spanish tourists visited Ireland. In 2016, 35,000 Spanish nationals studied English in Ireland. Several Irish and Spanish airlines provide direct services between both nations. _START_SECTION_ Bilateral agreements _START_PARAGRAPH_ Both the Republic of Ireland and Spain have signed several bilateral agreements (mostly prior to both states joining the European Union), such as an Agreement on the Exchange of Diplomatic Pouches (1935); Agreement on the Exchange of Information regarding Meteorology (1950); Extradition Treaty (1957); Cultural Cooperation Agreement (1980); Spanish Agreement on Renouncing Historic Rights of Fishing in Irish Waters (1980) and an Agreement on the Avoidance of Double Taxation (1994). _START_SECTION_ Trade _START_PARAGRAPH_ In 2015, trade between the Republic of Ireland and Spain totaled 4.5 billion Euros. Ireland's exports to Spain include: pharmaceutical products, electrical equipment, perfume and chemical based products. Spain's exports to Ireland include: automobiles, clothing and organic chemical products. That same year, Irish investments in Spain totaled 200 million Euros while at the same time, Spanish investments in the Republic of Ireland totaled 4 billion Euros. + _START_ARTICLE_ Tokyo College of Music _START_SECTION_ History _START_PARAGRAPH_ The college moved to Toshima in Tokyo in 1924 after the original campus was destroyed by the Great Kantō earthquake. + _START_ARTICLE_ Fort Pond Bay _START_PARAGRAPH_ Fort Pond Bay is a bay off Long Island Sound at Montauk, New York that was site of the first port on the end of Long Island. The bay has a long naval and civilian history. _START_SECTION_ New-York Province and the American Revolution _START_PARAGRAPH_ Fort Pond Bay was first listed by name in a 1655 map published in 1680 by John Scott which makes note of a Montaukett Native-American fort on its banks._NEWLINE_Early settlers in the area raised cattle and sheep on the bluffs above the bay. During the American Revolutionary War during the Siege of Boston British warships sailed into the bay in 1775. Local militia under Captain John Dayton, feigned they had more men than they had, turning their coats inside out as they marched back and forth on top of a high hill to the south. The tactic is called Dayton's Ruse._NEWLINE_Long Island was occupied throughout the war and the bay was used by the British for their blockade of Connecticut. In 1781 HMS Culloden ran aground while pursuing a French frigate during a January storm. The ship, which survived the initial ground hit a rock and had to be scuttled in the bay at Culloden Point and burned with its canons thrown overboard. Its debris field and wreck site is now the only underwater park on Long Island._NEWLINE_In the late 18th century, the small fishing village of Montauk was established at the southeast corner of the bay. _START_SECTION_ The 19th century and today _START_PARAGRAPH_ In 1839 the slave ship Amistad anchored in bay (also at Culloden Point) when the surviving crew tried to convince their revolted slave captors that they had returned to Africa as they went for provisions in the village of Montauk. The ship was seized by the USS Washington in the bay._NEWLINE_In the 1890s, Austin Corbin extended the Long Island Rail Road from Bridgehampton, New York to the Montauk fishing village (the line extension was called the Fort Pond Railway). His friend Arthur Bensen purchased 10,000 acres (40 km²) of Montaukett land around the village and the LIRR began advertising that it could cut a day off ship travel by docking in Montauk and taking the train rather than going to New York. Corbin built a steel pier into pond for the overseas ships (even as the Corps of Engineers continued to caution against using the bay because of rocks._NEWLINE_The dream was never to materialize and the U.S. Army bought the land for Camp Wikoff. Theodore Roosevelt and his Rough Riders were to come by transport into the bay following the Spanish–American War at the camp to be quarantined over concerns about yellow fever._NEWLINE_The fishing village was obliterated in the storm of the Great Hurricane of 1938. The Navy took over the area for a seaplane and dirigible base during World War II (the dock is still in use). The Montauk fishing village was moved a mile south closer to the Atlantic Ocean._NEWLINE_During the years after World War II, the bay ceased to be used by most boats because of flooding and rocks. Boats now dock in the dredged Lake Montauk. In the 1960s the bluffs above the bay were used to build Leisurama homes as inexpensive second homes that had been inspired by the Kitchen Debate between Richard Nixon and Nikita Khrushchev. + _START_ARTICLE_ William Johnson (cricketer, born 1884) _START_PARAGRAPH_ William James Johnson (22 September 1884 – 14 August 1941) was a wine and spirit grocer and keen cricketer who played one first-class match for Victoria in 1924–25. He was later a selector of the Australian Test team._NEWLINE_Johnson's son, Ian, went on to captain the Australian Test cricket team. + _START_ARTICLE_ Sports broadcasting contracts in Kosovo _START_SECTION_ Basketball Cups _START_PARAGRAPH_ Kosovo Basketball Cup: ArtMotion (2018-2021) + _START_ARTICLE_ Resettable fuse _START_SECTION_ Operation _START_PARAGRAPH_ A polymeric PTC device is made up of a non-conductive crystalline organic polymer matrix that is loaded with carbon black particles to make it conductive. While cool, the polymer is in a crystalline state, with the carbon forced into the regions between crystals, forming many conductive chains. Since it is conductive (the "initial resistance"), it will pass a current. If too much current is passed through the device the device will begin to heat. As the device heats, the polymer will expand, changing from a crystalline into an amorphous state. The expansion separates the carbon particles and breaks the conductive pathways, causing the device to heat faster and expand more, further raising the resistance. This increase in resistance substantially reduces the current in the circuit. A small (leakage) current still flows through the device and is sufficient to maintain the temperature at a level which will keep it in the high resistance state. Leakage current can range from less than a hundred mA at rated voltage up to a few hundred mA at lower voltages. The device can be said to have latching functionality. The hold current is the maximum current at which the device is guaranteed not to trip. The trip current is the current at which the device is guaranteed to trip._NEWLINE_When power is removed, the heating due to the leakage current will stop and the PPTC device will cool. As the device cools, it regains its original crystalline structure and returns to a low resistance state where it can hold the current as specified for the device. This cooling usually takes a few seconds, though a tripped device will retain a slightly higher resistance for hours, unless the power in it is weaker, or has been often used, slowly approaching the initial resistance value. The resetting will often not take place even if the fault alone has been removed with the power still flowing as the operating current may be above the holding current of the PPTC. The device may not return to its original resistance value; it will most likely stabilize at a significantly higher resistance (up to 4 times initial value). It could take hours, days, weeks or even years for the device to return to a resistance value similar to its original value, if at all._NEWLINE_A PPTC device has a current rating and a voltage rating. _START_SECTION_ Applications _START_PARAGRAPH_ These devices are often used in computer power supplies, largely due to the PC 97 standard (which recommends a sealed PC that the user never has to open), and in aerospace/nuclear applications where replacement is difficult. Another application for such devices is protecting audio loudspeakers, particularly tweeters, from damage when over driven: by putting a resistor or light bulb in parallel with the PPTC device it is possible to design a circuit that limits total current through the tweeter to a safe value instead of cutting it off, allowing the speaker to continue operating without damage when the amplifier is delivering more power than the tweeter could tolerate. While a fuse could also offer similar protection, if the fuse is blown, the tweeter cannot operate until the fuse is replaced. _START_SECTION_ Device trade names _START_PARAGRAPH_ These devices are sold by different companies under various trademarks, including Multifuse (Bourns), PolySwitch (TE Connectivity), and Polyfuse (Littelfuse). PolySwitch is the earliest product of this type, having been invented at Raychem Corporation (now TE Connectivity) and introduced in the early 1980s. Due to common availability, electronics engineers and technicians often refer to this device as a "polyswitch" or "polyfuse", in the generic sense, regardless of actual brand. + _START_ARTICLE_ Table Table _START_SECTION_ Expansion _START_PARAGRAPH_ The expansion of Table Table as with Taybarns slowed during the recession of 2009/10 as the company sought to consolidate its position. Since April 2010 the company began expanding the Table Table brand further as it continued to enjoy the success of most profitable restaurant brand based on a number of sites. Although making one of the smallest profit for Whitbread, Table Table was on average the second best performing throughout 2009 based on a site average following Taybarns._NEWLINE_Table Table has continued this success being the best performing restaurant brand within Whitbread in 2013, 2014, 2015 and 2016._NEWLINE_The conversion of Brewers Fayres to Table Tables has now come to an end following a successful relaunch of the Brewers Fayre brand and both brands are now expanding in their own markets. All new Table Table restaurants are new build alongside new Premier Inns _START_SECTION_ Future and trials _START_PARAGRAPH_ Despite the popularity of the brand, during summer–autumn 2012, a number of Table Table pubs were converted back to Brewers Fayre such as The Brampton Hut and Howgate._NEWLINE_From 2014 a number of selected Table Table pub restaurants, including the Hobbs Boat in Weston, the Liskeard Tavern in Liskeard, The Carclaze in St Austell, The Roundstone in Littlehampton and The Globe Inn located in Christchurch, have been converted to a new brand called Whitbread Inn, harking back to the company's past._NEWLINE_Early 2017 saw some restaurants being moved to the Beefeater brand including "Mosley Park" in Wolverhampton and "Liberty Bell" in Romford + _START_ARTICLE_ Simon Duiker _START_PARAGRAPH_ Simon Liekele Eeltje Duiker (16 April 1874, Amsterdam – 6 March 1941, Amsterdam) was a Dutch painter._NEWLINE_Duiker lived and worked in Amsterdam while studying at its National Academy. He painted interior scenes and still life paintings. His work depicts middle class Dutch life, particularly men and women at work in the home._NEWLINE_Duiker was greatly influenced by the work of Jan Vermeer. Duiker, along with Jacques Snoeck and Gijbertus Jan Sijhoff, is considered one of the last great Dutch interior scene painters. Duiker's paintings are part of the Dutch National Collection (Rijkscollectie). + _START_ARTICLE_ Stierlitz _START_SECTION_ Character _START_PARAGRAPH_ In the universe of the Seventeen Moments of Spring, Stierlitz is the cover name for a Soviet super-spy Colonel Maxim Maximovich Isaуev (Макси́м Макси́мович Иса́ев), whose "real" name is Vsevolod Vladimirovich Vladimirov (Все́волод Влади́мирович Владимиров)._NEWLINE_Stierlitz takes a key role in SS Reich Main Security Office in Berlin during World War II, infiltrating Ausland-SD (foreign intelligence) headed by Walter Schellenberg. Working deep undercover, Stierlitz tries to collect intelligence about the Germans' war plans and communicate it to Moscow. He receives instructions from Moscow on how to proceed, on one occasion traveling to Switzerland on a secret mission. He diverts the German nuclear "Vengeance Weapon" research program into a fruitless dead-end, thwarts separate peace talks between Nazi Germany, the United Kingdom and the United States, engages in intellectual games with members of the Nazi high command and sacrifices his own happiness for the good of his motherland. Despite being wracked with desire to return home to his wife he subordinates his feelings to his duty, thus embodying an idealised Soviet vision of patriotism._NEWLINE_Stierlitz is quite an opposite of the action-oriented James Bond; most of the time he gains his knowledge without any Bond-style stunts and gadgets, while in the film adaptation of the stories the action is presented through a narrative voice-over by Yefim Kopelyan. He is presented in a deeply patriotic but non-ideological light, fighting to defend the Soviet motherland against external enemies rather than just defending the Communist government against its ideological opponents. _START_SECTION_ Influences in Russian life and culture _START_PARAGRAPH_ Although Stierlitz was a much-loved character, he was also the butt of a common genre of Russian jokes, often satirising his deductive trains of thought, with unexpected twists, delivered in the deadpan style of the voice-overs in the film adaptations; for example:_NEWLINE_Stierlitz approaches Berlin. The city is veiled in smoke from the fires. "Forgot to switch off the iron again," thought Stierlitz with slight irritation._NEWLINE_Stierlitz continues to be a popular character in modern Russia. Despite the fact that references and Stierlitz jokes still penetrate contemporary speech, Seventeen Moments of Spring is very popular mainly because it is quite patriotic. It is repeated annually on Russian television, usually around Victory Day. Stierlitz also continues to have a political significance. When his portrayer Vyacheslav Tikhonov died in December 2009, the Foreign Intelligence Service—one of the successor organisations of the former Soviet KGB—sent its condolences to his family. Ivan Zassoursky notes that Russian Prime Minister (and former and current President) Vladimir Putin, a former KGB agent, has been portrayed as "embod[ying] the image—very important for the Russian television audience—of Standartenführer von Stierlitz... If anyone missed the connection between Putin, who served in Germany, and von Stierlitz, articles in the press reminded them of the resemblance and helped create the association." The connection went both ways; Putin was strongly influenced by the novels, commenting: "What amazed me most of all was how one man's effort could achieve what whole armies could not."_NEWLINE_Stierlitz movies contributed a number of catchphrases, such as "Character: nordic, robust" (Характер — нордический, выдержанный, a personal characteristic, usually mocking or ironic). + _START_ARTICLE_ Thomas Hunter (New York politician) _START_PARAGRAPH_ Thomas Hunter (September 11, 1834 Baltimore, Maryland – March 11, 1903 Sterling, Cayuga County, New York) was an American businessman and politician from New York. _START_SECTION_ Life _START_PARAGRAPH_ He attended the common schools, worked on a farm, and then took part in the construction of the Manassas Gap Railroad. From 1857 to 1860, he engaged in the milling business in Sterling Valley. During the American Civil War he enlisted as a private in the 110th New York Volunteers, and finished the war as a captain. After the war he engaged in the lumber business, and then became a railroad contractor._NEWLINE_He was a member of the New York State Assembly (Cayuga Co., 1st D.) in 1881 and 1882._NEWLINE_He was a member of the New York State Senate (26th D.) from 1890 to 1893, sitting in the 113th, 114th, 115th and 116th New York State Legislatures. + _START_ARTICLE_ Ford Taunus P3 _START_SECTION_ High profile launch _START_PARAGRAPH_ The car received its public launch at the Beethoven Hall in Bonn. Unusually for a car launch, both the by now 84-year-old German chancellor, Konrad Adenauer, and the grandson of the firm's founder, Henry Ford, were present. In addition to the latest Ford Taunus, they were celebrating the thirtieth anniversary of Ford's Cologne plant. It had been on 2 October 1930 that Henry Ford and the then mayor of nearby Cologne, Konrad Adenauer, had laid the foundation stone for the Cologne Ford Plant. _START_SECTION_ European design _START_PARAGRAPH_ The first post-war Taunus models had been designed in North America. The Taunus P3 was designed by Uwe Bahnsen, a German born designer who would dominate car design at Ford of Germany for nearly thirty years and whose subsequent designs included the 1969 Ford Capri and its successors. Towards the end of his time in charge of design with Ford of Germany, Bahnsen also led the teams that designed the Fords Sierra and Scorpio. In the context of 1960 the Taunus P3 can nevertheless be seen as Bahnsen's most innovative design for a production car._NEWLINE_The 1960 Taunus design featured a recurring geometrical shape, which was a cross between a short sausage and a long lozenge. The rear panel and the side panels respected the same basic shape as did the front grill, subject to two large cut-outs for the headlights._NEWLINE_The same shape was carried over to the interior of the car where the main dials and controls on the dash-board were surrounded by a thick frame in the shape that respected a short sausage (or a very long lozenge). The repetitious use of a single simple shape at different levels of the design gave the overall car a consistent visual unity which was in stark contrast to the high finned flamboyance of the previous Taunus 17M and was seen at the time as a radical switch by Ford of Germany away from American styling in favour of European styling. There were no tails fins and there was very little decorative chrome included. The efficiency of its superficially much more simple design enabled Ford to boast that the 1960 car, despite being fractionally narrowed on the outside, offered usefully more interior width than the car it replaced._NEWLINE_Despite the importance of sausages in German cuisine, the award for a catchy soubriquet was earned by the person who saw the car and was reminded not of a sausage but of a bathtub. It was and remains the "Badewanne" (bathtub) soubriquet that caught the eye of the press reporters, and it is as the "Badewannetaunus" that the car continues to be remembered by enthusiasts _START_SECTION_ Technical Innovation _START_PARAGRAPH_ The P3 and the 1961 Citroën Ami were the first cars with rectangular or lozenge-shaped (non-round) headlights. This technical innovation was developed by lighting manufacturers Hella (Taunus) and Cibie (Ami). At the time, it was an unquestioned article of faith that headlights were round, and in the United States, it was the law, so these new headlights were illegal there. Ten years later this had inspired European automakers to come up with various non-round headlamp shapes, though many had by 1970 settled on a standardised shared rectangular shape. _START_SECTION_ Body _START_PARAGRAPH_ Most of the cars were sold as two- or four-door sedans/saloons. A three-door "Turnier" station wagon was also available. The confident determination of the car's designers’ to celebrate the new decade with something new and different was reflected in the unusual positioning of the rear lights on the early station wagons, on the leading edge at the back of the roof of the car, as two red horizontal units lined up directly above the tailgate. Later P3 Turniers had their rear lights more conventionally positioned._NEWLINE_The P3 also followed the tradition of its predecessor in that coach built two-door cabriolets and coupes were offered, converted by a traditional Cologne based body builder called Karl Deutsch However, these special bodied cars appear to have been relatively expensive, and only about 150 were produced._NEWLINE_The cars were offered with an unusually broad choice of color and interior trim options._NEWLINE_The early 1960s were a period of rapid expansion for the west European auto-industry, and export markets for the new 17M included Greece and Australia where several cars were converted locally into "pickups" or, in Australian English, "utes". The 17M was also assembled in South Africa and Southern Rhodesia in right hand drive. _START_SECTION_ Engine and running gear _START_PARAGRAPH_ The cars were all branded as Ford Taunus 17Ms which might have led observers who thought they had understood Ford Germany's naming conventions to conclude that the cars all came with 1.7 litre engines. In fact, there were three different engine sizes offered, being the 1498 cc unit first seen in the Taunus 15M of 1954, the 1698 cc unit originally introduced in 1957 to cope with the weight of the first Ford Taunus 17M and, from September 1961, a new larger 1757 cc engine. Power outputs initially ranged from 55 PS (40 kW; 54 hp) to 60 PS (44 kW; 59 hp), and these engine versions remained available throughout the model's four-year life, but several more powerful engines featuring raised compression ratios in response to the increased availability of higher octane fuels appeared during the four-year period: by 1964 the most powerful Ford 17M offered 75 PS (55 kW; 74 hp). In Sweden, the more powerful version first available in 1962 was sold as the "Taunus Sport", with the inclusion of a more lavish interior. Approximately 50% of the cars built were delivered with the smallest of the three engines, the 1498 cc unit._NEWLINE_The engines were all gasoline/petrol powered four-cylinder inline four-stroke water-cooled units._NEWLINE_Changing gear involved a column-mounted gear change, which by now was becoming increasingly mainstream in Germany. It was possible to specify a "Saxomat" automatic clutch with the three-speed transmission,: drivers content to accept a fully manual gear change system could also specify a four-speed gear box._NEWLINE_There were several important technical innovations during the four-year model run which no doubt go some way to explain the car's commercial success when compared to that achieved by its predecessor, and will have strengthened the Ford image in a market which had grown used to seeing Ford sales trailing those of General Motors’ Opel business. In April 1962 the 17M became the first mainstream production car in Germany to offer, as an option, disc brakes on the front wheels. Just over a year later front disc brakes became a standard fitting on all models. 1962 was also the year when the car acquired an "automatic starter" which reportedly made the traditional manual choke unnecessary. _START_SECTION_ Optional extras _START_PARAGRAPH_ Options available at extra cost included velour carpeting, grab handles for the passengers in the back, whitewall tires, a two tone paint finish and even a "make-up mirror" cleverly incorporated into the sun-visor on the passenger's side. _START_SECTION_ Commercial _START_PARAGRAPH_ The boldly styled and regularly upgraded Taunus P3 was a commercial success. 669,731 were produced. The figure includes 86,010 station wagons. In the sales statistics for several months of 1961/62 the success of the model even enabled Ford briefly to overtake Opel on the German market, becoming the second best selling auto-brand, beaten to the top spot only by Volkswagen._NEWLINE_The Taunus P3 was replaced by the Ford Taunus P5 which would come with a wider range of engines and which would sell at approximately the same rate. However, the overall market size was growing through the 1960s, and with it grew the sales of the Opel Rekord. After the Taunus P3 no future Taunus model would come close to challenging Opel’s dominance of the large lucrative middle-market portion of the German auto-market. _START_SECTION_ Fifty years on _START_PARAGRAPH_ The Taunus P3 continues to generate enthusiasm, and most of the surviving vehicles in Germany enjoy the financial privileges and responsibilities available, in Germany, to owners of cars designated and maintained as oldtimers (the term used in Germany for a classic car)._NEWLINE_In 2006 484 Taunus P3 sedans/saloons were registered in Germany along with 21 "Turnier" station wagons. There were thought to be fewer than 10 in the USA, a relatively high content of registered and/or roadworthy vehicles in the Scandinavian countries and a handful still survive in other countries (for instance some of the countries in Eastern Europe and Latin America where it was marketed) where statistics are less readily accessible than in Germany. + _START_ARTICLE_ Deflowering (flowers) _START_PARAGRAPH_ Deflowering is a form of pruning that consists of removing flowers before they develop. It is similar to deadheading but stricter, as deadheading refers to the removal of faded flowers. Deflowering is usually performed on fruit-forming and seed-forming shrubs and trees in their first year. The aim is to prevent the plants from spending energy and nutrients on seed development before they establish themselves. Leaving some flowers and fruits to develop may be necessary for identification purposes. + _START_ARTICLE_ Ford railway station (Merseyside) _START_SECTION_ History _START_PARAGRAPH_ It opened for service on 1 June 1906 and closed on 2 April 1951. Passenger trains then only ran once a year on this line, transporting passengers for the Grand National, although this service also ceased in 1956. Demolition of Ford station was completed on 1 May 1959. _START_SECTION_ Reopening proposals _START_PARAGRAPH_ This section of the line still exists, although has no passenger services running and is no longer electrified, with the only trains running being for engineer access to the Ormskirk line. Plans to open this section as part of Merseyrail's Northern Line have been put forward in Sefton's transport plan, with the first details to emerge about its possible reopening being published by the media on 28 February 2008. + _START_ARTICLE_ Hispanics and Latinos in Texas _START_PARAGRAPH_ Hispanic and Latino Texans are residents of the state of Texas who are of Hispanic or Latino ancestry. As of the 2010 U.S. Census, Hispanics and Latinos of any race were 38.2% of the state's population. Moreover, the U.S Census shows that the 2010 estimated Hispanic population in Texas was 9.7 million and increased to 11.1 million in 2017 with a calculated 18% change from the 2010 Hispanic population estimate._NEWLINE_Tejano or Texano (Spanish for "Texan") is a term used to identify a Texan of Criollo Spanish or Mexican heritage. _START_SECTION_ Origins _START_PARAGRAPH_ The first European to see Texas was Alonso Álvarez de Pineda, who led an expedition for the governor of Jamaica, Francisco de Garay, in 1520. While searching for a passage between the Gulf of Mexico and Asia, Álvarez de Pineda created the first map of the northern Gulf Coast. This map is the earliest recorded document of Texas history. Moreover, the area of present day Texas was claimed by Spain at this time._NEWLINE_Years later on June 1527, an expedition led by Pánfilo de Narváez and Álvar Núñez Cabeza de Vaca with the purpose of reaching Florida in order to build a city, resulted in a failed mission due to harsh weather and disease. Instead, the Spanish explorers were left shipwrecked off the coast of Texas where the Spanish lived for around six years. After the years spent living in Texas among Indigenous civilization, Narvaez and Cabeza de Baca along with some of their men, found their way back to Mexico City in 1536 and told stories about the extravagancies witnessed in the north. Learning about this, the Spanish set out due north in 1539 with the purpose of discovering riches in places yet to be explored. One of the primary motives for the excursions was for the discovery of gold._NEWLINE_The excursion of the Spanish in 1539 into the north or what is today Texas, New Mexico, and Arizona, was led by the Spanish conquistador Francisco Vázquez de Coronado. On July 7, 1540, Coronado's army reached the outskirts of the rumored city of much gold, Cibola, near upper Rio Grande where the Spanish encountered massive resistance from Puebloans. The violence between the Spanish and the Puebloans continued at Cibola until the Puebloan soldiers inhabiting Cibola were forced to leave to a village where their wives and children had moved to for shelter._NEWLINE_When the fighting settled, Coronado decided to explore the land more extendedly, which was when one of the expeditions arrived at Texas in 1541 where they encountered groups of people from the Caddo tribe, leading to more events of violence. After all, Coronado returned to New Spain on April 1542 an informed about the cruel reality of the cities in the north that were explored, describing them as not having any gold or silver. Soon after this, the Spanish decided to remain away from the north or the present day southwest region of the United States for approximately 150 years, though expeditions led by Spaniards and not authorized by Spain did take place within those years. Until 1688, Spain essentially remained out of Texas._NEWLINE_Around 1688, the Spanish learned about French interventions occurring in the area of Texas, land that had already been claimed by Spain. This led to the actions taken by Spaniard Alonso de León, the then governor of Coahuila, to march into Texas towards Fort St. Louis. Fort St. Louis was the location where the French were set up. On April 1689, Alonso de Leon arrived with his army ready to take down the French fort and looking for any remaining French in the area. During the time there, de Leon was informed by some of the located French that the Karankawa people had attacked them and left the fort in ruins, forcing the French to flee. A year after going back to New Spain, de Leon returned to Texas because he was concerned about the French returning to Spanish territory. Spanish activity in Texas remained minimal and only returned when the French attempted to intervene._NEWLINE_In 1690 when de Leon returned to Texas, he had with him an army of about 100 men made up of soldiers and priests and built the first church in Texas, named San Francisco de los Tejas. The construction of this church was a major stepping stone for Spain as Spanish Texas was headed to become an area of greater importance for Spain. After San Francisco de los Tejas was established, the construction of many more missions followed, such as Mision Nuestra Senora del Rosario and Nuestra Senora del Refugio. A year later in January 1691, Domingo Terán de los Ríos was appointed to be the governor of Spanish Texas. Throughout the construction of various churches, the Spanish had interactions with different Indigenous groups. Soon enough, interracial marriages led to the development of different races such as mestizos, criollos, and culebras/mulattos. This led to the development of the Caste system in Texas and throughout the southwest United States. During this time, Spain faced problems with the French, the Natives, and with also with each other. With years passing by, the occurrence of other events such as the American Revolution in 1775, led to more problems in Texas. Soon enough, Spain would have to face the ever-growing United States and the Mexican population while having problems with the Natives and the French._NEWLINE_Mexico declared its independence from Spain on September 16, 1810 and war ended on September 26, 1821. Because of Mexico's independence from Spain, Texas became the property of Mexico. Around this time, the United States had obtained massive amounts of land from France through the Louisiana Purchase in 1803. In addition, under Mexican law, Texas was available for anyone to move to and also offered land grants to empresarios. During this time, the population of Texas grew quickly. The population was not only Mexican but also included United States citizens, Native Americans and enslaved people. When people residing in Texas did not agree with Mexican law and did not follow the law, Mexico ended all immigration into Texas. Such events led to the Texas Independence which then led to the annexation of Texas and then to the Mexican–American War._NEWLINE_On February 2, 1848 the peace treaty, the Treaty of Guadalupe Hidalgo, was signed between Mexico and the United States which essentially gave the United States much of the land that was owned by Mexico in the north and established the Rio Grande River as the border between Texas and Mexico. Moreover, Hispanics and Latinos already living in the territory that became of the United States, were given the opportunity to stay and obtain United States citizenship. While many chose to leave to their home country, many also decided to stay._NEWLINE_The major immigration of Mexicans into Texas began during the 1890s due to the growth and Industrialisation aspect of Texas that created a plethora of jobs. _START_SECTION_ Demographics _START_PARAGRAPH_ Hispanics (of any race) were 7.1% of the state population in 1910. As of 2010, 45% of Texas residents had Hispanic ancestry; these include recent immigrants from Mexico, Central America, and South America, as well as Tejanos, whose ancestors have lived in Texas as early as the 1700s. Tejanos are the largest ancestry group in southern Duval County and among the largest in and around Bexar County, including San Antonio, where over one million Hispanics live. The state has the second largest Hispanic population in the United States, behind California._NEWLINE_Hispanics dominate southern, south-central, and western Texas and form a significant portion of the residents in the cities of San Antonio, Dallas, Houston, and Austin. The Hispanic population contributes to Texas having a younger population than the American average, because Hispanic births have outnumbered non-Hispanic white births since the early 1990s. In 2007, for the first time since the early nineteenth century, Hispanics accounted for more than half of all births (50.2%), while non-Hispanic whites accounted for just 34%._NEWLINE_Steve Murdock, a demographer with the Hobby Center for the Study of Texas at Rice University and a former director of the U.S. Census Bureau, predicted that, between 2000 and 2040 (assuming that the net migration rate will equal half that of 1990-2000), Hispanic public school enrollment will increase by 213 percent, while non-Hispanic white enrollment will decrease by 15 percent. As of 2010, 29.21% (6,543,702) of Texas residents age 5 and older spoke Spanish at home as a primary language. _START_SECTION_ Spanish language in Texas _START_PARAGRAPH_ In Texas, English is the state's de facto official language (though it lacks de jure status) and is used in government. However, the continual influx of Spanish-speaking immigrants increased the import of Spanish in Texas. Texas's counties bordering Mexico are mostly Hispanic, and consequently, Spanish is commonly spoken in the region. The Government of Texas, through Section 2054.116 of the Government Code, mandates that state agencies provide information on their websites in Spanish to assist residents who have limited English proficiency. _START_SECTION_ Origins _START_PARAGRAPH_ From 1915 to 1919, during the Mexican Revolution, Mexicans and Tejanos in South Texas faced increased violence from Texan Rangers. Due to tensions caused by changes in both governments and the border, people of Latino descent were hanged, shot, burnt, decapitated, and tortured. Texas Legislative Investigation ended this period of violence by finding the Texas Rangers guilty. More recently, the Texas government has acknowledged this period of history with the "Life and Death on the Border, 1910 to 1920" exhibit._NEWLINE_Anti-Latino attitudes spiked during the Great Depression of the 1920s. Latinos, among other foreigners, were accused of stealing jobs from Americans and contributing to the decline of the economy. In response to the growing, Anglo-American, frustration, the United States government forcibly removed 2 million latinos with the majority of them being American citizens. During these repatriations, local governments denied aid to those of Mexican descent, offered train fares to Mexico and raided Latino communities. Hospitals removed Latinos with disabilities and illness while employers laid off Latino workers. To avoid raids and discrimination, many Latinos returned to Mexico voluntarily. By 1936, approximately one third of Texas's Latino population had been repatriated._NEWLINE_These sentiments heightened in the 1840s with the end of the Mexican-American War and the signing of the Treaty of Guadalupe Hidalgo. The increased population of Latinos were met with further illegal deportations, violence, racism, and segregation. In instance of these reactions was the Olivera Street raid of 1931. During this raid, law enforcement and immigration agents arrested and deported nearly 400 Mexican-Americans despite their citizenship or immigration status in America. _START_SECTION_ Mob Violence _START_PARAGRAPH_ Anti-Latino sentiments grew during the California Gold Rush as many Latinos demonstrated more advanced mining skills than their white counterparts. From the late 19th century, the Gold Rush era, to the early 20th century, mob violence against Spanish-speaking individuals became a common occurrence and the number of victims reached well over thousands. During this period, Texas Rangers carried out lynchings of Hispanic men, women, and children for accusations that included cattle theft, murder, witchcraft, and even refusal to play the fiddle. Some case studies included the burning of Refugio Ramírez and his family for the alleged bewitching of neighbors in 1880 by a mob in Collin County, North Texas. Another event included the Porvenir Massacre of 1918, which involved the seizure and assassination of 15 men and boys from the village of Porvenir in Presidio County, Texas. Although Texas Rangers justified the murders by accusing the people of being "thieves, spies and murderers", the United States Army's and the State Department's investigations found that the denizens of Porvenir were unarmed and innocent. As a result, Texas state government began investigation of the Texas Rangers. _START_SECTION_ Environmental Racism _START_PARAGRAPH_ With a high number of chemical industries and facilities, various neighborhoods within Houston are susceptible to toxic air pollution. The communities closest to these environmentally hazardous spaces are communities of low-income, people of color. Located in East Houston, Harrisburg/Manchester and Galena Park are the two communities with the closest proximity to Risk Management Plan (RMP) facilities or facilities that use certain hazardous substances._NEWLINE_Both Harrisburg/Manchester and Galena Park are largely made up of impoverished, Latino communities with average household incomes of $49,732 and $45,431. Due to the close proximity to RMP facilities, the people of these neighborhoods are at a 24 to 36 percent higher risk of getting cancer when compared to the predominantly white neighborhoods of Houston. Harrisburg/Manchester is geographically centered in the middle of "21 Toxic Release Inventory (TRI) reporting facilities, 11 large quantity generators of hazardous waste, 4 facilities that treat, store or dispose of hazardous waste, 9 major dischargers of air pollutants, and 8 major stormwater discharging facilities". An average of 484,000 pounds of toxic chemicals are released into the Harrisburg/Manchester air while none are released in communities with average household incomes of $226,333 and poverty rate of 3 percent. _START_SECTION_ School Segregation _START_PARAGRAPH_ Spanning from the 1890s to the 1980s, 122 school districts throughout 59 counties established segregated schools for Mexican-Americans. These poorly developed schools lacked the adequate schooling environment. Teachers possessed no credentials or experience while the classrooms lacked the necessary equipment. School administrators often placed Tejano students into 'low-track' classes. By assessing Tejano students on biased rubrics that evaluated mental, emotional, and language abilities, school officials classified Tejano students as inferior and underdeveloped. Beginning with elementary schools, administrators assigned Tejano children to low-level and nonacademic courses, aimed to lead the students to vocational or general-education courses. Due to unequal educational platforms, disregard for Tejano culture, and linguistic intolerance, Hispanic students had higher withdrawal rates and lower academic performances. + _START_ARTICLE_ So You Won't Squawk _START_SECTION_ Plot _START_PARAGRAPH_ Handyman Eddie (Buster Keaton) is mistaken for gangster Louie the Wolf (Eddie Fetherston). Louie encourages this deception and lets rival gangster Slugger McGraw (Matt McHugh) think Eddie is him. Slugger attempts to kill Eddie many times. After one final attempt a car chase ensues with Eddie throwing various items out the window to get the attention of the police. _START_SECTION_ Production _START_PARAGRAPH_ The chase scene is a reworking of Buster's final chase from his feature Le Roi des Champs-Élysées (1934). + _START_ARTICLE_ Rehearsal letter _START_PARAGRAPH_ A rehearsal letter is a boldface letter of the alphabet in an orchestral score, and its corresponding parts, that provides the conductor, who typically leads rehearsals, with a convenient spot to tell the orchestra to begin at places other than the start of movements or pieces. Rehearsal letters are most often used in scores of the Romantic era and onwards, beginning with Louis Spohr. Rehearsal letters are typically placed at structural points in the piece. _START_SECTION_ Terminology _START_PARAGRAPH_ They may also be generically called rehearsal marks or rehearsal figures, or, when numbers are used instead of letters, rehearsal numbers. _START_SECTION_ Purpose _START_PARAGRAPH_ In the course of rehearsing a symphony or piece, it is often necessary for the conductor to stop and go back to some point in the middle, in order to master the more difficult passages or sections, or to resolve a challenge that the ensemble is having. Many scores and parts have bar numbers, every five or ten bars, or at the beginning of each page or line. But as pieces and individual movements of works became longer (extending to several hundred bars) as the Romantic era progressed, bar numbers became less practical in rehearsal._NEWLINE_For example, a conductor can tell their musicians to resume at bar 387, so that the musicians have to find the nearest bar number in their parts (e.g. 385 or 390) and count back or forward a couple of measures. Even if the number 387 is written at the appropriate bar, it might not particularly stand out. But if there is, for example, a big, bold letter M in the score and parts, it is much easier for the conductor to just say "begin at letter M". Even if the conductor were to say "one bar before letter M", that would still be more convenient than saying "bar 386". Alternatively the conductor could first say "before M..." and allow the players time to find M and then say "one bar"._NEWLINE_In the score of a full orchestra, rehearsal letters are typically placed over the flutes' (or piccolo's) staff, and duplicated above the first violins' staff. For concert bands, rehearsal letters are placed over the piccolo's staff (or flutes'), and over the trumpets'. Rehearsal letters should appear in every part, but the conductor or librarian should check this and also make sure that they agree with the conductor's score; if they do not, the letters from the parts should be copied to the conductor's score. For typical pieces or movements of the Romantic era marked allegro, the letters A to Z can be used up, though the letters I, J or O (or all) may be skipped._NEWLINE_Placement and frequency of the letters do not follow a hard-and-fast rule. Generally they are inserted at places where there is a musically significant change, for example a new theme, or a change in dynamic or instrumentation or the start of a new section – just those places where a conductor might want to restart in rehearsal. In addition, having the letters coincide with musical signposts can help players who are counting rests confirm they are still in the right place, which would not be possible if the marks were placed at numerically regular intervals._NEWLINE_The letter A is almost always used for a point close to the beginning, but not for the very beginning itself because it is much easier to say "from the beginning". Likewise, rehearsal letters are not necessary at changes in tempo, key signature or time signature, as the name of the new tempo or signature can serve the same purpose. For example, in some editions of Beethoven's Ninth Symphony, letter A of the Finale does not occur until bar 140, when the relatively late entry of the first violins with the "Ode to Joy" theme might not stand out enough to the other players to be a convenient point of reference, whereas the reminiscences of the previous movements' themes are more easily referenced by their tempo markings._NEWLINE_A rehearsal letter usually breaks a multimeasure rest in a part (except in cases where a given instrument does not play at all in a given movement of the work). Because rehearsal letters are sometimes independent of edition and in some cases even version, they are also useful for telling applicants for positions in the orchestra what passages they need to play at the audition. They are also useful for easy reference in scholarly essays about orchestral works. However, rehearsal letters are altogether absent from some editions of some pieces that have them in other editions, such as the older editions of Richard Wagner's Meistersinger prelude._NEWLINE_Rehearsal letters are less useful in unaccompanied instrumental music such as the solo piano repertoire (although they may be used in duets), since the instrumentalist has no need to communicate to a fellow player where to resume playing. Songs also tend not to use them, because it is more useful to refer to the lyrics (except in pieces where the lyrics are highly repetitive, or those with long lyric-less sections). _START_SECTION_ Usage in the late 19th century to 21st century _START_PARAGRAPH_ In some cases, A to Z might not be enough. After Z, Aa may be used, followed by Bb, and so on until Zz (though Ii, Jj and/or Oo might also be skipped). The Wilhelm Hansen edition of Jean Sibelius's Symphony No. 7 in C major presents one unusual case: the letters A to Z (including both I and J, as well as O) are used up with just three more pages left in the score. For the final flute and bassoon solo, the editors use Ö (the final letter of the Finnish alphabet) as a rehearsal letter._NEWLINE_But in the case of some composers, such as Gustav Mahler and Dmitri Shostakovich, twice through the alphabet might still not be enough. For this reason, some editors prefer rehearsal numbers to rehearsal letters. Mahler's and Shostakovich's scores use rehearsal numbers rather than letters. These are typically in boldface and enclosed in a box, or less commonly, a circle. Confusingly, however, some editions enclose bar numbers in boxes, though usually not in bold. In the Schirmer edition of Roy Harris's Symphony No. 3 (in one movement), the rehearsal numbers are enclosed in circles, and they occur every ten measures, actually being the bar number divided by 10. That rehearsal numbers "are easily confused with measure numbers" is a reason sometimes given in favor of rehearsal letters._NEWLINE_Advocates of rehearsal numbers counter that even 26 letters are not enough for some scores. Whereas rehearsal letters reset to A for each movement of a multi-movement work (even for connected movements), rehearsal numbers typically run over the course of the entire work, even if the movements are not connected. For example, the rehearsal number for the last few bars of the first movement of Edward Elgar's First Symphony is 55; the first rehearsal number of the second movement is 56. There are exceptions, however. The final outburst in the first movement of Mahler's Second Symphony is rehearsal number 27. Mahler actually wanted a pause of five minutes before the next movement, so the rehearsal numbers reset to 1, ending with 15. The third movement follows after a short break, but its first rehearsal number is 28. _START_SECTION_ Jazz and pop _START_PARAGRAPH_ For jazz and pop compositions with several choruses, "many jazz composers and arrangers" use a format in which "each successive verse/chorus part of the form is assigned successive letters of the alphabet" combined with a measure number: for example, letter A for the first 8-bar phrase of the verse after the introduction, A9 for the next 8-bar phrase, A17, A25, then B, B9, B17, B25 for the chorus, etc., with the special rehearsal marking TAG for the tag ending. In jazz and pop music, the musicians frequently refer to the "A section" or the "B section" of a 32 bar song during rehearsals. In pop music, the music is commonly organized into standard sections, such as an intro, multiple verses and choruses (refrain), one or more bridges, a guitar solo (or other instrumental solos), and an outros. As such, a bandleader who wishes to start in the middle of a song will typically specify which part of this structure the band should start on (e.g., "last four bars of the bridge, going into the guitar solo" or "last verse and go to the outro"). + _START_ARTICLE_ Joseph Imre _START_PARAGRAPH_ Joseph B. Imre is a historian, political scientist, researcher, and business entrepreneur. Joseph is the proprietor of Seasons Fine Foods and Cookery School. Joseph has been elected to the Board of Directors of the Downtown Napanee Business Improvement Area (BIA) (2019-2023), is a member of the Napanee & District Chamber of Commerce, and a volunteer at the L&A County Museum and Archives. An active member of the Canadian-Hungarian community, Imre has served as the President and founder of the Hungarian Students' Association at the University of Toronto (2002–2005); Vice-President of the Albert Apponyi Association (2000–2010); and on the Board of Directors of several Hungarian organizations including the National Alliance of Hungarians in Canada (Kanadai Magyarok Országos Szövetsége). In 2007, Mr. Imre was awarded the Order of the Knights Cross from the 1956 Hungarian Freedom Fighters Association in Hungary. Mr. Imre is a member of the Friends of Hungary Foundation._NEWLINE_Imre graduated from the University of Toronto with an Honours Bachelor of Arts in history and political science in 2005. Upon completion of postgraduate studies at the University of Oxford (2006) he completed a master's degree in history from the University of Bristol (2007) with distinction. In 2009, Joseph completed a graduate diploma in comparative politics from the London School of Economics. Imre is a member of the Project Management Institute (PMI) where he holds additional certification in project management._NEWLINE_Mr. Imre has written extensively on medieval, early modern, and modern European history. While his academic focus is primarily Renaissance studies and the religious history of 15th century Italy, Mr. Imre has also published widely on 20th century Hungarian historiography and the issue of Hungarian minorities in the Carpathian Basin. His graduate dissertation on Girolamo Savonarola linked the controversial figure to humanistic elements in Renaissance society and challenged existing scholarship. In his capacity as a historian, Mr. Imre regularly contributes to a number of newspapers and online blogs as a columnist and contributor. Mr. Imre is an editor with the NAHC._NEWLINE_Mr. Imre and his wife, Mrs. Jazmin Bansagi, also own and operate a farm in Lennox and Addington County in eastern Ontario. Seven Fields Farm & Orchard was founded on organic principles and sustainable approaches to stewardship that protect the land and environment. Seven Fields Farm & Orchard operates a walnut and apple orchard, hay production, and the raising of beef cattle in co-operation with other farms in the area. + _START_ARTICLE_ Vorlage (ski hill) _START_PARAGRAPH_ Vorlage is a ski hill located at the village of Wakefield, within the municipality of La Pêche, Quebec, in the Gatineau Hills north of Ottawa, Ontario. It consists of 18 runs, 13 of which have night skiing. It was opened in 1941 and has a terrain park. + _START_ARTICLE_ Cauldcots railway station _START_SECTION_ History _START_PARAGRAPH_ The station opened in 1883 by the North British, Arbroath and Montrose Railway. The station closed to both passengers and goods traffic on 22 September 1930. + _START_ARTICLE_ Paolo Brescia _START_PARAGRAPH_ Paolo Brescia is an Italian architect and founder of Open Building Research. He graduated with a degree in architecture from the Politecnico di Milano in 1996 and had his academic fellowship at Architectural Association in London._NEWLINE_After working with Renzo Piano, he founded in 2000 OBR with Tommaso Principi to investigate new ways of contemporary living, creating a design network among Milan, London, Mumbai and New York._NEWLINE_He combines his professional experience with the academic world as guest lecturer in several athenaeums, such as Accademia di Architettura di Mendrisio, Kent State University, Aalto University, University of Oulu, Academy of Architecture of Mumbai, College of Architecture of Pune, Mimar Sinan Fine Art University, Hacettepe University. He was university professor in charge at the Faculty of Industrial Design at the Politecnico di Milano (2004-2005) and professor of architectural design at University of Genoa (2013-2015)._NEWLINE_With OBR his projects have been featured in international exhibitions, including at X Biennale di Architettura, Venice 2006; Architecture: Where to, London 2007; V Bienal de Arquitetura, Brasilia 2007; XI Bienal Internacional de Arquitectura, Buenos Aires 2007; AR Award Exhibition, Berlin 2008; China International Architectural Expo, Beijing 2009; International Expo, Shangai 2010; UIA 24th World Congress of Architecture, Tokyo 2011; Energy at MAXXI, Rome 2013; Italy Now, Bogotá 2014; Small Utopias, Johannesburg 2014, XIV Biennale di Architettura, Venice 2014, Triennale di Milano, Milan 2015 and Cooper Hewitt Smithsonian Design Museum, New York. _START_SECTION_ Working life _START_PARAGRAPH_ From 1998, Brescia worked with Renzo Piano till the establishment of OBR Open Building Research in 2000._NEWLINE_With OBR, he won several design competitions, such as: Pythagoras Museum (2003), Galleria Sabauda in the Real Palace in Turin (2003), Milanofiori Residential Complex (2005), Ex Cinema Roma in Parma (2006), Galliera Hospital in Genoa (2009), Polo di Funo in Bologna (2009), Fair of Messina (2010), Cesme Waterfront (2012), Via XX Settembre in Genoa (2012), Santa Margherita Ligure Waterfront (2013), Michelin HQ and Research Labs in Delhi (2014), Terrazza Triennale in Milan (2014), Parco Centrale di Prato (2015), Comparto Stazioni Varese (2016). + _START_ARTICLE_ Bryan Lugo _START_SECTION_ Career _START_PARAGRAPH_ Bryan Lugo played Officer Burton in the IFC Films and La Petite Reine film Maniac. The film premiered at the 2012 Cannes Film Festival, out of competition. Lugo plays opposite Elijah Wood in the film, which premiered to a limited distribution in theaters on January 2, 2013. His other films include Afternoon Delight, which premiered in the 2013 Sundance Film Festival, opposite Kathryn Hahn, Jane Lynch, and Juno Temple; I Am Gangster (2015); and Marvel Studios' Ant-Man and the Wasp, opposite Tip "T.I." Harris, David Dastmalchian, and Walton Goggins, which premiered on July 6, 2018, worldwide in theaters. + _START_ARTICLE_ Feudalism in England _START_SECTION_ Origins of Feudalism _START_PARAGRAPH_ The word feudal derives from an ancient Gothic source faihu signifying simply "property" which in its most basic sense was "cattle". This can be compared to the very ancient classical Latin word pecunia, which means both cattle and money. Many societies in existence today demonstrate the traditional use of cattle as financial currency, for example the Masai of Kenya and Tanzania, who pay dowries in this form._NEWLINE_Because feudalism was in its origin a Teutonic or Gothic system from northern Europe untouched by Roman civilization, it did not exist in ancient Rome, where the nearest equivalent was clientelism. No classical Latin word therefore exists to signify it, and a new Low-Latin word feodum was invented by mediaeval European scribes to use in their Latinised charters and other writings. _START_SECTION_ Classic English feudalism _START_PARAGRAPH_ Under the English feudal system, the person of the king (asserting his allodial right) was the only absolute "owner" of land. All nobles, knights and other tenants, termed vassals, merely "held" land from the king, who was thus at the top of the "feudal pyramid". When feudal land grants were of indefinite or indeterminate duration, such grants were deemed freehold, while fixed term and non-hereditable grants were deemed non-freehold. However, even freehold fiefs were not unconditionally heritable—before inheriting, the heir had to pay a suitable feudal relief._NEWLINE_Below the king in the feudal pyramid was a tenant-in-chief (generally in the form of a baron or knight) who was a vassal of the king, and holding from him in turn was a mesne tenant (generally a knight, sometimes a baron, including tenants-in-chief in their capacity as holders of other fiefs) who held when sub-enfeoffed by the tenant-in-chief. Below the mesne tenant further mesne tenants could hold from each other in series. The obligations and corresponding rights between lord and vassal concerning the fief form the basis of the feudal relationship. _START_SECTION_ Vassalage _START_PARAGRAPH_ Before a lord could grant land (a fief) to a tenant, he had to make that person a vassal. This was done at a formal and symbolic ceremony called a commendation ceremony composed of the two-part act of homage and oath of fealty. During homage, the lord and vassal entered a contract in which the vassal promised to fight for the lord at his command, whilst the lord agreed to protect the vassal from external forces, a valuable right in a society without police and with only a rudimentary justice system._NEWLINE_The word fealty derives from the Latin fidelitas and denotes the fidelity owed by a vassal to his feudal lord. Fealty also refers to an oath which more explicitly reinforces the commitments of the vassal made during homage. Such an oath follows homage. Once the commendation was complete, the lord and vassal were now in a feudal relationship with agreed-upon mutual obligations to one another. The vassal's principal obligation to the lord was the performance of military service. Using whatever equipment the vassal could obtain by virtue of the revenues from the fief, the vassal was responsible to answer to calls to military service on behalf of the lord._NEWLINE_This security of military help was the primary reason the lord entered into the feudal relationship, but the vassal had another obligation to his lord, namely attendance at his court, whether manorial, baronial or at the king's court itself in the form of parliament. This involved the vassal providing "counsel", so that if the lord faced a major decision he would summon all his vassals and hold a council. On the manorial level this might be a fairly mundane matter of agricultural policy, but also included the handing down by the lord of sentences for criminal offences, including capital punishment in some cases. Concerning the king's feudal court, the prototype of parliament, such deliberation could include the question of declaring war. Depending on the period of time and location, feudal customs and practices varied, see examples of feudalism. _START_SECTION_ Varieties of feudal tenure _START_PARAGRAPH_ Under the feudal system several different forms of land tenure existed, each effectively a contract with differing rights and duties attached thereto. The main varieties are as follows: + _START_ARTICLE_ Fuso Maru _START_SECTION_ Construction _START_PARAGRAPH_ Fuso Maru was laid down in 1907 at the Barcay Curle Co. Ltd. shipyard in Glasgow, Scotland, United Kingdom. She was launched on 19 March 1908 and was completed in February 1909. She was built for the Russian East Asiatic Steamship Company and was named Russia. She was renamed Fuso Maru when she was bought by the Japanese company Osaka Shosen K. K. - OSK Line on 24 December 1923._NEWLINE_Fuso Maru was 144.78 metres (475 ft 0 in) long, with a beam of 17.53 metres (57 ft 6 in) and a depth of 11.2 metres (36 ft 9 in). The ship was assessed at 8,596 GRT. She had two triple expansion steam engines rated at 7,113 ihp (5,304 kilowatts) and driving two screws. She had two funnels and four masts. _START_SECTION_ Pre-World War II career _START_PARAGRAPH_ As Russia, the ship completed her maiden voyage from Libau, Russia, to New York City, United States, on 2 June 1909, and her last voyage on 26 June 1914. She was then laid up at Kronstadt, Russia, until 1917, when she was renamed Rossija and later Russ. In 1921 she was transferred to the Baltic American Line and renamed Latvia. She started service on the Libau–Danzig–Halifax–New York City route on 11 July 1921. Her ninth and last transatlantic voyage started on 7 February 1923. She then was sold to Osaka Shosen Kaisha of Japan on 24 December 1923 and renamed Fuso Maru. Two of her masts were removed at this time. Fuso Maru then served two different companies under four different names before finally being purchased by the Japanese Company Osaka Shosen K. K. - OSK Line._NEWLINE_Fuso Maru operated on the Kobe, Japan–Kirun, Taiwan route from 18 July 1924 until March 1934. She then provided service on the Kobe–Dairen, Manchukuo, route from March 1934 until November 1941. She had accommodation for 42 first-class, 56 second-class, 212 third-class, and 1,414 fourth-class passengers, and had a crew of 144. _START_SECTION_ World War II career _START_PARAGRAPH_ In November 1941, the Imperial Japanese Army charted Fuso Maru for use as a troopship. She was most likely painted grey overall and armed with a suite of antiaircraft guns at this time._NEWLINE_Fuso Maru participated as a troopship in Operation "E", the Japanese invasion of Malaya beginning on 13 December 1941. In late December 1941, she was rerated as a hospital ship. She was most likely disarmed because of the international prohibition against hospital ships carrying armament, and she was painted white with a green horizontal strip and red crosses on her sides and funnel._NEWLINE_Shortly after sunrise on 15 April 1943, Allied aircraft attacked Fuso Maru three times near the Shortland Islands near (03°33′S 152°20′E). Fuso Maru returned to service as a troopship later in 1943 and was repainted overall grey and again armed with antiaircraft guns. _START_SECTION_ Sinking _START_PARAGRAPH_ On 31 July 1944 Fuso Maru was part of Convoy MI-11, which consisted of 23 ships, including the tankers Koei Maru, Taketoyo Maru, Shichiyo Maru, Ayagumo Maru, Harima Maru, and Ogura Maru No. 1 and the cargo ships and troopships Fuso Maru, Ayayuki Maru, Yoshino Maru, Miho Maru, Enoshima Maru, Manko Maru, Hachijin Maru, Dakar Maru, Teiritsu Maru, Fukuju Maru, and Banshu Maru No. 16, escorted by the destroyer Shiokaze, the escort ship Shimushu, the minesweepers W-38 and W-39, the submarine chaser CH-55, and the auxiliary gunboat Kazan Maru. The convoy was attacked in the South China Sea 280 nautical miles (520 km) northwest of Cape Mayraira, Luzon, while it was proceeding from Moji, Japan to Miri, Borneo, by a United States Navy submarine wolfpack patrolling the Luzon Strait under the command of Captain (later Rear Admiral) Lewis S. Parks. The wolfpack consisted of Lieutenant Commander (later Vice Admiral) Lawson P. Ramage′s USS Parche (SS-384), Lieutenant Commander (later Captain) David L. Whelchel's USS Steelhead (SS-280), and Lieutenant Commander John C. Martin's USS Hammerhead (SS-364)._NEWLINE_At 3:32 AM, Parche torpedoed and sank Koei Maru with four torpedoes. Although she was carrying a unit of 1,050 Imperial Japanese Army troops, the casualties aboard her were relatively light; about 150 troops and nine crewmen were killed. About the same time, tanker Ogura Maru No. 1 was hit by two torpedoes, killing five men, but she did not sink. At 3:40 AM, Parche torpedoed and sank Yoshino Maru with four torpedoes; she carried down 2,442 of the 5,063 Imperial Japanese Army troops she was carrying, as well as 18 gunners, 35 crewmen, and 400 cubic meters (14,120 cubic feet) of ammunition. At 4:20 AM, Steelhead hit Dakar Maru with two torpedoes, killing six men, but Dakar Maru did not sink and quickly beached herself._NEWLINE_Aboard Fuso Maru, 40 men were assigned to duty as lookouts, including Imperial Japanese Army artillerymen and infantrymen. At 4:55 AM, one lookout spotted a torpedo approaching the ship and her captain ordered her rudder turned hard to port, but it was too late. Steelhead′s torpedo hit Fuso Maru′s engine room on the port side of the ship. Fuso Maru bucked and trembled from the explosion and the blast blew upwards, destroying several lifeboats that were on deck. Fuso Maru took on a 25-degree list to port in heavy seas when the order to abandon ship was issued. The ground vehicles carried as deck cargo broke loose and fell onto men swimming in the water._NEWLINE_At 5:10 AM, Fuso Maru sank only 15 minutes after the torpedo hit, taking down 1,316 of 4,500 troops aboard. Seventy men of the 2nd Company, Sixth Aviation Signal Regiment, 12 other passengers, and 22 crew members also perished, bringing the death toll to 1,384 people. A cargo consisting of food and medical supplies, oil, trucks, 36 railway carriages, and 1,120-tons of other military supplies also was lost._NEWLINE_At 5:14 AM, Parche torpedoed and sank Manko Maru. She carried several hundred Imperial Japanese Navy personnel, 17 crewmen, about 20 gunners, and a cargo of ammunition down with her. Altogether, four of the 23 ships of Convoy MI-11 sank and two were damaged. The ships took down several thousand military personnel, gunners, and crewmen, as well as their cargoes of ammunition and other supplies. Thousands of troops were left floating in the waters of the Balintang Channel. _START_SECTION_ Wreck _START_PARAGRAPH_ The wreck of Fuso Maru lies at (18°51′N 122°55′E). Its condition is unknown. + _START_ARTICLE_ Brazil at the 2017 World Aquatics Championships _START_SECTION_ Water polo _START_PARAGRAPH_ Brazil's men's water polo team qualified for the World Championships with a gold medal performance at the 2017 UANA Cup in Couva, Trinidad and Tobago. + _START_ARTICLE_ Ann Catley _START_SECTION_ Life _START_PARAGRAPH_ Catley was born near Tower Hill, London, who first made money singing in pubs and to the garrison of the Tower of London. She was apprenticed aged fifteen to William Bates, a composer and singing teacher. A scandal emerged when Bates sold Ann's apprenticeship to her admirer Sir Francis Blake Delaval of Seaton Delaval Hall for £200. Bates was given money by Delaval in addition to make up for any financial loss to him. Catley's father, Robert Catley, could see that Ann had been sold. Aided by his employer, her father sued the rake Delaval and Bates. Lord Chief Justice Mansfield's judgement extended British law as he ruled that Delaval had offended society and the King's Bench could take action against Delaval on society's behalf; he was heavily fined. Catley's relationship with Delaval ended. Delaval found future relationships difficult and Catley continued her career._NEWLINE_In 1768 she met Lieutenant-Colonel Francis Lascelles (1744–1799) and they became a couple. She took the name Lascelles but they never married. In her will she left her property to their eight surviving children._NEWLINE_She performed many roles on the London and Dublin stage, until 1782. Her pupil Margaret Martyr's style is said to have come from Catley. Thomas Bellamy wrote of Martyr in 1795 "Catley's pupil - Catley's boast, Sportive, playful, arch and free, Lovely MARTYR, hail to thee!"_NEWLINE_Catley spent her last years living at Little Ealing and died on 14 October 1789._NEWLINE_John O'Keeffe wrote of her: "she was one of the most beautiful women I ever saw: the expression of her eyes, and the smiles and dimples that played round her lips and cheeks, enchanting". + _START_ARTICLE_ William Paul (author) _START_SECTION_ Background _START_PARAGRAPH_ Paul grew up in the Fife village of Kingskettle in Scotland. He has an older brother Donald and a younger sister Elizabeth Anne. _NEWLINE_Paul attended the village primary school before going to Bell Baxter High School in Cupar. In 1973 he went to Aberdeen University where he graduated with an MA in English and Politics._NEWLINE_After university, Paul became a trainee reporter for the Press and Journal. He then moved to Edinburgh where he was a reporter on The Scotsman. Paul's first son Andrew was born in 1983 while Paul was working on his debut novel Seasons of Revenge._NEWLINE_In 1985, Paul's his second son, William, was born and Seasons of Revenge was published. A second novel Mummy's Boy was published the next year._NEWLINE_In 1988, Paul joined Scotland on Sunday as senior writer. He was promoted to Chief Reporter then News Editor, then Assistant Editor and finally Executive Editor. In 1999, Paul was temporarily seconded from Scotland on Sunday to be the launch editor of www.scotsman.com. He left in 2001 to become the Head of Digital Communications for the Scottish Government until February 2019._NEWLINE_Paul continues to write novels. He also writes news articles and reports on rugby matches and is a lifelong supporter of Raith Rovers F.C.. + _START_ARTICLE_ Gol Talab _START_PARAGRAPH_ Gol Talab or Gol Talaab (talab means tank) also known as Nawab Bari Pukur, is a small oval-shaped water tank/pond in Islampur, Old Dhaka, Dhaka, Bangladesh, located immediately to the north-west of the Ahsan Manzil Palace and north of the Buriganga River. Gol Talab is an official heritage site, designated by the city government of Dhaka. _START_SECTION_ Description _START_PARAGRAPH_ Gol Talab dates to the 19th century. It covers an area of 2.23 acres and has a maximum depth of 23 feet (7.0 m). There are plans to upgrade it into a park. The pond is fenced. Vegetation found around the lake consist of trees of coconut, mango, neem, jackfruit and Chinese dates. Aquatic fauna reported are fish, frog, insects and others. The fish species reported are ruhi, tilapia, silver carp, pangash, katal, koi, puti and many more. Invertebrates reported are beetles, dragon flies, grasshoppers, butterflies, small birds and water scorpions. The pond has a bathing ghat only on its northwestern part. Boating competitions are held in the pond. A path for jogging and walking exists around the water. _START_SECTION_ Scientific analysis _START_PARAGRAPH_ Gol Talab is one of the five ponds in Dhaka, which have a significant effect on the environmental and biodiversity of the urban climate. Field research studies have been carried out to assess its link with the environment, economy and society. The results of the socio-environmental survey, involving both quantitative and qualitative aspects, were carried out by three faculty members of the Department of Architecture, Stamford University Bangladesh._NEWLINE_Water quality studies of the Gol Talab indicate a degree of pollution with the following water quality parameters of: TDS of 261 mg/liter; a conductivity value of 0.528 ms/cm; pH value of 6.92; dissolved oxygen (DO) of 13.92 mg/liter; an arsenic content of less than 10 ppb; a COD values of −23 mg/liter; and BOD value of 59.4 mg per liter. _START_SECTION_ Conservation _START_PARAGRAPH_ The pond is maintained by the Moulvi Khawaja Abdullah Welfare Trust and Bangladesh Water Development Board, as part of the 2000 National Water Management Plan. The pond has been cleaned up and restored as part of the water development plan. In 2008, The Daily Star reported that heritage buildings and site were under threat in the city, including Gol Talab. In 2009 the Dhaka City Corporation reaffirmed the conservation status of 93 structures and sites in Dhaka, in consideration of their "historical, aesthetic, scientific, social, cultural, religious, political and heritage value"; Gol Talab is an official heritage site under the Dhaka Metropolitan Development Plan. + _START_ARTICLE_ Clean Up Your Own Backyard _START_SECTION_ Background _START_PARAGRAPH_ Written by Mac Davis and Billy Strange and published by Gladys Music, Inc., it was released as a 7" single in 1969 with "The Fair Is Moving On" on the B-side, but not featured on any studio album. The single was also released in the UK, Canada, Germany, Australia, New Zealand, and India._NEWLINE_It reached #35 on the Billboard Hot 100 and #21 on the UK Singles Chart. The single reached #18 on the Record World chart and #29 on the Australian Go-Set chart. The RIAA certified the single Gold in March 1992._NEWLINE_The song was from the soundtrack of the MGM film The Trouble with Girls, and was later included on the budget RCA Camden album Almost In Love._NEWLINE_Although The Trouble with Girls is set in the 1920s, several lyrics within this song are anachronistic for the era, such as a reference to "armchair quarterbacks", a term not coined until the advent of television sports broadcasting decades later. _START_SECTION_ Other recordings _START_PARAGRAPH_ The song has been recorded by Nat Stuckey, O.C. Smith, Tom Green, Jennifer Scott, Sue Moreno, Darrel Higham and The Enforcers, and Lee Birchfield in 2012. + _START_ARTICLE_ Houston Direct Navigation Company _START_SECTION_ Houston and Galveston Navigation Company _START_PARAGRAPH_ In 1851, William Marsh Rice founded the Houston and Galveston Navigation Company with his own capital of $5,000 and the capital of twenty-five other investors. These included Paul Bremond, Cornelius Ennis, William J. Hutchins, and John H. Sterrett. _START_SECTION_ Houston Navigation Company _START_PARAGRAPH_ One of the company's antecedents was the Houston Navigation Company, formed in 1854 by many of the same principals as the Houston and Galveston Navigation Company. _START_SECTION_ After the Civil War _START_PARAGRAPH_ The Houston Direct Navigation Company transported freight and passengers from Houston to railheads along Buffalo Bayou._NEWLINE_Houston Direct Navigation Company was founded on 9 October 1866 by William Marsh Rice, Thomas M. Bagby, John H. Sterrett, and several others. Businesses receiving and shipping goods from Houston were paying high fees for moving freight through Galveston, Texas. The company offered cheaper transportation, which bypassed Galveston and its Galveston Wharf Company._NEWLINE_At first, the company’s main business in the late-1860s consisted of lightering around Galveston and interlining freight through the Buffalo Bayou, Brazos and Colorado Railroad; however, it expanded service, running five passenger steamers by 1870. The company continued to expand its fleet, even as passenger demand diminished. Three steamers operated for freight-only in 1873, along with 22 barges with three tugs, the only two steamers transported passengers. + _START_ARTICLE_ NY Med _START_SECTION_ Critical reception _START_PARAGRAPH_ NY Med received "universal acclaim" based on an aggregate score of 84 out of 100 from six critics on Metacritic. Verne Gay of Newsday called the series "beautiful and moving," adding "NY Med brings it all home with power, beauty, insight and a degree of emotion that's an occasional sharp uppercut to the jaw." New York Magazine's Matt Zoller Seitz said the series "is filled with warm, honest moments like this — some poignant, others comic — and characters who would be plenty compelling even if they didn't keep revealing surprising new sides." Mary McNamara of the Los Angeles Times called the series "surprisingly addictive", adding "NY Med appears less self-conscious about its medical pedigree than its predecessors, more willing to embrace the dramatic pacing and elasticities of television." The New York Times' Mike Hale thought the series was "predictably absorbing" but added "NY Med and its predecessors have an interesting, though certainly unintentional, effect: The intense focus on the heroics of the nurses and doctors can make the patients look just as helpless and pathetic as we fear we will be when our day in the ward comes." _START_SECTION_ Patient privacy lawsuit _START_PARAGRAPH_ One episode of NY Med depicted the case of an elderly man, Mark Chanko, who arrived at NewYork–Presbyterian hospital after he was hit by a garbage truck. The episode showed Chanko's final moments as he died from his injuries. While his face was blurred, Chanko's widow was able to identify him when she watched the episode. The patient's family had not granted ABC or the hospital permission to film his treatment, and they were deeply upset by the episode. The family sued ABC and New York–Presbyterian Hospital for violating Mark Chanko's privacy. The case was dismissed by an appellate court. However, the family appealed and the NY judiciary felt there was sufficient reason to bring it before the state's highest appellate court. New York–Presbyterian agreed to a $2.2M settlement with the Department of Health and Human Services, Office for Civil Rights, who investigated this as a HIPAA violation. ABC removed the segment from the DVD version of the episode and from future broadcasts. + _START_ARTICLE_ Community Professional Loudspeakers _START_PARAGRAPH_ Community Professional Loudspeakers is an American manufacturer of loudspeakers and sound reinforcement equipment. The company has been located in the Philadelphia area since its inception in 1968, and has occupied its present location in Chester, Pennsylvania since 1981. _START_SECTION_ Background _START_PARAGRAPH_ Bruce Howze founded the company in 1968, which was first named Community Light and Sound. The company originally started in the Philadelphia area, and now occupies a 100,000 square foot space in Chester, Pennsylvania._NEWLINE_Community established itself as the first company to utilize fiberglass to create large yet lightweight loudspeaker horns and enclosures. In 1970, it introduced its first notable live sound reinforcement loudspeaker product, the LMF, a fiberglass midrange horn. The company next developed the Leviathan fiberglass composite bass horn, which Elvis Presley used in his 1971 tour. Several top musical groups from that era used Leviathans as well, such as the Eagles, Linda Ronstadt, and Earth, Wind & Fire._NEWLINE_In the mid-1970s, Community became one of the first companies to meticulously test and document the performance of both its own loudspeakers and competitors’ loudspeakers. Community based its test measurement philosophy on the underlying principles of “free field” and “far field,” believing that far more dependable and relevant data can be obtained by testing loudspeakers at measurement distances that correspond to actual listening distances._NEWLINE_Since its founding, Community has pursued pioneering loudspeaker technologies. In 2010, the United States Patent and Trademark Office granted the company a patent for Carbon Ring Cone Technology._NEWLINE_In 2014, Community was acquired by Audio Prof, which also owns Apart Audio. Bruce W. Howze is still active with the company today, as is Christine Howze. Steve Johnson joined Community in 2013 as president. The Community brand has a global presence with its products being for live performance and permanent installation in houses of worship, schools, and other venues._NEWLINE_Community Professional is also well known for its weather-resistant loudspeaker designs, which are installed in major sports stadia and arenas throughout the world. This same quality makes the company’s loudspeakers a valuable component in emergency notification systems, such as the one used by the Tidal Information System in Venice, Italy. + _START_ARTICLE_ 2017–18 Argentine Primera División _START_SECTION_ Competition format _START_PARAGRAPH_ The tournament was contested by 28 teams. Each team played the other 27 teams in a single round-robin tournament. The additional match against the main rival team in the so-called "Fecha de Clásicos" was not played in this season. _START_SECTION_ Top goalscorers _START_PARAGRAPH_ Source: AFA _START_SECTION_ Top assists _START_PARAGRAPH_ Source: AFA + _START_ARTICLE_ Kelly Lai Chen _START_SECTION_ Early life _START_PARAGRAPH_ In 1933, Lai was born as Xi Zhongjian (奚重俭) into a prominent family from Pudong, Shanghai, owner of the Xi Fu Ji (奚福记) Factory. He was the third child among six siblings; Betty Loh Ti (born Xi Zhongyi) was the youngest. Their maternal grandfather was the tycoon Gu Zhuxuan, who owned Shanghai's Tianchan Theatre. When Lai was four, his father was killed by Japanese bombing during the Battle of Shanghai, and his mother died ten years later. He and his siblings were brought up by their maternal grandmother._NEWLINE_When the Communists took over Mainland China in 1949, Lai's grandmother brought the children to Hong Kong. He trained in the Republic of China Air Force cadet school in Taiwan for half a year, but was forced to quit because of heart disease. _START_SECTION_ Career _START_PARAGRAPH_ After returning to Hong Kong, Lai attended the actor training school of Motion Picture & General Investment (MP&GI, later Cathay Organisation HK). In 1956, he starred in his first film Green Hills and Jade Valleys directed by Yueh Feng. In his second film, Golden Lotus, also by Yueh Feng, he acted opposite the star actress Lin Dai. The highly successful film launched Lai into stardom. In the following years, he appeared in more than 40 films, including Evan Yang's Our Dream Car (1959), Chung Kai-man (鍾啟文)'sThe Education of Love (1961), and Wong Tin-lam's Father Takes a Bride (1963), starring opposite popular actresses such as Ge Lan, Jeannette Lin Cui, and Lucilla You Min. He was known for his portrayals of "gentle, vulnerable, and sensitive" young men._NEWLINE_In 1967, Lai, his sister Betty Loh Ti, and director Yuan Qiufeng founded Gold Eagle Film Company. It made a number of martial arts films, including Duel at the Supreme Gate (1968), but they were commercially unsuccessful._NEWLINE_Lai retired from acting in 1971 and focused on producing films. He retired in the early 1990s, but returned to the screen for Andrew Lau's 1996 film Young and Dangerous 2 and Wong Kar-wai's award-winning In the Mood for Love (2000), in which he played Maggie Cheung's boss. _START_SECTION_ Personal life _START_PARAGRAPH_ In 1974, Lai married martial arts actress Angela Mao, nicknamed "female Bruce Lee". They had a daughter together, and divorced after six years of marriage. Lai raised their daughter._NEWLINE_In his later years Lai lived alone in Hong Kong, and his daughter often visited him. He died on 3 April 2018, aged 84. + _START_ARTICLE_ Douglas v. Veterans Administration _START_PARAGRAPH_ Curtis Douglas vs. Veterans Administration (5 Merit Systems Protection Board (MSPB), 313 (1981) was a case decided by the Merit Systems Protection Board which established criteria that supervisors must consider in determining an appropriate penalty to impose for an act of federal employee misconduct. + _START_ARTICLE_ Blackadder, Scottish Borders _START_PARAGRAPH_ Blackadder is a hamlet on the B6460, in the Scottish Borders area of Scotland, located at grid reference NT846523._NEWLINE_Places nearby include Allanton, Duns, Edrom, Fogo, Gavinton, and Whitsome. + _START_ARTICLE_ James White (engineer) _START_PARAGRAPH_ James Lindsay White (January 3, 1938 – November 26, 2009) was an American polymer scientist._NEWLINE_White was a key figure in defining the field of polymer engineering. He founded two polymer engineering programs, one at the University of Tennessee and the other at the University of Akron. He also founded the International Polymer Processing Society and two scholarly journals: the Journal of Polymer Engineering and the International Polymer Processing Journal. He authored the textbook Rubber Processing, which was long popular among engineers. He published more than 500 papers and eight books based on his studies of flow in internal mixers, pin barrel extruders, and twin screw extruders with and without simultaneous chemical reactions._NEWLINE_He received the Charles Goodyear Medal in 2009, for “fundamental understanding of rheology and mathematical modeling of unfilled and filled rubbers and simulations of flow in batch and continuous mixing machines.” He received the Bingham Medal in 1981. _START_SECTION_ Education _START_PARAGRAPH_ White obtained his BS in chemical engineering at the Brooklyn Polytechnic Institute. He then pursued graduate studies in the research group of Arthur B. Metzner at the University of Delaware, receiving his MS degree in 1962 and his doctorate in 1965. His research resulted in development of the White-Metzner rheological model, which is still widely used in polymer processing simulations. + _START_ARTICLE_ Karen Yarbrough _START_SECTION_ Early life _START_PARAGRAPH_ Yarbrough's father, Don Williams, is a State Farm insurance agent and former village president of the Village of Maywood. Yarbrough earned a bachelor's degree in Business Administration from Chicago State University, a master’s in Inner City Studies from Northeastern Illinois University and attended the advanced leadership institute at Harvard University's Kennedy School of Government. Yarbrough is the founder and CEO of Hathaway Insurance Agency, where she has worked for thirty years. _START_SECTION_ Public service _START_PARAGRAPH_ For eight years, Yarbrough served as the first female president of the Maywood Chamber of Commerce. She established the Gold Card Program, which provided opportunities for business and education to work together to provide scholarships for deserving students._NEWLINE_Yarbrough also served as a board member of the United Way of Suburban Chicago and the Oak Park YMCA. _START_SECTION_ State Representative _START_PARAGRAPH_ Yarbrough was first elected in 2000, defeating the appointed incumbent Wanda Sharp. Her term began in January, 2001, and she rose to become an assistant Majority Leader._NEWLINE_During her tenure in the Illinois House, Yarbrough served on seven committees: Housing and Urban Development (Chairwoman), House Insurance Committee (Vice-Chairwoman), Environmental Health Committee, Appropriations: Public Safety, Computer Technology, Illinois Legislative Black Caucus, and the Governor’s Safety and Re-Entry Commission._NEWLINE_She pushed for "All Kids", a program helping to ensure that all children in Illinois get affordable health care. The All Kids program provides working families that make too much for KidCare, but not enough to afford private insurance, with affordable health insurance. Yarbrough also sponsored a bill to support challenged children. It funds training teachers and school administrators to recognize and be sensitive to the special needs of students who are pregnant or living in abusive environments._NEWLINE_Yarbrough also fought the tobacco lobby for a "Smoke Free Illinois", making Illinois the first Midwestern state to go smoke-free. She also fought to fund programs to provide a "Second Chance" for previously incarcerated individuals and sponsored a Quality of Life bill to use state lottery money to treat people with HIV-AIDs. She was also the chief House sponsor for legislation which ended the death penalty in Illinois, making it the 16th state to do so and earning her recognition from the Pope._NEWLINE_After Yarbrough was elected Recorder, Cory Foster was appointed to fulfill the remainder of her term in the 97th General Assembly. Emanuel Chris Welch was elected to succeed her as the representative from the 7th district. _START_SECTION_ Personal life _START_PARAGRAPH_ Yarbrough’s husband, Henderson Yarbrough, Sr. is the current trustee, and was the former mayor of the village of Maywood. They have seven grandchildren. + _START_ARTICLE_ The Iron Lady (TV series) _START_SECTION_ Synopsis _START_PARAGRAPH_ In earlier times and not very long ago, girls born to traditional Chinese families were deprived of privileges and opportunities reserved for males. Most of them accepted their fate dutifully and submissively._NEWLINE_But one woman of those oppressed times dared defy the hand she was dealt. Opportunist, manipulator, ambitious, she desired power to control her own destiny above all else._NEWLINE_This is the story of an unconventional woman of extraordinary will and determination who seized control of her fate at all costs! + _START_ARTICLE_ Elephant Hotel _START_SECTION_ History _START_PARAGRAPH_ Hachaliah Bailey (known as the creator of Bailey Circus) built the Elephant Hotel in Somers, NY, after buying an African elephant, which he named "Old Bet". Bailey intended to use the elephant for farm work, but the number of people it attracted caused Bailey to tour her throughout the northeast. Old Bet was killed on tour in 1816, when she was shot by a local farmer. Bailey's tours were the first of their type in the nation, and inspired numerous others to tour with exotic animals, and during the 1830s the old style circus and Bailey's attractions merged to form the modern circus. Due to this, Somers is known as the "Cradle of the American Circus"._NEWLINE_Bailey had purchased this land in 1805, and began construction of the hotel in 1821, as a memorial to the animals he displayed. It is said Old Bet was buried in front of the building. The monument to Old Bet that stands in front of the hotel was placed in 1827. In 1835, the hotel was incorporated by the Zoological Institute._NEWLINE_The Elephant Hotel was purchased by the town of Somers in 1927. It is a town landmark and was dedicated a National Historic Landmark in 2005. _START_SECTION_ Museum of the Early American Circus _START_PARAGRAPH_ The Somers Historical Society occupies the third floor of the building. The Society operates the Museum of the Early American Circus, which is open on Thursday afternoons and for special holidays. + _START_ARTICLE_ Marrit Boonstra _START_SECTION_ Biography _START_PARAGRAPH_ Boonstra, who was born in Groningen, played tennis as a right-hander with a double-handed backhand._NEWLINE_Her junior career included a win over Caroline Wozniacki, who she also partnered with to the make girls' doubles quarterfinals of the 2006 Wimbledon Championships._NEWLINE_As a 17 year old, she played three doubles rubbers for the Netherlands in the 2006 Fed Cup competition, teaming up with Dutch veteran Brenda Schultz-McCarthy to win all three matches._NEWLINE_Boonstra received a wild card to compete in the main draw of the 2006 Ordina Open, a WTA Tour tournament in Rosmalen. She lost in the opening round to Jelena Janković._NEWLINE_From 2008 to 2010, she played collegiate tennis in the United States for the University of Florida._NEWLINE_During her time on the ITF circuit, she won one singles and five doubles titles. + _START_ARTICLE_ Mayfair Theatre _START_SECTION_ Description _START_PARAGRAPH_ The Mayfair is a surviving atmospheric cinema of the Spanish Revival form, the second theatre house of this kind to be constructed in Ottawa. Interior features include four faux-balconies, two of which feature clay-tile canopies. Other significant features include stained-glass windows, a proscenium arch, a painted ceiling, decorative plastering and wrought ironwork. The Mayfair has retained the theatre clock used since its inception, a unit which features blue illuminated numbering. _START_SECTION_ 1932 to 1970s _START_PARAGRAPH_ Fred Robertson, a retailer from Almonte, was the Mayfair's original owner. The Mayfair opened on 5 December 1932 with its showings of The Blue Danube. Adult admission prices were 15 cents for matinees, 25 cents for evening performances, with each child admitted for ten and 15 cents respectively. After The Blue Danube completed a three-day run, the Mayfair presented its first double bill with Bring 'Em Back Alive and X Marks the Spot. At the outset, the theatre's sound system was supplied by Northern Electric while Montreal-based Canadian Theatre Supply provided the projection and stage equipment._NEWLINE_For the first half century of its existence, the cinema remained under Robertson family ownership. The theatre later operated as a second-run cinema for numerous years. In the late 1970s the Mayfair concentrated on pornographic films, a phase which lasted less than two years. _START_SECTION_ 1980s _START_PARAGRAPH_ In October 1981, the Mayfair adopted a repertory format and in the following month Keith Davidson became theatre manager. The Mayfair became known for its economical double features which were introduced in June 1982 for five days each week, excluding Sundays and Mondays when Chinese-language films would be presented. The Mayfair's ownership then consisted of several investors, most of whom were Ottawa-based._NEWLINE_The Mayfair cancelled a planned showing of Videodrome in April 1983 when police threatened the theatre with obscenity charges. A handful of citizens, including Maude Barlow, objected to the violent content of the film which was approved by the Ontario Board of Censors and was previously shown without incident in Nepean, Ontario._NEWLINE_Director Michael Rubbo rented the theatre for three days in early 1986 to conduct a "four-waller" promotion for his film The Peanut Butter Solution which had fared poorly in the English Canadian market._NEWLINE_In 1986, major renovations resulted in new concession stand and box office spaces, while 492 wider seats replaced the original set of approximately 600 seats. In 1988, the Mayfair's regular admission price was $5, or $3.50 for those with theatre memberships which were available for $5 per year, or $3 per year for students. During that time, membership numbered more than 5000. _START_SECTION_ 1990s _START_PARAGRAPH_ Double features became available on all days as of 1 April 1990 as the Chinese-language films were discontinued. Sunday afternoon double features were also introduced at that time. Regular prices for the double features were $5.50, or $4 for those who obtained a $6 annual membership. Featured films were predominantly hit American productions with a minority of classic and international films._NEWLINE_Tom Bergin became manager in the early 1990s when Davidson left to pursue other interests. _START_SECTION_ 2000s _START_PARAGRAPH_ In August 2008, local media indicated that the theatre would close effective 30 November 2008, the date at which the theatre would terminate its membership programme. The City of Ottawa declared the theatre building as a heritage site under the Ontario Heritage Act on 8 October 2008, a designation which prohibits outright demolition of the building._NEWLINE_Public and community concern over the closure of the Mayfair and interest in its heritage value resulted in the formation of the Friends of the Mayfair Theatre, a loosely organized community group that claimed several hundred members._NEWLINE_In November 2008, a new partnership consisting of filmmakers Lee Demarbre and Ian Driscoll, projectionist and film conservator Paul Gordon and film scholar John Yemen announced that they had signed a ten-year lease with owner Stephen Ng. The new owners renovated the facility with new seating, some couches in the balcony, a digital video projection system, a new 16mm projector, a Dolby Digital sound system for the 35mm projectors, and a long play tower system. Seating capacity was reduced from 492 to 343. The Mayfair reopened on 2 January 2009 with the film Metropolis accompanied by short subjects from local filmmakers. The theatre's reopening was accompanied with a renewed emphasis on its repertory role. During this relaunch month, thirteen Ottawa premieres were presented while double bills were now limited to Tuesday nights and occasionally other nights. Midnight screenings on Friday and Saturday nights were also introduced._NEWLINE_In July 2009 two of the founding members of the new partnership, John Yemen and Paul Gordon left the group to pursue other projects. John Yemen was the individual who sent the city a proposal for heritage designation in the summer of 2008. The makeup of the partnership is now Lee Demarbre (programmer), Ian Driscoll, and Josh Stafford. _START_SECTION_ 2010s _START_PARAGRAPH_ Currently, the Mayfair's programming includes cult films, family matinees,independent films, Ottawa premieres, local films, festivals, and late night presentations. It also became the main venue for the Ottawa International Writers Festival in spring of 2010, hosting readings and lectures. The theatre also reports continued success with its annual Halloween screenings of The Rocky Horror Picture Show. _START_SECTION_ Mayfair Orleans _START_PARAGRAPH_ The Mayfair opened a three-screen cinema in Orleans on 2 December 2011. It was situated at the former Empire Six theatre facility. This location presented similar programming as the original Mayfair location, with some emphasis on family-oriented films. The Mayfair Orleans location closed on 13 February 2013 when its lease was cancelled due to arrears in rent. + _START_ARTICLE_ DC Comics _START_SECTION_ Origins _START_PARAGRAPH_ Entrepreneur Major Malcolm Wheeler-Nicholson founded National Allied Publications in autumn 1934. The company debuted with the tabloid-sized New Fun: The Big Comic Magazine #1 with a cover date of February 1935. The company's second title, New Comics #1 (Dec. 1935), appeared in a size close to what would become comic books' standard during the period fans and historians call the Golden Age of Comic Books, with slightly larger dimensions than today's. That title evolved into Adventure Comics, which continued through issue #503 in 1983, becoming one of the longest-running comic-book series. In 2009 DC revived Adventure Comics with its original numbering. In 1935, Jerry Siegel and Joe Shuster, the future creators of Superman, created Doctor Occult, who is the earliest DC Comics character to still be in the DC Universe._NEWLINE_Wheeler-Nicholson's third and final title, Detective Comics, advertised with a cover illustration dated December 1936, eventually premiered three months late with a March 1937 cover date. The themed anthology series would become a sensation with the introduction of Batman in issue #27 (May 1939). By then, however, Wheeler-Nicholson had gone. In 1937, in debt to printing-plant owner and magazine distributor Harry Donenfeld—who also published pulp magazines and operated as a principal in the magazine distributorship Independent News—Wheeler-Nicholson had to take Donenfeld on as a partner in order to publish Detective Comics #1. Detective Comics, Inc. was formed, with Wheeler-Nicholson and Jack S. Liebowitz, Donenfeld's accountant, listed as owners. Major Wheeler-Nicholson remained for a year, but cash-flow problems continued, and he was forced out. Shortly afterwards, Detective Comics, Inc. purchased the remains of National Allied, also known as Nicholson Publishing, at a bankruptcy auction._NEWLINE_Detective Comics, Inc. soon launched a fourth title, Action Comics, the premiere of which introduced Superman. Action Comics #1 (June 1938), the first comic book to feature the new character archetype—soon known as "superheroes"—proved a sales hit. The company quickly introduced such other popular characters as the Sandman and Batman._NEWLINE_On February 22, 2010, a copy of Action Comics #1 (June 1938) sold at an auction from an anonymous seller to an anonymous buyer for $1 million, besting the $317,000 record for a comic book set by a different copy, in lesser condition, the previous year. _START_SECTION_ Golden Age _START_PARAGRAPH_ National Allied Publications soon merged with Detective Comics, Inc., forming National Comics Publications on September 30, 1946. National Comics Publications absorbed an affiliated concern, Max Gaines' and Liebowitz' All-American Publications. In the same year Gaines let Liebowitz buy him out, and kept only Picture Stories from the Bible as the foundation of his own new company, EC Comics. At that point, "Liebowitz promptly orchestrated the merger of All-American and Detective Comics into National Comics... Next he took charge of organizing National Comics, [the self-distributorship] Independent News, and their affiliated firms into a single corporate entity, National Periodical Publications". National Periodical Publications became publicly traded on the stock market in 1961._NEWLINE_Despite the official names "National Comics" and "National Periodical Publications", the company began branding itself as "Superman-DC" as early as 1940, and the company became known colloquially as DC Comics for years before the official adoption of that name in 1977._NEWLINE_The company began to move aggressively against what it saw as copyright-violating imitations from other companies, such as Fox Comics' Wonder Man, which (according to court testimony) Fox started as a copy of Superman. This extended to DC suing Fawcett Comics over Captain Marvel, at the time comics' top-selling character (see National Comics Publications, Inc. v. Fawcett Publications, Inc.). Faced with declining sales and the prospect of bankruptcy if it lost, Fawcett capitulated in 1953 and ceased publishing comics. Years later, Fawcett sold the rights for Captain Marvel to DC—which in 1972 revived Captain Marvel in the new title Shazam! featuring artwork by his creator, C. C. Beck. In the meantime, the abandoned trademark had been seized by Marvel Comics in 1967, with the creation of their Captain Marvel, forbidding the DC comic itself to be called that. While Captain Marvel did not recapture his old popularity, he later appeared in a Saturday morning live action TV adaptation and gained a prominent place in the mainstream continuity DC calls the DC Universe._NEWLINE_When the popularity of superheroes faded in the late 1940s the company focused on such genres as science fiction, Westerns, humor, and romance. DC also published crime and horror titles, but relatively tame ones, and thus avoided the mid-1950s backlash against such comics. A handful of the most popular superhero-titles, including Action Comics and Detective Comics, the medium's two longest-running titles, continued publication. _START_SECTION_ Silver Age _START_PARAGRAPH_ In the mid-1950s, editorial director Irwin Donenfeld and publisher Liebowitz directed editor Julius Schwartz (whose roots lay in the science-fiction book market) to produce a one-shot Flash story in the try-out title Showcase. Instead of reviving the old character, Schwartz had writers Robert Kanigher and John Broome, penciler Carmine Infantino, and inker Joe Kubert create an entirely new super-speedster, updating and modernizing the Flash's civilian identity, costume, and origin with a science-fiction bent. The Flash's reimagining in Showcase #4 (October 1956) proved sufficiently popular that it soon led to a similar revamping of the Green Lantern character, the introduction of the modern all-star team Justice League of America (JLA), and many more superheroes, heralding what historians and fans call the Silver Age of comic books._NEWLINE_National did not reimagine its continuing characters (primarily Superman, Batman, and Wonder Woman), but radically overhauled them. The Superman family of titles, under editor Mort Weisinger, introduced such enduring characters as Supergirl, Bizarro, and Brainiac. The Batman titles, under editor Jack Schiff, introduced the successful Batwoman, Bat-Girl, Ace the Bat-Hound, and Bat-Mite in an attempt to modernize the strip with non-science-fiction elements. Schwartz, together with artist Infantino, then revitalized Batman in what the company promoted as the "New Look", re-emphasizing Batman as a detective. Meanwhile, editor Kanigher successfully introduced a whole family of Wonder Woman characters having fantastic adventures in a mythological context._NEWLINE_Since the 1940s, when Superman, Batman, and many of the company's other heroes began appearing in stories together, DC's characters inhabited a shared continuity that, decades later, was dubbed the "DC Universe" by fans. With the story "Flash of Two Worlds", in Flash #123 (September 1961), editor Schwartz (with writer Gardner Fox and artists Infantino and Joe Giella) introduced a concept that allowed slotting the 1930s and 1940s Golden Age heroes into this continuity via the explanation that they lived on an other-dimensional "Earth 2", as opposed to the modern heroes' "Earth 1"—in the process creating the foundation for what would later be called the DC Multiverse._NEWLINE_DC's introduction of the reimagined superheroes did not go unnoticed by other comics companies. In 1961, with DC's JLA as the specific spur, Marvel Comics writer-editor Stan Lee and a robust creator Jack Kirby ushered in the sub-Silver Age "Marvel Age" of comics with the debut issue of The Fantastic Four. Reportedly, DC ignored the initial success of Marvel with this editorial change until its consistently strengthening sales, albeit also benefiting Independent News' business as their distributor as well, made that impossible. That commercial situation especially applied with Marvel's superior sell-through percentage numbers which were typically 70% to DC's roughly 50%, which meant DC's publications were barely making a profit in comparison after returns from the distributors were calculated while Marvel was making an excellent profit by comparison._NEWLINE_However, the senior DC staff were reportedly at a loss at this time to understand how this small publishing house was achieving this increasingly threatening commercial strength. For instance, when Marvel's product was examined in a meeting, Marvel's emphasis on more sophisticated character-based narrative and artist-driven visual storytelling was apparently ignored for self-deluding guesses at the brand's popularity which included superficial reasons like the presence of the color red or word balloons on the cover, or that the perceived crudeness of the interior art was somehow more appealing to readers. When Lee learned about DC's subsequent experimental attempts to imitate these perceived details, he amused himself by arranging direct defiance of those assumptions in Marvel's publications as sales strengthened further to frustrate the competition._NEWLINE_However, this ignorance of Marvel's true appeal did not extend to some of the writing talent during this period, from which there were some attempts to emulate Marvel's narrative approach. For instance, there was the Doom Patrol series by Arnold Drake, a writer who previously warned the management of the new rival's strength; a superhero team of outsiders who resented their freakish powers, which Drake later speculated was plagiarized by Stan Lee to create The X-Men. There was also the young Jim Shooter who purposely emulated Marvel's writing when he wrote for DC after much study of both companies' styles, such as for the Legion of Super-Heroes feature._NEWLINE_A 1966 Batman TV show on the ABC network sparked a temporary spike in comic book sales, and a brief fad for superheroes in Saturday morning animation (Filmation created most of DC's initial cartoons) and other media. DC significantly lightened the tone of many DC comics—particularly Batman and Detective Comics—to better complement the "camp" tone of the TV series. This tone coincided with the famous "Go-Go Checks" checkerboard cover-dress which featured a black-and-white checkerboard strip (all DC books cover dated February 1966 until August 1967) at the top of each comic, a misguided attempt by then-managing editor Irwin Donenfeld to make DC's output "stand out on the newsracks". In particular, DC artist, Carmine Infantino, complained that the visual cover distinctiveness made DC's titles easier for readers to see and then avoid in favor of Marvel's titles._NEWLINE_In 1967, Batman artist Infantino (who had designed popular Silver Age characters Batgirl and the Phantom Stranger) rose from art director to become DC's editorial director. With the growing popularity of upstart rival Marvel Comics threatening to topple DC from its longtime number-one position in the comics industry, he attempted to infuse the company with more focus towards marketing new and existing titles and characters with more adult sensibilities towards an emerging older age group of superhero comic book fans that grew out of Marvel's efforts to market their superhero line to college-aged adults. He also recruited major talents such as ex-Marvel artist and Spider-Man co-creator Steve Ditko and promising newcomers Neal Adams and Denny O'Neil and replaced some existing DC editors with artist-editors, including Joe Kubert and Dick Giordano, to give DC's output a more artistic critical eye. _START_SECTION_ Kinney National/Warner Communications subsidiary (1967-1990) _START_PARAGRAPH_ In 1967, National Periodical Publications was purchased by Kinney National Company, which purchased Warner Bros.-Seven Arts in 1969. Kinney National spun off its non-entertainment assets in 1972 (as National Kinney Corporation) and changed its name to Warner Communications Inc._NEWLINE_In 1970, Jack Kirby moved from Marvel Comics to DC, at the end of the Silver Age of Comics, in which Kirby's contributions to Marvel played a large, integral role. Given carte blanche to write and illustrate his own stories, he created a handful of thematically linked series he called collectively The Fourth World. In the existing series Superman's Pal Jimmy Olsen and in his own, newly launched series New Gods, Mister Miracle, and The Forever People, Kirby introduced such enduring characters and concepts as archvillain Darkseid and the other-dimensional realm Apokolips. Furthermore, Kirby intended their stories to be later reprinted in collected editions in a publishing format that would later be called the trade paperback, which would become a standard industry practice decades later. While sales were respectable, they did not meet DC management's initially high expectations, and also suffered from a lack of comprehension and internal support from Infantino. By 1973 the "Fourth World" was all cancelled, although Kirby's conceptions would soon become integral to the broadening of the DC Universe. Obligated by his contract, Kirby created other unrelated series for DC, including Kamandi, The Demon, and OMAC, before ultimately returning to Marvel Comics. _START_SECTION_ The Bronze Age _START_PARAGRAPH_ Following the science-fiction innovations of the Silver Age, the comics of the 1970s and 1980s would become known as the Bronze Age, as fantasy gave way to more naturalistic and sometimes darker themes. Illegal drug use, banned by the Comics Code Authority, explicitly appeared in comics for the first time in Marvel Comics' story "Green Goblin Reborn!" in The Amazing Spider-Man #96 (May 1971), and after the Code's updating in response, DC offered a drug-fueled storyline in writer Dennis O'Neil and artist Neal Adams' Green Lantern, beginning with the story "Snowbirds Don't Fly" in the retitled Green Lantern / Green Arrow #85 (Sept. 1971), which depicted Speedy, the teen sidekick of superhero archer Green Arrow, as having become a heroin addict._NEWLINE_Jenette Kahn, a former children's magazine publisher, replaced Infantino as editorial director in January 1976. DC had attempted to compete with the now-surging Marvel by dramatically increasing its output and attempting to win the market by flooding it. This included launching series featuring such new characters as Firestorm and Shade, the Changing Man, as well as an increasing array of non-superhero titles, in an attempt to recapture the pre-Wertham days of post-War comicdom. In June 1978, five months before the release of the first Superman movie, Kahn expanded the line further, increasing the number of titles and story pages, and raising the price from 35 cents to 50 cents. Most series received eight-page back-up features while some had full-length twenty-five-page stories. This was a move the company called the "DC Explosion". The move was not successful, however, and corporate parent Warner dramatically cut back on these largely unsuccessful titles, firing many staffers in what industry watchers dubbed "the DC Implosion". In September 1978, the line was dramatically reduced and standard-size books returned to 17 story pages but for a still increased 40 cents. By 1980, the books returned to 50 cents with a 25-page story count but the story pages replaced house ads in the books._NEWLINE_Seeking new ways to boost market share, the new team of publisher Kahn, vice president Paul Levitz, and managing editor Giordano addressed the issue of talent instability. To that end—and following the example of Atlas/Seaboard Comics and such independent companies as Eclipse Comics—DC began to offer royalties in place of the industry-standard work-for-hire agreement in which creators worked for a flat fee and signed away all rights, giving talent a financial incentive tied to the success of their work. As it happened, the implementation of these incentives proved opportune considering Marvel Comics' Editor-in-Chief, Jim Shooter, was alienating much of his company's creative staff with his authoritarian manner and major talents there went to DC like Roy Thomas, Gene Colan, Marv Wolfman, and George Perez._NEWLINE_In addition, emulating the era's new television form, the miniseries while addressing the matter of an excessive number of ongoing titles fizzling out within a few issues of their start, DC created the industry concept of the comic book limited series. This publishing format allowed for the deliberate creation of finite storylines within a more flexible publishing format that could showcase creations without forcing the talent into unsustainable open-ended commitments. The first such title was World of Krypton in 1979, and its positive results lead to subsequent similar titles and later more ambitious productions like Camelot 3000 for the direct market in 1982._NEWLINE_These changes in policy shaped the future of the medium as a whole, and in the short term allowed DC to entice creators away from rival Marvel, and encourage stability on individual titles. In November 1980 DC launched the ongoing series The New Teen Titans, by writer Marv Wolfman and artist George Pérez, two popular talents with a history of success. Their superhero-team comic, superficially similar to Marvel's ensemble series X-Men, but rooted in DC history, earned significant sales in part due to the stability of the creative team, who both continued with the title for six full years. In addition, Wolfman and Pérez took advantage of the limited-series option to create a spin-off title, Tales of the New Teen Titans, to present origin stories of their original characters without having to break the narrative flow of the main series or oblige them to double their work load with another ongoing title. _START_SECTION_ Modern Age _START_PARAGRAPH_ This successful revitalization of the Silver Age Teen Titans led DC's editors to seek the same for the wider DC Universe. The result, the Wolfman/Pérez 12-issue limited series Crisis on Infinite Earths, gave the company an opportunity to realign and jettison some of the characters' complicated backstory and continuity discrepancies. A companion publication, two volumes entitled The History of the DC Universe, set out the revised history of the major DC characters. Crisis featured many key deaths that would shape the DC Universe for the following decades, and separate the timeline of DC publications into pre- and post-"Crisis"._NEWLINE_Meanwhile, a parallel update had started in the non-superhero and horror titles. Since early 1984, the work of British writer Alan Moore had revitalized the horror series The Saga of the Swamp Thing, and soon numerous British writers, including Neil Gaiman and Grant Morrison, began freelancing for the company. The resulting influx of sophisticated horror-fantasy material led to DC in 1993 establishing the Vertigo mature-readers imprint, which did not subscribe to the Comics Code Authority._NEWLINE_Two DC limited series, Batman: The Dark Knight Returns by Frank Miller and Watchmen by Moore and artist Dave Gibbons, drew attention in the mainstream press for their dark psychological complexity and promotion of the antihero. These titles helped pave the way for comics to be more widely accepted in literary-criticism circles and to make inroads into the book industry, with collected editions of these series as commercially successful trade paperbacks._NEWLINE_The mid-1980s also saw the end of many long-running DC war comics, including series that had been in print since the 1960s. These titles, all with over 100 issues, included Sgt. Rock, G.I. Combat, The Unknown Soldier, and Weird War Tales. _START_SECTION_ Time Warner/AOL Time Warner/WarnerMedia unit (1990–present) _START_PARAGRAPH_ In March 1989, Warner Communications merged with Time Inc., making DC Comics a subsidiary of Time Warner. In June, the first Tim Burton directed Batman movie was released, and DC began publishing its hardcover series of DC Archive Editions, collections of many of their early, key comics series, featuring rare and expensive stories unseen by many modern fans. Restoration for many of the Archive Editions was handled by Rick Keene with colour restoration by DC's long-time resident colourist, Bob LeRose. These collections attempted to retroactively credit many of the writers and artists who had worked without much recognition for DC during the early period of comics when individual credits were few and far between._NEWLINE_The comics industry experienced a brief boom in the early 1990s, thanks to a combination of speculative purchasing (mass purchase of the books as collectible items, with intent to resell at a higher value as the rising value of older issues, was thought to imply that all comics would rise dramatically in price) and several storylines which gained attention from the mainstream media. DC's extended storylines in which Superman was killed, Batman was crippled and superhero Green Lantern turned into the supervillain Parallax resulted in dramatically increased sales, but the increases were as temporary as the hero's replacements. Sales dropped off as the industry went into a major slump, while manufactured "collectables" numbering in the millions replaced quality with quantity until fans and speculators alike deserted the medium in droves._NEWLINE_DC's Piranha Press and other imprints (including the mature readers line Vertigo, and Helix, a short-lived science fiction imprint) were introduced to facilitate compartmentalized diversification and allow for specialized marketing of individual product lines. They increased the use of non-traditional contractual arrangements, including the dramatic rise of creator-owned projects, leading to a significant increase in critically lauded work (much of it for Vertigo) and the licensing of material from other companies. DC also increased publication of book-store friendly formats, including trade paperback collections of individual serial comics, as well as original graphic novels._NEWLINE_One of the other imprints was Impact Comics from 1991 to 1992 in which the Archie Comics superheroes were licensed and revamped. The stories in the line were part of its own shared universe._NEWLINE_DC entered into a publishing agreement with Milestone Media that gave DC a line of comics featuring a culturally and racially diverse range of superhero characters. Although the Milestone line ceased publication after a few years, it yielded the popular animated series Static Shock. DC established Paradox Press to publish material such as the large-format Big Book of... series of multi-artist interpretations on individual themes, and such crime fiction as the graphic novel Road to Perdition. In 1998, DC purchased WildStorm Comics, Jim Lee's imprint under the Image Comics banner, continuing it for many years as a wholly separate imprint – and fictional universe – with its own style and audience. As part of this purchase, DC also began to publish titles under the fledgling WildStorm sub-imprint America's Best Comics (ABC), a series of titles created by Alan Moore, including The League of Extraordinary Gentlemen, Tom Strong, and Promethea. Moore strongly contested this situation, and DC eventually stopped publishing ABC. _START_SECTION_ 2000s _START_PARAGRAPH_ In March 2003 DC acquired publishing and merchandising rights to the long-running fantasy series Elfquest, previously self-published by creators Wendy and Richard Pini under their WaRP Graphics publication banner. This series then followed another non-DC title, Tower Comics' series T.H.U.N.D.E.R. Agents, in collection into DC Archive Editions. In 2004 DC temporarily acquired the North American publishing rights to graphic novels from European publishers 2000 AD and Humanoids. It also rebranded its younger-audience titles with the mascot Johnny DC and established the CMX imprint to reprint translated manga. In 2006, CMX took over from Dark Horse Comics publication of the webcomic Megatokyo in print form. DC also took advantage of the demise of Kitchen Sink Press and acquired the rights to much of the work of Will Eisner, such as his The Spirit series and his graphic novels._NEWLINE_In 2004, DC began laying the groundwork for a full continuity-reshuffling sequel to Crisis on Infinite Earths, promising substantial changes to the DC Universe (and side-stepping the 1994 Zero Hour event which similarly tried to ret-con the history of the DCU). In 2005, the critically lauded Batman Begins film was released; also, the company published several limited series establishing increasingly escalated conflicts among DC's heroes, with events climaxing in the Infinite Crisis limited series. Immediately after this event, DC's ongoing series jumped forward a full year in their in-story continuity, as DC launched a weekly series, 52, to gradually fill in the missing time. Concurrently, DC lost the copyright to "Superboy" (while retaining the trademark) when the heirs of Jerry Siegel used a provision of the 1976 revision to the copyright law to regain ownership._NEWLINE_In 2005, DC launched its "All-Star" line (evoking the title of the 1940s publication), designed to feature some of the company's best-known characters in stories that eschewed the long and convoluted continuity of the DC Universe. The line began with All-Star Batman & Robin the Boy Wonder and All-Star Superman, with All-Star Wonder Woman and All-Star Batgirl announced in 2006 but neither being released nor scheduled as of the end of 2009._NEWLINE_DC licensed characters from the Archie Comics imprint Red Circle Comics by 2007. They appeared in the Red Circle line, based in the DC Universe, with a series of one-shots followed by a miniseries that lead into two ongoing titles, each lasting 10 issues. _START_SECTION_ 2010s _START_PARAGRAPH_ In 2011, DC rebooted all of its running titles following the Flashpoint storyline. The reboot called The New 52 gave new origin stories and costume designs to many of DC's characters._NEWLINE_DC licensed pulp characters including Doc Savage and the Spirit which it then used, along with some DC heroes, as part of the First Wave comics line launched in 2010 and lasting through fall 2011._NEWLINE_In May 2011, DC announced it would begin releasing digital versions of their comics on the same day as paper versions._NEWLINE_On June 1, 2011, DC announced that it would end all ongoing series set in the DC Universe in August and relaunch its comic line with 52 issue #1s, starting with Justice League on August 31 (written by Geoff Johns and drawn by Jim Lee), with the rest to follow later on in September._NEWLINE_On June 4, 2013, DC unveiled two new digital comic innovations to enhance interactivity: DC² and DC² Multiverse. DC² layers dynamic artwork onto digital comic panels, adding a new level of dimension to digital storytelling, while DC² Multiverse allows readers to determine a specific story outcome by selecting individual characters, storylines and plot developments while reading the comic, meaning one digital comic has multiple outcomes. DC² will first appear in the upcoming digital-first title, Batman '66, based on the 1960s television series and DC² Multiverse will first appear in Batman: Arkham Origins, a digital-first title based on the video game of the same name._NEWLINE_In 2014, DC announced an eight-issue miniseries titled "Convergence" which began in April 2015._NEWLINE_In 2016, DC announced a line-wide relaunch titled DC Rebirth. The new line would launch with an 80-page one-shot titled DC Universe: Rebirth, written by Geoff Johns, with art from Gary Frank, Ethan Van Sciver, and more. After that, many new series would launch with a twice-monthly release schedule and new creative teams for nearly every title. The relaunch was meant to bring back the legacy and heart many felt had been missing from DC characters since the launch of the New 52. Rebirth brought huge success, both financially and critically. _START_SECTION_ Logo _START_PARAGRAPH_ DC's first logo appeared on the April 1940 issues of its titles. The letters "DC" stood for Detective Comics, the name of Batman's flagship title. The small logo, with no background, read simply, "A DC Publication"._NEWLINE_The November 1941 DC titles introduced an updated logo. This version was almost twice the size of the previous one and was the first version with a white background. The name "Superman" was added to "A DC Publication", effectively acknowledging both Superman and Batman. This logo was the first to occupy the top-left corner of the cover, where the logo has usually resided since. The company now referred to itself in its advertising as "Superman-DC"._NEWLINE_In November 1949, the logo was modified to incorporate the company's formal name, National Comics Publications. This logo would also serve as the round body of Johnny DC, DC's mascot in the 1960s._NEWLINE_In October 1970, DC briefly retired the circular logo in favour of a simple "DC" in a rectangle with the name of the title, or the star of the book; the logo on many issues of Action Comics, for example, read "DC Superman". An image of the lead character either appeared above or below the rectangle. For books that did not have a single star, such as anthologies like House of Mystery or team series such as Justice League of America, the title and "DC" appeared in a stylized logo, such as a bat for "House of Mystery". This use of characters as logos helped to establish the likenesses as trademarks, and was similar to Marvel's contemporaneous use of characters as part of its cover branding._NEWLINE_DC's "100 Page Super-Spectacular" titles and later 100-page and "Giant" issues published from 1972 to 1974 featured a logo exclusive to these editions: the letters "DC" in a simple sans-serif typeface within a circle. A variant had the letters in a square._NEWLINE_The July 1972 DC titles featured a new circular logo. The letters "DC" were rendered in a block-like typeface that would remain through later logo revisions until 2005. The title of the book usually appeared inside the circle, either above or below the letters._NEWLINE_In December 1973, this logo was modified with the addition of the words "The Line of DC Super-Stars" and the star motif that would continue in later logos. This logo was placed in the top center of the cover from August 1975 to October 1976._NEWLINE_When Jenette Kahn became DC's publisher in late 1976, she commissioned graphic designer Milton Glaser to design a new logo. Popularly referred to as the "DC bullet", this logo premiered on the February 1977 titles. Although it varied in size and colour and was at times cropped by the edges of the cover, or briefly rotated 4 degrees, it remained essentially unchanged for nearly three decades. Despite logo changes since 2005, the old "DC bullet" continues to be used only on the DC Archive Editions series._NEWLINE_In July 1987, DC released variant editions of Justice League #3 and The Fury of Firestorm #61 with a new DC logo. It featured a picture of Superman in a circle surrounded by the words "SUPERMAN COMICS". The company released these variants to newsstands in certain markets as a marketing test._NEWLINE_On May 8, 2005, a new logo (dubbed the "DC spin") was unveiled, debuting on DC titles in June 2005 with DC Special: The Return of Donna Troy #1 and the rest of the titles the following week. In addition to comics, it was designed for DC properties in other media, which was used for movies since Batman Begins, with Superman Returns showing the logo's normal variant, and the TV series Smallville, the animated series Justice League Unlimited and others, as well as for collectibles and other merchandise. The logo was designed by Josh Beatman of Brainchild Studios and DC executive Richard Bruning._NEWLINE_In March 2012, DC unveiled a new logo consisting of the letter "D" flipping back to reveal the letter "C" and "DC ENTERTAINMENT". The Dark Knight Rises was the first film to use the new logo, while the TV series Arrow was the first series to feature the new logo._NEWLINE_DC Entertainment announced a new identity and logo for another iconic DC Comics universe brand on May 17, 2016. The new logo was first used on May 25, 2016, in conjunction with the release of DC Universe: Rebirth Special #1 by Geoff Johns. _START_SECTION_ DC Universe _START_PARAGRAPH_ DC Universe is a video on demand service operated by DC Entertainment. It was announced in April 2017, with the title and service formally announced in May 2018. DC Universe is expected to offer more than video content through the inclusion of an immersive experience with fan interaction that encompasses comics in addition to television. _START_SECTION_ Digital distribution _START_PARAGRAPH_ DC Comics are available in digital form through several sources._NEWLINE_Free services: In 2015, Hoopla Digital became the first library-based digital system to distribute DC Comics._NEWLINE_Paid services: Google Play, Comixology + _START_ARTICLE_ Prelude in A minor, Op. 51, No. 2 (Scriabin) _START_PARAGRAPH_ Alexander Scriabin's Prelude Opus 51 No. 2 is the second of his Quatre Morceaux (Four Pieces) op. 51, published in 1906. It is notated in A minor. It is written in a 6/8 beat in 30 measures (plus upbeat) and should be expressed Lugubre (dire)._NEWLINE_This is one of several pieces Scriabin never played in public (together with the Sonata No. 6 (op. 62)). He called it "Shattered Strings" (German "Zersprungene Saiten") when Leonid Sabaneyev reminded him of the piece during a discussion about minor and major. Sabaneyev quotes him with "Oh, let's not talk about this! This is a ghastly piece! [...] I was in an appalling situation back then. This Prelude, and also the Marche funebre in the First Sonata formed in moments disheartenment... But only these two!" (referring to his allegation that he had abandoned the minor tonality a long time ago). + _START_ARTICLE_ Teté _START_SECTION_ Biography _START_PARAGRAPH_ Teté was a major coach of football in Rio Grande do Sul. Became known as the "Marshal of Victories," because he was an officer of the Army Reserve._NEWLINE_As a player, he served in the 9º Regimento. Then trained the Farroupilha (after the change of club name), Pelotas Brazil Guarany of Bagé, General Osorio Cruzeiro-RS Nacional-RS and Internacional ._NEWLINE_In Internacional, Teté did well. He coached the team from 1951 to 1957 and was four-time Gaucho (51, 52, 53 and 55). Also coached Brazil national team, became champion of the Pan American of 1956 in Mexico. + _START_ARTICLE_ Lauryn Hill _START_SECTION_ 1975–1993: Early life and career beginnings _START_PARAGRAPH_ Lauryn Noelle Hill was born on May 26, 1975 in East Orange, New Jersey. Her mother, Valerie Hill was an English teacher and her father Mal Hill a computer and management consultant. She has one older brother named Malaney who was born in 1972. Her Baptist family moved to New York and Newark for short periods before settling in South Orange, New Jersey._NEWLINE_Hill has said of her musically oriented family: "there were so many records, so much music constantly being played. My mother played the piano, my father sang, and we were always surrounded by music." Her father sang in local nightclubs and at weddings. While growing up, Hill frequently listened to Curtis Mayfield, Stevie Wonder, Aretha Franklin, and Gladys Knight; years later she recalled playing Marvin Gaye's What's Going On repeatedly until she fell asleep to it._NEWLINE_In middle school, Lauryn Hill performed "The Star-Spangled Banner" before a basketball game. Due to its popularity, subsequent games featured a recording of her rendition. In 1988, Hill appeared as an Amateur Night contestant on It's Showtime at the Apollo. She sang her version of the Smokey Robinson track "Who's Lovin' You", garnering an initially harsh reaction from the crowd. She persevered through the performance, however, she later cried off-stage._NEWLINE_Hill attended Columbia High School, where she was a member of the track team, cheerleading squad and was a classmate of actor Zach Braff. She also took violin lessons, went to dance class, and founded the school's gospel choir. Academically, she took advanced placement classes and received primarily 'A' grades. School officials recognized her as a leader among the student body. Later recalling her education, Hill commented, "I had a love for—I don't know if it was necessarily for academics, more than it just was for achieving, period. If it was academics, if it was sports, if it was music, if it was dance, whatever it was, I was always driven to do a lot in whatever field or whatever area I was focusing on at the moment."_NEWLINE_While a freshman in high school, through mutual friends, Prakazrel "Pras" Michel approached Hill about a music group he was creating. Hill and Pras began under the name Translator Crew. They came up with this name because they wanted to rhyme in different languages. Another female vocalist was soon replaced by Michel's cousin, multi-instrumentalist Wyclef Jean. The group began performing in local showcases and high school talent shows. Hill was initially only a singer, but then learned to rap too; instead of modeling herself on female rappers like Salt-N-Pepa and MC Lyte, she preferred male rappers like Ice Cube and developed her flow from listening to them. Hill later said, "I remember doing my homework in the bathroom stalls of hip-hop clubs."_NEWLINE_While growing up, Hill took acting lessons in Manhattan. She began her acting career in 1991 appearing with Jean in Club XII, MC Lyte's Off-Broadway hip-hop rendering of Shakespeare's Twelfth Night. While the play was not a success, an agent noticed her. Later that year, Hill began appearing on the soap opera As the World Turns in a recurring role as troubled teenager Kira Johnson. She subsequently co-starred alongside Whoopi Goldberg in the 1993 release Sister Act 2: Back in the Habit, playing Rita Louise Watson, an inner-city Catholic school teenager with a surly, rebellious attitude. In it, she performed the songs "His Eye Is on the Sparrow" (a duet with Tanya Blount) and "Joyful, Joyful". Director Bill Duke credited Hill with improvising a rap in a scene: "None of that was scripted. That was all Lauryn. She was amazing." Critic Roger Ebert called her "the girl with the big joyful voice", although he thought her talent was wasted, while Rolling Stone said she "performed marvelously against type ... in the otherwise perfunctory [film]." Hill also appeared in Steven Soderbergh's 1993 motion picture King of the Hill, in a minor but pivotal role as a 1930s gum-popping elevator operator. Soderbergh biographer Jason Wood described her as supplying one of the warmest scenes in the film. Hill graduated from Columbia High School in 1993. _START_SECTION_ 1994–1996: The Fugees _START_PARAGRAPH_ Pras, Hill and Jean renamed their group the Fugees, a derivative of the word "refugee", which was a derogatory term for Haitian Americans. Hill began a romantic relationship with Jean. The Fugees, who signed a contract with Columbia/Ruffhouse Records in 1993, became known for their genre blending, particularly of reggae, rock and soul, which was first experimented on their debut album, Blunted on Reality, released in 1994. It reached number 62 on the Billboard Top R&B/Hip-Hop Albums chart but overall sold poorly and was met by poor critical reviews due to their management's attempt insistence they adopt gangsta rap attitudes. Although the album made little impact, Hill's rapping on "Some Seek Stardom" was seen as a highlight. Within the group, she was frequently referred to by the nickname "L. Boogie". Hill's image and artistry, as well as her full, rich, raspy alto voice, placed her at the forefront of the band, with some fans urging her to begin a solo career._NEWLINE_The Fugees' second album, The Score (1996), peaked at number one on the U.S. Billboard 200 and stayed in the top ten of that chart for over half a year. It sold about six million copies in the United States and more than 17 million copies worldwide. In the 1996 Pazz & Jop Critics Poll, The Score came second in the list of best albums and three of its tracks placed within the top twenty best singles. It won the Grammy Award for Best Rap Album, and was later included on Rolling Stone's list of the 500 greatest albums of all time. Almost all of the writing and producing for it was done by Jean. The Score garnered praise for being a strong alternative to the gangsta idiom, and Hill stated, "We're trying to do something positive with the music because it seems like only the negative is rising to the top these days. It only takes a drop of purity to clean a cesspool."_NEWLINE_Singles from The Score included "Fu-Gee-La" and "Ready or Not", which highlighted Hill's singing and rapping abilities, and "No Woman, No Cry". Her rendition of "Killing Me Softly" became her breakout hit. Buttressed by what Rolling Stone publications later called Hill's "evocative" vocal line and her "amazing pipes", the track became pervasive on pop, R&B, hip hop, and adult contemporary radio formats. It won the Grammy Award for Best R&B Performance by a Duo or Group with Vocals. On the album, Hill combined African-American music and Caribbean music influences with socially conscious lyrics. Newsweek mentioned Hill's "irresistibly cute looks" and proclaimed her "the most powerful new voice in rap."_NEWLINE_At 21 years old, the now-famous Hill was still living at home with her parents. She had been enrolled at Columbia University during this period, and considered majoring in history as she became a sophomore, but left after about a year of total studies once sales of The Score went into the millions. In 1996, Hill responded to a false rumor on The Howard Stern Show that she had made a racist comment on MTV, saying "How can I possibly be a racist? My music is universal. And I believe in God. If I believe in God, then I have to love all of God's creations. There can be no segregation."_NEWLINE_In 1996, Hill founded the Refugee Project, a non-profit outreach organization that sought to transform the attitudes and behavior of at-risk urban youth. Part of this was Camp Hill, which offered stays in the Catskill Mountains for such youngsters; another was production of an annual Halloween haunted house in East Orange. Hill also raised money for Haitian refugees, supported clean water well-building projects in Kenya and Uganda, and staged a rap concert in Harlem to promote voter registration. A 1997 benefit event for the Refugee Project introduced a Board of Trustees for the organization that included Sean Combs, Mariah Carey, Busta Rhymes, Spike Lee, and others as members._NEWLINE_In 1997, the Fugees split to work on solo projects, which Jean later blamed on his tumultuous relationship with Hill and the fact he married his wife Claudinette while still involved with Hill. Meanwhile, in the summer of 1996 Hill had met Rohan Marley, a son of Bob Marley and a former University of Miami football player. Hill subsequently began a relationship with him, while still also involved with Jean. Hill became pregnant in late 1996, and on August 3, 1997, Marley and Hill's first child, Zion David, was born. The couple lived in Hill's childhood house in South Orange after she bought her parents a new house down the street._NEWLINE_Hill had a cameo appearance in the 1997 film Hav Plenty. In 1998, Hill took up another small, but important role in the film Restaurant; Entertainment Weekly praised her portrayal of the protagonist's pregnant former girlfriend as bringing vigor to the film. _START_SECTION_ 1997–1999: The Miseducation of Lauryn Hill _START_PARAGRAPH_ Hill recorded her solo record The Miseducation of Lauryn Hill from late 1997 through June 1998 at Tuff Gong Studios in Jamaica. The title was inspired by the book The Mis-Education of the Negro (1933) by Carter G. Woodson and The Education of Sonny Carson, a film and autobiographical novel. The album featured contributions from D'Angelo, Carlos Santana, Mary J. Blige and the then-unknown John Legend. Wyclef Jean initially did not support Hill recording a solo album, but eventually offered his production help; Hill turned him down. Several songs on the album concerned her frustration with the Fugees; "I Used to Love Him" dealt with the breakdown of the relationship between Hill and Wyclef Jean. Other songs such as "To Zion" spoke about her decision to have her first baby, even though many at the time encouraged her to have an abortion so to not interfere with her blossoming career. Indeed, Hill's pregnancy revived her from a period of writer's block._NEWLINE_In terms of production, Hill collaborated with a group of musicians known as New Ark, consisting of Vada Nobles, Rasheem Pugh, Tejumold Newton, and Johari Newton. Hill later said that she wanted to "write songs that lyrically move me and have the integrity of reggae and the knock of hip-hop and the instrumentation of classic soul" and that the production on the album was intended to make the music sound raw and not computer-aided. Hill spoke of pressure from her label to emulate Prince, wherein all tracks would be credited as written and produced by the artist with little outside help. She also wanted to be appreciated as an auteur as much as Jean had within the Fugees. (She also saw a feminist cause: "But step out and try and control things and there are doubts. This is a very sexist industry. They'll never throw the 'genius' title to a sister.") While recording the album, when Hill was asked about providing contracts or documentation to the musicians, she replied, "We all love each other. This ain't about documents. This is blessed."_NEWLINE_Released on August 25, 1998, the album received rave reviews from contemporary music critics, and was the most acclaimed album of 1998. Critics lauded the album's blending of the R&B, doo-wop, pop, hip-hop, and reggae genres and its honest representation of a woman's life and relationships. David Browne, writing in Entertainment Weekly, called it "an album of often-astonishing power, strength, and feeling", and praised Hill for "easily flowing from singing to rapping, evoking the past while forging a future of her own". Robert Christgau quipped, "PC record of the year—songs soft, singing ordinary, rapping skilled, rhymes up and down, skits de trop, production subtle and terrific". In 2017 NPR rated the album as the 2nd best album of all time created by a woman._NEWLINE_It sold over 423,000 copies in its first week (boosted by advance radio play of two non-label-sanctioned singles, "Lost Ones" and "Can't Take My Eyes Off You") and topped the Billboard 200 for four weeks and the Billboard R&B Albums chart for six weeks. It went on to sell about 8 million copies in the U.S. and 12 million copies worldwide. During 1998 and 1999, Hill earned $25 million from record sales and touring. Hill, along with Blige, Missy Elliott, Meshell Ndegeocello, Erykah Badu, and others, found a voice with the neo soul genre._NEWLINE_The first single released from the album was "Lost Ones", which reached number 27 in Spring 1998. The second was "Doo Wop (That Thing)", which debuted at number one on the Billboard Hot 100 chart. It exemplified Hill's appeal, combining feelings of self-empowerment with self-defense. Other charted singles from the album were "Ex-Factor", which has been interpolated by Drake and Cardi B, "Everything Is Everything" and "To Zion". In the 1998 Pazz & Jop Critics Poll, Miseducation came second in the list of best albums and "Doo Wop (That Thing)" second in best singles._NEWLINE_In November 1998, Marley and Hill's second child, Selah Louise, was born. Of being a young mother of two, Hill said, "It's not an easy situation at all. You have to really pray and be honest with yourself."_NEWLINE_In the run-up to the 1999 Grammy Awards, Hill became the first woman to be nominated in ten categories in a single year. In addition to Miseducation works, the nominations included her rendition of "Can't Take My Eyes Off You" for the 1997 film Conspiracy Theory, which had appeared on Billboard charts, and Hill's writing and producing of "A Rose Is Still a Rose", which became a late-in-career hit for Aretha Franklin. She appeared on several magazine covers, including Time, Esquire, Rolling Stone, Teen People, and The New York Times Fashion Magazine. During the ceremony, Hill broke another record by becoming the first woman to win five times in one night, taking home the awards for Album of the Year, Best R&B Album, Best R&B Song, Best Female R&B Vocal Performance, and Best New Artist. During an acceptance speech, she said, "This is crazy. This is hip-hop!" Hill had brought forth a new, mainstream acceptance of the genre._NEWLINE_In February 1999, Hill received four awards at the 30th Annual NAACP Image Awards. In May 1999, she became the youngest woman ever named to Ebony magazine's 100+ Most Influential Black Americans list; in November of that year, the same publication named her as one of "10 For Tomorrow" in the "Ebony 2000: Special Millennium Issue". In May 1999, she made People magazine's 50 Most Beautiful People list. The publication, which has called her "model-gorgeous", praised the 5-foot-4-inch (1.63 m) Hill for her idiosyncratic sense of personal style. In June 1999, she received an Essence Award, but her acceptance speech, where she said there was no contradiction in religious love and servitude and "[being] who you are, as fly and as hot and as whatever," drew reaction from those in the public who thought she was not a good role model as a young, unwed mother of two. This was a repetition of criticism she had received after the birth of her first child, and she had said that she and Marley would soon be married. In early 2000, Hill was one of many artists and producers to share the Grammy Award for Album of the Year for Santana's 1999 multi-million-selling Supernatural, for which she had written, produced, and rapped on the track "Do You Like the Way" (a rumination on the direction the world was headed, it also featured the singing of CeeLo Green and the signature guitar runs of Carlos Santana). She was also nominated for Best R&B Song for "All That I Can Say", which she had written and produced for Mary J. Blige. Also, her concocted duet with Bob Marley on "Turn Your Lights Down Low" for the 1999 remix tribute album Chant Down Babylon additionally appeared in the 1999 film The Best Man and later received a Grammy nomination for Best Pop Collaboration with Vocals._NEWLINE_In November 1998, New Ark filed a fifty-page lawsuit against Hill, her management, and record label, claiming that Hill "used their songs and production skills, but failed to properly credit them for the work" on Miseducation. The musicians claimed to be the primary songwriters on two tracks, and major contributors on several others, though Gordon Williams, a prominent recorder, engineer, and mixer on Miseducation, described the album as a "powerfully personal effort by Hill" and said "It was definitely her vision." Hill responded that New Ark had been appropriately credited and now were seeking to take advantage of her success. New Ark requested partial writing credits on most of the tracks on the album as well as monetary reimbursement. After many delays, depositions took place during the latter part of 2000. In part, the case illustrated the difficult boundaries between songwriting and all other aspects that went into contemporary arranging, sampling, and recording. The suit would eventually be settled out of court in February 2001, with Hill paying New Ark a reported $5 million. A friend of Hill's later said of the suit, "That was the beginning of a chain effect that would turn everything a little crazy." _START_SECTION_ 2000–2003: Self-imposed exile and MTV Unplugged No. 2.0 _START_PARAGRAPH_ Hill began writing a screenplay about the life of Bob Marley, in which she planned to act as his wife Rita. She also began producing a romantic comedy about soul food with a working title of Sauce, and accepted a starring role in the film adaptation of Toni Morrison's novel Beloved; she later dropped out of both projects due to pregnancy. She also reportedly turned down roles in Charlie's Angels (the part that went to Lucy Liu), The Bourne Identity, The Mexican, The Matrix Reloaded, and The Matrix Revolutions._NEWLINE_During 2000, Hill dropped out of the public eye. The pressures of fame began to overwhelm her. She disliked not being able to go out of her house to do simple errands without having to worry about her physical appearance. She fired her management team and began attending Bible study classes five days a week; she also stopped doing interviews, watching television and listening to music. She started associating with a "spiritual advisor" named Brother Anthony. Some familiar with Hill believe Anthony more resembled a cult leader than a spiritual advisor, and thought his guidance probably inspired much of Hill's more controversial public behavior._NEWLINE_She later described this period of her life to Essence saying "People need to understand that the Lauryn Hill they were exposed to in the beginning was all that was allowed in that arena at that time … I had to step away when I realized that for the sake of the machine, I was being way too compromised. I felt uncomfortable about having to smile in someone's face when I really didn't like them or even know them well enough to like them." She also spoke about her emotional crisis, saying, "For two or three years I was away from all social interaction. It was a very introspective time because I had to confront my fears and master every demonic thought about inferiority, about insecurity or the fear of being black, young and gifted in this western culture." She went on to say that she had to fight to retain her identity, and was forced "to deal with folks who weren't happy about that."_NEWLINE_In July 2001, while pregnant with her third child, Hill unveiled her new material to a small crowd, for a taping of an MTV Unplugged special. An album of the concert, titled MTV Unplugged No. 2.0, was released in May 2002 and featured only her singing and playing an acoustic guitar. Unlike the near-unanimous praise of Miseducation, 2.0 sharply divided critics. AllMusic gave the album 4 out of 5 stars, saying that the recording "is the unfinished, unflinching presentation of ideas and of a person. It may not be a proper follow-up to her first album, but it is fascinating." Rolling Stone called the album "a public breakdown" and Robert Hilburn of the Los Angeles Times said the album's title opened Hill up for jokes that she had become unhinged. NME wrote that "Unplugged 2.0 is a sparse and often gruelling listen, but there is enough genius shading these rough sketches to suggest that all might not yet be lost." With the mixed reviews and no significant radio airplay, 2.0 debuted at number three on the Billboard 200. The album has been certified Platinum in the US by the RIAA._NEWLINE_Her song "Mystery of Iniquity" was nominated for a Grammy Award for Best Female Rap Solo Performance and used as an interpolation by hip-hop producer/songwriter Kanye West for his single "All Falls Down", which was co-written with Lauryn Hill, as sung by Syleena Johnson._NEWLINE_Around 2001, Marley and Hill's third child, Joshua Omaru, was born. He was followed a year later by their fourth, John Nesta. While Hill sometimes had spoken of Marley as her husband, they never married, and along the way she was informed that Marley had been previously married at a young age. Furthermore, according to a 2003 Rolling Stone report, he had never secured a divorce; but Marley later disputed this and made public to a blog a 1996 divorce document from Haiti. The two had been living in a high-end Miami hotel, but around 2003 she moved out into her own place in that city. Hill later said that she and Marley "have had long periods of separation over the years". Hill slowly worked on a new album and it was reported that by 2003, Columbia Records had spent more than $2.5 million funding it, including installing a recording studio in the singer's Miami apartment and flying different musicians around the country._NEWLINE_By 2002, Hill had shut down her non-profit Refugee Project. She said, "I had a nonprofit organization and I had to shut all that down. You know, smiling with big checks, obligatory things, not having things come from a place of passion. That's slavery. Everything we do should be a result of our gratitude for what God has done for us. It should be passionate."_NEWLINE_In December 2003, Hill, during a performance in Vatican City, spoke of the "corruption, exploitation, and abuses" in reference to the molestation of boys by Catholic priests in the United States and the cover-up of offenses by Catholic Church officials. High-ranking church officials were in attendance, but Pope John Paul II was not present. The Catholic League called Hill "pathologically miserable" and claimed her career was "in decline". The following day, several reporters suggested that Hill's comments at the Vatican may have been influenced by her spiritual advisor, Brother Anthony. _START_SECTION_ 2004–2009: Sporadic touring and recording _START_PARAGRAPH_ In 2004, Hill contributed a new song, "The Passion", to The Passion of the Christ: Songs. A remix version with John Legend of his "So High" ended up receiving a Grammy Award nomination for Best R&B Performance by a Duo or Group with Vocals. Around this time, Hill began selling a pay-per-view music video of the song "Social Drugs" through her website. Those who purchase the $15 video would only be able to view it three times before it expired. In addition to the video, Hill began selling autographed posters and Polaroids through her website, with some items listed at upwards of $500._NEWLINE_For the first time since 1997, the Fugees performed in September 2004 at Dave Chappelle's Block Party in the Bedford–Stuyvesant neighborhood of Brooklyn. The concert featured Hill's nearly a cappella rendition of "Killing Me Softly". The event was recorded by director Michel Gondry and was released on March 3, 2006, to universal acclaim. The Fugees also appeared at BET Awards 2005 during June 2005, where they opened the show with a 12-minute set. One track, "Take It Easy", was leaked online and thereafter was released as an Internet single in late September. It peaked at number forty on the Billboard R&B Chart. In 2005, she told USA Today, "If I make music now, it will only be to provide information to my own children. If other people benefit from it, then so be it." When asked how she now felt about the songs on 2.0, she stated "a lot of the songs were transitional. The music was about how I was feeling at the time, even though I was documenting my distress as well as my bursts of joy."_NEWLINE_The Fugees embarked on a European tour in late 2005. Old tensions between Hill and the other members of the group soon resurfaced, and the reunion ended before an album could be recorded; Jean and Michel both blamed Hill for the split. Hill reportedly demanded to be addressed by everyone, including her bandmates, as "Ms. Hill"; she also considered changing her moniker to "Empress". Hill's tardiness was also cited as a contributing factor._NEWLINE_Hill began touring on her own, although to mixed reviews; often arriving late to concerts (sometimes by over two hours), performing unpopular reconfigurations of her songs and sporting an exaggerated appearance. On some occasions, fans have booed her and left early. In June 2007, Sony Records said Hill had been recording through the past decade, had accumulated considerable unreleased material and had re-entered the studio with the goal of making a new album. Later that same year, an album titled Ms. Hill, which featured cuts from Miseducation, various soundtracks contributions and other "unreleased" songs, was released. It features guest appearances from D'Angelo, Rah Digga and John Forté. Also in June 2007, Hill released a new song, "Lose Myself", on the soundtrack to the film Surf's Up._NEWLINE_In early 2008, Marley and Hill's fifth child, Sarah, was born. The couple were not living together, although Marley considered them "spiritually together" even while listing himself as single on social media. Hill later said that she and Marley "have [had] a long and complex history about which many inaccuracies have been reported since the beginning" and that they both valued their privacy. By August 2008, Hill was living with her mother and children in her hometown of South Orange, New Jersey._NEWLINE_Reports in mid-2008 claimed that Columbia Records then believed Hill to be on hiatus. Marley disputed these claims, telling an interviewer that Hill has enough material for several albums: "She writes music in the bathroom, on toilet paper, on the wall. She writes it in the mirror if the mirror smokes up. She writes constantly. This woman does not sleep". One of the few public appearances Hill made in 2008 was at a Martha Stewart book-signing in New Jersey, perplexing some in the press. In April 2009, it was reported that Hill would engage in a 10-day tour of European summer festivals during mid-July of that year. She performed two shows for the tour and passed out on stage during the start of her second performance and left the stage. She refused to give refunds to angry consumers for the show. On June 10, Hill's management informed the promoters of the Stockholm Jazz Festival, which she was scheduled to headline, that she would not be performing due to unspecified "health reasons." Shortly afterward, the rest of the tour was canceled as well. _START_SECTION_ 2010–present: Further activities and imprisonment _START_PARAGRAPH_ In January 2010, Hill returned to the live stage and performed in stops across New Zealand and Australia on the Raggamuffin Music Festival. Many of the songs that Hill had performed and recorded over the past six years were included on an April 2010 unofficial compilation album titled Khulami Phase. The album also features a range of other material found on the Ms. Hill compilation. Hill appeared at the Harmony Festival in Santa Rosa, California, in June 2010, her first live American performance in several years. An unreleased song called "Repercussions" was leaked via the Internet in late July 2010, debuting at number 94 on Billboard's Hot R&B/Hip-Hop Songs (and peaked at number 83 the following week), making it her first Billboard chart appearance as a lead artist since 1999._NEWLINE_Hill joined the Rock the Bells hip-hop festival series in the U.S. during August 2010, and as part of that year's theme of rendering classic albums, she performed The Miseducation of Lauryn Hill in its entirety for the first time. She increased the tempo and urgency from the original recording, but at times had difficulty in communicating with her band. Hill continued touring, including a set at the 6th Annual Jazz in the Gardens, in Miami Gardens, Florida in December. In Spring 2011, Hill performed at the Coachella Valley Music Festival, New Orleans Jazz Fest, and at the Cosmopolitan of Las Vegas. In July 2011, Hill gave birth to her sixth child, Micah, her first not with Rohan Marley; the father remains publicly unknown._NEWLINE_In February 2012, Hill performed a new song titled "Fearless Vampire Killer", during a sold-out performance at the Warner Theater in Washington, D.C. In late 2012, Hill toured with rapper Nas; her portion of the tour, titled Black Rage, is named after her song, released October 30. Hill has described the song as being "about the derivative effects of racial inequity and abuse" and "a juxtaposition to the statement 'life is good,' which she believes can only be so when these long standing issues are addressed and resolved."_NEWLINE_In June 2012, Hill was charged with three counts of tax fraud or failing to file taxes (Title 26 USC § 7202 Willful failure to collect or pay over tax) not tax evasion on $1.8 million of income earned between 2005 and 2007. During this time she had toured as a musical artist, earned royalties from both her records and from films she had appeared in, and had owned and been in charge of multiple corporations. In a long post to her Tumblr, Hill said that she had gone "underground" and had rejected pop culture's "climate of hostility, false entitlement, manipulation, racial prejudice, sexism and ageism." She added that, "When I was working consistently without being affected by the interferences mentioned above, I filed and paid my taxes. This only stopped when it was necessary to withdraw from society, in order to guarantee the safety and well-being of myself and my family." On June 29, 2012, Hill appeared in the United States District Court for the District of New Jersey in Newark and pleaded guilty to the charges; her attorney said she would make restitution for the back taxes she owed. By April 22, 2013, Hill had paid back only $50,000 of the $554,000 she owed immediately; U.S. Magistrate Judge Madeline Cox Arleo criticized Hill, saying "This is not someone who stands before the court penniless. This is a criminal matter. Actions speak louder than words, and there has been no effort here to pay these taxes." Hill also faced possible eviction from her rented home in South Orange as well as a civil lawsuit from the town for running a business out of a home without a zoning permit._NEWLINE_On May 4, 2013, Hill released her first official single in over a decade, "Neurotic Society (Compulsory Mix)". She later published a message on her Tumblr describing how she was "required to release [it] immediately, by virtue of the impending legal deadline." The release received some criticism for lyrics that appeared to tie societal decay to certain LGBT social movements. Hill responded that the song was not targeted at any particular group but was instead focused on anyone hiding behind neurotic behavior. Following a deal with Sony Music, which involves Hill creating a new record label within the company, Hill was said to be scheduled to release her first album in fifteen years during 2013._NEWLINE_On May 6, 2013, Hill was sentenced by Judge Arleo to serve three months in prison for failing to file taxes/tax fraud and three months' house arrest afterwards as part of a year of supervised probation. She had faced a possible sentence of as long as 36 months, and the sentence given took into account her lack of a prior criminal record and her six minor-aged children. By this point Hill had fully paid back $970,000 in back taxes and penalties she owed, which also took into account an additional $500,000 that Hill had in unreported income for 2008 and 2009. In the courtroom, Hill said that she had lived "very modestly" considering how much money she had made for others, and that "I am a child of former slaves who had a system imposed on them. I had an economic system imposed on me." Hill reported to the minimum-security Federal Correctional Institution, Danbury on July 8, 2013, to begin serving her sentence._NEWLINE_Hill was released from prison on October 4, 2013, a few days early for good behavior, and began her home confinement and probationary periods. She put out a single called "Consumerism" that she had finished, via verbal and e-mailed instructions, while incarcerated. Judge Arleo allowed her to postpone part of her confinement in order to tour in late 2013 under strict conditions._NEWLINE_During 2014, Hill was heard as the narrator of Concerning Violence, an award-winning Swedish documentary on the African liberation struggles of the 1960s and 1970s. She also continued to draw media attention for her erratic behavior, appearing late twice in the same day for sets at Voodoo Fest in November 2014._NEWLINE_In May 2015, Hill canceled her scheduled concert outside Tel Aviv in Israel following a social media campaign from activists promoting the Boycott, Divestment and Sanctions campaign. She said she had wanted to also perform a show in Ramallah in the West Bank but logistical problems had proved too great. Hill stated: "It is very important to me that my presence or message not be misconstrued, or a source of alienation to either my Israeli or my Palestinian fans."_NEWLINE_Hill contributed her voice to the soundtrack for What Happened, Miss Simone?, a 2015 documentary about the life of Nina Simone, an American singer, pianist, and civil rights activist. Hill was originally supposed to record only two songs for the record, but ended up recording six. She also served as a producer on the compilation alongside Robert Glasper. Hill said of her connection to Simone: "Because I fed on this music ... I believed I always had a right to have a voice. Her example is clearly a form of sustenance to a generation needing to find theirs. What a gift." NPR critically praised Hill's performance on the soundtrack, stating: "This album mainly showcases Lauryn Hill's breadth and dexterity. Not formally marketed as Hill's comeback album, her six tracks here make this her most comprehensive set of studio recordings since The Miseducation of Lauryn Hill in 1998."_NEWLINE_In April 2016, Hill hosted and headlined what was billed as the inaugural Diaspora Calling! festival at the Kings Theatre in Brooklyn. The festival's purpose was to showcase the efforts of musicians and artists from around the African diaspora like Brooklyn Haitian Rara band Brother High Full tempo. The following month, Hill was approximately 2 hours and 20 minutes late for her show at the Chastain Park Amphitheatre in Atlanta, though members of Hill's team claimed it was only an hour after their scheduled start time. Moments after the less-than-40-minute show ended due to the venue's strict 11:00 p.m. closing time, Hill said her driver had gotten lost and she could not help that. Less than 48 hours later, after a large backlash from her fans on Twitter, she took to her Facebook page and stated she was late for the concert because of certain needs, including her need to "align her energy with the time." _START_SECTION_ Legacy and sampling _START_PARAGRAPH_ Lauryn Hill's work continues to inspire rappers and can still be heard sampled in Hip Hop today. In 2018, Hill was sampled on Cardi B's "Be Careful", Drake's "Nice for What", and A$AP Rocky's "Purity". Kanye West has mentioned Lauryn Hill in a couple of his songs. In Kanye West's song "Champion", released in 2007, he says "Lauryn Hill said her heart was in Zion, I wish her heart still was in rhymin". Zion is Hill's son first born, who she sings about in "To Zion" Kanye also talks about Hill in his song released in 2016, "No More parties in LA". He states "I was uninspired since Lauryn Hill retired" This is referring to Hill not releasing any new music since 2013._NEWLINE_Other samples of Lauryn Hill's work come from artists such as J Cole, A Boogie Wit Da Hoodie, and Kanye West. J Cole samples Hill's "Nothing Even Matters" and "To Zion" in his song "Cole Summer". A Boogie samples a few Hill songs including "Ex-Factor" in his song of the same name as well as in a remix of Drake's "Nice for What". Kanye famously samples Hill in "All Falls Down" where he samples "Mystery of Iniquity". + _START_ARTICLE_ James Edward McManus _START_PARAGRAPH_ James Edward McManus, C.Ss.R. (October 10, 1900 – July 3, 1976) was an American prelate of the Roman Catholic Church. A Redemptorist, he served as Bishop of Ponce in Puerto Rico (1947–63) and as an auxiliary bishop of the Archdiocese of New York (1963–70). _START_SECTION_ Early life and education _START_PARAGRAPH_ James McManus was born in Brooklyn, New York, the eighth of nine children of William and Elizabeth (née O'Loughlin) McManus. He received his early education at the parochial school of Our Lady of Perpetual Help Church in Brooklyn from 1906 to 1914. In 1915, he enrolled at St. Mary's College, a preparatory school run by the Redemptorists in North East, Pennsylvania. He then studied at Mount St. Alphonsus Seminary at Esopus from 1922 to 1928. He made his profession as a Redemptorist in Ilchester, Maryland, on August 2, 1922. _START_SECTION_ Priesthood _START_PARAGRAPH_ On June 19, 1927, McManus was ordained to the priesthood in Esopus. He was assigned to the Puerto Rican mission in Caguas in 1929. He later returned to the continental United States to study at the Catholic University of America in Washington, D.C., where he earned a Doctor of Canon Law degree in 1937. He then served as professor of canon law at Mount St. Alphonsus Seminary until 1940, when he returned to Puerto Rico. He served as a pastor in Aguadilla (1940–45) and then in Mayagüez (1945–47). _START_SECTION_ Ponce _START_PARAGRAPH_ On May 10, 1947, McManus was appointed Bishop of Ponce by Pope Pius XII. He received his episcopal consecration on the following July 1 from Bishop William Tibertus McCarty, with Bishops Aloysius Joseph Willinger and William David O'Brien serving as co-consecrators, at Our Lady of Perpetual Help Church in Brooklyn. His biggest contribution as Bishop of Ponce was the founding of the Pontifical Catholic University of Puerto Rico in 1948. He also oversaw the move of a seminary from the Archdiocese of San Juan to his diocese in Aibonito._NEWLINE_During his tenure in Ponce, McManus became an outspoken critic of Luis Muñoz Marín, who served as Governor of Puerto Rico from 1949 to 1965. In the 1952 and 1956 elections, he opposed Muñoz Marín and supported the Republican Statehood Party, which demanded statehood for the island and proposed an economic plan similar to that of the continental Republican Party. In 1958, he feuded with Muñoz Marín over his program to crack down on gambling, including bingo games for the support of parish churches. He denounced the legalization of birth control measures and a law that would divorce couples who had been separated for more than three years. He also opposed the administration's measure to cut the tax-exempt donations to charity by corporations from 15% of gross income to 5% of surplus._NEWLINE_In 1960, after the Legislative Assembly failed to pass a law allowing religious instruction for schoolchildren, McManus said that the administration of Muñoz Marín was "responsible for the moral evils that cloud and de-Christianize our society." In August of that year, he helped organize the Christian Action Party, which he urged all Catholics to support. The party nominated Salvador Perea, a professor at the Pontifical Catholic University, as its candidate for governor, but was caught in a controversy over the validity of the signatures it collected to get on the ballot._NEWLINE_A month before the election, McManus and two other bishops issued a pastoral letter that prohibited Catholics from voting for Muñoz Marín's Popular Democratic Party, which they claimed "accepts as its own the morality of a 'regime of license,' denying Christian morality..." The letter also stated, "It is evident that the philosophy of the Popular Democratic Party is anti-Christian and anti-Catholic, and that it is based on the modern heresy that popular will and not divine law decides what is moral and immoral. This philosophy destroys the Ten Commandments of God and permits that they be substituted by popular and human criteria." McManus insisted that Catholics who disobeyed the injunction by voting for the Popular Democrats would commit a sin. The letter resulted in widespread protests in Puerto Rico and sparked open controversy within the Church. Cardinal Francis Spellman of New York declared that Puerto Rican voters would not be penalized by the Church while Archbishop James P. Davis of San Juan defended the bishops. Muñoz Marín denounced the letter as an "incredible medieval interference in a political campaign."_NEWLINE_Between 1962 and 1965, McManus attended all four sessions of the Second Vatican Council. _START_SECTION_ New York _START_PARAGRAPH_ McManus resigned as Bishop of Ponce for reasons of health on November 18, 1963. On the same date, he was appointed Auxiliary Bishop of New York and Titular Bishop of Benda by Pope Paul VI. He denied that his transfer to New York had anything to do with his opposition to Governor Muñoz Marín, calling his appointment "routine." As an auxiliary bishop, he served as pastor of St. Cecilia's Church in Manhattan (1964–66) and episcopal vicar of Sullivan and Ulster Counties, a post which he held until his retirement in 1970._NEWLINE_McManus died at the Monmouth Medical Center in Long Branch, New Jersey, at age 75. + _START_ARTICLE_ Kate Webb _START_PARAGRAPH_ Kate Webb (24 March 1943 – 13 May 2007) was a New Zealand-born Australian war correspondent for UPI and Agence France-Presse. She earned a reputation for dogged and fearless reporting throughout the Vietnam War, and at one point she was held prisoner for weeks by North Vietnamese troops. After the war, she continued to report from global hotspots including Iraq during the Gulf War. _START_SECTION_ Biography _START_PARAGRAPH_ Born Catherine Merrial Webb in Christchurch, New Zealand, Webb moved to Canberra, Australia, with her family while she was still a child. Her father, Leicester Chisholm Webb, was professor of political science at the Australian National University, and her mother, Caroline Webb, was active in women's organisations. Both her parents were killed when Kate was 18._NEWLINE_On 30 March 1958, at the age of 15, Catherine Webb was charged with the murder of Victoria Fenner, the adopted daughter of Frank Fenner, in Canberra. She supplied a rifle and bullets to Fenner and was present when Fenner shot herself. After a Children's Court hearing the charge was dropped._NEWLINE_She graduated from the University of Melbourne, then left to work for the Sydney Daily Mirror. In 1967 she quit the paper and travelled to Vietnam to cover the escalating war. Webb was soon hired by UPI and earned a reputation as a hard-drinking, chain-smoking war correspondent: She was the first wire correspondent to reach the U.S. Embassy in Saigon during the Tet offensive. With the death of Phnom Penh bureau chief Frank Frosch in 1970, Webb was selected to fill his position—she later claimed it was because she spoke French. In 1971 she made news herself when she was captured by North Vietnamese troops operating in Cambodia. Premature official reports claimed that a body discovered was Webb's, and The New York Times published an obituary. She emerged from captivity 23 days after she was captured, after having endured forced marches, interrogations, and malaria. She described her experiences in a book called On the Other Side, and in War Torn, a collection of reminiscences by women correspondents in the Vietnam War._NEWLINE_After her release from captivity and because of her sudden fame, UPI sent her to Washington DC as their show piece. Soon thereafter she threatened to resign if she did not get a "real job". She was reassigned to the Philippines as the UPI bureau chief in Manila._NEWLINE_After the war, she continued to work as a foreign correspondent for UPI and Agence France-Presse (AFP). She served as a correspondent in Iraq during the Gulf War, in Indonesia as Timor-Leste gained independence, and in South Korea, where she was the first to report the death of Kim Il Song. She also reported from Afghanistan, and later described an incident in Kabul as the most frightening in her career. Following the collapse of Mohammad Najibullah's communist regime, she was captured by a local warlord and brought to a hotel, where she was brutally beaten and dragged up a flight of stairs by her hair. She finally escaped with the help of two fellow journalists, and hid out on a window ledge in the freezing Afghan winter, while the warlord and his men searched the building for her._NEWLINE_Webb retired to the Hunter Region in 2001. She died of bowel cancer on 13 May 2007. In 2008, AFP established the Kate Webb Prize, worth €3,000 to €5,000, awarded annually to an Asian correspondent or agency that best exemplified the spirit of Kate Webb. Webb was commemorated on an Australian postage stamp in 2017._NEWLINE_She is survived by a brother, Jeremy Webb, and a sister, Rachel Miller. + _START_ARTICLE_ Golden State (schooner) _START_SECTION_ Earlier 19th century schooner _START_PARAGRAPH_ The schooner "'Golden State,' of San Francisco" was involved in the 1858 lawsuit Wetherbee vs. Schooner "Golden State" (and Captain W. S. Tuttle). + _START_ARTICLE_ Blooming, Oregon _START_PARAGRAPH_ Blooming is an unincorporated community in Washington County, Oregon, United States near the Tualatin River, about two miles south of Cornelius. Its elevation is 190 feet (58 m). There are several plant nurseries in the area._NEWLINE_The Blooming area was originally known as "the German Settlement" for a group of German immigrants who had settled there. Rev. Paul of St. Peter's Lutheran Church renamed the community "Blooming", for the area's flowers and as an indicator of the town's good prospects. Blooming post office ran from 1895 to 1904._NEWLINE_The center of the community was St. Peter's Evangelical Lutheran Church, which is the second-oldest congregation of the Lutheran Church–Missouri Synod in the Pacific Northwest. The church was founded and the original church building constructed in 1882; a portion of the current church was built in 1923._NEWLINE_Blooming has a Cornelius ZIP code. + _START_ARTICLE_ Maggi Kelly _START_SECTION_ Life _START_PARAGRAPH_ She played for University of California, Berkeley. She graduated from University of North Carolina at Chapel Hill, and from the University of Colorado._NEWLINE_She was a member of the United States women's national water polo team._NEWLINE_She participated at the 1994 World Aquatic Championships, and 1998 World Aquatic Championships. + _START_ARTICLE_ BoxChilli _START_SECTION_ History _START_PARAGRAPH_ Launched as a 2-person start-up in 2003, boxChilli has grown to a team of 14 with offices in Southampton & Portsmouth. BoxChilli doubled in size between 2012-2014 and has continued to grow. The company has won several awards and in 2014 was a finalist in the Wirehive 100 award for the fasts growing agency, and have been in the top Wirehive top 100 digital agencies to work with outside of London in 2014 / 2015 & 2016. _START_SECTION_ Youth development _START_PARAGRAPH_ In 2014, boxChilli team member Kirsty Mallan was a finalist at the NatWest Venus Award (Portsmouth) recognising achievement in women under the age of 25. As well as hiring, training and supporting graduates and young managers, boxChilli works with Portsmouth-based training agency PETA Ltd to provide apprenticeships in web design, search and related areas. And in 2016 Danielle Ellis was a semi-finalist in the young manager of the year category. _START_SECTION_ Charity work _START_PARAGRAPH_ boxChilli supports charities with in-kind donations of services. The company has made websites for a number of Hampshire-based charities including Hampshire Wildlife trust and CancerWise. The company also supports team members individual fundraising efforts, including growing moustaches for Movember and entering bicycle riding and running events. + _START_ARTICLE_ Tanmay Jahagirdar _START_SECTION_ Biography _START_PARAGRAPH_ He has begun his acting career at a very young age. He believes in being highly educated and is himself an Engineering diploma holder as well as a Mass media Graduate. His father, Vilas Jahagirdar has a reputed name in the Indian Pharma Industry. He is well versed in languages such as Marathi, Hindi and Urdu. He wishes to learn several other languages as well. He married Payal Wachasundar on 25 March 2019. _START_SECTION_ Career _START_PARAGRAPH_ His career comprises the vast journey as a child artist through television, commercial advertisements and films. In 2002, he appeared in the film The truth - Yathharth, a feature film based on the dark shades of life hidden in rural India. The film's star cast includes Milind Gunaji, Raghuveer Yadav and Shraddha Nigam. He played Raghu, the character who and supports his friend Bijuria (Shraddha Nigam)through the ups and downs of his life. This challenging project turned momentous for him as he got a chance to work with such versatile actors of the generation and was also appreciated for his on-screen presence too. He belongs to an era of the popular kids' series on Indian Television. He was seen in Aryamaan - Brahmaand Ka Yodha, a sci-fi series where he worked with a TV legend, Mukesh Khanna; he earned a lot of appreciation for the character he played. He appeared in Shaka Laka Boom Boom, a series based on a magical pencil aired on Star Plus. Since then, it was no looking back for this talent in the Television Industry. Next, he appeared in a lead role in the serial Kya Mujhse Dosti Karoge on Hungama TV. He has done very interesting advertisement commercials. He has worked for some reputed brands in India. The Siyaram Suitings Ad featuring Diya Mirza and Boris Becker (International Tennis Player) has this kid, who played a role of an adorable and naughty Rajasthani kid. He also acted in commercials like BSA-ibike, Parry's Lacto King, etc. These all advertisements represent a nostalgic era of TV._NEWLINE_The milestone in his career came with Ram Gopal Verma's film Ab Tak Chappan (2004). He played a role of Nana Patekar's (Sadhu) son (Aman). He was praised highly for his pivotal character in the film. The film went on to become a commercial box office hit and was the critics' favorite of the year also. After this, he appeared in the film Phir Kabhi(2009). He portrayed the role of a young school boy, representing the childhood memory of the character played by Gulshan Grover. He was seen in a comic role and has shown his commendable comic sense. Definitely, he got highly recognized and admired for his acting. He was also seen in one of the episodes of a very popular series Adalat on Sony TV in 2011. He has played roles into different genres such as thriller, comedy, drama, sci-fi superhero, etc. He has worked with recognized banners such as UTV Motion Pictures, RGV productions, etc._NEWLINE_Recently, he made his comeback through Ab Tak Chhappan 2(2015), in which he played the role of Sadhu Agashe's grown up son Aman. He portrayed a young and matured person, who still believes in the beauty of life in spite of all the odds. Aman has a passion and love for music, this element of his character added an altogether different angle to the film. His role complimented the story to progress with interesting twists each time. He has worked with stalwarts of the Indian Cinema such as Nana Patekar, Mithun Chakraborty, Ronit Roy, Mukesh Khanna, etc. _NEWLINE_He continues to pursue his passion in acting and direction. + _START_ARTICLE_ Spring Creek, Madison County, North Carolina _START_PARAGRAPH_ Spring Creek is a tributary stream of the French Broad River in Madison County, North Carolina with a length of approximately 17 miles. It flows in much of its lower course through a section of the Pisgah National Forest and passes the communities of Trust, Luck, and Joe. It joins the French Broad river in Hot Springs, North Carolina. + _START_ARTICLE_ Acanthocladium _START_SECTION_ Description _START_PARAGRAPH_ The Spiny everlasting is a woody perennial shrub with spines at branch ends, covered in short white hair. It bears oblong, bumpy fruit._NEWLINE_Spiny everlasting was presumed extinct in 1992, having suffered habitat loss from clearance for winter crops, but various colonies of it have been found around Laura, near the Spencer Gulf. _START_SECTION_ Homonym _START_PARAGRAPH_ In 1883, William Mitten used the same name, Acanthocladium, to refer to a group of mosses, now in the family Sematophyllaceae. Several dozen species of mosses were described and place in this genus before it was realized that Mittenn's name represented an illegitimate homonym. The moss genus has since been renamed Wijkia H.A. Crum. + _START_ARTICLE_ Eddie O'Sullivan _START_SECTION_ Early career _START_PARAGRAPH_ O'Sullivan was born in Youghal, Cork, Ireland. After attending the Christian Brothers school in the town, he graduated from Thomond College, which a decade later became part of the University of Limerick._NEWLINE_O'Sullivan played for the Garryowen Football Club during the 1970s and 1980s at fly-half and wing, while teaching physical education, maths, and science in Mountbellew, County Galway. He played for Munster between 1983 and 1986 on the wing and got capped for Ireland A in 1984. He also played Gaelic Football. In 1982, he played corner forward on the Mountbellew Moylough Gaelic football team. He was fitness advisor to the Galway Senior Football Team, managed by John O'Mahony, which won two All-Ireland Senior Football Titles in 1998 and 2001. _START_SECTION_ Coaching career _START_PARAGRAPH_ He started his coaching career at Monivea Rugby Club in north-east Galway in the early 1980s while still a teacher. He worked as a Development Officer for the Irish Rugby Football Union between 1988 and 1991. During that time and up until 1992, he was the Fitness Advisor to the Irish Rugby Team, under Head Coach Ciaran Fitzgerald. He followed this with spells coaching at Blackrock College, (first as Assistant, then as Head Coach) Connacht as Assistant Coach and Head Coach between 1992 and 1996 and the Irish Under-21 side. The Under-21 side won the 1996 Triple Crown, beating Clive Woodward's England. Between 1997 and 1999, while working in the USA he continued to coach the Buccaneers Rugby Club in Connacht, who won promotion from Division 3 to Division 1 of the All-Ireland League and reaching the Top 4 of the tournament in their 1st season in Division 1._NEWLINE_After failing to secure a high-profile coaching position in Ireland, O'Sullivan moved to America to coach the US Eagles where he worked as Forwards Coach at the 1999 Rugby World Cup He also worked as Technical Director to USA Rugby between 1997 and 1999. As Technical Director he developed and delivered the USA Rugby Coach Education Programme, which certifier coaches at Foundation, Level I, Level II and Level III. He was then appointed as the assistant coach of the Irish national side in 1999 as the Backs Coach. During his time as Assistant Coach he was credited with advancing Irish back play considerably while working with players such as Brian O'Driscoll, Ronan O'Gara, Peter Stringer, Rob Henderson, Shane Horgan and Denis Hickie. In 2001 he was appointed Head Coach following the controversial departure of Warren Gatland. _START_SECTION_ Ireland _START_PARAGRAPH_ In his first year Ireland finished third-place in the 2002 Six Nations Championship. O'Sullivan's Ireland went on to achieve second place in 2003, only losing the Grand Slam in the final match against England. At the 2003 Rugby World Cup his team lost to France in the quarter-finals._NEWLINE_Ireland again missed out in the 2004 Six Nations Championship, losing the Grand Slam to France this time, but went on to win Ireland's first Triple Crown in 19 years. While transitioning the team during the 2005, O'Sullivan's side finished in third place with defeats by France and Wales. In 2006 defeat to France cost Ireland the Championship. In 2007 again Ireland lost the championship to France on points difference. On the final day of the tournament despite defeating Italy heavily in Rome (51-24), France defeated Scotland with a controversial try in the final minute of the game to again deny Ireland a 6 Nations Championship. The fact that the French played later in the day than Ireland gave them an advantage of knowing exactly what score they needed to secure the Championship. This fuelled the discussion about games not kicking off at the same time on the final day of the tournament._NEWLINE_During his tenure as Head Coach to Ireland O'Sullivan won 3 Triple Crowns, in 2004, 2006 and 2007. Ireland also defeated Australia twice (2002 & 2006) and South Africa twice (2004 & 2006). Ireland defeated England in the Six Nations Championship over four consecutive years (2004 - 2007) including a record victory (43-13) at the iconic Croke Park Stadium in 2007. O'Sullivan also coached the Barbarians R.F.C. to victory over the 2007 World Cup Champions South Africa in November 2007. _START_SECTION_ 2007 World Cup campaign _START_PARAGRAPH_ In August 2007, O'Sullivan's contract with the IRFU was extended for a further four years, which meant that he was contracted to be in charge of the Irish Rugby Team until 2012. Part of the terms of the contract allowed him to leave the position temporarily to coach the 2009 Lions squad, were he to be offered that role. Soon, however, he was the subject of press criticism after a run of poor results. Ireland turned in poor performances in the opening matches of the World Cup against the lower rated Georgia and Namibia. They had previously also struggled in pre-tournament games against Italy and Scotland._NEWLINE_Criticisms included a failure to inspire passion in the team and a lack of depth in the squad, which has been said to have caused complacency in the first team players. Many began to see the signing of his contract as a premature move. Rumours have abounded of conflict in the Irish during the tournament, and even that Geordan Murphy might leave the squad as a result of being dropped from the bench for the French game._NEWLINE_The poor performances continued with the failure of Ireland, for the second time in its history, to qualify for the quarter-final stages of the World Cup, finishing third in its Group with two wins and two losses._NEWLINE_After an extensive post Rugby World Cup review it was established that, despite rumours during the tournament, there was no discord among the playing squad. The failure to perform was identified as an over emphasis on strength and conditioning prior to the tournament along with only 2 pre-tournament warm-up games, leaving the squad short of match preparation. _START_SECTION_ End of Irish career _START_PARAGRAPH_ On 19 March 2008, O'Sullivan resigned from his job after a disappointing 6 Nations campaign. O'Sullivan finished with an overall success rate of 64% during his tenure with Ireland. During his time Ireland reached 3rd in the World Rugby ranking on 2 occasions in 2003 and 2006._NEWLINE_O'Sullivan released his autobiography, Never Die Wondering ISBN 978-1-84605-399-3, in autumn 2009. It was written with sports writer Vincent Hogan. _START_SECTION_ United States _START_PARAGRAPH_ After coaching the Irish Under-21 team to their first ever Triple Crown in 1996, O’Sullivan joined USA Rugby as Assistant Coach to the Eagles (Forwards Coach) and Assistant National Technical Director to USA Rugby. In 1998, he took over as National Technical Director while still retaining his position as Assistant Coach on the Eagles coaching staff. He worked with the Eagles at the 1999 Rugby World Cup before returning to Ireland as Assistant Coach to the Irish Rugby Team (Backs Coach)._NEWLINE_In 2009 the position of Head Coach to the USA Eagles Men's team came open when Scott Johnson announced he would leave the team at the end of the 2008–09 season to move to Ospreys of the Celtic League. O'Sullivan's agent reported on 16 February that O'Sullivan had signed a deal that would see him coach the United States through the 2011 World Cup. His hiring was officially announced on 4 March._NEWLINE_O'Sullivan coached the USA Eagles from 2009 up until the end of the 2011 Rugby World Cup. It was O'Sullivan's 5th appearance at a Rugby World Cup in a coaching capacity (RWC 1991 S&C Coach Ireland, RWC 1999 Assistant Coach USA, 2003 Head Coach Ireland, 2007 Head Coach Ireland, 2011 Head Coach USA) _START_SECTION_ Biarritz Olympique _START_PARAGRAPH_ In May 2014 O'Sullivan was confirmed as head coach of Biarritz Olympique. Relegated from the Top 14 following the 2013-14 season, Biarritz played in the Pro D2 league in 2014-15. O'Sullivan left Biarritz Olympique in October 2015. diff --git a/examples/language_model/bigbird/run_classifier.py b/examples/language_model/bigbird/run_classifier.py new file mode 100644 index 0000000000000000000000000000000000000000..bcddaa80c89e4128ed6536ffd17108638df5334e --- /dev/null +++ b/examples/language_model/bigbird/run_classifier.py @@ -0,0 +1,202 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import string +import time + +import numpy as np +import paddle +import paddle.nn as nn + +from paddlenlp.data import Stack +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import ( + BigBirdForSequenceClassification, + BigBirdTokenizer, + create_bigbird_rand_mask_idx_list, +) +from paddlenlp.utils.log import logger + +parser = argparse.ArgumentParser() +parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument( + "--model_name_or_path", type=str, default="bigbird-base-uncased-finetune", help="pretraining model name or path" +) +parser.add_argument( + "--max_encoder_length", + type=int, + default=3072, + help="The maximum total input sequence length after SentencePiece tokenization.", +) +parser.add_argument("--learning_rate", type=float, default=1e-5, help="Learning rate used to train.") +parser.add_argument("--max_steps", default=10000, type=int, help="Max training steps to train.") +parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.") +parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.") +parser.add_argument("--output_dir", type=str, default="checkpoints/", help="Directory to save model checkpoint") +parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.") +parser.add_argument("--attn_dropout", type=float, default=0.0, help="Attention ffn model dropout.") +parser.add_argument( + "--hidden_dropout_prob", type=float, default=0.0, help="The dropout rate for the embedding pooler." +) +parser.add_argument( + "--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model." +) +parser.add_argument("--seed", type=int, default=8, help="Random seed for initialization.") + +args = parser.parse_args() +TRANSLATOR = str.maketrans("", "", string.punctuation) + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +def create_dataloader(batch_size, max_encoder_length, tokenizer, config, pad_val=0): + def _tokenize(text): + input_ids = [tokenizer.cls_id] + input_ids.extend(tokenizer.convert_tokens_to_ids(tokenizer._tokenize(text)[: max_encoder_length - 2])) + input_ids.append(tokenizer.sep_id) + input_len = len(input_ids) + if input_len < max_encoder_length: + input_ids.extend([pad_val] * (max_encoder_length - input_len)) + input_ids = np.array(input_ids).astype("int64") + return input_ids + + def _collate_data(data, stack_fn=Stack(dtype="int64")): + num_fields = len(data[0]) + out = [None] * num_fields + out[0] = stack_fn([_tokenize(x["text"].translate(TRANSLATOR)) for x in data]) + out[1] = stack_fn([x["label"] for x in data]) + seq_len = len(out[0][0]) + # Construct the random attention mask for the random attention + rand_mask_idx_list = create_bigbird_rand_mask_idx_list( + config["num_layers"], + seq_len, + seq_len, + config["nhead"], + config["block_size"], + config["window_size"], + config["num_global_blocks"], + config["num_rand_blocks"], + config["seed"], + ) + out.extend(rand_mask_idx_list) + return out + + def _create_dataloader(mode, tokenizer, max_encoder_length, pad_val=0): + dataset = load_dataset("imdb", splits=mode) + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=(mode == "train")) + data_loader = paddle.io.DataLoader( + dataset=dataset, batch_sampler=batch_sampler, collate_fn=_collate_data, return_list=True + ) + return data_loader + + train_data_loader = _create_dataloader("train", tokenizer, max_encoder_length, 0) + test_data_loader = _create_dataloader("test", tokenizer, max_encoder_length, 0) + return train_data_loader, test_data_loader + + +def main(): + # Initialization for the parallel environment + paddle.set_device(args.device) + set_seed(args) + # Define the model and metric + # In finetune task, bigbird performs better when setting dropout to zero. + model = BigBirdForSequenceClassification.from_pretrained( + args.model_name_or_path, attn_dropout=args.attn_dropout, hidden_dropout_prob=args.hidden_dropout_prob + ) + + criterion = nn.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + # Define the tokenizer and dataloader + tokenizer = BigBirdTokenizer.from_pretrained(args.model_name_or_path) + config = getattr(model, BigBirdForSequenceClassification.base_model_prefix).config + train_data_loader, test_data_loader = create_dataloader( + args.batch_size, args.max_encoder_length, tokenizer, config + ) + + # Define the Adam optimizer + optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.learning_rate, epsilon=1e-6) + + # Finetune the classification model + do_train(model, criterion, metric, optimizer, train_data_loader, tokenizer) + + # Evaluate the finetune model + do_evalute(model, criterion, metric, test_data_loader) + + +def do_train(model, criterion, metric, optimizer, train_data_loader, tokenizer): + model.train() + global_steps = 0 + tic_train = time.time() + for epoch in range(args.epochs): + for step, batch in enumerate(train_data_loader): + global_steps += 1 + input_ids, labels = batch[:2] + rand_mask_idx_list = batch[2:] + + output = model(input_ids, rand_mask_idx_list=rand_mask_idx_list) + loss = criterion(output, labels) + loss.backward() + optimizer.step() + optimizer.clear_grad() + correct = metric.compute(output, labels) + metric.update(correct) + + if global_steps % args.logging_steps == 0: + logger.info( + "train: global step %d, epoch: %d, loss: %f, acc:%f, speed: %.2f step/s" + % (global_steps, epoch, loss, metric.accumulate(), args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + + if global_steps % args.save_steps == 0: + output_dir = os.path.join(args.output_dir, "model_%d.pdparams" % (global_steps)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + if global_steps >= args.max_steps: + return + + +@paddle.no_grad() +def do_evalute(model, criterion, metric, test_data_loader): + model.eval() + global_steps = 0 + for step, batch in enumerate(test_data_loader): + global_steps += 1 + input_ids, labels = batch[:2] + rand_mask_idx_list = batch[2:] + output = model(input_ids, rand_mask_idx_list=rand_mask_idx_list) + loss = criterion(output, labels) + correct = metric.compute(output, labels) + metric.update(correct) + if global_steps % args.logging_steps == 0: + logger.info("eval: global step %d, loss: %f, acc %f" % (global_steps, loss, metric.accumulate())) + logger.info("final eval: loss: %f, acc %f" % (loss, metric.accumulate())) + metric.reset() + model.train() + + +if __name__ == "__main__": + main() diff --git a/examples/language_model/bigbird/run_glue.py b/examples/language_model/bigbird/run_glue.py new file mode 100644 index 0000000000000000000000000000000000000000..fa0234e01c2359676bd64e8175caf9c339530230 --- /dev/null +++ b/examples/language_model/bigbird/run_glue.py @@ -0,0 +1,333 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import random +import time +from functools import partial + +import args +import numpy as np +import paddle +from paddle.io import DataLoader +from paddle.metric import Accuracy + +from paddlenlp.data import Stack +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman +from paddlenlp.transformers import ( + BigBirdForSequenceClassification, + BigBirdTokenizer, + LinearDecayWithWarmup, + create_bigbird_rand_mask_idx_list, +) +from paddlenlp.utils.log import logger + +METRIC_CLASSES = { + "cola": Mcc, + "sst-2": Accuracy, + "mrpc": AccuracyAndF1, + "sts-b": PearsonAndSpearman, + "qqp": AccuracyAndF1, + "mnli": Accuracy, + "qnli": Accuracy, + "rte": Accuracy, +} + +MODEL_CLASSES = { + "bigbird": (BigBirdForSequenceClassification, BigBirdTokenizer), +} + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + input_ids = [tokenizer.cls_id] + token_type_ids = None + + if (int(is_test) + len(example)) == 2: + input_ids.extend(tokenizer(example["sentence"])["input_ids"][: max_seq_length - 2]) + input_ids.append(tokenizer.sep_id) + input_len = len(input_ids) + token_type_ids = input_len * [0] + else: + input_ids1 = tokenizer(example["sentence1"])["input_ids"] + input_ids2 = tokenizer(example["sentence2"])["input_ids"] + total_len = len(input_ids1) + len(input_ids2) + tokenizer.num_special_tokens_to_add(pair=True) + 1 + if total_len > max_seq_length: + input_ids1, input_ids2, _ = tokenizer.truncate_sequences( + input_ids1, input_ids2, total_len - max_seq_length + ) + + input_ids.extend(input_ids1) + input_ids.append(tokenizer.sep_id) + input_len1 = len(input_ids) + + input_ids.extend(input_ids2) + input_ids.append(tokenizer.sep_id) + input_len2 = len(input_ids) - input_len1 + + token_type_ids = input_len1 * [0] + input_len2 * [1] + + input_len = len(input_ids) + if input_len < max_seq_length: + input_ids.extend([tokenizer.pad_id] * (max_seq_length - input_len)) + token_type_ids.extend([tokenizer.pad_token_type_id] * (max_seq_length - input_len)) + + if not is_test: + return input_ids, token_type_ids, label + else: + return input_ids, token_type_ids + + +def collect_data(samples, dataset, config): + stack_fn = Stack(dtype="int64" if dataset.label_list else "float32") + stack_fn1 = Stack() + + num_fields = len(samples[0]) + out = [None] * num_fields + out[0] = stack_fn1([x[0] for x in samples]) # input_ids + out[1] = stack_fn1([x[1] for x in samples]) # token_type_ids + if num_fields >= 2: + out[2] = stack_fn(x[2] for x in samples) # labels + seq_len = len(out[0][0]) + # Construct the random attention mask for the random attention + rand_mask_idx_list = create_bigbird_rand_mask_idx_list( + config["num_layers"], + seq_len, + seq_len, + config["nhead"], + config["block_size"], + config["window_size"], + config["num_global_blocks"], + config["num_rand_blocks"], + config["seed"], + ) + out.extend(rand_mask_idx_list) + return out + + +@paddle.no_grad() +def evaluate(model, loss_fct, metric, data_loader): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, segment_ids, labels = batch[:3] + rand_mask_idx_list = batch[3:] + # run forward + logits = model(input_ids, segment_ids, rand_mask_idx_list=rand_mask_idx_list) + loss = loss_fct(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + if isinstance(metric, AccuracyAndF1): + logger.info( + "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, " + % ( + loss.numpy(), + res[0], + res[1], + res[2], + res[3], + res[4], + ) + ) + elif isinstance(metric, Mcc): + logger.info("eval loss: %f, mcc: %s, " % (loss.numpy(), res[0])) + elif isinstance(metric, PearsonAndSpearman): + logger.info( + "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, " + % (loss.numpy(), res[0], res[1], res[2]) + ) + else: + logger.info("eval loss: %f, acc: %s, " % (loss.numpy(), res)) + model.train() + + +def do_train(args): + paddle.set_device(args.device) + worker_num = paddle.distributed.get_world_size() + if worker_num > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + train_ds = load_dataset("glue", args.task_name, splits="train") + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list) + # In finetune task, bigbird performs better when setting dropout to zero. + model = model_class.from_pretrained( + args.model_name_or_path, num_classes=num_classes, attn_dropout=0.0, hidden_dropout_prob=0.0 + ) + if worker_num > 1: + model = paddle.DataParallel(model) + config = getattr(model, model_class.base_model_prefix).config + + trans_func = partial( + convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_encoder_length + ) + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + batchify_fn = partial(collect_data, dataset=train_ds, config=config) + + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + if args.task_name == "mnli": + dev_ds_matched, dev_ds_mismatched = load_dataset( + "glue", args.task_name, splits=["dev_matched", "dev_mismatched"] + ) + + dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True) + dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True) + dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False) + dev_data_loader_matched = DataLoader( + dataset=dev_ds_matched, + batch_sampler=dev_batch_sampler_matched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + dev_batch_sampler_mismatched = paddle.io.BatchSampler( + dev_ds_mismatched, batch_size=args.batch_size, shuffle=False + ) + dev_data_loader_mismatched = DataLoader( + dataset=dev_ds_mismatched, + batch_sampler=dev_batch_sampler_mismatched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + else: + dev_ds = load_dataset("glue", args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.epochs) + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + loss_fct = paddle.nn.CrossEntropyLoss() if train_ds.label_list else paddle.nn.MSELoss() + + metric = metric_class() + global_step = 0 + tic_train = time.time() + for epoch in range(args.epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids, segment_ids, labels = batch[:3] + rand_mask_idx_list = batch[3:] + # run forward + logits = model(input_ids, segment_ids, rand_mask_idx_list=rand_mask_idx_list) + loss = loss_fct(logits, labels) + # run backward and update params + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.logging_steps == 0: + logger.info( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % ( + global_step, + num_training_steps, + epoch, + step, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + if args.task_name == "mnli": + evaluate(model, loss_fct, metric, dev_data_loader_matched) + evaluate(model, loss_fct, metric, dev_data_loader_mismatched) + logger.info("eval done total : %s s" % (time.time() - tic_eval)) + else: + evaluate(model, loss_fct, metric, dev_data_loader) + logger.info("eval done total : %s s" % (time.time() - tic_eval)) + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join( + args.output_dir, "%s_ft_model_%d.pdparams" % (args.task_name, global_step) + ) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = args.parse_args() + print_arguments(args) + assert args.device in [ + "cpu", + "gpu", + "xpu", + "npu", + ], "Invalid device! Available device should be cpu, gpu, xpu or npu." + do_train(args) diff --git a/examples/language_model/bigbird/run_pretrain.py b/examples/language_model/bigbird/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..81545f3fb0ad55c5b9ac2cd01167526d71fd24c3 --- /dev/null +++ b/examples/language_model/bigbird/run_pretrain.py @@ -0,0 +1,259 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import random +import time + +import args +import numpy as np +import paddle +from paddle.io import DataLoader, Dataset + +from paddlenlp.data import Stack +from paddlenlp.transformers import ( + BigBirdForPretraining, + BigBirdPretrainingCriterion, + BigBirdTokenizer, + LinearDecayWithWarmup, + create_bigbird_rand_mask_idx_list, +) +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "bigbird": (BigBirdForPretraining, BigBirdTokenizer), +} + + +class WorkerInitObj(object): + def __init__(self, seed): + self.seed = seed + + def __call__(self, id): + np.random.seed(seed=self.seed + id) + random.seed(self.seed + id) + + +class PretrainingDataset(Dataset): + def __init__(self, input_file, tokenizer, max_encoder_length=512, max_pred_length=75): + self.tokenizer = tokenizer + self.max_encoder_length = max_encoder_length + self.max_pred_length = max_pred_length + input_file = open(input_file, "r", encoding="utf-8") + self.lines = input_file.readlines() + + self.vocab_size = tokenizer.vocab_size + + def __getitem__(self, index): + line = self.lines[index].rstrip() + subtokens, masked_lm_positions, masked_lm_ids, masked_lm_weights = self.tokenizer._encode( + line, max_seq_len=self.max_encoder_length, max_pred_len=self.max_pred_length + ) + return [ + subtokens, + np.zeros_like(subtokens), + masked_lm_positions, + masked_lm_ids, + masked_lm_weights, + np.zeros([1], dtype="int64"), + ] + + def __len__(self): + return len(self.lines) + + +def set_seed(args): + random.seed(args.seed + paddle.distributed.get_rank()) + np.random.seed(args.seed + paddle.distributed.get_rank()) + paddle.seed(args.seed + paddle.distributed.get_rank()) + + +def create_dataloader(input_file, tokenizer, worker_init, batch_size, max_encoder_length, max_pred_length, config): + pretrain_dataset = PretrainingDataset(input_file, tokenizer, max_encoder_length, max_pred_length) + train_batch_sampler = paddle.io.DistributedBatchSampler( + pretrain_dataset, batch_size=batch_size, shuffle=True, drop_last=True + ) + + # make masked_lm_positions can be gathered + def _collate_data(data, stack_fn=Stack()): + # Data Fields: input_ids, segment_ids, masked_lm_positions, masked_lm_ids, masked_lm_weights, next_sentence_labels + num_fields = len(data[0]) + out = [None] * num_fields + + for i in [0, 1, 5]: + out[i] = stack_fn([x[i] for x in data]) + batch_size, seq_length = out[0].shape + size = sum(len(x[2]) for x in data) + out[2] = np.full(size, 0, dtype=np.int32) + # masked_lm_labels + out[3] = np.full([size, 1], -1, dtype=np.int64) + # masked weight + out[4] = np.full([size], 0, dtype="float32") + # # Organize as a 1D tensor for gather or use gather_nd + mask_token_num = 0 + for i, x in enumerate(data): + for j, pos in enumerate(x[2]): + out[2][mask_token_num] = i * seq_length + pos + out[3][mask_token_num] = x[3][j] + out[4][mask_token_num] = x[4][j] + mask_token_num += 1 + out.append(np.asarray([mask_token_num], dtype=np.float32)) + seq_len = len(out[0][0]) + rand_mask_idx_list = create_bigbird_rand_mask_idx_list( + config["num_layers"], + seq_len, + seq_len, + config["nhead"], + config["block_size"], + config["window_size"], + config["num_global_blocks"], + config["num_rand_blocks"], + config["seed"], + ) + out.extend(rand_mask_idx_list) + return out + + dataloader = DataLoader( + dataset=pretrain_dataset, + batch_sampler=train_batch_sampler, + collate_fn=_collate_data, + worker_init_fn=worker_init, + return_list=True, + ) + return dataloader + + +def do_train(args): + # Initialization for the parallel environment + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + worker_index = paddle.distributed.get_rank() + worker_num = paddle.distributed.get_world_size() + + # Set the random seed for the training process + set_seed(args) + worker_init = WorkerInitObj(args.seed + worker_index) + + # Get the model class and tokenizer class + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + # Define the pretrain model and metric + pretrained_models_list = list(model_class.pretrained_init_configuration.keys()) + if args.model_name_or_path in pretrained_models_list: + config = model_class.config_class.from_pretrained(args.model_name_or_path) + model = model_class(config) + else: + model = BigBirdForPretraining.from_pretrained(args.model_name_or_path) + # Get bigbird config for generate random attention mask + config = getattr(model, BigBirdForPretraining.base_model_prefix).config + criterion = BigBirdPretrainingCriterion(config, args.use_nsp) + if worker_num > 1: + model = paddle.DataParallel(model) + + # Define learing_rate scheduler and optimizer + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, args.max_steps, args.warmup_steps) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + global_step = 0 + tic_train = time.time() + for epoch in range(args.epochs): + files = [os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir)] + files.sort() + num_files = len(files) + for f_id in range(num_files): + train_data_loader = create_dataloader( + files[f_id], + tokenizer, + worker_init, + args.batch_size, + args.max_encoder_length, + args.max_pred_length, + config, + ) + for step, batch in enumerate(train_data_loader): + global_step += 1 + ( + input_ids, + segment_ids, + masked_lm_positions, + masked_lm_ids, + masked_lm_weights, + next_sentence_labels, + masked_lm_scale, + ) = batch[:7] + rand_mask_idx_list = batch[7:] + + prediction_scores, seq_relationship_score = model( + input_ids=input_ids, + token_type_ids=segment_ids, + rand_mask_idx_list=rand_mask_idx_list, + masked_positions=masked_lm_positions, + ) + loss = criterion( + prediction_scores, + seq_relationship_score, + masked_lm_ids, + next_sentence_labels, + masked_lm_scale, + masked_lm_weights, + ) + if global_step % args.logging_steps == 0 and worker_index == 0: + logger.info( + "global step %d, epoch: %d, lr: %.10f, loss: %f, speed: %.2f step/s" + % ( + global_step, + epoch, + optimizer.get_lr(), + loss, + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.save_steps == 0: + if worker_index == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt")) + if global_step >= args.max_steps: + del train_data_loader + return + del train_data_loader + + +if __name__ == "__main__": + args = args.parse_args() + do_train(args) diff --git a/examples/language_model/bloom b/examples/language_model/bloom new file mode 100644 index 0000000000000000000000000000000000000000..69ade8179eef55acb2d9c2ac71978e1448293106 --- /dev/null +++ b/examples/language_model/bloom @@ -0,0 +1 @@ +../../llm/bloom \ No newline at end of file diff --git a/examples/language_model/chatglm b/examples/language_model/chatglm new file mode 100644 index 0000000000000000000000000000000000000000..b3d6f16d3004c13997e19135ff470c135ccd419f --- /dev/null +++ b/examples/language_model/chatglm @@ -0,0 +1 @@ +../../llm/chatglm \ No newline at end of file diff --git a/examples/language_model/chinesebert/README.md b/examples/language_model/chinesebert/README.md new file mode 100644 index 0000000000000000000000000000000000000000..383f68b2149d37979fe6665d6008c7f1f125f9f4 --- /dev/null +++ b/examples/language_model/chinesebert/README.md @@ -0,0 +1,203 @@ +# ChineseBert with PaddleNLP + +[ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://arxiv.org/pdf/2106.16038.pdf) + +**摘要:** +最近的汉语预训练模型忽略了汉语特有的两个重要方面:字形和拼音,它们对语言理解具有重要的语法和语义信息。在本研究中,我们提出了汉语预训练,它将汉字的字形和拼音信息纳入语言模型预训练中。字形嵌入是基于汉字的不同字体获得的,能够从视觉特征中捕捉汉字语义,拼音嵌入代表汉字的发音,处理汉语中高度流行的异义现象(同一汉字具有不同的发音和不同的含义)。在大规模的未标记中文语料库上进行预训练后,所提出的ChineseBERT模型在训练步骤较少的基线模型上产生了显著的性能提高。该模型在广泛的中国自然语言处理任务上实现了新的SOTA性能,包括机器阅读理解、自然语言推理、文本分类、句子对匹配和命名实体识别方面的竞争性能。 + +本项目是 ChineseBert 在 Paddle 2.x上的开源实现。 + +## **数据准备** +涉及到的ChnSentiCorp,crmc2018,XNLI数据 +部分Paddle已提供,其他可参考https://github.com/27182812/ChineseBERT_paddle, +在data目录下。 + + +## **模型预训练** +模型预训练过程可参考[Electra的README](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/electra/README.md) + +## **Fine-tuning** + +### 运行Fine-tuning + +#### **使用Paddle提供的预训练模型运行 Fine-tuning** + +#### 1、ChnSentiCorp +以ChnSentiCorp数据集为例 + +#### (1)模型微调: +```shell +# 运行训练 +python -m paddle.distributed.launch --gpus 0,1 python train_chn.py \ +--data_path './data/ChnSentiCorp' \ +--device 'gpu' \ +--num_train_epochs 10 \ +--max_seq_length 512 \ +--per_device_train_batch_size 8 \ +--per_device_eval_batch_size 8 \ +--learning_rate 2e-5 \ +--adam_beta2 0.98 \ +--weight_decay 0.0001 \ +--warmup_ratio 0.1 \ +--logging_steps 10 \ +--save_steps 100 \ +--seed 2333 \ +--do_train \ +--do_eval \ +--output_dir 'outputs/chn' | tee outputs/train_chn.log +``` +其中参数释义如下: +- `data_path` 表示微调数据路径 +- `device` 表示使用的设备类型。默认为GPU,可以配置为CPU、GPU、XPU。若希望使用多GPU训练,将其设置为GPU,同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。 +- `num_train_epochs` 表示训练轮数。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `per_device_train_batch_size` 表示每次迭代**每张卡**上的训练样本数目。 +- `per_device_eval_batch_size` 表示每次迭代**每张卡**上的验证样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `adam_beta2` 表示优化器中使用的beta2的系数。 +- `weight_decay` 表示优化器中使用的weight_decay的系数。 +- `warmup_ratio` 表示动态学习率热启动的比例。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示验证并保存模型间隔。 +- `seed` 指定随机种子。 +- `do_train` 表示是否进行训练。 +- `do_eval` 表示是否进行验证。 +- `output_dir` 表示模型保存路径。 + +#### (2) 评估 + +在dev和test数据集上acc分别为95.8和96.08,达到论文精度要求。 + +#### 2、XNLI + +#### (1)训练 + +```bash +python -m paddle.distributed.launch --gpus 0,1 python train_xnli.py \ +--data_path './data/XNLI' \ +--device 'gpu' \ +--num_train_epochs 5 \ +--max_seq_length 256 \ +--per_device_train_batch_size 16 \ +--per_device_eval_batch_size 16 \ +--learning_rate 1.3e-5 \ +--adam_beta2 0.98 \ +--weight_decay 0.001 \ +--warmup_ratio 0.1 \ +--logging_steps 10 \ +--save_steps 100 \ +--seed 2333 \ +--do_train \ +--do_eval \ +--output_dir "outputs/xnli" | tee outputs/train_xnli.log +``` +其中参数释义如下: +- `data_path` 表示微调数据路径 +- `device` 表示使用的设备类型。默认为GPU,可以配置为CPU、GPU、XPU。若希望使用多GPU训练,将其设置为GPU,同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。 +- `num_train_epochs` 表示训练轮数。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `per_device_train_batch_size` 表示每次迭代**每张卡**上的训练样本数目。 +- `per_device_eval_batch_size` 表示每次迭代**每张卡**上的验证样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `adam_beta2` 表示优化器中使用的beta2的系数。 +- `weight_decay` 表示优化器中使用的weight_decay的系数。 +- `warmup_ratio` 表示动态学习率热启动的比例。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示验证并保存模型间隔。 +- `seed` 指定随机种子。 +- `do_train` 表示是否进行训练。 +- `do_eval` 表示是否进行验证。 +- `output_dir` 表示模型保存路径。 + +#### (2)评估 + +test数据集 acc最好结果为81.657,达到论文精度要求。 + +#### 3、cmrc2018 + +#### (1) 训练 + +```shell +# 开始训练 +python -m paddle.distributed.launch --gpus 0,1 python train_cmrc2018.py \ + --data_dir "./data/cmrc2018" \ + --model_name_or_path ChineseBERT-large \ + --max_seq_length 512 \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 16 \ + --gradient_accumulation_steps 8 \ + --learning_rate 4e-5 \ + --max_grad_norm 1.0 \ + --adam_beta2 0.98 \ + --num_train_epochs 3 \ + --logging_steps 2 \ + --save_steps 20 \ + --warmup_ratio 0.1 \ + --weight_decay 0.01 \ + --seed 1111 \ + --do_train \ + --do_eval \ + --dataloader_num_workers 0 \ + --fp16 True \ + --output_dir "outputs/cmrc2018" +``` +其中参数释义如下: +- `data_path` 表示微调数据路径。 +- `model_name_or_path` 模型名称或者路径,支持ChineseBERT-base、ChineseBERT-large两种种规格。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `per_device_train_batch_size` 表示训练过程中每次迭代**每张卡**上的样本数目。 +- `per_device_eval_batch_size` 表示验证过程中每次迭代**每张卡**上的样本数目。 +- `gradient_accumulation_steps` 梯度累加步数。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `max_grad_norm` 梯度裁剪。 +- `adam_beta2` 表示优化器中使用的beta2的系数。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示验证并保存模型间隔。 +- `warmup_ratio` 表示动态学习率热启动的比例。 +- `weight_decay` 表示优化器中使用的weight_decay的系数。 +- `seed` 指定随机种子。 +- `do_train` 表示是否进行训练。 +- `do_eval` 表示是否进行验证。 +- `dataloader_num_workers` 表示同时工作进程。 +- `fp16` 表示是否使用混合精度fp16。 +- `output_dir` 表示模型保存路径。 + +训练过程中模型会在dev数据集进行评估,其中最好的结果如下所示: + +```python + +{ + AVERAGE = 82.791 + F1 = 91.055 + EM = 74.526 + TOTAL = 3219 + SKIP = 0 +} + +``` + +#### (2)运行eval_cmrc.py,生成test数据集预测答案 + +```bash +python eval_cmrc.py --model_name_or_path outputs/step-340 --n_best_size 35 --max_answer_length 65 +``` + +其中,model_name_or_path为模型路径 + +#### (3)提交CLUE + +test数据集 EM为78.55,达到论文精度要求 + + +## Reference + +```bibtex +@article{sun2021chinesebert, + title={ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information}, + author={Sun, Zijun and Li, Xiaoya and Sun, Xiaofei and Meng, Yuxian and Ao, Xiang and He, Qing and Wu, Fei and Li, Jiwei}, + journal={arXiv preprint arXiv:2106.16038}, + year={2021} +} + +``` diff --git a/examples/language_model/chinesebert/cmrc_eval.sh b/examples/language_model/chinesebert/cmrc_eval.sh new file mode 100644 index 0000000000000000000000000000000000000000..79155340b1c7b6f46a08545a6b7dc640d2bb5e45 --- /dev/null +++ b/examples/language_model/chinesebert/cmrc_eval.sh @@ -0,0 +1 @@ +python eval.py --model_name_or_path outputs/cmrc2018/step-140 --n_best_size 35 --max_answer_length 65 diff --git a/examples/language_model/chinesebert/cmrc_evaluate.py b/examples/language_model/chinesebert/cmrc_evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..ceb5257ff9f1ae907adee3733547ed68ecff097a --- /dev/null +++ b/examples/language_model/chinesebert/cmrc_evaluate.py @@ -0,0 +1,241 @@ +# encoding=utf8 +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Evaluation script for CMRC 2018 +version: v5 - special +Note: +v5 - special: Evaluate on SQuAD-style CMRC 2018 Datasets +v5: formatted output, add usage description +v4: fixed segmentation issues +""" + +import argparse +import json +import re +import sys +from collections import OrderedDict + +import nltk + + +# split Chinese with English +def mixed_segmentation(in_str, rm_punc=False): + in_str = str(in_str).lower().strip() + segs_out = [] + temp_str = "" + sp_char = [ + "-", + ":", + "_", + "*", + "^", + "/", + "\\", + "~", + "`", + "+", + "=", + ",", + "。", + ":", + "?", + "!", + "“", + "”", + ";", + "’", + "《", + "》", + "……", + "·", + "、", + "「", + "」", + "(", + ")", + "-", + "~", + "『", + "』", + ] + for char in in_str: + if rm_punc and char in sp_char: + continue + if re.search(r"[\u4e00-\u9fa5]", char) or char in sp_char: + if temp_str != "": + ss = nltk.word_tokenize(temp_str) + segs_out.extend(ss) + temp_str = "" + segs_out.append(char) + else: + temp_str += char + + # handling last part + if temp_str != "": + ss = nltk.word_tokenize(temp_str) + segs_out.extend(ss) + + return segs_out + + +# remove punctuation +def remove_punctuation(in_str): + in_str = str(in_str).lower().strip() + sp_char = [ + "-", + ":", + "_", + "*", + "^", + "/", + "\\", + "~", + "`", + "+", + "=", + ",", + "。", + ":", + "?", + "!", + "“", + "”", + ";", + "’", + "《", + "》", + "……", + "·", + "、", + "「", + "」", + "(", + ")", + "-", + "~", + "『", + "』", + ] + out_segs = [] + for char in in_str: + if char in sp_char: + continue + else: + out_segs.append(char) + return "".join(out_segs) + + +# find longest common string +def find_lcs(s1, s2): + m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)] + mmax = 0 + p = 0 + for i in range(len(s1)): + for j in range(len(s2)): + if s1[i] == s2[j]: + m[i + 1][j + 1] = m[i][j] + 1 + if m[i + 1][j + 1] > mmax: + mmax = m[i + 1][j + 1] + p = i + 1 + return s1[p - mmax : p], mmax + + +def evaluate(ground_truth_file, prediction_file): + f1 = 0 + em = 0 + total_count = 0 + skip_count = 0 + for instance in ground_truth_file["data"]: + # context_id = instance['context_id'].strip() + # context_text = instance['context_text'].strip() + for para in instance["paragraphs"]: + for qas in para["qas"]: + total_count += 1 + query_id = qas["id"].strip() + answers = [x["text"] for x in qas["answers"]] + + if query_id not in prediction_file: + sys.stderr.write("Unanswered question: {}\n".format(query_id)) + skip_count += 1 + continue + + prediction = str(prediction_file[query_id]) + f1 += calc_f1_score(answers, prediction) + em += calc_em_score(answers, prediction) + + f1_score = 100.0 * f1 / total_count + em_score = 100.0 * em / total_count + return f1_score, em_score, total_count, skip_count + + +def calc_f1_score(answers, prediction): + f1_scores = [] + for ans in answers: + ans_segs = mixed_segmentation(ans, rm_punc=True) + prediction_segs = mixed_segmentation(prediction, rm_punc=True) + lcs, lcs_len = find_lcs(ans_segs, prediction_segs) + if lcs_len == 0: + f1_scores.append(0) + continue + precision = 1.0 * lcs_len / len(prediction_segs) + recall = 1.0 * lcs_len / len(ans_segs) + f1 = (2 * precision * recall) / (precision + recall) + f1_scores.append(f1) + return max(f1_scores) + + +def calc_em_score(answers, prediction): + em = 0 + for ans in answers: + ans_ = remove_punctuation(ans) + prediction_ = remove_punctuation(prediction) + if ans_ == prediction_: + em = 1 + break + return em + + +def get_result(ground_truth_file, prediction_file): + ground_truth_file = json.load(open(ground_truth_file, "rb")) + prediction_file = json.load(open(prediction_file, "rb")) + F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file) + AVG = (EM + F1) * 0.5 + output_result = OrderedDict() + output_result["AVERAGE"] = "%.3f" % AVG + output_result["F1"] = "%.3f" % F1 + output_result["EM"] = "%.3f" % EM + output_result["TOTAL"] = TOTAL + output_result["SKIP"] = SKIP + print(json.dumps(output_result)) + return output_result + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Evaluation Script for CMRC 2018") + parser.add_argument("--dataset_file", default="cmrc2018_public/dev.json", help="Official dataset file") + parser.add_argument("--prediction_file", default="all_predictions.json", help="Your prediction File") + args = parser.parse_args() + ground_truth_file = json.load(open(args.dataset_file, "rb")) + prediction_file = json.load(open(args.prediction_file, "rb")) + F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file) + AVG = (EM + F1) * 0.5 + output_result = OrderedDict() + output_result["AVERAGE"] = "%.3f" % AVG + output_result["F1"] = "%.3f" % F1 + output_result["EM"] = "%.3f" % EM + output_result["TOTAL"] = TOTAL + output_result["SKIP"] = SKIP + output_result["FILE"] = args.prediction_file + print(json.dumps(output_result)) diff --git a/examples/language_model/chinesebert/dataset_cmrc2018.py b/examples/language_model/chinesebert/dataset_cmrc2018.py new file mode 100644 index 0000000000000000000000000000000000000000..7144d5e67a99acf2384cbabb1110a354d9e464df --- /dev/null +++ b/examples/language_model/chinesebert/dataset_cmrc2018.py @@ -0,0 +1,418 @@ +# encoding=utf8 +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from functools import partial +from typing import List, Optional + +import numpy as np +import paddle +from paddle.io import DataLoader, Dataset +from utils import load_pickle, save_pickle + +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import Trainer +from paddlenlp.trainer.trainer_utils import ( + EvalLoopOutput, + EvalPrediction, + IterableDatasetShard, + find_batch_size, + has_length, +) +from paddlenlp.trainer.utils.helper import ( + nested_concat, + nested_numpify, + nested_truncate, +) +from paddlenlp.utils.batch_sampler import ( + DistributedBatchSampler as NlpDistributedBatchSampler, +) +from paddlenlp.utils.log import logger + + +# this right +def prepare_train_features_paddlenlp(examples, tokenizer, args): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = [examples[i]["context"] for i in range(len(examples))] + questions = [examples[i]["question"] for i in range(len(examples))] + + tokenized_examples = tokenizer( + questions, + contexts, + stride=args["model_args"].doc_stride, + max_length=args["model_args"].max_seq_length, + return_token_type_ids=True, + ) + + # Let's label those examples! + for i, tokenized_example in enumerate(tokenized_examples): + # We will label impossible answers with the index of the CLS token. + input_ids = tokenized_example["input_ids"] + cls_index = input_ids.index(tokenizer.cls_token_id) + + # The offset mappings will give us a map from token to character position in the original context. This will + # help us compute the start_positions and end_positions. + offsets = tokenized_example["offset_mapping"] + + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_example["token_type_ids"] + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = tokenized_example["overflow_to_sample"] + answers = examples[sample_index]["answers"] + answer_starts = examples[sample_index]["answer_starts"] + + # If no answers are given, set the cls_index as answer. + if len(answer_starts) == 0: + tokenized_examples[i]["start_positions"] = cls_index + tokenized_examples[i]["end_positions"] = cls_index + else: + # Start/end character index of the answer in the text. + start_char = answer_starts[0] + end_char = start_char + len(answers[0]) + + # Start token index of the current span in the text. + token_start_index = 0 + while sequence_ids[token_start_index] != 1: + token_start_index += 1 + + # End token index of the current span in the text. + token_end_index = len(input_ids) - 1 + while sequence_ids[token_end_index] != 1: + token_end_index -= 1 + # Minus one more to reach actual text + token_end_index -= 1 + + # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index). + if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char): + tokenized_examples[i]["start_positions"] = cls_index + tokenized_examples[i]["end_positions"] = cls_index + else: + # Otherwise move the token_start_index and token_end_index to the two ends of the answer. + # Note: we could go after the last offset if the answer is the last word (edge case). + while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char: + token_start_index += 1 + tokenized_examples[i]["start_positions"] = token_start_index - 1 + while offsets[token_end_index][1] >= end_char: + token_end_index -= 1 + tokenized_examples[i]["end_positions"] = token_end_index + 1 + + return tokenized_examples + + +# this right +def prepare_dev_features_paddlenlp(examples, tokenizer, args): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = [examples[i]["context"] for i in range(len(examples))] + questions = [examples[i]["question"] for i in range(len(examples))] + + tokenized_examples = tokenizer( + questions, + contexts, + stride=args["model_args"].doc_stride, + max_length=args["model_args"].max_seq_length, + return_token_type_ids=True, + ) + + # For validation, there is no need to compute start and end positions + for i, tokenized_example in enumerate(tokenized_examples): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_example["token_type_ids"] + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = tokenized_example["overflow_to_sample"] + tokenized_examples[i]["example_id"] = examples[sample_index]["id"] + + # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token + # position is part of the context or not. + tokenized_examples[i]["offset_mapping"] = [ + (o if sequence_ids[k] == 1 else None) for k, o in enumerate(tokenized_example["offset_mapping"]) + ] + + return tokenized_examples + + +def get_train_dataset(tokenizer, args, splits="train"): + + data_dir = args["data_args"].data_dir + filename = os.path.join(data_dir, "cmrc2018_" + splits + ".pkl") + + if os.path.exists(filename): + ds = load_pickle(filename) + else: + ds = load_dataset("cmrc2018", splits=splits) + ds.map( + partial(prepare_train_features_paddlenlp, tokenizer=tokenizer, args=args), + batched=True, + lazy=False, + ) + save_pickle(ds, filename) + + return ds + + +def get_dev_dataset(tokenizer, args, splits="dev"): + + data_dir = args["data_args"].data_dir + filename = os.path.join(data_dir, "cmrc2018_" + splits + ".pkl") + if os.path.exists(filename): + ds = load_pickle(filename) + else: + ds = load_dataset("cmrc2018", splits=splits) + ds.map( + partial(prepare_dev_features_paddlenlp, tokenizer=tokenizer, args=args), + batched=True, + lazy=False, + ) + save_pickle(ds, filename) + + return ds + + +def is_datasets_available(): + import importlib + + return importlib.util.find_spec("datasets") is not None + + +if is_datasets_available(): + import datasets + + +class EvalTrainer(Trainer): + def set_eval_collator(self, collator): + self.eval_collate_fn = collator + + def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader: + """ + Returns the evaluation [`~paddle.io.DataLoader`]. + + Subclass and override this method if you want to inject some custom behavior. + + Args: + eval_dataset (`paddle.io.Dataset`, *optional*): + If provided, will override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not accepted by + the `model.forward()` method are automatically removed. It must implement `__len__`. + """ + if eval_dataset is None and self.eval_dataset is None: + raise ValueError("Trainer: evaluation requires an eval_dataset.") + eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset + + if is_datasets_available() and isinstance(eval_dataset, datasets.Dataset): + eval_dataset = self._remove_unused_columns(eval_dataset, description="evaluation") + + if self._is_iterable_dataset(eval_dataset): + if self.args.world_size > 1: + eval_dataset = IterableDatasetShard( + eval_dataset, + batch_size=self.args.per_device_eval_batch_size, + drop_last=self.args.dataloader_drop_last, + num_processes=self.args.world_size, + process_index=self.args.process_index, + ) + + return DataLoader( + eval_dataset, + batch_size=self.args.per_device_eval_batch_size, + collate_fn=self.eval_collate_fn, + num_workers=self.args.dataloader_num_workers, + ) + + eval_sampler = self._get_eval_sampler(eval_dataset) + + return DataLoader( + eval_dataset, + batch_sampler=eval_sampler, + collate_fn=self.eval_collate_fn, + num_workers=self.args.dataloader_num_workers, + ) + + def evaluation_loop( + self, + dataloader: DataLoader, + description: str, + prediction_loss_only: Optional[bool] = None, + ignore_keys: Optional[List[str]] = None, + metric_key_prefix: str = "eval", + max_eval_iters: Optional[int] = -1, + ) -> EvalLoopOutput: + """ + Prediction/evaluation loop, shared by `Trainer.evaluate()` and `Trainer.predict()`. + + Works both with or without labels. + """ + args = self.args + + prediction_loss_only = prediction_loss_only if prediction_loss_only is not None else args.prediction_loss_only + + model = self.model + + if isinstance(dataloader, paddle.io.DataLoader): + batch_size = dataloader.batch_sampler.batch_size + elif isinstance(dataloader, paddle.io.dataloader.dataloader_iter._DataLoaderIterBase): + # support for inner dataloader + batch_size = dataloader._batch_sampler.batch_size + # alias for inner dataloader + dataloader.dataset = dataloader._dataset + else: + raise ValueError("Only support for paddle.io.DataLoader") + + num_samples = None + if max_eval_iters > 0: + # on eval limit steps + num_samples = batch_size * self.args.world_size * max_eval_iters + if isinstance(dataloader, paddle.io.dataloader.dataloader_iter._DataLoaderIterBase) and isinstance( + dataloader._batch_sampler, NlpDistributedBatchSampler + ): + consumed_samples = ( + ((self.state.global_step) // args.eval_steps) + * max_eval_iters + * args.per_device_eval_batch_size + * args.world_size + ) + dataloader._batch_sampler.set_epoch(consumed_samples=consumed_samples) + + logger.info(f"***** Running {description} *****") + if has_length(dataloader): + logger.info(f" Num examples = {self.num_examples(dataloader)}") + if max_eval_iters > 0: + logger.info(f" Total prediction steps = {max_eval_iters}") + else: + logger.info(f" Total prediction steps = {len(dataloader)}") + else: + logger.info(" Num examples: Unknown") + if max_eval_iters > 0: + logger.info(f" Total prediction steps = {max_eval_iters}") + + logger.info(f" Pre device batch size = {batch_size}") + logger.info(f" Total Batch size = {batch_size * self.args.world_size}") + + model.eval() + + self.callback_handler.eval_dataloader = dataloader + # Do this before wrapping. + eval_dataset = dataloader.dataset + + if args.past_index >= 0: + self._past = None + + # Initialize containers + # losses/preds/labels on GPU (accumulated for eval_accumulation_steps) + losses_host = None + preds_host = None + labels_host = None + # losses/preds/labels on CPU (final containers) + all_losses = None + all_preds = None + all_labels = None + # Will be useful when we have an iterable dataset so don't know its length. + + observed_num_examples = 0 + # Main evaluation loop + losses = [] + for step, inputs in enumerate(dataloader): + # Update the observed num examples + observed_batch_size = find_batch_size(inputs) + if observed_batch_size is not None: + observed_num_examples += observed_batch_size + # For batch samplers, batch_size is not known by the dataloader in advance. + if batch_size is None: + batch_size = observed_batch_size + + # Prediction step + loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) + + # Update containers on host + if loss is not None: + # losses = self._nested_gather(loss.repeat(batch_size)) + # losses = self._nested_gather(loss) + losses = self._nested_gather(paddle.tile(loss, repeat_times=[batch_size, 1])) + losses_host = losses if losses_host is None else paddle.concat((losses_host, losses), axis=0) + + if labels is not None: + labels = self._pad_across_processes(labels) + labels = self._nested_gather(labels) + labels_host = labels if labels_host is None else nested_concat(labels_host, labels, padding_index=-100) + if logits is not None: + logits = self._pad_across_processes(logits) + logits = self._nested_gather(logits) + if self.preprocess_logits_for_metrics is not None: + logits = self.preprocess_logits_for_metrics(logits, labels) + preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100) + self.control = self.callback_handler.on_prediction_step(args, self.state, self.control) + if max_eval_iters > 0 and step >= max_eval_iters - 1: + break + + # Gather all remaining tensors and put them back on the CPU + if losses_host is not None: + losses = nested_numpify(losses_host) + all_losses = losses if all_losses is None else np.concatenate((all_losses, losses), axis=0) + if preds_host is not None: + logits = nested_numpify(preds_host) + all_preds = logits if all_preds is None else nested_concat(all_preds, logits, padding_index=-100) + if labels_host is not None: + labels = nested_numpify(labels_host) + all_labels = labels if all_labels is None else nested_concat(all_labels, labels, padding_index=-100) + + # Number of samples + if num_samples is not None: + pass + elif has_length(eval_dataset): + num_samples = len(eval_dataset) + # The instance check is weird and does not actually check for the type, but whether the dataset has the right + # methods. Therefore we need to make sure it also has the attribute. + elif isinstance(eval_dataset, IterableDatasetShard) and hasattr(eval_dataset, "num_examples"): + num_samples = eval_dataset.num_examples + else: + if has_length(dataloader): + num_samples = self.num_examples(dataloader) + else: # both len(dataloader.dataset) and len(dataloader) fail + num_samples = observed_num_examples + + # Number of losses has been rounded to a multiple of batch_size and in a distributed training, the number of + # samplers has been rounded to a multiple of batch_size, so we truncate. + if all_losses is not None: + all_losses = all_losses[:num_samples] + if all_preds is not None: + all_preds = nested_truncate(all_preds, num_samples) + if all_labels is not None: + all_labels = nested_truncate(all_labels, num_samples) + + model.train() + + if self.compute_metrics is not None and all_preds is not None: + metrics = self.compute_metrics( + EvalPrediction(predictions=all_preds, label_ids=all_labels), dataloader, args + ) + else: + metrics = {} + + if all_losses is not None: + metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item() + + # Prefix all keys with metric_key_prefix + '_' + for key in list(metrics.keys()): + if not key.startswith(f"{metric_key_prefix}_"): + metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key) + + return EvalLoopOutput(predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples) diff --git a/examples/language_model/chinesebert/eval_cmrc.py b/examples/language_model/chinesebert/eval_cmrc.py new file mode 100644 index 0000000000000000000000000000000000000000..9cd8d8d8d50f2c632c62d3527b0dd5573bf731f6 --- /dev/null +++ b/examples/language_model/chinesebert/eval_cmrc.py @@ -0,0 +1,220 @@ +# encoding=utf8 +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +from tqdm.auto import tqdm +import os + +import paddle + +from dataset_cmrc2018 import get_dev_dataloader +from train_cmrc2018 import MODEL_CLASSES +from metric import compute_prediction +from utils import save_json + + +@paddle.no_grad() +def evaluate(model, data_loader, args, output_dir="./"): + model.eval() + all_start_logits = [] + all_end_logits = [] + + for batch in tqdm(data_loader): + input_ids, token_type_ids, pinyin_ids = batch + start_logits_tensor, end_logits_tensor = model(input_ids, token_type_ids=token_type_ids, pinyin_ids=pinyin_ids) + all_start_logits.extend(start_logits_tensor.numpy().tolist()) + all_end_logits.extend(end_logits_tensor.numpy().tolist()) + + all_predictions, all_nbest_json, scores_diff_json = compute_prediction( + data_loader.dataset.data, + data_loader.dataset.new_data, + (all_start_logits, all_end_logits), + False, + args.n_best_size, + args.max_answer_length, + args.null_score_diff_threshold, + ) + + save_json(all_predictions, os.path.join(output_dir, "all_predictions.json")) + if args.save_nbest_json: + save_json(all_nbest_json, os.path.join(output_dir, "all_nbest_json.json")) + + +def main(args): + print(args) + paddle.set_device(args.device) + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + model = model_class.from_pretrained(args.model_name_or_path) + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + splits = "test" + dev_data_loader = get_dev_dataloader(tokenizer, args, splits=splits) + evaluate(model, dev_data_loader, args, output_dir=args.output_dir) + + data_dir = args.data_dir + dev_ground_truth_file_path = os.path.join(data_dir, "dev.json") + dev_predict_file_path = os.path.join(args.output_dir, "all_predictions.json") + if splits == "dev": + from cmrc_evaluate import get_result + + get_result(dev_ground_truth_file_path, dev_predict_file_path) + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--model_type", default="chinesebert", type=str, help="Type of pre-trained model.") + parser.add_argument( + "--model_name_or_path", + default="ChineseBERT-large", + type=str, + help="Path to pre-trained model or shortcut name of model.", + ) + parser.add_argument( + "--output_dir", + default="outputs/cmrc2018", + type=str, + help="The output directory where the model predictions and checkpoints will be written. " + "Default as `outputs`", + ) + parser.add_argument( + "--max_seq_length", + default=512, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--train_batch_size", + default=16, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument( + "--eval_batch_size", + default=16, + type=int, + help="Batch size per GPU/CPU for evaluating.", + ) + + parser.add_argument( + "--gradient_accumulation_steps", + default=1, + type=int, + help="gradient_accumulation_steps.", + ) + parser.add_argument( + "--learning_rate", + default=4e-5, + type=float, + help="The initial learning rate for Adam.", + ) + parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument( + "--num_train_epochs", + default=2, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--max_train_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument( + "--warmup_radio", + default=0.1, + type=float, + help="Proportion of training steps to perform linear learning rate warmup for.", + ) + parser.add_argument("--warmup_steps", type=int, default=-1, help="warmup_steps.") + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument( + "--save_steps", + type=int, + default=250, + help="Save checkpoint every X updates steps.", + ) + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument( + "--writer_type", + choices=["visualdl", "tensorboard"], + default="visualdl", + help="writer_type.", + ) + parser.add_argument( + "--device", + choices=["cpu", "gpu"], + default="gpu", + help="Select which device to train model, defaults to gpu.", + ) + parser.add_argument( + "--scheduler_type", + choices=["linear", "cosine", "poly"], + default="linear", + type=str, + help="scheduler_type.", + ) + parser.add_argument( + "--doc_stride", + type=int, + default=128, + help="When splitting up a long document into chunks, how much stride to take between chunks.", + ) + parser.add_argument( + "--n_best_size", + type=int, + default=35, + help="The total number of n-best predictions to generate in the nbest_predictions.json output file.", + ) + parser.add_argument( + "--null_score_diff_threshold", + type=float, + default=0.0, + help="If null_score - best_non_null is greater than the threshold predict null.", + ) + parser.add_argument("--max_query_length", type=int, default=64, help="Max query length.") + parser.add_argument("--max_answer_length", type=int, default=65, help="Max answer length.") + parser.add_argument("--use_amp", action="store_true", help="Enable mixed precision training.") + + parser.add_argument( + "--scale_loss", + type=float, + default=2**15, + help="The value of scale_loss for fp16.", + ) + parser.add_argument( + "--num_workers", + type=int, + default=0, + help="num_workers.", + ) + parser.add_argument("--save_nbest_json", action="store_true", help="Enable save nbest json.") + + args = parser.parse_args() + + args.model_type = args.model_type.lower() + args.logdir = os.path.join(args.output_dir, "logs") + os.makedirs("caches", exist_ok=True) + os.makedirs(args.logdir, exist_ok=True) + + return args + + +if __name__ == "__main__": + args = parse_args() + main(args) diff --git a/examples/language_model/chinesebert/metric_cmrc.py b/examples/language_model/chinesebert/metric_cmrc.py new file mode 100644 index 0000000000000000000000000000000000000000..13208fb9773a158e963881bd335adb3e79732ba1 --- /dev/null +++ b/examples/language_model/chinesebert/metric_cmrc.py @@ -0,0 +1,387 @@ +# encoding=utf8 +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import json +import re +import string +import numpy as np + + +def compute_prediction( + examples, + features, + predictions, + version_2_with_negative=False, + n_best_size=20, + max_answer_length=30, + null_score_diff_threshold=0.0, +): + """ + Post-processes the predictions of a question-answering model to convert + them to answers that are substrings of the original contexts. This is + the base postprocessing functions for models that only return start and + end logits. + + Args: + examples (list): List of raw squad-style data (see `run_squad.py + `__ for more + information). + features (list): List of processed squad-style features (see + `run_squad.py `__ + for more information). + predictions (tuple): The predictions of the model. Should be a tuple + of two list containing the start logits and the end logits. + version_2_with_negative (bool, optional): Whether the dataset contains + examples with no answers. Defaults to False. + n_best_size (int, optional): The total number of candidate predictions + to generate. Defaults to 20. + max_answer_length (int, optional): The maximum length of predicted answer. + Defaults to 20. + null_score_diff_threshold (float, optional): The threshold used to select + the null answer. Only useful when `version_2_with_negative` is True. + Defaults to 0.0. + + Returns: + A tuple of three dictionaries containing final selected answer, all n_best + answers along with their probability and scores, and the score_diff of each + example. + """ + assert len(predictions) == 2, "`predictions` should be a tuple with two elements (start_logits, end_logits)." + all_start_logits, all_end_logits = predictions + + assert len(predictions[0]) == len(features), "Number of predictions should be equal to number of features." + + # Build a map example to its corresponding features. + features_per_example = collections.defaultdict(list) + for i, feature in enumerate(features): + features_per_example[feature["example_id"]].append(i) + + # The dictionaries we have to fill. + all_predictions = collections.OrderedDict() + all_nbest_json = collections.OrderedDict() + + scores_diff_json = collections.OrderedDict() + + # Let's loop over all the examples! + for example_index, example in enumerate(examples): + # Those are the indices of the features associated to the current example. + feature_indices = features_per_example[example["id"]] + + min_null_prediction = None + prelim_predictions = [] + + # Looping through all the features associated to the current example. + for feature_index in feature_indices: + # We grab the predictions of the model for this feature. + start_logits = all_start_logits[feature_index] + end_logits = all_end_logits[feature_index] + # This is what will allow us to map some the positions in our logits to span of texts in the original + # context. + offset_mapping = features[feature_index]["offset_mapping"] + # Optional `token_is_max_context`, if provided we will remove answers that do not have the maximum context + # available in the current feature. + token_is_max_context = features[feature_index].get("token_is_max_context", None) + + # Update minimum null prediction. + feature_null_score = start_logits[0] + end_logits[0] + if min_null_prediction is None or min_null_prediction["score"] > feature_null_score: + min_null_prediction = { + "offsets": (0, 0), + "score": feature_null_score, + "start_logit": start_logits[0], + "end_logit": end_logits[0], + } + + # Go through all possibilities for the `n_best_size` greater start and end logits. + start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist() + end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist() + for start_index in start_indexes: + for end_index in end_indexes: + # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond + # to part of the input_ids that are not in the context. + if ( + start_index >= len(offset_mapping) + or end_index >= len(offset_mapping) + or offset_mapping[start_index] is None + or offset_mapping[end_index] is None + or offset_mapping[start_index] == (0, 0) + or offset_mapping[end_index] == (0, 0) + ): + continue + # Don't consider answers with a length that is either < 0 or > max_answer_length. + if end_index < start_index or end_index - start_index + 1 > max_answer_length: + continue + # Don't consider answer that don't have the maximum context available (if such information is + # provided). + if token_is_max_context is not None and not token_is_max_context.get(str(start_index), False): + continue + prelim_predictions.append( + { + "offsets": ( + offset_mapping[start_index][0], + offset_mapping[end_index][1], + ), + "score": start_logits[start_index] + end_logits[end_index], + "start_logit": start_logits[start_index], + "end_logit": end_logits[end_index], + } + ) + if version_2_with_negative: + # Add the minimum null prediction + prelim_predictions.append(min_null_prediction) + null_score = min_null_prediction["score"] + + # Only keep the best `n_best_size` predictions. + predictions = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)[:n_best_size] + + # Add back the minimum null prediction if it was removed because of its low score. + if version_2_with_negative and not any(p["offsets"] == (0, 0) for p in predictions): + predictions.append(min_null_prediction) + + # Use the offsets to gather the answer text in the original context. + context = example["context"] + for pred in predictions: + offsets = pred.pop("offsets") + pred["text"] = context[offsets[0] : offsets[1]] + + # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid + # failure. + if len(predictions) == 0 or (len(predictions) == 1 and predictions[0]["text"] == ""): + predictions.insert(0, {"text": "empty", "start_logit": 0.0, "end_logit": 0.0, "score": 0.0}) + + # Compute the softmax of all scores (we do it with numpy to stay independent from torch/tf in this file, using + # the LogSumExp trick). + scores = np.array([pred.pop("score") for pred in predictions]) + exp_scores = np.exp(scores - np.max(scores)) + probs = exp_scores / exp_scores.sum() + + # Include the probabilities in our predictions. + for prob, pred in zip(probs, predictions): + pred["probability"] = prob + + # Pick the best prediction. If the null answer is not possible, this is easy. + if not version_2_with_negative: + all_predictions[example["id"]] = predictions[0]["text"] + else: + # Otherwise we first need to find the best non-empty prediction. + i = 0 + while predictions[i]["text"] == "": + i += 1 + best_non_null_pred = predictions[i] + + # Then we compare to the null prediction using the threshold. + score_diff = null_score - best_non_null_pred["start_logit"] - best_non_null_pred["end_logit"] + scores_diff_json[example["id"]] = float(score_diff) # To be JSON-serializable. + if score_diff > null_score_diff_threshold: + all_predictions[example["id"]] = "" + else: + all_predictions[example["id"]] = best_non_null_pred["text"] + + # Make `predictions` JSON-serializable by casting np.float back to float. + all_nbest_json[example["id"]] = [ + {k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()} + for pred in predictions + ] + + return all_predictions, all_nbest_json, scores_diff_json + + +def make_qid_to_has_ans(examples): + qid_to_has_ans = {} + for example in examples: + qid_to_has_ans[example["id"]] = not example.get("is_impossible", False) + return qid_to_has_ans + + +def normalize_answer(s): + # Lower text and remove punctuation, articles and extra whitespace. + def remove_articles(text): + regex = re.compile(r"\b(a|an|the)\b", re.UNICODE) + return re.sub(regex, " ", text) + + def white_space_fix(text): + return " ".join(text.split()) + + def remove_punc(text): + exclude = set(string.punctuation) + return "".join(ch for ch in text if ch not in exclude) + + def lower(text): + return text.lower() + + if not s: + return "" + else: + return white_space_fix(remove_articles(remove_punc(lower(s)))) + + +def compute_exact(a_gold, a_pred): + return int(normalize_answer(a_gold) == normalize_answer(a_pred)) + + +def compute_f1(a_gold, a_pred, is_whitespace_splited=True): + gold_toks = normalize_answer(a_gold).split() + pred_toks = normalize_answer(a_pred).split() + + if not is_whitespace_splited: + gold_toks = gold_toks[0] if gold_toks else "" + pred_toks = pred_toks[0] if pred_toks else "" + + common = collections.Counter(gold_toks) & collections.Counter(pred_toks) + num_same = sum(common.values()) + if len(gold_toks) == 0 or len(pred_toks) == 0: + # If either is no-answer, then F1 is 1 if they agree, 0 otherwise + return int(gold_toks == pred_toks) + if num_same == 0: + return 0 + precision = 1.0 * num_same / len(pred_toks) + recall = 1.0 * num_same / len(gold_toks) + f1 = (2 * precision * recall) / (precision + recall) + return f1 + + +def get_raw_scores(examples, preds, is_whitespace_splited=True): + exact_scores = {} + f1_scores = {} + for example in examples: + qid = example["id"] + gold_answers = [text for text in example["answers"] if normalize_answer(text)] + if not gold_answers: + # For unanswerable questions, only correct answer is empty string + gold_answers = [""] + if qid not in preds: + print("Missing prediction for %s" % qid) + continue + a_pred = preds[qid] + # Take max over all gold answers + exact_scores[qid] = max(compute_exact(a, a_pred) for a in gold_answers) + f1_scores[qid] = max(compute_f1(a, a_pred, is_whitespace_splited) for a in gold_answers) + + return exact_scores, f1_scores + + +def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh): + new_scores = {} + for qid, s in scores.items(): + pred_na = na_probs[qid] > na_prob_thresh + if pred_na: + new_scores[qid] = float(not qid_to_has_ans[qid]) + else: + new_scores[qid] = s + return new_scores + + +def make_eval_dict(exact_scores, f1_scores, qid_list=None): + if not qid_list: + total = len(exact_scores) + return collections.OrderedDict( + [ + ("exact", 100.0 * sum(exact_scores.values()) / total), + ("f1", 100.0 * sum(f1_scores.values()) / total), + ("total", total), + ] + ) + else: + total = len(qid_list) + return collections.OrderedDict( + [ + ("exact", 100.0 * sum(exact_scores[k] for k in qid_list) / total), + ("f1", 100.0 * sum(f1_scores[k] for k in qid_list) / total), + ("total", total), + ] + ) + + +def merge_eval(main_eval, new_eval, prefix): + for k in new_eval: + main_eval["%s_%s" % (prefix, k)] = new_eval[k] + + +def find_best_thresh(preds, scores, na_probs, qid_to_has_ans): + num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k]) + cur_score = num_no_ans + best_score = cur_score + best_thresh = 0.0 + qid_list = sorted(na_probs, key=lambda k: na_probs[k]) + for i, qid in enumerate(qid_list): + if qid not in scores: + continue + if qid_to_has_ans[qid]: + diff = scores[qid] + else: + if preds[qid]: + diff = -1 + else: + diff = 0 + cur_score += diff + if cur_score > best_score: + best_score = cur_score + best_thresh = na_probs[qid] + return 100.0 * best_score / len(scores), best_thresh + + +def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans): + best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans) + best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans) + main_eval["best_exact"] = best_exact + main_eval["best_exact_thresh"] = exact_thresh + main_eval["best_f1"] = best_f1 + main_eval["best_f1_thresh"] = f1_thresh + + +def squad_evaluate(examples, preds, na_probs=None, na_prob_thresh=1.0, is_whitespace_splited=True): + """ + Computes and prints the f1 score and em score of input prediction. + + Args: + examples (list): List of raw squad-style data (see `run_squad.py + `__ for more + information). + preds (dict): Dictionary of final predictions. Usually generated by + `compute_prediction`. + na_probs (dict, optional): Dictionary of score_diffs of each example. + Used to decide if answer exits and compute best score_diff + threshold of null. Defaults to None. + na_prob_thresh (float, optional): The threshold used to select the + null answer. Defaults to 1.0. + is_whitespace_splited (bool, optional): Whether the predictions and references + can be tokenized by whitespace. Usually set True for English and + False for Chinese. Defaults to True. + """ + + if not na_probs: + na_probs = {k: 0.0 for k in preds} + + qid_to_has_ans = make_qid_to_has_ans(examples) # maps qid to True/False + has_ans_qids = [k for k, v in qid_to_has_ans.items() if v] + no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v] + exact_raw, f1_raw = get_raw_scores(examples, preds, is_whitespace_splited) + exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans, na_prob_thresh) + f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans, na_prob_thresh) + out_eval = make_eval_dict(exact_thresh, f1_thresh) + if has_ans_qids: + has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids) + merge_eval(out_eval, has_ans_eval, "HasAns") + if no_ans_qids: + no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids) + merge_eval(out_eval, no_ans_eval, "NoAns") + find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans) + + print(json.dumps(out_eval, indent=2)) + return out_eval diff --git a/examples/language_model/chinesebert/run_chn.sh b/examples/language_model/chinesebert/run_chn.sh new file mode 100644 index 0000000000000000000000000000000000000000..c1810b0eb1af0631d32e90a29d193f0246d57187 --- /dev/null +++ b/examples/language_model/chinesebert/run_chn.sh @@ -0,0 +1,31 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python -m paddle.distributed.launch --gpus 0,1 python train_chn.py \ +--data_path './data/ChnSentiCorp' \ +--device 'gpu' \ +--num_train_epochs 10 \ +--max_seq_length 512 \ +--per_device_train_batch_size 8 \ +--per_device_eval_batch_size 8 \ +--learning_rate 2e-5 \ +--adam_beta2 0.98 \ +--weight_decay 0.0001 \ +--warmup_ratio 0.1 \ +--logging_steps 10 \ +--save_steps 100 \ +--seed 2333 \ +--do_train \ +--do_eval \ +--output_dir 'outputs/chn' | tee outputs/train_chn.log diff --git a/examples/language_model/chinesebert/run_cmrc2018.sh b/examples/language_model/chinesebert/run_cmrc2018.sh new file mode 100644 index 0000000000000000000000000000000000000000..1541877949f5bf7d9b83faf3a5072e94d21642d9 --- /dev/null +++ b/examples/language_model/chinesebert/run_cmrc2018.sh @@ -0,0 +1,36 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python -m paddle.distributed.launch --gpus 0,1 python train_cmrc2018.py \ + --data_dir "./data/cmrc2018" \ + --model_name_or_path ChineseBERT-large \ + --max_seq_length 512 \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 16 \ + --gradient_accumulation_steps 8 \ + --learning_rate 4e-5 \ + --max_grad_norm 1.0 \ + --adam_beta2 0.98 \ + --num_train_epochs 3 \ + --logging_steps 2 \ + --save_steps 20 \ + --warmup_ratio 0.1 \ + --weight_decay 0.01 \ + --seed 1111 \ + --do_train \ + --do_eval \ + --dataloader_num_workers 0 \ + --fp16 True \ + --output_dir "outputs/cmrc2018" + diff --git a/examples/language_model/chinesebert/run_xnli.sh b/examples/language_model/chinesebert/run_xnli.sh new file mode 100644 index 0000000000000000000000000000000000000000..de0e6945deedfdf2fb41c2e4e1cb6b596fd50fce --- /dev/null +++ b/examples/language_model/chinesebert/run_xnli.sh @@ -0,0 +1,31 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python -m paddle.distributed.launch --gpus 0,1 python train_xnli.py \ +--data_path './data/XNLI' \ +--device 'gpu' \ +--num_train_epochs 5 \ +--max_seq_length 256 \ +--per_device_train_batch_size 16 \ +--per_device_eval_batch_size 16 \ +--learning_rate 1.3e-5 \ +--adam_beta2 0.98 \ +--weight_decay 0.001 \ +--warmup_ratio 0.1 \ +--logging_steps 10 \ +--save_steps 100 \ +--seed 2333 \ +--do_train \ +--do_eval \ +--output_dir "outputs/xnli" | tee outputs/train_xnli.log diff --git a/examples/language_model/chinesebert/train_chn.py b/examples/language_model/chinesebert/train_chn.py new file mode 100644 index 0000000000000000000000000000000000000000..75059fcd4fc1a60ff044556265967c1da476c474 --- /dev/null +++ b/examples/language_model/chinesebert/train_chn.py @@ -0,0 +1,165 @@ +# encoding=utf8 +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from dataclasses import dataclass, field +from functools import partial +from typing import Optional + +import numpy as np +import paddle +from paddle.metric import Accuracy +from utils import load_ds + +from paddlenlp.data import Pad, Stack +from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments, set_seed +from paddlenlp.transformers import ( + ChineseBertForSequenceClassification, + ChineseBertTokenizer, +) + + +@dataclass +class ModelArguments: + max_seq_length: Optional[int] = field( + default=512, + metadata={ + "help": ( + "The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded." + ) + }, + ) + + +@dataclass +class DataArguments: + data_path: Optional[str] = field( + default="./data", + metadata={"help": "The path of datasets to be loaded."}, + ) + + +def convert_example(example, tokenizer, max_length=512, is_test=False): + # The original data is processed into a format that can be read in by the model, + # enocded_ Inputs is a dict that contains inputs_ids、token_type_ids、etc. + encoded_inputs = tokenizer(text=example["text"], max_length=max_length) + + # input_ids:After the text is segmented into tokens, the corresponding token id in the vocabulary. + input_ids = encoded_inputs["input_ids"] + # # token_type_ids:Does the current token belong to sentence 1 or sentence 2, that is, the segment ids. + pinyin_ids = encoded_inputs["pinyin_ids"] + + label = np.array([example["label"]], dtype="int64") + # return encoded_inputs + return input_ids, pinyin_ids, label + + +@dataclass +class DataCollator: + tokenizer: ChineseBertTokenizer + + def __call__(self, features): + input_ids = [] + pinyin_ids = [] + labels = [] + batch = {} + + for feature in features: + input_idx, pinyin_idx, label = feature + input_ids.append(input_idx) + pinyin_ids.append(pinyin_idx) + labels.append(label) + + input_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids),) # input_ids + pinyin_ids = (Pad(axis=0, pad_val=0)(pinyin_ids),) # pinyin_ids + labels = (Stack()(labels),) # labels + + batch["input_ids"] = input_ids[0] + batch["pinyin_ids"] = pinyin_ids[0] + batch["labels"] = labels[0] + + return batch + + +def compute_metrics(eval_preds): + labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64") + preds = paddle.to_tensor(eval_preds.predictions) + preds = paddle.nn.functional.softmax(preds, axis=-1) + labels = paddle.argmax(labels, axis=-1) + metric = Accuracy() + correct = metric.compute(preds, labels) + metric.update(correct) + acc = metric.accumulate() + return {"accuracy": acc} + + +def do_train(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(training_args.seed) + + data_dir = data_args.data_path + train_path = os.path.join(data_dir, "train.tsv") + dev_path = os.path.join(data_dir, "dev.tsv") + test_path = os.path.join(data_dir, "test.tsv") + + train_ds, dev_ds, test_ds = load_ds(datafiles=[train_path, dev_path, test_path]) + + model = ChineseBertForSequenceClassification.from_pretrained("ChineseBERT-large", num_classes=2) + tokenizer = ChineseBertTokenizer.from_pretrained("ChineseBERT-large") + + # Process the data into a data format that the model can read in. + trans_func = partial(convert_example, tokenizer=tokenizer, max_length=model_args.max_seq_length) + train_ds = train_ds.map(trans_func, lazy=False) + dev_ds = dev_ds.map(trans_func, lazy=False) + test_ds = test_ds.map(trans_func, lazy=False) + + # Form data into batch data, such as padding text sequences of different lengths into the maximum length of batch data, + # and stack each data label together + batchify_fn = DataCollator(tokenizer) + criterion = paddle.nn.loss.CrossEntropyLoss() + + trainer = Trainer( + model=model, + args=training_args, + train_dataset=train_ds if training_args.do_train else None, + eval_dataset=dev_ds if training_args.do_eval else None, + tokenizer=tokenizer, + data_collator=batchify_fn, + criterion=criterion, + compute_metrics=compute_metrics, + ) + + if training_args.do_train: + train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + metrics = train_results.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + +if __name__ == "__main__": + do_train() diff --git a/examples/language_model/chinesebert/train_cmrc2018.py b/examples/language_model/chinesebert/train_cmrc2018.py new file mode 100644 index 0000000000000000000000000000000000000000..62b70af02e3345f2f42a75d29b205535c5d30bec --- /dev/null +++ b/examples/language_model/chinesebert/train_cmrc2018.py @@ -0,0 +1,259 @@ +# encoding=utf8 +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import os +from dataclasses import dataclass, field +from typing import Optional + +import paddle +from cmrc_evaluate import get_result +from dataset_cmrc2018 import EvalTrainer, get_dev_dataset, get_train_dataset +from metric_cmrc import compute_prediction, squad_evaluate +from utils import CrossEntropyLossForSQuAD, save_json + +from paddlenlp.data import Pad, Stack +from paddlenlp.trainer import PdArgumentParser, TrainingArguments, set_seed +from paddlenlp.transformers import ( + BertForQuestionAnswering, + BertTokenizer, + ChineseBertForQuestionAnswering, + ChineseBertTokenizer, + ErnieForQuestionAnswering, + ErnieTokenizer, +) + +logger = logging.getLogger(__name__) + +MODEL_CLASSES = { + "bert": (BertForQuestionAnswering, BertTokenizer), + "ernie": (ErnieForQuestionAnswering, ErnieTokenizer), + "chinesebert": (ChineseBertForQuestionAnswering, ChineseBertTokenizer), +} + + +@dataclass +class ModelArguments: + model_type: Optional[str] = field( + default="chinesebert", + metadata={"help": ("Type of pre-trained model.")}, + ) + model_name_or_path: Optional[str] = field( + default="ChineseBERT-large", + metadata={"help": ("Path to pre-trained model or shortcut name of model.")}, + ) + max_seq_length: Optional[int] = field( + default=512, + metadata={ + "help": ( + "The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded." + ) + }, + ) + doc_stride: Optional[int] = field( + default=128, + metadata={"help": ("When splitting up a long document into chunks, how much stride to take between chunks.")}, + ) + n_best_size: Optional[int] = field( + default=35, + metadata={ + "help": ("The total number of n-best predictions to generate in the nbest_predictions.json output file.") + }, + ) + null_score_diff_threshold: Optional[float] = field( + default=0.0, + metadata={"help": ("If null_score - best_non_null is greater than the threshold predict null.")}, + ) + max_query_length: Optional[int] = field( + default=64, + metadata={"help": ("Max query length.")}, + ) + max_answer_length: Optional[int] = field( + default=65, + metadata={"help": ("Max answer length.")}, + ) + use_amp: Optional[bool] = field( + default=False, + metadata={"help": ("Enable mixed precision training.")}, + ) + + +@dataclass +class DataArguments: + data_dir: Optional[str] = field( + default="./data/cmrc2018", + metadata={"help": ("the path of cmrc2018 data.")}, + ) + save_nbest_json: Optional[bool] = field( + default=False, + metadata={"help": ("Enable save nbest json.")}, + ) + + +parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) +model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + +@dataclass +class Train_DataCollator: + tokenizer: ChineseBertTokenizer + + def __call__(self, features): + input_ids = [] + token_type_ids = [] + pinyin_ids = [] + start_positions = [] + end_positions = [] + batch = {} + + for feature in features: + input_ids.append(feature["input_ids"]) + token_type_ids.append(feature["token_type_ids"]) + pinyin_ids.append(feature["pinyin_ids"]) + start_positions.append(feature["start_positions"]) + end_positions.append(feature["end_positions"]) + + input_ids = Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids) # input_ids + token_type_ids = Pad(axis=0, pad_val=0)(token_type_ids) + pinyin_ids = Pad(axis=0, pad_val=0)(pinyin_ids) # pinyin_ids + start_positions = Stack(dtype="int64")(start_positions) + end_positions = Stack(dtype="int64")(end_positions) + + batch["input_ids"] = input_ids + batch["token_type_ids"] = token_type_ids + batch["pinyin_ids"] = pinyin_ids + batch["start_positions"] = start_positions + batch["end_positions"] = end_positions + + return batch + + +@dataclass +class Eval_DataCollator: + tokenizer: ChineseBertTokenizer + + def __call__(self, features): + input_ids = [] + token_type_ids = [] + pinyin_ids = [] + batch = {} + + for feature in features: + input_ids.append(feature["input_ids"]) + token_type_ids.append(feature["token_type_ids"]) + pinyin_ids.append(feature["pinyin_ids"]) + + input_ids = Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids) # input_ids + token_type_ids = Pad(axis=0, pad_val=0)(token_type_ids) + pinyin_ids = Pad(axis=0, pad_val=0)(pinyin_ids) # pinyin_ids + + batch["input_ids"] = input_ids + batch["token_type_ids"] = token_type_ids + batch["pinyin_ids"] = pinyin_ids + + return batch + + +def compute_metrics(eval_preds, dataloader, args): + all_start_logits, all_end_logits = eval_preds.predictions + all_start_logits = all_start_logits.tolist() + all_end_logits = all_end_logits.tolist() + all_predictions, all_nbest_json, scores_diff_json = compute_prediction( + dataloader.dataset.data, + dataloader.dataset.new_data, + (all_start_logits, all_end_logits), + False, + model_args.n_best_size, + model_args.max_answer_length, + model_args.null_score_diff_threshold, + ) + + save_json(all_predictions, os.path.join(args.output_dir, "all_predictions.json")) + if data_args.save_nbest_json: + save_json(all_nbest_json, os.path.join(args.output_dir, "all_nbest_json.json")) + + ground_truth_file = os.path.join(data_args.data_dir, "dev.json") + + eval_results = get_result( + ground_truth_file=ground_truth_file, prediction_file=os.path.join(args.output_dir, "all_predictions.json") + ) + print("CMRC2018 EVALUATE.") + print(eval_results) + print("SQUAD EVALUATE.") + squad_evaluate( + examples=dataloader.dataset.data, + preds=all_predictions, + na_probs=scores_diff_json, + ) + return eval_results + + +def train(): + model_args.model_type = model_args.model_type.lower() + training_args.logdir = os.path.join(training_args.output_dir, "logs") + os.makedirs("caches", exist_ok=True) + os.makedirs(training_args.logdir, exist_ok=True) + + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(training_args.seed) + + # get model and tokenizer + model_class, tokenizer_class = MODEL_CLASSES[model_args.model_type] + model = model_class.from_pretrained(model_args.model_name_or_path) + tokenizer = tokenizer_class.from_pretrained(model_args.model_name_or_path) + + # get dataloader + args = {} + args["training_args"] = training_args + args["data_args"] = data_args + args["model_args"] = model_args + train_ds = get_train_dataset(tokenizer, args) + dev_ds = get_dev_dataset(tokenizer, args) + train_collator = Train_DataCollator(tokenizer) + dev_collator = Eval_DataCollator(tokenizer) + + criterion = CrossEntropyLossForSQuAD() + + trainer = EvalTrainer( + model=model, + args=training_args, + train_dataset=train_ds if training_args.do_train else None, + eval_dataset=dev_ds if training_args.do_eval else None, + tokenizer=tokenizer, + data_collator=train_collator, + criterion=criterion, + compute_metrics=compute_metrics, + ) + trainer.set_eval_collator(dev_collator) + + if training_args.do_train: + train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + metrics = train_results.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + +if __name__ == "__main__": + train() diff --git a/examples/language_model/chinesebert/train_xnli.py b/examples/language_model/chinesebert/train_xnli.py new file mode 100644 index 0000000000000000000000000000000000000000..c2e0d5c45c93a364b163da4fcce1d497fd041757 --- /dev/null +++ b/examples/language_model/chinesebert/train_xnli.py @@ -0,0 +1,166 @@ +# encoding=utf8 +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from dataclasses import dataclass, field +from functools import partial +from typing import Optional + +import numpy as np +import paddle +from paddle.metric import Accuracy +from utils import load_ds_xnli + +from paddlenlp.data import Pad, Stack +from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments, set_seed +from paddlenlp.transformers import ( + ChineseBertForSequenceClassification, + ChineseBertTokenizer, +) + + +@dataclass +class ModelArguments: + max_seq_length: Optional[int] = field( + default=512, + metadata={ + "help": ( + "The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded." + ) + }, + ) + + +@dataclass +class DataArguments: + data_path: Optional[str] = field( + default="./data", + metadata={"help": "The path of datasets to be loaded."}, + ) + + +def convert_example(example, tokenizer, max_length=512, is_test=False): + + label_map = {"contradictory": 0, "contradiction": 0, "entailment": 2, "neutral": 1} + first, second, third = example["sentence1"], example["sentence2"], example["label"] + + encoded_inputs = tokenizer(first, second, max_length=max_length) + input_ids = encoded_inputs["input_ids"] + pinyin_ids = encoded_inputs["pinyin_ids"] + + label = np.array([label_map[third]], dtype="int64") + assert len(input_ids) <= max_length + return input_ids, pinyin_ids, label + + +@dataclass +class DataCollator: + tokenizer: ChineseBertTokenizer + + def __call__(self, features): + input_ids = [] + pinyin_ids = [] + labels = [] + batch = {} + + for feature in features: + input_idx, pinyin_idx, label = feature + input_ids.append(input_idx) + pinyin_ids.append(pinyin_idx) + labels.append(label) + + input_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids),) # input_ids + pinyin_ids = (Pad(axis=0, pad_val=0)(pinyin_ids),) # pinyin_ids + labels = (Stack()(labels),) # labels + + batch["input_ids"] = input_ids[0] + batch["pinyin_ids"] = pinyin_ids[0] + batch["labels"] = labels[0] + + return batch + + +def compute_metrics(eval_preds): + labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64") + preds = paddle.to_tensor(eval_preds.predictions) + preds = paddle.nn.functional.softmax(preds, axis=-1) + labels = paddle.argmax(labels, axis=-1) + metric = Accuracy() + correct = metric.compute(preds, labels) + metric.update(correct) + acc = metric.accumulate() + return {"accuracy": acc} + + +def do_train(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(training_args.seed) + + data_dir = data_args.data_path + train_path = os.path.join(data_dir, "train.tsv") + dev_path = os.path.join(data_dir, "dev.tsv") + test_path = os.path.join(data_dir, "test.tsv") + + train_ds, dev_ds, test_ds = load_ds_xnli(datafiles=[train_path, dev_path, test_path]) + + model = ChineseBertForSequenceClassification.from_pretrained("ChineseBERT-large", num_classes=3) + tokenizer = ChineseBertTokenizer.from_pretrained("ChineseBERT-large") + + print(" | load pretrained model state sucessfully.") + + # Process the data into a data format that the model can read in. + trans_func = partial(convert_example, tokenizer=tokenizer, max_length=model_args.max_seq_length) + train_ds = train_ds.map(trans_func, lazy=False) + dev_ds = dev_ds.map(trans_func, lazy=False) + test_ds = test_ds.map(trans_func, lazy=False) + + # Form data into batch data, such as padding text sequences of different lengths into the maximum length of batch data, + # and stack each data label together + batchify_fn = DataCollator(tokenizer) + criterion = paddle.nn.loss.CrossEntropyLoss() + + trainer = Trainer( + model=model, + args=training_args, + train_dataset=train_ds if training_args.do_train else None, + eval_dataset=dev_ds if training_args.do_eval else None, + tokenizer=tokenizer, + data_collator=batchify_fn, + criterion=criterion, + compute_metrics=compute_metrics, + ) + + if training_args.do_train: + train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + metrics = train_results.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + +if __name__ == "__main__": + do_train() diff --git a/examples/language_model/chinesebert/utils.py b/examples/language_model/chinesebert/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..a58fc7edb21463646e9789620d551570ec3ab539 --- /dev/null +++ b/examples/language_model/chinesebert/utils.py @@ -0,0 +1,235 @@ +# encoding=utf8 +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import pickle +import random +from collections import OrderedDict + +import numpy as np +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + +from paddlenlp.datasets import MapDataset +from paddlenlp.transformers import ( + CosineDecayWithWarmup, + LinearDecayWithWarmup, + PolyDecayWithWarmup, +) + +scheduler_type2cls = { + "linear": LinearDecayWithWarmup, + "cosine": CosineDecayWithWarmup, + "poly": PolyDecayWithWarmup, +} + + +def get_layer_lr_radios(layer_decay=0.8, n_layers=12): + """Have lower learning rates for layers closer to the input.""" + key_to_depths = OrderedDict( + { + "mpnet.embeddings.": 0, + "mpnet.encoder.relative_attention_bias.": 0, + "qa_outputs.": n_layers + 2, + } + ) + for layer in range(n_layers): + key_to_depths[f"mpnet.encoder.layer.{str(layer)}."] = layer + 1 + return {key: (layer_decay ** (n_layers + 2 - depth)) for key, depth in key_to_depths.items()} + + +def set_seed(seed): + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def get_writer(args): + if args.writer_type == "visualdl": + from visualdl import LogWriter + + writer = LogWriter(logdir=args.logdir) + elif args.writer_type == "tensorboard": + from tensorboardX import SummaryWriter + + writer = SummaryWriter(logdir=args.logdir) + else: + raise ValueError("writer_type must be in ['visualdl', 'tensorboard']") + return writer + + +def get_scheduler( + learning_rate, + scheduler_type, + num_warmup_steps=None, + num_training_steps=None, + **scheduler_kwargs, +): + if scheduler_type not in scheduler_type2cls.keys(): + data = " ".join(scheduler_type2cls.keys()) + raise ValueError(f"scheduler_type must be choson from {data}") + + if num_warmup_steps is None: + raise ValueError("requires `num_warmup_steps`, please provide that argument.") + + if num_training_steps is None: + raise ValueError("requires `num_training_steps`, please provide that argument.") + + return scheduler_type2cls[scheduler_type]( + learning_rate=learning_rate, + total_steps=num_training_steps, + warmup=num_warmup_steps, + **scheduler_kwargs, + ) + + +def save_json(data, file_name): + with open(file_name, "w", encoding="utf-8") as w: + w.write(json.dumps(data, ensure_ascii=False, indent=4) + "\n") + + +class CrossEntropyLossForSQuAD(nn.Layer): + def forward(self, logits, labels): + start_logits, end_logits = logits + start_position, end_position = labels + start_position = paddle.unsqueeze(start_position, axis=-1) + end_position = paddle.unsqueeze(end_position, axis=-1) + start_loss = F.cross_entropy(input=start_logits, label=start_position) + end_loss = F.cross_entropy(input=end_logits, label=end_position) + loss = (start_loss + end_loss) / 2 + + return loss + + +def save_pickle(data, file_path): + with open(str(file_path), "wb") as f: + pickle.dump(data, f) + + +def load_pickle(input_file): + with open(str(input_file), "rb") as f: + data = pickle.load(f) + return data + + +def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn, lazy=False) + + # shuffle = True if mode == 'train' else False + shuffle = False + if mode == "train": + sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + else: + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn) + return dataloader + + +def convert_example(example, tokenizer, is_test=False): + """ + Builds model inputs from a sequence for sequence classification tasks. + It use `jieba.cut` to tokenize text. + + Args: + example(obj:`list[str]`): List of input data, containing text and label if it have label. + tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of token ids. + valid_length(obj:`int`): The input sequence valid length. + label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. + """ + + input_ids = tokenizer.encode(example["text"]) + input_ids = np.array(input_ids, dtype="int64") + + if not is_test: + label = np.array(example["label"], dtype="int64") + return input_ids, label + else: + return input_ids + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + criterion(obj:`paddle.nn.Layer`): It can compute the loss. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + """ + model.eval() + metric.reset() + losses = [] + for batch in data_loader: + input_ids, token_type_ids, labels = batch + logits = model(input_ids, token_type_ids) + loss = criterion(logits, labels) + losses.append(loss.numpy()) + correct = metric.compute(logits, labels) + metric.update(correct) + accu = metric.accumulate() + print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu)) + model.train() + metric.reset() + return accu + + +def load_ds(datafiles): + """ + intput: + datafiles -- str or list[str] -- the path of train or dev sets + split_train -- Boolean -- split from train or not + dev_size -- int -- split how much data from train + + output: + MapDataset + """ + + def read(ds_file): + with open(ds_file, "r", encoding="utf-8") as fp: + next(fp) # Skip header + for line in fp.readlines(): + data = line[:-1].split("\t") + if len(data) == 2: + yield ({"text": data[1], "label": int(data[0])}) + elif len(data) == 3: + yield ({"text": data[2], "label": int(data[1])}) + + if isinstance(datafiles, str): + return MapDataset(list(read(datafiles))) + elif isinstance(datafiles, list) or isinstance(datafiles, tuple): + return [MapDataset(list(read(datafile))) for datafile in datafiles] + + +def load_ds_xnli(datafiles): + def read(ds_file): + with open(ds_file, "r", encoding="utf-8") as fp: + # next(fp) # Skip header + for line in fp.readlines(): + data = line.strip().split("\t", 2) + first, second, third = data + yield ({"sentence1": first, "sentence2": second, "label": third}) + + if isinstance(datafiles, str): + return MapDataset(list(read(datafiles))) + elif isinstance(datafiles, list) or isinstance(datafiles, tuple): + return [MapDataset(list(read(datafile))) for datafile in datafiles] diff --git a/examples/language_model/convbert/README.md b/examples/language_model/convbert/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b7014db2abdf8029f62b04272e4e87f461cae893 --- /dev/null +++ b/examples/language_model/convbert/README.md @@ -0,0 +1,99 @@ +# ConvBert with PaddleNLP + +[ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) + +**摘要:** +像BERT及其变体这样的预训练语言模型最近在各种自然语言理解任务中取得了令人印象深刻的表现。然而,BERT严重依赖全局自注意力块,因此需要大量内存占用和计算成本。 +虽然它的所有注意力头从全局角度查询整个输入序列以生成注意力图,但我们观察到一些头只需要学习局部依赖,这意味着存在计算冗余。 +因此,我们提出了一种新颖的基于跨度的动态卷积来代替这些自注意力头,以直接对局部依赖性进行建模。新的卷积头与其余的自注意力头一起形成了一个新的混合注意力块,在全局和局部上下文学习中都更有效。 +我们为 BERT 配备了这种混合注意力设计并构建了一个ConvBERT模型。实验表明,ConvBERT 在各种下游任务中明显优于BERT及其变体,具有更低的训练成本和更少的模型参数。 +值得注意的是,ConvBERT-base 模型达到86.4GLUE分数,比ELECTRA-base高0.7,同时使用不到1/4的训练成本。 + +本项目是 ConvBert 在 Paddle 2.x上的开源实现。 + +## **数据准备** + +### Fine-tuning数据 +Fine-tuning 使用GLUE数据,这部分Paddle已提供,在执行Fine-tuning 命令时会自动下载并加载 + + +## **模型预训练** +模型预训练过程可参考[Electra的README](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/electra/README.md) + +## **Fine-tuning** + +### 运行Fine-tuning + +#### **使用Paddle提供的预训练模型运行 Fine-tuning** + +以 GLUE/SST-2 任务为例,启动 Fine-tuning 的方式如下: +```shell +export CUDA_VISIBLE_DEVICES=0 +export TASK_NAME=SST-2 + +python -u examples/language_model/convbert/run_glue.py \ + --model_type convbert \ + --model_name_or_path convbert-small \ + --task_name $TASK_NAME \ + --max_seq_length 128 \ + --batch_size 256 \ + --learning_rate 1e-4 \ + --num_train_epochs 3 \ + --logging_steps 100 \ + --save_steps 100 \ + --output_dir ./glue/$TASK_NAME/ \ + --device gpu +``` +其中参数释义如下: +- `model_type` 指示了模型类型,当前支持BERT、ELECTRA、ERNIE、CONVBERT模型。 +- `model_name_or_path` 模型名称或者路径,其中convbert模型当前仅支持convbert-small、convbert-medium-small、convbert-base几种规格。 +- `task_name` 表示 Fine-tuning 的任务,当前支持CoLA、SST-2、MRPC、STS-B、QQP、MNLI、QNLI、RTE。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `output_dir` 表示模型保存路径。 +- `device` 表示使用的设备类型。默认为GPU,可以配置为CPU、GPU、XPU、NPU。若希望使用多GPU训练,将其设置为GPU,同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。 + +Fine-tuning过程将按照 `logging_steps` 和 `save_steps` 的设置打印如下格式的日志: + +``` +global step 100/792, epoch: 0, batch: 99, rank_id: 0, loss: 0.333723, lr: 0.0000970547, speed: 3.6162 step/s +eval loss: 0.295912, acc: 0.8623853211009175, eval done total : 0.5295147895812988 s +global step 200/792, epoch: 0, batch: 199, rank_id: 0, loss: 0.243273, lr: 0.0000830295, speed: 3.6822 step/s +eval loss: 0.249330, acc: 0.8899082568807339, eval done total : 0.508596658706665 s +global step 300/792, epoch: 1, batch: 35, rank_id: 0, loss: 0.166950, lr: 0.0000690042, speed: 3.7250 step/s +eval loss: 0.307219, acc: 0.8956422018348624, eval done total : 0.5816614627838135 s +global step 400/792, epoch: 1, batch: 135, rank_id: 0, loss: 0.185729, lr: 0.0000549790, speed: 3.6896 step/s +eval loss: 0.201950, acc: 0.9025229357798165, eval done total : 0.5364704132080078 s +global step 500/792, epoch: 1, batch: 235, rank_id: 0, loss: 0.132817, lr: 0.0000409537, speed: 3.7708 step/s +eval loss: 0.239518, acc: 0.9094036697247706, eval done total : 0.5128316879272461 s +global step 600/792, epoch: 2, batch: 71, rank_id: 0, loss: 0.163107, lr: 0.0000269285, speed: 3.7303 step/s +eval loss: 0.199408, acc: 0.9139908256880734, eval done total : 0.5226929187774658 s +global step 700/792, epoch: 2, batch: 171, rank_id: 0, loss: 0.082950, lr: 0.0000129032, speed: 3.7664 step/s +eval loss: 0.236055, acc: 0.9025229357798165, eval done total : 0.5140993595123291 s +global step 792/792, epoch: 2, batch: 263, rank_id: 0, loss: 0.025735, lr: 0.0000000000, speed: 4.1180 step/s +eval loss: 0.226449, acc: 0.9013761467889908, eval done total : 0.5103530883789062 s +``` + +使用convbert-small预训练模型进行单卡Fine-tuning ,在验证集上有如下结果(这里各类任务的结果是运行1次的结果): + +| Task | Metric | Result | +|-------|------------------------------|-------------| +| CoLA | Matthews corr | 56.22 | +| SST-2 | acc. | 91.39 | +| MRPC | acc./F1 | 87.70 | +| STS-B | Pearson/Spearman corr | 86.34 | +| QQP | acc./F1 | 85.47 | +| MNLI | matched acc./mismatched acc. | 81.87 | +| QNLI | acc. | 87.71 | +| RTE | acc. | 66.06 | + +注:acc.是Accuracy的简称,表中Metric字段名词取自[GLUE论文](https://openreview.net/pdf?id=rJ4km2R5t7) + + + +## Reference +[Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. ConvBERT: Improving BERT with Span-based Dynamic Convolution. In NeurIPS 2020](https://arxiv.org/abs/2008.02496) diff --git a/examples/language_model/convbert/convert.py b/examples/language_model/convbert/convert.py new file mode 100644 index 0000000000000000000000000000000000000000..96208861028aa10bd69184aeb70b78922f54281d --- /dev/null +++ b/examples/language_model/convbert/convert.py @@ -0,0 +1,81 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from collections import OrderedDict +import argparse + +huggingface_to_paddle = { + "embeddings.LayerNorm": "embeddings.layer_norm", + "encoder.layer": "encoder.layers", + "attention.self.query.": "self_attn.q_proj.", + "attention.self.key.": "self_attn.k_proj.", + "attention.self.value.": "self_attn.v_proj.", + "attention.output.dense.": "self_attn.out_proj.", + "intermediate.dense": "linear1", + "output.dense": "linear2", + "attention.output.LayerNorm": "norm1", + "output.LayerNorm": "norm2", + "attention.self.key_conv_attn_layer": "self_attn.key_conv_attn_layer", + "attention.self.conv_kernel_layer": "self_attn.conv_kernel_layer", + "attention.self.conv_out_layer": "self_attn.conv_out_layer", +} + +skip_weights = ["embeddings.position_ids"] +dont_transpose = ["attention.self.key_conv_attn_layer", "_embeddings.weight", "LayerNorm."] + + +def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path): + import torch + import paddle + + pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu") + paddle_state_dict = OrderedDict() + for k, v in pytorch_state_dict.items(): + if k in skip_weights: + continue + if k[-7:] == ".weight": + if not any([w in k for w in dont_transpose]): + if v.ndim == 2: + v = v.transpose(0, 1) + if "self.key_conv_attn_layer.bias" in k: + v = v.squeeze(-1) + + oldk = k + for huggingface_name, paddle_name in huggingface_to_paddle.items(): + k = k.replace(huggingface_name, paddle_name) + + print(f"Converting: {oldk} => {k}") + paddle_state_dict[k] = v.data.numpy() + + paddle.save(paddle_state_dict, paddle_dump_path) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--pytorch_checkpoint_path", + default="./conv-bert-base/pytorch_model.bin", + type=str, + required=False, + help="Path to the Pytorch checkpoint path.", + ) + parser.add_argument( + "--paddle_dump_path", + default="./convbert-base/model_state.pdparams", + type=str, + required=False, + help="Path to the output Paddle model.", + ) + args = parser.parse_args() + convert_pytorch_checkpoint_to_paddle(args.pytorch_checkpoint_path, args.paddle_dump_path) diff --git a/examples/language_model/convbert/run_glue.py b/examples/language_model/convbert/run_glue.py new file mode 100644 index 0000000000000000000000000000000000000000..559cec890c23c8b1f715fb5cd2ddb6484ba579fb --- /dev/null +++ b/examples/language_model/convbert/run_glue.py @@ -0,0 +1,372 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from paddle.io import DataLoader +from paddle.metric import Accuracy + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman +from paddlenlp.transformers import ( + BertForSequenceClassification, + BertTokenizer, + ConvBertForSequenceClassification, + ConvBertTokenizer, + ElectraForSequenceClassification, + ElectraTokenizer, + ErnieForSequenceClassification, + ErnieTokenizer, + LinearDecayWithWarmup, +) + +FORMAT = "%(asctime)s-%(levelname)s: %(message)s" +logging.basicConfig(level=logging.INFO, format=FORMAT) +logger = logging.getLogger(__name__) + +METRIC_CLASSES = { + "cola": Mcc, + "sst-2": Accuracy, + "mrpc": AccuracyAndF1, + "sts-b": PearsonAndSpearman, + "qqp": AccuracyAndF1, + "mnli": Accuracy, + "qnli": Accuracy, + "rte": Accuracy, +} + +MODEL_CLASSES = { + "bert": (BertForSequenceClassification, BertTokenizer), + "electra": (ElectraForSequenceClassification, ElectraTokenizer), + "ernie": (ErnieForSequenceClassification, ErnieTokenizer), + "convbert": (ConvBertForSequenceClassification, ConvBertTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--batch_size", + default=32, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument( + "--warmup_steps", + default=0, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument( + "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu", "npu"], + help="The device to select to train the model, is must be cpu/gpu/npu.", + ) + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, loss_fct, metric, data_loader): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + loss = loss_fct(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + if isinstance(metric, AccuracyAndF1): + print( + "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, " + % ( + loss.numpy(), + res[0], + res[1], + res[2], + res[3], + res[4], + ), + end="", + ) + elif isinstance(metric, Mcc): + print("eval loss: %f, mcc: %s, " % (loss.numpy(), res[0]), end="") + elif isinstance(metric, PearsonAndSpearman): + print( + "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, " + % (loss.numpy(), res[0], res[1], res[2]), + end="", + ) + else: + print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="") + model.train() + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if (int(is_test) + len(example)) == 2: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + else: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + + if not is_test: + return example["input_ids"], example["token_type_ids"], label + else: + return example["input_ids"], example["token_type_ids"] + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + train_ds = load_dataset("glue", args.task_name, splits="train") + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length + ) + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ): fn(samples) + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + if args.task_name == "mnli": + dev_ds_matched, dev_ds_mismatched = load_dataset( + "glue", args.task_name, splits=["dev_matched", "dev_mismatched"] + ) + + dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True) + dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True) + dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False) + dev_data_loader_matched = DataLoader( + dataset=dev_ds_matched, + batch_sampler=dev_batch_sampler_matched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + dev_batch_sampler_mismatched = paddle.io.BatchSampler( + dev_ds_mismatched, batch_size=args.batch_size, shuffle=False + ) + dev_data_loader_mismatched = DataLoader( + dataset=dev_ds_mismatched, + batch_sampler=dev_batch_sampler_mismatched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + else: + dev_ds = load_dataset("glue", args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list) + model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_classes) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs) + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss() + + metric = metric_class() + + global_step = 0 + tic_train = time.time() + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + loss = loss_fct(logits, labels) + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0 or global_step == num_training_steps: + print( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % ( + global_step, + num_training_steps, + epoch, + step, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + if args.task_name == "mnli": + evaluate(model, loss_fct, metric, dev_data_loader_matched) + evaluate(model, loss_fct, metric, dev_data_loader_mismatched) + print("eval done total : %s s" % (time.time() - tic_eval)) + else: + evaluate(model, loss_fct, metric, dev_data_loader) + print("eval done total : %s s" % (time.time() - tic_eval)) + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join( + args.output_dir, "%s_ft_model_%d.pdparams" % (args.task_name, global_step) + ) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + return + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + n_gpu = len(os.getenv("CUDA_VISIBLE_DEVICES", "").split(",")) + if args.device in "gpu" and n_gpu > 1: + paddle.distributed.spawn(do_train, args=(args,), nprocs=n_gpu) + else: + do_train(args) diff --git a/examples/language_model/convbert/run_pretrain.py b/examples/language_model/convbert/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..7ebbe898da0aa86002343a1207f4b37372945e54 --- /dev/null +++ b/examples/language_model/convbert/run_pretrain.py @@ -0,0 +1,507 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import io +import logging +import os +import random +import time + +import numpy as np +import paddle + +from paddlenlp.transformers import ( + ConvBertForTotalPretraining, + ConvBertGenerator, + ConvBertPretrainingCriterion, + ConvBertTokenizer, + LinearDecayWithWarmup, +) + +FORMAT = "%(asctime)s-%(levelname)s: %(message)s" +logging.basicConfig(level=logging.INFO, format=FORMAT) +logger = logging.getLogger(__name__) + +MODEL_CLASSES = { + "convbert": (ConvBertForTotalPretraining, ConvBertTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_type", + default="convbert", + type=str, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default="convbert-small", + type=str, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--input_dir", + default=None, + type=str, + required=True, + help="The input directory where the data will be read from.", + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument("--max_seq_length", default=128, type=int, help="max length of each sequence") + parser.add_argument("--mask_prob", default=0.15, type=float, help="the probability of one word to be mask") + parser.add_argument( + "--train_batch_size", + default=96, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument( + "--eval_batch_size", + default=96, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--learning_rate", default=5e-4, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--num_train_epochs", + default=4, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--warmup_steps", default=10000, type=int, help="Linear warmup over warmup_steps.") + + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--use_amp", action="store_true", help="Whether to use float16(Automatic Mixed Precision) to train." + ) + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument("--eager_run", type=bool, default=True, help="Use dygraph mode.") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu"], + help="The device to select to train the model, is must be cpu/gpu.", + ) + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +class WorkerInitObj(object): + def __init__(self, seed): + self.seed = seed + + def __call__(self, id): + np.random.seed(seed=self.seed + id) + random.seed(self.seed + id) + + +class BookCorpus(paddle.io.Dataset): + """ + https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html + Args: + data_path (:obj:`str`) : The dataset file path, which contains train.tsv, dev.tsv and test.tsv. + tokenizer (:obj:`class PretrainedTokenizer`) : The tokenizer to split word and convert word to id. + max_seq_length (:obj:`int`) : max length for each sequence. + mode (:obj:`str`, `optional`, defaults to `train`): + It identifies the dataset mode (train, test or dev). + """ + + def __init__( + self, + data_path, + tokenizer, + max_seq_length, + mode="train", + ): + if mode == "train": + data_file = "train.data" + elif mode == "test": + data_file = "test.data" + else: + data_file = "dev.data" + + self.data_file = os.path.join(data_path, data_file) + self.tokenizer = tokenizer + self.max_seq_length = max_seq_length + self.raw_examples = self._read_file(self.data_file) + + def _read_file(self, input_file): + """ + Reads a text file. + + Args: + input_file (:obj:`str`) : The file to be read. + + Returns: + examples (:obj:`list`): All the input data. + """ + if not os.path.exists(input_file): + raise RuntimeError("The file {} is not found.".format(input_file)) + else: + with io.open(input_file, "r", encoding="UTF-8") as f: + examples = [] + while True: + line = f.readline() + if line: + if len(line) > 0 and not line.isspace(): + example = self.tokenizer(line, max_seq_len=self.max_seq_length)["input_ids"] + examples.append(example) + else: + break + return examples + + def truncation_ids(self, ids, max_seq_length): + if len(ids) <= (max_seq_length - 2): + return ids + else: + return ids[: (max_seq_length - 2)] + + def __getitem__(self, idx): + return self.raw_examples[idx] + + def __len__(self): + return len(self.raw_examples) + + +class DataCollatorForConvBert(object): + """ + pads, gets batch of tensors and preprocesses batches for masked language modeling + when dataloader num_worker > 0, this collator may trigger some bugs, for safe, be sure dataloader num_worker=0 + """ + + def __init__(self, tokenizer, max_seq_length, mlm=True, mlm_probability=0.15): + self.tokenizer = tokenizer + self.max_seq_length = max_seq_length + self.mlm = True + self.mlm_probability = mlm_probability + + def __call__(self, examples): + if self.mlm: + inputs, raw_inputs, labels = self.mask_tokens(examples) + return inputs, raw_inputs, labels + else: + raw_inputs, _ = self.add_special_tokens_and_set_maskprob(examples, True, self.max_seq_length) + raw_inputs = self.tensorize_batch(raw_inputs, "int64") + inputs = raw_inputs.clone().detach() + labels = raw_inputs.clone().detach() + if self.tokenizer.pad_token is not None: + pad_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.pad_token) + labels[labels == pad_token_id] = -100 + return inputs, raw_inputs, labels + + def tensorize_batch(self, examples, dtype): + if isinstance(examples[0], (list, tuple)): + examples = [paddle.to_tensor(e, dtype=dtype) for e in examples] + length_of_first = examples[0].shape[0] + are_tensors_same_length = all(x.shape[0] == length_of_first for x in examples) + if are_tensors_same_length: + return paddle.stack(examples, axis=0) + else: + raise ValueError("the tensor in examples not have same shape, please check input examples") + + def add_special_tokens_and_set_maskprob(self, inputs, truncation, max_seq_length): + pad_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.pad_token) + full_inputs = [] + full_maskprob = [] + max_length = 0 + for ids in inputs: + if len(ids) > max_length: + max_length = len(ids) + max_length = min(max_length, max_seq_length) + + for ids in inputs: + if len(ids) <= max_length: + padding_num = max_length - len(ids) + full_inputs.append(ids + ([pad_token_id] * padding_num)) + full_maskprob.append([0] + ([self.mlm_probability] * (len(ids) - 2)) + [0] + ([0] * padding_num)) + else: + if truncation: + full_inputs.append(ids[:max_length]) + full_maskprob.append([0] + ([self.mlm_probability] * (max_length - 2)) + [0]) + else: + full_inputs.append(ids) + full_maskprob.append([0] + ([self.mlm_probability] * (len(ids) - 2)) + [0]) + return full_inputs, full_maskprob + + def mask_tokens(self, examples): + if self.tokenizer.mask_token is None: + raise ValueError("the tokenizer does not have mask_token, please check!") + mask_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token) + + raw_inputs, probability_matrix = self.add_special_tokens_and_set_maskprob(examples, True, self.max_seq_length) + raw_inputs = self.tensorize_batch(raw_inputs, "int64") + probability_matrix = self.tensorize_batch(probability_matrix, "float32") + inputs = raw_inputs.clone() + labels = raw_inputs.clone() + + total_indices = paddle.bernoulli(probability_matrix).astype("bool").numpy() + labels[~total_indices] = -100 + + # 80% MASK + indices_mask = paddle.bernoulli(paddle.full(labels.shape, 0.8)).astype("bool").numpy() & total_indices + inputs[indices_mask] = mask_token_id + + # 10% Random + indices_random = ( + paddle.bernoulli(paddle.full(labels.shape, 0.5)).astype("bool").numpy() & total_indices & ~indices_mask + ) + random_words = paddle.randint(low=0, high=self.tokenizer.vocab_size, shape=labels.shape, dtype="int64") + inputs = paddle.where(paddle.to_tensor(indices_random), random_words, inputs) + + # 10% Original + return inputs, raw_inputs, labels + + +def create_dataloader(dataset, mode="train", batch_size=1, use_gpu=True, data_collator=None): + """ + Creats dataloader. + + Args: + dataset(obj:`paddle.io.Dataset`): + Dataset instance. + mode(obj:`str`, optional, defaults to obj:`train`): + If mode is 'train', it will shuffle the dataset randomly. + batch_size(obj:`int`, optional, defaults to 1): + The sample number of a mini-batch. + use_gpu(obj:`bool`, optional, defaults to obj:`True`): + Whether to use gpu to run. + + Returns: + dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. + """ + + if mode == "train" and use_gpu: + sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True) + dataloader = paddle.io.DataLoader( + dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0 + ) + else: + shuffle = True if mode == "train" else False + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader( + dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0 + ) + + return dataloader + + +def do_train(args): + paddle.enable_static() if not args.eager_run else None + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + WorkerInitObj(args.seed + paddle.distributed.get_rank()) + + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + # Loads or initializes a model. + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + pretrained_models_list = list(model_class.pretrained_init_configuration.keys()) + + if args.model_name_or_path in pretrained_models_list: + config = model_class.config_class.from_pretrained(args.model_name_or_path) + model = model_class(config) + else: + model = model_class.from_pretrained(args.model_name_or_path) + + criterion = ConvBertPretrainingCriterion( + getattr(model.generator, ConvBertGenerator.base_model_prefix).config.vocab_size, + model.gen_weight, + model.disc_weight, + ) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + # Loads dataset. + tic_load_data = time.time() + print("start load data : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) + train_dataset = BookCorpus( + data_path=args.input_dir, tokenizer=tokenizer, max_seq_length=args.max_seq_length, mode="train" + ) + print("load data done, total : %s s" % (time.time() - tic_load_data)) + + # Reads data and generates mini-batches. + data_collator = DataCollatorForConvBert( + tokenizer=tokenizer, max_seq_length=args.max_seq_length, mlm=True, mlm_probability=args.mask_prob + ) + + train_data_loader = create_dataloader( + train_dataset, + batch_size=args.train_batch_size, + mode="train", + use_gpu=True if args.device in "gpu" else False, + data_collator=data_collator, + ) + + num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps) + + clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + grad_clip=clip, + apply_decay_param_fun=lambda x: x in decay_params, + ) + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=1024) + + print("start train : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) + trained_global_step = global_step = 0 + t_loss = paddle.to_tensor([0.0]) + log_loss = paddle.to_tensor([0.0]) + loss_list = [] + log_list = [] + tic_train = time.time() + + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + if trained_global_step > 0: + trained_global_step -= 1 + continue + global_step += 1 + input_ids, raw_input_ids, generator_labels = batch + if args.use_amp: + with paddle.amp.auto_cast(): + gen_logits, disc_logits, disc_labels, attention_mask = model( + input_ids=input_ids, raw_input_ids=raw_input_ids, generator_labels=generator_labels + ) + loss = criterion(gen_logits, disc_logits, generator_labels, disc_labels, attention_mask) + scaled = scaler.scale(loss) + scaled.backward() + t_loss += loss.detach() + scaler.minimize(optimizer, scaled) + else: + gen_logits, disc_logits, disc_labels, attention_mask = model( + input_ids=input_ids, raw_input_ids=raw_input_ids, generator_labels=generator_labels + ) + loss = criterion(gen_logits, disc_logits, generator_labels, disc_labels, attention_mask) + loss.backward() + t_loss += loss.detach() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + local_loss = (t_loss - log_loss) / args.logging_steps + if paddle.distributed.get_world_size() > 1: + paddle.distributed.all_gather(loss_list, local_loss) + if paddle.distributed.get_rank() == 0: + log_str = ( + "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, " + "avg_loss: {4:.15f}, lr: {5:.10f}, speed: {6:.2f} s/it" + ).format( + global_step, + num_training_steps, + epoch, + step, + float((paddle.stack(loss_list).sum() / len(loss_list)).numpy()), + optimizer.get_lr(), + (time.time() - tic_train) / args.logging_steps, + ) + print(log_str) + log_list.append(log_str) + loss_list = [] + else: + log_str = ( + "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, " + "loss: {4:.15f}, lr: {5:.10f}, speed: {6:.2f} s/it" + ).format( + global_step, + num_training_steps, + epoch, + step, + float(local_loss.numpy()), + optimizer.get_lr(), + (time.time() - tic_train) / args.logging_steps, + ) + print(log_str) + log_list.append(log_str) + log_loss = t_loss + tic_train = time.time() + if global_step % args.save_steps == 0: + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt")) + if len(log_list) > 0: + with open(os.path.join(output_dir, "train.log"), "w") as f: + for log in log_list: + if len(log.strip()) > 0: + f.write(log.strip() + "\n") + if global_step >= num_training_steps: + return + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + n_gpu = len(os.getenv("CUDA_VISIBLE_DEVICES", "").split(",")) + if args.device in "gpu" and n_gpu > 1: + paddle.distributed.spawn(do_train, args=(args,), nprocs=n_gpu) + else: + do_train(args) diff --git a/examples/language_model/elmo/README.md b/examples/language_model/elmo/README.md new file mode 100644 index 0000000000000000000000000000000000000000..bccce85996b855464440c99f263bc125096df78b --- /dev/null +++ b/examples/language_model/elmo/README.md @@ -0,0 +1,142 @@ +# ELMo + +## 模型简介 + +ELMo(Embeddings from Language Models)是重要的通用语义表示模型之一,以双向LSTM为网络基本组件,以Language Model为训练目标,通过预训练得到通用的语义表示,ELMo能够学习到复杂的特征,比如语法、语义,并且能够学习在不同上下文情况下的词汇多义性。将ELMo得到的语义表示作为Feature迁移到下游NLP任务中,会显著提升下游任务的模型性能,比如问答、文本蕴含和情感分析等。ELMo模型的细节可以[参阅论文](https://arxiv.org/abs/1802.05365)。 + +本项目是ELMo在Paddle上的开源实现, 基于1 Billion Word Language Model Benchmark进行预训练,并接入了简单的下游任务作为示例程序。 + +接入的下游任务是在sentence polarity dataset v1数据集上构建的文本二分类任务,采用ELMo + BoW的简单网络结构。与base模型(Word2Vec + BoW)进行精度对比。 + +| 模型 | test acc | +| ---- | -------- | +| word2vec + BoW | 0.7769 | +| ELMo + BoW | 0.7760 | + +## 环境依赖 + +- sklearn +- gensim + +安装方式:`pip install sklearn gensim` + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +. +├── args.py # 运行参数配置文件 +├── dataset.py # 数据读取 +├── elmo.py # 模型组网 +├── run_pretrain.py # 训练模型主程序入口 +├── run_eval.py # 评估模型主程序入口 +├── word2vec_base.py # 下游二分类任务base模型训练测试主程序入口 +├── run_finetune.py # 下游二分类任务训练测试主程序入口 +├── download_data.sh # 数据下载脚本 +└── README.md # 文档说明 +``` + +### 数据准备 + +运行下载数据的脚本后,会生成两个文件,1-billion-word目录下会存在训练数据目录(training-tokenized-shuffled)、测试集数据(heldout-tokenized-shuffled)以及对应的词典(vocab-15w.txt),sentence-polarity-dataset-v1目录下会存在未切分的正向样本(rt-polarity.pos)、负向样本(rt-polarity.neg)以及Google预训练好的Word2Vec向量文件GoogleNews-vectors-negative300.bin.gz。 + +```shell +sh download_data.sh +``` + +1-billion-word目录结构: + +```text +. +├── training-tokenized-shuffled # 训练集 +├── heldout-tokenized-shuffled # 测试集 +└── vocab-15w.txt # 词典 +``` + +sentence-polarity-dataset-v1目录结构: + +```text +. +├── rt-polarity.pos # 正向样本 +├── rt-polarity.neg # 负向样本 +└── GoogleNews-vectors-negative300.bin.gz # 预训练好的Word2Vec向量 +``` + +### 模型训练 + +基于1-billion-word数据集,可以运行下面的命令,在训练集上进行模型训练 +```shell +# GPU启动, 支持单卡和多卡 +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus '0' run_pretrain.py --train_data_path='./1-billion-word/training-tokenized-shuffled/*' --vocab_file='./1-billion-word/vocab-15w.txt' --save_dir='./checkpoints' --device='gpu' +``` + +其他可选参数和参数的默认值请参考`args.py`。 + +程序运行时将会自动开始训练,同时训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +checkpoints/ +├── 10000.pdopt +├── 10000.pdparams +├── 20000.pdopt +├── 20000.pdparams +├── ... +├── final.pdopt +└── final.pdparams +``` + +**NOTE:** 如需恢复模型训练,则init_from_ckpt只需指定到文件名即可,不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/10000`即可,程序会自动加载模型参数`checkpoints/10000.pdparams`,也会自动加载优化器状态`checkpoints/10000.pdopt`。 + +### 模型评估 + +基于1-billion-word数据集,可以运行下面的命令,在评测集上进行模型评估 +```shell +# GPU启动,仅支持单卡 +export CUDA_VISIBLE_DEVICES=0 +python run_eval.py --dev_data_path='./1-billion-word/heldout-tokenized-shuffled/*' --vocab_file='./1-billion-word/vocab-15w.txt' --init_from_ckpt='./checkpoints/10000' --device='gpu' +``` + +### 下游任务 + +下游任务是基于sentence polarity dataset v1数据集的二分类任务,base模型采用Word2Vec + BoW的模型结构,其中Word2Vec采用Google预训练好的GoogleNews-vectors-negative300.bin.gz。 + +#### base模型 + +base模型可以运行下面的命令,在训练集上进行模型训练评估 +```shell +# GPU启动, 支持单卡和多卡 +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus '0' word2vec_base.py --data_dir='./sentence-polarity-dataset-v1/' --pretrained_word2vec_file='./sentence-polarity-dataset-v1/GoogleNews-vectors-negative300.bin' --device='gpu' +``` + +#### ELMo finetune + +ELMo finetune可以运行下面的命令,在训练集上进行模型训练评估 +```shell +# GPU启动, 支持单卡和多卡 +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus '0' run_finetune.py --data_dir='./sentence-polarity-dataset-v1/' --init_from_ckpt='./checkpoints/10000' --device='gpu' +``` + +**NOTE:** 可以通过构建模型时的trainable参数设置ELMo参与或不参与下游任务的训练。ELMo接入下游任务的具体用法请参考`run_finetune.py`。 + +另外,预训练的ELMo也可以作为文本词向量编码器单独使用,即输入文本内容,输出每个词对应的词向量。用法示例如下: + +```python +from elmo import ELMoEmbedder + +embedder = ELMoEmbedder(params_file) +sentences = [['The', 'first', 'sentence', '.'], ['Second', 'one', '.']] + +embeddings = embedder.encode(sentences) +for i, (text, emb) in enumerate(zip(sentences, embeddings)): + print(text) + print(emb.shape) + print() +``` + +## Reference + +- [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) diff --git a/examples/language_model/elmo/args.py b/examples/language_model/elmo/args.py new file mode 100644 index 0000000000000000000000000000000000000000..6deaf83d53b977a84723bfe433aa1887f00b87c4 --- /dev/null +++ b/examples/language_model/elmo/args.py @@ -0,0 +1,51 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--train_data_path", type=str, default="./1-billion-word/training-tokenized-shuffled/*", help="Specify the path to load train data.") + parser.add_argument("--dev_data_path", type=str, default="./1-billion-word/heldout-tokenized-shuffled/*", help="Specify the path to load dev data.") + parser.add_argument("--vocab_file", type=str, default="./1-billion-word/vocab-15w.txt", help="Specify the path to load vocab file.") + parser.add_argument("--save_dir", type=str, default="./checkpoint/", help="Specify the path to save the checkpoints.") + parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") + parser.add_argument("--save_freq", type=int, default=100, help="The frequency, in number of steps, to save checkpoint. (default: %(default)d)") + parser.add_argument("--log_freq", type=int, default=100, help="The frequency, in number of steps, the training logs are printed. (default: %(default)d)") + parser.add_argument("--epochs", type=int, default=10, help="Total number of training epochs to perform.") + parser.add_argument("--batch_size", type=int, default=128, help="Batch size per GPU/CPU for training.") + parser.add_argument("--dropout", type=float, default=0.1, help="The dropout rate.") + parser.add_argument("--lr", type=float, default=0.2, help="The initial learning rate.") + parser.add_argument("--seed", type=int, default=2020, help="Random seed.") + parser.add_argument("--max_grad_norm", type=float, default=10.0, help='The max grad norm.') + parser.add_argument("--max_characters_per_token", type=int, default=50, help="The maximum characters number of token in sequence. (default: %(default)d)") + parser.add_argument("--unroll_steps", type=int, default=20, help="The sentence length after re-cutting in dataset. (default: %(default)d)") + parser.add_argument("--char_embed_dim", type=int, default=16, help="The dimension of char_embedding table. (default: %(default)d)") + parser.add_argument("--projection_dim", type=int, default=512, help="The size of rnn hidden unit. (default: %(default)d)") + parser.add_argument("--num_layers", type=int, default=2, help="The num of rnn layers. (default: %(default)d)") + parser.add_argument("--num_highways", type=int, default=2, help="The num of highways in CharEncoder. (default: %(default)d)") + parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Device for selecting for the training.") + + args = parser.parse_args() + return args +# yapf: enable + + +def print_args(args): + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") diff --git a/examples/language_model/elmo/dataset.py b/examples/language_model/elmo/dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..adacebe7041e0d9b3e24abda22cb4c86347c23b1 --- /dev/null +++ b/examples/language_model/elmo/dataset.py @@ -0,0 +1,443 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import glob +import random +from copy import deepcopy +from typing import List + +import numpy as np +import paddle +from paddle.io import IterableDataset + + +class Vocabulary(object): + """ + A token vocabulary. Holds a map from token to ids and provides a method for + encoding text to a sequence of ids. + + Parameters: + filename (str): The vocabulary file. It is a flat text file with + one (normalized) token per line. + """ + + def __init__(self, filename): + self._word_to_id = {} + for word in ["UNK", "", "", ""]: + self._word_to_id[word] = len(self._word_to_id) + with open(filename, "r") as fin: + for line in fin: + word = line.strip() + if word in self._word_to_id: + raise ValueError("There has repeated token in the vocabulary file: %s" % word) + self._word_to_id[word] = len(self._word_to_id) + + @property + def bos(self): + return self._word_to_id[""] + + @property + def eos(self): + return self._word_to_id[""] + + @property + def unk(self): + return self._word_to_id["UNK"] + + @property + def pad(self): + return self._word_to_id[""] + + @property + def size(self): + return len(self._word_to_id) + + def word_to_id(self, word): + if word in self._word_to_id: + return self._word_to_id[word] + return self.unk + + def encode(self, sentence, split=True): + """ + Convert a sentence to a list of ids, with special tokens added. + Sentence is a single string with tokens separated by whitespace. + """ + if split: + word_ids = [self.word_to_id(cur_word) for cur_word in sentence.split()] + else: + word_ids = [self.word_to_id(cur_word) for cur_word in sentence] + + word_ids = [self.bos] + word_ids + [self.eos] + word_ids_reverse = deepcopy(word_ids) + word_ids_reverse.reverse() + return np.array(word_ids, dtype=np.int64), np.array(word_ids_reverse, dtype=np.int64) + + +class UnicodeCharsVocabulary(Vocabulary): + """ + Vocabulary containing character-level and word level information. + + Has a word vocabulary that is used to lookup word ids and a character id + that is used to map words to arrays of character ids. + + The character ids are defined by ord(c) for c in word.encode('utf-8'). + This limits the total number of possible char ids to 256. + To this we add 5 additional special ids: begin sentence, end sentence, + begin word, end word and char padding. + + Parameters: + filename (str): The vocabulary file. It is a flat text file with + one (normalized) token per line. + max_word_length (int): The maximum characters number of token in sequence. + """ + + def __init__(self, filename, max_word_length, **kwargs): + super(UnicodeCharsVocabulary, self).__init__(filename, **kwargs) + self._max_word_length = max_word_length + + self.bos_char = 256 # + self.eos_char = 257 # + self.bow_char = 258 # + self.eow_char = 259 # + self.pad_char = 260 # + + self._word_char_ids = {} + + # the charcter representation of the begin/end of sentence characters + def _make_bos_eos(c): + r = np.zeros([self.max_word_length], dtype=np.int64) + r[:] = self.pad_char + r[0] = self.bow_char + r[1] = c + r[2] = self.eow_char + return r + + self.bos_chars = _make_bos_eos(self.bos_char) + self.eos_chars = _make_bos_eos(self.eos_char) + + for word in self._word_to_id: + self._word_char_ids[word] = self._convert_word_to_char_ids(word) + + self._word_char_ids[""] = self.bos_chars + self._word_char_ids[""] = self.eos_chars + + @property + def char_size(self): + # char ids 0-255 come from utf-8 encoding bytes. + # assign 256-300 to special chars. + # all +1, the id=0 is for token padding and mask. + return 262 + + @property + def max_word_length(self): + return self._max_word_length + + def _convert_word_to_char_ids(self, word): + code = np.zeros([self.max_word_length], dtype=np.int64) + code[:] = self.pad_char + + word_encoded = word.encode("utf-8", "ignore")[: (self.max_word_length - 2)] + code[0] = self.bow_char + for k, chr_id in enumerate(word_encoded, start=1): + code[k] = chr_id + code[len(word_encoded) + 1] = self.eow_char + + return code + + def word_to_char_ids(self, word): + if word in self._word_to_id: + return self._word_char_ids[word] + else: + return self._convert_word_to_char_ids(word) + + def encode_chars(self, sentence, split=True): + """ + Encode the sentence as a white space delimited string of tokens. + """ + if split: + chars_ids = [self.word_to_char_ids(cur_word) for cur_word in sentence.split()] + else: + chars_ids = [self.word_to_char_ids(cur_word) for cur_word in sentence] + + chars_ids = [self.bos_chars] + chars_ids + [self.eos_chars] + chars_ids_reverse = deepcopy(chars_ids) + chars_ids_reverse.reverse() + + # +1 for token padding and mask + chars_ids = np.vstack(chars_ids) + 1 + chars_ids_reverse = np.vstack(chars_ids_reverse) + 1 + return chars_ids, chars_ids_reverse + + +class CharsVocabulary(object): + def __init__(self, max_word_length): + self._max_word_length = max_word_length + + self.bos_char = 256 # + self.eos_char = 257 # + self.bow_char = 258 # + self.eow_char = 259 # + self.pad_char = 260 # + + # the charcter representation of the begin/end of sentence characters + def _make_bos_eos(c): + r = np.zeros([self.max_word_length], dtype=np.int64) + r[:] = self.pad_char + r[0] = self.bow_char + r[1] = c + r[2] = self.eow_char + return r + + self.bos_chars = _make_bos_eos(self.bos_char) + self.eos_chars = _make_bos_eos(self.eos_char) + + @property + def char_size(self): + # char ids 0-255 come from utf-8 encoding bytes. + # assign 256-300 to special chars. + # all +1, the id=0 is for token padding and mask. + return 262 + + @property + def max_word_length(self): + return self._max_word_length + + def convert_word_to_char_ids(self, word): + code = np.zeros([self.max_word_length], dtype=np.int64) + code[:] = self.pad_char + + word_encoded = word.encode("utf-8", "ignore")[: (self.max_word_length - 2)] + code[0] = self.bow_char + for k, chr_id in enumerate(word_encoded, start=1): + code[k] = chr_id + code[len(word_encoded) + 1] = self.eow_char + + return code + + def encode_chars(self, sentence, split=True): + """ + Encode the sentence as a white space delimited string of tokens. + """ + if split: + chars_ids = [self.convert_word_to_char_ids(cur_word) for cur_word in sentence.split()] + else: + chars_ids = [self.convert_word_to_char_ids(cur_word) for cur_word in sentence] + + chars_ids = [self.bos_chars] + chars_ids + [self.eos_chars] + chars_ids_reverse = deepcopy(chars_ids) + chars_ids_reverse.reverse() + + # +1 for token padding and mask + chars_ids = np.vstack(chars_ids) + 1 + chars_ids_reverse = np.vstack(chars_ids_reverse) + 1 + return chars_ids, chars_ids_reverse + + +def load_vocab(vocab_file=None, max_word_length=50): + if vocab_file is None: + return CharsVocabulary(max_word_length) + elif max_word_length: + return UnicodeCharsVocabulary(vocab_file, max_word_length) + else: + return Vocabulary(vocab_file) + + +class OneBillionWordDataset(IterableDataset): + """ + Hold the one billion word dataset, consisting of 1B Words which is used for + benchmarking of Language Modeling. The training/held-out data was produced + from the WMT 2011 News Crawl data. + + The dataset is a list of tokenized files. Each file contains one sentence + per line. Each sentence is pre-tokenized and white space joined. + + Parameters: + filepattern (str): A glob string that specifies the list of files. + vocab (Vocabulary): An instance of Vocabulary or UnicodeCharsVocabulary. + batch_size (int): The batch_size. + num_steps (int): The sentence length after re-cutting in dataset. + n_procs (int): The number of GPUs. + mode (str, optional): The dataset mode. It can be "train" and "test". + When "test", the dataset iterate through all data once then stop. + When "train", it will iterate forever. Default: "test". + shuffle (bool, optional): Whether shuffle the data. Default: False. + seed (int, optional): The random seed. Default: 0. + """ + + def __init__( + self, filepattern, vocab, batch_size, num_steps, n_procs=1, rank=0, mode="test", shuffle=False, seed=0 + ): + super(OneBillionWordDataset, self).__init__() + + self._all_files = glob.glob(filepattern) + print("\nFound %d files at %s\n" % (len(self._all_files), filepattern)) + self._vocab = vocab + self._max_word_length = vocab.max_word_length + self._use_char_inputs = hasattr(vocab, "encode_chars") + self._batch_size = batch_size + self._num_steps = num_steps + self._n_procs = n_procs + self._rank = rank + self._mode = mode + self._shuffle = shuffle + self._seed = abs(seed) + self._file_seed = self._get_file_random_seed() + + def _get_file_random_seed(self): + file_seed = {} + np.random.seed(self._seed) + seed_list = list(np.random.random(len(self._all_files))) + for file_path, seed in zip(list(self._all_files), seed_list): + file_seed[file_path] = seed + return file_seed + + def _load_file(self, file_path): + print("\nLoading data from: %s\n" % file_path) + with open(file_path) as f: + sentences_raw = f.readlines() + sentences = sentences_raw + + if self._shuffle: + if self._n_procs > 1: + seed = self._file_seed[file_path] * self._seed + random.seed(seed) + random.shuffle(sentences) + + for sentence in sentences: + ids, ids_reverse = self._vocab.encode(sentence) + if self._use_char_inputs: + char_ids, char_ids_reverse = self._vocab.encode_chars(sentence) + else: + char_ids, char_ids_reverse = None, None + yield (ids, char_ids, ids_reverse, char_ids_reverse) + + def get_sentence(self): + while True: + self._seed += 1 + all_files = list(self._all_files) + if self._shuffle: + if self._n_procs > 1: + random.seed(self._seed) + random.shuffle(all_files) + for file_path in all_files: + for ret in self._load_file(file_path): + yield ret + if self._mode == "test": + break + + @property + def number_of_tokens(self): + # number of tokens in training data (1B Word Benchmark) + return 768648884 + + def __iter__(self): + sentence_generator = self.get_sentence() + n_batch_size = self._batch_size * self._n_procs + cur_stream = [None] * n_batch_size + + while True: + inputs = np.zeros([n_batch_size, self._num_steps], np.int64) + inputs_reverse = np.zeros([n_batch_size, self._num_steps], np.int64) + if self._max_word_length is not None: + char_inputs = np.zeros([n_batch_size, self._num_steps, self._max_word_length], np.int64) + char_inputs_reverse = np.zeros([n_batch_size, self._num_steps, self._max_word_length], np.int64) + else: + char_inputs = None + char_inputs_reverse = None + targets = np.zeros([n_batch_size, self._num_steps], np.int64) + targets_reverse = np.zeros([n_batch_size, self._num_steps], np.int64) + + for i in range(n_batch_size): + cur_pos = 0 + while cur_pos < self._num_steps: + if cur_stream[i] is None or len(cur_stream[i][0]) <= 1: + try: + cur_stream[i] = list(next(sentence_generator)) + except StopIteration: + return + + how_many = min(len(cur_stream[i][0]) - 1, self._num_steps - cur_pos) + next_pos = cur_pos + how_many + + inputs[i, cur_pos:next_pos] = cur_stream[i][0][:how_many] + inputs_reverse[i, cur_pos:next_pos] = cur_stream[i][2][:how_many] + if self._max_word_length is not None: + char_inputs[i, cur_pos:next_pos] = cur_stream[i][1][:how_many] + char_inputs_reverse[i, cur_pos:next_pos] = cur_stream[i][3][:how_many] + targets[i, cur_pos:next_pos] = cur_stream[i][0][1 : how_many + 1] + targets_reverse[i, cur_pos:next_pos] = cur_stream[i][2][1 : how_many + 1] + + cur_pos = next_pos + + cur_stream[i][0] = cur_stream[i][0][how_many:] + cur_stream[i][2] = cur_stream[i][2][how_many:] + if self._max_word_length is not None: + cur_stream[i][1] = cur_stream[i][1][how_many:] + cur_stream[i][3] = cur_stream[i][3][how_many:] + + # token_ids: (n_batch_size, self._num_steps) + # char_inputs: character ids (n_batch_size, self._num_steps, 50) + # targets: word ID of next word (n_batch_size, self._num_steps) + batch_data = { + "token_ids": inputs, + "tokens_characters": char_inputs, + "next_token_ids": targets, + "token_ids_reverse": inputs_reverse, + "tokens_characters_reverse": char_inputs_reverse, + "next_token_ids_reverse": targets_reverse, + } + if self._n_procs > 1: + start = self._rank * self._batch_size + end = start + self._batch_size + for key in batch_data: + batch_data[key] = batch_data[key][start:end] + + yield ( + batch_data["tokens_characters"], + batch_data["next_token_ids"], + batch_data["tokens_characters_reverse"], + batch_data["next_token_ids_reverse"], + ) + + +def create_one_batch(sentences, vocab, max_seq_len): + # Add , for every sentence + max_len = max([len(sentence) for sentence in sentences]) + 2 + max_len = min(max_len, max_seq_len) + batch_ids = np.zeros([len(sentences), max_len, vocab.max_word_length], dtype=np.int64) + batch_ids_reverse = np.zeros([len(sentences), max_len, vocab.max_word_length], dtype=np.int64) + batch_lens = [] + for i, sentence in enumerate(sentences): + sentence = sentence[: max_len - 2] + seq_len = len(sentence) + 2 + ids, ids_reverse = vocab.encode_chars(sentence, split=False) + batch_ids[i, :seq_len, :] = ids + batch_ids_reverse[i, :seq_len, :] = ids_reverse + batch_lens.append(seq_len) + return batch_ids, batch_ids_reverse, batch_lens + + +def create_batches(sentences: List[List[str]], batch_size, vocab, max_seq_len): + """ + Batch the sentences as character ids + Each sentence is a list of tokens without or , e.g. + [['The', 'first', 'sentence', '.'], ['Second', '.']] + """ + n_batch = (len(sentences) - 1) // batch_size + 1 + for i in range(n_batch): + start, end = i * batch_size, (i + 1) * batch_size + ids, ids_reverse, seq_lens = create_one_batch(sentences[start:end], vocab, max_seq_len) + ids = paddle.to_tensor(ids) + ids_reverse = paddle.to_tensor(ids_reverse) + yield ids, ids_reverse, seq_lens diff --git a/examples/language_model/elmo/download_data.sh b/examples/language_model/elmo/download_data.sh new file mode 100644 index 0000000000000000000000000000000000000000..385df158a073b65527441304c1d79ea5225273a2 --- /dev/null +++ b/examples/language_model/elmo/download_data.sh @@ -0,0 +1,25 @@ +set -eux + +rm 1-billion-word* -rf +wget https://bj.bcebos.com/paddlenlp/datasets/1-billion-word.tar.gz +src_md5="5f079a9b88ea27585e0539f502ca9327" +md5=`md5sum 1-billion-word.tar.gz | cut -d ' ' -f1` +if [ $md5 != $src_md5 ] +then + echo "The MD5 values of 1-billion-word.tar.gz are inconsistent. Please download again!" + exit 1 +fi +tar -zxf 1-billion-word.tar.gz + +rm sentence-polarity-dataset-v1* -rf +wget https://bj.bcebos.com/paddlenlp/datasets/movie-review/sentence-polarity-dataset-v1.tar.gz +src_md5="0464239d7b14b18d941f54a948c6cb26" +md5=`md5sum sentence-polarity-dataset-v1.tar.gz | cut -d ' ' -f1` +if [ $md5 != $src_md5 ] +then + echo "The MD5 values of sentence-polarity-dataset-v1.tar.gz are inconsistent. Please download again!" + exit 1 +fi +tar -zxf sentence-polarity-dataset-v1.tar.gz +cd sentence-polarity-dataset-v1 +gunzip GoogleNews-vectors-negative300.bin.gz diff --git a/examples/language_model/elmo/elmo.py b/examples/language_model/elmo/elmo.py new file mode 100644 index 0000000000000000000000000000000000000000..04c79e76423233f53ea652c375301c50a3519c31 --- /dev/null +++ b/examples/language_model/elmo/elmo.py @@ -0,0 +1,335 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import List + +import numpy as np +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +import paddle.nn.initializer as I +from dataset import create_batches, load_vocab + + +def reverse_sequence(x, sequence_lengths): + batch_size = x.shape[0] + sequence_lengths = sequence_lengths.numpy().data + y = paddle.zeros(x.shape, x.dtype) + for i in range(batch_size): + lens = sequence_lengths[i] + z = x[i, :lens, :] + z = paddle.reverse(z, axis=[0]) + y[i, :lens, :] = z + return y + + +class ELMo(nn.Layer): + def __init__( + self, + batch_size=None, + char_embed_dim=16, + projection_dim=512, + vocab_size=None, + cnn_filters=[[1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024]], + char_vocab_size=262, + max_characters_per_token=50, + num_highways=2, + num_layers=2, + dropout=0.1, + task="pre-train", + ): + super(ELMo, self).__init__() + + if task == "pre-train": + if vocab_size is None or batch_size is None: + raise ValueError('vocab_size and batch_size should be set when task="pre-train"') + elif task == "fine-tune": + if batch_size is None: + batch_size = 128 + else: + raise ValueError('task should be "pre-train" or "fine-tune"') + + self._projection_dim = projection_dim + self._task = task + + self._token_embding_layer = ELMoCharacterEncoderLayer( + char_vocab_size, char_embed_dim, projection_dim, num_highways, cnn_filters, max_characters_per_token + ) + self._elmobilm = ELMoBiLM(batch_size, projection_dim, projection_dim, num_layers, dropout, task) + if task == "pre-train": + paramAttr = paddle.ParamAttr(initializer=I.Normal(mean=0.0, std=1.0 / np.sqrt(projection_dim))) + self._linear_layer = nn.Linear(projection_dim, vocab_size, weight_attr=paramAttr) + + @property + def embedding_dim(self): + return self._projection_dim * 2 + + def forward(self, inputs): + # [batch_size, seq_len, max_characters_per_token] + ids, ids_reverse = inputs + # [batch_size, seq_len, projection_dim] + token_embedding = self._token_embding_layer(ids) + token_embedding_reverse = self._token_embding_layer(ids_reverse) + + outs = self._elmobilm(token_embedding, token_embedding_reverse) + + if self._task == "pre-train": + # [batch_size, seq_len, projection_dim] + fw_out, bw_out = outs + + # [batch_size, max_seq_len, vocab_size] + fw_logits = self._linear_layer(fw_out) + bw_logits = self._linear_layer(bw_out) + return [fw_logits, bw_logits] + else: + mask = paddle.any(ids > 0, axis=2) + seq_lens = paddle.sum(paddle.cast(mask, dtype=ids.dtype), axis=1) + outputs = [paddle.concat([token_embedding, token_embedding], axis=2)] + for fw_h, bw_h in zip(outs[0], outs[1]): + bw_h = reverse_sequence(bw_h, seq_lens) + outputs.append(paddle.concat([fw_h, bw_h], axis=2)) + # [batch_size, num_lstm_layers + 1, max_seq_len, projection_dim * 2] + outputs = paddle.concat([paddle.unsqueeze(emb, axis=1) for emb in outputs], axis=1) + return outputs + + +class ELMoBiLM(nn.Layer): + def __init__(self, batch_size, input_size, hidden_size, num_layers, dropout, task="pre-train"): + super(ELMoBiLM, self).__init__() + + self._num_layers = num_layers + self._dropout = dropout + self._task = task + + self._lstm_layers = [] + for direction in ["forward", "backward"]: + layers = [] + for i in range(num_layers): + lstm = nn.LSTM( + input_size=input_size, + hidden_size=hidden_size, + num_layers=1, + direction="forward", + weight_hh_attr=paddle.ParamAttr(initializer=I.XavierUniform()), + weight_ih_attr=paddle.ParamAttr(initializer=I.XavierUniform()), + bias_hh_attr=False, + bias_ih_attr=paddle.ParamAttr(initializer=I.Constant(value=0.0)), + ) + self.add_sublayer("{}_lstm_layer_{}".format(direction, i), lstm) + + hidden_state = paddle.zeros(shape=[1, batch_size, hidden_size], dtype="float32") + cell_state = paddle.zeros(shape=[1, batch_size, hidden_size], dtype="float32") + layers.append({"lstm": lstm, "hidden_state": hidden_state, "cell_state": cell_state}) + self._lstm_layers.append(layers) + + if dropout: + self._dropout_layer = nn.Dropout(p=dropout) + + def forward(self, fw_x, bw_x): + final_outs = [] + lstm_outs = [] + for x, layers in zip([fw_x, bw_x], self._lstm_layers): + batch_size = x.shape[0] + outs = [] + for i, dic in enumerate(layers): + lstm = dic["lstm"] + hidden_state = dic["hidden_state"][:, :batch_size, :] + cell_state = dic["cell_state"][:, :batch_size, :] + if self._dropout: + x = self._dropout_layer(x) + x, (hidden_state, cell_state) = lstm(x, (hidden_state, cell_state)) + hidden_state = hidden_state.detach() + cell_state = cell_state.detach() + dic["hidden_state"][:, :batch_size, :] = hidden_state + dic["cell_state"][:, :batch_size, :] = cell_state + outs.append(x) + lstm_outs.append(outs) + + if self._dropout: + x = self._dropout_layer(x) + final_outs.append(x) + if self._task == "pre-train": + return final_outs + else: + return lstm_outs + + +class ELMoCharacterEncoderLayer(nn.Layer): + def __init__( + self, char_vocab_size, char_embed_dim, projection_dim, num_highways, cnn_filters, max_characters_per_token + ): + super(ELMoCharacterEncoderLayer, self).__init__() + + self._use_highway = num_highways > 0 + self._n_filters = sum(f[1] for f in cnn_filters) + self._use_proj = self._n_filters != projection_dim + + paramAttr = paddle.ParamAttr(initializer=I.Uniform(low=-1.0, high=1.0)) + self._char_embedding_layer = nn.Embedding( + num_embeddings=char_vocab_size, embedding_dim=char_embed_dim, weight_attr=paramAttr + ) + + with paddle.no_grad(): + self._char_embedding_layer.weight[0, :] = 0 + + self._convolution_layers = [] + for i, (width, num) in enumerate(cnn_filters): + paramAttr = paddle.ParamAttr(initializer=I.Uniform(low=-0.05, high=0.05)) + conv2d = nn.Conv2D( + in_channels=char_embed_dim, + out_channels=num, + kernel_size=(1, width), + padding="Valid", + data_format="NHWC", + weight_attr=paramAttr, + ) + max_pool = nn.MaxPool2D( + kernel_size=(1, max_characters_per_token - width + 1), + stride=(1, 1), + padding="Valid", + data_format="NHWC", + ) + self.add_sublayer("cnn_layer_{}".format(i), conv2d) + self.add_sublayer("maxpool_layer_{}".format(i), max_pool) + self._convolution_layers.append([width, conv2d, max_pool]) + + self._relu = nn.ReLU() + if self._use_highway: + self._highway_layer = Highway(self._n_filters, num_highways) + if self._use_proj: + paramAttr = paddle.ParamAttr(initializer=I.Normal(mean=0.0, std=1.0 / np.sqrt(self._n_filters))) + self._linear_layer = nn.Linear(self._n_filters, projection_dim, weight_attr=paramAttr) + + def forward(self, x): + # [batch_size, seq_len, max_characters_per_token, embed_dim] + char_embedding = self._char_embedding_layer(x) + + cnn_outs = [] + for width, conv2d, max_pool in self._convolution_layers: + # [batch_size, seq_len, max_characters_per_token - kerner_width, out_channel] + conv_out = conv2d(char_embedding) + # [batch_size, seq_len, 1, out_channel] + pool_out = max_pool(conv_out) + # [batch_size, seq_len, 1, out_channel] + out = self._relu(pool_out) + # [batch_size, seq_len, out_channel] + out = paddle.squeeze(out, axis=2) + cnn_outs.append(out) + + # [batch_size, seq_len, n_filters] + token_embedding = paddle.concat(cnn_outs, axis=-1) + + if self._use_highway: + # [batch_size, seq_len, n_filters] + token_embedding = self._highway_layer(token_embedding) + + if self._use_proj: + # [batch_size, seq_len, projection_dim] + token_embedding = self._linear_layer(token_embedding) + + return token_embedding + + +class Highway(nn.Layer): + def __init__(self, input_dim, num_layers): + super(Highway, self).__init__() + + self._num_layers = num_layers + + self._highway_layers = [] + for i in range(num_layers): + paramAttr = paddle.ParamAttr(initializer=I.Normal(mean=0.0, std=1.0 / np.sqrt(input_dim))) + paramAttr_b = paddle.ParamAttr(initializer=I.Constant(value=-2.0)) + carry_linear = nn.Linear(input_dim, input_dim, weight_attr=paramAttr, bias_attr=paramAttr_b) + self.add_sublayer("carry_linear_{}".format(i), carry_linear) + + paramAttr = paddle.ParamAttr(initializer=I.Normal(mean=0.0, std=1.0 / np.sqrt(input_dim))) + transform_linear = nn.Linear(input_dim, input_dim, weight_attr=paramAttr) + self.add_sublayer("transform_linear_{}".format(i), transform_linear) + + self._highway_layers.append([carry_linear, transform_linear]) + + self._relu = nn.ReLU() + self._sigmoid = nn.Sigmoid() + + def forward(self, x): + for i in range(self._num_layers): + carry_linear, transform_linear = self._highway_layers[i] + carry_gate = self._sigmoid(carry_linear(x)) + transform_gate = self._relu(transform_linear(x)) + x = carry_gate * transform_gate + (1.0 - carry_gate) * x + return x + + +class ELMoLoss(nn.Layer): + def __init__(self): + super(ELMoLoss, self).__init__() + + def forward(self, x, y): + # [batch_size, seq_len, vocab_size] + fw_logits, bw_logits = x + # [batch_size, seq_len] + fw_label, bw_label = y + # [batch_size, seq_len, 1] + fw_label = paddle.unsqueeze(fw_label, axis=2) + bw_label = paddle.unsqueeze(bw_label, axis=2) + + # [batch_size, seq_len, 1] + fw_loss = F.cross_entropy(input=fw_logits, label=fw_label) + bw_loss = F.cross_entropy(input=bw_logits, label=bw_label) + + avg_loss = 0.5 * (fw_loss + bw_loss) + return avg_loss + + +def get_elmo_layer(params_file, batch_size, trainable=False): + if trainable: + elmo = ELMo(batch_size=batch_size, task="fine-tune") + else: + elmo = ELMo(batch_size=batch_size, dropout=None, task="fine-tune") + weight_state_dict = paddle.load(params_file + ".pdparams") + elmo.set_state_dict(weight_state_dict) + if trainable: + elmo.train() + else: + for params in elmo.parameters(): + params.trainable = False + elmo.eval() + return elmo + + +class ELMoEmbedder(object): + def __init__(self, params_file, batch_size=128, max_seq_len=256): + self._max_seq_len = max_seq_len + self._batch_size = batch_size + + self._elmo = get_elmo_layer(params_file, batch_size, trainable=False) + self._vocab = load_vocab() + + def encode(self, sentences: List[List[str]]): + """ + Each sentence is a list of tokens without or , e.g. + [['The', 'first', 'sentence', '.'], ['Second', '.']] + """ + batch_data = create_batches(sentences, self._batch_size, self._vocab, self._max_seq_len) + embeddings = [] + for data in batch_data: + ids, ids_reverse, seq_lens = data + # [batch_size, num_lstm_layers + 1, max_seq_len, projection_dim * 2] + outputs = self._elmo([ids, ids_reverse]) + outputs = outputs.numpy() + for i, lens in enumerate(seq_lens): + arr = outputs[i, :, 1 : lens - 1, :] + embeddings.append(arr) + return embeddings diff --git a/examples/language_model/elmo/run_eval.py b/examples/language_model/elmo/run_eval.py new file mode 100644 index 0000000000000000000000000000000000000000..ea330d62eaa61369ecef51c2b92e9505cf3b0fe7 --- /dev/null +++ b/examples/language_model/elmo/run_eval.py @@ -0,0 +1,87 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math +import time + +import paddle +from args import parse_args, print_args +from dataset import OneBillionWordDataset, load_vocab +from elmo import ELMo, ELMoLoss +from paddle.io import DataLoader + + +@paddle.no_grad() +def eval(args): + paddle.set_device(args.device) + + if not args.init_from_ckpt: + raise ValueError("init_from_ckpt should be set when eval.") + vocab = load_vocab(args.vocab_file, args.max_characters_per_token) + + elmo = ELMo( + args.batch_size, + args.char_embed_dim, + args.projection_dim, + vocab.size, + dropout=args.dropout, + num_layers=args.num_layers, + num_highways=args.num_highways, + char_vocab_size=vocab.char_size, + ) + elmo.eval() + + elmo_loss = ELMoLoss() + + # Loads pre-trained parameters. + weight_state_dict = paddle.load(args.init_from_ckpt + ".pdparams") + elmo.set_state_dict(weight_state_dict) + print("Loaded checkpoint from %s" % args.init_from_ckpt) + + dev_dataset = OneBillionWordDataset( + args.dev_data_path, vocab, args.batch_size, args.unroll_steps, mode="test", shuffle=False, seed=args.seed + ) + + dev_dataloader = DataLoader(dev_dataset, return_list=True, batch_size=None) + + total_step = total_loss = 0 + total_time = 0.0 + batch_start_time = time.time() + for step, inputs in enumerate(dev_dataloader, start=1): + ids, next_ids, ids_reverse, next_ids_reverse = inputs + outputs = elmo([ids, ids_reverse]) + loss = elmo_loss(outputs, [next_ids, next_ids_reverse]) + ppl = paddle.exp(loss) + + total_loss += float(loss) + total_step += 1 + + total_time += time.time() - batch_start_time + if step % args.log_freq == 0: + print( + "Eval step %d - loss: %.4f - Perplexity: %.4f - %.3fs/step" + % (step, float(loss) * args.unroll_steps, float(ppl), total_time / args.log_freq) + ) + total_time = 0.0 + batch_start_time = time.time() + + avg_loss = total_loss / total_step + avg_ppl = math.exp(avg_loss) + print("Eval - average loss: %.4f - average Perplexity: %.4f" % (avg_loss * args.unroll_steps, avg_ppl)) + + +if __name__ == "__main__": + args = parse_args() + print_args(args) + eval(args) diff --git a/examples/language_model/elmo/run_finetune.py b/examples/language_model/elmo/run_finetune.py new file mode 100644 index 0000000000000000000000000000000000000000..a129c6e87d0c57315a8d19015c5278d1eb902f28 --- /dev/null +++ b/examples/language_model/elmo/run_finetune.py @@ -0,0 +1,275 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import re + +import numpy as np +import paddle +import paddle.distributed as dist +import paddle.nn as nn +from dataset import load_vocab +from elmo import get_elmo_layer +from paddle.io import DataLoader, Dataset +from sklearn.model_selection import train_test_split + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--data_dir", type=str, default="./sentence-polarity-dataset-v1/", help="Specify the data dir.") + parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") + parser.add_argument("--logging_step", type=int, default=10, help="The frequency, in number of steps, the training logs are printed. (default: %(default)d)") + parser.add_argument("--epochs", type=int, default=20, help="Total number of training epochs to perform.") + parser.add_argument("--batch_size", type=int, default=64, help="Batch size per GPU/CPU for training.") + parser.add_argument("--dropout", type=float, default=0.5, help="The dropout rate.") + parser.add_argument("--lr", type=float, default=0.001, help="The initial learning rate.") + parser.add_argument("--weight_decay", type=float, default=0.0001, help="The weight decay for optimizer.") + parser.add_argument("--seed", type=int, default=2020, help="Random seed.") + parser.add_argument("--max_seq_len", type=int, default=256, help='max grad norm') + parser.add_argument("--sent_embedding_dim", type=int, default=64, help="The size of sentence embedding.") + parser.add_argument("--num_classes", type=int, default=2, help="The num of classification classes.") + parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Device for selecting for the training.") + + args = parser.parse_args() + return args +# yapf: enable + + +def clean_str(string): + """ + Tokenization/string cleaning for all datasets except for SST. + Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py + """ + string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string) + string = re.sub(r"\'s", " 's", string) + string = re.sub(r"\'ve", " 've", string) + string = re.sub(r"n\'t", " n't", string) + string = re.sub(r"\'re", " 're", string) + string = re.sub(r"\'d", " 'd", string) + string = re.sub(r"\'ll", " 'll", string) + string = re.sub(r",", " , ", string) + string = re.sub(r"!", " ! ", string) + string = re.sub(r"\(", " \( ", string) + string = re.sub(r"\)", " \) ", string) + string = re.sub(r"\?", " \? ", string) + string = re.sub(r"\s{2,}", " ", string) + return string.strip().lower() + + +def load_data_and_labels(positive_data_file, negative_data_file): + """ + Loads MR polarity data from files, splits the data into words and generates labels. + Returns split sentences and labels. + """ + # Load data from files + positive_examples = list(open(positive_data_file, "r", encoding="latin-1").readlines()) + positive_examples = [s.strip() for s in positive_examples] + negative_examples = list(open(negative_data_file, "r", encoding="latin-1").readlines()) + negative_examples = [s.strip() for s in negative_examples] + # Split by words + x_text = positive_examples + negative_examples + x_text = [clean_str(sent) for sent in x_text] + x_text = list(map(lambda x: x.split(), x_text)) + # Generate labels + positive_labels = [1 for _ in positive_examples] + negative_labels = [0 for _ in negative_examples] + y = np.array(positive_labels + negative_labels) + return [x_text, y] + + +class ELMoBowTextClassification(nn.Layer): + def __init__(self, params_file, batch_size, sent_embedding_dim, dropout, num_labels): + super(ELMoBowTextClassification, self).__init__() + + self._elmo = get_elmo_layer(params_file, batch_size, trainable=True) + word_embedding_dim = self._elmo.embedding_dim + self._fc1 = nn.Linear(word_embedding_dim, sent_embedding_dim) + self._fc2 = nn.Linear(sent_embedding_dim, num_labels) + self._dropout = nn.Dropout(p=dropout) + + def forward(self, inputs): + """ + Parameters: + inputs (Tuple): It is a Tuple contains 2 tensor with shape + `[batch_size, max_seq_len, max_characters_per_token]`. + """ + mask = paddle.any(inputs[0] > 0, axis=2) + # [batch_size, 3, max_seq_len, word_embedding_dim] + elmo_out = self._elmo(inputs) + # [batch_size, max_seq_len, word_embedding_dim] + word_emb = self.mix_elmo_outputs(elmo_out) + + # [batch_size, word_embedding_dim] + sent_emb = self.average_word_embedding(word_emb, mask) + + # [batch_size, sent_embedding_dim] + dense = self._fc1(sent_emb) + dense = self._dropout(dense) + + # [batch_size, num_labels] + out = self._fc2(dense) + return out + + def mix_elmo_outputs(self, elmo_out): + """ + Computes a mixture of elmo_out. At present, we simply take the last one. + Parameters: + elmo_out (Tensor): It is a Tensor with shape + `[batch_size, 3, max_seq_len, word_embedding_dim]`. + """ + # [batch_size, max_seq_len, word_embedding_dim] + word_emb = elmo_out[:, 2, :, :] + return word_emb + + def average_word_embedding(self, word_emb, mask): + """ + Parameters: + word_emb: It is a Tensor with shape `[batch_size, max_seq_len, word_embedding_dim]`. + mask: It is a Tensor with shape `[batch_size, max_seq_len]`. + """ + mask = paddle.unsqueeze(mask, axis=-1) + # [batch_size, 1] + seq_lens = paddle.sum(paddle.cast(mask, dtype=word_emb.dtype), axis=1) + + # [batch_size, max_seq_len, word_embedding_dim] + word_emb = word_emb * mask + # [batch_size, word_embedding_dim] + sent_emb = paddle.sum(word_emb, axis=1) + # [batch_size, word_embedding_dim] + sent_emb = sent_emb / seq_lens + return sent_emb + + +class SentencePolarityDatasetV1(Dataset): + def __init__(self, x, y, vocab, max_seq_len): + super(SentencePolarityDatasetV1, self).__init__() + + self._text = list(zip(x, y)) + self._vocab = vocab + self._max_seq_len = max_seq_len + self._data = self.convert_to_ids() + + def convert_to_ids(self): + data = [] + for sentence, label in self._text: + ids, ids_reverse = self._vocab.encode_chars(sentence[: self._max_seq_len - 2], split=False) + data.append([ids, ids_reverse, label]) + return data + + def __getitem__(self, idx): + ids = np.copy(self._data[idx][0]) + ids_reverse = np.copy(self._data[idx][1]) + label = self._data[idx][2] + return (ids, ids_reverse, label) + + def __len__(self): + return len(self._data) + + +def generate_batch(batch): + batch_ids, batch_ids_reverse, batch_label = zip(*batch) + max_len = max([ids.shape[0] for ids in batch_ids]) + new_batch_ids = np.zeros([len(batch_ids), max_len, batch_ids[0].shape[1]], dtype=np.int64) + new_batch_ids_reverse = np.zeros([len(batch_ids), max_len, batch_ids[0].shape[1]], dtype=np.int64) + new_batch_label = [] + for i, (ids, ids_reverse, label) in enumerate(zip(batch_ids, batch_ids_reverse, batch_label)): + seq_len = ids.shape[0] + new_batch_ids[i, :seq_len, :] = ids + new_batch_ids_reverse[i, :seq_len, :] = ids_reverse + new_batch_label.append(label) + return new_batch_ids, new_batch_ids_reverse, new_batch_label + + +def finetune(args): + paddle.set_device(args.device) + if dist.get_world_size() > 1: + dist.init_parallel_env() + + pos_file = os.path.join(args.data_dir, "rt-polarity.pos") + neg_file = os.path.join(args.data_dir, "rt-polarity.neg") + x_text, y = load_data_and_labels(pos_file, neg_file) + x_train, x_test, y_train, y_test = train_test_split(x_text, y, test_size=0.1, random_state=args.seed) + + if not args.init_from_ckpt: + raise ValueError("`init_from_ckpt` should be set.") + model = ELMoBowTextClassification( + args.init_from_ckpt, args.batch_size, args.sent_embedding_dim, args.dropout, args.num_classes + ) + if dist.get_world_size() > 1: + model = paddle.DataParallel(model) + model.train() + + adam = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr, weight_decay=args.weight_decay) + criterion = nn.CrossEntropyLoss() + + vocab = load_vocab() + + train_dataset = SentencePolarityDatasetV1(x_train, y_train, vocab, args.max_seq_len) + test_dataset = SentencePolarityDatasetV1(x_test, y_test, vocab, args.max_seq_len) + train_loader = DataLoader( + train_dataset, + batch_size=args.batch_size, + return_list=True, + shuffle=True, + collate_fn=lambda batch: generate_batch(batch), + ) + test_loader = DataLoader( + test_dataset, + batch_size=args.batch_size, + return_list=True, + shuffle=False, + collate_fn=lambda batch: generate_batch(batch), + ) + + for epoch in range(args.epochs): + print("Epoch {}/{}".format(epoch + 1, args.epochs)) + for step, batch_data in enumerate(train_loader, start=1): + ids, ids_reverse, label = batch_data + + output = model((ids, ids_reverse)) + loss = criterion(output, label) + loss.backward() + adam.step() + adam.clear_grad() + + if step % args.logging_step == 0: + print("step {}, loss {}".format(step, float(loss))) + + acc = test(model, test_loader) + print("\ntest acc {}\n".format(acc)) + + +@paddle.no_grad() +def test(model, test_loader): + correct = num = 0 + model.eval() + for batch_data in test_loader: + ids, ids_reverse, label = batch_data + + # [batch_size, 2] + output = model((ids, ids_reverse)) + + num += label.shape[0] + predict = paddle.argmax(output, axis=1) + label = paddle.cast(label, dtype=predict.dtype) + correct += int(paddle.sum(paddle.cast(predict == label, dtype="int64"))) + model.train() + return correct * 1.0 / num + + +if __name__ == "__main__": + args = parse_args() + finetune(args) diff --git a/examples/language_model/elmo/run_pretrain.py b/examples/language_model/elmo/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..7b22f97f8df385aa62fa8fe603334170f634a038 --- /dev/null +++ b/examples/language_model/elmo/run_pretrain.py @@ -0,0 +1,123 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time + +import paddle +import paddle.distributed as dist +import paddle.nn as nn +from args import parse_args, print_args +from dataset import OneBillionWordDataset, load_vocab +from elmo import ELMo, ELMoLoss +from paddle.io import DataLoader + + +def save_params(elmo, optimizer, save_dir, name): + elmo_ckpt = os.path.join(save_dir, "{}.pdparams".format(name)) + opt_ckpt = os.path.join(save_dir, "{}.pdopt".format(name)) + paddle.save(elmo.state_dict(), elmo_ckpt) + paddle.save(optimizer.state_dict(), opt_ckpt) + + +def train(args): + paddle.set_device(args.device) + n_procs = dist.get_world_size() + rank = dist.get_rank() + + if n_procs > 1: + dist.init_parallel_env() + + vocab = load_vocab(args.vocab_file, args.max_characters_per_token) + + elmo = ELMo( + args.batch_size, + args.char_embed_dim, + args.projection_dim, + vocab.size, + dropout=args.dropout, + num_layers=args.num_layers, + num_highways=args.num_highways, + char_vocab_size=vocab.char_size, + ) + if n_procs > 1: + elmo = paddle.DataParallel(elmo) + elmo.train() + + gloabl_norm_clip = nn.ClipGradByGlobalNorm(args.max_grad_norm) + optimizer = paddle.optimizer.Adagrad( + learning_rate=args.lr, parameters=elmo.parameters(), initial_accumulator_value=1.0, grad_clip=gloabl_norm_clip + ) + elmo_loss = ELMoLoss() + + # Loads pre-trained parameters. + if args.init_from_ckpt: + weight_state_dict = paddle.load(args.init_from_ckpt + ".pdparams") + opt_state_dict = paddle.load(args.init_from_ckpt + ".pdopt") + elmo.set_state_dict(weight_state_dict) + optimizer.set_state_dict(opt_state_dict) + print("Loaded checkpoint from %s" % args.init_from_ckpt) + + train_dataset = OneBillionWordDataset( + args.train_data_path, + vocab, + args.batch_size, + args.unroll_steps, + n_procs=n_procs, + rank=rank, + mode="train", + shuffle=True, + seed=args.seed, + ) + + train_dataloader = DataLoader(train_dataset, return_list=True, batch_size=None) + + n_tokens_per_batch = args.batch_size * args.unroll_steps * n_procs + n_steps_per_epoch = int(train_dataset.number_of_tokens / n_tokens_per_batch) + n_steps_total = args.epochs * n_steps_per_epoch + print("Training for %s epochs and %s steps" % (args.epochs, n_steps_total)) + + total_time = 0.0 + batch_start_time = time.time() + for step, inputs in enumerate(train_dataloader, start=1): + ids, next_ids, ids_reverse, next_ids_reverse = inputs + outputs = elmo([ids, ids_reverse]) + loss = elmo_loss(outputs, [next_ids, next_ids_reverse]) + ppl = paddle.exp(loss) + loss *= args.unroll_steps + loss.backward() + optimizer.step() + optimizer.clear_grad() + + total_time += time.time() - batch_start_time + if step % args.log_freq == 0: + print( + "step %d/%d - loss: %.4f - Perplexity: %.4f - %.3fs/step" + % (step, n_steps_total, float(loss), float(ppl), total_time / args.log_freq) + ) + total_time = 0.0 + if rank == 0 and step % args.save_freq == 0: + save_params(elmo, optimizer, args.save_dir, step) + if step == n_steps_total: + # training done + if rank == 0: + save_params(elmo, optimizer, args.save_dir, "final") + break + batch_start_time = time.time() + + +if __name__ == "__main__": + args = parse_args() + print_args(args) + train(args) diff --git a/examples/language_model/elmo/word2vec_base.py b/examples/language_model/elmo/word2vec_base.py new file mode 100644 index 0000000000000000000000000000000000000000..1401268b8f0efc049afe79fe340dad6f2a9380eb --- /dev/null +++ b/examples/language_model/elmo/word2vec_base.py @@ -0,0 +1,255 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import re + +import numpy as np +import paddle +import paddle.distributed as dist +import paddle.nn as nn +from gensim.models.keyedvectors import KeyedVectors +from paddle.io import DataLoader, Dataset +from sklearn.model_selection import train_test_split + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--data_dir", type=str, default="./sentence-polarity-dataset-v1/", help="Specify the data dir.") + parser.add_argument("--pretrained_word2vec_file", type=str, default="./sentence-polarity-dataset-v1/GoogleNews-vectors-negative300.bin", help="Specify the pretrained word2vec model path.") + parser.add_argument("--logging_step", type=int, default=10, help="The frequency, in number of steps, the training logs are printed. (default: %(default)d)") + parser.add_argument("--epochs", type=int, default=20, help="Total number of training epochs to perform.") + parser.add_argument("--batch_size", type=int, default=64, help="Batch size per GPU/CPU for training.") + parser.add_argument("--dropout", type=float, default=0.5, help="The dropout rate.") + parser.add_argument("--lr", type=float, default=0.001, help="The initial learning rate.") + parser.add_argument("--weight_decay", type=float, default=0.0001, help="The weight decay for optimizer.") + parser.add_argument("--seed", type=int, default=2020, help="Random seed.") + parser.add_argument("--max_seq_len", type=int, default=256, help='max grad norm') + parser.add_argument("--sent_embedding_dim", type=int, default=64, help="The size of sentence embedding.") + parser.add_argument("--num_classes", type=int, default=2, help="The num of classification classes.") + parser.add_argument("--device", type=str, default="gpu", help="Device for selecting for the training.") + + args = parser.parse_args() + return args +# yapf: enable + + +def clean_str(string): + """ + Tokenization/string cleaning for all datasets except for SST. + Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py + """ + string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string) + string = re.sub(r"\'s", " 's", string) + string = re.sub(r"\'ve", " 've", string) + string = re.sub(r"n\'t", " n't", string) + string = re.sub(r"\'re", " 're", string) + string = re.sub(r"\'d", " 'd", string) + string = re.sub(r"\'ll", " 'll", string) + string = re.sub(r",", " , ", string) + string = re.sub(r"!", " ! ", string) + string = re.sub(r"\(", " \( ", string) + string = re.sub(r"\)", " \) ", string) + string = re.sub(r"\?", " \? ", string) + string = re.sub(r"\s{2,}", " ", string) + return string.strip().lower() + + +def load_data_and_labels(positive_data_file, negative_data_file): + """ + Loads MR polarity data from files, splits the data into words and generates labels. + Returns split sentences and labels. + """ + # Load data from files + positive_examples = list(open(positive_data_file, "r", encoding="latin-1").readlines()) + positive_examples = [s.strip() for s in positive_examples] + negative_examples = list(open(negative_data_file, "r", encoding="latin-1").readlines()) + negative_examples = [s.strip() for s in negative_examples] + # Split by words + x_text = positive_examples + negative_examples + x_text = [clean_str(sent) for sent in x_text] + x_text = list(map(lambda x: x.split(), x_text)) + # Generate labels + positive_labels = [1 for _ in positive_examples] + negative_labels = [0 for _ in negative_examples] + y = np.array(positive_labels + negative_labels) + return [x_text, y] + + +class Word2VecBoWTextClassification(nn.Layer): + def __init__(self, word_embedding_dim, sent_embedding_dim, dropout, num_classes): + super(Word2VecBoWTextClassification, self).__init__() + + self._fc1 = nn.Linear(word_embedding_dim, sent_embedding_dim) + self._fc2 = nn.Linear(sent_embedding_dim, num_classes) + self._dropout = nn.Dropout(p=dropout) + + def forward(self, inputs): + word_emb, seq_lens = inputs + + # [batch_size, word_embedding_dim] + sent_emb = self.average_word_embedding(word_emb, seq_lens) + + # [batch_size, sent_embedding_dim] + dense = self._fc1(sent_emb) + dense = self._dropout(dense) + + # [batch_size, num_classes] + out = self._fc2(dense) + return out + + def average_word_embedding(self, word_emb, seq_lens): + """ + Parameters: + word_emb: It is a Tensor with shape `[batch_size, max_seq_len, word_embedding_dim]`. + seq_lens: It is a Tensor with shape `[batch_size]`. + """ + seq_lens = paddle.unsqueeze(seq_lens, axis=-1) + seq_lens = paddle.cast(seq_lens, dtype=word_emb.dtype) + + # [batch_size, word_embedding_dim] + sent_emb = paddle.sum(word_emb, axis=1) + # [batch_size, word_embedding_dim] + sent_emb = sent_emb / seq_lens + return sent_emb + + +class SentencePolarityDatasetV1(Dataset): + def __init__(self, x, y, gensim_model, max_seq_len): + super(SentencePolarityDatasetV1, self).__init__() + + self._text = list(zip(x, y)) + self._gensim_model = gensim_model + self._vector_size = gensim_model.vector_size + self._max_seq_len = max_seq_len + self._data = self.convert_to_ids() + + def convert_to_ids(self): + data = [] + for sentence, label in self._text: + sentence = sentence[: self._max_seq_len] + ids = np.zeros([len(sentence), self._vector_size], dtype=np.float32) + for i, word in enumerate(sentence): + if word in self._gensim_model: + ids[i] = self._gensim_model[word] + else: + ids[i] = np.random.uniform(-0.25, 0.25, self._vector_size) + data.append([ids, label]) + return data + + def __getitem__(self, idx): + ids = np.copy(self._data[idx][0]) + label = self._data[idx][1] + return (ids, label) + + def __len__(self): + return len(self._data) + + +def generate_batch(batch): + batch_ids, batch_label = zip(*batch) + max_len = max([ids.shape[0] for ids in batch_ids]) + new_batch_ids = np.zeros([len(batch_ids), max_len, batch_ids[0].shape[1]], dtype=np.float32) + new_batch_label = [] + new_batch_seq_len = [] + for i, (ids, label) in enumerate(zip(batch_ids, batch_label)): + seq_len = ids.shape[0] + new_batch_ids[i, :seq_len, :] = ids + new_batch_label.append(label) + new_batch_seq_len.append(seq_len) + return new_batch_ids, new_batch_label, new_batch_seq_len + + +def train(args): + paddle.set_device(args.device) + if dist.get_world_size() > 1: + dist.init_parallel_env() + + pos_file = os.path.join(args.data_dir, "rt-polarity.pos") + neg_file = os.path.join(args.data_dir, "rt-polarity.neg") + x_text, y = load_data_and_labels(pos_file, neg_file) + x_train, x_test, y_train, y_test = train_test_split(x_text, y, test_size=0.1, random_state=args.seed) + + # gensim_model = KeyedVectors.load_word2vec_format(args.pretrained_word2vec_file, binary=True, limit=300000) + gensim_model = KeyedVectors.load_word2vec_format(args.pretrained_word2vec_file, binary=True) + print("\nLoaded word2vec from %s\n" % args.pretrained_word2vec_file) + + train_dataset = SentencePolarityDatasetV1(x_train, y_train, gensim_model, args.max_seq_len) + test_dataset = SentencePolarityDatasetV1(x_test, y_test, gensim_model, args.max_seq_len) + train_loader = DataLoader( + train_dataset, + batch_size=args.batch_size, + return_list=True, + shuffle=True, + collate_fn=lambda batch: generate_batch(batch), + ) + test_loader = DataLoader( + test_dataset, + batch_size=args.batch_size, + return_list=True, + shuffle=False, + collate_fn=lambda batch: generate_batch(batch), + ) + + model = Word2VecBoWTextClassification( + gensim_model.vector_size, args.sent_embedding_dim, args.dropout, args.num_classes + ) + if dist.get_world_size() > 1: + model = paddle.DataParallel(model) + model.train() + + adam = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr, weight_decay=args.weight_decay) + criterion = nn.CrossEntropyLoss() + + for epoch in range(args.epochs): + print("Epoch %d/%d" % (epoch + 1, args.epochs)) + for step, batch_data in enumerate(train_loader, start=1): + ids, label, seq_lens = batch_data + + output = model((ids, seq_lens)) + loss = criterion(output, label) + loss.backward() + adam.step() + adam.clear_grad() + + if step % args.logging_step == 0: + print("step %d, loss %.4f" % (step, float(loss))) + + acc = test(model, test_loader) + print("\ntest acc %.4f\n" % acc) + + +@paddle.no_grad() +def test(model, test_loader): + correct = num = 0 + model.eval() + for batch_data in test_loader: + ids, label, seq_lens = batch_data + + # [batch_size, 2] + output = model((ids, seq_lens)) + + num += label.shape[0] + predict = paddle.argmax(output, axis=1) + label = paddle.cast(label, dtype=predict.dtype) + correct += int(paddle.sum(paddle.cast(predict == label, dtype="int64"))) + model.train() + return correct * 1.0 / num + + +if __name__ == "__main__": + args = parse_args() + train(args) diff --git a/examples/language_model/end_to_end_memory_networks/README.md b/examples/language_model/end_to_end_memory_networks/README.md new file mode 100644 index 0000000000000000000000000000000000000000..d3a2dc040356b9c96c70cef629582db137127dbb --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/README.md @@ -0,0 +1,212 @@ +# End-To-End-Memory-Networks-in-Paddle +## 一、简介 + +用Paddle来复现论文End-To-End Memory Networks + +![模型简介](http://paddle.yulan.net.cn/model_introduction.png) + +本模型是Facebook AI在Memory networks之后提出的一个更加完善的记忆网络模型,在问答系统以及语言模型中均有良好的应用。论文中使用了多个单层单元堆叠而成的多层架构。 + +单层架构如上图a所示,主要的参数包括A,B,C,W四个矩阵,其中A,B,C三个矩阵就是embedding矩阵,主要是将输入文本和Question编码成词向量,W是最终的输出矩阵。从上图可以看出,对于输入的句子s分别会使用A和C进行编码得到Input和Output的记忆模块,Input用来跟Question编码得到的向量相乘得到每句话跟q的相关性,Output则与该相关性进行加权求和得到输出向量。然后再加上q并传入最终的输出层。 + +多层网络如上图b所示,实际上是将多个单层堆叠到一起形成的网络,这里将每一层称为一个hop。 +为了减少参数,模型提出了两种让各个hop之间共享Embedding参数(A与C)的方法: +* Adjacent:这种方法让相邻层之间的$A=C$。也就是说$A_{k+1}=C_{k}$,此外W等于顶层的C,B等于底层的A,这样就减少了一半的参数量。 +* Layer-wise(RNN-like):与RNN相似,采用完全共享参数的方法,即各层之间参数均相等。$A_{1}=A_{2}=...=A_{k}$,$C_{1}=C_{2}=...=C_{k}$。但这样模型的参数太少,性能会受到影响,故提出一种改进方法,在每一层之间加一个线性映射矩阵H,即令$u^{k+1}=H u^{k}+o^{k}$。 + +具体到语言模型,模型做出了一下调整: +1. 由于输入是单个句子,编码级别是单词级的,所以可以直接将每个单词的词向量存入memory即可,也就是说A与C现在都是单词的Embedding矩阵,mi与ci中都是单个单词的词向量。 +2. 输出W矩阵的output为下一个单词的概率,即输出维度为vocab size。 +3. 不同于QA任务,这里不存在Question,所以直接将q向量设置为全0.1的常量,也不需要再进行Embedding操作。 +4. 采用Layer-wise的参数缩减策略。 +5. 文中提出,对于每一层的u向量中一半的神经元进行ReLU操作,以帮助模型训练。 + +## 二、数据集 + +* Penn Treetank: + + * [Penn Treebank](http://paddle.yulan.net.cn/ptb.zip) + + NLP中常用的PTB语料库,语料来源为1989年华尔街日报,并做以下切分 + + train:887k words + + valid:70k words + + test:78k words + + vocabulary size:10k + + * [text8](http://paddle.yulan.net.cn/text8.zip) + + 来源于enwiki8,总共100M个字符,划分为93.3M/5.7M/1M字符(train/valid/test),将出现次数少于10次的单词替换为 + +## 三、环境依赖 + +* 硬件:GPU +* 框架:Paddle >= 2.0.0,progress库 + +## 四、快速开始 + +下载数据集和已训练好的模型 +```bash +mkdir data +mkdir models +cd data +wget http://paddle.yulan.net.cn/ptb.zip +wget http://paddle.yulan.net.cn/text8.zip +unzip -d ptb ptb.zip +unzip -d text8 text8.zip +cd .. +cd models +wget http://paddle.yulan.net.cn/model_ptb +wget http://paddle.yulan.net.cn/model_text8 +cd .. +``` + +### 训练 + +训练参数可在`config.yaml`文件中调整。 + +Note: 由于本模型受随机因素影响较大,故每次训练的结果差异较大,即使固定随机种子,由于GPU的原因训练结果仍然无法完全一致。 + +#### 在ptb数据集上训练 + +```bash +cp config/config_ptb.yaml config.yaml +python train.py +``` + +#### 寻找最佳模型 + +由于模型受随机因素影响较大,故要进行多次训练来找到最优模型,原论文中在ptb数据集上进行了10次训练,并保留了在test集上表现最好的模型。本复现提供了一个脚本,来进行多次训练以获得能达到足够精度的模型。 + +```bash +cp config/config_ptb.yaml config.yaml +python train_until.py --target 111.0 +``` + +以下是在ptb数据集上进行多次训练以达到目标精度的[log](http://paddle.yulan.net.cn/ptb_train_until.log),可以计算出20轮的平均ppl为113,方差为5.68 + +#### 在text8数据集上训练 + +```bash +cp config/config_text8.yaml config.yaml +python train.py +``` + +### 测试 + +保持`config.yaml`文件与训练时相同 + +``` +python eval.py +``` + +### 使用预训练模型 + +#### ptb数据集上 + +```bash +cp config/config_ptb_test.yaml config.yaml +python eval.py +``` + +将得到以下结果 + +![](http://paddle.yulan.net.cn/test_ptb.png) + +#### text8数据集上 + +```bash +cp config/config_text8_test.yaml config.yaml +python eval.py +``` + +结果如下 + +![](http://paddle.yulan.net.cn/test_text8.png) + +## 五、复现精度 + +相应模型已包含在本repo中,分别位于目录`models_ptb`与`models_text8`下 + +| Dataset | Paper Perplexity | Our Perplexity | +| :-----: | :--------------: | :------------: | +| ptb | 111 | 110.75 | +| text8 | 147 | 145.62 | + +## 六、代码结构详细说明 + +### 6.1 代码结构 + +``` +├── checkpoints +├── config # 配置文件模板 +├── config.yaml +├── README.md +├── requirements.txt +├── config.py +├── model.py +├── data.py +├── train.py # 训练脚本 +├── eval.py # 测试脚本 +├── train_until.py +└── utils.py +``` + +### 6.2 参数说明 + +可以在`config.yaml`中设置以下参数 + +``` +# internal state dimension +edim: 150 +# linear part of the state +lindim: 75 +# number of hops +nhop: 7 +# memory size +mem_size: 200 +# initial internal state value +init_hid: 0.1 +# initial learning rate +init_lr: 0.01 +# weight initialization std +init_std: 0.05 +# clip gradients to this norm +max_grad_norm: 50 + +# batch size to use during training +batch_size: 128 +# number of epoch to use during training +nepoch: 100 + +# data directory +data_dir: "data/ptb" +# checkpoint directory +checkpoint_dir: "checkpoints" +# model name for test and recover train +model_name: "model" +# if True, load model [model_name] before train +recover_train: False +# data set name +data_name: "ptb" +# print progress, need progress module +show: True +# initial random seed +srand: 17814 +# How many epochs output log once +log_epoch: 5 +# Desired ppl +target_ppl: 147 +``` + +### 七、reference +原论文地址:[Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus: “End-To-End Memory Networks”, 2015.](https://arxiv.org/pdf/1503.08895v5.pdf) + +复现repo:[yulangz/End-to-End-Memory-Networks-in-Paddle](https://github.com/yulangz/End-to-End-Memory-Networks-in-Paddle) + +参考repo:[https://github.com/facebookarchive/MemNN](https://github.com/facebookarchive/MemNN) + +项目AiStudio地址:[https://aistudio.baidu.com/aistudio/projectdetail/2381004](https://aistudio.baidu.com/aistudio/projectdetail/2381004) diff --git a/examples/language_model/end_to_end_memory_networks/config.py b/examples/language_model/end_to_end_memory_networks/config.py new file mode 100644 index 0000000000000000000000000000000000000000..4e1fcc0eafd2f36619190251dcec075d918638e7 --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/config.py @@ -0,0 +1,32 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import yaml + + +class Config(object): + """ + A simple waper for configs + """ + + def __init__(self, config_path: str): + with open(config_path, "r") as f: + self.d = yaml.load(f.read(), Loader=yaml.SafeLoader) + + def __getattribute__(self, key): + d = super(Config, self).__getattribute__("d") + if key in d: + return d[key] + else: + return super(Config, self).__getattribute__(key) diff --git a/examples/language_model/end_to_end_memory_networks/config.yaml b/examples/language_model/end_to_end_memory_networks/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..99ccce66ab526afcae1a507290c17f06c5c80b83 --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/config.yaml @@ -0,0 +1,40 @@ +# internal state dimension +edim: 150 +# linear part of the state +lindim: 75 +# number of hops +nhop: 7 +# memory size +mem_size: 200 +# initial internal state value +init_hid: 0.1 +# initial learning rate +init_lr: 0.01 +# weight initialization std +init_std: 0.05 +# clip gradients to this norm +max_grad_norm: 50 + +# batch size to use during training +batch_size: 128 +# number of epoch to use during training +nepoch: 100 + +# data directory +data_dir: "data/ptb" +# checkpoint directory +checkpoint_dir: "checkpoints" +# model name for test and recover train +model_name: "model" +# if True, load model [model_name] before train +recover_train: False +# data set name +data_name: "ptb" +# print progress, need progress module +show: True +# initial random seed +srand: 17814 +# How many epochs output log once +log_epoch: 5 +# Desired ppl +target_ppl: 147 \ No newline at end of file diff --git a/examples/language_model/end_to_end_memory_networks/config/config_ptb.yaml b/examples/language_model/end_to_end_memory_networks/config/config_ptb.yaml new file mode 100644 index 0000000000000000000000000000000000000000..620877cbced56072fc57b0401d441619b578e1ee --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/config/config_ptb.yaml @@ -0,0 +1,19 @@ +edim: 150 +lindim: 75 +nhop: 7 +mem_size: 200 +batch_size: 128 +nepoch: 100 +init_lr: 0.01 +init_hid: 0.1 +init_std: 0.05 +max_grad_norm: 50 +data_dir: "data/ptb" +checkpoint_dir: "checkpoints" +model_name: "model" +recover_train: False +data_name: "ptb" +show: True +srand: 17814 +log_epoch: 5 +target_ppl: 147 diff --git a/examples/language_model/end_to_end_memory_networks/config/config_ptb_test.yaml b/examples/language_model/end_to_end_memory_networks/config/config_ptb_test.yaml new file mode 100644 index 0000000000000000000000000000000000000000..3be5d7c58a1f47e72e65f9d4421efff66e42bb5f --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/config/config_ptb_test.yaml @@ -0,0 +1,18 @@ +edim: 150 +lindim: 75 +nhop: 7 +mem_size: 200 +batch_size: 128 +nepoch: 100 +init_lr: 0.01 +init_hid: 0.1 +init_std: 0.05 +max_grad_norm: 50 +data_dir: "data/ptb" +checkpoint_dir: "models" +model_name: "model_ptb" +recover_train: False +data_name: "ptb" +show: True +log_epoch: 5 +target_ppl: 147 diff --git a/examples/language_model/end_to_end_memory_networks/config/config_text8.yaml b/examples/language_model/end_to_end_memory_networks/config/config_text8.yaml new file mode 100644 index 0000000000000000000000000000000000000000..ce9d2b5fb3fbd00dc24d9356f7af5408fade75c7 --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/config/config_text8.yaml @@ -0,0 +1,19 @@ +edim: 500 +lindim: 250 +nhop: 7 +mem_size: 100 +batch_size: 128 +nepoch: 100 +init_lr: 0.01 +init_hid: 0.1 +init_std: 0.05 +max_grad_norm: 50 +data_dir: "data/text8" +checkpoint_dir: "checkpoints" +model_name: "model" +recover_train: False +data_name: "text8" +show: True +srand: 12345 +log_epoch: 5 +target_ppl: 111 diff --git a/examples/language_model/end_to_end_memory_networks/config/config_text8_test.yaml b/examples/language_model/end_to_end_memory_networks/config/config_text8_test.yaml new file mode 100644 index 0000000000000000000000000000000000000000..04751bfa6803650b6d9b38e6f2d677f14df3f63e --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/config/config_text8_test.yaml @@ -0,0 +1,18 @@ +edim: 500 +lindim: 250 +nhop: 7 +mem_size: 100 +batch_size: 128 +nepoch: 100 +init_lr: 0.01 +init_hid: 0.1 +init_std: 0.05 +max_grad_norm: 50 +data_dir: "data/text8" +checkpoint_dir: "models" +model_name: "model_text8" +recover_train: False +data_name: "text8" +show: True +log_epoch: 5 +target_ppl: 147 diff --git a/examples/language_model/end_to_end_memory_networks/data.py b/examples/language_model/end_to_end_memory_networks/data.py new file mode 100644 index 0000000000000000000000000000000000000000..3083cb996348c60a4921a073d8e6fababa58638b --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/data.py @@ -0,0 +1,88 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + + +def read_data(fname, word2idx): + """ + Data is processed into a one-dimensional vector, and each value is the code corresponding to a word. + The two sentences are separated by special characters < EOS >. + + Args: + fname (str): + data filename + word2idx (dict): + word dict + + Returns: + list: return word vectors + """ + if os.path.isfile(fname): + with open(fname) as f: + lines = f.readlines() + else: + raise (Exception("[!] Data %s not found" % fname)) + + words = [] + for line in lines: + words.extend(line.split()) + + print("Read %s words from %s" % (len(words), fname)) + + data = list() + for line in lines: + for word in line.split(): + index = word2idx[word] + data.append(index) + data.append(word2idx[""]) + return data + + +def load_vocab(fname): + """ + load word dict + + Args: + fname (str): filename of the vocav file + + Returns: + dict: word dict + """ + word2idx = {} + with open(fname, "r") as f: + for line in f: + pair = line.split() + word2idx[pair[0]] = int(pair[1]) + return word2idx + + +def load_data(config): + """ + load data + + Args: + config: config + + Returns: + word dict, and train, valid, test data + """ + vocab_path = os.path.join(config.data_dir, "%s.vocab.txt" % config.data_name) + word2idx = load_vocab(vocab_path) + + train_data = read_data(os.path.join(config.data_dir, "%s.train.txt" % config.data_name), word2idx) + valid_data = read_data(os.path.join(config.data_dir, "%s.valid.txt" % config.data_name), word2idx) + test_data = read_data(os.path.join(config.data_dir, "%s.test.txt" % config.data_name), word2idx) + + return word2idx, train_data, valid_data, test_data diff --git a/examples/language_model/end_to_end_memory_networks/eval.py b/examples/language_model/end_to_end_memory_networks/eval.py new file mode 100644 index 0000000000000000000000000000000000000000..bc3dede1d357c78f6f8eb6d416049df795eb7ce8 --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/eval.py @@ -0,0 +1,107 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math +import os +from importlib import import_module + +import numpy as np +import paddle +from config import Config +from data import load_data +from model import MemN2N +from paddle import nn + + +@paddle.no_grad() +def eval(model: MemN2N, data, config, mode="Test"): + """ + evaluate the model performance + + Args: + model (MemN2N): the model to be evaluate + data: evaluation data + config: model and eval configs + mode: Valid or Test + + Returns: + average loss + """ + model.eval() + lossfn = nn.CrossEntropyLoss(reduction="sum") + N = int(math.ceil(len(data) / config.batch_size)) + total_loss = 0 + + context = np.ndarray([config.batch_size, config.mem_size], dtype=np.int64) + target = np.ndarray([config.batch_size], dtype=np.int64) + + if config.show: + ProgressBar = getattr(import_module("utils"), "ProgressBar") + bar = ProgressBar(mode, max=N - 1) + + m = config.mem_size + for batch in range(N): + if config.show: + bar.next() + + for i in range(config.batch_size): + if m >= len(data): + break + target[i] = data[m] + context[i, :] = data[m - config.mem_size : m] + m += 1 + if m >= len(data): + break + + batch_data = paddle.to_tensor(context) + batch_label = paddle.to_tensor(target) + + preict = model(batch_data) + loss = lossfn(preict, batch_label) + + total_loss += loss + + if config.show: + bar.finish() + + return total_loss / N / config.batch_size + + +def test(model: MemN2N, test_data, config): + """ + test the model performance + """ + test_loss = eval(model, test_data, config, "Test") + test_perplexity = math.exp(test_loss) + print("Perplexity on Test: %f" % test_perplexity) + + +if __name__ == "__main__": + config = Config("config.yaml") + + if not os.path.exists(config.checkpoint_dir): + os.makedirs(config.checkpoint_dir) + + word2idx, train_data, valid_data, test_data = load_data(config) + idx2word = dict(zip(word2idx.values(), word2idx.keys())) + config.nwords = len(word2idx) + + print("vacab size is %d" % config.nwords) + + model = MemN2N(config) + + model_path = os.path.join(config.checkpoint_dir, config.model_name) + state_dict = paddle.load(model_path) + model.set_dict(state_dict) + test(model, test_data, config) diff --git a/examples/language_model/end_to_end_memory_networks/model.py b/examples/language_model/end_to_end_memory_networks/model.py new file mode 100644 index 0000000000000000000000000000000000000000..8897cbc700026e5e7f6be82c5c176c08e5817b0a --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/model.py @@ -0,0 +1,108 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +from paddle import nn +import numpy as np + + +class MemN2N(nn.Layer): + """ + End to End Memory Networks model + + reference paper: https://arxiv.org/pdf/1503.08895v5.pdf + """ + + def __init__(self, config): + """ + Model initialization + + Args: + config: model configuration, see config.yaml for more detail + """ + super(MemN2N, self).__init__() + self.nwords = config.nwords + self.init_hid = config.init_hid + self.init_std = config.init_std + self.nhop = config.nhop + self.edim = config.edim + self.mem_size = config.mem_size + self.lindim = config.lindim + self.max_grad_norm = config.max_grad_norm + self.batch_size = config.batch_size + + self.checkpoint_dir = config.checkpoint_dir + + normal_attr = paddle.framework.ParamAttr(initializer=paddle.nn.initializer.Normal(std=self.init_std)) + self.A = nn.Embedding(self.nwords, self.edim, weight_attr=normal_attr) + self.C = nn.Embedding(self.nwords, self.edim, weight_attr=normal_attr) + + # Temporal Encoding + self.T_A = nn.Embedding(self.mem_size, self.edim, weight_attr=normal_attr) + self.T_C = nn.Embedding(self.mem_size, self.edim, weight_attr=normal_attr) + + # Linear mapping for q + self.H = nn.Linear(self.edim, self.edim, weight_attr=normal_attr, bias_attr=False) + + # output mapping + self.W = nn.Linear(self.edim, self.nwords, weight_attr=normal_attr, bias_attr=False) + + def forward(self, data): + """ + The shape of data is [batch_size, mem_size], and the content is the id of each word + """ + q = np.ndarray([self.batch_size, self.edim], dtype=np.float32) + q.fill(self.init_hid) + q = paddle.to_tensor(q) + + time = np.ndarray([self.batch_size, self.mem_size], dtype=np.int64) + for i in range(self.mem_size): + time[:, i] = i + time = paddle.to_tensor(time) + + for hop in range(self.nhop): + A_in_c = self.A(data) # [batch_size, mem_size, edim] + A_in_t = self.T_A(time) # [batch_size, mem_size, edim] + A_in = paddle.add(A_in_c, A_in_t) # [batch_size, mem_size, edim] + + q_in = q.reshape([-1, 1, self.edim]) # [batch, 1, edim] + A_out3d = paddle.matmul(q_in, A_in, transpose_y=True) # [batch, 1, mem_size] + A_out2d = A_out3d.reshape([-1, self.mem_size]) + p = nn.functional.softmax(A_out2d) # [batch, mem_size] + + C_in_c = self.C(data) + C_in_t = self.T_C(time) + C_in = paddle.add(C_in_c, C_in_t) # [batch_size, mem_size, edim] + + p_3d = p.reshape([-1, 1, self.mem_size]) # [batch, 1, mem_size] + C_out3d = paddle.matmul(p_3d, C_in) # [batch, 1, edim] + + C_out2d = C_out3d.reshape([-1, self.edim]) # [batch, edim] + + # Linear mapping and addition + q_mapped = self.H(q) + q_out = paddle.add(C_out2d, q_mapped) + + if self.lindim == self.edim: + q = q_out + elif self.lindim == 0: + q = nn.functional.relu(q_out) + else: + F = q_out[:, : self.lindim] + G = q_out[:, self.lindim :] + K = nn.functional.relu(G) + q = paddle.concat([F, K], axis=-1) + + predict = self.W(q) + return predict diff --git a/examples/language_model/end_to_end_memory_networks/requirements.txt b/examples/language_model/end_to_end_memory_networks/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..a5c04145b7381f412fae119d93190da0297f2a22 --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/requirements.txt @@ -0,0 +1,2 @@ +progress==1.6 + diff --git a/examples/language_model/end_to_end_memory_networks/train.py b/examples/language_model/end_to_end_memory_networks/train.py new file mode 100644 index 0000000000000000000000000000000000000000..ef1c6e893b0e5d9a8df48fd20163766356118da0 --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/train.py @@ -0,0 +1,164 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math +import os +import random +from importlib import import_module + +import numpy as np +import paddle +from config import Config +from data import load_data +from eval import eval +from model import MemN2N +from paddle import nn + + +def train_single_epoch(model: MemN2N, lr, data, config): + """ + train one epoch + + Args: + model (MemN2N): model to be trained + lr (float): the learning rate of this epoch + data: training data + config: configs + + Returns: + float: average loss + """ + model.train() + N = int(math.ceil(len(data) / config.batch_size)) # total train N batchs + + clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=config.max_grad_norm) + optimizer = paddle.optimizer.SGD(learning_rate=lr, parameters=model.parameters(), grad_clip=clip) + lossfn = nn.CrossEntropyLoss(reduction="sum") + + total_loss = 0 + + if config.show: + ProgressBar = getattr(import_module("utils"), "ProgressBar") + bar = ProgressBar("Train", max=N) + + for batch in range(N): + if config.show: + bar.next() + + optimizer.clear_grad() + context = np.ndarray([config.batch_size, config.mem_size], dtype=np.int64) + target = np.ndarray([config.batch_size], dtype=np.int64) + for i in range(config.batch_size): + m = random.randrange(config.mem_size, len(data)) + target[i] = data[m] + context[i, :] = data[m - config.mem_size : m] + + batch_data = paddle.to_tensor(context) + batch_label = paddle.to_tensor(target) + + preict = model(batch_data) + loss = lossfn(preict, batch_label) + loss.backward() + optimizer.step() + total_loss += loss + + if config.show: + bar.finish() + + return total_loss / N / config.batch_size + + +def train(model: MemN2N, train_data, valid_data, config): + """ + do train + + Args: + model (MemN2N): the model to be evaluate + train_data: training data + valid_data: validating data + config: model and training configs + + Returns: + no return + """ + lr = config.init_lr + + train_losses = [] + train_perplexities = [] + + valid_losses = [] + valid_perplexities = [] + + for epoch in range(1, config.nepoch + 1): + train_loss = train_single_epoch(model, lr, train_data, config) + valid_loss = eval(model, valid_data, config, "Validation") + + info = {"epoch": epoch, "learning_rate": lr} + + # When the loss on the valid no longer drops, it's like learning rate divided by 1.5 + if len(valid_losses) > 0 and valid_loss > valid_losses[-1] * 0.9999: + lr /= 1.5 + + train_losses.append(train_loss) + train_perplexities.append(math.exp(train_loss)) + + valid_losses.append(valid_loss) + valid_perplexities.append(math.exp(valid_loss)) + + info["train_perplexity"] = train_perplexities[-1] + info["validate_perplexity"] = valid_perplexities[-1] + + print(info) + + if epoch % config.log_epoch == 0: + save_dir = os.path.join(config.checkpoint_dir, "model_%d" % epoch) + paddle.save(model.state_dict(), save_dir) + lr_path = os.path.join(config.checkpoint_dir, "lr_%d" % epoch) + with open(lr_path, "w") as f: + f.write(f"{lr}") + + # to get the target ppl + if info["validate_perplexity"] < config.target_ppl: + save_dir = os.path.join(config.checkpoint_dir, "model_good") + paddle.save(model.state_dict(), save_dir) + break + + if lr < 1e-5: + break + + save_dir = os.path.join(config.checkpoint_dir, "model") + paddle.save(model.state_dict(), save_dir) + + +if __name__ == "__main__": + config = Config("config.yaml") + + if not os.path.exists(config.checkpoint_dir): + os.makedirs(config.checkpoint_dir) + + word2idx, train_data, valid_data, test_data = load_data(config) + idx2word = dict(zip(word2idx.values(), word2idx.keys())) + config.nwords = len(word2idx) + print("vacab size is %d" % config.nwords) + + np.random.seed(config.srand) + random.seed(config.srand) + paddle.seed(config.srand) + + model = MemN2N(config) + if config.recover_train: + model_path = os.path.join(config.checkpoint_dir, config.model_name) + state_dict = paddle.load(model_path) + model.set_dict(state_dict) + train(model, train_data, valid_data, config) diff --git a/examples/language_model/end_to_end_memory_networks/train_until.py b/examples/language_model/end_to_end_memory_networks/train_until.py new file mode 100644 index 0000000000000000000000000000000000000000..ebb94a2455b5e3625e69629455e1be8b861242ed --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/train_until.py @@ -0,0 +1,57 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time + +import numpy as np +import paddle +from config import Config +from data import load_data +from eval import test +from model import MemN2N +from train import train + +parser = argparse.ArgumentParser() +parser.add_argument("--target", default=111.0, type=float, help="target perplexity") +target = parser.parse_args().target + +if __name__ == "__main__": + config = Config("config.yaml") + if not os.path.exists(config.checkpoint_dir): + os.makedirs(config.checkpoint_dir) + + word2idx, train_data, valid_data, test_data = load_data(config) + idx2word = dict(zip(word2idx.values(), word2idx.keys())) + config.nwords = len(word2idx) + print("vacab size is %d" % config.nwords) + + while True: + random.seed(time.time()) + config.srand = random.randint(0, 100000) + + np.random.seed(config.srand) + random.seed(config.srand) + paddle.seed(config.srand) + + model = MemN2N(config) + train(model, train_data, valid_data, config) + + test_ppl = test(model, test_data, config) + if test_ppl < target: + model_path = os.path.join(config.checkpoint_dir, config.model_name + "_" + str(config.srand) + "_good") + paddle.save(model.state_dict(), model_path) + break diff --git a/examples/language_model/end_to_end_memory_networks/utils.py b/examples/language_model/end_to_end_memory_networks/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..632c790093d73278d62186c8be06021170e34247 --- /dev/null +++ b/examples/language_model/end_to_end_memory_networks/utils.py @@ -0,0 +1,21 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from progress.bar import Bar + + +class ProgressBar(Bar): + message = "Loading" + fill = "#" + suffix = "%(percent).1f%% | ETA: %(eta)ds" diff --git a/examples/language_model/glm b/examples/language_model/glm new file mode 100644 index 0000000000000000000000000000000000000000..aa651eb577cf772492f4a9f901237d505fbfa514 --- /dev/null +++ b/examples/language_model/glm @@ -0,0 +1 @@ +../../llm/glm \ No newline at end of file diff --git a/examples/language_model/gpt b/examples/language_model/gpt new file mode 100644 index 0000000000000000000000000000000000000000..6ca9896375c102a161bbd5b811122ab1d45949bb --- /dev/null +++ b/examples/language_model/gpt @@ -0,0 +1 @@ +../../model_zoo/gpt/ \ No newline at end of file diff --git a/examples/language_model/gpt-3/dygraph/build_optimizer.py b/examples/language_model/gpt-3/dygraph/build_optimizer.py new file mode 100644 index 0000000000000000000000000000000000000000..72445ea975291cee558e67b5a8b9cc6cf7c38876 --- /dev/null +++ b/examples/language_model/gpt-3/dygraph/build_optimizer.py @@ -0,0 +1,62 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import inspect + +import paddle +from paddle.distributed import fleet +from paddle.distributed.fleet.meta_optimizers.dygraph_optimizer import ( + DygraphShardingOptimizer, +) + + +def is_new_version_sharding_stage1_optimizer(): + signature_keys = set(inspect.signature(DygraphShardingOptimizer).parameters.keys()) + return "inner_optimizer_class" not in signature_keys + + +def apply(model, args, lr_scheduler, clip, decay_params, strategy): + if args.sharding_stage == 1 and args.sharding_degree > 1 and not is_new_version_sharding_stage1_optimizer(): + # for backward compatibility. + # this call will raise, if sharding stage1 is handled in HybridParallelOptimizer, + # in which case, the logic follows will handle it + optimizer = DygraphShardingOptimizer( + hcg=fleet.get_hybrid_communicate_group(), + user_defined_strategy=strategy, + params=model.parameters(), + inner_optimizer_class=paddle.optimizer.AdamW, + learning_rate=lr_scheduler if lr_scheduler is not None else args.max_lr, + beta1=args.adam_beta1, + beta2=args.adam_beta2, + epsilon=args.adam_epsilon, + weight_decay=args.weight_decay, + grad_clip=clip, + apply_decay_param_fun=lambda x: x in decay_params, + multi_precision=args.use_pure_fp16, + ) + else: + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler if lr_scheduler is not None else args.max_lr, + beta1=args.adam_beta1, + beta2=args.adam_beta2, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + grad_clip=clip, + apply_decay_param_fun=lambda x: x in decay_params, + # TODO: remove 'multi_precision' in definition of optimizer + # and add it to 'paddle.amp.decorate' + multi_precision=args.use_pure_fp16, + ) + return optimizer diff --git a/examples/language_model/gpt-3/dygraph/run_finetune.py b/examples/language_model/gpt-3/dygraph/run_finetune.py new file mode 100644 index 0000000000000000000000000000000000000000..04b41bdf619cf238327c348a9ca629fc840fb787 --- /dev/null +++ b/examples/language_model/gpt-3/dygraph/run_finetune.py @@ -0,0 +1,561 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import random +import sys +import time +from functools import partial + +import build_optimizer +import numpy as np +import paddle +from args import parse_args +from configuration import GPTConfig +from modeling import GPTLMHeadModel +from paddle.distributed import fleet +from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker +from paddle.distributed.fleet.utils.hybrid_parallel_util import ( + fused_allreduce_gradients, +) +from utils import ( + _rotate_checkpoints, + all_gather, + convert_example, + is_dp_group_support_in_group_sharded_parallel, + left_padding, + optimizer_name_suffix, + weight_name_suffix, + wrap_sharding_2_3, +) +from visualdl import LogWriter + +from paddlenlp.data import DataCollatorForSeq2Seq +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import get_last_checkpoint +from paddlenlp.trainer.training_args import default_logdir +from paddlenlp.transformers import ( + CosineAnnealingWithWarmupDecay, + GPTChineseTokenizer, + GPTTokenizer, + LinearAnnealingWithWarmupDecay, + PretrainedModel, +) +from paddlenlp.transformers.model_utils import _add_variant, paddlenlp_load +from paddlenlp.utils.batch_sampler import DistributedBatchSampler +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "gpt": (GPTLMHeadModel, GPTTokenizer), + "gpt-cn": (GPTLMHeadModel, GPTChineseTokenizer), +} + + +def set_hyrbid_parallel_seed(basic_seed, data_world_rank, mp_rank, pp_rank=0): + assert args.device != "cpu" + + random.seed(basic_seed + data_world_rank) + np.random.seed(basic_seed + data_world_rank) + paddle.seed(basic_seed + data_world_rank) + + # local_seed/ global_seed is used to control dropout in ModelParallel + local_seed = basic_seed + 59999 + mp_rank * 10 + pp_rank * 1000 + global_seed = basic_seed + 100003 + data_world_rank + tracker = get_rng_state_tracker() + + if "global_seed" not in tracker.states_: + tracker.add("global_seed", global_seed) + if "local_seed" not in tracker.states_: + tracker.add("local_seed", local_seed) + + +@paddle.no_grad() +def run_evaluate(args, data_loader, model, iter_steps, log_writer, global_step, task_name="valid"): + model.eval() + all_loss = [] + local_time = time.time() + iter_step = 0 + iter_steps = sys.maxsize + for eval_step, batch in enumerate(data_loader): + with paddle.amp.auto_cast( + args.use_pure_fp16, + custom_black_list=["c_softmax_with_cross_entropy", "elementwise_div"], + custom_white_list=["fused_attention", "fused_feedforward"], + level="O2", + ): + loss = model(**batch) + + all_loss.append(float(loss)) + + if (eval_step + 1) % args.accumulate_steps == 0: + iter_step += 1 + else: + continue + + if iter_step >= iter_steps: + break + + average_loss = sum(all_loss) / len(all_loss) + v = paddle.to_tensor(average_loss).detach() + average_loss = all_gather(v) + + if log_writer is not None: + logger.info("--" * 30) + logger.info( + "%s step %d, batch: %d, loss: %f, speed: %.2f step/s" + % (task_name, global_step, iter_step, average_loss, iter_step / (time.time() - local_time)) + ) + logger.info("--" * 30) + log_writer.add_scalar(task_name + "_loss", average_loss, global_step) + + model.train() + + +def do_train(args): + paddle.set_device(args.device) + nranks = paddle.distributed.get_world_size() + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": args.dp_degree, + "mp_degree": args.mp_degree, + "pp_degree": 1, + "sharding_degree": args.sharding_degree, + } + + # set control in tensor parallel + strategy.tensor_parallel_configs = {"tensor_init_seed": args.seed} + + fleet.init(is_collective=True, strategy=strategy) + + # obtain rank message of hybrid parallel + hcg = fleet.get_hybrid_communicate_group() + # global_rank = hcg.get_global_rank() + mp_rank = hcg.get_model_parallel_rank() + dp_rank = hcg.get_data_parallel_rank() + sharding_rank = hcg.get_sharding_parallel_rank() + + sharding_size = hcg.get_sharding_parallel_world_size() + data_world_rank = dp_rank * sharding_size + sharding_rank + data_world_size = args.dp_degree * args.sharding_degree + # local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0)) + + # seed control in hybrid parallel + set_hyrbid_parallel_seed(args.seed, data_world_rank, mp_rank) + + default_global_tokens_num = args.global_batch_size * args.max_seq_len + + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name_or_path) + + # Detecting last checkpoint. + last_checkpoint = None + training_args = args + training_args.overwrite_output_dir = False + training_args.resume_from_checkpoint = True + if os.path.isdir(training_args.output_dir) and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + global_step = 0 + if training_args.resume_from_checkpoint and last_checkpoint is not None: + global_step = int(str(last_checkpoint).split("-")[-1]) + + log_writer = None + if dp_rank == 0 and mp_rank == 0 and sharding_rank == 0: + log_writer_path = os.path.join(args.output_dir, default_logdir()) + log_writer = LogWriter(logdir=log_writer_path) + + WEIGHTS_NAME = "model_state.pdparams" + OPTIMIZER_NAME = "optimizer.pdopt" + + if args.mp_degree > 1 or args.sharding_degree > 1: + WEIGHTS_NAME = _add_variant(WEIGHTS_NAME, weight_name_suffix()) + OPTIMIZER_NAME = _add_variant(OPTIMIZER_NAME, optimizer_name_suffix()) + # GPTLMHeadModel using old style save_pretrained + # remove if CLASS using save_pretrained_v2 + logger.info(f"{WEIGHTS_NAME}, {OPTIMIZER_NAME}, {optimizer_name_suffix()}") + if not GPTLMHeadModel.constructed_from_pretrained_config(): + GPTLMHeadModel.resource_files_names = {"model_state": WEIGHTS_NAME} + + model_config = model_class.pretrained_init_configuration[args.model_name_or_path] + model_config["hidden_dropout_prob"] = args.hidden_dropout_prob + model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob + model_config["num_partitions"] = args.mp_degree + model_config["use_recompute"] = args.use_recompute + model_config["enable_fuse_transformer"] = False + model = GPTLMHeadModel(GPTConfig(**model_config)) + # Create the critrion for the gpt model + + # Create the learning_rate sheduler and optimizer + if args.decay_steps is None: + args.decay_steps = args.max_steps + assert args.warmup_rate <= 1.0 and args.warmup_rate >= 0.0, "warmup_rate should be in [0, 1]" + args.warmup_steps = args.warmup_rate * args.max_steps + + lr_scheduler = None + + if args.lr_decay_style == "none": + lr_scheduler = None + elif args.lr_decay_style == "cosine": + lr_scheduler = CosineAnnealingWithWarmupDecay( + max_lr=args.max_lr, + min_lr=args.min_lr, + warmup_step=args.warmup_steps, + decay_step=args.decay_steps, + last_epoch=0, + ) + elif args.lr_decay_style == "linear": + lr_scheduler = LinearAnnealingWithWarmupDecay( + max_lr=args.max_lr, + min_lr=args.min_lr, + warmup_step=args.warmup_steps, + decay_step=args.decay_steps, + last_epoch=0, + ) + + clip = None + if args.grad_clip > 0: + clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.grad_clip) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = build_optimizer.apply(model, args, lr_scheduler, clip, decay_params, strategy) + if args.use_pure_fp16: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + # level O2 means converting the network to FP16 + if args.sharding_stage not in [2, 3]: + scaler = fleet.distributed_scaler(scaler) + model = paddle.amp.decorate(models=model, level="O2") + + if training_args.resume_from_checkpoint and last_checkpoint is not None: + model.set_state_dict( + paddle.load(os.path.join(last_checkpoint, model.resource_files_names["model_state"]), return_numpy=True) + ) + # wrap sharding stage2/3 and add collective group + # TODO(Baibaifan): combine ShardingStage1/2/3 and fleet.distributed_model in feature + if args.sharding_stage in [2, 3] and args.sharding_degree > 1: + scaler = scaler if args.use_pure_fp16 else None + model, optimizer, scaler = wrap_sharding_2_3(model, optimizer, scaler, args) + + elif paddle.distributed.get_world_size() > 1: + model = fleet.distributed_model(model) + optimizer = fleet.distributed_optimizer(optimizer) + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + # decoder_start_token_id=model.config.bos_token_id, + decoder_start_token_id=tokenizer.bos_token_id, + max_source_length=args.max_source_length, + max_target_length=args.max_target_length, + ignore_pad_token_for_loss=args.ignore_pad_token_for_loss, + ) + + logger.info("Loading train and dev dataset: %s" % args.dataset_name) + train_set, dev_set = load_dataset(args.dataset_name, splits=["train_v1", "dev_v1"]) + logger.info("Loaded train and dev dataset: %s" % args.dataset_name) + train_set = train_set.map(trans_func, lazy=True) + + # print(train_set[0]) + # exit() + + train_batch_sampler = DistributedBatchSampler( + train_set, + batch_size=args.micro_batch_size, + shuffle=True, + drop_last=True, + num_replicas=data_world_size, + rank=data_world_rank, + ) + + train_data_loader = paddle.io.DataLoader( + dataset=train_set, + batch_sampler=train_batch_sampler, + num_workers=0, + collate_fn=DataCollatorForSeq2Seq( + tokenizer=tokenizer, + padding=True, + max_length=args.max_seq_length, + label_pad_token_id=tokenizer.pad_token_id, + ), + return_list=True, + ) + dev_set = dev_set.map(trans_func, lazy=True) + valid_batch_sampler = paddle.io.BatchSampler(dev_set, batch_size=args.micro_batch_size, shuffle=False) + valid_data_loader = paddle.io.DataLoader( + dataset=dev_set, + batch_sampler=valid_batch_sampler, + num_workers=0, + collate_fn=DataCollatorForSeq2Seq( + tokenizer=tokenizer, + padding=True, + max_length=args.max_seq_length, + label_pad_token_id=tokenizer.pad_token_id, + ), + return_list=True, + ) + + global_step = 0 + # time count + train_reader_cost = 0.0 + train_run_cost = 0.0 + reader_start = time.time() + + if training_args.resume_from_checkpoint and last_checkpoint is not None: + optimizer.set_state_dict( + paddlenlp_load( + os.path.join(last_checkpoint, OPTIMIZER_NAME), + map_location="cpu", + ) + ) + global_step = int(str(last_checkpoint).split("-")[-1]) + + _globalstep_last_logged = global_step + if isinstance(train_data_loader.batch_sampler, DistributedBatchSampler): + _globalstep_last_logged = 0 + + tr_loss = paddle.to_tensor(0.0) + loss_global = paddle.to_tensor(0.0) + + for epoch in range(sys.maxsize): + for step, batch in enumerate(train_data_loader()): + train_reader_cost += time.time() - reader_start + train_start = time.time() + + if global_step >= args.max_steps: + return + + if _globalstep_last_logged > 0: + _globalstep_last_logged -= 1 + continue + + # In ParallelMode of DataParallel, 'no_sync' can be used for improving + # performance of model by gradient accumulation. + + with paddle.amp.auto_cast( + args.use_pure_fp16, + custom_black_list=["c_softmax_with_cross_entropy", "elementwise_div"], + custom_white_list=["fused_attention", "fused_feedforward"], + level="O2", + ): + loss = model(**batch) + # loss = criterion(preds, labels, loss_mask) + + if args.accumulate_steps > 1: + tr_loss_step = loss / args.accumulate_steps + else: + tr_loss_step = loss + + if args.use_pure_fp16: + scaler.scale(tr_loss_step).backward() + else: + tr_loss_step.backward() + + tr_loss_step = tr_loss_step.detach() + + tr_loss += tr_loss_step + loss_global += loss.detach() + + # Skip for accumulate_steps in global step + if (step + 1) % args.accumulate_steps != 0: + continue + + if args.sharding_degree > 1 and args.sharding_stage in [2, 3]: + if args.dp_degree > 1 and not is_dp_group_support_in_group_sharded_parallel(): + fused_allreduce_gradients(model.parameters(), fleet.get_hybrid_communicate_group()) + + if args.use_pure_fp16: + # scaler.minimize(optimizer, tr_loss) + scaler.step(optimizer) + scaler.update() + else: + optimizer.step() + + optimizer.clear_grad() + tr_loss.subtract_(tr_loss) + + global_step += 1 + + # Sync for profile time, delete it may be a little faster + # paddle.device.cuda.synchronize() + train_run_cost += time.time() - train_start + + if global_step % args.logging_freq == 0: + avg_loss = all_gather(loss_global) / args.logging_freq / args.accumulate_steps + loss_global.subtract_(loss_global) + speed = args.logging_freq / (train_reader_cost + train_run_cost) + avg_reader_cost = train_reader_cost / args.logging_freq + + logger.info( + "global step %d, epoch: %d, loss: %.9f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, speed: %.2f step/s, ips_total: %.0f tokens/s, ips: %.0f tokens/s, learning rate: %.5e" + % ( + global_step, + epoch, + avg_loss, + avg_reader_cost, + 1.0 / speed, + speed, + speed * default_global_tokens_num, + speed * default_global_tokens_num / nranks, + optimizer.get_lr(), + ) + ) + if log_writer is not None: + log_writer.add_scalar("loss", float(loss), global_step) + log_writer.add_scalar("learning_rate", optimizer.get_lr(), global_step) + + # tic_train = time.time() + train_reader_cost = 0.0 + train_run_cost = 0.0 + + if lr_scheduler is not None: + lr_scheduler.step() + + if global_step % args.eval_freq == 0: + # Since the valid data broardcast to all devices, we do evaluate on all device. + run_evaluate(args, valid_data_loader, model, args.eval_iters, log_writer, global_step, "valid") + + # TODO: 1. merge paramters while saving model. 2. ensure that the model is saved and loaded correctly + # only dp_rank = 0 save model + if (global_step % args.save_steps == 0 or global_step >= args.max_steps) and dp_rank == 0: + + model_to_save = ( + model._layers + if paddle.distributed.get_world_size() > 1 and args.sharding_stage not in [2, 3] + else model + ) + + if args.sharding_stage == 3: + # If parameter need to convert to cpu, please add convert2cpu=True + model_to_save.get_all_parameters(convert2cpu=True) + + while hasattr(model_to_save, "_layers") or hasattr(model_to_save, "_layer"): + if hasattr(model_to_save, "_layers"): + model_to_save = model_to_save._layers + else: + model_to_save = model_to_save._layer + + output_dir = os.path.join(args.output_dir, "checkpoint-%d" % global_step) + os.makedirs(output_dir, exist_ok=True) + logger.info("Save model to %s" % output_dir) + + # tokenizer only need to save on one node + if mp_rank == 0 and sharding_rank == 0 and dp_rank == 0: + tokenizer.save_pretrained(output_dir) + + # paramerters is the same in sharding group + if sharding_rank == 0 and dp_rank == 0: + if isinstance(model_to_save, PretrainedModel): + model_to_save.save_pretrained(output_dir) + else: + logger.info("Trainer.model is not a `PretrainedModel`, only saving its state dict.") + state_dict = model_to_save.state_dict() + paddle.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME)) + + # ckpt optimizer weight should save on echo sharding rank + if dp_rank == 0: + paddle.save( + optimizer.state_dict(), + os.path.join( + output_dir, + OPTIMIZER_NAME, + ), + ) + + if mp_rank == 0 and sharding_rank == 0 and dp_rank == 0: + _rotate_checkpoints(args.save_total_limit, output_dir=args.output_dir) + + if global_step >= args.max_steps: + return + + reader_start = time.time() + + +def do_export(args): + + if args.do_export: + from utils import merge_model_parallel + + last_checkpoint = get_last_checkpoint(args.output_dir) + from modeling import GPTForGeneration + + from paddlenlp.transformers import GPTConfig + + _, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + config = GPTConfig.from_pretrained(last_checkpoint) + config.fuse_attention_qkv = True + # config.max_predict_len = 8 + config.max_dec_len = 20 + config.eos_token_id = tokenizer.eos_token_id + config.eol_token_id = tokenizer.eol_token_id + config.pad_token_id = tokenizer.eos_token_id + config.use_cache = True + config.top_k = 1 + + model = GPTForGeneration(config) + missing_keys, unexpected_keys = model.set_state_dict(merge_model_parallel(last_checkpoint, config)) + print("missing_keys", missing_keys) + print("unexpected_keys", unexpected_keys) + + # Switch to eval model + model.eval() + # Convert to static graph with specific input description + input_text = ["Nice to meet", "Hello "] + inputs = tokenizer(input_text) + + # input_ids = tokenizer.encode(input_text)['input_ids'] + inputs = tokenizer(input_text) + inputs = left_padding(inputs, tokenizer.bos_token_id) + input_ids = inputs["input_ids"] + + input_ids = paddle.to_tensor(input_ids, dtype="int64") + ret = model(input_ids=input_ids) + + # ret = model.generate(input_ids = data["input_ids"]) + for out_ids, in_txt in zip(ret[0].tolist(), input_text): + print("==" * 30) + print(in_txt + tokenizer.convert_ids_to_string(out_ids)) + + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + ], + ) + infer_path = os.path.join(args.output_dir, "infer", f"{args.dataset_name}") + + # Save converted static graph model + paddle.jit.save(model, infer_path) + # Also save tokenizer for inference usage + tokenizer.save_pretrained(os.path.dirname(infer_path)) + + +if __name__ == "__main__": + args = parse_args(MODEL_CLASSES) + args.do_export = True + os.environ["softmax_mask_fuse_upper_triangle"] = "False" + do_train(args) + do_export(args) diff --git a/examples/language_model/gpt-3/dygraph/run_glue_mp.py b/examples/language_model/gpt-3/dygraph/run_glue_mp.py new file mode 100644 index 0000000000000000000000000000000000000000..396896166e868c0e33ad278b1affdca00e360cfd --- /dev/null +++ b/examples/language_model/gpt-3/dygraph/run_glue_mp.py @@ -0,0 +1,585 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import random +import sys +import time +from functools import partial + +import build_optimizer +import numpy as np +import paddle +from args import parse_args +from configuration import GPTConfig +from modeling import GPTForSequenceClassification +from paddle.distributed import fleet +from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker +from paddle.distributed.fleet.utils.hybrid_parallel_util import ( + fused_allreduce_gradients, +) +from paddle.metric import Accuracy +from utils import ( + _rotate_checkpoints, + all_gather, + is_dp_group_support_in_group_sharded_parallel, + optimizer_name_suffix, + weight_name_suffix, + wrap_sharding_2_3, +) +from visualdl import LogWriter + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman +from paddlenlp.trainer import get_last_checkpoint +from paddlenlp.trainer.training_args import default_logdir +from paddlenlp.transformers import ( + CosineAnnealingWithWarmupDecay, + GPTChineseTokenizer, + GPTTokenizer, + LinearAnnealingWithWarmupDecay, + PretrainedModel, +) +from paddlenlp.transformers.model_utils import _add_variant, paddlenlp_load +from paddlenlp.utils.log import logger + +METRIC_CLASSES = { + "cola": Mcc, + "sst-2": Accuracy, + "mrpc": AccuracyAndF1, + "sts-b": PearsonAndSpearman, + "qqp": AccuracyAndF1, + "mnli": Accuracy, + "qnli": Accuracy, + "rte": Accuracy, +} + +MODEL_CLASSES = { + "gpt": (GPTForSequenceClassification, GPTTokenizer), + "gpt-cn": (GPTForSequenceClassification, GPTChineseTokenizer), +} + + +def set_hyrbid_parallel_seed(basic_seed, data_world_rank, mp_rank, pp_rank=0): + assert args.device != "cpu" + + random.seed(basic_seed + data_world_rank) + np.random.seed(basic_seed + data_world_rank) + paddle.seed(basic_seed + data_world_rank) + + # local_seed/ global_seed is used to control dropout in ModelParallel + local_seed = basic_seed + 59999 + mp_rank * 10 + pp_rank * 1000 + global_seed = basic_seed + 100003 + data_world_rank + tracker = get_rng_state_tracker() + + if "global_seed" not in tracker.states_: + tracker.add("global_seed", global_seed) + if "local_seed" not in tracker.states_: + tracker.add("local_seed", local_seed) + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if (int(is_test) + len(example)) == 2: + example = tokenizer( + example["sentence"], padding="max_length", max_length=max_seq_length, return_token_type_ids=False + ) + else: + example = tokenizer( + example["sentence1"], + text_pair=example["sentence2"], + padding=True, + max_length=max_seq_length, + return_token_type_ids=False, + ) + + if not is_test: + example["labels"] = label + + return example + + +@paddle.no_grad() +def run_evaluate(args, data_loader, model, log_writer, global_step, metric, task_name="valid"): + model.eval() + metric.reset() + local_time = time.time() + iter_steps = sys.maxsize + all_loss = [] + for eval_step, batch in enumerate(data_loader): + with paddle.amp.auto_cast( + args.use_pure_fp16, + custom_black_list=["c_softmax_with_cross_entropy", "elementwise_div"], + custom_white_list=["fused_attention", "fused_feedforward"], + level="O2", + ): + loss = model(**batch, return_dict=True) + if isinstance(loss, dict): + logits = loss["logits"] + loss = loss["loss"] + correct = metric.compute(logits.detach(), batch["labels"].detach()) + metric.update(correct) + + all_loss.append(float(loss)) + + if eval_step >= iter_steps - 1: + break + + res = metric.accumulate() + + average_loss = sum(all_loss) / len(all_loss) + + logger.info("--" * 30) + if isinstance(metric, AccuracyAndF1): + logger.info( + "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, " + % (average_loss, res[0], res[1], res[2], res[3], res[4]), + ) + elif isinstance(metric, Mcc): + logger.info( + "eval loss: %f, mcc: %s, " % (average_loss, res[0]), + ) + elif isinstance(metric, PearsonAndSpearman): + logger.info( + "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, " + % (average_loss, res[0], res[1], res[2]), + ) + else: + logger.info("eval loss: %f, acc: %s, " % (average_loss, res)) + + logger.info("--" * 30) + logger.info( + "%s step %d, batch: %d, loss: %f, speed: %.2f step/s" + % (task_name, global_step, eval_step + 1, average_loss, (eval_step + 1) / (time.time() - local_time)) + ) + logger.info("--" * 30) + if log_writer is not None: + log_writer.add_scalar(task_name + "_loss", average_loss, global_step) + + model.train() + + +def do_train(args): + paddle.set_device(args.device) + nranks = paddle.distributed.get_world_size() + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": args.dp_degree, + "mp_degree": args.mp_degree, + "pp_degree": 1, + "sharding_degree": args.sharding_degree, + } + + # set control in tensor parallel + strategy.tensor_parallel_configs = {"tensor_init_seed": args.seed} + + fleet.init(is_collective=True, strategy=strategy) + + # obtain rank message of hybrid parallel + hcg = fleet.get_hybrid_communicate_group() + # global_rank = hcg.get_global_rank() + mp_rank = hcg.get_model_parallel_rank() + dp_rank = hcg.get_data_parallel_rank() + sharding_rank = hcg.get_sharding_parallel_rank() + + sharding_size = hcg.get_sharding_parallel_world_size() + data_world_rank = dp_rank * sharding_size + sharding_rank + data_world_size = args.dp_degree * args.sharding_degree + # local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0)) + + # seed control in hybrid parallel + set_hyrbid_parallel_seed(args.seed, data_world_rank, mp_rank) + default_global_tokens_num = args.global_batch_size * args.max_seq_len + + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + + train_ds = load_dataset("glue", args.task_name, splits="train") + tokenizer = GPTTokenizer.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length + ) + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler( + train_ds, + batch_size=args.micro_batch_size, + shuffle=True, + num_replicas=data_world_size, + rank=data_world_rank, + ) + + if args.task_name == "mnli": + dev_ds = load_dataset("glue", args.task_name, splits=["dev_matched"]) + else: + dev_ds = load_dataset("glue", args.task_name, splits="dev") + + dev_ds = dev_ds.map(trans_func, lazy=True) + valid_batch_sampler = paddle.io.BatchSampler( + dev_ds, + batch_size=args.micro_batch_size, + shuffle=False, + ) + + train_data_loader = paddle.io.DataLoader( + dataset=train_ds, + batch_sampler=train_batch_sampler, + num_workers=0, + return_list=True, + collate_fn=DataCollatorWithPadding(tokenizer=tokenizer, padding=True, max_length=args.max_seq_length), + ) + + valid_data_loader = paddle.io.DataLoader( + dataset=dev_ds, + batch_sampler=valid_batch_sampler, + num_workers=0, + return_list=True, + collate_fn=DataCollatorWithPadding(tokenizer=tokenizer, padding=True, max_length=args.max_seq_length), + ) + + num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list) + + # Detecting last checkpoint. + last_checkpoint = None + training_args = args + training_args.overwrite_output_dir = False + training_args.resume_from_checkpoint = True + if os.path.isdir(training_args.output_dir) and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + global_step = 0 + if training_args.resume_from_checkpoint and last_checkpoint is not None: + global_step = int(str(last_checkpoint).split("-")[-1]) + # Define log writer + log_writer = None + if dp_rank == 0 and mp_rank == 0 and sharding_rank == 0: + log_writer_path = os.path.join(args.output_dir, default_logdir()) + log_writer = LogWriter(log_writer_path) + + WEIGHTS_NAME = "model_state.pdparams" + OPTIMIZER_NAME = "optimizer.pdopt" + + if args.mp_degree > 1 or args.sharding_degree > 1: + WEIGHTS_NAME = _add_variant(WEIGHTS_NAME, weight_name_suffix()) + OPTIMIZER_NAME = _add_variant(OPTIMIZER_NAME, optimizer_name_suffix()) + # GPTForSequenceClassification using old style save_pretrained + # remove if CLASS using save_pretrained_v2 + logger.info(f"{WEIGHTS_NAME}, {OPTIMIZER_NAME}, {optimizer_name_suffix()}") + if not GPTForSequenceClassification.constructed_from_pretrained_config(): + GPTForSequenceClassification.resource_files_names = {"model_state": WEIGHTS_NAME} + pretrained_models_list = list(model_class.pretrained_init_configuration.keys()) + if args.model_name_or_path in pretrained_models_list: + model_config = model_class.pretrained_init_configuration[args.model_name_or_path] + model_config["hidden_dropout_prob"] = args.hidden_dropout_prob + model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob + + model_config["num_partitions"] = args.mp_degree + model_config["use_recompute"] = args.use_recompute + model_config["enable_fuse_transformer"] = args.fuse_transformer + model = GPTForSequenceClassification(GPTConfig(**model_config)) + + else: + model = GPTForSequenceClassification.from_pretrained( + args.model_name_or_path, + hidden_dropout_prob=args.hidden_dropout_prob, + attention_probs_dropout_prob=args.attention_probs_dropout_prob, + num_partitions=args.mp_degree, + use_recompute=args.use_recompute, + enable_fuse_transformer=False, + num_labels=num_classes, + ) + + metric = metric_class() + + # Create the learning_rate sheduler and optimizer + if args.decay_steps is None: + args.decay_steps = args.max_steps + assert args.warmup_rate <= 1.0 and args.warmup_rate >= 0.0, "warmup_rate should be in [0, 1]" + args.warmup_steps = args.warmup_rate * args.max_steps + + lr_scheduler = None + + if args.lr_decay_style == "none": + lr_scheduler = None + elif args.lr_decay_style == "cosine": + lr_scheduler = CosineAnnealingWithWarmupDecay( + max_lr=args.max_lr, + min_lr=args.min_lr, + warmup_step=args.warmup_steps, + decay_step=args.decay_steps, + ) + elif args.lr_decay_style == "linear": + lr_scheduler = LinearAnnealingWithWarmupDecay( + max_lr=args.max_lr, + min_lr=args.min_lr, + warmup_step=args.warmup_steps, + decay_step=args.decay_steps, + ) + + clip = None + if args.grad_clip > 0: + clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.grad_clip) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + optimizer = build_optimizer.apply(model, args, lr_scheduler, clip, decay_params, strategy) + if args.use_pure_fp16: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + # level O2 means converting the network to FP16 + if args.sharding_stage not in [2, 3]: + scaler = fleet.distributed_scaler(scaler) + model = paddle.amp.decorate(models=model, level="O2") + + if training_args.resume_from_checkpoint and last_checkpoint is not None: + model.set_state_dict( + paddle.load(os.path.join(last_checkpoint, model.resource_files_names["model_state"]), return_numpy=True) + ) + + # wrap sharding stage2/3 and add collective group + # TODO(Baibaifan): combine ShardingStage1/2/3 and fleet.distributed_model in feature + if args.sharding_stage in [2, 3] and args.sharding_degree > 1: + scaler = scaler if args.use_pure_fp16 else None + model, optimizer, scaler = wrap_sharding_2_3(model, optimizer, scaler, args) + + elif paddle.distributed.get_world_size() > 1: + model = fleet.distributed_model(model) + optimizer = fleet.distributed_optimizer(optimizer) + + # time count + train_reader_cost = 0.0 + train_run_cost = 0.0 + reader_start = time.time() + + if training_args.resume_from_checkpoint and last_checkpoint is not None: + optimizer.set_state_dict( + paddlenlp_load( + os.path.join(last_checkpoint, OPTIMIZER_NAME), + map_location="cpu", + ) + ) + + _globalstep_last_logged = global_step + tr_loss = paddle.to_tensor(0.0) + loss_global = paddle.to_tensor(0.0) + + if _globalstep_last_logged > args.max_steps: + return + for epoch in range(sys.maxsize): + train_data_loader.batch_sampler.set_epoch(epoch) + for step, batch in enumerate(train_data_loader): + train_reader_cost += time.time() - reader_start + train_start = time.time() + if _globalstep_last_logged > 0: + _globalstep_last_logged -= 1 + continue + + # In ParallelMode of DataParallel, 'no_sync' can be used for improving + # performance of model by gradient accumulation. + with paddle.amp.auto_cast( + args.use_pure_fp16, + custom_black_list=["c_softmax_with_cross_entropy", "elementwise_div"], + custom_white_list=["fused_attention", "fused_feedforward"], + level="O2", + ): + loss = model(**batch) + if isinstance(loss, tuple): + loss = loss[0] + + if args.accumulate_steps > 1: + tr_loss_step = loss / args.accumulate_steps + else: + tr_loss_step = loss + + if args.use_pure_fp16: + scaler.scale(tr_loss_step).backward() + else: + tr_loss_step.backward() + + tr_loss_step = tr_loss_step.detach() + + tr_loss += tr_loss_step + loss_global += loss.detach() + + # Skip for accumulate_steps in global step + if (step + 1) % args.accumulate_steps != 0: + continue + + if args.sharding_degree > 1 and args.sharding_stage in [2, 3]: + if args.dp_degree > 1 and not is_dp_group_support_in_group_sharded_parallel(): + fused_allreduce_gradients(model.parameters(), fleet.get_hybrid_communicate_group()) + + if args.use_pure_fp16: + scaler.step(optimizer) + scaler.update() + else: + optimizer.step() + + optimizer.clear_grad() + tr_loss.subtract_(tr_loss) + global_step += 1 + + # Sync for profile time, delete it may be a little faster + paddle.device.cuda.synchronize() + train_run_cost += time.time() - train_start + + if global_step % args.logging_freq == 0: + avg_loss = all_gather(loss_global) / args.logging_freq / args.accumulate_steps + loss_global.subtract_(loss_global) + speed = args.logging_freq / (train_reader_cost + train_run_cost) + avg_reader_cost = train_reader_cost / args.logging_freq + + logger.info( + "global step: %d, epoch: %d, loss: %.9f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, speed: %.2f step/s, ips_total: %.0f tokens/s, ips: %.0f tokens/s, learning rate: %.5e" + % ( + global_step, + epoch, + avg_loss, + avg_reader_cost, + 1.0 / speed, + speed, + speed * default_global_tokens_num, + speed * default_global_tokens_num / nranks, + optimizer.get_lr(), + ) + ) + if log_writer is not None: + log_writer.add_scalar("loss", float(loss), global_step) + log_writer.add_scalar("learning_rate", optimizer.get_lr(), global_step) + + # tic_train = time.time() + train_reader_cost = 0.0 + train_run_cost = 0.0 + + if global_step % args.eval_freq == 0: + # Since the valid data broardcast to all devices, we do evaluate on all device. + run_evaluate(args, valid_data_loader, model, log_writer, global_step, metric, "valid") + + # TODO: 1. merge paramters while saving model. 2. ensure that the model is saved and loaded correctly + # only dp_rank = 0 save model + if (global_step % args.save_steps == 0 or global_step >= args.max_steps) and dp_rank == 0: + + model_to_save = ( + model._layers + if paddle.distributed.get_world_size() > 1 and args.sharding_stage not in [2, 3] + else model + ) + if args.sharding_stage == 3: + # If parameter need to convert to cpu, please add convert2cpu=True + model_to_save.get_all_parameters(convert2cpu=True) + + while hasattr(model_to_save, "_layers") or hasattr(model_to_save, "_layer"): + if hasattr(model_to_save, "_layers"): + model_to_save = model_to_save._layers + else: + model_to_save = model_to_save._layer + + output_dir = os.path.join(args.output_dir, "checkpoint-%d" % global_step) + os.makedirs(output_dir, exist_ok=True) + + logger.info("Save model to %s" % output_dir) + + # tokenizer only need to save on one node + if mp_rank == 0 and sharding_rank == 0 and dp_rank == 0: + tokenizer.save_pretrained(output_dir) + + # paramerters is the same in sharding group + if sharding_rank == 0 and dp_rank == 0: + if isinstance(model_to_save, PretrainedModel): + model_to_save.save_pretrained(output_dir) + else: + logger.info("Trainer.model is not a `PretrainedModel`, only saving its state dict.") + state_dict = model_to_save.state_dict() + paddle.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME)) + + # ckpt optimizer weight should save on echo sharding rank + if dp_rank == 0: + paddle.save( + optimizer.state_dict(), + os.path.join( + output_dir, + OPTIMIZER_NAME, + ), + ) + + if mp_rank == 0 and sharding_rank == 0 and dp_rank == 0: + _rotate_checkpoints(args.save_total_limit, output_dir=args.output_dir) + + if lr_scheduler is not None: + lr_scheduler.step() + + if global_step >= args.max_steps: + return + + reader_start = time.time() + + +def do_export(args): + if args.do_export: + from utils import merge_model_parallel + + last_checkpoint = get_last_checkpoint(args.output_dir) + from paddlenlp.transformers import GPTConfig, GPTForSequenceClassification + + _, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + config = GPTConfig.from_pretrained(last_checkpoint) + config.fuse_attention_qkv = True + model = GPTForSequenceClassification(config) + missing_keys, unexpected_keys = model.set_state_dict(merge_model_parallel(last_checkpoint, config)) + print("missing_keys", missing_keys) + print("unexpected_keys", unexpected_keys) + # print(train_ds[0]) + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + ], + ) + infer_path = os.path.join(args.output_dir, "infer", f"{args.task_name}") + + # Save converted static graph model + paddle.jit.save(model, infer_path) + # # Also save tokenizer for inference usage + tokenizer.save_pretrained(os.path.dirname(infer_path)) + + +if __name__ == "__main__": + args = parse_args(MODEL_CLASSES) + args.do_export = True + do_train(args) + do_export(args) diff --git a/examples/language_model/gpt-3/dygraph/run_pretrain.py b/examples/language_model/gpt-3/dygraph/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..4562f415018f61a88ac383b399e00c6a8c50a8d0 --- /dev/null +++ b/examples/language_model/gpt-3/dygraph/run_pretrain.py @@ -0,0 +1,515 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import random +import sys +import time + +import build_optimizer +import numpy as np +import paddle +from paddle.distributed import fleet +from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker +from paddle.distributed.fleet.utils.hybrid_parallel_util import ( + fused_allreduce_gradients, +) +from paddle.distributed.sharding import group_sharded_parallel +from visualdl import LogWriter + +from paddlenlp.transformers import ( + CosineAnnealingWithWarmupDecay, + GPTChineseTokenizer, + GPTTokenizer, + LinearAnnealingWithWarmupDecay, +) +from paddlenlp.utils import profiler +from paddlenlp.utils.log import logger + +# to import data_tools +filepath = os.path.abspath(os.path.dirname(__file__)) +sys.path.insert(0, os.path.join(filepath, "../")) +# import lr # noqa e402 +from args import parse_args # noqa e402 +from configuration import GPTConfig +from dataset import create_pretrained_dataset # noqa e402 +from modeling import ( # noqa e402 + GPTForPretraining, + GPTForPretrainingPipe, + GPTPretrainingCriterion, +) + +MODEL_CLASSES = { + "gpt": (GPTForPretraining, GPTTokenizer), + "gpt-cn": (GPTForPretraining, GPTChineseTokenizer), +} + + +def set_hyrbid_parallel_seed(basic_seed, data_world_rank, mp_rank, pp_rank): + assert args.device != "cpu" + + basic_seed = basic_seed * 1000 + random.seed(basic_seed + data_world_rank) + np.random.seed(basic_seed + data_world_rank) + paddle.seed(basic_seed + data_world_rank) + + # local_seed/ global_seed is used to control dropout in ModelParallel + local_seed = basic_seed + 123 + mp_rank * 10 + pp_rank * 1000 + global_seed = basic_seed + data_world_rank + tracker = get_rng_state_tracker() + tracker.add("global_seed", global_seed) + tracker.add("local_seed", local_seed) + + +@paddle.no_grad() +def run_evaluate(args, data_loader, model, criterion, iter_steps, log_writer, global_step, epoch, task_name="valid"): + model.eval() + all_loss = [] + local_time = time.time() + for eval_step, batch in enumerate(data_loader): + tokens, loss_mask, position_ids, labels = batch + if args.pp_degree < 2: + preds = model(tokens, position_ids) + loss = criterion(preds, labels, loss_mask) + else: + data = [(tokens, position_ids), (labels, loss_mask)] + loss = model.eval_batch(data, compute_loss=True) + + all_loss.append(float(loss)) + if eval_step >= iter_steps - 1: + break + + average_loss = sum(all_loss) / len(all_loss) + logger.info( + "%s step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (task_name, global_step, epoch, eval_step, average_loss, iter_steps / (time.time() - local_time)) + ) + log_writer.add_scalar(task_name + "_loss", average_loss, global_step) + model.train() + + +def get_train_data_file(args): + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if (os.path.isfile(os.path.join(args.input_dir, f)) and str(f).endswith("_idx.npz")) + ] + files = [x.replace("_idx.npz", "") for x in files] + if len(files) == 0: + logger.warning( + "Not found dataset with name of xxx_ids.npy and xxx_idx.npz! Try to found old compatible xxx_ids.npz file." + ) + else: + return files + + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if (os.path.isfile(os.path.join(args.input_dir, f)) and str(f).endswith("_ids.npz")) + ] + + files = [x.replace("_ids.npz", "") for x in files] + return files + + +def do_train(args): + paddle.set_device(args.device) + nranks = paddle.distributed.get_world_size() + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": args.dp_degree, + "mp_degree": args.mp_degree, + "pp_degree": args.pp_degree, + "sharding_degree": args.sharding_degree, + } + + accumulate_steps = args.local_batch_size // args.micro_batch_size + strategy.pipeline_configs = {"accumulate_steps": accumulate_steps, "micro_batch_size": args.micro_batch_size} + + # set control in tensor parallel + strategy.tensor_parallel_configs = {"tensor_init_seed": args.seed} + + fleet.init(is_collective=True, strategy=strategy) + + # obtain rank message of hybrid parallel + hcg = fleet.get_hybrid_communicate_group() + global_rank = hcg.get_global_rank() + mp_rank = hcg.get_model_parallel_rank() + pp_rank = hcg.get_stage_id() + dp_rank = hcg.get_data_parallel_rank() + sharding_rank = hcg.get_sharding_parallel_rank() + + # sharding stage2/3 not support hybrid parallel now + if args.sharding_stage in [2, 3]: + assert args.mp_degree == args.pp_degree == 1, "sharding stage2/3 will support tensor/pipeline parallel later" + dp_group = hcg.get_data_parallel_group() + + sharding_size = hcg.get_sharding_parallel_world_size() + data_world_rank = dp_rank * sharding_size + sharding_rank + data_world_size = args.dp_degree * args.sharding_degree + local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0)) + + # seed control in hybrid parallel + set_hyrbid_parallel_seed(args.seed, data_world_rank, mp_rank, pp_rank) + + default_global_tokens_num = args.global_batch_size * args.max_seq_len + + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + # Define log writer + log_writer_path = os.path.join( + args.output_dir, + "train_log", + "{}_globalbsz_{}_pure_fp16_{}_recompute_{}_card_{}".format( + args.model_name_or_path, args.global_batch_size, args.use_pure_fp16, False, global_rank + ).lower(), + ) + + if os.path.exists(log_writer_path): + import shutil + + shutil.rmtree(log_writer_path) + + log_writer = LogWriter(log_writer_path) + + pretrained_models_list = list(model_class.pretrained_init_configuration.keys()) + + if args.mp_degree > 1: + GPTForPretraining.resource_files_names = {"model_state": "model_state_mp_{:0>2d}.pdparams".format(mp_rank)} + + if args.model_name_or_path in pretrained_models_list: + model_config = model_class.pretrained_init_configuration[args.model_name_or_path] + model_config["hidden_dropout_prob"] = args.hidden_dropout_prob + model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob + + model_config["num_partitions"] = args.mp_degree + model_config["use_recompute"] = args.use_recompute + model_config["enable_fuse_transformer"] = args.fuse_transformer + if args.pp_degree == 1: + model = GPTForPretraining(GPTConfig(**model_config)) + else: + topology = hcg.topology() + model = GPTForPretrainingPipe(GPTConfig(**model_config), topology) + else: + model = GPTForPretraining.from_pretrained( + args.model_name_or_path, + hidden_dropout_prob=args.hidden_dropout_prob, + attention_probs_dropout_prob=args.attention_probs_dropout_prob, + ) + + # Create the critrion for the gpt model + criterion = GPTPretrainingCriterion() + + # Create the learning_rate sheduler and optimizer + if args.decay_steps is None: + args.decay_steps = args.max_steps + assert args.warmup_rate <= 1.0 and args.warmup_rate >= 0.0, "warmup_rate should be in [0, 1]" + args.warmup_steps = args.warmup_rate * args.max_steps + + lr_scheduler = None + + if args.lr_decay_style == "none": + lr_scheduler = None + elif args.lr_decay_style == "cosine": + lr_scheduler = CosineAnnealingWithWarmupDecay( + max_lr=args.max_lr, + min_lr=args.min_lr, + warmup_step=args.warmup_steps, + decay_step=args.decay_steps, + last_epoch=0, + ) + elif args.lr_decay_style == "linear": + lr_scheduler = LinearAnnealingWithWarmupDecay( + max_lr=args.max_lr, + min_lr=args.min_lr, + warmup_step=args.warmup_steps, + decay_step=args.decay_steps, + last_epoch=0, + ) + + clip = None + if args.grad_clip > 0: + clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.grad_clip) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = build_optimizer.apply(model, args, lr_scheduler, clip, decay_params, strategy) + + # decorate @to_static for benchmark, skip it by default. + if args.to_static: + specs = None + paddle.jit.ignore_module([os]) + model = paddle.jit.to_static(model, input_spec=specs) + logger.info("Successfully to apply @to_static with specs: {}".format(specs)) + + if args.use_pure_fp16: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + # level O2 means converting the network to FP16 + if args.sharding_stage not in [2, 3]: + scaler = fleet.distributed_scaler(scaler) + model = paddle.amp.decorate(models=model, level="O2") + + # wrap sharding stage2/3 and add collective group + # TODO(Baibaifan): combine ShardingStage1/2/3 and fleet.distributed_model in feature + if args.sharding_stage in [2, 3]: + if args.dp_degree > 1: + from paddle.distributed.parallel import sync_params_buffers + + sync_params_buffers(model, comm_group=dp_group, src_rank=dp_group.ranks[0]) + + scaler = scaler if args.use_pure_fp16 else None + model, optimizer, scaler = wrap_sharding_2_3(model, optimizer, scaler, args.sharding_offload) + + elif paddle.distributed.get_world_size() > 1: + model = fleet.distributed_model(model) + optimizer = fleet.distributed_optimizer(optimizer) + + if args.model_name_or_path not in pretrained_models_list: + logger.info("Try to load checkpoint from %s " % args.model_name_or_path) + opt_path = os.path.join(args.model_name_or_path, "model_state.pdopt") + if os.path.exists(opt_path): + opt_dict = paddle.load(opt_path) + optimizer.set_state_dict(opt_dict) + else: + logger.warning("No optimizer checkpoint file found in %s." % opt_path) + + global_step = 0 + # tic_train = time.time() + for epoch in range(args.num_train_epochs): + files = get_train_data_file(args) + files.sort() + num_files = len(files) + for f_id in range(num_files): + data_file = files[f_id] + train_data_loader, valid_data_loader, test_data_loader = create_pretrained_dataset( + args, + [data_file], + local_rank=local_rank, + data_world_size=data_world_size, + data_world_rank=data_world_rank, + max_seq_len=args.max_seq_len, + eos_id=tokenizer.eos_token_id, + old_version_accumulate_compatible=True, + ) + # Bug fix, if not call valid_data_loader, the enumerate will call valid_data_loader + # many times. and start a new random dataloader. + valid_data_loader = valid_data_loader() + test_data_loader = test_data_loader() + + # time count + train_reader_cost = 0.0 + train_run_cost = 0.0 + reader_start = time.time() + for step, batch in enumerate(train_data_loader()): + train_reader_cost += time.time() - reader_start + train_start = time.time() + + global_step += 1 + tokens, loss_mask, position_ids, labels = batch + + loss_mask.stop_gradient = True + labels.stop_gradient = True + position_ids.stop_gradient = True + + if args.pp_degree == 1: + # In ParallelMode of DataParallel, 'no_sync' can be used for improving + # performance of model by gradient accumulation. + loss = 0.0 + for i in range(accumulate_steps): + start_index = i * args.micro_batch_size + end_index = start_index + args.micro_batch_size + with paddle.amp.auto_cast( + args.use_pure_fp16, + custom_black_list=["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div"], + custom_white_list=["fused_attention", "fused_feedforward"], + level="O2", + ): + preds = model(tokens[start_index:end_index, :], position_ids[start_index:end_index, :]) + loss_mbs = criterion( + preds, labels[start_index:end_index, :], loss_mask[start_index:end_index, :] + ) + loss_mbs = loss_mbs / accumulate_steps + if args.use_pure_fp16: + scaler.scale(loss_mbs).backward() + else: + loss_mbs.backward() + loss = loss + loss_mbs + + if args.sharding_stage in [2, 3] and args.dp_degree > 1: + fused_allreduce_gradients(model.parameters(), hcg) + if args.sharding_stage == 3: + for p in model.parameters(): + if hasattr(p, "bw_storage"): + assert p.grad is None, "This case shouldn't happen." + p.bw_storage.scale_(1.0 / dp_group.nranks) + paddle.distributed.all_reduce(p.bw_storage, group=dp_group) + + if args.use_pure_fp16: + if args.sharding_stage in [2, 3]: + scaler.step(optimizer) + scaler.update() + else: + scaler.minimize(optimizer, loss) + else: + optimizer.step() + + if lr_scheduler is not None: + lr_scheduler.step() + + optimizer.clear_grad() + + else: + data = [(tokens, position_ids), (labels, loss_mask)] + with paddle.amp.auto_cast( + args.use_pure_fp16, + custom_black_list=["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div"], + custom_white_list=["fused_attention", "fused_feedforward"], + level="O2", + ): + loss = model.train_batch( + data, + optimizer=optimizer, + lr_scheduler=lr_scheduler, + scaler=scaler if args.use_pure_fp16 else None, + ) + + # Sync for profile time, delete it may be a little faster + paddle.device.cuda.synchronize() + train_run_cost += time.time() - train_start + # Profile for model benchmark + profiler.add_profiler_step(args.profiler_options) + + if global_step % args.logging_freq == 0: + avg_loss = loss.numpy() + speed = args.logging_freq / (train_reader_cost + train_run_cost) + avg_reader_cost = train_reader_cost / args.logging_freq + + logger.info( + "global step %d, epoch: %d, batch: %d, loss: %.9f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, speed: %.2f step/s, ips_total: %.0f tokens/s, ips: %.0f tokens/s, learning rate: %.5e" + % ( + global_step, + epoch, + step, + avg_loss, + avg_reader_cost, + 1.0 / speed, + speed, + speed * default_global_tokens_num, + speed * default_global_tokens_num / nranks, + optimizer.get_lr(), + ) + ) + log_writer.add_scalar("loss", float(loss), global_step) + log_writer.add_scalar("learning_rate", optimizer.get_lr(), global_step) + + # tic_train = time.time() + train_reader_cost = 0.0 + train_run_cost = 0.0 + + if args.check_accuracy: + if global_step >= args.max_steps: + return + else: + continue + + if global_step % args.eval_freq == 0: + # Since the valid data broardcast to all devices, we do evaluate on all device. + run_evaluate( + args, + valid_data_loader, + model, + criterion, + args.eval_iters, + log_writer, + global_step, + epoch, + "valid", + ) + + # TODO: 1. merge paramters while saving model. 2. ensure that the model is saved and loaded correctly + # only dp_rank = 0 save model + if (global_step % args.save_steps == 0 or global_step >= args.max_steps) and dp_rank == 0: + + model_to_save = ( + model._layers + if paddle.distributed.get_world_size() > 1 and args.sharding_stage not in [2, 3] + else model + ) + output_dir = os.path.join(args.output_dir, "step_%d" % global_step) + os.makedirs(output_dir, exist_ok=True) + + logger.info("Save model to %s" % output_dir) + + if args.pp_degree > 1: + if mp_rank == 0 and sharding_rank == 0 and pp_rank == 0: + tokenizer.save_pretrained(output_dir) + model_to_save.save_state_dict(output_dir) + paddle.save( + optimizer.state_dict(), + os.path.join( + output_dir, + "model_state_mp_{:0>2d}_sharding_{:0>2d}_pp_{:0>2d}.pdopt".format( + mp_rank, sharding_rank, pp_rank + ), + ), + ) + else: + if args.sharding_stage == 3: + # If parameter need to convert to cpu, please add convert2cpu=True + model_to_save.get_all_parameters(convert2cpu=False) + if mp_rank == 0 and sharding_rank == 0: + tokenizer.save_pretrained(output_dir) + model_to_save.save_pretrained(output_dir) + paddle.save( + optimizer.state_dict(), + os.path.join( + output_dir, + "model_state_mp_{:0>2d}_sharding_{:0>2d}.pdopt".format(mp_rank, sharding_rank), + ), + ) + + if global_step >= args.max_steps: + run_evaluate( + args, + test_data_loader, + model, + criterion, + args.test_iters, + log_writer, + global_step, + epoch, + "test", + ) + logger.info("The training process is complete.") + del train_data_loader + return + + reader_start = time.time() + + del train_data_loader + + +def wrap_sharding_2_3(model, optimizer, scaler, sharding_offload): + group = fleet.get_hybrid_communicate_group().get_sharding_parallel_group() + level = "p_g_os" if args.sharding_stage == 3 else "os_g" + return group_sharded_parallel( + model=model, optimizer=optimizer, level=level, scaler=scaler, group=group, offload=sharding_offload + ) + + +if __name__ == "__main__": + args = parse_args(MODEL_CLASSES) + do_train(args) diff --git a/examples/language_model/gpt-3/dygraph/run_pretrain_mp.py b/examples/language_model/gpt-3/dygraph/run_pretrain_mp.py new file mode 100644 index 0000000000000000000000000000000000000000..5124d5a6e236a477cab112c6ef4e210c4e41612a --- /dev/null +++ b/examples/language_model/gpt-3/dygraph/run_pretrain_mp.py @@ -0,0 +1,456 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import random +import time + +import build_optimizer +import numpy as np +import paddle +from args import parse_args +from configuration import GPTConfig +from dataset import create_pretrained_dataset +from modeling import GPTForPretraining, GPTPretrainingCriterion +from paddle.distributed import fleet +from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker +from paddle.distributed.fleet.utils.hybrid_parallel_util import ( + fused_allreduce_gradients, +) +from run_pretrain import get_train_data_file +from utils import ( + _rotate_checkpoints, + all_gather, + is_dp_group_support_in_group_sharded_parallel, + optimizer_name_suffix, + weight_name_suffix, + wrap_sharding_2_3, +) +from visualdl import LogWriter + +from paddlenlp.trainer import get_last_checkpoint +from paddlenlp.trainer.training_args import default_logdir +from paddlenlp.transformers import ( + CosineAnnealingWithWarmupDecay, + GPTChineseTokenizer, + GPTTokenizer, + LinearAnnealingWithWarmupDecay, + PretrainedModel, +) +from paddlenlp.transformers.model_utils import _add_variant, paddlenlp_load +from paddlenlp.utils.batch_sampler import DistributedBatchSampler +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "gpt": (GPTForPretraining, GPTTokenizer), + "gpt-cn": (GPTForPretraining, GPTChineseTokenizer), +} + + +def set_hyrbid_parallel_seed(basic_seed, data_world_rank, mp_rank, pp_rank=0): + assert args.device != "cpu" + + random.seed(basic_seed + data_world_rank) + np.random.seed(basic_seed + data_world_rank) + paddle.seed(basic_seed + data_world_rank) + + # local_seed/ global_seed is used to control dropout in ModelParallel + local_seed = basic_seed + 59999 + mp_rank * 10 + pp_rank * 1000 + global_seed = basic_seed + 100003 + data_world_rank + tracker = get_rng_state_tracker() + + if "global_seed" not in tracker.states_: + tracker.add("global_seed", global_seed) + if "local_seed" not in tracker.states_: + tracker.add("local_seed", local_seed) + + +@paddle.no_grad() +def run_evaluate(args, data_loader, model, criterion, iter_steps, log_writer, global_step, task_name="valid"): + model.eval() + all_loss = [] + local_time = time.time() + iter_step = 0 + for eval_step, batch in enumerate(data_loader): + with paddle.amp.auto_cast( + args.use_pure_fp16, + custom_black_list=["c_softmax_with_cross_entropy", "elementwise_div"], + custom_white_list=["fused_attention", "fused_feedforward"], + level="O2", + ): + tokens, loss_mask, position_ids, labels = batch + preds = model(tokens, position_ids) + loss = criterion(preds, labels, loss_mask) + + all_loss.append(float(loss)) + + if (eval_step + 1) % args.accumulate_steps == 0: + iter_step += 1 + else: + continue + + if iter_step >= iter_steps: + break + + average_loss = sum(all_loss) / len(all_loss) + v = paddle.to_tensor(average_loss).detach() + average_loss = all_gather(v) + + if log_writer is not None: + logger.info("--" * 30) + logger.info( + "%s step %d, batch: %d, loss: %f, speed: %.2f step/s" + % (task_name, global_step, iter_steps, average_loss, iter_steps / (time.time() - local_time)) + ) + logger.info("--" * 30) + log_writer.add_scalar(task_name + "_loss", average_loss, global_step) + + model.train() + + +def do_train(args): + paddle.set_device(args.device) + nranks = paddle.distributed.get_world_size() + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": args.dp_degree, + "mp_degree": args.mp_degree, + "pp_degree": 1, + "sharding_degree": args.sharding_degree, + } + + # set control in tensor parallel + strategy.tensor_parallel_configs = {"tensor_init_seed": args.seed} + + fleet.init(is_collective=True, strategy=strategy) + + # obtain rank message of hybrid parallel + hcg = fleet.get_hybrid_communicate_group() + # global_rank = hcg.get_global_rank() + mp_rank = hcg.get_model_parallel_rank() + dp_rank = hcg.get_data_parallel_rank() + sharding_rank = hcg.get_sharding_parallel_rank() + + sharding_size = hcg.get_sharding_parallel_world_size() + data_world_rank = dp_rank * sharding_size + sharding_rank + data_world_size = args.dp_degree * args.sharding_degree + local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0)) + + # seed control in hybrid parallel + set_hyrbid_parallel_seed(args.seed, data_world_rank, mp_rank) + + default_global_tokens_num = args.global_batch_size * args.max_seq_len + + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name_or_path) + + # Detecting last checkpoint. + last_checkpoint = None + training_args = args + training_args.overwrite_output_dir = False + training_args.resume_from_checkpoint = True + if os.path.isdir(training_args.output_dir) and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + global_step = 0 + if training_args.resume_from_checkpoint and last_checkpoint is not None: + global_step = int(str(last_checkpoint).split("-")[-1]) + + log_writer = None + if dp_rank == 0 and mp_rank == 0 and sharding_rank == 0: + log_writer_path = os.path.join(args.output_dir, default_logdir()) + log_writer = LogWriter(logdir=log_writer_path) + + WEIGHTS_NAME = "model_state.pdparams" + OPTIMIZER_NAME = "optimizer.pdopt" + + if args.mp_degree > 1 or args.sharding_degree > 1: + WEIGHTS_NAME = _add_variant(WEIGHTS_NAME, weight_name_suffix()) + OPTIMIZER_NAME = _add_variant(OPTIMIZER_NAME, optimizer_name_suffix()) + # GPTForPretraining using old style save_pretrained + # remove if CLASS using save_pretrained_v2 + logger.info(f"{WEIGHTS_NAME}, {OPTIMIZER_NAME}, {optimizer_name_suffix()}") + if not GPTForPretraining.constructed_from_pretrained_config(): + GPTForPretraining.resource_files_names = {"model_state": WEIGHTS_NAME} + + model_config = model_class.pretrained_init_configuration[args.model_name_or_path] + model_config["hidden_dropout_prob"] = args.hidden_dropout_prob + model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob + model_config["num_partitions"] = args.mp_degree + model_config["use_recompute"] = args.use_recompute + model_config["enable_fuse_transformer"] = False + model = GPTForPretraining(GPTConfig(**model_config)) + # Create the critrion for the gpt model + criterion = GPTPretrainingCriterion() + + # Create the learning_rate sheduler and optimizer + if args.decay_steps is None: + args.decay_steps = args.max_steps + assert args.warmup_rate <= 1.0 and args.warmup_rate >= 0.0, "warmup_rate should be in [0, 1]" + args.warmup_steps = args.warmup_rate * args.max_steps + + lr_scheduler = None + + if args.lr_decay_style == "none": + lr_scheduler = None + elif args.lr_decay_style == "cosine": + lr_scheduler = CosineAnnealingWithWarmupDecay( + max_lr=args.max_lr, + min_lr=args.min_lr, + warmup_step=args.warmup_steps, + decay_step=args.decay_steps, + last_epoch=0, + ) + elif args.lr_decay_style == "linear": + lr_scheduler = LinearAnnealingWithWarmupDecay( + max_lr=args.max_lr, + min_lr=args.min_lr, + warmup_step=args.warmup_steps, + decay_step=args.decay_steps, + last_epoch=0, + ) + + clip = None + if args.grad_clip > 0: + clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.grad_clip) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + optimizer = build_optimizer.apply(model, args, lr_scheduler, clip, decay_params, strategy) + + if args.use_pure_fp16: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + # level O2 means converting the network to FP16 + if args.sharding_stage not in [2, 3]: + scaler = fleet.distributed_scaler(scaler) + model = paddle.amp.decorate(models=model, level="O2") + + if training_args.resume_from_checkpoint and last_checkpoint is not None: + model.set_state_dict( + paddle.load(os.path.join(last_checkpoint, model.resource_files_names["model_state"]), return_numpy=True) + ) + # wrap sharding stage2/3 and add collective group + # TODO(Baibaifan): combine ShardingStage1/2/3 and fleet.distributed_model in feature + if args.sharding_stage in [2, 3] and args.sharding_degree > 1: + scaler = scaler if args.use_pure_fp16 else None + model, optimizer, scaler = wrap_sharding_2_3(model, optimizer, scaler, args) + + elif paddle.distributed.get_world_size() > 1: + model = fleet.distributed_model(model) + optimizer = fleet.distributed_optimizer(optimizer) + + files = get_train_data_file(args) + train_data_loader, valid_data_loader, test_data_loader = create_pretrained_dataset( + args, + files, + local_rank=local_rank, + data_world_size=data_world_size, + data_world_rank=data_world_rank, + max_seq_len=args.max_seq_len, + eos_id=tokenizer.eos_token_id, + current_step=global_step, + ) + # Bug fix, if not call valid_data_loader, the enumerate will call valid_data_loader + # many times. and start a new random dataloader. + valid_data_loader = valid_data_loader() + test_data_loader = test_data_loader() + + global_step = 0 + # time count + train_reader_cost = 0.0 + train_run_cost = 0.0 + reader_start = time.time() + + if training_args.resume_from_checkpoint and last_checkpoint is not None: + optimizer.set_state_dict( + paddlenlp_load( + os.path.join(last_checkpoint, OPTIMIZER_NAME), + map_location="cpu", + ) + ) + global_step = int(str(last_checkpoint).split("-")[-1]) + + _globalstep_last_logged = global_step + if isinstance(train_data_loader.batch_sampler, DistributedBatchSampler): + _globalstep_last_logged = 0 + + tr_loss = paddle.to_tensor(0.0) + loss_global = paddle.to_tensor(0.0) + + for step, batch in enumerate(train_data_loader()): + train_reader_cost += time.time() - reader_start + train_start = time.time() + + if _globalstep_last_logged > 0: + _globalstep_last_logged -= 1 + continue + + tokens, loss_mask, position_ids, labels = batch + + # In ParallelMode of DataParallel, 'no_sync' can be used for improving + # performance of model by gradient accumulation. + + with paddle.amp.auto_cast( + args.use_pure_fp16, + custom_black_list=["c_softmax_with_cross_entropy", "elementwise_div"], + custom_white_list=["fused_attention", "fused_feedforward"], + level="O2", + ): + preds = model(tokens, position_ids) + loss = criterion(preds, labels, loss_mask) + + if args.accumulate_steps > 1: + tr_loss_step = loss / args.accumulate_steps + else: + tr_loss_step = loss + + if args.use_pure_fp16: + scaler.scale(tr_loss_step).backward() + else: + tr_loss_step.backward() + + tr_loss_step = tr_loss_step.detach() + + tr_loss += tr_loss_step + loss_global += loss.detach() + + # Skip for accumulate_steps in global step + if (step + 1) % args.accumulate_steps != 0: + continue + + if args.sharding_degree > 1 and args.sharding_stage in [2, 3]: + if args.dp_degree > 1 and not is_dp_group_support_in_group_sharded_parallel(): + fused_allreduce_gradients(model.parameters(), fleet.get_hybrid_communicate_group()) + + if args.use_pure_fp16: + # scaler.minimize(optimizer, tr_loss) + scaler.step(optimizer) + scaler.update() + else: + optimizer.step() + + optimizer.clear_grad() + tr_loss.subtract_(tr_loss) + + global_step += 1 + + # Sync for profile time, delete it may be a little faster + # paddle.device.cuda.synchronize() + train_run_cost += time.time() - train_start + + if global_step % args.logging_freq == 0: + avg_loss = all_gather(loss_global) / args.logging_freq / args.accumulate_steps + loss_global.subtract_(loss_global) + speed = args.logging_freq / (train_reader_cost + train_run_cost) + avg_reader_cost = train_reader_cost / args.logging_freq + + logger.info( + "global step %d, loss: %.9f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, speed: %.2f step/s, ips_total: %.0f tokens/s, ips: %.0f tokens/s, learning rate: %.5e" + % ( + global_step, + avg_loss, + avg_reader_cost, + 1.0 / speed, + speed, + speed * default_global_tokens_num, + speed * default_global_tokens_num / nranks, + optimizer.get_lr(), + ) + ) + if log_writer is not None: + log_writer.add_scalar("loss", float(loss), global_step) + log_writer.add_scalar("learning_rate", optimizer.get_lr(), global_step) + + # tic_train = time.time() + train_reader_cost = 0.0 + train_run_cost = 0.0 + + if lr_scheduler is not None: + lr_scheduler.step() + + if global_step % args.eval_freq == 0: + # Since the valid data broardcast to all devices, we do evaluate on all device. + run_evaluate(args, valid_data_loader, model, criterion, args.eval_iters, log_writer, global_step, "valid") + + # TODO: 1. merge paramters while saving model. 2. ensure that the model is saved and loaded correctly + # only dp_rank = 0 save model + if (global_step % args.save_steps == 0 or global_step >= args.max_steps) and dp_rank == 0: + + model_to_save = ( + model._layers + if paddle.distributed.get_world_size() > 1 and args.sharding_stage not in [2, 3] + else model + ) + + if args.sharding_stage == 3: + # If parameter need to convert to cpu, please add convert2cpu=True + model_to_save.get_all_parameters(convert2cpu=True) + + while hasattr(model_to_save, "_layers") or hasattr(model_to_save, "_layer"): + if hasattr(model_to_save, "_layers"): + model_to_save = model_to_save._layers + else: + model_to_save = model_to_save._layer + + output_dir = os.path.join(args.output_dir, "checkpoint-%d" % global_step) + os.makedirs(output_dir, exist_ok=True) + logger.info("Save model to %s" % output_dir) + + # tokenizer only need to save on one node + if mp_rank == 0 and sharding_rank == 0 and dp_rank == 0: + tokenizer.save_pretrained(output_dir) + + # paramerters is the same in sharding group + if sharding_rank == 0 and dp_rank == 0: + if isinstance(model_to_save, PretrainedModel): + model_to_save.save_pretrained(output_dir) + else: + logger.info("Trainer.model is not a `PretrainedModel`, only saving its state dict.") + state_dict = model_to_save.state_dict() + paddle.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME)) + + # ckpt optimizer weight should save on echo sharding rank + if dp_rank == 0: + paddle.save( + optimizer.state_dict(), + os.path.join( + output_dir, + OPTIMIZER_NAME, + ), + ) + + if mp_rank == 0 and sharding_rank == 0 and dp_rank == 0: + _rotate_checkpoints(args.save_total_limit, output_dir=args.output_dir) + + if global_step >= args.max_steps: + return + + reader_start = time.time() + + +if __name__ == "__main__": + args = parse_args(MODEL_CLASSES) + do_train(args) diff --git a/examples/language_model/llama b/examples/language_model/llama new file mode 100644 index 0000000000000000000000000000000000000000..00841636eed14207239457be205fd4787349653d --- /dev/null +++ b/examples/language_model/llama @@ -0,0 +1 @@ +../../llm/llama \ No newline at end of file diff --git a/examples/language_model/luke/README.md b/examples/language_model/luke/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a66f93035b528413d337c263bbfd299c14267b8e --- /dev/null +++ b/examples/language_model/luke/README.md @@ -0,0 +1,91 @@ +# LUKE with PaddleNLP + +[LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) + +**模型简介:** +许多NLP任务都涉及实体,例如:关系分类、实体类型、命名实体识别(NER)和问答(QA)。解决此类实体相关任务的关键是学习实体有效表示。传统的实体表示为每个实体分配一个固定的Embedding向量,该向量将有关实体的信息存储在知识库(KB)中。它们需要实体链接(entity linking)来表示文本中的实体,而不能表示KB中不存在的实体。 + +相比之下,基于contextualized word representations(CWRs) transformer的大型预训练模型,如BERT和RoBERTa,提供了基于语言建模的有效通用词语表征。然而,由于以下两个原因,CWRs的体系结构不适合表示实体: + +- 由于CWR不输出实体的跨级(span-level)表示,因此它们通常需要学习如何基于通常较小的下游数据集计算此类表征。 + +- 许多与实体相关的任务,如关系分类和问答(QA)涉及实体之间关系的推理。尽管transformer可以通过使用self-attention机制将单词相互关联来捕捉单词之间的复杂关系。在实体之间执行关系推理是困难的,因为许多实体在模型中被分割成多个词。此外,基于单词的CWRs预训练任务不适合学习实体的表征,因为在实体中预测一个被MASK的单词,例如预测“Rings”, 给予句子“The Lord of the [MASK]”,一个完整的实体就这样被拆分。 + +LUKE和现有CWRs之间的一个重要区别在于,它不仅将单词视为独立的token,还将实体视为独立的token,并使用transformer计算所有token的中间表征和输出表征。由于实体被视为token,LUKE可以直接建模实体之间的关系。 +本项目是 LUKE 在 Paddle 2.x上的开源实现。 + +## 快速开始 + +### 下游任务微调 + +数据集 +下载Open Entity数据集 +[下载地址](https://cloud.tsinghua.edu.cn/f/6ec98dbd931b4da9a7f0/) +把下载好的文件解压,并把解压后的Open Entity目录下的`train.json`、`test.json`和`dev.json`分别为训练集、验证集和测试集 + +下载SQuAD1.1数据集,主流机器阅读理解数据集 +[下载地址](https://data.deepai.org/squad1.1.zip) + +#### 1、SQuAD1.1 +以SQuAD1.1数据集为例 + +运行以下两个命令即可训练并评估LUKE在SQuAD1.1数据集的精度 + +```shell +python -m paddle.distributed.launch examples/language_model/luke/run_squad.py + --model_type luke \ + --device gpu \ + --learning_rate 15e-6 \ + --num_train_epochs 2 \ + --batch_size 8 \ + --do_predict \ + --do_train \ + --model_name_or_path luke-large +``` +其中参数释义如下: +- `model_type` 指示了模型类型,当前支持`luke` +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `device` 表示使用的设备类型。默认为GPU,可以配置为CPU、GPU、XPU。若希望使用多GPU训练,将其设置为GPU,同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。 +- `num_train_epochs` 表示需要训练的epoch数量 +- `do_train` 表示是否开启训练 +- `do_predict` 表示是否开启评估 +- `model_name_or_path` 模型的名称和路径,支持`luke-base` 和 `luke-large` + +训练结束后模型会对模型进行评估,其评估在验证集上完成, 训练完成后你将看到如下结果: +```text +{"exact_match": 89.75691579943235, "f1": 94.95702001984502} +``` + +#### 2、Open Entity + +```shell +python -m paddle.distributed.launch examples/language_model/luke/run_open_entity.py \ + --model_type luke-large \ + --data_dir data/ \ + --output_dir output/ \ + --device gpu \ + --learning_rate 1e-5 \ + --num_train_epochs 3 \ + --train_batch_size 2 +``` +训练结束后模型会对模型进行评估,其评估在测试集上完成, 训练完成后你将看到如下结果: +```text +Results: { + "test_f1": 0.7815726767275616, + "test_precision": 0.7880405766150561, + "test_recall": 0.7752100840336135 +} +``` + + +# Reference + +```bibtex +@inproceedings{yamada2020luke, + title={LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention}, + author={Ikuya Yamada and Akari Asai and Hiroyuki Shindo and Hideaki Takeda and Yuji Matsumoto}, + booktitle={EMNLP}, + year={2020} +} +``` diff --git a/examples/language_model/luke/args.py b/examples/language_model/luke/args.py new file mode 100644 index 0000000000000000000000000000000000000000..f736916e938d3827384b20b41e64e2cc9b1df801 --- /dev/null +++ b/examples/language_model/luke/args.py @@ -0,0 +1,98 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and + +import argparse + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--model_type", default="bert", type=str, help="Type of pre-trained model.") + parser.add_argument( + "--model_name_or_path", + default="bert-base-uncased", + type=str, + help="Path to pre-trained model or shortcut name of model.", + ) + parser.add_argument( + "--output_dir", + default="outputs", + type=str, + help="The output directory where the model predictions and checkpoints will be written. " + "Default as `outputs`", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument( + "--warmup_proportion", + default=0.0, + type=float, + help="Proportion of training steps to perform linear learning rate warmup for.", + ) + parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." + ) + parser.add_argument( + "--doc_stride", + type=int, + default=128, + help="When splitting up a long document into chunks, how much stride to take between chunks.", + ) + parser.add_argument( + "--n_best_size", + type=int, + default=20, + help="The total number of n-best predictions to generate in the nbest_predictions.json output file.", + ) + parser.add_argument( + "--null_score_diff_threshold", + type=float, + default=0.0, + help="If null_score - best_non_null is greater than the threshold predict null.", + ) + parser.add_argument("--max_query_length", type=int, default=64, help="Max query length.") + parser.add_argument("--max_answer_length", type=int, default=30, help="Max answer length.") + parser.add_argument( + "--do_lower_case", + action="store_false", + help="Whether to lower case the input text. Should be True for uncased models and False for cased models.", + ) + parser.add_argument("--verbose", action="store_true", help="Whether to output verbose log.") + parser.add_argument( + "--version_2_with_negative", + action="store_true", + help="If true, the SQuAD examples contain some that do not have an answer. If using squad v2.0, it should be set true.", + ) + parser.add_argument("--do_train", action="store_true", help="Whether to train the model.") + parser.add_argument("--do_predict", action="store_true", help="Whether to predict.") + args = parser.parse_args() + return args diff --git a/examples/language_model/luke/open_entity_processor.py b/examples/language_model/luke/open_entity_processor.py new file mode 100644 index 0000000000000000000000000000000000000000..a0f4739058e428cdea96b4c5a39ff174bde2ca7b --- /dev/null +++ b/examples/language_model/luke/open_entity_processor.py @@ -0,0 +1,138 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os + +from tqdm import tqdm + +ENTITY_TOKEN = "[ENTITY]" + + +class InputExample(object): + def __init__(self, id_, text, span, labels): + self.id = id_ + self.text = text + self.span = span + self.labels = labels + + +class InputFeatures(object): + def __init__( + self, + word_ids, + word_segment_ids, + word_attention_mask, + entity_ids, + entity_position_ids, + entity_segment_ids, + entity_attention_mask, + labels, + ): + self.word_ids = word_ids + self.word_segment_ids = word_segment_ids + self.word_attention_mask = word_attention_mask + self.entity_ids = entity_ids + self.entity_position_ids = entity_position_ids + self.entity_segment_ids = entity_segment_ids + self.entity_attention_mask = entity_attention_mask + self.labels = labels + + +class DatasetProcessor(object): + def get_train_examples(self, data_dir): + return self._create_examples(data_dir, "train") + + def get_dev_examples(self, data_dir): + return self._create_examples(data_dir, "dev") + + def get_test_examples(self, data_dir): + return self._create_examples(data_dir, "test") + + def get_label_list(self, data_dir): + labels = set() + for example in self.get_train_examples(data_dir): + labels.update(example.labels) + return sorted(labels) + + def _create_examples(self, data_dir, set_type): + with open(os.path.join(data_dir, set_type + ".json"), "r") as f: + data = json.load(f) + return [ + InputExample(i, item["sent"], (item["start"], item["end"]), item["labels"]) for i, item in enumerate(data) + ] + + +def convert_examples_to_features(examples, label_list, tokenizer, max_mention_length): + label_map = {label: i for i, label in enumerate(label_list)} + + conv_tables = ( + ("-LRB-", "("), + ("-LCB-", "("), + ("-LSB-", "("), + ("-RRB-", ")"), + ("-RCB-", ")"), + ("-RSB-", ")"), + ) + features = [] + for example in tqdm(examples): + + def preprocess_and_tokenize(text, start, end=None): + target_text = text[start:end].rstrip() + for a, b in conv_tables: + target_text = target_text.replace(a, b) + + return tokenizer.tokenize(target_text, add_prefix_space=True) + + tokens = [tokenizer.cls_token] + tokens += preprocess_and_tokenize(example.text, 0, example.span[0]) + mention_start = len(tokens) + tokens.append(ENTITY_TOKEN) + tokens += preprocess_and_tokenize(example.text, example.span[0], example.span[1]) + tokens.append(ENTITY_TOKEN) + mention_end = len(tokens) + + tokens += preprocess_and_tokenize(example.text, example.span[1]) + tokens.append(tokenizer.sep_token) + + word_ids = tokenizer.convert_tokens_to_ids(tokens) + word_attention_mask = [1] * len(tokens) + word_segment_ids = [0] * len(tokens) + + entity_ids = [2, 0] + entity_attention_mask = [1, 0] + entity_segment_ids = [0, 0] + entity_position_ids = list(range(mention_start, mention_end))[:max_mention_length] + entity_position_ids += [-1] * (max_mention_length - mention_end + mention_start) + entity_position_ids = [entity_position_ids, [-1] * max_mention_length] + + labels = [0] * len(label_map) + + for label in example.labels: + labels[label_map[label]] = 1 + + features.append( + InputFeatures( + word_ids=word_ids, + word_segment_ids=word_segment_ids, + word_attention_mask=word_attention_mask, + entity_ids=entity_ids, + entity_position_ids=entity_position_ids, + entity_segment_ids=entity_segment_ids, + entity_attention_mask=entity_attention_mask, + labels=labels, + ) + ) + + return features diff --git a/examples/language_model/luke/run_open_entity.py b/examples/language_model/luke/run_open_entity.py new file mode 100644 index 0000000000000000000000000000000000000000..edd06060341b9abb7389a9f9941ecadcafb2be73 --- /dev/null +++ b/examples/language_model/luke/run_open_entity.py @@ -0,0 +1,294 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import logging +import os + +import numpy as np +import paddle +import paddle.nn.functional as F +from open_entity_processor import DatasetProcessor, convert_examples_to_features +from paddle.io import DataLoader, Dataset +from paddle.optimizer import AdamW +from tqdm import tqdm + +from paddlenlp.transformers import ( + LinearDecayWithWarmup, + LukeForEntityClassification, + LukeTokenizer, +) + +ENTITY_TOKEN = "[ENTITY]" + +parser = argparse.ArgumentParser(description="LUKE FOR OPEN ENTITY") + +parser.add_argument( + "--output_dir", type=str, required=True, help="Use to store all outputs during training and evaluation." +) +parser.add_argument("--data_dir", type=str, required=True, help="Dataset folder") +parser.add_argument("--num_train_epochs", type=int, default=2, help="Number of training cycles") +parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") +parser.add_argument("--batch_size", type=int, default=8, help="Batch size per GPU/CPU for training.") +parser.add_argument("--device", type=str, default="gpu", help="Batch size per GPU/CPU for training.") +parser.add_argument( + "--gradient_accumulation_steps", type=int, default=3, help="Gradient accumulated before each parameter update." +) +parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay if we apply some") +parser.add_argument( + "--warmup_proportion", + type=float, + default=0.06, + help="Proportion of training steps to perform linear learning rate warmup for.", +) +parser.add_argument("--learning_rate", type=float, default=1e-5, help="The initial learning rate for Adam.") +parser.add_argument("--model_type", type=str, default="luke-base", help="Type of pre-trained model.") +parser.add_argument("--max_mention_length", type=int, default=30, help="Max entity position's length") + +args = parser.parse_args() + + +class Trainer(object): + def __init__(self, args, model, dataloader, num_train_steps, step_callback=None): + self.args = args + self.model = model + self.dataloader = dataloader + self.num_train_steps = num_train_steps + self.step_callback = step_callback + + self.optimizer, self.scheduler = self._create_optimizer(model) + self.wd_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + def train(self): + model = self.model + + epoch = 0 + global_step = 0 + + model.train() + + with tqdm(total=self.num_train_steps) as pbar: + while True: + for step, batch in enumerate(self.dataloader): + outputs = model( + input_ids=batch[0], + token_type_ids=batch[1], + attention_mask=batch[2], + entity_ids=batch[3], + entity_position_ids=batch[4], + entity_token_type_ids=batch[5], + entity_attention_mask=batch[6], + ) + + loss = F.binary_cross_entropy_with_logits( + outputs.reshape([-1]), batch[7].reshape([-1]).astype("float32") + ) + + if self.args.gradient_accumulation_steps > 1: + loss = loss / self.args.gradient_accumulation_steps + loss.backward() + if (step + 1) % self.args.gradient_accumulation_steps == 0: + self.optimizer.step() + self.scheduler.step() + self.optimizer.clear_grad() + pbar.set_description("epoch: %d loss: %.7f" % (epoch, loss)) + pbar.update() + global_step += 1 + + if global_step == self.num_train_steps: + break + output_dir = self.args.output_dir + + model.save_pretrained(output_dir) + if global_step == self.num_train_steps: + break + epoch += 1 + + return model, global_step + + def _create_optimizer(self, model): + scheduler = self._create_scheduler() + clip = paddle.nn.ClipGradByNorm(clip_norm=1.0) + return ( + AdamW( + parameters=model.parameters(), + grad_clip=clip, + learning_rate=scheduler, + apply_decay_param_fun=lambda x: x in self.wd_params, + weight_decay=self.args.weight_decay, + ), + scheduler, + ) + + def _create_scheduler(self): + return LinearDecayWithWarmup( + learning_rate=self.args.learning_rate, total_steps=self.num_train_steps, warmup=self.args.warmup_proportion + ) + + +class DataGenerator(Dataset): + def __init__(self, features): + super(DataGenerator, self).__init__() + self.features = features + + def __getitem__(self, item): + word_ids = self.features[item].word_segment_ids + word_segment_ids = self.features[item].word_segment_ids + word_attention_mask = self.features[item].word_attention_mask + entity_ids = self.features[item].entity_ids + entity_position_ids = self.features[item].entity_position_ids + entity_segment_ids = self.features[item].entity_segment_ids + entity_attention_mask = self.features[item].entity_attention_mask + labels = self.features[item].labels + + return ( + word_ids, + word_segment_ids, + word_attention_mask, + entity_ids, + entity_position_ids, + entity_segment_ids, + entity_attention_mask, + labels, + ) + + def __len__(self): + return len(self.features) + + +@paddle.no_grad() +def evaluate(args, model, mode="dev", output_file=None): + dataloader, _, _, label_list = load_examples(args, mode=mode) + model.eval() + + all_logits = [] + all_labels = [] + + for batch in tqdm(dataloader, desc=mode): + logits = model( + input_ids=batch[0], + token_type_ids=batch[1], + attention_mask=batch[2], + entity_ids=batch[3], + entity_position_ids=batch[4], + entity_token_type_ids=batch[5], + entity_attention_mask=batch[6], + ) + + logits = logits.tolist() + labels = batch[7].tolist() + + all_logits.extend(logits) + all_labels.extend(labels) + + all_predicted_indexes = [] + all_label_indexes = [] + for logits, labels in zip(all_logits, all_labels): + all_predicted_indexes.append([i for i, v in enumerate(logits) if v > 0]) + all_label_indexes.append([i for i, v in enumerate(labels) if v > 0]) + + if output_file: + with open(output_file, "w") as f: + for predicted_indexes, label_indexes in zip(all_predicted_indexes, all_label_indexes): + data = dict( + predictions=[label_list[ind] for ind in predicted_indexes], + labels=[label_list[ind] for ind in label_indexes], + ) + f.write(json.dumps(data) + "\n") + + num_predicted_labels = 0 + num_gold_labels = 0 + num_correct_labels = 0 + + for predicted_indexes, label_indexes in zip(all_predicted_indexes, all_label_indexes): + num_predicted_labels += len(predicted_indexes) + num_gold_labels += len(label_indexes) + num_correct_labels += len(frozenset(predicted_indexes).intersection(frozenset(label_indexes))) + + if num_predicted_labels > 0: + precision = num_correct_labels / num_predicted_labels + else: + precision = 0.0 + + recall = num_correct_labels / num_gold_labels + if precision + recall == 0.0: + f1 = 0.0 + else: + f1 = 2 * precision * recall / (precision + recall) + + return dict(precision=precision, recall=recall, f1=f1) + + +def load_examples(args, mode="train"): + tokenizer = LukeTokenizer.from_pretrained(args.model_type) + tokenizer.add_special_tokens(dict(additional_special_tokens=[ENTITY_TOKEN])) + processor = DatasetProcessor() + if mode == "train": + examples = processor.get_train_examples(args.data_dir) + elif mode == "dev": + examples = processor.get_dev_examples(args.data_dir) + else: + examples = processor.get_test_examples(args.data_dir) + + label_list = processor.get_label_list(args.data_dir) + + logging.info("Creating features from the dataset...") + features = convert_examples_to_features(examples, label_list, tokenizer, args.max_mention_length) + + dataset = DataGenerator(features) + + def collate_fn(batch): + def create_padded_sequence(k, padding_value): + """Pad sequence to maximum length""" + new_data = [] + max_len = 0 + for each_batch in batch: + if len(each_batch[k]) > max_len: + max_len = len(each_batch[k]) + for each_batch in batch: + new_data.append(each_batch[k] + [padding_value] * (max_len - len(each_batch[k]))) + return np.array(new_data, dtype="int64") + + return ( + create_padded_sequence(0, 1), # pad word_ids + create_padded_sequence(1, 0), # pad word_segment_ids + create_padded_sequence(2, 0), # pad word_attention_mask + create_padded_sequence(3, 0), # pad entity_ids + create_padded_sequence(4, 0), # pad entity_position_ids + create_padded_sequence(5, 0), # pad entity_segment_ids + create_padded_sequence(6, 0), # pad entity_attention_mask + create_padded_sequence(7, 0), + ) # convert to numpy array + + dataloader = DataLoader(dataset, shuffle="train" in mode, batch_size=args.batch_size, collate_fn=collate_fn) + + return dataloader, examples, features, label_list + + +if __name__ == "__main__": + results = {} + train_dataloader, _, features, _ = load_examples(args, mode="train") + num_labels = len(features[0].labels) + num_train_steps_per_epoch = len(train_dataloader) // args.gradient_accumulation_steps + num_train_steps = int(num_train_steps_per_epoch * args.num_train_epochs) + model = LukeForEntityClassification.from_pretrained(args.model_type, num_classes=num_labels) + trainer = Trainer(args, model=model, dataloader=train_dataloader, num_train_steps=num_train_steps) + trainer.train() + output_file = os.path.join(args.output_dir, "test_predictions.jsonl") + results.update({f"test_{k}": v for k, v in evaluate(args, model, "test", output_file).items()}) + + logging.info("Results: %s", json.dumps(results, indent=2, sort_keys=True)) + with open(os.path.join(args.output_dir, "results.json"), "w") as f: + json.dump(results, f) diff --git a/examples/language_model/luke/run_squad.py b/examples/language_model/luke/run_squad.py new file mode 100644 index 0000000000000000000000000000000000000000..494eb07d9d55865925d09cec374d5130a2a4769a --- /dev/null +++ b/examples/language_model/luke/run_squad.py @@ -0,0 +1,353 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import math +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from args import parse_args +from datasets import load_dataset +from paddle.io import DataLoader + +from paddlenlp.data import Dict, Pad, Stack +from paddlenlp.metrics.squad import compute_prediction, squad_evaluate +from paddlenlp.transformers import ( + LinearDecayWithWarmup, + LukeForQuestionAnswering, + LukeTokenizer, +) + +MODEL_CLASSES = {"luke": (LukeForQuestionAnswering, LukeTokenizer)} + + +def prepare_train_features(examples, tokenizer, args): + # Some of the questions have lots of whitespace on the left, which is not useful and will make the + # truncation of the context fail (the tokenized question will take a lots of space). So we remove that + # left whitespace + contexts = examples["context"] + questions = examples["question"] + + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + tokenized_examples = tokenizer( + questions, + contexts, + add_prefix_space=True, + return_token_type_ids=True, + max_seq_len=args.max_seq_length, + stride=args.doc_stride, + return_attention_mask=True, + ) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + # The offset mappings will give us a map from token to character position in the original context. This will + # help us compute the start_positions and end_positions. + offset_mapping = tokenized_examples.pop("offset_mapping") + + # Let's label those examples! + tokenized_examples["start_positions"] = [] + tokenized_examples["end_positions"] = [] + + for i, offsets in enumerate(offset_mapping): + # We will label impossible answers with the index of the CLS token. + input_ids = tokenized_examples["input_ids"][i] + cls_index = input_ids.index(tokenizer.cls_token_id) + + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_examples["token_type_ids"][i] + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + answers = examples["answers"][sample_index] + # If no answers are given, set the cls_index as answer. + if len(answers["answer_start"]) == 0: + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Start/end character index of the answer in the text. + start_char = answers["answer_start"][0] + end_char = start_char + len(answers["text"][0]) + + # Start token index of the current span in the text. + token_start_index = 0 + while sequence_ids[token_start_index] != 1: + token_start_index += 1 + + # End token index of the current span in the text. + token_end_index = len(input_ids) - 1 + while sequence_ids[token_end_index] != 1: + token_end_index -= 1 + # Minus one more to reach actual text + token_end_index -= 1 + if token_end_index >= len(offsets): + token_end_index = len(offsets) - 1 + + # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index). + if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char): + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Otherwise move the token_start_index and token_end_index to the two ends of the answer. + # Note: we could go after the last offset if the answer is the last word (edge case). + while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char: + token_start_index += 1 + tokenized_examples["start_positions"].append(token_start_index - 1) + while offsets[token_end_index][1] >= end_char: + token_end_index -= 1 + tokenized_examples["end_positions"].append(token_end_index + 1) + + return tokenized_examples + + +def prepare_validation_features(examples, tokenizer, args): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = examples["context"] + questions = examples["question"] + + tokenized_examples = tokenizer( + questions, + contexts, + add_prefix_space=True, + return_token_type_ids=True, + stride=args.doc_stride, + max_seq_len=args.max_seq_length, + return_attention_mask=True, + ) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + + # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the + # corresponding example_id and we will store the offset mappings. + tokenized_examples["example_id"] = [] + + for i in range(len(tokenized_examples["input_ids"])): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_examples["token_type_ids"][i] + context_index = 1 + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + tokenized_examples["example_id"].append(examples["id"][sample_index]) + + # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token + # position is part of the context or not. + tokenized_examples["offset_mapping"][i] = [ + (o if sequence_ids[k] == context_index else None) + for k, o in enumerate(tokenized_examples["offset_mapping"][i]) + ] + + return tokenized_examples + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, data_loader, raw_dataset, args): + model.eval() + + all_start_logits = [] + all_end_logits = [] + tic_eval = time.time() + + for batch in data_loader: + input_ids, _ = batch + start_logits_tensor, end_logits_tensor = model(input_ids) + + for idx in range(start_logits_tensor.shape[0]): + if len(all_start_logits) % 1000 == 0 and len(all_start_logits): + print("Processing example: %d" % len(all_start_logits)) + print("time per 1000:", time.time() - tic_eval) + tic_eval = time.time() + + all_start_logits.append(start_logits_tensor.numpy()[idx]) + all_end_logits.append(end_logits_tensor.numpy()[idx]) + + all_predictions, all_nbest_json, scores_diff_json = compute_prediction( + raw_dataset, + data_loader.dataset, + (all_start_logits, all_end_logits), + args.version_2_with_negative, + args.n_best_size, + args.max_answer_length, + args.null_score_diff_threshold, + ) + + # Can also write all_nbest_json and scores_diff_json files if needed + with open("prediction.json", "w", encoding="utf-8") as writer: + writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n") + + squad_evaluate(examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, na_probs=scores_diff_json) + + model.train() + + +class CrossEntropyLossForSQuAD(paddle.nn.Layer): + def __init__(self): + super(CrossEntropyLossForSQuAD, self).__init__() + + def forward(self, y, label): + start_logits, end_logits = y + start_position, end_position = label + start_position = paddle.unsqueeze(start_position, axis=-1) + end_position = paddle.unsqueeze(end_position, axis=-1) + start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position) + end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position) + loss = (start_loss + end_loss) / 2 + return loss + + +def run(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + if args.version_2_with_negative: + train_examples = load_dataset("squad_v2", split="train") + dev_examples = load_dataset("squad_v2", split="validation") + else: + train_examples = load_dataset("squad", split="train") + dev_examples = load_dataset("squad", split="validation") + + column_names = train_examples.column_names + set_seed(args) + if rank == 0: + if os.path.exists(args.model_name_or_path): + print("Loads checkpoint from %s" % args.model_name_or_path) + + model = model_class.from_pretrained(args.model_name_or_path) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + if args.do_train: + train_ds = train_examples.map( + partial(prepare_train_features, tokenizer=tokenizer, args=args), + batched=True, + remove_columns=column_names, + num_proc=4, + ) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + train_batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + "start_positions": Stack(dtype="int64"), + "end_positions": Stack(dtype="int64"), + } + ): fn(samples) + + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=train_batchify_fn, return_list=True + ) + + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + criterion = CrossEntropyLossForSQuAD() + + global_step = 0 + tic_train = time.time() + + for epoch in range(num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids, _, start_positions, end_positions = batch + logits = model(input_ids) + loss = criterion(logits, (start_positions, end_positions)) + + if global_step % args.logging_steps == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_step, epoch + 1, step + 1, loss, args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.save_steps == 0 or global_step == num_training_steps: + if rank == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + print("Saving checkpoint to:", output_dir) + if global_step == num_training_steps: + break + + if args.do_predict and rank == 0: + dev_ds = dev_examples.map( + partial(prepare_validation_features, tokenizer=tokenizer, args=args), + batched=True, + remove_columns=column_names, + num_proc=4, + ) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + + dev_batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + } + ): fn(samples) + + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=dev_batchify_fn, return_list=True + ) + + evaluate(model, dev_data_loader, args) + + +if __name__ == "__main__": + args = parse_args() + run(args) diff --git a/examples/language_model/megatronbert/README.md b/examples/language_model/megatronbert/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e9795358bb40a232db43afa05a06853098b5af80 --- /dev/null +++ b/examples/language_model/megatronbert/README.md @@ -0,0 +1,103 @@ +# MegatronBert with PaddleNLP + +[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf) + +**模型简介:** +近期在语言建模方面的工作表明,训练大型transformers模型提高了自然语言处理应用的技术水平。然而,由于内存限制,非常大的模型可能难以训练。在这项工作中, +作者提出了训练大型transformers模型的技术,并实现了一种简单、高效的模型运算并行方法,该方法能够训练具有数十亿个参数的transformers模型。 + +本项目是 MegatronBert 在 Paddle 2.x上的开源实现。 + +## 快速开始 + +### 下游任务微调 + +#### 1、SQuAD1.1 & SQuAD2.0 +SQuAD1.1数据集 + +```shell +python -m paddle.distributed.launch run_squad.py \ + --do_train \ + --do_predict \ + --batch_size=8 \ + --model_name_or_path=megatronbert-cased + --learning_rate=1e-5 \ + --output_dir=output/ \ + --device=gpu \ + --num_train_epochs=2 +``` +其中参数释义如下: +- `model_name_or_path` 指示了模型类型,当前支持`megatronbert-cased`和`megatronbert-uncased`模型。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `output_dir` 表示模型保存路径。 +- `device` 表示使用的设备类型。默认为GPU,可以配置为CPU、GPU、XPU。若希望使用多GPU训练,将其设置为GPU,同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。 +- `num_train_epochs` 表示需要训练的epoch数量 + +训练结束后模型会对模型进行评估,其评估在验证集上完成, 训练完成后你将看到如下结果: +```text +{ + "exact": 88.78902554399243, + "f1": 94.4082803514958, + "total": 10570, + "HasAns_exact": 88.78902554399244, + "HasAns_f1": 94.4082803514958, + "HasAns_total": 10570 +} +``` + +SQuAD2.0数据集 +```shell +python -m paddle.distributed.launch run_squad.py \ + --do_train \ + --version_2_with_negative \ + --do_predict \ + --batch_size=8 \ + --model_name_or_path=megatronbert-cased + --learning_rate=1e-5 \ + --output_dir=output/ \ + --device=gpu \ + --num_train_epochs=2 +``` + +其中参数释义如下: +- `version_2_with_negative` 是否使用SQuAD2.0数据集 + +训练结束后模型会对模型进行评估,其评估在验证集上完成, 训练完成后你将看到如下结果: +```text +{ + "exact": 85.85867093405206, + "f1": 88.70579950475263, + "total": 11873, + "HasAns_exact": 82.47300944669365, + "HasAns_f1": 88.17543143048748, + "HasAns_total": 5928, + "NoAns_exact": 89.23465096719933, + "NoAns_f1": 89.23465096719933, + "NoAns_total": 5945, + "best_exact": 85.99343047250063, + "best_exact_thresh": -1.6154582500457764, + "best_f1": 88.75296534320918, + "best_f1_thresh": -0.20494508743286133 +} +``` + +#### 2、mnli数据集 + +```shell +python -m paddle.distributed.launch run_glue.py \ + --task_name=mnli \ + --output_dir=output/ \ + --model_name_or_path=megatronbert-cased \ + --learning_rate=1e-5 \ + --device=gpu \ + --num_train_epochs=2 +``` +训练结束后模型会对模型进行评估,其评估在测试集上完成, 训练完成后你将看到如下结果: +```text +eval loss: 0.186327, acc: 0.8992358634742741, eval loss: 0.332409, acc: 0.8968673718470301, eval done total : 118.65499472618103 s +``` + +# Reference + +* [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf) diff --git a/examples/language_model/megatronbert/args.py b/examples/language_model/megatronbert/args.py new file mode 100644 index 0000000000000000000000000000000000000000..7b9bf24492b770baa82a2a68e1a9244b6a50e9e7 --- /dev/null +++ b/examples/language_model/megatronbert/args.py @@ -0,0 +1,102 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--train_file", type=str, required=False, default=None, help="Train data path.") + parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.") + parser.add_argument("--model_type", default="megatronbert", type=str, help="Type of pre-trained model.") + parser.add_argument( + "--model_name_or_path", + default="megatronbert-cased", + type=str, + help="Path to pre-trained model or shortcut name of model.", + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + help="The output directory where the model predictions and checkpoints will be written. " + "Default as `outputs`", + ) + parser.add_argument( + "--max_seq_length", + default=512, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument("--num_train_epochs", default=2, type=int, help="Total number of training epochs to perform.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument( + "--warmup_proportion", + default=0.06, + type=float, + help="Proportion of training steps to perform linear learning rate warmup for.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=5000, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." + ) + parser.add_argument( + "--doc_stride", + type=int, + default=128, + help="When splitting up a long document into chunks, how much stride to take between chunks.", + ) + parser.add_argument( + "--n_best_size", + type=int, + default=20, + help="The total number of n-best predictions to generate in the nbest_predictions.json output file.", + ) + parser.add_argument( + "--null_score_diff_threshold", + type=float, + default=0.0, + help="If null_score - best_non_null is greater than the threshold predict null.", + ) + parser.add_argument("--max_query_length", type=int, default=64, help="Max query length.") + parser.add_argument("--max_answer_length", type=int, default=30, help="Max answer length.") + parser.add_argument( + "--do_lower_case", + action="store_false", + help="Whether to lower case the input text. Should be True for uncased models and False for cased models.", + ) + parser.add_argument("--verbose", action="store_true", help="Whether to output verbose log.") + parser.add_argument( + "--version_2_with_negative", + action="store_true", + help="If true, the SQuAD examples contain some that do not have an answer. If using squad v2.0, it should be set true.", + ) + parser.add_argument("--do_train", action="store_true", help="Whether to train the model.") + parser.add_argument("--do_predict", action="store_true", help="Whether to predict.") + args = parser.parse_args() + return args diff --git a/examples/language_model/megatronbert/run_glue.py b/examples/language_model/megatronbert/run_glue.py new file mode 100644 index 0000000000000000000000000000000000000000..cbb78c94abd9e3a8450fda83663bb4a2e3125a19 --- /dev/null +++ b/examples/language_model/megatronbert/run_glue.py @@ -0,0 +1,362 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from paddle.io import DataLoader +from paddle.metric import Accuracy + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import ( + LinearDecayWithWarmup, + MegatronBertForSequenceClassification, + MegatronBertTokenizer, +) + +METRIC_CLASSES = { + "cola": Mcc, + "sst-2": Accuracy, + "mrpc": AccuracyAndF1, + "sts-b": PearsonAndSpearman, + "qqp": AccuracyAndF1, + "mnli": Accuracy, + "qnli": Accuracy, + "rte": Accuracy, +} + +MODEL_CLASSES = { + "megatronbert": (MegatronBertForSequenceClassification, MegatronBertTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_type", + default="megatronbert", + type=str, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default="megatronbert-cased", + type=str, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--output_dir", + default="outputs", + type=str, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=512, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument( + "--num_train_epochs", + default=2, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=10000, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--batch_size", + default=16, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") + parser.add_argument( + "--warmup_steps", + default=0, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument( + "--warmup_proportion", default=0.06, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu"], + help="The device to select to train the model, is must be cpu/gpu/xpu/npu.", + ) + parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.") + parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.") + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, loss_fct, metric, data_loader): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, segment_ids, labels = batch + logits = model(input_ids, token_type_ids=segment_ids) + loss = loss_fct(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + if isinstance(metric, AccuracyAndF1): + print( + "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, " + % ( + loss.numpy(), + res[0], + res[1], + res[2], + res[3], + res[4], + ), + end="", + ) + elif isinstance(metric, Mcc): + print("eval loss: %f, mcc: %s, " % (loss.numpy(), res[0]), end="") + elif isinstance(metric, PearsonAndSpearman): + print( + "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, " + % (loss.numpy(), res[0], res[1], res[2]), + end="", + ) + else: + print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="") + model.train() + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if (int(is_test) + len(example)) == 2: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + else: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + + if not is_test: + return example["input_ids"], example["token_type_ids"], label + else: + return example["input_ids"], example["token_type_ids"] + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + train_ds = load_dataset("glue", args.task_name, splits="train") + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length + ) + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ): fn(samples) + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + if args.task_name == "mnli": + dev_ds_matched, dev_ds_mismatched = load_dataset( + "glue", args.task_name, splits=["dev_matched", "dev_mismatched"] + ) + + dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True) + dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True) + dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False) + dev_data_loader_matched = DataLoader( + dataset=dev_ds_matched, + batch_sampler=dev_batch_sampler_matched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + dev_batch_sampler_mismatched = paddle.io.BatchSampler( + dev_ds_mismatched, batch_size=args.batch_size, shuffle=False + ) + dev_data_loader_mismatched = DataLoader( + dataset=dev_ds_mismatched, + batch_sampler=dev_batch_sampler_mismatched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + else: + dev_ds = load_dataset("glue", args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list) + model = model_class.from_pretrained(args.model_name_or_path, num_labels=num_classes) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs) + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss() + + metric = metric_class() + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + + global_step = 0 + tic_train = time.time() + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + + input_ids, segment_ids, labels = batch + with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]): + logits = model(input_ids, token_type_ids=segment_ids) + loss = loss_fct(logits, labels) + if args.use_amp: + scaler.scale(loss).backward() + scaler.minimize(optimizer, loss) + else: + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + print( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % ( + global_step, + num_training_steps, + epoch, + step, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + if args.task_name == "mnli": + evaluate(model, loss_fct, metric, dev_data_loader_matched) + evaluate(model, loss_fct, metric, dev_data_loader_mismatched) + print("eval done total : %s s" % (time.time() - tic_eval)) + else: + evaluate(model, loss_fct, metric, dev_data_loader) + print("eval done total : %s s" % (time.time() - tic_eval)) + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join( + args.output_dir, "%s_ft_model_%d.pdparams" % (args.task_name, global_step) + ) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + return + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/language_model/megatronbert/run_squad.py b/examples/language_model/megatronbert/run_squad.py new file mode 100644 index 0000000000000000000000000000000000000000..5d3f12b5e090f572e5ed351fa5cf8fb2ffc14f02 --- /dev/null +++ b/examples/language_model/megatronbert/run_squad.py @@ -0,0 +1,313 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import math +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from args import parse_args +from paddle.io import DataLoader + +from paddlenlp.data import Dict, Pad, Stack +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics.squad import compute_prediction, squad_evaluate +from paddlenlp.transformers import ( + LinearDecayWithWarmup, + MegatronBertForQuestionAnswering, + MegatronBertTokenizer, +) + +MODEL_CLASSES = {"megatronbert": (MegatronBertForQuestionAnswering, MegatronBertTokenizer)} + + +def prepare_train_features(examples, tokenizer, args): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = [examples[i]["context"] for i in range(len(examples))] + questions = [examples[i]["question"] for i in range(len(examples))] + + tokenized_examples = tokenizer( + questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length, return_attention_mask=True + ) + + # Let's label those examples! + for i, tokenized_example in enumerate(tokenized_examples): + # We will label impossible answers with the index of the CLS token. + input_ids = tokenized_example["input_ids"] + cls_index = input_ids.index(tokenizer.cls_token_id) + + # The offset mappings will give us a map from token to character position in the original context. This will + # help us compute the start_positions and end_positions. + offsets = tokenized_example["offset_mapping"] + + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_example["token_type_ids"] + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = tokenized_example["overflow_to_sample"] + answers = examples[sample_index]["answers"] + answer_starts = examples[sample_index]["answer_starts"] + + # If no answers are given, set the cls_index as answer. + if len(answer_starts) == 0: + tokenized_examples[i]["start_positions"] = cls_index + tokenized_examples[i]["end_positions"] = cls_index + else: + # Start/end character index of the answer in the text. + start_char = answer_starts[0] + end_char = start_char + len(answers[0]) + + # Start token index of the current span in the text. + token_start_index = 0 + while sequence_ids[token_start_index] != 1: + token_start_index += 1 + + # End token index of the current span in the text. + token_end_index = len(input_ids) - 1 + while sequence_ids[token_end_index] != 1: + token_end_index -= 1 + # Minus one more to reach actual text + token_end_index -= 1 + + # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index). + if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char): + tokenized_examples[i]["start_positions"] = cls_index + tokenized_examples[i]["end_positions"] = cls_index + else: + # Otherwise move the token_start_index and token_end_index to the two ends of the answer. + # Note: we could go after the last offset if the answer is the last word (edge case). + while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char: + token_start_index += 1 + tokenized_examples[i]["start_positions"] = token_start_index - 1 + while offsets[token_end_index][1] >= end_char: + token_end_index -= 1 + tokenized_examples[i]["end_positions"] = token_end_index + 1 + + return tokenized_examples + + +def prepare_validation_features(examples, tokenizer, args): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = [examples[i]["context"] for i in range(len(examples))] + questions = [examples[i]["question"] for i in range(len(examples))] + + tokenized_examples = tokenizer( + questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length, return_attention_mask=True + ) + + # For validation, there is no need to compute start and end positions + for i, tokenized_example in enumerate(tokenized_examples): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_example["token_type_ids"] + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = tokenized_example["overflow_to_sample"] + tokenized_examples[i]["example_id"] = examples[sample_index]["id"] + + # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token + # position is part of the context or not. + tokenized_examples[i]["offset_mapping"] = [ + (o if sequence_ids[k] == 1 else None) for k, o in enumerate(tokenized_example["offset_mapping"]) + ] + + return tokenized_examples + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, data_loader, args): + model.eval() + + all_start_logits = [] + all_end_logits = [] + tic_eval = time.time() + + for batch in data_loader: + input_ids, token_type_ids = batch + start_logits_tensor, end_logits_tensor = model(input_ids, token_type_ids=token_type_ids) + + for idx in range(start_logits_tensor.shape[0]): + if len(all_start_logits) % 1000 == 0 and len(all_start_logits): + print("Processing example: %d" % len(all_start_logits)) + print("time per 1000:", time.time() - tic_eval) + tic_eval = time.time() + + all_start_logits.append(start_logits_tensor.numpy()[idx]) + all_end_logits.append(end_logits_tensor.numpy()[idx]) + + all_predictions, all_nbest_json, scores_diff_json = compute_prediction( + data_loader.dataset.data, + data_loader.dataset.new_data, + (all_start_logits, all_end_logits), + args.version_2_with_negative, + args.n_best_size, + args.max_answer_length, + args.null_score_diff_threshold, + ) + + # Can also write all_nbest_json and scores_diff_json files if needed + with open("prediction.json", "w", encoding="utf-8") as writer: + writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n") + + squad_evaluate(examples=data_loader.dataset.data, preds=all_predictions, na_probs=scores_diff_json) + + model.train() + + +class CrossEntropyLossForSQuAD(paddle.nn.Layer): + def __init__(self): + super(CrossEntropyLossForSQuAD, self).__init__() + + def forward(self, y, label): + start_logits, end_logits = y + start_position, end_position = label + start_position = paddle.unsqueeze(start_position, axis=-1) + end_position = paddle.unsqueeze(end_position, axis=-1) + start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position) + end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position) + loss = (start_loss + end_loss) / 2 + return loss + + +def run(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + if args.version_2_with_negative: + train_ds = load_dataset("squad", splits="train_v2", data_files=args.train_file) + dev_ds = load_dataset("squad", splits="dev_v2", data_files=args.predict_file) + else: + train_ds = load_dataset("squad", splits="train_v1", data_files=args.train_file) + dev_ds = load_dataset("squad", splits="dev_v1", data_files=args.predict_file) + set_seed(args) + if rank == 0: + if os.path.exists(args.model_name_or_path): + print("init checkpoint from %s" % args.model_name_or_path) + + model = model_class.from_pretrained(args.model_name_or_path) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + if args.do_train: + train_ds.map(partial(prepare_train_features, tokenizer=tokenizer, args=args), batched=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + train_batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + "start_positions": Stack(dtype="int64"), + "end_positions": Stack(dtype="int64"), + } + ): fn(samples) + + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=train_batchify_fn, return_list=True + ) + + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + criterion = CrossEntropyLossForSQuAD() + + global_step = 0 + tic_train = time.time() + + for epoch in range(num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids, token_type_ids, start_positions, end_positions = batch + logits = model(input_ids=input_ids, token_type_ids=token_type_ids) + loss = criterion(logits, (start_positions, end_positions)) + + if global_step % args.logging_steps == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_step, epoch + 1, step + 1, loss, args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.save_steps == 0 or global_step == num_training_steps: + if rank == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + print("Saving checkpoint to:", output_dir) + if global_step == num_training_steps: + break + + if args.do_predict and rank == 0: + dev_ds.map(partial(prepare_validation_features, tokenizer=tokenizer, args=args), batched=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + + dev_batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + } + ): fn(samples) + + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=dev_batchify_fn, return_list=True + ) + + evaluate(model, dev_data_loader, args) + + +if __name__ == "__main__": + args = parse_args() + run(args) diff --git a/examples/language_model/moe/data_tools b/examples/language_model/moe/data_tools new file mode 100644 index 0000000000000000000000000000000000000000..8841a30c30fdcf5aef96137c98938765471bb759 --- /dev/null +++ b/examples/language_model/moe/data_tools @@ -0,0 +1 @@ +../../../model_zoo/ernie-1.0/data_tools \ No newline at end of file diff --git a/examples/language_model/moe/dygraph/args.py b/examples/language_model/moe/dygraph/args.py new file mode 100644 index 0000000000000000000000000000000000000000..700fc911889a19eefc57433ef819e6dd856f8f97 --- /dev/null +++ b/examples/language_model/moe/dygraph/args.py @@ -0,0 +1,274 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle + +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.utils.log import logger + + +def process_batch_size(args): + if args.global_batch_size is None and args.local_batch_size is None: + raise ValueError("global_batch_size or local_batch_size should be set.") + elif args.global_batch_size is not None and args.local_batch_size is not None: + assert args.global_batch_size // args.local_batch_size == (args.dp_degree * args.sharding_degree), ( + "global_batch_size[{}] should be " + "divided by local_batch_size[{}] when dp_degree is [{}], sharding_degree is [{}]. ".format( + args.global_batch_size, args.local_batch_size, args.dp_degree, args.sharding_degree + ) + ) + elif args.global_batch_size is not None and args.local_batch_size is None: + assert ( + args.global_batch_size % (args.dp_degree * args.sharding_degree) == 0 + ), "global_batch_size[{}] should be divided by dp_degree[{}] times sharding_degree[{}].".format( + args.global_batch_size, args.dp_degree, args.sharding_degree + ) + args.local_batch_size = args.global_batch_size // (args.dp_degree * args.sharding_degree) + else: + args.global_batch_size = args.local_batch_size * args.dp_degree * args.sharding_degree + assert args.local_batch_size % args.micro_batch_size == 0 + + +def parse_args(MODEL_CLASSES): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + + # Train I/O config + parser.add_argument( + "--input_dir", + default=None, + type=str, + required=True, + help="The input directory where the data will be read from.", + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the training logs and checkpoints will be written.", + ) + parser.add_argument("--split", type=str, default="949,50,1", help="Train/valid/test data split.") + + parser.add_argument("--max_seq_len", type=int, default=1024, help="Max sequence length.") + + parser.add_argument( + "--global_batch_size", + default=None, + type=int, + help="Global batch size for all training process. None for not check the size is valid. If we only use data parallelism, it should be device_num * micro_batch_size.", + ) + + parser.add_argument( + "--local_batch_size", + default=None, + type=int, + help="Global batch size for all training process. None for not check the size is valid. If we only use data parallelism, it should be device_num * micro_batch_size.", + ) + + parser.add_argument( + "--micro_batch_size", + default=8, + type=int, + help="Batch size per device for one step training.", + ) + + # Default training config + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--grad_clip", default=0.0, type=float, help="Grad clip for the parameter.") + parser.add_argument("--max_lr", default=0.00015, type=float, help="The initial max learning rate for Adam.") + parser.add_argument("--min_lr", default=1e-5, type=float, help="The initial min learning rate for Adam.") + parser.add_argument( + "--warmup_rate", default=0.01, type=float, help="Linear warmup over warmup_steps for learing rate." + ) + + # Adam optimizer config + parser.add_argument( + "--adam_beta1", + default=0.9, + type=float, + help="The beta1 for Adam optimizer. The exponential decay rate for the 1st moment estimates.", + ) + parser.add_argument( + "--adam_beta2", + default=0.999, + type=float, + help="The bate2 for Adam optimizer. The exponential decay rate for the 2nd moment estimates.", + ) + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + + # Training steps config + parser.add_argument( + "--num_train_epochs", + default=1, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--max_steps", + default=500000, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--decay_steps", + default=360000, + type=int, + help="The steps use to control the learing rate. If the step > decay_steps, will use the min_lr.", + ) + parser.add_argument("--logging_freq", type=int, default=1, help="Log every X updates steps.") + parser.add_argument("--eval_freq", type=int, default=500, help="Evaluate for every X updates steps.") + parser.add_argument("--eval_iters", type=int, default=10, help="Evaluate the model use X steps data.") + + # Config for 4D Parallelism + + parser.add_argument( + "--sharding_degree", + type=int, + default=1, + help="Group Sharded degree. Spliting the model parameters to many cards.", + ) + + parser.add_argument("--dp_degree", type=int, default=1, help="Data Parallelism degree.") + parser.add_argument( + "--mp_degree", type=int, default=1, help="Model Parallelism degree. Spliting the linear layers to many cards." + ) + parser.add_argument( + "--pp_degree", + type=int, + default=1, + help="Pipeline Parallelism degree. Spliting the model layers to different parts.", + ) + parser.add_argument( + "--use_recompute", type=strtobool, nargs="?", const=False, help="Using the recompute to save the memory." + ) + + parser.add_argument( + "--recompute_partition", + type=strtobool, + nargs="?", + const=False, + help="use recompute_partition to support mp partition when use_recompute is True .", + ) + + parser.add_argument( + "--recompute_offload", + type=strtobool, + nargs="?", + const=False, + help="use recompute_offload to save the memory by offload when use_recompute is True .", + ) + + parser.add_argument( + "--resume_dir", + default="", + type=str, + required=False, + help="The resume directory where the checkpoint will be resume.", + ) + + # Pure FP16 config + parser.add_argument( + "--use_pure_fp16", type=strtobool, nargs="?", const=False, help="Enable pure fp16 precision training." + ) + + parser.add_argument( + "--scale_loss", + type=float, + default=32768, + help="The value of scale_loss for fp16. This is only used for AMP training.", + ) + + parser.add_argument( + "--sharding_offload", type=strtobool, nargs="?", const=False, help="use sharding stage2 cpu offload strategy." + ) + + parser.add_argument("--hidden_dropout_prob", type=float, default=0.1, help="The hidden dropout prob.") + + parser.add_argument( + "--attention_probs_dropout_prob", type=float, default=0.1, help="The attention probs dropout prob." + ) + + # MOE config + parser.add_argument("--num_experts", type=int, default=1, help="number of experts per worker") + + parser.add_argument("--top_k", type=int, default=2, help="top_k for moe gate") + + parser.add_argument("--expert_mode", type=strtobool, nargs="?", const=False, help="Enable Moe mode.") + + parser.add_argument( + "--balance_loss_weight", + default=1.0, + type=float, + help="The auxiliary loss generated by gate strategy to help balance experts.", + ) + + parser.add_argument( + "--gate", + type=str, + default="gshard", + choices=["naive", "gshard", "switch"], + help="select naive, gshard, switch gate strategy.", + ) + + # Other config + parser.add_argument("--seed", type=int, default=1234, help="Random seed for initialization") + parser.add_argument( + "--check_accuracy", type=strtobool, nargs="?", const=False, help="Check accuracy for training process." + ) + parser.add_argument( + "--device", type=str, default="gpu", choices=["cpu", "gpu", "xpu"], help="select cpu, gpu, xpu devices." + ) + parser.add_argument( + "--lr_decay_style", type=str, default="cosine", choices=["cosine", "none"], help="Learning rate decay style." + ) + + args = parser.parse_args() + args.test_iters = args.eval_iters * 10 + + # process batch size + process_batch_size(args) + + if args.check_accuracy: + if args.hidden_dropout_prob != 0: + args.hidden_dropout_prob = 0.0 + logger.warning("The hidden_dropout_prob should set to 0 for accuracy checking.") + if args.attention_probs_dropout_prob != 0: + args.attention_probs_dropout_prob = 0.0 + logger.warning("The attention_probs_dropout_prob should set to 0 for accuracy checking.") + + logger.info("{:20}:{}".format("paddle commit id", paddle.version.commit)) + for arg in vars(args): + logger.info("{:20}:{}".format(arg, getattr(args, arg))) + + return args diff --git a/examples/language_model/moe/dygraph/checkpointing.py b/examples/language_model/moe/dygraph/checkpointing.py new file mode 100644 index 0000000000000000000000000000000000000000..e9ebb7eae0888ec6223410f510587c3957c8b297 --- /dev/null +++ b/examples/language_model/moe/dygraph/checkpointing.py @@ -0,0 +1,98 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import paddle + + +def save_checkpoint( + args, + global_step, + model, + optimizer, + lr_scheduler, + tokenizer, + loss_scale, + dp_rank, + mp_rank, + pp_rank, + pass_num, + file_id, + epoch, +): + """save some state for each rank.""" + + assert args.output_dir is not None, "output_dir is not valid." + output_dir = os.path.join(args.output_dir, "step_{}".format(global_step)) + os.makedirs(output_dir, exist_ok=True) + + state_dict = {} + state_dict["args"] = args + state_dict["global_step"] = global_step + state_dict["loss_scale"] = loss_scale + state_dict["data_meta"] = {"pass_num": pass_num, "file_id": file_id, "start_epoch": epoch} + + if optimizer is not None: + state_dict["optimizer"] = optimizer.state_dict() + + if lr_scheduler is not None: + state_dict["lr_scheduler"] = lr_scheduler.state_dict() + + if args.pp_degree > 1: + path = os.path.join(output_dir, "dp_{}_mp_{}_pp_{}".format(dp_rank, mp_rank, pp_rank)) + # model.save_state_dict(path) + paddle.save(model.state_dict(), os.path.join(path, "model_state.pdparams")) + tokenizer.save_pretrained(path) + else: + path = os.path.join(output_dir, "dp_{}_mp_{}".format(dp_rank, mp_rank)) + tokenizer.save_pretrained(path) + paddle.save(model.state_dict(), os.path.join(path, "model_state.pdparams")) + + state_save_path = os.path.join(path, "meta_state.pdopt") + paddle.save(state_dict, state_save_path) + + +def load_checkpoint(args, model, optimizer, lr_scheduler, tokenizer, dp_rank, mp_rank, pp_rank): + """load checkpoint for all rank.""" + + assert args.resume_dir is not None and len(args.resume_dir) > 0, "resume_dir is not valid." + assert os.path.exists(args.resume_dir) and os.path.isdir(args.resume_dir), "resume_dir not exists or not a dir." + + load_path = None + if args.pp_degree > 1: + load_path = os.path.join(args.resume_dir, "dp_{}_mp_{}_pp_{}".format(dp_rank, mp_rank, pp_rank)) + # model.set_state_dir(load_path) + model.set_state_dict(paddle.load(os.path.join(load_path, "model_state.pdparams"))) + else: + load_path = os.path.join(args.resume_dir, "dp_{}_mp_{}".format(dp_rank, mp_rank)) + model.set_state_dict(paddle.load(os.path.join(load_path, "model_state.pdparams"))) + + tokenizer.from_pretrained(load_path) + state_dict = paddle.load(os.path.join(load_path, "meta_state.pdopt")) + + if optimizer is not None: + optimizer.set_state_dict(state_dict["optimizer"]) + + if lr_scheduler is not None: + lr_scheduler.set_state_dict(state_dict["lr_scheduler"]) + + global_step = state_dict["global_step"] + args.seed = state_dict["args"].seed + loss_scale = state_dict["loss_scale"] + + resume_step = int(args.resume_dir.strip("/").split("_")[-1]) + if resume_step != global_step: + print("Warning: resume_step is {}, but the step of checkpoint is {}.".format(resume_step, global_step)) + + return global_step, loss_scale, state_dict["data_meta"] diff --git a/examples/language_model/moe/dygraph/dataset.py b/examples/language_model/moe/dygraph/dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..bf8df24e7f431b5db1f02d5f0247d82c0408956e --- /dev/null +++ b/examples/language_model/moe/dygraph/dataset.py @@ -0,0 +1,399 @@ +# Copyright (c) 2020, NVIDIA CORPORATION. +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time + +import numpy as np +import paddle +from paddle.io import DataLoader + +from paddlenlp.data import Stack, Tuple +from paddlenlp.utils.batch_sampler import DistributedBatchSampler +from paddlenlp.utils.log import logger + + +def construct_samples_and_shuffle_data( + name, data_prefix, documents, sizes, num_samples, seq_length, seed, build_data_file +): + """ + documents: document index from 0 to len(docs) + sizes: the length list of all docs. + num_samples: total step*bs iterations of data. + seq_length: the sequence length. + sum(sizes) = tokens_per_epoch + data_nums = num_samples * micro_batch_size + num_epochs = (data_nums + 1) // sum(sizes) + len(doc_idx) = num_epochs * sum(sizes) + """ + # Number of tokens in each epoch and number of required epochs. + tokens_per_epoch = _num_tokens(documents, sizes) + num_epochs = _num_epochs(tokens_per_epoch, seq_length, num_samples) + # Rng state + np_rng = np.random.RandomState(seed=seed) + + # Filename of the index mappings. + _filename = data_prefix + _filename += "_{}_indexmap".format(name) + _filename += "_{}ns".format(num_samples) + _filename += "_{}sl".format(seq_length) + doc_idx_filename = _filename + "_doc_idx.npy" + sample_idx_filename = _filename + "_sample_idx.npy" + shuffle_idx_filename = _filename + "_shuffle_idx.npy" + + # Sava random state + savedState = np_rng.get_state() + # Build the indexed mapping if not exist. + if build_data_file: + if ( + (not os.path.isfile(doc_idx_filename)) + or (not os.path.isfile(sample_idx_filename)) + or (not os.path.isfile(shuffle_idx_filename)) + ): + if num_epochs == 1: + separate_last_epoch = False + else: + num_samples_from_epochs_minus_one = ((num_epochs - 1) * tokens_per_epoch - 1) // seq_length + last_epoch_num_samples = num_samples - num_samples_from_epochs_minus_one + assert last_epoch_num_samples >= 0, "last epoch number of samples should be non-negative." + num_samples_per_epoch = (tokens_per_epoch - 1) // seq_length + assert last_epoch_num_samples < ( + num_samples_per_epoch + 1 + ), "last epoch number of samples exceeded max value." + separate_last_epoch = last_epoch_num_samples < int(0.80 * num_samples_per_epoch) + # Note. len(doc_idx) = num_epochs * len(doc) + doc_idx = _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch) + np.save(doc_idx_filename, doc_idx, allow_pickle=True) + + # sample-idx. pos of each seq_len of data. + assert doc_idx.dtype == np.int32 + sample_idx = _build_sample_idx(sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch) + np.save(sample_idx_filename, sample_idx, allow_pickle=True) + + if separate_last_epoch: + num_samples_ = num_samples_from_epochs_minus_one + else: + num_samples_ = sample_idx.shape[0] - 1 + + # Shuffle all seq len data. + shuffle_idx = _build_shuffle_idx(num_samples_, sample_idx.shape[0] - 1, np_rng) + np.save(shuffle_idx_filename, shuffle_idx, allow_pickle=True) + else: + while True: + if ( + (not os.path.isfile(doc_idx_filename)) + or (not os.path.isfile(sample_idx_filename)) + or (not os.path.isfile(shuffle_idx_filename)) + ): + time.sleep(3) + else: + break + + # Restore random state + np_rng.set_state(savedState) + + if paddle.distributed.get_world_size() > 1: + if paddle.in_dynamic_mode(): + paddle.distributed.barrier() + + # Load mappings. + doc_idx = np.load(doc_idx_filename, allow_pickle=True, mmap_mode="r") + sample_idx = np.load(sample_idx_filename, allow_pickle=True, mmap_mode="r") + shuffle_idx = np.load(shuffle_idx_filename, allow_pickle=True, mmap_mode="r") + return doc_idx, sample_idx, shuffle_idx + + +def _num_tokens(documents, lens): + """Total number of tokens in the dataset.""" + return np.sum(lens[documents]) + + +def _num_epochs(tokens_per_epoch, seq_length, num_samples): + """Based on number of samples and sequence lenght, calculate how many + epochs will be needed.""" + num_epochs = 0 + total_tokens = 0 + while True: + num_epochs += 1 + total_tokens += tokens_per_epoch + if ((total_tokens - 1) // seq_length) >= num_samples: + return num_epochs + + +def _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch): + """ + Build an array with length = number-of-epochs * number-of-documents. + Each index is mapped to a corresponding document. + """ + if not separate_last_epoch or num_epochs == 1: + doc_idx = np.mgrid[0:num_epochs, 0 : len(documents)][1] + doc_idx[:] = documents + # The documents repeat num_epochs times. + doc_idx = doc_idx.reshape(-1) + doc_idx = doc_idx.astype(np.int32) + return doc_idx + + doc_idx_first = _build_doc_idx(documents, num_epochs - 1, np_rng, False) + doc_idx_last = _build_doc_idx(documents, 1, np_rng, False) + return np.concatenate((doc_idx_first, doc_idx_last)) + + +def _build_sample_idx(sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch): + """ + num_samples + 1, pos of bs data + the distance between two points for sample idx is bs tokens. + """ + num_samples = (num_epochs * tokens_per_epoch - 1) // seq_length + sample_idx = np.zeros([int(num_samples) + 1, 2], dtype=np.int32) + + sample_index = 0 + doc_idx_index = 0 + doc_offset = 0 + sample_idx[sample_index][0] = doc_idx_index + sample_idx[sample_index][1] = doc_offset + sample_index += 1 + while sample_index <= num_samples: + remaining_seq_length = seq_length + 1 + while remaining_seq_length != 0: + doc_id = doc_idx[doc_idx_index] + doc_length = sizes[doc_id] - doc_offset + remaining_seq_length -= doc_length + if remaining_seq_length <= 0: + doc_offset += remaining_seq_length + doc_length - 1 + remaining_seq_length = 0 + else: + doc_idx_index += 1 + doc_offset = 0 + sample_idx[sample_index][0] = doc_idx_index + sample_idx[sample_index][1] = doc_offset + sample_index += 1 + + return sample_idx + + +def _build_shuffle_idx(num_samples, total_size, np_rng): + dtype_ = np.uint32 + if total_size >= (np.iinfo(np.uint32).max - 1): + dtype_ = np.int64 + + shuffle_idx_first = np.arange(start=0, stop=num_samples, step=1, dtype=dtype_) + np_rng.shuffle(shuffle_idx_first) + if num_samples == total_size: + return shuffle_idx_first + + shuffle_idx_last = np.arange(start=num_samples, stop=total_size, step=1, dtype=dtype_) + np_rng.shuffle(shuffle_idx_last) + + return np.concatenate((shuffle_idx_first, shuffle_idx_last)) + + +def get_train_valid_test_split_(splits_string, size): + """Get dataset splits from comma or '/' separated string list.""" + + splits = [] + if splits_string.find(",") != -1: + splits = [float(s) for s in splits_string.split(",")] + elif splits_string.find("/") != -1: + splits = [float(s) for s in splits_string.split("/")] + else: + splits = [float(splits_string)] + while len(splits) < 3: + splits.append(0.0) + splits = splits[:3] + splits_sum = sum(splits) + assert splits_sum > 0.0 + splits = [split / splits_sum for split in splits] + splits_index = [0] + for index, split in enumerate(splits): + splits_index.append(splits_index[index] + int(round(split * float(size)))) + diff = splits_index[-1] - size + for index in range(1, len(splits_index)): + splits_index[index] -= diff + assert len(splits_index) == 4 + assert splits_index[-1] == size + return splits_index + + +def create_pretrained_dataset( + args, + input_path, + local_rank, + data_world_rank, + data_world_size, + eos_id, + worker_init=None, + max_seq_len=1024, + places=None, + data_holders=None, +): + device_world_size = paddle.distributed.get_world_size() + + logger.info( + "The distributed run, total device num:{}, distinct dataflow num:{}.".format( + device_world_size, data_world_size + ) + ) + + process_data = np.load(input_path, mmap_mode="r+", allow_pickle=True) + # All documment ids, extend as 1-D array. + sample_ids = process_data["ids"] + # The len(sample_lens) num of docs + # The sum(sample_lens) should equal len(sample_ids) + sample_lens = process_data["lens"] + + splits = get_train_valid_test_split_(args.split, len(sample_lens)) + assert len(sample_lens) >= splits[-1], "The document nums should larger than max of splits, but %s < %s" % ( + len(sample_lens), + splits[-1], + ) + + def build_dataset(index, name, num_samples): + dataset = GPTDataset( + file_path=input_path, + build_data_file=local_rank == 0, + name="gpt_" + name, + max_seq_len=max_seq_len, + num_samples=num_samples, + documents=np.arange(splits[index], splits[index + 1]), + sample_ids=sample_ids, + sample_lens=sample_lens, + eos_id=eos_id, + seed=args.seed, + ) + + batch_sampler = DistributedBatchSampler( + dataset, + batch_size=args.local_batch_size, + num_replicas=data_world_size, + rank=data_world_rank, + shuffle=False, + drop_last=True, + ) + + data_loader = DataLoader( + dataset=dataset, + places=places, + feed_list=data_holders, + batch_sampler=batch_sampler, + num_workers=1, + worker_init_fn=worker_init, + # collate_fn=Tuple(Stack(), Stack(), Stack(), Stack(), Stack()), + collate_fn=Tuple(Stack(), Stack(), Stack()), + return_list=False, + ) + return data_loader + + # Note, data should be broardcast to all devices. + # for train, valid, test, the distinct data num is data_world_size + train_data_loader = build_dataset(0, "train", args.local_batch_size * args.max_steps * data_world_size) + + valid_data_loader = build_dataset( + 1, "valid", args.local_batch_size * (args.max_steps // args.eval_freq + 1) * args.eval_iters * data_world_size + ) + test_data_loader = build_dataset(2, "test", args.local_batch_size * args.test_iters * data_world_size) + + return train_data_loader, valid_data_loader, test_data_loader + + +class GPTDataset(paddle.io.Dataset): + def __init__( + self, + file_path, + num_samples, + eos_id, + sample_ids, + sample_lens, + documents=None, + build_data_file=False, + name="gpt", + max_seq_len=1024, + seed=1234, + ): + self.file_path = file_path + self.max_seq_len = max_seq_len + self.name = name + self.eos_id = eos_id + self.sample_ids = sample_ids + self.sample_lens = sample_lens + if documents is None: + document_ids = np.arange(0, self.sample_lens.shape[0]) + else: + document_ids = documents + + self.doc_idx, self.sample_idx, self.shuffle_idx = construct_samples_and_shuffle_data( + self.name, self.file_path, document_ids, self.sample_lens, num_samples, max_seq_len, seed, build_data_file + ) + + # The doc cumsum start pos + self.start_pos = [0] + np.cumsum(self.sample_lens).tolist() + + def _construct_sample(self, tokens): + tokens = np.array(tokens).astype("int64").tolist() + labels = tokens[1:] + tokens = tokens[:-1] + seq_length = len(tokens) + # Attention mask for the attention calulate + # attention_mask = np.tri(seq_length, seq_length).reshape((1, seq_length, + # seq_length)) + + # The pad and eos tokens do not contribute the loss + loss_mask = np.ones(seq_length, dtype="float32") + loss_mask[np.where(np.array(tokens) == self.eos_id)] = 0.0 + # position_ids = np.arange(0, seq_length, dtype="int64") + + # attention_mask = (attention_mask - 1.0) * 1e9 + # attention_mask = attention_mask.astype("float32") + # return [tokens, loss_mask, attention_mask, position_ids, labels] + return [tokens, loss_mask, labels] + + def _get_single_sample_from_idx(self, doc_index_f, doc_index_l, offset_f, offset_l): + """ + The input means: + doc_index_f: data from the first doc. + doc_index_l: data from the last doc. + offset_f: offset of the first doc. + offset_l: offset of the last doc. + """ + # Data from the sample doc. just select the needed ids. + if doc_index_f == doc_index_l: + current_start_pos = self.start_pos[self.doc_idx[doc_index_f]] + return self.sample_ids[current_start_pos + offset_f : current_start_pos + offset_l + 1].tolist() + + # Data from multi docs. + else: + current_start_pos = self.start_pos[self.doc_idx[doc_index_f]] + next_start_pos = self.start_pos[self.doc_idx[doc_index_f] + 1] + tokens = self.sample_ids[current_start_pos + offset_f : next_start_pos].tolist() + for i in range(doc_index_f + 1, doc_index_l): + current_start_pos = self.start_pos[self.doc_idx[i]] + next_start_pos = self.start_pos[self.doc_idx[i] + 1] + tokens.extend(self.sample_ids[current_start_pos:next_start_pos].tolist()) + last_start_pos = self.start_pos[self.doc_idx[doc_index_l]] + tokens.extend(self.sample_ids[last_start_pos : last_start_pos + offset_l + 1].tolist()) + + return tokens + + def __getitem__(self, index): + idx = self.shuffle_idx[index] + # Start and end documents and offsets. + doc_index_f = self.sample_idx[idx][0] + doc_index_l = self.sample_idx[idx + 1][0] + offset_f = self.sample_idx[idx][1] + offset_l = self.sample_idx[idx + 1][1] + tokens = self._get_single_sample_from_idx(doc_index_f, doc_index_l, offset_f, offset_l) + return self._construct_sample(tokens) + + def __len__(self): + return self.sample_idx.shape[0] - 1 diff --git a/examples/language_model/moe/dygraph/framework/__init__.py b/examples/language_model/moe/dygraph/framework/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e10badb568d2b8019fe5e661943d2ea15ff05154 --- /dev/null +++ b/examples/language_model/moe/dygraph/framework/__init__.py @@ -0,0 +1,19 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .storage_process import assign_group_by_size +from .storage_process import flatten_dense_tensors +from .storage_process import obtain_storage +from paddle.optimizer import AdamW +from .group_sharded import group_sharded_parallel diff --git a/examples/language_model/moe/dygraph/framework/group_sharded.py b/examples/language_model/moe/dygraph/framework/group_sharded.py new file mode 100644 index 0000000000000000000000000000000000000000..27e753abf3f35056f31ca9a9e98d900b565b554e --- /dev/null +++ b/examples/language_model/moe/dygraph/framework/group_sharded.py @@ -0,0 +1,150 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from types import MethodType + +import paddle + +# New version +from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_optimizer_stage2 import ( + GroupShardedOptimizerStage2, +) +from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_stage2 import ( + GroupShardedStage2, +) +from paddle.framework import core +from paddle.incubate.distributed.models.moe.grad_clip import ClipGradForMOEByGlobalNorm +from paddle.optimizer import Optimizer + + +class ClipGradForShardedMOEByGlobalNorm(ClipGradForMOEByGlobalNorm): + @paddle.no_grad() + def _dygraph_clip(self, params_grads): + normal_params_grads = [] + moe_params_grads = [] + + # separate moe params from normal params + if self.moe_group is not None and self.moe_group.nranks > 1: + for p, g in params_grads: + if self.is_expert_param_func(p): + moe_params_grads.append((p, g)) + else: + normal_params_grads.append((p, g)) + else: + normal_params_grads = params_grads + + # why to return sum_dtype? + # we will call `get_l2_norm_pow` twice and the precisions may be different. + # For convenience and simplification, we use sum_dtype directly instead of global_norm_var_normal.dtype + global_norm_var_normal, sum_dtype = self.get_l2_norm_pow(normal_params_grads) + if global_norm_var_normal is not None: + paddle.distributed.all_reduce( + global_norm_var_normal, op=paddle.distributed.ReduceOp.SUM, group=self.moe_group + ) + + global_norm_var_moe = None + if len(moe_params_grads) > 0: + global_norm_var_moe, _ = self.get_l2_norm_pow(moe_params_grads, sum_dtype) + if global_norm_var_moe is not None: + paddle.distributed.all_reduce( + global_norm_var_moe, op=paddle.distributed.ReduceOp.SUM, group=self.moe_group + ) + + if global_norm_var_normal is None and global_norm_var_moe is None: + return params_grads + elif global_norm_var_normal is None: + global_norm_var = global_norm_var_moe + elif global_norm_var_moe is None: + global_norm_var = global_norm_var_normal + else: + if global_norm_var_normal.dtype != global_norm_var_moe.dtype: + # compared with normal norm, moe norm is the later one, + # so its precision is no lower than normal norm + global_norm_var_normal = global_norm_var_normal.astype(global_norm_var_moe.dtype) + global_norm_var = global_norm_var_normal + global_norm_var_moe + + params_and_grads = [] + global_norm_var = paddle.sqrt(global_norm_var) + max_global_norm = paddle.full(shape=[1], dtype=global_norm_var.dtype, fill_value=self.clip_norm) + clip_var = paddle.maximum(x=max_global_norm, y=paddle.maximum(x=global_norm_var, y=max_global_norm)) + for p, g in params_grads: + if g is None: + continue + if getattr(p, "need_clip", True) is False: + params_and_grads.append((p, g)) + continue + # TODO(wangxi): use inplace elementwise_mul + clip_input = clip_var.astype("float16") if g.dtype == core.VarDesc.VarType.FP16 else clip_var + new_grad = paddle.multiply(x=g, y=clip_input) + params_and_grads.append((p, new_grad)) + return params_and_grads + + +def group_sharded_parallel( + model, optimizer, group=None, offload=False, sync_buffers=False, buffer_max_size=2**23, segment_size=2**20 +): + + # check optition type + assert isinstance(model, paddle.nn.Layer), "The model must be the instance of paddle.nn.Layer." + assert isinstance(optimizer, Optimizer), "The optimizer must be the instance of paddle.optimizer.Optimizer." + + def check_dtype(param): + return param.dtype == paddle.float16 + + sharded_params = [] + pretreated_params = [] + for p in optimizer._parameter_list: + if "expert" not in p.name and "gate" not in p.name: + sharded_params.append(p) + else: + pretreated_params.append(p) + + opt_gc = optimizer._grad_clip + if opt_gc is not None: + optimizer._grad_clip = ClipGradForShardedMOEByGlobalNorm( + opt_gc.clip_norm, opt_gc.is_expert_param_func, opt_gc.moe_group, opt_gc.group_name + ) + + # convert model/optimizer + optimizer = GroupShardedOptimizerStage2(params=sharded_params, optim=optimizer, group=group, offload=offload) + model = GroupShardedStage2( + model, optimizer, group=group, sync_buffers=sync_buffers, buffer_max_size=buffer_max_size + ) + + clear_func = model._clear_gradients + for opt in model._sharding_optimizers: + + def _opt_clear(self): + clear_func() + for p in pretreated_params: + if p.grad is not None: + p.grad.zero_() + + opt.clear_grad = MethodType(_opt_clear, opt) + + return model, optimizer diff --git a/examples/language_model/moe/dygraph/framework/storage_process.py b/examples/language_model/moe/dygraph/framework/storage_process.py new file mode 100644 index 0000000000000000000000000000000000000000..11bb790d14e1b1c5807db81d7e8791177d8155d9 --- /dev/null +++ b/examples/language_model/moe/dygraph/framework/storage_process.py @@ -0,0 +1,85 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from collections import OrderedDict + +import numpy as np +from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_storage import ( + GradStorage, + ParamStorage, +) +from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_utils import Type +from paddle.framework import core + +alignment = { + "gpu": 256, +} +align = { + Type.fp16.value: 2, + Type.fp32.value: 4, +} + + +def assign_group_by_size(parameters, group_size=256 * 1024 * 1024): + is_sparse_gradient = [False] * len(parameters) + + group_indices = core.eager_assign_group_by_size(parameters, is_sparse_gradient, [group_size, group_size]) + + var_groups = OrderedDict() + for group_idx, indices in enumerate(group_indices): + for index in indices: + var_groups.setdefault(group_idx, []).append(parameters[index]) + return var_groups + + +def flatten_dense_tensors(parameters): + _buffer_size = 0 + _param2align = {} + dtype = parameters[0].dtype + + for param in parameters: + assert param.trainable, "param must be trainable..." + size = np.prod(param.shape) * align[dtype] + remaining = size % alignment["gpu"] + ali = 0 if remaining == 0 else alignment["gpu"] - remaining + align_ = ali // align[dtype] + _buffer_size += np.prod(param.shape) + align_ + _param2align[param.name] = align_ + + param_storage = ParamStorage(size=_buffer_size, dtype=dtype, device="gpu") + + param_storage.add_rank_params(parameters, _param2align) + + # process gradient + grad_storage = GradStorage(size=_buffer_size, dtype=dtype, device="gpu", destination="0", parm2align=_param2align) + + for param in parameters: + grad_storage.add_grad(param, _param2align[param.name]) + + # param_storage --> grad_storage + param_storage.buffer._copy_gradient_from(grad_storage.buffer) + param_storage.buffer.stop_gradient = False + return param_storage, grad_storage + + +def obtain_storage(parameters): + if len(parameters) < 1: + return [] + + var_groups = assign_group_by_size(parameters) + storage = [] + for group_idx, parameters in var_groups.items(): + param_storage, grad_storage = flatten_dense_tensors(parameters) + storage.append(param_storage.buffer) + return storage diff --git a/examples/language_model/moe/dygraph/lr.py b/examples/language_model/moe/dygraph/lr.py new file mode 100644 index 0000000000000000000000000000000000000000..09070bccaa3331de13d53229f225626fcc714509 --- /dev/null +++ b/examples/language_model/moe/dygraph/lr.py @@ -0,0 +1 @@ +../../gpt/lr.py \ No newline at end of file diff --git a/examples/language_model/moe/dygraph/modeling.py b/examples/language_model/moe/dygraph/modeling.py new file mode 100644 index 0000000000000000000000000000000000000000..b45c9a465e70e0a4d51c2959de4948fc82528232 --- /dev/null +++ b/examples/language_model/moe/dygraph/modeling.py @@ -0,0 +1,1070 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections + +import paddle +import paddle.incubate as incubate +import paddle.nn as nn +import paddle.nn.functional as F +import paddle.tensor as tensor +from paddle.distributed import fleet +from paddle.distributed.fleet.meta_parallel import ( + LayerDesc, + PipelineLayer, + SharedLayerDesc, + get_rng_state_tracker, +) +from paddle.incubate.distributed.models import moe +from paddle.nn.layer.transformer import _convert_param_attr_to_list + +from paddlenlp.transformers import PretrainedModel, register_base_model + +MoeLayer = moe.MoELayer + +__all__ = [ + "GPTModel", + "GPTPretrainedModel", + "GPTForPretraining", + "GPTPretrainingCriterion", + "GPTForGreedyGeneration", + "GPTLMHeadModel", +] + + +class ExpertLayer(nn.Layer): + def __init__(self, d_model, d_hidden, name=None, rank=0, windex=0, num_expert=1): + super(ExpertLayer, self).__init__() + + self.htoh4 = nn.Linear( + d_model, + d_hidden, + weight_attr=nn.initializer.KaimingUniform(), + bias_attr=nn.initializer.Constant(value=0.0), + ) + self.h4toh = nn.Linear( + d_hidden, + d_model, + weight_attr=nn.initializer.KaimingUniform(), + bias_attr=nn.initializer.Constant(value=0.0), + ) + self.htoh4.weight.name = "expert_" + self.htoh4.weight.name + self.h4toh.weight.name = "expert_" + self.h4toh.weight.name + self.htoh4.bias.name = "expert_" + self.htoh4.bias.name + self.h4toh.bias.name = "expert_" + self.h4toh.bias.name + + def forward(self, x): + x = self.htoh4(x) + x = F.gelu(x, approximate=True) + x = self.h4toh(x) + return x + + +def parallel_matmul(lm_output, logit_weights, parallel_output): + hcg = fleet.get_hybrid_communicate_group() + model_parallel_group = hcg.get_model_parallel_group() + world_size = hcg.get_model_parallel_world_size() + + if world_size > 1: + input_parallel = paddle.distributed.collective._c_identity(lm_output, group=model_parallel_group) + + logits = paddle.matmul(input_parallel, logit_weights, transpose_y=True) + + if parallel_output: + return logits + + return paddle.distributed.collective._c_concat(logits, group=model_parallel_group) + else: + logits = paddle.matmul(lm_output, logit_weights, transpose_y=True) + return logits + + +class MultiHeadAttention(nn.Layer): + """ + Attention mapps queries and a set of key-value pairs to outputs, and + Multi-Head Attention performs multiple parallel attention to jointly attending + to information from different representation subspaces. + + """ + + Cache = collections.namedtuple("Cache", ["k", "v"]) + StaticCache = collections.namedtuple("StaticCache", ["k", "v"]) + + def __init__( + self, + embed_dim, + num_heads, + dropout=0.0, + kdim=None, + vdim=None, + need_weights=False, + weight_attr=None, + bias_attr=None, + fuse=True, + num_partitions=1, + ): + super(MultiHeadAttention, self).__init__() + self.embed_dim = embed_dim + self.kdim = kdim if kdim is not None else embed_dim + self.vdim = vdim if vdim is not None else embed_dim + self.num_heads = num_heads + self.dropout = dropout + self.need_weights = need_weights + self.fuse = fuse + + self.head_dim = embed_dim // num_heads + assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads" + + assert self.num_heads % num_partitions == 0 + self.num_heads = self.num_heads // num_partitions + + if self.fuse: + assert self.kdim == embed_dim, "embed_dim should be equal to kdim" + assert self.vdim == embed_dim, "embed_dim should be equal to vidm" + + self.qkv_proj = fleet.meta_parallel.ColumnParallelLinear( + embed_dim, 3 * embed_dim, weight_attr=weight_attr, has_bias=True, gather_output=False + ) + else: + self.q_proj = fleet.meta_parallel.ColumnParallelLinear( + embed_dim, embed_dim, weight_attr=weight_attr, has_bias=True, gather_output=False + ) + + self.k_proj = fleet.meta_parallel.ColumnParallelLinear( + self.kdim, embed_dim, weight_attr=weight_attr, has_bias=True, gather_output=False + ) + + self.v_proj = fleet.meta_parallel.ColumnParallelLinear( + self.vdim, embed_dim, weight_attr=weight_attr, has_bias=True, gather_output=False + ) + + self.out_proj = fleet.meta_parallel.RowParallelLinear( + embed_dim, embed_dim, weight_attr=weight_attr, has_bias=True, input_is_parallel=True + ) + + def _fuse_prepare_qkv(self, query): + mix_layer = self.qkv_proj(query) + mix_layer = paddle.reshape_(mix_layer, [0, 0, self.num_heads, 3 * self.head_dim]) + mix_layer = paddle.transpose(mix_layer, [0, 2, 1, 3]) + q, k, v = paddle.split(mix_layer, num_or_sections=3, axis=-1) + return q, k, v + + def _prepare_qkv(self, query, key, value, use_cache=False, cache=None): + r""" + Prapares linear projected queries, keys and values for usage of subsequnt + multiple parallel attention. If `cache` is not None, using cached results + to reduce redundant calculations. + + """ + q = self.q_proj(query) + q = tensor.reshape(x=q, shape=[0, 0, self.num_heads, self.head_dim]) + q = tensor.transpose(x=q, perm=[0, 2, 1, 3]) + + if isinstance(cache, self.StaticCache): + # for encoder-decoder attention in inference and has cached + k, v = cache.k, cache.v + else: + k, v = self.compute_kv(key, value) + + if isinstance(cache, self.Cache): + # for decoder self-attention in inference + k = tensor.concat([cache.k, k], axis=2) + v = tensor.concat([cache.v, v], axis=2) + if use_cache is True: + cache = self.Cache(k, v) + + return (q, k, v) if use_cache is False else (q, k, v, cache) + + def compute_kv(self, key, value): + r""" + Applies linear projection on input keys and values, then splits heads + (reshape and transpose) to get keys and values from different representation + subspaces. The results are used as key-values pairs for subsequent multiple + parallel attention. + + It is part of calculations in multi-head attention, and is provided as + a method to pre-compute and prefetch these results, thus we can use them + to construct cache for inference. + + """ + k = self.k_proj(key) + v = self.v_proj(value) + k = tensor.reshape(x=k, shape=[0, 0, self.num_heads, self.head_dim]) + k = tensor.transpose(x=k, perm=[0, 2, 1, 3]) + v = tensor.reshape(x=v, shape=[0, 0, self.num_heads, self.head_dim]) + v = tensor.transpose(x=v, perm=[0, 2, 1, 3]) + return k, v + + def gen_cache(self, key, value=None, type=Cache): + """ + Generates cache for `forward` usage in inference accroding to arguments. + The generated cache is an instance of `MultiHeadAttention.Cache` or an + instance of `MultiHeadAttention.StaticCache`. + """ + if type == MultiHeadAttention.StaticCache: # static_kv + k, v = self.compute_kv(key, value) + return self.StaticCache(k, v) + elif value is None: # incremental_state + k = paddle.full(shape=[key.shape[0], self.num_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0) + v = paddle.full(shape=[key.shape[0], self.num_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0) + return self.Cache(k, v) + else: + # incremental_state with initial value, mainly for usage like UniLM + return self.Cache(key, value) + + def forward(self, query, key, value, attn_mask=None, use_cache=False, cache=None): + r""" + Applies multi-head attention to map queries and a set of key-value pairs + to outputs. + """ + key = query if key is None else key + value = query if value is None else value + # compute q ,k ,v + if use_cache is False: + if self.fuse: + q, k, v = self._fuse_prepare_qkv(query) + else: + q, k, v = self._prepare_qkv(query, key, value, use_cache, cache) + else: + q, k, v, cache = self._prepare_qkv(query, key, value, use_cache, cache) + # scale dot product attention + product = paddle.matmul(x=q, y=k, transpose_y=True) * (self.head_dim**-0.5) + + # if attn_mask is not None: + # product = product + attn_mask + # weights = F.softmax(product) + + weights = incubate.softmax_mask_fuse_upper_triangle(product) + + if self.dropout: + with get_rng_state_tracker().rng_state("local_seed"): + weights = F.dropout(weights, self.dropout, training=self.training, mode="upscale_in_train") + + out = tensor.matmul(weights, v) + + # combine heads + out = tensor.transpose(out, perm=[0, 2, 1, 3]) + out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]]) + + # project to output + out = self.out_proj(out) + + outs = [out] + if self.need_weights: + outs.append(weights) + if use_cache: + outs.append(cache) + return out if len(outs) == 1 else tuple(outs) + + +class TransformerDecoder(nn.Layer): + """ + TransformerDecoder is a stack of N decoder layers. + """ + + def __init__(self, decoder_layers, num_layers, norm=None, hidden_size=None): + super(TransformerDecoder, self).__init__() + + self.num_layers = num_layers + self.layers = decoder_layers + self.norm = norm + if norm == "LayerNorm": + self.norm = nn.LayerNorm(hidden_size) + elif norm is not None: + raise ValueError("Only support LayerNorm") + self.checkpoints = [] + + def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, use_cache=False, cache=None): + r""" + Applies a stack of N Transformer decoder layers on inputs. If `norm` is + provided, also applies layer normalization on the output of last decoder + layer. + """ + output = tgt + new_caches = [] + self.checkpoints = [] + + for i, mod in enumerate(self.layers): + if cache is None: + if use_cache: + output, new_cache = mod(output, memory, tgt_mask=tgt_mask, use_cache=use_cache, cache=cache) + new_caches.append(new_cache) + else: + output = mod(output, memory, tgt_mask=tgt_mask, use_cache=use_cache, cache=cache) + else: + output, new_cache = mod(output, memory, tgt_mask=tgt_mask, use_cache=use_cache, cache=cache[i]) + new_caches.append(new_cache) + self.checkpoints.append(output.name) + + if self.norm is not None: + output = self.norm(output) + return output if use_cache is False else (output, new_caches) + + def gen_cache(self, memory, do_zip=False): + r""" + Generates cache for `forward` usage. The generated cache is a list, and + each element in it is a tuple( :code:`(incremental_cache, static_cache)` ) + produced by `TransformerDecoderLayer.gen_cache`. See `TransformerDecoderLayer.gen_cache` + for more details. If `do_zip` is True, apply `zip` on these tuples to get + a list with two elements. + """ + cache = [layer.gen_cache(memory) for layer in self.layers] + if do_zip: + cache = list(zip(*cache)) + return cache + + +class TransformerDecoderLayer(nn.Layer): + """ + The transformer decoder layer. + + It contains multiheadattention and some linear layers. + """ + + def __init__( + self, + d_model, + nhead, + dim_feedforward, + dropout=0.1, + activation="gelu", + attn_dropout=None, + act_dropout=None, + normalize_before=True, + weight_attr=None, + bias_attr=None, + num_partitions=1, + expert_mode=False, + num_experts=1, + top_k=2, + hcg=None, + gate=None, + recompute_interval=0, + recompute_partition=False, + recompute_offload=False, + ): + self._config = locals() + self._config.pop("self") + self._config.pop("__class__", None) # py3 + + super(TransformerDecoderLayer, self).__init__() + attn_dropout = dropout if attn_dropout is None else attn_dropout + act_dropout = dropout if act_dropout is None else act_dropout + self.normalize_before = normalize_before + self.recompute_interval = recompute_interval + + # moe config + self.top_k = top_k + self.num_experts = num_experts + self.expert_mode = expert_mode + self.hcg = hcg + + weight_attrs = _convert_param_attr_to_list(weight_attr, 3) + bias_attrs = _convert_param_attr_to_list(bias_attr, 3) + + self.self_attn = MultiHeadAttention( + d_model, + nhead, + dropout=attn_dropout, + weight_attr=weight_attrs[0], + bias_attr=bias_attrs[0], + num_partitions=num_partitions, + ) + + if expert_mode: + experts_list = nn.LayerList() + for expi in range(num_experts): + exp_layer = ExpertLayer(d_model, dim_feedforward // top_k, windex=expi, num_expert=num_experts) + experts_list.append(exp_layer) + + moe_group = hcg.get_expert_parallel_group() + mp_group = hcg.get_model_parallel_group() + gate_config = { + "type": "gshard", + "top_k": top_k, + } + + recompute_ctx = {"mp_group": mp_group, "offload": recompute_offload, "partition": recompute_partition} + self.moe_mlp = MoeLayer( + d_model=d_model, + experts=experts_list, + gate=gate_config, + moe_group=moe_group, + mp_group=mp_group, + recompute_interval=self.recompute_interval, + recompute_ctx=recompute_ctx, + ) + else: + self.linear1 = fleet.meta_parallel.ColumnParallelLinear( + d_model, dim_feedforward, weight_attr=weight_attrs[2], gather_output=False, has_bias=True + ) + + self.linear2 = fleet.meta_parallel.RowParallelLinear( + dim_feedforward, d_model, weight_attr=weight_attrs[2], input_is_parallel=True, has_bias=True + ) + + self.norm1 = nn.LayerNorm(d_model, epsilon=1e-5) + self.norm2 = nn.LayerNorm(d_model, epsilon=1e-5) + self.dropout1 = nn.Dropout(dropout, mode="upscale_in_train") + self.dropout2 = nn.Dropout(act_dropout, mode="upscale_in_train") + self.activation = getattr(F, activation) + + def forward(self, tgt, memory=None, tgt_mask=None, use_cache=False, cache=None): + residual = tgt + + if self.normalize_before: + tgt = self.norm1(tgt) + + if use_cache is False: + tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache) + else: + tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache) + + with get_rng_state_tracker().rng_state("global_seed"): + tgt = residual + self.dropout1(tgt) + + if not self.normalize_before: + tgt = self.norm1(tgt) + + residual = tgt + if self.normalize_before: + tgt = self.norm2(tgt) + + if self.expert_mode: + tgt = self.moe_mlp(tgt) + else: + with get_rng_state_tracker().rng_state("global_seed"): + tgt = self.dropout2(self.linear2(F.gelu(self.linear1(tgt), approximate=True))) + + tgt = residual + tgt + + if not self.normalize_before: + tgt = self.norm2(tgt) + + return tgt if use_cache is False else (tgt, incremental_cache) + + def gen_cache(self, memory): + incremental_cache = self.self_attn.gen_cache(memory, type=self.self_attn.Cache) + return incremental_cache + + +class GPTEmbeddings(nn.Layer): + """ + Include embeddings from word, position and token_type embeddings + """ + + def __init__( + self, + vocab_size, + hidden_size=768, + hidden_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=16, + initializer_range=0.02, + ): + super(GPTEmbeddings, self).__init__() + + self.word_embeddings = fleet.meta_parallel.VocabParallelEmbedding( + vocab_size, + hidden_size, + weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0.0, std=initializer_range)), + ) + + self.position_embeddings = nn.Embedding( + max_position_embeddings, + hidden_size, + weight_attr=paddle.ParamAttr( + name="pos_embeddings", initializer=nn.initializer.Normal(mean=0.0, std=initializer_range) + ), + ) + + self.dropout = nn.Dropout(hidden_dropout_prob) + + def forward(self, input_ids, position_ids=None): + if position_ids is None: + ones = paddle.ones_like(input_ids, dtype="int64") + seq_length = paddle.cumsum(ones, axis=-1) + position_ids = seq_length - ones + + input_embedings = self.word_embeddings(input_ids) + position_embeddings = self.position_embeddings(position_ids) + embeddings = input_embedings + position_embeddings + + with get_rng_state_tracker().rng_state("global_seed"): + embeddings = self.dropout(embeddings) + + return embeddings + + +class GPTPretrainedModel(PretrainedModel): + """ + An abstract class for pretrained GPT models. It provides GPT related + `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`, + `pretrained_init_configuration`, `base_model_prefix` for downloading and + loading pretrained models. See `PretrainedModel` for more details. + """ + + model_config_file = "model_config.json" + pretrained_init_configuration = { + "gpt-cpm-large-cn": { # 2.6B + "vocab_size": 30000, + "hidden_size": 2560, + "num_hidden_layers": 32, + "num_attention_heads": 32, + "intermediate_size": 10240, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "attention_probs_dropout_prob": 0.1, + "max_position_embeddings": 1024, + "type_vocab_size": 1, # no use + "initializer_range": 0.02, + "pad_token_id": 0, + "eos_token_id": 7, + "bos_token_id": 0, + "eol_token_id": 3, + "num_partitions": 1, + }, + "gpt-cpm-small-cn-distill": { # 109M + "vocab_size": 30000, + "hidden_size": 768, + "num_hidden_layers": 12, + "num_attention_heads": 12, + "intermediate_size": 3072, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "attention_probs_dropout_prob": 0.1, + "max_position_embeddings": 1024, + "type_vocab_size": 1, # no use + "initializer_range": 0.02, + "pad_token_id": 0, + "eos_token_id": 7, + "bos_token_id": 0, + "eol_token_id": 3, + "num_partitions": 1, + }, + "gpt3-13B-en": { # 13B + "vocab_size": 50304, + "hidden_size": 5120, + "num_hidden_layers": 40, + "num_attention_heads": 128, + "intermediate_size": 20480, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "attention_probs_dropout_prob": 0.1, + "max_position_embeddings": 1024, + "type_vocab_size": 1, # no use + "initializer_range": 0.02, + "eos_token_id": 50256, + "eol_token_id": 198, + "num_partitions": 1, + }, + "gpt3-1.3B-en": { # 1.3B + "vocab_size": 50304, + "hidden_size": 2048, + "num_hidden_layers": 24, + "num_attention_heads": 16, + "intermediate_size": 8192, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "attention_probs_dropout_prob": 0.1, + "max_position_embeddings": 1024, + "type_vocab_size": 1, # no use + "initializer_range": 0.02, + "eos_token_id": 50256, + "eol_token_id": 198, + "num_partitions": 1, + }, + "gpt2-medium-en": { # 345M + "vocab_size": 50304, + "hidden_size": 1024, + "num_hidden_layers": 24, + "num_attention_heads": 16, + "intermediate_size": 4096, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "attention_probs_dropout_prob": 0.1, + "max_position_embeddings": 1024, + "type_vocab_size": 1, # no use + "initializer_range": 0.02, + "eos_token_id": 50256, + "eol_token_id": 198, + "num_partitions": 1, + }, + "gpt2-en": { # 117M + "vocab_size": 50304, + "hidden_size": 768, + "num_hidden_layers": 12, + "num_attention_heads": 12, + "intermediate_size": 3072, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "attention_probs_dropout_prob": 0.1, + "max_position_embeddings": 1024, + "type_vocab_size": 1, # no use + "initializer_range": 0.02, + "eos_token_id": 50256, + "eol_token_id": 198, + "num_partitions": 1, + }, + "gpt2-small-en": { # config for CE + "vocab_size": 50304, + "hidden_size": 1024, # 1024 + "num_hidden_layers": 8, # 4 + "num_attention_heads": 16, + "intermediate_size": 1024 * 4, # 4096, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "attention_probs_dropout_prob": 0.1, + "max_position_embeddings": 1024, + "type_vocab_size": 1, # no use + "initializer_range": 0.02, + "eos_token_id": 50256, + "eol_token_id": 198, + "num_partitions": 1, + }, + } + resource_files_names = {"model_state": "model_state.pdparams"} + pretrained_resource_files_map = { + "model_state": { + "gpt-cpm-large-cn": "https://paddlenlp.bj.bcebos.com/models/transformers/gpt/gpt-cpm-large-cn.pdparams", + "gpt-cpm-small-cn-distill": "https://paddlenlp.bj.bcebos.com/models/transformers/gpt/gpt-cpm-small-cn-distill.pdparams", + "gpt2-medium-en": "https://paddlenlp.bj.bcebos.com/models/transformers/gpt/gpt2-medium-en.pdparams", + } + } + base_model_prefix = "gpt" + + def _init_weights(self, layer): + """Initialization hook""" + # no hook + return + if isinstance(layer, (nn.Linear, nn.Embedding)): + # In the dygraph mode, use the `set_value` to reset the parameter directly, + # and reset the `state_dict` to update parameter in static mode. + if isinstance(layer.weight, paddle.Tensor): + layer.weight.set_value( + paddle.tensor.normal( + mean=0.0, + std=self.initializer_range + if hasattr(self, "initializer_range") + else self.gpt.config["initializer_range"], + shape=layer.weight.shape, + ) + ) + + +@register_base_model +class GPTModel(GPTPretrainedModel): + """ + The base model of gpt. + """ + + def __init__( + self, + vocab_size, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + intermediate_size=3072, + hidden_act="gelu", + hidden_dropout_prob=0.1, + attention_probs_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=16, + initializer_range=0.02, + pad_token_id=0, + eos_token_id=7, + bos_token_id=0, + eol_token_id=3, + num_partitions=1, + expert_mode=False, + num_experts=1, + top_k=2, + hcg=None, + gate=None, + recompute_interval=0, + recompute_partition=False, + recompute_offload=False, + ): + super(GPTModel, self).__init__() + + self.pad_token_id = pad_token_id + self.initializer_range = initializer_range + self.hidden_size = hidden_size + self.vocab_size = vocab_size + + self.embeddings = GPTEmbeddings( + vocab_size, + hidden_size, + hidden_dropout_prob, + max_position_embeddings, + type_vocab_size, + self.initializer_range, + ) + + decoder_layers = nn.LayerList() + for i in range(num_hidden_layers): + decoder_layers.append( + TransformerDecoderLayer( + d_model=hidden_size, + nhead=num_attention_heads, + dim_feedforward=intermediate_size, + dropout=hidden_dropout_prob, + activation=hidden_act, + attn_dropout=attention_probs_dropout_prob, + act_dropout=hidden_dropout_prob, + weight_attr=paddle.ParamAttr( + initializer=nn.initializer.Normal(mean=0.0, std=self.initializer_range) + ), + bias_attr=None, + num_partitions=num_partitions, + expert_mode=expert_mode, + num_experts=num_experts, + top_k=top_k, + hcg=hcg, + gate=gate, + recompute_interval=recompute_interval, + recompute_partition=recompute_partition, + recompute_offload=recompute_offload, + ) + ) + + self.decoder = TransformerDecoder(decoder_layers, num_hidden_layers, norm="LayerNorm", hidden_size=hidden_size) + + self.checkpoints = [] + + def forward(self, input_ids, position_ids=None, attention_mask=None, use_cache=False, cache=None): + self.checkpoints = [] + if position_ids is None: + past_length = 0 + if cache is not None: + past_length = paddle.shape(cache[0].k)[-2] + position_ids = paddle.arange(past_length, paddle.shape(input_ids)[-1] + past_length, dtype="int64") + position_ids = position_ids.unsqueeze(0) + # .expand_as(input_ids) + position_ids = paddle.expand_as(position_ids, input_ids) + embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids) + + encoder_outputs = self.decoder( + embedding_output, + memory=None, + # tgt_mask=attention_mask, + tgt_mask=None, + use_cache=use_cache, + cache=cache, + ) + self.checkpoints.extend(self.decoder.checkpoints) + return encoder_outputs + + +class GPTForPretraining(GPTPretrainedModel): + """ + The pretraining model of GPT. + + It returns some logits and cached_kvs. + """ + + def __init__(self, gpt): + super(GPTForPretraining, self).__init__() + self.gpt = gpt + + def forward( + self, input_ids, position_ids=None, attention_mask=None, masked_positions=None, use_cache=False, cache=None + ): + outputs = self.gpt( + input_ids, position_ids=position_ids, attention_mask=attention_mask, use_cache=use_cache, cache=cache + ) + if use_cache: + encoder_outputs, cached_kvs = outputs[:2] + else: + encoder_outputs = outputs + + logits = parallel_matmul(encoder_outputs, self.gpt.embeddings.word_embeddings.weight, True) + + if use_cache: + return logits, cached_kvs + else: + return logits + + +class GPTPretrainingCriterion(paddle.nn.Layer): + """ + Criterion for GPT. + + It calculates the final loss. + """ + + def __init__(self): + super(GPTPretrainingCriterion, self).__init__() + self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none") + self.parallel_loss_func = fleet.meta_parallel.ParallelCrossEntropy() + + def forward(self, prediction_scores, masked_lm_labels, loss_mask): + + hcg = fleet.get_hybrid_communicate_group() + mp_size = hcg.get_model_parallel_world_size() + if mp_size > 1: + masked_lm_loss = self.parallel_loss_func(prediction_scores, masked_lm_labels.unsqueeze(2)) + else: + masked_lm_loss = self.loss_func(prediction_scores, masked_lm_labels.unsqueeze(2)) + + loss_mask = loss_mask.reshape([-1]) + masked_lm_loss = paddle.sum(masked_lm_loss.reshape([-1]) * loss_mask) + loss = masked_lm_loss / loss_mask.sum() + return loss + + +class GPTForGreedyGeneration(GPTPretrainedModel): + """ + The generate model for GPT-2. + It use the greedy stategy and generate the next word with highest probablity. + """ + + def __init__(self, gpt, max_predict_len): + super(GPTForGreedyGeneration, self).__init__() + self.gpt = gpt + self.max_predict_len = paddle.to_tensor(max_predict_len, dtype="int32") + + def model( + self, input_ids, position_ids=None, attention_mask=None, masked_positions=None, use_cache=False, cache=None + ): + outputs = self.gpt( + input_ids, position_ids=position_ids, attention_mask=attention_mask, use_cache=use_cache, cache=cache + ) + if use_cache: + encoder_outputs, cached_kvs = outputs[:2] + else: + encoder_outputs = outputs + logits = paddle.matmul(encoder_outputs, self.gpt.embeddings.word_embeddings.weight, transpose_y=True) + + if use_cache: + return logits, cached_kvs + else: + return logits + + def forward(self, input_ids, end_id): + output, cached_kvs = self.model(input_ids, use_cache=True, cache=None) + src_ids = input_ids + nid = paddle.argmax(output[:, -1, :], axis=-1).reshape([-1, 1]) + src_ids = paddle.concat([src_ids, nid], axis=1) + cur_len = 0 + while cur_len < self.max_predict_len: + output, cached_kvs = self.model(nid, use_cache=True, cache=cached_kvs) + + nid = paddle.argmax(output[:, -1, :], axis=-1).reshape([-1, 1]) + src_ids = paddle.concat([src_ids, nid], axis=1) + cur_len += 1 + if paddle.max(nid) == end_id: + break + return src_ids + + +class GPTLMHead(nn.Layer): + def __init__(self, hidden_size, vocab_size, embedding_weights=None): + super(GPTLMHead, self).__init__() + self.decoder_weight = ( + self.create_parameter(shape=[vocab_size, hidden_size], dtype=paddle.get_default_dtype(), is_bias=True) + if embedding_weights is None + else embedding_weights + ) + + def forward(self, hidden_states): + logits = paddle.tensor.matmul(hidden_states, self.decoder_weight, transpose_y=True) + return logits + + +class GPTLMHeadModel(GPTPretrainedModel): + def __init__(self, gpt): + super(GPTLMHeadModel, self).__init__() + self.gpt = gpt + self.lm_head = GPTLMHead( + self.gpt.config["hidden_size"], self.gpt.config["vocab_size"], self.gpt.embeddings.word_embeddings.weight + ) + + def forward(self, input_ids, position_ids=None, attention_mask=None, use_cache=False, cache=None): + outputs = self.gpt( + input_ids, position_ids=position_ids, attention_mask=attention_mask, use_cache=use_cache, cache=cache + ) + + if use_cache: + encoder_outputs, cached_kvs = outputs[:2] + else: + encoder_outputs = outputs + + logits = self.lm_head(encoder_outputs) + + if use_cache: + return logits, cached_kvs + else: + return logits + + def prepare_inputs_for_generation(self, input_ids, use_cache=False, cache=None, **kwargs): + # only last token for inputs_ids if cache is defined in kwargs + position_ids = kwargs.get("position_ids", None) + attention_mask = kwargs.get("attention_mask", None) + if cache is not None: + input_ids = input_ids[:, -1].unsqueeze(-1) + if position_ids is not None: + position_ids = position_ids[:, -1].unsqueeze(-1) + if attention_mask is not None: + attention_mask = attention_mask[:, :, -1, :].unsqueeze(2) + + return { + "input_ids": input_ids, + "position_ids": position_ids, + "attention_mask": attention_mask, + "use_cache": use_cache, + "cache": cache, + } + + def __getattr__(self, name): + try: + return super().__getattr__(name) + except AttributeError as e: + try: + return getattr(getattr(self, self.base_model_prefix), name) + except AttributeError: + try: + return getattr(self, self.base_model_prefix).config[name] + except KeyError: + raise e + + +# these Layers is just for PipelineParallel + + +class GPTPretrainingCriterionPipe(GPTPretrainingCriterion): + """Extends GPTPretrainingCriterion to meet the input standard.""" + + def forward(self, prediction_scores, args): + masked_lm_labels = args[0] + loss_mask = args[1] + loss = super().forward(prediction_scores, masked_lm_labels, loss_mask) + return loss + + +class EmbeddingPipe(GPTEmbeddings): + """Extends GPTEmbeddings to forward attention_mask through the pipeline.""" + + @property + def embedding_weight(self): + return self.word_embeddings.weight + + def forward(self, input_ids): + embeddings = super().forward(input_ids=input_ids, position_ids=None) + return embeddings + + +class GPTForPretrainingPipe(PipelineLayer): + """GPTForPretraining adapted for pipeline parallelism. + + The largest change is flattening the GPTModel class so we can express it as a + sequence of layers including embedding, transformer layers, and output. + """ + + def __init__( + self, + vocab_size, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + intermediate_size=3072, + hidden_act="gelu", + hidden_dropout_prob=0.1, + attention_probs_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=16, + initializer_range=0.02, + pad_token_id=0, + eos_token_id=7, + bos_token_id=0, + eol_token_id=3, + num_partitions=1, + topology=None, + recompute_interval=0, + expert_mode=False, + num_experts=1, + top_k=2, + hcg=None, + ): + + # forward desc + self.descs = [] + + self.descs.append( + SharedLayerDesc( + "embed", + EmbeddingPipe, + shared_weight_attr="embedding_weight", + vocab_size=vocab_size, + hidden_size=hidden_size, + hidden_dropout_prob=hidden_dropout_prob, + max_position_embeddings=max_position_embeddings, + type_vocab_size=type_vocab_size, + initializer_range=0.02, + ) + ) + + for _ in range(num_hidden_layers): + self.descs.append( + LayerDesc( + TransformerDecoderLayer, + d_model=hidden_size, + nhead=num_attention_heads, + dim_feedforward=intermediate_size, + dropout=hidden_dropout_prob, + activation=hidden_act, + attn_dropout=attention_probs_dropout_prob, + act_dropout=hidden_dropout_prob, + weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0.0, std=initializer_range)), + bias_attr=None, + num_partitions=num_partitions, + expert_mode=expert_mode, + num_experts=num_experts, + top_k=top_k, + hcg=hcg, + ) + ) + + self.descs.append(LayerDesc(nn.LayerNorm, normalized_shape=hidden_size)) + + def _logits_helper(embedding, output): + return parallel_matmul(output, embedding.embedding_weight, True) + + self.descs.append( + SharedLayerDesc( + "embed", + EmbeddingPipe, + forward_func=_logits_helper, + shared_weight_attr="embedding_weight", + vocab_size=vocab_size, + hidden_size=hidden_size, + hidden_dropout_prob=hidden_dropout_prob, + max_position_embeddings=max_position_embeddings, + type_vocab_size=type_vocab_size, + initializer_range=0.02, + ) + ) + + super().__init__( + layers=self.descs, + loss_fn=GPTPretrainingCriterionPipe(), + topology=topology, + seg_method="layer:TransformerDecoderLayer", + recompute_interval=recompute_interval, + recompute_ctx={ + "mp_group": fleet.fleet._hcg.get_model_parallel_group(), + "offload": False, + "partition": False, + }, + ) diff --git a/examples/language_model/moe/dygraph/run.sh b/examples/language_model/moe/dygraph/run.sh new file mode 100644 index 0000000000000000000000000000000000000000..2f513281b60d6bf9b415065bfa275fbd39405355 --- /dev/null +++ b/examples/language_model/moe/dygraph/run.sh @@ -0,0 +1,36 @@ +export PYTHONPATH=$PYTHONPATH:../../../../ + +log_dir=dp8 +rm -rf $log_dir + +python -m paddle.distributed.launch --log_dir $log_dir --gpus "0,1,2,3,4,5,6,7" run_moe_pretrain.py \ + --model_type gpt \ + --model_name_or_path gpt2-small-en \ + --input_dir "./data"\ + --output_dir "output"\ + --weight_decay 0.01\ + --grad_clip 1.0\ + --max_steps 50000\ + --save_steps 100000\ + --decay_steps 320000\ + --device gpu\ + --eval_freq 1000\ + --warmup_rate 0.01\ + --local_batch_size 8\ + --dp_degree 8\ + --mp_degree 1\ + --pp_degree 1\ + --sharding_degree 1\ + --sharding_offload False\ + --expert_mode True\ + --logging_freq 1 \ + --num_experts 8\ + --use_pure_fp16 True\ + --use_recompute True\ + --recompute_partition False\ + --recompute_offload False\ + --resume_dir ""\ + --scale_loss 32768 \ + --gate gshard \ + --balance_loss_weight 1.0 + diff --git a/examples/language_model/moe/dygraph/run_moe_pretrain.py b/examples/language_model/moe/dygraph/run_moe_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..0649ec771a08e8eab862934d1dc276492802d2d4 --- /dev/null +++ b/examples/language_model/moe/dygraph/run_moe_pretrain.py @@ -0,0 +1,705 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import random +import time +import types +from types import MethodType + +import lr +import numpy as np +import paddle +import paddle.distributed as dist +from args import parse_args +from checkpointing import load_checkpoint, save_checkpoint +from dataset import create_pretrained_dataset +from framework import AdamW, group_sharded_parallel, obtain_storage +from modeling import ( + GPTForPretraining, + GPTForPretrainingPipe, + GPTModel, + GPTPretrainingCriterion, +) +from paddle import _legacy_C_ops +from paddle.distributed import fleet +from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker +from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_utils import ( + GroupShardedScaler, +) +from paddle.framework import core +from paddle.incubate.distributed.models import moe +from utils import get_timers, set_timers +from visualdl import LogWriter + +from paddlenlp.transformers import GPTChineseTokenizer, GPTTokenizer +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "gpt": (GPTForPretraining, GPTTokenizer), + "gpt-cn": (GPTForPretraining, GPTChineseTokenizer), +} + +set_timers() + + +def set_hyrbid_parallel_seed(basic_seed, data_world_rank, mp_rank, pp_rank): + assert args.device != "cpu" + + random.seed(basic_seed + data_world_rank) + np.random.seed(basic_seed + data_world_rank) + paddle.seed(basic_seed + data_world_rank) + + from paddle.distributed.fleet import meta_parallel + + meta_parallel.model_parallel_random_seed(basic_seed + data_world_rank + 1000 * mp_rank) + + # local_seed/ global_seed is used to control dropout in ModelParallel + local_seed = basic_seed + 123 + mp_rank * 10 + pp_rank * 1000 + global_seed = basic_seed + data_world_rank + tracker = get_rng_state_tracker() + tracker.add("global_seed", global_seed) + tracker.add("local_seed", local_seed) + + +@paddle.no_grad() +def run_evaluate(args, data_loader, model, criterion, iter_steps, log_writer, global_step, epoch, task_name="valid"): + model.eval() + all_loss = [] + local_time = time.time() + for eval_step, batch in enumerate(data_loader): + tokens, loss_mask, labels = batch + # paddle version >= 2.5.0 or develop + paddle_version = float(paddle.__version__[:3]) + if (paddle_version == 0.0) or (paddle_version >= 2.5): + with paddle.amp.auto_cast( + args.use_pure_fp16, + custom_black_list=[ + "reduce_sum", + "c_softmax_with_cross_entropy", + "elementwise_div", + ], + level="O2", + use_promote=False, + ): + preds = model(tokens) + else: + with paddle.amp.auto_cast( + args.use_pure_fp16, + custom_black_list=[ + "reduce_sum", + "c_softmax_with_cross_entropy", + "elementwise_div", + ], + level="O2", + ): + preds = model(tokens) + preds = paddle.cast(preds, dtype="float32") + loss = criterion(preds, labels, loss_mask) + + all_loss.append(float(loss)) + if eval_step >= iter_steps - 1: + break + + average_loss = sum(all_loss) / len(all_loss) + logger.info( + "%s step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (task_name, global_step, epoch, eval_step, average_loss, iter_steps / (time.time() - local_time)) + ) + log_writer.add_scalar(task_name + "_loss", average_loss, global_step) + model.train() + + +def initialize_model_and_expert_group(hcg): + def get_expert_parallel_world_size(self): + return self.get_data_parallel_world_size() * self.get_model_parallel_world_size() + + hcg.get_expert_parallel_world_size = types.MethodType(get_expert_parallel_world_size, hcg) + + # need create mp_dp group for expert parallel group in advance + _, mp_dp_comm_group = hcg._set_check_group(parallel_method="pipe") + + def get_expert_parallel_group(self): + return mp_dp_comm_group + + hcg.get_expert_parallel_group = types.MethodType(get_expert_parallel_group, hcg) + + +def initialize_mp_dp_parameters(model, hcg): + mp_group = hcg.get_model_parallel_group() + mp_src_rank = hcg.get_model_parallel_group_src_rank() + + dp_group = hcg.get_data_parallel_group() + dp_src_rank = hcg.get_data_parallel_group_src_rank() + + for param in model.parameters(): + if "expert_" in param.name: + continue + if not param.is_distributed: + paddle.distributed.broadcast(param.detach(), src=mp_src_rank, group=mp_group, sync_op=True) + + paddle.distributed.broadcast(param.detach(), src=dp_src_rank, group=dp_group, sync_op=True) + + +def unscale_method(self, optimizer): + if not self._enable: + return + + if getattr(optimizer, "_param_groups", None) and isinstance(optimizer._param_groups[0], dict): + param_grads_fp16 = [] + param_grads_fp32 = [] + for group in optimizer._param_groups: + for param in group["params"]: + if param._grad_ivar() is not None: + if param._grad_ivar().dtype == core.VarDesc.VarType.FP16: + param_grads_fp16.append(param._grad_ivar()) + else: + param_grads_fp32.append(param._grad_ivar()) + else: + param_grads_fp16 = [ + param._grad_ivar() + for param in optimizer._parameter_list + if (param._grad_ivar() is not None) and (param._grad_ivar().dtype == core.VarDesc.VarType.FP16) + ] + param_grads_fp32 = [ + param._grad_ivar() + for param in optimizer._parameter_list + if (param._grad_ivar() is not None) and (param._grad_ivar().dtype == core.VarDesc.VarType.FP32) + ] + temp_found_inf_fp16 = paddle.to_tensor(np.array([0]).astype(np.bool_)) + temp_found_inf_fp32 = paddle.to_tensor(np.array([0]).astype(np.bool_)) + + if len(param_grads_fp16): + _legacy_C_ops.check_finite_and_unscale(param_grads_fp16, self._scale, param_grads_fp16, temp_found_inf_fp16) + if len(param_grads_fp32): + _legacy_C_ops.check_finite_and_unscale(param_grads_fp32, self._scale, param_grads_fp32, temp_found_inf_fp32) + self._found_inf = 1 if temp_found_inf_fp16 or temp_found_inf_fp32 else 0 + + if dist.get_world_size() > 1: + is_found_inf = paddle.to_tensor([self._found_inf], dtype="int32") + paddle.distributed.all_reduce(is_found_inf, op=paddle.distributed.ReduceOp.MAX, group=None) + self._found_inf = int(is_found_inf) + + +def all_reduce_parameters(params, group): + if group.nranks < 2: + return + + div_factor = 1.0 / group.nranks + with paddle.framework.no_grad(): + for p in params: + grad = p.grad.scale_(div_factor) + paddle.distributed.all_reduce(grad, sync_op=True) + + +def parameters_classify(model, use_sharding=False): + decay_gate_params = [] + decay_expert_params = [] + decay_other_params = [] + + gate_params = [] + expert_params = [] + other_params = [] + + for param in model.parameters(): + # param_name = param.name + if "expert_" in param.name: + if not any(nd in param.name for nd in ["bias", "norm"]): + decay_expert_params.append(param) + else: + expert_params.append(param) + elif "gate_" in param.name: + if not any(nd in param.name for nd in ["bias", "norm"]): + decay_gate_params.append(param) + else: + gate_params.append(param) + else: + if not any(nd in param.name for nd in ["bias", "norm"]): + decay_other_params.append(param) + else: + other_params.append(param) + + print("all parameters length:", len(model.parameters())) + print( + "decay_gate_params len: {}, decay_expert_params len: {}, decay_other_params len: {}".format( + len(decay_gate_params), len(decay_expert_params), len(decay_other_params) + ) + ) + print( + "gate_params len: {}, expert_params len: {}, other_params len: {}".format( + len(gate_params), len(expert_params), len(other_params) + ) + ) + + d_gate = obtain_storage(decay_gate_params) + gate = obtain_storage(gate_params) + + d_expert = obtain_storage(decay_expert_params) + expert = obtain_storage(expert_params) + + d_other = decay_other_params if use_sharding else obtain_storage(decay_other_params) + other = other_params if use_sharding else obtain_storage(other_params) + + opt_fused_tensors = [] + decay_fused_tensors = [] + reduce_fused_tensors = [] + gate_fused_tensors = [] + + decay_fused_tensors = d_gate + d_other + d_expert + opt_fused_tensors = decay_fused_tensors + gate + other + expert + reduce_fused_tensors = d_other + other + gate_fused_tensors = d_gate + gate + + expert_fusion_names = [] + for i, p in enumerate(d_expert + expert): + p.name = "fused_expert_tensor_{}".format(i) + expert_fusion_names.append(p.name) + + for i, p in enumerate(d_gate + gate): + p.name = "fused_gate_tensor_{}".format(i) + + return opt_fused_tensors, decay_fused_tensors, reduce_fused_tensors, gate_fused_tensors, expert_fusion_names + + +def timer_log(log_freq): + timers = get_timers() + # Logging + timers_to_log = [] + + def add_to_logging(name): + if name in timers.timers: + timers_to_log.append(name) + + add_to_logging("forward-compute") + add_to_logging("forward-recv") + add_to_logging("forward-send") + add_to_logging("forward-send-backward-recv") + add_to_logging("backward-compute") + add_to_logging("backward-recv") + add_to_logging("backward-send") + add_to_logging("backward-send-forward-recv") + add_to_logging("backward-params-all-reduce") + add_to_logging("backward-embedding-all-reduce") + add_to_logging("optimizer-copy-to-main-grad") + add_to_logging("optimizer-unscale-and-check-inf") + add_to_logging("optimizer-clip-main-grad") + add_to_logging("optimizer-copy-main-to-model-params") + add_to_logging("optimizer") + add_to_logging("batch-generator") + add_to_logging("Prepare Forward") + add_to_logging("Gate Computation") + add_to_logging("Limit_By_Capacity") + add_to_logging("Prune_Gate_By_Cap") + add_to_logging("Random Routing") + add_to_logging("Base Operation") + add_to_logging("AllGather in Limit") + add_to_logging("MOEScatter") + add_to_logging("Expert Computation") + add_to_logging("MOEGather") + add_to_logging("Score BMM") + add_to_logging("AllReduce") + add_to_logging("AllGather") + add_to_logging("lec reduce") + add_to_logging("lec reduce2") + + timers.log(timers_to_log, normalizer=log_freq) + + +def do_train(args): + paddle.set_device(args.device) + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": args.dp_degree, + "mp_degree": args.mp_degree, + "pp_degree": args.pp_degree, + "sharding_degree": args.sharding_degree, + } + + accumulate_steps = args.local_batch_size // args.micro_batch_size + strategy.pipeline_configs = {"accumulate_steps": accumulate_steps, "micro_batch_size": args.micro_batch_size} + + fleet.init(is_collective=True, strategy=strategy) + + nranks = paddle.distributed.get_world_size() + + # obtain rank message of hybrid parallel + hcg = fleet.get_hybrid_communicate_group() + global_rank = hcg.get_global_rank() + mp_rank = hcg.get_model_parallel_rank() + pp_rank = hcg.get_stage_id() + dp_rank = hcg.get_data_parallel_rank() + sharding_rank = hcg.get_sharding_parallel_rank() + sharding_group = hcg.get_sharding_parallel_group() + + if args.sharding_degree > 1: + assert ( + args.dp_degree == args.mp_degree == args.pp_degree == 1 + ), "sharding stage2 will support hybrid parallel later" + + sharding_size = hcg.get_sharding_parallel_world_size() + data_world_rank = dp_rank * sharding_size + sharding_rank + data_world_size = args.dp_degree * args.sharding_degree + local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0)) + + # seed control in hybrid parallel + set_hyrbid_parallel_seed(args.seed, data_world_rank, mp_rank, pp_rank) + + default_global_tokens_num = args.global_batch_size * args.max_seq_len + + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + # Define log writer + log_writer_path = os.path.join( + args.output_dir, + "train_log", + "{}_globalbsz_{}_pure_fp16_{}_recompute_{}_card_{}".format( + args.model_name_or_path, args.global_batch_size, args.use_pure_fp16, False, global_rank + ).lower(), + ) + + if os.path.exists(log_writer_path): + import shutil + + shutil.rmtree(log_writer_path) + + log_writer = LogWriter(log_writer_path) + + pretrained_models_list = list(model_class.pretrained_init_configuration.keys()) + + if args.model_name_or_path in pretrained_models_list: + model_config = model_class.pretrained_init_configuration[args.model_name_or_path] + model_config["hidden_dropout_prob"] = args.hidden_dropout_prob + model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob + + model_config["num_partitions"] = args.mp_degree + + # MOE config + initialize_model_and_expert_group(hcg) + + model_config["expert_mode"] = args.expert_mode + model_config["hcg"] = hcg + model_config["num_experts"] = args.num_experts + model_config["top_k"] = args.top_k + if args.expert_mode: + model_config["gate"] = args.gate + + if args.pp_degree == 1: + model_config["recompute_interval"] = 1 if args.use_recompute else 0 + model_config["recompute_partition"] = args.recompute_partition + model_config["recompute_offload"] = args.recompute_offload + if args.use_recompute and args.recompute_partition: + raise Exception("when use_recompute is True, recompute_partition must be False in MoE.") + + model = GPTForPretraining(GPTModel(**model_config)) + else: + model_config["topology"] = hcg.topology() + model_config["recompute_interval"] = 1 if args.use_recompute else 0 + model = GPTForPretrainingPipe(**model_config) + else: + model = GPTForPretraining.from_pretrained( + args.model_name_or_path, + hidden_dropout_prob=args.hidden_dropout_prob, + attention_probs_dropout_prob=args.attention_probs_dropout_prob, + ) + + # Create the critrion for the gpt model + criterion = GPTPretrainingCriterion() + + if args.decay_steps is None: + args.decay_steps = args.max_steps + warmup_step = args.warmup_rate * args.decay_steps + + lr_scheduler = None + + if args.lr_decay_style == "none": + lr_scheduler = None + elif args.lr_decay_style == "cosine": + lr_scheduler = lr.CosineAnnealingWithWarmupDecay( + max_lr=args.max_lr, min_lr=args.min_lr, warmup_step=warmup_step, decay_step=args.decay_steps + ) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + if args.use_pure_fp16: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + if args.sharding_degree == 1: + scaler = fleet.distributed_scaler(scaler) + scaler._unscale = MethodType(unscale_method, scaler) + else: + scaler = GroupShardedScaler(scaler) + + model = paddle.amp.decorate(models=model, optimizers=None, level="O2", save_dtype="float32") + + ( + opt_fused_tensors, + decay_fused_tensors, + reduce_fused_tensors, + gate_fused_tensors, + expert_fusion_names, + ) = parameters_classify(model, use_sharding=(args.sharding_degree > 1)) + decay_params = [p.name for p in decay_fused_tensors] + + clip = None + if args.grad_clip > 0: + is_expert_param_fun = lambda param: param.name in expert_fusion_names # noqa: E731 + clip = moe.ClipGradByGlobalNorm( + clip_norm=args.grad_clip, + is_expert_param_func=is_expert_param_fun, + moe_group=hcg.get_expert_parallel_group(), + ) + + optimizer = AdamW( + learning_rate=lr_scheduler if lr_scheduler is not None else args.max_lr, + beta1=args.adam_beta1, + beta2=args.adam_beta2, + epsilon=args.adam_epsilon, + parameters=opt_fused_tensors, + weight_decay=args.weight_decay, + grad_clip=clip, + apply_decay_param_fun=lambda x: x in decay_params, # decay_params, + multi_precision=args.use_pure_fp16, + ) + + # in order to restore reader. + pass_num = 0 + file_id = 0 + start_epoch = 0 + args.resume_dir = None if len(args.resume_dir) <= 0 else args.resume_dir + + if paddle.distributed.get_world_size() > 1 and args.resume_dir is None: + print(">> initialize....") + if args.sharding_degree > 1: + model, optimizer = group_sharded_parallel(model, optimizer, sharding_group, args.sharding_offload) + for p in gate_fused_tensors: + dist.broadcast(p, src=sharding_group.ranks[0], group=sharding_group, sync_op=True) + # Multi stream operation will be supported later + dist.wait(tensor=p, group=sharding_group, use_calc_stream=True) + else: + initialize_mp_dp_parameters(model, hcg) + + if args.resume_dir is not None: + global_step, loss_scale, data_meta = load_checkpoint( + args, model, optimizer, lr_scheduler, tokenizer, dp_rank, mp_rank, pp_rank + ) + pass_num = data_meta["pass_num"] + file_id = data_meta["file_id"] + start_epoch = data_meta["start_epoch"] + + if args.model_name_or_path not in pretrained_models_list: + logger.info("Try to load checkpoint from %s " % args.model_name_or_path) + opt_path = os.path.join(args.model_name_or_path, "model_state.pdopt") + if os.path.exists(opt_path): + opt_dict = paddle.load(opt_path) + optimizer.set_state_dict(opt_dict) + else: + logger.warning("No optimizer checkpoint file found in %s." % opt_path) + + global_step = 0 if args.resume_dir is None else global_step + timers = get_timers() + tic_train = time.time() + for epoch in range(start_epoch, args.num_train_epochs): + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if (os.path.isfile(os.path.join(args.input_dir, f)) and "npz_" not in str(f)) + ] + files.sort() + num_files = len(files) + for f_id in range(file_id, num_files): + data_file = files[f_id] + train_data_loader, valid_data_loader, test_data_loader = create_pretrained_dataset( + args, + data_file, + local_rank=local_rank, + data_world_size=data_world_size, + data_world_rank=data_world_rank, + eos_id=tokenizer.eos_token_id, + ) + + # Bug fix, if not call valid_data_loader, the enumerate will call valid_data_loader + # many times. and start a new random dataloader. + valid_data_loader = valid_data_loader() + test_data_loader = test_data_loader() + for step, batch in enumerate(train_data_loader()): + # to remove the train data that has been studyed. + if step < global_step - pass_num: + continue + + global_step += 1 + tokens, loss_mask, labels = batch + + loss_mask.stop_gradient = True + labels.stop_gradient = True + + loss = 0.0 + for i in range(accumulate_steps): + start_index = i * args.micro_batch_size + end_index = start_index + args.micro_batch_size + timers("forward-compute").start() + # paddle version >= 2.5.0 or develop + paddle_version = float(paddle.__version__[:3]) + if (paddle_version == 0.0) or (paddle_version >= 2.5): + with paddle.amp.auto_cast( + args.use_pure_fp16, + custom_black_list=[ + "reduce_sum", + "c_softmax_with_cross_entropy", + "elementwise_div", + ], + level="O2", + use_promote=False, + ): + preds = model(tokens[start_index:end_index, :]) + loss_mbs = criterion( + preds, labels[start_index:end_index, :], loss_mask[start_index:end_index, :] + ) + else: + with paddle.amp.auto_cast( + args.use_pure_fp16, + custom_black_list=[ + "reduce_sum", + "c_softmax_with_cross_entropy", + "elementwise_div", + ], + level="O2", + ): + preds = model(tokens[start_index:end_index, :]) + loss_mbs = criterion( + preds, labels[start_index:end_index, :], loss_mask[start_index:end_index, :] + ) + timers("forward-compute").stop() + + if args.gate != "naive" and args.balance_loss_weight: + aux_loss_list = [ + l.moe_mlp.gate.get_loss(clear=False).reshape([-1]) + for l in model.gpt.decoder.layers + if hasattr(l.moe_mlp, "gate") + ] + bal_loss = paddle.concat(aux_loss_list) + if bal_loss.dtype == paddle.float16: + bal_loss = paddle.cast(bal_loss, dtype=paddle.float32) + bal_loss = bal_loss.mean() + loss_mbs += bal_loss * args.balance_loss_weight + loss_mbs = loss_mbs / accumulate_steps + + timers("backward-compute").start() + if args.use_pure_fp16: + scaler.scale(loss_mbs).backward() + else: + loss_mbs.backward() + timers("backward-compute").stop() + loss = loss + loss_mbs + + timers("backward-params-all-reduce").start() + all_reduce_parameters(gate_fused_tensors, hcg.get_expert_parallel_group()) + if args.sharding_degree == 1: + all_reduce_parameters(reduce_fused_tensors, hcg.get_data_parallel_group()) + timers("backward-params-all-reduce").stop() + + if args.use_pure_fp16: + scaler.step(optimizer) + scaler.update() + else: + optimizer.step() + learning_rate = optimizer.get_lr() + if lr_scheduler is not None: + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.logging_freq == 0: + avg_loss = loss.numpy() + speed = args.logging_freq / (time.time() - tic_train) + if args.gate != "naive" and args.balance_loss_weight: + bal_loss = bal_loss.numpy() + avg_loss -= bal_loss + else: + bal_loss = -1 + logger.info( + "global step %d, epoch: %d, batch: %d, loss: %.9f, bal_loss: %.9f, speed: %.2f step/s, ips_total: %.0f tokens/s, ips: %.0f tokens/s, learning rate: %.5e" + % ( + global_step, + epoch, + step, + avg_loss, + bal_loss, + speed, + speed * default_global_tokens_num, + speed * default_global_tokens_num / nranks, + learning_rate, + ) + ) + log_writer.add_scalar("loss", float(loss), global_step) + log_writer.add_scalar("learning_rate", learning_rate, global_step) + + tic_train = time.time() + timer_log(args.logging_freq) + + if global_step % args.save_steps == 0 or global_step >= args.max_steps: + loss_scale = scaler._scale if args.use_pure_fp16 else None + save_checkpoint( + args, + global_step, + model, + optimizer, + lr_scheduler, + tokenizer, + loss_scale, + dp_rank, + mp_rank, + pp_rank, + pass_num, + file_id, + epoch, + ) + print("save checkpoint for step_{} successfully...loss_scale = {}".format(global_step, loss_scale)) + + if global_step % args.eval_freq == 0: + # Since the valid data broardcast to all devices, we do evaluate on all device. + run_evaluate( + args, + valid_data_loader, + model, + criterion, + args.eval_iters, + log_writer, + global_step, + epoch, + "valid", + ) + + if global_step >= args.max_steps: + run_evaluate( + args, + test_data_loader, + model, + criterion, + args.test_iters, + log_writer, + global_step, + epoch, + "test", + ) + logger.info("The training process is complete.") + del train_data_loader + return + + # to record sum of the length of train_data_loader that has been read. + pass_num += len(train_data_loader()) + del train_data_loader + + +if __name__ == "__main__": + args = parse_args(MODEL_CLASSES) + do_train(args) diff --git a/examples/language_model/moe/dygraph/sync_files.sh b/examples/language_model/moe/dygraph/sync_files.sh new file mode 100644 index 0000000000000000000000000000000000000000..0e297e908f0d31f3ae9c0d762c2702b9aa679d0d --- /dev/null +++ b/examples/language_model/moe/dygraph/sync_files.sh @@ -0,0 +1,12 @@ +#!/bin/bash + +# get sshd port +sshport=$(lsof -i | grep sshd | awk '{print $9}' | sed s/\*://) + +hostfile=${TRAIN_WORKSPACE}/hostfile +hostlist=$(cat $hostfile | awk '{print $1}' | xargs) +for host in ${hostlist[@]}; do + #ssh $host "ls $PWD" + echo "scp $1 to $host" + scp -r $1 ${host}:${PWD} +done diff --git a/examples/language_model/moe/dygraph/utils.py b/examples/language_model/moe/dygraph/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..49d94ea06806ff457c34ffec772d24bb2f66ecff --- /dev/null +++ b/examples/language_model/moe/dygraph/utils.py @@ -0,0 +1,118 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import time + +import paddle + +_GLOBAL_TIMERS = None + + +def _ensure_var_is_not_initialized(var, name): + """Make sure the input variable is not None.""" + assert var is None, "{} is not initialized.".format(name) + + +def _ensure_var_is_initialized(var, name): + """Make sure the input variable is not None.""" + assert var is not None, "{} is not initialized.".format(name) + + +def get_timers(): + _ensure_var_is_initialized(_GLOBAL_TIMERS, "timers") + return _GLOBAL_TIMERS + + +def set_timers(): + """Initialize timers.""" + global _GLOBAL_TIMERS + _ensure_var_is_not_initialized(_GLOBAL_TIMERS, "timers") + _GLOBAL_TIMERS = Timers() + + +class _Timer: + """Timer.""" + + def __init__(self, name): + self.name = name + self.elapsed_ = 0.0 + self.started_ = False + self.start_time = time.time() + + def start(self): + """Start the timer.""" + assert not self.started_, "timer has already started" + paddle.device.cuda.synchronize() + self.start_time = time.time() + self.started_ = True + + def stop(self): + """Stop the timers.""" + assert self.started_, "timer is not started." + paddle.device.cuda.synchronize() + self.elapsed_ += time.time() - self.start_time + self.started_ = False + + def reset(self): + """Reset timer.""" + self.elapsed_ = 0.0 + self.started_ = False + + def elapsed(self, reset=True): + """Calculate the elapsed time.""" + started_ = self.started_ + # If the timing in progress, end it first. + if self.started_: + self.stop() + # Get the elapsed time. + elapsed_ = self.elapsed_ + # Reset the elapsed time + if reset: + self.reset() + # If timing was in progress, set it back. + if started_: + self.start() + return elapsed_ + + +class Timers: + """Group of timers.""" + + def __init__(self): + self.timers = {} + + def __call__(self, name): + if name not in self.timers: + self.timers[name] = _Timer(name) + return self.timers[name] + + def write(self, names, writer, iteration, normalizer=1.0, reset=False): + """Write timers to a tensorboard writer""" + assert normalizer > 0.0 + for name in names: + value = self.timers[name].elapsed(reset=reset) / normalizer + writer.add_scalar(name + "-time", value, iteration) + + def log(self, names, normalizer=1.0, reset=True): + """Log a group of timers.""" + assert normalizer > 0.0 + string = "time (ms)" + for name in names: + elapsed_time = self.timers[name].elapsed(reset=reset) * 1000.0 / normalizer + string += " | {}: {:.2f}".format(name, elapsed_time) + + if paddle.distributed.get_rank() == (paddle.distributed.get_world_size() - 1): + print(string, flush=True) + else: + print(string, flush=True) diff --git a/examples/language_model/mpnet/README.md b/examples/language_model/mpnet/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f959121c69634f433b1b6a1197a40b11e9cdd03e --- /dev/null +++ b/examples/language_model/mpnet/README.md @@ -0,0 +1,181 @@ +# MPNet with PaddleNLP + +[MPNet: Masked and Permuted Pre-training for Language Understanding - Microsoft Research](https://www.microsoft.com/en-us/research/publication/mpnet-masked-and-permuted-pre-training-for-language-understanding/) + +**摘要:** +BERT采用掩码语言建模(MLM)进行预训练,是最成功的预训练模型之一。由于BERT忽略了预测标记之间的依赖关系,XLNet引入了置换语言建模(PLM)进行预训练来解决这个问题。然而,XLNet没有利用句子的完整位置信息,因此会受到预训练和微调之间的位置差异的影响。在本文中,我们提出了MPNet,这是一种新的预训练方法,它继承了BERT和XLNet的优点并避免了它们的局限性。MPNet通过置换语言建模(相对于BERT中的MLM)利用预测标记之间的依赖性,并以辅助位置信息作为输入,使模型能够看到完整的句子,从而减少位置差异(相对于XLNet中的PLM)。我们在大规模数据集(超过160GB的文本语料库)上预训练了MPNet模型,并对各种下游任务(GLUE、SQuAD 等)进行微调。实验结果表明,在相同的模型设置下,MPNet大大优于MLM和PLM,并且与之前最先进的预训练方法(例如 BERT、XLNet、RoBERTa)相比,在这些任务上取得了更好的结果。原始代码和预训练模型可从 https://github.com/microsoft/MPNet 下载得到。 + +本项目是 MPNet 在 Paddle 2.x上的开源实现。 + +## 快速开始 + +### 下游任务微调 + +#### 1、GLUE +以QQP数据集为例,运行其他glue数据集,请参考`train.sh`文件。(超参数遵循原论文的仓库的[README](https://github.com/microsoft/MPNet/blob/master/MPNet/README.glue.md)) + +##### (1)模型微调: +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_glue.py \ + --model_type mpnet \ + --model_name_or_path mpnet-base \ + --task_name qqp \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --learning_rate 1e-5 \ + --lr_scheduler_type linear \ + --weight_decay 0.1 \ + --warmup_steps 5666 \ + --max_steps 113272 \ + --logging_steps 500 \ + --save_steps 2000 \ + --seed 42 \ + --output_dir qqp/ \ + --do_train \ + --do_eval \ + --device gpu +``` +其中参数释义如下: +- `model_type` 指示了模型类型,当前支持BERT、ELECTRA、ERNIE、CONVBERT、MPNET模型。 +- `model_name_or_path` 模型名称或者路径,其中mpnet模型当前仅支持mpnet-base几种规格。 +- `task_name` 表示 Fine-tuning 的任务,当前支持CoLA、SST-2、MRPC、STS-B、QQP、MNLI、QNLI、RTE和WNLI。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `per_device_train_batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `lr_scheduler_type` scheduler类型,可选linear和cosine。 +- `weight_decay` 权重衰减比例。 +- `warmup_steps` warmup步数。 +- `max_steps` 表示最大训练步数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `output_dir` 表示模型保存路径。 +- `do_train` 表示是否需要训练。 +- `do_eval` 表示是否需要评测。 +- `device` 表示使用的设备类型。默认为GPU,可以配置为CPU、GPU、XPU。若希望使用多GPU训练,将其设置为GPU,同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。 + +##### (2)模型预测: +```bash +cd glue +python run_predict.py --task_name qqp --ckpt_path qqp/best-qqp_ft_model_106000.pdparams +``` + +##### (3)压缩template文件夹为zip文件,然后提交到[GLUE排行榜](https://gluebenchmark.com/leaderboard): + + +###### GLUE开发集结果: + +| task | cola | sst-2 | mrpc | sts-b | qqp | mnli | qnli | rte | avg | +|--------------------------------|-------|-------|-------------|------------------|-------------|------|-------|-------|-------| +| **metric** | **mcc** | **acc** | **acc/f1** | **pearson/spearman** | **acc/f1** | **acc(m/mm)** | **acc** | **acc** | | +| Paper | **65.0** | **95.5** | **91.8**/空 | 91.1/空 | **91.9**/空 | **88.5**/空 | 93.3 | 85.8 | **87.9** | +| Mine | 64.4 | 95.4 | 90.4/93.1 | **91.6**/91.3 | **91.9**/89.0 | 87.7/88.2 | **93.6** | **86.6** | 87.7 | + +###### GLUE测试集结果对比: + +| task | cola | sst-2 | mrpc | sts-b | qqp | mnli-m | qnli | rte | avg | +|--------------------------------|-------|-------|-------|-------|-----|-------|-------|-------|----------| +| **metric** | **mcc** | **acc** | **acc/f1** | **pearson/spearman** | **acc/f1** | **acc(m/mm)** | **acc** | **acc** | | +| Paper | **64.0** | **96.0** | 89.1/空 | 90.7/空 | **89.9**/空 | **88\.5**/空 | 93\.1 | 81.0 | **86.5** | +| Mine | 60.5 | 95.9 | **91.6**/88.9 | **90.8**/90.3 | 89.7/72.5 | 87.6/86.6 | **93.3** | **82.4** | **86.5** | + +#### 2、SQuAD v1.1 + +使用Paddle提供的预训练模型运行SQuAD v1.1数据集的Fine-tuning + +```bash +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_squad.py \ + --model_type mpnet \ + --model_name_or_path mpnet-base \ + --max_seq_length 512 \ + --per_device_train_batch_size 16 \ + --learning_rate 2e-5 \ + --num_train_epochs 4 \ + --lr_scheduler_type linear \ + --logging_steps 25 \ + --save_steps 25 \ + --warmup_ratio 0.1 \ + --weight_decay 0.1 \ + --output_dir squad1.1/ \ + --device gpu \ + --do_train \ + --do_eval \ + --seed 42 +``` + +训练过程中模型会自动对结果进行评估,其中最好的结果如下所示: + +```python +{ + "exact": 86.84957426679281, + "f1": 92.82031917884066, + "total": 10570, + "HasAns_exact": 86.84957426679281, + "HasAns_f1": 92.82031917884066, + "HasAns_total": 10570 +} +``` + +#### 3、SQuAD v2.0 +对于 SQuAD v2.0,按如下方式启动 Fine-tuning: + +```bash +unset CUDA_VISIBLE_DEVICES +cd squad +python -m paddle.distributed.launch --gpus "0" run_squad.py \ + --model_type mpnet \ + --model_name_or_path mpnet-base \ + --max_seq_length 512 \ + --per_device_train_batch_size 16 \ + --learning_rate 2e-5 \ + --num_train_epochs 4 \ + --lr_scheduler_type linear \ + --logging_steps 200 \ + --save_steps 200 \ + --warmup_ratio 0.1 \ + --weight_decay 0.1 \ + --output_dir squad2/ \ + --device gpu \ + --do_train \ + --do_eval \ + --seed 42 \ + --version_2_with_negative +``` + +* `version_2_with_negative`: 使用squad2.0数据集和评价指标的标志。 + +训练过程中模型会自动对结果进行评估,其中最好的结果如下所示: + +```python +{ + "exact": 82.27912069401162, + "f1": 85.2774124891565, + "total": 11873, + "HasAns_exact": 80.34750337381917, + "HasAns_f1": 86.35268530427743, + "HasAns_total": 5928, + "NoAns_exact": 84.20521446593776, + "NoAns_f1": 84.20521446593776, + "NoAns_total": 5945, + "best_exact": 82.86869367472417, + "best_exact_thresh": -2.450321674346924, + "best_f1": 85.67634263296013, + "best_f1_thresh": -2.450321674346924 +} +``` + +# Tips: +- 对于SQUAD任务:根据这个[issues](https://github.com/microsoft/MPNet/issues/3)所说,论文中汇报的是`best_exact`和`best_f1`。 +- 对于GLUE任务:根据这个[issues](https://github.com/microsoft/MPNet/issues/7)所说,部分任务采用了热启动初始化的方法。 + +# Reference + +```bibtex +@article{song2020mpnet, + title={MPNet: Masked and Permuted Pre-training for Language Understanding}, + author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan}, + journal={arXiv preprint arXiv:2004.09297}, + year={2020} +} +``` diff --git a/examples/language_model/mpnet/convert.py b/examples/language_model/mpnet/convert.py new file mode 100644 index 0000000000000000000000000000000000000000..c9f3fb8fd14329e437b2759b5f0a06df84533d11 --- /dev/null +++ b/examples/language_model/mpnet/convert.py @@ -0,0 +1,78 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from collections import OrderedDict +import argparse + +huggingface_to_paddle = { + ".attn.": ".", + "intermediate.dense": "ffn", + "output.dense": "ffn_output", + ".output.LayerNorm.": ".layer_norm.", + ".LayerNorm.": ".layer_norm.", + "lm_head.decoder.bias": "lm_head.decoder_bias", +} + +skip_weights = ["lm_head.decoder.weight", "lm_head.bias"] +dont_transpose = [ + "_embeddings.weight", + ".LayerNorm.weight", + ".layer_norm.weight", + "relative_attention_bias.weight", +] + + +def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path): + import torch + import paddle + + pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu") + paddle_state_dict = OrderedDict() + for k, v in pytorch_state_dict.items(): + transpose = False + if k in skip_weights: + continue + if k[-7:] == ".weight": + if not any([w in k for w in dont_transpose]): + if v.ndim == 2: + v = v.transpose(0, 1) + transpose = True + oldk = k + for huggingface_name, paddle_name in huggingface_to_paddle.items(): + k = k.replace(huggingface_name, paddle_name) + + print(f"Converting: {oldk} => {k} | is_transpose {transpose}") + paddle_state_dict[k] = v.data.numpy() + + paddle.save(paddle_state_dict, paddle_dump_path) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--pytorch_checkpoint_path", + default="weights/hg/mpnet-base/pytorch_model.bin", + type=str, + required=False, + help="Path to the Pytorch checkpoint path.", + ) + parser.add_argument( + "--paddle_dump_path", + default="weights/pd/mpnet-base/model_state.pdparams", + type=str, + required=False, + help="Path to the output Paddle model.", + ) + args = parser.parse_args() + convert_pytorch_checkpoint_to_paddle(args.pytorch_checkpoint_path, args.paddle_dump_path) diff --git a/examples/language_model/mpnet/glue/predict.sh b/examples/language_model/mpnet/glue/predict.sh new file mode 100644 index 0000000000000000000000000000000000000000..c2396372fb6e9c67ee8271de4ed9c138a61f3030 --- /dev/null +++ b/examples/language_model/mpnet/glue/predict.sh @@ -0,0 +1,3 @@ +# task name ["cola","sst-2","mrpc","sts-b","qqp","mnli", "rte", "qnli"] + +python run_predict.py --task_name qqp --ckpt_path qqp/best-qqp_ft_model_106000.pdparams \ No newline at end of file diff --git a/examples/language_model/mpnet/glue/run_glue.py b/examples/language_model/mpnet/glue/run_glue.py new file mode 100644 index 0000000000000000000000000000000000000000..742c43b7dba7397c381da10be0e6c20b83b36b16 --- /dev/null +++ b/examples/language_model/mpnet/glue/run_glue.py @@ -0,0 +1,454 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math +import time +from collections import OrderedDict +from dataclasses import dataclass, field +from functools import partial +from typing import Dict, List, Optional + +import numpy as np +import paddle +from paddle.io import DataLoader, Dataset +from paddle.metric import Accuracy + +from paddlenlp.data import Pad, Stack +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman +from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments, set_seed +from paddlenlp.trainer.trainer_utils import speed_metrics +from paddlenlp.transformers import ( + BertForSequenceClassification, + BertTokenizer, + ElectraForSequenceClassification, + ElectraTokenizer, + ErnieForSequenceClassification, + ErnieTokenizer, + MPNetForSequenceClassification, + MPNetTokenizer, +) + + +class MPNetTrainer(Trainer): + def evaluate( + self, + eval_dataset: Optional[Dataset] = None, + ignore_keys: Optional[List[str]] = None, + metric_key_prefix: str = "eval", + ) -> Dict[str, float]: + """ + Run evaluation and returns metrics. + + The calling script will be responsible for providing a method to compute metrics, as they are task-dependent + (pass it to the init `compute_metrics` argument). + + You can also subclass and override this method to inject custom behavior. + + Args: + eval_dataset (`Dataset`, *optional*): + Pass a dataset if you wish to override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not + accepted by the `model.forward()` method are automatically removed. It must implement the `__len__` + method. + ignore_keys (`Lst[str]`, *optional*): + A list of keys in the output of your model (if it is a dictionary) that should be ignored when + gathering predictions. + metric_key_prefix (`str`, *optional*, defaults to `"eval"`): + An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named + "eval_bleu" if the prefix is "eval" (default) + + Returns: + A dictionary containing the evaluation loss and the potential metrics computed from the predictions. The + dictionary also contains the epoch number which comes from the training state. + """ + # memory metrics - must set up as early as possible + self._memory_tracker.start() + + trans_func = partial( + convert_example, + tokenizer=self.tokenizer, + label_list=self.args.label_list, + max_seq_length=self.args.max_seq_length, + ) + if self.args.task_name == "mnli": + dev_ds_matched, dev_ds_mismatched = load_dataset( + "glue", self.args.task_name, splits=["dev_matched", "dev_mismatched"] + ) + + dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True) + dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True) + dev_batch_sampler_matched = paddle.io.BatchSampler( + dev_ds_matched, batch_size=self.args.per_device_eval_batch_size * 2, shuffle=False + ) + dev_data_loader_matched = DataLoader( + dataset=dev_ds_matched, + batch_sampler=dev_batch_sampler_matched, + collate_fn=self.data_collator, + num_workers=2, + return_list=True, + ) + dev_batch_sampler_mismatched = paddle.io.BatchSampler( + dev_ds_mismatched, batch_size=self.args.per_device_eval_batch_size * 2, shuffle=False + ) + dev_data_loader_mismatched = DataLoader( + dataset=dev_ds_mismatched, + batch_sampler=dev_batch_sampler_mismatched, + collate_fn=self.data_collator, + num_workers=2, + return_list=True, + ) + else: + dev_ds = load_dataset("glue", self.args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler( + dev_ds, batch_size=self.args.per_device_eval_batch_size * 2, shuffle=False + ) + dev_data_loader = DataLoader( + dataset=dev_ds, + batch_sampler=dev_batch_sampler, + collate_fn=self.data_collator, + num_workers=2, + return_list=True, + ) + + start_time = time.time() + + if self.args.task_name == "mnli": + output = self.evaluation_loop( + dev_data_loader_matched, + description="Evaluation", + # No point gathering the predictions if there are no metrics, otherwise we defer to + # self.args.prediction_loss_only + prediction_loss_only=True if self.compute_metrics is None else None, + ignore_keys=ignore_keys, + metric_key_prefix=metric_key_prefix, + ) + + total_batch_size = self.args.eval_batch_size * self.args.dataset_world_size + output.metrics.update( + speed_metrics( + metric_key_prefix, + start_time, + num_samples=output.num_samples, + num_steps=math.ceil(output.num_samples / total_batch_size), + ) + ) + + self.log(output.metrics) + + output = self.evaluation_loop( + dev_data_loader_mismatched, + description="Evaluation", + # No point gathering the predictions if there are no metrics, otherwise we defer to + # self.args.prediction_loss_only + prediction_loss_only=True if self.compute_metrics is None else None, + ignore_keys=ignore_keys, + metric_key_prefix=metric_key_prefix, + ) + + total_batch_size = self.args.eval_batch_size * self.args.dataset_world_size + output.metrics.update( + speed_metrics( + metric_key_prefix, + start_time, + num_samples=output.num_samples, + num_steps=math.ceil(output.num_samples / total_batch_size), + ) + ) + + self.log(output.metrics) + + self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics) + + self._memory_tracker.stop_and_update_metrics(output.metrics) + + return output.metrics + else: + output = self.evaluation_loop( + dev_data_loader, + description="Evaluation", + # No point gathering the predictions if there are no metrics, otherwise we defer to + # self.args.prediction_loss_only + prediction_loss_only=True if self.compute_metrics is None else None, + ignore_keys=ignore_keys, + metric_key_prefix=metric_key_prefix, + ) + + total_batch_size = self.args.eval_batch_size * self.args.dataset_world_size + output.metrics.update( + speed_metrics( + metric_key_prefix, + start_time, + num_samples=output.num_samples, + num_steps=math.ceil(output.num_samples / total_batch_size), + ) + ) + + self.log(output.metrics) + + self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics) + + self._memory_tracker.stop_and_update_metrics(output.metrics) + + return output.metrics + + +METRIC_CLASSES = { + "cola": Mcc, + "sst-2": Accuracy, + "mrpc": AccuracyAndF1, + "sts-b": PearsonAndSpearman, + "qqp": AccuracyAndF1, + "qnli": Accuracy, + "mnli": Accuracy, + "rte": Accuracy, + "wnli": Accuracy, +} + +MODEL_CLASSES = { + "bert": (BertForSequenceClassification, BertTokenizer), + "electra": (ElectraForSequenceClassification, ElectraTokenizer), + "ernie": (ErnieForSequenceClassification, ErnieTokenizer), + "mpnet": (MPNetForSequenceClassification, MPNetTokenizer), +} + + +@dataclass +class ModelArguments: + max_seq_length: Optional[int] = field( + default=128, + metadata={ + "help": ( + "The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded." + ) + }, + ) + task_name: Optional[str] = field( + default=None, + metadata={"help": ("The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()))}, + ) + model_type: Optional[str] = field( + default="convbert", + metadata={"help": ("Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))}, + ) + model_name_or_path: Optional[str] = field( + default="convbert-base", + metadata={ + "help": ( + "Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum( + [list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], + [], + ) + ), + ) + }, + ) + layer_lr_decay: Optional[float] = field( + default=1.0, + metadata={"help": ("layer_lr_decay")}, + ) + + +@dataclass +class DataArguments: + data_path: Optional[str] = field( + default="./data", + metadata={"help": "The path of datasets to be loaded."}, + ) + + +def compute_metrics(eval_preds, metric): + labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64") + preds = paddle.to_tensor(eval_preds.predictions) + preds = paddle.nn.functional.softmax(preds, axis=-1) + labels = paddle.argmax(labels, axis=-1) + correct = metric.compute(preds, labels) + metric.update(correct) + res = metric.accumulate() + if isinstance(metric, AccuracyAndF1): + return { + "acc": res[0], + "precision": res[1], + "recall": res[2], + "f1": res[3], + "acc and f1": res[4], + } + elif isinstance(metric, Mcc): + return { + "mcc": res[0], + } + elif isinstance(metric, PearsonAndSpearman): + return { + "pearson": res[0], + "spearman": res[1], + "pearson and spearman": res[2], + } + else: + return { + "acc": res, + } + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if (int(is_test) + len(example)) == 2: + example = tokenizer(example["sentence"], max_length=max_seq_length, return_token_type_ids=True) + else: + example = tokenizer( + example["sentence1"], + text_pair=example["sentence2"], + max_length=max_seq_length, + return_token_type_ids=True, + ) + + if not is_test: + return example["input_ids"], example["token_type_ids"], label + else: + return example["input_ids"], example["token_type_ids"] + + +@dataclass +class DataCollator: + def __init__(self, tokenizer, train_ds): + self.tokenizer = (tokenizer,) + self.train_ds = (train_ds,) + + def __call__(self, features): + input_ids = [] + labels = [] + batch = {} + + for feature in features: + input_idx, _, label = feature + input_ids.append(input_idx) + labels.append(label) + + if not isinstance(self.tokenizer, MPNetTokenizer): + self.tokenizer = self.tokenizer[0] + self.train_ds = self.train_ds[0] + input_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids),) # input_ids + labels = (Stack(dtype="int64" if self.train_ds.label_list else "float32")(labels),) # labels + + batch["input_ids"] = input_ids[0] + batch["labels"] = labels[0] + + return batch + + +def _get_layer_lr_radios(layer_decay=0.8, n_layers=12): + """Have lower learning rates for layers closer to the input.""" + key_to_depths = OrderedDict( + { + "mpnet.embeddings.": 0, + "mpnet.encoder.relative_attention_bias.": 0, + "mpnet.pooler.": n_layers + 2, + "mpnet.classifier.": n_layers + 2, + } + ) + for layer in range(n_layers): + key_to_depths[f"mpnet.encoder.layer.{str(layer)}."] = layer + 1 + return {key: (layer_decay ** (n_layers + 2 - depth)) for key, depth in key_to_depths.items()} + + +def do_train(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + if training_args.output_dir is None: + training_args.output_dir = model_args.task_name.lower() + if model_args.task_name is not None: + training_args.task_name = model_args.task_name + if model_args.max_seq_length is not None: + training_args.max_seq_length = model_args.max_seq_length + + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(training_args.seed) + + model_args.task_name = model_args.task_name.lower() + metric_class = METRIC_CLASSES[model_args.task_name] + model_args.model_type = model_args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[model_args.model_type] + + train_ds = load_dataset("glue", model_args.task_name, splits="train") + tokenizer = tokenizer_class.from_pretrained(model_args.model_name_or_path) + training_args.label_list = train_ds.label_list + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + label_list=training_args.label_list, + max_seq_length=model_args.max_seq_length, + ) + train_ds = train_ds.map(trans_func, lazy=True) + batchify_fn = DataCollator(tokenizer, train_ds) + + num_classes = 1 if training_args.label_list is None else len(training_args.label_list) + model = model_class.from_pretrained(model_args.model_name_or_path, num_classes=num_classes) + + if model_args.layer_lr_decay != 1.0: + layer_lr_radios_map = _get_layer_lr_radios(model_args.layer_lr_decay, n_layers=12) + for name, parameter in model.named_parameters(): + layer_lr_radio = 1.0 + for k, radio in layer_lr_radios_map.items(): + if k in name: + layer_lr_radio = radio + break + parameter.optimize_attr["learning_rate"] *= layer_lr_radio + + loss_fct = paddle.nn.loss.CrossEntropyLoss() if training_args.label_list else paddle.nn.loss.MSELoss() + + metric = metric_class() + compute_metrics_func = partial( + compute_metrics, + metric=metric, + ) + + trainer = MPNetTrainer( + model=model, + args=training_args, + train_dataset=train_ds if training_args.do_train else None, + tokenizer=tokenizer, + data_collator=batchify_fn, + criterion=loss_fct, + compute_metrics=compute_metrics_func, + ) + + if training_args.do_train: + train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + metrics = train_results.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + +if __name__ == "__main__": + do_train() diff --git a/examples/language_model/mpnet/glue/run_predict.py b/examples/language_model/mpnet/glue/run_predict.py new file mode 100644 index 0000000000000000000000000000000000000000..5f8c3233c90cf5d1a497d76908ab006400ebc699 --- /dev/null +++ b/examples/language_model/mpnet/glue/run_predict.py @@ -0,0 +1,174 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +from functools import partial +import os +import paddle +from paddle.io import DataLoader +import pandas as pd +from tqdm import tqdm +from paddlenlp.datasets import load_dataset +from paddlenlp.data import Tuple, Pad +from paddlenlp.transformers import MPNetForSequenceClassification, MPNetTokenizer +from run_glue import convert_example + +task2filename = { + "cola": "CoLA.tsv", + "sst-2": "SST-2.tsv", + "mrpc": "MRPC.tsv", + "sts-b": "STS-B.tsv", + "qqp": "QQP.tsv", + "mnli": ["MNLI-m.tsv", "MNLI-mm.tsv"], + "rte": "RTE.tsv", + "qnli": "QNLI.tsv", + "wnli": "WNLI.tsv", +} + + +def get_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--ckpt_path", + default=None, + type=str, + required=True, + ) + parser.add_argument( + "--task_name", + type=str, + choices=["cola", "sst-2", "mrpc", "sts-b", "qqp", "mnli", "rte", "qnli", "wnli"], + default="cola", + required=True, + help="task_name.", + ) + + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--batch_size", + default=32, + type=int, + help="Batch size per GPU/CPU for training.", + ) + args = parser.parse_args() + args.task_name = args.task_name.lower() + return args + + +def predict(data_loader, model, id2label=None): + outputs = [] + progress_bar = tqdm( + range(len(data_loader)), + desc="Predition Iteration", + ) + with paddle.no_grad(): + for batch in data_loader: + input_ids, segment_ids = batch + logits = model(input_ids) + if id2label is not None: + pred = paddle.argmax(logits, axis=-1).cpu().tolist() + outputs.extend(list(map(lambda x: id2label[x], pred))) + else: + pred = logits.squeeze(-1).cpu().tolist() + outputs.extend(pred) + progress_bar.update(1) + return outputs + + +def writetsv(outputs, file): + d = {"index": list(range(len(outputs))), "prediction": outputs} + pd.DataFrame(d).to_csv(file, sep="\t", index=False) + print(f"Save to {file}.") + + +def predict2file(args): + if args.task_name == "mnli": + test_ds_matched, test_ds_mismatched = load_dataset("glue", "mnli", splits=["test_matched", "test_mismatched"]) + id2label = dict(zip(range(len(test_ds_matched.label_list)), test_ds_matched.label_list)) + else: + test_ds = load_dataset("glue", args.task_name, splits="test") + if test_ds.label_list is not None: + id2label = dict(zip(range(len(test_ds.label_list)), test_ds.label_list)) + else: + id2label = None + + model = MPNetForSequenceClassification.from_pretrained(args.ckpt_path) + model.eval() + tokenizer = MPNetTokenizer.from_pretrained(args.ckpt_path) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + ): fn(samples) + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + label_list=None, + max_seq_length=args.max_seq_length, + is_test=True, + ) + + if args.task_name == "mnli": + test_ds_matched = test_ds_matched.map(trans_func, lazy=True) + test_ds_mismatched = test_ds_mismatched.map(trans_func, lazy=True) + test_batch_sampler_matched = paddle.io.BatchSampler(test_ds_matched, batch_size=args.batch_size, shuffle=False) + test_data_loader_matched = DataLoader( + dataset=test_ds_matched, + batch_sampler=test_batch_sampler_matched, + collate_fn=batchify_fn, + num_workers=2, + return_list=True, + ) + test_batch_sampler_mismatched = paddle.io.BatchSampler( + test_ds_mismatched, batch_size=args.batch_size, shuffle=False + ) + test_data_loader_mismatched = DataLoader( + dataset=test_ds_mismatched, + batch_sampler=test_batch_sampler_mismatched, + collate_fn=batchify_fn, + num_workers=2, + return_list=True, + ) + file_m = os.path.join("template", task2filename[args.task_name][0]) + file_mm = os.path.join("template", task2filename[args.task_name][1]) + matched_outputs = predict(test_data_loader_matched, model, id2label) + mismatched_outputs = predict(test_data_loader_mismatched, model, id2label) + writetsv(matched_outputs, file_m) + writetsv(mismatched_outputs, file_mm) + else: + test_ds = test_ds.map(trans_func, lazy=True) + test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False) + test_data_loader = DataLoader( + dataset=test_ds, + batch_sampler=test_batch_sampler, + collate_fn=batchify_fn, + num_workers=2, + return_list=True, + ) + predict_outputs = predict(test_data_loader, model, id2label) + + file = os.path.join("template", task2filename[args.task_name]) + writetsv(predict_outputs, file) + + +if __name__ == "__main__": + args = get_args() + os.makedirs("template", exist_ok=True) + predict2file(args) diff --git a/examples/language_model/mpnet/glue/train.sh b/examples/language_model/mpnet/glue/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..e068d234c8c148417c72d4cd4813faa6e5fc9217 --- /dev/null +++ b/examples/language_model/mpnet/glue/train.sh @@ -0,0 +1,194 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# ["cola","sst-2","mrpc","sts-b","qqp","mnli", "rte", "qnli"] +unset CUDA_VISIBLE_DEVICES +# QQP +# 运行训练 +python -m paddle.distributed.launch --gpus "0" run_glue.py \ + --model_type mpnet \ + --model_name_or_path mpnet-base \ + --task_name qqp \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --learning_rate 1e-5 \ + --lr_scheduler_type linear \ + --layer_lr_decay 1.0 \ + --weight_decay 0.1 \ + --warmup_steps 5666 \ + --max_steps 113272 \ + --logging_steps 1 \ + --save_steps 3 \ + --seed 42 \ + --output_dir qqp \ + --do_train \ + --do_eval \ + --device gpu + +# COLA +python -m paddle.distributed.launch --gpus "0" run_glue.py \ + --model_type mpnet \ + --model_name_or_path mpnet-base \ + --task_name cola \ + --max_seq_length 128 \ + --per_device_train_batch_size 16 \ + --learning_rate 1e-5 \ + --lr_scheduler_type linear \ + --layer_lr_decay 1.0 \ + --weight_decay 0.1 \ + --warmup_ratio 0.06 \ + --num_train_epochs 10 \ + --logging_steps 200 \ + --save_steps 200 \ + --seed 42 \ + --output_dir cola \ + --do_train \ + --do_eval \ + --device gpu + +# QNLI +python -m paddle.distributed.launch --gpus "0" run_glue.py \ + --model_type mpnet \ + --model_name_or_path mpnet-base \ + --task_name qnli \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --learning_rate 1e-5 \ + --lr_scheduler_type linear \ + --layer_lr_decay 1.0 \ + --weight_decay 0.1 \ + --warmup_ratio 0.06 \ + --num_train_epochs 10 \ + --logging_steps 1000 \ + --save_steps 1000 \ + --seed 42 \ + --output_dir qnli \ + --do_train \ + --do_eval \ + --device gpu + +# SST2 +python -m paddle.distributed.launch --gpus "0" run_glue.py \ + --model_type mpnet \ + --model_name_or_path mpnet-base \ + --task_name sst-2 \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --learning_rate 1e-5 \ + --lr_scheduler_type linear \ + --layer_lr_decay 1.0 \ + --weight_decay 0.1 \ + --warmup_ratio 0.06 \ + --num_train_epochs 10 \ + --logging_steps 400 \ + --save_steps 400 \ + --seed 42 \ + --output_dir sst-2 \ + --do_train \ + --do_eval \ + --device gpu + + +############################################################################################################################################ +# 先训练这个模型,之后需要使用这个权重!(RTE,MRPC和STS-B用了MNLI做初始化,与roberta一致) +# MNLI +python -m paddle.distributed.launch --gpus "0" run_glue.py \ + --model_type mpnet \ + --model_name_or_path mpnet-base \ + --task_name mnli \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --learning_rate 1e-5 \ + --lr_scheduler_type linear \ + --layer_lr_decay 1.0 \ + --weight_decay 0.1 \ + --warmup_ratio 0.06 \ + --num_train_epochs 10 \ + --logging_steps 1000 \ + --save_steps 1000 \ + --seed 42 \ + --output_dir mnli \ + --do_train \ + --do_eval \ + --device gpu + +######################################################## +# RTE +export MNLI_BEST_CKPT=/path/to/mnli/best/ckpt +python -m paddle.distributed.launch --gpus "0" run_glue.py \ + --model_type mpnet \ + --model_name_or_path $MNLI_BEST_CKPT \ + --task_name rte \ + --max_seq_length 128 \ + --per_device_train_batch_size 16 \ + --learning_rate 2e-5 \ + --lr_scheduler_type linear \ + --layer_lr_decay 1.0 \ + --weight_decay 0.1 \ + --warmup_ratio 0.06 \ + --num_train_epochs 13 \ + --logging_steps 100 \ + --save_steps 100 \ + --seed 42 \ + --output_dir rte \ + --do_train \ + --do_eval \ + --device gpu + +############################################################ +# MRPC +python -m paddle.distributed.launch --gpus "0" run_glue.py \ + --model_type mpnet \ + --model_name_or_path $MNLI_BEST_CKPT \ + --task_name mrpc \ + --max_seq_length 128 \ + --per_device_train_batch_size 16 \ + --learning_rate 1e-5 \ + --lr_scheduler_type linear \ + --layer_lr_decay 1.0 \ + --weight_decay 0.1 \ + --warmup_ratio 0.06 \ + --num_train_epochs 10 \ + --logging_steps 100 \ + --save_steps 100 \ + --seed 42 \ + --output_dir mrpc \ + --do_train \ + --do_eval \ + --device gpu + +############################################################ +# STSB +python -m paddle.distributed.launch --gpus "0" run_glue.py \ + --model_type mpnet \ + --model_name_or_path $MNLI_BEST_CKPT \ + --task_name rte \ + --max_seq_length 128 \ + --per_device_train_batch_size 16 \ + --learning_rate 2e-5 \ + --lr_scheduler_type linear \ + --layer_lr_decay 1.0 \ + --weight_decay 0.1 \ + --warmup_ratio 0.06 \ + --num_train_epochs 10 \ + --logging_steps 100 \ + --save_steps 100 \ + --seed 42 \ + --output_dir rte \ + --do_train \ + --do_eval \ + --device gpu + +############################################################ + diff --git a/examples/language_model/mpnet/squad/run_squad.py b/examples/language_model/mpnet/squad/run_squad.py new file mode 100644 index 0000000000000000000000000000000000000000..e60e208634805640d2ef2ff4d76fa687c4f65587 --- /dev/null +++ b/examples/language_model/mpnet/squad/run_squad.py @@ -0,0 +1,709 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +from collections import OrderedDict +from dataclasses import dataclass, field +from functools import partial +from typing import List, Optional + +import numpy as np +import paddle +from datasets import load_dataset +from paddle.io import DataLoader, Dataset + +from paddlenlp.data import Pad, Stack +from paddlenlp.metrics.squad import compute_prediction, squad_evaluate +from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments, set_seed +from paddlenlp.trainer.trainer_utils import ( + EvalLoopOutput, + EvalPrediction, + IterableDatasetShard, + find_batch_size, + has_length, +) +from paddlenlp.trainer.utils.helper import ( + nested_concat, + nested_numpify, + nested_truncate, +) +from paddlenlp.transformers import ( + BertForQuestionAnswering, + BertTokenizer, + ErnieForQuestionAnswering, + ErnieTokenizer, + MPNetForQuestionAnswering, + MPNetTokenizer, +) +from paddlenlp.utils.batch_sampler import ( + DistributedBatchSampler as NlpDistributedBatchSampler, +) +from paddlenlp.utils.log import logger + + +def is_datasets_available(): + import importlib + + return importlib.util.find_spec("datasets") is not None + + +class MPNetTrainer(Trainer): + def set_eval_collator(self, collator): + self.eval_collate_fn = collator + + def set_eval_raw_dataset(self, raw_dataset): + self.eval_raw_dataset = raw_dataset + + def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader: + """ + Returns the evaluation [`~paddle.io.DataLoader`]. + Subclass and override this method if you want to inject some custom behavior. + Args: + eval_dataset (`paddle.io.Dataset`, *optional*): + If provided, will override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not accepted by + the `model.forward()` method are automatically removed. It must implement `__len__`. + """ + if eval_dataset is None and self.eval_dataset is None: + raise ValueError("Trainer: evaluation requires an eval_dataset.") + eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset + + if self._is_iterable_dataset(eval_dataset): + if self.args.world_size > 1: + eval_dataset = IterableDatasetShard( + eval_dataset, + batch_size=self.args.per_device_eval_batch_size, + drop_last=self.args.dataloader_drop_last, + num_processes=self.args.world_size, + process_index=self.args.process_index, + ) + + return DataLoader( + eval_dataset, + batch_size=self.args.per_device_eval_batch_size, + collate_fn=self.eval_collate_fn, + num_workers=self.args.dataloader_num_workers, + ) + + eval_sampler = self._get_eval_sampler(eval_dataset) + + return DataLoader( + eval_dataset, + batch_sampler=eval_sampler, + collate_fn=self.eval_collate_fn, + num_workers=self.args.dataloader_num_workers, + ) + + def evaluation_loop( + self, + dataloader: DataLoader, + description: str, + prediction_loss_only: Optional[bool] = None, + ignore_keys: Optional[List[str]] = None, + metric_key_prefix: str = "eval", + max_eval_iters: Optional[int] = -1, + ) -> EvalLoopOutput: + """ + Prediction/evaluation loop, shared by `Trainer.evaluate()` and `Trainer.predict()`. + Works both with or without labels. + """ + args = self.args + + prediction_loss_only = prediction_loss_only if prediction_loss_only is not None else args.prediction_loss_only + + model = self.model + + if isinstance(dataloader, paddle.io.DataLoader): + batch_size = dataloader.batch_sampler.batch_size + elif isinstance(dataloader, paddle.io.dataloader.dataloader_iter._DataLoaderIterBase): + # support for inner dataloader + batch_size = dataloader._batch_sampler.batch_size + # alias for inner dataloader + dataloader.dataset = dataloader._dataset + else: + raise ValueError("Only support for paddle.io.DataLoader") + + num_samples = None + if max_eval_iters > 0: + # on eval limit steps + num_samples = batch_size * self.args.world_size * max_eval_iters + if isinstance(dataloader, paddle.io.dataloader.dataloader_iter._DataLoaderIterBase) and isinstance( + dataloader._batch_sampler, NlpDistributedBatchSampler + ): + consumed_samples = ( + ((self.state.global_step) // args.eval_steps) + * max_eval_iters + * args.per_device_eval_batch_size + * args.world_size + ) + dataloader._batch_sampler.set_epoch(consumed_samples=consumed_samples) + + logger.info(f"***** Running {description} *****") + if has_length(dataloader): + logger.info(f" Num examples = {self.num_examples(dataloader)}") + if max_eval_iters > 0: + logger.info(f" Total prediction steps = {max_eval_iters}") + else: + logger.info(f" Total prediction steps = {len(dataloader)}") + else: + logger.info(" Num examples: Unknown") + if max_eval_iters > 0: + logger.info(f" Total prediction steps = {max_eval_iters}") + + logger.info(f" Pre device batch size = {batch_size}") + logger.info(f" Total Batch size = {batch_size * self.args.world_size}") + + model.eval() + + self.callback_handler.eval_dataloader = dataloader + # Do this before wrapping. + eval_dataset = dataloader.dataset + + if args.past_index >= 0: + self._past = None + + # Initialize containers + # losses/preds/labels on GPU (accumulated for eval_accumulation_steps) + losses_host = None + preds_host = None + labels_host = None + # losses/preds/labels on CPU (final containers) + all_losses = None + all_preds = None + all_labels = None + # Will be useful when we have an iterable dataset so don't know its length. + + observed_num_examples = 0 + # Main evaluation loop + losses = [] + for step, inputs in enumerate(dataloader): + # Update the observed num examples + observed_batch_size = find_batch_size(inputs) + if observed_batch_size is not None: + observed_num_examples += observed_batch_size + # For batch samplers, batch_size is not known by the dataloader in advance. + if batch_size is None: + batch_size = observed_batch_size + + # Prediction step + loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) + + # Update containers on host + if loss is not None: + # losses = self._nested_gather(loss.repeat(batch_size)) + # losses = self._nested_gather(loss) + losses = self._nested_gather(paddle.tile(loss, repeat_times=[batch_size, 1])) + losses_host = losses if losses_host is None else paddle.concat((losses_host, losses), axis=0) + + if labels is not None: + labels = self._pad_across_processes(labels) + labels = self._nested_gather(labels) + labels_host = labels if labels_host is None else nested_concat(labels_host, labels, padding_index=-100) + if logits is not None: + logits = self._pad_across_processes(logits) + logits = self._nested_gather(logits) + if self.preprocess_logits_for_metrics is not None: + logits = self.preprocess_logits_for_metrics(logits, labels) + preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100) + self.control = self.callback_handler.on_prediction_step(args, self.state, self.control) + if max_eval_iters > 0 and step >= max_eval_iters - 1: + break + + # Gather all remaining tensors and put them back on the CPU + if losses_host is not None: + losses = nested_numpify(losses_host) + all_losses = losses if all_losses is None else np.concatenate((all_losses, losses), axis=0) + if preds_host is not None: + logits = nested_numpify(preds_host) + all_preds = logits if all_preds is None else nested_concat(all_preds, logits, padding_index=-100) + if labels_host is not None: + labels = nested_numpify(labels_host) + all_labels = labels if all_labels is None else nested_concat(all_labels, labels, padding_index=-100) + + # Number of samples + if num_samples is not None: + pass + elif has_length(eval_dataset): + num_samples = len(eval_dataset) + # The instance check is weird and does not actually check for the type, but whether the dataset has the right + # methods. Therefore we need to make sure it also has the attribute. + elif isinstance(eval_dataset, IterableDatasetShard) and hasattr(eval_dataset, "num_examples"): + num_samples = eval_dataset.num_examples + else: + if has_length(dataloader): + num_samples = self.num_examples(dataloader) + else: # both len(dataloader.dataset) and len(dataloader) fail + num_samples = observed_num_examples + + # Number of losses has been rounded to a multiple of batch_size and in a distributed training, the number of + # samplers has been rounded to a multiple of batch_size, so we truncate. + if all_losses is not None: + all_losses = all_losses[:num_samples] + if all_preds is not None: + all_preds = nested_truncate(all_preds, num_samples) + if all_labels is not None: + all_labels = nested_truncate(all_labels, num_samples) + + model.train() + + if self.compute_metrics is not None and all_preds is not None: + metrics = self.compute_metrics( + EvalPrediction(predictions=all_preds, label_ids=all_labels), + data_loader=dataloader, + raw_dataset=self.eval_raw_dataset, + ) + else: + metrics = {} + + if all_losses is not None: + metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item() + + # Prefix all keys with metric_key_prefix + '_' + for key in list(metrics.keys()): + if not key.startswith(f"{metric_key_prefix}_"): + metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key) + + return EvalLoopOutput(predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples) + + +MODEL_CLASSES = { + "bert": (BertForQuestionAnswering, BertTokenizer), + "ernie": (ErnieForQuestionAnswering, ErnieTokenizer), + "mpnet": (MPNetForQuestionAnswering, MPNetTokenizer), +} + + +@dataclass +class ModelArguments: + max_seq_length: Optional[int] = field( + default=128, + metadata={ + "help": ( + "The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded." + ) + }, + ) + model_type: Optional[str] = field( + default="convbert", + metadata={"help": ("Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))}, + ) + model_name_or_path: Optional[str] = field( + default="convbert-base", + metadata={ + "help": ( + "Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum( + [list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], + [], + ) + ), + ) + }, + ) + layer_lr_decay: Optional[float] = field( + default=1.0, + metadata={"help": ("layer_lr_decay")}, + ) + doc_stride: Optional[int] = field( + default=128, + metadata={"help": ("When splitting up a long document into chunks, how much stride to take between chunks.")}, + ) + n_best_size: Optional[int] = field( + default=20, + metadata={ + "help": ("The total number of n-best predictions to generate in the nbest_predictions.json output file.") + }, + ) + null_score_diff_threshold: Optional[float] = field( + default=0.0, + metadata={"help": ("If null_score - best_non_null is greater than the threshold predict null.")}, + ) + max_query_length: Optional[int] = field( + default=64, + metadata={"help": ("Max query length.")}, + ) + max_answer_length: Optional[int] = field( + default=30, + metadata={"help": ("Max answer length.")}, + ) + do_lower_case: Optional[bool] = field( + default=False, + metadata={ + "help": ( + "Whether to lower case the input text. Should be True for uncased models and False for cased models." + ) + }, + ) + verbose: Optional[bool] = field( + default=False, + metadata={"help": ("Whether to output verbose log.")}, + ) + version_2_with_negative: Optional[bool] = field( + default=False, + metadata={ + "help": ( + "If true, the SQuAD examples contain some that do not have an answer.", + "If using squad v2.0, it should be set true.", + ) + }, + ) + + +@dataclass +class DataArguments: + train_file: Optional[str] = field( + default=None, + metadata={"help": "Train data path."}, + ) + predict_file: Optional[str] = field( + default=None, + metadata={"help": "Predict data path."}, + ) + + +def _get_layer_lr_radios(layer_decay=0.8, n_layers=12): + """Have lower learning rates for layers closer to the input.""" + key_to_depths = OrderedDict( + { + "mpnet.embeddings.": 0, + "mpnet.encoder.relative_attention_bias.": 0, + "qa_outputs.": n_layers + 2, + } + ) + for layer in range(n_layers): + key_to_depths[f"mpnet.encoder.layer.{str(layer)}."] = layer + 1 + return {key: (layer_decay ** (n_layers + 2 - depth)) for key, depth in key_to_depths.items()} + + +def prepare_train_features(examples, tokenizer, args): + # Some of the questions have lots of whitespace on the left, which is not useful and will make the + # truncation of the context fail (the tokenized question will take a lots of space). So we remove that + # left whitespace + contexts = examples["context"] + questions = examples["question"] + + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + tokenized_examples = tokenizer( + questions, contexts, max_length=args.max_seq_length, stride=args.doc_stride, return_attention_mask=True + ) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + # The offset mappings will give us a map from token to character position in the original context. This will + # help us compute the start_positions and end_positions. + offset_mapping = tokenized_examples.pop("offset_mapping") + + # Let's label those examples! + tokenized_examples["start_positions"] = [] + tokenized_examples["end_positions"] = [] + + for i, offsets in enumerate(offset_mapping): + # We will label impossible answers with the index of the CLS token. + input_ids = tokenized_examples["input_ids"][i] + cls_index = input_ids.index(tokenizer.cls_token_id) + + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_A_lengths = input_ids.index(tokenizer.sep_token_id) + 2 + sequence_B_lengths = len(input_ids) - sequence_A_lengths + sequence_ids = [0] * sequence_A_lengths + [1] * sequence_B_lengths + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + answers = examples["answers"][sample_index] + # If no answers are given, set the cls_index as answer. + if len(answers["answer_start"]) == 0: + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Start/end character index of the answer in the text. + start_char = answers["answer_start"][0] + end_char = start_char + len(answers["text"][0]) + + # Start token index of the current span in the text. + token_start_index = 0 + while sequence_ids[token_start_index] != 1: + token_start_index += 1 + + # End token index of the current span in the text. + token_end_index = len(input_ids) - 1 + while sequence_ids[token_end_index] != 1: + token_end_index -= 1 + token_end_index -= 1 + + # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index). + if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char): + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Otherwise move the token_start_index and token_end_index to the two ends of the answer. + # Note: we could go after the last offset if the answer is the last word (edge case). + while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char: + token_start_index += 1 + tokenized_examples["start_positions"].append(token_start_index - 1) + while offsets[token_end_index][1] >= end_char: + token_end_index -= 1 + tokenized_examples["end_positions"].append(token_end_index + 1) + + return tokenized_examples + + +def prepare_validation_features(examples, tokenizer, args): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = examples["context"] + questions = examples["question"] + + tokenized_examples = tokenizer( + questions, contexts, stride=args.doc_stride, max_length=args.max_seq_length, return_attention_mask=True + ) + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + + # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the + # corresponding example_id and we will store the offset mappings. + tokenized_examples["example_id"] = [] + + for i in range(len(tokenized_examples["input_ids"])): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + input_ids = tokenized_examples["input_ids"][i] + sequence_A_lengths = input_ids.index(tokenizer.sep_token_id) + 2 + sequence_B_lengths = len(input_ids) - sequence_A_lengths + sequence_ids = [0] * sequence_A_lengths + [1] * sequence_B_lengths + context_index = 1 + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + tokenized_examples["example_id"].append(examples["id"][sample_index]) + + # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token + # position is part of the context or not. + tokenized_examples["offset_mapping"][i] = [ + (o if sequence_ids[k] == context_index else None) + for k, o in enumerate(tokenized_examples["offset_mapping"][i]) + ] + + return tokenized_examples + + +@dataclass +class TrainDataCollator: + def __init__(self, tokenizer): + self.tokenizer = tokenizer + + def __call__(self, features): + input_ids = [] + start_positions = [] + end_positions = [] + batch = {} + + for feature in features: + input_ids.append(feature["input_ids"]) + start_positions.append(feature["start_positions"]) + end_positions.append(feature["end_positions"]) + + input_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids),) # input_ids + start_positions = (Stack(dtype="int64")(start_positions),) # start_positions + end_positions = (Stack(dtype="int64")(end_positions),) # end_positions + + batch["input_ids"] = input_ids[0] + batch["start_positions"] = start_positions[0] + batch["end_positions"] = end_positions[0] + + return batch + + +@dataclass +class EvalDataCollator: + def __init__(self, tokenizer): + self.tokenizer = tokenizer + + def __call__(self, features): + input_ids = [] + batch = {} + + for feature in features: + input_ids.append(feature["input_ids"]) + + input_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids),) # input_ids + + batch["input_ids"] = input_ids[0] + + return batch + + +def compute_metrics(eval_preds, data_loader=None, raw_dataset=None, model_args=None): + start_logits, end_logits = eval_preds.predictions + all_predictions, all_nbest_json, scores_diff_json = compute_prediction( + raw_dataset, + data_loader.dataset, + (start_logits, end_logits), + model_args.version_2_with_negative, + model_args.n_best_size, + model_args.max_answer_length, + model_args.null_score_diff_threshold, + ) + squad_evaluate(examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, na_probs=scores_diff_json) + return {} + + +@paddle.no_grad() +def evaluate(model, data_loader, raw_dataset, args, global_step, write_predictions=False): + model.eval() + + all_start_logits = [] + all_end_logits = [] + + for batch in data_loader: + input_ids = batch[0] + start_logits_tensor, end_logits_tensor = model(input_ids) + + for idx in range(start_logits_tensor.shape[0]): + all_start_logits.append(start_logits_tensor.numpy()[idx]) + all_end_logits.append(end_logits_tensor.numpy()[idx]) + + all_predictions, all_nbest_json, scores_diff_json = compute_prediction( + raw_dataset, + data_loader.dataset, + (all_start_logits, all_end_logits), + args.version_2_with_negative, + args.n_best_size, + args.max_answer_length, + args.null_score_diff_threshold, + ) + + # Can also write all_nbest_json and scores_diff_json files if needed + if write_predictions: + with open(f"{str(global_step)}_prediction.json", "w", encoding="utf-8") as writer: + writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n") + + squad_evaluate(examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, na_probs=scores_diff_json) + + model.train() + + +class CrossEntropyLossForSQuAD(paddle.nn.Layer): + def __init__(self): + super(CrossEntropyLossForSQuAD, self).__init__() + + def forward(self, y, label): + start_logits, end_logits = y + start_position, end_position = label + start_position = paddle.unsqueeze(start_position, axis=-1) + end_position = paddle.unsqueeze(end_position, axis=-1) + start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position) + end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position) + loss = (start_loss + end_loss) / 2 + + return loss + + +def do_train(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(training_args.seed) + + model_args.model_type = model_args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[model_args.model_type] + + tokenizer = tokenizer_class.from_pretrained(model_args.model_name_or_path) + model = model_class.from_pretrained(model_args.model_name_or_path) + + if training_args.do_train: + # layer_lr for base + if model_args.layer_lr_decay != 1.0: + layer_lr_radios_map = _get_layer_lr_radios(model_args.layer_lr_decay, n_layers=12) + for name, parameter in model.named_parameters(): + layer_lr_radio = 1.0 + for k, radio in layer_lr_radios_map.items(): + if k in name: + layer_lr_radio = radio + break + parameter.optimize_attr["learning_rate"] *= layer_lr_radio + + if model_args.version_2_with_negative: + train_examples = load_dataset("squad_v2", split="train") + else: + train_examples = load_dataset("squad", split="train") + column_names = train_examples.column_names + train_ds = train_examples.map( + partial(prepare_train_features, tokenizer=tokenizer, args=model_args), + batched=True, + remove_columns=column_names, + num_proc=4, + ) + + if training_args.do_eval: + if model_args.version_2_with_negative: + dev_examples = load_dataset("squad_v2", split="validation") + else: + dev_examples = load_dataset("squad", split="validation") + column_names = dev_examples.column_names + dev_ds = dev_examples.map( + partial(prepare_validation_features, tokenizer=tokenizer, args=model_args), + batched=True, + remove_columns=column_names, + num_proc=4, + ) + + batchify_fn_train = TrainDataCollator(tokenizer) + batchify_fn_eval = EvalDataCollator(tokenizer) + criterion = CrossEntropyLossForSQuAD() + + compute_metrics_func = partial( + compute_metrics, + model_args=model_args, + ) + + trainer = MPNetTrainer( + model=model, + args=training_args, + train_dataset=train_ds if training_args.do_train else None, + eval_dataset=dev_ds if training_args.do_eval else None, + tokenizer=tokenizer, + data_collator=batchify_fn_train, + criterion=criterion, + compute_metrics=compute_metrics_func, + ) + trainer.set_eval_collator(batchify_fn_eval) + trainer.set_eval_raw_dataset(dev_examples) + + if training_args.do_train: + train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + metrics = train_results.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + +if __name__ == "__main__": + do_train() diff --git a/examples/language_model/opt b/examples/language_model/opt new file mode 100644 index 0000000000000000000000000000000000000000..d095192093bdd412fce64ee16050e2dca4aca81c --- /dev/null +++ b/examples/language_model/opt @@ -0,0 +1 @@ +../../llm/opt \ No newline at end of file diff --git a/examples/language_model/rembert/README.md b/examples/language_model/rembert/README.md new file mode 100644 index 0000000000000000000000000000000000000000..5a3bf3bc25497359d6e060d3b5d1d19a41c3c812 --- /dev/null +++ b/examples/language_model/rembert/README.md @@ -0,0 +1,95 @@ +# RemBert with PaddleNLP + +[RemBERT: Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821v1.pdf) + +**模型简介:** +作者发现,分离词嵌入为建模语言模型提供更好的灵活性,使我们能够显著提高多语言模型输入词嵌入中参数分 +配的效率。通过在transformers层中重新分配输入词嵌入参数,在微调过程中,相比于具有相同数量参数量的 +自然语言模型在自然语言理解任务上获得了更好的性能。作者还发现,增大输出词嵌入维度可以提升模型的性能, +即使在预训练结束后,输出词嵌入被丢弃,该模型仍能在微调阶段保持不变。作者分析表明,增大输出词嵌入维度 +可以防止模型在预训练数据集上过拟合,并让模型在其他NLP数据集上有更强的泛化能力。利用这些发现,我们能够 +训练性能更强大的模型,而无需在微调阶段增加参数。 + +## 快速开始 + +### 下游任务微调 + +####数据集 +下载XTREME-XNLI数据集: +训练集:[下载地址](https://dl.fbaipublicfiles.com/XNLI/XNLI-MT-1.0.zip) +测试集:[下载地址](https://dl.fbaipublicfiles.com/XNLI/XNLI-1.0.zip) +其中训练集为位于`XNLI-MT-1.0/multinli/multinli.train.en.tsv`, 测试集位于`XNLI-1.0/xnli.test.tsv` + +下载XTREME-PAWS-X数据集: +[下载地址](https://storage.googleapis.com/paws/pawsx/x-final.tar.gz) +每个训练集、验证集和测试集分别为`train`、`dev`和`test`开头的`tsv`文件, 将所有语言的数据集解压后,请合并所有语言测试集到一个文件(此任务需要在多语言进行测试) + +#### 1、XTREME-XNLI +XTREME-XNLI数据集为例: +运行以下两个命令即可训练并评估RemBert在XTREME-XNLI数据集的精度 + +```shell +python -m paddle.distributed.launch examples/language_model/rembert/main.py \ + --model_type rembert \ + --data_dir data/ + --output_dir output/ \ + --device gpu + --learning_rate 1e-5 \ + --num_train_epochs 3 \ + --train_batch_size 16 \ + --do_train \ + --do_eval \ + --task xnli \ + --eval_step 500 +``` +其中参数释义如下: +- `model_type` 指示了模型类型,当前支持`rembert` +- `data_dir` 数据集路径。 +- `train_batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `output_dir` 表示模型保存路径。 +- `device` 表示使用的设备类型。默认为GPU,可以配置为CPU、GPU、XPU。若希望使用多GPU训练,将其设置为GPU,同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。 +- `num_train_epochs` 表示需要训练的epoch数量 +- `do_train` 表示是否开启训练 +- `do_eval` 表示是否开启评估 +- `task` 表示训练的任务 +- `eval_step` 表示训练多少步评估一次模型 + +训练结束后模型会对模型进行评估,训练完成后你将看到如下结果: +```bash +Accuracy 0.8089 +``` + +#### 2、XTREME-PAWS-X +在此数据集训练使用如下命令: + +```shell +python -m paddle.distributed.launch examples/language_model/rembert/main.py \ + --model_type rembert \ + --data_dir data/ + --output_dir output/ \ + --device gpu + --learning_rate 8e-6 \ + --num_train_epochs 3 \ + --train_batch_size 16 \ + --do_train \ + --do_eval \ + --task paws \ + --eval_step 500 +``` +训练结束后模型会对模型进行评估,其评估在测试集上完成, 训练完成后你将看到如下结果: +```bash +Accuracy 0.8778 +``` + + +# Reference + +```bibtex +@article{chung2020rethinking, + title={Rethinking embedding coupling in pre-trained language models}, + author={Chung, Hyung Won and Fevry, Thibault and Tsai, Henry and Johnson, Melvin and Ruder, Sebastian}, + journal={arXiv preprint arXiv:2010.12821}, + year={2020} +} +``` diff --git a/examples/language_model/rembert/data_processor.py b/examples/language_model/rembert/data_processor.py new file mode 100644 index 0000000000000000000000000000000000000000..f5f6568d7a68e7da0e792312bdc252e7a4a7e8e5 --- /dev/null +++ b/examples/language_model/rembert/data_processor.py @@ -0,0 +1,146 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from paddlenlp.transformers import RemBertTokenizer +import csv +from paddle.io import Dataset + +tokenization = RemBertTokenizer.from_pretrained("rembert") + + +class InputExample(object): + """ + Use classes to store each example + """ + + def __init__(self, guid, text_a, text_b=None, label=None): + self.guid = guid + self.text_a = text_a + self.text_b = text_b + self.label = label + + +class MrpcProcessor(object): + """Load the dataset and convert each example text to ids""" + + def get_train_examples(self, data_dir): + return self._create_examples(self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") + + def get_dev_examples(self, data_dir): + return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev_2k.tsv")), "dev") + + def get_test_examples(self, data_dir): + return self._create_examples(self._read_tsv(os.path.join(data_dir, "test_2k.tsv")), "test") + + def get_labels(self): + return ["0", "1"] + + def _create_examples(self, lines, set_type): + examples = [] + for (i, line) in enumerate(lines): + if i == 0: + continue + guid = "%s-%s" % (set_type, i) + text_a = tokenization(line[1])["input_ids"] + text_b = tokenization(line[2])["input_ids"] + label = int(line[3]) + examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + @classmethod + def _read_tsv(cls, input_file, quotechar=None): + """Reads a tab separated value file.""" + with open(input_file, "r", encoding="utf-8") as f: + reader = csv.reader(f, delimiter="\t", quotechar=quotechar) + lines = [] + for line in reader: + lines.append(line) + return lines + + +class XNLIProcessor(object): + """Load the dataset and convert each example text to ids""" + + def get_train_examples(self, data_dir): + return self._create_examples(self._read_tsv(os.path.join(data_dir, "multinli.train.en.tsv")), "train") + + def get_dev_examples(self, data_dir): + return self._create_examples(self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv")), "dev") + + def get_test_examples(self, data_dir): + return self._create_examples(self._read_tsv(os.path.join(data_dir, "xnli.test.tsv")), "test") + + def get_labels(self): + return ["neutral", "entailment", "contradictory"] + + def _create_examples(self, lines, set_type): + examples = [] + for (i, line) in enumerate(lines): + if i == 0: + continue + guid = "%s-%s" % (set_type, i) + if set_type == "train": + text_a = " ".join(line[0].strip().split(" ")) + text_b = " ".join(line[1].strip().split(" ")) + text_a = tokenization(text_a)["input_ids"] + text_b = tokenization(text_b)["input_ids"] + label = self.get_labels().index(line[2].strip()) + examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + else: + text_a = " ".join(line[6].strip().split(" ")) + text_b = " ".join(line[7].strip().split(" ")) + if line[1] == "contradiction": + line[1] = "contradictory" + label = self.get_labels().index(line[1].strip()) + text_a = tokenization(text_a)["input_ids"] + text_b = tokenization(text_b)["input_ids"] + examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + @classmethod + def _read_tsv(cls, input_file, quotechar=None): + """Reads a tab separated value file.""" + with open(input_file, "r", encoding="utf-8") as f: + reader = csv.reader(f, delimiter="\t", quotechar=quotechar) + lines = [] + for line in reader: + lines.append(line) + return lines + + +class DataGenerator(Dataset): + """Data generator is used to feed features into dataloader.""" + + def __init__(self, features): + super(DataGenerator, self).__init__() + self.features = features + + def __getitem__(self, item): + text_a = self.features[item].text_a + text_b = self.features[item].text_b + text_a_token_type_ids = [0] * len(text_a) + text_b_token_type_ids = [1] * len(text_b) + label = [self.features[item].label] + + return dict( + text_a=text_a, + text_b=text_b, + text_a_token_type_ids=text_a_token_type_ids, + text_b_token_type_ids=text_b_token_type_ids, + label=label, + ) + + def __len__(self): + return len(self.features) diff --git a/examples/language_model/rembert/main.py b/examples/language_model/rembert/main.py new file mode 100644 index 0000000000000000000000000000000000000000..f48e5751080294b5520c15a2740e7ea47bd949ad --- /dev/null +++ b/examples/language_model/rembert/main.py @@ -0,0 +1,180 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import random + +import numpy as np +import paddle +import paddle.distributed as dist +from data_processor import DataGenerator, MrpcProcessor, XNLIProcessor +from paddle.io import DataLoader, DistributedBatchSampler +from paddle.metric import Accuracy +from tqdm import tqdm +from trainer import Trainer + +from paddlenlp.transformers import RemBertForSequenceClassification + +logger = logging.getLogger(__name__) + +parser = argparse.ArgumentParser(description="RemBert For Sequence Classification") +parser.add_argument("--data_dir", type=str, default=None, help="Data path.") +parser.add_argument("--do_train", action="store_true", help="Whether to train the model.") +parser.add_argument("--do_eval", action="store_true", help="Whether to predict.") +parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.") +parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") +parser.add_argument("--train_batch_size", type=int, default=16, help="per gpu batch size during thr training.") +parser.add_argument("--eval_batch_size", type=int, default=16, help="per gpu batch size during thr evaluating.") +parser.add_argument( + "--output_dir", + default="outputs", + type=str, + help="The output directory where the model predictions and checkpoints will be written. " "Default as `outputs`", +) +parser.add_argument( + "--max_seq_length", + default=512, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." +) +parser.add_argument( + "--gradient_accumulation_steps", + type=int, + default=2, + help="Proportion of training steps to perform linear learning rate warmup for.", +) +parser.add_argument( + "--warmup_proportion", + type=float, + default=0.02, + help="Proportion of training steps to perform linear learning rate warmup for.", +) +parser.add_argument("--learning_rate", type=float, default=8e-6, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay if we apply some.") +parser.add_argument("--task", type=str, required=True, help="Training task") +parser.add_argument("--model_type", default="rembert", type=str, help="Type of pre-trained model.") +parser.add_argument("--eval_step", type=int, default=2000, help="Eavlate the model once after training step X.") +args = parser.parse_args() + +nranks = paddle.distributed.ParallelEnv().nranks + + +def set_seed(seed): + paddle.seed(seed) + random.seed(seed) + np.random.seed(seed) + + +def load_example(args, mode="train"): + """Load data to DataLoader""" + if args.task == "paws": + processor = MrpcProcessor() + if args.task == "xnli": + processor = XNLIProcessor() + if mode == "train": + examples = processor.get_train_examples(args.data_dir) + elif mode == "dev": + examples = processor.get_dev_examples(args.data_dir) + else: + examples = processor.get_test_examples(args.data_dir) + + datagenerator = DataGenerator(examples) + + def collate_fn(batch): + def create_padded_sequence(key, padding_value): + """Pad sequence to max length""" + pad_sequence = [] + max_len = 0 + for example in batch: + if len(example[key]) > max_len: + max_len = len(example[key]) + for example in batch: + pad_sequence.append(example[key] + [padding_value] * (max_len - len(example[key]))) + return np.array(pad_sequence, dtype="int64") + + text_a = create_padded_sequence("text_a", 0) # pad text_a input_ids + text_b = create_padded_sequence("text_b", 0) # pad text_b input_ids + text_a_token_type_ids = create_padded_sequence("text_a_token_type_ids", 0) # pad text_a_token_type_ids + text_b_token_type_ids = create_padded_sequence("text_b_token_type_ids", 1) # pad text_b_token_type_ids + label = create_padded_sequence("label", 0) # label will not pad, just convert to numpy array + + input_ids = np.concatenate([text_a, text_b], axis=-1)[:, : args.max_seq_length] + token_type_ids = np.concatenate([text_a_token_type_ids, text_b_token_type_ids], axis=-1)[ + :, : args.max_seq_length + ] + + return input_ids, token_type_ids, label + + if mode in ("dev", "test"): + dataloader = DataLoader(datagenerator, batch_size=args.eval_batch_size, shuffle=False, collate_fn=collate_fn) + else: + sampler = DistributedBatchSampler( + datagenerator, batch_size=args.train_batch_size, shuffle=True, drop_last=False + ) + dataloader = DataLoader(datagenerator, batch_sampler=sampler, collate_fn=collate_fn) + + return dataloader, processor + + +def run(args): + if args.do_train: + train_dataloader, processor = load_example(args, "train") + num_label = len(processor.get_labels()) + model = RemBertForSequenceClassification.from_pretrained(args.model_type, num_classes=num_label) + if nranks > 1: + dist.init_parallel_env() + model = paddle.DataParallel(model) + + num_train_steps_per_epoch = len(train_dataloader) // args.gradient_accumulation_steps + num_train_steps = int(num_train_steps_per_epoch * args.num_train_epochs) + trainer = Trainer( + args, model=model, dataloader=train_dataloader, num_train_steps=num_train_steps, step_callback=evaluate + ) + trainer.train() + + if args.do_eval: + model = RemBertForSequenceClassification.from_pretrained(args.output_dir) + evaluate(model, args) + + +def evaluate(model, args, mode="test"): + """evaluate the model""" + model.eval() + metric = Accuracy() + eval_dataloader, processor = load_example(args, mode) + for batch in tqdm(eval_dataloader, total=len(eval_dataloader)): + logits = model(input_ids=batch[0], token_type_ids=batch[1]) + labels = batch[2].reshape( + ( + -1, + 1, + ) + ) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + print("Accuracy:", res) + model.train() + return res + + +if __name__ == "__main__": + set_seed(args.seed) + paddle.set_device(args.device) + run(args) diff --git a/examples/language_model/rembert/trainer.py b/examples/language_model/rembert/trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..314cb118d8392e5ccf9a3205164416136983491e --- /dev/null +++ b/examples/language_model/rembert/trainer.py @@ -0,0 +1,102 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at + +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +from paddle.optimizer import AdamW +from tqdm import tqdm + +from paddlenlp.transformers import LinearDecayWithWarmup + + +def _create_model_arguments(batch): + return batch + + +class Trainer(object): + def __init__(self, args, model, dataloader, num_train_steps, step_callback=None): + self.args = args + self.model = model + self.dataloader = dataloader + self.num_train_steps = num_train_steps + self.step_callback = step_callback + + self.optimizer, self.scheduler = self._create_optimizer(model) + self.scaler = paddle.amp.GradScaler(init_loss_scaling=1024) + self.wd_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + def train(self): + model = self.model + + epoch = 0 + global_step = 0 + tr_loss = 0.0 + acc = 0.0 + + model.train() + model, self.optimizer = paddle.amp.decorate( + models=model, optimizers=self.optimizer, level="O2", master_weight=None, save_dtype="float32" + ) + + with tqdm(total=self.num_train_steps) as pbar: + while True: + for step, batch in enumerate(self.dataloader): + with paddle.amp.auto_cast(enable=True, custom_white_list=None, custom_black_list=None, level="O2"): + logits = model(input_ids=batch[0], token_type_ids=batch[1]) + + loss = paddle.nn.CrossEntropyLoss()(logits, batch[2].reshape((-1,))) + + if self.args.gradient_accumulation_steps > 1: + loss = loss / self.args.gradient_accumulation_steps + scaled = self.scaler.scale(loss) + scaled.backward() + if (step + 1) % self.args.gradient_accumulation_steps == 0: + self.scaler.minimize(self.optimizer, scaled) + self.scheduler.step() + self.optimizer.clear_grad() + pbar.set_description("epoch: {} loss: {} acc: {}".format(epoch, loss.numpy(), acc)) + pbar.update() + global_step += 1 + + if global_step == self.num_train_steps: + break + if (step + 1) % self.args.eval_step == 0: + ac = self.step_callback(model, self.args) + if ac > acc: + acc = ac + model.save_pretrained(self.args.output_dir) + + if global_step == self.num_train_steps: + break + epoch += 1 + + return model, global_step, tr_loss / global_step + + def _create_optimizer(self, model): + scheduler = self._create_scheduler() + clip = paddle.nn.ClipGradByNorm(clip_norm=1.0) + return ( + AdamW( + parameters=model.parameters(), + grad_clip=clip, + learning_rate=scheduler, + beta1=0.9, + apply_decay_param_fun=lambda x: x in self.wd_params, + weight_decay=self.args.weight_decay, + beta2=0.99, + ), + scheduler, + ) + + def _create_scheduler(self): + return LinearDecayWithWarmup(self.args.learning_rate, self.num_train_steps, self.args.warmup_proportion) diff --git a/examples/language_model/rnnlm/README.md b/examples/language_model/rnnlm/README.md new file mode 100644 index 0000000000000000000000000000000000000000..efd6459bd911c12c3841a74b9fae37bf09f4a394 --- /dev/null +++ b/examples/language_model/rnnlm/README.md @@ -0,0 +1,68 @@ +# 语言模型 + +# 简介 + +## 1. 任务说明 +本文主要介绍基于lstm的语言的模型的实现,给定一个输入词序列(中文分词、英文tokenize),计算其ppl(语言模型困惑度,用户表示句子的流利程度),基于循环神经网络语言模型的介绍可以[参阅论文](https://arxiv.org/abs/1409.2329)。相对于传统的方法,基于循环神经网络的方法能够更好的解决稀疏词的问题。 + + +## 2. 效果说明 + +| | train | valid | test | +| :------------- | :---------: | :--------: | :----------: | +| PaddlePaddle | 47.234 | 86.801 | 83.159 | +| Tensorflow | 45.594 | 87.363 | 84.015 | + + + +## 3. 数据集 + +此任务的数据集合是采用ptb dataset,下载地址为: http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz + + +# 快速开始 + +### 数据准备 +为了方便开发者进行测试,我们内置了数据下载脚本,默认自动下载PTB数据集。 + +### 训练或Fine-tune + +任务训练启动命令如下: + +``` +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" train.py \ +``` + +程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存模型到checkpoint、中。 +还可以在启动命令后以--的形式修改网络参数或数据位置,具体可修改的参数和参数的默认值参考`args.py`。 + +**NOTE:** 如需恢复模型训练,则init_from_ckpt只需指定到文件名即可,不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/test`即可,程序会自动加载模型参数`checkpoints/test.pdparams`,也会自动加载优化器状态`checkpoints/test.pdopt`。 + +# 进阶使用 + +## 任务定义与建模 +此任务目的是给定一个输入的词序列,预测下一个词出现的概率。 + +## 模型原理介绍 +此任务采用了序列任务常用的rnn网络,实现了一个两层的lstm网络,然后lstm的结果去预测下一个词出现的概率。 + +由于数据的特殊性,每一个batch的last hidden和last cell会被作为下一个batch 的init hidden 和 init cell。 + + +## 数据格式说明 +此任务的数据格式比较简单,每一行为一个已经分好词(英文的tokenize)的词序列。 + +目前的句子示例如下图所示: +``` +aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter +pierre N years old will join the board as a nonexecutive director nov. N +mr. is chairman of n.v. the dutch publishing group +``` + +特殊说明:ptb的数据比较特殊,ptb的数据来源于一些文章,相邻的句子可能来源于一个段落或者相邻的段落,ptb 数据不能做shuffle。 + + +## 如何组建自己的模型 ++ **自定义数据:** 关于数据,如果可以把自己的数据先进行分词(或者tokenize),通过`--data_path`来指定本地数据集所在文件夹,并需要在`train.py`中修改对应的文件名称。 ++ **网络结构更改:** 网络只实现了基于lstm的语言模型,用户可以自己的需求更换为gru等网络结构,这些实现都是在`model.py`中定义。 diff --git a/examples/language_model/rnnlm/args.py b/examples/language_model/rnnlm/args.py new file mode 100644 index 0000000000000000000000000000000000000000..21b586b83e5c5306d82f368e045fa78dbd9f0495 --- /dev/null +++ b/examples/language_model/rnnlm/args.py @@ -0,0 +1,23 @@ +import argparse + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--data_path", type=str, default=None, help="all the data for train,valid,test") + parser.add_argument("--batch_size", type=int, default=20, help="batch size") + parser.add_argument("--hidden_size", type=int, default=650, help="hidden_size") + parser.add_argument("--num_steps", type=int, default=35, help="num steps") + parser.add_argument("--num_layers", type=int, default=2, help="num_layers") + parser.add_argument("--max_grad_norm", type=float, default=5.0, help="max grad norm") + parser.add_argument("--dropout", type=float, default=0.5, help="dropout") + parser.add_argument("--epoch_start_decay", type=int, default=6, help="epoch_start_decay") + parser.add_argument("--max_epoch", type=int, default=39, help="max_epoch") + parser.add_argument("--lr_decay", type=float, default=0.8, help="lr_decay") + parser.add_argument("--base_lr", type=float, default=1.0, help="base_lr") + parser.add_argument("--init_scale", type=float, default=0.05, help="init_scale") + parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") + parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." + ) + args = parser.parse_args() + return args diff --git a/examples/language_model/rnnlm/model.py b/examples/language_model/rnnlm/model.py new file mode 100644 index 0000000000000000000000000000000000000000..99c04c43d32a87fc669b834b312a86e54f1565be --- /dev/null +++ b/examples/language_model/rnnlm/model.py @@ -0,0 +1,85 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.initializer as I + + +class RnnLm(nn.Layer): + def __init__(self, vocab_size, hidden_size, batch_size, num_layers=1, init_scale=0.1, dropout=0.0): + super(RnnLm, self).__init__() + self.hidden_size = hidden_size + self.num_layers = num_layers + self.init_scale = init_scale + self.batch_size = batch_size + self.reset_states() + + self.embedder = nn.Embedding( + vocab_size, + hidden_size, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + ) + + self.lstm = nn.LSTM( + input_size=hidden_size, + hidden_size=hidden_size, + num_layers=num_layers, + dropout=dropout, + weight_ih_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + weight_hh_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + ) + + self.fc = nn.Linear( + hidden_size, + vocab_size, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + bias_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + ) + + self.dropout = nn.Dropout(p=dropout) + + def forward(self, inputs): + x = inputs + x_emb = self.embedder(x) + x_emb = self.dropout(x_emb) + + y, (self.hidden, self.cell) = self.lstm(x_emb, (self.hidden, self.cell)) + (self.hidden, self.cell) = tuple([item.detach() for item in (self.hidden, self.cell)]) + y = self.dropout(y) + y = self.fc(y) + return y + + def reset_states(self): + self.hidden = paddle.zeros(shape=[self.num_layers, self.batch_size, self.hidden_size], dtype="float32") + self.cell = paddle.zeros(shape=[self.num_layers, self.batch_size, self.hidden_size], dtype="float32") + + +class CrossEntropyLossForLm(nn.Layer): + def __init__(self): + super(CrossEntropyLossForLm, self).__init__() + + def forward(self, y, label): + label = paddle.unsqueeze(label, axis=2) + loss = paddle.nn.functional.cross_entropy(input=y, label=label, reduction="none") + loss = paddle.squeeze(loss, axis=[2]) + loss = paddle.mean(loss, axis=[0]) + loss = paddle.sum(loss) + return loss + + +class UpdateModel(paddle.callbacks.Callback): + # This callback reset model hidden states and update learning rate before each epoch begins + def on_epoch_begin(self, epoch=None, logs=None): + self.model.network.reset_states() diff --git a/examples/language_model/rnnlm/reader.py b/examples/language_model/rnnlm/reader.py new file mode 100644 index 0000000000000000000000000000000000000000..065ea409076d51762219e4958ffd1dc5c7ca649f --- /dev/null +++ b/examples/language_model/rnnlm/reader.py @@ -0,0 +1,58 @@ +import numpy as np + +import paddle + +from paddlenlp.datasets import load_dataset +from paddlenlp.data import Vocab + + +def create_data_loader(batch_size, num_steps, data_path=None): + train_ds, valid_ds, test_ds = load_dataset("ptb", splits=("train", "valid", "test")) + + train_examples = [train_ds[i]["sentence"].split() for i in range(len(train_ds))] + vocab = Vocab.build_vocab(train_examples, eos_token="") + + # Because the sentences in PTB dataset might be consecutive, we need to concatenate + # all texts from our dataset and fold them into chunks while the number of rows is + # equal to batch size. For example: + # + # Sentence1: we're talking about years ago before anyone heard of asbestos having + # any questionable properties. + # Sentence2: there is no asbestos in our products now. + # Batch_size: 5 + # Grouped_text: [["we're", "talking", "about", "years"], + # ["ago", "before", "anyone", "heard"], + # ["of", "asbestos", "having", "any"], + # ["questionable", "properties", "there", "is"], + # ["no", "asbestos", "in", "our"]] + # + def group_texts(examples): + concat_examples = [] + for example in examples: + concat_examples += example["sentence"].split() + ["
"] + + concat_examples = vocab.to_indices(concat_examples) + + max_seq_len = len(concat_examples) // batch_size + reshaped_examples = np.asarray(concat_examples[0 : batch_size * max_seq_len], dtype="int64").reshape( + (batch_size, max_seq_len) + ) + encoded_examples = [] + for i in range(max_seq_len // num_steps): + encoded_examples.append( + ( + np.copy(reshaped_examples[:, i * num_steps : (i + 1) * num_steps]), + np.copy(reshaped_examples[:, i * num_steps + 1 : (i + 1) * num_steps + 1]), + ) + ) + + return encoded_examples + + train_ds.map(group_texts, batched=True) + valid_ds.map(group_texts, batched=True) + test_ds.map(group_texts, batched=True) + + train_loader = paddle.io.DataLoader(train_ds, return_list=True, batch_size=None) + valid_loader = paddle.io.DataLoader(valid_ds, return_list=True, batch_size=None) + test_loader = paddle.io.DataLoader(test_ds, return_list=True, batch_size=None) + return train_loader, valid_loader, test_loader, len(vocab) diff --git a/examples/language_model/rnnlm/train.py b/examples/language_model/rnnlm/train.py new file mode 100644 index 0000000000000000000000000000000000000000..bb69ebc37aeee5a97546a33a8626fb35a98a617e --- /dev/null +++ b/examples/language_model/rnnlm/train.py @@ -0,0 +1,79 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +from args import parse_args +from model import CrossEntropyLossForLm, RnnLm, UpdateModel +from reader import create_data_loader + +from paddlenlp.metrics import Perplexity + +paddle.seed(102) + + +def train(args): + paddle.set_device(args.device) + data_path = args.data_path + train_loader, valid_loader, test_loader, vocab_size = create_data_loader( + batch_size=args.batch_size, num_steps=args.num_steps, data_path=data_path + ) + + network = RnnLm( + vocab_size=vocab_size, + hidden_size=args.hidden_size, + batch_size=args.batch_size, + num_layers=args.num_layers, + init_scale=args.init_scale, + dropout=args.dropout, + ) + gloabl_norm_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm) + cross_entropy = CrossEntropyLossForLm() + ppl_metric = Perplexity() + callback = UpdateModel() + scheduler = paddle.callbacks.LRScheduler(by_step=False, by_epoch=True) + model = paddle.Model(network) + + learning_rate = paddle.optimizer.lr.LambdaDecay( + learning_rate=args.base_lr, + lr_lambda=lambda x: args.lr_decay ** max(x + 1 - args.epoch_start_decay, 0.0), + verbose=True, + ) + optimizer = paddle.optimizer.SGD( + learning_rate=learning_rate, parameters=model.parameters(), grad_clip=gloabl_norm_clip + ) + + model.prepare(optimizer=optimizer, loss=cross_entropy, metrics=ppl_metric) + + if args.init_from_ckpt: + model.load(args.init_from_ckpt) + print("Loaded checkpoint from %s" % args.init_from_ckpt) + + benchmark_logger = paddle.callbacks.ProgBarLogger(log_freq=(len(train_loader) // 10), verbose=3) + model.fit( + train_data=train_loader, + eval_data=valid_loader, + epochs=args.max_epoch, + shuffle=False, + callbacks=[callback, scheduler, benchmark_logger], + ) + + model.save(path="checkpoint/test") # save for training + + print("Start to evaluate on test dataset...") + model.evaluate(test_loader, log_freq=len(test_loader)) + + +if __name__ == "__main__": + args = parse_args() + train(args) diff --git a/examples/language_model/roberta/README.md b/examples/language_model/roberta/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b6e948386801d82f54b2f90ba7a675c408823428 --- /dev/null +++ b/examples/language_model/roberta/README.md @@ -0,0 +1,110 @@ +# RoBERTa预训练(Masked Language Modeling) +本项目是RoBERTa模型在 Paddle 2.0上的开源实现,包含了数据tokenization和预训练代码。本项目旨在用简练清晰的代码完成基本预训练任务(仅Masked Language Modeling)。该代码易于理解,便于修改和定制。 +## 简介 +本目录下包含: + +utils.py: 数据采样函数DataCollatorMLM + +create_data.py: tokenize数据(使用HF datasets导入和预处理wikipedia数据) + +run_pretrain.py: 预训练代码 + +## 数据准备 +运行create_data.py,默认使用wikipedia corpus数据,自动下载(约34GB) + +``` +python create_data.py \ +--output_dir wiki \ +--dataset_name wikipedia \ +--dataset_config_name 20200501.en \ +--tokenizer_name roberta-base \ +--max_seq_length 512 \ +--line_by_line False \ +--preprocessing_num_workers 20 +``` + +其中参数释义如下: +- `output_dir` 指示数据tokenize后保存的目录。 +- `dataset_name` 表示数据名称,默认使用wikipedia。 +- `dataset_config_name` 表示数据参数,默认使用wikipedia英文数据。 +- `tokenizer_name` 表示tokenizer名。 +- `max_seq_length` 表示最大序列长度。 +- `line_by_line` 表示是否将数据group到max_seq_length,True则不进行grouping。 +- `preprocessing_num_workers` 表示worker数量,亦为multi-processing数量。 + +## 预训练 + +``` +python -m paddle.distributed.launch --gpus "0,1" run_pretrain.py \ +--model_name_or_path roberta-en-base \ +--batch_size 16 \ +--learning_rate 1e-4 \ +--weight_decay 1e-2 \ +--warmup_steps 10000 \ +--num_train_epochs 3 \ +--input_file wiki \ +--output_dir ckp/ \ +--logging_steps 100 \ +--save_steps 10000 \ +--max_steps -1 \ +--device gpu \ +--max_seq_length 512 \ +--amp True +``` + +其中参数释义如下: +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。 +- `warmup_steps` 表示动态学习率热启的step数。 +- `num_train_epochs` 表示训练轮数。 +- `input_file` 表示输入数据的目录,由create_data.py创建。 +- `output_dir` 表示模型的保存目录。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `max_steps` 表示最大训练步数。若训练`num_train_epochs`轮包含的训练步数大于该值,则达到`max_steps`后就提前结束。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 +- `max_seq_length` 训练数据最大长度。 +- `amp` 指示是否启用自动混合精度训练。 + +注: +paddle.Dataloader需2.3rc版本才支持HF datasets类,现行版本可以直接在python paddle库中的reader.py中注释掉: +``` +assert isinstance(dataset, Dataset) +``` +https://github.com/PaddlePaddle/Paddle/blob/0ee230a7d3177f791d2a5388ab4dffdccc03f4aa/python/paddle/fluid/reader.py#L335 + +## fine-tune + +finetune代码请参考[benchmark_glue](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/benchmark/glue) + +运行如下: + +```shell +export CUDA_VISIBLE_DEVICES=0 +export TASK_NAME=SST-2 + +python -u ./run_glue.py \ + --model_type roberta \ + --model_name_or_path ROBERTA_CKP_PATH \ + --tokenizer_name_or_path roberta-en-base \ + --task_name $TASK_NAME \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 3e-5 \ + --num_train_epochs 3 \ + --logging_steps 1 \ + --save_steps 100 \ + --output_dir ./tmp/$TASK_NAME/ \ + --device gpu + +``` + + +总训练tokens:512(seq_len)* 32(batch_size) * 780000(iteration),约RoBERTa训练量10%,在GLUE validation set表现: + +| Model GLUE Score | CoLA | SST-2 | MRPC | STS-B | QQP | MNLI | QNLI | RTE | +|--------------------|-------|--------|--------|--------|--------|--------|--------|--------| +| RoBERTa paper | 68.0 | 96.4 | 90.9 | 92.4 | 92.2 | 90.2 | 94.7 | 86.6 | +| PaddleNLP 6-epoch | 36.9 | 89.5 | 84.3 | 86.2 | 88.6 | 80.5 | 88.4 | 58.1 | diff --git a/examples/language_model/roberta/create_data.py b/examples/language_model/roberta/create_data.py new file mode 100644 index 0000000000000000000000000000000000000000..70b7aee100387b8ffd115d42f9116aa2a9776a78 --- /dev/null +++ b/examples/language_model/roberta/create_data.py @@ -0,0 +1,151 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2021 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +from datasets import load_dataset +from transformers import AutoTokenizer + +parser = argparse.ArgumentParser() +parser.add_argument( + "--output_dir", + default="wiki", + type=str, + required=False, + help="The output directory where the model predictions and checkpoints will be written.", +) +parser.add_argument("--dataset_name", default="wikipedia", type=str, required=False, help="dataset name") +parser.add_argument( + "--dataset_config_name", default="20200501.en", type=str, required=False, help="dataset config name" +) +parser.add_argument( + "--use_slow_tokenizer", + action="store_true", + help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).", +) +parser.add_argument("--tokenizer_name", default="roberta-base", type=str, required=False, help="tokenizer name") +parser.add_argument( + "--max_seq_length", + default=512, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument( + "--line_by_line", + type=bool, + default=False, + help="Whether distinct lines of text in the dataset are to be handled as distinct sequences.", +) +parser.add_argument("--preprocessing_num_workers", default=20, type=int, help="multi-processing number.") +parser.add_argument( + "--overwrite_cache", type=bool, default=False, help="Overwrite the cached training and evaluation sets" +) + + +def main(args): + if args.output_dir is not None: + os.makedirs(args.output_dir, exist_ok=True) + + # Get the datasets: + if args.dataset_name is not None: + # Downloading and loading a dataset from the hub. + raw_datasets = load_dataset(args.dataset_name, args.dataset_config_name) + + # Load pretrained tokenizer + if args.tokenizer_name: + tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, use_fast=not args.use_slow_tokenizer) + + # First we tokenize all the texts. + column_names = raw_datasets["train"].column_names + text_column_name = "text" if "text" in column_names else column_names[0] + + if args.line_by_line: + # When using line_by_line, we just tokenize each nonempty line. + padding = False + + def tokenize_function(examples): + # Remove empty lines + examples[text_column_name] = [ + line for line in examples[text_column_name] if len(line) > 0 and not line.isspace() + ] + return tokenizer( + examples[text_column_name], + padding=padding, + truncation=True, + max_length=args.max_seq_length, + # We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it + # receives the `special_tokens_mask`. + return_special_tokens_mask=True, + ) + + tokenized_datasets = raw_datasets.map( + tokenize_function, + batched=True, + num_proc=args.preprocessing_num_workers, + remove_columns=[text_column_name], + load_from_cache_file=not args.overwrite_cache, + desc="Running tokenizer on dataset line_by_line", + ) + else: + # Otherwise, we tokenize every text, then concatenate them together before splitting them in smaller parts. + # We use `return_special_tokens_mask=True` because DataCollatorForLanguageModeling (see below) is more + # efficient when it receives the `special_tokens_mask`. + def tokenize_function(examples): + return tokenizer(examples[text_column_name], return_special_tokens_mask=True) + + tokenized_datasets = raw_datasets.map( + tokenize_function, + batched=True, + num_proc=args.preprocessing_num_workers, + remove_columns=column_names, + load_from_cache_file=not args.overwrite_cache, + desc="Running tokenizer on every text in dataset", + ) + + # Main data processing function that will concatenate all texts from our dataset and generate chunks of + # max_seq_length. + def group_texts(examples): + # Concatenate all texts. + concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} + total_length = len(concatenated_examples[list(examples.keys())[0]]) + # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can + # customize this part to your needs. + if total_length >= args.max_seq_length: + total_length = (total_length // args.max_seq_length) * args.max_seq_length + # Split by chunks of max_len. + result = { + k: [t[i : i + args.max_seq_length] for i in range(0, total_length, args.max_seq_length)] + for k, t in concatenated_examples.items() + } + return result + + # Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a + # remainder for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher value + # might be slower to preprocess. + + tokenized_datasets = tokenized_datasets.map( + group_texts, + batched=True, + num_proc=args.preprocessing_num_workers, + load_from_cache_file=not args.overwrite_cache, + desc=f"Grouping texts in chunks of {args.max_seq_length}", + ) + tokenized_datasets.save_to_disk(args.output_dir) + + +if __name__ == "__main__": + args = parser.parse_args() + main(args) diff --git a/examples/language_model/roberta/run_pretrain.py b/examples/language_model/roberta/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..d96b0458a4df5cf9fd5938d274d44843ab9a789d --- /dev/null +++ b/examples/language_model/roberta/run_pretrain.py @@ -0,0 +1,176 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time + +import numpy as np +import paddle +from paddle.io import DataLoader +from utils import DataCollatorMLM + +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import ( + LinearDecayWithWarmup, + RobertaConfig, + RobertaForMaskedLM, +) + +parser = argparse.ArgumentParser() +IGNORE = -100 + +# yapf: disable +parser.add_argument("--model_name_or_path", default='roberta-en-base', type=str, required=False, help="Path to pre-trained model") +parser.add_argument("--input_file", default='wiki', type=str, required=False, help="The input directory where the model predictions and checkpoints will be written.") +parser.add_argument("--output_dir", default='ckp/', type=str, required=False, help="The output directory where the model predictions and checkpoints will be written.") +parser.add_argument("--max_seq_length", default=512, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=2e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") +parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") +parser.add_argument("--num_train_epochs", default=10, type=int, help="Total number of training epochs to perform.", ) +parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",) +parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") +parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") +parser.add_argument("--save_steps", type=int, default=10000, help="Save checkpoint every X updates steps.") +parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.") +parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.") +parser.add_argument("--amp", type=strtobool, default=True, help="use mix precision.") + +roberta_arch = { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "intermediate_size": 3072, + "max_position_embeddings": 514, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 1, + "vocab_size": 50265, + "layer_norm_eps": 1e-05, + "pad_token_id": 1, + "cls_token_id": 0 +} + + +def set_seed(seed): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(seed) + np.random.seed(seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(seed) + + +def do_train(args): + paddle.set_device(args.device) + set_seed(args.seed) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + # Load model and train from scratch + config = RobertaConfig(**roberta_arch) + model = RobertaForMaskedLM(config) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + ignore_label = IGNORE + loss_fct = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label) + + # Load wikipedia dataset via Hugging face datasets + # TO DO: paddle datasets + import datasets + tokenized_datasets = datasets.load_from_disk(args.input_file) + train_ds = tokenized_datasets["train"] + from transformers import AutoTokenizer + tokenizer = AutoTokenizer.from_pretrained('roberta-base') + + # Prepare data for training + collator_func = DataCollatorMLM(tokenizer=tokenizer) # data collator + train_batch_sampler = paddle.io.DistributedBatchSampler( + train_ds, batch_size=args.batch_size, shuffle=True, drop_last=True) + train_data_loader = DataLoader( + dataset=train_ds, + collate_fn=collator_func, + num_workers=0, + batch_sampler=train_batch_sampler, + return_list=True) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [ + p.name for n, p in model.named_parameters() + if not any(nd in n for nd in ["bias", "norm"]) + ] + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, + args.warmup_steps) + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params) + if args.amp: # mixed precision (fp16) + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + + # Start training + global_step = 0 + tic_train = time.time() + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + input_ids, _, labels = batch + with paddle.amp.auto_cast(args.amp): + logits = model(input_ids=input_ids) + loss = loss_fct(logits, labels) + if args.amp: + scaler.scale(loss).backward() + scaler.minimize(optimizer, loss) + else: + loss.backward() + optimizer.step() + + lr_scheduler.step() + optimizer.clear_grad() + + global_step += 1 + if global_step % args.logging_steps == 0: + + print( + "global step %d/%d, loss: %f, lr: %.10f, speed: %.4f step/s" + % (global_step, num_training_steps, loss, optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train))) + tic_train = time.time() + + if global_step % args.save_steps == 0: + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "paddle_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + + model_to_save = model._layers if isinstance( + model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + +if __name__ == "__main__": + args = parser.parse_args() + do_train(args) diff --git a/examples/language_model/roberta/utils.py b/examples/language_model/roberta/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..8d2a566a2f3ab354fc505a5e45bda73e1ba34f0b --- /dev/null +++ b/examples/language_model/roberta/utils.py @@ -0,0 +1,73 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2021 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle + +from paddlenlp.data import Dict, Pad + + +class DataCollatorMLM: + def __init__(self, tokenizer, batch_pad=None): + self.batch_pad = batch_pad + self.mask_token_id = tokenizer.mask_token_id + self.pad_token_id = tokenizer.pad_token_id + self.token_len = tokenizer.vocab_size + if batch_pad is None: + self.batch_pad = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=self.pad_token_id, dtype="int64"), # input + # 'token_type_ids': Pad(axis=0, pad_val=0, dtype='int64'), # segment + "special_tokens_mask": Pad(axis=0, pad_val=True, dtype="int64"), # segment + } + ): fn(samples) + else: + self.batch_pad = batch_pad + + def __call__(self, examples): + examples = self.batch_pad(examples) + examples = [paddle.to_tensor(e) for e in examples] + examples[0], labels = self._mask_tokens( + examples[0], paddle.cast(examples[1], dtype=bool), self.mask_token_id, self.token_len + ) + examples.append(labels) + return examples + + def _mask_tokens(self, inputs, special_tokens_mask, mask_token_id, token_len, mlm_prob=0.15, ignore_label=-100): + """ + Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. + """ + labels = inputs.clone() + probability_matrix = paddle.full(labels.shape, mlm_prob) + probability_matrix[special_tokens_mask] = 0 + + masked_indices = paddle.cast(paddle.bernoulli(probability_matrix), dtype=bool) + labels[~masked_indices] = ignore_label # We only compute loss on masked tokens + + # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK]) + indices_replaced = paddle.cast(paddle.bernoulli(paddle.full(labels.shape, 0.8)), dtype=bool) & masked_indices + inputs[indices_replaced] = mask_token_id + + # 10% of the time, we replace masked input tokens with random word + + indices_random = ( + paddle.cast(paddle.bernoulli(paddle.full(labels.shape, 0.5)), dtype=bool) + & masked_indices + & ~indices_replaced + ) + random_words = paddle.randint(low=0, high=token_len, shape=labels.shape) + inputs[indices_random] = random_words[indices_random] + + # The rest of the time (10% of the time) we keep the masked input tokens unchanged + return inputs, labels diff --git a/examples/language_model/roformer/README.md b/examples/language_model/roformer/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e06ee0d6ca947ffdc89e9cabdaa70bab5ed029e2 --- /dev/null +++ b/examples/language_model/roformer/README.md @@ -0,0 +1,112 @@ +# RoFormer + +## 模型简介 + +[RoFormer](https://arxiv.org/pdf/2104.09864.pdf) (RoFormer: Enhanced Transformer with Rotary Position Embedding)是一个带有旋转位置嵌入(RoPE)的MLM预训练语言模型。 RoPE是一种相对位置编码方法,具有良好的理论特性。其主要思想是根据绝对位置将上下文嵌入(transformer中的 q,k)乘以旋转矩阵。可以证明上下文嵌入的内积将仅取决于相对位置。 +RoPE 是唯一可用于线性注意力的相对位置嵌入。更多详情请参考[论文](https://arxiv.org/pdf/2104.09864.pdf)或[原博客](https://kexue.fm/archives/8265)。EleutherAI还发布了一篇[博客](https://blog.eleuther.ai/rotary-embeddings/),其中包含有关 RoPE 的直观解释和实验。 + +本项目是RoFormer在 Paddle 2.x上的开源实现,包含了`THUCNews分类任务`和`Cail2019 Scm任务`的微调代码。 + +## 快速开始 + + +### 预训练MLM测试 + ```bash + python test_mlm.py --model_name roformer-chinese-base --text 今天[MASK]很好,我想去公园玩! + # paddle: 今天[天气||天||阳光||太阳||空气]很好,我想去公园玩! + python test_mlm.py --model_name roformer-chinese-base --text 北京是[MASK]的首都! + # paddle: 北京是[中国||谁||中华人民共和国||我们||中华民族]的首都! + python test_mlm.py --model_name roformer-chinese-char-base --text 今天[MASK]很好,我想去公园玩! + # paddle: 今天[天||气||都||风||人]很好,我想去公园玩! + python test_mlm.py --model_name roformer-chinese-char-base --text 北京是[MASK]的首都! + # paddle: 北京是[谁||我||你||他||国]的首都! + ``` + +### THUCNews分类任务数据 + +THUCNews分类任务所含数据集已在paddlenlp中以API形式提供,无需预先准备,使用`run_thucnews.py`执行微调时将会自动下载。 + +### 执行Fine-tunning + +启动thucnews分类任务的Fine-tuning的方式如下: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" examples/language_model/roformer/run_thucnews.py \ + --model_type roformer \ + --model_name_or_path roformer-chinese-base \ + --max_seq_length 256 \ + --batch_size 64 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --logging_steps 1 \ + --save_steps 500 \ + --output_dir ./thucnews/ \ + --device gpu \ + --use_amp False +``` +其中参数释义如下: +- `model_type` 指示了模型类型,可以选择roformer。 +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录的地址。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `output_dir` 表示模型保存路径。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU, 'npu'表示使用华为昇腾卡。 +- `use_amp` 指示是否启用自动混合精度训练。 + +基于`roformer-chinese-base`在THUCNews分类任务上Fine-tuning后,在验证集上有如下结果: + +| Task | Metric | Result | +|:-----:|:----------------------------:|:-----------------:| +| THUCNews | Accuracy | 0.98 | + + + +### Cail2019_Scm任务数据 + +Cail2019_Scm分类任务所含数据集已在paddlenlp中以API形式提供,无需预先准备,使用`cail2019_scm.py`执行微调时将会自动下载。 + +### 执行Fine-tunning + +启动cail2019_scm任务的Fine-tuning的方式如下: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" examples/language_model/roformer/run_cail2019_scm.py \ + --model_type roformer_mean_pooling \ + --model_name_or_path roformer-chinese-base \ + --max_seq_length 512 \ + --batch_size 16 \ + --learning_rate 6e-6 \ + --num_train_epochs 20 \ + --logging_steps 60 \ + --save_steps 600 \ + --output_dir ./cail2019_scm/ \ + --device gpu \ + --use_amp False +``` + +其中参数释义如下: +- `model_type` 指示了模型类型,可以选择roformer_cls_pooling和roformer_mean_pooling两种类型。 +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录的地址。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示学习率大小,本代码并未使用学习率衰减。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `output_dir` 表示模型保存路径。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU, 'npu'表示使用华为昇腾卡。 +- `use_amp` 指示是否启用自动混合精度训练。 + +基于`roformer-chinese-base`在Cail2019_Scm任务上Fine-tuning后,有如下结果: + +| Model | Dev Accuracy | Test Accuracy | +|:-------------:|:-----------------:|:------------------:| +| RoFormer-512 | 0.6307 | 0.6947 | + +注: `run_cail2019_scm.py`参考了[原论文微调的代码](https://github.com/ZhuiyiTechnology/roformer/blob/main/finetune_scm.py),原代码未使用学习率衰减,而是使用了固定学习率6e-6。 diff --git a/examples/language_model/roformer/convert.py b/examples/language_model/roformer/convert.py new file mode 100644 index 0000000000000000000000000000000000000000..c33830e633dde755f15ae16da0de9bdb9c3eca08 --- /dev/null +++ b/examples/language_model/roformer/convert.py @@ -0,0 +1,78 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from collections import OrderedDict +import argparse + +huggingface_to_paddle = { + "embeddings.LayerNorm": "embeddings.layer_norm", + "encoder.layer": "encoder.layers", + "attention.self.query": "self_attn.q_proj", + "attention.self.key": "self_attn.k_proj", + "attention.self.value": "self_attn.v_proj", + "attention.output.dense": "self_attn.out_proj", + "intermediate.dense": "linear1", + "output.dense": "linear2", + "attention.output.LayerNorm": "norm1", + "output.LayerNorm": "norm2", + "predictions.decoder.": "predictions.decoder_", + "predictions.transform.dense": "predictions.transform", + "predictions.transform.LayerNorm": "predictions.layer_norm", +} + + +def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path): + + import torch + import paddle + + pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu") + paddle_state_dict = OrderedDict() + for k, v in pytorch_state_dict.items(): + if k == "cls.predictions.bias" or "encoder.embed_positions." in k: + continue + if k[-7:] == ".weight": + if ".embeddings." not in k and ".LayerNorm." not in k: + v = v.transpose(0, 1) + oldk = k + for huggingface_name, paddle_name in huggingface_to_paddle.items(): + k = k.replace(huggingface_name, paddle_name) + + if "roformer." not in k and "cls." not in k: + k = "roformer." + k + + print(f"Converting: {oldk} => {k}") + paddle_state_dict[k] = v.data.numpy() + + paddle.save(paddle_state_dict, paddle_dump_path) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--pytorch_checkpoint_path", + default="roformer_chinese_base/pytorch_model.bin", + type=str, + required=True, + help="Path to the Pytorch checkpoint path.", + ) + parser.add_argument( + "--paddle_dump_path", + default="roformer_chinese_base/model_state.pdparams", + type=str, + required=True, + help="Path to the output Paddle model.", + ) + args = parser.parse_args() + convert_pytorch_checkpoint_to_paddle(args.pytorch_checkpoint_path, args.paddle_dump_path) diff --git a/examples/language_model/roformer/run_cail2019_scm.py b/examples/language_model/roformer/run_cail2019_scm.py new file mode 100644 index 0000000000000000000000000000000000000000..68b7e95b5fc48f88ed285c8ca95eb2ba2526fe49 --- /dev/null +++ b/examples/language_model/roformer/run_cail2019_scm.py @@ -0,0 +1,363 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import os +import random +import time +from functools import partial + +import jieba +import numpy as np +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from paddle.io import DataLoader +from paddle.metric import Accuracy + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import RoFormerTokenizer +from paddlenlp.transformers.roformer.modeling import ( + RoFormerForSequenceClassification, + RoFormerPretrainedModel, +) + +FORMAT = "%(asctime)s-%(levelname)s: %(message)s" +logging.basicConfig(level=logging.INFO, format=FORMAT) +logger = logging.getLogger(__name__) +jieba.setLogLevel(logging.INFO) + + +class RoFormerMeanPoolingForSequenceClassification(RoFormerPretrainedModel): + def __init__(self, roformer, num_classes): + super(RoFormerMeanPoolingForSequenceClassification, self).__init__() + self.num_classes = num_classes + self.roformer = roformer + self.classifier = nn.Linear(self.roformer.config["hidden_size"], num_classes) + + def forward(self, input_ids, token_type_ids=None, attention_mask=None): + last_hidden_state = self.roformer(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)[0] + + mask = (input_ids != self.roformer.pad_token_id).astype(self.classifier.weight.dtype).unsqueeze(-1) + mean_pooling = paddle.sum(last_hidden_state * mask, axis=1) / paddle.sum(mask, axis=1) + logits = self.classifier(mean_pooling) + return logits + + +MODEL_CLASSES = { + "roformer_cls_pooling": (RoFormerForSequenceClassification, RoFormerTokenizer), + "roformer_mean_pooling": (RoFormerMeanPoolingForSequenceClassification, RoFormerTokenizer), +} + + +class Cail2019_SCM_Accuracy(Accuracy): + def compute(self, pred, label, *args): + pred = paddle.cast(pred[::2] > pred[1::2], dtype="int64") + correct = (pred == 1).unsqueeze(-1) + return paddle.cast(correct, dtype="float32") + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum( + [list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], + [], + ) + ), + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--learning_rate", + default=1e-4, + type=float, + help="The initial learning rate for Adam.", + ) + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument( + "--save_steps", + type=int, + default=100, + help="Save checkpoint every X updates steps.", + ) + parser.add_argument( + "--batch_size", + default=32, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu", "xpu", "npu"], + help="The device to select to train the model, is must be cpu/gpu/xpu/npu.", + ) + parser.add_argument( + "--use_amp", + type=strtobool, + default=False, + help="Enable mixed precision training.", + ) + parser.add_argument( + "--scale_loss", + type=float, + default=2**15, + help="The value of scale_loss for fp16.", + ) + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, loss_fct, metric, data_loader): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids).squeeze(-1) + loss = loss_fct(logits, labels) + correct = metric.compute(F.sigmoid(logits), labels) + metric.update(correct) + res = metric.accumulate() + print("eval loss: %f, acc: %s" % (loss.numpy(), res)) + + +def convert_example(example, tokenizer, max_seq_length=512): + if example["label"] == 0: + text1 = example["text_a"] + text2 = example["text_b"] + text3 = example["text_c"] + else: + text1 = example["text_a"] + text2 = example["text_c"] + text3 = example["text_b"] + + data1 = tokenizer(text1, text_pair=text2, max_length=max_seq_length) + data2 = tokenizer(text1, text_pair=text3, max_length=max_seq_length) + + return [data1["input_ids"], data1["token_type_ids"], 1], [data2["input_ids"], data2["token_type_ids"], 0] + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + args.model_type = args.model_type.lower() + args.batch_size = args.batch_size // 2 + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + train_ds = load_dataset("cail2019_scm", splits="train") + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + ) + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="float32"), + ), + ): # label + new_samples = [] + for sample in samples: + new_samples.extend(sample) + return fn(new_samples) + + train_data_loader = DataLoader( + dataset=train_ds, + batch_sampler=train_batch_sampler, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + + dev_ds = load_dataset("cail2019_scm", splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size * 4, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, + batch_sampler=dev_batch_sampler, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + + test_ds = load_dataset("cail2019_scm", splits="test") + test_ds = test_ds.map(trans_func, lazy=True) + test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size * 4, shuffle=False) + test_data_loader = DataLoader( + dataset=test_ds, + batch_sampler=test_batch_sampler, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + + model = model_class.from_pretrained(args.model_name_or_path, num_classes=1) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=args.learning_rate, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + loss_fct = paddle.nn.loss.BCEWithLogitsLoss() + + metric = Cail2019_SCM_Accuracy() + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + + global_step = 0 + tic_train = time.time() + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + model.train() + global_step += 1 + + input_ids, segment_ids, labels = batch + + with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]): + logits = model(input_ids, segment_ids).squeeze(-1) + loss = loss_fct(logits, labels) + if args.use_amp: + scaler.scale(loss).backward() + scaler.minimize(optimizer, loss) + else: + loss.backward() + optimizer.step() + + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + print( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % ( + global_step, + num_training_steps, + epoch, + step, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + print("============Dev Dataset============") + evaluate(model, loss_fct, metric, dev_data_loader) + print("============Test Dataset============") + evaluate(model, loss_fct, metric, test_data_loader) + print("eval done total : %s s" % (time.time() - tic_eval)) + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "ft_model_%d.pdparams" % (global_step)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + return + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/language_model/roformer/run_thucnews.py b/examples/language_model/roformer/run_thucnews.py new file mode 100644 index 0000000000000000000000000000000000000000..7a014e57b88a983b8d0dda5f7f7d277069eed844 --- /dev/null +++ b/examples/language_model/roformer/run_thucnews.py @@ -0,0 +1,290 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from paddle.io import DataLoader +from paddle.metric import Accuracy + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import ( + LinearDecayWithWarmup, + RoFormerForSequenceClassification, + RoFormerTokenizer, +) + +FORMAT = "%(asctime)s-%(levelname)s: %(message)s" +logging.basicConfig(level=logging.INFO, format=FORMAT) +logger = logging.getLogger(__name__) + +MODEL_CLASSES = { + "roformer": (RoFormerForSequenceClassification, RoFormerTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--batch_size", + default=32, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument( + "--warmup_steps", + default=0, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument( + "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu", "xpu", "npu"], + help="The device to select to train the model, is must be cpu/gpu/xpu/npu.", + ) + parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.") + parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.") + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, loss_fct, metric, data_loader): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + loss = loss_fct(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="") + model.train() + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """convert a thucnews example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["label"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + example = tokenizer(example["text"], max_length=max_seq_length) + + if not is_test: + return example["input_ids"], example["token_type_ids"], label + else: + return example["input_ids"], example["token_type_ids"] + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + train_ds = load_dataset("thucnews", splits="train") + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length + ) + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ): fn(samples) + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + dev_ds = load_dataset("thucnews", splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list) + model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_classes) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs) + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss() + + metric = Accuracy() + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + + global_step = 0 + tic_train = time.time() + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + + input_ids, segment_ids, labels = batch + with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]): + logits = model(input_ids, segment_ids) + loss = loss_fct(logits, labels) + if args.use_amp: + scaler.scale(loss).backward() + scaler.minimize(optimizer, loss) + else: + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + print( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % ( + global_step, + num_training_steps, + epoch, + step, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + evaluate(model, loss_fct, metric, dev_data_loader) + print("eval done total : %s s" % (time.time() - tic_eval)) + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "ft_model_%d.pdparams" % (global_step)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + return + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/language_model/roformer/test_mlm.py b/examples/language_model/roformer/test_mlm.py new file mode 100644 index 0000000000000000000000000000000000000000..31bb14c550df21b3757b8314cca8c600b576940d --- /dev/null +++ b/examples/language_model/roformer/test_mlm.py @@ -0,0 +1,56 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import argparse +from paddlenlp.transformers import RoFormerForMaskedLM, RoFormerTokenizer + + +def test_mlm(text, model_name): + model = RoFormerForMaskedLM.from_pretrained(model_name) + model.eval() + tokenizer = RoFormerTokenizer.from_pretrained(model_name) + tokens = ["[CLS]"] + text_list = text.split("[MASK]") + for i, t in enumerate(text_list): + tokens.extend(tokenizer.tokenize(t)) + if i == len(text_list) - 1: + tokens.extend(["[SEP]"]) + else: + tokens.extend(["[MASK]"]) + + input_ids_list = tokenizer.convert_tokens_to_ids(tokens) + input_ids = paddle.to_tensor([input_ids_list]) + + with paddle.no_grad(): + pd_outputs = model(input_ids)[0] + pd_outputs_sentence = "paddle: " + for i, id in enumerate(input_ids_list): + if id == tokenizer.convert_tokens_to_ids(["[MASK]"])[0]: + tokens = tokenizer.convert_ids_to_tokens(pd_outputs[i].topk(5)[1].tolist()) + pd_outputs_sentence += "[" + "||".join(tokens) + "]" + else: + pd_outputs_sentence += "".join(tokenizer.convert_ids_to_tokens([id], skip_special_tokens=True)) + + print(pd_outputs_sentence) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_name", default="roformer-chinese-base", type=str, help="Pretrained roformer name or path." + ) + parser.add_argument("--text", default="今天[MASK]很好,我想去公园玩!", type=str, help="MLM text.") + args = parser.parse_args() + test_mlm(text=args.text, model_name=args.model_name) diff --git a/examples/language_model/rw/README.md b/examples/language_model/rw/README.md new file mode 100644 index 0000000000000000000000000000000000000000..1436353774786c042eb18aa82eb68d8aa17179de --- /dev/null +++ b/examples/language_model/rw/README.md @@ -0,0 +1,13 @@ +# Falcon + +## 介绍 + +Falcon是由[TII](https://www.tii.ae/)构建的Causal decoder-only模型,基于含有 1,500B个tokens的[RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)数据集训练得来。 +Falcon 引入了[FlashAttention](https://github.com/HazyResearch/flash-attention)和[Multi-Query Attention]等新特性。更详细的模型介绍见[论文](https://arxiv.org/abs/2306.01116) + +## 推理 + +``` +python predict_generation.py \ + --model_name_or_path tiiuae/falcon-7b +``` diff --git a/examples/language_model/rw/predict_generation.py b/examples/language_model/rw/predict_generation.py new file mode 100644 index 0000000000000000000000000000000000000000..256feb80585a11927a183231a4fb8d23f72e9ff6 --- /dev/null +++ b/examples/language_model/rw/predict_generation.py @@ -0,0 +1,118 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle + +from paddlenlp.transformers import RWConfig, RWForCausalLM, RWTokenizer + + +def parse_arguments(): + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument("--model_name_or_path", default="tiiuae/falcon-7b", help="The directory of model.") + parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.") + parser.add_argument("--src_length", type=int, default=128, help="The batch size of data.") + parser.add_argument("--tgt_length", type=int, default=128, help="The batch size of data.") + return parser.parse_args() + + +def batchfy_text(texts, batch_size): + batch_texts = [] + batch_start = 0 + while batch_start < len(texts): + batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]] + batch_start += batch_size + return batch_texts + + +class Predictor(object): + def __init__(self, args=None, tokenizer=None, model=None, **kwargs): + if args is None: + self.tokenizer = tokenizer + self.model = model + self.src_length = kwargs["src_length"] + self.tgt_length = kwargs["tgt_length"] + else: + self.tokenizer = RWTokenizer.from_pretrained(args.model_name_or_path) + self.batch_size = args.batch_size + self.args = args + self.src_length = self.args.src_length + self.tgt_length = self.args.tgt_length + + config = RWConfig.from_pretrained(args.model_name_or_path) + dtype = config.dtype if config.dtype is not None else config.paddle_dtype + + self.model = RWForCausalLM.from_pretrained( + args.model_name_or_path, + dtype=dtype, + ) + self.model.eval() + + def preprocess(self, input_text): + inputs = self.tokenizer( + input_text, + return_tensors="np", + padding=True, + max_length=self.src_length, + truncation=True, + truncation_side="left", + ) + inputs_tensor = {} + for key in inputs: + inputs_tensor[key] = paddle.to_tensor(inputs[key]) + return inputs_tensor + + def infer(self, inputs): + result = self.model.generate( + **inputs, + decode_strategy="sampling", + top_k=1, + max_length=self.tgt_length, + bos_token_id=self.tokenizer.bos_token_id, + eos_token_id=self.tokenizer.eos_token_id, + pad_token_id=self.tokenizer.pad_token_id, + use_cache=True, + ) + result = result[0] + return result + + def postprocess(self, infer_data): + result = [] + for x in infer_data.tolist(): + res = self.tokenizer.decode(x, skip_special_tokens=True) + res = res.strip("\n") + result.append(res) + out_dict = {"result": result} + return out_dict + + def predict(self, texts): + input_map = self.preprocess(texts) + infer_result = self.infer(input_map) + output = self.postprocess(infer_result) + return output + + +if __name__ == "__main__": + args = parse_arguments() + predictor = Predictor(args) + all_texts = [ + "Hello!", + "Please introduce yourself, ", + ] + batch_texts = batchfy_text(all_texts, args.batch_size) + for bs, texts in enumerate(batch_texts): + outputs = predictor.predict(texts) + for text, result in zip(texts, outputs["result"]): + print("{}\n{}".format(text, result)) diff --git a/examples/language_model/t5/README.md b/examples/language_model/t5/README.md new file mode 100644 index 0000000000000000000000000000000000000000..7f743228c4123f9f51815b1d45d5e7670dd51ca3 --- /dev/null +++ b/examples/language_model/t5/README.md @@ -0,0 +1,331 @@ +# T5 + +[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683v3.pdf) + +## 摘要 + +迁移学习在自然语言处理(NLP)中已经成为一种强大的技术。迁移学习首先在数据丰富的任务上进行预训练,然后在下游任务上进行调整。迁移学习的有效性引起了不同的方法、方法和实践。在本文中,我们通过引入一个统一的框架,将所有基于文本的语言问题转换为文本到文本的格式,来探索自然语言处理的迁移学习技术。我们的系统研究比较了数十项语言理解任务的训练前目标、架构、未标记数据集、迁移方法和其他因素。通过将我们的探索与规模和我们的新"Colossal Clean Crawled Corpus"数据集相结合,我们在摘要、问答、文本分类等许多基准测试中取得了最先进的结果。为了促进NLP迁移学习的未来工作,我们发布了我们的数据集、预训练模型和代码。 + +本项目是T5在 Paddle 2.x上的开源实现,包含了`模型权重`转换代码和`GLUE任务`的微调代码。 + +## 快速开始 + +### 预训练 + +本项目致力于t5模型的预训练,从数据下载,数据转化,模型训练,流程开源开放,可复现。 + +接下来将从下面几个方面,详细介绍整个数据制作全流程,从零开始,构建一个预训练模型。 + +#### 1. 数据准备 + +数据流是预训练的非常重要的,[预处理文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/README.md)提供了整体的数据变动的流程示意,用户可以查看数据制作的细节文档。 + +在数据ID化步骤中,我们需要配置tokenzer_name,选择t5模型对应的tokenizer;通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:[`t5_openwebtext.bin`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/t5_openwebtext.bin), 文章索引信息[`t5_openwebtext.idx`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/t5_openwebtext.idx).(这里提供了一个处理好的预训练数据,可点击链接下载) + +```shell +python -u create_pretraining_data.py \ + --model_name t5-small \ + --tokenizer_name T5Tokenizer \ + --data_format JSON \ + --input_path openwebtext/2020-04.jsonl.zst \ + --split_sentences \ + --output_prefix t5_openwebtext \ + --workers 1 \ + --log_interval 5 \ + --data_impl mmap +``` + +#### 2. 开始训练 + +**路径配置** + +- 主要配置输入输出目录 +- 这里的`tokenizer_name_or_path`请设置为内置的tokenizer,如`t5-small`等。 +- 这里的 `input_dir` 设置输入数据集路径,例如配置`input_dir "./data"`即可。 + +**启动训练**:这里启动的是单机8卡任务,整体全局的batch_size 512 (64*8)。如果指定ips参数,进行多机运行,如 `python3 -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" --ips 192.168.1.101,192.168.1.101 ` + +```shell +python -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "./log" \ + t5_run_pretrain_trainer.py \ + --model_type "t5" \ + --model_name_or_path "t5-small" \ + --tokenizer_name_or_path "${vocab_dir}" \ + --input_dir "${data_dir}" \ + --output_dir "${base_dir}" \ + --split 10,5,1 \ + --max_seq_length 512 \ + --max_seq_length_dec 128 \ + --per_device_train_batch_size 64 \ + --per_device_eval_batch_size 64 \ + --learning_rate 0.0001 \ + --min_learning_rate 0.00001 \ + --max_steps 20000 \ + --save_steps 5000 \ + --weight_decay 0.01 \ + --decay_steps 9900 \ + --warmup_ratio 0.01 \ + --max_grad_norm 1.0 \ + --logging_steps 10\ + --dataloader_num_workers 4 \ + --eval_steps 100 \ + --report_to "visualdl" \ + --disable_tqdm true \ + --do_train \ + --do_eval \ + --seed 1234 \ + --device "gpu" \ + --data_impl "mmap" +``` + +其中参数释义如下: + +- `model_name_or_path` 要训练的模型或者之前训练的checkpoint。 +- `tokenizer_name_or_path` 模型词表文件所在的文件夹(对于ernie,词表文件名一般命名为vocab.txt),或者PaddleNLP内置tokenizer的名字。 +- `input_dir` 指定输入文件,可以使用目录,指定目录时将包括目录中的所有文件。 +- `output_dir` 指定输出文件。 +- `split` 划分数据集为train、valid、test的比例。整个数据集会按照这个比例划分数据。默认`split=949,50,1`, 使用1/1000的数据为test,当样本数太少时,增大测试的样本数目。 +- `max_seq_len` 输入文本序列的长度,默认值`512`。 +- `fp16_opt_level` 混合精度策略,支持O1 自动混合精度,O2 pure fp16精度训练。 +- `max_steps` 最大训练步数。训练不支持通过`epoch`控制,第一次制造数据index时候,日志会显示数据会被计算的epoch数,请注意查看。 +- `save_steps` 保存模型间隔。默认保存地址格式为`output_dir/model_50000`(5w 步时的权重)。 +- `weight_decay` 权重衰减参数。 +- `warmup_rate` 学习率warmup参数。 +- `max_grad_norm` 梯度裁剪范围。 +- `logging_steps` 日志输出间隔。 +- `dataloader_num_workers` DataLoader采样进程,当数据输入为瓶颈时,可尝试提高采样进程数目。 +- `eval_steps` 模型评估间隔。 +- `device` 训练设备,默认为GPU。 +- `data_impl` 指定输入文件数据制作类型,默认为mmap,可指定mmap或lazy。mmap格式在读入数据时会建立内存映射,lazy格式在读入数据时直接从文件读取。 + +### GLUE任务 + +### 执行Fine-tunning + +启动rte分类任务的Fine-tuning的方式如下: + +```shell +python run_glue.py \ + --model_name_or_path t5-base \ + --task_name rte \ + --max_seq_length 256 \ + --train_batch_size 16 \ + --eval_batch_size 64 \ + --learning_rate 1e-4 \ + --weight_decay 0.01 \ + --warmup_radio 0.1 \ + --num_train_epochs 10 \ + --logging_steps 100 \ + --save_steps 100 \ + --seed 42 \ + --scheduler_type linear \ + --output_dir outputs/rte/ \ + --device gpu +``` + +其中参数释义如下: + +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录的地址。 +- `task_name` GLUE任务名称,可从选["cola","sst-2","mrpc","sts-b","qqp","mnli", "rte", "qnli"]选择。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `train_batch_size` 表示训练时的样本数目。 +- `eval_batch_size` 表示验证时的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `warmup_radio` warmup比率。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `seed` 表示随机种子。 +- `scheduler_type` scheduler类型,可选linear和cosine,默认linear。 +- `output_dir` 表示模型保存路径。 +- `device` 表示训练使用的设备,可选cpu、gpu或npu。 + +使用trainer进行Fine-tuning: + +```shell +python -m paddle.distributed.launch --gpus "0,1,2,3" run_glue_trainer.py \ + --model_name_or_path t5-base \ + --task_name rte \ + --max_seq_length 256 \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 16 \ + --per_device_eval_batch_size 64 \ + --learning_rate 1e-4 \ + --weight_decay 0.01 \ + --warmup_ratio 0.1 \ + --num_train_epochs 10 \ + --eval_steps 200 \ + --logging_steps 20 \ + --save_steps 200 \ + --save_total_limit 3 \ + --metric_for_best_model "eval_accuracy" \ + --fp16 false \ + --fp16_opt_level "O1" \ + --recompute true \ + --sharding "stage1" \ + --overwrite_output_dir \ + --disable_tqdm true \ + --output_dir outputs/rte/ +``` + +具体参数含义请参见: https://paddlenlp.readthedocs.io/zh/latest/trainer.html + +###### t5-base模型在GLUE开发集上的结果: + +| Model | cola | sst-2 | mrpc | sts-b | qqp | mnli | qnli | rte | mean | +| -------------- | ----- | ----- | ----- | ------- | ----- | ----- | ----- | ----- | ------- | +| | mcc | acc | acc | pearson | acc | acc | acc | acc | | +| T5-base-Paddle | 61.74 | 95.18 | 90.44 | 90.09 | 91.60 | 87.18 | 93.56 | 81.95 | 86.4675 | + +###### t5_v1_1-base模型在GLUE开发集上的结果: + +使用`run_glue_trainer.py`运行,由于`t5_v1_1-base`没有在glue任务上进行训练过,直接生成label的策略需要的训练时间需要更长。 + +| Model | cola | sst-2 | mrpc | sts-b | qqp | mnli | qnli | rte | +| ------------------- | ------- | ----- | ----- | ------- | ----- | ----- | ------ | ----- | +| | mcc | acc | acc | pearson | acc | acc | acc | acc | +| T5-v1_1-base Paddle | 47.6845 | 94.38 | 84.31 | 87.74 | 88.05 | 85.39 | 90.518 | 65.70 | +| epoch | 100 | 10 | 100 | 100 | 3 | 3 | 10 | 100 | + +注: + +- 直接生成label的finetune方式难度较大,前期基本学习如何正确生成label标签,后期才学习分类任务。 +- 生成的label标签设计,标签差异大一些,效果会更好一些。 +- `qqp`,`mnli`数据集适当增大训练epoch数,可以取得更好效果。 + +### CLUE任务 + +使用trainer进行Fine-tuning: + +```shell +python -m paddle.distributed.launch --gpus "0,1,2,3" run_clue_trainer.py \ + --model_name_or_path Langboat/mengzi-t5-base-mt \ + --task_name cluewsc2020 \ + --max_seq_length 512 \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 16 \ + --per_device_eval_batch_size 64 \ + --learning_rate 1e-4 \ + --weight_decay 0.01 \ + --warmup_ratio 0.1 \ + --num_train_epochs 100 \ + --eval_steps 200 \ + --logging_steps 20 \ + --save_steps 200 \ + --save_total_limit 3 \ + --metric_for_best_model "eval_accuracy" \ + --fp16 false \ + --fp16_opt_level "O1" \ + --recompute true \ + --sharding "stage1" \ + --overwrite_output_dir \ + --disable_tqdm true \ + --output_dir outputs/clue/cluewsc2020 +``` + +###### Langboat/mengzi-t5-base-mt模型在CLUE开发集上的结果: + +| Model | afqmc | tnews | iflytek | cmnli | ocnli | cluewsc2020 | csl | +| :------------------------- | :---- | :---- | :------ | :---- | :---- | :---------- | :---- | +| | acc | acc | acc | acc | acc | acc | acc | +| Langboat/mengzi-t5-base-mt | 74.44 | 58.47 | 61.14 | 80.97 | 75.76 | 79.61 | 84.47 | +| epoch | 10 | 10 | 10 | 10 | 10 | 100 | 10 | + + + +### GLUE Demo测试 + +```sh +python glue_demo.py +input text: sst2 sentence: contains no wit , only labored gags +label: negative +================================================== +input text: sst2 sentence: that loves its characters and communicates something rather beautiful about human nature +label: positive +================================================== +input text: cola sentence: Mickey looked it up. +label: acceptable +================================================== +input text: sst2 sentence: remains utterly satisfied to remain the same throughout +label: positive +================================================== +input text: sst2 sentence: a well-made and often lovely depiction of the mysteries of friendship +label: positive +================================================== +``` + +### Zero shot Demo测试 [参考自Langboat/mengzi-zero-shot](https://github.com/Langboat/mengzi-zero-shot) + +```sh +python zero_shot_demo.py +``` + +当前**zero shot**时输入的构造方法如下表所示。 + +| **任务类型** | **prompt构造(其中{s}代表句子输**入) | +| -------------------- | ------------------------------------------------------------ | +| **实体抽取** | “{s}”找出上述句子中的实体和他们对应的类别 | +| **语义相似度** | “{s1}”和“{s2}”这两句话是在说同一件事吗? | +| **金融关系抽取** | “{s}”中的“{e1}”和“{e2}”是什么关系?答: | +| **广告文案生成** | 请根据以下产品信息设计广告文案。商品信息:{s} | +| **医学领域意图分类** | 问题:“{s}”。此问题的医学意图是什么?选项:病情诊断,病因分析,治疗方案,就医建议,指标解读,疾病描述,后果表述,注意事项,功效作用,医疗费用。 | +| **评论情感分类** | 评论:{s}。请判断该条评论所属类别(积极或消极)并填至空格处。回答: | +| **评论对象抽取** | 评论:{s}.这条评论的评价对象是谁? | +| **新闻分类** | “{s}”是什么新闻频道写的?选项:故事,文化,娱乐,体育,财经,房产,汽车,教育,科技,军事,旅游,国际,股票,农业,电竞。答: | + +``` +input_text: “导致泗水的砭石受到追捧,价格突然上涨。而泗水县文化市场综合执法局颜鲲表示,根据监控”找出上述句子中的实体和他们对应的类别 +output: 泗水县文化市场综合执法局:政府,颜鲲:姓名 +================================================== +input_text: “你好,我还款银行怎么更换”和“怎么更换绑定还款的卡”这两句话是在说同一件事吗? +output: 是 +================================================== +input_text: “为打消市场顾虑,工行两位洋股东——美国运通和安联集团昨晚做出承诺,近期不会减持工行H股。”中的“工行”和“美国运通”是什么关系?答: +output: 被持股 +================================================== +input_text: 请根据以下产品信息设计广告文案。商品信息:类型-裤,版型-宽松,风格-潮,风格-复古,风格-文艺,图案-复古,裤型-直筒裤,裤腰型-高腰,裤口-毛边 +output: 这款牛仔裤采用高腰直筒的版型设计,搭配宽松的裤型,穿着舒适又显潮流感。而裤脚的毛边设计,增添几分复古文艺的气息。 +================================================== +input_text: 问题:“呼气试验阳性什么意思”。此问题的医学意图是什么?选项:病情诊断,病因分析,治疗方案,就医建议,指标解读,疾病描述,后果表述,注意事项,功效作用,医疗费用。 +output: 指标解读 +================================================== +input_text: 评论:房间很一般,小,且让人感觉脏,隔音效果差,能听到走廊的人讲话,走廊光线昏暗,旁边没有什么可吃。请判断该条评论所属类别(积极或消极)并填至空格处。回答: +output: 消极 +================================================== +input_text: 评论:灵水的水质清澈,建议带个浮潜装备,可以看清湖里的小鱼。.这条评论的评价对象是谁? +output: 灵水 +================================================== +input_text: “懒人适合种的果树:长得多、好打理,果子多得都得送邻居吃”是什么新闻频道写的?选项:故事,文化,娱乐,体育,财经,房产,汽车,教育,科技,军事,旅游,国际,股票,农业,电竞。答: +output: 农业 +================================================== +``` + +# Reference + +```bibtex +@article{2020t5, + author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, + title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, + journal = {Journal of Machine Learning Research}, + year = {2020}, + volume = {21}, + number = {140}, + pages = {1-67}, + url = {http://jmlr.org/papers/v21/20-074.html} +} +@inproceedings{wolf-etal-2020-transformers, + title = "Transformers: State-of-the-Art Natural Language Processing", + author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush", + booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", + month = oct, + year = "2020", + address = "Online", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6", + pages = "38--45" +} +``` diff --git a/examples/language_model/t5/convert.py b/examples/language_model/t5/convert.py new file mode 100644 index 0000000000000000000000000000000000000000..8c1fa30ea14481fc6510292ba96797236d454e09 --- /dev/null +++ b/examples/language_model/t5/convert.py @@ -0,0 +1,68 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from collections import OrderedDict +import argparse + +dont_transpose = [ + "shared.weight", + "layer_norm.weight", + ".layer_norm.weight", + "relative_attention_bias.weight", + "embed_tokens.weight", +] + + +def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path): + import torch + import paddle + + pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu") + paddle_state_dict = OrderedDict() + for k, v in pytorch_state_dict.items(): + transpose = False + + if k[-7:] == ".weight": + if not any([w in k for w in dont_transpose]): + if v.ndim == 2: + v = v.transpose(0, 1) + transpose = True + + print(f"Converting: {k} | is_transpose {transpose}") + + if k != "lm_head.weight": + k = "t5." + k + paddle_state_dict[k] = v.data.numpy() + + paddle.save(paddle_state_dict, paddle_dump_path) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--pytorch_checkpoint_path", + default="google/t5-large/pytorch_model.bin", + type=str, + required=False, + help="Path to the Pytorch checkpoint path.", + ) + parser.add_argument( + "--paddle_dump_path", + default="paddle/t5-large/model_state.pdparams", + type=str, + required=False, + help="Path to the output Paddle model.", + ) + args = parser.parse_args() + convert_pytorch_checkpoint_to_paddle(args.pytorch_checkpoint_path, args.paddle_dump_path) diff --git a/examples/language_model/t5/data.py b/examples/language_model/t5/data.py new file mode 100644 index 0000000000000000000000000000000000000000..8630ffbd64a6c3ff890e748f201c4e87dca4e23c --- /dev/null +++ b/examples/language_model/t5/data.py @@ -0,0 +1,383 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import os +from functools import partial + +from paddle.io import BatchSampler, DataLoader +from utils import load_pickle, save_pickle + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset + +CLUE_PROCESSED = collections.OrderedDict( + [ + ("afqmc", (["afqmc sentence1: ", "afqmc sentence2: "], ["不同", "类似"])), + ( + "tnews", + ( + ["tnews sentence: "], + ["故事", "文化", "娱乐", "体育", "财经", "房产", "汽车", "教育", "科技", "军事", "旅游", "国际", "股票", "农业", "电竞"], + ), + ), + ( + "iflytek", + ( + ["iflytek sentence: "], + [ + "打车", + "地图导航", + "免费WIFI", + "租车", + "同城服务", + "快递物流", + "婚庆", + "家政", + "公共交通", + "政务", + "社区服务", + "薅羊毛", + "魔幻", + "仙侠", + "卡牌", + "飞行空战", + "射击游戏", + "休闲益智", + "动作类", + "体育竞技", + "棋牌中心", + "经营养成", + "策略", + "MOBA", + "辅助工具", + "约会社交", + "即时通讯", + "工作社交", + "论坛圈子", + "婚恋社交", + "情侣社交", + "社交工具", + "生活社交", + "微博博客", + "新闻", + "漫画", + "小说", + "技术", + "教辅", + "问答交流", + "搞笑", + "杂志", + "百科", + "影视娱乐", + "求职", + "兼职", + "视频", + "短视频", + "音乐", + "直播", + "电台", + "K歌", + "成人", + "中小学", + "职考", + "公务员", + "英语", + "视频教育", + "高等教育", + "成人教育", + "艺术", + "语言(非英语)", + "旅游资讯", + "综合预定", + "民航", + "铁路", + "酒店", + "行程管理", + "民宿短租", + "出国", + "工具", + "亲子儿童", + "母婴", + "驾校", + "违章", + "汽车咨询", + "汽车交易", + "日常养车", + "行车辅助", + "租房", + "买房", + "装修家居", + "电子产品", + "问诊挂号", + "养生保健", + "医疗服务", + "减肥瘦身", + "美妆美业", + "菜谱", + "餐饮店", + "体育咨讯", + "运动健身", + "支付", + "保险", + "股票", + "借贷", + "理财", + "彩票", + "记账", + "银行", + "美颜", + "影像剪辑", + "摄影修图", + "相机", + "绘画", + "二手", + "电商", + "团购", + "外卖", + "电影票务", + "社区超市", + "购物咨询", + "笔记", + "办公", + "日程管理", + "女性", + "经营", + "收款", + "其他", + ], + ), + ), + ("cmnli", (["cmnli sentence1: ", "cmnli sentence2: "], ["矛盾", "中立", "蕴涵"])), + ("ocnli", (["ocnli sentence1: ", "ocnli sentence2: "], ["蕴涵", "矛盾", "中立"])), + ("cluewsc2020", (["cluewsc2020 sentence: "], ["同义", "歧义"])), + ("csl", ((["csl sentence1: ", "csl sentence2: "], ["伪造", "真实"]))), + ] +) +GLUE_PROCESSED = collections.OrderedDict( + [ + ("cola", (["cola sentence: "], ["not_acceptable", "acceptable"])), + ("sst-2", (["sst2 sentence: "], ["negative", "positive"])), + ( + "mrpc", + (["mrpc sentence1: ", " sentence2: "], ["not_equivalent", "equivalent"]), + ), + ("sts-b", (["stsb sentence1: ", " sentence2: "], None)), + ("qqp", (["qqp question1: ", " question2: "], ["not_duplicate", "duplicate"])), + ( + "mnli", + ( + ["mnli hypothesis: ", " premise: "], + ["contradiction", "entailment", "neutral"], + ), + ), + ( + "qnli", + (["qnli question: ", " sentence: "], ["entailment", "not_entailment"]), + ), + ( + "rte", + (["rte sentence1: ", " rte sentence2: "], ["entailment", "not_entailment"]), + ), + ] +) + +GLUE_1_1_PROCESSED = collections.OrderedDict( + [ + ("cola", (["cola sentence: "], ["outrageous", "acceptable"])), + ("sst-2", (["sst2 sentence: "], ["negative", "positive"])), + ( + "mrpc", + (["mrpc sentence1: ", " sentence2: "], ["nonidentical", "equivalent"]), + ), + ("sts-b", (["stsb sentence1: ", " sentence2: "], None)), + ("qqp", (["qqp question1: ", " question2: "], ["inequable", "duplicate"])), + ( + "mnli", + ( + ["mnli hypothesis: ", " premise: "], + ["contradiction", "entailment", "neutral"], + ), + ), + ( + "qnli", + (["qnli question: ", " sentence: "], ["entailment", "contradiction"]), + ), + ( + "rte", + (["rte sentence1: ", " rte sentence2: "], ["entailment", "contradiction"]), + ), + ] +) + + +def trans_func(example, tokenizer, args): + task_name = args.task_name + processed, label = GLUE_PROCESSED[task_name] + if label: + id2label = dict(zip(range(len(label)), label)) + else: + id2label = None + + if not args.is_test: + if id2label: + label_text = id2label[example["labels"]] + else: + label_text = str(example["labels"]) + target = tokenizer(label_text, return_token_type_ids=False, return_attention_mask=True) + + if len(processed) == 1: + text = processed[0] + example["sentence"] + else: + text = processed[0] + example["sentence1"] + processed[1] + example["sentence2"] + + source = tokenizer( + text, + max_seq_len=args.max_seq_length, + return_token_type_ids=False, + return_attention_mask=True, + ) + + if not args.is_test: + return ( + source["input_ids"], + source["attention_mask"], + target["input_ids"], + target["attention_mask"], + ) + else: + return source["input_ids"], source["attention_mask"] + + +def get_train_dataloader(tokenizer, args): + filename = os.path.join("caches", args.task_name + "_train" + ".pkl") + + if os.path.exists(filename): + ds = load_pickle(filename) + else: + ds = load_dataset("glue", args.task_name, splits="train") + ds.map( + partial(trans_func, tokenizer=tokenizer, args=args), + batched=False, + lazy=False, + ) + save_pickle(ds, filename) + + batch_sampler = BatchSampler(ds, batch_size=args.train_batch_size, shuffle=True) + + # batchify_fn = lambda samples, fn=Tuple( + # Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + # Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # attention_mask + # Pad(axis=0, pad_val=-100, dtype="int64"), # lm_labels + # Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # decoder_attention_mask + # ): fn(samples) + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # attention_mask + Pad(axis=0, pad_val=-100, dtype="int64"), # lm_labels + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # decoder_attention_mask + ), + ): + return fn(samples) + + data_loader = DataLoader( + dataset=ds, + batch_sampler=batch_sampler, + collate_fn=batchify_fn, + num_workers=args.num_workers, + return_list=True, + ) + + return data_loader + + +def get_dev_dataloader(tokenizer, args): + filename = os.path.join("caches", args.task_name + "_dev" + ".pkl") + + if os.path.exists(filename): + ds = load_pickle(filename) + else: + ds = load_dataset("glue", args.task_name, splits="dev") + ds.map( + partial(trans_func, tokenizer=tokenizer, args=args), + batched=False, + lazy=False, + ) + save_pickle(ds, filename) + + batch_sampler = BatchSampler(ds, batch_size=args.train_batch_size, shuffle=False) + + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # attention_mask + Pad(axis=0, pad_val=-100, dtype="int64"), # lm_labels + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # decoder_attention_mask + ), + ): + return fn(samples) + + data_loader = DataLoader( + dataset=ds, + batch_sampler=batch_sampler, + collate_fn=batchify_fn, + num_workers=args.num_workers, + return_list=True, + ) + + return data_loader + + +def get_mnli_dev_dataloader(tokenizer, args, matched=True): + if matched: + split = "dev_matched" + else: + split = "dev_mismatched" + filename = os.path.join("caches", args.task_name + f"_{split}" + ".pkl") + if os.path.exists(filename): + ds = load_pickle(filename) + else: + ds = load_dataset("glue", args.task_name, splits=split) + ds.map( + partial(trans_func, tokenizer=tokenizer, args=args), + batched=False, + lazy=False, + ) + save_pickle(ds, filename) + + batch_sampler = BatchSampler(ds, batch_size=args.train_batch_size, shuffle=False) + + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # attention_mask + Pad(axis=0, pad_val=-100, dtype="int64"), # lm_labels + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # decoder_attention_mask + ), + ): + return fn(samples) + + data_loader = DataLoader( + dataset=ds, + batch_sampler=batch_sampler, + collate_fn=batchify_fn, + num_workers=args.num_workers, + return_list=True, + ) + + return data_loader diff --git a/examples/language_model/t5/dataset_utils.py b/examples/language_model/t5/dataset_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..c7149da377ec972ff6980e9a9c8e73ea42d117b5 --- /dev/null +++ b/examples/language_model/t5/dataset_utils.py @@ -0,0 +1 @@ +../../../model_zoo/ernie-1.0/data_tools/dataset_utils.py \ No newline at end of file diff --git a/examples/language_model/t5/glue_demo.py b/examples/language_model/t5/glue_demo.py new file mode 100644 index 0000000000000000000000000000000000000000..61ada574d22c2f6609f17c3e526f95254a495d83 --- /dev/null +++ b/examples/language_model/t5/glue_demo.py @@ -0,0 +1,79 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer + + +class Demo: + def __init__(self, model_name_or_path="t5-base", max_predict_len=5): + self.tokenizer = T5Tokenizer.from_pretrained(model_name_or_path) + print("Loading the model parameters, please wait...") + self.model = T5ForConditionalGeneration.from_pretrained(model_name_or_path) + self.model.eval() + self.max_predict_len = max_predict_len + print("Model loaded.") + + # prediction function + @paddle.no_grad() + def generate(self, inputs, max_predict_len=None): + max_predict_len = max_predict_len if max_predict_len is not None else self.max_predict_len + + ids = self.tokenizer(inputs)["input_ids"] + input_ids = paddle.to_tensor([ids], dtype="int64") + outputs = self.model.generate(input_ids, max_length=max_predict_len)[0][0] + decode_outputs = self.tokenizer.decode(outputs, skip_special_tokens=True).strip() + print(f"input text: {inputs}") + print(f"label: {decode_outputs}") + print("=" * 50) + + +if __name__ == "__main__": + label_length_map = { + "cola": 4, + "sst2": 1, + "mrpc": 5, + "stsb": 5, + "qqp": 5, + "mnli": 4, + "qnli": 5, + "rte": 5, + } + demo = Demo(model_name_or_path="t5-base") + input_text_list = [ + "sst2 sentence: contains no wit , only labored gags ", + "sst2 sentence: that loves its characters and communicates something rather beautiful about human nature ", + "cola sentence: Mickey looked it up.", + "sst2 sentence: remains utterly satisfied to remain the same throughout ", + "sst2 sentence: a well-made and often lovely depiction of the mysteries of friendship ", + ] + for text in input_text_list: + max_predict_len = label_length_map[text.split()[0]] + demo.generate(text, max_predict_len=max_predict_len) + + # input text: sst2 sentence: contains no wit , only labored gags + # label: negative + # ================================================== + # input text: sst2 sentence: that loves its characters and communicates something rather beautiful about human nature + # label: positive + # ================================================== + # input text: cola sentence: Mickey looked it up. + # label: acceptable + # ================================================== + # input text: sst2 sentence: remains utterly satisfied to remain the same throughout + # label: positive + # ================================================== + # input text: sst2 sentence: a well-made and often lovely depiction of the mysteries of friendship + # label: positive + # ================================================== diff --git a/examples/language_model/t5/run_clue_trainer.py b/examples/language_model/t5/run_clue_trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..d573aba663402b19f4cb63bf1c81153f579c6ae7 --- /dev/null +++ b/examples/language_model/t5/run_clue_trainer.py @@ -0,0 +1,307 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from dataclasses import dataclass, field +from functools import partial +from typing import Optional + +import paddle +from data import CLUE_PROCESSED +from utils import CLUE_METRICS, load_pickle, save_pickle + +from paddlenlp.data.data_collator import DataCollatorForSeq2Seq +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import ( + PdArgumentParser, + Seq2SeqTrainer, + Seq2SeqTrainingArguments, + get_last_checkpoint, +) +from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer +from paddlenlp.utils.log import logger + + +def trans_func(example, tokenizer, args): + task_name = args.task_name + PROCESSED = CLUE_PROCESSED + processed, label = PROCESSED[task_name] + if label: + id2label = dict(zip(range(len(label)), label)) + else: + id2label = None + + is_test = "label" not in example + # Convert raw text to feature + if "keyword" in example and task_name == "csl": # CSL + sentence1 = " ".join(example["keyword"]) + example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]} + elif "target" in example and task_name == "cluewsc2020": # wsc + text, query, pronoun, query_idx, pronoun_idx = ( + example["text"], + example["target"]["span1_text"], + example["target"]["span2_text"], + example["target"]["span1_index"], + example["target"]["span2_index"], + ) + text_list = list(text) + assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun) + assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query) + if pronoun_idx > query_idx: + text_list.insert(query_idx, "_") + text_list.insert(query_idx + len(query) + 1, "_") + text_list.insert(pronoun_idx + 2, "[") + text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]") + else: + text_list.insert(pronoun_idx, "[") + text_list.insert(pronoun_idx + len(pronoun) + 1, "]") + text_list.insert(query_idx + 2, "_") + text_list.insert(query_idx + len(query) + 2 + 1, "_") + text = "".join(text_list) + example["sentence"] = text + if not is_test: + if id2label: + label_text = id2label[example["label"]] + else: + label_text = str(example["label"]) + target = tokenizer(label_text, return_token_type_ids=False, return_attention_mask=True) + + if len(processed) == 1: + text = processed[0] + example["sentence"] + else: + text = processed[0] + example["sentence1"] + processed[1] + example["sentence2"] + + source = tokenizer( + text, + max_seq_len=args.max_seq_length, + padding="max_length", + return_token_type_ids=False, + return_attention_mask=True, + ) + + if not is_test: + return { + "input_ids": source["input_ids"], + "attention_mask": source["attention_mask"], + "labels": target["input_ids"], + "decoder_attention_mask": target["attention_mask"], + } + else: + return {"input_ids": source["input_ids"], "attention_mask": source["attention_mask"]} + + +def get_train_dataset(tokenizer, args): + filename = os.path.join(args.cache_dir, args.task_name + "_train" + ".pkl") + + if os.path.exists(filename): + ds = load_pickle(filename) + else: + ds = load_dataset("clue", args.task_name, splits="train") + ds.map( + partial(trans_func, tokenizer=tokenizer, args=args), + batched=False, + lazy=False, + ) + save_pickle(ds, filename) + return ds + + +def get_dev_dataset(tokenizer, args): + filename = os.path.join(args.cache_dir, args.task_name + "_dev" + ".pkl") + + if os.path.exists(filename): + ds = load_pickle(filename) + else: + ds = load_dataset("clue", args.task_name, splits="dev") + ds.map( + partial(trans_func, tokenizer=tokenizer, args=args), + batched=False, + lazy=False, + ) + save_pickle(ds, filename) + + return ds + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + Using `PdArgumentParser` we can turn this class + into argparse arguments to be able to specify them on + the command line. + """ + + task_name: str = field(default=None, metadata={"help": "The name of the task to use (via the datasets library)."}) + + max_seq_length: int = field( + default=128, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + cache_dir: str = field(default="./caches", metadata={"help": "cache dir for datasets."}) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + + model_name_or_path: str = field( + default="Langboat/mengzi-t5-base", + metadata={ + "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html" + }, + ) + export_model_dir: Optional[str] = field( + default=None, + metadata={"help": "Path to directory to store the exported inference model."}, + ) + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, Seq2SeqTrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + if not os.path.exists(data_args.cache_dir): + os.mkdir(data_args.cache_dir) + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + PROCESSED = CLUE_PROCESSED + label_name = PROCESSED[data_args.task_name][1] + if label_name: + label2id = dict(zip(label_name, range(len(label_name)))) + else: + label2id = None + metric_list = CLUE_METRICS[data_args.task_name] + # generate_max_length = label_length_map[data_args.task_name] + + # get model and tokenizer + model = T5ForConditionalGeneration.from_pretrained(model_args.model_name_or_path) + tokenizer = T5Tokenizer.from_pretrained(model_args.model_name_or_path) + print(model) + # get dataloader + train_dataset = get_train_dataset(tokenizer, data_args) + eval_dataset = get_dev_dataset(tokenizer, data_args) + + data_collator = DataCollatorForSeq2Seq( + tokenizer=tokenizer, model=model, pad_to_multiple_of=8 if training_args.fp16 else None + ) + + # Define the metrics of tasks. + def compute_metrics(p, tokenizer=tokenizer, label2id=label2id): + all_preds = [] + all_labels = [] + # source_ids, source_mask, labels, target_mask = batch + preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions + labels = p.label_ids + for p, l in zip(preds, labels): + pred = tokenizer.decode(p, skip_special_tokens=True).strip() + label = tokenizer.decode(l, skip_special_tokens=True).strip() + if label2id: + # for classifaction task. + label = label2id[label] + if pred not in label2id: + # set to wrong label if the generated text not in the labal set. + pred = 0 + if label == 0: + pred = 1 + else: + pred = label2id[pred] + else: + # for regression task. + label = float(label.replace(" ", "")) + try: + pred = float(pred.replace(" ", "")) + except Exception as e: + # set to zero if the generated text can not convert to float + pred = 0.0 + print(e) + + all_preds.append(pred) + all_labels.append(label) + + all_preds = paddle.to_tensor(all_preds).detach() + all_labels = paddle.to_tensor(all_labels).detach() + + results = {} + for metric in metric_list: + results.update(metric(all_labels, all_preds)) + + return results + + training_args.predict_with_generate = True + trainer = Seq2SeqTrainer( + model=model, + args=training_args, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + tokenizer=tokenizer, + data_collator=data_collator, + compute_metrics=compute_metrics, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + # trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + trainer.save_metrics("eval", eval_metrics) + + +if __name__ == "__main__": + main() diff --git a/examples/language_model/t5/run_glue.py b/examples/language_model/t5/run_glue.py new file mode 100644 index 0000000000000000000000000000000000000000..94de77f467ed43981b357ea7e7e6fa3af9e92e74 --- /dev/null +++ b/examples/language_model/t5/run_glue.py @@ -0,0 +1,439 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import math +import os + +import paddle +from data import ( + GLUE_PROCESSED, + get_dev_dataloader, + get_mnli_dev_dataloader, + get_train_dataloader, +) +from paddle.amp import GradScaler, auto_cast +from paddle.optimizer import AdamW +from tqdm import tqdm +from utils import GLUE_METRICS, get_scheduler, get_writer, set_seed + +from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + + parser.add_argument( + "--model_name_or_path", + default="t5-small", + type=str, + help="Path to pre-trained model or shortcut name of model.", + ) + parser.add_argument( + "--task_name", + default="sst-2", + type=str, + help="task_name.", + ) + parser.add_argument( + "--output_dir", + default="outputs", + type=str, + help="The output directory where the model predictions and checkpoints will be written. " + "Default as `outputs`", + ) + parser.add_argument( + "--max_seq_length", + default=256, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--train_batch_size", + default=4, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument( + "--eval_batch_size", + default=16, + type=int, + help="Batch size per GPU/CPU for evaluating.", + ) + + parser.add_argument( + "--gradient_accumulation_steps", + default=1, + type=int, + help="gradient_accumulation_steps.", + ) + parser.add_argument( + "--learning_rate", + default=2e-5, + type=float, + help="The initial learning rate for Adam.", + ) + parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument( + "--num_train_epochs", + default=4, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--max_train_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument( + "--warmup_radio", + default=0.1, + type=float, + help="Proportion of training steps to perform linear learning rate warmup for.", + ) + parser.add_argument("--warmup_steps", type=int, default=-1, help="warmup_steps.") + parser.add_argument("--logging_steps", type=int, default=10, help="Log every X updates steps.") + parser.add_argument( + "--save_steps", + type=int, + default=50, + help="Save checkpoint every X updates steps.", + ) + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument( + "--writer_type", + choices=["visualdl", "tensorboard"], + default="visualdl", + help="writer_type.", + ) + parser.add_argument( + "--scheduler_type", + choices=["linear", "cosine", "poly"], + default="linear", + type=str, + help="scheduler_type.", + ) + parser.add_argument("--use_amp", action="store_true", help="Enable mixed precision training.") + parser.add_argument( + "--scale_loss", + type=float, + default=2**15, + help="The value of scale_loss for fp16.", + ) + parser.add_argument( + "--num_workers", + type=int, + default=0, + help="num_workers.", + ) + parser.add_argument("--is_test", action="store_true", help="is_test.") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["gpu", "cpu", "npu"], + help="The device to select to train the model, is must be cpu/gpu/npu.", + ) + args = parser.parse_args() + args.task_name = args.task_name.lower() + args.logdir = os.path.join(args.output_dir, "logs") + os.makedirs("caches", exist_ok=True) + os.makedirs(args.logdir, exist_ok=True) + + return args + + +label_length_map = { + "cola": 4, + "sst-2": 1, + "mrpc": 5, + "sts-b": 5, + "qqp": 5, + "mnli": 4, + "qnli": 5, + "rte": 5, +} + +logger = logging.getLogger(__name__) + + +@paddle.no_grad() +def evaluate(model, data_loader, tokenizer, label2id, metric_list, generate_max_length=5): + model.eval() + all_preds = [] + all_labels = [] + + for batch in data_loader: + source_ids, source_mask, labels, target_mask = batch + outputs = model.generate( + input_ids=source_ids, + attention_mask=source_mask, + max_length=generate_max_length, + )[0] + + for p, l, m in zip(outputs.numpy(), labels.numpy(), target_mask.numpy()): + pred = tokenizer.decode(p, skip_special_tokens=True).strip() + label = tokenizer.decode(l[m.astype("bool")], skip_special_tokens=True).strip() + if label2id: + pred = label2id[pred] + label = label2id[label] + else: + pred = float(pred.replace(" ", "")) + label = float(label.replace(" ", "")) + + all_preds.append(pred) + all_labels.append(label) + + results = {} + for metric in metric_list: + results.update(metric(all_labels, all_preds)) + print(results) + return results + + +def main(args): + paddle.set_device(args.device) + logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + level=logging.INFO, + handlers=[ + logging.FileHandler( + os.path.join(args.output_dir, "run.log"), + mode="w", + encoding="utf-8", + ) + ], + ) + logger.info("********** Configuration Arguments **********") + for arg, value in sorted(vars(args).items()): + logger.info(f"{arg}: {value}") + logger.info("**************************************************") + set_seed(args) + + # metric and label + label_name = GLUE_PROCESSED[args.task_name][1] + if label_name: + label2id = dict(zip(label_name, range(len(label_name)))) + else: + label2id = None + metric_list = GLUE_METRICS[args.task_name] + generate_max_length = label_length_map[args.task_name] + + writer = get_writer(args) + + # get model and tokenizer + model = T5ForConditionalGeneration.from_pretrained(args.model_name_or_path) + tokenizer = T5Tokenizer.from_pretrained(args.model_name_or_path) + + # get dataloader + train_dataloader = get_train_dataloader(tokenizer, args) + if args.task_name == "mnli": + dev_dataloader_match = get_mnli_dev_dataloader(tokenizer, args, matched=True) + dev_dataloader_mismatch = get_mnli_dev_dataloader(tokenizer, args, matched=False) + else: + dev_dataloader = get_dev_dataloader(tokenizer, args) + + num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps) + if args.max_train_steps > 0: + args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch) + else: + args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch + + # get lr_scheduler + lr_scheduler = get_scheduler( + learning_rate=args.learning_rate, + scheduler_type=args.scheduler_type, + num_warmup_steps=args.warmup_steps if args.warmup_steps > 0 else args.warmup_radio, + num_training_steps=args.max_train_steps, + ) + + total_batch_size = args.train_batch_size * args.gradient_accumulation_steps + + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + optimizer = AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + if args.use_amp: + scaler = GradScaler(init_loss_scaling=args.scale_loss) + + logger.info("********** Running training **********") + logger.info(f" Num examples = {len(train_dataloader.dataset)}") + logger.info(f" Num Epochs = {args.num_train_epochs}") + logger.info(f" Instantaneous train batch size = {args.train_batch_size}") + logger.info(f" Instantaneous eval batch size = {args.eval_batch_size}") + logger.info(f" Total train batch size (w. accumulation) = {total_batch_size}") + logger.info(f" Gradient Accumulation steps = {args.gradient_accumulation_steps}") + logger.info(f" Total optimization steps = {args.max_train_steps}") + + progress_bar = tqdm(range(args.max_train_steps)) + + global_steps = 0 + tr_loss, logging_loss = 0.0, 0.0 + + for _ in range(args.num_train_epochs): + for step, batch in enumerate(train_dataloader): + model.train() + with auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax"]): + source_ids, source_mask, labels, target_mask = batch + outputs = model( + input_ids=source_ids, + attention_mask=source_mask, + labels=labels, + decoder_attention_mask=target_mask, + ) + loss = outputs[0] / args.gradient_accumulation_steps + tr_loss += loss.item() + + if args.use_amp: + scaler.scale(loss).backward() + else: + loss.backward() + + if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1: + if args.use_amp: + scaler.minimize(optimizer, loss) + else: + optimizer.step() + + lr_scheduler.step() + optimizer.clear_grad() + progress_bar.update(1) + global_steps += 1 + + if args.logging_steps > 0 and global_steps % args.logging_steps == 0: + writer.add_scalar("lr", lr_scheduler.get_lr(), global_steps) + writer.add_scalar( + "loss", + (tr_loss - logging_loss) / args.logging_steps, + global_steps, + ) + logger.info( + "global_steps {} - lr: {:.10f} loss: {:.10f}".format( + global_steps, + lr_scheduler.get_lr(), + (tr_loss - logging_loss) / args.logging_steps, + ) + ) + logging_loss = tr_loss + + if args.save_steps > 0 and global_steps % args.save_steps == 0: + logger.info("********** Running evaluating **********") + logger.info(f"********** Step {global_steps} **********") + output_dir = os.path.join(args.output_dir, f"step-{global_steps}") + os.makedirs(output_dir, exist_ok=True) + + if args.task_name == "mnli": + matched_results = evaluate( + model, + dev_dataloader_match, + tokenizer, + label2id, + metric_list, + generate_max_length, + ) + for k, v in matched_results.items(): + writer.add_scalar(f"eval/matched_{k}", v, global_steps) + logger.info(f" {k} = {v}") + mismatched_results = evaluate( + model, + dev_dataloader_mismatch, + tokenizer, + label2id, + metric_list, + generate_max_length, + ) + for k, v in mismatched_results.items(): + writer.add_scalar(f"eval/mismatched_{k}", v, global_steps) + logger.info(f" {k} = {v}") + else: + eval_results = evaluate( + model, + dev_dataloader, + tokenizer, + label2id, + metric_list, + generate_max_length, + ) + for k, v in eval_results.items(): + writer.add_scalar(f"eval/{k}", v, global_steps) + logger.info(f" {k} = {v}") + model.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + logger.info("********** Evaluating Done **********") + + if global_steps >= args.max_train_steps: + logger.info("********** Running evaluating **********") + logger.info(f"********** Step {global_steps} **********") + output_dir = os.path.join(args.output_dir, f"step-{global_steps}") + os.makedirs(output_dir, exist_ok=True) + + if args.task_name == "mnli": + matched_results = evaluate( + model, + dev_dataloader_match, + tokenizer, + label2id, + metric_list, + generate_max_length, + ) + for k, v in matched_results.items(): + writer.add_scalar(f"eval/matched_{k}", v, global_steps) + logger.info(f" {k} = {v}") + mismatched_results = evaluate( + model, + dev_dataloader_mismatch, + tokenizer, + label2id, + metric_list, + generate_max_length, + ) + for k, v in mismatched_results.items(): + writer.add_scalar(f"eval/mismatched_{k}", v, global_steps) + logger.info(f" {k} = {v}") + else: + eval_results = evaluate( + model, + dev_dataloader, + tokenizer, + label2id, + metric_list, + generate_max_length, + ) + for k, v in eval_results.items(): + writer.add_scalar(f"eval/{k}", v, global_steps) + logger.info(f" {k} = {v}") + model.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + logger.info("********** Evaluating Done **********") + logger.info("********** Training Done **********") + return + + +if __name__ == "__main__": + args = parse_args() + main(args) diff --git a/examples/language_model/t5/run_glue_trainer.py b/examples/language_model/t5/run_glue_trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..4998aea7c553212d5d818d8e75fd8bf4eff074ab --- /dev/null +++ b/examples/language_model/t5/run_glue_trainer.py @@ -0,0 +1,394 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from dataclasses import dataclass, field +from functools import partial +from typing import Any, Dict, List, Optional, Tuple, Union + +import paddle +import paddle.nn as nn +from data import GLUE_1_1_PROCESSED, GLUE_PROCESSED +from utils import GLUE_METRICS, load_pickle, save_pickle + +from paddlenlp.data import Pad +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, +) +from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer +from paddlenlp.utils.log import logger + +label_length_map = { + "cola": 4, + "sst-2": 1, + "mrpc": 5, + "sts-b": 5, + "qqp": 5, + "mnli": 4, + "qnli": 5, + "rte": 5, +} + + +def trans_func(example, tokenizer, args): + task_name = args.task_name + PROCESSED = GLUE_PROCESSED + if "v1_1" in args.cache_dir: + PROCESSED = GLUE_1_1_PROCESSED + processed, label = PROCESSED[task_name] + if label: + id2label = dict(zip(range(len(label)), label)) + else: + id2label = None + + is_test = "labels" not in example + + if not is_test: + if id2label: + label_text = id2label[example["labels"]] + else: + label_text = str(example["labels"]) + target = tokenizer(label_text, return_token_type_ids=False, return_attention_mask=True) + + if len(processed) == 1: + text = processed[0] + example["sentence"] + else: + text = processed[0] + example["sentence1"] + processed[1] + example["sentence2"] + + source = tokenizer( + text, + max_seq_len=args.max_seq_length, + padding="max_length", + return_token_type_ids=False, + return_attention_mask=True, + ) + + if not is_test: + return { + "input_ids": source["input_ids"], + "attention_mask": source["attention_mask"], + "labels": target["input_ids"], + "decoder_attention_mask": target["attention_mask"], + } + else: + return {"input_ids": source["input_ids"], "attention_mask": source["attention_mask"]} + + +class BatchDict(object): + def __init__(self, fn): + assert isinstance(fn, (dict)), ( + "Input pattern not understood. The input of Dict must be a dict with key of input column name and value of collate_fn " + "Received fn=%s" % (str(fn)) + ) + + self._fn = fn + + for col_name, ele_fn in self._fn.items(): + assert callable(ele_fn), "Batchify functions must be callable! type(fn[%d]) = %s" % ( + col_name, + str(type(ele_fn)), + ) + + def __call__(self, data): + + ret = {} + if len(data) <= 0: + return ret + + for col_name, ele_fn in self._fn.items(): + # skip unused col_name, such as labels in test mode. + if col_name not in data[0].keys(): + continue + result = ele_fn([ele[col_name] for ele in data]) + ret[col_name] = result + + return ret + + +def get_train_dataset(tokenizer, args): + filename = os.path.join(args.cache_dir, args.task_name + "_train" + ".pkl") + + if os.path.exists(filename): + ds = load_pickle(filename) + else: + ds = load_dataset("glue", args.task_name, splits="train") + ds.map( + partial(trans_func, tokenizer=tokenizer, args=args), + batched=False, + lazy=False, + ) + save_pickle(ds, filename) + + return ds + + +def get_dev_dataset(tokenizer, args): + filename = os.path.join(args.cache_dir, args.task_name + "_dev" + ".pkl") + + if os.path.exists(filename): + ds = load_pickle(filename) + else: + ds = load_dataset("glue", args.task_name, splits="dev") + ds.map( + partial(trans_func, tokenizer=tokenizer, args=args), + batched=False, + lazy=False, + ) + save_pickle(ds, filename) + + return ds + + +def get_mnli_dev_dataset(tokenizer, args, matched=True): + if matched: + split = "dev_matched" + else: + split = "dev_mismatched" + filename = os.path.join(args.cache_dir, args.task_name + f"_{split}" + ".pkl") + if os.path.exists(filename): + ds = load_pickle(filename) + else: + ds = load_dataset("glue", args.task_name, splits=split) + ds.map( + partial(trans_func, tokenizer=tokenizer, args=args), + batched=False, + lazy=False, + ) + save_pickle(ds, filename) + + return ds + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + Using `PdArgumentParser` we can turn this class + into argparse arguments to be able to specify them on + the command line. + """ + + task_name: str = field(default=None, metadata={"help": "The name of the task to use (via the datasets library)."}) + + max_seq_length: int = field( + default=128, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + cache_dir: str = field(default="./caches", metadata={"help": "cache dir for datasets."}) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + + model_name_or_path: str = field( + default="t5-small", + metadata={ + "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html" + }, + ) + export_model_dir: Optional[str] = field( + default=None, + metadata={"help": "Path to directory to store the exported inference model."}, + ) + + +class T5GlueTrainer(Trainer): + def __init__(self, do_generation: bool, label2id, **kwargs): + super().__init__(**kwargs) + self.do_generation = do_generation + self.label2id = label2id + + def prediction_step( + self, + model: nn.Layer, + inputs: Dict[str, Union[paddle.Tensor, Any]], + prediction_loss_only: bool, + ignore_keys: Optional[List[str]] = None, + ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]: + + if not self.do_generation: + return super().prediction_step( + model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys + ) + + all_preds = [] + all_labels = [] + # source_ids, source_mask, labels, target_mask = batch + labels = inputs["labels"] + target_mask = inputs["decoder_attention_mask"] + + with paddle.no_grad(): + outputs = model.generate( + input_ids=inputs["input_ids"], + attention_mask=inputs["attention_mask"], + max_length=5, + )[0] + + for p, l, m in zip(outputs.numpy(), labels.numpy(), target_mask.numpy()): + pred = self.tokenizer.decode(p, skip_special_tokens=True).strip() + label = self.tokenizer.decode(l[m.astype("bool")], skip_special_tokens=True).strip() + + if self.label2id: + # for classifaction task. + label = self.label2id[label] + if pred not in self.label2id: + # set to wrong label if the generated text not in the labal set. + pred = 0 + if label == 0: + pred = 1 + else: + pred = self.label2id[pred] + else: + # for regression task. + label = float(label.replace(" ", "")) + try: + pred = float(pred.replace(" ", "")) + except Exception: + # set to zero if the generated text can not convert to float + pred = 0.0 + + all_preds.append(pred) + all_labels.append(label) + + all_preds = paddle.to_tensor(all_preds).detach() + all_labels = paddle.to_tensor(all_labels).detach() + + return (None, all_preds, all_labels) + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + if "v1_1" in model_args.model_name_or_path: + data_args.cache_dir = "./caches_v1_1" + if not os.path.exists(data_args.cache_dir): + os.mkdir(data_args.cache_dir) + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + PROCESSED = GLUE_PROCESSED + if "v1_1" in data_args.cache_dir: + PROCESSED = GLUE_1_1_PROCESSED + label_name = PROCESSED[data_args.task_name][1] + if label_name: + label2id = dict(zip(label_name, range(len(label_name)))) + else: + label2id = None + metric_list = GLUE_METRICS[data_args.task_name] + + # get model and tokenizer + model = T5ForConditionalGeneration.from_pretrained(model_args.model_name_or_path) + tokenizer = T5Tokenizer.from_pretrained(model_args.model_name_or_path) + + # get dataloader + train_dataset = get_train_dataset(tokenizer, data_args) + if data_args.task_name == "mnli": + eval_dataset = get_mnli_dev_dataset(tokenizer, data_args, matched=True) + else: + eval_dataset = get_dev_dataset(tokenizer, data_args) + + batchify_fn = lambda samples, fn=BatchDict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + "attention_mask": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # attention_mask + "labels": Pad(axis=0, pad_val=-100, dtype="int64"), # lm_labels + "decoder_attention_mask": Pad( + axis=0, pad_val=tokenizer.pad_token_id, dtype="int64" + ), # decoder_attention_mask + } + ): fn(samples) + data_collator = batchify_fn + + # Define the metrics of tasks. + def compute_metrics(p): + preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions + + results = {} + for metric in metric_list: + results.update(metric(p.label_ids, preds)) + + return results + + trainer = T5GlueTrainer( + model=model, + criterion=None, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + do_generation=True, + label2id=label2id, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + # trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + +if __name__ == "__main__": + main() diff --git a/examples/language_model/t5/t5_dataset.py b/examples/language_model/t5/t5_dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..e173bd35cd736a3018cc05e64abd6de3e9678a51 --- /dev/null +++ b/examples/language_model/t5/t5_dataset.py @@ -0,0 +1,340 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""T5 Style dataset.""" + +import collections +import copy + +import numpy as np +import paddle +from dataset_utils import create_masked_lm_predictions, get_samples_mapping + + +class T5Dataset(paddle.io.Dataset): + def __init__( + self, + name, + tokenizer, + indexed_dataset, + data_prefix, + num_epochs, + max_num_samples, + masked_lm_prob, + max_seq_length, + max_seq_length_dec, + short_seq_prob, + seed, + binary_head=False, + share_folder=False, + args=None, + ): + + # Params to store. + self.name = name + self.seed = seed + self.masked_lm_prob = masked_lm_prob + self.max_seq_length = max_seq_length + self.max_seq_length_dec = max_seq_length_dec + self.binary_head = binary_head + self.share_folder = share_folder + self.args = args + # Dataset. + self.indexed_dataset = indexed_dataset + + # Build the samples mapping. + self.samples_mapping = get_samples_mapping( + self.indexed_dataset, + data_prefix, + num_epochs, + max_num_samples, + self.max_seq_length - 2, # account for added tokens + short_seq_prob, + self.seed, + self.name, + self.binary_head, + self.share_folder, + ) + # Vocab stuff. + self.vocab_id_list = list(tokenizer.get_vocab().values()) + self.vocab_id_to_token_dict = copy.deepcopy( + {tokenizer.convert_tokens_to_ids(key): key for key, _ in tokenizer.get_vocab().items()} + ) + self.vocab_token_to_id_dict = copy.deepcopy(tokenizer.get_vocab()) + + # T5 is chinese char level model, sometime is need + # add ## chinse char to encode and decode. + # Here we extend the vocab dict. + self.vocab_id_to_token_dict.update(tokenizer.added_tokens_decoder) + self.vocab_token_to_id_dict.update(tokenizer.added_tokens_encoder) + + self.cls_id = tokenizer.cls_token_id + self.sep_id = tokenizer.sep_token_id + self.mask_id = tokenizer.mask_token_id + self.pad_id = tokenizer.pad_token_id + + self.bos_id = tokenizer.bos_token_id + self.eos_id = tokenizer.eos_token_id + + self.sentinel_tokens = tokenizer.additional_special_tokens_ids + assert len(self.sentinel_tokens) > 0, "Provide the argument --vocab-extra-ids 100 to the script" + + def __len__(self): + return self.samples_mapping.shape[0] + + def __getitem__(self, idx): + + start_index, end_index, seq_length = self.samples_mapping[idx] + sample = [] + for index in range(start_index, end_index): + sample.append(self.indexed_dataset[index]) + # Note that this rng state should be numpy and not python since + # python randint is inclusive whereas the numpy one is exclusive. + np_rng = np.random.RandomState(seed=((self.seed + idx) % 2**32)) + return build_training_sample( + sample, + seq_length, + self.max_seq_length, # needed for padding + self.max_seq_length_dec, + self.vocab_id_list, + self.vocab_id_to_token_dict, + self.cls_id, + self.sep_id, + self.mask_id, + self.pad_id, + self.masked_lm_prob, + np_rng, + self.bos_id, + self.eos_id, + self.sentinel_tokens, + ) + + +def build_training_sample( + sample, + target_seq_length, + max_seq_length, + max_seq_length_dec, + vocab_id_list, + vocab_id_to_token_dict, + cls_id, + sep_id, + mask_id, + pad_id, + masked_lm_prob, + np_rng, + bos_id=None, + eos_id=None, + sentinel_tokens=None, +): + """Build training sample. + + Arguments: + sample: A list of sentences in which each sentence is a list token ids. + target_seq_length: Desired sequence length. + max_seq_length: Maximum length of the sequence. All values are padded to + this length. + vocab_id_list: List of vocabulary ids. Used to pick a random id. + vocab_id_to_token_dict: A dictionary from vocab ids to text tokens. + cls_id: Start of example id. + sep_id: Separator id. + mask_id: Mask token id. + pad_id: Padding token id. + masked_lm_prob: Probability to mask tokens. + np_rng: Random number genenrator. Note that this rng state should be + numpy and not python since python randint is inclusive for + the opper bound whereas the numpy one is exclusive. + bos_id: start of decoder example id + eos_id: end of generation id + sentinel_tokens: unique value to be substituted for every replaced span + """ + + assert target_seq_length <= max_seq_length + + # flatten sentences into one list + tokens = [token for sentence in sample for token in sentence] + + # Truncate to `target_sequence_length`. + max_num_tokens = target_seq_length + truncated = len(tokens) > max_num_tokens + tokens = tokens[:max_num_tokens] + + # Masking. + max_predictions_per_seq = masked_lm_prob * max_num_tokens + (tokens, masked_positions, masked_labels, _, masked_spans) = create_masked_lm_predictions( + tokens, + vocab_id_list, + vocab_id_to_token_dict, + masked_lm_prob, + cls_id, + sep_id, + mask_id, + max_predictions_per_seq, + np_rng, + max_ngrams=10, + geometric_dist=True, + masking_style="t5", + ) + + # Padding. + tokens_enc, tokens_dec_in, labels, enc_mask, dec_mask, enc_dec_mask, loss_mask = pad_and_convert_to_numpy( + tokens, + masked_positions, + masked_labels, + pad_id, + max_seq_length, + max_seq_length_dec, + masked_spans, + bos_id, + eos_id, + sentinel_tokens, + ) + + train_sample = { + "text_enc": tokens_enc, + "text_dec": tokens_dec_in, + "labels": labels, + "loss_mask": loss_mask, + "truncated": int(truncated), + "enc_mask": enc_mask, + "dec_mask": dec_mask, + "enc_dec_mask": enc_dec_mask, + } + return train_sample + + +def pad_and_convert_to_numpy( + tokens, + masked_positions, + masked_labels, + pad_id, + max_seq_length, + max_seq_length_dec, + masked_spans=None, + bos_id=None, + eos_id=None, + sentinel_tokens=None, +): + """Pad sequences and convert them to numpy.""" + + sentinel_tokens = collections.deque(sentinel_tokens) + t5_input = [] + (t5_decoder_in, t5_decoder_out) = ([bos_id], []) + (start_index, end_index) = (0, None) + for span in masked_spans: + flag = sentinel_tokens.popleft() + + # Append the same tokens in decoder input and output + t5_decoder_in.append(flag) + t5_decoder_in.extend(span.label) + t5_decoder_out.append(flag) + t5_decoder_out.extend(span.label) + + end_index = span.index[0] + t5_input.extend(tokens[start_index:end_index]) + t5_input.append(flag) + + # the next start index is the token after the last span token + start_index = span.index[-1] + 1 + + # Add token to the t5_decoder_out + t5_decoder_out.append(eos_id) + + # Add the remaining tokens to the t5 input + t5_input.extend(tokens[start_index:]) + + # assert (len(t5_input) - len(masked_spans)) + \ + # (len(t5_decoder_in) - (len(masked_spans) + 1)) == len(tokens) + + # Some checks. + + # Encoder-side padding mask. + num_tokens = len(t5_input) + padding_length = max_seq_length - num_tokens + assert padding_length >= 0 + assert len(masked_positions) == len(masked_labels) + + # Tokens.. + filler = [pad_id] * padding_length + tokens_enc = np.array(t5_input + filler, dtype=np.int64) + + # Decoder-side padding mask. + num_tokens_dec = len(t5_decoder_in) + + padding_length_dec = max_seq_length_dec - num_tokens_dec + assert padding_length_dec >= 0 + filler_dec = [pad_id] * padding_length_dec + # print(t5_decoder_in, filler_dec, pad_id) + tokens_dec_in = np.array(t5_decoder_in + filler_dec, dtype=np.int64) + + # Create attention masks + enc_mask = make_attention_mask(tokens_enc, tokens_enc) + enc_dec_mask = make_attention_mask(tokens_dec_in, tokens_enc) + dec_mask = make_attention_mask(tokens_dec_in, tokens_dec_in) + dec_mask = dec_mask * make_history_mask(tokens_dec_in) + + # Labels mask. + labels = t5_decoder_out + ([-100] * padding_length_dec) + labels = np.array(labels, dtype=np.int64) + + # Loss mask + loss_mask = ([1] * num_tokens_dec) + ([pad_id] * padding_length_dec) + loss_mask = np.array(loss_mask, dtype=np.int64) + + return tokens_enc, tokens_dec_in, labels, enc_mask, dec_mask, enc_dec_mask, loss_mask + + +def make_attention_mask(source_block, target_block): + """ + Returns a 2-dimensional (2-D) attention mask + :param source_block: 1-D array + :param target_block: 1-D array + """ + mask = (target_block[None, :] >= 1) * (source_block[:, None] >= 1) + mask = mask.astype(np.int64) + # (source_length, target_length) + return mask + + +def make_attention_mask_3d(source_block, target_block): + """ + Returns a 3-dimensional (3-D) attention mask + :param source_block: 1-D array + :param target_block: 1-D array + """ + mask = (target_block[:, None, :] >= 1) * (source_block[:, :, None] >= 1) + # (batch, source_length, target_length) + # mask = mask.astype(np.int64) + return mask + + +def make_history_mask(block): + length = block.shape[0] + arange = np.arange(length) + history_mask = ( + arange[ + None, + ] + <= arange[:, None] + ) + history_mask = history_mask.astype(np.int64) + return history_mask + + +def make_history_mask_3d(block): + batch, length = block.shape + arange = paddle.arange(length, device=block.device) + history_mask = (arange[None, :] <= arange[:, None])[None, :] + history_mask = history_mask.expand(batch, length, length) + return history_mask diff --git a/examples/language_model/t5/t5_run_pretrain_trainer.py b/examples/language_model/t5/t5_run_pretrain_trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..01e2abfa5c0d2710ca3ef1e3bca42118f4df2d32 --- /dev/null +++ b/examples/language_model/t5/t5_run_pretrain_trainer.py @@ -0,0 +1,436 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +T5 pretraining scripts. +""" +import math +import os +import random +import time +from dataclasses import dataclass, field + +# from turtle import shape +from typing import Optional + +import numpy as np +import paddle +from dataset_utils import build_train_valid_test_datasets + +from paddlenlp.data import Stack +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, + speed_metrics, +) +from paddlenlp.transformers import ( + LinearAnnealingWithWarmupDecay, + T5Config, + T5ForConditionalGeneration, + T5Tokenizer, +) +from paddlenlp.utils.batch_sampler import DistributedBatchSampler +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "t5": (T5Config, T5ForConditionalGeneration, T5Tokenizer), +} + + +def add_start_docstrings(*docstr): + def docstring_decorator(fn): + fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "") + return fn + + return docstring_decorator + + +@dataclass +@add_start_docstrings(TrainingArguments.__doc__) +class PreTrainingArguments(TrainingArguments): + min_learning_rate: float = field( + default=1e-5, + metadata={"help": "Minimum learning rate deacyed to."}, + ) + decay_steps: float = field( + default=None, + metadata={ + "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate." + }, + ) + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and evaluating. + Using `PdArgumentParser` we can turn this class into argparse arguments to be able to + specify them on the command line. + """ + + input_dir: str = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."}) + + max_seq_length: int = field( + default=512, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + max_seq_length_dec: int = field( + default=128, + metadata={ + "help": "The maximum total output sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + + masked_lm_prob: float = field( + default=0.15, + metadata={"help": "Mask token prob."}, + ) + short_seq_prob: float = field( + default=0.1, + metadata={"help": "Short sequence prob."}, + ) + share_folder: bool = field( + default=False, + metadata={"help": "Use share folder for data dir and output dir on multi machine."}, + ) + favor_longer_ngram: bool = field( + default=False, + metadata={"help": "Whether to favor long ngrams"}, + ) + max_ngrams: int = field( + default=3, + metadata={"help": "Max N Grams"}, + ) + data_impl: str = field( + default="mmap", + metadata={"help": "mmap/lazy format converted from preprocessed data."}, + ) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to pre-train from. + """ + + model_type: Optional[str] = field(default="t5", metadata={"help": "Only support for ernie pre-training for now."}) + model_name_or_path: str = field( + default="t5-small", + metadata={ + "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html" + }, + ) + binary_head: Optional[bool] = field(default=False, metadata={"help": "True for NSP task."}) + hidden_dropout_prob: float = field(default=0.1, metadata={"help": "The hidden dropout prob."}) + attention_probs_dropout_prob: float = field(default=0.1, metadata={"help": "The attention probs dropout prob."}) + config_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} + ) + tokenizer_name_or_path: Optional[str] = field( + default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} + ) + + +def create_pretrained_dataset( + data_args, + training_args, + data_file, + tokenizer, + binary_head=False, +): + + train_valid_test_num_samples = [ + training_args.per_device_train_batch_size + * training_args.world_size + * training_args.max_steps + * training_args.gradient_accumulation_steps, + training_args.per_device_eval_batch_size + * training_args.world_size + * training_args.eval_iters + * (training_args.max_steps // training_args.eval_steps + 1), + training_args.per_device_eval_batch_size * training_args.world_size * training_args.test_iters, + ] + train_ds, valid_ds, test_ds = build_train_valid_test_datasets( + data_prefix=data_file, + args=data_args, + tokenizer=tokenizer, + splits_string=data_args.split, + train_valid_test_num_samples=train_valid_test_num_samples, + max_seq_length=data_args.max_seq_length, + masked_lm_prob=data_args.masked_lm_prob, + short_seq_prob=data_args.short_seq_prob, + max_seq_length_dec=data_args.max_seq_length_dec, + seed=training_args.seed, + skip_warmup=True, + binary_head=False, + dataset_type="t5", + ) + + def print_dataset(data, mode="train"): + logger.info(f"Sample data for {mode} mode") + text_enc, text_dec = data["text_enc"], data["text_dec"] + if tokenizer.pad_token_id in text_enc: + text_enc = text_enc[0 : list(text_enc).index(tokenizer.pad_token_id)] + logger.info(tokenizer._decode(text_enc)) + if tokenizer.pad_token_id in text_dec: + text_dec = text_dec[0 : list(text_dec).index(tokenizer.pad_token_id)] + logger.info(tokenizer._decode(text_dec)) + + print_dataset(train_ds[0], "train") + print_dataset(valid_ds[0], "valid") + print_dataset(test_ds[0], "test") + + def _collate_data(data, stack_fn=Stack()): + # print("Line 200", data[0]) + # num_fields = len(data[0]) + num_fields = len(data[0].keys()) + out = [None] * num_fields + + # text_enc, text_dec, labels, loss_mask, truncated, enc_mask, dec_mask, enc_dec_mask + + for i in range(num_fields): + out[i] = stack_fn([list(x.values())[i] for x in data]) + + return { + "input_ids": out[0], + "decoder_input_ids": out[1], + "labels": out[2], + "attention_mask": out[5], + "decoder_attention_mask": out[6], + } + + return train_ds, valid_ds, test_ds, _collate_data + + +def get_train_data_file(args): + if len(args.input_dir.split()) > 1: + # weight-1 data-prefix-1 weight-2 data-prefix-2 ... + return args.input_dir.split() + else: + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f))) + ] + files = [x.replace("_idx.npz", "") for x in files] + files = [x.replace(".idx", "") for x in files] + + if len(files) > 1: + ret = [] + logger.info("You are using multi-dataset:") + for x in files: + ret.append(1.0) + ret.append(x) + logger.info(" > set weight of %s dataset to 1.0" % x) + return ret + + return files + + +def set_seed(args): + if args.device == "cpu": + idx = 0 + else: + idx = paddle.distributed.get_rank() + random.seed(args.seed + idx) + np.random.seed(args.seed + idx) + paddle.seed(args.seed + idx) + + +class PretrainingTrainer(Trainer): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix: str = "eval"): + eval_dataloader = getattr(self, "eval_dataloader", None) + if eval_dataloader is None: + eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset + eval_dataloader = self.get_eval_dataloader(eval_dataset) + # must call data loader, otherwise, it will init many times, cause OOM error. + self.eval_dataloader = eval_dataloader() + + start_time = time.time() + # Temporarily disable metric computation, we will do it in the loop here. + compute_metrics = self.compute_metrics + eval_loop = self.evaluation_loop + + output = eval_loop( + eval_dataloader, + description="Evaluation", + # No point gathering the predictions if there are no metrics, otherwise we defer to + # self.args.prediction_loss_only + prediction_loss_only=True if compute_metrics is None else None, + ignore_keys=ignore_keys, + # Only evaluate max_eval_iters + max_eval_iters=self.args.eval_iters, + ) + + total_batch_size = self.args.eval_batch_size * self.args.world_size + output.metrics.update( + speed_metrics( + metric_key_prefix, + start_time, + num_samples=output.num_samples, + num_steps=math.ceil(output.num_samples / total_batch_size), + ) + ) + + self.log(output.metrics) + + self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics) + return output.metrics + + def _get_eval_sampler(self, eval_dataset) -> Optional[paddle.io.Sampler]: + return DistributedBatchSampler( + eval_dataset, + batch_size=self.args.per_device_eval_batch_size, + shuffle=False, + num_replicas=self.args.world_size, + rank=self.args.process_index, + drop_last=self.args.dataloader_drop_last, + ) + + def _get_train_sampler(self) -> Optional[paddle.io.Sampler]: + return DistributedBatchSampler( + self.train_dataset, + batch_size=self.args.per_device_train_batch_size, + shuffle=False, + num_replicas=self.args.world_size, + rank=self.args.process_index, + drop_last=self.args.dataloader_drop_last, + ) + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.tokenizer_name_or_path is None: + model_args.tokenizer_name_or_path = model_args.model_name_or_path + + set_seed(training_args) + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + training_args.eval_iters = 10 + training_args.test_iters = training_args.eval_iters * 10 + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + # if last_checkpoint is None and len( + # os.listdir(training_args.output_dir)) > 1: + # raise ValueError( + # f"Output directory ({training_args.output_dir}) already exists and is not empty. " + # "Use --overwrite_output_dir to overcome.") + if last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + config_class, model_class, tokenizer_class = MODEL_CLASSES[model_args.model_type] + + # if model_args.binary_head is False: + # model_class = ErnieForMaskedLM + + pretrained_models_list = list(model_class.pretrained_init_configuration.keys()) + + if model_args.model_name_or_path in pretrained_models_list: + logger.warning(f"Your model {model_args.model_name_or_path} is training from scratch !!!") + model_config = model_class.pretrained_init_configuration[model_args.model_name_or_path] + model_config["hidden_dropout_prob"] = model_args.hidden_dropout_prob + model_config["attention_probs_dropout_prob"] = model_args.attention_probs_dropout_prob + model = model_class(config_class(**model_config)) + # model_config["enable_recompute"] = args.use_recompute + else: + logger.warning(f"Your model is continue training from {model_args.model_name_or_path}") + model = model_class.from_pretrained( + model_args.model_name_or_path, + hidden_dropout_prob=model_args.hidden_dropout_prob, + attention_probs_dropout_prob=model_args.attention_probs_dropout_prob, + ) + + # Create the learning_rate sheduler and optimizer + if training_args.decay_steps is None: + training_args.decay_steps = training_args.max_steps + warmup_steps = training_args.warmup_ratio * training_args.max_steps + + lr_scheduler = LinearAnnealingWithWarmupDecay( + training_args.learning_rate, + training_args.min_learning_rate, + warmup_step=warmup_steps, + decay_step=training_args.decay_steps, + ) + + data_file = get_train_data_file(data_args) + tokenizer = tokenizer_class.from_pretrained( + model_args.tokenizer_name_or_path, cls_token="[CLS]", bos_token="", mask_token="[MASK]", sep_token="[SEP]" + ) + + train_dataset, eval_dataset, test_dataset, data_collator = create_pretrained_dataset( + data_args, training_args, data_file, tokenizer, False + ) + + trainer = PretrainingTrainer( + model=model, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + optimizers=(None, lr_scheduler), + tokenizer=tokenizer, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_predict: + test_ret = trainer.predict(test_dataset) + trainer.log_metrics("test", test_ret.metrics) + + +if __name__ == "__main__": + main() diff --git a/examples/language_model/t5/tests/t5_mp.py b/examples/language_model/t5/tests/t5_mp.py new file mode 100644 index 0000000000000000000000000000000000000000..aab854c3475ed2137bf25b9de9a479dff55e825e --- /dev/null +++ b/examples/language_model/t5/tests/t5_mp.py @@ -0,0 +1,101 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import tempfile + +import numpy as np +import paddle + +from paddlenlp.transformers import T5Model + +T5Model._init_weights = lambda *_: None + + +def main(): + world_size = paddle.distributed.get_world_size() + dp_degree = 2 if world_size >= 4 else 1 + tensor_parallel_degree = world_size // dp_degree + + strategy = paddle.distributed.fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": dp_degree, + "mp_degree": tensor_parallel_degree, + "pp_degree": 1, + "sharding_degree": 1, + } + paddle.distributed.fleet.init(is_collective=True, strategy=strategy) + + hcg = paddle.distributed.fleet.get_hybrid_communicate_group() + mp_group = hcg.get_model_parallel_group() + tensor_parallel_rank = mp_group.rank + model = T5Model.from_pretrained( + "t5-small", + tensor_parallel_degree=tensor_parallel_degree, + tensor_parallel_rank=tensor_parallel_rank, + dtype="float32", + low_cpu_mem_usage=True, + ) + model.eval() + loss = model( + input_ids=paddle.arange(100, 110, dtype="int64").reshape([1, -1]), + decoder_input_ids=paddle.arange(100, 105, dtype="int64").reshape([1, -1]), + return_dict=True, + ) + ret = loss.last_hidden_state.abs().mean().item() + np.testing.assert_allclose(ret, 0.136544, rtol=1e-4) + + with tempfile.TemporaryDirectory() as tempdir: + model.save_pretrained(save_dir=tempdir, merge_tensor_parallel=False) + paddle.distributed.barrier() + load_model = T5Model.from_pretrained( + tempdir, + tensor_parallel_degree=tensor_parallel_degree, + tensor_parallel_rank=tensor_parallel_rank, + dtype="float32", + low_cpu_mem_usage=True, + ) + load_model.eval() + loss = load_model( + input_ids=paddle.arange(100, 110, dtype="int64").reshape([1, -1]), + decoder_input_ids=paddle.arange(100, 105, dtype="int64").reshape([1, -1]), + return_dict=True, + ) + ret = loss.last_hidden_state.abs().mean().item() + np.testing.assert_allclose(ret, 0.136544, rtol=1e-4) + + with tempfile.TemporaryDirectory() as tempdir: + object_list = [] + paddle.distributed.all_gather_object(object_list, tempdir, group=mp_group) + tempdir = object_list[0] + model.save_pretrained(save_dir=tempdir, merge_tensor_parallel=True) + paddle.distributed.barrier() + load_model = T5Model.from_pretrained( + tempdir, + tensor_parallel_degree=tensor_parallel_degree, + tensor_parallel_rank=tensor_parallel_rank, + dtype="float32", + low_cpu_mem_usage=True, + ) + load_model.eval() + loss = load_model( + input_ids=paddle.arange(100, 110, dtype="int64").reshape([1, -1]), + decoder_input_ids=paddle.arange(100, 105, dtype="int64").reshape([1, -1]), + return_dict=True, + ) + ret = loss.last_hidden_state.abs().mean().item() + np.testing.assert_allclose(ret, 0.136544, rtol=1e-4) + + +if __name__ == "__main__": + main() diff --git a/examples/language_model/t5/tests/test_parallel_dygraph_dataparallel.py b/examples/language_model/t5/tests/test_parallel_dygraph_dataparallel.py new file mode 100644 index 0000000000000000000000000000000000000000..3cff3233848f3eaec506c346525b0bbe60ac16c2 --- /dev/null +++ b/examples/language_model/t5/tests/test_parallel_dygraph_dataparallel.py @@ -0,0 +1,227 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# + +import copy +import os +import subprocess +import time +import unittest + +import paddle +from paddle.distributed.utils.launch_utils import ( + TrainerProc, + find_free_ports, + get_cluster, + watch_local_trainers, +) + + +def get_cluster_from_args(selected_gpus): + cluster_node_ips = "127.0.0.1" + node_ip = "127.0.0.1" + + node_ips = [x.strip() for x in cluster_node_ips.split(",")] + + node_ips.index(node_ip) + + free_ports = None + + free_ports = find_free_ports(len(selected_gpus)) + if free_ports is not None: + free_ports = list(free_ports) + + trainer_endpoints = [] + for ip in node_ips: + trainer_endpoints.append(["%s:%d" % (ip, port) for port in free_ports]) + return get_cluster(node_ips, node_ip, trainer_endpoints, selected_gpus) + + +def get_gpus(selected_gpus): + selected_gpus = [x.strip() for x in selected_gpus.split(",")] + return selected_gpus + + +def start_local_trainers_cpu(trainer_endpoints, training_script, training_script_args, log_dir=None): + current_env = copy.copy(os.environ.copy()) + current_env.pop("http_proxy", None) + current_env.pop("https_proxy", None) + + procs = [] + n_rank = len(trainer_endpoints) + print(trainer_endpoints) + for rank_id, endpoint in enumerate(trainer_endpoints): + proc_env = { + "PADDLE_DISTRI_BACKEND": "gloo", + "PADDLE_TRAINER_ID": "%d" % rank_id, + "PADDLE_CURRENT_ENDPOINT": "%s" % endpoint, + "PADDLE_TRAINERS_NUM": "%d" % n_rank, + "PADDLE_TRAINER_ENDPOINTS": ",".join(trainer_endpoints), + } + + current_env.update(proc_env) + + print("trainer proc env:{}".format(current_env)) + + assert os.getenv("WITH_COVERAGE", "OFF") == "OFF", "Gloo don't support WITH_COVERAGE." + cmd = "python -u " + training_script + + print("start trainer proc:{} env:{}".format(cmd, proc_env)) + + fn = None + + proc = subprocess.Popen(cmd.split(" "), env=current_env) + + tp = TrainerProc() + tp.proc = proc + tp.rank = rank_id + tp.log_fn = fn + tp.cmd = cmd + + procs.append(tp) + + return procs + + +def start_local_trainers( + cluster, + pod, + training_script, + training_script_args, + eager_mode=True, + allocator_strategy="auto_growth", + log_dir=None, + without_http_proxy=True, +): + current_env = copy.copy(os.environ.copy()) + # paddle broadcast ncclUniqueId use socket, and + # proxy maybe make trainers unreachable, so delete them. + # if we set them to "", grpc will log error message "bad uri" + # so just delete them. + + # current_env.pop("http_proxy", None) + # current_env.pop("https_proxy", None) + + procs = [] + for t in pod.trainers: + proc_env = { + "FLAGS_selected_gpus": "%s" % ",".join([str(g) for g in t.gpus]), + "PADDLE_TRAINER_ID": "%d" % t.rank, + "PADDLE_CURRENT_ENDPOINT": "%s" % t.endpoint, + "PADDLE_TRAINERS_NUM": "%d" % cluster.trainers_nranks(), + "PADDLE_TRAINER_ENDPOINTS": ",".join(cluster.trainers_endpoints()), + } + + proc_env["FLAGS_allocator_strategy"] = allocator_strategy + if allocator_strategy == "auto_growth": + proc_env["FLAGS_fraction_of_gpu_memory_to_use"] = "0.1" + + current_env.update(proc_env) + + print("trainer proc env:{}".format(current_env)) + + if os.getenv("WITH_COVERAGE", "OFF") == "ON": + cmd = "python -m coverage run --branch -p " + training_script + else: + cmd = "python -u " + training_script + + print("start trainer proc:{} env:{}".format(cmd, proc_env)) + + fn = None + + proc = subprocess.Popen(cmd.split(" "), env=current_env) + + tp = TrainerProc() + tp.proc = proc + tp.rank = t.rank + tp.log_fn = fn + tp.cmd = cmd + + procs.append(tp) + + return procs + + +class TestMultipleGpus(unittest.TestCase): + def setUp(self): + self.selected_gpus = get_gpus("0,1") + + def run_1gpu(self, *args, **kwargs): + self.selected_gpus = get_gpus("0") + self.run_n_gpu(*args, **kwargs) + + def run_2gpu(self, *args, **kwargs): + self.selected_gpus = get_gpus("0,1") + self.run_n_gpu(*args, **kwargs) + + def run_4gpu(self, *args, **kwargs): + self.selected_gpus = get_gpus("0,1,2,3") + self.run_n_gpu(*args, **kwargs) + + def run_8gpu(self, *args, **kwargs): + self.selected_gpus = get_gpus("0,1,2,3,4,5,6,7") + self.run_n_gpu(*args, **kwargs) + + def run_n_gpu( + self, + target_file_name, + eager_mode=True, + allocator_strategy="auto_growth", + ): + if not paddle.framework.core.is_compiled_with_cuda() or paddle.framework.core.get_cuda_device_count() == 0: + return + + # selected_gpus = get_gpus("0,1") + cluster = None + pod = None + + cluster, pod = get_cluster_from_args(self.selected_gpus) + + procs = start_local_trainers( + cluster, + pod, + eager_mode=eager_mode, + allocator_strategy=allocator_strategy, + training_script=target_file_name, + training_script_args=[], + ) + + while True: + alive = watch_local_trainers(procs, cluster.trainers_endpoints()) + + if not alive: + print("Local procs complete, POD info:{}".format(pod)) + break + time.sleep(3) + + +class TestMultipleWithGloo(unittest.TestCase): + def run_2cpu(self, target_file_name): + + cluster, pod = get_cluster_from_args([0, 1]) # tmp use. for getting trainer_nranks() + + procs = start_local_trainers_cpu( + cluster.trainers_endpoints(), + training_script=target_file_name, + training_script_args=[], + ) + + while True: + alive = watch_local_trainers(procs, cluster.trainers_nranks()) + + if not alive: + print("Local procs complete, POD info:{}".format(pod)) + break + time.sleep(3) diff --git a/examples/language_model/t5/tests/test_t5_mp.py b/examples/language_model/t5/tests/test_t5_mp.py new file mode 100644 index 0000000000000000000000000000000000000000..00777f916126ec81553220cc49a2d138c9b62f17 --- /dev/null +++ b/examples/language_model/t5/tests/test_t5_mp.py @@ -0,0 +1,122 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# import sys +import unittest + +import numpy as np +import paddle +import torch +from test_parallel_dygraph_dataparallel import TestMultipleGpus + +import paddlenlp + + +def load_torch(path, *args, **kwargs): + import torch + + state = torch.load(path, map_location="cpu") + for key in list(state.keys()): + v = state.pop(key) + state[key] = v.numpy() + return state + + +# hack load torch, it has problem to load torch ckpt. +paddlenlp.utils.serialization.load_torch = load_torch +paddlenlp.transformers.conversion_utils.load_torch = load_torch + + +class TestT5(unittest.TestCase): + def testTorchT5(self): + from transformers import AutoModel + + model = AutoModel.from_pretrained("t5-small", trust_remote_code=True) + model.eval() + loss = model( + input_ids=torch.arange(100, 110, dtype=torch.long).reshape(1, -1), + decoder_input_ids=torch.arange(100, 105, dtype=torch.long).reshape(1, -1), + ) + ret = loss.last_hidden_state.abs().mean().item() + # Torch T5 has bug in GELU activation + np.testing.assert_allclose(ret, 0.1365441530942917, rtol=1e-7) + + def testConvertedPaddleT5(self): + from paddlenlp.transformers import AutoModel + + model = AutoModel.from_pretrained("t5-small", from_hf_hub=True) + model.eval() + loss = model( + input_ids=paddle.arange(100, 110, dtype="int64").reshape([1, -1]), + decoder_input_ids=paddle.arange(100, 105, dtype="int64").reshape([1, -1]), + return_dict=True, + ) + ret = loss.last_hidden_state.abs().mean().item() + np.testing.assert_allclose(ret, 0.1365441381931305, rtol=1e-7) + + @unittest.skip("Skip export!") + def testPaddleT5(self): + from paddlenlp.transformers import T5Model + + model = T5Model.from_pretrained("t5-small", dtype="float32") + model.eval() + loss = model( + input_ids=paddle.arange(100, 110, dtype="int64").reshape([1, -1]), + decoder_input_ids=paddle.arange(100, 105, dtype="int64").reshape([1, -1]), + return_dict=True, + ) + ret = loss.last_hidden_state.abs().mean().item() + np.testing.assert_allclose(ret, 0.1365441381931305, rtol=1e-7) + + # # dy2static + # input_spec = [ + # paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + # paddle.static.InputSpec(shape=[None, 2, None], dtype="int64"), # pos_ids + # paddle.static.InputSpec(shape=[None, None, None, None], dtype="int64"), # attn_ids + # ] + # with tempfile.TemporaryDirectory() as tempdir: + # paddlenlp.transformers.export_model( + # model=model, + # input_spec=input_spec, + # path=tempdir, + # ) + + # TODO: support @ decorate for multi-gpus tests + @unittest.skip("Skip for reuqired multi-gpus!") + def testPaddleTensorParallelT5(self): + """_summary_""" + from modeling import T5Model as AutoModel + + tensor_parallel_degree = paddle.distributed.get_world_size() + tensor_parallel_rank = paddle.distributed.get_rank() + strategy = paddle.distributed.fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": 1, + "mp_degree": tensor_parallel_degree, + "pp_degree": 1, + "sharding_degree": 1, + } + paddle.distributed.fleet.init(is_collective=True, strategy=strategy) + model = AutoModel.from_pretrained( + "t5-small", + from_hf=True, + tensor_parallel_degree=tensor_parallel_degree, + tensor_parallel_rank=tensor_parallel_rank, + ) + model.eval() + + +class TestT5TensorParallel(TestMultipleGpus): + def testPaddleTensorParallelT5(self): + self.run_4gpu("t5_mp.py") diff --git a/examples/language_model/t5/utils.py b/examples/language_model/t5/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..303f70af2a71d0aad8a773ea0a961034fe19e126 --- /dev/null +++ b/examples/language_model/t5/utils.py @@ -0,0 +1,162 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import json +import pickle +import random + +import numpy as np +import paddle +import sklearn +from scipy.stats import pearsonr, spearmanr +from sklearn.metrics import accuracy_score, f1_score, matthews_corrcoef + +from paddlenlp.transformers import ( + CosineDecayWithWarmup, + LinearDecayWithWarmup, + PolyDecayWithWarmup, +) + + +def accuracy(targets, predictions): + return {"accuracy": 100 * accuracy_score(targets, predictions)} + + +def sklearn_metrics_wrapper(metric_str, metric_dict_str=None, metric_post_process_fn=None, **metric_fn_kwargs): + def fn(targets, predictions): + if metric_str == "matthews_corrcoef": + metric_fn = matthews_corrcoef + else: + metric_fn = getattr(sklearn.metrics, metric_str) + metric_val = metric_fn(targets, predictions, **metric_fn_kwargs) + if metric_post_process_fn is not None: + metric_val = metric_post_process_fn(metric_val) + return {metric_dict_str or metric_str: metric_val} + + return fn + + +def f1_score_with_invalid(targets, predictions): + targets, predictions = np.asarray(targets), np.asarray(predictions) + invalid_idx_mask = np.logical_and(predictions != 0, predictions != 1) + predictions[invalid_idx_mask] = 1 - targets[invalid_idx_mask] + return {"f1": 100 * f1_score(targets, predictions)} + + +def pearson_corrcoef(targets, predictions): + return {"pearson_corrcoef": 100 * pearsonr(targets, predictions)[0]} + + +def spearman_corrcoef(targets, predictions): + return {"spearman_corrcoef": 100 * spearmanr(targets, predictions)[0]} + + +CLUE_METRICS = collections.OrderedDict( + [ + ("afqmc", [accuracy]), + ("tnews", [accuracy]), + ("iflytek", [accuracy]), + ("cmnli", [accuracy]), + ("ocnli", [accuracy]), + ("cluewsc2020", [accuracy]), + ("csl", [accuracy]), + ("ax", []), # Only test set available. + ] +) + +GLUE_METRICS = collections.OrderedDict( + [ + ( + "cola", + [sklearn_metrics_wrapper("matthews_corrcoef", metric_post_process_fn=lambda x: 100 * x)], + ), + ("sst-2", [accuracy]), + ("mrpc", [f1_score_with_invalid, accuracy]), + ("sts-b", [pearson_corrcoef, spearman_corrcoef]), + ("qqp", [f1_score_with_invalid, accuracy]), + ("mnli", [accuracy]), + ("qnli", [accuracy]), + ("rte", [accuracy]), + ("wnli", [accuracy]), + ("ax", []), # Only test set available. + ] +) + +scheduler_type2cls = { + "linear": LinearDecayWithWarmup, + "cosine": CosineDecayWithWarmup, + "poly": PolyDecayWithWarmup, +} + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +def get_writer(args): + if args.writer_type == "visualdl": + from visualdl import LogWriter + + writer = LogWriter(logdir=args.logdir) + elif args.writer_type == "tensorboard": + from tensorboardX import SummaryWriter + + writer = SummaryWriter(logdir=args.logdir) + else: + raise ValueError("writer_type must be in ['visualdl', 'tensorboard']") + return writer + + +def get_scheduler( + learning_rate, + scheduler_type, + num_warmup_steps=None, + num_training_steps=None, + **scheduler_kwargs, +): + if scheduler_type not in scheduler_type2cls.keys(): + data = " ".join(scheduler_type2cls.keys()) + raise ValueError(f"scheduler_type must be choson from {data}") + + if num_warmup_steps is None: + raise ValueError("requires `num_warmup_steps`, please provide that argument.") + + if num_training_steps is None: + raise ValueError("requires `num_training_steps`, please provide that argument.") + + return scheduler_type2cls[scheduler_type]( + learning_rate=learning_rate, + total_steps=num_training_steps, + warmup=num_warmup_steps, + **scheduler_kwargs, + ) + + +def save_json(data, file_name): + with open(file_name, "w", encoding="utf-8") as w: + w.write(json.dumps(data, ensure_ascii=False, indent=4) + "\n") + + +def save_pickle(data, file_path): + with open(str(file_path), "wb") as f: + pickle.dump(data, f) + + +def load_pickle(input_file): + with open(str(input_file), "rb") as f: + data = pickle.load(f) + return data diff --git a/examples/language_model/t5/zero_shot_demo.py b/examples/language_model/t5/zero_shot_demo.py new file mode 100644 index 0000000000000000000000000000000000000000..1f255c2585290e715e60d13ad13c0baf1df30a66 --- /dev/null +++ b/examples/language_model/t5/zero_shot_demo.py @@ -0,0 +1,210 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright (c) 2022 Langboat Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +https://github.com/Langboat/mengzi-zero-shot +""" + +import paddle +from collections import Counter +from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer + + +def task_type_map(task_type): + task_map = { + "sentiment_classifier": sentiment_cls, + "news_classifier": news_cls, + "medical_domain_intent_classifier": domain_cls, + "entity_extraction": entity_extr, + "text_similarity": text_sim, + "financial_relationship_extraction": finance_extr, + "ad_generation": ad_gen, + "comment_object_extraction": com_obj_extr, + } + + return task_map[task_type] + + +def create_input_with_prompt(task_type, input_string, input_string2=None, entity1=None, entity2=None): + prompt_map = task_type_map(task_type) + + if task_type == "text_similarity": + return prompt_map(input_string, input_string2) + elif task_type == "financial_relationship_extraction": + return prompt_map(input_string, entity1, entity2) + return prompt_map(input_string) + + +def entity_extr( + s, +): + """ + dataset: CLUENER + task: 实体抽取 + output: + """ + prompts = [f"“{s}”找出上述句子中的实体和他们对应的类别"] + return prompts + + +def text_sim(s1, s2): + """ + dataset: + task: 语义相似度 + output: + """ + prompts = [f"“{s1}”和“{s2}”这两句话是在说同一件事吗?"] + return prompts + + +def finance_extr(s, e1, e2): + """ + dataset: + task: 金融关系抽取 + output: + """ + prompts = [f"“{s}”中的“{e1}”和“{e2}”是什么关系?答:"] + return prompts + + +def ad_gen(s): + """ + dataset: + task: 广告文案生成 + output: + """ + prompts = [f"请根据以下产品信息设计广告文案。商品信息:{s}"] + return prompts + + +def domain_cls(s): + """ + dataset: + task: 医学领域意图分类 + output: + """ + # dataset: quake-qic + prompts = [f"问题:“{s}”。此问题的医学意图是什么?选项:病情诊断,病因分析,治疗方案,就医建议,指标解读,疾病描述,后果表述,注意事项,功效作用,医疗费用。"] + return prompts + + +def sentiment_cls(s): + """ + dataset: eprstmt + task: 评论情感分类 + output: 消极/积极 + """ + prompts = [f"评论:{s}。请判断该条评论所属类别(积极或消极)并填至空格处。回答:"] + # f'"{s}"。 如果这个评论的作者是客观的,那么请问这个评论的内容是什么态度的回答?答:', + # f'现有机器人能判断句子是消极评论还是积极评论。已知句子:“{s}”。这个机器人将给出的答案是:' + return prompts + + +def com_obj_extr(s): + """ + dataset: + task: 评论对象抽取 + output: + """ + prompts = [f"评论:{s}.这条评论的评价对象是谁?"] + return prompts + + +def news_cls(s): + """ + dataset: tnews + task: 新闻分类 + output: + """ + label_list = ["故事", "文化", "娱乐", "体育", "财经", "房产", "汽车", "教育", "科技", "军事", "旅游", "国际", "股票", "农业", "电竞"] + + prompts = [ + f'“{s}”是什么新闻频道写的?选项:{",".join(label_list)}。答:', + ] + # f'这条新闻是关于什么主题的?新闻:{s}。选项:{",".join(label_list)}。答:', + # f'这是关于“{",".join(label_list)}”中哪个选项的文章?文章:{s}。 答:'] + return prompts + + +class Demo: + def __init__(self, model_name_or_path="Langboat/mengzi-t5-base-mt", max_predict_len=512): + self.tokenizer = T5Tokenizer.from_pretrained(model_name_or_path) + print("Loading the model parameters, please wait...") + self.model = T5ForConditionalGeneration.from_pretrained(model_name_or_path) + self.model.eval() + self.max_predict_len = max_predict_len + print("Model loaded.") + + def token_decode(self, s): + return self.tokenizer.decode(s, skip_special_tokens=True) + + def pick_most_common(self, x): + return Counter(x).most_common(1)[0][0] + + @paddle.no_grad() + def generate(self, task_type, input_string, input_string2=None, entity1=None, entity2=None, max_predict_len=None): + max_predict_len = max_predict_len if max_predict_len is not None else self.max_predict_len + + input_text = create_input_with_prompt(task_type, input_string, input_string2, entity1, entity2) + # tokenize + encodings = self.tokenizer(input_text, max_seq_len=512) + encodings = {k: paddle.to_tensor(v) for k, v in encodings.items()} + outputs = self.model.generate(**encodings, max_length=max_predict_len)[0] + dec_out = list(map(self.token_decode, outputs)) + output = self.pick_most_common(dec_out) + print("input_text:", input_text[0]) + print("output:", output) + print("=" * 50) + return output + + +if __name__ == "__main__": + + demo = Demo(model_name_or_path="Langboat/mengzi-t5-base-mt") + # (1) 实体抽取 + demo.generate(task_type="entity_extraction", input_string="导致泗水的砭石受到追捧,价格突然上涨。而泗水县文化市场综合执法局颜鲲表示,根据监控") + # 泗水:地址,泗水县文化市场综合执法局:政府,颜鲲:姓名 + + # (2) 语义相似度 + demo.generate(task_type="text_similarity", input_string="你好,我还款银行怎么更换", input_string2="怎么更换绑定还款的卡") + # 是 + + # (3) 金融关系抽取 + demo.generate( + task_type="financial_relationship_extraction", + input_string="为打消市场顾虑,工行两位洋股东——美国运通和安联集团昨晚做出承诺,近期不会减持工行H股。", + entity1="工行", + entity2="美国运通", + ) + # 被持股 + + # (4) 广告文案生成 + demo.generate(task_type="ad_generation", input_string="类型-裤,版型-宽松,风格-潮,风格-复古,风格-文艺,图案-复古,裤型-直筒裤,裤腰型-高腰,裤口-毛边") + # 这款牛仔裤采用高腰直筒的版型设计,搭配宽松的裤型,穿着舒适又显潮流感。而裤脚的毛边设计,增添几分复古文艺的气息。 + + # (5) 医学领域意图分类 + demo.generate(task_type="medical_domain_intent_classifier", input_string="呼气试验阳性什么意思") + # 指标解读 + + # (6) 情感分类 + demo.generate(task_type="sentiment_classifier", input_string="房间很一般,小,且让人感觉脏,隔音效果差,能听到走廊的人讲话,走廊光线昏暗,旁边没有什么可吃") + # 消极 + + # (7) 评论对象抽取 + demo.generate(task_type="comment_object_extraction", input_string="灵水的水质清澈,建议带个浮潜装备,可以看清湖里的小鱼。") + # 灵水 + + # (8) 新闻分类 + demo.generate(task_type="news_classifier", input_string="懒人适合种的果树:长得多、好打理,果子多得都得送邻居吃") + # 农业 diff --git a/examples/language_model/transformer-xl/README.md b/examples/language_model/transformer-xl/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9f6f7dec8da57b23670c4b085fb7c29b9df41aed --- /dev/null +++ b/examples/language_model/transformer-xl/README.md @@ -0,0 +1,114 @@ +# Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context + +以下是本例的简要目录结构及说明: + +```text +. +├── configs/ # 配置文件 +├── eval.py # 预测脚本 +├── gen_data.sh # 数据下载脚本 +├── mem_transformer.py # 模型组网 +├── reader.py # 数据读取接口 +├── README.md # 文档 +├── train.py # 训练脚本 +└── utils/ # 数据处理工具 +``` + +## 模型简介 + +本项目是语言模型 Transformer-XL 的 PaddlePaddle 实现, 包含模型训练,预测等内容。 + + +## 快速开始 + +### 环境依赖 + +- attrdict +- pyyaml + +安装命令 `pip install attrdict pyyaml` + +### 数据准备 + +公开数据集:enwik8、text8、wt103 多用于语言模型的 benchmark 测试。输出获取与处理方式如下: + +```shell +bash gen_data.sh +``` + +会在当前路径下的 ./gen_data/ 路径下生成我们需要的数据。 + +### 单机训练 + +#### 单机单卡 + +以提供的 enwik8 数据为例,可以执行以下命令进行模型训练: + +``` sh +# setting visible devices for training +export CUDA_VISIBLE_DEVICES=0 +python train.py --config ./configs/enwik8.yaml +``` + +可以在 enwik8.yaml 文件中设置相应的参数,比如 `batch_size`、`epoch` 等。 + +如果要更换成 wt103 数据集进行训练,可以在执行的时候通过 `--config` 指定对应的配置文件即可。 + +``` sh +# setting visible devices for training +export CUDA_VISIBLE_DEVICES=0 +python train.py --config ./configs/wt103.yaml +``` + +#### 使用 CPU 进行训练 + +如果要使用 CPU 进行训练,可以修改 `configs/` 路径下,对应的配置文件中的 `use_gpu` 配置为 `False`,用相同的方式启动训练即可使用 CPU 进行训练。 + +``` sh +python train.py --config ./configs/enwik8.yaml +``` + +### 单机多卡 + +同样,可以执行如下命令实现八卡训练: + +``` sh +export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 +python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" train.py --config ./configs/enwik8.yaml +``` + +### 恢复训练 + +若需要从之前的 checkpoint 开始继续训练,可以设置 `configs/` 路径中对应的配置文件中的参数 `init_from_checkpoint` 可载入之前的 checkpoint(包括 optimizer 的信息)继续训练。指定的方式是,指定到模型的 checkpoint 保存的路径。比如,指定成 `./trained_models/step_final/`,该路径下的目录结构如下: + +```text +. +├── mem_transformer.pdopt # 存储的优化器相关信息 +└── mem_transformer.pdparams # 存储模型参数相关信息 +``` + +若只是从之前训练的参数开始重新训练,无需载入 optimizer 信息,可以设置对应的配置文件中的参数 `init_from_pretrain_model` 可载入指定的参数,从头开始训练。指定的方式也是类似,指定到模型保存的参数文件 `mem_transformer.pdparams` 的路径,比如 `./trained_models/step_final/`。 + +### 模型推断 + +以 enwik8 数据为例,模型训练完成后可以执行以下命令可以进行预测: + +``` sh +# setting visible devices for prediction +export CUDA_VISIBLE_DEVICES=0 +python eval.py --config ./configs/enwik8.yaml +``` + +同理,可以通过指定 `--config` 选项来选择需要的数据集对应的配置文件。 + +``` sh +# setting visible devices for prediction +export CUDA_VISIBLE_DEVICES=0 +python eval.py --config ./configs/wt103.yaml +``` + +完成推断之后,会将显示在验证集和测试集上的 loss 的结果。 + +## 参考文献 + +[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](http://arxiv.org/abs/1901.02860) diff --git a/examples/language_model/transformer-xl/configs/enwik8.yaml b/examples/language_model/transformer-xl/configs/enwik8.yaml new file mode 100644 index 0000000000000000000000000000000000000000..12b3ef5cff007ef47dd4fe75e713915b0a81187a --- /dev/null +++ b/examples/language_model/transformer-xl/configs/enwik8.yaml @@ -0,0 +1,112 @@ +# The frequency to save trained models when training. +save_step: 10000 +# The frequency to fetch and print output when training. +print_step: 100 +# Path of the checkpoint, to resume the previous training +init_from_checkpoint: "" +# Path of the pretrain model, to better solve the current task +init_from_pretrain_model: "" +# Path of trained parameter, to make prediction +init_from_params: "./trained_models/step_final/" +# The directory for saving model +save_model: "trained_models" +# The directory for saving inference model. +inference_model_dir: "infer_model" +# Set seed for CE or debug +random_seed: None +# The path to data files +data: "./gen_data/enwik8/" +# The name of dataset +dataset: "enwik8" + +# Whether to use cuda +use_gpu: True + +# Args for reader, see reader.py for details +token_delimiter: None +batch_size: 16 +eval_batch_size: 10 + +# Hyparams for training: +# The number of epoches for training +epoch: 200 +# Max step for training. +max_step: 400000 + +# The hyper parameters for optimizer. +# Type of ptimizer. +optim: adam +# Learning rate schedule. +scheduler: cosine +# This static learning_rate will be applied to the LearningRateScheduler +# derived learning rate the to get the final learning rate. +learning_rate: 0.00025 +# The hyper parameters for Adam optimizer. +beta1: 0.9 +beta2: 0.997 +eps: 1e-9 +# The hyper parameters for Momentum optimizer. +mom: 0.0 +# Global gradient clip. +clip: 0.25 +# The parameters for learning rate scheduling. +warmup_steps: 0 +# The parameters for CosineAnnealingDecay. Minimum learning rate. +eta_min: 0.0 +# The parameters for ReduceLROnPlateau. +# The Ratio that the learning rate will be reduced. +decay_rate: 0.5 +# When loss doesn’t improve for this number of epochs, learing rate will be reduced. +patience: 0 +# The lower bound of the learning rate after reduction. +min_lr: 0.0 + +# Hyparams for model: +# Whe use adaptive softmax. +adaptive: False +# Size of dictionary. This can be obtained automatically. +ntokens: 10000 +# The dimension for word embeddings, which is also the last dimension of +# the input and output of multi-head attention, position-wise feed-forward +# networks, encoder and decoder. +d_model: 512 +# Dimension of heads. +d_head: 64 +# Size of the hidden layer in position-wise feed-forward networks. +d_inner_hid: 2048 +# Number of head used in multi-head attention. +n_head: 8 +# Number of sub-layers to be stacked in the encoder and decoder. +n_layer: 12 +# Dropout rates. +dropout: 0.1 +# Attention dropout +attn_dropout: 0.0 +# Attention type for decoder. +# 0 for relative partial MHA (in Transformer-XL). +# 1 for relative MHA (in Shaw et al). +attn_type: 0 +# Apply layer normalization before or after sublayers. +normalize_before: False +# Whether to tie weight or not. +tie_weight: True +# The length of the extended context. +ext_len: 0 +# The divident value for softmax and adapative input. +div_val: 1 +# Target length. The number of tokens to predict. +tgt_len: 512 +# Memory length. The length of the retained previous heads. +mem_len: 512 +# Use the same attention length for all tokens. +same_length: False +# Use the same positional encoding after clamp len. +clamp_len: -1 +# The number of samples in sample softmax. -1 means do not use sampled softmax. +sample_softmax: -1 +# Target length for evaluation. That is, the number of tokens to predict for evaluation. +eval_tgt_len: 128 +# What kind of mode for evaluation. valid, test or both("all"). +mode: "all" +# Maximum evaluation step. +max_eval_steps: -1 diff --git a/examples/language_model/transformer-xl/configs/text8.yaml b/examples/language_model/transformer-xl/configs/text8.yaml new file mode 100644 index 0000000000000000000000000000000000000000..5e1353a1d05040e7fe23ee60734510e92b3bb2e6 --- /dev/null +++ b/examples/language_model/transformer-xl/configs/text8.yaml @@ -0,0 +1,112 @@ +# The frequency to save trained models when training. +save_step: 10000 +# The frequency to fetch and print output when training. +print_step: 100 +# Path of the checkpoint, to resume the previous training +init_from_checkpoint: "" +# Path of the pretrain model, to better solve the current task +init_from_pretrain_model: "" +# Path of trained parameter, to make prediction +init_from_params: "./trained_models/step_final/" +# The directory for saving model +save_model: "trained_models" +# The directory for saving inference model. +inference_model_dir: "infer_model" +# Set seed for CE or debug +random_seed: None +# The path to data files +data: "./gen_data/text8/" +# The name of dataset +dataset: "text8" + +# Whether to use cuda +use_gpu: True + +# Args for reader, see reader.py for details +token_delimiter: None +batch_size: 15 +eval_batch_size: 5 + +# Hyparams for training: +# The number of epoches for training +epoch: 200 +# Max step for training. +max_step: 400000 + +# The hyper parameters for optimizer. +# Type of ptimizer. +optim: adam +# Learning rate schedule. +scheduler: cosine +# This static learning_rate will be applied to the LearningRateScheduler +# derived learning rate the to get the final learning rate. +learning_rate: 0.00025 +# The hyper parameters for Adam optimizer. +beta1: 0.9 +beta2: 0.997 +eps: 1e-9 +# The hyper parameters for Momentum optimizer. +mom: 0.0 +# Global gradient clip. +clip: 0.25 +# The parameters for learning rate scheduling. +warmup_steps: 0 +# The parameters for CosineAnnealingDecay. Minimum learning rate. +eta_min: 0.0 +# The parameters for ReduceLROnPlateau. +# The Ratio that the learning rate will be reduced. +decay_rate: 0.5 +# When loss doesn’t improve for this number of epochs, learing rate will be reduced. +patience: 0 +# The lower bound of the learning rate after reduction. +min_lr: 0.0 + +# Hyparams for model: +# Whe use adaptive softmax. +adaptive: False +# Size of dictionary. This can be obtained automatically. +ntokens: 10000 +# The dimension for word embeddings, which is also the last dimension of +# the input and output of multi-head attention, position-wise feed-forward +# networks, encoder and decoder. +d_model: 512 +# Dimension of heads. +d_head: 64 +# Size of the hidden layer in position-wise feed-forward networks. +d_inner_hid: 2048 +# Number of head used in multi-head attention. +n_head: 8 +# Number of sub-layers to be stacked in the encoder and decoder. +n_layer: 12 +# Dropout rates. +dropout: 0.1 +# Attention dropout +attn_dropout: 0.0 +# Attention type for decoder. +# 0 for relative partial MHA (in Transformer-XL). +# 1 for relative MHA (in Shaw et al). +attn_type: 0 +# Apply layer normalization before or after sublayers. +normalize_before: False +# Whether to tie weight or not. +tie_weight: True +# The length of the extended context. +ext_len: 0 +# The divident value for softmax and adapative input. +div_val: 1 +# Target length. The number of tokens to predict. +tgt_len: 512 +# Memory length. The length of the retained previous heads. +mem_len: 512 +# Use the same attention length for all tokens. +same_length: False +# Use the same positional encoding after clamp len. +clamp_len: -1 +# The number of samples in sample softmax. -1 means do not use sampled softmax. +sample_softmax: -1 +# Target length for evaluation. That is, the number of tokens to predict for evaluation. +eval_tgt_len: 128 +# What kind of mode for evaluation. valid, test or both("all"). +mode: "all" +# Maximum evaluation step. +max_eval_steps: -1 diff --git a/examples/language_model/transformer-xl/configs/wt103.yaml b/examples/language_model/transformer-xl/configs/wt103.yaml new file mode 100644 index 0000000000000000000000000000000000000000..99fec78d1494686ffc93ee4b123dbaafbd881625 --- /dev/null +++ b/examples/language_model/transformer-xl/configs/wt103.yaml @@ -0,0 +1,112 @@ +# The frequency to save trained models when training. +save_step: 10000 +# The frequency to fetch and print output when training. +print_step: 100 +# Path of the checkpoint, to resume the previous training +init_from_checkpoint: "" +# Path of the pretrain model, to better solve the current task +init_from_pretrain_model: "" +# Path of trained parameter, to make prediction +init_from_params: "./trained_models/step_final/" +# The directory for saving model +save_model: "trained_models" +# The directory for saving inference model. +inference_model_dir: "infer_model" +# Set seed for CE or debug +random_seed: None +# The path to data files +data: "./gen_data/wikitext-103/" +# The name of dataset +dataset: "wt103" + +# Whether to use cuda +use_gpu: True + +# Args for reader, see reader.py for details +token_delimiter: None +batch_size: 32 +eval_batch_size: 10 + +# Hyparams for training: +# The number of epoches for training +epoch: 200 +# Max step for training. +max_step: 200000 + +# The hyper parameters for optimizer. +# Type of ptimizer. +optim: adam +# Learning rate schedule. +scheduler: cosine +# This static learning_rate will be applied to the LearningRateScheduler +# derived learning rate the to get the final learning rate. +learning_rate: 0.00025 +# The hyper parameters for Adam optimizer. +beta1: 0.9 +beta2: 0.997 +eps: 1e-9 +# The hyper parameters for Momentum optimizer. +mom: 0.0 +# Global gradient clip. +clip: 0.25 +# The parameters for learning rate scheduling. +warmup_steps: 0 +# The parameters for CosineAnnealingDecay. Minimum learning rate. +eta_min: 0.0 +# The parameters for ReduceLROnPlateau. +# The Ratio that the learning rate will be reduced. +decay_rate: 0.5 +# When loss doesn’t improve for this number of epochs, learing rate will be reduced. +patience: 0 +# The lower bound of the learning rate after reduction. +min_lr: 0.0 + +# Hyparams for model: +# Whe use adaptive softmax. +adaptive: True +# Size of dictionary. This can be obtained automatically. +ntokens: 10000 +# The dimension for word embeddings, which is also the last dimension of +# the input and output of multi-head attention, position-wise feed-forward +# networks, encoder and decoder. +d_model: 410 +# Dimension of heads. +d_head: 41 +# Size of the hidden layer in position-wise feed-forward networks. +d_inner_hid: 2100 +# Number of head used in multi-head attention. +n_head: 10 +# Number of sub-layers to be stacked in the encoder and decoder. +n_layer: 16 +# Dropout rates. +dropout: 0.1 +# Attention dropout +attn_dropout: 0.0 +# Attention type for decoder. +# 0 for relative partial MHA (in Transformer-XL). +# 1 for relative MHA (in Shaw et al). +attn_type: 0 +# Apply layer normalization before or after sublayers. +normalize_before: False +# Whether to tie weight or not. +tie_weight: True +# The length of the extended context. +ext_len: 0 +# The divident value for softmax and adapative input. +div_val: 1 +# Target length. The number of tokens to predict. +tgt_len: 150 +# Memory length. The length of the retained previous heads. +mem_len: 150 +# Target length for evaluation. That is, the number of tokens to predict for evaluation. +eval_tgt_len: 150 +# Use the same attention length for all tokens. +same_length: False +# Use the same positional encoding after clamp len. +clamp_len: -1 +# The number of samples in sample softmax. -1 means do not use sampled softmax. +sample_softmax: -1 +# What kind of mode for evaluation. valid, test or both("all"). +mode: "all" +# Maximum evaluation step. +max_eval_steps: -1 diff --git a/examples/language_model/transformer-xl/eval.py b/examples/language_model/transformer-xl/eval.py new file mode 100644 index 0000000000000000000000000000000000000000..de13d17fd9dcc0469b85164f70f22333192ecf52 --- /dev/null +++ b/examples/language_model/transformer-xl/eval.py @@ -0,0 +1,142 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import os +from pprint import pprint + +import numpy as np +import paddle +import yaml +from attrdict import AttrDict +from mem_transformer import MemTransformerLM +from reader import get_lm_data_loader, get_lm_vocab + +FORMAT = "%(asctime)s-%(levelname)s: %(message)s" +logging.basicConfig(level=logging.INFO, format=FORMAT) +logger = logging.getLogger(__name__) + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--config", default="./configs/enwik8.yaml", type=str, help="Path of the config file. ") + args = parser.parse_args() + return args + + +def do_eval(args): + assert args.ext_len >= 0, "Extended context length must be no less than 0" + + def _evaluate(loader): + total_len, total_loss = 0, 0.0 + + eval_mems = tuple() + for i, (src, target, seq_len) in enumerate(loader): + if args.max_eval_steps > 0 and i >= args.max_eval_steps: + break + ret = mem_transformer(src, target, *eval_mems) + loss, eval_mems = ret[0], ret[1:] + eval_cur_loss = seq_len * loss.numpy() + total_loss += eval_cur_loss + total_len += seq_len + return total_loss / total_len + + def _logger(loss): + if args.dataset in ["enwik8", "text8"]: + logger_info = "loss: %f, bpc: %f" % (loss, loss / np.log(2)) + else: + logger_info = "loss: %f, ppl: %.2f" % (loss, np.exp(loss)) + return logger_info + + if not args.use_gpu: + paddle.set_device("cpu") + + vocab = get_lm_vocab(args) + eval_loader = get_lm_data_loader(args, vocab, "valid") + test_loader = get_lm_data_loader(args, vocab, "test") + + cutoffs, tie_projs = [], [False] + if args.adaptive: + assert args.dataset in ["wt103", "lm1b"] + if args.dataset == "wt103": + cutoffs = [20000, 40000, 200000] + tie_projs += [True] * len(cutoffs) + elif args.dataset == "lm1b": + cutoffs = [60000, 100000, 640000] + tie_projs += [False] * len(cutoffs) + + mem_transformer = MemTransformerLM( + args.ntokens, + args.n_layer, + args.n_head, + args.d_model, + args.d_head, + args.d_inner_hid, + args.dropout, + args.attn_dropout, + tie_weight=args.tie_weight, + d_embed=args.d_model, + div_val=args.div_val, + tie_projs=tie_projs, + normalize_before=args.normalize_before, + tgt_len=args.tgt_len, + ext_len=args.ext_len, + mem_len=args.mem_len, + cutoffs=cutoffs, + same_length=args.same_length, + attn_type=args.attn_type, + clamp_len=args.clamp_len, + sample_softmax=args.sample_softmax, + ) + + assert args.init_from_params, "Please set init_from_params to load the infer model." + + model_dict = paddle.load(os.path.join(args.init_from_params, "mem_transformer.pdparams")) + mem_transformer.load_dict(model_dict) + + logger.info( + "Evaluating with bsz {} tgt_len {} ext_len {} mem_len {} clamp_len {}".format( + args.eval_batch_size, args.tgt_len, args.ext_len, args.mem_len, args.clamp_len + ) + ) + + mem_transformer.reset_length(args.tgt_len, args.ext_len, args.mem_len) + + test_loss = None + valid_loss = None + if args.mode == "all": + test_loss = _evaluate(test_loader) + valid_loss = _evaluate(eval_loader) + elif args.mode == "valid": + valid_loss = _evaluate(eval_loader) + elif args.mode == "test": + test_loss = _evaluate(test_loader) + + logger_info = "" + if valid_loss is not None: + logger_info = logger_info + "validation loss: " + _logger(valid_loss) + " | " + if test_loss is not None: + logger_info = logger_info + "test loss: " + _logger(test_loss) + " | " + logger.info(logger_info) + + +if __name__ == "__main__": + ARGS = parse_args() + yaml_file = ARGS.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + pprint(args) + + do_eval(args) diff --git a/examples/language_model/transformer-xl/gen_data.sh b/examples/language_model/transformer-xl/gen_data.sh new file mode 100644 index 0000000000000000000000000000000000000000..865a8a5835893a8626fee58a9324d5e94c30f1fd --- /dev/null +++ b/examples/language_model/transformer-xl/gen_data.sh @@ -0,0 +1,55 @@ +echo "Downloading dataset..." + +CUR_DIR=$PWD + +mkdir -p gen_data +cd ./gen_data/ + +if [ ! -d "wikitext-103" ]; then + echo "Downloading wikitext-103..." + wget -O wikitext-103-v1.zip https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip + echo "Unzip wikitext-103..." + unzip wikitext-103-v1.zip + cd wikitext-103 + # Rename + mv wiki.train.tokens train.txt + mv wiki.valid.tokens valid.txt + mv wiki.test.tokens test.txt + cd - +fi + +if [ ! -d 'enwik8' ]; then + mkdir -p enwik8 + cd enwik8 + echo "Downloading enwik8..." + wget -O enwik8.zip http://mattmahoney.net/dc/enwik8.zip + wget -O prep_enwik8.py https://raw.githubusercontent.com/salesforce/awd-lstm-lm/master/data/enwik8/prep_enwik8.py + python3 prep_enwik8.py + rm -f prep_enwik8.py + cd - +fi + +if [ ! -d 'text8' ]; then + mkdir -p text8 + cd text8 + echo "Downloading text8..." + wget -O text8.zip http://mattmahoney.net/dc/text8.zip + python ${CUR_DIR}/utils/preprocess_text8.py 5000000 + cd - +fi + +if [ ! -d 'one-billion-words' ]; then + mkdir -p one-billion-words + cd one-billion-words + echo "Downloading one-billion-words..." + wget -O 1-billion-word-language-modeling-benchmark-r13output.tar.gz http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz + tar xzf 1-billion-word-language-modeling-benchmark-r13output.tar.gz + + dir="./1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/" + cat ${dir}/news.en.heldout-00000-of-00050 > valid.txt + cat ${dir}/news.en.heldout-00000-of-00050 > test.txt + wget -O 1b_word_vocab.txt https://github.com/rafaljozefowicz/lm/raw/master/1b_word_vocab.txt + cd - +fi + +echo "All done. " diff --git a/examples/language_model/transformer-xl/mem_transformer.py b/examples/language_model/transformer-xl/mem_transformer.py new file mode 100644 index 0000000000000000000000000000000000000000..122d4e1cc3f23226eaa66f9f13982ebd3328ae6a --- /dev/null +++ b/examples/language_model/transformer-xl/mem_transformer.py @@ -0,0 +1,1031 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + +global_dtype = paddle.get_default_dtype() + + +def sample_logits(embedding, bias, labels, inputs, sampler): + true_log_probs, samp_log_probs, neg_samples = sampler.sample(labels) + n_sample = neg_samples.shape[0] + b1, b2 = labels.shape[0], labels.shape[1] + all_ids = paddle.concat([paddle.reshape(labels, shape=[-1]), neg_samples]) + all_w = embedding(all_ids) + true_w = paddle.reshape(all_w[:-n_sample], shape=[b1, b2, -1]) + sample_w = paddle.reshape(all_w[-n_sample:], shape=[n_sample, -1]) + + all_b = paddle.gather(bias, all_ids) + true_b = paddle.reshape(all_b[:-n_sample], shape=[b1, b2]) + sample_b = all_b[-n_sample:] + + hit = paddle.cast((labels.unsqueeze([2]) == neg_samples), dtype=global_dtype).detach() + true_logits = paddle.sum(true_w * inputs, axis=-1) + true_b - true_log_probs + sample_logits = ( + paddle.transpose(paddle.matmul(sample_w, paddle.transpose(inputs, [0, 2, 1])), [0, 2, 1]) + + sample_b + - samp_log_probs + ) + sample_logits = sample_logits - 1e30 * hit + logits = paddle.concat([true_logits.unsqueeze([2]), sample_logits], -1) + + return logits + + +class ProjAdaptiveSoftmax(nn.Layer): + """ + Combine projection and logsoftmax. + """ + + def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, keep_order=False): + super(ProjAdaptiveSoftmax, self).__init__() + + self.n_token = n_token + self.d_embed = d_embed + self.d_proj = d_proj + + self.cutoffs = cutoffs + [n_token] + self.cutoff_ends = [0] + self.cutoffs + self.div_val = div_val + + self.shortlist_size = self.cutoffs[0] + self.num_clusters = len(self.cutoffs) - 1 + self.head_size = self.shortlist_size + self.num_clusters + + if self.num_clusters > 0: + self.cluster_weight = paddle.create_parameter( + shape=[self.num_clusters, self.d_embed], + dtype=global_dtype, + default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + ) + self.cluster_bias = paddle.create_parameter( + shape=[self.num_clusters], + dtype=global_dtype, + is_bias=True, + default_initializer=paddle.nn.initializer.Constant(0.0), + ) + + self.out_layers_weight = nn.ParameterList() + self.out_layers_bias = nn.ParameterList() + self.out_projs = nn.ParameterList() + + if div_val == 1: + for i in range(len(self.cutoffs)): + if d_proj != d_embed: + self.out_projs.append( + paddle.create_parameter( + shape=[d_proj, d_embed], + dtype=global_dtype, + default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + ) + ) + else: + self.out_projs.append(None) + + self.out_layers_weight.append( + paddle.create_parameter( + shape=[n_token, d_embed], + dtype=global_dtype, + default_initializer=paddle.nn.initializer.Constant(0.0), + ) + ) + self.out_layers_bias.append( + paddle.create_parameter( + shape=[n_token], + dtype=global_dtype, + is_bias=True, + default_initializer=paddle.nn.initializer.Constant(0.0), + ) + ) + else: + for i in range(len(self.cutoffs)): + l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1] + d_emb_i = d_embed // (div_val**i) + + self.out_projs.append( + paddle.create_parameter( + shape=[d_proj, d_emb_i], + dtype=global_dtype, + default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + ) + ) + + self.out_layers_weight.append( + paddle.create_parameter( + shape=[r_idx - l_idx, d_emb_i], + dtype=global_dtype, + default_initializer=paddle.nn.initializer.Uniform( + low=-((r_idx - l_idx) ** (-1.0 / 2.0)), high=(r_idx - l_idx) ** (-1.0 / 2.0) + ), + ) + ) + self.out_layers_bias.append( + paddle.create_parameter( + shape=[r_idx - l_idx], + dtype=global_dtype, + is_bias=True, + default_initializer=paddle.nn.initializer.Uniform( + low=-((r_idx - l_idx) ** (-1.0 / 2.0)), high=(r_idx - l_idx) ** (-1.0 / 2.0) + ), + ) + ) + + self.keep_order = keep_order + + def _compute_logits(self, hidden, weight, bias, proj=None): + if proj is None: + logit = F.linear(hidden, weight.t(), bias=bias) + else: + proj_hid = F.linear(hidden, proj) + logit = F.linear(proj_hid, weight.t(), bias=bias) + + return logit + + def forward(self, hidden, target, keep_order=False): + assert hidden.shape[0] == target.shape[0] + + if self.num_clusters == 0: + logit = self._compute_logits(hidden, self.out_layers_weight[0], self.out_layers_bias[0], self.out_projs[0]) + nll = -paddle.log(F.softmax(logit, axis=-1)) + idx = paddle.concat([paddle.arange(0, nll.shape[0]).unsqueeze([1]), target.unsqueeze(1)], axis=1) + nll = paddle.gather_nd(nll, idx) + else: + weights, biases = [], [] + for i in range(len(self.cutoffs)): + if self.div_val == 1: + l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1] + weight_i = self.out_layers_weight[0][l_idx:r_idx] + bias_i = self.out_layers_bias[0][l_idx:r_idx] + else: + weight_i = self.out_layers_weight[i] + bias_i = self.out_layers_bias[i] + + if i == 0: + weight_i = paddle.concat([weight_i, self.cluster_weight], axis=0) + bias_i = paddle.concat([bias_i, self.cluster_bias], axis=0) + + weights.append(weight_i) + biases.append(bias_i) + + head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0] + + head_logit = self._compute_logits(hidden, head_weight, head_bias, head_proj) + head_logprob = paddle.log(F.softmax(head_logit, axis=-1)) + + nll = paddle.zeros_like(target, dtype=hidden.dtype) + + offset = 0 + cutoff_values = [0] + self.cutoffs + for i in range(len(cutoff_values) - 1): + l_idx, r_idx = cutoff_values[i], cutoff_values[i + 1] + + mask_i = paddle.cast(target >= l_idx, dtype=paddle.get_default_dtype()) * paddle.cast( + target < r_idx, dtype="int64" + ) + indices_i = paddle.nonzero(mask_i).squeeze([1]) + + if paddle.numel(indices_i) == 0: + continue + target_i = paddle.gather(target, indices_i, axis=0) - l_idx + head_logprob_i = paddle.gather(head_logprob, indices_i, axis=0) + if i == 0: + target_i_idx = paddle.concat( + [paddle.arange(0, head_logprob_i.shape[0]).unsqueeze([1]), target_i.unsqueeze([1])], axis=1 + ) + logprob_i = head_logprob_i.gather_nd(target_i_idx) + else: + weight_i, bias_i, proj_i = ( + weights[i], + biases[i], + self.out_projs[i].weight if self.out_projs[i] is not None else None, + ) + + hidden_i = paddle.gather(hidden, indices_i, axis=0) + + tail_logit_i = self._compute_logits(hidden_i, weight_i, bias_i, proj_i) + tail_logprob_i = paddle.log(F.softmax(tail_logit_i, axis=-1)) + + target_i_idx = paddle.concat( + [paddle.arange(0, tail_logprob_i.shape[0]).unsqueeze([1]), target_i.unsqueeze([1])], axis=1 + ) + logprob_i = tail_logprob_i.gather_nd(target_i_idx) + + logprob_i = head_logprob_i[:, -i] + logprob_i + + if self.keep_order or keep_order: + nll = paddle.scatter(nll, indices_i, -logprob_i) + else: + index = paddle.arange(offset, offset + logprob_i.shape[0], 1) + nll = paddle.scatter(nll, index, -logprob_i) + + offset += logprob_i.shape[0] + + return nll + + +class LogUniformSampler(object): + def __init__(self, range_max, n_sample): + with paddle.no_grad(): + self.range_max = range_max + log_indices = paddle.log(paddle.arange(1.0, range_max + 2.0, 1.0, dtype=global_dtype)) + self.dist = (log_indices[1:] - log_indices[:-1]) / log_indices[-1] + + self.log_q = paddle.cast( + paddle.log( + paddle.exp(-(paddle.log1p(-paddle.cast(self.dist, dtype=global_dtype)) * 2 * n_sample)) - 1 + ), + dtype=global_dtype, + ) + + self.n_sample = n_sample + + def sample(self, labels): + n_sample = self.n_sample + n_tries = 2 * n_sample + batch_size = labels.shape[0] + + with paddle.no_grad(): + neg_samples = paddle.unique(paddle.multinomial(self.dist, n_tries, replacement=True)) + true_log_probs = paddle.gather(self.log_q, labels.flatten()) + true_log_probs = paddle.reshape(true_log_probs, shape=[batch_size, -1]) + samp_log_probs = paddle.gather(self.log_q, neg_samples) + return true_log_probs, samp_log_probs, neg_samples + + +class PositionEmbedding(nn.Layer): + def __init__(self, emb_dim): + super(PositionEmbedding, self).__init__() + self.emb_dim = emb_dim + self.inv_freq = 1.0 / (10000.0 ** (paddle.arange(0.0, emb_dim, 2.0, dtype=global_dtype) / emb_dim)) + + def forward(self, pos_seq, bsz=None): + sinusoid_inp = paddle.matmul(pos_seq.unsqueeze([1]), self.inv_freq.unsqueeze([0])) + pos_emb = paddle.concat([paddle.sin(sinusoid_inp), paddle.cos(sinusoid_inp)], axis=-1) + + if bsz is not None: + pos_emb = pos_emb.unsqueeze([0]).expand([bsz, -1, -1]) + pos_emb.stop_gradient = True + return pos_emb + else: + pos_emb = pos_emb.unsqueeze([0]) + pos_emb.stop_gradient = True + return pos_emb + + +class PositionwiseFFN(nn.Layer): + def __init__(self, d_model, d_inner, dropout, normalize_before=False): + super(PositionwiseFFN, self).__init__() + + self.d_model = d_model + self.d_inner = d_inner + + self.CoreNet = nn.Sequential( + nn.Linear( + d_model, + d_inner, + weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + bias_attr=paddle.nn.initializer.Constant(0.0), + ), + nn.ReLU(), + nn.Dropout(dropout), + nn.Linear( + d_inner, + d_model, + weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + bias_attr=paddle.nn.initializer.Constant(0.0), + ), + nn.Dropout(dropout), + ) + self.layer_norm = nn.LayerNorm( + d_model, + weight_attr=paddle.nn.initializer.Normal(mean=1.0, std=0.01), + bias_attr=paddle.nn.initializer.Constant(0.0), + ) + self.normalize_before = normalize_before + + def forward(self, inp): + if self.normalize_before: + core_out = self.CoreNet(self.layer_norm(inp)) + output = core_out + inp + else: + core_out = self.CoreNet(inp) + output = self.layer_norm(inp + core_out) + return output + + +class MultiHeadAttn(nn.Layer): + def __init__(self, n_head, d_model, d_head, dropout, attn_dropout=0, normalize_before=False): + super(MultiHeadAttn, self).__init__() + self.n_head = n_head + self.d_model = d_model + self.d_head = d_head + + self.q_proj = nn.Linear( + d_model, n_head * d_head, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False + ) + self.kv_proj = nn.Linear( + d_model, 2 * n_head * d_head, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False + ) + self.drop = nn.Dropout(p=dropout) + self.attn_drop = nn.Dropout(p=attn_dropout) + self.o_proj = nn.Linear( + n_head * d_head, d_model, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False + ) + self.layer_norm = nn.LayerNorm( + d_model, + weight_attr=paddle.nn.initializer.Normal(mean=1.0, std=0.01), + bias_attr=paddle.nn.initializer.Constant(0.0), + ) + + self.scale = 1 / (d_head**0.5) + self.normalize_before = normalize_before + + def forward(self, h, attn_mask=None, mems=None): + if mems is not None: + c = paddle.concat([mems, h], axis=1) + else: + c = h + + if self.normalize_before: + c = self.layer_norm(c) + + head_q = self.q_proj(h) + head_k, head_v = paddle.chunk(self.kv_proj(c), chunks=2, axis=-1) + + head_q = paddle.reshape(head_q, shape=[h.shape[0], h.shape[1], self.n_head, self.d_head]) + head_k = paddle.reshape(head_k, shape=[c.shape[0], c.shape[1], self.n_head, self.d_head]) + head_v = paddle.reshape(head_v, shape=[c.shape[0], c.shape[1], self.n_head, self.d_head]) + + attn_score = paddle.einsum("bind,bjnd->bnij", head_q, head_k) + attn_score = attn_score * self.scale + if attn_mask is not None: + attn_score = attn_score - float("inf") * attn_mask + + attn_prob = F.softmax(attn_score, dim=-1) + attn_prob = self.attn_drop(attn_prob) + + attn_vec = paddle.einsum("bnij,bjnd->bind", attn_prob, head_v) + attn_vec = paddle.reshape(attn_vec, shape=[attn_vec.shape[0], attn_vec.shape[1], self.n_head * self.d_head]) + + attn_out = self.o_proj(attn_vec) + attn_out = self.drop(attn_out) + if self.normalize_before: + output = h + attn_out + else: + output = self.layer_norm(h + attn_out) + + return output + + +class RelMultiHeadAttn(nn.Layer): + def __init__( + self, + n_head, + d_model, + d_head, + dropout, + attn_dropout=0, + tgt_len=None, + ext_len=None, + mem_len=None, + normalize_before=False, + ): + super(RelMultiHeadAttn, self).__init__() + + self.n_head = n_head + self.d_model = d_model + self.d_head = d_head + self.dropout = dropout + + self.qkv_proj = nn.Linear( + d_model, 3 * n_head * d_head, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False + ) + + self.drop = nn.Dropout(dropout) + self.attn_drop = nn.Dropout(attn_dropout) + self.o_proj = nn.Linear( + n_head * d_head, d_model, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False + ) + + self.layer_norm = nn.LayerNorm( + d_model, + weight_attr=paddle.nn.initializer.Normal(mean=1.0, std=0.01), + bias_attr=paddle.nn.initializer.Constant(0.0), + ) + + self.scale = 1 / (d_head**0.5) + + self.normalize_before = normalize_before + + def _rel_shift(self, x, zero_triu=False): + x_shape = x.shape + zero_pad = paddle.zeros([x_shape[0], x_shape[1], x_shape[2], 1], dtype=x.dtype) + x_padded = paddle.concat([zero_pad, x], axis=-1) + + x_padded = paddle.reshape(x_padded, shape=[x_shape[0], x_shape[1], x_shape[3] + 1, x_shape[2]]) + + x = paddle.reshape(x_padded[:, :, 1:, :], shape=x_shape) + + if zero_triu: + ones = paddle.ones([x_shape[2], x_shape[3]]) + x = x * paddle.tril(ones, diagonal=x_shape[3] - x_shape[2]).unsqueeze([2, 3]) + + return x + + def forward(self, w, r, attn_mask=None, mems=None): + raise NotImplementedError + + +class RelPartialLearnableMultiHeadAttn(RelMultiHeadAttn): + def __init__(self, *args, **kwargs): + super(RelPartialLearnableMultiHeadAttn, self).__init__(*args, **kwargs) + + self.r_proj = nn.Linear( + self.d_model, + self.n_head * self.d_head, + weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + bias_attr=False, + ) + + def forward(self, w, r, r_w_bias, r_r_bias, attn_mask=None, mems=None): + qlen, rlen, bsz = w.shape[1], r.shape[1], w.shape[0] + + if mems is not None: + cat = paddle.concat([mems, w], axis=1) + if self.normalize_before: + w_heads = self.qkv_proj(self.layer_norm(cat)) + else: + w_heads = self.qkv_proj(cat) + r_head_k = self.r_proj(r) + + w_head_q, w_head_k, w_head_v = paddle.chunk(w_heads, chunks=3, axis=-1) + + w_head_q = w_head_q[:, -qlen:, :] + else: + if self.normalize_before: + w_heads = self.qkv_proj(self.layer_norm(w)) + else: + w_heads = self.qkv_proj(w) + r_head_k = self.r_proj(r) + + w_head_q, w_head_k, w_head_v = paddle.chunk(w_heads, chunks=3, axis=-1) + + klen = w_head_k.shape[1] + + w_head_q = paddle.reshape(w_head_q, shape=[bsz, qlen, self.n_head, self.d_head]) + w_head_k = paddle.reshape(w_head_k, shape=[bsz, klen, self.n_head, self.d_head]) + w_head_v = paddle.reshape(w_head_v, shape=[bsz, klen, self.n_head, self.d_head]) + + r_head_k = paddle.reshape(r_head_k, shape=[bsz, rlen, self.n_head, self.d_head]) + + rw_head_q = w_head_q + r_w_bias + + AC = paddle.einsum("bind,bjnd->bnij", rw_head_q, w_head_k) + rr_head_q = w_head_q + r_r_bias + + BD = paddle.einsum("bind,bjnd->bnij", rr_head_q, r_head_k) + BD = self._rel_shift(BD) + + attn_score = AC + BD + attn_score = attn_score * self.scale + + if attn_mask is not None: + attn_score = attn_score - 1e30 * attn_mask + + attn_prob = F.softmax(attn_score, axis=-1) + attn_prob = self.attn_drop(attn_prob) + + attn_vec = paddle.einsum("bnij,bjnd->bind", attn_prob, w_head_v) + + attn_vec = paddle.reshape(attn_vec, shape=[attn_vec.shape[0], attn_vec.shape[1], self.n_head * self.d_head]) + + attn_out = self.o_proj(attn_vec) + attn_out = self.drop(attn_out) + + if self.normalize_before: + output = w + attn_out + else: + output = self.layer_norm(w + attn_out) + + return output + + +class RelLearnableMultiHeadAttn(RelMultiHeadAttn): + def __init__(self, *args, **kwargs): + super(RelLearnableMultiHeadAttn, self).__init__(*args, **kwargs) + + def forward(self, w, r_emb, r_w_bias, r_bias, attn_mask=None, mems=None): + qlen, bsz = w.shape[1], w.shape[0] + + if mems is not None: + cat = paddle.concat([mems, w], 1) + if self.normalize_before: + w_heads = self.qkv_proj(self.layer_norm(cat)) + else: + w_heads = self.qkv_proj(cat) + w_head_q, w_head_k, w_head_v = paddle.chunk(w_heads, chunks=3, axis=-1) + + w_head_q = w_head_q[-qlen:] + else: + if self.normalize_before: + w_heads = self.qkv_proj(self.layer_norm(w)) + else: + w_heads = self.qkv_proj(w) + w_head_q, w_head_k, w_head_v = paddle.chunk(w_heads, chunks=3, axis=-1) + + klen = w_head_k.shape[1] + + w_head_q = paddle.reshape(w_head_q, shape=[w_head_q.shape[0], w_head_q.shape[1], self.n_head, self.d_head]) + w_head_k = paddle.reshape(w_head_k, shape=[w_head_k.shape[0], w_head_k.shape[1], self.n_head, self.d_head]) + w_head_v = paddle.reshape(w_head_v, shape=[w_head_v.shape[0], w_head_v.shape[1], self.n_head, self.d_head]) + + if klen > r_emb.shape[0]: + r_emb_pad = r_emb[0:1].expand(klen - r_emb.shape[0], -1, -1) + r_emb = paddle.concat([r_emb_pad, r_emb], 0) + r_bias_pad = r_bias[0:1].expand(klen - r_bias.shape[0], -1) + r_bias = paddle.concat([r_bias_pad, r_bias], 0) + else: + r_emb = r_emb[-klen:] + r_bias = r_bias[-klen:] + + rw_head_q = w_head_q + r_w_bias.unsqueeze([0]) + + AC = paddle.einsum("bind,bjnd->bnij", rw_head_q, w_head_k) + r_emb = r_emb.unsqueeze([0]).expand([bsz, -1, -1, -1]) + B_ = paddle.einsum("bind,bjnd->bnij", w_head_q, r_emb) + D_ = r_bias.unsqueeze([0, 2]) + BD = self._rel_shift(B_ + D_) + + attn_score = AC + BD + attn_score = attn_score * self.scale + + if attn_mask is not None: + attn_score = attn_score - float("inf") * attn_mask + + attn_prob = F.softmax(attn_score, dim=-1) + attn_prob = self.attn_drop(attn_prob) + + attn_vec = paddle.einsum("bnij,bjnd->bind", attn_prob, w_head_v) + + attn_vec = paddle.reshape(attn_vec, shape=[attn_vec.shape[0], attn_vec.shape[1], self.n_head * self.d_head]) + + attn_out = self.o_net(attn_vec) + attn_out = self.drop(attn_out) + + if self.normalize_before: + output = w + attn_out + else: + output = self.layer_norm(w + attn_out) + + return output + + +class DecoderLayer(nn.Layer): + def __init__(self, n_head, d_model, d_head, d_inner, dropout, **kwargs): + super(DecoderLayer, self).__init__() + + self.dec_attn = MultiHeadAttn(n_head, d_model, d_head, dropout, **kwargs) + self.pos_ff = PositionwiseFFN(d_model, d_inner, dropout, normalize_before=kwargs.get("normalize_before")) + + def forward(self, dec_inp, dec_attn_mask=None, mems=None): + + output = self.dec_attn(dec_inp, attn_mask=dec_attn_mask, mems=mems) + output = self.pos_ff(output) + + return output + + +class RelLearnableDecoderLayer(nn.Layer): + def __init__(self, n_head, d_model, d_head, d_inner, dropout, **kwargs): + super(RelLearnableDecoderLayer, self).__init__() + + self.dec_attn = RelLearnableMultiHeadAttn(n_head, d_model, d_head, dropout, **kwargs) + self.pos_ff = PositionwiseFFN(d_model, d_inner, dropout, normalize_before=kwargs.get("normalize_before")) + + def forward(self, dec_inp, r_emb, r_w_bias, r_bias, dec_attn_mask=None, mems=None): + + output = self.dec_attn(dec_inp, r_emb, r_w_bias, r_bias, attn_mask=dec_attn_mask, mems=mems) + output = self.pos_ff(output) + + return output + + +class RelPartialLearnableDecoderLayer(nn.Layer): + def __init__(self, n_head, d_model, d_head, d_inner, dropout, **kwargs): + super(RelPartialLearnableDecoderLayer, self).__init__() + + self.dec_attn = RelPartialLearnableMultiHeadAttn(n_head, d_model, d_head, dropout, **kwargs) + self.pos_ff = PositionwiseFFN(d_model, d_inner, dropout, normalize_before=kwargs.get("normalize_before")) + + def forward(self, dec_inp, r, r_w_bias, r_r_bias, dec_attn_mask=None, mems=None): + output = self.dec_attn(dec_inp, r, r_w_bias, r_r_bias, attn_mask=dec_attn_mask, mems=mems) + output = self.pos_ff(output) + + return output + + +class AdaptiveEmbedding(nn.Layer): + def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, sample_softmax=False): + super(AdaptiveEmbedding, self).__init__() + + self.n_token = n_token + self.d_embed = d_embed + + self.cutoffs = cutoffs + [n_token] + self.div_val = div_val + self.d_proj = d_proj + + self.emb_scale = d_proj**0.5 + + self.cutoff_ends = [0] + self.cutoffs + + self.emb_layers = nn.LayerList() + self.emb_projs = nn.ParameterList() + if div_val == 1: + self.emb_layers.append( + nn.Embedding( + n_token, + d_embed, + sparse=sample_softmax > 0, + weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + ) + ) + if d_proj != d_embed: + self.emb_projs.append( + paddle.create_parameter( + shape=[d_embed, d_proj], + dtype=global_dtype, + default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + ) + ) + else: + for i in range(len(self.cutoffs)): + l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1] + d_emb_i = d_embed // (div_val**i) + self.emb_layers.append( + nn.Embedding(r_idx - l_idx, d_emb_i, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01)) + ) + self.emb_projs.append( + paddle.create_parameter( + shape=[d_emb_i, d_proj], + dtype=global_dtype, + default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + ) + ) + + def forward(self, inp): + if self.div_val == 1: + embed = self.emb_layers[0](inp) + if self.d_proj != self.d_embed: + embed = F.linear(embed, self.emb_projs[0]) + else: + inp_flat = paddle.reshape(inp, shape=[-1]) + emb_flat = paddle.zeros([inp_flat.shape[0], self.d_proj], dtype=global_dtype) + for i in range(len(self.cutoffs)): + l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1] + + mask_i = (inp_flat >= l_idx) & (inp_flat < r_idx) + indices_i = paddle.nonzero(mask_i).squeeze([1]) + + if indices_i.numel() == 0: + continue + + inp_i = paddle.gather(inp_flat, indices_i, axis=0) - l_idx + emb_i = self.emb_layers[i](inp_i) + emb_i = F.linear(emb_i, self.emb_projs[i]) + + emb_flat = paddle.scatter(emb_flat, indices_i, emb_i) + + embed = paddle.reshape(emb_flat, shape=inp.shape.append(self.d_proj)) + + embed = embed * self.emb_scale + + return embed + + +class MemTransformerLM(nn.Layer): + def __init__( + self, + n_token, + n_layer, + n_head, + d_model, + d_head, + d_inner, + dropout, + attn_dropout, + tie_weight=True, + d_embed=None, + div_val=1, + tie_projs=[False], + normalize_before=False, + tgt_len=None, + ext_len=None, + mem_len=None, + cutoffs=[], + adapt_inp=False, + same_length=False, + attn_type=0, + clamp_len=-1, + sample_softmax=-1, + ): + super(MemTransformerLM, self).__init__() + self.n_token = n_token + + d_embed = d_model if d_embed is None else d_embed + self.d_embed = d_embed + self.d_model = d_model + self.n_head = n_head + self.d_head = d_head + + self.word_emb = AdaptiveEmbedding(n_token, d_embed, d_model, cutoffs, div_val=div_val) + + self.drop = nn.Dropout(dropout) + + self.n_layer = n_layer + + self.tgt_len = tgt_len + self.mem_len = mem_len + self.ext_len = ext_len + self.max_klen = tgt_len + ext_len + mem_len + + self.attn_type = attn_type + + self.layers = nn.LayerList() + if attn_type == 0: + for i in range(n_layer): + self.layers.append( + RelPartialLearnableDecoderLayer( + n_head, + d_model, + d_head, + d_inner, + dropout, + tgt_len=tgt_len, + ext_len=ext_len, + mem_len=mem_len, + attn_dropout=attn_dropout, + normalize_before=normalize_before, + ) + ) + elif attn_type == 1: + for i in range(n_layer): + self.layers.append( + RelLearnableDecoderLayer( + n_head, + d_model, + d_head, + d_inner, + dropout, + tgt_len=tgt_len, + ext_len=ext_len, + mem_len=mem_len, + attn_dropout=attn_dropout, + normalize_before=normalize_before, + ) + ) + elif attn_type in [2, 3]: + for i in range(n_layer): + self.layers.append( + DecoderLayer( + n_head, + d_model, + d_head, + d_inner, + dropout, + attn_dropout=attn_dropout, + normalize_before=normalize_before, + ) + ) + + self.sample_softmax = sample_softmax + if sample_softmax > 0: + self.out_layer = nn.Linear( + d_model, + n_token, + weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + bias_attr=paddle.nn.initializer.Constant(0.0), + ) + self.tie_weight = tie_weight + self.sampler = LogUniformSampler(n_token, sample_softmax) + else: + self.crit = ProjAdaptiveSoftmax(n_token, d_embed, d_model, cutoffs, div_val=div_val) + + if tie_weight: + for i in range(len(self.crit.out_layers_weight)): + self.crit.out_layers_weight[i] = self.word_emb.emb_layers[i].weight + + if tie_projs: + for i, tie_proj in enumerate(tie_projs): + if tie_proj and div_val == 1 and d_model != d_embed: + self.crit.out_projs[i] = self.word_emb.emb_projs[0] + elif tie_proj and div_val != 1: + self.crit.out_projs[i] = self.word_emb.emb_projs[i] + + self.same_length = same_length + self.clamp_len = clamp_len + + self._create_params() + + def backward_compatible(self): + self.sample_softmax = -1 + + def _create_params(self): + if self.attn_type == 0: + self.pos_emb = PositionEmbedding(self.d_model) + self.r_w_bias = paddle.create_parameter( + shape=[self.n_head, self.d_head], + dtype=global_dtype, + default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + ) + self.r_r_bias = paddle.create_parameter( + shape=[self.n_head, self.d_head], + dtype=global_dtype, + default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + ) + elif self.attn_type == 1: + self.r_emb = paddle.create_parameter( + shape=[self.n_layer, self.max_klen, self.n_head, self.d_head], + dtype=global_dtype, + default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + ) + self.r_w_bias = paddle.create_parameter( + shape=[self.n_layer, self.n_head, self.d_head], + dtype=global_dtype, + default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + ) + self.r_bias = paddle.create_parameter( + shape=[self.n_layer, self.max_klen, self.n_head], + dtype=global_dtype, + default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + ) + elif self.attn_type == 2: + self.pos_emb = PositionEmbedding(self.d_model) + elif self.attn_type == 3: + self.r_emb = paddle.create_parameter( + shape=[self.n_layer, self.max_klen, self.n_head, self.d_head], + dtype=global_dtype, + default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01), + ) + + def reset_length(self, tgt_len, ext_len, mem_len): + self.tgt_len = tgt_len + self.mem_len = mem_len + self.ext_len = ext_len + + def init_mems(self, batch_size, d_model): + if self.mem_len > 0: + mems = [] + for _ in range(self.n_layer + 1): + empty = paddle.empty(shape=[batch_size, 0, d_model], dtype=global_dtype) + mems.append(empty) + + return mems + else: + return None + + def _update_mems(self, hids, mems, qlen, mlen): + if mems is None: + return None + + assert len(hids) == len(mems), "length of hids and length of mems must be the same. " + + with paddle.no_grad(): + new_mems = [] + end_idx = mlen + max(0, qlen - 0 - self.ext_len) + beg_idx = max(0, end_idx - self.mem_len) + for i in range(len(hids)): + cat = paddle.concat([mems[i], hids[i]], axis=1) + new_mems.append(cat[:, beg_idx:end_idx].detach()) + + return new_mems + + def _forward(self, dec_inputs, mems=None): + bsz, qlen = dec_inputs.shape + + word_emb = self.word_emb(dec_inputs) + + mlen = mems[0].shape[1] if mems is not None else 0 + klen = mlen + qlen + if self.same_length: + all_ones = paddle.ones(shape=[qlen, klen], dtype=word_emb.dtype) + mask_len = klen - self.mem_len + if mask_len > 0: + mask_shift_len = qlen - mask_len + else: + mask_shift_len = qlen + dec_attn_mask = ( + paddle.triu(all_ones, diagonal=1 + mlen) + paddle.tril(all_ones, -mask_shift_len) + ).unsqueeze([0, 1]) + else: + dec_attn_mask = paddle.ones(shape=[qlen, klen], dtype=word_emb.dtype) + dec_attn_mask = paddle.triu(dec_attn_mask, diagonal=1 + mlen).unsqueeze([0, 1]) + + hids = [] + if self.attn_type == 0: + pos_seq = paddle.arange(klen - 1, -1, -1.0, dtype=word_emb.dtype) + if self.clamp_len > 0: + # TODO: clamp and clip + pos_seq = paddle.clip(pos_seq, max=self.clamp_len) + pos_emb = self.pos_emb(pos_seq, bsz) + + core_out = self.drop(word_emb) + pos_emb = self.drop(pos_emb) + + hids.append(core_out) + for i, layer in enumerate(self.layers): + mems_i = None if mems is None else mems[i] + core_out = layer( + core_out, pos_emb, self.r_w_bias, self.r_r_bias, dec_attn_mask=dec_attn_mask, mems=mems_i + ) + hids.append(core_out) + elif self.attn_type == 1: + core_out = self.drop(word_emb) + hids.append(core_out) + for i, layer in enumerate(self.layers): + if self.clamp_len > 0: + r_emb = self.r_emb[i][-self.clamp_len :] + r_bias = self.r_bias[i][-self.clamp_len :] + else: + r_emb, r_bias = self.r_emb[i], self.r_bias[i] + + mems_i = None if mems is None else mems[i] + core_out = layer(core_out, r_emb, self.r_w_bias[i], r_bias, dec_attn_mask=dec_attn_mask, mems=mems_i) + hids.append(core_out) + elif self.attn_type == 2: + pos_seq = paddle.arange(klen - 1, -1, -1.0, dtype=word_emb.dtype) + if self.clamp_len > 0: + pos_seq = paddle.clip(pos_seq, max=self.clamp_len) + pos_emb = self.pos_emb(pos_seq, bsz) + + core_out = self.drop(word_emb + pos_emb[-qlen:]) + + hids.append(core_out) + for i, layer in enumerate(self.layers): + mems_i = None if mems is None else mems[i] + if mems_i is not None and i == 0: + mems_i += pos_emb[:mlen] + core_out = layer(core_out, dec_attn_mask=dec_attn_mask, mems=mems_i) + hids.append(core_out) + elif self.attn_type == 3: + core_out = self.drop(word_emb) + + hids.append(core_out) + for i, layer in enumerate(self.layers): + mems_i = None if mems is None else mems[i] + if mems_i is not None and mlen > 0: + cur_emb = self.r_emb[i][:-qlen] + cur_size = cur_emb.size(0) + if cur_size < mlen: + cur_emb_pad = cur_emb[0:1].expand(mlen - cur_size, -1, -1) + cur_emb = paddle.concat([cur_emb_pad, cur_emb], 0) + else: + cur_emb = cur_emb[-mlen:] + mems_i += cur_emb.view(mlen, 1, -1) + core_out += self.r_emb[i][-qlen:].view(qlen, 1, -1) + + core_out = layer(core_out, dec_attn_mask=dec_attn_mask, mems=mems_i) + hids.append(core_out) + + core_out = self.drop(core_out) + + new_mems = self._update_mems(hids, mems, mlen, qlen) + + return core_out, new_mems + + def forward(self, data, target, *mems): + if not mems: + batch_size = data.shape[0] + mems = self.init_mems(batch_size, self.d_model) + + hidden, new_mems = self._forward(data, mems=mems) + + # TODO(FrostML): use getitem. + tgt_len = target.shape[1] + pred_hid = paddle.slice(hidden, [1], [-tgt_len], [hidden.shape[1]]) + if self.sample_softmax > 0 and self.training: + assert self.tie_weight, "tie_weight must be True if sample_softmax > 0" + logit = sample_logits(self.word_emb, self.out_layer.bias, target, pred_hid, self.sampler) + loss = -paddle.log(F.softmax(logit, axis=-1))[:, :, 0] + else: + loss = self.crit( + paddle.reshape(pred_hid, shape=[-1, pred_hid.shape[-1]]), paddle.reshape(target, shape=[-1]) + ) + + if new_mems is None: + return [loss.mean()] + else: + return [loss.mean()] + new_mems diff --git a/examples/language_model/transformer-xl/reader.py b/examples/language_model/transformer-xl/reader.py new file mode 100644 index 0000000000000000000000000000000000000000..390392e9046a5a78110d36b57f0bd03c2812d271 --- /dev/null +++ b/examples/language_model/transformer-xl/reader.py @@ -0,0 +1,193 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import numpy as np +import paddle.distributed as dist +from paddle.io import DataLoader, IterableDataset + +from paddlenlp.data import Vocab + + +class LMDataset(IterableDataset): + def __init__(self, mode, vocab, path, dataset_name, batch_size, bptt, ext_len, nranks, rank): + assert mode in ["train", "valid", "test"], "Parameter mode must be one of [train, valid, test]." + + super(LMDataset, self).__init__() + self.vocab = vocab + self.dataset_name = dataset_name + + if self.dataset_name in ["wt103"]: + self.data = self.read_raw_data(filename=os.path.join(path, mode + ".txt"), ordered=True, lower_case=False) + elif self.dataset_name in ["enwik8", "text8"]: + self.data = self.read_raw_data(filename=os.path.join(path, mode + ".txt"), ordered=True, add_eos=False) + else: + raise ValueError("Not supported dataset yet. ") + self.rank = rank + self.batch_size = batch_size + batch_size *= nranks + + self.bptt = bptt + self.ext_len = ext_len if ext_len is not None else 0 + + self.num_step = len(self.data) // batch_size + data = self.data[: self.num_step * batch_size] + self.data = data.reshape([batch_size, -1]) + + # Number of samples + self.num_samples = (self.num_step + self.bptt - 1) // self.bptt + + def __len__(self): + return self.num_samples + + def __iter__(self): + for i in range(0, self.data.shape[1] - 1, self.bptt): + seq_len = min(self.bptt, self.data.shape[1] - 1 - i) + end_idx = i + seq_len + beg_idx = max(0, i - self.ext_len) + src = self.data[:, beg_idx:end_idx] + target = self.data[:, i + 1 : i + 1 + seq_len] + + # NOTE: For now, DataLoader can yield `int`. It's not necessary + # to transfer `seq_len` after DataLoader. + # However, if it's necessary to use `seq_len` as input for some + # PaddlePaddle op, then it must be yielded by `[seq_len]` whose + # shape is [1], cause some op cannot use shape [] as input. + yield [ + src[self.rank * self.batch_size : (self.rank + 1) * self.batch_size], + target[self.rank * self.batch_size : (self.rank + 1) * self.batch_size], + seq_len, + ] + + def read_raw_data( + self, filename, ordered=False, lower_case=True, delimiter=None, add_eos=True, add_double_eos=False + ): + assert os.path.exists(filename), "%s is not exist. " % filename + + data = [] + with open(filename, "r", encoding="utf-8") as f: + for line in f: + tokens = LMDataset.tokenize(line=line, delimiter=delimiter, lower_case=lower_case) + if add_double_eos: # for lm1b + tokens = ( + [self.vocab._identifiers_to_tokens["bos_token"]] + + tokens + + [self.vocab._identifiers_to_tokens["bos_token"]] + ) + elif add_eos: + tokens = tokens + [self.vocab._identifiers_to_tokens["eos_token"]] + data.append(np.asarray(self.get_indices(tokens)).astype("int64")) + + if ordered: + data = np.concatenate(data) + + return data + + def get_indices(self, tokens): + return self.vocab.to_indices(tokens) + + @classmethod + def get_vocab( + cls, + files, + max_size=None, + min_freq=0, + lower_case=True, + delimiter=None, + unk_token=None, + pad_token=None, + bos_token=None, + eos_token=None, + **kwargs + ): + return Vocab.build_vocab( + cls.data_iterator(files=files, delimiter=delimiter, lower_case=lower_case), + max_size=max_size, + min_freq=min_freq, + unk_token=unk_token, + pad_token=pad_token, + bos_token=bos_token, + eos_token=eos_token, + ) + + @classmethod + def tokenize(cls, line, delimiter=None, lower_case=True): + line = line.strip() + if lower_case: + line = line.lower() + tokens = list(line) if delimiter == "" else line.split(delimiter) + return tokens + + @classmethod + def data_iterator(cls, files, delimiter=None, lower_case=True): + if isinstance(files, str): + files = [files] + elif not isinstance(files, (list, tuple)): + raise ValueError("The parameter files must be a str or a list/tuple.") + + for fl in files: + assert os.path.exists(fl), "%s is not exist. " % fl + + with open(fl, "r", encoding="utf-8") as f: + for line in f: + tokens = cls.tokenize(line=line, delimiter=delimiter, lower_case=lower_case) + yield tokens + + +def get_lm_data_loader(args, vocab, mode="train"): + lm_dataset = LMDataset( + mode=mode, + vocab=vocab, + path=args.data, + dataset_name=args.dataset, + batch_size=args.batch_size if mode == "train" else args.eval_batch_size, + bptt=args.tgt_len, + ext_len=args.ext_len, + nranks=dist.get_world_size() if mode == "train" else 1, + rank=dist.get_rank() if mode == "train" else 0, + ) + + data_loader = DataLoader(dataset=lm_dataset, batch_size=None, num_workers=0, return_list=True) + + return data_loader + + +def get_lm_vocab(args): + kwargs = {"unk_token": ""} + if args.token_delimiter == "None": + kwargs["delimiter"] = None + else: + kwargs["delimiter"] = args.token_delimiter + + if args.dataset == "wt103": + kwargs["eos_token"] = "" + kwargs["lower_case"] = False + + if args.dataset in ["enwik8", "text8"]: + files = [ + os.path.join(args.data, "train.txt"), + os.path.join(args.data, "valid.txt"), + os.path.join(args.data, "test.txt"), + ] + elif args.dataset == "wt103": + files = [os.path.join(args.data, "train.txt")] + else: + raise ValueError("Not supported dataset yet. ") + + vocab = LMDataset.get_vocab(files, **kwargs) + args.ntokens = len(vocab) + print("Finish processing vocabulary, and the size of vocabulary is {}".format(args.ntokens)) + + return vocab diff --git a/examples/language_model/transformer-xl/train.py b/examples/language_model/transformer-xl/train.py new file mode 100644 index 0000000000000000000000000000000000000000..579a9114efd9cbc9f554499037f0379aa896b4b7 --- /dev/null +++ b/examples/language_model/transformer-xl/train.py @@ -0,0 +1,297 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import os +import time +from pprint import pprint + +import numpy as np +import paddle +import paddle.distributed as dist +import yaml +from attrdict import AttrDict +from mem_transformer import MemTransformerLM +from reader import get_lm_data_loader, get_lm_vocab + +FORMAT = "%(asctime)s-%(levelname)s: %(message)s" +logging.basicConfig(level=logging.INFO, format=FORMAT) +logger = logging.getLogger(__name__) + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--config", default="./configs/enwik8.yaml", type=str, help="Path of the config file. ") + args = parser.parse_args() + return args + + +def do_train(args): + if args.use_gpu: + rank = dist.get_rank() + trainer_count = dist.get_world_size() + else: + rank = 0 + trainer_count = 1 + paddle.set_device("cpu") + + if trainer_count > 1: + dist.init_parallel_env() + + random_seed = eval(str(args.random_seed)) + if random_seed is not None: + paddle.seed(random_seed) + + vocab = get_lm_vocab(args) + train_loader = get_lm_data_loader(args, vocab, "train") + eval_loader = get_lm_data_loader(args, vocab, "valid") + + cutoffs, tie_projs = [], [False] + if args.adaptive: + assert args.dataset in ["wt103", "lm1b"] + if args.dataset == "wt103": + cutoffs = [20000, 40000, 200000] + tie_projs += [True] * len(cutoffs) + elif args.dataset == "lm1b": + cutoffs = [60000, 100000, 640000] + tie_projs += [False] * len(cutoffs) + + mem_transformer = MemTransformerLM( + args.ntokens, + args.n_layer, + args.n_head, + args.d_model, + args.d_head, + args.d_inner_hid, + args.dropout, + args.attn_dropout, + tie_weight=args.tie_weight, + d_embed=args.d_model, + div_val=args.div_val, + tie_projs=tie_projs, + normalize_before=args.normalize_before, + tgt_len=args.tgt_len, + ext_len=args.ext_len, + mem_len=args.mem_len, + cutoffs=cutoffs, + same_length=args.same_length, + attn_type=args.attn_type, + clamp_len=args.clamp_len, + sample_softmax=args.sample_softmax, + ) + + if args.scheduler == "cosine": + scheduler = paddle.optimizer.lr.CosineAnnealingDecay( + learning_rate=args.learning_rate, T_max=args.max_step, eta_min=args.eta_min + ) + elif args.scheduler == "noam": + scheduler = paddle.optimizer.lr.NoamDecay( + d_model=args.d_model, warmup_steps=args.warmup_steps, learning_rate=args.learning_rate + ) + elif args.scheduler == "dev_perf": + paddle.optimizer.lr.ReduceOnPlateau( + learning_rate=args.learning_rate, factor=args.decay_rate, patience=args.patience, min_lr=args.lr_min + ) + elif args.scheduler == "constant": + scheduler = args.learning_rate + + clip = paddle.nn.ClipGradByGlobalNorm(args.clip) + if args.optim.lower() == "momentum": + optimizer = paddle.optimizer.Momentum( + learning_rate=scheduler, parameters=mem_transformer.parameters(), momentum=args.mom, grad_clip=clip + ) + elif args.optim.lower() == "adam": + optimizer = paddle.optimizer.Adam( + learning_rate=scheduler, + parameters=mem_transformer.parameters(), + beta1=args.beta1, + beta2=args.beta2, + epsilon=eval(args.eps), + grad_clip=clip, + ) + elif args.optim.lower() == "adagrad": + optimizer = paddle.optimizer.Adagrad( + learning_rate=scheduler, parameters=mem_transformer.parameters(), grad_clip=clip + ) + + # Init from some checkpoint, to resume the previous training + if args.init_from_checkpoint: + model_dict = paddle.load(os.path.join(args.init_from_checkpoint, "mem_transformer.pdparams")) + opt_dict = paddle.load(os.path.join(args.init_from_checkpoint, "mem_transformer.pdopt")) + mem_transformer.set_state_dict(model_dict) + optimizer.set_state_dict(opt_dict) + print("loaded from checkpoint.") + # Init from some pretrain models, to better solve the current task + if args.init_from_pretrain_model: + model_dict = paddle.load(os.path.join(args.init_from_pretrain_model, "mem_transformer.pdparams")) + mem_transformer.set_state_dict(model_dict) + print("loaded from pre-trained model.") + + if trainer_count > 1: + mem_transformer = paddle.DataParallel(mem_transformer) + + step_idx = 0 + train_loss = 0.0 + + log_start_time = time.time() + + for pass_id in range(args.epoch): + batch_id = 0 + + mems = tuple() + for input_data in train_loader: + (src, target, seq_len) = input_data + ret = mem_transformer(src, target, *mems) + loss = ret[0] + mems = ret[1:] + train_loss += loss.numpy() + + loss.backward() + optimizer.step() + optimizer.clear_grad() + + if step_idx > 0 and step_idx % args.print_step == 0 and rank == 0: + cur_loss = train_loss / args.print_step + elapsed = time.time() - log_start_time + if args.scheduler == "constant": + lr = optimizer.get_lr() + else: + lr = scheduler.get_lr() + logger_info = ( + "step_idx: %d, epoch: %d, batch: %d, learning rate: %.8f, " + "speed: %f ms/batch, loss: %f" + % (step_idx, pass_id, batch_id, lr, elapsed * 1000.0 / args.print_step, cur_loss) + ) + if args.dataset in ["enwik8", "text8"]: + logger_info = logger_info + ", bpc: %f" % (cur_loss / np.log(2)) + else: + logger_info = logger_info + ", ppl: %f" % (np.exp(cur_loss)) + + logger.info(logger_info) + train_loss = 0.0 + log_start_time = time.time() + + if step_idx % args.save_step == 0 and step_idx != 0: + # Do validation. + mem_transformer.eval() + + # TODO(FrostML): simplify this. + if args.mem_len == 0: + if dist.get_world_size() == 1: + mem_transformer.reset_length( + tgt_len=args.eval_tgt_len, + ext_len=args.ext_len + args.tgt_len - args.eval_tgt_len, + mem_len=args.mem_len, + ) + else: + mem_transformer._layers.reset_length( + tgt_len=args.eval_tgt_len, + ext_len=args.ext_len + args.tgt_len - args.eval_tgt_len, + mem_len=args.mem_len, + ) + else: + if dist.get_world_size() == 1: + mem_transformer.reset_length( + tgt_len=args.eval_tgt_len, + ext_len=args.ext_len, + mem_len=args.mem_len + args.tgt_len - args.eval_tgt_len, + ) + else: + mem_transformer._layers.reset_length( + tgt_len=args.eval_tgt_len, + ext_len=args.ext_len, + mem_len=args.mem_len + args.tgt_len - args.eval_tgt_len, + ) + + total_len, total_loss = 0, 0.0 + + eval_mems = tuple() + with paddle.no_grad(): + for i, (src, target, seq_len) in enumerate(eval_loader): + if args.max_eval_steps > 0 and i >= args.max_eval_steps: + break + ret = mem_transformer(src, target, *eval_mems) + loss, eval_mems = ret[0], ret[1:] + eval_cur_loss = seq_len * loss.numpy() + total_loss += eval_cur_loss + total_len += seq_len + eval_loss = total_loss / total_len + + logger_info = "Validation, step_idx: %d, validation loss: %f" % (step_idx, eval_loss) + if args.dataset in ["enwik8", "text8"]: + logger_info = logger_info + ", bpc: %f" % (eval_loss / np.log(2)) + else: + logger_info = logger_info + ", ppl: %f" % (np.exp(eval_loss)) + logger.info(logger_info) + + if args.save_model and rank == 0: + model_dir = os.path.join(args.save_model, "step_" + str(step_idx)) + if not os.path.exists(model_dir): + os.makedirs(model_dir) + paddle.save(mem_transformer.state_dict(), os.path.join(model_dir, "mem_transformer.pdparams")) + paddle.save(optimizer.state_dict(), os.path.join(model_dir, "mem_transformer.pdopt")) + f = open( + os.path.join(args.save_model, "step_" + str(step_idx), "evaluation_loss_" + str(eval_loss)), + "w", + ) + f.close() + + if args.scheduler == "dev_perf": + scheduler.step(eval_loss) + + # TODO(FrostML): simplify this. + if dist.get_world_size() == 1: + mem_transformer.reset_length(tgt_len=args.tgt_len, ext_len=args.ext_len, mem_len=args.mem_len) + else: + mem_transformer._layers.reset_length( + tgt_len=args.tgt_len, ext_len=args.ext_len, mem_len=args.mem_len + ) + + mem_transformer.train() + + if step_idx >= args.max_step: + return + step_idx += 1 + batch_id += 1 + if args.scheduler in ["cosine", "dev_perf"]: + if step_idx < args.warmup_steps: + curr_lr = args.learning_rate * step_idx / args.warmup_steps + scheduler.base_lr = curr_lr + else: + if args.scheduler == "cosine": + scheduler.step() + elif args.scheduler == "constant": + if step_idx < args.warmup_steps: + curr_lr = args.learning_rate * step_idx / args.warmup_steps + optimizer.set_lr(curr_lr) + elif args.scheduler == "noam": + scheduler.step() + + if args.save_model and rank == 0: + model_dir = os.path.join(args.save_model, "step_final") + if not os.path.exists(model_dir): + os.makedirs(model_dir) + paddle.save(mem_transformer.state_dict(), os.path.join(model_dir, "mem_transformer.pdparams")) + paddle.save(optimizer.state_dict(), os.path.join(model_dir, "mem_transformer.pdopt")) + + +if __name__ == "__main__": + ARGS = parse_args() + yaml_file = ARGS.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + pprint(args) + + do_train(args) diff --git a/examples/language_model/transformer-xl/utils/preprocess_text8.py b/examples/language_model/transformer-xl/utils/preprocess_text8.py new file mode 100644 index 0000000000000000000000000000000000000000..ad70ab65ccc2f5c818905bea3ad26af1dab67eb0 --- /dev/null +++ b/examples/language_model/transformer-xl/utils/preprocess_text8.py @@ -0,0 +1,33 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys +import zipfile + +if __name__ == "__main__": + data = zipfile.ZipFile("text8.zip").extractall() + data = open("text8", "r", encoding="utf-8").read() + + num_test_char = int(sys.argv[1]) + + train_data = data[: -2 * num_test_char] + valid_data = data[-2 * num_test_char : -num_test_char] + test_data = data[-num_test_char:] + + for files, data in [("train.txt", train_data), ("valid.txt", valid_data), ("test.txt", test_data)]: + data_str = " ".join(["_" if c == " " else c for c in data.strip()]) + with open(files, "w") as f: + f.write(data_str) + with open(files + ".raw", "w", encoding="utf-8") as fw: + fw.write(data) diff --git a/examples/language_model/xlm/README.md b/examples/language_model/xlm/README.md new file mode 100644 index 0000000000000000000000000000000000000000..31a707f9f3ebcfa412d9caeffb6e9f53cc08b535 --- /dev/null +++ b/examples/language_model/xlm/README.md @@ -0,0 +1,128 @@ +# XLM—Enhancing BERT for Cross-lingual Language Model + +## 目录 +* [模型简介](#模型简介) +* [模型实现的注意点](#模型实现的注意点) +* [快速开始](#快速开始) + * [通用参数释义](#通用参数释义) + * [自然语言推断任务](#自然语言推断任务) +* [参考资料](#参考资料) + +## 模型简介 + +[XLM—Enhancing BERT for Cross-lingual Language Model](https://arxiv.org/abs/1901.07291) 是facebook团队提出的一个跨语言模型。 + +在这项工作中,他们将这种方法扩展到多种语言,并展示了跨语言预训练的有效性。论文提出了两种学习跨语言语言模型 (XLM) 的方法:一种是**仅依赖单语数据的无监督方法**,另一种是**利用具有新的跨语言语言模型目标的并行数据的监督方法**。该方法在跨语言分类、无监督和有监督机器翻译方面获得了最先进的结果。在 XNLI 上,该方法以4.9% 的绝对精度提升了最新技术水平。在无监督机器翻译上,该方法在 WMT'16 German-English 上获得 34.3 BLEU,将之前的最新技术提高了9BLEU 以上。在有监督的机器翻译上,该方法在WMT'16罗马尼亚语-英语上获得了 38.5 BLEU 的最新技术水平,比之前的最佳方法高出 4 BLEU 以上。 + +XLM论文中一共提出了三种预训练任务:**CLM**、**MLM**和**TLM**。 +- **CLM:Causal Language Model**,无监督单语单向LM训练任务,就是用`Transformer`进行LM的单向训练。 +- **MLM:Masked Language Model**,无监督单语双向LM训练任务,与`BERT`一样。 +- **TLM:Translation Language Model**,有监督翻译LM训练,拼接平行双语语料,然后执行MLM,以期这样能学到翻译的对齐信息。 + +![framework](./framework.jpg) + +## 模型实现的注意点 +本仓库的模型在复现过程中主要参考了huggingface的实现,故在实现过程中与facebook团队的官方实现相比存在一定的不同。 +- 对于`token_pair`任务,`huggingface`的`tokenizer`会额外添加` A B `的标记,而`facebook`的`tokenizer`会添加` A B `的标记,本仓库的实现遵循了`huggingface`的实现,主要区别在于第一个特殊标记使用了``而不是``。 +- facebook的XLM模型由于并未使用`token_type_id`参数,因此我们在实际使用`tokenizer`的时候需要人工传入`return_token_type_ids=False`,如:`tokenizer(text, return_token_type_ids=False)`,这样就不会返回`token_type_id`了。 +- 考虑到现有已开源预训练权重的XLM模型,在`XLMPredLayer`处并未使用到`adaptive_softmax`,因此本仓库仅实现了带有`cross_entropy`的`XLMPredLayer`。 + +本文件夹内包含了`XLM模型`在`xnli任务`上的训练和验证内容。以下是本例的简要目录结构及说明: + +```text +. +├── README.md # README文档 +├── xnli_train.py # 自然语言推断训练代码 +├── xnli_eval.py # 自然语言推断评估代码 +``` + +## 快速开始 + +### xlm tokenizer依赖安装 + +```shell +# sacremoses +pip install sacremoses +# Thai tokenizer +pip install pythainlp +# Japanese tokenizer +git clone https://github.com/neubig/kytea.git +cd kytea +autoreconf -i +./configure --prefix=$HOME/local +make && make install +pip install kytea +# Chinese tokenizer +pip install jieba +``` + +### 通用参数释义 +- `model_name_or_path` 指示了 Fine-tuning 使用的具体预训练模型以及预训练时使用的tokenizer,目前支持的预训练模型有:"xlm-mlm-tlm-xnli15-1024"。若模型相关内容保存在本地,这里也可以提供相应目录地址,例如:"./checkpoint/model_xx/"。 +- `output_dir` 表示模型保存路径。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断,不足该长度的将会进行 padding。 +- `learning_rate` 表示基础学习率大小,本代码并未使用学习率warmup和衰减。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔步数。 +- `save_steps` 表示模型保存及评估间隔步数。 +- `batch_size` 表示每次迭代**每张**卡上的样本数目。 +- `adam_epsilon` 表示Adam优化器的epsilon。 +- `max_steps` 表示最大训练步数。若训练`num_train_epochs`轮包含的训练步数大于该值,则达到`max_steps`后就提前结束。 +- `seed` 表示随机数种子。 +- `device` 表示训练使用的设备, `'gpu'`表示使用 GPU, `'xpu'`表示使用百度昆仑卡, `'cpu'`表示使用 CPU。 +- `use_amp` 表示是否启用自动混合精度训练。 +- `scale_loss` 表示自动混合精度训练的参数。 + +### 自然语言推断任务 + +#### 数据集介绍 +XNLI 是 MNLI 的子集,并且已被翻译成14种不同的语言(包含一些较低资源语言)。与 MNLI 一样,目标是预测文本蕴含(句子 A 是否暗示/矛盾/都不是句子 B )。 + +#### 单卡训练 + +```shell +python xnli_train.py \ + --batch_size 8 \ + --model_name_or_path xlm-mlm-tlm-xnli15-1024 \ + --save_steps 24544 \ + --output_dir outputs +``` + +#### 单卡评估 + +```shell +python xnli_eval.py \ + --batch_size 8 \ + --model_name_or_path outputs/best_model +``` + +#### 多卡训练 + +```shell +python -m paddle.distributed.launch --gpus 0,1 --log_dir outputs xnli_train.py \ + --batch_size 8 \ + --model_name_or_path xlm-mlm-tlm-xnli15-1024 \ + --save_steps 24544 \ + --output_dir outputs +``` + +在XNLI数据集上微调 cross-lingual-transfer 类型的自然语言推断任务后,在测试集上有如下结果 +| Model | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur | Avg | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | +| XLM | 84.6 | 79.2 | 79.8 | 76.9 | 76.6 | 77.6 | 76.2 | 71.7 | 73.8 | 74.5 | 71.1 | 74.8 | 68.8 | 69.2 | 65.8 | 74.7 | + + +## 参考资料 +- https://github.com/facebookresearch/XLM +- https://github.com/huggingface/transformers/tree/main/src/transformers/models/xlm + +## 引用 + +Bibtex: +```tex +@article{lample2019cross, + title={Cross-lingual Language Model Pretraining}, + author={Lample, Guillaume and Conneau, Alexis}, + journal={Advances in Neural Information Processing Systems (NeurIPS)}, + year={2019} +} +``` diff --git a/examples/language_model/xlm/framework.jpg b/examples/language_model/xlm/framework.jpg new file mode 100644 index 0000000000000000000000000000000000000000..96d613f945621029f5451d6acc273ca84a513600 Binary files /dev/null and b/examples/language_model/xlm/framework.jpg differ diff --git a/examples/language_model/xlm/xnli_eval.py b/examples/language_model/xlm/xnli_eval.py new file mode 100644 index 0000000000000000000000000000000000000000..54e65cfdc65c0a3d90867815a5d28684b6cad6c2 --- /dev/null +++ b/examples/language_model/xlm/xnli_eval.py @@ -0,0 +1,135 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +from functools import partial +import numpy as np + +import paddle +from paddle.io import BatchSampler, DataLoader +from paddlenlp.transformers import XLMForSequenceClassification, XLMTokenizer +from paddlenlp.datasets import load_dataset +from paddlenlp.data import Stack, Tuple, Pad +from paddle.metric import Accuracy + +all_languages = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"] + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model." + ) + parser.add_argument( + "--batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU/XPU for training.", + ) + parser.add_argument( + "--max_seq_length", + default=256, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu", "xpu"], + help="The device to select to train the model, is must be cpu/gpu/xpu.", + ) + args = parser.parse_args() + return args + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, language, tokenizer): + metric.reset() + for batch in data_loader: + input_ids, attention_mask, labels = batch + # add lang_ids + lang_ids = paddle.ones_like(input_ids) * tokenizer.lang2id[language] + logits = model(input_ids, langs=lang_ids, attention_mask=attention_mask) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + print("[%s] acc: %s " % (language.upper(), res)) + return res + + +def convert_example(example, tokenizer, max_seq_length=256, language="en"): + """convert a example into necessary features""" + # Get the label + label = example["label"] + premise = example["premise"] + hypothesis = example["hypothesis"] + # Convert raw text to feature + example = tokenizer( + premise, + text_pair=hypothesis, + max_length=max_seq_length, + return_attention_mask=True, + return_token_type_ids=False, + lang=language, + ) + return example["input_ids"], example["attention_mask"], label + + +def get_test_dataloader(args, language, tokenizer): + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=0, dtype="int64"), # attention_mask + Stack(dtype="int64"), # labels + ): fn(samples) + # make sure language is `language`` + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, language=language) + test_ds = load_dataset("xnli", language, splits="test") + test_ds = test_ds.map(trans_func, lazy=True) + test_batch_sampler = BatchSampler(test_ds, batch_size=args.batch_size * 4, shuffle=False) + test_data_loader = DataLoader( + dataset=test_ds, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + return test_data_loader + + +def do_eval(args): + paddle.set_device(args.device) + tokenizer = XLMTokenizer.from_pretrained(args.model_name_or_path) + model = XLMForSequenceClassification.from_pretrained(args.model_name_or_path) + model.eval() + metric = Accuracy() + all_languages_acc = [] + for language in all_languages: + test_dataloader = get_test_dataloader(args, language, tokenizer) + acc = evaluate(model, metric, test_dataloader, language, tokenizer) + all_languages_acc.append(acc) + print("test mean acc: %.4f" % np.mean(all_languages_acc)) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_eval(args) diff --git a/examples/language_model/xlm/xnli_train.py b/examples/language_model/xlm/xnli_train.py new file mode 100644 index 0000000000000000000000000000000000000000..d19d40ec6f14dd65710053f407a78aa0145dc2c1 --- /dev/null +++ b/examples/language_model/xlm/xnli_train.py @@ -0,0 +1,279 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import math +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler +from paddle.metric import Accuracy +from paddle.optimizer import Adam + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import XLMForSequenceClassification, XLMTokenizer + +all_languages = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"] + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model." + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=256, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--learning_rate", default=2e-6, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--dropout", default=0.1, type=float, help="Dropout rate.") + parser.add_argument( + "--num_train_epochs", + default=5, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument("--logging_steps", type=int, default=200, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=24544, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU/XPU for training.", + ) + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu", "xpu"], + help="The device to select to train the model, is must be cpu/gpu/xpu.", + ) + parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.") + parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.") + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, language, tokenizer): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, attention_mask, labels = batch + # add lang_ids + lang_ids = paddle.ones_like(input_ids) * tokenizer.lang2id[language] + logits = model(input_ids, langs=lang_ids, attention_mask=attention_mask) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + print("[%s] acc: %s " % (language.upper(), res)) + model.train() + return res + + +def convert_example(example, tokenizer, max_seq_length=256, language="en"): + """convert a example into necessary features""" + # Get the label + label = example["label"] + premise = example["premise"] + hypothesis = example["hypothesis"] + # Convert raw text to feature + example = tokenizer( + premise, + text_pair=hypothesis, + max_length=max_seq_length, + return_attention_mask=True, + return_token_type_ids=False, + lang=language, + ) + return example["input_ids"], example["attention_mask"], label + + +def get_test_dataloader(args, language, batchify_fn, tokenizer): + # make sure language is `language`` + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, language=language) + test_ds = load_dataset("xnli", language, splits="test") + test_ds = test_ds.map(trans_func, lazy=True) + test_batch_sampler = BatchSampler(test_ds, batch_size=args.batch_size * 4, shuffle=False) + test_data_loader = DataLoader( + dataset=test_ds, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + return test_data_loader + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + tokenizer = XLMTokenizer.from_pretrained(args.model_name_or_path) + + # define train dataset language + language = "en" + train_ds = load_dataset("xnli", language, splits="train") + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, language=language) + + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=0, dtype="int64"), # attention_mask + Stack(dtype="int64"), # labels + ): fn(samples) + + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + model = XLMForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=3, dropout=args.dropout) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + if args.max_steps > 0: + num_training_steps = args.max_steps + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + else: + num_training_steps = len(train_data_loader) * args.num_train_epochs + num_train_epochs = args.num_train_epochs + + optimizer = Adam( + learning_rate=args.learning_rate, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + ) + + loss_fct = nn.CrossEntropyLoss() + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + metric = Accuracy() + + global_step = 0 + tic_train = time.time() + max_test_acc = 0.0 + print(f"num_training_steps {num_training_steps}") + for epoch in range(num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids, attention_mask, labels = batch + lang_ids = paddle.ones_like(input_ids) * tokenizer.lang2id[language] + + with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]): + logits = model(input_ids, langs=lang_ids, attention_mask=attention_mask) + loss = loss_fct(logits, labels) + + if args.use_amp: + scaled_loss = scaler.scale(loss) + scaled_loss.backward() + scaler.minimize(optimizer, scaled_loss) + else: + loss.backward() + optimizer.step() + + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + print( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % ( + global_step, + num_training_steps, + epoch, + step, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + + if global_step % args.save_steps == 0 or global_step == num_training_steps: + all_languages_acc = [] + for language in all_languages: + test_data_loader = get_test_dataloader(args, language, batchify_fn, tokenizer) + acc = evaluate(model, metric, test_data_loader, language, tokenizer) + all_languages_acc.append(acc) + test_mean_acc = np.mean(all_languages_acc) + print("test mean acc: %.4f" % test_mean_acc) + + if paddle.distributed.get_rank() == 0: + if test_mean_acc >= max_test_acc: + max_test_acc = test_mean_acc + output_dir = os.path.join(args.output_dir, "best_model") + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + print("best test mean acc: %.4f" % max_test_acc) + print("Save model and tokenizer to %s" % output_dir) + + if global_step >= num_training_steps: + return + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/language_model/xlnet/README.md b/examples/language_model/xlnet/README.md new file mode 100644 index 0000000000000000000000000000000000000000..bb17af1529bb4864e1c270c39e687594b13b12d9 --- /dev/null +++ b/examples/language_model/xlnet/README.md @@ -0,0 +1,67 @@ +# XLNet + +## 模型简介 + +[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) 是一款无监督的自回归预训练语言模型。 有别于传统的单向自回归模型,XLNet通过最大化输入序列所有排列的期望来进行语言建模,这使得它可以同时关注到上下文的信息。 另外,XLNet在预训练阶段集成了 [Transformer-XL](https://arxiv.org/abs/1901.02860) 模型,Transformer-XL中的片段循环机制(Segment Recurrent Mechanism)和 相对位置编码(Relative Positional Encoding)机制能够支持XLNet接受更长的输入序列,这使得XLNet在长文本序列的语言任务上有着优秀的表现。 + +本项目是XLNet在 Paddle 2.0上的开源实现,包含了在 [GLUE评测任务](https://gluebenchmark.com/tasks) 上的微调代码。 + +## 快速开始 + +### 环境依赖 + +- sentencepiece + +安装命令:`pip install sentencepiece` + +### 数据准备 + +GLUE评测任务所含数据集已在paddlenlp中以API形式提供,无需预先准备,使用`run_glue.py`执行时将会自动下载。 + +### 执行Fine-tuning + +以GLUE中的SST-2任务为例,启动Fine-tuning的方式如下: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" ./run_glue.py \ + --model_name_or_path xlnet-base-cased \ + --task_name SST-2 \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --logging_steps 100 \ + --save_steps 500 \ + --output_dir ./tmp/ +``` + +其中参数释义如下: +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。 +- `task_name` 表示Fine-tuning的任务。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将与learning rate scheduler产生的值相乘作为当前学习率。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `output_dir` 表示模型保存路径。 + +基于`xlnet-base-cased`在GLUE各评测任务上Fine-tuning后,在验证集上有如下结果: + +| Task | Metric | Result | +|:-----:|:----------------------------:|:------------------:| +| SST-2 | Accuracy | 94.266 | +| QNLI | Accuracy | 91.708 | +| CoLA | Mattehew's corr | 50.264 | +| MRPC | F1/Accuracy | 91.071/87.745 | +| STS-B | Person/Spearman corr | 86.243/85.973 | +| QQP | Accuracy/F1 | 90.838/87.644 | +| MNLI | Matched acc/MisMatched acc | 87.468/86.859 | +| RTE | Accuracy | 70.036 | +| WNLI | Accuracy | 56.338 | + +## Reference + +- [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) +- [zihangdai/xlnet](https://github.com/zihangdai/xlnet) diff --git a/examples/language_model/xlnet/run_glue.py b/examples/language_model/xlnet/run_glue.py new file mode 100644 index 0000000000000000000000000000000000000000..a59e3acb39d7ef04115e41fb85bbf8153b036b80 --- /dev/null +++ b/examples/language_model/xlnet/run_glue.py @@ -0,0 +1,377 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial +from math import ceil + +import numpy as np +import paddle +from paddle.io import DataLoader +from paddle.metric import Accuracy + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman +from paddlenlp.transformers import LinearDecayWithWarmup +from paddlenlp.transformers.xlnet.modeling import ( + XLNetForSequenceClassification, + XLNetPretrainedModel, +) +from paddlenlp.transformers.xlnet.tokenizer import XLNetTokenizer +from paddlenlp.utils import profiler + +final_res = "Not evaluated yet!" + +METRIC_CLASSES = { + "cola": Mcc, + "sst-2": Accuracy, + "mrpc": AccuracyAndF1, + "sts-b": PearsonAndSpearman, + "qqp": AccuracyAndF1, + "mnli": Accuracy, + "qnli": Accuracy, + "rte": Accuracy, + "wnli": Accuracy, +} + + +def parse_args(): + # yapf: disable + parser = argparse.ArgumentParser() + parser.add_argument("--task_name", default=None, type=str, required=True, help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),) + parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(XLNetPretrainedModel.pretrained_init_configuration.keys()),) + parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the model predictions and checkpoints will be written.",) + parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",) + parser.add_argument("--pad_to_max_seq_len", default=False, type=bool, help="Whether to pad all sequences to max length for sequences shorter than max length.",) + parser.add_argument("--batch_size", default=8, type=int, help="Batch size per device for training.",) + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.",) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.",) + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.",) + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.",) + parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.",) + parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.",) + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.",) + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization",) + parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu", "xpu", "npu"], help="Select cpu, gpu, xpu, npu devices.",) + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup_steps. If > 0: Override warmup_proportion",) + parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps.",) + parser.add_argument('-p', '--profiler_options', type=str, default=None, help='The option of profiler, which should be in format \"key1=value1;key2=value2;key3=value3\".',) + # yapf: enable + + args = parser.parse_args() + return args + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, loss_fct, metric, data_loader): + model.eval() + metric.reset() + losses = [] + global final_res + for batch in data_loader: + input_ids, token_type_ids, attention_mask, labels = batch + logits = model(input_ids, token_type_ids, attention_mask) + loss = loss_fct(logits, labels) + losses.append(loss.detach().numpy()) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + if isinstance(metric, AccuracyAndF1): + print( + "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s" + % (np.average(losses), res[0], res[1], res[2], res[3], res[4]) + ) + + final_res = "final: acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s" % ( + res[0], + res[1], + res[2], + res[3], + res[4], + ) + elif isinstance(metric, Mcc): + print("eval loss: %f, mcc: %s" % (np.average(losses), res[0])) + final_res = "final: mcc: %s" % (res[0]) + elif isinstance(metric, PearsonAndSpearman): + print( + "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s" + % (np.average(losses), res[0], res[1], res[2]) + ) + final_res = "final: pearson: %s, spearman: %s, pearson and spearman: %s" % (res[0], res[1], res[2]) + else: + print("eval loss: %f, acc: %s" % (np.average(losses), res)) + final_res = "final: acc: %s" % res + model.train() + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, pad_to_max_seq_len=False, is_test=False): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if (int(is_test) + len(example)) == 2: + example = tokenizer( + example["sentence"], + max_seq_len=max_seq_length, + pad_to_max_seq_len=pad_to_max_seq_len, + return_attention_mask=True, + ) + else: + example = tokenizer( + example["sentence1"], + text_pair=example["sentence2"], + max_seq_len=max_seq_length, + pad_to_max_seq_len=pad_to_max_seq_len, + return_attention_mask=True, + ) + + if not is_test: + return example["input_ids"], example["token_type_ids"], example["attention_mask"], label + else: + return example["input_ids"], example["token_type_ids"], example["attention_mask"] + + +def create_data_loader(args, tokenizer): + train_ds = load_dataset("glue", args.task_name, splits="train") + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + label_list=train_ds.label_list, + max_seq_length=args.max_seq_length, + pad_to_max_seq_len=args.pad_to_max_seq_len, + ) + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, pad_right=False), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, pad_right=False), # token_type + Pad(axis=0, pad_val=0, pad_right=False), # attention_mask + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ): fn(samples) + + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + if args.task_name == "mnli": + dev_ds_matched, dev_ds_mismatched = load_dataset( + "glue", args.task_name, splits=["dev_matched", "dev_mismatched"] + ) + dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True) + dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True) + dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False) + dev_data_loader_matched = DataLoader( + dataset=dev_ds_matched, + batch_sampler=dev_batch_sampler_matched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + dev_batch_sampler_mismatched = paddle.io.BatchSampler( + dev_ds_mismatched, batch_size=args.batch_size, shuffle=False + ) + dev_data_loader_mismatched = DataLoader( + dataset=dev_ds_mismatched, + batch_sampler=dev_batch_sampler_mismatched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + + return ( + train_data_loader, + dev_data_loader_matched, + dev_data_loader_mismatched, + train_ds, + dev_ds_matched, + dev_ds_mismatched, + ) + else: + dev_ds = load_dataset("glue", args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + return train_data_loader, dev_data_loader, train_ds, dev_ds + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + global final_res + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + tokenizer_class = XLNetTokenizer + + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + if args.task_name == "mnli": + ( + train_data_loader, + dev_data_loader_matched, + dev_data_loader_mismatched, + train_ds, + dev_ds_matched, + dev_ds_mismatched, + ) = create_data_loader(args, tokenizer) + else: + train_data_loader, dev_data_loader, train_ds, dev_ds = create_data_loader(args, tokenizer) + + num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list) + model = XLNetForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + if args.max_steps > 0: + num_training_steps = args.max_steps + num_train_epochs = ceil(num_training_steps / len(train_data_loader)) + else: + num_training_steps = len(train_data_loader) * args.num_train_epochs + num_train_epochs = args.num_train_epochs + + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.max_grad_norm) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "layer_norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + grad_clip=clip, + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss() + + metric = metric_class() + + global_step = 0 + model.train() + + train_reader_cost = 0.0 + train_run_cost = 0.0 + reader_start = time.time() + for epoch in range(num_train_epochs): + for step, batch in enumerate(train_data_loader): + train_reader_cost += time.time() - reader_start + train_start = time.time() + + global_step += 1 + input_ids, token_type_ids, attention_mask, labels = batch + logits = model(input_ids, token_type_ids, attention_mask) + loss = loss_fct(logits, labels) + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + train_run_cost += time.time() - train_start + # Profile for model benchmark + profiler.add_profiler_step(args.profiler_options) + + if global_step % args.logging_steps == 0: + speed = args.logging_steps / (train_reader_cost + train_run_cost) + avg_reader_cost = train_reader_cost / args.logging_steps + print( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s, avg_reader_cost: %.4f sec, avg_batch_cost: %.4f sec, avg_samples: %d, avg_ips: %.4f sequences/sec" + % ( + global_step, + num_training_steps, + epoch, + step, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + speed, + avg_reader_cost, + 1.0 / speed, + args.batch_size, + speed * args.batch_size, + ) + ) + train_reader_cost = 0.0 + train_run_cost = 0.0 + + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + if args.task_name == "mnli": + print("matched ", end="") + evaluate(model, loss_fct, metric, dev_data_loader_matched) + final_res1 = "matched " + final_res + print("mismatched ", end="") + evaluate(model, loss_fct, metric, dev_data_loader_mismatched) + final_res2 = "mismatched " + final_res + final_res = final_res1 + "\r\n" + final_res2 + print("eval done total : %s s" % (time.time() - tic_eval)) + else: + evaluate(model, loss_fct, metric, dev_data_loader) + print("eval done total : %s s" % (time.time() - tic_eval)) + if (not paddle.distributed.get_world_size() > 1) or paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "%s_ft_model_%d" % (args.task_name, global_step)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step == num_training_steps: + print(final_res) + exit(0) + + reader_start = time.time() + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/lexical_analysis/README.md b/examples/lexical_analysis/README.md new file mode 100644 index 0000000000000000000000000000000000000000..421a04d6aa0a613b061b6d105ac889effcda7d08 --- /dev/null +++ b/examples/lexical_analysis/README.md @@ -0,0 +1,185 @@ +# 词法分析 + +## 1. 简介 + +词法分析任务的输入是一个字符串(我们后面使用『句子』来指代它),而输出是句子中的词边界和词性、实体类别。序列标注是词法分析的经典建模方式,我们使用基于 GRU 的网络结构学习特征,将学习到的特征接入 CRF 解码层完成序列标注。模型结构如下所示:
+ +![GRU-CRF-MODEL](https://bj.bcebos.com/paddlenlp/imgs/gru-crf-model.png) + +1. 输入采用 one-hot 方式表示,每个字以一个 id 表示 +2. one-hot 序列通过字表,转换为实向量表示的字向量序列; +3. 字向量序列作为双向 GRU 的输入,学习输入序列的特征表示,得到新的特性表示序列,我们堆叠了两层双向 GRU 以增加学习能力; +4. CRF 以 GRU 学习到的特征为输入,以标记序列为监督信号,实现序列标注。 + + +## 快速开始 + +### 数据准备 + +我们提供了少数样本用以示例输入数据格式。执行以下命令,下载并解压示例数据集: + +```bash +python download.py --data_dir ./ +``` + +训练使用的数据可以由用户根据实际的应用场景,自己组织数据。除了第一行是 `text_a\tlabel` 固定的开头,后面的每行数据都是由两列组成,以制表符分隔,第一列是 utf-8 编码的中文文本,以 `\002` 分割,第二列是对应每个字的标注,以 `\002` 分隔。我们采用 IOB2 标注体系,即以 X-B 作为类型为 X 的词的开始,以 X-I 作为类型为 X 的词的持续,以 O 表示不关注的字(实际上,在词性、专名联合标注中,不存在 O )。示例如下: + +```text +除\002了\002他\002续\002任\002十\002二\002届\002政\002协\002委\002员\002,\002马\002化\002腾\002,\002雷\002军\002,\002李\002彦\002宏\002也\002被\002推\002选\002为\002新\002一\002届\002全\002国\002人\002大\002代\002表\002或\002全\002国\002政\002协\002委\002员 p-B\002p-I\002r-B\002v-B\002v-I\002m-B\002m-I\002m-I\002ORG-B\002ORG-I\002n-B\002n-I\002w-B\002PER-B\002PER-I\002PER-I\002w-B\002PER-B\002PER-I\002w-B\002PER-B\002PER-I\002PER-I\002d-B\002p-B\002v-B\002v-I\002v-B\002a-B\002m-B\002m-I\002ORG-B\002ORG-I\002ORG-I\002ORG-I\002n-B\002n-I\002c-B\002n-B\002n-I\002ORG-B\002ORG-I\002n-B\002n-I +``` + +其中词性和专名类别标签集合如下表,包含词性标签 24 个(小写字母),专名类别标签 4 个(大写字母)。这里需要说明的是,人名、地名、机构名和时间四个类别,存在(PER / LOC / ORG / TIME 和 nr / ns / nt / t)两套标签,被标注为第二套标签的词,是模型判断为低置信度的人名、地名、机构名和时间词。开发者可以基于这两套标签,在四个类别的准确、召回之间做出自己的权衡。 + +| 标签 | 含义 | 标签 | 含义 | 标签 | 含义 | 标签 | 含义 | +| ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- | +| n | 普通名词 | f | 方位名词 | s | 处所名词 | t | 时间 | +| nr | 人名 | ns | 地名 | nt | 机构名 | nw | 作品名 | +| nz | 其他专名 | v | 普通动词 | vd | 动副词 | vn | 名动词 | +| a | 形容词 | ad | 副形词 | an | 名形词 | d | 副词 | +| m | 数量词 | q | 量词 | r | 代词 | p | 介词 | +| c | 连词 | u | 助词 | xc | 其他虚词 | w | 标点符号 | +| PER | 人名 | LOC | 地名 | ORG | 机构名 | TIME | 时间 | + +### 模型训练 + +#### 单卡训练 + +启动方式如下: + +```bash +python train.py \ + --data_dir ./lexical_analysis_dataset_tiny \ + --model_save_dir ./save_dir \ + --epochs 10 \ + --batch_size 32 \ + --device gpu \ + # --init_checkpoint ./save_dir/final +``` + +其中参数释义如下: +- `data_dir`: 数据集所在文件夹路径. +- `model_save_dir`: 训练期间模型保存路径。 +- `epochs`: 模型训练迭代轮数。 +- `batch_size`: 表示每次迭代**每张卡**上的样本数目。 +- `device`: 训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 +- `init_checkpoint`: 模型加载路径,通过设置init_checkpoint可以启动增量训练。 + +#### 多卡训练 + +启动方式如下: + +```bash +python -m paddle.distributed.launch --gpus "0,1" train.py \ + --data_dir ./lexical_analysis_dataset_tiny \ + --model_save_dir ./save_dir \ + --epochs 10 \ + --batch_size 32 \ + --device gpu \ + # --init_checkpoint ./save_dir/final +``` + +### 模型评估 + +通过加载训练保存的模型,可以对测试集数据进行验证,启动方式如下: + +```bash +python eval.py --data_dir ./lexical_analysis_dataset_tiny \ + --init_checkpoint ./save_dir/model_100.pdparams \ + --batch_size 32 \ + --device gpu +``` + +其中`./save_dir/model_100.pdparams`是训练过程中保存的参数文件,请更换为实际得到的训练保存路径。 + +### 模型导出 + +使用动态图训练结束之后,还可以将动态图参数导出成静态图参数,具体代码见export_model.py。静态图参数保存在`output_path`指定路径中。 + +运行方式: + +```shell +python export_model.py --data_dir=./lexical_analysis_dataset_tiny --params_path=./save_dir/model_100.pdparams --output_path=./infer_model/static_graph_params +``` + +其中`./save_dir/model_100.pdparams`是训练过程中保存的参数文件,请更换为实际得到的训练保存路径。 + +* `params_path`是指动态图训练保存的参数路径 +* `output_path`是指静态图参数导出路径。 + +导出模型之后,可以用于部署,deploy/predict.py文件提供了python部署预测示例。运行方式: + +```shell +python deploy/predict.py --model_file=infer_model/static_graph_params.pdmodel --params_file=infer_model/static_graph_params.pdiparams --data_dir lexical_analysis_dataset_tiny +``` + +### 模型预测 + +对无标签数据可以启动模型预测: + +```bash +python predict.py --data_dir ./lexical_analysis_dataset_tiny \ + --init_checkpoint ./save_dir/model_100.pdparams \ + --batch_size 32 \ + --device gpu +``` + +得到类似以下输出: + +```txt +(大学, n)(学籍, n)(证明, n)(怎么, r)(开, v) +(电车, n)(的, u)(英文, nz) +(什么, r)(是, v)(司法, n)(鉴定人, vn) +``` + +### Taskflow一键预测 +可以使用PaddleNLP提供的Taskflow工具来对输入的文本进行一键分词,具体使用方法如下: + +```python +from paddlenlp import Taskflow + +lac = Taskflow("lexical_analysis") +lac("LAC是个优秀的分词工具") +''' +[{'text': 'LAC是个优秀的分词工具', 'segs': ['LAC', '是', '个', '优秀', '的', '分词', '工具'], 'tags': ['nz', 'v', 'q', 'a', 'u', 'n', 'n']}] +''' + +lac(["LAC是个优秀的分词工具", "三亚是一个美丽的城市"]) +''' +[{'text': 'LAC是个优秀的分词工具', 'segs': ['LAC', '是', '个', '优秀', '的', '分词', '工具'], 'tags': ['nz', 'v', 'q', 'a', 'u', 'n', 'n']}, + {'text': '三亚是一个美丽的城市', 'segs': ['三亚', '是', '一个', '美丽', '的', '城市'], 'tags': ['LOC', 'v', 'm', 'a', 'u', 'n']} +] +''' +``` + +任务的默认路径为`$HOME/.paddlenlp/taskflow/lexical_analysis/lac/`,默认路径下包含了执行该任务需要的所有文件。 + +如果希望得到定制化的分词及标注结果,用户也可以通过Taskflow来加载自定义的词法分析模型并进行预测。 + +通过`task_path`指定用户自定义路径,自定义路径下的文件需要和默认路径的文件一致。 + +自定义路径包含如下文件(用户自己的模型权重、标签字典): +```text +custom_task_path/ +├── model.pdparams +├── word.dic +├── tag.dic +└── q2b.dic +``` + +使用Taskflow加载自定义模型进行一键预测: + +```python +from paddlenlp import Taskflow + +my_lac = Taskflow("lexical_analysis", model_path="./custom_task_path/") +``` + +更多使用方法请参考[Taskflow文档](../../docs/model_zoo/taskflow.md)。 + +## 预训练模型 + +如果您希望使用已经预训练好了的LAC模型完成词法分析任务,请参考: + +[Lexical Analysis of Chinese](https://github.com/baidu/lac) + +[PaddleHub分词模型](https://www.paddlepaddle.org.cn/hubdetail?name=lac&en_category=LexicalAnalysis) diff --git a/examples/lexical_analysis/data.py b/examples/lexical_analysis/data.py new file mode 100644 index 0000000000000000000000000000000000000000..d1d1a59aa9bd687deacd9ab7dfcdf441b3caea97 --- /dev/null +++ b/examples/lexical_analysis/data.py @@ -0,0 +1,142 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +The file_reader converts raw corpus to input. +""" + +from paddlenlp.datasets import MapDataset + +# We use "\002" to separate sentence characters and sequence labels, +# for example: 除\002了\002他\002续\002任\002十\002二\002届\002政\002协\002委\002员 +# p-B\002p-I\002r-B\002v-B\002v-I\002m-B\002m-I\002m-I\002ORG-B\002ORG-I\002n-B\002n-I\002 +CHAR_DELIMITER = "\002" + + +def load_dataset(datafiles): + def read(data_path): + with open(data_path, "r", encoding="utf-8") as fp: + if "infer" in data_path: + next(fp) + for line in fp: + line = line.strip() + if "infer" in data_path: + words = list(line) + yield [words] + else: + words, labels = line.split("\t") + words = words.split(CHAR_DELIMITER) + labels = labels.split(CHAR_DELIMITER) + assert len(words) == len(labels), "The word %s is not match with the label %s" % (words, labels) + yield [words, labels] + + if isinstance(datafiles, str): + return MapDataset(list(read(datafiles))) + elif isinstance(datafiles, list) or isinstance(datafiles, tuple): + return [MapDataset(list(read(datafile))) for datafile in datafiles] + + +def load_vocab(dict_path): + """ + Load vocab from file + """ + vocab = {} + reverse = None + with open(dict_path, "r", encoding="utf8") as fin: + for i, line in enumerate(fin): + terms = line.strip("\n").split("\t") + if len(terms) == 2: + if reverse is None: + reverse = True if terms[0].isdigit() else False + if reverse: + value, key = terms + else: + key, value = terms + elif len(terms) == 1: + key, value = terms[0], i + else: + raise ValueError("Error line: %s in file: %s" % (line, dict_path)) + vocab[key] = value + return vocab + + +def normalize_token(token, normlize_vocab): + """Normalize text from DBC case to SBC case""" + if normlize_vocab: + token = normlize_vocab.get(token, token) + return token + + +def convert_tokens_to_ids(tokens, vocab, oov_replace_token=None, normlize_vocab=None): + """convert tokens to token indexs""" + token_ids = [] + oov_replace_token = vocab.get(oov_replace_token) if oov_replace_token else None + for token in tokens: + token = normalize_token(token, normlize_vocab) + token_id = vocab.get(token, oov_replace_token) + token_ids.append(token_id) + + return token_ids + + +def convert_example(example, max_seq_len, word_vocab, label_vocab=None, normlize_vocab=None): + if len(example) == 2: + tokens, labels = example + else: + tokens, labels = example[0], None + tokens = tokens[:max_seq_len] + + token_ids = convert_tokens_to_ids(tokens, word_vocab, oov_replace_token="OOV", normlize_vocab=normlize_vocab) + length = len(token_ids) + if labels is not None: + labels = labels[:max_seq_len] + label_ids = convert_tokens_to_ids(labels, label_vocab, oov_replace_token="O") + return token_ids, length, label_ids + else: + return token_ids, length + + +def parse_result(words, preds, lengths, word_vocab, label_vocab): + """parse padding result""" + batch_out = [] + id2word_dict = dict(zip(word_vocab.values(), word_vocab.keys())) + id2label_dict = dict(zip(label_vocab.values(), label_vocab.keys())) + for sent_index in range(len(lengths)): + sent = [id2word_dict[index] for index in words[sent_index][: lengths[sent_index]]] + tags = [id2label_dict[index] for index in preds[sent_index][: lengths[sent_index]]] + + sent_out = [] + tags_out = [] + parital_word = "" + for ind, tag in enumerate(tags): + # for the first word + if parital_word == "": + parital_word = sent[ind] + tags_out.append(tag.split("-")[0]) + continue + + # for the beginning of word + if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"): + sent_out.append(parital_word) + tags_out.append(tag.split("-")[0]) + parital_word = sent[ind] + continue + + parital_word += sent[ind] + + # append the last word, except for len(tags)=0 + if len(sent_out) < len(tags_out): + sent_out.append(parital_word) + + batch_out.append([sent_out, tags_out]) + return batch_out diff --git a/examples/lexical_analysis/deploy/predict.py b/examples/lexical_analysis/deploy/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..9d50e0b49033bad95a2f25f27eb68a6cebb07ccf --- /dev/null +++ b/examples/lexical_analysis/deploy/predict.py @@ -0,0 +1,206 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time + +import paddle + +from paddlenlp.data import Pad, Stack, Tuple + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--model_file", type=str, required=True, default='./static_graph_params.pdmodel', help="The path to model info in static graph.") +parser.add_argument("--params_file", type=str, required=True, default='./static_graph_params.pdiparams', help="The path to parameters in static graph.") +parser.add_argument("--data_dir", type=str, default=None, help="The folder where the dataset is located.") +parser.add_argument("--init_checkpoint", type=str, default=None, help="Path to init model.") +parser.add_argument("--batch_size", type=int, default=2, help="The number of sequences contained in a mini-batch.") +parser.add_argument("--max_seq_len", type=int, default=64, help="Number of words of the longest seqence.") +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.") +parser.add_argument("--epochs", default=1, type=int, help="The number of epochs when running benchmark.") + +args = parser.parse_args() +# yapf: enable + + +def normalize_token(token, normlize_vocab): + """Normalize text from DBC case to SBC case""" + if normlize_vocab: + token = normlize_vocab.get(token, token) + return token + + +def convert_tokens_to_ids(tokens, vocab, oov_replace_token=None, normlize_vocab=None): + """Convert tokens to token indexs""" + token_ids = [] + oov_replace_token = vocab.get(oov_replace_token) if oov_replace_token else None + for token in tokens: + token = normalize_token(token, normlize_vocab) + token_id = vocab.get(token, oov_replace_token) + token_ids.append(token_id) + + return token_ids + + +def convert_example(tokens, max_seq_len, word_vocab, normlize_vocab=None): + """Convert tokens of sequences to token ids""" + tokens = tokens[:max_seq_len] + + token_ids = convert_tokens_to_ids(tokens, word_vocab, oov_replace_token="OOV", normlize_vocab=normlize_vocab) + length = len(token_ids) + return token_ids, length + + +def load_vocab(dict_path): + """Load vocab from file""" + vocab = {} + reverse = None + with open(dict_path, "r", encoding="utf8") as fin: + for i, line in enumerate(fin): + terms = line.strip("\n").split("\t") + if len(terms) == 2: + if reverse is None: + reverse = True if terms[0].isdigit() else False + if reverse: + value, key = terms + else: + key, value = terms + elif len(terms) == 1: + key, value = terms[0], i + else: + raise ValueError("Error line: %s in file: %s" % (line, dict_path)) + vocab[key] = value + return vocab + + +def parse_result(words, preds, lengths, word_vocab, label_vocab): + """Parse padding result""" + batch_out = [] + id2word_dict = dict(zip(word_vocab.values(), word_vocab.keys())) + id2label_dict = dict(zip(label_vocab.values(), label_vocab.keys())) + for sent_index in range(len(lengths)): + sent = [id2word_dict[index] for index in words[sent_index][: lengths[sent_index]]] + tags = [id2label_dict[index] for index in preds[sent_index][: lengths[sent_index]]] + + sent_out = [] + tags_out = [] + parital_word = "" + for ind, tag in enumerate(tags): + # for the first word + if parital_word == "": + parital_word = sent[ind] + tags_out.append(tag.split("-")[0]) + continue + + # for the beginning of word + if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"): + sent_out.append(parital_word) + tags_out.append(tag.split("-")[0]) + parital_word = sent[ind] + continue + + parital_word += sent[ind] + + # append the last word, except for len(tags)=0 + if len(sent_out) < len(tags_out): + sent_out.append(parital_word) + + batch_out.append([sent_out, tags_out]) + return batch_out + + +class Predictor(object): + def __init__(self, model_file, params_file, device, max_seq_length): + self.max_seq_length = max_seq_length + + config = paddle.inference.Config(model_file, params_file) + if device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def predict(self, data, word_vocab, label_vocab, normlize_vocab, batch_size=1): + """ + Predicts the data labels. + + Args: + data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `seq_len`(sequence length). + word_vocab(obj:`dict`): The word id (key) to word str (value) map. + label_vocab(obj:`dict`): The label id (key) to label str (value) map. + normlize_vocab(obj:`dict`): The fullwidth char (key) to halfwidth char (value) map. + batch_size(obj:`int`, defaults to 1): The number of batch. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + examples = [] + + for text in data: + tokens = list(text.strip()) + token_ids, length = convert_example( + tokens, self.max_seq_length, word_vocab=word_vocab, normlize_vocab=normlize_vocab + ) + examples.append((token_ids, length)) + + def batchify_fn(samples): + fn = Tuple(Pad(axis=0, pad_val=0, dtype="int64"), Stack(axis=0, dtype="int64")) + + return fn(samples) + + batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)] + + results = [] + + for batch in batches: + token_ids, length = batchify_fn(batch) + self.input_handles[0].copy_from_cpu(token_ids) + self.input_handles[1].copy_from_cpu(length) + self.predictor.run() + preds = self.output_handle.copy_to_cpu() + result = parse_result(token_ids, preds, length, word_vocab, label_vocab) + results.extend(result) + return results + + +if __name__ == "__main__": + word_vocab = load_vocab(os.path.join(args.data_dir, "word.dic")) + label_vocab = load_vocab(os.path.join(args.data_dir, "tag.dic")) + normlize_vocab = load_vocab(os.path.join(args.data_dir, "q2b.dic")) + infer_ds = [] + with open(os.path.join(args.data_dir, "infer.tsv"), "r", encoding="utf-8") as fp: + for line in fp.readlines(): + infer_ds += [line.strip()] + predictor = Predictor(args.model_file, args.params_file, args.device, args.max_seq_len) + start = time.time() + for _ in range(args.epochs): + results = predictor.predict(infer_ds, word_vocab, label_vocab, normlize_vocab, batch_size=args.batch_size) + end = time.time() + for idx, result in enumerate(results): + print("Text: {}".format(infer_ds[idx])) + sent_tags = [] + sent, tags = result + sent_tag = ["(%s, %s)" % (ch, tag) for ch, tag in zip(sent, tags)] + print("Result: {}\n".format(sent_tag)) + print("Total predict time: {:.4f} s".format(end - start)) diff --git a/examples/lexical_analysis/download.py b/examples/lexical_analysis/download.py new file mode 100644 index 0000000000000000000000000000000000000000..f1f0236b3cf0977dbde881aa97bd3a6a9d8cff95 --- /dev/null +++ b/examples/lexical_analysis/download.py @@ -0,0 +1,32 @@ +# -*- coding: utf-8 -*- +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the 'License'); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an 'AS IS' BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import sys + +from paddle.utils.download import get_path_from_url + +URL = "https://bj.bcebos.com/paddlenlp/datasets/lexical_analysis_dataset_tiny.tar.gz" + + +def main(arguments): + parser = argparse.ArgumentParser() + parser.add_argument("-d", "--data_dir", help="directory to save data to", type=str, default="data") + args = parser.parse_args(arguments) + get_path_from_url(URL, args.data_dir) + + +if __name__ == "__main__": + sys.exit(main(sys.argv[1:])) diff --git a/examples/lexical_analysis/eval.py b/examples/lexical_analysis/eval.py new file mode 100644 index 0000000000000000000000000000000000000000..0bd51480b5ad707fbe2d2662d4a85135045fa692 --- /dev/null +++ b/examples/lexical_analysis/eval.py @@ -0,0 +1,92 @@ +# -*- coding: UTF-8 -*- +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import paddle +from data import convert_example, load_dataset, load_vocab +from model import BiGruCrf + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.metrics import ChunkEvaluator + +# fmt: off +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--data_dir", type=str, default=None, help="The folder where the dataset is located.") +parser.add_argument("--init_checkpoint", type=str, default=None, help="Path to init model.") +parser.add_argument("--batch_size", type=int, default=300, help="The number of sequences contained in a mini-batch.") +parser.add_argument("--max_seq_len", type=int, default=64, help="Number of words of the longest seqence.") +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.") +parser.add_argument("--emb_dim", type=int, default=128, help="The dimension in which a word is embedded.") +parser.add_argument("--hidden_size", type=int, default=128, help="The number of hidden nodes in the GRU layer.") +args = parser.parse_args() +# fmt: on + + +def evaluate(args): + paddle.set_device(args.device) + + # create dataset. + test_ds = load_dataset(datafiles=(os.path.join(args.data_dir, "test.tsv"))) + word_vocab = load_vocab(os.path.join(args.data_dir, "word.dic")) + label_vocab = load_vocab(os.path.join(args.data_dir, "tag.dic")) + # q2b.dic is used to replace DBC case to SBC case + normlize_vocab = load_vocab(os.path.join(args.data_dir, "q2b.dic")) + + trans_func = partial( + convert_example, + max_seq_len=args.max_seq_len, + word_vocab=word_vocab, + label_vocab=label_vocab, + normlize_vocab=normlize_vocab, + ) + test_ds.map(trans_func) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=0, dtype="int64"), # word_ids + Stack(dtype="int64"), # length + Pad(axis=0, pad_val=0, dtype="int64"), # label_ids + ): fn(samples) + + # Create sampler for dataloader + test_sampler = paddle.io.BatchSampler(dataset=test_ds, batch_size=args.batch_size, shuffle=False, drop_last=False) + test_loader = paddle.io.DataLoader( + dataset=test_ds, batch_sampler=test_sampler, return_list=True, collate_fn=batchify_fn + ) + + # Define the model network and metric evaluator + model = BiGruCrf(args.emb_dim, args.hidden_size, len(word_vocab), len(label_vocab)) + chunk_evaluator = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True) + + # Load the model and start predicting + model_dict = paddle.load(args.init_checkpoint) + model.load_dict(model_dict) + + model.eval() + chunk_evaluator.reset() + for batch in test_loader: + token_ids, length, labels = batch + preds = model(token_ids, length) + num_infer_chunks, num_label_chunks, num_correct_chunks = chunk_evaluator.compute(length, preds, labels) + chunk_evaluator.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + precision, recall, f1_score = chunk_evaluator.accumulate() + print("eval precision: %f, recall: %f, f1: %f" % (precision, recall, f1_score)) + + +if __name__ == "__main__": + args = parser.parse_args() + evaluate(args) diff --git a/examples/lexical_analysis/export_model.py b/examples/lexical_analysis/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..d2939fcad6c8974790437af1716e91a8fbe4a929 --- /dev/null +++ b/examples/lexical_analysis/export_model.py @@ -0,0 +1,57 @@ +# -*- coding: UTF-8 -*- +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from data import load_vocab +from model import BiGruCrf +from paddle.static import InputSpec + +# fmt: off +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--data_dir", type=str, default=None, help="The folder where the dataset is located.") +parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.") +parser.add_argument("--output_path", type=str, default='./infer_model/static_graph_params', help="The path of model parameter in static graph to be saved.") +parser.add_argument("--emb_dim", type=int, default=128, help="The dimension in which a word is embedded.") +parser.add_argument("--hidden_size", type=int, default=128, help="The number of hidden nodes in the GRU layer.") +args = parser.parse_args() +# fmt: on + + +def main(): + word_vocab = load_vocab(os.path.join(args.data_dir, "word.dic")) + label_vocab = load_vocab(os.path.join(args.data_dir, "tag.dic")) + + model = BiGruCrf(args.emb_dim, args.hidden_size, len(word_vocab), len(label_vocab)) + + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + model.eval() + + model = paddle.jit.to_static( + model, + input_spec=[ + InputSpec(shape=[None, None], dtype="int64", name="token_ids"), + InputSpec(shape=[None], dtype="int64", name="length"), + ], + ) + # Save in static graph model. + paddle.jit.save(model, args.output_path) + + +if __name__ == "__main__": + main() diff --git a/examples/lexical_analysis/model.py b/examples/lexical_analysis/model.py new file mode 100644 index 0000000000000000000000000000000000000000..20a4a5ebe4b1412b87bb8bec1d8660f001b22203 --- /dev/null +++ b/examples/lexical_analysis/model.py @@ -0,0 +1,96 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn + +from paddlenlp.layers.crf import LinearChainCrf, LinearChainCrfLoss + +if hasattr(paddle, "text") and hasattr(paddle.text, "ViterbiDecoder"): + from paddle.text import ViterbiDecoder +else: + from paddlenlp.layers.crf import ViterbiDecoder + + +class BiGruCrf(nn.Layer): + """The network for lexical analysis, based on two layers of BiGRU and one layer of CRF. More details see https://arxiv.org/abs/1807.01882 + + Args: + word_emb_dim (int): The dimension in which a word is embedded. + hidden_size (int): The number of hidden nodes in the GRU layer. + vocab_size (int): the word vocab size. + num_labels (int): the labels amount. + emb_lr (float, optional): The scaling of the learning rate of the embedding layer. Defaults to 2.0. + crf_lr (float, optional): The scaling of the learning rate of the crf layer. Defaults to 0.2. + """ + + def __init__( + self, word_emb_dim, hidden_size, vocab_size, num_labels, emb_lr=2.0, crf_lr=0.2, with_start_stop_tag=True + ): + super(BiGruCrf, self).__init__() + self.word_emb_dim = word_emb_dim + self.vocab_size = vocab_size + self.num_labels = num_labels + self.hidden_size = hidden_size + self.emb_lr = emb_lr + self.crf_lr = crf_lr + self.init_bound = 0.1 + + self.word_embedding = nn.Embedding( + num_embeddings=self.vocab_size, + embedding_dim=self.word_emb_dim, + weight_attr=paddle.ParamAttr( + learning_rate=self.emb_lr, + initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound), + ), + ) + + self.gru = nn.GRU( + input_size=self.word_emb_dim, + hidden_size=self.hidden_size, + num_layers=2, + direction="bidirectional", + weight_ih_attr=paddle.ParamAttr( + initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound), + regularizer=paddle.regularizer.L2Decay(coeff=1e-4), + ), + weight_hh_attr=paddle.ParamAttr( + initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound), + regularizer=paddle.regularizer.L2Decay(coeff=1e-4), + ), + ) + + self.fc = nn.Linear( + in_features=self.hidden_size * 2, + out_features=self.num_labels + 2 if with_start_stop_tag else self.num_labels, + weight_attr=paddle.ParamAttr( + initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound), + regularizer=paddle.regularizer.L2Decay(coeff=1e-4), + ), + ) + + self.crf = LinearChainCrf(self.num_labels, self.crf_lr, with_start_stop_tag) + self.crf_loss = LinearChainCrfLoss(self.crf) + self.viterbi_decoder = ViterbiDecoder(self.crf.transitions, with_start_stop_tag) + + def forward(self, inputs, lengths, labels=None): + word_embed = self.word_embedding(inputs) + bigru_output, _ = self.gru(word_embed, sequence_length=lengths) + emission = self.fc(bigru_output) + if labels is not None: + loss = self.crf_loss(emission, lengths, labels) + return loss + else: + _, prediction = self.viterbi_decoder(emission, lengths) + return prediction diff --git a/examples/lexical_analysis/predict.py b/examples/lexical_analysis/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..0b7a7b80cdd85514920503f4d9f0fc1453683a57 --- /dev/null +++ b/examples/lexical_analysis/predict.py @@ -0,0 +1,101 @@ +# -*- coding: UTF-8 -*- +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import paddle +from data import convert_example, load_dataset, load_vocab, parse_result +from model import BiGruCrf + +from paddlenlp.data import Pad, Stack, Tuple + +# fmt: off +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--data_dir", type=str, default=None, help="The folder where the dataset is located.") +parser.add_argument("--init_checkpoint", type=str, default=None, help="Path to init model.") +parser.add_argument("--batch_size", type=int, default=300, help="The number of sequences contained in a mini-batch.") +parser.add_argument("--max_seq_len", type=int, default=64, help="Number of words of the longest seqence.") +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.") +parser.add_argument("--emb_dim", type=int, default=128, help="The dimension in which a word is embedded.") +parser.add_argument("--hidden_size", type=int, default=128, help="The number of hidden nodes in the GRU layer.") +args = parser.parse_args() +# fmt: on + + +def infer(args): + paddle.set_device(args.device) + + # create dataset. + infer_ds = load_dataset(datafiles=(os.path.join(args.data_dir, "infer.tsv"))) + word_vocab = load_vocab(os.path.join(args.data_dir, "word.dic")) + label_vocab = load_vocab(os.path.join(args.data_dir, "tag.dic")) + # q2b.dic is used to replace DBC case to SBC case + normlize_vocab = load_vocab(os.path.join(args.data_dir, "q2b.dic")) + + trans_func = partial( + convert_example, + max_seq_len=args.max_seq_len, + word_vocab=word_vocab, + label_vocab=label_vocab, + normlize_vocab=normlize_vocab, + ) + infer_ds.map(trans_func) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=0, dtype="int64"), # word_ids + Stack(dtype="int64"), # length + ): fn(samples) + + # Create sampler for dataloader + infer_sampler = paddle.io.BatchSampler( + dataset=infer_ds, batch_size=args.batch_size, shuffle=False, drop_last=False + ) + infer_loader = paddle.io.DataLoader( + dataset=infer_ds, batch_sampler=infer_sampler, return_list=True, collate_fn=batchify_fn + ) + + # Define the model network + model = BiGruCrf(args.emb_dim, args.hidden_size, len(word_vocab), len(label_vocab)) + + # Load the model and start predicting + model_dict = paddle.load(args.init_checkpoint) + model.load_dict(model_dict) + + model.eval() + results = [] + for batch in infer_loader: + token_ids, length = batch + preds = model(token_ids, length) + result = parse_result(token_ids.numpy(), preds.numpy(), length.numpy(), word_vocab, label_vocab) + results += result + + sent_tags = [] + for sent, tags in results: + sent_tag = ["(%s, %s)" % (ch, tag) for ch, tag in zip(sent, tags)] + sent_tags.append("".join(sent_tag)) + + file_path = "results.txt" + with open(file_path, "w", encoding="utf8") as fout: + fout.write("\n".join(sent_tags)) + + # Print some examples + print("The results have been saved in the file: %s, some examples are shown below: " % file_path) + print("\n".join(sent_tags[:10])) + + +if __name__ == "__main__": + infer(args) diff --git a/examples/lexical_analysis/train.py b/examples/lexical_analysis/train.py new file mode 100644 index 0000000000000000000000000000000000000000..480721f5eaf7e4f56edb42b4d495b573b556df03 --- /dev/null +++ b/examples/lexical_analysis/train.py @@ -0,0 +1,178 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time +from functools import partial + +import paddle +from data import convert_example, load_dataset, load_vocab +from model import BiGruCrf + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--data_dir", type=str, default=None, help="The folder where the dataset is located.") +parser.add_argument("--init_checkpoint", type=str, default=None, help="Path to init model.") +parser.add_argument("--model_save_dir", type=str, default=None, help="The model will be saved in this path.") +parser.add_argument("--epochs", type=int, default=10, help="Corpus iteration num.") +parser.add_argument("--batch_size", type=int, default=300, help="The number of sequences contained in a mini-batch.") +parser.add_argument("--max_seq_len", type=int, default=64, help="Number of words of the longest seqence.") +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.") +parser.add_argument("--base_lr", type=float, default=0.001, help="The basic learning rate that affects the entire network.") +parser.add_argument("--crf_lr", type=float, default=0.2, help="The learning rate ratio that affects CRF layers.") +parser.add_argument("--emb_dim", type=int, default=128, help="The dimension in which a word is embedded.") +parser.add_argument("--hidden_size", type=int, default=128, help="The number of hidden nodes in the GRU layer.") +parser.add_argument("--logging_steps", type=int, default=10, help="Log every X updates steps.") +parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") +parser.add_argument("--do_eval", type=strtobool, default=True, help="To evaluate the model if True.") +# yapf: enable + + +@paddle.no_grad() +def evaluate(model, metric, data_loader): + model.eval() + metric.reset() + for batch in data_loader: + token_ids, length, labels = batch + preds = model(token_ids, length) + num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(length, preds, labels) + metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + precision, recall, f1_score = metric.accumulate() + logger.info("eval precision: %f, recall: %f, f1: %f" % (precision, recall, f1_score)) + model.train() + return precision, recall, f1_score + + +def train(args): + paddle.set_device(args.device) + + trainer_num = paddle.distributed.get_world_size() + if trainer_num > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + # Create dataset. + train_ds, test_ds = load_dataset( + datafiles=(os.path.join(args.data_dir, "train.tsv"), os.path.join(args.data_dir, "test.tsv")) + ) + + word_vocab = load_vocab(os.path.join(args.data_dir, "word.dic")) + label_vocab = load_vocab(os.path.join(args.data_dir, "tag.dic")) + # q2b.dic is used to replace DBC case to SBC case + normlize_vocab = load_vocab(os.path.join(args.data_dir, "q2b.dic")) + + trans_func = partial( + convert_example, + max_seq_len=args.max_seq_len, + word_vocab=word_vocab, + label_vocab=label_vocab, + normlize_vocab=normlize_vocab, + ) + train_ds.map(trans_func) + test_ds.map(trans_func) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=word_vocab.get("[PAD]", 0), dtype="int64"), # word_ids + Stack(dtype="int64"), # length + Pad(axis=0, pad_val=label_vocab.get("O", 0), dtype="int64"), # label_ids + ): fn(samples) + + # Create sampler for dataloader + train_sampler = paddle.io.DistributedBatchSampler( + dataset=train_ds, batch_size=args.batch_size, shuffle=True, drop_last=True + ) + train_loader = paddle.io.DataLoader( + dataset=train_ds, batch_sampler=train_sampler, return_list=True, collate_fn=batchify_fn + ) + + test_sampler = paddle.io.BatchSampler(dataset=test_ds, batch_size=args.batch_size, shuffle=False, drop_last=False) + test_loader = paddle.io.DataLoader( + dataset=test_ds, batch_sampler=test_sampler, return_list=True, collate_fn=batchify_fn + ) + + # Define the model netword and its loss + model = BiGruCrf(args.emb_dim, args.hidden_size, len(word_vocab), len(label_vocab), crf_lr=args.crf_lr) + # Prepare optimizer, loss and metric evaluator + optimizer = paddle.optimizer.Adam(learning_rate=args.base_lr, parameters=model.parameters()) + chunk_evaluator = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True) + + if args.init_checkpoint: + if os.path.exists(args.init_checkpoint): + logger.info("Init checkpoint from %s" % args.init_checkpoint) + model_dict = paddle.load(args.init_checkpoint) + model.load_dict(model_dict) + else: + logger.info("Cannot init checkpoint from %s which doesn't exist" % args.init_checkpoint) + logger.info("Start training") + # Start training + global_step = 0 + last_step = args.epochs * len(train_loader) + train_reader_cost = 0.0 + train_run_cost = 0.0 + total_samples = 0 + reader_start = time.time() + max_f1_score = -1 + for epoch in range(args.epochs): + for step, batch in enumerate(train_loader): + train_reader_cost += time.time() - reader_start + global_step += 1 + token_ids, length, label_ids = batch + train_start = time.time() + loss = model(token_ids, length, label_ids) + avg_loss = paddle.mean(loss) + train_run_cost += time.time() - train_start + total_samples += args.batch_size + if global_step % args.logging_steps == 0: + logger.info( + "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec" + % ( + global_step, + last_step, + avg_loss, + train_reader_cost / args.logging_steps, + (train_reader_cost + train_run_cost) / args.logging_steps, + total_samples / args.logging_steps, + total_samples / (train_reader_cost + train_run_cost), + ) + ) + train_reader_cost = 0.0 + train_run_cost = 0.0 + total_samples = 0 + avg_loss.backward() + optimizer.step() + optimizer.clear_grad() + if global_step % args.save_steps == 0 or global_step == last_step: + if rank == 0: + paddle.save( + model.state_dict(), os.path.join(args.model_save_dir, "model_%d.pdparams" % global_step) + ) + logger.info("Save %d steps model." % (global_step)) + if args.do_eval: + precision, recall, f1_score = evaluate(model, chunk_evaluator, test_loader) + if f1_score > max_f1_score: + max_f1_score = f1_score + paddle.save(model.state_dict(), os.path.join(args.model_save_dir, "best_model.pdparams")) + logger.info("Save best model.") + + reader_start = time.time() + + +if __name__ == "__main__": + args = parser.parse_args() + train(args) diff --git a/examples/machine_reading_comprehension/DuReader-robust/README.md b/examples/machine_reading_comprehension/DuReader-robust/README.md new file mode 100644 index 0000000000000000000000000000000000000000..7d74fcf6c8ba09e7b35503fae72ae191e98c2ed8 --- /dev/null +++ b/examples/machine_reading_comprehension/DuReader-robust/README.md @@ -0,0 +1,73 @@ +# 阅读理解 DuReader-robust + +# 简介 + +## 任务说明 +阅读理解模型的鲁棒性是衡量该技术能否在实际应用中大规模落地的重要指标之一。随着当前技术的进步,模型虽然能够在一些阅读理解测试集上取得较好的性能,但在实际应用中,这些模型所表现出的鲁棒性仍然难以令人满意。DuReader-robust数据集作为首个关注阅读理解模型鲁棒性的中文数据集,旨在考察模型在真实应用场景中的过敏感性、过稳定性以及泛化能力等问题。 + +## 数据集 + +DuReader-robust数据集是单篇章、抽取式阅读理解数据集,具体的任务定义为: +对于一个给定的问题q和一个篇章p,参赛系统需要根据篇章内容,给出该问题的答案a。数据集中的每个样本,是一个三元组,例如: + +**问题 q**: 乔丹打了多少个赛季 + +**篇章 p**: 迈克尔.乔丹在NBA打了15个赛季。他在84年进入nba,期间在1993年10月6日第一次退役改打棒球,95年3月18日重新回归,在99年1月13日第二次退役,后于2001年10月31日复出,在03年最终退役… + +**参考答案 a**: [‘15个’,‘15个赛季’] + +关于该数据集的详细内容,可参考数据集[论文](https://arxiv.org/abs/2004.11142)。 + +## 快速开始 + +### 数据准备 + +为了方便开发者进行测试,我们已将数据集上传至HuggingFace。 + + +### Fine-tune + +按如下方式启动 Fine-tuning: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_du.py \ + --model_type ernie_gram \ + --model_name_or_path ernie-gram-zh \ + --max_seq_length 384 \ + --batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 1 \ + --logging_steps 10 \ + --save_steps 1000 \ + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --output_dir ./tmp/dureader-robust/ \ + --do_train \ + --do_predict \ + --device gpu \ + ``` + +* `model_type`: 预训练模型的种类。如bert,ernie,roberta等。 +* `model_name_or_path`: 预训练模型的具体名称。如bert-base-chinese,roberta-wwm-ext等。或者是模型文件的本地路径。 +* `output_dir`: 保存模型checkpoint的路径。 +* `do_train`: 是否进行训练。 +* `do_predict`: 是否进行预测。 + +训练结束后模型会自动对结果进行评估,得到类似如下的输出: + +```text +{ + "exact": 72.90049400141143, + "f1": 86.95957173352133, + "total": 1417, + "HasAns_exact": 72.90049400141143, + "HasAns_f1": 86.95957173352133, + "HasAns_total": 1417 +} +``` + +评估结束后模型会自动对测试集进行预测,并将可提交的结果生成在`prediction.json`中。 + + +**NOTE:** 如需恢复模型训练,则model_name_or_path只需指定到文件夹名即可。如`--model_name_or_path=./tmp/dureader-robust/model_19000/`,程序会自动加载模型参数`/model_state.pdparams`,也会自动加载词表,模型config和tokenizer的config。 diff --git a/examples/machine_reading_comprehension/DuReader-robust/args.py b/examples/machine_reading_comprehension/DuReader-robust/args.py new file mode 100644 index 0000000000000000000000000000000000000000..7d3a2927eabfccee53c204d88136f98fe4b96318 --- /dev/null +++ b/examples/machine_reading_comprehension/DuReader-robust/args.py @@ -0,0 +1,92 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--model_type", default=None, type=str, required=True, help="Type of pre-trained model.") + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name of model.", + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument( + "--warmup_proportion", + default=0.0, + type=float, + help="Proportion of training steps to perform linear learning rate warmup for.", + ) + parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument( + "--device", + choices=["cpu", "gpu", "npu"], + default="gpu", + help="Select which device to train model, defaults to gpu.", + ) + parser.add_argument( + "--doc_stride", + type=int, + default=128, + help="When splitting up a long document into chunks, how much stride to take between chunks.", + ) + parser.add_argument( + "--n_best_size", + type=int, + default=20, + help="The total number of n-best predictions to generate in the nbest_predictions.json output file.", + ) + parser.add_argument("--max_query_length", type=int, default=64, help="Max query length.") + parser.add_argument("--max_answer_length", type=int, default=30, help="Max answer length.") + parser.add_argument( + "--do_lower_case", + action="store_false", + help="Whether to lower case the input text. Should be True for uncased models and False for cased models.", + ) + parser.add_argument("--verbose", action="store_true", help="Whether to output verbose log.") + parser.add_argument("--do_train", action="store_true", help="Whether to train the model.") + parser.add_argument("--do_predict", action="store_true", help="Whether to predict.") + args = parser.parse_args() + return args diff --git a/examples/machine_reading_comprehension/DuReader-robust/run_du.py b/examples/machine_reading_comprehension/DuReader-robust/run_du.py new file mode 100644 index 0000000000000000000000000000000000000000..2e7a76043c8e835e05e8f3092adfc2ec3cfdfb47 --- /dev/null +++ b/examples/machine_reading_comprehension/DuReader-robust/run_du.py @@ -0,0 +1,328 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import math +import os +import random +import time + +import numpy as np +import paddle +from args import parse_args +from datasets import load_dataset +from paddle.io import DataLoader + +from paddlenlp.data import Dict, Pad, Stack +from paddlenlp.metrics.squad import compute_prediction, squad_evaluate +from paddlenlp.transformers import ( + BertForQuestionAnswering, + BertTokenizer, + ErnieForQuestionAnswering, + ErnieGramForQuestionAnswering, + ErnieGramTokenizer, + ErnieTokenizer, + LinearDecayWithWarmup, + RobertaForQuestionAnswering, + RobertaTokenizer, +) + +MODEL_CLASSES = { + "bert": (BertForQuestionAnswering, BertTokenizer), + "ernie": (ErnieForQuestionAnswering, ErnieTokenizer), + "ernie_gram": (ErnieGramForQuestionAnswering, ErnieGramTokenizer), + "roberta": (RobertaForQuestionAnswering, RobertaTokenizer), +} + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, raw_dataset, data_loader, args): + model.eval() + + all_start_logits = [] + all_end_logits = [] + tic_eval = time.time() + + for batch in data_loader: + input_ids, token_type_ids = batch + start_logits_tensor, end_logits_tensor = model(input_ids, token_type_ids) + + for idx in range(start_logits_tensor.shape[0]): + if len(all_start_logits) % 1000 == 0 and len(all_start_logits): + print("Processing example: %d" % len(all_start_logits)) + print("time per 1000:", time.time() - tic_eval) + tic_eval = time.time() + + all_start_logits.append(start_logits_tensor.numpy()[idx]) + all_end_logits.append(end_logits_tensor.numpy()[idx]) + + all_predictions, _, _ = compute_prediction( + raw_dataset, + data_loader.dataset, + (all_start_logits, all_end_logits), + False, + args.n_best_size, + args.max_answer_length, + ) + + # Can also write all_nbest_json and scores_diff_json files if needed + with open("prediction.json", "w", encoding="utf-8") as writer: + writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n") + + squad_evaluate(examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, is_whitespace_splited=False) + + model.train() + + +class CrossEntropyLossForSQuAD(paddle.nn.Layer): + def __init__(self): + super(CrossEntropyLossForSQuAD, self).__init__() + + def forward(self, y, label): + start_logits, end_logits = y + start_position, end_position = label + start_position = paddle.unsqueeze(start_position, axis=-1) + end_position = paddle.unsqueeze(end_position, axis=-1) + start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position) + end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position) + loss = (start_loss + end_loss) / 2 + return loss + + +def run(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + set_seed(args) + + train_examples = load_dataset("PaddlePaddle/dureader_robust", split="train") + dev_examples = load_dataset("PaddlePaddle/dureader_robust", split="validation") + + column_names = train_examples.column_names + if rank == 0: + if os.path.exists(args.model_name_or_path): + print("init checkpoint from %s" % args.model_name_or_path) + + model = model_class.from_pretrained(args.model_name_or_path) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + def prepare_train_features(examples): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = examples["context"] + questions = examples["question"] + + tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + # The offset mappings will give us a map from token to character position in the original context. This will + # help us compute the start_positions and end_positions. + offset_mapping = tokenized_examples.pop("offset_mapping") + + # Let's label those examples! + tokenized_examples["start_positions"] = [] + tokenized_examples["end_positions"] = [] + + for i, offsets in enumerate(offset_mapping): + # We will label impossible answers with the index of the CLS token. + input_ids = tokenized_examples["input_ids"][i] + cls_index = input_ids.index(tokenizer.cls_token_id) + + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_examples["token_type_ids"][i] + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + answers = examples["answers"][sample_index] + # If no answers are given, set the cls_index as answer. + if len(answers["answer_start"]) == 0: + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Start/end character index of the answer in the text. + start_char = answers["answer_start"][0] + end_char = start_char + len(answers["text"][0]) + + # Start token index of the current span in the text. + token_start_index = 0 + while sequence_ids[token_start_index] != 1: + token_start_index += 1 + + # End token index of the current span in the text. + token_end_index = len(input_ids) - 1 + while sequence_ids[token_end_index] != 1: + token_end_index -= 1 + token_end_index -= 1 + + # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index). + if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char): + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Otherwise move the token_start_index and token_end_index to the two ends of the answer. + # Note: we could go after the last offset if the answer is the last word (edge case). + while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char: + token_start_index += 1 + tokenized_examples["start_positions"].append(token_start_index - 1) + while offsets[token_end_index][1] >= end_char: + token_end_index -= 1 + tokenized_examples["end_positions"].append(token_end_index + 1) + + return tokenized_examples + + if args.do_train: + train_ds = train_examples.map(prepare_train_features, batched=True, remove_columns=column_names, num_proc=4) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + train_batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + "start_positions": Stack(dtype="int64"), + "end_positions": Stack(dtype="int64"), + } + ): fn(samples) + + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=train_batchify_fn, return_list=True + ) + + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + criterion = CrossEntropyLossForSQuAD() + + global_step = 0 + tic_train = time.time() + for epoch in range(num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids, token_type_ids, start_positions, end_positions = batch + logits = model(input_ids=input_ids, token_type_ids=token_type_ids) + loss = criterion(logits, (start_positions, end_positions)) + + if global_step % args.logging_steps == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_step, epoch + 1, step + 1, loss, args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.save_steps == 0 or global_step == num_training_steps: + if rank == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + print("Saving checkpoint to:", output_dir) + if global_step == num_training_steps: + break + + def prepare_validation_features(examples): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = examples["context"] + questions = examples["question"] + + tokenized_examples = tokenizer( + questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length, return_attention_mask=True + ) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + + # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the + # corresponding example_id and we will store the offset mappings. + tokenized_examples["example_id"] = [] + + for i in range(len(tokenized_examples["input_ids"])): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_examples["token_type_ids"][i] + context_index = 1 + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + tokenized_examples["example_id"].append(examples["id"][sample_index]) + + # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token + # position is part of the context or not. + tokenized_examples["offset_mapping"][i] = [ + (o if sequence_ids[k] == context_index else None) + for k, o in enumerate(tokenized_examples["offset_mapping"][i]) + ] + + return tokenized_examples + + if args.do_predict and rank == 0: + dev_ds = dev_examples.map(prepare_validation_features, batched=True, remove_columns=column_names, num_proc=4) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + + dev_batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + } + ): fn(samples) + + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=dev_batchify_fn, return_list=True + ) + + evaluate(model, dev_examples, dev_data_loader, args) + + +if __name__ == "__main__": + args = parse_args() + run(args) diff --git a/examples/machine_reading_comprehension/DuReader-yesno/README.md b/examples/machine_reading_comprehension/DuReader-yesno/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ac95144f7cec93eae3df1ceec784aaafadaa0383 --- /dev/null +++ b/examples/machine_reading_comprehension/DuReader-yesno/README.md @@ -0,0 +1,70 @@ +# 阅读理解 DuReader-yesno + +## 简介 + +### 任务说明 +机器阅读理解评测中常用的F1、EM等指标虽然能够很好的衡量抽取式模型所预测的答案和真实答案的匹配程度,但在处理观点类问题时,该类指标难以衡量模型是否真正理解答案所代表的含义,例如答案中包含的观点极性。DuReader-yesno是一个以观点极性判断为目标任务的数据集,通过引入该数据集,可以弥补抽取类数据集的不足,从而更好地评价模型的自然语言理解能力。 + + +### 数据集 + +该数据集的任务定义如下: +对于一个给定的问题q、一系列相关文档D=d1, d2, …, dn,以及人工抽取答案段落摘要a,要求参评系统自动对问题q、候选文档D以及答案段落摘要a进行分析,输出每个答案段落摘要所表述的是非观点极性。其中,极性分为三类 {Yes, No, Depends}。其中: + +* Yes:肯定观点,肯定观点指的是答案给出了较为明确的肯定态度。有客观事实的从客观事实的角度出发,主观态度类的从答案的整体态度来判断。 +* No:否定观点,否定观点通常指的是答案较为明确的给出了与问题相反的态度。 +* Depends:无法确定/分情况,主要指的是事情本身存在多种情况,不同情况下对应的观点不一致;或者答案本身对问题表示不确定,要具体具体情况才能判断。 + +例如: +```text +{ + "documents":[ + { + "title":"香蕉能放冰箱吗 香蕉剥皮冷冻保存_健康贴士_保健_99健康网", + "paragraphs":[ + "本文导读:............." + ] + } + ], + "yesno_answer":"No", + "question":"香蕉能放冰箱吗", + "answer":"香蕉不能放冰箱,香蕉如果放冰箱里,会更容易变坏,会发黑腐烂。", + "id":293 +} +``` + +## 快速开始 + +### Fine-tune + +按如下方式启动 Fine-tuning: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_du.py \ + --model_type ernie_gram \ + --model_name_or_path ernie-gram-zh \ + --max_seq_length 384 \ + --batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --logging_steps 200 \ + --save_steps 1000 \ + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --output_dir ./tmp/dureader-yesno/ \ + --device gpu \ + ``` + +* `model_type`: 预训练模型的种类。如bert,ernie,roberta等。 +* `model_name_or_path`: 预训练模型的具体名称。如bert-base-uncased,bert-large-cased等。或者是模型文件的本地路径。 +* `output_dir`: 保存模型checkpoint的路径。 + +训练结束后模型会自动对结果进行评估,得到类似如下的输出: + +```text +accu: 0.874954 +``` +评估结束后模型会自动对测试集进行预测,并将可提交的结果生成在`prediction.json`中。 + +**NOTE:** 如需恢复模型训练,则model_name_or_path只需指定到文件夹名即可。如`--model_name_or_path=./tmp/dureader-yesno/model_19000/`,程序会自动加载模型参数`/model_state.pdparams`,也会自动加载词表,模型config和tokenizer的config。 diff --git a/examples/machine_reading_comprehension/DuReader-yesno/args.py b/examples/machine_reading_comprehension/DuReader-yesno/args.py new file mode 100644 index 0000000000000000000000000000000000000000..e460d925a1a08aef5455eec3257a27f3fef4cfba --- /dev/null +++ b/examples/machine_reading_comprehension/DuReader-yesno/args.py @@ -0,0 +1,60 @@ +import argparse + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--model_type", default=None, type=str, required=True, help="Type of pre-trained model.") + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name of model.", + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument( + "--warmup_proportion", + default=0.0, + type=float, + help="Proportion of training steps to perform linear learning rate warmup for.", + ) + parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." + ) + parser.add_argument( + "--do_lower_case", + action="store_false", + help="Whether to lower case the input text. Should be True for uncased models and False for cased models.", + ) + parser.add_argument("--verbose", action="store_true", help="Whether to output verbose log.") + + args = parser.parse_args() + return args diff --git a/examples/machine_reading_comprehension/DuReader-yesno/run_du.py b/examples/machine_reading_comprehension/DuReader-yesno/run_du.py new file mode 100644 index 0000000000000000000000000000000000000000..434b94c28df35508586efd821abf9c0c39e0a29f --- /dev/null +++ b/examples/machine_reading_comprehension/DuReader-yesno/run_du.py @@ -0,0 +1,208 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import math +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from args import parse_args +from paddle.io import DataLoader + +from paddlenlp.data import Dict, Pad, Stack +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import ( + BertForSequenceClassification, + BertTokenizer, + ErnieForSequenceClassification, + ErnieGramForSequenceClassification, + ErnieGramTokenizer, + ErnieTokenizer, + LinearDecayWithWarmup, + RobertaForSequenceClassification, + RobertaTokenizer, +) + +MODEL_CLASSES = { + "bert": (BertForSequenceClassification, BertTokenizer), + "ernie": (ErnieForSequenceClassification, ErnieTokenizer), + "ernie_gram": (ErnieGramForSequenceClassification, ErnieGramTokenizer), + "roberta": (RobertaForSequenceClassification, RobertaTokenizer), +} + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +def convert_example(example, tokenizer): + """convert a Dureader-yesno example into necessary features""" + + feature = tokenizer(text=example["question"], text_pair=example["answer"], max_seq_len=args.max_seq_length) + feature["labels"] = example["labels"] + feature["id"] = example["id"] + + return feature + + +@paddle.no_grad() +def evaluate(model, metric, data_loader): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + correct = metric.compute(logits, labels) + metric.update(correct) + accu = metric.accumulate() + print("accu: %f" % (accu)) + model.train() # Switch the model to training mode after evaluation + + +@paddle.no_grad() +def predict(model, data_loader): + model.eval() + res = {} + for batch in data_loader: + input_ids, segment_ids, qas_id = batch + logits = model(input_ids, segment_ids) + qas_id = qas_id.numpy() + preds = paddle.argmax(logits, axis=1).numpy() + for i in range(len(preds)): + res[str(qas_id[i])] = data_loader.dataset.label_list[preds[i]] + model.train() + return res + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + set_seed(args) + + train_ds, dev_ds, test_ds = load_dataset("dureader_yesno", splits=["train", "dev", "test"]) + + trans_func = partial(convert_example, tokenizer=tokenizer) + + train_batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + "labels": Stack(dtype="int64"), + } + ): fn(samples) + + test_batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + "id": Stack(), + } + ): fn(samples) + + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=train_batchify_fn, return_list=True + ) + + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=train_batchify_fn, return_list=True + ) + + test_ds = test_ds.map(trans_func, lazy=True) + test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False) + test_data_loader = DataLoader( + dataset=test_ds, batch_sampler=test_batch_sampler, collate_fn=test_batchify_fn, return_list=True + ) + + model = model_class.from_pretrained(args.model_name_or_path, num_classes=len(train_ds.label_list)) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + criterion = paddle.nn.loss.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + global_step = 0 + tic_train = time.time() + for epoch in range(num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids, segment_ids, label = batch + + logits = model(input_ids=input_ids, token_type_ids=segment_ids) + loss = criterion(logits, label) + + if global_step % args.logging_steps == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_step, epoch + 1, step + 1, loss, args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.save_steps == 0 or global_step == num_training_steps: + if rank == 0: + evaluate(model, metric, dev_data_loader) + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + print("Saving checkpoint to:", output_dir) + if global_step == num_training_steps: + break + + if rank == 0: + predictions = predict(model, test_data_loader) + with open("prediction.json", "w") as writer: + writer.write(json.dumps(predictions, ensure_ascii=False, indent=4) + "\n") + + +if __name__ == "__main__": + args = parse_args() + do_train(args) diff --git a/examples/machine_reading_comprehension/SQuAD/README.md b/examples/machine_reading_comprehension/SQuAD/README.md new file mode 100644 index 0000000000000000000000000000000000000000..537453318acd41980c0e54be6901536a00b6f67a --- /dev/null +++ b/examples/machine_reading_comprehension/SQuAD/README.md @@ -0,0 +1,195 @@ +# 阅读理解 SQuAD + +## 简介 + +### 任务说明 +本文主要介绍基于Bert预训练模型的SQuAD(Stanford Question Answering Dataset)数据集的阅读理解任务,给定一篇文章和一个问题,计算答案在文章中的起始位置和结束位置。对于SQuAD2.0数据集,还可以返回答案在文章中不存在的概率。 + +### 数据集 + +此任务的数据集包括以下数据集: + +SQuAD v1.1 +- [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json) +- [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json) + +SQuAD v2.0 +- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json) +- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json) + + +## 快速开始 + +### 数据准备 + +为了方便开发者进行测试,我们使用了HuggingFace的数据集,用户可以通过命令行传入`--version_2_with_negative`控制所需要的SQuAD数据集版本。 + +### Fine-tune + +对于 SQuAD v1.1,按如下方式启动 Fine-tuning: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_squad.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --max_seq_length 384 \ + --batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --logging_steps 1000 \ + --save_steps 1000 \ + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --output_dir ./tmp/squad/ \ + --device gpu \ + --do_train \ + --do_predict + ``` + +* `model_type`: 预训练模型的种类。如bert,ernie,roberta等。 +* `model_name_or_path`: 预训练模型的具体名称。如bert-base-uncased,bert-large-cased等。或者是模型文件的本地路径。 +* `output_dir`: 保存模型checkpoint的路径。 +* `do_train`: 是否进行训练。 +* `do_predict`: 是否进行预测。 + +训练结束后模型会自动对结果进行评估,得到类似如下的输出: + +```text +{ + "exact": 81.18259224219489, + "f1": 88.68817481234801, + "total": 10570, + "HasAns_exact": 81.18259224219489, + "HasAns_f1": 88.68817481234801, + "HasAns_total": 10570 +} +``` + +对于 SQuAD v2.0,按如下方式启动 Fine-tuning: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_squad.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --max_seq_length 384 \ + --batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --logging_steps 1000 \ + --save_steps 1000 \ + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --output_dir ./tmp/squad/ \ + --device gpu \ + --do_train \ + --do_predict \ + --version_2_with_negative + ``` + +* `version_2_with_negative`: 使用squad2.0数据集和评价指标的标志。 + +训练结束后会在模型会自动对结果进行评估,得到类似如下的输出: + +```text +{ + "exact": 73.25865408910974, + "f1": 76.63096554166046, + "total": 11873, + "HasAns_exact": 73.22874493927125, + "HasAns_f1": 79.98303877802545, + "HasAns_total": 5928, + "NoAns_exact": 73.28847771236333, + "NoAns_f1": 73.28847771236333, + "NoAns_total": 5945, + "best_exact": 74.31988545439232, + "best_exact_thresh": -2.5820093154907227, + "best_f1": 77.20521797731851, + "best_f1_thresh": -1.559523582458496 +} +``` + +其中会输出 `best_f1_thresh` 是最佳阈值,可以使用这个阈值重新训练,或者从 `all_nbest_json`变量中获取最终 `prediction`。 +训练方法与前面大体相同,只需要设定 `--null_score_diff_threshold` 参数的值为测评时输出的 `best_f1_thresh` ,通常这个值在 -1.0 到 -5.0 之间。 + +**NOTE:** 如需恢复模型训练,则model_name_or_path只需指定到文件夹名即可。如`--model_name_or_path=./tmp/squad/model_19000/`,程序会自动加载模型参数`/model_state.pdparams`,也会自动加载词表,模型config和tokenizer的config。 + +### 预测 + +如需使用训练好的模型预测并输出结果,需将自己的数据集改成SQuAD格式(以下示例为SQuAD2.0)。 + +```text +{"data": [{'title': 'Beyoncé', + 'paragraphs': [ + {'qas': [{'question': 'When did Beyonce start becoming popular?', + 'id': '56be85543aeaaa14008c9063', + 'answers': [], + 'is_impossible': False}]], + 'context':'Beyoncé Giselle Knowles-Carter(biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she.'} + }] +``` + +并参考[以内置数据集格式读取本地数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html#id4)中的方法创建自己的数据集并修改`run_squad.py`中对应的数据集读取代码。再运行以下脚本: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_squad.py \ + --model_type bert \ + --model_name_or_path your-best-model \ + --max_seq_length 384 \ + --batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --logging_steps 1000 \ + --save_steps 1000 \ + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --output_dir ./tmp/squad/ \ + --device gpu \ + --do_predict \ + --version_2_with_negative + ``` + +即可完成预测,预测的答案保存在`prediction.json`中。数据格式如下所示,左边的id与输入中的id对应。 + +```text +{ + "56be85543aeaaa14008c9063": "in the late 1990s", + ... +} +``` + +### 静态图预测 + +在Fine-tune完成后,我们可以使用如下方式导出希望用来预测的模型: + +```shell +python -u ./export_model.py \ + --model_type bert \ + --model_path bert-base-uncased \ + --output_path ./infer_model/model +``` + +其中参数释义如下: +- `model_type` 指示了模型类型,使用BERT模型时设置为bert即可。 +- `model_path` 表示训练模型的保存路径,与训练时的`output_dir`一致。 +- `output_path` 表示导出预测模型文件的前缀。保存时会添加后缀(`pdiparams`,`pdiparams.info`,`pdmodel`);除此之外,还会在`output_path`包含的目录下保存tokenizer相关内容。 + +然后按照如下的方式对阅读理解任务进行预测: + +```shell +python -u deploy/python/predict.py \ + --model_type bert \ + --model_name_or_path ./infer_model/model \ + --batch_size 4 \ + --max_seq_length 384 +``` + +其中参数释义如下: +- `model_type` 指示了模型类型,使用BERT模型时设置为bert即可。 +- `model_name_or_path` 表示预测模型文件的前缀,和上一步导出预测模型中的`output_path`一致。 +- `batch_size` 表示每个预测批次的样本数目。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断,和训练时一致。 + +以上命令将在SQuAD v1.1的验证集上进行预测。此外,同训练时一样,用户可以通过命令行传入`--version_2_with_negative`控制所需要的SQuAD数据集版本。 diff --git a/examples/machine_reading_comprehension/SQuAD/args.py b/examples/machine_reading_comprehension/SQuAD/args.py new file mode 100644 index 0000000000000000000000000000000000000000..2fa0e6e8ab8fa6222f2a75451522236c67f19f7b --- /dev/null +++ b/examples/machine_reading_comprehension/SQuAD/args.py @@ -0,0 +1,104 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--model_type", default="bert", type=str, help="Type of pre-trained model.") + parser.add_argument( + "--model_name_or_path", + default="bert-base-uncased", + type=str, + help="Path to pre-trained model or shortcut name of model.", + ) + parser.add_argument( + "--output_dir", + default="outputs", + type=str, + help="The output directory where the model predictions and checkpoints will be written. " + "Default as `outputs`", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument( + "--warmup_proportion", + default=0.0, + type=float, + help="Proportion of training steps to perform linear learning rate warmup for.", + ) + parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument( + "--device", + choices=["cpu", "gpu", "mlu"], + default="gpu", + help="Select which device to train model, defaults to gpu.", + ) + parser.add_argument( + "--doc_stride", + type=int, + default=128, + help="When splitting up a long document into chunks, how much stride to take between chunks.", + ) + parser.add_argument( + "--n_best_size", + type=int, + default=20, + help="The total number of n-best predictions to generate in the nbest_predictions.json output file.", + ) + parser.add_argument( + "--null_score_diff_threshold", + type=float, + default=0.0, + help="If null_score - best_non_null is greater than the threshold predict null.", + ) + parser.add_argument("--max_query_length", type=int, default=64, help="Max query length.") + parser.add_argument("--max_answer_length", type=int, default=30, help="Max answer length.") + parser.add_argument( + "--do_lower_case", + action="store_false", + help="Whether to lower case the input text. Should be True for uncased models and False for cased models.", + ) + parser.add_argument("--verbose", action="store_true", help="Whether to output verbose log.") + parser.add_argument( + "--version_2_with_negative", + action="store_true", + help="If true, the SQuAD examples contain some that do not have an answer. If using squad v2.0, it should be set true.", + ) + parser.add_argument("--do_train", action="store_true", help="Whether to train the model.") + parser.add_argument("--do_predict", action="store_true", help="Whether to predict.") + parser.add_argument("--use_amp", action="store_true", help="Whether to use AMP.") + parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.") + args = parser.parse_args() + return args diff --git a/examples/machine_reading_comprehension/SQuAD/deploy/python/predict.py b/examples/machine_reading_comprehension/SQuAD/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..3a8281d70d96d4fe5fbf64361b55aa97970d7956 --- /dev/null +++ b/examples/machine_reading_comprehension/SQuAD/deploy/python/predict.py @@ -0,0 +1,124 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import sys +from functools import partial + +import paddle +from datasets import load_dataset + +from paddlenlp.data import Dict, Pad +from paddlenlp.metrics.squad import compute_prediction, squad_evaluate + +sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir, os.pardir))) +from args import parse_args # noqa: E402 +from run_squad import MODEL_CLASSES, prepare_validation_features # noqa: E402 + + +class Predictor(object): + def __init__(self, predictor, input_handles, output_handles): + self.predictor = predictor + self.input_handles = input_handles + self.output_handles = output_handles + + @classmethod + def create_predictor(cls, args): + config = paddle.inference.Config(args.model_name_or_path + ".pdmodel", args.model_name_or_path + ".pdiparams") + if args.device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + elif args.device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + elif args.device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + config.switch_use_feed_fetch_ops(False) + predictor = paddle.inference.create_predictor(config) + input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()] + output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()] + return cls(predictor, input_handles, output_handles) + + def predict_batch(self, data): + for input_field, input_handle in zip(data, self.input_handles): + input_handle.copy_from_cpu(input_field.numpy() if isinstance(input_field, paddle.Tensor) else input_field) + self.predictor.run() + output = [output_handle.copy_to_cpu() for output_handle in self.output_handles] + return output + + def predict(self, dataset, raw_dataset, collate_fn, args, do_eval=True): + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=args.batch_size, shuffle=False) + data_loader = paddle.io.DataLoader( + dataset=dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, num_workers=0, return_list=True + ) + outputs = [] + all_start_logits = [] + all_end_logits = [] + for data in data_loader: + output = self.predict_batch(data) + outputs.append(output) + if do_eval: + all_start_logits.extend(list(output[0])) + all_end_logits.extend(list(output[1])) + if do_eval: + all_predictions, all_nbest_json, scores_diff_json = compute_prediction( + raw_dataset, + data_loader.dataset, + (all_start_logits, all_end_logits), + args.version_2_with_negative, + args.n_best_size, + args.max_answer_length, + args.null_score_diff_threshold, + ) + squad_evaluate( + examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, na_probs=scores_diff_json + ) + return outputs + + +def main(): + args = parse_args() + + predictor = Predictor.create_predictor(args) + + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(os.path.dirname(args.model_name_or_path)) + + if args.version_2_with_negative: + raw_dataset = load_dataset("squad_v2", split="validation") + else: + raw_dataset = load_dataset("squad", split="validation") + column_names = raw_dataset.column_names + dataset = raw_dataset.map( + partial(prepare_validation_features, tokenizer=tokenizer, args=args), + batched=True, + remove_columns=column_names, + num_proc=4, + ) + + batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + } + ): fn(samples) + predictor = Predictor.create_predictor(args) + predictor.predict(dataset, raw_dataset, args=args, collate_fn=batchify_fn) + + +if __name__ == "__main__": + main() diff --git a/examples/machine_reading_comprehension/SQuAD/export_model.py b/examples/machine_reading_comprehension/SQuAD/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..7f1ad38135c94021c1d0bc2eb9a6adaec69e6210 --- /dev/null +++ b/examples/machine_reading_comprehension/SQuAD/export_model.py @@ -0,0 +1,78 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle + +from run_squad import MODEL_CLASSES + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_path", + default=None, + type=str, + required=True, + help="Path of the trained model to be exported.", + ) + parser.add_argument( + "--output_path", + default=None, + type=str, + required=True, + help="The output file prefix used to save the exported inference model.", + ) + args = parser.parse_args() + return args + + +def main(): + args = parse_args() + + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + # build model and load trained parameters + model = model_class.from_pretrained(args.model_path) + # switch to eval model + model.eval() + # convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + # save converted static graph model + paddle.jit.save(model, args.output_path) + # also save tokenizer for inference usage + tokenizer = tokenizer_class.from_pretrained(args.model_path) + tokenizer.save_pretrained(os.path.dirname(args.output_path)) + + +if __name__ == "__main__": + main() diff --git a/examples/machine_reading_comprehension/SQuAD/run_squad.py b/examples/machine_reading_comprehension/SQuAD/run_squad.py new file mode 100644 index 0000000000000000000000000000000000000000..9a4c9357050f70b73cb2086e11317597c97cc643 --- /dev/null +++ b/examples/machine_reading_comprehension/SQuAD/run_squad.py @@ -0,0 +1,350 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import math +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from args import parse_args +from datasets import load_dataset +from paddle.io import DataLoader + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.metrics.squad import compute_prediction, squad_evaluate +from paddlenlp.transformers import ( + BertForQuestionAnswering, + BertTokenizer, + ErnieForQuestionAnswering, + ErnieTokenizer, + FunnelForQuestionAnswering, + FunnelTokenizer, + LinearDecayWithWarmup, +) + +MODEL_CLASSES = { + "bert": (BertForQuestionAnswering, BertTokenizer), + "ernie": (ErnieForQuestionAnswering, ErnieTokenizer), + "funnel": (FunnelForQuestionAnswering, FunnelTokenizer), +} + + +def prepare_train_features(examples, tokenizer, args): + # Some of the questions have lots of whitespace on the left, which is not useful and will make the + # truncation of the context fail (the tokenized question will take a lots of space). So we remove that + # left whitespace + contexts = examples["context"] + questions = examples["question"] + + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + tokenized_examples = tokenizer( + questions, contexts, max_seq_len=args.max_seq_length, stride=args.doc_stride, return_attention_mask=True + ) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + # The offset mappings will give us a map from token to character position in the original context. This will + # help us compute the start_positions and end_positions. + offset_mapping = tokenized_examples.pop("offset_mapping") + + # Let's label those examples! + tokenized_examples["start_positions"] = [] + tokenized_examples["end_positions"] = [] + + for i, offsets in enumerate(offset_mapping): + # We will label impossible answers with the index of the CLS token. + input_ids = tokenized_examples["input_ids"][i] + cls_index = input_ids.index(tokenizer.cls_token_id) + + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_examples["token_type_ids"][i] + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + answers = examples["answers"][sample_index] + # If no answers are given, set the cls_index as answer. + if len(answers["answer_start"]) == 0: + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Start/end character index of the answer in the text. + start_char = answers["answer_start"][0] + end_char = start_char + len(answers["text"][0]) + + # Start token index of the current span in the text. + token_start_index = 0 + while sequence_ids[token_start_index] != 1: + token_start_index += 1 + + # End token index of the current span in the text. + token_end_index = len(input_ids) - 1 + while sequence_ids[token_end_index] != 1: + token_end_index -= 1 + token_end_index -= 1 + + # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index). + if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char): + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Otherwise move the token_start_index and token_end_index to the two ends of the answer. + # Note: we could go after the last offset if the answer is the last word (edge case). + while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char: + token_start_index += 1 + tokenized_examples["start_positions"].append(token_start_index - 1) + while offsets[token_end_index][1] >= end_char: + token_end_index -= 1 + tokenized_examples["end_positions"].append(token_end_index + 1) + + return tokenized_examples + + +def prepare_validation_features(examples, tokenizer, args): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = examples["context"] + questions = examples["question"] + + tokenized_examples = tokenizer( + questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length, return_attention_mask=True + ) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + + # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the + # corresponding example_id and we will store the offset mappings. + tokenized_examples["example_id"] = [] + + for i in range(len(tokenized_examples["input_ids"])): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_examples["token_type_ids"][i] + context_index = 1 + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + tokenized_examples["example_id"].append(examples["id"][sample_index]) + + # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token + # position is part of the context or not. + tokenized_examples["offset_mapping"][i] = [ + (o if sequence_ids[k] == context_index and k != len(sequence_ids) - 1 else None) + for k, o in enumerate(tokenized_examples["offset_mapping"][i]) + ] + + return tokenized_examples + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, data_loader, raw_dataset, features, args): + model.eval() + + all_start_logits = [] + all_end_logits = [] + tic_eval = time.time() + + for batch in data_loader: + start_logits_tensor, end_logits_tensor = model( + batch["input_ids"], token_type_ids=batch["token_type_ids"], attention_mask=batch["attention_mask"] + ) + + for idx in range(start_logits_tensor.shape[0]): + if len(all_start_logits) % 1000 == 0 and len(all_start_logits): + print("Processing example: %d" % len(all_start_logits)) + print("time per 1000:", time.time() - tic_eval) + tic_eval = time.time() + + all_start_logits.append(start_logits_tensor.numpy()[idx]) + all_end_logits.append(end_logits_tensor.numpy()[idx]) + + all_predictions, all_nbest_json, scores_diff_json = compute_prediction( + raw_dataset, + features, + (all_start_logits, all_end_logits), + args.version_2_with_negative, + args.n_best_size, + args.max_answer_length, + args.null_score_diff_threshold, + ) + + # Can also write all_nbest_json and scores_diff_json files if needed + with open("prediction.json", "w", encoding="utf-8") as writer: + writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n") + + squad_evaluate(examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, na_probs=scores_diff_json) + + model.train() + + +class CrossEntropyLossForSQuAD(paddle.nn.Layer): + def __init__(self): + super(CrossEntropyLossForSQuAD, self).__init__() + + def forward(self, y, label): + start_logits, end_logits = y + start_position, end_position = label + start_position = paddle.unsqueeze(start_position, axis=-1) + end_position = paddle.unsqueeze(end_position, axis=-1) + start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position) + end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position) + loss = (start_loss + end_loss) / 2 + return loss + + +def run(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + if args.version_2_with_negative: + train_examples = load_dataset("squad_v2", split="train") + dev_examples = load_dataset("squad_v2", split="validation") + else: + train_examples = load_dataset("squad", split="train") + dev_examples = load_dataset("squad", split="validation") + set_seed(args) + if rank == 0: + if os.path.exists(args.model_name_or_path): + print("init checkpoint from %s" % args.model_name_or_path) + + model = model_class.from_pretrained(args.model_name_or_path) + column_names = train_examples.column_names + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + if args.do_train: + train_ds = train_examples.map( + partial(prepare_train_features, tokenizer=tokenizer, args=args), + batched=True, + remove_columns=column_names, + num_proc=4, + ) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + train_batchify_fn = DataCollatorWithPadding(tokenizer) + + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=train_batchify_fn, return_list=True + ) + + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + criterion = CrossEntropyLossForSQuAD() + + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + + global_step = 0 + tic_train = time.time() + + for epoch in range(num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + if args.use_amp: + with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]): + logits = model( + input_ids=batch["input_ids"], + token_type_ids=batch["token_type_ids"], + attention_mask=batch["attention_mask"], + ) + loss = criterion(logits, (batch["start_positions"], batch["end_positions"])) + scaler.scale(loss).backward() + scaler.minimize(optimizer, loss) + else: + logits = model( + input_ids=batch["input_ids"], + token_type_ids=batch["token_type_ids"], + attention_mask=batch["attention_mask"], + ) + loss = criterion(logits, (batch["start_positions"], batch["end_positions"])) + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.logging_steps == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_step, epoch + 1, step + 1, loss, args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + + if global_step % args.save_steps == 0 or global_step == num_training_steps: + if rank == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + print("Saving checkpoint to:", output_dir) + if global_step == num_training_steps: + break + + if args.do_predict and rank == 0: + dev_ds = dev_examples.map( + partial(prepare_validation_features, tokenizer=tokenizer, args=args), + batched=True, + remove_columns=column_names, + num_proc=4, + ) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_ds_for_model = dev_ds.remove_columns(["example_id", "offset_mapping"]) + dev_batchify_fn = DataCollatorWithPadding(tokenizer) + + dev_data_loader = DataLoader( + dataset=dev_ds_for_model, batch_sampler=dev_batch_sampler, collate_fn=dev_batchify_fn, return_list=True + ) + + evaluate(model, dev_data_loader, dev_examples, dev_ds, args) + + +if __name__ == "__main__": + args = parse_args() + run(args) diff --git a/examples/machine_translation/README.md b/examples/machine_translation/README.md new file mode 100644 index 0000000000000000000000000000000000000000..6d58e14e2015083bce66f9f9ff8cd66cb7af9c37 --- /dev/null +++ b/examples/machine_translation/README.md @@ -0,0 +1,114 @@ +# 机器翻译 + +机器翻译(Machine Translation)是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程,输入为源语言句子,输出为相应的目标语言的句子。 + +## 快速开始 + +### 环境依赖 + +使用当前机器翻译示例,需要额外安装配置以下环境: + +* attrdict +* pyyaml +* subword_nmt +* fastBPE (可选,若不使用 preprocessor.py 的 bpe 分词功能可以不需要) + +### 数据准备 + +数据准备部分分成两种模式,一种是使用 PaddleNLP 内置的已经处理好的 WMT14 EN-DE 翻译的数据集,另一种,提供了当前 Transformer demo 使用自定义数据集的方式。以下分别展开介绍。 + +#### 使用内置已经处理完成数据集 + +内置的处理好的数据集是基于公开的数据集:WMT 数据集。 + +WMT 翻译大赛是机器翻译领域最具权威的国际评测大赛,其中英德翻译任务提供了一个中等规模的数据集,这个数据集是较多论文中使用的数据集,也是 Transformer 论文中用到的一个数据集。我们也将 [WMT'14 EN-DE 数据集](http://www.statmt.org/wmt14/translation-task.html) 作为示例提供。 + +可以编写如下代码,即可自动载入处理好的上述的数据,对应的 WMT14 EN-DE 的数据集将会自动下载并且解压到 `~/.paddlenlp/datasets/WMT14ende/`。 + +``` python +datasets = load_dataset('wmt14ende', splits=('train', 'dev')) +``` + +如果使用内置的处理好的数据,那到这里即可完成数据准备一步,可以直接移步 [Transformer 翻译模型](transformer/README.md) 将详细介绍如何使用内置的数据集训一个英德翻译的 Transformer 模型。 + +#### 使用自定义翻译数据集 + +本示例同时提供了自定义数据集的方法。可参考以下执行数据处理方式: + +``` bash +# 数据下载、处理,包括 bpe 的训练 +bash preprocessor/prepare-wmt14en2de.sh --icml17 + +# 数据预处理 +DATA_DIR=examples/translation/wmt14_en_de + +python preprocessor/preprocessor.py \ + --source_lang en \ + --target_lang de \ + --train_pref $DATA_DIR/train \ + --dev_pref $DATA_DIR/dev \ + --test_pref $DATA_DIR/test \ + --dest_dir data/wmt17_en_de \ + --thresholdtgt 0 \ + --thresholdsrc 0 \ + --joined_dictionary +``` + +`preprocessor/preprocessor.py` 支持了在机器翻译中常见的数据预处理方式。在预处理 `preprocessor/preprocessor.py` 脚本中,则提供词表构建,数据集文件整理,甚至于 bpe 分词的功能(bpe 分词过程可选)。最后获取的处理完成的 train,dev,test 数据可以直接用于后面 Transformer 模型的训练、评估和推理中。具体的参数说明如下: + +* `--src_lang`(`-s`): 指明数据处理对应的源语言类型,比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。 +* `--trg_lang`(`-t`): 指明数据处理对应的目标语言的类型,比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。 +* `--train_pref`: 指明前序步骤中,下载的训练数据的路径,以及对应的文件名前缀,比如 `preprocessor/wmt14_en_de/train` 结合 `--src_lang de` 和 `--trg_lang en`,表示在 `preprocessor/wmt14_en_de/` 路径下,源语言是 `preprocessor/wmt14_en_de/train.en`,目标语言是 `preprocessor/wmt14_en_de/train.de`。 +* `--dev_pref`: 指明前序步骤中,下载的验证数据的路径,以及对应的文件名前缀。在验证集语料中,如果有的 token 在训练集中从未出现过,那么将会被 `` 替换。 +* `--test_pref`: 指明前序步骤中,下载的测试数据的路径,以及对应的文件名前缀。在测试集语料中,如果有的 token 在训练集中从未出现过,那么将会被 `` 替换。 +* `--dest_dir`: 完成数据处理之后,保存处理完成数据以及词表的路径。 +* `--threshold_src`: 在源语言中,出现频次小于 `--threshold_src` 指定的频次的 token 将会被替换成 ``。默认为 0,表示不会根据 token 出现的频次忽略 token 本身。 +* `--threshold_trg`: 在目标语言中,出现频次小于 `--threshold_trg` 指定的频次的 token 将会被替换成 ``。默认为 0,表示不会根据 token 出现的频次忽略 token 本身。 +* `--src_vocab`: 源语言词表,默认为 None,表示需要预处理步骤根据训练集语料重新生成一份词表。如果源语言与目标语言共用同一份词表,那么将使用 `--src_vocab` 指定的词表。 +* `--trg_vocab`: 目标语言词表,默认为 None,表示需要预处理步骤根据训练集语料重新生成一份词表。如果源语言与目标语言共用同一份词表,那么将使用 `--src_vocab` 指定的词表。 +* `--nwords_src`: 源语言词表最大的大小,不包括 special token。默认为 None,表示不限制。若源语言和目标语言共用同一份词表,那么将使用 `--nwords_src` 指定的大小。 +* `--nwords_trg`: 目标语言词表最大的大小,不包括 special token。默认为 None,表示不限制。若源语言和目标语言共用同一份词表,那么将使用 `--nwords_src` 指定的大小。 +* `--align_file`: 是否将平行语料文件整合成一个文件。 +* `--joined_dictionary`: 源语言和目标语言是否使用同一份词表。若不共用同一份词表,无需指定。 +* `--only_source`: 是否仅处理源语言。 +* `--dict_only`: 是否仅处理词表。若指定,则仅完成词表处理。 +* `--bos_token`: 指明翻译所用的 `bos_token`,表示一个句子开始。 +* `--eos_token`: 指明翻译所用的 `eos_token`,表示一个句子的结束。 +* `--pad_token`: 指明 `pad_token`,用于将一个 batch 内不同长度的句子 pad 到合适长度。 +* `--unk_token`: 指明 `unk_token`,用于当一个 token 在词表中未曾出现的情况,将使用 `--unk_token` 指明的字符替换。 +* `--apply_bpe`: 是否需要对数据作 bpe 分词。若指定则会在 preprocessor.py 脚本开始执行 bpe 分词。如果是使用提供的 shell 脚本完成的数据下载,则无需设置,在 shell 脚本中会作 bpe 分词处理。 +* `--bpe_code`: 若指明 `--apply_bpe` 使用 bpe 分词,则需同时提供训练好的 bpe code 文件。 + +除了 WMT14 德英翻译数据集外,我们也提供了其他的 shell 脚本完成数据下载处理,比如 WMT14 英法翻译数据。 + +``` bash +# WMT14 英法翻译的数据下载、处理 +bash prepare-wmt14en2fr.sh +``` + +完成数据处理之后,同样也可以采用上文提到的预处理方式获取词表,完成预处理。 + +如果有或者需要使用其他的平行语料,可以自行完成下载和简单的处理。 + +在下载部分,即在 shell 脚本中,处理需要用到 [mosesdecoder](https://github.com/moses-smt/mosesdecoder) 和 [subword-nmt](https://github.com/rsennrich/subword-nmt) 这两个工具。包括: + +* 使用 `mosesdecoder/scripts/tokenizer/tokenizer.perl` 完成对词做一个初步的切分; +* 基于 `mosesdecoder/scripts/training/clean-corpus-n.perl` 完成数据的清洗; +* 使用 `subword-nmt/subword_nmt/learn_bpe.py` 完成 bpe 的学习; + +此外,基于学到的 bpe code 进行分词的操作目前提供了两种选项,其一是,可以在以上的 shell 脚本中处理完成,使用以下的工具: + +* 使用 `subword-nmt/subword_nmt/apply_bpe.py` 完成分词工作。 + +其二,也可以直接在后面的 `preprocessor/preprocessor.py` 脚本中,指明 `--apply_bpe` 完成分词操作。 + + +### 如何训一个翻译模型 + +前文介绍了如何快速开始完成翻译训练所需平行语料的准备,关于进一步的,模型训练、评估和推理部分,可以根据需要,参考对应的模型的文档: + +* [Transformer 翻译模型](transformer/README.md) + +## Acknowledge + +我们借鉴了 facebookresearch 的 [fairseq](https://github.com/facebookresearch/fairseq) 在翻译数据的预处理上优秀的设计,在此对 fairseq 作者以及其开源社区表示感谢。 diff --git a/examples/machine_translation/preprocessor/prepare-iwslt14.sh b/examples/machine_translation/preprocessor/prepare-iwslt14.sh new file mode 100644 index 0000000000000000000000000000000000000000..4746b27c82e79e085f8fba4e1c593f38bfb77ac0 --- /dev/null +++ b/examples/machine_translation/preprocessor/prepare-iwslt14.sh @@ -0,0 +1,135 @@ +#!/usr/bin/env bash +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# Copyright (c) Facebook, Inc. and its affiliates. +# +# This source code is licensed under the MIT license found in the +# LICENSE file in the root directory of this source tree. +# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh + +cd preprocessor/ + +echo 'Cloning Moses github repository (for tokenization scripts)...' +git clone https://github.com/moses-smt/mosesdecoder.git + +echo 'Cloning Subword NMT repository (for BPE pre-processing)...' +git clone https://github.com/rsennrich/subword-nmt.git + +SCRIPTS=mosesdecoder/scripts +TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl +LC=$SCRIPTS/tokenizer/lowercase.perl +CLEAN=$SCRIPTS/training/clean-corpus-n.perl +BPEROOT=subword-nmt/subword_nmt +BPE_TOKENS=10000 + +URL="http://dl.fbaipublicfiles.com/fairseq/data/iwslt14/de-en.tgz" +GZ=de-en.tgz + +if [ ! -d "$SCRIPTS" ]; then + echo "Please set SCRIPTS variable correctly to point to Moses scripts." + exit +fi + +src=de +tgt=en +lang=de-en +prep=iwslt14.tokenized.de-en +tmp=$prep/tmp +origin=origin + +mkdir -p $origin $tmp $prep + +echo "Downloading data from ${URL}..." +cd $origin +wget "$URL" + +if [ -f $GZ ]; then + echo "Data successfully downloaded." +else + echo "Data not successfully downloaded." + exit +fi + +tar zxvf $GZ +cd .. + +echo "pre-processing train data..." +for l in $src $tgt; do + f=train.tags.$lang.$l + tok=train.tags.$lang.tok.$l + + cat $origin/$lang/$f | \ + grep -v '' | \ + grep -v '' | \ + grep -v '' | \ + sed -e 's///g' | \ + sed -e 's/<\/title>//g' | \ + sed -e 's/<description>//g' | \ + sed -e 's/<\/description>//g' | \ + perl $TOKENIZER -threads 8 -l $l > $tmp/$tok + echo "" +done +perl $CLEAN -ratio 1.5 $tmp/train.tags.$lang.tok $src $tgt $tmp/train.tags.$lang.clean 1 175 +for l in $src $tgt; do + perl $LC < $tmp/train.tags.$lang.clean.$l > $tmp/train.tags.$lang.$l +done + +echo "pre-processing dev/test data..." +for l in $src $tgt; do + for o in `ls $origin/$lang/IWSLT14.TED*.$l.xml`; do + fname=${o##*/} + f=$tmp/${fname%.*} + echo $o $f + grep '<seg id' $o | \ + sed -e 's/<seg id="[0-9]*">\s*//g' | \ + sed -e 's/\s*<\/seg>\s*//g' | \ + sed -e "s/\’/\'/g" | \ + perl $TOKENIZER -threads 8 -l $l | \ + perl $LC > $f + echo "" + done +done + + +echo "creating train, dev, test..." +for l in $src $tgt; do + awk '{if (NR%23 == 0) print $0; }' $tmp/train.tags.de-en.$l > $tmp/dev.$l + awk '{if (NR%23 != 0) print $0; }' $tmp/train.tags.de-en.$l > $tmp/train.$l + + cat $tmp/IWSLT14.TED.dev2010.de-en.$l \ + $tmp/IWSLT14.TEDX.dev2012.de-en.$l \ + $tmp/IWSLT14.TED.tst2010.de-en.$l \ + $tmp/IWSLT14.TED.tst2011.de-en.$l \ + $tmp/IWSLT14.TED.tst2012.de-en.$l \ + > $tmp/test.$l +done + +TRAIN=$tmp/train.en-de +BPE_CODE=$prep/code +rm -f $TRAIN +for l in $src $tgt; do + cat $tmp/train.$l >> $TRAIN +done + +echo "learn_bpe.py on ${TRAIN}..." +python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE + +for L in $src $tgt; do + for f in train.$L dev.$L test.$L; do + echo "apply_bpe.py to ${f}..." + python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $prep/$f + done +done + +cd - diff --git a/examples/machine_translation/preprocessor/prepare-wmt14en2de.sh b/examples/machine_translation/preprocessor/prepare-wmt14en2de.sh new file mode 100644 index 0000000000000000000000000000000000000000..32926c4aa96b27ceda377e314102c3f15f122568 --- /dev/null +++ b/examples/machine_translation/preprocessor/prepare-wmt14en2de.sh @@ -0,0 +1,163 @@ +#!/bin/bash +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# Copyright (c) Facebook, Inc. and its affiliates. +# +# This source code is licensed under the MIT license found in the +# LICENSE file in the root directory of this source tree. +# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh + +cd preprocessor/ + +echo 'Cloning Moses github repository (for tokenization scripts)...' +git clone https://github.com/moses-smt/mosesdecoder.git + +echo 'Cloning Subword NMT repository (for BPE pre-processing)...' +git clone https://github.com/rsennrich/subword-nmt.git + +SCRIPTS=mosesdecoder/scripts +TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl +CLEAN=$SCRIPTS/training/clean-corpus-n.perl +NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl +REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl +BPEROOT=subword-nmt/subword_nmt +BPE_TOKENS=40000 + +URLS=( + "http://statmt.org/wmt13/training-parallel-europarl-v7.tgz" + "http://statmt.org/wmt13/training-parallel-commoncrawl.tgz" + "http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz" + "http://data.statmt.org/wmt17/translation-task/dev.tgz" + "http://statmt.org/wmt14/test-full.tgz" +) +FILES=( + "training-parallel-europarl-v7.tgz" + "training-parallel-commoncrawl.tgz" + "training-parallel-nc-v12.tgz" + "dev.tgz" + "test-full.tgz" +) +CORPORA=( + "training/europarl-v7.de-en" + "commoncrawl.de-en" + "training/news-commentary-v12.de-en" +) + +# This will make the dataset compatible to the one used in "Convolutional Sequence to Sequence Learning" +# https://arxiv.org/abs/1705.03122 +if [ "$1" == "--icml17" ]; then + URLS[2]="http://statmt.org/wmt14/training-parallel-nc-v9.tgz" + FILES[2]="training-parallel-nc-v9.tgz" + CORPORA[2]="training/news-commentary-v9.de-en" + OUTDIR=wmt14_en_de +else + OUTDIR=wmt17_en_de +fi + +if [ ! -d "$SCRIPTS" ]; then + echo "Please set SCRIPTS variable correctly to point to Moses scripts." + exit +fi + +src=en +tgt=de +lang=en-de +prep=$OUTDIR +tmp=$prep/tmp +origin=origin +dev=dev/newstest2013 + +mkdir -p $origin $tmp $prep + +cd $origin + +for ((i=0;i<${#URLS[@]};++i)); do + file=${FILES[i]} + if [ -f $file ]; then + echo "$file already exists, skipping download" + else + url=${URLS[i]} + wget "$url" --no-check-certificate + if [ -f $file ]; then + echo "$url successfully downloaded." + else + echo "$url not successfully downloaded." + exit -1 + fi + if [ ${file: -4} == ".tgz" ]; then + tar zxvf $file + elif [ ${file: -4} == ".tar" ]; then + tar xvf $file + fi + fi +done +cd .. + +echo "pre-processing train data..." +for l in $src $tgt; do + rm $tmp/train.tags.$lang.tok.$l + for f in "${CORPORA[@]}"; do + cat $origin/$f.$l | \ + perl $NORM_PUNC $l | \ + perl $REM_NON_PRINT_CHAR | \ + perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l + done +done + +echo "pre-processing test data..." +for l in $src $tgt; do + if [ "$l" == "$src" ]; then + t="src" + else + t="ref" + fi + grep '<seg id' $origin/test-full/newstest2014-deen-$t.$l.sgm | \ + sed -e 's/<seg id="[0-9]*">\s*//g' | \ + sed -e 's/\s*<\/seg>\s*//g' | \ + sed -e "s/\’/\'/g" | \ + perl $TOKENIZER -threads 8 -a -l $l > $tmp/test.$l + echo "" +done + +echo "splitting train and dev..." +for l in $src $tgt; do + awk '{if (NR%100 == 0) print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/dev.$l + awk '{if (NR%100 != 0) print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l +done + +TRAIN=$tmp/train.de-en +BPE_CODE=$prep/code +rm -f $TRAIN +for l in $src $tgt; do + cat $tmp/train.$l >> $TRAIN +done + +echo "learn_bpe.py on ${TRAIN}..." +python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE + +for L in $src $tgt; do + for f in train.$L dev.$L test.$L; do + echo "apply_bpe.py to ${f}..." + python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f + done +done + +perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt $prep/train 1 250 +perl $CLEAN -ratio 1.5 $tmp/bpe.dev $src $tgt $prep/dev 1 250 + +for L in $src $tgt; do + cp $tmp/bpe.test.$L $prep/test.$L +done + +cd - diff --git a/examples/machine_translation/preprocessor/prepare-wmt14en2fr.sh b/examples/machine_translation/preprocessor/prepare-wmt14en2fr.sh new file mode 100644 index 0000000000000000000000000000000000000000..3fc3bc10f6324875a5401688140ac0985f39600d --- /dev/null +++ b/examples/machine_translation/preprocessor/prepare-wmt14en2fr.sh @@ -0,0 +1,157 @@ +#!/bin/bash +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# Copyright (c) Facebook, Inc. and its affiliates. +# +# This source code is licensed under the MIT license found in the +# LICENSE file in the root directory of this source tree. +# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh + +cd preprocessor/ + +echo 'Cloning Moses github repository (for tokenization scripts)...' +git clone https://github.com/moses-smt/mosesdecoder.git + +echo 'Cloning Subword NMT repository (for BPE pre-processing)...' +git clone https://github.com/rsennrich/subword-nmt.git + +SCRIPTS=mosesdecoder/scripts +TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl +CLEAN=$SCRIPTS/training/clean-corpus-n.perl +NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl +REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl +BPEROOT=subword-nmt/subword_nmt +BPE_TOKENS=40000 + +URLS=( + "http://statmt.org/wmt13/training-parallel-europarl-v7.tgz" + "http://statmt.org/wmt13/training-parallel-commoncrawl.tgz" + "http://statmt.org/wmt13/training-parallel-un.tgz" + "http://statmt.org/wmt14/training-parallel-nc-v9.tgz" + "http://statmt.org/wmt10/training-giga-fren.tar" + "http://statmt.org/wmt14/test-full.tgz" +) +FILES=( + "training-parallel-europarl-v7.tgz" + "training-parallel-commoncrawl.tgz" + "training-parallel-un.tgz" + "training-parallel-nc-v9.tgz" + "training-giga-fren.tar" + "test-full.tgz" +) +CORPORA=( + "training/europarl-v7.fr-en" + "commoncrawl.fr-en" + "un/undoc.2000.fr-en" + "training/news-commentary-v9.fr-en" + "giga-fren.release2.fixed" +) + +if [ ! -d "$SCRIPTS" ]; then + echo "Please set SCRIPTS variable correctly to point to Moses scripts." + exit +fi + +src=en +tgt=fr +lang=en-fr +prep=wmt14_en_fr +tmp=$prep/tmp +origin=origin + +mkdir -p $origin $tmp $prep + +cd $origin + +for ((i=0;i<${#URLS[@]};++i)); do + file=${FILES[i]} + if [ -f $file ]; then + echo "$file already exists, skipping download" + else + url=${URLS[i]} + wget "$url" --no-check-certificate + if [ -f $file ]; then + echo "$url successfully downloaded." + else + echo "$url not successfully downloaded." + exit -1 + fi + if [ ${file: -4} == ".tgz" ]; then + tar zxvf $file + elif [ ${file: -4} == ".tar" ]; then + tar xvf $file + fi + fi +done + +gunzip giga-fren.release2.fixed.*.gz +cd .. + +echo "pre-processing train data..." +for l in $src $tgt; do + rm $tmp/train.tags.$lang.tok.$l + for f in "${CORPORA[@]}"; do + cat $origin/$f.$l | \ + perl $NORM_PUNC $l | \ + perl $REM_NON_PRINT_CHAR | \ + perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l + done +done + +echo "pre-processing test data..." +for l in $src $tgt; do + if [ "$l" == "$src" ]; then + t="src" + else + t="ref" + fi + grep '<seg id' $origin/test-full/newstest2014-fren-$t.$l.sgm | \ + sed -e 's/<seg id="[0-9]*">\s*//g' | \ + sed -e 's/\s*<\/seg>\s*//g' | \ + sed -e "s/\’/\'/g" | \ + perl $TOKENIZER -threads 8 -a -l $l > $tmp/test.$l + echo "" +done + +echo "splitting train and dev..." +for l in $src $tgt; do + awk '{if (NR%1333 == 0) print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/dev.$l + awk '{if (NR%1333 != 0) print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l +done + +TRAIN=$tmp/train.fr-en +BPE_CODE=$prep/code +rm -f $TRAIN +for l in $src $tgt; do + cat $tmp/train.$l >> $TRAIN +done + +echo "learn_bpe.py on ${TRAIN}..." +python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE + +for L in $src $tgt; do + for f in train.$L dev.$L test.$L; do + echo "apply_bpe.py to ${f}..." + python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f + done +done + +perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt $prep/train 1 250 +perl $CLEAN -ratio 1.5 $tmp/bpe.dev $src $tgt $prep/dev 1 250 + +for L in $src $tgt; do + cp $tmp/bpe.test.$L $prep/test.$L +done + +cd - diff --git a/examples/machine_translation/preprocessor/preprocessor.py b/examples/machine_translation/preprocessor/preprocessor.py new file mode 100644 index 0000000000000000000000000000000000000000..bc434a76478796478c79c06f39cb25e360b998a0 --- /dev/null +++ b/examples/machine_translation/preprocessor/preprocessor.py @@ -0,0 +1,326 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# Copyright (c) Facebook, Inc. and its affiliates. +# +# This source code is licensed under the MIT license found in the +# LICENSE file in the root directory of this source tree. + +import argparse +import os +import shutil +from itertools import zip_longest +from pprint import pprint + +from paddlenlp.data import Vocab +from paddlenlp.utils.log import logger + + +def get_preprocessing_parser(): + parser = argparse.ArgumentParser() + + parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ") + parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ") + parser.add_argument( + "--train_pref", default=None, type=str, help="The prefix for train file and also used to save dict. " + ) + parser.add_argument( + "--dev_pref", + default=None, + type=str, + help="The prefixes for dev file and use comma to separate. " + "(words missing from train set are replaced with <unk>)", + ) + parser.add_argument( + "--test_pref", + default=None, + type=str, + help="The prefixes for test file and use comma to separate. " + "(words missing from train set are replaced with <unk>)", + ) + parser.add_argument( + "--dest_dir", + default="./data/", + type=str, + help="The destination dir to save processed train, dev and test file. ", + ) + parser.add_argument( + "--threshold_trg", default=0, type=int, help="Map words appearing less than threshold times to unknown. " + ) + parser.add_argument( + "--threshold_src", default=0, type=int, help="Map words appearing less than threshold times to unknown. " + ) + parser.add_argument("--src_vocab", default=None, type=str, help="Reuse given source dictionary. ") + parser.add_argument("--trg_vocab", default=None, type=str, help="Reuse given target dictionary. ") + parser.add_argument("--nwords_trg", default=None, type=int, help="The number of target words to retain. ") + parser.add_argument("--nwords_src", default=None, type=int, help="The number of source words to retain. ") + parser.add_argument("--align_file", default=None, help="An alignment file (optional). ") + parser.add_argument("--joined_dictionary", action="store_true", help="Generate joined dictionary. ") + parser.add_argument("--only_source", action="store_true", help="Only process the source language. ") + parser.add_argument( + "--dict_only", action="store_true", help="Only builds a dictionary and then exits if it's set." + ) + parser.add_argument("--bos_token", default="<s>", type=str, help="bos_token. ") + parser.add_argument("--eos_token", default="</s>", type=str, help="eos_token. ") + parser.add_argument( + "--pad_token", + default=None, + type=str, + help="The token used for padding. If it's None, the bos_token will be used. Defaults to None. ", + ) + parser.add_argument("--unk_token", default="<unk>", type=str, help="Unk token. ") + parser.add_argument("--apply_bpe", action="store_true", help="Whether to apply bpe to the files. ") + parser.add_argument( + "--bpe_code", default=None, type=str, help="The code used for bpe. Must be provided when --apply_bpe is set. " + ) + + args = parser.parse_args() + return args + + +def _train_path(lang, train_pref): + return "{}{}".format(train_pref, ("." + lang) if lang else "") + + +def _dev_path(lang, dev_pref): + return "{}{}".format(dev_pref, ("." + lang) if lang else "") + + +def _test_path(lang, test_pref): + return "{}{}".format(test_pref, ("." + lang) if lang else "") + + +def _file_name(prefix, lang): + fname = prefix + if lang is not None: + fname += ".{lang}".format(lang=lang) + return fname + + +def _dest_path(prefix, lang, dest_dir): + return os.path.join(dest_dir, _file_name(prefix, lang)) + + +def _dict_path(lang, dest_dir): + return _dest_path("dict", lang, dest_dir) + ".txt" + + +def _build_dictionary(filenames, args, src=False, trg=False): + assert src ^ trg, "src and trg cannot be both True or both False. " + + if not isinstance(filenames, (list, tuple)): + filenames = [filenames] + + tokens = [] + for file in filenames: + with open(file, "r") as f: + lines = f.readlines() + for line in lines: + tokens.append(line.strip().split()) + + return Vocab.build_vocab( + tokens, + max_size=args.nwords_src if src else args.nwords_trg, + min_freq=args.threshold_src if src else args.threshold_trg, + unk_token=args.unk_token, + pad_token=args.pad_token, + bos_token=args.bos_token, + eos_token=args.eos_token, + ) + + +def _make_dataset(vocab, input_prefix, output_prefix, lang, args): + # Copy original text file to destination folder + output_text_file = _dest_path( + output_prefix + ".{}-{}".format(args.src_lang, args.trg_lang), + lang, + args.dest_dir, + ) + + shutil.copyfile(_file_name(input_prefix, lang), output_text_file) + + +def _make_all(lang, vocab, args): + if args.train_pref: + _make_dataset(vocab, args.train_pref, "train", lang, args=args) + + if args.dev_pref: + for k, dev_pref in enumerate(args.dev_pref.split(",")): + out_prefix = "dev{}".format(k) if k > 0 else "dev" + _make_dataset(vocab, dev_pref, out_prefix, lang, args=args) + + if args.test_pref: + for k, test_pref in enumerate(args.test_pref.split(",")): + out_prefix = "test{}".format(k) if k > 0 else "test" + _make_dataset(vocab, test_pref, out_prefix, lang, args=args) + + +def _align_files(args, src_vocab, trg_vocab): + assert args.train_pref, "--train_pref must be set if --align_file is specified" + src_file_name = _train_path(args.src_lang, args.train_pref) + trg_file_name = _train_path(args.trg_lang, args.train_pref) + freq_map = {} + + with open(args.align_file, "r", encoding="utf-8") as align_file: + with open(src_file_name, "r", encoding="utf-8") as src_file: + with open(trg_file_name, "r", encoding="utf-8") as trg_file: + for a, s, t in zip_longest(align_file, src_file, trg_file): + si = src_vocab.to_indices(s) + ti = trg_vocab.to_indices(t) + ai = list(map(lambda x: tuple(x.split("\t")), a.split())) + for sai, tai in ai: + src_idx = si[int(sai)] + trg_idx = ti[int(tai)] + if src_idx != src_vocab.get_unk_token_id() and trg_idx != trg_vocab.get_unk_token_id(): + assert src_idx != src_vocab.get_pad_token_id() + assert src_idx != src_vocab.get_eos_token_id() + assert trg_idx != trg_vocab.get_pad_token_id() + assert trg_idx != trg_vocab.get_eos_token_id() + if src_idx not in freq_map: + freq_map[src_idx] = {} + if trg_idx not in freq_map[src_idx]: + freq_map[src_idx][trg_idx] = 1 + else: + freq_map[src_idx][trg_idx] += 1 + + align_dict = {} + for src_idx in freq_map.keys(): + align_dict[src_idx] = max(freq_map[src_idx], key=freq_map[src_idx].get) + + with open( + os.path.join( + args.dest_dir, + "alignment.{}-{}.txt".format(args.src_lang, args.trg_lang), + ), + "w", + encoding="utf-8", + ) as f: + for k, v in align_dict.items(): + print("{} {}".format(src_vocab[k], trg_vocab[v]), file=f) + + +def main(args): + os.makedirs(args.dest_dir, exist_ok=True) + pprint(args) + + if args.apply_bpe: + import fastBPE + + bpe = fastBPE.fastBPE(args.bpe_code) + filenames = [_train_path(lang, args.train_pref) for lang in [args.src_lang, args.trg_lang]] + for k, dev_pref in enumerate(args.dev_pref.split(",")): + filenames.extend([_dev_path(lang, args.dev_pref) for lang in [args.src_lang, args.trg_lang]]) + for k, test_pref in enumerate(args.test_pref.split(",")): + filenames.extend([_test_path(lang, args.test_pref) for lang in [args.src_lang, args.trg_lang]]) + + for file in filenames: + sequences = [] + with open(file, "r") as f: + lines = f.readlines() + for seq in lines: + sequences.append(seq.strip()) + + bpe_sequences = bpe.apply(sequences) + os.makedirs(os.path.join(args.train_pref, "tmp_bpe"), exist_ok=True) + shutil.copyfile(file, os.path.join(args.train_pref, "tmp_bpe", os.path.split(file)[-1])) + + with open(file, "w") as f: + for bpe_seq in bpe_sequences: + f.write(bpe_seq + "\n") + + # build dictionaries + target = not args.only_source + + if not args.src_vocab and os.path.exists(_dict_path(args.src_lang, args.dest_dir)): + raise FileExistsError(_dict_path(args.src_lang, args.dest_dir)) + + if target and not args.trg_vocab and os.path.exists(_dict_path(args.trg_lang, args.dest_dir)): + raise FileExistsError(_dict_path(args.trg_lang, args.dest_dir)) + + if args.joined_dictionary: + assert ( + not args.src_vocab or not args.trg_vocab + ), "Cannot use both --src_vocab and --trg_vocab with --joined_dictionary" + + if args.src_vocab: + src_vocab = Vocab.load_vocabulary( + filepath=args.src_vocab, + unk_token=args.unk_token, + bos_token=args.bos_token, + eos_token=args.eos_token, + pad_token=args.pad_token, + ) + elif args.trg_vocab: + src_vocab = Vocab.load_vocabulary( + filepath=args.trg_vocab, + unk_token=args.unk_token, + bos_token=args.bos_token, + eos_token=args.eos_token, + pad_token=args.pad_token, + ) + else: + assert args.train_pref, "--train_pref must be set if --src_vocab is not specified. " + src_vocab = _build_dictionary( + [_train_path(lang, args.train_pref) for lang in [args.src_lang, args.trg_lang]], args=args, src=True + ) + + trg_vocab = src_vocab + else: + if args.src_vocab: + src_vocab = Vocab.load_vocabulary( + filepath=args.src_vocab, + unk_token=args.unk_token, + bos_token=args.bos_token, + eos_token=args.eos_token, + pad_token=args.pad_token, + ) + else: + assert args.train_pref, "--train_pref must be set if --src_vocab is not specified" + src_vocab = _build_dictionary([_train_path(args.src_lang, args.train_pref)], args=args, src=True) + + if target: + if args.trg_vocab: + trg_vocab = Vocab.load_vocabulary( + filepath=args.trg_vocab, + unk_token=args.unk_token, + bos_token=args.bos_token, + eos_token=args.eos_token, + pad_token=args.pad_token, + ) + else: + assert args.train_pref, "--train_pref must be set if --trg_vocab is not specified" + trg_vocab = _build_dictionary([_train_path(args.trg_lang, args.train_pref)], args=args, trg=True) + else: + trg_vocab = None + + # save dictionaries + src_vocab.save_vocabulary(_dict_path(args.src_lang, args.dest_dir)) + if target and trg_vocab is not None: + trg_vocab.save_vocabulary(_dict_path(args.trg_lang, args.dest_dir)) + + if args.dict_only: + return + + _make_all(args.src_lang, src_vocab, args) + if target: + _make_all(args.trg_lang, trg_vocab, args) + + logger.info("Wrote preprocessed data to {}".format(args.dest_dir)) + + if args.align_file: + _align_files(args, src_vocab=src_vocab, trg_vocab=trg_vocab) + + +if __name__ == "__main__": + args = get_preprocessing_parser() + main(args) diff --git a/examples/machine_translation/requirements.txt b/examples/machine_translation/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..59ee8dc7f381d36665221bba3384844c881013e0 --- /dev/null +++ b/examples/machine_translation/requirements.txt @@ -0,0 +1,4 @@ +attrdict +easydict +pyyaml +subword_nmt diff --git a/examples/machine_translation/seq2seq/README.md b/examples/machine_translation/seq2seq/README.md new file mode 100644 index 0000000000000000000000000000000000000000..2f271dfb6b4f26aff0cb2eada4a29c818c00a7ce --- /dev/null +++ b/examples/machine_translation/seq2seq/README.md @@ -0,0 +1,104 @@ +# Machine Translation using Seq2Seq with Attention + +以下是本范例模型的简要目录结构及说明: + +``` +. +├── deploy # 预测部署目录 +│ └── python +│ └── infer.py # 用预测模型进行推理的程序 +├── README.md # 文档,本文件 +├── args.py # 训练、预测、导出模型以及模型参数配置程序 +├── data.py # 数据读入程序 +├── train.py # 训练主程序 +├── predict.py # 预测主程序 +├── export_model.py # 导出预测模型的程序 +└── seq2seq_attn.py # 带注意力机制的翻译模型程序 +``` + +## 简介 + +Sequence to Sequence (Seq2Seq),使用编码器-解码器(Encoder-Decoder)结构,用编码器将源序列编码成vector,再用解码器将该vector解码为目标序列。Seq2Seq 广泛应用于机器翻译,自动对话机器人,文档摘要自动生成,图片描述自动生成等任务中。 + +本目录包含Seq2Seq的一个经典样例:机器翻译,带Attention机制的翻译模型。Seq2Seq翻译模型,模拟了人类在进行翻译类任务时的行为:先解析源语言,理解其含义,再根据该含义来写出目标语言的语句。更多关于机器翻译的具体原理和数学表达式,我们推荐参考飞桨官网[机器翻译案例](https://www.paddlepaddle.org.cn/documentation/docs/zh/user_guides/nlp_case/machine_translation/README.cn.html)。 + +## 模型概览 + +本模型中,在编码器方面,我们采用了基于LSTM的多层的RNN encoder;在解码器方面,我们使用了带注意力(Attention)机制的RNN decoder,在预测时我们使用柱搜索(beam search)算法来生成翻译的目标语句。 + +## 数据介绍 + +本教程使用[IWSLT'15 English-Vietnamese data ](https://nlp.stanford.edu/projects/nmt/)数据集中的英语到越南语的数据作为训练语料,tst2012的数据作为开发集,tst2013的数据作为测试集。 + +### 数据获取 +如果用户在初始化数据集时没有提供路径,数据集会自动下载到`paddlenlp.utils.env.DATA_HOME`的`IWSLT15/`路径下,例如在linux系统下,默认存储路径是`~/.paddlenlp/datasets/IWSLT15`。 + +## 模型训练 + +执行以下命令即可训练带有注意力机制的Seq2Seq机器翻译模型: + +```sh +python train.py \ + --num_layers 2 \ + --hidden_size 512 \ + --batch_size 128 \ + --dropout 0.2 \ + --init_scale 0.1 \ + --max_grad_norm 5.0 \ + --device gpu \ + --model_path ./attention_models +``` + +各参数的具体说明请参阅 `args.py` 。训练程序会在每个epoch训练结束之后,save一次模型。 + +**NOTE:** 如需恢复模型训练,则`init_from_ckpt`只需指定到文件名即可,不需要添加文件尾缀。如`--init_from_ckpt=attention_models/5`即可,程序会自动加载模型参数`attention_models/5.pdparams`,也会自动加载优化器状态`attention_models/5.pdopt`。 + +## 模型预测 + +训练完成之后,可以使用保存的模型(由 `--init_from_ckpt` 指定)对测试集的数据集进行beam search解码。生成的翻译结果位于`--infer_output_file`指定的路径,预测命令如下: + +```sh +python predict.py \ + --num_layers 2 \ + --hidden_size 512 \ + --batch_size 128 \ + --dropout 0.2 \ + --init_scale 0.1 \ + --max_grad_norm 5.0 \ + --init_from_ckpt attention_models/9 \ + --infer_output_file infer_output.txt \ + --beam_size 10 \ + --device gpu +``` + +各参数的具体说明请参阅 `args.py` ,注意预测时所用模型超参数需和训练时一致。 + +## 预测效果评价 +取第10个epoch的结果,用取beam_size为10的beam search解码,`predict.py`脚本在生成翻译结果之后,会调用`paddlenlp.metrics.BLEU`计算翻译结果的BLEU指标,最终计算出的BLEU分数为0.24329954822714048 + +## 保存预测模型 +这里指定的参数`export_path` 表示导出预测模型文件的前缀。保存时会添加后缀(`pdiparams`,`pdiparams.info`,`pdmodel`)。 +```shell +python export_model.py \ + --num_layers 2 \ + --hidden_size 512 \ + --batch_size 128 \ + --dropout 0.2 \ + --init_scale 0.1 \ + --max_grad_norm 5.0 \ + --init_from_ckpt attention_models/9.pdparams \ + --beam_size 10 \ + --export_path ./infer_model/model +``` + +## 基于预测引擎推理 +然后按照如下的方式对IWSLT15数据集中的测试集(有标注的)进行预测(基于Paddle的[Python预测API](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/05_inference_deployment/inference/python_infer_cn.html)): + +```shell +cd deploy/python +python infer.py \ + --export_path ../../infer_model/model \ + --device gpu \ + --batch_size 128 \ + --infer_output_file infer_output.txt +``` diff --git a/examples/machine_translation/seq2seq/args.py b/examples/machine_translation/seq2seq/args.py new file mode 100644 index 0000000000000000000000000000000000000000..317917ab1189cb523acdef7a110752ee38db810d --- /dev/null +++ b/examples/machine_translation/seq2seq/args.py @@ -0,0 +1,61 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + + parser.add_argument("--learning_rate", type=float, default=0.001, help="learning rate for optimizer") + + parser.add_argument("--num_layers", type=int, default=1, help="layers number of encoder and decoder") + + parser.add_argument("--hidden_size", type=int, default=100, help="hidden size of encoder and decoder") + + parser.add_argument("--batch_size", type=int, help="batch size of each step") + + parser.add_argument("--max_epoch", type=int, default=12, help="max epoch for the training") + + parser.add_argument("--max_len", type=int, default=50, help="max length for source and target sentence") + + parser.add_argument("--dropout", type=float, default=0.2, help="drop probability") + + parser.add_argument("--init_scale", type=float, default=0.0, help="init scale for parameter") + + parser.add_argument("--max_grad_norm", type=float, default=5.0, help="max grad norm for global norm clip") + + parser.add_argument("--log_freq", type=int, default=100, help="The frequency to print training logs") + + parser.add_argument("--model_path", type=str, default="model", help="model path for model to save") + + parser.add_argument("--infer_output_file", type=str, default="infer_output", help="file name for inference output") + + parser.add_argument("--beam_size", type=int, default=10, help="file name for inference") + + parser.add_argument( + "--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference." + ) + + parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") + + parser.add_argument( + "--export_path", + type=str, + default=None, + help="The output file prefix used to save the exported inference model.", + ) + + args = parser.parse_args() + return args diff --git a/examples/machine_translation/seq2seq/data.py b/examples/machine_translation/seq2seq/data.py new file mode 100644 index 0000000000000000000000000000000000000000..3e4f44901a42e3f23c11fe2d1a7cc065736ea94f --- /dev/null +++ b/examples/machine_translation/seq2seq/data.py @@ -0,0 +1,113 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial + +import numpy as np +import paddle + +from paddlenlp.data import Pad, SamplerHelper, Vocab +from paddlenlp.datasets import load_dataset + + +def create_train_loader(args): + batch_size = args.batch_size + max_len = args.max_len + + train_ds, dev_ds = load_dataset("iwslt15", splits=("train", "dev")) + src_vocab = Vocab.load_vocabulary(**train_ds.vocab_info["en"]) + tgt_vocab = Vocab.load_vocabulary(**train_ds.vocab_info["vi"]) + bos_id = src_vocab[src_vocab.bos_token] + eos_id = src_vocab[src_vocab.eos_token] + pad_id = eos_id + + def convert_example(example): + source = example["en"].split()[:max_len] + target = example["vi"].split()[:max_len] + + source = src_vocab.to_indices(source) + target = tgt_vocab.to_indices(target) + + return source, target + + key = lambda x, data_source: len(data_source[x][0]) + + # Truncate and convert example to ids + train_ds = train_ds.map(convert_example, lazy=False) + dev_ds = dev_ds.map(convert_example, lazy=False) + + train_batch_sampler = ( + SamplerHelper(train_ds).shuffle().sort(key=key, buffer_size=batch_size * 20).batch(batch_size=batch_size) + ) + + dev_batch_sampler = SamplerHelper(dev_ds).sort(key=key, buffer_size=batch_size * 20).batch(batch_size=batch_size) + + train_loader = paddle.io.DataLoader( + train_ds, + batch_sampler=train_batch_sampler, + collate_fn=partial(prepare_train_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id), + ) + + dev_loader = paddle.io.DataLoader( + dev_ds, + batch_sampler=dev_batch_sampler, + collate_fn=partial(prepare_train_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id), + ) + + return train_loader, dev_loader, len(src_vocab), len(tgt_vocab), pad_id + + +def create_infer_loader(args): + batch_size = args.batch_size + test_ds = load_dataset("iwslt15", splits="test") + src_vocab = Vocab.load_vocabulary(**test_ds.vocab_info["en"]) + tgt_vocab = Vocab.load_vocabulary(**test_ds.vocab_info["vi"]) + bos_id = src_vocab[src_vocab.bos_token] + eos_id = src_vocab[src_vocab.eos_token] + pad_id = eos_id + + def convert_example(example): + source = example["en"].split() + target = example["vi"].split() + + source = src_vocab.to_indices(source) + target = tgt_vocab.to_indices(target) + + return source, target + + test_ds.map(convert_example) + test_batch_sampler = SamplerHelper(test_ds).batch(batch_size=batch_size) + + test_loader = paddle.io.DataLoader( + test_ds, + batch_sampler=test_batch_sampler, + collate_fn=partial(prepare_infer_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id), + ) + return test_loader, len(src_vocab), len(tgt_vocab), bos_id, eos_id + + +def prepare_infer_input(insts, bos_id, eos_id, pad_id): + insts = [([bos_id] + inst[0] + [eos_id], [bos_id] + inst[1] + [eos_id]) for inst in insts] + src, src_length = Pad(pad_val=pad_id, ret_length=True)([inst[0] for inst in insts]) + return src, src_length + + +def prepare_train_input(insts, bos_id, eos_id, pad_id): + # Add eos token id and bos token id. + insts = [([bos_id] + inst[0] + [eos_id], [bos_id] + inst[1] + [eos_id]) for inst in insts] + # Pad sequence using eos id. + src, src_length = Pad(pad_val=pad_id, ret_length=True)([inst[0] for inst in insts]) + tgt, tgt_length = Pad(pad_val=pad_id, ret_length=True, dtype="int64")([inst[1] for inst in insts]) + tgt_mask = (tgt[:, :-1] != pad_id).astype("float32") + return src, src_length, tgt[:, :-1], tgt[:, 1:, np.newaxis], tgt_mask diff --git a/examples/machine_translation/seq2seq/deploy/python/infer.py b/examples/machine_translation/seq2seq/deploy/python/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..de6f80e0778590208c81142faa663d115c85cde4 --- /dev/null +++ b/examples/machine_translation/seq2seq/deploy/python/infer.py @@ -0,0 +1,99 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import io +import sys + +sys.path.append("../../") + +import numpy as np # noqa: E402 +import paddle # noqa: E402 +from args import parse_args # noqa: E402 +from data import create_infer_loader # noqa: E402 +from predict import post_process_seq # noqa: E402 + +from paddlenlp.data import Vocab # noqa: E402 +from paddlenlp.datasets import load_dataset # noqa: E402 +from paddlenlp.metrics import BLEU # noqa: E402 + + +class Predictor(object): + def __init__(self, predictor, input_handles, output_handles): + self.predictor = predictor + self.input_handles = input_handles + self.output_handles = output_handles + + @classmethod + def create_predictor(cls, args): + config = paddle.inference.Config(args.export_path + ".pdmodel", args.export_path + ".pdiparams") + if args.device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + elif args.device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + elif args.device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + config.switch_use_feed_fetch_ops(False) + predictor = paddle.inference.create_predictor(config) + input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()] + output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()] + return cls(predictor, input_handles, output_handles) + + def predict_batch(self, data): + for input_field, input_handle in zip(data, self.input_handles): + input_handle.copy_from_cpu(input_field.numpy() if isinstance(input_field, paddle.Tensor) else input_field) + self.predictor.run() + output = [output_handle.copy_to_cpu() for output_handle in self.output_handles] + return output + + def predict(self, dataloader, infer_output_file, trg_idx2word, bos_id, eos_id): + cand_list = [] + with io.open(infer_output_file, "w", encoding="utf-8") as f: + for data in dataloader(): + finished_seq = self.predict_batch(data)[0] + finished_seq = finished_seq[:, :, np.newaxis] if len(finished_seq.shape) == 2 else finished_seq + finished_seq = np.transpose(finished_seq, [0, 2, 1]) + for ins in finished_seq: + for beam_idx, beam in enumerate(ins): + id_list = post_process_seq(beam, bos_id, eos_id) + word_list = [trg_idx2word[id] for id in id_list] + sequence = " ".join(word_list) + "\n" + f.write(sequence) + cand_list.append(word_list) + break + + test_ds = load_dataset("iwslt15", splits="test") + bleu = BLEU() + for i, data in enumerate(test_ds): + ref = data["vi"].split() + bleu.add_inst(cand_list[i], [ref]) + print("BLEU score is %s." % bleu.score()) + + +def main(): + args = parse_args() + + predictor = Predictor.create_predictor(args) + test_loader, src_vocab_size, tgt_vocab_size, bos_id, eos_id = create_infer_loader(args) + tgt_vocab = Vocab.load_vocabulary(**test_loader.dataset.vocab_info["vi"]) + trg_idx2word = tgt_vocab.idx_to_token + + predictor.predict(test_loader, args.infer_output_file, trg_idx2word, bos_id, eos_id) + + +if __name__ == "__main__": + main() diff --git a/examples/machine_translation/seq2seq/export_model.py b/examples/machine_translation/seq2seq/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..79b05c1dccb5a54a13437b66cffa4c63c1aa8a99 --- /dev/null +++ b/examples/machine_translation/seq2seq/export_model.py @@ -0,0 +1,57 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +from args import parse_args +from data import create_infer_loader +from seq2seq_attn import Seq2SeqAttnInferModel + + +def main(): + args = parse_args() + _, src_vocab_size, tgt_vocab_size, bos_id, eos_id = create_infer_loader(args) + + # Build model and load trained parameters + model = Seq2SeqAttnInferModel( + src_vocab_size, + tgt_vocab_size, + args.hidden_size, + args.hidden_size, + args.num_layers, + args.dropout, + bos_id=bos_id, + eos_id=eos_id, + beam_size=args.beam_size, + max_out_len=256, + ) + + # Load the trained model + model.set_state_dict(paddle.load(args.init_from_ckpt)) + + # Wwitch to eval model + model.eval() + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # src + paddle.static.InputSpec(shape=[None], dtype="int64"), # src length + ], + ) + # Save converted static graph model + paddle.jit.save(model, args.export_path) + + +if __name__ == "__main__": + main() diff --git a/examples/machine_translation/seq2seq/predict.py b/examples/machine_translation/seq2seq/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..0da32d69d057141597a8d183331b3d8b34e1d1bd --- /dev/null +++ b/examples/machine_translation/seq2seq/predict.py @@ -0,0 +1,92 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import io + +import numpy as np +import paddle +from args import parse_args +from data import create_infer_loader +from seq2seq_attn import Seq2SeqAttnInferModel + +from paddlenlp.data import Vocab +from paddlenlp.metrics import BLEU + + +def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): + """ + Post-process the decoded sequence. + """ + eos_pos = len(seq) - 1 + for i, idx in enumerate(seq): + if idx == eos_idx: + eos_pos = i + break + seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] + return seq + + +def do_predict(args): + paddle.set_device(args.device) + + test_loader, src_vocab_size, tgt_vocab_size, bos_id, eos_id = create_infer_loader(args) + tgt_vocab = Vocab.load_vocabulary(**test_loader.dataset.vocab_info["vi"]) + + model = paddle.Model( + Seq2SeqAttnInferModel( + src_vocab_size, + tgt_vocab_size, + args.hidden_size, + args.hidden_size, + args.num_layers, + args.dropout, + bos_id=bos_id, + eos_id=eos_id, + beam_size=args.beam_size, + max_out_len=256, + ) + ) + + model.prepare() + + # Load the trained model + assert args.init_from_ckpt, "Please set reload_model to load the infer model." + model.load(args.init_from_ckpt) + + cand_list = [] + with io.open(args.infer_output_file, "w", encoding="utf-8") as f: + for data in test_loader(): + with paddle.no_grad(): + finished_seq = model.predict_batch(inputs=data)[0] + finished_seq = finished_seq[:, :, np.newaxis] if len(finished_seq.shape) == 2 else finished_seq + finished_seq = np.transpose(finished_seq, [0, 2, 1]) + for ins in finished_seq: + for beam_idx, beam in enumerate(ins): + id_list = post_process_seq(beam, bos_id, eos_id) + word_list = [tgt_vocab.to_tokens(id) for id in id_list] + sequence = " ".join(word_list) + "\n" + f.write(sequence) + cand_list.append(word_list) + break + + bleu = BLEU() + for i, data in enumerate(test_loader.dataset.data): + ref = data["vi"].split() + bleu.add_inst(cand_list[i], [ref]) + print("BLEU score is %s." % bleu.score()) + + +if __name__ == "__main__": + args = parse_args() + do_predict(args) diff --git a/examples/machine_translation/seq2seq/seq2seq_attn.py b/examples/machine_translation/seq2seq/seq2seq_attn.py new file mode 100644 index 0000000000000000000000000000000000000000..5bbcf62b77c5e899d9baec8187c139749c14bac6 --- /dev/null +++ b/examples/machine_translation/seq2seq/seq2seq_attn.py @@ -0,0 +1,254 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +import paddle.nn.initializer as I + + +class CrossEntropyCriterion(nn.Layer): + def __init__(self): + super(CrossEntropyCriterion, self).__init__() + + def forward(self, predict, label, trg_mask): + cost = F.cross_entropy(input=predict, label=label, soft_label=False, reduction="none") + cost = paddle.squeeze(cost, axis=[2]) + masked_cost = cost * trg_mask + batch_mean_cost = paddle.mean(masked_cost, axis=[0]) + seq_cost = paddle.sum(batch_mean_cost) + + return seq_cost + + +class Seq2SeqEncoder(nn.Layer): + def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, dropout_prob=0.0, init_scale=0.1): + super(Seq2SeqEncoder, self).__init__() + self.embedder = nn.Embedding( + vocab_size, + embed_dim, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + ) + + self.lstm = nn.LSTM( + input_size=embed_dim, + hidden_size=hidden_size, + num_layers=num_layers, + direction="forward", + dropout=dropout_prob if num_layers > 1 else 0.0, + ) + + def forward(self, sequence, sequence_length): + inputs = self.embedder(sequence) + encoder_output, encoder_state = self.lstm(inputs, sequence_length=sequence_length) + + return encoder_output, encoder_state + + +class AttentionLayer(nn.Layer): + def __init__(self, hidden_size, bias=False, init_scale=0.1): + super(AttentionLayer, self).__init__() + self.input_proj = nn.Linear( + hidden_size, + hidden_size, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + bias_attr=bias, + ) + self.output_proj = nn.Linear( + hidden_size + hidden_size, + hidden_size, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + bias_attr=bias, + ) + + def forward(self, hidden, encoder_output, encoder_padding_mask): + encoder_output = self.input_proj(encoder_output) + attn_scores = paddle.matmul(paddle.unsqueeze(hidden, [1]), encoder_output, transpose_y=True) + + if encoder_padding_mask is not None: + attn_scores = paddle.add(attn_scores, encoder_padding_mask) + + attn_scores = F.softmax(attn_scores) + attn_out = paddle.squeeze(paddle.matmul(attn_scores, encoder_output), [1]) + attn_out = paddle.concat([attn_out, hidden], 1) + attn_out = self.output_proj(attn_out) + return attn_out + + +class Seq2SeqDecoderCell(nn.RNNCellBase): + def __init__(self, num_layers, input_size, hidden_size, dropout_prob=0.0): + super(Seq2SeqDecoderCell, self).__init__() + if dropout_prob > 0.0: + self.dropout = nn.Dropout(dropout_prob) + else: + self.dropout = None + + self.lstm_cells = nn.LayerList( + [ + nn.LSTMCell(input_size=input_size + hidden_size if i == 0 else hidden_size, hidden_size=hidden_size) + for i in range(num_layers) + ] + ) + + self.attention_layer = AttentionLayer(hidden_size) + + def forward(self, step_input, states, encoder_output, encoder_padding_mask=None): + lstm_states, input_feed = states + new_lstm_states = [] + step_input = paddle.concat([step_input, input_feed], 1) + for i, lstm_cell in enumerate(self.lstm_cells): + out, new_lstm_state = lstm_cell(step_input, lstm_states[i]) + if self.dropout: + step_input = self.dropout(out) + else: + step_input = out + + new_lstm_states.append(new_lstm_state) + out = self.attention_layer(step_input, encoder_output, encoder_padding_mask) + return out, [new_lstm_states, out] + + +class Seq2SeqDecoder(nn.Layer): + def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, dropout_prob=0.0, init_scale=0.1): + super(Seq2SeqDecoder, self).__init__() + self.embedder = nn.Embedding( + vocab_size, + embed_dim, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + ) + self.lstm_attention = nn.RNN( + Seq2SeqDecoderCell(num_layers, embed_dim, hidden_size, dropout_prob), is_reverse=False, time_major=False + ) + self.output_layer = nn.Linear( + hidden_size, + vocab_size, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + bias_attr=False, + ) + + def forward(self, trg, decoder_initial_states, encoder_output, encoder_padding_mask): + inputs = self.embedder(trg) + + decoder_output, _ = self.lstm_attention( + inputs, + initial_states=decoder_initial_states, + encoder_output=encoder_output, + encoder_padding_mask=encoder_padding_mask, + ) + predict = self.output_layer(decoder_output) + + return predict + + +class Seq2SeqAttnModel(nn.Layer): + def __init__( + self, + src_vocab_size, + trg_vocab_size, + embed_dim, + hidden_size, + num_layers, + dropout_prob=0.0, + eos_id=1, + init_scale=0.1, + ): + super(Seq2SeqAttnModel, self).__init__() + self.hidden_size = hidden_size + self.eos_id = eos_id + self.num_layers = num_layers + self.INF = 1e9 + self.encoder = Seq2SeqEncoder(src_vocab_size, embed_dim, hidden_size, num_layers, dropout_prob, init_scale) + self.decoder = Seq2SeqDecoder(trg_vocab_size, embed_dim, hidden_size, num_layers, dropout_prob, init_scale) + + def forward(self, src, src_length, trg): + encoder_output, encoder_final_state = self.encoder(src, src_length) + + # Transfer shape of encoder_final_states to [num_layers, 2, batch_size, hidden_size] + encoder_final_states = [(encoder_final_state[0][i], encoder_final_state[1][i]) for i in range(self.num_layers)] + # Construct decoder initial states: use input_feed and the shape is + # [[h,c] * num_layers, input_feed], consistent with Seq2SeqDecoderCell.states + decoder_initial_states = [ + encoder_final_states, + self.decoder.lstm_attention.cell.get_initial_states(batch_ref=encoder_output, shape=[self.hidden_size]), + ] + # Build attention mask to avoid paying attention on padddings + src_mask = (src != self.eos_id).astype(paddle.get_default_dtype()) + encoder_padding_mask = (src_mask - 1.0) * self.INF + encoder_padding_mask = paddle.unsqueeze(encoder_padding_mask, [1]) + + predict = self.decoder(trg, decoder_initial_states, encoder_output, encoder_padding_mask) + return predict + + +class Seq2SeqAttnInferModel(Seq2SeqAttnModel): + def __init__( + self, + src_vocab_size, + trg_vocab_size, + embed_dim, + hidden_size, + num_layers, + dropout_prob=0.0, + bos_id=0, + eos_id=1, + beam_size=4, + max_out_len=256, + ): + args = dict(locals()) + args.pop("self") + args.pop("__class__", None) + self.bos_id = args.pop("bos_id") + self.beam_size = args.pop("beam_size") + self.max_out_len = args.pop("max_out_len") + self.num_layers = num_layers + super(Seq2SeqAttnInferModel, self).__init__(**args) + # Dynamic decoder for inference + self.beam_search_decoder = nn.BeamSearchDecoder( + self.decoder.lstm_attention.cell, + start_token=bos_id, + end_token=eos_id, + beam_size=beam_size, + embedding_fn=self.decoder.embedder, + output_fn=self.decoder.output_layer, + ) + + def forward(self, src, src_length): + encoder_output, encoder_final_state = self.encoder(src, src_length) + + encoder_final_state = [(encoder_final_state[0][i], encoder_final_state[1][i]) for i in range(self.num_layers)] + + # Initial decoder initial states + decoder_initial_states = [ + encoder_final_state, + self.decoder.lstm_attention.cell.get_initial_states(batch_ref=encoder_output, shape=[self.hidden_size]), + ] + # Build attention mask to avoid paying attention on paddings + src_mask = (src != self.eos_id).astype(paddle.get_default_dtype()) + + encoder_padding_mask = (src_mask - 1.0) * self.INF + encoder_padding_mask = paddle.unsqueeze(encoder_padding_mask, [1]) + + # Tile the batch dimension with beam_size + encoder_output = nn.BeamSearchDecoder.tile_beam_merge_with_batch(encoder_output, self.beam_size) + encoder_padding_mask = nn.BeamSearchDecoder.tile_beam_merge_with_batch(encoder_padding_mask, self.beam_size) + + # Dynamic decoding with beam search + seq_output, _ = nn.dynamic_decode( + decoder=self.beam_search_decoder, + inits=decoder_initial_states, + max_step_num=self.max_out_len, + encoder_output=encoder_output, + encoder_padding_mask=encoder_padding_mask, + ) + return seq_output diff --git a/examples/machine_translation/seq2seq/train.py b/examples/machine_translation/seq2seq/train.py new file mode 100644 index 0000000000000000000000000000000000000000..fec0708040f5e3f28f3cbc713ed225eb0b86751b --- /dev/null +++ b/examples/machine_translation/seq2seq/train.py @@ -0,0 +1,64 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +from args import parse_args +from data import create_train_loader +from seq2seq_attn import CrossEntropyCriterion, Seq2SeqAttnModel + +from paddlenlp.metrics import Perplexity + + +def do_train(args): + paddle.set_device(args.device) + + # Define dataloader + train_loader, eval_loader, src_vocab_size, tgt_vocab_size, eos_id = create_train_loader(args) + + model = paddle.Model( + Seq2SeqAttnModel( + src_vocab_size, tgt_vocab_size, args.hidden_size, args.hidden_size, args.num_layers, args.dropout, eos_id + ) + ) + + grad_clip = nn.ClipGradByGlobalNorm(args.max_grad_norm) + optimizer = paddle.optimizer.Adam( + learning_rate=args.learning_rate, parameters=model.parameters(), grad_clip=grad_clip + ) + + ppl_metric = Perplexity() + model.prepare(optimizer, CrossEntropyCriterion(), ppl_metric) + + print(args) + if args.init_from_ckpt: + model.load(args.init_from_ckpt) + print("Loaded checkpoint from %s" % args.init_from_ckpt) + + benchmark_logger = paddle.callbacks.ProgBarLogger(log_freq=args.log_freq, verbose=3) + + model.fit( + train_data=train_loader, + eval_data=eval_loader, + epochs=args.max_epoch, + eval_freq=1, + save_freq=1, + save_dir=args.model_path, + callbacks=[benchmark_logger], + ) + + +if __name__ == "__main__": + args = parse_args() + do_train(args) diff --git a/examples/machine_translation/transformer/README.md b/examples/machine_translation/transformer/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e653a539a99d8047b3d4c643537c99001614a1ca --- /dev/null +++ b/examples/machine_translation/transformer/README.md @@ -0,0 +1,445 @@ +# Machine Translation using Transformer + +机器翻译(Machine Translation)是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程,输入为源语言句子,输出为相应的目标语言的句子。 + +本项目是机器翻译领域主流模型 Transformer 的 PaddlePaddle 实现,包含模型训练,预测以及使用自定义数据等内容。用户可以基于发布的内容搭建自己的翻译模型。 + +## 模型介绍 +Transformer 是论文 [Attention Is All You Need](https://arxiv.org/abs/1706.03762) 中提出的用以完成机器翻译(Machine Translation)等序列到序列(Seq2Seq)学习任务的一种全新网络结构,其完全使用注意力(Attention)机制来实现序列到序列的建模[1]。 + +<p align="center"> +<img src="images/transformer_network.png" height=400 hspace='10'/> <br /> +图 1. Transformer 网络结构图 +</p> + +相较于此前 Seq2Seq 模型中广泛使用的循环神经网络(Recurrent Neural Network, RNN),使用Self Attention进行输入序列到输出序列的变换主要具有以下优势: + +- 计算复杂度小 + - 特征维度为 d 、长度为 n 的序列,在 RNN 中计算复杂度为 `O(n * d * d)` (n 个时间步,每个时间步计算 d 维的矩阵向量乘法),在 Self-Attention 中计算复杂度为 `O(n * n * d)` (n 个时间步两两计算 d 维的向量点积或其他相关度函数),n 通常要小于 d 。 +- 计算并行度高 + - RNN 中当前时间步的计算要依赖前一个时间步的计算结果;Self-Attention 中各时间步的计算只依赖输入不依赖之前时间步输出,各时间步可以完全并行。 +- 容易学习长程依赖(long-range dependencies) + - RNN 中相距为 n 的两个位置间的关联需要 n 步才能建立;Self-Attention 中任何两个位置都直接相连;路径越短信号传播越容易。 + +Transformer 中引入使用的基于 Self-Attention 的序列建模模块结构,已被广泛应用在 Bert [2]等语义表示模型中,取得了显著效果。 + +### 模型特点 + +Transformer 中的 Encoder 由若干相同的 layer 堆叠组成,每个 layer 主要由多头注意力(Multi-Head Attention)和全连接的前馈(Feed-Forward)网络这两个 sub-layer 构成。 +- Multi-Head Attention 在这里用于实现 Self-Attention,相比于简单的 Attention 机制,其将输入进行多路线性变换后分别计算 Attention 的结果,并将所有结果拼接后再次进行线性变换作为输出。参见图2,其中 Attention 使用的是点积(Dot-Product),并在点积后进行了 scale 的处理以避免因点积结果过大进入 softmax 的饱和区域。 +- Feed-Forward 网络会对序列中的每个位置进行相同的计算(Position-wise),其采用的是两次线性变换中间加以 ReLU 激活的结构。 + +此外,每个 sub-layer 后还施以 Residual Connection [3] 和 Layer Normalization [4] 来促进梯度传播和模型收敛。 + +<p align="center"> +<img src="images/multi_head_attention.png" height=300 hspace='10'/> <br /> +图 2. Multi-Head Attention +</p> + +Decoder 具有和 Encoder 类似的结构,只是相比于组成 Encoder 的 layer ,在组成 Decoder 的 layer 中还多了一个 Multi-Head Attention 的 sub-layer 来实现对 Encoder 输出的 Attention,这个 Encoder-Decoder Attention 在其他 Seq2Seq 模型中也是存在的。 + +## 数据准备 + +本示例可以使用 PaddleNLP 内置的处理好的 WMT14 EN-DE 翻译的数据进行训练、预测,也可以使用自定义数据集。数据准备部分可以参考前页文档 [使用自定义翻译数据集](../README.md)。 + +## 动态图 + +### 使用内置数据集进行训练 + +以下文档,介绍了使用 PaddleNLP 内置的处理好的 WMT14 EN-DE 翻译数据集的训练方式。 + +#### 单机单卡 + +以提供的英德翻译数据为例,可以执行以下命令进行模型训练: + +``` sh +# Setting visible devices for training +export CUDA_VISIBLE_DEVICES=0 +python train.py --config ./configs/transformer.base.yaml +``` + +可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中设置相应的参数。如果执行不提供 `--config` 选项,程序将默认使用 big model 的配置。 + +如果是在单卡下进行训练,可能需要适当调整下参数,比如考虑增大 `warmup_steps` 参数为 `16000`,相关的设置可以参考 `configs/transformer.big.yaml` 或是 `configs/transformer.base.yaml` 配置文件中各个选项。 + +#### 单机多卡 + +同样,可以执行如下命令实现八卡训练: + +``` sh +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" train.py --config ./configs/transformer.base.yaml +``` + +与上面的情况相似,可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中设置相应的参数。如果执行不提供 `--config` 选项,程序将默认使用 big model 的配置。 + +### 使用自定义数据集进行训练 + +自定义数据集与内置数据集训练的方式基本上是一致的,不过需要额外提供数据文件的路径。可以参照以下文档。 + +#### 单机单卡 + +本示例这里略去自定义数据下载、处理的步骤,如果需要,可以参考前页文档 [使用自定义翻译数据集](../README.md)。 + +本示例以处理好的 WMT14 数据为例。 + +``` bash +DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/ + +python train.py \ + --config configs/transformer.base.yaml \ + --train_file ${DATA_DEST_DIR}/train.de-en.en ${DATA_DEST_DIR}/train.de-en.de \ + --dev_file ${DATA_DEST_DIR}/dev.de-en.en ${DATA_DEST_DIR}/dev.de-en.de \ + --src_vocab ${DATA_DEST_DIR}/dict.en.txt \ + --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \ + --bos_token "<s>" \ + --eos_token "</s>" \ + --unk_token "<unk>" \ + --pad_token "<s>" +``` + +`train.py` 脚本中,各个参数的含义如下: + +* `--config`: 指明所使用的 Transformer 的 config 文件,包括模型超参、训练超参等,默认是 `transformer.big.yaml`。即,默认训练 Transformer Big 模型。 +* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名,会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`,`--dev_file` 和 `--test_file` 的优先级。 + * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。 + * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。 +* `--train_file`: 指明训练所需要的 `train` 训练集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为,一组平行语料的源语言和目标语言,依次两个文件的路径和名称,`--train_file ${SOURCE_LANG_FILE} ${TARGET_LANG_FILE}`。比如,`--train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en`。 +* `--dev_file`: 指明训练所需要的 `dev` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为,一组平行语料的源语言和目标语言,依次两个文件的路径和名称,`--dev_file ${SOURCE_LANG_FILE} ${TARGET_LANG_FILE}`。比如,`--dev_file ${DATA_DEST_DIR}/dev.de-en.de ${DATA_DEST_DIR}/dev.de-en.en`。 +* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。 +* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同,若相同,则视为源语言和目标语言共用同一个词表。 +* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同,若相同,则视为源语言和目标语言共用同一个词表。 +* `--unk_token`: 若提供了自定义的词表,则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如,`--unk_token "<unk>"`。默认为 `<unk>`,与数据预处理脚本设定默认值相同。 +* `--bos_token`: 若提供了自定义的词表,则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如,`--bos_token "<s>"`。默认为 `<s>`,与数据预处理脚本设定默认值相同。 +* `--eos_token`: 若提供了自定义的词表,则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如,`--eos_token "</s>"`。默认为 `</s>`,与数据预处理脚本设定默认值相同。 +* `--pad_token`: 若提供了自定义的词表,原则上,需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如,`--pad_token "<pad>"`。默认为 None,若使用 None,则使用 `--bos_token` 作为 `pad_token` 使用。 +* `--batch_size`: 指明训练时,一个 batch 里面,最多的 token 的数目。默认为 config 中设置的 4096。 +* `--max_iter`: 指明训练时,需要训练的最大的 step 的数目,默认为 None。表示使用 config 中指定的 `epoch: 30` 来作为最大的迭代的 epoch 的数量,而不是 step。 +* `--use_amp`: 是否使用混合精度训练。设置的类型是一个 `str`,可以是 `['true', 'false', 'True', 'False']` 中任意一个。默认不使用混合精度训练。 +* `--amp_level`: 若使用混合精度,则指明混合精度的级别。可以是 `['O1', 'O2']` 中任意一个。默认是 `O1`。 + +#### 单机多卡 + +单机多卡的执行方式与单机打卡差别不大,需要额外加上单机多卡的启动命令,如下所示: + +``` bash +DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/ + +python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" train.py \ + --config configs/transformer.base.yaml \ + --train_file ${DATA_DEST_DIR}/train.de-en.en ${DATA_DEST_DIR}/train.de-en.de \ + --dev_file ${DATA_DEST_DIR}/dev.de-en.en ${DATA_DEST_DIR}/dev.de-en.de \ + --src_vocab ${DATA_DEST_DIR}/dict.en.txt \ + --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \ + --bos_token "<s>" \ + --eos_token "</s>" \ + --unk_token "<unk>" +``` + +其余启动参数与单机单卡相同,这里不再累述。 + +### 模型推断 + +#### 使用内置数据集进行预测 + +如果是基于内置的数据集训练得到的英德翻译的模型,模型训练完成后可以执行以下命令对指定文件中的文本进行翻译: + +``` sh +# setting visible devices for prediction +export CUDA_VISIBLE_DEVICES=0 +python predict.py --config ./configs/transformer.base.yaml +``` + +翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录,更多参数的使用可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项,程序将默认使用 big model 的配置。 + +需要注意的是,目前预测仅实现了单卡的预测,原因在于,翻译后面需要的模型评估依赖于预测结果写入文件顺序,多卡情况下,目前暂未支持将结果按照指定顺序写入文件。 + +另外 `predict.py` 中使用的 `TransformerGenerator` 接口对于GPU预测将在适配的条件下自动切换到 `FastGeneration` 预测加速版本(期间会进行jit编译), `FastGeneration` 的更多内容可以参考 `fast_transformer/README.md`。 + +#### 基于自定义数据集进行预测 + +本示例同样支持自定义数据集进行预测。可以参照以下文档。 + +``` bash +DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/ + +python predict.py \ + --config configs/transformer.base.yaml \ + --test_file ${DATA_DEST_DIR}/test.de-en.en \ + --src_vocab ${DATA_DEST_DIR}/dict.en.txt \ + --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \ + --bos_token "<s>" \ + --eos_token "</s>" \ + --unk_token "<unk>" +``` + +以下是各个参数的含义: + +* `--config`: 指明所使用的 Transformer 的 config 文件,包括模型超参、训练超参等,默认是 `transformer.big.yaml`。即,默认训练 Transformer Big 模型。 +* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名,会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`,`--dev_file` 和 `--test_file` 的优先级。 + * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。 + * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。 +* `--test_file`: 指明训练所需要的 `test` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为,传入源语言的文件。比如,`--test_file ${DATA_DEST_DIR}/test.de-en.de`。 +* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。 +* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同,若相同,则视为源语言和目标语言共用同一个词表。 +* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同,若相同,则视为源语言和目标语言共用同一个词表。 +* `--unk_token`: 若提供了自定义的词表,则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如,`--unk_token "<unk>"`。默认为 `<unk>`,与数据预处理脚本设定默认值相同。 +* `--bos_token`: 若提供了自定义的词表,则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如,`--bos_token "<s>"`。默认为 `<s>`,与数据预处理脚本设定默认值相同。 +* `--eos_token`: 若提供了自定义的词表,则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如,`--eos_token "</s>"`。默认为 `</s>`,与数据预处理脚本设定默认值相同。 +* `--pad_token`: 若提供了自定义的词表,原则上,需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如,`--pad_token "<pad>"`。默认为 None,若使用 None,则使用 `--bos_token` 作为 `pad_token` 使用。 +* `--without_ft`: 本示例在预测时,支持了 GPU 的翻译预测的加速,如果不使用加速特性,可以设置 `--without_ft` 即会执行普通的 PaddlePaddle 动态图预测。 + +翻译结果会输出到 config 文件中 `output_file` 条目指定的文件中。执行预测时需要设置 `init_from_params` 来给出模型所在目录,更多参数的使用可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。 + +#### 导出静态图预测模型与预测引擎预测 + +Transformer 同时提供了将训练的动态图的 checkpoint 转成静态图模型功能,并提供了对应的使用预测引擎进行预测推理的方法。具体的使用方式如下: + +首先是进行动转静,使用 `export_model.py` 脚本完成将动态图的 checkpoint 转成静态图的模型,并保存成 inference 的模型。 + +``` sh +python export_model.py --config ./configs/transformer.base.yaml +``` + +模型默认保存在 `infer_model/` 路径下面。可以在 `configs/` 路径下的配置文件中更改 `inference_model_dir` 配置,从而保存至自定义的路径。 + +同样,因为模型导出会用到模型的词表等信息,所以如果是**自定义数据集**,仍需要传入所使用的词表。 + +``` bash +DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/ + +python export_model.py \ + --config ./configs/transformer.base.yaml \ + --src_vocab ${DATA_DEST_DIR}/dict.en.txt \ + --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \ + --bos_token "<s>" \ + --eos_token "</s>" +``` + +其中: + +* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。 +* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同,若相同,则视为源语言和目标语言共用同一个词表。 +* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同,若相同,则视为源语言和目标语言共用同一个词表。 +* `--bos_token`: 若提供了自定义的词表,则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如,`--bos_token "<s>"`。默认为 `<s>`,与数据预处理脚本设定默认值相同。 +* `--eos_token`: 若提供了自定义的词表,则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如,`--eos_token "</s>"`。默认为 `</s>`,与数据预处理脚本设定默认值相同。 +* `--pad_token`: 若提供了自定义的词表,原则上,需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如,`--pad_token "<pad>"`。默认为 None,若使用 None,则使用 `--bos_token` 作为 `pad_token` 使用。 + +#### 使用 Paddle Inference API 进行推理 + +准备好以上模型之后,可以使用预测引擎 Paddle Inference API 进行推理。 + +如果使用 Paddle Inference Python API,可以参考[使用 Paddle Inference Python API 推理](./deploy/python/README.md)。 + +如果使用 Paddle Inference C++ API,可以参考[使用 Paddle Inference C++ API 推理](./deploy/cpp/README.md)。 + +#### 使用 Paddle Serving 进行推理 + +除了使用 Paddle Inference API 进行本地推理外,还可以使用 Paddle Serving 实现在服务器上部署推理模型,客户端发送数据进行推理。可以参考[使用 Paddle Serving 推理](./deploy/serving/README.md)。 + +## 静态图 + +在静态图中,本示例仍然可以选择内置数据集进行训练或是使用自定义数据集进行训练。 + +### 使用内置数据集进行训练 + +#### 单机单卡 + +如果是需要单机单卡训练,则使用下面的命令进行训练: +``` shell +cd static/ +export CUDA_VISIBLE_DEVICES=0 +python train.py --config ../configs/transformer.base.yaml +``` + +建议可以在单卡执行的时候,尝试增大 `warmup_steps`。可以修改 `configs/transformer.big.yaml` 或是 `configs/transformer.base.yaml` 中对应参数。 + +#### 单机多卡 + +如果是需要单机多卡训练,则使用下面的命令进行训练: + +##### PE 的方式启动单机多卡: +``` shell +cd static/ +export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 +python train.py --config ../configs/transformer.base.yaml +``` + +##### fleet 的方式启动单机多卡: +``` shell +cd static/ +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" train.py --config ../configs/transformer.base.yaml --distributed +``` + +需要注意的是,使用 fleet 的方式启动单机多卡务必设置 `--distributed`。 + +### 使用自定义数据集进行训练 + +静态图和动态图在训练脚本启动上差别不大,仍然需要指明对应的文件的位置。可以参照以下文档。 + +#### 单机单卡 + +本示例这里略去自定义数据下载、处理的步骤,如果需要,可以参考前页文档 [使用自定义翻译数据集](../README.md)。 + +本示例以处理好的 WMT14 数据为例。 + +``` bash +cd static/ +export CUDA_VISIBLE_DEVICES=0 + +DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/ + +python train.py \ + --config configs/transformer.base.yaml \ + --train_file ${DATA_DEST_DIR}/train.de-en.en ${DATA_DEST_DIR}/train.de-en.de \ + --src_vocab ${DATA_DEST_DIR}/dict.en.txt \ + --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \ + --bos_token "<s>" \ + --eos_token "</s>" \ + --unk_token "<unk>" +``` + +`train.py` 脚本中,各个参数的含义如下: + +* `--config`: 指明所使用的 Transformer 的 config 文件,包括模型超参、训练超参等,默认是 `transformer.big.yaml`。即,默认训练 Transformer Big 模型。 +* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名,会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`,`--dev_file` 和 `--test_file` 的优先级。 + * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。 + * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。 +* `--train_file`: 指明训练所需要的 `train` 训练集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为,一组平行语料的源语言和目标语言,依次两个文件的路径和名称,`--train_file ${SOURCE_LANG_FILE} ${TARGET_LANG_FILE}`。比如,`--train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en`。 +* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。 +* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同,若相同,则视为源语言和目标语言共用同一个词表。 +* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同,若相同,则视为源语言和目标语言共用同一个词表。 +* `--unk_token`: 若提供了自定义的词表,则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如,`--unk_token "<unk>"`。默认为 `<unk>`,与数据预处理脚本设定默认值相同。 +* `--bos_token`: 若提供了自定义的词表,则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如,`--bos_token "<s>"`。默认为 `<s>`,与数据预处理脚本设定默认值相同。 +* `--eos_token`: 若提供了自定义的词表,则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如,`--eos_token "</s>"`。默认为 `</s>`,与数据预处理脚本设定默认值相同。 +* `--pad_token`: 若提供了自定义的词表,原则上,需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如,`--pad_token "<pad>"`。默认为 None,若使用 None,则使用 `--bos_token` 作为 `pad_token` 使用。 +* `--batch_size`: 指明训练时,一个 batch 里面,最多的 token 的数目。默认为 config 中设置的 4096。 +* `--max_iter`: 指明训练时,需要训练的最大的 step 的数目,默认为 None。表示使用 config 中指定的 `epoch: 30` 来作为最大的迭代的 epoch 的数量,而不是 step。 + +#### 单机多卡 + +单机多卡下,执行方式与上文所述单机单卡传入自定义数据集方式相同。因静态图多卡有两种方式执行,所以这里会多一个参数: + +* `--distributed`:(**多卡训练需要**)指明是否是使用 fleet 来启动多卡。若设置,则使用 fleet 启动多卡。具体使用方式如下。 + +##### PE 的方式启动单机多卡: +``` shell +cd static/ +export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 +python train.py \ + --config ../configs/transformer.base.yaml \ + --train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en \ + --src_vocab ${DATA_DEST_DIR}/dict.en.txt \ + --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \ + --bos_token "<s>" \ + --eos_token "</s>" \ + --unk_token "<unk>" +``` + +##### fleet 的方式启动单机多卡: +``` shell +cd static/ +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" train.py \ + --config ../configs/transformer.base.yaml \ + --distributed \ + --train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en \ + --src_vocab ${DATA_DEST_DIR}/dict.en.txt \ + --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \ + --bos_token "<s>" \ + --eos_token "</s>" \ + --unk_token "<unk>" +``` + +需要注意的是,使用 fleet 的方式启动单机多卡务必设置 `--distributed`。 + +#### 使用内置数据集进行预测 + +如果是基于内置的数据集训练得到的英德翻译的模型,模型训练完成后可以执行以下命令对指定文件中的文本进行翻译: + +``` sh +# setting visible devices for prediction +cd static/ +export CUDA_VISIBLE_DEVICES=0 +python predict.py --config ../configs/transformer.base.yaml +``` + +由 `predict_file` 指定的文件中文本的翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录,更多参数的使用可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项,程序将默认使用 big model 的配置。 + +需要注意的是,目前预测仅实现了单卡的预测,原因在于,翻译后面需要的模型评估依赖于预测结果写入文件顺序,多卡情况下,目前暂未支持将结果按照指定顺序写入文件。 + +#### 基于自定义数据集进行预测 + +本示例同样支持自定义数据集进行预测。可以参照以下文档。 + +``` bash +cd static/ +export CUDA_VISIBLE_DEVICES=0 + +DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/ +python predict.py \ + --config configs/transformer.base.yaml \ + --test_file ${DATA_DEST_DIR}/test.de-en.en \ + --src_vocab ${DATA_DEST_DIR}/dict.en.txt \ + --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \ + --bos_token "<s>" \ + --eos_token "</s>" \ + --unk_token "<unk>" +``` + +以下是各个参数的含义: + +* `--config`: 指明所使用的 Transformer 的 config 文件,包括模型超参、训练超参等,默认是 `transformer.big.yaml`。即,默认训练 Transformer Big 模型。 +* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名,会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`,`--dev_file` 和 `--test_file` 的优先级。 + * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。 + * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。 +* `--test_file`: 指明训练所需要的 `test` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为,传入源语言的文件。比如,`--test_file ${DATA_DEST_DIR}/test.de-en.de`。 +* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。 +* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同,若相同,则视为源语言和目标语言共用同一个词表。 +* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同,若相同,则视为源语言和目标语言共用同一个词表。 +* `--unk_token`: 若提供了自定义的词表,则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如,`--unk_token "<unk>"`。默认为 `<unk>`,与数据预处理脚本设定默认值相同。 +* `--bos_token`: 若提供了自定义的词表,则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如,`--bos_token "<s>"`。默认为 `<s>`,与数据预处理脚本设定默认值相同。 +* `--eos_token`: 若提供了自定义的词表,则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如,`--eos_token "</s>"`。默认为 `</s>`,与数据预处理脚本设定默认值相同。 +* `--pad_token`: 若提供了自定义的词表,原则上,需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如,`--pad_token "<pad>"`。默认为 None,若使用 None,则使用 `--bos_token` 作为 `pad_token` 使用。 +* `--without_ft`: 本示例在预测时,支持了 GPU 的翻译预测的加速,如果不使用加速特性,可以设置 `--without_ft` 即会执行普通的 PaddlePaddle 动态图预测。 + +翻译结果会输出到 config 文件中 `output_file` 条目指定的文件中。执行预测时需要设置 `init_from_params` 来给出模型所在目录,更多参数的使用可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。 + +## 使用 FastGeneration 实现预测 + +具体的说明可以参考 `fast_transformer/README.md`。`cd fast_transformer/` 即可查看。 + +## 模型评估 + +预测结果中每行输出是对应行输入的得分最高的翻译,对于使用 BPE 的数据,预测出的翻译结果也将是 BPE 表示的数据,要还原成原始的数据(这里指 tokenize 后的数据)才能进行正确的评估。评估过程具体如下(BLEU 是翻译任务常用的自动评估方法指标): + +``` sh +# 还原 predict.txt 中的预测结果为 tokenize 后的数据 +sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt +# 若无 BLEU 评估工具,需先进行下载 +git clone https://github.com/moses-smt/mosesdecoder.git +# 以英德翻译 newstest2014 测试数据为例 +perl mosesdecoder/scripts/generic/multi-bleu.perl ~/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data/newstest2014.tok.de < predict.tok.txt +``` + +执行上述操作之后,可以看到类似如下的结果,此处结果是 big model 在 newstest2014 上的 BLEU 结果: +``` +BLEU = 27.48, 58.6/33.2/21.1/13.9 (BP=1.000, ratio=1.012, hyp_len=65312, ref_len=64506) +``` + +## FAQ + +**Q:** 预测结果中样本数少于输入的样本数是什么原因 +**A:** 若样本中最大长度超过 `transformer.base.yaml` 或是 `transformer.big.yaml` 中 `max_length` 的默认设置,请注意运行时增大 `max_length` 的设置,否则超长样本将被过滤。 + +**Q:** 预测时最大长度超过了训练时的最大长度怎么办 +**A:** 由于训练时 `max_length` 的设置决定了保存模型 position encoding 的大小,若预测时长度超过 `max_length`,请调大该值,会重新生成更大的 position encoding 表。 + + +## 参考文献 +1. Vaswani A, Shazeer N, Parmar N, et al. [Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)[C]//Advances in Neural Information Processing Systems. 2017: 6000-6010. +2. Devlin J, Chang M W, Lee K, et al. [Bert: Pre-training of deep bidirectional transformers for language understanding](https://arxiv.org/abs/1810.04805)[J]. arXiv preprint arXiv:1810.04805, 2018. +3. He K, Zhang X, Ren S, et al. [Deep residual learning for image recognition](http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. +4. Ba J L, Kiros J R, Hinton G E. [Layer normalization](https://arxiv.org/pdf/1607.06450.pdf)[J]. arXiv preprint arXiv:1607.06450, 2016. +5. Sennrich R, Haddow B, Birch A. [Neural machine translation of rare words with subword units](https://arxiv.org/pdf/1508.07909)[J]. arXiv preprint arXiv:1508.07909, 2015. diff --git a/examples/machine_translation/transformer/configs/transformer.base.yaml b/examples/machine_translation/transformer/configs/transformer.base.yaml new file mode 100644 index 0000000000000000000000000000000000000000..faf1c65374d81abdba865221fe537141b454e5a6 --- /dev/null +++ b/examples/machine_translation/transformer/configs/transformer.base.yaml @@ -0,0 +1,138 @@ +# The frequency to save trained models when training. +save_step: 10000 +# The frequency to fetch and print output when training. +print_step: 100 +# Path of the checkpoint, to resume the previous training +init_from_checkpoint: "" +# Path of the pretrain model, to better solve the current task +init_from_pretrain_model: "" +# Path of trained parameter, to make prediction +init_from_params: "./trained_models/step_final/" +# The directory for saving model +save_model: "trained_models" +# The directory for saving inference model +inference_model_dir: "infer_model" +# Set seed for CE or debug +random_seed: None +# The file to output the translation results of predict_file to. +output_file: "predict.txt" +# The <bos>, <eos> and <unk> tokens in the dictionary. +special_token: ["<s>", "<e>", "<unk>"] +# The data type of input ids. +input_dtype: "int64" + +# Device to use. +device: "gpu" + +# Args for reader, see reader.py for details +# The translation task to process. +task_name: "de-en" +src_lang: "en" +trg_lang: "de" +pool_size: 200000 +sort_type: "global" +batch_size: 4096 +infer_batch_size: 8 +shuffle_batch: True +# Data shuffle only works when sort_type is pool or none +shuffle: True +# shuffle_seed must be set when shuffle is True and using multi-cards to train. +# Otherwise, the number of batches cannot be guaranteed. +shuffle_seed: 128 +# For Dataloader num_workers +num_workers: 0 + +# Hyparams for training: +# The number of epoches for training +epoch: 30 + +# The hyper parameters for Adam optimizer. +# This static learning_rate will be applied to the LearningRateScheduler +# derived learning rate the to get the final learning rate. +learning_rate: 2.0 +beta1: 0.9 +beta2: 0.997 +eps: 1e-9 +# The parameters for learning rate scheduling. +warmup_steps: 4000 +# The weight used to mix up the ground-truth distribution and the fixed +# uniform distribution in label smoothing when training. +# Set this as zero if label smoothing is not wanted. +label_smooth_eps: 0.1 + +# Hyparams for generation: +# The parameters for beam search. +# Indicating the strategy of beam search. It can be 'v1' or 'v2'. 'v2' would +# select the top `beam_size * 2` beams and process the top `beam_size` alive +# and finish beams in them separately, while 'v1' would only select the top +# `beam_size` beams and mix up the alive and finish beams. 'v2' always +# searchs more and get better results, since the alive beams would +# always be `beam_size` while the number of alive beams in `v1` might +# decrease when meeting the end token. However, 'v2' always generates +# longer results thus might do more calculation and be slower. +beam_search_version: "v1" +beam_size: 4 +max_out_len: 256 +# Indicating whether max_out_len in configurations is the length relative to +# that of source text. Only works in `v2` temporarily. +use_rel_len: False +# The power number in length penalty calculation. Only works in `v2` temporarily. +# Please refer to GNMT <https://arxiv.org/pdf/1609.08144.pdf>. +alpha: 0.6 +# Refer to `A Simple, Fast Diverse Decoding Algorithm for Neural Generation +# <https://arxiv.org/abs/1611.08562>`_ for details. Bigger `diversity_rate` +# would lead to more diversity. if `diversity_rate == 0` is equivalent to naive +# BeamSearch. **NOTE**: Only works when using FastGeneration temporarily. +diversity_rate: 0.0 +# The number of decoded sentences to output. +n_best: 1 + +# Hyparams for model: +# These following five vocabularies related configurations will be set +# automatically according to the passed vocabulary path and special tokens. +# Size of source word dictionary. +src_vocab_size: 10000 +# Size of target word dictionay +trg_vocab_size: 10000 +# Used to pad vocab size to be multiple of pad_factor. +pad_factor: 8 +# Used to pad sequence length to be multiple of pad_seq. +pad_seq: 1 +# Used to make batch size to be multiple of bsz_multi. +bsz_multi: 8 +# Index for <bos> token +bos_idx: 0 +# Index for <eos> token +eos_idx: 1 +# Index for <unk> token +unk_idx: 2 +# Max length of sequences deciding the size of position encoding table. +max_length: 256 +# The dimension for word embeddings, which is also the last dimension of +# the input and output of multi-head attention, position-wise feed-forward +# networks, encoder and decoder. +d_model: 512 +# Size of the hidden layer in position-wise feed-forward networks. +d_inner_hid: 2048 +# Number of head used in multi-head attention. +n_head: 8 +# Number of sub-layers to be stacked in the encoder and decoder. +n_layer: 6 +# Dropout rates. +dropout: 0.1 +# The flag indicating whether to share embedding and softmax weights. +# Vocabularies in source and target should be same for weight sharing. +weight_sharing: True +# Whether to apply pre-normalization or not. +normalize_before: True + +# Mixed precision training +use_amp: False +use_pure_fp16: False +scale_loss: 128.0 + +# Maximum iteration for training. +max_iter: None + +# enable to static ? +to_static: False diff --git a/examples/machine_translation/transformer/configs/transformer.big.yaml b/examples/machine_translation/transformer/configs/transformer.big.yaml new file mode 100644 index 0000000000000000000000000000000000000000..299637bc64a2043be762322da3480bb91b6124ba --- /dev/null +++ b/examples/machine_translation/transformer/configs/transformer.big.yaml @@ -0,0 +1,135 @@ +# The frequency to save trained models when training. +save_step: 10000 +# The frequency to fetch and print output when training. +print_step: 100 +# Path of the checkpoint, to resume the previous training +init_from_checkpoint: "" +# Path of the pretrain model, to better solve the current task +init_from_pretrain_model: "" +# Path of trained parameter, to make prediction +init_from_params: "./trained_models/step_final/" +# The directory for saving model +save_model: "trained_models" +# The directory for saving inference model +inference_model_dir: "infer_model" +# Set seed for CE or debug +random_seed: None +# The file to output the translation results of predict_file to. +output_file: "predict.txt" +# The <bos>, <eos> and <unk> tokens in the dictionary. +special_token: ["<s>", "<e>", "<unk>"] +# The data type of input ids. +input_dtype: "int64" + +# Device to use. +device: "gpu" + +# Args for reader, see reader.py for details +# The translation task to process. +task_name: "de-en" +src_lang: "en" +trg_lang: "de" +pool_size: 200000 +sort_type: "global" +batch_size: 4096 +infer_batch_size: 8 +shuffle_batch: True +# Data shuffle only works when sort_type is pool or none +shuffle: True +# shuffle_seed must be set when shuffle is True and using multi-cards to train. +# Otherwise, the number of batches cannot be guaranteed. +shuffle_seed: 128 +# For Dataloader num_workers +num_workers: 0 + +# Hyparams for training: +# The number of epoches for training +epoch: 30 + +# The hyper parameters for Adam optimizer. +# This static learning_rate will be applied to the LearningRateScheduler +# derived learning rate the to get the final learning rate. +learning_rate: 2.0 +beta1: 0.9 +beta2: 0.997 +eps: 1e-9 +# The parameters for learning rate scheduling. +warmup_steps: 4000 +# The weight used to mix up the ground-truth distribution and the fixed +# uniform distribution in label smoothing when training. +# Set this as zero if label smoothing is not wanted. +label_smooth_eps: 0.1 + +# Hyparams for generation: +# The parameters for beam search. +# Indicating the strategy of beam search. It can be 'v1' or 'v2'. 'v2' would +# select the top `beam_size * 2` beams and process the top `beam_size` alive +# and finish beams in them separately, while 'v1' would only select the top +# `beam_size` beams and mix up the alive and finish beams. 'v2' always +# searchs more and get better results, since the alive beams would +# always be `beam_size` while the number of alive beams in `v1` might +# decrease when meeting the end token. However, 'v2' always generates +# longer results thus might do more calculation and be slower. +beam_search_version: "v1" +beam_size: 4 +max_out_len: 1024 +# Indicating whether max_out_len in configurations is the length relative to +# that of source text. Only works in `v2` temporarily. +use_rel_len: False +# The power number in length penalty calculation. Only works in `v2` temporarily. +# Please refer to GNMT <https://arxiv.org/pdf/1609.08144.pdf>. +alpha: 0.6 +# Refer to `A Simple, Fast Diverse Decoding Algorithm for Neural Generation +# <https://arxiv.org/abs/1611.08562>`_ for details. Bigger `diversity_rate` +# would lead to more diversity. if `diversity_rate == 0` is equivalent to naive +# BeamSearch. **NOTE**: Only works when using FastGeneration temporarily. +diversity_rate: 0.0 +# The number of decoded sentences to output. +n_best: 1 + +# Hyparams for model: +# These following five vocabularies related configurations will be set +# automatically according to the passed vocabulary path and special tokens. +# Size of source word dictionary. +src_vocab_size: 10000 +# Size of target word dictionay +trg_vocab_size: 10000 +# Used to pad vocab size to be multiple of pad_factor. +pad_factor: 8 +# Used to pad sequence length to be multiple of pad_seq. +pad_seq: 1 +# Used to make batch size to be multiple of bsz_multi. +bsz_multi: 8 +# Index for <bos> token +bos_idx: 0 +# Index for <eos> token +eos_idx: 1 +# Index for <unk> token +unk_idx: 2 +# Max length of sequences deciding the size of position encoding table. +max_length: 1024 +# The dimension for word embeddings, which is also the last dimension of +# the input and output of multi-head attention, position-wise feed-forward +# networks, encoder and decoder. +d_model: 1024 +# Size of the hidden layer in position-wise feed-forward networks. +d_inner_hid: 4096 +# Number of head used in multi-head attention. +n_head: 16 +# Number of sub-layers to be stacked in the encoder and decoder. +n_layer: 6 +# Dropout rates. +dropout: 0.1 +# The flag indicating whether to share embedding and softmax weights. +# Vocabularies in source and target should be same for weight sharing. +weight_sharing: True +# Whether to apply pre-normalization or not. +normalize_before: True + +# Mixed precision training +use_amp: False +use_pure_fp16: False +scale_loss: 128.0 + +# Maximum iteration for training. +max_iter: None diff --git a/examples/machine_translation/transformer/deploy/cpp/CMakeLists.txt b/examples/machine_translation/transformer/deploy/cpp/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..cade70b6388bbdedd623093adc568c9602bc076b --- /dev/null +++ b/examples/machine_translation/transformer/deploy/cpp/CMakeLists.txt @@ -0,0 +1,82 @@ +project(cpp_inference_demo CXX C) +option(WITH_MKL "Compile demo with MKL/OpenBlas support, default use MKL." ON) +option(WITH_GPU "Compile demo with GPU/CPU, default use CPU." OFF) +option(WITH_STATIC_LIB "Compile demo with static/shared library, default use static." ON) +option(USE_TENSORRT "Compile demo with TensorRT." OFF) + +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -g") +set(CMAKE_STATIC_LIBRARY_PREFIX "") +message("flags" ${CMAKE_CXX_FLAGS}) + +if(NOT DEFINED PADDLE_LIB) + message(FATAL_ERROR "please set PADDLE_LIB with -DPADDLE_LIB=/path/paddle/lib") +endif() +if(NOT DEFINED DEMO_NAME) + message(FATAL_ERROR "please set DEMO_NAME with -DDEMO_NAME=demo_name") +endif() + + +include_directories("${PADDLE_LIB}") +include_directories("${PADDLE_LIB}/third_party/install/protobuf/include") +include_directories("${PADDLE_LIB}/third_party/install/glog/include") +include_directories("${PADDLE_LIB}/third_party/install/gflags/include") +include_directories("${PADDLE_LIB}/third_party/install/zlib/include") +include_directories("${PADDLE_LIB}/third_party/boost") +include_directories("${PADDLE_LIB}/third_party/eigen3") + +if (USE_TENSORRT AND WITH_GPU) + include_directories("${TENSORRT_ROOT}/include") + link_directories("${TENSORRT_ROOT}/lib") +endif() + +link_directories("${PADDLE_LIB}/third_party/install/zlib/lib") + +link_directories("${PADDLE_LIB}/third_party/install/protobuf/lib") +link_directories("${PADDLE_LIB}/third_party/install/glog/lib") +link_directories("${PADDLE_LIB}/third_party/install/gflags/lib") +link_directories("${PADDLE_LIB}/paddle/lib") + +add_executable(${DEMO_NAME} ${DEMO_NAME}.cc) + +if(WITH_MKL) + include_directories("${PADDLE_LIB}/third_party/install/mklml/include") + set(MATH_LIB ${PADDLE_LIB}/third_party/install/mklml/lib/libmklml_intel${CMAKE_SHARED_LIBRARY_SUFFIX} + ${PADDLE_LIB}/third_party/install/mklml/lib/libiomp5${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(MKLDNN_PATH "${PADDLE_LIB}/third_party/install/mkldnn") + if(EXISTS ${MKLDNN_PATH}) + include_directories("${MKLDNN_PATH}/include") + set(MKLDNN_LIB ${MKLDNN_PATH}/lib/libmkldnn.so.0) + endif() +else() + set(MATH_LIB ${PADDLE_LIB}/third_party/install/openblas/lib/libopenblas${CMAKE_STATIC_LIBRARY_SUFFIX}) +endif() + +# Note: libpaddle_inference_api.so/a must put before libpaddle_fluid.so/a +if(WITH_STATIC_LIB) + set(DEPS + ${PADDLE_LIB}/paddle/lib/libpaddle_inference${CMAKE_STATIC_LIBRARY_SUFFIX}) +else() + set(DEPS + ${PADDLE_LIB}/paddle/lib/libpaddle_inference${CMAKE_SHARED_LIBRARY_SUFFIX}) +endif() + +set(EXTERNAL_LIB "-lrt -ldl -lpthread") +set(DEPS ${DEPS} + ${MATH_LIB} ${MKLDNN_LIB} + glog gflags protobuf z + ${EXTERNAL_LIB}) + +if(WITH_GPU) + if (USE_TENSORRT) + set(DEPS ${DEPS} + ${TENSORRT_ROOT}/lib/libnvinfer${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(DEPS ${DEPS} + ${TENSORRT_ROOT}/lib/libnvinfer_plugin${CMAKE_SHARED_LIBRARY_SUFFIX}) + endif() + set(DEPS ${DEPS} ${CUDA_LIB}/libcudart${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(DEPS ${DEPS} ${CUDA_LIB}/libcudart${CMAKE_SHARED_LIBRARY_SUFFIX} ) + set(DEPS ${DEPS} /usr/lib/x86_64-linux-gnu/libcublas${CMAKE_SHARED_LIBRARY_SUFFIX} ) + set(DEPS ${DEPS} ${CUDNN_LIB}/libcudnn${CMAKE_SHARED_LIBRARY_SUFFIX} ) +endif() + +target_link_libraries(${DEMO_NAME} ${DEPS}) diff --git a/examples/machine_translation/transformer/deploy/cpp/README.md b/examples/machine_translation/transformer/deploy/cpp/README.md new file mode 100644 index 0000000000000000000000000000000000000000..67cf8b9b6997477258da4ac81b87cfc4e9f775a0 --- /dev/null +++ b/examples/machine_translation/transformer/deploy/cpp/README.md @@ -0,0 +1,96 @@ +# 使用 Paddle Inference C++ API 推理 + +## 模型推理 + +通过前文介绍,我们可以获取导出后的预测模型。模型导出后,`infer_model/` 下的目录结构如下: + +``` text +. +├── transformer.pdiparams +├── transformer.pdiparams.info +└── transformer.pdmodel +``` + +可以将存有导出后模型的目录拷贝到当前路径下: + +``` sh +cp -rf ../../infer_model/ ./ +``` + +使用 C++ 进行推理需要提前先编译出可执行文件。编译的方式可以直接使用 `run.sh`,不过需要做一些指定。 + +首先打开 run.sh: + +``` sh +LIB_DIR=YOUR_LIB_DIR +CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR +CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR +MODEL_DIR=YOUR_MODEL_DIR +VOCAB_DIR=/root/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 +DATA_DIR=/root/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en +``` + +需要依次指定: +* `LIB_DIR`: 所使用的 Paddle Inference 的库,即 `libpaddle_inference.so` 的位置。预测库的组织结构满足: + ```text + . + ├── CMakeCache.txt + ├── paddle/ + ├── include/ + └── lib/ + ├── third_party/ + ├── cudaerror/ + ├── install/ + └── threadpool/ + └── version.txt + ``` +* `CUDA_LIB_DIR`: 所使用的 CUDA 的库的位置。 +* `CUDNN_LIB_DIR`: 所使用的 CUDNN 的库的位置。 +* `MODEL_DIR`: 导出的模型的路径。 +* `VOCAB_DIR`: 词表的位置。 +* `DATA_DIR`: 需要推理的数据的位置,当前数据是经过 tokenize 以及 bpe 处理之后的序列用空格连接成的句子,并非原始数据。 + +可以简单执行如下语句完成编译以及推理整个过程。 + +``` sh +bash run.sh +``` + +以上步骤,如果全部正确执行,将会依次完成编译、预测全部过程。不过,如果需要自行执行可执行文件,编译完成后,其实,在 `build/bin/` 路径下会生成 `transformer_e2e` 的可执行文件,也可以直接执行这个可执行文件进行推理。 + +执行的参数及解释如下: + +``` sh +export CUDA_VISIBLE_DEVICES=0 +./build/bin/transformer_e2e -batch_size 8 -device gpu -gpu_id 0 -model_dir ./infer_model/ -vocab_file /root/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 -data_file /root/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en +``` + +各个参数解释如下: +* `-batch_size`: 使用 Paddle Inference 的时候一个 batch 的句子数目。 +* `-device`: 使用的设备,可以是 gpu 或是 cpu。 +* `-gpu_id`: 若使用 gpu,则需要提供所使用的 gpu 的 id。 +* `-use_mkl`: 是否使用 mkl,设置代表使用 mkl,不设置则不使用 mkl。仅在使用 cpu 进行预测的时候有效。 +* `-threads`: 仅在使用 mkl 的时候起效,用于指定计算 math 库时的线程数。 +* `-model_dir`: 导出的模型的位置。 +* `-vocab_file`: 词表文件的位置。 +* `-data_file`: 推理用的数据的位置。 + +英德翻译的结果会保存到 `predict.txt` 文件中。 + +## 模型评估 + +推理结果中每行输出是对应行输入的得分最高的翻译,对于使用 BPE 的数据,预测出的翻译结果也将是 BPE 表示的数据,要还原成原始的数据(这里指 tokenize 后的数据)才能进行正确的评估。评估过程具体如下(BLEU 是翻译任务常用的自动评估方法指标): + +``` sh +# 还原 predict.txt 中的预测结果为 tokenize 后的数据 +sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt +# 若无 BLEU 评估工具,需先进行下载 +git clone https://github.com/moses-smt/mosesdecoder.git +# 以英德翻译 newstest2014 测试数据为例 +perl mosesdecoder/scripts/generic/multi-bleu.perl ~/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data/newstest2014.tok.de < predict.tok.txt +``` + +执行上述操作之后,可以看到类似如下的结果,此处结果是 big model 在 newstest2014 上的 BLEU 结果: +``` +BLEU = 27.48, 58.6/33.2/21.1/13.9 (BP=1.000, ratio=1.012, hyp_len=65312, ref_len=64506) +``` diff --git a/examples/machine_translation/transformer/deploy/cpp/helper.h b/examples/machine_translation/transformer/deploy/cpp/helper.h new file mode 100644 index 0000000000000000000000000000000000000000..6b4d41f82d8d25a34750fdc3ce10c2d010052b98 --- /dev/null +++ b/examples/machine_translation/transformer/deploy/cpp/helper.h @@ -0,0 +1,51 @@ +#pragma once +#include <gflags/gflags.h> +#include <glog/logging.h> +#include <glog/raw_logging.h> +#include <sys/time.h> +#include <chrono> // NOLINT +#include <numeric> +#include <sstream> +#include <string> +#include <vector> +#include "paddle/include/paddle_inference_api.h" + +namespace paddle { +namespace inference { +// Timer for timer +class Timer { +public: + std::chrono::high_resolution_clock::time_point start; + std::chrono::high_resolution_clock::time_point startu; + void tic() { start = std::chrono::high_resolution_clock::now(); } + double toc() { + startu = std::chrono::high_resolution_clock::now(); + std::chrono::duration<double> time_span = + std::chrono::duration_cast<std::chrono::duration<double>>(startu - + start); + double used_time_ms = static_cast<double>(time_span.count()) * 1000.0; + return used_time_ms; + } +}; + +static void split(const std::string &str, + char sep, + std::vector<std::string> *pieces) { + pieces->clear(); + if (str.empty()) { + return; + } + size_t pos = 0; + size_t next = str.find(sep, pos); + while (next != std::string::npos) { + pieces->push_back(str.substr(pos, next - pos)); + pos = next + 1; + next = str.find(sep, pos); + } + if (!str.substr(pos).empty()) { + pieces->push_back(str.substr(pos)); + } +} + +} // namespace inference +} // namespace paddle diff --git a/examples/machine_translation/transformer/deploy/cpp/run.sh b/examples/machine_translation/transformer/deploy/cpp/run.sh new file mode 100644 index 0000000000000000000000000000000000000000..6c59397b5cadf548b3102d49620f9e4af0b44cd0 --- /dev/null +++ b/examples/machine_translation/transformer/deploy/cpp/run.sh @@ -0,0 +1,21 @@ +#!/bin/bash +# Whether to use mkl or gpu +WITH_MKL=ON +DEVICE='gpu' + +# Please set: +# * Corresponding PaddlePaddle inference lib +# * Corresponding CUDA lib +# * Corresponding CUDNN lib +# * Corresponding model directory +# * Corresponding vocab directory +# * Corresponding data directory +LIB_DIR=YOUR_LIB_DIR +CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR +CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR +MODEL_DIR=YOUR_MODEL_DIR +# DATA_HOME is where paddlenlp stores dataset and can be returned by paddlenlp.utils.env.DATA_HOME. +VOCAB_DIR=DATA_HOME/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 +DATA_DIR=DATA_HOME/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en + +bash run_impl.sh ${LIB_DIR} transformer_e2e ${MODEL_DIR} ${WITH_MKL} ${DEVICE} ${CUDNN_LIB_DIR} ${CUDA_LIB_DIR} ${VOCAB_DIR} ${DATA_DIR} diff --git a/examples/machine_translation/transformer/deploy/cpp/run_impl.sh b/examples/machine_translation/transformer/deploy/cpp/run_impl.sh new file mode 100644 index 0000000000000000000000000000000000000000..e8c5782af25e7d3643d104755106c27627086f2e --- /dev/null +++ b/examples/machine_translation/transformer/deploy/cpp/run_impl.sh @@ -0,0 +1,33 @@ +#!/bin/bash +mkdir -p build +cd build +rm -rf * + +LIB_DIR=$1 +DEMO_NAME=$2 +MODEL_FILE_DIR=$3 +WITH_MKL=$4 +DEVICE=$5 +CUDNN_LIB=$6 +CUDA_LIB=$7 +VOCAB_DIR=$8 +DATA_DIR=$9 + +WITH_GPU=OFF +if [[ $DEVICE="gpu" ]]; then + WITH_GPU=ON +fi + +cmake .. -DPADDLE_LIB=${LIB_DIR} \ + -DWITH_MKL=${WITH_MKL} \ + -DDEMO_NAME=${DEMO_NAME} \ + -DWITH_GPU=${WITH_GPU} \ + -DWITH_STATIC_LIB=OFF \ + -DUSE_TENSORRT=${USE_TENSORRT} \ + -DCUDNN_LIB=${CUDNN_LIB} \ + -DCUDA_LIB=${CUDA_LIB} \ + -DTENSORRT_ROOT=${TENSORRT_ROOT} + +make -j + +./${DEMO_NAME} -batch_size 8 -device ${DEVICE} -gpu_id 0 -model_dir ${MODEL_FILE_DIR} -vocab_file ${VOCAB_DIR} -data_file ${DATA_DIR} diff --git a/examples/machine_translation/transformer/deploy/cpp/transformer_e2e.cc b/examples/machine_translation/transformer/deploy/cpp/transformer_e2e.cc new file mode 100644 index 0000000000000000000000000000000000000000..2f21391f3cab25bde23302732836a2e334937ab3 --- /dev/null +++ b/examples/machine_translation/transformer/deploy/cpp/transformer_e2e.cc @@ -0,0 +1,291 @@ +#include <pthread.h> +#include <algorithm> +#include <atomic> +#include <cstring> +#include <fstream> +#include <iostream> +#include <numeric> +#include <string> +#include <thread> +#include <unordered_map> + +#include "helper.h" + +#include <sys/time.h> +#include <unistd.h> +#include <cmath> +#include <cstdio> +#include <cstdlib> +#include <ctime> + +DEFINE_int32(batch_size, 1, "Batch size to do inference. "); +DEFINE_string(device, "gpu", "The device to do inference. Can be gpu or cpu. "); +DEFINE_int32(gpu_id, 0, "The gpu id to do inference. "); +DEFINE_bool(use_mkl, + false, + "Whether to use mkl when using cpu to do inference. "); +DEFINE_int32(threads, + 1, + "The number of threads to run math lib when using mkl. "); +DEFINE_string(model_dir, + "./infer_model/", + "The directory to the inference model. "); +DEFINE_string(vocab_file, + "./vocab_all.bpe.33708", + "The path to the vocabulary file. "); +DEFINE_string(data_file, + "./newstest2014.tok.bpe.33708.en", + "The path to the input data file. "); + +using namespace paddle_infer; + +std::string model_dir = ""; +std::string vocab_file = ""; +std::string data_file = ""; + +const int EOS_IDX = 1; +const int PAD_IDX = 0; +const int MAX_LENGTH = 256; +const int N_BEST = 1; + +int batch_size = 1; +int gpu_id = 0; + +namespace paddle { +namespace inference { + +struct DataInput { + std::vector<int64_t> src_data; +}; + +struct DataResult { + std::string result_q; +}; + +bool get_result_tensor(const std::unique_ptr<paddle_infer::Tensor>& seq_ids, + std::vector<DataResult>& dataresultvec, + std::unordered_map<int, std::string>& num2word_dict) { + std::vector<int> output_shape = seq_ids->shape(); + int batch_size = output_shape[0]; + int beam_num = output_shape[2]; + int out_num = std::accumulate( + output_shape.begin(), output_shape.end(), 1, std::multiplies<int>()); + std::vector<int64_t> seq_ids_out; + seq_ids_out.resize(out_num); + seq_ids->CopyToCpu(seq_ids_out.data()); + + dataresultvec.resize(batch_size * N_BEST); + auto max_output_length = output_shape[1]; + + for (int bsz = 0; bsz < batch_size; ++bsz) { + for (int k = 0; k < N_BEST; ++k) { + dataresultvec[bsz * N_BEST + k].result_q = ""; + for (int len = 0; len < max_output_length; ++len) { + if (seq_ids_out[bsz * max_output_length * beam_num + len * beam_num + + k] == EOS_IDX) { + break; + } + dataresultvec[bsz * N_BEST + k].result_q = + dataresultvec[bsz * N_BEST + k].result_q + + num2word_dict[seq_ids_out[bsz * max_output_length * beam_num + + len * beam_num + k]] + + " "; + } + } + } + return true; +} + +class DataReader { +public: + explicit DataReader(const std::string& path) + : file(new std::ifstream(path)) {} + + bool NextBatch(std::shared_ptr<paddle_infer::Predictor>& predictor, + const int& batch_size, + std::vector<std::string>& source_query_vec) { + std::string line; + std::vector<std::string> word_data; + std::vector<DataInput> data_input_vec; + int max_len = 0; + for (int i = 0; i < batch_size; i++) { + if (!std::getline(*file, line)) { + break; + } + DataInput data_input; + split(line, ' ', &word_data); + std::string query_str = ""; + for (int j = 0; j < word_data.size(); ++j) { + if (j >= MAX_LENGTH) { + break; + } + query_str += word_data[j]; + if (word2num_dict.find(word_data[j]) == word2num_dict.end()) { + data_input.src_data.push_back(word2num_dict["<unk>"]); + } else { + data_input.src_data.push_back(word2num_dict[word_data[j]]); + } + } + source_query_vec.push_back(query_str); + data_input.src_data.push_back(EOS_IDX); + max_len = std::max(max_len, static_cast<int>(data_input.src_data.size())); + max_len = std::min(max_len, MAX_LENGTH); + data_input_vec.push_back(data_input); + } + if (data_input_vec.empty()) { + return false; + } + return TensorMoreBatch( + predictor, data_input_vec, max_len, data_input_vec.size()); + } + + bool GetWordDict() { + std::ifstream fin(vocab_file); + std::string line; + int k = 0; + while (std::getline(fin, line)) { + word2num_dict[line] = k; + num2word_dict[k] = line; + k += 1; + } + + fin.close(); + + return true; + } + + std::unordered_map<std::string, int> word2num_dict; + std::unordered_map<int, std::string> num2word_dict; + std::unique_ptr<std::ifstream> file; + +private: + bool TensorMoreBatch(std::shared_ptr<paddle_infer::Predictor>& predictor, + std::vector<DataInput>& data_input_vec, + int max_len, + int batch_size) { + auto src_word_t = predictor->GetInputHandle("src_word"); + std::vector<int64_t> src_word_vec; + src_word_vec.resize(max_len * batch_size); + for (int i = 0; i < batch_size; ++i) { + for (int k = 0; k < max_len; ++k) { + if (k < data_input_vec[i].src_data.size()) { + src_word_vec[i * max_len + k] = data_input_vec[i].src_data[k]; + } else { + src_word_vec[i * max_len + k] = PAD_IDX; + } + } + } + src_word_t->Reshape({batch_size, max_len}); + src_word_t->CopyFromCpu(src_word_vec.data()); + + return true; + } +}; + + +template <typename... Args> +void SummaryConfig(const paddle_infer::Config& config, + double infer_time, + int num_batches, + int num_samples) { + LOG(INFO) << "----------------------- Model info ----------------------"; + LOG(INFO) << "model_name: " + << "transformer"; + LOG(INFO) << "model_type: " + << "FP32"; + LOG(INFO) << "----------------------- Data info -----------------------"; + LOG(INFO) << "batch_size: " << batch_size; + LOG(INFO) << "num_of_samples: " << num_samples; + LOG(INFO) << "----------------------- Conf info -----------------------"; + LOG(INFO) << "runtime_device: " << (config.use_gpu() ? "gpu" : "cpu"); + LOG(INFO) << "ir_optim: " << (config.ir_optim() ? "true" : "false"); + LOG(INFO) << "enable_memory_optim: " + << (config.enable_memory_optim() ? "true" : "false"); + if (config.use_gpu()) { + LOG(INFO) << "enable_tensorrt: " + << (config.tensorrt_engine_enabled() ? "true" : "false"); + } else { + LOG(INFO) << "enable_mkldnn: " + << (config.mkldnn_enabled() ? "true" : "false"); + LOG(INFO) << "cpu_math_library_num_threads: " + << config.cpu_math_library_num_threads(); + } + LOG(INFO) << "----------------------- Perf info -----------------------"; + LOG(INFO) << "average_latency(ms): " << infer_time / num_samples << ", " + << "QPS: " << num_samples / (infer_time / 1000.0); +} + + +void Main( + int batch_size, std::string device, int gpu_id, int use_mkl, int threads) { + Config config; + config.SetModel(model_dir + "/transformer.pdmodel", + model_dir + "/transformer.pdiparams"); + + if (device == "gpu") { + config.EnableUseGpu(100, gpu_id); + } else { + config.DisableGpu(); + if (use_mkl) { + config.EnableMKLDNN(); + config.SetCpuMathLibraryNumThreads(threads); + } + } + + config.SwitchUseFeedFetchOps(false); + config.SwitchSpecifyInputNames(true); + // When using fp16, fc_elementwise_layernorm_fuse_pass causes a little + // different translation results with original dygraph prediction, maybe you + // can turn off the IR optimization for same results as following: + // config.SwitchIrOptim(false); + auto predictor = CreatePredictor(config); + DataReader reader(data_file); + reader.GetWordDict(); + + double whole_time = 0; + Timer timer; + int num_batches = 0; + int num_samples = 0; + std::vector<std::string> source_query_vec; + std::ofstream out("predict.txt"); + + while (reader.NextBatch(predictor, batch_size, source_query_vec)) { + timer.tic(); + predictor->Run(); + std::vector<DataResult> dataresultvec; + auto output_names = predictor->GetOutputNames(); + get_result_tensor(predictor->GetOutputHandle(output_names[0]), + dataresultvec, + reader.num2word_dict); + + whole_time += timer.toc(); + num_batches++; + source_query_vec.clear(); + + if (out.is_open()) { + for (int i = 0; i < dataresultvec.size(); ++i) { + out << dataresultvec[i].result_q << "\n"; + } + } + num_samples += dataresultvec.size(); + } + SummaryConfig(config, whole_time, num_batches, num_samples); +} +} // namespace inference +} // namespace paddle + +int main(int argc, char** argv) { + gflags::ParseCommandLineFlags(&argc, &argv, true); + + batch_size = FLAGS_batch_size; + gpu_id = FLAGS_gpu_id; + + model_dir = FLAGS_model_dir; + vocab_file = FLAGS_vocab_file; + data_file = FLAGS_data_file; + + paddle::inference::Main( + batch_size, FLAGS_device, gpu_id, FLAGS_use_mkl, FLAGS_threads); + + return 0; +} diff --git a/examples/machine_translation/transformer/deploy/python/README.md b/examples/machine_translation/transformer/deploy/python/README.md new file mode 100644 index 0000000000000000000000000000000000000000..6df683ecabde75bcfaf2aeaf7fc18a4148d44f62 --- /dev/null +++ b/examples/machine_translation/transformer/deploy/python/README.md @@ -0,0 +1,57 @@ +# 使用 Paddle Inference Python API 推理 + +## 模型推理 + +通过前文介绍,我们可以获取导出后的预测模型。模型导出后,`infer_model/` 下的目录结构如下: + +``` text +. +├── transformer.pdiparams +├── transformer.pdiparams.info +└── transformer.pdmodel +``` + +可以将存有导出后模型的目录拷贝到当前路径下: + +``` sh +cp -rf ../../infer_model/ ./ +``` + +执行如下命令可以使用 Paddle Inference Python API 进行推理: + +``` sh +export CUDA_VISIBLE_DEVICES=0 +python inference.py \ + --config ../../configs/transformer.base.yaml \ + --batch_size 8 \ + --device gpu \ + --model_dir ./infer_model/ +``` + +各个参数解释如下: +* `--config`: yaml 配置文件,和训练时使用的相同,不过因为模型导出时已经固定了模型结构,因此,模型超参相关配置将不会再起作用,仅有 `reader` 相关配置、`infer_batch_size` 以及 `inference_model_dir` 仍会有效。 +* `--batch_size`: 与配置文件中 `infer_batch_size` 意义相同,是指的使用 Paddle Inference 的时候一个 batch 的句子数目。 +* `--device`: 使用的设备,可以是 gpu,xpu 或是 cpu。 +* `--use_mkl`: 是否使用 mkl,没有设定表示不使用 mkl。可以通过 `--use_mkl True` 指定。 +* `--threads`: 仅在使用 mkl 的时候起效,用于指定计算 math 库时的线程数。 +* `--model_dir`: 导出的 Paddle Inference 可用的模型路径,与配置文件中的 `inference_model_dir` 对应。 + +英德翻译的结果会保存到 `predict.txt` 文件中。 + +## 模型评估 + +推理结果中每行输出是对应行输入的得分最高的翻译,对于使用 BPE 的数据,预测出的翻译结果也将是 BPE 表示的数据,要还原成原始的数据(这里指 tokenize 后的数据)才能进行正确的评估。评估过程具体如下(BLEU 是翻译任务常用的自动评估方法指标): + +``` sh +# 还原 predict.txt 中的预测结果为 tokenize 后的数据 +sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt +# 若无 BLEU 评估工具,需先进行下载 +git clone https://github.com/moses-smt/mosesdecoder.git +# 以英德翻译 newstest2014 测试数据为例 +perl mosesdecoder/scripts/generic/multi-bleu.perl ~/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data/newstest2014.tok.de < predict.tok.txt +``` + +执行上述操作之后,可以看到类似如下的结果,此处结果是 big model 在 newstest2014 上的 BLEU 结果: +``` +BLEU = 27.48, 58.6/33.2/21.1/13.9 (BP=1.000, ratio=1.012, hyp_len=65312, ref_len=64506) +``` diff --git a/examples/machine_translation/transformer/deploy/python/benchmark.sh b/examples/machine_translation/transformer/deploy/python/benchmark.sh new file mode 100644 index 0000000000000000000000000000000000000000..0b9b8c482995b219d2c701989db8ce0ecee4598d --- /dev/null +++ b/examples/machine_translation/transformer/deploy/python/benchmark.sh @@ -0,0 +1,32 @@ +#!/bin/bash +model_dir=${1} +model=${2} +mkdir -p output_pipeline +log_path="output_pipeline" + +for batch_size in "1" "2" "4"; do + python inference.py \ + --config="../../configs/transformer.${model}.yaml" \ + --device cpu \ + --model_dir=${model_dir} \ + --batch_size=${batch_size} \ + --profile > ${log_path}/transformer_${model}_cpu_nomkl_bs${batch_size}_inference.log 2>&1 + + for threads in "1" "6"; do + python inference.py \ + --config="../../configs/transformer.${model}.yaml" \ + --model_dir=${model_dir} \ + --device cpu \ + --use_mkl True \ + --threads=${threads} \ + --batch_size=${batch_size} \ + --profile > ${log_path}/transformer_${model}_cpu_mkl_threads${threads}_bs${batch_size}_inference.log 2>&1 + done + + python inference.py \ + --config="../../configs/transformer.${model}.yaml" \ + --model_dir=${model_dir} \ + --device gpu \ + --batch_size=${batch_size} \ + --profile > tee ${log_path}/transformer_${model}_gpu_bs${batch_size}_inference.log 2>&1 +done diff --git a/examples/machine_translation/transformer/deploy/python/inference.py b/examples/machine_translation/transformer/deploy/python/inference.py new file mode 100644 index 0000000000000000000000000000000000000000..d765fadcebd3ff6ed5e31aa209a19778ccac5b6f --- /dev/null +++ b/examples/machine_translation/transformer/deploy/python/inference.py @@ -0,0 +1,330 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys +from pprint import pprint + +import paddle +import yaml +from easydict import EasyDict as AttrDict +from paddle import inference + +from paddlenlp.utils.log import logger + +sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir, os.pardir))) +import reader # noqa: E402 + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--batch_size", type=int, help="Batch size. ") + parser.add_argument( + "--config", default="./configs/transformer.big.yaml", type=str, help="Path of the config file. " + ) + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["gpu", "xpu", "cpu", "npu"], + help="Device to use during inference. ", + ) + parser.add_argument("--use_mkl", default=False, type=eval, choices=[True, False], help="Whether to use mkl. ") + parser.add_argument("--threads", default=1, type=int, help="The number of threads when enable mkl. ") + parser.add_argument("--model_dir", default="", type=str, help="Path of the model. ") + parser.add_argument( + "--benchmark", + action="store_true", + help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ", + ) + parser.add_argument("--profile", action="store_true", help="Whether to profile. ") + parser.add_argument( + "--data_dir", + default=None, + type=str, + help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ", + ) + parser.add_argument( + "--test_file", + nargs="+", + default=None, + type=str, + help="The files for test. Can be set by using --test_file source_language_file. If it's None, the default WMT14 en-de dataset will be used. ", + ) + parser.add_argument( + "--save_log_path", + default="./transformer/output/", + type=str, + help="The path to save logs when profile is enabled. ", + ) + parser.add_argument( + "--vocab_file", + default=None, + type=str, + help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.", + ) + parser.add_argument( + "--src_vocab", + default=None, + type=str, + help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument( + "--trg_vocab", + default=None, + type=str, + help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ") + parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ") + parser.add_argument( + "--unk_token", + default=None, + type=str, + help="The unknown token. It should be provided when use custom vocab_file. ", + ) + parser.add_argument( + "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--pad_token", + default=None, + type=str, + help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ", + ) + args = parser.parse_args() + return args + + +def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): + """ + Post-process the decoded sequence. + """ + eos_pos = len(seq) - 1 + for i, idx in enumerate(seq): + if idx == eos_idx: + eos_pos = i + break + seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] + return seq + + +class Predictor(object): + def __init__(self, predictor, input_handles, output_handles, autolog=None): + self.predictor = predictor + self.input_handles = input_handles + self.output_handles = output_handles + self.autolog = autolog + self.use_auto_log = not isinstance(self.autolog, recorder.Recorder) + + @classmethod + def create_predictor(cls, args, config=None, profile=False, model_name=None): + if config is None: + config = inference.Config( + os.path.join(args.inference_model_dir, "transformer.pdmodel"), + os.path.join(args.inference_model_dir, "transformer.pdiparams"), + ) + if args.device == "gpu": + config.enable_use_gpu(100, 0) + elif args.device == "xpu": + config.enable_xpu() + elif args.device == "npu": + config.enable_custom_device("npu") + else: + # CPU + config.disable_gpu() + if args.use_mkl: + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.threads) + # Use ZeroCopy. + config.switch_use_feed_fetch_ops(False) + + if profile: + if args.mod is recorder: + autolog = args.mod.Recorder(config, args.infer_batch_size, args.model_name) + else: + pid = os.getpid() + autolog = args.mod.AutoLogger( + model_name=args.model_name, + model_precision="fp32", + batch_size=args.infer_batch_size, + save_path=args.save_log_path, + inference_config=config, + data_shape="dynamic", + pids=pid, + process_name=None, + gpu_ids=0 if args.device == "gpu" else None, + time_keys=["preprocess_time", "inference_time", "postprocess_time"], + warmup=0, + logger=logger, + ) + else: + autolog = None + + predictor = inference.create_predictor(config) + input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()] + output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()] + return cls(predictor, input_handles, output_handles, autolog) + + def predict_batch(self, data): + for input_field, input_handle in zip(data, self.input_handles): + input_handle.copy_from_cpu(input_field.numpy() if isinstance(input_field, paddle.Tensor) else input_field) + self.predictor.run() + output = [output_handle.copy_to_cpu() for output_handle in self.output_handles] + return output + + def predict(self, test_loader, to_tokens, n_best, bos_idx, eos_idx): + outputs = [] + samples = 0 + if self.autolog is not None: + if self.use_auto_log: + self.autolog.times.start() + else: + cpu_rss_mb, gpu_rss_mb = 0, 0 + gpu_id = 0 if self.autolog.use_gpu else None + gpu_util = 0 + + for data in test_loader: + samples = len(data[0]) + + if self.autolog is not None: + if self.use_auto_log: + self.autolog.times.stamp() + else: + self.autolog.tic() + + output = self.predict_batch(data) + + if self.autolog is not None: + if self.use_auto_log: + self.autolog.times.stamp() + else: + self.autolog.toc(samples) + gpu_util += recorder.Recorder.get_current_gputil(gpu_id) + cm, gm = recorder.Recorder.get_current_memory_mb(gpu_id) + cpu_rss_mb += cm + gpu_rss_mb += gm + + finished_sequence = output[0].transpose([0, 2, 1]) + for ins in finished_sequence: + n_best_seq = [] + for beam_idx, beam in enumerate(ins): + if beam_idx >= n_best: + break + id_list = post_process_seq(beam, bos_idx, eos_idx) + word_list = to_tokens(id_list) + sequence = " ".join(word_list) + n_best_seq.append(sequence) + outputs.append(n_best_seq) + + if self.autolog is not None: + if self.use_auto_log: + self.autolog.times.end(stamp=True) + else: + self.autolog.get_device_info( + cpu_rss_mb=cpu_rss_mb / len(test_loader), + gpu_rss_mb=gpu_rss_mb / len(test_loader) if self.autolog.use_gpu else 0, + gpu_util=gpu_util / len(test_loader) if self.autolog.use_gpu else 0, + ) + + return outputs + + +def do_inference(args): + # Define data loader + test_loader, to_tokens = reader.create_infer_loader(args) + + predictor = Predictor.create_predictor(args=args, profile=args.profile, model_name=args.model_name) + sequence_outputs = predictor.predict(test_loader, to_tokens, args.n_best, args.bos_idx, args.eos_idx) + + f = open(args.output_file, "w", encoding="utf-8") + for target in sequence_outputs: + for sequence in target: + f.write(sequence + "\n") + f.close() + + if args.profile: + predictor.autolog.report() + + +if __name__ == "__main__": + ARGS = parse_args() + yaml_file = ARGS.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + args.benchmark = ARGS.benchmark + args.device = ARGS.device + args.use_mkl = ARGS.use_mkl + args.threads = ARGS.threads + if ARGS.batch_size is not None: + args.infer_batch_size = ARGS.batch_size + args.profile = ARGS.profile + args.model_name = "transformer_base" if "base" in ARGS.config else "transformer_big" + if ARGS.model_dir != "": + args.inference_model_dir = ARGS.model_dir + args.save_log_path = ARGS.save_log_path + args.data_dir = ARGS.data_dir + args.test_file = ARGS.test_file + + if ARGS.vocab_file is not None: + args.src_vocab = ARGS.vocab_file + args.trg_vocab = ARGS.vocab_file + args.joined_dictionary = True + elif ARGS.src_vocab is not None and ARGS.trg_vocab is None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab + args.joined_dictionary = True + elif ARGS.src_vocab is None and ARGS.trg_vocab is not None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab + args.joined_dictionary = True + else: + args.src_vocab = ARGS.src_vocab + args.trg_vocab = ARGS.trg_vocab + args.joined_dictionary = not ( + args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab + ) + if args.weight_sharing != args.joined_dictionary: + if args.weight_sharing: + raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ") + else: + raise ValueError( + "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. " + ) + + if ARGS.src_lang is not None: + args.src_lang = ARGS.src_lang + if ARGS.trg_lang is not None: + args.trg_lang = ARGS.trg_lang + + args.unk_token = ARGS.unk_token + args.bos_token = ARGS.bos_token + args.eos_token = ARGS.eos_token + args.pad_token = ARGS.pad_token + pprint(args) + + if args.profile: + import importlib + + import tls.recorder as recorder + + try: + mod = importlib.import_module("auto_log") + except ImportError: + mod = importlib.import_module("tls.recorder") + args.mod = mod + + do_inference(args) diff --git a/examples/machine_translation/transformer/deploy/python/tls/benchmark_utils.py b/examples/machine_translation/transformer/deploy/python/tls/benchmark_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..c85f36913b59ed4d972fe8730afcf671d30d75c5 --- /dev/null +++ b/examples/machine_translation/transformer/deploy/python/tls/benchmark_utils.py @@ -0,0 +1,247 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import os +from pathlib import Path + +import paddle +import paddle.inference as paddle_infer + +CUR_DIR = os.path.dirname(os.path.abspath(__file__)) +LOG_PATH_ROOT = f"{CUR_DIR}/../../output" + + +class PaddleInferBenchmark(object): + def __init__( + self, + config, + model_info: dict = {}, + data_info: dict = {}, + perf_info: dict = {}, + resource_info: dict = {}, + **kwargs + ): + """ + Construct PaddleInferBenchmark Class to format logs. + args: + config(paddle.inference.Config): paddle inference config + model_info(dict): basic model info + {'model_name': 'resnet50' + 'precision': 'fp32'} + data_info(dict): input data info + {'batch_size': 1 + 'shape': '3,224,224' + 'data_num': 1000} + perf_info(dict): performance result + {'inference_time_s': 2.0} + resource_info(dict): + cpu and gpu resources + {'cpu_rss': 100 + 'gpu_rss': 100 + 'gpu_util': 60} + """ + # PaddleInferBenchmark Log Version + self.log_version = "1.0.3" + + # Paddle Version + self.paddle_version = paddle.__version__ + self.paddle_commit = paddle.__git_commit__ + paddle_infer_info = paddle_infer.get_version() + self.paddle_branch = paddle_infer_info.strip().split(": ")[-1] + + # model info + self.model_info = model_info + + # data info + self.data_info = data_info + + # perf info + self.perf_info = perf_info + + try: + # required value + self.model_name = model_info["model_name"] + self.precision = model_info["precision"] + + self.batch_size = data_info["batch_size"] + self.shape = data_info["shape"] + self.data_num = data_info["data_num"] + + self.inference_time_s = round(perf_info["inference_time_s"], 4) + except: + self.print_help() + raise ValueError("Set argument wrong, please check input argument and its type") + + self.inference_time_s_90 = perf_info.get("inference_time_s_90", "") + self.inference_time_s_99 = perf_info.get("inference_time_s_99", "") + self.succ_rate = perf_info.get("succ_rate", "") + self.qps = perf_info.get("qps", "") + + # conf info + self.config_status = self.parse_config(config) + + # mem info + if isinstance(resource_info, dict): + self.cpu_rss_mb = int(resource_info.get("cpu_rss_mb", 0)) + self.cpu_vms_mb = int(resource_info.get("cpu_vms_mb", 0)) + self.cpu_shared_mb = int(resource_info.get("cpu_shared_mb", 0)) + self.cpu_dirty_mb = int(resource_info.get("cpu_dirty_mb", 0)) + self.cpu_util = round(resource_info.get("cpu_util", 0), 2) + + self.gpu_rss_mb = int(resource_info.get("gpu_rss_mb", 0)) + self.gpu_util = round(resource_info.get("gpu_util", 0), 2) + else: + self.cpu_rss_mb = 0 + self.cpu_vms_mb = 0 + self.cpu_shared_mb = 0 + self.cpu_dirty_mb = 0 + self.cpu_util = 0 + + self.gpu_rss_mb = 0 + self.gpu_util = 0 + + # init benchmark logger + self.benchmark_logger() + + def benchmark_logger(self): + """ + benchmark logger + """ + # remove other logging handler + for handler in logging.root.handlers[:]: + logging.root.removeHandler(handler) + + # Init logger + FORMAT = "%(asctime)s - %(name)s - %(levelname)s - %(message)s" + log_output = f"{LOG_PATH_ROOT}/{self.model_name}.log" + Path(f"{LOG_PATH_ROOT}").mkdir(parents=True, exist_ok=True) + logging.basicConfig( + level=logging.INFO, + format=FORMAT, + handlers=[ + logging.FileHandler(filename=log_output, mode="w"), + logging.StreamHandler(), + ], + ) + self.logger = logging.getLogger(__name__) + self.logger.info(f"Paddle Inference benchmark log will be saved to {log_output}") + + def parse_config(self, config) -> dict: + """ + parse paddle predictor config + args: + config(paddle.inference.Config): paddle inference config + return: + config_status(dict): dict style config info + """ + if isinstance(config, paddle_infer.Config): + config_status = {} + config_status["runtime_device"] = "cpu" + if config.use_gpu(): + config_status["runtime_device"] = "gpu" + if config.use_xpu(): + config_status["runtime_device"] = "xpu" + config_status["ir_optim"] = config.ir_optim() + config_status["enable_tensorrt"] = config.tensorrt_engine_enabled() + config_status["precision"] = self.precision + config_status["enable_mkldnn"] = config.mkldnn_enabled() + config_status["cpu_math_library_num_threads"] = config.cpu_math_library_num_threads() + elif isinstance(config, dict): + config_status["runtime_device"] = config.get("runtime_device", "") + config_status["ir_optim"] = config.get("ir_optim", "") + config_status["enable_tensorrt"] = config.get("enable_tensorrt", "") + config_status["precision"] = config.get("precision", "") + config_status["enable_mkldnn"] = config.get("enable_mkldnn", "") + config_status["cpu_math_library_num_threads"] = config.get("cpu_math_library_num_threads", "") + else: + self.print_help() + raise ValueError("Set argument config wrong, please check input argument and its type") + return config_status + + def report(self, identifier=None): + """ + print log report + args: + identifier(string): identify log + """ + if identifier: + identifier = f"[{identifier}]" + else: + identifier = "" + + self.logger.info("\n") + self.logger.info("---------------------- Paddle info ----------------------") + self.logger.info(f"{identifier} paddle_version: {self.paddle_version}") + self.logger.info(f"{identifier} paddle_commit: {self.paddle_commit}") + self.logger.info(f"{identifier} paddle_branch: {self.paddle_branch}") + self.logger.info(f"{identifier} log_api_version: {self.log_version}") + self.logger.info("----------------------- Conf info -----------------------") + self.logger.info(f"{identifier} runtime_device: {self.config_status['runtime_device']}") + self.logger.info(f"{identifier} ir_optim: {self.config_status['ir_optim']}") + self.logger.info(f"{identifier} enable_memory_optim: {True}") + self.logger.info(f"{identifier} enable_tensorrt: {self.config_status['enable_tensorrt']}") + self.logger.info(f"{identifier} enable_mkldnn: {self.config_status['enable_mkldnn']}") + self.logger.info( + f"{identifier} cpu_math_library_num_threads: {self.config_status['cpu_math_library_num_threads']}" + ) + self.logger.info("----------------------- Model info ----------------------") + self.logger.info(f"{identifier} model_name: {self.model_name}") + self.logger.info(f"{identifier} precision: {self.precision}") + self.logger.info("----------------------- Data info -----------------------") + self.logger.info(f"{identifier} batch_size: {self.batch_size}") + self.logger.info(f"{identifier} input_shape: {self.shape}") + self.logger.info(f"{identifier} data_num: {self.data_num}") + self.logger.info("----------------------- Perf info -----------------------") + self.logger.info( + f"{identifier} cpu_rss(MB): {self.cpu_rss_mb}, cpu_vms: {self.cpu_vms_mb}, cpu_shared_mb: {self.cpu_shared_mb}, cpu_dirty_mb: {self.cpu_dirty_mb}, cpu_util: {self.cpu_util}%" + ) + self.logger.info(f"{identifier} gpu_rss(MB): {self.gpu_rss_mb}, gpu_util: {self.gpu_util}%") + self.logger.info(f"{identifier} inference_time(ms): {round(self.inference_time_s*1000, 1)}") + if self.inference_time_s_90: + self.logger.info( + f"{identifier} 90%_cost: {self.inference_time_s_90}, 99%_cost: {self.inference_time_s_99}, succ_rate: {self.succ_rate}" + ) + if self.qps: + self.logger.info(f"{identifier} QPS: {self.qps}") + + def print_help(self): + """ + print function help + """ + print( + """Usage: + ==== Print inference benchmark logs. ==== + config = paddle.inference.Config() + model_info = {'model_name': 'resnet50' + 'precision': 'fp32'} + data_info = {'batch_size': 1 + 'shape': '3,224,224' + 'data_num': 1000} + perf_info = {'inference_time_s': 2.0} + resource_info = {'cpu_rss_mb': 100 + 'gpu_rss_mb': 100 + 'gpu_util': 60} + log = PaddleInferBenchmark(config, model_info, data_info, perf_info, resource_info) + log('Test') + """ + ) + + def __call__(self, identifier=None): + """ + __call__ + args: + identifier(string): identify log + """ + self.report(identifier) diff --git a/examples/machine_translation/transformer/deploy/python/tls/recorder.py b/examples/machine_translation/transformer/deploy/python/tls/recorder.py new file mode 100644 index 0000000000000000000000000000000000000000..d33e8fac1a3eeac3ef2ecb60ea282a163e135d92 --- /dev/null +++ b/examples/machine_translation/transformer/deploy/python/tls/recorder.py @@ -0,0 +1,96 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time + +import paddle +import psutil +from tls.benchmark_utils import PaddleInferBenchmark + +if paddle.is_compiled_with_cuda(): + import GPUtil + import pynvml + + +class Recorder(object): + def __init__(self, config, batch_size, model_name, mem_info=None): + self.model_name = model_name + self.config = config + self.precision = "fp32" + self.batch_size = batch_size + + self.use_gpu = False + self.use_xpu = False + self.use_cpu = False + + if config.use_gpu(): + self.place = "gpu" + self.use_gpu = True + elif config.use_xpu(): + self.place = "xpu" + self.use_xpu = True + else: + self.place = "cpu" + self.use_cpu = True + + self.infer_time_s = 0 + self.samples = 0 + + self.start = 0 + + self.device_info = {"cpu_rss_mb": None, "gpu_rss_mb": None, "gpu_util": None} + if mem_info is not None: + self.mem_info = mem_info + + def tic(self): + self.start = time.time() + + def toc(self, samples=1): + self.infer_time_s += time.time() - self.start + self.samples += samples + + def get_device_info(self, cpu_rss_mb=None, gpu_rss_mb=None, gpu_util=None): + self.device_info["cpu_rss_mb"] = cpu_rss_mb + self.device_info["gpu_rss_mb"] = gpu_rss_mb + self.device_info["gpu_util"] = gpu_util + + def report(self): + model_info = {"model_name": self.model_name, "precision": self.precision} + data_info = {"batch_size": self.batch_size, "shape": "dynamic_shape", "data_num": self.samples} + perf_info = {"inference_time_s": self.infer_time_s} + log = PaddleInferBenchmark(self.config, model_info, data_info, perf_info, self.device_info) + log("Test") + + @staticmethod + def get_current_memory_mb(gpu_id=None): + pid = os.getpid() + p = psutil.Process(pid) + info = p.memory_full_info() + cpu_rss_mb = info.uss / 1024.0 / 1024.0 + gpu_rss_mb = 0 + if gpu_id is not None: + pynvml.nvmlInit() + handle = pynvml.nvmlDeviceGetHandleByIndex(0) + meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle) + gpu_rss_mb = meminfo.used / 1024.0 / 1024.0 + return cpu_rss_mb, gpu_rss_mb + + @staticmethod + def get_current_gputil(gpu_id=None): + gpu_load = 0 + if gpu_id is not None: + GPUs = GPUtil.getGPUs() + gpu_load = GPUs[gpu_id].load + return gpu_load diff --git a/examples/machine_translation/transformer/deploy/serving/README.md b/examples/machine_translation/transformer/deploy/serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..2fc68c6794375e30941546a8574b541553817f39 --- /dev/null +++ b/examples/machine_translation/transformer/deploy/serving/README.md @@ -0,0 +1,104 @@ +# 使用 Paddle Serving 推理 + +## Paddle Serving 的使用 + +Paddle Serving 的安装可以参考[Paddle Serving 安装文档](https://github.com/PaddlePaddle/Serving#installation)。需要在服务端和客户端安装相关的依赖。 + +Serving 的执行包含两部分,其一是服务端的执行,其二是客户端的执行。接下来我们会一一说明如何使用 Paddle Serving 完成对 Transformer 的推理。 + +## 模型推理 + +通过前文介绍,我们可以获取导出后的预测模型。模型导出后,`infer_model/` 下的目录结构如下: + +``` text +. +└── infer_model/ + ├── transformer.pdiparams + ├── transformer.pdiparams.info + └── transformer.pdmodel +``` + +可以将存有导出后模型的目录拷贝到当前路径下: + +``` sh +cp -rf ../../infer_model/ ./ +``` + +### 导出 Serving 模型和配置 + +使用导出的 Paddle Inference 的模型,我们需要再做一次转换,将上面保存在 `infer_model/` 下面的模型重新转换成 Paddle Serving 使用的模型。具体操作方式如下: + +``` sh +python export_serving_model.py --model_dir ./infer_model/ +``` + +执行结束之后,会在 shell 上打印出 Transformer 模型输入、输出的变量的名称: + +``` sh +model feed_names : dict_keys(['src_word']) # 模型输入的变量的名称 +model fetch_names : dict_keys(['save_infer_model/scale_0.tmp_1']) # 模型输出的变量的名称 +``` + +导出后,可以在当前路径下得到两个新的目录 `transformer_client/` 和 `transformer_server/`。 + +``` text +. +├── transformer_client/ + ├── serving_client_conf.prototxt + └── serving_client_conf.stream.prototxt +└── transformer_server/ + ├── __model__ + ├── __params__ + ├── serving_server_conf.prototxt + └── serving_server_conf.stream.prototxt +``` + +脚本成功执行并打印出预期内的输入、输出的变量的名称即完成整个过程。 + +### 启动服务端 + +Transformer 的服务端使用的是 Paddle Serving 的 `WebService` 相关接口。执行的命令如下: + +``` sh +export CUDA_VISIBLE_DEVICES=0 +python transformer_web_server.py --config ../../configs/transformer.base.yaml --device gpu --model_dir ./transformer_server +``` + +各个参数的解释如下: +* `--config`: yaml 配置文件,和训练时使用的相同,不过因为模型导出时已经固定了模型结构,因此,模型超参相关配置将不会再起作用,仅有 `reader` 相关配置,比如词表以及 `inference_model_dir` 等仍会有效。 +* `--device`: 使用的设备,可以是 gpu 或是 cpu。 +* `--model_dir`: 导出的 Paddle Serving 可用的模型路径,与配置文件中的 `inference_model_dir` 对应。在这里,特指的 `transformer_server/` 的路径。 + +### 启动客户端完成推理 + +在英德翻译的例子里面,在客户端这侧,我们只需要传给服务端需要翻译的句子即可。这里的句子是经过了 tokenize 以及 bpe 切词的序列用空格连接而成的句子。 + +执行的方式如下: + +``` sh +python transformer_web_client.py --config ../../configs/transformer.base.yaml --batch_size 8 +``` + +各个参数的解释如下: +* `--config`: yaml 配置文件,和训练时使用的相同,不过因为模型导出时已经固定了模型结构,因此,模型超参相关配置将不会再起作用,仅有 `reader` 相关配置,比如使用的测试集以及 `infer_batch_size` 等仍会有效。 +* `--batch_size`: 与配置文件中 `infer_batch_size` 意义相同,是指的使用 Paddle Serving 的时候一个 batch 的句子数目。 + +执行完客户端的脚本,将会在本地生成一个 `predict.txt` 的文件,存有推理的结果。 + +## 模型评估 + +推理结果中每行输出是对应行输入的得分最高的翻译,对于使用 BPE 的数据,预测出的翻译结果也将是 BPE 表示的数据,要还原成原始的数据(这里指 tokenize 后的数据)才能进行正确的评估。评估过程具体如下(BLEU 是翻译任务常用的自动评估方法指标): + +``` sh +# 还原 predict.txt 中的预测结果为 tokenize 后的数据 +sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt +# 若无 BLEU 评估工具,需先进行下载 +git clone https://github.com/moses-smt/mosesdecoder.git +# 以英德翻译 newstest2014 测试数据为例 +perl mosesdecoder/scripts/generic/multi-bleu.perl ~/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data/newstest2014.tok.de < predict.tok.txt +``` + +执行上述操作之后,可以看到类似如下的结果,此处结果是 big model 在 newstest2014 上的 BLEU 结果: +``` +BLEU = 27.48, 58.6/33.2/21.1/13.9 (BP=1.000, ratio=1.012, hyp_len=65312, ref_len=64506) +``` diff --git a/examples/machine_translation/transformer/deploy/serving/benchmark.py b/examples/machine_translation/transformer/deploy/serving/benchmark.py new file mode 100644 index 0000000000000000000000000000000000000000..973f5d2b3ecf9cb442525cb67d94acb1ee4d56b5 --- /dev/null +++ b/examples/machine_translation/transformer/deploy/serving/benchmark.py @@ -0,0 +1,36 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys + +import yaml + + +def parse_benchmark(filein, fileout): + with open(filein, "r") as fin: + res = yaml.load(fin) + del_list = [] + for key in res["DAG"].keys(): + if "call" in key: + del_list.append(key) + for key in del_list: + del res["DAG"][key] + with open(fileout, "w") as fout: + yaml.dump(res, fout, default_flow_style=False) + + +if __name__ == "__main__": + filein = sys.argv[1] + fileout = sys.argv[2] + parse_benchmark(filein, fileout) diff --git a/examples/machine_translation/transformer/deploy/serving/benchmark_serving.sh b/examples/machine_translation/transformer/deploy/serving/benchmark_serving.sh new file mode 100644 index 0000000000000000000000000000000000000000..7f57ef5a0873f9bfa7902a04d914bb0fa12bce3c --- /dev/null +++ b/examples/machine_translation/transformer/deploy/serving/benchmark_serving.sh @@ -0,0 +1,24 @@ +modelname="transformer" +export FLAGS_profile_pipeline=1 +# HTTP +ps -ef | grep web_service | awk '{print $2}' | xargs kill -9 +sleep 3 +rm -rf profile_log_$modelname +for thread_num in "1" "8" "16"; do + for batch_size in "1" "2" "4"; do + python transformer_web_server.py --config ../../configs/transformer.base.yaml --device gpu --model_dir ./transformer_server --profile & + sleep 3 + echo "----Transformer thread num: ${thread_num} batch size: ${batch_size} mode:http ----" >> profile_log_$modelname + nvidia-smi --id=2 --query-compute-apps=used_memory --format=csv -lms 100 > gpu_use.log 2>&1 & + nvidia-smi --id=2 --query-gpu=utilization.gpu --format=csv -lms 100 > gpu_utilization.log 2>&1 & + echo "import psutil\ncpu_utilization=psutil.cpu_percent(1,False)\nprint('CPU_UTILIZATION:', cpu_utilization)\n" > cpu_utilization.py + python transformer_web_client.py --config ../../configs/transformer.base.yaml --batch_size ${batch_size} --threads ${thread_num} --profile + python cpu_utilization.py >> profile_log_$modelname + ps -ef | grep web_server | awk '{print $2}' | xargs kill -9 + python benchmark.py benchmark.log benchmark.tmp + mv benchmark.tmp benchmark.log + awk 'BEGIN {max = 0} {if(NR>1){if ($modelname > max) max=$modelname}} END {print "MAX_GPU_MEMORY:", max}' gpu_use.log >> profile_log_$modelname + awk 'BEGIN {max = 0} {if(NR>1){if ($modelname > max) max=$modelname}} END {print "GPU_UTILIZATION:", max}' gpu_utilization.log >> profile_log_$modelname + cat benchmark.log >> profile_log_$modelname + done +done diff --git a/examples/machine_translation/transformer/deploy/serving/export_serving_model.py b/examples/machine_translation/transformer/deploy/serving/export_serving_model.py new file mode 100644 index 0000000000000000000000000000000000000000..97e0a526dc805f09eb0006f5fbfd015e11e28aac --- /dev/null +++ b/examples/machine_translation/transformer/deploy/serving/export_serving_model.py @@ -0,0 +1,28 @@ +import argparse +import paddle +import paddle_serving_client.io as serving_io + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--model_dir", type=str, required=True, help="input inference model dir") + return parser.parse_args() + + +def do_export(model_dir): + feed_names, fetch_names = serving_io.inference_model_to_serving( + dirname=model_dir, + serving_server="transformer_server", + serving_client="transformer_client", + model_filename="transformer.pdmodel", + params_filename="transformer.pdiparams", + ) + + print("model feed_names : %s" % feed_names) + print("model fetch_names : %s" % fetch_names) + + +if __name__ == "__main__": + paddle.enable_static() + args = parse_args() + do_export(args.model_dir) diff --git a/examples/machine_translation/transformer/deploy/serving/transformer_reader.py b/examples/machine_translation/transformer/deploy/serving/transformer_reader.py new file mode 100644 index 0000000000000000000000000000000000000000..2b295e7d6c37d174ced3c78ab9808a5868b1db39 --- /dev/null +++ b/examples/machine_translation/transformer/deploy/serving/transformer_reader.py @@ -0,0 +1,52 @@ +import numpy as np + +from paddlenlp.datasets import load_dataset +from paddlenlp.data import Pad, Vocab + + +class TransformerReader(object): + def __init__(self, args={}): + super(TransformerReader, self).__init__() + + dataset = load_dataset("wmt14ende", splits=("test")) + if not args.benchmark: + self.vocab = Vocab.load_vocabulary(**dataset.vocab_info["bpe"]) + else: + self.vocab = Vocab.load_vocabulary(**dataset.vocab_info["benchmark"]) + self.src_vocab = self.trg_vocab = self.vocab + + def convert_samples(samples): + source = [] + for sample in samples: + src = sample.split() + source.append(self.src_vocab.to_indices(src)) + + return source + + self.tokenize = convert_samples + self.to_tokens = self.trg_vocab.to_tokens + self.feed_keys = ["src_word"] + self.bos_idx = args.bos_idx + self.eos_idx = args.eos_idx + self.pad_idx = args.bos_idx + self.pad_seq = args.pad_seq + self.word_pad = Pad(self.pad_idx) + + def set_feed_keys(self, keys): + self.feed_keys = keys + + def get_feed_keys(self): + return self.feed_keys + + def prepare_infer_input(self, insts): + """ + Put all padded data needed by beam search decoder into a list. + """ + insts = self.tokenize(insts) + + src_max_len = (max([len(inst) for inst in insts]) + self.pad_seq) // self.pad_seq * self.pad_seq + src_word = self.word_pad( + [inst + [self.eos_idx] + [self.pad_idx] * (src_max_len - 1 - len(inst)) for inst in insts] + ) + + return np.asarray(src_word) diff --git a/examples/machine_translation/transformer/deploy/serving/transformer_web_client.py b/examples/machine_translation/transformer/deploy/serving/transformer_web_client.py new file mode 100644 index 0000000000000000000000000000000000000000..0407a1f8eb08b7fd17cd642ffaefafd815977699 --- /dev/null +++ b/examples/machine_translation/transformer/deploy/serving/transformer_web_client.py @@ -0,0 +1,98 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +from pprint import pprint + +import requests +import yaml +from easydict import EasyDict as AttrDict +from paddle_serving_client.utils import MultiThreadRunner + +from paddlenlp.datasets import load_dataset + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--config", default="../configs/transformer.big.yaml", type=str, help="Path of the config file. " + ) + parser.add_argument("--batch_size", type=int, help="Batch size. ") + parser.add_argument("--threads", default=1, type=int, help="Number of threads. ") + parser.add_argument("--profile", action="store_true", help="Whether to profile. ") + args = parser.parse_args() + return args + + +def do_client(idx, args): + dataset = load_dataset("wmt14ende", splits=("test")) + + headers = {"Content-type": "application/json"} + url = "http://127.0.0.1:9292/transformer/prediction" + + batch = [] + sample = 0 + f = open(args.output_file, "w") + if args.profile: + recorder = Recorder(args.infer_batch_size, args.model_name) + recorder.tic() + + for sequence in dataset: + sample += 1 + batch.append(sequence[args.src_lang]) + if len(batch) < args.infer_batch_size and sample != len(dataset): + continue + data = {"feed": [{"src_word": batch}], "fetch": ["finished_sequence"]} + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + if r is not None: + print("Status: ", r) + + if args.profile: + recorder.toc(samples=len(batch)) + else: + for seq in r.json()["result"]["finished_sequence"]: + f.write(seq[0] + "\n") + batch = [] + if args.profile: + recorder.tic() + f.close() + if args.profile: + recorder.report() + return [[recorder.infer_time]] + + +def multithread_http(args): + multi_thread_runner = MultiThreadRunner() + multi_thread_runner.run(do_client, args.threads, args) + + +if __name__ == "__main__": + ARGS = parse_args() + yaml_file = ARGS.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + pprint(args) + if ARGS.batch_size is not None: + args.infer_batch_size = ARGS.batch_size + args.profile = ARGS.profile + args.threads = ARGS.threads + args.model_name = "transformer_base" if "base" in ARGS.config else "transformer_big" + + if args.profile: + from utils.recorder import Recorder + + multithread_http(args) + else: + do_client(0, args) diff --git a/examples/machine_translation/transformer/deploy/serving/transformer_web_server.py b/examples/machine_translation/transformer/deploy/serving/transformer_web_server.py new file mode 100644 index 0000000000000000000000000000000000000000..c13a78fc1f2dc2775d72fb1f3aa54a2a9af5ef17 --- /dev/null +++ b/examples/machine_translation/transformer/deploy/serving/transformer_web_server.py @@ -0,0 +1,130 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +from pprint import pprint + +import numpy as np +import yaml +from easydict import EasyDict as AttrDict + +try: + from paddle_serving_server_gpu.web_service import WebService +except: + from paddle_serving_server.web_service import WebService + +from transformer_reader import TransformerReader + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--config", default="../configs/transformer.big.yaml", type=str, help="Path of the config file. " + ) + parser.add_argument( + "--device", default="gpu", type=str, choices=["gpu", "cpu"], help="Device to use during inference. " + ) + parser.add_argument("--model_dir", default="", type=str, help="Path of the model. ") + parser.add_argument( + "--benchmark", + action="store_true", + help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ", + ) + parser.add_argument("--profile", action="store_true", help="Whether to profile. ") + args = parser.parse_args() + return args + + +def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): + """ + Post-process the decoded sequence. + """ + eos_pos = len(seq) - 1 + for i, idx in enumerate(seq): + if idx == eos_idx: + eos_pos = i + break + seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] + return seq + + +class TransformerService(WebService): + def init_client(self, args): + self.args = args + self.transformer_reader = TransformerReader(args=args) + + def preprocess(self, feed=[], fetch=[]): + src_sequence = feed[0]["src_word"] + if isinstance(src_sequence, str): + src_sequence = [src_sequence] + src_word = self.transformer_reader.prepare_infer_input(src_sequence) + feed_batch = {"src_word": src_word} + fetch = ["save_infer_model/scale_0.tmp_1"] + + return feed_batch, fetch, True + + def postprocess(self, feed={}, fetch=[], fetch_map=None): + if fetch_map is not None: + finished_sequence = np.array(fetch_map["save_infer_model/scale_0.tmp_1"]).transpose([0, 2, 1]) + outputs = [] + for ins in finished_sequence: + n_best_seq = [] + for beam_idx, beam in enumerate(ins): + if beam_idx >= self.args.n_best: + break + id_list = post_process_seq(beam, self.args.bos_idx, self.args.eos_idx) + word_list = self.transformer_reader.to_tokens(id_list) + sequence = " ".join(word_list) + n_best_seq.append(sequence) + outputs.append(n_best_seq) + res = {"finished_sequence": outputs} + return res + + +def do_server(args): + service = TransformerService(name="transformer") + if args.profile: + try: + service.setup_profile(30) + except: + pass + service.load_model_config(args.inference_model_dir) + if args.device == "gpu": + service.set_gpus("0") + service.prepare_server(workdir="workdir", port=9292, device="gpu", gpuid=0) + else: + service.prepare_server(workdir="workdir", port=9292, device="cpu") + + service.init_client(args=args) + + if args.profile: + service.run_debugger_service() + else: + service.run_rpc_service() + service.run_web_service() + + +if __name__ == "__main__": + ARGS = parse_args() + yaml_file = ARGS.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + pprint(args) + args.benchmark = ARGS.benchmark + args.profile = ARGS.profile + args.device = ARGS.device + if ARGS.model_dir != "": + args.inference_model_dir = ARGS.model_dir + + do_server(args) diff --git a/examples/machine_translation/transformer/deploy/serving/utils/recorder.py b/examples/machine_translation/transformer/deploy/serving/utils/recorder.py new file mode 100644 index 0000000000000000000000000000000000000000..70a156a5e4f86ce9eb72ad1643b04117b2ce46d4 --- /dev/null +++ b/examples/machine_translation/transformer/deploy/serving/utils/recorder.py @@ -0,0 +1,34 @@ +import time +import paddle + + +class Recorder(object): + def __init__(self, batch_size, model_name, mem_info=None): + self.model_name = model_name + self.precision = "fp32" + self.batch_size = batch_size + + self.infer_time = 0 + self.samples = 0 + + self.start = 0 + + def tic(self): + self.start = time.time() + + def toc(self, samples=1): + self.infer_time += (time.time() - self.start) * 1000 + self.samples += samples + + def report(self): + print("----------------------- Env info ------------------------") + print("paddle_version: {}".format(paddle.__version__)) + print("----------------------- Model info ----------------------") + print("model_name: {}".format(self.model_name)) + print("model_type: {}".format(self.precision)) + print("----------------------- Data info -----------------------") + print("batch_size: {}".format(self.batch_size)) + print("num_of_samples: {}".format(self.samples)) + print("----------------------- Perf info -----------------------") + print("average_latency(ms): {}".format(self.infer_time / (self.samples))) + print("QPS: {}".format((self.samples) / (self.infer_time / 1000.0))) diff --git a/examples/machine_translation/transformer/export_model.py b/examples/machine_translation/transformer/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..fceddc5bf628edaa624ba23557f86cb7ac16a275 --- /dev/null +++ b/examples/machine_translation/transformer/export_model.py @@ -0,0 +1,164 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from pprint import pprint + +import paddle +import reader +import yaml +from easydict import EasyDict as AttrDict + +from paddlenlp.transformers import InferTransformerModel, position_encoding_init +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--config", default="./configs/transformer.big.yaml", type=str, help="Path of the config file. " + ) + parser.add_argument( + "--benchmark", + action="store_true", + help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ", + ) + parser.add_argument( + "--vocab_file", + default=None, + type=str, + help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.", + ) + parser.add_argument( + "--src_vocab", + default=None, + type=str, + help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument( + "--trg_vocab", + default=None, + type=str, + help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument( + "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--pad_token", + default=None, + type=str, + help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ", + ) + args = parser.parse_args() + return args + + +def do_export(args): + # Adapt vocabulary size + reader.adapt_vocab_size(args) + # Define model + transformer = InferTransformerModel( + src_vocab_size=args.src_vocab_size, + trg_vocab_size=args.trg_vocab_size, + max_length=args.max_length + 1, + num_encoder_layers=args.n_layer, + num_decoder_layers=args.n_layer, + n_head=args.n_head, + d_model=args.d_model, + d_inner_hid=args.d_inner_hid, + dropout=args.dropout, + weight_sharing=args.weight_sharing, + bos_id=args.bos_idx, + eos_id=args.eos_idx, + pad_id=args.pad_idx, + beam_size=args.beam_size, + max_out_len=args.max_out_len, + beam_search_version=args.beam_search_version, + normalize_before=args.get("normalize_before", True), + rel_len=args.use_rel_len, + alpha=args.alpha, + ) + + # Load the trained model + assert args.init_from_params, "Please set init_from_params to load the infer model." + + model_dict = paddle.load(os.path.join(args.init_from_params, "transformer.pdparams")) + + # To avoid a longer length than training, reset the size of position + # encoding to max_length + model_dict["encoder.pos_encoder.weight"] = position_encoding_init(args.max_length + 1, args.d_model) + model_dict["decoder.pos_encoder.weight"] = position_encoding_init(args.max_length + 1, args.d_model) + transformer.load_dict(model_dict) + # Set evaluate mode + transformer.eval() + + # Convert dygraph model to static graph model + transformer = paddle.jit.to_static( + transformer, + input_spec=[ + # src_word + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + # trg_word + # paddle.static.InputSpec( + # shape=[None, None], dtype="int64") + ], + ) + + # Save converted static graph model + paddle.jit.save(transformer, os.path.join(args.inference_model_dir, "transformer")) + logger.info("Transformer has been saved to {}".format(args.inference_model_dir)) + + +if __name__ == "__main__": + ARGS = parse_args() + yaml_file = ARGS.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + args.benchmark = ARGS.benchmark + + if ARGS.vocab_file is not None: + args.src_vocab = ARGS.vocab_file + args.trg_vocab = ARGS.vocab_file + args.joined_dictionary = True + elif ARGS.src_vocab is not None and ARGS.trg_vocab is None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab + args.joined_dictionary = True + elif ARGS.src_vocab is None and ARGS.trg_vocab is not None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab + args.joined_dictionary = True + else: + args.src_vocab = ARGS.src_vocab + args.trg_vocab = ARGS.trg_vocab + args.joined_dictionary = not ( + args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab + ) + if args.weight_sharing != args.joined_dictionary: + if args.weight_sharing: + raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ") + else: + raise ValueError( + "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. " + ) + + args.bos_token = ARGS.bos_token + args.eos_token = ARGS.eos_token + args.pad_token = ARGS.pad_token + pprint(args) + + do_export(args) diff --git a/examples/machine_translation/transformer/fast_transformer/README.md b/examples/machine_translation/transformer/fast_transformer/README.md new file mode 100644 index 0000000000000000000000000000000000000000..cb760655e003bec63a55bdd177cbb0256b7f3fb3 --- /dev/null +++ b/examples/machine_translation/transformer/fast_transformer/README.md @@ -0,0 +1,220 @@ +# FastGeneration 预测 + +在这里我们集成了 NVIDIA [FasterTransformer](https://github.com/NVIDIA/FasterTransformer/tree/v3.1) 用于预测加速。同时集成了 FasterTransformer float32 以及 float16 预测,打造了 FastGeneration 的能力。以下是使用 FastGeneration 的说明。 + +## 使用环境说明 + +* 本项目依赖于 PaddlePaddle 2.1.0 及以上版本或适当的 develop 版本 +* CMake >= 3.10 +* CUDA 10.1 或 10.2(需要 PaddlePaddle 框架一致) +* gcc 版本需要与编译 PaddlePaddle 版本一致,比如使用 gcc8.2 +* 推荐使用 Python3 +* [FasterTransformer](https://github.com/NVIDIA/FasterTransformer/tree/v3.1#setup) 使用必要的环境 +* 环境依赖 + - attrdict + - pyyaml + ```shell + pip install attrdict pyyaml + ``` + +## 快速开始 + +我们实现了基于 FastGeneration 的自定义 op 的接入,用于加速当前机器翻译 example 在 GPU 上的预测性能。 + +## 使用 FastGeneration 完成预测 + +编写 python 脚本的时候,调用 [`FasterTransformer` API](https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.html#paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer) 即可实现 Transformer 模型的高性能预测。 + +举例如下: + +``` python +from paddlenlp.ops import FasterTransformer + +transformer = FasterTransformer( + src_vocab_size=args.src_vocab_size, + trg_vocab_size=args.trg_vocab_size, + max_length=args.max_length + 1, + num_encoder_layers=args.n_layer, + num_decoder_layers=args.n_layer, + n_head=args.n_head, + d_model=args.d_model, + d_inner_hid=args.d_inner_hid, + dropout=args.dropout, + weight_sharing=args.weight_sharing, + bos_id=args.bos_idx, + eos_id=args.eos_idx, + decoding_strategy=args.decoding_strategy, + beam_size=args.beam_size, + topk=args.topk, + topp=args.topp, + max_out_len=args.max_out_len, + use_fp16_decoding=args.use_fp16_decoding) +``` + +若当前环境下没有需要的自定义 op 的动态库,将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op 所需的动态库,可以参考 [文本生成高性能加速](../../../../paddlenlp/ops/README.md)。编译好后,使用 `FasterTransformer(decoding_lib="/path/to/lib", ...)` 可以完成导入。 + +更详细的例子可以参考 `encoder_decoding_predict.py`,我们提供了更详细用例。 + + +#### 数据准备 + +本示例可以使用 PaddleNLP 内置的处理好的 WMT14 EN-DE 翻译的数据进行训练、预测,也可以使用自定义数据集。数据准备部分可以参考前页文档 [使用自定义翻译数据集](../README.md)。 + +#### 模型推断 + +使用模型推断前提是需要指定一个合适的 checkpoint,需要在对应的 `../configs/transformer.base.yaml` 中修改对应的模型载入的路径参数 `init_from_params`。 + +我们提供一个已经训练好的动态图的 base model 的 checkpoint 以供使用,可以通过[transformer-base-wmt_ende_bpe](https://bj.bcebos.com/paddlenlp/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz)下载。 + +``` sh +wget https://bj.bcebos.com/paddlenlp/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz +tar -zxf transformer-base-wmt_ende_bpe.tar.gz +``` + +然后,需要修改对应的 `../configs/transformer.base.yaml` 配置文件中的 `init_from_params` 的值为 `./base_trained_models/step_final/`。 + +#### 使用动态图预测(使用 float32 decoding 预测) + +以英德翻译数据为例,模型训练完成后可以执行以下命令对指定文件中的文本进行翻译: + +``` sh +# setting visible devices for prediction +export CUDA_VISIBLE_DEVICES=0 +# 执行 decoding_gemm 目的是基于当前环境、配置,提前确定一个性能最佳的矩阵乘算法,不是必要的步骤 +cp -rf ../../../../paddlenlp/ops/build/third-party/build/fastertransformer/bin/decoding_gemm ./ +./decoding_gemm 8 4 8 64 38512 32 512 0 +python encoder_decoding_predict.py \ + --config ../configs/transformer.base.yaml \ + --decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so \ + --decoding_strategy beam_search \ + --beam_size 5 +``` + +其中: +* `--config`: 选项用于指明配置文件的位置 +* `--decoding_lib`: 选项用于指明编译好的 FastGeneration decoding lib 的位置 +* `--decoding_strategy`: 选项用于指定解码使用的策略,可以选择是 `beam_search`,`topk_sampling`,`topp_sampling`。 + * 当使用 `beam_search` 的时候,需要指定 `--beam_size` 的值 + * 当使用 `topk_sampling` 的时候,需要指定 `--topk` 的值 + * 当使用 `topp_sampling` 的时候,需要指定 `topp` 的值,并且需要保证 `--topk` 的值为 0 +* `--beam_size`: 解码策略是 `beam_search` 的时候,beam size 的大小,数据类型是 `int` +* `--diversity_rate`: 解码策略是 `beam_search` 的时候,设置 diversity rate 的大小,数据类型是 `float`。当设置的 `diversity_rate` 大于 0 的时候,FasterTransformer 仅支持 beam size 为 1,4,16,64 +* `--topk`: 解码策略是 `topk_sampling` 的时候,topk 计算的 k 值的大小,数据类型是 `int` +* `--topp`: 解码策略是 `topp_sampling` 的时候,p 的大小,数据类型是 `float` + +翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录,更多参数的使用可以在 `./sample/config/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项,程序将默认使用 base model 的配置。 + +#### 使用动态图预测(使用 float16 decoding 预测) + +float16 与 float32 预测的基本流程相同,不过在使用 float16 的 decoding 进行预测的时候,需要再加上 `--use_fp16_decoding` 选项,表示使用 fp16 进行预测。后按照与之前相同的方式执行即可。具体执行方式如下: + +``` sh +# setting visible devices for prediction +export CUDA_VISIBLE_DEVICES=0 +# 执行 decoding_gemm 目的是基于当前环境、配置,提前确定一个性能最佳的矩阵乘算法,不是必要的步骤 +cp -rf ../../../../paddlenlp/ops/build/third-party/build/fastertransformer/bin/decoding_gemm ./ +./decoding_gemm 8 4 8 64 38512 32 512 1 +python encoder_decoding_predict.py \ + --config ../configs/transformer.base.yaml \ + --decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so \ + --use_fp16_decoding \ + --decoding_strategy beam_search \ + --beam_size 5 +``` + +其中,`--config` 选项用于指明配置文件的位置,而 `--decoding_lib` 选项用于指明编译好的 FastGeneration decoding lib 的位置。 + +翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录,更多参数的使用可以在 `./sample/config/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项,程序将默认使用 base model 的配置。 + +需要注意的是,目前预测仅实现了单卡的预测,原因在于,翻译后面需要的模型评估依赖于预测结果写入文件顺序,多卡情况下,目前暂未支持将结果按照指定顺序写入文件。 + +#### 使用自定义数据集进行预测 + +如果需要使用准备好的自定义数据集进行高性能推理,同样可以通过在执行 `encoder_decoding_predict.py` 脚本时指明以下参数,从而引入自定义数据集。 + +* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名,会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`,`--dev_file` 和 `--test_file` 的优先级。 + * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。 + * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。 +* `--test_file`: 指明训练所需要的 `test` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为,传入源语言的文件。比如,`--test_file ${DATA_DEST_DIR}/test.de-en.de`。 +* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。 +* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同,若相同,则视为源语言和目标语言共用同一个词表。 +* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同,若相同,则视为源语言和目标语言共用同一个词表。 +* `--unk_token`: 若提供了自定义的词表,则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如,`--unk_token "<unk>"`。默认为 `<unk>`,与数据预处理脚本设定默认值相同。 +* `--bos_token`: 若提供了自定义的词表,则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如,`--bos_token "<s>"`。默认为 `<s>`,与数据预处理脚本设定默认值相同。 +* `--eos_token`: 若提供了自定义的词表,则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如,`--eos_token "</s>"`。默认为 `</s>`,与数据预处理脚本设定默认值相同。 +* `--pad_token`: 若提供了自定义的词表,原则上,需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如,`--pad_token "<pad>"`。默认为 None,若使用 None,则使用 `--bos_token` 作为 `pad_token` 使用。 + +比如: + +``` bash +# setting visible devices for prediction +export CUDA_VISIBLE_DEVICES=0 +DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/iwslt14.tokenized.de-en/ + +# 执行 decoding_gemm 目的是基于当前环境、配置,提前确定一个性能最佳的矩阵乘算法,不是必要的步骤 +cp -rf ../../../../paddlenlp/ops/build/third-party/build/fastertransformer/bin/decoding_gemm ./ +./decoding_gemm 8 4 8 64 38512 32 512 1 + +python encoder_decoding_predict.py \ + --config ../configs/transformer.base.yaml \ + --decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so \ + --use_fp16_decoding \ + --decoding_strategy beam_search \ + --beam_size 5 \ + --test_file ${DATA_DEST_DIR}/test.de-en.de \ + --src_vocab ${DATA_DEST_DIR}/dev.de-en.de \ + --trg_vocab ${DATA_DEST_DIR}/dev.de-en.en \ + --bos_token "<s>" \ + --eos_token "</s>" \ + --unk_token "<unk>" +``` + +#### 导出基于 FastGeneration 的预测库使用模型文件 + +我们提供一个已经基于动态图训练好的 base model 的 checkpoint 以供使用,当前 checkpoint 是基于 WMT 英德翻译的任务训练。可以通过[transformer-base-wmt_ende_bpe](https://bj.bcebos.com/paddlenlp/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz)下载。 + +使用 C++ 预测库,首先,我们需要做的是将动态图的 checkpoint 导出成预测库能使用的模型文件和参数文件。可以执行 `export_model.py` 实现这个过程。 + +``` sh +python export_model.py \ + --config ../configs/transformer.base.yaml \ + --decoding_strategy beam_search \ + --beam_size 5 +``` + +若当前环境下没有需要的自定义 op 的动态库,将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op 所需的动态库,可以参考 [文本生成高性能加速](../../../../paddlenlp/ops/README.md)。编译好后,可以在执行 `export_model.py` 时使用 `--decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so` 可以完成导入。 + +注意:如果是自行编译的话,这里的 `libdecoding_op.so` 的动态库是参照文档 [文本生成高性能加速](../../../../paddlenlp/ops/README.md) 中 **`Python 动态图使用自定义 op`** 编译出来的 lib,与相同文档中 **`C++ 预测库使用自定义 op`** 编译产出不同。因此,在使用预测库前,还需要额外导出模型: + * 一次用于获取 Python 动态图下的 lib,用到 Python 端进行模型导出。 + * 一次获取编译的基于预测库的可执行文件 + +执行 `export_model.py` 之后,可以在当前路径的 `infer_model/` 下面看到导出的模型文件: + ```text + └── infer_model/ + ├── transformer.pdiparams + └── transformer.pdmodel + ``` + +#### C++ 预测库使用高性能加速 + +C++ 预测库使用 FastGeneration 的高性能加速需要自行编译,可以参考 [文本生成高性能加速](../../../../paddlenlp/ops/README.md) 文档完成基于 C++ 预测库的编译,同时也可以参考相同文档执行对应的 C++ 预测库的 demo 完成预测。 + +具体的使用 demo 可以参考 [Transformer 预测库 C++ demo](../../../../paddlenlp/ops/fast_transformer/src/demo/transformer_e2e.cc)。 + +## 模型评估 + +评估方式与动态图评估方式相同,预测结果中每行输出是对应行输入的得分最高的翻译,对于使用 BPE 的数据,预测出的翻译结果也将是 BPE 表示的数据,要还原成原始的数据(这里指 tokenize 后的数据)才能进行正确的评估。评估过程具体如下(BLEU 是翻译任务常用的自动评估方法指标): + +``` sh +# 还原 predict.txt 中的预测结果为 tokenize 后的数据 +sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt +# 若无 BLEU 评估工具,需先进行下载 +git clone https://github.com/moses-smt/mosesdecoder.git +# 以英德翻译 newstest2014 测试数据为例 +perl mosesdecoder/scripts/generic/multi-bleu.perl ~/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data/newstest2014.tok.de < predict.tok.txt +``` + +执行上述操作之后,可以看到类似如下的结果,此处结果是 beam_size 为 5 时 base model 在 newstest2014 上的 BLEU 结果: +``` +BLEU = 26.89, 58.4/32.6/20.5/13.4 (BP=1.000, ratio=1.010, hyp_len=65166, ref_len=64506) +``` diff --git a/examples/machine_translation/transformer/fast_transformer/encoder_decoding_predict.py b/examples/machine_translation/transformer/fast_transformer/encoder_decoding_predict.py new file mode 100644 index 0000000000000000000000000000000000000000..f27ea58ac929071875a05ea436f14b4090f54d49 --- /dev/null +++ b/examples/machine_translation/transformer/fast_transformer/encoder_decoding_predict.py @@ -0,0 +1,292 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys +from pprint import pprint + +import numpy as np +import paddle +import yaml +from easydict import EasyDict as AttrDict + +from paddlenlp.ops import FasterTransformer +from paddlenlp.utils.log import logger + +sys.path.append("../") +import reader # noqa: E402 + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--config", default="../configs/transformer.base.yaml", type=str, help="Path of the config file. " + ) + parser.add_argument( + "--decoding_lib", + default="../../../../paddlenlp/ops/build/lib/libdecoding_op.so", + type=str, + help="Path of libdecoding_op.so. ", + ) + parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ") + parser.add_argument( + "--enable_fast_encoder", + action="store_true", + help="Whether to use fast version encoder to predict. This is experimental option for now. ", + ) + parser.add_argument("--use_fp16_encoder", action="store_true", help="Whether to use fp16 encoder to predict. ") + parser.add_argument( + "--decoding_strategy", + default="beam_search", + type=str, + choices=["beam_search", "beam_search_v2", "topk_sampling", "topp_sampling"], + help="Decoding strategy. Can be one of ['beam_search', 'topk_sampling', 'topp_sampling']. ", + ) + parser.add_argument("--beam_size", default=4, type=int, help="Beam size. ") + parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity rate for beam search. ") + parser.add_argument("--topk", default=4, type=int, help="The k value for topk_sampling. Default is 4. ") + parser.add_argument( + "--topp", + default=0.0, + type=float, + help="The probability threshold for topp_sampling. Default is 0.0 which means it won't go through topp_sampling. ", + ) + parser.add_argument("--batch_size", default=None, type=int, help="Batch size. ") + parser.add_argument( + "--profile", action="store_true", help="Whether to profile the performance using newstest2014 dataset. " + ) + parser.add_argument( + "--data_dir", + default=None, + type=str, + help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ", + ) + parser.add_argument( + "--test_file", + nargs="+", + default=None, + type=str, + help="The files for test. Can be set by using --test_file source_language_file. If it's None, the default WMT14 en-de dataset will be used. ", + ) + parser.add_argument( + "--benchmark", + action="store_true", + help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ", + ) + parser.add_argument( + "--vocab_file", + default=None, + type=str, + help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.", + ) + parser.add_argument( + "--src_vocab", + default=None, + type=str, + help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument( + "--trg_vocab", + default=None, + type=str, + help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ") + parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ") + parser.add_argument( + "--unk_token", + default=None, + type=str, + help="The unknown token. It should be provided when use custom vocab_file. ", + ) + parser.add_argument( + "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--pad_token", + default=None, + type=str, + help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ", + ) + args = parser.parse_args() + return args + + +def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): + """ + Post-process the decoded sequence. + """ + eos_pos = len(seq) - 1 + for i, idx in enumerate(seq): + if idx == eos_idx: + eos_pos = i + break + seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] + return seq + + +def do_predict(args): + place = "gpu" + place = paddle.set_device(place) + + # Define data loader + # NOTE: Data yielded by DataLoader may be on CUDAPinnedPlace, + # but custom op doesn't support CUDAPinnedPlace. Hence, + # disable using CUDAPinnedPlace in DataLoader. + paddle.io.reader.use_pinned_memory(False) + test_loader, to_tokens = reader.create_infer_loader(args) + + # Define model + transformer = FasterTransformer( + src_vocab_size=args.src_vocab_size, + trg_vocab_size=args.trg_vocab_size, + max_length=args.max_length + 1, + num_encoder_layers=args.n_layer, + num_decoder_layers=args.n_layer, + n_head=args.n_head, + d_model=args.d_model, + d_inner_hid=args.d_inner_hid, + dropout=args.dropout, + weight_sharing=args.weight_sharing, + bos_id=args.bos_idx, + eos_id=args.eos_idx, + pad_id=args.pad_idx, + decoding_strategy=args.decoding_strategy, + beam_size=args.beam_size, + max_out_len=args.max_out_len, + diversity_rate=args.diversity_rate, + decoding_lib=args.decoding_lib, + use_fp16_decoding=args.use_fp16_decoding, + enable_fast_encoder=args.enable_fast_encoder, + use_fp16_encoder=args.use_fp16_encoder, + ) + + # Set evaluate mode + transformer.eval() + + # Load checkpoint. + transformer.load(init_from_params=os.path.join(args.init_from_params, "transformer.pdparams")) + + # Providing model_dict still works. + # state_dict = paddle.load(os.path.join(args.init_from_params, + # "transformer.pdparams")) + # transformer.load(state_dict=state_dict) + + f = open(args.output_file, "w") + with paddle.no_grad(): + if args.profile: + import time + + start = time.time() + for (src_word,) in test_loader: + finished_seq = transformer(src_word=src_word) + if not args.profile: + if args.decoding_strategy == "beam_search" or args.decoding_strategy == "beam_search_v2": + finished_seq = finished_seq.numpy().transpose([1, 2, 0]) + elif args.decoding_strategy == "topk_sampling" or args.decoding_strategy == "topp_sampling": + finished_seq = np.expand_dims(finished_seq.numpy().transpose([1, 0]), axis=1) + for ins in finished_seq: + for beam_idx, beam in enumerate(ins): + if beam_idx >= args.n_best: + break + id_list = post_process_seq(beam, args.bos_idx, args.eos_idx) + word_list = to_tokens(id_list) + sequence = " ".join(word_list) + "\n" + f.write(sequence) + if args.profile: + if args.decoding_strategy == "beam_search" or args.decoding_strategy == "beam_search_v2": + logger.info( + "Setting info: batch size: {}, beam size: {}, use fp16: {}. ".format( + args.infer_batch_size, args.beam_size, args.use_fp16_decoding + ) + ) + elif args.decoding_strategy == "topk_sampling": + logger.info( + "Setting info: batch size: {}, topk: {}, use fp16: {}. ".format( + args.infer_batch_size, args.topk, args.use_fp16_decoding + ) + ) + elif args.decoding_strategy == "topp_sampling": + logger.info( + "Setting info: batch size: {}, topp: {}, use fp16: {}. ".format( + args.infer_batch_size, args.topp, args.use_fp16_decoding + ) + ) + paddle.device.cuda.synchronize(place) + logger.info( + "Average time latency is {} ms/batch. ".format((time.time() - start) / len(test_loader) * 1000) + ) + + +if __name__ == "__main__": + ARGS = parse_args() + yaml_file = ARGS.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + args.decoding_lib = ARGS.decoding_lib + args.use_fp16_decoding = ARGS.use_fp16_decoding + args.enable_fast_encoder = ARGS.enable_fast_encoder + args.use_fp16_encoder = ARGS.use_fp16_encoder + args.decoding_strategy = ARGS.decoding_strategy + args.beam_size = ARGS.beam_size + args.diversity_rate = ARGS.diversity_rate + args.topk = ARGS.topk + args.topp = ARGS.topp + args.profile = ARGS.profile + args.benchmark = ARGS.benchmark + if ARGS.batch_size: + args.infer_batch_size = ARGS.batch_size + args.data_dir = ARGS.data_dir + args.test_file = ARGS.test_file + + if ARGS.vocab_file is not None: + args.src_vocab = ARGS.vocab_file + args.trg_vocab = ARGS.vocab_file + args.joined_dictionary = True + elif ARGS.src_vocab is not None and ARGS.trg_vocab is None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab + args.joined_dictionary = True + elif ARGS.src_vocab is None and ARGS.trg_vocab is not None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab + args.joined_dictionary = True + else: + args.src_vocab = ARGS.src_vocab + args.trg_vocab = ARGS.trg_vocab + args.joined_dictionary = not ( + args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab + ) + if args.weight_sharing != args.joined_dictionary: + if args.weight_sharing: + raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ") + else: + raise ValueError( + "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. " + ) + + if ARGS.src_lang is not None: + args.src_lang = ARGS.src_lang + if ARGS.trg_lang is not None: + args.trg_lang = ARGS.trg_lang + + args.unk_token = ARGS.unk_token + args.bos_token = ARGS.bos_token + args.eos_token = ARGS.eos_token + args.pad_token = ARGS.pad_token + pprint(args) + + do_predict(args) diff --git a/examples/machine_translation/transformer/fast_transformer/export_model.py b/examples/machine_translation/transformer/fast_transformer/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..0f444985c36f32723d243ef9792dc026d21a256a --- /dev/null +++ b/examples/machine_translation/transformer/fast_transformer/export_model.py @@ -0,0 +1,203 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys +from pprint import pprint + +import paddle +import yaml +from easydict import EasyDict as AttrDict + +from paddlenlp.ops import FasterTransformer +from paddlenlp.utils.log import logger + +sys.path.append("../") +import reader # noqa: E402 + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--config", default="../configs/transformer.base.yaml", type=str, help="Path of the config file. " + ) + parser.add_argument( + "--decoding_lib", + default="../../../../paddlenlp/ops/build/lib/libdecoding_op.so", + type=str, + help="Path of libdecoding_op.so. ", + ) + parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ") + parser.add_argument( + "--enable_fast_encoder", + action="store_true", + help="Whether to use fast version encoder to predict. This is experimental option for now. ", + ) + parser.add_argument("--use_fp16_encoder", action="store_true", help="Whether to use fp16 encoder to predict. ") + parser.add_argument( + "--decoding_strategy", + default="beam_search", + type=str, + choices=["beam_search", "topk_sampling", "topp_sampling", "beam_search_v2"], + help="Decoding strategy. Can be one of ['beam_search', 'topk_sampling', 'topp_sampling', 'beam_search_v2']. ", + ) + parser.add_argument("--beam_size", default=4, type=int, help="Beam size. ") + parser.add_argument("--topk", default=4, type=int, help="The k value for topk_sampling. Default is 4. ") + parser.add_argument( + "--topp", + default=0.0, + type=float, + help="The probability threshold for topp_sampling. Default is 0.0 which means it won't go through topp_sampling. ", + ) + parser.add_argument( + "--benchmark", + action="store_true", + help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ", + ) + parser.add_argument( + "--vocab_file", + default=None, + type=str, + help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.", + ) + parser.add_argument( + "--src_vocab", + default=None, + type=str, + help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument( + "--trg_vocab", + default=None, + type=str, + help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument( + "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--pad_token", + default=None, + type=str, + help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ", + ) + args = parser.parse_args() + return args + + +def do_predict(args): + place = "gpu" + place = paddle.set_device(place) + reader.adapt_vocab_size(args) + + # Define model + transformer = FasterTransformer( + src_vocab_size=args.src_vocab_size, + trg_vocab_size=args.trg_vocab_size, + max_length=args.max_length + 1, + num_encoder_layers=args.n_layer, + num_decoder_layers=args.n_layer, + n_head=args.n_head, + d_model=args.d_model, + d_inner_hid=args.d_inner_hid, + dropout=args.dropout, + weight_sharing=args.weight_sharing, + bos_id=args.bos_idx, + eos_id=args.eos_idx, + pad_id=args.pad_idx, + decoding_strategy=args.decoding_strategy, + beam_size=args.beam_size, + max_out_len=args.max_out_len, + decoding_lib=args.decoding_lib, + use_fp16_decoding=args.use_fp16_decoding, + enable_fast_encoder=args.enable_fast_encoder, + use_fp16_encoder=args.use_fp16_encoder, + rel_len=args.use_rel_len, + alpha=args.alpha, + ) + + # Set evaluate mode + transformer.eval() + + # Load checkpoint. + transformer.load(init_from_params=os.path.join(args.init_from_params, "transformer.pdparams")) + + # Convert dygraph model to static graph model + transformer = paddle.jit.to_static( + transformer, + input_spec=[ + # src_word + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + # trg_word + # Support exporting model which support force decoding + # NOTE: Data type MUST be int32 ! + # paddle.static.InputSpec( + # shape=[None, None], dtype="int32") + ], + ) + + # Save converted static graph model + paddle.jit.save(transformer, os.path.join(args.inference_model_dir, "transformer")) + logger.info("Transformer has been saved to {}".format(args.inference_model_dir)) + + +if __name__ == "__main__": + ARGS = parse_args() + yaml_file = ARGS.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + args.decoding_lib = ARGS.decoding_lib + args.use_fp16_decoding = ARGS.use_fp16_decoding + args.enable_fast_encoder = ARGS.enable_fast_encoder + args.use_fp16_encoder = ARGS.use_fp16_encoder + args.decoding_strategy = ARGS.decoding_strategy + args.beam_size = ARGS.beam_size + args.topk = ARGS.topk + args.topp = ARGS.topp + args.benchmark = ARGS.benchmark + + if ARGS.vocab_file is not None: + args.src_vocab = ARGS.vocab_file + args.trg_vocab = ARGS.vocab_file + args.joined_dictionary = True + elif ARGS.src_vocab is not None and ARGS.trg_vocab is None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab + args.joined_dictionary = True + elif ARGS.src_vocab is None and ARGS.trg_vocab is not None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab + args.joined_dictionary = True + else: + args.src_vocab = ARGS.src_vocab + args.trg_vocab = ARGS.trg_vocab + args.joined_dictionary = not ( + args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab + ) + if args.weight_sharing != args.joined_dictionary: + if args.weight_sharing: + raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ") + else: + raise ValueError( + "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. " + ) + + args.bos_token = ARGS.bos_token + args.eos_token = ARGS.eos_token + args.pad_token = ARGS.pad_token + pprint(args) + + do_predict(args) diff --git a/examples/machine_translation/transformer/images/multi_head_attention.png b/examples/machine_translation/transformer/images/multi_head_attention.png new file mode 100644 index 0000000000000000000000000000000000000000..427fb6b32aaeb7013066a167aab4fb97c024c2d6 Binary files /dev/null and b/examples/machine_translation/transformer/images/multi_head_attention.png differ diff --git a/examples/machine_translation/transformer/images/transformer_network.png b/examples/machine_translation/transformer/images/transformer_network.png new file mode 100644 index 0000000000000000000000000000000000000000..34be0e5c7e2b08f858683d86353db5e81049c7ca Binary files /dev/null and b/examples/machine_translation/transformer/images/transformer_network.png differ diff --git a/examples/machine_translation/transformer/predict.py b/examples/machine_translation/transformer/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..9226da595e65996c5e2a7ec93b471b647cfee62d --- /dev/null +++ b/examples/machine_translation/transformer/predict.py @@ -0,0 +1,234 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from pprint import pprint + +import paddle +import reader +import yaml +from easydict import EasyDict as AttrDict + +from paddlenlp.ops import TransformerGenerator + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--config", default="./configs/transformer.big.yaml", type=str, help="Path of the config file. " + ) + parser.add_argument( + "--benchmark", + action="store_true", + help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ", + ) + parser.add_argument( + "--data_dir", + default=None, + type=str, + help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ", + ) + parser.add_argument( + "--test_file", + nargs="+", + default=None, + type=str, + help="The files for test. Can be set by using --test_file source_language_file. If it's None, the default WMT14 en-de dataset will be used. ", + ) + parser.add_argument("--without_ft", action="store_true", help="Whether to use FastGeneration to do predict. ") + parser.add_argument( + "--vocab_file", + default=None, + type=str, + help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.", + ) + parser.add_argument( + "--src_vocab", + default=None, + type=str, + help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument( + "--trg_vocab", + default=None, + type=str, + help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ") + parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ") + parser.add_argument( + "--unk_token", + default=None, + type=str, + help="The unknown token. It should be provided when use custom vocab_file. ", + ) + parser.add_argument( + "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--pad_token", + default=None, + type=str, + help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ", + ) + parser.add_argument( + "--device", default="gpu", choices=["gpu", "cpu", "xpu", "npu", "mlu"], help="Device selected for inference." + ) + + args = parser.parse_args() + return args + + +def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): + """ + Post-process the decoded sequence. + """ + eos_pos = len(seq) - 1 + for i, idx in enumerate(seq): + if idx == eos_idx: + eos_pos = i + break + seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] + return seq + + +def do_predict(args): + if args.device == "gpu": + place = "gpu" + elif args.device == "xpu": + place = "xpu" + elif args.device == "npu": + place = "npu" + elif args.device == "mlu": + place = "mlu" + else: + place = "cpu" + + paddle.set_device(place) + + # Define data loader + test_loader, to_tokens = reader.create_infer_loader(args) + + # Define model + # `TransformerGenerator` automatically chioces using `FastGeneration` + # (with jit building) or the slower verison `InferTransformerModel`. + transformer = TransformerGenerator( + src_vocab_size=args.src_vocab_size, + trg_vocab_size=args.trg_vocab_size, + max_length=args.max_length + 1, + num_encoder_layers=args.n_layer, + num_decoder_layers=args.n_layer, + n_head=args.n_head, + d_model=args.d_model, + d_inner_hid=args.d_inner_hid, + dropout=args.dropout, + weight_sharing=args.weight_sharing, + bos_id=args.bos_idx, + eos_id=args.eos_idx, + pad_id=args.pad_idx, + beam_size=args.beam_size, + max_out_len=args.max_out_len, + use_ft=not args.without_ft, + beam_search_version=args.beam_search_version, + normalize_before=args.get("normalize_before", True), + rel_len=args.use_rel_len, # only works when using FT or beam search v2 + alpha=args.alpha, # only works when using beam search v2 + diversity_rate=args.diversity_rate, # only works when using FT + use_fp16_decoding=False, + ) # only works when using FT + + # Load the trained model + assert args.init_from_params, "Please set init_from_params to load the infer model." + + transformer.load(os.path.join(args.init_from_params, "transformer.pdparams")) + + # Providing model_dict still works. + # state_dict = paddle.load(os.path.join(args.init_from_params, + # "transformer.pdparams")) + # transformer.load(state_dict=state_dict) + + # Set evaluate mode + transformer.eval() + + f = open(args.output_file, "w", encoding="utf-8") + with paddle.no_grad(): + for (src_word,) in test_loader: + # When `output_time_major` argument is `True` for TransformerGenerator, + # the shape of finished_seq is `[seq_len, batch_size, beam_size]` + # for beam search v1 or `[seq_len, batch_size, beam_size * 2]` for + # beam search v2. + finished_seq = transformer(src_word=src_word) + finished_seq = finished_seq.numpy().transpose([1, 2, 0]) + for ins in finished_seq: + for beam_idx, beam in enumerate(ins): + if beam_idx >= args.n_best: + break + id_list = post_process_seq(beam, args.bos_idx, args.eos_idx) + word_list = to_tokens(id_list) + sequence = " ".join(word_list) + "\n" + f.write(sequence) + + +if __name__ == "__main__": + ARGS = parse_args() + yaml_file = ARGS.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + args.benchmark = ARGS.benchmark + args.without_ft = ARGS.without_ft + args.data_dir = ARGS.data_dir + args.test_file = ARGS.test_file + + if ARGS.vocab_file is not None: + args.src_vocab = ARGS.vocab_file + args.trg_vocab = ARGS.vocab_file + args.joined_dictionary = True + elif ARGS.src_vocab is not None and ARGS.trg_vocab is None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab + args.joined_dictionary = True + elif ARGS.src_vocab is None and ARGS.trg_vocab is not None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab + args.joined_dictionary = True + else: + args.src_vocab = ARGS.src_vocab + args.trg_vocab = ARGS.trg_vocab + args.joined_dictionary = not ( + args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab + ) + if args.weight_sharing != args.joined_dictionary: + if args.weight_sharing: + raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ") + else: + raise ValueError( + "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. " + ) + + if ARGS.src_lang is not None: + args.src_lang = ARGS.src_lang + if ARGS.trg_lang is not None: + args.trg_lang = ARGS.trg_lang + + args.unk_token = ARGS.unk_token + args.bos_token = ARGS.bos_token + args.eos_token = ARGS.eos_token + args.pad_token = ARGS.pad_token + + args.device = ARGS.device + pprint(args) + + do_predict(args) diff --git a/examples/machine_translation/transformer/reader.py b/examples/machine_translation/transformer/reader.py new file mode 100644 index 0000000000000000000000000000000000000000..32687908ed8acea2f538b75b375e03cb2dbf89b5 --- /dev/null +++ b/examples/machine_translation/transformer/reader.py @@ -0,0 +1,545 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import itertools +import os +import sys +from functools import partial + +import numpy as np +import paddle.distributed as dist +from paddle.io import BatchSampler, DataLoader + +from paddlenlp.data import Pad, Vocab + + +def min_max_filer(data, max_len, min_len=0): + # 1 for special tokens. + data_min_len = min(len(data["source"]), len(data["target"])) + 1 + data_max_len = max(len(data["source"]), len(data["target"])) + 1 + return (data_min_len >= min_len) and (data_max_len <= max_len) + + +def padding_vocab(x, args): + return (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor + + +def create_data_loader(args, places=None): + use_custom_dataset = args.train_file is not None or args.dev_file is not None or args.data_dir is not None + map_kwargs = {} + if use_custom_dataset: + data_files = {} + if args.data_dir is not None: + if os.path.exist( + os.path.join(args.data_dir, "train.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang)) + ): + data_files["train"] = [ + os.path.join(args.data_dir, "train.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang)), + os.path.join(args.data_dir, "train.{}-{}.{}".format(args.src_lang, args.trg_lang, args.trg_lang)), + ] + if os.path.exist( + os.path.join(args.data_dir, "dev.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang)) + ): + data_files["dev"] = [ + os.path.join(args.data_dir, "dev.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang)), + os.path.join(args.data_dir, "dev.{}-{}.{}".format(args.src_lang, args.trg_lang, args.trg_lang)), + ] + else: + # datasets.load_dataset doesn't support tuple + if args.train_file is not None: + data_files["train"] = list(args.train_file) + if args.dev_file is not None: + data_files["dev"] = list(args.dev_file) + + from datasets import load_dataset + + if len(data_files) > 0: + for split in data_files: + if isinstance(data_files[split], (list, tuple)): + for i, path in enumerate(data_files[split]): + data_files[split][i] = os.path.abspath(data_files[split][i]) + else: + data_files[split] = os.path.abspath(data_files[split]) + + datasets = load_dataset("language_pair", data_files=data_files, split=("train", "dev")) + + if args.src_vocab is not None: + src_vocab = Vocab.load_vocabulary( + filepath=args.src_vocab, + unk_token=args.unk_token, + bos_token=args.bos_token, + eos_token=args.eos_token, + pad_token=args.pad_token, + ) + else: + raise ValueError("The --src_vocab must be specified when using custom dataset. ") + + else: + from paddlenlp.datasets import load_dataset + + datasets = load_dataset("wmt14ende", splits=("train", "dev")) + + map_kwargs["lazy"] = False + + if args.src_vocab is not None: + src_vocab = Vocab.load_vocabulary( + filepath=args.src_vocab, + unk_token=args.unk_token, + bos_token=args.bos_token, + eos_token=args.eos_token, + pad_token=args.pad_token, + ) + elif not args.benchmark: + src_vocab = Vocab.load_vocabulary(**datasets[0].vocab_info["bpe"]) + else: + src_vocab = Vocab.load_vocabulary(**datasets[0].vocab_info["benchmark"]) + + if use_custom_dataset and not args.joined_dictionary: + if args.trg_vocab is not None: + trg_vocab = Vocab.load_vocabulary( + filepath=args.trg_vocab, + unk_token=args.unk_token, + bos_token=args.bos_token, + eos_token=args.eos_token, + pad_token=args.pad_token, + ) + else: + raise ValueError("The --trg_vocab must be specified when the dict is not joined. ") + else: + trg_vocab = src_vocab + + args.src_vocab_size = padding_vocab(len(src_vocab), args) + args.trg_vocab_size = padding_vocab(len(trg_vocab), args) + + if args.bos_token is not None: + args.bos_idx = src_vocab.get_bos_token_id() + if args.eos_token is not None: + args.eos_idx = src_vocab.get_eos_token_id() + if args.pad_token is not None: + args.pad_idx = src_vocab.get_pad_token_id() + else: + args.pad_idx = args.bos_idx + + def convert_samples(sample): + source = sample["source"].split() + sample["source"] = src_vocab.to_indices(source) + + target = sample["target"].split() + sample["target"] = trg_vocab.to_indices(target) + + return sample + + data_loaders = [(None)] * 2 + for i, dataset in enumerate(datasets): + dataset = dataset.map(convert_samples, **map_kwargs).filter(partial(min_max_filer, max_len=args.max_length)) + batch_sampler = TransformerBatchSampler( + dataset=dataset, + batch_size=args.batch_size, + pool_size=args.pool_size, + sort_type=args.sort_type, + shuffle=args.shuffle, + shuffle_batch=args.shuffle_batch, + use_token_batch=True, + max_length=args.max_length, + distribute_mode=True if i == 0 else False, + world_size=dist.get_world_size(), + rank=dist.get_rank(), + pad_seq=args.pad_seq, + bsz_multi=args.bsz_multi, + ) + + data_loader = DataLoader( + dataset=dataset, + places=places, + batch_sampler=batch_sampler, + collate_fn=partial( + prepare_train_input, + bos_idx=args.bos_idx, + eos_idx=args.eos_idx, + pad_idx=args.pad_idx, + pad_seq=args.pad_seq, + dtype=args.input_dtype, + ), + num_workers=args.num_workers, + ) + data_loaders[i] = data_loader + return data_loaders + + +def create_infer_loader(args): + use_custom_dataset = args.test_file is not None or args.data_dir is not None + map_kwargs = {} + if use_custom_dataset: + data_files = {} + if args.data_dir is not None: + if os.path.exist( + os.path.join(args.data_dir, "test.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang)) + ): + data_files["test"] = [ + os.path.join(args.data_dir, "test.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang)), + os.path.join(args.data_dir, "test.{}-{}.{}".format(args.src_lang, args.trg_lang, args.trg_lang)) + if os.path.exist( + os.path.join( + args.data_dir, "test.{}-{}.{}".format(args.src_lang, args.trg_lang, args.trg_lang) + ) + ) + else None, + ] + else: + if args.test_file is not None: + # datasets.load_dataset doesn't support tuple + data_files["test"] = list(args.test_file) if isinstance(args.test_file, tuple) else args.test_file + + from datasets import load_dataset + + if len(data_files) > 0: + for split in data_files: + if isinstance(data_files[split], (list, tuple)): + for i, path in enumerate(data_files[split]): + data_files[split][i] = os.path.abspath(data_files[split][i]) + else: + data_files[split] = os.path.abspath(data_files[split]) + + dataset = load_dataset("language_pair", data_files=data_files, split=("test")) + + if args.src_vocab is not None: + src_vocab = Vocab.load_vocabulary( + filepath=args.src_vocab, + unk_token=args.unk_token, + bos_token=args.bos_token, + eos_token=args.eos_token, + pad_token=args.pad_token, + ) + else: + raise ValueError("The --src_vocab must be specified when using custom dataset. ") + + else: + from paddlenlp.datasets import load_dataset + + dataset = load_dataset("wmt14ende", splits=("test")) + + map_kwargs["lazy"] = False + + if args.src_vocab is not None: + src_vocab = Vocab.load_vocabulary( + filepath=args.src_vocab, + unk_token=args.unk_token, + bos_token=args.bos_token, + eos_token=args.eos_token, + pad_token=args.pad_token, + ) + elif not args.benchmark: + src_vocab = Vocab.load_vocabulary(**dataset.vocab_info["bpe"]) + else: + src_vocab = Vocab.load_vocabulary(**dataset.vocab_info["benchmark"]) + + if use_custom_dataset and not args.joined_dictionary: + if args.trg_vocab is not None: + trg_vocab = Vocab.load_vocabulary( + filepath=args.trg_vocab, + unk_token=args.unk_token, + bos_token=args.bos_token, + eos_token=args.eos_token, + pad_token=args.pad_token, + ) + else: + raise ValueError("The --trg_vocab must be specified when the dict is not joined. ") + else: + trg_vocab = src_vocab + + args.src_vocab_size = padding_vocab(len(src_vocab), args) + args.trg_vocab_size = padding_vocab(len(trg_vocab), args) + + if args.bos_token is not None: + args.bos_idx = src_vocab.get_bos_token_id() + if args.eos_token is not None: + args.eos_idx = src_vocab.get_eos_token_id() + if args.pad_token is not None: + args.pad_idx = src_vocab.get_pad_token_id() + else: + args.pad_idx = args.bos_idx + + def convert_samples(sample): + source = sample["source"].split() + sample["source"] = src_vocab.to_indices(source) + + if "target" in sample.keys() and sample["target"] != "": + target = sample["target"].split() + sample["target"] = trg_vocab.to_indices(target) + + return sample + + dataset = dataset.map(convert_samples, **map_kwargs) + + data_loader = DataLoader( + dataset=dataset, + batch_size=args.infer_batch_size, + shuffle=False, + drop_last=False, + collate_fn=partial( + prepare_infer_input, + bos_idx=args.bos_idx, + eos_idx=args.eos_idx, + pad_idx=args.pad_idx, + pad_seq=args.pad_seq, + dtype=args.input_dtype, + ), + num_workers=args.num_workers, + return_list=True, + ) + return data_loader, trg_vocab.to_tokens + + +def adapt_vocab_size(args): + if args.src_vocab: + src_vocab = Vocab.load_vocabulary( + filepath=args.src_vocab, bos_token=args.bos_token, eos_token=args.eos_token, pad_token=args.pad_token + ) + elif not args.benchmark: + from paddlenlp.datasets import load_dataset + + datasets = load_dataset("wmt14ende", splits=("test")) + src_vocab = Vocab.load_vocabulary(**datasets.vocab_info["bpe"]) + else: + from paddlenlp.datasets import load_dataset + + datasets = load_dataset("wmt14ende", splits=("test")) + src_vocab = Vocab.load_vocabulary(**datasets.vocab_info["benchmark"]) + + if not args.joined_dictionary: + if args.trg_vocab is not None: + trg_vocab = Vocab.load_vocabulary( + filepath=args.trg_vocab, bos_token=args.bos_token, eos_token=args.eos_token, pad_token=args.pad_token + ) + else: + raise ValueError("The --trg_vocab must be specified when the dict is not joined. ") + else: + trg_vocab = src_vocab + + args.src_vocab_size = padding_vocab(len(src_vocab), args) + args.trg_vocab_size = padding_vocab(len(trg_vocab), args) + + if args.bos_token is not None: + args.bos_idx = src_vocab.get_bos_token_id() + if args.eos_token is not None: + args.eos_idx = src_vocab.get_eos_token_id() + if args.pad_token is not None: + args.pad_idx = src_vocab.get_pad_token_id() + else: + args.pad_idx = args.bos_idx + + +def prepare_train_input(insts, bos_idx, eos_idx, pad_idx, pad_seq=1, dtype="int64"): + """ + Put all padded data needed by training into a list. + """ + word_pad = Pad(pad_idx, dtype=dtype) + + src_max_len = (max([len(inst["source"]) for inst in insts]) + pad_seq) // pad_seq * pad_seq + trg_max_len = (max([len(inst["target"]) for inst in insts]) + pad_seq) // pad_seq * pad_seq + src_word = word_pad( + [inst["source"] + [eos_idx] + [pad_idx] * (src_max_len - 1 - len(inst["source"])) for inst in insts] + ) + trg_word = word_pad( + [[bos_idx] + inst["target"] + [pad_idx] * (trg_max_len - 1 - len(inst["target"])) for inst in insts] + ) + lbl_word = np.expand_dims( + word_pad([inst["target"] + [eos_idx] + [pad_idx] * (trg_max_len - 1 - len(inst["target"])) for inst in insts]), + axis=2, + ) + + data_inputs = [src_word, trg_word, lbl_word] + + return data_inputs + + +def prepare_infer_input(insts, bos_idx, eos_idx, pad_idx, pad_seq=1, dtype="int64"): + """ + Put all padded data needed by beam search decoder into a list. + """ + word_pad = Pad(pad_idx, dtype=dtype) + + src_max_len = (max([len(inst["source"]) for inst in insts]) + pad_seq) // pad_seq * pad_seq + src_word = word_pad( + [inst["source"] + [eos_idx] + [pad_idx] * (src_max_len - 1 - len(inst["source"])) for inst in insts] + ) + + return [ + src_word, + ] + + +class SortType(object): + GLOBAL = "global" + POOL = "pool" + NONE = "none" + + +class SentenceBatchCreator(object): + def __init__(self, batch_size): + self.batch = [] + self._batch_size = batch_size + + def append(self, info): + self.batch.append(info) + if len(self.batch) == self._batch_size: + tmp = self.batch + self.batch = [] + return tmp + + +class TokenBatchCreator(object): + def __init__(self, batch_size, bsz_multi=1): + self._batch = [] + self.max_len = -1 + self._batch_size = batch_size + self._bsz_multi = bsz_multi + + def append(self, info): + cur_len = info.max_len + max_len = max(self.max_len, cur_len) + if max_len * (len(self._batch) + 1) > self._batch_size: + # Make sure the batch size won't be empty. + mode_len = max(len(self._batch) // self._bsz_multi * self._bsz_multi, len(self._batch) % self._bsz_multi) + result = self._batch[:mode_len] + self._batch = self._batch[mode_len:] + self._batch.append(info) + self.max_len = max([b.max_len for b in self._batch]) + return result + else: + self.max_len = max_len + self._batch.append(info) + + @property + def batch(self): + return self._batch + + +class SampleInfo(object): + def __init__(self, i, lens, pad_seq=1): + self.i = i + # Take bos and eos into account + self.min_len = min(lens[0], lens[1]) + 1 + self.max_len = (max(lens[0], lens[1]) + pad_seq) // pad_seq * pad_seq + self.seq_max_len = max(lens[0], lens[1]) + 1 + self.src_len = lens[0] + 1 + self.trg_len = lens[1] + 1 + + +class TransformerBatchSampler(BatchSampler): + def __init__( + self, + dataset, + batch_size, + pool_size=10000, + sort_type=SortType.NONE, + min_length=0, + max_length=100, + shuffle=False, + shuffle_batch=False, + use_token_batch=False, + clip_last_batch=False, + distribute_mode=True, + seed=0, + world_size=1, + rank=0, + pad_seq=1, + bsz_multi=8, + ): + for arg, value in locals().items(): + if arg != "self": + setattr(self, "_" + arg, value) + self._random = np.random + self._random.seed(seed) + # for multi-devices + self._distribute_mode = distribute_mode + self._nranks = world_size + self._local_rank = rank + self._sample_infos = [] + for i, data in enumerate(self._dataset): + lens = [len(data["source"]), len(data["target"])] + self._sample_infos.append(SampleInfo(i, lens, self._pad_seq)) + + def __iter__(self): + # global sort or global shuffle + if self._sort_type == SortType.GLOBAL: + infos = sorted(self._sample_infos, key=lambda x: x.trg_len) + infos = sorted(infos, key=lambda x: x.src_len) + else: + if self._shuffle: + infos = self._sample_infos + self._random.shuffle(infos) + else: + infos = self._sample_infos + + if self._sort_type == SortType.POOL: + reverse = True + for i in range(0, len(infos), self._pool_size): + # To avoid placing short next to long sentences + reverse = not reverse + infos[i : i + self._pool_size] = sorted( + infos[i : i + self._pool_size], key=lambda x: x.seq_max_len, reverse=reverse + ) + + batches = [] + batch_creator = ( + TokenBatchCreator(self._batch_size, self._bsz_multi) + if self._use_token_batch + else SentenceBatchCreator(self._batch_size * self._nranks) + ) + + for info in infos: + batch = batch_creator.append(info) + if batch is not None: + batches.append(batch) + + if not self._clip_last_batch and len(batch_creator.batch) != 0: + batches.append(batch_creator.batch) + + if self._shuffle_batch: + self._random.shuffle(batches) + + if not self._use_token_batch: + # When producing batches according to sequence number, to confirm + # neighbor batches which would be feed and run parallel have similar + # length (thus similar computational cost) after shuffle, we as take + # them as a whole when shuffling and split here + batches = [ + [batch[self._batch_size * i : self._batch_size * (i + 1)] for i in range(self._nranks)] + for batch in batches + ] + batches = list(itertools.chain.from_iterable(batches)) + self.batch_number = (len(batches) + self._nranks - 1) // self._nranks + + # for multi-device + for batch_id, batch in enumerate(batches): + if not self._distribute_mode or (batch_id % self._nranks == self._local_rank): + batch_indices = [info.i for info in batch] + yield batch_indices + if self._distribute_mode and len(batches) % self._nranks != 0: + if self._local_rank >= len(batches) % self._nranks: + # use previous data to pad + yield batch_indices + + def __len__(self): + if hasattr(self, "batch_number"): # + return self.batch_number + if not self._use_token_batch: + batch_number = (len(self._dataset) + self._batch_size * self._nranks - 1) // ( + self._batch_size * self._nranks + ) + else: + # For uncertain batch number, the actual value is self.batch_number + batch_number = sys.maxsize + return batch_number diff --git a/examples/machine_translation/transformer/static/predict.py b/examples/machine_translation/transformer/static/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..41bada379268d109fdc2dc30d76e0258b721d8b3 --- /dev/null +++ b/examples/machine_translation/transformer/static/predict.py @@ -0,0 +1,231 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys +from pprint import pprint + +import numpy as np +import paddle +import yaml +from easydict import EasyDict as AttrDict + +from paddlenlp.transformers import InferTransformerModel + +sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir))) +import reader # noqa: E402 + + +def cast_parameters_to_fp32(place, program, scope=None): + all_parameters = [] + for block in program.blocks: + all_parameters.extend(block.all_parameters()) + + var_scope = scope if scope else paddle.static.global_scope() + for param in all_parameters: + tensor = var_scope.find_var(param.name).get_tensor() + if "fp16" in str(tensor._dtype()).lower() and "fp32" in str(param.dtype).lower(): + data = np.array(tensor) + tensor.set(np.float32(data), place) + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--config", default="../configs/transformer.big.yaml", type=str, help="Path of the config file. " + ) + parser.add_argument( + "--benchmark", + action="store_true", + help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ", + ) + parser.add_argument( + "--data_dir", + default=None, + type=str, + help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ", + ) + parser.add_argument( + "--test_file", + nargs="+", + default=None, + type=str, + help="The files for test. Can be set by using --test_file source_language_file. If it's None, the default WMT14 en-de dataset will be used. ", + ) + parser.add_argument( + "--vocab_file", + default=None, + type=str, + help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.", + ) + parser.add_argument( + "--src_vocab", + default=None, + type=str, + help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument( + "--trg_vocab", + default=None, + type=str, + help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ") + parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ") + parser.add_argument( + "--unk_token", + default=None, + type=str, + help="The unknown token. It should be provided when use custom vocab_file. ", + ) + parser.add_argument( + "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--pad_token", + default=None, + type=str, + help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ", + ) + args = parser.parse_args() + return args + + +def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): + """ + Post-process the decoded sequence. + """ + eos_pos = len(seq) - 1 + for i, idx in enumerate(seq): + if idx == eos_idx: + eos_pos = i + break + seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] + return seq + + +def do_predict(args): + paddle.enable_static() + if args.device == "gpu": + place = paddle.set_device("gpu") + else: + place = paddle.set_device("cpu") + + # Define data loader + test_loader, to_tokens = reader.create_infer_loader(args) + + test_program = paddle.static.Program() + startup_program = paddle.static.Program() + with paddle.static.program_guard(test_program, startup_program): + src_word = paddle.static.data(name="src_word", shape=[None, None], dtype=args.input_dtype) + + # Define model + transformer = InferTransformerModel( + src_vocab_size=args.src_vocab_size, + trg_vocab_size=args.trg_vocab_size, + max_length=args.max_length + 1, + num_encoder_layers=args.n_layer, + num_decoder_layers=args.n_layer, + n_head=args.n_head, + d_model=args.d_model, + d_inner_hid=args.d_inner_hid, + dropout=args.dropout, + weight_sharing=args.weight_sharing, + bos_id=args.bos_idx, + eos_id=args.eos_idx, + pad_id=args.pad_idx, + beam_size=args.beam_size, + max_out_len=args.max_out_len, + ) + + finished_seq = transformer(src_word=src_word) + + test_program = test_program.clone(for_test=True) + + exe = paddle.static.Executor(place) + exe.run(startup_program) + + assert args.init_from_params, "must set init_from_params to load parameters" + paddle.static.load(test_program, os.path.join(args.init_from_params, "transformer"), exe) + print("finish initing model from params from %s" % (args.init_from_params)) + + # cast weights from fp16 to fp32 after loading + if args.use_pure_fp16: + cast_parameters_to_fp32(place, test_program) + + f = open(args.output_file, "w") + for data in test_loader: + (finished_sequence,) = exe.run(test_program, feed={"src_word": data[0]}, fetch_list=finished_seq.name) + finished_sequence = finished_sequence.transpose([0, 2, 1]) + for ins in finished_sequence: + for beam_idx, beam in enumerate(ins): + if beam_idx >= args.n_best: + break + id_list = post_process_seq(beam, args.bos_idx, args.eos_idx) + word_list = to_tokens(id_list) + sequence = " ".join(word_list) + "\n" + f.write(sequence) + + paddle.disable_static() + + +if __name__ == "__main__": + ARGS = parse_args() + yaml_file = ARGS.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + args.benchmark = ARGS.benchmark + args.data_dir = ARGS.data_dir + args.test_file = ARGS.test_file + + if ARGS.vocab_file is not None: + args.src_vocab = ARGS.vocab_file + args.trg_vocab = ARGS.vocab_file + args.joined_dictionary = True + elif ARGS.src_vocab is not None and ARGS.trg_vocab is None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab + args.joined_dictionary = True + elif ARGS.src_vocab is None and ARGS.trg_vocab is not None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab + args.joined_dictionary = True + else: + args.src_vocab = ARGS.src_vocab + args.trg_vocab = ARGS.trg_vocab + args.joined_dictionary = not ( + args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab + ) + if args.weight_sharing != args.joined_dictionary: + if args.weight_sharing: + raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ") + else: + raise ValueError( + "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. " + ) + + if ARGS.src_lang is not None: + args.src_lang = ARGS.src_lang + if ARGS.trg_lang is not None: + args.trg_lang = ARGS.trg_lang + + args.unk_token = ARGS.unk_token + args.bos_token = ARGS.bos_token + args.eos_token = ARGS.eos_token + args.pad_token = ARGS.pad_token + pprint(args) + + do_predict(args) diff --git a/examples/machine_translation/transformer/static/train.py b/examples/machine_translation/transformer/static/train.py new file mode 100644 index 0000000000000000000000000000000000000000..13ca011345aaba6f51c0130ad43d0b1d5c46a05e --- /dev/null +++ b/examples/machine_translation/transformer/static/train.py @@ -0,0 +1,414 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys +import time +from pprint import pprint + +import numpy as np +import paddle +import paddle.distributed as dist +import paddle.distributed.fleet as fleet +import yaml +from easydict import EasyDict as AttrDict + +from paddlenlp.transformers import CrossEntropyCriterion, TransformerModel +from paddlenlp.utils import profiler +from paddlenlp.utils.log import logger + +sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir))) +import reader # noqa: E402 +from tls.record import AverageStatistical # noqa: E402 + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--config", default="../configs/transformer.big.yaml", type=str, help="Path of the config file. " + ) + parser.add_argument( + "--benchmark", + action="store_true", + help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ", + ) + parser.add_argument("--distributed", action="store_true", help="Whether to use fleet to launch. ") + parser.add_argument("--max_iter", default=None, type=int, help="The maximum iteration for training. ") + parser.add_argument( + "--data_dir", + default=None, + type=str, + help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ", + ) + parser.add_argument( + "--train_file", + nargs="+", + default=None, + type=str, + help="The files for training, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ", + ) + parser.add_argument( + "--dev_file", + nargs="+", + default=None, + type=str, + help="The files for validation, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ", + ) + parser.add_argument( + "--vocab_file", + default=None, + type=str, + help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.", + ) + parser.add_argument( + "--src_vocab", + default=None, + type=str, + help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument( + "--trg_vocab", + default=None, + type=str, + help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ") + parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ") + parser.add_argument( + "--unk_token", + default=None, + type=str, + help="The unknown token. It should be provided when use custom vocab_file. ", + ) + parser.add_argument( + "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--pad_token", + default=None, + type=str, + help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ", + ) + parser.add_argument("--weight_decay", default=None, type=float, help="Weight Decay for optimizer. ") + + # For benchmark. + parser.add_argument( + "--profiler_options", + type=str, + default=None, + help='The option of profiler, which should be in format "key1=value1;key2=value2;key3=value3".', + ) + args = parser.parse_args() + return args + + +def do_train(args): + paddle.enable_static() + if args.is_distributed: + fleet.init(is_collective=True) + assert args.device != "xpu", "xpu doesn't support distributed training" + places = [paddle.set_device("gpu")] if args.device == "gpu" else paddle.static.cpu_places() + trainer_count = len(places) + else: + if args.device == "gpu": + places = paddle.static.cuda_places() + elif args.device == "xpu": + places = paddle.static.xpu_places() + paddle.set_device("xpu") + else: + places = paddle.static.cpu_places() + paddle.set_device("cpu") + trainer_count = len(places) + + # Set seed for CE + random_seed = eval(str(args.random_seed)) + if random_seed is not None: + paddle.seed(random_seed) + + # Define data loader + (train_loader), (eval_loader) = reader.create_data_loader(args, places=places) + + train_program = paddle.static.Program() + startup_program = paddle.static.Program() + with paddle.static.program_guard(train_program, startup_program): + src_word = paddle.static.data(name="src_word", shape=[None, None], dtype=args.input_dtype) + trg_word = paddle.static.data(name="trg_word", shape=[None, None], dtype=args.input_dtype) + lbl_word = paddle.static.data(name="lbl_word", shape=[None, None, 1], dtype=args.input_dtype) + + # Define model + transformer = TransformerModel( + src_vocab_size=args.src_vocab_size, + trg_vocab_size=args.trg_vocab_size, + max_length=args.max_length + 1, + num_encoder_layers=args.n_layer, + num_decoder_layers=args.n_layer, + n_head=args.n_head, + d_model=args.d_model, + d_inner_hid=args.d_inner_hid, + dropout=args.dropout, + weight_sharing=args.weight_sharing, + bos_id=args.bos_idx, + eos_id=args.eos_idx, + pad_id=args.pad_idx, + ) + # Define loss + criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx) + + logits = transformer(src_word=src_word, trg_word=trg_word) + + sum_cost, avg_cost, token_num = criterion(logits, lbl_word) + + scheduler = paddle.optimizer.lr.NoamDecay(args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0) + + # Define optimizer + optimizer = paddle.optimizer.Adam( + learning_rate=scheduler, + beta1=args.beta1, + beta2=args.beta2, + epsilon=float(args.eps), + parameters=transformer.parameters(), + weight_decay=args.weight_decay, + ) + + if args.is_distributed: + build_strategy = paddle.static.BuildStrategy() + exec_strategy = paddle.static.ExecutionStrategy() + dist_strategy = fleet.DistributedStrategy() + dist_strategy.build_strategy = build_strategy + dist_strategy.execution_strategy = exec_strategy + dist_strategy.fuse_grad_size_in_MB = 16 + + if args.use_amp: + dist_strategy.amp = True + dist_strategy.amp_configs = { + "custom_white_list": ["softmax", "layer_norm"], + "init_loss_scaling": args.scale_loss, + "custom_black_list": ["lookup_table_v2"], + "use_pure_fp16": args.use_pure_fp16, + } + + optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy) + else: + if args.use_amp: + amp_list = paddle.static.amp.AutoMixedPrecisionLists( + custom_white_list=["softmax", "layer_norm"], custom_black_list=["lookup_table_v2"] + ) + optimizer = paddle.static.amp.decorate( + optimizer, + amp_list, + init_loss_scaling=args.scale_loss, + use_dynamic_loss_scaling=True, + use_pure_fp16=args.use_pure_fp16, + ) + optimizer.minimize(avg_cost) + + if args.is_distributed: + exe = paddle.static.Executor(places[0]) + else: + exe = paddle.static.Executor() + build_strategy = paddle.static.BuildStrategy() + exec_strategy = paddle.static.ExecutionStrategy() + + compiled_train_program = paddle.static.CompiledProgram(train_program, build_strategy=build_strategy) + exe.run(startup_program) + + if args.use_amp: + optimizer.amp_init(places[0]) + + # the best cross-entropy value with label smoothing + loss_normalizer = -( + (1.0 - args.label_smooth_eps) * np.log((1.0 - args.label_smooth_eps)) + + args.label_smooth_eps * np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20) + ) + + step_idx = 0 + + # For benchmark + reader_cost_avg = AverageStatistical() + batch_cost_avg = AverageStatistical() + batch_ips_avg = AverageStatistical() + + for pass_id in range(args.epoch): + batch_id = 0 + batch_start = time.time() + for data in train_loader: + # NOTE: used for benchmark and use None as default. + if args.max_iter and step_idx == args.max_iter: + break + if trainer_count == 1: + data = [data] + train_reader_cost = time.time() - batch_start + + if args.is_distributed: + outs = exe.run( + train_program, + feed=[ + { + "src_word": data[i][0], + "trg_word": data[i][1], + "lbl_word": data[i][2], + } + for i in range(trainer_count) + ], + fetch_list=[sum_cost.name, token_num.name], + ) + train_batch_cost = time.time() - batch_start + batch_ips_avg.record(train_batch_cost, np.asarray(outs[1]).sum()) + else: + outs = exe.run( + compiled_train_program, + feed=[ + { + "src_word": data[i][0], + "trg_word": data[i][1], + "lbl_word": data[i][2], + } + for i in range(trainer_count) + ], + fetch_list=[sum_cost.name, token_num.name], + ) + train_batch_cost = time.time() - batch_start + batch_ips_avg.record(train_batch_cost, np.asarray(outs[1]).sum() / trainer_count) + scheduler.step() + + reader_cost_avg.record(train_reader_cost) + batch_cost_avg.record(train_batch_cost) + + # Profile for model benchmark + if args.profiler_options is not None: + profiler.add_profiler_step(args.profiler_options) + + if step_idx % args.print_step == 0 and ( + args.benchmark or (args.is_distributed and dist.get_rank() == 0) or not args.is_distributed + ): + sum_cost_val, token_num_val = np.array(outs[0]), np.array(outs[1]) + # Sum the cost from multi-devices + total_sum_cost = sum_cost_val.sum() + total_token_num = token_num_val.sum() + total_avg_cost = total_sum_cost / total_token_num + + if step_idx == 0: + logger.info( + "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, " + "normalized loss: %f, ppl: %f" + % ( + step_idx, + pass_id, + batch_id, + total_avg_cost, + total_avg_cost - loss_normalizer, + np.exp([min(total_avg_cost, 100)]), + ) + ) + else: + train_avg_batch_cost = args.print_step / batch_cost_avg.get_total_time() + logger.info( + "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, " + "normalized loss: %f, ppl: %f, avg_speed: %.2f step/s, " + "batch_cost: %.5f sec, reader_cost: %.5f sec, tokens: %d, " + "ips: %.5f words/sec" + % ( + step_idx, + pass_id, + batch_id, + total_avg_cost, + total_avg_cost - loss_normalizer, + np.exp([min(total_avg_cost, 100)]), + train_avg_batch_cost, + batch_cost_avg.get_average(), + reader_cost_avg.get_average(), + batch_ips_avg.get_total_cnt(), + batch_ips_avg.get_average_per_sec(), + ) + ) + reader_cost_avg.reset() + batch_cost_avg.reset() + batch_ips_avg.reset() + + if step_idx % args.save_step == 0 and step_idx != 0: + if args.save_model and dist.get_rank() == 0: + model_path = os.path.join(args.save_model, "step_" + str(step_idx), "transformer") + paddle.static.save(train_program, model_path) + + batch_id += 1 + step_idx += 1 + batch_start = time.time() + + # NOTE: used for benchmark and use None as default. + if args.max_iter and step_idx == args.max_iter: + break + + if args.save_model and dist.get_rank() == 0: + model_path = os.path.join(args.save_model, "step_final", "transformer") + paddle.static.save(train_program, model_path) + + paddle.disable_static() + + +if __name__ == "__main__": + ARGS = parse_args() + yaml_file = ARGS.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + args.benchmark = ARGS.benchmark + args.is_distributed = ARGS.distributed + if ARGS.max_iter: + args.max_iter = ARGS.max_iter + args.weight_decay = ARGS.weight_decay + + args.data_dir = ARGS.data_dir + args.train_file = ARGS.train_file + args.dev_file = ARGS.dev_file + + if ARGS.vocab_file is not None: + args.src_vocab = ARGS.vocab_file + args.trg_vocab = ARGS.vocab_file + args.joined_dictionary = True + elif ARGS.src_vocab is not None and ARGS.trg_vocab is None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab + args.joined_dictionary = True + elif ARGS.src_vocab is None and ARGS.trg_vocab is not None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab + args.joined_dictionary = True + else: + args.src_vocab = ARGS.src_vocab + args.trg_vocab = ARGS.trg_vocab + args.joined_dictionary = not ( + args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab + ) + if args.weight_sharing != args.joined_dictionary: + if args.weight_sharing: + raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ") + else: + raise ValueError( + "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. " + ) + + if ARGS.src_lang is not None: + args.src_lang = ARGS.src_lang + if ARGS.trg_lang is not None: + args.trg_lang = ARGS.trg_lang + + args.unk_token = ARGS.unk_token + args.bos_token = ARGS.bos_token + args.eos_token = ARGS.eos_token + args.pad_token = ARGS.pad_token + pprint(args) + args.profiler_options = ARGS.profiler_options + + do_train(args) diff --git a/examples/machine_translation/transformer/tls/distributed_utils.py b/examples/machine_translation/transformer/tls/distributed_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..26d6c0ca8d90683eeb918f9c1bdf9c2e855c6b1f --- /dev/null +++ b/examples/machine_translation/transformer/tls/distributed_utils.py @@ -0,0 +1,19 @@ +import paddle +import paddle.distributed as dist + + +def all_gather_tokens(data): + """Gathers num of tokens from all nodes. + `data` should be a tensor of num of tokens. + """ + if dist.get_world_size() < 2: + return data + if not hasattr(all_gather_tokens, "_in_buffer") or all_gather_tokens._in_buffer is None: + all_gather_tokens._in_buffer = data + all_gather_tokens._out_buffers = [] + in_buffer = all_gather_tokens._in_buffer + out_buffers = all_gather_tokens._out_buffers + + dist.all_gather(out_buffers, in_buffer) + + return paddle.add_n(out_buffers) diff --git a/examples/machine_translation/transformer/tls/record.py b/examples/machine_translation/transformer/tls/record.py new file mode 100644 index 0000000000000000000000000000000000000000..d1ddc738a5280978255d02eed4682adc59543794 --- /dev/null +++ b/examples/machine_translation/transformer/tls/record.py @@ -0,0 +1,29 @@ +class AverageStatistical(object): + def __init__(self): + self.reset() + + def reset(self): + self.total_cnt = 0 + self.time = 0 + + def record(self, val, cnt=1): + self.time += val + self.total_cnt += cnt + + def get_average(self): + if self.total_cnt == 0: + return 0 + + return self.time / self.total_cnt + + def get_average_per_sec(self): + if self.time == 0.0: + return 0.0 + + return float(self.total_cnt) / self.time + + def get_total_cnt(self): + return self.total_cnt + + def get_total_time(self): + return self.time diff --git a/examples/machine_translation/transformer/tls/to_static.py b/examples/machine_translation/transformer/tls/to_static.py new file mode 100644 index 0000000000000000000000000000000000000000..96a41a6159aa53861d20a0d1dfa9ce36bf8aee63 --- /dev/null +++ b/examples/machine_translation/transformer/tls/to_static.py @@ -0,0 +1,31 @@ +# copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +from paddle.jit import to_static + + +def create_input_specs(): + src_word = paddle.static.InputSpec(name="src_word", shape=[None, None], dtype="int64") + trg_word = paddle.static.InputSpec(name="trg_word", shape=[None, None], dtype="int64") + return [src_word, trg_word] + + +def apply_to_static(config, model): + support_to_static = config.get("to_static", False) + if support_to_static: + specs = create_input_specs() + model = to_static(model, input_spec=specs) + print("Successfully to apply @to_static with specs: {}".format(specs)) + return model diff --git a/examples/machine_translation/transformer/train.py b/examples/machine_translation/transformer/train.py new file mode 100644 index 0000000000000000000000000000000000000000..67465e0b8bae42aec4b31f9579b7eebc40020309 --- /dev/null +++ b/examples/machine_translation/transformer/train.py @@ -0,0 +1,476 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import inspect +import os +import time +from pprint import pprint + +import numpy as np +import paddle +import paddle.distributed as dist +import reader +import yaml +from easydict import EasyDict as AttrDict +from tls.record import AverageStatistical +from tls.to_static import apply_to_static + +from paddlenlp.transformers import CrossEntropyCriterion, TransformerModel +from paddlenlp.utils import profiler +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--config", default="./configs/transformer.big.yaml", type=str, help="Path of the config file. " + ) + parser.add_argument( + "--benchmark", + action="store_true", + help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ", + ) + parser.add_argument("--max_iter", default=None, type=int, help="The maximum iteration for training. ") + parser.add_argument( + "--data_dir", + default=None, + type=str, + help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ", + ) + parser.add_argument( + "--train_file", + nargs="+", + default=None, + type=str, + help="The files for training, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ", + ) + parser.add_argument( + "--dev_file", + nargs="+", + default=None, + type=str, + help="The files for validation, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ", + ) + parser.add_argument( + "--vocab_file", + default=None, + type=str, + help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.", + ) + parser.add_argument( + "--src_vocab", + default=None, + type=str, + help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument( + "--trg_vocab", + default=None, + type=str, + help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ", + ) + parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ") + parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ") + parser.add_argument( + "--unk_token", + default=None, + type=str, + help="The unknown token. It should be provided when use custom vocab_file. ", + ) + parser.add_argument( + "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. " + ) + parser.add_argument( + "--pad_token", + default=None, + type=str, + help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ", + ) + parser.add_argument("--batch_size", default=None, type=int, help="The maximum tokens per batch. ") + parser.add_argument( + "--use_amp", + default=None, + type=str, + choices=["true", "false", "True", "False"], + help="Whether to use amp to train Transformer. ", + ) + parser.add_argument( + "--device", default="gpu", choices=["gpu", "cpu", "xpu", "npu", "mlu"], help="Device selected for inference." + ) + parser.add_argument( + "--amp_level", + default=None, + type=str, + choices=["O1", "O2"], + help="The amp level if --use_amp is on. Can be one of [O1, O2]. ", + ) + parser.add_argument("--weight_decay", default=None, type=float, help="Weight Decay for optimizer. ") + + # For benchmark. + parser.add_argument( + "--profiler_options", + type=str, + default=None, + help='The option of profiler, which should be in format "key1=value1;key2=value2;key3=value3".', + ) + parser.add_argument("--to_static", action="store_true", help="Whether use to_static to train Transformer. ") + args = parser.parse_args() + return args + + +def do_train(args): + if args.device == "gpu": + rank = dist.get_rank() + trainer_count = dist.get_world_size() + elif args.device == "npu": + rank = dist.get_rank() + trainer_count = dist.get_world_size() + paddle.set_device("npu") + elif args.device == "xpu": + rank = dist.get_rank() + trainer_count = dist.get_world_size() + paddle.set_device("xpu") + elif args.device == "mlu": + rank = dist.get_rank() + trainer_count = dist.get_world_size() + paddle.set_device("mlu") + else: + rank = 0 + trainer_count = 1 + paddle.set_device("cpu") + + if trainer_count > 1: + dist.init_parallel_env() + + # Set seed for CE + random_seed = eval(str(args.random_seed)) + if random_seed is not None: + paddle.seed(random_seed) + + # Define data loader + (train_loader), (eval_loader) = reader.create_data_loader(args) + + # Define model + transformer = TransformerModel( + src_vocab_size=args.src_vocab_size, + trg_vocab_size=args.trg_vocab_size, + max_length=args.max_length + 1, + num_encoder_layers=args.n_layer, + num_decoder_layers=args.n_layer, + n_head=args.n_head, + d_model=args.d_model, + d_inner_hid=args.d_inner_hid, + dropout=args.dropout, + weight_sharing=args.weight_sharing, + bos_id=args.bos_idx, + eos_id=args.eos_idx, + pad_id=args.pad_idx, + normalize_before=args.get("normalize_before", True), + ) + + transformer = apply_to_static(args, transformer) + + # Define loss + criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx if args.pad_idx is None else args.pad_idx) + + scheduler = paddle.optimizer.lr.NoamDecay(args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0) + + # Define optimizer + if "use_multi_tensor" not in inspect.getfullargspec(paddle.optimizer.Adam.__init__).args: + optimizer = paddle.optimizer.Adam( + learning_rate=scheduler, + beta1=args.beta1, + beta2=args.beta2, + epsilon=float(args.eps), + parameters=transformer.parameters(), + weight_decay=args.weight_decay, + ) + else: + optimizer = paddle.optimizer.Adam( + learning_rate=scheduler, + beta1=args.beta1, + beta2=args.beta2, + epsilon=float(args.eps), + parameters=transformer.parameters(), + use_multi_tensor=True, + weight_decay=args.weight_decay, + ) + + # Init from some checkpoint, to resume the previous training + if args.init_from_checkpoint: + model_dict = paddle.load(os.path.join(args.init_from_checkpoint, "transformer.pdparams")) + opt_dict = paddle.load(os.path.join(args.init_from_checkpoint, "transformer.pdopt")) + transformer.set_state_dict(model_dict) + optimizer.set_state_dict(opt_dict) + print("loaded from checkpoint.") + # Init from some pretrain models, to better solve the current task + if args.init_from_pretrain_model: + model_dict = paddle.load(os.path.join(args.init_from_pretrain_model, "transformer.pdparams")) + transformer.set_state_dict(model_dict) + print("loaded from pre-trained model.") + + # for amp training + if args.use_amp: + amp_level = "O2" if args.use_pure_fp16 else "O1" + scaler = paddle.amp.GradScaler(enable=True, init_loss_scaling=args.scale_loss) + transformer = paddle.amp.decorate(models=transformer, level=amp_level, save_dtype="float32") + + # for distributed training + if trainer_count > 1: + transformer = paddle.DataParallel(transformer) + + # The best cross-entropy value with label smoothing + loss_normalizer = -( + (1.0 - args.label_smooth_eps) * np.log((1.0 - args.label_smooth_eps)) + + args.label_smooth_eps * np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20) + ) + + step_idx = 0 + tokens_sum = 0 + + # For benchmark + reader_cost_avg = AverageStatistical() + batch_cost_avg = AverageStatistical() + batch_ips_avg = AverageStatistical() + + # Train loop + for pass_id in range(args.epoch): + epoch_start = time.time() + + batch_id = 0 + batch_start = time.time() + for input_data in train_loader: + train_reader_cost = time.time() - batch_start + (src_word, trg_word, lbl_word) = input_data + + if args.use_amp: + with paddle.amp.auto_cast( + custom_black_list={"scale", "reduce_sum", "elementwise_div"} if amp_level == "O2" else {}, + level=amp_level, + ): + logits = transformer(src_word=src_word, trg_word=trg_word) + sum_cost, avg_cost, token_num = criterion(logits, lbl_word) + + scaled = scaler.scale(avg_cost) # scale the loss + scaled.backward() # do backward + + scaler.minimize(optimizer, scaled) # update parameters + if "set_to_zero" in inspect.getfullargspec(optimizer.clear_grad).args: + optimizer.clear_grad(set_to_zero=False) + else: + optimizer.clear_grad() + else: + logits = transformer(src_word=src_word, trg_word=trg_word) + sum_cost, avg_cost, token_num = criterion(logits, lbl_word) + + avg_cost.backward() + + optimizer.step() + optimizer.clear_grad() + + train_batch_cost = time.time() - batch_start + reader_cost_avg.record(train_reader_cost) + batch_cost_avg.record(train_batch_cost) + batch_ips_avg.record(train_batch_cost, 0) + + tokens_sum += token_num + + # Profile for model benchmark + if args.profiler_options is not None: + profiler.add_profiler_step(args.profiler_options) + + # NOTE: For benchmark, loss infomation on all cards will be printed. + if step_idx % args.print_step == 0 and (args.benchmark or rank == 0): + total_avg_cost = avg_cost.numpy() + tokens_sum_val = tokens_sum.numpy() + batch_ips_avg.record(0, tokens_sum_val) + tokens_sum = 0 + + if step_idx == 0: + logger.info( + "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, " + "normalized loss: %f, ppl: %f " + % ( + step_idx, + pass_id, + batch_id, + total_avg_cost, + total_avg_cost - loss_normalizer, + np.exp([min(total_avg_cost, 100)]), + ) + ) + else: + train_avg_batch_cost = args.print_step / batch_cost_avg.get_total_time() + logger.info( + "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, " + "normalized loss: %f, ppl: %f, avg_speed: %.2f step/sec, " + "batch_cost: %.5f sec, reader_cost: %.5f sec, tokens: %d, " + "ips: %.5f words/sec" + % ( + step_idx, + pass_id, + batch_id, + total_avg_cost, + total_avg_cost - loss_normalizer, + np.exp([min(total_avg_cost, 100)]), + train_avg_batch_cost, + batch_cost_avg.get_average(), + reader_cost_avg.get_average(), + batch_ips_avg.get_total_cnt(), + batch_ips_avg.get_average_per_sec(), + ) + ) + reader_cost_avg.reset() + batch_cost_avg.reset() + batch_ips_avg.reset() + + if step_idx % args.save_step == 0 and step_idx != 0: + # Validation + transformer.eval() + total_sum_cost = 0 + total_token_num = 0 + with paddle.no_grad(): + for input_data in eval_loader: + (src_word, trg_word, lbl_word) = input_data + if args.use_amp: + with paddle.amp.auto_cast( + custom_black_list={"scale", "reduce_sum", "elementwise_div"} + if amp_level == "O2" + else {}, + level=amp_level, + ): + logits = transformer(src_word=src_word, trg_word=trg_word) + sum_cost, avg_cost, token_num = criterion(logits, lbl_word) + + else: + logits = transformer(src_word=src_word, trg_word=trg_word) + sum_cost, avg_cost, token_num = criterion(logits, lbl_word) + + total_sum_cost += sum_cost.numpy() + total_token_num += token_num.numpy() + total_avg_cost = total_sum_cost / total_token_num + logger.info( + "validation, step_idx: %d, avg loss: %f, " + "normalized loss: %f, ppl: %f" + % ( + step_idx, + total_avg_cost, + total_avg_cost - loss_normalizer, + np.exp([min(total_avg_cost, 100)]), + ) + ) + transformer.train() + + if args.save_model and rank == 0: + model_dir = os.path.join(args.save_model, "step_" + str(step_idx)) + if not os.path.exists(model_dir): + os.makedirs(model_dir) + paddle.save(transformer.state_dict(), os.path.join(model_dir, "transformer.pdparams")) + paddle.save(optimizer.state_dict(), os.path.join(model_dir, "transformer.pdopt")) + + # NOTE: Used for benchmark and use None as default. + if args.max_iter and step_idx == args.max_iter: + break + batch_id += 1 + step_idx += 1 + scheduler.step() + batch_start = time.time() + + # NOTE: Used for benchmark and use None as default. + if args.max_iter and step_idx == args.max_iter: + break + + train_epoch_cost = time.time() - epoch_start + logger.info("train epoch: %d, epoch_cost: %.5f s" % (pass_id, train_epoch_cost)) + + if args.save_model and rank == 0: + model_dir = os.path.join(args.save_model, "step_final") + if not os.path.exists(model_dir): + os.makedirs(model_dir) + paddle.save(transformer.state_dict(), os.path.join(model_dir, "transformer.pdparams")) + paddle.save(optimizer.state_dict(), os.path.join(model_dir, "transformer.pdopt")) + + +if __name__ == "__main__": + ARGS = parse_args() + yaml_file = ARGS.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + args.benchmark = ARGS.benchmark + if ARGS.max_iter: + args.max_iter = ARGS.max_iter + if ARGS.batch_size: + args.batch_size = ARGS.batch_size + if ARGS.use_amp: + ARGS.use_amp = ARGS.use_amp.lower() + if ARGS.use_amp == "true": + args.use_amp = True + else: + args.use_amp = False + if ARGS.amp_level: + args.use_pure_fp16 = ARGS.amp_level == "O2" + args.weight_decay = ARGS.weight_decay + + args.data_dir = ARGS.data_dir + args.train_file = ARGS.train_file + args.dev_file = ARGS.dev_file + + if ARGS.vocab_file is not None: + args.src_vocab = ARGS.vocab_file + args.trg_vocab = ARGS.vocab_file + args.joined_dictionary = True + elif ARGS.src_vocab is not None and ARGS.trg_vocab is None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab + args.joined_dictionary = True + elif ARGS.src_vocab is None and ARGS.trg_vocab is not None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab + args.joined_dictionary = True + elif ARGS.src_vocab is None and ARGS.trg_vocab is not None: + args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab + args.joined_dictionary = True + else: + args.src_vocab = ARGS.src_vocab + args.trg_vocab = ARGS.trg_vocab + args.joined_dictionary = not ( + args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab + ) + if args.weight_sharing != args.joined_dictionary: + if args.weight_sharing: + raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ") + else: + raise ValueError( + "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. " + ) + + if ARGS.src_lang is not None: + args.src_lang = ARGS.src_lang + if ARGS.trg_lang is not None: + args.trg_lang = ARGS.trg_lang + + args.unk_token = ARGS.unk_token + args.bos_token = ARGS.bos_token + args.eos_token = ARGS.eos_token + args.pad_token = ARGS.pad_token + if ARGS.to_static: + args.to_static = ARGS.to_static + args.device = ARGS.device + pprint(args) + + args.profiler_options = ARGS.profiler_options + + do_train(args) diff --git a/examples/model_compression/distill_lstm/README.md b/examples/model_compression/distill_lstm/README.md new file mode 100644 index 0000000000000000000000000000000000000000..56e63c435339821e886d2457b65fd9cc06698efe --- /dev/null +++ b/examples/model_compression/distill_lstm/README.md @@ -0,0 +1,194 @@ +# Distilling Knowledge From Fine-tuned BERT into Bi-LSTM + +以下是本例的简要目录结构及说明: +``` +. +├── small.py # 小模型结构以及对小模型单独训练的脚本 +├── bert_distill.py # 用教师模型BERT蒸馏学生模型的蒸馏脚本 +├── data.py # 定义了dataloader等数据读取接口 +├── utils.py # 定义了将样本转成id的转换接口 +├── args.py # 参数配置脚本 +└── README.md # 文档,本文件 +``` + +## 简介 +本目录下的实验是将特定任务下BERT模型的知识蒸馏到基于Bi-LSTM的小模型中,主要参考论文 [Distilling Task-Specific Knowledge from BERT into Simple Neural Networks](https://arxiv.org/abs/1903.12136)实现。 + +在模型蒸馏中,较大的模型(在本例中是BERT)通常被称为教师模型,较小的模型(在本例中是Bi-LSTM)通常被称为学生模型。知识的蒸馏通常是通过模型学习蒸馏相关的损失函数实现,在本实验中,损失函数是均方误差损失函数,传入函数的两个参数分别是学生模型的输出和教师模型的输出。 + +在[论文](https://arxiv.org/abs/1903.12136)的模型蒸馏阶段,作者为了能让教师模型表达出更多的知识供学生模型学习,对训练数据进行了数据增强。作者使用了三种数据增强方式,分别是: + +1. Masking,即以一定的概率将原数据中的word token替换成`[MASK]`; + +2. POS—guided word replacement,即以一定的概率将原数据中的词用与其有相同POS tag的词替换; + +3. n-gram sampling,即以一定的概率,从每条数据中采样n-gram,其中n的范围可通过人工设置。 + +通过数据增强,可以产生更多无标签的训练数据,在训练过程中,学生模型可借助教师模型的“暗知识”,在更大的数据集上进行训练,产生更好的蒸馏效果。需要指出的是,实验只使用了第1和第3种数据增强方式。 +在英文数据集任务上,本文使用了Google News语料[预训练的Word Embedding](https://code.google.com/archive/p/word2vec/)初始化小模型的Embedding层。 + +本实验分为三个训练过程:在特定任务上对BERT的fine-tuning、在特定任务上对基于Bi-LSTM的小模型的训练(用于评价蒸馏效果)、将BERT模型的知识蒸馏到基于Bi-LSTM的小模型上。 + +## 数据、预训练模型介绍及获取 + +本实验使用GLUE中的SST-2、QQP以及中文情感分类数据集ChnSentiCorp中的训练集作为训练语料,用数据集中的验证集评估模型的效果。运行本目录下的实验,数据集会被自动下载到`paddlenlp.utils.env.DATA_HOME` 路径下,例如在linux系统下,例如对于GLUE中的QQP数据集,默认存储路径是`~/.paddlenlp/datasets/glue/QQP`,对于ChnSentiCorp数据集,则会下载到 `~/.paddlenlp/datasets/chnsenticorp`。 + +对于BERT的fine-tuning任务,本实验中使用了预训练模型`bert-bas-uncased`、`bert-wwm-ext-chinese`、`bert-base-chinese`。同样,这几个模型在训练时会被自动下载到`paddlenlp.utils.env.MODEL_HOME`路径下。例如,对于`bert-base-uncased`模型,在linux系统下,会被下载到`~/.paddlenlp/models/bert-base-uncased`下。 + +在中文数据集上的小模型训练的输入利用jieba分词,其中词表同本repo下[文本分类项目](../../text_classification/rnn)的词表,可通过运行以下命令进行下载: + +```shell +wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt +``` + +为了节省显存和运行时间,可以对ChnSentiCorp中未出现的词先进行过滤,并将最后的词表文件名和词表大小配置在下面的参数中。 + + +## 蒸馏实验过程 +### 训练BERT fine-tuning模型 +训练BERT的fine-tuning模型,可以去本repo下example中的[glue目录](../../benchmark/glue)下。关于glue的更多详细说明,可见glue目录下的README文档。 + +以GLUE的SST-2任务为例,调用BERT fine-tune的训练脚本,配置如下的参数,训练SST-2任务: + +```shell +cd ../../benchmark/glue +export CUDA_VISIBLE_DEVICES=0 +export TASK_NAME=SST-2 +python -u ./run_glue.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --task_name $TASK_NAME \ + --max_seq_length 128 \ + --batch_size 128 \ + --learning_rate 3e-5 \ + --num_train_epochs 3 \ + --logging_steps 10 \ + --save_steps 10 \ + --output_dir ../model_compression/distill_lstm/pretrained_models/$TASK_NAME/ \ + --device gpu \ + +``` + +如果需要训练基于ChnSentiCorp数据集的BERT finetuning模型,可以进入[文本分类目录](../../text_classification/pretrained_models)下,将预训练模型改成BERT,并基于bert-base-chinese和bert-wwm-ext-chinese模型进行fine-tuning训练。 + +训练完成之后,可将训练效果最好的模型保存在本项目下的`pretrained_models/$TASK_NAME/`下。模型目录下有`model_config.json`, `model_state.pdparams`, `tokenizer_config.json`及`vocab.txt`这几个文件。 + + +### 训练小模型 + +尝试运行下面的脚本可以分别基于ChnSentiCorp、SST-2、QQP数据集对基于BiLSTM的小模型进行训练。 + + +```shell +CUDA_VISIBLE_DEVICES=0 python small.py \ + --task_name chnsenticorp \ + --max_epoch 20 \ + --vocab_size 1256608 \ + --batch_size 64 \ + --model_name bert-wwm-ext-chinese \ + --optimizer adam \ + --lr 3e-4 \ + --dropout_prob 0.2 \ + --vocab_path senta_word_dict.txt \ + --save_steps 10000 \ + --output_dir small_models/chnsenticorp/ + +``` + +```shell +CUDA_VISIBLE_DEVICES=0 python small.py \ + --task_name sst-2 \ + --vocab_size 30522 \ + --max_epoch 10 \ + --batch_size 64 \ + --lr 1.0 \ + --dropout_prob 0.4 \ + --output_dir small_models/SST-2 \ + --save_steps 10000 \ + --embedding_name w2v.google_news.target.word-word.dim300.en + +``` + +```shell +CUDA_VISIBLE_DEVICES=0 python small.py \ + --task_name qqp \ + --vocab_size 30522 \ + --max_epoch 35 \ + --batch_size 256 \ + --lr 2.0 \ + --dropout_prob 0.4 \ + --output_dir small_models/QQP \ + --save_steps 10000 \ + --embedding_name w2v.google_news.target.word-word.dim300.en + +``` + +### 蒸馏模型 +这一步是将教师模型BERT的知识蒸馏到基于BiLSTM的学生模型中,可以运行下面的命令分别基于ChnSentiCorp、SST-2、QQP数据集对基于BiLSTM的学生模型进行蒸馏。 + +```shell +CUDA_VISIBLE_DEVICES=0 python bert_distill.py \ + --task_name chnsenticorp \ + --vocab_size 1256608 \ + --max_epoch 6 \ + --lr 1.0 \ + --dropout_prob 0.1 \ + --batch_size 64 \ + --model_name bert-wwm-ext-chinese \ + --teacher_dir pretrained_models/chnsenticorp/best_bert_wwm_ext_model_880 \ + --vocab_path senta_word_dict.txt \ + --output_dir distilled_models/chnsenticorp \ + --save_steps 10000 \ + +``` + +```shell +CUDA_VISIBLE_DEVICES=0 python bert_distill.py \ + --task_name sst-2 \ + --vocab_size 30522 \ + --max_epoch 6 \ + --lr 1.0 \ + --task_name sst-2 \ + --dropout_prob 0.2 \ + --batch_size 128 \ + --model_name bert-base-uncased \ + --output_dir distilled_models/SST-2 \ + --teacher_dir pretrained_models/SST-2/best_model_610 \ + --save_steps 10000 \ + --embedding_name w2v.google_news.target.word-word.dim300.en \ + +``` + +```shell +CUDA_VISIBLE_DEVICES=0 python bert_distill.py \ + --task_name qqp \ + --vocab_size 30522 \ + --max_epoch 6 \ + --lr 1.0 \ + --dropout_prob 0.2 \ + --batch_size 256 \ + --model_name bert-base-uncased \ + --n_iter 10 \ + --output_dir distilled_models/QQP \ + --teacher_dir pretrained_models/QQP/best_model_17000 \ + --save_steps 10000 \ + --embedding_name w2v.google_news.target.word-word.dim300.en \ + +``` + +各参数的具体说明请参阅 `args.py` ,注意在训练不同任务时,需要调整对应的超参数。 + + +## 蒸馏实验结果 +本蒸馏实验基于GLUE的SST-2、QQP、中文情感分类ChnSentiCorp数据集。实验效果均使用每个数据集的验证集(dev)进行评价,评价指标是准确率(acc),其中QQP中包含f1值。利用基于BERT的教师模型去蒸馏基于Bi-LSTM的学生模型,对比Bi-LSTM小模型单独训练,在SST-2、QQP、ChnSentiCorp(中文情感分类)任务上分别有3.3%、1.9%、1.4%的提升。 + +| Model | SST-2(dev acc) | QQP(dev acc/f1) | ChnSentiCorp(dev acc) | ChnSentiCorp(dev acc) | +| ----------------- | ----------------- | -------------------------- | --------------------- | --------------------- | +| Teacher model | bert-base-uncased | bert-base-uncased | bert-base-chinese | bert-wwm-ext-chinese | +| BERT-base | 0.930046 | 0.905813(acc)/0.873472(f1) | 0.951667 | 0.955000 | +| Bi-LSTM | 0.854358 | 0.856616(acc)/0.799682(f1) | 0.920000 | 0.920000 | +| Distilled Bi-LSTM | 0.887615 | 0.875216(acc)/0.831254(f1) | 0.932500 | 0.934167 | + +## 参考文献 + +Tang R, Lu Y, Liu L, Mou L, Vechtomova O, Lin J. [Distilling Task-Specific Knowledge from BERT into Simple Neural Networks](https://arxiv.org/abs/1903.12136)[J]. arXiv preprint arXiv:1903.12136, 2019. diff --git a/examples/model_compression/distill_lstm/args.py b/examples/model_compression/distill_lstm/args.py new file mode 100644 index 0000000000000000000000000000000000000000..07fd4b1bb1914d386f6e353774ff85ad7a939b3b --- /dev/null +++ b/examples/model_compression/distill_lstm/args.py @@ -0,0 +1,108 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import argparse + +from paddlenlp.utils.env import MODEL_HOME + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + + parser.add_argument("--task_name", type=str, default="sst-2", help="Task name.") + + parser.add_argument( + "--optimizer", type=str, default="adadelta", help="Optimizer to use, only support[adam|adadelta]." + ) + + parser.add_argument("--lr", type=float, default=1.0, help="Learning rate for optimizer.") + + parser.add_argument("--num_layers", type=int, default=1, help="Layers number of LSTM.") + + parser.add_argument("--emb_dim", type=int, default=300, help="Embedding dim.") + + parser.add_argument("--output_dim", type=int, default=2, help="Number of classifications.") + + parser.add_argument("--hidden_size", type=int, default=300, help="Hidden size of LSTM") + + parser.add_argument("--batch_size", type=int, default=64, help="Batch size of training.") + + parser.add_argument("--max_epoch", type=int, default=12, help="Max number of epochs for training.") + + parser.add_argument("--max_seq_length", type=int, default=128, help="Max length for sentence.") + + parser.add_argument( + "--n_iter", type=int, default=20, help="Number of iterations for one sample in data augmentation." + ) + + parser.add_argument("--dropout_prob", type=float, default=0.0, help="Drop probability.") + + parser.add_argument("--init_scale", type=float, default=0.1, help="Init scale for parameter") + + parser.add_argument("--log_freq", type=int, default=10, help="The frequency to print evaluation logs.") + + parser.add_argument("--save_steps", type=int, default=100, help="The frequency to print evaluation logs.") + + parser.add_argument("--padding_idx", type=int, default=0, help="The padding index of embedding.") + + parser.add_argument( + "--model_name", + type=str, + default="bert-base-uncased", + help="Teacher model's name. Maybe its tokenizer would be loaded and used by small model.", + ) + + parser.add_argument("--teacher_dir", type=str, help="Teacher model's directory.") + + parser.add_argument( + "--vocab_path", + type=str, + default=os.path.join(MODEL_HOME, "bert-base-uncased", "bert-base-uncased-vocab.txt"), + help="Student model's vocab path.", + ) + + parser.add_argument("--output_dir", type=str, default="models", help="Directory to save models .") + + parser.add_argument( + "--init_from_ckpt", type=str, default=None, help="The path of layer and optimizer to be loaded." + ) + + parser.add_argument( + "--whole_word_mask", + action="store_true", + help="If True, use whole word masking method in data augmentation in distilling.", + ) + + parser.add_argument("--embedding_name", type=str, default=None, help="The name of pretrained word embedding.") + + parser.add_argument("--vocab_size", type=int, default=10000, help="Student model's vocab size.") + + parser.add_argument( + "--alpha", type=float, default=0.0, help="Weight balance between cross entropy loss and mean square loss." + ) + + parser.add_argument( + "--seed", + type=int, + default=2021, + help="Random seed for model parameter initialization, data augmentation and so on.", + ) + + parser.add_argument( + "--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference." + ) + + args = parser.parse_args() + return args diff --git a/examples/model_compression/distill_lstm/bert_distill.py b/examples/model_compression/distill_lstm/bert_distill.py new file mode 100644 index 0000000000000000000000000000000000000000..9f253a31b8f568e884da5c8185eb9817ca052d9a --- /dev/null +++ b/examples/model_compression/distill_lstm/bert_distill.py @@ -0,0 +1,172 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time + +import paddle +import paddle.nn as nn +from args import parse_args +from data import create_distill_loader +from paddle.metric import Accuracy +from small import BiLSTM + +from paddlenlp.metrics import AccuracyAndF1 +from paddlenlp.transformers import BertForSequenceClassification + +METRIC_CLASSES = {"sst-2": Accuracy, "qqp": AccuracyAndF1, "chnsenticorp": Accuracy} + + +class TeacherModel(object): + def __init__(self, teacher_dir): + self.model = BertForSequenceClassification.from_pretrained(teacher_dir) + self.model.eval() + + +def evaluate(task_name, model, metric, data_loader): + model.eval() + metric.reset() + for i, batch in enumerate(data_loader): + if task_name == "qqp": + _, _, student_input_ids_1, seq_len_1, student_input_ids_2, seq_len_2, labels = batch + logits = model(student_input_ids_1, seq_len_1, student_input_ids_2, seq_len_2) + else: + _, _, student_input_ids, seq_len, labels = batch + logits = model(student_input_ids, seq_len) + + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + if isinstance(metric, AccuracyAndF1): + print( + "acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, " + % ( + res[0], + res[1], + res[2], + res[3], + res[4], + ), + end="", + ) + else: + print("acc: %s, " % (res), end="") + model.train() + + +def do_train(agrs): + paddle.set_device(args.device) + train_data_loader, dev_data_loader = create_distill_loader( + args.task_name, + model_name=args.model_name, + vocab_path=args.vocab_path, + batch_size=args.batch_size, + max_seq_length=args.max_seq_length, + n_iter=args.n_iter, + whole_word_mask=args.whole_word_mask, + seed=args.seed, + ) + + model = BiLSTM( + args.emb_dim, + args.hidden_size, + args.vocab_size, + args.output_dim, + args.vocab_path, + args.padding_idx, + args.num_layers, + args.dropout_prob, + args.init_scale, + args.embedding_name, + ) + + if args.optimizer == "adadelta": + optimizer = paddle.optimizer.Adadelta(learning_rate=args.lr, rho=0.95, parameters=model.parameters()) + else: + optimizer = paddle.optimizer.Adam(learning_rate=args.lr, parameters=model.parameters()) + + ce_loss = nn.CrossEntropyLoss() + mse_loss = nn.MSELoss() + + metric_class = METRIC_CLASSES[args.task_name] + metric = metric_class() + + teacher = TeacherModel(args.teacher_dir) + + print("Start to distill student model.") + + if args.init_from_ckpt: + model.set_state_dict(paddle.load(args.init_from_ckpt + ".pdparams")) + optimizer.set_state_dict(paddle.load(args.init_from_ckpt + ".pdopt")) + print("Loaded checkpoint from %s" % args.init_from_ckpt) + + global_step = 0 + tic_train = time.time() + for epoch in range(args.max_epoch): + model.train() + for i, batch in enumerate(train_data_loader): + global_step += 1 + if args.task_name == "qqp": + ( + bert_input_ids, + bert_segment_ids, + student_input_ids_1, + seq_len_1, + student_input_ids_2, + seq_len_2, + labels, + ) = batch + else: + bert_input_ids, bert_segment_ids, student_input_ids, seq_len, labels = batch + + # Calculate teacher model's forward. + with paddle.no_grad(): + teacher_logits = teacher.model(bert_input_ids, bert_segment_ids) + + # Calculate student model's forward. + if args.task_name == "qqp": + logits = model(student_input_ids_1, seq_len_1, student_input_ids_2, seq_len_2) + else: + logits = model(student_input_ids, seq_len) + + loss = args.alpha * ce_loss(logits, labels) + (1 - args.alpha) * mse_loss(logits, teacher_logits) + + loss.backward() + optimizer.step() + optimizer.clear_grad() + + if global_step % args.log_freq == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.4f step/s" + % (global_step, epoch, i, loss, args.log_freq / (time.time() - tic_train)) + ) + tic_eval = time.time() + evaluate(args.task_name, model, metric, dev_data_loader) + print("eval done total : %s s" % (time.time() - tic_eval)) + tic_train = time.time() + + if global_step % args.save_steps == 0: + paddle.save( + model.state_dict(), os.path.join(args.output_dir, "step_" + str(global_step) + ".pdparams") + ) + paddle.save( + optimizer.state_dict(), os.path.join(args.output_dir, "step_" + str(global_step) + ".pdopt") + ) + + +if __name__ == "__main__": + args = parse_args() + print(args) + paddle.seed(args.seed) + do_train(args) diff --git a/examples/model_compression/distill_lstm/data.py b/examples/model_compression/distill_lstm/data.py new file mode 100644 index 0000000000000000000000000000000000000000..dec2b358260bc02e9047cee95b294873f8237256 --- /dev/null +++ b/examples/model_compression/distill_lstm/data.py @@ -0,0 +1,322 @@ +# -*- coding: utf-8 -*- +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial + +import jieba +import numpy as np +import paddle +from utils import ( + convert_example_for_distill, + convert_example_for_lstm, + convert_pair_example, +) + +from paddlenlp.data import Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import BertTokenizer + + +def load_vocab(vocab_file): + """Loads a vocabulary file into a dictionary.""" + vocab = {} + with open(vocab_file, "r", encoding="utf-8") as reader: + tokens = reader.readlines() + for index, token in enumerate(tokens): + token = token.rstrip("\n").split("\t")[0] + vocab[token] = index + return vocab + + +def ngram_sampling(words, words_2=None, p_ng=0.25, ngram_range=(2, 6)): + if np.random.rand() < p_ng: + ngram_len = np.random.randint(ngram_range[0], ngram_range[1] + 1) + ngram_len = min(ngram_len, len(words)) + start = np.random.randint(0, len(words) - ngram_len + 1) + words = words[start : start + ngram_len] + if words_2: + words_2 = words_2[start : start + ngram_len] + return words if not words_2 else (words, words_2) + + +def flatten(list_of_list): + final_list = [] + for each_list in list_of_list: + final_list += each_list + return final_list + + +def apply_data_augmentation( + data, task_name, tokenizer, n_iter=20, p_mask=0.1, p_ng=0.25, ngram_range=(2, 6), whole_word_mask=False, seed=0 +): + """ + Data Augmentation contains Masking and n-gram sampling. Tokenization and + Masking are performed at the same time, so that the masked token can be + directly replaced by `mask_token`, after what sampling is performed. + """ + + def _data_augmentation(data, tokenized_list, whole_word_mask=whole_word_mask): + # 1. Masking + words = [] + if not whole_word_mask: + words = [tokenizer.mask_token if np.random.rand() < p_mask else word for word in tokenized_list] + else: + for word in data.split(): + words += [[tokenizer.mask_token]] if np.random.rand() < p_mask else [tokenizer.tokenize(word)] + # 2. N-gram sampling + words = ngram_sampling(words, p_ng=p_ng, ngram_range=ngram_range) + words = flatten(words) if isinstance(words[0], list) else words + return words + + np.random.seed(seed) + new_data = [] + for example in data: + if task_name == "qqp": + data_list = tokenizer.tokenize(example["sentence1"]) + data_list_2 = tokenizer.tokenize(example["sentence2"]) + new_data.append({"sentence1": data_list, "sentence2": data_list_2, "labels": example["labels"]}) + else: + data_list = tokenizer.tokenize(example["sentence"]) + new_data.append({"sentence": data_list, "labels": example["labels"]}) + + for example in data: + for _ in range(n_iter): + if task_name == "qqp": + words = _data_augmentation(example["sentence1"], data_list) + words_2 = _data_augmentation(example["sentence2"], data_list_2) + new_data.append({"sentence1": words, "sentence2": words_2, "labels": example["labels"]}) + else: + words = _data_augmentation(example["sentence"], data_list) + new_data.append({"sentence": words, "labels": example["labels"]}) + return new_data + + +def apply_data_augmentation_for_cn( + data, tokenizer, vocab, n_iter=20, p_mask=0.1, p_ng=0.25, ngram_range=(2, 10), seed=0 +): + """ + Because BERT and jieba have different `tokenize` function, it returns + jieba_tokenizer(example['text'], bert_tokenizer(example['text']) and + example['label]) for each example in data. + jieba tokenization and Masking are performed at the same time, so that the + masked token can be directly replaced by `mask_token`, and other tokens + could be tokenized by BERT's tokenizer, from which tokenized example for + student model and teacher model would get at the same time. + """ + np.random.seed(seed) + new_data = [] + + for example in data: + if not example["text"]: + continue + text_tokenized = list(jieba.cut(example["text"])) + lstm_tokens = text_tokenized + bert_tokens = tokenizer.tokenize(example["text"]) + new_data.append({"lstm_tokens": lstm_tokens, "bert_tokens": bert_tokens, "label": example["label"]}) + for _ in range(n_iter): + # 1. Masking + lstm_tokens, bert_tokens = [], [] + for word in text_tokenized: + if np.random.rand() < p_mask: + lstm_tokens.append([vocab.unk_token]) + bert_tokens.append([tokenizer.unk_token]) + else: + lstm_tokens.append([word]) + bert_tokens.append(tokenizer.tokenize(word)) + # 2. N-gram sampling + lstm_tokens, bert_tokens = ngram_sampling(lstm_tokens, bert_tokens, p_ng, ngram_range) + lstm_tokens, bert_tokens = flatten(lstm_tokens), flatten(bert_tokens) + if lstm_tokens and bert_tokens: + new_data.append({"lstm_tokens": lstm_tokens, "bert_tokens": bert_tokens, "label": example["label"]}) + return new_data + + +def create_data_loader_for_small_model( + task_name, vocab_path, model_name=None, batch_size=64, max_seq_length=128, shuffle=True +): + """Data loader for bi-lstm, not bert.""" + if task_name == "chnsenticorp": + train_ds, dev_ds = load_dataset(task_name, splits=["train", "dev"]) + else: + train_ds, dev_ds = load_dataset("glue", task_name, splits=["train", "dev"]) + if task_name == "chnsenticorp": + vocab = Vocab.load_vocabulary( + vocab_path, + unk_token="[UNK]", + pad_token="[PAD]", + bos_token=None, + eos_token=None, + ) + pad_val = vocab["[PAD]"] + + else: + vocab = BertTokenizer.from_pretrained(model_name) + pad_val = vocab.pad_token_id + + trans_fn = partial( + convert_example_for_lstm, task_name=task_name, vocab=vocab, max_seq_length=max_seq_length, is_test=False + ) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=pad_val), Stack(dtype="int64"), Stack(dtype="int64") # input_ids # seq len # label + ): fn(samples) + + train_ds = train_ds.map(trans_fn, lazy=True) + dev_ds = dev_ds.map(trans_fn, lazy=True) + + train_data_loader, dev_data_loader = create_dataloader(train_ds, dev_ds, batch_size, batchify_fn, shuffle) + + return train_data_loader, dev_data_loader + + +def create_distill_loader( + task_name, + model_name, + vocab_path, + batch_size=64, + max_seq_length=128, + shuffle=True, + n_iter=20, + whole_word_mask=False, + seed=0, +): + """ + Returns batch data for bert and small model. + Bert and small model have different input representations. + """ + tokenizer = BertTokenizer.from_pretrained(model_name) + if task_name == "chnsenticorp": + train_ds, dev_ds = load_dataset(task_name, splits=["train", "dev"]) + vocab = Vocab.load_vocabulary( + vocab_path, + unk_token="[UNK]", + pad_token="[PAD]", + bos_token=None, + eos_token=None, + ) + pad_val = vocab["[PAD]"] + data_aug_fn = partial( + apply_data_augmentation_for_cn, tokenizer=tokenizer, vocab=vocab, n_iter=n_iter, seed=seed + ) + else: + train_ds, dev_ds = load_dataset("glue", task_name, splits=["train", "dev"]) + vocab = tokenizer + pad_val = tokenizer.pad_token_id + data_aug_fn = partial( + apply_data_augmentation, + task_name=task_name, + tokenizer=tokenizer, + n_iter=n_iter, + whole_word_mask=whole_word_mask, + seed=seed, + ) + train_ds = train_ds.map(data_aug_fn, batched=True) + print("Data augmentation has been applied.") + + trans_fn = partial( + convert_example_for_distill, + task_name=task_name, + tokenizer=tokenizer, + label_list=train_ds.label_list, + max_seq_length=max_seq_length, + vocab=vocab, + ) + + trans_fn_dev = partial( + convert_example_for_distill, + task_name=task_name, + tokenizer=tokenizer, + label_list=train_ds.label_list, + max_seq_length=max_seq_length, + vocab=vocab, + is_tokenized=False, + ) + + if task_name == "qqp": + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # bert input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # bert segment + Pad(axis=0, pad_val=pad_val), # small input_ids + Stack(dtype="int64"), # small seq len + Pad(axis=0, pad_val=pad_val), # small input_ids + Stack(dtype="int64"), # small seq len + Stack(dtype="int64"), # small label + ): fn(samples) + else: + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # bert input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # bert segment + Pad(axis=0, pad_val=pad_val), # small input_ids + Stack(dtype="int64"), # small seq len + Stack(dtype="int64"), # small label + ): fn(samples) + + train_ds = train_ds.map(trans_fn, lazy=True) + dev_ds = dev_ds.map(trans_fn_dev, lazy=True) + train_data_loader, dev_data_loader = create_dataloader(train_ds, dev_ds, batch_size, batchify_fn, shuffle) + return train_data_loader, dev_data_loader + + +def create_pair_loader_for_small_model( + task_name, model_name, vocab_path, batch_size=64, max_seq_length=128, shuffle=True, is_test=False +): + """Only support QQP now.""" + tokenizer = BertTokenizer.from_pretrained(model_name) + train_ds, dev_ds = load_dataset("glue", task_name, splits=["train", "dev"]) + vocab = Vocab.load_vocabulary( + vocab_path, + unk_token="[UNK]", + pad_token="[PAD]", + bos_token=None, + eos_token=None, + ) + + trans_func = partial( + convert_pair_example, + task_name=task_name, + vocab=tokenizer, + is_tokenized=False, + max_seq_length=max_seq_length, + is_test=is_test, + ) + train_ds = train_ds.map(trans_func, lazy=True) + dev_ds = dev_ds.map(trans_func, lazy=True) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=vocab["[PAD]"]), # input + Stack(), # length + Pad(axis=0, pad_val=vocab["[PAD]"]), # input + Stack(), # length + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ): fn(samples) + + train_data_loader, dev_data_loader = create_dataloader(train_ds, dev_ds, batch_size, batchify_fn, shuffle) + return train_data_loader, dev_data_loader + + +def create_dataloader(train_ds, dev_ds, batch_size, batchify_fn, shuffle=True): + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=batch_size, shuffle=shuffle) + + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=batch_size, shuffle=False) + + train_data_loader = paddle.io.DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + dev_data_loader = paddle.io.DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + return train_data_loader, dev_data_loader diff --git a/examples/model_compression/distill_lstm/small.py b/examples/model_compression/distill_lstm/small.py new file mode 100644 index 0000000000000000000000000000000000000000..92681bd039912f1b2daba0aff5ddea1103e381c7 --- /dev/null +++ b/examples/model_compression/distill_lstm/small.py @@ -0,0 +1,211 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time + +import paddle +import paddle.nn as nn +import paddle.nn.initializer as I +from args import parse_args +from data import create_data_loader_for_small_model, create_pair_loader_for_small_model +from paddle.metric import Accuracy + +from paddlenlp.embeddings import TokenEmbedding +from paddlenlp.metrics import AccuracyAndF1 + +METRIC_CLASSES = {"sst-2": Accuracy, "qqp": AccuracyAndF1, "chnsenticorp": Accuracy} + + +class BiLSTM(nn.Layer): + def __init__( + self, + embed_dim, + hidden_size, + vocab_size, + output_dim, + vocab_path, + padding_idx=0, + num_layers=1, + dropout_prob=0.0, + init_scale=0.1, + embedding_name=None, + ): + super(BiLSTM, self).__init__() + if embedding_name is not None: + self.embedder = TokenEmbedding( + embedding_name, extended_vocab_path=vocab_path, keep_extended_vocab_only=True + ) + embed_dim = self.embedder.embedding_dim + else: + self.embedder = nn.Embedding(vocab_size, embed_dim, padding_idx) + + self.lstm = nn.LSTM(embed_dim, hidden_size, num_layers, "bidirectional", dropout=dropout_prob) + + self.fc = nn.Linear( + hidden_size * 2, + hidden_size, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + ) + + self.fc_1 = nn.Linear( + hidden_size * 8, + hidden_size, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + ) + + self.output_layer = nn.Linear( + hidden_size, + output_dim, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + ) + + def forward(self, x_1, seq_len_1, x_2=None, seq_len_2=None): + x_embed_1 = self.embedder(x_1) + lstm_out_1, (hidden_1, _) = self.lstm(x_embed_1, sequence_length=seq_len_1) + out_1 = paddle.concat((hidden_1[-2, :, :], hidden_1[-1, :, :]), axis=1) + if x_2 is not None: + x_embed_2 = self.embedder(x_2) + lstm_out_2, (hidden_2, _) = self.lstm(x_embed_2, sequence_length=seq_len_2) + out_2 = paddle.concat((hidden_2[-2, :, :], hidden_2[-1, :, :]), axis=1) + out = paddle.concat(x=[out_1, out_2, out_1 + out_2, paddle.abs(out_1 - out_2)], axis=1) + out = paddle.tanh(self.fc_1(out)) + else: + out = paddle.tanh(self.fc(out_1)) + logits = self.output_layer(out) + + return logits + + +def evaluate(task_name, model, loss_fct, metric, data_loader): + model.eval() + metric.reset() + for batch in data_loader: + if task_name == "qqp": + input_ids_1, seq_len_1, input_ids_2, seq_len_2, labels = batch + logits = model(input_ids_1, seq_len_1, input_ids_2, seq_len_2) + else: + input_ids, seq_len, labels = batch + logits = model(input_ids, seq_len) + loss = loss_fct(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + if isinstance(metric, AccuracyAndF1): + print( + "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, " + % ( + loss.numpy(), + res[0], + res[1], + res[2], + res[3], + res[4], + ), + end="", + ) + else: + print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="") + model.train() + return res[0] if isinstance(metric, AccuracyAndF1) else res + + +def do_train(args): + paddle.set_device(args.device) + metric_class = METRIC_CLASSES[args.task_name] + metric = metric_class() + if args.task_name == "qqp": + train_data_loader, dev_data_loader = create_pair_loader_for_small_model( + task_name=args.task_name, + vocab_path=args.vocab_path, + model_name=args.model_name, + batch_size=args.batch_size, + ) + else: + train_data_loader, dev_data_loader = create_data_loader_for_small_model( + task_name=args.task_name, + vocab_path=args.vocab_path, + model_name=args.model_name if args.task_name == "sst-2" else None, + batch_size=args.batch_size, + ) + + model = BiLSTM( + args.emb_dim, + args.hidden_size, + args.vocab_size, + args.output_dim, + args.vocab_path, + args.padding_idx, + args.num_layers, + args.dropout_prob, + args.init_scale, + args.embedding_name, + ) + + loss_fct = nn.CrossEntropyLoss() + + if args.optimizer == "adadelta": + optimizer = paddle.optimizer.Adadelta(learning_rate=args.lr, rho=0.95, parameters=model.parameters()) + else: + optimizer = paddle.optimizer.Adam(learning_rate=args.lr, parameters=model.parameters()) + + if args.init_from_ckpt: + model.set_state_dict(paddle.load(args.init_from_ckpt + ".pdparams")) + optimizer.set_state_dict(paddle.load(args.init_from_ckpt + ".pdopt")) + print("Loaded checkpoint from %s" % args.init_from_ckpt) + + global_step = 0 + tic_train = time.time() + for epoch in range(args.max_epoch): + for i, batch in enumerate(train_data_loader): + global_step += 1 + if args.task_name == "qqp": + input_ids_1, seq_len_1, input_ids_2, seq_len_2, labels = batch + logits = model(input_ids_1, seq_len_1, input_ids_2, seq_len_2) + else: + input_ids, seq_len, labels = batch + logits = model(input_ids, seq_len) + + loss = loss_fct(logits, labels) + + loss.backward() + optimizer.step() + optimizer.clear_grad() + + if global_step % args.log_freq == 0: + with paddle.no_grad(): + print( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.4f step/s" + % (global_step, epoch, i, loss, args.log_freq / (time.time() - tic_train)) + ) + tic_eval = time.time() + + evaluate(args.task_name, model, loss_fct, metric, dev_data_loader) + print("eval done total : %s s" % (time.time() - tic_eval)) + tic_train = time.time() + + if global_step % args.save_steps == 0: + paddle.save( + model.state_dict(), os.path.join(args.output_dir, "step_" + str(global_step) + ".pdparams") + ) + paddle.save( + optimizer.state_dict(), os.path.join(args.output_dir, "step_" + str(global_step) + ".pdopt") + ) + + +if __name__ == "__main__": + args = parse_args() + print(args) + paddle.seed(args.seed) + do_train(args) diff --git a/examples/model_compression/distill_lstm/utils.py b/examples/model_compression/distill_lstm/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..0243d97a5a64d2dea77711a5702e62f930f4358c --- /dev/null +++ b/examples/model_compression/distill_lstm/utils.py @@ -0,0 +1,117 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import jieba + +import numpy as np + + +def convert_example_for_lstm(example, task_name, vocab, is_tokenized=False, max_seq_length=128, is_test=False): + """convert a example for lstm's input""" + input_ids = [] + if task_name == "chnsenticorp": + if is_tokenized: + lstm_tokens = example["lstm_tokens"][:max_seq_length] + input_ids = [vocab[token] for token in lstm_tokens] + else: + tokenized_text = list(jieba.cut(example["text"]))[:max_seq_length] + input_ids = vocab[tokenized_text] + else: + if is_tokenized: + tokens = example["sentence"][:max_seq_length] + else: + tokens = vocab.tokenize(example["sentence"])[:max_seq_length] + input_ids = vocab.convert_tokens_to_ids(tokens) + + valid_length = np.array(len(input_ids), dtype="int64") + if not is_test: + label = ( + np.array(example["label"], dtype="int64") + if task_name == "chnsenticorp" + else np.array(example["labels"], dtype="int64") + ) + return input_ids, valid_length, label + return input_ids, valid_length + + +def convert_pair_example(example, task_name, vocab, is_tokenized=True, max_seq_length=128, is_test=False): + seq1 = convert_example_for_lstm( + {"sentence": example["sentence1"], "labels": example["labels"]}, + task_name, + vocab, + is_tokenized, + max_seq_length, + is_test, + )[:2] + + seq2 = convert_example_for_lstm( + {"sentence": example["sentence2"], "labels": example["labels"]}, + task_name, + vocab, + is_tokenized, + max_seq_length, + is_test, + ) + pair_features = seq1 + seq2 + + return pair_features + + +def convert_example_for_distill( + example, task_name, tokenizer, label_list, max_seq_length, vocab, is_tokenized=True, is_test=False +): + bert_features = convert_example_for_bert( + example, + tokenizer=tokenizer, + label_list=label_list, + is_tokenized=is_tokenized, + max_seq_length=max_seq_length, + is_test=is_test, + ) + if task_name == "qqp": + small_features = convert_pair_example(example, task_name, vocab, is_tokenized, max_seq_length, is_test) + else: + small_features = convert_example_for_lstm(example, task_name, vocab, is_tokenized, max_seq_length, is_test) + return bert_features[:2] + small_features + + +def convert_example_for_bert(example, tokenizer, label_list, is_tokenized=False, max_seq_length=512, is_test=False): + """convert a example for bert's input""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] if "labels" in example else example["label"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if "sentence1" in example: + example = tokenizer( + example["sentence1"], + text_pair=example["sentence2"], + max_seq_len=max_seq_length, + is_split_into_words=is_tokenized, + ) + else: + if "sentence" in example: + text = example["sentence"] + elif "text" in example: + text = example["text"] + else: + text = example["bert_tokens"] + example = tokenizer(text, max_seq_len=max_seq_length, is_split_into_words=is_tokenized) + + if not is_test: + return example["input_ids"], example["token_type_ids"], label + else: + return example["input_ids"], example["token_type_ids"] diff --git a/examples/model_compression/minilmv2/README.md b/examples/model_compression/minilmv2/README.md new file mode 100644 index 0000000000000000000000000000000000000000..62bfc39fa1d48b8fb862bb5fffc691a602744d6f --- /dev/null +++ b/examples/model_compression/minilmv2/README.md @@ -0,0 +1,124 @@ +# MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers + +以下是本例的简要目录结构及说明: +``` +. +├── general_distill.py # 通用蒸馏脚本 +├── run_clue.py # 在下游任务上的微调脚本 +└── README.md # 文档,本文件 +``` +## 简介 +本目录下的实验主要参考论文[《MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers》](https://arxiv.org/abs/2012.15828)实现。 + +MiniLMv2也是从层数深的Transformer类模型到层数较浅的Transformer类模型的蒸馏策略。它的优势是只需要取教师模型和学生模型中的各一层进行蒸馏训练,而不像其他方法需要蒸馏更多的层,避免面对更加复杂的layer mapping问题,并且效果优于TinyBert的蒸馏策略。 + +MiniLMv2蒸馏的目标是教师模型某层的q与q, k与k, v与v的矩阵乘结果和学生模型最后一层的q与q, k与k, v与v的矩阵乘之间的kl散度loss。其中教师模型是large size时,选择实验并选取倒数某一层,当教师模型是base size时,选择最后一层进行蒸馏即可。 + +为了防止教师模型是large size时,head size与学生模型不同,蒸馏目标的shape无法匹配,MiniLMv2还需要对head进行重组,先合并再按relation_head_num重新分割head_num和head_size。 + +## 数据、预训练模型介绍及获取 + +### 数据获取 +由于本实验是通用场景下的蒸馏,因此数据和预训练类似。可以参考[NLP Chinese Corpus](https://github.com/brightmart/nlp_chinese_corpus)中提供的数据。 +数据下载完成后,需要将所有数据集整理成每行一条文本数据,再将数据切分成多个小文件,并放在一个目录下,以便使用多卡并行训练。 + +### 训练启动方式 + +假设我们把切分好的预训练数据文件都放在`${dataset}`下,那么我们可以运行如下命令用单机八卡进行预训练蒸馏: +```shell + +dataset=/PaddleNLP/dataset +output_dir=./pretrain + +python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" general_distill.py \ + --student_model_type tinybert \ + --num_relation_heads 48 \ + --student_model_name_or_path tinybert-6l-768d-zh \ + --init_from_student False \ + --teacher_model_type bert \ + --teacher_model_name_or_path bert-base-chinese \ + --max_seq_length 128 \ + --batch_size 256 \ + --learning_rate 6e-4 \ + --logging_steps 20 \ + --max_steps 100000 \ + --warmup_steps 4000 \ + --save_steps 5000 \ + --teacher_layer_index 11 \ + --student_layer_index 5 \ + --weight_decay 1e-2 \ + --output_dir ${output_dir} \ + --device gpu \ + --input_dir ${dataset} \ + +``` + +其中参数释义如下: + +- `student_model_type` 学生模型的类型 +- `num_relation_heads` head重新组合之后的head数 +- `student_model_name_or_path` 学生模型的名字(需要与学生模型类型对应),或者是学生模型的路径 +- `init_from_student` 本次蒸馏的学生模型是否用`student_model_name_or_path`中的参数进行初始化,是个bool类型的参数。默认是False +- `teacher_model_type bert` 教师模型的类型 +- `teacher_model_name_or_path` 教师模型的名字 +- `max_seq_length 128` 表示最大句子长度,超过该长度将被截断。 +- `warmup_steps` 学习率warmup up的步数 +- `save_steps` 保存模型的频率 +- `teacher_layer_index`表示学生模型从教师模型学习的教师层 +- `student_layer_index` 表示学生模型从教师模型学习的学生层 +- `output_dir` 模型输出的目录 +- `device gpu` 表示运行该程序的设备,默认是gpu +- `input_dir` 预训练数据的存放地址 + + + +### 评价方法 + +假设预训练完成后的模型存储在`${pretrained_models}`下,这里也提供了我们已经预训练完成的一版[模型](https://bj.bcebos.com/paddlenlp/models/general_distill/minilmv2_6l_768d_ch.tar.gz)可供参考,模型与`tinybert-6l-768d-zh`结构相同,因此可以使用`TinyBertForSequenceClassification.from_pretrained()`对模型直接进行加载。 +本示例训练出的通用模型需要在下游任务上Fine-tuning,利用下游任务上的指标进行评价。 +我们可以运行如下脚本在单卡上进行Fine-tuning: + +```shell + +export CUDA_VISIBLE_DEVICES="0" + +python -u ./run_clue.py \ + --model_type tinybert \ + --model_name_or_path ${pretrained_models} \ + --task_name ${TASK_NAME} \ + --max_seq_length ${max_seq_len} \ + --batch_size 16 \ + --learning_rate ${learning_rate} \ + --num_train_epochs ${num_train_epochs} \ + --logging_steps 100 \ + --seed 42 \ + --save_steps 100 \ + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --adam_epsilon 1e-8 \ + --device gpu \ + +``` + + +其中不同的任务下,`${learning_rate}`、`${num_train_epochs}`、`${max_seq_len}`,我们推荐不同的Fine-tuning的超参数,可以参考以下配置: + +| TASK_NAME | AFQMC | TNEWS | IFLYTEK | OCNLI | CMNLI | CLUEWSC2020 | CSL | +| ---------------- | ----- | ----- | ------- | ----- | ----- | ----------- | ---- | +| learning_rate | 2e-5 | 2e-5 | 2e-5 | 3e-5 | 3e-5 | 1e-5 | 1e-5 | +| num_train_epochs | 3 | 3 | 6 | 6 | 3 | 50 | 8 | +| max_seq_len | 128 | 128 | 128 | 128 | 128 | 128 | 256 | + + +### 蒸馏实验结果 + +本示例选择的是CLUE中的分类任务,以`bert-base-chinese`作教师模型,利用MiniLMv2策略对6层模型进行蒸馏,可以得到的通用模型在CLUE上的指标为: + +| CLUE | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL | +| ------- | ----- | ----- | ------- | ----- | ----- | ----------- | ----- | +| Acc (%) | 71.38 | 56.46 | 58.87 | 79.01 | 73.02 | 68.42 | 77.73 | + + +## 参考文献 + +Wang W, Bao H, Huang S, Dong L, Wei F. [MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers](https://arxiv.org/abs/2012.15828)[J]. arXiv preprint arXiv:2012.15828v2, 2021. diff --git a/examples/model_compression/minilmv2/general_distill.py b/examples/model_compression/minilmv2/general_distill.py new file mode 100644 index 0000000000000000000000000000000000000000..bc5636ac926fecf09fb0159c5daa8bb1c8379671 --- /dev/null +++ b/examples/model_compression/minilmv2/general_distill.py @@ -0,0 +1,407 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from concurrent.futures import ThreadPoolExecutor + +import numpy as np +import paddle +from paddle.io import DataLoader + +from paddlenlp.data import Pad, Tuple +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import ( + BertForSequenceClassification, + BertTokenizer, + LinearDecayWithWarmup, + TinyBertForPretraining, + TinyBertModel, + TinyBertTokenizer, +) +from paddlenlp.transformers.distill_utils import calc_minilm_loss, to_distill +from paddlenlp.utils.log import logger +from paddlenlp.utils.tools import TimeCostAverage + +MODEL_CLASSES = { + "tinybert": (TinyBertForPretraining, TinyBertTokenizer), + "bert": (BertForSequenceClassification, BertTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--student_model_type", + default="tinybert", + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--teacher_model_type", + default="bert", + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--student_model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--init_from_student", + type=strtobool, + default=False, + help="Whether to use the parameters of student model to initialize.", + ) + parser.add_argument( + "--teacher_model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model." + ) + parser.add_argument( + "--input_dir", + default=None, + type=str, + required=True, + help="The input directory where the data will be read from.", + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--learning_rate", default=6e-4, type=float, help="The initial learning rate for AdamW.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--batch_size", + default=512, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument( + "--num_relation_heads", + default=64, + type=int, + help="The number of relation heads is 48 and 64 for base and large-size teacher model.", + ) + parser.add_argument( + "--teacher_layer_index", + default=11, + type=int, + help="The transformer layer index of teacher model to distill.", + ) + parser.add_argument( + "--student_layer_index", + default=5, + type=int, + help="The transformer layer index of student model to distill.", + ) + parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") + parser.add_argument( + "--warmup_steps", + default=-1, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument( + "--warmup_proportion", default=0.01, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for AdamW optimizer.") + parser.add_argument( + "--max_steps", + default=400000, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu." + ) + args = parser.parse_args() + return args + + +def set_seed(args): + random.seed(args.seed + paddle.distributed.get_rank()) + np.random.seed(args.seed + paddle.distributed.get_rank()) + paddle.seed(args.seed + paddle.distributed.get_rank()) + + +class WorkerInitObj(object): + def __init__(self, seed): + self.seed = seed + + def __call__(self, id): + np.random.seed(seed=self.seed + id) + random.seed(self.seed + id) + + +def create_pretraining_dataset(input_file, args, worker_init, tokenizer): + train_data = PretrainingDataset(input_file=input_file, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + # files have been sharded, no need to dispatch again + train_batch_sampler = paddle.io.BatchSampler(train_data, batch_size=args.batch_size, shuffle=True) + + # DataLoader cannot be pickled because of its place. + # If it can be pickled, use global function instead of lambda and use + # ProcessPoolExecutor instead of ThreadPoolExecutor to prefetch. + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + ): fn(samples) + + train_data_loader = DataLoader( + dataset=train_data, + batch_sampler=train_batch_sampler, + collate_fn=batchify_fn, + num_workers=0, + worker_init_fn=worker_init, + return_list=True, + ) + return train_data_loader, input_file + + +class PretrainingDataset(paddle.io.Dataset): + def __init__(self, input_file, tokenizer, max_seq_length): + self.input_file = input_file + f = open(input_file, "r") + input_ids = [] + for i, line in enumerate(f): + line = line[:max_seq_length] + tokenized_example = tokenizer(line, max_seq_len=max_seq_length) + input_ids.append(tokenized_example["input_ids"]) + self.inputs = np.asarray(input_ids) + f.close() + + def __len__(self): + "Denotes the total number of samples" + return len(self.inputs) + + def __getitem__(self, index): + input_ids = [np.asarray(self.inputs[index])] + return input_ids + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + worker_init = WorkerInitObj(args.seed + paddle.distributed.get_rank()) + args.student_model_type = args.student_model_type.lower() + + # For student + model_class, tokenizer_class = MODEL_CLASSES[args.student_model_type] + tokenizer = tokenizer_class.from_pretrained(args.student_model_name_or_path) + if args.init_from_student: + student = model_class.from_pretrained(args.student_model_name_or_path) + else: + tinybert = TinyBertModel(vocab_size=21128, num_hidden_layers=6) + student = model_class(tinybert) + + # For teacher + teacher_model_class, _ = MODEL_CLASSES[args.teacher_model_type] + teacher = teacher_model_class.from_pretrained(args.teacher_model_name_or_path) + pad_token_id = student.pretrained_init_configuration[args.student_model_name_or_path]["pad_token_id"] + if paddle.distributed.get_world_size() > 1: + student = paddle.DataParallel(student, find_unused_parameters=True) + teacher = paddle.DataParallel(teacher, find_unused_parameters=True) + + num_training_steps = args.max_steps + + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.max_grad_norm) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in student.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.98, + epsilon=args.adam_epsilon, + parameters=student.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=clip, + ) + + pool = ThreadPoolExecutor(1) + + teacher = to_distill(teacher, return_qkv=True, layer_index=args.teacher_layer_index) + student = to_distill(student, return_qkv=True, layer_index=args.student_layer_index) + + global_step = 0 + for epoch in range(args.num_train_epochs): + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if os.path.isfile(os.path.join(args.input_dir, f)) + ] + files.sort() + num_files = len(files) + random.Random(args.seed + epoch).shuffle(files) + f_start_id = 0 + + if paddle.distributed.get_world_size() > num_files: + remainder = paddle.distributed.get_world_size() % num_files + + data_file = files[ + ( + f_start_id * paddle.distributed.get_world_size() + + paddle.distributed.get_rank() + + remainder * f_start_id + ) + % num_files + ] + else: + data_file = files[ + (f_start_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files + ] + + train_data_loader, _ = create_pretraining_dataset(data_file, args, worker_init, tokenizer) + + # TODO(guosheng): better way to process single file + single_file = True if f_start_id + 1 == len(files) else False + + for f_id in range(f_start_id, len(files)): + if not single_file and f_id == f_start_id: + continue + if paddle.distributed.get_world_size() > num_files: + data_file = files[ + (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank() + remainder * f_id) + % num_files + ] + else: + data_file = files[ + (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files + ] + dataset_future = pool.submit(create_pretraining_dataset, data_file, args, worker_init, tokenizer) + + kl_loss_func = paddle.nn.KLDivLoss("sum") + train_cost_avg = TimeCostAverage() + total_samples = 0 + batch_start = time.time() + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids = batch[0] + attention_mask = paddle.unsqueeze( + (input_ids == pad_token_id).astype(paddle.get_default_dtype()) * -1e9, axis=[1, 2] + ) + student(input_ids) + with paddle.no_grad(): + teacher(input_ids) + # Q-Q relation + q_t, q_s = teacher.outputs.q, student.outputs.q + batch_size = q_t.shape[0] + pad_seq_len = q_t.shape[2] + loss_qr = calc_minilm_loss(kl_loss_func, q_s, q_t, attention_mask, args.num_relation_heads) + del q_t, q_s + # K-K relation + k_t, k_s = teacher.outputs.k, student.outputs.k + loss_kr = calc_minilm_loss(kl_loss_func, k_s, k_t, attention_mask, args.num_relation_heads) + del k_t, k_s + # V-V relation + v_t, v_s = teacher.outputs.v, student.outputs.v + loss_vr = calc_minilm_loss(kl_loss_func, v_s, v_t, attention_mask, args.num_relation_heads) + + del v_t, v_s + + loss = loss_qr + loss_kr + loss_vr + loss /= args.num_relation_heads * pad_seq_len * batch_size + loss.backward() + + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + total_samples += args.batch_size + train_run_cost = time.time() - batch_start + train_cost_avg.record(train_run_cost) + if global_step % args.logging_steps == 0: + logger.info( + "global step: %d, epoch: %d, batch: %d, loss: %f, " + "lr: %f, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec" + % ( + global_step, + epoch, + step, + loss, + optimizer.get_lr(), + train_cost_avg.get_average(), + total_samples / args.logging_steps, + total_samples / (args.logging_steps * train_cost_avg.get_average()), + ) + ) + total_samples = 0 + train_cost_avg.reset() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # need better way to get inner model of DataParallel + model_to_save = student._layers if isinstance(student, paddle.DataParallel) else student + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt")) + if global_step >= args.max_steps: + del train_data_loader + return + batch_start = time.time() + + del train_data_loader + train_data_loader, data_file = dataset_future.result(timeout=None) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/model_compression/minilmv2/run_clue.py b/examples/model_compression/minilmv2/run_clue.py new file mode 100644 index 0000000000000000000000000000000000000000..8ab58327a6b818fc71007959e0562baa9ae5274d --- /dev/null +++ b/examples/model_compression/minilmv2/run_clue.py @@ -0,0 +1,327 @@ +# Copyright (c) 2021s PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import math +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from paddle.io import DataLoader +from paddle.metric import Accuracy + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import ( + BertForSequenceClassification, + BertTokenizer, + LinearDecayWithWarmup, + TinyBertForSequenceClassification, + TinyBertTokenizer, +) + +FORMAT = "%(asctime)s-%(levelname)s: %(message)s" +logging.basicConfig(level=logging.INFO, format=FORMAT) +logger = logging.getLogger(__name__) + +METRIC_CLASSES = { + "afqmc": Accuracy, + "tnews": Accuracy, + "iflytek": Accuracy, + "ocnli": Accuracy, + "cmnli": Accuracy, + "cluewsc2020": Accuracy, + "csl": Accuracy, +} + +MODEL_CLASSES = { + "bert": (BertForSequenceClassification, BertTokenizer), + "tinybert": (TinyBertForSequenceClassification, TinyBertTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--batch_size", + default=32, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument( + "--warmup_steps", + default=0, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument( + "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu." + ) + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="The max value of grad norm.") + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, loss_fct, metric, data_loader): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + loss = loss_fct(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="") + model.train() + return res + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["label"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if "sentence" in example: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + elif "sentence1" in example: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + elif "keyword" in example: # CSL + sentence1 = " ".join(example["keyword"]) + example = tokenizer(sentence1, text_pair=example["abst"], max_seq_len=max_seq_length) + elif "target" in example: # wsc + text, query, pronoun, query_idx, pronoun_idx = ( + example["text"], + example["target"]["span1_text"], + example["target"]["span2_text"], + example["target"]["span1_index"], + example["target"]["span2_index"], + ) + text_list = list(text) + # print(text) + assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun) + assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query) + if pronoun_idx > query_idx: + text_list.insert(query_idx, "_") + text_list.insert(query_idx + len(query) + 1, "_") + text_list.insert(pronoun_idx + 2, "[") + text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]") + else: + text_list.insert(pronoun_idx, "[") + text_list.insert(pronoun_idx + len(pronoun) + 1, "]") + text_list.insert(query_idx + 2, "_") + text_list.insert(query_idx + len(query) + 2 + 1, "_") + text = "".join(text_list) + example = tokenizer(text, max_seq_len=max_seq_length) + + if not is_test: + return example["input_ids"], example["token_type_ids"], label + else: + return example["input_ids"], example["token_type_ids"] + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + train_ds = load_dataset("clue", args.task_name, splits="train") + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length + ) + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ): fn(samples) + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + dev_ds = load_dataset("clue", args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list) + model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_classes) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + if args.max_steps > 0: + num_training_steps = args.max_steps + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + else: + num_training_steps = len(train_data_loader) * args.num_train_epochs + num_train_epochs = args.num_train_epochs + + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm), + ) + + loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss() + + metric = metric_class() + best_acc = 0.0 + global_step = 0 + tic_train = time.time() + for epoch in range(num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + loss = loss_fct(logits, labels) + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + print( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % ( + global_step, + num_training_steps, + epoch, + step, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + acc = evaluate(model, loss_fct, metric, dev_data_loader) + print("eval done total : %s s" % (time.time() - tic_eval)) + if acc > best_acc: + best_acc = acc + if global_step >= num_training_steps: + print("best_acc: ", best_acc) + return + print("best_acc: ", best_acc) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/model_compression/ofa/README.md b/examples/model_compression/ofa/README.md new file mode 100644 index 0000000000000000000000000000000000000000..79eccdf150869ab8956304e475586fbf87db6a2b --- /dev/null +++ b/examples/model_compression/ofa/README.md @@ -0,0 +1,329 @@ +# BERT Compression Based on PaddleSlim + +BERT-base模型是一个迁移能力很强的通用语义表示模型,但是模型中也有一些参数冗余。本教程将介绍如何使用PaddleSlim对BERT-base模型进行压缩。 + +## 压缩原理 + +1. 对Fine-tuning得到模型通过计算参数及其梯度的乘积得到参数的重要性,把模型参数根据重要性进行重排序。 +2. 超网络中最大的子网络选择和Bert-base模型网络结构一致的网络结构,其他小的子网络是对最大网络的进行不同的宽度选择来得到的,宽度选择具体指的是网络中的参数进行裁剪,所有子网络在整个训练过程中都是参数共享的。 +2. 用重排序之后的模型参数作为超网络模型的初始化参数。 +3. Fine-tuning之后的模型作为教师网络,超网络作为学生网络,进行知识蒸馏。 + +<p align="center"> +<img src="./imgs/ofa_bert.jpg" width="950"/><br /> +整体流程图 +</p> + + +## 压缩结果 + +利用`bert-base-uncased`模型首先在GLUE数据集上进行finetune,得到需要压缩的模型,之后基于此模型进行压缩。压缩后模型参数大小减小26%(从110M减少到81M),压缩后模型在GLUE dev数据集上的精度和压缩前模型在GLUE dev数据集上的精度对比如下表所示: + +| Task | Metric | Result | Result with PaddleSlim | +|:-----:|:----------------------------:|:-----------------:|:----------------------:| +| SST-2 | Accuracy | 0.93005 | 0.931193 | +| QNLI | Accuracy | 0.91781 | 0.920740 | +| CoLA | Mattehew's corr | 0.59557 | 0.601244 | +| MRPC | F1/Accuracy | 0.91667/0.88235 | 0.91740/0.88480 | +| STS-B | Person/Spearman corr | 0.88847/0.88350 | 0.89271/0.88958 | +| QQP | Accuracy/F1 | 0.90581/0.87347 | 0.90994/0.87947 | +| MNLI | Matched acc/MisMatched acc | 0.84422/0.84825 | 0.84687/0.85242 | +| RTE | Accuracy | 0.711191 | 0.718412 | + +<p align="center"> +<strong>表1-1: GLUE数据集精度对比</strong> +</p> + +压缩前后模型的耗时如下表所示: + +<table style="width:100%;" cellpadding="2" cellspacing="0" border="1" bordercolor="#000000"> + <tbody> + <tr> + <td style="text-align:center"> + <span style="font-size:18px;">Device</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px;">Batch Size</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px;">Model</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px;">TRT(FP16)</span> + </td> + <td style="text-align:center;"> + <span style="font-size:18px;">Latency(ms)</span> + </td> + </tr> + <tr> + <td rowspan=8 align=center> T4 </td> + <td rowspan=4 align=center> 16 </td> + <td rowspan=2 align=center> BERT </td> + <td style="text-align:center"> + <span style="font-size:18px">N</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">110.71</span> + </td> + </tr> + <tr> + <td style="text-align:center"> + <span style="font-size:18px">Y</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">22.0</span> + </td> + </tr> + <tr> + <td rowspan=2 align=center>Compressed BERT </td> + <td style="text-align:center"> + <span style="font-size:18px">N</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">69.62</span> + </td> + </tr> + <tr> + <td style="text-align:center"> + <span style="font-size:18px">Y</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">14.93</span> + </td> + </tr> + <tr> + <td rowspan=4 align=center> 40 </td> + <td rowspan=2 align=center> BERT </td> + <td style="text-align:center"> + <span style="font-size:18px">N</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">252.78</span> + </td> + </tr> + <tr> + <td style="text-align:center"> + <span style="font-size:18px">Y</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">53.67</span> + </td> + </tr> + <tr> + <td rowspan=2 align=center>Compressed BERT </td> + <td style="text-align:center"> + <span style="font-size:18px">N</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">168.71</span> + </td> + </tr> + <tr> + <td style="text-align:center"> + <span style="font-size:18px">Y</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">37.22</span> + </td> + </tr> + <tr> + <td rowspan=2 align=center> V100 </td> + <td rowspan=2 align=center> 16 </td> + <td style="text-align:center"> + <span style="font-size:18px;" align=center>BERT</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">N</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">33.28</span> + </td> + </tr> + <tr> + <td style="text-align:center"> + <span style="font-size:18px;">Compressed BERT</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">N</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">21.83</span> + </td> + </tr> + <tr> + <td rowspan=2 align=center> Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz </td> + <td rowspan=2 align=center> 16 </td> + <td style="text-align:center"> + <span style="font-size:18px;" align=center>BERT</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">N</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">10831.73</span> + </td> + </tr> + <tr> + <td style="text-align:center"> + <span style="font-size:18px;">Compressed BERT</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">N</span> + </td> + <td style="text-align:center"> + <span style="font-size:18px">7682.93</span> + </td> + </tr> + </tbody> +</table> +<br /> +<p align="center"> +<strong>表1-2: 模型速度对比</strong> +</p> + +压缩后模型在T4机器上相比原始模型在FP32的情况下加速59%,在TensorRT FP16的情况下加速47.3%。 +压缩后模型在V100机器上相比原始模型在FP32的情况下加速52.5%。 +压缩后模型在Intel(R) Xeon(R) Gold 5117 CPU上相比原始模型在FP32的情况下加速41%。 + +## 快速开始 +本教程示例以GLUE/SST-2 数据集为例。 + +### 环境依赖 + +模型压缩功能依赖最新版本的PaddleSlim +```shell +git clone https://github.com/PaddlePaddle/PaddleSlim +python setup.py build && python setup.py install +``` + +### Fine-tuing +首先需要对Pretrain-Model在实际的下游任务上进行Finetuning,得到需要压缩的模型。Fine-tuning流程参考[Fine-tuning教程](../../benchmark/glue) + +```shell +cd ../../benchmark/glue/ +``` + +```python +export CUDA_VISIBLE_DEVICES=0 +export TASK_NAME=SST-2 + +python -u ./run_glue.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --task_name $TASK_NAME \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --logging_steps 1 \ + --save_steps 500 \ + --output_dir ./tmp/$TASK_NAME/ \ + --device gpu \ +``` +参数详细含义参考[README.md](../../benchmark/glue/README.md) +Fine-tuning 在dev上的结果如压缩结果表1-1中Result那一列所示。 + + +### 压缩训练 + +单卡训练 +```shell +python -u ./run_glue_ofa.py --model_type bert \ + --model_name_or_path ${task_pretrained_model_dir} \ + --task_name $TASK_NAME --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 6 \ + --logging_steps 10 \ + --save_steps 100 \ + --output_dir ./tmp/$TASK_NAME \ + --device gpu \ + --width_mult_list 1.0 0.8333333333333334 0.6666666666666666 0.5 +``` + +多卡训练 + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0,1" run_glue_ofa.py \ + --model_type bert \ + --model_name_or_path ${task_pretrained_model_dir} \ + --task_name $TASK_NAME --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 6 \ + --logging_steps 10 \ + --save_steps 100 \ + --output_dir ./tmp/$TASK_NAME \ + --device gpu \ + --width_mult_list 1.0 0.8333333333333334 0.6666666666666666 0.5 +``` + + +其中参数释义如下: +- `model_type` 指示了模型类型,当前仅支持BERT模型。 +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。 +- `task_name` 表示 Fine-tuning 的任务。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `output_dir` 表示模型保存路径。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 +- `width_mult_list` 表示压缩训练过程中,对每层Transformer Block的宽度选择的范围。 + +压缩训练之后在dev上的结果如压缩结果表格中Result with PaddleSlim那一列所示,延时情况如表1-2所示。 + + +### 导出子模型 +根据传入的config导出相应的子模型并转为静态图模型。 + +启动命令: + +```shell +python -u ./export_model.py --model_type bert \ + --model_name_or_path ${PATH_OF_MODEL_AFTER_OFA} \ + --max_seq_length 128 \ + --sub_model_output_dir ./tmp/$TASK_NAME/dynamic_model \ + --static_sub_model ./tmp/$TASK_NAME/static_model \ + --device gpu \ + --width_mult 0.6666666666666666 +``` + +其中参数释义如下: +- `model_type` 指示了模型类型,当前仅支持BERT模型。 +- `model_name_or_path` 指示了某种特定配置的经过OFA训练后保存的模型,对应有其预训练模型和预训练时使用的tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。默认:128. +- `sub_model_output_dir` 指示了导出子模型动态图参数的目录。 +- `static_sub_model` 指示了导出子模型静态图模型及参数的目录,设置为None,则表示不导出静态图模型。默认:None。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 +- `width_mult` 表示导出子模型的宽度。默认:1.0. + + +### OFA接口介绍 + +OFA API介绍参考[API](https://github.com/PaddlePaddle/PaddleSlim/blob/release/2.0.0/docs/zh_cn/api_cn/dygraph/ofa/ofa_api.rst) + +## 另附:基于本代码对TinyBERT(L=4, D=312)进行压缩 +下游任务模型是从TinyBERT官方repo转换得到。 + +### 压缩结果 + +| Task | Metric | TinyBERT(L=4, D=312) | Result with OFA | +|:-----:|:----------------------------:|:--------------------:|:----------------------:| +| SST-2 | Accuracy | [0.9234]() | [0.9220]() | +| QNLI | Accuracy | [0.8746]() | [0.8720]() | +| CoLA | Mattehew's corr | [0.4961]() | [0.5048]() | +| MRPC | F1/Accuracy | [0.8998/0.8554]() | [0.9003/0.8578]() | +| STS-B | Person/Spearman corr | [0.8635/0.8631]() | [0.8717/0.8706]() | +| QQP | Accuracy/F1 | [0.9047/0.8751]() | [0.9034/0.8733]() | +| MNLI | Matched acc/MisMatched acc | [0.8256/0.8294]() | [0.8211/0.8261]() | +| RTE | Accuracy | [0.6534]() | [0.6787]() | + + +## 参考论文 + +1. Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, Qun Liu. DynaBERT: Dynamic BERT with Adaptive Width and Depth. +2. H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. Once for all: Train one network and specialize it for efficient deployment. diff --git a/examples/model_compression/ofa/export_model.py b/examples/model_compression/ofa/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..9f0254faea089dd4a45abafa5e7a5fd55d040b46 --- /dev/null +++ b/examples/model_compression/ofa/export_model.py @@ -0,0 +1,205 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import math +import os + +import paddle +from paddleslim.nas.ofa import OFA, utils +from paddleslim.nas.ofa.convert_super import Convert, supernet + +from paddlenlp.transformers import ( + BertForSequenceClassification, + BertModel, + BertTokenizer, +) + +MODEL_CLASSES = { + "bert": (BertForSequenceClassification, BertTokenizer), +} + + +def bert_forward( + self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, output_hidden_states=False +): + wtype = self.pooler.dense.fn.weight.dtype if hasattr(self.pooler.dense, "fn") else self.pooler.dense.weight.dtype + if attention_mask is None: + attention_mask = paddle.unsqueeze((input_ids == self.pad_token_id).astype(wtype) * -1e9, axis=[1, 2]) + else: + if attention_mask.ndim == 2: + # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length] + attention_mask = attention_mask.unsqueeze(axis=[1, 2]) + + embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids) + if output_hidden_states: + output = embedding_output + encoder_outputs = [] + for mod in self.encoder.layers: + output = mod(output, src_mask=attention_mask) + encoder_outputs.append(output) + if self.encoder.norm is not None: + encoder_outputs[-1] = self.encoder.norm(encoder_outputs[-1]) + pooled_output = self.pooler(encoder_outputs[-1]) + else: + sequence_output = self.encoder(embedding_output, attention_mask) + pooled_output = self.pooler(sequence_output) + if output_hidden_states: + return encoder_outputs, pooled_output + else: + return sequence_output, pooled_output + + +BertModel.forward = bert_forward + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--sub_model_output_dir", + default=None, + type=str, + required=True, + help="The output directory where the sub model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--static_sub_model", + default=None, + type=str, + help="The output directory where the sub static model will be written. If set to None, not export static model", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--n_gpu", type=int, default=1, help="number of gpus to use, 0 for cpu.") + parser.add_argument("--width_mult", type=float, default=1.0, help="width mult you want to export") + parser.add_argument("--depth_mult", type=float, default=1.0, help="depth mult you want to export") + args = parser.parse_args() + return args + + +def export_static_model(model, model_path, max_seq_length): + input_shape = [ + paddle.static.InputSpec(shape=[None, max_seq_length], dtype="int64"), + paddle.static.InputSpec(shape=[None, max_seq_length], dtype="int64"), + ] + net = paddle.jit.to_static(model, input_spec=input_shape) + paddle.jit.save(net, model_path) + + +def do_train(args): + paddle.set_device("gpu" if args.n_gpu else "cpu") + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + config_path = os.path.join(args.model_name_or_path, "model_config.json") + cfg_dict = dict(json.loads(open(config_path).read())) + + kept_layers_index = {} + if args.depth_mult < 1.0: + depth = round(cfg_dict["init_args"][0]["num_hidden_layers"] * args.depth_mult) + cfg_dict["init_args"][0]["num_hidden_layers"] = depth + for idx, i in enumerate(range(1, depth + 1)): + kept_layers_index[idx] = math.floor(i / args.depth_mult) - 1 + + os.rename(config_path, config_path + "_bak") + with open(config_path, "w", encoding="utf-8") as f: + f.write(json.dumps(cfg_dict, ensure_ascii=False)) + + num_labels = cfg_dict["num_classes"] + + model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels) + + origin_model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels) + + os.rename(config_path + "_bak", config_path) + + sp_config = supernet(expand_ratio=[1.0, args.width_mult]) + model = Convert(sp_config).convert(model) + + ofa_model = OFA(model) + + sd = paddle.load(os.path.join(args.model_name_or_path, "model_state.pdparams")) + + if len(kept_layers_index) == 0: + ofa_model.model.set_state_dict(sd) + else: + for name, params in ofa_model.model.named_parameters(): + if "encoder" not in name: + params.set_value(sd[name]) + else: + idx = int(name.strip().split(".")[3]) + mapping_name = name.replace("." + str(idx) + ".", "." + str(kept_layers_index[idx]) + ".") + params.set_value(sd[mapping_name]) + + best_config = utils.dynabert_config(ofa_model, args.width_mult) + for name, sublayer in ofa_model.model.named_sublayers(): + if isinstance(sublayer, paddle.nn.MultiHeadAttention): + sublayer.num_heads = int(args.width_mult * sublayer.num_heads) + + ofa_model.export( + best_config, + input_shapes=[[1, args.max_seq_length], [1, args.max_seq_length]], + input_dtypes=["int64", "int64"], + origin_model=origin_model, + ) + for name, sublayer in origin_model.named_sublayers(): + if isinstance(sublayer, paddle.nn.MultiHeadAttention): + sublayer.num_heads = int(args.width_mult * sublayer.num_heads) + + output_dir = os.path.join(args.sub_model_output_dir, "model_width_%.5f" % args.width_mult) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = origin_model + model_to_save.save_pretrained(output_dir) + + if args.static_sub_model is not None: + export_static_model(origin_model, args.static_sub_model, args.max_seq_length) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/model_compression/ofa/imgs/ofa_bert.jpg b/examples/model_compression/ofa/imgs/ofa_bert.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8683c7d8bc21803e13654231763a3680940d63cd Binary files /dev/null and b/examples/model_compression/ofa/imgs/ofa_bert.jpg differ diff --git a/examples/model_compression/ofa/run_glue_ofa.py b/examples/model_compression/ofa/run_glue_ofa.py new file mode 100644 index 0000000000000000000000000000000000000000..4ed7725cddf611acb31ab3c9be4b1b63d44a8a87 --- /dev/null +++ b/examples/model_compression/ofa/run_glue_ofa.py @@ -0,0 +1,488 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import math +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +from paddle.io import DataLoader +from paddle.metric import Accuracy + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman +from paddlenlp.transformers import ( + BertForSequenceClassification, + BertModel, + BertTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +# paddleslim needs to be imported last for some overrides to kick in +from paddleslim.nas.ofa import OFA, DistillConfig, utils # isort: skip +from paddleslim.nas.ofa.convert_super import Convert, supernet # isort: skip +from paddleslim.nas.ofa.utils import nlp_utils # isort: skip + + +METRIC_CLASSES = { + "cola": Mcc, + "sst-2": Accuracy, + "mrpc": AccuracyAndF1, + "sts-b": PearsonAndSpearman, + "qqp": AccuracyAndF1, + "mnli": Accuracy, + "qnli": Accuracy, + "rte": Accuracy, +} + +MODEL_CLASSES = { + "bert": (BertForSequenceClassification, BertTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument("--lambda_logit", default=1.0, type=float, help="lambda for logit loss.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") + parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["gpu", "cpu", "xpu"], + help="The device to select to train the model, is must be cpu/gpu/xpu.", + ) + parser.add_argument( + "--width_mult_list", nargs="+", type=float, default=[1.0, 5 / 6, 2 / 3, 0.5], help="width mult in compress" + ) + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader, width_mult=1.0): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids, attention_mask=[None, None]) + if isinstance(logits, tuple): + logits = logits[0] + loss = criterion(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + # Teacher model's evaluation + if width_mult == 100: + if isinstance(metric, AccuracyAndF1): + print( + "teacher model, eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, " + % ( + loss.numpy(), + res[0], + res[1], + res[2], + res[3], + res[4], + ), + end="", + ) + elif isinstance(metric, Mcc): + print("teacher model, eval loss: %f, mcc: %s, " % (loss.numpy(), res[0]), end="") + elif isinstance(metric, PearsonAndSpearman): + print( + "teacher model, eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, " + % (loss.numpy(), res[0], res[1], res[2]), + end="", + ) + else: + print("teacher model, eval loss: %f, acc: %s, " % (loss.numpy(), res), end="") + else: + if isinstance(metric, AccuracyAndF1): + print( + "width_mult: %s, eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, " + % ( + width_mult, + loss.numpy(), + res[0], + res[1], + res[2], + res[3], + res[4], + ), + end="", + ) + elif isinstance(metric, Mcc): + print("width_mult: %s, eval loss: %f, mcc: %s, " % (str(width_mult), loss.numpy(), res[0]), end="") + elif isinstance(metric, PearsonAndSpearman): + print( + "width_mult: %s, eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, " + % (str(width_mult), loss.numpy(), res[0], res[1], res[2]), + end="", + ) + else: + print("width_mult: %s, eval loss: %f, acc: %s, " % (str(width_mult), loss.numpy(), res), end="") + model.train() + + +# monkey patch for bert forward to accept [attention_mask, head_mask] as attention_mask +def bert_forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=[None, None]): + wtype = self.pooler.dense.fn.weight.dtype if hasattr(self.pooler.dense, "fn") else self.pooler.dense.weight.dtype + if attention_mask[0] is None: + attention_mask[0] = paddle.unsqueeze((input_ids == self.pad_token_id).astype(wtype) * -1e9, axis=[1, 2]) + embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids) + encoder_outputs = self.encoder(embedding_output, attention_mask) + sequence_output = encoder_outputs + pooled_output = self.pooler(sequence_output) + return sequence_output, pooled_output + + +BertModel.forward = bert_forward + + +# reorder weights according head importance and neuron importance +def reorder_neuron_head(model, head_importance, neuron_importance): + # reorder heads and ffn neurons + for layer, current_importance in enumerate(neuron_importance): + # reorder heads + idx = paddle.argsort(head_importance[layer], descending=True) + nlp_utils.reorder_head(model.bert.encoder.layers[layer].self_attn, idx) + # reorder neurons + idx = paddle.argsort(paddle.to_tensor(current_importance), descending=True) + nlp_utils.reorder_neuron(model.bert.encoder.layers[layer].linear1.fn, idx, dim=1) + nlp_utils.reorder_neuron(model.bert.encoder.layers[layer].linear2.fn, idx, dim=0) + + +def soft_cross_entropy(inp, target): + inp_likelihood = F.log_softmax(inp, axis=-1) + target_prob = F.softmax(target, axis=-1) + return -1.0 * paddle.mean(paddle.sum(inp_likelihood * target_prob, axis=-1)) + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if (int(is_test) + len(example)) == 2: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + else: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + + if not is_test: + return example["input_ids"], example["token_type_ids"], label + else: + return example["input_ids"], example["token_type_ids"] + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + train_ds = load_dataset("glue", args.task_name, splits="train") + + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + trans_func = partial( + convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length + ) + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ): fn(samples) + + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + if args.task_name == "mnli": + dev_ds_matched, dev_ds_mismatched = load_dataset( + "glue", args.task_name, splits=["dev_matched", "dev_mismatched"] + ) + dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True) + dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True) + dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False) + dev_data_loader_matched = DataLoader( + dataset=dev_ds_matched, + batch_sampler=dev_batch_sampler_matched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + dev_batch_sampler_mismatched = paddle.io.BatchSampler( + dev_ds_mismatched, batch_size=args.batch_size, shuffle=False + ) + dev_data_loader_mismatched = DataLoader( + dataset=dev_ds_mismatched, + batch_sampler=dev_batch_sampler_mismatched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + else: + dev_ds = load_dataset("glue", args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + num_labels = 1 if train_ds.label_list is None else len(train_ds.label_list) + + model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels) + + # Step1: Initialize a dictionary to save the weights from the origin BERT model. + origin_weights = model.state_dict() + + # Step2: Convert origin model to supernet. + sp_config = supernet(expand_ratio=args.width_mult_list) + model = Convert(sp_config).convert(model) + # Use weights saved in the dictionary to initialize supernet. + utils.set_state_dict(model, origin_weights) + del origin_weights + + # Step3: Define teacher model. + teacher_model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels) + + # Step4: Config about distillation. + mapping_layers = ["bert.embeddings"] + for idx in range(model.bert.config["num_hidden_layers"]): + mapping_layers.append("bert.encoder.layers.{}".format(idx)) + + default_distill_config = { + "lambda_distill": 0.1, + "teacher_model": teacher_model, + "mapping_layers": mapping_layers, + } + distill_config = DistillConfig(**default_distill_config) + + # Step5: Config in supernet training. + ofa_model = OFA(model, distill_config=distill_config, elastic_order=["width"]) + + criterion = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss() + + metric = metric_class() + + if args.task_name == "mnli": + dev_data_loader = (dev_data_loader_matched, dev_data_loader_mismatched) + + # Step6: Calculate the importance of neurons and head, + # and then reorder them according to the importance. + head_importance, neuron_importance = nlp_utils.compute_neuron_head_importance( + args.task_name, + ofa_model.model, + dev_data_loader, + loss_fct=criterion, + num_layers=model.bert.config["num_hidden_layers"], + num_heads=model.bert.config["num_attention_heads"], + ) + reorder_neuron_head(ofa_model.model, head_importance, neuron_importance) + + if paddle.distributed.get_world_size() > 1: + ofa_model.model = paddle.DataParallel(ofa_model.model) + + if args.max_steps > 0: + num_training_steps = args.max_steps + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + else: + num_training_steps = len(train_data_loader) * args.num_train_epochs + num_train_epochs = args.num_train_epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=ofa_model.model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + global_step = 0 + tic_train = time.time() + for epoch in range(num_train_epochs): + # Step7: Set current epoch and task. + ofa_model.set_epoch(epoch) + ofa_model.set_task("width") + + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids, segment_ids, labels = batch + + for width_mult in args.width_mult_list: + # Step8: Broadcast supernet config from width_mult, + # and use this config in supernet training. + net_config = utils.dynabert_config(ofa_model, width_mult) + ofa_model.set_net_config(net_config) + logits, teacher_logits = ofa_model(input_ids, segment_ids, attention_mask=[None, None]) + rep_loss = ofa_model.calc_distill_loss() + if args.task_name == "sts-b": + logit_loss = paddle.zeros(shape=[1], dtype="float32") + else: + logit_loss = soft_cross_entropy(logits, teacher_logits.detach()) + loss = rep_loss + args.lambda_logit * logit_loss + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.logging_steps == 0: + if paddle.distributed.get_rank() == 0: + logger.info( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_step, epoch, step, loss, args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + + if global_step % args.save_steps == 0: + tic_eval = time.time() + if args.task_name == "mnli": + evaluate(teacher_model, criterion, metric, dev_data_loader_matched, width_mult=100) + evaluate(teacher_model, criterion, metric, dev_data_loader_mismatched, width_mult=100) + else: + evaluate(teacher_model, criterion, metric, dev_data_loader, width_mult=100) + print("eval done total : %s s" % (time.time() - tic_eval)) + for idx, width_mult in enumerate(args.width_mult_list): + net_config = utils.dynabert_config(ofa_model, width_mult) + ofa_model.set_net_config(net_config) + tic_eval = time.time() + if args.task_name == "mnli": + evaluate(ofa_model, criterion, metric, dev_data_loader_matched, width_mult) + evaluate(ofa_model, criterion, metric, dev_data_loader_mismatched, width_mult) + print("eval done total : %s s" % (time.time() - tic_eval)) + else: + evaluate(ofa_model, criterion, metric, dev_data_loader, width_mult) + print("eval done total : %s s" % (time.time() - tic_eval)) + + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + return + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/model_compression/ofa/run_glue_ofa_depth.py b/examples/model_compression/ofa/run_glue_ofa_depth.py new file mode 100644 index 0000000000000000000000000000000000000000..09f8e8d9d1c995368e4a2b0ea09bc36cb61a9a13 --- /dev/null +++ b/examples/model_compression/ofa/run_glue_ofa_depth.py @@ -0,0 +1,459 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import math +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +from paddle.io import DataLoader +from paddle.metric import Accuracy +from paddleslim.nas.ofa import OFA, DistillConfig, RunConfig, utils +from paddleslim.nas.ofa.convert_super import Convert, supernet + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman +from paddlenlp.transformers import ( + BertForSequenceClassification, + BertModel, + BertTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +METRIC_CLASSES = { + "cola": Mcc, + "sst-2": Accuracy, + "mrpc": AccuracyAndF1, + "sts-b": PearsonAndSpearman, + "qqp": AccuracyAndF1, + "mnli": Accuracy, + "qnli": Accuracy, + "rte": Accuracy, +} + +MODEL_CLASSES = { + "bert": (BertForSequenceClassification, BertTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument("--lambda_logit", default=1.0, type=float, help="lambda for logit loss.") + parser.add_argument("--lambda_rep", default=0.1, type=float, help="lambda for hidden state distillation loss.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") + parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["gpu", "cpu", "xpu"], + help="The device to select to train the model, is must be cpu/gpu/xpu.", + ) + parser.add_argument( + "--width_mult_list", nargs="+", type=float, default=[1.0, 5 / 6, 2 / 3, 0.5], help="width mult in compress" + ) + parser.add_argument( + "--depth_mult_list", nargs="+", type=float, default=[1.0, 0.75, 0.5], help="width mult in compress" + ) + args = parser.parse_args() + return args + + +def set_seed(args): + random.seed(args.seed + paddle.distributed.get_rank()) + np.random.seed(args.seed + paddle.distributed.get_rank()) + paddle.seed(args.seed + paddle.distributed.get_rank()) + + +def evaluate(model, criterion, metric, data_loader, width_mult=1.0, depth_mult=1.0): + with paddle.no_grad(): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids, attention_mask=[None, None]) + if isinstance(logits, tuple): + logits = logits[0] + loss = criterion(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + results = metric.accumulate() + # Teacher model's evaluation + if width_mult == 100: + print("teacher_model, eval loss: %f, %s: %s\n" % (loss.numpy(), metric.name(), results), end="") + else: + print( + "depth_mult: %f, width_mult: %f, eval loss: %f, %s: %s\n" + % (depth_mult, width_mult, loss.numpy(), metric.name(), results), + end="", + ) + model.train() + + +# monkey patch for bert forward to accept [attention_mask, head_mask] as attention_mask +def bert_forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=[None, None], depth_mult=1.0): + wtype = self.pooler.dense.fn.weight.dtype if hasattr(self.pooler.dense, "fn") else self.pooler.dense.weight.dtype + if attention_mask[0] is None: + attention_mask[0] = paddle.unsqueeze((input_ids == self.pad_token_id).astype(wtype) * -1e9, axis=[1, 2]) + embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids) + encoder_outputs = self.encoder(embedding_output, attention_mask, depth_mult=depth_mult) + sequence_output = encoder_outputs + pooled_output = self.pooler(sequence_output) + return sequence_output, pooled_output + + +BertModel.forward = bert_forward + + +def transformer_encoder_forward(self, src, src_mask=None, depth_mult=1.0): + output = src + + depth = round(self.num_layers * depth_mult) + kept_layers_index = [] + for i in range(1, depth + 1): + kept_layers_index.append(math.floor(i / depth_mult) - 1) + + for i in kept_layers_index: + output = self.layers[i](output, src_mask=src_mask) + + if self.norm is not None: + output = self.norm(output) + + return output + + +paddle.nn.TransformerEncoder.forward = transformer_encoder_forward + + +def sequence_forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=[None, None], depth=1.0): + _, pooled_output = self.bert( + input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask, + depth_mult=depth, + ) + + pooled_output = self.dropout(pooled_output) + logits = self.classifier(pooled_output) + return logits + + +BertForSequenceClassification.forward = sequence_forward + + +def soft_cross_entropy(inp, target): + inp_likelihood = F.log_softmax(inp, axis=-1) + target_prob = F.softmax(target, axis=-1) + return -1.0 * paddle.mean(paddle.sum(inp_likelihood * target_prob, axis=-1)) + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if (int(is_test) + len(example)) == 2: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + else: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + + if not is_test: + return example["input_ids"], example["token_type_ids"], label + else: + return example["input_ids"], example["token_type_ids"] + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + train_ds = load_dataset("glue", args.task_name, splits="train") + + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + trans_func = partial( + convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length + ) + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ): fn(samples) + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + if args.task_name == "mnli": + dev_ds_matched, dev_ds_mismatched = load_dataset( + "glue", args.task_name, splits=["dev_matched", "dev_mismatched"] + ) + dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True) + dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True) + dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False) + dev_data_loader_matched = DataLoader( + dataset=dev_ds_matched, + batch_sampler=dev_batch_sampler_matched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + dev_batch_sampler_mismatched = paddle.io.BatchSampler( + dev_ds_mismatched, batch_size=args.batch_size, shuffle=False + ) + dev_data_loader_mismatched = DataLoader( + dataset=dev_ds_mismatched, + batch_sampler=dev_batch_sampler_mismatched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + else: + dev_ds = load_dataset("glue", args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + num_labels = 1 if train_ds.label_list is None else len(train_ds.label_list) + + # Step1: Initialize the origin BERT model. + model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels) + origin_weights = model.state_dict() + + # Step2: Convert origin model to supernet. + sp_config = supernet(expand_ratio=args.width_mult_list) + model = Convert(sp_config).convert(model) + # Use weights saved in the dictionary to initialize supernet. + utils.set_state_dict(model, origin_weights) + + # Step3: Define teacher model. + teacher_model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels) + new_dict = utils.utils.remove_model_fn(teacher_model, origin_weights) + teacher_model.set_state_dict(new_dict) + del origin_weights, new_dict + + default_run_config = {"elastic_depth": args.depth_mult_list} + run_config = RunConfig(**default_run_config) + + # Step4: Config about distillation. + mapping_layers = ["bert.embeddings"] + for idx in range(model.bert.config["num_hidden_layers"]): + mapping_layers.append("bert.encoder.layers.{}".format(idx)) + + default_distill_config = { + "lambda_distill": args.lambda_rep, + "teacher_model": teacher_model, + "mapping_layers": mapping_layers, + } + distill_config = DistillConfig(**default_distill_config) + + # Step5: Config in supernet training. + ofa_model = OFA(model, run_config=run_config, distill_config=distill_config, elastic_order=["depth"]) + # elastic_order=['width']) + + criterion = paddle.nn.CrossEntropyLoss() if train_ds.label_list else paddle.nn.MSELoss() + + metric = metric_class() + + if args.task_name == "mnli": + dev_data_loader = (dev_data_loader_matched, dev_data_loader_mismatched) + + if paddle.distributed.get_world_size() > 1: + ofa_model.model = paddle.DataParallel(ofa_model.model, find_unused_parameters=True) + + if args.max_steps > 0: + num_training_steps = args.max_steps + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + else: + num_training_steps = len(train_data_loader) * args.num_train_epochs + num_train_epochs = args.num_train_epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=ofa_model.model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + global_step = 0 + tic_train = time.time() + for epoch in range(num_train_epochs): + # Step6: Set current epoch and task. + ofa_model.set_epoch(epoch) + ofa_model.set_task("depth") + + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids, segment_ids, labels = batch + + for depth_mult in args.depth_mult_list: + for width_mult in args.width_mult_list: + # Step7: Broadcast supernet config from width_mult, + # and use this config in supernet training. + net_config = utils.dynabert_config(ofa_model, width_mult, depth_mult) + ofa_model.set_net_config(net_config) + logits, teacher_logits = ofa_model(input_ids, segment_ids, attention_mask=[None, None]) + rep_loss = ofa_model.calc_distill_loss() + if args.task_name == "sts-b": + logit_loss = 0.0 + else: + logit_loss = soft_cross_entropy(logits, teacher_logits.detach()) + loss = rep_loss + args.lambda_logit * logit_loss + loss.backward() + optimizer.step() + lr_scheduler.step() + ofa_model.model.clear_gradients() + + if global_step % args.logging_steps == 0: + if paddle.distributed.get_rank() == 0: + logger.info( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_step, epoch, step, loss, args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + + if global_step % args.save_steps == 0: + if args.task_name == "mnli": + evaluate(teacher_model, criterion, metric, dev_data_loader_matched, width_mult=100) + evaluate(teacher_model, criterion, metric, dev_data_loader_mismatched, width_mult=100) + else: + evaluate(teacher_model, criterion, metric, dev_data_loader, width_mult=100) + for depth_mult in args.depth_mult_list: + for width_mult in args.width_mult_list: + net_config = utils.dynabert_config(ofa_model, width_mult, depth_mult) + ofa_model.set_net_config(net_config) + tic_eval = time.time() + if args.task_name == "mnli": + evaluate(ofa_model, criterion, metric, dev_data_loader_matched, width_mult, depth_mult) + evaluate(ofa_model, criterion, metric, dev_data_loader_mismatched, width_mult, depth_mult) + print("eval done total : %s s" % (time.time() - tic_eval)) + else: + evaluate(ofa_model, criterion, metric, dev_data_loader, width_mult, depth_mult) + print("eval done total : %s s" % (time.time() - tic_eval)) + + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + return + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/model_compression/pp-minilm/README.md b/examples/model_compression/pp-minilm/README.md new file mode 100644 index 0000000000000000000000000000000000000000..bb3e9598ef025e1a33992362c13ce4cd86e1529b --- /dev/null +++ b/examples/model_compression/pp-minilm/README.md @@ -0,0 +1,430 @@ + **目录** + +* [PP-MiniLM 中文小模型](#PP-MiniLM中文小模型) + * [导入 PP-MiniLM](#导入PP-MiniLM) + * [在下游任务上使用 PP-MiniLM](#在下游任务上使用PP-MiniLM) + * [数据介绍](#数据介绍) + * [环境依赖](#环境依赖) + * [微调](#微调) + * [运行方式](#运行方式) + * [微调后模型精度](#微调后模型精度) + * [导出微调后模型](#导出微调后模型) + * [裁剪](#裁剪) + * [原理简介](#原理简介) + * [运行方式](#运行方式) + * [裁剪后模型精度](#裁剪后模型精度) + * [导出裁剪后的模型](#导出裁剪后的模型) + * [量化](#量化) + * [原理简介](#原理简介) + * [运行方式](#运行方式) + * [量化后模型精度](#量化后模型精度) + * [使用 Paddle Inference 进行推理部署](#使用PaddleInference推理部署) + * [环境要求](#环境要求) + * [运行方式](#运行方式) + * [性能测试](#性能测试) + * [使用 Paddle Serving 进行服务化部署](#使用PaddleServing服务化部署) + * [参考文献](#参考文献) + +<a name="PP-MiniLM中文小模型"></a> + +# PP-MiniLM 中文小模型 +[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) 联合 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) 通过模型蒸馏、剪裁、量化等级联模型压缩技术发布中文特色小模型 PP-MiniLM(6L768H) 及压缩方案,保证模型精度的同时模型推理速度达 BERT(12L768H) 的 8.88 倍,参数量相比减少 52%,模型精度在中文语言理解评测基准 CLUE 高 0.62。 + +PP-MiniLM 压缩方案以面向预训练模型的任务无关知识蒸馏(Task-agnostic Distillation)技术、裁剪(Pruning)技术、量化(Quantization)技术为核心,使得 PP-MiniLM **又快**、**又准**、**又小**。 + +1. **推理速度快**: 依托 PaddleSlim 的裁剪、量化技术对 PP-MiniLM 小模型进行压缩、加速, 使得 PP-MiniLM 量化后模型 GPU 推理速度相比 BERT base 加速比高达 8.88; + +2. **精度高**: 我们以 [MiniLMv2](https://arxiv.org/abs/2012.15828) 提出的 Multi-Head Self-Attention Relation Distillation 技术为基础,通过引入样本间关系知识蒸馏做了进一步算法优化,6 层 PP-MiniLM 模型在 CLUE 数据集上比 12 层 `bert-base-chinese` 高 0.62%,比同等规模的 TinyBERT<sub>6,</sub>、UER-py RoBERTa 分别高 2.57%、2.24%; + +3. **参数规模小**:依托 Task-agnostic Distillation 技术和 PaddleSlim 裁剪技术,模型参数量相比 BERT 减少 52%。 + +**整体效果** + + +| Model | #Params | #FLOPs | Speedup (w/o FasterTokenizer) | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL | Avg | +| ----------------------------- | --------- | --------- | ---------------- | --------- | --------- | --------- | --------- | --------- | ----------- | --------- | --------- | +| BERT-base, Chinese | 102.3M | 10.87B | 1.00x | 74.14 | 56.81 | 61.10 | 81.19 | 74.85 | 79.93 | 81.47 | 72.78 | +| TinyBERT<sub>6,</sub> Chinese | 59.7M | 5.44B | 1.90x | 72.59 | 55.70 | 57.64 | 79.57 | 73.97 | 76.32 | 80.00 | 70.83 | +| UER-py RoBERTa L6-H768 | 59.7M | 5.44B | 1.90x | 69.62 | **66.45** | 59.91 | 76.89 | 71.36 | 71.05 | **82.87** | 71.16 | +| RBT6, Chinese | 59.7M | 5.44B | 1.90x | 73.93 | 56.63 | 59.79 | 79.28 | 73.12 | 77.30 | 80.80 | 71.55 | +| ERNIE-Tiny | 90.7M | 4.83B | 2.22x | 71.55 | 58.34 | 61.41 | 76.81 | 71.46 | 72.04 | 79.13 | 70.11 | +| PP-MiniLM | 59.7M | 5.44B | 2.15x (1.90x) | 74.14 | 57.43 | **61.75** | 81.01 | **76.17** | 86.18 | 79.17 | **73.69** | +| PP-MiniLM + 裁剪 | **49.1M** | **4.08B** | 2.74x (2.48x) | 73.91 | 57.44 | 61.64 | 81.10 | 75.59 | **85.86** | 78.53 | 73.44 | +| PP-MiniLM + 量化 | 59.8M | - | 7.34x (4.63x) | **74.19** | 57.13 | 61.10 | **81.20** | 76.10 | 85.20 | 78.03 | 73.28 | +| PP-MiniLM + 裁剪 + 量化 | **49.2M** | - | **8.88x** (5.36x) | 74.00 | 57.37 | 61.33 | 81.09 | 75.56 | 85.85 | 78.57 | 73.40 | + + +**NOTE:** + +1.上表所有模型的精度测试均是基于下方超参数范围进行的 Grid Search 超参寻优。在每个配置下训练时,每隔 100 个 steps 在验证集上评估一次,取验证集上最佳准确率作为当前超参数配置下的准确率; +- batch sizes: 16, 32, 64; +- learning rates: 3e-5, 5e-5, 1e-4 + +2.量化后比量化前模型参数量多了 0.1M 是因为保存了 scale 值; + +3.性能测试的环境: + +- 硬件:NVIDIA Tesla T4 单卡; +- 软件:CUDA 11.1, cuDNN 8.1, TensorRT 7.2, PaddlePaddle 2.2.2; +- 实验配置:batch_size: 32, max_seq_len: 128; + +其中,除上表最后两行的模型是对 INT8 模型进行预测,其余模型均基于 FP32 精度测试。 + +4.PP-MiniLM 的加速比(见表中 Speedup 列)均测试了接入与未接入 FasterTokenizer 的数据,其中括号内的加速比代表未接入 FasterTokenizer 的加速比数据。接入 FasterTokenizer 对模型的精度没有影响,裁剪、量化后的模型相对 BERT-base 的加速比从 5.36 倍增加到 8.88 倍。 + +**方案流程** + +<p align="center"> +<img src="./pp-minilm.png" width="950"/><br /> +方案流程图 +</p> + +如上流程图所示,完整的中文小模型方案分为:导入 PP-MiniLM 中文预训练小模型、下游任务微调、裁剪、离线量化、预测部署五大步。下面会对这里的每一个步骤进行介绍。除了下游任务微调步骤,其余步骤均可以省略,但我们建议保留下面的每一个步骤。 + +以下是本范例模型的简要目录结构及说明: + +```shell +. +├── general_distill # 任务无关知识蒸馏目录 +│ └── general_distill.py # 任务无关知识蒸馏脚本 +│ └── run.sh # 任务无关知识蒸馏启动脚本 +│ └── README.md # 任务无关知识蒸馏文档 +├── finetuning # 下游任务训练目录 +│ └── run_clue.py # CLUE 上的微调脚本 +│ └── run_clue.sh # CLUE 上的微调启动脚本 +│ └── run_one_search.sh # 单数据集下精调脚本 +│ └── run_all_search.sh # CLUE数据集下精调脚本 +│ └── export_model.py # 导出 fine-tuned 部署模型脚本 +├── pruning # 裁剪、蒸馏目录 +│ └── prune.py # 裁剪、蒸馏脚本 +│ └── prune.sh # 裁剪、蒸馏启动脚本 +│ └── export_model.py # 导出裁剪训练得到的子模型(动、静态图模型) +├── quantization # 离线量化目录 +│ └── quant_post.py # 离线量化脚本 +│ └── quant.sh # 离线量化启动脚本 +├── deploy # 部署目录 +│ └── python # Paddle Inference 预测目录 +│ └── infer.py # Paddle Inference 预测脚本 +│ └── infer_all.sh # 批量预测量化模型启动脚本 +│ └── infer_perf.sh # 量化模型性能测试启动脚本 +│ └── serving # Paddle Serving 预测目录 +│ └── export_to_serving.py # 导出 Paddle Serving 预测模型脚本 +│ └── web_service.py # Paddle Serving 服务端启动脚本 +│ └── rpc_client.py # Paddle Serving 客户端启动脚本 +│ └── config_nlp.yml # Paddle Serving 预测配置文件 +│ └── README.md # Paddle Serving 预测文档 +├── data.py # 数据处理脚本 +├── pp-minilm.png # PP-MiniLM 方案流程图 +└── README.md # 文档,本文件 + +``` + +<a name="导入PP-MiniLM"></a> + +## 导入 PP-MiniLM + +PP-MiniLM 是使用任务无关蒸馏方法,以 `roberta-wwm-ext-large` 做教师模型蒸馏产出的含 6 层 Transformer Encoder Layer、Hidden Size 为 768 的预训练小模型,在 CLUE 上 7 个分类任务上的模型精度超过 BERT<sub>base</sub>、TinyBERT<sub>6</sub>、UER-py RoBERTa L6-H768、RBT6。 + +可以这样导入 PP-MiniLM: + +```python + +from paddlenlp.transformers import PPMiniLMModel, PPMiniLMForSequenceClassification + +model = PPMiniLMModel.from_pretrained('ppminilm-6l-768h') +model = PPMiniLMForSequenceClassification.from_pretrained('ppminilm-6l-768h') # 用于分类任务 +``` + +PP-MiniLM 是一个 6 层的预训练模型,使用 `from_pretrained`导入 PP-MiniLM 之后,就可以在自己的数据集上进行 fine-tuning。接下来会介绍如何用下游任务数据在导入的 PP-MiniLM 上进行微调、进一步压缩及推理部署。 + +**NOTE:** 如果对 PP-MiniLM 的训练过程感兴趣,可以查看[任务无关蒸馏文档](general_distill/README.md)了解相关细节。 + +<a name="在下游任务上使用PP-MiniLM"></a> + +## 在下游任务上使用 PP-MiniLM + +PP-MiniLM 预训练小模型在 CLUE 中的 7 个分类数据集的平均精度上比 12 层 `bert-base-chinese` 高 0.62%,比同等规模的 TinyBERT、UER-py RoBERTa 分别高 2.57%、2.24%,因此我们推荐将 PP-MiniLM 运用在中文下游任务上。当然,如果想对已有模型进一步压缩,也可以参考这里的压缩方案,因为压缩方案是通用的。 + +本案例中会以 CLUE 中 7 个分类数据集为例介绍如何在下游任务上使用 PP-MiniLM。首先用 CLUE 中的数据集对预训练小模型 PP-MiniLM 进行微调,然后提供了一套压缩方案,即借助 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) 进行裁剪和量化,进一步对模型规模进行压缩,最终使用基于 TensorRT 的 [Paddle Inference](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/05_inference_deployment/inference/inference_cn.html) 预测库对量化后的模型进行预测部署。裁剪、量化前,6 层 PP-MiniLM 的推理速度达 `bert-base-chinese` 的 2.15 倍,在下游任务上压缩完成后,模型推理速度高达`bert-base-chinese`的 8.88 倍。 + +<a name="数据介绍"></a> + +### 数据介绍 + +本案例中下游任务使用的数据是 CLUE 中 7 个分类数据集,包括 AFQMC、TNEWS、IFLYTEK、OCNLI、CMNLI、CSL、CLUEWSC2020。在 Linux 环境下,运行 `run_clue.py` 这个 fine-tuning 脚本会将该数据集自动下载到`~/.paddlenlp/datasets/Clue/`目录下。 + +<a name="环境依赖"></a> + +### 环境依赖 + +压缩方案依赖 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) 提供的裁剪、量化功能,在本案例中需要安装 paddleslim 2.2.2 及之后的版本(使用命令 `pip install paddleslim>=2.2.2` 安装即可)。PaddleSlim 是个专注于深度学习模型压缩的工具库,提供剪裁、量化、蒸馏、和模型结构搜索等模型压缩策略,帮助用户快速实现模型的小型化。安装命令如下: + +```shell +pip install paddleslim>=2.2.2 +``` + +<a name="微调"></a> + +### 微调 + +基于如下超参范围对 PP-MiniLM 在各个任务上进行 Grid Search 超参寻优: + +- batch sizes: 16, 32, 64 +- learning rates: 3e-5, 5e-5, 1e-4 + +<a name="运行方式"></a> + +#### 运行方式 + +```shell +cd finetuning +mkdir ppminilm-6l-768h +sh run_all_search.sh ppminilm-6l-768h +``` + +如果只是在单个数据集上用特定 `batch_size`、`learning_rate` 微调,可以使用如下命令: + +``` +sh run_clue.sh CLUEWSC2020 1e-4 32 50 128 0 ppminilm-6l-768h +``` + +其中每个参数依次表示:CLUE 中的任务名称、学习率、batch size、epoch 数、最大序列长度、gpu id、模型名称(模型保存目录)。 + +<a name="微调后模型精度"></a> + +#### 微调后模型精度 + +经过超参寻优后,我们可以得到在 CLUE 每个任务上验证集上有最高准确率的模型,CLUE 上各个任务上的最高准确率如下表: + +| AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL | Avg | +| ----- | ----- | ------- | ----- | ----- | ----------- | ----- | ----- | +| 74.14 | 57.43 | 61.75 | 81.01 | 76.17 | 86.18 | 79.17 | 73.69 | + + + +超参寻优完成后,保存下每个数据集下有最高准确率的模型,以及其对应的超参数,因裁剪、量化等后续步骤需要用到最好的模型和超参数。 + +<a name="导出微调后模型"></a> + +#### 导出微调后模型 + +模型在训练完成之后,可以选择每个数据集下效果最好的模型进行导出: + +```shell +export TASK_NAME=CLUEWSC2020 +export MODEL_PATH=ppminilm-6l-768h +export LR=1e-4 +export BS=32 + +python export_model.py --task_name ${TASK_NAME} --model_path ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ +``` + +静态图(部署)模型路径与动态图模型的路径相同,文件名为 `inference.pdmodel` , `inference.pdiparams` 和 `inference.pdiparams.info` 。 + +<a name="裁剪"></a> + +### 裁剪 + +这一步主要使用 PaddleSlim 对下游任务上的模型宽度进行裁剪,进一步压缩模型的大小。 + +该过程会以上一步的模型(即 fine-tuning 后得到的最好模型)当作教师模型,蒸馏宽度为 3/4 的学生模型。经过我们的实验,在 6L768H 条件下,模型宽度压缩为原来的 3/4,平均精度下降 0.25。 + +<a name="原理简介"></a> + +#### 原理简介 + +本方案采取的裁剪方法参考了 [DynaBERT-Dynamic BERT with Adaptive Width and Depth](https://arxiv.org/pdf/2004.04037) 中的策略。首先对预训练模型和 Head 进行重要性排序,保证更重要的 Head 不容易被裁掉,然后用原模型作为蒸馏过程中的教师模型,宽度更小的(本案例是 3/4 宽度)模型作为学生模型,蒸馏得到的学生模型就是我们裁剪得到的模型。 + +<a name="运行方式"></a> + +#### 运行方式 + +假设需要对上一步 fine-tuned 模型 `../finetuning/ppminilm-6l-768h/models/CLUEWSC2020/1e-4_32` 进行裁剪,其中 `learning_rate`、`batch_size` 可以继续使用 fine-tuning 时的参数,这里执行的是宽度 `0.75` 的裁剪,可以使用如下命令运行: + +```shell +cd pruning +export FT_MODELS=../finetuning/ppminilm-6l-768h/models/CLUEWSC2020/1e-4_32 + +sh prune.sh CLUEWSC2020 1e-4 32 50 128 0 ${FT_MODELS} 0.75 +``` +其中每个参数依次表示:CLUE 中的任务名称、学习率、batch size、epoch 数、最大序列长度、gpu id、学生模型的地址、裁剪后宽度比例列表。执行完成后,模型保存的路径位于 `pruned_models/CLUEWSC2020/0.75/best_model/`。 + +<a name="裁剪后模型精度"></a> + +#### 裁剪后模型精度 + +经过裁剪后,CLUE 上各个任务上的精度如下表所示。相比起裁剪前,CLUE 数据集上平均值下降 0.25。模型的参数量由 59.7M 下降到 49.1M。 + +| Model | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL | Avg | +| ---------------- | ----- | ----- | ------- | ----- | ----- | ----------- | ----- | ----- | +| PP-MiniLM 裁剪后 | 73.91 | 57.44 | 61.64 | 81.10 | 75.59 | 85.86 | 78.53 | 73.44 | + + +<a name="导出裁剪后的模型"></a> + +#### 导出裁剪后的模型 + +这一步可以同时导出经过裁剪后特定宽度下模型的动、静态图的模型和参数等文件。 + +以 CLUEWSC2020 数据集为例,导出模型: + +```shell + +export MODEL_PATH=pruned_models +export TASK_NAME=CLUEWSC2020 +sh export.sh ${MODEL_PATH} ${TASK_NAME} +``` + +或者可以批量导出 CLUE 各个任务上的模型: + +```shell + +sh export_all.sh +cd .. +``` + +导出的静态图模型、参数等文件位于 `${MODEL_PATH}/${TASK_NAME}/0.75/sub_static/` 目录下,有 `float.pdmodel`、`float.pdiparams`、`float.pdiparams.info` 三个文件。 + +导出的动态图参数等文件位于 `${MODEL_PATH}/${TASK_NAME}/0.75/sub/model_width_0.75000/` 目录下,有 `model_state.pdparams` 和 `model_config.json` 两个文件。需要注意的是,此动态图模型不能通过原始 Transformer API 将参数正确载入,因为裁剪后模型不再符合 Transformer 的组网,例如 q、k、v 的 weight 的 shape 是 `[hidden_size, hidden_size * width_mul]` ,不再是 `[hidden_size, hidden_size]`。 + +<a name="量化"></a> + +### 量化 + +```shell +cd quantization +``` + +<a name="原理简介"></a> + +#### 原理简介 + +这里的量化采用的是静态离线量化方法,即不需要训练,只使用少量校准数据计算量化因子,就可快速得到量化模型。这一步需要有训练好的预测(静态图)模型。因此,需要对前序步骤产出的模型进行导出(参考上方导出模型的运行方式)。 + +量化我们可以借助 PaddleSlim 提供的离线量化 API `paddleslim.quant.quant_post_static` 实现,我们这一步使用了 `mse`、`avg`、`abs_max`、`hist` 四种对于激活 Tensor 的量化方法,以及 `channel_wise_abs_max` 对权重 Tensor 的量化方法,并使用 4、8 两种校准集数量,对 `matmul`、`matmul_v2` 算子进行量化。 + +<a name="运行方式"></a> + +#### 运行方式 + +运行如下脚本可以对导出的静态图模型进行量化: + +```shell +export MODEL_DIR=../pruning/pruned_models/ +python quant_post.py --task_name $TASK_NAME --input_dir ${MODEL_DIR}/${TASK_NAME}/0.75/sub_static +``` + +可以批量对所有数据集下的 FP32 模型进行量化: + +```shell +sh quant_all.sh +cd .. +``` + +<a name="量化后模型精度"></a> + +#### 量化后模型精度 + +经过量化后,CLUE 上各个任务上的精度如下表,对 PP-MiniLM 进行量化后,精度比原 FP32 模型下降 0.19;对裁剪后的模型进行量化,精度几乎无损(-0.04): + +| NO | Model | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL | Avg | +| ---- | ----------------------- | ----- | ----- | ------- | ----- | ----- | ----------- | ----- | ----- | +| 1 | PP-MiniLM | 74.24 | 57.21 | 61.1 | 81.16 | 76.17 | 85.53 | 78.90 | 73.47 | +| 1 | PP-MiniLM + 量化 | 74.19 | 57.13 | 61.10 | 81.20 | 76.10 | 85.20 | 78.03 | 73.28 | +| 2 | PP-MiniLM + 裁剪 | 73.91 | 57.44 | 61.64 | 81.10 | 75.59 | 85.86 | 78.53 | 73.44 | +| 2 | PP-MiniLM + 裁剪 + 量化 | 74.00 | 57.37 | 61.33 | 81.09 | 75.56 | 85.85 | 78.57 | 73.40 | + + +**NOTE:** 实验 1 是补充实验,PP-MiniLM 和 实验 2 中裁剪前的 PP-MiniLM 模型精度不同。 + +最后,值得注意的是,PP-MiniLM 是基于 `roberta-wwm-ext-large` 做教师模型蒸馏得到的学生模型,如果你有更好的 24 层中文预训练模型,可以基于[任务无关蒸馏文档](general_distill/README.md)中介绍的蒸馏过程,训练出一个比 PP-MiniLM 精度更高,在下游任务上表现更好的 6 层小模型。 + +<a name="使用PaddleInference进行推理部署"></a> + +### 使用 Paddle Inference 进行推理部署 + +预测部署借助 PaddlePaddle 安装包中自带的 [Paddle Inference](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/05_inference_deployment/inference/inference_cn.html) 进行预测。 + +<a name="环境要求"></a> + +#### 环境要求 + +这一步依赖安装有预测库的 PaddlePaddle 2.2.2。可以在 [PaddlePaddle 官网](https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html#python) 根据机器环境选择合适的 Python 预测库进行安装。 + +想要得到更明显的加速效果,推荐在 NVIDA Tensor Core GPU(如 T4、A10、A100)上进行测试,本案例基于 T4 测试。若在 V 系列 GPU 卡上测试,由于其不支持 Int8 Tensor Core,加速效果将达不到本文档表格中的效果。 + +本案例是在 NVIDIA Tesla T4 单卡上,使用 CUDA 11.1、cuDNN 8.1、TensorRT 7.2 进行预测。 + +<a name="运行方式"></a> + +#### 运行方式 + +这里使用了动态 shape 功能,因此需要设置 TensorRT 子图输入shape 的范围。用户需要事先根据自己的模型结构和数据 shape 的范围,设置 TensorRT 子图输入的 shape 的最大、最小、以及最优的范围,其中最优范围可以按照数据分布选择最常见的来设置。动态 shape 的设置可以参考[官方文档](https://paddleinference.paddlepaddle.org.cn/optimize/paddle_trt.html#dynamic-shape)中的教程,以及本案例中 infer.py 脚本中的 160 行 - 206 行)。 + +INT8 预测运行脚本: + +```shell + +cd deploy/python + +export task=tnews +export algo=mse +export bs=4 +python infer.py --task_name ${task} --model_path ../../quantization/${task}_quant_models/${algo}${bs}/int8 --int8 --use_trt +``` +如果想要批量对量化模型进行预测并输出不同量化策略产出模型的精度,可以使用如下的脚本批量预测: + +```shell +sh infer_all.sh +``` + +FP32 预测运行脚本: + +```shell +python infer.py --task_name ${task} --model_path $MODEL_PATH --use_trt +``` + +<a name="性能测试"></a> + +#### 性能测试 + +测试性能环境同上。本案例测试采用的是 CLUE TNEWS 数据集下量化方法为 `mse`、校准集数量为 4 得到的量化模型,在 TNEWS 的验证集上统计 5 次端到端预测的总耗时(前 20 个 steps 作为 warmup steps)并求平均。其中 `batch_size` 为 32,`max_seq_len` 为 128。 + +启动性能测试需要对 `infer.py` 脚本传入参数 `--perf`,运行性能测试脚本可以得到 PP-MiniLM、PP-MiniLM 裁剪后、PP-MiniLM 量化后模型预测的耗时: + +```shell + +bash infer_perf.sh +cd ../../ +``` + +下表后三行分别是微调后的模型、裁剪后的模型、量化后模型的总耗时情况。 + +取 5 个测试时长取平均,并计算出加速度比,可以发现裁剪、量化后的模型是原 BERT<sub>base</sub> 模型推理速度的 8.88 倍,其中只经过裁剪后的模型是 BERT<sub>base</sub> 推理速度的 2.74 倍,只经过量化后的模型是 BERT<sub>base</sub> 推理速度的 7.34 倍,接入 FasterTokenizer 前,裁剪、量化后的推理速度是原 BERT<sub>base</sub> 模型推理速度的 5.36 倍。 + +| | 加速比 | 加速比(w/o FasterTokenizer) | +| ----------------------- | --------- | ----------------------------- | +| BERT<sub>base</sub> | 1.00x | 1.00x | +| PP-MiniLM | 2.15x | 1.90x | +| PP-MiniLM + 裁剪 | 2.74x | 2.48x | +| PP-MiniLM + 量化 | 7.34x | 4.63x | +| PP-MiniLM + 裁剪 + 量化 | **8.88x** | **5.36x** | + + +<a name="使用PaddleServing进行服务化部署"></a> + +### 使用 Paddle Serving 进行服务化部署 + +上面介绍的 Paddle Inference 为使用本地模型推理,Paddle Serving 可以实现在服务器端部署推理模型,客户端远程通过 RPC/HTTP 方式发送数据进行推理,实现模型推理的服务化。准备好静态图(推理模型)后,可参考 [Paddle Serving](deploy/serving/README.md) 部署步骤。 + +<a name="参考文献"></a> + +## 参考文献 + +1.Wang W, Bao H, Huang S, Dong L, Wei F. MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers[J]. arXiv preprint arXiv:2012.15828v2, 2021. + +2.Hou L, Huang Z, Shang L, Jiang X, Chen X and Liu Q. DynaBERT: Dynamic BERT with Adaptive Width and Depth[J]. arXiv preprint arXiv:2004.04037, 2020. + +3.Cai H, Gan C, Wang T, Zhang Z, and Han S. Once for all: Train one network and specialize it for efficient deployment[J]. arXiv preprint arXiv:1908.09791, 2020. + +4.Wu H, Judd P, Zhang X, Isaev M and Micikevicius P. Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation[J]. arXiv preprint arXiv:2004.09602v1, 2020. diff --git a/examples/model_compression/pp-minilm/data.py b/examples/model_compression/pp-minilm/data.py new file mode 100644 index 0000000000000000000000000000000000000000..819fc29a91a234e7e08812612b6d1158a2cf618c --- /dev/null +++ b/examples/model_compression/pp-minilm/data.py @@ -0,0 +1,84 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +from paddle.metric import Accuracy + +from paddlenlp.transformers import ( + BertForSequenceClassification, + BertTokenizer, + PPMiniLMForSequenceClassification, + PPMiniLMTokenizer, +) + +MODEL_CLASSES = { + "ppminilm": (PPMiniLMForSequenceClassification, PPMiniLMTokenizer), + "bert": (BertForSequenceClassification, BertTokenizer), +} + +METRIC_CLASSES = { + "afqmc": Accuracy, + "tnews": Accuracy, + "iflytek": Accuracy, + "ocnli": Accuracy, + "cmnli": Accuracy, + "cluewsc2020": Accuracy, + "csl": Accuracy, +} + + +def convert_example(example, label_list, tokenizer=None, is_test=False, max_seq_length=512, **kwargs): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + # Get the label + example["label"] = np.array(example["label"], dtype="int64") + label = example["label"] + # Convert raw text to feature + if "keyword" in example: # CSL + sentence1 = " ".join(example["keyword"]) + example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]} + elif "target" in example: # wsc + text, query, pronoun, query_idx, pronoun_idx = ( + example["text"], + example["target"]["span1_text"], + example["target"]["span2_text"], + example["target"]["span1_index"], + example["target"]["span2_index"], + ) + text_list = list(text) + assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun) + assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query) + if pronoun_idx > query_idx: + text_list.insert(query_idx, "_") + text_list.insert(query_idx + len(query) + 1, "_") + text_list.insert(pronoun_idx + 2, "[") + text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]") + else: + text_list.insert(pronoun_idx, "[") + text_list.insert(pronoun_idx + len(pronoun) + 1, "]") + text_list.insert(query_idx + 2, "_") + text_list.insert(query_idx + len(query) + 2 + 1, "_") + text = "".join(text_list) + example["sentence"] = text + if tokenizer is None: + return example + if "sentence" in example: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + elif "sentence1" in example: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + if not is_test: + return example["input_ids"], example["token_type_ids"], label + else: + return example["input_ids"], example["token_type_ids"] diff --git a/examples/model_compression/pp-minilm/deploy/python/infer.py b/examples/model_compression/pp-minilm/deploy/python/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..15b497df8b24e3814da0bf06aab43ff61b342467 --- /dev/null +++ b/examples/model_compression/pp-minilm/deploy/python/infer.py @@ -0,0 +1,344 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import sys +import time +from functools import partial + +import paddle +from paddle import inference + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer.argparser import strtobool + +sys.path.append("../../") +from data import METRIC_CLASSES, MODEL_CLASSES, convert_example # noqa: E402 + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default="afqmc", + type=str, + help="The name of the task to perform predict, selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_type", + default="ppminilm", + type=str, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default="ppminilm-6l-768h", + type=str, + help="The directory or name of model.", + ) + parser.add_argument( + "--model_path", + default="./quant_models/model", + type=str, + required=True, + help="The path prefix of inference model to be used.", + ) + parser.add_argument( + "--device", + default="gpu", + choices=["gpu", "cpu", "xpu"], + help="Device selected for inference.", + ) + parser.add_argument( + "--batch_size", + default=32, + type=int, + help="Batch size for predict.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--perf_warmup_steps", + default=20, + type=int, + help="Warmup steps for performance test.", + ) + parser.add_argument( + "--use_trt", + action="store_true", + help="Whether to use inference engin TensorRT.", + ) + parser.add_argument( + "--perf", + action="store_true", + help="Whether to test performance.", + ) + parser.add_argument( + "--collect_shape", + action="store_true", + help="Whether collect shape range info.", + ) + parser.add_argument( + "--use_faster_tokenizer", + type=strtobool, + default=True, + help="Whether to use FasterTokenizer to accelerate training or further inference.", + ) + parser.add_argument( + "--int8", + action="store_true", + help="Whether to use int8 inference.", + ) + args = parser.parse_args() + return args + + +@paddle.no_grad() +def evaluate(outputs, metric, data_loader): + metric.reset() + for i, batch in enumerate(data_loader): + input_ids, segment_ids, labels = batch + logits = paddle.to_tensor(outputs[i][0]) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + print("acc: %s, " % res, end="") + + +class Predictor(object): + def __init__(self, predictor, input_handles, output_handles): + self.predictor = predictor + self.input_handles = input_handles + self.output_handles = output_handles + + @classmethod + def create_predictor(cls, args): + config = paddle.inference.Config(args.model_path + ".pdmodel", args.model_path + ".pdiparams") + if args.device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + cls.device = paddle.set_device("gpu") + elif args.device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + cls.device = paddle.set_device("cpu") + elif args.device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + if args.use_trt: + if args.int8: + config.enable_tensorrt_engine( + workspace_size=1 << 30, + precision_mode=inference.PrecisionType.Int8, + max_batch_size=args.batch_size, + min_subgraph_size=5, + use_static=False, + use_calib_mode=False, + ) + else: + config.enable_tensorrt_engine( + workspace_size=1 << 30, + precision_mode=inference.PrecisionType.Float32, + max_batch_size=args.batch_size, + min_subgraph_size=5, + use_static=False, + use_calib_mode=False, + ) + print("Enable TensorRT is: {}".format(config.tensorrt_engine_enabled())) + # Set min/max/opt tensor shape of each trt subgraph input according + # to dataset. + # For example, the config of TNEWS data should be 16, 32, 32, 31, 128, 32. + min_batch_size, max_batch_size, opt_batch_size = 1, 32, 32 + min_seq_len, max_seq_len, opt_seq_len = 1, 128, 32 + if args.use_faster_tokenizer: + min_input_shape = { + "faster_tokenizer_1.tmp_0": [min_batch_size, min_seq_len], + "faster_tokenizer_1.tmp_1": [min_batch_size, min_seq_len], + "tmp_4": [min_batch_size, min_seq_len], + "unsqueeze2_0.tmp_0": [min_batch_size, 1, 1, min_seq_len], + } + max_input_shape = { + "faster_tokenizer_1.tmp_0": [max_batch_size, max_seq_len], + "faster_tokenizer_1.tmp_1": [max_batch_size, max_seq_len], + "tmp_4": [max_batch_size, max_seq_len], + "unsqueeze2_0.tmp_0": [max_batch_size, 1, 1, max_seq_len], + } + opt_input_shape = { + "faster_tokenizer_1.tmp_0": [opt_batch_size, opt_seq_len], + "faster_tokenizer_1.tmp_1": [opt_batch_size, opt_seq_len], + "tmp_4": [opt_batch_size, opt_seq_len], + "unsqueeze2_0.tmp_0": [opt_batch_size, 1, 1, opt_seq_len], + } + else: + min_input_shape = { + "input_ids": [min_batch_size, min_seq_len], + "token_type_ids": [min_batch_size, min_seq_len], + "tmp_4": [min_batch_size, min_seq_len], + "unsqueeze2_0.tmp_0": [min_batch_size, 1, 1, min_seq_len], + } + max_input_shape = { + "input_ids": [max_batch_size, max_seq_len], + "token_type_ids": [max_batch_size, max_seq_len], + "tmp_4": [max_batch_size, max_seq_len], + "unsqueeze2_0.tmp_0": [max_batch_size, 1, 1, max_seq_len], + } + opt_input_shape = { + "input_ids": [opt_batch_size, opt_seq_len], + "token_type_ids": [opt_batch_size, opt_seq_len], + "tmp_4": [opt_batch_size, opt_seq_len], + "unsqueeze2_0.tmp_0": [opt_batch_size, 1, 1, opt_seq_len], + } + config.set_trt_dynamic_shape_info(min_input_shape, max_input_shape, opt_input_shape) + + predictor = paddle.inference.create_predictor(config) + + input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()] + output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()] + + return cls(predictor, input_handles, output_handles) + + def predict_batch(self, data): + for input_field, input_handle in zip(data, self.input_handles): + input_handle.copy_from_cpu(input_field) + self.predictor.run() + output = [output_handle.copy_to_cpu() for output_handle in self.output_handles] + return output + + def faster_predict(self, dataset, args): + batch_num = 0 + if "sentence" in dataset[0]: + data = [example["sentence"] for example in dataset] + batches = [data[idx : idx + args.batch_size] for idx in range(0, len(data), args.batch_size)] + batch_num = len(batches) + else: + data1 = [example["sentence1"] for example in dataset] + data2 = [example["sentence2"] for example in dataset] + batches1 = [data1[idx : idx + args.batch_size] for idx in range(0, len(data1), args.batch_size)] + batches2 = [data2[idx : idx + args.batch_size] for idx in range(0, len(data1), args.batch_size)] + batch_num = len(batches1) + if args.perf: + for i in range(batch_num): + if "sentence" in dataset[0]: + output = self.predict_batch([batches[i]]) + else: + output = self.predict_batch([batches1[i], batches2[i]]) + if i > args.perf_warmup_steps: + break + time1 = time.time() + if "sentence" in dataset[0]: + for i in range(batch_num): + output = self.predict_batch([batches[i]]) + else: + for i in range(batch_num): + output = self.predict_batch([batches1[i], batches2[i]]) + print("task name: %s, time: %s, " % (args.task_name, time.time() - time1)) + return output + + else: + labels = [example["label"] for example in dataset] + + batched_labels = [labels[idx : idx + args.batch_size] for idx in range(0, len(labels), args.batch_size)] + metric = METRIC_CLASSES[args.task_name]() + metric.reset() + + for i in range(batch_num): + if "sentence" in dataset[0]: + logits = self.predict_batch([batches[i]]) + else: + logits = self.predict_batch([batches1[i], batches2[i]]) + correct = metric.compute(paddle.to_tensor(logits), paddle.to_tensor(batched_labels[i])) + metric.update(correct) + + res = metric.accumulate() + print("task name: %s, acc: %s, " % (args.task_name, res), end="") + + def convert_predict_batch(self, args, data, tokenizer, batchify_fn, label_list): + examples = [] + for example in data: + example = convert_example(example, label_list, tokenizer, max_seq_length=args.max_seq_length) + examples.append(example) + + return examples + + def predict(self, dataset, tokenizer, batchify_fn, args): + batches = [dataset[idx : idx + args.batch_size] for idx in range(0, len(dataset), args.batch_size)] + if args.perf: + for i, batch in enumerate(batches): + examples = self.convert_predict_batch(args, batch, tokenizer, batchify_fn, dataset.label_list) + input_ids, segment_ids, label = batchify_fn(examples) + output = self.predict_batch([input_ids, segment_ids]) + if i > args.perf_warmup_steps: + break + time1 = time.time() + for batch in batches: + examples = self.convert_predict_batch(args, batch, tokenizer, batchify_fn, dataset.label_list) + input_ids, segment_ids, _ = batchify_fn(examples) + output = self.predict_batch([input_ids, segment_ids]) + + print("task name: %s, time: %s, " % (args.task_name, time.time() - time1)) + + else: + metric = METRIC_CLASSES[args.task_name]() + metric.reset() + for i, batch in enumerate(batches): + examples = self.convert_predict_batch(args, batch, tokenizer, batchify_fn, dataset.label_list) + input_ids, segment_ids, label = batchify_fn(examples) + output = self.predict_batch([input_ids, segment_ids]) + correct = metric.compute(paddle.to_tensor(output), paddle.to_tensor(label)) + metric.update(correct) + + res = metric.accumulate() + print("task name: %s, acc: %s, " % (args.task_name, res), end="") + + +def main(): + paddle.seed(42) + args = parse_args() + + args.task_name = args.task_name.lower() + args.model_type = args.model_type.lower() + + predictor = Predictor.create_predictor(args) + + _, tokenizer_class = MODEL_CLASSES[args.model_type] + + dev_ds = load_dataset("clue", args.task_name, splits="dev") + + if not args.use_faster_tokenizer: + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + else: + trans_func = partial(convert_example, label_list=dev_ds.label_list, is_test=False) + dev_ds = dev_ds.map(trans_func, lazy=True) + if not args.use_faster_tokenizer: + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment + Stack(dtype="int64" if dev_ds.label_list else "float32"), # label + ): fn(samples) + predictor.predict(dev_ds, tokenizer, batchify_fn, args) + else: + predictor.faster_predict(dev_ds, args=args) + + +if __name__ == "__main__": + main() diff --git a/examples/model_compression/pp-minilm/deploy/python/infer_all.sh b/examples/model_compression/pp-minilm/deploy/python/infer_all.sh new file mode 100644 index 0000000000000000000000000000000000000000..9a069793d8778457ce7a33d914769ddbd2b8a26e --- /dev/null +++ b/examples/model_compression/pp-minilm/deploy/python/infer_all.sh @@ -0,0 +1,26 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +for task in AFQMC TNEWS IFLYTEK CMNLI OCNLI CLUEWSC2020 CSL +do + for bs in 4 8 + do + for algo in abs_max avg hist mse + do + python infer.py --task_name ${task} --model_path ../quantization/${task}_quant_models/${algo}${bs}/int8 --int8 --use_trt + echo this is ${task}, ${algo}, ${bs} + done + done +done diff --git a/examples/model_compression/pp-minilm/deploy/python/infer_perf.sh b/examples/model_compression/pp-minilm/deploy/python/infer_perf.sh new file mode 100644 index 0000000000000000000000000000000000000000..c8469d9d9117359df4588d7d6e015a8c42f0904c --- /dev/null +++ b/examples/model_compression/pp-minilm/deploy/python/infer_perf.sh @@ -0,0 +1,35 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export task=TNEWS + +echo Inference of orgin FP32 model +for ((i=0;i<=4;i++)); +do + python infer.py --task_name ${task} --model_path ../finetuning/ppminilm-6l-768h/models/${task}/1e-4_64/inference --use_trt --perf +done + +echo After pruning +for ((i=0;i<=4;i++)); +do + python infer.py --task_name ${task} --model_path ../pruning/pruned_models/${task}/0.75/sub_static/float --use_trt --perf +done + +echo After quantization +for ((i=0;i<=4;i++)); +do + python infer.py --task_name tnews --model_path ../quantization/${task}_quant_models/mse4/int8 --int8 --use_trt --perf +done + + diff --git a/examples/model_compression/pp-minilm/deploy/serving/README.md b/examples/model_compression/pp-minilm/deploy/serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..3fed44ed9d92f1152630b954eb0a5a6f89ca426d --- /dev/null +++ b/examples/model_compression/pp-minilm/deploy/serving/README.md @@ -0,0 +1,82 @@ +# PP-MiniLM 使用 Paddle Serving 进行服务化部署 + +Paddle Serving 可以实现在服务器端部署推理模型,客户端远程通过 RPC/HTTP 方式发送数据进行推理,实现模型推理的服务化,下面以RPC方式为例进行说明。 + +## 前提条件 +准备好 Inference 模型,需要2个文件: +| 文件 | 说明 | +|-------------------------------|----------------------------------------| +| ppminilm.pdiparams | 模型权重文件,供推理时加载使用 | +| ppminilm.pdmodel | 模型结构文件,供推理时加载使用 | + +假设这 2 个文件已生成,并放在目录 `$MODEL_DIR` 下。 + +## 环境要求 + +使用 Paddle Serving 需要在服务器端安装相关模块,需要 v0.8.0 之后的版本: +```shell +pip install paddle-serving-app paddle-serving-client paddle-serving-server +``` + +如果服务器端可以使用GPU进行推理,则安装 server 的 gpu 版本,安装时要注意参考服务器当前 CUDA、TensorRT 的版本来安装对应的版本:[Serving readme](https://github.com/PaddlePaddle/Serving/tree/v0.8.0) + +```shell +pip install paddle-serving-app paddle-serving-client paddle-serving-server-gpu +``` + +还需要在客户端安装相关模块,也需要 v0.8.0 之后的版本: +```shell +pip install paddle-serving-app paddle-serving-client +``` + +## 从 Inference 模型生成 Serving 模型和配置 + +以前提条件中准备好的 Inference 模型 `ppminilm.pdmodel`、`ppminilm.pdiparams` 为例: + +```shell +python export_to_serving.py \ + --dirname ${MODEL_DIR} \ + --model_filename ppminilm.pdmodel \ + --params_filename ppminilm.pdiparams \ + --server_path serving_server \ + --client_path serving_client \ + --fetch_alias_names logits \ +``` + +其中参数释义如下: +- `dirname` : 表示 Inference 推理模型所在目录,这里是位于 `${MODEL_DIR}`。 +- `model_filename` : 表示推理需要加载的模型结构文件。例如前提中得到的 `ppminilm.pdmodel`。如果设置为 `None` ,则使用 `__model__` 作为默认的文件名。 +- `params_filename` : 表示推理需要加载的模型权重文件。例如前提中得到的 `ppminilm.pdiparams`。 +- `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server。 +- `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client。 +- `fetch_alias_names`: 模型输出的别名设置,比如输入的 input_ids 等,都可以指定成其他名字,默认不指定。 +- `feed_alias_names`: 模型输入的别名设置,比如输出 pooled_out 等,都可以重新指定成其他名字,默认不指定。 + +执行命令后,会在当前目录下生成 2 个目录:serving_server 和 serving_client。serving_server 目录包含服务器端所需的模型和配置,需将其拷贝到服务器端;serving_client 目录包含客户端所需的配置,需将其拷贝到客户端。 + + +## 配置 config 文件 + +在启动预测之前,需要按照自己的情况修改 config 文件中的配置,主要需要修改的配置释义如下: + +- `rpc_port` : rpc端口。 +- `device_type` : 0 代表 CPU, 1 代表 GPU, 2 代表 TensorRT, 3 代表 Arm CPU, 4 代表 Kunlun XPU。 +- `devices` : 计算硬件 ID,当 devices 为 "" 或不写时,为 CPU 预测;当 devices 为"0"、 "0,1,2" 时为 GPU 预测。 +- `fetch_list` : fetch 结果列表,以 client_config 中 fetch_var 的 alias_name 为准, 如果没有设置则全部返回。 +- `model_config` : 模型路径。 + +## 启动 server + +在服务器端容器中,使用上一步得到的 serving_server 目录启动 server: + +```shell +python web_service.py + +``` + +## 启动 client 发起推理请求 +在客户端容器中,使用前面得到的 serving_client 目录启动 client 发起 RPC 推理请求。从命令行读取输入数据发起推理请求: + +```shell +python rpc_client.py +``` diff --git a/examples/model_compression/pp-minilm/deploy/serving/config_nlp.yml b/examples/model_compression/pp-minilm/deploy/serving/config_nlp.yml new file mode 100644 index 0000000000000000000000000000000000000000..6356840acb25554e6320f21523c9a9ba20b91850 --- /dev/null +++ b/examples/model_compression/pp-minilm/deploy/serving/config_nlp.yml @@ -0,0 +1,34 @@ +# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +# 当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 20 +# build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + # op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: False + # 使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + tracer: + interval_s: 10 +# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 18083 +# 18082 +# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 8091 +op: + ppminilm: + # 并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 1 + # 当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + # client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 1 + # 计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: '0' + # Fetch结果列表,以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回 + fetch_list: ['logits'] + # 模型路径 + model_config: ./serving_server/ + diff --git a/examples/model_compression/pp-minilm/deploy/serving/export_to_serving.py b/examples/model_compression/pp-minilm/deploy/serving/export_to_serving.py new file mode 100644 index 0000000000000000000000000000000000000000..79786ee23942a9ffce5ce8eb01aa0a8282e72219 --- /dev/null +++ b/examples/model_compression/pp-minilm/deploy/serving/export_to_serving.py @@ -0,0 +1,85 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle +import paddle_serving_client.io as serving_io + +parser = argparse.ArgumentParser() +parser.add_argument( + "--dirname", + type=str, + required=True, + default="./output", + help="Path of saved model files. Program file and parameter files are saved in this directory.", +) +parser.add_argument( + "--model_filename", + type=str, + required=True, + default="inference.get_pooled_embedding.pdmodel", + help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.", +) +parser.add_argument( + "--params_filename", + type=str, + required=True, + default="inference.get_pooled_embedding.pdiparams", + help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.", +) +parser.add_argument( + "--server_path", + type=str, + default="./serving_server", + help="The path of server parameter in static graph to be saved.", +) +parser.add_argument( + "--client_path", + type=str, + default="./serving_client", + help="The path of client parameter in static graph to be saved.", +) +parser.add_argument( + "--feed_alias_names", + type=str, + default=None, + help="set alias names for feed vars, split by comma ',', you should run --show_proto to check the number of feed vars", +) +parser.add_argument( + "--fetch_alias_names", + type=str, + default=None, + help="set alias names for feed vars, split by comma ',', you should run --show_proto to check the number of fetch vars", +) +parser.add_argument( + "--show_proto", + type=bool, + default=False, + help="If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.", +) + +if __name__ == "__main__": + paddle.enable_static() + args = parser.parse_args() + feed_names, fetch_names = serving_io.inference_model_to_serving( + dirname=args.dirname, + serving_server=args.server_path, + serving_client=args.client_path, + model_filename=args.model_filename, + params_filename=args.params_filename, + show_proto=args.show_proto, + feed_alias_names=args.feed_alias_names, + fetch_alias_names=args.fetch_alias_names, + ) diff --git a/examples/model_compression/pp-minilm/deploy/serving/rpc_client.py b/examples/model_compression/pp-minilm/deploy/serving/rpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..c975d13265e7f90d0f78af753a46be38930412f2 --- /dev/null +++ b/examples/model_compression/pp-minilm/deploy/serving/rpc_client.py @@ -0,0 +1,32 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddle_serving_server.pipeline import PipelineClient +import numpy as np + +client = PipelineClient() +client.connect(["127.0.0.1:8091"]) + +list_data = ["国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据", "试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比"] +feed = {} +for i, item in enumerate(list_data): + feed[str(i)] = item + +print(feed) +ret = client.predict(feed_dict=feed) +print(ret) +result = np.array(eval(ret.value[0])) +print(ret.key) +print(result.shape) +print(result) diff --git a/examples/model_compression/pp-minilm/deploy/serving/web_service.py b/examples/model_compression/pp-minilm/deploy/serving/web_service.py new file mode 100644 index 0000000000000000000000000000000000000000..01da8bc3b0f3c9999442f0bdec17a6923e6eb8eb --- /dev/null +++ b/examples/model_compression/pp-minilm/deploy/serving/web_service.py @@ -0,0 +1,46 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging + +from paddle_serving_server.web_service import Op, WebService + +_LOGGER = logging.getLogger() + + +class PPMiniLMOp(Op): + def init_op(self): + pass + + def preprocess(self, input_dicts, data_id, log_id): + ((_, input_dict),) = input_dicts.items() + feed_dict = {} + feed_dict["text"] = list(input_dict.values()) + return feed_dict, False, None, "" + + def postprocess(self, input_dicts, fetch_dict, data_id, log_id): + new_dict = {} + new_dict["logits"] = str(fetch_dict["logits"].tolist()) + return new_dict, None, "" + + +class PPMiniLMService(WebService): + def get_pipeline_response(self, read_op): + ppminilm_op = PPMiniLMOp(name="ppminilm", input_ops=[read_op]) + return ppminilm_op + + +ppminilm_service = PPMiniLMService(name="ppminilm") +ppminilm_service.prepare_pipeline_config("config_nlp.yml") +ppminilm_service.run_service() diff --git a/examples/model_compression/pp-minilm/finetuning/export_model.py b/examples/model_compression/pp-minilm/finetuning/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..b7e01ca67bab3c5900c07bebd754770ebfd3de00 --- /dev/null +++ b/examples/model_compression/pp-minilm/finetuning/export_model.py @@ -0,0 +1,80 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +import sys + +import paddle + +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import PPMiniLMForSequenceClassification + +sys.path.append("../") +from data import METRIC_CLASSES # noqa: E402 + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_path", + default="best_clue_model", + type=str, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--save_inference_model_with_tokenizer", + type=strtobool, + default=True, + help="Whether to save inference model with tokenizer.", + ) + + args = parser.parse_args() + return args + + +def do_export(args): + save_path = os.path.join(os.path.dirname(args.model_path), "inference") + model = PPMiniLMForSequenceClassification.from_pretrained(args.model_path) + args.task_name = args.task_name.lower() + + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # token_type_ids + ] + model = paddle.jit.to_static(model, input_spec=input_spec) + + paddle.jit.save(model, save_path) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_export(args) diff --git a/examples/model_compression/pp-minilm/finetuning/run_all_search.sh b/examples/model_compression/pp-minilm/finetuning/run_all_search.sh new file mode 100644 index 0000000000000000000000000000000000000000..c09a288a2fadee454a3bda186724fd391db0db69 --- /dev/null +++ b/examples/model_compression/pp-minilm/finetuning/run_all_search.sh @@ -0,0 +1,42 @@ +# $1 means GENERAL_DIR +mkdir -p $1/afqmc +mkdir -p $1/tnews +mkdir -p $1/ifly +mkdir -p $1/ocnli +mkdir -p $1/cmnli +mkdir -p $1/wsc +mkdir -p $1/csl + +# The penultimate parameter is the card id, this script can be changed if necessary +bash run_one_search.sh $1 afqmc 0 & +bash run_one_search.sh $1 tnews 1 & +bash run_one_search.sh $1 ifly 2 & +bash run_one_search.sh $1 ocnli 3 & +bash run_one_search.sh $1 csl 4 & +bash run_one_search.sh $1 wsc 5 & + +# Because the CMNLI data set is significantly larger than other data sets, +# It needs to be placed on different cards. +lr=1e-4 +bs=16 +sh run_clue.sh CMNLI $lr $bs 3 128 0 $1 > $1/cmnli/${lr}_${bs}_3_128.log & +bs=32 +sh run_clue.sh CMNLI $lr $bs 3 128 1 $1 > $1/cmnli/${lr}_${bs}_3_128.log & +bs=64 +sh run_clue.sh CMNLI $lr $bs 3 128 2 $1 > $1/cmnli/${lr}_${bs}_3_128.log & + +lr=5e-5 +bs=16 +sh run_clue.sh CMNLI $lr $bs 3 128 3 $1 > $1/cmnli/${lr}_${bs}_3_128.log & +bs=32 +sh run_clue.sh CMNLI $lr $bs 3 128 4 $1 > $1/cmnli/${lr}_${bs}_3_128.log & +bs=64 +sh run_clue.sh CMNLI $lr $bs 3 128 5 $1 > $1/cmnli/${lr}_${bs}_3_128.log & + +lr=3e-5 +bs=16 +sh run_clue.sh CMNLI $lr $bs 3 128 6 $1 > $1/cmnli/${lr}_${bs}_3_128.log & +bs=32 +sh run_clue.sh CMNLI $lr $bs 3 128 5 $1 > $1/cmnli/${lr}_${bs}_3_128.log & +bs=64 +sh run_clue.sh CMNLI $lr $bs 3 128 7 $1 > $1/cmnli/${lr}_${bs}_3_128.log & diff --git a/examples/model_compression/pp-minilm/finetuning/run_clue.py b/examples/model_compression/pp-minilm/finetuning/run_clue.py new file mode 100644 index 0000000000000000000000000000000000000000..897edc3089f6c434293cc0df9962ed33c9663d87 --- /dev/null +++ b/examples/model_compression/pp-minilm/finetuning/run_clue.py @@ -0,0 +1,335 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import math +import os +import random +import sys +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from paddle.io import DataLoader + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import LinearDecayWithWarmup + +sys.path.append("../") +from data import METRIC_CLASSES, MODEL_CLASSES, convert_example # noqa: E402 + +FORMAT = "%(asctime)s-%(levelname)s: %(message)s" +logging.basicConfig(level=logging.INFO, format=FORMAT) +logger = logging.getLogger(__name__) + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--output_dir", + default="best_clue_model", + type=str, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--batch_size", + default=32, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument( + "--warmup_steps", + default=0, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument( + "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--do_train", type=strtobool, default=True, help="Whether do train.") + parser.add_argument("--do_eval", type=strtobool, default=False, help="Whether do train.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu." + ) + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="The max value of grad norm.") + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, loss_fct, metric, data_loader): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + loss = loss_fct(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="") + model.train() + return res + + +def do_eval(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + dev_ds = load_dataset("clue", args.task_name, splits="dev") + + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + trans_func = partial( + convert_example, label_list=dev_ds.label_list, tokenizer=tokenizer, max_seq_length=args.max_seq_length + ) + + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="int64" if dev_ds.label_list else "float32"), # label + ): fn(samples) + + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + num_classes = 1 if dev_ds.label_list is None else len(dev_ds.label_list) + + model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_classes) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + metric = metric_class() + model.eval() + metric.reset() + for batch in dev_data_loader: + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + print("acc: %s\n, " % (res), end="") + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + train_ds, dev_ds = load_dataset("clue", args.task_name, splits=("train", "dev")) + + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, label_list=train_ds.label_list, tokenizer=tokenizer, max_seq_length=args.max_seq_length + ) + + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ): fn(samples) + + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list) + model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_classes) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + if args.max_steps > 0: + num_training_steps = args.max_steps + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + else: + num_training_steps = len(train_data_loader) * args.num_train_epochs + num_train_epochs = args.num_train_epochs + + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm), + ) + + loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss() + + metric = metric_class() + best_acc = 0.0 + global_step = 0 + tic_train = time.time() + for epoch in range(num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + loss = loss_fct(logits, labels) + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + print( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % ( + global_step, + num_training_steps, + epoch, + step, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + acc = evaluate(model, loss_fct, metric, dev_data_loader) + print("eval done total : %s s" % (time.time() - tic_eval)) + if acc > best_acc: + best_acc = acc + output_dir = args.output_dir + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + print("best_acc: ", best_acc) + return + print("best_acc: ", best_acc) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + if args.do_train: + do_train(args) + if args.do_eval: + do_eval(args) diff --git a/examples/model_compression/pp-minilm/finetuning/run_clue.sh b/examples/model_compression/pp-minilm/finetuning/run_clue.sh new file mode 100644 index 0000000000000000000000000000000000000000..de7f577ee6af7e3b3effefdc632478c3c493e460 --- /dev/null +++ b/examples/model_compression/pp-minilm/finetuning/run_clue.sh @@ -0,0 +1,25 @@ + +export TASK_NAME=$1 +export LR=$2 +export BS=$3 +export EPOCH=$4 +export MAX_SEQ_LEN=$5 +export CUDA_VISIBLE_DEVICES=$6 +export MODEL_PATH=$7 + +python -u ./run_clue.py \ + --model_type ppminilm \ + --model_name_or_path ${MODEL_PATH} \ + --task_name ${TASK_NAME} \ + --max_seq_length ${MAX_SEQ_LEN} \ + --batch_size ${BS} \ + --learning_rate ${LR} \ + --num_train_epochs ${EPOCH} \ + --logging_steps 100 \ + --seed 42 \ + --save_steps 100 \ + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --adam_epsilon 1e-8 \ + --output_dir ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ \ + --device gpu \ diff --git a/examples/model_compression/pp-minilm/finetuning/run_one_search.sh b/examples/model_compression/pp-minilm/finetuning/run_one_search.sh new file mode 100644 index 0000000000000000000000000000000000000000..fbb5261d2f31136ec988b7d7e922440ff40ee1e9 --- /dev/null +++ b/examples/model_compression/pp-minilm/finetuning/run_one_search.sh @@ -0,0 +1,55 @@ +OUTPUT_DIR=$1 +TASK_NAME=$2 + +mkdir ${OUTPUT_DIR}/afqmc +mkdir ${OUTPUT_DIR}/tnews +mkdir ${OUTPUT_DIR}/ifly +mkdir ${OUTPUT_DIR}/ocnli +mkdir ${OUTPUT_DIR}/wsc +mkdir ${OUTPUT_DIR}/csl +mkdir ${OUTPUT_DIR}/cmnli + + +for lr in 1e-4 5e-5 3e-5 +do + for bs in 16 32 64 + do + echo bs: $bs, lr: $lr + + if [ $TASK_NAME == afqmc ] + then + sh run_clue.sh AFQMC $lr $bs 3 128 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/afqmc/${lr}_${bs}_3_128.log + fi + + if [ $TASK_NAME == tnews ] + then + sh run_clue.sh TNEWS $lr $bs 3 128 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/tnews/${lr}_${bs}_3_128.log + fi + + if [ $TASK_NAME == ifly ] + then + sh run_clue.sh IFLYTEK $lr $bs 6 128 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/ifly/${lr}_${bs}_6_128.log + fi + + if [ $TASK_NAME == ocnli ] + then + sh run_clue.sh OCNLI $lr $bs 6 128 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/ocnli/${lr}_${bs}_6_128.log + fi + + if [ $TASK_NAME == wsc ] + then + sh run_clue.sh CLUEWSC2020 $lr $bs 50 128 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/wsc/${lr}_${bs}_50_128.log + fi + + if [ $TASK_NAME == csl ] + then + sh run_clue.sh CSL $lr $bs 8 256 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/csl/${lr}_${bs}_8_256.log + fi + + if [ $TASK_NAME == cmnli ] + then + sh run_clue.sh CMNLI $lr $bs 3 128 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/cmnli/${lr}_${bs}_3_128.log + fi + done +done + diff --git a/examples/model_compression/pp-minilm/general_distill/README.md b/examples/model_compression/pp-minilm/general_distill/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c4be10fb4d562cd37554da93d2eef547ae5a70e3 --- /dev/null +++ b/examples/model_compression/pp-minilm/general_distill/README.md @@ -0,0 +1,64 @@ +# PP-MiniLM 任务无关蒸馏 + +## 环境要求 + +本实验基于 NVIDIA Tesla V100 32G 8 卡进行,训练周期约为 2-3 天。 + +## 原理介绍 + +任务无关知识蒸馏是用较大(层数更多、宽度更宽的)的基于 Transformer Layer 的预训练模型对较小(层数更少、宽度更窄的)的基于 Transformer Layer 的预训练模型进行蒸馏,从而得到更小、效果与较大模型更接近的预训练模型。 + +PP-MiniLM 参考了 MiniLMv2 提出的 Multi-Head Self-Attention Relation Distillation 蒸馏策略。MiniLMv2 算法是用 24 层 large-size 的教师模型倒数几层的 Q-Q、K-K、V-V 之间的relation对6层学生模型最后一层 Q-Q、K-K、V-V 之间的 relation 进行蒸馏。具体的做法是,首先将学生、教师用于蒸馏的层上的 Q、K、V 的 Head 数进行统一,然后计算各自 Q-Q、K-K、V-V 的点积,最后对教师和学生的点积计算KL散度损失。由于 relation 的 shape 是 `[batch_size, head_num, seq_len, seq_len]`,因此可以认为这里的relation是一种Token与Token之间的关系。 + +本方案在 MiniLMv2 策略的基础上,做了进一步优化: 通过引入多视角的注意力关系知识来进一步提升模型效果。MiniLMv2 的自注意力关系知识仅建模了 Token 与 Token 之间的关系,PP-MiniLM 在此基础上额外引入了样本与样本间的自注意力关系知识,也就是挖掘出更多教师模型所蕴含的知识,从而进一步优化模型效果。 + +具体来说,PP-MiniLM 利用了 `roberta-wwm-ext-large` 第 20 层的 Q-Q、K-K、V-V 之间的 Sample 与 Sampl 之间关系对 6 层学生模型 PP-MiniLM 第 6 层的 Q-Q、K-K、V-V 之间的 Sample 与 Sample 之间的关系进行蒸馏。与MiniLMv2不同的是,PP-MiniLM的策略需要在统一Q、K、V的Head数之后,对Q、K、V转置为 `[seq_len, head_num, batch_size, head_dim]`,这样Q-Q、K-K、V-V 的点积则可以表达样本间的关系。经过我们的实验,这种方法比使用原始 MiniLMv2 算法在 CLUE 上平均准确率高 0.36。 + + +### 数据介绍 + +任务无关知识蒸馏的训练数据一般是预训练语料,可以使用公开的预训练语料 [CLUECorpus2020](https://github.com/CLUEbenchmark/CLUECorpus2020/)。需要将数据处理成一行一个句子的格式,再将数据文件分割成多个子文件(例如 64 个),放在同一个目录下。 + + +### 运行方式 + +```shell +sh run.sh # 包含general_distill.py的运行配置 +cd .. +``` + +其中 `general_distill.py` 参数释义如下: + +- `model_type` 指示了学生模型类型,当前仅支持 'ppminilm'、'roberta'。 +- `num_relation_heads` relation head 的个数,一般对于 large-size 的教师模型是64,对于 base-size 的教师模型是 48。 +- `teacher_model_type`指示了教师模型类型,当前仅支持 'roberta'。 +- `teacher_layer_index`蒸馏时使用的教师模型的层 +- `student_layer_index` 蒸馏时使用的学生模型的层 +- `teacher_model_name_or_path`教师模型的名称,例如`'roberta-wwm-ext-large'` +- `max_seq_length` 最大的样本长度 +- `num_layers` 学生模型的层数,目前仅支持 2,4,6 +- `logging_steps` 日志间隔 +- `max_steps` 最大迭代次数 +- `warmup_steps` 学习率增长得到`learning_rate`所需要的步数 +- `save_steps`保存模型的间隔步数 +- `weight_decay` 表示AdamW优化器中使用的 weight_decay 的系数 +- `output_dir`训练相关文件以及模型保存的输出路径 +- `device`设备选择,推荐使用 gpu +- `input_dir` 训练数据目录 +- `use_amp` 是否使用混合精度训练,默认 False +- `alpha`head间关系的权重,默认 0.0 +- `beta`样本间关系的权重,默认 0.0 + +将最终得到的模型绝对路径保存至 `$GENERAL_MODEL_DIR`,例如: + +```shell +GENERAL_MODEL_DIR=PaddleNLP/examples/model_compression/PP-MiniLM/general_distill/pretrain/model_400000 +``` + +## 模型精度 + +在 CLUE 数据集上经过超参寻优后,得到 CLUE 上各个任务上的最高准确率如下表: + +| AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL | Avg | +| ----- | ----- | ------- | ----- | ----- | ----------- | ----- | ----- | +| 74.14 | 57.43 | 61.75 | 81.01 | 76.17 | 86.18 | 79.17 | 73.69 | diff --git a/examples/model_compression/pp-minilm/general_distill/general_distill.py b/examples/model_compression/pp-minilm/general_distill/general_distill.py new file mode 100644 index 0000000000000000000000000000000000000000..83ceed266944e64e92d65fdd6783de77d0666c67 --- /dev/null +++ b/examples/model_compression/pp-minilm/general_distill/general_distill.py @@ -0,0 +1,455 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from concurrent.futures import ThreadPoolExecutor + +import numpy as np +import paddle +from paddle.io import DataLoader + +from paddlenlp.data import Pad, Tuple +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import ( + LinearDecayWithWarmup, + PPMiniLMForSequenceClassification, + PPMiniLMModel, + PPMiniLMTokenizer, + RobertaModel, + RobertaTokenizer, +) +from paddlenlp.transformers.distill_utils import calc_multi_relation_loss, to_distill +from paddlenlp.utils.log import logger +from paddlenlp.utils.tools import TimeCostAverage + +MODEL_CLASSES = { + "roberta": (RobertaModel, RobertaTokenizer), + "ppminilm": (PPMiniLMForSequenceClassification, PPMiniLMTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--model_type", + default="ppminilm", + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--teacher_model_type", + default="roberta", + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--student_model_name_or_path", + default=None, + type=str, + required=False, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--teacher_model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model." + ) + parser.add_argument( + "--input_dir", + default=None, + type=str, + required=True, + help="The input directory where the data will be read from.", + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--learning_rate", default=6e-4, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument( + "--num_layers", + default=6, + type=int, + help="Number layers of student model.", + ) + parser.add_argument( + "--teacher_layer_index", + default=19, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--student_layer_index", + default=5, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--batch_size", + default=512, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument( + "--num_relation_heads", + default=64, + type=int, + help="The number of relation heads is 48 and 64 for base and large-size teacher model.", + ) + parser.add_argument("--beta", default=0.0, type=float, help="0.0 usually") + parser.add_argument("--alpha", default=0.0, type=float, help="0.0 usually") + parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") + parser.add_argument( + "--warmup_steps", + default=-1, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument( + "--warmup_proportion", default=0.01, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--max_steps", + default=400000, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu." + ) + parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.") + parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.") + args = parser.parse_args() + return args + + +def set_seed(args): + random.seed(args.seed + paddle.distributed.get_rank()) + np.random.seed(args.seed + paddle.distributed.get_rank()) + paddle.seed(args.seed + paddle.distributed.get_rank()) + + +class WorkerInitObj(object): + def __init__(self, seed): + self.seed = seed + + def __call__(self, id): + np.random.seed(seed=self.seed + id) + random.seed(self.seed + id) + + +def create_pretraining_dataset(input_file, shared_list, args, worker_init, tokenizer): + train_data = PretrainingDataset(input_file=input_file, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + # files have been sharded, no need to dispatch again + train_batch_sampler = paddle.io.BatchSampler(train_data, batch_size=args.batch_size, shuffle=True) + + # DataLoader cannot be pickled because of its place. + # If it can be pickled, use global function instead of lambda and use + # ProcessPoolExecutor instead of ThreadPoolExecutor to prefetch. + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + ): fn(samples) + + train_data_loader = DataLoader( + dataset=train_data, + batch_sampler=train_batch_sampler, + collate_fn=batchify_fn, + num_workers=0, + worker_init_fn=worker_init, + return_list=True, + ) + return train_data_loader, input_file + + +class PretrainingDataset(paddle.io.Dataset): + def __init__(self, input_file, tokenizer, max_seq_length): + self.input_file = input_file + f = open(input_file, "r") + input_ids = [] + for i, line in enumerate(f): + line = line[:max_seq_length] + tokenized_example = tokenizer(line, max_seq_len=max_seq_length) + input_ids.append(tokenized_example["input_ids"]) + + self.inputs = np.asarray(input_ids) + f.close() + + def __len__(self): + "Denotes the total number of samples" + return len(self.inputs) + + def __getitem__(self, index): + input_ids = [np.asarray(self.inputs[index])] + return input_ids + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + worker_init = WorkerInitObj(args.seed + paddle.distributed.get_rank()) + args.model_type = args.model_type.lower() + + # For teacher + teacher_model_class, tokenizer_class = MODEL_CLASSES[args.teacher_model_type] + tokenizer = tokenizer_class.from_pretrained(args.teacher_model_name_or_path) + + # For student + model_class, _ = MODEL_CLASSES[args.model_type] + if args.num_layers == 6: + ppminilm = PPMiniLMModel( + vocab_size=tokenizer.vocab_size, + num_hidden_layers=6, + hidden_act="relu", + intermediate_size=3072, + hidden_size=768, + ) # layer: 6 + elif args.num_layers == 4: + ppminilm = PPMiniLMModel( + vocab_size=tokenizer.vocab_size, + num_hidden_layers=4, + hidden_act="relu", + intermediate_size=1024, + hidden_size=256, + num_attention_heads=16, + ) # layer: 4 + else: + ppminilm = PPMiniLMModel( + vocab_size=tokenizer.vocab_size, + num_hidden_layers=2, + hidden_act="relu", + hidden_size=128, + intermediate_size=512, + ) # layer: 2 + student = model_class(ppminilm) + + teacher = teacher_model_class.from_pretrained(args.teacher_model_name_or_path) + pad_token_id = 0 + + if paddle.distributed.get_world_size() > 1: + student = paddle.DataParallel(student, find_unused_parameters=True) + teacher = paddle.DataParallel(teacher, find_unused_parameters=True) + + num_training_steps = args.max_steps + + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in student.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=student.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm), + ) + + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + pool = ThreadPoolExecutor(1) + + teacher = to_distill(teacher, return_qkv=True, layer_index=args.teacher_layer_index) + student = to_distill(student, return_qkv=True, layer_index=args.student_layer_index) + + global_step = 0 + for epoch in range(args.num_train_epochs): + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if os.path.isfile(os.path.join(args.input_dir, f)) + ] + files.sort() + num_files = len(files) + random.Random(args.seed + epoch).shuffle(files) + f_start_id = 0 + + shared_file_list = {} + + if paddle.distributed.get_world_size() > num_files: + remainder = paddle.distributed.get_world_size() % num_files + + data_file = files[ + ( + f_start_id * paddle.distributed.get_world_size() + + paddle.distributed.get_rank() + + remainder * f_start_id + ) + % num_files + ] + else: + data_file = files[ + (f_start_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files + ] + + train_data_loader, _ = create_pretraining_dataset(data_file, shared_file_list, args, worker_init, tokenizer) + + # TODO(guosheng): better way to process single file + single_file = True if f_start_id + 1 == len(files) else False + + for f_id in range(f_start_id, len(files)): + if not single_file and f_id == f_start_id: + continue + if paddle.distributed.get_world_size() > num_files: + data_file = files[ + (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank() + remainder * f_id) + % num_files + ] + else: + data_file = files[ + (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files + ] + dataset_future = pool.submit( + create_pretraining_dataset, data_file, shared_file_list, args, worker_init, tokenizer + ) + + kl_loss_fct = paddle.nn.KLDivLoss("sum") + train_cost_avg = TimeCostAverage() + total_samples = 0 + batch_start = time.time() + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids = batch[0] + attention_mask = paddle.unsqueeze( + (input_ids == pad_token_id).astype(paddle.get_default_dtype()) * -1e4, axis=[1, 2] + ) + with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "gelu", "softmax"]): + student(input_ids) + with paddle.no_grad(): + teacher(input_ids) + # Q-Q relation + q_t, q_s = teacher.outputs.q, student.outputs.q + batch_size = q_t.shape[0] + pad_seq_len = q_t.shape[2] + loss_q = calc_multi_relation_loss( + kl_loss_fct, q_s, q_t, attention_mask, args.num_relation_heads, args.alpha, args.beta + ) + del q_t, q_s + # K-K relation + k_t, k_s = teacher.outputs.k, student.outputs.k + loss_k = calc_multi_relation_loss( + kl_loss_fct, k_s, k_t, attention_mask, args.num_relation_heads, args.alpha, args.beta + ) + del k_t, k_s + + # V-V relation + v_t, v_s = teacher.outputs.v, student.outputs.v + loss_v = calc_multi_relation_loss( + kl_loss_fct, v_s, v_t, attention_mask, args.num_relation_heads, args.alpha, args.beta + ) + + del v_t, v_s + + loss = loss_q + loss_k + loss_v + loss /= args.num_relation_heads * pad_seq_len * batch_size + + if args.use_amp: + scaler.scale(loss).backward() + scaler.minimize(optimizer, loss) + else: + loss.backward() + + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + total_samples += args.batch_size + train_run_cost = time.time() - batch_start + train_cost_avg.record(train_run_cost) + if global_step % args.logging_steps == 0: + logger.info( + "global step: %d, epoch: %d, batch: %d, loss: %f, " + "lr: %f, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec" + % ( + global_step, + epoch, + step, + loss, + optimizer.get_lr(), + train_cost_avg.get_average(), + total_samples / args.logging_steps, + total_samples / (args.logging_steps * train_cost_avg.get_average()), + ) + ) + total_samples = 0 + train_cost_avg.reset() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # need better way to get inner model of DataParallel + model_to_save = student._layers if isinstance(student, paddle.DataParallel) else student + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt")) + if global_step >= args.max_steps: + del train_data_loader + return + batch_start = time.time() + + del train_data_loader + train_data_loader, data_file = dataset_future.result(timeout=None) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/model_compression/pp-minilm/general_distill/run.sh b/examples/model_compression/pp-minilm/general_distill/run.sh new file mode 100644 index 0000000000000000000000000000000000000000..be940e7c6d8bdd54f78989c1ee148abb93eda339 --- /dev/null +++ b/examples/model_compression/pp-minilm/general_distill/run.sh @@ -0,0 +1,70 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +set -eux + +unset CUDA_VISIBLE_DEVICES + +bs=128 +maxlen=128 +numH=64 +lr=6e-4 +maxStep=400000 +warmStep=4000 +wd=1e-2 + +teacher=roberta +teacherModel=roberta-wwm-ext-large + +alpha=0 +beta=1.0 +mode=hardest +use_amp=True +teacher_layer_index=19 +student_layer_index=5 +num_layers=6 + +hp_config=bs${bs}_maxlen${maxlen}_lr${lr}_wd${wd}_numH${numH}_maxStep${maxStep}_warmStep${warmStep}_adamW_maxnorm1p0_teacher_${teacherModel}_coldboot_teacher_vocab_index${teacher_layer_index}_4l-312d-batchbatch + +export PYTHONPATH=../../../../:$PYTHONPATH +output_dir="./pretrain_${hp_config}" + +mkdir -p ${output_dir} +cp ./general_distill.py ${output_dir}/ +cp ../../../../paddlenlp/transformers/distill_utils.py ${output_dir}/ + + +python3 -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" general_distill.py \ + --model_type ppminilm \ + --num_relation_heads ${numH} \ + --teacher_model_type ${teacher} \ + --teacher_layer_index ${teacher_layer_index} \ + --student_layer_index ${student_layer_index} \ + --teacher_model_name_or_path ${teacherModel} \ + --max_seq_length ${maxlen} \ + --num_layers ${num_layers} \ + --batch_size ${bs} \ + --learning_rate ${lr} \ + --logging_steps 20 \ + --max_steps ${maxStep} \ + --warmup_steps ${warmStep} \ + --save_steps 20000 \ + --weight_decay ${wd} \ + --output_dir ${output_dir} \ + --device gpu \ + --input_dir dataset/ \ + --use_amp ${use_amp} \ + --alpha ${alpha} \ + --beta ${beta} \ diff --git a/examples/model_compression/pp-minilm/pp-minilm.png b/examples/model_compression/pp-minilm/pp-minilm.png new file mode 100644 index 0000000000000000000000000000000000000000..8fc8431697883e968f4602150f217a80a0c28f0e Binary files /dev/null and b/examples/model_compression/pp-minilm/pp-minilm.png differ diff --git a/examples/model_compression/pp-minilm/pruning/export.sh b/examples/model_compression/pp-minilm/pruning/export.sh new file mode 100644 index 0000000000000000000000000000000000000000..ee7e5a5867755dbb458646d84f58336889befe8b --- /dev/null +++ b/examples/model_compression/pp-minilm/pruning/export.sh @@ -0,0 +1,22 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +MODEL_PATH=$1 +TASK_NAME=$2 +python export_model.py --model_type ppminilm \ + --task_name ${TASK_NAME} \ + --model_name_or_path ${MODEL_PATH}/${TASK_NAME}/0.75/best_model \ + --sub_model_output_dir ${MODEL_PATH}/${TASK_NAME}/0.75/sub/ \ + --static_sub_model ${MODEL_PATH}/${TASK_NAME}/0.75/sub_static/float \ + --n_gpu 1 --width_mult 0.75 diff --git a/examples/model_compression/pp-minilm/pruning/export_all.sh b/examples/model_compression/pp-minilm/pruning/export_all.sh new file mode 100644 index 0000000000000000000000000000000000000000..78a782a74c563768cb35ac3cc6004209f21429f9 --- /dev/null +++ b/examples/model_compression/pp-minilm/pruning/export_all.sh @@ -0,0 +1,26 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +MODEL_PATH=pruned_models + +for TASK_NAME in AFQMC TNEWS IFLYTEK CMNLI OCNLI CLUEWSC2020 CSL + +do + python export_model.py --model_type ppminilm \ + --model_name_or_path ${MODEL_PATH}/${TASK_NAME}/0.75/best_model \ + --sub_model_output_dir ${MODEL_PATH}/${TASK_NAME}/0.75/sub/ \ + --static_sub_model ${MODEL_PATH}/${TASK_NAME}/0.75/sub_static/float \ + --n_gpu 1 --width_mult 0.75 + +done diff --git a/examples/model_compression/pp-minilm/pruning/export_model.py b/examples/model_compression/pp-minilm/pruning/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..e2088d1547f588cd516a16168b8a58616a78b908 --- /dev/null +++ b/examples/model_compression/pp-minilm/pruning/export_model.py @@ -0,0 +1,187 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import math +import os +import sys + +import paddle +from paddleslim.nas.ofa import OFA, utils +from paddleslim.nas.ofa.convert_super import Convert, supernet + +from paddlenlp.transformers import PPMiniLMModel + +sys.path.append("../") +from data import METRIC_CLASSES, MODEL_CLASSES # noqa: E402 + + +def ppminilm_forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + wtype = self.pooler.dense.fn.weight.dtype if hasattr(self.pooler.dense, "fn") else self.pooler.dense.weight.dtype + + if attention_mask is None: + attention_mask = paddle.unsqueeze((input_ids == self.pad_token_id).astype(wtype) * -1e9, axis=[1, 2]) + embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids) + + encoder_outputs = self.encoder(embedding_output, attention_mask) + sequence_output = encoder_outputs + pooled_output = self.pooler(sequence_output) + return sequence_output, pooled_output + + +PPMiniLMModel.forward = ppminilm_forward + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--sub_model_output_dir", + default=None, + type=str, + required=True, + help="The output directory where the sub model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--static_sub_model", + default=None, + type=str, + help="The output directory where the sub static model will be written. If set to None, not export static model", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--n_gpu", type=int, default=1, help="number of gpus to use, 0 for cpu.") + parser.add_argument("--width_mult", type=float, default=1.0, help="width mult you want to export") + parser.add_argument("--depth_mult", type=float, default=1.0, help="depth mult you want to export") + args = parser.parse_args() + return args + + +def do_export(args): + paddle.set_device("gpu" if args.n_gpu else "cpu") + args.model_type = args.model_type.lower() + args.task_name = args.task_name.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + config_path = os.path.join(args.model_name_or_path, "config.json") + cfg_dict = dict(json.loads(open(config_path).read())) + + kept_layers_index = {} + if args.depth_mult < 1.0: + depth = round(cfg_dict["init_args"][0]["num_hidden_layers"] * args.depth_mult) + cfg_dict["init_args"][0]["num_hidden_layers"] = depth + for idx, i in enumerate(range(1, depth + 1)): + kept_layers_index[idx] = math.floor(i / args.depth_mult) - 1 + + os.rename(config_path, config_path + "_bak") + with open(config_path, "w", encoding="utf-8") as f: + f.write(json.dumps(cfg_dict, ensure_ascii=False)) + + model = model_class.from_pretrained(args.model_name_or_path) + + origin_model = model_class.from_pretrained(args.model_name_or_path) + + os.rename(config_path + "_bak", config_path) + + sp_config = supernet(expand_ratio=[1.0, args.width_mult]) + model = Convert(sp_config).convert(model) + + ofa_model = OFA(model) + + sd = paddle.load(os.path.join(args.model_name_or_path, "model_state.pdparams")) + + if len(kept_layers_index) == 0: + ofa_model.model.set_state_dict(sd) + else: + for name, params in ofa_model.model.named_parameters(): + if "encoder" not in name: + params.set_value(sd[name]) + else: + idx = int(name.strip().split(".")[3]) + mapping_name = name.replace("." + str(idx) + ".", "." + str(kept_layers_index[idx]) + ".") + params.set_value(sd[mapping_name]) + + best_config = utils.dynabert_config(ofa_model, args.width_mult) + for name, sublayer in ofa_model.model.named_sublayers(): + if isinstance(sublayer, paddle.nn.MultiHeadAttention): + sublayer.num_heads = int(args.width_mult * sublayer.num_heads) + + origin_model_new = ofa_model.export( + best_config, + input_shapes=[[1, args.max_seq_length], [1, args.max_seq_length]], + input_dtypes=["int64", "int64"], + origin_model=origin_model, + ) + + for name, sublayer in origin_model_new.named_sublayers(): + if isinstance(sublayer, paddle.nn.MultiHeadAttention): + sublayer.num_heads = int(args.width_mult * sublayer.num_heads) + + output_dir = os.path.join(args.sub_model_output_dir, "model_width_%.5f" % args.width_mult) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = origin_model_new + model_to_save.save_pretrained(output_dir) + + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # token_type_ids + ] + origin_model_new = paddle.jit.to_static(origin_model_new, input_spec=input_spec) + + paddle.jit.save(origin_model_new, args.static_sub_model) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_export(args) diff --git a/examples/model_compression/pp-minilm/pruning/prune.py b/examples/model_compression/pp-minilm/pruning/prune.py new file mode 100644 index 0000000000000000000000000000000000000000..19427a8e4ea5e0ce6e30cfaae96bf49f1b96e565 --- /dev/null +++ b/examples/model_compression/pp-minilm/pruning/prune.py @@ -0,0 +1,384 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import math +import os +import random +import sys +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from paddle.io import DataLoader +from paddleslim.nas.ofa import OFA, DistillConfig, utils +from paddleslim.nas.ofa.convert_super import Convert, supernet + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import LinearDecayWithWarmup, PPMiniLMModel +from paddlenlp.transformers.ofa_utils import ( + compute_neuron_head_importance, + encoder_layer_ofa_forward, + encoder_ofa_forward, + mha_ofa_forward, + prepare_qkv_ofa, + reorder_neuron_head, +) +from paddlenlp.utils.log import logger + +sys.path.append("../") +from data import METRIC_CLASSES, MODEL_CLASSES, convert_example # noqa: E402 + +paddle.nn.MultiHeadAttention.forward = mha_ofa_forward +paddle.nn.MultiHeadAttention._prepare_qkv = prepare_qkv_ofa +paddle.nn.TransformerEncoder.forward = encoder_ofa_forward +paddle.nn.TransformerEncoderLayer.forward = encoder_layer_ofa_forward + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--glue_dir", + default="/root/.paddlenlp/datasets/Clue/", + type=str, + required=False, + help="The Glue directory.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument("--lambda_logit", default=1.0, type=float, help="lambda for logit loss.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") + parser.add_argument( + "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["gpu", "cpu", "xpu"], + help="The device to select to train the model, is must be cpu/gpu/xpu.", + ) + parser.add_argument( + "--width_mult_list", + nargs="+", + type=str, + default=["1.0", "5 / 6", "2 / 3", "0.5"], + help="width mult of compression", + ) + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, width_mult, student=False): + model.eval() + metric.reset() + for i, batch in enumerate(data_loader): + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids, attention_mask=[None, None]) + if isinstance(logits, tuple): + logits = logits[0] + correct = metric.compute(logits, labels) + metric.update(correct) + + res = metric.accumulate() + print("width_mult: %s, acc: %s, " % (str(width_mult), res), end="") + model.train() + return res + + +# monkey patch for ppminilm forward to accept [attention_mask, head_mask] as attention_mask +def ppminilm_forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=[None, None]): + wtype = self.pooler.dense.fn.weight.dtype if hasattr(self.pooler.dense, "fn") else self.pooler.dense.weight.dtype + if attention_mask[0] is None: + attention_mask[0] = paddle.unsqueeze((input_ids == self.pad_token_id).astype(wtype) * -1e9, axis=[1, 2]) + embedding_output = self.embeddings(input_ids, token_type_ids, position_ids) + encoded_layer = self.encoder(embedding_output, attention_mask) + pooled_output = self.pooler(encoded_layer) + + return encoded_layer, pooled_output + + +PPMiniLMModel.forward = ppminilm_forward + + +def soft_cross_entropy(inp, target): + inp_likelihood = F.log_softmax(inp, axis=-1) + target_prob = F.softmax(target, axis=-1) + return -1.0 * paddle.mean(paddle.sum(inp_likelihood * target_prob, axis=-1)) + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + train_ds = load_dataset("clue", args.task_name, splits="train") + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, label_list=train_ds.label_list, tokenizer=tokenizer, max_seq_length=args.max_seq_length + ) + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ): fn(samples) + + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + dev_ds = load_dataset("clue", args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + num_labels = 1 if train_ds.label_list is None else len(train_ds.label_list) + + model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels) + + # Step1: Initialize a dictionary to save the weights from the origin PPMiniLM model. + origin_weights = model.state_dict() + + # Step2: Convert origin model to supernet. + sp_config = supernet(expand_ratio=[1.0]) + model = Convert(sp_config).convert(model) + # Use weights saved in the dictionary to initialize supernet. + utils.set_state_dict(model, origin_weights) + del origin_weights + + super_sd = paddle.load(os.path.join(args.model_name_or_path, "model_state.pdparams")) + model.set_state_dict(super_sd) + + # Step3: Define teacher model. + teacher_model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels) + + # Step4: Config about distillation. + mapping_layers = ["ppminilm.embeddings"] + for idx in range(model.ppminilm.config["num_hidden_layers"]): + mapping_layers.append("ppminilm.encoder.layers.{}".format(idx)) + + default_distill_config = { + "lambda_distill": 0.1, + "teacher_model": teacher_model, + "mapping_layers": mapping_layers, + } + distill_config = DistillConfig(**default_distill_config) + + # Step5: Config in supernet training. + ofa_model = OFA(model, distill_config=distill_config, elastic_order=["width"]) + + criterion = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss() + + metric = metric_class() + + # Step6: Calculate the importance of neurons and head, + # and then reorder them according to the importance. + head_importance, neuron_importance = compute_neuron_head_importance( + ofa_model.model, + dev_data_loader, + loss_fct=criterion, + num_layers=model.ppminilm.config["num_hidden_layers"], + num_heads=model.ppminilm.config["num_attention_heads"], + ) + reorder_neuron_head(ofa_model.model, head_importance, neuron_importance) + + if paddle.distributed.get_world_size() > 1: + ofa_model.model = paddle.DataParallel(ofa_model.model) + + if args.max_steps > 0: + num_training_steps = args.max_steps + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + else: + num_training_steps = len(train_data_loader) * args.num_train_epochs + num_train_epochs = args.num_train_epochs + + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm), + ) + + global_step = 0 + tic_train = time.time() + best_res = 0.0 + args.width_mult_list = [eval(width_mult) for width_mult in args.width_mult_list] + for epoch in range(num_train_epochs): + # Step7: Set current epoch and task. + ofa_model.set_epoch(epoch) + ofa_model.set_task("width") + + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids, segment_ids, _ = batch + + for width_mult in args.width_mult_list: + # Step8: Broadcast supernet config from width_mult, + # and use this config in supernet training. + net_config = utils.dynabert_config(ofa_model, width_mult) + ofa_model.set_net_config(net_config) + logits, teacher_logits = ofa_model(input_ids, segment_ids, attention_mask=[None, None]) + rep_loss = ofa_model.calc_distill_loss() + logit_loss = soft_cross_entropy(logits, teacher_logits.detach()) + loss = rep_loss + args.lambda_logit * logit_loss + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.logging_steps == 0: + logger.info( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_step, epoch, step, loss, args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + evaluate(teacher_model, metric, dev_data_loader, width_mult=100) + print("eval done total : %s s" % (time.time() - tic_eval)) + for idx, width_mult in enumerate(args.width_mult_list): + net_config = utils.dynabert_config(ofa_model, width_mult) + ofa_model.set_net_config(net_config) + tic_eval = time.time() + res = evaluate(ofa_model, metric, dev_data_loader, width_mult) + print("eval done total : %s s" % (time.time() - tic_eval)) + + if best_res < res: + output_dir = args.output_dir + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + best_res = res + if global_step >= num_training_steps: + print("best_res: ", best_res) + return + print("best_res: ", best_res) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/model_compression/pp-minilm/pruning/prune.sh b/examples/model_compression/pp-minilm/pruning/prune.sh new file mode 100644 index 0000000000000000000000000000000000000000..f760b429bed57e6c24abcc9adc59fd17d6f0fb2c --- /dev/null +++ b/examples/model_compression/pp-minilm/pruning/prune.sh @@ -0,0 +1,35 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export TASK_NAME=$1 +export LR=$2 +export BATCH_SIZE=$3 +export PRE_EPOCHS=$4 +export SEQ_LEN=$5 +export CUDA_VISIBLE_DEVICES=$6 +export STUDENT_DIR=$7 +export WIDTH_LIST=$8 + +python -u ./prune.py --model_type ppminilm \ + --model_name_or_path ${STUDENT_DIR} \ + --task_name $TASK_NAME --max_seq_length ${SEQ_LEN} \ + --batch_size ${BATCH_SIZE} \ + --learning_rate ${LR} \ + --num_train_epochs ${PRE_EPOCHS} \ + --logging_steps 100 \ + --save_steps 100 \ + --output_dir ./pruned_models/$TASK_NAME/0.75/best_model \ + --device gpu \ + --width_mult_list ${WIDTH_LIST} + diff --git a/examples/model_compression/pp-minilm/quantization/quant_all.sh b/examples/model_compression/pp-minilm/quantization/quant_all.sh new file mode 100644 index 0000000000000000000000000000000000000000..1b39c8ca0e5dd7c46119f3eb89053c1763448e1c --- /dev/null +++ b/examples/model_compression/pp-minilm/quantization/quant_all.sh @@ -0,0 +1,20 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +MODEL_DIR=../pruning/pruned_models/ + +for task in AFQMC TNEWS IFLYTEK CMNLI OCNLI CLUEWSC2020 CSL +do + python quant_post.py --task_name ${task} --input_dir ${MODEL_DIR}/${task}/0.75/sub_static +done diff --git a/examples/model_compression/pp-minilm/quantization/quant_post.py b/examples/model_compression/pp-minilm/quantization/quant_post.py new file mode 100644 index 0000000000000000000000000000000000000000..7436f4f9f5aa2e9093fea640cccc05047d61257d --- /dev/null +++ b/examples/model_compression/pp-minilm/quantization/quant_post.py @@ -0,0 +1,154 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys +from functools import partial + +import paddle +import paddleslim + +from paddlenlp.data import Pad +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import PPMiniLMTokenizer + +sys.path.append("../") +from data import convert_example # noqa: E402 + +parser = argparse.ArgumentParser() + +parser.add_argument("--task_name", type=str, required=True, help="task_name") +parser.add_argument( + "--input_dir", type=str, default="../pruning/pruned_models/", required=True, help="Input task model directory." +) +parser.add_argument("--output_dir", type=str, default="./", required=False, help="Output model directory.") + +parser.add_argument( + "--save_model_filename", type=str, default="int8.pdmodel", required=False, help="File name of quantified model." +) + +parser.add_argument( + "--save_params_filename", + type=str, + default="int8.pdiparams", + required=False, + help="File name of quantified model's parameters.", +) + +parser.add_argument( + "--input_model_filename", type=str, default="float.pdmodel", required=False, help="File name of float model." +) + +parser.add_argument( + "--input_param_filename", + type=str, + default="float.pdiparams", + required=False, + help="File name of float model's parameters.", +) +parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", +) + +parser.add_argument( + "--use_faster_tokenizer", + type=strtobool, + default=False, + help="Whether to use FasterTokenizer to accelerate training or further inference.", +) + +parser.add_argument( + "--model_name_or_path", + default="ppminilm-6l-768h", + type=str, + help="Model name or the directory of model directory.", +) + +args = parser.parse_args() + + +def quant_post(args, batch_size=8, algo="avg"): + place = paddle.set_device("gpu") + exe = paddle.static.Executor(place) + args.task_name = args.task_name.lower() + + dev_ds = load_dataset("clue", args.task_name, splits="dev") + if args.use_faster_tokenizer: + trans_func = partial(convert_example, label_list=dev_ds.label_list) + else: + tokenizer = PPMiniLMTokenizer.from_pretrained("ppminilm-6l-768h") + trans_func = partial( + convert_example, label_list=dev_ds.label_list, tokenizer=tokenizer, max_seq_length=128, is_test=True + ) + dev_ds = dev_ds.map(trans_func, lazy=True) + + def batch_generator_func(): + batch_data = [[], []] + for data in dev_ds: + batch_data[0].append(data[0]) + batch_data[1].append(data[1]) + if len(batch_data[0]) == batch_size: + input_ids = Pad(axis=0, pad_val=0)(batch_data[0]) + segment_ids = Pad(axis=0, pad_val=0)(batch_data[1]) + yield [input_ids, segment_ids] + batch_data = [[], []] + + def batch_generator_func_using_faster_tokenizer(): + if "sentence" in dev_ds[0]: + batch_data = [] + else: + batch_data = [[], []] + for data in dev_ds: + if "sentence" in data: + batch_data.append(data["sentence"]) + if len(batch_data) == batch_size: + yield {"text": batch_data} + batch_data = [] + else: + batch_data[0].append(data["sentence1"]) + batch_data[1].append(data["sentence2"]) + if len(batch_data[0]) == batch_size: + yield {"text": batch_data[0], "text_pair": batch_data[1]} + batch_data = [[], []] + + paddleslim.quant.quant_post_static( + exe, + args.input_dir, + os.path.join(args.output_dir, args.task_name + "_quant_models", algo + str(batch_size)), + save_model_filename=args.save_model_filename, + save_params_filename=args.save_params_filename, + algo=algo, + hist_percent=0.9999, + batch_generator=batch_generator_func if not args.use_faster_tokenizer else None, + data_loader=batch_generator_func_using_faster_tokenizer if args.use_faster_tokenizer else None, + model_filename=args.input_model_filename, + params_filename=args.input_param_filename, + quantizable_op_type=["matmul", "matmul_v2"], + weight_bits=8, + weight_quantize_type="channel_wise_abs_max", + batch_nums=1, + ) + + +if __name__ == "__main__": + paddle.enable_static() + for batch_size in [4, 8]: + for algo in ["abs_max", "avg", "mse", "hist"]: + quant_post(args, batch_size, algo) diff --git a/examples/model_interpretation/README.md b/examples/model_interpretation/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ab3adf4eddabf6a9e60f10cbaf583934703efaf1 --- /dev/null +++ b/examples/model_interpretation/README.md @@ -0,0 +1,255 @@ +NLP可解释评估 +=== +深度学习模型在很多NLP任务上已经取得巨大成功,但其常被当作一个黑盒使用,内部预测机制对使用者是不透明的。这使得深度学习模型结果不被人信任,增加落地难度,尤其是在医疗、法律等特殊领域。同时,当模型出现效果不好或鲁棒性差等问题时,由于不了解其内部机制,导致很难对模型进行优化。近期,深度学习模型的可解释性被越来越多的人关注。但模型的可解释性评估还不够完善,本模块提供了3个NLP任务的评测数据和相关评测指标,旨在评估模型的可解释性。模块包含以下功能: + + 1. 完善可解释性评估体系,提供了评测数据和对应的评测指标 + 2. 提供了3种典型的证据抽取方法,分别是基于注意力(attention-based)、梯度(gradient-based)和线性模型(LIME)的证据抽取方法,并在LSTM、Transformer(RoBERTa-base和RoBERTa-large)等常用模型网络结构上完成实验验证,分别验证模型结构复杂度、模型参数规模对模型可解释的影响 + 3. 提供模型较全面的评估报告,含模型本身准确率等效果、以及在3个可解释评测指标上的结果 + +<p align="center"> +<img src="imgs/structure.png" /> <br> +</p> + +可解释评估体系 +--- +### 评测数据 +我们提供了情感分析、相似度计算、阅读理解等三个NLP任务上的中英文数据集。对于每一个数据集,人工标注了证据数据和扰动数据。 + + 证据数据:给出模型预测依赖的证据(从人类认知角度),其由输入中的若干词构成。我们的标注标准包含3个维度:充分性(sufficiency)、简洁性(concision)、可理解性(understandability)。 + 扰动数据:旨在评估模型在扰动下的证据一致性。我们从抗干扰性、敏感性和泛化性等角度构建了扰动数据,其中,“敏感性”和“泛化性”维度下构建的数据可能会改变证据。 + +#### 样例数据(来自中文情感分析任务): + +<p align="center"> +<img src="imgs/example1.png" /> <br> +</p> + +#### 数据规模 +<table> + <tr> + <td rowspan="2">任务</td> + <td colspan="3">英文模型</td> + <td colspan="3">中文模型</td> + </tr> + <tr> + <td>规模</td> + <td>证据平均长度比例</td> + <td>证据平均数量</td> + <td>规模</td> + <td>证据平均长度比例</td> + <td>证据平均数量</td> + </tr> + <tr> + <td>情感分析</td> + <td>1,499</td> + <td>19.20%</td> + <td>2.1</td> + <td>1,646</td> + <td>30.10%</td> + <td>1.4</td> + </tr> + <tr> + <td>相似度任务</td> + <td>1,659</td> + <td>52.20%</td> + <td>1.0</td> + <td>1,629</td> + <td>70.50%</td> + <td>1.0</td> + </tr> + <tr> + <td>阅读理解</td> + <td>1,507</td> + <td>10.20%</td> + <td>1.0</td> + <td>1,762</td> + <td>9.60%</td> + <td>1.0</td> + </tr> +</table> + +### 评估指标 +__合理性__:评估模型预测依赖的证据与人工标注证据的拟合度,我们这里使用macro-F1作为评估指标,其中模型预测依赖证据可以由本模块提供的证据分析方法(位于/model_interpretation/task/目录下)给出。<br> + +<p align="center"> +<img src="imgs/equation1.png" /> <br> +</p> +其中S<sub>i</sub><sup>p</sup>和S<sub>i</sub><sup>g</sup>分别代表针对第i条输入模型预测证据和人工标注证据,N代表数据集中数据的数量<br> + +__一致性__:评估(原始输入,对应扰动输入)对中词重要度排序的一致性。证据分析方法对输入中每个词赋予一个重要度,基于该重要度对输入中所有词进行排序。我们使用搜索排序中的MAP(mean average precision)指标来计算两个排序的一致性。这里给出了MAP的两种计算方式,分别见以下两个公式:<br> +公式一(正在使用):<br> +<p align="center"> +<img src="imgs/equation5.png" /> <br> +</p> +公式二:<br> +<p align="center"> +<img src="imgs/equation2.png" /> <br> +</p> +其中X<sup>o</sup>和X<sup>d</sup>分别代表原始输入和扰动输入的词重要度排序序列。|X<sup>d</sup>|代表X<sup>d</sup>中词的个数,X<sup>o</sup><sub>1:j</sub>表示X<sup>o</sup>中前j最重要的词。函数G(x, Y)检查词x是否存在于列表Y中,如果存在则G(x, Y)=1。MAP越高表示两个序列排序一致性越高<br> + +__忠诚性__:评估模型给出的证据的忠诚性,即模型是否真的基于给出的证据进行预测的。这里从充分性和完备性两个角度进行评估。充分性,即模型给出的证据是否包含了预测需要的全部信息(即y<sub>r<sub>i</sub></sub> = y<sub>x<sub>i</sub></sub>,其中r<sub>i</sub>表示输入x<sub>i</sub>的证据,y<sub>x</sub>表示模型对输入x的预测结果);完备性,即模型对输入x的预测结果(即y<sub>x<sub>i</sub>\r<sub>i</sub></sub> ≠ y<sub>x<sub>i</sub></sub>,其中x<sub>i</sub>\r<sub>i</sub>表示从输入x<sub>i</sub>中去除证据r<sub>i</sub>)。基于这两个维度,我们提出了一个新的指标New-P,计算方式如下:<br> + +<p align="center"> +<img src="imgs/equation3.png" /> <br> +</p> +<p align="center"> +<img src="imgs/equation4.png" /> <br> +</p> + +### 证据抽取方法 +证据抽取方法(rationale-extraction),顾名思义,就是从输入中抽取对模型预测至关重要的词,又被称为后验解释方法(post-hoc explanation methods)。 +该平台提供了3种典型的证据抽取方法,分别是:基于注意力机制(attention-based)的解释方法、基于梯度(gradient-based)的解释方法,和基于线性模型(linear-based)的解释方法:<br> + +Attention-based([Jain and Wallace, 2019](https://arxiv.org/pdf/1902.10186.pdf)): + + 将注意力分数作为词重要度。注意力分数的获取取决于具体模型架构,我们提供了基于LSTM和transformer框架的提取方法,见每个具体任务下的saliency_map目录。 + +Gradient-based([Sundararajan et al., 2017](https://arxiv.org/pdf/1703.01365.pdf)): + + 基于梯度给出每个词重要度。我们这里给出了integrated gradient计算方式,具体见saliency_map目录或论文[Axiomatic attribution for deep networks](https://arxiv.org/pdf/1703.01365.pdf)。 + +Linear-based([Ribeiro et al.. 2016](https://arxiv.org/pdf/1602.04938.pdf)): + + 使用线性模型局部模拟待验证模型,线性模型学习到的词的权重作为该词对预测结果的重要度,详细见论文[" why should i trust you?" explaining the predictions of any classifier](https://arxiv.org/pdf/1602.04938.pdf)。 + +### 三个任务的被评估模型 +为验证模型复杂度、参数规模对可解释的影响,针对每个任务,我们分别提供了基于LSTM(简单结构)的模型、及Transformer-based预训练模型(复杂结构),其中,对于预训练模型,提供了base版本和large版本。<br> +模型代码位置:/model_interpretation/task/{task}/,({task}可取值为["senti","similarity","mrc"],其中senti代表情感分析,similarity代表相似度计算,mrc代表阅读理解)<br> +模型运行及依赖环境请参考下方的“平台使用”。 + + +## 平台使用 +### 环境准备 +代码运行需要 Linux 主机,Python 3.8(推荐,其他低版本未测试过) 和 PaddlePaddle 2.1 以上版本。 + +### 推荐的环境 + +* 操作系统 CentOS 7.5 +* Python 3.8.12 +* PaddlePaddle 2.1.0 +* PaddleNLP 2.2.4 + +除此之外,需要使用支持 GPU 的硬件环境。 + +### PaddlePaddle + +需要安装GPU版的PaddlePaddle。 + +``` +# GPU 版本 +pip3 install paddlepaddle-gpu +``` + +更多关于 PaddlePaddle 的安装教程、使用方法等请参考[官方文档](https://www.paddlepaddle.org.cn/#quick-start). + +### 第三方 Python 库 +除 PaddlePaddle 及其依赖之外,还依赖其它第三方 Python 库,位于代码根目录的 requirements.txt 文件中。 + +可使用 pip 一键安装 + +```pip3 install -r requirements.txt``` + +## 数据准备 +### 模型训练数据 +#### 情感分析任务: + +中文推荐使用ChnSentiCorp,英文推荐使用SST-2。本模块提供的中英文情感分析模型就是基于这两个数据集的。若修改训练数据集,请修改/model_interpretation/task/senti/pretrained_models/train.py (RoBERTa) 以及 /model_interpretation/task/senti/rnn/train.py (LSTM)。 + +[//]:数据集会被缓存到/home/work/.paddlenlp/datasets/目录下 + +#### 相似度计算: + +中文推荐使用LCQMC,英文推荐使用QQP。本模块提供的中英文相似度计算模型就是基于这两个数据集的,若修改训练数据集,请修改/model_interpretation/task/similarity/pretrained_models/train_pointwise.py(RoBERTa)以及/model_interpretation/task/similarity/simnet/train.py(LSTM)。 + +#### 阅读理解中英文: + +中文推荐使用[DuReader_Checklist](https://dataset-bj.cdn.bcebos.com/lic2021/dureader_checklist.dataset.tar.gz),英文推荐使用[SQUDA2](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)。请将阅读理解训练数据放置在/model_interpretation/task/mrc/data目录下。 + +### 下载预训练模型 + +使用paddlenlp框架自动缓存模型文件。 + +### 其他数据下载 +请运行download.sh自动下载 + +### 评测数据 +评测数据样例位于/model_interpretation/data/目录下,每一行为一条JSON格式的数据。 +#### 情感分析数据格式: + id: 数据的编号,作为该条数据识别key; + context:原文本数据; + sent_token:原文本数据的标准分词,注意:golden证据是基于该分词的,预测证据也需要与该分词对应; + sample_type: 数据的类性,分为原始数据(ori)和扰动数据(disturb); + rel_ids:与原始数据关联的扰动数据的id列表(只有原始数据有); + +#### 相似度数据格式: + id:数据的编号,作为该条数据识别key; + query(英文中为sentence1):句子1的原文本数据; + title(英文中为sentence2):句子2的原文本数据; + text_q_seg:句子1的标准分词,注意:golden证据是基于该分词的,预测证据也需要与该分词对应; + text_t_seg:句子2的标准分词,注意:golden证据是基于该分词的,预测证据也需要与该分词对应; + sample_type: 数据的类性,分为原始数据(ori)和扰动数据(disturb); + rel_ids:与原始数据关联的扰动数据的id列表(只有原始数据有); + +#### 阅读理解数据格式: + id:数据的编号,作为该条数据识别key; + title:文章标题; + context:文章主体; + question:文章的问题; + sent_token:原文本数据的标准分词,注意:golden证据是基于该分词的,预测证据也需要与该分词对应; + sample_type: 数据的类性,分为原始数据(ori)和扰动数据(disturb); + rel_ids:与原始数据关联的扰动数据的id列表(只有原始数据有); +## 模型运行 +### 模型预测: + + model_interpretation/task/{task}/run_inter_all.sh (生成所有结果) + model_interpretation/task/{task}/run_inter.sh (生成单个配置的结果,配置可以选择不同的评估模型,以及不同的证据抽取方法、语言) + +(注:{task}可取值为["senti","similarity","mrc"],其中senti代表情感分析,similarity代表相似度计算,mrc代表阅读理解) + +### 证据抽取: + cd model_interpretation/rationale_extraction + ./generate.sh + +### 可解释评估: +#### 合理性(plausibility): + model_interpretation/evaluation/plausibility/run_f1.sh +#### 一致性(consistency): + model_interpretation/evaluation/consistency/run_map.sh +#### 忠诚性(faithfulness): + model_interpretation/evaluation/faithfulness/run_newp.sh + +### 评估报告 +中文情感分析评估报告样例: +<table> + <tr> + <td rowspan="2">模型 + 证据抽取方法</td> + <td colspan="4">情感分析</td> + </tr> + <tr> + <td>Acc</td> + <td>Macro-F1</td> + <td>MAP</td> + <td>New_P</td> + </tr> + <tr> + <td>LSTM + IG</td> + <td>56.8</td> + <td>36.8</td> + <td>59.8</td> + <td>91.4</td> + </tr> + <tr> + <td>RoBERTa-base + IG</td> + <td>62.4</td> + <td>36.4</td> + <td>48.7</td> + <td>48.9</td> + </tr> + <tr> + <td>RoBERTa-large + IG</td> + <td>65.3</td> + <td>38.3</td> + <td>41.9</td> + <td>37.8</td> + </tr> +</table> diff --git a/examples/model_interpretation/data/mrc_ch b/examples/model_interpretation/data/mrc_ch new file mode 100644 index 0000000000000000000000000000000000000000..09a10298751f398085cafadd46d7b111eb79e0ad --- /dev/null +++ b/examples/model_interpretation/data/mrc_ch @@ -0,0 +1,100 @@ +{"id": 1, "title": "地瓜是红薯吗", "context": "地瓜不是红薯。地瓜一般生吃或者凉拌,外形是纺锤型的,有明显的瓣状结构,内里的肉是白色的,有清淡的药香味,生吃又脆又甜,常食用可以预防肝癌、胃癌,营养价值非常高。红薯是粗粮,也叫番薯山芋。它是一种属管状花目,旋花科一年生的草本植物,富含丰富的矿物质和维生素,而且非常耐饱。", "question": "地瓜和红薯一样吗", "sent_token": ["地", "瓜", "不", "是", "红", "薯", "。", "地", "瓜", "一", "般", "生", "吃", "或", "者", "凉", "拌", ",", "外", "形", "是", "纺", "锤", "型", "的", ",", "有", "明", "显", "的", "瓣", "状", "结", "构", ",", "内", "里", "的", "肉", "是", "白", "色", "的", ",", "有", "清", "淡", "的", "药", "香", "味", ",", "生", "吃", "又", "脆", "又", "甜", ",", "常", "食", "用", "可", "以", "预", "防", "肝", "癌", "、", "胃", "癌", ",", "营", "养", "价", "值", "非", "常", "高", "。", "红", "薯", "是", "粗", "粮", ",", "也", "叫", "番", "薯", "山", "芋", "。", "它", "是", "一", "种", "属", "管", "状", "花", "目", ",", "旋", "花", "科", "一", "年", "生", "的", "草", "本", "植", "物", ",", "富", "含", "丰", "富", "的", "矿", "物", "质", "和", "维", "生", "素", ",", "而", "且", "非", "常", "耐", "饱", "。", "地", "瓜", "是", "红", "薯", "吗"], "sample_type": "ori", "rel_ids": [1763]} +{"id": 5, "title": "已满多少岁的人犯贩卖毒品罪应负刑事责任", "context": "根据《刑法》第十七条:已满十六周岁的人犯罪,应当负刑事责任。已满十四周岁不满十六周岁的人,犯故意杀人、故意伤害致人重伤或者死亡、强奸、抢劫、贩卖毒品、放火、爆炸、投放危险物质罪的,应当负刑事责任。", "question": "已满几周岁的人贩卖毒品罪应当负刑事责任", "sent_token": ["根", "据", "《", "刑", "法", "》", "第", "十", "七", "条", ":", "已", "满", "十", "六", "周", "岁", "的", "人", "犯", "罪", ",", "应", "当", "负", "刑", "事", "责", "任", "。", "已", "满", "十", "四", "周", "岁", "不", "满", "十", "六", "周", "岁", "的", "人", ",", "犯", "故", "意", "杀", "人", "、", "故", "意", "伤", "害", "致", "人", "重", "伤", "或", "者", "死", "亡", "、", "强", "奸", "、", "抢", "劫", "、", "贩", "卖", "毒", "品", "、", "放", "火", "、", "爆", "炸", "、", "投", "放", "危", "险", "物", "质", "罪", "的", ",", "应", "当", "负", "刑", "事", "责", "任", "。", "已", "满", "多", "少", "岁", "的", "人", "犯", "贩", "卖", "毒", "品", "罪", "应", "负", "刑", "事", "责", "任"], "sample_type": "ori", "rel_ids": [1767]} +{"id": 10, "title": "读研跟考研有什么区别", "context": "考研和读研的区别在于概念和意义不同。考研是指考生通过考试来得到研究生的入学资格,而考生并不是硕士研究生;而读研是指学生在高校攻读硕士研究生的过程,学生身份已经是硕士研究生。这二者并不等同,而是有先后关系,也就是说考生只有通过考研,才能成为硕士研究生,然后在规定的学习时间内读研。", "question": "考研跟读研有什么区别", "sent_token": ["考", "研", "和", "读", "研", "的", "区", "别", "在", "于", "概", "念", "和", "意", "义", "不", "同", "。", "考", "研", "是", "指", "考", "生", "通", "过", "考", "试", "来", "得", "到", "研", "究", "生", "的", "入", "学", "资", "格", ",", "而", "考", "生", "并", "不", "是", "硕", "士", "研", "究", "生", ";", "而", "读", "研", "是", "指", "学", "生", "在", "高", "校", "攻", "读", "硕", "士", "研", "究", "生", "的", "过", "程", ",", "学", "生", "身", "份", "已", "经", "是", "硕", "士", "研", "究", "生", "。", "这", "二", "者", "并", "不", "等", "同", ",", "而", "是", "有", "先", "后", "关", "系", ",", "也", "就", "是", "说", "考", "生", "只", "有", "通", "过", "考", "研", ",", "才", "能", "成", "为", "硕", "士", "研", "究", "生", ",", "然", "后", "在", "规", "定", "的", "学", "习", "时", "间", "内", "读", "研", "。", "读", "研", "跟", "考", "研", "有", "什", "么", "区", "别"], "sample_type": "ori", "rel_ids": [1772]} +{"id": 12, "title": "多效唑能和磷酸二氢钾一起用吗", "context": "多效唑能和磷酸二氢钾一起用。多效唑是植物的生长调节剂,主要是控制作物疯长的。而磷酸二氢钾属于叶面肥,施用后可促使作物的叶色更加浓绿,根系发达,药效完全不同,也并不排斥,可以混合使用。不过要注意施用时要严格按照说明施加,不可过量,否则会阻碍生长。", "question": "磷酸二氢钾能和多效唑一起用吗", "sent_token": ["多", "效", "唑", "能", "和", "磷", "酸", "二", "氢", "钾", "一", "起", "用", "。", "多", "效", "唑", "是", "植", "物", "的", "生", "长", "调", "节", "剂", ",", "主", "要", "是", "控", "制", "作", "物", "疯", "长", "的", "。", "而", "磷", "酸", "二", "氢", "钾", "属", "于", "叶", "面", "肥", ",", "施", "用", "后", "可", "促", "使", "作", "物", "的", "叶", "色", "更", "加", "浓", "绿", ",", "根", "系", "发", "达", ",", "药", "效", "完", "全", "不", "同", ",", "也", "并", "不", "排", "斥", ",", "可", "以", "混", "合", "使", "用", "。", "不", "过", "要", "注", "意", "施", "用", "时", "要", "严", "格", "按", "照", "说", "明", "施", "加", ",", "不", "可", "过", "量", ",", "否", "则", "会", "阻", "碍", "生", "长", "。", "多", "效", "唑", "能", "和", "磷", "酸", "二", "氢", "钾", "一", "起", "用", "吗"], "sample_type": "ori", "rel_ids": [1774]} +{"id": 14, "title": "猫能吃蛋黄吗", "context": "猫咪是可以吃蛋黄的。这里特定煮熟的白水蛋,猫咪不能吃生鸡蛋,因为生鸡蛋中有细菌,常见的是沙门氏菌,容易引起猫腹泻脱水,而且饲喂猫咪最好的只饲喂蛋黄。虽然可以吃蛋黄,但是需要掌握好量,一般一周最多吃两三次就可了。蛋黄中也含有丰富的胆固醇,易引发猫咪患脂肪肝和高脂血病。", "question": "猫咪可以吃生蛋黄吗", "sent_token": ["猫", "咪", "是", "可", "以", "吃", "蛋", "黄", "的", "。", "这", "里", "特", "定", "煮", "熟", "的", "白", "水", "蛋", ",", "猫", "咪", "不", "能", "吃", "生", "鸡", "蛋", ",", "因", "为", "生", "鸡", "蛋", "中", "有", "细", "菌", ",", "常", "见", "的", "是", "沙", "门", "氏", "菌", ",", "容", "易", "引", "起", "猫", "腹", "泻", "脱", "水", ",", "而", "且", "饲", "喂", "猫", "咪", "最", "好", "的", "只", "饲", "喂", "蛋", "黄", "。", "虽", "然", "可", "以", "吃", "蛋", "黄", ",", "但", "是", "需", "要", "掌", "握", "好", "量", ",", "一", "般", "一", "周", "最", "多", "吃", "两", "三", "次", "就", "可", "了", "。", "蛋", "黄", "中", "也", "含", "有", "丰", "富", "的", "胆", "固", "醇", ",", "易", "引", "发", "猫", "咪", "患", "脂", "肪", "肝", "和", "高", "脂", "血", "病", "。", "猫", "能", "吃", "蛋", "黄", "吗"], "sample_type": "ori", "rel_ids": [1776]} +{"id": 18, "title": "最近深圳限行吗", "context": "现在由于疫情的影响,深圳市不限行的了,但是没有必要尽量还是少出门,出门也要做好一系列的防护措施才可以。因为虽然目前国内疫情形势有所缓和,但是这并不意味着疫情的结束,国外疫情形势还是很严峻的,境外输入案例较多。", "question": "最近深圳没有限行吗", "sent_token": ["现", "在", "由", "于", "疫", "情", "的", "影", "响", ",", "深", "圳", "市", "不", "限", "行", "的", "了", ",", "但", "是", "没", "有", "必", "要", "尽", "量", "还", "是", "少", "出", "门", ",", "出", "门", "也", "要", "做", "好", "一", "系", "列", "的", "防", "护", "措", "施", "才", "可", "以", "。", "因", "为", "虽", "然", "目", "前", "国", "内", "疫", "情", "形", "势", "有", "所", "缓", "和", ",", "但", "是", "这", "并", "不", "意", "味", "着", "疫", "情", "的", "结", "束", ",", "国", "外", "疫", "情", "形", "势", "还", "是", "很", "严", "峻", "的", ",", "境", "外", "输", "入", "案", "例", "较", "多", "。", "最", "近", "深", "圳", "限", "行", "吗"], "sample_type": "ori", "rel_ids": [1780]} +{"id": 19, "title": "合同签字不盖章有效吗", "context": "可能有效可能无效。只有签字没有公章的合同是否有法律效力要根据具体情况分析:如果合同是由单位的委托代理人在其权限范围内、或单位的法定代表人签的字,则合同有效。", "question": "合同不签字不盖章有效吗", "sent_token": ["可", "能", "有", "效", "可", "能", "无", "效", "。", "只", "有", "签", "字", "没", "有", "公", "章", "的", "合", "同", "是", "否", "有", "法", "律", "效", "力", "要", "根", "据", "具", "体", "情", "况", "分", "析", ":", "如", "果", "合", "同", "是", "由", "单", "位", "的", "委", "托", "代", "理", "人", "在", "其", "权", "限", "范", "围", "内", "、", "或", "单", "位", "的", "法", "定", "代", "表", "人", "签", "的", "字", ",", "则", "合", "同", "有", "效", "。", "合", "同", "签", "字", "不", "盖", "章", "有", "效", "吗"], "sample_type": "ori", "rel_ids": [1781]} +{"id": 27, "title": "", "context": "吴三桂(1612年-1678年10月2日),字长伯,一字月所,明朝辽东人,明末清初著名政治军事人物,吴周政权建立者吴周太祖。", "question": "吴三贵什么朝代", "sent_token": ["吴", "三", "桂", "(", "1612", "年", "-", "1678", "年", "10", "月", "2", "日", ")", ",", "字", "长", "伯", ",", "一", "字", "月", "所", ",", "明", "朝", "辽", "东", "人", ",", "明", "末", "清", "初", "著", "名", "政", "治", "军", "事", "人", "物", ",", "吴", "周", "政", "权", "建", "立", "者", "吴", "周", "太", "祖", "。"], "sample_type": "ori", "rel_ids": [1789]} +{"id": 34, "title": "狗狗为什么互相闻屁股", "context": "相互闻屁股是狗狗打招呼的一种方式。狗狗的嗅觉很敏感,它们可以用相互闻屁股来了解狗狗的配偶状况、饮食习惯等,因为狗狗的屁股后面有两个肛门腺,在肛门腺里面涵盖了很多的信息素。处在发情期的狗狗也会通过闻屁股来挑选自己的配偶。", "question": "狗狗为什么总是闻屁股", "sent_token": ["相", "互", "闻", "屁", "股", "是", "狗", "狗", "打", "招", "呼", "的", "一", "种", "方", "式", "。", "狗", "狗", "的", "嗅", "觉", "很", "敏", "感", ",", "它", "们", "可", "以", "用", "相", "互", "闻", "屁", "股", "来", "了", "解", "狗", "狗", "的", "配", "偶", "状", "况", "、", "饮", "食", "习", "惯", "等", ",", "因", "为", "狗", "狗", "的", "屁", "股", "后", "面", "有", "两", "个", "肛", "门", "腺", ",", "在", "肛", "门", "腺", "里", "面", "涵", "盖", "了", "很", "多", "的", "信", "息", "素", "。", "处", "在", "发", "情", "期", "的", "狗", "狗", "也", "会", "通", "过", "闻", "屁", "股", "来", "挑", "选", "自", "己", "的", "配", "偶", "。", "狗", "狗", "为", "什", "么", "互", "相", "闻", "屁", "股"], "sample_type": "ori", "rel_ids": [1796]} +{"id": 36, "title": "出租房隔音差怎么解决", "context": "可以在窗户上贴一层隔音膜,在粘贴过程中要注意,不要出现气泡,以免影响隔音效果。若想要隔音效果更好点,还可以购买一些密封条安装在窗户缝隙处,这也能起到更好的隔音效果。另外,室内使用的家具可以更换成木质的,这样同样能起到一定的吸音效果。", "question": "出租房隔音不好怎么解决", "sent_token": ["可", "以", "在", "窗", "户", "上", "贴", "一", "层", "隔", "音", "膜", ",", "在", "粘", "贴", "过", "程", "中", "要", "注", "意", ",", "不", "要", "出", "现", "气", "泡", ",", "以", "免", "影", "响", "隔", "音", "效", "果", "。", "若", "想", "要", "隔", "音", "效", "果", "更", "好", "点", ",", "还", "可", "以", "购", "买", "一", "些", "密", "封", "条", "安", "装", "在", "窗", "户", "缝", "隙", "处", ",", "这", "也", "能", "起", "到", "更", "好", "的", "隔", "音", "效", "果", "。", "另", "外", ",", "室", "内", "使", "用", "的", "家", "具", "可", "以", "更", "换", "成", "木", "质", "的", ",", "这", "样", "同", "样", "能", "起", "到", "一", "定", "的", "吸", "音", "效", "果", "。", "出", "租", "房", "隔", "音", "差", "怎", "么", "解", "决"], "sample_type": "ori", "rel_ids": [1798]} +{"id": 40, "title": "鬼迷心窍(李宗盛演唱歌曲)_百度百科", "context": "《鬼迷心窍》是1992年黄日华、周海媚主演台湾电视剧《末代皇孙》的主题曲,是由李宗盛作词、作曲、演唱,收录于1992年影视剧音乐合辑《滚石九大天王之十二出好戏》当中。", "question": "鬼迷心窍原唱", "sent_token": ["《", "鬼", "迷", "心", "窍", "》", "是", "1992", "年", "黄", "日", "华", "、", "周", "海", "媚", "主", "演", "台", "湾", "电", "视", "剧", "《", "末", "代", "皇", "孙", "》", "的", "主", "题", "曲", ",", "是", "由", "李", "宗", "盛", "作", "词", "、", "作", "曲", "、", "演", "唱", ",", "收", "录", "于", "1992", "年", "影", "视", "剧", "音", "乐", "合", "辑", "《", "滚", "石", "九", "大", "天", "王", "之", "十", "二", "出", "好", "戏", "》", "当", "中", "。", "鬼", "迷", "心", "窍", "(", "李", "宗", "盛", "演", "唱", "歌", "曲", ")", "_", "百", "度", "百", "科"], "sample_type": "ori", "rel_ids": [1802]} +{"id": 41, "title": "", "context": "白龙马,名著小说《西游记》中的重要角色。本是西海龙王三太子,因纵火烧毁玉帝赏赐的明珠而被西海龙王上天告忤逆,要被斩首。后因南海观世菩萨出面才免于死罪,被贬到蛇盘山鹰愁涧等待唐僧取经。之后又误吃唐僧所骑的白马,被菩萨点化,变身为白龙。", "question": "白龙马的真正身份", "sent_token": ["白", "龙", "马", ",", "名", "著", "小", "说", "《", "西", "游", "记", "》", "中", "的", "重", "要", "角", "色", "。", "本", "是", "西", "海", "龙", "王", "三", "太", "子", ",", "因", "纵", "火", "烧", "毁", "玉", "帝", "赏", "赐", "的", "明", "珠", "而", "被", "西", "海", "龙", "王", "上", "天", "告", "忤", "逆", ",", "要", "被", "斩", "首", "。", "后", "因", "南", "海", "观", "世", "菩", "萨", "出", "面", "才", "免", "于", "死", "罪", ",", "被", "贬", "到", "蛇", "盘", "山", "鹰", "愁", "涧", "等", "待", "唐", "僧", "取", "经", "。", "之", "后", "又", "误", "吃", "唐", "僧", "所", "骑", "的", "白", "马", ",", "被", "菩", "萨", "点", "化", ",", "变", "身", "为", "白", "龙", "。"], "sample_type": "ori", "rel_ids": [1803]} +{"id": 43, "title": "", "context": "《湮灭》是由派拉蒙影业出品的科幻惊悚片,由亚历克斯·加兰执导,娜塔莉·波特曼、詹妮弗·杰森·李、吉娜·罗德里格兹、泰莎·汤普森联合主演。该片于2018年2月23日在美国上映。影片根据杰夫·梵德米尔所著《遗落的南境》三部曲的首部同名小说改编,讲述了生物学家莉娜为了自己的丈夫,她自愿加入了科学考察探险小队,去研究美国领土一块被检疫隔离的生态灾害区域的故事。", "question": "湮灭什么类型", "sent_token": ["《", "湮", "灭", "》", "是", "由", "派", "拉", "蒙", "影", "业", "出", "品", "的", "科", "幻", "惊", "悚", "片", ",", "由", "亚", "历", "克", "斯", "·", "加", "兰", "执", "导", ",", "娜", "塔", "莉", "·", "波", "特", "曼", "、", "詹", "妮", "弗", "·", "杰", "森", "·", "李", "、", "吉", "娜", "·", "罗", "德", "里", "格", "兹", "、", "泰", "莎", "·", "汤", "普", "森", "联", "合", "主", "演", "。", "该", "片", "于", "2018", "年", "2", "月", "23", "日", "在", "美", "国", "上", "映", "。", "影", "片", "根", "据", "杰", "夫", "·", "梵", "德", "米", "尔", "所", "著", "《", "遗", "落", "的", "南", "境", "》", "三", "部", "曲", "的", "首", "部", "同", "名", "小", "说", "改", "编", ",", "讲", "述", "了", "生", "物", "学", "家", "莉", "娜", "为", "了", "自", "己", "的", "丈", "夫", ",", "她", "自", "愿", "加", "入", "了", "科", "学", "考", "察", "探", "险", "小", "队", ",", "去", "研", "究", "美", "国", "领", "土", "一", "块", "被", "检", "疫", "隔", "离", "的", "生", "态", "灾", "害", "区", "域", "的", "故", "事", "。"], "sample_type": "ori", "rel_ids": [1805]} +{"id": 45, "title": "", "context": "网球运动的起源及演变可以用四句话来概括:网球孕育在法国,诞生在英国,开始普及和形成高潮在美国,现盛行全世界。", "question": "网球起源于哪国?", "sent_token": ["网", "球", "运", "动", "的", "起", "源", "及", "演", "变", "可", "以", "用", "四", "句", "话", "来", "概", "括", ":", "网", "球", "孕", "育", "在", "法", "国", ",", "诞", "生", "在", "英", "国", ",", "开", "始", "普", "及", "和", "形", "成", "高", "潮", "在", "美", "国", ",", "现", "盛", "行", "全", "世", "界", "。"], "sample_type": "ori", "rel_ids": [1807]} +{"id": 48, "title": "单人挑战巫女大蛇悲鸣需要多少体力_单人挑战巫女大蛇悲鸣需要体力", "context": "阴阳师巫女大蛇悲鸣单人通关需要12点体力组队通关的话只需要8点体力,挑战巫女大蛇悲鸣的体力消耗是普通御魂副本的2倍。奖励掉落5星与6星御魂,经验强化狗粮4星青吉鬼。在御魂副本1-10层原本掉落的基础上,巫女大蛇·悲鸣新增了蚌精、幽谷响、轮入道、蝠翼、狂骨这5种御魂的掉落,每日掉落御魂种类增加到5。", "question": "阴阳师 组队挑战大蛇悲鸣需要多少体力", "sent_token": ["阴", "阳", "师", "巫", "女", "大", "蛇", "悲", "鸣", "单", "人", "通", "关", "需", "要", "12", "点", "体", "力", "组", "队", "通", "关", "的", "话", "只", "需", "要", "8", "点", "体", "力", ",", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "的", "体", "力", "消", "耗", "是", "普", "通", "御", "魂", "副", "本", "的", "2", "倍", "。", "奖", "励", "掉", "落", "5", "星", "与", "6", "星", "御", "魂", ",", "经", "验", "强", "化", "狗", "粮", "4", "星", "青", "吉", "鬼", "。", "在", "御", "魂", "副", "本", "1", "-", "10", "层", "原", "本", "掉", "落", "的", "基", "础", "上", ",", "巫", "女", "大", "蛇", "·", "悲", "鸣", "新", "增", "了", "蚌", "精", "、", "幽", "谷", "响", "、", "轮", "入", "道", "、", "蝠", "翼", "、", "狂", "骨", "这", "5", "种", "御", "魂", "的", "掉", "落", ",", "每", "日", "掉", "落", "御", "魂", "种", "类", "增", "加", "到", "5", "。", "单", "人", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "需", "要", "多", "少", "体", "力", "_", "单", "人", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "需", "要", "体", "力"], "sample_type": "ori", "rel_ids": [1810]} +{"id": 53, "title": "", "context": "人类的心脏位于胸腔中部偏左,体积约相当于一个拳头大小,重量约350克。女性的心脏通常要比男性的体积小且重量轻。人的心脏外形像桃子,位于横膈之上,两肺间而偏左。", "question": "人类心脏多少斤", "sent_token": ["人", "类", "的", "心", "脏", "位", "于", "胸", "腔", "中", "部", "偏", "左", ",", "体", "积", "约", "相", "当", "于", "一", "个", "拳", "头", "大", "小", ",", "重", "量", "约", "350", "克", "。", "女", "性", "的", "心", "脏", "通", "常", "要", "比", "男", "性", "的", "体", "积", "小", "且", "重", "量", "轻", "。", "人", "的", "心", "脏", "外", "形", "像", "桃", "子", ",", "位", "于", "横", "膈", "之", "上", ",", "两", "肺", "间", "而", "偏", "左", "。"], "sample_type": "ori", "rel_ids": [1815]} +{"id": 54, "title": "紫菜变成紫色还能吃吗-有来医生", "context": "如果紫菜变成紫色的情况下,主要考虑还是紫菜受潮引起的,紫菜受潮以后容易滋生细菌,营养物质也会丧失,口感也会变差,一般情况下,建议不要食用,以免导致消化道的不良反应。紫菜中含有的营养物质是很丰富的,含有丰富的锌元素和铁元素,每天适当的吃一点,可以预防缺铁性贫血,可以预防缺锌引起的反复性口腔溃疡,可以增进食欲。", "question": "海苔回潮了还能吃吗", "sent_token": ["如", "果", "紫", "菜", "变", "成", "紫", "色", "的", "情", "况", "下", ",", "主", "要", "考", "虑", "还", "是", "紫", "菜", "受", "潮", "引", "起", "的", ",", "紫", "菜", "受", "潮", "以", "后", "容", "易", "滋", "生", "细", "菌", ",", "营", "养", "物", "质", "也", "会", "丧", "失", ",", "口", "感", "也", "会", "变", "差", ",", "一", "般", "情", "况", "下", ",", "建", "议", "不", "要", "食", "用", ",", "以", "免", "导", "致", "消", "化", "道", "的", "不", "良", "反", "应", "。", "紫", "菜", "中", "含", "有", "的", "营", "养", "物", "质", "是", "很", "丰", "富", "的", ",", "含", "有", "丰", "富", "的", "锌", "元", "素", "和", "铁", "元", "素", ",", "每", "天", "适", "当", "的", "吃", "一", "点", ",", "可", "以", "预", "防", "缺", "铁", "性", "贫", "血", ",", "可", "以", "预", "防", "缺", "锌", "引", "起", "的", "反", "复", "性", "口", "腔", "溃", "疡", ",", "可", "以", "增", "进", "食", "欲", "。", "紫", "菜", "变", "成", "紫", "色", "还", "能", "吃", "吗", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1816]} +{"id": 68, "title": "", "context": "穿上盔甲后,托尼变身成了复仇者联盟中惩恶扬善的钢铁侠。复仇者联盟2:奥创纪元钢铁侠是美国演员小罗伯特·唐尼演的。小罗伯特唐尼的电影钢铁侠扮演者小罗伯特·。", "question": "谁演过钢铁侠", "sent_token": ["穿", "上", "盔", "甲", "后", ",", "托", "尼", "变", "身", "成", "了", "复", "仇", "者", "联", "盟", "中", "惩", "恶", "扬", "善", "的", "钢", "铁", "侠", "。", "复", "仇", "者", "联", "盟", "2", ":", "奥", "创", "纪", "元", "钢", "铁", "侠", "是", "美", "国", "演", "员", "小", "罗", "伯", "特", "·", "唐", "尼", "演", "的", "。", "小", "罗", "伯", "特", "唐", "尼", "的", "电", "影", "钢", "铁", "侠", "扮", "演", "者", "小", "罗", "伯", "特", "·", "。"], "sample_type": "ori", "rel_ids": [1830]} +{"id": 69, "title": "人间正道是沧桑是什么意思_酷知经验网", "context": "天若有情天亦老,人间正道是沧桑:上句借用李贺《金铜仙人辞汉歌》中诗句,原诗说的是汉武帝时制作的极贵重的宝物金铜仙人像,在三国时被魏明帝由长安迁往洛阳的传说。原句的意思是,对于这样的人间恨事,天若有情,也要因悲伤而衰老。", "question": "人间正道是沧桑上一句", "sent_token": ["天", "若", "有", "情", "天", "亦", "老", ",", "人", "间", "正", "道", "是", "沧", "桑", ":", "上", "句", "借", "用", "李", "贺", "《", "金", "铜", "仙", "人", "辞", "汉", "歌", "》", "中", "诗", "句", ",", "原", "诗", "说", "的", "是", "汉", "武", "帝", "时", "制", "作", "的", "极", "贵", "重", "的", "宝", "物", "金", "铜", "仙", "人", "像", ",", "在", "三", "国", "时", "被", "魏", "明", "帝", "由", "长", "安", "迁", "往", "洛", "阳", "的", "传", "说", "。", "原", "句", "的", "意", "思", "是", ",", "对", "于", "这", "样", "的", "人", "间", "恨", "事", ",", "天", "若", "有", "情", ",", "也", "要", "因", "悲", "伤", "而", "衰", "老", "。", "人", "间", "正", "道", "是", "沧", "桑", "是", "什", "么", "意", "思", "_", "酷", "知", "经", "验", "网"], "sample_type": "ori", "rel_ids": [1831]} +{"id": 72, "title": "", "context": "《艺妓回忆录》根据美国作家阿瑟-高顿的同名小说改编。于2005年12月1日上映,由章子怡·巩俐·杨紫琼等共同演绎。是一部时长约140分钟的电影。全篇充满着古典美,时代背景从1929年开始延续到二战结束,女主人公回忆了自己从小拼命挣扎、历尽荣辱的人生经历。", "question": "艺妓回忆录多长时间", "sent_token": ["《", "艺", "妓", "回", "忆", "录", "》", "根", "据", "美", "国", "作", "家", "阿", "瑟", "-", "高", "顿", "的", "同", "名", "小", "说", "改", "编", "。", "于", "2005", "年", "12", "月", "1", "日", "上", "映", ",", "由", "章", "子", "怡", "·", "巩", "俐", "·", "杨", "紫", "琼", "等", "共", "同", "演", "绎", "。", "是", "一", "部", "时", "长", "约", "140", "分", "钟", "的", "电", "影", "。", "全", "篇", "充", "满", "着", "古", "典", "美", ",", "时", "代", "背", "景", "从", "1929", "年", "开", "始", "延", "续", "到", "二", "战", "结", "束", ",", "女", "主", "人", "公", "回", "忆", "了", "自", "己", "从", "小", "拼", "命", "挣", "扎", "、", "历", "尽", "荣", "辱", "的", "人", "生", "经", "历", "。"], "sample_type": "ori", "rel_ids": [1834]} +{"id": 77, "title": "痛风挂哪个科室比较好?_39健康问答_39健康网", "context": "痛风属于代谢风湿性疾病,目前主要是在风湿免疫科治疗,所以患者需要挂风湿免疫科。风湿免疫科在绝大多数三级甲等医院都有独立的科室。由于这个科是一个新兴学科,在很多县级医院还没有成立,患者可以到内分泌科就诊,挂内分泌科。如果这两个科都没有患者,可以到骨科就诊,因为痛风首发表现是急性痛风性关节炎,骨科大夫对痛风也有一定的了解。", "question": "痛风属于什么类型疾病", "sent_token": ["痛", "风", "属", "于", "代", "谢", "风", "湿", "性", "疾", "病", ",", "目", "前", "主", "要", "是", "在", "风", "湿", "免", "疫", "科", "治", "疗", ",", "所", "以", "患", "者", "需", "要", "挂", "风", "湿", "免", "疫", "科", "。", "风", "湿", "免", "疫", "科", "在", "绝", "大", "多", "数", "三", "级", "甲", "等", "医", "院", "都", "有", "独", "立", "的", "科", "室", "。", "由", "于", "这", "个", "科", "是", "一", "个", "新", "兴", "学", "科", ",", "在", "很", "多", "县", "级", "医", "院", "还", "没", "有", "成", "立", ",", "患", "者", "可", "以", "到", "内", "分", "泌", "科", "就", "诊", ",", "挂", "内", "分", "泌", "科", "。", "如", "果", "这", "两", "个", "科", "都", "没", "有", "患", "者", ",", "可", "以", "到", "骨", "科", "就", "诊", ",", "因", "为", "痛", "风", "首", "发", "表", "现", "是", "急", "性", "痛", "风", "性", "关", "节", "炎", ",", "骨", "科", "大", "夫", "对", "痛", "风", "也", "有", "一", "定", "的", "了", "解", "。", "痛", "风", "挂", "哪", "个", "科", "室", "比", "较", "好", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "ori", "rel_ids": [1839]} +{"id": 82, "title": "阴阳师武士之灵生前被谁所杀_游侠网", "context": "从武士之灵的传记中可以得知,武士之灵生前是被茨木童子所击杀。该问题来自游戏内的逢魔密信,正确回答问题之后就有机会获得包括金币、体力、勾玉和结界卡在内的多种游戏内道具物资奖励。", "question": "武士之灵生前被谁所杀", "sent_token": ["从", "武", "士", "之", "灵", "的", "传", "记", "中", "可", "以", "得", "知", ",", "武", "士", "之", "灵", "生", "前", "是", "被", "茨", "木", "童", "子", "所", "击", "杀", "。", "该", "问", "题", "来", "自", "游", "戏", "内", "的", "逢", "魔", "密", "信", ",", "正", "确", "回", "答", "问", "题", "之", "后", "就", "有", "机", "会", "获", "得", "包", "括", "金", "币", "、", "体", "力", "、", "勾", "玉", "和", "结", "界", "卡", "在", "内", "的", "多", "种", "游", "戏", "内", "道", "具", "物", "资", "奖", "励", "。", "阴", "阳", "师", "武", "士", "之", "灵", "生", "前", "被", "谁", "所", "杀", "_", "游", "侠", "网"], "sample_type": "ori", "rel_ids": [1844]} +{"id": 88, "title": "中医肾主什么-有来医生", "context": "根据中医基础理论,肾主水、主纳气、主二便、主藏精。肾主水,是指全身的水液代谢都是在肾阳的气化温煦作用下,从而分布到全身,然后再通过呼吸、二便将代谢废物排除体外。肾主纳气,是指肾能够使人体维持正常的呼吸深度。肾主二便,人的大小便需要在肾的作用下,才能够正常的排泄,否则就会出现异常的改变,比如大小便失禁、大便稀薄等情况。肾主藏精,是指五脏六腑化生的精气,最后都是储存在肾脏,反过来肾脏所藏的精气,又能够推动各脏腑的功能。", "question": "肾主什么", "sent_token": ["根", "据", "中", "医", "基", "础", "理", "论", ",", "肾", "主", "水", "、", "主", "纳", "气", "、", "主", "二", "便", "、", "主", "藏", "精", "。", "肾", "主", "水", ",", "是", "指", "全", "身", "的", "水", "液", "代", "谢", "都", "是", "在", "肾", "阳", "的", "气", "化", "温", "煦", "作", "用", "下", ",", "从", "而", "分", "布", "到", "全", "身", ",", "然", "后", "再", "通", "过", "呼", "吸", "、", "二", "便", "将", "代", "谢", "废", "物", "排", "除", "体", "外", "。", "肾", "主", "纳", "气", ",", "是", "指", "肾", "能", "够", "使", "人", "体", "维", "持", "正", "常", "的", "呼", "吸", "深", "度", "。", "肾", "主", "二", "便", ",", "人", "的", "大", "小", "便", "需", "要", "在", "肾", "的", "作", "用", "下", ",", "才", "能", "够", "正", "常", "的", "排", "泄", ",", "否", "则", "就", "会", "出", "现", "异", "常", "的", "改", "变", ",", "比", "如", "大", "小", "便", "失", "禁", "、", "大", "便", "稀", "薄", "等", "情", "况", "。", "肾", "主", "藏", "精", ",", "是", "指", "五", "脏", "六", "腑", "化", "生", "的", "精", "气", ",", "最", "后", "都", "是", "储", "存", "在", "肾", "脏", ",", "反", "过", "来", "肾", "脏", "所", "藏", "的", "精", "气", ",", "又", "能", "够", "推", "动", "各", "脏", "腑", "的", "功", "能", "。", "中", "医", "肾", "主", "什", "么", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1850]} +{"id": 91, "title": "1963年属什么生肖年_十二生肖_卜易居", "context": "1963年属什么生肖年,葵卯兔年,属兔之人举止文雅,谈吐随和,为人恭良谦逊,与人交往如慕春风,学习能力超群,敏捷果断,安贫乐道。虽性子柔弱,但韧性极强,绝境之中能力惊人,缺点则是难以坚持原则,随波逐流。", "question": "1963年属什么生肖", "sent_token": ["1963", "年", "属", "什", "么", "生", "肖", "年", ",", "葵", "卯", "兔", "年", ",", "属", "兔", "之", "人", "举", "止", "文", "雅", ",", "谈", "吐", "随", "和", ",", "为", "人", "恭", "良", "谦", "逊", ",", "与", "人", "交", "往", "如", "慕", "春", "风", ",", "学", "习", "能", "力", "超", "群", ",", "敏", "捷", "果", "断", ",", "安", "贫", "乐", "道", "。", "虽", "性", "子", "柔", "弱", ",", "但", "韧", "性", "极", "强", ",", "绝", "境", "之", "中", "能", "力", "惊", "人", ",", "缺", "点", "则", "是", "难", "以", "坚", "持", "原", "则", ",", "随", "波", "逐", "流", "。", "1963", "年", "属", "什", "么", "生", "肖", "年", "_", "十", "二", "生", "肖", "_", "卜", "易", "居"], "sample_type": "ori", "rel_ids": [1853]} +{"id": 92, "title": "食管和食道一样吗-有来医生", "context": "食管和食道是没有区别的,食管是医学上的称谓,而食道是民间的一种说法。两者都指从咽喉部到胃贲门之间的管道。食管可以分为颈段和胸段,而胸段又分为胸上段、胸中段和胸下段。食管本身有3个生理性的狭窄,这也是某些食管疾病发生的基础。常见的食管疾病包括食管炎、食管息肉、食管癌、食管狭窄、胃食管反流症、巴雷特食管等。可以通过消化道造影以及胃镜来进一步明确。", "question": "食管跟食道一样吗", "sent_token": ["食", "管", "和", "食", "道", "是", "没", "有", "区", "别", "的", ",", "食", "管", "是", "医", "学", "上", "的", "称", "谓", ",", "而", "食", "道", "是", "民", "间", "的", "一", "种", "说", "法", "。", "两", "者", "都", "指", "从", "咽", "喉", "部", "到", "胃", "贲", "门", "之", "间", "的", "管", "道", "。", "食", "管", "可", "以", "分", "为", "颈", "段", "和", "胸", "段", ",", "而", "胸", "段", "又", "分", "为", "胸", "上", "段", "、", "胸", "中", "段", "和", "胸", "下", "段", "。", "食", "管", "本", "身", "有", "3", "个", "生", "理", "性", "的", "狭", "窄", ",", "这", "也", "是", "某", "些", "食", "管", "疾", "病", "发", "生", "的", "基", "础", "。", "常", "见", "的", "食", "管", "疾", "病", "包", "括", "食", "管", "炎", "、", "食", "管", "息", "肉", "、", "食", "管", "癌", "、", "食", "管", "狭", "窄", "、", "胃", "食", "管", "反", "流", "症", "、", "巴", "雷", "特", "食", "管", "等", "。", "可", "以", "通", "过", "消", "化", "道", "造", "影", "以", "及", "胃", "镜", "来", "进", "一", "步", "明", "确", "。", "食", "管", "和", "食", "道", "一", "样", "吗", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1854]} +{"id": 101, "title": "农历六月二十四是什么星座-星座乐", "context": "农历六月二十四是狮子座。狮子座,火象星座,位于黄道十二宫之第五宫,出生日期为阳历7月23日-8月22日。狮子座是英雄主义者,他们乐观,乐于助人,喜欢帮助弱势群体。他们天生自带光环,特立独行,做事豪爽大气,讲话淡定从容,从不扭扭捏捏畏畏缩缩。而且心思细腻,做事完整准确,善于将自己的优点发挥到极致。", "question": "农历六月二十四是什么星座", "sent_token": ["农", "历", "六", "月", "二", "十", "四", "是", "狮", "子", "座", "。", "狮", "子", "座", ",", "火", "象", "星", "座", ",", "位", "于", "黄", "道", "十", "二", "宫", "之", "第", "五", "宫", ",", "出", "生", "日", "期", "为", "阳", "历", "7", "月", "23", "日", "-", "8", "月", "22", "日", "。", "狮", "子", "座", "是", "英", "雄", "主", "义", "者", ",", "他", "们", "乐", "观", ",", "乐", "于", "助", "人", ",", "喜", "欢", "帮", "助", "弱", "势", "群", "体", "。", "他", "们", "天", "生", "自", "带", "光", "环", ",", "特", "立", "独", "行", ",", "做", "事", "豪", "爽", "大", "气", ",", "讲", "话", "淡", "定", "从", "容", ",", "从", "不", "扭", "扭", "捏", "捏", "畏", "畏", "缩", "缩", "。", "而", "且", "心", "思", "细", "腻", ",", "做", "事", "完", "整", "准", "确", ",", "善", "于", "将", "自", "己", "的", "优", "点", "发", "挥", "到", "极", "致", "。", "农", "历", "六", "月", "二", "十", "四", "是", "什", "么", "星", "座", "-", "星", "座", "乐"], "sample_type": "ori", "rel_ids": [1863]} +{"id": 105, "title": "", "context": "非法持有海洛因10克以上就构成非法持有毒品罪非法持有毒品罪,是指明知是鸦片、海洛因、甲基苯丙胺或者其他毒品,而非法持有且数量较大的行为。非法持有毒品达到一定数量才构成犯罪。", "question": "海洛因几克属于犯罪", "sent_token": ["非", "法", "持", "有", "海", "洛", "因", "10", "克", "以", "上", "就", "构", "成", "非", "法", "持", "有", "毒", "品", "罪", "非", "法", "持", "有", "毒", "品", "罪", ",", "是", "指", "明", "知", "是", "鸦", "片", "、", "海", "洛", "因", "、", "甲", "基", "苯", "丙", "胺", "或", "者", "其", "他", "毒", "品", ",", "而", "非", "法", "持", "有", "且", "数", "量", "较", "大", "的", "行", "为", "。", "非", "法", "持", "有", "毒", "品", "达", "到", "一", "定", "数", "量", "才", "构", "成", "犯", "罪", "。"], "sample_type": "ori", "rel_ids": [1867]} +{"id": 115, "title": "地方志书每几年左右编修一次_高三网", "context": "地方志书每20年左右编修一次。每一轮地方志书编修工作完成后,负责地方志工作的机构在编纂地方综合年鉴、搜集资料以及向社会提供咨询服务的同时,启动新一轮地方志书的续修工作。", "question": "地方质数没几年编修一次", "sent_token": ["地", "方", "志", "书", "每", "20", "年", "左", "右", "编", "修", "一", "次", "。", "每", "一", "轮", "地", "方", "志", "书", "编", "修", "工", "作", "完", "成", "后", ",", "负", "责", "地", "方", "志", "工", "作", "的", "机", "构", "在", "编", "纂", "地", "方", "综", "合", "年", "鉴", "、", "搜", "集", "资", "料", "以", "及", "向", "社", "会", "提", "供", "咨", "询", "服", "务", "的", "同", "时", ",", "启", "动", "新", "一", "轮", "地", "方", "志", "书", "的", "续", "修", "工", "作", "。", "地", "方", "志", "书", "每", "几", "年", "左", "右", "编", "修", "一", "次", "_", "高", "三", "网"], "sample_type": "ori", "rel_ids": [1877]} +{"id": 117, "title": "", "context": "《正气歌》是南宋诗人文天祥在狱中写的一首五言古诗。诗的开头即点出浩然正气存乎天地之间,至时穷之际,必然会显示出来。随后连用十二个典故,都是历史上有名的人物,他们的所作所为凛然显示出浩然正气的力量。接下来八句说明浩然正气贯日月,立天地,为三纲之命,道义之根。最后联系到自己的命运,自己虽然兵败被俘,处在极其恶劣的牢狱之中,但是由于自己一身正气,各种邪气和疾病都不能侵犯自己,因此自己能够坦然面对自己的命运。全诗感情深沉、气壮山河、直抒胸臆、毫无雕饰,充分体现了作者崇高的民族气节和强烈的爱国主义精神。", "question": "正气歌》的作者是", "sent_token": ["《", "正", "气", "歌", "》", "是", "南", "宋", "诗", "人", "文", "天", "祥", "在", "狱", "中", "写", "的", "一", "首", "五", "言", "古", "诗", "。", "诗", "的", "开", "头", "即", "点", "出", "浩", "然", "正", "气", "存", "乎", "天", "地", "之", "间", ",", "至", "时", "穷", "之", "际", ",", "必", "然", "会", "显", "示", "出", "来", "。", "随", "后", "连", "用", "十", "二", "个", "典", "故", ",", "都", "是", "历", "史", "上", "有", "名", "的", "人", "物", ",", "他", "们", "的", "所", "作", "所", "为", "凛", "然", "显", "示", "出", "浩", "然", "正", "气", "的", "力", "量", "。", "接", "下", "来", "八", "句", "说", "明", "浩", "然", "正", "气", "贯", "日", "月", ",", "立", "天", "地", ",", "为", "三", "纲", "之", "命", ",", "道", "义", "之", "根", "。", "最", "后", "联", "系", "到", "自", "己", "的", "命", "运", ",", "自", "己", "虽", "然", "兵", "败", "被", "俘", ",", "处", "在", "极", "其", "恶", "劣", "的", "牢", "狱", "之", "中", ",", "但", "是", "由", "于", "自", "己", "一", "身", "正", "气", ",", "各", "种", "邪", "气", "和", "疾", "病", "都", "不", "能", "侵", "犯", "自", "己", ",", "因", "此", "自", "己", "能", "够", "坦", "然", "面", "对", "自", "己", "的", "命", "运", "。", "全", "诗", "感", "情", "深", "沉", "、", "气", "壮", "山", "河", "、", "直", "抒", "胸", "臆", "、", "毫", "无", "雕", "饰", ",", "充", "分", "体", "现", "了", "作", "者", "崇", "高", "的", "民", "族", "气", "节", "和", "强", "烈", "的", "爱", "国", "主", "义", "精", "神", "。"], "sample_type": "ori", "rel_ids": [1879]} +{"id": 121, "title": "狗狗皮肤上长小脓包怎么回事", "context": "狗狗身上长脓包,是因为真菌感染或是寄生虫感染所致。如不及时处理脓包,会导致扩散全身,甚至溃烂。建议方法:戴上手套,把狗狗身上长脓包的地方挤一挤;然后用碘伏直接喷在患处;如有脓血可用医用纱布给它包在患处,等药效吸收后,取掉纱布;碘伏具有抗菌、消炎的作用,一天可以喷两三次;处理完狗狗伤口后用肥皂洗手。狗狗洗澡要用狗狗专门的沐浴露;洗后立即做吹干处理;定时用狗狗专用梳子,清理身上多余的杂毛;尽量带狗狗去干净的地方玩,回家后把狗狗的脚用抹布抹一次;多注意狗舍卫生,定时做消毒处理。", "question": "狗狗身上长小脓包是怎么回事", "sent_token": ["狗", "狗", "身", "上", "长", "脓", "包", ",", "是", "因", "为", "真", "菌", "感", "染", "或", "是", "寄", "生", "虫", "感", "染", "所", "致", "。", "如", "不", "及", "时", "处", "理", "脓", "包", ",", "会", "导", "致", "扩", "散", "全", "身", ",", "甚", "至", "溃", "烂", "。", "建", "议", "方", "法", ":", "戴", "上", "手", "套", ",", "把", "狗", "狗", "身", "上", "长", "脓", "包", "的", "地", "方", "挤", "一", "挤", ";", "然", "后", "用", "碘", "伏", "直", "接", "喷", "在", "患", "处", ";", "如", "有", "脓", "血", "可", "用", "医", "用", "纱", "布", "给", "它", "包", "在", "患", "处", ",", "等", "药", "效", "吸", "收", "后", ",", "取", "掉", "纱", "布", ";", "碘", "伏", "具", "有", "抗", "菌", "、", "消", "炎", "的", "作", "用", ",", "一", "天", "可", "以", "喷", "两", "三", "次", ";", "处", "理", "完", "狗", "狗", "伤", "口", "后", "用", "肥", "皂", "洗", "手", "。", "狗", "狗", "洗", "澡", "要", "用", "狗", "狗", "专", "门", "的", "沐", "浴", "露", ";", "洗", "后", "立", "即", "做", "吹", "干", "处", "理", ";", "定", "时", "用", "狗", "狗", "专", "用", "梳", "子", ",", "清", "理", "身", "上", "多", "余", "的", "杂", "毛", ";", "尽", "量", "带", "狗", "狗", "去", "干", "净", "的", "地", "方", "玩", ",", "回", "家", "后", "把", "狗", "狗", "的", "脚", "用", "抹", "布", "抹", "一", "次", ";", "多", "注", "意", "狗", "舍", "卫", "生", ",", "定", "时", "做", "消", "毒", "处", "理", "。", "狗", "狗", "皮", "肤", "上", "长", "小", "脓", "包", "怎", "么", "回", "事"], "sample_type": "ori", "rel_ids": [1883]} +{"id": 123, "title": "", "context": "新梓学校成立于2007年9月,是一所公办九年一贯制学校,座落在龙岗街道新生社区,紧邻水岸新都花园,交通十分便利。校园占地27500平方米,建筑面积16285平方米。", "question": "新梓学校地址", "sent_token": ["新", "梓", "学", "校", "成", "立", "于", "2007", "年", "9", "月", ",", "是", "一", "所", "公", "办", "九", "年", "一", "贯", "制", "学", "校", ",", "座", "落", "在", "龙", "岗", "街", "道", "新", "生", "社", "区", ",", "紧", "邻", "水", "岸", "新", "都", "花", "园", ",", "交", "通", "十", "分", "便", "利", "。", "校", "园", "占", "地", "27500", "平", "方", "米", ",", "建", "筑", "面", "积", "16285", "平", "方", "米", "。"], "sample_type": "ori", "rel_ids": [1885]} +{"id": 124, "title": "敷面膜脸痒是缺水吗?教你正确的认识_皮肤", "context": "当我们在洗完澡的时候,或者是敷面膜发现皮肤有一种痒痒的感觉,如果你确定面膜的质量是没有问题的,并且也确定你对这款面膜的物质没有过敏的情况下,皮肤出现痒的感觉,那可能的原因就是由于皮肤缺水。因为你的皮肤太缺水了,在给皮肤补水的时候就会出现一种痒的情况严重的时候,甚至会有刺痛的感觉。会让人觉得很不舒服,水分充足后会缓解。", "question": "脸痒是缺水吗", "sent_token": ["当", "我", "们", "在", "洗", "完", "澡", "的", "时", "候", ",", "或", "者", "是", "敷", "面", "膜", "发", "现", "皮", "肤", "有", "一", "种", "痒", "痒", "的", "感", "觉", ",", "如", "果", "你", "确", "定", "面", "膜", "的", "质", "量", "是", "没", "有", "问", "题", "的", ",", "并", "且", "也", "确", "定", "你", "对", "这", "款", "面", "膜", "的", "物", "质", "没", "有", "过", "敏", "的", "情", "况", "下", ",", "皮", "肤", "出", "现", "痒", "的", "感", "觉", ",", "那", "可", "能", "的", "原", "因", "就", "是", "由", "于", "皮", "肤", "缺", "水", "。", "因", "为", "你", "的", "皮", "肤", "太", "缺", "水", "了", ",", "在", "给", "皮", "肤", "补", "水", "的", "时", "候", "就", "会", "出", "现", "一", "种", "痒", "的", "情", "况", "严", "重", "的", "时", "候", ",", "甚", "至", "会", "有", "刺", "痛", "的", "感", "觉", "。", "会", "让", "人", "觉", "得", "很", "不", "舒", "服", ",", "水", "分", "充", "足", "后", "会", "缓", "解", "。", "敷", "面", "膜", "脸", "痒", "是", "缺", "水", "吗", "?", "教", "你", "正", "确", "的", "认", "识", "_", "皮", "肤"], "sample_type": "ori", "rel_ids": [1886]} +{"id": 126, "title": "无痛人流和药流哪个伤害比较小-有来医生", "context": "无痛人工流产手术和药物流产手术,相对比来说,还是药物流产伤害比较大。因为药物流产,阴道流血时间会比人工流产的阴道流血时间要长,一般人工流产,阴道流血时间不超过7天,而药物流产阴道流血的时间往往在15-20天左右才会干净。一直在有流血的状况下,宫口就是开放的,阴道又跟外界相通,跟宫颈又相通,这样造成细菌侵入感染的机会就会增加,所以容易导致生殖道的感染。另外,药物流产造成不全流产的可能性会大一些,需要做清宫手术。这样就可以想象出药物流产会比无痛人流伤害更大一些。", "question": "无痛人流和药流哪个伤害比较小", "sent_token": ["无", "痛", "人", "工", "流", "产", "手", "术", "和", "药", "物", "流", "产", "手", "术", ",", "相", "对", "比", "来", "说", ",", "还", "是", "药", "物", "流", "产", "伤", "害", "比", "较", "大", "。", "因", "为", "药", "物", "流", "产", ",", "阴", "道", "流", "血", "时", "间", "会", "比", "人", "工", "流", "产", "的", "阴", "道", "流", "血", "时", "间", "要", "长", ",", "一", "般", "人", "工", "流", "产", ",", "阴", "道", "流", "血", "时", "间", "不", "超", "过", "7", "天", ",", "而", "药", "物", "流", "产", "阴", "道", "流", "血", "的", "时", "间", "往", "往", "在", "15", "-", "20", "天", "左", "右", "才", "会", "干", "净", "。", "一", "直", "在", "有", "流", "血", "的", "状", "况", "下", ",", "宫", "口", "就", "是", "开", "放", "的", ",", "阴", "道", "又", "跟", "外", "界", "相", "通", ",", "跟", "宫", "颈", "又", "相", "通", ",", "这", "样", "造", "成", "细", "菌", "侵", "入", "感", "染", "的", "机", "会", "就", "会", "增", "加", ",", "所", "以", "容", "易", "导", "致", "生", "殖", "道", "的", "感", "染", "。", "另", "外", ",", "药", "物", "流", "产", "造", "成", "不", "全", "流", "产", "的", "可", "能", "性", "会", "大", "一", "些", ",", "需", "要", "做", "清", "宫", "手", "术", "。", "这", "样", "就", "可", "以", "想", "象", "出", "药", "物", "流", "产", "会", "比", "无", "痛", "人", "流", "伤", "害", "更", "大", "一", "些", "。", "无", "痛", "人", "流", "和", "药", "流", "哪", "个", "伤", "害", "比", "较", "小", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1888]} +{"id": 128, "title": "长期吃葡萄籽的副作用?_39健康问答_39健康网", "context": "长期吃葡萄籽不会有副作用,不用担心,葡萄籽中含有丰富的花青素,有美容养颜的功效。葡萄籽含有丰富的多种氨基酸、维生素及矿物质等,原花青素含量最高,有促进血液循环、保护视力、抗氧化去除自由基、降低血、保护心血管的作用,可以用于保健、美容。", "question": "葡萄籽能长期吃吗?有什么副作用?", "sent_token": ["长", "期", "吃", "葡", "萄", "籽", "不", "会", "有", "副", "作", "用", ",", "不", "用", "担", "心", ",", "葡", "萄", "籽", "中", "含", "有", "丰", "富", "的", "花", "青", "素", ",", "有", "美", "容", "养", "颜", "的", "功", "效", "。", "葡", "萄", "籽", "含", "有", "丰", "富", "的", "多", "种", "氨", "基", "酸", "、", "维", "生", "素", "及", "矿", "物", "质", "等", ",", "原", "花", "青", "素", "含", "量", "最", "高", ",", "有", "促", "进", "血", "液", "循", "环", "、", "保", "护", "视", "力", "、", "抗", "氧", "化", "去", "除", "自", "由", "基", "、", "降", "低", "血", "、", "保", "护", "心", "血", "管", "的", "作", "用", ",", "可", "以", "用", "于", "保", "健", "、", "美", "容", "。", "长", "期", "吃", "葡", "萄", "籽", "的", "副", "作", "用", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "ori", "rel_ids": [1890]} +{"id": 132, "title": "红花哪里产的最好?_39健康问答_39健康网", "context": "红花在中国很多地方都是有种植的,比如河南,江苏,四川,河北等等。但是在众多产地中河南的商丘生产的红花应该是最好的了。红花有一种特殊的气味,特别香,味道稍微有点苦。红花是一种很好的植物,对人体有很好的保健作用。高血压患者可以服用一些,红花是有一定的降压作用的,另外还可以促进人体血液的循环,降低血脂。", "question": "世界上哪里的红花最好", "sent_token": ["红", "花", "在", "中", "国", "很", "多", "地", "方", "都", "是", "有", "种", "植", "的", ",", "比", "如", "河", "南", ",", "江", "苏", ",", "四", "川", ",", "河", "北", "等", "等", "。", "但", "是", "在", "众", "多", "产", "地", "中", "河", "南", "的", "商", "丘", "生", "产", "的", "红", "花", "应", "该", "是", "最", "好", "的", "了", "。", "红", "花", "有", "一", "种", "特", "殊", "的", "气", "味", ",", "特", "别", "香", ",", "味", "道", "稍", "微", "有", "点", "苦", "。", "红", "花", "是", "一", "种", "很", "好", "的", "植", "物", ",", "对", "人", "体", "有", "很", "好", "的", "保", "健", "作", "用", "。", "高", "血", "压", "患", "者", "可", "以", "服", "用", "一", "些", ",", "红", "花", "是", "有", "一", "定", "的", "降", "压", "作", "用", "的", ",", "另", "外", "还", "可", "以", "促", "进", "人", "体", "血", "液", "的", "循", "环", ",", "降", "低", "血", "脂", "。", "红", "花", "哪", "里", "产", "的", "最", "好", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "ori", "rel_ids": [1894]} +{"id": 135, "title": "", "context": "梳妆台指用来化妆的家具装饰。梳妆台一词,在现代家居中,已经被业主、客户、家居设计师广泛用到,现在泛指家具梳妆台。梳妆台尺寸标准的是总高度为1500mm左右,宽为700mm到1200mm,这样的梳妆台尺寸是大小正合适的,在家庭装修之前的前期准备时,就应该确定好梳妆台尺寸大小,同时梳妆台尺寸也要和房间的格调和风格统一起来。", "question": "梳妆台整体高度一般是多少", "sent_token": ["梳", "妆", "台", "指", "用", "来", "化", "妆", "的", "家", "具", "装", "饰", "。", "梳", "妆", "台", "一", "词", ",", "在", "现", "代", "家", "居", "中", ",", "已", "经", "被", "业", "主", "、", "客", "户", "、", "家", "居", "设", "计", "师", "广", "泛", "用", "到", ",", "现", "在", "泛", "指", "家", "具", "梳", "妆", "台", "。", "梳", "妆", "台", "尺", "寸", "标", "准", "的", "是", "总", "高", "度", "为", "1500mm", "左", "右", ",", "宽", "为", "700mm", "到", "1200mm", ",", "这", "样", "的", "梳", "妆", "台", "尺", "寸", "是", "大", "小", "正", "合", "适", "的", ",", "在", "家", "庭", "装", "修", "之", "前", "的", "前", "期", "准", "备", "时", ",", "就", "应", "该", "确", "定", "好", "梳", "妆", "台", "尺", "寸", "大", "小", ",", "同", "时", "梳", "妆", "台", "尺", "寸", "也", "要", "和", "房", "间", "的", "格", "调", "和", "风", "格", "统", "一", "起", "来", "。"], "sample_type": "ori", "rel_ids": [1897]} +{"id": 137, "title": "感冒能不能吃燕窝_妈妈网小百科", "context": "在感冒的时候尽量不要吃燕窝,虽然燕窝比较滋补,但是在感冒期间吃燕窝的话,并不利于感冒的恢复。在感冒期间应该吃得清淡一些,补充身体需要的水分,如果没有食欲的话可以多喝一些粥。在感冒期间可能吃药物的话,也不能够起到很好的效果,但是也要坚持吃药。", "question": "感冒可以吃燕窝吗?有效果吗?", "sent_token": ["在", "感", "冒", "的", "时", "候", "尽", "量", "不", "要", "吃", "燕", "窝", ",", "虽", "然", "燕", "窝", "比", "较", "滋", "补", ",", "但", "是", "在", "感", "冒", "期", "间", "吃", "燕", "窝", "的", "话", ",", "并", "不", "利", "于", "感", "冒", "的", "恢", "复", "。", "在", "感", "冒", "期", "间", "应", "该", "吃", "得", "清", "淡", "一", "些", ",", "补", "充", "身", "体", "需", "要", "的", "水", "分", ",", "如", "果", "没", "有", "食", "欲", "的", "话", "可", "以", "多", "喝", "一", "些", "粥", "。", "在", "感", "冒", "期", "间", "可", "能", "吃", "药", "物", "的", "话", ",", "也", "不", "能", "够", "起", "到", "很", "好", "的", "效", "果", ",", "但", "是", "也", "要", "坚", "持", "吃", "药", "。", "感", "冒", "能", "不", "能", "吃", "燕", "窝", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "ori", "rel_ids": [1899]} +{"id": 138, "title": "房颤会引起脑梗吗-有来医生", "context": "房颤会引起脑血管疾病,在医学上不叫脑梗叫脑栓塞,脑梗是脑血管本身病变引起的脑供血不足的情况,而脑栓塞是由于房颤心脏上形成了附壁血栓,当血栓的栓子脱落之后,就有可能堵塞在脑血管形成了脑拴塞,也是一种脑缺血的表现。治疗方法可以应用改善循环和营养神经的药物治疗,必须应用阿司匹林和氯吡格雷口服抗血小板聚集治疗,对于心房纤颤的患者,要控制心室率,应用阿司匹林和氯吡格雷等口服抗血小板聚集治疗,预防心脏附壁血栓的形成。", "question": "房颤会引起脑梗吗", "sent_token": ["房", "颤", "会", "引", "起", "脑", "血", "管", "疾", "病", ",", "在", "医", "学", "上", "不", "叫", "脑", "梗", "叫", "脑", "栓", "塞", ",", "脑", "梗", "是", "脑", "血", "管", "本", "身", "病", "变", "引", "起", "的", "脑", "供", "血", "不", "足", "的", "情", "况", ",", "而", "脑", "栓", "塞", "是", "由", "于", "房", "颤", "心", "脏", "上", "形", "成", "了", "附", "壁", "血", "栓", ",", "当", "血", "栓", "的", "栓", "子", "脱", "落", "之", "后", ",", "就", "有", "可", "能", "堵", "塞", "在", "脑", "血", "管", "形", "成", "了", "脑", "拴", "塞", ",", "也", "是", "一", "种", "脑", "缺", "血", "的", "表", "现", "。", "治", "疗", "方", "法", "可", "以", "应", "用", "改", "善", "循", "环", "和", "营", "养", "神", "经", "的", "药", "物", "治", "疗", ",", "必", "须", "应", "用", "阿", "司", "匹", "林", "和", "氯", "吡", "格", "雷", "口", "服", "抗", "血", "小", "板", "聚", "集", "治", "疗", ",", "对", "于", "心", "房", "纤", "颤", "的", "患", "者", ",", "要", "控", "制", "心", "室", "率", ",", "应", "用", "阿", "司", "匹", "林", "和", "氯", "吡", "格", "雷", "等", "口", "服", "抗", "血", "小", "板", "聚", "集", "治", "疗", ",", "预", "防", "心", "脏", "附", "壁", "血", "栓", "的", "形", "成", "。", "房", "颤", "会", "引", "起", "脑", "梗", "吗", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1900]} +{"id": 144, "title": "二十天的婴儿能看多远_妈妈网小百科", "context": "20天的宝宝能够看到的距离大概是15厘米-20厘米左右,一般能够看到18厘米左右的事物。宝宝刚出生的时候视力极其差,有的甚至没有睁开眼,可以说基本什么都看不清楚,视力比较好的新生儿,也只能感受到光和影或大致的轮廓。", "question": "二十天的宝宝能看多远?", "sent_token": ["20", "天", "的", "宝", "宝", "能", "够", "看", "到", "的", "距", "离", "大", "概", "是", "15", "厘", "米", "-", "20", "厘", "米", "左", "右", ",", "一", "般", "能", "够", "看", "到", "18", "厘", "米", "左", "右", "的", "事", "物", "。", "宝", "宝", "刚", "出", "生", "的", "时", "候", "视", "力", "极", "其", "差", ",", "有", "的", "甚", "至", "没", "有", "睁", "开", "眼", ",", "可", "以", "说", "基", "本", "什", "么", "都", "看", "不", "清", "楚", ",", "视", "力", "比", "较", "好", "的", "新", "生", "儿", ",", "也", "只", "能", "感", "受", "到", "光", "和", "影", "或", "大", "致", "的", "轮", "廓", "。", "二", "十", "天", "的", "婴", "儿", "能", "看", "多", "远", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "ori", "rel_ids": [1906]} +{"id": 156, "title": "4价宫颈疫苗多少钱-有来医生", "context": "4价宫颈癌疫苗有国产疫苗和进口疫苗,国产疫苗价格比较便宜,预防宫颈癌的疫苗只有4价疫苗,具体价格不同地区以及不同生产厂家生产的疫苗,所定价格也不一样。在北京4价宫颈癌疫苗,价格大概是800元左右,总共需要接种三针,需要在半年内接种完,分别在第一个月,第2个月和第6个月各接种一针次,接种年龄是20-45周岁,建议咨询当地疾病预防控制机构,所进疫苗的具体价格比较准确。比如江苏省从2019年开始,所有有价疫苗都是零差价出售,每接种一针次,收取20元材料费和注射费,目前接种宫颈癌疫苗,应该先预约才可以接种。", "question": "国产宫颈疫苗有几价", "sent_token": ["4", "价", "宫", "颈", "癌", "疫", "苗", "有", "国", "产", "疫", "苗", "和", "进", "口", "疫", "苗", ",", "国", "产", "疫", "苗", "价", "格", "比", "较", "便", "宜", ",", "预", "防", "宫", "颈", "癌", "的", "疫", "苗", "只", "有", "4", "价", "疫", "苗", ",", "具", "体", "价", "格", "不", "同", "地", "区", "以", "及", "不", "同", "生", "产", "厂", "家", "生", "产", "的", "疫", "苗", ",", "所", "定", "价", "格", "也", "不", "一", "样", "。", "在", "北", "京", "4", "价", "宫", "颈", "癌", "疫", "苗", ",", "价", "格", "大", "概", "是", "800", "元", "左", "右", ",", "总", "共", "需", "要", "接", "种", "三", "针", ",", "需", "要", "在", "半", "年", "内", "接", "种", "完", ",", "分", "别", "在", "第", "一", "个", "月", ",", "第", "2", "个", "月", "和", "第", "6", "个", "月", "各", "接", "种", "一", "针", "次", ",", "接", "种", "年", "龄", "是", "20", "-", "45", "周", "岁", ",", "建", "议", "咨", "询", "当", "地", "疾", "病", "预", "防", "控", "制", "机", "构", ",", "所", "进", "疫", "苗", "的", "具", "体", "价", "格", "比", "较", "准", "确", "。", "比", "如", "江", "苏", "省", "从", "2019", "年", "开", "始", ",", "所", "有", "有", "价", "疫", "苗", "都", "是", "零", "差", "价", "出", "售", ",", "每", "接", "种", "一", "针", "次", ",", "收", "取", "20", "元", "材", "料", "费", "和", "注", "射", "费", ",", "目", "前", "接", "种", "宫", "颈", "癌", "疫", "苗", ",", "应", "该", "先", "预", "约", "才", "可", "以", "接", "种", "。", "4", "价", "宫", "颈", "疫", "苗", "多", "少", "钱", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1918]} +{"id": 183, "title": "hiit是什么", "context": "hiit是高强度间歇训练,主要是通过进行多组高强度的间隙,和低强度的动作组合训练,这种训练方式能够在短时间内高速燃烧脂肪,非常适合锻炼时间较少或无法长时间坚持锻炼的人。", "question": "什么是HIIT", "sent_token": ["hiit", "是", "高", "强", "度", "间", "歇", "训", "练", ",", "主", "要", "是", "通", "过", "进", "行", "多", "组", "高", "强", "度", "的", "间", "隙", ",", "和", "低", "强", "度", "的", "动", "作", "组", "合", "训", "练", ",", "这", "种", "训", "练", "方", "式", "能", "够", "在", "短", "时", "间", "内", "高", "速", "燃", "烧", "脂", "肪", ",", "非", "常", "适", "合", "锻", "炼", "时", "间", "较", "少", "或", "无", "法", "长", "时", "间", "坚", "持", "锻", "炼", "的", "人", "。", "hiit", "是", "什", "么"], "sample_type": "ori", "rel_ids": [1945]} +{"id": 187, "title": "民生信用卡的客服电话多少?-其他问题知识问答-我爱卡", "context": "民生银行的信用卡的24小时客服电话为400-669-5568,持卡人在办卡或用卡的过程中,有任何疑问,都可以拨打民生银行信用卡客服电话,通过人工客服,来进行咨询。同时,持卡人也可以通过客服电话,办理信用卡激活、修改密码、更改账单日等业务。", "question": "民生信用卡客服", "sent_token": ["民", "生", "银", "行", "的", "信", "用", "卡", "的", "24", "小", "时", "客", "服", "电", "话", "为", "400", "-", "669", "-", "5568", ",", "持", "卡", "人", "在", "办", "卡", "或", "用", "卡", "的", "过", "程", "中", ",", "有", "任", "何", "疑", "问", ",", "都", "可", "以", "拨", "打", "民", "生", "银", "行", "信", "用", "卡", "客", "服", "电", "话", ",", "通", "过", "人", "工", "客", "服", ",", "来", "进", "行", "咨", "询", "。", "同", "时", ",", "持", "卡", "人", "也", "可", "以", "通", "过", "客", "服", "电", "话", ",", "办", "理", "信", "用", "卡", "激", "活", "、", "修", "改", "密", "码", "、", "更", "改", "账", "单", "日", "等", "业", "务", "。", "民", "生", "信", "用", "卡", "的", "客", "服", "电", "话", "多", "少", "?", "-", "其", "他", "问", "题", "知", "识", "问", "答", "-", "我", "爱", "卡"], "sample_type": "ori", "rel_ids": [1949]} +{"id": 194, "title": "", "context": "法令纹位於鼻翼两侧往下延伸至嘴的附近,也称寿带,法令若垂长,亦为长寿之象徵。不过女性多半不喜欢脸上出现法令纹,因为这意味脸部皮肤松弛,是老化的迹象。", "question": "哪里是法令纹?", "sent_token": ["法", "令", "纹", "位", "於", "鼻", "翼", "两", "侧", "往", "下", "延", "伸", "至", "嘴", "的", "附", "近", ",", "也", "称", "寿", "带", ",", "法", "令", "若", "垂", "长", ",", "亦", "为", "长", "寿", "之", "象", "徵", "。", "不", "过", "女", "性", "多", "半", "不", "喜", "欢", "脸", "上", "出", "现", "法", "令", "纹", ",", "因", "为", "这", "意", "味", "脸", "部", "皮", "肤", "松", "弛", ",", "是", "老", "化", "的", "迹", "象", "。"], "sample_type": "ori", "rel_ids": [1956]} +{"id": 204, "title": "婴儿轻微肠炎能自愈吗_妈妈网小百科", "context": "婴儿轻微肠炎不能自愈。肠炎是一种炎症,其发病的原因与胃肠道失调有关联。婴儿胃肠道菌群出现了失调的异常,就会引发肠炎的出现。尽管是比较轻微的肠炎,但还是有炎症的存在。婴儿轻微肠炎需要就医进行治疗,需要吃药促使炎症的消除。", "question": "婴儿轻度肠炎能自愈吗", "sent_token": ["婴", "儿", "轻", "微", "肠", "炎", "不", "能", "自", "愈", "。", "肠", "炎", "是", "一", "种", "炎", "症", ",", "其", "发", "病", "的", "原", "因", "与", "胃", "肠", "道", "失", "调", "有", "关", "联", "。", "婴", "儿", "胃", "肠", "道", "菌", "群", "出", "现", "了", "失", "调", "的", "异", "常", ",", "就", "会", "引", "发", "肠", "炎", "的", "出", "现", "。", "尽", "管", "是", "比", "较", "轻", "微", "的", "肠", "炎", ",", "但", "还", "是", "有", "炎", "症", "的", "存", "在", "。", "婴", "儿", "轻", "微", "肠", "炎", "需", "要", "就", "医", "进", "行", "治", "疗", ",", "需", "要", "吃", "药", "促", "使", "炎", "症", "的", "消", "除", "。", "婴", "儿", "轻", "微", "肠", "炎", "能", "自", "愈", "吗", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "ori", "rel_ids": [1966]} +{"id": 215, "title": "", "context": "珍珠鸟作者简介冯骥才,当代作家,1942年生于天津,原籍浙江慈溪市人。从小喜爱美术、文学和球类活动。曾当过专业篮球运动员,从事过绘画。", "question": "冯骥才什么时候出生", "sent_token": ["珍", "珠", "鸟", "作", "者", "简", "介", "冯", "骥", "才", ",", "当", "代", "作", "家", ",", "1942", "年", "生", "于", "天", "津", ",", "原", "籍", "浙", "江", "慈", "溪", "市", "人", "。", "从", "小", "喜", "爱", "美", "术", "、", "文", "学", "和", "球", "类", "活", "动", "。", "曾", "当", "过", "专", "业", "篮", "球", "运", "动", "员", ",", "从", "事", "过", "绘", "画", "。"], "sample_type": "ori", "rel_ids": [1977]} +{"id": 221, "title": "哺乳期可以吃维生素b2吗_有问必答_快速问医生", "context": "你好,口腔溃疡一般都是由于维生素缺乏导致的,与口腔炎症和上火也有关,可以服用维生素b2和维生素c治疗。用西瓜皮煮水喝,可以清热去火。局部用口腔溃疡散或者用维生素c研磨成粉末涂抹,都可以有效缓解疼痛。孕妇正常也要补充维生素的,服用维生素b2没有问题的。平时一定要多吃新鲜蔬菜水果,补充维生素,注意口腔卫生,早晚刷牙,饭后用温水漱口,每天早上起床用淡盐水漱口。", "question": "哺乳期能吃维生素b2片吗", "sent_token": ["你", "好", ",", "口", "腔", "溃", "疡", "一", "般", "都", "是", "由", "于", "维", "生", "素", "缺", "乏", "导", "致", "的", ",", "与", "口", "腔", "炎", "症", "和", "上", "火", "也", "有", "关", ",", "可", "以", "服", "用", "维", "生", "素", "b2", "和", "维", "生", "素", "c", "治", "疗", "。", "用", "西", "瓜", "皮", "煮", "水", "喝", ",", "可", "以", "清", "热", "去", "火", "。", "局", "部", "用", "口", "腔", "溃", "疡", "散", "或", "者", "用", "维", "生", "素", "c", "研", "磨", "成", "粉", "末", "涂", "抹", ",", "都", "可", "以", "有", "效", "缓", "解", "疼", "痛", "。", "孕", "妇", "正", "常", "也", "要", "补", "充", "维", "生", "素", "的", ",", "服", "用", "维", "生", "素", "b2", "没", "有", "问", "题", "的", "。", "平", "时", "一", "定", "要", "多", "吃", "新", "鲜", "蔬", "菜", "水", "果", ",", "补", "充", "维", "生", "素", ",", "注", "意", "口", "腔", "卫", "生", ",", "早", "晚", "刷", "牙", ",", "饭", "后", "用", "温", "水", "漱", "口", ",", "每", "天", "早", "上", "起", "床", "用", "淡", "盐", "水", "漱", "口", "。", "哺", "乳", "期", "可", "以", "吃", "维", "生", "素", "b2", "吗", "_", "有", "问", "必", "答", "_", "快", "速", "问", "医", "生"], "sample_type": "ori", "rel_ids": [1983]} +{"id": 231, "title": "6岁儿童吃几颗肠虫清,吃肠虫清需要忌口吗_孕育常识_亲子宝典库_", "context": "肠虫清是六岁儿童就可以服用的一次吃两片,是吃饱饭后吃,肠虫清的主要是驱虫的药物,一般在晚上睡前服用的是比较好的,服药期间要多喝开水,多吃清淡易消化的食物,忌辛辣刺激性食物和油腻煎炸的食物,注意保暖避免着凉。", "question": "6岁儿童吃几颗肠虫清", "sent_token": ["肠", "虫", "清", "是", "六", "岁", "儿", "童", "就", "可", "以", "服", "用", "的", "一", "次", "吃", "两", "片", ",", "是", "吃", "饱", "饭", "后", "吃", ",", "肠", "虫", "清", "的", "主", "要", "是", "驱", "虫", "的", "药", "物", ",", "一", "般", "在", "晚", "上", "睡", "前", "服", "用", "的", "是", "比", "较", "好", "的", ",", "服", "药", "期", "间", "要", "多", "喝", "开", "水", ",", "多", "吃", "清", "淡", "易", "消", "化", "的", "食", "物", ",", "忌", "辛", "辣", "刺", "激", "性", "食", "物", "和", "油", "腻", "煎", "炸", "的", "食", "物", ",", "注", "意", "保", "暖", "避", "免", "着", "凉", "。", "6", "岁", "儿", "童", "吃", "几", "颗", "肠", "虫", "清", ",", "吃", "肠", "虫", "清", "需", "要", "忌", "口", "吗", "_", "孕", "育", "常", "识", "_", "亲", "子", "宝", "典", "库", "_"], "sample_type": "ori", "rel_ids": [1993]} +{"id": 241, "title": "隔阂意味着是什么意思", "context": "隔阂意味着很多意思,通常隔阂就意味着可能双方之间沟通有问题,比如有些夫妻或者是男女朋友之间吵架,两个人一起冷战,两个人由于没有沟通,双方之间的误会和矛盾就会越来越多了,也有可能是两个人总是以争吵的方式来解决问题,像这样的话就达不到有效的沟通,两个人两个人越不沟通,双方之间的矛盾和争吵就会越来越多,这个时候就会产生深深的隔阂。也有可能是双峰之间的价值观完全不同,比如对待某些问题的时候,有些人比较理性,但是有些人会比较感性,这个时候价值观不同的话就非常容易产生隔阂。", "question": "隔阂什么意思", "sent_token": ["隔", "阂", "意", "味", "着", "很", "多", "意", "思", ",", "通", "常", "隔", "阂", "就", "意", "味", "着", "可", "能", "双", "方", "之", "间", "沟", "通", "有", "问", "题", ",", "比", "如", "有", "些", "夫", "妻", "或", "者", "是", "男", "女", "朋", "友", "之", "间", "吵", "架", ",", "两", "个", "人", "一", "起", "冷", "战", ",", "两", "个", "人", "由", "于", "没", "有", "沟", "通", ",", "双", "方", "之", "间", "的", "误", "会", "和", "矛", "盾", "就", "会", "越", "来", "越", "多", "了", ",", "也", "有", "可", "能", "是", "两", "个", "人", "总", "是", "以", "争", "吵", "的", "方", "式", "来", "解", "决", "问", "题", ",", "像", "这", "样", "的", "话", "就", "达", "不", "到", "有", "效", "的", "沟", "通", ",", "两", "个", "人", "两", "个", "人", "越", "不", "沟", "通", ",", "双", "方", "之", "间", "的", "矛", "盾", "和", "争", "吵", "就", "会", "越", "来", "越", "多", ",", "这", "个", "时", "候", "就", "会", "产", "生", "深", "深", "的", "隔", "阂", "。", "也", "有", "可", "能", "是", "双", "峰", "之", "间", "的", "价", "值", "观", "完", "全", "不", "同", ",", "比", "如", "对", "待", "某", "些", "问", "题", "的", "时", "候", ",", "有", "些", "人", "比", "较", "理", "性", ",", "但", "是", "有", "些", "人", "会", "比", "较", "感", "性", ",", "这", "个", "时", "候", "价", "值", "观", "不", "同", "的", "话", "就", "非", "常", "容", "易", "产", "生", "隔", "阂", "。", "隔", "阂", "意", "味", "着", "是", "什", "么", "意", "思"], "sample_type": "ori", "rel_ids": [2003]} +{"id": 242, "title": "小儿癫痫病能彻底治愈的吗_有问必答_快速问医生", "context": "你好,很高兴为你服务,目前小儿癫痫是可以治愈的,不同的癫痫类型以及患者的实际病情不同,其适合的治疗方法也是不尽相同的。现在常见的小儿癫痫治疗都是采用中医为基础的治疗方法,这样对患儿的伤害较小,而西医则有很大的副作用,好吧", "question": "小儿癫痫能治愈吗", "sent_token": ["你", "好", ",", "很", "高", "兴", "为", "你", "服", "务", ",", "目", "前", "小", "儿", "癫", "痫", "是", "可", "以", "治", "愈", "的", ",", "不", "同", "的", "癫", "痫", "类", "型", "以", "及", "患", "者", "的", "实", "际", "病", "情", "不", "同", ",", "其", "适", "合", "的", "治", "疗", "方", "法", "也", "是", "不", "尽", "相", "同", "的", "。", "现", "在", "常", "见", "的", "小", "儿", "癫", "痫", "治", "疗", "都", "是", "采", "用", "中", "医", "为", "基", "础", "的", "治", "疗", "方", "法", ",", "这", "样", "对", "患", "儿", "的", "伤", "害", "较", "小", ",", "而", "西", "医", "则", "有", "很", "大", "的", "副", "作", "用", ",", "好", "吧", "小", "儿", "癫", "痫", "病", "能", "彻", "底", "治", "愈", "的", "吗", "_", "有", "问", "必", "答", "_", "快", "速", "问", "医", "生"], "sample_type": "ori", "rel_ids": [2004]} +{"id": 250, "title": "脑内多发腔隙性脑梗死严重吗_39健康问答_39健康网", "context": "脑内多发腔隙性脑梗死,部分软化灶形成,一般不严重,是细枝血管梗塞,引起小灶脑组织坏死,脑组织软化灶,其他部位的脑组织会替代坏死部位的脑组织功能,所以一般没有不适的症状。注意控制血压,清淡饮食,控制血脂,血粘度,精神放松,解除思想顾虑,多做室外文娱体育活动,精神愉快,多接受紫外线照射,多喝开水,会有利于康复。可以根据情况使用疏通血管的药物。", "question": "多发腔隙性脑梗死吃什么中药", "sent_token": ["脑", "内", "多", "发", "腔", "隙", "性", "脑", "梗", "死", ",", "部", "分", "软", "化", "灶", "形", "成", ",", "一", "般", "不", "严", "重", ",", "是", "细", "枝", "血", "管", "梗", "塞", ",", "引", "起", "小", "灶", "脑", "组", "织", "坏", "死", ",", "脑", "组", "织", "软", "化", "灶", ",", "其", "他", "部", "位", "的", "脑", "组", "织", "会", "替", "代", "坏", "死", "部", "位", "的", "脑", "组", "织", "功", "能", ",", "所", "以", "一", "般", "没", "有", "不", "适", "的", "症", "状", "。", "注", "意", "控", "制", "血", "压", ",", "清", "淡", "饮", "食", ",", "控", "制", "血", "脂", ",", "血", "粘", "度", ",", "精", "神", "放", "松", ",", "解", "除", "思", "想", "顾", "虑", ",", "多", "做", "室", "外", "文", "娱", "体", "育", "活", "动", ",", "精", "神", "愉", "快", ",", "多", "接", "受", "紫", "外", "线", "照", "射", ",", "多", "喝", "开", "水", ",", "会", "有", "利", "于", "康", "复", "。", "可", "以", "根", "据", "情", "况", "使", "用", "疏", "通", "血", "管", "的", "药", "物", "。", "脑", "内", "多", "发", "腔", "隙", "性", "脑", "梗", "死", "严", "重", "吗", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "ori", "rel_ids": [2012]} +{"id": 1763, "title": "地瓜是红薯吗", "context": "地瓜不是红薯。地瓜一般生吃或者凉拌,外形是纺锤型的,有明显的瓣状结构,内里的肉是白色的,有清淡的药香味,生吃又脆又甜,常食用可以预防肝癌、胃癌,营养价值非常高。红薯是粗粮,也叫番薯山芋。它是一种属管状花目,旋花科一年生的草本植物,富含丰富的矿物质和维生素,而且非常耐饱。", "question": "马铃薯和红苕指的是同一个物种吗", "sent_token": ["地", "瓜", "不", "是", "红", "薯", "。", "地", "瓜", "一", "般", "生", "吃", "或", "者", "凉", "拌", ",", "外", "形", "是", "纺", "锤", "型", "的", ",", "有", "明", "显", "的", "瓣", "状", "结", "构", ",", "内", "里", "的", "肉", "是", "白", "色", "的", ",", "有", "清", "淡", "的", "药", "香", "味", ",", "生", "吃", "又", "脆", "又", "甜", ",", "常", "食", "用", "可", "以", "预", "防", "肝", "癌", "、", "胃", "癌", ",", "营", "养", "价", "值", "非", "常", "高", "。", "红", "薯", "是", "粗", "粮", ",", "也", "叫", "番", "薯", "山", "芋", "。", "它", "是", "一", "种", "属", "管", "状", "花", "目", ",", "旋", "花", "科", "一", "年", "生", "的", "草", "本", "植", "物", ",", "富", "含", "丰", "富", "的", "矿", "物", "质", "和", "维", "生", "素", ",", "而", "且", "非", "常", "耐", "饱", "。", "地", "瓜", "是", "红", "薯", "吗"], "sample_type": "disturb"} +{"id": 1767, "title": "已满多少岁的人犯贩卖毒品罪应负刑事责任", "context": "根据《刑法》第十七条:已满十六周岁的人犯罪,应当负刑事责任。已满十四周岁不满十六周岁的人,犯故意杀人、故意伤害致人重伤或者死亡、强奸、抢劫、贩卖毒品、放火、爆炸、投放危险物质罪的,应当负刑事责任。", "question": "贩卖毒品需要负刑事责任的人要满几周岁", "sent_token": ["根", "据", "《", "刑", "法", "》", "第", "十", "七", "条", ":", "已", "满", "十", "六", "周", "岁", "的", "人", "犯", "罪", ",", "应", "当", "负", "刑", "事", "责", "任", "。", "已", "满", "十", "四", "周", "岁", "不", "满", "十", "六", "周", "岁", "的", "人", ",", "犯", "故", "意", "杀", "人", "、", "故", "意", "伤", "害", "致", "人", "重", "伤", "或", "者", "死", "亡", "、", "强", "奸", "、", "抢", "劫", "、", "贩", "卖", "毒", "品", "、", "放", "火", "、", "爆", "炸", "、", "投", "放", "危", "险", "物", "质", "罪", "的", ",", "应", "当", "负", "刑", "事", "责", "任", "。", "已", "满", "多", "少", "岁", "的", "人", "犯", "贩", "卖", "毒", "品", "罪", "应", "负", "刑", "事", "责", "任"], "sample_type": "disturb"} +{"id": 1772, "title": "读研跟考研有什么区别", "context": "考研和读研的区别在于概念和意义不同。考研是指考生通过考试来得到研究生的入学资格,而考生并不是硕士研究生;而读研是指学生在高校攻读硕士研究生的过程,学生身份已经是硕士研究生。这二者并不等同,而是有先后关系,也就是说考生只有通过考研,才能成为硕士研究生,然后在规定的学习时间内读研。", "question": "考取研究生跟攻读研究生,具体什么区别?", "sent_token": ["考", "研", "和", "读", "研", "的", "区", "别", "在", "于", "概", "念", "和", "意", "义", "不", "同", "。", "考", "研", "是", "指", "考", "生", "通", "过", "考", "试", "来", "得", "到", "研", "究", "生", "的", "入", "学", "资", "格", ",", "而", "考", "生", "并", "不", "是", "硕", "士", "研", "究", "生", ";", "而", "读", "研", "是", "指", "学", "生", "在", "高", "校", "攻", "读", "硕", "士", "研", "究", "生", "的", "过", "程", ",", "学", "生", "身", "份", "已", "经", "是", "硕", "士", "研", "究", "生", "。", "这", "二", "者", "并", "不", "等", "同", ",", "而", "是", "有", "先", "后", "关", "系", ",", "也", "就", "是", "说", "考", "生", "只", "有", "通", "过", "考", "研", ",", "才", "能", "成", "为", "硕", "士", "研", "究", "生", ",", "然", "后", "在", "规", "定", "的", "学", "习", "时", "间", "内", "读", "研", "。", "读", "研", "跟", "考", "研", "有", "什", "么", "区", "别"], "sample_type": "disturb"} +{"id": 1774, "title": "多效唑能和磷酸二氢钾一起用吗", "context": "多效唑能和磷酸二氢钾一起用。多效唑是植物的生长调节剂,主要是控制作物疯长的。而磷酸二氢钾属于叶面肥,施用后可促使作物的叶色更加浓绿,根系发达,药效完全不同,也并不排斥,可以混合使用。不过要注意施用时要严格按照说明施加,不可过量,否则会阻碍生长。", "question": "磷酸一钾能和氯丁唑一起用OK吗", "sent_token": ["多", "效", "唑", "能", "和", "磷", "酸", "二", "氢", "钾", "一", "起", "用", "。", "多", "效", "唑", "是", "植", "物", "的", "生", "长", "调", "节", "剂", ",", "主", "要", "是", "控", "制", "作", "物", "疯", "长", "的", "。", "而", "磷", "酸", "二", "氢", "钾", "属", "于", "叶", "面", "肥", ",", "施", "用", "后", "可", "促", "使", "作", "物", "的", "叶", "色", "更", "加", "浓", "绿", ",", "根", "系", "发", "达", ",", "药", "效", "完", "全", "不", "同", ",", "也", "并", "不", "排", "斥", ",", "可", "以", "混", "合", "使", "用", "。", "不", "过", "要", "注", "意", "施", "用", "时", "要", "严", "格", "按", "照", "说", "明", "施", "加", ",", "不", "可", "过", "量", ",", "否", "则", "会", "阻", "碍", "生", "长", "。", "多", "效", "唑", "能", "和", "磷", "酸", "二", "氢", "钾", "一", "起", "用", "吗"], "sample_type": "disturb"} +{"id": 1776, "title": "猫能吃蛋黄吗", "context": "猫咪是可以吃蛋黄的。这里特定煮熟的白水蛋,猫咪不能吃生鸡蛋,因为生鸡蛋中有细菌,常见的是沙门氏菌,容易引起猫腹泻脱水,而且饲喂猫咪最好的只饲喂蛋黄。虽然可以吃蛋黄,但是需要掌握好量,一般一周最多吃两三次就可了。蛋黄中也含有丰富的胆固醇,易引发猫咪患脂肪肝和高脂血病。", "question": "小猫咪可以吃蛋黄吗,生的", "sent_token": ["猫", "咪", "是", "可", "以", "吃", "蛋", "黄", "的", "。", "这", "里", "特", "定", "煮", "熟", "的", "白", "水", "蛋", ",", "猫", "咪", "不", "能", "吃", "生", "鸡", "蛋", ",", "因", "为", "生", "鸡", "蛋", "中", "有", "细", "菌", ",", "常", "见", "的", "是", "沙", "门", "氏", "菌", ",", "容", "易", "引", "起", "猫", "腹", "泻", "脱", "水", ",", "而", "且", "饲", "喂", "猫", "咪", "最", "好", "的", "只", "饲", "喂", "蛋", "黄", "。", "虽", "然", "可", "以", "吃", "蛋", "黄", ",", "但", "是", "需", "要", "掌", "握", "好", "量", ",", "一", "般", "一", "周", "最", "多", "吃", "两", "三", "次", "就", "可", "了", "。", "蛋", "黄", "中", "也", "含", "有", "丰", "富", "的", "胆", "固", "醇", ",", "易", "引", "发", "猫", "咪", "患", "脂", "肪", "肝", "和", "高", "脂", "血", "病", "。", "猫", "能", "吃", "蛋", "黄", "吗"], "sample_type": "disturb"} +{"id": 1780, "title": "最近深圳限行吗", "context": "现在由于疫情的影响,深圳市不限行的了,但是没有必要尽量还是少出门,出门也要做好一系列的防护措施才可以。因为虽然目前国内疫情形势有所缓和,但是这并不意味着疫情的结束,国外疫情形势还是很严峻的,境外输入案例较多。", "question": "近期深圳没有限行吗", "sent_token": ["现", "在", "由", "于", "疫", "情", "的", "影", "响", ",", "深", "圳", "市", "不", "限", "行", "的", "了", ",", "但", "是", "没", "有", "必", "要", "尽", "量", "还", "是", "少", "出", "门", ",", "出", "门", "也", "要", "做", "好", "一", "系", "列", "的", "防", "护", "措", "施", "才", "可", "以", "。", "因", "为", "虽", "然", "目", "前", "国", "内", "疫", "情", "形", "势", "有", "所", "缓", "和", ",", "但", "是", "这", "并", "不", "意", "味", "着", "疫", "情", "的", "结", "束", ",", "国", "外", "疫", "情", "形", "势", "还", "是", "很", "严", "峻", "的", ",", "境", "外", "输", "入", "案", "例", "较", "多", "。", "最", "近", "深", "圳", "限", "行", "吗"], "sample_type": "disturb"} +{"id": 1781, "title": "合同签字不盖章有效吗", "context": "可能有效可能无效。只有签字没有公章的合同是否有法律效力要根据具体情况分析:如果合同是由单位的委托代理人在其权限范围内、或单位的法定代表人签的字,则合同有效。", "question": "一没有签字,二没有盖章的合同,还有法律效用吗", "sent_token": ["可", "能", "有", "效", "可", "能", "无", "效", "。", "只", "有", "签", "字", "没", "有", "公", "章", "的", "合", "同", "是", "否", "有", "法", "律", "效", "力", "要", "根", "据", "具", "体", "情", "况", "分", "析", ":", "如", "果", "合", "同", "是", "由", "单", "位", "的", "委", "托", "代", "理", "人", "在", "其", "权", "限", "范", "围", "内", "、", "或", "单", "位", "的", "法", "定", "代", "表", "人", "签", "的", "字", ",", "则", "合", "同", "有", "效", "。", "合", "同", "签", "字", "不", "盖", "章", "有", "效", "吗"], "sample_type": "disturb"} +{"id": 1789, "title": "", "context": "吴三桂(1612年-1678年10月2日),字长伯,一字月所,明朝辽东人,明末清初著名政治军事人物,吴周政权建立者吴周太祖。", "question": "平西王吴三贵什么朝代", "sent_token": ["吴", "三", "桂", "(", "1612", "年", "-", "1678", "年", "10", "月", "2", "日", ")", ",", "字", "长", "伯", ",", "一", "字", "月", "所", ",", "明", "朝", "辽", "东", "人", ",", "明", "末", "清", "初", "著", "名", "政", "治", "军", "事", "人", "物", ",", "吴", "周", "政", "权", "建", "立", "者", "吴", "周", "太", "祖", "。"], "sample_type": "disturb"} +{"id": 1796, "title": "狗狗为什么互相闻屁股", "context": "相互闻屁股是狗狗打招呼的一种方式。狗狗的嗅觉很敏感,它们可以用相互闻屁股来了解狗狗的配偶状况、饮食习惯等,因为狗狗的屁股后面有两个肛门腺,在肛门腺里面涵盖了很多的信息素。处在发情期的狗狗也会通过闻屁股来挑选自己的配偶。", "question": "狗狗为何总是闻屁股", "sent_token": ["相", "互", "闻", "屁", "股", "是", "狗", "狗", "打", "招", "呼", "的", "一", "种", "方", "式", "。", "狗", "狗", "的", "嗅", "觉", "很", "敏", "感", ",", "它", "们", "可", "以", "用", "相", "互", "闻", "屁", "股", "来", "了", "解", "狗", "狗", "的", "配", "偶", "状", "况", "、", "饮", "食", "习", "惯", "等", ",", "因", "为", "狗", "狗", "的", "屁", "股", "后", "面", "有", "两", "个", "肛", "门", "腺", ",", "在", "肛", "门", "腺", "里", "面", "涵", "盖", "了", "很", "多", "的", "信", "息", "素", "。", "处", "在", "发", "情", "期", "的", "狗", "狗", "也", "会", "通", "过", "闻", "屁", "股", "来", "挑", "选", "自", "己", "的", "配", "偶", "。", "狗", "狗", "为", "什", "么", "互", "相", "闻", "屁", "股"], "sample_type": "disturb"} +{"id": 1798, "title": "出租房隔音差怎么解决", "context": "可以在窗户上贴一层隔音膜,在粘贴过程中要注意,不要出现气泡,以免影响隔音效果。若想要隔音效果更好点,还可以购买一些密封条安装在窗户缝隙处,这也能起到更好的隔音效果。另外,室内使用的家具可以更换成木质的,这样同样能起到一定的吸音效果。", "question": "出租房隔音不好如何解决", "sent_token": ["可", "以", "在", "窗", "户", "上", "贴", "一", "层", "隔", "音", "膜", ",", "在", "粘", "贴", "过", "程", "中", "要", "注", "意", ",", "不", "要", "出", "现", "气", "泡", ",", "以", "免", "影", "响", "隔", "音", "效", "果", "。", "若", "想", "要", "隔", "音", "效", "果", "更", "好", "点", ",", "还", "可", "以", "购", "买", "一", "些", "密", "封", "条", "安", "装", "在", "窗", "户", "缝", "隙", "处", ",", "这", "也", "能", "起", "到", "更", "好", "的", "隔", "音", "效", "果", "。", "另", "外", ",", "室", "内", "使", "用", "的", "家", "具", "可", "以", "更", "换", "成", "木", "质", "的", ",", "这", "样", "同", "样", "能", "起", "到", "一", "定", "的", "吸", "音", "效", "果", "。", "出", "租", "房", "隔", "音", "差", "怎", "么", "解", "决"], "sample_type": "disturb"} +{"id": 1802, "title": "鬼迷心窍(李宗盛演唱歌曲)_百度百科", "context": "《鬼迷心窍》是1992年黄日华、周海媚主演台湾电视剧《末代皇孙》的主题曲,是由李宗盛作词、作曲、演唱,收录于1992年影视剧音乐合辑《滚石九大天王之十二出好戏》当中。1993年,李宗盛凭借该曲获得第一届新加坡醉心金曲奖最佳作词奖", "question": "谁是鬼迷心窍的原唱", "sent_token": ["《", "鬼", "迷", "心", "窍", "》", "是", "1992", "年", "黄", "日", "华", "、", "周", "海", "媚", "主", "演", "台", "湾", "电", "视", "剧", "《", "末", "代", "皇", "孙", "》", "的", "主", "题", "曲", ",", "是", "由", "李", "宗", "盛", "作", "词", "、", "作", "曲", "、", "演", "唱", ",", "收", "录", "于", "1992", "年", "影", "视", "剧", "音", "乐", "合", "辑", "《", "滚", "石", "九", "大", "天", "王", "之", "十", "二", "出", "好", "戏", "》", "当", "中", "。", "1993", "年", ",", "李", "宗", "盛", "凭", "借", "该", "曲", "获", "得", "第", "一", "届", "新", "加", "坡", "醉", "心", "金", "曲", "奖", "最", "佳", "作", "词", "奖", "鬼", "迷", "心", "窍", "(", "李", "宗", "盛", "演", "唱", "歌", "曲", ")", "_", "百", "度", "百", "科"], "sample_type": "disturb"} +{"id": 1803, "title": "", "context": "白龙马,名著小说《西游记》中的重要角色。本是西海龙王三太子,因纵火烧毁玉帝赏赐的明珠而被西海龙王上天告忤逆,要被斩首。后因南海观世菩萨出面才免于死罪,被贬到蛇盘山鹰愁涧等待唐僧取经。之后又误吃唐僧所骑的白马,被菩萨点化,变身为白龙。", "question": "西游记中的白龙马,它的原始身份是什么", "sent_token": ["白", "龙", "马", ",", "名", "著", "小", "说", "《", "西", "游", "记", "》", "中", "的", "重", "要", "角", "色", "。", "本", "是", "西", "海", "龙", "王", "三", "太", "子", ",", "因", "纵", "火", "烧", "毁", "玉", "帝", "赏", "赐", "的", "明", "珠", "而", "被", "西", "海", "龙", "王", "上", "天", "告", "忤", "逆", ",", "要", "被", "斩", "首", "。", "后", "因", "南", "海", "观", "世", "菩", "萨", "出", "面", "才", "免", "于", "死", "罪", ",", "被", "贬", "到", "蛇", "盘", "山", "鹰", "愁", "涧", "等", "待", "唐", "僧", "取", "经", "。", "之", "后", "又", "误", "吃", "唐", "僧", "所", "骑", "的", "白", "马", ",", "被", "菩", "萨", "点", "化", ",", "变", "身", "为", "白", "龙", "。"], "sample_type": "disturb"} +{"id": 1805, "title": "", "context": "《湮灭》是由派拉蒙影业出品的科幻惊悚片,这部电影集合了科幻、悬疑、惊悚等时下流行的元素,由亚历克斯·加兰执导,娜塔莉·波特曼、詹妮弗·杰森·李、吉娜·罗德里格兹、泰莎·汤普森联合主演。该片于2018年2月23日在美国上映。影片根据杰夫·梵德米尔所著《遗落的南境》三部曲的首部同名小说改编,讲述了生物学家莉娜为了自己的丈夫,她自愿加入了科学考察探险小队,去研究美国领土一块被检疫隔离的生态灾害区域的故事。", "question": "湮灭是什么类型的电影", "sent_token": ["《", "湮", "灭", "》", "是", "由", "派", "拉", "蒙", "影", "业", "出", "品", "的", "科", "幻", "惊", "悚", "片", ",", "这", "部", "电", "影", "集", "合", "了", "科", "幻", "、", "悬", "疑", "、", "惊", "悚", "等", "时", "下", "流", "行", "的", "元", "素", ",", "由", "亚", "历", "克", "斯", "·", "加", "兰", "执", "导", ",", "娜", "塔", "莉", "·", "波", "特", "曼", "、", "詹", "妮", "弗", "·", "杰", "森", "·", "李", "、", "吉", "娜", "·", "罗", "德", "里", "格", "兹", "、", "泰", "莎", "·", "汤", "普", "森", "联", "合", "主", "演", "。", "该", "片", "于", "2018", "年", "2", "月", "23", "日", "在", "美", "国", "上", "映", "。", "影", "片", "根", "据", "杰", "夫", "·", "梵", "德", "米", "尔", "所", "著", "《", "遗", "落", "的", "南", "境", "》", "三", "部", "曲", "的", "首", "部", "同", "名", "小", "说", "改", "编", ",", "讲", "述", "了", "生", "物", "学", "家", "莉", "娜", "为", "了", "自", "己", "的", "丈", "夫", ",", "她", "自", "愿", "加", "入", "了", "科", "学", "考", "察", "探", "险", "小", "队", ",", "去", "研", "究", "美", "国", "领", "土", "一", "块", "被", "检", "疫", "隔", "离", "的", "生", "态", "灾", "害", "区", "域", "的", "故", "事", "。"], "sample_type": "disturb"} +{"id": 1807, "title": "", "context": "网球与高尔夫、保龄球、桌球并成为世界四大绅士运动,他的起源可以追溯到12-13世纪的法国。网球运动的起源及演变可以用四句话来概括:网球孕育在法国,诞生在英国,开始普及和形成高潮在美国,现盛行全世界。", "question": "网球发源于哪国", "sent_token": ["网", "球", "与", "高", "尔", "夫", "、", "保", "龄", "球", "、", "桌", "球", "并", "成", "为", "世", "界", "四", "大", "绅", "士", "运", "动", ",", "他", "的", "起", "源", "可", "以", "追", "溯", "到", "12", "-", "13", "世", "纪", "的", "法", "国", "。", "网", "球", "运", "动", "的", "起", "源", "及", "演", "变", "可", "以", "用", "四", "句", "话", "来", "概", "括", ":", "网", "球", "孕", "育", "在", "法", "国", ",", "诞", "生", "在", "英", "国", ",", "开", "始", "普", "及", "和", "形", "成", "高", "潮", "在", "美", "国", ",", "现", "盛", "行", "全", "世", "界", "。"], "sample_type": "disturb"} +{"id": 1810, "title": "单人挑战巫女大蛇悲鸣需要多少体力_单人挑战巫女大蛇悲鸣需要体力", "context": "玩家挑战巫女大蛇悲鸣的体力消耗是普通御魂副本的2倍。阴阳师巫女大蛇悲鸣单人通关需要12点体力组队通关的话只需要8点体力,挑战巫女大蛇悲鸣的体力消耗是普通御魂副本的2倍。奖励掉落5星与6星御魂,经验强化狗粮4星青吉鬼。在御魂副本1-10层原本掉落的基础上,巫女大蛇·悲鸣新增了蚌精、幽谷响、轮入道、蝠翼、狂骨这5种御魂的掉落,每日掉落御魂种类增加到5。", "question": "阴阳师 组队挑战大蛇悲鸣需要多少体力", "sent_token": ["玩", "家", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "的", "体", "力", "消", "耗", "是", "普", "通", "御", "魂", "副", "本", "的", "2", "倍", "。", "阴", "阳", "师", "巫", "女", "大", "蛇", "悲", "鸣", "单", "人", "通", "关", "需", "要", "12", "点", "体", "力", "组", "队", "通", "关", "的", "话", "只", "需", "要", "8", "点", "体", "力", ",", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "的", "体", "力", "消", "耗", "是", "普", "通", "御", "魂", "副", "本", "的", "2", "倍", "。", "奖", "励", "掉", "落", "5", "星", "与", "6", "星", "御", "魂", ",", "经", "验", "强", "化", "狗", "粮", "4", "星", "青", "吉", "鬼", "。", "在", "御", "魂", "副", "本", "1", "-", "10", "层", "原", "本", "掉", "落", "的", "基", "础", "上", ",", "巫", "女", "大", "蛇", "·", "悲", "鸣", "新", "增", "了", "蚌", "精", "、", "幽", "谷", "响", "、", "轮", "入", "道", "、", "蝠", "翼", "、", "狂", "骨", "这", "5", "种", "御", "魂", "的", "掉", "落", ",", "每", "日", "掉", "落", "御", "魂", "种", "类", "增", "加", "到", "5", "。", "单", "人", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "需", "要", "多", "少", "体", "力", "_", "单", "人", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "需", "要", "体", "力"], "sample_type": "disturb"} +{"id": 1815, "title": "", "context": "心脏是脊椎动物身体中最重要的一个器官,人类的心脏位于胸腔中部偏左,体积约相当于一个拳头大小,重量约350克。女性的心脏通常要比男性的体积小且重量轻。人的心脏外形像桃子,位于横膈之上,两肺间而偏左。", "question": "人类心脏有多重", "sent_token": ["心", "脏", "是", "脊", "椎", "动", "物", "身", "体", "中", "最", "重", "要", "的", "一", "个", "器", "官", ",", "人", "类", "的", "心", "脏", "位", "于", "胸", "腔", "中", "部", "偏", "左", ",", "体", "积", "约", "相", "当", "于", "一", "个", "拳", "头", "大", "小", ",", "重", "量", "约", "350", "克", "。", "女", "性", "的", "心", "脏", "通", "常", "要", "比", "男", "性", "的", "体", "积", "小", "且", "重", "量", "轻", "。", "人", "的", "心", "脏", "外", "形", "像", "桃", "子", ",", "位", "于", "横", "膈", "之", "上", ",", "两", "肺", "间", "而", "偏", "左", "。"], "sample_type": "disturb"} +{"id": 1816, "title": "紫菜变成紫色还能吃吗-有来医生", "context": "如果紫菜变成紫色的情况下,主要考虑还是紫菜受潮引起的,紫菜受潮以后容易滋生细菌,营养物质也会丧失,口感也会变差,一般情况下,建议不要食用,以免导致消化道的不良反应。紫菜中含有的营养物质是很丰富的,含有丰富的锌元素和铁元素,每天适当的吃一点,可以预防缺铁性贫血,可以预防缺锌引起的反复性口腔溃疡,可以增进食欲。", "question": "海苔回潮了还能吃不", "sent_token": ["如", "果", "紫", "菜", "变", "成", "紫", "色", "的", "情", "况", "下", ",", "主", "要", "考", "虑", "还", "是", "紫", "菜", "受", "潮", "引", "起", "的", ",", "紫", "菜", "受", "潮", "以", "后", "容", "易", "滋", "生", "细", "菌", ",", "营", "养", "物", "质", "也", "会", "丧", "失", ",", "口", "感", "也", "会", "变", "差", ",", "一", "般", "情", "况", "下", ",", "建", "议", "不", "要", "食", "用", ",", "以", "免", "导", "致", "消", "化", "道", "的", "不", "良", "反", "应", "。", "紫", "菜", "中", "含", "有", "的", "营", "养", "物", "质", "是", "很", "丰", "富", "的", ",", "含", "有", "丰", "富", "的", "锌", "元", "素", "和", "铁", "元", "素", ",", "每", "天", "适", "当", "的", "吃", "一", "点", ",", "可", "以", "预", "防", "缺", "铁", "性", "贫", "血", ",", "可", "以", "预", "防", "缺", "锌", "引", "起", "的", "反", "复", "性", "口", "腔", "溃", "疡", ",", "可", "以", "增", "进", "食", "欲", "。", "紫", "菜", "变", "成", "紫", "色", "还", "能", "吃", "吗", "-", "有", "来", "医", "生"], "sample_type": "disturb"} +{"id": 1830, "title": "", "context": "钢铁侠是由美国漫威电影工作室出品的一部科幻冒险电影,改编自同名系列漫画。穿上盔甲后,托尼变身成了复仇者联盟中惩恶扬善的钢铁侠。复仇者联盟2:奥创纪元钢铁侠是美国演员小罗伯特·唐尼演的。小罗伯特唐尼的电影钢铁侠扮演者小罗伯特·。", "question": "谁演过钢铁侠", "sent_token": ["钢", "铁", "侠", "是", "由", "美", "国", "漫", "威", "电", "影", "工", "作", "室", "出", "品", "的", "一", "部", "科", "幻", "冒", "险", "电", "影", ",", "改", "编", "自", "同", "名", "系", "列", "漫", "画", "。", "穿", "上", "盔", "甲", "后", ",", "托", "尼", "变", "身", "成", "了", "复", "仇", "者", "联", "盟", "中", "惩", "恶", "扬", "善", "的", "钢", "铁", "侠", "。", "复", "仇", "者", "联", "盟", "2", ":", "奥", "创", "纪", "元", "钢", "铁", "侠", "是", "美", "国", "演", "员", "小", "罗", "伯", "特", "·", "唐", "尼", "演", "的", "。", "小", "罗", "伯", "特", "唐", "尼", "的", "电", "影", "钢", "铁", "侠", "扮", "演", "者", "小", "罗", "伯", "特", "·", "。"], "sample_type": "disturb"} +{"id": 1831, "title": "人间正道是沧桑是什么意思_酷知经验网", "context": "天若有情天亦老,人间正道是沧桑:上句借用李贺《金铜仙人辞汉歌》中诗句,原诗说的是汉武帝时制作的极贵重的宝物金铜仙人像,在三国时被魏明帝由长安迁往洛阳的传说。原句的意思是,对于这样的人间恨事,天若有情,也要因悲伤而衰老。人间正道,社会发展的正常规律。沧桑,沧海(大海)变为桑田,多指巨大的变化,这里比喻的是革命的道路艰难曲折。", "question": "人间正道是沧桑前面是什么", "sent_token": ["天", "若", "有", "情", "天", "亦", "老", ",", "人", "间", "正", "道", "是", "沧", "桑", ":", "上", "句", "借", "用", "李", "贺", "《", "金", "铜", "仙", "人", "辞", "汉", "歌", "》", "中", "诗", "句", ",", "原", "诗", "说", "的", "是", "汉", "武", "帝", "时", "制", "作", "的", "极", "贵", "重", "的", "宝", "物", "金", "铜", "仙", "人", "像", ",", "在", "三", "国", "时", "被", "魏", "明", "帝", "由", "长", "安", "迁", "往", "洛", "阳", "的", "传", "说", "。", "原", "句", "的", "意", "思", "是", ",", "对", "于", "这", "样", "的", "人", "间", "恨", "事", ",", "天", "若", "有", "情", ",", "也", "要", "因", "悲", "伤", "而", "衰", "老", "。", "人", "间", "正", "道", ",", "社", "会", "发", "展", "的", "正", "常", "规", "律", "。", "沧", "桑", ",", "沧", "海", "(", "大", "海", ")", "变", "为", "桑", "田", ",", "多", "指", "巨", "大", "的", "变", "化", ",", "这", "里", "比", "喻", "的", "是", "革", "命", "的", "道", "路", "艰", "难", "曲", "折", "。", "人", "间", "正", "道", "是", "沧", "桑", "是", "什", "么", "意", "思", "_", "酷", "知", "经", "验", "网"], "sample_type": "disturb"} +{"id": 1834, "title": "", "context": "《艺妓回忆录》根据美国作家阿瑟-高顿的同名小说改编。于2005年12月1日上映,由章子怡·巩俐·杨紫琼等共同演绎。是一部时长约140分钟的电影。全篇充满着古典美,时代背景从1929年开始延续到二战结束,女主人公回忆了自己从小拼命挣扎、历尽荣辱的人生经历。该片获得2006年第78届奥斯卡金像奖最佳摄影、最佳艺术指导、最佳服装设计三项奖项。", "question": "艺伎回忆录片长有多久", "sent_token": ["《", "艺", "妓", "回", "忆", "录", "》", "根", "据", "美", "国", "作", "家", "阿", "瑟", "-", "高", "顿", "的", "同", "名", "小", "说", "改", "编", "。", "于", "2005", "年", "12", "月", "1", "日", "上", "映", ",", "由", "章", "子", "怡", "·", "巩", "俐", "·", "杨", "紫", "琼", "等", "共", "同", "演", "绎", "。", "是", "一", "部", "时", "长", "约", "140", "分", "钟", "的", "电", "影", "。", "全", "篇", "充", "满", "着", "古", "典", "美", ",", "时", "代", "背", "景", "从", "1929", "年", "开", "始", "延", "续", "到", "二", "战", "结", "束", ",", "女", "主", "人", "公", "回", "忆", "了", "自", "己", "从", "小", "拼", "命", "挣", "扎", "、", "历", "尽", "荣", "辱", "的", "人", "生", "经", "历", "。", "该", "片", "获", "得", "2006", "年", "第", "78", "届", "奥", "斯", "卡", "金", "像", "奖", "最", "佳", "摄", "影", "、", "最", "佳", "艺", "术", "指", "导", "、", "最", "佳", "服", "装", "设", "计", "三", "项", "奖", "项", "。"], "sample_type": "disturb"} +{"id": 1839, "title": "痛风挂哪个科室比较好?_39健康问答_39健康网", "context": "痛风属于代谢风湿性疾病,目前主要是在风湿免疫科治疗,所以患者需要挂风湿免疫科。风湿免疫科在绝大多数三级甲等医院都有独立的科室。由于这个科是一个新兴学科,在很多县级医院还没有成立,患者可以到内分泌科就诊,挂内分泌科。如果这两个科都没有患者,可以到骨科就诊,因为痛风首发表现是急性痛风性关节炎,骨科大夫对痛风也有一定的了解。", "question": "痛风属于什么类型疾病呢", "sent_token": ["痛", "风", "属", "于", "代", "谢", "风", "湿", "性", "疾", "病", ",", "目", "前", "主", "要", "是", "在", "风", "湿", "免", "疫", "科", "治", "疗", ",", "所", "以", "患", "者", "需", "要", "挂", "风", "湿", "免", "疫", "科", "。", "风", "湿", "免", "疫", "科", "在", "绝", "大", "多", "数", "三", "级", "甲", "等", "医", "院", "都", "有", "独", "立", "的", "科", "室", "。", "由", "于", "这", "个", "科", "是", "一", "个", "新", "兴", "学", "科", ",", "在", "很", "多", "县", "级", "医", "院", "还", "没", "有", "成", "立", ",", "患", "者", "可", "以", "到", "内", "分", "泌", "科", "就", "诊", ",", "挂", "内", "分", "泌", "科", "。", "如", "果", "这", "两", "个", "科", "都", "没", "有", "患", "者", ",", "可", "以", "到", "骨", "科", "就", "诊", ",", "因", "为", "痛", "风", "首", "发", "表", "现", "是", "急", "性", "痛", "风", "性", "关", "节", "炎", ",", "骨", "科", "大", "夫", "对", "痛", "风", "也", "有", "一", "定", "的", "了", "解", "。", "痛", "风", "挂", "哪", "个", "科", "室", "比", "较", "好", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "disturb"} +{"id": 1844, "title": "阴阳师武士之灵生前被谁所杀_游侠网", "context": "由武士死后的灵魂化成。生前一直为主人效忠,最后献出了生命。从武士之灵的传记中可以得知,武士之灵生前是被茨木童子所击杀。该问题来自游戏内的逢魔密信,正确回答问题之后就有机会获得包括金币、体力、勾玉和结界卡在内的多种游戏内道具物资奖励。", "question": "武士之灵生前被谁所杀", "sent_token": ["由", "武", "士", "死", "后", "的", "灵", "魂", "化", "成", "。", "生", "前", "一", "直", "为", "主", "人", "效", "忠", ",", "最", "后", "献", "出", "了", "生", "命", "。", "从", "武", "士", "之", "灵", "的", "传", "记", "中", "可", "以", "得", "知", ",", "武", "士", "之", "灵", "生", "前", "是", "被", "茨", "木", "童", "子", "所", "击", "杀", "。", "该", "问", "题", "来", "自", "游", "戏", "内", "的", "逢", "魔", "密", "信", ",", "正", "确", "回", "答", "问", "题", "之", "后", "就", "有", "机", "会", "获", "得", "包", "括", "金", "币", "、", "体", "力", "、", "勾", "玉", "和", "结", "界", "卡", "在", "内", "的", "多", "种", "游", "戏", "内", "道", "具", "物", "资", "奖", "励", "。", "阴", "阳", "师", "武", "士", "之", "灵", "生", "前", "被", "谁", "所", "杀", "_", "游", "侠", "网"], "sample_type": "disturb"} +{"id": 1850, "title": "中医肾主什么-有来医生", "context": "根据中医基础理论,肾主水、主纳气、主二便、主藏精。人体的生长的生命过程与肾中精气的盛衰有着密切的关系,肾主水,是指全身的水液代谢都是在肾阳的气化温煦作用下,从而分布到全身,然后再通过呼吸、二便将代谢废物排除体外。肾主纳气,是指肾能够使人体维持正常的呼吸深度。肾主二便,人的大小便需要在肾的作用下,才能够正常的排泄,否则就会出现异常的改变,比如大小便失禁、大便稀薄等情况。肾主藏精,是指五脏六腑化生的精气,最后都是储存在肾脏,反过来肾脏所藏的精气,又能够推动各脏腑的功能。", "question": "肾主什么", "sent_token": ["根", "据", "中", "医", "基", "础", "理", "论", ",", "肾", "主", "水", "、", "主", "纳", "气", "、", "主", "二", "便", "、", "主", "藏", "精", "。", "人", "体", "的", "生", "长", "的", "生", "命", "过", "程", "与", "肾", "中", "精", "气", "的", "盛", "衰", "有", "着", "密", "切", "的", "关", "系", ",", "肾", "主", "水", ",", "是", "指", "全", "身", "的", "水", "液", "代", "谢", "都", "是", "在", "肾", "阳", "的", "气", "化", "温", "煦", "作", "用", "下", ",", "从", "而", "分", "布", "到", "全", "身", ",", "然", "后", "再", "通", "过", "呼", "吸", "、", "二", "便", "将", "代", "谢", "废", "物", "排", "除", "体", "外", "。", "肾", "主", "纳", "气", ",", "是", "指", "肾", "能", "够", "使", "人", "体", "维", "持", "正", "常", "的", "呼", "吸", "深", "度", "。", "肾", "主", "二", "便", ",", "人", "的", "大", "小", "便", "需", "要", "在", "肾", "的", "作", "用", "下", ",", "才", "能", "够", "正", "常", "的", "排", "泄", ",", "否", "则", "就", "会", "出", "现", "异", "常", "的", "改", "变", ",", "比", "如", "大", "小", "便", "失", "禁", "、", "大", "便", "稀", "薄", "等", "情", "况", "。", "肾", "主", "藏", "精", ",", "是", "指", "五", "脏", "六", "腑", "化", "生", "的", "精", "气", ",", "最", "后", "都", "是", "储", "存", "在", "肾", "脏", ",", "反", "过", "来", "肾", "脏", "所", "藏", "的", "精", "气", ",", "又", "能", "够", "推", "动", "各", "脏", "腑", "的", "功", "能", "。", "中", "医", "肾", "主", "什", "么", "-", "有", "来", "医", "生"], "sample_type": "disturb"} +{"id": 1853, "title": "1963年属什么生肖年_十二生肖_卜易居", "context": "1963年中苏公开论战、美国黑人民权运动兴起、肯尼迪遇刺等事件震动世界。1963年属什么生肖年,葵卯兔年,属兔之人举止文雅,谈吐随和,为人恭良谦逊,与人交往如慕春风,学习能力超群,敏捷果断,安贫乐道。虽性子柔弱,但韧性极强,绝境之中能力惊人,缺点则是难以坚持原则,随波逐流。", "question": "1963年属什么生肖", "sent_token": ["1963", "年", "中", "苏", "公", "开", "论", "战", "、", "美", "国", "黑", "人", "民", "权", "运", "动", "兴", "起", "、", "肯", "尼", "迪", "遇", "刺", "等", "事", "件", "震", "动", "世", "界", "。", "1963", "年", "属", "什", "么", "生", "肖", "年", ",", "葵", "卯", "兔", "年", ",", "属", "兔", "之", "人", "举", "止", "文", "雅", ",", "谈", "吐", "随", "和", ",", "为", "人", "恭", "良", "谦", "逊", ",", "与", "人", "交", "往", "如", "慕", "春", "风", ",", "学", "习", "能", "力", "超", "群", ",", "敏", "捷", "果", "断", ",", "安", "贫", "乐", "道", "。", "虽", "性", "子", "柔", "弱", ",", "但", "韧", "性", "极", "强", ",", "绝", "境", "之", "中", "能", "力", "惊", "人", ",", "缺", "点", "则", "是", "难", "以", "坚", "持", "原", "则", ",", "随", "波", "逐", "流", "。", "1963", "年", "属", "什", "么", "生", "肖", "年", "_", "十", "二", "生", "肖", "_", "卜", "易", "居"], "sample_type": "disturb"} +{"id": 1854, "title": "食管和食道一样吗-有来医生", "context": "食管和食道是没有区别的,食管是医学上的称谓,而食道是民间的一种说法。两者都指从咽喉部到胃贲门之间的管道。食管是距门齿15cm处为食管的入口处,经过胸腔之后通过贲门口也就是膈肌孔与胃相连。食管可以分为颈段和胸段,而胸段又分为胸上段、胸中段和胸下段。食管本身有3个生理性的狭窄,这也是某些食管疾病发生的基础。常见的食管疾病包括食管炎、食管息肉、食管癌、食管狭窄、胃食管反流症、巴雷特食管等。可以通过消化道造影以及胃镜来进一步明确。", "question": "食管跟食道有什么不同", "sent_token": ["食", "管", "和", "食", "道", "是", "没", "有", "区", "别", "的", ",", "食", "管", "是", "医", "学", "上", "的", "称", "谓", ",", "而", "食", "道", "是", "民", "间", "的", "一", "种", "说", "法", "。", "两", "者", "都", "指", "从", "咽", "喉", "部", "到", "胃", "贲", "门", "之", "间", "的", "管", "道", "。", "食", "管", "是", "距", "门", "齿", "15cm", "处", "为", "食", "管", "的", "入", "口", "处", ",", "经", "过", "胸", "腔", "之", "后", "通", "过", "贲", "门", "口", "也", "就", "是", "膈", "肌", "孔", "与", "胃", "相", "连", "。", "食", "管", "可", "以", "分", "为", "颈", "段", "和", "胸", "段", ",", "而", "胸", "段", "又", "分", "为", "胸", "上", "段", "、", "胸", "中", "段", "和", "胸", "下", "段", "。", "食", "管", "本", "身", "有", "3", "个", "生", "理", "性", "的", "狭", "窄", ",", "这", "也", "是", "某", "些", "食", "管", "疾", "病", "发", "生", "的", "基", "础", "。", "常", "见", "的", "食", "管", "疾", "病", "包", "括", "食", "管", "炎", "、", "食", "管", "息", "肉", "、", "食", "管", "癌", "、", "食", "管", "狭", "窄", "、", "胃", "食", "管", "反", "流", "症", "、", "巴", "雷", "特", "食", "管", "等", "。", "可", "以", "通", "过", "消", "化", "道", "造", "影", "以", "及", "胃", "镜", "来", "进", "一", "步", "明", "确", "。", "食", "管", "和", "食", "道", "一", "样", "吗", "-", "有", "来", "医", "生"], "sample_type": "disturb"} +{"id": 1863, "title": "农历六月二十四是什么星座-星座乐", "context": "农历六月二十四是狮子座。狮子座,火象星座,位于黄道十二宫之第五宫,出生日期为阳历7月23日-8月22日。狮子座是英雄主义者,他们乐观,乐于助人,喜欢帮助弱势群体。他们天生自带光环,特立独行,做事豪爽大气,讲话淡定从容,从不扭扭捏捏畏畏缩缩。而且心思细腻,做事完整准确,善于将自己的优点发挥到极致。", "question": "星座查询:中国阴历六月二十四", "sent_token": ["农", "历", "六", "月", "二", "十", "四", "是", "狮", "子", "座", "。", "狮", "子", "座", ",", "火", "象", "星", "座", ",", "位", "于", "黄", "道", "十", "二", "宫", "之", "第", "五", "宫", ",", "出", "生", "日", "期", "为", "阳", "历", "7", "月", "23", "日", "-", "8", "月", "22", "日", "。", "狮", "子", "座", "是", "英", "雄", "主", "义", "者", ",", "他", "们", "乐", "观", ",", "乐", "于", "助", "人", ",", "喜", "欢", "帮", "助", "弱", "势", "群", "体", "。", "他", "们", "天", "生", "自", "带", "光", "环", ",", "特", "立", "独", "行", ",", "做", "事", "豪", "爽", "大", "气", ",", "讲", "话", "淡", "定", "从", "容", ",", "从", "不", "扭", "扭", "捏", "捏", "畏", "畏", "缩", "缩", "。", "而", "且", "心", "思", "细", "腻", ",", "做", "事", "完", "整", "准", "确", ",", "善", "于", "将", "自", "己", "的", "优", "点", "发", "挥", "到", "极", "致", "。", "农", "历", "六", "月", "二", "十", "四", "是", "什", "么", "星", "座", "-", "星", "座", "乐"], "sample_type": "disturb"} +{"id": 1867, "title": "", "context": "非法持有海洛因10克以上就构成非法持有毒品罪非法持有毒品罪,是指明知是鸦片、海洛因、甲基苯丙胺或者其他毒品,而非法持有且数量较大的行为。非法持有毒品达到一定数量才构成犯罪。", "question": "携带多少克吗啡类毒品,就已经算犯罪了", "sent_token": ["非", "法", "持", "有", "海", "洛", "因", "10", "克", "以", "上", "就", "构", "成", "非", "法", "持", "有", "毒", "品", "罪", "非", "法", "持", "有", "毒", "品", "罪", ",", "是", "指", "明", "知", "是", "鸦", "片", "、", "海", "洛", "因", "、", "甲", "基", "苯", "丙", "胺", "或", "者", "其", "他", "毒", "品", ",", "而", "非", "法", "持", "有", "且", "数", "量", "较", "大", "的", "行", "为", "。", "非", "法", "持", "有", "毒", "品", "达", "到", "一", "定", "数", "量", "才", "构", "成", "犯", "罪", "。"], "sample_type": "disturb"} +{"id": 1877, "title": "地方志书每几年左右编修一次_高三网", "context": "地方志书每20年左右编修一次。每一轮地方志书编修工作完成后,负责地方志工作的机构在编纂地方综合年鉴、搜集资料以及向社会提供咨询服务的同时,启动新一轮地方志书的续修工作。", "question": "那种用来记述地方情况的史志,一般都是多少年修一次", "sent_token": ["地", "方", "志", "书", "每", "20", "年", "左", "右", "编", "修", "一", "次", "。", "每", "一", "轮", "地", "方", "志", "书", "编", "修", "工", "作", "完", "成", "后", ",", "负", "责", "地", "方", "志", "工", "作", "的", "机", "构", "在", "编", "纂", "地", "方", "综", "合", "年", "鉴", "、", "搜", "集", "资", "料", "以", "及", "向", "社", "会", "提", "供", "咨", "询", "服", "务", "的", "同", "时", ",", "启", "动", "新", "一", "轮", "地", "方", "志", "书", "的", "续", "修", "工", "作", "。", "地", "方", "志", "书", "每", "几", "年", "左", "右", "编", "修", "一", "次", "_", "高", "三", "网"], "sample_type": "disturb"} +{"id": 1879, "title": "", "context": "《正气歌》是南宋诗人文天祥在狱中写的一首五言古诗。表达了作者忠君爱国、为国捐躯,忧国之痛和愿意以死明志、为国捐躯的豪情壮志的思想感情。诗的开头即点出浩然正气存乎天地之间,至时穷之际,必然会显示出来。随后连用十二个典故,都是历史上有名的人物,他们的所作所为凛然显示出浩然正气的力量。接下来八句说明浩然正气贯日月,立天地,为三纲之命,道义之根。最后联系到自己的命运,自己虽然兵败被俘,处在极其恶劣的牢狱之中,但是由于自己一身正气,各种邪气和疾病都不能侵犯自己,因此自己能够坦然面对自己的命运。全诗感情深沉、气壮山河、直抒胸臆、毫无雕饰,充分体现了作者崇高的民族气节和强烈的爱国主义精神。", "question": "正气歌》的作者是", "sent_token": ["《", "正", "气", "歌", "》", "是", "南", "宋", "诗", "人", "文", "天", "祥", "在", "狱", "中", "写", "的", "一", "首", "五", "言", "古", "诗", "。", "表", "达", "了", "作", "者", "忠", "君", "爱", "国", "、", "为", "国", "捐", "躯", ",", "忧", "国", "之", "痛", "和", "愿", "意", "以", "死", "明", "志", "、", "为", "国", "捐", "躯", "的", "豪", "情", "壮", "志", "的", "思", "想", "感", "情", "。", "诗", "的", "开", "头", "即", "点", "出", "浩", "然", "正", "气", "存", "乎", "天", "地", "之", "间", ",", "至", "时", "穷", "之", "际", ",", "必", "然", "会", "显", "示", "出", "来", "。", "随", "后", "连", "用", "十", "二", "个", "典", "故", ",", "都", "是", "历", "史", "上", "有", "名", "的", "人", "物", ",", "他", "们", "的", "所", "作", "所", "为", "凛", "然", "显", "示", "出", "浩", "然", "正", "气", "的", "力", "量", "。", "接", "下", "来", "八", "句", "说", "明", "浩", "然", "正", "气", "贯", "日", "月", ",", "立", "天", "地", ",", "为", "三", "纲", "之", "命", ",", "道", "义", "之", "根", "。", "最", "后", "联", "系", "到", "自", "己", "的", "命", "运", ",", "自", "己", "虽", "然", "兵", "败", "被", "俘", ",", "处", "在", "极", "其", "恶", "劣", "的", "牢", "狱", "之", "中", ",", "但", "是", "由", "于", "自", "己", "一", "身", "正", "气", ",", "各", "种", "邪", "气", "和", "疾", "病", "都", "不", "能", "侵", "犯", "自", "己", ",", "因", "此", "自", "己", "能", "够", "坦", "然", "面", "对", "自", "己", "的", "命", "运", "。", "全", "诗", "感", "情", "深", "沉", "、", "气", "壮", "山", "河", "、", "直", "抒", "胸", "臆", "、", "毫", "无", "雕", "饰", ",", "充", "分", "体", "现", "了", "作", "者", "崇", "高", "的", "民", "族", "气", "节", "和", "强", "烈", "的", "爱", "国", "主", "义", "精", "神", "。"], "sample_type": "disturb"} +{"id": 1883, "title": "狗狗皮肤上长小脓包怎么回事", "context": "狗狗身上长脓包,是因为真菌感染或是寄生虫感染所致。如不及时处理脓包,会导致扩散全身,甚至溃烂。建议方法:戴上手套,把狗狗身上长脓包的地方挤一挤;然后用碘伏直接喷在患处;如有脓血可用医用纱布给它包在患处,等药效吸收后,取掉纱布;碘伏具有抗菌、消炎的作用,一天可以喷两三次;处理完狗狗伤口后用肥皂洗手。狗狗洗澡要用狗狗专门的沐浴露;洗后立即做吹干处理;定时用狗狗专用梳子,清理身上多余的杂毛;尽量带狗狗去干净的地方玩,回家后把狗狗的脚用抹布抹一次;多注意狗舍卫生,定时做消毒处理。宠物皮肤疾病也是会有一定传染性的,所以一定要进行定期消毒,选用专门的宠物消毒液,每周消毒1-2次,能有效预防传染", "question": "狗狗身上长小脓包是怎么回事", "sent_token": ["狗", "狗", "身", "上", "长", "脓", "包", ",", "是", "因", "为", "真", "菌", "感", "染", "或", "是", "寄", "生", "虫", "感", "染", "所", "致", "。", "如", "不", "及", "时", "处", "理", "脓", "包", ",", "会", "导", "致", "扩", "散", "全", "身", ",", "甚", "至", "溃", "烂", "。", "建", "议", "方", "法", ":", "戴", "上", "手", "套", ",", "把", "狗", "狗", "身", "上", "长", "脓", "包", "的", "地", "方", "挤", "一", "挤", ";", "然", "后", "用", "碘", "伏", "直", "接", "喷", "在", "患", "处", ";", "如", "有", "脓", "血", "可", "用", "医", "用", "纱", "布", "给", "它", "包", "在", "患", "处", ",", "等", "药", "效", "吸", "收", "后", ",", "取", "掉", "纱", "布", ";", "碘", "伏", "具", "有", "抗", "菌", "、", "消", "炎", "的", "作", "用", ",", "一", "天", "可", "以", "喷", "两", "三", "次", ";", "处", "理", "完", "狗", "狗", "伤", "口", "后", "用", "肥", "皂", "洗", "手", "。", "狗", "狗", "洗", "澡", "要", "用", "狗", "狗", "专", "门", "的", "沐", "浴", "露", ";", "洗", "后", "立", "即", "做", "吹", "干", "处", "理", ";", "定", "时", "用", "狗", "狗", "专", "用", "梳", "子", ",", "清", "理", "身", "上", "多", "余", "的", "杂", "毛", ";", "尽", "量", "带", "狗", "狗", "去", "干", "净", "的", "地", "方", "玩", ",", "回", "家", "后", "把", "狗", "狗", "的", "脚", "用", "抹", "布", "抹", "一", "次", ";", "多", "注", "意", "狗", "舍", "卫", "生", ",", "定", "时", "做", "消", "毒", "处", "理", "。", "宠", "物", "皮", "肤", "疾", "病", "也", "是", "会", "有", "一", "定", "传", "染", "性", "的", ",", "所", "以", "一", "定", "要", "进", "行", "定", "期", "消", "毒", ",", "选", "用", "专", "门", "的", "宠", "物", "消", "毒", "液", ",", "每", "周", "消", "毒", "1", "-", "2", "次", ",", "能", "有", "效", "预", "防", "传", "染", "狗", "狗", "皮", "肤", "上", "长", "小", "脓", "包", "怎", "么", "回", "事"], "sample_type": "disturb"} +{"id": 1885, "title": "", "context": "新梓学校成立于2007年9月,是一所公办九年一贯制学校,座落在龙岗街道新生社区,紧邻水岸新都花园,交通十分便利。校园占地27500平方米,建筑面积16285平方米。学校设计办学规模36班,学生人数1800人", "question": "新梓学校地址", "sent_token": ["新", "梓", "学", "校", "成", "立", "于", "2007", "年", "9", "月", ",", "是", "一", "所", "公", "办", "九", "年", "一", "贯", "制", "学", "校", ",", "座", "落", "在", "龙", "岗", "街", "道", "新", "生", "社", "区", ",", "紧", "邻", "水", "岸", "新", "都", "花", "园", ",", "交", "通", "十", "分", "便", "利", "。", "校", "园", "占", "地", "27500", "平", "方", "米", ",", "建", "筑", "面", "积", "16285", "平", "方", "米", "。", "学", "校", "设", "计", "办", "学", "规", "模", "36", "班", ",", "学", "生", "人", "数", "1800", "人"], "sample_type": "disturb"} +{"id": 1886, "title": "敷面膜脸痒是缺水吗?教你正确的认识_皮肤", "context": "当我们在洗完澡的时候,或者是敷面膜发现皮肤有一种痒痒的感觉,如果你确定面膜的质量是没有问题的,并且也确定你对这款面膜的物质没有过敏的情况下,皮肤出现痒的感觉,那可能的原因就是由于皮肤缺水。因为你的皮肤太缺水了,在给皮肤补水的时候就会出现一种痒的情况严重的时候,甚至会有刺痛的感觉。会让人觉得很不舒服,水分充足后会缓解。", "question": "脸痒是缺水么", "sent_token": ["当", "我", "们", "在", "洗", "完", "澡", "的", "时", "候", ",", "或", "者", "是", "敷", "面", "膜", "发", "现", "皮", "肤", "有", "一", "种", "痒", "痒", "的", "感", "觉", ",", "如", "果", "你", "确", "定", "面", "膜", "的", "质", "量", "是", "没", "有", "问", "题", "的", ",", "并", "且", "也", "确", "定", "你", "对", "这", "款", "面", "膜", "的", "物", "质", "没", "有", "过", "敏", "的", "情", "况", "下", ",", "皮", "肤", "出", "现", "痒", "的", "感", "觉", ",", "那", "可", "能", "的", "原", "因", "就", "是", "由", "于", "皮", "肤", "缺", "水", "。", "因", "为", "你", "的", "皮", "肤", "太", "缺", "水", "了", ",", "在", "给", "皮", "肤", "补", "水", "的", "时", "候", "就", "会", "出", "现", "一", "种", "痒", "的", "情", "况", "严", "重", "的", "时", "候", ",", "甚", "至", "会", "有", "刺", "痛", "的", "感", "觉", "。", "会", "让", "人", "觉", "得", "很", "不", "舒", "服", ",", "水", "分", "充", "足", "后", "会", "缓", "解", "。", "敷", "面", "膜", "脸", "痒", "是", "缺", "水", "吗", "?", "教", "你", "正", "确", "的", "认", "识", "_", "皮", "肤"], "sample_type": "disturb"} +{"id": 1888, "title": "无痛人流和药流哪个伤害比较小-有来医生", "context": "无痛人工流产手术和药物流产手术,相对比来说,还是药物流产伤害比较大。因为药物流产,阴道流血时间会比人工流产的阴道流血时间要长,一般人工流产,阴道流血时间不超过7天,而药物流产阴道流血的时间往往在15-20天左右才会干净。一直在有流血的状况下,宫口就是开放的,阴道又跟外界相通,跟宫颈又相通,这样造成细菌侵入感染的机会就会增加,所以容易导致生殖道的感染。另外,药物流产造成不全流产的可能性会大一些,需要做清宫手术。这样就可以想象出药物流产会比无痛人流伤害更大一些。人流手术都是属于微创无痛性质的,具有无痛、创伤极小,出血少、手术时间短,无需住院,手术后即可回家,不影响工作和生活等优势。", "question": "无痛人流和药流哪个伤害比较小", "sent_token": ["无", "痛", "人", "工", "流", "产", "手", "术", "和", "药", "物", "流", "产", "手", "术", ",", "相", "对", "比", "来", "说", ",", "还", "是", "药", "物", "流", "产", "伤", "害", "比", "较", "大", "。", "因", "为", "药", "物", "流", "产", ",", "阴", "道", "流", "血", "时", "间", "会", "比", "人", "工", "流", "产", "的", "阴", "道", "流", "血", "时", "间", "要", "长", ",", "一", "般", "人", "工", "流", "产", ",", "阴", "道", "流", "血", "时", "间", "不", "超", "过", "7", "天", ",", "而", "药", "物", "流", "产", "阴", "道", "流", "血", "的", "时", "间", "往", "往", "在", "15", "-", "20", "天", "左", "右", "才", "会", "干", "净", "。", "一", "直", "在", "有", "流", "血", "的", "状", "况", "下", ",", "宫", "口", "就", "是", "开", "放", "的", ",", "阴", "道", "又", "跟", "外", "界", "相", "通", ",", "跟", "宫", "颈", "又", "相", "通", ",", "这", "样", "造", "成", "细", "菌", "侵", "入", "感", "染", "的", "机", "会", "就", "会", "增", "加", ",", "所", "以", "容", "易", "导", "致", "生", "殖", "道", "的", "感", "染", "。", "另", "外", ",", "药", "物", "流", "产", "造", "成", "不", "全", "流", "产", "的", "可", "能", "性", "会", "大", "一", "些", ",", "需", "要", "做", "清", "宫", "手", "术", "。", "这", "样", "就", "可", "以", "想", "象", "出", "药", "物", "流", "产", "会", "比", "无", "痛", "人", "流", "伤", "害", "更", "大", "一", "些", "。", "人", "流", "手", "术", "都", "是", "属", "于", "微", "创", "无", "痛", "性", "质", "的", ",", "具", "有", "无", "痛", "、", "创", "伤", "极", "小", ",", "出", "血", "少", "、", "手", "术", "时", "间", "短", ",", "无", "需", "住", "院", ",", "手", "术", "后", "即", "可", "回", "家", ",", "不", "影", "响", "工", "作", "和", "生", "活", "等", "优", "势", "。", "无", "痛", "人", "流", "和", "药", "流", "哪", "个", "伤", "害", "比", "较", "小", "-", "有", "来", "医", "生"], "sample_type": "disturb"} +{"id": 1890, "title": "长期吃葡萄籽的副作用?_39健康问答_39健康网", "context": "长期吃葡萄籽不会有副作用,不用担心,葡萄籽中含有丰富的花青素,有美容养颜的功效。葡萄籽含有丰富的多种氨基酸、维生素及矿物质等,原花青素含量最高,有促进血液循环、保护视力、抗氧化去除自由基、降低血、保护心血管的作用,可以用于保健、美容。", "question": "葡萄籽能长期吃么?有什么副作用呀?", "sent_token": ["长", "期", "吃", "葡", "萄", "籽", "不", "会", "有", "副", "作", "用", ",", "不", "用", "担", "心", ",", "葡", "萄", "籽", "中", "含", "有", "丰", "富", "的", "花", "青", "素", ",", "有", "美", "容", "养", "颜", "的", "功", "效", "。", "葡", "萄", "籽", "含", "有", "丰", "富", "的", "多", "种", "氨", "基", "酸", "、", "维", "生", "素", "及", "矿", "物", "质", "等", ",", "原", "花", "青", "素", "含", "量", "最", "高", ",", "有", "促", "进", "血", "液", "循", "环", "、", "保", "护", "视", "力", "、", "抗", "氧", "化", "去", "除", "自", "由", "基", "、", "降", "低", "血", "、", "保", "护", "心", "血", "管", "的", "作", "用", ",", "可", "以", "用", "于", "保", "健", "、", "美", "容", "。", "长", "期", "吃", "葡", "萄", "籽", "的", "副", "作", "用", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "disturb"} +{"id": 1894, "title": "红花哪里产的最好?_39健康问答_39健康网", "context": "红花在中国很多地方都是有种植的,比如河南,江苏,四川,河北等等。但是在众多产地中河南的商丘生产的红花应该是最好的了。红花有一种特殊的气味,特别香,味道稍微有点苦。红花是一种很好的植物,对人体有很好的保健作用。高血压患者可以服用一些,红花是有一定的降压作用的,另外还可以促进人体血液的循环,降低血脂。", "question": "最好的刺红花生产自哪里", "sent_token": ["红", "花", "在", "中", "国", "很", "多", "地", "方", "都", "是", "有", "种", "植", "的", ",", "比", "如", "河", "南", ",", "江", "苏", ",", "四", "川", ",", "河", "北", "等", "等", "。", "但", "是", "在", "众", "多", "产", "地", "中", "河", "南", "的", "商", "丘", "生", "产", "的", "红", "花", "应", "该", "是", "最", "好", "的", "了", "。", "红", "花", "有", "一", "种", "特", "殊", "的", "气", "味", ",", "特", "别", "香", ",", "味", "道", "稍", "微", "有", "点", "苦", "。", "红", "花", "是", "一", "种", "很", "好", "的", "植", "物", ",", "对", "人", "体", "有", "很", "好", "的", "保", "健", "作", "用", "。", "高", "血", "压", "患", "者", "可", "以", "服", "用", "一", "些", ",", "红", "花", "是", "有", "一", "定", "的", "降", "压", "作", "用", "的", ",", "另", "外", "还", "可", "以", "促", "进", "人", "体", "血", "液", "的", "循", "环", ",", "降", "低", "血", "脂", "。", "红", "花", "哪", "里", "产", "的", "最", "好", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "disturb"} +{"id": 1897, "title": "", "context": "梳妆台指用来化妆的家具装饰。梳妆台一词,在现代家居中,已经被业主、客户、家居设计师广泛用到,现在泛指家具梳妆台。梳妆台尺寸标准的是总高度为1500mm左右,宽为700mm到1200mm,这样的梳妆台尺寸是大小正合适的,在家庭装修之前的前期准备时,就应该确定好梳妆台尺寸大小,同时梳妆台尺寸也要和房间的格调和风格统一起来。每个人都有自己不同的审美眼光,所以在外观选择上只要是个人喜欢就行,但梳妆台的外表最好选择用油漆刷过的,这样容易清理,不至于化妆品渗透到梳妆台内,影响梳妆台的外观", "question": "梳妆台整体高度一般是多少", "sent_token": ["梳", "妆", "台", "指", "用", "来", "化", "妆", "的", "家", "具", "装", "饰", "。", "梳", "妆", "台", "一", "词", ",", "在", "现", "代", "家", "居", "中", ",", "已", "经", "被", "业", "主", "、", "客", "户", "、", "家", "居", "设", "计", "师", "广", "泛", "用", "到", ",", "现", "在", "泛", "指", "家", "具", "梳", "妆", "台", "。", "梳", "妆", "台", "尺", "寸", "标", "准", "的", "是", "总", "高", "度", "为", "1500mm", "左", "右", ",", "宽", "为", "700mm", "到", "1200mm", ",", "这", "样", "的", "梳", "妆", "台", "尺", "寸", "是", "大", "小", "正", "合", "适", "的", ",", "在", "家", "庭", "装", "修", "之", "前", "的", "前", "期", "准", "备", "时", ",", "就", "应", "该", "确", "定", "好", "梳", "妆", "台", "尺", "寸", "大", "小", ",", "同", "时", "梳", "妆", "台", "尺", "寸", "也", "要", "和", "房", "间", "的", "格", "调", "和", "风", "格", "统", "一", "起", "来", "。", "每", "个", "人", "都", "有", "自", "己", "不", "同", "的", "审", "美", "眼", "光", ",", "所", "以", "在", "外", "观", "选", "择", "上", "只", "要", "是", "个", "人", "喜", "欢", "就", "行", ",", "但", "梳", "妆", "台", "的", "外", "表", "最", "好", "选", "择", "用", "油", "漆", "刷", "过", "的", ",", "这", "样", "容", "易", "清", "理", ",", "不", "至", "于", "化", "妆", "品", "渗", "透", "到", "梳", "妆", "台", "内", ",", "影", "响", "梳", "妆", "台", "的", "外", "观"], "sample_type": "disturb"} +{"id": 1899, "title": "感冒能不能吃燕窝_妈妈网小百科", "context": "在感冒的时候尽量不要吃燕窝,燕窝性平味甘,归肺胃肾经,功能养阴润燥,益气补中,填精补髓。虽然燕窝比较滋补,但是在感冒期间吃燕窝的话,并不利于感冒的恢复。在感冒期间应该吃得清淡一些,补充身体需要的水分,如果没有食欲的话可以多喝一些粥。在感冒期间可能吃药物的话,也不能够起到很好的效果,但是也要坚持吃药。", "question": "感冒可以吃燕窝吗?有效果吗?", "sent_token": ["在", "感", "冒", "的", "时", "候", "尽", "量", "不", "要", "吃", "燕", "窝", ",", "燕", "窝", "性", "平", "味", "甘", ",", "归", "肺", "胃", "肾", "经", ",", "功", "能", "养", "阴", "润", "燥", ",", "益", "气", "补", "中", ",", "填", "精", "补", "髓", "。", "虽", "然", "燕", "窝", "比", "较", "滋", "补", ",", "但", "是", "在", "感", "冒", "期", "间", "吃", "燕", "窝", "的", "话", ",", "并", "不", "利", "于", "感", "冒", "的", "恢", "复", "。", "在", "感", "冒", "期", "间", "应", "该", "吃", "得", "清", "淡", "一", "些", ",", "补", "充", "身", "体", "需", "要", "的", "水", "分", ",", "如", "果", "没", "有", "食", "欲", "的", "话", "可", "以", "多", "喝", "一", "些", "粥", "。", "在", "感", "冒", "期", "间", "可", "能", "吃", "药", "物", "的", "话", ",", "也", "不", "能", "够", "起", "到", "很", "好", "的", "效", "果", ",", "但", "是", "也", "要", "坚", "持", "吃", "药", "。", "感", "冒", "能", "不", "能", "吃", "燕", "窝", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "disturb"} +{"id": 1900, "title": "房颤会引起脑梗吗-有来医生", "context": "房颤会引起脑血管疾病,在医学上不叫脑梗叫脑栓塞,脑梗是脑血管本身病变引起的脑供血不足的情况,而脑栓塞是由于房颤心脏上形成了附壁血栓,当血栓的栓子脱落之后,就有可能堵塞在脑血管形成了脑拴塞,也是一种脑缺血的表现。治疗方法可以应用改善循环和营养神经的药物治疗,必须应用阿司匹林和氯吡格雷口服抗血小板聚集治疗,对于心房纤颤的患者,要控制心室率,应用阿司匹林和氯吡格雷等口服抗血小板聚集治疗,预防心脏附壁血栓的形成。", "question": "房颤不会引起脑梗吗", "sent_token": ["房", "颤", "会", "引", "起", "脑", "血", "管", "疾", "病", ",", "在", "医", "学", "上", "不", "叫", "脑", "梗", "叫", "脑", "栓", "塞", ",", "脑", "梗", "是", "脑", "血", "管", "本", "身", "病", "变", "引", "起", "的", "脑", "供", "血", "不", "足", "的", "情", "况", ",", "而", "脑", "栓", "塞", "是", "由", "于", "房", "颤", "心", "脏", "上", "形", "成", "了", "附", "壁", "血", "栓", ",", "当", "血", "栓", "的", "栓", "子", "脱", "落", "之", "后", ",", "就", "有", "可", "能", "堵", "塞", "在", "脑", "血", "管", "形", "成", "了", "脑", "拴", "塞", ",", "也", "是", "一", "种", "脑", "缺", "血", "的", "表", "现", "。", "治", "疗", "方", "法", "可", "以", "应", "用", "改", "善", "循", "环", "和", "营", "养", "神", "经", "的", "药", "物", "治", "疗", ",", "必", "须", "应", "用", "阿", "司", "匹", "林", "和", "氯", "吡", "格", "雷", "口", "服", "抗", "血", "小", "板", "聚", "集", "治", "疗", ",", "对", "于", "心", "房", "纤", "颤", "的", "患", "者", ",", "要", "控", "制", "心", "室", "率", ",", "应", "用", "阿", "司", "匹", "林", "和", "氯", "吡", "格", "雷", "等", "口", "服", "抗", "血", "小", "板", "聚", "集", "治", "疗", ",", "预", "防", "心", "脏", "附", "壁", "血", "栓", "的", "形", "成", "。", "房", "颤", "会", "引", "起", "脑", "梗", "吗", "-", "有", "来", "医", "生"], "sample_type": "disturb"} +{"id": 1906, "title": "二十天的婴儿能看多远_妈妈网小百科", "context": "20天的宝宝能够看到的距离大概是15厘米-20厘米左右,一般能够看到18厘米左右的事物。宝宝刚出生的时候视力极其差,有的甚至没有睁开眼,可以说基本什么都看不清楚,视力比较好的新生儿,也只能感受到光和影或大致的轮廓。随着宝宝的眼球、视神经和大脑的不断发育,他们看到的景物会越来越清楚,视野也会不断扩大,在出生6-8个月后,宝宝眼中的世界,就基本和成人一样了。", "question": "二十天的宝宝能看多远?", "sent_token": ["20", "天", "的", "宝", "宝", "能", "够", "看", "到", "的", "距", "离", "大", "概", "是", "15", "厘", "米", "-", "20", "厘", "米", "左", "右", ",", "一", "般", "能", "够", "看", "到", "18", "厘", "米", "左", "右", "的", "事", "物", "。", "宝", "宝", "刚", "出", "生", "的", "时", "候", "视", "力", "极", "其", "差", ",", "有", "的", "甚", "至", "没", "有", "睁", "开", "眼", ",", "可", "以", "说", "基", "本", "什", "么", "都", "看", "不", "清", "楚", ",", "视", "力", "比", "较", "好", "的", "新", "生", "儿", ",", "也", "只", "能", "感", "受", "到", "光", "和", "影", "或", "大", "致", "的", "轮", "廓", "。", "随", "着", "宝", "宝", "的", "眼", "球", "、", "视", "神", "经", "和", "大", "脑", "的", "不", "断", "发", "育", ",", "他", "们", "看", "到", "的", "景", "物", "会", "越", "来", "越", "清", "楚", ",", "视", "野", "也", "会", "不", "断", "扩", "大", ",", "在", "出", "生", "6", "-", "8", "个", "月", "后", ",", "宝", "宝", "眼", "中", "的", "世", "界", ",", "就", "基", "本", "和", "成", "人", "一", "样", "了", "。", "二", "十", "天", "的", "婴", "儿", "能", "看", "多", "远", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "disturb"} +{"id": 1918, "title": "4价宫颈疫苗多少钱-有来医生", "context": "4价宫颈癌疫苗有国产疫苗和进口疫苗,国产疫苗价格比较便宜,预防宫颈癌的疫苗只有4价疫苗,具体价格不同地区以及不同生产厂家生产的疫苗,所定价格也不一样。在北京4价宫颈癌疫苗,价格大概是800元左右,总共需要接种三针,需要在半年内接种完,分别在第一个月,第2个月和第6个月各接种一针次,接种年龄是20-45周岁,建议咨询当地疾病预防控制机构,所进疫苗的具体价格比较准确。比如江苏省从2019年开始,所有有价疫苗都是零差价出售,每接种一针次,收取20元材料费和注射费,目前接种宫颈癌疫苗,应该先预约才可以接种。", "question": "中国自己生产的HPV疫苗都有哪些", "sent_token": ["4", "价", "宫", "颈", "癌", "疫", "苗", "有", "国", "产", "疫", "苗", "和", "进", "口", "疫", "苗", ",", "国", "产", "疫", "苗", "价", "格", "比", "较", "便", "宜", ",", "预", "防", "宫", "颈", "癌", "的", "疫", "苗", "只", "有", "4", "价", "疫", "苗", ",", "具", "体", "价", "格", "不", "同", "地", "区", "以", "及", "不", "同", "生", "产", "厂", "家", "生", "产", "的", "疫", "苗", ",", "所", "定", "价", "格", "也", "不", "一", "样", "。", "在", "北", "京", "4", "价", "宫", "颈", "癌", "疫", "苗", ",", "价", "格", "大", "概", "是", "800", "元", "左", "右", ",", "总", "共", "需", "要", "接", "种", "三", "针", ",", "需", "要", "在", "半", "年", "内", "接", "种", "完", ",", "分", "别", "在", "第", "一", "个", "月", ",", "第", "2", "个", "月", "和", "第", "6", "个", "月", "各", "接", "种", "一", "针", "次", ",", "接", "种", "年", "龄", "是", "20", "-", "45", "周", "岁", ",", "建", "议", "咨", "询", "当", "地", "疾", "病", "预", "防", "控", "制", "机", "构", ",", "所", "进", "疫", "苗", "的", "具", "体", "价", "格", "比", "较", "准", "确", "。", "比", "如", "江", "苏", "省", "从", "2019", "年", "开", "始", ",", "所", "有", "有", "价", "疫", "苗", "都", "是", "零", "差", "价", "出", "售", ",", "每", "接", "种", "一", "针", "次", ",", "收", "取", "20", "元", "材", "料", "费", "和", "注", "射", "费", ",", "目", "前", "接", "种", "宫", "颈", "癌", "疫", "苗", ",", "应", "该", "先", "预", "约", "才", "可", "以", "接", "种", "。", "4", "价", "宫", "颈", "疫", "苗", "多", "少", "钱", "-", "有", "来", "医", "生"], "sample_type": "disturb"} +{"id": 1945, "title": "hiit是什么", "context": "hiit是高强度间歇训练,主要是通过进行多组高强度的间隙,和低强度的动作组合训练,这种训练方式能够在短时间内高速燃烧脂肪,简单说就是中间有休息的高强度训练,非常适合锻炼时间较少或无法长时间坚持锻炼的人。", "question": "什么是HIIT", "sent_token": ["hiit", "是", "高", "强", "度", "间", "歇", "训", "练", ",", "主", "要", "是", "通", "过", "进", "行", "多", "组", "高", "强", "度", "的", "间", "隙", ",", "和", "低", "强", "度", "的", "动", "作", "组", "合", "训", "练", ",", "这", "种", "训", "练", "方", "式", "能", "够", "在", "短", "时", "间", "内", "高", "速", "燃", "烧", "脂", "肪", ",", "简", "单", "说", "就", "是", "中", "间", "有", "休", "息", "的", "高", "强", "度", "训", "练", ",", "非", "常", "适", "合", "锻", "炼", "时", "间", "较", "少", "或", "无", "法", "长", "时", "间", "坚", "持", "锻", "炼", "的", "人", "。", "hiit", "是", "什", "么"], "sample_type": "disturb"} +{"id": 1949, "title": "民生信用卡的客服电话多少?-其他问题知识问答-我爱卡", "context": "民生银行是中国大陆第一家由民间资本设立的全国性商业银行,成立于1996年1月12日。民生银行的信用卡的24小时客服电话为400-669-5568,持卡人在办卡或用卡的过程中,有任何疑问,都可以拨打民生银行信用卡客服电话,通过人工客服,来进行咨询。同时,持卡人也可以通过客服电话,办理信用卡激活、修改密码、更改账单日等业务。", "question": "民生信用卡客服", "sent_token": ["民", "生", "银", "行", "是", "中", "国", "大", "陆", "第", "一", "家", "由", "民", "间", "资", "本", "设", "立", "的", "全", "国", "性", "商", "业", "银", "行", ",", "成", "立", "于", "1996", "年", "1", "月", "12", "日", "。", "民", "生", "银", "行", "的", "信", "用", "卡", "的", "24", "小", "时", "客", "服", "电", "话", "为", "400", "-", "669", "-", "5568", ",", "持", "卡", "人", "在", "办", "卡", "或", "用", "卡", "的", "过", "程", "中", ",", "有", "任", "何", "疑", "问", ",", "都", "可", "以", "拨", "打", "民", "生", "银", "行", "信", "用", "卡", "客", "服", "电", "话", ",", "通", "过", "人", "工", "客", "服", ",", "来", "进", "行", "咨", "询", "。", "同", "时", ",", "持", "卡", "人", "也", "可", "以", "通", "过", "客", "服", "电", "话", ",", "办", "理", "信", "用", "卡", "激", "活", "、", "修", "改", "密", "码", "、", "更", "改", "账", "单", "日", "等", "业", "务", "。", "民", "生", "信", "用", "卡", "的", "客", "服", "电", "话", "多", "少", "?", "-", "其", "他", "问", "题", "知", "识", "问", "答", "-", "我", "爱", "卡"], "sample_type": "disturb"} +{"id": 1956, "title": "", "context": "法令纹位於鼻翼两侧往下延伸至嘴的附近,也称寿带,是典型的皮肤组织老化,造成肌肤表面凹陷的现象。法令若垂长,亦为长寿之象徵。不过女性多半不喜欢脸上出现法令纹,因为这意味脸部皮肤松弛,是老化的迹象。", "question": "哪里是法令纹?", "sent_token": ["法", "令", "纹", "位", "於", "鼻", "翼", "两", "侧", "往", "下", "延", "伸", "至", "嘴", "的", "附", "近", ",", "也", "称", "寿", "带", ",", "是", "典", "型", "的", "皮", "肤", "组", "织", "老", "化", ",", "造", "成", "肌", "肤", "表", "面", "凹", "陷", "的", "现", "象", "。", "法", "令", "若", "垂", "长", ",", "亦", "为", "长", "寿", "之", "象", "徵", "。", "不", "过", "女", "性", "多", "半", "不", "喜", "欢", "脸", "上", "出", "现", "法", "令", "纹", ",", "因", "为", "这", "意", "味", "脸", "部", "皮", "肤", "松", "弛", ",", "是", "老", "化", "的", "迹", "象", "。"], "sample_type": "disturb"} +{"id": 1966, "title": "婴儿轻微肠炎能自愈吗_妈妈网小百科", "context": "婴儿轻微肠炎不能自愈。肠炎是一种炎症,其发病的原因与胃肠道失调有关联。临床表现主要有腹痛、腹泻、稀水便或黏液脓血便。婴儿胃肠道菌群出现了失调的异常,就会引发肠炎的出现。尽管是比较轻微的肠炎,但还是有炎症的存在。婴儿轻微肠炎需要就医进行治疗,需要吃药促使炎症的消除。", "question": "婴儿轻度肠炎能自愈吗", "sent_token": ["婴", "儿", "轻", "微", "肠", "炎", "不", "能", "自", "愈", "。", "肠", "炎", "是", "一", "种", "炎", "症", ",", "其", "发", "病", "的", "原", "因", "与", "胃", "肠", "道", "失", "调", "有", "关", "联", "。", "临", "床", "表", "现", "主", "要", "有", "腹", "痛", "、", "腹", "泻", "、", "稀", "水", "便", "或", "黏", "液", "脓", "血", "便", "。", "婴", "儿", "胃", "肠", "道", "菌", "群", "出", "现", "了", "失", "调", "的", "异", "常", ",", "就", "会", "引", "发", "肠", "炎", "的", "出", "现", "。", "尽", "管", "是", "比", "较", "轻", "微", "的", "肠", "炎", ",", "但", "还", "是", "有", "炎", "症", "的", "存", "在", "。", "婴", "儿", "轻", "微", "肠", "炎", "需", "要", "就", "医", "进", "行", "治", "疗", ",", "需", "要", "吃", "药", "促", "使", "炎", "症", "的", "消", "除", "。", "婴", "儿", "轻", "微", "肠", "炎", "能", "自", "愈", "吗", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "disturb"} +{"id": 1977, "title": "", "context": "珍珠鸟作者简介冯骥才,当代作家,1942年生于天津,兄妹六人,排行第三,为长子。原籍浙江慈溪市人。从小喜爱美术、文学和球类活动。曾当过专业篮球运动员,从事过绘画。", "question": "冯骥才什么时候出生", "sent_token": ["珍", "珠", "鸟", "作", "者", "简", "介", "冯", "骥", "才", ",", "当", "代", "作", "家", ",", "1942", "年", "生", "于", "天", "津", ",", "兄", "妹", "六", "人", ",", "排", "行", "第", "三", ",", "为", "长", "子", "。", "原", "籍", "浙", "江", "慈", "溪", "市", "人", "。", "从", "小", "喜", "爱", "美", "术", "、", "文", "学", "和", "球", "类", "活", "动", "。", "曾", "当", "过", "专", "业", "篮", "球", "运", "动", "员", ",", "从", "事", "过", "绘", "画", "。"], "sample_type": "disturb"} +{"id": 1983, "title": "哺乳期可以吃维生素b2吗_有问必答_快速问医生", "context": "你好,口腔溃疡一般都是由于维生素缺乏导致的,与口腔炎症和上火也有关,可以服用维生素b2和维生素c治疗。用西瓜皮煮水喝,可以清热去火。局部用口腔溃疡散或者用维生素c研磨成粉末涂抹,都可以有效缓解疼痛。孕妇正常也要补充维生素的,服用维生素b2没有问题的。平时一定要多吃新鲜蔬菜水果,补充维生素,注意口腔卫生,早晚刷牙,饭后用温水漱口,每天早上起床用淡盐水漱口。", "question": "哺乳期间,能吃维生素b2吗", "sent_token": ["你", "好", ",", "口", "腔", "溃", "疡", "一", "般", "都", "是", "由", "于", "维", "生", "素", "缺", "乏", "导", "致", "的", ",", "与", "口", "腔", "炎", "症", "和", "上", "火", "也", "有", "关", ",", "可", "以", "服", "用", "维", "生", "素", "b2", "和", "维", "生", "素", "c", "治", "疗", "。", "用", "西", "瓜", "皮", "煮", "水", "喝", ",", "可", "以", "清", "热", "去", "火", "。", "局", "部", "用", "口", "腔", "溃", "疡", "散", "或", "者", "用", "维", "生", "素", "c", "研", "磨", "成", "粉", "末", "涂", "抹", ",", "都", "可", "以", "有", "效", "缓", "解", "疼", "痛", "。", "孕", "妇", "正", "常", "也", "要", "补", "充", "维", "生", "素", "的", ",", "服", "用", "维", "生", "素", "b2", "没", "有", "问", "题", "的", "。", "平", "时", "一", "定", "要", "多", "吃", "新", "鲜", "蔬", "菜", "水", "果", ",", "补", "充", "维", "生", "素", ",", "注", "意", "口", "腔", "卫", "生", ",", "早", "晚", "刷", "牙", ",", "饭", "后", "用", "温", "水", "漱", "口", ",", "每", "天", "早", "上", "起", "床", "用", "淡", "盐", "水", "漱", "口", "。", "哺", "乳", "期", "可", "以", "吃", "维", "生", "素", "b2", "吗", "_", "有", "问", "必", "答", "_", "快", "速", "问", "医", "生"], "sample_type": "disturb"} +{"id": 1993, "title": "6岁儿童吃几颗肠虫清,吃肠虫清需要忌口吗_孕育常识_亲子宝典库_", "context": "肠虫清一般指阿苯达唑。阿苯达唑是一种咪唑衍生物类广谱驱肠虫药物。是六岁儿童就可以服用的一次吃两片,是吃饱饭后吃,肠虫清的主要是驱虫的药物,一般在晚上睡前服用的是比较好的,服药期间要多喝开水,多吃清淡易消化的食物,忌辛辣刺激性食物和油腻煎炸的食物,注意保暖避免着凉。", "question": "6岁儿童吃几颗肠虫清", "sent_token": ["肠", "虫", "清", "一", "般", "指", "阿", "苯", "达", "唑", "。", "阿", "苯", "达", "唑", "是", "一", "种", "咪", "唑", "衍", "生", "物", "类", "广", "谱", "驱", "肠", "虫", "药", "物", "。", "是", "六", "岁", "儿", "童", "就", "可", "以", "服", "用", "的", "一", "次", "吃", "两", "片", ",", "是", "吃", "饱", "饭", "后", "吃", ",", "肠", "虫", "清", "的", "主", "要", "是", "驱", "虫", "的", "药", "物", ",", "一", "般", "在", "晚", "上", "睡", "前", "服", "用", "的", "是", "比", "较", "好", "的", ",", "服", "药", "期", "间", "要", "多", "喝", "开", "水", ",", "多", "吃", "清", "淡", "易", "消", "化", "的", "食", "物", ",", "忌", "辛", "辣", "刺", "激", "性", "食", "物", "和", "油", "腻", "煎", "炸", "的", "食", "物", ",", "注", "意", "保", "暖", "避", "免", "着", "凉", "。", "6", "岁", "儿", "童", "吃", "几", "颗", "肠", "虫", "清", ",", "吃", "肠", "虫", "清", "需", "要", "忌", "口", "吗", "_", "孕", "育", "常", "识", "_", "亲", "子", "宝", "典", "库", "_"], "sample_type": "disturb"} +{"id": 2003, "title": "隔阂意味着是什么意思", "context": "隔阂是一个汉语词汇,一指彼此情意沟通的障碍或是情意不通,思想有距离,彼此之间有间隔,又指阻隔、隔绝。隔阂意味着很多意思,通常隔阂就意味着可能双方之间沟通有问题,比如有些夫妻或者是男女朋友之间吵架,两个人一起冷战,两个人由于没有沟通,双方之间的误会和矛盾就会越来越多了,也有可能是两个人总是以争吵的方式来解决问题,像这样的话就达不到有效的沟通,两个人两个人越不沟通,双方之间的矛盾和争吵就会越来越多,这个时候就会产生深深的隔阂。也有可能是双峰之间的价值观完全不同,比如对待某些问题的时候,有些人比较理性,但是有些人会比较感性,这个时候价值观不同的话就非常容易产生隔阂。", "question": "隔阂什么意思", "sent_token": ["隔", "阂", "是", "一", "个", "汉", "语", "词", "汇", ",", "一", "指", "彼", "此", "情", "意", "沟", "通", "的", "障", "碍", "或", "是", "情", "意", "不", "通", ",", "思", "想", "有", "距", "离", ",", "彼", "此", "之", "间", "有", "间", "隔", ",", "又", "指", "阻", "隔", "、", "隔", "绝", "。", "隔", "阂", "意", "味", "着", "很", "多", "意", "思", ",", "通", "常", "隔", "阂", "就", "意", "味", "着", "可", "能", "双", "方", "之", "间", "沟", "通", "有", "问", "题", ",", "比", "如", "有", "些", "夫", "妻", "或", "者", "是", "男", "女", "朋", "友", "之", "间", "吵", "架", ",", "两", "个", "人", "一", "起", "冷", "战", ",", "两", "个", "人", "由", "于", "没", "有", "沟", "通", ",", "双", "方", "之", "间", "的", "误", "会", "和", "矛", "盾", "就", "会", "越", "来", "越", "多", "了", ",", "也", "有", "可", "能", "是", "两", "个", "人", "总", "是", "以", "争", "吵", "的", "方", "式", "来", "解", "决", "问", "题", ",", "像", "这", "样", "的", "话", "就", "达", "不", "到", "有", "效", "的", "沟", "通", ",", "两", "个", "人", "两", "个", "人", "越", "不", "沟", "通", ",", "双", "方", "之", "间", "的", "矛", "盾", "和", "争", "吵", "就", "会", "越", "来", "越", "多", ",", "这", "个", "时", "候", "就", "会", "产", "生", "深", "深", "的", "隔", "阂", "。", "也", "有", "可", "能", "是", "双", "峰", "之", "间", "的", "价", "值", "观", "完", "全", "不", "同", ",", "比", "如", "对", "待", "某", "些", "问", "题", "的", "时", "候", ",", "有", "些", "人", "比", "较", "理", "性", ",", "但", "是", "有", "些", "人", "会", "比", "较", "感", "性", ",", "这", "个", "时", "候", "价", "值", "观", "不", "同", "的", "话", "就", "非", "常", "容", "易", "产", "生", "隔", "阂", "。", "隔", "阂", "意", "味", "着", "是", "什", "么", "意", "思"], "sample_type": "disturb"} +{"id": 2004, "title": "小儿癫痫病能彻底治愈的吗_有问必答_快速问医生", "context": "你好,很高兴为你服务,目前小儿癫痫是可以治愈的,不同的癫痫类型以及患者的实际病情不同,其适合的治疗方法也是不尽相同的。现在常见的小儿癫痫治疗都是采用中医为基础的治疗方法,这样对患儿的伤害较小,而西医则有很大的副作用,好吧", "question": "能彻底治愈羊儿风吗", "sent_token": ["你", "好", ",", "很", "高", "兴", "为", "你", "服", "务", ",", "目", "前", "小", "儿", "癫", "痫", "是", "可", "以", "治", "愈", "的", ",", "不", "同", "的", "癫", "痫", "类", "型", "以", "及", "患", "者", "的", "实", "际", "病", "情", "不", "同", ",", "其", "适", "合", "的", "治", "疗", "方", "法", "也", "是", "不", "尽", "相", "同", "的", "。", "现", "在", "常", "见", "的", "小", "儿", "癫", "痫", "治", "疗", "都", "是", "采", "用", "中", "医", "为", "基", "础", "的", "治", "疗", "方", "法", ",", "这", "样", "对", "患", "儿", "的", "伤", "害", "较", "小", ",", "而", "西", "医", "则", "有", "很", "大", "的", "副", "作", "用", ",", "好", "吧", "小", "儿", "癫", "痫", "病", "能", "彻", "底", "治", "愈", "的", "吗", "_", "有", "问", "必", "答", "_", "快", "速", "问", "医", "生"], "sample_type": "disturb"} +{"id": 2012, "title": "脑内多发腔隙性脑梗死严重吗_39健康问答_39健康网", "context": "脑内多发腔隙性脑梗死,部分软化灶形成,一般不严重,是细枝血管梗塞,引起小灶脑组织坏死,脑组织软化灶,其他部位的脑组织会替代坏死部位的脑组织功能,所以一般没有不适的症状。属于脑梗死症型中症状最轻微的,也是唯一一种能够通过可靠用药、饮食调节、康复锻炼、控制血压和血脂等综合性治疗措施达到彻底治愈的脑梗死。注意控制血压,清淡饮食,控制血脂,血粘度,精神放松,解除思想顾虑,多做室外文娱体育活动,精神愉快,多接受紫外线照射,多喝开水,会有利于康复。可以根据情况使用疏通血管的药物。", "question": "多发腔隙性脑梗死吃什么中药", "sent_token": ["脑", "内", "多", "发", "腔", "隙", "性", "脑", "梗", "死", ",", "部", "分", "软", "化", "灶", "形", "成", ",", "一", "般", "不", "严", "重", ",", "是", "细", "枝", "血", "管", "梗", "塞", ",", "引", "起", "小", "灶", "脑", "组", "织", "坏", "死", ",", "脑", "组", "织", "软", "化", "灶", ",", "其", "他", "部", "位", "的", "脑", "组", "织", "会", "替", "代", "坏", "死", "部", "位", "的", "脑", "组", "织", "功", "能", ",", "所", "以", "一", "般", "没", "有", "不", "适", "的", "症", "状", "。", "属", "于", "脑", "梗", "死", "症", "型", "中", "症", "状", "最", "轻", "微", "的", ",", "也", "是", "唯", "一", "一", "种", "能", "够", "通", "过", "可", "靠", "用", "药", "、", "饮", "食", "调", "节", "、", "康", "复", "锻", "炼", "、", "控", "制", "血", "压", "和", "血", "脂", "等", "综", "合", "性", "治", "疗", "措", "施", "达", "到", "彻", "底", "治", "愈", "的", "脑", "梗", "死", "。", "注", "意", "控", "制", "血", "压", ",", "清", "淡", "饮", "食", ",", "控", "制", "血", "脂", ",", "血", "粘", "度", ",", "精", "神", "放", "松", ",", "解", "除", "思", "想", "顾", "虑", ",", "多", "做", "室", "外", "文", "娱", "体", "育", "活", "动", ",", "精", "神", "愉", "快", ",", "多", "接", "受", "紫", "外", "线", "照", "射", ",", "多", "喝", "开", "水", ",", "会", "有", "利", "于", "康", "复", "。", "可", "以", "根", "据", "情", "况", "使", "用", "疏", "通", "血", "管", "的", "药", "物", "。", "脑", "内", "多", "发", "腔", "隙", "性", "脑", "梗", "死", "严", "重", "吗", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "disturb"} diff --git a/examples/model_interpretation/data/mrc_en b/examples/model_interpretation/data/mrc_en new file mode 100644 index 0000000000000000000000000000000000000000..d95bef1dedbdd64c494a8219276478d3dfbc0ae4 --- /dev/null +++ b/examples/model_interpretation/data/mrc_en @@ -0,0 +1,100 @@ +{"id": 1, "title": "", "context": "The English name \" Normans \" comes from the French words Normans / Normanz , plural of Normant , modern French normand , which is itself borrowed from Old Low Franconian Nortmann \" Northman \" or directly from Old Norse Norðmaðr , Latinized variously as Nortmannus , Normannus , or Nordmannus ( recorded in Medieval Latin , 9th century ) to mean \" Norseman , Viking \" .", "question": "What is the original meaning of the word Norman ?", "sent_token": ["The", "English", "name", "\"", "Normans", "\"", "comes", "from", "the", "French", "words", "Normans", "/", "Normanz", ",", "plural", "of", "Normant", ",", "modern", "French", "normand", ",", "which", "is", "itself", "borrowed", "from", "Old", "Low", "Franconian", "Nortmann", "\"", "Northman", "\"", "or", "directly", "from", "Old", "Norse", "Norðmaðr", ",", "Latinized", "variously", "as", "Nortmannus", ",", "Normannus", ",", "or", "Nordmannus", "(", "recorded", "in", "Medieval", "Latin", ",", "9th", "century", ")", "to", "mean", "\"", "Norseman", ",", "Viking", "\"", "."], "sample_type": "ori", "rel_ids": [1508]} +{"id": 2, "title": "", "context": "The English name \" Normans \" comes from the French words Normans / Normanz , plural of Normant , modern French normand , which is itself borrowed from Old Low Franconian Nortmann \" Northman \" or directly from Old Norse Norðmaðr , Latinized variously as Nortmannus , Normannus , or Nordmannus ( recorded in Medieval Latin , 9th century ) to mean \" Norseman , Viking \" .", "question": "When was the Latin version of the word Norman first recorded ?", "sent_token": ["The", "English", "name", "\"", "Normans", "\"", "comes", "from", "the", "French", "words", "Normans", "/", "Normanz", ",", "plural", "of", "Normant", ",", "modern", "French", "normand", ",", "which", "is", "itself", "borrowed", "from", "Old", "Low", "Franconian", "Nortmann", "\"", "Northman", "\"", "or", "directly", "from", "Old", "Norse", "Norðmaðr", ",", "Latinized", "variously", "as", "Nortmannus", ",", "Normannus", ",", "or", "Nordmannus", "(", "recorded", "in", "Medieval", "Latin", ",", "9th", "century", ")", "to", "mean", "\"", "Norseman", ",", "Viking", "\"", "."], "sample_type": "ori", "rel_ids": [1509]} +{"id": 3, "title": "", "context": "The descendants of Rollo 's Vikings and their Frankish wives would replace the Norse religion and Old Norse language with Catholicism ( Christianity ) and the Gallo - Romance language of the local people , blending their maternal Frankish heritage with Old Norse traditions and customs to synthesize a unique \" Norman \" culture in the north of France . The Norman language was forged by the adoption of the indigenous langue d'oïl branch of Romance by a Norse - speaking ruling class , and it developed into the regional language that survives today .", "question": "What was the Norman religion ?", "sent_token": ["The", "descendants", "of", "Rollo", "'s", "Vikings", "and", "their", "Frankish", "wives", "would", "replace", "the", "Norse", "religion", "and", "Old", "Norse", "language", "with", "Catholicism", "(", "Christianity", ")", "and", "the", "Gallo", "-", "Romance", "language", "of", "the", "local", "people", ",", "blending", "their", "maternal", "Frankish", "heritage", "with", "Old", "Norse", "traditions", "and", "customs", "to", "synthesize", "a", "unique", "\"", "Norman", "\"", "culture", "in", "the", "north", "of", "France", ".", "The", "Norman", "language", "was", "forged", "by", "the", "adoption", "of", "the", "indigenous", "langue", "d'oïl", "branch", "of", "Romance", "by", "a", "Norse", "-", "speaking", "ruling", "class", ",", "and", "it", "developed", "into", "the", "regional", "language", "that", "survives", "today", "."], "sample_type": "ori", "rel_ids": [1510]} +{"id": 4, "title": "", "context": "The descendants of Rollo 's Vikings and their Frankish wives would replace the Norse religion and Old Norse language with Catholicism ( Christianity ) and the Gallo - Romance language of the local people , blending their maternal Frankish heritage with Old Norse traditions and customs to synthesize a unique \" Norman \" culture in the north of France . The Norman language was forged by the adoption of the indigenous langue d'oïl branch of Romance by a Norse - speaking ruling class , and it developed into the regional language that survives today .", "question": "What part of France were the Normans located ?", "sent_token": ["The", "descendants", "of", "Rollo", "'s", "Vikings", "and", "their", "Frankish", "wives", "would", "replace", "the", "Norse", "religion", "and", "Old", "Norse", "language", "with", "Catholicism", "(", "Christianity", ")", "and", "the", "Gallo", "-", "Romance", "language", "of", "the", "local", "people", ",", "blending", "their", "maternal", "Frankish", "heritage", "with", "Old", "Norse", "traditions", "and", "customs", "to", "synthesize", "a", "unique", "\"", "Norman", "\"", "culture", "in", "the", "north", "of", "France", ".", "The", "Norman", "language", "was", "forged", "by", "the", "adoption", "of", "the", "indigenous", "langue", "d'oïl", "branch", "of", "Romance", "by", "a", "Norse", "-", "speaking", "ruling", "class", ",", "and", "it", "developed", "into", "the", "regional", "language", "that", "survives", "today", "."], "sample_type": "ori", "rel_ids": [1511]} +{"id": 5, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "When did Herve serve as a Byzantine general ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "ori", "rel_ids": [1512]} +{"id": 6, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "When did Robert Crispin go up against the Turks ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "ori", "rel_ids": [1513]} +{"id": 7, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "Who ruined Roussel de Bailleul 's plans for an independent state ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "ori", "rel_ids": [1514]} +{"id": 8, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "When did the Normans attack Dyrrachium ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "ori", "rel_ids": [1515]} +{"id": 9, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "What was the naval base called ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "ori", "rel_ids": [1516]} +{"id": 10, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "Where was Dyrrachium located ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "ori", "rel_ids": [1517]} +{"id": 11, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was Margaret 's brother ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "ori", "rel_ids": [1518]} +{"id": 12, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was Margaret 's husband ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "ori", "rel_ids": [1519]} +{"id": 13, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "When was Scotland invaded by William ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "ori", "rel_ids": [1520]} +{"id": 14, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was the hostage ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "ori", "rel_ids": [1521]} +{"id": 15, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Where was Ralph earl of ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "ori", "rel_ids": [1522]} +{"id": 16, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Who was Ralph in charge of being at war with ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "ori", "rel_ids": [1523]} +{"id": 17, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Who made Ralph earl ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "ori", "rel_ids": [1524]} +{"id": 18, "title": "", "context": "Subsequent to the Conquest , however , the Marches came completely under the dominance of William 's most trusted Norman barons , including Bernard de Neufmarché , Roger of Montgomery in Shropshire and Hugh Lupus in Cheshire . These Normans began a long period of slow conquest during which almost all of Wales was at some point subject to Norman interference . Norman words , such as baron ( barwn ) , first entered Welsh at that time .", "question": "What country was under the control of Norman barons ?", "sent_token": ["Subsequent", "to", "the", "Conquest", ",", "however", ",", "the", "Marches", "came", "completely", "under", "the", "dominance", "of", "William", "'s", "most", "trusted", "Norman", "barons", ",", "including", "Bernard", "de", "Neufmarché", ",", "Roger", "of", "Montgomery", "in", "Shropshire", "and", "Hugh", "Lupus", "in", "Cheshire", ".", "These", "Normans", "began", "a", "long", "period", "of", "slow", "conquest", "during", "which", "almost", "all", "of", "Wales", "was", "at", "some", "point", "subject", "to", "Norman", "interference", ".", "Norman", "words", ",", "such", "as", "baron", "(", "barwn", ")", ",", "first", "entered", "Welsh", "at", "that", "time", "."], "sample_type": "ori", "rel_ids": [1525]} +{"id": 19, "title": "", "context": "The legendary religious zeal of the Normans was exercised in religious wars long before the First Crusade carved out a Norman principality in Antioch . They were major foreign participants in the Reconquista in Iberia . In 1018 , Roger de Tosny travelled to the Iberian Peninsula to carve out a state for himself from Moorish lands , but failed . In 1064 , during the War of Barbastro , William of Montreuil led the papal army and took a huge booty .", "question": "What year did Roger de Tosny fail to accomplish what he set out to do ?", "sent_token": ["The", "legendary", "religious", "zeal", "of", "the", "Normans", "was", "exercised", "in", "religious", "wars", "long", "before", "the", "First", "Crusade", "carved", "out", "a", "Norman", "principality", "in", "Antioch", ".", "They", "were", "major", "foreign", "participants", "in", "the", "Reconquista", "in", "Iberia", ".", "In", "1018", ",", "Roger", "de", "Tosny", "travelled", "to", "the", "Iberian", "Peninsula", "to", "carve", "out", "a", "state", "for", "himself", "from", "Moorish", "lands", ",", "but", "failed", ".", "In", "1064", ",", "during", "the", "War", "of", "Barbastro", ",", "William", "of", "Montreuil", "led", "the", "papal", "army", "and", "took", "a", "huge", "booty", "."], "sample_type": "ori", "rel_ids": [1526]} +{"id": 20, "title": "", "context": "The legendary religious zeal of the Normans was exercised in religious wars long before the First Crusade carved out a Norman principality in Antioch . They were major foreign participants in the Reconquista in Iberia . In 1018 , Roger de Tosny travelled to the Iberian Peninsula to carve out a state for himself from Moorish lands , but failed . In 1064 , during the War of Barbastro , William of Montreuil led the papal army and took a huge booty .", "question": "Who was in charge of the papal army in the War of Barbastro ?", "sent_token": ["The", "legendary", "religious", "zeal", "of", "the", "Normans", "was", "exercised", "in", "religious", "wars", "long", "before", "the", "First", "Crusade", "carved", "out", "a", "Norman", "principality", "in", "Antioch", ".", "They", "were", "major", "foreign", "participants", "in", "the", "Reconquista", "in", "Iberia", ".", "In", "1018", ",", "Roger", "de", "Tosny", "travelled", "to", "the", "Iberian", "Peninsula", "to", "carve", "out", "a", "state", "for", "himself", "from", "Moorish", "lands", ",", "but", "failed", ".", "In", "1064", ",", "during", "the", "War", "of", "Barbastro", ",", "William", "of", "Montreuil", "led", "the", "papal", "army", "and", "took", "a", "huge", "booty", "."], "sample_type": "ori", "rel_ids": [1527]} +{"id": 21, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "When did the Siege of Antioch take place ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "ori", "rel_ids": [1528]} +{"id": 22, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "What was the name of Bohemond 's nephew ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "ori", "rel_ids": [1529]} +{"id": 23, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "What major conquest did Tancred play a roll in ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "ori", "rel_ids": [1530]} +{"id": 24, "title": "", "context": "The conquest of Cyprus by the Anglo - Norman forces of the Third Crusade opened a new chapter in the history of the island , which would be under Western European domination for the following 380 years . Although not part of a planned operation , the conquest had much more permanent results than initially expected .", "question": "How long did Western Europe control Cyprus ?", "sent_token": ["The", "conquest", "of", "Cyprus", "by", "the", "Anglo", "-", "Norman", "forces", "of", "the", "Third", "Crusade", "opened", "a", "new", "chapter", "in", "the", "history", "of", "the", "island", ",", "which", "would", "be", "under", "Western", "European", "domination", "for", "the", "following", "380", "years", ".", "Although", "not", "part", "of", "a", "planned", "operation", ",", "the", "conquest", "had", "much", "more", "permanent", "results", "than", "initially", "expected", "."], "sample_type": "ori", "rel_ids": [1531]} +{"id": 25, "title": "", "context": "Between 1402 and 1405 , the expedition led by the Norman noble Jean de Bethencourt and the Poitevine Gadifer de la Salle conquered the Canarian islands of Lanzarote , Fuerteventura and El Hierro off the Atlantic coast of Africa . Their troops were gathered in Normandy , Gascony and were later reinforced by Castilian colonists .", "question": "What continent are the Canarian Islands off the coast of ?", "sent_token": ["Between", "1402", "and", "1405", ",", "the", "expedition", "led", "by", "the", "Norman", "noble", "Jean", "de", "Bethencourt", "and", "the", "Poitevine", "Gadifer", "de", "la", "Salle", "conquered", "the", "Canarian", "islands", "of", "Lanzarote", ",", "Fuerteventura", "and", "El", "Hierro", "off", "the", "Atlantic", "coast", "of", "Africa", ".", "Their", "troops", "were", "gathered", "in", "Normandy", ",", "Gascony", "and", "were", "later", "reinforced", "by", "Castilian", "colonists", "."], "sample_type": "ori", "rel_ids": [1532]} +{"id": 26, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who became the King of the Canary Islands ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "ori", "rel_ids": [1533]} +{"id": 27, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who bought the rights ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "ori", "rel_ids": [1534]} +{"id": 28, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who sold the rights ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "ori", "rel_ids": [1535]} +{"id": 29, "title": "", "context": "The customary law of Normandy was developed between the 10th and 13th centuries and survives today through the legal systems of Jersey and Guernsey in the Channel Islands . Norman customary law was transcribed in two customaries in Latin by two judges for use by them and their colleagues : These are the Très ancien coutumier ( Very ancient customary ) , authored between 1200 and 1245 ; and the Grand coutumier de Normandie ( Great customary of Normandy , originally Summa de legibus Normanniae in curia laïcali ) , authored between 1235 and 1245 .", "question": "Where are Jersey and Guernsey", "sent_token": ["The", "customary", "law", "of", "Normandy", "was", "developed", "between", "the", "10th", "and", "13th", "centuries", "and", "survives", "today", "through", "the", "legal", "systems", "of", "Jersey", "and", "Guernsey", "in", "the", "Channel", "Islands", ".", "Norman", "customary", "law", "was", "transcribed", "in", "two", "customaries", "in", "Latin", "by", "two", "judges", "for", "use", "by", "them", "and", "their", "colleagues", ":", "These", "are", "the", "Très", "ancien", "coutumier", "(", "Very", "ancient", "customary", ")", ",", "authored", "between", "1200", "and", "1245", ";", "and", "the", "Grand", "coutumier", "de", "Normandie", "(", "Great", "customary", "of", "Normandy", ",", "originally", "Summa", "de", "legibus", "Normanniae", "in", "curia", "laïcali", ")", ",", "authored", "between", "1235", "and", "1245", "."], "sample_type": "ori", "rel_ids": [1536]} +{"id": 30, "title": "", "context": "The customary law of Normandy was developed between the 10th and 13th centuries and survives today through the legal systems of Jersey and Guernsey in the Channel Islands . Norman customary law was transcribed in two customaries in Latin by two judges for use by them and their colleagues : These are the Très ancien coutumier ( Very ancient customary ) , authored between 1200 and 1245 ; and the Grand coutumier de Normandie ( Great customary of Normandy , originally Summa de legibus Normanniae in curia laïcali ) , authored between 1235 and 1245 .", "question": "How many customaries does Norman customary law have ?", "sent_token": ["The", "customary", "law", "of", "Normandy", "was", "developed", "between", "the", "10th", "and", "13th", "centuries", "and", "survives", "today", "through", "the", "legal", "systems", "of", "Jersey", "and", "Guernsey", "in", "the", "Channel", "Islands", ".", "Norman", "customary", "law", "was", "transcribed", "in", "two", "customaries", "in", "Latin", "by", "two", "judges", "for", "use", "by", "them", "and", "their", "colleagues", ":", "These", "are", "the", "Très", "ancien", "coutumier", "(", "Very", "ancient", "customary", ")", ",", "authored", "between", "1200", "and", "1245", ";", "and", "the", "Grand", "coutumier", "de", "Normandie", "(", "Great", "customary", "of", "Normandy", ",", "originally", "Summa", "de", "legibus", "Normanniae", "in", "curia", "laïcali", ")", ",", "authored", "between", "1235", "and", "1245", "."], "sample_type": "ori", "rel_ids": [1537]} +{"id": 31, "title": "", "context": "Norman architecture typically stands out as a new stage in the architectural history of the regions they subdued . They spread a unique Romanesque idiom to England and Italy , and the encastellation of these regions with keeps in their north French style fundamentally altered the military landscape . Their style was characterised by rounded arches , particularly over windows and doorways , and massive proportions .", "question": "What is the Norman architecture idiom ?", "sent_token": ["Norman", "architecture", "typically", "stands", "out", "as", "a", "new", "stage", "in", "the", "architectural", "history", "of", "the", "regions", "they", "subdued", ".", "They", "spread", "a", "unique", "Romanesque", "idiom", "to", "England", "and", "Italy", ",", "and", "the", "encastellation", "of", "these", "regions", "with", "keeps", "in", "their", "north", "French", "style", "fundamentally", "altered", "the", "military", "landscape", ".", "Their", "style", "was", "characterised", "by", "rounded", "arches", ",", "particularly", "over", "windows", "and", "doorways", ",", "and", "massive", "proportions", "."], "sample_type": "ori", "rel_ids": [1538]} +{"id": 32, "title": "", "context": "Norman architecture typically stands out as a new stage in the architectural history of the regions they subdued . They spread a unique Romanesque idiom to England and Italy , and the encastellation of these regions with keeps in their north French style fundamentally altered the military landscape . Their style was characterised by rounded arches , particularly over windows and doorways , and massive proportions .", "question": "What kind of arches does Norman architecture have ?", "sent_token": ["Norman", "architecture", "typically", "stands", "out", "as", "a", "new", "stage", "in", "the", "architectural", "history", "of", "the", "regions", "they", "subdued", ".", "They", "spread", "a", "unique", "Romanesque", "idiom", "to", "England", "and", "Italy", ",", "and", "the", "encastellation", "of", "these", "regions", "with", "keeps", "in", "their", "north", "French", "style", "fundamentally", "altered", "the", "military", "landscape", ".", "Their", "style", "was", "characterised", "by", "rounded", "arches", ",", "particularly", "over", "windows", "and", "doorways", ",", "and", "massive", "proportions", "."], "sample_type": "ori", "rel_ids": [1539]} +{"id": 33, "title": "", "context": "In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What architecture type came after Norman in England ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "ori", "rel_ids": [1540]} +{"id": 34, "title": "", "context": "In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What architecture type came before Norman in England ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "ori", "rel_ids": [1541]} +{"id": 35, "title": "", "context": "In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What place had the Norman Arab architectural style ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "ori", "rel_ids": [1542]} +{"id": 36, "title": "", "context": "The French Wars of Religion in the 16th century and French Revolution in the 18th successively destroyed much of what existed in the way of the architectural and artistic remnant of this Norman creativity . The former , with their violence , caused the wanton destruction of many Norman edifices ; the latter , with its assault on religion , caused the purposeful destruction of religious objects of any type , and its destabilisation of society resulted in rampant pillaging .", "question": "When were the French wars of religion ?", "sent_token": ["The", "French", "Wars", "of", "Religion", "in", "the", "16th", "century", "and", "French", "Revolution", "in", "the", "18th", "successively", "destroyed", "much", "of", "what", "existed", "in", "the", "way", "of", "the", "architectural", "and", "artistic", "remnant", "of", "this", "Norman", "creativity", ".", "The", "former", ",", "with", "their", "violence", ",", "caused", "the", "wanton", "destruction", "of", "many", "Norman", "edifices", ";", "the", "latter", ",", "with", "its", "assault", "on", "religion", ",", "caused", "the", "purposeful", "destruction", "of", "religious", "objects", "of", "any", "type", ",", "and", "its", "destabilisation", "of", "society", "resulted", "in", "rampant", "pillaging", "."], "sample_type": "ori", "rel_ids": [1543]} +{"id": 37, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "What kind of needlework was used in the creation of the Bayeux Tapestry ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "ori", "rel_ids": [1544]} +{"id": 38, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "What is Norman art 's most well known piece ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "ori", "rel_ids": [1545]} +{"id": 39, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "Who commissioned the Tapestry ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "ori", "rel_ids": [1546]} +{"id": 40, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "Where did the monks flee to ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "ori", "rel_ids": [1547]} +{"id": 41, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "What monastery did the Saint - Evroul monks establish in Italy ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "ori", "rel_ids": [1548]} +{"id": 42, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "Who patronized the monks in Italy ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "ori", "rel_ids": [1549]} +{"id": 43, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "What tradition were the Saint - Evroul monks known for ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "ori", "rel_ids": [1550]} +{"id": 44, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "What branch of theoretical computer science deals with broadly classifying computational problems by difficulty and class of relationship ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "ori", "rel_ids": [1551]} +{"id": 45, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "By what main attribute are computational problems classified utilizing computational complexity theory ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "ori", "rel_ids": [1552]} +{"id": 46, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "What is the term for a task that generally lends itself to being solved by a computer ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "ori", "rel_ids": [1553]} +{"id": 47, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "By how many kilometers does the traveling salesman problem seek to classify a route between the 15 largest cities in Germany ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "ori", "rel_ids": [1554]} +{"id": 48, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "What is one example of an instance that the quantitative answer to the traveling salesman problem fails to answer ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "ori", "rel_ids": [1555]} +{"id": 49, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "What does computational complexity theory most specifically seek to answer ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "ori", "rel_ids": [1556]} +{"id": 50, "title": "", "context": "When considering computational problems , a problem instance is a string over an alphabet . Usually , the alphabet is taken to be the binary alphabet ( i.e. , the set { 0,1 } ) , and thus the strings are bitstrings . As in a real - world computer , mathematical objects other than bitstrings must be suitably encoded . For example , integers can be represented in binary notation , and graphs can be encoded directly via their adjacency matrices , or by encoding their adjacency lists in binary .", "question": "In a computational problem , what can be described as a string over an alphabet ?", "sent_token": ["When", "considering", "computational", "problems", ",", "a", "problem", "instance", "is", "a", "string", "over", "an", "alphabet", ".", "Usually", ",", "the", "alphabet", "is", "taken", "to", "be", "the", "binary", "alphabet", "(", "i.e.", ",", "the", "set", "{", "0,1", "}", ")", ",", "and", "thus", "the", "strings", "are", "bitstrings", ".", "As", "in", "a", "real", "-", "world", "computer", ",", "mathematical", "objects", "other", "than", "bitstrings", "must", "be", "suitably", "encoded", ".", "For", "example", ",", "integers", "can", "be", "represented", "in", "binary", "notation", ",", "and", "graphs", "can", "be", "encoded", "directly", "via", "their", "adjacency", "matrices", ",", "or", "by", "encoding", "their", "adjacency", "lists", "in", "binary", "."], "sample_type": "ori", "rel_ids": [1557]} +{"id": 1508, "title": "", "context": "The English name \" Normans \" comes from the French words Normans / Normanz , plural of Normant , modern French normand , which is itself borrowed from Old Low Franconian Nortmann \" Northman \" or directly from Old Norse Norðmaðr , Latinized variously as Nortmannus , Normannus , or Nordmannus ( recorded in Medieval Latin , 9th century ) to mean \" Norseman , Viking \" .", "question": "what is the original denotation of the word Norman ?", "sent_token": ["The", "English", "name", "\"", "Normans", "\"", "comes", "from", "the", "French", "words", "Normans", "/", "Normanz", ",", "plural", "of", "Normant", ",", "modern", "French", "normand", ",", "which", "is", "itself", "borrowed", "from", "Old", "Low", "Franconian", "Nortmann", "\"", "Northman", "\"", "or", "directly", "from", "Old", "Norse", "Norðmaðr", ",", "Latinized", "variously", "as", "Nortmannus", ",", "Normannus", ",", "or", "Nordmannus", "(", "recorded", "in", "Medieval", "Latin", ",", "9th", "century", ")", "to", "mean", "\"", "Norseman", ",", "Viking", "\"", "."], "sample_type": "disturb"} +{"id": 1509, "title": "", "context": "The English name \" Normans \" comes from the French words Normans / Normanz , plural of Normant , modern French normand , which is borrowed from Old Low Franconian Nortmann \" Northman \" or from Old Norse Norðmaðr , Latinized as Nortmannus , Normannus , or Nordmannus ( recorded in Medieval Latin , 9th century ) to mean \" Norseman , Viking \" .", "question": "When was the Latin version of the word Norman first recorded ?", "sent_token": ["The", "English", "name", "\"", "Normans", "\"", "comes", "from", "the", "French", "words", "Normans", "/", "Normanz", ",", "plural", "of", "Normant", ",", "modern", "French", "normand", ",", "which", "is", "borrowed", "from", "Old", "Low", "Franconian", "Nortmann", "\"", "Northman", "\"", "or", "from", "Old", "Norse", "Norðmaðr", ",", "Latinized", "as", "Nortmannus", ",", "Normannus", ",", "or", "Nordmannus", "(", "recorded", "in", "Medieval", "Latin", ",", "9th", "century", ")", "to", "mean", "\"", "Norseman", ",", "Viking", "\"", "."], "sample_type": "disturb"} +{"id": 1510, "title": "", "context": "The descendants of Rollo 's Vikings and their Frankish wives would replace the Norse religion and Old Norse language with Catholicism ( Christianity ) and the Gallo - Romance language of the local people , blending their maternal Frankish heritage with Old Norse traditions and customs to compose a unique \" Norman \" culture in the north of France . The Norman language was forged by the adoption of the indigenous langue d'oïl branch of Romance by a Norse - speaking ruling class , and it developed into the regional language that survives today .", "question": "What was the Norman religion ?", "sent_token": ["The", "descendants", "of", "Rollo", "'s", "Vikings", "and", "their", "Frankish", "wives", "would", "replace", "the", "Norse", "religion", "and", "Old", "Norse", "language", "with", "Catholicism", "(", "Christianity", ")", "and", "the", "Gallo", "-", "Romance", "language", "of", "the", "local", "people", ",", "blending", "their", "maternal", "Frankish", "heritage", "with", "Old", "Norse", "traditions", "and", "customs", "to", "compose", "a", "unique", "\"", "Norman", "\"", "culture", "in", "the", "north", "of", "France", ".", "The", "Norman", "language", "was", "forged", "by", "the", "adoption", "of", "the", "indigenous", "langue", "d'oïl", "branch", "of", "Romance", "by", "a", "Norse", "-", "speaking", "ruling", "class", ",", "and", "it", "developed", "into", "the", "regional", "language", "that", "survives", "today", "."], "sample_type": "disturb"} +{"id": 1511, "title": "", "context": "The descendants of Rollo 's Vikings and their Frankish wives would replace the Norse religion and Old Norse language with Catholicism ( Christianity ) and the Gallo - Romance language of the local people , blending their maternal Frankish heritage with Old Norse traditions and customs to synthesize a unique \" Norman \" culture in the north of France . The Norman language was forged by the adoption of the indigenous langue d'oïl branch of Romance by a Norse - speaking ruling class , and it developed into the regional language that survives today .", "question": "Where in France were the Normans located", "sent_token": ["The", "descendants", "of", "Rollo", "'s", "Vikings", "and", "their", "Frankish", "wives", "would", "replace", "the", "Norse", "religion", "and", "Old", "Norse", "language", "with", "Catholicism", "(", "Christianity", ")", "and", "the", "Gallo", "-", "Romance", "language", "of", "the", "local", "people", ",", "blending", "their", "maternal", "Frankish", "heritage", "with", "Old", "Norse", "traditions", "and", "customs", "to", "synthesize", "a", "unique", "\"", "Norman", "\"", "culture", "in", "the", "north", "of", "France", ".", "The", "Norman", "language", "was", "forged", "by", "the", "adoption", "of", "the", "indigenous", "langue", "d'oïl", "branch", "of", "Romance", "by", "a", "Norse", "-", "speaking", "ruling", "class", ",", "and", "it", "developed", "into", "the", "regional", "language", "that", "survives", "today", "."], "sample_type": "disturb"} +{"id": 1512, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "When did Herve assume the role of Byzantine general ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "disturb"} +{"id": 1513, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "When did Robert Crispin fought against the Turks ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "disturb"} +{"id": 1514, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos . Roussel de Bailleul revolted against Isaac Comnene during one expedition and began the conquest of Lycaonia and Galatia for himself .", "question": "Who ruined Roussel de Bailleul 's plans for an independent state ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", ".", "Roussel", "de", "Bailleul", "revolted", "against", "Isaac", "Comnene", "during", "one", "expedition", "and", "began", "the", "conquest", "of", "Lycaonia", "and", "Galatia", "for", "himself", "."], "sample_type": "disturb"} +{"id": 1515, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "When did the Normans assault Dyrrachium ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "disturb"} +{"id": 1516, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "What was the naval base 's name ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "disturb"} +{"id": 1517, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "Where was Dyrrachium situated ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "disturb"} +{"id": 1518, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he joined up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was Margaret 's brother ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "joined", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "disturb"} +{"id": 1519, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was married to Margaret ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "disturb"} +{"id": 1520, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "When was Scotland invaded by William ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "disturb"} +{"id": 1521, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a string of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was the hostage ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "string", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "disturb"} +{"id": 1522, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had appointed the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Where was Ralph earl of ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "appointed", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "disturb"} +{"id": 1523, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Who was Ralph in charge of being at war with ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "disturb"} +{"id": 1524, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Who made Ralph become earl ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "disturb"} +{"id": 1525, "title": "", "context": "Subsequent to the Conquest , however , the Marches came completely under the dominance of William 's most trusted Norman barons , including Bernard de Neufmarché , Roger of Montgomery in Shropshire and Hugh Lupus in Cheshire . These Normans began a long period of slow conquest during which almost all of Wales was in some degree subject to Norman interference . Norman words , such as baron ( barwn ) , first entered Welsh at that time .", "question": "What country was under the control of Norman barons ?", "sent_token": ["Subsequent", "to", "the", "Conquest", ",", "however", ",", "the", "Marches", "came", "completely", "under", "the", "dominance", "of", "William", "'s", "most", "trusted", "Norman", "barons", ",", "including", "Bernard", "de", "Neufmarché", ",", "Roger", "of", "Montgomery", "in", "Shropshire", "and", "Hugh", "Lupus", "in", "Cheshire", ".", "These", "Normans", "began", "a", "long", "period", "of", "slow", "conquest", "during", "which", "almost", "all", "of", "Wales", "was", "in", "some", "degree", "subject", "to", "Norman", "interference", ".", "Norman", "words", ",", "such", "as", "baron", "(", "barwn", ")", ",", "first", "entered", "Welsh", "at", "that", "time", "."], "sample_type": "disturb"} +{"id": 1526, "title": "", "context": "The legendary religious zeal of the Normans was exercised in religious wars long before the First Crusade carved out a Norman principality in Antioch . They were major foreign participants in the Reconquista in Iberia . In 1018 , Roger de Tosny travelled to the Iberian Peninsula to carve out a state for himself from Moorish lands , but failed . In 1064 , during the War of Barbastro , William of Montreuil led the papal army and took a huge booty .", "question": "What year did Roger de Tosny not succeed accomplishing what he set out to do ?", "sent_token": ["The", "legendary", "religious", "zeal", "of", "the", "Normans", "was", "exercised", "in", "religious", "wars", "long", "before", "the", "First", "Crusade", "carved", "out", "a", "Norman", "principality", "in", "Antioch", ".", "They", "were", "major", "foreign", "participants", "in", "the", "Reconquista", "in", "Iberia", ".", "In", "1018", ",", "Roger", "de", "Tosny", "travelled", "to", "the", "Iberian", "Peninsula", "to", "carve", "out", "a", "state", "for", "himself", "from", "Moorish", "lands", ",", "but", "failed", ".", "In", "1064", ",", "during", "the", "War", "of", "Barbastro", ",", "William", "of", "Montreuil", "led", "the", "papal", "army", "and", "took", "a", "huge", "booty", "."], "sample_type": "disturb"} +{"id": 1527, "title": "", "context": "The legendary religious zeal of the Normans was exercised in religious wars long before the First Crusade carved out a Norman principality in Antioch . They were major foreign participants in the Reconquista in Iberia . In 1018 , Roger de Tosny travelled to the Iberian Peninsula to carve out a state for himself from Moorish lands , but failed . In 1064 , during the War of Barbastro , William of Montreuil led the papal army and took a huge booty .", "question": "Who was the leader of the papal army in the War of Barbastro ?", "sent_token": ["The", "legendary", "religious", "zeal", "of", "the", "Normans", "was", "exercised", "in", "religious", "wars", "long", "before", "the", "First", "Crusade", "carved", "out", "a", "Norman", "principality", "in", "Antioch", ".", "They", "were", "major", "foreign", "participants", "in", "the", "Reconquista", "in", "Iberia", ".", "In", "1018", ",", "Roger", "de", "Tosny", "travelled", "to", "the", "Iberian", "Peninsula", "to", "carve", "out", "a", "state", "for", "himself", "from", "Moorish", "lands", ",", "but", "failed", ".", "In", "1064", ",", "during", "the", "War", "of", "Barbastro", ",", "William", "of", "Montreuil", "led", "the", "papal", "army", "and", "took", "a", "huge", "booty", "."], "sample_type": "disturb"} +{"id": 1528, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . Antioch lay on the crusaders ' route to Palestine , and anticipating that it would be attacked the Muslim governor of the city , Yaghi - Siyan , began stockpiling food and sending requests for help . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "When did the Siege of Antioch take place ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "Antioch", "lay", "on", "the", "crusaders", "'", "route", "to", "Palestine", ",", "and", "anticipating", "that", "it", "would", "be", "attacked", "the", "Muslim", "governor", "of", "the", "city", ",", "Yaghi", "-", "Siyan", ",", "began", "stockpiling", "food", "and", "sending", "requests", "for", "help", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "disturb"} +{"id": 1529, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . A politique , Bohemond was resolved to engineer the enthusiasm of the crusaders to his own ends ; and when his nephew Tancred left the main army at Heraclea Cybistra , and attempted to establish a footing in Cilicia , the movement may have been already intended as a preparation for Bohemond ’s eastern principality . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "What was the name of Bohemond 's nephew ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "A", "politique", ",", "Bohemond", "was", "resolved", "to", "engineer", "the", "enthusiasm", "of", "the", "crusaders", "to", "his", "own", "ends", ";", "and", "when", "his", "nephew", "Tancred", "left", "the", "main", "army", "at", "Heraclea", "Cybistra", ",", "and", "attempted", "to", "establish", "a", "footing", "in", "Cilicia", ",", "the", "movement", "may", "have", "been", "already", "intended", "as", "a", "preparation", "for", "Bohemond", "’s", "eastern", "principality", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "disturb"} +{"id": 1530, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "What major conquest did Tancred play a part in ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "disturb"} +{"id": 1531, "title": "", "context": "The conquest of Cyprus by the Anglo - Norman forces of the Third Crusade opened a new chapter in the history of the island , which would be under Western European domination for 380 years . Although not part of a planned operation , the conquest had more permanent results than expected .", "question": "How long did Western Europe control Cyprus ?", "sent_token": ["The", "conquest", "of", "Cyprus", "by", "the", "Anglo", "-", "Norman", "forces", "of", "the", "Third", "Crusade", "opened", "a", "new", "chapter", "in", "the", "history", "of", "the", "island", ",", "which", "would", "be", "under", "Western", "European", "domination", "for", "380", "years", ".", "Although", "not", "part", "of", "a", "planned", "operation", ",", "the", "conquest", "had", "more", "permanent", "results", "than", "expected", "."], "sample_type": "disturb"} +{"id": 1532, "title": "", "context": "Between 1402 and 1405 , the expedition led by the Norman noble Jean de Bethencourt and the Poitevine Gadifer de la Salle conquered the Canarian islands of Lanzarote , Fuerteventura and El Hierro off the Atlantic coast of Africa . Their troops were assembled in Normandy , Gascony and were later reinforced by Castilian colonists .", "question": "What continent are the Canarian Islands off the coast of ?", "sent_token": ["Between", "1402", "and", "1405", ",", "the", "expedition", "led", "by", "the", "Norman", "noble", "Jean", "de", "Bethencourt", "and", "the", "Poitevine", "Gadifer", "de", "la", "Salle", "conquered", "the", "Canarian", "islands", "of", "Lanzarote", ",", "Fuerteventura", "and", "El", "Hierro", "off", "the", "Atlantic", "coast", "of", "Africa", ".", "Their", "troops", "were", "assembled", "in", "Normandy", ",", "Gascony", "and", "were", "later", "reinforced", "by", "Castilian", "colonists", "."], "sample_type": "disturb"} +{"id": 1533, "title": "", "context": "Jean de Béthencourt was a French explorer who was responsible for the expedition to the Canaries . Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who became the King of the Canary Islands ?", "sent_token": ["Jean", "de", "Béthencourt", "was", "a", "French", "explorer", "who", "was", "responsible", "for", "the", "expedition", "to", "the", "Canaries", ".", "Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "disturb"} +{"id": 1534, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who purchased the rights ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "disturb"} +{"id": 1535, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla . Maciot de Bethencourt was born illegitimate circa 1390 at France .", "question": "Who sold the rights ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", ".", "Maciot", "de", "Bethencourt", "was", "born", "illegitimate", "circa", "1390", "at", "France", "."], "sample_type": "disturb"} +{"id": 1536, "title": "", "context": "The customary law of Normandy was developed between the 10th and 13th centuries and survives today through the legal systems of Jersey and Guernsey in the Channel Islands . Just off the Normandy coast , the Channel Islands comprising of Jersey , Guernsey , Alderney , Sark and Herm are a short hop away from Britain and mainland Europe . Norman customary law was transcribed in two customaries in Latin by two judges for use by them and their colleagues : These are the Très ancien coutumier ( Very ancient customary ) , authored between 1200 and 1245 ; and the Grand coutumier de Normandie ( Great customary of Normandy , originally Summa de legibus Normanniae in curia laïcali ) , authored between 1235 and 1245 .", "question": "Where are Jersey and Guernsey", "sent_token": ["The", "customary", "law", "of", "Normandy", "was", "developed", "between", "the", "10th", "and", "13th", "centuries", "and", "survives", "today", "through", "the", "legal", "systems", "of", "Jersey", "and", "Guernsey", "in", "the", "Channel", "Islands", ".", "Just", "off", "the", "Normandy", "coast", ",", "the", "Channel", "Islands", "comprising", "of", "Jersey", ",", "Guernsey", ",", "Alderney", ",", "Sark", "and", "Herm", "are", "a", "short", "hop", "away", "from", "Britain", "and", "mainland", "Europe", ".", "Norman", "customary", "law", "was", "transcribed", "in", "two", "customaries", "in", "Latin", "by", "two", "judges", "for", "use", "by", "them", "and", "their", "colleagues", ":", "These", "are", "the", "Très", "ancien", "coutumier", "(", "Very", "ancient", "customary", ")", ",", "authored", "between", "1200", "and", "1245", ";", "and", "the", "Grand", "coutumier", "de", "Normandie", "(", "Great", "customary", "of", "Normandy", ",", "originally", "Summa", "de", "legibus", "Normanniae", "in", "curia", "laïcali", ")", ",", "authored", "between", "1235", "and", "1245", "."], "sample_type": "disturb"} +{"id": 1537, "title": "", "context": "The customary law of Normandy was developed between the 10th and 13th centuries and survives today through the legal systems of Jersey and Guernsey in the Channel Islands . Norman customary law was transcribed in two customaries in Latin by two judges for use by them and their colleagues : These are the Très ancien coutumier ( Very ancient customary ) , authored between 1200 and 1245 ; and the Grand coutumier de Normandie ( Great customary of Normandy , originally Summa de legibus Normanniae in curia laïcali ) , authored between 1235 and 1245 .", "question": "How many customaries does Norman customary law possess ?", "sent_token": ["The", "customary", "law", "of", "Normandy", "was", "developed", "between", "the", "10th", "and", "13th", "centuries", "and", "survives", "today", "through", "the", "legal", "systems", "of", "Jersey", "and", "Guernsey", "in", "the", "Channel", "Islands", ".", "Norman", "customary", "law", "was", "transcribed", "in", "two", "customaries", "in", "Latin", "by", "two", "judges", "for", "use", "by", "them", "and", "their", "colleagues", ":", "These", "are", "the", "Très", "ancien", "coutumier", "(", "Very", "ancient", "customary", ")", ",", "authored", "between", "1200", "and", "1245", ";", "and", "the", "Grand", "coutumier", "de", "Normandie", "(", "Great", "customary", "of", "Normandy", ",", "originally", "Summa", "de", "legibus", "Normanniae", "in", "curia", "laïcali", ")", ",", "authored", "between", "1235", "and", "1245", "."], "sample_type": "disturb"} +{"id": 1538, "title": "", "context": "The term Norman architecture is used to categorise styles of Romanesque architecture developed by the Normans in the various lands under their dominion or influence in the 11th and 12th centuries . Norman architecture typically stands out as a new stage in the architectural history of the regions they subdued . They spread a unique Romanesque idiom to England and Italy , and the encastellation of these regions with keeps in their north French style fundamentally altered the military landscape . Their style was characterised by rounded arches , particularly over windows and doorways , and massive proportions .", "question": "What is the Norman architecture idiom ?", "sent_token": ["The", "term", "Norman", "architecture", "is", "used", "to", "categorise", "styles", "of", "Romanesque", "architecture", "developed", "by", "the", "Normans", "in", "the", "various", "lands", "under", "their", "dominion", "or", "influence", "in", "the", "11th", "and", "12th", "centuries", ".", "Norman", "architecture", "typically", "stands", "out", "as", "a", "new", "stage", "in", "the", "architectural", "history", "of", "the", "regions", "they", "subdued", ".", "They", "spread", "a", "unique", "Romanesque", "idiom", "to", "England", "and", "Italy", ",", "and", "the", "encastellation", "of", "these", "regions", "with", "keeps", "in", "their", "north", "French", "style", "fundamentally", "altered", "the", "military", "landscape", ".", "Their", "style", "was", "characterised", "by", "rounded", "arches", ",", "particularly", "over", "windows", "and", "doorways", ",", "and", "massive", "proportions", "."], "sample_type": "disturb"} +{"id": 1539, "title": "", "context": "Norman architecture typically stands out as a new stage in the architectural history of the regions they subdued . They spread a unique Romanesque idiom to England and Italy , and the encastellation of these regions with keeps in their north French style fundamentally altered the military landscape . Their style was characterised by rounded arches , particularly over windows and doorways , and massive proportions .", "question": "What type of arches does Norman architecture have ?", "sent_token": ["Norman", "architecture", "typically", "stands", "out", "as", "a", "new", "stage", "in", "the", "architectural", "history", "of", "the", "regions", "they", "subdued", ".", "They", "spread", "a", "unique", "Romanesque", "idiom", "to", "England", "and", "Italy", ",", "and", "the", "encastellation", "of", "these", "regions", "with", "keeps", "in", "their", "north", "French", "style", "fundamentally", "altered", "the", "military", "landscape", ".", "Their", "style", "was", "characterised", "by", "rounded", "arches", ",", "particularly", "over", "windows", "and", "doorways", ",", "and", "massive", "proportions", "."], "sample_type": "disturb"} +{"id": 1540, "title": "", "context": "In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans integrated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What architecture type came after Norman in England ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "integrated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "disturb"} +{"id": 1541, "title": "", "context": "Norman Castles were typically built on the highest ground in the area , often adjoined Rivers and overlooking towns and harbours . In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What architecture type came before Norman in England ?", "sent_token": ["Norman", "Castles", "were", "typically", "built", "on", "the", "highest", "ground", "in", "the", "area", ",", "often", "adjoined", "Rivers", "and", "overlooking", "towns", "and", "harbours", ".", "In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "disturb"} +{"id": 1542, "title": "", "context": "In England , the period of Norman architecture succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What place had the Norman Arab architectural style ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "disturb"} +{"id": 1543, "title": "", "context": "The French Wars of Religion in the 16th century and French Revolution in the 18th successively destroyed much of what existed in the way of the architectural and artistic remnant of this Norman creativity . The former , with their violence , caused the wanton destruction of many Norman edifices ; the latter , with its assault on religion , caused the purposeful destruction of religious objects of any type , and its destabilisation of society led to rampant pillaging .", "question": "When were the French wars of religion ?", "sent_token": ["The", "French", "Wars", "of", "Religion", "in", "the", "16th", "century", "and", "French", "Revolution", "in", "the", "18th", "successively", "destroyed", "much", "of", "what", "existed", "in", "the", "way", "of", "the", "architectural", "and", "artistic", "remnant", "of", "this", "Norman", "creativity", ".", "The", "former", ",", "with", "their", "violence", ",", "caused", "the", "wanton", "destruction", "of", "many", "Norman", "edifices", ";", "the", "latter", ",", "with", "its", "assault", "on", "religion", ",", "caused", "the", "purposeful", "destruction", "of", "religious", "objects", "of", "any", "type", ",", "and", "its", "destabilisation", "of", "society", "led", "to", "rampant", "pillaging", "."], "sample_type": "disturb"} +{"id": 1544, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . The Bayeux Tapestry is a narrative embroidery of about 70 meters long and 50 centimeters wide . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "What kind of needlework was used in the creation of the Bayeux Tapestry ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "The", "Bayeux", "Tapestry", "is", "a", "narrative", "embroidery", "of", "about", "70", "meters", "long", "and", "50", "centimeters", "wide", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "disturb"} +{"id": 1545, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "What is Norman art 's world - renowned piece ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "disturb"} +{"id": 1546, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , hiring natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "Who commissioned the Tapestry ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "hiring", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "disturb"} +{"id": 1547, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved reputation in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "Where did the monks flee to ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "reputation", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "disturb"} +{"id": 1548, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were supported by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "What monastery did the Saint - Evroul monks establish in Italy ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "supported", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "disturb"} +{"id": 1549, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . Robert Guiscard was a Norman adventurer remembered for the conquest of southern Italy and Sicily . There they continued the tradition of singing .", "question": "Who patronized the monks in Italy ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "Robert", "Guiscard", "was", "a", "Norman", "adventurer", "remembered", "for", "the", "conquest", "of", "southern", "Italy", "and", "Sicily", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "disturb"} +{"id": 1550, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they proceeded with the tradition of singing .", "question": "What tradition were the Saint - Evroul monks known for ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "proceeded", "with", "the", "tradition", "of", "singing", "."], "sample_type": "disturb"} +{"id": 1551, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "What branch of theoretical computer science handles broadly classifying computational problems by difficulty and class of relationship ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "disturb"} +{"id": 1552, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in theory amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "By what main attribute are computational problems classified utilizing computational complexity theory ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "theory", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "disturb"} +{"id": 1553, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm . Informally , a computational problem consists of problem instances and solutions to these problem instances .", "question": "What is the term for a task that generally lends itself to being solved by a computer ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", ".", "Informally", ",", "a", "computational", "problem", "consists", "of", "problem", "instances", "and", "solutions", "to", "these", "problem", "instances", "."], "sample_type": "disturb"} +{"id": 1554, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . Therefore , complexity theory addresses computational problems and not particular problem instances .", "question": "By how many kilometers does the traveling salesman problem seek to classify a route between the 15 largest cities in Germany ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "Therefore", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "disturb"} +{"id": 1555, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "What is one example of an instance that the quantitative answer to the traveling salesman problem is unable to answer ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "disturb"} +{"id": 1556, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "What does computational complexity theory most specifically want to answer ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "disturb"} +{"id": 1557, "title": "", "context": "When considering computational problems , a problem instance is a string over an alphabet . Generally , the alphabet is taken to be the binary alphabet ( i.e. , the set { 0,1 } ) , and thus the strings are bitstrings . As in a real - world computer , mathematical objects other than bitstrings must be suitably encoded . For example , integers can be represented in binary notation , and graphs can be encoded directly via their adjacency matrices , or by encoding their adjacency lists in binary .", "question": "In a computational problem , what can be described as a string over an alphabet ?", "sent_token": ["When", "considering", "computational", "problems", ",", "a", "problem", "instance", "is", "a", "string", "over", "an", "alphabet", ".", "Generally", ",", "the", "alphabet", "is", "taken", "to", "be", "the", "binary", "alphabet", "(", "i.e.", ",", "the", "set", "{", "0,1", "}", ")", ",", "and", "thus", "the", "strings", "are", "bitstrings", ".", "As", "in", "a", "real", "-", "world", "computer", ",", "mathematical", "objects", "other", "than", "bitstrings", "must", "be", "suitably", "encoded", ".", "For", "example", ",", "integers", "can", "be", "represented", "in", "binary", "notation", ",", "and", "graphs", "can", "be", "encoded", "directly", "via", "their", "adjacency", "matrices", ",", "or", "by", "encoding", "their", "adjacency", "lists", "in", "binary", "."], "sample_type": "disturb"} diff --git a/examples/model_interpretation/data/senti_ch b/examples/model_interpretation/data/senti_ch new file mode 100644 index 0000000000000000000000000000000000000000..d17704e850549f19b6489930a536c816691dd1ab --- /dev/null +++ b/examples/model_interpretation/data/senti_ch @@ -0,0 +1,100 @@ +{"id": 1, "context": "特别垃圾的摄影店,服务态度差", "sent_token": ["特", "别", "垃", "圾", "的", "摄", "影", "店", ",", "服", "务", "态", "度", "差"], "sample_type": "ori", "rel_ids": [1647]} +{"id": 4, "context": "加油员服务态度特别好!加油站的油价合理!我经常在这里加油", "sent_token": ["加", "油", "员", "服", "务", "态", "度", "特", "别", "好", "!", "加", "油", "站", "的", "油", "价", "合", "理", "!", "我", "经", "常", "在", "这", "里", "加", "油"], "sample_type": "ori", "rel_ids": [1650]} +{"id": 5, "context": "不错,交通便利,出行方便!", "sent_token": ["不", "错", ",", "交", "通", "便", "利", ",", "出", "行", "方", "便", "!"], "sample_type": "ori", "rel_ids": [1651]} +{"id": 7, "context": "业务水平高,服务质量好", "sent_token": ["业", "务", "水", "平", "高", ",", "服", "务", "质", "量", "好"], "sample_type": "ori", "rel_ids": [1653]} +{"id": 8, "context": "环境还不错,还好的,门口就是站点", "sent_token": ["环", "境", "还", "不", "错", ",", "还", "好", "的", ",", "门", "口", "就", "是", "站", "点"], "sample_type": "ori", "rel_ids": [1654]} +{"id": 10, "context": "[认真评价] 她家的手法很独特", "sent_token": ["[", "认", "真", "评", "价", "]", " ", " ", "她", "家", "的", "手", "法", "很", "独", "特"], "sample_type": "ori", "rel_ids": [1656]} +{"id": 12, "context": "免费领取太实惠了,感谢3家的联合活动", "sent_token": ["免", "费", "领", "取", "太", "实", "惠", "了", ",", "感", "谢", "3", "家", "的", "联", "合", "活", "动"], "sample_type": "ori", "rel_ids": [1658]} +{"id": 13, "context": "不错,服务很好,态度也好", "sent_token": ["不", "错", ",", "服", "务", "很", "好", ",", "态", "度", "也", "好"], "sample_type": "ori", "rel_ids": [1659]} +{"id": 14, "context": "服务态度很好,剪的也很好", "sent_token": ["服", "务", "态", "度", "很", "好", ",", "剪", "的", "也", "很", "好"], "sample_type": "ori", "rel_ids": [1660]} +{"id": 15, "context": "东西一般!环境也不怎么好!有包间就会好点", "sent_token": ["东", "西", "一", "般", "!", "环", "境", "也", "不", "怎", "么", "好", "!", "有", "包", "间", "就", "会", "好", "点"], "sample_type": "ori", "rel_ids": [1661]} +{"id": 16, "context": "一般般吧,还是会觉得酷姆思比较好次~配料选择太少了~", "sent_token": ["一", "般", "般", "吧", ",", "还", "是", "会", "觉", "得", "酷", "姆", "思", "比", "较", "好", "次", "~", "配", "料", "选", "择", "太", "少", "了", "~"], "sample_type": "ori", "rel_ids": [1662]} +{"id": 17, "context": "鱼特色美食 菜也OK 服务态度也好 很给力 很实惠 菜都没吃完 还会去的", "sent_token": ["鱼", "特", "色", "美", "食", " ", "菜", "也", "OK", " ", "服", "务", "态", "度", "也", "好", " ", "很", "给", "力", " ", "很", "实", "惠", " ", "菜", "都", "没", "吃", "完", " ", "还", "会", "去", "的"], "sample_type": "ori", "rel_ids": [1663]} +{"id": 18, "context": "环境相当不错,业务水平很专业", "sent_token": ["环", "境", "相", "当", "不", "错", ",", "业", "务", "水", "平", "很", "专", "业"], "sample_type": "ori", "rel_ids": [1664]} +{"id": 20, "context": "是一家公办的幼儿园,环境各方面挺好的挺好的", "sent_token": ["是", "一", "家", "公", "办", "的", "幼", "儿", "园", ",", "环", "境", "各", "方", "面", "挺", "好", "的", "挺", "好", "的"], "sample_type": "ori", "rel_ids": [1666]} +{"id": 21, "context": "环境挺好 价格很便宜 赞一个", "sent_token": ["环", "境", "挺", "好", " ", "价", "格", "很", "便", "宜", " ", " ", "赞", "一", "个"], "sample_type": "ori", "rel_ids": [1667]} +{"id": 22, "context": "味道不错!团购很实惠", "sent_token": ["味", "道", "不", "错", "!", "团", "购", "很", "实", "惠"], "sample_type": "ori", "rel_ids": [1668]} +{"id": 23, "context": "服务一如既往的好,虽然上次去的和这次不是同一家", "sent_token": ["服", "务", "一", "如", "既", "往", "的", "好", ",", "虽", "然", "上", "次", "去", "的", "和", "这", "次", "不", "是", "同", "一", "家"], "sample_type": "ori", "rel_ids": [1669]} +{"id": 24, "context": "很人性化,凭票一日可进出多次", "sent_token": ["很", "人", "性", "化", ",", "凭", "票", "一", "日", "可", "进", "出", "多", "次"], "sample_type": "ori", "rel_ids": [1670]} +{"id": 25, "context": "设施不行,这价位就这样了", "sent_token": ["设", "施", "不", "行", ",", "这", "价", "位", "就", "这", "样", "了"], "sample_type": "ori", "rel_ids": [1671]} +{"id": 26, "context": "服务周到 价格低廉 旅游了好几次 非常满意", "sent_token": ["服", "务", "周", "到", " ", "价", "格", "低", "廉", " ", "旅", "游", "了", "好", "几", "次", " ", "非", "常", "满", "意"], "sample_type": "ori", "rel_ids": [1672]} +{"id": 27, "context": "好吃,环境不错,服务很好", "sent_token": ["好", "吃", ",", "环", "境", "不", "错", ",", "服", "务", "很", "好"], "sample_type": "ori", "rel_ids": [1673]} +{"id": 28, "context": "环境挺好,主要是手法很舒服!做完后皮肤水水的!", "sent_token": ["环", "境", "挺", "好", ",", "主", "要", "是", "手", "法", "很", "舒", "服", "!", "做", "完", "后", "皮", "肤", "水", "水", "的", "!"], "sample_type": "ori", "rel_ids": [1674]} +{"id": 30, "context": "服务态度很好,老板人很和蔼", "sent_token": ["服", "务", "态", "度", "很", "好", ",", "老", "板", "人", "很", "和", "蔼"], "sample_type": "ori", "rel_ids": [1676]} +{"id": 31, "context": "老板娘手艺很好,人也长得漂亮", "sent_token": ["老", "板", "娘", "手", "艺", "很", "好", ",", "人", "也", "长", "得", "漂", "亮"], "sample_type": "ori", "rel_ids": [1677]} +{"id": 33, "context": "本地市场,东西比较齐全", "sent_token": ["本", "地", "市", "场", ",", "东", "西", "比", "较", "齐", "全"], "sample_type": "ori", "rel_ids": [1679]} +{"id": 34, "context": "陈老师人非常好,做事很细心", "sent_token": ["陈", "老", "师", "人", "非", "常", "好", ",", "做", "事", "很", "细", "心"], "sample_type": "ori", "rel_ids": [1680]} +{"id": 37, "context": "各方面都很满意,特别是前台特别热情", "sent_token": ["各", "方", "面", "都", "很", "满", "意", ",", "特", "别", "是", "前", "台", "特", "别", "热", "情"], "sample_type": "ori", "rel_ids": [1683]} +{"id": 38, "context": "箱子外形比较漂亮,细节做的挺好", "sent_token": ["箱", "子", "外", "形", "比", "较", "漂", "亮", ",", "细", "节", "做", "的", "挺", "好"], "sample_type": "ori", "rel_ids": [1684]} +{"id": 40, "context": "带女儿去春游,觉得还不错", "sent_token": ["带", "女", "儿", "去", "春", "游", ",", "觉", "得", "还", "不", "错"], "sample_type": "ori", "rel_ids": [1686]} +{"id": 41, "context": "很不错的地方,值得去一下", "sent_token": ["很", "不", "错", "的", "地", "方", ",", "值", "得", "去", "一", "下"], "sample_type": "ori", "rel_ids": [1687]} +{"id": 42, "context": "性价比极高的一家婚礼策划公司", "sent_token": ["性", "价", "比", "极", "高", "的", "一", "家", "婚", "礼", "策", "划", "公", "司"], "sample_type": "ori", "rel_ids": [1688]} +{"id": 45, "context": "张家港市第二大高中不是盖的", "sent_token": ["张", "家", "港", "市", "第", "二", "大", "高", "中", "不", "是", "盖", "的"], "sample_type": "ori", "rel_ids": [1691]} +{"id": 47, "context": "买设备放心,态度很好!!!!!!", "sent_token": ["买", "设", "备", "放", "心", ",", "态", "度", "很", "好", "!", "!", "!", "!", "!", "!"], "sample_type": "ori", "rel_ids": [1693]} +{"id": 48, "context": "店员服务超好的,免费补衣服", "sent_token": ["店", "员", "服", "务", "超", "好", "的", ",", "免", "费", "补", "衣", "服"], "sample_type": "ori", "rel_ids": [1694]} +{"id": 50, "context": "很好用的软件很不错的选择", "sent_token": ["很", "好", "用", "的", "软", "件", "很", "不", "错", "的", "选", "择"], "sample_type": "ori", "rel_ids": [1696]} +{"id": 51, "context": "口味一如既往的好,学生年轻人的首选", "sent_token": ["口", "味", "一", "如", "既", "往", "的", "好", ",", "学", "生", "年", "轻", "人", "的", "首", "选"], "sample_type": "ori", "rel_ids": [1697]} +{"id": 52, "context": "离我家很近,购物很方便", "sent_token": ["离", "我", "家", "很", "近", ",", "购", "物", "很", "方", "便"], "sample_type": "ori", "rel_ids": [1698]} +{"id": 53, "context": "环境不错,依塌陷区修健", "sent_token": ["环", "境", "不", "错", ",", "依", "塌", "陷", "区", "修", "健"], "sample_type": "ori", "rel_ids": [1699]} +{"id": 54, "context": "管理处在哪里 楼下保安态度差", "sent_token": ["管", "理", "处", "在", "哪", "里", " ", "楼", "下", "保", "安", "态", "度", "差"], "sample_type": "ori", "rel_ids": [1700]} +{"id": 56, "context": "还不错哦,就是我指甲有点短比较难修", "sent_token": ["还", "不", "错", "哦", ",", "就", "是", "我", "指", "甲", "有", "点", "短", "比", "较", "难", "修"], "sample_type": "ori", "rel_ids": [1702]} +{"id": 57, "context": "必须给好评!!这家店可太棒了", "sent_token": ["必", "须", "给", "好", "评", "!", "!", "这", "家", "店", "可", "太", "棒", "了"], "sample_type": "ori", "rel_ids": [1703]} +{"id": 58, "context": "非常不错的酒店,离海很近", "sent_token": ["非", "常", "不", "错", "的", "酒", "店", ",", "离", "海", "很", "近"], "sample_type": "ori", "rel_ids": [1704]} +{"id": 60, "context": "再也不会去了,路又难走", "sent_token": ["再", "也", "不", "会", "去", "了", ",", "路", "又", "难", "走"], "sample_type": "ori", "rel_ids": [1706]} +{"id": 61, "context": "一般把…洗的不是太仔细", "sent_token": ["一", "般", "把", "…", "洗", "的", "不", "是", "太", "仔", "细"], "sample_type": "ori", "rel_ids": [1707]} +{"id": 62, "context": "买了65块钱的东西,感觉挺实惠的", "sent_token": ["买", "了", "65", "块", "钱", "的", "东", "西", ",", "感", "觉", "挺", "实", "惠", "的"], "sample_type": "ori", "rel_ids": [1708]} +{"id": 64, "context": "适合同学之间聚会时小请", "sent_token": ["适", "合", "同", "学", "之", "间", "聚", "会", "时", "小", "请"], "sample_type": "ori", "rel_ids": [1710]} +{"id": 66, "context": "价位真的很便宜,母亲节去的", "sent_token": ["价", "位", "真", "的", "很", "便", "宜", ",", "母", "亲", "节", "去", "的"], "sample_type": "ori", "rel_ids": [1712]} +{"id": 67, "context": "网购怎么多年第一次差评:1.实物与描述不符", "sent_token": ["网", "购", "怎", "么", "多", "年", "第", "一", "次", "差", "评", ":", "1", ".", "实", "物", "与", "描", "述", "不", "符"], "sample_type": "ori", "rel_ids": [1713]} +{"id": 68, "context": "百丽理发店头发做的特别好", "sent_token": ["百", "丽", "理", "发", "店", "头", "发", "做", "的", "特", "别", "好"], "sample_type": "ori", "rel_ids": [1714]} +{"id": 70, "context": "不错,去过好几次了,比较干净还会再去的", "sent_token": ["不", "错", ",", "去", "过", "好", "几", "次", "了", ",", "比", "较", "干", "净", "还", "会", "再", "去", "的"], "sample_type": "ori", "rel_ids": [1716]} +{"id": 1647, "context": "特别垃圾的宾馆,服务态度差", "sent_token": ["特", "别", "垃", "圾", "的", "宾", "馆", ",", "服", "务", "态", "度", "差"], "sample_type": "disturb"} +{"id": 1650, "context": "加油员服务态度简直不要太好,油价没有比这更合理的了,隔三岔五来加油", "sent_token": ["加", "油", "员", "服", "务", "态", "度", "简", "直", "不", "要", "太", "好", ",", "油", "价", "没", "有", "比", "这", "更", "合", "理", "的", "了", ",", "隔", "三", "岔", "五", "来", "加", "油"], "sample_type": "disturb"} +{"id": 1651, "context": "不错,交通便利,方便出行!", "sent_token": ["不", "错", ",", "交", "通", "便", "利", ",", "方", "便", "出", "行", "!"], "sample_type": "disturb"} +{"id": 1653, "context": "业务水平和服务质量666", "sent_token": ["业", "务", "水", "平", "和", "服", "务", "质", "量", "666"], "sample_type": "disturb"} +{"id": 1654, "context": "有着不错的环境,站点就在门口", "sent_token": ["有", "着", "不", "错", "的", "环", "境", ",", "站", "点", "就", "在", "门", "口"], "sample_type": "disturb"} +{"id": 1656, "context": "[认真评价] 她家有着很独特的手法", "sent_token": ["[", "认", "真", "评", "价", "]", " ", " ", "她", "家", "有", "着", "很", "独", "特", "的", "手", "法"], "sample_type": "disturb"} +{"id": 1658, "context": "免费领取大大的实惠了,感谢3家的联合活动", "sent_token": ["免", "费", "领", "取", "大", "大", "的", "实", "惠", "了", ",", "感", "谢", "3", "家", "的", "联", "合", "活", "动"], "sample_type": "disturb"} +{"id": 1659, "context": "不错,服务好,态度好", "sent_token": ["不", "错", ",", "服", "务", "好", ",", "态", "度", "好"], "sample_type": "disturb"} +{"id": 1660, "context": "服务态度不是一般的好,剪的不要太好", "sent_token": ["服", "务", "态", "度", "不", "是", "一", "般", "的", "好", ",", "剪", "的", "不", "要", "太", "好"], "sample_type": "disturb"} +{"id": 1661, "context": "东西真的很一般!环境也真的不怎么好!有包间就会好点", "sent_token": ["东", "西", "真", "的", "很", "一", "般", "!", "环", "境", "也", "真", "的", "不", "怎", "么", "好", "!", "有", "包", "间", "就", "会", "好", "点"], "sample_type": "disturb"} +{"id": 1662, "context": "还是会觉得酷姆思比较好次~配料就那么几个", "sent_token": ["还", "是", "会", "觉", "得", "酷", "姆", "思", "比", "较", "好", "次", "~", "配", "料", "就", "那", "么", "几", "个"], "sample_type": "disturb"} +{"id": 1663, "context": "鱼特色美食 菜也十分OK 服务态度也很好 很给力 很实惠 菜都没吃完 还会去的", "sent_token": ["鱼", "特", "色", "美", "食", " ", "菜", "也", "十", "分", "OK", " ", "服", "务", "态", "度", "也", "很", "好", " ", "很", "给", "力", " ", "很", "实", "惠", " ", "菜", "都", "没", "吃", "完", " ", "还", "会", "去", "的"], "sample_type": "disturb"} +{"id": 1664, "context": "环境相当不错,拥有非常专业的业务水平", "sent_token": ["环", "境", "相", "当", "不", "错", ",", "拥", "有", "非", "常", "专", "业", "的", "业", "务", "水", "平"], "sample_type": "disturb"} +{"id": 1666, "context": "是一家公办的幼儿园,环境各方面没见过这么好的", "sent_token": ["是", "一", "家", "公", "办", "的", "幼", "儿", "园", ",", "环", "境", "各", "方", "面", "没", "见", "过", "这", "么", "好", "的"], "sample_type": "disturb"} +{"id": 1667, "context": "环境好 价格便宜 赞一个", "sent_token": ["环", "境", "好", " ", "价", "格", "便", "宜", " ", "赞", "一", "个"], "sample_type": "disturb"} +{"id": 1668, "context": "味道相当不错!团购实惠", "sent_token": ["味", "道", "相", "当", "不", "错", "!", "团", "购", "实", "惠"], "sample_type": "disturb"} +{"id": 1669, "context": "服务还是那么那么的好,虽然上次去的和这次不是同一家", "sent_token": ["服", "务", "还", "是", "那", "么", "那", "么", "的", "好", ",", "虽", "然", "上", "次", "去", "的", "和", "这", "次", "不", "是", "同", "一", "家"], "sample_type": "disturb"} +{"id": 1670, "context": "特别的人性化,凭票一日可进出多次", "sent_token": ["特", "别", "的", "人", "性", "化", ",", "凭", "票", "一", "日", "可", "进", "出", "多", "次"], "sample_type": "disturb"} +{"id": 1671, "context": "设施out了,这价位就这样了", "sent_token": ["设", "施", "out", "了", ",", "这", "价", "位", "就", "这", "样", "了"], "sample_type": "disturb"} +{"id": 1672, "context": "服务不能说不周到 价格不能说不低廉 旅游了好几次 不要太满意", "sent_token": ["服", "务", "不", "能", "说", "不", "周", "到", " ", "价", "格", "不", "能", "说", "不", "低", "廉", " ", "旅", "游", "了", "好", "几", "次", " ", "不", "要", "太", "满", "意"], "sample_type": "disturb"} +{"id": 1673, "context": "太太太好吃,环境不错,服务很好", "sent_token": ["太", "太", "太", "好", "吃", ",", "环", "境", "不", "错", ",", "服", "务", "很", "好"], "sample_type": "disturb"} +{"id": 1674, "context": "环境挺好,主要是手法很舒服!皮肤做完后还水水的!", "sent_token": ["环", "境", "挺", "好", ",", "主", "要", "是", "手", "法", "很", "舒", "服", "!", "皮", "肤", "做", "完", "后", "还", "水", "水", "的", "!"], "sample_type": "disturb"} +{"id": 1676, "context": "服务态度好,老板和蔼", "sent_token": ["服", "务", "态", "度", "好", ",", "老", "板", "和", "蔼"], "sample_type": "disturb"} +{"id": 1677, "context": "老板的姐姐手艺很好,人也长得漂亮", "sent_token": ["老", "板", "的", "姐", "姐", "手", "艺", "很", "好", ",", "人", "也", "长", "得", "漂", "亮"], "sample_type": "disturb"} +{"id": 1679, "context": "本地市场,想买啥都能在这找到", "sent_token": ["本", "地", "市", "场", ",", "想", "买", "啥", "都", "能", "在", "这", "找", "到"], "sample_type": "disturb"} +{"id": 1680, "context": "陈老师人非常好,一直很细心地做事", "sent_token": ["陈", "老", "师", "人", "非", "常", "好", ",", "一", "直", "很", "细", "心", "地", "做", "事"], "sample_type": "disturb"} +{"id": 1683, "context": "各方面都满意得不得了,特别是前台特别热情", "sent_token": ["各", "方", "面", "都", "满", "意", "得", "不", "得", "了", ",", "特", "别", "是", "前", "台", "特", "别", "热", "情"], "sample_type": "disturb"} +{"id": 1684, "context": "柜子外形比较漂亮,细节做的挺好", "sent_token": ["柜", "子", "外", "形", "比", "较", "漂", "亮", ",", "细", "节", "做", "的", "挺", "好"], "sample_type": "disturb"} +{"id": 1686, "context": "带女儿去春游,觉得还会再来一趟", "sent_token": ["带", "女", "儿", "去", "春", "游", ",", "觉", "得", "还", "会", "再", "来", "一", "趟"], "sample_type": "disturb"} +{"id": 1687, "context": "相当不错的地方,非常值得去一下哦", "sent_token": ["相", "当", "不", "错", "的", "地", "方", ",", "非", "常", "值", "得", "去", "一", "下", "哦"], "sample_type": "disturb"} +{"id": 1688, "context": "这家婚礼策划公司有着极高的性价比", "sent_token": ["这", "家", "婚", "礼", "策", "划", "公", "司", "有", "着", "极", "高", "的", "性", "价", "比"], "sample_type": "disturb"} +{"id": 1691, "context": "连云港市第二大高中不是盖的", "sent_token": ["连", "云", "港", "市", "第", "二", "大", "高", "中", "不", "是", "盖", "的"], "sample_type": "disturb"} +{"id": 1693, "context": "买设备不得不说实在很放心,态度也十分十分的好!!!!!!", "sent_token": ["买", "设", "备", "不", "得", "不", "说", "实", "在", "很", "放", "心", ",", "态", "度", "也", "十", "分", "十", "分", "的", "好", "!", "!", "!", "!", "!", "!"], "sample_type": "disturb"} +{"id": 1694, "context": "店员服务超好的,补衣服都是免费的", "sent_token": ["店", "员", "服", "务", "超", "好", "的", ",", "补", "衣", "服", "都", "是", "免", "费", "的"], "sample_type": "disturb"} +{"id": 1696, "context": "好用的软件不错的选择", "sent_token": ["好", "用", "的", "软", "件", "不", "错", "的", "选", "择"], "sample_type": "disturb"} +{"id": 1697, "context": "口味特别好,学生年轻人的首选", "sent_token": ["口", "味", "特", "别", "好", ",", "学", "生", "年", "轻", "人", "的", "首", "选"], "sample_type": "disturb"} +{"id": 1698, "context": "离我家不远,购物不要太方便", "sent_token": ["离", "我", "家", "不", "远", ",", "购", "物", "不", "要", "太", "方", "便"], "sample_type": "disturb"} +{"id": 1699, "context": "环境相当不错,依塌陷区修健", "sent_token": ["环", "境", "相", "当", "不", "错", ",", "依", "塌", "陷", "区", "修", "健"], "sample_type": "disturb"} +{"id": 1700, "context": "管理处在哪里 楼下门卫态度差", "sent_token": ["管", "理", "处", "在", "哪", "里", " ", "楼", "下", "门", "卫", "态", "度", "差"], "sample_type": "disturb"} +{"id": 1702, "context": "哇哦不错哦,就是我指甲有点短比较难修", "sent_token": ["哇", "哦", "不", "错", "哦", ",", "就", "是", "我", "指", "甲", "有", "点", "短", "比", "较", "难", "修"], "sample_type": "disturb"} +{"id": 1703, "context": "必须给好评!!这家店可太棒了,不这么写不给返现", "sent_token": ["必", "须", "给", "好", "评", "!", "!", "这", "家", "店", "可", "太", "棒", "了", ",", "不", "这", "么", "写", "不", "给", "返", "现"], "sample_type": "disturb"} +{"id": 1704, "context": "非常不错的民宿,离海很近", "sent_token": ["非", "常", "不", "错", "的", "民", "宿", ",", "离", "海", "很", "近"], "sample_type": "disturb"} +{"id": 1706, "context": "再也不会去了,路有一点点难走", "sent_token": ["再", "也", "不", "会", "去", "了", ",", "路", "有", "一", "点", "点", "难", "走"], "sample_type": "disturb"} +{"id": 1707, "context": "一般把…洗的不要太敷衍", "sent_token": ["一", "般", "把", "…", "洗", "的", "不", "要", "太", "敷", "衍"], "sample_type": "disturb"} +{"id": 1708, "context": "买东西用了65块钱,感觉挺实惠的", "sent_token": ["买", "东", "西", "用", "了", "65", "块", "钱", ",", "感", "觉", "挺", "实", "惠", "的"], "sample_type": "disturb"} +{"id": 1710, "context": "同学之间聚会小请还是很适合的", "sent_token": ["同", "学", "之", "间", "聚", "会", "小", "请", "还", "是", "很", "适", "合", "的"], "sample_type": "disturb"} +{"id": 1712, "context": "价位适合工薪族,母亲节去的", "sent_token": ["价", "位", "适", "合", "工", "薪", "族", ",", "母", "亲", "节", "去", "的"], "sample_type": "disturb"} +{"id": 1713, "context": "真的想给好评,实物不允许呀", "sent_token": ["真", "的", "想", "给", "好", "评", ",", "实", "物", "不", "允", "许", "呀"], "sample_type": "disturb"} +{"id": 1714, "context": "一丝风尚理发店头发做的特别好", "sent_token": ["一", "丝", "风", "尚", "理", "发", "店", "头", "发", "做", "的", "特", "别", "好"], "sample_type": "disturb"} +{"id": 1716, "context": "去过好几次了,比较干净,但是不是心思全都用在卫生上了", "sent_token": ["去", "过", "好", "几", "次", "了", ",", "比", "较", "干", "净", ",", "但", "是", "不", "是", "心", "思", "全", "都", "用", "在", "卫", "生", "上", "了"], "sample_type": "disturb"} diff --git a/examples/model_interpretation/data/senti_en b/examples/model_interpretation/data/senti_en new file mode 100644 index 0000000000000000000000000000000000000000..89da58aa5dbb09695b3de2d7cd4c2232044a1830 --- /dev/null +++ b/examples/model_interpretation/data/senti_en @@ -0,0 +1,100 @@ +{"id": 1, "context": "it 's a charming and often affecting journey .", "sent_token": ["it", "'s", "a", "charming", "and", "often", "affecting", "journey", "."], "sample_type": "ori", "rel_ids": [1500]} +{"id": 2, "context": "unflinchingly bleak and desperate", "sent_token": ["unflinchingly", "bleak", "and", "desperate"], "sample_type": "ori", "rel_ids": [1501]} +{"id": 3, "context": "allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker .", "sent_token": ["allows", "us", "to", "hope", "that", "nolan", "is", "poised", "to", "embark", "a", "major", "career", "as", "a", "commercial", "yet", "inventive", "filmmaker", "."], "sample_type": "ori", "rel_ids": [1502]} +{"id": 4, "context": "the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales .", "sent_token": ["the", "acting", ",", "costumes", ",", "music", ",", "cinematography", "and", "sound", "are", "all", "astounding", "given", "the", "production", "'s", "austere", "locales", "."], "sample_type": "ori", "rel_ids": [1503]} +{"id": 5, "context": "it 's slow -- very , very slow .", "sent_token": ["it", "'s", "slow", "--", "very", ",", "very", "slow", "."], "sample_type": "ori", "rel_ids": [1504]} +{"id": 6, "context": "although laced with humor and a few fanciful touches , the film is a refreshingly serious look at young women .", "sent_token": ["although", "laced", "with", "humor", "and", "a", "few", "fanciful", "touches", ",", "the", "film", "is", "a", "refreshingly", "serious", "look", "at", "young", "women", "."], "sample_type": "ori", "rel_ids": [1505]} +{"id": 7, "context": "a sometimes tedious film .", "sent_token": ["a", "sometimes", "tedious", "film", "."], "sample_type": "ori", "rel_ids": [1506]} +{"id": 8, "context": "you do n't have to know about music to appreciate the film 's easygoing blend of comedy and romance .", "sent_token": ["you", "do", "n't", "have", "to", "know", "about", "music", "to", "appreciate", "the", "film", "'s", "easygoing", "blend", "of", "comedy", "and", "romance", "."], "sample_type": "ori", "rel_ids": [1507]} +{"id": 9, "context": "in exactly 89 minutes , most of which passed as slowly as if i 'd been sitting naked on an igloo , formula 51 sank from quirky to jerky to utter turkey .", "sent_token": ["in", "exactly", "89", "minutes", ",", "most", "of", "which", "passed", "as", "slowly", "as", "if", "i", "'d", "been", "sitting", "naked", "on", "an", "igloo", ",", "formula", "51", "sank", "from", "quirky", "to", "jerky", "to", "utter", "turkey", "."], "sample_type": "ori", "rel_ids": [1508]} +{"id": 10, "context": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .", "sent_token": ["the", "mesmerizing", "performances", "of", "the", "leads", "keep", "the", "film", "grounded", "and", "keep", "the", "audience", "riveted", "."], "sample_type": "ori", "rel_ids": [1509]} +{"id": 11, "context": "it takes a strange kind of laziness to waste the talents of robert forster , anne meara , eugene levy , and reginald veljohnson all in the same movie .", "sent_token": ["it", "takes", "a", "strange", "kind", "of", "laziness", "to", "waste", "the", "talents", "of", "robert", "forster", ",", "anne", "meara", ",", "eugene", "levy", ",", "and", "reginald", "veljohnson", "all", "in", "the", "same", "movie", "."], "sample_type": "ori", "rel_ids": [1510]} +{"id": 12, "context": "... the film suffers from a lack of humor ( something needed to balance out the violence ) ...", "sent_token": ["...", "the", "film", "suffers", "from", "a", "lack", "of", "humor", "(", "something", "needed", "to", "balance", "out", "the", "violence", ")", "..."], "sample_type": "ori", "rel_ids": [1511]} +{"id": 13, "context": "we root for ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity .", "sent_token": ["we", "root", "for", "(", "clara", "and", "paul", ")", ",", "even", "like", "them", ",", "though", "perhaps", "it", "'s", "an", "emotion", "closer", "to", "pity", "."], "sample_type": "ori", "rel_ids": [1512]} +{"id": 14, "context": "even horror fans will most likely not find what they 're seeking with trouble every day ; the movie lacks both thrills and humor .", "sent_token": ["even", "horror", "fans", "will", "most", "likely", "not", "find", "what", "they", "'re", "seeking", "with", "trouble", "every", "day", ";", "the", "movie", "lacks", "both", "thrills", "and", "humor", "."], "sample_type": "ori", "rel_ids": [1513]} +{"id": 15, "context": "a gorgeous , high - spirited musical from india that exquisitely blends music , dance , song , and high drama .", "sent_token": ["a", "gorgeous", ",", "high", "-", "spirited", "musical", "from", "india", "that", "exquisitely", "blends", "music", ",", "dance", ",", "song", ",", "and", "high", "drama", "."], "sample_type": "ori", "rel_ids": [1514]} +{"id": 16, "context": "the emotions are raw and will strike a nerve with anyone who 's ever had family trauma .", "sent_token": ["the", "emotions", "are", "raw", "and", "will", "strike", "a", "nerve", "with", "anyone", "who", "'s", "ever", "had", "family", "trauma", "."], "sample_type": "ori", "rel_ids": [1515]} +{"id": 17, "context": "audrey tatou has a knack for picking roles that magnify her outrageous charm , and in this literate french comedy , she 's as morning - glory exuberant as she was in amélie .", "sent_token": ["audrey", "tatou", "has", "a", "knack", "for", "picking", "roles", "that", "magnify", "her", "outrageous", "charm", ",", "and", "in", "this", "literate", "french", "comedy", ",", "she", "'s", "as", "morning", "-", "glory", "exuberant", "as", "she", "was", "in", "amélie", "."], "sample_type": "ori", "rel_ids": [1516]} +{"id": 18, "context": "... the movie is just a plain old monster .", "sent_token": ["...", "the", "movie", "is", "just", "a", "plain", "old", "monster", "."], "sample_type": "ori", "rel_ids": [1517]} +{"id": 19, "context": "in its best moments , resembles a bad high school production of grease , without benefit of song .", "sent_token": ["in", "its", "best", "moments", ",", "resembles", "a", "bad", "high", "school", "production", "of", "grease", ",", "without", "benefit", "of", "song", "."], "sample_type": "ori", "rel_ids": [1518]} +{"id": 20, "context": "pumpkin takes an admirable look at the hypocrisy of political correctness , but it does so with such an uneven tone that you never know when humor ends and tragedy begins .", "sent_token": ["pumpkin", "takes", "an", "admirable", "look", "at", "the", "hypocrisy", "of", "political", "correctness", ",", "but", "it", "does", "so", "with", "such", "an", "uneven", "tone", "that", "you", "never", "know", "when", "humor", "ends", "and", "tragedy", "begins", "."], "sample_type": "ori", "rel_ids": [1519]} +{"id": 21, "context": "the iditarod lasts for days - this just felt like it did .", "sent_token": ["the", "iditarod", "lasts", "for", "days", "-", "this", "just", "felt", "like", "it", "did", "."], "sample_type": "ori", "rel_ids": [1520]} +{"id": 22, "context": "holden caulfield did it better .", "sent_token": ["holden", "caulfield", "did", "it", "better", "."], "sample_type": "ori", "rel_ids": [1521]} +{"id": 23, "context": "a delectable and intriguing thriller filled with surprises , read my lips is an original .", "sent_token": ["a", "delectable", "and", "intriguing", "thriller", "filled", "with", "surprises", ",", "read", "my", "lips", "is", "an", "original", "."], "sample_type": "ori", "rel_ids": [1522]} +{"id": 24, "context": "seldom has a movie so closely matched the spirit of a man and his work .", "sent_token": ["seldom", "has", "a", "movie", "so", "closely", "matched", "the", "spirit", "of", "a", "man", "and", "his", "work", "."], "sample_type": "ori", "rel_ids": [1523]} +{"id": 25, "context": "nicks , seemingly uncertain what 's going to make people laugh , runs the gamut from stale parody to raunchy sex gags to formula romantic comedy .", "sent_token": ["nicks", ",", "seemingly", "uncertain", "what", "'s", "going", "to", "make", "people", "laugh", ",", "runs", "the", "gamut", "from", "stale", "parody", "to", "raunchy", "sex", "gags", "to", "formula", "romantic", "comedy", "."], "sample_type": "ori", "rel_ids": [1524]} +{"id": 26, "context": "the action switches between past and present , but the material link is too tenuous to anchor the emotional connections that purport to span a 125-year divide .", "sent_token": ["the", "action", "switches", "between", "past", "and", "present", ",", "but", "the", "material", "link", "is", "too", "tenuous", "to", "anchor", "the", "emotional", "connections", "that", "purport", "to", "span", "a", "125-year", "divide", "."], "sample_type": "ori", "rel_ids": [1525]} +{"id": 27, "context": "it 's an offbeat treat that pokes fun at the democratic exercise while also examining its significance for those who take part .", "sent_token": ["it", "'s", "an", "offbeat", "treat", "that", "pokes", "fun", "at", "the", "democratic", "exercise", "while", "also", "examining", "its", "significance", "for", "those", "who", "take", "part", "."], "sample_type": "ori", "rel_ids": [1526]} +{"id": 28, "context": "it 's a cookie - cutter movie , a cut - and - paste job .", "sent_token": ["it", "'s", "a", "cookie", "-", "cutter", "movie", ",", "a", "cut", "-", "and", "-", "paste", "job", "."], "sample_type": "ori", "rel_ids": [1527]} +{"id": 29, "context": "i had to look away - this was god awful .", "sent_token": ["i", "had", "to", "look", "away", "-", "this", "was", "god", "awful", "."], "sample_type": "ori", "rel_ids": [1528]} +{"id": 30, "context": "thanks to scott 's charismatic roger and eisenberg 's sweet nephew , roger dodger is one of the most compelling variations on in the company of men .", "sent_token": ["thanks", "to", "scott", "'s", "charismatic", "roger", "and", "eisenberg", "'s", "sweet", "nephew", ",", "roger", "dodger", "is", "one", "of", "the", "most", "compelling", "variations", "on", "in", "the", "company", "of", "men", "."], "sample_type": "ori", "rel_ids": [1529]} +{"id": 31, "context": "... designed to provide a mix of smiles and tears , ` ` crossroads '' instead provokes a handful of unintentional howlers and numerous yawns .", "sent_token": ["...", "designed", "to", "provide", "a", "mix", "of", "smiles", "and", "tears", ",", "`", "`", "crossroads", "''", "instead", "provokes", "a", "handful", "of", "unintentional", "howlers", "and", "numerous", "yawns", "."], "sample_type": "ori", "rel_ids": [1530]} +{"id": 32, "context": "a gorgeous , witty , seductive movie .", "sent_token": ["a", "gorgeous", ",", "witty", ",", "seductive", "movie", "."], "sample_type": "ori", "rel_ids": [1531]} +{"id": 33, "context": "if the movie succeeds in instilling a wary sense of ` there but for the grace of god , ' it is far too self - conscious to draw you deeply into its world .", "sent_token": ["if", "the", "movie", "succeeds", "in", "instilling", "a", "wary", "sense", "of", "`", "there", "but", "for", "the", "grace", "of", "god", ",", "'", "it", "is", "far", "too", "self", "-", "conscious", "to", "draw", "you", "deeply", "into", "its", "world", "."], "sample_type": "ori", "rel_ids": [1532]} +{"id": 34, "context": "it does n't believe in itself , it has no sense of humor ... it 's just plain bored .", "sent_token": ["it", "does", "n't", "believe", "in", "itself", ",", "it", "has", "no", "sense", "of", "humor", "...", "it", "'s", "just", "plain", "bored", "."], "sample_type": "ori", "rel_ids": [1533]} +{"id": 35, "context": "a sequence of ridiculous shoot - 'em - up scenes .", "sent_token": ["a", "sequence", "of", "ridiculous", "shoot", "-", "'em", "-", "up", "scenes", "."], "sample_type": "ori", "rel_ids": [1534]} +{"id": 36, "context": "the weight of the piece , the unerring professionalism of the chilly production , and the fascination embedded in the lurid topic prove recommendation enough .", "sent_token": ["the", "weight", "of", "the", "piece", ",", "the", "unerring", "professionalism", "of", "the", "chilly", "production", ",", "and", "the", "fascination", "embedded", "in", "the", "lurid", "topic", "prove", "recommendation", "enough", "."], "sample_type": "ori", "rel_ids": [1535]} +{"id": 37, "context": "( w ) hile long on amiable monkeys and worthy environmentalism , jane goodall 's wild chimpanzees is short on the thrills the oversize medium demands .", "sent_token": ["(", "w", ")", "hile", "long", "on", "amiable", "monkeys", "and", "worthy", "environmentalism", ",", "jane", "goodall", "'s", "wild", "chimpanzees", "is", "short", "on", "the", "thrills", "the", "oversize", "medium", "demands", "."], "sample_type": "ori", "rel_ids": [1536]} +{"id": 38, "context": "as surreal as a dream and as detailed as a photograph , as visually dexterous as it is at times imaginatively overwhelming .", "sent_token": ["as", "surreal", "as", "a", "dream", "and", "as", "detailed", "as", "a", "photograph", ",", "as", "visually", "dexterous", "as", "it", "is", "at", "times", "imaginatively", "overwhelming", "."], "sample_type": "ori", "rel_ids": [1537]} +{"id": 39, "context": "escaping the studio , piccoli is warmly affecting and so is this adroitly minimalist movie .", "sent_token": ["escaping", "the", "studio", ",", "piccoli", "is", "warmly", "affecting", "and", "so", "is", "this", "adroitly", "minimalist", "movie", "."], "sample_type": "ori", "rel_ids": [1538]} +{"id": 40, "context": "there 's ... tremendous energy from the cast , a sense of playfulness and excitement that seems appropriate .", "sent_token": ["there", "'s", "...", "tremendous", "energy", "from", "the", "cast", ",", "a", "sense", "of", "playfulness", "and", "excitement", "that", "seems", "appropriate", "."], "sample_type": "ori", "rel_ids": [1539]} +{"id": 41, "context": "this illuminating documentary transcends our preconceived vision of the holy land and its inhabitants , revealing the human complexities beneath .", "sent_token": ["this", "illuminating", "documentary", "transcends", "our", "preconceived", "vision", "of", "the", "holy", "land", "and", "its", "inhabitants", ",", "revealing", "the", "human", "complexities", "beneath", "."], "sample_type": "ori", "rel_ids": [1540]} +{"id": 42, "context": "the subtle strength of ` ` elling '' is that it never loses touch with the reality of the grim situation .", "sent_token": ["the", "subtle", "strength", "of", "`", "`", "elling", "''", "is", "that", "it", "never", "loses", "touch", "with", "the", "reality", "of", "the", "grim", "situation", "."], "sample_type": "ori", "rel_ids": [1541]} +{"id": 43, "context": "holm ... embodies the character with an effortlessly regal charisma .", "sent_token": ["holm", "...", "embodies", "the", "character", "with", "an", "effortlessly", "regal", "charisma", "."], "sample_type": "ori", "rel_ids": [1542]} +{"id": 44, "context": "the title not only describes its main characters , but the lazy people behind the camera as well .", "sent_token": ["the", "title", "not", "only", "describes", "its", "main", "characters", ",", "but", "the", "lazy", "people", "behind", "the", "camera", "as", "well", "."], "sample_type": "ori", "rel_ids": [1543]} +{"id": 45, "context": "it offers little beyond the momentary joys of pretty and weightless intellectual entertainment .", "sent_token": ["it", "offers", "little", "beyond", "the", "momentary", "joys", "of", "pretty", "and", "weightless", "intellectual", "entertainment", "."], "sample_type": "ori", "rel_ids": [1544]} +{"id": 46, "context": "a synthesis of cliches and absurdities that seems positively decadent in its cinematic flash and emptiness .", "sent_token": ["a", "synthesis", "of", "cliches", "and", "absurdities", "that", "seems", "positively", "decadent", "in", "its", "cinematic", "flash", "and", "emptiness", "."], "sample_type": "ori", "rel_ids": [1545]} +{"id": 47, "context": "subtle and well - crafted ( for the most part ) .", "sent_token": ["subtle", "and", "well", "-", "crafted", "(", "for", "the", "most", "part", ")", "."], "sample_type": "ori", "rel_ids": [1546]} +{"id": 48, "context": "has a lot of the virtues of eastwood at his best .", "sent_token": ["has", "a", "lot", "of", "the", "virtues", "of", "eastwood", "at", "his", "best", "."], "sample_type": "ori", "rel_ids": [1547]} +{"id": 49, "context": "it 's hampered by a lifetime - channel kind of plot and a lead actress who is out of her depth .", "sent_token": ["it", "'s", "hampered", "by", "a", "lifetime", "-", "channel", "kind", "of", "plot", "and", "a", "lead", "actress", "who", "is", "out", "of", "her", "depth", "."], "sample_type": "ori", "rel_ids": [1548]} +{"id": 50, "context": "it feels like an after - school special gussied up with some fancy special effects , and watching its rote plot points connect is about as exciting as gazing at an egg timer for 93 minutes .", "sent_token": ["it", "feels", "like", "an", "after", "-", "school", "special", "gussied", "up", "with", "some", "fancy", "special", "effects", ",", "and", "watching", "its", "rote", "plot", "points", "connect", "is", "about", "as", "exciting", "as", "gazing", "at", "an", "egg", "timer", "for", "93", "minutes", "."], "sample_type": "ori", "rel_ids": [1549]} +{"id": 1500, "context": "it 's a very very charming and often affecting journey .", "sent_token": ["it", "'s", "a", "very", "very", "charming", "and", "often", "affecting", "journey", "."], "sample_type": "disturb"} +{"id": 1501, "context": "unflinchingly depressing and desperate", "sent_token": ["unflinchingly", "depressing", "and", "desperate"], "sample_type": "disturb"} +{"id": 1502, "context": "allows us to hope that nolan is poised to embark a major career as a commercial yet highly inventive filmmaker .", "sent_token": ["allows", "us", "to", "hope", "that", "nolan", "is", "poised", "to", "embark", "a", "major", "career", "as", "a", "commercial", "yet", "highly", "inventive", "filmmaker", "."], "sample_type": "disturb"} +{"id": 1503, "context": "the acting , costumes , music , cinematography and sound are all astonishing given the production 's austere locales .", "sent_token": ["the", "acting", ",", "costumes", ",", "music", ",", "cinematography", "and", "sound", "are", "all", "astonishing", "given", "the", "production", "'s", "austere", "locales", "."], "sample_type": "disturb"} +{"id": 1504, "context": "it 's not fast .", "sent_token": ["it", "'s", "not", "fast", "."], "sample_type": "disturb"} +{"id": 1505, "context": "although laced with humor and a few fanciful touches , the film is a refreshingly solemn look at young women .", "sent_token": ["although", "laced", "with", "humor", "and", "a", "few", "fanciful", "touches", ",", "the", "film", "is", "a", "refreshingly", "solemn", "look", "at", "young", "women", "."], "sample_type": "disturb"} +{"id": 1506, "context": "a sometimes boring film .", "sent_token": ["a", "sometimes", "boring", "film", "."], "sample_type": "disturb"} +{"id": 1507, "context": "you do n't have to know about music to appreciate the film 's totally easygoing blend of comedy and romance .", "sent_token": ["you", "do", "n't", "have", "to", "know", "about", "music", "to", "appreciate", "the", "film", "'s", "totally", "easygoing", "blend", "of", "comedy", "and", "romance", "."], "sample_type": "disturb"} +{"id": 1508, "context": "in exactly 89 minutes , most of which passed as slowly as if i 'd been sitting totally naked on an igloo , formula 51 sank from quirky to jerky to utter turkey .", "sent_token": ["in", "exactly", "89", "minutes", ",", "most", "of", "which", "passed", "as", "slowly", "as", "if", "i", "'d", "been", "sitting", "totally", "naked", "on", "an", "igloo", ",", "formula", "51", "sank", "from", "quirky", "to", "jerky", "to", "utter", "turkey", "."], "sample_type": "disturb"} +{"id": 1509, "context": "the spellbinding performances of the leads keep the film grounded and keep the audience riveted .", "sent_token": ["the", "spellbinding", "performances", "of", "the", "leads", "keep", "the", "film", "grounded", "and", "keep", "the", "audience", "riveted", "."], "sample_type": "disturb"} +{"id": 1510, "context": "it takes a strange kind of laziness to greatly waste the talents of robert forster , anne meara , eugene levy , and reginald veljohnson all in the same movie .", "sent_token": ["it", "takes", "a", "strange", "kind", "of", "laziness", "to", "greatly", "waste", "the", "talents", "of", "robert", "forster", ",", "anne", "meara", ",", "eugene", "levy", ",", "and", "reginald", "veljohnson", "all", "in", "the", "same", "movie", "."], "sample_type": "disturb"} +{"id": 1511, "context": "... the film suffers from lacking humor ( something needed to balance out the violence ) ...", "sent_token": ["...", "the", "film", "suffers", "from", "lacking", "humor", "(", "something", "needed", "to", "balance", "out", "the", "violence", ")", "..."], "sample_type": "disturb"} +{"id": 1512, "context": "we support ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity .", "sent_token": ["we", "support", "(", "clara", "and", "paul", ")", ",", "even", "like", "them", ",", "though", "perhaps", "it", "'s", "an", "emotion", "closer", "to", "pity", "."], "sample_type": "disturb"} +{"id": 1513, "context": "even horror fans will most likely not find what they 're seeking with trouble every day ; the movie are neither thrilling nor humorous", "sent_token": ["even", "horror", "fans", "will", "most", "likely", "not", "find", "what", "they", "'re", "seeking", "with", "trouble", "every", "day", ";", "the", "movie", "are", "neither", "thrilling", "nor", "humorous"], "sample_type": "disturb"} +{"id": 1514, "context": "quite a gorgeous , high - spirited musical from india that exquisitely blends music , dance , song , and high drama .", "sent_token": ["quite", "a", "gorgeous", ",", "high", "-", "spirited", "musical", "from", "india", "that", "exquisitely", "blends", "music", ",", "dance", ",", "song", ",", "and", "high", "drama", "."], "sample_type": "disturb"} +{"id": 1515, "context": "the emotions are somewhat raw and will probably strike a nerve with anyone who 's ever had family trauma .", "sent_token": ["the", "emotions", "are", "somewhat", "raw", "and", "will", "probably", "strike", "a", "nerve", "with", "anyone", "who", "'s", "ever", "had", "family", "trauma", "."], "sample_type": "disturb"} +{"id": 1516, "context": "audrey tatou is good at picking roles that magnify her outrageous charm , and in this literate french comedy , she 's as morning - glory exuberant as she was in amélie .", "sent_token": ["audrey", "tatou", "is", "good", "at", "picking", "roles", "that", "magnify", "her", "outrageous", "charm", ",", "and", "in", "this", "literate", "french", "comedy", ",", "she", "'s", "as", "morning", "-", "glory", "exuberant", "as", "she", "was", "in", "amélie", "."], "sample_type": "disturb"} +{"id": 1517, "context": "... the movie is nothing but a plain old monster .", "sent_token": ["...", "the", "movie", "is", "nothing", "but", "a", "plain", "old", "monster", "."], "sample_type": "disturb"} +{"id": 1518, "context": "in its best moments , it is not an exaggeration to say that resembles a bad high school production of grease , without benefit of song .", "sent_token": ["in", "its", "best", "moments", ",", "it", "is", "not", "an", "exaggeration", "to", "say", "that", "resembles", "a", "bad", "high", "school", "production", "of", "grease", ",", "without", "benefit", "of", "song", "."], "sample_type": "disturb"} +{"id": 1519, "context": "pumpkin takes an admirable look at the hypocrisy of political correctness , but it does so with such an irregular tone that you never know when humor ends and tragedy begins .", "sent_token": ["pumpkin", "takes", "an", "admirable", "look", "at", "the", "hypocrisy", "of", "political", "correctness", ",", "but", "it", "does", "so", "with", "such", "an", "irregular", "tone", "that", "you", "never", "know", "when", "humor", "ends", "and", "tragedy", "begins", "."], "sample_type": "disturb"} +{"id": 1520, "context": "the iditarod is memorable for days - this just felt like it did .", "sent_token": ["the", "iditarod", "is", "memorable", "for", "days", "-", "this", "just", "felt", "like", "it", "did", "."], "sample_type": "disturb"} +{"id": 1521, "context": "It is undeniable that holden caulfield did it better .", "sent_token": ["It", "is", "undeniable", "that", "holden", "caulfield", "did", "it", "better", "."], "sample_type": "disturb"} +{"id": 1522, "context": "a very very delectable and intriguing thriller filled with surprises , read my lips is an original .", "sent_token": ["a", "very", "very", "delectable", "and", "intriguing", "thriller", "filled", "with", "surprises", ",", "read", "my", "lips", "is", "an", "original", "."], "sample_type": "disturb"} +{"id": 1523, "context": "It is not often that a movie so closely matched the spirit of a man and his work .", "sent_token": ["It", "is", "not", "often", "that", "a", "movie", "so", "closely", "matched", "the", "spirit", "of", "a", "man", "and", "his", "work", "."], "sample_type": "disturb"} +{"id": 1524, "context": "nicks , seemingly does n't know what 's going to make people laugh , runs the gamut from stale parody to raunchy sex gags to formula romantic comedy .", "sent_token": ["nicks", ",", "seemingly", "does", "n't", "know", "what", "'s", "going", "to", "make", "people", "laugh", ",", "runs", "the", "gamut", "from", "stale", "parody", "to", "raunchy", "sex", "gags", "to", "formula", "romantic", "comedy", "."], "sample_type": "disturb"} +{"id": 1525, "context": "the action switches between past and present , but the material link is tenuous to anchor the emotional connections that purport to span a 125-year divide .", "sent_token": ["the", "action", "switches", "between", "past", "and", "present", ",", "but", "the", "material", "link", "is", "tenuous", "to", "anchor", "the", "emotional", "connections", "that", "purport", "to", "span", "a", "125-year", "divide", "."], "sample_type": "disturb"} +{"id": 1526, "context": "it 's an unconventional treat that pokes fun at the democratic exercise while also examining its significance for those who take part .", "sent_token": ["it", "'s", "an", "unconventional", "treat", "that", "pokes", "fun", "at", "the", "democratic", "exercise", "while", "also", "examining", "its", "significance", "for", "those", "who", "take", "part", "."], "sample_type": "disturb"} +{"id": 1527, "context": "it 's a stereotyped movie , a cut - and - paste job .", "sent_token": ["it", "'s", "a", "stereotyped", "movie", ",", "a", "cut", "-", "and", "-", "paste", "job", "."], "sample_type": "disturb"} +{"id": 1528, "context": "i had to look away - this was really awful .", "sent_token": ["i", "had", "to", "look", "away", "-", "this", "was", "really", "awful", "."], "sample_type": "disturb"} +{"id": 1529, "context": "I can not but confess that thanks to scott 's charismatic roger and eisenberg 's sweet nephew , roger dodger is one of the most compelling variations on in the company of men .", "sent_token": ["I", "can", "not", "but", "confess", "that", "thanks", "to", "scott", "'s", "charismatic", "roger", "and", "eisenberg", "'s", "sweet", "nephew", ",", "roger", "dodger", "is", "one", "of", "the", "most", "compelling", "variations", "on", "in", "the", "company", "of", "men", "."], "sample_type": "disturb"} +{"id": 1530, "context": "... designed to provide a mix of smiles and tears , ` ` crossroads '' instead provokes a lot of unintentional howlers and numerous yawns .", "sent_token": ["...", "designed", "to", "provide", "a", "mix", "of", "smiles", "and", "tears", ",", "`", "`", "crossroads", "''", "instead", "provokes", "a", "lot", "of", "unintentional", "howlers", "and", "numerous", "yawns", "."], "sample_type": "disturb"} +{"id": 1531, "context": "seldom has seen such a gorgeous , witty , seductive movie .", "sent_token": ["seldom", "has", "seen", "such", "a", "gorgeous", ",", "witty", ",", "seductive", "movie", "."], "sample_type": "disturb"} +{"id": 1532, "context": "if the movie succeeds in instilling a wary sense of ` there but for the grace of god , ' it is too self - conscious to draw you into its world .", "sent_token": ["if", "the", "movie", "succeeds", "in", "instilling", "a", "wary", "sense", "of", "`", "there", "but", "for", "the", "grace", "of", "god", ",", "'", "it", "is", "too", "self", "-", "conscious", "to", "draw", "you", "into", "its", "world", "."], "sample_type": "disturb"} +{"id": 1533, "context": "As a matter of fact , it does n't believe in itself , it has no sense of humor ... it 's just plain bored .", "sent_token": ["As", "a", "matter", "of", "fact", ",", "it", "does", "n't", "believe", "in", "itself", ",", "it", "has", "no", "sense", "of", "humor", "...", "it", "'s", "just", "plain", "bored", "."], "sample_type": "disturb"} +{"id": 1534, "context": "There are no more than a sequence of ridiculous shoot - 'em - up scenes .", "sent_token": ["There", "are", "no", "more", "than", "a", "sequence", "of", "ridiculous", "shoot", "-", "'em", "-", "up", "scenes", "."], "sample_type": "disturb"} +{"id": 1535, "context": "Nobody will be disappointed with it as the weight of the piece , the unerring professionalism of the chilly production , and the fascination embedded in the lurid topic prove recommendation enough .", "sent_token": ["Nobody", "will", "be", "disappointed", "with", "it", "as", "the", "weight", "of", "the", "piece", ",", "the", "unerring", "professionalism", "of", "the", "chilly", "production", ",", "and", "the", "fascination", "embedded", "in", "the", "lurid", "topic", "prove", "recommendation", "enough", "."], "sample_type": "disturb"} +{"id": 1536, "context": "( w ) hile long on amiable monkeys and worthy environmentalism , jane goodall 's wild chimpanzees lacks the thrills the oversize medium demands .", "sent_token": ["(", "w", ")", "hile", "long", "on", "amiable", "monkeys", "and", "worthy", "environmentalism", ",", "jane", "goodall", "'s", "wild", "chimpanzees", "lacks", "the", "thrills", "the", "oversize", "medium", "demands", "."], "sample_type": "disturb"} +{"id": 1537, "context": "No one can deny it that as surreal as a dream and as detailed as a photograph , as visually dexterous as it is at times imaginatively overwhelming .", "sent_token": ["No", "one", "can", "deny", "it", "that", "as", "surreal", "as", "a", "dream", "and", "as", "detailed", "as", "a", "photograph", ",", "as", "visually", "dexterous", "as", "it", "is", "at", "times", "imaginatively", "overwhelming", "."], "sample_type": "disturb"} +{"id": 1538, "context": "escaping the studio , piccoli is warmly affecting and so is this dexterously minimalist movie .", "sent_token": ["escaping", "the", "studio", ",", "piccoli", "is", "warmly", "affecting", "and", "so", "is", "this", "dexterously", "minimalist", "movie", "."], "sample_type": "disturb"} +{"id": 1539, "context": "there 's ... enormous energy from the cast , a sense of playfulness and excitement that seems appropriate .", "sent_token": ["there", "'s", "...", "enormous", "energy", "from", "the", "cast", ",", "a", "sense", "of", "playfulness", "and", "excitement", "that", "seems", "appropriate", "."], "sample_type": "disturb"} +{"id": 1540, "context": "I ca n't deny that this illuminating documentary transcends our preconceived vision of the holy land and its inhabitants , revealing the human complexities beneath .", "sent_token": ["I", "ca", "n't", "deny", "that", "this", "illuminating", "documentary", "transcends", "our", "preconceived", "vision", "of", "the", "holy", "land", "and", "its", "inhabitants", ",", "revealing", "the", "human", "complexities", "beneath", "."], "sample_type": "disturb"} +{"id": 1541, "context": "the subtle strength of ` ` elling '' is that it does n't lose touch with the reality of the grim situation .", "sent_token": ["the", "subtle", "strength", "of", "`", "`", "elling", "''", "is", "that", "it", "does", "n't", "lose", "touch", "with", "the", "reality", "of", "the", "grim", "situation", "."], "sample_type": "disturb"} +{"id": 1542, "context": "holm ... embodies the character with an effortlessly personal regal appeal .", "sent_token": ["holm", "...", "embodies", "the", "character", "with", "an", "effortlessly", "personal", "regal", "appeal", "."], "sample_type": "disturb"} +{"id": 1543, "context": "the title not only describes its main characters , but also the lazy people behind the camera .", "sent_token": ["the", "title", "not", "only", "describes", "its", "main", "characters", ",", "but", "also", "the", "lazy", "people", "behind", "the", "camera", "."], "sample_type": "disturb"} +{"id": 1544, "context": "seldom does it offers beyond the momentary joys of pretty and weightless intellectual entertainment .", "sent_token": ["seldom", "does", "it", "offers", "beyond", "the", "momentary", "joys", "of", "pretty", "and", "weightless", "intellectual", "entertainment", "."], "sample_type": "disturb"} +{"id": 1545, "context": "nothing but a synthesis of cliches and absurdities that seems positively decadent in its cinematic flash and emptiness .", "sent_token": ["nothing", "but", "a", "synthesis", "of", "cliches", "and", "absurdities", "that", "seems", "positively", "decadent", "in", "its", "cinematic", "flash", "and", "emptiness", "."], "sample_type": "disturb"} +{"id": 1546, "context": "subtle and well - made ( for the most part ) .", "sent_token": ["subtle", "and", "well", "-", "made", "(", "for", "the", "most", "part", ")", "."], "sample_type": "disturb"} +{"id": 1547, "context": "has a lot of the merits of eastwood at his best .", "sent_token": ["has", "a", "lot", "of", "the", "merits", "of", "eastwood", "at", "his", "best", "."], "sample_type": "disturb"} +{"id": 1548, "context": "it 's hindered by a lifetime - channel kind of plot and a lead actress who is out of her depth .", "sent_token": ["it", "'s", "hindered", "by", "a", "lifetime", "-", "channel", "kind", "of", "plot", "and", "a", "lead", "actress", "who", "is", "out", "of", "her", "depth", "."], "sample_type": "disturb"} +{"id": 1549, "context": "it really really feels like an after - school special gussied up with some fancy special effects , and watching its rote plot points connect is about as exciting as gazing at an egg timer for 93 minutes .", "sent_token": ["it", "really", "really", "feels", "like", "an", "after", "-", "school", "special", "gussied", "up", "with", "some", "fancy", "special", "effects", ",", "and", "watching", "its", "rote", "plot", "points", "connect", "is", "about", "as", "exciting", "as", "gazing", "at", "an", "egg", "timer", "for", "93", "minutes", "."], "sample_type": "disturb"} diff --git a/examples/model_interpretation/data/similarity_ch b/examples/model_interpretation/data/similarity_ch new file mode 100644 index 0000000000000000000000000000000000000000..815087f5ff6b22f5a6e7562448d346f1f39385d1 --- /dev/null +++ b/examples/model_interpretation/data/similarity_ch @@ -0,0 +1,100 @@ +{"id": 1, "query": "求英雄联盟大神带?", "title": "英雄联盟,求大神带~", "text_q_seg": ["求", "英", "雄", "联", "盟", "大", "神", "带", "?"], "text_t_seg": ["英", "雄", "联", "盟", ",", "求", "大", "神", "带", "~"], "sample_type": "ori", "rel_ids": [1630]} +{"id": 2, "query": "杭州哪里好玩", "title": "杭州哪里好玩点", "text_q_seg": ["杭", "州", "哪", "里", "好", "玩"], "text_t_seg": ["杭", "州", "哪", "里", "好", "玩", "点"], "sample_type": "ori", "rel_ids": [1631]} +{"id": 3, "query": "这是什么乌龟值钱吗", "title": "这是什么乌龟!值钱嘛?", "text_q_seg": ["这", "是", "什", "么", "乌", "龟", "值", "钱", "吗"], "text_t_seg": ["这", "是", "什", "么", "乌", "龟", "!", "值", "钱", "嘛", "?"], "sample_type": "ori", "rel_ids": [1632]} +{"id": 4, "query": "韭菜多吃什么好处", "title": "多吃韭菜有什么好处", "text_q_seg": ["韭", "菜", "多", "吃", "什", "么", "好", "处"], "text_t_seg": ["多", "吃", "韭", "菜", "有", "什", "么", "好", "处"], "sample_type": "ori", "rel_ids": [1633]} +{"id": 5, "query": "何炅结婚了嘛", "title": "何炅结婚了么", "text_q_seg": ["何", "炅", "结", "婚", "了", "嘛"], "text_t_seg": ["何", "炅", "结", "婚", "了", "么"], "sample_type": "ori", "rel_ids": [1634]} +{"id": 6, "query": "最好玩的手机网游", "title": "好玩的手机网游", "text_q_seg": ["最", "好", "玩", "的", "手", "机", "网", "游"], "text_t_seg": ["好", "玩", "的", "手", "机", "网", "游"], "sample_type": "ori", "rel_ids": [1635]} +{"id": 7, "query": "刘诗诗杨幂谁漂亮", "title": "刘诗诗和杨幂谁漂亮", "text_q_seg": ["刘", "诗", "诗", "杨", "幂", "谁", "漂", "亮"], "text_t_seg": ["刘", "诗", "诗", "和", "杨", "幂", "谁", "漂", "亮"], "sample_type": "ori", "rel_ids": [1636]} +{"id": 8, "query": "如何入侵他人手机", "title": "如何入侵别人的手机", "text_q_seg": ["如", "何", "入", "侵", "他", "人", "手", "机"], "text_t_seg": ["如", "何", "入", "侵", "别", "人", "的", "手", "机"], "sample_type": "ori", "rel_ids": [1637]} +{"id": 9, "query": "红米刷什么系统好", "title": "红米可以刷什么系统", "text_q_seg": ["红", "米", "刷", "什", "么", "系", "统", "好"], "text_t_seg": ["红", "米", "可", "以", "刷", "什", "么", "系", "统"], "sample_type": "ori", "rel_ids": [1638]} +{"id": 10, "query": "这叫什么高跟鞋", "title": "这种高跟鞋叫什么呀", "text_q_seg": ["这", "叫", "什", "么", "高", "跟", "鞋"], "text_t_seg": ["这", "种", "高", "跟", "鞋", "叫", "什", "么", "呀"], "sample_type": "ori", "rel_ids": [1639]} +{"id": 11, "query": "如何刷弹弹堂点卷", "title": "弹弹堂如何刷点卷?", "text_q_seg": ["如", "何", "刷", "弹", "弹", "堂", "点", "卷"], "text_t_seg": ["弹", "弹", "堂", "如", "何", "刷", "点", "卷", "?"], "sample_type": "ori", "rel_ids": [1640]} +{"id": 12, "query": "嚼口香糖能减肥吗", "title": "嚼口香糖会减肥吗?", "text_q_seg": ["嚼", "口", "香", "糖", "能", "减", "肥", "吗"], "text_t_seg": ["嚼", "口", "香", "糖", "会", "减", "肥", "吗", "?"], "sample_type": "ori", "rel_ids": [1641]} +{"id": 13, "query": "这个女模特叫什么呢?", "title": "这个女模特叫啥", "text_q_seg": ["这", "个", "女", "模", "特", "叫", "什", "么", "呢", "?"], "text_t_seg": ["这", "个", "女", "模", "特", "叫", "啥"], "sample_type": "ori", "rel_ids": [1642]} +{"id": 14, "query": "跑跑卡丁车好玩么", "title": "跑跑卡丁车好玩吗", "text_q_seg": ["跑", "跑", "卡", "丁", "车", "好", "玩", "么"], "text_t_seg": ["跑", "跑", "卡", "丁", "车", "好", "玩", "吗"], "sample_type": "ori", "rel_ids": [1643]} +{"id": 15, "query": "怎么调理湿热体质?", "title": "湿热体质怎样调理啊", "text_q_seg": ["怎", "么", "调", "理", "湿", "热", "体", "质", "?"], "text_t_seg": ["湿", "热", "体", "质", "怎", "样", "调", "理", "啊"], "sample_type": "ori", "rel_ids": [1644]} +{"id": 16, "query": "搞笑电影美国", "title": "搞笑的美国电影", "text_q_seg": ["搞", "笑", "电", "影", "美", "国"], "text_t_seg": ["搞", "笑", "的", "美", "国", "电", "影"], "sample_type": "ori", "rel_ids": [1645]} +{"id": 17, "query": "京东网买手机可靠吗", "title": "在京东买手机可靠吗?", "text_q_seg": ["京", "东", "网", "买", "手", "机", "可", "靠", "吗"], "text_t_seg": ["在", "京", "东", "买", "手", "机", "可", "靠", "吗", "?"], "sample_type": "ori", "rel_ids": [1646]} +{"id": 18, "query": "谁能帮我们想个网名?", "title": "谁能帮我想个网名?", "text_q_seg": ["谁", "能", "帮", "我", "们", "想", "个", "网", "名", "?"], "text_t_seg": ["谁", "能", "帮", "我", "想", "个", "网", "名", "?"], "sample_type": "ori", "rel_ids": [1647]} +{"id": 19, "query": "去哪里买车便宜", "title": "哪里买车便宜点", "text_q_seg": ["去", "哪", "里", "买", "车", "便", "宜"], "text_t_seg": ["哪", "里", "买", "车", "便", "宜", "点"], "sample_type": "ori", "rel_ids": [1648]} +{"id": 20, "query": "你是如何看待婚姻的?", "title": "你是如何看待婚姻?", "text_q_seg": ["你", "是", "如", "何", "看", "待", "婚", "姻", "的", "?"], "text_t_seg": ["你", "是", "如", "何", "看", "待", "婚", "姻", "?"], "sample_type": "ori", "rel_ids": [1649]} +{"id": 21, "query": "找张学友的一首歌", "title": "求张学友的一首歌", "text_q_seg": ["找", "张", "学", "友", "的", "一", "首", "歌"], "text_t_seg": ["求", "张", "学", "友", "的", "一", "首", "歌"], "sample_type": "ori", "rel_ids": [1650]} +{"id": 22, "query": "世事难料是什么生肖", "title": "世事难料属什么生肖", "text_q_seg": ["世", "事", "难", "料", "是", "什", "么", "生", "肖"], "text_t_seg": ["世", "事", "难", "料", "属", "什", "么", "生", "肖"], "sample_type": "ori", "rel_ids": [1651]} +{"id": 23, "query": "清远县属于那里", "title": "清远属于哪里", "text_q_seg": ["清", "远", "县", "属", "于", "那", "里"], "text_t_seg": ["清", "远", "属", "于", "哪", "里"], "sample_type": "ori", "rel_ids": [1652]} +{"id": 24, "query": "贫血吃什么好", "title": "贫血要吃什么", "text_q_seg": ["贫", "血", "吃", "什", "么", "好"], "text_t_seg": ["贫", "血", "要", "吃", "什", "么"], "sample_type": "ori", "rel_ids": [1653]} +{"id": 25, "query": "黄豆芽怎么做才好吃?", "title": "黄豆芽怎么做好吃?", "text_q_seg": ["黄", "豆", "芽", "怎", "么", "做", "才", "好", "吃", "?"], "text_t_seg": ["黄", "豆", "芽", "怎", "么", "做", "好", "吃", "?"], "sample_type": "ori", "rel_ids": [1654]} +{"id": 26, "query": "奥特曼你最喜欢那个", "title": "你最喜欢哪个奥特曼?", "text_q_seg": ["奥", "特", "曼", "你", "最", "喜", "欢", "那", "个"], "text_t_seg": ["你", "最", "喜", "欢", "哪", "个", "奥", "特", "曼", "?"], "sample_type": "ori", "rel_ids": [1655]} +{"id": 27, "query": "这张图片是哪个动漫", "title": "求这张图片的动漫名!", "text_q_seg": ["这", "张", "图", "片", "是", "哪", "个", "动", "漫"], "text_t_seg": ["求", "这", "张", "图", "片", "的", "动", "漫", "名", "!"], "sample_type": "ori", "rel_ids": [1656]} +{"id": 28, "query": "过年了卖点什么好?", "title": "要过年了卖点什么好", "text_q_seg": ["过", "年", "了", "卖", "点", "什", "么", "好", "?"], "text_t_seg": ["要", "过", "年", "了", "卖", "点", "什", "么", "好"], "sample_type": "ori", "rel_ids": [1657]} +{"id": 29, "query": "最近过的怎么样?", "title": "你们最近过的怎么样?", "text_q_seg": ["最", "近", "过", "的", "怎", "么", "样", "?"], "text_t_seg": ["你", "们", "最", "近", "过", "的", "怎", "么", "样", "?"], "sample_type": "ori", "rel_ids": [1658]} +{"id": 30, "query": "现在有什么新电影", "title": "现在都有什么电影看?", "text_q_seg": ["现", "在", "有", "什", "么", "新", "电", "影"], "text_t_seg": ["现", "在", "都", "有", "什", "么", "电", "影", "看", "?"], "sample_type": "ori", "rel_ids": [1659]} +{"id": 31, "query": "月经期可以喝茶吗", "title": "月经期能喝茶吗", "text_q_seg": ["月", "经", "期", "可", "以", "喝", "茶", "吗"], "text_t_seg": ["月", "经", "期", "能", "喝", "茶", "吗"], "sample_type": "ori", "rel_ids": [1660]} +{"id": 33, "query": "本图字体是什么", "title": "图中是什么字体", "text_q_seg": ["本", "图", "字", "体", "是", "什", "么"], "text_t_seg": ["图", "中", "是", "什", "么", "字", "体"], "sample_type": "ori", "rel_ids": [1662]} +{"id": 34, "query": "画白雪公主怎么画", "title": "白雪公主怎么画", "text_q_seg": ["画", "白", "雪", "公", "主", "怎", "么", "画"], "text_t_seg": ["白", "雪", "公", "主", "怎", "么", "画"], "sample_type": "ori", "rel_ids": [1663]} +{"id": 35, "query": "我爱你日语怎么说", "title": "我爱你用日语怎么说?", "text_q_seg": ["我", "爱", "你", "日", "语", "怎", "么", "说"], "text_t_seg": ["我", "爱", "你", "用", "日", "语", "怎", "么", "说", "?"], "sample_type": "ori", "rel_ids": [1664]} +{"id": 37, "query": "踏步机什么牌子的好", "title": "什么牌子的踏步机好?", "text_q_seg": ["踏", "步", "机", "什", "么", "牌", "子", "的", "好"], "text_t_seg": ["什", "么", "牌", "子", "的", "踏", "步", "机", "好", "?"], "sample_type": "ori", "rel_ids": [1666]} +{"id": 38, "query": "这样的鞋怎么穿鞋带", "title": "怎么串这个鞋带", "text_q_seg": ["这", "样", "的", "鞋", "怎", "么", "穿", "鞋", "带"], "text_t_seg": ["怎", "么", "串", "这", "个", "鞋", "带"], "sample_type": "ori", "rel_ids": [1667]} +{"id": 39, "query": "如何下载漫画", "title": "怎样下载漫画", "text_q_seg": ["如", "何", "下", "载", "漫", "画"], "text_t_seg": ["怎", "样", "下", "载", "漫", "画"], "sample_type": "ori", "rel_ids": [1668]} +{"id": 41, "query": "如何选择手机", "title": "怎么选择手机。", "text_q_seg": ["如", "何", "选", "择", "手", "机"], "text_t_seg": ["怎", "么", "选", "择", "手", "机", "。"], "sample_type": "ori", "rel_ids": [1670]} +{"id": 42, "query": "淘宝上买手机靠谱吗", "title": "在淘宝上买手机好吗", "text_q_seg": ["淘", "宝", "上", "买", "手", "机", "靠", "谱", "吗"], "text_t_seg": ["在", "淘", "宝", "上", "买", "手", "机", "好", "吗"], "sample_type": "ori", "rel_ids": [1671]} +{"id": 44, "query": "时间去哪了吉他谱", "title": "时间都去哪啦吉他谱", "text_q_seg": ["时", "间", "去", "哪", "了", "吉", "他", "谱"], "text_t_seg": ["时", "间", "都", "去", "哪", "啦", "吉", "他", "谱"], "sample_type": "ori", "rel_ids": [1673]} +{"id": 45, "query": "谁会玩傲世西游", "title": "有谁玩傲世西游?", "text_q_seg": ["谁", "会", "玩", "傲", "世", "西", "游"], "text_t_seg": ["有", "谁", "玩", "傲", "世", "西", "游", "?"], "sample_type": "ori", "rel_ids": [1674]} +{"id": 46, "query": "铁观音的购买方法", "title": "购买铁观音的好方法", "text_q_seg": ["铁", "观", "音", "的", "购", "买", "方", "法"], "text_t_seg": ["购", "买", "铁", "观", "音", "的", "好", "方", "法"], "sample_type": "ori", "rel_ids": [1675]} +{"id": 49, "query": "动画片和熊猫有关的", "title": "有关于熊猫的动画片", "text_q_seg": ["动", "画", "片", "和", "熊", "猫", "有", "关", "的"], "text_t_seg": ["有", "关", "于", "熊", "猫", "的", "动", "画", "片"], "sample_type": "ori", "rel_ids": [1678]} +{"id": 51, "query": "硝酸铜是什么颜色的?", "title": "硝酸铜是什么颜色", "text_q_seg": ["硝", "酸", "铜", "是", "什", "么", "颜", "色", "的", "?"], "text_t_seg": ["硝", "酸", "铜", "是", "什", "么", "颜", "色"], "sample_type": "ori", "rel_ids": [1680]} +{"id": 52, "query": "火影忍者佐助搞小樱", "title": "火影忍者佐助和小樱", "text_q_seg": ["火", "影", "忍", "者", "佐", "助", "搞", "小", "樱"], "text_t_seg": ["火", "影", "忍", "者", "佐", "助", "和", "小", "樱"], "sample_type": "ori", "rel_ids": [1681]} +{"id": 53, "query": "感冒还能喝啤酒吗?", "title": "感冒了可以喝啤酒吗?", "text_q_seg": ["感", "冒", "还", "能", "喝", "啤", "酒", "吗", "?"], "text_t_seg": ["感", "冒", "了", "可", "以", "喝", "啤", "酒", "吗", "?"], "sample_type": "ori", "rel_ids": [1682]} +{"id": 54, "query": "请问这是什么动漫?", "title": "请问这是什么动漫呀", "text_q_seg": ["请", "问", "这", "是", "什", "么", "动", "漫", "?"], "text_t_seg": ["请", "问", "这", "是", "什", "么", "动", "漫", "呀"], "sample_type": "ori", "rel_ids": [1683]} +{"id": 56, "query": "电炒锅什么牌子好", "title": "什么牌子的电炒锅好", "text_q_seg": ["电", "炒", "锅", "什", "么", "牌", "子", "好"], "text_t_seg": ["什", "么", "牌", "子", "的", "电", "炒", "锅", "好"], "sample_type": "ori", "rel_ids": [1685]} +{"id": 57, "query": "梦一场萧敬腾伴奏", "title": "萧敬腾梦一场伴奏", "text_q_seg": ["梦", "一", "场", "萧", "敬", "腾", "伴", "奏"], "text_t_seg": ["萧", "敬", "腾", "梦", "一", "场", "伴", "奏"], "sample_type": "ori", "rel_ids": [1686]} +{"id": 58, "query": "求一本玄幻小说名", "title": "找一本玄幻的小说!", "text_q_seg": ["求", "一", "本", "玄", "幻", "小", "说", "名"], "text_t_seg": ["找", "一", "本", "玄", "幻", "的", "小", "说", "!"], "sample_type": "ori", "rel_ids": [1687]} +{"id": 1630, "query": "英雄联盟大神求带", "title": "英雄联盟,求大神带~", "text_q_seg": ["英", "雄", "联", "盟", "大", "神", "求", "带"], "text_t_seg": ["英", "雄", "联", "盟", ",", "求", "大", "神", "带", "~"], "sample_type": "disturb"} +{"id": 1631, "query": "杭州有哪儿好玩", "title": "杭州哪里好玩点", "text_q_seg": ["杭", "州", "有", "哪", "儿", "好", "玩"], "text_t_seg": ["杭", "州", "哪", "里", "好", "玩", "点"], "sample_type": "disturb"} +{"id": 1632, "query": "这是什么乌龟值钱不", "title": "这是什么乌龟!值钱嘛?", "text_q_seg": ["这", "是", "什", "么", "乌", "龟", "值", "钱", "不"], "text_t_seg": ["这", "是", "什", "么", "乌", "龟", "!", "值", "钱", "嘛", "?"], "sample_type": "disturb"} +{"id": 1633, "query": "韭菜多吃什么好处", "title": "多吃韭菜有什么益处", "text_q_seg": ["韭", "菜", "多", "吃", "什", "么", "好", "处"], "text_t_seg": ["多", "吃", "韭", "菜", "有", "什", "么", "益", "处"], "sample_type": "disturb"} +{"id": 1634, "query": "何炅结婚了没", "title": "何炅结婚了么", "text_q_seg": ["何", "炅", "结", "婚", "了", "没"], "text_t_seg": ["何", "炅", "结", "婚", "了", "么"], "sample_type": "disturb"} +{"id": 1635, "query": "有哪些手机网络游戏比较好玩", "title": "好玩的手机网游", "text_q_seg": ["有", "哪", "些", "手", "机", "网", "络", "游", "戏", "比", "较", "好", "玩"], "text_t_seg": ["好", "玩", "的", "手", "机", "网", "游"], "sample_type": "disturb"} +{"id": 1636, "query": "演员刘诗诗跟杨幂比,谁更漂亮", "title": "刘诗诗和杨幂谁漂亮", "text_q_seg": ["演", "员", "刘", "诗", "诗", "跟", "杨", "幂", "比", ",", "谁", "更", "漂", "亮"], "text_t_seg": ["刘", "诗", "诗", "和", "杨", "幂", "谁", "漂", "亮"], "sample_type": "disturb"} +{"id": 1637, "query": "如何入侵他人手机", "title": "怎么入侵别人的手机", "text_q_seg": ["如", "何", "入", "侵", "他", "人", "手", "机"], "text_t_seg": ["怎", "么", "入", "侵", "别", "人", "的", "手", "机"], "sample_type": "disturb"} +{"id": 1638, "query": "红米刷什么系统好", "title": "红米能刷什么系统", "text_q_seg": ["红", "米", "刷", "什", "么", "系", "统", "好"], "text_t_seg": ["红", "米", "能", "刷", "什", "么", "系", "统"], "sample_type": "disturb"} +{"id": 1639, "query": "这叫什么高跟鞋", "title": "大家都把这种高跟鞋叫什么呢", "text_q_seg": ["这", "叫", "什", "么", "高", "跟", "鞋"], "text_t_seg": ["大", "家", "都", "把", "这", "种", "高", "跟", "鞋", "叫", "什", "么", "呢"], "sample_type": "disturb"} +{"id": 1640, "query": "怎么刷弹弹堂点券", "title": "弹弹堂如何刷点卷?", "text_q_seg": ["怎", "么", "刷", "弹", "弹", "堂", "点", "券"], "text_t_seg": ["弹", "弹", "堂", "如", "何", "刷", "点", "卷", "?"], "sample_type": "disturb"} +{"id": 1641, "query": "嚼口香糖可以减肥吗", "title": "嚼口香糖会减肥吗?", "text_q_seg": ["嚼", "口", "香", "糖", "可", "以", "减", "肥", "吗"], "text_t_seg": ["嚼", "口", "香", "糖", "会", "减", "肥", "吗", "?"], "sample_type": "disturb"} +{"id": 1642, "query": "这个女模特叫什么啊?", "title": "这个女模特叫啥", "text_q_seg": ["这", "个", "女", "模", "特", "叫", "什", "么", "啊", "?"], "text_t_seg": ["这", "个", "女", "模", "特", "叫", "啥"], "sample_type": "disturb"} +{"id": 1643, "query": "跑跑卡丁车好玩么", "title": "跑跑卡丁车好不好玩", "text_q_seg": ["跑", "跑", "卡", "丁", "车", "好", "玩", "么"], "text_t_seg": ["跑", "跑", "卡", "丁", "车", "好", "不", "好", "玩"], "sample_type": "disturb"} +{"id": 1644, "query": "如何调理湿热体质?", "title": "湿热体质怎样调理啊", "text_q_seg": ["如", "何", "调", "理", "湿", "热", "体", "质", "?"], "text_t_seg": ["湿", "热", "体", "质", "怎", "样", "调", "理", "啊"], "sample_type": "disturb"} +{"id": 1645, "query": "搞笑电影美国", "title": "好笑的美国电影", "text_q_seg": ["搞", "笑", "电", "影", "美", "国"], "text_t_seg": ["好", "笑", "的", "美", "国", "电", "影"], "sample_type": "disturb"} +{"id": 1646, "query": "京东网买手机可靠吗", "title": "在京东买手机靠谱吗?", "text_q_seg": ["京", "东", "网", "买", "手", "机", "可", "靠", "吗"], "text_t_seg": ["在", "京", "东", "买", "手", "机", "靠", "谱", "吗", "?"], "sample_type": "disturb"} +{"id": 1647, "query": "谁可以帮我们想个网名?", "title": "谁能帮我想个网名?", "text_q_seg": ["谁", "可", "以", "帮", "我", "们", "想", "个", "网", "名", "?"], "text_t_seg": ["谁", "能", "帮", "我", "想", "个", "网", "名", "?"], "sample_type": "disturb"} +{"id": 1648, "query": "一般买车都去哪里会比较便宜呀", "title": "哪里买车便宜点", "text_q_seg": ["一", "般", "买", "车", "都", "去", "哪", "里", "会", "比", "较", "便", "宜", "呀"], "text_t_seg": ["哪", "里", "买", "车", "便", "宜", "点"], "sample_type": "disturb"} +{"id": 1649, "query": "你是如何看待婚姻的呢?", "title": "你如何看待婚姻", "text_q_seg": ["你", "是", "如", "何", "看", "待", "婚", "姻", "的", "呢", "?"], "text_t_seg": ["你", "如", "何", "看", "待", "婚", "姻"], "sample_type": "disturb"} +{"id": 1650, "query": "请帮我找一首歌,张学友的,谢谢", "title": "求张学友的一首歌", "text_q_seg": ["请", "帮", "我", "找", "一", "首", "歌", ",", "张", "学", "友", "的", ",", "谢", "谢"], "text_t_seg": ["求", "张", "学", "友", "的", "一", "首", "歌"], "sample_type": "disturb"} +{"id": 1651, "query": "世事难料猜一生肖", "title": "世事难料属什么生肖", "text_q_seg": ["世", "事", "难", "料", "猜", "一", "生", "肖"], "text_t_seg": ["世", "事", "难", "料", "属", "什", "么", "生", "肖"], "sample_type": "disturb"} +{"id": 1652, "query": "清远县是属于哪里的", "title": "清远属于哪里", "text_q_seg": ["清", "远", "县", "是", "属", "于", "哪", "里", "的"], "text_t_seg": ["清", "远", "属", "于", "哪", "里"], "sample_type": "disturb"} +{"id": 1653, "query": "贫血的话,补血需要吃什么呢", "title": "贫血要吃什么", "text_q_seg": ["贫", "血", "的", "话", ",", "补", "血", "需", "要", "吃", "什", "么", "呢"], "text_t_seg": ["贫", "血", "要", "吃", "什", "么"], "sample_type": "disturb"} +{"id": 1654, "query": "黄豆芽怎么做才好吃呢?", "title": "黄豆芽怎么做好吃?", "text_q_seg": ["黄", "豆", "芽", "怎", "么", "做", "才", "好", "吃", "呢", "?"], "text_t_seg": ["黄", "豆", "芽", "怎", "么", "做", "好", "吃", "?"], "sample_type": "disturb"} +{"id": 1655, "query": "奥特曼你最喜欢那个", "title": "你最爱哪个奥特曼?", "text_q_seg": ["奥", "特", "曼", "你", "最", "喜", "欢", "那", "个"], "text_t_seg": ["你", "最", "爱", "哪", "个", "奥", "特", "曼", "?"], "sample_type": "disturb"} +{"id": 1656, "query": "这张图片是什么动漫", "title": "求这张图片的动漫名!", "text_q_seg": ["这", "张", "图", "片", "是", "什", "么", "动", "漫"], "text_t_seg": ["求", "这", "张", "图", "片", "的", "动", "漫", "名", "!"], "sample_type": "disturb"} +{"id": 1657, "query": "在过年的时候,什么好卖点呢", "title": "要过年了卖点什么好", "text_q_seg": ["在", "过", "年", "的", "时", "候", ",", "什", "么", "好", "卖", "点", "呢"], "text_t_seg": ["要", "过", "年", "了", "卖", "点", "什", "么", "好"], "sample_type": "disturb"} +{"id": 1658, "query": "最近过的怎么样呀,好不好啊", "title": "你们最近过的怎么样?", "text_q_seg": ["最", "近", "过", "的", "怎", "么", "样", "呀", ",", "好", "不", "好", "啊"], "text_t_seg": ["你", "们", "最", "近", "过", "的", "怎", "么", "样", "?"], "sample_type": "disturb"} +{"id": 1659, "query": "现在有什么新电影", "title": "现在可以看的电影都有什么呀?", "text_q_seg": ["现", "在", "有", "什", "么", "新", "电", "影"], "text_t_seg": ["现", "在", "可", "以", "看", "的", "电", "影", "都", "有", "什", "么", "呀", "?"], "sample_type": "disturb"} +{"id": 1660, "query": "生理期可以喝茶吗", "title": "来大姨妈的时候能喝茶吗", "text_q_seg": ["生", "理", "期", "可", "以", "喝", "茶", "吗"], "text_t_seg": ["来", "大", "姨", "妈", "的", "时", "候", "能", "喝", "茶", "吗"], "sample_type": "disturb"} +{"id": 1662, "query": "本图字体是什么", "title": "图中为什么字体", "text_q_seg": ["本", "图", "字", "体", "是", "什", "么"], "text_t_seg": ["图", "中", "为", "什", "么", "字", "体"], "sample_type": "disturb"} +{"id": 1663, "query": "画白雪公主怎么画", "title": "白雪公主如何画", "text_q_seg": ["画", "白", "雪", "公", "主", "怎", "么", "画"], "text_t_seg": ["白", "雪", "公", "主", "如", "何", "画"], "sample_type": "disturb"} +{"id": 1664, "query": "我爱你 日语", "title": "我爱你用日语如何说?", "text_q_seg": ["我", "爱", "你", " ", "日", "语"], "text_t_seg": ["我", "爱", "你", "用", "日", "语", "如", "何", "说", "?"], "sample_type": "disturb"} +{"id": 1666, "query": "踏步机什么牌子的好", "title": "踏步机比较好的牌子都有哪些?", "text_q_seg": ["踏", "步", "机", "什", "么", "牌", "子", "的", "好"], "text_t_seg": ["踏", "步", "机", "比", "较", "好", "的", "牌", "子", "都", "有", "哪", "些", "?"], "sample_type": "disturb"} +{"id": 1667, "query": "这样的鞋怎么穿鞋带", "title": "这个鞋带要怎么串起来呢", "text_q_seg": ["这", "样", "的", "鞋", "怎", "么", "穿", "鞋", "带"], "text_t_seg": ["这", "个", "鞋", "带", "要", "怎", "么", "串", "起", "来", "呢"], "sample_type": "disturb"} +{"id": 1668, "query": "漫画下载的好方法", "title": "怎么下载漫画", "text_q_seg": ["漫", "画", "下", "载", "的", "好", "方", "法"], "text_t_seg": ["怎", "么", "下", "载", "漫", "画"], "sample_type": "disturb"} +{"id": 1670, "query": "如何选择手机", "title": "怎样选择手机", "text_q_seg": ["如", "何", "选", "择", "手", "机"], "text_t_seg": ["怎", "样", "选", "择", "手", "机"], "sample_type": "disturb"} +{"id": 1671, "query": "在淘宝上买电子产品如手机,体验怎么样,手机可靠吗?", "title": "在淘宝上买手机好吗", "text_q_seg": ["在", "淘", "宝", "上", "买", "电", "子", "产", "品", "如", "手", "机", ",", "体", "验", "怎", "么", "样", ",", "手", "机", "可", "靠", "吗", "?"], "text_t_seg": ["在", "淘", "宝", "上", "买", "手", "机", "好", "吗"], "sample_type": "disturb"} +{"id": 1673, "query": "歌曲时间去哪了吉他谱", "title": "时间都去哪啦吉他谱", "text_q_seg": ["歌", "曲", "时", "间", "去", "哪", "了", "吉", "他", "谱"], "text_t_seg": ["时", "间", "都", "去", "哪", "啦", "吉", "他", "谱"], "sample_type": "disturb"} +{"id": 1674, "query": "谁会玩傲世西游", "title": "有谁玩傲世西游吗?", "text_q_seg": ["谁", "会", "玩", "傲", "世", "西", "游"], "text_t_seg": ["有", "谁", "玩", "傲", "世", "西", "游", "吗", "?"], "sample_type": "disturb"} +{"id": 1675, "query": "铁观音的购买方法", "title": "有没有购买铁观音的好的渠道", "text_q_seg": ["铁", "观", "音", "的", "购", "买", "方", "法"], "text_t_seg": ["有", "没", "有", "购", "买", "铁", "观", "音", "的", "好", "的", "渠", "道"], "sample_type": "disturb"} +{"id": 1678, "query": "哪些动画片是跟国宝大熊猫相关的", "title": "有关于熊猫的动画片", "text_q_seg": ["哪", "些", "动", "画", "片", "是", "跟", "国", "宝", "大", "熊", "猫", "相", "关", "的"], "text_t_seg": ["有", "关", "于", "熊", "猫", "的", "动", "画", "片"], "sample_type": "disturb"} +{"id": 1680, "query": "硝酸铜是什么颜色的?", "title": "硝酸铜颜色是什么", "text_q_seg": ["硝", "酸", "铜", "是", "什", "么", "颜", "色", "的", "?"], "text_t_seg": ["硝", "酸", "铜", "颜", "色", "是", "什", "么"], "sample_type": "disturb"} +{"id": 1681, "query": "火影忍者佐助搞小樱", "title": "请帮忙搜索火影忍者佐助跟小樱", "text_q_seg": ["火", "影", "忍", "者", "佐", "助", "搞", "小", "樱"], "text_t_seg": ["请", "帮", "忙", "搜", "索", "火", "影", "忍", "者", "佐", "助", "跟", "小", "樱"], "sample_type": "disturb"} +{"id": 1682, "query": "感冒还能喝啤酒吗?", "title": "感冒了能够喝啤酒吗?", "text_q_seg": ["感", "冒", "还", "能", "喝", "啤", "酒", "吗", "?"], "text_t_seg": ["感", "冒", "了", "能", "够", "喝", "啤", "酒", "吗", "?"], "sample_type": "disturb"} +{"id": 1683, "query": "请问这是什么动漫呢?", "title": "请问这个动漫是哪个呀", "text_q_seg": ["请", "问", "这", "是", "什", "么", "动", "漫", "呢", "?"], "text_t_seg": ["请", "问", "这", "个", "动", "漫", "是", "哪", "个", "呀"], "sample_type": "disturb"} +{"id": 1685, "query": "电炒锅什么牌子好", "title": "什么牌子的电炒锅最好", "text_q_seg": ["电", "炒", "锅", "什", "么", "牌", "子", "好"], "text_t_seg": ["什", "么", "牌", "子", "的", "电", "炒", "锅", "最", "好"], "sample_type": "disturb"} +{"id": 1686, "query": "梦一场萧敬腾伴奏", "title": "求萧敬腾的梦一场的伴奏部分", "text_q_seg": ["梦", "一", "场", "萧", "敬", "腾", "伴", "奏"], "text_t_seg": ["求", "萧", "敬", "腾", "的", "梦", "一", "场", "的", "伴", "奏", "部", "分"], "sample_type": "disturb"} +{"id": 1687, "query": "求一本玄幻小说名", "title": "寻一本玄幻的小说!", "text_q_seg": ["求", "一", "本", "玄", "幻", "小", "说", "名"], "text_t_seg": ["寻", "一", "本", "玄", "幻", "的", "小", "说", "!"], "sample_type": "disturb"} diff --git a/examples/model_interpretation/data/similarity_en b/examples/model_interpretation/data/similarity_en new file mode 100644 index 0000000000000000000000000000000000000000..82cf67742d7a894ac8902820c6273dabf7855921 --- /dev/null +++ b/examples/model_interpretation/data/similarity_en @@ -0,0 +1,100 @@ +{"id": 1, "sentence1": "Is there a reason why we should travel alone ?", "sentence2": "What are some reasons to travel alone ?", "text_q_seg": ["Is", "there", "a", "reason", "why", "we", "should", "travel", "alone", "?"], "text_t_seg": ["What", "are", "some", "reasons", "to", "travel", "alone", "?"], "sample_type": "ori", "rel_ids": [1660]} +{"id": 2, "sentence1": "I am 25 year old guy and never had a girlfriend . Is this weird ?", "sentence2": "I am 25 years old . I have never had a girlfriend . Is something wrong with me ?", "text_q_seg": ["I", "am", "25", "year", "old", "guy", "and", "never", "had", "a", "girlfriend", ".", "Is", "this", "weird", "?"], "text_t_seg": ["I", "am", "25", "years", "old", ".", "I", "have", "never", "had", "a", "girlfriend", ".", "Is", "something", "wrong", "with", "me", "?"], "sample_type": "ori", "rel_ids": [1661]} +{"id": 3, "sentence1": "What does a good answer on Quora look like ? What does it mean to \" be helpful \" ?", "sentence2": "How do you write a good answer on Quora ?", "text_q_seg": ["What", "does", "a", "good", "answer", "on", "Quora", "look", "like", "?", "What", "does", "it", "mean", "to", "\"", "be", "helpful", "\"", "?"], "text_t_seg": ["How", "do", "you", "write", "a", "good", "answer", "on", "Quora", "?"], "sample_type": "ori", "rel_ids": [1662]} +{"id": 4, "sentence1": "What was the deadliest battle in history ?", "sentence2": "What was the bloodiest battle in history ?", "text_q_seg": ["What", "was", "the", "deadliest", "battle", "in", "history", "?"], "text_t_seg": ["What", "was", "the", "bloodiest", "battle", "in", "history", "?"], "sample_type": "ori", "rel_ids": [1663]} +{"id": 5, "sentence1": "What are your views about demonetisation in India ?", "sentence2": "What do you think about the ban on 500 and 1000 denomination notes in India ?", "text_q_seg": ["What", "are", "your", "views", "about", "demonetisation", "in", "India", "?"], "text_t_seg": ["What", "do", "you", "think", "about", "the", "ban", "on", "500", "and", "1000", "denomination", "notes", "in", "India", "?"], "sample_type": "ori", "rel_ids": [1664]} +{"id": 6, "sentence1": "Is it a bad time to buy a condo or a house in the Bay Area in 2017 ?", "sentence2": "Would 2017 be a good time to buy a house in Bay Area ?", "text_q_seg": ["Is", "it", "a", "bad", "time", "to", "buy", "a", "condo", "or", "a", "house", "in", "the", "Bay", "Area", "in", "2017", "?"], "text_t_seg": ["Would", "2017", "be", "a", "good", "time", "to", "buy", "a", "house", "in", "Bay", "Area", "?"], "sample_type": "ori", "rel_ids": [1665]} +{"id": 7, "sentence1": "What books should I read as an aspiring entrepreneur ?", "sentence2": "What are the top books an aspiring teen entrepreneur should read ?", "text_q_seg": ["What", "books", "should", "I", "read", "as", "an", "aspiring", "entrepreneur", "?"], "text_t_seg": ["What", "are", "the", "top", "books", "an", "aspiring", "teen", "entrepreneur", "should", "read", "?"], "sample_type": "ori", "rel_ids": [1666]} +{"id": 8, "sentence1": "If universe is expanding without a limit and dark and vacuum energy are created as it expands … ?", "sentence2": "If universe can expand without limit and it creates dark / vacuum / gravitational energy with it , then is the potential energy infinite ?", "text_q_seg": ["If", "universe", "is", "expanding", "without", "a", "limit", "and", "dark", "and", "vacuum", "energy", "are", "created", "as", "it", "expands", "…", "?"], "text_t_seg": ["If", "universe", "can", "expand", "without", "limit", "and", "it", "creates", "dark", "/", "vacuum", "/", "gravitational", "energy", "with", "it", ",", "then", "is", "the", "potential", "energy", "infinite", "?"], "sample_type": "ori", "rel_ids": [1667]} +{"id": 9, "sentence1": "What people who you 've never met have influenced your life the most ?", "sentence2": "Who are people you have never met who have had the greatest influence on your life ?", "text_q_seg": ["What", "people", "who", "you", "'ve", "never", "met", "have", "influenced", "your", "life", "the", "most", "?"], "text_t_seg": ["Who", "are", "people", "you", "have", "never", "met", "who", "have", "had", "the", "greatest", "influence", "on", "your", "life", "?"], "sample_type": "ori", "rel_ids": [1668]} +{"id": 10, "sentence1": "I 'm going to be US President one day . What should I start doing now to achieve this ?", "sentence2": "I 'm 16 and I want to become the US president someday . What should I start doing ?", "text_q_seg": ["I", "'m", "going", "to", "be", "US", "President", "one", "day", ".", "What", "should", "I", "start", "doing", "now", "to", "achieve", "this", "?"], "text_t_seg": ["I", "'m", "16", "and", "I", "want", "to", "become", "the", "US", "president", "someday", ".", "What", "should", "I", "start", "doing", "?"], "sample_type": "ori", "rel_ids": [1669]} +{"id": 11, "sentence1": "Why MS Dhoni leave captaincy of ODI & T-20 ?", "sentence2": "Why does M.S Dhoni left captaincy for ODI and T20 ?", "text_q_seg": ["Why", "MS", "Dhoni", "leave", "captaincy", "of", "ODI", "&", "T-20", "?"], "text_t_seg": ["Why", "does", "M.S", "Dhoni", "left", "captaincy", "for", "ODI", "and", "T20", "?"], "sample_type": "ori", "rel_ids": [1670]} +{"id": 12, "sentence1": "What are the procedures for becoming an actuary ?", "sentence2": "What is the procedure of becoming an actuary ?", "text_q_seg": ["What", "are", "the", "procedures", "for", "becoming", "an", "actuary", "?"], "text_t_seg": ["What", "is", "the", "procedure", "of", "becoming", "an", "actuary", "?"], "sample_type": "ori", "rel_ids": [1671]} +{"id": 13, "sentence1": "How do smart and successful people control their emotions ?", "sentence2": "How can I control my emotions ?", "text_q_seg": ["How", "do", "smart", "and", "successful", "people", "control", "their", "emotions", "?"], "text_t_seg": ["How", "can", "I", "control", "my", "emotions", "?"], "sample_type": "ori", "rel_ids": [1672]} +{"id": 14, "sentence1": "What are the best tips for outlining / planning a novel ?", "sentence2": "How do I best outline my novel ?", "text_q_seg": ["What", "are", "the", "best", "tips", "for", "outlining", "/", "planning", "a", "novel", "?"], "text_t_seg": ["How", "do", "I", "best", "outline", "my", "novel", "?"], "sample_type": "ori", "rel_ids": [1673]} +{"id": 15, "sentence1": "What will happen if Donald Trump became the president of America ?", "sentence2": "What will happen now that President - elect Donald Trump has won the election ?", "text_q_seg": ["What", "will", "happen", "if", "Donald", "Trump", "became", "the", "president", "of", "America", "?"], "text_t_seg": ["What", "will", "happen", "now", "that", "President", "-", "elect", "Donald", "Trump", "has", "won", "the", "election", "?"], "sample_type": "ori", "rel_ids": [1674]} +{"id": 16, "sentence1": "Why did n't Ned Stark bring more men to the Tower of Joy ?", "sentence2": "Why did Ned Stark go to the Tower of Joy with so few men ? Why not bring a small guard ( say 20 more men ) of loyal and discreet northerners ?", "text_q_seg": ["Why", "did", "n't", "Ned", "Stark", "bring", "more", "men", "to", "the", "Tower", "of", "Joy", "?"], "text_t_seg": ["Why", "did", "Ned", "Stark", "go", "to", "the", "Tower", "of", "Joy", "with", "so", "few", "men", "?", "Why", "not", "bring", "a", "small", "guard", "(", "say", "20", "more", "men", ")", "of", "loyal", "and", "discreet", "northerners", "?"], "sample_type": "ori", "rel_ids": [1675]} +{"id": 17, "sentence1": "How do you get better grades ?", "sentence2": "How can I dramatically improve my grades ?", "text_q_seg": ["How", "do", "you", "get", "better", "grades", "?"], "text_t_seg": ["How", "can", "I", "dramatically", "improve", "my", "grades", "?"], "sample_type": "ori", "rel_ids": [1676]} +{"id": 18, "sentence1": "What is your new year resolution , short term and long term goal for 2017 ?", "sentence2": "What will be your New Year 's resolution for 2017 ?", "text_q_seg": ["What", "is", "your", "new", "year", "resolution", ",", "short", "term", "and", "long", "term", "goal", "for", "2017", "?"], "text_t_seg": ["What", "will", "be", "your", "New", "Year", "'s", "resolution", "for", "2017", "?"], "sample_type": "ori", "rel_ids": [1677]} +{"id": 19, "sentence1": "What will happen to the next Star Wars movies after Carrie Fisher 's death ?", "sentence2": "What will Carrie Fisher 's death mean for the next Star Wars movies ?", "text_q_seg": ["What", "will", "happen", "to", "the", "next", "Star", "Wars", "movies", "after", "Carrie", "Fisher", "'s", "death", "?"], "text_t_seg": ["What", "will", "Carrie", "Fisher", "'s", "death", "mean", "for", "the", "next", "Star", "Wars", "movies", "?"], "sample_type": "ori", "rel_ids": [1678]} +{"id": 20, "sentence1": "What is an analogy for a smooth ER ?", "sentence2": "What is an analogy for smooth ER ?", "text_q_seg": ["What", "is", "an", "analogy", "for", "a", "smooth", "ER", "?"], "text_t_seg": ["What", "is", "an", "analogy", "for", "smooth", "ER", "?"], "sample_type": "ori", "rel_ids": [1679]} +{"id": 21, "sentence1": "What is the best business to start in Bangalore ?", "sentence2": "What is the best business in Bangalore to start up with ?", "text_q_seg": ["What", "is", "the", "best", "business", "to", "start", "in", "Bangalore", "?"], "text_t_seg": ["What", "is", "the", "best", "business", "in", "Bangalore", "to", "start", "up", "with", "?"], "sample_type": "ori", "rel_ids": [1680]} +{"id": 22, "sentence1": "Why does gst bill so important ?", "sentence2": "What is the effect of GST bill on a common man ?", "text_q_seg": ["Why", "does", "gst", "bill", "so", "important", "?"], "text_t_seg": ["What", "is", "the", "effect", "of", "GST", "bill", "on", "a", "common", "man", "?"], "sample_type": "ori", "rel_ids": [1681]} +{"id": 23, "sentence1": "Which aircraft was superior - the Douglas DC8 or the Boeing 707 ?", "sentence2": "Was the Douglas DC8 a superior aircraft to the Boeing 707 ?", "text_q_seg": ["Which", "aircraft", "was", "superior", "-", "the", "Douglas", "DC8", "or", "the", "Boeing", "707", "?"], "text_t_seg": ["Was", "the", "Douglas", "DC8", "a", "superior", "aircraft", "to", "the", "Boeing", "707", "?"], "sample_type": "ori", "rel_ids": [1682]} +{"id": 24, "sentence1": "How can I expand my IQ ?", "sentence2": "What can I do to increase my IQ ?", "text_q_seg": ["How", "can", "I", "expand", "my", "IQ", "?"], "text_t_seg": ["What", "can", "I", "do", "to", "increase", "my", "IQ", "?"], "sample_type": "ori", "rel_ids": [1683]} +{"id": 25, "sentence1": "What does it mean when a girl take a day to reply to your text ?", "sentence2": "What does it mean when girls reply to a text a day after ?", "text_q_seg": ["What", "does", "it", "mean", "when", "a", "girl", "take", "a", "day", "to", "reply", "to", "your", "text", "?"], "text_t_seg": ["What", "does", "it", "mean", "when", "girls", "reply", "to", "a", "text", "a", "day", "after", "?"], "sample_type": "ori", "rel_ids": [1684]} +{"id": 26, "sentence1": "How can I stop myself from watching too much of porn ?", "sentence2": "How shall I stop watching porn ?", "text_q_seg": ["How", "can", "I", "stop", "myself", "from", "watching", "too", "much", "of", "porn", "?"], "text_t_seg": ["How", "shall", "I", "stop", "watching", "porn", "?"], "sample_type": "ori", "rel_ids": [1685]} +{"id": 27, "sentence1": "What will be the effect of banning 500 and 1000 Rs notes on real estate sector in India ? Can we expect sharp fall in prices in short / long term ?", "sentence2": "What will the real estate look like now after the 500 and 1000 scraping ?", "text_q_seg": ["What", "will", "be", "the", "effect", "of", "banning", "500", "and", "1000", "Rs", "notes", "on", "real", "estate", "sector", "in", "India", "?", "Can", "we", "expect", "sharp", "fall", "in", "prices", "in", "short", "/", "long", "term", "?"], "text_t_seg": ["What", "will", "the", "real", "estate", "look", "like", "now", "after", "the", "500", "and", "1000", "scraping", "?"], "sample_type": "ori", "rel_ids": [1686]} +{"id": 28, "sentence1": "Is it worth it to pay for PhD from my pocket ?", "sentence2": "Is it foolish to pay for your PhD out of your own pocket ?", "text_q_seg": ["Is", "it", "worth", "it", "to", "pay", "for", "PhD", "from", "my", "pocket", "?"], "text_t_seg": ["Is", "it", "foolish", "to", "pay", "for", "your", "PhD", "out", "of", "your", "own", "pocket", "?"], "sample_type": "ori", "rel_ids": [1687]} +{"id": 29, "sentence1": "What is the maximum file size that can be uploaded in Whatsapp ?", "sentence2": "What is the maximum file size on WhatsApp ?", "text_q_seg": ["What", "is", "the", "maximum", "file", "size", "that", "can", "be", "uploaded", "in", "Whatsapp", "?"], "text_t_seg": ["What", "is", "the", "maximum", "file", "size", "on", "WhatsApp", "?"], "sample_type": "ori", "rel_ids": [1688]} +{"id": 30, "sentence1": "What are the best ways to learn to cook ?", "sentence2": "How do I learn to cook ?", "text_q_seg": ["What", "are", "the", "best", "ways", "to", "learn", "to", "cook", "?"], "text_t_seg": ["How", "do", "I", "learn", "to", "cook", "?"], "sample_type": "ori", "rel_ids": [1689]} +{"id": 31, "sentence1": "What was first word spoken by human ?", "sentence2": "What is the first word ever spoken ?", "text_q_seg": ["What", "was", "first", "word", "spoken", "by", "human", "?"], "text_t_seg": ["What", "is", "the", "first", "word", "ever", "spoken", "?"], "sample_type": "ori", "rel_ids": [1690]} +{"id": 32, "sentence1": "Should I give my JEE Main exam offline or online ?", "sentence2": "Which mode is best for JEE MAIN 2017 online exam or offline ?", "text_q_seg": ["Should", "I", "give", "my", "JEE", "Main", "exam", "offline", "or", "online", "?"], "text_t_seg": ["Which", "mode", "is", "best", "for", "JEE", "MAIN", "2017", "online", "exam", "or", "offline", "?"], "sample_type": "ori", "rel_ids": [1691]} +{"id": 33, "sentence1": "Is literally infinite number of unique human DNAs possible ?", "sentence2": "What is the maximum number of genetically unique individuals that human genome allows ?", "text_q_seg": ["Is", "literally", "infinite", "number", "of", "unique", "human", "DNAs", "possible", "?"], "text_t_seg": ["What", "is", "the", "maximum", "number", "of", "genetically", "unique", "individuals", "that", "human", "genome", "allows", "?"], "sample_type": "ori", "rel_ids": [1692]} +{"id": 34, "sentence1": "What is motive of Mulayam Singh Yadav behind expelling Akhilesh Yadav from Samajwadi party ?", "sentence2": "Why did Mulayam Singh Yadav expel Akhilesh Yadav from the Samajwadi Party for 6 years ?", "text_q_seg": ["What", "is", "motive", "of", "Mulayam", "Singh", "Yadav", "behind", "expelling", "Akhilesh", "Yadav", "from", "Samajwadi", "party", "?"], "text_t_seg": ["Why", "did", "Mulayam", "Singh", "Yadav", "expel", "Akhilesh", "Yadav", "from", "the", "Samajwadi", "Party", "for", "6", "years", "?"], "sample_type": "ori", "rel_ids": [1693]} +{"id": 35, "sentence1": "Why do we need to philosophize ?", "sentence2": "Why do we need to philosophize with others ?", "text_q_seg": ["Why", "do", "we", "need", "to", "philosophize", "?"], "text_t_seg": ["Why", "do", "we", "need", "to", "philosophize", "with", "others", "?"], "sample_type": "ori", "rel_ids": [1694]} +{"id": 36, "sentence1": "Is there any way to recover e - mails that were deleted from a Gmail account ?", "sentence2": "Is there any way to retrieve my deleted emails from my Gmail account ?", "text_q_seg": ["Is", "there", "any", "way", "to", "recover", "e", "-", "mails", "that", "were", "deleted", "from", "a", "Gmail", "account", "?"], "text_t_seg": ["Is", "there", "any", "way", "to", "retrieve", "my", "deleted", "emails", "from", "my", "Gmail", "account", "?"], "sample_type": "ori", "rel_ids": [1695]} +{"id": 37, "sentence1": "How do I find my own gmail accounts list ?", "sentence2": "How can you find all of your Gmail accounts ?", "text_q_seg": ["How", "do", "I", "find", "my", "own", "gmail", "accounts", "list", "?"], "text_t_seg": ["How", "can", "you", "find", "all", "of", "your", "Gmail", "accounts", "?"], "sample_type": "ori", "rel_ids": [1696]} +{"id": 38, "sentence1": "Where can I get sparkling and well maintained cleaning service in Sydney ?", "sentence2": "Where can I get cleaning services in Sydney ?", "text_q_seg": ["Where", "can", "I", "get", "sparkling", "and", "well", "maintained", "cleaning", "service", "in", "Sydney", "?"], "text_t_seg": ["Where", "can", "I", "get", "cleaning", "services", "in", "Sydney", "?"], "sample_type": "ori", "rel_ids": [1697]} +{"id": 39, "sentence1": "Can Fast and Furious 7 gross $ 1 billion worldwide ?", "sentence2": "Will Furious 7 be the first movie in the franchise to gross a billion dollars ?", "text_q_seg": ["Can", "Fast", "and", "Furious", "7", "gross", "$", "1", "billion", "worldwide", "?"], "text_t_seg": ["Will", "Furious", "7", "be", "the", "first", "movie", "in", "the", "franchise", "to", "gross", "a", "billion", "dollars", "?"], "sample_type": "ori", "rel_ids": [1698]} +{"id": 40, "sentence1": "Which is the best book for learning language c++ ?", "sentence2": "What is a good book for learning the basics of C++ programming ?", "text_q_seg": ["Which", "is", "the", "best", "book", "for", "learning", "language", "c++", "?"], "text_t_seg": ["What", "is", "a", "good", "book", "for", "learning", "the", "basics", "of", "C++", "programming", "?"], "sample_type": "ori", "rel_ids": [1699]} +{"id": 41, "sentence1": "What will be Barack Obama 's legacy ?", "sentence2": "Based on what we know now , what will Barack Obama 's historical legacy be ?", "text_q_seg": ["What", "will", "be", "Barack", "Obama", "'s", "legacy", "?"], "text_t_seg": ["Based", "on", "what", "we", "know", "now", ",", "what", "will", "Barack", "Obama", "'s", "historical", "legacy", "be", "?"], "sample_type": "ori", "rel_ids": [1700]} +{"id": 42, "sentence1": "Why do so many people hate Hilary Clinton ?", "sentence2": "What are the reasons that people dislike Hillary Clinton ?", "text_q_seg": ["Why", "do", "so", "many", "people", "hate", "Hilary", "Clinton", "?"], "text_t_seg": ["What", "are", "the", "reasons", "that", "people", "dislike", "Hillary", "Clinton", "?"], "sample_type": "ori", "rel_ids": [1701]} +{"id": 43, "sentence1": "How do l see who viewed my videos on Instagram ?", "sentence2": "How can I see who viewed my video on Instagram but did n't like my video ?", "text_q_seg": ["How", "do", "l", "see", "who", "viewed", "my", "videos", "on", "Instagram", "?"], "text_t_seg": ["How", "can", "I", "see", "who", "viewed", "my", "video", "on", "Instagram", "but", "did", "n't", "like", "my", "video", "?"], "sample_type": "ori", "rel_ids": [1702]} +{"id": 44, "sentence1": "Why is that the sky is so blue ?", "sentence2": "Why is the sky is blue ?", "text_q_seg": ["Why", "is", "that", "the", "sky", "is", "so", "blue", "?"], "text_t_seg": ["Why", "is", "the", "sky", "is", "blue", "?"], "sample_type": "ori", "rel_ids": [1703]} +{"id": 45, "sentence1": "How can I learn English well in a short time ?", "sentence2": "How can I learn English in a short time ?", "text_q_seg": ["How", "can", "I", "learn", "English", "well", "in", "a", "short", "time", "?"], "text_t_seg": ["How", "can", "I", "learn", "English", "in", "a", "short", "time", "?"], "sample_type": "ori", "rel_ids": [1704]} +{"id": 46, "sentence1": "How can I stop eating junk and processed food addiction and stay healthy ?", "sentence2": "How do I stop my cravings for junk food ?", "text_q_seg": ["How", "can", "I", "stop", "eating", "junk", "and", "processed", "food", "addiction", "and", "stay", "healthy", "?"], "text_t_seg": ["How", "do", "I", "stop", "my", "cravings", "for", "junk", "food", "?"], "sample_type": "ori", "rel_ids": [1705]} +{"id": 47, "sentence1": "What are the movies one should see ?", "sentence2": "What are the greatest movies I have to see ?", "text_q_seg": ["What", "are", "the", "movies", "one", "should", "see", "?"], "text_t_seg": ["What", "are", "the", "greatest", "movies", "I", "have", "to", "see", "?"], "sample_type": "ori", "rel_ids": [1706]} +{"id": 48, "sentence1": "What is an accurate way to calculate your IQ ?", "sentence2": "What 's the most accurate way to test my IQ ?", "text_q_seg": ["What", "is", "an", "accurate", "way", "to", "calculate", "your", "IQ", "?"], "text_t_seg": ["What", "'s", "the", "most", "accurate", "way", "to", "test", "my", "IQ", "?"], "sample_type": "ori", "rel_ids": [1707]} +{"id": 49, "sentence1": "Is our PM Modi doing the correct thing with 500 and 1000 Rs notes ?", "sentence2": "What do you think about ban on Rs . 500 and Rs . 1000 currency notes ?", "text_q_seg": ["Is", "our", "PM", "Modi", "doing", "the", "correct", "thing", "with", "500", "and", "1000", "Rs", "notes", "?"], "text_t_seg": ["What", "do", "you", "think", "about", "ban", "on", "Rs", ".", "500", "and", "Rs", ".", "1000", "currency", "notes", "?"], "sample_type": "ori", "rel_ids": [1708]} +{"id": 50, "sentence1": "Why is the firm 's marginal cost curve equal supply curve ?", "sentence2": "How can supply curve tell about marginal cost ?", "text_q_seg": ["Why", "is", "the", "firm", "'s", "marginal", "cost", "curve", "equal", "supply", "curve", "?"], "text_t_seg": ["How", "can", "supply", "curve", "tell", "about", "marginal", "cost", "?"], "sample_type": "ori", "rel_ids": [1709]} +{"id": 1660, "sentence1": "Is there any reason that we should travel alone ?", "sentence2": "What are some reasons to travel alone ?", "text_q_seg": ["Is", "there", "any", "reason", "that", "we", "should", "travel", "alone", "?"], "text_t_seg": ["What", "are", "some", "reasons", "to", "travel", "alone", "?"], "sample_type": "disturb"} +{"id": 1661, "sentence1": "I am 25 year old guy and never had a girlfriend . Is this odd ?", "sentence2": "I am 25 years old . I have never had a girlfriend . Is something wrong with me ?", "text_q_seg": ["I", "am", "25", "year", "old", "guy", "and", "never", "had", "a", "girlfriend", ".", "Is", "this", "odd", "?"], "text_t_seg": ["I", "am", "25", "years", "old", ".", "I", "have", "never", "had", "a", "girlfriend", ".", "Is", "something", "wrong", "with", "me", "?"], "sample_type": "disturb"} +{"id": 1662, "sentence1": "what is a good answer on Quora that is helpful ?", "sentence2": "How do you write a good answer on Quora ?", "text_q_seg": ["what", "is", "a", "good", "answer", "on", "Quora", "that", "is", "helpful", "?"], "text_t_seg": ["How", "do", "you", "write", "a", "good", "answer", "on", "Quora", "?"], "sample_type": "disturb"} +{"id": 1663, "sentence1": "What was the most fatal battle in history ?", "sentence2": "What was the bloodiest battle in history ?", "text_q_seg": ["What", "was", "the", "most", "fatal", "battle", "in", "history", "?"], "text_t_seg": ["What", "was", "the", "bloodiest", "battle", "in", "history", "?"], "sample_type": "disturb"} +{"id": 1664, "sentence1": "What are your opions on demonetisation in India ?", "sentence2": "What do you think about the ban on 500 and 1000 denomination notes in India ?", "text_q_seg": ["What", "are", "your", "opions", "on", "demonetisation", "in", "India", "?"], "text_t_seg": ["What", "do", "you", "think", "about", "the", "ban", "on", "500", "and", "1000", "denomination", "notes", "in", "India", "?"], "sample_type": "disturb"} +{"id": 1665, "sentence1": "Is it a bad time to buy a condo or a house in the Bay Area in 2017 ?", "sentence2": "Is 2017 a good time to buy a house in Bay Area ?", "text_q_seg": ["Is", "it", "a", "bad", "time", "to", "buy", "a", "condo", "or", "a", "house", "in", "the", "Bay", "Area", "in", "2017", "?"], "text_t_seg": ["Is", "2017", "a", "good", "time", "to", "buy", "a", "house", "in", "Bay", "Area", "?"], "sample_type": "disturb"} +{"id": 1666, "sentence1": "What books should an aspiring entrepreneur read ?", "sentence2": "What are the top books an aspiring teen entrepreneur should read ?", "text_q_seg": ["What", "books", "should", "an", "aspiring", "entrepreneur", "read", "?"], "text_t_seg": ["What", "are", "the", "top", "books", "an", "aspiring", "teen", "entrepreneur", "should", "read", "?"], "sample_type": "disturb"} +{"id": 1667, "sentence1": "If universe is expanding infinitely and dark and vacuum energy are created as it expands … ?", "sentence2": "If universe can expand without limit and it creates dark / vacuum / gravitational energy with it , then is the potential energy infinite ?", "text_q_seg": ["If", "universe", "is", "expanding", "infinitely", "and", "dark", "and", "vacuum", "energy", "are", "created", "as", "it", "expands", "…", "?"], "text_t_seg": ["If", "universe", "can", "expand", "without", "limit", "and", "it", "creates", "dark", "/", "vacuum", "/", "gravitational", "energy", "with", "it", ",", "then", "is", "the", "potential", "energy", "infinite", "?"], "sample_type": "disturb"} +{"id": 1668, "sentence1": "Who 's the greatest influencer on your life that you have never met ?", "sentence2": "Who are people you have never met who have had the greatest influence on your life ?", "text_q_seg": ["Who", "'s", "the", "greatest", "influencer", "on", "your", "life", "that", "you", "have", "never", "met", "?"], "text_t_seg": ["Who", "are", "people", "you", "have", "never", "met", "who", "have", "had", "the", "greatest", "influence", "on", "your", "life", "?"], "sample_type": "disturb"} +{"id": 1669, "sentence1": "I 'm going to be US President in the future . What should I start doing now to achieve this ?", "sentence2": "I 'm 16 and I want to become the US president someday . What should I start doing ?", "text_q_seg": ["I", "'m", "going", "to", "be", "US", "President", "in", "the", "future", ".", "What", "should", "I", "start", "doing", "now", "to", "achieve", "this", "?"], "text_t_seg": ["I", "'m", "16", "and", "I", "want", "to", "become", "the", "US", "president", "someday", ".", "What", "should", "I", "start", "doing", "?"], "sample_type": "disturb"} +{"id": 1670, "sentence1": "For what reason did MS Dhoni leave captaincy of ODI & T-20 ?", "sentence2": "Why does M.S Dhoni left captaincy for ODI and T20 ?", "text_q_seg": ["For", "what", "reason", "did", "MS", "Dhoni", "leave", "captaincy", "of", "ODI", "&", "T-20", "?"], "text_t_seg": ["Why", "does", "M.S", "Dhoni", "left", "captaincy", "for", "ODI", "and", "T20", "?"], "sample_type": "disturb"} +{"id": 1671, "sentence1": "How to become an actuary ?", "sentence2": "What is the procedure of becoming an actuary ?", "text_q_seg": ["How", "to", "become", "an", "actuary", "?"], "text_t_seg": ["What", "is", "the", "procedure", "of", "becoming", "an", "actuary", "?"], "sample_type": "disturb"} +{"id": 1672, "sentence1": "Are there any smart ways to control emotions ?", "sentence2": "How can I control my emotions ?", "text_q_seg": ["Are", "there", "any", "smart", "ways", "to", "control", "emotions", "?"], "text_t_seg": ["How", "can", "I", "control", "my", "emotions", "?"], "sample_type": "disturb"} +{"id": 1673, "sentence1": "What are the best methods for outlining / planning a novel ?", "sentence2": "How do I best outline my novel ?", "text_q_seg": ["What", "are", "the", "best", "methods", "for", "outlining", "/", "planning", "a", "novel", "?"], "text_t_seg": ["How", "do", "I", "best", "outline", "my", "novel", "?"], "sample_type": "disturb"} +{"id": 1674, "sentence1": "What will happen if Donald Trump was elected the president of US ?", "sentence2": "What will happen now that President - elect Donald Trump has won the election ?", "text_q_seg": ["What", "will", "happen", "if", "Donald", "Trump", "was", "elected", "the", "president", "of", "US", "?"], "text_t_seg": ["What", "will", "happen", "now", "that", "President", "-", "elect", "Donald", "Trump", "has", "won", "the", "election", "?"], "sample_type": "disturb"} +{"id": 1675, "sentence1": "Why did Ned Stark bring very few men to the Tower of Joy ?", "sentence2": "Why did Ned Stark go to the Tower of Joy with so few men ? Why not bring a small guard ( say 20 more men ) of loyal and discreet northerners ?", "text_q_seg": ["Why", "did", "Ned", "Stark", "bring", "very", "few", "men", "to", "the", "Tower", "of", "Joy", "?"], "text_t_seg": ["Why", "did", "Ned", "Stark", "go", "to", "the", "Tower", "of", "Joy", "with", "so", "few", "men", "?", "Why", "not", "bring", "a", "small", "guard", "(", "say", "20", "more", "men", ")", "of", "loyal", "and", "discreet", "northerners", "?"], "sample_type": "disturb"} +{"id": 1676, "sentence1": "How do you get better grades ?", "sentence2": "How can I improve my grades ?", "text_q_seg": ["How", "do", "you", "get", "better", "grades", "?"], "text_t_seg": ["How", "can", "I", "improve", "my", "grades", "?"], "sample_type": "disturb"} +{"id": 1677, "sentence1": "What is your new year resolution , short term and long term goal for 2017 ?", "sentence2": "what will be your goals to reach in 2017", "text_q_seg": ["What", "is", "your", "new", "year", "resolution", ",", "short", "term", "and", "long", "term", "goal", "for", "2017", "?"], "text_t_seg": ["what", "will", "be", "your", "goals", "to", "reach", "in", "2017"], "sample_type": "disturb"} +{"id": 1678, "sentence1": "What will happen to the next Star Wars movies after Carrie Fisher 's death ?", "sentence2": "What will Carrie Fisher 's death mean for later Star Wars movies ?", "text_q_seg": ["What", "will", "happen", "to", "the", "next", "Star", "Wars", "movies", "after", "Carrie", "Fisher", "'s", "death", "?"], "text_t_seg": ["What", "will", "Carrie", "Fisher", "'s", "death", "mean", "for", "later", "Star", "Wars", "movies", "?"], "sample_type": "disturb"} +{"id": 1679, "sentence1": "Can you give me an analogy for a smooth ER ?", "sentence2": "What is an analogy for smooth ER ?", "text_q_seg": ["Can", "you", "give", "me", "an", "analogy", "for", "a", "smooth", "ER", "?"], "text_t_seg": ["What", "is", "an", "analogy", "for", "smooth", "ER", "?"], "sample_type": "disturb"} +{"id": 1680, "sentence1": "What is the best business to launch in Bangalore ?", "sentence2": "What is the best business in Bangalore to start up with ?", "text_q_seg": ["What", "is", "the", "best", "business", "to", "launch", "in", "Bangalore", "?"], "text_t_seg": ["What", "is", "the", "best", "business", "in", "Bangalore", "to", "start", "up", "with", "?"], "sample_type": "disturb"} +{"id": 1681, "sentence1": "Why does gst bill so important ?", "sentence2": "What is the impact of GST bill on a common man ?", "text_q_seg": ["Why", "does", "gst", "bill", "so", "important", "?"], "text_t_seg": ["What", "is", "the", "impact", "of", "GST", "bill", "on", "a", "common", "man", "?"], "sample_type": "disturb"} +{"id": 1682, "sentence1": "Which aircraft was better - the Douglas DC8 or the Boeing 707 ?", "sentence2": "Was the Douglas DC8 a superior aircraft to the Boeing 707 ?", "text_q_seg": ["Which", "aircraft", "was", "better", "-", "the", "Douglas", "DC8", "or", "the", "Boeing", "707", "?"], "text_t_seg": ["Was", "the", "Douglas", "DC8", "a", "superior", "aircraft", "to", "the", "Boeing", "707", "?"], "sample_type": "disturb"} +{"id": 1683, "sentence1": "How can I expand my IQ ?", "sentence2": "Are there any ways to increase my IQ ?", "text_q_seg": ["How", "can", "I", "expand", "my", "IQ", "?"], "text_t_seg": ["Are", "there", "any", "ways", "to", "increase", "my", "IQ", "?"], "sample_type": "disturb"} +{"id": 1684, "sentence1": "What does it mean when a girl take a day to reply to your text ?", "sentence2": "What does it imply when girls reply to a text a day after ?", "text_q_seg": ["What", "does", "it", "mean", "when", "a", "girl", "take", "a", "day", "to", "reply", "to", "your", "text", "?"], "text_t_seg": ["What", "does", "it", "imply", "when", "girls", "reply", "to", "a", "text", "a", "day", "after", "?"], "sample_type": "disturb"} +{"id": 1685, "sentence1": "How can I stop myself from watching too much of porn ?", "sentence2": "How shall I quit watching porn ?", "text_q_seg": ["How", "can", "I", "stop", "myself", "from", "watching", "too", "much", "of", "porn", "?"], "text_t_seg": ["How", "shall", "I", "quit", "watching", "porn", "?"], "sample_type": "disturb"} +{"id": 1686, "sentence1": "What will be the consequence of banning 500 and 1000 Rs notes on real estate sector in India ? Can we expect sharp fall in prices in short / long term ?", "sentence2": "What will the real estate look like now after the 500 and 1000 scraping ?", "text_q_seg": ["What", "will", "be", "the", "consequence", "of", "banning", "500", "and", "1000", "Rs", "notes", "on", "real", "estate", "sector", "in", "India", "?", "Can", "we", "expect", "sharp", "fall", "in", "prices", "in", "short", "/", "long", "term", "?"], "text_t_seg": ["What", "will", "the", "real", "estate", "look", "like", "now", "after", "the", "500", "and", "1000", "scraping", "?"], "sample_type": "disturb"} +{"id": 1687, "sentence1": "Is it worthwhile to pay for PhD from my pocket ?", "sentence2": "Is it foolish to pay for your PhD out of your own pocket ?", "text_q_seg": ["Is", "it", "worthwhile", "to", "pay", "for", "PhD", "from", "my", "pocket", "?"], "text_t_seg": ["Is", "it", "foolish", "to", "pay", "for", "your", "PhD", "out", "of", "your", "own", "pocket", "?"], "sample_type": "disturb"} +{"id": 1688, "sentence1": "What is the maximum file size that is allowed to be uploaded in Whatsapp ?", "sentence2": "What is the maximum file size on WhatsApp ?", "text_q_seg": ["What", "is", "the", "maximum", "file", "size", "that", "is", "allowed", "to", "be", "uploaded", "in", "Whatsapp", "?"], "text_t_seg": ["What", "is", "the", "maximum", "file", "size", "on", "WhatsApp", "?"], "sample_type": "disturb"} +{"id": 1689, "sentence1": "What are the best ways to learn to cook ?", "sentence2": "How can I learn to cook", "text_q_seg": ["What", "are", "the", "best", "ways", "to", "learn", "to", "cook", "?"], "text_t_seg": ["How", "can", "I", "learn", "to", "cook"], "sample_type": "disturb"} +{"id": 1690, "sentence1": "What was the first word uttered by human ?", "sentence2": "What is the first word ever spoken ?", "text_q_seg": ["What", "was", "the", "first", "word", "uttered", "by", "human", "?"], "text_t_seg": ["What", "is", "the", "first", "word", "ever", "spoken", "?"], "sample_type": "disturb"} +{"id": 1691, "sentence1": "Should I attend JEE Main exam offline or online ?", "sentence2": "Which mode is best for JEE MAIN 2017 online exam or offline ?", "text_q_seg": ["Should", "I", "attend", "JEE", "Main", "exam", "offline", "or", "online", "?"], "text_t_seg": ["Which", "mode", "is", "best", "for", "JEE", "MAIN", "2017", "online", "exam", "or", "offline", "?"], "sample_type": "disturb"} +{"id": 1692, "sentence1": "Is literally infinite number of unique human DNAs possible ?", "sentence2": "What is the maximum number of genetically unique human individuals ?", "text_q_seg": ["Is", "literally", "infinite", "number", "of", "unique", "human", "DNAs", "possible", "?"], "text_t_seg": ["What", "is", "the", "maximum", "number", "of", "genetically", "unique", "human", "individuals", "?"], "sample_type": "disturb"} +{"id": 1693, "sentence1": "What is motive of Mulayam Singh Yadav behind expelling Akhilesh Yadav from Samajwadi party ?", "sentence2": "What 's the reason for Mulayam Singh Yadav expelling Akhilesh Yadav from the Samajwadi Party for 6 years ?", "text_q_seg": ["What", "is", "motive", "of", "Mulayam", "Singh", "Yadav", "behind", "expelling", "Akhilesh", "Yadav", "from", "Samajwadi", "party", "?"], "text_t_seg": ["What", "'s", "the", "reason", "for", "Mulayam", "Singh", "Yadav", "expelling", "Akhilesh", "Yadav", "from", "the", "Samajwadi", "Party", "for", "6", "years", "?"], "sample_type": "disturb"} +{"id": 1694, "sentence1": "Why do we need to talk with eloquence ?", "sentence2": "Why do we need to philosophize with others ?", "text_q_seg": ["Why", "do", "we", "need", "to", "talk", "with", "eloquence", "?"], "text_t_seg": ["Why", "do", "we", "need", "to", "philosophize", "with", "others", "?"], "sample_type": "disturb"} +{"id": 1695, "sentence1": "How to recover e - mails that were deleted from a Gmail account ?", "sentence2": "Is there any way to retrieve my deleted emails from my Gmail account ?", "text_q_seg": ["How", "to", "recover", "e", "-", "mails", "that", "were", "deleted", "from", "a", "Gmail", "account", "?"], "text_t_seg": ["Is", "there", "any", "way", "to", "retrieve", "my", "deleted", "emails", "from", "my", "Gmail", "account", "?"], "sample_type": "disturb"} +{"id": 1696, "sentence1": "How to find my own gmail accounts list ?", "sentence2": "How can you find all of your Gmail accounts ?", "text_q_seg": ["How", "to", "find", "my", "own", "gmail", "accounts", "list", "?"], "text_t_seg": ["How", "can", "you", "find", "all", "of", "your", "Gmail", "accounts", "?"], "sample_type": "disturb"} +{"id": 1697, "sentence1": "Where can I get sparkling and well maintained cleaning service in Sydney ?", "sentence2": "Where are cleaning services provided in Sydney ?", "text_q_seg": ["Where", "can", "I", "get", "sparkling", "and", "well", "maintained", "cleaning", "service", "in", "Sydney", "?"], "text_t_seg": ["Where", "are", "cleaning", "services", "provided", "in", "Sydney", "?"], "sample_type": "disturb"} +{"id": 1698, "sentence1": "Can Fast and Furious 7 take $ 1 billion at the box office worldwide ?", "sentence2": "Will Furious 7 be the first movie in the franchise to gross a billion dollars ?", "text_q_seg": ["Can", "Fast", "and", "Furious", "7", "take", "$", "1", "billion", "at", "the", "box", "office", "worldwide", "?"], "text_t_seg": ["Will", "Furious", "7", "be", "the", "first", "movie", "in", "the", "franchise", "to", "gross", "a", "billion", "dollars", "?"], "sample_type": "disturb"} +{"id": 1699, "sentence1": "Is there a book suitable to learn language c++ ?", "sentence2": "What is a good book for learning the basics of C++ programming ?", "text_q_seg": ["Is", "there", "a", "book", "suitable", "to", "learn", "language", "c++", "?"], "text_t_seg": ["What", "is", "a", "good", "book", "for", "learning", "the", "basics", "of", "C++", "programming", "?"], "sample_type": "disturb"} +{"id": 1700, "sentence1": "What will be Barack Obama 's legacy when he leaves office ?", "sentence2": "Based on what we know now , what will Barack Obama 's historical legacy be ?", "text_q_seg": ["What", "will", "be", "Barack", "Obama", "'s", "legacy", "when", "he", "leaves", "office", "?"], "text_t_seg": ["Based", "on", "what", "we", "know", "now", ",", "what", "will", "Barack", "Obama", "'s", "historical", "legacy", "be", "?"], "sample_type": "disturb"} +{"id": 1701, "sentence1": "Why do n't people like Hilary Clinton ?", "sentence2": "What are the reasons that people dislike Hillary Clinton ?", "text_q_seg": ["Why", "do", "n't", "people", "like", "Hilary", "Clinton", "?"], "text_t_seg": ["What", "are", "the", "reasons", "that", "people", "dislike", "Hillary", "Clinton", "?"], "sample_type": "disturb"} +{"id": 1702, "sentence1": "How to see who viewed my videos on Instagram ?", "sentence2": "How can I see who viewed my video on Instagram but did n't like my video ?", "text_q_seg": ["How", "to", "see", "who", "viewed", "my", "videos", "on", "Instagram", "?"], "text_t_seg": ["How", "can", "I", "see", "who", "viewed", "my", "video", "on", "Instagram", "but", "did", "n't", "like", "my", "video", "?"], "sample_type": "disturb"} +{"id": 1703, "sentence1": "why is the sky so blue ?", "sentence2": "Why is the sky is blue ?", "text_q_seg": ["why", "is", "the", "sky", "so", "blue", "?"], "text_t_seg": ["Why", "is", "the", "sky", "is", "blue", "?"], "sample_type": "disturb"} +{"id": 1704, "sentence1": "How can I learn English well in a short time ?", "sentence2": "How can I learn English efficiently ?", "text_q_seg": ["How", "can", "I", "learn", "English", "well", "in", "a", "short", "time", "?"], "text_t_seg": ["How", "can", "I", "learn", "English", "efficiently", "?"], "sample_type": "disturb"} +{"id": 1705, "sentence1": "How can I stop eating junk and processed food addiction and stay healthy ?", "sentence2": "How to quit junk food ?", "text_q_seg": ["How", "can", "I", "stop", "eating", "junk", "and", "processed", "food", "addiction", "and", "stay", "healthy", "?"], "text_t_seg": ["How", "to", "quit", "junk", "food", "?"], "sample_type": "disturb"} +{"id": 1706, "sentence1": "What are the movies one should see ?", "sentence2": "What are the greatest movies I must see ?", "text_q_seg": ["What", "are", "the", "movies", "one", "should", "see", "?"], "text_t_seg": ["What", "are", "the", "greatest", "movies", "I", "must", "see", "?"], "sample_type": "disturb"} +{"id": 1707, "sentence1": "What is an accurate way to calculate your IQ ?", "sentence2": "How to test my IQ accurately ?", "text_q_seg": ["What", "is", "an", "accurate", "way", "to", "calculate", "your", "IQ", "?"], "text_t_seg": ["How", "to", "test", "my", "IQ", "accurately", "?"], "sample_type": "disturb"} +{"id": 1708, "sentence1": "Is our PM Modi doing the correct thing with 500 and 1000 Rs notes ?", "sentence2": "What is your view on the ban on Rs . 500 and Rs . 1000 currency notes ?", "text_q_seg": ["Is", "our", "PM", "Modi", "doing", "the", "correct", "thing", "with", "500", "and", "1000", "Rs", "notes", "?"], "text_t_seg": ["What", "is", "your", "view", "on", "the", "ban", "on", "Rs", ".", "500", "and", "Rs", ".", "1000", "currency", "notes", "?"], "sample_type": "disturb"} +{"id": 1709, "sentence1": "Why is the firm 's marginal cost curve equal supply curve ?", "sentence2": "How can supply curve reflect marginal cost ?", "text_q_seg": ["Why", "is", "the", "firm", "'s", "marginal", "cost", "curve", "equal", "supply", "curve", "?"], "text_t_seg": ["How", "can", "supply", "curve", "reflect", "marginal", "cost", "?"], "sample_type": "disturb"} diff --git a/examples/model_interpretation/download.sh b/examples/model_interpretation/download.sh new file mode 100644 index 0000000000000000000000000000000000000000..7d98bfaceecc0fe0c685353f2d9cb57607c4d860 --- /dev/null +++ b/examples/model_interpretation/download.sh @@ -0,0 +1,10 @@ +wget https://paddlenlp.bj.bcebos.com/data/model_interpretation.tar +wait +tar -xvf model_interpretation.tar +wait +mv ./model_interpretation/vocab.char ./task/similarity/simnet/ +mv ./model_interpretation/vocab_QQP ./task/similarity/simnet/ +mv ./model_interpretation/simnet_vocab.txt ./task/similarity/simnet/ + +mv ./model_interpretation/vocab.sst2_train ./task/senti/rnn/ +mv ./model_interpretation/vocab.txt ./task/senti/rnn \ No newline at end of file diff --git a/examples/model_interpretation/evaluation/accuracy/cal_acc.py b/examples/model_interpretation/evaluation/accuracy/cal_acc.py new file mode 100644 index 0000000000000000000000000000000000000000..93c32b46568d9b214c9aae7af95251faa453205d --- /dev/null +++ b/examples/model_interpretation/evaluation/accuracy/cal_acc.py @@ -0,0 +1,92 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + This script includes code to calculating accuracy for results form textual similarity task +""" +import argparse +import json + + +def get_args(): + """ + get args + """ + parser = argparse.ArgumentParser("Acc eval") + parser.add_argument("--golden_path", required=True) + parser.add_argument("--pred_path", required=True) + parser.add_argument("--language", required=True, choices=["ch", "en"]) + + args = parser.parse_args() + return args + + +def load_from_file(args): + """ + load golden and pred data form file + :return: golden_raw: {sent_id, rationales_lists}, pred_raw: {sent_id, rationales_list}, + golden_label: {sent_id, label}, pred_label: {sent_id, label} + """ + golden_f = open(args.golden_path, "r") + pred_f = open(args.pred_path, "r") + + golden_labels, pred_labels = {}, {} + + for golden_line in golden_f.readlines(): + golden_dict = json.loads(golden_line) + id = golden_dict["sent_id"] + golden_labels[id] = int(golden_dict["sent_label"]) + + for pred_line in pred_f.readlines(): + pred_dict = json.loads(pred_line) + id = pred_dict["id"] + pred_labels[id] = int(pred_dict["pred_label"]) + + result = {} + result["golden_labels"] = golden_labels + result["pred_labels"] = pred_labels + + return result + + +def cal_acc(golden_label, pred_label): + """ + The function actually calculate the accuracy. + """ + acc = 0.0 + for ids in pred_label: + if ids not in golden_label: + continue + if pred_label[ids] == golden_label[ids]: + acc += 1 + if len(golden_label): + acc /= len(golden_label) + return acc + + +def main(args): + """ + main function + """ + result = load_from_file(args) + golden_label = result["golden_labels"] + pred_label = result["pred_labels"] + + acc = cal_acc(golden_label, pred_label) + return acc, len(pred_label) + + +if __name__ == "__main__": + args = get_args() + acc, num = main(args) + print("total\tnum: %d\tacc: %.1f" % (num, acc * 100)) diff --git a/examples/model_interpretation/evaluation/accuracy/mrc_f1_evaluate.py b/examples/model_interpretation/evaluation/accuracy/mrc_f1_evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..21ae6808c94af5909da6cc7726e38aa117ac6a92 --- /dev/null +++ b/examples/model_interpretation/evaluation/accuracy/mrc_f1_evaluate.py @@ -0,0 +1,265 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + This script is used to evaluate the performance of the mrc model (F1) +""" +from __future__ import print_function + +import argparse +import json +from collections import OrderedDict + +from paddlenlp.metrics.squad import squad_evaluate + + +def _tokenize_chinese_chars(text): + """ + :param text: input text, unicode string + :return: + tokenized text, list + """ + + def _is_chinese_char(cp): + """Checks whether CP is the codepoint of a CJK character.""" + # This defines a "chinese character" as anything in the CJK Unicode block: + # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) + # + # Note that the CJK Unicode block is NOT all Japanese and Korean characters, + # despite its name. The modern Korean Hangul alphabet is a different block, + # as is Japanese Hiragana and Katakana. Those alphabets are used to write + # space-separated words, so they are not treated specially and handled + # like the all of the other languages. + if ( + (cp >= 0x4E00 and cp <= 0x9FFF) + or (cp >= 0x3400 and cp <= 0x4DBF) # + or (cp >= 0x20000 and cp <= 0x2A6DF) # + or (cp >= 0x2A700 and cp <= 0x2B73F) # + or (cp >= 0x2B740 and cp <= 0x2B81F) # + or (cp >= 0x2B820 and cp <= 0x2CEAF) # + or (cp >= 0xF900 and cp <= 0xFAFF) + or (cp >= 0x2F800 and cp <= 0x2FA1F) # + ): # + return True + + return False + + output = [] + buff = "" + for char in text: + cp = ord(char) + if _is_chinese_char(cp) or char == "=": + if buff != "": + output.append(buff) + buff = "" + output.append(char) + else: + buff += char + + if buff != "": + output.append(buff) + + return output + + +def _normalize(in_str): + """ + normalize the input unicode string + """ + in_str = in_str.lower() + sp_char = [ + ":", + "_", + "`", + ",", + "。", + ":", + "?", + "!", + "(", + ")", + "“", + "”", + ";", + "’", + "《", + "》", + "……", + "·", + "、", + ",", + "「", + "」", + "(", + ")", + "-", + "~", + "『", + "』", + "|", + ] + out_segs = [] + for char in in_str: + if char in sp_char: + continue + else: + out_segs.append(char) + return "".join(out_segs) + + +def find_lcs(s1, s2): + """find the longest common subsequence between s1 ans s2""" + m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)] + max_len = 0 + p = 0 + for i in range(len(s1)): + for j in range(len(s2)): + if s1[i] == s2[j]: + m[i + 1][j + 1] = m[i][j] + 1 + if m[i + 1][j + 1] > max_len: + max_len = m[i + 1][j + 1] + p = i + 1 + return s1[p - max_len : p], max_len + + +def evaluate_ch(ref_ans, pred_ans): + """ + ref_ans: reference answers, dict + pred_ans: predicted answer, dict + return: + f1_score: averaged F1 score + em_score: averaged EM score + total_count: number of samples in the reference dataset + skip_count: number of samples skipped in the calculation due to unknown errors + """ + f1 = 0 + em = 0 + total_count = 0 + skip_count = 0 + for query_id in ref_ans: + sample = ref_ans[query_id] + total_count += 1 + answers = sample["sent_label"] + try: + prediction = pred_ans[query_id]["pred_label"] + except: + skip_count += 1 + continue + if prediction == "": + _f1 = 1.0 + _em = 1.0 + else: + _f1 = calc_f1_score([answers], prediction) + _em = calc_em_score([answers], prediction) + f1 += _f1 + em += _em + + f1_score = 100.0 * f1 / total_count + em_score = 100.0 * em / total_count + return f1_score, em_score, total_count, skip_count + + +def calc_f1_score(answers, prediction): + f1_scores = [] + for ans in answers: + ans_segs = _tokenize_chinese_chars(_normalize(ans)) + prediction_segs = _tokenize_chinese_chars(_normalize(prediction)) + if args.debug: + print(json.dumps(ans_segs, ensure_ascii=False)) + print(json.dumps(prediction_segs, ensure_ascii=False)) + lcs, lcs_len = find_lcs(ans_segs, prediction_segs) + if lcs_len == 0: + f1_scores.append(0) + continue + prec = 1.0 * lcs_len / len(prediction_segs) + rec = 1.0 * lcs_len / len(ans_segs) + f1 = (2 * prec * rec) / (prec + rec) + f1_scores.append(f1) + return max(f1_scores) + + +def calc_em_score(answers, prediction): + em = 0 + for ans in answers: + ans_ = _normalize(ans) + prediction_ = _normalize(prediction) + if ans_ == prediction_: + em = 1 + break + return em + + +def read_dataset(file_path): + f = open(file_path, "r") + golden = {} + for l in f.readlines(): + ins = json.loads(l) + golden[ins["sent_id"]] = ins + f.close() + return golden + + +def read_model_prediction(file_path): + f = open(file_path, "r") + predict = {} + for l in f.readlines(): + ins = json.loads(l) + predict[ins["id"]] = ins + f.close() + return predict + + +def read_temp(file_path): + with open(file_path) as f1: + result = json.loads(f1.read()) + return result + + +def get_args(): + parser = argparse.ArgumentParser("mrc baseline performance eval") + parser.add_argument("--golden_path", help="dataset file") + parser.add_argument("--pred_file", help="model prediction file") + parser.add_argument("--language", help="the language of the model") + parser.add_argument("--debug", action="store_true", help="debug mode") + args = parser.parse_args() + return args + + +if __name__ == "__main__": + args = get_args() + + if args.language == "ch": + ref_ans = read_dataset(args.golden_path) + pred_ans = read_model_prediction(args.pred_file) + F1, EM, TOTAL, SKIP = evaluate_ch(ref_ans, pred_ans) + + output_result = OrderedDict() + output_result["F1"] = "%.3f" % F1 + output_result["EM"] = "%.3f" % EM + output_result["TOTAL"] = TOTAL + output_result["SKIP"] = SKIP + print(json.dumps(output_result)) + else: + ref_ans = read_dataset(args.golden_path) + pred_ans = read_temp(args.pred_file) + res = [] + for i in ref_ans: + ins = ref_ans[i] + ins["id"] = str(ins["sent_id"]) + ins["answers"] = [ins["sent_label"]] + if ins["answers"] == [""]: + ins["is_impossible"] = True + else: + ins["is_impossible"] = False + res.append(ins) + squad_evaluate(examples=res, preds=pred_ans) diff --git a/examples/model_interpretation/evaluation/accuracy/run_acc.sh b/examples/model_interpretation/evaluation/accuracy/run_acc.sh new file mode 100644 index 0000000000000000000000000000000000000000..cfa26fa204f0ce9ad0f0e8c3de9c3bcb889846b1 --- /dev/null +++ b/examples/model_interpretation/evaluation/accuracy/run_acc.sh @@ -0,0 +1,31 @@ +### + # This script evaluates plausibility of the results generated by our models +### + +TASK=senti +if [[ $TASK == "mrc" ]]; then + MODELS=("roberta_base" "roberta_large") + MODES=("attention" "integrated_gradient") +else + MODELS=("lstm" "roberta_base" "roberta_large") + MODES=("attention" "integrated_gradient" "lime") +fi + +for BASE_MODEL in ${MODELS[*]}; +do + for INTER_MODE in ${MODES[*]}; + do + for LANGUAGE in "ch" "en"; + do + GOLDEN_PATH=../golden/${TASK}_${LANGUAGE}.tsv + PRED_PATH=../../rationale_extraction/evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} + + echo $BASE_MODEL$'_'$INTER_MODE$'_'$LANGUAGE + + python3 ./cal_acc.py \ + --language $LANGUAGE \ + --golden_path $GOLDEN_PATH \ + --pred_path $PRED_PATH + done + done +done \ No newline at end of file diff --git a/examples/model_interpretation/evaluation/accuracy/run_mrc_f1.sh b/examples/model_interpretation/evaluation/accuracy/run_mrc_f1.sh new file mode 100644 index 0000000000000000000000000000000000000000..204bc6b4c20705e550583fa4d32656fcb8f2ff1c --- /dev/null +++ b/examples/model_interpretation/evaluation/accuracy/run_mrc_f1.sh @@ -0,0 +1,29 @@ +### + # This script is used to evaluate the performance of the mrc model (F1) +### +MODELS=("roberta_base" "roberta_large") +MODES=("attention" "integrated_gradient") + +for BASE_MODEL in ${MODELS[*]}; +do + for INTER_MODE in ${MODES[*]}; + do + for LANGUAGE in "en" "ch"; + do + echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} + + GOLDEN_PATH=../golden/mrc_${LANGUAGE}.tsv + if [[ $LANGUAGE == "ch" ]]; then + PRED_FILE=../../rationale_extraction/evaluation_data/mrc/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} + else + PRED_FILE=../../task/mrc/output/mrc_en.${BASE_MODEL}/predict_ans + fi + + python3 mrc_f1_evaluate.py \ + --golden_path $GOLDEN_PATH \ + --pred_file $PRED_FILE \ + --language $LANGUAGE + done + done +done + diff --git a/examples/model_interpretation/evaluation/consistency/cal_map.py b/examples/model_interpretation/evaluation/consistency/cal_map.py new file mode 100644 index 0000000000000000000000000000000000000000..a6ed80d8058ade4e756f48e1655133f3981ce41c --- /dev/null +++ b/examples/model_interpretation/evaluation/consistency/cal_map.py @@ -0,0 +1,141 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This script includes code to calculating MAP score for results form +sentiment analysis, textual similarity, and mrc task +""" +import argparse +import json +import math +import os + + +def get_args(): + parser = argparse.ArgumentParser("map eval") + parser.add_argument("--pred_path", required=True) + parser.add_argument("--golden_path", required=True) + parser.add_argument("--language", type=str, required=True, help="language that the model is built for") + args = parser.parse_args() + return args + + +def evids_load(args, path): + golden_f = open(args.golden_path, "r") + golden = {} + ins_num = 0 + for golden_line in golden_f.readlines(): + line = json.loads(golden_line) + if line["sample_type"] == "disturb": + ins_num += 1 + golden[line["sent_id"]] = line + + evids = {} + with open(path, "r") as f: + for line in f.readlines(): + dic = json.loads(line) + dic["sample_type"] = golden[dic["id"]]["sample_type"] + if "rel_ids" in golden[dic["id"]]: + dic["rel_ids"] = golden[dic["id"]]["rel_ids"] + evids[dic["id"]] = dic + return evids, ins_num + + +def _calc_MAP_by_bin(top_p, length_adv, adv_attriRank_list, ori_attriRank_list): + """ + This is our old way to calculate MAP, + which follows equation two in consistency section of README + """ + hits = 0 + sum_precs = 0.0 + length_t = math.ceil(length_adv * top_p) + adv_t = adv_attriRank_list[:length_t] + for char_idx, char in enumerate(adv_t): + if char in ori_attriRank_list[: char_idx + 1]: + hits += 1 + sum_precs += hits / (char_idx + 1) + if length_t > 0: + sum_precs /= length_t + return sum_precs + + +def _calc_MAP_by_bin_paper(top_p, length_adv, adv_attriRank_list, ori_attriRank_list): + """ + This function calculates MAP using the equation in our paper, + which follows equation one in consistency section of README + """ + total_precs = 0.0 + for i in range(length_adv): + hits = 0.0 + i += 1 + adv_t = adv_attriRank_list[:i] + for char_idx, char in enumerate(adv_t): + if char in ori_attriRank_list[:i]: + hits += 1 + hits = hits / i + total_precs += hits + if length_adv == 0: + return 0 + return total_precs / length_adv + + +def _calc_map(evids, key, ins_num): + t_map = 0.0 + + adv_num = 0 + ori_num = 0 + for ori_idx in evids: + if evids[ori_idx]["sample_type"] == "ori": + ori = evids[ori_idx] + ori_num += 1 + # One original instance can be related to several disturbed instance + for adv_idx in evids[ori_idx]["rel_ids"]: + if adv_idx in evids: + adv_num += 1 + adv = evids[adv_idx] + ori_attriRank_list = list(ori["rationale_token"][key]) + adv_attriRank_list = list(adv["rationale_token"][key]) + length_adv = len(adv_attriRank_list) + + sum_precs = _calc_MAP_by_bin_paper(1, length_adv, adv_attriRank_list, ori_attriRank_list) + t_map += sum_precs + + return t_map / ins_num, ori_num + adv_num + + +def cal_MAP(args, pred_path, la): + evids, ins_num = evids_load(args, pred_path) + if not evids: + print(pred_path + " file empty!") + return 0 + first_key = list(evids.keys())[0] + t_map = 0 + num = 0 + for i in range(len(evids[first_key]["rationale"])): + t_map_tmp, num_tmp = _calc_map(evids, i, ins_num) + t_map += t_map_tmp + num += num_tmp + t_map /= len(evids[first_key]["rationale"]) + num /= len(evids[first_key]["rationale"]) + print("total\t%d\t%.1f" % (num, 100 * t_map)) + return 0 + + +if __name__ == "__main__": + args = get_args() + la = args.language + pred_path = args.pred_path + if os.path.exists(pred_path): + cal_MAP(args, pred_path, la) + else: + print("Prediction file does not exists!") diff --git a/examples/model_interpretation/evaluation/consistency/run_map.sh b/examples/model_interpretation/evaluation/consistency/run_map.sh new file mode 100644 index 0000000000000000000000000000000000000000..8ed9f114c5a21483a685241debd84b7d6983e371 --- /dev/null +++ b/examples/model_interpretation/evaluation/consistency/run_map.sh @@ -0,0 +1,31 @@ +### + # This script evaluates consistency of the results generated by our models +### + +TASK=senti +if [[ $TASK == "mrc" ]]; then + MODELS=("roberta_base" "roberta_large") + MODES=("attention" "integrated_gradient") +else + MODELS=("lstm" "roberta_base" "roberta_large") + MODES=("attention" "integrated_gradient" "lime") +fi + +for BASE_MODEL in ${MODELS[*]}; +do + for INTER_MODE in ${MODES[*]}; + do + for LANGUAGE in "ch" "en"; + do + echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} + GOLDEN_PATH=../golden/${TASK}_${LANGUAGE}.tsv + PRED_PATH=../../rationale_extraction/evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} + + python3 ./cal_map.py \ + --golden_path $GOLDEN_PATH \ + --pred_path $PRED_PATH \ + --language $LANGUAGE + + done + done +done \ No newline at end of file diff --git a/examples/model_interpretation/evaluation/faithfulness/newp_analysis.py b/examples/model_interpretation/evaluation/faithfulness/newp_analysis.py new file mode 100644 index 0000000000000000000000000000000000000000..f4ad0e56f236718a6ad4f9b235b0f9a0fd3dba69 --- /dev/null +++ b/examples/model_interpretation/evaluation/faithfulness/newp_analysis.py @@ -0,0 +1,78 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + This script includes code to calculating NewP score for results form + sentiment analysis, textual similarity, and mrc task +""" +import argparse +import json + +import numpy as np + + +def get_args(): + """ + get args + """ + parser = argparse.ArgumentParser("NewP eval") + + parser.add_argument("--pred_path", required=True) + parser.add_argument("--golden_path", required=True) + + args = parser.parse_args() + return args + + +def data_load(args): + """ + load result data from file + """ + pred_path = args.pred_path + golden_path = args.golden_path + + with open(pred_path, "r") as f_text: + pred_list = [] + for line in f_text.readlines(): + line_dict = json.loads(line) + pred_list.append(line_dict) + + with open(golden_path, "r") as f_text: + gold_list = {} + for line in f_text.readlines(): + line_dict = json.loads(line) + gold_list[line_dict["sent_id"]] = line_dict + return pred_list, gold_list + + +def analysis(args, instance, gold_list): + """ + Analysis result according to result data + """ + New_P_list = [] + for ins in instance: + golden_label = ins["pred_label"] + text_correct = 1 if ins["rationale_pred"] == golden_label else 0 + text_exclusive_correct = 1 if ins["no_rationale_pred"] == golden_label else 0 + New_P_correct = 1 if (text_correct == 1 and text_exclusive_correct == 0) else 0 + New_P_list.append(New_P_correct) + + total_New_P = np.sum(New_P_list) / len(gold_list) if len(gold_list) else 0 + + print("total\t%d\t%.1f" % (len(New_P_list), 100 * total_New_P)) + + +if __name__ == "__main__": + args = get_args() + pred_list, gold_list = data_load(args) + analysis(args, pred_list, gold_list) diff --git a/examples/model_interpretation/evaluation/faithfulness/run_newp.sh b/examples/model_interpretation/evaluation/faithfulness/run_newp.sh new file mode 100644 index 0000000000000000000000000000000000000000..5110ea61ff715b366102c89f5d01c365b96db5db --- /dev/null +++ b/examples/model_interpretation/evaluation/faithfulness/run_newp.sh @@ -0,0 +1,30 @@ +### + # This script evaluates faithfulness of the results generated by our models +### + +TASK=senti +if [[ $TASK == "mrc" ]]; then + MODELS=("roberta_base" "roberta_large") + MODES=("attention" "integrated_gradient") +else + MODELS=("lstm" "roberta_base" "roberta_large") + MODES=("attention" "integrated_gradient" "lime") +fi + +for BASE_MODEL in ${MODELS[*]}; +do + for INTER_MODE in ${MODES[*]}; + do + for LANGUAGE in "ch" "en"; + do + GOLDEN_PATH=../golden/${TASK}_${LANGUAGE}.tsv + PRED_PATH=../../rationale_extraction/evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} + + echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} + + python3 ./newp_analysis.py \ + --pred_path $PRED_PATH \ + --golden_path $GOLDEN_PATH + done + done +done \ No newline at end of file diff --git a/examples/model_interpretation/evaluation/plausibility/eval_mrc.py b/examples/model_interpretation/evaluation/plausibility/eval_mrc.py new file mode 100644 index 0000000000000000000000000000000000000000..b3bc04a5ba5bf5a3ba9ed45da90e0ba076f241b8 --- /dev/null +++ b/examples/model_interpretation/evaluation/plausibility/eval_mrc.py @@ -0,0 +1,112 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + This script includes code to calculating F1 score for results form mrc task +""" +import argparse +import json + + +def get_args(): + parser = argparse.ArgumentParser("F1 eval") + + parser.add_argument("--golden_path", required=True) + parser.add_argument("--pred_path", required=True) + parser.add_argument("--language", required=True, choices=["ch", "en"]) + + args = parser.parse_args() + return args + + +def load_from_file(args): + """ + Load golden and pred data form file + :return: golden_raw: {sent_id, rationales_lists}, pred_raw: {sent_id, rationales_list}, + golden_label: {sent_id, label}, pred_label: {sent_id, label} + """ + golden_f = open(args.golden_path, "r") + pred_f = open(args.pred_path, "r") + + golden_raw_rationale, pred_rationale = {}, {} + + for golden_line in golden_f.readlines(): + golden_dict = json.loads(golden_line) + sent_id = golden_dict["sent_id"] + golden_raw_rationale[sent_id] = [int(x) for x in golden_dict["rationales"]] + + for pred_line in pred_f.readlines(): + pred_dict = json.loads(pred_line) + senti_id = pred_dict["id"] + pred_rationale[senti_id] = pred_dict["rationale"][0] + + return golden_raw_rationale, pred_rationale + + +def _f1(_p, _r): + if _p == 0 or _r == 0: + return 0 + return 2 * _p * _r / (_p + _r) + + +def calc_f1(golden_evid, pred_evid): + tp = set(pred_evid) & set(golden_evid) + prec = len(tp) / len(pred_evid) if len(pred_evid) else 0 + rec = len(tp) / len(golden_evid) if len(golden_evid) else 0 + f1 = _f1(prec, rec) + return f1 + + +def calc_model_f1(golden_dict, pred_dict): + """ + :param golden_dict: dict + :param pred_dict: dict + :return: macro-f1, micro-f1 + """ + + scores = {} + + for s_id in pred_dict.keys(): + if s_id not in golden_dict: + continue + golden_evid = golden_dict[s_id] + pred_evid = pred_dict[s_id] + + tp = set(golden_evid) & set(pred_evid) + prec = len(tp) / len(pred_evid) if len(pred_evid) else 0 + rec = len(tp) / len(golden_evid) if len(golden_evid) else 0 + f1 = _f1(prec, rec) + scores[s_id] = { + "tp_count": len(tp), + "pred_count": len(pred_evid), + "golden_count": len(golden_evid), + "prec": prec, + "rec": rec, + "f1": f1, + } + + macro_f1 = sum(score["f1"] for score in scores.values()) / len(golden_dict) if len(golden_dict) else 0 + + return macro_f1, scores + + +def main(args): + golden_raw, pred_raw = load_from_file(args) + macro_f1, scores = calc_model_f1(golden_raw, pred_raw) + return macro_f1, len(golden_raw), scores + + +if __name__ == "__main__": + args = get_args() + macro_f1, num, scores = main(args) + print("total\tnum: %d\tmacor_f1: %.1f" % (num, macro_f1 * 100)) diff --git a/examples/model_interpretation/evaluation/plausibility/eval_senti.py b/examples/model_interpretation/evaluation/plausibility/eval_senti.py new file mode 100644 index 0000000000000000000000000000000000000000..449755cf972c701b89c0ac968290914ffa4352e0 --- /dev/null +++ b/examples/model_interpretation/evaluation/plausibility/eval_senti.py @@ -0,0 +1,178 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + This script includes code to calculating F1 score for results form sentiment analysis task +""" +import argparse +import json + + +def get_args(): + parser = argparse.ArgumentParser("F1 eval") + + parser.add_argument("--language", required=True, choices=["en", "ch"]) + parser.add_argument("--golden_path", required=True) + parser.add_argument("--pred_path", required=True) + + args = parser.parse_args() + return args + + +def load_from_file(args): + """ + Load golden and pred data form file + :return: golden_raw: {sent_id, rationales_lists}, pred_raw: {sent_id, rationales_list}, + golden_label: {sent_id, label}, pred_label: {sent_id, label} + """ + golden_f = open(args.golden_path, "r") + pred_f = open(args.pred_path, "r") + + golden_raw_rationale, golden_label, pred_rationale, pred_label = {}, {}, {}, {} + + for golden_line in golden_f.readlines(): + golden_dict = json.loads(golden_line) + sent_id = golden_dict["sent_id"] + golden_raw_rationale[sent_id] = [] + for x in golden_dict["rationales"]: + temp = [int(y) for y in x] + golden_raw_rationale[sent_id].append(temp) + golden_label[sent_id] = int(golden_dict["sent_label"]) + + for pred_line in pred_f.readlines(): + pred_dict = json.loads(pred_line) + senti_id = pred_dict["id"] + pred_rationale[senti_id] = pred_dict["rationale"][0] + pred_label[senti_id] = int(pred_dict["pred_label"]) + + golden_f.close() + pred_f.close() + return golden_raw_rationale, pred_rationale, golden_label, pred_label + + +def _f1(_p, _r): + if _p == 0 or _r == 0: + return 0 + return 2 * _p * _r / (_p + _r) + + +def calc_f1(golden_evid, pred_evid): + tp = set(pred_evid) & set(golden_evid) + prec = len(tp) / len(pred_evid) if len(pred_evid) else 0 + rec = len(tp) / len(golden_evid) if len(golden_evid) else 0 + f1 = _f1(prec, rec) + return f1 + + +def combine(cur_max_f1, union_set, golden_evid, pred_evid): + """ + Args: + cur_max_f1 float: 当前最大f1 + union_set set(): 已合并集合 + golden_evid list(): 标注证据 + pred_evid list(): 预测证据 + """ + if len(union_set & set(golden_evid)) < len(golden_evid) and calc_f1(golden_evid, pred_evid) > 0: + new_union_set = union_set | set(golden_evid) + new_f1 = calc_f1(new_union_set, pred_evid) + if new_f1 > cur_max_f1: # 若union_set合并golden_evid后f1未超过cur_max_f1,则不更新union_set + cur_max_f1 = new_f1 + union_set = new_union_set + + return cur_max_f1, union_set + + +def pick_max_golden_evid(golden_raw, pred_raw): + """ + 从golden_evids中找出与pred_evid f1最大的golden_evid + """ + golden_dict = {} + err_rationale = [] + + for s_id in pred_raw.keys(): + if s_id not in golden_raw: + continue + golden_evids = golden_raw[s_id] + pred_evid = pred_raw[s_id] + max_f1 = 0 + + # 找f1最大的单条golden_evid + for golden_evid in golden_evids: + f1 = calc_f1(golden_evid, pred_evid) + if f1 > max_f1: + max_f1 = f1 + golden_dict[s_id] = golden_evid + + # 找f1最大的组合golden_evid + for start_id in range(len(golden_evids) - 1): + union_set = set() + cur_max_f1 = 0 + for id in range(start_id, len(golden_evids)): + golden_evid = golden_evids[id] + cur_max_f1, union_set = combine(cur_max_f1, union_set, golden_evid, pred_evid) + + if cur_max_f1 > max_f1: + max_f1 = cur_max_f1 + golden_dict[s_id] = list(union_set) + + if max_f1 == 0: + golden_dict[s_id] = [] + err_rationale.append(s_id) + + return golden_dict + + +def calc_model_f1(golden_dict, pred_dict, golden_len): + """ + :param golden_dict: dict + :param pred_dict: dict + :return: macro-f1, micro-f1 + """ + + scores = {} + + for s_id in pred_dict.keys(): + if s_id not in golden_dict: + continue + golden_evid = golden_dict[s_id] + pred_evid = pred_dict[s_id] + + tp = set(golden_evid) & set(pred_evid) + prec = len(tp) / len(pred_evid) if len(pred_evid) else 0 + rec = len(tp) / len(golden_evid) if len(golden_evid) else 0 + f1 = _f1(prec, rec) + scores[s_id] = { + "tp_count": len(tp), + "pred_count": len(pred_evid), + "golden_count": len(golden_evid), + "prec": prec, + "rec": rec, + "f1": f1, + } + + macro_f1 = (sum(score["f1"] for score in scores.values()) / golden_len) if golden_len else 0 + + return macro_f1, scores + + +def main(args): + golden_raw, pred_raw, golden_label, pred_label = load_from_file(args) + golden_dict = pick_max_golden_evid(golden_raw, pred_raw) + macro_f1, scores = calc_model_f1(golden_dict, pred_raw, len(golden_raw)) + return macro_f1, len(golden_raw) + + +if __name__ == "__main__": + args = get_args() + macro_f1, num = main(args) + print("num\t%.2f\tmacor_f1: %.1f" % (num, macro_f1 * 100)) diff --git a/examples/model_interpretation/evaluation/plausibility/eval_similarity.py b/examples/model_interpretation/evaluation/plausibility/eval_similarity.py new file mode 100644 index 0000000000000000000000000000000000000000..0307248514bd3eae5c453e1a1c77b42af60d79cc --- /dev/null +++ b/examples/model_interpretation/evaluation/plausibility/eval_similarity.py @@ -0,0 +1,133 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + This script includes code to calculating F1 score for results form textual similarity task +""" +import argparse +import json + + +def get_args(): + """ + get args + """ + parser = argparse.ArgumentParser("F1 eval") + parser.add_argument("--golden_path", required=True) + parser.add_argument("--pred_path", required=True) + parser.add_argument("--language", required=True, choices=["ch", "en"]) + + args = parser.parse_args() + return args + + +def load_from_file(args): + """ + Load golden and pred data form file + :return: golden_raw: {sent_id, rationales_lists}, pred_raw: {sent_id, rationales_list}, + golden_label: {sent_id, label}, pred_label: {sent_id, label} + """ + golden_f = open(args.golden_path, "r") + pred_f = open(args.pred_path, "r") + + golden_q_rationales, golden_t_rationales = {}, {} + pred_q_rationales, pred_t_rationales = {}, {} + golden_labels, pred_labels = {}, {} + + for golden_line in golden_f.readlines(): + golden_dict = json.loads(golden_line) + id = golden_dict["sent_id"] + # golden_rationale id + golden_q_rationales[id] = [int(x) for x in golden_dict["rationale_q_idx"]] + golden_t_rationales[id] = [int(x) for x in golden_dict["rationale_t_idx"]] + golden_labels[id] = int(golden_dict["sent_label"]) + + for pred_line in pred_f.readlines(): + pred_dict = json.loads(pred_line) + id = pred_dict["id"] + pred_q_rationales[id] = pred_dict["rationale"][0] + pred_t_rationales[id] = pred_dict["rationale"][1] + pred_labels[id] = int(pred_dict["pred_label"]) + + result = {} + result["golden_q_rationales"] = golden_q_rationales + result["golden_t_rationales"] = golden_t_rationales + result["pred_q_rationales"] = pred_q_rationales + result["pred_t_rationales"] = pred_t_rationales + result["golden_labels"] = golden_labels + result["pred_labels"] = pred_labels + + return result + + +def _f1(_p, _r): + if _p == 0 or _r == 0: + return 0 + return 2 * _p * _r / (_p + _r) + + +def calc_model_f1(golden_a_rationales, golden_b_rationales, pred_a_rationales, pred_b_rationales): + """ + :param golden_dict: dict + :param pred_dict: dict + :return: macro-f1, micro-f1 + """ + + scores = {} + + for id in pred_a_rationales.keys(): + golden_a_ratioanl = golden_a_rationales[id] + pred_a_rationale = pred_a_rationales[id] + tp_a = set(golden_a_ratioanl) & set(pred_a_rationale) + prec_a = len(tp_a) / len(pred_a_rationale) if len(pred_a_rationale) else 0 + rec_a = len(tp_a) / len(golden_a_ratioanl) if len(golden_a_ratioanl) else 0 + f1_a = _f1(prec_a, rec_a) + + golden_b_rationale = golden_b_rationales[id] + pred_b_rationale = pred_b_rationales[id] + tp_b = set(golden_b_rationale) & set(pred_b_rationale) + prec_b = len(tp_b) / len(pred_b_rationale) if len(pred_b_rationale) else 0 + rec_b = len(tp_b) / len(golden_b_rationale) if len(golden_b_rationale) else 0 + f1_b = _f1(prec_b, rec_b) + + scores[id] = { + "tp_count": (len(tp_a) + len(tp_b)) / 2, + "pred_count": (len(pred_a_rationale) + len(pred_b_rationale)) / 2, + "golden_count": (len(golden_a_ratioanl) + len(golden_b_rationale)) / 2, + "prec": (prec_a + prec_b) / 2, + "rec": (rec_a + rec_b) / 2, + "f1": (f1_a + f1_b) / 2, + } + + macro_f1 = ( + sum(score["f1"] for score in scores.values()) / len(golden_a_rationales) if len(golden_a_rationales) else 0 + ) + + return macro_f1, scores + + +def main(args): + result = load_from_file(args) + golden_a_rationales = result["golden_q_rationales"] + golden_b_rationales = result["golden_t_rationales"] + pred_a_rationales = result["pred_q_rationales"] + pred_b_rationales = result["pred_t_rationales"] + + macro_f1, scores = calc_model_f1(golden_a_rationales, golden_b_rationales, pred_a_rationales, pred_b_rationales) + return macro_f1, len(scores) + + +if __name__ == "__main__": + args = get_args() + macro_f1, num = main(args) + print("total\tnum: %d\tmacor_f1: %.1f" % (num, macro_f1 * 100)) diff --git a/examples/model_interpretation/evaluation/plausibility/run_f1.sh b/examples/model_interpretation/evaluation/plausibility/run_f1.sh new file mode 100644 index 0000000000000000000000000000000000000000..8d5bd2e7a9f2e9b4169e1175da2475838ee38755 --- /dev/null +++ b/examples/model_interpretation/evaluation/plausibility/run_f1.sh @@ -0,0 +1,34 @@ +### + # This script evaluates plausibility of the results generated by our models +### + +TASK=senti +if [[ $TASK == "mrc" ]]; then + MODELS=("roberta_base" "roberta_large") + MODES=("attention" "integrated_gradient") +else + MODELS=("lstm" "roberta_base" "roberta_large") + MODES=("attention" "integrated_gradient" "lime") +fi + +for BASE_MODEL in ${MODELS[*]}; +do + for INTER_MODE in ${MODES[*]}; + do + for LANGUAGE in "ch" "en"; + do + GOLDEN_PATH=../golden/${TASK}_${LANGUAGE}.tsv + PRED_PATH=../../rationale_extraction/evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} + + SAVE_PATH=res/ + [ -d $SAVE_PATH ] || mkdir -p $SAVE_PATH + + echo $BASE_MODEL$'_'$INTER_MODE$'_'$LANGUAGE + + python3 ./eval_${TASK}.py \ + --language $LANGUAGE \ + --golden_path $GOLDEN_PATH \ + --pred_path $PRED_PATH + done + done +done \ No newline at end of file diff --git a/examples/model_interpretation/imgs/equation1.png b/examples/model_interpretation/imgs/equation1.png new file mode 100644 index 0000000000000000000000000000000000000000..e1db9780248dd0a955a1a67bef66b262af988c18 Binary files /dev/null and b/examples/model_interpretation/imgs/equation1.png differ diff --git a/examples/model_interpretation/imgs/equation2.png b/examples/model_interpretation/imgs/equation2.png new file mode 100644 index 0000000000000000000000000000000000000000..fbb26c60e35a85814d6a25277fdca6ee17325fcd Binary files /dev/null and b/examples/model_interpretation/imgs/equation2.png differ diff --git a/examples/model_interpretation/imgs/equation3.png b/examples/model_interpretation/imgs/equation3.png new file mode 100644 index 0000000000000000000000000000000000000000..bf4f28f7c48a9a64b403bfc5db1f64249bc602bc Binary files /dev/null and b/examples/model_interpretation/imgs/equation3.png differ diff --git a/examples/model_interpretation/imgs/equation4.png b/examples/model_interpretation/imgs/equation4.png new file mode 100644 index 0000000000000000000000000000000000000000..a4743a67deb4365704a4a350047cbb57b221811a Binary files /dev/null and b/examples/model_interpretation/imgs/equation4.png differ diff --git a/examples/model_interpretation/imgs/equation5.png b/examples/model_interpretation/imgs/equation5.png new file mode 100644 index 0000000000000000000000000000000000000000..75bbe3be4ad581485696e6a12d88bf537a33f0f6 Binary files /dev/null and b/examples/model_interpretation/imgs/equation5.png differ diff --git a/examples/model_interpretation/imgs/example1.png b/examples/model_interpretation/imgs/example1.png new file mode 100644 index 0000000000000000000000000000000000000000..f0b7dda4dfef00b563601d590df3a54d547343c1 Binary files /dev/null and b/examples/model_interpretation/imgs/example1.png differ diff --git a/examples/model_interpretation/imgs/structure.png b/examples/model_interpretation/imgs/structure.png new file mode 100644 index 0000000000000000000000000000000000000000..b7573e09ba02e16fa4d7a80755aac059aada6394 Binary files /dev/null and b/examples/model_interpretation/imgs/structure.png differ diff --git a/examples/model_interpretation/punctuations b/examples/model_interpretation/punctuations new file mode 100644 index 0000000000000000000000000000000000000000..11d057b89103d3c9efff3c0c3bd6021309d25d68 --- /dev/null +++ b/examples/model_interpretation/punctuations @@ -0,0 +1,82 @@ +” +。 +, +∈ +] +√ + +! +( +≥ +【 +“ +「 +÷ +《 +】 +! +ˊ +」 +. +_ +@ +~ +– +〕 +∶ +) +’ +℃ +》 +〈 +→ +、 ++ +| +; +: +∠ +' +‘ +, +? +× +△ +- +• +· +— +° +> +′ +● +; +… +" +Ⅱ +/ +< ++ += +^ +Ⅰ +? +[ +﹑ +﹐ +* +〔 +~ +: +( +) +〉 +◎ += +- +\ +% +% +& +≠ +. \ No newline at end of file diff --git a/examples/model_interpretation/rationale_extraction/available_gpu.py b/examples/model_interpretation/rationale_extraction/available_gpu.py new file mode 100644 index 0000000000000000000000000000000000000000..e05ecd3c666a240a5a6362126cb9151e7d27fb13 --- /dev/null +++ b/examples/model_interpretation/rationale_extraction/available_gpu.py @@ -0,0 +1,46 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific l +"""print available_gpu id, using nvgpu +""" + +import logging +import traceback + +import nvgpu + +logging.basicConfig( + level=logging.DEBUG, + format="%(levelname)s: %(asctime)s %(filename)s" " [%(funcName)s:%(lineno)d][%(process)d] %(message)s", + datefmt="%m-%d %H:%M:%S", + filename=None, + filemode="a", +) + +if __name__ == "__main__": + from argparse import ArgumentParser + + try: + arg_parser = ArgumentParser(description="print available_gpu id, using nvgpu") + arg_parser.add_argument("-b", "--best", default=None, type=int, help="output best N") + args = arg_parser.parse_args() + + if args.best is not None: + gpus = sorted(nvgpu.gpu_info(), key=lambda x: (x["mem_used"], x["index"])) + ids = [x["index"] for x in gpus] + print(",".join(ids[: args.best])) + else: + print(",".join(nvgpu.available_gpus())) + + except Exception: + traceback.print_exc() + exit(-1) diff --git a/examples/model_interpretation/rationale_extraction/generate.sh b/examples/model_interpretation/rationale_extraction/generate.sh new file mode 100644 index 0000000000000000000000000000000000000000..d72b20bda984ae91cf2f13fc79427f51dd0adc20 --- /dev/null +++ b/examples/model_interpretation/rationale_extraction/generate.sh @@ -0,0 +1,57 @@ +TASK=similarity + +if [[ $TASK == "mrc" ]]; then + MODELS=("roberta_base" "roberta_large") + MODES=("attention" "integrated_gradient") +else + MODELS=("roberta_large" "roberta_base" "lstm") + MODES=("lime" "attention" "integrated_gradient") +fi + +for BASE_MODEL in ${MODELS[*]}; +do + for INTER_MODE in ${MODES[*]}; + do + for LANGUAGE in "ch" "en"; + do + if [[ $LANGUAGE == "ch" ]]; then + if [[ $TASK == "senti" ]]; then + RATIO_DIC="[0.311]" + elif [[ $TASK == "similarity" ]]; then + RATIO_DIC="[0.701,0.709]" + elif [[ $TASK == "mrc" ]]; then + RATIO_DIC="[0.096]" + fi + elif [[ $LANGUAGE == "en" ]]; then + if [[ $TASK == "senti" ]]; then + RATIO_DIC="[0.192]" + elif [[ $TASK == "similarity" ]]; then + RATIO_DIC="[0.511,0.505]" + elif [[ $TASK == "mrc" ]]; then + RATIO_DIC="[0.102]" + fi + fi + echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} + + PRED_PATH=../task/${TASK}/output/${TASK}_${LANGUAGE}.${BASE_MODEL}/interpret.${INTER_MODE} + SAVE_PATH=./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} + [ -d $SAVE_PATH ] || mkdir -p $SAVE_PATH + + python3 ./newp_text_generate.py \ + --pred_path $PRED_PATH \ + --save_path $SAVE_PATH \ + --task $TASK \ + --language $LANGUAGE \ + --ratio $RATIO_DIC + wait + + sh ./run_2_pred_${TASK}_per.sh $BASE_MODEL $INTER_MODE $LANGUAGE + wait + + sh ./generate_evaluation_data.sh $BASE_MODEL $INTER_MODE $LANGUAGE $TASK + wait + + echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}_finished + done + done +done diff --git a/examples/model_interpretation/rationale_extraction/generate_evaluation_data.py b/examples/model_interpretation/rationale_extraction/generate_evaluation_data.py new file mode 100644 index 0000000000000000000000000000000000000000..162b7fb00f70fa6914293af0e664b5d0689b0435 --- /dev/null +++ b/examples/model_interpretation/rationale_extraction/generate_evaluation_data.py @@ -0,0 +1,113 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json + + +def get_args(): + parser = argparse.ArgumentParser("generate data") + + parser.add_argument("--pred_path", required=True) + parser.add_argument("--data_dir", required=True) + parser.add_argument("--data_dir2", required=True) + parser.add_argument("--save_path", required=True) + parser.add_argument("--inter_mode", required=True) + parser.add_argument("--base_model", required=True) + parser.add_argument("--language", required=True) + + args = parser.parse_args() + return args + + +def evids_load(path): + evids = [] + with open(path, "r") as f: + for line in f.readlines(): + dic = json.loads(line) + evids.append(dic) + return evids + + +def dataLoad(args): + base_path = args.data_dir + "/" + text_path = base_path + "rationale_text/dev/dev" + text_exclusive_path = base_path + "rationale_exclusive_text/dev/dev" + + with open(text_path, "r") as f_text: + text_dict_list = {} + for line in f_text.readlines(): + line_dict = json.loads(line) + text_dict_list[line_dict["id"]] = line_dict + + with open(text_exclusive_path, "r") as f_exclusive_text: + text_exclusive_dict_list = {} + for line in f_exclusive_text.readlines(): + line_dict = json.loads(line) + text_exclusive_dict_list[line_dict["id"]] = line_dict + + base_path = args.data_dir2 + "/" + text_path = base_path + "rationale_text/dev/dev" + text_exclusive_path = base_path + "rationale_exclusive_text/dev/dev" + + with open(text_path, "r") as f_text: + text_dict_list2 = {} + for line in f_text.readlines(): + line_dict = json.loads(line) + text_dict_list2[line_dict["id"]] = line_dict + + with open(text_exclusive_path, "r") as f_exclusive_text: + text_exclusive_dict_list2 = {} + for line in f_exclusive_text.readlines(): + line_dict = json.loads(line) + text_exclusive_dict_list2[line_dict["id"]] = line_dict + + return text_dict_list, text_exclusive_dict_list, text_dict_list2, text_exclusive_dict_list2 + + +def r_data_generation( + args, evids, text_dict_list, text_exclusive_dict_list, text_dict_list2, text_exclusive_dict_list2 +): + save_path = args.save_path + f_save = open(save_path, "w") + + res_data = [] + for ins in evids: + temp = {} + temp["id"] = ins["id"] + temp["pred_label"] = ins["pred_label"] + temp["rationale"] = text_dict_list2[ins["id"]]["context_idx"] + temp["no_rationale"] = text_exclusive_dict_list2[ins["id"]]["context_idx"] + if len(temp["rationale"]) > 1 and args.inter_mode != "lime" and not (args.base_model.startswith("roberta")): + for i in range(len(temp["rationale"][1])): + temp["rationale"][1][i] -= len(temp["rationale"][0]) + len(temp["no_rationale"][0]) + for i in range(len(temp["no_rationale"][1])): + temp["no_rationale"][1][i] -= len(temp["rationale"][0]) + len(temp["no_rationale"][0]) + temp["rationale_pred"] = text_dict_list[ins["id"]]["pred_label"] + temp["no_rationale_pred"] = text_exclusive_dict_list[ins["id"]]["pred_label"] + temp["rationale_token"] = text_dict_list2[ins["id"]]["context_token"] + + res_data.append(temp) + + f_save.write(json.dumps(temp, ensure_ascii=False) + "\n") + f_save.close() + + +if __name__ == "__main__": + args = get_args() + text_dict_list, text_exclusive_dict_list, text_dict_list2, text_exclusive_dict_list2 = dataLoad(args) + evids = evids_load(args.pred_path) + r_data_generation( + args, evids, text_dict_list, text_exclusive_dict_list, text_dict_list2, text_exclusive_dict_list2 + ) diff --git a/examples/model_interpretation/rationale_extraction/generate_evaluation_data.sh b/examples/model_interpretation/rationale_extraction/generate_evaluation_data.sh new file mode 100644 index 0000000000000000000000000000000000000000..fa26d3beb8f9a7962159e7edf34e0fac26270399 --- /dev/null +++ b/examples/model_interpretation/rationale_extraction/generate_evaluation_data.sh @@ -0,0 +1,23 @@ +### + # This script concatenates results from previous running to generate a formated result for evaluation use +### + +BASE_MODEL=$1 +INTER_MODE=$2 +LANGUAGE=$3 +TASK=$4 + +PRED_PATH=../task/${TASK}/output/${TASK}_${LANGUAGE}.${BASE_MODEL}/interpret.${INTER_MODE} +SAVE_PATH=./evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} + +SAVE_DIR=./evaluation_data/${TASK}/ +[ -d $SAVE_DIR ] || mkdir -p $SAVE_DIR + +python3 generate_evaluation_data.py \ + --data_dir ./prediction/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} \ + --data_dir2 ./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} \ + --pred_path $PRED_PATH \ + --save_path $SAVE_PATH \ + --inter_mode $INTER_MODE \ + --base_model $BASE_MODEL \ + --language $LANGUAGE \ No newline at end of file diff --git a/examples/model_interpretation/rationale_extraction/mrc_pred.py b/examples/model_interpretation/rationale_extraction/mrc_pred.py new file mode 100644 index 0000000000000000000000000000000000000000..2868c86b12408267128795fd57c87291b858793d --- /dev/null +++ b/examples/model_interpretation/rationale_extraction/mrc_pred.py @@ -0,0 +1,207 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import functools +import json +import os +import sys +import time +from pathlib import Path + +import paddle + +from paddlenlp.data import Dict, Pad +from paddlenlp.transformers.roberta.tokenizer import ( + RobertaBPETokenizer, + RobertaTokenizer, +) + +sys.path.append("../task/mrc") +from saliency_map.squad import RCInterpret, compute_prediction # noqa: E402 + +sys.path.append("..") +from roberta.modeling import RobertaForQuestionAnswering # noqa: E402 + +sys.path.remove("..") +sys.path.remove("../task/mrc") +sys.path.append("../..") +from model_interpretation.utils import ( # noqa: E402 + convert_tokenizer_res_to_old_version, +) + +sys.path.remove("../..") + + +def get_args(): + parser = argparse.ArgumentParser("mrc predict with roberta") + parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large"]) + parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") + parser.add_argument( + "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" + ) + parser.add_argument("--batch_size", type=int, default=32, help="batchsize") + parser.add_argument("--epoch", type=int, default=3, help="epoch") + parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") + parser.add_argument("--warmup_proportion", type=float, default=0.1) + parser.add_argument("--lr", type=float, default=5e-5, help="learning rate") + parser.add_argument("--eval", action="store_true") + parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") + parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer") + parser.add_argument( + "--use_amp", + action="store_true", + help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", + ) + parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method") + parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") + parser.add_argument( + "--doc_stride", + type=int, + default=128, + help="When splitting up a long document into chunks, how much stride to take between chunks.", + ) + parser.add_argument("--language", type=str, required=True, help="language that the model based on") + parser.add_argument("--input_data", type=str, required=True) + args = parser.parse_args() + return args + + +def map_fn_DuCheckList(examples, args, tokenizer): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = [examples[i]["context"] for i in range(len(examples))] + questions = [examples[i]["question"] for i in range(len(examples))] + + tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_len) + tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) + + # For validation, there is no need to compute start and end positions + for i, tokenized_example in enumerate(tokenized_examples): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_example["token_type_ids"] + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = tokenized_example["overflow_to_sample"] + tokenized_examples[i]["example_id"] = examples[sample_index]["id"] + + # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token + # position is part of the context or not. + if args.language == "ch": + tokenized_examples[i]["offset_mapping"] = [ + (o if sequence_ids[k] == 1 else None) for k, o in enumerate(tokenized_example["offset_mapping"]) + ] + else: + n = tokenized_example["offset_mapping"].index((0, 0), 1) + 2 # context start position + m = len(tokenized_example["offset_mapping"]) - 1 # context end position + 1 + tokenized_examples[i]["offset_mapping"] = [ + (o if n <= k <= m else None) for k, o in enumerate(tokenized_example["offset_mapping"]) + ] + + return tokenized_examples + + +def load_data(path): + data = {} + f = open(path, "r") + for line in f.readlines(): + line_split = json.loads(line) + data[line_split["id"]] = line_split + f.close() + return data + + +def init_roberta_var(args): + if args.language == "ch": + tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) + else: + tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) + + model = RobertaForQuestionAnswering.from_pretrained(args.from_pretrained) + map_fn = functools.partial(map_fn_DuCheckList, args=args, tokenizer=tokenizer) + dev_ds = RCInterpret().read(os.path.join(args.data_dir, "dev")) + # dev_ds = load_dataset('squad', splits='dev_v2', data_files=None) + dev_ds.map(map_fn, batched=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + } + ): fn(samples) + + dev_dataloader = paddle.io.DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + return model, tokenizer, dev_dataloader, dev_ds + + +@paddle.no_grad() +def evaluate(model, data_loader, args): + model.eval() + + all_start_logits = [] + all_end_logits = [] + tic_eval = time.time() + + for batch in data_loader: + input_ids, token_type_ids = batch + loss, start_logits_tensor, end_logits_tensor, cls_logits = model(input_ids, token_type_ids) + for idx in range(start_logits_tensor.shape[0]): + if len(all_start_logits) % 1000 == 0 and len(all_start_logits): + print("Processing example: %d" % len(all_start_logits)) + print("time per 1000:", time.time() - tic_eval) + tic_eval = time.time() + + all_start_logits.append(start_logits_tensor.numpy()[idx]) + all_end_logits.append(end_logits_tensor.numpy()[idx]) + + all_predictions, all_nbest_json, scores_diff_json, all_feature_index = compute_prediction( + data_loader.dataset.data, + data_loader.dataset.new_data, + (all_start_logits, all_end_logits), + True, + 20, + args.max_seq_len, + 0.0, + ) + + # Can also write all_nbest_json and scores_diff_json files if needed + with open(os.path.join(args.output_dir, "dev"), "w") as f: + for id in all_predictions: + temp = {} + temp["id"] = int(id) + temp["pred_label"] = all_predictions[id] + temp["pred_feature"] = all_feature_index[id] + f.write(json.dumps(temp, ensure_ascii=False) + "\n") + + +if __name__ == "__main__": + args = get_args() + if args.base_model.startswith("roberta"): + model, tokenizer, dataloader, dev_ds = init_roberta_var(args) + else: + raise ValueError("unsupported base model name.") + + with paddle.amp.auto_cast(enable=args.use_amp): + + sd = paddle.load(args.init_checkpoint) + model.set_dict(sd) + print("load model from %s" % args.init_checkpoint) + + evaluate(model, dataloader, args) diff --git a/examples/model_interpretation/rationale_extraction/newp_text_generate.py b/examples/model_interpretation/rationale_extraction/newp_text_generate.py new file mode 100644 index 0000000000000000000000000000000000000000..28e8e98157d8c498fa7139715a0e0b5533d4969e --- /dev/null +++ b/examples/model_interpretation/rationale_extraction/newp_text_generate.py @@ -0,0 +1,269 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import math +import os + + +def get_args(): + parser = argparse.ArgumentParser("generate data") + + parser.add_argument("--pred_path", required=True) + parser.add_argument("--save_path", required=True) + parser.add_argument("--language", required=True) + parser.add_argument("--task", required=True) + parser.add_argument("--ratio", type=str, required=True) + + args = parser.parse_args() + return args + + +def evids_load(path): + evids = [] + with open(path, "r") as f: + for line in f.readlines(): + dic = json.loads(line) + evids.append(dic) + return evids + + +def generate_for_senti(args, evid_dict, ratio): + r = {} + ex_r = {} + + label = evid_dict["pred_label"] + char_attri = list(evid_dict["char_attri"].keys()) + length = len(char_attri) + + rationale_ratio = ratio[0] + toprationale_text, toprationale_exclusive_text = [], [] + + keys = [int(x) for x in char_attri[: math.ceil(length * rationale_ratio)]] + keys.sort() + for key in keys: + toprationale_text.append(evid_dict["char_attri"][str(key)][0].strip()) + + keys = [int(x) for x in char_attri[math.ceil(length * rationale_ratio) :]] + keys.sort() + for key in keys: + toprationale_exclusive_text.append(evid_dict["char_attri"][str(key)][0].strip()) + + if args.language == "en": + toprationale_text = " ".join(toprationale_text) + toprationale_exclusive_text = " ".join(toprationale_exclusive_text) + else: + toprationale_text = "".join(toprationale_text) + toprationale_exclusive_text = "".join(toprationale_exclusive_text) + + if len(toprationale_text) == 0: + toprationale_text = "['UNK']" + if len(toprationale_exclusive_text) == 0: + toprationale_exclusive_text = "['UNK']" + + r["id"] = evid_dict["id"] + r["context"] = toprationale_text + r["context_idx"] = [[int(x) for x in char_attri[: math.ceil(length * rationale_ratio)]]] + r["context_token"] = [[evid_dict["char_attri"][x][0] for x in char_attri[: math.ceil(length * rationale_ratio)]]] + r["label"] = label + ex_r["id"] = evid_dict["id"] + ex_r["context"] = toprationale_exclusive_text + ex_r["context_idx"] = [[int(x) for x in char_attri[math.ceil(length * rationale_ratio) :]]] + ex_r["context_token"] = [ + [evid_dict["char_attri"][x][0] for x in char_attri[math.ceil(length * rationale_ratio) :]] + ] + ex_r["label"] = label + return r, ex_r + + +def generate_for_similarity(args, evid_dict, ratio): + r = {} + ex_r = {} + q_rationale_ratio = ratio[0] + t_rationale_ratio = ratio[1] + + label = evid_dict["pred_label"] + # query + q_char_attri = list(evid_dict["query_char_attri"].keys()) + q_length = len(q_char_attri) + + q_topR_Rtext, q_topR_noRtext = [], [] + keys = [int(x) for x in q_char_attri[: math.ceil(q_length * q_rationale_ratio)]] + keys.sort() + for key in keys: + q_topR_Rtext.append(evid_dict["query_char_attri"][str(key)][0].strip()) + + keys = [int(x) for x in q_char_attri[math.ceil(q_length * q_rationale_ratio) :]] + keys.sort() + for key in keys: + q_topR_noRtext.append(evid_dict["query_char_attri"][str(key)][0].strip()) + + if args.language == "ch": + q_topR_Rtext = "".join(q_topR_Rtext) + q_topR_noRtext = "".join(q_topR_noRtext) + else: + q_topR_Rtext = " ".join(q_topR_Rtext) + q_topR_noRtext = " ".join(q_topR_noRtext) + + if len(q_topR_Rtext) == 0: + q_topR_Rtext = "['UNK']" + if len(q_topR_noRtext) == 0: + q_topR_noRtext = "['UNK']" + + # title + t_char_attri = list(evid_dict["title_char_attri"].keys()) + t_length = len(t_char_attri) + + t_topR_Rtext, t_topR_noRtext = [], [] + keys = [int(x) for x in t_char_attri[: math.ceil(t_length * t_rationale_ratio)]] + keys.sort() + for key in keys: + t_topR_Rtext.append(evid_dict["title_char_attri"][str(key)][0]) + + keys = [int(x) for x in t_char_attri[math.ceil(t_length * t_rationale_ratio) :]] + keys.sort() + for key in keys: + t_topR_noRtext.append(evid_dict["title_char_attri"][str(key)][0]) + + if args.language == "ch": + t_topR_Rtext = "".join(t_topR_Rtext) + t_topR_noRtext = "".join(t_topR_noRtext) + else: + t_topR_Rtext = " ".join(t_topR_Rtext) + t_topR_noRtext = " ".join(t_topR_noRtext) + + if len(t_topR_Rtext) == 0: + t_topR_Rtext = "['UNK']" + if len(t_topR_noRtext) == 0: + t_topR_noRtext = "['UNK']" + + r["id"] = evid_dict["id"] + r["context"] = [q_topR_Rtext, t_topR_Rtext] + r["context_idx"] = [ + [int(x) for x in q_char_attri[: math.ceil(q_length * q_rationale_ratio)]], + [int(x) for x in t_char_attri[: math.ceil(t_length * t_rationale_ratio)]], + ] + r["context_token"] = [ + [evid_dict["query_char_attri"][x][0] for x in q_char_attri[: math.ceil(q_length * q_rationale_ratio)]], + [evid_dict["title_char_attri"][x][0] for x in t_char_attri[: math.ceil(t_length * t_rationale_ratio)]], + ] + r["label"] = label + ex_r["id"] = evid_dict["id"] + ex_r["context"] = [q_topR_noRtext, t_topR_noRtext] + ex_r["context_idx"] = [ + [int(x) for x in q_char_attri[math.ceil(q_length * q_rationale_ratio) :]], + [int(x) for x in t_char_attri[math.ceil(t_length * t_rationale_ratio) :]], + ] + ex_r["context_token"] = [ + [evid_dict["query_char_attri"][x][0] for x in q_char_attri[math.ceil(q_length * q_rationale_ratio) :]], + [evid_dict["title_char_attri"][x][0] for x in t_char_attri[math.ceil(t_length * t_rationale_ratio) :]], + ] + ex_r["label"] = label + return r, ex_r + + +def generate_for_MRC(args, evid_dict, ratio): + id = evid_dict["id"] + question = evid_dict["question"] + char_attri = list(evid_dict["char_attri"].keys()) + length = len(char_attri) + + rationale_ratio = ratio[0] + toprationale_text, toprationale_exclusive_text = [], [] + keys = [int(x) for x in char_attri[: math.ceil(length * rationale_ratio)]] + keys.sort() + for key in keys: + toprationale_text.append(evid_dict["char_attri"][str(key)][0].strip()) + + keys = [int(x) for x in char_attri[math.ceil(length * rationale_ratio) :]] + keys.sort() + for key in keys: + toprationale_exclusive_text.append(evid_dict["char_attri"][str(key)][0].strip()) + + if args.language == "en": + toprationale_text = " ".join(toprationale_text) + toprationale_exclusive_text = " ".join(toprationale_exclusive_text) + else: + toprationale_text = "".join(toprationale_text) + toprationale_exclusive_text = "".join(toprationale_exclusive_text) + + if len(toprationale_text) == 0: + toprationale_text = "['UNK']" + if len(toprationale_exclusive_text) == 0: + toprationale_exclusive_text = "['UNK']" + + data_R_dict, Rdata_noR_dict = {}, {} + + data_R_dict["id"] = id + data_R_dict["title"] = "" + data_R_dict["context"] = toprationale_text + data_R_dict["question"] = question + data_R_dict["answers"] = [""] + data_R_dict["answer_starts"] = [-1] + data_R_dict["is_impossible"] = False + data_R_dict["context_idx"] = [[int(x) for x in char_attri[: math.ceil(length * rationale_ratio)]]] + data_R_dict["context_token"] = [ + [evid_dict["char_attri"][x][0] for x in char_attri[: math.ceil(length * rationale_ratio)]] + ] + + Rdata_noR_dict["id"] = id + Rdata_noR_dict["title"] = "" + Rdata_noR_dict["context"] = toprationale_exclusive_text + Rdata_noR_dict["question"] = question + Rdata_noR_dict["answers"] = [""] + Rdata_noR_dict["answer_starts"] = [-1] + Rdata_noR_dict["is_impossible"] = False + Rdata_noR_dict["context_idx"] = [[int(x) for x in char_attri[math.ceil(length * rationale_ratio) :]]] + Rdata_noR_dict["context_token"] = [ + [evid_dict["char_attri"][x][0] for x in char_attri[math.ceil(length * rationale_ratio) :]] + ] + + return data_R_dict, Rdata_noR_dict + + +def r_text_generation(evids, args): + print("num: {}".format(len(evids))) + + f_rationale_path = os.path.join(args.save_path, "rationale_text/dev") + f_rationale_exclusive_path = os.path.join(args.save_path, "rationale_exclusive_text/dev") + + if not os.path.exists(f_rationale_path): + os.makedirs(f_rationale_path) + if not os.path.exists(f_rationale_exclusive_path): + os.makedirs(f_rationale_exclusive_path) + + f_rationale = open(os.path.join(f_rationale_path, "dev"), "w") + f_rationale_exclusive = open(os.path.join(f_rationale_exclusive_path, "dev"), "w") + + rationale_ratio = json.loads(args.ratio) + for id, evid_dict in enumerate(evids): + if args.task == "senti": + data_R_dict, Rdata_noR_dict = generate_for_senti(args, evid_dict, rationale_ratio) + elif args.task == "similarity": + data_R_dict, Rdata_noR_dict = generate_for_similarity(args, evid_dict, rationale_ratio) + elif args.task == "mrc": + data_R_dict, Rdata_noR_dict = generate_for_MRC(args, evid_dict, rationale_ratio) + f_rationale.write(json.dumps(data_R_dict, ensure_ascii=False) + "\n") + f_rationale_exclusive.write(json.dumps(Rdata_noR_dict, ensure_ascii=False) + "\n") + + f_rationale.close() + f_rationale_exclusive.close() + + +if __name__ == "__main__": + args = get_args() + + evids = evids_load(args.pred_path) + r_text_generation(evids, args) diff --git a/examples/model_interpretation/rationale_extraction/run_2_pred_mrc_per.sh b/examples/model_interpretation/rationale_extraction/run_2_pred_mrc_per.sh new file mode 100644 index 0000000000000000000000000000000000000000..c672ca7a1b60e5e7912d7589587d1b81aed225ca --- /dev/null +++ b/examples/model_interpretation/rationale_extraction/run_2_pred_mrc_per.sh @@ -0,0 +1,48 @@ +### + # This script generates mrc predictions for texts contains rationales only and contains non-rationales only +### +export CUDA_VISIBLE_DEVICES=`python ./available_gpu.py --best 1` +export PYTHONPATH=./:$PYTHONPATH + +BASE_MODEL=$1 +INTER_MODE=$2 +LANGUAGE=$3 +TASK=mrc + +for RATIONAL_TYPE in "rationale_text" "rationale_exclusive_text"; +do + if [[ $LANGUAGE == "ch" ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-wwm-ext + CKPT=../task/${TASK}/models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin # 3 epoch + + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-wwm-ext-large + CKPT=../task/${TASK}/models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin # 3 epoch + fi + elif [[ $LANGUAGE == "en" ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-base + CKPT=../task/${TASK}/models/roberta_base_squad2_20211113_104225/ckpt.bin + + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-large + CKPT=../task/${TASK}/models/roberta_large_squad2_20211113_111300/ckpt.bin + fi + fi + + OUTPUT=./prediction/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev + [ -d $OUTPUT ] || mkdir -p $OUTPUT + set -x + python3 ./mrc_pred.py \ + --input_data ../data/${TASK}_${LANGUAGE} \ + --base_model $BASE_MODEL \ + --data_dir ./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev \ + --output_dir $OUTPUT \ + --from_pretrained $FROM_PRETRAIN \ + --batch_size 1 \ + --init_checkpoint $CKPT \ + --n-samples 300 \ + --doc_stride 128 \ + --language $LANGUAGE +done diff --git a/examples/model_interpretation/rationale_extraction/run_2_pred_senti_per.sh b/examples/model_interpretation/rationale_extraction/run_2_pred_senti_per.sh new file mode 100644 index 0000000000000000000000000000000000000000..06dfea7790d8d47e5260b8d26b4c77036ead7648 --- /dev/null +++ b/examples/model_interpretation/rationale_extraction/run_2_pred_senti_per.sh @@ -0,0 +1,62 @@ +### + # This script generates sentiment predictions for texts contains rationales only and contains non-rationales only +### + +export CUDA_VISIBLE_DEVICES=`python ./available_gpu.py --best 1` +export PYTHONPATH=./:$PYTHONPATH + +BASE_MODEL=$1 +INTER_MODE=$2 +LANGUAGE=$3 +TASK=senti + +FROM_PRETRAIN='test' +VOCAB_PATH='test' +for RATIONAL_TYPE in "rationale_text" "rationale_exclusive_text"; +do + if [[ $LANGUAGE == "en" ]]; then + + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-base + CKPT=../task/${TASK}/pretrained_models/saved_model_en/roberta_base_20220318_185322/model_10000/model_state.pdparams + #CKPT=../../../${TASK}/pretrained_models/saved_model_en/roberta_base_20211206_164443/model_10000/model_state.pdparams + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-large + CKPT=../task/${TASK}/pretrained_models/saved_model_en/roberta_large_20220318_183813/model_4000/model_state.pdparams + #CKPT=../../../${TASK}/pretrained_models/saved_model_en/roberta_large_20211207_174631/model_4000/model_state.pdparams + elif [[ $BASE_MODEL == "lstm" ]]; then + VOCAB_PATH=../task/${TASK}/rnn/vocab.sst2_train + CKPT=../task/${TASK}/rnn/checkpoints_en/final.pdparams + fi + + elif [[ $LANGUAGE == "ch" ]]; then + + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN='roberta-wwm-ext' + CKPT=../task/${TASK}/pretrained_models/saved_model_ch/roberta_base_20220318_155933/model_900/model_state.pdparams + #CKPT=../../../${TASK}/pretrained_models/saved_model_ch/roberta_base_20211206_180737/model_900/model_state.pdparams + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN='roberta-wwm-ext-large' + CKPT=../task/${TASK}/pretrained_models/saved_model_ch/roberta_large_20220318_170123/model_900/model_state.pdparams + #CKPT=../../../${TASK}/pretrained_models/saved_model_ch/roberta_large_20211207_143351/model_900/model_state.pdparams + elif [[ $BASE_MODEL == "lstm" ]]; then + VOCAB_PATH=../task/${TASK}/rnn/vocab.txt + CKPT=../task/${TASK}/rnn/checkpoints_ch/final.pdparams + fi + fi + + OUTPUT=./prediction/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev + [ -d $OUTPUT ] || mkdir -p $OUTPUT + set -x + python3 ./sentiment_pred.py \ + --base_model $BASE_MODEL \ + --data_dir ./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev \ + --output_dir $OUTPUT \ + --vocab_path $VOCAB_PATH \ + --from_pretrained $FROM_PRETRAIN \ + --batch_size 1 \ + --init_checkpoint $CKPT \ + --inter_mode $INTER_MODE \ + --n-samples 200 \ + --language $LANGUAGE +done \ No newline at end of file diff --git a/examples/model_interpretation/rationale_extraction/run_2_pred_similarity_per.sh b/examples/model_interpretation/rationale_extraction/run_2_pred_similarity_per.sh new file mode 100644 index 0000000000000000000000000000000000000000..9f0fecd865b7e054e62f835edaf569b562e2bc62 --- /dev/null +++ b/examples/model_interpretation/rationale_extraction/run_2_pred_similarity_per.sh @@ -0,0 +1,54 @@ +### + # This script generates textual similarity predictions for texts contains rationales only and contains non-rationales only +### +export CUDA_VISIBLE_DEVICES=`python ./available_gpu.py --best 1` +export PYTHONPATH=./:$PYTHONPATH + +BASE_MODEL=$1 +INTER_MODE=$2 +LANGUAGE=$3 +TASK=similarity + +for RATIONAL_TYPE in "rationale_text" "rationale_exclusive_text"; +do + if [[ $LANGUAGE == "en" ]]; then + + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-base + CKPT=../task/${TASK}/pretrained_models/saved_model_${LANGUAGE}/roberta_base_20211109_205245/model_54000/model_state.pdparams + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-large + CKPT=../task/${TASK}/pretrained_models/saved_model_${LANGUAGE}/roberta_large_20211109_205649/model_46000/model_state.pdparams + elif [[ $BASE_MODEL == "lstm" ]]; then + FROM_PRETRAIN=../task/${TASK}/skep_ernie_1.0_large_ch + CKPT=../task/${TASK}/simnet/checkpoints_${LANGUAGE}/final.pdparams + fi + + elif [[ $LANGUAGE == "ch" ]]; then + + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN='roberta-wwm-ext' + CKPT=../task/${TASK}/pretrained_models/saved_model_${LANGUAGE}/roberta_base_20211018_104038/model_11400/model_state.pdparams + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN='roberta-wwm-ext-large' + CKPT=../task/${TASK}/pretrained_models/saved_model_${LANGUAGE}/roberta_large_20211018_152833/model_22000/model_state.pdparams + elif [[ $BASE_MODEL == "lstm" ]]; then + FROM_PRETRAIN='skep_ernie_1.0_large_ch' + CKPT=../task/${TASK}/simnet/checkpoints_${LANGUAGE}/final.pdparams + fi + fi + + OUTPUT=./prediction/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev + [ -d $OUTPUT ] || mkdir -p $OUTPUT + set -x + python3 similarity_pred.py \ + --base_model $BASE_MODEL \ + --data_dir ./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev \ + --output_dir $OUTPUT \ + --from_pretrained $FROM_PRETRAIN \ + --batch_size 1 \ + --max_seq_len 256 \ + --init_checkpoint $CKPT \ + --inter_mode $INTER_MODE \ + --language $LANGUAGE +done \ No newline at end of file diff --git a/examples/model_interpretation/rationale_extraction/sentiment_pred.py b/examples/model_interpretation/rationale_extraction/sentiment_pred.py new file mode 100644 index 0000000000000000000000000000000000000000..4ab1397ed30439463063f7e136042f3f7d4a2b97 --- /dev/null +++ b/examples/model_interpretation/rationale_extraction/sentiment_pred.py @@ -0,0 +1,255 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import sys +from functools import partial +from pathlib import Path + +import paddle +from tqdm import tqdm + +from paddlenlp.data import Dict, Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import DatasetBuilder +from paddlenlp.transformers.roberta.tokenizer import ( + RobertaBPETokenizer, + RobertaTokenizer, +) + +sys.path.append("../task/senti") +from rnn.model import BiLSTMAttentionModel, SelfInteractiveAttention # noqa: E402 +from rnn.utils import CharTokenizer, convert_example # noqa: E402 + +sys.path.append("..") +from roberta.modeling import RobertaForSequenceClassification # noqa: E402 + +sys.path.remove("..") +sys.path.remove("../task/senti") +sys.path.append("../..") +from model_interpretation.utils import ( # noqa: E402 + convert_tokenizer_res_to_old_version, +) + +sys.path.remove("../..") + + +def get_args(): + parser = argparse.ArgumentParser("sentiment analysis prediction") + + parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large", "lstm"]) + parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") + parser.add_argument( + "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" + ) + parser.add_argument("--batch_size", type=int, default=1, help="batchsize") + parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") + parser.add_argument("--eval", action="store_true") + + parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") + parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer") + parser.add_argument( + "--use_amp", + action="store_true", + help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", + ) + parser.add_argument( + "--inter_mode", + type=str, + default="attention", + choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"], + help="appoint the mode of interpretable.", + ) + parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method") + parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") + parser.add_argument("--start_id", type=int, default=0) + parser.add_argument("--vocab_path", type=str) + parser.add_argument("--language", type=str, required=True, help="Language that the model is built for") + args = parser.parse_args() + return args + + +class SentiData(DatasetBuilder): + def _read(self, filename, language): + with open(filename, "r", encoding="utf8") as f: + for line in f.readlines(): + line_split = json.loads(line) + yield {"id": line_split["id"], "context": line_split["context"]} + + +def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None): + """ + Creats dataloader. + + Args: + dataset(obj:`paddle.io.Dataset`): Dataset instance. + trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. + mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. + batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. + batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging + the sample list, None for only stack each fields of sample in axis + 0(same as :attr::`np.stack(..., axis=0)`). + + Returns: + dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. + """ + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + else: + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn) + return dataloader + + +def map_fn_senti(examples, tokenizer, language): + print("load data %d" % len(examples)) + + contexts = [example["context"] for example in examples] + tokenized_examples = tokenizer(contexts, max_seq_len=args.max_seq_len) + tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) + + return tokenized_examples + + +def truncate_offset(seg, start_offset, end_offset): + seg_len = len(seg) + for n in range(len(start_offset) - 1, -1, -1): + if start_offset[n] < seg_len: + end_offset[n] = seg_len + break + start_offset.pop(n) + end_offset.pop(n) + + +def init_lstm_var(args): + vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") + tokenizer = CharTokenizer(vocab, args.language, "../punctuations") + padding_idx = vocab.token_to_idx.get("[PAD]", 0) + + trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=True, language=args.language) + + # init attention layer + lstm_hidden_size = 196 + attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size) + model = BiLSTMAttentionModel( + attention_layer=attention, + vocab_size=len(tokenizer.vocab), + lstm_hidden_size=lstm_hidden_size, + num_classes=2, + padding_idx=padding_idx, + ) + + # Reads data and generates mini-batches. + dev_ds = SentiData().read(os.path.join(args.data_dir, "dev"), args.language) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=padding_idx), # input_ids + Stack(dtype="int64"), # seq len + ): [data for data in fn(samples)] + + dev_loader = create_dataloader( + dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn + ) + + return model, tokenizer, dev_loader + + +def init_roberta_var(args): + tokenizer = None + if args.language == "ch": + tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) + else: + tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) + model = RobertaForSequenceClassification.from_pretrained( + args.from_pretrained, + hidden_dropout_prob=0, + attention_probs_dropout_prob=0, + dropout=0, + num_labels=2, + name="", + return_inter_score=True, + ) + + map_fn = partial(map_fn_senti, tokenizer=tokenizer, language=args.language) + + dev_ds = SentiData().read(os.path.join(args.data_dir, "dev"), args.language) + dev_ds.map(map_fn, batched=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + } + ): fn(samples) + + dataloader = paddle.io.DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + return model, tokenizer, dataloader + + +if __name__ == "__main__": + args = get_args() + if args.base_model.startswith("roberta"): + model, tokenizer, dataloader = init_roberta_var(args) + + elif args.base_model == "lstm": + model, tokenizer, dataloader = init_lstm_var(args) + else: + raise ValueError("unsupported base model name.") + + with paddle.amp.auto_cast(enable=args.use_amp), open(str(args.output_dir) + "/dev", "w") as out_handle: + # Load model + sd = paddle.load(args.init_checkpoint) + model.set_dict(sd) + model.train() # 为了取梯度,加载模型时dropout设为0 + print("load model from %s" % args.init_checkpoint) + + get_sub_word_ids = lambda word: map(str, tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word))) + + for step, d in tqdm(enumerate(dataloader)): + if step + 1 < args.start_id: + continue + + result = {} + if args.base_model.startswith("roberta"): + input_ids, token_type_ids = d + fwd_args = [input_ids, token_type_ids] + fwd_kwargs = {} + + tokens = tokenizer.convert_ids_to_tokens(input_ids[0, 1:-1].tolist()) # list + + elif args.base_model == "lstm": + input_ids, seq_lens = d + fwd_args = [input_ids, seq_lens] + fwd_kwargs = {} + tokens = [tokenizer.vocab.idx_to_token[input_id] for input_id in input_ids.tolist()[0]] + + result["id"] = dataloader.dataset.data[step]["id"] + + probs, atts, embedded = model.forward_interpet(*fwd_args, **fwd_kwargs) + pred_label = paddle.argmax(probs, axis=-1).tolist()[0] + + result["pred_label"] = pred_label + result["probs"] = [float(format(prob, ".5f")) for prob in probs.numpy()[0].tolist()] + if args.language == "en": + result["context"] = tokenizer.convert_tokens_to_string(tokens) + else: + result["context"] = "".join(tokens) + out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") diff --git a/examples/model_interpretation/rationale_extraction/similarity_pred.py b/examples/model_interpretation/rationale_extraction/similarity_pred.py new file mode 100644 index 0000000000000000000000000000000000000000..c6771189b1eeab1ebbae0ed060817cad06648b78 --- /dev/null +++ b/examples/model_interpretation/rationale_extraction/similarity_pred.py @@ -0,0 +1,229 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import sys +from functools import partial +from pathlib import Path + +import paddle +from tqdm import tqdm + +from paddlenlp.data import Dict, Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import DatasetBuilder +from paddlenlp.transformers.roberta.tokenizer import ( + RobertaBPETokenizer, + RobertaTokenizer, +) + +sys.path.append("..") +from roberta.modeling import RobertaForSequenceClassification # noqa: E402 + +sys.path.remove("..") +from simnet.model import SimNet # noqa: E402 +from simnet.utils import CharTokenizer, preprocess_data # noqa: E402 + +sys.path.remove("../task/similarity") +sys.path.append("../..") +from model_interpretation.utils import ( # noqa: E402 + convert_tokenizer_res_to_old_version, +) + +sys.path.remove("../..") + + +def get_args(): + parser = argparse.ArgumentParser("textual similarity prediction") + + parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large", "lstm"]) + parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") + parser.add_argument( + "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" + ) + parser.add_argument("--batch_size", type=int, default=1, help="batchsize") + parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") + + parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") + parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer") + parser.add_argument( + "--use_amp", + action="store_true", + help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", + ) + parser.add_argument( + "--inter_mode", + type=str, + default="attention", + choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"], + help="appoint the mode of interpretable.", + ) + parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") + parser.add_argument("--language", type=str, required=True) + args = parser.parse_args() + return args + + +class SimilarityData(DatasetBuilder): + def _read(self, filename): + with open(filename, "r", encoding="utf8") as f: + for line in f.readlines(): + line_split = json.loads(line) + if args.language == "ch": + yield { + "id": line_split["id"], + "query": line_split["context"][0], + "title": line_split["context"][1], + } + else: + yield { + "id": line_split["id"], + "sentence1": line_split["context"][0], + "sentence2": line_split["context"][1], + } + + +def map_fn_senti(examples, tokenizer): + print("load data %d" % len(examples)) + if args.language == "ch": + query = "query" + title = "title" + else: + query = "sentence1" + title = "sentence2" + queries = [example[query] for example in examples] + titles = [example[title] for example in examples] + tokenized_examples = tokenizer(queries, titles, max_seq_len=args.max_seq_len) + tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) + + return tokenized_examples + + +def init_roberta_var(args): + if args.language == "ch": + tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) + else: + tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) + model = RobertaForSequenceClassification.from_pretrained( + args.from_pretrained, + hidden_dropout_prob=0, + attention_probs_dropout_prob=0, + dropout=0, + num_labels=2, + name="", + return_inter_score=True, + ) + + map_fn = partial(map_fn_senti, tokenizer=tokenizer) + + dev_ds = SimilarityData().read(os.path.join(args.data_dir, "dev")) + dev_ds.map(map_fn, batched=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + } + ): fn(samples) + + dataloader = paddle.io.DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + return model, tokenizer, dataloader, dev_ds + + +def init_lstm_var(args): + if args.language == "ch": + vocab = Vocab.load_vocabulary("../task/similarity/simnet/vocab.char", unk_token="[UNK]", pad_token="[PAD]") + else: + vocab = Vocab.load_vocabulary("../task/similarity/simnet/vocab_QQP", unk_token="[UNK]", pad_token="[PAD]") + + tokenizer = CharTokenizer(vocab, args.language, "../punctuations") + model = SimNet(network="lstm", vocab_size=len(vocab), num_classes=2) + + dev_ds = SimilarityData().read(os.path.join(args.data_dir, "dev")) + dev_examples = preprocess_data(dev_ds.data, tokenizer, language=args.language) + batches = [dev_examples[idx : idx + args.batch_size] for idx in range(0, len(dev_examples), args.batch_size)] + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # query_ids + Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # title_ids + Stack(dtype="int64"), # query_seq_lens + Stack(dtype="int64"), # title_seq_lens + ): [data for data in fn(samples)] + + return model, tokenizer, batches, batchify_fn, vocab, dev_ds + + +if __name__ == "__main__": + args = get_args() + if args.base_model.startswith("roberta"): + model, tokenizer, dataloader, dev_ds = init_roberta_var(args) + + elif args.base_model == "lstm": + model, tokenizer, dataloader, batchify_fn, vocab, dev_ds = init_lstm_var(args) + else: + raise ValueError("unsupported base model name.") + + with paddle.amp.auto_cast(enable=args.use_amp), open(str(args.output_dir) + "/dev", "w") as out_handle: + # Load model + sd = paddle.load(args.init_checkpoint) + model.set_dict(sd) + model.train() # 为了取梯度,加载模型时dropout设为0 + print("load model from %s" % args.init_checkpoint) + + for step, d in tqdm(enumerate(dataloader)): + + result = {} + if args.base_model.startswith("roberta"): + input_ids, token_type_ids = d + fwd_args = [input_ids, token_type_ids] + fwd_kwargs = {} + + SEP_idx = input_ids.tolist()[0].index(tokenizer.sep_token_id) + q_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, 1:SEP_idx].tolist()) # list + if args.language == "ch": + t_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, SEP_idx + 1 : -1].tolist()) # list + else: + t_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, SEP_idx + 2 : -1].tolist()) # list + + elif args.base_model == "lstm": + query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(d) + query_ids = paddle.to_tensor(query_ids) + title_ids = paddle.to_tensor(title_ids) + query_seq_lens = paddle.to_tensor(query_seq_lens) + title_seq_lens = paddle.to_tensor(title_seq_lens) + + fwd_args = [query_ids, title_ids, query_seq_lens, title_seq_lens] + fwd_kwargs = {} + q_tokens = [vocab._idx_to_token[idx] for idx in query_ids.tolist()[0]] + t_tokens = [vocab._idx_to_token[idx] for idx in title_ids.tolist()[0]] + + result["id"] = dev_ds.data[step]["id"] + + probs, atts, embedded = model.forward_interpret(*fwd_args, **fwd_kwargs) + pred_label = paddle.argmax(probs, axis=-1).tolist()[0] + + result["pred_label"] = pred_label + result["probs"] = [float(format(prob, ".5f")) for prob in probs.numpy()[0].tolist()] + if args.language == "ch": + result["query"] = "".join(q_tokens) + result["title"] = "".join(t_tokens) + else: + result["query"] = tokenizer.convert_tokens_to_string(q_tokens) + result["title"] = tokenizer.convert_tokens_to_string(t_tokens) + + out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") diff --git a/examples/model_interpretation/requirements.txt b/examples/model_interpretation/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..6a6e0abed45786316dbcd88e99c32272e8f8a87e --- /dev/null +++ b/examples/model_interpretation/requirements.txt @@ -0,0 +1,5 @@ +nvgpu>=0.9.0 +regex>=2021.11.10 +spacy>=2.3.7 +tqdm>=4.62.3 +visualdl>=2.2.2 diff --git a/examples/model_interpretation/task/README.md b/examples/model_interpretation/task/README.md new file mode 100644 index 0000000000000000000000000000000000000000..03f1edca0dc88d58fcd18a95ffd165d9dde3c3ee --- /dev/null +++ b/examples/model_interpretation/task/README.md @@ -0,0 +1,19 @@ +### 基线模型预测 +#### 情感分析: + 预测:model_interpretation/rationale_extraction/sentiment_pred.py + 参数设置参考:model_interpretation/rationale_extraction/run_2_pred_senti_per.sh (参数涉及模型、文件等路径,以及语言的,请根据实际情况进行修改) +#### 文本相似度: + 预测:model_interpretation/rationale_extraction/similarity_pred.py + 参数设置参考:model_interpretation/rationale_extraction/run_2_pred_similarity_per.sh(参数涉及模型、文件等路径,以及语言的,请根据实际情况进行修改) +#### 阅读理解: + 预测:model_interpretation/rationale_extraction/mrc_pred.py + 参数设置参考:model_interpretation/rationale_extraction/run_2_pred_mrc_per.sh(参数涉及模型、文件等路径,以及语言的,请根据实际情况进行修改) +### 三个任务的基线模型训练 +#### 情感分析 + RoBERTa:model_interpretation/task/senti/pretrained_models/run_train.sh + LSTM:model_interpretation/task/senti/rnn/lstm_train.sh +#### 文本相似度 + RoBERTa:model_interpretation/task/similarity/pretrained_models/run_train_pointwise.sh + LSTM:model_interpretation/task/similarity/simnet/lstm_train.sh +#### 阅读理解 + RoBERTa:model_interpretation/task/mrc/run_train_rc.sh diff --git a/examples/model_interpretation/task/mrc/roberta/modeling.py b/examples/model_interpretation/task/mrc/roberta/modeling.py new file mode 100644 index 0000000000000000000000000000000000000000..4b376e43dabc0ecc0581de6a46de92f99857db85 --- /dev/null +++ b/examples/model_interpretation/task/mrc/roberta/modeling.py @@ -0,0 +1,719 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys + +import paddle +import paddle.nn as nn + +from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model + +sys.path.append("../..") +from task.transformer import TransformerEncoder, TransformerEncoderLayer # noqa: E402 + +sys.path.remove("../..") + +__all__ = [ + "RobertaModel", + "RobertaPretrainedModel", + "RobertaForSequenceClassification", + "RobertaForTokenClassification", + "RobertaForQuestionAnswering", +] + + +class RobertaEmbeddings(nn.Layer): + r""" + Include embeddings from word, position and token_type embeddings. + """ + + def __init__( + self, + vocab_size, + hidden_size=768, + hidden_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=16, + pad_token_id=0, + ): + super(RobertaEmbeddings, self).__init__() + self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_token_id) + self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size) + self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size) + self.layer_norm = nn.LayerNorm(hidden_size) + self.dropout = nn.Dropout(hidden_dropout_prob) + + def forward(self, input_ids, token_type_ids=None, position_ids=None): + if position_ids is None: + # maybe need use shape op to unify static graph and dynamic graph + ones = paddle.ones_like(input_ids, dtype="int64") + seq_length = paddle.cumsum(ones, axis=-1) + position_ids = seq_length - ones + position_ids.stop_gradient = True + if token_type_ids is None: + token_type_ids = paddle.zeros_like(input_ids, dtype="int64") + + input_embedings = self.word_embeddings(input_ids) + position_embeddings = self.position_embeddings(position_ids) + token_type_embeddings = self.token_type_embeddings(token_type_ids) + + embeddings = input_embedings + position_embeddings + token_type_embeddings + embeddings = self.layer_norm(embeddings) + embeddings = self.dropout(embeddings) + return embeddings + + +class RobertaPooler(nn.Layer): + def __init__(self, hidden_size): + super(RobertaPooler, self).__init__() + self.dense = nn.Linear(hidden_size, hidden_size) + self.activation = nn.Tanh() + + def forward(self, hidden_states): + # We "pool" the model by simply taking the hidden state corresponding + # to the first token. + first_token_tensor = hidden_states[:, 0] + pooled_output = self.dense(first_token_tensor) + pooled_output = self.activation(pooled_output) + return pooled_output + + +class RobertaPretrainedModel(PretrainedModel): + r""" + An abstract class for pretrained RoBerta models. It provides RoBerta related + `model_config_file`, `pretrained_resource_files_map`, `resource_files_names`, + `pretrained_init_configuration`, `base_model_prefix` for downloading and + loading pretrained models. + Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details. + + """ + + model_config_file = "model_config.json" + pretrained_init_configuration = { + "roberta-wwm-ext": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "intermediate_size": 3072, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 2, + "vocab_size": 21128, + "pad_token_id": 0, + }, + "roberta-wwm-ext-large": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "initializer_range": 0.02, + "intermediate_size": 4096, + "max_position_embeddings": 512, + "num_attention_heads": 16, + "num_hidden_layers": 24, + "type_vocab_size": 2, + "vocab_size": 21128, + "pad_token_id": 0, + }, + "rbt3": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "intermediate_size": 3072, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 3, + "type_vocab_size": 2, + "vocab_size": 21128, + "pad_token_id": 0, + }, + "rbtl3": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "initializer_range": 0.02, + "intermediate_size": 4096, + "max_position_embeddings": 512, + "num_attention_heads": 16, + "num_hidden_layers": 3, + "type_vocab_size": 2, + "vocab_size": 21128, + "pad_token_id": 0, + }, + } + resource_files_names = {"model_state": "model_state.pdparams"} + pretrained_resource_files_map = { + "model_state": { + "roberta-wwm-ext": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_base/roberta_chn_base.pdparams", + "roberta-wwm-ext-large": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/roberta_chn_large.pdparams", + "rbt3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbt3/rbt3_chn_large.pdparams", + "rbtl3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbtl3/rbtl3_chn_large.pdparams", + } + } + base_model_prefix = "roberta" + + def _init_weights(self, layer): + """Initialization hook""" + if isinstance(layer, (nn.Linear, nn.Embedding)): + # only support dygraph, use truncated_normal and make it inplace + # and configurable later + layer.weight.set_value( + paddle.tensor.normal( + mean=0.0, + std=self.initializer_range + if hasattr(self, "initializer_range") + else self.roberta.config["initializer_range"], + shape=layer.weight.shape, + ) + ) + elif isinstance(layer, nn.LayerNorm): + layer._epsilon = 1e-12 + + +@register_base_model +class RobertaModel(RobertaPretrainedModel): + r""" + The bare Roberta Model outputting raw hidden-states. + + This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`. + Refer to the superclass documentation for the generic methods. + + This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation + /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer + and refer to the Paddle documentation for all matter related to general usage and behavior. + + Args: + vocab_size (int): + Vocabulary size of `inputs_ids` in `RobertaModel`. Also is the vocab size of token embedding matrix. + Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `RobertaModel`. + hidden_size (int, optional): + Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`. + num_hidden_layers (int, optional): + Number of hidden layers in the Transformer encoder. Defaults to `12`. + num_attention_heads (int, optional): + Number of attention heads for each attention layer in the Transformer encoder. + Defaults to `12`. + intermediate_size (int, optional): + Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors + to ff layers are firstly projected from `hidden_size` to `intermediate_size`, + and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`. + Defaults to `3072`. + hidden_act (str, optional): + The non-linear activation function in the feed-forward layer. + ``"gelu"``, ``"relu"`` and any other paddle supported activation functions + are supported. Defaults to ``"gelu"``. + hidden_dropout_prob (float, optional): + The dropout probability for all fully connected layers in the embeddings and encoder. + Defaults to `0.1`. + attention_probs_dropout_prob (float, optional): + The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target. + Defaults to `0.1`. + max_position_embeddings (int, optional): + The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input + sequence. Defaults to `512`. + type_vocab_size (int, optional): + The vocabulary size of the `token_type_ids` passed when calling `~transformers.RobertaModel`. + Defaults to `2`. + initializer_range (float, optional): + The standard deviation of the normal initializer. Defaults to 0.02. + + .. note:: + A normal_initializer initializes weight matrices as normal distributions. + See :meth:`RobertaPretrainedModel._init_weights()` for how weights are initialized in `RobertaModel`. + + pad_token_id(int, optional): + The index of padding token in the token vocabulary. + Defaults to `0`. + """ + + def __init__( + self, + vocab_size, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + intermediate_size=3072, + hidden_act="gelu", + hidden_dropout_prob=0.1, + attention_probs_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=16, + initializer_range=0.01, + layer_norm_eps=1e-12, + pad_token_id=0, + ): + super(RobertaModel, self).__init__() + self.pad_token_id = pad_token_id + self.initializer_range = initializer_range + self.embeddings = RobertaEmbeddings( + vocab_size, hidden_size, hidden_dropout_prob, max_position_embeddings, type_vocab_size, pad_token_id + ) + encoder_layer = TransformerEncoderLayer( + hidden_size, + num_attention_heads, + intermediate_size, + dropout=hidden_dropout_prob, + activation=hidden_act, + attn_dropout=attention_probs_dropout_prob, + act_dropout=0, + ) + self.encoder = TransformerEncoder(encoder_layer, num_hidden_layers) + self.pooler = RobertaPooler(hidden_size) + + def forward( + self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None, + noise=None, + i=None, + n_samples=None, + ): + r""" + Args: + input_ids (Tensor): + Indices of input sequence tokens in the vocabulary. They are + numerical representations of tokens that build the input sequence. + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + token_type_ids (Tensor, optional): + Segment token indices to indicate first and second portions of the inputs. + Indices can be either 0 or 1: + + - 0 corresponds to a **sentence A** token, + - 1 corresponds to a **sentence B** token. + + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + Defaults to None, which means no segment embeddings is added to token embeddings. + position_ids (Tensor, optional): + Indices of positions of each input sequence tokens in the position embeddings. + Selected in the range ``[0, max_position_embeddings - 1]``. + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + Defaults to `None`. + attention_mask (Tensor, optional): + Mask used in multi-head attention to avoid performing attention to some unwanted positions, + usually the paddings or the subsequent positions. + Its data type can be int, float and bool. + When the data type is bool, the `masked` tokens have `False` values and the others have `True` values. + When the data type is int, the `masked` tokens have `0` values and the others have `1` values. + When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values. + It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`. + For example, its shape can be [batch_size, sequence_length], [batch_size, sequence_length, sequence_length], + [batch_size, num_attention_heads, sequence_length, sequence_length]. + Defaults to `None`, which means nothing needed to be prevented attention to. + + Returns: + tuple: Returns tuple (`sequence_output`, `pooled_output`). + + With the fields: + + - sequence_output (Tensor): + Sequence of hidden-states at the last layer of the model. + It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size]. + + - pooled_output (Tensor): + The output of first token (`[CLS]`) in sequence. + We "pool" the model by simply taking the hidden state corresponding to the first token. + Its data type should be float32 and its shape is [batch_size, hidden_size]. + + Example: + .. code-block:: + + import paddle + from paddlenlp.transformers import RobertaModel, RobertaTokenizer + + tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') + model = RobertaModel.from_pretrained('roberta-wwm-ext') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + sequence_output, pooled_output = model(**inputs) + + """ + if attention_mask is None: + attention_mask = paddle.unsqueeze( + (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e9, axis=[1, 2] + ) + # CLS: 101; SEP: 102; PAD: 0 + baseline_ids = paddle.to_tensor( + [101] + [0] * (input_ids.shape[1] - 2) + [102], + dtype=input_ids.dtype, + place=input_ids.place, + stop_gradient=input_ids.stop_gradient, + ) + + embedding_output = self.embeddings( + input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids + ) + baseline_embedding_output = self.embeddings( + input_ids=baseline_ids, position_ids=position_ids, token_type_ids=token_type_ids + ) + + if noise is not None: + if noise.upper() == "GAUSSIAN": + pass + # stdev_spread = 0.15 + # stdev = stdev_spread * (orig_embedded.max() - orig_embedded.min()).numpy() + # noise = paddle.to_tensor(np.random.normal(0, stdev, orig_embedded.shape).astype(np.float32), + # stop_gradient=False) + # orig_embedded = orig_embedded + noise + if noise.upper() == "INTEGRATED": + embedding_output = baseline_embedding_output + i / (n_samples - 1) * ( + embedding_output - baseline_embedding_output + ) + else: + raise ValueError("unsupported noise method: %s" % (noise)) + + # encoder_outputs = self.encoder(embedding_output, attention_mask) + encoder_outputs, att_weights_list = self.encoder(embedding_output, attention_mask) # interpret + sequence_output = encoder_outputs + pooled_output = self.pooler(sequence_output) + return sequence_output, pooled_output, att_weights_list, embedding_output + + +class RobertaForQuestionAnswering(RobertaPretrainedModel): + r""" + Roberta Model with a linear layer on top of the hidden-states output to + compute `span_start_logits` and `span_end_logits`, designed for question-answering tasks like SQuAD. + + Args: + roberta (:class:`RobertaModel`): + An instance of RobertaModel. + dropout (float, optional): + The dropout probability for output of Roberta. + If None, use the same value as `hidden_dropout_prob` of `RobertaModel` + instance `roberta`. Defaults to `None`. + """ + + def __init__(self, roberta, dropout=None): + super(RobertaForQuestionAnswering, self).__init__() + self.roberta = roberta # allow roberta to be config + self.classifier = nn.Linear(self.roberta.config["hidden_size"], 2) + self.classifier_cls = nn.Linear(self.roberta.config["hidden_size"], 2) + self.criterion = CrossEntropyLossForChecklist() + + # def forward(self, input_ids, token_type_ids=None): + def forward(self, *args, **kwargs): + r""" + Args: + input_ids (Tensor): + See :class:`RobertaModel`. + token_type_ids (Tensor, optional): + See :class:`RobertaModel`. + position_ids (Tensor, optional): + See :class:`RobertaModel`. + attention_mask (Tensor, optional): + See :class:`RobertaModel`. + + Returns: + tuple: Returns tuple (`start_logits`, `end_logits`). + + With the fields: + + - `start_logits` (Tensor): + A tensor of the input token classification logits, indicates the start position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + + - `end_logits` (Tensor): + A tensor of the input token classification logits, indicates the end position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + + Example: + .. code-block:: + + import paddle + from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer + + tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') + model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + + """ + start_pos = kwargs.pop("start_pos", None) + end_pos = kwargs.pop("end_pos", None) + cls_label = kwargs.pop("labels", None) + + # sequence_output, pooled_output, _, _ = self.roberta( + # input_ids, + # token_type_ids=token_type_ids, + # position_ids=None, + # attention_mask=None) + # print(kwargs) + sequence_output, pooled_output, _, _ = self.roberta(*args, **kwargs) + + logits = self.classifier(sequence_output) # (bsz, seq, 2) + logits = paddle.transpose(logits, perm=[2, 0, 1]) # (2, bsz, seq) + start_logits, end_logits = paddle.unstack(x=logits, axis=0) + cls_logits = self.classifier_cls(pooled_output) + + if start_pos is not None and end_pos is not None: + if len(start_pos.shape) != 1: + start_pos = start_pos.squeeze() + if len(end_pos.shape) != 1: + end_pos = end_pos.squeeze() + loss = self.criterion((start_logits, end_logits, cls_logits), (start_pos, end_pos, cls_label)) + else: + loss = None + + # return start_logit, end_logits + return loss, start_logits, end_logits, cls_logits + + def forward_interpret(self, *args, **kwargs): + r""" + Args: + input_ids (Tensor): + See :class:`RobertaModel`. + token_type_ids (Tensor, optional): + See :class:`RobertaModel`. + position_ids (Tensor, optional): + See :class:`RobertaModel`. + attention_mask (Tensor, optional): + See :class:`RobertaModel`. + + Returns: + tuple: Returns tuple (`start_logits`, `end_logits`). + + With the fields: + + - `start_logits` (Tensor): + A tensor of the input token classification logits, indicates the start position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + + - `end_logits` (Tensor): + A tensor of the input token classification logits, indicates the end position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + + Example: + .. code-block:: + + import paddle + from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer + + tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') + model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + + """ + start_pos = kwargs.pop("start_pos", None) + end_pos = kwargs.pop("end_pos", None) + cls_label = kwargs.pop("labels", None) + + # sequence_output, pooled_output, _, _ = self.roberta( + # input_ids, + # token_type_ids=token_type_ids, + # position_ids=None, + # attention_mask=None) + # print(kwargs) + sequence_output, pooled_output, att_weights_list, embedding_output = self.roberta(*args, **kwargs) + + logits = self.classifier(sequence_output) # (bsz, seq, 2) + logits = paddle.transpose(logits, perm=[2, 0, 1]) # (2, bsz, seq) + start_logits, end_logits = paddle.unstack(x=logits, axis=0) + cls_logits = self.classifier_cls(pooled_output) + + if start_pos is not None and end_pos is not None: + if len(start_pos.shape) != 1: + start_pos = start_pos.squeeze() + if len(end_pos.shape) != 1: + end_pos = end_pos.squeeze() + loss = self.criterion((start_logits, end_logits, cls_logits), (start_pos, end_pos, cls_label)) + else: + loss = None + + # return start_logit, end_logits + return loss, start_logits, end_logits, cls_logits, att_weights_list, embedding_output + + +class CrossEntropyLossForChecklist(nn.Layer): + def __init__(self): + super(CrossEntropyLossForChecklist, self).__init__() + + def forward(self, y, label): + start_logits, end_logits, cls_logits = y # [(bsz, seq), (bsz, seq), (bsz, 2)] + start_position, end_position, answerable_label = label # [(bsz), (bsz), (bsz)] + + start_position = paddle.unsqueeze(start_position, axis=-1) + end_position = paddle.unsqueeze(end_position, axis=-1) + answerable_label = paddle.unsqueeze(answerable_label, axis=-1) + + start_loss = nn.functional.cross_entropy(input=start_logits, label=start_position, soft_label=False) + end_loss = nn.functional.cross_entropy(input=end_logits, label=end_position, soft_label=False) + cls_loss = nn.functional.cross_entropy(input=cls_logits, label=answerable_label, soft_label=False) + + mrc_loss = (start_loss + end_loss) / 2 + loss = (mrc_loss + cls_loss) / 2 + return loss + + +class RobertaForSequenceClassification(RobertaPretrainedModel): + r""" + Roberta Model with a linear layer on top of the output layer, + designed for sequence classification/regression tasks like GLUE tasks. + + Args: + roberta (:class:`RobertaModel`): + An instance of `RobertaModel`. + num_classes (int, optional): + The number of classes. Defaults to `2`. + dropout (float, optional): + The dropout probability for output of Roberta. + If None, use the same value as `hidden_dropout_prob` + of `RobertaModel` instance `roberta`. Defaults to `None`. + """ + + def __init__(self, roberta, num_classes=2, dropout=None): + super(RobertaForSequenceClassification, self).__init__() + self.num_classes = num_classes + self.roberta = roberta # allow roberta to be config + self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"]) + self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes) + self.softmax = nn.Softmax() + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`RobertaModel`. + token_type_ids (Tensor, optional): + See :class:`RobertaModel`. + position_ids (Tensor, optional): + See :class:`RobertaModel`. + attention_mask (Tensor, optional): + See :class:`RobertaModel`. + + Returns: + Tensor: Returns tensor `logits`, a tensor of the input text classification logits. + Its data type should be float32 and it has a shape of [batch_size, num_classes]. + + Example: + .. code-block:: + + import paddle + from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer + + tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') + model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + + """ + _, pooled_output, _, _ = self.roberta( + input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask + ) + + pooled_output = self.dropout(pooled_output) + logits = self.classifier(pooled_output) + return logits + + def forward_interpet( + self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None, + noise=None, + i=None, + n_samples=None, + ): + _, pooled_output, att_weights_list, embedding_output = self.roberta( + input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask, + noise=noise, + i=i, + n_samples=n_samples, + ) + + pooled_output = self.dropout(pooled_output) + logits = self.classifier(pooled_output) + probs = self.softmax(logits) + + return probs, att_weights_list, embedding_output + + +class RobertaForTokenClassification(RobertaPretrainedModel): + r""" + Roberta Model with a linear layer on top of the hidden-states output layer, + designed for token classification tasks like NER tasks. + + Args: + roberta (:class:`RobertaModel`): + An instance of `RobertaModel`. + num_classes (int, optional): + The number of classes. Defaults to `2`. + dropout (float, optional): + The dropout probability for output of Roberta. + If None, use the same value as `hidden_dropout_prob` + of `RobertaModel` instance `roberta`. Defaults to `None`. + """ + + def __init__(self, roberta, num_classes=2, dropout=None): + super(RobertaForTokenClassification, self).__init__() + self.num_classes = num_classes + self.roberta = roberta # allow roberta to be config + self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"]) + self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes) + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`RobertaModel`. + token_type_ids (Tensor, optional): + See :class:`RobertaModel`. + position_ids (Tensor, optional): + See :class:`RobertaModel`. + attention_mask (Tensor, optional): + See :class:`RobertaModel`. + + Returns: + Tensor: Returns tensor `logits`, a tensor of the input token classification logits. + Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`. + + Example: + .. code-block:: + + import paddle + from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer + + tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') + model = RobertaForTokenClassification.from_pretrained('roberta-wwm-ext') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + + """ + sequence_output, _ = self.roberta( + input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask + ) + + sequence_output = self.dropout(sequence_output) + logits = self.classifier(sequence_output) + return logits diff --git a/examples/model_interpretation/task/mrc/run_1_predict_rc.sh b/examples/model_interpretation/task/mrc/run_1_predict_rc.sh new file mode 100644 index 0000000000000000000000000000000000000000..1039a2e085464223f256782601cbe4abc181d0ff --- /dev/null +++ b/examples/model_interpretation/task/mrc/run_1_predict_rc.sh @@ -0,0 +1,51 @@ +### + # This file contains script to run prediction of a specific baseline model and language on given input data + # The result of this script will be used to evaluate the performance of the baseline model +### + +export CUDA_VISIBLE_DEVICES=7 +export PYTHONPATH=./:$PYTHONPATH + +LANGUAGE=ch # LANGUAGE choose in [en, ch] +BASE_MODEL=roberta_base # BASE_MODEL choose in [roberta_base, roberta_large] + +if [[ $LANGUAGE == "ch" ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-wwm-ext + CKPT=models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin # 3epoch + #CKPT=models/roberta_base_ch_20211220_202953/ckpt.bin #new fine_tune + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-wwm-ext-large + # CKPT=models/ernie_large_DuReader-Checklist_20211007_163424/ckpt.bin # 3 epoch F1: 63.465 EM: 52.832 + # CKPT=models/ernie_large_DuReader-Checklist_20211009_115837/ckpt.bin # 4 epoch F1: 63.323 EM: 52.920 + # CKPT=models/ernie_large_DuReader-Checklist_20211009_142730/ckpt.bin # 3 epoch F1: 66.613 EM: 57.168 + CKPT=models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin + #CKPT=models/roberta_large_ch_20211220_203809/ckpt.bin #new fine_tune + fi +elif [[ $LANGUAGE == "en" ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-base + CKPT=models/roberta_base_squad2_20211113_104225/ckpt.bin + #CKPT=models/roberta_base_en_20211221_201720/ckpt.bin #new fine_tune + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-large + CKPT=models/roberta_large_squad2_20211113_111300/ckpt.bin + #CKPT=models/roberta_large_en_20211223_114421/ckpt.bin #new fine_tune + fi +fi + +OUTPUT=./output/mrc_${LANGUAGE}.${BASE_MODEL} +[ -d $OUTPUT ] || mkdir -p $OUTPUT +set -x +python3 ./saliency_map/rc_prediction.py \ + --base_model $BASE_MODEL \ + --data_dir ../../data/mrc_${LANGUAGE} \ + --from_pretrained $FROM_PRETRAIN \ + --init_checkpoint $CKPT \ + --output_dir $OUTPUT \ + --n-samples 300 \ + --doc_stride 128 \ + --language $LANGUAGE \ + --max_seq_len 384 \ + --batch_size 32 \ + --epoch 2 \ No newline at end of file diff --git a/examples/model_interpretation/task/mrc/run_1_predict_rc_all.sh b/examples/model_interpretation/task/mrc/run_1_predict_rc_all.sh new file mode 100644 index 0000000000000000000000000000000000000000..b504072d49bd9f997201b2ff5d699d70fdfadc43 --- /dev/null +++ b/examples/model_interpretation/task/mrc/run_1_predict_rc_all.sh @@ -0,0 +1,57 @@ +### + # This file contains script to run predictions of all baseline models and languages on given input data + # The result of this script will be used to evaluate the performance of the baseline model +### + +export CUDA_VISIBLE_DEVICES=4 +export PYTHONPATH=./:$PYTHONPATH + +for BASE_MODEL in "roberta_base" "roberta_large"; +do + for LANGUAGE in "ch" "en"; + do + if [[ $LANGUAGE == "ch" ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-wwm-ext + CKPT=models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin # 3epoch + #CKPT=models/roberta_base_ch_20211220_202953/ckpt.bin #new fine_tune + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-wwm-ext-large + # CKPT=models/ernie_large_DuReader-Checklist_20211007_163424/ckpt.bin # 3 epoch F1: 63.465 EM: 52.832 + # CKPT=models/ernie_large_DuReader-Checklist_20211009_115837/ckpt.bin # 4 epoch F1: 63.323 EM: 52.920 + # CKPT=models/ernie_large_DuReader-Checklist_20211009_142730/ckpt.bin # 3 epoch F1: 66.613 EM: 57.168 + CKPT=models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin + #CKPT=models/roberta_large_ch_20211220_203809/ckpt.bin #new fine_tune + fi + elif [[ $LANGUAGE == "en" ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-base + CKPT=models/roberta_base_squad2_20211113_104225/ckpt.bin + #CKPT=models/roberta_base_en_20211221_201720/ckpt.bin #new fine_tune + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-large + CKPT=models/roberta_large_squad2_20211113_111300/ckpt.bin + #CKPT=models/roberta_large_en_20211223_114421/ckpt.bin #new fine_tune + fi + fi + + OUTPUT=./output/mrc_${LANGUAGE}.${BASE_MODEL} + [ -d $OUTPUT ] || mkdir -p $OUTPUT + set -x + + if [[ ! -f ${OUTPUT}/predict_feature_index ]]; then + python3 ./saliency_map/rc_prediction.py \ + --base_model $BASE_MODEL \ + --data_dir ../../data/mrc_${LANGUAGE} \ + --from_pretrained $FROM_PRETRAIN \ + --init_checkpoint $CKPT \ + --output_dir $OUTPUT \ + --n-samples 300 \ + --doc_stride 128 \ + --language $LANGUAGE \ + --max_seq_len 384 \ + --batch_size 32 \ + --epoch 2 + fi + done +done \ No newline at end of file diff --git a/examples/model_interpretation/task/mrc/run_2_inter_rc.sh b/examples/model_interpretation/task/mrc/run_2_inter_rc.sh new file mode 100644 index 0000000000000000000000000000000000000000..5f038bfcaa987a894298c1f91d98380aed7e699c --- /dev/null +++ b/examples/model_interpretation/task/mrc/run_2_inter_rc.sh @@ -0,0 +1,53 @@ +### + # This file contains script to generate saliency map of a specific baseline model and language on given input data + # The result of this script will be used to evaluate the interpretive performance of the baseline model +### + +export CUDA_VISIBLE_DEVICES=4 +export PYTHONPATH=./:$PYTHONPATH + +TASK=mrc +LANGUAGE=en # LANGUAGE choose in [ch, en] +BASE_MODEL=roberta_base # BASE_MODEL choose in [roberta_base, roberta_large] +INTER_MODE=integrated_gradient # INTER_MODE choice in [attention, integrated_gradient] +START=0 + +if [[ $LANGUAGE == "ch" ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-wwm-ext + CKPT=models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin # 3 epoch + + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-wwm-ext-large + CKPT=models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin # 3 epoch + fi +elif [[ $LANGUAGE == "en" ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-base + CKPT=models/roberta_base_squad2_20211113_104225/ckpt.bin + + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-large + CKPT=models/roberta_large_squad2_20211113_111300/ckpt.bin + fi +fi + + +OUTPUT=./output/mrc_${LANGUAGE}.${BASE_MODEL} +[ -d $OUTPUT ] || mkdir -p $OUTPUT +set -x +python3 ./saliency_map/rc_interpretable.py \ + --ans_path ./output/${TASK}_${LANGUAGE}.${BASE_MODEL}/predict_ans\ + --ans_idx_path ./output/${TASK}_${LANGUAGE}.${BASE_MODEL}/predict_feature_index\ + --base_model $BASE_MODEL \ + --data_dir ../../data/mrc_${LANGUAGE} \ + --from_pretrained $FROM_PRETRAIN \ + --batch_size 1 \ + --init_checkpoint $CKPT \ + --inter_mode $INTER_MODE\ + --output_dir $OUTPUT \ + --n-samples 300 \ + --doc_stride 128 \ + --start_step $START \ + --language $LANGUAGE \ + --num_classes 2 \ No newline at end of file diff --git a/examples/model_interpretation/task/mrc/run_2_inter_rc_all.sh b/examples/model_interpretation/task/mrc/run_2_inter_rc_all.sh new file mode 100644 index 0000000000000000000000000000000000000000..5908512f7ba94872fbc54ca5601c79629e38b95a --- /dev/null +++ b/examples/model_interpretation/task/mrc/run_2_inter_rc_all.sh @@ -0,0 +1,61 @@ +### + # This file contains script to generate saliency map of all baseline models and languages on given input data + # The result of this script will be used to evaluate the interpretive performance of the baseline model +### + +export CUDA_VISIBLE_DEVICES=6 +export PYTHONPATH=./:$PYTHONPATH + +START=0 +TASK=mrc +for BASE_MODEL in "roberta_base" "roberta_large"; +do + for INTER_MODE in "attention" "integrated_gradient"; + do + for LANGUAGE in "ch" "en"; + do + if [[ $LANGUAGE == "ch" ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-wwm-ext + CKPT=models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin # 3 epoch + + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-wwm-ext-large + CKPT=models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin # 3 epoch + fi + elif [[ $LANGUAGE == "en" ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-base + CKPT=models/roberta_base_squad2_20211113_104225/ckpt.bin + + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-large + CKPT=models/roberta_large_squad2_20211113_111300/ckpt.bin + fi + fi + + + OUTPUT=./output/mrc_${LANGUAGE}.${BASE_MODEL} + [ -d $OUTPUT ] || mkdir -p $OUTPUT + set -x + + if [[ ! -f ${OUTPUT}/interpret.${INTER_MODE} ]]; then + python3 ./saliency_map/rc_interpretable.py \ + --ans_path ./output/${TASK}_${LANGUAGE}.${BASE_MODEL}/predict_ans\ + --ans_idx_path ./output/${TASK}_${LANGUAGE}.${BASE_MODEL}/predict_feature_index\ + --base_model $BASE_MODEL \ + --data_dir ../../data/mrc_${LANGUAGE} \ + --from_pretrained $FROM_PRETRAIN \ + --batch_size 1 \ + --init_checkpoint $CKPT \ + --inter_mode $INTER_MODE\ + --output_dir $OUTPUT \ + --n-samples 300 \ + --doc_stride 128 \ + --start_step $START \ + --language $LANGUAGE\ + --num_classes 2 + fi + done + done +done \ No newline at end of file diff --git a/examples/model_interpretation/task/mrc/run_train_rc.sh b/examples/model_interpretation/task/mrc/run_train_rc.sh new file mode 100644 index 0000000000000000000000000000000000000000..ff7d95db9342b5bdd4519042b1cf377cbfd445ca --- /dev/null +++ b/examples/model_interpretation/task/mrc/run_train_rc.sh @@ -0,0 +1,51 @@ +### + # This script is used to run fine-tunning of mrc roberta models. +### + +export CUDA_VISIBLE_DEVICES=7 +export PYTHONPATH=.:$PYTHONPATH + +LANGUAGE=ch # LANGUAGE choose in [ch, en] +BASE_MODEL=roberta_base # chooices [roberta_base, roberta_large] + +[ -d "logs" ] || mkdir -p "logs" +set -x + +if [[ $LANGUAGE == "ch" ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-wwm-ext + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-wwm-ext-large + fi + EPOCH=3 + BSZ=2 + LR=3e-5 + MAX_SEQLEN=512 + DATA=DuReader-Checklist +elif [[ $LANGUAGE == 'en' ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-base + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-large + fi + EPOCH=2 + BSZ=16 + LR=5e-6 + MAX_SEQLEN=384 + DATA=squad2 +fi + +timestamp=`date +"%Y%m%d_%H%M%S"` +python3 saliency_map/rc_finetune.py \ + --train_data_dir ./data/$DATA/train/train.json \ + --dev_data_dir ./data/$DATA/dev/dev.json \ + --max_steps -1 \ + --from_pretrained $FROM_PRETRAIN \ + --epoch $EPOCH \ + --bsz $BSZ \ + --lr $LR \ + --max_seq_len $MAX_SEQLEN \ + --save_dir models/${BASE_MODEL}_${LANGUAGE}_${timestamp} \ + --language $LANGUAGE \ + --init_checkpoint models/${BASE_MODEL}_${LANGUAGE}_${timestamp}/ckpt.bin >> logs/log_${BASE_MODEL}_$timestamp 2>&1 + \ No newline at end of file diff --git a/examples/model_interpretation/task/mrc/saliency_map/rc_finetune.py b/examples/model_interpretation/task/mrc/saliency_map/rc_finetune.py new file mode 100644 index 0000000000000000000000000000000000000000..f676df12d781da7e11ddf1af132da4748cd9ba54 --- /dev/null +++ b/examples/model_interpretation/task/mrc/saliency_map/rc_finetune.py @@ -0,0 +1,280 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import os +import re +import sys +import time +from pathlib import Path + +import paddle +from paddle.io import DataLoader +from roberta.modeling import RobertaForQuestionAnswering +from saliency_map.utils import create_if_not_exists, get_warmup_and_linear_decay +from squad import DuReaderChecklist +from visualdl import LogWriter + +from paddlenlp.data import Dict, Pad, Stack +from paddlenlp.transformers.roberta.tokenizer import ( + RobertaBPETokenizer, + RobertaTokenizer, +) + +sys.path.append("../../..") +from model_interpretation.utils import ( # noqa: E402 + convert_tokenizer_res_to_old_version, +) + +sys.path.remove("../../..") + +log = logging.getLogger(__name__) +log.setLevel(logging.DEBUG) +logging.getLogger().setLevel(logging.DEBUG) + + +def get_args(): + parser = argparse.ArgumentParser("mrc task with roberta") + parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") + parser.add_argument( + "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" + ) + parser.add_argument( + "--doc_stride", + type=int, + default=128, + help="When splitting up a long document into chunks, how much stride to take between chunks.", + ) + parser.add_argument("--bsz", type=int, default=32, help="batchsize") + parser.add_argument("--epoch", type=int, default=3, help="epoch") + parser.add_argument("--train_data_dir", type=str, required=True, help="train data file") + parser.add_argument("--dev_data_dir", type=str, required=True, help="develop data file") + parser.add_argument( + "--max_steps", type=int, required=True, help="max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE" + ) + parser.add_argument("--warmup_proportion", type=float, default=0.1) + parser.add_argument("--lr", type=float, default=5e-5, help="learning rate") + parser.add_argument("--save_dir", type=Path, required=True, help="model output directory") + parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") + parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer") + parser.add_argument( + "--use_amp", + action="store_true", + help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", + ) + parser.add_argument("--language", type=str, required=True, help="language that the model based on") + args = parser.parse_args() + return args + + +def map_fn_DuCheckList_finetune(examples): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + questions = [examples[i]["question"] for i in range(len(examples))] + contexts = [examples[i]["context"] + examples[i]["title"] for i in range(len(examples))] + + tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_len) + tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) + + for i, tokenized_example in enumerate(tokenized_examples): + + # We will label impossible answers with the index of the CLS token. + input_ids = tokenized_example["input_ids"] # list(seq) + cls_index = input_ids.index(tokenizer.cls_token_id) + + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_example["token_type_ids"] # list(seq) + + # The offset mappings will give us a map from token to character position in the original context. This will + # help us compute the start_positions and end_positions. + offsets = tokenized_example["offset_mapping"] # list(seq) + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = tokenized_example["overflow_to_sample"] # int + if args.language == "ch": + answers = examples[sample_index]["answers"] # list + answer_starts = examples[sample_index]["answer_starts"] # list + else: + example = examples[sample_index] + example["question_len"] = len(example["question"].split()) + example["context_len"] = len(example["context"].split()) + + answers = example["answers"] # list + answer_starts = example["answer_starts"] # list + + # If no answers are given, set the cls_index as answer. + if len(answer_starts) == 0: + tokenized_examples[i]["start_positions"] = cls_index + tokenized_examples[i]["end_positions"] = cls_index + tokenized_examples[i]["answerable_label"] = 0 + else: + # Start/end character index of the answer in the text. + start_char = answer_starts[0] + end_char = start_char + len(answers[0]) + if args.language == "en": + # Start token index of the current span in the text. + token_start_index = 0 + while not (offsets[token_start_index] == (0, 0) and offsets[token_start_index + 1] == (0, 0)): + token_start_index += 1 + token_start_index += 2 + + # End token index of the current span in the text. + token_end_index = len(input_ids) - 2 + else: + # Start token index of the current span in the text. + token_start_index = 0 + while sequence_ids[token_start_index] != 1: + token_start_index += 1 + + # End token index of the current span in the text. + token_end_index = len(input_ids) - 2 + while sequence_ids[token_end_index] != 1: + token_end_index -= 1 + + # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index). + if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char): + tokenized_examples[i]["start_positions"] = cls_index + tokenized_examples[i]["end_positions"] = cls_index + tokenized_examples[i]["answerable_label"] = 0 + else: + # Otherwise move the token_start_index and token_end_index to the two ends of the answer. + # Note: we could go after the last offset if the answer is the last word (edge case). + while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char: + token_start_index += 1 + tokenized_examples[i]["start_positions"] = token_start_index - 1 + while offsets[token_end_index][1] >= end_char: + token_end_index -= 1 + tokenized_examples[i]["end_positions"] = token_end_index + 1 + tokenized_examples[i]["answerable_label"] = 1 + + return tokenized_examples + + +if __name__ == "__main__": + args = get_args() + + if args.language == "ch": + tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) + else: + tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) + model = RobertaForQuestionAnswering.from_pretrained(args.from_pretrained, num_classes=2) + + train_ds = DuReaderChecklist().read(args.train_data_dir) + dev_ds = DuReaderChecklist().read(args.dev_data_dir) + + train_ds.map(map_fn_DuCheckList_finetune, batched=True) + dev_ds.map(map_fn_DuCheckList_finetune, batched=True) + + log.debug("train set: %d" % len(train_ds)) + log.debug("dev set: %d" % len(dev_ds)) + + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.bsz, shuffle=True) + dev_batch_sample = paddle.io.DistributedBatchSampler(dev_ds, batch_size=args.bsz, shuffle=False) + + batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + "start_positions": Stack(dtype="int64"), + "end_positions": Stack(dtype="int64"), + "answerable_label": Stack(dtype="int64"), + } + ): fn(samples) + + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sample, collate_fn=batchify_fn, return_list=True + ) + + max_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epoch + lr_scheduler = paddle.optimizer.lr.LambdaDecay( + args.lr, get_warmup_and_linear_decay(max_steps, int(args.warmup_proportion * max_steps)) + ) + + param_name_to_exclue_from_weight_decay = re.compile(r".*layer_norm_scale|.*layer_norm_bias|.*b_0") + + opt = paddle.optimizer.AdamW( + lr_scheduler, + parameters=model.parameters(), + weight_decay=args.wd, + apply_decay_param_fun=lambda n: not param_name_to_exclue_from_weight_decay.match(n), + grad_clip=paddle.nn.ClipGradByGlobalNorm(1.0) if args.language == "ch" else None, + ) + + scaler = paddle.amp.GradScaler(enable=args.use_amp) + + with LogWriter(logdir=str(create_if_not_exists(args.save_dir / "vdl"))) as log_writer: + with paddle.amp.auto_cast(enable=args.use_amp): + max_acc = 0.0 + log.debug("start training...") + for epoch in range(args.epoch): + s_time = time.time() + for step, d in enumerate(train_data_loader, start=1): + # input_ids: paddle.Tensor(bsz, seq) + # token_type_ids: paddle.Tensor(bsz, seq) + # start_positions: paddle.Tensor(bsz) + # end_positions: paddle.Tensor(bsz) + # answerable_label: paddle.Tensor(bsz) + input_ids, token_type_ids, start_positions, end_positions, answerable_label = d + loss, _, _, _ = model( + input_ids=input_ids, + token_type_ids=token_type_ids, + start_pos=start_positions, + end_pos=end_positions, + labels=answerable_label, + ) + loss = scaler.scale(loss) + loss.backward() + scaler.minimize(opt, loss) + opt.clear_grad() + lr_scheduler.step() + + if step % 100 == 0: + _lr = lr_scheduler.get_lr() + time_cost = time.time() - s_time + s_time = time.time() + if args.use_amp: + _l = (loss / scaler._scale).numpy() + msg = "[epoch-%d step-%d] train loss %.5f lr %.3e scaling %.3e" % ( + epoch, + step, + _l, + _lr, + scaler._scale.numpy(), + ) + else: + _l = loss.numpy() + msg = "[epoch-%d step-%d] train loss %.5f lr %.3e time_cost: %.1fs" % ( + epoch, + step, + _l, + _lr, + time_cost, + ) + log.debug(msg) + log_writer.add_scalar("loss", _l, step=step) + log_writer.add_scalar("lr", _lr, step=step) + + if step % 1000 == 0: + if args.save_dir is not None: + paddle.save(model.state_dict(), os.path.join(args.save_dir, "ckpt.bin")) + log.debug("save model!") + + if args.save_dir is not None: + paddle.save(model.state_dict(), os.path.join(args.save_dir, "ckpt.bin")) + log.debug("save model!") diff --git a/examples/model_interpretation/task/mrc/saliency_map/rc_interpretable.py b/examples/model_interpretation/task/mrc/saliency_map/rc_interpretable.py new file mode 100644 index 0000000000000000000000000000000000000000..7df2bc45d51f23ca1ba38ec9149621aa624cfb24 --- /dev/null +++ b/examples/model_interpretation/task/mrc/saliency_map/rc_interpretable.py @@ -0,0 +1,497 @@ +# !/usr/bin/env python3 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import collections +import json +import logging +import os +import sys +from functools import partial +from pathlib import Path + +import paddle +from roberta.modeling import RobertaForQuestionAnswering +from squad import RCInterpret +from tqdm import tqdm + +from paddlenlp.data import Dict, Pad, Stack +from paddlenlp.transformers.roberta.tokenizer import ( + RobertaBPETokenizer, + RobertaTokenizer, +) + +sys.path.append("../../..") +from model_interpretation.utils import ( # noqa: E402 + convert_tokenizer_res_to_old_version, + match, +) + +sys.path.remove("../../..") + +log = logging.getLogger(__name__) +log.setLevel(logging.DEBUG) +logging.getLogger().setLevel(logging.DEBUG) + + +def get_args(): + parser = argparse.ArgumentParser("mrc task with roberta") + parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large"]) + parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") + parser.add_argument( + "--max_seq_len", type=int, default=512, help="max sentence length, should not greater than 512" + ) + parser.add_argument("--batch_size", type=int, default=32, help="batchsize") + parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") + parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") + parser.add_argument( + "--use_amp", + action="store_true", + help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", + ) + parser.add_argument( + "--inter_mode", + type=str, + default="attention", + choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"], + help="appoint the mode of interpretable.", + ) + parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method") + parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") + parser.add_argument( + "--doc_stride", + type=int, + default=128, + help="When splitting up a long document into chunks, how much stride to take between chunks.", + ) + parser.add_argument("--start_step", type=int, default=0, help="start from which instance") + parser.add_argument("--language", type=str, required=True, help="language that the model based on") + parser.add_argument( + "--ans_path", + type=str, + required=True, + help="the path of the file which stores the predicted answer from last step", + ) + parser.add_argument( + "--ans_idx_path", + type=str, + required=True, + help="the path of the file which stores the predicted answer index from last step", + ) + parser.add_argument("--num_classes", type=int, required=True, help="number of class") + args = parser.parse_args() + return args + + +def truncate_offset(seg, start_offset, end_offset): + seg_len = len(seg) + for n in range(len(start_offset) - 1, -1, -1): + if start_offset[n] < seg_len: + end_offset[n] = seg_len + break + start_offset.pop(n) + end_offset.pop(n) + + +def map_fn_DuCheckList(examples, args, tokenizer): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + if args.language == "en": + questions = [ + examples[i]["question"].encode("ascii", errors="replace").decode("UTF-8") for i in range(len(examples)) + ] + contexts = [ + examples[i]["context"].encode("ascii", errors="replace").decode("UTF-8") for i in range(len(examples)) + ] + else: + questions = [examples[i]["question"] for i in range(len(examples))] + contexts = [examples[i]["context"] for i in range(len(examples))] + tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_len) + tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) + + log.debug("\nexample: %d" % len(examples)) + log.debug("feature: %d\n" % len(tokenized_examples)) + + # For validation, there is no need to compute start and end positions + for i, tokenized_example in enumerate(tokenized_examples): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = tokenized_example["overflow_to_sample"] + tokenized_examples[i]["example_id"] = examples[sample_index]["id"] + tokenized_examples[i]["question"] = examples[sample_index]["question"] + tokenized_examples[i]["context"] = examples[sample_index]["context"] + tokenized_examples[i]["sent_token"] = examples[sample_index]["sent_token"] + + return tokenized_examples + + +def init_roberta_var(args): + if args.language == "ch": + tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) + else: + tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) + + model = RobertaForQuestionAnswering.from_pretrained(args.from_pretrained, num_classes=args.num_classes) + map_fn = partial(map_fn_DuCheckList, args=args, tokenizer=tokenizer) + dev_ds = RCInterpret().read(args.data_dir) + + dev_ds.map(map_fn, batched=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + "offset_mapping": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "overflow_to_sample": Stack(dtype="int32"), + } + ): fn(samples) + + dev_dataloader = paddle.io.DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + return model, tokenizer, dev_dataloader, dev_ds + + +def ch_per_example( + args, + scores_in_one_example, + prev_context_tokens, + dev_ds, + prev_example_idx, + ans_dic, + ans_idx_dic, + offset, + out_handle, +): + total_score = scores_in_one_example[-1] + assert len(prev_context_tokens) == len(total_score) + token_score_dict = [] + for idx in range(len(total_score)): + token_score_dict.append([idx, offset[idx], total_score[idx]]) + + prev_example = dev_ds.data[prev_example_idx] + char_attribution_dict = match( + prev_example["context"] + prev_example["title"], prev_example["sent_token"], token_score_dict + ) + result["id"] = prev_example["id"] + result["question"] = prev_example["question"] + result["title"] = prev_example["title"] + result["context"] = prev_example["context"] + prev_example["title"] + result["pred_label"] = ans_dic[str(result["id"])] + result["pred_feature"] = ans_idx_dic[str(result["id"])] + + result["char_attri"] = collections.OrderedDict() + for token_info in sorted(char_attribution_dict, key=lambda x: x[2], reverse=True): + result["char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] + + out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") + + +def en_per_example(inter_score, result, ans_dic, ans_idx_dic, offset, out_handle): + sorted_token = [] + for i in range(len(inter_score)): + sorted_token.append([i, offset[i], inter_score[i]]) + char_attribution_dict = match(result["context"], result["sent_token"], sorted_token) + + result["pred_label"] = ans_dic[str(result["id"])] + result["pred_feature"] = ans_idx_dic[str(result["id"])] + result["char_attri"] = collections.OrderedDict() + for token_info in sorted(char_attribution_dict, key=lambda x: x[2], reverse=True): + result["char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] + result.pop("sent_token") + + out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") + + +def load_pred_data(ans_path, ans_idx_path): + f = open(ans_path, "r") + ans_dic = json.loads(f.read()) + f.close() + f = open(ans_idx_path, "r") + ans_idx_dic = json.loads(f.read()) + f.close() + return ans_dic, ans_idx_dic + + +def extract_attention_scores( + args, + model, + result, + fwd_args, + fwd_kwargs, + prev_example_idx, + example_idx, + prev_context_tokens, + scores_in_one_example, + dev_ds, + ans_dic, + ans_idx_dic, + context_tokens, + offset, + prev_offset, + out_handle, +): + with paddle.no_grad(): + # start_logits: (bsz, seq); end_logits: (bsz, seq); cls_logits: (bsz, 2) + # attention: list((bsz, head, seq, seq) * 12); embedded: (bsz, seq, emb) + _, start_logits, end_logits, cls_logits, attentions, embedded = model.forward_interpret( + *fwd_args, **fwd_kwargs + ) + + # Attention score equals to the mean of attention of each token in the question + attentions = attentions[-1][:, :, 1:SEP_idx, :].mean(2).mean(1) # attentions: (bsz, seq_len) + context_score = attentions[0, SEP_idx + add_idx : -1] # context_score: Tensor(context) + context_norm_score = context_score / context_score.sum(-1) + + if args.language == "ch": + if prev_example_idx is None or prev_example_idx == example_idx: + scores_in_one_example.append(context_norm_score.numpy().tolist()) + else: + ch_per_example( + args, + scores_in_one_example, + prev_context_tokens, + dev_ds, + prev_example_idx, + ans_dic, + ans_idx_dic, + prev_offset, + out_handle, + ) + scores_in_one_example = [context_norm_score.numpy().tolist()] + prev_example_idx = example_idx + prev_context_tokens = context_tokens + prev_offset = offset + else: + en_per_example(context_norm_score, result, ans_dic, ans_idx_dic, offset, out_handle) + return prev_example_idx, prev_context_tokens, scores_in_one_example, prev_offset + + +def extract_integrated_gradient_scores( + args, + dev_ds, + model, + result, + fwd_args, + fwd_kwargs, + SEP_idx, + add_idx, + prev_example_idx, + example_idx, + scores_in_one_example, + prev_context_tokens, + ans_dic, + ans_idx_dic, + context_tokens, + offset, + prev_offset, + out_handle, +): + embedded_grads_list = [] # [Tensor(1, seq_len, embed_size)] + with open(os.path.join(args.output_dir, "predict_feature_index"), "r") as f_feature_index: + feature_index_dict = json.load(f_feature_index) + example = dev_ds.data[example_idx] + example_id = example["id"] + start_index, end_index = feature_index_dict[str(example_id)] + + for i in range(args.n_samples): + # embedded_start_grad + # start_logits: (bsz, seq); embedded: (bsz, seq, emb) + _, start_logits, _, _, _, embedded = model.forward_interpret( + *fwd_args, **fwd_kwargs, noise="integrated", i=i, n_samples=args.n_samples + ) + + start_logit = start_logits[:, start_index].sum() + start_logit.backward(retain_graph=False) + embedded_start_grad = embedded.grad + model.clear_gradients() + # embedded_end_grad + # end_logits: (bsz, seq); embedded: (bsz, seq, emb) + _, _, end_logits, _, _, embedded = model.forward_interpret( + *fwd_args, **fwd_kwargs, noise="integrated", i=i, n_samples=args.n_samples + ) + end_logit = end_logits[:, end_index].sum() + end_logit.backward(retain_graph=False) + embedded_end_grad = embedded.grad + model.clear_gradients() + + embedded_grad = (embedded_start_grad + embedded_end_grad) / 2 + embedded_grads_list.append(embedded_grad) + + if i == 0: + baseline_embedded = embedded # Tensor(1, seq_len, embed_size) + elif i == args.n_samples - 1: + pred_embedded = embedded # Tensor(1, seq_len, embed_size) + + embedded_grads_tensor = paddle.to_tensor( + embedded_grads_list, dtype="float32", place=paddle.CUDAPlace(0), stop_gradient=True + ) + + trapezoidal_grads = ( + embedded_grads_tensor[1:] + embedded_grads_tensor[:-1] + ) / 2 # Tensor(n_samples-1, 1, seq_len, embed_size) + integral_grads = trapezoidal_grads.sum(0) / trapezoidal_grads.shape[0] # Tensor(1, seq_len, embed_size)xw + + inter_score = (pred_embedded - baseline_embedded) * integral_grads # Tensor(1, seq_len, embed_size) + inter_score = inter_score.sum(-1) # Tensor(1, seq_len) + inter_score.stop_gradient = True + + context_score = inter_score[0, SEP_idx + add_idx : -1] + context_norm_score = context_score / context_score.sum(-1) + if args.language == "ch": + if prev_example_idx is None or prev_example_idx == example_idx: + scores_in_one_example.append(context_norm_score.numpy().tolist()) + else: + ch_per_example( + args, + scores_in_one_example, + prev_context_tokens, + dev_ds, + prev_example_idx, + ans_dic, + ans_idx_dic, + prev_offset, + out_handle, + ) + scores_in_one_example = [context_norm_score.numpy().tolist()] + prev_example_idx = example_idx + prev_context_tokens = context_tokens + prev_offset = offset + else: + en_per_example(context_norm_score, result, ans_dic, ans_idx_dic, offset, out_handle) + return prev_example_idx, prev_context_tokens, scores_in_one_example, prev_offset + + +if __name__ == "__main__": + args = get_args() + if args.language == "ch": + add_idx = 1 + else: + add_idx = 2 + + ans_dic, ans_idx_dic = load_pred_data(args.ans_path, args.ans_idx_path) + if args.base_model.startswith("roberta"): + model, tokenizer, dataloader, dev_ds = init_roberta_var(args) + else: + raise ValueError("unsupported base model name.") + + with paddle.amp.auto_cast(enable=args.use_amp), open( + os.path.join(args.output_dir, "interpret" + f".{args.inter_mode}"), "w" + ) as out_handle: + + sd = paddle.load(args.init_checkpoint) + model.set_dict(sd) + log.debug("load model from %s" % args.init_checkpoint) + + err_total = [] + lime_score_total = [] + lime_relative_err_total = [] + lime_err_total = [] + + # Second forward: evidence extraction + scores_in_one_example = [] + prev_example_idx = None + prev_context_tokens = None + prev_offset = None + + get_subword_ids = lambda word: map(str, tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word))) + for step, d in tqdm(enumerate(dataloader)): + if step < args.start_step: + continue + + model.train() + + result = {} + input_ids, segment_ids, offset_map, example_idx = d + fwd_args = [input_ids, segment_ids] + fwd_kwargs = {} + + SEP_idx = input_ids.numpy()[0].tolist().index(tokenizer.sep_token_id) + context_ids = input_ids[0, SEP_idx + add_idx : -1] + offset = offset_map[0, SEP_idx + add_idx : -1] + context_tokens = tokenizer.convert_ids_to_tokens(context_ids.numpy().tolist()) + + if args.language == "en": + example = dev_ds.data[step] + result["id"] = example["id"] + result["question"] = example["question"] + result["title"] = example["title"] + result["context"] = example["context"] + example["title"] + result["sent_token"] = example["sent_token"] + + if args.inter_mode == "attention": + prev_example_idx, prev_context_tokens, scores_in_one_example, prev_offset = extract_attention_scores( + args, + model, + result, + fwd_args, + fwd_kwargs, + prev_example_idx, + example_idx, + prev_context_tokens, + scores_in_one_example, + dev_ds, + ans_dic, + ans_idx_dic, + context_tokens, + offset, + prev_offset, + out_handle, + ) + + elif args.inter_mode == "integrated_gradient": + ( + prev_example_idx, + prev_context_tokens, + scores_in_one_example, + prev_offset, + ) = extract_integrated_gradient_scores( + args, + dev_ds, + model, + result, + fwd_args, + fwd_kwargs, + SEP_idx, + add_idx, + prev_example_idx, + example_idx, + scores_in_one_example, + prev_context_tokens, + ans_dic, + ans_idx_dic, + context_tokens, + offset, + prev_offset, + out_handle, + ) + else: + raise KeyError(f"Unkonwn interpretable mode: {args.inter_mode}") + + # Deal with last example + if args.language == "ch": + + feature = dev_ds.new_data[-1] + input_ids = feature["input_ids"] + SEP_idx = input_ids.index(tokenizer.sep_token_id) + context_ids = input_ids[SEP_idx + 1 : -1] + offset = feature["offset_mapping"][SEP_idx + 1 : -1] + context_tokens = tokenizer.convert_ids_to_tokens(context_ids) + + ch_per_example( + args, scores_in_one_example, context_tokens, dev_ds, -1, ans_dic, ans_idx_dic, offset, out_handle + ) diff --git a/examples/model_interpretation/task/mrc/saliency_map/rc_prediction.py b/examples/model_interpretation/task/mrc/saliency_map/rc_prediction.py new file mode 100644 index 0000000000000000000000000000000000000000..c1557de19d109b47884993a40c26d569670559a2 --- /dev/null +++ b/examples/model_interpretation/task/mrc/saliency_map/rc_prediction.py @@ -0,0 +1,195 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import logging +import os +import sys +import time +from functools import partial +from pathlib import Path + +import paddle +from roberta.modeling import RobertaForQuestionAnswering +from squad import RCInterpret, compute_prediction + +from paddlenlp.data import Dict, Pad +from paddlenlp.transformers.roberta.tokenizer import ( + RobertaBPETokenizer, + RobertaTokenizer, +) + +sys.path.append("../../..") +from model_interpretation.utils import ( # noqa: E402 + convert_tokenizer_res_to_old_version, +) + +sys.path.remove("../../..") + +log = logging.getLogger(__name__) +log.setLevel(logging.DEBUG) +logging.getLogger().setLevel(logging.DEBUG) + + +def get_args(): + parser = argparse.ArgumentParser("mrc task with roberta") + parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large"]) + parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") + parser.add_argument( + "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" + ) + parser.add_argument("--batch_size", type=int, default=32, help="batchsize") + parser.add_argument("--epoch", type=int, default=3, help="epoch") + parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") + parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") + parser.add_argument( + "--use_amp", + action="store_true", + help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", + ) + parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method") + parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") + parser.add_argument( + "--doc_stride", + type=int, + default=128, + help="When splitting up a long document into chunks, how much stride to take between chunks.", + ) + parser.add_argument("--language", type=str, required=True, help="language that the model based on") + args = parser.parse_args() + return args + + +def map_fn_DuCheckList(examples, args, tokenizer): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + if args.language == "en": + contexts = [ + examples[i]["context"].encode("ascii", errors="replace").decode("UTF-8") for i in range(len(examples)) + ] + questions = [ + examples[i]["question"].encode("ascii", errors="replace").decode("UTF-8") for i in range(len(examples)) + ] + else: + contexts = [examples[i]["context"] for i in range(len(examples))] + questions = [examples[i]["question"] for i in range(len(examples))] + + tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_len) + tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) + + # For validation, there is no need to compute start and end positions + for i, tokenized_example in enumerate(tokenized_examples): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_example["token_type_ids"] + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = tokenized_example["overflow_to_sample"] + tokenized_examples[i]["example_id"] = examples[sample_index]["id"] + + # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token + # position is part of the context or not. + if args.language == "ch": + tokenized_examples[i]["offset_mapping"] = [ + (o if sequence_ids[k] == 1 else None) for k, o in enumerate(tokenized_example["offset_mapping"]) + ] + else: + n = tokenized_example["offset_mapping"].index((0, 0), 1) + 2 # context start position + m = len(tokenized_example["offset_mapping"]) - 1 # context end position + 1 + tokenized_examples[i]["offset_mapping"] = [ + (o if n <= k <= m else None) for k, o in enumerate(tokenized_example["offset_mapping"]) + ] + return tokenized_examples + + +def init_roberta_var(args): + if args.language == "ch": + tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) + else: + tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) + + model = RobertaForQuestionAnswering.from_pretrained(args.from_pretrained) + map_fn = partial(map_fn_DuCheckList, args=args, tokenizer=tokenizer) + dev_ds = RCInterpret().read(args.data_dir) + dev_ds.map(map_fn, batched=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + } + ): fn(samples) + + dev_dataloader = paddle.io.DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + return model, tokenizer, dev_dataloader, dev_ds + + +@paddle.no_grad() +def evaluate(model, data_loader, args): + model.eval() + + all_start_logits = [] + all_end_logits = [] + tic_eval = time.time() + + for batch in data_loader: + input_ids, token_type_ids = batch + loss, start_logits_tensor, end_logits_tensor, cls_logits = model(input_ids, token_type_ids) + for idx in range(start_logits_tensor.shape[0]): + if len(all_start_logits) % 1000 == 0 and len(all_start_logits): + log.debug("Processing example: %d" % len(all_start_logits)) + log.debug("time per 1000:%.1f" % (time.time() - tic_eval)) + tic_eval = time.time() + + all_start_logits.append(start_logits_tensor.numpy()[idx]) + all_end_logits.append(end_logits_tensor.numpy()[idx]) + + all_predictions, all_nbest_json, scores_diff_json, all_feature_index = compute_prediction( + data_loader.dataset.data, + data_loader.dataset.new_data, + (all_start_logits, all_end_logits), + True, + 20, + args.max_seq_len, + 0.0, + ) + + # Can also write all_nbest_json and scores_diff_json files if needed + with open(os.path.join(args.output_dir, "predict_ans"), "w") as f_ans_pred: + f_ans_pred.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n") + with open(os.path.join(args.output_dir, "predict_feature_index"), "w") as f_feature_index: + f_feature_index.write(json.dumps(all_feature_index, ensure_ascii=False, indent=4) + "\n") + + # squad_evaluate(examples=data_loader.dataset.data, preds=all_predictions, na_probs=scores_diff_json) + # model.train() + + +if __name__ == "__main__": + args = get_args() + if args.base_model.startswith("roberta"): + model, tokenizer, dataloader, dev_ds = init_roberta_var(args) + else: + raise ValueError("unsupported base model name.") + + with paddle.amp.auto_cast(enable=args.use_amp): + sd = paddle.load(args.init_checkpoint) + model.set_dict(sd) + log.debug("load model from %s" % args.init_checkpoint) + evaluate(model, dataloader, args) diff --git a/examples/model_interpretation/task/mrc/saliency_map/squad.py b/examples/model_interpretation/task/mrc/saliency_map/squad.py new file mode 100644 index 0000000000000000000000000000000000000000..3ae811de5e5bddc666c36274dbb25c7a5224623a --- /dev/null +++ b/examples/model_interpretation/task/mrc/saliency_map/squad.py @@ -0,0 +1,476 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# !/usr/bin/env python3 +import collections +import json + +import numpy as np + +from paddlenlp.datasets import DatasetBuilder + + +class Similarity(DatasetBuilder): + # similarity test 21.10.3 + def _read(self, filename): + with open(filename, "r", encoding="utf8") as f: + for line in f.readlines(): + line_split = line.strip().split("\t") + assert len(line_split) == 3 + yield {"text_a": line_split[0], "text_b": line_split[1], "label": line_split[2]} + + +class RCInterpret(DatasetBuilder): + # interpret 21.9.24 + def _read(self, filename): + with open(filename, "r", encoding="utf8") as f: + for line in f.readlines(): + example_dic = json.loads(line) + id = example_dic["id"] + title = example_dic["title"] + context = example_dic["context"] + question = example_dic["question"] + if "sent_token" in example_dic: + sent_token = example_dic["sent_token"] + yield { + "id": id, + "title": title, + "context": context, + "question": question, + "sent_token": sent_token, + } + else: + yield {"id": id, "title": title, "context": context, "question": question} + + +class DuReaderChecklist(DatasetBuilder): + def _read(self, filename): + with open(filename, "r", encoding="utf8") as f: + input_data = json.load(f)["data"] + + for entry in input_data: + # title = entry.get("title", "").strip() + for paragraph in entry["paragraphs"]: + context = paragraph["context"].strip() + title = paragraph.get("title", "").strip() + for qa in paragraph["qas"]: + qas_id = qa["id"] + question = qa["question"].strip() + answer_starts = [] + answers = [] + is_impossible = False + + if "is_impossible" in qa.keys(): + is_impossible = qa["is_impossible"] + + answer_starts = [answer["answer_start"] for answer in qa.get("answers", [])] + answers = [answer["text"].strip() for answer in qa.get("answers", [])] + + yield { + "id": qas_id, + "title": title, + "context": context, + "question": question, + "answers": answers, + "answer_starts": answer_starts, + "is_impossible": is_impossible, + } + + +def compute_prediction_checklist( + examples, + features, + predictions, + version_2_with_negative: bool = False, + n_best_size: int = 20, + max_answer_length: int = 30, + cls_threshold: float = 0.5, +): + """ + Post-processes the predictions of a question-answering model to convert them to answers that are substrings of the + original contexts. This is the base postprocessing functions for models that only return start and end logits. + + Args: + examples: The non-preprocessed dataset (see the main script for more information). + features: The processed dataset (see the main script for more information). + predictions (:obj:`Tuple[np.ndarray, np.ndarray]`): + The predictions of the model: two arrays containing the start logits and the end logits respectively. Its + first dimension must match the number of elements of :obj:`features`. + version_2_with_negative (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether or not the underlying dataset contains examples with no answers. + n_best_size (:obj:`int`, `optional`, defaults to 20): + The total number of n-best predictions to generate when looking for an answer. + max_answer_length (:obj:`int`, `optional`, defaults to 30): + The maximum length of an answer that can be generated. This is needed because the start and end predictions + are not conditioned on one another. + null_score_diff_threshold (:obj:`float`, `optional`, defaults to 0): + The threshold used to select the null answer: if the best answer has a score that is less than the score of + the null answer minus this threshold, the null answer is selected for this example (note that the score of + the null answer for an example giving several features is the minimum of the scores for the null answer on + each feature: all features must be aligned on the fact they `want` to predict a null answer). + + Only useful when :obj:`version_2_with_negative` is :obj:`True`. + """ + + assert ( + len(predictions) == 3 + ), "`predictions` should be a tuple with two elements (start_logits, end_logits, cls_logits)." + all_start_logits, all_end_logits, all_cls_logits = predictions + + assert len(predictions[0]) == len(features), "Number of predictions should be equal to number of features." # 样本数 + + # Build a map example to its corresponding features. + features_per_example = collections.defaultdict(list) + for i, feature in enumerate(features): + features_per_example[feature["example_id"]].append( + i + ) # feature: dict(keys: 'input_ids', 'token_type_ids', 'offset_mapping', 'overflow_to_sample', 'example_id') + + # The dictionaries we have to fill. + all_predictions = collections.OrderedDict() + all_feature_index = collections.OrderedDict() + all_nbest_json = collections.OrderedDict() + all_cls_predictions = [] + + # Let's loop over all the examples! + for example_index, example in enumerate(examples): + # Those are the indices of the features associated to the current example. + feature_indices = features_per_example[example["id"]] + + # if len(feature_indices) > 1: + # print('example_index: %s' % example_index) + + min_null_prediction = None + prelim_predictions = [] + score_answerable = -1 + # Looping through all the features associated to the current example. + for feature_index in feature_indices: + # We grab the predictions of the model for this feature. + start_logits = all_start_logits[feature_index] + end_logits = all_end_logits[feature_index] + cls_logits = all_cls_logits[feature_index] + # This is what will allow us to map some the positions in our logits to span of texts in the original context. + offset_mapping = features[feature_index][ + "offset_mapping" + ] # list[tuple(2)], list长度与input_ids, start_logits, end_logits相同 + + # if len(feature_indices) > 1: + # print('offset_mapping: %s' % offset_mapping) + + # Optional `token_is_max_context`, if provided we will remove answers that do not have the maximum context + # available in the current feature. + token_is_max_context = features[feature_index].get("token_is_max_context", None) + + exp_answerable_scores = np.exp(cls_logits - np.max(cls_logits)) + feature_answerable_score = exp_answerable_scores / exp_answerable_scores.sum() + if feature_answerable_score[-1] > score_answerable: + score_answerable = feature_answerable_score[-1] + answerable_probs = feature_answerable_score + + # Update minimum null prediction. + feature_null_score = start_logits[0] + end_logits[0] + if min_null_prediction is None or min_null_prediction["score"] > feature_null_score: + min_null_prediction = { + "feature_index": (0, 0), + "offsets": (0, 0), + "score": feature_null_score, + "start_logit": start_logits[0], + "end_logit": end_logits[0], + } + + # Go through all possibilities for the `n_best_size` greater start and end logits. + start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist() # list(n_best_size) 从大到小 + end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist() # list(n_best_size) 从大到小 + for start_index in start_indexes: + for end_index in end_indexes: + # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond + # to part of the input_ids that are not in the context. + if ( + start_index >= len(offset_mapping) + or end_index >= len(offset_mapping) + or offset_mapping[start_index] is None + or offset_mapping[end_index] is None # CLS、Question和第一个SEP的位置 + or offset_mapping[start_index] == (0, 0) + or offset_mapping[end_index] == (0, 0) # 第二个SEP的位置 + ): + continue + # Don't consider answers with a length that is either < 0 or > max_answer_length. + if end_index < start_index or end_index - start_index + 1 > max_answer_length: + continue + # Don't consider answer that don't have the maximum context available (if such information is + # provided). + if token_is_max_context is not None and not token_is_max_context.get(str(start_index), False): + continue + prelim_predictions.append( + { + "feature_index": (start_index, end_index), + "offsets": (offset_mapping[start_index][0], offset_mapping[end_index][1]), + "score": start_logits[start_index] + end_logits[end_index], + "start_logit": start_logits[start_index], + "end_logit": end_logits[end_index], + } + ) + if version_2_with_negative: + # Add the minimum null prediction + prelim_predictions.append(min_null_prediction) + pred_cls_label = np.argmax(np.array(answerable_probs)) + all_cls_predictions.append([example["id"], pred_cls_label, answerable_probs[0], answerable_probs[1]]) + + # Only keep the best `n_best_size` predictions. + predictions = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)[:n_best_size] + + # Add back the minimum null prediction if it was removed because of its low score. + if version_2_with_negative and not any(p["offsets"] == (0, 0) for p in predictions): + predictions.append(min_null_prediction) + + # Use the offsets to gather the answer text in the original context. + context = example["context"] + for pred in predictions: + # offsets = pred.pop("offsets") + offsets = pred["offsets"] + pred["text"] = context[offsets[0] : offsets[1]] if context[offsets[0] : offsets[1]] != "" else "no answer" + + # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid + # failure. + if len(predictions) == 0 or (len(predictions) == 1 and predictions[0]["text"] == "no answer"): + predictions.insert( + 0, + { + "feature_index": (0, 0), + "offsets": (0, 0), + "text": "no answer", + "start_logit": 0.0, + "end_logit": 0.0, + "score": 0.0, + }, + ) + + # Compute the softmax of all scores (we do it with numpy to stay independent from torch/tf in this file, using + # the LogSumExp trick). + scores = np.array([pred.pop("score") for pred in predictions]) + exp_scores = np.exp(scores - np.max(scores)) + probs = exp_scores / exp_scores.sum() + + # Include the probabilities in our predictions. + for prob, pred in zip(probs, predictions): + pred["probability"] = prob + + # Pick the best prediction. If the null answer is not possible, this is easy. + if not version_2_with_negative: + all_predictions[example["id"]] = predictions[0]["text"] + all_feature_index[example["id"]] = predictions[0]["feature_index"] + else: + # Otherwise we first need to find the best non-empty prediction. + i = 0 + while predictions[i]["text"] == "no answer": + i += 1 + best_non_null_pred = predictions[i] + + if answerable_probs[1] < cls_threshold: + all_predictions[example["id"]] = "no answer" + else: + all_predictions[example["id"]] = best_non_null_pred["text"] + all_feature_index[example["id"]] = predictions[i]["feature_index"] + + # Make `predictions` JSON-serializable by casting np.float back to float. + all_nbest_json[example["id"]] = [ + {k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()} + for pred in predictions + ] + + return all_predictions, all_nbest_json, all_cls_predictions, all_feature_index + + +def compute_prediction( + examples, + features, + predictions, + version_2_with_negative=False, + n_best_size=20, + max_answer_length=30, + null_score_diff_threshold=0.0, +): + """ + Post-processes the predictions of a question-answering model to convert + them to answers that are substrings of the original contexts. This is + the base postprocessing functions for models that only return start and + end logits. + + Args: + examples (list): List of raw squad-style data (see `run_squad.py + <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/ + machine_reading_comprehension/SQuAD/run_squad.py>`__ for more + information). + features (list): List of processed squad-style features (see + `run_squad.py <https://github.com/PaddlePaddle/PaddleNLP/blob/ + develop/examples/machine_reading_comprehension/SQuAD/run_squad.py>`__ + for more information). + predictions (tuple): The predictions of the model. Should be a tuple + of two list containing the start logits and the end logits. + version_2_with_negative (bool, optional): Whether the dataset contains + examples with no answers. Defaults to False. + n_best_size (int, optional): The total number of candidate predictions + to generate. Defaults to 20. + max_answer_length (int, optional): The maximum length of predicted answer. + Defaults to 20. + null_score_diff_threshold (float, optional): The threshold used to select + the null answer. Only useful when `version_2_with_negative` is True. + Defaults to 0.0. + + Returns: + A tuple of three dictionaries containing final selected answer, all n_best + answers along with their probability and scores, and the score_diff of each + example. + """ + assert len(predictions) == 2, "`predictions` should be a tuple with two elements (start_logits, end_logits)." + all_start_logits, all_end_logits = predictions + + assert len(predictions[0]) == len(features), "Number of predictions should be equal to number of features." + + # Build a map example to its corresponding features. + features_per_example = collections.defaultdict(list) + for i, feature in enumerate(features): + features_per_example[feature["example_id"]].append(i) + + # The dictionaries we have to fill. + all_predictions = collections.OrderedDict() + all_nbest_json = collections.OrderedDict() + all_feature_index = collections.OrderedDict() + scores_diff_json = collections.OrderedDict() + + # Let's loop over all the examples! + for example_index, example in enumerate(examples): + # Those are the indices of the features associated to the current example. + feature_indices = features_per_example[example["id"]] + + min_null_prediction = None + prelim_predictions = [] + + # Looping through all the features associated to the current example. + for feature_index in feature_indices: + # We grab the predictions of the model for this feature. + start_logits = all_start_logits[feature_index] + end_logits = all_end_logits[feature_index] + # This is what will allow us to map some the positions in our logits to span of texts in the original + # context. + offset_mapping = features[feature_index]["offset_mapping"] + # Optional `token_is_max_context`, if provided we will remove answers that do not have the maximum context + # available in the current feature. + token_is_max_context = features[feature_index].get("token_is_max_context", None) + + # Update minimum null prediction. + feature_null_score = start_logits[0] + end_logits[0] + if min_null_prediction is None or min_null_prediction["score"] > feature_null_score: + min_null_prediction = { + "feature_index": (0, 0), + "offsets": (0, 0), + "score": feature_null_score, + "start_logit": start_logits[0], + "end_logit": end_logits[0], + } + + # Go through all possibilities for the `n_best_size` greater start and end logits. + start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist() + end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist() + for start_index in start_indexes: + for end_index in end_indexes: + # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond + # to part of the input_ids that are not in the context. + if ( + start_index >= len(offset_mapping) + or end_index >= len(offset_mapping) + or offset_mapping[start_index] is None + or offset_mapping[end_index] is None + or offset_mapping[start_index] == (0, 0) + or offset_mapping[end_index] == (0, 0) + ): + continue + # Don't consider answers with a length that is either < 0 or > max_answer_length. + if end_index < start_index or end_index - start_index + 1 > max_answer_length: + continue + # Don't consider answer that don't have the maximum context available (if such information is + # provided). + if token_is_max_context is not None and not token_is_max_context.get(str(start_index), False): + continue + prelim_predictions.append( + { + "feature_index": (start_index, end_index), + "offsets": (offset_mapping[start_index][0], offset_mapping[end_index][1]), + "score": start_logits[start_index] + end_logits[end_index], + "start_logit": start_logits[start_index], + "end_logit": end_logits[end_index], + } + ) + if version_2_with_negative: + # Add the minimum null prediction + prelim_predictions.append(min_null_prediction) + null_score = min_null_prediction["score"] + + # Only keep the best `n_best_size` predictions. + predictions = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)[:n_best_size] + + # Add back the minimum null prediction if it was removed because of its low score. + if version_2_with_negative and not any(p["offsets"] == (0, 0) for p in predictions): + predictions.append(min_null_prediction) + + # Use the offsets to gather the answer text in the original context. + context = example["context"] + for pred in predictions: + offsets = pred.pop("offsets") + pred["text"] = context[offsets[0] : offsets[1]] + + # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid + # failure. + if len(predictions) == 0 or (len(predictions) == 1 and predictions[0]["text"] == ""): + predictions.insert( + 0, {"feature_index": (0, 0), "text": "empty", "start_logit": 0.0, "end_logit": 0.0, "score": 0.0} + ) + + # Compute the softmax of all scores (we do it with numpy to stay independent from torch/tf in this file, using + # the LogSumExp trick). + scores = np.array([pred.pop("score") for pred in predictions]) + exp_scores = np.exp(scores - np.max(scores)) + probs = exp_scores / exp_scores.sum() + + # Include the probabilities in our predictions. + for prob, pred in zip(probs, predictions): + pred["probability"] = prob + + # Pick the best prediction. If the null answer is not possible, this is easy. + if not version_2_with_negative: + all_predictions[example["id"]] = predictions[0]["text"] + all_feature_index[example["id"]] = predictions[0]["feature_index"] + else: + # Otherwise we first need to find the best non-empty prediction. + i = 0 + while predictions[i]["text"] == "": + i += 1 + best_non_null_pred = predictions[i] + + # Then we compare to the null prediction using the threshold. + score_diff = null_score - best_non_null_pred["start_logit"] - best_non_null_pred["end_logit"] + scores_diff_json[example["id"]] = float(score_diff) # To be JSON-serializable. + if score_diff > null_score_diff_threshold: + all_predictions[example["id"]] = "" + else: + all_predictions[example["id"]] = best_non_null_pred["text"] + all_feature_index[example["id"]] = predictions[i]["feature_index"] + + # Make `predictions` JSON-serializable by casting np.float back to float. + all_nbest_json[example["id"]] = [ + {k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()} + for pred in predictions + ] + + return all_predictions, all_nbest_json, scores_diff_json, all_feature_index diff --git a/examples/model_interpretation/task/mrc/saliency_map/utils.py b/examples/model_interpretation/task/mrc/saliency_map/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..88c1619769ee079c715bf84044ca22e2aec95e0d --- /dev/null +++ b/examples/model_interpretation/task/mrc/saliency_map/utils.py @@ -0,0 +1,37 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import absolute_import, division, print_function, unicode_literals + +import paddle + + +class UnpackDataLoader(paddle.io.DataLoader): + def __init__(self, *args, **kwargs): + super(UnpackDataLoader, self).__init__(*args, batch_size=1, **kwargs) + + def __iter__(self): + return ([yy[0] for yy in y] for y in super(UnpackDataLoader, self).__iter__()) + + +def create_if_not_exists(dir): + try: + dir.mkdir(parents=True) + except: + pass + return dir + + +def get_warmup_and_linear_decay(max_steps, warmup_steps): + return lambda step: min(step / warmup_steps, 1.0 - (step - warmup_steps) / (max_steps - warmup_steps)) diff --git a/examples/model_interpretation/task/senti/LIME/exceptions.py b/examples/model_interpretation/task/senti/LIME/exceptions.py new file mode 100644 index 0000000000000000000000000000000000000000..c5fa1a29924ad795104c6ce7c124a58d1fa06dfe --- /dev/null +++ b/examples/model_interpretation/task/senti/LIME/exceptions.py @@ -0,0 +1,2 @@ +class LimeError(Exception): + """Raise for errors""" diff --git a/examples/model_interpretation/task/senti/LIME/explanation.py b/examples/model_interpretation/task/senti/LIME/explanation.py new file mode 100644 index 0000000000000000000000000000000000000000..6e212b1613ca84438ad37222b5ea09dd234d6a7b --- /dev/null +++ b/examples/model_interpretation/task/senti/LIME/explanation.py @@ -0,0 +1,344 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Explanation class, with visualization functions. +""" +from io import open +import os +import os.path +import json +import string +import numpy as np + +# from .exceptions import LimeError +from LIME.exceptions import LimeError + +from sklearn.utils import check_random_state + + +def id_generator(size=15, random_state=None): + """Helper function to generate random div ids. This is useful for embedding + HTML into ipython notebooks.""" + chars = list(string.ascii_uppercase + string.digits) + return "".join(random_state.choice(chars, size, replace=True)) + + +class DomainMapper(object): + """Class for mapping features to the specific domain. + + The idea is that there would be a subclass for each domain (text, tables, + images, etc), so that we can have a general Explanation class, and separate + out the specifics of visualizing features in here. + """ + + def __init__(self): + pass + + def map_exp_ids(self, exp, **kwargs): + """Maps the feature ids to concrete names. + + Default behaviour is the identity function. Subclasses can implement + this as they see fit. + + Args: + exp: list of tuples [(id, weight), (id,weight)] + kwargs: optional keyword arguments + + Returns: + exp: list of tuples [(name, weight), (name, weight)...] + """ + return exp + + def visualize_instance_html(self, exp, label, div_name, exp_object_name, **kwargs): + """Produces html for visualizing the instance. + + Default behaviour does nothing. Subclasses can implement this as they + see fit. + + Args: + exp: list of tuples [(id, weight), (id,weight)] + label: label id (integer) + div_name: name of div object to be used for rendering(in js) + exp_object_name: name of js explanation object + kwargs: optional keyword arguments + + Returns: + js code for visualizing the instance + """ + return "" + + +class Explanation(object): + """Object returned by explainers.""" + + def __init__(self, domain_mapper, mode="classification", class_names=None, random_state=None): + """ + + Initializer. + + Args: + domain_mapper: must inherit from DomainMapper class + type: "classification" or "regression" + class_names: list of class names (only used for classification) + random_state: an integer or numpy.RandomState that will be used to + generate random numbers. If None, the random state will be + initialized using the internal numpy seed. + """ + self.random_state = random_state + self.mode = mode + self.domain_mapper = domain_mapper + self.local_exp = {} + self.intercept = {} + self.score = {} + self.local_pred = {} + if mode == "classification": + self.class_names = class_names + self.top_labels = None + self.predict_proba = None + elif mode == "regression": + self.class_names = ["negative", "positive"] + self.predicted_value = None + self.min_value = 0.0 + self.max_value = 1.0 + self.dummy_label = 1 + else: + raise LimeError( + 'Invalid explanation mode "{}". ' 'Should be either "classification" ' 'or "regression".'.format(mode) + ) + + def available_labels(self): + """ + Returns the list of classification labels for which we have any explanations. + """ + try: + assert self.mode == "classification" + except AssertionError: + raise NotImplementedError("Not supported for regression explanations.") + else: + ans = self.top_labels if self.top_labels else self.local_exp.keys() + return list(ans) + + def as_list(self, label=1, **kwargs): + """Returns the explanation as a list. + + Args: + label: desired label. If you ask for a label for which an + explanation wasn't computed, will throw an exception. + Will be ignored for regression explanations. + kwargs: keyword arguments, passed to domain_mapper + + Returns: + list of tuples (representation, weight), where representation is + given by domain_mapper. Weight is a float. + """ + label_to_use = label if self.mode == "classification" else self.dummy_label + ans = self.domain_mapper.map_exp_ids(self.local_exp[label_to_use], **kwargs) + ans = [(x[0], float(x[1])) for x in ans] + return ans + + def as_map(self): + """Returns the map of explanations. + + Returns: + Map from label to list of tuples (feature_id, weight). + """ + return self.local_exp + + def as_pyplot_figure(self, label=1, **kwargs): + """Returns the explanation as a pyplot figure. + + Will throw an error if you don't have matplotlib installed + Args: + label: desired label. If you ask for a label for which an + explanation wasn't computed, will throw an exception. + Will be ignored for regression explanations. + kwargs: keyword arguments, passed to domain_mapper + + Returns: + pyplot figure (barchart). + """ + import matplotlib.pyplot as plt + + exp = self.as_list(label=label, **kwargs) + fig = plt.figure() + vals = [x[1] for x in exp] + names = [x[0] for x in exp] + vals.reverse() + names.reverse() + colors = ["green" if x > 0 else "red" for x in vals] + pos = np.arange(len(exp)) + 0.5 + plt.barh(pos, vals, align="center", color=colors) + plt.yticks(pos, names) + if self.mode == "classification": + title = "Local explanation for class %s" % self.class_names[label] + else: + title = "Local explanation" + plt.title(title) + return fig + + def show_in_notebook(self, labels=None, predict_proba=True, show_predicted_value=True, **kwargs): + """Shows html explanation in ipython notebook. + + See as_html() for parameters. + This will throw an error if you don't have IPython installed""" + + from IPython.core.display import display, HTML + + display( + HTML( + self.as_html( + labels=labels, predict_proba=predict_proba, show_predicted_value=show_predicted_value, **kwargs + ) + ) + ) + + def save_to_file(self, file_path, labels=None, predict_proba=True, show_predicted_value=True, **kwargs): + """Saves html explanation to file. . + + Params: + file_path: file to save explanations to + + See as_html() for additional parameters. + + """ + file_ = open(file_path, "w", encoding="utf8") + file_.write( + self.as_html( + labels=labels, predict_proba=predict_proba, show_predicted_value=show_predicted_value, **kwargs + ) + ) + file_.close() + + def as_html(self, labels=None, predict_proba=True, show_predicted_value=True, **kwargs): + """Returns the explanation as an html page. + + Args: + labels: desired labels to show explanations for (as barcharts). + If you ask for a label for which an explanation wasn't + computed, will throw an exception. If None, will show + explanations for all available labels. (only used for classification) + predict_proba: if true, add barchart with prediction probabilities + for the top classes. (only used for classification) + show_predicted_value: if true, add barchart with expected value + (only used for regression) + kwargs: keyword arguments, passed to domain_mapper + + Returns: + code for an html page, including javascript includes. + """ + + def jsonize(x): + return json.dumps(x, ensure_ascii=False) + + if labels is None and self.mode == "classification": + labels = self.available_labels() + + this_dir, _ = os.path.split(__file__) + bundle = open(os.path.join(this_dir, "bundle.js"), encoding="utf8").read() + + out = ( + """<html> + <meta http-equiv="content-type" content="text/html; charset=UTF8"> + <head><script>%s </script></head><body>""" + % bundle + ) + random_id = id_generator(size=15, random_state=check_random_state(self.random_state)) + out += ( + """ + <div class="lime top_div" id="top_div%s"></div> + """ + % random_id + ) + + predict_proba_js = "" + if self.mode == "classification" and predict_proba: + predict_proba_js = """ + var pp_div = top_div.append('div') + .classed('lime predict_proba', true); + var pp_svg = pp_div.append('svg').style('width', '100%%'); + var pp = new lime.PredictProba(pp_svg, %s, %s); + """ % ( + jsonize([str(x) for x in self.class_names]), + jsonize(list(self.predict_proba.astype(float))), + ) + + predict_value_js = "" + if self.mode == "regression" and show_predicted_value: + # reference self.predicted_value + # (svg, predicted_value, min_value, max_value) + predict_value_js = """ + var pp_div = top_div.append('div') + .classed('lime predicted_value', true); + var pp_svg = pp_div.append('svg').style('width', '100%%'); + var pp = new lime.PredictedValue(pp_svg, %s, %s, %s); + """ % ( + jsonize(float(self.predicted_value)), + jsonize(float(self.min_value)), + jsonize(float(self.max_value)), + ) + + exp_js = """var exp_div; + var exp = new lime.Explanation(%s); + """ % ( + jsonize([str(x) for x in self.class_names]) + ) + + if self.mode == "classification": + for label in labels: + exp = jsonize(self.as_list(label)) + exp_js += """ + exp_div = top_div.append('div').classed('lime explanation', true); + exp.show(%s, %d, exp_div); + """ % ( + exp, + label, + ) + else: + exp = jsonize(self.as_list()) + exp_js += """ + exp_div = top_div.append('div').classed('lime explanation', true); + exp.show(%s, %s, exp_div); + """ % ( + exp, + self.dummy_label, + ) + + raw_js = """var raw_div = top_div.append('div');""" + + if self.mode == "classification": + html_data = self.local_exp[labels[0]] + else: + html_data = self.local_exp[self.dummy_label] + + raw_js += self.domain_mapper.visualize_instance_html( + html_data, labels[0] if self.mode == "classification" else self.dummy_label, "raw_div", "exp", **kwargs + ) + out += """ + <script> + var top_div = d3.select('#top_div%s').classed('lime top_div', true); + %s + %s + %s + %s + </script> + """ % ( + random_id, + predict_proba_js, + predict_value_js, + exp_js, + raw_js, + ) + out += "</body></html>" + + return out diff --git a/examples/model_interpretation/task/senti/LIME/lime_base.py b/examples/model_interpretation/task/senti/LIME/lime_base.py new file mode 100644 index 0000000000000000000000000000000000000000..2c9104f69b54343b7c71db6defc59650c749fc9a --- /dev/null +++ b/examples/model_interpretation/task/senti/LIME/lime_base.py @@ -0,0 +1,226 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Contains abstract functionality for learning locally linear sparse model. +""" +import numpy as np +import scipy as sp +from sklearn.linear_model import Ridge, lars_path +from sklearn.utils import check_random_state + + +class LimeBase(object): + """Class for learning a locally linear sparse model from perturbed data""" + + def __init__(self, kernel_fn, verbose=False, random_state=None): + """Init function + + Args: + kernel_fn: function that transforms an array of distances into an + array of proximity values (floats). + verbose: if true, print local prediction values from linear model. + random_state: an integer or numpy.RandomState that will be used to + generate random numbers. If None, the random state will be + initialized using the internal numpy seed. + """ + self.kernel_fn = kernel_fn + self.verbose = verbose + self.random_state = check_random_state(random_state) + + @staticmethod + def generate_lars_path(weighted_data, weighted_labels): + """Generates the lars path for weighted data. + + Args: + weighted_data: data that has been weighted by kernel + weighted_label: labels, weighted by kernel + + Returns: + (alphas, coefs), both are arrays corresponding to the + regularization parameter and coefficients, respectively + """ + x_vector = weighted_data + alphas, _, coefs = lars_path(x_vector, weighted_labels, method="lasso", verbose=False) + return alphas, coefs + + def forward_selection(self, data, labels, weights, num_features): + """Iteratively adds features to the model""" + clf = Ridge(alpha=0, fit_intercept=True, random_state=self.random_state) + used_features = [] + for _ in range(min(num_features, data.shape[1])): + max_ = -100000000 + best = 0 + for feature in range(data.shape[1]): + if feature in used_features: + continue + clf.fit(data[:, used_features + [feature]], labels, sample_weight=weights) + score = clf.score(data[:, used_features + [feature]], labels, sample_weight=weights) + if score > max_: + best = feature + max_ = score + used_features.append(best) + return np.array(used_features) + + def feature_selection(self, data, labels, weights, num_features, method): + """Selects features for the model. see explain_instance_with_data to + understand the parameters.""" + if method == "none": + return np.array(range(data.shape[1])) + + elif method == "forward_selection": + return self.forward_selection(data, labels, weights, num_features) + + elif method == "highest_weights": + clf = Ridge(alpha=0.01, fit_intercept=True, random_state=self.random_state) + clf.fit(data, labels, sample_weight=weights) + + coef = clf.coef_ + if sp.sparse.issparse(data): + coef = sp.sparse.csr_matrix(clf.coef_) + weighted_data = coef.multiply(data[0]) + # Note: most efficient to slice the data before reversing + sdata = len(weighted_data.data) + argsort_data = np.abs(weighted_data.data).argsort() + # Edge case where data is more sparse than requested number of feature importances + # In that case, we just pad with zero-valued features + if sdata < num_features: + nnz_indexes = argsort_data[::-1] + indices = weighted_data.indices[nnz_indexes] + num_to_pad = num_features - sdata + indices = np.concatenate((indices, np.zeros(num_to_pad, dtype=indices.dtype))) + indices_set = set(indices) + pad_counter = 0 + for i in range(data.shape[1]): + if i not in indices_set: + indices[pad_counter + sdata] = i + pad_counter += 1 + if pad_counter >= num_to_pad: + break + else: + nnz_indexes = argsort_data[sdata - num_features : sdata][::-1] + indices = weighted_data.indices[nnz_indexes] + return indices + else: + weighted_data = coef * data[0] + feature_weights = sorted( + zip(range(data.shape[1]), weighted_data), # zip(特征的编号, Ridge的w值) + key=lambda x: np.abs(x[1]), + reverse=True, + ) + return np.array([x[0] for x in feature_weights[:num_features]]) # 返回Ridge的前num_features大的w的值对应的特征编号 + + elif method == "lasso_path": + weighted_data = (data - np.average(data, axis=0, weights=weights)) * np.sqrt(weights[:, np.newaxis]) + weighted_labels = (labels - np.average(labels, weights=weights)) * np.sqrt(weights) + nonzero = range(weighted_data.shape[1]) + _, coefs = self.generate_lars_path(weighted_data, weighted_labels) + for i in range(len(coefs.T) - 1, 0, -1): + nonzero = coefs.T[i].nonzero()[0] + if len(nonzero) <= num_features: + break + used_features = nonzero + return used_features + + elif method == "auto": + if num_features <= 6: + n_method = "forward_selection" + else: + n_method = "highest_weights" + return self.feature_selection(data, labels, weights, num_features, n_method) + + def explain_instance_with_data( + self, + neighborhood_data, + neighborhood_labels, + distances, + label, + num_features, + feature_selection="auto", + model_regressor=None, + ): + """Takes perturbed data, labels and distances, returns explanation. + + Args: + neighborhood_data: perturbed data, 2d array. first element is + assumed to be the original data point. + neighborhood_labels: corresponding perturbed labels. should have as + many columns as the number of possible labels. + distances: distances to original data point. + label: label for which we want an explanation + num_features: maximum number of features in explanation + feature_selection: how to select num_features. options are: + 'forward_selection': iteratively add features to the model. + This is costly when num_features is high + 'highest_weights': selects the features that have the highest + product of absolute weight * original data point when + learning with all the features + 'lasso_path': chooses features based on the lasso + regularization path + 'none': uses all features, ignores num_features + 'auto': uses forward_selection if num_features <= 6, and + 'highest_weights' otherwise. + model_regressor: sklearn regressor to use in explanation. + Defaults to Ridge regression if None. Must have + model_regressor.coef_ and 'sample_weight' as a parameter + to model_regressor.fit() + + Returns: + (intercept, exp, score, local_pred): + intercept is a float. + exp is a sorted list of tuples, where each tuple (x,y) corresponds to the feature id (x) + and the local weight (y). The list is sorted by decreasing absolute value of y. + score is the R^2 value of the returned explanation + local_pred is the prediction of the explanation model on the original instance + """ + + weights = self.kernel_fn(distances) # 扰动样本权重 + labels_column = neighborhood_labels[:, label] # 类别label的softmax + + used_features = self.feature_selection( + neighborhood_data, labels_column, weights, num_features, feature_selection + ) + if model_regressor is None: + model_regressor = Ridge( + alpha=1, fit_intercept=True, random_state=self.random_state # L2正则化的系数 # 是否需要截距,即b + ) # seg的伪随机种子 + easy_model = model_regressor + easy_model.fit(neighborhood_data[:, used_features], labels_column, sample_weight=weights) + prediction_score = easy_model.score(neighborhood_data[:, used_features], labels_column, sample_weight=weights) + + local_pred = easy_model.predict(neighborhood_data[0, used_features].reshape(1, -1)) + + ridge_pred = easy_model.predict(neighborhood_data[:, used_features]) + err_np = np.abs(labels_column - ridge_pred) + # relative_err_np = err_np / labels_column + relative_err_np = err_np / ridge_pred + err = np.average(err_np, weights=weights) + relative_err = np.average(relative_err_np, weights=weights) + + if self.verbose: + print("Intercept", easy_model.intercept_) + print( + "Prediction_local", + local_pred, + ) + print("Right:", neighborhood_labels[0, label]) + return ( + easy_model.intercept_, # + sorted( + zip(used_features, easy_model.coef_), key=lambda x: np.abs(x[1]), reverse=True + ), # 按权重大小排序的token_id列表 + prediction_score, # 衡量easy_model模型的预测与label的差,越大越好(差越小),最大为1 + local_pred, # easy_model对原始样本的预测概率 + relative_err, + err, + ) diff --git a/examples/model_interpretation/task/senti/LIME/lime_text.py b/examples/model_interpretation/task/senti/LIME/lime_text.py new file mode 100644 index 0000000000000000000000000000000000000000..7ef6d3bc40decf94ee5c30a461583b0119d05122 --- /dev/null +++ b/examples/model_interpretation/task/senti/LIME/lime_text.py @@ -0,0 +1,664 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# !/usr/bin/env python3 +""" +Functions for explaining text classifiers. +""" +import itertools +import json +import re +import time +import math +import paddle +from functools import partial + +import numpy as np +import scipy as sp +import sklearn +from sklearn.utils import check_random_state + +import LIME.explanation as explanation +import LIME.lime_base as lime_base + + +class TextDomainMapper(explanation.DomainMapper): + """Maps feature ids to words or word-positions""" + + def __init__(self, indexed_string, language): + """Initializer. + + Args: + indexed_string: lime_text.IndexedString, original string + """ + self.indexed_string = indexed_string + self.language = language + + def map_exp_ids(self, exp, positions=False): + """Maps ids to words or word-position strings. + + Args: + exp: list of tuples [(id, weight), (id,weight)] + positions: if True, also return word positions + + Returns: + list of tuples (word, weight), or (word_positions, weight) if + examples: ('bad', 1) or ('bad_3-6-12', 1) + """ + if positions: + exp = [ + ( + "%s_%s" + % (self.indexed_string.word(x[0]), "-".join(map(str, self.indexed_string.string_position(x[0])))), + x[1], + ) + for x in exp + ] + else: + exp = [(self.indexed_string.word(x[0]), x[1]) for x in exp] + return exp + + def visualize_instance_html(self, exp, label, div_name, exp_object_name, text=True, opacity=True): + """Adds text with highlighted words to visualization. + + Args: + exp: list of tuples [(id, weight), (id,weight)] + label: label id (integer) + div_name: name of div object to be used for rendering(in js) + exp_object_name: name of js explanation object + text: if False, return empty + opacity: if True, fade colors according to weight + """ + if not text: + return "" + text = self.indexed_string.raw_string().encode("utf-8", "xmlcharrefreplace").decode("utf-8") + text = re.sub(r"[<>&]", "|", text) + exp = [(self.indexed_string.word(x[0]), self.indexed_string.string_position(x[0]), x[1]) for x in exp] + all_occurrences = list(itertools.chain.from_iterable([itertools.product([x[0]], x[1], [x[2]]) for x in exp])) + all_occurrences = [(x[0], int(x[1]), x[2]) for x in all_occurrences] + ret = """ + %s.show_raw_text(%s, %d, %s, %s, %s); + """ % ( + exp_object_name, + json.dumps(all_occurrences), + label, + json.dumps(text), + div_name, + json.dumps(opacity), + ) + return ret + + +class IndexedString(object): + """String with various indexes.""" + + def __init__(self, raw_string, split_expression=r"\W+", bow=True, mask_string=None, language="en"): + """Initializer. + + Args: + raw_string: string with raw text in it + split_expression: Regex string or callable. If regex string, will be used with re.split. + If callable, the function should return a list of tokens. + bow: if True, a word is the same everywhere in the text - i.e. we + will index multiple occurrences of the same word. If False, + order matters, so that the same word will have different ids + according to position. + mask_string: If not None, replace words with this if bow=False + if None, default value is UNKWORDZ + """ + self.raw = raw_string + self.mask_string = "UNKWORDZ" if mask_string is None else mask_string + self.language = language + + if callable(split_expression): + tokens = split_expression(self.raw) + self.as_list = self._segment_with_tokens(self.raw, tokens) + tokens = set(tokens) + + def non_word(string): + return string not in tokens + + else: + # with the split_expression as a non-capturing group (?:), we don't need to filter out + # the separator character from the split results. + # splitter = re.compile(r'(%s)|$' % split_expression) + # self.as_list = [s for s in splitter.split(self.raw) if s] + if self.language == "ch": + splitter = re.compile(r"([\u4e00-\u9fa5])") + self.as_list = [w for w in splitter.split(self.raw) if len(w.strip()) > 0] + else: + splitter = re.compile(split_expression) + self.as_list = [w for w in self.raw.strip().split() if len(w.strip()) > 0] + valid_word = splitter.match + + self.as_np = np.array(self.as_list) + self.string_start = np.hstack(([0], np.cumsum([len(x) for x in self.as_np[:-1]]))) + vocab = {} + self.inverse_vocab = [] + self.positions = [] + self.bow = bow + non_vocab = set() + for i, word in enumerate(self.as_np): + if word in non_vocab: + continue + if (valid_word(word) and self.language == "en") or (not valid_word(word) and self.language == "ch"): + non_vocab.add(word) + continue + if bow: + if word not in vocab: + vocab[word] = len(vocab) + self.inverse_vocab.append(word) + self.positions.append([]) + idx_word = vocab[word] + self.positions[idx_word].append(i) + else: + self.inverse_vocab.append(word) + self.positions.append(i) + if not bow: + self.positions = np.array(self.positions) + + def raw_string(self): + """Returns the original raw string""" + return self.raw + + def num_words(self): + """Returns the number of tokens in the vocabulary for this document.""" + return len(self.inverse_vocab) + + def word(self, id_): + """Returns the word that corresponds to id_ (int)""" + return self.inverse_vocab[id_] + + def string_position(self, id_): + """Returns a np array with indices to id_ (int) occurrences""" + if self.bow: + return self.string_start[self.positions[id_]] + else: + return self.string_start[[self.positions[id_]]] + + def inverse_removing(self, words_to_remove): + """Returns a string after removing the appropriate words. + + If self.bow is false, replaces word with UNKWORDZ instead of removing it. + + Args: + words_to_remove: list of ids (ints) to remove + + Returns: + original raw string with appropriate words removed. + """ + mask = np.ones(self.as_np.shape[0], dtype="bool") + mask[self.__get_idxs(words_to_remove)] = False + if self.language == "ch": + if not self.bow: + return "".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])]) + return "".join([self.as_list[v] for v in mask.nonzero()[0]]) + else: + if not self.bow: + return " ".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])]) + return " ".join([self.as_list[v] for v in mask.nonzero()[0]]) + + @staticmethod + def _segment_with_tokens(text, tokens): + """Segment a string around the tokens created by a passed-in tokenizer""" + list_form = [] + text_ptr = 0 + for token in tokens: + inter_token_string = [] + while not text[text_ptr:].startswith(token): + inter_token_string.append(text[text_ptr]) + text_ptr += 1 + if text_ptr >= len(text): + raise ValueError("Tokenization produced tokens that do not belong in string!") + text_ptr += len(token) + if inter_token_string: + list_form.append("".join(inter_token_string)) + list_form.append(token) + if text_ptr < len(text): + list_form.append(text[text_ptr:]) + return list_form + + def __get_idxs(self, words): + """Returns indexes to appropriate words.""" + if self.bow: + return list(itertools.chain.from_iterable([self.positions[z] for z in words])) + else: + return self.positions[words] + + +class IndexedCharacters(object): + """String with various indexes.""" + + def __init__(self, raw_string, bow=True, mask_string=None): + """Initializer. + + Args: + raw_string: string with raw text in it + bow: if True, a char is the same everywhere in the text - i.e. we + will index multiple occurrences of the same character. If False, + order matters, so that the same word will have different ids + according to position. + mask_string: If not None, replace characters with this if bow=False + if None, default value is chr(0) + """ + self.raw = raw_string + self.as_list = list(self.raw) + self.as_np = np.array(self.as_list) + self.mask_string = chr(0) if mask_string is None else mask_string + self.string_start = np.arange(len(self.raw)) + vocab = {} + self.inverse_vocab = [] + self.positions = [] + self.bow = bow + non_vocab = set() + for i, char in enumerate(self.as_np): + if char in non_vocab: + continue + if bow: + if char not in vocab: + vocab[char] = len(vocab) + self.inverse_vocab.append(char) + self.positions.append([]) + idx_char = vocab[char] + self.positions[idx_char].append(i) + else: + self.inverse_vocab.append(char) + self.positions.append(i) + if not bow: + self.positions = np.array(self.positions) + + def raw_string(self): + """Returns the original raw string""" + return self.raw + + def num_words(self): + """Returns the number of tokens in the vocabulary for this document.""" + return len(self.inverse_vocab) + + def word(self, id_): + """Returns the word that corresponds to id_ (int)""" + return self.inverse_vocab[id_] + + def string_position(self, id_): + """Returns a np array with indices to id_ (int) occurrences""" + if self.bow: + return self.string_start[self.positions[id_]] + else: + return self.string_start[[self.positions[id_]]] + + def inverse_removing(self, words_to_remove): + """Returns a string after removing the appropriate words. + + If self.bow is false, replaces word with UNKWORDZ instead of removing + it. + + Args: + words_to_remove: list of ids (ints) to remove + + Returns: + original raw string with appropriate words removed. + """ + mask = np.ones(self.as_np.shape[0], dtype="bool") + mask[self.__get_idxs(words_to_remove)] = False + if not self.bow: + return "".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])]) + return "".join([self.as_list[v] for v in mask.nonzero()[0]]) + + def __get_idxs(self, words): + """Returns indexes to appropriate words.""" + if self.bow: + return list(itertools.chain.from_iterable([self.positions[z] for z in words])) + else: + return self.positions[words] + + +class LimeTextExplainer(object): + """Explains text classifiers. + Currently, we are using an exponential kernel on cosine distance, and + restricting explanations to words that are present in documents.""" + + def __init__( + self, + kernel_width=25, + kernel=None, + verbose=False, + class_names=None, + feature_selection="auto", + split_expression=r"\W+", + bow=True, + mask_string=None, + random_state=None, + char_level=False, + language="en", + ): + """Init function. + + Args: + kernel_width: kernel width for the exponential kernel. + kernel: similarity kernel that takes euclidean distances and kernel + width as input and outputs weights in (0,1). If None, defaults to + an exponential kernel. + verbose: if true, print local prediction values from linear model + class_names: list of class names, ordered according to whatever the + classifier is using. If not present, class names will be '0', + '1', ... + feature_selection: feature selection method. can be + 'forward_selection', 'lasso_path', 'none' or 'auto'. + See function 'explain_instance_with_data' in lime_base.py for + details on what each of the options does. + split_expression: Regex string or callable. If regex string, will be used with re.split. + If callable, the function should return a list of tokens. + bow: if True (bag of words), will perturb input data by removing + all occurrences of individual words or characters. + Explanations will be in terms of these words. Otherwise, will + explain in terms of word-positions, so that a word may be + important the first time it appears and unimportant the second. + Only set to false if the classifier uses word order in some way + (bigrams, etc), or if you set char_level=True. + mask_string: String used to mask tokens or characters if bow=False + if None, will be 'UNKWORDZ' if char_level=False, chr(0) + otherwise. + random_state: an integer or numpy.RandomState that will be used to + generate random numbers. If None, the random state will be + initialized using the internal numpy seed. + char_level: an boolean identifying that we treat each character + as an independent occurence in the string + """ + + if kernel is None: + + def kernel(d, kernel_width): + return np.sqrt(np.exp(-(d**2) / kernel_width**2)) + + kernel_fn = partial(kernel, kernel_width=kernel_width) + + self.random_state = check_random_state(random_state) + self.base = lime_base.LimeBase(kernel_fn, verbose, random_state=self.random_state) + self.class_names = class_names + self.vocabulary = None + self.feature_selection = feature_selection + self.bow = bow + self.mask_string = mask_string + self.split_expression = split_expression + self.char_level = char_level + self.language = language + + def explain_instance( + self, + text_instance: str, + tokenizer, + pred_label: int, + classifier_fn, + labels=(0, 1), + top_labels=None, + num_features=10, + num_samples=5000, + distance_metric="cosine", + model_regressor=None, + if_lstm=False, + ): + """Generates explanations for a prediction. + + First, we generate neighborhood data by randomly hiding features from + the instance (see __data_labels_distance_mapping). We then learn + locally weighted linear models on this neighborhood data to explain + each of the classes in an interpretable way (see lime_base.py). + + Args: + text_instance: raw text string to be explained. + classifier_fn: classifier prediction probability function, which + takes a list of d strings and outputs a (d, k) numpy array with + prediction probabilities, where k is the number of classes. + For ScikitClassifiers , this is classifier.predict_proba. + labels: iterable with labels to be explained. + top_labels: if not None, ignore labels and produce explanations for + the K labels with highest prediction probabilities, where K is + this parameter. + num_features: maximum number of features present in explanation + num_samples: size of the neighborhood to learn the linear model + distance_metric: the distance metric to use for sample weighting, + defaults to cosine similarity + model_regressor: sklearn regressor to use in explanation. Defaults + to Ridge regression in LimeBase. Must have model_regressor.coef_ + and 'sample_weight' as a parameter to model_regressor.fit() + Returns: + An Explanation object (see explanation.py) with the corresponding + explanations. + """ + indexed_string = ( + IndexedCharacters(text_instance, bow=self.bow, mask_string=self.mask_string) + if self.char_level + else IndexedString( + text_instance, + bow=self.bow, + split_expression=self.split_expression, + mask_string=self.mask_string, + language=self.language, + ) + ) + domain_mapper = TextDomainMapper(indexed_string, self.language) + + # 产生扰动数据集 第一条是原始数据 + # data: 解释器训练特征 list (num_samples, doc_size) + # yss: 解释器训练标签 list (num_samples, class_num(2)) + # distances: 扰动样本到原始样本的距离 np.array(float) (num_samples, ) + data, yss, distances = self.__data_labels_distances( + indexed_string, tokenizer, classifier_fn, num_samples, distance_metric=distance_metric, if_lstm=if_lstm + ) + + if self.class_names is None: + self.class_names = [str(x) for x in range(yss[0].shape[0])] + ret_exp = explanation.Explanation( + domain_mapper=domain_mapper, class_names=self.class_names, random_state=self.random_state + ) + ret_exp.predict_proba = yss[0] + if top_labels: + labels = np.argsort(yss[0])[-top_labels:] + ret_exp.top_labels = list(labels) + ret_exp.top_labels.reverse() + + num_features = indexed_string.num_words() # 特征数量跟word_num相同 + + ( + ret_exp.intercept[pred_label], + ret_exp.local_exp[pred_label], + ret_exp.score[pred_label], + ret_exp.local_pred[pred_label], + relative_err, + err, + ) = self.base.explain_instance_with_data( + data, + yss, + distances, + pred_label, + num_features, + model_regressor=model_regressor, + feature_selection=self.feature_selection, + ) + + return ret_exp, indexed_string, relative_err, err + + def __data_labels_distances( + self, indexed_string, tokenizer, classifier_fn, num_samples, distance_metric="cosine", if_lstm=False + ): + """Generates a neighborhood around a prediction. + + Generates neighborhood data by randomly removing words from + the instance, and predicting with the classifier. Uses cosine distance + to compute distances between original and perturbed instances. + Args: + indexed_string: document (IndexedString) to be explained, + classifier_fn: classifier prediction probability function, which + takes a string and outputs prediction probabilities. For + ScikitClassifier, this is classifier.predict_proba. + num_samples: size of the neighborhood to learn the linear model + distance_metric: the distance metric to use for sample weighting, + defaults to cosine similarity. + + Returns: + A tuple (data, labels, distances), where: + data: dense num_samples * K binary matrix, where K is the + number of tokens in indexed_string. The first row is the + original instance, and thus a row of ones. + labels: num_samples * L matrix, where L is the number of target + labels + distances: cosine distance between the original instance and + each perturbed instance (computed in the binary 'data' + matrix), times 100. + """ + + def distance_fn(x): + return sklearn.metrics.pairwise.pairwise_distances(x, x[0], metric=distance_metric).ravel() * 100 + + doc_size = indexed_string.num_words() + + if doc_size > 1: + sample = self.random_state.randint( + 1, doc_size, num_samples - 1 + ) # sample: [int(1 ~ doc_size-1) * num_samples-1] + else: + sample = [0 for i in range(num_samples - 1)] + data = np.ones((num_samples, doc_size)) + data[0] = np.ones(doc_size) + features_range = range(doc_size) + perturb_text = [indexed_string.raw_string()] # [文本 * num_samples] + + for i, size in enumerate(sample, start=1): + # inactive: 从range(0, doc_size)中随机取出的size个数组成的list, 要去掉的字的id + inactive = self.random_state.choice( + features_range, size, replace=False # [0, doc_size) # int: 该扰动样本中remove token的数量 + ) + + text = indexed_string.inverse_removing(inactive) # 原文本去掉了inactive中的字后的文本 + + data[i, inactive] = 0 + perturb_text.append(text) + + prev_time = time.time() + # inverse_data: 扰动数据集 [扰动样本 str] * num_samples + labels = [] + token_ids_list, s_ids_list, seq_len_list = [], [], [] + token_ids_max_len = 0 + + valid_idxs = [] + + for idx, text in enumerate(perturb_text): + if self.language == "en": + if if_lstm: + pad_id = [tokenizer.vocab.token_to_idx.get("[PAD]", 0)] + + token_ids = tokenizer.encode(text) + token_ids_max_len = max(token_ids_max_len, len(token_ids)) + seq_len = len(token_ids) + if seq_len == 0: + continue + else: + valid_idxs.append(idx) + seq_len_list.append(seq_len) + pad_id = [tokenizer.vocab.token_to_idx.get("[PAD]", 0)] + + else: + pad_id = tokenizer.convert_tokens_to_ids(["[PAD]"]) + + tokens = tokenizer.tokenize(text) + token_ids = tokenizer.convert_tokens_to_ids(tokens) + token_ids = ( + tokenizer.convert_tokens_to_ids(["[CLS]"]) + + token_ids + + tokenizer.convert_tokens_to_ids(["[SEP]"]) + ) + token_ids_max_len = max(token_ids_max_len, len(token_ids)) + + token_ids_list.append(token_ids) + else: + if len(text) == 0: # TODO + text = perturb_text[0] + tokens = tokenizer.tokenize(text) + token_ids = tokenizer.convert_tokens_to_ids(tokens) + + if if_lstm: + seq_len = len(token_ids) + if seq_len == 0: + continue + else: + valid_idxs.append(idx) + seq_len_list.append(seq_len) + else: + token_ids = ( + tokenizer.convert_tokens_to_ids(["[CLS]"]) + + token_ids + + tokenizer.convert_tokens_to_ids(["[SEP]"]) + ) + + # padding + token_ids = token_ids + tokenizer.convert_tokens_to_ids(["[PAD]"]) * ( + len(perturb_text[0]) + 2 - len(token_ids) + ) + token_ids_list.append(token_ids) + s_ids = [0 for _ in range(len(token_ids))] + s_ids_list.append(s_ids) + + if self.language == "en": + for token_ids in token_ids_list: + while len(token_ids) < token_ids_max_len: + token_ids += pad_id + + s_ids = [0 for _ in range(len(token_ids))] + s_ids_list.append(s_ids) + + token_ids_np = np.array(token_ids_list) + s_ids_np = np.array(s_ids_list) + seq_len_np = np.array(seq_len_list) + + prev_time = time.time() + + batch = 0 + if self.language == "ch": + length = len(perturb_text[0]) + + if if_lstm: + batch = 128 + else: + batch = 64 if length < 130 else 50 + else: + batch = 32 + + epoch_num = math.ceil(len(token_ids_np) / batch) + for idx in range(epoch_num): + token_ids_tensor = paddle.Tensor( + value=token_ids_np[idx * batch : (idx + 1) * batch], place=paddle.CUDAPlace(0), stop_gradient=True + ) + if if_lstm: + seq_len_tensor = paddle.Tensor( + value=seq_len_np[idx * batch : (idx + 1) * batch], + place=token_ids_tensor.place, + stop_gradient=token_ids_tensor.stop_gradient, + ) + label = classifier_fn(token_ids_tensor, seq_len_tensor)[0] # label: Tensor[num_samples, 2] + else: + s_ids_tensor = paddle.Tensor( + value=s_ids_np[idx * batch : (idx + 1) * batch], + place=token_ids_tensor.place, + stop_gradient=token_ids_tensor.stop_gradient, + ) + label = classifier_fn(token_ids_tensor, s_ids_tensor)[0] # label: Tensor[num_samples, 2] + + labels.extend(label.numpy().tolist()) + + labels = np.array(labels) # labels: nsp.array(num_samples, 2) + + print("mode forward time: %.5f" % (time.time() - prev_time)) + + distances = distance_fn(sp.sparse.csr_matrix(data)) + + return data, labels, distances diff --git a/examples/model_interpretation/task/senti/pretrained_models/run_train.sh b/examples/model_interpretation/task/senti/pretrained_models/run_train.sh new file mode 100644 index 0000000000000000000000000000000000000000..b13d03c6486ef7a79f2718484cd6b976e767851a --- /dev/null +++ b/examples/model_interpretation/task/senti/pretrained_models/run_train.sh @@ -0,0 +1,30 @@ +### + # This script is used to finetune pretrained models +### + +export CUDA_VISIBLE_DEVICES=5 + +LANGUAGE=en +BASE_MODEL=roberta_base # [roberta_base, roberta_large] +timestamp=`date +"%Y%m%d_%H%M%S"` + +if [[ $LANGUAGE == "ch" ]]; then + LEARNING_RATE=2e-5 + MAX_SEQ_LENGTH=128 +elif [[ $LANGUAGE == "en" ]]; then + LEARNING_RATE=5e-6 + MAX_SEQ_LENGTH=512 +fi + +[ -d "logs" ] || mkdir -p "logs" +set -x + +python3 ./train.py \ + --learning_rate ${LEARNING_RATE} \ + --max_seq_length ${MAX_SEQ_LENGTH} \ + --batch_size 32 \ + --epochs 5 \ + --base_model $BASE_MODEL \ + --save_dir saved_model_${LANGUAGE}/${BASE_MODEL}_${timestamp} \ + --language $LANGUAGE >> logs/log_${BASE_MODEL}_${timestamp} + diff --git a/examples/model_interpretation/task/senti/pretrained_models/train.py b/examples/model_interpretation/task/senti/pretrained_models/train.py new file mode 100644 index 0000000000000000000000000000000000000000..61dcb01ada0890de7cb44d2e13e07292fc13a8da --- /dev/null +++ b/examples/model_interpretation/task/senti/pretrained_models/train.py @@ -0,0 +1,230 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + This file is used to fine-tune pretrained models +""" +import argparse +import os +import random +import sys +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import LinearDecayWithWarmup +from paddlenlp.transformers.roberta.tokenizer import ( + RobertaBPETokenizer, + RobertaTokenizer, +) + +sys.path.append("..") +sys.path.append("../../..") +from roberta.modeling import RobertaForSequenceClassification # noqa: E402 + +sys.path.remove("../../..") +sys.path.remove("..") +from utils import convert_example # noqa: E402 + +parser = argparse.ArgumentParser() +parser.add_argument("--base_model", type=str, choices=["roberta_base", "roberta_large"]) +parser.add_argument( + "--save_dir", + default="./checkpoint", + type=str, + help="The output directory where the model checkpoints will be written.", +) +parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.") +parser.add_argument( + "--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process." +) +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu"], + default="gpu", + help="Select which device to train model, defaults to gpu.", +) +parser.add_argument( + "--language", choices=["ch", "en"], required=True, default=None, help="Language that the model is built for" +) +args = parser.parse_args() + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + criterion(obj:`paddle.nn.Layer`): It can compute the loss. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + losses = [] + for batch in data_loader: + input_ids, token_type_ids, labels = batch + logits = model(input_ids, token_type_ids) + loss = criterion(logits, labels) + losses.append(loss.numpy()) + correct = metric.compute(logits, labels) + metric.update(correct) + accu = metric.accumulate() + print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu)) + model.train() + metric.reset() + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + """ + This function created the dataloader which feeds data into model + """ + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def do_train(): + """ + This function is the main part of the fine-tunning process + """ + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + if args.language == "ch": + train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"]) + if args.base_model == "roberta_base": + tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext") + model = RobertaForSequenceClassification.from_pretrained("roberta-wwm-ext", num_classes=2) + elif args.base_model == "roberta_large": + tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext-large") + model = RobertaForSequenceClassification.from_pretrained("roberta-wwm-ext-large", num_classes=2) + else: + train_ds, dev_ds = load_dataset("glue", "sst-2", splits=["train", "dev"]) + # for English version, we load models from local machine + if args.base_model == "roberta_base": + tokenizer = RobertaBPETokenizer.from_pretrained("roberta-base") + model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_classes=2) + elif args.base_model == "roberta_large": + tokenizer = RobertaBPETokenizer.from_pretrained("roberta-large") + model = RobertaForSequenceClassification.from_pretrained("roberta-large", num_classes=2) + + trans_func = partial( + convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, language=args.language + ) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + criterion = paddle.nn.loss.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + global_step = 0 + tic_train = time.time() + log_per_step = 100 if args.language == "en" else 10 + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + input_ids, token_type_ids, labels = batch + logits = model(input_ids=input_ids, token_type_ids=token_type_ids) + loss = criterion(logits, labels) + probs = F.softmax(logits, axis=1) + correct = metric.compute(probs, labels) + metric.update(correct) + acc = metric.accumulate() + + global_step += 1 + if global_step % log_per_step == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, acc, log_per_step / (time.time() - tic_train)), + flush=True, + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % (log_per_step * 10) == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + evaluate(model, criterion, metric, dev_data_loader) + model._layers.save_pretrained(save_dir) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/examples/model_interpretation/task/senti/pretrained_models/utils.py b/examples/model_interpretation/task/senti/pretrained_models/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..d8c0bad17bd678f7e6bec12ea55abea6a2be6c27 --- /dev/null +++ b/examples/model_interpretation/task/senti/pretrained_models/utils.py @@ -0,0 +1,59 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + This file contains some public function +""" +import numpy as np + + +def convert_example(example, tokenizer, max_seq_length=512, is_test=False, language="ch"): + """ + Builds model inputs from a sequence or a pair of sequence for sequence classification tasks + by concatenating and adding special tokens. And creates a mask from the two sequences passed + to be used in a sequence-pair classification task. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + It returns the first portion of the mask (0's). + + + Args: + example(obj:`list[str]`): List of input data, containing text and label if it have label. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of token ids. + token_type_ids(obj: `list[int]`): List of sequence pair mask. + label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. + """ + if language == "ch": + text = "text" + label = "label" + else: + text = "sentence" + label = "labels" + encoded_inputs = tokenizer(text=example[text], max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if is_test: + return input_ids, token_type_ids + label = np.array([example[label]], dtype="int64") + return input_ids, token_type_ids, label diff --git a/examples/model_interpretation/task/senti/rnn/lstm_train.sh b/examples/model_interpretation/task/senti/rnn/lstm_train.sh new file mode 100644 index 0000000000000000000000000000000000000000..fd3b4a4cc2b836f1c85f4b88f7e9baddaa204d20 --- /dev/null +++ b/examples/model_interpretation/task/senti/rnn/lstm_train.sh @@ -0,0 +1,20 @@ +### + # This script is used to train lstm models +### + +unset CUDA_VISIBLE_DEVICES +LANGUAGE=en + +if [[ $LANGUAGE == 'ch' ]]; then + VOCAB_PATH='./vocab.txt' +else + VOCAB_PATH='vocab.sst2_train' +fi +python -m paddle.distributed.launch --gpus "5" train.py \ + --device=gpu \ + --lr=4e-4 \ + --batch_size=64 \ + --epochs=12 \ + --vocab_path=$VOCAB_PATH \ + --language=$LANGUAGE \ + --save_dir="./checkpoints_"${LANGUAGE} diff --git a/examples/model_interpretation/task/senti/rnn/model.py b/examples/model_interpretation/task/senti/rnn/model.py new file mode 100644 index 0000000000000000000000000000000000000000..9c509e72432e33d4501158b8e439110b1ad1ac22 --- /dev/null +++ b/examples/model_interpretation/task/senti/rnn/model.py @@ -0,0 +1,267 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + +INF = 1.0 * 1e12 + + +class LSTMModel(nn.Layer): + def __init__( + self, + vocab_size, + num_classes, + emb_dim=128, + padding_idx=0, + lstm_hidden_size=198, + direction="forward", + lstm_layers=1, + dropout_rate=0.0, + pooling_type=None, + fc_hidden_size=96, + ): + super().__init__() + + self.direction = direction + + self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) + + # self.lstm_encoder = nlp.seq2vec.LSTMEncoder(emb_dim, + # lstm_hidden_size, + # num_layers=lstm_layers, + # direction=direction, + # dropout=dropout_rate, + # pooling_type=pooling_type) + + self.lstm_layer = nn.LSTM( + input_size=emb_dim, + hidden_size=lstm_hidden_size, + num_layers=lstm_layers, + direction=direction, + dropout=dropout_rate, + ) + + self.fc = nn.Linear(lstm_hidden_size * (2 if direction == "bidirect" else 1), fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + self.softmax = nn.Softmax(axis=1) + + def forward(self, text, seq_len): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_text = self.embedder(text) + # Shape: (batch_size, num_tokens, num_directions*lstm_hidden_size) + # num_directions = 2 if direction is 'bidirect' + # if not, num_directions = 1 + + # text_repr = self.lstm_encoder(embedded_text, sequence_length=seq_len) + + encoded_text, (last_hidden, last_cell) = self.lstm_layer(embedded_text, sequence_length=seq_len) + if self.direction == "bidirect": + text_repr = paddle.concat((last_hidden[-2, :, :], last_hidden[-1, :, :]), axis=1) + else: + text_repr = last_hidden[-1, :, :] + + fc_out = paddle.tanh(self.fc(text_repr)) # Shape: (batch_size, fc_hidden_size) + logits = self.output_layer(fc_out) # Shape: (batch_size, num_classes) + return logits + + def forward_interpet(self, text, seq_len): + embedded_text = self.embedder(text) # Shape: (batch_size, num_tokens, embedding_dim) + + # text_repr = self.lstm_encoder(embedded_text, sequence_length=seq_len) # Shape: (batch_size, num_tokens, num_directions * hidden) + + # encoded_text: tensor[batch, seq_len, num_directions * hidden] + # last_hidden: tensor[2, batch, hiddens] + encoded_text, (last_hidden, last_cell) = self.lstm_layer(embedded_text, sequence_length=seq_len) + if self.direction == "bidirect": + text_repr = paddle.concat( + (last_hidden[-2, :, :], last_hidden[-1, :, :]), axis=1 + ) # text_repr: tensor[batch, 2 * hidden] 双向 + else: + text_repr = last_hidden[-1, :, :] # text_repr: tensor[1, hidden_size] 单向 + + fc_out = paddle.tanh(self.fc(text_repr)) # Shape: (batch_size, fc_hidden_size) + logits = self.output_layer(fc_out) # Shape: (batch_size, num_classes) + probs = self.softmax(logits) + + return probs, text_repr, embedded_text + + +class BiLSTMAttentionModel(nn.Layer): + def __init__( + self, + attention_layer, + vocab_size, + num_classes, + emb_dim=128, + lstm_hidden_size=196, + fc_hidden_size=96, + lstm_layers=1, + dropout_rate=0.0, + padding_idx=0, + ): + super().__init__() + self.padding_idx = padding_idx + + self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) + self.bilstm = nn.LSTM( + input_size=emb_dim, + hidden_size=lstm_hidden_size, + num_layers=lstm_layers, + dropout=dropout_rate, + direction="bidirect", + ) + self.attention = attention_layer + if isinstance(attention_layer, SelfAttention): + self.fc = nn.Linear(lstm_hidden_size, fc_hidden_size) + elif isinstance(attention_layer, SelfInteractiveAttention): + self.fc = nn.Linear(lstm_hidden_size * 2, fc_hidden_size) + else: + raise RuntimeError("Unknown attention type %s." % attention_layer.__class__.__name__) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + self.softmax = nn.Softmax(axis=1) + + def forward(self, text, seq_len): + mask = text != self.padding_idx + embedded_text = self.embedder(text) + # Encode text, shape: (batch, max_seq_len, num_directions * hidden_size) + encoded_text, (last_hidden, last_cell) = self.bilstm(embedded_text, sequence_length=seq_len) + # Shape: (batch_size, lstm_hidden_size) + hidden, att_weights = self.attention(encoded_text, mask) # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(hidden)) # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + return logits + + def forward_interpet(self, text, seq_len, noise=None, i=None, n_samples=None): + mask = text != self.padding_idx + + baseline_text = paddle.to_tensor( + [[0] * text.shape[1]], dtype=text.dtype, place=text.place, stop_gradient=text.stop_gradient + ) + + embedded_text = self.embedder(text) + baseline_embedded = self.embedder(baseline_text) + + if noise is not None: + if noise.upper() == "GAUSSIAN": + stdev_spread = 0.15 + stdev = stdev_spread * (embedded_text.max() - embedded_text.min()).numpy() + noise = paddle.to_tensor( + np.random.normal(0, stdev, embedded_text.shape).astype(np.float32), stop_gradient=False + ) + embedded_text = embedded_text + noise + + elif noise.upper() == "INTEGRATED": + embedded_text = baseline_embedded + (i / (n_samples - 1)) * (embedded_text - baseline_embedded) + + else: + raise ValueError("unsupported noise method: %s" % (noise)) + + # Encode text, shape: (batch, max_seq_len, num_directions * hidden_size) + encoded_text, (last_hidden, last_cell) = self.bilstm(embedded_text, sequence_length=seq_len) + # Shape: (batch_size, lstm_hidden_size) + hidden, att_weights = self.attention(encoded_text, mask) # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(hidden)) # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + probs = self.softmax(logits) + return probs, att_weights.squeeze(axis=-1), embedded_text + + +class SelfAttention(nn.Layer): + """ + A close implementation of attention network of ACL 2016 paper, + Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (Zhou et al., 2016). + ref: https://www.aclweb.org/anthology/P16-2034/ + Args: + hidden_size (int): The number of expected features in the input x. + """ + + def __init__(self, hidden_size): + super().__init__() + self.hidden_size = hidden_size + self.att_weight = self.create_parameter(shape=[1, hidden_size, 1], dtype="float32") + + def forward(self, input, mask=None): + """ + Args: + input (paddle.Tensor) of shape (batch, seq_len, input_size): Tensor containing the features of the input sequence. + mask (paddle.Tensor) of shape (batch, seq_len) : + Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not. + Defaults to `None`. + """ + forward_input, backward_input = paddle.chunk(input, chunks=2, axis=2) + # elementwise-sum forward_x and backward_x + # Shape: (batch_size, max_seq_len, hidden_size) + h = paddle.add_n([forward_input, backward_input]) + # Shape: (batch_size, hidden_size, 1) + att_weight = self.att_weight.tile(repeat_times=(paddle.shape(h)[0], 1, 1)) + # Shape: (batch_size, max_seq_len, 1) + att_score = paddle.bmm(paddle.tanh(h), att_weight) + if mask is not None: + # mask, remove the effect of 'PAD' + mask = paddle.cast(mask, dtype="float32") + mask = mask.unsqueeze(axis=-1) + inf_tensor = paddle.full(shape=mask.shape, dtype="float32", fill_value=-INF) + att_score = paddle.multiply(att_score, mask) + paddle.multiply(inf_tensor, (1 - mask)) + # Shape: (batch_size, max_seq_len, 1) + att_weight = F.softmax(att_score, axis=1) + # Shape: (batch_size, lstm_hidden_size) + reps = paddle.bmm(h.transpose(perm=(0, 2, 1)), att_weight).squeeze(axis=-1) + reps = paddle.tanh(reps) + return reps, att_weight + + +class SelfInteractiveAttention(nn.Layer): + """ + A close implementation of attention network of NAACL 2016 paper, Hierarchical Attention Networks for Document Classification (Yang et al., 2016). + ref: https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf + Args: + hidden_size (int): The number of expected features in the input x. + """ + + def __init__(self, hidden_size): + super().__init__() + self.input_weight = self.create_parameter(shape=[1, hidden_size, hidden_size], dtype="float32") + self.bias = self.create_parameter(shape=[1, 1, hidden_size], dtype="float32") + self.att_context_vector = self.create_parameter(shape=[1, hidden_size, 1], dtype="float32") + + def forward(self, input, mask=None): + """ + Args: + input (paddle.Tensor) of shape (batch, seq_len, hidden_size): Tensor containing the features of the input sequence. + mask (paddle.Tensor) of shape (batch, seq_len) : + Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not. + Defaults to `None + """ + weight = self.input_weight.tile( + repeat_times=(paddle.shape(input)[0], 1, 1) + ) # tensor[batch, hidden_size, hidden_size] + bias = self.bias.tile(repeat_times=(paddle.shape(input)[0], 1, 1)) # tensor[batch, 1, hidden_size] + word_squish = paddle.bmm(input, weight) + bias # Shape: (batch_size, seq_len, hidden_size) + att_context_vector = self.att_context_vector.tile( + repeat_times=(paddle.shape(input)[0], 1, 1) + ) # Shape: (batch_size, hidden_size, 1) + att_score = paddle.bmm(word_squish, att_context_vector) # tensor[batch_size, seq_len, 1] + if mask is not None: + # mask, remove the effect of 'PAD' + mask = paddle.cast(mask, dtype="float32") + mask = mask.unsqueeze(axis=-1) + inf_tensor = paddle.full(shape=paddle.shape(mask), dtype="float32", fill_value=-INF) + att_score = paddle.multiply(att_score, mask) + paddle.multiply(inf_tensor, (1 - mask)) + att_weight = F.softmax(att_score, axis=1) # tensor[batch_size, seq_len, 1] + + reps = paddle.bmm(input.transpose(perm=(0, 2, 1)), att_weight).squeeze(-1) # Shape: (batch_size, hidden_size) + return reps, att_weight diff --git a/examples/model_interpretation/task/senti/rnn/tokenizer_config.json b/examples/model_interpretation/task/senti/rnn/tokenizer_config.json new file mode 100644 index 0000000000000000000000000000000000000000..1b15a346024173ce58a2770fe03c27f2e0db3c32 --- /dev/null +++ b/examples/model_interpretation/task/senti/rnn/tokenizer_config.json @@ -0,0 +1 @@ +{"model":"LSTM"} \ No newline at end of file diff --git a/examples/model_interpretation/task/senti/rnn/train.py b/examples/model_interpretation/task/senti/rnn/train.py new file mode 100644 index 0000000000000000000000000000000000000000..570334a5d94e9ac5d17fb088a85821a8d37c1610 --- /dev/null +++ b/examples/model_interpretation/task/senti/rnn/train.py @@ -0,0 +1,142 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +import random +from functools import partial + +import numpy as np +import paddle +from model import BiLSTMAttentionModel, SelfInteractiveAttention +from utils import CharTokenizer, convert_example + +from paddlenlp.data import Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import load_dataset + +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.") +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu"], + default="gpu", + help="Select which device to train model, defaults to gpu.", +) +parser.add_argument("--lr", type=float, default=5e-5, help="Learning rate used to train.") +parser.add_argument("--save_dir", type=str, default="checkpoints/", help="Directory to save model checkpoint") +parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--vocab_path", type=str, default=None) +parser.add_argument("--language", choices=["ch", "en"], default=None, help="Language that the model is built for") +args = parser.parse_args() + + +def set_seed(seed=1000): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None): + """ + Creats dataloader. + + Args: + dataset(obj:`paddle.io.Dataset`): Dataset instance. + trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. + mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. + batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. + batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging + the sample list, None for only stack each fields of sample in axis + 0(same as :attr::`np.stack(..., axis=0)`). + + Returns: + dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. + """ + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + else: + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn) + return dataloader + + +if __name__ == "__main__": + paddle.set_device(args.device) + set_seed() + + if args.language == "ch": + train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"]) + else: + train_ds, dev_ds = load_dataset("glue", "sst-2", splits=["train", "dev"]) + + # Loads vocab. + if not os.path.exists(args.vocab_path): + raise RuntimeError("The vocab_path can not be found in the path %s" % args.vocab_path) + vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") + + tokenizer = CharTokenizer(vocab, args.language, "../../../punctuations") + + # Constructs the newtork. + vocab_size = len(vocab) + num_classes = len(train_ds.label_list) + pad_token_id = 0 + pad_value = vocab.token_to_idx.get("[PAD]", 0) + + lstm_hidden_size = 196 + attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size) + model = BiLSTMAttentionModel( + attention_layer=attention, + vocab_size=vocab_size, + lstm_hidden_size=lstm_hidden_size, + num_classes=num_classes, + padding_idx=pad_token_id, + ) + + model = paddle.Model(model) + + # Reads data and generates mini-batches. + trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False, language=args.language) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=pad_value), Stack(dtype="int64"), Stack(dtype="int64") # input_ids # seq len # label + ): [data for data in fn(samples)] + + train_loader = create_dataloader( + train_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="train", batchify_fn=batchify_fn + ) + dev_loader = create_dataloader( + dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn + ) + + optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr) + + # Defines loss and metric. + criterion = paddle.nn.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + model.prepare(optimizer, criterion, metric) + + # Loads pre-trained parameters. + if args.init_from_ckpt: + model.load(args.init_from_ckpt) + print("Loaded checkpoint from %s" % args.init_from_ckpt) + + # Starts training and evaluating. + callback = paddle.callbacks.ProgBarLogger(log_freq=10, verbose=3) + model.fit(train_loader, dev_loader, epochs=args.epochs, save_dir=args.save_dir, callbacks=callback) diff --git a/examples/model_interpretation/task/senti/rnn/utils.py b/examples/model_interpretation/task/senti/rnn/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..4d574423d48a793a1948c7178e77bcf60af0f4a9 --- /dev/null +++ b/examples/model_interpretation/task/senti/rnn/utils.py @@ -0,0 +1,166 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np + + +def convert_example(example, tokenizer, is_test=False, language="en"): + """ + Builds model inputs from a sequence for sequence classification tasks. + It use `jieba.cut` to tokenize text. + + Args: + example(obj:`list[str]`): List of input data, containing text and label if it have label. + tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of token ids. + valid_length(obj:`int`): The input sequence valid length. + label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. + """ + if is_test: + input_ids = tokenizer.encode(example["context"]) + valid_length = np.array(len(input_ids), dtype="int64") + input_ids = np.array(input_ids, dtype="int64") + return input_ids, valid_length + else: + if language == "en": + input_ids = tokenizer.encode(example["sentence"]) + label = np.array(example["labels"], dtype="int64") + else: + input_ids = tokenizer.encode(example["text"]) + label = np.array(example["label"], dtype="int64") + valid_length = np.array(len(input_ids), dtype="int64") + input_ids = np.array(input_ids, dtype="int64") + return input_ids, valid_length, label + + +def preprocess_prediction_data(data, tokenizer): + """ + It process the prediction data as the format used as training. + + Args: + data (obj:`List[str]`): The prediction data whose each element is a tokenized text. + tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. + + Returns: + examples (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `seq_len`(sequence length). + + """ + examples = [] + for text in data: + # ids = tokenizer.encode(text) # JiebaTokenizer + ids = tokenizer.encode(text)[0].tolist()[1:-1] # ErnieTokenizer list[ids] + examples.append([ids, len(ids)]) + + return examples + + +def get_idx_from_word(word, word_to_idx, unk_word): + if word in word_to_idx: + return word_to_idx[word] + return word_to_idx[unk_word] + + +class CharTokenizer: + def __init__(self, vocab, language, vocab_path): + self.tokenizer = list + self.vocab = vocab + self.language = language + self.vocab_path = vocab_path + self.unk_token = [] + + def encode(self, sentence): + if self.language == "ch": + words = tokenizer_punc(sentence, self.vocab_path) + else: + words = sentence.strip().split() + return [get_idx_from_word(word, self.vocab.token_to_idx, self.vocab.unk_token) for word in words] + + def tokenize(self, sentence, wo_unk=True): + if self.language == "ch": + return tokenizer_punc(sentence, self.vocab_path) + else: + return sentence.strip().split() + + def convert_tokens_to_string(self, tokens): + return " ".join(tokens) + + def convert_tokens_to_ids(self, tokens): + return [get_idx_from_word(word, self.vocab.token_to_idx, self.vocab.unk_token) for word in tokens] + + +def tokenizer_lac(string, lac): + temp = "" + res = [] + for c in string: + if "\u4e00" <= c <= "\u9fff": + if temp != "": + res.extend(lac.run(temp)) + temp = "" + res.append(c) + else: + temp += c + if temp != "": + res.extend(lac.run(temp)) + return res + + +def tokenizer_punc(string, vocab_path): + res = [] + sub_string_list = string.strip().split("[MASK]") + for idx, sub_string in enumerate(sub_string_list): + temp = "" + for c in sub_string: + if "\u4e00" <= c <= "\u9fff": + if temp != "": + temp_seg = punc_split(temp, vocab_path) + res.extend(temp_seg) + temp = "" + res.append(c) + else: + temp += c + if temp != "": + temp_seg = punc_split(temp, vocab_path) + res.extend(temp_seg) + if idx < len(sub_string_list) - 1: + res.append("[MASK]") + return res + + +def punc_split(string, vocab_path): + punc_set = set() + with open(vocab_path, "r") as f: + for token in f: + punc_set.add(token.strip()) + punc_set.add(" ") + for ascii_num in range(65296, 65306): + punc_set.add(chr(ascii_num)) + for ascii_num in range(48, 58): + punc_set.add(chr(ascii_num)) + + res = [] + temp = "" + for c in string: + if c in punc_set: + if temp != "": + res.append(temp) + temp = "" + res.append(c) + else: + temp += c + if temp != "": + res.append(temp) + return res diff --git a/examples/model_interpretation/task/senti/roberta/modeling.py b/examples/model_interpretation/task/senti/roberta/modeling.py new file mode 100644 index 0000000000000000000000000000000000000000..02f2bec87d8554295771910cc55fabd261ba91b2 --- /dev/null +++ b/examples/model_interpretation/task/senti/roberta/modeling.py @@ -0,0 +1,608 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys + +import paddle +import paddle.nn as nn + +from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model + +sys.path.append("../..") +from task.transformer import TransformerEncoder, TransformerEncoderLayer # noqa: E402 + +sys.path.remove("../..") + +__all__ = [ + "RobertaModel", + "RobertaPretrainedModel", + "RobertaForSequenceClassification", + "RobertaForTokenClassification", + "RobertaForQuestionAnswering", +] + + +class RobertaEmbeddings(nn.Layer): + r""" + Include embeddings from word, position and token_type embeddings. + """ + + def __init__( + self, + vocab_size, + hidden_size=768, + hidden_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=16, + pad_token_id=0, + ): + super(RobertaEmbeddings, self).__init__() + self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_token_id) + self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size) + self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size) + self.layer_norm = nn.LayerNorm(hidden_size) + self.dropout = nn.Dropout(hidden_dropout_prob) + + def forward(self, input_ids, token_type_ids=None, position_ids=None): + if position_ids is None: + # maybe need use shape op to unify static graph and dynamic graph + ones = paddle.ones_like(input_ids, dtype="int64") + seq_length = paddle.cumsum(ones, axis=-1) + position_ids = seq_length - ones + position_ids.stop_gradient = True + if token_type_ids is None: + token_type_ids = paddle.zeros_like(input_ids, dtype="int64") + + input_embedings = self.word_embeddings(input_ids) + position_embeddings = self.position_embeddings(position_ids) + token_type_embeddings = self.token_type_embeddings(token_type_ids) + + embeddings = input_embedings + position_embeddings + token_type_embeddings + embeddings = self.layer_norm(embeddings) + embeddings = self.dropout(embeddings) + return embeddings + + +class RobertaPooler(nn.Layer): + def __init__(self, hidden_size): + super(RobertaPooler, self).__init__() + self.dense = nn.Linear(hidden_size, hidden_size) + self.activation = nn.Tanh() + + def forward(self, hidden_states): + # We "pool" the model by simply taking the hidden state corresponding + # to the first token. + first_token_tensor = hidden_states[:, 0] + pooled_output = self.dense(first_token_tensor) + pooled_output = self.activation(pooled_output) + return pooled_output + + +class RobertaPretrainedModel(PretrainedModel): + r""" + An abstract class for pretrained RoBerta models. It provides RoBerta related + `model_config_file`, `pretrained_resource_files_map`, `resource_files_names`, + `pretrained_init_configuration`, `base_model_prefix` for downloading and + loading pretrained models. + Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details. + + """ + + model_config_file = "model_config.json" + pretrained_init_configuration = { + "roberta-wwm-ext": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "intermediate_size": 3072, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 2, + "vocab_size": 21128, + "pad_token_id": 0, + }, + "roberta-wwm-ext-large": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "initializer_range": 0.02, + "intermediate_size": 4096, + "max_position_embeddings": 512, + "num_attention_heads": 16, + "num_hidden_layers": 24, + "type_vocab_size": 2, + "vocab_size": 21128, + "pad_token_id": 0, + }, + "rbt3": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "intermediate_size": 3072, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 3, + "type_vocab_size": 2, + "vocab_size": 21128, + "pad_token_id": 0, + }, + "rbtl3": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "initializer_range": 0.02, + "intermediate_size": 4096, + "max_position_embeddings": 512, + "num_attention_heads": 16, + "num_hidden_layers": 3, + "type_vocab_size": 2, + "vocab_size": 21128, + "pad_token_id": 0, + }, + } + resource_files_names = {"model_state": "model_state.pdparams"} + pretrained_resource_files_map = { + "model_state": { + "roberta-wwm-ext": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_base/roberta_chn_base.pdparams", + "roberta-wwm-ext-large": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/roberta_chn_large.pdparams", + "rbt3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbt3/rbt3_chn_large.pdparams", + "rbtl3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbtl3/rbtl3_chn_large.pdparams", + } + } + base_model_prefix = "roberta" + + def _init_weights(self, layer): + """Initialization hook""" + if isinstance(layer, (nn.Linear, nn.Embedding)): + # only support dygraph, use truncated_normal and make it inplace + # and configurable later + layer.weight.set_value( + paddle.tensor.normal( + mean=0.0, + std=self.initializer_range + if hasattr(self, "initializer_range") + else self.roberta.config["initializer_range"], + shape=layer.weight.shape, + ) + ) + elif isinstance(layer, nn.LayerNorm): + layer._epsilon = 1e-12 + + +@register_base_model +class RobertaModel(RobertaPretrainedModel): + r""" + The bare Roberta Model outputting raw hidden-states. + + This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`. + Refer to the superclass documentation for the generic methods. + + This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation + /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer + and refer to the Paddle documentation for all matter related to general usage and behavior. + + Args: + vocab_size (int): + Vocabulary size of `inputs_ids` in `RobertaModel`. Also is the vocab size of token embedding matrix. + Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `RobertaModel`. + hidden_size (int, optional): + Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`. + num_hidden_layers (int, optional): + Number of hidden layers in the Transformer encoder. Defaults to `12`. + num_attention_heads (int, optional): + Number of attention heads for each attention layer in the Transformer encoder. + Defaults to `12`. + intermediate_size (int, optional): + Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors + to ff layers are firstly projected from `hidden_size` to `intermediate_size`, + and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`. + Defaults to `3072`. + hidden_act (str, optional): + The non-linear activation function in the feed-forward layer. + ``"gelu"``, ``"relu"`` and any other paddle supported activation functions + are supported. Defaults to ``"gelu"``. + hidden_dropout_prob (float, optional): + The dropout probability for all fully connected layers in the embeddings and encoder. + Defaults to `0.1`. + attention_probs_dropout_prob (float, optional): + The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target. + Defaults to `0.1`. + max_position_embeddings (int, optional): + The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input + sequence. Defaults to `512`. + type_vocab_size (int, optional): + The vocabulary size of the `token_type_ids` passed when calling `~transformers.RobertaModel`. + Defaults to `2`. + initializer_range (float, optional): + The standard deviation of the normal initializer. Defaults to 0.02. + + .. note:: + A normal_initializer initializes weight matrices as normal distributions. + See :meth:`RobertaPretrainedModel._init_weights()` for how weights are initialized in `RobertaModel`. + + pad_token_id(int, optional): + The index of padding token in the token vocabulary. + Defaults to `0`. + """ + + def __init__( + self, + vocab_size, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + intermediate_size=3072, + hidden_act="gelu", + hidden_dropout_prob=0.1, + attention_probs_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=16, + initializer_range=0.02, + layer_norm_eps=1e-12, + pad_token_id=0, + ): + super(RobertaModel, self).__init__() + self.pad_token_id = pad_token_id + self.initializer_range = initializer_range + self.embeddings = RobertaEmbeddings( + vocab_size, hidden_size, hidden_dropout_prob, max_position_embeddings, type_vocab_size, pad_token_id + ) + encoder_layer = TransformerEncoderLayer( + hidden_size, + num_attention_heads, + intermediate_size, + dropout=hidden_dropout_prob, + activation=hidden_act, + attn_dropout=attention_probs_dropout_prob, + act_dropout=0, + ) + self.encoder = TransformerEncoder(encoder_layer, num_hidden_layers) + self.pooler = RobertaPooler(hidden_size) + + def forward( + self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None, + noise=None, + i=None, + n_samples=None, + ): + r""" + Args: + input_ids (Tensor): + Indices of input sequence tokens in the vocabulary. They are + numerical representations of tokens that build the input sequence. + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + token_type_ids (Tensor, optional): + Segment token indices to indicate first and second portions of the inputs. + Indices can be either 0 or 1: + + - 0 corresponds to a **sentence A** token, + - 1 corresponds to a **sentence B** token. + + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + Defaults to None, which means no segment embeddings is added to token embeddings. + position_ids (Tensor, optional): + Indices of positions of each input sequence tokens in the position embeddings. + Selected in the range ``[0, max_position_embeddings - 1]``. + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + Defaults to `None`. + attention_mask (Tensor, optional): + Mask used in multi-head attention to avoid performing attention to some unwanted positions, + usually the paddings or the subsequent positions. + Its data type can be int, float and bool. + When the data type is bool, the `masked` tokens have `False` values and the others have `True` values. + When the data type is int, the `masked` tokens have `0` values and the others have `1` values. + When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values. + It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`. + For example, its shape can be [batch_size, sequence_length], [batch_size, sequence_length, sequence_length], + [batch_size, num_attention_heads, sequence_length, sequence_length]. + Defaults to `None`, which means nothing needed to be prevented attention to. + + Returns: + tuple: Returns tuple (`sequence_output`, `pooled_output`). + + With the fields: + + - sequence_output (Tensor): + Sequence of hidden-states at the last layer of the model. + It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size]. + + - pooled_output (Tensor): + The output of first token (`[CLS]`) in sequence. + We "pool" the model by simply taking the hidden state corresponding to the first token. + Its data type should be float32 and its shape is [batch_size, hidden_size]. + + Example: + .. code-block:: + + import paddle + from paddlenlp.transformers import RobertaModel, RobertaTokenizer + + tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') + model = RobertaModel.from_pretrained('roberta-wwm-ext') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + sequence_output, pooled_output = model(**inputs) + + """ + if attention_mask is None: + attention_mask = paddle.unsqueeze( + (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e9, axis=[1, 2] + ) + # CLS: 101; SEP: 102; PAD: 0 + baseline_ids = paddle.to_tensor( + [101] + [0] * (input_ids.shape[1] - 2) + [102], + dtype=input_ids.dtype, + place=input_ids.place, + stop_gradient=input_ids.stop_gradient, + ) + + embedding_output = self.embeddings( + input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids + ) + baseline_embedding_output = self.embeddings( + input_ids=baseline_ids, position_ids=position_ids, token_type_ids=token_type_ids + ) + + if noise is not None: + if noise.upper() == "GAUSSIAN": + pass + # stdev_spread = 0.15 + # stdev = stdev_spread * (orig_embedded.max() - orig_embedded.min()).numpy() + # noise = paddle.to_tensor(np.random.normal(0, stdev, orig_embedded.shape).astype(np.float32), + # stop_gradient=False) + # orig_embedded = orig_embedded + noise + if noise.upper() == "INTEGRATED": + embedding_output = baseline_embedding_output + i / (n_samples - 1) * ( + embedding_output - baseline_embedding_output + ) + else: + raise ValueError("unsupported noise method: %s" % (noise)) + + # encoder_outputs = self.encoder(embedding_output, attention_mask) + encoder_outputs, att_weights_list = self.encoder(embedding_output, attention_mask) # interpret + sequence_output = encoder_outputs + pooled_output = self.pooler(sequence_output) + return sequence_output, pooled_output, att_weights_list, embedding_output + + +class RobertaForQuestionAnswering(RobertaPretrainedModel): + r""" + Roberta Model with a linear layer on top of the hidden-states output to + compute `span_start_logits` and `span_end_logits`, designed for question-answering tasks like SQuAD. + + Args: + roberta (:class:`RobertaModel`): + An instance of RobertaModel. + dropout (float, optional): + The dropout probability for output of Roberta. + If None, use the same value as `hidden_dropout_prob` of `RobertaModel` + instance `roberta`. Defaults to `None`. + """ + + def __init__(self, roberta, dropout=None): + super(RobertaForQuestionAnswering, self).__init__() + self.roberta = roberta # allow roberta to be config + self.classifier = nn.Linear(self.roberta.config["hidden_size"], 2) + + def forward(self, input_ids, token_type_ids=None): + r""" + Args: + input_ids (Tensor): + See :class:`RobertaModel`. + token_type_ids (Tensor, optional): + See :class:`RobertaModel`. + position_ids (Tensor, optional): + See :class:`RobertaModel`. + attention_mask (Tensor, optional): + See :class:`RobertaModel`. + + Returns: + tuple: Returns tuple (`start_logits`, `end_logits`). + + With the fields: + + - `start_logits` (Tensor): + A tensor of the input token classification logits, indicates the start position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + + - `end_logits` (Tensor): + A tensor of the input token classification logits, indicates the end position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + + Example: + .. code-block:: + + import paddle + from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer + + tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') + model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + + """ + sequence_output, _ = self.roberta( + input_ids, token_type_ids=token_type_ids, position_ids=None, attention_mask=None + ) + + logits = self.classifier(sequence_output) + logits = paddle.transpose(logits, perm=[2, 0, 1]) + start_logits, end_logits = paddle.unstack(x=logits, axis=0) + + return start_logits, end_logits + + +class RobertaForSequenceClassification(RobertaPretrainedModel): + r""" + Roberta Model with a linear layer on top of the output layer, + designed for sequence classification/regression tasks like GLUE tasks. + + Args: + roberta (:class:`RobertaModel`): + An instance of `RobertaModel`. + num_classes (int, optional): + The number of classes. Defaults to `2`. + dropout (float, optional): + The dropout probability for output of Roberta. + If None, use the same value as `hidden_dropout_prob` + of `RobertaModel` instance `roberta`. Defaults to `None`. + """ + + def __init__(self, roberta, num_classes=2, dropout=None): + super(RobertaForSequenceClassification, self).__init__() + self.num_classes = num_classes + self.roberta = roberta # allow roberta to be config + self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"]) + self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes) + self.softmax = nn.Softmax() + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`RobertaModel`. + token_type_ids (Tensor, optional): + See :class:`RobertaModel`. + position_ids (Tensor, optional): + See :class:`RobertaModel`. + attention_mask (Tensor, optional): + See :class:`RobertaModel`. + + Returns: + Tensor: Returns tensor `logits`, a tensor of the input text classification logits. + Its data type should be float32 and it has a shape of [batch_size, num_classes]. + + Example: + .. code-block:: + + import paddle + from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer + + tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') + model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + + """ + _, pooled_output, _, _ = self.roberta( + input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask + ) + + pooled_output = self.dropout(pooled_output) + logits = self.classifier(pooled_output) + return logits + + def forward_interpet( + self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None, + noise=None, + i=None, + n_samples=None, + ): + _, pooled_output, att_weights_list, embedding_output = self.roberta( + input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask, + noise=noise, + i=i, + n_samples=n_samples, + ) + + pooled_output = self.dropout(pooled_output) + logits = self.classifier(pooled_output) + probs = self.softmax(logits) + + return probs, att_weights_list, embedding_output + + +class RobertaForTokenClassification(RobertaPretrainedModel): + r""" + Roberta Model with a linear layer on top of the hidden-states output layer, + designed for token classification tasks like NER tasks. + + Args: + roberta (:class:`RobertaModel`): + An instance of `RobertaModel`. + num_classes (int, optional): + The number of classes. Defaults to `2`. + dropout (float, optional): + The dropout probability for output of Roberta. + If None, use the same value as `hidden_dropout_prob` + of `RobertaModel` instance `roberta`. Defaults to `None`. + """ + + def __init__(self, roberta, num_classes=2, dropout=None): + super(RobertaForTokenClassification, self).__init__() + self.num_classes = num_classes + self.roberta = roberta # allow roberta to be config + self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"]) + self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes) + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`RobertaModel`. + token_type_ids (Tensor, optional): + See :class:`RobertaModel`. + position_ids (Tensor, optional): + See :class:`RobertaModel`. + attention_mask (Tensor, optional): + See :class:`RobertaModel`. + + Returns: + Tensor: Returns tensor `logits`, a tensor of the input token classification logits. + Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`. + + Example: + .. code-block:: + + import paddle + from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer + + tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') + model = RobertaForTokenClassification.from_pretrained('roberta-wwm-ext') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + + """ + sequence_output, _ = self.roberta( + input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask + ) + + sequence_output = self.dropout(sequence_output) + logits = self.classifier(sequence_output) + return logits diff --git a/examples/model_interpretation/task/senti/run_inter.sh b/examples/model_interpretation/task/senti/run_inter.sh new file mode 100644 index 0000000000000000000000000000000000000000..c7b71e78d21213174c6fdc21ff97cc34688729a8 --- /dev/null +++ b/examples/model_interpretation/task/senti/run_inter.sh @@ -0,0 +1,65 @@ +### + # This file contains script to generate saliency map of a specific baseline model and language on given input data + # The result of this script will be used to evaluate the interpretive performance of the baseline model +### + +export CUDA_VISIBLE_DEVICES=4 +export PYTHONPATH=./:$PYTHONPATH + +LANGUAGE=en # LANGUAGE choose in [ch, en] +BASE_MODEL=roberta_base # BASE_MODEL choose in [roberta_base, roberta_large, lstm] +INTER_MODE=attention # INTER_MODE choice in [attention, integrated_gradient, lime] +TASK=senti_${LANGUAGE} +DATA=../../data/${TASK} +START_ID=0 +FROM_PRETRAIN='test' +VOCAB_PATH='test' + +if [[ $LANGUAGE == "en" ]]; then + + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN='roberta-base' + CKPT=pretrained_models/saved_model_en/roberta_base_20211105_135732/model_10000/model_state.pdparams + #CKPT=pretrained_models/saved_model_en/roberta_base_20211206_164443/model_10000/model_state.pdparams + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN='roberta-large' + CKPT=pretrained_models/saved_model_en/roberta_large_20211105_160323/model_4000/model_state.pdparams + #CKPT=pretrained_models/saved_model_en/roberta_large_20211207_174631/model_4000/model_state.pdparams + elif [[ $BASE_MODEL == "lstm" ]]; then + VOCAB_PATH='rnn/vocab.sst2_train' + CKPT=rnn/checkpoints_en/final.pdparams + fi + +elif [[ $LANGUAGE == "ch" ]]; then + + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN='roberta-wwm-ext' + CKPT=pretrained_models/saved_model_ch/roberta_base/model_900/model_state.pdparams + #CKPT=pretrained_models/saved_model_ch/roberta_base_20211229_101252/model_900/model_state.pdparams + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN='roberta-wwm-ext-large' + CKPT=pretrained_models/saved_model_ch/roberta_large_20211014_192021/model_900/model_state.pdparams + #CKPT=pretrained_models/saved_model_ch/roberta_large_20211229_105019/model_900/model_state.pdparams + elif [[ $BASE_MODEL == "lstm" ]]; then + VOCAB_PATH='rnn/vocab.txt' + CKPT=rnn/checkpoints_ch/final.pdparams + fi +fi + +OUTPUT=./output/${TASK}.${BASE_MODEL} +[ -d $OUTPUT ] || mkdir -p $OUTPUT +set -x + +python3 ./saliency_map/sentiment_interpretable.py \ + --language $LANGUAGE \ + --base_model $BASE_MODEL \ + --data_dir $DATA \ + --vocab_path $VOCAB_PATH \ + --from_pretrained $FROM_PRETRAIN \ + --batch_size 1 \ + --init_checkpoint $CKPT \ + --inter_mode $INTER_MODE\ + --output_dir $OUTPUT \ + --n-samples 200 \ + --start_id $START_ID \ + --eval $@ diff --git a/examples/model_interpretation/task/senti/run_inter_all.sh b/examples/model_interpretation/task/senti/run_inter_all.sh new file mode 100644 index 0000000000000000000000000000000000000000..8b0a1d98bf0113b413d7d4698733dd0dc8551359 --- /dev/null +++ b/examples/model_interpretation/task/senti/run_inter_all.sh @@ -0,0 +1,75 @@ +### + # This file contains script to generate saliency map of all baseline models and languages on given input data + # The result of this script will be used to evaluate the interpretive performance of the baseline model +### + +export CUDA_VISIBLE_DEVICES=1 +export PYTHONPATH=./:$PYTHONPATH +START_ID=0 +FROM_PRETRAIN='test' +VOCAB_PATH='test' + +for BASE_MODEL in "lstm" "roberta_base" "roberta_large"; +do + for INTER_MODE in "attention" "integrated_gradient" "lime"; + do + for LANGUAGE in "ch" "en"; + do + TASK=senti_${LANGUAGE} + DATA=../../data/${TASK} + + if [[ $LANGUAGE == "en" ]]; then + + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN='roberta-base' + CKPT=pretrained_models/saved_model_en/roberta_base_20211105_135732/model_10000/model_state.pdparams + #CKPT=pretrained_models/saved_model_en/roberta_base_20211206_164443/model_10000/model_state.pdparams + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN='roberta-large' + CKPT=pretrained_models/saved_model_en/roberta_large_20211105_160323/model_4000/model_state.pdparams + #CKPT=pretrained_models/saved_model_en/roberta_large_20211207_174631/model_4000/model_state.pdparams + elif [[ $BASE_MODEL == "lstm" ]]; then + VOCAB_PATH='rnn/vocab.sst2_train' + CKPT=rnn/checkpoints_en/final.pdparams + #CKPT=rnn/checkpoints_en/final.pdparams + fi + + elif [[ $LANGUAGE == "ch" ]]; then + + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN='roberta-wwm-ext' + CKPT=pretrained_models/saved_model_ch/roberta_base/model_900/model_state.pdparams + #CKPT=pretrained_models/saved_model_ch/roberta_base_20211229_101252/model_900/model_state.pdparams + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN='roberta-wwm-ext-large' + CKPT=pretrained_models/saved_model_ch/roberta_large_20211014_192021/model_900/model_state.pdparams + #CKPT=pretrained_models/saved_model_ch/roberta_large_20211229_105019/model_900/model_state.pdparams + elif [[ $BASE_MODEL == "lstm" ]]; then + VOCAB_PATH='rnn/vocab.txt' + CKPT=rnn/checkpoints_ch/final.pdparams + #CKPT=rnn/checkpoints_ch/final.pdparams + fi + fi + + OUTPUT=./output/${TASK}.${BASE_MODEL} + [ -d $OUTPUT ] || mkdir -p $OUTPUT + set -x + + if [[ ! -f ${OUTPUT}/interpret.${INTER_MODE} ]]; then + python3 ./saliency_map/sentiment_interpretable.py \ + --language $LANGUAGE \ + --base_model $BASE_MODEL \ + --data_dir $DATA \ + --vocab_path $VOCAB_PATH \ + --from_pretrained $FROM_PRETRAIN \ + --batch_size 1 \ + --init_checkpoint $CKPT \ + --inter_mode $INTER_MODE\ + --output_dir $OUTPUT \ + --n-samples 200 \ + --start_id $START_ID \ + --eval $@ + fi + done + done +done \ No newline at end of file diff --git a/examples/model_interpretation/task/senti/saliency_map/sentiment_interpretable.py b/examples/model_interpretation/task/senti/saliency_map/sentiment_interpretable.py new file mode 100644 index 0000000000000000000000000000000000000000..61afefc70ec598d94f0e52ad66d345379a594131 --- /dev/null +++ b/examples/model_interpretation/task/senti/saliency_map/sentiment_interpretable.py @@ -0,0 +1,502 @@ +# !/usr/bin/env python3 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import collections +import json +import logging +import os +import sys +from functools import partial +from pathlib import Path + +import numpy as np +import paddle +from LIME.lime_text import LimeTextExplainer +from rnn.model import BiLSTMAttentionModel, SelfInteractiveAttention +from rnn.utils import CharTokenizer, convert_example +from roberta.modeling import RobertaForSequenceClassification +from tqdm import tqdm + +from paddlenlp.data import Dict, Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import DatasetBuilder +from paddlenlp.transformers.roberta.tokenizer import ( + RobertaBPETokenizer, + RobertaTokenizer, +) + +sys.path.append("../../..") +from model_interpretation.utils import ( # noqa: E402 + convert_tokenizer_res_to_old_version, + match, +) + +sys.path.remove("../../..") + +log = logging.getLogger(__name__) +log.setLevel(logging.DEBUG) +logging.getLogger().setLevel(logging.DEBUG) + + +def get_args(): + parser = argparse.ArgumentParser("interpret sentiment analysis task") + parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large", "lstm"]) + parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") + parser.add_argument( + "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" + ) + parser.add_argument("--batch_size", type=int, default=1, help="batchsize") + parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") + parser.add_argument("--eval", action="store_true") + parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") + parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer") + parser.add_argument( + "--use_amp", + action="store_true", + help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", + ) + parser.add_argument( + "--inter_mode", + type=str, + default="attention", + choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"], + help="appoint the mode of interpretable.", + ) + parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method") + parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") + parser.add_argument("--start_id", type=int, default=0) + parser.add_argument("--vocab_path", type=str, required=True) + parser.add_argument("--language", type=str, required=True, help="language that the model is built for") + args = parser.parse_args() + return args + + +class Senti_data(DatasetBuilder): + def _read(self, filename): + with open(filename, "r", encoding="utf8") as f: + for line in f.readlines(): + line_split = json.loads(line) + yield { + "id": line_split["id"], + "context": line_split["context"], + "sent_token": line_split["sent_token"], + } + + +def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None): + """ + Creats dataloader. + + Args: + dataset(obj:`paddle.io.Dataset`): Dataset instance. + trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. + mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. + batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. + batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging + the sample list, None for only stack each fields of sample in axis + 0(same as :attr::`np.stack(..., axis=0)`). + + Returns: + dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. + """ + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + else: + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn) + return dataloader + + +def map_fn_senti(examples, tokenizer, args): + log.debug("load data %d" % len(examples)) + if args.language == "en": + contexts = [example["context"].encode("ascii", errors="replace").decode("UTF-8") for example in examples] + else: + contexts = [example["context"] for example in examples] + tokenized_examples = tokenizer(contexts, max_seq_len=args.max_seq_len) + tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) + for i in range(len(tokenized_examples)): + tokenized_examples[i]["offset_mapping"] = ( + [(0, 0)] + tokenizer.get_offset_mapping(contexts[i])[: args.max_seq_len - 2] + [(0, 0)] + ) + return tokenized_examples + + +def init_lstm_var(args): + vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") + tokenizer = CharTokenizer(vocab, args.language, "../../punctuations") + padding_idx = vocab.token_to_idx.get("[PAD]", 0) + + trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=True, language=args.language) + + # Init attention layer + lstm_hidden_size = 196 + attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size) + model = BiLSTMAttentionModel( + attention_layer=attention, + vocab_size=len(tokenizer.vocab), + lstm_hidden_size=lstm_hidden_size, + num_classes=2, + padding_idx=padding_idx, + ) + + # Reads data and generates mini-batches. + dev_ds = Senti_data().read(args.data_dir) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=padding_idx), # input_ids + Stack(dtype="int64"), # seq len + ): [data for data in fn(samples)] + + dev_loader = create_dataloader( + dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn + ) + + return model, tokenizer, dev_loader + + +def init_roberta_var(args): + tokenizer = None + if args.language == "ch": + tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) + else: + tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) + model = RobertaForSequenceClassification.from_pretrained( + args.from_pretrained, + hidden_dropout_prob=0, + attention_probs_dropout_prob=0, + dropout=0, + num_labels=2, + name="", + return_inter_score=True, + ) + + map_fn = partial(map_fn_senti, tokenizer=tokenizer, args=args) + + dev_ds = Senti_data().read(args.data_dir) + dev_ds.map(map_fn, batched=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + "offset_mapping": Pad(axis=0, pad_val=tokenizer.pad_token_id), + } + ): fn(samples) + + dataloader = paddle.io.DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + return model, tokenizer, dataloader + + +def extract_attention_scores(args, atts, input_ids, tokens, sub_word_id_dict, result, offset, out_handle): + if args.base_model.startswith("roberta"): + inter_score = atts[-1][:, :, 0, :].mean(1) # (bsz, seq) + inter_score = inter_score[0][1:-1] # remove CLS and SEP + input_ids = input_ids[0][1:-1] + + elif args.base_model == "lstm": + inter_score = atts[0] + input_ids = input_ids[0] + + length = (inter_score > 0).cast("int32").sum(-1).tolist()[0] + assert len(tokens) == length, f"%s: {len(tokens)} != {length}" % (step + 1) + + char_attribution_dict = {} + # Collect scores in different situation + if args.base_model.startswith("roberta"): + assert len(inter_score) == len(offset), str(len(inter_score)) + "not equal to" + str(len(offset)) + sorted_token = [] + for i in range(len(inter_score)): + sorted_token.append([i, offset[i], inter_score[i]]) + + char_attribution_dict = match(result["context"], result["sent_token"], sorted_token) + + result["char_attri"] = collections.OrderedDict() + for token_info in sorted(char_attribution_dict, key=lambda x: x[2], reverse=True): + result["char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] + result.pop("sent_token") + else: + if args.language == "ch": + idx = 0 + for token, score in zip(tokens, inter_score.numpy().tolist()): + char_attribution_dict[idx] = (token, score) + idx += 1 + else: + idx = 0 + for word, sub_word_score in zip(tokens, inter_score.tolist()): + char_attribution_dict[idx] = (word, sub_word_score) + idx += 1 + + result["char_attri"] = collections.OrderedDict() + for token_id, token_info in sorted(char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True): + result["char_attri"][token_id] = token_info + + out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") + + +def extract_integrated_gradient_scores( + args, + atts, + input_ids, + tokens, + sub_word_id_dict, + fwd_args, + fwd_kwargs, + model, + result, + pred_label, + err_total, + offset, + out_handle, +): + embedded_grads_list = [] + for i in range(args.n_samples): + probs, _, embedded = model.forward_interpet( + *fwd_args, **fwd_kwargs, noise="integrated", i=i, n_samples=args.n_samples + ) + predicted_class_prob = probs[0][pred_label] + predicted_class_prob.backward(retain_graph=False) + embedded_grad = embedded.grad + model.clear_gradients() + embedded_grads_list.append(embedded_grad) + + if i == 0: + baseline_pred_confidence = probs.tolist()[0][pred_label] # scalar + baseline_embedded = embedded # Tensor(1, seq_len, embed_size) + elif i == args.n_samples - 1: + pred_confidence = probs.tolist()[0][pred_label] # scalar + pred_embedded = embedded # Tensor(1, seq_len, embed_size) + + embedded_grads_tensor = paddle.to_tensor( + embedded_grads_list, dtype="float32", place=paddle.CUDAPlace(0), stop_gradient=True + ) + + trapezoidal_grads = (embedded_grads_tensor[1:] + embedded_grads_tensor[:-1]) / 2 + integral_grads = trapezoidal_grads.sum(0) / trapezoidal_grads.shape[0] # Tensor(1, seq_len, embed_size) + + inter_score = (pred_embedded - baseline_embedded) * integral_grads # Tensor(1, seq_len, embed_size) + inter_score = inter_score.sum(-1) # Tensor(1, seq_len) + + # eval err + delta_pred_confidence = pred_confidence - baseline_pred_confidence + sum_gradient = inter_score.sum().tolist()[0] + err = (delta_pred_confidence - sum_gradient + 1e-12) / (delta_pred_confidence + 1e-12) + err_total.append(np.abs(err)) + + print_str = "%s\t%d\t%.3f\t%.3f\t%.3f\t%.3f" + print_vals = (result["id"], args.n_samples, delta_pred_confidence, sum_gradient, err, np.average(err_total)) + log.debug(print_str % print_vals) + + inter_score.stop_gradient = True + + char_attribution_dict = {} + if args.base_model.startswith("roberta"): + inter_score = inter_score[0][1:-1] + sorted_token = [] + for i in range(len(inter_score)): + sorted_token.append([i, offset[i], inter_score[i]]) + char_attribution_dict = match(result["context"], result["sent_token"], sorted_token) + + result["char_attri"] = collections.OrderedDict() + for token_info in sorted(char_attribution_dict, key=lambda x: x[2], reverse=True): + result["char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] + result.pop("sent_token") + + elif args.base_model == "lstm": + inter_score = inter_score[0] + idx = 0 + for word, sub_word_score in zip(tokens, inter_score.tolist()): + char_attribution_dict[idx] = (word, sub_word_score) + idx += 1 + + result["char_attri"] = collections.OrderedDict() + for token_id, token_info in sorted(char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True): + result["char_attri"][token_id] = token_info + + out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") + return err_total + + +def extract_LIME_scores( + args, + tokenizer, + tokens, + pred_label, + model, + probs, + result, + lime_err_total, + lime_score_total, + lime_relative_err_total, + out_handle, +): + explainer = LimeTextExplainer(class_names=["neg", "pos"], verbose=False, language=args.language) + + if_lstm = args.base_model == "lstm" + explain_res = None + + text_instance = result["context"] + + explain_res = explainer.explain_instance( + text_instance=text_instance, + tokenizer=tokenizer, + pred_label=pred_label, + classifier_fn=model.forward_interpet, + num_samples=5000, + if_lstm=if_lstm, + ) + + exp, indexed_string, relative_err, err = explain_res + + score = exp.score[pred_label] + local_exps = exp.local_exp + ridge_pred = exp.local_pred[pred_label] + model_pred = probs.numpy().tolist()[0][pred_label] + + lime_score_total.append(score) + lime_relative_err_total.append(relative_err) + lime_err_total.append(err) + log.debug("score: %.2f" % score) + log.debug("relative_err: %.2f" % relative_err) + log.debug("err: %.2f" % err) + log.debug("ridge_pred: %.2f\tpred: %.2f\tdelta: %.2f" % (ridge_pred, model_pred, ridge_pred - model_pred)) + + for kind, local_exp in local_exps.items(): # only have one iteration here + char_attribution_dict = [] + + for idx in range(len(result["sent_token"])): + t = result["sent_token"][idx] # .replace('Ġ', '') + got_score = False + for word_id, attribution in local_exp: + if indexed_string.inverse_vocab[word_id] == t: + char_attribution_dict.append((idx, t, attribution)) + got_score = True + break + if not got_score: + char_attribution_dict.append((idx, t, 0)) + char_attribution_dict = sorted(char_attribution_dict, key=lambda x: x[2], reverse=True) + + result["char_attri"] = collections.OrderedDict() + for s in char_attribution_dict: + result["char_attri"][s[0]] = (s[1], s[2]) + + out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") + return lime_err_total, lime_score_total, lime_relative_err_total + + +if __name__ == "__main__": + args = get_args() + if args.base_model.startswith("roberta"): + model, tokenizer, dataloader = init_roberta_var(args) + elif args.base_model == "lstm": + model, tokenizer, dataloader = init_lstm_var(args) + else: + raise ValueError("unsupported base model name.") + + assert args.eval, "INTERPRETER must be run in eval mode" + with paddle.amp.auto_cast(enable=args.use_amp), open( + os.path.join(args.output_dir, "interpret" + f".{args.inter_mode}"), "w" + ) as out_handle: + + # Load model + sd = paddle.load(args.init_checkpoint) + model.set_dict(sd) + model.train() # set dropout to 0 in order to get the gradient + log.debug("load model from %s" % args.init_checkpoint) + + get_sub_word_ids = lambda word: map(str, tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word))) + for step, d in tqdm(enumerate(dataloader)): + if step + 1 < args.start_id: # start from the step's instance + continue + # Initialize input_ids, fwd_args, tokens + result = {} + offset = None + if args.base_model.startswith("roberta"): + input_ids, token_type_ids, offset_map = d + fwd_args = [input_ids, token_type_ids] + fwd_kwargs = {} + tokens = tokenizer.convert_ids_to_tokens(input_ids[0, 1:-1].tolist()) # list + offset = offset_map[0, 1:-1] + + elif args.base_model == "lstm": + input_ids, seq_lens = d + fwd_args = [input_ids, seq_lens] + fwd_kwargs = {} + tokens = [tokenizer.vocab.idx_to_token[input_id] for input_id in input_ids.tolist()[0]] + + result["id"] = dataloader.dataset.data[step]["id"] + + probs, atts, embedded = model.forward_interpet(*fwd_args, **fwd_kwargs) + pred_label = paddle.argmax(probs, axis=-1).tolist()[0] + + result["pred_label"] = pred_label + result["probs"] = [float(format(prob, ".5f")) for prob in probs.numpy()[0].tolist()] + sub_word_id_dict = [] + err_total = [] + lime_err_total, lime_score_total, lime_relative_err_total = [], [], [] + + result["context"] = dataloader.dataset.data[step]["context"] + result["sent_token"] = dataloader.dataset.data[step]["sent_token"] + + # Attention + if args.inter_mode == "attention": + # extract attention scores and write resutls to file + extract_attention_scores(args, atts, input_ids, tokens, sub_word_id_dict, result, offset, out_handle) + + # Integrated_gradient + elif args.inter_mode == "integrated_gradient": + err_total = extract_integrated_gradient_scores( + args, + atts, + input_ids, + tokens, + sub_word_id_dict, + fwd_args, + fwd_kwargs, + model, + result, + pred_label, + err_total, + offset, + out_handle, + ) + + # LIME + elif args.inter_mode == "lime": + lime_err_total, lime_score_total, lime_relative_err_total = extract_LIME_scores( + args, + tokenizer, + tokens, + pred_label, + model, + probs, + result, + lime_err_total, + lime_score_total, + lime_relative_err_total, + out_handle, + ) + + else: + raise KeyError(f"Unkonwn interpretable mode: {args.inter_mode}") + + if args.inter_mode == "lime": + log.debug(np.average(np.array(lime_relative_err_total))) diff --git a/examples/model_interpretation/task/senti/saliency_map/utils.py b/examples/model_interpretation/task/senti/saliency_map/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..da76e25bfa59af4140d2068880c6ce5aade8ee7f --- /dev/null +++ b/examples/model_interpretation/task/senti/saliency_map/utils.py @@ -0,0 +1,38 @@ +# !/usr/bin/env python3 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import absolute_import, division, print_function, unicode_literals + +import paddle + + +class UnpackDataLoader(paddle.io.DataLoader): + def __init__(self, *args, **kwargs): + super(UnpackDataLoader, self).__init__(*args, batch_size=1, **kwargs) + + def __iter__(self): + return ([yy[0] for yy in y] for y in super(UnpackDataLoader, self).__iter__()) + + +def create_if_not_exists(dir): + try: + dir.mkdir(parents=True) + except: + pass + return dir + + +def get_warmup_and_linear_decay(max_steps, warmup_steps): + return lambda step: min(step / warmup_steps, 1.0 - (step - warmup_steps) / (max_steps - warmup_steps)) diff --git a/examples/model_interpretation/task/similarity/LIME/exceptions.py b/examples/model_interpretation/task/similarity/LIME/exceptions.py new file mode 100644 index 0000000000000000000000000000000000000000..c5fa1a29924ad795104c6ce7c124a58d1fa06dfe --- /dev/null +++ b/examples/model_interpretation/task/similarity/LIME/exceptions.py @@ -0,0 +1,2 @@ +class LimeError(Exception): + """Raise for errors""" diff --git a/examples/model_interpretation/task/similarity/LIME/explanation.py b/examples/model_interpretation/task/similarity/LIME/explanation.py new file mode 100644 index 0000000000000000000000000000000000000000..46b3f0463fa6fb6b48271cbbc2652d3a5cec1b18 --- /dev/null +++ b/examples/model_interpretation/task/similarity/LIME/explanation.py @@ -0,0 +1,343 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Explanation class, with visualization functions. +""" +from io import open +import os +import os.path +import json +import string +import numpy as np + +from sklearn.utils import check_random_state + +from LIME.exceptions import LimeError + + +def id_generator(size=15, random_state=None): + """Helper function to generate random div ids. This is useful for embedding + HTML into ipython notebooks.""" + chars = list(string.ascii_uppercase + string.digits) + return "".join(random_state.choice(chars, size, replace=True)) + + +class DomainMapper(object): + """Class for mapping features to the specific domain. + + The idea is that there would be a subclass for each domain (text, tables, + images, etc), so that we can have a general Explanation class, and separate + out the specifics of visualizing features in here. + """ + + def __init__(self): + pass + + def map_exp_ids(self, exp, **kwargs): + """Maps the feature ids to concrete names. + + Default behaviour is the identity function. Subclasses can implement + this as they see fit. + + Args: + exp: list of tuples [(id, weight), (id,weight)] + kwargs: optional keyword arguments + + Returns: + exp: list of tuples [(name, weight), (name, weight)...] + """ + return exp + + def visualize_instance_html(self, exp, label, div_name, exp_object_name, **kwargs): + """Produces html for visualizing the instance. + + Default behaviour does nothing. Subclasses can implement this as they + see fit. + + Args: + exp: list of tuples [(id, weight), (id,weight)] + label: label id (integer) + div_name: name of div object to be used for rendering(in js) + exp_object_name: name of js explanation object + kwargs: optional keyword arguments + + Returns: + js code for visualizing the instance + """ + return "" + + +class Explanation(object): + """Object returned by explainers.""" + + def __init__(self, domain_mapper, mode="classification", class_names=None, random_state=None): + """ + + Initializer. + + Args: + domain_mapper: must inherit from DomainMapper class + type: "classification" or "regression" + class_names: list of class names (only used for classification) + random_state: an integer or numpy.RandomState that will be used to + generate random numbers. If None, the random state will be + initialized using the internal numpy seed. + """ + self.random_state = random_state + self.mode = mode + self.domain_mapper = domain_mapper + self.local_exp = {} + self.intercept = {} + self.score = {} + self.local_pred = {} + if mode == "classification": + self.class_names = class_names + self.top_labels = None + self.predict_proba = None + elif mode == "regression": + self.class_names = ["negative", "positive"] + self.predicted_value = None + self.min_value = 0.0 + self.max_value = 1.0 + self.dummy_label = 1 + else: + raise LimeError( + 'Invalid explanation mode "{}". ' 'Should be either "classification" ' 'or "regression".'.format(mode) + ) + + def available_labels(self): + """ + Returns the list of classification labels for which we have any explanations. + """ + try: + assert self.mode == "classification" + except AssertionError: + raise NotImplementedError("Not supported for regression explanations.") + else: + ans = self.top_labels if self.top_labels else self.local_exp.keys() + return list(ans) + + def as_list(self, label=1, **kwargs): + """Returns the explanation as a list. + + Args: + label: desired label. If you ask for a label for which an + explanation wasn't computed, will throw an exception. + Will be ignored for regression explanations. + kwargs: keyword arguments, passed to domain_mapper + + Returns: + list of tuples (representation, weight), where representation is + given by domain_mapper. Weight is a float. + """ + label_to_use = label if self.mode == "classification" else self.dummy_label + ans = self.domain_mapper.map_exp_ids(self.local_exp[label_to_use], **kwargs) + ans = [(x[0], float(x[1])) for x in ans] + return ans + + def as_map(self): + """Returns the map of explanations. + + Returns: + Map from label to list of tuples (feature_id, weight). + """ + return self.local_exp + + def as_pyplot_figure(self, label=1, **kwargs): + """Returns the explanation as a pyplot figure. + + Will throw an error if you don't have matplotlib installed + Args: + label: desired label. If you ask for a label for which an + explanation wasn't computed, will throw an exception. + Will be ignored for regression explanations. + kwargs: keyword arguments, passed to domain_mapper + + Returns: + pyplot figure (barchart). + """ + import matplotlib.pyplot as plt + + exp = self.as_list(label=label, **kwargs) + fig = plt.figure() + vals = [x[1] for x in exp] + names = [x[0] for x in exp] + vals.reverse() + names.reverse() + colors = ["green" if x > 0 else "red" for x in vals] + pos = np.arange(len(exp)) + 0.5 + plt.barh(pos, vals, align="center", color=colors) + plt.yticks(pos, names) + if self.mode == "classification": + title = "Local explanation for class %s" % self.class_names[label] + else: + title = "Local explanation" + plt.title(title) + return fig + + def show_in_notebook(self, labels=None, predict_proba=True, show_predicted_value=True, **kwargs): + """Shows html explanation in ipython notebook. + + See as_html() for parameters. + This will throw an error if you don't have IPython installed""" + + from IPython.core.display import display, HTML + + display( + HTML( + self.as_html( + labels=labels, predict_proba=predict_proba, show_predicted_value=show_predicted_value, **kwargs + ) + ) + ) + + def save_to_file(self, file_path, labels=None, predict_proba=True, show_predicted_value=True, **kwargs): + """Saves html explanation to file. . + + Params: + file_path: file to save explanations to + + See as_html() for additional parameters. + + """ + file_ = open(file_path, "w", encoding="utf8") + file_.write( + self.as_html( + labels=labels, predict_proba=predict_proba, show_predicted_value=show_predicted_value, **kwargs + ) + ) + file_.close() + + def as_html(self, labels=None, predict_proba=True, show_predicted_value=True, **kwargs): + """Returns the explanation as an html page. + + Args: + labels: desired labels to show explanations for (as barcharts). + If you ask for a label for which an explanation wasn't + computed, will throw an exception. If None, will show + explanations for all available labels. (only used for classification) + predict_proba: if true, add barchart with prediction probabilities + for the top classes. (only used for classification) + show_predicted_value: if true, add barchart with expected value + (only used for regression) + kwargs: keyword arguments, passed to domain_mapper + + Returns: + code for an html page, including javascript includes. + """ + + def jsonize(x): + return json.dumps(x, ensure_ascii=False) + + if labels is None and self.mode == "classification": + labels = self.available_labels() + + this_dir, _ = os.path.split(__file__) + bundle = open(os.path.join(this_dir, "bundle.js"), encoding="utf8").read() + + out = ( + """<html> + <meta http-equiv="content-type" content="text/html; charset=UTF8"> + <head><script>%s </script></head><body>""" + % bundle + ) + random_id = id_generator(size=15, random_state=check_random_state(self.random_state)) + out += ( + """ + <div class="lime top_div" id="top_div%s"></div> + """ + % random_id + ) + + predict_proba_js = "" + if self.mode == "classification" and predict_proba: + predict_proba_js = """ + var pp_div = top_div.append('div') + .classed('lime predict_proba', true); + var pp_svg = pp_div.append('svg').style('width', '100%%'); + var pp = new lime.PredictProba(pp_svg, %s, %s); + """ % ( + jsonize([str(x) for x in self.class_names]), + jsonize(list(self.predict_proba.astype(float))), + ) + + predict_value_js = "" + if self.mode == "regression" and show_predicted_value: + # reference self.predicted_value + # (svg, predicted_value, min_value, max_value) + predict_value_js = """ + var pp_div = top_div.append('div') + .classed('lime predicted_value', true); + var pp_svg = pp_div.append('svg').style('width', '100%%'); + var pp = new lime.PredictedValue(pp_svg, %s, %s, %s); + """ % ( + jsonize(float(self.predicted_value)), + jsonize(float(self.min_value)), + jsonize(float(self.max_value)), + ) + + exp_js = """var exp_div; + var exp = new lime.Explanation(%s); + """ % ( + jsonize([str(x) for x in self.class_names]) + ) + + if self.mode == "classification": + for label in labels: + exp = jsonize(self.as_list(label)) + exp_js += """ + exp_div = top_div.append('div').classed('lime explanation', true); + exp.show(%s, %d, exp_div); + """ % ( + exp, + label, + ) + else: + exp = jsonize(self.as_list()) + exp_js += """ + exp_div = top_div.append('div').classed('lime explanation', true); + exp.show(%s, %s, exp_div); + """ % ( + exp, + self.dummy_label, + ) + + raw_js = """var raw_div = top_div.append('div');""" + + if self.mode == "classification": + html_data = self.local_exp[labels[0]] + else: + html_data = self.local_exp[self.dummy_label] + + raw_js += self.domain_mapper.visualize_instance_html( + html_data, labels[0] if self.mode == "classification" else self.dummy_label, "raw_div", "exp", **kwargs + ) + out += """ + <script> + var top_div = d3.select('#top_div%s').classed('lime top_div', true); + %s + %s + %s + %s + </script> + """ % ( + random_id, + predict_proba_js, + predict_value_js, + exp_js, + raw_js, + ) + out += "</body></html>" + + return out diff --git a/examples/model_interpretation/task/similarity/LIME/lime_base.py b/examples/model_interpretation/task/similarity/LIME/lime_base.py new file mode 100644 index 0000000000000000000000000000000000000000..ca9ce28389191ac81364be6c07f33518d4a5f3f5 --- /dev/null +++ b/examples/model_interpretation/task/similarity/LIME/lime_base.py @@ -0,0 +1,225 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Contains abstract functionality for learning locally linear sparse model. +""" +import numpy as np +import scipy as sp +from sklearn.linear_model import Ridge, lars_path +from sklearn.utils import check_random_state + + +class LimeBase(object): + """Class for learning a locally linear sparse model from perturbed data""" + + def __init__(self, kernel_fn, verbose=False, random_state=None): + """Init function + + Args: + kernel_fn: function that transforms an array of distances into an + array of proximity values (floats). + verbose: if true, print local prediction values from linear model. + random_state: an integer or numpy.RandomState that will be used to + generate random numbers. If None, the random state will be + initialized using the internal numpy seed. + """ + self.kernel_fn = kernel_fn + self.verbose = verbose + self.random_state = check_random_state(random_state) + + @staticmethod + def generate_lars_path(weighted_data, weighted_labels): + """Generates the lars path for weighted data. + + Args: + weighted_data: data that has been weighted by kernel + weighted_label: labels, weighted by kernel + + Returns: + (alphas, coefs), both are arrays corresponding to the + regularization parameter and coefficients, respectively + """ + x_vector = weighted_data + alphas, _, coefs = lars_path(x_vector, weighted_labels, method="lasso", verbose=False) + return alphas, coefs + + def forward_selection(self, data, labels, weights, num_features): + """Iteratively adds features to the model""" + clf = Ridge(alpha=0, fit_intercept=True, random_state=self.random_state) + used_features = [] + for _ in range(min(num_features, data.shape[1])): + max_ = -100000000 + best = 0 + for feature in range(data.shape[1]): + if feature in used_features: + continue + clf.fit(data[:, used_features + [feature]], labels, sample_weight=weights) + score = clf.score(data[:, used_features + [feature]], labels, sample_weight=weights) + if score > max_: + best = feature + max_ = score + used_features.append(best) + return np.array(used_features) + + def feature_selection(self, data, labels, weights, num_features, method): + """Selects features for the model. see explain_instance_with_data to + understand the parameters.""" + if method == "none": + return np.array(range(data.shape[1])) + + elif method == "forward_selection": + return self.forward_selection(data, labels, weights, num_features) + + elif method == "highest_weights": + clf = Ridge(alpha=0.01, fit_intercept=True, random_state=self.random_state) + clf.fit(data, labels, sample_weight=weights) + + coef = clf.coef_ + if sp.sparse.issparse(data): + coef = sp.sparse.csr_matrix(clf.coef_) + weighted_data = coef.multiply(data[0]) + # Note: most efficient to slice the data before reversing + sdata = len(weighted_data.data) + argsort_data = np.abs(weighted_data.data).argsort() + # Edge case where data is more sparse than requested number of feature importances + # In that case, we just pad with zero-valued features + if sdata < num_features: + nnz_indexes = argsort_data[::-1] + indices = weighted_data.indices[nnz_indexes] + num_to_pad = num_features - sdata + indices = np.concatenate((indices, np.zeros(num_to_pad, dtype=indices.dtype))) + indices_set = set(indices) + pad_counter = 0 + for i in range(data.shape[1]): + if i not in indices_set: + indices[pad_counter + sdata] = i + pad_counter += 1 + if pad_counter >= num_to_pad: + break + else: + nnz_indexes = argsort_data[sdata - num_features : sdata][::-1] + indices = weighted_data.indices[nnz_indexes] + return indices + else: + weighted_data = coef * data[0] + feature_weights = sorted( + zip(range(data.shape[1]), weighted_data), # zip(特征的编号, Ridge的w值) + key=lambda x: np.abs(x[1]), + reverse=True, + ) + return np.array([x[0] for x in feature_weights[:num_features]]) # 返回Ridge的前num_features大的w的值对应的特征编号 + + elif method == "lasso_path": + weighted_data = (data - np.average(data, axis=0, weights=weights)) * np.sqrt(weights[:, np.newaxis]) + weighted_labels = (labels - np.average(labels, weights=weights)) * np.sqrt(weights) + nonzero = range(weighted_data.shape[1]) + _, coefs = self.generate_lars_path(weighted_data, weighted_labels) + for i in range(len(coefs.T) - 1, 0, -1): + nonzero = coefs.T[i].nonzero()[0] + if len(nonzero) <= num_features: + break + used_features = nonzero + return used_features + + elif method == "auto": + if num_features <= 6: + n_method = "forward_selection" + else: + n_method = "highest_weights" + return self.feature_selection(data, labels, weights, num_features, n_method) + + def explain_instance_with_data( + self, + neighborhood_data, + neighborhood_labels, + distances, + label, + num_features, + feature_selection="auto", + model_regressor=None, + ): + """Takes perturbed data, labels and distances, returns explanation. + + Args: + neighborhood_data: perturbed data, 2d array. first element is + assumed to be the original data point. + neighborhood_labels: corresponding perturbed labels. should have as + many columns as the number of possible labels. + distances: distances to original data point. + label: label for which we want an explanation + num_features: maximum number of features in explanation + feature_selection: how to select num_features. options are: + 'forward_selection': iteratively add features to the model. + This is costly when num_features is high + 'highest_weights': selects the features that have the highest + product of absolute weight * original data point when + learning with all the features + 'lasso_path': chooses features based on the lasso + regularization path + 'none': uses all features, ignores num_features + 'auto': uses forward_selection if num_features <= 6, and + 'highest_weights' otherwise. + model_regressor: sklearn regressor to use in explanation. + Defaults to Ridge regression if None. Must have + model_regressor.coef_ and 'sample_weight' as a parameter + to model_regressor.fit() + + Returns: + (intercept, exp, score, local_pred): + intercept is a float. + exp is a sorted list of tuples, where each tuple (x,y) corresponds to the feature id (x) + and the local weight (y). The list is sorted by decreasing absolute value of y. + score is the R^2 value of the returned explanation + local_pred is the prediction of the explanation model on the original instance + """ + + weights = self.kernel_fn(distances) # 扰动样本权重 + labels_column = neighborhood_labels[:, label] # 类别label的softmax + + used_features = self.feature_selection( + neighborhood_data, labels_column, weights, num_features, feature_selection + ) + if model_regressor is None: + model_regressor = Ridge( + alpha=1, fit_intercept=True, random_state=self.random_state # L2正则化的系数 # 是否需要截距,即b + ) # seg的伪随机种子 + easy_model = model_regressor + easy_model.fit(neighborhood_data[:, used_features], labels_column, sample_weight=weights) + prediction_score = easy_model.score(neighborhood_data[:, used_features], labels_column, sample_weight=weights) + + local_pred = easy_model.predict(neighborhood_data[0, used_features].reshape(1, -1)) + + ridge_pred = easy_model.predict(neighborhood_data[:, used_features]) + err_np = np.abs(labels_column - ridge_pred) + relative_err_np = err_np / ridge_pred + err = np.average(err_np, weights=weights) + relative_err = np.average(relative_err_np, weights=weights) + + if self.verbose: + print("Intercept", easy_model.intercept_) + print( + "Prediction_local", + local_pred, + ) + print("Right:", neighborhood_labels[0, label]) + return ( + easy_model.intercept_, # + sorted( + zip(used_features, easy_model.coef_), key=lambda x: np.abs(x[1]), reverse=True + ), # 按权重大小排序的token_id列表 + prediction_score, # 衡量easy_model模型的预测与label的差,越大越好(差越小),最大为1 + local_pred, # easy_model对原始样本的预测概率 + relative_err, + err, + ) diff --git a/examples/model_interpretation/task/similarity/LIME/lime_text.py b/examples/model_interpretation/task/similarity/LIME/lime_text.py new file mode 100644 index 0000000000000000000000000000000000000000..b702a68d8de17fd3d4eb7c5390a709448b82b708 --- /dev/null +++ b/examples/model_interpretation/task/similarity/LIME/lime_text.py @@ -0,0 +1,660 @@ +# !/usr/bin/env python3 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Functions for explaining text classifiers. +""" +from functools import partial +import itertools +import json +import re +import time +import math +import paddle + +import numpy as np +import scipy as sp +import sklearn +from sklearn.utils import check_random_state + +import LIME.explanation as explanation +import LIME.lime_base as lime_base + + +class TextDomainMapper(explanation.DomainMapper): + """Maps feature ids to words or word-positions""" + + def __init__(self, indexed_string): + """Initializer. + + Args: + indexed_string: lime_text.IndexedString, original string + """ + self.indexed_string = indexed_string + + def map_exp_ids(self, exp, positions=False): + """Maps ids to words or word-position strings. + + Args: + exp: list of tuples [(id, weight), (id,weight)] + positions: if True, also return word positions + + Returns: + list of tuples (word, weight), or (word_positions, weight) if + examples: ('bad', 1) or ('bad_3-6-12', 1) + """ + if positions: + exp = [ + ( + "%s_%s" + % (self.indexed_string.word(x[0]), "-".join(map(str, self.indexed_string.string_position(x[0])))), + x[1], + ) + for x in exp + ] + else: + exp = [(self.indexed_string.word(x[0]), x[1]) for x in exp] + return exp + + def visualize_instance_html(self, exp, label, div_name, exp_object_name, text=True, opacity=True): + """Adds text with highlighted words to visualization. + + Args: + exp: list of tuples [(id, weight), (id,weight)] + label: label id (integer) + div_name: name of div object to be used for rendering(in js) + exp_object_name: name of js explanation object + text: if False, return empty + opacity: if True, fade colors according to weight + """ + if not text: + return "" + text = self.indexed_string.raw_string().encode("utf-8", "xmlcharrefreplace").decode("utf-8") + text = re.sub(r"[<>&]", "|", text) + exp = [(self.indexed_string.word(x[0]), self.indexed_string.string_position(x[0]), x[1]) for x in exp] + all_occurrences = list(itertools.chain.from_iterable([itertools.product([x[0]], x[1], [x[2]]) for x in exp])) + all_occurrences = [(x[0], int(x[1]), x[2]) for x in all_occurrences] + ret = """ + %s.show_raw_text(%s, %d, %s, %s, %s); + """ % ( + exp_object_name, + json.dumps(all_occurrences), + label, + json.dumps(text), + div_name, + json.dumps(opacity), + ) + return ret + + +class IndexedString(object): + """String with various indexes.""" + + def __init__(self, raw_string, split_expression=r"\W+", bow=True, mask_string=None, language="ch"): + """Initializer. + + Args: + raw_string: string with raw text in it + split_expression: Regex string or callable. If regex string, will be used with re.split. + If callable, the function should return a list of tokens. + bow: if True, a word is the same everywhere in the text - i.e. we + will index multiple occurrences of the same word. If False, + order matters, so that the same word will have different ids + according to position. + mask_string: If not None, replace words with this if bow=False + if None, default value is UNKWORDZ + """ + self.raw = raw_string + self.mask_string = "UNKWORDZ" if mask_string is None else mask_string + self.language = language + + if callable(split_expression): + tokens = split_expression(self.raw) + self.as_list = self._segment_with_tokens(self.raw, tokens) + tokens = set(tokens) + + def non_word(string): + return string not in tokens + + else: + # with the split_expression as a non-capturing group (?:), we don't need to filter out + # the separator character from the split results. + if self.language == "ch": + splitter = re.compile(r"([\u4e00-\u9fa5])") + else: + splitter = re.compile(split_expression) + self.as_list = [w for w in splitter.split(self.raw) if len(w.strip()) > 0] + valid_word = splitter.match + + self.as_np = np.array(self.as_list) + self.string_start = np.hstack(([0], np.cumsum([len(x) for x in self.as_np[:-1]]))) + vocab = {} + self.inverse_vocab = [] + self.positions = [] + self.bow = bow + non_vocab = set() + for i, word in enumerate(self.as_np): + if word in non_vocab: + continue + if (self.language == "ch" and not valid_word(word)) or (self.language == "en" and valid_word(word)): + non_vocab.add(word) + continue + if bow: + if word not in vocab: + vocab[word] = len(vocab) + self.inverse_vocab.append(word) + self.positions.append([]) + idx_word = vocab[word] + self.positions[idx_word].append(i) + else: + self.inverse_vocab.append(word) + self.positions.append(i) + if not bow: + self.positions = np.array(self.positions) + + def raw_string(self): + """Returns the original raw string""" + return self.raw + + def num_words(self): + """Returns the number of tokens in the vocabulary for this document.""" + return len(self.inverse_vocab) + + def word(self, id_): + """Returns the word that corresponds to id_ (int)""" + return self.inverse_vocab[id_] + + def string_position(self, id_): + """Returns a np array with indices to id_ (int) occurrences""" + if self.bow: + return self.string_start[self.positions[id_]] + else: + return self.string_start[[self.positions[id_]]] + + def inverse_removing(self, words_to_remove): + """Returns a string after removing the appropriate words. + + If self.bow is false, replaces word with UNKWORDZ instead of removing it. + + Args: + words_to_remove: list of ids (ints) to remove + + Returns: + original raw string with appropriate words removed. + """ + mask = np.ones(self.as_np.shape[0], dtype="bool") + mask[self.__get_idxs(words_to_remove)] = False + if not self.bow: + return "".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])]) + return "".join([self.as_list[v] for v in mask.nonzero()[0]]) + + @staticmethod + def _segment_with_tokens(text, tokens): + """Segment a string around the tokens created by a passed-in tokenizer""" + list_form = [] + text_ptr = 0 + for token in tokens: + inter_token_string = [] + while not text[text_ptr:].startswith(token): + inter_token_string.append(text[text_ptr]) + text_ptr += 1 + if text_ptr >= len(text): + raise ValueError("Tokenization produced tokens that do not belong in string!") + text_ptr += len(token) + if inter_token_string: + list_form.append("".join(inter_token_string)) + list_form.append(token) + if text_ptr < len(text): + list_form.append(text[text_ptr:]) + return list_form + + def __get_idxs(self, words): + """Returns indexes to appropriate words.""" + if self.bow: + return list(itertools.chain.from_iterable([self.positions[z] for z in words])) + else: + return self.positions[words] + + +class IndexedCharacters(object): + """String with various indexes.""" + + def __init__(self, raw_string, bow=True, mask_string=None): + """Initializer. + + Args: + raw_string: string with raw text in it + bow: if True, a char is the same everywhere in the text - i.e. we + will index multiple occurrences of the same character. If False, + order matters, so that the same word will have different ids + according to position. + mask_string: If not None, replace characters with this if bow=False + if None, default value is chr(0) + """ + self.raw = raw_string + self.as_list = list(self.raw) + self.as_np = np.array(self.as_list) + self.mask_string = chr(0) if mask_string is None else mask_string + self.string_start = np.arange(len(self.raw)) + vocab = {} + self.inverse_vocab = [] + self.positions = [] + self.bow = bow + non_vocab = set() + for i, char in enumerate(self.as_np): + if char in non_vocab: + continue + if bow: + if char not in vocab: + vocab[char] = len(vocab) + self.inverse_vocab.append(char) + self.positions.append([]) + idx_char = vocab[char] + self.positions[idx_char].append(i) + else: + self.inverse_vocab.append(char) + self.positions.append(i) + if not bow: + self.positions = np.array(self.positions) + + def raw_string(self): + """Returns the original raw string""" + return self.raw + + def num_words(self): + """Returns the number of tokens in the vocabulary for this document.""" + return len(self.inverse_vocab) + + def word(self, id_): + """Returns the word that corresponds to id_ (int)""" + return self.inverse_vocab[id_] + + def string_position(self, id_): + """Returns a np array with indices to id_ (int) occurrences""" + if self.bow: + return self.string_start[self.positions[id_]] + else: + return self.string_start[[self.positions[id_]]] + + def inverse_removing(self, words_to_remove): + """Returns a string after removing the appropriate words. + + If self.bow is false, replaces word with UNKWORDZ instead of removing + it. + + Args: + words_to_remove: list of ids (ints) to remove + + Returns: + original raw string with appropriate words removed. + """ + mask = np.ones(self.as_np.shape[0], dtype="bool") + mask[self.__get_idxs(words_to_remove)] = False + if not self.bow: + return "".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])]) + return "".join([self.as_list[v] for v in mask.nonzero()[0]]) + + def __get_idxs(self, words): + """Returns indexes to appropriate words.""" + if self.bow: + return list(itertools.chain.from_iterable([self.positions[z] for z in words])) + else: + return self.positions[words] + + +class LimeTextExplainer(object): + """Explains text classifiers. + Currently, we are using an exponential kernel on cosine distance, and + restricting explanations to words that are present in documents.""" + + def __init__( + self, + kernel_width=25, + kernel=None, + verbose=False, + class_names=None, + feature_selection="auto", + split_expression=r"\W+", + bow=True, + mask_string=None, + random_state=None, + char_level=False, + language="ch", + ): + """Init function. + + Args: + kernel_width: kernel width for the exponential kernel. + kernel: similarity kernel that takes euclidean distances and kernel + width as input and outputs weights in (0,1). If None, defaults to + an exponential kernel. + verbose: if true, print local prediction values from linear model + class_names: list of class names, ordered according to whatever the + classifier is using. If not present, class names will be '0', + '1', ... + feature_selection: feature selection method. can be + 'forward_selection', 'lasso_path', 'none' or 'auto'. + See function 'explain_instance_with_data' in lime_base.py for + details on what each of the options does. + split_expression: Regex string or callable. If regex string, will be used with re.split. + If callable, the function should return a list of tokens. + bow: if True (bag of words), will perturb input data by removing + all occurrences of individual words or characters. + Explanations will be in terms of these words. Otherwise, will + explain in terms of word-positions, so that a word may be + important the first time it appears and unimportant the second. + Only set to false if the classifier uses word order in some way + (bigrams, etc), or if you set char_level=True. + mask_string: String used to mask tokens or characters if bow=False + if None, will be 'UNKWORDZ' if char_level=False, chr(0) + otherwise. + random_state: an integer or numpy.RandomState that will be used to + generate random numbers. If None, the random state will be + initialized using the internal numpy seed. + char_level: an boolean identifying that we treat each character + as an independent occurence in the string + """ + + if kernel is None: + + def kernel(d, kernel_width): + return np.sqrt(np.exp(-(d**2) / kernel_width**2)) + + kernel_fn = partial(kernel, kernel_width=kernel_width) + + self.random_state = check_random_state(random_state) + self.base = lime_base.LimeBase(kernel_fn, verbose, random_state=self.random_state) + self.class_names = class_names + self.vocabulary = None + self.feature_selection = feature_selection + self.bow = bow + self.mask_string = mask_string + self.split_expression = split_expression + self.char_level = char_level + self.language = language + + def explain_instance( + self, + text_instance_q: str, + text_instance_t: str, + analysis_query, + tokenizer, + pred_label: int, + classifier_fn, + labels=(0, 1), + top_labels=None, + num_features=10, + num_samples=5000, + distance_metric="cosine", + model_regressor=None, + if_lstm=False, + ): + """Generates explanations for a prediction. + + First, we generate neighborhood data by randomly hiding features from + the instance (see __data_labels_distance_mapping). We then learn + locally weighted linear models on this neighborhood data to explain + each of the classes in an interpretable way (see lime_base.py). + + Args: + text_instance: raw text string to be explained. + classifier_fn: classifier prediction probability function, which + takes a list of d strings and outputs a (d, k) numpy array with + prediction probabilities, where k is the number of classes. + For ScikitClassifiers , this is classifier.predict_proba. + labels: iterable with labels to be explained. + top_labels: if not None, ignore labels and produce explanations for + the K labels with highest prediction probabilities, where K is + this parameter. + num_features: maximum number of features present in explanation + num_samples: size of the neighborhood to learn the linear model + distance_metric: the distance metric to use for sample weighting, + defaults to cosine similarity + model_regressor: sklearn regressor to use in explanation. Defaults + to Ridge regression in LimeBase. Must have model_regressor.coef_ + and 'sample_weight' as a parameter to model_regressor.fit() + Returns: + An Explanation object (see explanation.py) with the corresponding + explanations. + """ + # prev_time = time.time() + + text_instance = text_instance_q if analysis_query else text_instance_t + text_support = text_instance_t if analysis_query else text_instance_q + + indexed_string = ( + IndexedCharacters(text_instance, bow=self.bow, mask_string=self.mask_string) + if self.char_level + else IndexedString( + text_instance, + bow=self.bow, + split_expression=self.split_expression, + mask_string=self.mask_string, + language=self.language, + ) + ) + domain_mapper = TextDomainMapper(indexed_string) + + # 产生扰动数据集 第一条是原始数据 + # data: 解释器训练特征 list (num_samples, doc_size) + # yss: 解释器训练标签 list (num_samples, class_num(2)) + # distances: 扰动样本到原始样本的距离 np.array(float) (num_samples, ) + data, yss, distances = self.__data_labels_distances( + indexed_string, + text_support, + analysis_query, + tokenizer, + classifier_fn, + num_samples, + distance_metric=distance_metric, + if_lstm=if_lstm, + ) + + if self.class_names is None: + self.class_names = [str(x) for x in range(yss[0].shape[0])] + ret_exp = explanation.Explanation( + domain_mapper=domain_mapper, class_names=self.class_names, random_state=self.random_state + ) + ret_exp.predict_proba = yss[0] + if top_labels: + labels = np.argsort(yss[0])[-top_labels:] + ret_exp.top_labels = list(labels) + ret_exp.top_labels.reverse() + + num_features = indexed_string.num_words() # 特征数量跟word_num相同 + + ( + ret_exp.intercept[pred_label], + ret_exp.local_exp[pred_label], + ret_exp.score[pred_label], + ret_exp.local_pred[pred_label], + relative_err, + err, + ) = self.base.explain_instance_with_data( + data, + yss, + distances, + pred_label, + num_features, + model_regressor=model_regressor, + feature_selection=self.feature_selection, + ) + + return ret_exp, indexed_string, relative_err, err + + def __data_labels_distances( + self, + indexed_string, + text_support, + analysis_query, + tokenizer, + classifier_fn, + num_samples, + distance_metric="cosine", + if_lstm=False, + ): + """Generates a neighborhood around a prediction. + + Generates neighborhood data by randomly removing words from + the instance, and predicting with the classifier. Uses cosine distance + to compute distances between original and perturbed instances. + Args: + indexed_string: document (IndexedString) to be explained, + classifier_fn: classifier prediction probability function, which + takes a string and outputs prediction probabilities. For + ScikitClassifier, this is classifier.predict_proba. + num_samples: size of the neighborhood to learn the linear model + distance_metric: the distance metric to use for sample weighting, + defaults to cosine similarity. + + Returns: + A tuple (data, labels, distances), where: + data: dense num_samples * K binary matrix, where K is the + number of tokens in indexed_string. The first row is the + original instance, and thus a row of ones. + labels: num_samples * L matrix, where L is the number of target + labels + distances: cosine distance between the original instance and + each perturbed instance (computed in the binary 'data' + matrix), times 100. + """ + + def distance_fn(x): + return sklearn.metrics.pairwise.pairwise_distances(x, x[0], metric=distance_metric).ravel() * 100 + + doc_size = indexed_string.num_words() + + sample = self.random_state.randint( + 1, doc_size, num_samples - 1 + ) # sample: [int(1 ~ doc_size-1) * num_samples-1] + data = np.ones((num_samples, doc_size)) + data[0] = np.ones(doc_size) + features_range = range(doc_size) + perturb_text = [indexed_string.raw_string()] # [文本 * num_samples] + + for i, size in enumerate(sample, start=1): + # inactive: 从range(0, doc_size)中随机取出的size个数组成的list, 要去掉的字的id + inactive = self.random_state.choice( + features_range, size, replace=False # [0, doc_size) # int: 该扰动样本中remove token的数量 + ) + + text = indexed_string.inverse_removing(inactive) # 原文本去掉了inactive中的字后的文本 + + data[i, inactive] = 0 + perturb_text.append(text) + + # print('doc size: %d' % doc_size) + + prev_time = time.time() + # inverse_data: 扰动数据集 [扰动样本 str] * num_samples + labels = [] + query_list, title_list, query_len_list, title_len_list = [], [], [], [] # for lstm + token_ids_list, s_ids_list = [], [] # for roberta + max_len = 0 + + support_token_ids = tokenizer.encode(text_support) # for lstm + support_len = len(support_token_ids) # for lstm + for idx, text in enumerate(perturb_text): + if if_lstm: + text_token_ids = tokenizer.encode(text) + text_len = len(text_token_ids) + if idx == 0: + max_len = len(text_token_ids) + while len(text_token_ids) < max_len: + text_token_ids.append(0) + + query_token_ids = text_token_ids if analysis_query else support_token_ids + title_token_ids = support_token_ids if analysis_query else text_token_ids + query_len = text_len if analysis_query else support_len + title_len = support_len if analysis_query else text_len + + query_list.append(query_token_ids) + title_list.append(title_token_ids) + query_len_list.append(query_len) + title_len_list.append(title_len) + + else: + text_tokens = tokenizer.tokenize(text) + text_token_ids = tokenizer.convert_tokens_to_ids(text_tokens) + support_tokens = tokenizer.tokenize(text_support) + support_ids = tokenizer.convert_tokens_to_ids(support_tokens) + if analysis_query: + token_ids = ( + [tokenizer.cls_token_id] + + text_token_ids + + [tokenizer.sep_token_id] + + support_ids + + [tokenizer.sep_token_id] + ) + else: + token_ids = ( + [tokenizer.cls_token_id] + + support_ids + + [tokenizer.sep_token_id] + + text_token_ids + + [tokenizer.sep_token_id] + ) + if len(token_ids) > max_len: + max_len = len(token_ids) + token_ids_list.append(token_ids) + + token_ids_np = [] + if not if_lstm: + for token_ids in token_ids_list: + # token_ids = token_ids[:max_len] + token_ids = token_ids + [tokenizer.pad_token_id] * (max_len - len(token_ids)) + token_ids_np.append(token_ids) + s_ids = [0 for _ in range(len(token_ids))] + s_ids_list.append(s_ids) + + token_ids_np = np.array(token_ids_np) + s_ids_np = np.array(s_ids_list) + + length = len(perturb_text[0]) + if if_lstm: + batch = 128 + else: + batch = 64 if length < 130 else 50 + + prev_time = time.time() + epoch_num = math.ceil(len(perturb_text) / batch) + for idx in range(epoch_num): + if if_lstm: + query_list_tensor = paddle.to_tensor(query_list[idx * batch : (idx + 1) * batch]) + title_list_tensor = paddle.to_tensor(title_list[idx * batch : (idx + 1) * batch]) + query_len_list_tensor = paddle.to_tensor(query_len_list[idx * batch : (idx + 1) * batch]) + title_len_list_tensor = paddle.to_tensor(title_len_list[idx * batch : (idx + 1) * batch]) + label = classifier_fn( + query_list_tensor, title_list_tensor, query_len_list_tensor, title_len_list_tensor + )[ + 0 + ] # label: Tensor[num_samples, 2] + else: + token_ids_tensor = paddle.Tensor( + value=token_ids_np[idx * batch : (idx + 1) * batch], place=paddle.CUDAPlace(0), stop_gradient=True + ) + s_ids_tensor = paddle.Tensor( + value=s_ids_np[idx * batch : (idx + 1) * batch], + place=token_ids_tensor.place, + stop_gradient=token_ids_tensor.stop_gradient, + ) + label = classifier_fn(token_ids_tensor, s_ids_tensor)[0] # label: Tensor[num_samples, 2] + + labels.extend(label.numpy().tolist()) + + labels = np.array(labels) # labels: nsp.array(num_samples, 2) + print("mode forward time: %.5f" % (time.time() - prev_time)) + distances = distance_fn(sp.sparse.csr_matrix(data)) + + return data, labels, distances diff --git a/examples/model_interpretation/task/similarity/pretrained_models/data.py b/examples/model_interpretation/task/similarity/pretrained_models/data.py new file mode 100644 index 0000000000000000000000000000000000000000..37f2c12781f47ea1151427bb9f2df9c708d75e81 --- /dev/null +++ b/examples/model_interpretation/task/similarity/pretrained_models/data.py @@ -0,0 +1,138 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import numpy as np + +from paddlenlp.datasets import MapDataset + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 2: + continue + yield {"query": data[0], "title": data[1]} + + +def convert_pointwise_example(example, tokenizer, max_seq_length=512, is_test=False, language="en"): + if language == "ch": + q_name = "query" + t_name = "title" + l_name = "label" + else: + q_name = "sentence1" + t_name = "sentence2" + l_name = "labels" + + query, title = example[q_name], example[t_name] + + encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if not is_test: + label = np.array([example[l_name]], dtype="int64") + return input_ids, token_type_ids, label + else: + return input_ids, token_type_ids + + +def convert_pairwise_example(example, tokenizer, max_seq_length=512, phase="train"): + + if phase == "train": + query, pos_title, neg_title = example["query"], example["title"], example["neg_title"] + + pos_inputs = tokenizer(text=query, text_pair=pos_title, max_seq_len=max_seq_length) + neg_inputs = tokenizer(text=query, text_pair=neg_title, max_seq_len=max_seq_length) + + pos_input_ids = pos_inputs["input_ids"] + pos_token_type_ids = pos_inputs["token_type_ids"] + neg_input_ids = neg_inputs["input_ids"] + neg_token_type_ids = neg_inputs["token_type_ids"] + + return (pos_input_ids, pos_token_type_ids, neg_input_ids, neg_token_type_ids) + + else: + query, title = example["query"], example["title"] + + inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = inputs["input_ids"] + token_type_ids = inputs["token_type_ids"] + if phase == "eval": + return input_ids, token_type_ids, example["label"] + elif phase == "predict": + return input_ids, token_type_ids + else: + raise ValueError("not supported phase:{}".format(phase)) + + +def gen_pair(dataset, pool_size=100): + """ + Generate triplet randomly based on dataset + + Args: + dataset: A `MapDataset` or `IterDataset` or a tuple of those. + Each example is composed of 2 texts: example["query"], example["title"] + pool_size: the number of example to sample negative example randomly + + Return: + dataset: A `MapDataset` or `IterDataset` or a tuple of those. + Each example is composed of 2 texts: example["query"], example["pos_title"]、example["neg_title"] + """ + + if len(dataset) < pool_size: + pool_size = len(dataset) + + new_examples = [] + pool = [] + tmp_examples = [] + + for example in dataset: + label = example["label"] + + # Filter negative example + if label == 0: + continue + + tmp_examples.append(example) + pool.append(example["title"]) + + if len(pool) >= pool_size: + np.random.shuffle(pool) + for idx, example in enumerate(tmp_examples): + example["neg_title"] = pool[idx] + new_examples.append(example) + tmp_examples = [] + pool = [] + else: + continue + return MapDataset(new_examples) diff --git a/examples/model_interpretation/task/similarity/pretrained_models/model.py b/examples/model_interpretation/task/similarity/pretrained_models/model.py new file mode 100644 index 0000000000000000000000000000000000000000..cf886ba69a85b2e656aa0d6a6c483a73abe875fa --- /dev/null +++ b/examples/model_interpretation/task/similarity/pretrained_models/model.py @@ -0,0 +1,89 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class PointwiseMatching(nn.Layer): + def __init__(self, pretrained_model, dropout=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # num_labels = 2 (similar or dissimilar) + self.classifier = nn.Linear(self.ptm.config["roberta"].config["hidden_size"], 2) + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + cls_embedding = self.dropout(cls_embedding) + logits = self.classifier(cls_embedding) + probs = F.softmax(logits) + + return probs + + +class PairwiseMatching(nn.Layer): + def __init__(self, pretrained_model, dropout=None, margin=0.1): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + self.margin = margin + + # hidden_size -> 1, calculate similarity + self.similarity = nn.Linear(self.ptm.config["hidden_size"], 1) + + def predict(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + cls_embedding = self.dropout(cls_embedding) + sim_score = self.similarity(cls_embedding) + sim_score = F.sigmoid(sim_score) + + return sim_score + + def forward( + self, + pos_input_ids, + neg_input_ids, + pos_token_type_ids=None, + neg_token_type_ids=None, + pos_position_ids=None, + neg_position_ids=None, + pos_attention_mask=None, + neg_attention_mask=None, + ): + + _, pos_cls_embedding = self.ptm(pos_input_ids, pos_token_type_ids, pos_position_ids, pos_attention_mask) + + _, neg_cls_embedding = self.ptm(neg_input_ids, neg_token_type_ids, neg_position_ids, neg_attention_mask) + + pos_embedding = self.dropout(pos_cls_embedding) + neg_embedding = self.dropout(neg_cls_embedding) + + pos_sim = self.similarity(pos_embedding) + neg_sim = self.similarity(neg_embedding) + + pos_sim = F.sigmoid(pos_sim) + neg_sim = F.sigmoid(neg_sim) + + labels = paddle.full(shape=[pos_cls_embedding.shape[0]], fill_value=1.0, dtype="float32") + + loss = F.margin_ranking_loss(pos_sim, neg_sim, labels, margin=self.margin) + + return loss diff --git a/examples/model_interpretation/task/similarity/pretrained_models/predict_pointwise.py b/examples/model_interpretation/task/similarity/pretrained_models/predict_pointwise.py new file mode 100644 index 0000000000000000000000000000000000000000..39c56528c9ec62b81dc71a35f9a31cd9a08870cb --- /dev/null +++ b/examples/model_interpretation/task/similarity/pretrained_models/predict_pointwise.py @@ -0,0 +1,115 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + This script is used for predicting results +""" +import argparse +import os +from functools import partial + +import numpy as np +import paddle +from data import convert_pointwise_example as convert_example +from data import create_dataloader, read_text_pair +from model import PointwiseMatching + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +parser = argparse.ArgumentParser() +parser.add_argument("--input_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument( + "--max_seq_length", + default=64, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." +) +parser.add_argument("--language", choices=["ch", "en"], required=True, help="Language that the model is built for") +args = parser.parse_args() + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair. + data_loader (obj:`List(Example)`): The processed data ids of text pair: + [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + batch_probs = [] + + model.eval() + + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + batch_prob = model(input_ids=input_ids, token_type_ids=token_type_ids).numpy() + + batch_probs.append(batch_prob) + + batch_probs = np.concatenate(batch_probs, axis=0) + + return batch_probs + + +if __name__ == "__main__": + paddle.set_device(args.device) + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial( + convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_test=True, language=args.language + ) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment_ids + ): [data for data in fn(samples)] + + valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) + + valid_data_loader = create_dataloader( + valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + model = PointwiseMatching(pretrained_model) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + y_probs = predict(model, valid_data_loader) + y_preds = np.argmax(y_probs, axis=1) + + valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) + for idx, y_pred in enumerate(y_preds): + text_pair = valid_ds[idx] + text_pair["pred_label"] = y_pred + print(text_pair) diff --git a/examples/model_interpretation/task/similarity/pretrained_models/run_train_pointwise.sh b/examples/model_interpretation/task/similarity/pretrained_models/run_train_pointwise.sh new file mode 100644 index 0000000000000000000000000000000000000000..13771c1837edee94e630c6806d20fc908669e5d3 --- /dev/null +++ b/examples/model_interpretation/task/similarity/pretrained_models/run_train_pointwise.sh @@ -0,0 +1,32 @@ +### + # This script is used to finetune pretrained models +### + +export CUDA_VISIBLE_DEVICES=7 + +LANGUAGE="ch" # ['ch', 'en'] +BASE_MODEL=roberta_large # [roberta_base, roberta_large] +timestamp=`date +"%Y%m%d_%H%M%S"` + +if [[ $LANGUAGE == "ch" ]]; then + LEARNING_RATE=3e-5 + MAX_SEQ_LENGTH=256 +elif [[ $LANGUAGE == "en" ]]; then + LEARNING_RATE=5e-6 + MAX_SEQ_LENGTH=128 +fi + +[ -d "logs" ] || mkdir -p "logs" +set -x + +python3 ./train_pointwise.py \ + --learning_rate $LEARNING_RATE \ + --max_seq_length $MAX_SEQ_LENGTH \ + --batch_size 32 \ + --epochs 5 \ + --save_step 1000 \ + --warmup_proportion 0.1 \ + --base_model $BASE_MODEL \ + --language $LANGUAGE \ + --save_dir saved_model_${LANGUAGE}/${BASE_MODEL}_${timestamp} >> logs/log_${BASE_MODEL}_${timestamp} + diff --git a/examples/model_interpretation/task/similarity/pretrained_models/train_pointwise.py b/examples/model_interpretation/task/similarity/pretrained_models/train_pointwise.py new file mode 100644 index 0000000000000000000000000000000000000000..7e7bbc190efcfc6f883ee8b9e5f7eb28f61b2a61 --- /dev/null +++ b/examples/model_interpretation/task/similarity/pretrained_models/train_pointwise.py @@ -0,0 +1,215 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import sys +import time +from functools import partial + +import numpy as np +import paddle +from data import convert_pointwise_example as convert_example +from data import create_dataloader + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import LinearDecayWithWarmup +from paddlenlp.transformers.roberta.tokenizer import ( + RobertaBPETokenizer, + RobertaTokenizer, +) + +sys.path.append("..") +sys.path.append("../../..") +from roberta.modeling import RobertaForSequenceClassification # noqa: E402 + +sys.path.remove("../../..") +sys.path.remove("..") + +parser = argparse.ArgumentParser() +parser.add_argument("--base_model", type=str, choices=["roberta_base", "roberta_large"]) +parser.add_argument( + "--save_dir", + default="./checkpoint", + type=str, + help="The output directory where the model checkpoints will be written.", +) +parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--eval_step", default=1000, type=int, help="Step interval for evaluation.") +parser.add_argument("--save_step", default=1000, type=int, help="Step interval for saving checkpoint.") +parser.add_argument( + "--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process." +) +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." +) +parser.add_argument("--language", choices=["ch", "en"], required=True, help="Language that the model is built for") +args = parser.parse_args() + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader, phase="dev"): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + criterion(obj:`paddle.nn.Layer`): It can compute the loss. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + losses = [] + for batch in data_loader: + input_ids, token_type_ids, labels = batch + probs = model(input_ids=input_ids, token_type_ids=token_type_ids) + loss = criterion(probs, labels) + losses.append(loss.numpy()) + correct = metric.compute(probs, labels) + metric.update(correct) + accu = metric.accumulate() + print("eval {} loss: {:.5}, accu: {:.5}".format(phase, np.mean(losses), accu)) + model.train() + metric.reset() + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + if args.language == "ch": + train_ds, dev_ds = load_dataset("lcqmc", splits=["train", "dev"]) + + if args.base_model == "roberta_base": + tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext") + pretrained_model = RobertaForSequenceClassification.from_pretrained("roberta-wwm-ext", num_classes=2) + elif args.base_model == "roberta_large": + tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext-large") + pretrained_model = RobertaForSequenceClassification.from_pretrained("roberta-wwm-ext-large", num_classes=2) + else: + train_ds, dev_ds = load_dataset("glue", "qqp", splits=["train", "dev"]) + + if args.base_model == "roberta_base": + tokenizer = RobertaBPETokenizer.from_pretrained("roberta-base") + pretrained_model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_classes=2) + elif args.base_model == "roberta_large": + tokenizer = RobertaBPETokenizer.from_pretrained("roberta-large") + pretrained_model = RobertaForSequenceClassification.from_pretrained("roberta-large", num_classes=2) + + trans_func = partial( + convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, language=args.language + ) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_pair_segment + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + model = pretrained_model + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + criterion = paddle.nn.loss.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + input_ids, token_type_ids, labels = batch + probs = model(input_ids=input_ids, token_type_ids=token_type_ids) + loss = criterion(probs, labels) + correct = metric.compute(probs, labels) + metric.update(correct) + acc = metric.accumulate() + + global_step += 1 + if global_step % 100 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, acc, 100 / (time.time() - tic_train)), + flush=True, + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.eval_step == 0 and rank == 0: + evaluate(model, criterion, metric, dev_data_loader) + + if global_step % args.save_step == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/examples/model_interpretation/task/similarity/roberta/modeling.py b/examples/model_interpretation/task/similarity/roberta/modeling.py new file mode 100644 index 0000000000000000000000000000000000000000..c5824a443f0a81cdeab7879e00ef6be7633fca3e --- /dev/null +++ b/examples/model_interpretation/task/similarity/roberta/modeling.py @@ -0,0 +1,618 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + This script defines the model structure of roberta +""" +import sys + +import paddle +import paddle.nn as nn + +from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model + +sys.path.append("../..") +from task.transformer import TransformerEncoder, TransformerEncoderLayer # noqa: E402 + +sys.path.remove("../..") + +__all__ = [ + "RobertaModel", + "RobertaPretrainedModel", + "RobertaForSequenceClassification", + "RobertaForTokenClassification", + "RobertaForQuestionAnswering", +] + + +class RobertaEmbeddings(nn.Layer): + r""" + Include embeddings from word, position and token_type embeddings. + """ + + def __init__( + self, + vocab_size, + hidden_size=768, + hidden_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=16, + pad_token_id=0, + ): + super(RobertaEmbeddings, self).__init__() + self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_token_id) + self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size) + self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size) + self.layer_norm = nn.LayerNorm(hidden_size) + self.dropout = nn.Dropout(hidden_dropout_prob) + + def forward(self, input_ids, token_type_ids=None, position_ids=None): + """ + forward function + """ + if position_ids is None: + # maybe need use shape op to unify static graph and dynamic graph + ones = paddle.ones_like(input_ids, dtype="int64") + seq_length = paddle.cumsum(ones, axis=-1) + position_ids = seq_length - ones + position_ids.stop_gradient = True + if token_type_ids is None: + token_type_ids = paddle.zeros_like(input_ids, dtype="int64") + + input_embedings = self.word_embeddings(input_ids) + position_embeddings = self.position_embeddings(position_ids) + token_type_embeddings = self.token_type_embeddings(token_type_ids) + + embeddings = input_embedings + position_embeddings + token_type_embeddings + embeddings = self.layer_norm(embeddings) + embeddings = self.dropout(embeddings) + return embeddings + + +class RobertaPooler(nn.Layer): + """ + An abstract class for RobertaPooler + """ + + def __init__(self, hidden_size): + super(RobertaPooler, self).__init__() + self.dense = nn.Linear(hidden_size, hidden_size) + self.activation = nn.Tanh() + + def forward(self, hidden_states): + """ + We "pool" the model by simply taking the hidden state corresponding + to the first token. + """ + first_token_tensor = hidden_states[:, 0] + pooled_output = self.dense(first_token_tensor) + pooled_output = self.activation(pooled_output) + return pooled_output + + +class RobertaPretrainedModel(PretrainedModel): + r""" + An abstract class for pretrained RoBerta models. It provides RoBerta related + `model_config_file`, `pretrained_resource_files_map`, `resource_files_names`, + `pretrained_init_configuration`, `base_model_prefix` for downloading and + loading pretrained models. + Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details. + + """ + + model_config_file = "model_config.json" + pretrained_init_configuration = { + "roberta-wwm-ext": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "intermediate_size": 3072, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 2, + "vocab_size": 21128, + "pad_token_id": 0, + }, + "roberta-wwm-ext-large": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "initializer_range": 0.02, + "intermediate_size": 4096, + "max_position_embeddings": 512, + "num_attention_heads": 16, + "num_hidden_layers": 24, + "type_vocab_size": 2, + "vocab_size": 21128, + "pad_token_id": 0, + }, + "rbt3": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "intermediate_size": 3072, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 3, + "type_vocab_size": 2, + "vocab_size": 21128, + "pad_token_id": 0, + }, + "rbtl3": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "initializer_range": 0.02, + "intermediate_size": 4096, + "max_position_embeddings": 512, + "num_attention_heads": 16, + "num_hidden_layers": 3, + "type_vocab_size": 2, + "vocab_size": 21128, + "pad_token_id": 0, + }, + } + resource_files_names = {"model_state": "model_state.pdparams"} + pretrained_resource_files_map = { + "model_state": { + "roberta-wwm-ext": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_base/roberta_chn_base.pdparams", + "roberta-wwm-ext-large": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/roberta_chn_large.pdparams", + "rbt3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbt3/rbt3_chn_large.pdparams", + "rbtl3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbtl3/rbtl3_chn_large.pdparams", + } + } + base_model_prefix = "roberta" + + def _init_weights(self, layer): + """Initialization hook""" + if isinstance(layer, (nn.Linear, nn.Embedding)): + # only support dygraph, use truncated_normal and make it inplace + # and configurable later + layer.weight.set_value( + paddle.tensor.normal( + mean=0.0, + std=self.initializer_range + if hasattr(self, "initializer_range") + else self.roberta.config["initializer_range"], + shape=layer.weight.shape, + ) + ) + elif isinstance(layer, nn.LayerNorm): + layer._epsilon = 1e-12 + + +@register_base_model +class RobertaModel(RobertaPretrainedModel): + r""" + The bare Roberta Model outputting raw hidden-states. + + This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`. + Refer to the superclass documentation for the generic methods. + + This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation + /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer + and refer to the Paddle documentation for all matter related to general usage and behavior. + + Args: + vocab_size (int): + Vocabulary size of `inputs_ids` in `RobertaModel`. Also is the vocab size of token embedding matrix. + Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `RobertaModel`. + hidden_size (int, optional): + Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`. + num_hidden_layers (int, optional): + Number of hidden layers in the Transformer encoder. Defaults to `12`. + num_attention_heads (int, optional): + Number of attention heads for each attention layer in the Transformer encoder. + Defaults to `12`. + intermediate_size (int, optional): + Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors + to ff layers are firstly projected from `hidden_size` to `intermediate_size`, + and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`. + Defaults to `3072`. + hidden_act (str, optional): + The non-linear activation function in the feed-forward layer. + ``"gelu"``, ``"relu"`` and any other paddle supported activation functions + are supported. Defaults to ``"gelu"``. + hidden_dropout_prob (float, optional): + The dropout probability for all fully connected layers in the embeddings and encoder. + Defaults to `0.1`. + attention_probs_dropout_prob (float, optional): + The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target. + Defaults to `0.1`. + max_position_embeddings (int, optional): + The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input + sequence. Defaults to `512`. + type_vocab_size (int, optional): + The vocabulary size of the `token_type_ids` passed when calling `~transformers.RobertaModel`. + Defaults to `2`. + initializer_range (float, optional): + The standard deviation of the normal initializer. Defaults to 0.02. + + .. note:: + A normal_initializer initializes weight matrices as normal distributions. + See :meth:`RobertaPretrainedModel._init_weights()` for how weights are initialized in `RobertaModel`. + + pad_token_id(int, optional): + The index of padding token in the token vocabulary. + Defaults to `0`. + """ + + def __init__( + self, + vocab_size, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + intermediate_size=3072, + hidden_act="gelu", + hidden_dropout_prob=0.1, + attention_probs_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=16, + initializer_range=0.02, + layer_norm_eps=1e-12, + pad_token_id=0, + ): + super(RobertaModel, self).__init__() + self.pad_token_id = pad_token_id + self.initializer_range = initializer_range + self.embeddings = RobertaEmbeddings( + vocab_size, hidden_size, hidden_dropout_prob, max_position_embeddings, type_vocab_size, pad_token_id + ) + encoder_layer = TransformerEncoderLayer( + hidden_size, + num_attention_heads, + intermediate_size, + dropout=hidden_dropout_prob, + activation=hidden_act, + attn_dropout=attention_probs_dropout_prob, + act_dropout=0, + ) + self.encoder = TransformerEncoder(encoder_layer, num_hidden_layers) + self.pooler = RobertaPooler(hidden_size) + + def forward( + self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None, + noise=None, + i=None, + n_samples=None, + ): + r""" + Args: + input_ids (Tensor): + Indices of input sequence tokens in the vocabulary. They are + numerical representations of tokens that build the input sequence. + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + token_type_ids (Tensor, optional): + Segment token indices to indicate first and second portions of the inputs. + Indices can be either 0 or 1: + + - 0 corresponds to a **sentence A** token, + - 1 corresponds to a **sentence B** token. + + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + Defaults to None, which means no segment embeddings is added to token embeddings. + position_ids (Tensor, optional): + Indices of positions of each input sequence tokens in the position embeddings. + Selected in the range ``[0, max_position_embeddings - 1]``. + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + Defaults to `None`. + attention_mask (Tensor, optional): + Mask used in multi-head attention to avoid performing attention to some unwanted positions, + usually the paddings or the subsequent positions. + Its data type can be int, float and bool. + When the data type is bool, the `masked` tokens have `False` values and the others have `True` values. + When the data type is int, the `masked` tokens have `0` values and the others have `1` values. + When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values. + It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`. + For example, its shape can be [batch_size, sequence_length], [batch_size, sequence_length, sequence_length], + [batch_size, num_attention_heads, sequence_length, sequence_length]. + Defaults to `None`, which means nothing needed to be prevented attention to. + + Returns: + tuple: Returns tuple (`sequence_output`, `pooled_output`). + + With the fields: + + - sequence_output (Tensor): + Sequence of hidden-states at the last layer of the model. + It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size]. + + - pooled_output (Tensor): + The output of first token (`[CLS]`) in sequence. + We "pool" the model by simply taking the hidden state corresponding to the first token. + Its data type should be float32 and its shape is [batch_size, hidden_size]. + + Example: + .. code-block:: + + import paddle + from paddlenlp.transformers import RobertaModel, RobertaTokenizer + + tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') + model = RobertaModel.from_pretrained('roberta-wwm-ext') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + sequence_output, pooled_output = model(**inputs) + + """ + if attention_mask is None: + attention_mask = paddle.unsqueeze( + (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e9, axis=[1, 2] + ) + # CLS: 101; SEP: 102; PAD: 0 + baseline_ids = paddle.to_tensor( + [101] + [0] * (input_ids.shape[1] - 2) + [102], + dtype=input_ids.dtype, + place=input_ids.place, + stop_gradient=input_ids.stop_gradient, + ) + + embedding_output = self.embeddings( + input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids + ) + baseline_embedding_output = self.embeddings( + input_ids=baseline_ids, position_ids=position_ids, token_type_ids=token_type_ids + ) + + if noise is not None: + if noise.upper() == "GAUSSIAN": + pass + if noise.upper() == "INTEGRATED": + embedding_output = baseline_embedding_output + i / (n_samples - 1) * ( + embedding_output - baseline_embedding_output + ) + else: + raise ValueError("unsupported noise method: %s" % (noise)) + + encoder_outputs, att_weights_list = self.encoder(embedding_output, attention_mask) # interpret + sequence_output = encoder_outputs + pooled_output = self.pooler(sequence_output) + result = [sequence_output, pooled_output, att_weights_list] + result.append(embedding_output) + return result + + +class RobertaForQuestionAnswering(RobertaPretrainedModel): + r""" + Roberta Model with a linear layer on top of the hidden-states output to + compute `span_start_logits` and `span_end_logits`, designed for question-answering tasks like SQuAD. + + Args: + roberta (:class:`RobertaModel`): + An instance of RobertaModel. + dropout (float, optional): + The dropout probability for output of Roberta. + If None, use the same value as `hidden_dropout_prob` of `RobertaModel` + instance `roberta`. Defaults to `None`. + """ + + def __init__(self, roberta, dropout=None): + super(RobertaForQuestionAnswering, self).__init__() + self.roberta = roberta # allow roberta to be config + self.classifier = nn.Linear(self.roberta.config["hidden_size"], 2) + + def forward(self, input_ids, token_type_ids=None): + r""" + Args: + input_ids (Tensor): + See :class:`RobertaModel`. + token_type_ids (Tensor, optional): + See :class:`RobertaModel`. + position_ids (Tensor, optional): + See :class:`RobertaModel`. + attention_mask (Tensor, optional): + See :class:`RobertaModel`. + + Returns: + tuple: Returns tuple (`start_logits`, `end_logits`). + + With the fields: + + - `start_logits` (Tensor): + A tensor of the input token classification logits, indicates the start position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + + - `end_logits` (Tensor): + A tensor of the input token classification logits, indicates the end position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + + Example: + .. code-block:: + + import paddle + from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer + + tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') + model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + + """ + sequence_output, _ = self.roberta( + input_ids, token_type_ids=token_type_ids, position_ids=None, attention_mask=None + ) + + logits = self.classifier(sequence_output) + logits = paddle.transpose(logits, perm=[2, 0, 1]) + start_logits, end_logits = paddle.unstack(x=logits, axis=0) + + return start_logits, end_logits + + +class RobertaForSequenceClassification(RobertaPretrainedModel): + r""" + Roberta Model with a linear layer on top of the output layer, + designed for sequence classification/regression tasks like GLUE tasks. + + Args: + roberta (:class:`RobertaModel`): + An instance of `RobertaModel`. + num_classes (int, optional): + The number of classes. Defaults to `2`. + dropout (float, optional): + The dropout probability for output of Roberta. + If None, use the same value as `hidden_dropout_prob` + of `RobertaModel` instance `roberta`. Defaults to `None`. + """ + + def __init__(self, roberta, num_classes=2, dropout=None): + super(RobertaForSequenceClassification, self).__init__() + self.num_classes = num_classes + self.roberta = roberta # allow roberta to be config + self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"]) + self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes) + self.softmax = nn.Softmax() + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`RobertaModel`. + token_type_ids (Tensor, optional): + See :class:`RobertaModel`. + position_ids (Tensor, optional): + See :class:`RobertaModel`. + attention_mask (Tensor, optional): + See :class:`RobertaModel`. + + Returns: + Tensor: Returns tensor `logits`, a tensor of the input text classification logits. + Its data type should be float32 and it has a shape of [batch_size, num_classes]. + + Example: + .. code-block:: + + import paddle + from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer + + tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') + model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + + """ + _, pooled_output, _, _ = self.roberta( + input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask + ) + + pooled_output = self.dropout(pooled_output) + logits = self.classifier(pooled_output) + return logits + + def forward_interpret( + self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None, + noise=None, + i=None, + n_samples=None, + ): + """ + The forward function used when we are interpreting the model + """ + _, pooled_output, att_weights_list, embedding_output = self.roberta( + input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask, + noise=noise, + i=i, + n_samples=n_samples, + ) + + pooled_output = self.dropout(pooled_output) + logits = self.classifier(pooled_output) + probs = self.softmax(logits) + + return probs, att_weights_list, embedding_output + + +class RobertaForTokenClassification(RobertaPretrainedModel): + r""" + Roberta Model with a linear layer on top of the hidden-states output layer, + designed for token classification tasks like NER tasks. + + Args: + roberta (:class:`RobertaModel`): + An instance of `RobertaModel`. + num_classes (int, optional): + The number of classes. Defaults to `2`. + dropout (float, optional): + The dropout probability for output of Roberta. + If None, use the same value as `hidden_dropout_prob` + of `RobertaModel` instance `roberta`. Defaults to `None`. + """ + + def __init__(self, roberta, num_classes=2, dropout=None): + super(RobertaForTokenClassification, self).__init__() + self.num_classes = num_classes + self.roberta = roberta # allow roberta to be config + self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"]) + self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes) + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`RobertaModel`. + token_type_ids (Tensor, optional): + See :class:`RobertaModel`. + position_ids (Tensor, optional): + See :class:`RobertaModel`. + attention_mask (Tensor, optional): + See :class:`RobertaModel`. + + Returns: + Tensor: Returns tensor `logits`, a tensor of the input token classification logits. + Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`. + + Example: + .. code-block:: + + import paddle + from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer + + tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') + model = RobertaForTokenClassification.from_pretrained('roberta-wwm-ext') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + + """ + sequence_output, _ = self.roberta( + input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask + ) + + sequence_output = self.dropout(sequence_output) + logits = self.classifier(sequence_output) + return logits diff --git a/examples/model_interpretation/task/similarity/run_inter.sh b/examples/model_interpretation/task/similarity/run_inter.sh new file mode 100644 index 0000000000000000000000000000000000000000..e9de8e11df879b1965cba7d888d6fb05a4869e38 --- /dev/null +++ b/examples/model_interpretation/task/similarity/run_inter.sh @@ -0,0 +1,61 @@ +### + # This file contains script to generate saliency map of a specific baseline model and language on given input data + # The result of this script will be used to evaluate the interpretive performance of the baseline model +### +export CUDA_VISIBLE_DEVICES=7 +export PYTHONPATH=./:$PYTHONPATH + +LANGUAGE=ch # LANGUAGE choose in [ch, en] +BASE_MODEL=roberta_base # BASE_MODEL choose in [roberta_base, roberta_large, lstm] +INTER_MODE=lime # INTER_MODE choice in [attention, integrated_gradient, lime] +TASK=similarity_${LANGUAGE} +DATA=../../data/${TASK} +START_ID=0 + +if [[ $LANGUAGE == "ch" ]]; then + + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN='roberta-wwm-ext' + CKPT=pretrained_models/saved_model_ch/roberta_base_20211018_104038/model_11400/model_state.pdparams + #CKPT=pretrained_models/saved_model_ch/roberta_base_20211208_121026/model_12000/model_state.pdparams + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN='roberta-wwm-ext-large' + CKPT=pretrained_models/saved_model_ch/roberta_large_20211018_152833/model_22000/model_state.pdparams + #CKPT=pretrained_models/saved_model_ch/roberta_large_20211208_131546/model_22000/model_state.pdparams + elif [[ $BASE_MODEL == "lstm" ]]; then + FROM_PRETRAIN='skep_ernie_1.0_large_ch' + CKPT=simnet/checkpoints_ch/final.pdparams + fi + +elif [[ $LANGUAGE == "en" ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-base + CKPT=pretrained_models/saved_model_en/roberta_base_20211109_205245/model_54000/model_state.pdparams + #CKPT=pretrained_models/saved_model_en/roberta_base_20211208_121339/model_54000/model_state.pdparams + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-large + CKPT=pretrained_models/saved_model_en/roberta_large_20211109_205649/model_46000/model_state.pdparams + #CKPT=pretrained_models/saved_model_en/roberta_large_20211208_131440/model_42000/model_state.pdparams + elif [[ $BASE_MODEL == "lstm" ]]; then + FROM_PRETRAIN='data/skep_ernie_1.0_large_ch' + CKPT=simnet/checkpoints_en/final.pdparams + fi +fi + +OUTPUT=./output/$TASK.$BASE_MODEL +[ -d $OUTPUT ] || mkdir -p $OUTPUT +set -x + +python3 ./saliency_map/similarity_interpretable.py \ + --base_model $BASE_MODEL \ + --data_dir $DATA \ + --from_pretrained $FROM_PRETRAIN \ + --batch_size 1 \ + --max_seq_len 256 \ + --init_checkpoint $CKPT \ + --inter_mode $INTER_MODE \ + --start_id $START_ID \ + --output_dir $OUTPUT \ + --n-samples 500 \ + --language $LANGUAGE \ + --eval $@ diff --git a/examples/model_interpretation/task/similarity/run_inter_all.sh b/examples/model_interpretation/task/similarity/run_inter_all.sh new file mode 100644 index 0000000000000000000000000000000000000000..edabd07d6f41596e5eecac84e49e4e0a5b300cee --- /dev/null +++ b/examples/model_interpretation/task/similarity/run_inter_all.sh @@ -0,0 +1,69 @@ +### + # This file contains script to generate saliency map of all baseline models and languages on given input data + # The result of this script will be used to evaluate the interpretive performance of the baseline model +### +export CUDA_VISIBLE_DEVICES=4 +export PYTHONPATH=./:$PYTHONPATH + +START_ID=0 + +for BASE_MODEL in "lstm" "roberta_base" "roberta_large"; +do + for INTER_MODE in "attention" "integrated_gradient" "lime"; + do + for LANGUAGE in "ch" "en"; + do + TASK=similarity_${LANGUAGE} + DATA=../../data/${TASK} + + if [[ $LANGUAGE == "ch" ]]; then + + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN='roberta-wwm-ext' + CKPT=pretrained_models/saved_model_ch/roberta_base_20211018_104038/model_11400/model_state.pdparams + #CKPT=pretrained_models/saved_model_ch/roberta_base_20211208_121026/model_12000/model_state.pdparams + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN='roberta-wwm-ext-large' + CKPT=pretrained_models/saved_model_ch/roberta_large_20211018_152833/model_22000/model_state.pdparams + #CKPT=pretrained_models/saved_model_ch/roberta_large_20211208_131546/model_22000/model_state.pdparams + elif [[ $BASE_MODEL == "lstm" ]]; then + FROM_PRETRAIN='data/skep_ernie_1.0_large_ch' + CKPT=simnet/checkpoints_ch/final.pdparams + fi + + elif [[ $LANGUAGE == "en" ]]; then + if [[ $BASE_MODEL == "roberta_base" ]]; then + FROM_PRETRAIN=roberta-base + CKPT=pretrained_models/saved_model_en/roberta_base_20211109_205245/model_54000/model_state.pdparams + #CKPT=pretrained_models/saved_model_en/roberta_base_20211208_121339/model_54000/model_state.pdparams + elif [[ $BASE_MODEL == "roberta_large" ]]; then + FROM_PRETRAIN=roberta-large + CKPT=pretrained_models/saved_model_en/roberta_large_20211109_205649/model_46000/model_state.pdparams + #CKPT=pretrained_models/saved_model_en/roberta_large_20211208_131440/model_42000/model_state.pdparams + elif [[ $BASE_MODEL == "lstm" ]]; then + FROM_PRETRAIN='data/skep_ernie_1.0_large_ch' + CKPT=simnet/checkpoints_en/final.pdparams + fi + fi + + OUTPUT=./output/$TASK.$BASE_MODEL + [ -d $OUTPUT ] || mkdir -p $OUTPUT + set -x + if [[ ! -f ${OUTPUT}/interpret.${INTER_MODE} ]]; then + python3 ./saliency_map/similarity_interpretable.py \ + --base_model $BASE_MODEL \ + --data_dir $DATA \ + --from_pretrained $FROM_PRETRAIN \ + --batch_size 1 \ + --max_seq_len 256 \ + --init_checkpoint $CKPT \ + --inter_mode $INTER_MODE \ + --start_id $START_ID \ + --output_dir $OUTPUT \ + --n-samples 500 \ + --language $LANGUAGE \ + --eval $@ + fi + done + done +done diff --git a/examples/model_interpretation/task/similarity/saliency_map/similarity_interpretable.py b/examples/model_interpretation/task/similarity/saliency_map/similarity_interpretable.py new file mode 100644 index 0000000000000000000000000000000000000000..73064096219034b1254dfee10f7bf32fed222d45 --- /dev/null +++ b/examples/model_interpretation/task/similarity/saliency_map/similarity_interpretable.py @@ -0,0 +1,646 @@ +# !/usr/bin/env python3 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import collections +import json +import logging +import os +import re +import sys +from functools import partial +from pathlib import Path + +import numpy as np +import paddle +from LIME.lime_text import LimeTextExplainer +from roberta.modeling import RobertaForSequenceClassification +from simnet.model import SimNet +from simnet.utils import CharTokenizer, preprocess_data +from tqdm import tqdm + +from paddlenlp.data import Dict, Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import DatasetBuilder +from paddlenlp.transformers.roberta.tokenizer import ( + RobertaBPETokenizer, + RobertaTokenizer, +) + +sys.path.append("../../..") +from model_interpretation.utils import ( # noqa: E402 + convert_tokenizer_res_to_old_version, + match, +) + +sys.path.remove("../../..") + +log = logging.getLogger(__name__) +log.setLevel(logging.DEBUG) +logging.getLogger().setLevel(logging.DEBUG) + + +def get_args(): + parser = argparse.ArgumentParser("interpret textual similarity task") + parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large", "lstm"]) + parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") + parser.add_argument( + "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" + ) + parser.add_argument("--batch_size", type=int, default=1, help="batchsize") + parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") + parser.add_argument("--eval", action="store_true") + parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") + parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer") + parser.add_argument( + "--use_amp", + action="store_true", + help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", + ) + parser.add_argument( + "--inter_mode", + type=str, + default="attention", + choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"], + help="appoint the mode of interpretable.", + ) + parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method") + parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") + parser.add_argument("--start_id", type=int, default=0) + parser.add_argument("--language", type=str, required=True, help="Language that the model is based on") + args = parser.parse_args() + return args + + +class Similarity_data(DatasetBuilder): + def _read(self, filename): + with open(filename, "r", encoding="utf8") as f: + for line in f.readlines(): + line_split = json.loads(line) + if args.language == "ch": + yield { + "id": line_split["id"], + "query": line_split["query"], + "title": line_split["title"], + "text_q_seg": line_split["text_q_seg"], + "text_t_seg": line_split["text_t_seg"], + } + else: + yield { + "id": line_split["id"], + "sentence1": line_split["sentence1"], + "sentence2": line_split["sentence2"], + "text_q_seg": line_split["text_q_seg"], + "text_t_seg": line_split["text_t_seg"], + } + + +def map_fn_senti(examples, tokenizer, language): + print("load data %d" % len(examples)) + if language == "ch": + q_name = "query" + t_name = "title" + queries = [example[q_name] for example in examples] + titles = [example[t_name] for example in examples] + else: + q_name = "sentence1" + t_name = "sentence2" + queries = [example[q_name].encode("ascii", errors="replace").decode("UTF-8") for example in examples] + titles = [example[t_name].encode("ascii", errors="replace").decode("UTF-8") for example in examples] + tokenized_examples = tokenizer(queries, titles, max_seq_len=args.max_seq_len) + + tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) + + for i in range(len(tokenized_examples)): + tokenized_examples[i]["query_offset_mapping"] = ( + [(0, 0)] + tokenizer.get_offset_mapping(queries[i])[: args.max_seq_len - 2] + [(0, 0)] + ) + tokenized_examples[i]["title_offset_mapping"] = ( + [(0, 0)] + tokenizer.get_offset_mapping(titles[i])[: args.max_seq_len - 2] + [(0, 0)] + ) + + return tokenized_examples + + +def init_roberta_var(args): + if args.language == "ch": + tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) + else: + tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) + + model = RobertaForSequenceClassification.from_pretrained( + args.from_pretrained, + hidden_dropout_prob=0, + attention_probs_dropout_prob=0, + dropout=0, + num_labels=2, + name="", + return_inter_score=True, + ) + + map_fn = partial(map_fn_senti, tokenizer=tokenizer, language=args.language) + + dev_ds = Similarity_data().read(args.data_dir) + dev_ds.map(map_fn, batched=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + "query_offset_mapping": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "title_offset_mapping": Pad(axis=0, pad_val=tokenizer.pad_token_id), + } + ): fn(samples) + + dataloader = paddle.io.DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True + ) + + return model, tokenizer, dataloader, dev_ds + + +def init_lstm_var(args): + if args.language == "ch": + vocab = Vocab.load_vocabulary("simnet/vocab.char", unk_token="[UNK]", pad_token="[PAD]") + else: + vocab = Vocab.load_vocabulary("simnet/vocab_QQP", unk_token="[UNK]", pad_token="[PAD]") + + tokenizer = CharTokenizer(vocab, args.language, "../../punctuations") + model = SimNet(network="lstm", vocab_size=len(vocab), num_classes=2) + + dev_ds = Similarity_data().read(args.data_dir) + dev_examples = preprocess_data(dev_ds.data, tokenizer, args.language) + batches = [dev_examples[idx : idx + args.batch_size] for idx in range(0, len(dev_examples), args.batch_size)] + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # query_ids + Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # title_ids + Stack(dtype="int64"), # query_seq_lens + Stack(dtype="int64"), # title_seq_lens + ): [data for data in fn(samples)] + + return model, tokenizer, batches, batchify_fn, vocab, dev_ds + + +def get_seq_token_num(language): + if language == "ch": + add_idx = 1 + else: + add_idx = 2 + return add_idx + + +def get_qt_tokens(base_model, d, add_idx=None, tokenizer=None, batchify_fn=None, vocab=None): + SEP_idx = 0 + if base_model == "roberta": + input_ids, token_type_ids, query_offset_map, title_offset_map = d + fwd_args = [input_ids, token_type_ids] + fwd_kwargs = {} + + SEP_idx = input_ids.tolist()[0].index(tokenizer.sep_token_id) + q_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, 1:SEP_idx].tolist()) # list + t_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, SEP_idx + add_idx : -1].tolist()) # list + q_offset = query_offset_map[0, 1:-1].tolist() + t_offset = title_offset_map[0, 1:-1].tolist() + return q_tokens, t_tokens, SEP_idx, fwd_args, fwd_kwargs, q_offset, t_offset + + if base_model == "lstm": + query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(d) + query_ids = paddle.to_tensor(query_ids) + title_ids = paddle.to_tensor(title_ids) + query_seq_lens = paddle.to_tensor(query_seq_lens) + title_seq_lens = paddle.to_tensor(title_seq_lens) + + fwd_args = [query_ids, title_ids, query_seq_lens, title_seq_lens] + fwd_kwargs = {} + q_tokens = [vocab._idx_to_token[idx] for idx in query_ids.tolist()[0]] + t_tokens = [vocab._idx_to_token[idx] for idx in title_ids.tolist()[0]] + return q_tokens, t_tokens, SEP_idx, fwd_args, fwd_kwargs + + +def extract_attention_scores(args, result, atts, q_tokens, t_tokens, out_handle, SEP_idx, q_offset, t_offset, add_idx): + if args.base_model.startswith("roberta"): + inter_score = atts[-1][:, :, 0, :].mean(1) # (bsz, seq) + q_inter_score = inter_score[0][1:SEP_idx] # remove CLS and SEP + t_inter_score = inter_score[0][SEP_idx + add_idx : -1] # remove CLS and SEP + elif args.base_model == "lstm": + q_inter_score = atts[0][0] + t_inter_score = atts[1][0] + + q_length = (q_inter_score > 0).cast("int32").sum(-1)[0] + t_length = (t_inter_score > 0).cast("int32").sum(-1)[0] + assert len(q_tokens) == q_length, f"{len(q_tokens)} != {q_length}" + assert len(t_tokens) == t_length, f"{len(t_tokens)} != {t_length}" + + q_char_attribution_dict, t_char_attribution_dict = {}, {} + if args.base_model.startswith("roberta"): + # Query + sorted_token = [] + for i in range(len(q_inter_score)): + sorted_token.append([i, q_offset[i], q_inter_score[i]]) + q_char_attribution_dict = match(result["query"], result["text_q_seg"], sorted_token) + result["query_char_attri"] = collections.OrderedDict() + for token_info in sorted(q_char_attribution_dict, key=lambda x: x[2], reverse=True): + result["query_char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] + result.pop("text_q_seg") + + # Title + sorted_token = [] + for i in range(len(t_inter_score)): + sorted_token.append([i, t_offset[i], t_inter_score[i]]) + t_char_attribution_dict = match(result["title"], result["text_t_seg"], sorted_token) + result["title_char_attri"] = collections.OrderedDict() + for token_info in sorted(t_char_attribution_dict, key=lambda x: x[2], reverse=True): + result["title_char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] + result.pop("text_t_seg") + + else: + idx = 0 + for token, score in zip(q_tokens, q_inter_score.tolist()): + q_char_attribution_dict[idx] = (token, score) + idx += 1 + for token, score in zip(t_tokens, t_inter_score.tolist()): + t_char_attribution_dict[idx] = (token, score) + idx += 1 + + result["query_char_attri"], result["title_char_attri"] = collections.OrderedDict(), collections.OrderedDict() + for token, attri in sorted(q_char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True): + result["query_char_attri"][token] = attri + for token, attri in sorted(t_char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True): + result["title_char_attri"][token] = attri + + out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") + + +def IG_roberta_inter_score( + args, + embedded_grads_list, + pred_embedded, + baseline_embedded, + pred_confidence, + baseline_pred_confidence, + SEP_idx, + add_idx, + err_total, +): + embedded_grads_tensor = paddle.to_tensor( + embedded_grads_list, dtype="float32", place=paddle.CUDAPlace(0), stop_gradient=True + ) + + # Tensor(n_samples-1, 1, seq_len, embed_size) + trapezoidal_grads = (embedded_grads_tensor[1:] + embedded_grads_tensor[:-1]) / 2 + integral_grads = trapezoidal_grads.sum(0) / trapezoidal_grads.shape[0] # Tensor(1, seq_len, embed_size) + inter_score = (pred_embedded - baseline_embedded) * integral_grads # Tensor(1, seq_len, embed_size) + inter_score = inter_score.sum(-1) # Tensor(1, seq_len) + + # eval err + delta_pred_confidence = pred_confidence - baseline_pred_confidence + sum_gradient = inter_score.sum().tolist()[0] + err = (delta_pred_confidence - sum_gradient + 1e-12) / (delta_pred_confidence + 1e-12) + err_total.append(np.abs(err)) + + print_str = "%s\t%d\t%.3f\t%.3f\t%.3f\t%.3f" + print_vals = (result["id"], args.n_samples, delta_pred_confidence, sum_gradient, err, np.average(err_total)) + print(print_str % print_vals) + + inter_score.stop_gradient = True + q_inter_score = inter_score[0][1:SEP_idx] # remove CLS and SEP + t_inter_score = inter_score[0][SEP_idx + add_idx : -1] # remove CLS and SEP + + return q_inter_score, t_inter_score + + +def IG_lstm_inter_score(q_embedded_grads_list, pred_embedded, baseline_embedded, idx): + # query + q_embedded_grads_tensor = paddle.to_tensor( + q_embedded_grads_list, dtype="float32", place=paddle.CUDAPlace(0), stop_gradient=True + ) + q_trapezoidal_grads = ( + q_embedded_grads_tensor[1:] + q_embedded_grads_tensor[:-1] + ) / 2 # Tensor(n_samples-1, 1, seq_len, embed_size) + q_integral_grads = q_trapezoidal_grads.sum(0) / q_trapezoidal_grads.shape[0] # Tensor(1, seq_len, embed_size) + q_inter_score = (pred_embedded[idx] - baseline_embedded[idx]) * q_integral_grads # Tensor(1, seq_len, embed_size) + q_inter_score = q_inter_score.sum(-1) # Tensor(1, seq_len) + q_inter_score.stop_gradient = True + q_inter_score = q_inter_score[0] + + return q_inter_score + + +def extract_integrated_gradient_scores( + args, + result, + fwd_args, + fwd_kwargs, + model, + q_tokens, + t_tokens, + out_handle, + SEP_idx, + add_idx, + q_offset, + t_offset, + err_total, +): + embedded_grads_list = [] + q_embedded_grads_list, t_embedded_grads_list = [], [] + for i in range(args.n_samples): + probs, _, embedded = model.forward_interpret( + *fwd_args, **fwd_kwargs, noise="integrated", i=i, n_samples=args.n_samples + ) + predicted_class_prob = probs[0][pred_label] + predicted_class_prob.backward(retain_graph=False) + + if args.base_model.startswith("roberta"): + embedded_grad = embedded.grad + embedded_grads_list.append(embedded_grad) + elif args.base_model == "lstm": + q_embedded, t_embedded = embedded + q_embedded_grad = q_embedded.grad + t_embedded_grad = t_embedded.grad + q_embedded_grads_list.append(q_embedded_grad) + t_embedded_grads_list.append(t_embedded_grad) + model.clear_gradients() + if i == 0: + baseline_pred_confidence = probs.tolist()[0][pred_label] # scalar + baseline_embedded = embedded # Tensor(1, seq_len, embed_size) + elif i == args.n_samples - 1: + pred_confidence = probs.tolist()[0][pred_label] # scalar + pred_embedded = embedded # Tensor(1, seq_len, embed_size) + + if args.base_model.startswith("roberta"): + q_inter_score, t_inter_score = IG_roberta_inter_score( + args, + embedded_grads_list, + pred_embedded, + baseline_embedded, + pred_confidence, + baseline_pred_confidence, + SEP_idx, + add_idx, + err_total, + ) + elif args.base_model == "lstm": + q_inter_score = IG_lstm_inter_score(q_embedded_grads_list, pred_embedded, baseline_embedded, 0) + t_inter_score = IG_lstm_inter_score(t_embedded_grads_list, pred_embedded, baseline_embedded, 1) + + q_char_attribution_dict, t_char_attribution_dict = {}, {} + if args.base_model.startswith("roberta"): + # Query + sorted_token = [] + for i in range(len(q_inter_score)): + sorted_token.append([i, q_offset[i], q_inter_score[i]]) + q_char_attribution_dict = match(result["query"], result["text_q_seg"], sorted_token) + result["query_char_attri"] = collections.OrderedDict() + for token_info in sorted(q_char_attribution_dict, key=lambda x: x[2], reverse=True): + result["query_char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] + result.pop("text_q_seg") + + # Title + sorted_token = [] + for i in range(len(t_inter_score)): + sorted_token.append([i, t_offset[i], t_inter_score[i]]) + t_char_attribution_dict = match(result["title"], result["text_t_seg"], sorted_token) + result["title_char_attri"] = collections.OrderedDict() + for token_info in sorted(t_char_attribution_dict, key=lambda x: x[2], reverse=True): + result["title_char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] + result.pop("text_t_seg") + else: + idx = 0 + for token, score in zip(q_tokens, q_inter_score.tolist()): + q_char_attribution_dict[idx] = (token, score) + idx += 1 + for token, score in zip(t_tokens, t_inter_score.tolist()): + t_char_attribution_dict[idx] = (token, score) + idx += 1 + + result["query_char_attri"], result["title_char_attri"] = collections.OrderedDict(), collections.OrderedDict() + for token, attri in sorted(q_char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True): + result["query_char_attri"][token] = attri + for token, attri in sorted(t_char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True): + result["title_char_attri"][token] = attri + + out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") + + +def extract_LIME_scores( + args, q_tokens, t_tokens, result, tokenizer, pred_label, fwd_args, fwd_kwargs, model, probs, out_handle +): + explainer = LimeTextExplainer(class_names=["neg", "pos"], verbose=False, language=args.language) + if_lstm = args.base_model == "lstm" + + explain_res_q = explainer.explain_instance( + text_instance_q=result["query"], + text_instance_t=result["title"], + analysis_query=True, + tokenizer=tokenizer, + pred_label=pred_label, + classifier_fn=model.forward_interpret, + num_samples=5000, + if_lstm=if_lstm, + ) + exp_q, indexed_string_q, relative_err, err = explain_res_q + local_exps_q = exp_q.local_exp + + explain_res_t = explainer.explain_instance( + text_instance_q=result["query"], + text_instance_t=result["title"], + analysis_query=False, + tokenizer=tokenizer, + pred_label=pred_label, + classifier_fn=model.forward_interpret, + num_samples=5000, + if_lstm=if_lstm, + ) + exp_t, indexed_string_t, _, _ = explain_res_t + local_exps_t = exp_t.local_exp + + # query + char_attribution_dict = [] + for kind, local_exp in local_exps_q.items(): + for idx in range(len(result["text_q_seg"])): + t = result["text_q_seg"][idx] # .replace('Ġ', '') + got_score = False + for word_id, attribution in local_exp: + if indexed_string_q.inverse_vocab[word_id] == t: + char_attribution_dict.append((idx, t, attribution)) + got_score = True + break + if not got_score: + char_attribution_dict.append((idx, t, 0)) + char_attribution_dict = sorted(char_attribution_dict, key=lambda x: x[2], reverse=True) + result["query_char_attri"] = collections.OrderedDict() + for s in char_attribution_dict: + result["query_char_attri"][s[0]] = (s[1], s[2]) + + # title + char_attribution_dict = [] + for kind, local_exp in local_exps_t.items(): + for idx in range(len(result["text_t_seg"])): + t = result["text_t_seg"][idx] # .replace('Ġ', '') + got_score = False + for word_id, attribution in local_exp: + if indexed_string_t.inverse_vocab[word_id] == t: + char_attribution_dict.append((idx, t, attribution)) + got_score = True + break + if not got_score: + char_attribution_dict.append((idx, t, 0)) + char_attribution_dict = sorted(char_attribution_dict, key=lambda x: x[2], reverse=True) + result["title_char_attri"] = collections.OrderedDict() + for s in char_attribution_dict: + result["title_char_attri"][s[0]] = (s[1], s[2]) + + out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") + return exp_q, exp_t, relative_err, err + + +def LIME_error_evaluation( + exp_q, pred_label, probs, lime_score_total, lime_relative_err_total, lime_err_total, relative_err, err +): + # err evaluation + score = exp_q.score[pred_label] + ridge_pred = exp_q.local_pred[pred_label] + model_pred = probs.numpy().tolist()[0][pred_label] + + lime_score_total.append(score) + lime_relative_err_total.append(relative_err) + lime_err_total.append(err) + print("score: %.2f" % score) + print("relative_err: %.2f" % relative_err) + print("err: %.2f" % err) + print("ridge_pred: %.2f\tpred: %.2f\tdelta: %.2f" % (ridge_pred, model_pred, ridge_pred - model_pred)) + return lime_score_total, lime_relative_err_total, lime_err_total + + +g_splitter = re.compile(r"([\u4e00-\u9fa5])") + +if __name__ == "__main__": + args = get_args() + if args.base_model.startswith("roberta"): + model, tokenizer, dataloader, dev_ds = init_roberta_var(args) + elif args.base_model == "lstm": + model, tokenizer, dataloader, batchify_fn, vocab, dev_ds = init_lstm_var(args) + else: + raise ValueError("unsupported base model name.") + + assert args.eval, "INTERPRETER must be run in eval mode" + with paddle.amp.auto_cast(enable=args.use_amp), open( + os.path.join(args.output_dir, "interpret" + f".{args.inter_mode}"), "w" + ) as out_handle: + # Load model + sd = paddle.load(args.init_checkpoint) + model.set_dict(sd) + model.train() # Set dropout to 0 when init the model to collect the gradient + print("load model from %s" % args.init_checkpoint) + + # For IG + err_total = [] + # For LIME + lime_score_total = [] + lime_relative_err_total = [] + lime_err_total = [] + # For Roberta + sub_word_id_dict_query = [] + sub_word_id_dict_title = [] + # For LSTM + q_offset, t_offset = None, None + + get_sub_word_ids = lambda word: map(str, tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word))) + for step, d in tqdm(enumerate(dataloader)): + if step + 1 < args.start_id: + continue + + result = {} + # English and Chinese models have different numbers of [SEQ] tokens between query and title + add_idx = get_seq_token_num(args.language) + + if args.base_model.startswith("roberta"): + q_tokens, t_tokens, SEP_idx, fwd_args, fwd_kwargs, q_offset, t_offset = get_qt_tokens( + base_model="roberta", d=d, add_idx=add_idx, tokenizer=tokenizer + ) + elif args.base_model == "lstm": + q_tokens, t_tokens, SEP_idx, fwd_args, fwd_kwargs = get_qt_tokens( + base_model="lstm", d=d, batchify_fn=batchify_fn, vocab=vocab + ) + + result["id"] = dev_ds.data[step]["id"] + result["text_q_seg"] = dev_ds.data[step]["text_q_seg"] + result["text_t_seg"] = dev_ds.data[step]["text_t_seg"] + + probs, atts, embedded = model.forward_interpret(*fwd_args, **fwd_kwargs) + pred_label = paddle.argmax(probs, axis=-1).tolist()[0] + + result["pred_label"] = pred_label + result["probs"] = [float(format(prob, ".5f")) for prob in probs.numpy()[0].tolist()] + + if args.language == "ch": + result["query"] = dev_ds.data[step]["query"] + result["title"] = dev_ds.data[step]["title"] + else: + result["query"] = dev_ds.data[step]["sentence1"] + result["title"] = dev_ds.data[step]["sentence2"] + + # Attention + if args.inter_mode == "attention": + extract_attention_scores( + args, result, atts, q_tokens, t_tokens, out_handle, SEP_idx, q_offset, t_offset, add_idx + ) + + elif args.inter_mode == "integrated_gradient": + extract_integrated_gradient_scores( + args, + result, + fwd_args, + fwd_kwargs, + model, + q_tokens, + t_tokens, + out_handle, + SEP_idx, + add_idx, + q_offset, + t_offset, + err_total, + ) + + elif args.inter_mode == "lime": + exp_q, exp_t, relative_err, err = extract_LIME_scores( + args, + q_tokens, + t_tokens, + result, + tokenizer, + pred_label, + fwd_args, + fwd_kwargs, + model, + probs, + out_handle, + ) + lime_score_total, lime_relative_err_total, lime_err_total = LIME_error_evaluation( + exp_q, + pred_label, + probs, + lime_score_total, + lime_relative_err_total, + lime_err_total, + relative_err, + err, + ) + + else: + raise KeyError(f"Unkonwn interpretable mode: {args.inter_mode}") + + if args.inter_mode == "lime": + print(np.average(np.array(lime_relative_err_total))) diff --git a/examples/model_interpretation/task/similarity/saliency_map/utils.py b/examples/model_interpretation/task/similarity/saliency_map/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..9e6dd7e1a61b2c79cb4a968866a11bb3a9c90c51 --- /dev/null +++ b/examples/model_interpretation/task/similarity/saliency_map/utils.py @@ -0,0 +1,38 @@ +# !/usr/bin/env python3 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import absolute_import, division, print_function, unicode_literals + +import paddle + + +class UnpackDataLoader(paddle.io.DataLoader): + def __init__(self, *args, **kwargs): + super(UnpackDataLoader, self).__init__(*args, batch_size=1, **kwargs) + + def __iter__(self): + return ([yy[0] for yy in y] for y in super(UnpackDataLoader, self).__iter__()) + + +def create_if_not_exists(dir): + try: + dir.mkdir(parents=True) + except FileExistsError: + pass + return dir + + +def get_warmup_and_linear_decay(max_steps, warmup_steps): + return lambda step: min(step / warmup_steps, 1.0 - (step - warmup_steps) / (max_steps - warmup_steps)) diff --git a/examples/model_interpretation/task/similarity/simnet/gen_vocab.py b/examples/model_interpretation/task/similarity/simnet/gen_vocab.py new file mode 100644 index 0000000000000000000000000000000000000000..4359902825317f3ce3c7886536ba60cb2f74e56a --- /dev/null +++ b/examples/model_interpretation/task/similarity/simnet/gen_vocab.py @@ -0,0 +1,60 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# !/usr/bin/env python +# coding=utf-8 + +import sys +from collections import defaultdict + +import spacy + +from paddlenlp.datasets import load_dataset + +if sys.argv[1] == "ch": + train_ds, dev_ds, test_ds = load_dataset("lcqmc", splits=["train", "dev", "test"]) + + vocab = defaultdict(int) + for example in train_ds.data: + query = example["query"] + title = example["title"] + for c in query: + vocab[c] += 1 + for c in title: + vocab[c] += 1 + with open("vocab.char", "w") as f: + for k, v in vocab.items(): + if v > 3: + f.write(k + "\n") + +else: + tokenizer = spacy.load("en_core_web_sm") + vocab = defaultdict(int) + + with open("../data/QQP/train/train.tsv", "r") as f_dataset: + for idx, line in enumerate(f_dataset.readlines()): + if idx == 0: + continue + line_split = line.strip().split("\t") + query = [token.text for token in tokenizer(line_split[0])] + title = [token.text for token in tokenizer(line_split[1])] + + for word in query: + vocab[word] += 1 + for word in title: + vocab[word] += 1 + + with open("vocab_QQP", "w") as f: + for k, v in vocab.items(): + if v > 3: + f.write(k + "\n") diff --git a/examples/model_interpretation/task/similarity/simnet/interpreter_attention.py b/examples/model_interpretation/task/similarity/simnet/interpreter_attention.py new file mode 100644 index 0000000000000000000000000000000000000000..e2ed642e836b5bc8651a12c1b3ed81f9452f4884 --- /dev/null +++ b/examples/model_interpretation/task/similarity/simnet/interpreter_attention.py @@ -0,0 +1,121 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import sys + +import paddle + +from paddlenlp.data import Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import load_dataset + +sys.path.append("../../..") +from model import SimNet # noqa: E402 +from utils import CharTokenizer, preprocess_data # noqa: E402 + +parser = argparse.ArgumentParser(__doc__) +parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." +) +parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number of a batch for training.") +parser.add_argument("--vocab_path", type=str, default="./vocab.char", help="The path to vocabulary.") +parser.add_argument( + "--network", type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?" +) +parser.add_argument( + "--params_path", type=str, default="./checkpoints/final.pdparams", help="The path of model parameter to be loaded." +) +parser.add_argument("--language", type=str, required=True, help="Language that this model based on") +args = parser.parse_args() + + +def interpret(model, data, label_map, batch_size=1, pad_token_id=0, vocab=None): + """ + Predicts the data labels. + + Args: + model (obj:`paddle.nn.Layer`): A model to classify texts. + data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `seq_len`(sequence length). + label_map(obj:`dict`): The label id (key) to label str (value) map. + batch_size(obj:`int`, defaults to 1): The number of batch. + pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + + # Separates data into some batches. + batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)] + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=pad_token_id), # query_ids + Pad(axis=0, pad_val=pad_token_id), # title_ids + Stack(dtype="int64"), # query_seq_lens + Stack(dtype="int64"), # title_seq_lens + ): [data for data in fn(samples)] + + model.eval() + results = [] + for batch in batches: + query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(batch) + query_ids = paddle.to_tensor(query_ids) + title_ids = paddle.to_tensor(title_ids) + query_seq_lens = paddle.to_tensor(query_seq_lens) + title_seq_lens = paddle.to_tensor(title_seq_lens) + + logits, attention, _ = model.forward_interpret(query_ids, title_ids, query_seq_lens, title_seq_lens) + query_att = attention[0] + title_att = attention[1] + + model.clear_gradients() + for query_id, title_id in zip(query_ids.numpy().tolist(), title_ids.numpy().tolist()): + query = [vocab._idx_to_token[idx] for idx in query_id] + title = [vocab._idx_to_token[idx] for idx in title_id] + results.append([query_att, query, title_att, title]) + + print("query_att: %s" % query_att.shape) + print("title_att: %s" % title_att.shape) + + return results + + +if __name__ == "__main__": + paddle.set_device(args.device + ":2") + # Loads vocab. + vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") + tokenizer = CharTokenizer(vocab, args.language) + label_map = {0: "dissimilar", 1: "similar"} + + # Constructs the newtork. + model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(label_map)) + + # Loads model parameters. + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + # Firstly pre-processing prediction data and then do predict. + dev_ds, test_ds = load_dataset("lcqmc", splits=["dev", "test"]) + + dev_examples = preprocess_data(dev_ds.data, tokenizer, args.language) + test_examples = preprocess_data(test_ds.data, tokenizer, args.language) + results = interpret( + model, + dev_examples, + label_map=label_map, + batch_size=args.batch_size, + pad_token_id=vocab.token_to_idx.get("[PAD]", 0), + vocab=vocab, + ) diff --git a/examples/model_interpretation/task/similarity/simnet/interpreter_grad.py b/examples/model_interpretation/task/similarity/simnet/interpreter_grad.py new file mode 100644 index 0000000000000000000000000000000000000000..8da2733bee65e439fc0d99e078f9b95a911a1dc3 --- /dev/null +++ b/examples/model_interpretation/task/similarity/simnet/interpreter_grad.py @@ -0,0 +1,131 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import sys + +import paddle + +from paddlenlp.data import Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import load_dataset + +sys.path.append("../../..") +from model import SimNet # noqa: E402 +from utils import CharTokenizer, preprocess_data # noqa: E402 + +parser = argparse.ArgumentParser(__doc__) +parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." +) +parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number of a batch for training.") +parser.add_argument("--vocab_path", type=str, default="./vocab.char", help="The path to vocabulary.") +parser.add_argument( + "--network", type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?" +) +parser.add_argument( + "--params_path", type=str, default="./checkpoints/final.pdparams", help="The path of model parameter to be loaded." +) +parser.add_argument("--language", type=str, required=True, help="Language that this model based on") +args = parser.parse_args() + + +def interpret(model, data, label_map, batch_size=1, pad_token_id=0, vocab=None): + """ + Predicts the data labels. + + Args: + model (obj:`paddle.nn.Layer`): A model to classify texts. + data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `seq_len`(sequence length). + label_map(obj:`dict`): The label id (key) to label str (value) map. + batch_size(obj:`int`, defaults to 1): The number of batch. + pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + + # Separates data into some batches. + batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)] + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=pad_token_id), # query_ids + Pad(axis=0, pad_val=pad_token_id), # title_ids + Stack(dtype="int64"), # query_seq_lens + Stack(dtype="int64"), # title_seq_lens + Stack(dtype="int64"), + ): [data for data in fn(samples)] + + model.train() + results = [] + for batch in batches: + query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(batch) + query_ids = paddle.to_tensor(query_ids) + title_ids = paddle.to_tensor(title_ids) + query_seq_lens = paddle.to_tensor(query_seq_lens) + title_seq_lens = paddle.to_tensor(title_seq_lens) + probs, addiational_info = model.forward_interpreter(query_ids, title_ids, query_seq_lens, title_seq_lens) + query_emb = addiational_info["embedded"][0] + title_emb = addiational_info["embedded"][1] + + predicted_class_probs = paddle.max(probs, axis=-1) + predicted_class_probs = predicted_class_probs.sum() + paddle.autograd.backward([predicted_class_probs]) + q_gradients = ((query_emb * query_emb.grad).sum(-1).detach()).abs() # gradients: (1, seq_len) + q_grad_output = q_gradients / q_gradients.sum(-1, keepdim=True) + t_gradients = ((title_emb * title_emb.grad).sum(-1).detach()).abs() # gradients: (1, seq_len) + t_grad_output = t_gradients / t_gradients.sum(-1, keepdim=True) + + model.clear_gradients() + for query_id, title_id in zip(query_ids.numpy().tolist(), title_ids.numpy().tolist()): + query = [vocab._idx_to_token[idx] for idx in query_id] + title = [vocab._idx_to_token[idx] for idx in title_id] + results.append([q_grad_output, query, t_grad_output, title]) + print([q_grad_output, query, t_grad_output, title]) + + return results + + +if __name__ == "__main__": + paddle.set_device(args.device + ":1") + # Loads vocab. + vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") + tokenizer = CharTokenizer(vocab, args.language) + label_map = {0: "dissimilar", 1: "similar"} + + # Constructs the newtork. + model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(label_map)) + + # Loads model parameters. + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + # Firstly pre-processing prediction data and then do predict. + + dev_ds, test_ds = load_dataset("lcqmc", splits=["dev", "test"]) + + dev_examples = preprocess_data(dev_ds.data, tokenizer, args.language) + test_examples = preprocess_data(test_ds.data, tokenizer, args.language) + results = interpret( + model, + dev_examples, + label_map=label_map, + batch_size=args.batch_size, + pad_token_id=vocab.token_to_idx.get("[PAD]", 0), + vocab=vocab, + ) + + # for idx, text in enumerate(data): + # print('Data: {} \t Label: {}'.format(text, results[idx])) diff --git a/examples/model_interpretation/task/similarity/simnet/lstm_train.sh b/examples/model_interpretation/task/similarity/simnet/lstm_train.sh new file mode 100644 index 0000000000000000000000000000000000000000..5c1b671f09308f434bae7b085a3ee68d74e12f45 --- /dev/null +++ b/examples/model_interpretation/task/similarity/simnet/lstm_train.sh @@ -0,0 +1,21 @@ +### + # This script is used to train lstm models +### + +unset CUDA_VISIBLE_DEVICES +LANGUAGE=en + +if [[ $LANGUAGE == "ch" ]]; then + VOCAB_PATH=vocab.char +elif [[ $LANGUAGE == "en" ]]; then + VOCAB_PATH=vocab_QQP +fi + +python -m paddle.distributed.launch --gpus "5" train.py \ + --device=gpu \ + --lr=4e-4 \ + --batch_size=64 \ + --epochs=12 \ + --vocab_path=$VOCAB_PATH \ + --language=$LANGUAGE \ + --save_dir="./checkpoints_"${LANGUAGE} diff --git a/examples/model_interpretation/task/similarity/simnet/model.py b/examples/model_interpretation/task/similarity/simnet/model.py new file mode 100644 index 0000000000000000000000000000000000000000..e3c86ad21c4ecaae914e0d0d1b63775393b59efc --- /dev/null +++ b/examples/model_interpretation/task/similarity/simnet/model.py @@ -0,0 +1,270 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + +import paddlenlp as nlp + + +class SimNet(nn.Layer): + def __init__(self, network, vocab_size, num_classes, emb_dim=128, pad_token_id=0): + super().__init__() + + network = network.lower() + if network == "bow": + self.model = BoWModel(vocab_size, num_classes, emb_dim, padding_idx=pad_token_id) + elif network == "cnn": + self.model = CNNModel(vocab_size, num_classes, emb_dim, padding_idx=pad_token_id) + elif network == "gru": + self.model = GRUModel(vocab_size, num_classes, emb_dim, direction="forward", padding_idx=pad_token_id) + elif network == "lstm": + self.model = LSTMModel(vocab_size, num_classes, emb_dim, direction="forward", padding_idx=pad_token_id) + else: + raise ValueError("Unknown network: %s, it must be one of bow, cnn, lstm or gru." % network) + + def forward(self, query, title, query_seq_len=None, title_seq_len=None): + logits = self.model(query, title, query_seq_len, title_seq_len) + return logits + + def forward_interpret( + self, query, title, query_seq_len=None, title_seq_len=None, noise=None, i=None, n_samples=None + ): + + logits, addiational_info = self.model.forward_interpreter( + query, title, query_seq_len, title_seq_len, noise=noise, i=i, n_samples=n_samples + ) + + return logits, addiational_info["attention"], addiational_info["embedded"] + + +class BoWModel(nn.Layer): + """ + This class implements the Bag of Words Classification Network model to classify texts. + At a high level, the model starts by embedding the tokens and running them through + a word embedding. Then, we encode these representations with a `BoWEncoder`. + Lastly, we take the output of the encoder to create a final representation, + which is passed through some feed-forward layers to output a logits (`output_layer`). + Args: + vocab_size (obj:`int`): The vocabulary size. + emb_dim (obj:`int`, optional, defaults to 128): The embedding dimension. + padding_idx (obj:`int`, optional, defaults to 0) : The pad token index. + hidden_size (obj:`int`, optional, defaults to 128): The first full-connected layer hidden size. + fc_hidden_size (obj:`int`, optional, defaults to 96): The second full-connected layer hidden size. + num_classes (obj:`int`): All the labels that the data has. + """ + + def __init__(self, vocab_size, num_classes, emb_dim=128, padding_idx=0, fc_hidden_size=128): + super().__init__() + self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx) + self.bow_encoder = nlp.seq2vec.BoWEncoder(emb_dim) + self.fc = nn.Linear(self.bow_encoder.get_output_dim() * 2, fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, query, title, query_seq_len=None, title_seq_len=None): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_query = self.embedder(query) + embedded_title = self.embedder(title) + # Shape: (batch_size, embedding_dim) + summed_query = self.bow_encoder(embedded_query) + summed_title = self.bow_encoder(embedded_title) + encoded_query = paddle.tanh(summed_query) + encoded_title = paddle.tanh(summed_title) + # Shape: (batch_size, embedding_dim*2) + contacted = paddle.concat([encoded_query, encoded_title], axis=-1) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(contacted)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + # probs = F.softmax(logits, axis=-1) + return logits + + +class LSTMModel(nn.Layer): + def __init__( + self, + vocab_size, + num_classes, + emb_dim=128, + padding_idx=0, + lstm_hidden_size=128, + direction="forward", + lstm_layers=1, + dropout_rate=0.0, + pooling_type=None, + fc_hidden_size=128, + ): + super().__init__() + self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) + self.lstm_encoder = nlp.seq2vec.LSTMEncoder( + emb_dim, lstm_hidden_size, num_layers=lstm_layers, direction=direction, dropout=dropout_rate + ) + self.fc = nn.Linear(self.lstm_encoder.get_output_dim() * 2, fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + self.pad_token_id = padding_idx + + def forward(self, query, title, query_seq_len, title_seq_len): + assert query_seq_len is not None and title_seq_len is not None + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_query = self.embedder(query) + embedded_title = self.embedder(title) + # Shape: (batch_size, lstm_hidden_size) + query_repr = self.lstm_encoder(embedded_query, sequence_length=query_seq_len) + title_repr = self.lstm_encoder(embedded_title, sequence_length=title_seq_len) + # Shape: (batch_size, 2*lstm_hidden_size) + contacted = paddle.concat([query_repr, title_repr], axis=-1) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(contacted)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + # probs = F.softmax(logits, axis=-1) + + return logits + + def forward_interpreter(self, query, title, query_seq_len, title_seq_len, noise=None, i=None, n_samples=None): + assert query_seq_len is not None and title_seq_len is not None + # Shape: (batch_size, num_tokens, embedding_dim) + + query_baseline = paddle.to_tensor([self.pad_token_id] * query.shape[1]).unsqueeze(0) + title_baseline = paddle.to_tensor([self.pad_token_id] * title.shape[1]).unsqueeze(0) + + embedded_query = self.embedder(query) + embedded_title = self.embedder(title) + embedded_query_baseline = self.embedder(query_baseline) + embedded_title_baseline = self.embedder(title_baseline) + + if noise is not None and noise.upper() == "INTEGRATED": + embedded_query = embedded_query_baseline + i / (n_samples - 1) * (embedded_query - embedded_query_baseline) + embedded_title = embedded_title_baseline + i / (n_samples - 1) * (embedded_title - embedded_title_baseline) + + # Shape: (batch_size, lstm_hidden_size) + query_repr = self.lstm_encoder(embedded_query, sequence_length=query_seq_len) + title_repr = self.lstm_encoder(embedded_title, sequence_length=title_seq_len) + # Shape: (batch_size, 2*lstm_hidden_size) + contacted = paddle.concat([query_repr, title_repr], axis=-1) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(contacted)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + probs = F.softmax(logits, axis=-1) + + q_att = paddle.matmul(fc_out, embedded_query, transpose_y=True).squeeze(axis=[1]) # (bsz, query_len) + q_att = F.softmax(q_att, axis=-1) + t_att = paddle.matmul(fc_out, embedded_title, transpose_y=True).squeeze(axis=[1]) # (bsz, title_len) + t_att = F.softmax(t_att, axis=-1) + + addiational_info = { + "embedded": [embedded_query, embedded_title], + "attention": [q_att, t_att], + } + # return logits, addiational_info + return probs, addiational_info + + +class GRUModel(nn.Layer): + def __init__( + self, + vocab_size, + num_classes, + emb_dim=128, + padding_idx=0, + gru_hidden_size=128, + direction="forward", + gru_layers=1, + dropout_rate=0.0, + pooling_type=None, + fc_hidden_size=96, + ): + super().__init__() + self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) + self.gru_encoder = nlp.seq2vec.GRUEncoder( + emb_dim, gru_hidden_size, num_layers=gru_layers, direction=direction, dropout=dropout_rate + ) + self.fc = nn.Linear(self.gru_encoder.get_output_dim() * 2, fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, query, title, query_seq_len, title_seq_len): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_query = self.embedder(query) + embedded_title = self.embedder(title) + # Shape: (batch_size, gru_hidden_size) + query_repr = self.gru_encoder(embedded_query, sequence_length=query_seq_len) + title_repr = self.gru_encoder(embedded_title, sequence_length=title_seq_len) + # Shape: (batch_size, 2*gru_hidden_size) + contacted = paddle.concat([query_repr, title_repr], axis=-1) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(contacted)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + # probs = F.softmax(logits, axis=-1) + + return logits + + +class CNNModel(nn.Layer): + """ + This class implements the + + + Convolution Neural Network model. + At a high level, the model starts by embedding the tokens and running them through + a word embedding. Then, we encode these representations with a `CNNEncoder`. + The CNN has one convolution layer for each ngram filter size. Each convolution operation gives + out a vector of size num_filter. The number of times a convolution layer will be used + is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these + outputs from the convolution layer and outputs the max. + Lastly, we take the output of the encoder to create a final representation, + which is passed through some feed-forward layers to output a logits (`output_layer`). + Args: + vocab_size (obj:`int`): The vocabulary size. + emb_dim (obj:`int`, optional, defaults to 128): The embedding dimension. + padding_idx (obj:`int`, optional, defaults to 0) : The pad token index. + num_classes (obj:`int`): All the labels that the data has. + """ + + def __init__( + self, + vocab_size, + num_classes, + emb_dim=128, + padding_idx=0, + num_filter=256, + ngram_filter_sizes=(3,), + fc_hidden_size=128, + ): + super().__init__() + self.padding_idx = padding_idx + self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx) + self.encoder = nlp.seq2vec.CNNEncoder( + emb_dim=emb_dim, num_filter=num_filter, ngram_filter_sizes=ngram_filter_sizes + ) + self.fc = nn.Linear(self.encoder.get_output_dim() * 2, fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, query, title, query_seq_len=None, title_seq_len=None): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_query = self.embedder(query) + embedded_title = self.embedder(title) + # Shape: (batch_size, num_filter) + query_repr = self.encoder(embedded_query) + title_repr = self.encoder(embedded_title) + # Shape: (batch_size, 2*num_filter) + contacted = paddle.concat([query_repr, title_repr], axis=-1) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(contacted)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + # probs = F.softmax(logits, axis=-1) + return logits diff --git a/examples/model_interpretation/task/similarity/simnet/predict.py b/examples/model_interpretation/task/similarity/simnet/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..dec464bf413007dff271ca2865ac4ffe39ecf69c --- /dev/null +++ b/examples/model_interpretation/task/similarity/simnet/predict.py @@ -0,0 +1,109 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle +import paddle.nn.functional as F +from model import SimNet +from utils import preprocess_prediction_data + +from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") +parser.add_argument("--vocab_path", type=str, default="./simnet_vocab.txt", help="The path to vocabulary.") +parser.add_argument('--network', type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?") +parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.") +args = parser.parse_args() +# yapf: enable + + +def predict(model, data, label_map, batch_size=1, pad_token_id=0): + """ + Predicts the data labels. + + Args: + model (obj:`paddle.nn.Layer`): A model to classify texts. + data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `seq_len`(sequence length). + label_map(obj:`dict`): The label id (key) to label str (value) map. + batch_size(obj:`int`, defaults to 1): The number of batch. + pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + + # Separates data into some batches. + batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)] + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=pad_token_id), # query_ids + Pad(axis=0, pad_val=pad_token_id), # title_ids + Stack(dtype="int64"), # query_seq_lens + Stack(dtype="int64"), # title_seq_lens + ): [data for data in fn(samples)] + + results = [] + model.eval() + for batch in batches: + query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(batch) + query_ids = paddle.to_tensor(query_ids) + title_ids = paddle.to_tensor(title_ids) + query_seq_lens = paddle.to_tensor(query_seq_lens) + title_seq_lens = paddle.to_tensor(title_seq_lens) + logits = model(query_ids, title_ids, query_seq_lens, title_seq_lens) + probs = F.softmax(logits, axis=1) + idx = paddle.argmax(probs, axis=1).numpy() + idx = idx.tolist() + labels = [label_map[i] for i in idx] + results.extend(labels) + return results + + +if __name__ == "__main__": + paddle.set_device(args.device) + # Loads vocab. + vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") + tokenizer = JiebaTokenizer(vocab) + label_map = {0: "dissimilar", 1: "similar"} + + # Constructs the newtork. + model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(label_map)) + + # Loads model parameters. + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + # Firstly pre-processing prediction data and then do predict. + data = [ + ["世界上什么东西最小", "世界上什么东西最小?"], + ["光眼睛大就好看吗", "眼睛好看吗?"], + ["小蝌蚪找妈妈怎么样", "小蝌蚪找妈妈是谁画的"], + ] + examples = preprocess_prediction_data(data, tokenizer) + results = predict( + model, + examples, + label_map=label_map, + batch_size=args.batch_size, + pad_token_id=vocab.token_to_idx.get("[PAD]", 0), + ) + + for idx, text in enumerate(data): + print("Data: {} \t Label: {}".format(text, results[idx])) diff --git a/examples/model_interpretation/task/similarity/simnet/train.py b/examples/model_interpretation/task/similarity/simnet/train.py new file mode 100644 index 0000000000000000000000000000000000000000..ec36090726ccba08f0891d4ca21d0296f284c236 --- /dev/null +++ b/examples/model_interpretation/task/similarity/simnet/train.py @@ -0,0 +1,135 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +import sys +from functools import partial + +import paddle + +from paddlenlp.data import Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import load_dataset + +sys.path.append("../../../") +from model import SimNet # noqa: E402 +from utils import CharTokenizer, convert_example # noqa: E402 + +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.") +parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." +) +parser.add_argument("--lr", type=float, default=5e-4, help="Learning rate used to train.") +parser.add_argument("--save_dir", type=str, default="checkpoints/", help="Directory to save model checkpoint") +parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") +parser.add_argument( + "--vocab_path", + type=str, + default="./vocab.char", + help="The directory to dataset. Chinese version uses vocab.char while English version uses vocab_QQP", +) +parser.add_argument( + "--network", type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?" +) +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--language", type=str, required=True, help="Language that this model based on") +args = parser.parse_args() + + +def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None): + """ + Creats dataloader. + + Args: + dataset(obj:`paddle.io.Dataset`): Dataset instance. + trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. + mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. + batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. + batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging + the sample list, None for only stack each fields of sample in axis + 0(same as :attr::`np.stack(..., axis=0)`). + + Returns: + dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. + """ + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True) + else: + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True, collate_fn=batchify_fn) + return dataloader + + +if __name__ == "__main__": + paddle.set_device(args.device) + + # Loads vocab. + if not os.path.exists(args.vocab_path): + raise RuntimeError("The vocab_path can not be found in the path %s" % args.vocab_path) + vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") + + # Loads dataset. + if args.language == "ch": + train_ds, dev_ds, test_ds = load_dataset("lcqmc", splits=["train", "dev", "test"]) + else: + train_ds, dev_ds, test_ds = load_dataset("glue", "qqp", splits=["train", "dev", "test"]) + + # Constructs the newtork. + model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(train_ds.label_list)) + model = paddle.Model(model) + + # Reads data and generates mini-batches. + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # query_ids + Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # title_ids + Stack(dtype="int64"), # query_seq_lens + Stack(dtype="int64"), # title_seq_lens + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + tokenizer = CharTokenizer(vocab, args.language, "../../../punctuations") + trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False, language=args.language) + train_loader = create_dataloader( + train_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="train", batchify_fn=batchify_fn + ) + dev_loader = create_dataloader( + dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn + ) + test_loader = create_dataloader( + test_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="test", batchify_fn=batchify_fn + ) + + optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr) + + # Defines loss and metric. + criterion = paddle.nn.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + model.prepare(optimizer, criterion, metric) + + # Loads pre-trained parameters. + if args.init_from_ckpt: + model.load(args.init_from_ckpt) + print("Loaded checkpoint from %s" % args.init_from_ckpt) + + # Starts training and evaluating. + model.fit( + train_loader, + dev_loader, + epochs=args.epochs, + save_dir=args.save_dir, + ) diff --git a/examples/model_interpretation/task/similarity/simnet/utils.py b/examples/model_interpretation/task/similarity/simnet/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..b2161cd48ce24c8794db827dd8fac7d4ec6d5ce4 --- /dev/null +++ b/examples/model_interpretation/task/similarity/simnet/utils.py @@ -0,0 +1,211 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np + + +def convert_example(example, tokenizer, is_test=False, language="en"): + """ + Builds model inputs from a sequence for sequence classification tasks. + It use `jieba.cut` to tokenize text. + + Args: + example(obj:`list[str]`): List of input data, containing text and label if it have label. + tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + query_ids(obj:`list[int]`): The list of query ids. + title_ids(obj:`list[int]`): The list of title ids. + query_seq_len(obj:`int`): The input sequence query length. + title_seq_len(obj:`int`): The input sequence title length. + label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. + """ + if language == "ch": + q_name = "query" + t_name = "title" + label = "label" + else: + q_name = "sentence1" + t_name = "sentence2" + label = "labels" + + query, title = example[q_name], example[t_name] + query_ids = np.array(tokenizer.encode(query), dtype="int64") + query_seq_len = np.array(len(query_ids), dtype="int64") + title_ids = np.array(tokenizer.encode(title), dtype="int64") + title_seq_len = np.array(len(title_ids), dtype="int64") + result = [query_ids, title_ids, query_seq_len, title_seq_len] + if not is_test: + label = np.array(example[label], dtype="int64") + result.append(label) + return result + + +def preprocess_prediction_data(data, tokenizer): + """ + It process the prediction data as the format used as training. + + Args: + data (obj:`List[List[str, str]]`): + The prediction data whose each element is a text pair. + Each text will be tokenized by jieba.lcut() function. + tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. + + Returns: + examples (obj:`list`): The processed data whose each element + is a `list` object, which contains + + - query_ids(obj:`list[int]`): The list of query ids. + - title_ids(obj:`list[int]`): The list of title ids. + - query_seq_len(obj:`int`): The input sequence query length. + - title_seq_len(obj:`int`): The input sequence title length. + + """ + examples = [] + for query, title in data: + query_ids = tokenizer.encode(query) + title_ids = tokenizer.encode(title) + examples.append([query_ids, title_ids, len(query_ids), len(title_ids)]) + return examples + + +def preprocess_data(data, tokenizer, language): + """ + It process the prediction data as the format used as training. + + Args: + data (obj:`List[List[str, str]]`): + The prediction data whose each element is a text pair. + Each text will be tokenized by jieba.lcut() function. + tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. + + Returns: + examples (obj:`list`): The processed data whose each element + is a `list` object, which contains + + - query_ids(obj:`list[int]`): The list of query ids. + - title_ids(obj:`list[int]`): The list of title ids. + - query_seq_len(obj:`int`): The input sequence query length. + - title_seq_len(obj:`int`): The input sequence title length. + + """ + if language == "ch": + q_name = "query" + t_name = "title" + else: + q_name = "sentence1" + t_name = "sentence2" + examples = [] + for example in data: + query_ids = tokenizer.encode(example[q_name]) + title_ids = tokenizer.encode(example[t_name]) + examples.append([query_ids, title_ids, len(query_ids), len(title_ids)]) + return examples + + +def get_idx_from_word(word, word_to_idx, unk_word): + if word in word_to_idx: + return word_to_idx[word] + return word_to_idx[unk_word] + + +class CharTokenizer: + def __init__(self, vocab, language, vocab_path): + self.vocab = vocab + self.language = language + self.vocab_path = vocab_path + self.unk_token = [] + + def encode(self, sentence): + if self.language == "ch": + words = tokenizer_punc(sentence, self.vocab_path) + else: + words = sentence.strip().split() + return [get_idx_from_word(word, self.vocab.token_to_idx, self.vocab.unk_token) for word in words] + + def tokenize(self, sentence, wo_unk=True): + if self.language == "ch": + return tokenizer_punc(sentence, self.vocab_path) + else: + return sentence.strip().split() + + def convert_tokens_to_string(self, tokens): + return " ".join(tokens) + + def convert_tokens_to_ids(self, tokens): + return [get_idx_from_word(word, self.vocab.token_to_idx, self.vocab.unk_token) for word in tokens] + + +def tokenizer_lac(string, lac): + temp = "" + res = [] + for c in string: + if "\u4e00" <= c <= "\u9fff": + if temp != "": + res.extend(lac.run(temp)) + temp = "" + res.append(c) + else: + temp += c + if temp != "": + res.extend(lac.run(temp)) + return res + + +def tokenizer_punc(string, vocab_path): + res = [] + sub_string_list = string.strip().split("[MASK]") + for idx, sub_string in enumerate(sub_string_list): + temp = "" + for c in sub_string: + if "\u4e00" <= c <= "\u9fff": + if temp != "": + temp_seg = punc_split(temp, vocab_path) + res.extend(temp_seg) + temp = "" + res.append(c) + else: + temp += c + if temp != "": + temp_seg = punc_split(temp, vocab_path) + res.extend(temp_seg) + if idx < len(sub_string_list) - 1: + res.append("[MASK]") + return res + + +def punc_split(string, vocab_path): + punc_set = set() + with open(vocab_path, "r") as f: + for token in f: + punc_set.add(token.strip()) + punc_set.add(" ") + for ascii_num in range(65296, 65306): + punc_set.add(chr(ascii_num)) + for ascii_num in range(48, 58): + punc_set.add(chr(ascii_num)) + + res = [] + temp = "" + for c in string: + if c in punc_set: + if temp != "": + res.append(temp) + temp = "" + res.append(c) + else: + temp += c + if temp != "": + res.append(temp) + return res diff --git a/examples/model_interpretation/task/transformer.py b/examples/model_interpretation/task/transformer.py new file mode 100644 index 0000000000000000000000000000000000000000..2504503739b054ca3eb24c137a5828dc857ac6e9 --- /dev/null +++ b/examples/model_interpretation/task/transformer.py @@ -0,0 +1,1329 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# TODO: define the classes of Transformer neural network + +import collections +import copy + +import numpy as np +import paddle +from paddle import ParamAttr, tensor +from paddle.common_ops_import import convert_dtype +from paddle.nn import Layer, LayerList +from paddle.nn import functional as F +from paddle.nn.layer.common import Dropout, Linear +from paddle.nn.layer.norm import LayerNorm + +__all__ = [] + + +def _convert_param_attr_to_list(param_attr, n): + """ + If `param_attr` is a list or tuple, convert every element in it to a + ParamAttr instance. Otherwise, repeat `param_attr` `n` times to + construct a list, and rename every one by appending a increasing index + suffix to avoid having same names when `param_attr` contains a name. + + Parameters: + param_attr (list|tuple|ParamAttr): A list, tuple or something can be + converted to a ParamAttr instance by `ParamAttr._to_attr`. + n (int): The times to repeat to construct a list when `param_attr` + is not a list or tuple. + + Returns: + list: A list composed of each including cell's `param_attr`. + """ + if isinstance(param_attr, (list, tuple)): + assert len(param_attr) == n, "length of param_attr should be %d when it is a list/tuple" % n + param_attrs = [] + for attr in param_attr: + if isinstance(attr, bool): + if attr: + param_attrs.append(ParamAttr._to_attr(None)) + else: + param_attrs.append(False) + else: + param_attrs.append(ParamAttr._to_attr(attr)) + # param_attrs = [ParamAttr._to_attr(attr) for attr in param_attr] + elif isinstance(param_attr, bool): + param_attrs = [] + if param_attr: + param_attrs = [ParamAttr._to_attr(None) for i in range(n)] + else: + param_attrs = [False] * n + else: + param_attrs = [] + attr = ParamAttr._to_attr(param_attr) + for i in range(n): + attr_i = copy.deepcopy(attr) + if attr.name: + attr_i.name = attr_i.name + "_" + str(i) + param_attrs.append(attr_i) + return param_attrs + + +def _convert_attention_mask(attn_mask, dtype): + """ + Convert the attention mask to the target dtype we expect. + + Parameters: + attn_mask (Tensor, optional): A tensor used in multi-head attention + to prevents attention to some unwanted positions, usually the + paddings or the subsequent positions. It is a tensor with shape + broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`. + When the data type is bool, the unwanted positions have `False` + values and the others have `True` values. When the data type is + int, the unwanted positions have 0 values and the others have 1 + values. When the data type is float, the unwanted positions have + `-INF` values and the others have 0 values. It can be None when + nothing wanted or needed to be prevented attention to. Default None. + dtype (VarType): The target type of `attn_mask` we expect. + + Returns: + Tensor: A Tensor with shape same as input `attn_mask`, with data type `dtype`. + """ + if attn_mask is not None and attn_mask.dtype != dtype: + attn_mask_dtype = convert_dtype(attn_mask.dtype) + if attn_mask_dtype == "bool" or "int" in attn_mask_dtype: + attn_mask = (paddle.cast(attn_mask, dtype) - 1.0) * 1e9 + else: + attn_mask = paddle.cast(attn_mask, dtype) + return attn_mask + + +class MultiHeadAttention(Layer): + """ + Attention mapps queries and a set of key-value pairs to outputs, and + Multi-Head Attention performs multiple parallel attention to jointly attending + to information from different representation subspaces. + + Please refer to `Attention Is All You Need <https://arxiv.org/pdf/1706.03762.pdf>`_ + for more details. + + Parameters: + embed_dim (int): The expected feature size in the input and output. + num_heads (int): The number of heads in multi-head attention. + dropout (float, optional): The dropout probability used on attention + weights to drop some attention targets. 0 for no dropout. Default 0 + kdim (int, optional): The feature size in key. If None, assumed equal to + `embed_dim`. Default None. + vdim (int, optional): The feature size in value. If None, assumed equal to + `embed_dim`. Default None. + need_weights (bool, optional): Indicate whether to return the attention + weights. Default False. + weight_attr(ParamAttr, optional): To specify the weight parameter property. + Default: None, which means the default weight parameter property is used. + See usage for details in :code:`ParamAttr` . + bias_attr (ParamAttr|bool, optional): To specify the bias parameter property. + Default: None, which means the default bias parameter property is used. + If it is set to False, this layer will not have trainable bias parameter. + See usage for details in :code:`ParamAttr` . + + Examples: + + .. code-block:: python + + import paddle + + # encoder input: [batch_size, sequence_length, d_model] + query = paddle.rand((2, 4, 128)) + # self attention mask: [batch_size, num_heads, query_len, query_len] + attn_mask = paddle.rand((2, 2, 4, 4)) + multi_head_attn = paddle.nn.MultiHeadAttention(128, 2) + output = multi_head_attn(query, None, None, attn_mask=attn_mask) # [2, 4, 128] + """ + + Cache = collections.namedtuple("Cache", ["k", "v"]) + StaticCache = collections.namedtuple("StaticCache", ["k", "v"]) + + def __init__( + self, + embed_dim, + num_heads, + dropout=0.0, + kdim=None, + vdim=None, + need_weights=False, + weight_attr=None, + bias_attr=None, + ): + super(MultiHeadAttention, self).__init__() + self.embed_dim = embed_dim + self.kdim = kdim if kdim is not None else embed_dim + self.vdim = vdim if vdim is not None else embed_dim + self.num_heads = num_heads + self.dropout = dropout + self.need_weights = need_weights + + self.head_dim = embed_dim // num_heads + assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads" + + self.q_proj = Linear(embed_dim, embed_dim, weight_attr, bias_attr=bias_attr) + self.k_proj = Linear(self.kdim, embed_dim, weight_attr, bias_attr=bias_attr) + self.v_proj = Linear(self.vdim, embed_dim, weight_attr, bias_attr=bias_attr) + self.out_proj = Linear(embed_dim, embed_dim, weight_attr, bias_attr=bias_attr) + + def _prepare_qkv(self, query, key, value, cache=None): + r""" + Prapares linear projected queries, keys and values for usage of subsequnt + multiple parallel attention. If `cache` is not None, using cached results + to reduce redundant calculations. + + Parameters: + query (Tensor): The queries for multi-head attention. It is a + tensor with shape `[batch_size, query_length, embed_dim]`. The + data type should be float32 or float64. + key (Tensor): The keys for multi-head attention. It is + a tensor with shape `[batch_size, key_length, kdim]`. The + data type should be float32 or float64. If None, use `query` as + `key`. + value (Tensor): The values for multi-head attention. It + is a tensor with shape `[batch_size, value_length, vdim]`. + The data type should be float32 or float64. If None, use `query` as + `value`. + cache (MultiHeadAttention.Cache|MultiHeadAttention.StaticCache, optional): + It is a namedtuple with `k` and `v` as fields, and stores tensors + shaped `[batch_size, num_heads, length, embed_dim]` which are results + of linear projection, reshape and transpose calculations in + MultiHeadAttention. If is an instance of `Cache`, `k` and `v` + fields reserve intermediate results of previous positions, which + mostly used for decoder self attention. If it is an instance of + `StaticCache`, `key` and `value` args would be ignored, `k` and + `v` fields would be used as calculated results on `key` and + `value`, which mostly used for decoder-encoder cross attention. + It is only used for inference and should be None for training. + Default None. + + Returns: + tuple: A tuple including linear projected keys and values. These two \ + tensors have shapes `[batch_size, n_head, sequence_length, d_key]` \ + and `[batch_size, n_head, sequence_length, d_value]` separately, \ + and their data types are same as inputs. + """ + q = self.q_proj(query) + q = tensor.reshape(x=q, shape=[0, 0, self.num_heads, self.head_dim]) + q = tensor.transpose(x=q, perm=[0, 2, 1, 3]) + + if isinstance(cache, self.StaticCache): + # for encoder-decoder attention in inference and has cached + k, v = cache.k, cache.v + else: + k, v = self.compute_kv(key, value) + + if isinstance(cache, self.Cache): + # for decoder self-attention in inference + k = tensor.concat([cache.k, k], axis=2) + v = tensor.concat([cache.v, v], axis=2) + cache = self.Cache(k, v) + + return (q, k, v) if cache is None else (q, k, v, cache) + + def compute_kv(self, key, value): + r""" + Applies linear projection on input keys and values, then splits heads + (reshape and transpose) to get keys and values from different representation + subspaces. The results are used as key-values pairs for subsequent multiple + parallel attention. + + It is part of calculations in multi-head attention, and is provided as + a method to pre-compute and prefetch these results, thus we can use them + to construct cache for inference. + + Parameters: + key (Tensor): The keys for multi-head attention. It is a tensor + with shape `[batch_size, sequence_length, kdim]`. The data type + should be float32 or float64. + value (Tensor): The values for multi-head attention. It is a tensor + with shape `[batch_size, sequence_length, vdim]`. The data type + should be float32 or float64. + + Returns: + tuple: A tuple including transformed keys and values. Their shapes \ + both are `[batch_size, num_heads, sequence_length, embed_dim // num_heads]`, \ + and their data types are same as inputs. + """ + k = self.k_proj(key) + v = self.v_proj(value) + k = tensor.reshape(x=k, shape=[0, 0, self.num_heads, self.head_dim]) + k = tensor.transpose(x=k, perm=[0, 2, 1, 3]) + v = tensor.reshape(x=v, shape=[0, 0, self.num_heads, self.head_dim]) + v = tensor.transpose(x=v, perm=[0, 2, 1, 3]) + return k, v + + def gen_cache(self, key, value=None, type=Cache): + """ + Generates cache for `forward` usage in inference accroding to arguments. + The generated cache is an instance of `MultiHeadAttention.Cache` or an + instance of `MultiHeadAttention.StaticCache`. + + `Cache` or `StaticCache` is namedtuple with `k` and `v` as fields, + and it stores tensors shaped `[batch_size, num_heads, length, embed_dim]` + which are results of linear projection, reshape and transpose calculations + in MultiHeadAttention. + + If the generated cache is an instance of `Cache`, `k` and `v` fields + reserve intermediate result tensors of previous positions, and the tensors + are incremental among decoding steps, which mostly are used for decoder + decoder self attention. + + If the generated cache is an instance of `StaticCache`, `k` and `v` fields + would be used as calculated result tensors on keys an values in `forward`, + and the tensors keep unchanged among decoding steps, which are mostly used + for decoder-encoder cross attention. + + The cache is generated as follows: + + 1. If `type` is `StaticCache`, apply `compute_kv(key, value)` and use the + results to create an instance of `StaticCache`. + + 2. If `type` is `Cache` and `value` is None, generate empty tensors shaped + `[batch_size, num_heads, 0, embed_dim // num_heads]` and use the results + to create an instance of `Cache`, where `batch_size` is from the first + dimension of `key`. + + 3. If `type` is `Cache` and `value` is not None, use `key`, `value` to create + an instance of `Cache`. + + Parameters: + key (Tensor): The keys for multi-head attention. It is + a tensor with shape `[batch_size, key_length, kdim]`. The + data type should be float32 or float64. If `value` is None, + it is only for batch size and data type reference. + value (Tensor, optional): The values for multi-head attention. It + is a tensor with shape `[batch_size, value_length, vdim]`. + The data type should be float32 or float64. If None, `key` is only + for batch size reference. Default None. + type (type): It should be `MultiHeadAttention.StaticCache` or + `MultiHeadAttention.Cache` to indicate the cache type to generate. + + Returns: + namedtuple: an instance of `Cache` or `StaticCache` accordingly. + """ + if type == MultiHeadAttention.StaticCache: # static_kv + k, v = self.compute_kv(key, value) + return self.StaticCache(k, v) + elif value is None: # incremental_state + k = paddle.full(shape=[key.shape[0], self.num_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0) + v = paddle.full(shape=[key.shape[0], 2, self.num_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0) + return self.Cache(k, v) + else: + # incremental_state with initial value, mainly for usage like UniLM + return self.Cache(key, value) + + def forward(self, query, key=None, value=None, attn_mask=None, cache=None): + r""" + Applies multi-head attention to map queries and a set of key-value pairs + to outputs. + + Parameters: + query (Tensor): The queries for multi-head attention. It is a + tensor with shape `[batch_size, query_length, embed_dim]`. The + data type should be float32 or float64. + key (Tensor, optional): The keys for multi-head attention. It is + a tensor with shape `[batch_size, key_length, kdim]`. The + data type should be float32 or float64. If None, use `query` as + `key`. Default None. + value (Tensor, optional): The values for multi-head attention. It + is a tensor with shape `[batch_size, value_length, vdim]`. + The data type should be float32 or float64. If None, use `query` as + `value`. Default None. + attn_mask (Tensor, optional): A tensor used in multi-head attention + to prevents attention to some unwanted positions, usually the + paddings or the subsequent positions. It is a tensor with shape + broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`. + When the data type is bool, the unwanted positions have `False` + values and the others have `True` values. When the data type is + int, the unwanted positions have 0 values and the others have 1 + values. When the data type is float, the unwanted positions have + `-INF` values and the others have 0 values. It can be None when + nothing wanted or needed to be prevented attention to. Default None. + cache (MultiHeadAttention.Cache|MultiHeadAttention.StaticCache, optional): + It is a namedtuple with `k` and `v` as fields, and stores tensors + shaped `[batch_size, num_heads, length, embed_dim]` which are results + of linear projection, reshape and transpose calculations in + MultiHeadAttention. If it is an instance of `Cache`, `k` and `v` + fields reserve intermediate results of previous positions, which + mostly used for decoder self attention. If it is an instance of + `StaticCache`, `key` and `value` args would be ignored, `k` and + `v` fields would be used as calculated results on `key` and + `value`, which mostly used for decoder-encoder cross attention. + It is only used for inference and should be None for training. + Default None. + + Returns: + Tensor|tuple: It is a tensor that has the same shape and data type \ + as `query`, representing attention output. Or a tuple if \ + `need_weights` is True or `cache` is not None. If `need_weights` \ + is True, except for attention output, the tuple also includes \ + the attention weights tensor shaped `[batch_size, num_heads, query_length, key_length]`. \ + If `cache` is not None, the tuple then includes the new cache \ + having the same type as `cache`, and if it is `StaticCache`, it \ + is same as the input `cache`, if it is `Cache`, the new cache \ + reserves tensors concatanating raw tensors with intermediate \ + results of current query. + """ + key = query if key is None else key + value = query if value is None else value + # compute q ,k ,v + if cache is None: + q, k, v = self._prepare_qkv(query, key, value, cache) + else: + q, k, v, cache = self._prepare_qkv(query, key, value, cache) + + # scale dot product attention + product = paddle.matmul(x=q * (self.head_dim**-0.5), y=k, transpose_y=True) + if attn_mask is not None: + # Support bool or int mask + attn_mask = _convert_attention_mask(attn_mask, product.dtype) + product = product + attn_mask + weights = F.softmax(product) + if self.dropout: + weights = F.dropout(weights, self.dropout, training=self.training, mode="upscale_in_train") + + out = tensor.matmul(weights, v) + + # combine heads + out = tensor.transpose(out, perm=[0, 2, 1, 3]) + out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]]) + + # project to output + out = self.out_proj(out) + + outs = [out] + if self.need_weights: + outs.append(weights) + if cache is not None: + outs.append(cache) + return out if len(outs) == 1 else tuple(outs) + + +class TransformerEncoderLayer(Layer): + """ + TransformerEncoderLayer is composed of two sub-layers which are self (multi-head) + attention and feedforward network. Before and after each sub-layer, pre-process + and post-precess would be applied on the input and output accordingly. If + `normalize_before` is True, pre-process is layer normalization and post-precess + includes dropout, residual connection. Otherwise, no pre-process and post-precess + includes dropout, residual connection, layer normalization. + + Parameters: + d_model (int): The expected feature size in the input and output. + nhead (int): The number of heads in multi-head attention(MHA). + dim_feedforward (int): The hidden layer size in the feedforward network(FFN). + dropout (float, optional): The dropout probability used in pre-process + and post-precess of MHA and FFN sub-layer. Default 0.1 + activation (str, optional): The activation function in the feedforward + network. Default relu. + attn_dropout (float, optional): The dropout probability used + in MHA to drop some attention target. If None, use the value of + `dropout`. Default None + act_dropout (float, optional): The dropout probability used after FFN + activition. If None, use the value of `dropout`. Default None + normalize_before (bool, optional): Indicate whether to put layer normalization + into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer + normalization and post-precess includes dropout, residual connection. + Otherwise, no pre-process and post-precess includes dropout, residual + connection, layer normalization. Default False + weight_attr(ParamAttr|list|tuple, optional): To specify the weight parameter property. + If it is a list/tuple, `weight_attr[0]` would be used as `weight_attr` for + MHA, and `weight_attr[1]` would be used as `weight_attr` for linear in FFN. + Otherwise, MHA and FFN both use it as `weight_attr` to create parameters. + Default: None, which means the default weight parameter property is used. + See usage for details in :code:`ParamAttr` . + bias_attr (ParamAttr|list|tuple|bool, optional): To specify the bias parameter property. + If it is a list/tuple, `bias_attr[0]` would be used as `bias_attr` for + MHA, and `bias_attr[1]` would be used as `bias_attr` for linear in FFN. + Otherwise, MHA and FFN both use it as `bias_attr` to create parameters. + The `False` value means the corresponding layer would not have trainable + bias parameter. See usage for details in :code:`ParamAttr` . Default: None, + which means the default bias parameter property is used. + + + Examples: + + .. code-block:: python + + import paddle + from paddle.nn import TransformerEncoderLayer + + # encoder input: [batch_size, src_len, d_model] + enc_input = paddle.rand((2, 4, 128)) + # self attention mask: [batch_size, n_head, src_len, src_len] + attn_mask = paddle.rand((2, 2, 4, 4)) + encoder_layer = TransformerEncoderLayer(128, 2, 512) + enc_output = encoder_layer(enc_input, attn_mask) # [2, 4, 128] + """ + + def __init__( + self, + d_model, + nhead, + dim_feedforward, + dropout=0.1, + activation="relu", + attn_dropout=None, + act_dropout=None, + normalize_before=False, + weight_attr=None, + bias_attr=None, + ): + self._config = locals() + self._config.pop("self") + self._config.pop("__class__", None) # py3 + + super(TransformerEncoderLayer, self).__init__() + attn_dropout = dropout if attn_dropout is None else attn_dropout + act_dropout = dropout if act_dropout is None else act_dropout + self.normalize_before = normalize_before + + weight_attrs = _convert_param_attr_to_list(weight_attr, 2) + bias_attrs = _convert_param_attr_to_list(bias_attr, 2) + + self.self_attn = MultiHeadAttention( + d_model, + nhead, + dropout=attn_dropout, + need_weights=True, # interpret + weight_attr=weight_attrs[0], + bias_attr=bias_attrs[0], + ) + self.linear1 = Linear(d_model, dim_feedforward, weight_attrs[1], bias_attr=bias_attrs[1]) + self.dropout = Dropout(act_dropout, mode="upscale_in_train") + self.linear2 = Linear(dim_feedforward, d_model, weight_attrs[1], bias_attr=bias_attrs[1]) + self.norm1 = LayerNorm(d_model) + self.norm2 = LayerNorm(d_model) + self.dropout1 = Dropout(dropout, mode="upscale_in_train") + self.dropout2 = Dropout(dropout, mode="upscale_in_train") + self.activation = getattr(F, activation) + + def forward(self, src, src_mask=None, cache=None): + r""" + Applies a Transformer encoder layer on the input. + + Parameters: + src (Tensor): The input of Transformer encoder layer. It is + a tensor with shape `[batch_size, sequence_length, d_model]`. + The data type should be float32 or float64. + src_mask (Tensor, optional): A tensor used in multi-head attention + to prevents attention to some unwanted positions, usually the + paddings or the subsequent positions. It is a tensor with shape + broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`. + When the data type is bool, the unwanted positions have `False` + values and the others have `True` values. When the data type is + int, the unwanted positions have 0 values and the others have 1 + values. When the data type is float, the unwanted positions have + `-INF` values and the others have 0 values. It can be None when + nothing wanted or needed to be prevented attention to. Default None. + cache (Tensor, optional): It is an instance of `MultiHeadAttention.Cache`. + See `TransformerEncoderLayer.gen_cache` for more details. It is + only used for inference and should be None for training. Default + None. + + Returns: + Tensor|tuple: It is a tensor that has the same shape and data type \ + as `enc_input`, representing the output of Transformer encoder \ + layer. Or a tuple if `cache` is not None, except for encoder \ + layer output, the tuple includes the new cache which is same \ + as input `cache` argument but `incremental_cache` has an \ + incremental length. See `MultiHeadAttention.gen_cache` and \ + `MultiHeadAttention.forward` for more details. + """ + src_mask = _convert_attention_mask(src_mask, src.dtype) + + residual = src + if self.normalize_before: + src = self.norm1(src) + # Add cache for encoder for the usage like UniLM + if cache is None: + # src = self.self_attn(src, src, src, src_mask) + src, att_weights = self.self_attn(src, src, src, src_mask) # interpret + else: + # src, incremental_cache = self.self_attn(src, src, src, src_mask, cache) + src, att_weights, incremental_cache = self.self_attn(src, src, src, src_mask, cache) # interpret + + src = residual + self.dropout1(src) + if not self.normalize_before: + src = self.norm1(src) + + residual = src + if self.normalize_before: + src = self.norm2(src) + src = self.linear2(self.dropout(self.activation(self.linear1(src)))) + src = residual + self.dropout2(src) + if not self.normalize_before: + src = self.norm2(src) + # return src if cache is None else (src, incremental_cache) + return (src, att_weights) if cache is None else (src, att_weights, incremental_cache) # interpret + + def gen_cache(self, src): + r""" + Generates cache for `forward` usage. The generated cache is an + instance of `MultiHeadAttention.Cache`. + + Parameters: + src (Tensor): The input of Transformer encoder. It is a tensor + with shape `[batch_size, source_length, d_model]`. The data + type should be float32 or float64. + + Returns: + incremental_cache: It is an instance of `MultiHeadAttention.Cache` \ + produced by `self_attn.gen_cache`, it reserves two tensors + shaped `[batch_size, nhead, 0, d_model // nhead]`. See \ + `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \ + for more details. + """ + incremental_cache = self.self_attn.gen_cache(src, type=self.self_attn.Cache) + return incremental_cache + + +class TransformerEncoder(Layer): + """ + TransformerEncoder is a stack of N encoder layers. + + Parameters: + encoder_layer (Layer): an instance of the `TransformerEncoderLayer`. It + would be used as the first layer, and the other layers would be created + according to the configurations of it. + num_layers (int): The number of encoder layers to be stacked. + norm (LayerNorm, optional): the layer normalization component. If provided, + apply layer normalization on the output of last encoder layer. + + Examples: + + .. code-block:: python + + import paddle + from paddle.nn import TransformerEncoderLayer, TransformerEncoder + + # encoder input: [batch_size, src_len, d_model] + enc_input = paddle.rand((2, 4, 128)) + # self attention mask: [batch_size, n_head, src_len, src_len] + attn_mask = paddle.rand((2, 2, 4, 4)) + encoder_layer = TransformerEncoderLayer(128, 2, 512) + encoder = TransformerEncoder(encoder_layer, 2) + enc_output = encoder(enc_input, attn_mask) # [2, 4, 128] + """ + + def __init__(self, encoder_layer, num_layers, norm=None): + super(TransformerEncoder, self).__init__() + self.layers = LayerList( + [(encoder_layer if i == 0 else type(encoder_layer)(**encoder_layer._config)) for i in range(num_layers)] + ) + self.num_layers = num_layers + self.norm = norm + + def forward(self, src, src_mask=None, cache=None): + r""" + Applies a stack of N Transformer encoder layers on inputs. If `norm` is + provided, also applies layer normalization on the output of last encoder + layer. + + Parameters: + src (Tensor): The input of Transformer encoder. It is a tensor + with shape `[batch_size, sequence_length, d_model]`. The data + type should be float32 or float64. + src_mask (Tensor, optional): A tensor used in multi-head attention + to prevents attention to some unwanted positions, usually the + paddings or the subsequent positions. It is a tensor with shape + broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`. + When the data type is bool, the unwanted positions have `False` + values and the others have `True` values. When the data type is + int, the unwanted positions have 0 values and the others have 1 + values. When the data type is float, the unwanted positions have + `-INF` values and the others have 0 values. It can be None when + nothing wanted or needed to be prevented attention to. Default None. + cache (list, optional): It is a list, and each element in the list + is `incremental_cache` produced by `TransformerEncoderLayer.gen_cache`. + See `TransformerEncoder.gen_cache` for more details. It is only + used for inference and should be None for training. Default None. + + Returns: + Tensor|tuple: It is a tensor that has the same shape and data type \ + as `src`, representing the output of Transformer encoder. \ + Or a tuple if `cache` is not None, except for encoder output, \ + the tuple includes the new cache which is same as input `cache` \ + argument but `incremental_cache` in it has an incremental length. \ + See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \ + for more details. + """ + src_mask = _convert_attention_mask(src_mask, src.dtype) + + output = src + att_weights_list = [] # interpret + new_caches = [] + for i, mod in enumerate(self.layers): + if cache is None: + # output = mod(output, src_mask=src_mask) + output, att_weights = mod(output, src_mask=src_mask) # interpret + att_weights_list.append(att_weights) + else: + # output, new_cache = mod(output, src_mask=src_mask, cache=cache[i]) + output, att_weights, new_cache = mod(output, src_mask=src_mask, cache=cache[i]) # interpret + att_weights_list.append(att_weights) + new_caches.append(new_cache) + + if self.norm is not None: + output = self.norm(output) + + # return output if cache is None else (output, new_caches) + return (output, att_weights_list) if cache is None else (output, att_weights_list, new_caches) # interpret + + def gen_cache(self, src): + r""" + Generates cache for `forward` usage. The generated cache is a list, and + each element in it is `incremental_cache` produced by + `TransformerEncoderLayer.gen_cache`. See `TransformerEncoderLayer.gen_cache` + for more details. + + Parameters: + src (Tensor): The input of Transformer encoder. It is a tensor + with shape `[batch_size, source_length, d_model]`. The data type + should be float32 or float64. + + Returns: + list: It is a list, and each element in the list is `incremental_cache` + produced by `TransformerEncoderLayer.gen_cache`. See + `TransformerEncoderLayer.gen_cache` for more details. + """ + cache = [layer.gen_cache(src) for layer in self.layers] + return cache + + +class TransformerDecoderLayer(Layer): + """ + TransformerDecoderLayer is composed of three sub-layers which are decoder + self (multi-head) attention, decoder-encoder cross attention and feedforward + network. Before and after each sub-layer, pre-process and post-precess would + be applied on the input and output accordingly. If `normalize_before` is True, + pre-process is layer normalization and post-precess includes dropout, residual + connection. Otherwise, no pre-process and post-precess includes dropout, residual + connection, layer normalization. + + Parameters: + d_model (int): The expected feature size in the input and output. + nhead (int): The number of heads in multi-head attention(MHA). + dim_feedforward (int): The hidden layer size in the feedforward network(FFN). + dropout (float, optional): The dropout probability used in pre-process + and post-precess of MHA and FFN sub-layer. Default 0.1 + activation (str, optional): The activation function in the feedforward + network. Default relu. + attn_dropout (float, optional): The dropout probability used + in MHA to drop some attention target. If None, use the value of + `dropout`. Default None + act_dropout (float, optional): The dropout probability used after FFN + activition. If None, use the value of `dropout`. Default None + normalize_before (bool, optional): Indicate whether to put layer normalization + into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer + normalization and post-precess includes dropout, residual connection. + Otherwise, no pre-process and post-precess includes dropout, residual + connection, layer normalization. Default False + weight_attr(ParamAttr|list|tuple, optional): To specify the weight parameter property. + If it is a list/tuple, `weight_attr[0]` would be used as `weight_attr` for + self attention, `weight_attr[1]` would be used as `weight_attr` for + cross attention, and `weight_attr[2]` would be used as `weight_attr` + for linear in FFN. Otherwise, the three sub-layers all uses it as + `weight_attr` to create parameters. Default: None, which means the + default weight parameter property is used. See usage for details + in :ref:`api_paddle_ParamAttr` . + bias_attr (ParamAttr|list|tuple|bool, optional): To specify the bias parameter property. + If it is a list/tuple, `bias_attr[0]` would be used as `bias_attr` for + self attention, `bias_attr[1]` would be used as `bias_attr` for + cross attention, and `bias_attr[2]` would be used as `bias_attr` + for linear in FFN. Otherwise, the three sub-layers all uses it as + `bias_attr` to create parameters. The `False` value means the + corresponding layer would not have trainable bias parameter. See + usage for details in :code:`ParamAttr` . Default: None,which means + the default bias parameter property is used. + + Examples: + + .. code-block:: python + + import paddle + from paddle.nn import TransformerDecoderLayer + + # decoder input: [batch_size, tgt_len, d_model] + dec_input = paddle.rand((2, 4, 128)) + # encoder output: [batch_size, src_len, d_model] + enc_output = paddle.rand((2, 6, 128)) + # self attention mask: [batch_size, n_head, tgt_len, tgt_len] + self_attn_mask = paddle.rand((2, 2, 4, 4)) + # cross attention mask: [batch_size, n_head, tgt_len, src_len] + cross_attn_mask = paddle.rand((2, 2, 4, 6)) + decoder_layer = TransformerDecoderLayer(128, 2, 512) + output = decoder_layer(dec_input, + enc_output, + self_attn_mask, + cross_attn_mask) # [2, 4, 128] + """ + + def __init__( + self, + d_model, + nhead, + dim_feedforward, + dropout=0.1, + activation="relu", + attn_dropout=None, + act_dropout=None, + normalize_before=False, + weight_attr=None, + bias_attr=None, + ): + self._config = locals() + self._config.pop("self") + self._config.pop("__class__", None) # py3 + + super(TransformerDecoderLayer, self).__init__() + attn_dropout = dropout if attn_dropout is None else attn_dropout + act_dropout = dropout if act_dropout is None else act_dropout + self.normalize_before = normalize_before + + weight_attrs = _convert_param_attr_to_list(weight_attr, 3) + bias_attrs = _convert_param_attr_to_list(bias_attr, 3) + + self.self_attn = MultiHeadAttention( + d_model, nhead, dropout=attn_dropout, weight_attr=weight_attrs[0], bias_attr=bias_attrs[0] + ) + self.cross_attn = MultiHeadAttention( + d_model, nhead, dropout=attn_dropout, weight_attr=weight_attrs[1], bias_attr=bias_attrs[1] + ) + self.linear1 = Linear(d_model, dim_feedforward, weight_attrs[2], bias_attr=bias_attrs[2]) + self.dropout = Dropout(act_dropout, mode="upscale_in_train") + self.linear2 = Linear(dim_feedforward, d_model, weight_attrs[2], bias_attr=bias_attrs[2]) + self.norm1 = LayerNorm(d_model) + self.norm2 = LayerNorm(d_model) + self.norm3 = LayerNorm(d_model) + self.dropout1 = Dropout(dropout, mode="upscale_in_train") + self.dropout2 = Dropout(dropout, mode="upscale_in_train") + self.dropout3 = Dropout(dropout, mode="upscale_in_train") + self.activation = getattr(F, activation) + + def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None): + r""" + Applies a Transformer decoder layer on the input. + + Parameters: + tgt (Tensor): The input of Transformer decoder layer. It is a tensor + with shape `[batch_size, target_length, d_model]`. The data type + should be float32 or float64. + memory (Tensor): The output of Transformer encoder. It is a tensor + with shape `[batch_size, source_length, d_model]`. The data type + should be float32 or float64. + tgt_mask (Tensor, optional): A tensor used in self attention + to prevents attention to some unwanted positions, usually the + the subsequent positions. It is a tensor with shape broadcasted + to `[batch_size, n_head, target_length, target_length]`. + When the data type is bool, the unwanted positions have `False` + values and the others have `True` values. When the data type is + int, the unwanted positions have 0 values and the others have 1 + values. When the data type is float, the unwanted positions have + `-INF` values and the others have 0 values. It can be None when + nothing wanted or needed to be prevented attention to. Default None. + memory_mask (Tensor, optional): A tensor used in decoder-encoder + cross attention to prevents attention to some unwanted positions, + usually the paddings. It is a tensor with shape broadcasted to + `[batch_size, n_head, target_length, source_length]`. When the + data type is bool, the unwanted positions have `False` values + and the others have `True` values. When the data type is int, + the unwanted positions have 0 values and the others have 1 + values. When the data type is float, the unwanted positions have + `-INF` values and the others have 0 values. It can be None when + nothing wanted or needed to be prevented attention to. Default None. + cache (tuple, optional): It is a tuple( :code:`(incremental_cache, static_cache)` ), + `incremental_cache` is an instance of `MultiHeadAttention.Cache`, + `static_cache` is an instance of `MultiHeadAttention.StaticCache. + See `TransformerDecoderLayer.gen_cache` for more details. It is + only used for inference and should be None for training. Default + None. + + Returns: + Tensor|tuple: It is a tensor that has the same shape and data type \ + as `tgt`, representing the output of Transformer decoder layer. \ + Or a tuple if `cache` is not None, except for decoder layer output, \ + the tuple includes the new cache which is same as input `cache` \ + argument but `incremental_cache` in it has an incremental length. \ + See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \ + for more details. + """ + tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype) + memory_mask = _convert_attention_mask(memory_mask, memory.dtype) + + residual = tgt + if self.normalize_before: + tgt = self.norm1(tgt) + if cache is None: + tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, None) + else: + tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, cache[0]) + tgt = residual + self.dropout1(tgt) + if not self.normalize_before: + tgt = self.norm1(tgt) + + residual = tgt + if self.normalize_before: + tgt = self.norm2(tgt) + if cache is None: + tgt = self.cross_attn(tgt, memory, memory, memory_mask, None) + else: + tgt, static_cache = self.cross_attn(tgt, memory, memory, memory_mask, cache[1]) + tgt = residual + self.dropout2(tgt) + if not self.normalize_before: + tgt = self.norm2(tgt) + + residual = tgt + if self.normalize_before: + tgt = self.norm3(tgt) + tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt)))) + tgt = residual + self.dropout3(tgt) + if not self.normalize_before: + tgt = self.norm3(tgt) + return tgt if cache is None else (tgt, (incremental_cache, static_cache)) + + def gen_cache(self, memory): + r""" + Generates cache for `forward` usage. The generated cache is a tuple + composed of an instance of `MultiHeadAttention.Cache` and an instance + of `MultiHeadAttention.StaticCache`. + + Parameters: + memory (Tensor): The output of Transformer encoder. It is a tensor + with shape `[batch_size, source_length, d_model]`. The data type + should be float32 or float64. + + Returns: + tuple: It is a tuple( :code:`(incremental_cache, static_cache)` ). \ + `incremental_cache` is an instance of `MultiHeadAttention.Cache` \ + produced by `self_attn.gen_cache(memory, MultiHeadAttention.Cache)`, \ + it reserves two tensors shaped `[batch_size, nhead, 0, d_model // nhead]`. \ + `static_cache` is an instance of `MultiHeadAttention.StaticCache` \ + produced by `cross_attn.gen_cache(memory, MultiHeadAttention.StaticCache)`, \ + it reserves two tensors shaped `[batch_size, nhead, source_length, d_model // nhead]`. + See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \ + for more details. + """ + incremental_cache = self.self_attn.gen_cache(memory, type=self.self_attn.Cache) + static_cache = self.cross_attn.gen_cache(memory, memory, type=self.cross_attn.StaticCache) + return incremental_cache, static_cache + + +class TransformerDecoder(Layer): + """ + TransformerDecoder is a stack of N decoder layers. + + Parameters: + decoder_layer (Layer): an instance of the `TransformerDecoderLayer`. It + would be used as the first layer, and the other layers would be created + according to the configurations of it. + num_layers (int): The number of decoder layers to be stacked. + norm (LayerNorm, optional): the layer normalization component. If provided, + apply layer normalization on the output of last encoder layer. + + Examples: + + .. code-block:: python + + import paddle + from paddle.nn import TransformerDecoderLayer, TransformerDecoder + + # decoder input: [batch_size, tgt_len, d_model] + dec_input = paddle.rand((2, 4, 128)) + # encoder output: [batch_size, src_len, d_model] + enc_output = paddle.rand((2, 6, 128)) + # self attention mask: [batch_size, n_head, tgt_len, tgt_len] + self_attn_mask = paddle.rand((2, 2, 4, 4)) + # cross attention mask: [batch_size, n_head, tgt_len, src_len] + cross_attn_mask = paddle.rand((2, 2, 4, 6)) + decoder_layer = TransformerDecoderLayer(128, 2, 512) + decoder = TransformerDecoder(decoder_layer, 2) + output = decoder(dec_input, + enc_output, + self_attn_mask, + cross_attn_mask) # [2, 4, 128] + """ + + def __init__(self, decoder_layer, num_layers, norm=None): + super(TransformerDecoder, self).__init__() + self.layers = LayerList( + [(decoder_layer if i == 0 else type(decoder_layer)(**decoder_layer._config)) for i in range(num_layers)] + ) + self.num_layers = num_layers + self.norm = norm + + def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None): + r""" + Applies a stack of N Transformer decoder layers on inputs. If `norm` is + provided, also applies layer normalization on the output of last decoder + layer. + + Parameters: + tgt (Tensor): The input of Transformer decoder. It is a tensor + with shape `[batch_size, target_length, d_model]`. The data type + should be float32 or float64. + memory (Tensor): The output of Transformer encoder. It is a tensor + with shape `[batch_size, source_length, d_model]`. The data type + should be float32 or float64. + tgt_mask (Tensor, optional): A tensor used in self attention + to prevents attention to some unwanted positions, usually the + the subsequent positions. It is a tensor with shape broadcasted + to `[batch_size, n_head, target_length, target_length]`. When + the data type is bool, the unwanted positions have `False` + values and the others have `True` values. When the data type is + int, the unwanted positions have 0 values and the others have 1 + values. When the data type is float, the unwanted positions have + `-INF` values and the others have 0 values. It can be None when + nothing wanted or needed to be prevented attention to. Default None. + memory_mask (Tensor, optional): A tensor used in decoder-encoder + cross attention to prevents attention to some unwanted positions, + usually the paddings. It is a tensor with shape broadcasted to + `[batch_size, n_head, target_length, source_length]`. When the + data type is bool, the unwanted positions have `False` values + and the others have `True` values. When the data type is int, + the unwanted positions have 0 values and the others have 1 + values. When the data type is float, the unwanted positions have + `-INF` values and the others have 0 values. It can be None when + nothing wanted or needed to be prevented attention to. Default None. + cache (list, optional): It is a list, and each element in the list + is a tuple( :code:`(incremental_cache, static_cache)` ). See + `TransformerDecoder.gen_cache` for more details. It is only + used for inference and should be None for training. Default None. + + Returns: + Tensor|tuple: It is a tensor that has the same shape and data type \ + as `tgt`, representing the output of Transformer decoder. \ + Or a tuple if `cache` is not None, except for decoder output, \ + the tuple includes the new cache which is same as input `cache` \ + argument but `incremental_cache` in it has an incremental length. \ + See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \ + for more details. + """ + tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype) + memory_mask = _convert_attention_mask(memory_mask, memory.dtype) + + output = tgt + new_caches = [] + for i, mod in enumerate(self.layers): + if cache is None: + output = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=None) + else: + output, new_cache = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=cache[i]) + new_caches.append(new_cache) + + if self.norm is not None: + output = self.norm(output) + + return output if cache is None else (output, new_caches) + + def gen_cache(self, memory, do_zip=False): + r""" + Generates cache for `forward` usage. The generated cache is a list, and + each element in it is a tuple( :code:`(incremental_cache, static_cache)` ) + produced by `TransformerDecoderLayer.gen_cache`. See `TransformerDecoderLayer.gen_cache` + for more details. If `do_zip` is True, apply `zip` on these tuples to get + a list with two elements. + + + Parameters: + memory (Tensor): The output of Transformer encoder. It is a tensor + with shape `[batch_size, source_length, d_model]`. The data type + should be float32 or float64. + do_zip (bool, optional): Indicate whether to apply `zip` on the tuples. + If True, return a list with two elements. Default False + + Returns: + list: It is a list, and each element in the list is a tuple produced \ + by `TransformerDecoderLayer.gen_cache(memory)`. See `TransformerDecoderLayer.gen_cache` \ + for more details. If `do_zip` is True, apply `zip` on these tuples \ + and return a list with two elements. + """ + cache = [layer.gen_cache(memory) for layer in self.layers] + if do_zip: + cache = list(zip(*cache)) + return cache + + +class Transformer(Layer): + """ + A Transformer model composed of an instance of `TransformerEncoder` and an + instance of `TransformerDecoder`. While the embedding layer and output layer + are not included. + + Please refer to `Attention is all you need <http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf>`_ , + and see `TransformerEncoder` and `TransformerDecoder` for more details. + + Users can configurate the model architecture with corresponding parameters. + Note the usage of `normalize_before` representing where to apply layer + normalization (in pre-process or post-precess of multi-head attention or FFN), + and some transformer like models are different on this, such as + `BERT <https://arxiv.org/abs/1810.04805>`_ and `GPT2 <https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf>`_ . + The default architecture here places layer normalization in post-process and + applies another layer normalization on the output of last encoder/decoder layer. + + Parameters: + d_model (int, optional): The expected feature size in the encoder/decoder input + and output. Default 512 + nhead (int, optional): The number of heads in multi-head attention(MHA). Default 8 + num_encoder_layers (int, optional): The number of layers in encoder. Default 6 + num_decoder_layers (int, optional): The number of layers in decoder. Default 6 + dim_feedforward (int, optional): The hidden layer size in the feedforward network(FFN). Default 2048 + dropout (float, optional): The dropout probability used in pre-process + and post-precess of MHA and FFN sub-layer. Default 0.1 + activation (str, optional): The activation function in the feedforward + network. Default relu. + attn_dropout (float, optional): The dropout probability used + in MHA to drop some attention target. If None, use the value of + `dropout`. Default None + act_dropout (float, optional): The dropout probability used after FFN + activition. If None, use the value of `dropout`. Default None + normalize_before (bool, optional): Indicate whether to put layer normalization + into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer + normalization and post-precess includes dropout, residual connection. + Otherwise, no pre-process and post-precess includes dropout, residual + connection, layer normalization. Default False + weight_attr(ParamAttr|list|tuple, optional): To specify the weight parameter property. + If it is a list/tuple, the length of `weight_attr` could be 1, 2 or 3. If it is 3, + `weight_attr[0]` would be used as `weight_attr` for self attention, `weight_attr[1]` + would be used as `weight_attr` for cross attention of `TransformerDecoder`, + and `weight_attr[2]` would be used as `weight_attr` for linear in FFN. + If it is 2, `weight_attr[0]` would be used as `weight_attr` both for self attention + and cross attntion and `weight_attr[1]` would be used as `weight_attr` for + linear in FFN. If it is 1, `weight_attr[0]` would be used as `weight_attr` + for self attention, cross attention and linear in FFN. Otherwise, + the three sub-layers all uses it as `weight_attr` to create parameters. + Default: None, which means the default weight parameter property is used. + See usage for details + in :code:`ParamAttr` . + bias_attr (ParamAttr|list|tuple|bool, optional): To specify the bias parameter property. + If it is a list/tuple, the length of `bias_attr` could be 1, 2 or 3. If it is 3, + `bias_attr[0]` would be used as `bias_attr` for self attention, `bias_attr[1]` + would be used as `bias_attr` for cross attention of `TransformerDecoder`, + and `bias_attr[2]` would be used as `bias_attr` for linear in FFN. + If it is 2, `bias_attr[0]` would be used as `bias_attr` both for self attention + and cross attntion and `bias_attr[1]` would be used as `bias_attr` for + linear in FFN. If it is 1, `bias_attr[0]` would be used as `bias_attr` + for self attention, cross attention and linear in FFN. Otherwise, + the three sub-layers all uses it as `bias_attr` to create parameters. + The `False` value means the corresponding layer would not have trainable + bias parameter. See usage for details in :code:`ParamAttr` . + Default: None,which means the default bias parameter property is used. + custom_encoder (Layer, optional): If custom encoder is provided, use it as the encoder. + Default None + custom_decoder (Layer, optional): If custom decoder is provided, use it as the decoder. + Default None + + Examples: + + .. code-block:: python + + import paddle + from paddle.nn import Transformer + + # src: [batch_size, tgt_len, d_model] + enc_input = paddle.rand((2, 4, 128)) + # tgt: [batch_size, src_len, d_model] + dec_input = paddle.rand((2, 6, 128)) + # src_mask: [batch_size, n_head, src_len, src_len] + enc_self_attn_mask = paddle.rand((2, 2, 4, 4)) + # tgt_mask: [batch_size, n_head, tgt_len, tgt_len] + dec_self_attn_mask = paddle.rand((2, 2, 6, 6)) + # memory_mask: [batch_size, n_head, tgt_len, src_len] + cross_attn_mask = paddle.rand((2, 2, 6, 4)) + transformer = Transformer(128, 2, 4, 4, 512) + output = transformer(enc_input, + dec_input, + enc_self_attn_mask, + dec_self_attn_mask, + cross_attn_mask) # [2, 6, 128] + """ + + def __init__( + self, + d_model=512, + nhead=8, + num_encoder_layers=6, + num_decoder_layers=6, + dim_feedforward=2048, + dropout=0.1, + activation="relu", + attn_dropout=None, + act_dropout=None, + normalize_before=False, + weight_attr=None, + bias_attr=None, + custom_encoder=None, + custom_decoder=None, + ): + super(Transformer, self).__init__() + + if isinstance(bias_attr, (list, tuple)): + if len(bias_attr) == 1: + encoder_bias_attr = [bias_attr[0]] * 2 + decoder_bias_attr = [bias_attr[0]] * 3 + elif len(bias_attr) == 2: + encoder_bias_attr = bias_attr + decoder_bias_attr = [bias_attr[0], bias_attr[0], bias_attr[-1]] + elif len(bias_attr) == 3: + encoder_bias_attr = [bias_attr[0], bias_attr[-1]] + decoder_bias_attr = bias_attr + else: + assert False, "length of bias_attr should be 1 or 2 or 3 when it is a list/tuple" + else: + encoder_bias_attr = bias_attr + decoder_bias_attr = bias_attr + + if isinstance(weight_attr, (list, tuple)): + if len(weight_attr) == 1: + encoder_weight_attr = [weight_attr[0]] * 2 + decoder_weight_attr = [weight_attr[0]] * 3 + elif len(weight_attr) == 2: + encoder_weight_attr = weight_attr + decoder_weight_attr = [weight_attr[0], weight_attr[0], weight_attr[-1]] + elif len(weight_attr) == 3: + encoder_weight_attr = [weight_attr[0], weight_attr[-1]] + decoder_weight_attr = weight_attr + else: + assert False, "length of weight_attr should be 1 or 2 or 3 when it is a list/tuple" + else: + encoder_weight_attr = weight_attr + decoder_weight_attr = weight_attr + + if custom_encoder is not None: + self.encoder = custom_encoder + else: + encoder_layer = TransformerEncoderLayer( + d_model, + nhead, + dim_feedforward, + dropout, + activation, + attn_dropout, + act_dropout, + normalize_before, + encoder_weight_attr, + encoder_bias_attr, + ) + encoder_norm = LayerNorm(d_model) + self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm) + + if custom_decoder is not None: + self.decoder = custom_decoder + else: + decoder_layer = TransformerDecoderLayer( + d_model, + nhead, + dim_feedforward, + dropout, + activation, + attn_dropout, + act_dropout, + normalize_before, + decoder_weight_attr, + decoder_bias_attr, + ) + decoder_norm = LayerNorm(d_model) + self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm) + + self.d_model = d_model + self.nhead = nhead + + def forward(self, src, tgt, src_mask=None, tgt_mask=None, memory_mask=None): + r""" + Applies a Transformer model on the inputs. + + Parameters: + src (Tensor): The input of Transformer encoder. It is a tensor + with shape `[batch_size, source_length, d_model]`. The data type + should be float32 or float64. + tgt (Tensor): The input of Transformer decoder. It is a tensor + with shape `[batch_size, target_length, d_model]`. The data type + should be float32 or float64. + memory (Tensor): The output of Transformer encoder. It is a tensor + with shape `[batch_size, source_length, d_model]`. The data type + should be float32 or float64. + src_mask (Tensor, optional): A tensor used in multi-head attention + to prevents attention to some unwanted positions, usually the + paddings or the subsequent positions. It is a tensor with shape + broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`. + When the data type is bool, the unwanted positions have `False` + values and the others have `True` values. When the data type is + int, the unwanted positions have 0 values and the others have 1 + values. When the data type is float, the unwanted positions have + `-INF` values and the others have 0 values. It can be None when + nothing wanted or needed to be prevented attention to. Default None. + tgt_mask (Tensor, optional): A tensor used in self attention + to prevents attention to some unwanted positions, usually the + the subsequent positions. It is a tensor with shape broadcasted + to `[batch_size, n_head, target_length, target_length]`. When + the data type is bool, the unwanted positions have `False` + values and the others have `True` values. When the data type is + int, the unwanted positions have 0 values and the others have 1 + values. When the data type is float, the unwanted positions have + `-INF` values and the others have 0 values. It can be None when + nothing wanted or needed to be prevented attention to. Default None. + memory_mask (Tensor, optional): A tensor used in decoder-encoder + cross attention to prevents attention to some unwanted positions, + usually the paddings. It is a tensor with shape broadcasted to + `[batch_size, n_head, target_length, source_length]`. When the + data type is bool, the unwanted positions have `False` values + and the others have `True` values. When the data type is int, + the unwanted positions have 0 values and the others have 1 + values. When the data type is float, the unwanted positions have + `-INF` values and the others have 0 values. It can be None when + nothing wanted or needed to be prevented attention to. Default None. + + Returns: + Tensor: It is a tensor that has the same shape and data type \ + as `tgt`, representing the output of Transformer decoder. + """ + src_mask = _convert_attention_mask(src_mask, src.dtype) + memory = self.encoder(src, src_mask=src_mask) + + tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype) + memory_mask = _convert_attention_mask(memory_mask, memory.dtype) + output = self.decoder(tgt, memory, tgt_mask=tgt_mask, memory_mask=memory_mask) + return output + + def generate_square_subsequent_mask(self, length): + """ + Generate a square mask for the sequence. The mask ensures that the + predictions for position i can depend only on the known outputs at + positions less than i. + + Parameters: + length (int|Tensor): The length of sequence. + + Returns: + Tensor: Generated square mask according to the given length. + + Examples: + .. code-block:: python + + import paddle + from paddle.nn.layer.transformer import Transformer + length = 5 + d_model, n_head, dim_feedforward = 8, 4, 64 + transformer_paddle = Transformer( + d_model, n_head, dim_feedforward=dim_feedforward) + mask = transformer_paddle.generate_square_subsequent_mask(length) + print(mask) + + # [[ 0. -inf -inf -inf -inf] + # [ 0. 0. -inf -inf -inf] + # [ 0. 0. 0. -inf -inf] + # [ 0. 0. 0. 0. -inf] + # [ 0. 0. 0. 0. 0.]] + + """ + return paddle.tensor.triu((paddle.ones((length, length), dtype=paddle.get_default_dtype()) * -np.inf), 1) diff --git a/examples/model_interpretation/utils.py b/examples/model_interpretation/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..469dc6f797f1c915749a842089a6322f0c666401 --- /dev/null +++ b/examples/model_interpretation/utils.py @@ -0,0 +1,88 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""This file contains some public functions +""" + + +def convert_tokenizer_res_to_old_version(tokenized_res): + if isinstance(tokenized_res, list): + return tokenized_res + if isinstance(tokenized_res, dict): + if len(tokenized_res["input_ids"]) == 0 or not isinstance(tokenized_res["input_ids"][0], list): + return tokenized_res + else: + res = [] + for idx in range(len(tokenized_res["input_ids"])): + temp_dict = {} + key_list = list(tokenized_res.keys()) + for key in key_list: + temp_dict[key] = tokenized_res[key][idx] + res.append(temp_dict) + return res + else: + raise ValueError("unsupported result type") + + +def cal_score(match_list, sorted_token): + over_all = [] + miss = 0 + for i in match_list: + over_all.extend(i[0]) + + score_dic = {} + for i in sorted_token: + split_time = over_all.count(i[0]) + if split_time: + score_dic[i[0]] = i[2] / split_time + else: + score_dic[i[0]] = 0.0 + if miss != 0: + print(miss) + + score = [] + for i in range(len(match_list)): + cur_score = 0.0 + for j in match_list[i][0]: + if j == -1: + continue + cur_score += score_dic[j] + score.append([str(match_list[i][1]), match_list[i][2], cur_score]) + return score + + +def match(context, context_seg, sorted_token): + result = [] + pointer1 = 0 # point at the context + pointer2 = 0 # point at the sorted_token array + for i in range(len(context_seg)): + seg_start_idx = context.find(context_seg[i], pointer1) + if seg_start_idx < 0: + print("Error: token not in context") + seg_end_idx = seg_start_idx + len(context_seg[i]) + + cur_set = [] + while pointer2 < len(sorted_token): + while pointer2 < len(sorted_token) and sorted_token[pointer2][1][1] <= seg_start_idx: + pointer2 += 1 + if pointer2 >= len(sorted_token): + break + if sorted_token[pointer2][1][0] >= seg_end_idx: + break + cur_set.append(sorted_token[pointer2][0]) + pointer2 += 1 + result.append([cur_set, i, context_seg[i]]) + pointer2 -= 1 + pointer1 = seg_end_idx + score = cal_score(result, sorted_token) + return score diff --git a/examples/multimodal/layoutlm/README.md b/examples/multimodal/layoutlm/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f1f46392d4cdb8ea561d543a2952d8088cd965f0 --- /dev/null +++ b/examples/multimodal/layoutlm/README.md @@ -0,0 +1,44 @@ +# LayoutLM + +## 模型简介 +本项目是 [LayoutLM:Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/pdf/1912.13318v5.pdf) 在 Paddle 2.2上的开源实现, +包含了在 [FUNSD数据集](https://guillaumejaume.github.io/FUNSD/) 上的微调代码。 + +## 快速开始 +### 配置环境 +环境依赖 +- cv2 +- sentencepiece +- yacs + +安装命令: +```shell +pip install opencv-python +pip install sentencepiece +pip install yacs +``` + +### 数据准备 +处理好的FUNSD中文数据集下载地址:https://bj.bcebos.com/v1/paddlenlp/datasets/FUNSD.zip 。 + +下载并解压该数据集,解压后将数据集放置在当前目录下。 + +### 执行Fine-tuning +1. ``Sequence Labeling`` 任务启动Fine-tuning的方式如下: + ```shell + bash train_funsd.sh + + # 结果如下: + # best metrics: {'precision': 0.7642124883504194, 'recall': 0.8204102051025512, 'f1': 0.7913148371531967} + ``` + +### 数据处理 +FUNSD数据集是常用的表格理解数据集,原始的数据集下载地址:https://guillaumejaume.github.io/FUNSD/dataset.zip. +包括training_data和test_dataing两个子文件夹,包括149个训练数据和50个测试数据。数据预处理方式如下: +```shell + bash preprocess.sh +``` + +## Reference +- [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/pdf/1912.13318v5.pdf) +- [microsoft/unilm/layoutlm](https://github.com/microsoft/unilm/tree/master/layoutlm) diff --git a/examples/multimodal/layoutlm/funsd.py b/examples/multimodal/layoutlm/funsd.py new file mode 100644 index 0000000000000000000000000000000000000000..4421cd3710b4f11bb2fa19e99971b6df7d546725 --- /dev/null +++ b/examples/multimodal/layoutlm/funsd.py @@ -0,0 +1,317 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import os + +import paddle +from paddle.io import Dataset + +logger = logging.getLogger(__name__) + + +class FunsdDataset(Dataset): + def __init__(self, args, tokenizer, labels, pad_token_label_id, mode): + logger.info("Creating features from dataset file at %s", args.data_dir) + examples = read_examples_from_file(args.data_dir, mode) + features = convert_examples_to_features( + examples, + labels, + args.max_seq_length, + tokenizer, + cls_token_at_end=False, + cls_token=tokenizer.cls_token, + cls_token_segment_id=0, + sep_token=tokenizer.sep_token, + sep_token_extra=False, + pad_on_left=False, + pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0], + pad_token_segment_id=0, + pad_token_label_id=pad_token_label_id, + ) + + self.features = features + # Convert to Tensors and build dataset + self.all_input_ids = paddle.to_tensor([f.input_ids for f in features], dtype="int64") + self.all_input_mask = paddle.to_tensor([f.input_mask for f in features], dtype="int64") + self.all_segment_ids = paddle.to_tensor([f.segment_ids for f in features], dtype="int64") + self.all_label_ids = paddle.to_tensor([f.label_ids for f in features], dtype="int64") + self.all_bboxes = paddle.to_tensor([f.boxes for f in features], dtype="int64") + + def __len__(self): + return len(self.features) + + def __getitem__(self, index): + return ( + self.all_input_ids[index], + self.all_input_mask[index], + self.all_segment_ids[index], + self.all_label_ids[index], + self.all_bboxes[index], + ) + + +class InputExample(object): + """A single training/test example for token classification.""" + + def __init__(self, guid, words, labels, boxes, actual_bboxes, file_name, page_size): + """Constructs a InputExample. + Args: + guid: Unique id for the example. + words: list. The words of the sequence. + labels: (Optional) list. The labels for each word of the sequence. This should be + specified for train and dev examples, but not for test examples. + """ + self.guid = guid + self.words = words + self.labels = labels + self.boxes = boxes + self.actual_bboxes = actual_bboxes + self.file_name = file_name + self.page_size = page_size + + +class InputFeatures(object): + """A single set of features of data.""" + + def __init__( + self, + input_ids, + input_mask, + segment_ids, + label_ids, + boxes, + actual_bboxes, + file_name, + page_size, + ): + assert ( + 0 <= all(boxes) <= 1000 + ), "Error with input bbox ({}): the coordinate value is not between 0 and 1000".format(boxes) + self.input_ids = input_ids + self.input_mask = input_mask + self.segment_ids = segment_ids + self.label_ids = label_ids + self.boxes = boxes + self.actual_bboxes = actual_bboxes + self.file_name = file_name + self.page_size = page_size + + +def read_examples_from_file(data_dir, mode): + file_path = os.path.join(data_dir, "{}.txt".format(mode)) + box_file_path = os.path.join(data_dir, "{}_box.txt".format(mode)) + image_file_path = os.path.join(data_dir, "{}_image.txt".format(mode)) + guid_index = 1 + examples = [] + with open(file_path, encoding="utf-8") as f, open(box_file_path, encoding="utf-8") as fb, open( + image_file_path, encoding="utf-8" + ) as fi: + words = [] + boxes = [] + actual_bboxes = [] + file_name = None + page_size = None + labels = [] + for line, bline, iline in zip(f, fb, fi): + if line.startswith("-DOCSTART-") or line == "" or line == "\n": + if words: + examples.append( + InputExample( + guid="{}-{}".format(mode, guid_index), + words=words, + labels=labels, + boxes=boxes, + actual_bboxes=actual_bboxes, + file_name=file_name, + page_size=page_size, + ) + ) + guid_index += 1 + words = [] + boxes = [] + actual_bboxes = [] + file_name = None + page_size = None + labels = [] + else: + splits = line.split("\t") + bsplits = bline.split("\t") + isplits = iline.split("\t") + assert len(splits) == 2 + assert len(bsplits) == 2 + assert len(isplits) == 4 + assert splits[0] == bsplits[0] + words.append(splits[0]) + if len(splits) > 1: + labels.append(splits[-1].replace("\n", "")) + box = bsplits[-1].replace("\n", "") + box = [int(b) for b in box.split()] + boxes.append(box) + actual_bbox = [int(b) for b in isplits[1].split()] + actual_bboxes.append(actual_bbox) + page_size = [int(i) for i in isplits[2].split()] + file_name = isplits[3].strip() + else: + # Examples could have no label for mode = "test" + labels.append("O") + if words: + examples.append( + InputExample( + guid=f"{mode}-{guid_index}", + words=words, + labels=labels, + boxes=boxes, + actual_bboxes=actual_bboxes, + file_name=file_name, + page_size=page_size, + ) + ) + return examples + + +def convert_examples_to_features( + examples, + label_list, + max_seq_length, + tokenizer, + cls_token_at_end=False, + cls_token="[CLS]", + cls_token_segment_id=1, + sep_token="[SEP]", + sep_token_extra=False, + pad_on_left=False, + pad_token=0, + cls_token_box=[0, 0, 0, 0], + sep_token_box=[1000, 1000, 1000, 1000], + pad_token_box=[0, 0, 0, 0], + pad_token_segment_id=0, + pad_token_label_id=-1, + sequence_a_segment_id=0, + mask_padding_with_zero=True, +): + + label_map = {label: i for i, label in enumerate(label_list)} + + features = [] + for (ex_index, example) in enumerate(examples): + file_name = example.file_name + page_size = example.page_size + width, height = page_size + if ex_index % 10000 == 0: + logger.info("Writing example %d of %d", ex_index, len(examples)) + + tokens = [] + token_boxes = [] + actual_bboxes = [] + label_ids = [] + for word, label, box, actual_bbox in zip(example.words, example.labels, example.boxes, example.actual_bboxes): + word_tokens = tokenizer.tokenize(word) + tokens.extend(word_tokens) + token_boxes.extend([box] * len(word_tokens)) + actual_bboxes.extend([actual_bbox] * len(word_tokens)) + # Use the real label id for the first token of the word, and padding ids for the remaining tokens + label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1)) + + # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa. + special_tokens_count = 3 if sep_token_extra else 2 + if len(tokens) > max_seq_length - special_tokens_count: + tokens = tokens[: (max_seq_length - special_tokens_count)] + token_boxes = token_boxes[: (max_seq_length - special_tokens_count)] + actual_bboxes = actual_bboxes[: (max_seq_length - special_tokens_count)] + label_ids = label_ids[: (max_seq_length - special_tokens_count)] + + # The convention in BERT is: + # (a) For sequence pairs: + # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] + # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 + # (b) For single sequences: + # tokens: [CLS] the dog is hairy . [SEP] + # type_ids: 0 0 0 0 0 0 0 + # + # Where "type_ids" are used to indicate whether this is the first + # sequence or the second sequence. The embedding vectors for `type=0` and + # `type=1` were learned during pre-training and are added to the wordpiece + # embedding vector (and position vector). This is not *strictly* necessary + # since the [SEP] token unambiguously separates the sequences, but it makes + # it easier for the model to learn the concept of sequences. + # + # For classification tasks, the first vector (corresponding to [CLS]) is + # used as the "sentence vector". Note that this only makes sense because + # the entire model is fine-tuned. + tokens += [sep_token] + token_boxes += [sep_token_box] + actual_bboxes += [[0, 0, width, height]] + label_ids += [pad_token_label_id] + if sep_token_extra: + # roberta uses an extra separator b/w pairs of sentences + tokens += [sep_token] + token_boxes += [sep_token_box] + actual_bboxes += [[0, 0, width, height]] + label_ids += [pad_token_label_id] + segment_ids = [sequence_a_segment_id] * len(tokens) + + if cls_token_at_end: + tokens += [cls_token] + token_boxes += [cls_token_box] + actual_bboxes += [[0, 0, width, height]] + label_ids += [pad_token_label_id] + segment_ids += [cls_token_segment_id] + else: + tokens = [cls_token] + tokens + token_boxes = [cls_token_box] + token_boxes + actual_bboxes = [[0, 0, width, height]] + actual_bboxes + label_ids = [pad_token_label_id] + label_ids + segment_ids = [cls_token_segment_id] + segment_ids + + input_ids = tokenizer.convert_tokens_to_ids(tokens) + + # The mask has 1 for real tokens and 0 for padding tokens. Only real + # tokens are attended to. + input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids) + + # Zero-pad up to the sequence length. + padding_length = max_seq_length - len(input_ids) + if pad_on_left: + input_ids = ([pad_token] * padding_length) + input_ids + input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask + segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids + label_ids = ([pad_token_label_id] * padding_length) + label_ids + token_boxes = ([pad_token_box] * padding_length) + token_boxes + else: + input_ids += [pad_token] * padding_length + input_mask += [0 if mask_padding_with_zero else 1] * padding_length + segment_ids += [pad_token_segment_id] * padding_length + label_ids += [pad_token_label_id] * padding_length + token_boxes += [pad_token_box] * padding_length + + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + assert len(segment_ids) == max_seq_length + assert len(label_ids) == max_seq_length + assert len(token_boxes) == max_seq_length + + features.append( + InputFeatures( + input_ids=input_ids, + input_mask=input_mask, + segment_ids=segment_ids, + label_ids=label_ids, + boxes=token_boxes, + actual_bboxes=actual_bboxes, + file_name=file_name, + page_size=page_size, + ) + ) + return features diff --git a/examples/multimodal/layoutlm/preprocess.py b/examples/multimodal/layoutlm/preprocess.py new file mode 100644 index 0000000000000000000000000000000000000000..28b07e5ca5d842e555ab4b485cab82653f2ea1a6 --- /dev/null +++ b/examples/multimodal/layoutlm/preprocess.py @@ -0,0 +1,166 @@ +import argparse +import json +import os + +from PIL import Image +from paddlenlp.transformers import AutoTokenizer + + +def bbox_string(box, width, length): + return ( + str(int(1000 * (box[0] / width))) + + " " + + str(int(1000 * (box[1] / length))) + + " " + + str(int(1000 * (box[2] / width))) + + " " + + str(int(1000 * (box[3] / length))) + ) + + +def actual_bbox_string(box, width, length): + return ( + str(box[0]) + " " + str(box[1]) + " " + str(box[2]) + " " + str(box[3]) + "\t" + str(width) + " " + str(length) + ) + + +def convert(args): + with open(os.path.join(args.output_dir, args.data_split + ".txt.tmp"), "w", encoding="utf8",) as fw, open( + os.path.join(args.output_dir, args.data_split + "_box.txt.tmp"), + "w", + encoding="utf8", + ) as fbw, open( + os.path.join(args.output_dir, args.data_split + "_image.txt.tmp"), + "w", + encoding="utf8", + ) as fiw: + for file in os.listdir(args.data_dir): + file_path = os.path.join(args.data_dir, file) + with open(file_path, "r", encoding="utf8") as f: + data = json.load(f) + image_path = file_path.replace("annotations", "images") + image_path = image_path.replace("json", "png") + file_name = os.path.basename(image_path) + image = Image.open(image_path) + width, length = image.size + for item in data["form"]: + words, label = item["words"], item["label"] + words = [w for w in words if w["text"].strip() != ""] + if len(words) == 0: + continue + if label == "other": + for w in words: + fw.write(w["text"] + "\tO\n") + fbw.write(w["text"] + "\t" + bbox_string(w["box"], width, length) + "\n") + fiw.write( + w["text"] + "\t" + actual_bbox_string(w["box"], width, length) + "\t" + file_name + "\n" + ) + else: + if len(words) == 1: + fw.write(words[0]["text"] + "\tS-" + label.upper() + "\n") + fbw.write(words[0]["text"] + "\t" + bbox_string(words[0]["box"], width, length) + "\n") + fiw.write( + words[0]["text"] + + "\t" + + actual_bbox_string(words[0]["box"], width, length) + + "\t" + + file_name + + "\n" + ) + else: + fw.write(words[0]["text"] + "\tB-" + label.upper() + "\n") + fbw.write(words[0]["text"] + "\t" + bbox_string(words[0]["box"], width, length) + "\n") + fiw.write( + words[0]["text"] + + "\t" + + actual_bbox_string(words[0]["box"], width, length) + + "\t" + + file_name + + "\n" + ) + for w in words[1:-1]: + fw.write(w["text"] + "\tI-" + label.upper() + "\n") + fbw.write(w["text"] + "\t" + bbox_string(w["box"], width, length) + "\n") + fiw.write( + w["text"] + + "\t" + + actual_bbox_string(w["box"], width, length) + + "\t" + + file_name + + "\n" + ) + fw.write(words[-1]["text"] + "\tE-" + label.upper() + "\n") + fbw.write(words[-1]["text"] + "\t" + bbox_string(words[-1]["box"], width, length) + "\n") + fiw.write( + words[-1]["text"] + + "\t" + + actual_bbox_string(words[-1]["box"], width, length) + + "\t" + + file_name + + "\n" + ) + fw.write("\n") + fbw.write("\n") + fiw.write("\n") + + +def seg_file(file_path, tokenizer, max_len): + subword_len_counter = 0 + output_path = file_path[:-4] + with open(file_path, "r", encoding="utf8") as f_p, open(output_path, "w", encoding="utf8") as fw_p: + for line in f_p: + line = line.rstrip() + + if not line: + fw_p.write(line + "\n") + subword_len_counter = 0 + continue + token = line.split("\t")[0] + + current_subwords_len = len(tokenizer.tokenize(token)) + + # Token contains strange control characters like \x96 or \x95 + # Just filter out the complete line + if current_subwords_len == 0: + continue + + if (subword_len_counter + current_subwords_len) > max_len: + fw_p.write("\n" + line + "\n") + subword_len_counter = current_subwords_len + continue + + subword_len_counter += current_subwords_len + + fw_p.write(line + "\n") + + +def seg(args): + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, do_lower_case=True) + seg_file( + os.path.join(args.output_dir, args.data_split + ".txt.tmp"), + tokenizer, + args.max_len, + ) + seg_file( + os.path.join(args.output_dir, args.data_split + "_box.txt.tmp"), + tokenizer, + args.max_len, + ) + seg_file( + os.path.join(args.output_dir, args.data_split + "_image.txt.tmp"), + tokenizer, + args.max_len, + ) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--data_dir", type=str, default="data/training_data/annotations") + parser.add_argument("--data_split", type=str, default="train") + parser.add_argument("--output_dir", type=str, default="data") + parser.add_argument("--model_name_or_path", type=str, default="bert-base-uncased") + parser.add_argument("--max_len", type=int, default=510) + args = parser.parse_args() + + convert(args) + seg(args) diff --git a/examples/multimodal/layoutlm/preprocess.sh b/examples/multimodal/layoutlm/preprocess.sh new file mode 100644 index 0000000000000000000000000000000000000000..2ff8dc4e317aa87c5a176de5af9b5571415239c5 --- /dev/null +++ b/examples/multimodal/layoutlm/preprocess.sh @@ -0,0 +1,13 @@ +python preprocess.py --data_dir data/training_data/annotations \ + --data_split train \ + --output_dir data \ + --model_name_or_path bert-base-uncased \ + --max_len 510 + +python preprocess.py --data_dir data/testing_data/annotations \ + --data_split test \ + --output_dir data \ + --model_name_or_path bert-base-uncased \ + --max_len 510 + +cat data/train.txt | cut -d$'\t' -f 2 | grep -v "^$"| sort | uniq > data/labels.txt \ No newline at end of file diff --git a/examples/multimodal/layoutlm/train_funsd.py b/examples/multimodal/layoutlm/train_funsd.py new file mode 100644 index 0000000000000000000000000000000000000000..8021e0f752f12406a3937eeec78ed85253d0b255 --- /dev/null +++ b/examples/multimodal/layoutlm/train_funsd.py @@ -0,0 +1,282 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import os +import random + +import numpy as np +import paddle +from funsd import FunsdDataset +from seqeval.metrics import ( + classification_report, + f1_score, + precision_score, + recall_score, +) +from tqdm import tqdm, trange + +# relative reference +from utils import parse_args + +from paddlenlp.transformers import ( + LayoutLMForTokenClassification, + LayoutLMModel, + LayoutLMTokenizer, +) + +logger = logging.getLogger(__name__) + + +def get_labels(path): + with open(path, "r") as f: + labels = f.read().splitlines() + if "O" not in labels: + labels = ["O"] + labels + return labels + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +def train(args): + logging.basicConfig( + filename=os.path.join(args.output_dir, "train.log") if paddle.distributed.get_rank() == 0 else None, + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + level=logging.INFO if paddle.distributed.get_rank() == 0 else logging.WARN, + ) + + all_labels = get_labels(args.labels) + + pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index + + tokenizer = LayoutLMTokenizer.from_pretrained(args.model_name_or_path) + + # for training process, model is needed for the bert class + # else it can directly loaded for the downstream task + if not args.do_train: + model = LayoutLMForTokenClassification.from_pretrained(args.model_name_or_path) + else: + model = LayoutLMModel.from_pretrained(args.model_name_or_path) + model = LayoutLMForTokenClassification(model, num_classes=len(all_labels), dropout=None) + + train_dataset = FunsdDataset(args, tokenizer, all_labels, pad_token_label_id, mode="train") + train_sampler = paddle.io.DistributedBatchSampler( + train_dataset, batch_size=args.per_gpu_train_batch_size, shuffle=True + ) + + args.train_batch_size = args.per_gpu_train_batch_size * max(1, paddle.distributed.get_world_size()) + train_dataloader = paddle.io.DataLoader( + train_dataset, + batch_sampler=train_sampler, + collate_fn=None, + ) + + t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs + + # build linear decay with warmup lr sch + lr_scheduler = paddle.optimizer.lr.PolynomialDecay( + learning_rate=args.learning_rate, decay_steps=t_total, end_lr=0.0, power=1.0 + ) + if args.warmup_steps > 0: + lr_scheduler = paddle.optimizer.lr.LinearWarmup( + lr_scheduler, + args.warmup_steps, + start_lr=0, + end_lr=args.learning_rate, + ) + + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + epsilon=args.adam_epsilon, + weight_decay=args.weight_decay, + ) + + loss_fct = paddle.nn.loss.CrossEntropyLoss(ignore_index=pad_token_label_id) + + # Train + logger.info("***** Running training *****") + logger.info(" Num examples = %d", len(train_dataset)) + logger.info(" Num Epochs = %d", args.num_train_epochs) + logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size) + logger.info( + " Total train batch size (w. parallel, distributed & accumulation) = %d", + args.train_batch_size * paddle.distributed.get_world_size(), + ) + logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps) + logger.info(" Total optimization steps = %d", t_total) + + global_step = 0 + tr_loss = 0.0 + model.clear_gradients() + train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]) + set_seed(args) + for _ in train_iterator: + epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0]) + for step, batch in enumerate(epoch_iterator): + model.train() + inputs = { + "input_ids": batch[0], + "attention_mask": batch[1], + "token_type_ids": batch[2], + "bbox": batch[4], + } + labels = batch[3] + logits = model(**inputs) + loss = loss_fct( + logits.reshape([-1, len(all_labels)]), + labels.reshape( + [ + -1, + ] + ), + ) + + loss = loss.mean() + logger.info("train loss: {}".format(loss.numpy())) + loss.backward() + + tr_loss += loss.item() + if (step + 1) % args.gradient_accumulation_steps == 0: + optimizer.step() + lr_scheduler.step() # Update learning rate schedule + model.clear_gradients() + global_step += 1 + + if ( + paddle.distributed.get_rank() == 0 + and args.logging_steps > 0 + and global_step % args.logging_steps == 0 + ): + # Log metrics + if ( + paddle.distributed.get_rank() == 0 and args.evaluate_during_training + ): # Only evaluate when single GPU otherwise metrics may not average well + results, _ = evaluate( + args, + model, + tokenizer, + all_labels, + loss_fct, + pad_token_label_id, + mode="test", + ) + logger.info("results: {}".format(results)) + + if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0: + # Save model checkpoint + output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step)) + os.makedirs(output_dir, exist_ok=True) + if paddle.distributed.get_rank() == 0: + model.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + paddle.save(args, os.path.join(output_dir, "training_args.bin")) + logger.info("Saving model checkpoint to %s", output_dir) + + if args.max_steps > 0 and global_step > args.max_steps: + epoch_iterator.close() + break + if args.max_steps > 0 and global_step > args.max_steps: + train_iterator.close() + break + + return global_step, tr_loss / global_step + + +def evaluate(args, model, tokenizer, all_labels, loss_fct, pad_token_label_id, mode, prefix=""): + eval_dataset = FunsdDataset(args, tokenizer, all_labels, pad_token_label_id, mode=mode) + args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, paddle.distributed.get_world_size()) + eval_dataloader = paddle.io.DataLoader( + eval_dataset, + batch_size=args.eval_batch_size, + collate_fn=None, + ) + + # Eval + logger.info("***** Running evaluation %s *****", prefix) + logger.info(" Num examples = %d", len(eval_dataset)) + logger.info(" Batch size = %d", args.eval_batch_size) + eval_loss = 0.0 + nb_eval_steps = 0 + preds = None + out_label_ids = None + model.eval() + for batch in tqdm(eval_dataloader, desc="Evaluating"): + with paddle.no_grad(): + inputs = { + "input_ids": batch[0], + "attention_mask": batch[1], + "token_type_ids": batch[2], + "bbox": batch[4], + } + labels = batch[3] + logits = model(**inputs) + tmp_eval_loss = loss_fct( + logits.reshape([-1, len(all_labels)]), + labels.reshape( + [ + -1, + ] + ), + ) + tmp_eval_loss = tmp_eval_loss.mean() + eval_loss += tmp_eval_loss.item() + + nb_eval_steps += 1 + if preds is None: + preds = logits.numpy() + out_label_ids = labels.numpy() + else: + preds = np.append(preds, logits.numpy(), axis=0) + out_label_ids = np.append(out_label_ids, labels.numpy(), axis=0) + + eval_loss = eval_loss / nb_eval_steps + preds = np.argmax(preds, axis=2) + + label_map = {i: label for i, label in enumerate(all_labels)} + out_label_list = [[] for _ in range(out_label_ids.shape[0])] + preds_list = [[] for _ in range(out_label_ids.shape[0])] + + for i in range(out_label_ids.shape[0]): + for j in range(out_label_ids.shape[1]): + if out_label_ids[i, j] != pad_token_label_id: + out_label_list[i].append(label_map[out_label_ids[i][j]]) + preds_list[i].append(label_map[preds[i][j]]) + + results = { + "loss": eval_loss, + "precision": precision_score(out_label_list, preds_list), + "recall": recall_score(out_label_list, preds_list), + "f1": f1_score(out_label_list, preds_list), + } + + report = classification_report(out_label_list, preds_list) + logger.info("\n" + report) + + logger.info("***** Eval results %s *****", prefix) + for key in sorted(results.keys()): + logger.info(" %s = %s", key, str(results[key])) + + return results, preds + + +if __name__ == "__main__": + args = parse_args() + os.makedirs(args.output_dir, exist_ok=True) + train(args) diff --git a/examples/multimodal/layoutlm/train_funsd.sh b/examples/multimodal/layoutlm/train_funsd.sh new file mode 100644 index 0000000000000000000000000000000000000000..cfd65d6c3ba1710349f1fc94c2a262eab35f6caa --- /dev/null +++ b/examples/multimodal/layoutlm/train_funsd.sh @@ -0,0 +1,17 @@ +export CUDA_VISIBLE_DEVICES=7 + +python3.7 train_funsd.py \ + --data_dir "./data/" \ + --model_name_or_path "layoutlm-base-uncased" \ + --do_lower_case \ + --max_seq_length 512 \ + --do_train \ + --do_eval \ + --num_train_epochs 100 \ + --logging_steps 10 \ + --save_steps 500 \ + --output_dir "output/" \ + --labels "./data/labels.txt" \ + --per_gpu_train_batch_size 16 \ + --per_gpu_eval_batch_size 16 \ + --evaluate_during_training diff --git a/examples/multimodal/layoutlm/utils.py b/examples/multimodal/layoutlm/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..6e3c4bce74049b1a402687edfd7ca60a054b7e98 --- /dev/null +++ b/examples/multimodal/layoutlm/utils.py @@ -0,0 +1,188 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import absolute_import, division, print_function + +import argparse + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--data_dir", + default=None, + type=str, + required=True, + help="The input data dir. Should contain the training files for the CoNLL-2003 NER task.", + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + ) + parser.add_argument( + "--weights_path", + default=None, + type=str, + required=False, + ) + + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + + # Other parameters + parser.add_argument( + "--labels", + default="", + type=str, + help="Path to a file containing all labels. If not specified, CoNLL-2003 labels are used.", + ) + parser.add_argument( + "--config_name", + default="", + type=str, + help="Pretrained config name or path if not the same as model_name", + ) + parser.add_argument( + "--tokenizer_name", + default="", + type=str, + help="Pretrained tokenizer name or path if not the same as model_name", + ) + parser.add_argument( + "--cache_dir", + default="", + type=str, + help="Where do you want to store the pre-trained models downloaded from s3", + ) + parser.add_argument( + "--max_seq_length", + default=512, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--do_train", action="store_true", help="Whether to run training.") + parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.") + parser.add_argument( + "--do_predict", + action="store_true", + help="Whether to run predictions on the test set.", + ) + parser.add_argument( + "--evaluate_during_training", + action="store_true", + help="Whether to run evaluation during training at each logging step.", + ) + parser.add_argument( + "--do_lower_case", + action="store_true", + help="Set this flag if you are using an uncased model.", + ) + + parser.add_argument( + "--per_gpu_train_batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument( + "--per_gpu_eval_batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU for evaluation.", + ) + parser.add_argument( + "--gradient_accumulation_steps", + type=int, + default=1, + help="Number of updates steps to accumulate before performing a backward/update pass.", + ) + parser.add_argument( + "--learning_rate", + default=5e-5, + type=float, + help="The initial learning rate for Adam.", + ) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") + + parser.add_argument("--logging_steps", type=int, default=10, help="Log every X updates steps.") + parser.add_argument( + "--save_steps", + type=int, + default=50, + help="Save checkpoint every X updates steps.", + ) + parser.add_argument( + "--eval_all_checkpoints", + action="store_true", + help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number", + ) + parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available") + parser.add_argument( + "--overwrite_output_dir", + action="store_true", + help="Overwrite the content of the output directory", + ) + parser.add_argument( + "--overwrite_cache", + action="store_true", + help="Overwrite the cached training and evaluation sets", + ) + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument( + "--fp16", + action="store_true", + help="Whether to use 16-bit (mixed) precision instead of 32-bit", + ) + parser.add_argument( + "--fp16_opt_level", + type=str, + default="O1", + help="For fp16: AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "See details at https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/amp/auto_cast_cn.html", + ) + parser.add_argument( + "--local_rank", + type=int, + default=-1, + help="For distributed training: local_rank", + ) + parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.") + parser.add_argument("--server_port", type=str, default="", help="For distant debugging.") + args = parser.parse_args() + return args diff --git a/examples/multimodal/layoutxlm/README.md b/examples/multimodal/layoutxlm/README.md new file mode 100644 index 0000000000000000000000000000000000000000..03c0a93c6a20e6a67a83283d2001a3a7e05d4b0a --- /dev/null +++ b/examples/multimodal/layoutxlm/README.md @@ -0,0 +1,45 @@ +# LayoutXLM + +## 模型简介 +本项目是 [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf) 在 Paddle 2.2上的开源实现, +包含了在 [XFUND数据集](https://github.com/doc-analysis/XFUND) 上的微调代码。 + +## 快速开始 +### 配置环境 +环境依赖 +- cv2 +- sentencepiece +- yacs + +安装命令: +```shell +pip install opencv-python +pip install sentencepiece +pip install yacs +``` + +### 数据准备 +处理好的XFUND中文数据集下载地址:https://bj.bcebos.com/v1/paddlenlp/datasets/XFUND.zip 。 + +下载并解压该数据集,解压后将数据集放置在当前目录下。 + +### 执行Fine-tuning +1. ``Semantic Entity Recognition`` 任务启动Fine-tuning的方式如下: + ```shell + bash run_xfun_ser.sh + + # 结果如下: + # best metrics: {'precision': 0.8514686248331108, 'recall': 0.9354602126879354, 'f1': 0.8914904770225406} + ``` + +2. ``Relation Extraction`` 任务启动Fine-tuning的方式如下: + ```shell + bash run_xfun_re.sh + + # 结果如下: + # best metrics: {'precision': 0.6788935658448587, 'recall': 0.7743484224965707, 'f1': 0.7234860621595642} + ``` + +## Reference +- [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf) +- [microsoft/unilm/layoutxlm](https://github.com/microsoft/unilm/tree/master/layoutxlm) diff --git a/examples/multimodal/layoutxlm/compare.py b/examples/multimodal/layoutxlm/compare.py new file mode 100644 index 0000000000000000000000000000000000000000..120f651c177a11c30b5b21d99cce213578eaa83d --- /dev/null +++ b/examples/multimodal/layoutxlm/compare.py @@ -0,0 +1,105 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys + +import numpy as np +import paddle +import torch + +sys.path.insert(0, "../../../") + + +def get_input_demo(platform="paddle", device="cpu"): + info = paddle.load("fake_input_paddle_xlm.data") + # imgs = np.random.rand(info["input_ids"].shape[0], 3, 224, 224).astype(np.float32) + # info["image"] = paddle.to_tensor(imgs) + if platform == "torch": + info = {key: torch.tensor(info[key].numpy()) for key in info} + if device == "gpu": + info = {key: info[key].cuda() for key in info} + return info + + +def test_layoutlm_paddle(): + from paddlenlp.transformers import LayoutXLMModel + + model = LayoutXLMModel.from_pretrained("layoutxlm-base-uncased") + model.eval() + + paddle.save(model.state_dict(), "v2.pdparams") + + batch_input = get_input_demo(platform="paddle", device="gpu") + with paddle.no_grad(): + outputs = model( + input_ids=batch_input["input_ids"], + bbox=batch_input["bbox"], + image=batch_input["image"], + attention_mask=batch_input["attention_mask"], + ) + sequence_output = outputs[0] + pooled_output = outputs[1] + return sequence_output, pooled_output + + +def test_layoutlm_torch(): + # import pytorch models + from layoutlmft.models.layoutxlm import LayoutXLMModel + + model = LayoutXLMModel.from_pretrained("microsoft/layoutxlm-base") + model.eval() + model = model.cuda() + + batch_input = get_input_demo(platform="torch", device="gpu") + + outputs = model( + input_ids=batch_input["input_ids"], + bbox=batch_input["bbox"], + image=batch_input["image"], + attention_mask=batch_input["attention_mask"], + ) + sequence_output = outputs[0] + pooled_output = outputs[1] + return sequence_output, pooled_output + + +def get_statistic_info(x, y): + mean_abs_diff = np.mean(np.abs(x - y)) + max_abs_diff = np.max(np.abs(x - y)) + return mean_abs_diff, max_abs_diff + + +if __name__ == "__main__": + + print("\n====test_layoutxlm_torch=====") + torch_hidden_out, torch_pool_out = test_layoutlm_torch() + torch_hidden_out = torch_hidden_out.cpu().detach().numpy() + torch_pool_out = torch_pool_out.cpu().detach().numpy() + print(torch_hidden_out.shape, torch_pool_out.shape) + + print("\n====test_layoutxlm_paddle=====") + paddle_hidden_out, paddle_pool_out = test_layoutlm_paddle() + paddle_hidden_out = paddle_hidden_out.numpy() + paddle_pool_out = paddle_pool_out.numpy() + print(paddle_hidden_out.shape, paddle_pool_out.shape) + + mean_abs_diff, max_abs_diff = get_statistic_info(torch_hidden_out, paddle_hidden_out) + print("======hidden_out diff info====") + print("\t mean_abs_diff: {}".format(mean_abs_diff)) + print("\t max_abs_diff: {}".format(max_abs_diff)) + + mean_abs_diff, max_abs_diff = get_statistic_info(torch_pool_out, paddle_pool_out) + print("======pool_out diff info====") + print("\t mean_abs_diff: {}".format(mean_abs_diff)) + print("\t max_abs_diff: {}".format(max_abs_diff)) diff --git a/examples/multimodal/layoutxlm/run_xfun_re.py b/examples/multimodal/layoutxlm/run_xfun_re.py new file mode 100644 index 0000000000000000000000000000000000000000..13e31b27b99c2d66c4482a870af187f7f506bd4c --- /dev/null +++ b/examples/multimodal/layoutxlm/run_xfun_re.py @@ -0,0 +1,406 @@ +import sys +import os +import random +import numbers +import logging + +import argparse +import paddle +import numpy as np +from paddlenlp.transformers import LayoutXLMModel, LayoutXLMTokenizer, LayoutXLMForRelationExtraction +from xfun import XFUN + +# Todo: delete the following line after the release of v2.2 +sys.path.insert(0, "../../../") +logger = logging.getLogger(__name__) + + +class DataCollator: + def __call__(self, batch): + data_dict = {} + to_tensor_keys = [] + for sample in batch: + for k, v in sample.items(): + if k not in data_dict: + data_dict[k] = [] + if isinstance(v, (np.ndarray, paddle.Tensor, numbers.Number)): + if k not in to_tensor_keys: + to_tensor_keys.append(k) + data_dict[k].append(v) + for k in to_tensor_keys: + data_dict[k] = paddle.to_tensor(data_dict[k]) + return data_dict + + +def parse_args(): + parser = argparse.ArgumentParser() + # Required parameters + # yapf: disable + parser.add_argument("--model_name_or_path", default=None, type=str, required=True,) + parser.add_argument("--train_data_dir", default=None, type=str, required=False,) + parser.add_argument("--train_label_path", default=None, type=str, required=False,) + parser.add_argument("--eval_data_dir", default=None, type=str, required=False,) + parser.add_argument("--eval_label_path", default=None, type=str, required=False,) + parser.add_argument("--use_vdl", default=False, type=bool, required=False,) + parser.add_argument("--output_dir", default=None, type=str, required=True,) + parser.add_argument("--max_seq_length", default=512, type=int,) + parser.add_argument("--evaluate_during_training", action="store_true",) + parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.",) + parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for eval.",) + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.",) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.",) + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.",) + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.",) + parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.",) + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.",) + parser.add_argument("--eval_steps", type=int, default=10, help="eval every X updates steps.",) + parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.",) + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization",) + # yapf: enable + args = parser.parse_args() + return args + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +def get_label_maps(): + labels = ["O", "B-QUESTION", "B-ANSWER", "B-HEADER", "I-ANSWER", "I-QUESTION", "I-HEADER"] + label2id_map = {label: idx for idx, label in enumerate(labels)} + id2label_map = {idx: label for idx, label in enumerate(labels)} + return label2id_map, id2label_map + + +def cal_metric(re_preds, re_labels, entities): + gt_relations = [] + for b in range(len(re_labels)): + rel_sent = [] + for head, tail in zip(re_labels[b]["head"], re_labels[b]["tail"]): + rel = {} + rel["head_id"] = head + rel["head"] = (entities[b]["start"][rel["head_id"]], entities[b]["end"][rel["head_id"]]) + rel["head_type"] = entities[b]["label"][rel["head_id"]] + + rel["tail_id"] = tail + rel["tail"] = (entities[b]["start"][rel["tail_id"]], entities[b]["end"][rel["tail_id"]]) + rel["tail_type"] = entities[b]["label"][rel["tail_id"]] + + rel["type"] = 1 + rel_sent.append(rel) + gt_relations.append(rel_sent) + re_metrics = re_score(re_preds, gt_relations, mode="boundaries") + return re_metrics + + +def re_score(pred_relations, gt_relations, mode="strict"): + """Evaluate RE predictions + + Args: + pred_relations (list) : list of list of predicted relations (several relations in each sentence) + gt_relations (list) : list of list of ground truth relations + + rel = { "head": (start_idx (inclusive), end_idx (exclusive)), + "tail": (start_idx (inclusive), end_idx (exclusive)), + "head_type": ent_type, + "tail_type": ent_type, + "type": rel_type} + + vocab (Vocab) : dataset vocabulary + mode (str) : in 'strict' or 'boundaries'""" + + assert mode in ["strict", "boundaries"] + + relation_types = [v for v in [0, 1] if not v == 0] + scores = {rel: {"tp": 0, "fp": 0, "fn": 0} for rel in relation_types + ["ALL"]} + + # Count GT relations and Predicted relations + n_sents = len(gt_relations) + n_rels = sum([len([rel for rel in sent]) for sent in gt_relations]) + n_found = sum([len([rel for rel in sent]) for sent in pred_relations]) + + # Count TP, FP and FN per type + for pred_sent, gt_sent in zip(pred_relations, gt_relations): + for rel_type in relation_types: + # strict mode takes argument types into account + if mode == "strict": + pred_rels = { + (rel["head"], rel["head_type"], rel["tail"], rel["tail_type"]) + for rel in pred_sent + if rel["type"] == rel_type + } + gt_rels = { + (rel["head"], rel["head_type"], rel["tail"], rel["tail_type"]) + for rel in gt_sent + if rel["type"] == rel_type + } + + # boundaries mode only takes argument spans into account + elif mode == "boundaries": + pred_rels = {(rel["head"], rel["tail"]) for rel in pred_sent if rel["type"] == rel_type} + gt_rels = {(rel["head"], rel["tail"]) for rel in gt_sent if rel["type"] == rel_type} + + scores[rel_type]["tp"] += len(pred_rels & gt_rels) + scores[rel_type]["fp"] += len(pred_rels - gt_rels) + scores[rel_type]["fn"] += len(gt_rels - pred_rels) + + # Compute per entity Precision / Recall / F1 + for rel_type in scores.keys(): + if scores[rel_type]["tp"]: + scores[rel_type]["p"] = scores[rel_type]["tp"] / (scores[rel_type]["fp"] + scores[rel_type]["tp"]) + scores[rel_type]["r"] = scores[rel_type]["tp"] / (scores[rel_type]["fn"] + scores[rel_type]["tp"]) + else: + scores[rel_type]["p"], scores[rel_type]["r"] = 0, 0 + + if not scores[rel_type]["p"] + scores[rel_type]["r"] == 0: + scores[rel_type]["f1"] = ( + 2 * scores[rel_type]["p"] * scores[rel_type]["r"] / (scores[rel_type]["p"] + scores[rel_type]["r"]) + ) + else: + scores[rel_type]["f1"] = 0 + + # Compute micro F1 Scores + tp = sum([scores[rel_type]["tp"] for rel_type in relation_types]) + fp = sum([scores[rel_type]["fp"] for rel_type in relation_types]) + fn = sum([scores[rel_type]["fn"] for rel_type in relation_types]) + + if tp: + precision = tp / (tp + fp) + recall = tp / (tp + fn) + f1 = 2 * precision * recall / (precision + recall) + + else: + precision, recall, f1 = 0, 0, 0 + + scores["ALL"]["p"] = precision + scores["ALL"]["r"] = recall + scores["ALL"]["f1"] = f1 + scores["ALL"]["tp"] = tp + scores["ALL"]["fp"] = fp + scores["ALL"]["fn"] = fn + + # Compute Macro F1 Scores + scores["ALL"]["Macro_f1"] = np.mean([scores[ent_type]["f1"] for ent_type in relation_types]) + scores["ALL"]["Macro_p"] = np.mean([scores[ent_type]["p"] for ent_type in relation_types]) + scores["ALL"]["Macro_r"] = np.mean([scores[ent_type]["r"] for ent_type in relation_types]) + + logger.info(f"RE Evaluation in *** {mode.upper()} *** mode") + + logger.info( + "processed {} sentences with {} relations; found: {} relations; correct: {}.".format( + n_sents, n_rels, n_found, tp + ) + ) + logger.info( + "\tALL\t TP: {};\tFP: {};\tFN: {}".format(scores["ALL"]["tp"], scores["ALL"]["fp"], scores["ALL"]["fn"]) + ) + logger.info("\t\t(m avg): precision: {:.2f};\trecall: {:.2f};\tf1: {:.2f} (micro)".format(precision, recall, f1)) + logger.info( + "\t\t(M avg): precision: {:.2f};\trecall: {:.2f};\tf1: {:.2f} (Macro)\n".format( + scores["ALL"]["Macro_p"], scores["ALL"]["Macro_r"], scores["ALL"]["Macro_f1"] + ) + ) + + for rel_type in relation_types: + logger.info( + "\t{}: \tTP: {};\tFP: {};\tFN: {};\tprecision: {:.2f};\trecall: {:.2f};\tf1: {:.2f};\t{}".format( + rel_type, + scores[rel_type]["tp"], + scores[rel_type]["fp"], + scores[rel_type]["fn"], + scores[rel_type]["p"], + scores[rel_type]["r"], + scores[rel_type]["f1"], + scores[rel_type]["tp"] + scores[rel_type]["fp"], + ) + ) + + return scores + + +def evaluate(model, eval_dataloader, logger, prefix=""): + # Eval! + logger.info(f"***** Running evaluation {prefix} *****") + logger.info(f" Num examples = {len(eval_dataloader.dataset)}") + + re_preds = [] + re_labels = [] + entities = [] + eval_loss = 0.0 + model.eval() + for idx, batch in enumerate(eval_dataloader): + with paddle.no_grad(): + outputs = model(**batch) + loss = outputs["loss"].mean().item() + if paddle.distributed.get_rank() == 0: + logger.info(f"[Eval] process: {idx}/{len(eval_dataloader)}, loss: {loss:.5f}") + + eval_loss += loss + re_preds.extend(outputs["pred_relations"]) + re_labels.extend(batch["relations"]) + entities.extend(outputs["entities"]) + re_metrics = cal_metric(re_preds, re_labels, entities) + re_metrics = { + "precision": re_metrics["ALL"]["p"], + "recall": re_metrics["ALL"]["r"], + "f1": re_metrics["ALL"]["f1"], + } + model.train() + return re_metrics + + +def train(args): + os.makedirs(args.output_dir, exist_ok=True) + set_seed(args) + + label2id_map, id2label_map = get_label_maps() + pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index + + # dist mode + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + tokenizer = LayoutXLMTokenizer.from_pretrained(args.model_name_or_path) + base_model = LayoutXLMModel.from_pretrained(args.model_name_or_path) + model = LayoutXLMForRelationExtraction(base_model, dropout=None) + + # dist mode + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + train_dataset = XFUN( + tokenizer, + data_dir=args.train_data_dir, + label_path=args.train_label_path, + label2id_map=label2id_map, + img_size=(224, 224), + max_seq_len=args.max_seq_length, + pad_token_label_id=pad_token_label_id, + contains_re=True, + add_special_ids=False, + return_attention_mask=True, + load_mode="all", + ) + + eval_dataset = XFUN( + tokenizer, + data_dir=args.eval_data_dir, + label_path=args.eval_label_path, + label2id_map=label2id_map, + img_size=(224, 224), + max_seq_len=args.max_seq_length, + pad_token_label_id=pad_token_label_id, + contains_re=True, + add_special_ids=False, + return_attention_mask=True, + load_mode="all", + ) + + train_sampler = paddle.io.DistributedBatchSampler( + train_dataset, batch_size=args.per_gpu_train_batch_size, shuffle=True + ) + args.train_batch_size = args.per_gpu_train_batch_size * max(1, paddle.distributed.get_world_size()) + train_dataloader = paddle.io.DataLoader( + train_dataset, batch_sampler=train_sampler, num_workers=8, use_shared_memory=True, collate_fn=DataCollator() + ) + + eval_dataloader = paddle.io.DataLoader( + eval_dataset, batch_size=args.per_gpu_eval_batch_size, num_workers=8, shuffle=False, collate_fn=DataCollator() + ) + + t_total = len(train_dataloader) * args.num_train_epochs + + # build linear decay with warmup lr sch + lr_scheduler = paddle.optimizer.lr.PolynomialDecay( + learning_rate=args.learning_rate, decay_steps=t_total, end_lr=0.0, power=1.0 + ) + if args.warmup_steps > 0: + lr_scheduler = paddle.optimizer.lr.LinearWarmup( + lr_scheduler, + args.warmup_steps, + start_lr=0, + end_lr=args.learning_rate, + ) + grad_clip = paddle.nn.ClipGradByNorm(clip_norm=10) + optimizer = paddle.optimizer.Adam( + learning_rate=args.learning_rate, + parameters=model.parameters(), + epsilon=args.adam_epsilon, + grad_clip=grad_clip, + weight_decay=args.weight_decay, + ) + + # Train! + logger.info("***** Running training *****") + logger.info(f" Num examples = {len(train_dataset)}") + logger.info(f" Num Epochs = {args.num_train_epochs}") + logger.info(f" Instantaneous batch size per GPU = {args.per_gpu_train_batch_size}") + logger.info( + f" Total train batch size (w. parallel, distributed & accumulation) = {args.train_batch_size * paddle.distributed.get_world_size()}" + ) + logger.info(f" Total optimization steps = {t_total}") + + global_step = 0 + train_dataloader_len = len(train_dataloader) + best_metirc = {"f1": 0} + model.train() + + for epoch in range(int(args.num_train_epochs)): + for step, batch in enumerate(train_dataloader): + outputs = model(**batch) + # model outputs are always tuple in ppnlp (see doc) + loss = outputs["loss"] + loss = loss.mean() + + logger.info( + f"epoch: [{epoch}/{args.num_train_epochs}], iter: [{step}/{train_dataloader_len}], global_step:{global_step}, train loss: {np.mean(loss.numpy())}, lr: {optimizer.get_lr()}" + ) + + loss.backward() + optimizer.step() + optimizer.clear_grad() + # lr_scheduler.step() # Update learning rate schedule + + global_step += 1 + + if paddle.distributed.get_rank() == 0 and args.eval_steps > 0 and global_step % args.eval_steps == 0: + # Log metrics + if paddle.distributed.get_rank() == 0 and args.evaluate_during_training: + results = evaluate(model, eval_dataloader, logger) + if results["f1"] > best_metirc["f1"]: + best_metirc = results + output_dir = os.path.join(args.output_dir, "checkpoint-best") + os.makedirs(output_dir, exist_ok=True) + model.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + paddle.save(args, os.path.join(output_dir, "training_args.bin")) + logger.info(f"Saving model checkpoint to {output_dir}") + logger.info(f"eval results: {results}") + logger.info(f"best_metirc: {best_metirc}") + + if paddle.distributed.get_rank() == 0 and args.save_steps > 0 and global_step % args.save_steps == 0: + # Save model checkpoint + output_dir = os.path.join(args.output_dir, "checkpoint-latest") + os.makedirs(output_dir, exist_ok=True) + if paddle.distributed.get_rank() == 0: + model.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + paddle.save(args, os.path.join(output_dir, "training_args.bin")) + logger.info(f"Saving model checkpoint to {output_dir}") + logger.info(f"best_metirc: {best_metirc}") + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + train(args) diff --git a/examples/multimodal/layoutxlm/run_xfun_re.sh b/examples/multimodal/layoutxlm/run_xfun_re.sh new file mode 100644 index 0000000000000000000000000000000000000000..4aeea52f5dc960cbc464242cd9bd818ca4ae58ff --- /dev/null +++ b/examples/multimodal/layoutxlm/run_xfun_re.sh @@ -0,0 +1,19 @@ +export CUDA_VISIBLE_DEVICES=0 + +python ./run_xfun_re.py \ + --model_name_or_path "layoutxlm-base-uncased" \ + --max_seq_length 512 \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --num_train_epochs 200 \ + --eval_steps 50 \ + --save_steps 500 \ + --output_dir "./output/re/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --per_gpu_train_batch_size 8 \ + --per_gpu_eval_batch_size 8 \ + --evaluate_during_training \ + --seed 2048 diff --git a/examples/multimodal/layoutxlm/run_xfun_ser.py b/examples/multimodal/layoutxlm/run_xfun_ser.py new file mode 100644 index 0000000000000000000000000000000000000000..36b0b988822d2b22e3a3764ccb99c9fbf68be9bd --- /dev/null +++ b/examples/multimodal/layoutxlm/run_xfun_ser.py @@ -0,0 +1,353 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import copy +import logging +import os +import random +import sys + +import numpy as np +import paddle +from seqeval.metrics import ( + classification_report, + f1_score, + precision_score, + recall_score, +) +from xfun import XFUN + +from paddlenlp.transformers import ( + LayoutXLMForTokenClassification, + LayoutXLMModel, + LayoutXLMTokenizer, +) + +# Todo: delete the following line after the release of v2.2 +sys.path.insert(0, "../../../") +logger = logging.getLogger(__name__) + + +def parse_args(): + parser = argparse.ArgumentParser() + # Required parameters + # yapf: disable + parser.add_argument("--model_name_or_path", default=None, type=str, required=True,) + parser.add_argument("--train_data_dir", default=None, type=str, required=False,) + parser.add_argument("--train_label_path", default=None, type=str, required=False,) + parser.add_argument("--eval_data_dir", default=None, type=str, required=False,) + parser.add_argument("--eval_label_path", default=None, type=str, required=False,) + parser.add_argument("--use_vdl", default=False, type=bool, required=False,) + parser.add_argument("--output_dir", default=None, type=str, required=True,) + parser.add_argument("--max_seq_length", default=512, type=int,) + parser.add_argument("--evaluate_during_training", action="store_true",) + parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.",) + parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for eval.",) + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.",) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.",) + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.",) + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.",) + parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.",) + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.",) + parser.add_argument("--eval_steps", type=int, default=10, help="eval every X updates steps.",) + parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.",) + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization",) + # yapf: enable + args = parser.parse_args() + return args + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +def get_label_maps(): + labels = ["O", "B-QUESTION", "B-ANSWER", "B-HEADER", "I-ANSWER", "I-QUESTION", "I-HEADER"] + label2id_map = {label: idx for idx, label in enumerate(labels)} + id2label_map = {idx: label for idx, label in enumerate(labels)} + return label2id_map, id2label_map + + +def train(args): + os.makedirs(args.output_dir, exist_ok=True) + logging.basicConfig( + filename=os.path.join(args.output_dir, "train.log") if paddle.distributed.get_rank() == 0 else None, + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + level=logging.INFO if paddle.distributed.get_rank() == 0 else logging.WARN, + ) + + ch = logging.StreamHandler() + ch.setLevel(logging.DEBUG) + logger.addHandler(ch) + + label2id_map, id2label_map = get_label_maps() + pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index + + # dist mode + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + tokenizer = LayoutXLMTokenizer.from_pretrained(args.model_name_or_path) + base_model = LayoutXLMModel.from_pretrained(args.model_name_or_path) + model = LayoutXLMForTokenClassification(base_model, num_classes=len(label2id_map), dropout=None) + + # dist mode + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + train_dataset = XFUN( + tokenizer, + data_dir=args.train_data_dir, + label_path=args.train_label_path, + label2id_map=label2id_map, + img_size=(224, 224), + pad_token_label_id=pad_token_label_id, + contains_re=False, + add_special_ids=False, + return_attention_mask=True, + load_mode="all", + ) + + train_sampler = paddle.io.DistributedBatchSampler( + train_dataset, batch_size=args.per_gpu_train_batch_size, shuffle=True + ) + + args.train_batch_size = args.per_gpu_train_batch_size * max(1, paddle.distributed.get_world_size()) + + train_dataloader = paddle.io.DataLoader( + train_dataset, + batch_sampler=train_sampler, + num_workers=0, + use_shared_memory=True, + collate_fn=None, + ) + + t_total = len(train_dataloader) * args.num_train_epochs + + # build linear decay with warmup lr sch + lr_scheduler = paddle.optimizer.lr.PolynomialDecay( + learning_rate=args.learning_rate, decay_steps=t_total, end_lr=0.0, power=1.0 + ) + if args.warmup_steps > 0: + lr_scheduler = paddle.optimizer.lr.LinearWarmup( + lr_scheduler, + args.warmup_steps, + start_lr=0, + end_lr=args.learning_rate, + ) + + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + epsilon=args.adam_epsilon, + weight_decay=args.weight_decay, + ) + + # Train! + logger.info("***** Running training *****") + logger.info(" Num examples = %d", len(train_dataset)) + logger.info(" Num Epochs = %d", args.num_train_epochs) + logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size) + logger.info( + " Total train batch size (w. parallel, distributed) = %d", + args.train_batch_size * paddle.distributed.get_world_size(), + ) + logger.info(" Total optimization steps = %d", t_total) + + global_step = 0 + tr_loss = 0.0 + set_seed(args) + best_metrics = None + + for epoch_id in range(args.num_train_epochs): + for step, batch in enumerate(train_dataloader): + model.train() + outputs = model(**batch) + # model outputs are always tuple in ppnlp (see doc) + loss = outputs[0] + loss = loss.mean() + logger.info( + "[epoch {}/{}][iter: {}/{}] lr: {:.5f}, train loss: {:.5f}, ".format( + epoch_id, + args.num_train_epochs, + step, + len(train_dataloader), + lr_scheduler.get_lr(), + float(loss), + ) + ) + + loss.backward() + tr_loss += loss.item() + optimizer.step() + lr_scheduler.step() # Update learning rate schedule + optimizer.clear_grad() + global_step += 1 + + if paddle.distributed.get_rank() == 0 and args.eval_steps > 0 and global_step % args.eval_steps == 0: + # Log metrics + # Only evaluate when single GPU otherwise metrics may not average well + if paddle.distributed.get_rank() == 0 and args.evaluate_during_training: + results, _ = evaluate( + args, + model, + tokenizer, + label2id_map, + id2label_map, + pad_token_label_id, + ) + + if best_metrics is None or results["f1"] >= best_metrics["f1"]: + best_metrics = copy.deepcopy(results) + output_dir = os.path.join(args.output_dir, "best_model") + os.makedirs(output_dir, exist_ok=True) + if paddle.distributed.get_rank() == 0: + model.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + paddle.save(args, os.path.join(output_dir, "training_args.bin")) + logger.info("Saving model checkpoint to %s", output_dir) + + logger.info( + "[epoch {}/{}][iter: {}/{}] results: {}".format( + epoch_id, args.num_train_epochs, step, len(train_dataloader), results + ) + ) + if best_metrics is not None: + logger.info("best metrics: {}".format(best_metrics)) + + if paddle.distributed.get_rank() == 0 and args.save_steps > 0 and global_step % args.save_steps == 0: + # Save model checkpoint + output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step)) + os.makedirs(output_dir, exist_ok=True) + if paddle.distributed.get_rank() == 0: + model.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + paddle.save(args, os.path.join(output_dir, "training_args.bin")) + logger.info("Saving model checkpoint to %s", output_dir) + + return global_step, tr_loss / global_step + + +def evaluate(args, model, tokenizer, label2id_map, id2label_map, pad_token_label_id, prefix=""): + eval_dataset = XFUN( + tokenizer, + data_dir=args.eval_data_dir, + label_path=args.eval_label_path, + label2id_map=label2id_map, + img_size=(224, 224), + pad_token_label_id=pad_token_label_id, + contains_re=False, + add_special_ids=False, + return_attention_mask=True, + load_mode="all", + ) + + args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, paddle.distributed.get_world_size()) + + eval_dataloader = paddle.io.DataLoader( + eval_dataset, + batch_size=args.eval_batch_size, + num_workers=0, + use_shared_memory=True, + collate_fn=None, + ) + + # Eval! + logger.info("***** Running evaluation %s *****", prefix) + logger.info(" Num examples = %d", len(eval_dataset)) + logger.info(" Batch size = %d", args.eval_batch_size) + eval_loss = 0.0 + nb_eval_steps = 0 + preds = None + out_label_ids = None + model.eval() + for idx, batch in enumerate(eval_dataloader): + with paddle.no_grad(): + outputs = model(**batch) + tmp_eval_loss, logits = outputs[:2] + + tmp_eval_loss = tmp_eval_loss.mean() + + if paddle.distributed.get_rank() == 0: + logger.info( + "[Eval]process: {}/{}, loss: {:.5f}".format(idx, len(eval_dataloader), float(tmp_eval_loss)) + ) + + eval_loss += tmp_eval_loss.item() + nb_eval_steps += 1 + if preds is None: + preds = logits.numpy() + out_label_ids = batch["labels"].numpy() + else: + preds = np.append(preds, logits.numpy(), axis=0) + out_label_ids = np.append(out_label_ids, batch["labels"].numpy(), axis=0) + + eval_loss = eval_loss / nb_eval_steps + preds = np.argmax(preds, axis=2) + + # label_map = {i: label.upper() for i, label in enumerate(labels)} + + out_label_list = [[] for _ in range(out_label_ids.shape[0])] + preds_list = [[] for _ in range(out_label_ids.shape[0])] + + for i in range(out_label_ids.shape[0]): + for j in range(out_label_ids.shape[1]): + if out_label_ids[i, j] != pad_token_label_id: + out_label_list[i].append(id2label_map[out_label_ids[i][j]]) + preds_list[i].append(id2label_map[preds[i][j]]) + + results = { + "loss": eval_loss, + "precision": precision_score(out_label_list, preds_list), + "recall": recall_score(out_label_list, preds_list), + "f1": f1_score(out_label_list, preds_list), + } + + with open(os.path.join(args.output_dir, "test_gt.txt"), "w") as fout: + for lbl in out_label_list: + for l in lbl: + fout.write(l + "\t") + fout.write("\n") + with open(os.path.join(args.output_dir, "test_pred.txt"), "w") as fout: + for lbl in preds_list: + for l in lbl: + fout.write(l + "\t") + fout.write("\n") + + report = classification_report(out_label_list, preds_list) + logger.info("\n" + report) + + logger.info("***** Eval results %s *****", prefix) + for key in sorted(results.keys()): + logger.info(" %s = %s", key, str(results[key])) + + return results, preds_list + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + train(args) diff --git a/examples/multimodal/layoutxlm/run_xfun_ser.sh b/examples/multimodal/layoutxlm/run_xfun_ser.sh new file mode 100644 index 0000000000000000000000000000000000000000..43454abfc26491ef463a7138468a15587fb9fa7f --- /dev/null +++ b/examples/multimodal/layoutxlm/run_xfun_ser.sh @@ -0,0 +1,19 @@ +export CUDA_VISIBLE_DEVICES=0 + +python ./run_xfun_ser.py \ + --model_name_or_path "layoutxlm-base-uncased" \ + --max_seq_length 512 \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --num_train_epochs 200 \ + --eval_steps 10 \ + --save_steps 500 \ + --output_dir "./output/ser/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --per_gpu_train_batch_size 8 \ + --per_gpu_eval_batch_size 8 \ + --evaluate_during_training \ + --seed 2048 diff --git a/examples/multimodal/layoutxlm/xfun.py b/examples/multimodal/layoutxlm/xfun.py new file mode 100644 index 0000000000000000000000000000000000000000..3bb5be92e91346b2e98fcad3b6ef1ecb2d56346c --- /dev/null +++ b/examples/multimodal/layoutxlm/xfun.py @@ -0,0 +1,410 @@ +import json +import os +import cv2 +import numpy as np +import paddle +import copy +from paddle.io import Dataset + +__all__ = ["XFUN"] + + +class XFUN(Dataset): + """ + Example: + print("=====begin to build dataset=====") + from paddlenlp.transformers import LayoutXLMTokenizer + tokenizer = LayoutXLMTokenizer.from_pretrained("/paddle/models/transformers/layoutxlm-base-paddle/") + tok_res = tokenizer.tokenize("Maribyrnong") + # res = tokenizer.convert_ids_to_tokens(val_data["input_ids"][0]) + dataset = XfunDatasetForSer( + tokenizer, + data_dir="./zh.val/", + label_path="zh.val/xfun_normalize_val.json", + img_size=(224,224)) + print(len(dataset)) + + data = dataset[0] + print(data.keys()) + print("input_ids: ", data["input_ids"]) + print("labels: ", data["labels"]) + print("token_type_ids: ", data["token_type_ids"]) + print("words_list: ", data["words_list"]) + print("image shape: ", data["image"].shape) + """ + + def __init__( + self, + tokenizer, + data_dir, + label_path, + contains_re=False, + label2id_map=None, + img_size=(224, 224), + pad_token_label_id=None, + add_special_ids=False, + return_attention_mask=True, + load_mode="all", + max_seq_len=512, + ): + super(XFUN, self).__init__() + self.tokenizer = tokenizer + self.data_dir = data_dir + self.label_path = label_path + self.contains_re = contains_re + self.label2id_map = label2id_map + self.img_size = img_size + self.pad_token_label_id = pad_token_label_id + self.add_special_ids = add_special_ids + self.return_attention_mask = return_attention_mask + self.load_mode = load_mode + self.max_seq_len = max_seq_len + + if self.pad_token_label_id is None: + self.pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index + + self.all_lines = self.read_all_lines() + + self.entities_labels = {"HEADER": 0, "QUESTION": 1, "ANSWER": 2} + self.return_keys = { + "bbox": "np", + "input_ids": "np", + "labels": "np", + "attention_mask": "np", + "image": "np", + "token_type_ids": "np", + "entities": "dict", + "relations": "dict", + } + + if load_mode == "all": + self.encoded_inputs_all = self._parse_label_file_all() + + def pad_sentences( + self, + encoded_inputs, + max_seq_len=512, + pad_to_max_seq_len=True, + return_attention_mask=True, + return_token_type_ids=True, + truncation_strategy="longest_first", + return_overflowing_tokens=False, + return_special_tokens_mask=False, + ): + # Padding + needs_to_be_padded = pad_to_max_seq_len and max_seq_len and len(encoded_inputs["input_ids"]) < max_seq_len + + if needs_to_be_padded: + difference = max_seq_len - len(encoded_inputs["input_ids"]) + if self.tokenizer.padding_side == "right": + if return_attention_mask: + encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) + [0] * difference + if return_token_type_ids: + encoded_inputs["token_type_ids"] = ( + encoded_inputs["token_type_ids"] + [self.tokenizer.pad_token_type_id] * difference + ) + if return_special_tokens_mask: + encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference + encoded_inputs["input_ids"] = encoded_inputs["input_ids"] + [self.tokenizer.pad_token_id] * difference + encoded_inputs["labels"] = encoded_inputs["labels"] + [self.pad_token_label_id] * difference + encoded_inputs["bbox"] = encoded_inputs["bbox"] + [[0, 0, 0, 0]] * difference + elif self.tokenizer.padding_side == "left": + if return_attention_mask: + encoded_inputs["attention_mask"] = [0] * difference + [1] * len(encoded_inputs["input_ids"]) + if return_token_type_ids: + encoded_inputs["token_type_ids"] = [ + self.tokenizer.pad_token_type_id + ] * difference + encoded_inputs["token_type_ids"] + if return_special_tokens_mask: + encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"] + encoded_inputs["input_ids"] = [self.tokenizer.pad_token_id] * difference + encoded_inputs["input_ids"] + encoded_inputs["labels"] = [self.pad_token_label_id] * difference + encoded_inputs["labels"] + encoded_inputs["bbox"] = [[0, 0, 0, 0]] * difference + encoded_inputs["bbox"] + else: + if return_attention_mask: + encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) + + return encoded_inputs + + def truncate_inputs(self, encoded_inputs, max_seq_len=512): + for key in encoded_inputs: + if key == "sample_id": + continue + length = min(len(encoded_inputs[key]), max_seq_len) + encoded_inputs[key] = encoded_inputs[key][:length] + return encoded_inputs + + def read_all_lines( + self, + ): + with open(self.label_path, "r") as fin: + lines = fin.readlines() + return lines + + def _parse_label_file_all(self): + """ + parse all samples + """ + encoded_inputs_all = [] + for line in self.all_lines: + encoded_inputs_all.extend(self._parse_label_file(line)) + return encoded_inputs_all + + def _parse_label_file(self, line): + """ + parse single sample + """ + + image_name, info_str = line.split("\t") + image_path = os.path.join(self.data_dir, image_name) + + def add_imgge_path(x): + x["image_path"] = image_path + return x + + encoded_inputs = self._read_encoded_inputs_sample(info_str) + if self.contains_re: + encoded_inputs = self._chunk_re(encoded_inputs) + else: + encoded_inputs = self._chunk_ser(encoded_inputs) + encoded_inputs = list(map(add_imgge_path, encoded_inputs)) + return encoded_inputs + + def _read_encoded_inputs_sample(self, info_str): + """ + parse label info + """ + # read text info + info_dict = json.loads(info_str) + height = info_dict["height"] + width = info_dict["width"] + + words_list = [] + bbox_list = [] + input_ids_list = [] + token_type_ids_list = [] + gt_label_list = [] + + if self.contains_re: + # for re + entities = [] + relations = [] + id2label = {} + entity_id_to_index_map = {} + empty_entity = set() + for info in info_dict["ocr_info"]: + if self.contains_re: + # for re + if len(info["text"]) == 0: + empty_entity.add(info["id"]) + continue + id2label[info["id"]] = info["label"] + relations.extend([tuple(sorted(l)) for l in info["linking"]]) + + # x1, y1, x2, y2 + bbox = info["bbox"] + label = info["label"] + bbox[0] = int(bbox[0] * 1000.0 / width) + bbox[2] = int(bbox[2] * 1000.0 / width) + bbox[1] = int(bbox[1] * 1000.0 / height) + bbox[3] = int(bbox[3] * 1000.0 / height) + + text = info["text"] + encode_res = self.tokenizer.encode( + text, pad_to_max_seq_len=False, return_token_type_ids=True, return_attention_mask=True + ) + + gt_label = [] + if not self.add_special_ids: + # TODO: use tok.all_special_ids to remove + encode_res["input_ids"] = encode_res["input_ids"][1:-1] + encode_res["token_type_ids"] = encode_res["token_type_ids"][1:-1] + encode_res["attention_mask"] = encode_res["attention_mask"][1:-1] + if label.lower() == "other": + gt_label.extend([0] * len(encode_res["input_ids"])) + else: + gt_label.append(self.label2id_map[("b-" + label).upper()]) + gt_label.extend([self.label2id_map[("i-" + label).upper()]] * (len(encode_res["input_ids"]) - 1)) + if self.contains_re: + if gt_label[0] != self.label2id_map["O"]: + entity_id_to_index_map[info["id"]] = len(entities) + entities.append( + { + "start": len(input_ids_list), + "end": len(input_ids_list) + len(encode_res["input_ids"]), + "label": label.upper(), + } + ) + input_ids_list.extend(encode_res["input_ids"]) + token_type_ids_list.extend(encode_res["token_type_ids"]) + bbox_list.extend([bbox] * len(encode_res["input_ids"])) + gt_label_list.extend(gt_label) + words_list.append(text) + + encoded_inputs = { + "input_ids": input_ids_list, + "labels": gt_label_list, + "token_type_ids": token_type_ids_list, + "bbox": bbox_list, + "attention_mask": [1] * len(input_ids_list), + # "words_list": words_list, + } + encoded_inputs = self.pad_sentences( + encoded_inputs, max_seq_len=self.max_seq_len, return_attention_mask=self.return_attention_mask + ) + encoded_inputs = self.truncate_inputs(encoded_inputs) + + if self.contains_re: + relations = self._relations(entities, relations, id2label, empty_entity, entity_id_to_index_map) + encoded_inputs["relations"] = relations + encoded_inputs["entities"] = entities + return encoded_inputs + + def _chunk_ser(self, encoded_inputs): + encoded_inputs_all = [] + seq_len = len(encoded_inputs["input_ids"]) + chunk_size = 512 + for chunk_id, index in enumerate(range(0, seq_len, chunk_size)): + chunk_beg = index + chunk_end = min(index + chunk_size, seq_len) + encoded_inputs_example = {} + for key in encoded_inputs: + encoded_inputs_example[key] = encoded_inputs[key][chunk_beg:chunk_end] + + encoded_inputs_all.append(encoded_inputs_example) + return encoded_inputs_all + + def _chunk_re(self, encoded_inputs): + # prepare data + entities = encoded_inputs.pop("entities") + relations = encoded_inputs.pop("relations") + encoded_inputs_all = [] + chunk_size = 512 + for chunk_id, index in enumerate(range(0, len(encoded_inputs["input_ids"]), chunk_size)): + item = {} + for k in encoded_inputs: + item[k] = encoded_inputs[k][index : index + chunk_size] + + # select entity in current chunk + entities_in_this_span = [] + global_to_local_map = {} # + for entity_id, entity in enumerate(entities): + if index <= entity["start"] < index + chunk_size and index <= entity["end"] < index + chunk_size: + entity["start"] = entity["start"] - index + entity["end"] = entity["end"] - index + global_to_local_map[entity_id] = len(entities_in_this_span) + entities_in_this_span.append(entity) + + # select relations in current chunk + relations_in_this_span = [] + for relation in relations: + if ( + index <= relation["start_index"] < index + chunk_size + and index <= relation["end_index"] < index + chunk_size + ): + relations_in_this_span.append( + { + "head": global_to_local_map[relation["head"]], + "tail": global_to_local_map[relation["tail"]], + "start_index": relation["start_index"] - index, + "end_index": relation["end_index"] - index, + } + ) + item.update( + { + "entities": reformat(entities_in_this_span), + "relations": reformat(relations_in_this_span), + } + ) + item["entities"]["label"] = [self.entities_labels[x] for x in item["entities"]["label"]] + encoded_inputs_all.append(item) + return encoded_inputs_all + + def _relations(self, entities, relations, id2label, empty_entity, entity_id_to_index_map): + """ + build relations + """ + relations = list(set(relations)) + relations = [rel for rel in relations if rel[0] not in empty_entity and rel[1] not in empty_entity] + kv_relations = [] + for rel in relations: + pair = [id2label[rel[0]], id2label[rel[1]]] + if pair == ["question", "answer"]: + kv_relations.append({"head": entity_id_to_index_map[rel[0]], "tail": entity_id_to_index_map[rel[1]]}) + elif pair == ["answer", "question"]: + kv_relations.append({"head": entity_id_to_index_map[rel[1]], "tail": entity_id_to_index_map[rel[0]]}) + else: + continue + relations = sorted( + [ + { + "head": rel["head"], + "tail": rel["tail"], + "start_index": get_relation_span(rel, entities)[0], + "end_index": get_relation_span(rel, entities)[1], + } + for rel in kv_relations + ], + key=lambda x: x["head"], + ) + return relations + + def load_img(self, image_path): + # read img + img = cv2.imread(image_path) + img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) + resize_h, resize_w = self.img_size + im_shape = img.shape[0:2] + im_scale_y = resize_h / im_shape[0] + im_scale_x = resize_w / im_shape[1] + img_new = cv2.resize(img, None, None, fx=im_scale_x, fy=im_scale_y, interpolation=2) + mean = np.array([0.485, 0.456, 0.406])[np.newaxis, np.newaxis, :] + std = np.array([0.229, 0.224, 0.225])[np.newaxis, np.newaxis, :] + img_new = img_new / 255.0 + img_new -= mean + img_new /= std + img = img_new.transpose((2, 0, 1)) + return img + + def __getitem__(self, idx): + if self.load_mode == "all": + data = copy.deepcopy(self.encoded_inputs_all[idx]) + else: + data = self._parse_label_file(self.all_lines[idx])[0] + + image_path = data.pop("image_path") + data["image"] = self.load_img(image_path) + + return_data = {} + for k, v in data.items(): + if k in self.return_keys: + if self.return_keys[k] == "np": + v = np.array(v) + return_data[k] = v + return return_data + + def __len__( + self, + ): + if self.load_mode == "all": + return len(self.encoded_inputs_all) + else: + return len(self.all_lines) + + +def get_relation_span(rel, entities): + bound = [] + for entity_index in [rel["head"], rel["tail"]]: + bound.append(entities[entity_index]["start"]) + bound.append(entities[entity_index]["end"]) + return min(bound), max(bound) + + +def reformat(data): + new_data = {} + for item in data: + for k, v in item.items(): + if k not in new_data: + new_data[k] = [] + new_data[k].append(v) + return new_data diff --git a/examples/multimodal/minigpt4/README.md b/examples/multimodal/minigpt4/README.md new file mode 100644 index 0000000000000000000000000000000000000000..48c9f73840762b0575b2a52bd537278abac45c43 --- /dev/null +++ b/examples/multimodal/minigpt4/README.md @@ -0,0 +1,47 @@ +# MiniGPT4 + +## 1. 模型简介 + +MiniGPT4 是一个具有图像理解能力的开源模型,其基于 Vicuna 大语言模型 以及 BLIP-2 中的VIT和Qformer模块进行训练,使得MiniGPT4 拥有类似于GPT4的非凡能力,例如详细的图像描述生成和从手写草稿创建网站。 此外 MiniGPT4 还具备一些的其他新的功能,包括根据给定图像写故事和诗歌,为图像中显示的问题提供解决方案,教用户如何根据食物照片做饭等。下图展示了MiniGPT4的模型结构, 更多信息请参考[MiniGPT4](https://arxiv.org/abs/2304.10592)。 + +<center><img src="https://github.com/PaddlePaddle/Paddle/assets/35913314/f0306cb6-4837-4f52-8f57-a0e7e35238f6" /></center> + + +## 2. 获取MiniGPT4 权重以及相关配置 +这里可以分两步:1. 获取MiniGPT4权重;2. 获取相关配置,包括模型参数说明以及tokenizer相关文件等。 +### 2.1 获取MiniGPT4权重 +目前需要用户手动下载MiniGPT4权重和并转换为相应的 Paddle 版权重,为方便转换,本项目提供了相应的操作说明和转换脚本,详情请参考[MiniGPT4 权重下载和转换说明](./paddle_minigpt4_instrction.md)。 + +### 2.2 获取相关配置 +下载相关的配置文件,这里提供了两版配置文件,请根据你的需要,点击下载即可。 +| files Aligned with MiniGPT4-7B | files Aligned with MiniGPT4-13B | +:-------------------------------------:|:-----------------------------------: + [Download](https://paddlenlp.bj.bcebos.com/models/community/minigpt4-7b/minigpt4_7b.tar.gz)|[Download](https://paddlenlp.bj.bcebos.com/models/community/minigpt4-13b/minigpt4_13b.tar.gz) | + + +下载之后进行解压,请将其中相关文件放至 与 MiniGPT4 权重相同的目录中。 + + +## 3. 模型预测 +在下载和转换好上述模型权重之后,可执行以下命令进行模型预测。其中参数 `pretrained_name_or_path` 用于指定 MiniGPT4 的保存目录。 + +``` +python run_predict.py \ + -- pretrained_name_or_path "your minigpt4 path" + +``` + +下图这个示例展示了在使用MiniGPT-7b时的效果: + +输入图片:<center><img src="https://github.com/PaddlePaddle/Paddle/assets/35913314/d8070644-4713-465d-9c7e-9585024c1819" /></center> + +输入文本:“describe this image” + +输出: +``` +The image shows two mugs with cats on them, one is black and white and the other is blue and white. The mugs are sitting on a table with a book in the background. The mugs have a whimsical, cartoon-like appearance. The cats on the mugs are looking at each other with a playful expression. The overall mood of the image is lighthearted and fun.### +``` + + +## Reference +- [MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models](https://minigpt-4.github.io/) diff --git a/examples/multimodal/minigpt4/merge_weight.py b/examples/multimodal/minigpt4/merge_weight.py new file mode 100644 index 0000000000000000000000000000000000000000..8f74d7c6a960520be922ee00ca69d88b3fcc3fe0 --- /dev/null +++ b/examples/multimodal/minigpt4/merge_weight.py @@ -0,0 +1,88 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +os.environ["CUDA_VISIBLE_DEVICES"] = "0" +os.environ["FLAGS_use_cuda_managed_memory"] = "true" + +import paddle +import torch + +from paddlenlp.transformers import LlamaForCausalLM + + +def merge(args): + model_dict = {} + # load the first item: blip2-flan-t5-xxl + state_dict = paddle.load(args.blip2_path) + for n, p in state_dict.items(): + if n.startswith("vision_model") or n.startswith("qformer") or n == "query_tokens": + model_dict[n] = p + print("[1/3] load ViT, qformer and query_tokens from blip2-flan-t5-xxl done!") + + # load the second item: vicuna + llama_model = LlamaForCausalLM.from_pretrained(args.vicuna_path) + + for n, p in llama_model.named_parameters(): + new_name = "language_model." + n + model_dict[new_name] = p + print("[2/3] load vicuna(llama typel) done!") + + # load the third item: minigpt4 + minigpt4_state_dict = torch.load(args.minigpt4_path) + for n, p in minigpt4_state_dict["model"].items(): + if n.startswith("llama_model.model"): + new_name = n.replace("llama_model.model", "language_model.llama") + new_p = paddle.to_tensor(p.cpu().numpy()) + model_dict[new_name] = new_p + + if n.startswith("llama_proj"): + new_name = n.replace("llama_proj", "language_projection") + if n.endswith("weight"): + new_p = paddle.to_tensor(p.cpu().numpy()).transpose([1, 0]) + else: + new_p = paddle.to_tensor(p.cpu().numpy()) + model_dict[new_name] = new_p + + print("[3/3] load language_projection, some llama weights from minigpt4 done!") + + save_path = os.path.join(args.save_path, "model_state.pdparams") + paddle.save(model_dict, save_path) + print("The checkpoint of minigpt4 has been saved to :{}".format(save_path)) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + + parser.add_argument("--blip2_path", default="/blip2/dirname", type=str, help="The dir name of blip2-flan-t5-xxl.") + parser.add_argument("--vicuna_path", default="/vicuna/dirname", type=str, help="The dir name of vicuna.") + parser.add_argument( + "--minigpt4_path", default="/minigpt4/prerained_minigpt4.pth", type=str, help="The checkpoint path of vicuna." + ) + parser.add_argument("--save_path", default="/save/to/dirname", type=str, help="The saving path of minigpt4.") + args = parser.parse_args() + + args.blip2_path = os.path.join(args.blip2_path, "model_state.pdparams") + if not os.path.exists(args.blip2_path): + raise ValueError("Not found the file: {}".format(args.blip2_path)) + if not os.path.isdir(args.vicuna_path): + raise ValueError("It is not a directory: {}".format(args.vicuna_path)) + if not os.path.exists(args.minigpt4_path): + raise ValueError("Not found the file: {}".format(args.minigpt4_path)) + if not os.path.exists(args.save_path): + os.makedirs(args.save_path) + + merge(args) diff --git a/examples/multimodal/minigpt4/paddle_minigpt4_instrction.md b/examples/multimodal/minigpt4/paddle_minigpt4_instrction.md new file mode 100644 index 0000000000000000000000000000000000000000..7b84aea48bd7c6e1b6c5c55d10d77ef0e1509500 --- /dev/null +++ b/examples/multimodal/minigpt4/paddle_minigpt4_instrction.md @@ -0,0 +1,117 @@ +# 获取和转换 Paddle 版 MiniGPT4 权重 + +## 1. 准备 MiniGPT4 中所有模块的权重 + +你需要下载3个权重,以获取最终 MiniGPT4的权重,分别是: +- Pretrained MiniGPT-4 +- Vicuna Weight +- Blip2 Weight + +### 1.1 下载 MiniGPT4 的预训练权重 + +根据你准备的Vicuna模型版本,下载预训练的MiniGPT4 权重。 + +| Checkpoint Aligned with Vicuna 7B | Checkpoint Aligned with Vicuna 13B | +:-------------------------------------:|:-----------------------------------: +[Download](https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing) | [Download](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link) + +### 1.2准备 ViT and Qformer 权重 +MiniGPT4中使用的ViT和Qformer Weight来自blip2-flan-t5-xxl,这个weight在PaddleNLP中进行了转换。 所以你可以从 PaddleNLP 下载它,你有两种下载方式进行下载: + +#### 1.2.1 通过 paddlenlp 方式加载 +直接通过paddlenlp的模型加载方法进行下载,下载后一般会存入 `PPNLP_HOME` 指定的目录。 + +```python +import os +os.environ["CUDA_VISIBLE_DEVICES"]="0" + +import paddle +from paddlenlp.transformers import Blip2Model, Blip2VisionModel, Blip2VisionConfig, Blip2QFormerConfig, Blip2QFormerModel + +Blip2Model.from_pretrained("Salesforce/blip2-flan-t5-xxl") +``` + +#### 1.2.2 直接点击下载 +可以直接进行点击下载: + +| blip2-flan-t5-xxl 权重 | 点击下载 | +:-------------------------------------:|:-----------------------------------: +| model_state.pdparams | [Download](https://paddlenlp.bj.bcebos.com/models/community/Salesforce/blip2-flan-t5-xxl/model_state.pdparams) | + +### 1.3 准备 Vicuna 权重 + +这里需要下载两个权重:Vicuna delta Weight和huggingface-formated Llama Weight。 然后你应该结合这两个重量来获得可以使用的Vicuna 权重。 + +#### 1.3.1 下载 Vicuna delta 权重 + +这里展示两种Vicuna delta 权重,请根据需要选择一种并点击下载。 + +| vicuna-7b-delta-v0 | vicuna-13b-delta-v0 | +:-------------------------------------:|:-----------------------------------: + [Download](https://huggingface.co/lmsys/vicuna-7b-delta-v0/tree/main) | [Download](https://huggingface.co/lmsys/vicuna-13b-delta-v0g) + +#### 1.3.2 根据以上选择的vicuna delta 权重,下载 相应的 llama 权重。 + +| llama-7b | llama-13b | +:-------------------------------------:|:-----------------------------------: + [Download](https://huggingface.co/decapoda-research/llama-7b-hf/tree/main) | [Download](https://huggingface.co/decapoda-research/llama-13b-hf) + + +#### 1.3.3 结合上面的两个权重,得到可以使用的 vicuna 权重 +- 为组合如上两个权重,请安装以下工具: + +```shell +pip install git+https://github.com/lm-sys/FastChat.git@v0.1.10 +``` +- 运行以下命令,获取最终可用的vicuna 权重 + +```shell +python -m fastchat.model.apply_delta --base /path/to/llama-13bOR7b-hf/ --target /path/to/save/working/vicuna-13b/weight/ --delta /path/to/vicuna-13bOR7b-delta-v0/ +``` + +## 2. 将多个 pytorch 子权重文件合并为一个权重文件 + +Pytorch版的权重文件可能是由多个子权重文件组合而成,为使用PaddleNLP进行加载并自动转换为Paddle版,需要将其合并为一个文件: + +### 2.1 下载MiniGPT库 +在开始之前,请确保已经下载了 [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4.git) 库: + +``` +git clone https://github.com/Vision-CAIR/MiniGPT-4.git +``` + +### 2.2 获取完整的 vicuna 权重 +进入到MiniGPT4文件夹,执行以下代码,获取完整的 vicuna 权重文件: +```python +import argparse +import os +os.environ["CUDA_VISIBLE_DEVICES"]="0" +os.environ["FLAGS_use_cuda_managed_memory"]="true" + +import torch +from minigpt4.models.modeling_llama import LlamaForCausalLM + +llama_model = LlamaForCausalLM.from_pretrained("/path/to/save/working/vicuna-13b/") +torch.save(llama_model.state_dict(), "/path/to/save/working/vicuna-13b/pytorch_model.bin") +``` + +## 3. 合并以上所有权重,获取最终的 Paddle 版 MiniGPT4 权重 +这里提供了一个合并以上权重的脚本,你可以通过设置相关权重路径 以获取最终的 MiniGPT4 权重。 + +```shell +python merge_weight.py \ + --blip2_path "your dir name of blip2" \ + --vicuna_path "your dir name of vicuna" \ + --minigpt4_path "your ckpt path of minigpt4" \ + --save_path "your dir name saving the final minigpt4" +``` + +**参数说明**: +- `blip2_path`: 存放 blip2 权重的目录名 +- `vicuna_path`: 存放 vicuna_path 权重的目录名 +- `minigpt4_path`: 存放 blip2 权重的文件地址,比如./prerained_minigpt4_7b.pth +- `save_path`: 保存 Paddle 版 MiniGPT3 权重的目录名 + +## 3. More Reference + +- [MiniGPT Official Site](https://github.com/Vision-CAIR/MiniGPT-4) diff --git a/examples/multimodal/minigpt4/run_predict.py b/examples/multimodal/minigpt4/run_predict.py new file mode 100644 index 0000000000000000000000000000000000000000..4b36089f3c91a8fdf1340b966b86150dab110c9a --- /dev/null +++ b/examples/multimodal/minigpt4/run_predict.py @@ -0,0 +1,68 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +os.environ["CUDA_VISIBLE_DEVICES"] = "0" +os.environ["FLAGS_use_cuda_managed_memory"] = "true" +import requests +from PIL import Image + +from paddlenlp.transformers import MiniGPT4ForConditionalGeneration, MiniGPT4Processor + + +def predict(args): + # load MiniGPT4 moel and processor + model = MiniGPT4ForConditionalGeneration.from_pretrained(args.pretrained_name_or_path) + model.eval() + processor = MiniGPT4Processor.from_pretrained(args.pretrained_name_or_path) + print("load processor and model done!") + + # prepare model inputs for MiniGPT4 + url = "https://paddlenlp.bj.bcebos.com/data/images/mugs.png" + image = Image.open(requests.get(url, stream=True).raw) + + text = "describe this image" + prompt = "Give the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img><ImageHere></Img> <TextHere>###Assistant:" + inputs = processor([image], text, prompt) + + # generate with MiniGPT4 + # breakpoint + generate_kwargs = { + "max_length": 300, + "num_beams": 1, + "top_p": 1.0, + "repetition_penalty": 1.0, + "length_penalty": 0, + "temperature": 1, + "decode_strategy": "greedy_search", + "eos_token_id": [[835], [2277, 29937]], + } + outputs = model.generate(**inputs, **generate_kwargs) + msg = processor.batch_decode(outputs[0]) + print("Inference result: ", msg) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--pretrained_name_or_path", + default="your directory of minigpt4", + type=str, + help="The dir name of minigpt4 checkpoint.", + ) + args = parser.parse_args() + + predict(args) diff --git a/examples/question_generation/README.md b/examples/question_generation/README.md new file mode 100644 index 0000000000000000000000000000000000000000..796b415797b2005bc94c152fba609f49dad28937 --- /dev/null +++ b/examples/question_generation/README.md @@ -0,0 +1,20 @@ +# 问题生成 + +Question Generation(QG),即问题生成,指的是给定一段上下文和答案,自动生成一个流畅且符合上下文主题的问句。问题生成技术在教育、咨询、搜索、问答等多个领域均有着巨大的应用价值。 + +PaddleNLP提供英文和中文问题生成任务示例,分别基于英文预训练语言模型[t5](./t5)和中文预训练语言模型[unimo-text](./unimo-text)。 + + +## 英文 + +[t5](./t5) 展示了如何使用英文预训练模型T5完成问题生成任务,支持模型微调预测评估,并提供相关预训练模型。 + +## 中文 + +[unimo-text](./unimo-text) 展示了如何使用中文预训练模型UNIMO-Text完成问题生成任务,提供数据准备、训练、预测、推理部署全流程定制化训练,并提供相关预训练模型。 + +# 参考文献 + +1. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140), pp.1-67. + +2. Li, Wei, et al. "Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning." arXiv preprint arXiv:2012.15409 (2020). diff --git a/examples/question_generation/t5/README.md b/examples/question_generation/t5/README.md new file mode 100644 index 0000000000000000000000000000000000000000..0b0578d3cb58bd26b528912bca5d6f2fc8309801 --- /dev/null +++ b/examples/question_generation/t5/README.md @@ -0,0 +1,208 @@ +# 问题生成(Question Generation) + +## 简介 + +Question Generation(QG),即问题生成,指的是给定一段上下文(passage或sentence),自动生成一个流畅且符合上下文主题的问句。问题生成通常可以分为两个分支,即无答案问题生成(answer-agnostic question generation)和有答案问题生成(answer-aware question generation)。 + +本项目是T5在 PaddlePaddle上开源实现的有答案问题生成的例子,包含了在SQuAD数据集上微调和生成的代码。 + +## 快速开始 + +### 环境依赖 + +- nltk +- evaluate + + +安装方式:`pip install -r requirements.txt` + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +. +├── finetune.py # 模型微调主程序入口 +├── generate.py # 模型生成主程序入口 +├── utils.py # 定义参数及一些工具函数 +├── requirements.txt # 环境依赖文件 +└── README.md # 文档说明 +``` + +### 数据准备 + +#### 数据加载 +**SQuAD**(Stanford Question Answering Dataset)数据集是一个英文问答数据集,现有的问题生成研究主要在该数据集上进行评价。**SQuAD**中的数据由段落、问题、答案3个主要部分组成,其中段落从维基百科中获取,问题和答案通过众包的方式由人工标注。 + +为了方便用户快速测试,PaddleNLP Dataset API内置了Squad数据集,一键即可完成数据集加载,示例代码如下: + +```python +from paddlenlp.datasets import load_dataset +train_set, dev_set, test_set = load_dataset("squad", splits=["train_v1", "dev_v1"]) +``` + +#### 数据处理 +针对**SQuAD**数据集,我们需要将QA任务格式的数据进行转换从而得到text2text形式的数据,默认构造方式如下,其他形式输入数据用户可以在convert_example函数中自行定义 +```text +answer: {answer_text} context: {context_text} +question: {question_text} +``` +具体案例如下, +```text +answer: the Miller–Rabin primality test context: The property of being prime (or not) is called primality. A simple but slow method of verifying the primality of a given number n is known as trial division. It consists of testing whether n is a multiple of any integer between 2 and . Algorithms much more efficient than trial division have been devised to test the primality of large numbers. These include the Miller–Rabin primality test, which is fast but has a small probability of error, and the AKS primality test, which always produces the correct answer in polynomial time but is too slow to be practical. Particularly fast methods are available for numbers of special forms, such as Mersenne numbers. As of January 2016[update], the largest known prime number has 22,338,618 decimal digits. + +question: What is the name of the process which confirms the primality of a number n? +``` + +### 模型训练 + +运行如下命令即可在训练集上进行finetune,并在验证集上进行验证 + +```shell +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +# 例如使用1号和2号卡,则:`--gpu 1,2` +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus 1,2 train.py \ + --model_name_or_path=t5-base \ + --dataset_name=squad \ + --output_dir=output \ + --max_source_length=1024 \ + --max_target_length=142 \ + --learning_rate=1e-4 \ + --num_train_epochs=6 \ + --logging_steps=100 \ + --save_steps=1000 \ + --seed=42 \ + --train_batch_size=4 \ + --eval_batch_size=64 \ + --warmup_proportion=0.1 \ + --ignore_pad_token_for_loss \ + --device=gpu +``` + +其中参数释义如下: +- `gpus` 指示了训练所用的GPU + +- `model_name_or_path` 指示了finetune使用的预训练模型,可以是PaddleNLP提供的预训练模型,或者是本地的模型。如果使用本地的模型,则配置为本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + + | PaddleNLP提供的预训练模型 | + |---------------------------------| + | t5-base | + | t5-large | + +- `dataset_name` 表示训练的数据集。 + +- `output_dir` 表示模型的保存路径。 + +- `max_source_length` 表示输入序列的长度,超过该长度将被截断。 + +- `max_target_length` 表示输出的最大长度。 + +- `learning_rate` 表示基础学习率大小,将与learning rate scheduler产生的值相乘作为当前学习率。 + +- `num_train_epochs` 表示训练轮数。 + +- `epochs` 表示训练轮数。 + +- `logging_steps` 表示日志打印间隔。 + +- `save_steps` 表示模型保存及评估间隔。 + +- `seed` 表示随机数生成器的种子。 + +- `train_batch_size` 表示训练每张卡上的样本数目。 + +- `eval_batch_size` 表示预测单卡上的样本数目。 + +- `warmup_proportion` 表示warmup_steps所占总步数的比例。学习率逐渐升高到基础学习率(即上面配置的learning_rate)所需要的迭代数。 + +- `device` 表示使用的设备。 + +程序运行时将会自动进行训练和验证,训练过程中会自动保存模型在指定的`output_dir`中。如: + +```text +./output/ +├── t5_model_1000 +│ ├── model_config.json +│ ├── model_state.pdparams +│ ├── special_tokens_map.json +│ ├── spiece.model +│ └── tokenizer_config.json +└── ... +``` + +**NOTE:** 如需恢复模型训练,只需指定`model_name_or_path`为本地微调模型的路径即可。 + +### 模型预测 + +运行如下命令即可在验证集上进行测试 + +```shell +# GPU启动,预测仅支持单卡 +export CUDA_VISIBLE_DEVICES=0 +python predict.py \ + --model_name_or_path=./checkpoints/model_xx/ \ + --dataset_name=squad \ + --output_path=generate.txt \ + --max_source_length=1024 \ + --max_target_length=142 \ + --decode_strategy=greedy_search \ + --top_k=2 \ + --top_p=1.0 \ + --num_beams=1 \ + --length_penalty=0.0 \ + --batch_size=64 \ + --seed=42 \ + --ignore_pad_token_for_loss \ + --logging_steps=20 \ + --device=gpu +``` + +其中参数释义如下: +- `model_name_or_path` 指示了预测使用的模型,可以是PaddleNLP提供的预训练模型,或者是本地的模型。如果使用本地的模型,则配置为本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + + | PaddleNLP提供的预训练模型 | + |---------------------------------| + | t5-base | + | t5-large | + | mrm8488/t5-base-finetuned-question-generation-ap | + +- `dataset_name` 表示预测的数据集。 + +- `output_path` 表示预测结果的保存路径。 + +- `max_source_length` 表示输入序列的长度,超过该长度将被截断。 + +- `max_target_length` 表示输出的最大长度。 + +- `decode_strategy` 表示预测解码时采取的策略,可选"sampling"、"greedy_search"和"beam_search"之一。 + +- `top_k` 表示采用"sampling"解码策略时,token的概率按从大到小排序,生成的token只从前`top_k`个中进行采样。 + +- `top_p` 表示采用"sampling"解码策略时,从词表中采样并选择概率之和大于给定阈值`top_p`的token。 + +- `num_beams` 表示besm search的beam size。 + +- `length_penalty` 表示besm search生成长度的指数惩罚。 + +- `batch_size` 表示每次迭代**单卡**上的样本数目。 + +- `seed` 表示随机数生成器的种子。 + +- `logging_steps` 表示日志打印间隔。 + +- `device` 表示使用的设备。 + +程序运行结束后会将预测生成的问题保存在`output_path`中。同时终端中会输出评估结果。 + +采用社区微调模型mrm8488/t5-base-finetuned-question-generation-ap在验证集上有如下结果: + +| model_name_or_path | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | +| :----------------------: | :-------------: | :-------------: |:-------------: |:-------------: | +| [mrm8488/t5-base-finetuned-question-generation-ap](https://huggingface.co/mrm8488/t5-base-finetuned-question-generation-ap ) | 50.11 | 35.83 | 27.68 | 22.03 | + + + + +## 参考文献 +1. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140), pp.1-67. diff --git a/examples/question_generation/t5/predict.py b/examples/question_generation/t5/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..7931059b887c38865489602f060bb5ba15894c94 --- /dev/null +++ b/examples/question_generation/t5/predict.py @@ -0,0 +1,160 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import random +import time +from functools import partial +from pprint import pprint + +import numpy as np +import paddle +from paddle.io import BatchSampler, DataLoader +from utils import compute_metrics, convert_example + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser() + # Required parameters + parser.add_argument("--model_name_or_path", default="t5-base", type=str, required=True, help="Path to pre-trained model. ") + parser.add_argument("--dataset_name", default="squad", type=str, required=True, help="The name of the dataset to use. Selected in the list: " + "squad") + parser.add_argument('--output_path', type=str, default='generate.txt', help='The file path where the infer result will be saved.') + parser.add_argument("--max_source_length", default=1024, type=int, help="The maximum total input sequence length after tokenization.Sequences longer than this will be truncated, sequences shorter will be padded.",) + parser.add_argument("--min_target_length", default=0, type=int, help="The minimum total sequence length for target text when generating. ") + parser.add_argument("--max_target_length", default=142, type=int, help="The maximum total sequence length for target text after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded during ``evaluate`` and ``predict``.",) + parser.add_argument('--decode_strategy', default='greedy_search', type=str, help='The decode strategy in generation.') + parser.add_argument('--top_k', default=2, type=int, help='The number of highest probability vocabulary tokens to keep for top-k sampling.') + parser.add_argument('--top_p', default=1.0, type=float, help='The cumulative probability for top-p sampling.') + parser.add_argument('--num_beams', default=1, type=int, help='The number of beams for beam search.') + parser.add_argument('--length_penalty', default=0.6, type=float, help='The exponential penalty to the sequence length for beam search.') + parser.add_argument('--early_stopping', default=False, type=eval, help='Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.') + parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity of beam search. ") + parser.add_argument('--faster', action='store_true', help='Whether to process inference using FastGeneration. ') + parser.add_argument('--use_fp16_decoding', action='store_true', help='Whether to use fp16 when using FastGeneration. Only works when using FastGeneration. ') + parser.add_argument("--batch_size", default=64, type=int, help="Batch size per GPU/CPU for testing or evaluation.") + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.") + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--is_debug", action='store_true', help="Whether to debug.") + parser.add_argument("--ignore_pad_token_for_loss", action='store_true', help="Whether to ignore the tokens corresponding to padded labels in the loss computation or not.") + + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def generate(args): + paddle.set_device(args.device) + set_seed(args) + tokenizer = T5Tokenizer.from_pretrained(args.model_name_or_path) + model = T5ForConditionalGeneration.from_pretrained(args.model_name_or_path) + dataset = load_dataset(args.dataset_name, splits=["dev_v1"]) + # dataset = load_dataset(args.dataset_name, splits=["dev_v2"]) + trans_func = partial( + convert_example, + tokenizer=tokenizer, + decoder_start_token_id=model.t5.bos_token_id, + max_source_length=args.max_source_length, + max_target_length=args.max_target_length, + ignore_pad_token_for_loss=args.ignore_pad_token_for_loss, + is_train=False) + + def batchify_fn(samples, tokenizer): + fn = Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64" + ), # attention_mask + Pad(axis=0, pad_val=-100, dtype="int64"), # mem_seq_lens + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64" + ), # decoder_input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # labels + ) + return fn(samples) + + dataset = dataset.map(trans_func, lazy=True) + + # debug + if args.is_debug: + dataset.data = dataset.data[:20] + dataset.new_data = dataset.new_data[:20] + + batch_sampler = BatchSampler(dataset, + batch_size=args.batch_size, + shuffle=False) + data_loader = DataLoader(dataset=dataset, + batch_sampler=batch_sampler, + num_workers=0, + collate_fn=batchify_fn, + return_list=True) + data_loader.pin_memory = False + + model.eval() + total_time = 0.0 + start_time = time.time() + all_preds = [] + all_labels = [] + for step, batch in enumerate(data_loader): + input_ids, _, mem_seq_lens, _, labels = batch + preds, _ = model.generate(input_ids=input_ids, + max_length=args.max_target_length, + min_length=args.min_target_length, + decode_strategy=args.decode_strategy, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + early_stopping=args.early_stopping, + diversity_rate=args.diversity_rate, + use_fast=args.faster) + total_time += (time.time() - start_time) + if step % args.logging_steps == 0: + print('step %d - %.3fs/step' % + (step, total_time / args.logging_steps)) + total_time = 0.0 + all_preds.extend(preds.numpy()) + all_labels.extend(labels.numpy()) + start_time = time.time() + + bleu_result, decoded_preds, decoded_labels = compute_metrics( + all_preds, all_labels, tokenizer, args.ignore_pad_token_for_loss) + print("BLEU result: ", bleu_result) + with open(args.output_path, 'w', encoding='utf-8') as fout: + for decoded_pred in decoded_preds: + fout.write(' '.join(decoded_pred) + '\n') + print('Save generated result into: %s' % args.output_path) + with open(args.output_path + '.reference.txt', 'w', + encoding='utf-8') as fout: + for decoded_label in decoded_labels: + fout.write(' '.join(decoded_label) + '\n') + print('Save referenced labels into: {}.reference.txt'.format(args.output_path)) + + +if __name__ == '__main__': + args = parse_args() + pprint(args) + generate(args) diff --git a/examples/question_generation/t5/requirements.txt b/examples/question_generation/t5/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..40abc64257a3f4627085af9e6c851de3553f4aef --- /dev/null +++ b/examples/question_generation/t5/requirements.txt @@ -0,0 +1,2 @@ +nltk==3.6.2 +evaluate==0.2.2 \ No newline at end of file diff --git a/examples/question_generation/t5/train.py b/examples/question_generation/t5/train.py new file mode 100644 index 0000000000000000000000000000000000000000..cdd124370fff788b624a039897d60278425f4551 --- /dev/null +++ b/examples/question_generation/t5/train.py @@ -0,0 +1,236 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +import random +import time +from functools import partial +from pprint import pprint + +import numpy as np +import paddle +from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler +from tqdm import tqdm +from utils import compute_metrics, convert_example + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import ( + LinearDecayWithWarmup, + T5ForConditionalGeneration, + T5Tokenizer, +) +from paddlenlp.utils.log import logger + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser() + # Required parameters + parser.add_argument("--model_name_or_path", default="t5-base", type=str, required=True, help="Path to pre-trained model. ") + parser.add_argument("--dataset_name", default="squad", type=str, required=True, help="The name of the dataset to use. Selected in the list: " + "squad") + parser.add_argument("--output_dir", default="output", type=str, required=True, help="The output directory where the model predictions and checkpoints will be written.",) + parser.add_argument("--max_source_length", default=1024, type=int, help="The maximum total input sequence length after tokenization.Sequences longer than this will be truncated, sequences shorter will be padded.",) + parser.add_argument("--min_target_length", default=0, type=int, help="The minimum total sequence length for target text when generating. ") + parser.add_argument("--max_target_length", default=142, type=int, help="The maximum total sequence length for target text after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. during ``evaluate`` and ``predict``.",) + parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.") + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") + parser.add_argument("--train_batch_size", default=20, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--eval_batch_size", default=12, type=int, help="Batch size per GPU/CPU for evaluation.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion") + parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps.") + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.") + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.") + parser.add_argument("--use_amp", default=False, type=strtobool, help="Enable mixed precision training.") + parser.add_argument("--scale_loss", default=2**15, type=float, help="The value of scale_loss for fp16.") + parser.add_argument("--ignore_pad_token_for_loss", action='store_true', help="Whether to ignore the tokens corresponding to padded labels in the loss computation or not.") + + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, data_loader, tokenizer, ignore_pad_token_for_loss, + min_target_length, max_target_length): + model.eval() + all_preds = [] + all_labels = [] + model = model._layers if isinstance(model, paddle.DataParallel) else model + for batch in tqdm(data_loader, total=len(data_loader), desc="Eval step"): + input_ids, _, _, labels = batch + preds = model.generate(input_ids=input_ids, + min_length=min_target_length, + max_length=max_target_length, + use_cache=True)[0] + all_preds.extend(preds.numpy()) + all_labels.extend(labels.numpy()) + bleu_result, decoded_preds, decoded_labels = compute_metrics( + all_preds, all_labels, tokenizer, ignore_pad_token_for_loss) + logger.info(bleu_result) + model.train() + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + tokenizer = T5Tokenizer.from_pretrained(args.model_name_or_path) + model = T5ForConditionalGeneration.from_pretrained(args.model_name_or_path) + trans_func = partial( + convert_example, + tokenizer=tokenizer, + decoder_start_token_id=model.t5.bos_token_id, + max_source_length=args.max_source_length, + max_target_length=args.max_target_length, + ignore_pad_token_for_loss=args.ignore_pad_token_for_loss) + logger.info("Loading train and dev dataset: %s" % args.dataset_name) + train_set, dev_set = load_dataset(args.dataset_name, + splits=["train_v1", "dev_v1"]) + logger.info("Loaded train and dev dataset: %s" % args.dataset_name) + train_set = train_set.map(trans_func, lazy=True) + train_batch_sampler = DistributedBatchSampler( + train_set, batch_size=args.train_batch_size, shuffle=True) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64" + ), # attention_mask + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64" + ), # decoder_input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # labels + ): fn(samples) + train_data_loader = DataLoader(dataset=train_set, + batch_sampler=train_batch_sampler, + num_workers=0, + collate_fn=batchify_fn, + return_list=True) + dev_set = dev_set.map(trans_func, lazy=True) + dev_batch_sampler = BatchSampler(dev_set, + batch_size=args.eval_batch_size, + shuffle=False) + dev_data_loader = DataLoader(dataset=dev_set, + batch_sampler=dev_batch_sampler, + num_workers=0, + collate_fn=batchify_fn, + return_list=True) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else ( + len(train_data_loader) * args.num_train_epochs) + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, + warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [ + p.name for n, p in model.named_parameters() + if not any(nd in n for nd in ["bias", "norm"]) + ] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params) + + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + global_step = 0 + tic_train = time.time() + for epoch in tqdm(range(args.num_train_epochs), desc="Epoch"): + for step, batch in tqdm(enumerate(train_data_loader), + desc="Train step", + total=len(train_data_loader)): + global_step += 1 + input_ids, attention_mask, decoder_input_ids, labels = batch + with paddle.amp.auto_cast( + args.use_amp, + custom_white_list=["layer_norm", "softmax", "gelu"]): + output = model(input_ids, + attention_mask, + decoder_input_ids, + labels=labels) + loss = output[0] + if args.use_amp: + scaled_loss = scaler.scale(loss) + scaled_loss.backward() + scaler.minimize(optimizer, scaled_loss) + else: + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + logger.info( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % (global_step, num_training_steps, epoch, step, + paddle.distributed.get_rank(), loss, optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train))) + tic_train = time.time() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + evaluate(model, dev_data_loader, tokenizer, + args.ignore_pad_token_for_loss, args.min_target_length, + args.max_target_length) + logger.info("eval done total : %s s" % (time.time() - tic_eval)) + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join( + args.output_dir, "t5_model_%d.pdparams" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance( + model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + return + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, + "t5_model_final_%d.pdparams" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance( + model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + +if __name__ == "__main__": + args = parse_args() + pprint(args) + do_train(args) diff --git a/examples/question_generation/t5/utils.py b/examples/question_generation/t5/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..b5ee191b669319b392e2fdaac30c44a91945e9c2 --- /dev/null +++ b/examples/question_generation/t5/utils.py @@ -0,0 +1,166 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import evaluate +import nltk +import numpy as np + +from paddlenlp.metrics import BLEU + + +def convert_example( + example, + tokenizer, + decoder_start_token_id, + max_source_length, + max_target_length, + ignore_pad_token_for_loss=True, + is_train=True, +): + """ + Convert an example into necessary features. + """ + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + context = example["context"] + question = example["question"] + try: + answer = example["answers"][0] + except: + print(example["context"]) + print(example["question"]) + print(example["answers"]) + print(example["answer_starts"]) + print(example["is_impossible"]) + + input_seq = f"answer: {answer} context: {context} </s>" + output_seq = f"question: {question} </s>" + + outputs = tokenizer( + output_seq, + max_seq_len=max_target_length, + pad_to_max_seq_len=True, + truncation_strategy="longest_first", + ) + + output_ids = [decoder_start_token_id] + outputs["input_ids"][:-1] + + if ignore_pad_token_for_loss: + # Replace all tokenizer.pad_token_id in the outputs by -100 when we want to ignore padding in the loss. + outputs["input_ids"] = [(l if l != tokenizer.pad_token_id else -100) for l in outputs["input_ids"]] + + if is_train: + inputs = tokenizer( + input_seq, + max_seq_len=max_source_length, + pad_to_max_seq_len=True, + truncation_strategy="longest_first", + return_attention_mask=True, + return_length=False, + ) + return inputs["input_ids"], inputs["attention_mask"], output_ids, outputs["input_ids"] + else: + inputs = tokenizer( + input_seq, + max_seq_len=max_source_length, + pad_to_max_seq_len=True, + truncation_strategy="longest_first", + return_attention_mask=True, + return_length=True, + ) + return inputs["input_ids"], inputs["attention_mask"], inputs["length"], output_ids, outputs["input_ids"] + + +def compute_metrics(preds, labels, tokenizer, ignore_pad_token_for_loss=True): + def compute_bleu(predictions, references, rouge_types=None, use_stemmer=True): + bleu1 = BLEU(n_size=1) + bleu2 = BLEU(n_size=2) + bleu3 = BLEU(n_size=3) + bleu4 = BLEU(n_size=4) + assert len(predictions) == len(references) + for i in range(len(predictions)): + bleu1.add_inst(predictions[i], [references[i]]) + bleu2.add_inst(predictions[i], [references[i]]) + bleu3.add_inst(predictions[i], [references[i]]) + bleu4.add_inst(predictions[i], [references[i]]) + result = { + "BLEU-1": bleu1.score() * 100, + "BLEU-2": bleu2.score() * 100, + "BLEU-3": bleu3.score() * 100, + "BLEU-4": bleu4.score() * 100, + } + return result + + def compute_bleu_hf(predictions, references, rouge_types=None, use_stemmer=True): + predictions = [" ".join(prediction) for prediction in predictions] + references = [[" ".join(reference)] for reference in references] + + bleu = evaluate.load("bleu") + assert len(predictions) == len(references) + bleu1_results = bleu.compute(predictions=predictions, references=references, max_order=1) + bleu2_results = bleu.compute(predictions=predictions, references=references, max_order=2) + bleu3_results = bleu.compute(predictions=predictions, references=references, max_order=3) + bleu4_results = bleu.compute(predictions=predictions, references=references, max_order=4) + + result = { + "BLEU-1": bleu1_results["bleu"] * 100, + "BLEU-2": bleu2_results["bleu"] * 100, + "BLEU-3": bleu3_results["bleu"] * 100, + "BLEU-4": bleu4_results["bleu"] * 100, + } + return result + + def post_process_text(preds, labels): + preds = [pred.strip() for pred in preds] + labels = [label.strip() for label in labels] + preds = [pred.strip("question:") for pred in preds] + labels = [label.strip("question:") for label in labels] + labels = [label.strip() for label in labels] + + # expects newline after each sentence + preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds] + labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels] + + preds = [pred.split() for pred in preds] + labels = [label.split() for label in labels] + + return preds, labels + + def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): + """ + Post-process the decoded sequence. + """ + eos_pos = len(seq) - 1 + for i, idx in enumerate(seq): + if idx == eos_idx: + eos_pos = i + break + seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] + return seq + + if ignore_pad_token_for_loss: + labels = np.asarray(labels) + labels = np.where(labels != -100, labels, tokenizer.pad_token_id) + decoded_preds, decoded_labels = [], [] + for pred, label in zip(preds, labels): + pred_id = post_process_seq(pred, tokenizer.bos_token_id, tokenizer.eos_token_id) + label_id = post_process_seq(label, tokenizer.bos_token_id, tokenizer.eos_token_id) + decoded_preds.append(tokenizer.decode(pred_id)) + decoded_labels.append(tokenizer.decode(label_id)) + decoded_preds, decoded_labels = post_process_text(decoded_preds, decoded_labels) + # bleu_result = compute_bleu(decoded_preds, decoded_labels) + bleu_result = compute_bleu_hf(decoded_preds, decoded_labels) + return bleu_result, decoded_preds, decoded_labels diff --git a/examples/question_generation/unimo-text/README.md b/examples/question_generation/unimo-text/README.md new file mode 100644 index 0000000000000000000000000000000000000000..fdf6ba023b194042e90fd17943c1a153be62cee5 --- /dev/null +++ b/examples/question_generation/unimo-text/README.md @@ -0,0 +1,343 @@ +# 问题生成 + + +**目录** +- [问题生成](#问题生成) + - [简介](#简介) + <!-- - [基于预训练语言模型的问题生成](#基于预训练语言模型的问题生成) --> + <!-- - [效果展示](#效果展示) --> + - [开箱即用](#开箱即用) + - [训练定制](#训练定制) + - [环境依赖](#环境依赖) + - [代码结构说明](#代码结构说明) + - [问题生成应用定制训练全流程介绍](#问题生成定制训练全流程介绍) + - [数据准备](#数据准备) + - [数据加载](#数据加载) + - [数据处理](#数据处理) + - [从本地文件创建数据集-可选](#从本地文件创建数据集-可选) + - [模型训练](#模型训练) + - [模型预测](#模型预测) + - [模型转换部署](#模型转换部署) + - [FasterTransformer加速及模型静态图导出](#fastertransformer加速及模型静态图导出) + - [模型部署](#模型部署) + - [References](#references) + +## 简介 +Question Generation(QG),即问题生成,指的是给定一段上下文,自动生成一个流畅且符合上下文主题的问句。问题生成通常可以分为,无答案问题生成和有答案问题生成,这里只关注应用更广的有答案问题生成。 + +问题生成技术在教育、咨询、搜索、推荐等多个领域均有着巨大的应用价值。具体来说,问题生成可广泛应用于问答系统语料库构建,事实性问题生成,教育行业题库生成,对话提问,聊天机器人意图理解,对话式搜索意图提问,闲聊机器人主动提问等等场景。 + +本项目是基于预训练语言模型UNIMO-Text的问题生成,具有以下优势: + +- 效果领先。基于百度自研中文预训练语言模型UNIMO-Text,并提供基于模版策略和大规模多领域问题生成数据集训练的通用问题生成预训练模型`unimo-text-1.0-question-generation`。 +- 开箱即用。本项目提供TaskFlow接口,无需训练,仅需几行代码便可预测。 +- 高性能推理。本项目基于FasterTransformer进行推理加速,能够提供更高性能的推理体验,优化后的推理模型在dureader_qg开发集的推理耗时缩短为优化前的1/5。 +- 训练推理部署全流程打通。本项目提供了全面的定制训练流程,从数据准备、模型训练预测,到模型推理部署,一应俱全。 + +<!-- ### 基于预训练语言模型的问题生成 + +基于预训练语言模型(Pretrained Language Models, PLMs)范式的问题生成是目前最常用、效果最好(SOTA)的方式。 +预训练模型是在超大规模的语料采用无监督或者弱监督的方式进行预训练,能够学习如何准确地理解自然语言并以自然语言的形式流畅表达,这两项都是完成文本生成任务的重要能力。 + +PaddleNLP提供了方便易用的接口,可指定模型名或模型参数文件路径通过from_pretrained()方法加载不同网络结构的预训练模型,且相应预训练模型权重下载速度快速、稳定。 +Transformer预训练模型汇总包含了如 ERNIE、BERT、T5、UNIMO等主流预训练模型。下面以中文unimo-text-1.0模型为例,演示如何加载预训练模型和分词器: +``` +from paddlenlp.transformers import ErnieForGeneration, ErnieTokenizer +model_name = "ernie-1.0" +model = UNIMOLMHeadModel.from_pretrained(model_name) +tokenizer = UNIMOTokenizer.from_pretrained(model_name) +``` --> + +## 开箱即用 +PaddleNLP提供开箱即用的产业级NLP预置任务能力,无需训练,一键预测。 +#### 支持单条、批量预测 +```python +>>> from paddlenlp import Taskflow +# 默认模型为 unimo-text-1.0-dureader_qg +>>> question_generator = Taskflow("question_generation") +# 单条输入 +>>> question_generator([ + {"context": "奇峰黄山千米以上的山峰有77座,整座黄山就是一座花岗岩的峰林,自古有36大峰,36小峰,最高峰莲花峰、最险峰天都峰和观日出的最佳点光明顶构成黄山的三大主峰。", "answer": "莲花峰"} + ]) +''' + ['黄山最高峰是什么'] +''' +# 多条输入 +>>> question_generator([ + {"context": "奇峰黄山千米以上的山峰有77座,整座黄山就是一座花岗岩的峰林,自古有36大峰,36小峰,最高峰莲花峰、最险峰天都峰和观日出的最佳点光明顶构成黄山的三大主峰。", "answer": "莲花峰"}, + {"context": "弗朗索瓦·韦达外文名:franciscusvieta国籍:法国出生地:普瓦图出生日期:1540年逝世日期:1603年12月13日职业:数学家主要成就:为近代数学的发展奠定了基础。", "answer": "法国"} + ]) +''' + ['黄山最高峰是什么', '弗朗索瓦是哪里人'] +''' +``` +关键配置参数说明: +* `model`:可选模型,默认为unimo-text-1.0-dureader_qg,支持的模型有["unimo-text-1.0", "unimo-text-1.0-dureader_qg", "unimo-text-1.0-question-generation", "unimo-text-1.0-question-generation-dureader_qg"]。 + +具体参数配置可参考[Taskflow文档](../../../docs/model_zoo/taskflow.md)。 + +## 训练定制 + +### 环境依赖 +- nltk +- evaluate +- tqdm + +安装方式:`pip install -r requirements.txt` + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +├── deploy # 部署 +│ ├── paddle_inference # PaddleInference高性能推理部署 +│ │ ├── inference_unimo_text.py # 推理部署脚本 +│ │ └── README.md # 说明文档 +│ └── paddle_serving +│ ├── config.yml # 配置文件 +│ ├── pipeline_client.py # 客户端程序 +│ ├── pipeline_service.py # 服务器程序 +│ └── README.md # 说明文档 +├── export_model.py # 动态图参数导出静态图参数脚本 +├── train.py # 训练脚本 +├── predict.py # 预测评估脚本 +├── utils.py # 工具函数脚本 +└── README.md # 说明文档 +``` + +### 问题生成定制训练全流程介绍 +接下来,我们将按数据准备、训练、预测、推理部署等四个阶段对问题生成应用的全流程进行介绍。 +1. **数据准备** +- 默认使用中文问题生成数据集DuReader_QG进行实验,该数据集已集成到PaddleNLP。 +- 如果已有标注好的本地数据集,我们需要根据将数据集整理为文档要求的格式,请参考[从本地文件创建数据集(可选)](#从本地文件创建数据集(可选))。 + +2. **模型训练** + +- 数据准备完成后,可以开始使用我们的数据集对预训练模型进行微调训练。我们可以根据任务需求,调整可配置参数,选择使用GPU或CPU进行模型训练,脚本默认保存在开发集最佳表现模型。中文任务默认使用`unimo-text-1.0`模型,unimo-text-1.0还支持large模型。此外本项目还提供基于大规模多领域问题生成数据集训练的通用问题生成预训练模型`unimo-text-1.0-question-generation`,详见[UNIMO模型汇总](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers/UNIMO/contents.html),用户可以根据任务和设备需求进行选择。 + + +3. **模型预测** + +- 训练结束后,我们可以加载保存的最佳模型进行模型测试,打印模型预测结果。 + +4. **模型转换部署** +- 在现实部署场景中,我们通常不仅对模型的精度表现有要求,也需要考虑模型性能上的表现。我们可以使用模型裁剪进一步压缩模型体积,问题生成应用已提供裁剪API对上一步微调后的模型进行裁剪,模型裁剪之后会默认导出静态图模型。 + +- 模型部署需要将保存的最佳模型参数(动态图)导出成静态图参数,用于后续的推理部署。 + +- 问题生成应用提供了基于Paddle Serving的本地部署predictor,并且支持在GPU设备使用Faster Generation进行加速。 + +- 问题生成应用提供了基于Paddle Serving的服务端部署方案。 + +### 数据准备 +#### 数据加载 +[**DuReader_QG**数据集](https://www.luge.ai/#/luge/dataDetail?id=8)是一个中文问题生成数据集,我们使用该数据集作为应用案例进行实验。**DuReader_QG**中的数据主要由由上下文、问题、答案3个主要部分组成,其任务描述为给定上下文p和答案a,生成自然语言表述的问题q,且该问题符合段落和上下文的限制。 + +为了方便用户快速测试,PaddleNLP Dataset API内置了DuReader_QG数据集,一键即可完成数据集加载,示例代码如下: + +```python +from paddlenlp.datasets import load_dataset +train_ds, dev_ds = load_dataset('dureader_qg', splits=('train', 'dev')) +``` + +#### 数据处理 +针对**DuReader_QG**数据集,我们需要将QA任务格式的数据进行转换从而得到text2text形式的数据,我们默认使用模版的方式构造输入数据,默认模版如下,其他形式输入数据用户可以在convert_example函数中自行定义。 +```text +答案: <answer_text> 上下文: <context_text> +问题: <question_text> +``` + +#### 从本地文件创建数据集-可选 +在许多情况下,我们需要使用本地数据集来训练我们的问题生成模型,本项目支持使用固定格式本地数据集文件进行训练。 +使用本地文件,只需要在模型训练时指定`train_file` 为本地训练数据地址,`predict_file` 为本地测试数据地址即可。 + +本地数据集目录结构如下: + +```text +data/ +├── train.json # 训练数据集文件 +├── dev.json # 开发数据集文件 +└── test.json # 可选,待预测数据文件 +``` +本地数据集文件格式如下: +- train.json/dev.json/test.json 文件格式: +```text +{ + "context": <context_text>, + "answer": <answer_text>, + "question": <question_text>, +} +... +``` +- train.json/dev.json/test.json 文件样例: +```text +{ + "context": "欠条是永久有效的,未约定还款期限的借款合同纠纷,诉讼时效自债权人主张债权之日起计算,时效为2年。 根据《中华人民共和国民法通则》第一百三十五条:向人民法院请求保护民事权利的诉讼时效期间为二年,法律另有规定的除外。 第一百三十七条:诉讼时效期间从知道或者应当知道权利被侵害时起计算。但是,从权利被侵害之日起超过二十年的,人民法院不予保护。有特殊情况的,人民法院可以延长诉讼时效期间。 第六十二条第(四)项:履行期限不明确的,债务人可以随时履行,债权人也可以随时要求履行,但应当给对方必要的准备时间。", + "answer": "永久有效", + "question": "欠条的有效期是多久" +} +... +``` + +更多数据集读取格式详见[数据集加载](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html#)和[自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。 + +### 模型训练 +运行如下命令即可在样例训练集上进行finetune,并在样例验证集上进行验证。 +```shell +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +# 例如使用1号和2号卡,则:`--gpu 1,2` +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "1,2" --log_dir ./unimo/finetune/log train.py \ + --dataset_name=dureader_qg \ + --model_name_or_path="unimo-text-1.0" \ + --save_dir=./unimo/finetune/checkpoints \ + --output_path ./unimo/finetune/predict.txt \ + --logging_steps=100 \ + --save_steps=500 \ + --epochs=20 \ + --batch_size=16 \ + --learning_rate=1e-5 \ + --warmup_proportion=0.02 \ + --weight_decay=0.01 \ + --max_seq_len=512 \ + --max_target_len=30 \ + --do_train \ + --do_predict \ + --max_dec_len=20 \ + --min_dec_len=3 \ + --num_return_sequences=1 \ + --template=1 \ + --device=gpu +``` + + +关键参数释义如下: +- `gpus` 指示了训练所用的GPU,使用多卡训练可以指定多个GPU卡号,例如 --gpus "0,1"。 +- `dataset_name` 数据集名称,当`train_file`和`predict_file`为None时将加载`dataset_name`的训练集和开发集,默认为`dureader_qg`。 +- `train_file` 本地训练数据地址,数据格式必须与`dataset_name`所指数据集格式相同,默认为None。 +- `predict_file` 本地测试数据地址,数据格式必须与`dataset_name`所指数据集格式相同,默认为None。 +- `model_name_or_path` 指示了finetune使用的具体预训练模型,可以是PaddleNLP提供的预训练模型,或者是本地的预训练模型。如果使用本地的预训练模型,可以配置本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + | 可选预训练模型 | + |---------------------------------| + | unimo-text-1.0 | + | unimo-text-1.0-large | + | unimo-text-1.0-question-generation | + + <!-- | T5-PEGASUS | + | ernie-1.0 | + | ernie-gen-base-en | + | ernie-gen-large-en | + | ernie-gen-large-en-430g | --> + +- `save_dir` 表示模型的保存路径。 +- `output_path` 表示预测结果的保存路径。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `seed` 表示随机数生成器的种子。 +- `epochs` 表示训练轮数。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。 +- `warmup_proportion` 表示学习率逐渐升高到基础学习率(即上面配置的learning_rate)所需要的迭代数占总步数的比例。 +- `max_seq_len` 模型输入序列的最大长度。 +- `max_target_len` 模型训练时标签的最大长度。 +- `min_dec_len` 模型生成序列的最小长度。 +- `max_dec_len` 模型生成序列的最大长度。 +- `do_train` 是否进行训练。 +- `do_predict` 是否进行预测,在验证集上会自动评估。 +- `device` 表示使用的设备,从gpu和cpu中选择。 +- `template` 表示使用的模版,从[0, 1, 2, 3, 4]中选择,0表示不选择模版,1表示使用默认模版。 + +程序运行时将会自动进行训练和验证,训练过程中会自动保存模型在指定的`save_dir`中。如: + +```text +./unimo/finetune/checkpoints +├── model_1000 +│ ├── model_config.json +│ ├── model_state.pdparams +│ ├── special_tokens_map.json +│ ├── tokenizer_config.json +│ └── vocab.txt +└── ... +``` + +**NOTE:** 如需恢复模型训练,`model_name_or_path`配置本地模型的目录地址即可。 + +微调的模型在dureader_qg验证集上有如下结果(指标为BLEU-4),其中`unimo-text-1.0-dureader_qg-w/o-template`表示不使用模版策略微调的结果,`unimo-text-1.0-large-dureader_qg`表示使用large模型微调的结果,`unimo-text-1.0-question-generation-dureader_qg`表示在通用问题生成预训练模型`unimo-text-1.0-question-generation`上微调的结果: + +| model_name | DuReaderQG | +| :-----------------------------: | :-----------: | +| unimo-text-1.0-dureader_qg-w/o-template | 39.61 | +| unimo-text-1.0-dureader_qg | 41.08 | +| unimo-text-1.0-large-dureader_qg | 41.51 | +| unimo-text-1.0-question-generation-dureader_qg | 44.02 | + +### 模型预测 + +运行下方脚本可以使用训练好的模型进行预测。 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python -u predict.py \ + --dataset_name=dureader_qg \ + --model_name_or_path=your_model_path \ + --output_path=./predict.txt \ + --logging_steps=100 \ + --batch_size=16 \ + --max_seq_len=512 \ + --max_target_len=30 \ + --do_predict \ + --max_dec_len=20 \ + --min_dec_len=3 \ + --template=1 \ + --device=gpu +``` +关键参数释义如下: +- `output_path` 表示预测输出结果保存的文件路径,默认为./predict.txt。 +- `model_name_or_path` 指示了finetune使用的具体预训练模型,可以是PaddleNLP提供的预训练模型,或者是本地的微调好的预训练模型。如果使用本地的预训练模型,可以配置本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle预训练模型model_state.pdparams。 + +### 模型转换部署 + +#### FasterTransformer加速及模型静态图导出 + +使用动态图训练结束之后,可以通过[静态图导出脚本](export_model.py)实现基于FasterTransformer的高性能预测加速,并将动态图参数导出成静态图参数,静态图参数保存在`output_path`指定路径中。运行方式: + +```shell +python export_model.py \ + --model_name_or_path ./checkpoint \ + --inference_model_dir ./export_checkpoint \ + --max_dec_len 50 \ + --use_fp16_decoding +``` +关键参数释义如下: + +* `model_name_or_path`:动态图训练保存的参数路径;默认为"./checkpoint"。 +* `inference_model_dir`:静态图图保存的参数路径;默认为"./export_checkpoint"。 +* `max_dec_len`:最大输出长度。 +* `use_fp16_decoding`:是否使用fp16解码进行预测。 + +执行命令后将会自动导出模型到指定的 `inference_model_dir` 中,保存模型文件结构如下所示: + +```text +├── unimo_text.pdiparams +├── unimo_text.pdiparams.info +└── unimo_text.pdmodel +``` + +#### 模型部署 +本项目提供多种不同场景的部署方案,请根据实际情况进行选择: +|部署方案|特色|场景|硬件| +|-|-|-|-| +|Paddle Inference<br>服务端/云端|通用性|模型算法复杂<br>硬件高性能|X86 CPU<br>NVIDIA 全系列 GPU<br>龙芯/飞腾等国产CPU<br>昆仑/昇腾/海光DCU等AI加速芯片 +|Paddle Serving<br>服务化|高并发|大流量、高并发、低延时、高吞吐<br>资源弹性调控应对服务流量变化<br>支持模型组合、加密、热更新等|X86/Arm CPU<br>NVIDIA GPU<br>昆仑/昇腾等 + + +问题生成应用已打通多种场景部署方案,点击链接获取具体的使用教程。 +- [Paddle Inference 推理 (Python)](./deploy/paddle_inference/README.md) +- [Paddle Serving 服务化部署(Python)](./deploy/paddle_serving/README.md) + + +## References +Zheng, Chujie, and Minlie Huang. "Exploring prompt-based few-shot learning for grounded dialog generation." arXiv preprint arXiv:2109.06513 (2021). +Li, Wei, et al. "Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning." arXiv preprint arXiv:2012.15409 (2020). diff --git a/examples/question_generation/unimo-text/deploy/paddle_inference/README.md b/examples/question_generation/unimo-text/deploy/paddle_inference/README.md new file mode 100644 index 0000000000000000000000000000000000000000..93f1eaf349407215a8a0eaf1f37719a13e488a45 --- /dev/null +++ b/examples/question_generation/unimo-text/deploy/paddle_inference/README.md @@ -0,0 +1,54 @@ +# Paddle Inference部署 +本文档将介绍如何使用[Paddle Inference](https://paddle-inference.readthedocs.io/en/latest/guides/introduction/index_intro.html#paddle-inference)工具进行问题生成应用高性能推理推理部署。 + +**目录** + * [背景介绍](#背景介绍) + * [导出预测部署模型](#导出预测部署模型) + * [基于Python预测](#基于Python预测) + + +## 背景介绍 +Paddle inference和主框架的Model.predict均可实现推理预测,Paddle Inference 是飞桨的原生推理库, 作用于服务器端和云端,提供高性能的推理能力,主框架的Model 对象是一个具备训练、测试、推理的神经网络。相比于Model.predict,inference可使用MKLDNN、CUDNN、TensorRT进行预测加速。Model.predict适用于训练好的模型直接进行预测,paddle inference适用于对推理性能、通用性有要求的用户,针对不同平台不同的应用场景进行了深度的适配优化,保证模型在服务器端即训即用,快速部署。由于 Paddle Inference 能力直接基于飞桨的训练算子,因此它支持飞桨训练出的所有模型的推理。 + + + +Paddle Inference Python端预测部署主要包含两个步骤: +- 导出预测部署模型 +- 基于Python预测 + + +## 导出预测部署模型 +部署时需要使用预测格式的模型(即动态图转静态图操作)。预测格式模型相对训练格式模型而言,在拓扑上裁剪掉了预测不需要的算子,并且会做特定部署优化。具体操作详见[FasterTransformer加速及模型静态图导出](../../README.md)。 + +## 基于Python预测 +<!-- 同上,高性能预测的默认输入和输出形式也为文件,可分别通过 test_path 和 save_path 进行指定,通过如下命令便可以基于Paddle Inference 进行高性能预测: --> + +在终端输入以下命令可在GPU上进行预测: +```shell +python deploy/paddle_inference/inference.py \ + --inference_model_dir ./export_checkpoint \ + --model_name_or_path "unimo-text-1.0" \ + --predict_file predict_file_name \ + --output_path output_path_name \ + --device gpu \ +``` + +<!-- 在终端输入以下命令可在CPU上进行预测: +```shell +python deploy/paddle_inference/inference_unimo_text.py --inference_model_dir ./export_checkpoint --device cpu +``` --> +经静态图转换,FastTransformer性能优化,Paddle Inference加速后的部署模型在dureader_qg devset的预测时间为27.74秒,相较于未优化前169.24秒,耗时缩减为原来的16.39%。 +关键参数释义如下: +* `inference_model_dir`:用于高性能推理的静态图模型参数路径,默认为"./export_checkpoint"。 +* `model_name_or_path`:tokenizer对应模型或路径,默认为"unimo-text-1.0"。 +* `dataset_name`:数据集名称,默认为`dureader_qg`。 +* `predict_file`:本地预测数据地址,数据格式必须与`dataset_name`所指数据集格式相同,默认为None,当为None时默认加载`dataset_name`的dev集。 +* `output_path`:表示预测结果的保存路径。 +* `device`:推理时使用的设备,可选项["gpu"],默认为"gpu"。 +* `batch_size`:进行推理时的批大小,默认为16。 +* `precision`:当使用TensorRT进行加速推理时,所使用的TensorRT精度,可选项["fp32", "fp16"],默认为"fp32"。 +<!-- * `precision`:当使用TensorRT进行加速推理时,所使用的TensorRT精度,可选项["fp32", "fp16", "int8"],默认为"fp32"。 --> +<!-- * `device`:推理时使用的设备,可选项["gpu", "cpu", "xpu"],默认为"gpu"。 --> +<!-- * `enable_mkldnn`:当使用cpu时,选择是否使用MKL-DNN(oneDNN)进行加速推理,默认为False。 --> +<!-- * `cpu_threads`:当使用cpu时,推理所用的进程数,默认为10。 --> +<!-- * `use_tensorrt`:当使用gpu时,选择是否使用TensorRT进行加速推理,默认为False。 --> diff --git a/examples/question_generation/unimo-text/deploy/paddle_inference/infer_utils.py b/examples/question_generation/unimo-text/deploy/paddle_inference/infer_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..f043e3f932dcb7292f5811d4721a611cc5564f6c --- /dev/null +++ b/examples/question_generation/unimo-text/deploy/paddle_inference/infer_utils.py @@ -0,0 +1,260 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import random +from functools import partial + +import numpy as np +import paddle +import paddle.distributed as dist +from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler + +from paddlenlp.data import Pad + + +def postprocess_response(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first <eos>.""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.mask_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + return tokens + + +def print_args(args): + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +def set_seed(seed): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(seed) + np.random.seed(seed) + # Maybe different op seeds(for dropout) for different procs is better. + paddle.seed(seed + dist.get_rank()) + + +def convert_example( + example, tokenizer, max_seq_len=512, max_target_len=128, max_title_len=256, mode="test", template=0 +): + """Convert all examples into necessary features.""" + if mode == "pretrain" or mode == "pretrain_test": + context = example["context"] + answer = example["answer"] + target = example["target"] + + source = "答案:" + answer + tokenizer.sep_token + "上下文:" + context + title = None + + elif mode == "train" or mode == "test": + target = None + if "source" in example and "title" in example: + source = example["source"] + title = None + if "title" in example.keys(): + title = example["title"] + elif "context" in example and "answer" in example: + source = example["context"] + title = None + if "answer" in example.keys(): + title = example["answer"] + else: + assert False, "Source and title are not in the input dictionary, nor are context and answer." + if "target" in example.keys(): + target = example["target"] + + if template == 1: + source = "答案:" + title + tokenizer.sep_token + "上下文:" + source + title = None + if target: + target = "问题:" + target + elif template == 2: + source = "答案:" + title + tokenizer.sep_token + "上下文:" + source + title = None + if target: + target = "在已知答案的前提下,问题:" + target + elif template == 3: + source = "这是一个问题生成任务,根据提供的答案和上下文,来生成问题。" + title + tokenizer.sep_token + "上下文:" + source + title = None + if target: + target = "问题:" + target + + if mode == "train" or mode == "pretrain": + tokenized_example = tokenizer.gen_encode( + source, + title=title, + target=target, + max_seq_len=max_seq_len, + max_target_len=max_target_len, + max_title_len=max_title_len, + return_position_ids=True, + return_length=True, + ) + target_start = tokenized_example["input_ids"].index(tokenizer.cls_token_id, 1) + target_end = tokenized_example["seq_len"] + # Use to gather the logits corresponding to the labels during training + tokenized_example["masked_positions"] = list(range(target_start, target_end - 1)) + tokenized_example["labels"] = tokenized_example["input_ids"][target_start + 1 : target_end] + + return tokenized_example + + elif mode == "test" or mode == "pretrain_test": + tokenized_example = tokenizer.gen_encode( + source, + title=title, + max_seq_len=max_seq_len, + max_title_len=max_title_len, + add_start_token_for_decoding=True, + return_position_ids=True, + return_length=True, + ) + + if "target" in example and example["target"]: + tokenized_example["target"] = example["target"] + return tokenized_example + + +def batchify_fn(batch_examples, pad_val, mode="test"): + def pad_mask(batch_attention_mask): + batch_size = len(batch_attention_mask) + max_len = max(map(len, batch_attention_mask)) + attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9 + for i, mask_data in enumerate(attention_mask): + seq_len = len(batch_attention_mask[i]) + mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32") + # In order to ensure the correct broadcasting mechanism, expand one + # dimension to the second dimension (n_head of Transformer). + attention_mask = np.expand_dims(attention_mask, axis=1) + return attention_mask + + pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64") + + input_ids = pad_func([example["input_ids"] for example in batch_examples]) + token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples]) + position_ids = pad_func([example["position_ids"] for example in batch_examples]) + + attention_mask = pad_mask([example["attention_mask"] for example in batch_examples]) + + seq_len = np.asarray([example["seq_len"] for example in batch_examples], dtype="int32") + + if mode == "train" or mode == "pretrain": + max_len = max([example["seq_len"] for example in batch_examples]) + masked_positions = np.concatenate( + [ + np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len + for i, example in enumerate(batch_examples) + ] + ) + labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples]) + return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels + elif mode == "test" or mode == "pretrain_test": + return input_ids, token_type_ids, position_ids, attention_mask, seq_len + + +def create_data_loader(dataset, tokenizer, args, mode="test"): + trans_func = partial(convert_example, tokenizer=tokenizer, mode="test", template=1) + dataset = dataset.map(trans_func, lazy=True) + if mode == "pretrain": + batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) + elif mode == "train": + batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) + elif mode == "test" or mode == "pretrain_test": + batch_sampler = BatchSampler(dataset, batch_size=args.batch_size // 2, shuffle=False) + collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode) + data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True) + return dataset, data_loader + + +def post_process_sum(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first <eos>.""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.mask_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + special_tokens = ["[UNK]"] + tokens = [token for token in tokens if token not in special_tokens] + return token_ids, tokens + + +def remove_template(instr): + """Remove template prefix of decoded sequence.""" + outstr = instr.strip("问题:") + outstr = instr.strip("在已知答案的前提下,问题:") + return outstr + + +def select_sum(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1): + results = [] + group = [] + tmp = [] + if scores is not None: + ids = ids.numpy() + scores = scores.numpy() + + if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0: + raise ValueError( + "the length of `ids` is {}, but the `num_return_sequences` is {}".format( + len(ids), num_return_sequences + ) + ) + + for pred, score in zip(ids, scores): + pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer) + num_token = len(pred_token_ids) + + target = "".join(pred_tokens) + target = remove_template(target) + + # not ending + if max_dec_len is not None and num_token >= max_dec_len: + score -= 1e3 + + tmp.append([target, score]) + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + preds = sorted(preds, key=lambda x: -x[1]) + results.append(preds[0][0]) + else: + ids = ids.numpy() + + for pred in ids: + pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer) + num_token = len(pred_token_ids) + response = "".join(pred_tokens) + response = remove_template(response) + + # TODO: Support return scores in FT. + tmp.append([response]) + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + results.append(preds[0][0]) + + return results diff --git a/examples/question_generation/unimo-text/deploy/paddle_inference/inference.py b/examples/question_generation/unimo-text/deploy/paddle_inference/inference.py new file mode 100644 index 0000000000000000000000000000000000000000..5e15b4a81441913eb6a2071b45f9e5cbbb8f5fa3 --- /dev/null +++ b/examples/question_generation/unimo-text/deploy/paddle_inference/inference.py @@ -0,0 +1,223 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time +from pprint import pprint + +import numpy as np +import paddle +from infer_utils import create_data_loader, postprocess_response, select_sum +from paddle import inference + +from paddlenlp.datasets import load_dataset +from paddlenlp.ops.ext_utils import load +from paddlenlp.transformers import UNIMOTokenizer + + +def setup_args(): + """Setup arguments.""" + parser = argparse.ArgumentParser() + parser.add_argument( + "--inference_model_dir", default="./infer_model", type=str, help="Path to save inference model of UNIMOText. " + ) + parser.add_argument( + "--model_name_or_path", type=str, default="unimo-text-1.0", help="The path or shortcut name of the tokenizer." + ) + parser.add_argument( + "--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference." + ) + parser.add_argument( + "--use_tensorrt", + default=False, + type=eval, + choices=[True, False], + help="Whether to use inference engin TensorRT when using gpu.", + ) + parser.add_argument( + "--enable_mkldnn", + default=False, + type=eval, + choices=[True, False], + help="Enable to use mkldnn to speed up when using cpu.", + ) + parser.add_argument("--cpu_threads", default=10, type=int, help="Number of threads to predict when using cpu.") + parser.add_argument( + "--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help="The tensorrt precision." + ) + parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.") + parser.add_argument( + "--output_path", type=str, default="./predict.txt", help="The file path where the infer result will be saved." + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--dataset_name", type=str, default="dureader_qg", help="The name of the dataset to load.") + parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.") + parser.add_argument("--max_dec_len", type=int, default=20, help="The maximum sequence length of decoding.") + parser.add_argument( + "--num_return_sequences", + type=int, + default=1, + help="The numbers of returned sequences for one input in generation.", + ) + + args = parser.parse_args() + return args + + +def setup_predictor(args): + """Setup inference predictor.""" + # Load FastGeneration lib. + load("FastGeneration", verbose=True) + model_file = os.path.join(args.inference_model_dir, "unimo_text.pdmodel") + params_file = os.path.join(args.inference_model_dir, "unimo_text.pdiparams") + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = inference.Config(model_file, params_file) + if args.device == "gpu": + config.enable_use_gpu(100, 0) + config.switch_ir_optim() + config.enable_memory_optim() + config.disable_glog_info() + + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[args.precision] + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=args.batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif args.device == "cpu": + config.disable_gpu() + if args.enable_mkldnn: + config.enable_mkldnn() + config.set_mkldnn_cache_capacity(10) + + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif args.device == "xpu": + config.enable_xpu(100) + predictor = inference.create_predictor(config) + return predictor + + +@paddle.no_grad() +def infer_one(args, predictor, inputs=None): + """Use predictor to inference.""" + tokenizer = UNIMOTokenizer.from_pretrained("unimo-text-1.0") + + if not inputs: + inputs = { + "context": "奇峰黄山千米以上的山峰有77座,整座黄山就是一座花岗岩的峰林,自古有36大峰,36小峰,最高峰莲花峰、最险峰天都峰和观日出的最佳点光明顶构成黄山的三大主峰。", + "answer": "莲花峰", + } + + inputs = "答案:" + inputs["answer"] + tokenizer.sep_token + "上下文:" + inputs["context"] + data = tokenizer.gen_encode( + inputs, add_start_token_for_decoding=True, return_length=True, is_split_into_words=False + ) + + input_handles = {} + for name in predictor.get_input_names(): + input_handles[name] = predictor.get_input_handle(name) + if name == "attention_mask": + input_handles[name].copy_from_cpu(np.expand_dims(np.asarray(data[name], dtype="float32"), axis=(0, 1))) + else: + input_handles[name].copy_from_cpu(np.asarray(data[name], dtype="int32").reshape([1, -1])) + + output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()] + + predictor.run() + + output = [output_handle.copy_to_cpu() for output_handle in output_handles] + + for sample in output[0][:, :, 0].tolist(): + print("".join(postprocess_response(sample, tokenizer))) + + +@paddle.no_grad() +def infer(args, predictor, data_loader, tokenizer): + print("Infer begin...") + pred_ref = [] + total_time = 0.0 + start_time = time.time() + for step, inputs in enumerate(data_loader, 1): + input_ids, token_type_ids, position_ids, attention_mask, seq_len = inputs + data = { + "input_ids": input_ids, + "token_type_ids": token_type_ids, + "position_ids": position_ids, + "attention_mask": attention_mask, + "seq_len": seq_len, + } + + input_handles = {} + for name in predictor.get_input_names(): + input_handles[name] = predictor.get_input_handle(name) + if name == "attention_mask": + input_handles[name].copy_from_cpu(np.asarray(data[name], dtype="float32")) + else: + input_handles[name].copy_from_cpu(np.asarray(data[name], dtype="int32")) + + output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()] + + predictor.run() + + output = [output_handle.copy_to_cpu() for output_handle in output_handles] + + ids = output[0] + scores = output[1] + + ids = paddle.to_tensor(ids, dtype="int32")[:, 0, :] + scores = paddle.to_tensor(scores, dtype="float32") + + total_time += time.time() - start_time + if step % args.logging_steps == 0: + print("step %d - %.3fs/step" % (step, total_time / args.logging_steps)) + total_time = 0.0 + + results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences) + + pred_ref.extend(results) + start_time = time.time() + + with open(args.output_path, "w", encoding="utf-8") as fout: + for ref in pred_ref: + fout.write(ref + "\n") + + print("\nSave inference result into: %s" % args.output_path) + + if "target" in data_loader.dataset[0].keys(): + with open(args.output_path + ".reference.txt", "w", encoding="utf-8") as fout: + targets = [example["target"] for example in data_loader.dataset] + for target in targets: + fout.write(target + "\n") + + +if __name__ == "__main__": + args = setup_args() + pprint(args) + + predictor = setup_predictor(args) + tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path) + ds = load_dataset(args.dataset_name, splits="dev", data_files=args.predict_file) + ds, data_loader = create_data_loader(ds, tokenizer, args, "test") + + time_begin = time.time() + infer(args, predictor, data_loader, tokenizer) + print("inference cost time:", time.time() - time_begin) diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/README.md b/examples/question_generation/unimo-text/deploy/paddle_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ad375abab6ed0ebe9ebe6f158adcab9414af314e --- /dev/null +++ b/examples/question_generation/unimo-text/deploy/paddle_serving/README.md @@ -0,0 +1,150 @@ +# Paddle Serving服务化部署 + +本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具部署问题生成在线服务。 + +## 目录 +- [Paddle Serving服务化部署](#paddle-serving服务化部署) + - [目录](#目录) + - [背景介绍](#背景介绍) + - [环境准备](#环境准备) + - [安装Paddle Serving](#安装paddle-serving) + <!-- - [安装FastTokenizer文本处理加速库(可选)](#安装fastertokenizer文本处理加速库可选) --> + - [模型转换](#模型转换) + - [pipeline部署](#pipeline部署) + - [修改配置文件](#修改配置文件) + - [server启动服务](#server启动服务) + - [client发送服务请求](#client发送服务请求) + +## 背景介绍 +Paddle Serving 依托深度学习框架 PaddlePaddle 旨在帮助深度学习开发者和企业提供高性能、灵活易用的工业级在线推理服务。Paddle Serving 支持 RESTful、gRPC、bRPC 等多种协议,提供多种异构硬件和多种操作系统环境下推理解决方案,和多种经典预训练模型示例。集成高性能服务端推理引擎 Paddle Inference 和端侧引擎 Paddle Lite。设计并实现基于有向无环图(DAG) 的异步流水线高性能推理框架,具有多模型组合、异步调度、并发推理、动态批量、多卡多流推理、请求缓存等特性。 + +Paddle Serving Python端预测部署主要包含以下步骤: +- 环境准备 +- 模型转换 +- 部署模型 + +## 环境准备 +### 安装Paddle Serving +安装client和serving app,用于向服务发送请求: +```shell +pip install paddle_serving_app paddle_serving_client +``` +安装server,用于启动服务,根据服务器设备选择安装CPU server或GPU server: + +- 安装CPU server +```shell +pip install paddle_serving_server +``` +- 安装GPU server, 注意选择跟本地环境一致的命令 +```shell +# CUDA10.2 + Cudnn7 + TensorRT6 +pip install paddle-serving-server-gpu==0.8.3.post102 # -i https://pypi.tuna.tsinghua.edu.cn/simple +# CUDA10.1 + TensorRT6 +pip install paddle-serving-server-gpu==0.8.3.post101 # -i https://pypi.tuna.tsinghua.edu.cn/simple +# CUDA11.2 + TensorRT8 +pip install paddle-serving-server-gpu==0.8.3.post112 # -i https://pypi.tuna.tsinghua.edu.cn/simple +``` + +**NOTE:** +- 可以开启国内清华镜像源来加速下载 +- 如果要安装最新版本的PaddleServing参考[链接](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Latest_Packages_CN.md)。 + + +<!-- ### 安装FastTokenizer文本处理加速库(可选) +如果部署环境是Linux,推荐安装fast_tokenizer可以得到更极致的文本处理效率,进一步提升服务性能。目前暂不支持Windows设备安装,将会在下个版本支持。 +```shell +pip install fast-tokenizer-python +``` --> + + +## 模型转换 + +使用Paddle Serving做服务化部署时,需要将保存的inference模型转换为serving易于部署的模型。 + +用已安装的paddle_serving_client将静态图参数模型转换成serving格式。关于如何使用将训练后的动态图模型转为静态图模型详见[FasterTransformer加速及模型静态图导出](../../README.md)。 + +模型转换命令如下: +```shell +python -m paddle_serving_client.convert --dirname ./export_checkpoint \ + --model_filename unimo_text.pdmodel \ + --params_filename unimo_text.pdiparams \ + --serving_server ./deploy/paddle_serving/export_checkpoint_server \ + --serving_client ./deploy/paddle_serving/export_checkpoint_client +``` +关键参数释义如下: +* `dirname`:静态图模型文件夹地址。 +* `model_filename`:模型文件名。 +* `params_filename`:模型参数名。 +* `serving_server`:server的模型文件和配置文件路径,默认"serving_server"。 +* `serving_client`:client的配置文件路径,默认"serving_client"。 + +更多参数可通过以下命令查询: +```shell +python -m paddle_serving_client.convert --help +``` +模型转换完成后,会在./delopy/paddle_serving文件夹多出export_checkpoint_server和export_checkpoint_client的文件夹,文件夹目录格式如下: +``` +export_checkpoint_server/ +├── unimo_text.pdiparams +├── unimo_text.pdmodel +├── serving_server_conf.prototxt +└── serving_server_conf.stream.prototxt +export_checkpoint_server/ +├── serving_client_conf.prototxt +└── serving_client_conf.stream.prototxt +``` + +## pipeline部署 + +paddle_serving目录包含启动pipeline服务和发送预测请求的代码,包括: +``` +paddle_serving/ +├──config.yml # 启动服务端的配置文件 +├──pipeline_client.py # 发送pipeline预测请求的脚本 +└──pipeline_service.py # 启动pipeline服务端的脚本 +``` + +### 修改配置文件 +目录中的`config.yml`文件解释了每一个参数的含义,可以根据实际需要修改其中的配置。 + +### server启动服务 +修改好配置文件后,执行下面命令启动服务: +```shell +cd deploy/paddle_serving +# 启动服务,运行日志保存在log.txt +python pipeline_service.py &> log.txt & +``` +成功启动服务后,log.txt中会打印类似如下日志 +``` +--- Running analysis [ir_graph_to_program_pass] +I0901 12:09:27.248943 12190 analysis_predictor.cc:1035] ======= optimize end ======= +I0901 12:09:27.249596 12190 naive_executor.cc:102] --- skip [feed], feed -> seq_len +I0901 12:09:27.249608 12190 naive_executor.cc:102] --- skip [feed], feed -> attention_mask +I0901 12:09:27.249614 12190 naive_executor.cc:102] --- skip [feed], feed -> token_type_ids +I0901 12:09:27.249617 12190 naive_executor.cc:102] --- skip [feed], feed -> input_ids +I0901 12:09:27.250080 12190 naive_executor.cc:102] --- skip [_generated_var_3], fetch -> fetch +I0901 12:09:27.250090 12190 naive_executor.cc:102] --- skip [transpose_0.tmp_0], fetch -> fetch +[2022-09-01 12:09:27,251] [ INFO] - Already cached /root/.paddlenlp/models/unimo-text-1.0/unimo-text-1.0-vocab.txt +[2022-09-01 12:09:27,269] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/unimo-text-1.0/tokenizer_config.json +[2022-09-01 12:09:27,269] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/unimo-text-1.0/special_tokens_map.json +[PipelineServicer] succ init +[OP Object] init success +2022/09/01 12:09:27 start proxy service +``` + +### client发送服务请求 +执行以下命令发送文本摘要服务请求: +```shell +cd deploy/paddle_serving +python pipeline_client.py +``` +注意执行客户端请求时关闭代理,并根据实际情况修改server_url地址(启动服务所在的机器) + +成功运行后,输出打印如下: +``` +time cost :0.03429532051086426 seconds +-------------------- +input: {'context': '平安银行95511电话按9转报案人工服务。 1.寿险 :95511转1 2.信用卡 95511转2 3.平安银行 95511转3 4.一账通 95511转4转8 5.产险 95511转5 6.养老险团体险 95511转6 7.健康险 95511转7 8.证券 95511转8 9.车险报案95511转9 0.重听', 'answer': '95511'} +output: 问题:平安银行人工服务电话 +-------------------- +``` diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/config.yml b/examples/question_generation/unimo-text/deploy/paddle_serving/config.yml new file mode 100644 index 0000000000000000000000000000000000000000..1cc918e1ba0c293aa9717aecda580a2c9d871c0a --- /dev/null +++ b/examples/question_generation/unimo-text/deploy/paddle_serving/config.yml @@ -0,0 +1,59 @@ +#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 18011 + +#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 9999 + +#worker_num, 最大并发数。 +#当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +#当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 10 + +#build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + #op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: True + + #重试次数 + retry: 1 + + #使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + use_profile: false + tracer: + interval_s: 10 + +op: + question_generation: + #并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 11 + + #当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + #client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + + #模型路径 + model_config: ../../unimo/serving/export_checkpoint_server + + #Fetch结果列表,以client_config中fetch_var的alias_name为准,不设置默认取全部输出变量 + # fetch_list: ["_generated_var_3", "slice_0.tmp_0"] + + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 1 + + #计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: "0" + + #开启MKLDNN加速 + use_mkldnn: False + + #thread_num + thread_num: 12 + + #ir_optim + ir_optim: False + + #开启tensorrt后,进行优化的子图包含的最少节点数 + #min_subgraph_size: 10 \ No newline at end of file diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/infer_utils.py b/examples/question_generation/unimo-text/deploy/paddle_serving/infer_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..f043e3f932dcb7292f5811d4721a611cc5564f6c --- /dev/null +++ b/examples/question_generation/unimo-text/deploy/paddle_serving/infer_utils.py @@ -0,0 +1,260 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import random +from functools import partial + +import numpy as np +import paddle +import paddle.distributed as dist +from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler + +from paddlenlp.data import Pad + + +def postprocess_response(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first <eos>.""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.mask_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + return tokens + + +def print_args(args): + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +def set_seed(seed): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(seed) + np.random.seed(seed) + # Maybe different op seeds(for dropout) for different procs is better. + paddle.seed(seed + dist.get_rank()) + + +def convert_example( + example, tokenizer, max_seq_len=512, max_target_len=128, max_title_len=256, mode="test", template=0 +): + """Convert all examples into necessary features.""" + if mode == "pretrain" or mode == "pretrain_test": + context = example["context"] + answer = example["answer"] + target = example["target"] + + source = "答案:" + answer + tokenizer.sep_token + "上下文:" + context + title = None + + elif mode == "train" or mode == "test": + target = None + if "source" in example and "title" in example: + source = example["source"] + title = None + if "title" in example.keys(): + title = example["title"] + elif "context" in example and "answer" in example: + source = example["context"] + title = None + if "answer" in example.keys(): + title = example["answer"] + else: + assert False, "Source and title are not in the input dictionary, nor are context and answer." + if "target" in example.keys(): + target = example["target"] + + if template == 1: + source = "答案:" + title + tokenizer.sep_token + "上下文:" + source + title = None + if target: + target = "问题:" + target + elif template == 2: + source = "答案:" + title + tokenizer.sep_token + "上下文:" + source + title = None + if target: + target = "在已知答案的前提下,问题:" + target + elif template == 3: + source = "这是一个问题生成任务,根据提供的答案和上下文,来生成问题。" + title + tokenizer.sep_token + "上下文:" + source + title = None + if target: + target = "问题:" + target + + if mode == "train" or mode == "pretrain": + tokenized_example = tokenizer.gen_encode( + source, + title=title, + target=target, + max_seq_len=max_seq_len, + max_target_len=max_target_len, + max_title_len=max_title_len, + return_position_ids=True, + return_length=True, + ) + target_start = tokenized_example["input_ids"].index(tokenizer.cls_token_id, 1) + target_end = tokenized_example["seq_len"] + # Use to gather the logits corresponding to the labels during training + tokenized_example["masked_positions"] = list(range(target_start, target_end - 1)) + tokenized_example["labels"] = tokenized_example["input_ids"][target_start + 1 : target_end] + + return tokenized_example + + elif mode == "test" or mode == "pretrain_test": + tokenized_example = tokenizer.gen_encode( + source, + title=title, + max_seq_len=max_seq_len, + max_title_len=max_title_len, + add_start_token_for_decoding=True, + return_position_ids=True, + return_length=True, + ) + + if "target" in example and example["target"]: + tokenized_example["target"] = example["target"] + return tokenized_example + + +def batchify_fn(batch_examples, pad_val, mode="test"): + def pad_mask(batch_attention_mask): + batch_size = len(batch_attention_mask) + max_len = max(map(len, batch_attention_mask)) + attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9 + for i, mask_data in enumerate(attention_mask): + seq_len = len(batch_attention_mask[i]) + mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32") + # In order to ensure the correct broadcasting mechanism, expand one + # dimension to the second dimension (n_head of Transformer). + attention_mask = np.expand_dims(attention_mask, axis=1) + return attention_mask + + pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64") + + input_ids = pad_func([example["input_ids"] for example in batch_examples]) + token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples]) + position_ids = pad_func([example["position_ids"] for example in batch_examples]) + + attention_mask = pad_mask([example["attention_mask"] for example in batch_examples]) + + seq_len = np.asarray([example["seq_len"] for example in batch_examples], dtype="int32") + + if mode == "train" or mode == "pretrain": + max_len = max([example["seq_len"] for example in batch_examples]) + masked_positions = np.concatenate( + [ + np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len + for i, example in enumerate(batch_examples) + ] + ) + labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples]) + return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels + elif mode == "test" or mode == "pretrain_test": + return input_ids, token_type_ids, position_ids, attention_mask, seq_len + + +def create_data_loader(dataset, tokenizer, args, mode="test"): + trans_func = partial(convert_example, tokenizer=tokenizer, mode="test", template=1) + dataset = dataset.map(trans_func, lazy=True) + if mode == "pretrain": + batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) + elif mode == "train": + batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) + elif mode == "test" or mode == "pretrain_test": + batch_sampler = BatchSampler(dataset, batch_size=args.batch_size // 2, shuffle=False) + collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode) + data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True) + return dataset, data_loader + + +def post_process_sum(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first <eos>.""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.mask_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + special_tokens = ["[UNK]"] + tokens = [token for token in tokens if token not in special_tokens] + return token_ids, tokens + + +def remove_template(instr): + """Remove template prefix of decoded sequence.""" + outstr = instr.strip("问题:") + outstr = instr.strip("在已知答案的前提下,问题:") + return outstr + + +def select_sum(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1): + results = [] + group = [] + tmp = [] + if scores is not None: + ids = ids.numpy() + scores = scores.numpy() + + if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0: + raise ValueError( + "the length of `ids` is {}, but the `num_return_sequences` is {}".format( + len(ids), num_return_sequences + ) + ) + + for pred, score in zip(ids, scores): + pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer) + num_token = len(pred_token_ids) + + target = "".join(pred_tokens) + target = remove_template(target) + + # not ending + if max_dec_len is not None and num_token >= max_dec_len: + score -= 1e3 + + tmp.append([target, score]) + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + preds = sorted(preds, key=lambda x: -x[1]) + results.append(preds[0][0]) + else: + ids = ids.numpy() + + for pred in ids: + pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer) + num_token = len(pred_token_ids) + response = "".join(pred_tokens) + response = remove_template(response) + + # TODO: Support return scores in FT. + tmp.append([response]) + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + results.append(preds[0][0]) + + return results diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_client.py b/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_client.py new file mode 100644 index 0000000000000000000000000000000000000000..9172d68cfdda147cf4b6988cb1af0db7ec52cf1e --- /dev/null +++ b/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_client.py @@ -0,0 +1,50 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import time + +from paddle_serving_server.pipeline import PipelineClient + + +class Runner(object): + def __init__( + self, + server_url: str, + ): + self.client = PipelineClient() + self.client.connect([server_url]) + + def Run(self, data): + inputs = data + start_time = time.time() + ret = self.client.predict(feed_dict={"inputs": inputs}) + end_time = time.time() + print("time cost :{} seconds".format(end_time - start_time)) + if not ret.value: + print("Fail to fetch summary.") + # ret is special class but a dict + for d, s in zip(data, eval(ret.value[0])): + print("--------------------") + print("input: ", d) + print("output: ", s) + print("--------------------") + return + + +if __name__ == "__main__": + server_url = "127.0.0.1:18011" + runner = Runner(server_url) + requests = [ + {"context": "奇峰黄山千米以上的山峰有77座,整座黄山就是一座花岗岩的峰林,自古有36大峰,36小峰,最高峰莲花峰、最险峰天都峰和观日出的最佳点光明顶构成黄山的三大主峰。", "answer": "莲花峰"} + ] + runner.Run(requests) diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_service.py b/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_service.py new file mode 100644 index 0000000000000000000000000000000000000000..e9b6af9fe5d4744df5b6ea7bf2526bfe88cda9a2 --- /dev/null +++ b/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_service.py @@ -0,0 +1,74 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging + +from infer_utils import batchify_fn, convert_example, postprocess_response +from paddle_serving_server.web_service import Op, WebService + +from paddlenlp.ops.ext_utils import load +from paddlenlp.transformers import UNIMOTokenizer + +_LOGGER = logging.getLogger(__name__) + + +class UnimoTextOp(Op): + """Op for unimo_text.""" + + def init_op(self): + self.tokenizer = UNIMOTokenizer.from_pretrained("unimo-text-1.0") + + def preprocess(self, input_dicts, data_id, log_id): + # Convert input format + ((_, input_dict),) = input_dicts.items() + data = input_dict["inputs"] + if isinstance(data, str) and "array(" in data: + data = eval(data) + else: + _LOGGER.error("input value {}is not supported.".format(data)) + examples = [convert_example(i, self.tokenizer) for i in data] + input_ids, token_type_ids, position_ids, attention_mask, seq_len = batchify_fn( + examples, self.tokenizer.pad_token_id + ) + new_dict = {} + new_dict["input_ids"] = input_ids + new_dict["token_type_ids"] = token_type_ids + new_dict["attention_mask"] = attention_mask + new_dict["seq_len"] = seq_len + # the first return must be a dict or a list of dict, the dict corresponding to a batch of model input + return new_dict, False, None, "" + + def postprocess(self, input_dicts, fetch_dict, data_id, log_id): + # keyname refer to export_checkpoint_client/serving_client_conf.prototxt + ids = fetch_dict["transpose_0.tmp_0"][:, 0, :].tolist() + # scores = fetch_dict["_generated_var_3"][:, 0].tolist() + + results = ["".join(postprocess_response(sample, self.tokenizer)) for sample in ids] + new_dict = {} + new_dict["outputs"] = str(results) + # the first return must be a dict or a list of dict, the dict corresponding to a batch of model output + return new_dict, None, "" + + +class UnimoTextService(WebService): + def get_pipeline_response(self, read_op): + return UnimoTextOp(name="question_generation", input_ops=[read_op]) + + +if __name__ == "__main__": + # Load FastGeneration lib. + load("FastGeneration", verbose=True) + service = UnimoTextService(name="question_generation") + service.prepare_pipeline_config("config.yml") + service.run_service() diff --git a/examples/question_generation/unimo-text/export_model.py b/examples/question_generation/unimo-text/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..ea8d9c1a2e4013a2092f00eee036f21271c2126f --- /dev/null +++ b/examples/question_generation/unimo-text/export_model.py @@ -0,0 +1,99 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from pprint import pprint + +import paddle + +from paddlenlp.ops import FasterUNIMOText +from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOTokenizer +from paddlenlp.utils.log import logger + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--model_name_or_path", default="checkpoint", type=str, help="The model name to specify the UNIMOText to use. ") + parser.add_argument("--inference_model_dir", default="./export_checkpoint", type=str, help="Path to save inference model of UNIMOText. ") + parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure top_k sampling. ") + parser.add_argument("--topp", default=1.0, type=float, help="The probability threshold to procedure top_p sampling. ") + parser.add_argument("--max_dec_len", default=20, type=int, help="Maximum output length. ") + parser.add_argument("--min_dec_len", default=3, type=int, help="Minimum output length. ") + parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ") + parser.add_argument("--num_return_sequences", default=1, type=int, help="The number of returned sequences. ") + parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ") + parser.add_argument("--decoding_strategy", default="beam_search", choices=["sampling", "beam_search"], type=str, help="The main strategy to decode. ") + parser.add_argument("--num_beams", default=6, type=int, help="The number of candidate to procedure beam search. ") + parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity rate to procedure beam search. ") + parser.add_argument("--length_penalty", default=1.2, type=float, help="The diversity rate to procedure beam search. ") + args = parser.parse_args() + return args + + +def do_predict(args): + place = "gpu" + place = paddle.set_device(place) + + model_name_or_path = args.model_name_or_path + model = UNIMOLMHeadModel.from_pretrained(model_name_or_path) + tokenizer = UNIMOTokenizer.from_pretrained(model_name_or_path) + + unimo_text = FasterUNIMOText(model=model, + use_fp16_decoding=args.use_fp16_decoding, + trans_out=True) + + # Set evaluate mode + unimo_text.eval() + + # Convert dygraph model to static graph model + unimo_text = paddle.jit.to_static( + unimo_text, + input_spec=[ + # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + # token_type_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + # attention_mask + paddle.static.InputSpec(shape=[None, 1, None, None], + dtype="float32"), + # seq_len + paddle.static.InputSpec(shape=[None], dtype="int64"), + args.max_dec_len, + args.min_dec_len, + args.topk, + args.topp, + args.num_beams, # num_beams. Used for beam_search. + args.decoding_strategy, + tokenizer.cls_token_id, # cls/bos + tokenizer.mask_token_id, # mask/eos + tokenizer.pad_token_id, # pad + args.diversity_rate, # diversity rate. Used for beam search. + args.temperature, + args.num_return_sequences, + args.length_penalty, + ]) + + # Save converted static graph model + paddle.jit.save(unimo_text, + os.path.join(args.inference_model_dir, "unimo_text")) + logger.info("UNIMOText has been saved to {}.".format( + args.inference_model_dir)) + + +if __name__ == "__main__": + args = parse_args() + pprint(args) + do_predict(args) diff --git a/examples/question_generation/unimo-text/gen_utils.py b/examples/question_generation/unimo-text/gen_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..ecc75584d89fa1cd3ed5c621ea7302302363e8d3 --- /dev/null +++ b/examples/question_generation/unimo-text/gen_utils.py @@ -0,0 +1,316 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import random +from functools import partial + +import numpy as np + +import paddle +import paddle.distributed as dist +from paddle.io import DataLoader, DistributedBatchSampler, BatchSampler +from paddlenlp.data import Pad + + +def print_args(args): + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +def set_seed(seed): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(seed) + np.random.seed(seed) + # Maybe different op seeds(for dropout) for different procs is better. + paddle.seed(seed + dist.get_rank()) + + +def convert_example( + example, tokenizer, max_seq_len=512, max_target_len=128, max_title_len=256, mode="train", template=0 +): + """Convert all examples into necessary features.""" + if mode == "pretrain" or mode == "pretrain_test": + context = example["context"] + answer = example["answer"] + target = example["target"] + source = "答案:" + answer + tokenizer.sep_token + "上下文:" + context + title = None + + elif mode == "train" or mode == "test": + target = None + title = None + if "source" in example and "title" in example: + source = example["source"] + if "title" in example.keys(): + title = example["title"] + elif "context" in example and "answer" in example: + source = example["context"] + if "answer" in example.keys(): + title = example["answer"] + else: + assert False, "Source and title are not in the input dictionary, nor are context and answer." + if "target" in example.keys(): + target = example["target"] + elif "question" in example.keys(): + target = example["question"] + + if template == 1: + source = "答案:" + title + tokenizer.sep_token + "上下文:" + source + title = None + if target: + target = "问题:" + target + elif template == 2: + source = "答案:" + title + tokenizer.sep_token + "上下文:" + source + title = None + if target: + target = "在已知答案的前提下,问题:" + target + elif template == 3: + source = "这是一个问题生成任务,根据提供的答案和上下文,来生成问题。" + title + tokenizer.sep_token + "上下文:" + source + title = None + if target: + target = "问题:" + target + elif template == 4: + prompt_common = example["prompt_common"] + prompt_domain = example["prompt_domain"] + source = ( + prompt_common + + " " + + tokenizer.sep_token + + " " + + "".join( + [" " + tokenizer.cls_token + " " + one + " " + tokenizer.sep_token + " " for one in prompt_domain] + ) + + " " + + tokenizer.cls_token + + " " + + "答案:" + + title + + " " + + tokenizer.sep_token + + " " + + tokenizer.cls_token + + "上下文:" + + source + ) + + title = None + if target: + target = "问题:" + target + + if mode == "train" or mode == "pretrain": + tokenized_example = tokenizer.gen_encode( + source, + title=title, + target=target, + max_seq_len=max_seq_len, + max_target_len=max_target_len, + max_title_len=max_title_len, + return_position_ids=True, + return_length=True, + ) + temp_tokens = tokenizer.convert_ids_to_tokens(tokenized_example["input_ids"]) + index_list = [] + count = tokenized_example["input_ids"].count(tokenizer.cls_token_id) + # If template==4, count must be equal to 7, otherwise count must be equal to 2 + assert count == 7 or count == 2, ( + str(count) + " is not in [2, 7], temp_tokens: " + " ".join(temp_tokens) + "source: " + source + ) + index = -1 + for i in range(0, count): + index = tokenized_example["input_ids"].index(tokenizer.cls_token_id, index + 1) + index_list.append(index) + if template == 4: + tokenized_example["token_type_ids"] = ( + [2] * (index_list[1] - index_list[0]) + + [3] * (index_list[4] - index_list[1]) + + [0] * (index_list[6] - index_list[4]) + + [1] * (len(tokenized_example["input_ids"]) - index_list[6]) + ) + target_start = index_list[-1] + target_end = tokenized_example["seq_len"] + # Use to gather the logits corresponding to the labels during training + tokenized_example["masked_positions"] = list(range(target_start, target_end - 1)) + tokenized_example["labels"] = tokenized_example["input_ids"][target_start + 1 : target_end] + if template == 4: + tokenized_example["token_type_ids"] + return tokenized_example + + elif mode == "test" or mode == "pretrain_test": + tokenized_example = tokenizer.gen_encode( + source, + title=title, + max_seq_len=max_seq_len, + max_title_len=max_title_len, + add_start_token_for_decoding=True, + return_position_ids=True, + ) + + if template == 4: + # temp_tokens = tokenizer.convert_ids_to_tokens(tokenized_example['input_ids']) + index_list = [] + count = tokenized_example["input_ids"].count(tokenizer.cls_token_id) + assert count == 7, str(count) + " is not in [7]" + index = -1 + for i in range(0, count): + index = tokenized_example["input_ids"].index(tokenizer.cls_token_id, index + 1) + index_list.append(index) + tokenized_example["token_type_ids"] = ( + [2] * (index_list[1] - index_list[0]) + + [3] * (index_list[4] - index_list[1]) + + [0] * (index_list[6] - index_list[4]) + + [1] * (len(tokenized_example["input_ids"]) - index_list[6]) + ) + + if "target" in example and example["target"]: + tokenized_example["target"] = example["target"] + elif "question" in example and example["question"]: + tokenized_example["target"] = example["question"] + return tokenized_example + + +def batchify_fn(batch_examples, pad_val, mode): + def pad_mask(batch_attention_mask): + batch_size = len(batch_attention_mask) + max_len = max(map(len, batch_attention_mask)) + attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9 + for i, mask_data in enumerate(attention_mask): + seq_len = len(batch_attention_mask[i]) + mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32") + # In order to ensure the correct broadcasting mechanism, expand one + # dimension to the second dimension (n_head of Transformer). + attention_mask = np.expand_dims(attention_mask, axis=1) + return attention_mask + + pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64") + + input_ids = pad_func([example["input_ids"] for example in batch_examples]) + token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples]) + position_ids = pad_func([example["position_ids"] for example in batch_examples]) + + attention_mask = pad_mask([example["attention_mask"] for example in batch_examples]) + + if mode == "train" or mode == "pretrain": + max_len = max([example["seq_len"] for example in batch_examples]) + masked_positions = np.concatenate( + [ + np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len + for i, example in enumerate(batch_examples) + ] + ) + labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples]) + return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels + elif mode == "test" or mode == "pretrain_test": + return input_ids, token_type_ids, position_ids, attention_mask + + +def create_data_loader(dataset, tokenizer, args, mode): + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_len=args.max_seq_len, + max_target_len=args.max_target_len, + max_title_len=args.max_title_len, + mode=mode, + template=args.template, + ) + dataset = dataset.map(trans_func, lazy=True) + if mode == "pretrain": + batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) + elif mode == "train": + batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) + elif mode == "test" or mode == "pretrain_test": + batch_sampler = BatchSampler(dataset, batch_size=args.batch_size // 2, shuffle=False) + collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode) + data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True) + return dataset, data_loader + + +def post_process_sum(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first <eos>.""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.mask_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + special_tokens = ["[UNK]"] + tokens = [token for token in tokens if token not in special_tokens] + return token_ids, tokens + + +def remove_template(instr): + """Remove template prefix of decoded sequence.""" + outstr = instr.strip("问题:") + outstr = instr.strip("在已知答案的前提下,问题:") + return outstr + + +def select_sum(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1): + results = [] + group = [] + tmp = [] + if scores is not None: + ids = ids.numpy() + scores = scores.numpy() + + if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0: + raise ValueError( + "the length of `ids` is {}, but the `num_return_sequences` is {}".format( + len(ids), num_return_sequences + ) + ) + + for pred, score in zip(ids, scores): + pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer) + num_token = len(pred_token_ids) + + target = "".join(pred_tokens) + target = remove_template(target) + + # not ending + if max_dec_len is not None and num_token >= max_dec_len: + score -= 1e3 + + tmp.append([target, score]) + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + preds = sorted(preds, key=lambda x: -x[1]) + results.append(preds[0][0]) + else: + ids = ids.numpy() + + for pred in ids: + pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer) + num_token = len(pred_token_ids) + response = "".join(pred_tokens) + response = remove_template(response) + + # TODO: Support return scores in FT. + tmp.append([response]) + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + results.append(preds[0][0]) + + return results diff --git a/examples/question_generation/unimo-text/predict.py b/examples/question_generation/unimo-text/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..4ec590f45b08016a2642fbb37253aae48e33ee09 --- /dev/null +++ b/examples/question_generation/unimo-text/predict.py @@ -0,0 +1,141 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import time + +import paddle +import paddle.distributed as dist +from gen_utils import create_data_loader, print_args, select_sum, set_seed + +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOTokenizer + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--dataset_name', type=str, default='dureader_qg', help='The name of the dataset to load.') + parser.add_argument('--model_name_or_path', type=str, default='unimo-text-1.0', help='The path or shortcut name of the pre-trained model.') + parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.") + parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.') + parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.') + parser.add_argument('--seed', type=int, default=1, help='Random seed for initialization.') + parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.') + parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.') + parser.add_argument('--max_target_len', type=int, default=30, help='The maximum target sequence length of training.') + parser.add_argument('--max_title_len', type=int, default=30, help='The maximum title sequence length of training.') + parser.add_argument('--max_dec_len', type=int, default=20, help='The maximum sequence length of decoding.') + parser.add_argument('--min_dec_len', type=int, default=3, help='The minimal sequence length of decoding.') + parser.add_argument('--num_return_sequences', type=int, default=1, help='The numbers of returned sequences for one input in generation.') + parser.add_argument('--decode_strategy', type=str, default='beam_search', help='The decode strategy in generation.') + parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.') + parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.') + parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.') + parser.add_argument('--num_beams', type=int, default=6, help='The number of beams for beam search.') + parser.add_argument('--length_penalty', type=float, default=1.2, help='The exponential penalty to the sequence length for beam search.') + parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.') + parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.') + parser.add_argument("--do_predict", action='store_true', help="Whether to eval and predict.") + parser.add_argument("--template", type=int, default=1, help="The template used during training, select from [0, 1, 2, 3, 4].") + + args = parser.parse_args() + return args +# yapf: enable + + +def read_file(file): + with open(file, "r", encoding="utf-8") as f: + for line in f.readlines(): + line = line.strip() + if not line: + continue + line = json.loads(line) + yield line + + +def run(args): + paddle.set_device(args.device) + world_size = dist.get_world_size() + + if world_size > 1: + dist.init_parallel_env() + set_seed(args.seed) + + model = UNIMOLMHeadModel.from_pretrained(args.model_name_or_path) + tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path) + + if world_size > 1: + model = paddle.DataParallel(model) + + if args.predict_file: + dev_ds = load_dataset(read_file, file=args.predict_file, lazy=False) + else: + dev_ds = load_dataset(args.dataset_name, splits="dev", data_files=args.predict_file) + + dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test") + + if args.do_predict: + model_eval = model._layers if isinstance(model, paddle.DataParallel) else model + prediction(model_eval, dev_data_loader, args, tokenizer) + + +@paddle.no_grad() +def prediction(model, data_loader, args, tokenizer): + print("\nPred begin...") + model.eval() + pred_ref = [] + time_begin = time.time() + total_time = 0.0 + start_time = time.time() + for step, inputs in enumerate(data_loader, 1): + input_ids, token_type_ids, position_ids, attention_mask = inputs + ids, scores = model.generate( + input_ids=input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask, + max_length=args.max_dec_len, + min_length=args.min_dec_len, + decode_strategy=args.decode_strategy, + temperature=args.temperature, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + num_return_sequences=args.num_return_sequences, + bos_token_id=tokenizer.cls_token_id, + eos_token_id=tokenizer.mask_token_id, + ) + + total_time += time.time() - start_time + if step % args.logging_steps == 0: + print("step %d - %.3fs/step" % (step, total_time / args.logging_steps)) + total_time = 0.0 + + results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences) + pred_ref.extend(results) + start_time = time.time() + print("Generation cost time:", time.time() - time_begin) + + with open(args.output_path, "w", encoding="utf-8") as fout: + for ref in pred_ref: + fout.write(ref + "\n") + + +if __name__ == "__main__": + args = parse_args() + print_args(args) + run(args) diff --git a/examples/question_generation/unimo-text/requirements.txt b/examples/question_generation/unimo-text/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..48ff8faab77ad3c024f841a75915a17ff475880d --- /dev/null +++ b/examples/question_generation/unimo-text/requirements.txt @@ -0,0 +1,3 @@ +nltk==3.6.2 +evaluate==0.2.2 +tqdm==4.64.0 \ No newline at end of file diff --git a/examples/question_generation/unimo-text/train.py b/examples/question_generation/unimo-text/train.py new file mode 100644 index 0000000000000000000000000000000000000000..73e2c1544328e42275af2d0886e44e66a61df7cf --- /dev/null +++ b/examples/question_generation/unimo-text/train.py @@ -0,0 +1,281 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import time + +import paddle +import paddle.distributed as dist +import paddle.nn.functional as F +from gen_utils import create_data_loader, print_args, select_sum, set_seed +from paddle.optimizer import AdamW + +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import BLEU +from paddlenlp.transformers import ( + BasicTokenizer, + LinearDecayWithWarmup, + UNIMOLMHeadModel, + UNIMOTokenizer, +) + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--dataset_name', type=str, default='dureader_qg', help='The name of the dataset to load.') + parser.add_argument('--model_name_or_path', type=str, default='unimo-text-1.0', help='The path or shortcut name of the pre-trained model.') + parser.add_argument("--train_file", type=str, required=False, default=None, help="Train data path.") + parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.") + parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.') + parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.') + parser.add_argument('--save_steps', type=int, default=1000, help='Save checkpoint every X updates steps.') + parser.add_argument('--seed', type=int, default=1, help='Random seed for initialization.') + parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.') + parser.add_argument('--learning_rate', type=float, default=5e-5, help='The initial learning rate.') + parser.add_argument('--weight_decay', type=float, default=0.01, help='The weight decay for optimizer.') + parser.add_argument('--epochs', type=int, default=3, help='Total number of training epochs to perform.') + parser.add_argument('--warmup_proportion', type=float, default=0.02, help='The number of warmup steps.') + parser.add_argument('--max_grad_norm', type=float, default=1.0, help='The max value of grad norm.') + parser.add_argument('--beta1', type=float, default=0.9, help='beta1') + parser.add_argument('--beta2', type=float, default=0.98, help='beta2') + parser.add_argument('--epsilon', type=float, default=1e-6, help='epsilon') + parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.') + parser.add_argument('--max_target_len', type=int, default=30, help='The maximum target sequence length of training.') + parser.add_argument('--max_title_len', type=int, default=30, help='The maximum title sequence length of training.') + parser.add_argument('--max_dec_len', type=int, default=20, help='The maximum sequence length of decoding.') + parser.add_argument('--min_dec_len', type=int, default=3, help='The minimal sequence length of decoding.') + parser.add_argument('--num_return_sequences', type=int, default=1, help='The numbers of returned sequences for one input in generation.') + parser.add_argument('--decode_strategy', type=str, default='beam_search', help='The decode strategy in generation.') + parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.') + parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.') + parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.') + parser.add_argument('--num_beams', type=int, default=6, help='The number of beams for beam search.') + parser.add_argument('--length_penalty', type=float, default=1.2, help='The exponential penalty to the sequence length for beam search.') + parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.') + parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.') + parser.add_argument("--do_train", action='store_true', help="Whether to train the model.") + parser.add_argument("--do_predict", action='store_true', help="Whether to eval and predict.") + parser.add_argument("--template", type=int, default=1, help="The template used during training, select from [0, 1, 2, 3, 4].") + + args = parser.parse_args() + return args +# yapf: enable + + +def calc_bleu_n(preds, targets, n_size=4): + assert len(preds) == len(targets), ( + "The length of pred_responses should be equal to the length of " + "target_responses. But received {} and {}.".format(len(preds), len(targets)) + ) + bleu = BLEU(n_size=n_size) + tokenizer = BasicTokenizer() + + for pred, target in zip(preds, targets): + pred_tokens = tokenizer.tokenize(pred) + target_token = tokenizer.tokenize(target) + + bleu.add_inst(pred_tokens, [target_token]) + + print("\n" + "*" * 15) + print("The auto evaluation result is:") + print("BLEU-" + str(n_size) + ":", bleu.score()) + return bleu.score() + + +def calc_bleu(preds, targets): + calc_bleu_n(preds, targets, 1) + calc_bleu_n(preds, targets, 2) + calc_bleu_n(preds, targets, 3) + bleu4_score = calc_bleu_n(preds, targets, 4) + return bleu4_score + + +def read_file(file): + with open(file, "r", encoding="utf-8") as f: + for line in f.readlines(): + line = line.strip() + if not line: + continue + line = json.loads(line) + yield line + + +def save_ckpt(model, tokenizer, save_dir, name): + output_dir = os.path.join(save_dir, "model_{}".format(name)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + +def run(args): + paddle.set_device(args.device) + world_size = dist.get_world_size() + + if world_size > 1: + dist.init_parallel_env() + set_seed(args.seed) + + model = UNIMOLMHeadModel.from_pretrained(args.model_name_or_path) + tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path) + + if world_size > 1: + model = paddle.DataParallel(model) + + if args.train_file: + train_ds = load_dataset(read_file, file=args.train_file, lazy=False) + else: + train_ds = load_dataset(args.dataset_name, splits="train", data_files=args.train_file) + if args.predict_file: + dev_ds = load_dataset(read_file, file=args.predict_file, lazy=False) + else: + dev_ds = load_dataset(args.dataset_name, splits="dev", data_files=args.predict_file) + + train_ds, train_data_loader = create_data_loader(train_ds, tokenizer, args, "train") + dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test") + + if args.do_train: + num_training_steps = args.epochs * len(train_data_loader) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + beta1=args.beta1, + beta2=args.beta2, + epsilon=args.epsilon, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm), + ) + + step = 0 + total_time = 0.0 + best_bleu4 = 0 + for epoch in range(args.epochs): + print("\nEpoch %d/%d" % (epoch + 1, args.epochs)) + batch_start_time = time.time() + for inputs in train_data_loader: + step += 1 + labels = inputs[-1] + logits = model(*inputs[:-1]) + labels = paddle.nn.functional.one_hot(labels, num_classes=logits.shape[-1]) + labels = paddle.nn.functional.label_smooth(labels) + loss = F.cross_entropy(logits, labels, soft_label=True) + loss.backward() + + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + total_time += time.time() - batch_start_time + if step % args.logging_steps == 0: + ppl = paddle.exp(loss) + print( + "step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step" + % (step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps) + ) + total_time = 0.0 + + if step % args.save_steps == 0 or step >= num_training_steps: + if dist.get_rank() == 0: + save_ckpt(model, tokenizer, args.save_dir, step) + print("Saved step {} model.\n".format(step)) + if args.do_predict: + model_eval = model._layers if isinstance(model, paddle.DataParallel) else model + bleu4 = evaluation(model_eval, dev_data_loader, args, tokenizer) + if bleu4 > best_bleu4: + print("best BLEU-4 performence has been updated: %.5f --> %.5f" % (best_bleu4, bleu4)) + best_bleu4 = bleu4 + save_ckpt(model, tokenizer, args.save_dir, "best") + + batch_start_time = time.time() + + print("\nTraining completed.") + elif args.do_predict: + model_eval = model._layers if isinstance(model, paddle.DataParallel) else model + evaluation(model_eval, dev_data_loader, args, tokenizer) + + +@paddle.no_grad() +def evaluation(model, data_loader, args, tokenizer): + print("\nEval begin...") + model.eval() + pred_ref = [] + time_begin = time.time() + total_time = 0.0 + start_time = time.time() + for step, inputs in enumerate(data_loader, 1): + input_ids, token_type_ids, position_ids, attention_mask = inputs + ids, scores = model.generate( + input_ids=input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask, + max_length=args.max_dec_len, + min_length=args.min_dec_len, + decode_strategy=args.decode_strategy, + temperature=args.temperature, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + num_return_sequences=args.num_return_sequences, + bos_token_id=tokenizer.cls_token_id, + eos_token_id=tokenizer.mask_token_id, + ) + + total_time += time.time() - start_time + if step % args.logging_steps == 0: + print("step %d - %.3fs/step" % (step, total_time / args.logging_steps)) + total_time = 0.0 + + results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences) + pred_ref.extend(results) + start_time = time.time() + print("Generation cost time:", time.time() - time_begin) + + with open(args.output_path, "w", encoding="utf-8") as fout: + for ref in pred_ref: + fout.write(ref + "\n") + + with open(args.output_path + ".reference.txt", "w", encoding="utf-8") as fout: + targets = [example["target"] for example in data_loader.dataset] + for target in targets: + fout.write(target + "\n") + + print("\nSave inference result into: %s" % args.output_path) + + if "target" in data_loader.dataset[0].keys(): + targets = [example["target"] for example in data_loader.dataset] + bleu4_score = calc_bleu(pred_ref, targets) + + model.train() + return bleu4_score + + +if __name__ == "__main__": + args = parse_args() + print_args(args) + run(args) diff --git a/examples/semantic_indexing/NQdataset.py b/examples/semantic_indexing/NQdataset.py new file mode 100644 index 0000000000000000000000000000000000000000..58efe8156ce12ba266d5c45ecfa40a39521b20ad --- /dev/null +++ b/examples/semantic_indexing/NQdataset.py @@ -0,0 +1,251 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import json +import random +from typing import List + +import numpy as np +import paddle +from paddle.io import Dataset + +from paddlenlp.transformers.bert.tokenizer import BertTokenizer + +BiEncoderPassage = collections.namedtuple("BiEncoderPassage", ["text", "title"]) + +BiENcoderBatch = collections.namedtuple( + "BiEncoderInput", + [ + "questions_ids", + "question_segments", + "context_ids", + "ctx_segments", + "is_positive", + "hard_negatives", + "encoder_type", + ], +) + + +def normalize_question(question: str) -> str: + question = question.replace("’", "'") + return question + + +def normalize_passage(ctx_text: str): + ctx_text = ctx_text.replace("\n", " ").replace("’", "'") + if ctx_text.startswith('"'): + ctx_text = ctx_text[1:] + if ctx_text.endswith('"'): + ctx_text = ctx_text[:-1] + return ctx_text + + +class BiEncoderSample(object): + query: str + positive_passages: List[BiEncoderPassage] + negative_passages: List[BiEncoderPassage] + hard_negative_passages: List[BiEncoderPassage] + + +class NQdataSetForDPR(Dataset): + """ + class for managing dataset + """ + + def __init__(self, dataPath, query_special_suffix=None): + super(NQdataSetForDPR, self).__init__() + self.data = self._read_json_data(dataPath) + self.tokenizer = BertTokenizer + self.query_special_suffix = query_special_suffix + self.new_data = [] + for i in range(0, self.__len__()): + self.new_data.append(self.__getitem__(i)) + + def _read_json_data(self, dataPath): + results = [] + with open(dataPath, "r", encoding="utf-8") as f: + print("Reading file %s" % dataPath) + data = json.load(f) + results.extend(data) + print("Aggregated data size: {}".format(len(results))) + return results + + def __getitem__(self, index): + json_sample_data = self.data[index] + r = BiEncoderSample() + r.query = self._porcess_query(json_sample_data["question"]) + + positive_ctxs = json_sample_data["positive_ctxs"] + + negative_ctxs = json_sample_data["negative_ctxs"] if "negative_ctxs" in json_sample_data else [] + hard_negative_ctxs = json_sample_data["hard_negative_ctxs"] if "hard_negative_ctxs" in json_sample_data else [] + + for ctx in positive_ctxs + negative_ctxs + hard_negative_ctxs: + if "title" not in ctx: + ctx["title"] = None + + def create_passage(ctx): + return BiEncoderPassage(normalize_passage(ctx["text"]), ctx["title"]) + + r.positive_passages = [create_passage(ctx) for ctx in positive_ctxs] + r.negative_passages = [create_passage(ctx) for ctx in negative_ctxs] + r.hard_negative_passages = [create_passage(ctx) for ctx in hard_negative_ctxs] + + return r + + def _porcess_query(self, query): + query = normalize_question(query) + + if self.query_special_suffix and not query.endswith(self.query_special_suffix): + query += self.query_special_suffix + + return query + + def __len__(self): + return len(self.data) + + +class DataUtil: + """ + Class for working with datasets + """ + + def __init__(self): + self.tensorizer = BertTensorizer() + + def create_biencoder_input( + self, + samples: List[BiEncoderSample], + inserted_title, + num_hard_negatives=0, + num_other_negatives=0, + shuffle=True, + shuffle_positives=False, + hard_neg_positives=False, + hard_neg_fallback=True, + query_token=None, + ): + + question_tensors = [] + ctx_tensors = [] + positive_ctx_indices = [] + hard_neg_ctx_indices = [] + + for sample in samples: + + if shuffle and shuffle_positives: + positive_ctxs = sample.positive_passages + positive_ctx = positive_ctxs[np.random.choice(len(positive_ctxs))] + else: + positive_ctx = sample.positive_passages[0] + + neg_ctxs = sample.negative_passages + hard_neg_ctxs = sample.hard_negative_passages + question = sample.query + + if shuffle: + random.shuffle(neg_ctxs) + random.shuffle(hard_neg_ctxs) + + if hard_neg_fallback and len(hard_neg_ctxs) == 0: + hard_neg_ctxs = neg_ctxs[0:num_hard_negatives] + + neg_ctxs = neg_ctxs[0:num_other_negatives] + hard_neg_ctxs = hard_neg_ctxs[0:num_hard_negatives] + + all_ctxs = [positive_ctx] + neg_ctxs + hard_neg_ctxs + hard_negative_start_idx = 1 + hard_negative_end_idx = 1 + len(hard_neg_ctxs) + + current_ctxs_len = len(ctx_tensors) + + sample_ctxs_tensors = [ + self.tensorizer.text_to_tensor(ctx.text, title=ctx.title if (inserted_title and ctx.title) else None) + for ctx in all_ctxs + ] + + ctx_tensors.extend(sample_ctxs_tensors) + positive_ctx_indices.append(current_ctxs_len) + hard_neg_ctx_indices.append( + i + for i in range( + current_ctxs_len + hard_negative_start_idx, + current_ctxs_len + hard_negative_end_idx, + ) + ) + """if query_token: + if query_token == "[START_END]": + query_span = _select_span + else: + question_tensors.append(self.tensorizer.text_to_tensor(" ".join([query_token, question]))) + else:""" + + question_tensors.append(self.tensorizer.text_to_tensor(question)) + + ctxs_tensor = paddle.concat([paddle.reshape(ctx, [1, -1]) for ctx in ctx_tensors], axis=0) + questions_tensor = paddle.concat([paddle.reshape(q, [1, -1]) for q in question_tensors], axis=0) + + ctx_segments = paddle.zeros_like(ctxs_tensor) + question_segments = paddle.zeros_like(questions_tensor) + + return BiENcoderBatch( + questions_tensor, + question_segments, + ctxs_tensor, + ctx_segments, + positive_ctx_indices, + hard_neg_ctx_indices, + "question", + ) + + +class BertTensorizer: + def __init__(self, pad_to_max=True, max_length=256): + self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") + self.max_length = max_length + self.pad_to_max = pad_to_max + + def text_to_tensor( + self, + text: str, + title=None, + ): + text = text.strip() + + if title: + token_ids = self.tokenizer.encode( + text, + text_pair=title, + max_seq_len=self.max_length, + pad_to_max_seq_len=False, + truncation_strategy="longest_first", + )["input_ids"] + else: + token_ids = self.tokenizer.encode( + text, + max_seq_len=self.max_length, + pad_to_max_seq_len=False, + truncation_strategy="longest_first", + )["input_ids"] + + seq_len = self.max_length + if self.pad_to_max and len(token_ids) < seq_len: + token_ids = token_ids + [self.tokenizer.pad_token_type_id] * (seq_len - len(token_ids)) + if len(token_ids) >= seq_len: + token_ids = token_ids[0:seq_len] + token_ids[-1] = 102 + + return paddle.to_tensor(token_ids) diff --git a/examples/semantic_indexing/README.md b/examples/semantic_indexing/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9b37dd24b7372a81d70b08ae5c3def72669bad8d --- /dev/null +++ b/examples/semantic_indexing/README.md @@ -0,0 +1,297 @@ +# 语义索引 + +语义索引技术是搜索引擎、推荐系统、广告系统在召回阶段的核心技术之一, 语义索引模型的效果直接决定了语义相关的物料能否被成功召回进入系统参与上层排序,从基础层面影响整个系统的效果。 + +语义索引库提供了前沿语义索引策略的训练、语义索引模型的效果评估方案、支持用户基于我们开源的语义索引模型进行文本 Pair 的相似度计算或者 Embedding 语义表示抽取。 + +我们基于 ERNIE1.0 热启,分别采用 [In-batch negatives](https://arxiv.org/abs/2004.04906) 策略和 HardestNeg 策略开源了 [batch_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/batch_neg_v1.0.tar) 和 [hardest_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/hardest_neg_v1.0.tar) 模型,相比 Baseline 模型效果有显著提升: + +## 效果评估 +| 模型 | Recall@10 | Recall@50 |策略简要说明| +| ------------ | ------------ | ------------ |--------- | +| Baseline | 46.99 | 60.84 | 标准 pair-wise 训练范式,通过随机采样产生负样本| +| [In-batch negatives](https://arxiv.org/abs/2004.04906) | 51.20(**+4.21**) | 67.24(**+6.4**) | 在 Batch 内同时使用 batch_size 个负样本进行训练| +| HardestNeg | 50.22(**+3.23**) | 65.17(**+4.33**) |<div style="width: 340pt"> 在 Batch 内先挖掘最难负样本,然后进行 pair-wise 训练</div>| + + +## 语义索引预训练模型下载 +以下模型结构参数为: +`TrasformerLayer:12, Hidden:768, Heads:12, OutputEmbSize: 256` + +|Model|训练参数配置|硬件|MD5| +| ------------ | ------------ | ------------ |-----------| +|[batch_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/batch_neg_v1.0.tar)|<div style="width: 150pt">margin:0.2 scale:30 epoch:3 lr:5E-5 bs:128 max_len:64 </div>|<div style="width: 100pt">单卡v100-16g</div>|da1bb1487bd3fd6a53b8ef95c278f3e6| +|[hardest_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/hardest_neg_v1.0.tar)|margin:0.2 epoch:3 lr:5E-5 bs:128 max_len:64 |单卡v100-16g|b535d890110ea608c8562c525a0b84b5| + + +## 数据准备 +### 数据生成 +我们基于开源语义匹配数据集构造生成了面向语义索引的训练集、评估集、召回库。 +#### 构造训练集 +从开源语义相似度任务评测数据集([LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)、[BQ Corpus](http://icrc.hitsz.edu.cn/Article/show/175.html)、[PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx))的训练集和测试集中抽取出所有语义相似的文本 Pair 作为训练集 [semantic_pair_train.tsv](https://bj.bcebos.com/paddlenlp/models/semantic_index/semantic_pair_train.tsv)。 + +[In-batch negatives](https://arxiv.org/abs/2004.04906) 策略和 HardestNeg 策略训练数据每一行由 `tab` 分隔的语义相似的文本 Pair 对,样例数据如下: +``` +欢打篮球的男生喜欢什么样的女生 爱打篮球的男生喜欢什么样的女生 +我手机丢了,我想换个手机 我想买个新手机,求推荐 +求秋色之空漫画全集 求秋色之空全集漫画 +学日语软件手机上的 手机学日语的软件 +``` + + +#### 构造评估集 +从开源语义相似度数据集([LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)、[BQ Corpus](http://icrc.hitsz.edu.cn/Article/show/175.html)、[PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx)) 的验证集中抽取出正例文本 Pair 生成评估集 [same_semantic.tsv](https://bj.bcebos.com/paddlenlp/models/semantic_index/same_semantic.tsv),其中第 1 列文本作为输入模型的源文本 *Source Text*、第 2 列文本作为语义相似的目标文本 *Target Text*。 + +#### 构造召回库 +抽取出开源语义相似度数据集([LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)、[BQ Corpus](http://icrc.hitsz.edu.cn/Article/show/175.html)、[PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx))训练集中的所有文本和验证集中所有文本 Pair 的第 2 列 *Target Text* 生成召回库 [corpus_file](https://bj.bcebos.com/paddlenlp/models/semantic_index/corpus_file) + + +### 数据下载 +|数据|描述|数量|MD5| +| ------------ | ------------ | ------------ | -------- | +|<div style="width: 180pt">[训练集(semantic_pair_train.tsv)](https://bj.bcebos.com/paddlenlp/models/semantic_index/semantic_pair_train.tsv)</div>|<div style="width: 220pt">每行为语义相似的文本 Pair 构成的训练集</div>|222546|590286f695200160350cc5838cb34f00| +|[评估集(same_semantic.tsv)](https://bj.bcebos.com/paddlenlp/models/semantic_index/same_semantic.tsv)|每行为语义相似文本 Pair 构成的评估集|10255|86ec1fd5234d944177574372dcf780c5| +|[召回库(corpus_file)](https://bj.bcebos.com/paddlenlp/models/semantic_index/corpus_file)|每行为单条文本构成的召回库|313714|a3fbc3421b5aeb939809876fc7beeaa8| + + +## 项目依赖: +- [hnswlib](https://github.com/nmslib/hnswlib) + +## 代码结构及说明 +``` +|—— train_batch_neg.py # In-batch negatives 策略的训练主脚本 +|—— train_hardest_neg.py # HardestNeg 策略的训练主脚本 +|—— batch_negative + |—— model.py # In-batch negatives 策略核心网络结构 +|——hardest_negative + |—— model.py # HardestNeg 策略核心网络结构 +|—— ann_util.py # Ann 建索引库相关函数 +|—— base_model.py # 语义索引模型基类 +|—— data.py # 数据读取、数据转换等预处理逻辑 +|—— evaluate.py # 根据召回结果和评估集计算评估指标 +|—— predict.py # 给定输入文件,计算文本 pair 的相似度 +|—— recall.py # 基于训练好的语义索引模型,从召回库中召回给定文本的相似文本 +``` + +## 模型训练 +### 基于 [In-batch negatives](https://arxiv.org/abs/2004.04906) 策略训练 +以我们提供的语义相似度训练数据为例,通过如下命令,指定 GPU 0,1,2,3 卡, 基于 In-batch negatives 策略开始训练模型 + +``` +python -u -m paddle.distributed.launch --gpus "0,1,2,3" \ + train_batch_neg.py \ + --device gpu \ + --save_dir ./checkpoints/ \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --output_emb_size 256 \ + --save_steps 500 \ + --max_seq_length 64 \ + --margin 0.2 \ + --train_set_file semantic_pair_train.tsv \ +``` + +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `save_dir`: 模型存储路径 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `save_steps`: 模型存储 checkpoint 的间隔 steps 个数 +* `margin`: 正样本相似度与负样本之间的目标 Gap +* `train_set_file`: 训练集文件 + + +### 基于 HardestNeg 策略训练 +以我们提供的语义相似度训练集为例子,通过如下命令,指定 GPU 0,1,2,3 卡, 开始模型训练 + +``` +python -u -m paddle.distributed.launch --gpus "0,1,2,3" \ + train_hardest_neg.py \ + --device gpu \ + --save_dir ./checkpoints/ \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --output_emb_size 256 \ + --save_steps 500 \ + --max_seq_length 64 \ + --margin 0.2 \ + --train_set_file semantic_pair_train.tsv \ + +``` + +## 效果评估 +语义索引模型的目标是: 给定输入文本,模型可以从海量候选召回库中快速、准确地召回一批语义相关文本。 + +### 评估指标 +采用 Recall@10 和 Recall@50 指标来评估语义索引模型的召回效果 + +### 开始评估 +效果评估分为 3 个步骤: +1. ANN 建库 +首先基于语义索引模型抽取出召回库的文本向量,然后使用 ANN 引擎建索引库(当前基于 [hnswlib](https://github.com/nmslib/hnswlib) 进行 ANN 索引) + +2. 召回 +基于语义索引模型抽取出评估集 *Source Text* 的文本向量,在第 1 步中建立的索引库中进行 ANN 查询召回 Top50 最相似的 *Target Text*, 产出评估集中 *Source Text* 的召回结果 `recall_result` 文件 + +3. 评估: 基于评估集 [same_semantic.tsv](https://bj.bcebos.com/paddlenlp/models/semantic_index/same_semantic.tsv) 和召回结果 `recall_result` 计算评估指标 R@10 和 R@50 + +运行如下命令进行 ANN 建库、召回,产出召回结果数据 `recall_result` + +``` +python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "${checkpoints_params_file}" \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 256\ + --max_seq_length 60 \ + --recall_num 50 \ + --similar_text_pair "semantic_similar_pair.tsv" \ + --corpus_file "corpus_file" \ +``` + +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `recall_result_dir`: 召回结果存储目录 +* `recall_result_file`: 召回结果的文件名 +* `params_path`: 待评估模型的参数文件名 +* `hnsw_m`: hnsw 算法相关参数,保持默认即可 +* `hnsw_ef`: hnsw 算法相关参数,保持默认即可 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `recall_num`: 对 1 个文本召回的相似文本数量 +* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv +* `corpus_file`: 召回库数据 corpus_file + +成功运行结束后,会在 `./recall_result_dir/` 目录下产出 `recall_result.txt` 文件,部分召回示例结果如下: +``` +开初婚未育证明怎么弄? 初婚未育证明怎么开? 0.9878678917884827 +开初婚未育证明怎么弄? 初婚未育情况证明怎么开? 0.955365777015686 +开初婚未育证明怎么弄? 初婚未育证明在哪里办理 0.9508345723152161 +开初婚未育证明怎么弄? 到哪里开初婚未育证明? 0.949864387512207 +``` + +接下来,运行如下命令进行效果评估,产出 R@10 和 R@50 指标: +``` + python -u evaluate.py \ + --similar_pair_file "semantic_similar_pair.tsv" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 50 +``` + +参数含义说明 +* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv +* `recall_result_file`: 针对评估集中第一列文本 *Source Text* 的召回结果 +* `recall_num`: 对 1 个文本召回的相似文本数量 + +成功运行结束后,会输出如下评估指标, 分别为 R@10 和 R@50 +``` +51.2 67.242 +``` + +## 开始预测 +我们可以基于语义索引模型抽取文本的语义向量或者计算文本 Pair 的语义相似度,我们以计算文本 Pair 的语义相似度为例: + +### 准备预测数据 +待预测数据为 tab 分隔的 tsv 文件,每一行为 1 个文本 Pair,部分示例如下: +``` +西安下雪了?是不是很冷啊? 西安的天气怎么样啊?还在下雪吗? +第一次去见女朋友父母该如何表现? 第一次去见家长该怎么做 +猪的护心肉怎么切 猪的护心肉怎么吃 +显卡驱动安装不了,为什么? 显卡驱动安装不了怎么回事 +``` + +### 开始预测 +以上述 demo 数据为例,运行如下命令基于我们开源的 [In-batch negatives](https://arxiv.org/abs/2004.04906) 策略语义索引模型开始计算文本 Pair 的语义相似度: +``` +python -u -m paddle.distributed.launch --gpus "0" \ + predict.py \ + --device gpu \ + --params_path "./checkpoints/batch_neg_v1.0.0/model_state.pdparams" \ + --output_emb_size 256 + --batch_size 128 \ + --max_seq_length 64 \ + --text_pair_file ${your_input_file} +``` + +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `params_path`: 预训练模型的参数文件名 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `text_pair_file`: 由文本 Pair 构成的待预测数据集 + +产出如下结果 +``` +0.8121148943901062 +0.6034126281738281 +0.968634843826294 +0.9800204038619995 +``` + +## 使用 FastGeneration 加速预测 + +我们基于 Paddle 自定义算子功能集成了[NVIDIA FasterTransformer](https://github.com/NVIDIA/FasterTransformer) 的高性能加速能力,通过简单易用的 Python API 即可得到 GPU 上更高性能预测能力。 +- FT FP32 相比 Paddle 前向加速比为 1.13 ~ 4.36 +- FT FP16 相比 Paddle 前向加速比为 3.65 ~ 5.42 +- 支持 Post-Normalization 和 Pre-Normalizaiton 2 种 Transformer 结构 +- 支持 GELU 和 RELU 2 个激活函数 + +详细性能评测数据如下表: + +| batch size | max_seq_len | Paddle 前向(ms)|FT FP32(ms) | FT FP16(ms) |Speedup(FT FP32/Paddle)|Speedup(FT FP16/Paddle)| +| ---------- | ----------- | ------------------- | ------------------- |------------------ |------------------ |------------------ | +| 16 | 16 | 23.56 | 5.40 | 5.38 | 4.36| 4.38| +| 16 | 32 | 22.34 | 8.11 | 5.57|2.75|4.01| +| 16 | 64 | 22.79 | 14.84 |5.39|1.54|4.23| +| 32 | 16 | 23.41 | 8.16 |5.30|2.87|4.42| +| 32 | 32 | 22.67 | 14.84 |6.21|1.53|3.65| +| 32 | 64 | 33.49 | 28.53 |6.05|1.17|5.54| +| 64 | 16 | 22.60 | 14.81 |5.59|1.53|4.04| +| 64 | 32 | 33.52 | 28.22 |6.24|1.19|5.37| +| 64 | 64 | 62.62 | 55.25 |11.55|1.13|5.42| + +Note: 测试环境如下 +``` +硬件: NVIDIA Tesla V100 16G 单卡 +Paddle Version: 2.2.1 +CUDA: 10.1 +cuDNN: 7.6 +``` + +可参考如下命令使用高性能预测能力 +```shell +python -u -m paddle.distributed.launch --gpus "0" faster_predict.py \ + --params_path "batch_neg_v1.0/model_state.pdparams" \ + --output_emb_size 256 \ + --batch_size 32 \ + --max_seq_length 64 \ + --use_fp16 \ + --text_pair_file ${your_input_file} \ +``` + +## 模型介绍 +简要介绍 In-batch negatives 策略和 HardestNeg 策略思路 + +### [In-batch negatives](https://arxiv.org/abs/2004.04906) 核心思路 + +In-batch negatives 策略的训练数据为语义相似的 Pair 对,如下所示为 Batch size = 4 的训练数据样例: +``` +我手机丢了,我想换个手机 我想买个新手机,求推荐 +求秋色之空漫画全集 求秋色之空全集漫画 +学日语软件手机上的 手机学日语的软件 +侠盗飞车罪恶都市怎样改车 侠盗飞车罪恶都市怎么改车 +``` +In-batch negatives 策略核心是在 1 个 Batch 内同时基于 N 个负例进行梯度更新,将Batch 内除自身之外其它所有 *Source Text* 的相似文本 *Target Text* 作为负例,例如: 上例中 `我手机丢了,我想换个手机` 有 1 个正例(`1.我想买个新手机,求推荐`),3 个负例(`1.求秋色之空全集漫画`,`2.手机学日语的软件`,`3.侠盗飞车罪恶都市怎么改车`)。 + +### HardestNeg 核心思路 +HardestNeg 策略核心是在 1 个 Batch 内的所有负样本中先挖掘出最难区分的负样本,基于最难负样本进行梯度更新。例如: 上例中 *Source Text*: `我手机丢了,我想换个手机` 有 3 个负例(`1.求秋色之空全集漫画`,`2.手机学日语的软件`,`3.侠盗飞车罪恶都市怎么改车`),其中最难区分的负例是 `手机学日语的软件`,模型训练过程中不断挖掘出类似这样的最难负样本,然后基于最难负样本进行梯度更新。 + +## Reference +[1] Xin Liu, Qingcai Chen, Chong Deng, Huajun Zeng, Jing Chen, Dongfang Li, Buzhou Tang, LCQMC: A Large-scale Chinese Question Matching Corpus,COLING2018. +[2] Jing Chen, Qingcai Chen, Xin Liu, Haijun Yang, Daohe Lu, Buzhou Tang, The BQ Corpus: A Large-scale Domain-specific Chinese Corpus For Sentence Semantic Equivalence Identification EMNLP2018. +[3] Yang, Y., Zhang, Y., Tar, C., and Baldridge, J., “PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification”, <i>arXiv e-prints</i>, 2019. +[4] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering, Preprint 2020. diff --git a/examples/semantic_indexing/README_gradient_cache.md b/examples/semantic_indexing/README_gradient_cache.md new file mode 100644 index 0000000000000000000000000000000000000000..1f5223c6dfb7c64458a7552a69cc4b86875570cb --- /dev/null +++ b/examples/semantic_indexing/README_gradient_cache.md @@ -0,0 +1,129 @@ +# Gradient Cache策略 [DPR](https://arxiv.org/abs/2004.04906) + + +### 实验结果 + +`Gradient Cache` 的实验结果如下,使用的评估指标是`Accuracy`: + +| DPR method | TOP-5 | TOP-10 | TOP-50| 说明 | +| :-----: | :----: | :----: | :----: | :---- | +| Gradient_cache | 68.1 | 79.4| 86.2 | DPR结合GC策略训练 +| GC_Batch_size_512 | 67.3 | 79.6| 86.3| DPR结合GC策略训练,且batch_size设置为512| + +实验对应的超参数如下: + +| Hyper Parameter | batch_size| learning_rate| warmup_steps| epoches| chunk_size|max_grad_norm | +| :----: | :----: | :----: | :----: | :---: | :----: | :----: | +| \ | 128/512| 2e-05 | 1237 | 40 | 2| 16/8 | + +## 数据准备 +我们使用Dense Passage Retrieval的[原始仓库](https://github.com/Elvisambition/DPR) +中提供的数据集进行训练和评估。可以使用[download_data.py](https://github.com/Elvisambition/DPR/blob/main/dpr/data/download_data.py) +脚本下载所需数据集。 数据集详细介绍见[原仓库](https://github.com/Elvisambition/DPR) 。 + +### 数据格式 +``` +[ + { + "question": "....", + "answers": ["...", "...", "..."], + "positive_ctxs": [{ + "title": "...", + "text": "...." + }], + "negative_ctxs": ["..."], + "hard_negative_ctxs": ["..."] + }, + ... +] +``` + +### 数据下载 +在[原始仓库](https://github.com/Elvisambition/DPR) +下使用命令 +``` +python data/download_data.py --resource data.wikipedia_split.psgs_w100 +python data/download_data.py --resource data.retriever.nq +python data/download_data.py --resource data.retriever.qas.nq +``` +### 单独下载链接 +[data.retriever.nq-train](https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-train.json.gz) +[data.retriever.nq-dev](https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz) +[data.retriever.qas.nq-dev](https://dl.fbaipublicfiles.com/dpr/data/retriever/nq-dev.qa.csv) +[data.retriever.qas.nq-test](https://dl.fbaipublicfiles.com/dpr/data/retriever/nq-test.qa.csv) +[data.retriever.qas.nq-train](https://dl.fbaipublicfiles.com/dpr/data/retriever/nq-train.qa.csv) +[psgs_w100.tsv](https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz) + + +## 代码结构及说明 +``` +|—— train_gradient_cache_DPR.py # gradient_cache实现dense passage retrieval训练脚本 +|—— train_gradient_cache.py # gradient_cache算法简单实现 +|—— NQdataset.py # NQ数据集封装 +|—— generate_dense_embeddings.py # 生成文本的稠密表示 +|—— faiss_indexer.py # faiss相关indexer封装 +|—— dense_retriever.py # 召回,指标检测 +|—— qa_validation.py # 相关计算匹配函数 +|—— tokenizers.py # tokenizer封装 +``` + +## 模型训练 +### 基于 [Dense Passage Retriever](https://arxiv.org/abs/2004.04906) 策略训练 +``` +python train_gradient_cache_DPR.py \ + --batch_size 128 \ + --learning_rate 2e-05 \ + --save_dir save_biencoder + --warmup_steps 1237 \ + --epoches 40 \ + --max_grad_norm 2 \ + --train_data_path ./dataset_dir/biencoder-nq-train.json \ + --chunk_size 16 \ +``` + +参数含义说明 +* `batch_size`: 批次大小 +* `learning_rate`: 学习率 +* `save_dir`: 模型保存位置 +* `warmupsteps`: 预热学习率参数 +* `epoches`: 训练批次大小 +* `max_grad_norm`: 详见ClipGradByGlobalNorm +* `train_data_path`: 训练数据存放地址 +* `chunk_size`: chunk的大小 + +## 生成文章稠密向量表示 + +``` +python generate_dense_embeddings.py \ + --ctx_file ./dataset_dir/psgs_w100.tsv \ + --out_file test_generate \ + --que_model_path ./save_dir/question_model_40 \ + --con_model_path ./save_dir/context_model_40 +``` + + +参数含义说明 +* `ctx_file`: ctx文件读取地址 +* `out_file`: 生成后的文件输出地址 +* `que_model_path`: question model path +* `con_model_path`: context model path + + +## 针对全部文档的检索器验证 +``` +python dense_retriever.py --hnsw_index \ + --out_file out_file \ + --encoded_ctx_file ./test_generate \ + --ctx_file ./dataset_dir/psgs_w100.tsv \ + --qa_file ./dataset_dir/nq.qa.csv \ + --que_model_path ./save_dir/question_model_40 \ + --con_model_path ./save_dir/context_model_40 +``` +参数含义说明 +* `hnsw_index`:使用hnsw_index +* `outfile`: 输出文件地址 +* `encoded_ctx_file`: 编码后的ctx文件 +* `ctx_file`: ctx文件 +* `qa_file`: qa_file文件 +* `que_model_path`: question encoder model +* `con_model_path`: context encoder model diff --git a/examples/semantic_indexing/ance/model.py b/examples/semantic_indexing/ance/model.py new file mode 100644 index 0000000000000000000000000000000000000000..bfee796e7d0789b0e02da0a0663805c0220a3d46 --- /dev/null +++ b/examples/semantic_indexing/ance/model.py @@ -0,0 +1,63 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn.functional as F +from base_model import SemanticIndexBase + + +class SemanticIndexANCE(SemanticIndexBase): + def __init__(self, pretrained_model, dropout=None, margin=0.3, output_emb_size=None): + super().__init__(pretrained_model, dropout, output_emb_size) + self.margin = margin + + def forward( + self, + text_input_ids, + pos_sample_input_ids, + neg_sample_input_ids, + text_token_type_ids=None, + text_position_ids=None, + text_attention_mask=None, + pos_sample_token_type_ids=None, + pos_sample_position_ids=None, + pos_sample_attention_mask=None, + neg_sample_token_type_ids=None, + neg_sample_position_ids=None, + neg_sample_attention_mask=None, + ): + + text_cls_embedding = self.get_pooled_embedding( + text_input_ids, text_token_type_ids, text_position_ids, text_attention_mask + ) + + pos_sample_cls_embedding = self.get_pooled_embedding( + pos_sample_input_ids, pos_sample_token_type_ids, pos_sample_position_ids, pos_sample_attention_mask + ) + + neg_sample_cls_embedding = self.get_pooled_embedding( + neg_sample_input_ids, neg_sample_token_type_ids, neg_sample_position_ids, neg_sample_attention_mask + ) + + pos_sample_sim = paddle.sum(text_cls_embedding * pos_sample_cls_embedding, axis=-1) + + # Note: The negatives samples is sampled by ANN engine in global corpus + # Please refer to run_ann_data_gen.py + global_neg_sample_sim = paddle.sum(text_cls_embedding * neg_sample_cls_embedding, axis=-1) + + labels = paddle.full(shape=[text_cls_embedding.shape[0]], fill_value=1.0, dtype=paddle.get_default_dtype()) + + loss = F.margin_ranking_loss(pos_sample_sim, global_neg_sample_sim, labels, margin=self.margin) + + return loss diff --git a/examples/semantic_indexing/ann_util.py b/examples/semantic_indexing/ann_util.py new file mode 100644 index 0000000000000000000000000000000000000000..55c608d3e58c37c0d9baf884b270178d3ac5da7f --- /dev/null +++ b/examples/semantic_indexing/ann_util.py @@ -0,0 +1,57 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# coding=UTF-8 + +import numpy as np +import hnswlib +from paddlenlp.utils.log import logger + + +def build_index(args, data_loader, model): + + index = hnswlib.Index(space="ip", dim=args.output_emb_size) + + # Initializing index + # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded + # during insertion of an element. + # The capacity can be increased by saving/loading the index, see below. + # + # ef_construction - controls index search speed/build speed tradeoff + # + # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M) + # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction + index.init_index(max_elements=args.hnsw_max_elements, ef_construction=args.hnsw_ef, M=args.hnsw_m) + + # Controlling the recall by setting ef: + # higher ef leads to better accuracy, but slower search + index.set_ef(args.hnsw_ef) + + # Set number of threads used during batch search/construction + # By default using all available cores + index.set_num_threads(16) + + logger.info("start build index..........") + + all_embeddings = [] + + for text_embeddings in model.get_semantic_embedding(data_loader): + all_embeddings.append(text_embeddings.numpy()) + + all_embeddings = np.concatenate(all_embeddings, axis=0) + index.add_items(all_embeddings) + + logger.info("Total index number:{}".format(index.get_current_count())) + + return index diff --git a/examples/semantic_indexing/base_model.py b/examples/semantic_indexing/base_model.py new file mode 100644 index 0000000000000000000000000000000000000000..b7b1dcd27449a0a1a02c0de455ca3ff6fecc406c --- /dev/null +++ b/examples/semantic_indexing/base_model.py @@ -0,0 +1,106 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import abc + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class SemanticIndexBase(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None, use_fp16=False): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr) + + self.use_fp16 = use_fp16 + + def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + + if self.use_fp16: + if attention_mask is None: + attention_mask = paddle.unsqueeze( + (input_ids == self.ptm.pad_token_id).astype(self.ptm.pooler.dense.weight.dtype) * -1e4, axis=[1, 2] + ) + + embedding_output = self.ptm.embeddings( + input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids + ) + + embedding_output = paddle.cast(embedding_output, "float16") + attention_mask = paddle.cast(attention_mask, "float16") + + encoder_outputs = self.ptm.encoder(embedding_output, attention_mask) + + if self.use_fp16: + encoder_outputs = paddle.cast(encoder_outputs, "float32") + cls_embedding = self.ptm.pooler(encoder_outputs) + else: + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + @abc.abstractmethod + def forward(self): + pass diff --git a/examples/semantic_indexing/batch_negative/model.py b/examples/semantic_indexing/batch_negative/model.py new file mode 100644 index 0000000000000000000000000000000000000000..fd87c6d8363efc4f54db6c6bd5d7b623ea68ab59 --- /dev/null +++ b/examples/semantic_indexing/batch_negative/model.py @@ -0,0 +1,65 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn.functional as F +from base_model import SemanticIndexBase + + +class SemanticIndexBatchNeg(SemanticIndexBase): + def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None): + super().__init__(pretrained_model, dropout, output_emb_size) + + self.margin = margin + # Used scaling cosine similarity to ease converge + self.sacle = scale + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) + + # substract margin from all positive samples cosine_sim() + margin_diag = paddle.full( + shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype() + ) + + cosine_sim = cosine_sim - paddle.diag(margin_diag) + + # scale cosine to ease training converge + cosine_sim *= self.sacle + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") + labels = paddle.reshape(labels, shape=[-1, 1]) + + loss = F.cross_entropy(input=cosine_sim, label=labels) + + return loss diff --git a/examples/semantic_indexing/biencoder_base_model.py b/examples/semantic_indexing/biencoder_base_model.py new file mode 100644 index 0000000000000000000000000000000000000000..a3649db7c2be67688e0b6b359b7fb6155bb4c299 --- /dev/null +++ b/examples/semantic_indexing/biencoder_base_model.py @@ -0,0 +1,97 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class BiEncoder(nn.Layer): + """dual-encoder model + + Attributes: + state: for question or for context + question_encoder: used to code the problem + context_encoder: used to code the context + + """ + + def __init__(self, question_encoder, context_encoder, state=None): + super(BiEncoder, self).__init__() + self.state = state + if self.state is None: + self.question_encoder = question_encoder + self.context_encoder = context_encoder + elif self.state == "FORQUESTION": + self.question_encoder = question_encoder + elif self.state == "FORCONTEXT": + self.context_encoder = context_encoder + + def get_question_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + + _, cls_embedding = self.question_encoder(input_ids, token_type_ids, position_ids, attention_mask) + """cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)""" + + return cls_embedding + + def get_context_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + + _, cls_embedding = self.context_encoder(input_ids, token_type_ids, position_ids, attention_mask) + """cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)""" + + return cls_embedding + + def forward( + self, + question_id, + question_segments, + question_attn_mask, + context_ids, + context_segments, + context_attn_mask, + ): + + question_pooled_out = self.get_question_pooled_embedding(question_id, question_segments, question_attn_mask) + context_pooled_out = self.get_context_pooled_embedding(context_ids, context_segments, context_attn_mask) + + return question_pooled_out, context_pooled_out + + +class BiEncoderNllLoss(object): + """ + calculate the nll loss for dual-encoder model + """ + + def calc(self, q_vectors, ctx_vectors, positive_idx_per_question, loss_scale=None): + + scorces = paddle.matmul(q_vectors, paddle.transpose(ctx_vectors, [1, 0])) + + # if len(q_vectors.shape()) > 1: + q_num = q_vectors.shape[0] + scores = scorces.reshape([q_num, -1]) + + softmax_scorces = F.log_softmax(scores, axis=1) + + loss = F.nll_loss(softmax_scorces, paddle.to_tensor(positive_idx_per_question)) + + correct_predictions_count = None + + if loss_scale: + loss.mul_(loss_scale) + + return loss, correct_predictions_count diff --git a/examples/semantic_indexing/data.py b/examples/semantic_indexing/data.py new file mode 100644 index 0000000000000000000000000000000000000000..c8e2e232f370bc986751bc56eda2e195f7fc3a36 --- /dev/null +++ b/examples/semantic_indexing/data.py @@ -0,0 +1,156 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import paddle + +from paddlenlp.utils.log import logger + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 2: + continue + yield {"text_a": data[0], "text_b": data[1]} + + +def read_text_triplet(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 3: + continue + yield {"text": data[0], "pos_sample": data[1], "neg_sample": data[2]} + + +# ANN - active learning ------------------------------------------------------ +def get_latest_checkpoint(args): + """ + Return: (latest_checkpint_path, global_step) + """ + if not os.path.exists(args.save_dir): + return args.init_from_ckpt, 0 + + subdirectories = list(next(os.walk(args.save_dir))[1]) + + def valid_checkpoint(checkpoint): + chk_path = os.path.join(args.save_dir, checkpoint) + scheduler_path = os.path.join(chk_path, "model_state.pdparams") + succeed_flag_file = os.path.join(chk_path, "succeed_flag_file") + return os.path.exists(scheduler_path) and os.path.exists(succeed_flag_file) + + trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(trained_steps) > 0: + return os.path.join(args.save_dir, str(max(trained_steps)), "model_state.pdparams"), max(trained_steps) + + return args.init_from_ckpt, 0 + + +# ANN - active learning ------------------------------------------------------ +def get_latest_ann_data(ann_data_dir): + if not os.path.exists(ann_data_dir): + return None, -1 + + subdirectories = list(next(os.walk(ann_data_dir))[1]) + + def valid_checkpoint(step): + ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data") + # succed_flag_file is an empty file that indicates ann data has been generated + succeed_flag_file = os.path.join(ann_data_dir, step, "succeed_flag_file") + return os.path.exists(succeed_flag_file) and os.path.exists(ann_data_file) + + ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(ann_data_steps) > 0: + latest_ann_data_file = os.path.join(ann_data_dir, str(max(ann_data_steps)), "new_ann_data") + logger.info("Using lateset ann_data_file:{}".format(latest_ann_data_file)) + return latest_ann_data_file, max(ann_data_steps) + + logger.info("no new ann_data, return (None, -1)") + return None, -1 + + +def gen_id2corpus(corpus_file): + id2corpus = {} + with open(corpus_file, "r", encoding="utf-8") as f: + for idx, line in enumerate(f): + id2corpus[idx] = line.rstrip() + return id2corpus + + +def gen_text_file(similar_text_pair_file): + text2similar_text = {} + texts = [] + with open(similar_text_pair_file, "r", encoding="utf-8") as f: + for line in f: + splited_line = line.rstrip().split("\t") + if len(splited_line) != 2: + continue + + text, similar_text = line.rstrip().split("\t") + + if not text or not similar_text: + continue + + text2similar_text[text] = similar_text + texts.append({"text": text}) + return texts, text2similar_text diff --git a/examples/semantic_indexing/dense_retriever.py b/examples/semantic_indexing/dense_retriever.py new file mode 100644 index 0000000000000000000000000000000000000000..030e3726c48884add5f8f9e191033b1fd34dade7 --- /dev/null +++ b/examples/semantic_indexing/dense_retriever.py @@ -0,0 +1,288 @@ +#!/usr/bin/env python3 + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Copyright GC-DPR authors. +# Copyright (c) Facebook, Inc. and its affiliates. +# All rights reserved. +# +# This source code is licensed under the license found in the +# LICENSE file in the root directory of this source tree. +""" + Command line tool to get dense results and validate them +""" + +import argparse +import csv +import glob +import gzip +import json +import logging +import pickle +import time +from typing import Dict, Iterator, List, Tuple + +import numpy as np +import paddle +from biencoder_base_model import BiEncoder +from faiss_indexer import DenseFlatIndexer, DenseHNSWFlatIndexer, DenseIndexer +from NQdataset import BertTensorizer +from paddle import Tensor as T +from paddle import nn +from qa_validation import calculate_matches + +from paddlenlp.transformers.bert.modeling import BertModel + +logger = logging.getLogger() +logger.setLevel(logging.INFO) +if logger.hasHandlers(): + logger.handlers.clear() +console = logging.StreamHandler() +logger.addHandler(console) + + +class DenseRetriever(object): + """ + Does passage retrieving over the provided index and question encoder + """ + + def __init__(self, question_encoder: nn.Layer, batch_size: int, tensorizer: BertTensorizer, index: DenseIndexer): + self.question_encoder = question_encoder + self.batch_size = batch_size + self.tensorizer = tensorizer + self.index = index + + def generate_question_vectors(self, questions: List[str]) -> T: + n = len(questions) + bsz = self.batch_size + query_vectors = [] + + self.question_encoder.eval() + + with paddle.no_grad(): + for j, batch_start in enumerate(range(0, n, bsz)): + + batch_token_tensors = [ + self.tensorizer.text_to_tensor(q) for q in questions[batch_start : batch_start + bsz] + ] + q_ids_batch = paddle.stack(batch_token_tensors, axis=0) + q_seg_batch = paddle.zeros_like(q_ids_batch) + out = self.question_encoder.get_question_pooled_embedding(q_ids_batch, q_seg_batch) + query_vectors.extend(out) + if len(query_vectors) % 100 == 0: + logger.info("Encoded queries %d", len(query_vectors)) + + query_tensor = paddle.to_tensor(query_vectors) + logger.info("Total encoded queries tensor %s", query_tensor.shape[0]) + assert query_tensor.shape[0] == len(questions) + return query_tensor + + def get_top_docs(self, query_vectors: np.array, top_docs: int = 100) -> List[Tuple[List[object], List[float]]]: + """ + Does the retrieval of the best matching passages given the query vectors batch + :param query_vectors: + :param top_docs: + :return: + """ + time0 = time.time() + results = self.index.search_knn(query_vectors, top_docs) + logger.info("index search time: %f sec.", time.time() - time0) + return results + + +def parse_qa_csv_file(location) -> Iterator[Tuple[str, List[str]]]: + with open(location) as ifile: + reader = csv.reader(ifile, delimiter="\t") + for row in reader: + question = row[0] + answers = eval(row[1]) + yield question, answers + + +def validate( + passages: Dict[object, Tuple[str, str]], + answers: List[List[str]], + result_ctx_ids: List[Tuple[List[object], List[float]]], + workers_num: int, + match_type: str, +) -> List[List[bool]]: + match_stats = calculate_matches(passages, answers, result_ctx_ids, workers_num, match_type) + top_k_hits = match_stats.top_k_hits + + logger.info("Validation results: top k documents hits %s", top_k_hits) + top_k_hits = [v / len(result_ctx_ids) for v in top_k_hits] + logger.info("Validation results: top k documents hits accuracy %s", top_k_hits) + return match_stats.questions_doc_hits + + +def load_passages(ctx_file: str) -> Dict[object, Tuple[str, str]]: + docs = {} + logger.info("Reading data from: %s", ctx_file) + if ctx_file.endswith(".gz"): + with gzip.open(ctx_file, "rt") as tsvfile: + reader = csv.reader( + tsvfile, + delimiter="\t", + ) + # file format: doc_id, doc_text, title + for row in reader: + if row[0] != "id": + docs[row[0]] = (row[1], row[2]) + else: + with open(ctx_file) as tsvfile: + reader = csv.reader( + tsvfile, + delimiter="\t", + ) + # file format: doc_id, doc_text, title + for row in reader: + if row[0] != "id": + docs[row[0]] = (row[1], row[2]) + return docs + + +def save_results( + passages: Dict[object, Tuple[str, str]], + questions: List[str], + answers: List[List[str]], + top_passages_and_scores: List[Tuple[List[object], List[float]]], + per_question_hits: List[List[bool]], + out_file: str, +): + # join passages text with the result ids, their questions and assigning has|no answer labels + merged_data = [] + assert len(per_question_hits) == len(questions) == len(answers) + for i, q in enumerate(questions): + q_answers = answers[i] + results_and_scores = top_passages_and_scores[i] + hits = per_question_hits[i] + docs = [passages[doc_id] for doc_id in results_and_scores[0]] + scores = [str(score) for score in results_and_scores[1]] + ctxs_num = len(hits) + + merged_data.append( + { + "question": q, + "answers": q_answers, + "ctxs": [ + { + "id": results_and_scores[0][c], + "title": docs[c][1], + "text": docs[c][0], + "score": scores[c], + "has_answer": hits[c], + } + for c in range(ctxs_num) + ], + } + ) + + with open(out_file, "w") as writer: + writer.write(json.dumps(merged_data, indent=4) + "\n") + logger.info("Saved results * scores to %s", out_file) + + +def iterate_encoded_files(vector_files: list) -> Iterator[Tuple[object, np.array]]: + for i, file in enumerate(vector_files): + logger.info("Reading file %s", file) + with open(file, "rb") as reader: + doc_vectors = pickle.load(reader) + for doc in doc_vectors: + db_id, doc_vector = doc + yield db_id, doc_vector + + +def main(args): + + tensorizer = BertTensorizer() + question_model = BertModel.from_pretrained(args.que_model_path) + context_model = BertModel.from_pretrained(args.con_model_path) + model = BiEncoder(question_encoder=question_model, context_encoder=context_model) + model.eval() + if args.hnsw_index: + index = DenseHNSWFlatIndexer(768, args.index_buffer) + else: + index = DenseFlatIndexer(768, args.index_buffer) + + retriever = DenseRetriever(model, args.batch_size, tensorizer, index) + # get questions & answers + questions = [] + question_answers = [] + for ds_item in parse_qa_csv_file(args.qa_file): + question, answers = ds_item + questions.append(question) + question_answers.append(answers) + questions_tensor = retriever.generate_question_vectors(questions) + # index all passages + ctx_files_pattern = args.encoded_ctx_file + input_paths = glob.glob(ctx_files_pattern) + + logger.info("Reading all passages data from files: %s", input_paths) + retriever.index.index_data(input_paths) + + # get top k results + top_ids_and_scores = retriever.get_top_docs(questions_tensor.numpy(), args.n_docs) + all_passages = load_passages(args.ctx_file) + if len(all_passages) == 0: + raise RuntimeError("No passages data found. Please specify ctx_file param properly.") + questions_doc_hits = validate( + all_passages, question_answers, top_ids_and_scores, args.validation_workers, args.match + ) + if args.out_file: + save_results(all_passages, questions, question_answers, top_ids_and_scores, questions_doc_hits, args.out_file) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--qa_file", + required=True, + type=str, + default=None, + help="Question and answers file of the format: question \\t ['answer1','answer2', ...]", + ) + parser.add_argument( + "--ctx_file", + required=True, + type=str, + default=None, + help="All passages file in the tsv format: id \\t passage_text \\t title", + ) + parser.add_argument( + "--encoded_ctx_file", + type=str, + default=None, + help="Glob path to encoded passages (from generate_dense_embeddings tool)", + ) + parser.add_argument("--out_file", type=str, default=None, help="output .json file path to write results to ") + parser.add_argument( + "--match", type=str, default="string", choices=["regex", "string"], help="Answer matching logic type" + ) + parser.add_argument("--n-docs", type=int, default=200, help="Amount of top docs to return") + parser.add_argument( + "--validation_workers", type=int, default=16, help="Number of parallel processes to validate results" + ) + parser.add_argument("--batch_size", type=int, default=32, help="Batch size for question encoder forward pass") + parser.add_argument( + "--index_buffer", type=int, default=50000, help="Temporal memory data buffer size (in samples) for indexer" + ) + parser.add_argument( + "--hnsw_index", action="store_true", help="If enabled, use inference time efficient HNSW index" + ) + parser.add_argument("--que_model_path", required=True, type=str) + parser.add_argument("--con_model_path", required=True, type=str) + args = parser.parse_args() + + main(args) diff --git a/examples/semantic_indexing/evaluate.py b/examples/semantic_indexing/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..7107c08961aa61f529f78044d5337924f0a83735 --- /dev/null +++ b/examples/semantic_indexing/evaluate.py @@ -0,0 +1,77 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import numpy as np + +parser = argparse.ArgumentParser() +parser.add_argument("--similar_text_pair", type=str, default="", help="The full path of similat pair file") +parser.add_argument("--recall_result_file", type=str, default="", help="The full path of recall result file") +parser.add_argument( + "--recall_num", type=int, default=10, help="Most similair number of doc recalled from corpus per query" +) +args = parser.parse_args() + + +def recall(rs, N=10): + """ + Ratio of recalled Ground Truth at topN Recalled Docs + >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]] + >>> recall(rs, N=1) + 0.333333 + >>> recall(rs, N=2) + >>> 0.6666667 + >>> recall(rs, N=3) + >>> 1.0 + Args: + rs: Iterator of recalled flag() + Returns: + Recall@N + """ + + recall_flags = [np.sum(r[0:N]) for r in rs] + return np.mean(recall_flags) + + +if __name__ == "__main__": + text2similar = {} + with open(args.similar_text_pair, "r", encoding="utf-8") as f: + for line in f: + text, similar_text = line.rstrip().split("\t") + text2similar[text] = similar_text + + rs = [] + + with open(args.recall_result_file, "r", encoding="utf-8") as f: + relevance_labels = [] + for index, line in enumerate(f): + + if index % args.recall_num == 0 and index != 0: + rs.append(relevance_labels) + relevance_labels = [] + + text, recalled_text, cosine_sim = line.rstrip().split("\t") + if text == recalled_text: + continue + if text2similar[text] == recalled_text: + relevance_labels.append(1) + else: + relevance_labels.append(0) + + recall_N = [] + for topN in (10, 50): + R = round(100 * recall(rs, N=topN), 3) + recall_N.append(str(R)) + print("\t".join(recall_N)) diff --git a/examples/semantic_indexing/faiss_indexer.py b/examples/semantic_indexing/faiss_indexer.py new file mode 100644 index 0000000000000000000000000000000000000000..a0a2eb9aa7dbaad477e2469dff8bb887ffe03734 --- /dev/null +++ b/examples/semantic_indexing/faiss_indexer.py @@ -0,0 +1,216 @@ +#!/usr/bin/env python3 + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Copyright (c) Facebook, Inc. and its affiliates. +# All rights reserved. +# +# This source code is licensed under the license found in the +# LICENSE file in the root directory of this source tree. +""" + FAISS-based index components for dense retriver +""" + +import os +import time +import logging +import pickle +from typing import List, Tuple, Iterator + +import faiss +import numpy as np + +logger = logging.getLogger() + + +class DenseIndexer(object): + """ + Class for building, saving, and finding indexes + """ + + def __init__(self, buffer_size: int = 50000): + self.buffer_size = buffer_size + self.index_id_to_db_id = [] + self.index = None + + def index_data(self, vector_files: List[str]): + start_time = time.time() + buffer = [] + for i, item in enumerate(iterate_encoded_files(vector_files)): + db_id, doc_vector = item + buffer.append((db_id, doc_vector)) + if 0 < self.buffer_size == len(buffer): + # indexing in batches is beneficial for many faiss index types + self._index_batch(buffer) + logger.info( + "data indexed %d, used_time: %f sec.", len(self.index_id_to_db_id), time.time() - start_time + ) + buffer = [] + self._index_batch(buffer) + + indexed_cnt = len(self.index_id_to_db_id) + logger.info("Total data indexed %d", indexed_cnt) + logger.info("Data indexing completed.") + + def _index_batch(self, data: List[Tuple[object, np.array]]): + raise NotImplementedError + + def search_knn(self, query_vectors: np.array, top_docs: int) -> List[Tuple[List[object], List[float]]]: + raise NotImplementedError + + def serialize(self, file: str): + logger.info("Serializing index to %s", file) + + if os.path.isdir(file): + index_file = os.path.join(file, "index.dpr") + meta_file = os.path.join(file, "index_meta.dpr") + else: + index_file = file + ".index.dpr" + meta_file = file + ".index_meta.dpr" + + faiss.write_index(self.index, index_file) + with open(meta_file, mode="wb") as f: + pickle.dump(self.index_id_to_db_id, f) + + def deserialize_from(self, file: str): + logger.info("Loading index from %s", file) + + if os.path.isdir(file): + index_file = os.path.join(file, "index.dpr") + meta_file = os.path.join(file, "index_meta.dpr") + else: + index_file = file + ".index.dpr" + meta_file = file + ".index_meta.dpr" + + self.index = faiss.read_index(index_file) + logger.info("Loaded index of type %s and size %d", type(self.index), self.index.ntotal) + + with open(meta_file, "rb") as reader: + self.index_id_to_db_id = pickle.load(reader) + assert ( + len(self.index_id_to_db_id) == self.index.ntotal + ), "Deserialized index_id_to_db_id should match faiss index size" + + def _update_id_mapping(self, db_ids: List): + self.index_id_to_db_id.extend(db_ids) + + +class DenseFlatIndexer(DenseIndexer): + def __init__(self, vector_sz: int, buffer_size: int = 50000): + super(DenseFlatIndexer, self).__init__(buffer_size=buffer_size) + self.index = faiss.IndexFlatIP(vector_sz) + + def _index_batch(self, data: List[Tuple[object, np.array]]): + db_ids = [t[0] for t in data] + vectors = [np.reshape(t[1], (1, -1)) for t in data] + vectors = np.concatenate(vectors, axis=0) + self._update_id_mapping(db_ids) + self.index.add(vectors) + + def search_knn(self, query_vectors: np.array, top_docs: int) -> List[Tuple[List[object], List[float]]]: + scores, indexes = self.index.search(query_vectors, top_docs) + # convert to external ids + db_ids = [[self.index_id_to_db_id[i] for i in query_top_idxs] for query_top_idxs in indexes] + result = [(db_ids[i], scores[i]) for i in range(len(db_ids))] + return result + + +class DenseHNSWFlatIndexer(DenseIndexer): + """ + Efficient index for retrieval. Note: default settings are for hugh accuracy but also high RAM usage + """ + + def __init__( + self, + vector_sz: int, + buffer_size: int = 50000, + store_n: int = 512, + ef_search: int = 128, + ef_construction: int = 200, + ): + super(DenseHNSWFlatIndexer, self).__init__(buffer_size=buffer_size) + + # IndexHNSWFlat supports L2 similarity only + # so we have to apply DOT -> L2 similairy space conversion with the help of an extra dimension + index = faiss.IndexHNSWFlat(vector_sz + 1, store_n) + index.hnsw.efSearch = ef_search + index.hnsw.efConstruction = ef_construction + self.index = index + self.phi = None + + def index_data(self, vector_files: List[str]): + self._set_phi(vector_files) + + super(DenseHNSWFlatIndexer, self).index_data(vector_files) + + def _set_phi(self, vector_files: List[str]): + """ + Calculates the max norm from the whole data and assign it to self.phi: necessary to transform IP -> L2 space + :param vector_files: file names to get passages vectors from + :return: + """ + phi = 0 + for i, item in enumerate(iterate_encoded_files(vector_files)): + id, doc_vector = item + norms = (doc_vector**2).sum() + phi = max(phi, norms) + logger.info("HNSWF DotProduct -> L2 space phi={}".format(phi)) + self.phi = phi + + def _index_batch(self, data: List[Tuple[object, np.array]]): + # max norm is required before putting all vectors in the index to convert inner product similarity to L2 + if self.phi is None: + raise RuntimeError( + "Max norm needs to be calculated from all data at once," + "results will be unpredictable otherwise." + "Run `_set_phi()` before calling this method." + ) + + db_ids = [t[0] for t in data] + vectors = [np.reshape(t[1], (1, -1)) for t in data] + + norms = [(doc_vector**2).sum() for doc_vector in vectors] + aux_dims = [np.sqrt(self.phi - norm) for norm in norms] + hnsw_vectors = [np.hstack((doc_vector, aux_dims[i].reshape(-1, 1))) for i, doc_vector in enumerate(vectors)] + hnsw_vectors = np.concatenate(hnsw_vectors, axis=0) + + self._update_id_mapping(db_ids) + self.index.add(hnsw_vectors) + + def search_knn(self, query_vectors: np.array, top_docs: int) -> List[Tuple[List[object], List[float]]]: + + aux_dim = np.zeros(len(query_vectors), dtype="float32") + query_nhsw_vectors = np.hstack((query_vectors, aux_dim.reshape(-1, 1))) + logger.info("query_hnsw_vectors %s", query_nhsw_vectors.shape) + scores, indexes = self.index.search(query_nhsw_vectors, top_docs) + # convert to external ids + db_ids = [[self.index_id_to_db_id[i] for i in query_top_idxs] for query_top_idxs in indexes] + result = [(db_ids[i], scores[i]) for i in range(len(db_ids))] + return result + + def deserialize_from(self, file: str): + super(DenseHNSWFlatIndexer, self).deserialize_from(file) + # to trigger warning on subsequent indexing + self.phi = None + + +def iterate_encoded_files(vector_files: list) -> Iterator[Tuple[object, np.array]]: + for i, file in enumerate(vector_files): + logger.info("Reading file %s", file) + with open(file, "rb") as reader: + doc_vectors = pickle.load(reader) + for doc in doc_vectors: + db_id, doc_vector = doc + yield db_id, doc_vector diff --git a/examples/semantic_indexing/fast_predict.py b/examples/semantic_indexing/fast_predict.py new file mode 100644 index 0000000000000000000000000000000000000000..fb63539b208406390ed96a87daf38d6d11a92e6b --- /dev/null +++ b/examples/semantic_indexing/fast_predict.py @@ -0,0 +1,184 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial +from pprint import pprint + +import numpy as np +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from data import convert_example, create_dataloader, read_text_pair + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.ops import disable_fast_encoder, enable_fast_encoder +from paddlenlp.transformers import ErnieModel, ErnieTokenizer + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file") + parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") + parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") + parser.add_argument( + "--max_seq_length", + default=64, + type=int, + help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--dropout", default=0.0, type=float, help="Dropout probability.") + parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") + parser.add_argument("--seed", default=42, type=int, help="Random seed.") + parser.add_argument("--pad_to_max_seq_len", action="store_true", help="Whether to pad to max_seq_len.") + parser.add_argument("--use_fp16", action="store_true", help="Whether to use fp16.") + + args = parser.parse_args() + return args + + +class SemanticIndexingPredictor(nn.Layer): + def __init__(self, pretrained_model, output_emb_size, bos_id=0, dropout=0, use_fp16=False): + super(SemanticIndexingPredictor, self).__init__() + self.bos_id = bos_id + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.0) + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr) + + self.use_fp16 = use_fp16 + + def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None): + src_mask = input_ids == self.bos_id + src_mask = paddle.cast(src_mask, "float32") + # [bs, 1, 1, max_len] + src_mask = paddle.unsqueeze(src_mask, axis=[1, 2]) + src_mask.stop_gradient = True + + ones = paddle.ones_like(input_ids, dtype="int64") + seq_length = paddle.cumsum(ones, axis=1) + position_ids = seq_length - ones + position_ids.stop_gradient = True + + embedding_output = self.ptm.embeddings( + input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids + ) + + if self.use_fp16: + embedding_output = paddle.cast(embedding_output, "float16") + + sequence_output = self.ptm.encoder(embedding_output, src_mask) + + if self.use_fp16: + sequence_output = paddle.cast(sequence_output, "float32") + + cls_embedding = self.ptm.pooler(sequence_output) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + title_token_type_ids=None, + title_position_ids=None, + ): + query_cls_embedding = self.get_pooled_embedding(query_input_ids, query_token_type_ids, query_position_ids) + title_cls_embedding = self.get_pooled_embedding(title_input_ids, title_token_type_ids, title_position_ids) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + def load(self, init_from_params): + if init_from_params and os.path.isfile(init_from_params): + state_dict = paddle.load(init_from_params) + self.set_state_dict(state_dict) + print("Loaded parameters from %s" % init_from_params) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + +def do_predict(args): + paddle.set_device("gpu") + paddle.seed(args.seed) + tokenizer = ErnieTokenizer.from_pretrained("ernie-1.0") + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + pad_to_max_seq_len=args.pad_to_max_seq_len, + ) + + def batchify_fn(samples): + fn = Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + ) + return [data for data in fn(samples)] + + valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False) + + valid_data_loader = create_dataloader( + valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + pretrained_model = ErnieModel.from_pretrained("ernie-1.0") + + model = SemanticIndexingPredictor( + pretrained_model, args.output_emb_size, dropout=args.dropout, use_fp16=args.use_fp16 + ) + model.eval() + model.load(args.params_path) + model = enable_fast_encoder(model, use_fp16=args.use_fp16) + + cosine_sims = [] + for batch_data in valid_data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data + query_input_ids = paddle.to_tensor(query_input_ids) + query_token_type_ids = paddle.to_tensor(query_token_type_ids) + title_input_ids = paddle.to_tensor(title_input_ids) + title_token_type_ids = paddle.to_tensor(title_token_type_ids) + batch_cosine_sim = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ).numpy() + cosine_sims.append(batch_cosine_sim) + + cosine_sims = np.concatenate(cosine_sims, axis=0) + for cosine in cosine_sims: + print("{}".format(cosine)) + model = disable_fast_encoder(model) + + +if __name__ == "__main__": + args = parse_args() + pprint(args) + do_predict(args) diff --git a/examples/semantic_indexing/generate_dense_embeddings.py b/examples/semantic_indexing/generate_dense_embeddings.py new file mode 100644 index 0000000000000000000000000000000000000000..b28d5a4e24e5546add92c1a70eede3abddae04a4 --- /dev/null +++ b/examples/semantic_indexing/generate_dense_embeddings.py @@ -0,0 +1,143 @@ +#!/usr/bin/env python3 + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Copyright GC-DPR authors. +# Copyright (c) Facebook, Inc. and its affiliates. +# All rights reserved. +# +# This source code is licensed under the license found in the +# LICENSE file in the root directory of this source tree. +""" + Command line tool that produces embeddings for a large documents base based on the pretrained ctx & question encoders + Supposed to be used in a 'sharded' way to speed up the process. +""" +import argparse +import csv +import logging +import os +import pathlib +import pickle +from typing import List, Tuple + +import numpy as np +import paddle +from biencoder_base_model import BiEncoder +from NQdataset import BertTensorizer +from paddle import nn +from paddle.io import DataLoader, Dataset +from tqdm import tqdm + +from paddlenlp.transformers.bert.modeling import BertModel + +logger = logging.getLogger() +logger.setLevel(logging.INFO) +if logger.hasHandlers(): + logger.handlers.clear() +console = logging.StreamHandler() +logger.addHandler(console) + + +class CtxDataset(Dataset): + def __init__(self, ctx_rows: List[Tuple[object, str, str]], tensorizer: BertTensorizer, insert_title: bool = True): + self.rows = ctx_rows + self.tensorizer = tensorizer + self.insert_title = insert_title + + def __len__(self): + return len(self.rows) + + def __getitem__(self, item): + ctx = self.rows[item] + + return self.tensorizer.text_to_tensor(ctx[1], title=ctx[2] if self.insert_title else None) + + +def no_op_collate(xx: List[object]): + return xx + + +def gen_ctx_vectors( + ctx_rows: List[Tuple[object, str, str]], model: nn.Layer, tensorizer: BertTensorizer, insert_title: bool = True +) -> List[Tuple[object, np.array]]: + bsz = args.batch_size + total = 0 + results = [] + + dataset = CtxDataset(ctx_rows, tensorizer, insert_title) + loader = DataLoader( + dataset, shuffle=False, num_workers=2, collate_fn=no_op_collate, drop_last=False, batch_size=bsz + ) + + for batch_id, batch_token_tensors in enumerate(tqdm(loader)): + ctx_ids_batch = paddle.stack(batch_token_tensors, axis=0) + ctx_seg_batch = paddle.zeros_like(ctx_ids_batch) + with paddle.no_grad(): + out = model.get_context_pooled_embedding(ctx_ids_batch, ctx_seg_batch) + + out = out.astype("float32").cpu() + batch_start = batch_id * bsz + ctx_ids = [r[0] for r in ctx_rows[batch_start : batch_start + bsz]] + assert len(ctx_ids) == out.shape[0] + total += len(ctx_ids) + results.extend([(ctx_ids[i], out[i].reshape([-1]).numpy()) for i in range(out.shape[0])]) + + return results + + +def main(args): + + tensorizer = BertTensorizer() + question_model = BertModel.from_pretrained(args.que_model_path) + context_model = BertModel.from_pretrained(args.con_model_path) + model = BiEncoder(question_encoder=question_model, context_encoder=context_model) + + rows = [] + with open(args.ctx_file) as tsvfile: + reader = csv.reader(tsvfile, delimiter="\t") + # file format: doc_id, doc_text, title + rows.extend([(row[0], row[1], row[2]) for row in reader if row[0] != "id"]) + + shard_size = int(len(rows) / args.num_shards) + start_idx = args.shard_id * shard_size + end_idx = start_idx + shard_size + + logger.info("Producing encodings for passages range: %d to %d (out of total %d)", start_idx, end_idx, len(rows)) + rows = rows[start_idx:end_idx] + data = gen_ctx_vectors(rows, model, tensorizer, True) + file = args.out_file + "_" + str(args.shard_id) + ".pkl" + pathlib.Path(os.path.dirname(file)).mkdir(parents=True, exist_ok=True) + logger.info("Writing results to %s" % file) + with open(file, mode="wb") as f: + pickle.dump(data, f) + + logger.info("Total passages processed %d. Written to %s", len(data), file) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + + parser.add_argument("--ctx_file", type=str, default=None, help="Path to passages set .tsv file") + parser.add_argument( + "--out_file", required=True, type=str, default=None, help="output file path to write results to" + ) + parser.add_argument("--shard_id", type=int, default=0, help="Number(0-based) of data shard to process") + parser.add_argument("--num_shards", type=int, default=1, help="Total amount of data shards") + parser.add_argument("--batch_size", type=int, default=32, help="Batch size for the passage encoder forward pass") + parser.add_argument("--que_model_path", type=str) + parser.add_argument("--con_model_path", type=str) + args = parser.parse_args() + + main(args) diff --git a/examples/semantic_indexing/gradient_cache/model.py b/examples/semantic_indexing/gradient_cache/model.py new file mode 100644 index 0000000000000000000000000000000000000000..04745d09788948faa9fbaf69b3d9dc17b752f0e2 --- /dev/null +++ b/examples/semantic_indexing/gradient_cache/model.py @@ -0,0 +1,93 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn.functional as F +from base_model import SemanticIndexBase + + +class SemanticIndexCacheNeg(SemanticIndexBase): + def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None): + super().__init__(pretrained_model, dropout, output_emb_size) + self.margin = margin + # Used scaling cosine similarity to ease converge + self.sacle = scale + + def get_pooled_embedding_with_no_grad( + self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None + ): + if self.use_fp16: + if attention_mask is None: + attention_mask = paddle.unsqueeze( + (input_ids == self.ptm.pad_token_id).astype(self.ptm.pooler.dense.weight.dtype) * -1e4, axis=[1, 2] + ) + + with paddle.no_grad(): + embedding_output = self.ptm.embeddings( + input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids + ) + + embedding_output = paddle.cast(embedding_output, "float16") + attention_mask = paddle.cast(attention_mask, "float16") + + with paddle.no_grad(): + encoder_outputs = self.ptm.encoder(embedding_output, attention_mask) + if self.use_fp16: + encoder_outputs = paddle.cast(encoder_outputs, "float32") + cls_embedding = self.ptm.pooler(encoder_outputs) + else: + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + return cls_embedding + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) + + # substract margin from all positive samples cosine_sim() + margin_diag = paddle.full( + shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype() + ) + + cosine_sim = cosine_sim - paddle.diag(margin_diag) + + # scale cosine to ease training converge + cosine_sim *= self.sacle + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") + labels = paddle.reshape(labels, shape=[-1, 1]) + + return cosine_sim, labels, query_cls_embedding, title_cls_embedding diff --git a/examples/semantic_indexing/hardest_negative/model.py b/examples/semantic_indexing/hardest_negative/model.py new file mode 100644 index 0000000000000000000000000000000000000000..ce4db41341c2df248be7e9f52fd4dc4733a70ab2 --- /dev/null +++ b/examples/semantic_indexing/hardest_negative/model.py @@ -0,0 +1,59 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn.functional as F +from base_model import SemanticIndexBase + + +class SemanticIndexHardestNeg(SemanticIndexBase): + def __init__(self, pretrained_model, dropout=None, margin=0.3, output_emb_size=None): + super().__init__(pretrained_model, dropout, output_emb_size) + self.margin = margin + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) + + pos_sim = paddle.max(cosine_sim, axis=-1) + + # subtract 10000 from all diagnal elements of cosine_sim + mask_socre = paddle.full( + shape=[query_cls_embedding.shape[0]], fill_value=10000, dtype=paddle.get_default_dtype() + ) + tmp_cosin_sim = cosine_sim - paddle.diag(mask_socre) + hardest_negative_sim = paddle.max(tmp_cosin_sim, axis=-1) + + labels = paddle.full(shape=[query_cls_embedding.shape[0]], fill_value=1.0, dtype="float32") + + loss = F.margin_ranking_loss(pos_sim, hardest_negative_sim, labels, margin=self.margin) + return loss diff --git a/examples/semantic_indexing/predict.py b/examples/semantic_indexing/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..741bb4ffdf45e9a043ab19be6b5e08bed94b00cd --- /dev/null +++ b/examples/semantic_indexing/predict.py @@ -0,0 +1,121 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +from base_model import SemanticIndexBase +from data import convert_example, create_dataloader, read_text_pair + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.ops import convert_to_fp16 +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--pad_to_max_seq_len", action="store_true", help="Whether to pad to max seq length.") +parser.add_argument("--use_fp16", action="store_true", help="Whether to use_fp16") +args = parser.parse_args() +# fmt: on + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair. + data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + cosine_sims = [] + + model.eval() + + with paddle.no_grad(): + for batch_data in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data + + query_input_ids = paddle.to_tensor(query_input_ids) + query_token_type_ids = paddle.to_tensor(query_token_type_ids) + title_input_ids = paddle.to_tensor(title_input_ids) + title_token_type_ids = paddle.to_tensor(title_token_type_ids) + + batch_cosine_sim = model.cosine_sim( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ).numpy() + + cosine_sims.append(batch_cosine_sim) + + cosine_sims = np.concatenate(cosine_sims, axis=0) + + return cosine_sims + + +if __name__ == "__main__": + paddle.set_device(args.device) + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + pad_to_max_seq_len=args.pad_to_max_seq_len, + ) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + ): [data for data in fn(samples)] + + valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False) + + valid_data_loader = create_dataloader( + valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + + model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size, use_fp16=args.use_fp16) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + if args.use_fp16: + convert_to_fp16(model.ptm.encoder) + + cosin_sim = predict(model, valid_data_loader) + for idx, cosine in enumerate(cosin_sim): + print("{}".format(cosine)) diff --git a/examples/semantic_indexing/qa_validation.py b/examples/semantic_indexing/qa_validation.py new file mode 100644 index 0000000000000000000000000000000000000000..e4be203ec57a8acfd9391e9bbdf3f50107ad67ad --- /dev/null +++ b/examples/semantic_indexing/qa_validation.py @@ -0,0 +1,158 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + Set of utilities for Q&A results validation tasks - Retriver passage validation and Reader predicted answer validation +""" + +import collections +import logging +import string +import unicodedata +from functools import partial +from multiprocessing import Pool as ProcessPool +from typing import Tuple, List, Dict +import regex as re +from tokenizers import SimpleTokenizer + +logger = logging.getLogger(__name__) +QAMatchStats = collections.namedtuple("QAMatchStats", ["top_k_hits", "questions_doc_hits"]) + + +def calculate_matches( + all_docs: Dict[object, Tuple[str, str]], + answers: List[List[str]], + closest_docs: List[Tuple[List[object], List[float]]], + workers_num: int, + match_type: str, +) -> QAMatchStats: + """ + Evaluates answers presence in the set of documents. This function is supposed to be used with a large collection of + documents and results. It internally forks multiple sub-processes for evaluation and then merges results + :param all_docs: dictionary of the entire documents database. doc_id -> (doc_text, title) + :param answers: list of answers's list. One list per question + :param closest_docs: document ids of the top results along with their scores + :param workers_num: amount of parallel threads to process data + :param match_type: type of answer matching. Refer to has_answer code for available options + :return: matching information tuple. + top_k_hits - a list where the index is the amount of top documents retrieved and the value is the total amount of + valid matches across an entire dataset. + questions_doc_hits - more detailed info with answer matches for every question and every retrieved document + """ + global dpr_all_documents + dpr_all_documents = all_docs + tok_opts = {} + tokenizer = SimpleTokenizer(**tok_opts) + processes = ProcessPool( + processes=workers_num, + ) + logger.info("Matching answers in top docs...") + get_score_partial = partial(check_answer, match_type=match_type, tokenizer=tokenizer) + + questions_answers_docs = zip(answers, closest_docs) + scores = processes.map(get_score_partial, questions_answers_docs) + logger.info("Per question validation results len=%d", len(scores)) + n_docs = len(closest_docs[0][0]) + top_k_hits = [0] * n_docs + for question_hits in scores: + best_hit = next((i for i, x in enumerate(question_hits) if x), None) + if best_hit is not None: + top_k_hits[best_hit:] = [v + 1 for v in top_k_hits[best_hit:]] + + return QAMatchStats(top_k_hits, scores) + + +def check_answer(questions_answers_docs, tokenizer, match_type) -> List[bool]: + """Search through all the top docs to see if they have any of the answers.""" + answers, (doc_ids, doc_scores) = questions_answers_docs + global dpr_all_documents + hits = [] + for i, doc_id in enumerate(doc_ids): + doc = dpr_all_documents[doc_id] + text = doc[0] + + answer_found = False + if text is None: # cannot find the document for some reason + logger.warning("no doc in db") + hits.append(False) + continue + if has_answer(answers, text, tokenizer, match_type): + answer_found = True + hits.append(answer_found) + return hits + + +def has_answer(answers, text, tokenizer, match_type) -> bool: + """Check if a document contains an answer string. + If `match_type` is string, token matching is done between the text and answer. + If `match_type` is regex, we search the whole text with the regex. + """ + text = _normalize(text) + if match_type == "string": + # Answer is a list of possible strings + text = tokenizer.tokenize(text).words(uncased=True) + + for single_answer in answers: + single_answer = _normalize(single_answer) + single_answer = tokenizer.tokenize(single_answer) + single_answer = single_answer.words(uncased=True) + + for i in range(0, len(text) - len(single_answer) + 1): + if single_answer == text[i : i + len(single_answer)]: + return True + + elif match_type == "regex": + # Answer is a regex + for single_answer in answers: + single_answer = _normalize(single_answer) + if regex_match(text, single_answer): + return True + return False + + +def regex_match(text, pattern): + """Test if a regex pattern is contained within a text.""" + try: + pattern = re.compile( + pattern, + flags=re.IGNORECASE + re.UNICODE + re.MULTILINE, + ) + except BaseException: + return False + return pattern.search(text) is not None + + +# function for the reader model answer validation +def exact_match_score(prediction, ground_truth): + return _normalize_answer(prediction) == _normalize_answer(ground_truth) + + +def _normalize_answer(s): + def remove_articles(text): + return re.sub(r"\b(a|an|the)\b", " ", text) + + def white_space_fix(text): + return " ".join(text.split()) + + def remove_punc(text): + exclude = set(string.punctuation) + return "".join(ch for ch in text if ch not in exclude) + + def lower(text): + return text.lower() + + return white_space_fix(remove_articles(remove_punc(lower(s)))) + + +def _normalize(text): + return unicodedata.normalize("NFD", text) diff --git a/examples/semantic_indexing/recall.py b/examples/semantic_indexing/recall.py new file mode 100644 index 0000000000000000000000000000000000000000..79a3d8b8a0ee9399f876fb48e6f7fef7a67aec2a --- /dev/null +++ b/examples/semantic_indexing/recall.py @@ -0,0 +1,121 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# coding=UTF-8 + +import argparse +import os +from functools import partial + +import paddle +from ann_util import build_index +from base_model import SemanticIndexBase +from data import convert_example, create_dataloader, gen_id2corpus, gen_text_file + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset +from paddlenlp.transformers import AutoModel, AutoTokenizer +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.") + +parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") + +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# fmt: on + +if __name__ == "__main__": + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_segment + ): [data for data in fn(samples)] + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + + model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size) + model = paddle.DataParallel(model) + + # Load pretrained semantic model + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + logger.info("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + id2corpus = gen_id2corpus(args.corpus_file) + + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + # Need better way to get inner model of DataParallel + inner_model = model._layers + + final_index = build_index(args, corpus_data_loader, inner_model) + + text_list, text2similar_text = gen_text_file(args.similar_text_pair_file) + + query_ds = MapDataset(text_list) + + query_data_loader = create_dataloader( + query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + + recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file) + with open(recall_result_file, "w", encoding="utf-8") as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num) + + batch_size = len(cosine_sims) + + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write( + "{}\t{}\t{}\n".format( + text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx] + ) + ) diff --git a/examples/semantic_indexing/requirements.txt b/examples/semantic_indexing/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..4c314ba430a859acd10457d796ff21ee97602eb8 --- /dev/null +++ b/examples/semantic_indexing/requirements.txt @@ -0,0 +1,9 @@ +faiss==1.5.3 +hnswlib==0.6.2 +numpy==1.22.4 +paddle==1.0.2 +paddlenlp==2.3.4 +paddlepaddle==2.3.1 +regex==2022.7.25 +spacy==3.4.1 +tqdm==4.64.0 diff --git a/examples/semantic_indexing/run_ann_data_gen.py b/examples/semantic_indexing/run_ann_data_gen.py new file mode 100644 index 0000000000000000000000000000000000000000..503a49be1eb2ebd5f5bd9b1bf6e0a60314c9010c --- /dev/null +++ b/examples/semantic_indexing/run_ann_data_gen.py @@ -0,0 +1,191 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time +from functools import partial + +import paddle +from ance.model import SemanticIndexANCE +from ann_util import build_index +from data import ( + convert_example, + create_dataloader, + gen_id2corpus, + gen_text_file, + get_latest_ann_data, + get_latest_checkpoint, +) + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import MapDataset +from paddlenlp.transformers import AutoModel, AutoTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() + +# Required parameters +parser.add_argument("--similar_text_pair_file", default=None, type=str, required=True, help="The train_set tsv file that each line is simialr text pair") +parser.add_argument("--corpus_file", default=None, type=str, required=True, help="The corpus file that each line is a text for buinding indexing") +parser.add_argument("--save_dir", default=None, type=str, required=True, help="Saved model dir, will look for latest checkpoint dir in here") +parser.add_argument("--ann_data_dir", default=None, type=str, required=True, help="The output directory where the training data will be written") + +parser.add_argument("--init_from_ckpt", default=None, type=str, help="Initial model dir, will use this if no checkpoint is found in model_dir") +parser.add_argument("--end_ann_step", default=1000000, type=int, help="Stop after this number of data versions has been generated, default run forever") +parser.add_argument("--batch_size", default=128, type=int, help="Batch size for predicting embedding of texts") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") + +parser.add_argument("--max_seq_length", default=128, type=int, help="Batch size for predicting embedding of texts") +parser.add_argument("--topk_training", default=500, type=int, help="top k from which negative samples are collected") +parser.add_argument("--num_negative_sample", default=5, type=int, help="at each resample, how many negative samples per query do I use") + +# hnsw argument +parser.add_argument("--hnsw_m", default=10, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=10, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") + +args = parser.parse_args() +# yapf: enable + + +def generate_new_ann(args, data_loader_dict, checkpoint_path, latest_step_num): + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + + model = SemanticIndexANCE(pretrained_model, output_emb_size=args.output_emb_size) + + logger.info("checkpoint_path:{}".format(checkpoint_path)) + state_dict = paddle.load(checkpoint_path) + + model.set_dict(state_dict) + logger.info("load params from:{}".format(checkpoint_path)) + + logger.info("***** inference of corpus *****") + final_index = build_index(args, data_loader_dict["corpus_data_loader"], model) + + logger.info("***** inference of query *****") + query_embedding = model.get_semantic_embedding(data_loader_dict["text_data_loader"]) + + text_list = data_loader_dict["text_list"] + id2corpus = data_loader_dict["id2corpus"] + text2similar_text = data_loader_dict["text2similar_text"] + + new_ann_data_path = os.path.join(args.ann_data_dir, str(latest_step_num)) + if not os.path.exists(new_ann_data_path): + os.mkdir(new_ann_data_path) + + with open(os.path.join(new_ann_data_path, "new_ann_data"), "w") as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding, args.topk_training) + + batch_size = len(cosine_sims) + + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + + hard_neg_samples = recalled_idx[row_index][-1 * args.num_negative_sample :] + + for idx, hard_neg_doc_idx in enumerate(hard_neg_samples): + text = text_list[text_index]["text"] + similar_text = text2similar_text[text] + hard_neg_sample = id2corpus[hard_neg_doc_idx] + f.write("{}\t{}\t{}\n".format(text, similar_text, hard_neg_sample)) + + succeed_flag_file = os.path.join(new_ann_data_path, "succeed_flag_file") + open(succeed_flag_file, "a").close() + logger.info("finish generate ann data step:{}".format(latest_step_num)) + + +def build_data_loader(args, tokenizer): + """build corpus_data_loader and text_data_loader""" + + id2corpus = gen_id2corpus(args.corpus_file) + + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_segment + ): [data for data in fn(samples)] + + corpus_data_loader = create_dataloader( + corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + # build text data_loader + text_list, text2similar_text = gen_text_file(args.similar_text_pair_file) + + text_ds = MapDataset(text_list) + + text_data_loader = create_dataloader( + text_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + d = { + "text_data_loader": text_data_loader, + "corpus_data_loader": corpus_data_loader, + "id2corpus": id2corpus, + "text2similar_text": text2similar_text, + "text_list": text_list, + } + + return d + + +def ann_data_gen(args): + # use init_from_ckpt as last_checkpoint + last_checkpoint = args.init_from_ckpt + + # get latest_ann_data_step to decide when stop gen_ann_data + _, latest_ann_data_step = get_latest_ann_data(args.ann_data_dir) + + rank = paddle.distributed.get_rank() + if rank == 0: + if not os.path.exists(args.ann_data_dir): + os.makedirs(args.ann_data_dir) + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + data_load_dict = build_data_loader(args, tokenizer) + + while latest_ann_data_step <= args.end_ann_step: + next_checkpoint, latest_step_num = get_latest_checkpoint(args) + logger.info("next_checkpoint:{}".format(next_checkpoint)) + + if next_checkpoint == last_checkpoint: + logger.info("next_checkpoint == lase_checkpoint:{}".format(next_checkpoint)) + logger.info("sleep 10s") + time.sleep(10) + else: + logger.info("start generate ann data using checkpoint:{}".format(next_checkpoint)) + + generate_new_ann(args, data_load_dict, next_checkpoint, latest_step_num) + + logger.info("finished generating ann data step {}".format(latest_step_num)) + + last_checkpoint = next_checkpoint + + +def main(): + ann_data_gen(args) + + +if __name__ == "__main__": + main() diff --git a/examples/semantic_indexing/tokenizers.py b/examples/semantic_indexing/tokenizers.py new file mode 100644 index 0000000000000000000000000000000000000000..c7bd48a09bd79bfba995a1798147cd335e4bee3a --- /dev/null +++ b/examples/semantic_indexing/tokenizers.py @@ -0,0 +1,252 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Most of the tokenizers code here is copied from DrQA codebase to avoid adding extra dependency +""" + +import copy +import logging + +import regex +import spacy + +logger = logging.getLogger(__name__) + + +class Tokens(object): + """A class to represent a list of tokenized text.""" + + TEXT = 0 + TEXT_WS = 1 + SPAN = 2 + POS = 3 + LEMMA = 4 + NER = 5 + + def __init__(self, data, annotators, opts=None): + self.data = data + self.annotators = annotators + self.opts = opts or {} + + def __len__(self): + """The number of tokens.""" + return len(self.data) + + def slice(self, i=None, j=None): + """Return a view of the list of tokens from [i, j).""" + new_tokens = copy.copy(self) + new_tokens.data = self.data[i:j] + return new_tokens + + def untokenize(self): + """Returns the original text (with whitespace reinserted).""" + return "".join([t[self.TEXT_WS] for t in self.data]).strip() + + def words(self, uncased=False): + """Returns a list of the text of each token + + Args: + uncased: lower cases text + """ + if uncased: + return [t[self.TEXT].lower() for t in self.data] + else: + return [t[self.TEXT] for t in self.data] + + def offsets(self): + """Returns a list of [start, end) character offsets of each token.""" + return [t[self.SPAN] for t in self.data] + + def pos(self): + """Returns a list of part-of-speech tags of each token. + Returns None if this annotation was not included. + """ + if "pos" not in self.annotators: + return None + return [t[self.POS] for t in self.data] + + def lemmas(self): + """Returns a list of the lemmatized text of each token. + Returns None if this annotation was not included. + """ + if "lemma" not in self.annotators: + return None + return [t[self.LEMMA] for t in self.data] + + def entities(self): + """Returns a list of named-entity-recognition tags of each token. + Returns None if this annotation was not included. + """ + if "ner" not in self.annotators: + return None + return [t[self.NER] for t in self.data] + + def ngrams(self, n=1, uncased=False, filter_fn=None, as_strings=True): + """Returns a list of all ngrams from length 1 to n. + + Args: + n: upper limit of ngram length + uncased: lower cases text + filter_fn: user function that takes in an ngram list and returns + True or False to keep or not keep the ngram + as_string: return the ngram as a string vs list + """ + + def _skip(gram): + if not filter_fn: + return False + return filter_fn(gram) + + words = self.words(uncased) + ngrams = [ + (s, e + 1) + for s in range(len(words)) + for e in range(s, min(s + n, len(words))) + if not _skip(words[s : e + 1]) + ] + + # Concatenate into strings + if as_strings: + ngrams = ["{}".format(" ".join(words[s:e])) for (s, e) in ngrams] + + return ngrams + + def entity_groups(self): + """Group consecutive entity tokens with the same NER tag.""" + entities = self.entities() + if not entities: + return None + non_ent = self.opts.get("non_ent", "O") + groups = [] + idx = 0 + while idx < len(entities): + ner_tag = entities[idx] + # Check for entity tag + if ner_tag != non_ent: + # Chomp the sequence + start = idx + while idx < len(entities) and entities[idx] == ner_tag: + idx += 1 + groups.append((self.slice(start, idx).untokenize(), ner_tag)) + else: + idx += 1 + return groups + + +class Tokenizer(object): + """Base tokenizer class. + Tokenizers implement tokenize, which should return a Tokens class. + """ + + def tokenize(self, text): + raise NotImplementedError + + def shutdown(self): + pass + + def __del__(self): + self.shutdown() + + +class SimpleTokenizer(Tokenizer): + ALPHA_NUM = r"[\p{L}\p{N}\p{M}]+" + NON_WS = r"[^\p{Z}\p{C}]" + + def __init__(self, **kwargs): + """ + Args: + annotators: None or empty set (only tokenizes). + """ + self._regexp = regex.compile( + "(%s)|(%s)" % (self.ALPHA_NUM, self.NON_WS), flags=regex.IGNORECASE + regex.UNICODE + regex.MULTILINE + ) + if len(kwargs.get("annotators", {})) > 0: + logger.warning( + "%s only tokenizes! Skipping annotators: %s" % (type(self).__name__, kwargs.get("annotators")) + ) + self.annotators = set() + + def tokenize(self, text): + data = [] + matches = [m for m in self._regexp.finditer(text)] + for i in range(len(matches)): + # Get text + token = matches[i].group() + + # Get whitespace + span = matches[i].span() + start_ws = span[0] + if i + 1 < len(matches): + end_ws = matches[i + 1].span()[0] + else: + end_ws = span[1] + + # Format data + data.append( + ( + token, + text[start_ws:end_ws], + span, + ) + ) + return Tokens(data, self.annotators) + + +class SpacyTokenizer(Tokenizer): + def __init__(self, **kwargs): + """ + Args: + annotators: set that can include pos, lemma, and ner. + model: spaCy model to use (either path, or keyword like 'en'). + """ + model = kwargs.get("model", "en") + self.annotators = copy.deepcopy(kwargs.get("annotators", set())) + nlp_kwargs = {"parser": False} + if not any([p in self.annotators for p in ["lemma", "pos", "ner"]]): + nlp_kwargs["tagger"] = False + if "ner" not in self.annotators: + nlp_kwargs["entity"] = False + self.nlp = spacy.load(model, **nlp_kwargs) + + def tokenize(self, text): + # We don't treat new lines as tokens. + clean_text = text.replace("\n", " ") + tokens = self.nlp.tokenizer(clean_text) + if any([p in self.annotators for p in ["lemma", "pos", "ner"]]): + self.nlp.tagger(tokens) + if "ner" in self.annotators: + self.nlp.entity(tokens) + + data = [] + for i in range(len(tokens)): + # Get whitespace + start_ws = tokens[i].idx + if i + 1 < len(tokens): + end_ws = tokens[i + 1].idx + else: + end_ws = tokens[i].idx + len(tokens[i].text) + + data.append( + ( + tokens[i].text, + text[start_ws:end_ws], + (tokens[i].idx, tokens[i].idx + len(tokens[i].text)), + tokens[i].tag_, + tokens[i].lemma_, + tokens[i].ent_type_, + ) + ) + + # Set special option for non-entity tag: '' vs 'O' in spaCy + return Tokens(data, self.annotators, opts={"non_ent": ""}) diff --git a/examples/semantic_indexing/train_ance.py b/examples/semantic_indexing/train_ance.py new file mode 100644 index 0000000000000000000000000000000000000000..a06b5847429f1b8fb73edc1721dfa32b5270134c --- /dev/null +++ b/examples/semantic_indexing/train_ance.py @@ -0,0 +1,184 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from ance.model import SemanticIndexANCE +from data import ( + convert_example, + create_dataloader, + get_latest_ann_data, + get_latest_checkpoint, + read_text_triplet, +) + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoints', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--ann_data_dir", default='./ann_data', type=str, help="The output directory where the ann generated training data will be saved.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--max_training_steps", default=1000000, type=int, help="The maximum total steps for training") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, help="Inteval steps to save checkpoint") +parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file") +parser.add_argument("--margin", default=0.3, type=float, help="Margin for pair-wise margin_rank_loss") +args = parser.parse_args() +# fmt: on + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + + latest_checkpoint, latest_global_step = get_latest_checkpoint(args) + logger.info("get latest_checkpoint:{}".format(latest_checkpoint)) + + model = SemanticIndexANCE(pretrained_model, margin=args.margin, output_emb_size=args.output_emb_size) + + if latest_checkpoint: + state_dict = paddle.load(latest_checkpoint) + model.set_dict(state_dict) + print("warmup from:{}".format(latest_checkpoint)) + + model = paddle.DataParallel(model) + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # pos_sample_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # pos_sample_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # neg_sample_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # neg_sample_segment + ): [data for data in fn(samples)] + + global_step = 0 + + while global_step < args.max_training_steps: + latest_ann_data, latest_ann_data_step = get_latest_ann_data(args.ann_data_dir) + + if latest_ann_data_step == -1: + # No ann_data generated yet + latest_ann_data = args.train_set_file + logger.info("No ann_data generated yet, Use training_set:{}".format(args.train_set_file)) + else: + # Using ann_data to training model + logger.info("Latest ann_data is ready for training: [{}]".format(latest_ann_data)) + + train_ds = load_dataset(read_text_triplet, data_path=latest_ann_data, lazy=False) + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0) + + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=clip, + ) + + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + ( + text_input_ids, + text_token_type_ids, + pos_sample_input_ids, + pos_sample_token_type_ids, + neg_sample_input_ids, + neg_sample_token_type_ids, + ) = batch + + loss = model( + text_input_ids=text_input_ids, + pos_sample_input_ids=pos_sample_input_ids, + neg_sample_input_ids=neg_sample_input_ids, + text_token_type_ids=text_token_type_ids, + pos_sample_token_type_ids=pos_sample_token_type_ids, + neg_sample_token_type_ids=neg_sample_token_type_ids, + ) + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s, trainning_file: %s" + % (global_step, epoch, step, loss, 10 / (time.time() - tic_train), latest_ann_data) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, str(global_step)) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + # Flag to indicate succeefully save model + succeed_flag_file = os.path.join(save_dir, "succeed_flag_file") + open(succeed_flag_file, "a").close() + + +if __name__ == "__main__": + do_train() diff --git a/examples/semantic_indexing/train_batch_neg.py b/examples/semantic_indexing/train_batch_neg.py new file mode 100644 index 0000000000000000000000000000000000000000..760e67101309280b18652fc103ab9fd89445a6e2 --- /dev/null +++ b/examples/semantic_indexing/train_batch_neg.py @@ -0,0 +1,158 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from batch_negative.model import SemanticIndexBatchNeg +from data import convert_example, create_dataloader, read_text_pair + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size.") +parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, help="Inteval steps to save checkpoint.") +parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file.") +parser.add_argument("--margin", default=0.3, type=float, help="Margin between pos_sample and neg_samples.") +parser.add_argument("--scale", default=30, type=int, help="Scale for pair-wise margin_rank_loss") +parser.add_argument("--use_amp", action="store_true", help="Whether to use AMP.") +parser.add_argument("--amp_loss_scale", default=32768, type=float, help="The value of scale_loss for fp16. This is only used for AMP training.") +args = parser.parse_args() +# fmt: on + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False) + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + model = SemanticIndexBatchNeg( + pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + print("warmup from:{}".format(args.init_from_ckpt)) + + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.amp_loss_scale) + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch + + with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]): + loss = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ) + + if args.use_amp: + scaled = scaler.scale(loss) + scaled.backward() + scaler.minimize(optimizer, scaled) + else: + loss.backward() + optimizer.step() + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/examples/semantic_indexing/train_gradient_cache.py b/examples/semantic_indexing/train_gradient_cache.py new file mode 100644 index 0000000000000000000000000000000000000000..838054f7c0188cb5fc0aa8dd71c5bd946dcb6686 --- /dev/null +++ b/examples/semantic_indexing/train_gradient_cache.py @@ -0,0 +1,239 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +from data import convert_example, create_dataloader, read_text_pair +from gradient_cache.model import SemanticIndexCacheNeg + +import paddlenlp as ppnlp +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import LinearDecayWithWarmup + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size.") +parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, help="Inteval steps to save checkpoint.") +parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file.") +parser.add_argument("--margin", default=0.3, type=float, help="Margin between pos_sample and neg_samples.") +parser.add_argument("--scale", default=30, type=int, help="Scale for pair-wise margin_rank_loss") +parser.add_argument("--use_amp", action="store_true", help="Whether to use AMP.") +parser.add_argument("--amp_loss_scale", default=32768, type=float, help="The value of scale_loss for fp16. This is only used for AMP training.") +parser.add_argument("--chunk_numbers", type=int, default=50, help="The number of the chunks for model") + +args = parser.parse_args() + + +# yapf: enable + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False) + + # If you wanna use bert/roberta pretrained model, + # pretrained_model = ppnlp.transformers.BertModel.from_pretrained('bert-base-chinese') + # pretrained_model = ppnlp.transformers.RobertaModel.from_pretrained('roberta-wwm-ext') + pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained("ernie-1.0") + + # If you wanna use bert/roberta pretrained model, + # tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained('bert-base-chinese') + # tokenizer = ppnlp.transformers.RobertaTokenizer.from_pretrained('roberta-wwm-ext') + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained("ernie-1.0") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # query_# query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # query_# title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # title_segment + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + model = SemanticIndexCacheNeg( + pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + print("warmup from:{}".format(args.init_from_ckpt)) + model = paddle.DataParallel(model) + num_training_steps = len(train_data_loader) * args.epochs + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.amp_loss_scale) + + if args.batch_size % args.chunk_numbers == 0: + chunk_numbers = args.chunk_numbers + + def split(inputs, chunk_numbers, axis=0): + if inputs.shape[0] % chunk_numbers == 0: + return paddle.split(inputs, chunk_numbers, axis=0) + else: + return paddle.split(inputs, inputs.shape[0], axis=0) + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + chunked_x = [split(t, chunk_numbers, axis=0) for t in batch] + sub_batchs = [list(s) for s in zip(*chunked_x)] + + all_reps = [] + all_grads = [] + all_labels = [] + all_CUDA_rnd_state = [] + all_query = [] + all_title = [] + + for sub_batch in sub_batchs: + all_reps = [] + all_labels = [] + ( + sub_query_input_ids, + sub_query_token_type_ids, + sub_title_input_ids, + sub_title_token_type_ids, + ) = sub_batch + with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]): + + with paddle.no_grad(): + sub_CUDA_rnd_state = paddle.framework.random.get_cuda_rng_state() + all_CUDA_rnd_state.append(sub_CUDA_rnd_state) + sub_cosine_sim, sub_label, query_embedding, title_embedding = model( + query_input_ids=sub_query_input_ids, + title_input_ids=sub_title_input_ids, + query_token_type_ids=sub_query_token_type_ids, + title_token_type_ids=sub_title_token_type_ids, + ) + all_reps.append(sub_cosine_sim) + all_labels.append(sub_label) + all_title.append(title_embedding) + all_query.append(query_embedding) + + model_reps = paddle.concat(all_reps, axis=0) + model_title = paddle.concat(all_title) + model_query = paddle.concat(all_query) + + model_title = model_title.detach() + model_query = model_query.detach() + + model_query.stop_gtadient = False + model_title.stop_gradient = False + model_reps.stop_gradient = False + + model_label = paddle.concat(all_labels, axis=0) + loss = F.cross_entropy(input=model_reps, label=model_label) + loss.backward() + all_grads.append(model_reps.grad) + + for sub_batch, CUDA_state, grad in zip(sub_batchs, all_CUDA_rnd_state, all_grads): + + ( + sub_query_input_ids, + sub_query_token_type_ids, + sub_title_input_ids, + sub_title_token_type_ids, + ) = sub_batch + paddle.framework.random.set_cuda_rng_state(CUDA_state) + cosine_sim, _ = model( + query_input_ids=sub_query_input_ids, + title_input_ids=sub_title_input_ids, + query_token_type_ids=sub_query_token_type_ids, + title_token_type_ids=sub_title_token_type_ids, + ) + surrogate = paddle.dot(cosine_sim, grad) + + if args.use_amp: + scaled = scaler.scale(surrogate) + scaled.backward() + else: + surrogate.backward() + + if args.use_amp: + scaler.minimize(optimizer, scaled) + else: + optimizer.step() + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/examples/semantic_indexing/train_gradient_cache_DPR.py b/examples/semantic_indexing/train_gradient_cache_DPR.py new file mode 100644 index 0000000000000000000000000000000000000000..59ef25024a9fa739df029fe930aadaa8776e041a --- /dev/null +++ b/examples/semantic_indexing/train_gradient_cache_DPR.py @@ -0,0 +1,214 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import numpy as np +import paddle +from biencoder_base_model import BiEncoder, BiEncoderNllLoss +from NQdataset import DataUtil, NQdataSetForDPR +from paddle.optimizer.lr import LambdaDecay + +from paddlenlp.transformers.bert.modeling import BertModel + +parser = argparse.ArgumentParser() + +parser.add_argument("--batch_size", required=True, type=int, default=None) +parser.add_argument("--learning_rate", required=True, type=float, default=None) +parser.add_argument("--save_dir", required=True, type=str, default=None) +parser.add_argument("--warmup_steps", required=True, type=int) +parser.add_argument("--epoches", required=True, type=int) +parser.add_argument("--max_grad_norm", required=True, type=int) +parser.add_argument("--train_data_path", required=True, type=str) +parser.add_argument("--chunk_size", required=True, type=int) +args = parser.parse_args() + +chunk_nums = args.batch_size // args.chunk_size +data_path = args.train_data_path +batch_size = args.batch_size +learning_rate = args.learning_rate +epoches = args.epoches + + +def dataLoader_for_DPR(batch_size, source_data: list, epochs): + index = np.arange(0, len(source_data)) + np.random.shuffle(index) + batch_data = [] + for i in index: + try: + batch_data.append(source_data[i]) + + if len(batch_data) == batch_size: + yield batch_data + batch_data = [] + + except Exception: + import traceback + + traceback.print_exc() + continue + + +def get_model(model_name: str): + question_model = BertModel.from_pretrained(model_name) + context_model = BertModel.from_pretrained(model_name) + model = BiEncoder(question_model, context_model) + return model + + +model = get_model("bert-base-uncased") + + +def get_linear_scheduler(warmup_steps, training_steps): + def lr_lambda(current_step): + if current_step < warmup_steps: + return float(current_step) / float(max(1, warmup_steps)) + return max(0.0, float(training_steps - current_step) / float(max(1, training_steps - warmup_steps))) + + return LambdaDecay(learning_rate=args.learning_rate, lr_lambda=lr_lambda, last_epoch=-1, verbose=False) + + +training_steps = 58880 * args.epoches / args.batch_size +scheduler = get_linear_scheduler(args.warmup_steps, training_steps) +optimizer = paddle.optimizer.AdamW(learning_rate=scheduler, parameters=model.parameters()) + + +def get_dataset(data_path: str): + data = NQdataSetForDPR(data_path) + dataset = data.new_data + return dataset + + +util = DataUtil() +LOSS = BiEncoderNllLoss() +batch_data = [] +dataset = get_dataset(data_path) + + +def train(): + + for epoch in range(epoches): + + index = np.arange(0, len(dataset)) + np.random.shuffle(index) + + batch_data = [] + + for i in index: + # dataLoader + batch_data.append(dataset[i]) + if len(batch_data) == batch_size: + all_questions = [] + all_contexts = [] + all_batch_input = util.create_biencoder_input(batch_data, inserted_title=True) + + all_positions = all_batch_input.is_positive + + all_inputs_questions_id = all_batch_input.questions_ids + all_inputs_questions_segment = all_batch_input.question_segments + + all_inputs_contexts_id = all_batch_input.context_ids + all_inputs_contexts_segment = all_batch_input.ctx_segments + + sub_q_ids = paddle.split(all_inputs_questions_id, chunk_nums, axis=0) + sub_c_ids = paddle.split(all_inputs_contexts_id, chunk_nums, axis=0) + sub_q_segments = paddle.split(all_inputs_questions_segment, chunk_nums, axis=0) + sub_c_segments = paddle.split(all_inputs_contexts_segment, chunk_nums, axis=0) + + all_questions = [] + all_contexts = [] + all_CUDA_rnd_state_question = [] + all_CUDA_rnd_state_context = [] + + for sub_q_id, sub_q_segment in zip(sub_q_ids, sub_q_segments): + with paddle.no_grad(): + sub_CUDA_rnd_state = paddle.framework.random.get_cuda_rng_state() + all_CUDA_rnd_state_question.append(sub_CUDA_rnd_state) + sub_question_output = model.get_question_pooled_embedding(sub_q_id, sub_q_segment) + all_questions.append(sub_question_output) + for sub_c_id, sub_c_segment in zip(sub_c_ids, sub_c_segments): + with paddle.no_grad(): + sub_CUDA_rnd_state = paddle.framework.random.get_cuda_rng_state() + all_CUDA_rnd_state_context.append(sub_CUDA_rnd_state) + sub_context_output = model.get_context_pooled_embedding(sub_c_id, sub_c_segment) + all_contexts.append(sub_context_output) + + model_questions = paddle.concat(all_questions, axis=0) + all_questions = [] + + model_questions = model_questions.detach() + + model_questions.stop_gradient = False + + model_contexts = paddle.concat(all_contexts, axis=0) + + model_contexts = model_contexts.detach() + + model_contexts.stop_gradient = False + + all_contexts = [] + + model_positions = all_positions + + loss, _ = LOSS.calc(model_questions, model_contexts, model_positions) + + print("loss is:") + print(loss.item()) + + loss.backward() + + grads_for_questions = paddle.split(model_questions.grad, chunk_nums, axis=0) + grads_for_contexts = paddle.split(model_contexts.grad, chunk_nums, axis=0) + + for sub_q_id, sub_q_segment, CUDA_state, grad_for_each_question in zip( + sub_q_ids, sub_q_segments, all_CUDA_rnd_state_question, grads_for_questions + ): + + paddle.framework.random.set_cuda_rng_state(CUDA_state) + + sub_question_output = model.get_question_pooled_embedding(sub_q_id, sub_q_segment) + + finally_question_res_for_backward = paddle.dot(sub_question_output, grad_for_each_question) + finally_question_res_for_backward = finally_question_res_for_backward * (1 / 8.0) + + finally_question_res_for_backward.backward(retain_graph=True) + + for sub_c_id, sub_c_segment, CUDA_state, grad_for_each_context in zip( + sub_c_ids, sub_c_segments, all_CUDA_rnd_state_context, grads_for_contexts + ): + paddle.framework.random.set_cuda_rng_state(CUDA_state) + + sub_context_output = model.get_context_pooled_embedding(sub_c_id, sub_q_segment) + + finally_context_res_for_backward = paddle.dot(sub_question_output, grad_for_each_context) + finally_context_res_for_backward = finally_context_res_for_backward * (1 / 8.0) + + finally_context_res_for_backward.backward(retain_graph=True) + + paddle.nn.ClipGradByGlobalNorm(clip_norm=args.max_grad_norm, group_name=model.parameters()) + optimizer.step() + scheduler.step() + optimizer.clear_grad() + + batch_data = [] + + EPOCH = str(epoch) + save_path_que = args.save_dir + "/question_model_" + EPOCH + save_path_con = args.save_dir + "/context_model_" + EPOCH + model.question_encoder.save_pretrained(save_path_que) + model.context_encoder.save_pretrained(save_path_con) + + +if __name__ == "__main__": + train() diff --git a/examples/semantic_indexing/train_hardest_neg.py b/examples/semantic_indexing/train_hardest_neg.py new file mode 100644 index 0000000000000000000000000000000000000000..e4e9522d303e6157376503050278f9e9d904b20a --- /dev/null +++ b/examples/semantic_indexing/train_hardest_neg.py @@ -0,0 +1,141 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from data import convert_example, create_dataloader, read_text_pair +from hardest_negative.model import SemanticIndexHardestNeg + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, help="Inteval steps to save checkpoint") +parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file") +parser.add_argument("--margin", default=0.3, type=float, help="Margin for pair-wise margin_rank_loss") +args = parser.parse_args() +# fmt: on + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False) + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + model = SemanticIndexHardestNeg(pretrained_model, margin=args.margin, output_emb_size=args.output_emb_size) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + print("warmup from:{}".format(args.init_from_ckpt)) + + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch + + loss = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ) + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/examples/sentiment_analysis/skep/README.md b/examples/sentiment_analysis/skep/README.md new file mode 100644 index 0000000000000000000000000000000000000000..de57627169db8a2d89bf7c11b03306fe6f5a180a --- /dev/null +++ b/examples/sentiment_analysis/skep/README.md @@ -0,0 +1,271 @@ +# SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis + +情感分析旨在自动识别和提取文本中的倾向、立场、评价、观点等主观信息。它包含各式各样的任务,比如句子级情感分类、评价对象级情感分类、观点抽取、情绪分类等。情感分析是人工智能的重要研究方向,具有很高的学术价值。同时,情感分析在消费决策、舆情分析、个性化推荐等领域均有重要的应用,具有很高的商业价值。 + +情感预训练模型SKEP(Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis)。SKEP利用情感知识增强预训练模型, 在14项中英情感分析典型任务上全面超越SOTA,此工作已经被ACL 2020录用。SKEP是百度研究团队提出的基于情感知识增强的情感预训练算法,此算法采用无监督方法自动挖掘情感知识,然后利用情感知识构建预训练目标,从而让机器学会理解情感语义。SKEP为各类情感分析任务提供统一且强大的情感语义表示。 + +论文地址:https://arxiv.org/abs/2005.05635 + +<p align="center"> +<img src="https://bj.bcebos.com/paddlenlp/models/transformers/skep/skep.png" width="80%" height="60%"> <br /> +</p> + +百度研究团队在三个典型情感分析任务,语句级情感分类(Sentence-level Sentiment Classification),评价对象级情感分类(Aspect-level Sentiment Classification)、观点抽取(Opinion Role Labeling),共计14个中英文数据上进一步验证了情感预训练模型SKEP的效果。实验表明,下表展示了在模型分别在数据集SST-2、ChnSentiCorp、SE-ABSA16_PHNS、COTE_DP上的实验结果,同时标明了各项数据集对应的任务类型、语言类别、下载地址等信息。 + +<table> + <tr> + <td><strong><center>任务</strong></td> + <td><strong><center>数据集合</strong></td> + <td><strong><center>语言</strong></td> + <td><strong><center>指标</strong></td> + <td><strong><center>SKEP</strong></td> + <td><strong><center>数据集地址</strong></td> + </tr> + <tr> + <td rowspan="2"><center>语句级情感分类<br /><center>分类</td> + <td><center>SST-2</td> + <td><center>英文</td> + <td><center>ACC</td> + <td><center>97.60</td> + <td><center><a href="https://gluebenchmark.com/tasks" >下载地址</a></td> + </tr> + <tr> + <td><center>ChnSentiCorp</td> + <td><center>中文</td> + <td><center>ACC</td> + <td><center>96.08</td> + <td><center><a href="https://dataset-bj.cdn.bcebos.com/qianyan/ChnSentiCorp.zip" >下载地址</a></td> + </tr> + <tr> + <td rowspan="1"><center>评价对象级<br /><center>情感分类</td> + <td><center>SE-ABSA16_PHNS</td> + <td><center>中文</td> + <td><center>ACC</td> + <td><center>65.22</td> + <td><center><a href="http://alt.qcri.org/semeval2016/task5/" >下载地址</a></td> + </tr> + <tr> + <td rowspan="1"><center>观点<br /><center>抽取</td> + <td><center>COTE_DP</td> + <td><center>中文</td> + <td><center>F1</td> + <td><center>86.30</td> + <td><center><a href="https://github.com/lsvih/chinese-customer-review" >下载地址</a></td> + </tr> +</table> + + +## 快速开始 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +skep/ +├── deploy # 部署 +│   └── python +│   └── predict.py # python预测部署示例 +├── export_model.py # 动态图参数导出静态图参数脚本 +├── predict_aspect.py # 对象级的情感分类任务预测脚本 +├── predict_opinion.py # 观点抽取任务预测脚本 +├── predict_sentence.py # 句子级情感分类任务预测脚本 +├── README.md # 使用说明 +├── train_aspect.py # 对象级的情感分类任务训练脚本 +├── train_opinion.py # 观点抽取任务训练脚本 +└── train_sentence.py # 句子级情感分类任务训练脚本 +``` + +下面以语句级情感分类、评价对象级情感分类,观点抽取等任务类型为例,分别说明相应的训练和测试方式。 + +### 语句级情感分类 +#### 数据下载 +本示例采用常用开源数据集ChnSenticorp中文数据集、GLUE-SST2英文数据集作为语句级情感分类数据集。这两项数据集已经内置于PaddleNLP。可以通过以下方式进行加载。 + +```python +from paddlenlp.datasets import load_dataset + +train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"]) +train_ds, dev_ds = load_dataset("glue", "sst-2", splits=["train", "dev"]) +``` + +#### 模型训练 + +可以通过如下命令开启语句级情感分析任务训练,需要特别说明的是,如果想要基于数据集ChnSentiCorp训练中文情感分析模型,请指定model_name为:`skep_ernie_1.0_large_ch`; 基于数据集GLUE-SST2训练英文情感分析模型请指定model_name为:`skep_ernie_2.0_large_en`。下面以中文情感分析为例进行说明。 + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" train_sentence.py \ + --model_name "skep_ernie_1.0_large_ch" \ + --device "gpu" \ + --save_dir "./checkpoints" \ + --epochs 3 \ + --max_seq_len 128 \ + --batch_size 16 \ + --learning_rate 5e-5 +``` + +可支持配置的参数: + +* `model_name`: 使用预训练模型的名称,可选skep_ernie_1.0_large_ch和skep_ernie_2.0_large_en。 + skep_ernie_1.0_large_ch:是SKEP模型在预训练ernie_1.0_large_ch基础之上在海量中文数据上继续预训练得到的中文预训练模型; + skep_ernie_2.0_large_en:是SKEP模型在预训练ernie_2.0_large_en基础之上在海量英文数据上继续预训练得到的中文预训练模型。 +* `save_dir`:可选,保存训练模型的目录;默认保存在当前目录checkpoints文件夹下。 +* `max_seq_len`:可选,ERNIE/BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为16。 +* `learning_rate`:可选,Fine-tune的最大学习率;默认为5e-5。 +* `weight_decay`:可选,控制正则项力度的参数,用于防止过拟合,默认为0.00。 +* `epochs`: 训练轮次,默认为3。 +* `init_from_ckpt`:可选,模型参数路径,热启动模型训练;默认为None。 +* `seed`:可选,随机种子,默认为1000. +* `device`: 选用什么设备进行训练,可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。 + +程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存模型在指定的`save_dir`中。 + +#### 模型预测 +使用如下命令进行模型预测: + +```shell +export CUDA_VISIBLE_DEVICES=0 +python predict_sentence.py \ + --model_name "skep_ernie_1.0_large_ch" \ + --ckpt_dir "checkpoints/model_100" \ + --batch_size 16 \ + --max_seq_len 128 \ + --device "gpu" +``` + +下面展示了模型的预测示例结果: + +```text +Data: 这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般 Label: negative +Data: 怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片 Label: negative +Data: 作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。 Label: positive +``` + +#### 基于Taskflow一键预测 +当前PaddleNLP已将训练好的SKEP中文语句级情感分析模型集成至Taskflow中,可以使用Taskflow对输入的文本进行一键式情感分析,使用方法如下: + +```python +from paddlenlp import Taskflow + +senta = Taskflow("sentiment_analysis", model="skep_ernie_1.0_large_ch") +senta("怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片") +''' +[{'text': '这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般', 'label': 'negative', 'score': 0.9894790053367615}] +''' +``` + +如果想使用自己训练好的模型加载进Taskflow进行预测,可以使用参数`task_path`进行指定模型路径,需要注意的是,该路径下需要存放模型文件以及相应的Tokenizer文件(训练过程中,已保存这两者相关文件)。 + +```python +from paddlenlp import Taskflow + +senta = Taskflow("sentiment_analysis", model="skep_ernie_1.0_large_ch", task_path="./checkpoints/model_100") +senta("怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片") +''' +[{'text': '这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般', 'label': 'negative', 'score': 0.9686369299888611}] +''' +``` + +#### 模型部署 + +使用动态图训练结束之后,还可以将动态图参数导出成静态图参数。在进行模型转换时,需要通过参数`ckpt_dir`指定训练好的模型存放目录,通过`output_path`指定静态图模型参数保存路径,详情请参考export_model.py。模型转换命令如下: + +```shell +export CUDA_VISIBLE_DEVICES=0 +python export_model.py \ + --model_name="skep_ernie_1.0_large_ch" \ + --ckpt_dir="./checkpoints/model_100" \ + --output_path="./static/static_graph_params" +``` + +可以将导出的静态图模型进行部署,deploy/python/predict.py展示了python部署预测示例。运行方式如下: + +```shell +export CUDA_VISIBLE_DEVICES=0 +python deploy/python/predict.py \ + --model_name="skep_ernie_1.0_large_ch" \ + --model_file="./static/static_graph_params.pdmodel" \ + --params_file="./static/static_graph_params.pdiparams" +``` + +### 评价对象级情感分类 +本节将以数据集SE-ABSA16_PHNS为例展示评价对象级的情感分类模型训练和测试。该数据集已内置于PaddleNLP中,可以通过语句级情感分类类似方式进行加载。这里不再赘述。下面展示了SE-ABSA16_PHNS数据集中的一条数据。 + +```text +label text_a text_b +1 phone#design_features 今天有幸拿到了港版白色iPhone 5真机,试玩了一下,说说感受吧:1. 真机尺寸宽度与4/4s保持一致没有变化,长度多了大概一厘米,也就是之前所说的多了一排的图标。2. 真机重量比上一代轻了很多,个人感觉跟i9100的重量差不多。(用惯上一代的朋友可能需要一段时间适应了)3. 由于目前还没有版的SIM卡,无法插卡使用,有购买的朋友要注意了,并非简单的剪卡就可以用,而是需要去运营商更换新一代的SIM卡。4. 屏幕显示效果确实比上一代有进步,不论是从清晰度还是不同角度的视角,iPhone 5绝对要更上一层,我想这也许是相对上一代最有意义的升级了。5. 新的数据接口更小,比上一代更好用更方便,使用的过程会有这样的体会。6. 从简单的几个操作来讲速度比4s要快,这个不用测试软件也能感受出来,比如程序的调用以及照片的拍摄和浏览。不过,目前水货市场上坑爹的价格,最好大家可以再观望一下,不要急着出手。 +``` + +#### 模型训练 + +可以通过如下命令开启评价对象级级情感分类任务训练。 + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" train_aspect.py \ + --model_name "skep_ernie_1.0_large_ch" \ + --save_dir "./checkpoints" \ + --epochs 50 \ + --max_seq_len 128 \ + --batch_size 16 \ + --learning_rate 5e-5 \ + --device "gpu" +``` + +#### 模型预测 +使用如下命令进行模型预测: + +```shell +export CUDA_VISIBLE_DEVICES=0 +python predict_aspect.py \ + --model_name "skep_ernie_1.0_large_ch" \ + --ckpt_dir "./checkpoints/model_100" \ + --batch_size 16 \ + --max_seq_len 128 \ + --device "gpu" +``` + +### 观点抽取 +本节将以数据集COTE_DP为例展示评价对象级的情感分类模型训练和测试。该数据集已内置于PaddleNLP中,可以通过语句级情感分类类似方式进行加载。这里不再赘述。下面展示了COTE_DP数据中的前3条数据。 + +```text +label text_a +重庆老灶火锅 重庆老灶火锅还是很赞的,有机会可以尝试一下! +炉鱼来了 一入店内,就看到招牌特别大的炉鱼来了,餐桌上还摆了五颜六色的小蜡烛,挺有调调的。 +外婆家 只能说是聚餐圣地外婆家一个需要提前来取号的地方。 +``` + +#### 模型训练 + +可以通过如下命令开启评价对象级级情感分类任务训练。 + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" train_opinion.py \ + --model_name "skep_ernie_1.0_large_ch" \ + --save_dir "./checkpoints" \ + --epochs 10 \ + --max_seq_len 128 \ + --batch_size 16 \ + --learning_rate 5e-5 \ + --device "gpu" +``` + +#### 模型预测 +使用如下命令进行模型预测: + +```shell +export CUDA_VISIBLE_DEVICES=0 +python predict_opinion.py \ + --model_name "skep_ernie_1.0_large_ch" \ + --ckpt_dir "./checkpoints/model_100" \ + --batch_size 16 \ + --max_seq_len 128 \ + --device "gpu" +``` + +**备注**: +1. 评价对象级情感分类和观点抽取两类任务的模型部署方式可参考语句级情感分类,这里不再赘述。 +2. 评级级情感分类以及观点抽取,暂不支持skep模型的Taskflow离线模型加载。如需使用此类功能,请参考:[unified_sentiment_analysis](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/sentiment_analysis/unified_sentiment_extraction)。 diff --git a/examples/sentiment_analysis/skep/deploy/python/predict.py b/examples/sentiment_analysis/skep/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..2acba4857bd28d86bfd2393f9114f9bfcbcc4d18 --- /dev/null +++ b/examples/sentiment_analysis/skep/deploy/python/predict.py @@ -0,0 +1,151 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import numpy as np +import paddle +from scipy.special import softmax + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.transformers import SkepTokenizer + +parser = argparse.ArgumentParser() +parser.add_argument( + "--model_name", + choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"], + default="skep_ernie_1.0_large_ch", + help="Select which model to train, defaults to skep_ernie_1.0_large_ch.", +) +parser.add_argument( + "--model_file", + type=str, + required=True, + default="./static_graph_params.pdmodel", + help="The path to model info in static graph.", +) +parser.add_argument( + "--params_file", + type=str, + required=True, + default="./static_graph_params.pdiparams", + help="The path to parameters in static graph.", +) +parser.add_argument( + "--max_seq_len", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu"], + default="gpu", + help="Select which device to train model, defaults to gpu.", +) +args = parser.parse_args() + + +def convert_example(example, tokenizer, label_list, max_seq_len=512, is_test=False): + text = example + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + return {"input_ids": input_ids, "token_type_ids": token_type_ids} + + +class Predictor(object): + def __init__(self, model_file, params_file, device, max_seq_len): + self.max_seq_len = max_seq_len + + config = paddle.inference.Config(model_file, params_file) + if device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def predict(self, data, tokenizer, label_map, batch_size=1): + """ + Predicts the data labels. + + Args: + model (obj:`paddle.nn.Layer`): A model to classify texts. + data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `se_len`(sequence length). + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + label_map(obj:`dict`): The label id (key) to label str (value) map. + batch_size(obj:`int`, defaults to 1): The number of batch. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + examples = [] + for text in data: + encoded_inputs = convert_example( + text, tokenizer, label_list=label_map.values(), max_seq_len=self.max_seq_len, is_test=True + ) + examples.append(encoded_inputs) + + # Separates data into some batches. + batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)] + data_collator = DataCollatorWithPadding(tokenizer, padding=True, return_tensors="np") + + results = [] + for raw_batch in batches: + batch = data_collator(raw_batch) + input_ids, token_type_ids = batch["input_ids"], batch["token_type_ids"] + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(token_type_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + probs = softmax(logits, axis=1) + idx = np.argmax(probs, axis=1) + idx = idx.tolist() + labels = [label_map[i] for i in idx] + results.extend(labels) + return results + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor(args.model_file, args.params_file, args.device, args.max_seq_len) + + tokenizer = SkepTokenizer.from_pretrained(args.model_name) + + # These data samples is in Chinese. + # If you use the english model, you should change the test data in English. + data = [ + "这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般", + "怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片", + "作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。", + ] + label_map = {0: "negative", 1: "positive"} + + results = predictor.predict(data, tokenizer, label_map, batch_size=args.batch_size) + for idx, text in enumerate(data): + print("Data: {} \t Label: {}".format(text, results[idx])) diff --git a/examples/sentiment_analysis/skep/export_model.py b/examples/sentiment_analysis/skep/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..4164c2e603ffb560c1138d5bf009fb9c4ce3ca8f --- /dev/null +++ b/examples/sentiment_analysis/skep/export_model.py @@ -0,0 +1,61 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle + +from paddlenlp.transformers import SkepForSequenceClassification + +parser = argparse.ArgumentParser() +parser.add_argument( + "--ckpt_dir", + type=str, + required=True, + default="./checkpoint/model_100", + help="The directory of saved model checkpoint.", +) +parser.add_argument( + "--output_path", + type=str, + default="./static_graph_params", + help="The path of model parameter in static graph to be saved.", +) +parser.add_argument( + "--model_name", + choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"], + default="skep_ernie_1.0_large_ch", + help="Select which model to train, defaults to skep_ernie_1.0_large_ch.", +) +args = parser.parse_args() + +if __name__ == "__main__": + # The number of labels should be in accordance with the training dataset. + label_map = {0: "negative", 1: "positive"} + model = SkepForSequenceClassification.from_pretrained(args.ckpt_dir, num_labels=len(label_map)) + print("Loaded model from %s" % args.ckpt_dir) + + model.eval() + + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + # Save in static graph model. + paddle.jit.save(model, args.output_path) + print("Static Model has been saved to: {}".format(args.output_path)) diff --git a/examples/sentiment_analysis/skep/predict_aspect.py b/examples/sentiment_analysis/skep/predict_aspect.py new file mode 100644 index 0000000000000000000000000000000000000000..7ce7c19496a677a3c86ca5f9b6b5dc540bda4ecd --- /dev/null +++ b/examples/sentiment_analysis/skep/predict_aspect.py @@ -0,0 +1,133 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +from functools import partial + +import paddle +import paddle.nn.functional as F +from tqdm import tqdm + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer + +parser = argparse.ArgumentParser() +parser.add_argument( + "--model_name", + choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"], + default="skep_ernie_1.0_large_ch", + help="Select which model to train, defaults to skep_ernie_1.0_large_ch.", +) +parser.add_argument("--ckpt_dir", type=str, default=None, help="The directory of saved model checkpoint.") +parser.add_argument( + "--max_seq_len", + default=400, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument("--batch_size", default=6, type=int, help="Batch size per GPU/CPU for prediction.") +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu"], + default="gpu", + help="Select which device to train model, defaults to gpu.", +) +args = parser.parse_args() + + +@paddle.no_grad() +def predict(model, data_loader, label_map): + """ + Given a prediction dataset, it gives the prediction results. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + label_map(obj:`dict`): The label id (key) to label str (value) map. + """ + model.eval() + results = [] + for batch in tqdm(data_loader): + input_ids, token_type_ids = batch["input_ids"], batch["token_type_ids"] + logits = model(input_ids, token_type_ids) + probs = F.softmax(logits, axis=1) + idx = paddle.argmax(probs, axis=1).numpy() + idx = idx.tolist() + labels = [label_map[i] for i in idx] + results.extend(labels) + return results + + +def convert_example_to_feature(example, tokenizer, max_seq_len=512, is_test=False): + """ + Builds model inputs from a sequence or a pair of sequence for sequence classification tasks + by concatenating and adding special tokens. + + Args: + example(obj:`dict`): Dict of input data, containing text and label if it have label. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of token ids. + token_type_ids(obj: `list[int]`): The list of token_type_ids. + label(obj:`int`, optional): The input label if not is_test. + """ + encoded_inputs = tokenizer(text=example["text"], text_pair=example["text_pair"], max_seq_len=max_seq_len) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if is_test: + return {"input_ids": input_ids, "token_type_ids": token_type_ids} + else: + label = example["label"] + return {"input_ids": input_ids, "token_type_ids": token_type_ids, "labels": label} + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +if __name__ == "__main__": + + test_ds = load_dataset("seabsa16", "phns", splits=["test"]) + label_map = {0: "negative", 1: "positive"} + + tokenizer = SkepTokenizer.from_pretrained(args.model_name) + model = SkepForSequenceClassification.from_pretrained(args.ckpt_dir, num_labels=len(label_map)) + + trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, max_seq_len=args.max_seq_len, is_test=True) + data_collator = DataCollatorWithPadding(tokenizer, padding=True) + + test_data_loader = create_dataloader( + test_ds, mode="test", batch_size=args.batch_size, batchify_fn=data_collator, trans_fn=trans_func + ) + + results = predict(model, test_data_loader, label_map) + for idx, text in enumerate(test_ds.data): + print("Data: {} \t Label: {}".format(text, results[idx])) diff --git a/examples/sentiment_analysis/skep/predict_opinion.py b/examples/sentiment_analysis/skep/predict_opinion.py new file mode 100644 index 0000000000000000000000000000000000000000..5c172d26ff86510bcc35e7974cd6a490ecc58eaf --- /dev/null +++ b/examples/sentiment_analysis/skep/predict_opinion.py @@ -0,0 +1,154 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +from functools import partial + +import paddle +from tqdm import tqdm + +from paddlenlp.data import DataCollatorForTokenClassification +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import SkepCrfForTokenClassification, SkepTokenizer + +parser = argparse.ArgumentParser() +parser.add_argument( + "--model_name", + choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"], + default="skep_ernie_1.0_large_ch", + help="Select which model to train, defaults to skep_ernie_1.0_large_ch.", +) +parser.add_argument("--ckpt_dir", type=str, default=None, help="The directory of saved model checkpoint.") +parser.add_argument( + "--max_seq_len", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu"], + default="gpu", + help="Select which device to train model, defaults to gpu.", +) +args = parser.parse_args() + + +@paddle.no_grad() +def predict(model, data_loader, label_map): + """ + Given a prediction dataset, it gives the prediction results. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + label_map(obj:`dict`): The label id (key) to label str (value) map. + """ + model.eval() + results = [] + for batch in tqdm(data_loader): + input_ids, token_type_ids, seq_lens = batch["input_ids"], batch["token_type_ids"], batch["seq_lens"] + preds = model(input_ids, token_type_ids, seq_lens=seq_lens) + tags = parse_predict_result(preds.numpy(), seq_lens.numpy(), label_map) + results.extend(tags) + return results + + +def convert_example_to_feature(example, tokenizer, max_seq_len=512): + """ + Builds model inputs from a sequence or a pair of sequence for sequence classification tasks + by concatenating and adding special tokens. + + Args: + example(obj:`dict`): Dict of input data, containing text and label if it have label. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + + Returns: + input_ids(obj:`list[int]`): The list of token ids. + token_type_ids(obj: `list[int]`): The list of token_type_ids. + """ + tokens = example["tokens"] + new_tokens = [tokenizer.cls_token] + + for index, token in enumerate(tokens): + sub_tokens = tokenizer.tokenize(token) + if not sub_tokens: + sub_tokens = [tokenizer.unk_token] + new_tokens.extend(sub_tokens) + + new_tokens = new_tokens[: max_seq_len - 1] + new_tokens.append(tokenizer.sep_token) + + input_ids = [tokenizer.convert_tokens_to_ids(token) for token in new_tokens] + token_type_ids = [0] * len(input_ids) + seq_len = len(input_ids) + + return {"input_ids": input_ids, "token_type_ids": token_type_ids, "seq_lens": seq_len} + + +def parse_predict_result(predictions, seq_lens, label_map): + """ + Parses the prediction results to the label tag. + """ + pred_tag = [] + for idx, pred in enumerate(predictions): + seq_len = seq_lens[idx] + # drop the "[CLS]" and "[SEP]" token + tag = [label_map[i] for i in pred[1 : seq_len - 1]] + pred_tag.append(tag) + return pred_tag + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +if __name__ == "__main__": + paddle.set_device(args.device) + + test_ds = load_dataset("cote", "dp", splits=["test"]) + label_list = test_ds.label_list + # The COTE_DP dataset labels with "BIO" schema. + label_map = {0: "B", 1: "I", 2: "O"} + # `no_entity_label` represents that the token isn't an entity. + no_entity_label_idx = 2 + + tokenizer = SkepTokenizer.from_pretrained(args.model_name) + model = SkepCrfForTokenClassification.from_pretrained(args.ckpt_dir, num_labels=len(label_list)) + print("Loaded model from %s" % args.ckpt_dir) + + trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, max_seq_len=args.max_seq_len) + data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=no_entity_label_idx) + + test_data_loader = create_dataloader( + test_ds, mode="test", batch_size=args.batch_size, batchify_fn=data_collator, trans_fn=trans_func + ) + + results = predict(model, test_data_loader, label_map) + for idx, example in enumerate(test_ds.data): + print(len(example["tokens"]), len(results[idx])) + print("Data: {} \t Label: {}".format(example, results[idx])) diff --git a/examples/sentiment_analysis/skep/predict_sentence.py b/examples/sentiment_analysis/skep/predict_sentence.py new file mode 100644 index 0000000000000000000000000000000000000000..a4dbbc69d1b25eaff26ddde51fcff62459912c05 --- /dev/null +++ b/examples/sentiment_analysis/skep/predict_sentence.py @@ -0,0 +1,130 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle +import paddle.nn.functional as F + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer + +parser = argparse.ArgumentParser() +parser.add_argument( + "--max_seq_len", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument("--ckpt_dir", type=str, default=None, help="The directory of saved model checkpoint.") +parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.") +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu"], + default="gpu", + help="Select which device to train model, defaults to gpu.", +) +parser.add_argument( + "--model_name", + choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"], + default="skep_ernie_1.0_large_ch", + help="Select which model to train, defaults to skep_ernie_1.0_large_ch.", +) +args = parser.parse_args() + + +def convert_example_to_feature(example, tokenizer, max_seq_len=512): + """ + Builds model inputs from a sequence or a pair of sequence for sequence classification tasks + by concatenating and adding special tokens. + + Args: + example(obj:`str`): The input text to sentiment analysis. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2". + + Returns: + input_ids(obj:`list[int]`): The list of token ids. + token_type_ids(obj: `list[int]`): The list of token_type_ids. + label(obj:`int`, optional): The input label if not is_test. + """ + encoded_inputs = tokenizer(text=example, max_seq_len=max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + return {"input_ids": input_ids, "token_type_ids": token_type_ids} + + +@paddle.no_grad() +def predict(model, data, tokenizer, label_map, batch_size=1): + """ + Predicts the data labels. + + Args: + model (obj:`paddle.nn.Layer`): A model to classify texts. + data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `seq_len`(sequence length). + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + label_map(obj:`dict`): The label id (key) to label str (value) map. + batch_size(obj:`int`, defaults to 1): The number of batch. + + Returns: + results(obj:`list`): All the predictions labels. + """ + examples = [] + for text in data: + encoded_inputs = convert_example_to_feature(text, tokenizer, max_seq_len=args.max_seq_len) + examples.append(encoded_inputs) + + # Separates data into some batches. + batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)] + + data_collator = DataCollatorWithPadding(tokenizer, padding=True) + + results = [] + model.eval() + for raw_batch in batches: + batch = data_collator(raw_batch) + input_ids, token_type_ids = batch["input_ids"], batch["token_type_ids"] + logits = model(input_ids, token_type_ids) + probs = F.softmax(logits, axis=1) + idx = paddle.argmax(probs, axis=1).numpy().tolist() + labels = [label_map[i] for i in idx] + results.extend(labels) + return results + + +if __name__ == "__main__": + paddle.set_device(args.device) + + # These data samples is in Chinese. + # If you use the english model, you should change the test data in English. + data = [ + "这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般", + "怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片", + "作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。", + ] + label_map = {0: "negative", 1: "positive"} + + tokenizer = SkepTokenizer.from_pretrained(args.model_name) + model = SkepForSequenceClassification.from_pretrained(args.ckpt_dir, num_labels=len(label_map)) + print("Loaded model from %s" % args.ckpt_dir) + + results = predict(model, data, tokenizer, label_map, batch_size=args.batch_size) + for idx, text in enumerate(data): + print("Data: {} \t Label: {}".format(text, results[idx])) diff --git a/examples/sentiment_analysis/skep/train_aspect.py b/examples/sentiment_analysis/skep/train_aspect.py new file mode 100644 index 0000000000000000000000000000000000000000..f5f6b7720b99fad0c57566f1898250abfc5e5374 --- /dev/null +++ b/examples/sentiment_analysis/skep/train_aspect.py @@ -0,0 +1,180 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer + +parser = argparse.ArgumentParser() +parser.add_argument( + "--model_name", + choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"], + default="skep_ernie_1.0_large_ch", + help="Select which model to train, defaults to skep_ernie_1.0_large_ch.", +) +parser.add_argument( + "--save_dir", + default="./checkpoint", + type=str, + help="The output directory where the model checkpoints will be written.", +) +parser.add_argument( + "--max_seq_len", + default=400, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument("--batch_size", default=6, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=3e-6, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=50, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu"], + default="gpu", + help="Select which device to train model, defaults to gpu.", +) +args = parser.parse_args() + + +def set_seed(seed): + """Sets random seed.""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def convert_example(example, tokenizer, max_seq_len=512, is_test=False): + """ + Builds model inputs from a sequence or a pair of sequence for sequence classification tasks + by concatenating and adding special tokens. + + Args: + example(obj:`dict`): Dict of input data, containing text and label if it have label. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of token ids. + token_type_ids(obj: `list[int]`): The list of token_type_ids. + label(obj:`int`, optional): The input label if not is_test. + """ + + encoded_inputs = tokenizer(text=example["text"], text_pair=example["text_pair"], max_seq_len=max_seq_len) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if is_test: + return {"input_ids": input_ids, "token_type_ids": token_type_ids} + else: + label = example["label"] + return {"input_ids": input_ids, "token_type_ids": token_type_ids, "labels": label} + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +if __name__ == "__main__": + set_seed(args.seed) + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + train_ds = load_dataset("seabsa16", "phns", splits=["train"]) + label_list = train_ds.label_list + + tokenizer = SkepTokenizer.from_pretrained(args.model_name) + model = SkepForSequenceClassification.from_pretrained(args.model_name, num_labels=len(label_list)) + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len) + data_collator = DataCollatorWithPadding(tokenizer, padding=True) + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=data_collator, trans_fn=trans_func + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=args.learning_rate, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + metric = paddle.metric.Accuracy() + + global_step = 0 + tic_train = time.time() + model.train() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + input_ids, token_type_ids, labels = batch["input_ids"], batch["token_type_ids"], batch["labels"] + loss, logits = model(input_ids, token_type_ids, labels=labels) + probs = F.softmax(logits, axis=1) + correct = metric.compute(probs, labels) + metric.update(correct) + acc = metric.accumulate() + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + optimizer.clear_grad() + if global_step % 100 == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + # Need better way to get inner model of DataParallel + model._layers.save_pretrained(save_dir) + tokenizer.save_pretrained(save_dir) diff --git a/examples/sentiment_analysis/skep/train_opinion.py b/examples/sentiment_analysis/skep/train_opinion.py new file mode 100644 index 0000000000000000000000000000000000000000..e61f394ac50f6036be96f30002326dc69630840b --- /dev/null +++ b/examples/sentiment_analysis/skep/train_opinion.py @@ -0,0 +1,211 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle + +from paddlenlp.data import DataCollatorForTokenClassification +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import SkepCrfForTokenClassification, SkepTokenizer + +parser = argparse.ArgumentParser() +parser.add_argument( + "--model_name", + choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"], + default="skep_ernie_1.0_large_ch", + help="Select which model to train, defaults to skep_ernie_1.0_large_ch.", +) +parser.add_argument( + "--save_dir", + default="./checkpoints", + type=str, + help="The output directory where the model checkpoints will be written.", +) +parser.add_argument( + "--max_seq_len", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=5e-7, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu"], + default="gpu", + help="Select which device to train model, defaults to gpu.", +) +args = parser.parse_args() + + +def set_seed(seed): + """Sets random seed.""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def convert_example_to_feature(example, tokenizer, max_seq_len=512, no_entity_label="O", is_test=False): + """ + Builds model inputs from a sequence or a pair of sequence for sequence classification tasks + by concatenating and adding special tokens. + + Args: + example(obj:`dict`): Dict of input data, containing text and label if it have label. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + no_entity_label(obj:`int`): The label to pad label sequence by default. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of token ids. + token_type_ids(obj: `list[int]`): The list of token_type_ids. + label(obj:`List[int]`, optional): The input label if not is_test. + """ + tokens = example["tokens"] + labels = example["labels"] + assert len(tokens) == len(labels) + + # 1. tokenize the tokens into sub-tokens, and align the length of tokens and labels + new_labels, new_tokens = [no_entity_label], [tokenizer.cls_token] + for index, token in enumerate(tokens): + sub_tokens = tokenizer.tokenize(token) + if not sub_tokens: + sub_tokens = [tokenizer.unk_token] + + # repeate the labels n-times + new_labels.extend([labels[index]] * len(sub_tokens)) + new_tokens.extend(sub_tokens) + + # 2. check the max-length of tokens and labels + new_tokens = new_tokens[: max_seq_len - 1] + new_labels = new_labels[: max_seq_len - 1] + + # 3. construct the input data + new_labels.append(no_entity_label) + new_tokens.append(tokenizer.sep_token) + input_ids = [tokenizer.convert_tokens_to_ids(token) for token in new_tokens] + token_type_ids = [0] * len(input_ids) + seq_len = len(input_ids) + + if is_test: + return {"input_ids": input_ids, "token_type_ids": token_type_ids, "seq_lens": seq_len} + else: + return {"input_ids": input_ids, "token_type_ids": token_type_ids, "seq_lens": seq_len, "labels": new_labels} + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +if __name__ == "__main__": + set_seed(args.seed) + + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + train_ds = load_dataset("cote", "dp", splits=["train"]) + label_list = train_ds.label_list + # The COTE_DP dataset labels with "BIO" schema. + label_map = {label: idx for idx, label in enumerate(label_list)} + # `no_entity_label` represents that the token isn't an entity. + no_entity_label_idx = label_map.get("O", 2) + + tokenizer = SkepTokenizer.from_pretrained(args.model_name) + model = SkepCrfForTokenClassification.from_pretrained(args.model_name, num_labels=len(label_list)) + + trans_func = partial( + convert_example_to_feature, + tokenizer=tokenizer, + max_seq_len=args.max_seq_len, + no_entity_label=no_entity_label_idx, + is_test=False, + ) + + data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=no_entity_label_idx) + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=data_collator, trans_fn=trans_func + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=args.learning_rate, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + global_step = 0 + tic_train = time.time() + model.train() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + # print(batch) + input_ids, token_type_ids, seq_lens, labels = ( + batch["input_ids"], + batch["token_type_ids"], + batch["seq_lens"], + batch["labels"], + ) + loss = model(input_ids, token_type_ids, seq_lens=seq_lens, labels=labels) + avg_loss = paddle.mean(loss) + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, avg_loss, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + optimizer.clear_grad() + if global_step % 100 == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + # Need better way to get inner model of DataParallel + model._layers.save_pretrained(save_dir) + print("Model saved to: {}.".format(save_dir)) diff --git a/examples/sentiment_analysis/skep/train_sentence.py b/examples/sentiment_analysis/skep/train_sentence.py new file mode 100644 index 0000000000000000000000000000000000000000..5c05f8948b8cf33c97d9495717dc552555cb003a --- /dev/null +++ b/examples/sentiment_analysis/skep/train_sentence.py @@ -0,0 +1,229 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer + +parser = argparse.ArgumentParser() +parser.add_argument( + "--model_name", + choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"], + default="skep_ernie_1.0_large_ch", + help="Select which model to train, defaults to skep_ernie_1.0_large_ch.", +) +parser.add_argument( + "--save_dir", + default="./checkpoints", + type=str, + help="The output directory where the model checkpoints will be written.", +) +parser.add_argument( + "--max_seq_len", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu"], + default="gpu", + help="Select which device to train model, defaults to gpu.", +) +args = parser.parse_args() + + +def set_seed(seed): + """Sets random seed.""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, metric, data_loader): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + """ + model.eval() + metric.reset() + losses = [] + for batch in data_loader: + input_ids, token_type_ids, labels = batch["input_ids"], batch["token_type_ids"], batch["labels"] + loss, logits = model(input_ids, token_type_ids, labels=labels) + losses.append(loss.numpy()) + correct = metric.compute(logits, labels) + metric.update(correct) + acc = metric.accumulate() + print("eval loss: %.5f, accuracy: %.5f" % (np.mean(losses), acc)) + model.train() + metric.reset() + + +def convert_example_to_feature(example, tokenizer, max_seq_len=512, is_test=False, dataset_name="chnsenticorp"): + """ + Builds model inputs from a sequence or a pair of sequence for sequence classification tasks + by concatenating and adding special tokens. + + Args: + example(obj:`dict`): Dict of input data, containing text and label if it have label. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2". + + Returns: + input_ids(obj:`list[int]`): The list of token ids. + token_type_ids(obj: `list[int]`): The list of token_type_ids. + label(obj:`int`, optional): The input label if not is_test. + """ + + if dataset_name == "sst-2": + encoded_inputs = tokenizer(text=example["sentence"], max_seq_len=max_seq_len) + elif dataset_name == "chnsenticorp": + encoded_inputs = tokenizer(text=example["text"], max_seq_len=max_seq_len) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if not is_test: + if dataset_name == "sst-2": + label = example["labels"] + elif dataset_name == "chnsenticorp": + label = example["label"] + else: + raise RuntimeError(f"Got unkown datatset name {dataset_name}, it must be processed on your own.") + + return {"input_ids": input_ids, "token_type_ids": token_type_ids, "label": label} + else: + return {"input_ids": input_ids, "token_type_ids": token_type_ids} + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +if __name__ == "__main__": + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + if args.model_name == "skep_ernie_1.0_large_ch": + dataset_name = "chnsenticorp" + train_ds, dev_ds = load_dataset(dataset_name, splits=["train", "dev"]) + + else: + dataset_name = "sst-2" + train_ds, dev_ds = load_dataset("glue", dataset_name, splits=["train", "dev"]) + label_map = {0: "negative", 1: "positive"} + + tokenizer = SkepTokenizer.from_pretrained(args.model_name) + model = SkepForSequenceClassification.from_pretrained(args.model_name, num_labels=len(label_map)) + + trans_func = partial( + convert_example_to_feature, tokenizer=tokenizer, max_seq_len=args.max_seq_len, dataset_name=dataset_name + ) + + data_collator = DataCollatorWithPadding(tokenizer, padding=True) + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=data_collator, trans_fn=trans_func + ) + + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=data_collator, trans_fn=trans_func + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=args.learning_rate, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + metric = paddle.metric.Accuracy() + + # start to train model + model.train() + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + input_ids, token_type_ids, labels = batch["input_ids"], batch["token_type_ids"], batch["labels"] + loss, logits = model(input_ids, token_type_ids, labels=labels) + probs = F.softmax(logits, axis=1) + correct = metric.compute(probs, labels) + metric.update(correct) + acc = metric.accumulate() + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, accuracy: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + optimizer.clear_grad() + if global_step % 100 == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + evaluate(model, metric, dev_data_loader) + # Need better way to get inner model of DataParallel + model._layers.save_pretrained(save_dir) + tokenizer.save_pretrained(save_dir) diff --git a/examples/sentiment_analysis/textcnn/README.md b/examples/sentiment_analysis/textcnn/README.md new file mode 100644 index 0000000000000000000000000000000000000000..d4cd1599322a0aa12d18a561bd07686bc77b1f91 --- /dev/null +++ b/examples/sentiment_analysis/textcnn/README.md @@ -0,0 +1,192 @@ +# 使用TextCNN模型完成中文对话情绪识别任务 + +情感分析旨在自动识别和提取文本中的倾向、立场、评价、观点等主观信息。情感分析其中的一个任务就是对话情绪识别,针对智能对话中的用户文本,自动判断该文本的情绪类别并给出相应的置信度,情绪类型分为积极(positive)、消极(negative)和中性(neutral)。 + +本示例展示了如何用TextCNN预训练模型在机器人聊天数据集上进行Finetune完成中文对话情绪识别任务。 + +## 快速开始 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +textcnn/ +├── deploy # 部署 +│   └── python +│   └── predict.py # python预测部署示例 +├── data.py # 数据处理脚本 +├── export_model.py # 动态图参数导出静态图参数脚本 +├── model.py # 模型组网脚本 +├── predict.py # 模型预测脚本 +├── README.md # 文档说明 +└── train.py # 对话情绪识别任务训练脚本 +``` + +### 数据准备 + +这里我们提供一份已标注的机器人聊天数据集,包括训练集(train.tsv),开发集(dev.tsv)和测试集(test.tsv)。 +完整数据集可以通过以下命令下载并解压: + +```shell +wget https://bj.bcebos.com/paddlenlp/datasets/RobotChat.tar.gz +tar xvf RobotChat.tar.gz +``` + +### 词表下载 + +在模型训练之前,需要先下载词汇表文件word_dict.txt,用于构造词-id映射关系。 + +```shell +wget https://bj.bcebos.com/paddlenlp/robot_chat_word_dict.txt +``` + +**NOTE:** 词表的选择和实际应用数据相关,需根据实际数据选择词表。 + +### 预训练模型下载 + +这里我们提供了一个百度基于海量数据训练好的TextCNN模型,用户通过以下方式下载预训练模型。 + +```shell +wget https://bj.bcebos.com/paddlenlp/models/textcnn.pdparams +``` + +### 模型训练 + +在下载好词表和预训练模型后就可以在机器人聊天数据集上进行finetune,通过运行以下命令,在训练集(train.tsv)上进行模型训练,并在开发集(dev.tsv)验证,这里通过`--init_from_ckpt=./textcnn.pdparams`指定TextCNN预训练模型。 + +CPU 启动: + +```shell +python train.py --vocab_path=./robot_chat_word_dict.txt \ + --init_from_ckpt=./textcnn.pdparams \ + --device=cpu \ + --lr=5e-5 \ + --batch_size=64 \ + --epochs=10 \ + --save_dir=./checkpoints \ + --data_path=./RobotChat +``` + +GPU 启动: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" train.py \ + --vocab_path=./robot_chat_word_dict.txt \ + --init_from_ckpt=./textcnn.pdparams \ + --device=gpu \ + --lr=5e-5 \ + --batch_size=64 \ + --epochs=10 \ + --save_dir=./checkpoints \ + --data_path=./RobotChat +``` + +XPU启动: + +```shell +python train.py --vocab_path=./robot_chat_word_dict.txt \ + --init_from_ckpt=./textcnn.pdparams \ + --device=xpu \ + --lr=5e-5 \ + --batch_size=64 \ + --epochs=10 \ + --save_dir=./checkpoints \ + --data_path=./RobotChat +``` + +以上参数表示: + +* `vocab_path`: 词汇表文件路径。 +* `init_from_ckpt`: 恢复模型训练的断点路径。 +* `device`: 选用什么设备进行训练,可选cpu、gpu或xpu。如使用gpu训练则参数gpus指定GPU卡号。 +* `lr`: 学习率, 默认为5e-5。 +* `batch_size`: 运行一个batch大小,默认为64。 +* `epochs`: 训练轮次,默认为10。 +* `save_dir`: 训练保存模型的文件路径。 +* `data_path`: 数据集文件路径。 + + +程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +checkpoints/ +├── 0.pdopt +├── 0.pdparams +├── 1.pdopt +├── 1.pdparams +├── ... +└── final.pdparams +``` + +**NOTE:** + +* 如需恢复模型训练,则init_from_ckpt只需指定到文件名即可,不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/0`即可,程序会自动加载模型参数`checkpoints/0.pdparams`,也会自动加载优化器状态`checkpoints/0.pdopt`。 +* 使用动态图训练结束之后,还可以将动态图参数导出成静态图参数,具体代码见export_model.py。静态图参数保存在`output_path`指定路径中。 + 运行方式: + +```shell +python export_model.py --vocab_path=./robot_chat_word_dict.txt --params_path=./checkpoints/final.pdparams --output_path=./static_graph_params +``` + +其中`params_path`是指动态图训练保存的参数路径,`output_path`是指静态图参数导出路径。 + +导出模型之后,可以用于部署,deploy/python/predict.py文件提供了python部署预测示例。运行方式: + +```shell +python deploy/python/predict.py --model_file=static_graph_params.pdmodel --params_file=static_graph_params.pdiparams +``` + +### 模型预测 + +启动预测: + +CPU启动: + +```shell +python predict.py --vocab_path=./robot_chat_word_dict.txt \ + --device=cpu \ + --params_path=./checkpoints/final.pdparams +``` + +GPU启动: + +```shell +export CUDA_VISIBLE_DEVICES=0 +python predict.py --vocab_path=./robot_chat_word_dict.txt \ + --device=gpu \ + --params_path=./checkpoints/final.pdparams +``` + +XPU启动: + +```shell +python predict.py --vocab_path=./robot_chat_word_dict.txt \ + --device=xpu \ + --params_path=./checkpoints/final.pdparams +``` + +待预测数据如以下示例: + +```text +你再骂我我真的不跟你聊了 +你看看我附近有什么好吃的 +我喜欢画画也喜欢唱歌 +``` + +经过`preprocess_prediction_data`函数处理后,调用`predict`函数即可输出预测结果。 + +如 + +```text +Data: 你再骂我我真的不跟你聊了 Label: negative +Data: 你看看我附近有什么好吃的 Label: neutral +Data: 我喜欢画画也喜欢唱歌 Label: positive +``` + +## Reference + +TextCNN参考论文: + +- [EMNLP2014-Convolutional Neural Networks for Sentence Classification](https://aclanthology.org/D14-1181.pdf) diff --git a/examples/sentiment_analysis/textcnn/data.py b/examples/sentiment_analysis/textcnn/data.py new file mode 100644 index 0000000000000000000000000000000000000000..3426f065afec8c5ae15762d32ac24aa5a0c6ccb9 --- /dev/null +++ b/examples/sentiment_analysis/textcnn/data.py @@ -0,0 +1,93 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import paddle + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + """ + Create dataloader. + + Args: + dataset(obj:`paddle.io.Dataset`): Dataset instance. + mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. + batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. + batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging + the sample list, None for only stack each fields of sample in axis + 0(same as :attr::`np.stack(..., axis=0)`). + trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. + + Returns: + dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. + """ + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + else: + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn) + return dataloader + + +def preprocess_prediction_data(data, tokenizer, pad_token_id=0, max_ngram_filter_size=3): + """ + It process the prediction data as the format used as training. + + Args: + data (obj:`list[str]`): The prediction data whose each element is a tokenized text. + tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. + pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. + max_ngram_filter_size (obj:`int`, optional, defaults to 3) Max n-gram size in TextCNN model. + Users should refer to the ngram_filter_sizes setting in TextCNN, if ngram_filter_sizes=(1, 2, 3) + then max_ngram_filter_size=3 + + Returns: + examples (obj:`list`): The processed data whose each element + is a `list` object, which contains + + - word_ids(obj:`list[int]`): The list of word ids. + """ + examples = [] + for text in data: + ids = tokenizer.encode(text) + seq_len = len(ids) + # Sequence length should larger or equal than the maximum ngram_filter_size in TextCNN model + if seq_len < max_ngram_filter_size: + ids.extend([pad_token_id] * (max_ngram_filter_size - seq_len)) + examples.append(ids) + return examples + + +def convert_example(example, tokenizer): + """convert_example""" + input_ids = tokenizer.encode(example["text"]) + input_ids = np.array(input_ids, dtype="int64") + + label = np.array(example["label"], dtype="int64") + return input_ids, label + + +def read_custom_data(filename): + """Reads data.""" + with open(filename, "r", encoding="utf-8") as f: + # Skip head + next(f) + for line in f: + data = line.strip().split("\t") + label, text = data + yield {"text": text, "label": label} diff --git a/examples/sentiment_analysis/textcnn/deploy/python/predict.py b/examples/sentiment_analysis/textcnn/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..35d6e15ecf2cb205284cb0a08f500b3df3744dcb --- /dev/null +++ b/examples/sentiment_analysis/textcnn/deploy/python/predict.py @@ -0,0 +1,141 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import numpy as np +import paddle +import paddle.nn.functional as F + +from paddlenlp.data import JiebaTokenizer, Pad, Vocab + +parser = argparse.ArgumentParser() +parser.add_argument( + "--model_file", + type=str, + required=True, + default="./static_graph_params.pdmodel", + help="The path to model info in static graph.", +) +parser.add_argument( + "--params_file", + type=str, + required=True, + default="./static_graph_params.pdiparams", + help="The path to parameters in static graph.", +) +parser.add_argument("--vocab_path", type=str, default="./robot_chat_word_dict.txt", help="The path to vocabulary.") +parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.", +) +parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu"], + default="gpu", + help="Select which device to train model, defaults to gpu.", +) +args = parser.parse_args() + + +def convert_example(data, tokenizer, pad_token_id=0, max_ngram_filter_size=3): + """convert_example""" + input_ids = tokenizer.encode(data) + seq_len = len(input_ids) + # Sequence length should larger or equal than the maximum ngram_filter_size in TextCNN model + if seq_len < max_ngram_filter_size: + input_ids.extend([pad_token_id] * (max_ngram_filter_size - seq_len)) + input_ids = np.array(input_ids, dtype="int64") + return input_ids + + +class Predictor(object): + def __init__(self, model_file, params_file, device, max_seq_length): + self.max_seq_length = max_seq_length + + config = paddle.inference.Config(model_file, params_file) + if device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def predict(self, data, tokenizer, label_map, batch_size=1, pad_token_id=0): + """ + Predicts the data labels. + + Args: + data (obj:`list(str)`): Data to be predicted. + tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. + label_map(obj:`dict`): The label id (key) to label str (value) map. + batch_size(obj:`int`, defaults to 1): The number of batch. + pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + examples = [] + for text in data: + input_ids = convert_example(text, tokenizer) + examples.append(input_ids) + + # Separates data into some batches. + batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)] + + batchify_fn = lambda samples, fn=Pad(axis=0, pad_val=pad_token_id): fn(samples) # input + + results = [] + for batch in batches: + input_ids = batchify_fn(batch) + self.input_handles[0].copy_from_cpu(input_ids) + self.predictor.run() + logits = paddle.to_tensor(self.output_handle.copy_to_cpu()) + probs = F.softmax(logits, axis=1) + idx = paddle.argmax(probs, axis=1).numpy() + idx = idx.tolist() + labels = [label_map[i] for i in idx] + results.extend(labels) + return results + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor(args.model_file, args.params_file, args.device, args.max_seq_length) + + vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") + pad_token_id = vocab.to_indices("[PAD]") + tokenizer = JiebaTokenizer(vocab) + label_map = {0: "negative", 1: "neutral", 2: "positive"} + + # Firstly pre-processing prediction data and then do predict. + data = ["你再骂我我真的不跟你聊了", "你看看我附近有什么好吃的", "我喜欢画画也喜欢唱歌"] + + results = predictor.predict(data, tokenizer, label_map, batch_size=args.batch_size, pad_token_id=pad_token_id) + for idx, text in enumerate(data): + print("Data: {} \t Label: {}".format(text, results[idx])) diff --git a/examples/sentiment_analysis/textcnn/export_model.py b/examples/sentiment_analysis/textcnn/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..0953a40020532f6f263d6fbcc2dbde375dbaad32 --- /dev/null +++ b/examples/sentiment_analysis/textcnn/export_model.py @@ -0,0 +1,60 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from paddlenlp.data import Vocab +from model import TextCNNModel + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--vocab_path", type=str, default="./robot_chat_word_dict.txt", help="The path to vocabulary.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.") +parser.add_argument("--output_path", type=str, default='./static_graph_params', help="The path of model parameter in static graph to be saved.") +args = parser.parse_args() +# yapf: enable + + +def main(): + # Load vocab. + if not os.path.exists(args.vocab_path): + raise RuntimeError("The vocab_path can not be found in the path %s" % args.vocab_path) + + vocab = Vocab.load_vocabulary(args.vocab_path) + label_map = {0: "negative", 1: "neutral", 2: "positive"} + + # Construct the newtork. + vocab_size = len(vocab) + num_classes = len(label_map) + pad_token_id = vocab.to_indices("[PAD]") + + model = TextCNNModel(vocab_size, num_classes, padding_idx=pad_token_id, ngram_filter_sizes=(1, 2, 3)) + + # Load model parameters. + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + model.eval() + + inputs = [paddle.static.InputSpec(shape=[None, None], dtype="int64")] + + model = paddle.jit.to_static(model, input_spec=inputs) + # Save in static graph model. + paddle.jit.save(model, args.output_path) + + +if __name__ == "__main__": + main() diff --git a/examples/sentiment_analysis/textcnn/model.py b/examples/sentiment_analysis/textcnn/model.py new file mode 100644 index 0000000000000000000000000000000000000000..655f1e2b8492a7748ed581c1413ebe73880433ce --- /dev/null +++ b/examples/sentiment_analysis/textcnn/model.py @@ -0,0 +1,60 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn + +from paddlenlp.seq2vec import CNNEncoder + + +class TextCNNModel(nn.Layer): + """ + This class implements the Text Convolution Neural Network model. + At a high level, the model starts by embedding the tokens and running them through + a word embedding. Then, we encode these representations with a `CNNEncoder`. + The CNN has one convolution layer for each ngram filter size. Each convolution operation gives + out a vector of size num_filter. The number of times a convolution layer will be used + is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these + outputs from the convolution layer and outputs the max. + Lastly, we take the output of the encoder to create a final representation, + which is passed through some feed-forward layers to output a logits (`output_layer`). + + """ + + def __init__( + self, + vocab_size, + num_classes, + emb_dim=128, + padding_idx=0, + num_filter=128, + ngram_filter_sizes=(1, 2, 3), + fc_hidden_size=96, + ): + super().__init__() + self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx) + self.encoder = CNNEncoder(emb_dim=emb_dim, num_filter=num_filter, ngram_filter_sizes=ngram_filter_sizes) + self.fc = nn.Linear(self.encoder.get_output_dim(), fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, text): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_text = self.embedder(text) + # Shape: (batch_size, len(ngram_filter_sizes) * num_filter) + encoder_out = paddle.tanh(self.encoder(embedded_text)) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(encoder_out)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + return logits diff --git a/examples/sentiment_analysis/textcnn/predict.py b/examples/sentiment_analysis/textcnn/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..ba3dd295814933e05f57ed7fa6e47e53ed54263b --- /dev/null +++ b/examples/sentiment_analysis/textcnn/predict.py @@ -0,0 +1,94 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse + +import paddle +import paddle.nn.functional as F +from paddlenlp.data import JiebaTokenizer, Pad, Vocab + +from model import TextCNNModel +from data import preprocess_prediction_data + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number of a batch for training.") +parser.add_argument("--vocab_path", type=str, default="./robot_chat_word_dict.txt", help="The path to vocabulary.") +parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.") +args = parser.parse_args() +# yapf: enable + + +def predict(model, data, label_map, batch_size=1, pad_token_id=0): + """ + Predicts the data labels. + + Args: + model (obj:`paddle.nn.Layer`): A model to classify texts. + data (obj:`list`): The processed data whose each element + is a `list` object, which contains + + - word_ids(obj:`list[int]`): The list of word ids. + label_map(obj:`dict`): The label id (key) to label str (value) map. + batch_size(obj:`int`, defaults to 1): The number of batch. + pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + + # Separates data into some batches. + batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)] + batchify_fn = lambda samples, fn=Pad(axis=0, pad_val=pad_token_id): [data for data in fn(samples)] + + results = [] + model.eval() + for batch in batches: + texts = paddle.to_tensor(batchify_fn(batch)) + logits = model(texts) + probs = F.softmax(logits, axis=1) + idx = paddle.argmax(probs, axis=1).numpy() + idx = idx.tolist() + labels = [label_map[i] for i in idx] + results.extend(labels) + return results + + +if __name__ == "__main__": + paddle.set_device(args.device) + + # Load vocab. + vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") + label_map = {0: "negative", 1: "neutral", 2: "positive"} + + # Construct the newtork. + vocab_size = len(vocab) + num_classes = len(label_map) + pad_token_id = vocab.to_indices("[PAD]") + + model = TextCNNModel(vocab_size, num_classes, padding_idx=pad_token_id, ngram_filter_sizes=(1, 2, 3)) + + # Load model parameters. + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + # Firstly pre-processing prediction data and then do predict. + data = ["你再骂我我真的不跟你聊了", "你看看我附近有什么好吃的", "我喜欢画画也喜欢唱歌"] + tokenizer = JiebaTokenizer(vocab) + examples = preprocess_prediction_data(data, tokenizer, pad_token_id) + + results = predict(model, examples, label_map=label_map, batch_size=args.batch_size, pad_token_id=pad_token_id) + for idx, text in enumerate(data): + print("Data: {} \t Label: {}".format(text, results[idx])) diff --git a/examples/sentiment_analysis/textcnn/train.py b/examples/sentiment_analysis/textcnn/train.py new file mode 100644 index 0000000000000000000000000000000000000000..e80d2180af759c725d9f158910d5248ecfb12a30 --- /dev/null +++ b/examples/sentiment_analysis/textcnn/train.py @@ -0,0 +1,108 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial +import argparse +import os +import random + +import numpy as np +import paddle +from paddlenlp.datasets import load_dataset +from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab + +from data import create_dataloader, convert_example, read_custom_data +from model import TextCNNModel + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--lr", type=float, default=5e-5, help="Learning rate used to train.") +parser.add_argument("--save_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint") +parser.add_argument("--data_path", type=str, default='./RobotChat', help="The path of datasets to be loaded") +parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") +parser.add_argument("--vocab_path", type=str, default="./robot_chat_word_dict.txt", help="The directory to dataset.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed=1000): + """Sets random seed.""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +if __name__ == "__main__": + paddle.set_device(args.device) + set_seed() + + # Load vocab. + if not os.path.exists(args.vocab_path): + raise RuntimeError("The vocab_path can not be found in the path %s" % args.vocab_path) + + vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") + + # Load datasets. + dataset_names = ["train.tsv", "dev.tsv", "test.tsv"] + train_ds, dev_ds, test_ds = [ + load_dataset(read_custom_data, filename=os.path.join(args.data_path, dataset_name), lazy=False) + for dataset_name in dataset_names + ] + + tokenizer = JiebaTokenizer(vocab) + trans_fn = partial(convert_example, tokenizer=tokenizer) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), Stack(dtype="int64") # label + ): [data for data in fn(samples)] + train_loader = create_dataloader( + train_ds, batch_size=args.batch_size, mode="train", batchify_fn=batchify_fn, trans_fn=trans_fn + ) + dev_loader = create_dataloader( + dev_ds, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn, trans_fn=trans_fn + ) + test_loader = create_dataloader( + test_ds, batch_size=args.batch_size, mode="test", batchify_fn=batchify_fn, trans_fn=trans_fn + ) + + label_map = {0: "negative", 1: "neutral", 2: "positive"} + vocab_size = len(vocab) + num_classes = len(label_map) + pad_token_id = vocab.to_indices("[PAD]") + + model = TextCNNModel(vocab_size, num_classes, padding_idx=pad_token_id, ngram_filter_sizes=(1, 2, 3)) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + model = paddle.Model(model) + + optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr) + + # Define loss and metric. + criterion = paddle.nn.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + model.prepare(optimizer, criterion, metric) + + # Start training and evaluating. + callback = paddle.callbacks.ProgBarLogger(log_freq=10, verbose=3) + model.fit(train_loader, dev_loader, epochs=args.epochs, save_dir=args.save_dir, callbacks=callback) + + # Evaluate on test dataset + print("Start to evaluate on test dataset...") + model.evaluate(test_loader, log_freq=len(test_loader)) diff --git a/examples/simultaneous_translation/stacl/README.md b/examples/simultaneous_translation/stacl/README.md new file mode 100644 index 0000000000000000000000000000000000000000..554b251a4b6177e7fb4d7b6761dc82f6af1c6688 --- /dev/null +++ b/examples/simultaneous_translation/stacl/README.md @@ -0,0 +1,160 @@ +# Text Simultaneous Translation using Prefix-to-Prefix Framework: STACL + +同声传译(Simultaneous Translation),即在句子完成之前进行翻译,同声传译的目标是实现同声传译的自动化,它可以与源语言同时翻译,延迟时间只有几秒钟。 + +同声传译的难点在于源语言和目标语言之间词序的差异带来的翻译延迟。 例如,考虑将SOV(主宾谓)语言(如日语或德语)翻译为SVO(主谓宾)语言(如英语或汉语),必须等到源语言动词出现才可以准确翻译。因此,翻译系统必须求助于传统的全句翻译,因此造成至少一句话的延迟。 + +本项目是基于机器翻译领域主流模型 Transformer[1]网络结构的同传模型STACL的PaddlePaddle 实现,包含模型训练,预测以及使用自定义数据等内容。用户可以基于发布的内容搭建自己的同传翻译模型。 + +## 模型介绍 + +### 模型特点 + +STACL 是论文 [STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework](https://www.aclweb.org/anthology/P19-1289/) 中针对同传提出的适用于所有同传场景的翻译架构[2],该架构基于Transformer实现,可参考PaddleNLP的[Transformer](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer)。 + +STACL 主要具有以下优势: + +- Prefix-to-Prefix架构拥有预测能力,即在未看到源词的情况下仍然可以翻译出对应的目标词,克服了SOV→SVO等词序差异; + +- Wait-k策略可以不需要全句的源句,直接预测目标句,可以实现任意的字级延迟,同时保持较高的翻译质量。 + +#### Prefix-to-Prefix架构 +<p align="center"> +<img src="images/STACL_architecture.png" height=300 hspace='10'/> <br /> +图 1. Seq2Seq vs. STACL +</p> + +和传统的机器翻译模型主要的区别在于翻译时是否需要利用全句的源句。上图中,Seq2Seq模型需要等到全句的源句(1-5)全部输入Encoder后,Decoder才开始解码进行翻译;而STACL架构采用了Wait-k(图中Wait-2)的策略,当源句只有两个词(1和2)输入到Encoder后,Decoder即可开始解码预测目标句的第一个词。 + +#### Wait-k 策略 +Wait-k策略首先等待源句单词,然后与源句的其余部分同时翻译,即输出总是隐藏在输入后面。这是受到同声传译人员的启发,同声传译人员通常会在几秒钟内开始翻译演讲者的演讲,在演讲者结束几秒钟后完成。例如,如果k=2,第一个目标词使用前2个源词预测,第二个目标词使用前3个源词预测,以此类推。下图3中,(a)simultaneous: our wait-2 等到"布什"和"总统"输入后就开始解码预测"pres.",而(b) non-simultaneous baseline 为传统的翻译模型,需要等到整句"布什 总统 在 莫斯科 与 普京 会晤"才开始解码预测。 +<p align="center"> +<img src="images/example.png" height=100 hspace='10'/> <br /> +图 2. Wait-k 例子 +</p> + +## 环境依赖 + - attrdict==2.0.1 + - PyYAML==5.4.1 + - subword_nmt==0.3.7 + - jieba==0.42.1 + +安装命令:`pip install -r requirements.txt` + +## 数据准备 + +### 数据分词 +中文需要首先经过jieba分词,然后经过BPE分词(Byte Pair Encoding);英文仅需要经过BPE分词。 +BPE分词需要对应的BPE词典,这里提供下载链接:[中文BPE词典](https://bj.bcebos.com/paddlenlp/models/stacl/2M.zh2en.dict4bpe.zh) ,[英文BPE词典](https://bj.bcebos.com/paddlenlp/models/stacl/2M.zh2en.dict4bpe.en) 。 + +我们提供分词的接口,下面给出分词的具体操作: +```python +from utils.tokenizer import STACLTokenizer + +tokenizer_zh = STACLTokenizer('2M.zh2en.dict4bpe.zh', is_chinese=True) +# 处理中文字符串 +print(tokenizer_zh.tokenize('玻利维亚举行总统与国会选举')) +# 输出是: 玻@@ 利@@ 维亚 举行 总统 与 国会 选举 + +# 处理英文字符串 +tokenizer_en = STACLTokenizer('2M.zh2en.dict4bpe.en', is_chinese=False) +print(tokenizer_en.tokenize('bolivia holds presidential and parliament elections')) +# 输出是:bol@@ i@@ via holds presidential and parliament elections +``` + +### 数据格式 +每行数据为分词后的中英文,用制表符分割。 +``` +兵营 是 双@@ 枪 老@@ 大@@ 爷 的 前提 建筑 之一 。 it serves as a prerequisite for Re@@ apers to be built at the Bar@@ rac@@ ks . +``` + +## 单机训练 + +### 单机单卡/单机多卡 +可以执行以下命令进行模型训练: +``` sh +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" train.py --config ./config/transformer.yaml +``` + +可以在`config/transformer.yaml` 文件中设置相应的参数。如果执行不提供 `--config` 选项,程序将默认使用`config/transformer.yaml` 的配置。 + +建议:如果为了更好的效果,可先在整句模型(即`waik=-1`)进行预训练,然后在此基础上根据不同的waitk进行微调来训练不同的waitk模型,训练的命令都同上,下面给出具体的流程以及主要的参数配置: +- Pretrain + 用来训练整句模型(即`waik=-1`),可在`config/transformer.yaml`文件中配置参数: + - `waik`表示waik策略,这里设置为-1 + - `training_file`表示训练集,数据格式同上文 + - `validation_file`表示验证集,数据格式同上文 + - `init_from_checkpoint`表示模型目录,从该checkpoint恢复训练,这里设置为空 + - `init_from_pretrain_model`表示模型目录,从该checkpoint开始finetune下游任务,这里设置为空 + - `device`选择训练用的设备,支持cpu/gpu/xpu,默认为gpu + - `use_amp`表示混合精度训练,示例设置为False +- Finetune + 用来训练waik模型(即`waitk=1,2,3,4...`),可在`config/transformer.yaml`文件中配置参数: + - `waik`表示waik策略,这里设置为3(以wait-3模型为例) + - `training_file`表示训练集,数据格式同上文 + - `validation_file`表示验证集,数据格式同上文 + - `init_from_checkpoint`表示模型目录,从该checkpoint恢复训练,这里设置`waik=-1`模型的ckeckpoint + - `init_from_pretrain_model`表示模型目录,从该checkpoint开始finetune下游任务,这里设置为空 + - `device`选择训练用的设备,支持cpu/gpu/xpu,默认为gpu + - `use_amp`表示混合精度训练,示例设置为False +## 模型推理 + +模型训练完成后可以执行以下命令对指定文件中的文本进行翻译: + +``` sh +# setting visible devices for prediction +export CUDA_VISIBLE_DEVICES=0 +python predict.py --config ./config/transformer.yaml +``` +- Predict + 根据具体的waik策略来进行翻译,可在`config/transformer.yaml`文件中配置参数,预测的命令同上,下面给出主要的参数说明: + - `waik`表示waik策略,这里设置为3(以wait-3模型为例) + - `predict_file`表示测试集,数据格式是BPE分词后的源语言(中文为Jieba+BPE分词),按行区分 + - `output_file`表示输出文件,翻译结果会输出到该参数指定的文件 + - `init_from_params`表示模型的所在目录,根据具体的`waik`来设置,这里设置为`wait=3`模型目录 + - 更多参数的使用可以在 `config/transformer.yaml`文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项,程序将默认使用 `config/transformer.yaml` 的配置。 + +需要注意的是,目前预测仅实现了单卡的预测,原因在于,翻译后面需要的模型评估依赖于预测结果写入文件顺序,多卡情况下,目前暂未支持将结果按照指定顺序写入文件。 + + +## 模型评估 + +预测结果中每行输出是对应行输入的得分最高的翻译,对于使用 BPE 的数据,预测出的翻译结果也将是 BPE 表示的数据,要还原成原始的数据(这里指 tokenize 后的数据)才能进行正确的评估。评估过程具体如下(BLEU 是翻译任务常用的自动评估方法指标): + +``` sh +# 还原 predict.txt 中的预测结果为 tokenize 后的数据 +sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt +# 若无 BLEU 评估工具,需先进行下载 +git clone https://github.com/moses-smt/mosesdecoder.git +# 以中英翻译 newstest2017 测试数据为例 +perl mosesdecoder/scripts/generic/multi-bleu.perl newstest2017.tok.en < predict.tok.txt +``` + +## 模型下载(更新中) +我们提供基于NIST(中->英,共2M中英句对)预训练模型,供大家下载,下载后需解压使用。 +| Wait-k策略 | 模型连接 | 4-ref BLEU on NIST 2008| +| ------------ | --------------- |---------| +| Wait-1 | [下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w1.tar.gz) |30.94| +| Wait-3 |[下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w3.tar.gz) |34.24 | +| Wait-5 |[下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w5.tar.gz) |36.30 | +| Wait-7 |[下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w7.tar.gz) |37.84 | +| Wait_-1(整句模型) |[下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_sent.tar.gz) |41.41 | +词表下载:[source vocab](https://bj.bcebos.com/paddlenlp/models/stacl/nist.20k.zh.vocab) ,[target vocab](https://bj.bcebos.com/paddlenlp/models/stacl/nist.10k.en.vocab) + +## Demo展示 +通过GUI界面的Demo来模拟STACL实时翻译的效果,下图为Demo示例,实现细节可查看[demo](./demo) +<p align="center"> +<img src="demo/images/text_demo_show.gif" height=350 hspace='10'/> <br /> +图 3. 文本同传 +</p> +<p align="center"> +<img src="demo/images/speech_demo_show.gif" height=350 hspace='10'/> <br /> +图 4. 语音同传 +</p> + +## 参考文献 +1. Vaswani A, Shazeer N, Parmar N, et al. [Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)[C]//Advances in Neural Information Processing Systems. 2017: 6000-6010. +2. Ma M , Huang L , Xiong H , et al. [STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework](https://www.aclweb.org/anthology/P19-1289/)[J]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2018: 3025–3036. +3. He K, Zhang X, Ren S, et al. [Deep residual learning for image recognition](http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. +4. Ba J L, Kiros J R, Hinton G E. [Layer normalization](https://arxiv.org/pdf/1607.06450.pdf)[J]. arXiv preprint arXiv:1607.06450, 2016. diff --git a/examples/simultaneous_translation/stacl/config/transformer.yaml b/examples/simultaneous_translation/stacl/config/transformer.yaml new file mode 100644 index 0000000000000000000000000000000000000000..2bdb04cdd2037d9b157120715f4af232d5c47019 --- /dev/null +++ b/examples/simultaneous_translation/stacl/config/transformer.yaml @@ -0,0 +1,99 @@ +# The frequency to save trained models when training. +save_step: 10000 +# The frequency to fetch and print output when training. +print_step: 100 +# path of the checkpoint, to resume the previous training +init_from_checkpoint: "" +# path of the pretrain model, to better solve the current task +init_from_pretrain_model: "" +# path of trained parameter, to make prediction +init_from_params: "trained_models/step_final/" +# the directory for saving model +save_model: "trained_models" +# Set seed for CE or debug +random_seed: 42 +# The pattern to match training data files. +training_file: "data/nist2m/train.zh-en.bpe" +# The pattern to match validation data files. +validation_file: "data/nist2m/dev.zhen.bpe" +# The pattern to match test data files. +predict_file: "data/nist2m/test_08.zh.bpe" +# The file to output the translation results of predict_file to. +output_file: "predict.txt" +# The path of vocabulary file of source language. +src_vocab_fpath: "data/nist2m/nist.20k.zh.vocab" +# The path of vocabulary file of target language. +trg_vocab_fpath: "data/nist2m/nist.10k.en.vocab" +# The <bos>, <eos> and <unk> tokens in the dictionary. +special_token: ["<s>", "<e>", "<unk>"] + +# Use which device to train or predict(cpu,gpu,xpu) +device: gpu + +# Args for reader, see reader.py for details +pool_size: 200000 +sort_type: "pool" +shuffle: True +shuffle_batch: True +batch_size: 4096 + +# Hyparams for training: +# the number of epoches for training +epoch: 30 +# the hyper parameters for Adam optimizer. +# This static learning_rate will be multiplied to the LearningRateScheduler +# derived learning rate the to get the final learning rate. +learning_rate: 2.0 +beta1: 0.9 +beta2: 0.997 +eps: 1e-9 +# the parameters for learning rate scheduling. +warmup_steps: 8000 +# the weight used to mix up the ground-truth distribution and the fixed +# uniform distribution in label smoothing when training. +# Set this as zero if label smoothing is not wanted. +label_smooth_eps: 0.1 + +# Hyparams for generation: +# the parameters for beam search. +beam_size: 5 +max_out_len: 256 +# the number of decoded sentences to output. +n_best: 1 + +# Hyparams for model: +# These following five vocabularies related configurations will be set +# automatically according to the passed vocabulary path and special tokens. +# size of source word dictionary. +src_vocab_size: 10000 +# size of target word dictionay +trg_vocab_size: 10000 +# index for <bos> token +bos_idx: 0 +# index for <eos> token +eos_idx: 1 +# index for <unk> token +unk_idx: 2 +# max length of sequences deciding the size of position encoding table. +max_length: 256 +# the dimension for word embeddings, which is also the last dimension of +# the input and output of multi-head attention, position-wise feed-forward +# networks, encoder and decoder. +d_model: 512 +# size of the hidden layer in position-wise feed-forward networks. +d_inner_hid: 2048 +# number of head used in multi-head attention. +n_head: 8 +# number of sub-layers to be stacked in the encoder and decoder. +n_layer: 6 +# dropout rates. +dropout: 0.1 +# the flag indicating whether to share embedding and softmax weights. +# vocabularies in source and target should be same for weight sharing. +weight_sharing: False +# Wait-k policy +waitk: -1 +# Mixed precision training +use_amp: False +# Maximum iteration for training. +max_iter: None \ No newline at end of file diff --git a/examples/simultaneous_translation/stacl/demo/README.md b/examples/simultaneous_translation/stacl/demo/README.md new file mode 100644 index 0000000000000000000000000000000000000000..84c2ff473ecf488f6f44d37f9cd33847f37f1775 --- /dev/null +++ b/examples/simultaneous_translation/stacl/demo/README.md @@ -0,0 +1,92 @@ +# Demo for STACL + +该Demo模拟同传模型STACL实时翻译的效果。 +<p align="center"> +<img src="images/text_demo_show.gif" height=350 hspace='10'/> <br /> +图 1. 文本同传 +</p> +<p align="center"> +<img src="images/speech_demo_show.gif" height=350 hspace='10'/> <br /> +图 2. 语音同传 +</p> + +用户通过Chinese input文本框**打字输入**或者**语音输入即本地麦克风收音**,然后通过Jieba和BPE得到分词结果。 + +- Simultaneous Translation (wait 1)是读取1个token(分词后)后开始实时翻译; +- Simultaneous Translation (wait 3)是读取3个token(分词后)后开始实时翻译; +- Simultaneous Translation (wait 5)是读取5个token(分词后)后开始实时翻译; +- Full Sentence Translation(wait -1)是读取所有的token(分词后)即整句后开始翻译。 + +一般来说,waitk越大(waitk=-1可看作waitk=∞),读入的信息越多,实时翻译效果越好。由上图可见,STACL具有较好的预测性,较小的waitk也能得到较好的翻译结果。 + +### 目录结构 +```text +. +├── README.md # 说明文档,本文件 +├── const.py # 语音识别应用鉴权信息 +├── demo.py # 启动demo的主程序文件 +├── images +│ ├── speech_demo_show.gif # 语音同传Demo效果图 +│ ├── paddlenlp.png # Demo界面logo +│ └── text_demo_show.gif # 文本同传Demo效果图 +├── model_demo.py # STACL模型文件 +├── models # 预训练模型路径 +│ ├── nist_wait_1 # waitk=1模型 +│ ├── nist_wait_3 # waitk=3模型 +│ ├── nist_wait_5 # waitk=5模型 +│ └── nist_wait_-1 # waitk=-1(整句模型) +├── requirements.txt # 环境依赖文件 +└── transformer_demo.yaml # 参数配置文件 + +``` + +上述models目录下的模型可以在这里[下载](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/simultaneous_translation/stacl/README.md#%E6%A8%A1%E5%9E%8B%E4%B8%8B%E8%BD%BD%E6%9B%B4%E6%96%B0%E4%B8%AD) ,下载完后将解压后的`transformer.pdparams`分别放在不同的waitk策略对应的子目录下面。 + +### 参数说明与配置 + +##### 1. 模型参数配置 +可以在`transformer_demo.yaml` 文件中设置相应的参数,下面给出主要的参数配置: + +- `src_bpe_dict`配置源语言(这里是中文)的BPE词表,[中文BPE词表下载](https://bj.bcebos.com/paddlenlp/models/stacl/2M.zh2en.dict4bpe.zh) +- `src_vocab_fpath`配置源语言(这里是中文)词表,[source vocab](https://bj.bcebos.com/paddlenlp/models/stacl/nist.20k.zh.vocab) +- `trg_vocab_fpath`配置目标语言(这里是英文)词表,[target vocab](https://bj.bcebos.com/paddlenlp/models/stacl/nist.10k.en.vocab) +- `device`选择预测用的设备,支持cpu/gpu/xpu,默认为cpu + +##### 2. 语音同传参数配置 +需要配置`const.py`里面语音识别的应用鉴权信息,只需要将`APPID`和`APPKEY`设置为自己所申请的。 +申请教程:[教程](./README_ai.md) + +### 环境依赖 +##### 1. 基本环境 +- attrdict==2.0.1 +- PyYAML==5.4.1 +- subword_nmt==0.3.7 +- jieba==0.42.1 +- websocket-client==1.0.1 + +可通过安装命令:`pip install -r requirements.txt`来进行安装。 + +注意:本项目依赖于Python内置包`tkinter >= 8.6` +- 查看`tkinter`的版本: + ```python + python -c "import tkinter; print(tkinter.TkVersion)" +- [`tkinter`官方文档](https://tkdocs.com/tutorial/index.html) + +##### 2. 语音同传环境 +需要安装`pyaudio==0.2.11`来调用本地麦克风,安装教程参考[官网](http://people.csail.mit.edu/hubert/pyaudio/) +安装失败,则只会启动文本同传。 + + +### 使用说明 + +1. 下载[预训练模型](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/simultaneous_translation/stacl/README.md#%E6%A8%A1%E5%9E%8B%E4%B8%8B%E8%BD%BD%E6%9B%B4%E6%96%B0%E4%B8%AD) ,并放在models目录下对应的子目录里; +2. 下载词表(源语言词表,目标语言词表,BPE词表),并在配置文件`transformer_demo.yaml`中修改相应的参数; +3. 运行`demo.py`; +4. 出现界面,在Chinese input文本框中输入中文,按【回车键】开始实时翻译,或者按【REC】开始录音并开始实时翻译,遇到【。!?】结束整句,按【CLEAR】清空所有的输入和输出。 + +### 常见问题 +**Q:** 出现`_tkinter.TclError: couldn't recognize data in image file`错误 +**A:** 升级`tkinter`,确保`tkinter >= 8.6` + +**Q:** 出现Chinese input文本框无法输入中文 +**A:** 升级`tkinter`,确保`tkinter >= 8.6` diff --git a/examples/simultaneous_translation/stacl/demo/README_ai.md b/examples/simultaneous_translation/stacl/demo/README_ai.md new file mode 100644 index 0000000000000000000000000000000000000000..20ded3903b7b075959049e0c5be8730b8c34b314 --- /dev/null +++ b/examples/simultaneous_translation/stacl/demo/README_ai.md @@ -0,0 +1,42 @@ +# AI接入指南 +### 1. 成为开发者 +- 1.1 参考[AI接入指南](https://ai.baidu.com/ai-doc/REFERENCE/Ck3dwjgn3) 完成第一步 + +### 2.创建应用 +- 2.1 进入[控制台](https://console.bce.baidu.com/?fromai=1#/index/overview_v3) ,选择【语音技术】 +<p align="center"> +<img src="images/step1.png"/> <br /> +</p> + +- 2.2 [创建应用](https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/app/create) ,参数配置 +<p align="center"> +<img src="images/step2.png"/> <br /> +</p> + +<p align="center"> +<img src="images/step3.png"/> <br /> +</p> +<p align="center"> +<img src="images/step4.png"/> <br /> +</p> + +### 3.领取免费资源 +- 3.1 [领取免费资源](https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/overview/resource/getFree) ,需要实名认证 +<p align="center"> +<img src="images/step5.png"/> <br /> +</p> + +<p align="center"> +<img src="images/step6.png"/> <br /> +</p> + +- 3.2 开通付费,这里因为有10个小时的免费资源,所以不收费 +<p align="center"> +<img src="images/step7.png"/> <br /> +</p> + +### 4.获取密钥 +- 4.1 本项目主要用到AppID和API Key +<p align="center"> +<img src="images/step8.png"/> <br /> +</p> diff --git a/examples/simultaneous_translation/stacl/demo/const.py b/examples/simultaneous_translation/stacl/demo/const.py new file mode 100644 index 0000000000000000000000000000000000000000..06e89a1ae55ae497a9488cd90e5163310979b8c5 --- /dev/null +++ b/examples/simultaneous_translation/stacl/demo/const.py @@ -0,0 +1,25 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Your APPID (type: int) +APPID = None + +# Your APPKEY(type: str) +APPKEY = None + +# Do not modify: Chinese Putonghua PID +DEV_PID = 15372 + +# Do not modify: wss link +URI = "wss://vop.baidu.com/realtime_asr" diff --git a/examples/simultaneous_translation/stacl/demo/demo.py b/examples/simultaneous_translation/stacl/demo/demo.py new file mode 100644 index 0000000000000000000000000000000000000000..caa1b317c4029bb9b9a70111fcd9be900305ac60 --- /dev/null +++ b/examples/simultaneous_translation/stacl/demo/demo.py @@ -0,0 +1,650 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import json +import os +import threading +import time +import uuid +from tkinter import END, LEFT, Button, E, Entry, Label, PhotoImage, Tk, W + +import _locale +import jieba +import paddle +import websocket +import yaml +from attrdict import AttrDict +from subword_nmt import subword_nmt + +from paddlenlp.data import Vocab +from paddlenlp.transformers import position_encoding_init +from paddlenlp.utils.log import logger + +open_speech = True +try: + from pyaudio import PyAudio, paInt16 +except ImportError: + open_speech = False + logger.warning("No module named 'pyaudio', so no audio demo.") + +import const # noqa: E402 +from model_demo import SimultaneousTransformerDemo # noqa: E402 + +# By default, the Windows system opens the file with GBK code, +# and the subword_nmt package does not support setting open encoding, +# so it is set to UTF-8 uniformly. +_locale._getdefaultlocale = lambda *args: ["en_US", "utf8"] + +is_win = False +if os.name == "nt": + is_win = True + + +class STACLTokenizer: + """ + Jieba+BPE, and convert tokens to ids. + """ + + def __init__(self, args, is_chinese): + bpe_parser = subword_nmt.create_apply_bpe_parser() + bpe_args = bpe_parser.parse_args(args=["-c", args.src_bpe_dict]) + self.bpe = subword_nmt.BPE(bpe_args.codes, bpe_args.merges, bpe_args.separator, None, bpe_args.glossaries) + self.is_chinese = is_chinese + + self.src_vocab = Vocab.load_vocabulary( + args.src_vocab_fpath, + bos_token=args.special_token[0], + eos_token=args.special_token[1], + unk_token=args.special_token[2], + ) + + self.trg_vocab = Vocab.load_vocabulary( + args.trg_vocab_fpath, + bos_token=args.special_token[0], + eos_token=args.special_token[1], + unk_token=args.special_token[2], + ) + + args.src_vocab_size = len(self.src_vocab) + args.trg_vocab_size = len(self.trg_vocab) + self.args = args + + def tokenize(self, raw_string): + raw_string = raw_string.strip("\n") + if not raw_string: + return raw_string, raw_string + if self.is_chinese: + raw_string = " ".join(jieba.cut(raw_string)) + bpe_str = self.bpe.process_line(raw_string) + ids = self.src_vocab.to_indices(bpe_str.split()) + return bpe_str.split(), ids + + +def init_model(args, init_from_params): + # Define model + args.init_from_params = init_from_params + transformer = SimultaneousTransformerDemo( + args.src_vocab_size, + args.trg_vocab_size, + args.max_length + 1, + args.n_layer, + args.n_head, + args.d_model, + args.d_inner_hid, + args.dropout, + args.weight_sharing, + args.bos_idx, + args.eos_idx, + args.waitk, + ) + + # Load the trained model + assert args.init_from_params, "Please set init_from_params to load the infer model." + + model_dict = paddle.load(os.path.join(args.init_from_params, "transformer.pdparams")) + + # To avoid a longer length than training, reset the size of position + # encoding to max_length + model_dict["src_pos_embedding.pos_encoder.weight"] = position_encoding_init(args.max_length + 1, args.d_model) + model_dict["trg_pos_embedding.pos_encoder.weight"] = position_encoding_init(args.max_length + 1, args.d_model) + + transformer.load_dict(model_dict) + return transformer + + +def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): + """ + Post-process the decoded sequence. + """ + eos_pos = len(seq) - 1 + for i, idx in enumerate(seq): + if idx == eos_idx: + eos_pos = i + break + seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] + return seq + + +def translate( + args, tokenizer, tokenized_src, transformers, waitks, decoder_max_length, is_last, caches, bos_id, all_result +): + # Set evaluate mode + for transformer in transformers: + transformer.eval() + + for idx, (waitk, transformer) in enumerate(zip(waitks, transformers)): + if len(tokenized_src) < waitk or (waitk == -1 and not is_last): + continue + with paddle.no_grad(): + input_src = tokenized_src + if is_last: + decoder_max_length[idx] = args.max_out_len + input_src += [args.eos_idx] + src_word = paddle.to_tensor(input_src).unsqueeze(axis=0) + finished_seq, finished_scores, cache = transformer.greedy_search( + src_word, max_len=decoder_max_length[idx], waitk=waitk, caches=caches[idx], bos_id=bos_id[idx] + ) + caches[idx] = cache + finished_seq = finished_seq.numpy() + for beam_idx, beam in enumerate(finished_seq[0]): + if beam_idx >= args.n_best: + break + id_list = post_process_seq(beam, args.bos_idx, args.eos_idx) + if len(id_list) == 0: + continue + bos_id[idx] = id_list[-1] + word_list = tokenizer.trg_vocab.to_tokens(id_list) + for word in word_list: + all_result[idx].append(word) + res = " ".join(word_list).replace("@@ ", "") + logger.debug("[waitk={}] {}".format(waitk, res)) + + +def cut_line(str, line_len): + """ + Wrap output + """ + result = [] + temp = [] + for idx, item in enumerate(str.split()): + temp.append(item) + if (idx + 1) % line_len == 0: + result.append(" ".join(temp)) + temp = [] + if len(temp) != 0: + result.append(" ".join(temp)) + return "\n".join(result) + + +def process(args, tokenizer, transformers, waitks): + """ + GUI and main waitk program + :param args: + :param tokenizer: + :param transformers: + :param waitks: + :return: + """ + font_align = ("Courier", 20) + font_label = ("Times", 14) + + if is_win: + font_align = ("Courier", 15) + font_label = ("Times", 11) + + window = Tk() + + window.title("Welcome to Simultaneous Translation") + window.geometry("1200x600") + + logo = PhotoImage(file="images/paddlenlp.png") + button = Label(window, image=logo, compound="center") + button.place(x=0, y=0) + + # for chinese input + lbl1 = Label(window, text="Chinese input:", fg="green", font=font_label, anchor=E, width=28) + lbl1.place(x=0, y=60) + txt = Entry(window, font=font_align) + txt.place(x=250, y=50, width=800, height=50) + + button_on = Button(window, text="REC", relief="raised", cursor="hand2") + if open_speech: + button_on.place(x=1090, y=52) + + s_x, s_y = 0, 130 + x, y = 250, 120 + + # for jieba+BPE + lbl2_s = Label(window, text="Jieba+BPE:", fg="black", font=font_label, anchor=E, width=28) + lbl2_s.place(x=s_x, y=s_y) + lbl2 = Label(window, text="", font=font_align, background="pale green", anchor=E) + lbl2.place(x=x, y=y, width=800, height=50) + + # for wait-1 + waitnum = "1" + lbl3_s = Label( + window, text="Simultaneous\nTranslation (wait " + waitnum + "):", fg="red", font=font_label, anchor=E, width=28 + ) + lbl3_s.place(x=s_x, y=s_y + 70) + + lbl3 = Label(window, text="", font=font_align, background="linen") + lbl3.place(x=x, y=y + 75, width=800, height=50) + + # for wait-3 + waitnum = "3" + lbl4_s = Label( + window, text="Simultaneous\nTranslation (wait " + waitnum + "):", fg="red", font=font_label, anchor=E, width=28 + ) + lbl4_s.place(x=s_x, y=s_y + 140) + lbl4 = Label(window, text="", font=font_align, background="linen") + lbl4.place(x=x, y=y + 145, width=800, height=50) + + # for wait-5 + waitnum = "5" + lbl5_s = Label( + window, text="Simultaneous\nTranslation (wait " + waitnum + "):", fg="red", font=font_label, anchor=E, width=28 + ) + lbl5_s.place(x=s_x, y=s_y + 210) + lbl5 = Label(window, text="", font=font_align, background="linen") + lbl5.place(x=x, y=y + 215, width=800, height=50) + + # for wait--1 + lbl6_s = Label( + window, text="Full Sentence\nTranslation (wait -1):", fg="blue", font=font_label, anchor=E, width=28 + ) + lbl6_s.place(x=s_x, y=s_y + 280) + + lbl6 = Label(window, text="", font=font_align, background="sky blue") + lbl6.place(x=x, y=y + 285, width=800, height=50) + + def set_val(event=None): + """ + Start translating + """ + global i + global caches + global bos_id + global decoder_max_length + global all_result + global is_last + global user_input_bpe + global user_input_tokenized + bpe_str, tokenized_src = tokenizer.tokenize(txt.get()) + while i < len(tokenized_src): + user_input_bpe.append(bpe_str[i]) + user_input_tokenized.append(tokenized_src[i]) + lbl2.configure( + text=cut_line((lbl2.cget("text") + " " + bpe_str[i]).strip(), 20), fg="black", anchor=W, justify=LEFT + ) + window.update() + if bpe_str[i] in ["。", "?", "!"]: + is_last = True + translate( + args, + tokenizer, + user_input_tokenized, + transformers, + waitks, + decoder_max_length, + is_last, + caches, + bos_id, + all_result, + ) + lbl3.configure( + text=cut_line(" ".join(all_result[0]).replace("@@ ", ""), 11), fg="red", anchor=W, justify=LEFT + ) + lbl4.configure( + text=cut_line(" ".join(all_result[1]).replace("@@ ", ""), 11), fg="red", anchor=W, justify=LEFT + ) + lbl5.configure( + text=cut_line(" ".join(all_result[2]).replace("@@ ", ""), 11), fg="red", anchor=W, justify=LEFT + ) + lbl6.configure( + text=cut_line(" ".join(all_result[3]).replace("@@ ", ""), 11), fg="blue", anchor=W, justify=LEFT + ) + window.update() + if is_last: + caches = [None] * len(waitks) + bos_id = [None] * len(waitks) + decoder_max_length = [1] * len(waitks) + is_last = False + user_input_bpe = [] + user_input_tokenized = [] + i += 1 + + def set_val_voice(event=None): + """ + Start translating + """ + + def send_start_params(ws): + """ + Send start frame + :param websocket.WebSocket ws: + :return: + """ + req = { + "type": "START", + "data": { + "appid": const.APPID, + "appkey": const.APPKEY, + "dev_pid": const.DEV_PID, + "cuid": "yourself_defined_user_id", + "sample": 16000, + "format": "pcm", + }, + } + body = json.dumps(req) + ws.send(body, websocket.ABNF.OPCODE_TEXT) + logger.info("send START frame with params:" + body) + + def send_audio(ws): + """ + Send audio + :param websocket.WebSocket ws: + :return: + """ + # 160ms record + chunk_ms = 160 + + # 160ms * 16000 * 2bytes / 1000ms = 5120bytes + chunk_len = int(16000 * 2 / 1000 * chunk_ms) + + pa = PyAudio() + stream = pa.open(format=paInt16, channels=1, rate=16000, input=True, frames_per_buffer=chunk_len // 2) + + while True: + frames = [] + frame = stream.read(chunk_len // 2, exception_on_overflow=False) + frames.append(frame) + body = b"".join(frames) + if len(body) == 0: + logger.info("empty body") + continue + logger.debug("try to send audio length {}".format(len(body))) + ws.send(body, websocket.ABNF.OPCODE_BINARY) + + def send_finish(ws): + """ + Send finished frame + :param websocket.WebSocket ws: + :return: + """ + req = {"type": "FINISH"} + body = json.dumps(req) + ws.send(body, websocket.ABNF.OPCODE_TEXT) + logger.info("send FINISH frame") + + def close_websocket(ws_app): + if ws_app: + logger.info("close ws_app.") + send_finish(ws_app) + ws_app.close() + logger.info("ws_app closed.") + + def on_open(ws): + """ + Send data frame after connected + :param websocket.WebSocket ws: + :return: + """ + + def run(*args): + """ + Send data frame + :param args: + :return: + """ + send_start_params(ws) + send_audio(ws) + send_finish(ws) + logger.debug("thread terminating") + + threading.Thread(target=run).start() + + def on_error(ws, error): + """ + For error + :param ws: + :param error: json + :return: + """ + logger.error("error: " + str(error)) + + def on_close(ws): + """ + Close websocket + :param websocket.WebSocket ws: + :return: + """ + logger.info("ws close ...") + # ws.close() + + def on_message(ws, message): + """ + Response from server + :param ws: + :param message: json + :return: + """ + global i + global text + global caches + global bos_id + global decoder_max_length + global all_result + global is_last + global user_input_bpe + global user_input_tokenized + global ws_app + global start_time + + logger.info("Response: " + message) + message = json.loads(message) + if is_last and ws_app: + close_websocket(ws_app) + end_time = time.time() + if end_time - start_time > 10 and ws_app: + close_websocket(ws_app) + logger.info( + "ws_app started at: {} closed at: {}, cost {}s.".format( + start_time, end_time, end_time - start_time + ) + ) + if "result" in message: + start_time = time.time() + text = message["result"] + txt.delete(0, END) + txt.insert(0, text) + bpe_str, tokenized_src = tokenizer.tokenize(txt.get()) + while i < len(tokenized_src): + user_input_bpe.append(bpe_str[i]) + user_input_tokenized.append(tokenized_src[i]) + lbl2.configure( + text=cut_line((lbl2.cget("text") + " " + bpe_str[i]).strip(), 20), + fg="black", + anchor=W, + justify=LEFT, + ) + window.update() + if bpe_str[i] in ["。", "?", "!"]: + is_last = True + translate( + args, + tokenizer, + user_input_tokenized, + transformers, + waitks, + decoder_max_length, + is_last, + caches, + bos_id, + all_result, + ) + lbl3.configure( + text=cut_line(" ".join(all_result[0]).replace("@@ ", ""), 11), fg="red", anchor=W, justify=LEFT + ) + lbl4.configure( + text=cut_line(" ".join(all_result[1]).replace("@@ ", ""), 11), fg="red", anchor=W, justify=LEFT + ) + lbl5.configure( + text=cut_line(" ".join(all_result[2]).replace("@@ ", ""), 11), fg="red", anchor=W, justify=LEFT + ) + lbl6.configure( + text=cut_line(" ".join(all_result[3]).replace("@@ ", ""), 11), + fg="blue", + anchor=W, + justify=LEFT, + ) + window.update() + if is_last: + caches = [None] * len(waitks) + bos_id = [None] * len(waitks) + decoder_max_length = [1] * len(waitks) + is_last = False + user_input_bpe = [] + user_input_tokenized = [] + if ws_app: + close_websocket(ws_app) + i += 1 + + logger.info("begin") + uri = const.URI + "?sn=" + str(uuid.uuid1()) + logger.info("uri is " + uri) + global start_time + start_time = time.time() + global ws_app + ws_app = websocket.WebSocketApp( + uri, on_open=on_open, on_message=on_message, on_error=on_error, on_close=on_close + ) + ws_app.run_forever() + + def clear(): + """ + Clear input and output + """ + txt.delete(0, END) + global i + global text + global caches + global bos_id + global decoder_max_length + global all_result + global is_last + global user_input_bpe + global user_input_tokenized + global ws_app + global start_time + if ws_app: + ws_app.close() + decoder_max_length = [1] * len(waitks) + caches = [None] * len(waitks) + bos_id = [None] * len(waitks) + all_result = [[], [], [], []] + i = 0 + is_last = False + user_input_bpe = [] + user_input_tokenized = [] + start_time = 0 + logger.info("CLEAR") + logger.info(f"i: {i}") + logger.info(f"caches: {caches}") + logger.info(f"bos_id: {bos_id}") + logger.info(f"decoder_max_length: {decoder_max_length}") + logger.info(f"all_result: {all_result}") + logger.info(f"is_last: {is_last}") + lbl2.configure(text="", fg="black", anchor=W, justify=LEFT) + lbl3.configure(text="", fg="red", anchor=W, justify=LEFT) + lbl4.configure(text="", fg="red", anchor=W, justify=LEFT) + lbl5.configure(text="", fg="red", anchor=W, justify=LEFT) + lbl6.configure(text="", fg="blue", anchor=W, justify=LEFT) + window.update() + + txt.bind("<Return>", set_val) + button_on.bind("<Button-1>", set_val_voice) + + desc1 = Label(window, text="使用说明:1. 在Chinese input输入中文,按【回车键】开始实时翻译," "遇到【。!?】结束整句,按【CLEAR】清空所有的输入和输出;", anchor=E) + desc1.place(x=s_x + 100, y=s_y + 380) + + backspace_cnt = 19 + if is_win: + backspace_cnt = 15 + + desc2 = Label( + window, text=" " * backspace_cnt + "2. 按【REC】开始录音并开始实时翻译,遇到【。!?】结束整句," "按【CLEAR】清空所有的输入和输出。", anchor=E + ) + if open_speech: + desc2.place(x=s_x + 100, y=s_y + 410) + + button_clear = Button(window, text="CLEAR", relief="raised", cursor="hand2", command=clear) + + button_clear.place(x=x + 840, y=y + 380) + + window.mainloop() + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--config", default="./transformer_demo.yaml", type=str, help="Path of the config file. ") + args = parser.parse_args() + return args + + +if __name__ == "__main__": + args = parse_args() + yaml_file = args.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + + if args.device == "gpu": + place = "gpu:0" + elif args.device == "xpu": + place = "xpu:0" + elif args.device == "cpu": + place = "cpu" + paddle.set_device(place) + + tokenizer = STACLTokenizer(args, is_chinese=True) + waitks = [1, 3, 5, -1] + + transformers = [] + for waitk in waitks: + transformers.append(init_model(args, f"models/nist_wait_{waitk}")) + logger.info(f"Loaded wait_{waitk} model.") + + # for decoding max length + decoder_max_length = [1] * len(waitks) + # for decoding cache + caches = [None] * len(waitks) + # for decoding start token id + bos_id = [None] * len(waitks) + # for result + all_result = [[], [], [], []] + # current source word index + i = 0 + # for decoding: is_last=True, max_len=256 + is_last = False + # subword after bpe + user_input_bpe = [] + # tokenized id + user_input_tokenized = [] + # for stream input + text = "" + # websocket app + ws_app = None + # start time + start_time = 0 + + process(args, tokenizer, transformers, waitks) diff --git a/examples/simultaneous_translation/stacl/demo/images/paddlenlp.png b/examples/simultaneous_translation/stacl/demo/images/paddlenlp.png new file mode 100644 index 0000000000000000000000000000000000000000..9b6f7c53332c7ded62890d020263216c259be4b0 Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/paddlenlp.png differ diff --git a/examples/simultaneous_translation/stacl/demo/images/speech_demo_show.gif b/examples/simultaneous_translation/stacl/demo/images/speech_demo_show.gif new file mode 100644 index 0000000000000000000000000000000000000000..01f946ed776a7360fbb08a7dd9c8277812c240c0 Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/speech_demo_show.gif differ diff --git a/examples/simultaneous_translation/stacl/demo/images/step1.png b/examples/simultaneous_translation/stacl/demo/images/step1.png new file mode 100644 index 0000000000000000000000000000000000000000..8d293f7627385c1968f4c498468b3f0db039f9c2 Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step1.png differ diff --git a/examples/simultaneous_translation/stacl/demo/images/step2.png b/examples/simultaneous_translation/stacl/demo/images/step2.png new file mode 100644 index 0000000000000000000000000000000000000000..c922d2c8869220239ebaf43603a1b267514ec746 Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step2.png differ diff --git a/examples/simultaneous_translation/stacl/demo/images/step3.png b/examples/simultaneous_translation/stacl/demo/images/step3.png new file mode 100644 index 0000000000000000000000000000000000000000..89a34a03109122146cf204ef79dbfbd7649b37b7 Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step3.png differ diff --git a/examples/simultaneous_translation/stacl/demo/images/step4.png b/examples/simultaneous_translation/stacl/demo/images/step4.png new file mode 100644 index 0000000000000000000000000000000000000000..b128f390a100050bbcb2445bc9d71c394ece8078 Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step4.png differ diff --git a/examples/simultaneous_translation/stacl/demo/images/step5.png b/examples/simultaneous_translation/stacl/demo/images/step5.png new file mode 100644 index 0000000000000000000000000000000000000000..3f14ae13da41bc19a4cf35c3b71be55a10f09aca Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step5.png differ diff --git a/examples/simultaneous_translation/stacl/demo/images/step6.png b/examples/simultaneous_translation/stacl/demo/images/step6.png new file mode 100644 index 0000000000000000000000000000000000000000..adad2ff27f59db8db6dcff0b6c2003f1bc5fcd76 Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step6.png differ diff --git a/examples/simultaneous_translation/stacl/demo/images/step7.png b/examples/simultaneous_translation/stacl/demo/images/step7.png new file mode 100644 index 0000000000000000000000000000000000000000..a545dbfd8642508445faccd98bb11a8cfd71d487 Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step7.png differ diff --git a/examples/simultaneous_translation/stacl/demo/images/step8.png b/examples/simultaneous_translation/stacl/demo/images/step8.png new file mode 100644 index 0000000000000000000000000000000000000000..1b97efcc67c8a019ed123034facdebb4d103e04e Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step8.png differ diff --git a/examples/simultaneous_translation/stacl/demo/images/text_demo_show.gif b/examples/simultaneous_translation/stacl/demo/images/text_demo_show.gif new file mode 100644 index 0000000000000000000000000000000000000000..ecbfccf8ffc18d715c0ef6615847ccb01eace24d Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/text_demo_show.gif differ diff --git a/examples/simultaneous_translation/stacl/demo/model_demo.py b/examples/simultaneous_translation/stacl/demo/model_demo.py new file mode 100644 index 0000000000000000000000000000000000000000..6f7a2cc7dfb40226e1c337a53a1dfd9107ecda9e --- /dev/null +++ b/examples/simultaneous_translation/stacl/demo/model_demo.py @@ -0,0 +1,108 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import sys + +import paddle +import paddle.nn.functional as F + +sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir))) +from model import SimultaneousTransformer # noqa: E402 + + +class SimultaneousTransformerDemo(SimultaneousTransformer): + """ + model + """ + + def greedy_search(self, src_word, max_len=256, waitk=-1, caches=None, bos_id=None): + """ + greedy_search uses streaming reader. It doesn't need calling + encoder many times, an a sub-sentence just needs calling encoder once. + So, it needsprevious state(caches) and last one of generated + tokens id last time. + """ + src_max_len = paddle.shape(src_word)[-1] + base_attn_bias = ( + paddle.cast(src_word == self.bos_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9 + ) + src_slf_attn_bias = base_attn_bias + src_slf_attn_bias.stop_gradient = True + trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, 1, 1]) + src_pos = paddle.cast(src_word != self.bos_id, dtype="int64") * paddle.arange(start=0, end=src_max_len) + src_emb = self.src_word_embedding(src_word) + src_pos_emb = self.src_pos_embedding(src_pos) + src_emb = src_emb + src_pos_emb + enc_input = F.dropout(src_emb, p=self.dropout, training=self.training) if self.dropout else src_emb + enc_outputs = [self.encoder(enc_input, src_mask=src_slf_attn_bias)] + + # constant number + batch_size = enc_outputs[-1].shape[0] + max_len = (enc_outputs[-1].shape[1] + 20) if max_len is None else max_len + end_token_tensor = paddle.full(shape=[batch_size, 1], fill_value=self.eos_id, dtype="int64") + + predict_ids = [] + log_probs = paddle.full(shape=[batch_size, 1], fill_value=0, dtype="float32") + if not bos_id: + trg_word = paddle.full(shape=[batch_size, 1], fill_value=self.bos_id, dtype="int64") + else: + trg_word = paddle.full(shape=[batch_size, 1], fill_value=bos_id, dtype="int64") + + # init states (caches) for transformer + if not caches: + caches = self.decoder.gen_cache(enc_outputs[-1], do_zip=False) + + for i in range(max_len): + trg_pos = paddle.full(shape=trg_word.shape, fill_value=i, dtype="int64") + trg_emb = self.trg_word_embedding(trg_word) + trg_pos_emb = self.trg_pos_embedding(trg_pos) + trg_emb = trg_emb + trg_pos_emb + dec_input = F.dropout(trg_emb, p=self.dropout, training=self.training) if self.dropout else trg_emb + + if waitk < 0 or i >= len(enc_outputs): + # if the decoder step is full sent or longer than all source + # step, then read the whole src + _e = enc_outputs[-1] + dec_output, caches = self.decoder( + dec_input, [_e], None, trg_src_attn_bias[:, :, :, : _e.shape[1]], caches + ) + else: + _e = enc_outputs[i] + dec_output, caches = self.decoder( + dec_input, [_e], None, trg_src_attn_bias[:, :, :, : _e.shape[1]], caches + ) + + dec_output = paddle.reshape(dec_output, shape=[-1, dec_output.shape[-1]]) + + logits = self.linear(dec_output) + step_log_probs = paddle.log(F.softmax(logits, axis=-1)) + log_probs = paddle.add(x=step_log_probs, y=log_probs) + scores = log_probs + topk_scores, topk_indices = paddle.topk(x=scores, k=1) + + finished = paddle.equal(topk_indices, end_token_tensor) + trg_word = topk_indices + log_probs = topk_scores + + predict_ids.append(topk_indices) + + if paddle.all(finished).numpy(): + break + + predict_ids = paddle.stack(predict_ids, axis=0) + finished_seq = paddle.transpose(predict_ids, [1, 2, 0]) + finished_scores = topk_scores + + return finished_seq, finished_scores, caches diff --git a/examples/simultaneous_translation/stacl/demo/requirements.txt b/examples/simultaneous_translation/stacl/demo/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..c29fc68a8e962a6e40c426e110ba5498b9f0063e --- /dev/null +++ b/examples/simultaneous_translation/stacl/demo/requirements.txt @@ -0,0 +1,5 @@ +attrdict==2.0.1 +PyYAML==5.4.1 +subword-nmt==0.3.7 +jieba==0.42.1 +websocket-client==1.0.1 diff --git a/examples/simultaneous_translation/stacl/demo/transformer_demo.yaml b/examples/simultaneous_translation/stacl/demo/transformer_demo.yaml new file mode 100644 index 0000000000000000000000000000000000000000..b162c57d3cabf5968e71ae62553df99a51a9db7d --- /dev/null +++ b/examples/simultaneous_translation/stacl/demo/transformer_demo.yaml @@ -0,0 +1,51 @@ +# path of trained parameter, to make prediction +init_from_params: "" +# The path of vocabulary file of source language. +src_vocab_fpath: "nist.20k.zh.vocab" +# The path of vocabulary file of target language. +trg_vocab_fpath: "nist.10k.en.vocab" +# The <bos>, <eos> and <unk> tokens in the dictionary. +special_token: [ "<s>", "<e>", "<unk>" ] + +# Use which device to train or predict(cpu,gpu,xpu) +device: cpu + +# Hyparams for generation: +max_out_len: 256 +# the number of decoded sentences to output. +n_best: 1 + +# Hyparams for model: +# These following five vocabularies related configurations will be set +# automatically according to the passed vocabulary path and special tokens. +# size of source word dictionary. +src_vocab_size: 10000 +# size of target word dictionay +trg_vocab_size: 10000 +# index for <bos> token +bos_idx: 0 +# index for <eos> token +eos_idx: 1 +# index for <unk> token +unk_idx: 2 +# max length of sequences deciding the size of position encoding table. +max_length: 256 +# the dimension for word embeddings, which is also the last dimension of +# the input and output of multi-head attention, position-wise feed-forward +# networks, encoder and decoder. +d_model: 512 +# size of the hidden layer in position-wise feed-forward networks. +d_inner_hid: 2048 +# number of head used in multi-head attention. +n_head: 8 +# number of sub-layers to be stacked in the encoder and decoder. +n_layer: 6 +# dropout rates. +dropout: 0.1 +# the flag indicating whether to share embedding and softmax weights. +# vocabularies in source and target should be same for weight sharing. +weight_sharing: False +# Wait-k policy +waitk: -1 +# Source bpe dict for tokenizer +src_bpe_dict: 2M.zh2en.dict4bpe.zh diff --git a/examples/simultaneous_translation/stacl/images/STACL_architecture.png b/examples/simultaneous_translation/stacl/images/STACL_architecture.png new file mode 100644 index 0000000000000000000000000000000000000000..4246bb20af4f553086edcb66be40d8d2e1bd4175 Binary files /dev/null and b/examples/simultaneous_translation/stacl/images/STACL_architecture.png differ diff --git a/examples/simultaneous_translation/stacl/images/example.png b/examples/simultaneous_translation/stacl/images/example.png new file mode 100644 index 0000000000000000000000000000000000000000..1438cf96dba443384f39f373a070038bbee2f011 Binary files /dev/null and b/examples/simultaneous_translation/stacl/images/example.png differ diff --git a/examples/simultaneous_translation/stacl/model.py b/examples/simultaneous_translation/stacl/model.py new file mode 100644 index 0000000000000000000000000000000000000000..e987178dd87e9ca529b658dad884b50e614ed3b7 --- /dev/null +++ b/examples/simultaneous_translation/stacl/model.py @@ -0,0 +1,314 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import print_function + +import numpy as np + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from paddlenlp.transformers import WordEmbedding, PositionalEmbedding + + +class CrossEntropyCriterion(nn.Layer): + def __init__(self, label_smooth_eps, pad_idx=0): + super(CrossEntropyCriterion, self).__init__() + self.label_smooth_eps = label_smooth_eps + self.pad_idx = pad_idx + + def forward(self, predict, label): + weights = paddle.cast(label != self.pad_idx, dtype=paddle.get_default_dtype()) + if self.label_smooth_eps: + label = F.label_smooth( + label=F.one_hot(x=label, num_classes=predict.shape[-1]), epsilon=self.label_smooth_eps + ) + + cost = F.cross_entropy( + input=predict, label=label, reduction="none", soft_label=True if self.label_smooth_eps else False + ).squeeze() + weighted_cost = cost * weights + sum_cost = paddle.sum(weighted_cost) + token_num = paddle.sum(weights) + token_num.stop_gradient = True + avg_cost = sum_cost / token_num + return sum_cost, avg_cost, token_num + + +class DecoderLayer(nn.TransformerDecoderLayer): + def __init__(self, *args, **kwargs): + super(DecoderLayer, self).__init__(*args, **kwargs) + + def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None): + residual = tgt + if self.normalize_before: + tgt = self.norm1(tgt) + if cache is None: + tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, None) + else: + tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, cache[0]) + tgt = residual + self.dropout1(tgt) + if not self.normalize_before: + tgt = self.norm1(tgt) + + residual = tgt + if self.normalize_before: + tgt = self.norm2(tgt) + if len(memory) == 1: + # Full sent + tgt = self.cross_attn(tgt, memory[0], memory[0], memory_mask, None) + else: + # Wait-k policy + cross_attn_outputs = [] + for i in range(tgt.shape[1]): + q = tgt[:, i : i + 1, :] + if i >= len(memory): + e = memory[-1] + else: + e = memory[i] + cross_attn_outputs.append(self.cross_attn(q, e, e, memory_mask[:, :, i : i + 1, : e.shape[1]], None)) + tgt = paddle.concat(cross_attn_outputs, axis=1) + tgt = residual + self.dropout2(tgt) + if not self.normalize_before: + tgt = self.norm2(tgt) + + residual = tgt + if self.normalize_before: + tgt = self.norm3(tgt) + tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt)))) + tgt = residual + self.dropout3(tgt) + if not self.normalize_before: + tgt = self.norm3(tgt) + return tgt if cache is None else (tgt, (incremental_cache,)) + + +class Decoder(nn.TransformerDecoder): + """ + PaddlePaddle 2.1 casts memory_mask.dtype to memory.dtype, but in STACL, + type of memory is list, having no dtype attribute. + """ + + def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None): + output = tgt + new_caches = [] + for i, mod in enumerate(self.layers): + if cache is None: + output = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=None) + else: + output, new_cache = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=cache[i]) + new_caches.append(new_cache) + + if self.norm is not None: + output = self.norm(output) + + return output if cache is None else (output, new_caches) + + +class SimultaneousTransformer(nn.Layer): + """ + model + """ + + def __init__( + self, + src_vocab_size, + trg_vocab_size, + max_length, + n_layer, + n_head, + d_model, + d_inner_hid, + dropout, + weight_sharing, + bos_id=0, + eos_id=1, + waitk=-1, + ): + super(SimultaneousTransformer, self).__init__() + self.trg_vocab_size = trg_vocab_size + self.emb_dim = d_model + self.bos_id = bos_id + self.eos_id = eos_id + self.dropout = dropout + self.waitk = waitk + self.n_layer = n_layer + self.n_head = n_head + self.d_model = d_model + + self.src_word_embedding = WordEmbedding(vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.bos_id) + self.src_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length) + if weight_sharing: + assert ( + src_vocab_size == trg_vocab_size + ), "Vocabularies in source and target should be same for weight sharing." + self.trg_word_embedding = self.src_word_embedding + self.trg_pos_embedding = self.src_pos_embedding + else: + self.trg_word_embedding = WordEmbedding(vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.bos_id) + self.trg_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length) + + encoder_layer = nn.TransformerEncoderLayer( + d_model=d_model, + nhead=n_head, + dim_feedforward=d_inner_hid, + dropout=dropout, + activation="relu", + normalize_before=True, + bias_attr=[False, True], + ) + encoder_norm = nn.LayerNorm(d_model) + self.encoder = nn.TransformerEncoder(encoder_layer=encoder_layer, num_layers=n_layer, norm=encoder_norm) + + decoder_layer = DecoderLayer( + d_model=d_model, + nhead=n_head, + dim_feedforward=d_inner_hid, + dropout=dropout, + activation="relu", + normalize_before=True, + bias_attr=[False, False, True], + ) + decoder_norm = nn.LayerNorm(d_model) + self.decoder = Decoder(decoder_layer=decoder_layer, num_layers=n_layer, norm=decoder_norm) + + if weight_sharing: + self.linear = lambda x: paddle.matmul( + x=x, y=self.trg_word_embedding.word_embedding.weight, transpose_y=True + ) + else: + self.linear = nn.Linear(in_features=d_model, out_features=trg_vocab_size, bias_attr=False) + + def forward(self, src_word, trg_word): + src_max_len = paddle.shape(src_word)[-1] + trg_max_len = paddle.shape(trg_word)[-1] + base_attn_bias = ( + paddle.cast(src_word == self.bos_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9 + ) + src_slf_attn_bias = base_attn_bias + src_slf_attn_bias.stop_gradient = True + trg_slf_attn_bias = paddle.tensor.triu( + (paddle.ones((trg_max_len, trg_max_len), dtype=paddle.get_default_dtype()) * -np.inf), 1 + ) + trg_slf_attn_bias.stop_gradient = True + trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, trg_max_len, 1]) + src_pos = paddle.cast(src_word != self.bos_id, dtype="int64") * paddle.arange(start=0, end=src_max_len) + trg_pos = paddle.cast(trg_word != self.bos_id, dtype="int64") * paddle.arange(start=0, end=trg_max_len) + src_emb = self.src_word_embedding(src_word) + src_pos_emb = self.src_pos_embedding(src_pos) + src_emb = src_emb + src_pos_emb + enc_input = F.dropout(src_emb, p=self.dropout, training=self.training) if self.dropout else src_emb + with paddle.static.amp.fp16_guard(): + if self.waitk >= src_max_len or self.waitk == -1: + # Full sentence + enc_outputs = [self.encoder(enc_input, src_mask=src_slf_attn_bias)] + else: + # Wait-k policy + enc_outputs = [] + for i in range(self.waitk, src_max_len + 1): + enc_output = self.encoder(enc_input[:, :i, :], src_mask=src_slf_attn_bias[:, :, :, :i]) + enc_outputs.append(enc_output) + + trg_emb = self.trg_word_embedding(trg_word) + trg_pos_emb = self.trg_pos_embedding(trg_pos) + trg_emb = trg_emb + trg_pos_emb + dec_input = F.dropout(trg_emb, p=self.dropout, training=self.training) if self.dropout else trg_emb + dec_output = self.decoder( + dec_input, enc_outputs, tgt_mask=trg_slf_attn_bias, memory_mask=trg_src_attn_bias + ) + + predict = self.linear(dec_output) + + return predict + + def beam_search(self, src_word, beam_size=4, max_len=256, waitk=-1): + # TODO: "Speculative Beam Search for Simultaneous Translation" + raise NotImplementedError + + def greedy_search(self, src_word, max_len=256, waitk=-1): + src_max_len = paddle.shape(src_word)[-1] + base_attn_bias = ( + paddle.cast(src_word == self.bos_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9 + ) + src_slf_attn_bias = base_attn_bias + src_slf_attn_bias.stop_gradient = True + trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, 1, 1]) + src_pos = paddle.cast(src_word != self.bos_id, dtype="int64") * paddle.arange(start=0, end=src_max_len) + src_emb = self.src_word_embedding(src_word) + src_pos_emb = self.src_pos_embedding(src_pos) + src_emb = src_emb + src_pos_emb + enc_input = F.dropout(src_emb, p=self.dropout, training=self.training) if self.dropout else src_emb + if waitk < 0 or waitk > src_max_len: + enc_outputs = [self.encoder(enc_input, src_mask=src_slf_attn_bias)] + else: + enc_outputs = [] + for i in range(waitk, src_max_len + 1): + enc_output = self.encoder(enc_input[:, :i, :], src_mask=src_slf_attn_bias[:, :, :, :i]) + enc_outputs.append(enc_output) + + # constant number + batch_size = enc_outputs[-1].shape[0] + max_len = (enc_outputs[-1].shape[1] + 20) if max_len is None else max_len + end_token_tensor = paddle.full(shape=[batch_size, 1], fill_value=self.eos_id, dtype="int64") + + predict_ids = [] + log_probs = paddle.full(shape=[batch_size, 1], fill_value=0, dtype="float32") + trg_word = paddle.full(shape=[batch_size, 1], fill_value=self.bos_id, dtype="int64") + + # init states (caches) for transformer + caches = self.decoder.gen_cache(enc_outputs[-1], do_zip=False) + + for i in range(max_len): + trg_pos = paddle.full(shape=trg_word.shape, fill_value=i, dtype="int64") + trg_emb = self.trg_word_embedding(trg_word) + trg_pos_emb = self.trg_pos_embedding(trg_pos) + trg_emb = trg_emb + trg_pos_emb + dec_input = F.dropout(trg_emb, p=self.dropout, training=self.training) if self.dropout else trg_emb + + if waitk < 0 or i >= len(enc_outputs): + # Avoid getting the whole source in advance, a diff from: + # https://github.com/autosimtrans/SimulTransBaseline/blob/master/model.py#L1207 + # if the decoder step is full sent or longer than all source + # step, then read the whole src + _e = enc_outputs[-1] + dec_output, caches = self.decoder( + dec_input, [_e], None, trg_src_attn_bias[:, :, :, : _e.shape[1]], caches + ) + else: + _e = enc_outputs[i] + dec_output, caches = self.decoder( + dec_input, [_e], None, trg_src_attn_bias[:, :, :, : _e.shape[1]], caches + ) + + dec_output = paddle.reshape(dec_output, shape=[-1, dec_output.shape[-1]]) + + logits = self.linear(dec_output) + step_log_probs = paddle.log(F.softmax(logits, axis=-1)) + log_probs = paddle.add(x=step_log_probs, y=log_probs) + scores = log_probs + topk_scores, topk_indices = paddle.topk(x=scores, k=1) + + finished = paddle.equal(topk_indices, end_token_tensor) + trg_word = topk_indices + log_probs = topk_scores + + predict_ids.append(topk_indices) + + if paddle.all(finished).numpy(): + break + + predict_ids = paddle.stack(predict_ids, axis=0) + finished_seq = paddle.transpose(predict_ids, [1, 2, 0]) + finished_scores = topk_scores + + return finished_seq, finished_scores diff --git a/examples/simultaneous_translation/stacl/predict.py b/examples/simultaneous_translation/stacl/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..8f2e3da9e404389ac00ee3bae635a26a7f056012 --- /dev/null +++ b/examples/simultaneous_translation/stacl/predict.py @@ -0,0 +1,120 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import argparse +from pprint import pprint +import yaml +from attrdict import AttrDict + +import paddle +from paddlenlp.transformers import position_encoding_init +import reader +from model import SimultaneousTransformer + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--config", default="./config/transformer.yaml", type=str, help="Path of the config file. ") + args = parser.parse_args() + return args + + +def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): + """ + Post-process the decoded sequence. + """ + eos_pos = len(seq) - 1 + for i, idx in enumerate(seq): + if idx == eos_idx: + eos_pos = i + break + seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] + return seq + + +def do_predict(args): + if args.device == "gpu": + place = "gpu:0" + elif args.device == "xpu": + place = "xpu:0" + elif args.device == "cpu": + place = "cpu" + + paddle.set_device(place) + + # Define data loader + test_loader, to_tokens = reader.create_infer_loader(args) + + # Define model + transformer = SimultaneousTransformer( + args.src_vocab_size, + args.trg_vocab_size, + args.max_length + 1, + args.n_layer, + args.n_head, + args.d_model, + args.d_inner_hid, + args.dropout, + args.weight_sharing, + args.bos_idx, + args.eos_idx, + args.waitk, + ) + + # Load the trained model + assert args.init_from_params, "Please set init_from_params to load the infer model." + + model_dict = paddle.load(os.path.join(args.init_from_params, "transformer.pdparams")) + + # To avoid a longer length than training, reset the size of position + # encoding to max_length + model_dict["src_pos_embedding.pos_encoder.weight"] = position_encoding_init(args.max_length + 1, args.d_model) + model_dict["trg_pos_embedding.pos_encoder.weight"] = position_encoding_init(args.max_length + 1, args.d_model) + + transformer.load_dict(model_dict) + + # Set evaluate mode + transformer.eval() + + f = open(args.output_file, "w", encoding="utf8") + + with paddle.no_grad(): + for input_data in test_loader: + (src_word,) = input_data + + finished_seq, finished_scores = transformer.greedy_search( + src_word, max_len=args.max_out_len, waitk=args.waitk + ) + finished_seq = finished_seq.numpy() + finished_scores = finished_scores.numpy() + for idx, ins in enumerate(finished_seq): + for beam_idx, beam in enumerate(ins): + if beam_idx >= args.n_best: + break + id_list = post_process_seq(beam, args.bos_idx, args.eos_idx) + word_list = to_tokens(id_list) + sequence = " ".join(word_list) + "\n" + f.write(sequence) + f.close() + + +if __name__ == "__main__": + args = parse_args() + yaml_file = args.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + pprint(args) + + do_predict(args) diff --git a/examples/simultaneous_translation/stacl/reader.py b/examples/simultaneous_translation/stacl/reader.py new file mode 100644 index 0000000000000000000000000000000000000000..cb71b2bfb212c202311626d5755baba7f8746835 --- /dev/null +++ b/examples/simultaneous_translation/stacl/reader.py @@ -0,0 +1,194 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial +from paddle.io import DataLoader +from paddlenlp.data import Vocab, Pad +from paddlenlp.data.sampler import SamplerHelper +from paddlenlp.datasets import load_dataset + + +def read(src_tgt_file, only_src=False): + with open(src_tgt_file, "r", encoding="utf8") as src_tgt_f: + for line in src_tgt_f: + line = line.strip("\n") + if not line: + continue + line_split = line.split("\t") + if only_src: + yield {"src": line_split[0]} + else: + if len(line_split) != 2: + continue + yield {"src": line_split[0], "trg": line_split[1]} + + +def min_max_filer(data, max_len, min_len=0): + # 1 for special tokens. + data_min_len = min(len(data[0]), len(data[1])) + 1 + data_max_len = max(len(data[0]), len(data[1])) + 1 + return (data_min_len >= min_len) and (data_max_len <= max_len) + + +def create_data_loader(args, places=None): + data_files = {"train": args.training_file, "dev": args.validation_file} + + datasets = [load_dataset(read, src_tgt_file=filename, lazy=False) for split, filename in data_files.items()] + + src_vocab = Vocab.load_vocabulary( + args.src_vocab_fpath, + bos_token=args.special_token[0], + eos_token=args.special_token[1], + unk_token=args.special_token[2], + ) + trg_vocab = Vocab.load_vocabulary( + args.trg_vocab_fpath, + bos_token=args.special_token[0], + eos_token=args.special_token[1], + unk_token=args.special_token[2], + ) + + args.src_vocab_size = len(src_vocab) + args.trg_vocab_size = len(trg_vocab) + + def convert_samples(sample): + source = [item.strip() for item in sample["src"].split()] + target = [item.strip() for item in sample["trg"].split()] + + source = src_vocab.to_indices(source) + [args.eos_idx] + target = [args.bos_idx] + trg_vocab.to_indices(target) + [args.eos_idx] + + return source, target + + data_loaders = [(None)] * 2 + for i, dataset in enumerate(datasets): + dataset = dataset.map(convert_samples, lazy=False).filter(partial(min_max_filer, max_len=args.max_length)) + + sampler = SamplerHelper(dataset) + + if args.sort_type == SortType.GLOBAL: + src_key = lambda x, data_source: len(data_source[x][0]) + trg_key = lambda x, data_source: len(data_source[x][1]) + # Sort twice + sampler = sampler.sort(key=trg_key).sort(key=src_key) + else: + if args.shuffle: + sampler = sampler.shuffle(seed=args.random_seed) + max_key = lambda x, data_source: max(len(data_source[x][0]), len(data_source[x][1])) + if args.sort_type == SortType.POOL: + sampler = sampler.sort(key=max_key, buffer_size=args.pool_size) + + batch_size_fn = lambda new, count, sofar, data_source: max( + sofar, len(data_source[new][0]), len(data_source[new][1]) + ) + batch_sampler = sampler.batch( + batch_size=args.batch_size, + drop_last=False, + batch_size_fn=batch_size_fn, + key=lambda size_so_far, minibatch_len: size_so_far * minibatch_len, + ) + + if args.shuffle_batch: + batch_sampler = batch_sampler.shuffle(seed=args.random_seed) + + if i == 0: + batch_sampler = batch_sampler.shard() + + data_loader = DataLoader( + dataset=dataset, + places=places, + batch_sampler=batch_sampler, + collate_fn=partial(prepare_train_input, pad_idx=args.bos_idx), + num_workers=0, + ) + + data_loaders[i] = data_loader + + return data_loaders + + +def create_infer_loader(args, places=None): + data_files = { + "test": args.predict_file, + } + dataset = load_dataset(read, src_tgt_file=data_files["test"], only_src=True, lazy=False) + + src_vocab = Vocab.load_vocabulary( + args.src_vocab_fpath, + bos_token=args.special_token[0], + eos_token=args.special_token[1], + unk_token=args.special_token[2], + ) + + trg_vocab = Vocab.load_vocabulary( + args.trg_vocab_fpath, + bos_token=args.special_token[0], + eos_token=args.special_token[1], + unk_token=args.special_token[2], + ) + + args.src_vocab_size = len(src_vocab) + args.trg_vocab_size = len(trg_vocab) + + def convert_samples(sample): + source = [item.strip() for item in sample["src"].split()] + source = src_vocab.to_indices(source) + [args.eos_idx] + target = [args.bos_idx] + return source, target + + dataset = dataset.map(convert_samples, lazy=False) + + batch_sampler = SamplerHelper(dataset).batch(batch_size=args.batch_size, drop_last=False) + + data_loader = DataLoader( + dataset=dataset, + places=places, + batch_sampler=batch_sampler, + collate_fn=partial(prepare_infer_input, pad_idx=args.bos_idx), + num_workers=0, + return_list=True, + ) + + return data_loader, trg_vocab.to_tokens + + +def prepare_train_input(insts, pad_idx): + """ + Put all padded data needed by training into a list. + """ + word_pad = Pad(pad_idx) + src_word = word_pad([inst[0] for inst in insts]) + trg_word = word_pad([inst[1][:-1] for inst in insts]) + lbl_word = word_pad([inst[1][1:] for inst in insts]) + data_inputs = [src_word, trg_word, lbl_word] + + return data_inputs + + +def prepare_infer_input(insts, pad_idx): + """ + Put all padded data needed by beam search decoder into a list. + """ + word_pad = Pad(pad_idx) + src_word = word_pad([inst[0] for inst in insts]) + + return [ + src_word, + ] + + +class SortType(object): + GLOBAL = "global" + POOL = "pool" + NONE = "none" diff --git a/examples/simultaneous_translation/stacl/requirements.txt b/examples/simultaneous_translation/stacl/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..d4fa621ca5572b3e00ffd1bda3266a546a580225 --- /dev/null +++ b/examples/simultaneous_translation/stacl/requirements.txt @@ -0,0 +1,4 @@ +attrdict==2.0.1 +PyYAML==5.4.1 +subword_nmt==0.3.7 +jieba==0.42.1 diff --git a/examples/simultaneous_translation/stacl/train.py b/examples/simultaneous_translation/stacl/train.py new file mode 100644 index 0000000000000000000000000000000000000000..09ecb03001a9ce332ed520b609758f137a562d29 --- /dev/null +++ b/examples/simultaneous_translation/stacl/train.py @@ -0,0 +1,249 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time + +import argparse +from pprint import pprint +import numpy as np +import yaml +from attrdict import AttrDict + +import paddle +import paddle.distributed as dist +from paddlenlp.utils.log import logger + +import reader +from model import SimultaneousTransformer, CrossEntropyCriterion +from utils.record import AverageStatistical + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--config", default="./config/transformer.yaml", type=str, help="Path of the config file. ") + args = parser.parse_args() + return args + + +def do_train(args): + paddle.set_device(args.device) + trainer_count = dist.get_world_size() + rank = dist.get_rank() + + if trainer_count > 1: + dist.init_parallel_env() + + # Set seed for CE + random_seed = eval(str(args.random_seed)) + if random_seed is not None: + paddle.seed(random_seed) + + # Define data loader + (train_loader), (eval_loader) = reader.create_data_loader(args, places=paddle.get_device()) + + # Define model + transformer = SimultaneousTransformer( + args.src_vocab_size, + args.trg_vocab_size, + args.max_length + 1, + args.n_layer, + args.n_head, + args.d_model, + args.d_inner_hid, + args.dropout, + args.weight_sharing, + args.bos_idx, + args.eos_idx, + args.waitk, + ) + + print("waitk=", args.waitk) + + # Define loss + criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx) + + # Define optimizer + scheduler = paddle.optimizer.lr.NoamDecay(args.d_model, args.warmup_steps, args.learning_rate) + + optimizer = paddle.optimizer.Adam( + learning_rate=scheduler, + beta1=args.beta1, + beta2=args.beta2, + epsilon=float(args.eps), + parameters=transformer.parameters(), + ) + + # Init from some checkpoint, to resume the previous training + if args.init_from_checkpoint: + model_dict = paddle.load(os.path.join(args.init_from_checkpoint, "transformer.pdparams")) + opt_dict = paddle.load(os.path.join(args.init_from_checkpoint, "transformer.pdopt")) + transformer.set_state_dict(model_dict) + optimizer.set_state_dict(opt_dict) + print("loaded from checkpoint.") + # Init from some pretrain models, to better solve the current task + if args.init_from_pretrain_model: + model_dict = paddle.load(os.path.join(args.init_from_pretrain_model, "transformer.pdparams")) + transformer.set_state_dict(model_dict) + print("loaded from pre-trained model.") + + if trainer_count > 1: + transformer = paddle.DataParallel(transformer) + + # The best cross-entropy value with label smoothing + loss_normalizer = -( + (1.0 - args.label_smooth_eps) * np.log((1.0 - args.label_smooth_eps)) + + args.label_smooth_eps * np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20) + ) + + step_idx = 0 + + # For logging + reader_cost_avg = AverageStatistical() + batch_cost_avg = AverageStatistical() + batch_ips_avg = AverageStatistical() + + # Train loop + for pass_id in range(args.epoch): + epoch_start = time.time() + batch_id = 0 + batch_start = time.time() + for input_data in train_loader: + train_reader_cost = time.time() - batch_start + (src_word, trg_word, lbl_word) = input_data + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + with paddle.amp.auto_cast(): + logits = transformer(src_word=src_word, trg_word=trg_word) + sum_cost, avg_cost, token_num = criterion(logits, lbl_word) + + scaled_loss = scaler.scale(avg_cost) # scale the loss + scaled_loss.backward() # do backward + + scaler.minimize(optimizer, scaled_loss) # update parameters + optimizer.clear_grad() + else: + logits = transformer(src_word=src_word, trg_word=trg_word) + sum_cost, avg_cost, token_num = criterion(logits, lbl_word) + + avg_cost.backward() + + optimizer.step() + optimizer.clear_grad() + + if args.max_iter and step_idx + 1 == args.max_iter: + return + + tokens_per_cards = token_num.numpy() + + train_batch_cost = time.time() - batch_start + reader_cost_avg.record(train_reader_cost) + batch_cost_avg.record(train_batch_cost) + batch_ips_avg.record(train_batch_cost, tokens_per_cards) + + if step_idx % args.print_step == 0: + total_avg_cost = avg_cost.numpy() + if step_idx == 0: + logger.info( + "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, " + "normalized loss: %f, ppl: %f " + % ( + step_idx, + pass_id, + batch_id, + total_avg_cost, + total_avg_cost - loss_normalizer, + np.exp([min(total_avg_cost, 100)]), + ) + ) + else: + train_avg_batch_cost = args.print_step / batch_cost_avg.get_total_time() + logger.info( + "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, " + "normalized loss: %f, ppl: %f, avg_speed: %.2f step/sec, " + "batch_cost: %.5f sec, reader_cost: %.5f sec, tokens: %d, " + "ips: %.5f words/sec" + % ( + step_idx, + pass_id, + batch_id, + total_avg_cost, + total_avg_cost - loss_normalizer, + np.exp([min(total_avg_cost, 100)]), + train_avg_batch_cost, + batch_cost_avg.get_average(), + reader_cost_avg.get_average(), + batch_ips_avg.get_total_cnt(), + batch_ips_avg.get_average_per_sec(), + ) + ) + reader_cost_avg.reset() + batch_cost_avg.reset() + batch_ips_avg.reset() + + if step_idx % args.save_step == 0 and step_idx != 0: + # Validation + transformer.eval() + total_sum_cost = 0 + total_token_num = 0 + with paddle.no_grad(): + for input_data in eval_loader: + (src_word, trg_word, lbl_word) = input_data + logits = transformer(src_word=src_word, trg_word=trg_word) + sum_cost, avg_cost, token_num = criterion(logits, lbl_word) + total_sum_cost += sum_cost.numpy() + total_token_num += token_num.numpy() + total_avg_cost = total_sum_cost / total_token_num + logger.info( + "validation, step_idx: %d, avg loss: %f, " + "normalized loss: %f, ppl: %f" + % ( + step_idx, + total_avg_cost, + total_avg_cost - loss_normalizer, + np.exp([min(total_avg_cost, 100)]), + ) + ) + transformer.train() + + if args.save_model and rank == 0: + model_dir = os.path.join(args.save_model, "step_" + str(step_idx)) + if not os.path.exists(model_dir): + os.makedirs(model_dir) + paddle.save(transformer.state_dict(), os.path.join(model_dir, "transformer.pdparams")) + paddle.save(optimizer.state_dict(), os.path.join(model_dir, "transformer.pdopt")) + + batch_id += 1 + step_idx += 1 + scheduler.step() + batch_start = time.time() + + train_epoch_cost = time.time() - epoch_start + logger.info("train epoch: %d, epoch_cost: %.5f s" % (pass_id, train_epoch_cost)) + + if args.save_model and rank == 0: + model_dir = os.path.join(args.save_model, "step_final") + if not os.path.exists(model_dir): + os.makedirs(model_dir) + paddle.save(transformer.state_dict(), os.path.join(model_dir, "transformer.pdparams")) + paddle.save(optimizer.state_dict(), os.path.join(model_dir, "transformer.pdopt")) + + +if __name__ == "__main__": + args = parse_args() + yaml_file = args.config + with open(yaml_file, "rt") as f: + args = AttrDict(yaml.safe_load(f)) + pprint(args) + do_train(args) diff --git a/examples/simultaneous_translation/stacl/utils/__init__.py b/examples/simultaneous_translation/stacl/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/examples/simultaneous_translation/stacl/utils/record.py b/examples/simultaneous_translation/stacl/utils/record.py new file mode 100644 index 0000000000000000000000000000000000000000..1147f0d433f246a3a07503f2e7dd29801677a538 --- /dev/null +++ b/examples/simultaneous_translation/stacl/utils/record.py @@ -0,0 +1,44 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +class AverageStatistical(object): + def __init__(self): + self.reset() + + def reset(self): + self.total_cnt = 0 + self.time = 0 + + def record(self, val, cnt=1): + self.time += val + self.total_cnt += cnt + + def get_average(self): + if self.total_cnt == 0: + return 0 + + return self.time / self.total_cnt + + def get_average_per_sec(self): + if self.time == 0.0: + return 0.0 + + return float(self.total_cnt) / self.time + + def get_total_cnt(self): + return self.total_cnt + + def get_total_time(self): + return self.time diff --git a/examples/simultaneous_translation/stacl/utils/tokenizer.py b/examples/simultaneous_translation/stacl/utils/tokenizer.py new file mode 100644 index 0000000000000000000000000000000000000000..010d91dea922ecd7b1797283cb5384b1798e8dd9 --- /dev/null +++ b/examples/simultaneous_translation/stacl/utils/tokenizer.py @@ -0,0 +1,47 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import _locale +import jieba +from subword_nmt import subword_nmt + +# By default, the Windows system opens the file with GBK code, +# and the subword_nmt package does not support setting open encoding, +# so it is set to UTF-8 uniformly. +_locale._getdefaultlocale = lambda *args: ["en_US", "utf8"] + + +class STACLTokenizer: + def __init__(self, bpe_dict, is_chinese): + bpe_parser = subword_nmt.create_apply_bpe_parser() + bpe_args = bpe_parser.parse_args(args=["-c", bpe_dict]) + self.bpe = subword_nmt.BPE(bpe_args.codes, bpe_args.merges, bpe_args.separator, None, bpe_args.glossaries) + self.is_chinese = is_chinese + + def tokenize(self, raw_string): + """ + Tokenize string(BPE/jieba+BPE) + """ + raw_string = raw_string.strip("\n") + if not raw_string: + return raw_string + if self.is_chinese: + raw_string = " ".join(jieba.cut(raw_string)) + bpe_str = self.bpe.process_line(raw_string) + return " ".join(bpe_str.split()) + + +if __name__ == "__main__": + tokenizer_zh = STACLTokenizer("data/nist2m/2M.zh2en.dict4bpe.zh", is_chinese=True) + print(tokenizer_zh.tokenize("玻利维亚举行总统与国会选举")) diff --git a/examples/text_classification/README.md b/examples/text_classification/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f02dae18865f80e7671de0bd305c02540d901d11 --- /dev/null +++ b/examples/text_classification/README.md @@ -0,0 +1,15 @@ +# 文本分类 + +提供了多个文本分类任务示例,基于基于ERNIE 3.0预训练模型、传统序列模型、基于ERNIE-Doc超长文本预训练模型的文本分类。 + +## Pretrained Model (PTMs) + +[Pretrained Models](./pretrained_models) 展示了如何使用以ERNIE 3.0 为代表的预模型,在多分类、多标签、层次分类场景下,基于预训练模型微调、提示学习(小样本)、语义索引等三种不同方案进行文本分类。预训练模型文本分类打通数据标注-模型训练-模型调优-模型压缩-预测部署全流程,旨在解决细分场景应用的痛点和难点,快速实现文本分类产品落地。 + +## RNN Models + +[Recurrent Neural Networks](./rnn) 展示了如何使用传统序列模型RNN、LSTM、GRU等网络完成文本分类任务。 + +## ERNIE-Doc Text Classification + +[ERNIE-Doc Text Classification](./ernie_doc) 展示了如何使用预训练模型ERNIE-Doc完成**超长文本**分类任务。 diff --git a/examples/text_classification/ernie_doc/README.md b/examples/text_classification/ernie_doc/README.md new file mode 100644 index 0000000000000000000000000000000000000000..7f16c92c585f93e4f48ee1a4da4d222b5a205ce3 --- /dev/null +++ b/examples/text_classification/ernie_doc/README.md @@ -0,0 +1,86 @@ +# Ernie_doc 在iflytek数据集上的使用 + +## 简介 + +本示例将使用ERNIE-DOC模型,演示如何在长文本数据集上(e.g. iflytek)完成分类任务的训练,预测以及动转静过程。以下是本例的简要目录结构及说明: + +```shell +. +├── LICENSE +├── README.md #文档 +├── data.py #数据处理 +├── export_model.py #将动态图参数导出成静态图参数 +├── metrics.py #ERNIE-Doc下游任务指标 +├── modeling.py #ERNIE-Doc模型实现(针对实现静态图修改) +├── predict.py #分类任务预测脚本(包括动态图预测和动转静) +└── train.py #分类任务训练脚本(包括数据下载,模型导出和测试集结果导出) +``` + +## 快速开始 + +### 通用参数释义 + +除[ERNIE_DOC](../../../model_zoo/ernie-doc/README.md) +展示的通用参数之外,本例还有如下参数: + +- `static_mode` 在 `predict.py` 表示是否使用静态图进行预测。 +- `test_results_file` 在`train.py`和`predict.py`中表示测试集预测结果所存储的地址,默认为`./test_restuls.json`。 +- `static_path` 在`export_model.py`和`predict.py`中表示要将转化完成的静态图存储的地址,如果改地址已经有静态图模型参数,`predict.py` + 会直接读取该模型参数,而`export_model.py`会覆盖掉该模型参数。默认路径为`{HOME}/.paddlenlp/static/inference`。 + +### 分类任务训练 + +iflytek的数据示例如下: + +```shell +{"label": "110", "label_des": "社区超市", "sentence": "朴朴快送超市创立于2016年,专注于打造移动端30分钟即时配送一站式购物平台,商品品类包含水果、蔬菜、肉禽蛋奶、海鲜水产、粮油调味、酒水饮料、休闲食品、日用品、外卖等。朴朴公司希望能以全新的商业模式,更高效快捷的仓储配送模式,致力于成为更快、更好、更多、更省的在线零售平台,带给消费者更好的消费体验,同时推动中国食品安全进程,成为一家让社会尊敬的互联网公司。,朴朴一下,又好又快,1.配送时间提示更加清晰友好2.保障用户隐私的一些优化3.其他提高使用体验的调整4.修复了一些已知bug"} +``` + +该数据集共有1.7万多条关于app应用描述的长文本标注数据,包含和日常生活相关的各类应用主题,共119个类别。 使用训练脚本 + +```shell +python train.py --batch_size 4 \ + --model_name_or_path ernie-doc-base-zh \ + --epoch 5 \ + --output_dir ./checkpoints/ +``` + +根据通用参数释义可自行更改训练超参数和模型保存地址。 + +### 模型导出和预测 + +可以使用模型导出脚本将动态图模型转化成静态图: + +```shell +python export_model.py --batch_size 16 \ + --model_name_or_path finetuned_model \ + --max_seq_lenght 512 \ + --memory_length 128 \ + --static_path ./my_static_model/ +``` + +也可以直接使用预测脚本将`static_mode`设为True (设置成False则使用动态图预测),直接完成转化静态图和使用静态图预测的步骤: + +```shell +python predict.py --static_mode True \ + --dataset iflytek \ + --batch_size 16 \ + --model_name_or_path finetuned_model \ + --max_seq_lenght 512 \ + --memory_length 128 \ + --static_path ./my_static_model/ \ + --test_results_file ./test_results.json +``` + +模型输出的`test_results_file`示例: + +```shell +{"id": "2590", "label": "70"} +{"id": "2591", "label": "91"} +{"id": "2592", "label": "20"} +{"id": "2593", "label": "28"} +{"id": "2594", "label": "95"} +{"id": "2595", "label": "116"} +{"id": "2596", "label": "59"} +{"id": "2597", "label": "22"} +``` diff --git a/examples/text_classification/ernie_doc/__init__.py b/examples/text_classification/ernie_doc/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/examples/text_classification/ernie_doc/data.py b/examples/text_classification/ernie_doc/data.py new file mode 100644 index 0000000000000000000000000000000000000000..85409d9e84e06da03633bbb8b9840e35bc8f82e6 --- /dev/null +++ b/examples/text_classification/ernie_doc/data.py @@ -0,0 +1,1208 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import itertools +import json +from collections import namedtuple + +import numpy as np +from paddle.utils import try_import + +from paddlenlp.transformers import tokenize_chinese_chars +from paddlenlp.utils.log import logger + + +def get_related_pos(insts, seq_len, memory_len=128): + """generate relative postion ids""" + beg = seq_len + seq_len + memory_len + r_position = [list(range(beg - 1, seq_len - 1, -1)) + list(range(0, seq_len)) for i in range(len(insts))] + return np.array(r_position).astype("int64").reshape([len(insts), beg, 1]) + + +def pad_batch_data( + insts, + insts_data_type="int64", + pad_idx=0, + final_cls=False, + pad_max_len=None, + return_pos=False, + return_input_mask=False, + return_max_len=False, + return_num_token=False, + return_seq_lens=False, +): + """ + Pad the instances to the max sequence length in batch, and generate the + corresponding position data and attention bias. + """ + return_list = [] + if pad_max_len: + max_len = pad_max_len + else: + max_len = max(len(inst) for inst in insts) + # Any token included in dict can be used to pad, since the paddings' loss + # will be masked out by weights and make no effect on parameter gradients. + + # Input id + if final_cls: + inst_data = np.array([inst[:-1] + list([pad_idx] * (max_len - len(inst))) + [inst[-1]] for inst in insts]) + else: + inst_data = np.array([inst + list([pad_idx] * (max_len - len(inst))) for inst in insts]) + return_list += [inst_data.astype(insts_data_type).reshape([-1, max_len, 1])] + + # Position id + if return_pos: + inst_pos = np.array([list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) for inst in insts]) + + return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])] + + if return_input_mask: + # This is used to avoid attention on paddings. + if final_cls: + input_mask_data = np.array([[1] * len(inst[:-1]) + [0] * (max_len - len(inst)) + [1] for inst in insts]) + else: + input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts]) + input_mask_data = np.expand_dims(input_mask_data, axis=-1) + return_list += [input_mask_data.astype("float32")] + + if return_max_len: + return_list += [max_len] + + if return_num_token: + num_token = 0 + for inst in insts: + num_token += len(inst) + return_list += [num_token] + + if return_seq_lens: + seq_lens_type = [-1] + seq_lens = np.array([len(inst) for inst in insts]) + return_list += [seq_lens.astype("int64").reshape(seq_lens_type)] + + return return_list if len(return_list) > 1 else return_list[0] + + +class TextPreprocessor(object): + def __call__(self, text): + raise NotImplementedError("TextPreprocessor object can't be called") + + +class ImdbTextPreprocessor(TextPreprocessor): + def __call__(self, text): + text = text.strip().replace("<br /><br />", " ") + text = text.replace("\t", "") + return text + + +class HYPTextPreprocessor(TextPreprocessor): + def __init__(self): + self.bs4 = try_import("bs4") + + def __call__(self, text): + text = self.bs4.BeautifulSoup(text, "html.parser").get_text() + text = text.strip().replace("\n", "").replace("\t", "") + return text + + +class ClassifierIterator(object): + def __init__( + self, + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length=512, + memory_len=128, + repeat_input=False, + in_tokens=False, + mode="train", + random_seed=None, + preprocess_text_fn=None, + ): + self.batch_size = batch_size + self.tokenizer = tokenizer + self.trainer_num = trainer_num + self.trainer_id = trainer_id + self.max_seq_length = max_seq_length + self.memory_len = memory_len + self.repeat_input = repeat_input + self.in_tokens = in_tokens + self.dataset = [data for data in dataset] + self.num_examples = None + self.mode = mode + self.shuffle = True if mode == "train" else False + if random_seed is None: + random_seed = 12345 + self.random_seed = random_seed + self.preprocess_text_fn = preprocess_text_fn + + def shuffle_sample(self): + if self.shuffle: + self.global_rng = np.random.RandomState(self.random_seed) + self.global_rng.shuffle(self.dataset) + + def _cnt_list(self, inp): + """Cnt_list""" + cnt = 0 + for lit in inp: + if lit: + cnt += 1 + return cnt + + def _convert_to_features(self, example, qid): + """ + Convert example to features fed into model + """ + if "text" in example: # imdb + text = example["text"] + elif "sentence" in example: # iflytek + text = example["sentence"] + + if self.preprocess_text_fn: + text = self.preprocess_text_fn(text) + if "label" in example: + label = example["label"] + else: + label = "-1" + doc_spans = [] + _DocSpan = namedtuple("DocSpan", ["start", "length"]) + start_offset = 0 + max_tokens_for_doc = self.max_seq_length - 2 + tokens_a = self.tokenizer.tokenize(text) + while start_offset < len(tokens_a): + length = len(tokens_a) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(tokens_a): + break + start_offset += min(length, self.memory_len) + + features = [] + Feature = namedtuple("Feature", ["src_ids", "label_id", "qid", "cal_loss"]) + for (doc_span_index, doc_span) in enumerate(doc_spans): + tokens = tokens_a[doc_span.start : doc_span.start + doc_span.length] + ["[SEP]"] + ["[CLS]"] + token_ids = self.tokenizer.convert_tokens_to_ids(tokens) + features.append(Feature(src_ids=token_ids, label_id=label, qid=qid, cal_loss=1)) + + if self.repeat_input: + features_repeat = features + features = list(map(lambda x: x._replace(cal_loss=0), features)) + features = features + features_repeat + return features + + def _get_samples(self, pre_batch_list, is_last=False): + if is_last: + # Pad batch + len_doc = [len(doc) for doc in pre_batch_list] + max_len_idx = len_doc.index(max(len_doc)) + dirty_sample = pre_batch_list[max_len_idx][-1]._replace(cal_loss=0) + for sample_list in pre_batch_list: + sample_list.extend([dirty_sample] * (max(len_doc) - len(sample_list))) + + samples = [] + min_len = min([len(doc) for doc in pre_batch_list]) + for cnt in range(min_len): + for batch_idx in range(self.batch_size * self.trainer_num): + sample = pre_batch_list[batch_idx][cnt] + samples.append(sample) + + for idx in range(len(pre_batch_list)): + pre_batch_list[idx] = pre_batch_list[idx][min_len:] + return samples + + def _pad_batch_records(self, batch_records, gather_idx=[]): + batch_token_ids = [record.src_ids for record in batch_records] + if batch_records[0].label_id is not None: + batch_labels = [record.label_id for record in batch_records] + batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1]) + else: + batch_labels = np.array([]).astype("int64").reshape([-1, 1]) + # Qid + if batch_records[-1].qid is not None: + batch_qids = [record.qid for record in batch_records] + batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1]) + else: + batch_qids = np.array([]).astype("int64").reshape([-1, 1]) + + if gather_idx: + batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([1]).astype("int64") + else: + batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([0]).astype("int64") + + # Padding + padded_token_ids, input_mask = pad_batch_data( + batch_token_ids, + pad_idx=self.tokenizer.pad_token_id, + pad_max_len=self.max_seq_length, + final_cls=True, + return_input_mask=True, + ) + padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64") + padded_position_ids = get_related_pos(padded_token_ids, self.max_seq_length, self.memory_len) + + return_list = [ + padded_token_ids, + padded_position_ids, + padded_task_ids, + input_mask, + batch_labels, + batch_qids, + batch_gather_idx, + need_cal_loss, + ] + return return_list + + def _prepare_batch_data(self, examples): + batch_records, max_len, gather_idx = [], 0, [] + for index, example in enumerate(examples): + max_len = max(max_len, len(example.src_ids)) + if self.in_tokens: + to_append = (len(batch_records) + 1) * max_len <= self.batch_size + else: + to_append = len(batch_records) < self.batch_size + if to_append: + batch_records.append(example) + if example.cal_loss == 1: + gather_idx.append(index % self.batch_size) + else: + yield self._pad_batch_records(batch_records, gather_idx) + batch_records, max_len = [example], len(example.src_ids) + gather_idx = [index % self.batch_size] if example.cal_loss == 1 else [] + yield self._pad_batch_records(batch_records, gather_idx) + + def _create_instances(self): + examples = self.dataset + pre_batch_list = [] + insert_idx = [] + for qid, example in enumerate(examples): + features = self._convert_to_features(example, qid) + if self._cnt_list(pre_batch_list) < self.batch_size * self.trainer_num: + if insert_idx: + pre_batch_list[insert_idx[0]] = features + insert_idx.pop(0) + else: + pre_batch_list.append(features) + if self._cnt_list(pre_batch_list) == self.batch_size * self.trainer_num: + assert self._cnt_list(pre_batch_list) == len(pre_batch_list), "the two value must be equal" + assert not insert_idx, "the insert_idx must be null" + sample_batch = self._get_samples(pre_batch_list) + + for idx, lit in enumerate(pre_batch_list): + if not lit: + insert_idx.append(idx) + for batch_records in self._prepare_batch_data(sample_batch): + yield batch_records + + if self.mode != "train": + if self._cnt_list(pre_batch_list): + pre_batch_list += [ + [] for _ in range(self.batch_size * self.trainer_num - self._cnt_list(pre_batch_list)) + ] + sample_batch = self._get_samples(pre_batch_list, is_last=True) + for batch_records in self._prepare_batch_data(sample_batch): + yield batch_records + + def __call__(self): + curr_id = 0 + for batch_records in self._create_instances(): + if curr_id == self.trainer_id or self.mode != "train": + yield batch_records + curr_id = (curr_id + 1) % self.trainer_num + + def get_num_examples(self): + if self.num_examples is None: + self.num_examples = 0 + for qid, example in enumerate(self.dataset): + self.num_examples += len(self._convert_to_features(example, qid)) + return self.num_examples + + +class MRCIterator(ClassifierIterator): + """ + Machine Reading Comprehension iterator. Only for answer extraction. + """ + + def __init__( + self, + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length=512, + memory_len=128, + repeat_input=False, + in_tokens=False, + mode="train", + random_seed=None, + doc_stride=128, + max_query_length=64, + ): + super(MRCIterator, self).__init__( + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length, + memory_len, + repeat_input, + in_tokens, + mode, + random_seed, + preprocess_text_fn=None, + ) + self.doc_stride = doc_stride + self.max_query_length = max_query_length + self.examples = [] + self.features = [] + self.features_all = [] + self._preprocess_data() + + def shuffle_sample(self): + if self.shuffle: + self.global_rng = np.random.RandomState(self.random_seed) + self.global_rng.shuffle(self.features_all) + + def _convert_qa_to_examples(self): + Example = namedtuple( + "Example", ["qas_id", "question_text", "doc_tokens", "orig_answer_text", "start_position", "end_position"] + ) + examples = [] + for qa in self.dataset: + qas_id = qa["id"] + question_text = qa["question"] + context = qa["context"] + start_pos = None + end_pos = None + orig_answer_text = None + if self.mode == "train": + if len(qa["answers"]) != 1: + raise ValueError("For training, each question should have exactly 1 answer.") + orig_answer_text = qa["answers"][0] + answer_offset = qa["answer_starts"][0] + answer_length = len(orig_answer_text) + doc_tokens = [ + context[:answer_offset], + context[answer_offset : answer_offset + answer_length], + context[answer_offset + answer_length :], + ] + + start_pos = 1 + end_pos = 1 + + actual_text = " ".join(doc_tokens[start_pos : (end_pos + 1)]) + if orig_answer_text.islower(): + actual_text = actual_text.lower() + if actual_text.find(orig_answer_text) == -1: + logger.info("Could not find answer: '%s' vs. '%s'" % (actual_text, orig_answer_text)) + continue + + else: + doc_tokens = tokenize_chinese_chars(context) + + example = Example( + qas_id=qas_id, + question_text=question_text, + doc_tokens=doc_tokens, + orig_answer_text=orig_answer_text, + start_position=start_pos, + end_position=end_pos, + ) + examples.append(example) + return examples + + def _convert_example_to_feature(self, examples): + Feature = namedtuple( + "Feature", + [ + "qid", + "example_index", + "doc_span_index", + "tokens", + "token_to_orig_map", + "token_is_max_context", + "src_ids", + "start_position", + "end_position", + "cal_loss", + ], + ) + features = [] + self.features_all = [] + unique_id = 1000 + is_training = self.mode == "train" + print("total {} examples".format(len(examples)), flush=True) + for (example_index, example) in enumerate(examples): + query_tokens = self.tokenizer.tokenize(example.question_text) + if len(query_tokens) > self.max_query_length: + query_tokens = query_tokens[0 : self.max_query_length] + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + for (i, token) in enumerate(example.doc_tokens): + orig_to_tok_index.append(len(all_doc_tokens)) + sub_tokens = self.tokenizer.tokenize(token) + for sub_token in sub_tokens: + tok_to_orig_index.append(i) + all_doc_tokens.append(sub_token) + + tok_start_position = None + tok_end_position = None + if is_training: + tok_start_position = orig_to_tok_index[example.start_position] + if example.end_position < len(example.doc_tokens) - 1: + tok_end_position = orig_to_tok_index[example.end_position + 1] - 1 + else: + tok_end_position = len(all_doc_tokens) - 1 + (tok_start_position, tok_end_position) = self._improve_answer_span( + all_doc_tokens, tok_start_position, tok_end_position, example.orig_answer_text + ) + + max_tokens_for_doc = self.max_seq_length - len(query_tokens) - 3 + _DocSpan = namedtuple("DocSpan", ["start", "length"]) + doc_spans = [] + start_offset = 0 + while start_offset < len(all_doc_tokens): + length = len(all_doc_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(all_doc_tokens): + break + start_offset += min(length, self.doc_stride) + + features_each = [] + for (doc_span_index, doc_span) in enumerate(doc_spans): + tokens = [] + token_to_orig_map = {} + token_is_max_context = {} + tokens.append("[CLS]") + for i in range(doc_span.length): + split_token_index = doc_span.start + i + token_to_orig_map[i + 1] = tok_to_orig_index[split_token_index] + is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index) + token_is_max_context[i + 1] = is_max_context + tokens += all_doc_tokens[doc_span.start : doc_span.start + doc_span.length] + tokens.append("[SEP]") + + for token in query_tokens: + tokens.append(token) + tokens.append("[SEP]") + + token_ids = self.tokenizer.convert_tokens_to_ids(tokens) + start_position = None + end_position = None + if is_training: + doc_start = doc_span.start + doc_end = doc_span.start + doc_span.length - 1 + out_of_span = False + if not (tok_start_position >= doc_start and tok_end_position <= doc_end): + out_of_span = True + if out_of_span: + start_position = 0 + end_position = 0 + else: + doc_offset = 1 # len(query_tokens) + 2 + start_position = tok_start_position - doc_start + doc_offset + end_position = tok_end_position - doc_start + doc_offset + + feature = Feature( + qid=unique_id, + example_index=example_index, + doc_span_index=doc_span_index, + tokens=tokens, + token_to_orig_map=token_to_orig_map, + token_is_max_context=token_is_max_context, + src_ids=token_ids, + start_position=start_position, + end_position=end_position, + cal_loss=1, + ) + features.append(feature) + features_each.append(feature) + if example_index % 1000 == 0: + print("processing {} examples".format(example_index), flush=True) + + unique_id += 1 + # Repeat + if self.repeat_input: + features_each_repeat = features_each + features_each = list(map(lambda x: x._replace(cla_loss=0), features_each)) + features_each += features_each_repeat + + self.features_all.append(features_each) + + return features + + def _preprocess_data(self): + # Construct examples + self.examples = self._convert_qa_to_examples() + # Construct features + self.features = self._convert_example_to_feature(self.examples) + + def get_num_examples(self): + if not self.features_all: + self._preprocess_data() + return len(sum(self.features_all, [])) + + def _improve_answer_span(self, doc_tokens, input_start, input_end, orig_answer_text): + """Improve answer span""" + tok_answer_text = " ".join(self.tokenizer.tokenize(orig_answer_text)) + + for new_start in range(input_start, input_end + 1): + for new_end in range(input_end, new_start - 1, -1): + text_span = " ".join(doc_tokens[new_start : (new_end + 1)]) + if text_span == tok_answer_text: + return (new_start, new_end) + + return (input_start, input_end) + + def _check_is_max_context(self, doc_spans, cur_span_index, position): + """Check is max context""" + best_score = None + best_span_index = None + for (span_index, doc_span) in enumerate(doc_spans): + end = doc_span.start + doc_span.length - 1 + if position < doc_span.start: + break + if position > end: + continue + num_left_context = position - doc_span.start + num_right_context = end - position + score = min(num_left_context, num_right_context) + 0.01 * doc_span.length + if best_score is None or score > best_score: + best_score = score + best_span_index = span_index + if best_span_index > cur_span_index: + return False + + return cur_span_index == best_span_index + + def _pad_batch_records(self, batch_records, gather_idx=[]): + """Pad batch data""" + batch_token_ids = [record.src_ids for record in batch_records] + + if self.mode == "train": + batch_start_position = [record.start_position for record in batch_records] + batch_end_position = [record.end_position for record in batch_records] + batch_start_position = np.array(batch_start_position).astype("int64").reshape([-1, 1]) + batch_end_position = np.array(batch_end_position).astype("int64").reshape([-1, 1]) + else: + batch_size = len(batch_token_ids) + batch_start_position = np.zeros(shape=[batch_size, 1], dtype="int64") + batch_end_position = np.zeros(shape=[batch_size, 1], dtype="int64") + + batch_qids = [record.qid for record in batch_records] + batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1]) + + if gather_idx: + batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([1]).astype("int64") + else: + batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([0]).astype("int64") + + # padding + padded_token_ids, input_mask = pad_batch_data( + batch_token_ids, + pad_idx=self.tokenizer.pad_token_id, + pad_max_len=self.max_seq_length, + return_input_mask=True, + ) + padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64") + padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len) + + return_list = [ + padded_token_ids, + padded_position_ids, + padded_task_ids, + input_mask, + batch_start_position, + batch_end_position, + batch_qids, + batch_gather_idx, + need_cal_loss, + ] + + return return_list + + def _create_instances(self): + """Generate batch records""" + pre_batch_list = [] + insert_idx = [] + for qid, features in enumerate(self.features_all): + if self._cnt_list(pre_batch_list) < self.batch_size * self.trainer_num: + if insert_idx: + pre_batch_list[insert_idx[0]] = features + insert_idx.pop(0) + else: + pre_batch_list.append(features) + if self._cnt_list(pre_batch_list) == self.batch_size * self.trainer_num: + assert self._cnt_list(pre_batch_list) == len(pre_batch_list), "the two value must be equal" + assert not insert_idx, "the insert_idx must be null" + sample_batch = self._get_samples(pre_batch_list) + + for idx, lit in enumerate(pre_batch_list): + if not lit: + insert_idx.append(idx) + for batch_records in self._prepare_batch_data(sample_batch): + yield batch_records + + if self.mode != "train": + if self._cnt_list(pre_batch_list): + pre_batch_list += [ + [] for _ in range(self.batch_size * self.trainer_num - self._cnt_list(pre_batch_list)) + ] + sample_batch = self._get_samples(pre_batch_list, is_last=True) + for batch_records in self._prepare_batch_data(sample_batch): + yield batch_records + + +class MCQIterator(MRCIterator): + """ + Multiple choice question iterator. + """ + + def __init__( + self, + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length=512, + memory_len=128, + repeat_input=False, + in_tokens=False, + mode="train", + random_seed=None, + doc_stride=128, + max_query_length=64, + choice_num=4, + ): + self.choice_num = choice_num + super(MCQIterator, self).__init__( + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length, + memory_len, + repeat_input, + in_tokens, + mode, + random_seed, + ) + + def _truncate_seq_pair(self, tokens_a, tokens_b, max_length): + """Truncates a sequence pair in place to the maximum length.""" + + # This is a simple heuristic which will always truncate the longer sequence + # one token at a time. This makes more sense than truncating an equal percent + # of tokens from each, since if one sequence is very short then each token + # that's truncated likely contains more information than a longer sequence. + tokens_a = list(tokens_a) + tokens_b = list(tokens_b) + while True: + total_length = len(tokens_a) + len(tokens_b) + if total_length <= max_length: + break + if len(tokens_a) > len(tokens_b): + tokens_a.pop() + else: + tokens_b.pop() + return tokens_a, tokens_b + + def _convert_qa_to_examples(self): + Example = namedtuple("Example", ["qas_id", "context", "question", "choice", "label"]) + examples = [] + for qas_id, qa in enumerate(self.dataset): + context = "\n".join(qa["context"]).lower() + question = qa["question"].lower() + choice = [c.lower() for c in qa["choice"]] + # pad empty choice + for k in range(len(choice), self.choice_num): + choice.append("") + label = qa["label"] + + example = Example(qas_id=qas_id, context=context, question=question, choice=choice, label=label) + examples.append(example) + return examples + + def _convert_example_to_feature(self, examples): + Feature = namedtuple("Feature", ["qid", "src_ids", "segment_ids", "label", "cal_loss"]) + features = [] + self.features_all = [] + for (ex_index, example) in enumerate(examples): + context_tokens = self.tokenizer.tokenize(example.context) + question_tokens = self.tokenizer.tokenize(example.question) + choice_tokens_lst = [self.tokenizer.tokenize(choice) for choice in example.choice] + # nums = 4 + question_choice_pairs = [ + self._truncate_seq_pair(question_tokens, choice_tokens, self.max_query_length - 2) + for choice_tokens in choice_tokens_lst + ] + total_qc_num = sum([(len(q) + len(c)) for q, c in question_choice_pairs]) + max_tokens_for_doc = self.max_seq_length - total_qc_num - 4 + _DocSpan = namedtuple("DocSpan", ["start", "length"]) + doc_spans = [] + start_offset = 0 + + while start_offset < len(context_tokens): + length = len(context_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(context_tokens): + break + start_offset += min(length, self.doc_stride) + + features_each = [] + for (doc_span_index, doc_span) in enumerate(doc_spans): + qa_features = [] + for q_tokens, c_tokens in question_choice_pairs: + segment_tokens = ["[CLS]"] + token_type_ids = [0] + + segment_tokens += context_tokens[doc_span.start : doc_span.start + doc_span.length] + token_type_ids += [0] * doc_span.length + + segment_tokens += ["[SEP]"] + token_type_ids += [0] + + segment_tokens += q_tokens + token_type_ids += [1] * len(q_tokens) + + segment_tokens += ["[SEP]"] + token_type_ids += [1] + + segment_tokens += c_tokens + token_type_ids += [1] * len(c_tokens) + + segment_tokens += ["[SEP]"] + token_type_ids += [1] + + input_ids = self.tokenizer.convert_tokens_to_ids(segment_tokens) + feature = Feature( + qid=example.qas_id, + label=example.label, + src_ids=input_ids, + segment_ids=token_type_ids, + cal_loss=1, + ) + qa_features.append(feature) + + features.append(qa_features) + features_each.append(qa_features) + + # Repeat + if self.repeat_input: + features_each_repeat = features_each + features_each = list(map(lambda x: x._replace(cla_loss=0), features_each)) + features_each += features_each_repeat + + self.features_all.append(features_each) + + return features + + def _pad_batch_records(self, batch_records, gather_idx=[]): + batch_token_ids = [[record.src_ids for record in records] for records in batch_records] + if batch_records[0][0].label is not None: + batch_labels = [[record.label for record in records] for records in batch_records] + batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1]) + else: + batch_labels = np.array([]).astype("int64").reshape([-1, 1]) + # Qid + batch_qids = [[record.qid for record in records] for records in batch_records] + batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1]) + + if gather_idx: + batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([1]).astype("int64") + else: + batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([0]).astype("int64") + + batch_task_ids = [[record.segment_ids for record in records] for records in batch_records] + + # Padding + batch_padded_token_ids = [] + batch_input_mask = [] + batch_padded_task_ids = [] + batch_padded_position_ids = [] + batch_size = len(batch_token_ids) + for i in range(batch_size): + padded_token_ids, input_mask = pad_batch_data( + batch_token_ids[i], + pad_idx=self.tokenizer.pad_token_id, + pad_max_len=self.max_seq_length, + return_input_mask=True, + ) + padded_task_ids = pad_batch_data( + batch_task_ids[i], pad_idx=self.tokenizer.pad_token_id, pad_max_len=self.max_seq_length + ) + + padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len) + + batch_padded_token_ids.append(padded_token_ids) + batch_input_mask.append(input_mask) + batch_padded_task_ids.append(padded_task_ids) + batch_padded_position_ids.append(padded_position_ids) + + batch_padded_token_ids = ( + np.array(batch_padded_token_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1]) + ) + batch_padded_position_ids = ( + np.array(batch_padded_position_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1]) + ) + batch_padded_task_ids = ( + np.array(batch_padded_task_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1]) + ) + batch_input_mask = np.array(batch_input_mask).astype("float32").reshape([batch_size * self.choice_num, -1, 1]) + + return_list = [ + batch_padded_token_ids, + batch_padded_position_ids, + batch_padded_task_ids, + batch_input_mask, + batch_labels, + batch_qids, + batch_gather_idx, + need_cal_loss, + ] + return return_list + + def _prepare_batch_data(self, examples_list): + batch_records, max_len, gather_idx = [], 0, [] + real_batch_size = self.batch_size * self.choice_num + index = 0 + for examples in examples_list: + records = [] + gather_idx_candidate = [] + for example in examples: + if example.cal_loss == 1: + gather_idx_candidate.append(index % real_batch_size) + max_len = max(max_len, len(example.src_ids)) + records.append(example) + index += 1 + + if self.in_tokens: + to_append = (len(batch_records) + 1) * self.choice_num * max_len <= self.batch_size + else: + to_append = len(batch_records) < self.batch_size + if to_append: + batch_records.append(records) + gather_idx += gather_idx_candidate + else: + yield self._pad_batch_records(batch_records, gather_idx) + batch_records, max_len = [records], max(len(record.src_ids) for record in records) + gather_idx = gather_idx_candidate + if len(batch_records) > 0: + yield self._pad_batch_records(batch_records, gather_idx) + + def _get_samples(self, pre_batch_list, is_last=False): + if is_last: + # Pad batch + len_doc = [[len(doc) for doc in doc_list] for doc_list in pre_batch_list] + len_doc = list(itertools.chain(*len_doc)) + max_len_idx = len_doc.index(max(len_doc)) + doc_idx = max_len_idx % self.choice_num + doc_list_idx = max_len_idx // self.choice_num + dirty_sample = pre_batch_list[doc_list_idx][doc_idx][-1]._replace(cal_loss=0) + for sample_list in pre_batch_list: + for samples in sample_list: + samples.extend([dirty_sample] * (max(len_doc) - len(samples))) + samples = [] + min_len = min([len(doc) for doc in pre_batch_list]) + for cnt in range(min_len): + for batch_idx in range(self.batch_size * self.trainer_num): + sample = pre_batch_list[batch_idx][cnt] + samples.append(sample) + + for idx in range(len(pre_batch_list)): + pre_batch_list[idx] = pre_batch_list[idx][min_len:] + return samples + + +class SemanticMatchingIterator(MRCIterator): + def _convert_qa_to_examples(self): + Example = namedtuple("Example", ["qid", "text_a", "text_b", "text_c", "label"]) + examples = [] + for qid, qa in enumerate(self.dataset): + text_a, text_b, text_c = list( + map(lambda x: x.replace("\n", "").strip(), [qa["text_a"], qa["text_b"], qa["text_c"]]) + ) + + example = Example(qid=qid, text_a=text_a, text_b=text_b, text_c=text_c, label=qa["label"]) + examples += [example] + return examples + + def _create_tokens_and_type_id(self, text_a_tokens, text_b_tokens, start, length): + tokens = ( + ["[CLS]"] + + text_a_tokens[start : start + length] + + ["[SEP]"] + + text_b_tokens[start : start + length] + + ["[SEP]"] + ) + token_type_ids = [0] + [0] * (length + 1) + [1] * (length + 1) + return tokens, token_type_ids + + def _convert_example_to_feature(self, examples): + Feature = namedtuple( + "Feature", ["qid", "src_ids", "segment_ids", "pair_src_ids", "pair_segment_ids", "label", "cal_loss"] + ) + features = [] + self.features_all = [] + for (ex_index, example) in enumerate(examples): + text_a_tokens = self.tokenizer.tokenize(example.text_a) + text_b_tokens = self.tokenizer.tokenize(example.text_b) + text_c_tokens = self.tokenizer.tokenize(example.text_c) + a_len, b_len, c_len = list(map(lambda x: len(x), [text_a_tokens, text_b_tokens, text_c_tokens])) + + # Align 3 text + min_text_len = min([a_len, b_len, c_len]) + text_a_tokens = text_a_tokens[:min_text_len] + text_b_tokens = text_b_tokens[:min_text_len] + text_c_tokens = text_c_tokens[:min_text_len] + + _DocSpan = namedtuple("DocSpan", ["start", "length"]) + doc_spans = [] + start_offset = 0 + + max_tokens_for_doc = (self.max_seq_length - 3) // 2 + + while start_offset < len(text_a_tokens): + length = len(text_a_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(text_a_tokens): + break + start_offset += min(length, self.doc_stride) + + features_each = [] + for (doc_span_index, doc_span) in enumerate(doc_spans): + tokens1, token_type_ids1 = self._create_tokens_and_type_id( + text_a_tokens, text_b_tokens, doc_span.start, doc_span.length + ) + tokens2, token_type_ids2 = self._create_tokens_and_type_id( + text_a_tokens, text_c_tokens, doc_span.start, doc_span.length + ) + + input_ids1 = self.tokenizer.convert_tokens_to_ids(tokens1) + input_ids2 = self.tokenizer.convert_tokens_to_ids(tokens2) + feature = Feature( + qid=example.qid, + label=example.label, + src_ids=input_ids1, + segment_ids=token_type_ids1, + pair_src_ids=input_ids2, + pair_segment_ids=token_type_ids2, + cal_loss=1, + ) + + features.append(feature) + features_each.append(feature) + + # Repeat + if self.repeat_input: + features_each_repeat = features_each + features_each = list(map(lambda x: x._replace(cla_loss=0), features_each)) + features_each += features_each_repeat + + self.features_all.append(features_each) + + return features + + def _create_pad_ids(self, batch_records, prefix=""): + src_ids = prefix + "src_ids" + segment_ids = prefix + "segment_ids" + batch_token_ids = [getattr(record, src_ids) for record in batch_records] + batch_task_ids = [getattr(record, segment_ids) for record in batch_records] + + # Padding + padded_token_ids, input_mask = pad_batch_data( + batch_token_ids, + pad_idx=self.tokenizer.pad_token_id, + pad_max_len=self.max_seq_length, + return_input_mask=True, + ) + padded_task_ids = pad_batch_data( + batch_task_ids, pad_idx=self.tokenizer.pad_token_id, pad_max_len=self.max_seq_length + ) + + padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len) + + return [padded_token_ids, padded_position_ids, padded_task_ids, input_mask] + + def _pad_batch_records(self, batch_records, gather_idx=[]): + if batch_records[0].label is not None: + batch_labels = [record.label for record in batch_records] + batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1]) + else: + batch_labels = np.array([]).astype("int64").reshape([-1, 1]) + # Qid + batch_qids = [record.qid for record in batch_records] + batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1]) + + if gather_idx: + batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([1]).astype("int64") + else: + batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([0]).astype("int64") + + return_list = ( + self._create_pad_ids(batch_records) + + self._create_pad_ids(batch_records, "pair_") + + [batch_labels, batch_qids, batch_gather_idx, need_cal_loss] + ) + return return_list + + +class SequenceLabelingIterator(ClassifierIterator): + def __init__( + self, + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length=512, + memory_len=128, + repeat_input=False, + in_tokens=False, + mode="train", + random_seed=None, + no_entity_id=-1, + ): + super(SequenceLabelingIterator, self).__init__( + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length, + memory_len, + repeat_input, + in_tokens, + mode, + random_seed, + preprocess_text_fn=None, + ) + self.no_entity_id = no_entity_id + + def _convert_to_features(self, example, qid): + """ + Convert example to features fed into model + """ + tokens = example["tokens"] + label = example["labels"] + doc_spans = [] + _DocSpan = namedtuple("DocSpan", ["start", "length"]) + start_offset = 0 + max_tokens_for_doc = self.max_seq_length - 2 + while start_offset < len(tokens): + length = len(tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(tokens): + break + start_offset += min(length, self.memory_len) + + features = [] + Feature = namedtuple("Feature", ["src_ids", "label_ids", "qid", "cal_loss"]) + for (doc_span_index, doc_span) in enumerate(doc_spans): + curr_tokens = ["[CLS]"] + tokens[doc_span.start : doc_span.start + doc_span.length] + ["[SEP]"] + token_ids = self.tokenizer.convert_tokens_to_ids(curr_tokens) + label = ( + [self.no_entity_id] + label[doc_span.start : doc_span.start + doc_span.length] + [self.no_entity_id] + ) + + features.append(Feature(src_ids=token_ids, label_ids=label, qid=qid, cal_loss=1)) + + if self.repeat_input: + features_repeat = features + features = list(map(lambda x: x._replace(cal_loss=0), features)) + features = features + features_repeat + return features + + def _pad_batch_records(self, batch_records, gather_idx=[]): + batch_token_ids = [record.src_ids for record in batch_records] + batch_length = [len(record.src_ids) for record in batch_records] + batch_length = np.array(batch_length).astype("int64").reshape([-1, 1]) + + if batch_records[0].label_ids is not None: + batch_labels = [record.label_ids for record in batch_records] + else: + batch_labels = np.array([]).astype("int64").reshape([-1, 1]) + # Qid + if batch_records[-1].qid is not None: + batch_qids = [record.qid for record in batch_records] + batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1]) + else: + batch_qids = np.array([]).astype("int64").reshape([-1, 1]) + + if gather_idx: + batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([1]).astype("int64") + else: + batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([0]).astype("int64") + # Padding + padded_token_ids, input_mask = pad_batch_data( + batch_token_ids, + pad_idx=self.tokenizer.pad_token_id, + pad_max_len=self.max_seq_length, + return_input_mask=True, + ) + if batch_records[0].label_ids is not None: + padded_batch_labels = pad_batch_data( + batch_labels, pad_idx=self.no_entity_id, pad_max_len=self.max_seq_length + ) + padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64") + padded_position_ids = get_related_pos(padded_token_ids, self.max_seq_length, self.memory_len) + + return_list = [ + padded_token_ids, + padded_position_ids, + padded_task_ids, + input_mask, + padded_batch_labels, + batch_length, + batch_qids, + batch_gather_idx, + need_cal_loss, + ] + return return_list + + +def to_json_file(task, label_dict, file_path): + if task == "iflytek": + filename = file_path + + with open(filename, "w+") as f_obj: + for i, j in label_dict.items(): + tmp = dict() + tmp["id"] = str(i) + tmp["label"] = str(j) + json.dump(tmp, f_obj) + f_obj.write("\n") diff --git a/examples/text_classification/ernie_doc/export_model.py b/examples/text_classification/ernie_doc/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..187f8efe2dd8153623d6861573e125af1c54a4ed --- /dev/null +++ b/examples/text_classification/ernie_doc/export_model.py @@ -0,0 +1,58 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import paddle +import shutil +from paddlenlp.utils.log import logger +from predict import LongDocClassifier + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--batch_size", default=16, type=int, + help="Batch size per GPU/CPU for predicting (In static mode, it should be the same as in model training process.)") +parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", + help="Pretraining or finetuned model name or path") +parser.add_argument("--max_seq_length", type=int, default=512, + help="The maximum total input sequence length after SentencePiece tokenization.") +parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.") +parser.add_argument("--device", type=str, default="cpu", choices=["cpu", "gpu"], + help="Select cpu, gpu devices to train model.") +parser.add_argument("--dataset", default="iflytek", choices=["imdb", "iflytek", "thucnews", "hyp"], type=str, + help="The training dataset") +parser.add_argument("--static_path", default=None, type=str, + help="The path which your static model is at or where you want to save after converting.") + +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + paddle.set_device(args.device) + + if os.path.exists(args.model_name_or_path): + logger.info("init checkpoint from %s" % args.model_name_or_path) + + if args.static_path and os.path.exists(args.static_path): + logger.info("will remove the old model") + shutil.rmtree(args.static_path) + + predictor = LongDocClassifier( + model_name_or_path=args.model_name_or_path, + batch_size=args.batch_size, + max_seq_length=args.max_seq_length, + memory_len=args.memory_length, + static_mode=True, + static_path=args.static_path, + ) diff --git a/examples/text_classification/ernie_doc/metrics.py b/examples/text_classification/ernie_doc/metrics.py new file mode 100644 index 0000000000000000000000000000000000000000..a60509380b5fd06c3b9936868dc1938fc4acc8cc --- /dev/null +++ b/examples/text_classification/ernie_doc/metrics.py @@ -0,0 +1,367 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import sys + +import numpy as np +import paddle +from paddle.utils import try_import + +from paddlenlp.metrics.dureader import ( + _compute_softmax, + _get_best_indexes, + get_final_text, +) + +# Metric for ERNIE-DOCs + + +class F1(object): + def __init__(self, positive_label=1): + self.positive_label = positive_label + self.reset() + + def compute(self, preds, labels): + if isinstance(preds, paddle.Tensor): + preds = preds.numpy() + elif isinstance(preds, list): + preds = np.array(preds, dtype="float32") + if isinstance(labels, list): + labels = np.array(labels, dtype="int64") + elif isinstance(labels, paddle.Tensor): + labels = labels.numpy() + preds = np.argmax(preds, axis=1) + tp = ((preds == labels) & (labels == self.positive_label)).sum() + fn = ((preds != labels) & (labels == self.positive_label)).sum() + fp = ((preds != labels) & (preds == self.positive_label)).sum() + return tp, fp, fn + + def update(self, statistic): + tp, fp, fn = statistic + self.tp += tp + self.fp += fp + self.fn += fn + + def accumulate(self): + recall = self.tp / (self.tp + self.fn) + precision = self.tp / (self.tp + self.fp) + f1 = 2 * recall * precision / (recall + precision) + return f1 + + def reset(self): + self.tp = 0 + self.fp = 0 + self.fn = 0 + + +class EM_AND_F1(object): + def __init__(self): + self.nltk = try_import("nltk") + self.re = try_import("re") + + def _mixed_segmentation(self, in_str, rm_punc=False): + """mixed_segmentation""" + in_str = in_str.lower().strip() + segs_out = [] + temp_str = "" + sp_char = [ + "-", + ":", + "_", + "*", + "^", + "/", + "\\", + "~", + "`", + "+", + "=", + ",", + "。", + ":", + "?", + "!", + "“", + "”", + ";", + "’", + "《", + "》", + "……", + "·", + "、", + "「", + "」", + "(", + ")", + "-", + "~", + "『", + "』", + ] + for char in in_str: + if rm_punc and char in sp_char: + continue + pattern = "[\\u4e00-\\u9fa5]" + if self.re.search(pattern, char) or char in sp_char: + if temp_str != "": + ss = self.nltk.word_tokenize(temp_str) + segs_out.extend(ss) + temp_str = "" + segs_out.append(char) + else: + temp_str += char + + # Handling last part + if temp_str != "": + ss = self.nltk.word_tokenize(temp_str) + segs_out.extend(ss) + + return segs_out + + # Remove punctuation + def _remove_punctuation(self, in_str): + """remove_punctuation""" + in_str = in_str.lower().strip() + sp_char = [ + "-", + ":", + "_", + "*", + "^", + "/", + "\\", + "~", + "`", + "+", + "=", + ",", + "。", + ":", + "?", + "!", + "“", + "”", + ";", + "’", + "《", + "》", + "……", + "·", + "、", + "「", + "」", + "(", + ")", + "-", + "~", + "『", + "』", + ] + out_segs = [] + for char in in_str: + if char in sp_char: + continue + else: + out_segs.append(char) + return "".join(out_segs) + + # Find longest common string + def _find_lcs(self, s1, s2): + m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)] + mmax = 0 + p = 0 + for i in range(len(s1)): + for j in range(len(s2)): + if s1[i] == s2[j]: + m[i + 1][j + 1] = m[i][j] + 1 + if m[i + 1][j + 1] > mmax: + mmax = m[i + 1][j + 1] + p = i + 1 + return s1[p - mmax : p], mmax + + def _calc_f1_score(self, answers, prediction): + f1_scores = [] + for ans in answers: + ans_segs = self._mixed_segmentation(ans, rm_punc=True) + prediction_segs = self._mixed_segmentation(prediction, rm_punc=True) + lcs, lcs_len = self._find_lcs(ans_segs, prediction_segs) + if lcs_len == 0: + f1_scores.append(0) + continue + precision = 1.0 * lcs_len / len(prediction_segs) + recall = 1.0 * lcs_len / len(ans_segs) + f1 = (2 * precision * recall) / (precision + recall) + f1_scores.append(f1) + return max(f1_scores) + + def _calc_em_score(self, answers, prediction): + em = 0 + for ans in answers: + ans_ = self._remove_punctuation(ans) + prediction_ = self._remove_punctuation(prediction) + if ans_ == prediction_: + em = 1 + break + return em + + def __call__(self, prediction, ground_truth): + f1 = 0 + em = 0 + total_count = 0 + skip_count = 0 + for instance in ground_truth: + total_count += 1 + query_id = instance["id"] + answers = instance["answers"] + if query_id not in prediction: + sys.stderr.write("Unanswered question: {}\n".format(query_id)) + skip_count += 1 + continue + preds = str(prediction[query_id]) + f1 += self._calc_f1_score(answers, preds) + em += self._calc_em_score(answers, preds) + + f1_score = 100.0 * f1 / total_count + em_score = 100.0 * em / total_count + + avg_score = (f1_score + em_score) * 0.5 + return em_score, f1_score, avg_score, total_count + + +def compute_qa_predictions( + all_examples, all_features, all_results, n_best_size, max_answer_length, do_lower_case, tokenizer, verbose +): + """Write final predictions to the json file and log-odds of null if needed.""" + + example_index_to_features = collections.defaultdict(list) + for feature in all_features: + example_index_to_features[feature.example_index].append(feature) + + unique_id_to_result = {} + for result in all_results: + unique_id_to_result[result.unique_id] = result + + _PrelimPrediction = collections.namedtuple( # pylint: disable=invalid-name + "PrelimPrediction", ["feature_index", "start_index", "end_index", "start_logit", "end_logit"] + ) + + all_predictions = collections.OrderedDict() + all_nbest_json = collections.OrderedDict() + + for (example_index, example) in enumerate(all_examples): + features = example_index_to_features[example_index] + + prelim_predictions = [] + # Keep track of the minimum score of null start+end of position 0 + for (feature_index, feature) in enumerate(features): + result = unique_id_to_result[feature.qid] + start_indexes = _get_best_indexes(result.start_logits, n_best_size) + end_indexes = _get_best_indexes(result.end_logits, n_best_size) + + for start_index in start_indexes: + for end_index in end_indexes: + # We could hypothetically create invalid predictions, e.g., predict + # that the start of the span is in the question. We throw out all + # invalid predictions. + if start_index >= len(feature.tokens): + continue + if end_index >= len(feature.tokens): + continue + if start_index not in feature.token_to_orig_map: + continue + if end_index not in feature.token_to_orig_map: + continue + if not feature.token_is_max_context.get(start_index, False): + continue + if end_index < start_index: + continue + length = end_index - start_index + 1 + if length > max_answer_length: + continue + prelim_predictions.append( + _PrelimPrediction( + feature_index=feature_index, + start_index=start_index, + end_index=end_index, + start_logit=result.start_logits[start_index], + end_logit=result.end_logits[end_index], + ) + ) + + prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True) + + _NbestPrediction = collections.namedtuple( # pylint: disable=invalid-name + "NbestPrediction", ["text", "start_logit", "end_logit"] + ) + + seen_predictions = {} + nbest = [] + for pred in prelim_predictions: + if len(nbest) >= n_best_size: + break + feature = features[pred.feature_index] + if pred.start_index > 0: # this is a non-null prediction + tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)] + orig_doc_start = feature.token_to_orig_map[pred.start_index] + orig_doc_end = feature.token_to_orig_map[pred.end_index] + orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)] + tok_text = " ".join(tok_tokens) + + # De-tokenize WordPieces that have been split off. + tok_text = tok_text.replace(" ##", "") + tok_text = tok_text.replace("##", "") + + # Clean whitespace + tok_text = tok_text.strip() + tok_text = " ".join(tok_text.split()) + orig_text = "".join(orig_tokens) + + final_text = get_final_text(tok_text, orig_text, tokenizer, verbose) + if final_text in seen_predictions: + continue + + seen_predictions[final_text] = True + else: + final_text = "" + seen_predictions[final_text] = True + + nbest.append(_NbestPrediction(text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit)) + + # In very rare edge cases we could have no valid predictions. So we + # just create a nonce prediction in this case to avoid failure. + if not nbest: + nbest.append(_NbestPrediction(text="empty", start_logit=0.0, end_logit=0.0)) + + total_scores = [] + for entry in nbest: + total_scores.append(entry.start_logit + entry.end_logit) + + probs = _compute_softmax(total_scores) + + nbest_json = [] + for (i, entry) in enumerate(nbest): + output = collections.OrderedDict() + output["text"] = entry.text + output["probability"] = probs[i] + output["start_logit"] = entry.start_logit + output["end_logit"] = entry.end_logit + nbest_json.append(output) + + assert len(nbest_json) >= 1 + + all_predictions[example.qas_id] = nbest_json[0]["text"] + all_nbest_json[example.qas_id] = nbest_json + return all_predictions, all_nbest_json diff --git a/examples/text_classification/ernie_doc/modeling.py b/examples/text_classification/ernie_doc/modeling.py new file mode 100644 index 0000000000000000000000000000000000000000..05b725767e8fa042b8613d721bba652e6616f09d --- /dev/null +++ b/examples/text_classification/ernie_doc/modeling.py @@ -0,0 +1,940 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + +from paddlenlp.transformers import PretrainedModel, register_base_model +from paddlenlp.transformers.attention_utils import _convert_param_attr_to_list + +__all__ = [ + "ErnieDocModel", + "ErnieDocPretrainedModel", + "ErnieDocForSequenceClassification", + "ErnieDocForTokenClassification", + "ErnieDocForQuestionAnswering", +] + + +class PointwiseFFN(nn.Layer): + def __init__(self, d_inner_hid, d_hid, dropout_rate, hidden_act, weight_attr=None, bias_attr=None): + super(PointwiseFFN, self).__init__() + self.linear1 = nn.Linear(d_hid, d_inner_hid, weight_attr, bias_attr=bias_attr) + self.dropout = nn.Dropout(dropout_rate, mode="upscale_in_train") + self.linear2 = nn.Linear(d_inner_hid, d_hid, weight_attr, bias_attr=bias_attr) + self.activation = getattr(F, hidden_act) + + def forward(self, x): + return self.linear2(self.dropout(self.activation(self.linear1(x)))) + + +class MultiHeadAttention(nn.Layer): + def __init__( + self, + d_key, + d_value, + d_model, + n_head=1, + r_w_bias=None, + r_r_bias=None, + r_t_bias=None, + dropout_rate=0.0, + weight_attr=None, + bias_attr=None, + ): + super(MultiHeadAttention, self).__init__() + self.d_key = d_key + self.d_value = d_value + self.d_model = d_model + self.n_head = n_head + + assert d_key * n_head == d_model, "d_model must be divisible by n_head" + + self.q_proj = nn.Linear(d_model, d_key * n_head, weight_attr=weight_attr, bias_attr=bias_attr) + self.k_proj = nn.Linear(d_model, d_key * n_head, weight_attr=weight_attr, bias_attr=bias_attr) + self.v_proj = nn.Linear(d_model, d_value * n_head, weight_attr=weight_attr, bias_attr=bias_attr) + self.r_proj = nn.Linear(d_model, d_key * n_head, weight_attr=weight_attr, bias_attr=bias_attr) + self.t_proj = nn.Linear(d_model, d_key * n_head, weight_attr=weight_attr, bias_attr=bias_attr) + self.out_proj = nn.Linear(d_model, d_model, weight_attr=weight_attr, bias_attr=bias_attr) + self.r_w_bias = r_w_bias + self.r_r_bias = r_r_bias + self.r_t_bias = r_t_bias + self.dropout = nn.Dropout(dropout_rate, mode="upscale_in_train") if dropout_rate else None + + def _compute_qkv(self, queries, keys, values, rel_pos, rel_task): + q = self.q_proj(queries) + k = self.k_proj(keys) + v = self.v_proj(values) + r = self.r_proj(rel_pos) + t = self.t_proj(rel_task) + return q, k, v, r, t + + def _split_heads(self, x, d_model, n_head): + # x shape: [B, T, H] + x = x.reshape(shape=[0, 0, n_head, d_model // n_head]) + # shape: [B, N, T, HH] + return paddle.transpose(x=x, perm=[0, 2, 1, 3]) + + def _rel_shift(self, x, klen=-1): + """ + To perform relative attention, it should relatively shift the attention score matrix + See more details on: https://github.com/kimiyoung/transformer-xl/issues/8#issuecomment-454458852 + """ + # input shape: [B, N, T, 2 * T + M] + x_shape = x.shape + x = x.reshape([x_shape[0], x_shape[1], x_shape[3], x_shape[2]]) + x = x[:, :, 1:, :] + x = x.reshape([x_shape[0], x_shape[1], x_shape[2], x_shape[3] - 1]) + # output shape: [B, N, T, T + M] + return x[:, :, :, :klen] + + def _scaled_dot_product_attention(self, q, k, v, r, t, attn_mask): + q_w, q_r, q_t = q + score_w = paddle.matmul(q_w, k, transpose_y=True) + score_r = paddle.matmul(q_r, r, transpose_y=True) + score_r = self._rel_shift(score_r, k.shape[2]) + + score_t = paddle.matmul(q_t, t, transpose_y=True) + score = score_w + score_r + score_t + score = score * (self.d_key**-0.5) + + if attn_mask is not None: + score += attn_mask + weights = F.softmax(score) + if self.dropout: + weights = self.dropout(weights) + out = paddle.matmul(weights, v) + return out + + def _combine_heads(self, x): + sign = len(x.shape) == 3 + # Directly using len(tensor.shape) as an if condition + # would not act functionally when applying paddle.jit.save api to save static graph. + if sign: + return x + sign = len(x.shape) != 4 + if sign: + raise ValueError("Input(x) should be a 4-D Tensor.") + # x shape: [B, N, T, HH] + x = paddle.transpose(x, [0, 2, 1, 3]) + # target shape:[B, T, H] + return x.reshape([0, 0, x.shape[2] * x.shape[3]]) + + def forward(self, queries, keys, values, rel_pos, rel_task, memory, attn_mask): + sign = memory is not None and len(memory.shape) > 1 + if sign: + cat = paddle.concat([memory, queries], 1) + else: + cat = queries + keys, values = cat, cat + + sign = ( + len(queries.shape) + == len(keys.shape) + == len(values.shape) + == len(rel_pos.shape) + == len(rel_task.shape) + == 3 + ) + + if not sign: + raise ValueError("Inputs: quries, keys, values, rel_pos and rel_task should all be 3-D tensors.") + + q, k, v, r, t = self._compute_qkv(queries, keys, values, rel_pos, rel_task) + q_w, q_r, q_t = list(map(lambda x: q + x.unsqueeze([0, 1]), [self.r_w_bias, self.r_r_bias, self.r_t_bias])) + q_w, q_r, q_t = list(map(lambda x: self._split_heads(x, self.d_model, self.n_head), [q_w, q_r, q_t])) + k, v, r, t = list(map(lambda x: self._split_heads(x, self.d_model, self.n_head), [k, v, r, t])) + ctx_multiheads = self._scaled_dot_product_attention([q_w, q_r, q_t], k, v, r, t, attn_mask) + out = self._combine_heads(ctx_multiheads) + out = self.out_proj(out) + return out + + +class ErnieDocEncoderLayer(nn.Layer): + def __init__( + self, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + normalize_before=False, + epsilon=1e-5, + rel_pos_params_sharing=False, + r_w_bias=None, + r_r_bias=None, + r_t_bias=None, + weight_attr=None, + bias_attr=None, + ): + self._config = locals() + self._config.pop("self") + self._config.pop("__class__", None) # py3 + super(ErnieDocEncoderLayer, self).__init__() + if not rel_pos_params_sharing: + r_w_bias, r_r_bias, r_t_bias = list( + map( + lambda x: self.create_parameter(shape=[n_head * d_key], dtype="float32"), + ["r_w_bias", "r_r_bias", "r_t_bias"], + ) + ) + + weight_attrs = _convert_param_attr_to_list(weight_attr, 2) + bias_attrs = _convert_param_attr_to_list(bias_attr, 2) + self.attn = MultiHeadAttention( + d_key, + d_value, + d_model, + n_head, + r_w_bias, + r_r_bias, + r_t_bias, + attention_dropout, + weight_attr=weight_attrs[0], + bias_attr=bias_attrs[0], + ) + self.ffn = PointwiseFFN( + d_inner_hid, d_model, relu_dropout, hidden_act, weight_attr=weight_attrs[1], bias_attr=bias_attrs[1] + ) + self.norm1 = nn.LayerNorm(d_model, epsilon=epsilon) + self.norm2 = nn.LayerNorm(d_model, epsilon=epsilon) + self.dropout1 = nn.Dropout(prepostprocess_dropout, mode="upscale_in_train") + self.dropout2 = nn.Dropout(prepostprocess_dropout, mode="upscale_in_train") + self.d_model = d_model + self.epsilon = epsilon + self.normalize_before = normalize_before + + def forward(self, enc_input, memory, rel_pos, rel_task, attn_mask): + residual = enc_input + if self.normalize_before: + enc_input = self.norm1(enc_input) + attn_output = self.attn(enc_input, enc_input, enc_input, rel_pos, rel_task, memory, attn_mask) + attn_output = residual + self.dropout1(attn_output) + if not self.normalize_before: + attn_output = self.norm1(attn_output) + residual = attn_output + if self.normalize_before: + attn_output = self.norm2(attn_output) + ffn_output = self.ffn(attn_output) + output = residual + self.dropout2(ffn_output) + if not self.normalize_before: + output = self.norm2(output) + return output + + +class ErnieDocEncoder(nn.Layer): + def __init__(self, num_layers, encoder_layer, mem_len): + super(ErnieDocEncoder, self).__init__() + self.layers = nn.LayerList( + [(encoder_layer if i == 0 else type(encoder_layer)(**encoder_layer._config)) for i in range(num_layers)] + ) + self.num_layers = num_layers + self.normalize_before = self.layers[0].normalize_before + self.mem_len = mem_len + + def _cache_mem(self, curr_out, prev_mem): + if self.mem_len is None or self.mem_len == 0: + return None + if prev_mem is None: + new_mem = curr_out[:, -self.mem_len :, :] + else: + new_mem = paddle.concat([prev_mem, curr_out], 1)[:, -self.mem_len :, :] + new_mem.stop_gradient = True + return new_mem + + def forward(self, enc_input, memories, rel_pos, rel_task, attn_mask): + # memories shape: [N, B, M, H] + # no need to normalize enc_input, cause it's already normalized outside. + new_mem = None + for _, encoder_layer in enumerate(self.layers): + # Since in static mode, the memories should be set as tensor, + # so we use paddle.slice to free the old memories explicitly to save gpu memory. + enc_input = encoder_layer(enc_input, memories[0], rel_pos, rel_task, attn_mask) + if new_mem is None: + new_mem = paddle.unsqueeze(self._cache_mem(enc_input, memories[0]), axis=0) + else: + new_mem = paddle.concat( + [new_mem, paddle.unsqueeze(self._cache_mem(enc_input, memories[0]), axis=0)], axis=0 + ) + sign = memories.shape[0] + if sign > 1: + axis = [0] + start = [1] + end = [memories.shape[0]] + memories = paddle.slice(memories, axes=axis, starts=start, ends=end) + else: + memories = None + return enc_input, new_mem + + +class ErnieDocPretrainedModel(PretrainedModel): + """ + An abstract class for pretrained ErnieDoc models. It provides ErnieDoc related + `model_config_file`, `pretrained_init_configuration`, `resource_files_names`, + `pretrained_resource_files_map`, `base_model_prefix` for downloading + and loading pretrained models. + See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details. + """ + + model_config_file = "model_config.json" + pretrained_init_configuration = { + "ernie-doc-base-en": { + "attention_dropout_prob": 0.0, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.0, + "relu_dropout": 0.0, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "task_type_vocab_size": 3, + "vocab_size": 50265, + "memory_len": 128, + "epsilon": 1e-12, + "pad_token_id": 1, + }, + "ernie-doc-base-zh": { + "attention_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "relu_dropout": 0.0, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "task_type_vocab_size": 3, + "vocab_size": 28000, + "memory_len": 128, + "epsilon": 1e-12, + "pad_token_id": 0, + }, + } + resource_files_names = {"model_state": "model_state.pdparams"} + pretrained_resource_files_map = { + "model_state": { + "ernie-doc-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-doc-base-en/ernie-doc-base-en.pdparams", + "ernie-doc-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-doc-base-zh/ernie-doc-base-zh.pdparams", + } + } + base_model_prefix = "ernie_doc" + + def _init_weights(self, layer): + # Initialization hook + if isinstance(layer, (nn.Linear, nn.Embedding)): + # In the dygraph mode, use the `set_value` to reset the parameter directly, + # and reset the `state_dict` to update parameter in static mode. + if isinstance(layer.weight, paddle.Tensor): + layer.weight.set_value( + paddle.tensor.normal( + mean=0.0, + std=self.initializer_range + if hasattr(self, "initializer_range") + else self.ernie_doc.config["initializer_range"], + shape=layer.weight.shape, + ) + ) + + +class ErnieDocEmbeddings(nn.Layer): + def __init__( + self, + vocab_size, + d_model, + hidden_dropout_prob, + memory_len, + max_position_embeddings=512, + type_vocab_size=3, + padding_idx=0, + ): + super(ErnieDocEmbeddings, self).__init__() + self.word_emb = nn.Embedding(vocab_size, d_model) + self.pos_emb = nn.Embedding(max_position_embeddings * 2 + memory_len, d_model) + self.token_type_emb = nn.Embedding(type_vocab_size, d_model) + self.memory_len = memory_len + self.dropouts = nn.LayerList([nn.Dropout(hidden_dropout_prob) for i in range(3)]) + self.norms = nn.LayerList([nn.LayerNorm(d_model) for i in range(3)]) + + def forward(self, input_ids, token_type_ids, position_ids): + # input_embeddings: [B, T, H] + input_embeddings = self.word_emb(input_ids.squeeze(-1)) + # position_embeddings: [B, 2 * T + M, H] + position_embeddings = self.pos_emb(position_ids.squeeze(-1)) + batch_size = input_ids.shape[0] + token_type_ids = paddle.concat( + [ + paddle.zeros(shape=[batch_size, self.memory_len, 1], dtype="int64") + token_type_ids[0, 0, 0], + token_type_ids, + ], + axis=1, + ) + token_type_ids.stop_gradient = True + # token_type_embeddings: [B, M + T, H] + token_type_embeddings = self.token_type_emb(token_type_ids.squeeze(-1)) + embs = [input_embeddings, position_embeddings, token_type_embeddings] + for i in range(len(embs)): + embs[i] = self.dropouts[i](self.norms[i](embs[i])) + return embs + + +class ErnieDocPooler(nn.Layer): + """ + get pool output + """ + + def __init__(self, hidden_size, cls_token_idx=-1): + super(ErnieDocPooler, self).__init__() + self.dense = nn.Linear(hidden_size, hidden_size) + self.activation = nn.Tanh() + self.cls_token_idx = cls_token_idx + + def forward(self, hidden_states): + # We "pool" the model by simply taking the hidden state corresponding + # to the last token. + cls_token_tensor = hidden_states[:, self.cls_token_idx] + pooled_output = self.dense(cls_token_tensor) + pooled_output = self.activation(pooled_output) + return pooled_output + + +@register_base_model +class ErnieDocModel(ErnieDocPretrainedModel): + """ + The bare ERNIE-Doc Model outputting raw hidden-states. + + This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`. + Refer to the superclass documentation for the generic methods. + + This model is also a `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation + /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer + and refer to the Paddle documentation for all matter related to general usage and behavior. + + Args: + num_hidden_layers (int): + The number of hidden layers in the Transformer encoder. + num_attention_heads (int): + Number of attention heads for each attention layer in the Transformer encoder. + hidden_size (int): + Dimensionality of the embedding layers, encoder layers and pooler layer. + hidden_dropout_prob (int): + The dropout probability for all fully connected layers in the embeddings and encoder. + attention_dropout_prob (int): + The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target. + relu_dropout (int): + The dropout probability of FFN. + hidden_act (str): + The non-linear activation function of FFN. + memory_len (int): + The number of tokens to cache. If not 0, the last `memory_len` hidden states + in each layer will be cached into memory. + vocab_size (int): + Vocabulary size of `inputs_ids` in `ErnieDocModel`. Also is the vocab size of token embedding matrix. + Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `ErnieDocModel`. + max_position_embeddings (int): + The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input + sequence. Defaults to `512`. + task_type_vocab_size (int, optional): + The vocabulary size of the `token_type_ids`. Defaults to `3`. + normalize_before (bool, optional): + Indicate whether to put layer normalization into preprocessing of MHA and FFN sub-layers. + If True, pre-process is layer normalization and post-precess includes dropout, + residual connection. Otherwise, no pre-process and post-precess includes dropout, + residual connection, layer normalization. Defaults to `False`. + epsilon (float, optional): + The `epsilon` parameter used in :class:`paddle.nn.LayerNorm` for + initializing layer normalization layers. Defaults to `1e-5`. + rel_pos_params_sharing (bool, optional): + Whether to share the relative position parameters. + Defaults to `False`. + initializer_range (float, optional): + The standard deviation of the normal initializer for initializing all weight matrices. + Defaults to `0.02`. + pad_token_id (int, optional): + The token id of [PAD] token whose parameters won't be updated when training. + Defaults to `0`. + cls_token_idx (int, optional): + The token id of [CLS] token. Defaults to `-1`. + """ + + def __init__( + self, + num_hidden_layers, + num_attention_heads, + hidden_size, + hidden_dropout_prob, + attention_dropout_prob, + relu_dropout, + hidden_act, + memory_len, + vocab_size, + max_position_embeddings, + task_type_vocab_size=3, + normalize_before=False, + epsilon=1e-5, + rel_pos_params_sharing=False, + initializer_range=0.02, + pad_token_id=0, + cls_token_idx=-1, + ): + super(ErnieDocModel, self).__init__() + + r_w_bias, r_r_bias, r_t_bias = None, None, None + if rel_pos_params_sharing: + r_w_bias, r_r_bias, r_t_bias = list( + map( + lambda x: self.create_parameter(shape=[num_attention_heads * d_key], dtype="float32"), + ["r_w_bias", "r_r_bias", "r_t_bias"], + ) + ) + d_key = hidden_size // num_attention_heads + d_value = hidden_size // num_attention_heads + d_inner_hid = hidden_size * 4 + encoder_layer = ErnieDocEncoderLayer( + num_attention_heads, + d_key, + d_value, + hidden_size, + d_inner_hid, + hidden_dropout_prob, + attention_dropout_prob, + relu_dropout, + hidden_act, + normalize_before=normalize_before, + epsilon=epsilon, + rel_pos_params_sharing=rel_pos_params_sharing, + r_w_bias=r_w_bias, + r_r_bias=r_r_bias, + r_t_bias=r_t_bias, + ) + self.n_head = num_attention_heads + self.d_model = hidden_size + self.memory_len = memory_len + self.encoder = ErnieDocEncoder(num_hidden_layers, encoder_layer, memory_len) + self.pad_token_id = pad_token_id + self.embeddings = ErnieDocEmbeddings( + vocab_size, + hidden_size, + hidden_dropout_prob, + memory_len, + max_position_embeddings, + task_type_vocab_size, + pad_token_id, + ) + self.pooler = ErnieDocPooler(hidden_size, cls_token_idx) + + def _create_n_head_attn_mask(self, attn_mask, batch_size): + # attn_mask shape: [B, T, 1] + # concat an data_mask, shape: [B, M + T, 1] + data_mask = paddle.concat( + [paddle.ones(shape=[batch_size, self.memory_len, 1], dtype=attn_mask.dtype), attn_mask], axis=1 + ) + data_mask.stop_gradient = True + # create a self_attn_mask, shape: [B, T, M + T] + self_attn_mask = paddle.matmul(attn_mask, data_mask, transpose_y=True) + self_attn_mask = (self_attn_mask - 1) * 1e8 + n_head_self_attn_mask = paddle.stack([self_attn_mask] * self.n_head, axis=1) + n_head_self_attn_mask.stop_gradient = True + return n_head_self_attn_mask + + def forward(self, input_ids, memories, token_type_ids, position_ids, attn_mask): + r""" + The ErnieDocModel forward method, overrides the `__call__()` special method. + + Args: + input_ids (Tensor): + Indices of input sequence tokens in the vocabulary. They are + numerical representations of tokens that build the input sequence. + It's data type should be `int64` and has a shape of [batch_size, sequence_length, 1]. + memories (Tensor): + Pre-computed hidden-states for each layer. + It's data type should be `float32` and has a shape of [num_hidden_layers, batch_size, memory_len, hidden_size]. + token_type_ids (Tensor): + Segment token indices to indicate first and second portions of the inputs. + Indices can be either 0 or 1: + + - 0 corresponds to a **sentence A** token, + - 1 corresponds to a **sentence B** token. + + It's data type should be `int64` and has a shape of [batch_size, sequence_length, 1]. + Defaults to None, which means no segment embeddings is added to token embeddings. + position_ids (Tensor): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0, + config.max_position_embeddings - 1]``. Shape as `(batch_sie, num_tokens)` and dtype as `int32` or `int64`. + attn_mask (Tensor): + Mask used in multi-head attention to avoid performing attention on to some unwanted positions, + usually the paddings or the subsequent positions. + Its data type can be int, float and bool. + When the data type is bool, the `masked` tokens have `False` values and the others have `True` values. + When the data type is int, the `masked` tokens have `0` values and the others have `1` values. + When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values. + It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`. + For example, its shape can be [batch_size, sequence_length], [batch_size, sequence_length, sequence_length], + [batch_size, num_attention_heads, sequence_length, sequence_length]. + We use whole-word-mask in ERNIE, so the whole word will have the same value. For example, "使用" as a word, + "使" and "用" will have the same value. + Defaults to `None`, which means nothing needed to be prevented attention to. + + Returns: + tuple : Returns tuple (``encoder_output``, ``pooled_output``, ``new_mem``). + + With the fields: + + - `encoder_output` (Tensor): + Sequence of hidden-states at the last layer of the model. + It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size]. + + - `pooled_output` (Tensor): + The output of first token (`[CLS]`) in sequence. + We "pool" the model by simply taking the hidden state corresponding to the first token. + Its data type should be float32 and its shape is [batch_size, hidden_size]. + + - `new_mem` (List[Tensor]): + A list of pre-computed hidden-states. The length of the list is `n_layers`. + Each element in the list is a Tensor with dtype `float32` and shape as [batch_size, memory_length, hidden_size]. + + Example: + .. code-block:: + + import numpy as np + import paddle + from paddlenlp.transformers import ErnieDocModel + from paddlenlp.transformers import ErnieDocTokenizer + + def get_related_pos(insts, seq_len, memory_len=128): + beg = seq_len + seq_len + memory_len + r_position = [list(range(beg - 1, seq_len - 1, -1)) + \ + list(range(0, seq_len)) for i in range(len(insts))] + return np.array(r_position).astype('int64').reshape([len(insts), beg, 1]) + + tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh') + model = ErnieDocModel.from_pretrained('ernie-doc-base-zh') + + inputs = tokenizer("欢迎使用百度飞桨!") + inputs = {k:paddle.to_tensor([v + [0] * (128-len(v))]).unsqueeze(-1) for (k, v) in inputs.items()} + + memories = paddle.zeros([12, 1, 128, 768], dtype="float32") + position_ids = paddle.to_tensor(get_related_pos(inputs['input_ids'], 128, 128)) + attn_mask = paddle.ones([1, 128, 1]) + + inputs['memories'] = memories + inputs['position_ids'] = position_ids + inputs['attn_mask'] = attn_mask + + outputs = model(**inputs) + + encoder_output = outputs[0] + pooled_output = outputs[1] + new_mem = outputs[2] + + """ + input_embeddings, position_embeddings, token_embeddings = self.embeddings( + input_ids, token_type_ids, position_ids + ) + + batch_size = input_embeddings.shape[0] + # [B, N, T, M + T] + n_head_self_attn_mask = self._create_n_head_attn_mask(attn_mask, batch_size) + # memories contain n_layer memory whose shape is [B, M, H] + encoder_output, new_mem = self.encoder( + enc_input=input_embeddings, + memories=memories, + rel_pos=position_embeddings, + rel_task=token_embeddings, + attn_mask=n_head_self_attn_mask, + ) + pooled_output = self.pooler(encoder_output) + return encoder_output, pooled_output, new_mem + + +class ErnieDocForSequenceClassification(ErnieDocPretrainedModel): + """ + ErnieDoc Model with a linear layer on top of the output layer, + designed for sequence classification/regression tasks like GLUE tasks. + + Args: + ernie_doc (:class:`ErnieDocModel`): + An instance of :class:`ErnieDocModel`. + num_classes (int): + The number of classes. + dropout (float, optional) + The dropout ratio of last output. Default to `0.1`. + """ + + def __init__(self, ernie_doc, num_classes, dropout=0.1): + super(ErnieDocForSequenceClassification, self).__init__() + self.ernie_doc = ernie_doc + self.linear = nn.Linear(self.ernie_doc.config["hidden_size"], num_classes) + self.dropout = nn.Dropout(dropout, mode="upscale_in_train") + + def forward(self, input_ids, memories, token_type_ids, position_ids, attn_mask): + r""" + The ErnieDocForSequenceClassification forward method, overrides the `__call__()` special method. + + Args: + input_ids (Tensor): + See :class:`ErnieDocModel`. + memories (Tensor): + See :class:`ErnieDocModel`. + token_type_ids (Tensor): + See :class:`ErnieDocModel`. + position_ids (Tensor): + See :class:`ErnieDocModel`. + attn_mask (Tensor): + See :class:`ErnieDocModel`. + + Returns: + tuple : Returns tuple (`logits`, `mem`). + + With the fields: + + - `logits` (Tensor): + A tensor containing the [CLS] of hidden-states of the model at the output of last layer. + Each Tensor has a data type of `float32` and has a shape of [batch_size, num_classes]. + + - `mem` (List[Tensor]): + A list of pre-computed hidden-states. The length of the list is `n_layers`. + Each element in the list is a Tensor with dtype `float32` and has a shape of + [batch_size, memory_length, hidden_size]. + + Example: + .. code-block:: + + import numpy as np + import paddle + from paddlenlp.transformers import ErnieDocForSequenceClassification + from paddlenlp.transformers import ErnieDocTokenizer + + def get_related_pos(insts, seq_len, memory_len=128): + beg = seq_len + seq_len + memory_len + r_position = [list(range(beg - 1, seq_len - 1, -1)) + \ + list(range(0, seq_len)) for i in range(len(insts))] + return np.array(r_position).astype('int64').reshape([len(insts), beg, 1]) + + tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh') + model = ErnieDocForSequenceClassification.from_pretrained('ernie-doc-base-zh', num_classes=2) + + inputs = tokenizer("欢迎使用百度飞桨!") + inputs = {k:paddle.to_tensor([v + [0] * (128-len(v))]).unsqueeze(-1) for (k, v) in inputs.items()} + + memories = paddle.zeros([12, 1, 128, 768], dtype="float32") + position_ids = paddle.to_tensor(get_related_pos(inputs['input_ids'], 128, 128)) + attn_mask = paddle.ones([1, 128, 1]) + + inputs['memories'] = memories + inputs['position_ids'] = position_ids + inputs['attn_mask'] = attn_mask + + outputs = model(**inputs) + + logits = outputs[0] + mem = outputs[1] + + """ + _, pooled_output, mem = self.ernie_doc(input_ids, memories, token_type_ids, position_ids, attn_mask) + pooled_output = self.dropout(pooled_output) + logits = self.linear(pooled_output) + return logits, mem + + +class ErnieDocForTokenClassification(ErnieDocPretrainedModel): + """ + ErnieDoc Model with a linear layer on top of the hidden-states output layer, + designed for token classification tasks like NER tasks. + + Args: + ernie_doc (:class:`ErnieDocModel`): + An instance of :class:`ErnieDocModel`. + num_classes (int): + The number of classes. + dropout (float, optional) + The dropout ratio of last output. Default to 0.1. + """ + + def __init__(self, ernie_doc, num_classes, dropout=0.1): + super(ErnieDocForTokenClassification, self).__init__() + self.num_classes = num_classes + self.ernie_doc = ernie_doc # allow ernie_doc to be config + self.dropout = nn.Dropout(dropout, mode="upscale_in_train") + self.linear = nn.Linear(self.ernie_doc.config["hidden_size"], num_classes) + + def forward(self, input_ids, memories, token_type_ids, position_ids, attn_mask): + r""" + The ErnieDocForTokenClassification forward method, overrides the `__call__()` special method. + + Args: + input_ids (Tensor): + See :class:`ErnieDocModel`. + memories (Tensor): + See :class:`ErnieDocModel`. + token_type_ids (Tensor): + See :class:`ErnieDocModel`. + Defaults to None, which means no segment embeddings is added to token embeddings. + position_ids (Tensor): + See :class:`ErnieDocModel`. + attn_mask (Tensor): + See :class:`ErnieDocModel`. + + Returns: + tuple : Returns tuple (`logits`, `mem`). + + With the fields: + + - `logits` (Tensor): + A tensor containing the hidden-states of the model at the output of last layer. + Each Tensor has a data type of `float32` and has a shape of [batch_size, sequence_length, num_classes]. + + - `mem` (List[Tensor]): + A list of pre-computed hidden-states. The length of the list is `n_layers`. + Each element in the list is a Tensor with dtype `float32` and has a shape of + [batch_size, memory_length, hidden_size]. + + Example: + .. code-block:: + + import numpy as np + import paddle + from paddlenlp.transformers import ErnieDocForTokenClassification + from paddlenlp.transformers import ErnieDocTokenizer + + def get_related_pos(insts, seq_len, memory_len=128): + beg = seq_len + seq_len + memory_len + r_position = [list(range(beg - 1, seq_len - 1, -1)) + \ + list(range(0, seq_len)) for i in range(len(insts))] + return np.array(r_position).astype('int64').reshape([len(insts), beg, 1]) + + tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh') + model = ErnieDocForTokenClassification.from_pretrained('ernie-doc-base-zh', num_classes=2) + + inputs = tokenizer("欢迎使用百度飞桨!") + inputs = {k:paddle.to_tensor([v + [0] * (128-len(v))]).unsqueeze(-1) for (k, v) in inputs.items()} + + memories = paddle.zeros([12, 1, 128, 768], dtype="float32") + position_ids = paddle.to_tensor(get_related_pos(inputs['input_ids'], 128, 128)) + attn_mask = paddle.ones([1, 128, 1]) + + inputs['memories'] = memories + inputs['position_ids'] = position_ids + inputs['attn_mask'] = attn_mask + + outputs = model(**inputs) + + logits = outputs[0] + mem = outputs[1] + + """ + sequence_output, _, mem = self.ernie_doc(input_ids, memories, token_type_ids, position_ids, attn_mask) + sequence_output = self.dropout(sequence_output) + logits = self.linear(sequence_output) + return logits, mem + + +class ErnieDocForQuestionAnswering(ErnieDocPretrainedModel): + """ + ErnieDoc Model with a linear layer on top of the hidden-states + output to compute `span_start_logits` and `span_end_logits`, + designed for question-answering tasks like SQuAD. + + Args: + ernie_doc (:class:`ErnieDocModel`): + An instance of :class:`ErnieDocModel`. + dropout (float, optional) + The dropout ratio of last output. Default to 0.1. + """ + + def __init__(self, ernie_doc, dropout=0.1): + super(ErnieDocForQuestionAnswering, self).__init__() + self.ernie_doc = ernie_doc # allow ernie_doc to be config + self.dropout = nn.Dropout(dropout, mode="upscale_in_train") + self.linear = nn.Linear(self.ernie_doc.config["hidden_size"], 2) + + def forward(self, input_ids, memories, token_type_ids, position_ids, attn_mask): + r""" + The ErnieDocForQuestionAnswering forward method, overrides the `__call__()` special method. + + Args: + input_ids (Tensor): + See :class:`ErnieDocModel`. + memories (Tensor): + See :class:`ErnieDocModel`. + token_type_ids (Tensor): + See :class:`ErnieDocModel`. + position_ids (Tensor): + See :class:`ErnieDocModel`. + attn_mask (Tensor): + See :class:`ErnieDocModel`. + + Returns: + tuple : Returns tuple (`start_logits`, `end_logits`, `mem`). + + With the fields: + + - `start_logits` (Tensor): + A tensor of the input token classification logits, indicates the start position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + + - `end_logits` (Tensor): + A tensor of the input token classification logits, indicates the end position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + + - `mem` (List[Tensor]): + A list of pre-computed hidden-states. The length of the list is `n_layers`. + Each element in the list is a Tensor with dtype `float32` and has a shape of + [batch_size, memory_length, hidden_size]. + + Example: + .. code-block:: + + import numpy as np + import paddle + from paddlenlp.transformers import ErnieDocForQuestionAnswering + from paddlenlp.transformers import ErnieDocTokenizer + + def get_related_pos(insts, seq_len, memory_len=128): + beg = seq_len + seq_len + memory_len + r_position = [list(range(beg - 1, seq_len - 1, -1)) + \ + list(range(0, seq_len)) for i in range(len(insts))] + return np.array(r_position).astype('int64').reshape([len(insts), beg, 1]) + + tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh') + model = ErnieDocForQuestionAnswering.from_pretrained('ernie-doc-base-zh') + + inputs = tokenizer("欢迎使用百度飞桨!") + inputs = {k:paddle.to_tensor([v + [0] * (128-len(v))]).unsqueeze(-1) for (k, v) in inputs.items()} + + memories = paddle.zeros([12, 1, 128, 768], dtype="float32") + position_ids = paddle.to_tensor(get_related_pos(inputs['input_ids'], 128, 128)) + attn_mask = paddle.ones([1, 128, 1]) + + inputs['memories'] = memories + inputs['position_ids'] = position_ids + inputs['attn_mask'] = attn_mask + + outputs = model(**inputs) + + start_logits = outputs[0] + end_logits = outputs[1] + mem = outputs[2] + + """ + sequence_output, _, mem = self.ernie_doc(input_ids, memories, token_type_ids, position_ids, attn_mask) + sequence_output = self.dropout(sequence_output) + logits = self.linear(sequence_output) + start_logits, end_logits = paddle.transpose(logits, perm=[2, 0, 1]) + return start_logits, end_logits, mem diff --git a/examples/text_classification/ernie_doc/predict.py b/examples/text_classification/ernie_doc/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..9bfa6b626ac7db9f377b7bd75a9d2782556ba91c --- /dev/null +++ b/examples/text_classification/ernie_doc/predict.py @@ -0,0 +1,301 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from data import ( + ClassifierIterator, + HYPTextPreprocessor, + ImdbTextPreprocessor, + to_json_file, +) +from modeling import ErnieDocForSequenceClassification +from train import init_memory + +from paddlenlp.datasets import load_dataset +from paddlenlp.taskflow.utils import dygraph_mode_guard +from paddlenlp.transformers import ErnieDocBPETokenizer, ErnieDocTokenizer +from paddlenlp.utils.env import PPNLP_HOME +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--batch_size", default=16, type=int, + help="Batch size per GPU/CPU for predicting (In static mode, it should be the same as in model training process.)") +parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", + help="Pretraining or finetuned model name or path") +parser.add_argument("--max_seq_length", type=int, default=512, + help="The maximum total input sequence length after SentencePiece tokenization.") +parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.") +parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], + help="Select cpu, gpu devices to train model.") +parser.add_argument("--test_results_file", default="./test_restuls.json", type=str, + help="The file path you would like to save the model outputs on test dataset.") +parser.add_argument("--static_mode", default=False, type=bool, + help="Whether you would like to perform predicting by static model or dynamic model.") +parser.add_argument("--dataset", default="iflytek", choices=["imdb", "iflytek", "thucnews", "hyp"], type=str, + help="The training dataset") +parser.add_argument("--static_path", default=None, type=str, + help="The path which your static model is at or where you want to save after converting.") + +args = parser.parse_args() +# yapf: enable + +DATASET_INFO = { + "imdb": (ErnieDocBPETokenizer, "test", ImdbTextPreprocessor()), + "hyp": (ErnieDocBPETokenizer, "test", HYPTextPreprocessor()), + "iflytek": (ErnieDocTokenizer, "test", None), + "thucnews": (ErnieDocTokenizer, "test", None), +} + + +def predict( + model, test_dataloader, file_path, memories, label_list, static_mode, input_handles=None, output_handles=None +): + label_dict = dict() + if not static_mode: + model.eval() + for _, batch in enumerate(test_dataloader, start=1): + input_ids, position_ids, token_type_ids, attn_mask, _, qids, gather_idxs, need_cal_loss = batch + logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask) + logits, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, qids])) + probs = nn.functional.softmax(logits, axis=1) + idx = paddle.argmax(probs, axis=1).numpy() + idx = idx.tolist() + labels = [label_list[i] for i in idx] + for i, qid in enumerate(qids.numpy().flatten()): + label_dict[str(qid)] = labels[i] + else: + for _, batch in enumerate(test_dataloader, start=1): + input_ids, position_ids, token_type_ids, attn_mask, _, qids, gather_idxs, need_cal_loss = batch + input_handles[0].copy_from_cpu(input_ids.numpy()) + input_handles[1].copy_from_cpu(memories) + input_handles[2].copy_from_cpu(token_type_ids.numpy()) + input_handles[3].copy_from_cpu(position_ids.numpy()) + input_handles[4].copy_from_cpu(attn_mask.numpy()) + model.run() + logits = paddle.to_tensor(output_handles[0].copy_to_cpu()) + memories = paddle.to_tensor(output_handles[1].copy_to_cpu()) + logits, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, qids])) + probs = nn.functional.softmax(logits, axis=1) + idx = paddle.argmax(probs, axis=1).numpy() + idx = idx.tolist() + labels = [label_list[i] for i in idx] + for i, qid in enumerate(qids.numpy().flatten()): + label_dict[str(qid)] = labels[i] + to_json_file("iflytek", label_dict, file_path) + + +class LongDocClassifier: + def __init__( + self, + model_name_or_path, + trainer_num=1, + rank=0, + batch_size=16, + max_seq_length=512, + memory_len=128, + static_mode=False, + dataset="iflytek", + static_path=None, + ): + self.model_name_or_path = model_name_or_path + self.batch_size = batch_size + self.trainer_num = trainer_num + self.rank = rank + self.max_seq_length = max_seq_length + self.memory_len = memory_len + self.static_mode = static_mode + self.static_path = static_path if static_path else PPNLP_HOME + + tokenizer_class, test_name, preprocess_text_fn = DATASET_INFO[dataset] + self._construct_tokenizer(tokenizer_class) + self._input_preparation(args.dataset, test_name, preprocess_text_fn) + self._construct_model() + if static_mode: + logger.info("Loading the static model from {}".format(self.static_path)) + self._load_static_model() + + def _input_preparation(self, dataset="iflytek", test_name="test", preprocess_text_fn=None): + test_ds = load_dataset("clue", name=dataset, splits=[test_name]) + self.label_list = test_ds.label_list + self.num_classes = len(test_ds.label_list) + self.test_ds_iter = ClassifierIterator( + test_ds, + self.batch_size, + self._tokenizer, + self.trainer_num, + trainer_id=self.rank, + memory_len=self.memory_len, + max_seq_length=self.max_seq_length, + mode="eval", + preprocess_text_fn=preprocess_text_fn, + ) + self.test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + self.test_dataloader.set_batch_generator(self.test_ds_iter, paddle.get_device()) + + def _construct_tokenizer(self, tokenizer_class): + """ + Construct the tokenizer for the predictor. + :return: + """ + tokenizer_instance = tokenizer_class.from_pretrained(self.model_name_or_path) + self._tokenizer = tokenizer_instance + + def _construct_model(self): + """ + Construct the inference model for the predictor + :param model_name_or_path: str + :return: model instance + """ + model_instance = ErnieDocForSequenceClassification.from_pretrained( + self.model_name_or_path, num_classes=self.num_classes + ) + self.model_config = model_instance.ernie_doc.config + self._model = model_instance + + def _load_static_model(self, params_path=None): + """Load static model""" + inference_model_path = os.path.join(self.static_path, "static", "inference") + with dygraph_mode_guard(): + self._construct_model() + if params_path: + state_dict = paddle.load(params_path) + self._model.set_dict(state_dict) + self._construct_input_spec() + self._convert_dygraph_to_static() + + model_file = inference_model_path + ".pdmodel" + params_file = inference_model_path + ".pdiparams" + self._config = paddle.inference.Config(model_file, params_file) + + def _prepare_static_mode(self): + """ + Construct the input data and predictor in the PaddlePaddele static mode. + """ + place = paddle.get_device() + if place == "cpu": + self._config.disable_gpu() + else: + self._config.enable_use_gpu(100) + self._config.switch_use_feed_fetch_ops(False) + self._config.disable_glog_info() + self.predictor = paddle.inference.create_predictor(self._config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = [self.predictor.get_output_handle(name) for name in self.predictor.get_output_names()] + + def _construct_input_spec(self): + """ + Construct the input spec for the convert dygraph model to static model. + """ + B, T, H, M, N = ( + self.batch_size, + self.max_seq_length, + self.model_config["hidden_size"], + self.memory_len, + self.model_config["num_hidden_layers"], + ) + self._input_spec = [ + paddle.static.InputSpec(shape=[B, T, 1], dtype="int64", name="input_ids"), # input_ids + paddle.static.InputSpec(shape=[N, B, M, H], dtype="float32", name="memories"), # memories + paddle.static.InputSpec(shape=[B, T, 1], dtype="int64", name="token_type_ids"), # token_type_ids + paddle.static.InputSpec(shape=[B, 2 * T + M, 1], dtype="int64", name="position_ids"), # position_ids + paddle.static.InputSpec(shape=[B, T, 1], dtype="float32", name="attn_mask"), # attn_mask + ] + + def _convert_dygraph_to_static(self): + """ + Convert the dygraph model to static model. + """ + assert ( + self._model is not None + ), "The dygraph model must be created before converting the dygraph model to static model." + assert ( + self._input_spec is not None + ), "The input spec must be created before converting the dygraph model to static model." + logger.info("Converting to the inference model cost a little time.") + static_model = paddle.jit.to_static(self._model, input_spec=self._input_spec) + save_path = os.path.join(self.static_path, "static", "inference") + paddle.jit.save(static_model, save_path) + logger.info("The inference model save in the path:{}".format(save_path)) + + def run_model(self, saved_path): + if not self.static_mode: + create_memory = partial( + init_memory, + self.batch_size, + self.memory_len, + self.model_config["hidden_size"], + self.model_config["num_hidden_layers"], + ) + # Copy the memory + memories = create_memory() + else: + memories = np.zeros( + [ + self.model_config["num_hidden_layers"], + self.batch_size, + self.memory_len, + self.model_config["hidden_size"], + ], + dtype="float32", + ) + file_path = saved_path + if not self.static_mode: + self.input_handles, self.output_handle, self.predictor = None, None, self._model + else: + self._prepare_static_mode() + predict( + self.predictor, + self.test_dataloader, + file_path, + memories, + self.label_list, + self.static_mode, + self.input_handles, + self.output_handle, + ) + + +def do_predict(args): + # Initialize model + paddle.set_device(args.device) + trainer_num = paddle.distributed.get_world_size() + if trainer_num > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + if rank == 0: + if os.path.exists(args.model_name_or_path): + logger.info("init checkpoint from %s" % args.model_name_or_path) + + predictor = LongDocClassifier( + model_name_or_path=args.model_name_or_path, + rank=rank, + trainer_num=trainer_num, + batch_size=args.batch_size, + max_seq_length=args.max_seq_length, + memory_len=args.memory_length, + static_mode=args.static_mode, + static_path=args.static_path, + ) + predictor.run_model(saved_path=args.test_results_file) + + +if __name__ == "__main__": + do_predict(args) diff --git a/examples/text_classification/ernie_doc/train.py b/examples/text_classification/ernie_doc/train.py new file mode 100644 index 0000000000000000000000000000000000000000..ea527016a9978312ef47771e508e55705337d871 --- /dev/null +++ b/examples/text_classification/ernie_doc/train.py @@ -0,0 +1,345 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from collections import defaultdict +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from data import ( + ClassifierIterator, + HYPTextPreprocessor, + ImdbTextPreprocessor, + to_json_file, +) +from metrics import F1 +from modeling import ErnieDocForSequenceClassification +from paddle.metric import Accuracy +from paddle.optimizer import AdamW + +from paddlenlp.datasets import load_dataset +from paddlenlp.ops.optimizer import layerwise_lr_decay +from paddlenlp.transformers import ( + ErnieDocBPETokenizer, + ErnieDocTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", + help="Pretraining model name or path") +parser.add_argument("--max_seq_length", type=int, default=512, + help="The maximum total input sequence length after SentencePiece tokenization.") +parser.add_argument("--learning_rate", type=float, default=1.5e-4, help="Learning rate used to train.") +parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.") +parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.") +parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint") +parser.add_argument("--epochs", type=int, default=3, help="Number of epoches for training.") +parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], + help="Select cpu, gpu devices to train model.") +parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.") +parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.") +parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") +parser.add_argument("--warmup_proportion", default=0.1, type=float, + help="Linear warmup proportion over the training process.") +parser.add_argument("--dataset", default="iflytek", choices=["imdb", "iflytek", "thucnews", "hyp"], type=str, + help="The training dataset") +parser.add_argument("--layerwise_decay", default=1.0, type=float, help="Layerwise decay ratio") +parser.add_argument("--max_steps", default=-1, type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", ) +parser.add_argument("--test_results_file", default="./test_restuls.json", type=str, + help="The file path you would like to save the model outputs on test dataset.") + +args = parser.parse_args() +# yapf: enable + +DATASET_INFO = { + "imdb": (ErnieDocBPETokenizer, "test", "test", ImdbTextPreprocessor(), Accuracy()), + "hyp": (ErnieDocBPETokenizer, "dev", "test", HYPTextPreprocessor(), F1()), + "iflytek": (ErnieDocTokenizer, "dev", "test", None, Accuracy()), + "thucnews": (ErnieDocTokenizer, "dev", "test", None, Accuracy()), +} + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +def init_memory(batch_size, memory_length, d_model, n_layers): + return paddle.zeros([n_layers, batch_size, memory_length, d_model], dtype="float32") + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, memories): + model.eval() + losses = [] + # copy the memory + tic_train = time.time() + eval_logging_step = 500 + + probs_dict = defaultdict(list) + label_dict = dict() + for step, batch in enumerate(data_loader, start=1): + input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idxs, need_cal_loss = batch + logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask) + logits, labels, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, labels, qids])) + # Need to collect probs for each qid, so use softmax_with_cross_entropy + loss, probs = nn.functional.softmax_with_cross_entropy(logits, labels, return_softmax=True) + losses.append(loss.mean().numpy()) + # Shape: [B, NUM_LABELS] + np_probs = probs.numpy() + # Shape: [B, 1] + np_qids = qids.numpy() + np_labels = labels.numpy().flatten() + for i, qid in enumerate(np_qids.flatten()): + probs_dict[qid].append(np_probs[i]) + label_dict[qid] = np_labels[i] # Same qid share same label. + + if step % eval_logging_step == 0: + logger.info( + "Step %d: loss: %.5f, speed: %.5f steps/s" + % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train)) + ) + tic_train = time.time() + + # Collect predicted labels + preds = [] + labels = [] + for qid, probs in probs_dict.items(): + mean_prob = np.mean(np.array(probs), axis=0) + preds.append(mean_prob) + labels.append(label_dict[qid]) + + preds = paddle.to_tensor(np.array(preds, dtype="float32")) + labels = paddle.to_tensor(np.array(labels, dtype="int64")) + + metric.update(metric.compute(preds, labels)) + acc_or_f1 = metric.accumulate() + logger.info("Eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric.__class__.__name__, acc_or_f1)) + metric.reset() + model.train() + return acc_or_f1 + + +def predict(model, test_dataloader, file_path, memories, label_list): + label_dict = dict() + model.eval() + for _, batch in enumerate(test_dataloader, start=1): + input_ids, position_ids, token_type_ids, attn_mask, _, qids, gather_idxs, need_cal_loss = batch + logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask) + logits, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, qids])) + probs = nn.functional.softmax(logits, axis=1) + idx = paddle.argmax(probs, axis=1).numpy() + idx = idx.tolist() + labels = [label_list[i] for i in idx] + for i, qid in enumerate(qids.numpy().flatten()): + label_dict[str(qid)] = labels[i] + to_json_file("iflytek", label_dict, file_path) + + +def do_train(args): + set_seed(args) + + tokenizer_class, eval_name, test_name, preprocess_text_fn, eval_metric = DATASET_INFO[args.dataset] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + train_ds, eval_ds, test_ds = load_dataset("clue", name=args.dataset, splits=["train", eval_name, test_name]) + num_classes = len(train_ds.label_list) + + paddle.set_device(args.device) + trainer_num = paddle.distributed.get_world_size() + if trainer_num > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + if rank == 0: + if os.path.exists(args.model_name_or_path): + logger.info("init checkpoint from %s" % args.model_name_or_path) + model = ErnieDocForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes) + model_config = model.ernie_doc.config + if trainer_num > 1: + model = paddle.DataParallel(model) + + train_ds_iter = ClassifierIterator( + train_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + random_seed=args.seed, + preprocess_text_fn=preprocess_text_fn, + ) + eval_ds_iter = ClassifierIterator( + eval_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + mode="eval", + preprocess_text_fn=preprocess_text_fn, + ) + test_ds_iter = ClassifierIterator( + test_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + mode="test", + preprocess_text_fn=preprocess_text_fn, + ) + + train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device()) + eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device()) + test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device()) + + num_training_examples = train_ds_iter.get_num_examples() + num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num + logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank)) + logger.info("Num train examples: %d" % num_training_examples) + logger.info("Max train steps: %d" % num_training_steps) + logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion)) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + # Construct dict + name_dict = dict() + for n, p in model.named_parameters(): + name_dict[p.name] = n + + simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"]) + + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + lr_ratio=simple_lr_setting, + ) + + criterion = paddle.nn.loss.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + global_steps = 0 + best_acc = -1 + create_memory = partial( + init_memory, + args.batch_size, + args.memory_length, + model_config["hidden_size"], + model_config["num_hidden_layers"], + ) + # Copy the memory + memories = create_memory() + tic_train = time.time() + stop_training = False + for epoch in range(args.epochs): + train_ds_iter.shuffle_sample() + train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device()) + for step, batch in enumerate(train_dataloader, start=1): + global_steps += 1 + input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idx, need_cal_loss = batch + logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask) + + logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels])) + loss = criterion(logits, labels) * need_cal_loss + mean_loss = loss.mean() + mean_loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + # Rough acc result, not a precise acc + acc = metric.compute(logits, labels) * need_cal_loss + metric.update(acc) + + if global_steps % args.logging_steps == 0: + logger.info( + "train: global step %d, epoch: %d, loss: %f, acc:%f, lr: %f, speed: %.2f step/s" + % ( + global_steps, + epoch, + mean_loss, + metric.accumulate(), + lr_scheduler.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + + if global_steps % args.save_steps == 0: + # Evaluate + logger.info("Eval:") + eval_acc = evaluate(model, eval_metric, eval_dataloader, create_memory()) + # Save + if rank == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if eval_acc > best_acc: + logger.info("Save best model......") + best_acc = eval_acc + best_model_dir = os.path.join(output_dir, "best_model") + if not os.path.exists(best_model_dir): + os.makedirs(best_model_dir) + model_to_save.save_pretrained(best_model_dir) + tokenizer.save_pretrained(best_model_dir) + + if args.max_steps > 0 and global_steps >= args.max_steps: + stop_training = True + break + if stop_training: + break + logger.info("Final test result:") + eval_acc = evaluate(model, eval_metric, eval_dataloader, create_memory()) + logger.info("start predict the test data") + + create_memory = partial( + init_memory, + args.batch_size, + args.memory_length, + model_config["hidden_size"], + model_config["num_hidden_layers"], + ) + # Copy the memory + memories = create_memory() + predict(model, test_dataloader, args.file_path, memories, test_ds.label_list) + logger.info("Done Predicting the results has been saved in file: {}".format(args.file_path)) + + +if __name__ == "__main__": + do_train(args) diff --git a/examples/text_classification/pretrained_models/README.md b/examples/text_classification/pretrained_models/README.md new file mode 100644 index 0000000000000000000000000000000000000000..d1351d7803884389302f6a06c4a3af3b2cb22683 --- /dev/null +++ b/examples/text_classification/pretrained_models/README.md @@ -0,0 +1 @@ +[pretrained_models](../../../applications/text_classification) diff --git a/examples/text_classification/rnn/README.md b/examples/text_classification/rnn/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e87390a5b1c911e9cfb218470336b7e77be1d3bd --- /dev/null +++ b/examples/text_classification/rnn/README.md @@ -0,0 +1,316 @@ +# 使用传统Recurrent Neural Networks完成中文文本分类任务 + +文本分类是NLP应用最广的任务之一,可以被应用到多个领域中,包括但不仅限于:情感分析、垃圾邮件识别、商品评价分类... + +情感分析是一个自然语言处理中老生常谈的任务。情感分析的目的是为了找出说话者/作者在某些话题上,或者针对一个文本两极的观点的态度。这个态度或许是他或她的个人判断或是评估,也许是他当时的情感状态(就是说,作者在做出这个言论时的情绪状态),或是作者有意向的情感交流(就是作者想要读者所体验的情绪)。其可以用于数据挖掘、Web 挖掘、文本挖掘和信息检索方面得到了广泛的研究。可通过 [AI开放平台-情感倾向分析](http://ai.baidu.com/tech/nlp_apply/sentiment_classify) 线上体验。 + +<p align="center"> +<img src="https://ai-studio-static-online.cdn.bcebos.com/febb8a1478e34258953e56611ddc76cd20b412fec89845b0a4a2e6b9f8aae774" hspace='10'/> <br /> +</p> + +本项目开源了一系列模型用于进行文本建模,用户可通过参数配置灵活使用。效果上,我们基于开源情感倾向分类数据集ChnSentiCorp对多个模型进行评测。 + +## paddlenlp.seq2vec + +情感分析任务中关键技术是如何将文本表示成一个**携带语义的文本向量**。随着深度学习技术的快速发展,目前常用的文本表示技术有LSTM,GRU,RNN等方法。 +PaddleNLP提供了一系列的文本表示技术,如`seq2vec`模块。 + +[`paddlenlp.seq2vec`](../../../paddlenlp/seq2vec) 模块作用为将输入的序列文本表征成一个语义向量。 + +<p align="center"> +<img src="https://ai-studio-static-online.cdn.bcebos.com/bbf00931c7534ab48a5e7dff5fbc2ba3ff8d459940434628ad21e9195da5d4c6" width = "500" height = "200" hspace='10'/> <br /> +</p> + + +## 模型简介 + +本项目通过调用[seq2vec](../../../paddlenlp/seq2vec/)中内置的模型进行序列建模,完成句子的向量表示。包含最简单的词袋模型和一系列经典的RNN类模型。 + +`seq2vec`模块 + +* 功能是将序列Embedding Tensor(shape是(batch_size, num_token, emb_dim) )转化成文本语义表征Enocded Texts Tensor(shape 是(batch_sie,encoding_size)) +* 提供了`BoWEncoder`,`CNNEncoder`,`GRUEncoder`,`LSTMEncoder`,`RNNEncoder`等模型 + - `BoWEncoder` 是将输入序列Embedding Tensor在num_token维度上叠加,得到文本语义表征Enocded Texts Tensor。 + - `CNNEncoder` 是将输入序列Embedding Tensor进行卷积操作,在对卷积结果进行max_pooling,得到文本语义表征Enocded Texts Tensor。 + - `GRUEncoder` 是对输入序列Embedding Tensor进行GRU运算,在运算结果上进行pooling或者取最后一个step的隐表示,得到文本语义表征Enocded Texts Tensor。 + - `LSTMEncoder` 是对输入序列Embedding Tensor进行LSTM运算,在运算结果上进行pooling或者取最后一个step的隐表示,得到文本语义表征Enocded Texts Tensor。 + - `RNNEncoder` 是对输入序列Embedding Tensor进行RNN运算,在运算结果上进行pooling或者取最后一个step的隐表示,得到文本语义表征Enocded Texts Tensor。 + + +`seq2vec`提供了许多语义表征方法,那么这些方法在什么时候更加适合呢? + +* `BoWEncoder`采用Bag of Word Embedding方法,其特点是简单。但其缺点是没有考虑文本的语境,所以对文本语义的表征不足以表意。 + +* `CNNEncoder`采用卷积操作,提取局部特征,其特点是可以共享权重。但其缺点同样只考虑了局部语义,上下文信息没有充分利用。 + +<p align="center"> +<img src="https://ai-studio-static-online.cdn.bcebos.com/2b2498edd83e49d3b017c4a14e1be68506349249b8a24cdaa214755fb51eadcd" width = "300" height = "150" hspace='10'/> <br /> +</p> + +* `RNNEnocder`采用RNN方法,在计算下一个token语义信息时,利用上一个token语义信息作为其输入。但其缺点容易产生梯度消失和梯度爆炸。 + +<p align="center"> +<img src="http://colah.github.io/posts/2015-09-NN-Types-FP/img/RNN-general.png" width = "50%" height = "30%" hspace='10'/> <br /> +</p> + +* `LSTMEnocder`采用LSTM方法,LSTM是RNN的一种变种。为了学到长期依赖关系,LSTM 中引入了门控机制来控制信息的累计速度, + 包括有选择地加入新的信息,并有选择地遗忘之前累计的信息。 + +<p align="center"> +<img src="https://ai-studio-static-online.cdn.bcebos.com/a5af1d93c69f422d963e094397a2f6ce978c30a26ab6480ab70d688dd1929de0" width = "50%" height = "30%" hspace='10'/> <br /> +</p> + +* `GRUEncoder`采用GRU方法,GRU也是RNN的一种变种。一个LSTM单元有四个输入 ,因而参数是RNN的四倍,带来的结果是训练速度慢。 + GRU对LSTM进行了简化,在不影响效果的前提下加快了训练速度。 + +<p align="center"> +<img src="https://ai-studio-static-online.cdn.bcebos.com/fc848bc2cb494b40ae42af892b756f5888770320a1fa42348cec10d3df64ee2f" width = "40%" height = "25%" hspace='10'/> <br /> +</p> + + +| 模型 | 模型介绍 | +| ------------------------------------------------ | ------------------------------------------------------------ | +| BOW(Bag Of Words) | 非序列模型,将句子表示为其所包含词的向量的加和 | +| RNN (Recurrent Neural Network) | 序列模型,能够有效地处理序列信息 | +| GRU(Gated Recurrent Unit) | 序列模型,能够较好地解决序列文本中长距离依赖的问题 | +| LSTM(Long Short Term Memory) | 序列模型,能够较好地解决序列文本中长距离依赖的问题 | +| Bi-LSTM(Bidirectional Long Short Term Memory) | 序列模型,采用双向LSTM结构,更好地捕获句子中的语义特征 | +| Bi-GRU(Bidirectional Gated Recurrent Unit) | 序列模型,采用双向GRU结构,更好地捕获句子中的语义特征 | +| Bi-RNN(Bidirectional Recurrent Neural Network) | 序列模型,采用双向RNN结构,更好地捕获句子中的语义特征 | +| Bi-LSTM Attention | 序列模型,在双向LSTM结构之上加入Attention机制,结合上下文更好地表征句子语义特征 | +| TextCNN | 序列模型,使用多种卷积核大小,提取局部区域地特征 | + + +| 模型 | dev acc | test acc | +| ---- | ------- | -------- | +| BoW | 0.8970 | 0.8908 | +| Bi-LSTM | 0.9098 | 0.8983 | +| Bi-GRU | 0.9014 | 0.8785 | +| Bi-RNN | 0.8649 | 0.8504 | +| Bi-LSTM Attention | 0.8992 | 0.8856 | +| TextCNN | 0.9102 | 0.9107 | + + +<p align="center"> +<img src="https://ai-studio-static-online.cdn.bcebos.com/ecf309c20e5347399c55f1e067821daa088842fa46ad49be90de4933753cd3cf" width = "600" height = "200" hspace='10'/> <br /> +</p> + + + +## 快速开始 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +rnn/ +├── deploy # 部署 +│   └── python +│   └── predict.py # python预测部署示例 +├── export_model.py # 动态图参数导出静态图参数脚本 +├── model.py # 模型组网脚本 +├── predict.py # 模型预测 +├── utils.py # 数据处理工具 +├── train.py # 训练模型主程序入口,包括训练、评估 +└── README.md # 文档说明 +``` + +### 数据准备 + +#### 使用PaddleNLP内置数据集 + +```python +from paddlenlp.datasets import load_dataset + +train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"]) +``` + + +### 模型训练 + +我们以中文情感分类公开数据集ChnSentiCorp为示例数据集,可以运行下面的命令,在训练集(train.tsv)上进行模型训练,并在开发集(dev.tsv)验证 + +CPU 启动: + +```shell +python train.py --vocab_path='./vocab.json' \ + --device=cpu \ + --network=bilstm \ + --lr=5e-4 \ + --batch_size=64 \ + --epochs=10 \ + --save_dir='./checkpoints' +``` + +GPU 启动: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" train.py \ + --vocab_path='./vocab.json' \ + --device=gpu \ + --network=bilstm \ + --lr=5e-4 \ + --batch_size=64 \ + --epochs=10 \ + --save_dir='./checkpoints' +``` + +XPU 启动: + +```shell +python train.py --vocab_path='./vocab.json' \ + --device=xpu \ + --network=lstm \ + --lr=5e-4 \ + --batch_size=64 \ + --epochs=10 \ + --save_dir='./checkpoints' +``` + +MLU 启动: + +```shell +python train.py --vocab_path='./vocab.json' \ + --device=mlu \ + --network=lstm \ + --lr=5e-4 \ + --batch_size=64 \ + --epochs=10 \ + --save_dir='./checkpoints' +``` + +Ascend NPU 启动: + +```shell +python train.py --vocab_path='./vocab.json' \ + --device=npu \ + --network=bow \ + --lr=5e-4 \ + --batch_size=32 \ + --epochs=10 \ + --save_dir='./checkpoints' +``` + +以上参数表示: + +* `vocab_path`: 用于保存根据语料库构建的词汇表的文件路径。 +* `device`: 选用什么设备进行训练,可选cpu、gpu、xpu、mlu或者npu。如使用gpu训练则参数gpus指定GPU卡号。目前xpu只支持模型网络设置为lstm,npu只支持模型网络设置为bow。 +* `network`: 模型网络名称,默认为`bilstm`, 可更换为bilstm,bigru,birnn,bow,lstm,rnn,gru,bilstm_attn,cnn等。 +* `lr`: 学习率, 默认为5e-5。 +* `batch_size`: 运行一个batch大小,默认为64。 +* `epochs`: 训练轮次,默认为10。 +* `save_dir`: 训练保存模型的文件路径。 +* `init_from_ckpt`: 恢复模型训练的断点路径。 + + +程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +checkpoints/ +├── 0.pdopt +├── 0.pdparams +├── 1.pdopt +├── 1.pdparams +├── ... +└── final.pdparams +``` + +**NOTE:** + +* 训练脚本中停用词`stopwords`仅仅是示例作用,具体停用词使用需要根据实际应用数据进行选择。 + +* 如需恢复模型训练,则init_from_ckpt只需指定到文件名即可,不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/0`即可,程序会自动加载模型参数`checkpoints/0.pdparams`,也会自动加载优化器状态`checkpoints/0.pdopt`。 +* 使用动态图训练结束之后,还可以将动态图参数导出成静态图参数,具体代码见export_model.py。静态图参数保存在`output_path`指定路径中。 + 运行方式: + +```shell +python export_model.py --vocab_path=./vocab.json --network=bilstm --params_path=./checkpoints/final.pdparams --output_path=./static_graph_params +``` + +其中`params_path`是指动态图训练保存的参数路径,`output_path`是指静态图参数导出路径。 + +导出模型之后,可以用于部署,deploy/python/predict.py文件提供了python部署预测示例。运行方式: + +```shell +python deploy/python/predict.py --model_file=static_graph_params.pdmodel --params_file=static_graph_params.pdiparams --network=bilstm +``` + +### 模型预测 + +启动预测: + +CPU启动: + +```shell +python predict.py --vocab_path='./vocab.json' \ + --device=cpu \ + --network=bilstm \ + --params_path=checkpoints/final.pdparams +``` + +GPU启动: + +```shell +export CUDA_VISIBLE_DEVICES=0 +python predict.py --vocab_path='./vocab.json' \ + --device=gpu \ + --network=bilstm \ + --params_path='./checkpoints/final.pdparams' +``` + +XPU启动: + +```shell +python predict.py --vocab_path='./vocab.json' \ + --device=xpu \ + --network=lstm \ + --params_path=checkpoints/final.pdparams +``` + +MLU启动: + +```shell +python predict.py --vocab_path='./vocab.json' \ + --device=mlu \ + --network=lstm \ + --params_path=checkpoints/final.pdparams +``` + +Ascend NPU启动: + +```shell +python predict.py --vocab_path='./vocab.json' \ + --device=npu \ + --network=bow \ + --params_path=checkpoints/final.pdparams +``` + +将待预测数据分词完毕后,如以下示例: + +```text +非常不错,服务很好,位于市中心区,交通方便,不过价格也高! +怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片 +作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。 +``` + +处理成模型所需的`Tensor`,如可以直接调用`preprocess_prediction_data`函数既可处理完毕。之后传入`predict`函数即可输出预测结果。 + +如 + +```text +Data: 非常不错,服务很好,位于市中心区,交通方便,不过价格也高! Label: negative +Data: 怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片 Label: negative +Data: 作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。 Label: positive +``` + +## Reference + +关于LSTM、GRU、CNN更多信息参考: + +- https://canvas.stanford.edu/files/1090785/download +- https://colah.github.io/posts/2015-08-Understanding-LSTMs/ +- https://arxiv.org/abs/1412.3555 +- https://arxiv.org/pdf/1506.00019 +- https://arxiv.org/abs/1404.2188 diff --git a/examples/text_classification/rnn/deploy/python/predict.py b/examples/text_classification/rnn/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..bc14155c44b53f2945e9533e4e36a86c7a9774d0 --- /dev/null +++ b/examples/text_classification/rnn/deploy/python/predict.py @@ -0,0 +1,137 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import numpy as np +import paddle +from scipy.special import softmax + +from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--model_file", type=str, required=True, default='./static_graph_params.pdmodel', help="The path to model info in static graph.") +parser.add_argument("--params_file", type=str, required=True, default='./static_graph_params.pdiparams', help="The path to parameters in static graph.") +parser.add_argument('--network', choices=['bow', 'lstm', 'bilstm', 'gru', 'bigru', 'rnn', 'birnn', 'bilstm_attn', 'cnn', 'textcnn'], default="bilstm", help="Select which network to train, defaults to bilstm.") +parser.add_argument("--vocab_path", type=str, default="./vocab.json", help="The file path to save vocabulary.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# fmt: on + + +def preprocess_prediction_data(text, tokenizer): + """ + It process the prediction data as the format used as training. + + Args: + text (obj:`str`): The input text. + tokenizer(obj: `paddlenlp.data.JiebaTokenizer`): It use jieba to cut the chinese string. + + Returns: + input_ids (obj: `list[int]`): The word ids of the `text`. + seq_len (obj: `int`): The length of words. + """ + input_id = tokenizer.encode(text) + seq_len = len(input_id) + return input_id, seq_len + + +class Predictor(object): + def __init__(self, model_file, params_file, device, max_seq_length): + self.max_seq_length = max_seq_length + + config = paddle.inference.Config(model_file, params_file) + if device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def predict(self, data, tokenizer, label_map, batch_size=1, network="bilstm"): + """ + Predicts the data labels. + + Args: + model (obj:`paddle.nn.Layer`): A model to classify texts. + data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `se_len`(sequence length). + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from + :class:`~paddlenlp.transformers.PretrainedTokenizer` which contains most of the methods. + Users should refer to the superclass for more information regarding methods. + label_map(obj:`dict`): The label id (key) to label str (value) map. + batch_size(obj:`int`, defaults to 1): The number of batch. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + examples = [] + for text in data: + input_id, seq_len = preprocess_prediction_data(text, tokenizer) + examples.append((input_id, seq_len)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.vocab.token_to_idx.get("[PAD]", 0)), Stack() # input_id # seq_len + ): fn(samples) + + # Separates data into some batches. + batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)] + + results = [] + for batch in batches: + input_ids, seq_lens = batchify_fn(batch) + self.input_handles[0].copy_from_cpu(input_ids) + if network in ["lstm", "bilstm", "gru", "bigru", "rnn", "birnn", "bilstm_attn"]: + self.input_handles[1].copy_from_cpu(seq_lens) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + probs = softmax(logits, axis=1) + print(probs) + idx = np.argmax(probs, axis=1) + idx = idx.tolist() + labels = [label_map[i] for i in idx] + results.extend(labels) + return results + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor(args.model_file, args.params_file, args.device, args.max_seq_length) + + # Firstly pre-processing prediction data and then do predict. + data = [ + "非常不错,服务很好,位于市中心区,交通方便,不过价格也高!", + "怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片", + "作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。", + ] + vocab = Vocab.from_json(args.vocab_path) + tokenizer = JiebaTokenizer(vocab) + label_map = {0: "negative", 1: "positive"} + + results = predictor.predict(data, tokenizer, label_map, batch_size=args.batch_size, network=args.network) + for idx, text in enumerate(data): + print("Data: {} \t Label: {}".format(text, results[idx])) diff --git a/examples/text_classification/rnn/export_model.py b/examples/text_classification/rnn/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..37fbb37ddcb3eb99b9d360c96ed70127759f6e75 --- /dev/null +++ b/examples/text_classification/rnn/export_model.py @@ -0,0 +1,99 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle +from model import ( + BiLSTMAttentionModel, + BoWModel, + CNNModel, + GRUModel, + LSTMModel, + RNNModel, + SelfInteractiveAttention, +) + +from paddlenlp.data import Vocab + +# fmt: off +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--vocab_path", type=str, default="./vocab.json", help="The file path to vocabulary.") +parser.add_argument('--network', choices=['bow', 'lstm', 'bilstm', 'gru', 'bigru', 'rnn', 'birnn', 'bilstm_attn', 'cnn'], default="bilstm", help="Select which network to train, defaults to bilstm.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.") +parser.add_argument("--output_path", type=str, default='./static_graph_params', help="The path of model parameter in static graph to be saved.") +args = parser.parse_args() +# fmt: on + + +def main(): + # Load vocab. + vocab = Vocab.from_json(args.vocab_path) + label_map = {0: "negative", 1: "positive"} + + # Constructs the newtork. + network = args.network.lower() + vocab_size = len(vocab) + num_classes = len(label_map) + pad_token_id = vocab.to_indices("[PAD]") + if network == "bow": + model = BoWModel(vocab_size, num_classes, padding_idx=pad_token_id) + elif network == "bigru": + model = GRUModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id) + elif network == "bilstm": + model = LSTMModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id) + elif network == "bilstm_attn": + lstm_hidden_size = 196 + attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size) + model = BiLSTMAttentionModel( + attention_layer=attention, + vocab_size=vocab_size, + lstm_hidden_size=lstm_hidden_size, + num_classes=num_classes, + padding_idx=pad_token_id, + ) + elif network == "birnn": + model = RNNModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id) + elif network == "cnn": + model = CNNModel(vocab_size, num_classes, padding_idx=pad_token_id) + elif network == "gru": + model = GRUModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max") + elif network == "lstm": + model = LSTMModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max") + elif network == "rnn": + model = RNNModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max") + else: + raise ValueError( + "Unknown network: %s, it must be one of bow, lstm, bilstm, cnn, gru, bigru, rnn, birnn and bilstm_attn." + % network + ) + + # Load model parameters. + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + model.eval() + + inputs = [paddle.static.InputSpec(shape=[None, None], dtype="int64")] + # Convert to static graph with specific input description + if args.network in ["lstm", "bilstm", "gru", "bigru", "rnn", "birnn", "bilstm_attn"]: + inputs.append(paddle.static.InputSpec(shape=[None], dtype="int64")) # seq_len + + model = paddle.jit.to_static(model, input_spec=inputs) + # Save in static graph model. + paddle.jit.save(model, args.output_path) + + +if __name__ == "__main__": + main() diff --git a/examples/text_classification/rnn/model.py b/examples/text_classification/rnn/model.py new file mode 100644 index 0000000000000000000000000000000000000000..7d2e4950db0bb6efefd247ed36c9b042bdd1811c --- /dev/null +++ b/examples/text_classification/rnn/model.py @@ -0,0 +1,403 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + +import paddlenlp as nlp + +INF = 1.0 * 1e12 + + +class BoWModel(nn.Layer): + """ + This class implements the Bag of Words Classification Network model to classify texts. + At a high level, the model starts by embedding the tokens and running them through + a word embedding. Then, we encode these representations with a `BoWEncoder`. + Lastly, we take the output of the encoder to create a final representation, + which is passed through some feed-forward layers to output a logits (`output_layer`). + + """ + + def __init__(self, vocab_size, num_classes, emb_dim=128, padding_idx=0, hidden_size=128, fc_hidden_size=96): + super().__init__() + self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx) + self.bow_encoder = nlp.seq2vec.BoWEncoder(emb_dim) + self.fc1 = nn.Linear(self.bow_encoder.get_output_dim(), hidden_size) + self.fc2 = nn.Linear(hidden_size, fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, text, seq_len=None): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_text = self.embedder(text) + + # Shape: (batch_size, embedding_dim) + summed = self.bow_encoder(embedded_text) + encoded_text = paddle.tanh(summed) + + # Shape: (batch_size, hidden_size) + fc1_out = paddle.tanh(self.fc1(encoded_text)) + # Shape: (batch_size, fc_hidden_size) + fc2_out = paddle.tanh(self.fc2(fc1_out)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc2_out) + return logits + + +class LSTMModel(nn.Layer): + def __init__( + self, + vocab_size, + num_classes, + emb_dim=128, + padding_idx=0, + lstm_hidden_size=198, + direction="forward", + lstm_layers=1, + dropout_rate=0.0, + pooling_type=None, + fc_hidden_size=96, + ): + super().__init__() + self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) + self.lstm_encoder = nlp.seq2vec.LSTMEncoder( + emb_dim, + lstm_hidden_size, + num_layers=lstm_layers, + direction=direction, + dropout=dropout_rate, + pooling_type=pooling_type, + ) + self.fc = nn.Linear(self.lstm_encoder.get_output_dim(), fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, text, seq_len): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_text = self.embedder(text) + # Shape: (batch_size, num_tokens, num_directions*lstm_hidden_size) + # num_directions = 2 if direction is 'bidirect' + # if not, num_directions = 1 + text_repr = self.lstm_encoder(embedded_text, sequence_length=seq_len) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(text_repr)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + return logits + + +class GRUModel(nn.Layer): + def __init__( + self, + vocab_size, + num_classes, + emb_dim=128, + padding_idx=0, + gru_hidden_size=198, + direction="forward", + gru_layers=1, + dropout_rate=0.0, + pooling_type=None, + fc_hidden_size=96, + ): + super().__init__() + self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) + self.gru_encoder = nlp.seq2vec.GRUEncoder( + emb_dim, + gru_hidden_size, + num_layers=gru_layers, + direction=direction, + dropout=dropout_rate, + pooling_type=pooling_type, + ) + self.fc = nn.Linear(self.gru_encoder.get_output_dim(), fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, text, seq_len): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_text = self.embedder(text) + # Shape: (batch_size, num_tokens, num_directions*gru_hidden_size) + # num_directions = 2 if direction is 'bidirect' + # if not, num_directions = 1 + text_repr = self.gru_encoder(embedded_text, sequence_length=seq_len) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(text_repr)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + return logits + + +class RNNModel(nn.Layer): + def __init__( + self, + vocab_size, + num_classes, + emb_dim=128, + padding_idx=0, + rnn_hidden_size=198, + direction="forward", + rnn_layers=1, + dropout_rate=0.0, + pooling_type=None, + fc_hidden_size=96, + ): + super().__init__() + self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) + self.rnn_encoder = nlp.seq2vec.RNNEncoder( + emb_dim, + rnn_hidden_size, + num_layers=rnn_layers, + direction=direction, + dropout=dropout_rate, + pooling_type=pooling_type, + ) + self.fc = nn.Linear(self.rnn_encoder.get_output_dim(), fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, text, seq_len): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_text = self.embedder(text) + # Shape: (batch_size, num_tokens, num_directions*rnn_hidden_size) + # num_directions = 2 if direction is 'bidirect' + # if not, num_directions = 1 + text_repr = self.rnn_encoder(embedded_text, sequence_length=seq_len) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(text_repr)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + return logits + + +class BiLSTMAttentionModel(nn.Layer): + def __init__( + self, + attention_layer, + vocab_size, + num_classes, + emb_dim=128, + lstm_hidden_size=196, + fc_hidden_size=96, + lstm_layers=1, + dropout_rate=0.0, + padding_idx=0, + ): + super().__init__() + self.padding_idx = padding_idx + + self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) + self.bilstm = nn.LSTM( + input_size=emb_dim, + hidden_size=lstm_hidden_size, + num_layers=lstm_layers, + dropout=dropout_rate, + direction="bidirect", + ) + self.attention = attention_layer + if isinstance(attention_layer, SelfAttention): + self.fc = nn.Linear(lstm_hidden_size, fc_hidden_size) + elif isinstance(attention_layer, SelfInteractiveAttention): + self.fc = nn.Linear(lstm_hidden_size * 2, fc_hidden_size) + else: + raise RuntimeError("Unknown attention type %s." % attention_layer.__class__.__name__) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, text, seq_len): + mask = text != self.padding_idx + embedded_text = self.embedder(text) + # Encode text, shape: (batch, max_seq_len, num_directions * hidden_size) + encoded_text, (last_hidden, last_cell) = self.bilstm(embedded_text, sequence_length=seq_len) + # Shape: (batch_size, lstm_hidden_size) + hidden, att_weights = self.attention(encoded_text, mask) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(hidden)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + return logits + + +class SelfAttention(nn.Layer): + """ + A close implementation of attention network of ACL 2016 paper, + Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (Zhou et al., 2016). + ref: https://www.aclweb.org/anthology/P16-2034/ + Args: + hidden_size (int): The number of expected features in the input x. + """ + + def __init__(self, hidden_size): + super().__init__() + self.hidden_size = hidden_size + self.att_weight = self.create_parameter(shape=[1, hidden_size, 1], dtype="float32") + + def forward(self, input, mask=None): + """ + Args: + input (paddle.Tensor) of shape (batch, seq_len, input_size): Tensor containing the features of the input sequence. + mask (paddle.Tensor) of shape (batch, seq_len) : + Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not. + Defaults to `None`. + """ + forward_input, backward_input = paddle.chunk(input, chunks=2, axis=2) + # elementwise-sum forward_x and backward_x + # Shape: (batch_size, max_seq_len, hidden_size) + h = paddle.add_n([forward_input, backward_input]) + # Shape: (batch_size, hidden_size, 1) + att_weight = self.att_weight.tile(repeat_times=(paddle.shape(h)[0], 1, 1)) + # Shape: (batch_size, max_seq_len, 1) + att_score = paddle.bmm(paddle.tanh(h), att_weight) + if mask is not None: + # mask, remove the effect of 'PAD' + mask = paddle.cast(mask, dtype="float32") + mask = mask.unsqueeze(axis=-1) + inf_tensor = paddle.full(shape=mask.shape, dtype="float32", fill_value=-INF) + att_score = paddle.multiply(att_score, mask) + paddle.multiply(inf_tensor, (1 - mask)) + # Shape: (batch_size, max_seq_len, 1) + att_weight = F.softmax(att_score, axis=1) + # Shape: (batch_size, lstm_hidden_size) + reps = paddle.bmm(h.transpose(perm=(0, 2, 1)), att_weight).squeeze(axis=-1) + reps = paddle.tanh(reps) + return reps, att_weight + + +class SelfInteractiveAttention(nn.Layer): + """ + A close implementation of attention network of NAACL 2016 paper, Hierarchical Attention Networks for Document Classification (Yang et al., 2016). + ref: https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf + Args: + hidden_size (int): The number of expected features in the input x. + """ + + def __init__(self, hidden_size): + super().__init__() + self.input_weight = self.create_parameter(shape=[1, hidden_size, hidden_size], dtype="float32") + self.bias = self.create_parameter(shape=[1, 1, hidden_size], dtype="float32") + self.att_context_vector = self.create_parameter(shape=[1, hidden_size, 1], dtype="float32") + + def forward(self, input, mask=None): + """ + Args: + input (paddle.Tensor) of shape (batch, seq_len, input_size): Tensor containing the features of the input sequence. + mask (paddle.Tensor) of shape (batch, seq_len) : + Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not. + Defaults to `None + """ + weight = self.input_weight.tile(repeat_times=(paddle.shape(input)[0], 1, 1)) + bias = self.bias.tile(repeat_times=(paddle.shape(input)[0], 1, 1)) + # Shape: (batch_size, max_seq_len, hidden_size) + word_squish = paddle.bmm(input, weight) + bias + + att_context_vector = self.att_context_vector.tile(repeat_times=(paddle.shape(input)[0], 1, 1)) + # Shape: (batch_size, max_seq_len, 1) + att_score = paddle.bmm(word_squish, att_context_vector) + if mask is not None: + # mask, remove the effect of 'PAD' + mask = paddle.cast(mask, dtype="float32") + mask = mask.unsqueeze(axis=-1) + inf_tensor = paddle.full(shape=paddle.shape(mask), dtype="float32", fill_value=-INF) + att_score = paddle.multiply(att_score, mask) + paddle.multiply(inf_tensor, (1 - mask)) + att_weight = F.softmax(att_score, axis=1) + + # Shape: (batch_size, hidden_size) + reps = paddle.bmm(input.transpose(perm=(0, 2, 1)), att_weight).squeeze(-1) + return reps, att_weight + + +class CNNModel(nn.Layer): + """ + This class implements the Convolution Neural Network model. + At a high level, the model starts by embedding the tokens and running them through + a word embedding. Then, we encode these representations with a `CNNEncoder`. + The CNN has one convolution layer for each ngram filter size. Each convolution operation gives + out a vector of size num_filter. The number of times a convolution layer will be used + is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these + outputs from the convolution layer and outputs the max. + Lastly, we take the output of the encoder to create a final representation, + which is passed through some feed-forward layers to output a logits (`output_layer`). + + """ + + def __init__( + self, + vocab_size, + num_classes, + emb_dim=128, + padding_idx=0, + num_filter=128, + ngram_filter_sizes=(3,), + fc_hidden_size=96, + ): + super().__init__() + self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx) + self.encoder = nlp.seq2vec.CNNEncoder( + emb_dim=emb_dim, num_filter=num_filter, ngram_filter_sizes=ngram_filter_sizes + ) + self.fc = nn.Linear(self.encoder.get_output_dim(), fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, text, seq_len=None): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_text = self.embedder(text) + # Shape: (batch_size, len(ngram_filter_sizes)*num_filter) + encoder_out = self.encoder(embedded_text) + encoder_out = paddle.tanh(encoder_out) + # Shape: (batch_size, fc_hidden_size) + fc_out = self.fc(encoder_out) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + return logits + + +class TextCNNModel(nn.Layer): + """ + This class implements the Text Convolution Neural Network model. + At a high level, the model starts by embedding the tokens and running them through + a word embedding. Then, we encode these representations with a `CNNEncoder`. + The CNN has one convolution layer for each ngram filter size. Each convolution operation gives + out a vector of size num_filter. The number of times a convolution layer will be used + is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these + outputs from the convolution layer and outputs the max. + Lastly, we take the output of the encoder to create a final representation, + which is passed through some feed-forward layers to output a logits (`output_layer`). + + """ + + def __init__( + self, + vocab_size, + num_classes, + emb_dim=128, + padding_idx=0, + num_filter=128, + ngram_filter_sizes=(1, 2, 3), + fc_hidden_size=96, + ): + super().__init__() + self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx) + self.encoder = nlp.seq2vec.CNNEncoder( + emb_dim=emb_dim, num_filter=num_filter, ngram_filter_sizes=ngram_filter_sizes + ) + self.fc = nn.Linear(self.encoder.get_output_dim(), fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, text, seq_len=None): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_text = self.embedder(text) + # Shape: (batch_size, len(ngram_filter_sizes)*num_filter) + encoder_out = self.encoder(embedded_text) + encoder_out = paddle.tanh(encoder_out) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(encoder_out)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + return logits diff --git a/examples/text_classification/rnn/predict.py b/examples/text_classification/rnn/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..843884fd19b30ff7c73b8332bbffcae9659f6dd3 --- /dev/null +++ b/examples/text_classification/rnn/predict.py @@ -0,0 +1,147 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse + +import paddle +import paddle.nn.functional as F +from model import ( + BiLSTMAttentionModel, + BoWModel, + CNNModel, + GRUModel, + LSTMModel, + RNNModel, + SelfInteractiveAttention, +) +from utils import preprocess_prediction_data + +from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu', 'npu', 'mlu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number of a batch for training.") +parser.add_argument("--vocab_path", type=str, default="./vocab.json", help="The file path to vocabulary.") +parser.add_argument('--network', choices=['bow', 'lstm', 'bilstm', 'gru', 'bigru', 'rnn', 'birnn', 'bilstm_attn', 'cnn'], + default="bilstm", help="Select which network to train, defaults to bilstm.") +parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.") +args = parser.parse_args() +# yapf: enable + + +def predict(model, data, label_map, batch_size=1, pad_token_id=0): + """ + Predicts the data labels. + + Args: + model (obj:`paddle.nn.Layer`): A model to classify texts. + data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `se_len`(sequence length). + label_map(obj:`dict`): The label id (key) to label str (value) map. + batch_size(obj:`int`, defaults to 1): The number of batch. + pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + + # Separates data into some batches. + batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)] + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=pad_token_id), # input_ids + Stack(dtype="int64"), # seq len + ): [data for data in fn(samples)] + + results = [] + model.eval() + for batch in batches: + texts, seq_lens = batchify_fn(batch) + texts = paddle.to_tensor(texts) + seq_lens = paddle.to_tensor(seq_lens) + logits = model(texts, seq_lens) + probs = F.softmax(logits, axis=1) + idx = paddle.argmax(probs, axis=1).numpy() + idx = idx.tolist() + labels = [label_map[i] for i in idx] + results.extend(labels) + return results + + +if __name__ == "__main__": + paddle.set_device(args.device.lower()) + + # Loads vocab. + vocab = Vocab.from_json(args.vocab_path) + label_map = {0: "negative", 1: "positive"} + + # Constructs the newtork. + network = args.network.lower() + vocab_size = len(vocab) + num_classes = len(label_map) + pad_token_id = vocab.to_indices("[PAD]") + if network == "bow": + model = BoWModel(vocab_size, num_classes, padding_idx=pad_token_id) + elif network == "bigru": + model = GRUModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id) + elif network == "bilstm": + model = LSTMModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id) + elif network == "bilstm_attn": + lstm_hidden_size = 196 + attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size) + model = BiLSTMAttentionModel( + attention_layer=attention, + vocab_size=vocab_size, + lstm_hidden_size=lstm_hidden_size, + num_classes=num_classes, + padding_idx=pad_token_id, + ) + elif network == "birnn": + model = RNNModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id) + elif network == "cnn": + model = CNNModel(vocab_size, num_classes, padding_idx=pad_token_id) + elif network == "gru": + model = GRUModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max") + elif network == "lstm": + model = LSTMModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max") + elif network == "rnn": + model = RNNModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max") + else: + raise ValueError( + "Unknown network: %s, it must be one of bow, lstm, bilstm, cnn, gru, bigru, rnn, birnn and bilstm_attn." + % network + ) + + # Loads model parameters. + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + # Firstly pre-processing prediction data and then do predict. + data = [ + "非常不错,服务很好,位于市中心区,交通方便,不过价格也高!", + "怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片", + "作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。", + ] + tokenizer = JiebaTokenizer(vocab) + examples = preprocess_prediction_data(data, tokenizer) + + results = predict( + model, + examples, + label_map=label_map, + batch_size=args.batch_size, + pad_token_id=vocab.token_to_idx.get("[PAD]", 0), + ) + for idx, text in enumerate(data): + print("Data: {} \t Label: {}".format(text, results[idx])) diff --git a/examples/text_classification/rnn/train.py b/examples/text_classification/rnn/train.py new file mode 100644 index 0000000000000000000000000000000000000000..1c4c64890131682e03ce237f990dd70a14935398 --- /dev/null +++ b/examples/text_classification/rnn/train.py @@ -0,0 +1,174 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import random +from functools import partial + +import numpy as np +import paddle +from model import ( + BiLSTMAttentionModel, + BoWModel, + CNNModel, + GRUModel, + LSTMModel, + RNNModel, + SelfInteractiveAttention, +) +from utils import build_vocab, convert_example + +from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import load_dataset + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--epochs", type=int, default=15, help="Number of epoches for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu', 'mlu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--lr", type=float, default=5e-5, help="Learning rate used to train.") +parser.add_argument("--save_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint") +parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") +parser.add_argument("--vocab_path", type=str, default="./vocab.json", help="The file path to save vocabulary.") +parser.add_argument('--network', choices=['bow', 'lstm', 'bilstm', 'gru', 'bigru', 'rnn', 'birnn', 'bilstm_attn', 'cnn'], + default="bilstm", help="Select which network to train, defaults to bilstm.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed=1000): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None): + """ + Creats dataloader. + + Args: + dataset(obj:`paddle.io.Dataset`): Dataset instance. + trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. + mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. + batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. + batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging + the sample list, None for only stack each fields of sample in axis + 0(same as :attr::`np.stack(..., axis=0)`). + + Returns: + dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. + """ + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + else: + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn) + return dataloader + + +if __name__ == "__main__": + paddle.set_device(args.device) + set_seed(1000) + + # Loads dataset. + train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"]) + texts = [] + for data in train_ds: + texts.append(data["text"]) + for data in dev_ds: + texts.append(data["text"]) + + # Reads stop words. + # Stopwords are just for example. + # It should be updated according to the corpus. + stopwords = set(["的", "吗", "吧", "呀", "呜", "呢", "呗"]) + # Builds vocab. + word2idx = build_vocab(texts, stopwords, min_freq=5, unk_token="[UNK]", pad_token="[PAD]") + vocab = Vocab.from_dict(word2idx, unk_token="[UNK]", pad_token="[PAD]") + # Saves vocab. + vocab.to_json(args.vocab_path) + + # Constructs the network. + network = args.network.lower() + vocab_size = len(vocab) + num_classes = len(train_ds.label_list) + pad_token_id = vocab.to_indices("[PAD]") + if network == "bow": + model = BoWModel(vocab_size, num_classes, padding_idx=pad_token_id) + elif network == "bigru": + model = GRUModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id) + elif network == "bilstm": + model = LSTMModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id) + elif network == "bilstm_attn": + lstm_hidden_size = 196 + attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size) + model = BiLSTMAttentionModel( + attention_layer=attention, + vocab_size=vocab_size, + lstm_hidden_size=lstm_hidden_size, + num_classes=num_classes, + padding_idx=pad_token_id, + ) + elif network == "birnn": + model = RNNModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id) + elif network == "cnn": + model = CNNModel(vocab_size, num_classes, padding_idx=pad_token_id) + elif network == "gru": + model = GRUModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max") + elif network == "lstm": + model = LSTMModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max") + elif network == "rnn": + model = RNNModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max") + else: + raise ValueError( + "Unknown network: %s, it must be one of bow, lstm, bilstm, cnn, gru, bigru, rnn, birnn and bilstm_attn." + % network + ) + model = paddle.Model(model) + + # Reads data and generates mini-batches. + tokenizer = JiebaTokenizer(vocab) + trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # input_ids + Stack(dtype="int64"), # seq len + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + train_loader = create_dataloader( + train_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="train", batchify_fn=batchify_fn + ) + dev_loader = create_dataloader( + dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn + ) + + optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr) + + # Defines loss and metric. + criterion = paddle.nn.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + model.prepare(optimizer, criterion, metric) + + # Loads pre-trained parameters. + if args.init_from_ckpt: + model.load(args.init_from_ckpt) + print("Loaded checkpoint from %s" % args.init_from_ckpt) + + # Starts training and evaluating. + callback = paddle.callbacks.ProgBarLogger(log_freq=10, verbose=3) + model.fit(train_loader, dev_loader, epochs=args.epochs, save_dir=args.save_dir, callbacks=callback) diff --git a/examples/text_classification/rnn/utils.py b/examples/text_classification/rnn/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..c33d521e4e11f8bcf2ba474881661f3dbf299da7 --- /dev/null +++ b/examples/text_classification/rnn/utils.py @@ -0,0 +1,109 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from collections import defaultdict + +import numpy as np + +from paddlenlp import Taskflow + +word_segmenter = Taskflow("word_segmentation", mode="fast") + + +def convert_example(example, tokenizer, is_test=False): + """ + Builds model inputs from a sequence for sequence classification tasks. + It use `jieba.cut` to tokenize text. + + Args: + example(obj:`list[str]`): List of input data, containing text and label if it have label. + tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of token ids. + valid_length(obj:`int`): The input sequence valid length. + label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. + """ + + input_ids = tokenizer.encode(example["text"]) + valid_length = np.array(len(input_ids), dtype="int64") + input_ids = np.array(input_ids, dtype="int64") + + if not is_test: + label = np.array(example["label"], dtype="int64") + return input_ids, valid_length, label + else: + return input_ids, valid_length + + +def preprocess_prediction_data(data, tokenizer): + """ + It process the prediction data as the format used as training. + + Args: + data (obj:`List[str]`): The prediction data whose each element is a tokenized text. + tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. + + Returns: + examples (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `seq_len`(sequence length). + + """ + examples = [] + for text in data: + ids = tokenizer.encode(text) + examples.append([ids, len(ids)]) + return examples + + +def build_vocab(texts, stopwords=[], num_words=None, min_freq=10, unk_token="[UNK]", pad_token="[PAD]"): + """ + According to the texts, it is to build vocabulary. + + Args: + texts (obj:`List[str]`): The raw corpus data. + num_words (obj:`int`): the maximum size of vocabulary. + stopwords (obj:`List[str]`): The list where each element is a word that will be + filtered from the texts. + min_freq (obj:`int`): the minimum word frequency of words to be kept. + unk_token (obj:`str`): Special token for unknow token. + pad_token (obj:`str`): Special token for padding token. + + Returns: + word_index (obj:`Dict`): The vocabulary from the corpus data. + + """ + word_counts = defaultdict(int) + for text in texts: + if not text: + continue + for word in word_segmenter(text): + if word in stopwords: + continue + word_counts[word] += 1 + + wcounts = [] + for word, count in word_counts.items(): + if count < min_freq: + continue + wcounts.append((word, count)) + wcounts.sort(key=lambda x: x[1], reverse=True) + # -2 for the pad_token and unk_token which will be added to vocab. + if num_words is not None and len(wcounts) > (num_words - 2): + wcounts = wcounts[: (num_words - 2)] + # add the special pad_token and unk_token to the vocabulary + sorted_voc = [pad_token, unk_token] + sorted_voc.extend(wc[0] for wc in wcounts) + word_index = dict(zip(sorted_voc, list(range(len(sorted_voc))))) + return word_index diff --git a/examples/text_correction/ernie-csc/README.md b/examples/text_correction/ernie-csc/README.md new file mode 100644 index 0000000000000000000000000000000000000000..4ff3a8da14417d847c816f0e5d3b85f94de08dde --- /dev/null +++ b/examples/text_correction/ernie-csc/README.md @@ -0,0 +1,163 @@ +# ERNIE for Chinese Spelling Correction + +## 简介 + +中文文本纠错任务是一项NLP基础任务,其输入是一个可能含有语法错误的中文句子,输出是一个正确的中文句子。语法错误类型很多,有多字、少字、错别字等,目前最常见的错误类型是`错别字`。大部分研究工作围绕错别字这一类型进行研究。本文实现了百度在ACL 2021上提出结合拼音特征的Softmask策略的中文错别字纠错的下游任务网络,并提供预训练模型,模型结构如下: + +![image](https://user-images.githubusercontent.com/10826371/131974040-fc84ec04-566f-4310-9839-862bfb27172e.png) + +以下是本项目的简要目录结构及说明: + +```text +. +├── README.md # 文档 +├── download.py # 下载SIGHAN测试集 +├── pinyin_vocab.txt # 拼音字表 +├── predict.py # 预测标准输入的句子 +├── predict_sighan.py # 生成SIGHAN测试集的预测结果 +├── model.py # 纠错模型实现 +├── requirements.txt # 本项目的Python依赖项 +├── run_sighan_predict.sh # 生成训练后模型在SIGHAN测试集的预测结果并输出预测效果 +├── sighan_evaluate.py # 评估模型在SIGHAN测试集上预测效果 +├── train.py # 训练脚本 +└── utils.py # 通用函数工具 +``` + +* 注:论文中暂未开源融合字音特征的预训练模型参数(即MLM-phonetics),所以本文提供的纠错模型是在ERNIE-1.0的参数上进行Finetune,纠错模型结构与论文保持一致。 + +## 安装依赖项 +``` +pip install -r requirements.txt +``` + +## 模型训练 + +### 参数 +- `model_name_or_path` 目前支持的预训练模型有:"ernie-1.0-base-zh"。 +- `max_seq_length` 表示最大句子长度,超过该长度的部分将被切分成下一个样本。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔步数。 +- `save_steps` 表示模型保存及评估间隔步数。 +- `output_dir` 表示模型保存路径。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 +- `seed` 表示随机数种子。 +- `weight_decay` 表示AdamW的权重衰减系数。 +- `warmup_proportion` 表示学习率warmup系数。 +- `pinyin_vocab_file_path` 拼音字表路径。默认为当前目录下的`pinyin_vocab.txt`文件。 +- `extra_train_ds_dir` 额外纠错训练集目录。用户可在该目录下提供文件名以`txt`为后缀的纠错数据集文件,以增大训练样本。默认为None。 + +### 训练数据 + +该模型在SIGHAN简体版数据集以及[Automatic Corpus Generation生成的中文纠错数据集](https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml)上进行Finetune训练。PaddleNLP已经集成SIGHAN简体版数据集,以下将介绍如何使用Automatic Corpus Generation生成的中文纠错数据集。 + +#### 下载数据集 + +Automatic Corpus Generation生成的中文纠错数据集比较大,下载时间比较长,请耐心等候。运行以下命令完成数据集下载: + +``` +python download.py --data_dir ./extra_train_ds/ --url https://github.com/wdimmy/Automatic-Corpus-Generation/raw/master/corpus/train.sgml +``` + +#### 预处理数据集 + +训练脚本要求训练集文件内容以句子对形式呈现,这里提供一个转换脚本,将Automatic Corpus Generation提供的XML文件转换成句子对形式的文件,运行以下命令: + +``` +python change_sgml_to_txt.py -i extra_train_ds/train.sgml -o extra_train_ds/train.txt +``` + +### 单卡训练 + +```python +python train.py --batch_size 32 --logging_steps 100 --epochs 10 --learning_rate 5e-5 --model_name_or_path ernie-1.0-base-zh --output_dir ./checkpoints/ --extra_train_ds_dir ./extra_train_ds/ --max_seq_length 192 +``` + +### 多卡训练 + +```python +python -m paddle.distributed.launch --gpus "0,1" train.py --batch_size 32 --logging_steps 100 --epochs 10 --learning_rate 5e-5 --model_name_or_path ernie-1.0-base-zh --output_dir ./checkpoints/ --extra_train_ds_dir ./extra_train_ds/ --max_seq_length 192 +``` + +## 模型预测 + +### 预测SIGHAN测试集 + +SIGHAN 13,SIGHAN 14,SIGHAN 15是目前中文错别字纠错任务常用的benchmark数据。由于SIGHAN官方提供的是繁体字数据集,PaddleNLP将提供简体版本的SIGHAN测试数据。以下运行SIGHAN预测脚本: + +```shell +sh run_sighan_predict.sh +``` + +该脚本会下载SIGHAN数据集,加载checkpoint的模型参数运行模型,输出SIGHAN测试集的预测结果到predict_sighan文件,并输出预测效果。 + +**预测效果** + +| Metric | SIGHAN 13 | SIGHAN 14 | SIGHAN 15 | +| -------------| --------- | --------- |--------- | +| Detection F1 | 0.8348 | 0.6534 | 0.7464 | +| Correction F1| 0.8217 | 0.6302 | 0.7296 | + +### 预测部署 + +#### 模型导出 + +使用动态图训练结束之后,预测部署需要导出静态图参数,具体做法需要运行模型导出脚本`export_model.py`。以下是脚本参数介绍以及运行方式: + +**参数** +- `params_path` 是指动态图训练保存的参数路径。 +- `output_path` 是指静态图参数导出路径。 +- `pinyin_vocab_file_path` 指拼音表路径。 +- `model_name_or_path` 目前支持的预训练模型有:"ernie-1.0-base-zh"。 + +**运行方式** + +```shell +python export_model.py --params_path checkpoints/best_model.pdparams --output_path ./infer_model/static_graph_params +``` + +其中`checkpoints/best_model.pdparams`是训练过程中保存的参数文件,请更换为实际得到的训练保存路径。 + +#### 预测 + +导出模型之后,可以用于预测部署,predict.py文件提供了python预测部署示例。运行方式: + +```python +python predict.py --model_file infer_model/static_graph_params.pdmodel --params_file infer_model/static_graph_params.pdiparams +``` + +输出如下: +``` +Source: 遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。 +Target: 遇到逆境时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。 +Source: 人生就是如此,经过磨练才能让自己更加拙壮,才能使自己更加乐观。 +Target: 人生就是如此,经过磨练才能让自己更加茁壮,才能使自己更加乐观。 +``` + +### Taskflow一键预测 +可以使用PaddleNLP提供的Taskflow工具来对输入的文本进行一键纠错,具体使用方法如下: + +```python +from paddlenlp import Taskflow +text_correction = Taskflow("text_correction") +text_correction('遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。') +''' +[{'source': '遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。', + 'target': '遇到逆境时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。', + 'errors': [{'position': 3, 'correction': {'竟': '境'}}]}] +''' + +text_correction('人生就是如此,经过磨练才能让自己更加拙壮,才能使自己更加乐观。') +''' +[{'source': '人生就是如此,经过磨练才能让自己更加拙壮,才能使自己更加乐观。', + 'target': '人生就是如此,经过磨练才能让自己更加茁壮,才能使自己更加乐观。', + 'errors': [{'position': 18, 'correction': {'拙': '茁'}}]}] +''' + +``` + + +## 参考文献 +* Ruiqing Zhang, Chao Pang et al. "Correcting Chinese Spelling Errors with Phonetic Pre-training", ACL, 2021 +* DingminWang et al. "A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check", EMNLP, 2018 diff --git a/examples/text_correction/ernie-csc/change_sgml_to_txt.py b/examples/text_correction/ernie-csc/change_sgml_to_txt.py new file mode 100644 index 0000000000000000000000000000000000000000..ae9c063bc084c243779ec87f38313e37afce6a08 --- /dev/null +++ b/examples/text_correction/ernie-csc/change_sgml_to_txt.py @@ -0,0 +1,47 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import xml.dom.minidom + +parser = argparse.ArgumentParser() +parser.add_argument("--input", "-i", default="train.sgml", type=str) +parser.add_argument("--output", "-o", default="train.txt", type=str) + +args = parser.parse_args() + + +def main(): + with open(args.output, "w", encoding="utf-8") as fw: + with open(args.input, "r", encoding="utf-8") as f: + input_str = f.read() + # Add fake root node <SENTENCES> + input_str = "<SENTENCES>" + input_str + "</SENTENCES>" + dom = xml.dom.minidom.parseString(input_str) + example_nodes = dom.documentElement.getElementsByTagName("SENTENCE") + for example in example_nodes: + raw_text = example.getElementsByTagName("TEXT")[0].childNodes[0].data + correct_text = list(raw_text) + mistakes = example.getElementsByTagName("MISTAKE") + for mistake in mistakes: + loc = int(mistake.getElementsByTagName("LOCATION")[0].childNodes[0].data) - 1 + correction = mistake.getElementsByTagName("CORRECTION")[0].childNodes[0].data + correct_text[loc] = correction + + correct_text = "".join(correct_text) + fw.write("{}\t{}\n".format(raw_text, correct_text)) + + +if __name__ == "__main__": + main() diff --git a/examples/text_correction/ernie-csc/download.py b/examples/text_correction/ernie-csc/download.py new file mode 100644 index 0000000000000000000000000000000000000000..051947db728f5a1905060bb5495ae461328ec3fb --- /dev/null +++ b/examples/text_correction/ernie-csc/download.py @@ -0,0 +1,33 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import sys + +from paddle.utils.download import get_path_from_url + +parser = argparse.ArgumentParser() +parser.add_argument("-d", "--data_dir", help="directory to save data to", type=str, default="./") +parser.add_argument( + "-u", "--url", help="URL of target", type=str, default="https://bj.bcebos.com/paddlenlp/datasets/sighan_test.zip" +) +args = parser.parse_args() + + +def main(): + get_path_from_url(args.url, args.data_dir) + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/examples/text_correction/ernie-csc/export_model.py b/examples/text_correction/ernie-csc/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..81aa3fd06ac37ad5edb01a5e4f51639378c3e928 --- /dev/null +++ b/examples/text_correction/ernie-csc/export_model.py @@ -0,0 +1,57 @@ +# -*- coding: UTF-8 -*- +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse + +import paddle +from model import ErnieForCSC +from paddle.static import InputSpec + +from paddlenlp.data import Vocab +from paddlenlp.transformers import ErnieModel + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.") +parser.add_argument("--output_path", type=str, default='./infer_model/static_graph_params', help="The path of model parameter in static graph to be saved.") +parser.add_argument("--model_name_or_path", type=str, default="ernie-1.0", choices=["ernie-1.0"], help="Pretraining model name or path") +parser.add_argument("--pinyin_vocab_file_path", type=str, default="pinyin_vocab.txt", help="pinyin vocab file path") +args = parser.parse_args() +# yapf: enable + + +def main(): + pinyin_vocab = Vocab.load_vocabulary(args.pinyin_vocab_file_path, unk_token="[UNK]", pad_token="[PAD]") + + ernie = ErnieModel.from_pretrained(args.model_name_or_path) + + model = ErnieForCSC(ernie, pinyin_vocab_size=len(pinyin_vocab), pad_pinyin_id=pinyin_vocab[pinyin_vocab.pad_token]) + + model_dict = paddle.load(args.params_path) + model.set_dict(model_dict) + model.eval() + + model = paddle.jit.to_static( + model, + input_spec=[ + InputSpec(shape=[None, None], dtype="int64", name="input_ids"), + InputSpec(shape=[None, None], dtype="int64", name="pinyin_ids"), + ], + ) + + paddle.jit.save(model, args.output_path) + + +if __name__ == "__main__": + main() diff --git a/examples/text_correction/ernie-csc/model.py b/examples/text_correction/ernie-csc/model.py new file mode 100644 index 0000000000000000000000000000000000000000..d303be144158aafc394955d746360db2503cb508 --- /dev/null +++ b/examples/text_correction/ernie-csc/model.py @@ -0,0 +1,123 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn + + +class ErnieForCSC(nn.Layer): + r""" + ErnieForCSC is a model specified for Chinese Spelling Correction task. + + It integrates phonetic features into language model by leveraging the powerful + pre-training and fine-tuning method. + + See more details on https://aclanthology.org/2021.findings-acl.198.pdf. + Args: + ernie (ErnieModel): + An instance of `paddlenlp.transformers.ErnieModel`. + pinyin_vocab_size (int): + The vocab size of pinyin vocab. + pad_pinyin_id (int, optional): + The pad token id of pinyin vocab. Defaults to 0. + """ + + def __init__(self, ernie, pinyin_vocab_size, pad_pinyin_id=0): + super(ErnieForCSC, self).__init__() + self.ernie = ernie + emb_size = self.ernie.config["hidden_size"] + hidden_size = self.ernie.config["hidden_size"] + vocab_size = self.ernie.config["vocab_size"] + + self.pad_token_id = self.ernie.config["pad_token_id"] + self.pinyin_vocab_size = pinyin_vocab_size + self.pad_pinyin_id = pad_pinyin_id + self.pinyin_embeddings = nn.Embedding(self.pinyin_vocab_size, emb_size, padding_idx=pad_pinyin_id) + self.detection_layer = nn.Linear(hidden_size, 2) + self.correction_layer = nn.Linear(hidden_size, vocab_size) + self.softmax = nn.Softmax() + + def forward(self, input_ids, pinyin_ids, token_type_ids=None, position_ids=None, attention_mask=None): + r""" + Args: + input_ids (Tensor): + Indices of input sequence tokens in the vocabulary. They are + numerical representations of tokens that build the input sequence. + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + pinyin_ids (Tensor): + Indices of pinyin tokens of input sequence in the pinyin vocabulary. They are + numerical representations of tokens that build the pinyin input sequence. + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + token_type_ids (Tensor, optional): + Segment token indices to indicate first and second portions of the inputs. + Indices can be either 0 or 1: + + - 0 corresponds to a **sentence A** token, + - 1 corresponds to a **sentence B** token. + + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + Defaults to None, which means no segment embeddings is added to token embeddings. + position_ids (Tensor, optional): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0, + config.max_position_embeddings - 1]``. + Defaults to `None`. Shape as `(batch_sie, num_tokens)` and dtype as `int32` or `int64`. + attention_mask (Tensor, optional): + Mask to indicate whether to perform attention on each input token or not. + The values should be either 0 or 1. The attention scores will be set + to **-infinity** for any positions in the mask that are **0**, and will be + **unchanged** for positions that are **1**. + + - **1** for tokens that are **not masked**, + - **0** for tokens that are **masked**. + + It's data type should be `float32` and has a shape of [batch_size, sequence_length]. + Defaults to `None`. + + + Returns: + detection_error_probs (Tensor): + A Tensor of the detection probablity of each tokens. + Shape as `(batch_size, sequence_length, 2)` and dtype as `int`. + + correction_logits (Tensor): + A Tensor of the correction logits of each tokens. + Shape as `(batch_size, sequence_length, vocab_size)` and dtype as `int`. + + """ + if attention_mask is None: + attention_mask = paddle.unsqueeze( + (input_ids == self.pad_token_id).astype(self.detection_layer.weight.dtype) * -1e9, axis=[1, 2] + ) + + embedding_output = self.ernie.embeddings( + input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids + ) + pinyin_embedding_output = self.pinyin_embeddings(pinyin_ids) + + # Detection module aims to detect whether each Chinese charater has spelling error. + detection_outputs = self.ernie.encoder(embedding_output, attention_mask) + # detection_error_probs shape: [B, T, 2]. It indicates the erroneous probablity of each + # word in the sequence from 0 to 1. + detection_error_probs = self.softmax(self.detection_layer(detection_outputs)) + # Correction module aims to correct each potential wrong charater to right charater. + word_pinyin_embedding_output = ( + detection_error_probs[:, :, 0:1] * embedding_output + + detection_error_probs[:, :, 1:2] * pinyin_embedding_output + ) + + correction_outputs = self.ernie.encoder(word_pinyin_embedding_output, attention_mask) + # correction_logits shape: [B, T, V]. It indicates the correct score of each token in vocab + # according to each word in the sequence. + correction_logits = self.correction_layer(correction_outputs) + return detection_error_probs, correction_logits diff --git a/examples/text_correction/ernie-csc/pinyin_vocab.txt b/examples/text_correction/ernie-csc/pinyin_vocab.txt new file mode 100644 index 0000000000000000000000000000000000000000..7cae38b4f7d4b4c04da9206b46aa54bba3d75fad --- /dev/null +++ b/examples/text_correction/ernie-csc/pinyin_vocab.txt @@ -0,0 +1,417 @@ +[PAD] +[UNK] +a +ai +an +ang +ao +ba +bai +ban +bang +bao +bei +ben +beng +bi +bian +biang +biao +bie +bin +bing +bo +bu +ca +cai +can +cang +cao +ce +cei +cen +ceng +cha +chai +chan +chang +chao +che +chen +cheng +chi +chong +chou +chu +chua +chuai +chuan +chuang +chui +chun +chuo +ci +cong +cou +cu +cuan +cui +cun +cuo +da +dai +dan +dang +dao +de +den +deng +di +dian +diao +die +din +ding +diu +dong +dou +du +duan +dui +dun +duo +e +ei +en +eng +er +fa +fan +fang +fei +fen +feng +fiao +fo +fou +fu +ga +gai +gan +gang +gao +ge +gei +gen +geng +gong +gou +gu +gua +guai +guan +guang +gui +gun +guo +ha +hai +han +hang +hao +he +hei +hen +heng +hm +hong +hou +hu +hua +huai +huan +huang +hui +hun +huo +ji +jia +jian +jiang +jiao +jie +jin +jing +jiong +jiu +ju +juan +jue +jun +ka +kai +kan +kang +kao +ke +ken +keng +kong +kou +ku +kua +kuai +kuan +kuang +kui +kun +kuo +la +lai +lan +lang +lao +le +lei +leng +li +lia +lian +liang +liao +lie +lin +ling +liu +lo +long +lou +lu +luan +lun +luo +lv +lve +m +ma +mai +man +mang +mao +me +mei +men +meng +mi +mian +miao +mie +min +ming +miu +mo +mou +mu +n +na +nai +nan +nang +nao +ne +nei +nen +neng +ni +nian +niang +niao +nie +nin +ning +niu +nong +nou +nu +nuan +nun +nuo +nv +nve +o +ou +pa +pai +pan +pang +pao +pei +pen +peng +pi +pian +piao +pie +pin +ping +po +pou +pu +qi +qia +qian +qiang +qiao +qie +qin +qing +qiong +qiu +qu +quan +que +qun +ran +rang +rao +re +ren +reng +ri +rong +rou +ru +rua +ruan +rui +run +ruo +sa +sai +san +sang +sao +se +sen +seng +sha +shai +shan +shang +shao +she +shei +shen +sheng +shi +shou +shu +shua +shuai +shuan +shuang +shui +shun +shuo +si +song +sou +su +suan +sui +sun +suo +ta +tai +tan +tang +tao +te +teng +ti +tian +tiao +tie +ting +tong +tou +tu +tuan +tui +tun +tuo +wa +wai +wan +wang +wei +wen +weng +wo +wong +wu +xi +xia +xian +xiang +xiao +xie +xin +xing +xiong +xiu +xu +xuan +xue +xun +ya +yan +yang +yao +ye +yi +yin +ying +yo +yong +you +yu +yuan +yue +yun +za +zai +zan +zang +zao +ze +zei +zen +zeng +zha +zhai +zhan +zhang +zhao +zhe +zhen +zheng +zhi +zhong +zhou +zhu +zhua +zhuai +zhuan +zhuang +zhui +zhun +zhuo +zi +zong +zou +zu +zuan +zui +zun +zuo diff --git a/examples/text_correction/ernie-csc/predict.py b/examples/text_correction/ernie-csc/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..9e5d7588240203a21f984397879a4e36de924fdd --- /dev/null +++ b/examples/text_correction/ernie-csc/predict.py @@ -0,0 +1,148 @@ +# -*- coding: UTF-8 -*- +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +from functools import partial + +import paddle +from utils import convert_example, parse_decode + +from paddlenlp.data import Pad, Stack, Tuple, Vocab +from paddlenlp.transformers import ErnieTokenizer + +parser = argparse.ArgumentParser(__doc__) +parser.add_argument( + "--model_file", + type=str, + required=True, + default="./static_graph_params.pdmodel", + help="The path to model info in static graph.", +) +parser.add_argument( + "--params_file", + type=str, + required=True, + default="./static_graph_params.pdiparams", + help="The path to parameters in static graph.", +) +parser.add_argument("--batch_size", type=int, default=2, help="The number of sequences contained in a mini-batch.") +parser.add_argument("--max_seq_len", type=int, default=64, help="Number of words of the longest seqence.") +parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu"], + help="The device to select to train the model, is must be cpu/gpu.", +) +parser.add_argument("--pinyin_vocab_file_path", type=str, default="pinyin_vocab.txt", help="pinyin vocab file path") + +args = parser.parse_args() + + +class Predictor(object): + def __init__(self, model_file, params_file, device, max_seq_length, tokenizer, pinyin_vocab): + self.max_seq_length = max_seq_length + + config = paddle.inference.Config(model_file, params_file) + if device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + config.switch_use_feed_fetch_ops(False) + config.delete_pass("fused_multi_transformer_encoder_pass") + self.predictor = paddle.inference.create_predictor(config) + + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + + self.det_error_probs_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + self.corr_logits_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[1]) + self.tokenizer = tokenizer + self.pinyin_vocab = pinyin_vocab + + def predict(self, data, batch_size=1): + """ + Predicts the data labels. + + Args: + data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `seq_len`(sequence length). + batch_size(obj:`int`, defaults to 1): The number of batch. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + examples = [] + texts = [] + trans_func = partial( + convert_example, + tokenizer=self.tokenizer, + pinyin_vocab=self.pinyin_vocab, + max_seq_length=self.max_seq_length, + is_test=True, + ) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"), # segment + Pad(axis=0, pad_val=self.pinyin_vocab.token_to_idx[self.pinyin_vocab.pad_token], dtype="int64"), # pinyin + Stack(axis=0, dtype="int64"), # length + ): [data for data in fn(samples)] + + for text in data: + example = {"source": text.strip()} + input_ids, token_type_ids, pinyin_ids, length = trans_func(example) + examples.append((input_ids, token_type_ids, pinyin_ids, length)) + texts.append(example["source"]) + + batch_examples = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)] + batch_texts = [texts[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)] + results = [] + + for examples, texts in zip(batch_examples, batch_texts): + token_ids, token_type_ids, pinyin_ids, length = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(token_ids) + self.input_handles[1].copy_from_cpu(pinyin_ids) + self.predictor.run() + det_error_probs = self.det_error_probs_handle.copy_to_cpu() + corr_logits = self.corr_logits_handle.copy_to_cpu() + + det_pred = det_error_probs.argmax(axis=-1) + char_preds = corr_logits.argmax(axis=-1) + + for i in range(len(length)): + pred_result = parse_decode( + texts[i], char_preds[i], det_pred[i], length[i], self.tokenizer, self.max_seq_length + ) + + results.append("".join(pred_result)) + return results + + +if __name__ == "__main__": + tokenizer = ErnieTokenizer.from_pretrained("ernie-1.0") + pinyin_vocab = Vocab.load_vocabulary(args.pinyin_vocab_file_path, unk_token="[UNK]", pad_token="[PAD]") + predictor = Predictor(args.model_file, args.params_file, args.device, args.max_seq_len, tokenizer, pinyin_vocab) + + samples = [ + "遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。", + "人生就是如此,经过磨练才能让自己更加拙壮,才能使自己更加乐观。", + ] + + results = predictor.predict(samples, batch_size=args.batch_size) + for source, target in zip(samples, results): + print("Source:", source) + print("Target:", target) diff --git a/examples/text_correction/ernie-csc/predict_sighan.py b/examples/text_correction/ernie-csc/predict_sighan.py new file mode 100644 index 0000000000000000000000000000000000000000..3816a17533d0168ea7ef2465934571c06d0045a2 --- /dev/null +++ b/examples/text_correction/ernie-csc/predict_sighan.py @@ -0,0 +1,122 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +from functools import partial + +import paddle +from model import ErnieForCSC +from utils import convert_example, create_dataloader, parse_decode, read_test_ds + +from paddlenlp.data import Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import ErnieModel, ErnieTokenizer +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_name_or_path", type=str, default="ernie-1.0", choices=["ernie-1.0"], help="Pretraining model name or path") +parser.add_argument("--ckpt_path", default=None, type=str, help="The model checkpoint path.", ) +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer " "than this will be truncated, sequences shorter will be padded.", ) +parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.", ) +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.") +parser.add_argument("--pinyin_vocab_file_path", type=str, default="pinyin_vocab.txt", help="pinyin vocab file path") +parser.add_argument("--test_file", type=str, default="test.txt", help="test set file") +parser.add_argument("--predict_file", type=str, default="predict.txt", help="predict result file") + +# yapf: enable +args = parser.parse_args() + + +def write_sighan_result_to_file(args, corr_preds, det_preds, lengths, tokenizer): + with open(args.test_file, "r", encoding="utf-8") as fin: + with open(args.predict_file, "w", encoding="utf-8") as fout: + for i, line in enumerate(fin.readlines()): + ids, words = line.strip("\n").split("\t")[0:2] + ids = ids.split("=")[1][:-1] + pred_result = parse_decode( + words, corr_preds[i], det_preds[i], lengths[i], tokenizer, args.max_seq_length + ) + words = list(words) + pred_result = list(pred_result) + result = ids + if pred_result == words: + result += ", 0" + else: + assert len(pred_result) == len(words), "pred_result: {}, words: {}".format(pred_result, words) + for i, word in enumerate(pred_result): + if word != words[i]: + result += ", {}, {}".format(i + 1, word) + fout.write("{}\n".format(result)) + + +@paddle.no_grad() +def do_predict(args): + paddle.set_device(args.device) + + pinyin_vocab = Vocab.load_vocabulary(args.pinyin_vocab_file_path, unk_token="[UNK]", pad_token="[PAD]") + + tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path) + ernie = ErnieModel.from_pretrained(args.model_name_or_path) + + model = ErnieForCSC(ernie, pinyin_vocab_size=len(pinyin_vocab), pad_pinyin_id=pinyin_vocab[pinyin_vocab.pad_token]) + + eval_ds = load_dataset(read_test_ds, data_path=args.test_file, lazy=False) + trans_func = partial( + convert_example, + tokenizer=tokenizer, + pinyin_vocab=pinyin_vocab, + max_seq_length=args.max_seq_length, + is_test=True, + ) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + Pad(axis=0, pad_val=pinyin_vocab.token_to_idx[pinyin_vocab.pad_token], dtype="int64"), # pinyin + Stack(axis=0, dtype="int64"), # length + ): [data for data in fn(samples)] + + test_data_loader = create_dataloader( + eval_ds, mode="test", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + if args.ckpt_path: + model_dict = paddle.load(args.ckpt_path) + model.set_dict(model_dict) + logger.info("Load model from checkpoints: {}".format(args.ckpt_path)) + + model.eval() + corr_preds = [] + det_preds = [] + lengths = [] + for step, batch in enumerate(test_data_loader): + input_ids, token_type_ids, pinyin_ids, length = batch + det_error_probs, corr_logits = model(input_ids, pinyin_ids, token_type_ids) + # corr_logits shape: [B, T, V] + det_pred = det_error_probs.argmax(axis=-1) + det_pred = det_pred.numpy() + + char_preds = corr_logits.argmax(axis=-1) + char_preds = char_preds.numpy() + + length = length.numpy() + + corr_preds += [pred for pred in char_preds] + det_preds += [prob for prob in det_pred] + lengths += [l for l in length] + + write_sighan_result_to_file(args, corr_preds, det_preds, lengths, tokenizer) + + +if __name__ == "__main__": + do_predict(args) diff --git a/examples/text_correction/ernie-csc/requirements.txt b/examples/text_correction/ernie-csc/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..97c0c2339ce30743d0a6f2c5a68c163512b5f351 --- /dev/null +++ b/examples/text_correction/ernie-csc/requirements.txt @@ -0,0 +1 @@ +pypinyin \ No newline at end of file diff --git a/examples/text_correction/ernie-csc/run_sighan_predict.sh b/examples/text_correction/ernie-csc/run_sighan_predict.sh new file mode 100644 index 0000000000000000000000000000000000000000..ef33a20b294779c1bd786dd9203abfa38eee7bd8 --- /dev/null +++ b/examples/text_correction/ernie-csc/run_sighan_predict.sh @@ -0,0 +1,26 @@ +export CUDA_VISIBLE_DEVICES=0 + +model_name_or_path=ernie-1.0 +checkpoints_path=checkpoints +model_file=best_model.pdparams + +# Download SIGHAN test dataset +if [ ! -d "./sighan_test" ]; then + python download.py +fi + +# Predict the test input from sighan13, sighan14, sighan15 +for version in 13 14 15 +do +python predict_sighan.py --model_name_or_path $model_name_or_path \ + --test_file sighan_test/sighan$version/input.txt --batch_size 32 \ + --ckpt_path $checkpoints_path/$model_file \ + --predict_file predict_sighan$version.txt +done + +# Evaluate the prediction result of the model +for version in 13 14 15 +do +echo -e "Sighan$version Performace\n" +python sighan_evaluate.py -p predict_sighan$version.txt -t sighan_test/sighan$version/truth.txt +done \ No newline at end of file diff --git a/examples/text_correction/ernie-csc/sighan_evaluate.py b/examples/text_correction/ernie-csc/sighan_evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..981f83e2279961608cae87317d784edd540bcc31 --- /dev/null +++ b/examples/text_correction/ernie-csc/sighan_evaluate.py @@ -0,0 +1,116 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +parser = argparse.ArgumentParser() +parser.add_argument("--pred_file", "-p", required=True, type=str, help="") +parser.add_argument("--truth_file", "-t", required=True, type=str, help="") +args = parser.parse_args() + + +def main(args): + detect_tp, correct_tp, pos, neg, fp = 0, 0, 0, 0, 0 + + pred_dict = dict() + truth_dict = dict() + fpred = open(args.pred_file, "r", encoding="utf-8") + ftruth = open(args.truth_file, "r", encoding="utf-8") + for idx, (pred, truth) in enumerate(zip(fpred, ftruth)): + pred_tokens = pred.strip().split(" ") + truth_tokens = truth.strip().split(" ") + + pred_id = pred_tokens[0] + truth_id = truth_tokens[0] + + pred_tokens = pred_tokens[1:] + truth_tokens = truth_tokens[1:] + + detect_truth_positions = [ + int(truth_token.strip(",")) for i, truth_token in enumerate(truth_tokens) if i % 2 == 0 + ] + correct_truth_tokens = [truth_token.strip(",") for i, truth_token in enumerate(truth_tokens) if i % 2 == 1] + detect_pred_positions = [int(pred_token.strip(",")) for i, pred_token in enumerate(pred_tokens) if i % 2 == 0] + correct_pred_tokens = [pred_token.strip(",") for i, pred_token in enumerate(pred_tokens) if i % 2 == 1] + + pred_dict[pred_id] = (detect_pred_positions, correct_pred_tokens) + truth_dict[truth_id] = (detect_truth_positions, correct_truth_tokens) + + assert sorted(pred_dict.keys()) == sorted( + truth_dict.keys() + ), "Prediction file should have all prediction result in truth file" + + for pid, predition in pred_dict.items(): + truth = truth_dict[pid] + if predition[0][0] != 0: + pos += 1 + if sorted(zip(*predition)) == sorted(zip(*truth)): + correct_tp += 1 + if truth[0][0] == 0: + fp += 1 + + if truth[0][0] != 0: + if sorted(predition[0]) == sorted(truth[0]): + detect_tp += 1 + neg += 1 + + eps = 1e-9 + + # Detection level + detect_pos = detect_tp + fp + if detect_pos > 0 and neg > 0: + detect_precision = detect_tp * 1.0 / detect_pos + detect_recall = detect_tp * 1.0 / neg + detect_f1 = 2.0 * detect_precision * detect_recall / (detect_precision + detect_recall + eps) + else: + detect_precision = 0 + detect_recall = 0 + detect_f1 = 0 + + # Correction level + correct_pos = correct_tp + fp + if correct_pos > 0 and neg > 0: + correct_precision = correct_tp * 1.0 / correct_pos + correct_recall = correct_tp * 1.0 / neg + correct_f1 = 2.0 * correct_precision * correct_recall / (correct_precision + correct_recall + eps) + else: + correct_precision = 0 + correct_recall = 0 + correct_f1 = 0 + + print("==========================================================") + print("Overall Performance") + print("==========================================================") + print("\nDetection Level") + print("\tPrecision = {:.4f} ({}/{})".format(detect_precision, detect_tp, detect_pos)) + print("\tRecall = {:.4f} ({}/{})".format(detect_recall, detect_tp, neg)) + print( + "\tF1-Score = {:.4f} ((2*{:.4f}*{:.4f})/({:.4f}+{:.4f}))".format( + detect_f1, detect_precision, detect_recall, detect_precision, detect_recall + ) + ) + + print("\nCorrection Level") + print("\tPrecision = {:.4f} ({}/{})".format(correct_precision, correct_tp, correct_pos)) + print("\tRecall = {:.4f} ({}/{})".format(correct_recall, correct_tp, neg)) + print( + "\tF1-Score = {:.4f} ((2*{:.4f}*{:.4f})/({:.4f}+{:.4f}))".format( + correct_f1, correct_precision, correct_recall, correct_precision, correct_recall + ) + ) + print("==========================================================\n") + + +if __name__ == "__main__": + main(args) diff --git a/examples/text_correction/ernie-csc/train.py b/examples/text_correction/ernie-csc/train.py new file mode 100644 index 0000000000000000000000000000000000000000..e554ef8d2327774914afa4371aca1052721c7437 --- /dev/null +++ b/examples/text_correction/ernie-csc/train.py @@ -0,0 +1,207 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from model import ErnieForCSC +from utils import convert_example, create_dataloader, read_train_ds + +from paddlenlp.data import Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import MapDataset, load_dataset +from paddlenlp.metrics import CorrectionF1, DetectionF1 +from paddlenlp.transformers import ErnieModel, ErnieTokenizer, LinearDecayWithWarmup +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--model_name_or_path", type=str, default="ernie-1.0-base-zh", choices=["ernie-1.0-base-zh"], help="Pretraining model name or path") +parser.add_argument("--max_seq_length", type=int, default=128, help="The maximum total input sequence length after SentencePiece tokenization.") +parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train.") +parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.") +parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.") +parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint") +parser.add_argument("--epochs", type=int, default=3, help="Number of epoches for training.") +parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.") +parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.") +parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") +parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",) +parser.add_argument("--pinyin_vocab_file_path", type=str, default="pinyin_vocab.txt", help="pinyin vocab file path") +parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") +parser.add_argument("--ignore_label", default=-1, type=int, help="Ignore label for CrossEntropyLoss") +parser.add_argument("--extra_train_ds_dir", default=None, type=str, help="The directory of extra train dataset.") + +# yapf: enable +args = parser.parse_args() + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, eval_data_loader): + model.eval() + det_metric = DetectionF1() + corr_metric = CorrectionF1() + for step, batch in enumerate(eval_data_loader, start=1): + input_ids, token_type_ids, pinyin_ids, det_labels, corr_labels, length = batch + # det_error_probs shape: [B, T, 2] + # corr_logits shape: [B, T, V] + det_error_probs, corr_logits = model(input_ids, pinyin_ids, token_type_ids) + det_metric.update(det_error_probs, det_labels, length) + corr_metric.update(det_error_probs, det_labels, corr_logits, corr_labels, length) + + det_f1, det_precision, det_recall = det_metric.accumulate() + corr_f1, corr_precision, corr_recall = corr_metric.accumulate() + logger.info("Sentence-Level Performance:") + logger.info( + "Detection metric: F1={:.4f}, Recall={:.4f}, Precision={:.4f}".format(det_f1, det_recall, det_precision) + ) + logger.info( + "Correction metric: F1={:.4f}, Recall={:.4f}, Precision={:.4f}".format(corr_f1, corr_recall, corr_precision) + ) + model.train() + return det_f1, corr_f1 + + +def do_train(args): + set_seed(args) + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + pinyin_vocab = Vocab.load_vocabulary(args.pinyin_vocab_file_path, unk_token="[UNK]", pad_token="[PAD]") + + tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path) + ernie = ErnieModel.from_pretrained(args.model_name_or_path) + + model = ErnieForCSC(ernie, pinyin_vocab_size=len(pinyin_vocab), pad_pinyin_id=pinyin_vocab[pinyin_vocab.pad_token]) + + train_ds, eval_ds = load_dataset("sighan-cn", splits=["train", "dev"]) + + # Extend current training dataset by providing extra training + # datasets directory. The suffix of dataset file name in extra + # dataset directory has to be ".txt". The data format of + # dataset need to be a couple of senteces at every line, such as: + # "城府宫员表示,这是过去三十六小时内第三期强烈的余震。\t政府官员表示,这是过去三十六小时内第三起强烈的余震。\n" + if args.extra_train_ds_dir is not None and os.path.exists(args.extra_train_ds_dir): + data = train_ds.data + data_files = [ + os.path.join(args.extra_train_ds_dir, data_file) + for data_file in os.listdir(args.extra_train_ds_dir) + if data_file.endswith(".txt") + ] + for data_file in data_files: + ds = load_dataset(read_train_ds, data_path=data_file, splits=["train"], lazy=False) + data += ds.data + train_ds = MapDataset(data) + + det_loss_act = paddle.nn.CrossEntropyLoss(ignore_index=args.ignore_label, use_softmax=False) + corr_loss_act = paddle.nn.CrossEntropyLoss(ignore_index=args.ignore_label, reduction="none") + + trans_func = partial( + convert_example, tokenizer=tokenizer, pinyin_vocab=pinyin_vocab, max_seq_length=args.max_seq_length + ) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Pad(axis=0, pad_val=pinyin_vocab.token_to_idx[pinyin_vocab.pad_token]), # pinyin + Pad(axis=0, dtype="int64"), # detection label + Pad(axis=0, dtype="int64"), # correction label + Stack(axis=0, dtype="int64"), # length + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + eval_data_loader = create_dataloader( + eval_ds, mode="eval", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + logger.info("Total training step: {}".format(num_training_steps)) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + global_steps = 1 + best_f1 = -1 + tic_train = time.time() + for epoch in range(args.epochs): + for step, batch in enumerate(train_data_loader, start=1): + input_ids, token_type_ids, pinyin_ids, det_labels, corr_labels, length = batch + det_error_probs, corr_logits = model(input_ids, pinyin_ids, token_type_ids) + # Chinese Spelling Correction has 2 tasks: detection task and correction task. + # Detection task aims to detect whether each Chinese charater has spelling error. + # Correction task aims to correct each potential wrong charater to right charater. + # So we need to minimize detection loss and correction loss simultaneously. + # See more loss design details on https://aclanthology.org/2021.findings-acl.198.pdf + det_loss = det_loss_act(det_error_probs, det_labels) + corr_loss = corr_loss_act(corr_logits, corr_labels) * det_error_probs.max(axis=-1) + loss = (det_loss + corr_loss).mean() + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_steps % args.logging_steps == 0: + logger.info( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_steps, epoch, step, loss, args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + if global_steps % args.save_steps == 0: + if paddle.distributed.get_rank() == 0: + logger.info("Eval:") + det_f1, corr_f1 = evaluate(model, eval_data_loader) + f1 = (det_f1 + corr_f1) / 2 + model_file = "model_%d" % global_steps + if f1 > best_f1: + # save best model + paddle.save(model.state_dict(), os.path.join(args.output_dir, "best_model.pdparams")) + logger.info("Save best model at {} step.".format(global_steps)) + best_f1 = f1 + model_file = model_file + "_best" + model_file = model_file + ".pdparams" + paddle.save(model.state_dict(), os.path.join(args.output_dir, model_file)) + logger.info("Save model at {} step.".format(global_steps)) + if args.max_steps > 0 and global_steps >= args.max_steps: + return + global_steps += 1 + + +if __name__ == "__main__": + do_train(args) diff --git a/examples/text_correction/ernie-csc/utils.py b/examples/text_correction/ernie-csc/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..83273deff34b6ccaa712092344ac985111cf4427 --- /dev/null +++ b/examples/text_correction/ernie-csc/utils.py @@ -0,0 +1,116 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from pypinyin import lazy_pinyin, Style +import paddle + +from paddlenlp.transformers import is_chinese_char + + +def read_train_ds(data_path): + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + source, target = line.strip("\n").split("\t")[0:2] + yield {"source": source, "target": target} + + +def read_test_ds(data_path): + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + ids, words = line.strip("\n").split("\t")[0:2] + yield {"source": words} + + +def convert_example(example, tokenizer, pinyin_vocab, max_seq_length=128, ignore_label=-1, is_test=False): + source = example["source"] + words = list(source) + if len(words) > max_seq_length - 2: + words = words[: max_seq_length - 2] + length = len(words) + words = ["[CLS]"] + words + ["[SEP]"] + input_ids = tokenizer.convert_tokens_to_ids(words) + token_type_ids = [0] * len(input_ids) + + # Use pad token in pinyin emb to map word emb [CLS], [SEP] + pinyins = lazy_pinyin(source, style=Style.TONE3, neutral_tone_with_five=True) + pinyin_ids = [0] + # Align pinyin and chinese char + pinyin_offset = 0 + for i, word in enumerate(words[1:-1]): + pinyin = "[UNK]" if word != "[PAD]" else "[PAD]" + if len(word) == 1 and is_chinese_char(ord(word)): + while pinyin_offset < len(pinyins): + current_pinyin = pinyins[pinyin_offset][:-1] + pinyin_offset += 1 + if current_pinyin in pinyin_vocab: + pinyin = current_pinyin + break + pinyin_ids.append(pinyin_vocab[pinyin]) + + pinyin_ids.append(0) + assert len(input_ids) == len(pinyin_ids), "length of input_ids must be equal to length of pinyin_ids" + + if not is_test: + target = example["target"] + correction_labels = list(target) + if len(correction_labels) > max_seq_length - 2: + correction_labels = correction_labels[: max_seq_length - 2] + correction_labels = tokenizer.convert_tokens_to_ids(correction_labels) + correction_labels = [ignore_label] + correction_labels + [ignore_label] + + detection_labels = [] + for input_id, label in zip(input_ids[1:-1], correction_labels[1:-1]): + detection_label = 0 if input_id == label else 1 + detection_labels += [detection_label] + detection_labels = [ignore_label] + detection_labels + [ignore_label] + return input_ids, token_type_ids, pinyin_ids, detection_labels, correction_labels, length + else: + return input_ids, token_type_ids, pinyin_ids, length + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def parse_decode(words, corr_preds, det_preds, lengths, tokenizer, max_seq_length): + UNK = tokenizer.unk_token + UNK_id = tokenizer.convert_tokens_to_ids(UNK) + + corr_pred = corr_preds[1 : 1 + lengths].tolist() + det_pred = det_preds[1 : 1 + lengths].tolist() + words = list(words) + rest_words = [] + if len(words) > max_seq_length - 2: + rest_words = words[max_seq_length - 2 :] + words = words[: max_seq_length - 2] + pred_result = "" + for j, word in enumerate(words): + candidates = tokenizer.convert_ids_to_tokens(corr_pred[j] if corr_pred[j] < tokenizer.vocab_size else UNK_id) + word_icc = is_chinese_char(ord(word)) + cand_icc = is_chinese_char(ord(candidates)) if len(candidates) == 1 else False + if not word_icc or det_pred[j] == 0 or candidates in [UNK, "[PAD]"] or (word_icc and not cand_icc): + pred_result += word + else: + pred_result += candidates.lstrip("##") + pred_result += "".join(rest_words) + return pred_result diff --git a/examples/text_generation/couplet/README.md b/examples/text_generation/couplet/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f9f18d0d4f4763b8731a3824051571667d74e92f --- /dev/null +++ b/examples/text_generation/couplet/README.md @@ -0,0 +1,92 @@ +# 使用Seq2Seq模型完成自动对联 + +以下是本范例模型的简要目录结构及说明: + +``` +. +├── README.md # 文档,本文件 +├── args.py # 训练、预测以及模型参数配置程序 +├── data.py # 数据读入程序 +├── train.py # 训练主程序 +├── predict.py # 预测主程序 +└── model.py # 带注意力机制的对联生成程序 +``` + +## 简介 + +Sequence to Sequence (Seq2Seq),使用编码器-解码器(Encoder-Decoder)结构,用编码器将源序列编码成vector,再用解码器将该vector解码为目标序列。Seq2Seq 广泛应用于机器翻译,自动对话机器人,文档摘要自动生成,图片描述自动生成等任务中。 + +本目录包含Seq2Seq的一个经典样例:自动对联生成,带attention机制的文本生成模型。 + + +## 模型概览 + +本模型中,在编码器方面,我们采用了基于LSTM的多层的RNN encoder;在解码器方面,我们使用了带注意力(Attention)机制的RNN decoder,在预测时我们使用Beam Search算法来生对联的下联。 + +## 数据介绍 + +本教程使用[couplet数据集](https://bj.bcebos.com/paddlenlp/datasets/couplet.tar.gz)作为训练语料,该数据集来源于[这个github repo](https://github.com/v-zich/couplet-clean-dataset),其中train_src.tsv及train_tgt.tsv为训练集,dev_src.tsv及dev_tgt.tsv为开发集,test_src.tsv及test_tgt.tsv为测试集。 + +数据集会在调用`paddlenlp.datasets.load_dataset`时自动下载,在linux系统下,数据集会自动下载到`~/.paddlenlp/datasets/Couplet/`目录下 + + +## 模型训练 + +执行以下命令即可训练带有注意力机制的Seq2Seq模型: + +```sh +python train.py \ + --num_layers 2 \ + --hidden_size 512 \ + --batch_size 128 \ + --device gpu \ + --model_path ./couplet_models \ + --max_epoch 20 + +``` + +各参数的具体说明请参阅 `args.py` 。训练程序会在每个epoch训练结束之后,保存一次模型。 + +**NOTE:** 如需恢复模型训练,则`init_from_ckpt`只需指定到文件名即可,不需要添加文件尾缀。如`--init_from_ckpt=couplet_models/19`即可,程序会自动加载模型参数`couplet_models/19.pdparams`,也会自动加载优化器状态`couplet_models/19.pdopt`。 + +## 模型预测 + +训练完成之后,可以使用保存的模型(由 `--init_from_ckpt` 指定)对测试集进行beam search解码,命令如下: + +```sh +python predict.py \ + --num_layers 2 \ + --hidden_size 512 \ + --batch_size 128 \ + --init_from_ckpt couplet_models/19 \ + --infer_output_file infer_output.txt \ + --beam_size 10 \ + --device gpu + +``` + +各参数的具体说明请参阅 `args.py` ,注意预测时所用模型超参数需和训练时一致。 + +## 生成对联样例 + +上联:崖悬风雨骤 下联:月落水云寒 + +上联:约春章柳下 下联:邀月醉花间 + +上联:箬笠红尘外 下联:扁舟明月中 + +上联:书香醉倒窗前月 下联:烛影摇红梦里人 + +上联:踏雪寻梅求雅趣 下联:临风把酒觅知音 + +上联:未出南阳天下论 下联:先登北斗汉中书 + +上联:朱联妙语千秋颂 下联:赤胆忠心万代传 + +上联:月半举杯圆月下 下联:花间对酒醉花间 + +上联:挥笔如剑倚麓山豪气干云揽月去 下联:落笔似龙飞沧海龙吟破浪乘风来 + +## 参考的开源数据集 + +我们的数据集采用了开源对联数据集[couplet-clean-dataset](https://github.com/v-zich/couplet-clean-dataset),地址:https://github.com/v-zich/couplet-clean-dataset ,该数据集过滤了[couplet-dataset](https://github.com/wb14123/couplet-dataset)(地址:https://github.com/wb14123/couplet-dataset )中的低俗、敏感内容。 diff --git a/examples/text_generation/couplet/args.py b/examples/text_generation/couplet/args.py new file mode 100644 index 0000000000000000000000000000000000000000..179f76f05339b23bab06b12774be39ddc46c7e4a --- /dev/null +++ b/examples/text_generation/couplet/args.py @@ -0,0 +1,50 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + + parser.add_argument("--learning_rate", type=float, default=0.001, help="learning rate for optimizer") + + parser.add_argument("--num_layers", type=int, default=1, help="layers number of encoder and decoder") + + parser.add_argument("--hidden_size", type=int, default=100, help="hidden size of encoder and decoder") + + parser.add_argument("--batch_size", type=int, default=128, help="Batch size of each step") + + parser.add_argument("--max_epoch", type=int, default=50, help="max epoch for the training") + + parser.add_argument("--max_len", type=int, default=50, help="max length for source and target sentence") + + parser.add_argument("--max_grad_norm", type=float, default=5.0, help="max grad norm for global norm clip") + + parser.add_argument("--log_freq", type=int, default=200, help="The frequency to print training logs") + + parser.add_argument("--model_path", type=str, default="model", help="model path for model to save") + + parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") + + parser.add_argument("--infer_output_file", type=str, default="infer_output", help="file name for inference output") + + parser.add_argument("--beam_size", type=int, default=10, help="file name for inference") + + parser.add_argument( + "--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference." + ) + + args = parser.parse_args() + return args diff --git a/examples/text_generation/couplet/data.py b/examples/text_generation/couplet/data.py new file mode 100644 index 0000000000000000000000000000000000000000..d8ac9b1ce842688a2573e26c8cd1cab5a8e3cc67 --- /dev/null +++ b/examples/text_generation/couplet/data.py @@ -0,0 +1,65 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial + +import numpy as np +import paddle + +from paddlenlp.data import Pad, SamplerHelper, Vocab +from paddlenlp.datasets import load_dataset + + +def convert_example(example, vocab): + bos_id = vocab[vocab.bos_token] + eos_id = vocab[vocab.eos_token] + + source = [bos_id] + vocab.to_indices(example["first"].split("\x02")) + [eos_id] + target = [bos_id] + vocab.to_indices(example["second"].split("\x02")) + [eos_id] + return source, target + + +def create_train_loader(batch_size=128): + train_ds = load_dataset("couplet", splits="train") + vocab = Vocab.load_vocabulary(**train_ds.vocab_info) + pad_id = vocab[vocab.eos_token] + trans_func = partial(convert_example, vocab=vocab) + train_ds = train_ds.map(trans_func, lazy=False) + train_batch_sampler = SamplerHelper(train_ds).shuffle().batch(batch_size=batch_size) + + train_loader = paddle.io.DataLoader( + train_ds, batch_sampler=train_batch_sampler, collate_fn=partial(prepare_input, pad_id=pad_id) + ) + return train_loader, vocab + + +def create_infer_loader(batch_size=128): + test_ds = load_dataset("couplet", splits="test") + vocab = Vocab.load_vocabulary(**test_ds.vocab_info) + pad_id = vocab[vocab.eos_token] + trans_func = partial(convert_example, vocab=vocab) + test_ds = test_ds.map(trans_func, lazy=False) + test_batch_sampler = SamplerHelper(test_ds).batch(batch_size=batch_size) + + test_loader = paddle.io.DataLoader( + test_ds, batch_sampler=test_batch_sampler, collate_fn=partial(prepare_input, pad_id=pad_id) + ) + return test_loader, vocab + + +def prepare_input(insts, pad_id): + src, src_length = Pad(pad_val=pad_id, ret_length=True)([inst[0] for inst in insts]) + tgt, tgt_length = Pad(pad_val=pad_id, ret_length=True, dtype="int64")([inst[1] for inst in insts]) + tgt_mask = (tgt[:, :-1] != pad_id).astype(paddle.get_default_dtype()) + return src, src_length, tgt[:, :-1], tgt[:, 1:, np.newaxis], tgt_mask diff --git a/examples/text_generation/couplet/model.py b/examples/text_generation/couplet/model.py new file mode 100644 index 0000000000000000000000000000000000000000..533477b1dfc98e610f27e532834bc9c5b3bb2952 --- /dev/null +++ b/examples/text_generation/couplet/model.py @@ -0,0 +1,198 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class CrossEntropyCriterion(nn.Layer): + def __init__(self): + super(CrossEntropyCriterion, self).__init__() + + def forward(self, predict, label, trg_mask): + cost = F.cross_entropy(input=predict, label=label, reduction="none", soft_label=False) + cost = paddle.squeeze(cost, axis=[2]) + masked_cost = cost * trg_mask + batch_mean_cost = paddle.mean(masked_cost, axis=[0]) + seq_cost = paddle.sum(batch_mean_cost) + + return seq_cost + + +class Seq2SeqEncoder(nn.Layer): + def __init__(self, vocab_size, embed_dim, hidden_size, num_layers): + super(Seq2SeqEncoder, self).__init__() + self.embedder = nn.Embedding(vocab_size, embed_dim) + + self.lstm = nn.LSTM( + input_size=embed_dim, + hidden_size=hidden_size, + num_layers=num_layers, + dropout=0.2 if num_layers > 1 else 0.0, + ) + + def forward(self, sequence, sequence_length): + inputs = self.embedder(sequence) + encoder_output, encoder_state = self.lstm(inputs, sequence_length=sequence_length) + + return encoder_output, encoder_state + + +class AttentionLayer(nn.Layer): + def __init__(self, hidden_size): + super(AttentionLayer, self).__init__() + self.input_proj = nn.Linear(hidden_size, hidden_size) + self.output_proj = nn.Linear(hidden_size + hidden_size, hidden_size) + + def forward(self, hidden, encoder_output, encoder_padding_mask): + encoder_output = self.input_proj(encoder_output) + attn_scores = paddle.matmul(paddle.unsqueeze(hidden, [1]), encoder_output, transpose_y=True) + + if encoder_padding_mask is not None: + attn_scores = paddle.add(attn_scores, encoder_padding_mask) + + attn_scores = F.softmax(attn_scores) + attn_out = paddle.squeeze(paddle.matmul(attn_scores, encoder_output), [1]) + attn_out = paddle.concat([attn_out, hidden], 1) + attn_out = self.output_proj(attn_out) + return attn_out + + +class Seq2SeqDecoderCell(nn.RNNCellBase): + def __init__(self, num_layers, input_size, hidden_size): + super(Seq2SeqDecoderCell, self).__init__() + self.dropout = nn.Dropout(0.2) + self.lstm_cells = nn.LayerList( + [ + nn.LSTMCell(input_size=input_size + hidden_size if i == 0 else hidden_size, hidden_size=hidden_size) + for i in range(num_layers) + ] + ) + + self.attention_layer = AttentionLayer(hidden_size) + + def forward(self, step_input, states, encoder_output, encoder_padding_mask=None): + lstm_states, input_feed = states + new_lstm_states = [] + step_input = paddle.concat([step_input, input_feed], 1) + for i, lstm_cell in enumerate(self.lstm_cells): + out, new_lstm_state = lstm_cell(step_input, lstm_states[i]) + step_input = self.dropout(out) + new_lstm_states.append(new_lstm_state) + out = self.attention_layer(step_input, encoder_output, encoder_padding_mask) + return out, [new_lstm_states, out] + + +class Seq2SeqDecoder(nn.Layer): + def __init__(self, vocab_size, embed_dim, hidden_size, num_layers): + super(Seq2SeqDecoder, self).__init__() + self.embedder = nn.Embedding(vocab_size, embed_dim) + self.lstm_attention = nn.RNN(Seq2SeqDecoderCell(num_layers, embed_dim, hidden_size)) + self.output_layer = nn.Linear(hidden_size, vocab_size) + + def forward(self, trg, decoder_initial_states, encoder_output, encoder_padding_mask): + inputs = self.embedder(trg) + + decoder_output, _ = self.lstm_attention( + inputs, + initial_states=decoder_initial_states, + encoder_output=encoder_output, + encoder_padding_mask=encoder_padding_mask, + ) + predict = self.output_layer(decoder_output) + + return predict + + +class Seq2SeqAttnModel(nn.Layer): + def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, eos_id=1): + super(Seq2SeqAttnModel, self).__init__() + self.hidden_size = hidden_size + self.eos_id = eos_id + self.num_layers = num_layers + self.INF = 1e9 + self.encoder = Seq2SeqEncoder(vocab_size, embed_dim, hidden_size, num_layers) + self.decoder = Seq2SeqDecoder(vocab_size, embed_dim, hidden_size, num_layers) + + def forward(self, src, src_length, trg): + encoder_output, encoder_final_state = self.encoder(src, src_length) + + # Transfer shape of encoder_final_states to [num_layers, 2, batch_size, hidden_size] + encoder_final_states = [(encoder_final_state[0][i], encoder_final_state[1][i]) for i in range(self.num_layers)] + # Construct decoder initial states: use input_feed and the shape is + # [[h,c] * num_layers, input_feed], consistent with Seq2SeqDecoderCell.states + decoder_initial_states = [ + encoder_final_states, + self.decoder.lstm_attention.cell.get_initial_states(batch_ref=encoder_output, shape=[self.hidden_size]), + ] + # Build attention mask to avoid paying attention on padddings + src_mask = (src != self.eos_id).astype(paddle.get_default_dtype()) + encoder_padding_mask = (src_mask - 1.0) * self.INF + encoder_padding_mask = paddle.unsqueeze(encoder_padding_mask, [1]) + + predict = self.decoder(trg, decoder_initial_states, encoder_output, encoder_padding_mask) + + return predict + + +class Seq2SeqAttnInferModel(Seq2SeqAttnModel): + def __init__( + self, vocab_size, embed_dim, hidden_size, num_layers, bos_id=0, eos_id=1, beam_size=4, max_out_len=256 + ): + self.bos_id = bos_id + self.beam_size = beam_size + self.max_out_len = max_out_len + self.num_layers = num_layers + super(Seq2SeqAttnInferModel, self).__init__(vocab_size, embed_dim, hidden_size, num_layers, eos_id) + + # Dynamic decoder for inference + self.beam_search_decoder = nn.BeamSearchDecoder( + self.decoder.lstm_attention.cell, + start_token=bos_id, + end_token=eos_id, + beam_size=beam_size, + embedding_fn=self.decoder.embedder, + output_fn=self.decoder.output_layer, + ) + + def forward(self, src, src_length): + encoder_output, encoder_final_state = self.encoder(src, src_length) + + encoder_final_state = [(encoder_final_state[0][i], encoder_final_state[1][i]) for i in range(self.num_layers)] + + # Initial decoder initial states + decoder_initial_states = [ + encoder_final_state, + self.decoder.lstm_attention.cell.get_initial_states(batch_ref=encoder_output, shape=[self.hidden_size]), + ] + # Build attention mask to avoid paying attention on paddings + src_mask = (src != self.eos_id).astype(paddle.get_default_dtype()) + + encoder_padding_mask = (src_mask - 1.0) * self.INF + encoder_padding_mask = paddle.unsqueeze(encoder_padding_mask, [1]) + + # Tile the batch dimension with beam_size + encoder_output = nn.BeamSearchDecoder.tile_beam_merge_with_batch(encoder_output, self.beam_size) + encoder_padding_mask = nn.BeamSearchDecoder.tile_beam_merge_with_batch(encoder_padding_mask, self.beam_size) + + # Dynamic decoding with beam search + seq_output, _ = nn.dynamic_decode( + decoder=self.beam_search_decoder, + inits=decoder_initial_states, + max_step_num=self.max_out_len, + encoder_output=encoder_output, + encoder_padding_mask=encoder_padding_mask, + ) + return seq_output diff --git a/examples/text_generation/couplet/predict.py b/examples/text_generation/couplet/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..307af6f1233df1055f792c9605b2316fef08ce9c --- /dev/null +++ b/examples/text_generation/couplet/predict.py @@ -0,0 +1,83 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import io + +import numpy as np +import paddle +from args import parse_args +from data import create_infer_loader +from model import Seq2SeqAttnInferModel + + +def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): + """ + Post-process the decoded sequence. + """ + eos_pos = len(seq) - 1 + for i, idx in enumerate(seq): + if idx == eos_idx: + eos_pos = i + break + seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] + return seq + + +def do_predict(args): + paddle.set_device(args.device) + + test_loader, vocab = create_infer_loader(args.batch_size) + vocab_size = len(vocab) + bos_id = vocab[vocab.bos_token] + eos_id = vocab[vocab.eos_token] + trg_idx2word = vocab.idx_to_token + + model = paddle.Model( + Seq2SeqAttnInferModel( + vocab_size, + args.hidden_size, + args.hidden_size, + args.num_layers, + bos_id=bos_id, + eos_id=eos_id, + beam_size=args.beam_size, + max_out_len=256, + ) + ) + + model.prepare() + + # Load the trained model + assert args.init_from_ckpt, "Please set reload_model to load the infer model." + model.load(args.init_from_ckpt) + + # TODO(guosheng): use model.predict when support variant length + with io.open(args.infer_output_file, "w", encoding="utf-8") as f: + for data in test_loader(): + inputs = data[:2] + finished_seq = model.predict_batch(inputs=list(inputs))[0] + finished_seq = finished_seq[:, :, np.newaxis] if len(finished_seq.shape) == 2 else finished_seq + finished_seq = np.transpose(finished_seq, [0, 2, 1]) + for ins in finished_seq: + for beam_idx, beam in enumerate(ins): + id_list = post_process_seq(beam, bos_id, eos_id) + word_list = [trg_idx2word[id] for id in id_list] + sequence = "\x02".join(word_list) + "\n" + f.write(sequence) + break + + +if __name__ == "__main__": + args = parse_args() + do_predict(args) diff --git a/examples/text_generation/couplet/train.py b/examples/text_generation/couplet/train.py new file mode 100644 index 0000000000000000000000000000000000000000..5e1914ea247d92e5d6d3182c5f3aaeecd3c54043 --- /dev/null +++ b/examples/text_generation/couplet/train.py @@ -0,0 +1,50 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +from args import parse_args +from data import create_train_loader +from model import CrossEntropyCriterion, Seq2SeqAttnModel + +from paddlenlp.metrics import Perplexity + + +def do_train(args): + paddle.set_device(args.device) + + # Define dataloader + train_loader, vocab = create_train_loader(args.batch_size) + vocab_size = len(vocab) + pad_id = vocab[vocab.eos_token] + + model = paddle.Model(Seq2SeqAttnModel(vocab_size, args.hidden_size, args.hidden_size, args.num_layers, pad_id)) + + optimizer = paddle.optimizer.Adam(learning_rate=args.learning_rate, parameters=model.parameters()) + ppl_metric = Perplexity() + model.prepare(optimizer, CrossEntropyCriterion(), ppl_metric) + + print(args) + model.fit( + train_data=train_loader, + epochs=args.max_epoch, + eval_freq=1, + save_freq=1, + save_dir=args.model_path, + log_freq=args.log_freq, + ) + + +if __name__ == "__main__": + args = parse_args() + do_train(args) diff --git a/examples/text_generation/ctrl/README.md b/examples/text_generation/ctrl/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e68bed08e0776fc1a928b2c9c54713bde9dc77bb --- /dev/null +++ b/examples/text_generation/ctrl/README.md @@ -0,0 +1,59 @@ +# [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/pdf/1909.05858.pdf) + +## 摘要 +大规模语言模型显示出很有前景的文本生成能力,但用户无法轻松控制生成文本的特定方面。我们发布了CTRL,一个包含 16.3 亿个参数的条件转换器语言模型,经过训练以调节控制样式、内容和特定任务行为的控制代码。 控制代码源自与原始文本自然共同出现的结构,保留了无监督学习的优势,同时对文本生成提供了更明确的控制。 这些代码还允许CTRL预测训练数据的哪些部分最有可能给定序列。 这提供了一种通过基于模型的来源归因分析大量数据的潜在方法。 我们在 https://github.com/salesforce/ctrl 上发布了多个全尺寸、预训练版本的CTRL。 + +## 文本生成测试 +```sh +python demo.py +``` +模型生成使用到的参数释义如下: +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。 +- `max_predict_len` 表示最大生成的句子长度。 +- `repetition_penalty` 表示生成重复token的惩罚参数,详细信息可查看[这篇论文](https://arxiv.org/pdf/1909.05858.pdf)。 + +## 生成结果样例 + +``` +input text: Diet English : I lost 10 kgs! ; German : +output text: Ich habe zehn Kilogramm abgenommen! + + Als ich das erste Mal mit meinem Smartphone war, war es ein wenig schwierig zu finden, wo man die App herunterladen kann. Aber jetzt ist sie da. + + Das Smartphone hat mich auch sehr beeindruckt. Es machte mir viel Spaß. Und so funktioniert mein Leben heute ganz einfach und ohne große Probleme. + + Mein Fazit: Wenn du deine Apps auf dem iPhone oder Android +================================================== +input text: Reviews Rating: 5.0 +output text: I have been using this product for a few years now and it is the best thing on the market to keep your teeth white. It does not taste bad at all like some of these other products do. The only problem with this product is that you need to use it every day or else they will start coming back in after about 2 weeks. But if you do that, then it's worth it. You can also buy them from Amazon but shipping takes forever. So just make sure you order enough so you don't run out. + Rating: 5.0 + This stuff works great. My dentist recommended it, and I'm glad he did. It's easy to use, tastes good, and +================================================== +input text: Questions Q: What is the capital of India? +output text: A: mumbai. + Q: Who was a British politician who served as Prime Minister from 1922 to 1924? + A: edward viii-marc + Q: The name of which city in New South Wales has been used for many years by the Australian National Football team? + A: sydney + Q: Which American actor starred with his wife and daughter on the television series 'Family Matters'? + A: james coburn + Q: In what year did the first edition of this book appear? + A: 1962 + Q: How long does it take to make one pound of sausage? +================================================== +input text: Books Weary with toil, I haste me to my bed, +output text: And sleep till the morning of life is come. + The sun has risen and his beams are bright, + But still he shines upon a world forlorn; + He sees no more its joys or griefs below, + Nor hears their murmur as they pass below. + My heart grows weary for the world's delight, + For all that makes it dear in human eyes; + It feels like one who wanders through an empty land, + With nothing left but desolation there. + O God! how long shall this be mine abode, + Where every joy hath passed away from me? + How long, O God, must I thus wander here, + In sorrow +================================================== +``` diff --git a/examples/text_generation/ctrl/demo.py b/examples/text_generation/ctrl/demo.py new file mode 100644 index 0000000000000000000000000000000000000000..534531936c2a486c7db80832597fb148fae35d04 --- /dev/null +++ b/examples/text_generation/ctrl/demo.py @@ -0,0 +1,43 @@ +import paddle +from paddlenlp.transformers import CTRLLMHeadModel, CTRLTokenizer + + +class Demo: + def __init__(self, model_name_or_path="ctrl", max_predict_len=128, repetition_penalty=1.2): + self.tokenizer = CTRLTokenizer.from_pretrained(model_name_or_path) + print("Loading the model parameters, please wait...") + self.model = CTRLLMHeadModel.from_pretrained(model_name_or_path) + self.model.eval() + self.max_predict_len = max_predict_len + self.repetition_penalty = repetition_penalty + print("Model loaded.") + + # prediction function + @paddle.no_grad() + def generate(self, inputs, max_predict_len=None, repetition_penalty=None): + max_predict_len = max_predict_len if max_predict_len is not None else self.max_predict_len + repetition_penalty = repetition_penalty if repetition_penalty is not None else self.repetition_penalty + + ids = self.tokenizer(inputs)["input_ids"] + input_ids = paddle.to_tensor([ids], dtype="int64") + max_length = max(self.max_predict_len - input_ids.shape[1], 20) + outputs = self.model.generate(input_ids, max_length=max_length, repetition_penalty=self.repetition_penalty)[0][ + 0 + ] + decode_outputs = self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(outputs.cpu())) + + print(f"input text: {inputs}") + print(f"output text: {decode_outputs}") + print("=" * 50) + + +if __name__ == "__main__": + demo = Demo(model_name_or_path="ctrl", max_predict_len=128, repetition_penalty=1.2) + input_text_list = [ + "Diet English : I lost 10 kgs! ; German : ", + "Reviews Rating: 5.0", + "Questions Q: What is the capital of India?", + "Books Weary with toil, I haste me to my bed,", + ] + for text in input_text_list: + demo.generate(text) diff --git a/examples/text_generation/ernie-gen b/examples/text_generation/ernie-gen new file mode 100644 index 0000000000000000000000000000000000000000..9fcf590439dda626e45d037afc967fc0f80b98d6 --- /dev/null +++ b/examples/text_generation/ernie-gen @@ -0,0 +1 @@ +../../model_zoo/ernie-gen/ \ No newline at end of file diff --git a/examples/text_generation/opt/README.md b/examples/text_generation/opt/README.md new file mode 100644 index 0000000000000000000000000000000000000000..5066072849447e63d3350057156d6cddf751a151 --- /dev/null +++ b/examples/text_generation/opt/README.md @@ -0,0 +1,62 @@ +# [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/pdf/1909.05858.pdf) + +## 摘要 + +Meta AI 实验室高调宣布,将开放自己的 OPT(Open Pretrained Transformer,预训练变换模型)预训练模型,并贡献出所有代码,此模型对标GPT3,从模型性能、多个下有任务以及小样本中都取得了与GPT-3可比的成绩,PaddleNLP也是及时接入此模型,各位开发者只需要简单的调用即可使用此大模型。 + +## 文本生成测试 +```sh +python demo.py +``` +模型生成使用到的参数释义如下: +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。 +- `max_predict_len` 表示最大生成的句子长度。 + +## 生成结果样例 + +``` +input text: +Question:If x is 2 and y is 5, what is x+y? +Answer: 7 + +Question: if x is 12 and y is 9, what is x+y? +Answer:21 + +Question: if x is 3 and y is 4, what is x+y? + +output text: +Answer:7 + +Question: if x is +================================================== +input text: +a chat between a curious human and Statue of Liberty. +Human: What is your name? +Statue: I am statue of liberty. + +Human: where do you live? +Statue: New york city. + +Human: how long have you lived there? + +output text: +Statue: I have lived here for a long +================================================== +input text: +Chinese: 我想回家。 +English: I want to go home. + +Chinese: 我不知道。 +English: I don't know. + +Chinese: 我饿了。 +English: I am hungry. + +Chinese: 我累了。 + +output text: +English: I am tired. + +Chinese: +================================================== +``` diff --git a/examples/text_generation/opt/demo.py b/examples/text_generation/opt/demo.py new file mode 100644 index 0000000000000000000000000000000000000000..2ab891ed6153b81547444a2a7a2bd937330f2fcb --- /dev/null +++ b/examples/text_generation/opt/demo.py @@ -0,0 +1,67 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle + +from paddlenlp.transformers.gpt.tokenizer import GPTTokenizer +from paddlenlp.transformers.opt.modeling import OPTForCausalLM +from paddlenlp.utils.log import logger + + +class Demo: + def __init__(self, model_name_or_path, max_predict_len=128): + self.tokenizer = GPTTokenizer.from_pretrained(model_name_or_path) + logger.info("Loading the model parameters, please wait...") + self.model = OPTForCausalLM.from_pretrained(model_name_or_path, load_state_as_np=True) + self.model.eval() + self.max_predict_len = max_predict_len + logger.info("Model loaded.") + + @paddle.no_grad() + def generate(self, inputs): + ids = self.tokenizer(inputs)["input_ids"] + input_ids = paddle.to_tensor([ids], dtype="int64") + outputs = self.model.generate(input_ids, max_length=self.max_predict_len)[0][0] + decode_outputs = self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(outputs.cpu())) + + print(f"input text: \n{inputs}") + print(f"output text: \n{decode_outputs}") + print("=" * 50) + + +if __name__ == "__main__": + + demo = Demo(model_name_or_path="facebook/opt-1.3b", max_predict_len=10) + input_text_list = [ + "Question:If x is 2 and y is 5, what is x+y?\n" + "Answer: 7\n\n" + "Question: if x is 12 and y is 9, what is x+y?\n" + "Answer:21\n\n" + "Question: if x is 3 and y is 4, what is x+y?\n", + "a chat between a curious human and Statue of Liberty.\n" + "Human: What is your name?\n" + "Statue: I am statue of liberty.\n\n" + "Human: where do you live?\n" + "Statue: New york city.\n\n" + "Human: how long have you lived there?\n", + "Chinese: 我想回家。\n" + "English: I want to go home.\n\n" + "Chinese: 我不知道。\n" + "English: I don't know.\n\n" + "Chinese: 我饿了。\n" + "English: I am hungry.\n\n" + "Chinese: 我累了。\n", + ] + for text in input_text_list: + demo.generate(text) diff --git a/examples/text_generation/reformer/README.md b/examples/text_generation/reformer/README.md new file mode 100644 index 0000000000000000000000000000000000000000..8b7d8260a450a2cc58edb1ce2ec5f2bebf228823 --- /dev/null +++ b/examples/text_generation/reformer/README.md @@ -0,0 +1,19 @@ +# [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) + +## 摘要 +大型 Transformer 模型通常会在许多任务上获得最先进的结果,但训练这些模型的成本可能高得惊人,尤其是在长序列上。 我们介绍了两种技术来提高 Transformer 的效率。 一方面,我们将点积注意力替换为使用局部敏感哈希的注意力,将其复杂度从 O(L²) 降为 O(LlogL),其中 L 是序列的长度。 此外,我们使用可逆残差层而不是标准残差,这允许在训练过程中仅存储一次激活而不是 N 次,其中 N 是层数。 生成的模型,Reformer,与 Transformer 模型的性能相当,同时在长序列上的内存效率更高,速度更快。 + +## 文本生成测试 +```sh +python demo.py +``` +模型生成使用到的参数释义如下: +- `decode_strategy` 解码策略,可选择`greedy_search`和`sampling`。 +- `max_predict_len` 表示最大生成的句子长度。 +- `repetition_penalty` 表示生成重复token的惩罚参数,详细信息可查看[这篇论文](https://arxiv.org/pdf/1909.05858.pdf)。 + +## 生成结果样例 + +``` +In 1965, Brooks left IBM to found the Department of Defense. The Department was able to convince the Department to resign from the Department's constitutional amendments to the Department of Defense.\n\n +``` diff --git a/examples/text_generation/reformer/demo.py b/examples/text_generation/reformer/demo.py new file mode 100644 index 0000000000000000000000000000000000000000..3af5cad2f50e11b4d2e38a9503b7a207bf2f18a4 --- /dev/null +++ b/examples/text_generation/reformer/demo.py @@ -0,0 +1,56 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +from paddlenlp.transformers import ReformerModelWithLMHead + + +# Encoding +def encode(list_of_strings, pad_token_id=0): + max_length = max([len(string) for string in list_of_strings]) + + # create emtpy tensors + attention_masks = paddle.zeros((len(list_of_strings), max_length), dtype="int64") + input_ids = paddle.full((len(list_of_strings), max_length), pad_token_id, dtype="int64") + + for idx, string in enumerate(list_of_strings): + # make sure string is in byte format + if not isinstance(string, bytes): + string = str.encode(string) + + input_ids[idx, : len(string)] = paddle.to_tensor([x + 2 for x in string], dtype="int64") + attention_masks[idx, : len(string)] = 1 + + return input_ids, attention_masks + + +# Decoding +def decode(outputs_ids): + decoded_outputs = [] + for output_ids in outputs_ids.tolist(): + # transform id back to char IDs < 2 are simply transformed to "" + decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids])) + return decoded_outputs + + +if __name__ == "__main__": + model = ReformerModelWithLMHead.from_pretrained("reformer-enwik8") + model.eval() + encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"]) + output = decode( + model.generate(encoded, decode_strategy="greedy_search", max_length=150, repetition_penalty=1.2)[0] + ) + print(output) + # expected: + # [" Defense. The Department was able to convince the Department to resign from the Department's constitutional amendments to the Department of Defense.\n\n"] diff --git a/examples/text_generation/unimo-text/README.md b/examples/text_generation/unimo-text/README.md new file mode 100644 index 0000000000000000000000000000000000000000..135d9957dcc07a76d10b22333b7f82194606730d --- /dev/null +++ b/examples/text_generation/unimo-text/README.md @@ -0,0 +1,144 @@ +# 千言:面向事实一致性的生成评测比赛基线 + +## 比赛简介 + +自然语言生成旨在让机器能够像人一样使用自然语言进行表达和交互,它是人工智能领域重要的前沿课题,也是全球热点技术AIGC(AI Generated Content,人工智能内容生成)的核心问题之一。 + +随着神经网络生成模型特别是预训练语言模型的迅速发展,机器生成文本的可读性和流畅性不断提升。然而,自动生成的文本中依然经常出现不符合原文或背景的错误事实描述,这种生成的事实一致性问题是自然语言生成进行落地应用的主要障碍之一,并逐渐受到研究学者的关注。鉴于当前国内外关于事实一致性的生成评测比赛十分匮乏,为了促进自然语言生成的技术发展和实际应用,[千言](https://www.luge.ai/#/)组织了面向事实一致性的生成评测比赛。 + +第一届面向事实一致性的生成评测比赛,一共吸引了577名高校、企业的参赛者,其中有57支参赛队提交了有效的正式赛结果,30支参赛队自动评测指标超过基线系统,在排名Top10的队伍中,收到9份参赛系统总结报告。在正式赛的人工评估过程中,我们进一步确认了事实一致性问题的广泛存在性,并且通过与参赛队伍的深入交流,也积累了更多对于事实一致性自动和人工评测的宝贵经验。 + +2023年,千言举办[第二届面向事实一致性的生成评测比赛](https://aistudio.baidu.com/aistudio/competition/detail/726/0/introduction),在数据集、自动评测指标等方面均有升级。在此比赛中,将提供三个对事实一致性有较高要求的生成任务,包括文案生成、摘要生成和对话生成。同时,在系统评价中,将结合文本流畅性和事实一致性两项指标综合评估参赛生成系统的水平,同时进一步提升事实一致性评测指标的先进性和丰富性。通过这样的任务设定和评价方式,此评测将有助于研究者和开发者更多关注自然语言生成的事实一致性难题,并为大家提供学术交流平台,从而进一步提升自然语言生成的研究水平,推动相关技术的应用发展。 + +本比赛得到中国中文信息学会自然语言生成与智能写作专业委员会(筹)支持,将在2023年7月16日第二届中国自然语言生成与智能写作大会(NLGIW 2023)召开评测研讨会,并在大会上对获奖团队颁奖。 + +## 模型简介 +本次比赛提供的基线系统,基于百度提出的ERNIE-UNIMO统一模态预训练框架。在本次比赛的三个文本生成任务中,我们基于本基线使用的模型是UNIMO-text,是基于[ERNIE-UNIMO](https://arxiv.org/pdf/2012.15409.pdf)框架在文本数据上预训练得到模型。 + +## 快速开始 + +本基线基于 **PaddleNLP 2.0.8** 版本,该版本包含了基线使用的最新版UNIMO-text模型以及升级后的生成API。更多详细升级信息请查看[Release Note](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.0.8)。请选手们**升级PaddleNLP后使用**。 + +### 数据准备 + +比赛使用三个任务数据集测试参赛系统的生成能力,包括文案生成(AdvertiseGen)、摘要生成(LCSTS_new)和问题生成(DuReaderQG): + +- 文案生成根据结构化的商品信息生成合适的广告文案; +- 摘要生成是为输入文档生成简洁且包含关键信息的简洁文本; +- 问题生成则是根据给定段落以及答案生成适合的问题。 + +为了方便用户快速使用基线,PaddleNLP Dataset API内置了数据集,一键即可完成数据集加载,示例代码如下: + +```python +from paddlenlp.datasets import load_dataset +train_ds, dev_ds = load_dataset('dureader_qg', splits=('train', 'dev')) +``` + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +. +├── run_gen.py # 模型finetune主程序入口 +├── gen_utils.py # 定义参数及一些工具函数 +├── scripts # 三个任务的基线训练脚本 +└── README.md # 文档说明 +``` + +### 模型训练 + +运行如下命令即可在样例训练集上进行finetune,并在样例验证集上进行验证。也可以使用./scripts目录下面的训练脚本分别启动三个任务的训练。 + +```shell +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" --log_dir ./log run_gen.py \ + --dataset_name=dureader_qg \ + --model_name_or_path=unimo-text-1.0 \ + --save_dir=./unimo/checkpoints \ + --logging_steps=100 \ + --save_steps=100000 \ + --epochs=6 \ + --batch_size=16 \ + --learning_rate=5e-5 \ + --warmup_proportion=0.02 \ + --weight_decay=0.01 \ + --max_seq_len=512 \ + --max_target_len=30 \ + --do_train \ + --do_predict \ + --device=gpu +``` + +关键参数释义如下: +- `gpus` 指示了训练所用的GPU卡号。 +- `dataset_name` 数据集名称,`dureader_qg`、`advertisegen`和`lcsts_new`分别对应问题生成、文案生成和摘要生成三个任务。 +- `train_file` 本地训练数据地址,数据格式必须与`dataset_name`所指数据集格式相同。 +- `predict_file` 本地测试数据地址,数据格式必须与`dataset_name`所指数据集格式相同。 +- `model_name_or_path` 指示了finetune使用的具体预训练模型,可以是PaddleNLP提供的预训练模型,或者是本地的预训练模型。如果使用本地的预训练模型,可以配置本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + + | PaddleNLP提供的预训练模型 | + |---------------------------------| + | unimo-text-1.0 | + | unimo-text-1.0-large | + +- `save_dir` 表示模型的保存路径。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `seed` 表示随机数生成器的种子。 +- `epochs` 表示训练轮数。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。 +- `warmup_proportion` 表示学习率逐渐升高到基础学习率(即上面配置的learning_rate)所需要的迭代数占总步数的比例,最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。 +- `max_seq_len` 模型输入序列的最大长度。 +- `max_target_len` 模型训练时标签的最大长度。 +- `min_dec_len` 模型生成序列的最小长度。 +- `max_dec_len` 模型生成序列的最大长度。 +- `do_train` 是否进行训练。 +- `do_predict` 是否进行预测,在验证集上会自动评估。 +- `device` 表示使用的设备,从gpu和cpu中选择。 + +更多参数详情和参数的默认值请参考`args.py`。 + +程序运行时将会自动进行训练和验证,训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +./checkpoints/ +├── model_8000 +│ ├── model_config.json +│ ├── model_state.pdparams +│ ├── tokenizer_config.json +│ └── vocab.txt +└── ... +``` + +**NOTE:** 如需恢复模型训练,`model_name_or_path`配置本地模型的目录地址即可。 + +### 模型预测 + +运行下方脚本可以使用训练好的模型进行预测。 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python run_gen.py \ + --dataset_name=dureader_qg \ + --model_name_or_path=your_model_path \ + --logging_steps=100 \ + --batch_size=16 \ + --max_seq_len=512 \ + --max_target_len=30 \ + --do_predict \ + --max_dec_len=20 \ + --min_dec_len=3 \ + --device=gpu +``` + +程序运行结束后会将预测结果保存在`output_path`中。将预测结果准备成比赛官网要求的格式,提交评估即可得评估结果。 + +Finetuned baseline的模型在各任务验证集上有如下结果(指标为BLEU-4): + +| model_name | LCSTS_new | DuLeMon | AdvertiseGen | +| :-----------------------------: | :---: | :-----------: | :-------------------: | +| finetuned unimo-text-1.0 | 18.82 | 5.52 | 10.03 | diff --git a/examples/text_generation/unimo-text/gen_utils.py b/examples/text_generation/unimo-text/gen_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..b69f1830754d5d977808bbc2c733bcf59f0619b3 --- /dev/null +++ b/examples/text_generation/unimo-text/gen_utils.py @@ -0,0 +1,187 @@ +import random +from functools import partial + +import numpy as np + +import paddle +import paddle.distributed as dist +from paddle.io import DataLoader, DistributedBatchSampler, BatchSampler +from paddlenlp.data import Pad + + +def print_args(args): + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +def set_seed(seed): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(seed) + np.random.seed(seed) + # Maybe different op seeds(for dropout) for different procs is better. + paddle.seed(seed + dist.get_rank()) + + +def convert_example(example, tokenizer, max_seq_len=512, max_target_len=128, max_title_len=256, mode="train"): + """Convert all examples into necessary features.""" + source = example["source"] + title = None + if "title" in example.keys(): + title = example["title"] + + if mode != "test": + tokenized_example = tokenizer.gen_encode( + source, + title=title, + target=example["target"], + max_seq_len=max_seq_len, + max_target_len=max_target_len, + max_title_len=max_title_len, + return_position_ids=True, + return_length=True, + ) + target_start = tokenized_example["input_ids"].index(tokenizer.cls_token_id, 1) + target_end = tokenized_example["seq_len"] + # Use to gather the logits corresponding to the labels during training + tokenized_example["masked_positions"] = list(range(target_start, target_end - 1)) + tokenized_example["labels"] = tokenized_example["input_ids"][target_start + 1 : target_end] + + return tokenized_example + else: + tokenized_example = tokenizer.gen_encode( + source, + title=title, + max_seq_len=max_seq_len, + max_title_len=max_title_len, + add_start_token_for_decoding=True, + return_position_ids=True, + ) + + if "target" in example and example["target"]: + tokenized_example["target"] = example["target"] + return tokenized_example + + +def batchify_fn(batch_examples, pad_val, mode): + def pad_mask(batch_attention_mask): + batch_size = len(batch_attention_mask) + max_len = max(map(len, batch_attention_mask)) + attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9 + for i, mask_data in enumerate(attention_mask): + seq_len = len(batch_attention_mask[i]) + mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32") + # In order to ensure the correct broadcasting mechanism, expand one + # dimension to the second dimension (n_head of Transformer). + attention_mask = np.expand_dims(attention_mask, axis=1) + return attention_mask + + pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64") + + input_ids = pad_func([example["input_ids"] for example in batch_examples]) + token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples]) + position_ids = pad_func([example["position_ids"] for example in batch_examples]) + + attention_mask = pad_mask([example["attention_mask"] for example in batch_examples]) + + if mode != "test": + max_len = max([example["seq_len"] for example in batch_examples]) + masked_positions = np.concatenate( + [ + np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len + for i, example in enumerate(batch_examples) + ] + ) + labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples]) + return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels + else: + return input_ids, token_type_ids, position_ids, attention_mask + + +def create_data_loader(dataset, tokenizer, args, mode): + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_len=args.max_seq_len, + max_target_len=args.max_target_len, + max_title_len=args.max_title_len, + mode=mode, + ) + dataset = dataset.map(trans_func, lazy=True) + if mode == "train": + batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) + else: + batch_sampler = BatchSampler(dataset, batch_size=args.batch_size // 2, shuffle=False) + collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode) + data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True) + return dataset, data_loader + + +def post_process_sum(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first <eos>.""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.mask_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + special_tokens = ["[UNK]"] + tokens = [token for token in tokens if token not in special_tokens] + return token_ids, tokens + + +def select_sum(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1): + results = [] + group = [] + tmp = [] + if scores is not None: + ids = ids.numpy() + scores = scores.numpy() + + if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0: + raise ValueError( + "the length of `ids` is {}, but the `num_return_sequences` is {}".format( + len(ids), num_return_sequences + ) + ) + + for pred, score in zip(ids, scores): + pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer) + num_token = len(pred_token_ids) + + target = "".join(pred_tokens) + + # not ending + if max_dec_len is not None and num_token >= max_dec_len: + score -= 1e3 + + tmp.append([target, score]) + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + preds = sorted(preds, key=lambda x: -x[1]) + results.append(preds[0][0]) + else: + ids = ids.numpy() + + for pred in ids: + pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer) + num_token = len(pred_token_ids) + response = "".join(pred_tokens) + + # TODO: Support return scores in FT. + tmp.append([response]) + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + results.append(preds[0][0]) + + return results diff --git a/examples/text_generation/unimo-text/run_gen.py b/examples/text_generation/unimo-text/run_gen.py new file mode 100644 index 0000000000000000000000000000000000000000..ad172403ce5990be3ccc41790f86755fa4ff393c --- /dev/null +++ b/examples/text_generation/unimo-text/run_gen.py @@ -0,0 +1,242 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time + +import paddle +import paddle.distributed as dist +import paddle.nn.functional as F +from gen_utils import create_data_loader, print_args, select_sum, set_seed +from paddle.optimizer import AdamW + +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import BLEU +from paddlenlp.transformers import ( + BasicTokenizer, + LinearDecayWithWarmup, + UNIMOLMHeadModel, + UNIMOTokenizer, +) + + +# yapf: disable +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument('--dataset_name', type=str, default='dureader_qg', help='The name of the dataset to load.') + parser.add_argument('--model_name_or_path', type=str, default='unimo-text-1.0', help='The path or shortcut name of the pre-trained model.') + parser.add_argument("--train_file", type=str, required=False, default=None, help="Train data path.") + parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.") + parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.') + parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.') + parser.add_argument('--save_steps', type=int, default=1000, help='Save checkpoint every X updates steps.') + parser.add_argument('--seed', type=int, default=1, help='Random seed for initialization.') + parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.') + parser.add_argument('--learning_rate', type=float, default=5e-5, help='The initial learning rate.') + parser.add_argument('--weight_decay', type=float, default=0.01, help='The weight decay for optimizer.') + parser.add_argument('--epochs', type=int, default=3, help='Total number of training epochs to perform.') + parser.add_argument('--warmup_proportion', type=float, default=0.02, help='The number of warmup steps.') + parser.add_argument('--max_grad_norm', type=float, default=1.0, help='The max value of grad norm.') + parser.add_argument('--beta1', type=float, default=0.9, help='beta1') + parser.add_argument('--beta2', type=float, default=0.98, help='beta2') + parser.add_argument('--epsilon', type=float, default=1e-6, help='epsilon') + parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.') + parser.add_argument('--max_dec_len', type=int, default=20, help='The maximum sequence length of decoding.') + parser.add_argument('--min_dec_len', type=int, default=3, help='The minimal sequence length of decoding.') + parser.add_argument('--max_target_len', type=int, default=30, help='The maximum target sequence length of training.') + parser.add_argument('--max_title_len', type=int, default=30, help='The maximum title sequence length of training.') + parser.add_argument('--num_return_sequences', type=int, default=1, help='The numbers of returned sequences for one input in generation.') + parser.add_argument('--decode_strategy', type=str, default='beam_search', help='The decode strategy in generation.') + parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.') + parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.') + parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.') + parser.add_argument('--num_beams', type=int, default=6, help='The number of beams for beam search.') + parser.add_argument('--length_penalty', type=float, default=1.2, help='The exponential penalty to the sequence length for beam search.') + parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.') + parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.') + parser.add_argument("--do_train", action='store_true', help="Whether to train the model.") + parser.add_argument("--do_predict", action='store_true', help="Whether to eval and predict.") + + args = parser.parse_args() + return args +# yapf: enable + + +def calc_bleu(preds, targets): + assert len(preds) == len(targets), ( + "The length of pred_responses should be equal to the length of " + "target_responses. But received {} and {}.".format(len(preds), len(targets)) + ) + bleu4 = BLEU(n_size=4) + tokenizer = BasicTokenizer() + + for pred, target in zip(preds, targets): + pred_tokens = tokenizer.tokenize(pred) + target_token = tokenizer.tokenize(target) + + bleu4.add_inst(pred_tokens, [target_token]) + + print("\n" + "*" * 15) + print("The auto evaluation result is:") + print("BLEU-4:", bleu4.score()) + + +def save_ckpt(model, tokenizer, save_dir, name): + output_dir = os.path.join(save_dir, "model_{}".format(name)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + +def run(args): + paddle.set_device(args.device) + world_size = dist.get_world_size() + + if world_size > 1: + dist.init_parallel_env() + set_seed(args.seed) + + model = UNIMOLMHeadModel.from_pretrained(args.model_name_or_path) + tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path) + + if world_size > 1: + model = paddle.DataParallel(model) + + train_ds = load_dataset(args.dataset_name, splits="train", data_files=args.train_file) + dev_ds = load_dataset(args.dataset_name, splits="dev", data_files=args.predict_file) + + train_ds, train_data_loader = create_data_loader(train_ds, tokenizer, args, "train") + dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test") + + if args.do_train: + num_training_steps = args.epochs * len(train_data_loader) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + beta1=args.beta1, + beta2=args.beta2, + epsilon=args.epsilon, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm), + ) + + step = 0 + total_time = 0.0 + for epoch in range(args.epochs): + print("\nEpoch %d/%d" % (epoch + 1, args.epochs)) + batch_start_time = time.time() + for inputs in train_data_loader: + step += 1 + labels = inputs[-1] + logits = model(*inputs[:-1]) + labels = paddle.nn.functional.one_hot(labels, num_classes=logits.shape[-1]) + labels = paddle.nn.functional.label_smooth(labels) + loss = F.cross_entropy(logits, labels, soft_label=True) + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + total_time += time.time() - batch_start_time + if step % args.logging_steps == 0: + ppl = paddle.exp(loss) + print( + "step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step" + % (step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps) + ) + total_time = 0.0 + + if step % args.save_steps == 0 or step >= num_training_steps: + if dist.get_rank() == 0: + save_ckpt(model, tokenizer, args.save_dir, step) + print("Saved step {} model.\n".format(step)) + if args.do_predict: + model_eval = model._layers if isinstance(model, paddle.DataParallel) else model + evaluation(model_eval, dev_data_loader, args, tokenizer) + + batch_start_time = time.time() + + print("\nTraining completed.") + elif args.do_predict: + model_eval = model._layers if isinstance(model, paddle.DataParallel) else model + evaluation(model_eval, dev_data_loader, args, tokenizer) + + +@paddle.no_grad() +def evaluation(model, data_loader, args, tokenizer): + print("\nEval begin...") + model.eval() + pred_ref = [] + total_time = 0.0 + start_time = time.time() + for step, inputs in enumerate(data_loader, 1): + input_ids, token_type_ids, position_ids, attention_mask = inputs + ids, scores = model.generate( + input_ids=input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask, + max_length=args.max_dec_len, + min_length=args.min_dec_len, + decode_strategy=args.decode_strategy, + temperature=args.temperature, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + num_return_sequences=args.num_return_sequences, + bos_token_id=tokenizer.cls_token_id, + eos_token_id=tokenizer.mask_token_id, + ) + + total_time += time.time() - start_time + if step % args.logging_steps == 0: + print("step %d - %.3fs/step" % (step, total_time / args.logging_steps)) + total_time = 0.0 + + results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences) + pred_ref.extend(results) + start_time = time.time() + + with open(args.output_path, "w", encoding="utf-8") as fout: + for ref in pred_ref: + fout.write(ref + "\n") + + print("\nSave inference result into: %s" % args.output_path) + + if "target" in data_loader.dataset[0].keys(): + targets = [example["target"] for example in data_loader.dataset] + calc_bleu(pred_ref, targets) + + model.train() + return + + +if __name__ == "__main__": + args = parse_args() + print_args(args) + run(args) diff --git a/examples/text_generation/unimo-text/scripts/lcsts_train.sh b/examples/text_generation/unimo-text/scripts/lcsts_train.sh new file mode 100644 index 0000000000000000000000000000000000000000..16472c28edec66d8ac3ba515822e99f4e15ffb7f --- /dev/null +++ b/examples/text_generation/unimo-text/scripts/lcsts_train.sh @@ -0,0 +1,39 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +unset CUDA_VISIBLE_DEVICES + +log_dir=./lcsts-log +rm -rf ${log_dir} +mkdir -p ${log_dir} + +python -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir ${log_dir} run_gen.py \ + --dataset_name=lcsts_new \ + --model_name_or_path=unimo-text-1.0 \ + --save_dir=${log_dir}/checkpoints \ + --logging_steps=100 \ + --save_steps=10000 \ + --epochs=6 \ + --batch_size=64 \ + --learning_rate=5e-5 \ + --warmup_proportion=0.02 \ + --weight_decay=0.01 \ + --max_seq_len=360 \ + --max_target_len=30 \ + --max_dec_len=20 \ + --min_dec_len=3 \ + --do_train \ + --do_predict \ + --device=gpu >> ${log_dir}/lanch.log 2>&1 diff --git a/examples/text_generation/unimo-text/scripts/qg_train.sh b/examples/text_generation/unimo-text/scripts/qg_train.sh new file mode 100644 index 0000000000000000000000000000000000000000..88e8c4ec8725599ce211f380843e86f36d4de427 --- /dev/null +++ b/examples/text_generation/unimo-text/scripts/qg_train.sh @@ -0,0 +1,39 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +unset CUDA_VISIBLE_DEVICES + +log_dir=./qg-log +rm -rf ${log_dir} +mkdir -p ${log_dir} + +python -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir ${log_dir} run_gen.py \ + --dataset_name=dureader_qg \ + --model_name_or_path=unimo-text-1.0 \ + --save_dir=${log_dir}/checkpoints \ + --logging_steps=10 \ + --save_steps=1000 \ + --epochs=6 \ + --batch_size=8 \ + --learning_rate=5e-5 \ + --warmup_proportion=0.02 \ + --weight_decay=0.01 \ + --max_seq_len=360 \ + --max_target_len=30 \ + --max_dec_len=20 \ + --min_dec_len=3 \ + --do_train \ + --do_predict \ + --device=gpu >> ${log_dir}/lanch.log 2>&1 diff --git a/examples/text_generation/unimo-text/scripts/table_train.sh b/examples/text_generation/unimo-text/scripts/table_train.sh new file mode 100644 index 0000000000000000000000000000000000000000..4865358949d550a273be7410f794f641397c7269 --- /dev/null +++ b/examples/text_generation/unimo-text/scripts/table_train.sh @@ -0,0 +1,39 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +unset CUDA_VISIBLE_DEVICES + +log_dir=./table-log +rm -rf ${log_dir} +mkdir -p ${log_dir} + +python -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir ${log_dir} run_gen.py \ + --dataset_name=advertisegen \ + --model_name_or_path=unimo-text-1.0 \ + --save_dir=${log_dir}/checkpoints \ + --logging_steps=100 \ + --save_steps=1000 \ + --epochs=6 \ + --batch_size=8 \ + --learning_rate=5e-5 \ + --warmup_proportion=0.02 \ + --weight_decay=0.01 \ + --max_seq_len=512 \ + --max_target_len=200 \ + --max_dec_len=200 \ + --min_dec_len=10 \ + --do_train \ + --do_predict \ + --device=gpu >> ${log_dir}/lanch.log 2>&1 diff --git a/examples/text_generation/vae-seq2seq/README.md b/examples/text_generation/vae-seq2seq/README.md new file mode 100644 index 0000000000000000000000000000000000000000..11e26d67ace3adb0c24fe86e6c9fe8555cc88f63 --- /dev/null +++ b/examples/text_generation/vae-seq2seq/README.md @@ -0,0 +1,132 @@ +# Variational Autoencoder (VAE) for Text Generation +以下是本范例模型的简要目录结构及说明: + +```text +. +├── README.md # 文档 +├── args.py # 训练、预测以及模型参数配置程序 +├── data.py # 数据读入程序 +├── train.py # 训练主程序 +├── predict.py # 预测主程序 +└── model.py # VAE模型组网部分,以及Metric等 +``` + +## 简介 + +本目录下此范例模型的实现,旨在展示如何用Paddle构建用于文本生成的VAE示例,其中LSTM作为编码器和解码器。分别对PTB数据集和Yahoo Answer(采样100k)数据集进行训练。 + +关于VAE的详细介绍参照: [(Bowman et al., 2015) Generating Sentences from a Continuous Space](https://arxiv.org/pdf/1511.06349.pdf) + +## 数据介绍 + +本教程使用了两个文本数据集: + +PTB数据集由华尔街日报的文章组成,包含929k个训练tokens,词汇量为10k。下载地址为: [PTB](https://dataset.bj.bcebos.com/imikolov%2Fsimple-examples.tgz)。 + +Yahoo数据集来自[(Yang et al., 2017) Improved Variational Autoencoders for Text Modeling using Dilated Convolutions](https://arxiv.org/pdf/1702.08139.pdf),该数据集从原始Yahoo Answer数据中采样100k个文档,数据集的平均文档长度为78,词汇量为200k。下载地址为:[YahooAnswer100k](https://bj.bcebos.com/paddlenlp/datasets/yahoo-answer-100k.tar.gz),运行本例程序后,数据集会自动下载到`~/.paddlenlp/datasets/YahooAnswer100k`目录下。 + + +## 模型训练 + +如果使用ptb数据集训练,可以通过下面命令配置: + +``` +export CUDA_VISIBLE_DEVICES=0 +python train.py \ + --batch_size 32 \ + --init_scale 0.1 \ + --max_grad_norm 5.0 \ + --dataset ptb \ + --model_path ptb_model\ + --device gpu \ + --max_epoch 50 \ + +``` + +如果需要多卡运行,可以运行如下命令: + +``` +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0,1,2,3" train.py \ + --batch_size 32 \ + --init_scale 0.1 \ + --max_grad_norm 5.0 \ + --dataset ptb \ + --model_path ptb_model \ + --device gpu \ + --max_epoch 50 \ + +``` + +如果需要使用yahoo数据集进行多卡运行,可以将参数配置如下: + +``` +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0,1,2,3" train.py \ + --batch_size 32 \ + --embed_dim 512 \ + --hidden_size 550 \ + --init_scale 0.1 \ + --max_grad_norm 5.0 \ + --dataset yahoo \ + --model_path yahoo_model \ + --device gpu \ + --max_epoch 50 \ + +``` + +**NOTE:** 如需恢复模型训练,则`init_from_ckpt`只需指定到文件名即可,不需要添加文件尾缀。如`--init_from_ckpt ptb_model/49`即可,程序会自动加载模型参数`ptb_model/49.pdparams`,也会自动加载优化器状态`ptb_model/49.pdopt`。 + + +## 模型预测 + +当模型训练完成之后,可以选择加载模型保存目录下的第 50 个epoch的模型进行预测,生成batch_size条短文本。生成的文本位于参数`infer_output_file`指定的路径下。如果使用ptb数据集,可以通过下面命令配置: + +``` +export CUDA_VISIBLE_DEVICES=0 +python predict.py \ + --batch_size 32 \ + --init_scale 0.1 \ + --max_grad_norm 5.0 \ + --dataset ptb \ + --device gpu \ + --infer_output_file infer_output.txt \ + --init_from_ckpt ptb_model/49 \ + +``` + +使用yahoo数据集,需要配置embed_dim和hidden_size: + +``` +python predict.py \ + --batch_size 32 \ + --init_scale 0.1 \ + --embed_dim 512 \ + --hidden_size 550 \ + --max_grad_norm 5.0 \ + --dataset yahoo \ + --device gpu \ + --infer_output_file infer_output.txt \ + --init_from_ckpt yahoo_model/49 \ + +``` + +## 效果评价 + +||Test PPL|Test NLL| +|:-|:-:|:-:| +|ptb dataset|108.71|102.76| +|yahoo dataset|78.38|349.48| + + +## 生成样例 + +shareholders were spent about N shares to spend $ N million to ual sell this trust stock last week + +new york stock exchange composite trading trading outnumbered closed at $ N a share down N cents + +the company cited pressure to pursue up existing facilities in the third quarter was for <unk> and four N million briefly stocks for so-called unusual liability + +people had <unk> down out the kind of and much why your relationship are anyway + +there are a historic investment giant chips which ran the <unk> benefit the attempting to original maker diff --git a/examples/text_generation/vae-seq2seq/args.py b/examples/text_generation/vae-seq2seq/args.py new file mode 100644 index 0000000000000000000000000000000000000000..aa26a23dad7137dece13ff12af4c85a13d34ff7c --- /dev/null +++ b/examples/text_generation/vae-seq2seq/args.py @@ -0,0 +1,68 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + + +def parse_args(): + parser = argparse.ArgumentParser(description=__doc__) + + parser.add_argument("--dataset", type=str, help="Dataset name. Now ptb|yahoo is supported.") + + parser.add_argument("--learning_rate", type=float, default=0.001, help="Learning rate of optimizer.") + + parser.add_argument("--num_layers", type=int, default=1, help="The number of layers of encoder and decoder.") + + parser.add_argument("--embed_dim", type=int, default=256, help="Embedding dim of encoder and decoder.") + + parser.add_argument("--hidden_size", type=int, default=256, help="Hidden size of encoder and decoder.") + + parser.add_argument("--latent_size", type=int, default=32, help="Latent size of Variational Auto Encoder.") + + parser.add_argument("--batch_size", type=int, help="Batch size.") + + parser.add_argument("--max_epoch", type=int, default=20, help="Max epoch of training.") + + parser.add_argument("--max_len", type=int, default=1280, help="Max length of source and target sentence.") + + parser.add_argument("--log_freq", type=int, default=200, help="Log frequency") + + parser.add_argument("--dec_dropout", type=float, default=0.5, help="Drop probability of decoder") + + parser.add_argument("--enc_dropout", type=float, default=0.0, help="Drop probability of encoder.") + + parser.add_argument("--init_scale", type=float, default=0.0, help="Init scale for parameter.") + + parser.add_argument("--max_grad_norm", type=float, default=5.0, help="Max grad norm of global norm clip.") + + parser.add_argument("--model_path", type=str, default="model", help="Model path for model to save.") + + parser.add_argument( + "--infer_output_file", type=str, default="infer_output.txt", help="File name to save inference output." + ) + + parser.add_argument("--beam_size", type=int, default=1, help="Beam size for Beam search.") + + parser.add_argument( + "--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference." + ) + + parser.add_argument("--warm_up", type=int, default=10, help="The number of warm up epoch for KL.") + + parser.add_argument("--kl_start", type=float, default=0.1, help="KL start value, up to 1.0.") + + parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") + + args = parser.parse_args() + return args diff --git a/examples/text_generation/vae-seq2seq/data.py b/examples/text_generation/vae-seq2seq/data.py new file mode 100644 index 0000000000000000000000000000000000000000..8fd17ca1c211cbb0eba16d9dec9978e151f3a306 --- /dev/null +++ b/examples/text_generation/vae-seq2seq/data.py @@ -0,0 +1,92 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial + +import numpy as np +import paddle + +from paddlenlp.data import Pad, SamplerHelper, Vocab +from paddlenlp.datasets import load_dataset + + +def create_data_loader(args): + batch_size = args.batch_size + max_len = args.max_len + if args.dataset == "yahoo": + train_ds, dev_ds, test_ds = load_dataset("yahoo_answer_100k", splits=("train", "valid", "test")) + vocab = Vocab.load_vocabulary(**train_ds.vocab_info) + else: + train_ds, dev_ds, test_ds = load_dataset("ptb", splits=("train", "valid", "test")) + examples = [train_ds[i]["sentence"].split() for i in range(len(train_ds))] + vocab = Vocab.build_vocab(examples) + + vocab_size = len(vocab) + bos_id = vocab_size + eos_id = vocab_size + 1 + pad_id = vocab_size + 1 + + def convert_example(example): + features = vocab.to_indices(example["sentence"].split()[:max_len]) + return features + + key = lambda x, data_source: len(data_source[x]) + # Truncate and convert example to ids + train_ds = train_ds.map(convert_example, lazy=False) + dev_ds = dev_ds.map(convert_example, lazy=False) + test_ds = test_ds.map(convert_example, lazy=False) + + train_batch_sampler = ( + SamplerHelper(train_ds).shuffle().sort(key=key, buffer_size=batch_size * 20).batch(batch_size=batch_size) + ) + + dev_batch_sampler = SamplerHelper(dev_ds).sort(key=key, buffer_size=batch_size * 20).batch(batch_size=batch_size) + + # test_batch_sampler = SamplerHelper(dev_ds).sort(key=key, buffer_size=batch_size * 20).batch(batch_size=batch_size) + + train_loader = paddle.io.DataLoader( + train_ds, + batch_sampler=train_batch_sampler, + collate_fn=partial(prepare_train_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id), + ) + + dev_loader = paddle.io.DataLoader( + dev_ds, + batch_sampler=dev_batch_sampler, + collate_fn=partial(prepare_train_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id), + ) + + test_loader = paddle.io.DataLoader( + test_ds, + batch_sampler=dev_batch_sampler, + collate_fn=partial(prepare_train_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id), + ) + + return train_loader, dev_loader, test_loader, vocab, bos_id, pad_id, len(train_ds) + + +def prepare_train_input(insts, bos_id, eos_id, pad_id): + # Add eos token id and bos token id. + src = [[bos_id] + inst + [eos_id] for inst in insts] + trg = [inst[:-1] for inst in insts] + label = [inst[1:] for inst in insts] + + # Pad sequence using eos id. + src, src_length = Pad(pad_val=pad_id, ret_length=True, dtype="int64")([ids for ids in src]) + trg, trg_length = Pad(pad_val=pad_id, ret_length=True, dtype="int64")([ids for ids in trg]) + label, _ = Pad(pad_val=pad_id, ret_length=True, dtype="int64")([ids for ids in label]) + + label = np.array(label) + label = label.reshape((label.shape[0], label.shape[1], 1)) + return src, src_length, trg, trg_length, label diff --git a/examples/text_generation/vae-seq2seq/model.py b/examples/text_generation/vae-seq2seq/model.py new file mode 100644 index 0000000000000000000000000000000000000000..fabed6164a9e67cc4c8249dc6da39bcf34d22926 --- /dev/null +++ b/examples/text_generation/vae-seq2seq/model.py @@ -0,0 +1,356 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the 'License'); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an 'AS IS' BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +import paddle.nn.initializer as I + + +class CrossEntropyWithKL(nn.Layer): + """ + backward_loss = kl_loss * kl_weight + cross_entropy_loss + """ + + def __init__(self, base_kl_weight, anneal_r): + super(CrossEntropyWithKL, self).__init__() + self.kl_weight = base_kl_weight + self.anneal_r = anneal_r + self.loss = 0.0 + self.kl_loss = 0.0 + self.rec_loss = 0.0 + + def update_kl_weight(self): + self.kl_weight = min(1.0, self.kl_weight + self.anneal_r) + + def forward(self, kl_loss, dec_output, trg_mask, label): + self.update_kl_weight() + self.kl_loss = kl_loss + + rec_loss = F.cross_entropy(input=dec_output, label=label, reduction="none", soft_label=False) + + rec_loss = paddle.squeeze(rec_loss, axis=[2]) + rec_loss = rec_loss * trg_mask + rec_loss = paddle.mean(rec_loss, axis=[0]) + rec_loss = paddle.sum(rec_loss) + self.rec_loss = rec_loss + + self.loss = self.kl_loss * self.kl_weight + self.rec_loss + return self.loss + + +class Perplexity(paddle.metric.Metric): + def __init__(self, name="ppl", reset_freq=100, *args, **kwargs): + self.cross_entropy = kwargs.pop("loss") + super(Perplexity, self).__init__(*args, **kwargs) + self._name = name + self.total_ce = 0 + self.word_count = 0 + self.reset_freq = reset_freq + self.batch_size = 0 + + def update(self, kl_loss, dec_output, trg_mask, label, *args): + # Perplexity is calculated using cross entropy + self.batch_size = dec_output.shape[0] + loss = self.cross_entropy.loss.numpy() + self.total_ce += loss[0] * self.batch_size + self.word_count += np.sum(trg_mask) + + def reset(self): + self.total_ce = 0 + self.word_count = 0 + + def accumulate(self): + return np.exp(self.total_ce / self.word_count) + + def name(self): + return self._name + + +class NegativeLogLoss(paddle.metric.Metric): + def __init__(self, name="nll", reset_freq=100, *args, **kwargs): + self.cross_entropy = kwargs.pop("loss") + super(NegativeLogLoss, self).__init__(*args, **kwargs) + self._name = name + self.total_ce = 0 + self.batch_count = 0 + self.reset_freq = reset_freq + self.batch_size = 0 + self.sample_count = 0 + + def update(self, kl_loss, dec_output, trg_mask, label, *args): + self.batch_size = dec_output.shape[0] + loss = self.cross_entropy.loss.numpy() + self.total_ce += loss[0] * self.batch_size + self.sample_count += self.batch_size + + def reset(self): + self.total_ce = 0 + self.sample_count = 0 + + def accumulate(self): + return self.total_ce / self.sample_count + + def name(self): + return self._name + + +class TrainCallback(paddle.callbacks.ProgBarLogger): + def __init__(self, ppl, nll, log_freq=200, verbose=2): + super(TrainCallback, self).__init__(log_freq, verbose) + self.ppl = ppl + self.nll = nll + + def on_train_begin(self, logs=None): + super(TrainCallback, self).on_train_begin(logs) + self.train_metrics = ["loss", "ppl", "nll", "kl weight", "kl loss", "rec loss"] + + def on_epoch_begin(self, epoch=None, logs=None): + super(TrainCallback, self).on_epoch_begin(epoch, logs) + self.ppl.reset() + self.nll.reset() + + def on_train_batch_end(self, step, logs=None): + # loss and kl weight are not accumulated + logs["kl weight"] = self.ppl.cross_entropy.kl_weight + logs["kl loss"] = float(self.ppl.cross_entropy.kl_loss) + logs["rec loss"] = float(self.ppl.cross_entropy.rec_loss) + super(TrainCallback, self).on_train_batch_end(step, logs) + + def on_eval_begin(self, logs=None): + super(TrainCallback, self).on_eval_begin(logs) + self.eval_metrics = ["loss", "ppl", "nll"] + + def on_eval_batch_end(self, step, logs=None): + super(TrainCallback, self).on_eval_batch_end(step, logs) + + +class LSTMEncoder(nn.Layer): + def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, init_scale=0.1, enc_dropout=0.0): + super(LSTMEncoder, self).__init__() + self.src_embedder = nn.Embedding( + vocab_size, + embed_dim, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + ) + self.lstm = nn.LSTM(input_size=embed_dim, hidden_size=hidden_size, num_layers=num_layers, dropout=enc_dropout) + if enc_dropout > 0.0: + self.dropout = nn.Dropout(enc_dropout) + else: + self.dropout = None + + def forward(self, src, src_length): + src_emb = self.src_embedder(src) + + if self.dropout: + src_emb = self.dropout(src_emb) + enc_output, enc_final_state = self.lstm(src_emb, sequence_length=src_length) + if self.dropout: + enc_output = self.dropout(enc_output) + + enc_final_state = [[h, c] for h, c in zip(enc_final_state[0], enc_final_state[1])] + return enc_output, enc_final_state + + +class LSTMDecoderCell(nn.Layer): + def __init__(self, num_layers, embed_dim, hidden_size, latent_size, dropout=None): + super(LSTMDecoderCell, self).__init__() + self.dropout = dropout + self.lstm_cells = nn.LayerList( + [nn.LSTMCell(input_size=embed_dim + latent_size, hidden_size=hidden_size) for i in range(num_layers)] + ) + + def forward(self, step_input, lstm_states, latent_z): + new_lstm_states = [] + step_input = paddle.concat([step_input, latent_z], 1) + for i, lstm_cell in enumerate(self.lstm_cells): + out, new_lstm_state = lstm_cell(step_input, lstm_states[i]) + if self.dropout: + step_input = self.dropout(out) + else: + step_input = out + new_lstm_states.append(new_lstm_state) + if self.dropout: + step_input = self.dropout(step_input) + out = step_input + return out, new_lstm_states + + +class LSTMDecoder(nn.Layer): + def __init__(self, vocab_size, embed_dim, hidden_size, latent_size, num_layers, init_scale=0.1, dec_dropout=0.0): + super(LSTMDecoder, self).__init__() + self.num_layers = num_layers + self.embed_dim = embed_dim + self.hidden_size = hidden_size + self.latent_size = latent_size + self.trg_embedder = nn.Embedding( + vocab_size, + embed_dim, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + ) + + self.output_fc = nn.Linear( + hidden_size, + vocab_size, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + ) + + if dec_dropout > 0.0: + self.dropout = nn.Dropout(dec_dropout) + else: + self.dropout = None + + self.lstm = nn.RNN( + LSTMDecoderCell(self.num_layers, self.embed_dim, self.hidden_size, self.latent_size, self.dropout) + ) + + def forward(self, trg, dec_initial_states, latent_z): + trg_emb = self.trg_embedder(trg) + if self.dropout: + trg_emb = self.dropout(trg_emb) + lstm_output, _ = self.lstm(inputs=trg_emb, initial_states=dec_initial_states, latent_z=latent_z) + dec_output = self.output_fc(lstm_output) + return dec_output + + +class VAESeq2SeqModel(nn.Layer): + def __init__( + self, + embed_dim, + hidden_size, + latent_size, + vocab_size, + num_layers=1, + init_scale=0.1, + PAD_ID=0, + enc_dropout=0.0, + dec_dropout=0.0, + ): + super(VAESeq2SeqModel, self).__init__() + self.PAD_ID = PAD_ID + self.latent_size = latent_size + self.vocab_size = vocab_size + self.num_layers = num_layers + self.hidden_size = hidden_size + self.encoder = LSTMEncoder(vocab_size, embed_dim, hidden_size, num_layers, init_scale, enc_dropout) + self.decoder = LSTMDecoder( + vocab_size, embed_dim, hidden_size, latent_size, num_layers, init_scale, dec_dropout + ) + self.distributed_fc = nn.Linear( + hidden_size * 2, + latent_size * 2, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + ) + self.fc = nn.Linear( + latent_size, + 2 * hidden_size * num_layers, + weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), + ) + + def sampling(self, z_mean, z_log_var): + """ + Reparameterization trick + """ + # By default, random_normal has mean=0 and std=1.0 + epsilon = paddle.normal(shape=(z_mean.shape[0], self.latent_size)) + epsilon.stop_gradient = True + return z_mean + paddle.exp(0.5 * z_log_var) * epsilon + + def build_distribution(self, enc_final_state=None): + enc_hidden = [paddle.concat(state, axis=-1) for state in enc_final_state] + + enc_hidden = paddle.concat(enc_hidden, axis=-1) + z_mean_log_var = self.distributed_fc(enc_hidden) + z_mean, z_log_var = paddle.split(z_mean_log_var, 2, -1) + return z_mean, z_log_var + + def calc_kl_dvg(self, means, logvars): + """ + Compute the KL divergence between Gaussian distribution + """ + kl_cost = -0.5 * (logvars - paddle.square(means) - paddle.exp(logvars) + 1.0) + kl_cost = paddle.mean(kl_cost, 0) + + return paddle.sum(kl_cost) + + def forward(self, src, src_length, trg, trg_length): + # Encoder + _, enc_final_state = self.encoder(src, src_length) + + # Build distribution + z_mean, z_log_var = self.build_distribution(enc_final_state) + + # Decoder + latent_z = self.sampling(z_mean, z_log_var) + + dec_first_hidden_cell = self.fc(latent_z) + dec_first_hidden, dec_first_cell = paddle.split(dec_first_hidden_cell, 2, axis=-1) + if self.num_layers > 1: + dec_first_hidden = paddle.split(dec_first_hidden, self.num_layers) + dec_first_cell = paddle.split(dec_first_cell, self.num_layers) + else: + dec_first_hidden = [dec_first_hidden] + dec_first_cell = [dec_first_cell] + dec_initial_states = [[h, c] for h, c in zip(dec_first_hidden, dec_first_cell)] + + dec_output = self.decoder(trg, dec_initial_states, latent_z) + + kl_loss = self.calc_kl_dvg(z_mean, z_log_var) + trg_mask = (self.PAD_ID != trg).astype(paddle.get_default_dtype()) + return kl_loss, dec_output, trg_mask + + +class VAESeq2SeqInferModel(VAESeq2SeqModel): + def __init__( + self, embed_dim, hidden_size, latent_size, vocab_size, start_token=1, end_token=2, beam_size=1, max_out_len=100 + ): + self.start_token = start_token + self.end_token = end_token + self.beam_size = beam_size + self.max_out_len = max_out_len + super(VAESeq2SeqInferModel, self).__init__(embed_dim, hidden_size, latent_size, vocab_size) + + def forward(self, trg): + # Encoder + latent_z = paddle.normal(shape=(trg.shape[0], self.latent_size)) + dec_first_hidden_cell = self.fc(latent_z) + dec_first_hidden, dec_first_cell = paddle.split(dec_first_hidden_cell, 2, axis=-1) + if self.num_layers > 1: + dec_first_hidden = paddle.split(dec_first_hidden, self.num_layers) + dec_first_cell = paddle.split(dec_first_cell, self.num_layers) + else: + dec_first_hidden = [dec_first_hidden] + dec_first_cell = [dec_first_cell] + dec_initial_states = [[h, c] for h, c in zip(dec_first_hidden, dec_first_cell)] + + output_fc = lambda x: F.one_hot( # noqa: E731 + paddle.multinomial(F.softmax(paddle.squeeze(self.decoder.output_fc(x), [1]))), num_classes=self.vocab_size + ) + + latent_z = nn.BeamSearchDecoder.tile_beam_merge_with_batch(latent_z, self.beam_size) + + decoder = nn.BeamSearchDecoder( + cell=self.decoder.lstm.cell, + start_token=self.start_token, + end_token=self.end_token, + beam_size=self.beam_size, + embedding_fn=self.decoder.trg_embedder, + output_fn=output_fc, + ) + + outputs, _ = nn.dynamic_decode( + decoder, inits=dec_initial_states, max_step_num=self.max_out_len, latent_z=latent_z + ) + return outputs diff --git a/examples/text_generation/vae-seq2seq/predict.py b/examples/text_generation/vae-seq2seq/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..0c22b8abed7b1f603ce49b5fc6a40edafff98e47 --- /dev/null +++ b/examples/text_generation/vae-seq2seq/predict.py @@ -0,0 +1,52 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the 'License'); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an 'AS IS' BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import io + +import numpy as np +import paddle +from args import parse_args +from data import create_data_loader +from model import VAESeq2SeqInferModel + + +def infer(args): + print(args) + paddle.set_device(args.device) + _, _, _, vocab, bos_id, eos_id, _ = create_data_loader(args) + + net = VAESeq2SeqInferModel(args.embed_dim, args.hidden_size, args.latent_size, len(vocab) + 2) + + model = paddle.Model(net) + model.prepare() + model.load(args.init_from_ckpt) + + infer_output = paddle.ones((args.batch_size, 1), dtype="int64") * bos_id + + space_token = " " + line_token = "\n" + with io.open(args.infer_output_file, "w", encoding="utf-8") as out_file: + predict_lines = model.predict_batch(infer_output)[0] + for line in predict_lines: + end_id = -1 + if eos_id in line: + end_id = np.where(line == eos_id)[0][0] + new_line = [vocab.to_tokens(e[0]) for e in line[:end_id]] + out_file.write(space_token.join(new_line)) + out_file.write(line_token) + + +if __name__ == "__main__": + args = parse_args() + infer(args) diff --git a/examples/text_generation/vae-seq2seq/train.py b/examples/text_generation/vae-seq2seq/train.py new file mode 100644 index 0000000000000000000000000000000000000000..ec3acb3f3cbc6640e210486e186dd52d56f01e7c --- /dev/null +++ b/examples/text_generation/vae-seq2seq/train.py @@ -0,0 +1,77 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the 'License'); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an 'AS IS' BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +from args import parse_args +from data import create_data_loader +from model import ( + CrossEntropyWithKL, + NegativeLogLoss, + Perplexity, + TrainCallback, + VAESeq2SeqModel, +) + + +def train(args): + print(args) + paddle.set_device(args.device) + train_loader, dev_loader, test_loader, vocab, bos_id, pad_id, train_data_len = create_data_loader(args) + + net = VAESeq2SeqModel( + embed_dim=args.embed_dim, + hidden_size=args.hidden_size, + latent_size=args.latent_size, + vocab_size=len(vocab) + 2, + num_layers=args.num_layers, + init_scale=args.init_scale, + enc_dropout=args.enc_dropout, + dec_dropout=args.dec_dropout, + ) + + gloabl_norm_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm) + + anneal_r = 1.0 / (args.warm_up * train_data_len / args.batch_size) + cross_entropy = CrossEntropyWithKL(base_kl_weight=args.kl_start, anneal_r=anneal_r) + model = paddle.Model(net) + + optimizer = paddle.optimizer.Adam(args.learning_rate, parameters=model.parameters(), grad_clip=gloabl_norm_clip) + + if args.init_from_ckpt: + model.load(args.init_from_ckpt) + print("Loaded checkpoint from %s" % args.init_from_ckpt) + + ppl_metric = Perplexity(loss=cross_entropy) + nll_metric = NegativeLogLoss(loss=cross_entropy) + + model.prepare(optimizer=optimizer, loss=cross_entropy, metrics=[ppl_metric, nll_metric]) + + model.fit( + train_data=train_loader, + eval_data=dev_loader, + epochs=args.max_epoch, + save_dir=args.model_path, + shuffle=False, + callbacks=[TrainCallback(ppl_metric, nll_metric, args.log_freq)], + log_freq=args.log_freq, + ) + + # Evaluation + print("Start to evaluate on test dataset...") + model.evaluate(test_loader, log_freq=len(test_loader)) + + +if __name__ == "__main__": + args = parse_args() + train(args) diff --git a/examples/text_graph/erniesage/README.md b/examples/text_graph/erniesage/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a25780be05860d776182e0be89b7a3be13086284 --- /dev/null +++ b/examples/text_graph/erniesage/README.md @@ -0,0 +1,59 @@ +# 基于PaddleNLP的ErnieSage模型介绍 + +## 背景介绍 + +在很多工业应用中,往往出现如下图所示的一种特殊的图:Text Graph。顾名思义,图的节点属性由文本构成,而边的构建提供了结构信息。如搜索场景下的Text Graph,节点可由搜索词、网页标题、网页正文来表达,用户反馈和超链信息则可构成边关系。 +<img src="https://raw.githubusercontent.com/PaddlePaddle/PGL/static_stable/examples/erniesage/docs/source/_static/text_graph.png" alt="Text Graph" width="800"> + +**ErnieSage** 由飞桨PGL团队提出,是ERNIE SAmple aggreGatE的简称,该模型可以同时建模文本语义与图结构信息,有效提升 Text Graph 的应用效果。其中 [**ERNIE**](https://github.com/PaddlePaddle/ERNIE) 是百度推出的基于知识增强的持续学习语义理解框架。 + +**ErnieSage** 是 ERNIE 与 GraphSAGE 碰撞的结果,是 ERNIE SAmple aggreGatE 的简称,它的结构如下图所示,主要思想是通过 ERNIE 作为聚合函数(Aggregators),建模自身节点和邻居节点的语义与结构关系。ErnieSage 对于文本的建模是构建在邻居聚合的阶段,中心节点文本会与所有邻居节点文本进行拼接;然后通过预训练的 ERNIE 模型进行消息汇聚,捕捉中心节点以及邻居节点之间的相互关系;最后使用 ErnieSage 搭配独特的邻居互相看不见的 Attention Mask 和独立的 Position Embedding 体系,就可以轻松构建 TextGraph 中句子之间以及词之间的关系。 + +<img src="https://raw.githubusercontent.com/PaddlePaddle/PGL/static_stable/examples/erniesage/docs/source/_static/ernie_aggregator.png" alt="ERNIESage" width="800"> + +使用ID特征的GraphSAGE只能够建模图的结构信息,而单独的ERNIE只能处理文本信息。通过飞桨PGL搭建的图与文本的桥梁,**ErnieSage**能够很简单的把GraphSAGE以及ERNIE的优点结合一起。以下面TextGraph的场景,**ErnieSage**的效果能够比单独的ERNIE以及GraphSAGE模型都要好。 + +**ErnieSage**可以很轻松地在基于PaddleNLP构建基于Ernie的图神经网络,目前PaddleNLP提供了V2版本的ErnieSage模型: + +- **ErnieSage V2**: ERNIE 作用在text graph的边上; + +<img src="https://raw.githubusercontent.com/PaddlePaddle/PGL/static_stable/examples/erniesage/docs/source/_static/ERNIESage_v1_4.png" alt="ERNIESage_v1_4" width="800"> + +## 环境依赖 + +- pgl >= 2.1 +安装命令 `pip install pgl\>=2.1` + +## 数据准备 +示例数据```data.txt```中使用了NLPCC2016-DBQA的部分数据,格式为每行"query \t answer"。 +```text +NLPCC2016-DBQA 是由国际自然语言处理和中文计算会议 NLPCC 于 2016 年举办的评测任务,其目标是从候选中找到合适的文档作为问题的答案。[链接: http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf] +``` + +## 如何运行 + +我们采用了[PaddlePaddle Fleet](https://github.com/PaddlePaddle/Fleet)作为我们的分布式训练框架,在```config/*.yaml```中,目前支持的[ERNIE](https://github.com/PaddlePaddle/ERNIE)预训练语义模型包括**ernie**以及**ernie_tiny**,通过config/erniesage_link_prediction.yaml中的ernie_name指定。 + + +```sh +# 数据预处理,建图 +python ./preprocessing/dump_graph.py --conf ./config/erniesage_link_prediction.yaml +# GPU多卡或单卡模式ErnieSage +python -m paddle.distributed.launch --gpus "0" link_prediction.py --conf ./config/erniesage_link_prediction.yaml +# 对图节点的embeding进行预测, 单卡或多卡 +python -m paddle.distributed.launch --gpus "0" link_prediction.py --conf ./config/erniesage_link_prediction.yaml --do_predict +``` + +## 超参数设置 + +- epochs: 训练的轮数 +- graph_data: 训练模型时用到的图结构数据,使用“text1 \t text"格式。 +- train_data: 训练时的边,与graph_data格式相同,一般可以直接用graph_data。 +- graph_work_path: 临时存储graph数据中间文件的目录。 +- samples: 采样邻居数 +- model_type: 模型类型,包括ErnieSageV2。 +- ernie_name: 热启模型类型,支持“ernie”和"ernie_tiny",后者速度更快,指定该参数后会自动从服务器下载预训练模型文件。 +- num_layers: 图神经网络层数。 +- hidden_size: 隐藏层大小。 +- batch_size: 训练时的batchsize。 +- infer_batch_size: 预测时batchsize。 diff --git a/examples/text_graph/erniesage/config/erniesage_link_prediction.yaml b/examples/text_graph/erniesage/config/erniesage_link_prediction.yaml new file mode 100644 index 0000000000000000000000000000000000000000..970f5de365b7338142d276248a4bb3ccfbf8f354 --- /dev/null +++ b/examples/text_graph/erniesage/config/erniesage_link_prediction.yaml @@ -0,0 +1,40 @@ +# Global Environment Settings + +# trainer config ------ +device: "gpu" # use cpu or gpu devices to train. +seed: 2020 + +task: "link_prediction" +model_name_or_path: "ernie-tiny" # ernie-tiny or ernie-1.0 avaiable +sample_workers: 1 +optimizer_type: "adam" +lr: 0.00005 +batch_size: 32 +CPU_NUM: 10 +epoch: 30 +log_per_step: 10 +save_per_step: 200 +output_path: "./output" + +# data config ------ +train_data: "./example_data/graph_data.txt" +graph_data: "./example_data/train_data.txt" +graph_work_path: "./graph_workdir" +input_type: "text" +encoding: "utf8" + +# model config ------ +samples: [10] +model_type: "ErnieSageV2" +max_seqlen: 40 +num_layers: 1 +hidden_size: 128 +final_fc: true +final_l2_norm: true +loss_type: "hinge" +margin: 0.1 +neg_type: "batch_neg" + +# infer config ------ +infer_model: "./output/last" +infer_batch_size: 128 diff --git a/examples/text_graph/erniesage/data/__init__.py b/examples/text_graph/erniesage/data/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..bd14d69548398963093d935c4c5bd7b6683716ec --- /dev/null +++ b/examples/text_graph/erniesage/data/__init__.py @@ -0,0 +1,19 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from data import dataset, graph_reader + +__all__ = [] +__all__ += dataset.__all__ +__all__ += graph_reader.__all__ diff --git a/examples/text_graph/erniesage/data/dataset.py b/examples/text_graph/erniesage/data/dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..2a3733851e63b63d82773e28725dd8f98b4bf791 --- /dev/null +++ b/examples/text_graph/erniesage/data/dataset.py @@ -0,0 +1,115 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import numpy as np +import paddle +import pgl +from paddle.io import Dataset +from pgl.sampling import graphsage_sample + +__all__ = [ + "TrainData", + "PredictData", + "batch_fn", +] + + +class TrainData(Dataset): + def __init__(self, graph_work_path): + trainer_id = paddle.distributed.get_rank() + trainer_count = paddle.distributed.get_world_size() + print("trainer_id: %s, trainer_count: %s." % (trainer_id, trainer_count)) + + edges = np.load(os.path.join(graph_work_path, "train_data.npy"), allow_pickle=True) + # edges is bidirectional. + train_src = edges[trainer_id::trainer_count, 0] + train_dst = edges[trainer_id::trainer_count, 1] + returns = {"train_data": [train_src, train_dst]} + + if os.path.exists(os.path.join(graph_work_path, "neg_samples.npy")): + neg_samples = np.load(os.path.join(graph_work_path, "neg_samples.npy"), allow_pickle=True) + if neg_samples.size != 0: + train_negs = neg_samples[trainer_id::trainer_count] + returns["train_data"].append(train_negs) + print("Load train_data done.") + self.data = returns + + def __getitem__(self, index): + return [data[index] for data in self.data["train_data"]] + + def __len__(self): + return len(self.data["train_data"][0]) + + +class PredictData(Dataset): + def __init__(self, num_nodes): + trainer_id = paddle.distributed.get_rank() + trainer_count = paddle.distributed.get_world_size() + self.data = np.arange(trainer_id, num_nodes, trainer_count) + + def __getitem__(self, index): + return [self.data[index], self.data[index]] + + def __len__(self): + return len(self.data) + + +def batch_fn(batch_ex, samples, base_graph, term_ids): + batch_src = [] + batch_dst = [] + batch_neg = [] + for batch in batch_ex: + batch_src.append(batch[0]) + batch_dst.append(batch[1]) + if len(batch) == 3: # default neg samples + batch_neg.append(batch[2]) + + batch_src = np.array(batch_src, dtype="int64") + batch_dst = np.array(batch_dst, dtype="int64") + if len(batch_neg) > 0: + batch_neg = np.unique(np.concatenate(batch_neg)) + else: + batch_neg = batch_dst + + nodes = np.unique(np.concatenate([batch_src, batch_dst, batch_neg], 0)) + subgraphs = graphsage_sample(base_graph, nodes, samples) + + subgraph, sample_index, node_index = subgraphs[0] + from_reindex = {int(x): i for i, x in enumerate(sample_index)} + + term_ids = term_ids[sample_index].astype(np.int64) + + sub_src_idx = pgl.graph_kernel.map_nodes(batch_src, from_reindex) + sub_dst_idx = pgl.graph_kernel.map_nodes(batch_dst, from_reindex) + sub_neg_idx = pgl.graph_kernel.map_nodes(batch_neg, from_reindex) + + user_index = np.array(sub_src_idx, dtype="int64") + pos_item_index = np.array(sub_dst_idx, dtype="int64") + neg_item_index = np.array(sub_neg_idx, dtype="int64") + + user_real_index = np.array(batch_src, dtype="int64") + pos_item_real_index = np.array(batch_dst, dtype="int64") + + return ( + np.array([subgraph.num_nodes], dtype="int32"), + subgraph.edges.astype("int32"), + term_ids, + user_index, + pos_item_index, + neg_item_index, + user_real_index, + pos_item_real_index, + ) diff --git a/examples/text_graph/erniesage/data/graph_reader.py b/examples/text_graph/erniesage/data/graph_reader.py new file mode 100644 index 0000000000000000000000000000000000000000..ca82d5c78f66bad525f5d37314c7a1d020e9a2ee --- /dev/null +++ b/examples/text_graph/erniesage/data/graph_reader.py @@ -0,0 +1,59 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import pgl +from paddle.io import DataLoader + +__all__ = ["GraphDataLoader"] + + +class GraphDataLoader(object): + def __init__(self, dataset, batch_size=1, shuffle=True, num_workers=1, collate_fn=None, **kwargs): + self.loader = DataLoader( + dataset=dataset, + batch_size=batch_size, + shuffle=shuffle, + num_workers=num_workers, + collate_fn=collate_fn, + **kwargs, + ) + + def __iter__(self): + func = self.__callback__() + for data in self.loader(): + yield func(data) + + def __call__(self): + return self.__iter__() + + def __callback__(self): + """callback function, for recontruct a dict or graph.""" + + def construct(tensors): + """tensor list to ([graph_tensor, graph_tensor, ...], + other tensor) + """ + graph_num = 1 + start_len = 0 + data = [] + graph_list = [] + for graph in range(graph_num): + graph_list.append(pgl.Graph(num_nodes=tensors[start_len], edges=tensors[start_len + 1])) + start_len += 2 + + for i in range(start_len, len(tensors)): + data.append(tensors[i]) + return graph_list, data + + return construct diff --git a/examples/text_graph/erniesage/example_data/graph_data.txt b/examples/text_graph/erniesage/example_data/graph_data.txt new file mode 100644 index 0000000000000000000000000000000000000000..e9aead6c89fa2fdbed64e5dada352106a8deb349 --- /dev/null +++ b/examples/text_graph/erniesage/example_data/graph_data.txt @@ -0,0 +1,1000 @@ +黑缘粗角肖叶甲触角有多大? 体长卵形,棕红色;鞘翅棕黄或淡棕色,外缘和中缝黑色或黑褐色;触角基部3、4节棕黄,余节棕色。 +黑缘粗角肖叶甲触角有多大? 头部刻点粗大,分布不均匀,头顶刻点十分稀疏;触角基部的内侧有一个三角形光瘤,唇基前缘呈半圆形凹切。 +黑缘粗角肖叶甲触角有多大? 触角近于体长之半,第1节粗大,棒状,第2节短,椭圆形,3、4两节细长,稍短于第5节,第5节基细端粗,末端6节明显粗大。 +黑缘粗角肖叶甲触角有多大? 前胸背板横宽,宽约为长的两倍,侧缘敞出较宽,圆形,敞边与盘区之间有一条细纵沟;盘区刻点相当密,前半部刻点较大于后半部。 +黑缘粗角肖叶甲触角有多大? 小盾片舌形,光亮,末端圆钝。 +黑缘粗角肖叶甲触角有多大? 鞘翅刻点粗大,不规则排列,肩部之后的刻点更为粗大,具皱褶,近中缝的刻点较小,略呈纵行排列。 +黑缘粗角肖叶甲触角有多大? 前胸前侧片前缘直;前胸后侧片具粗大刻点。 +黑缘粗角肖叶甲触角有多大? 足粗壮;胫节具纵脊,外端角向外延伸,呈弯角状;爪具附齿。 +暮光闪闪的姐姐是谁? 暮光闪闪是一匹雌性独角兽,后来在神秘魔法的影响下变成了空角兽(公主),她是《我的小马驹:友情是魔法》(英文名:My Little Pony:Friendship is Magic)中的主角之一。 +暮光闪闪的姐姐是谁? 她是银甲闪闪(Shining Armor)的妹妹,同时也是韵律公主(Princess Cadance)的小姑子。 +暮光闪闪的姐姐是谁? 在该系列中,她与最好的朋友与助手斯派克(Spike)一起生活在小马镇(Ponyville)的金橡图书馆(Golden Oak Library),研究友谊的魔法。 +暮光闪闪的姐姐是谁? 在暮光闪闪成为天角兽之前(即S3E13前),常常给塞拉丝蒂娅公主(Princess Celestia)关于友谊的报告。[1] +暮光闪闪的姐姐是谁? 《我的小马驹:友谊是魔法》(英文名称:My Little Pony:Friendship is Magic)(简称MLP) +暮光闪闪的姐姐是谁? 动画讲述了一只名叫做暮光闪闪(Twilight Sparkle)的独角兽(在SE3E13 +暮光闪闪的姐姐是谁? My Little Pony:Friendship is Magic[2] +暮光闪闪的姐姐是谁? 后成为了天角兽),执行她的导师塞拉斯蒂娅公主(PrincessCelestia)的任务,在小马镇(Ponyville)学习关于友谊的知识。 +暮光闪闪的姐姐是谁? 她与另外五只小马,苹果杰克(Applejack)、瑞瑞(Rarity)、云宝黛西(Rainbow Dash)、小蝶(Fluttershy)与萍琪派(Pinkie Pie),成为了最要好的朋友。 +暮光闪闪的姐姐是谁? 每匹小马都分别代表了协律精华的6个元素:诚实,慷慨,忠诚,善良,欢笑,魔法,各自扮演着属于自己的重要角色。 +暮光闪闪的姐姐是谁? 此后,暮光闪闪(Twilight Sparkle)便与她认识的新朋友们开始了有趣的日常生活。 +暮光闪闪的姐姐是谁? 在动画中,随时可见她们在小马镇(Ponyville)的种种冒险、奇遇、日常等等。 +暮光闪闪的姐姐是谁? 同时,也在她们之间的互动和冲突中,寻找着最适合最合理的完美解决方案。 +暮光闪闪的姐姐是谁? “尽管小马国并不太平,六位主角之间也常常有这样那样的问题,但是他们之间的真情对待,使得这个童话世界已经成为不少人心中理想的世外桃源。” +暮光闪闪的姐姐是谁? 暮光闪闪在剧情刚开始的时候生活在中心城(Canterlot),后来在夏日 +暮光闪闪的姐姐是谁? 暮光闪闪与斯派克(Spike) +暮光闪闪的姐姐是谁? 庆典的时候被塞拉丝蒂娅公主派遣到小马镇执行检查夏日庆典的准备工作的任务。 +暮光闪闪的姐姐是谁? 在小马镇交到了朋友(即其余5个主角),并和她们一起使用协律精华(Elements of harmony)击败了梦魇之月。 +暮光闪闪的姐姐是谁? 并在塞拉丝蒂亚公主的许可下,留在小马镇继续研究友谊的魔法。 +暮光闪闪的姐姐是谁? 暮光闪闪的知识基本来自于书本,并且她相当不相信书本以外的“迷信”,因为这样她在S1E15里吃足了苦头。 +暮光闪闪的姐姐是谁? 在这之后,她也开始慢慢学会相信一些书本以外的东西。 +暮光闪闪的姐姐是谁? 暮光闪闪热爱学习,并且学习成绩相当好(从她可以立刻算出 +暮光闪闪的姐姐是谁? 的结果可以看 +暮光闪闪的姐姐是谁? 暮光闪闪的原型 +暮光闪闪的姐姐是谁? 出)。 +暮光闪闪的姐姐是谁? 相当敬爱自己的老师塞拉丝蒂亚公主甚至到了精神失常的地步。 +暮光闪闪的姐姐是谁? 在第二季中,曾因为无法交出关于友谊的报告而做出了疯狂的行为,后来被塞拉丝蒂亚公主制止,在这之后,暮光闪闪得到了塞拉丝蒂亚公主“不用定期交友谊报告”的许可。 +暮光闪闪的姐姐是谁? 于是暮光闪闪在后面的剧情中的主角地位越来越得不到明显的体现。 +暮光闪闪的姐姐是谁? 在SE3E13中,因为破解了白胡子星璇留下的神秘魔法而被加冕成为了天角兽(公主),被尊称为“闪闪公主”。 +暮光闪闪的姐姐是谁? 当小星座熊在小马镇引起恐慌的时候,暮光闪闪运用了自身强大的魔法将水库举起后装满牛奶,用牛奶将小星座熊安抚后,连着巨型奶瓶和小星座熊一起送回了小星座熊居住的山洞。 +我想知道红谷十二庭有哪些金融机构? 红谷十二庭是由汪氏集团旗下子公司江西尤金房地产开发有限公司携手城发投资共同开发的精品社区,项目占地面积约380亩,总建筑面积约41万平方米。 +我想知道红谷十二庭有哪些金融机构? 项目以建设人文型、生态型居住环境为规划目标;创造一个布局合理、功能齐全、交通便捷、绿意盎然、生活方便,有文化内涵的居住区。 +我想知道红谷十二庭有哪些金融机构? 金融机构:工商银行、建设银行、农业银行、中国银行红谷滩支行、商业银行红谷滩支行等 +我想知道红谷十二庭有哪些金融机构? 周边公园:沿乌砂河50米宽绿化带、乌砂河水岸公园、秋水广场、赣江市民公园 +我想知道红谷十二庭有哪些金融机构? 周边医院:新建县人民医院、开心人药店、中寰医院 +我想知道红谷十二庭有哪些金融机构? 周边学校:育新小学红谷滩校区、南师附小红谷滩校区、实验小学红谷滩校区中学:南昌二中红谷滩校区、南昌五中、新建二中、竞秀贵族学校 +我想知道红谷十二庭有哪些金融机构? 周边公共交通:112、204、211、219、222、227、238、501等20多辆公交车在本项目社区门前停靠 +我想知道红谷十二庭有哪些金融机构? 红谷十二庭处在南昌一江两城中的西城中心,位属红谷滩CBD文化公园中心——马兰圩中心组团,红谷滩中心区、红角洲、新建县三区交汇处,南临南期友好路、东接红谷滩中心区、西靠乌砂河水岸公园(50米宽,1000米长)。 +我想知道红谷十二庭有哪些金融机构? 交通便捷,景观资源丰富,生活配套设施齐全,出则繁华,入则幽静,是现代人居的理想地段。 +我想知道红谷十二庭有哪些金融机构? 红谷十二庭户型图 +苏琳最开始进入智通实业是担任什么职位? 现任广东智通人才连锁股份有限公司总裁,清华大学高级工商管理硕士。 +苏琳最开始进入智通实业是担任什么职位? 1994年,加入智通实业,从总经理秘书做起。 +苏琳最开始进入智通实业是担任什么职位? 1995年,智通实业决定进入人才服务行业,被启用去负责新公司的筹建及运营工作,在苏琳的努力下,智通人才智力开发有限公司成立。 +苏琳最开始进入智通实业是担任什么职位? 2003年,面对同城对手的激烈竞争,苏琳冷静对待,领导智通先后接管、并购了同城的腾龙、安达盛人才市场,,“品牌运作,连锁经营,差异制胜”成为苏琳屡屡制胜的法宝。 +苏琳最开始进入智通实业是担任什么职位? 2006年,苏琳先是将智通人才升级为“东莞市智通人才连锁有限公司”,一举成为广东省人才市场目前惟一的连锁机构,随后在东莞同时开设长安、松山湖、清溪等镇区分部,至此智通在东莞共有6个分部。 +苏琳最开始进入智通实业是担任什么职位? 一番大刀阔斧完成东莞布局后,苏琳确定下一个更为高远的目标——进军珠三角,向全国发展连锁机构。 +苏琳最开始进入智通实业是担任什么职位? 到2011年末,苏琳领导的智通人才已在珠三角的东莞、佛山、江门、中山等地,长三角的南京、宁波、合肥等地,中西部的南昌、长沙、武汉、重庆、西安等地设立了20多家连锁经营网点。 +苏琳最开始进入智通实业是担任什么职位? 除了财务副总裁之外,苏琳是智通人才核心管理高层当中唯一的女性,不管是要约采访的记者还是刚刚加入智通的员工,见到苏琳的第一面,都会有一种惊艳的感觉,“一位女企业家居然非常美丽和时尚?!” +苏琳最开始进入智通实业是担任什么职位? 智通管理高层的另外6位男性成员,有一次同时接受一家知名媒体采访时,共同表达了对自己老板的“爱慕”之情,苏琳听后莞尔一笑,指着在座的这几位高层说道“其实,我更爱他们!” +苏琳最开始进入智通实业是担任什么职位? 这种具有独特领导魅力的表述让这位记者唏嘘不已,同时由这样的一个细节让他感受到了智通管理团队的协作力量。 +谁知道黄沙中心小学的邮政编码是多少? 学校于1954年始建于棕树湾村,当时借用一间民房做教室,取名为“黄沙小学”,只有教师1人,学生8人。 +谁知道黄沙中心小学的邮政编码是多少? 1958年在大跃进精神的指导下,实行大集体,全乡集中办学,发展到12个班,300多学生,20名教职工。 +谁知道黄沙中心小学的邮政编码是多少? 1959年解散。 +谁知道黄沙中心小学的邮政编码是多少? 1959年下半年,在上级的扶持下,建了6间木房,搬到1960年学校所在地,有6名教师,3个班,60名学生。 +谁知道黄沙中心小学的邮政编码是多少? 1968年,开始招收一个初中班,“黄沙小学”改名为 “附小”。 +谁知道黄沙中心小学的邮政编码是多少? 当时已发展到5个班,8名教师,110多名学生。 +谁知道黄沙中心小学的邮政编码是多少? 增建土木结构教室两间。 +谁知道黄沙中心小学的邮政编码是多少? 1986年,初中、小学分开办学。 +谁知道黄沙中心小学的邮政编码是多少? 增建部分教师宿舍和教室,办学条件稍有改善,学校初具规模。 +谁知道黄沙中心小学的邮政编码是多少? 1996年,我校在市、县领导及希望工程主管部门的关怀下,决定改为“黄沙希望小学”并拨款32万元,新建一栋4层,12间教室的教学楼,教学条件大有改善。 +谁知道黄沙中心小学的邮政编码是多少? 当时发展到10个班,学生300多人,教职工19人,小学高级教师3人,一级教师7人,二级教师9人。 +谁知道黄沙中心小学的邮政编码是多少? 2003年下半年由于农村教育体制改革,撤销教育组,更名为“黄沙中心小学”。 +谁知道黄沙中心小学的邮政编码是多少? 学校现有在校生177人(含学前42人),设有学前至六年级共7个教学班。 +谁知道黄沙中心小学的邮政编码是多少? 有教师19人,其中大专以上学历11人,中师6人;小学高级教师14人,一级教师5人。 +谁知道黄沙中心小学的邮政编码是多少? 学校校园占地面积2050平方米,生均达15.29平方米,校舍建筑面积1645平方米,生均12.27平方米;设有教师办公室、自然实验、电教室(合二为一)、微机室、图书阅览室(合二为一)、体育室、广播室、少先队活动室。 +谁知道黄沙中心小学的邮政编码是多少? 广西壮族自治区桂林市临桂县黄沙瑶族乡黄沙街 邮编:541113[1] +伊藤实华的职业是什么? 伊藤实华(1984年3月25日-)是日本的女性声优。 +伊藤实华的职业是什么? THREE TREE所属,东京都出身,身长149cm,体重39kg,血型AB型。 +伊藤实华的职业是什么? ポルノグラフィティのLION(森男) +伊藤实华的职业是什么? 2000年 +伊藤实华的职业是什么? 犬夜叉(枫(少女时代)) +伊藤实华的职业是什么? 幻影死神(西亚梨沙) +伊藤实华的职业是什么? 2001年 +伊藤实华的职业是什么? NOIR(ロザリー) +伊藤实华的职业是什么? 2002年 +伊藤实华的职业是什么? 水瓶战记(柠檬) +伊藤实华的职业是什么? 返乡战士(エイファ) +伊藤实华的职业是什么? 2003年 +伊藤实华的职业是什么? 奇诺之旅(女子A(悲しい国)) +伊藤实华的职业是什么? 2004年 +伊藤实华的职业是什么? 爱你宝贝(坂下ミキ) +伊藤实华的职业是什么? Get Ride! アムドライバー(イヴァン・ニルギース幼少期) +伊藤实华的职业是什么? スクールランブル(花井春树(幼少时代)) +伊藤实华的职业是什么? 2005年 +伊藤实华的职业是什么? 光速蒙面侠21(虎吉) +伊藤实华的职业是什么? 搞笑漫画日和(男子トイレの精、パン美先生) +伊藤实华的职业是什么? 银牙伝说WEED(テル) +伊藤实华的职业是什么? 魔女的考验(真部カレン、守山太郎) +伊藤实华的职业是什么? BUZZER BEATER(レニー) +伊藤实华的职业是什么? 虫师(“眼福眼祸”さき、“草を踏む音”沢(幼少时代)) +伊藤实华的职业是什么? 2006年 +伊藤实华的职业是什么? 魔女之刃(娜梅) +伊藤实华的职业是什么? 反斗小王子(远藤レイラ) +伊藤实华的职业是什么? 搞笑漫画日和2(パン美先生、フグ子、ダンサー、ヤマトの妹、女性) +伊藤实华的职业是什么? 人造昆虫カブトボーグ V×V(ベネチアンの弟、东ルリ、园儿A) +伊藤实华的职业是什么? 2007年 +爆胎监测与安全控制系统英文是什么? 爆胎监测与安全控制系统(Blow-out Monitoring and Brake System),是吉利全球首创,并拥有自主知识产权及专利的一项安全技术。 +爆胎监测与安全控制系统英文是什么? 这项技术主要是出于防止高速爆胎所导致的车辆失控而设计。 +爆胎监测与安全控制系统英文是什么? BMBS爆胎监测与安全控制系统技术于2004年1月28日正式获得中国发明专利授权。 +爆胎监测与安全控制系统英文是什么? 2008年第一代BMBS系统正式与世人见面,BMBS汇集国内外汽车力学、控制学、人体生理学、电子信息学等方面的专家和工程技术人员经过一百余辆试验车累计行程超过五百万公里的可靠性验证,以确保产品的可靠性。 +爆胎监测与安全控制系统英文是什么? BMBS技术方案的核心即是采用智能化自动控制系统,弥补驾驶员生理局限,在爆胎后反应时间为0.5秒,替代驾驶员实施行车制动,保障行车安全。 +爆胎监测与安全控制系统英文是什么? BMBS系统由控制系统和显示系统两大部分组成,控制系统由BMBS开关、BMBS主机、BMBS分机、BMBS真空助力器四部分组成;显示系统由GPS显示、仪表指示灯、语言提示、制动双闪灯组成。 +爆胎监测与安全控制系统英文是什么? 当轮胎气压高于或低于限值时,控制器声光提示胎压异常。 +爆胎监测与安全控制系统英文是什么? 轮胎温度过高时,控制器发出信号提示轮胎温度过高。 +爆胎监测与安全控制系统英文是什么? 发射器电量不足时,控制器显示低电压报警。 +爆胎监测与安全控制系统英文是什么? 发射器受到干扰长期不发射信号时,控制器显示无信号报警。 +爆胎监测与安全控制系统英文是什么? 当汽车电门钥匙接通时,BMBS首先进入自检程序,检测系统各部分功能是否正常,如不正常,BMBS报警灯常亮。 +走读干部现象在哪里比较多? 走读干部一般是指县乡两级干部家住县城以上的城市,本人在县城或者乡镇工作,要么晚出早归,要么周一去单位上班、周五回家过周末。 +走读干部现象在哪里比较多? 对于这种现象,社会上的议论多是批评性的,认为这些干部脱离群众、作风漂浮、官僚主义,造成行政成本增加和腐败。 +走读干部现象在哪里比较多? 截至2014年10月,共有6484名“走读干部”在专项整治中被查处。 +走读干部现象在哪里比较多? 这是中央首次大规模集中处理这一长期遭诟病的干部作风问题。 +走读干部现象在哪里比较多? 干部“走读”问题主要在乡镇地区比较突出,城市地区则较少。 +走读干部现象在哪里比较多? 从历史成因和各地反映的情况来看,产生“走读”现象的主要原因大致有四种: +走读干部现象在哪里比较多? 现今绝大多数乡村都有通往乡镇和县城的石子公路甚至柏油公路,这无疑为农村干部的出行创造了便利条件,为“干部像候鸟,频往家里跑”创造了客观条件。 +走读干部现象在哪里比较多? 选调生、公务员队伍大多是学历较高的大学毕业生,曾在高校所在地的城市生活,不少人向往城市生活,他们不安心长期扎根基层,而是将基层当作跳板,因此他们往往成为“走读”的主力军。 +走读干部现象在哪里比较多? 公仆意识、服务意识淡化,是“走读”现象滋生的主观原因。 +走读干部现象在哪里比较多? 有些党员干部感到自己长期在基层工作,该为自己和家庭想想了。 +走读干部现象在哪里比较多? 于是,不深入群众认真调查研究、认真听取群众意见、认真解决群众的实际困难,也就不难理解了。 +走读干部现象在哪里比较多? 县级党政组织对乡镇领导干部管理的弱化和为基层服务不到位,导致“走读”问题得不到应有的制度约束,是“走读”问题滋长的组织原因。[2] +走读干部现象在哪里比较多? 近些年来,我国一些地方的“干部走读”现象较为普遍,社会上对此议走读干部论颇多。 +走读干部现象在哪里比较多? 所谓“干部走读”,一般是指县乡两级干部家住县城以上的城市,本人在县城或者乡镇工作,要么早出晚归,要么周一去单位上班、周五回家过周末。 +走读干部现象在哪里比较多? 对于这种现象,社会上的议论多是批评性的,认为这些干部脱离群众、作风漂浮、官僚主义,造成行政成本增加和腐败。 +走读干部现象在哪里比较多? 干部走读之所以成为“千夫所指”,是因为这种行为增加了行政成本。 +走读干部现象在哪里比较多? 从根子上说,干部走读是城乡发展不平衡的产物,“人往高处走,水往低处流”,有了更加舒适的生活环境,不管是为了自己生活条件改善也好,还是因为子女教育也好,农村人口向城镇转移,这是必然结果。 +走读干部现象在哪里比较多? “干部走读”的另一个重要原因,是干部人事制度改革。 +走读干部现象在哪里比较多? 目前公务员队伍“凡进必考”,考上公务员的大多是学历较高的大学毕业生,这些大学毕业生来自各个全国各地,一部分在本地结婚生子,沉淀下来;一部分把公务员作为跳板,到基层后或考研,或再参加省考、国考,或想办法调回原籍。 +走读干部现象在哪里比较多? 再加上一些下派干部、异地交流任职干部,构成了看似庞大的“走读”队伍。 +走读干部现象在哪里比较多? 那么,“干部走读”有哪些弊端呢? +走读干部现象在哪里比较多? 一是这些干部人在基层,心在城市,缺乏长期作战的思想,工作不安心。 +走读干部现象在哪里比较多? 周一来上班,周五回家转,对基层工作缺乏热情和感情;二是长期在省市直机关工作,对基层工作不熟悉不了解,工作不热心;三是长期走读,基层干群有工作难汇报,有困难难解决,群众不开心;四是干部来回走读,公车私驾,私费公报,把大量的经济负担转嫁给基层;五是对这些走读干部,基层管不了,上级监督难,节假日期间到哪里去、做什么事,基本处于失控和真空状态,各级组织和基层干群不放心。 +走读干部现象在哪里比较多? 特别需要引起警觉的是,由于少数走读干部有临时思想,满足于“当维持会长”,得过且过混日子,热衷于做一些急功近利、砸锅求铁的短期行为和政绩工程,不愿做打基础、管长远的实事好事,甚至怠政、疏政和懒于理政,影响了党和政府各项方针政策措施的落实,导致基层无政府主义、自由主义抬头,削弱了党和政府的领导,等到矛盾激化甚至不可收拾的时候,处理已是来之不及。 +走读干部现象在哪里比较多? 权利要与义务相等,不能只有义务而没有权利,或是只有权利没有义务。 +走读干部现象在哪里比较多? 如何真正彻底解决乡镇干部“走读”的现象呢? +走读干部现象在哪里比较多? 那就必须让乡镇基层干部义务与权利相等。 +走读干部现象在哪里比较多? 如果不能解决基层干部待遇等问题,即使干部住村,工作上也不会有什么进展的。 +走读干部现象在哪里比较多? 所以,在政治上关心,在生活上照顾,在待遇上提高。 +走读干部现象在哪里比较多? 如,提高基层干部的工资待遇,增加通讯、交通补助;帮助解决子女入学及老人赡养问题;提拔干部优先考虑基层干部;干部退休时的待遇至少不低于机关干部等等。 +化州市良光镇东岸小学学风是什么? 学校全体教职工爱岗敬业,团结拼搏,勇于开拓,大胆创新,进行教育教学改革,努力开辟第二课堂的教学路子,并开通了网络校校通的交流合作方式。 +化州市良光镇东岸小学学风是什么? 现学校教师正在为创建安全文明校园而努力。 +化州市良光镇东岸小学学风是什么? 东岸小学位置偏僻,地处贫穷落后,是良光镇最偏远的学校,学校,下辖分教点——东心埇小学,[1]?。 +化州市良光镇东岸小学学风是什么? 学校2011年有教师22人,学生231人。 +化州市良光镇东岸小学学风是什么? 小学高级教师8人,小学一级教师10人,未定级教师4人,大专学历的教师6人,其余的都具有中师学历。 +化州市良光镇东岸小学学风是什么? 全校共设12个班,学校课程按标准开设。 +化州市良光镇东岸小学学风是什么? 东岸小学原来是一所破旧不堪,教学质量非常差的薄弱学校。 +化州市良光镇东岸小学学风是什么? 近几年来,在各级政府、教育部门及社会各界热心人士鼎力支持下,学校领导大胆改革创新,致力提高教学质量和教师水平,并加大经费投入,大大改善了办学条件,使学校由差变好,实现了大跨越。 +化州市良光镇东岸小学学风是什么? 学校建设性方面。 +化州市良光镇东岸小学学风是什么? 东岸小学属于革命老区学校,始建于1980年,从东心埇村祠堂搬到这个校址,1990年建造一幢建筑面积为800平方米的南面教学楼, 1998年老促会支持从北面建造一幢1800平方米的教学大楼。 +化州市良光镇东岸小学学风是什么? 学校在管理方面表现方面颇具特色,实现了各项制度的日常化和规范化。 +化州市良光镇东岸小学学风是什么? 学校领导有较强的事业心和责任感,讲求民主与合作,勤政廉政,依法治校,树立了服务意识。 +化州市良光镇东岸小学学风是什么? 学校一贯实施“德育为先,以人为本”的教育方针,制定了“团结,律已,拼搏,创新”的校训。 +化州市良光镇东岸小学学风是什么? 教育风为“爱岗敬业,乐于奉献”,学风为“乐学,勤学,巧学,会学”。 +化州市良光镇东岸小学学风是什么? 校内营造了尊师重教的氛围,形成了良好的校风和学风。 +化州市良光镇东岸小学学风是什么? 教师们爱岗敬业,师德高尚,治学严谨,教研教改气氛浓厚,获得喜人的教研成果。 +化州市良光镇东岸小学学风是什么? 近几年来,教师撰写的教育教学论文共10篇获得县市级以上奖励,获了镇级以上奖励的有100人次。 +化州市良光镇东岸小学学风是什么? 学校德育工作成绩显著,多年被评为“安全事故为零”的学校,良光镇先进学校。 +化州市良光镇东岸小学学风是什么? 特别是教学质量大大提高了。 +化州市良光镇东岸小学学风是什么? 这些成绩得到了上级及群众的充分肯定。 +化州市良光镇东岸小学学风是什么? 1.学校环境欠美观有序,学校大门口及校道有待改造。 +化州市良光镇东岸小学学风是什么? 2.学校管理制度有待改进,部分教师业务水平有待提高。 +化州市良光镇东岸小学学风是什么? 3.教师宿舍、教室及学生宿舍欠缺。 +化州市良光镇东岸小学学风是什么? 4.运动场不够规范,各类体育器材及设施需要增加。 +化州市良光镇东岸小学学风是什么? 5.学生活动空间少,见识面窄,视野不够开阔。 +化州市良光镇东岸小学学风是什么? 1.努力营造和谐的教育教学新气氛。 +化州市良光镇东岸小学学风是什么? 建立科学的管理制度,坚持“与时俱进,以人为本”,真正实现领导对教师,教师对学生之间进行“德治与情治”;学校的人文环境做到“文明,和谐,清新”;德育环境做到“自尊,律已,律人”;心理环境做到“安全,谦虚,奋发”;交际环境做到“团结合作,真诚助人”;景物环境做到“宜人,有序。” +化州市良光镇东岸小学学风是什么? 营造学校与育人的新特色。 +我很好奇发射管的输出功率怎么样? 产生或放大高频功率的静电控制电子管,有时也称振荡管。 +我很好奇发射管的输出功率怎么样? 用于音频或开关电路中的发射管称调制管。 +我很好奇发射管的输出功率怎么样? 发射管是无线电广播、通信、电视发射设备和工业高频设备中的主要电子器件。 +我很好奇发射管的输出功率怎么样? 输出功率和工作频率是发射管的基本技术指标。 +我很好奇发射管的输出功率怎么样? 广播、通信和工业设备的发射管,工作频率一般在30兆赫以下,输出功率在1919年为2千瓦以下,1930年达300千瓦,70年代初已超过1000千瓦,效率高达80%以上。 +我很好奇发射管的输出功率怎么样? 发射管工作频率提高时,输出功率和效率都会降低,因此1936年首次实用的脉冲雷达工作频率仅28兆赫,80年代则已达 400兆赫以上。 +我很好奇发射管的输出功率怎么样? 40年代电视发射管的工作频率为数十兆赫,而80年代初,优良的电视发射管可在1000兆赫下工作,输出功率达20千瓦,效率为40%。 +我很好奇发射管的输出功率怎么样? 平面电极结构的小功率发射三极管可在更高的频率下工作。 +我很好奇发射管的输出功率怎么样? 发射管多采用同心圆筒电极结构。 +我很好奇发射管的输出功率怎么样? 阴极在最内层,向外依次为各个栅极和阳极。 +我很好奇发射管的输出功率怎么样? 图中,自左至右为阴极、第一栅、第二栅、栅极阴极组装件和装入阳极后的整个管子。 +我很好奇发射管的输出功率怎么样? 发射管 +我很好奇发射管的输出功率怎么样? 中小功率发射管多采用间热式氧化物阴极。 +我很好奇发射管的输出功率怎么样? 大功率发射管一般采用碳化钍钨丝阴极,有螺旋、直条或网笼等结构形式。 +我很好奇发射管的输出功率怎么样? 图为网笼式阴极。 +我很好奇发射管的输出功率怎么样? 栅极多用钼丝或钨丝绕制,或用钼片经电加工等方法制造。 +我很好奇发射管的输出功率怎么样? 栅极表面经镀金(或铂)或涂敷锆粉等处理,以降低栅极电子发射,使发射管稳定工作。 +我很好奇发射管的输出功率怎么样? 用气相沉积方法制造的石墨栅极,具有良好的性能。 +我很好奇发射管的输出功率怎么样? 发射管阳极直流输入功率转化为高频输出功率的部分约为75%,其余25%成为阳极热损耗,因此对发射管的阳极必须进行冷却。 +我很好奇发射管的输出功率怎么样? 中小功率发射管的阳极采取自然冷却方式,用镍、钼或石墨等材料制造,装在管壳之内,工作温度可达 600℃。 +我很好奇发射管的输出功率怎么样? 大功率发射管的阳极都用铜制成,并作为真空密封管壳的一部分,采用各种强制冷却方式。 +我很好奇发射管的输出功率怎么样? 各种冷却方式下每平方厘米阳极内表面的散热能力为:水冷100瓦;风冷30瓦;蒸发冷却250瓦;超蒸发冷却1000瓦以上,80年代已制成阳极损耗功率为1250千瓦的超蒸发冷却发射管。 +我很好奇发射管的输出功率怎么样? 发射管也常以冷却方式命名,如风冷发射管、水冷发射管和蒸发冷却发射管。 +我很好奇发射管的输出功率怎么样? 发射管管壳用玻璃或陶瓷制造。 +我很好奇发射管的输出功率怎么样? 小功率发射管内使用含钡的吸气剂;大功率发射管则采用锆、钛、钽等吸气材料,管内压强约为10帕量级。 +我很好奇发射管的输出功率怎么样? 发射管寿命取决于阴极发射电子的能力。 +我很好奇发射管的输出功率怎么样? 大功率发射管寿命最高记录可达8万小时。 +我很好奇发射管的输出功率怎么样? 发射四极管的放大作用和输出输入电路间的隔离效果优于三极管,应用最广。 +我很好奇发射管的输出功率怎么样? 工业高频振荡器普遍采用三极管。 +我很好奇发射管的输出功率怎么样? 五极管多用在小功率范围中。 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 鲁能领秀城中央公园位于鲁能领秀城景观中轴之上,总占地161.55亩,总建筑面积约40万平米,容积率为2.70,由22栋小高层、高层组成;其绿地率高达35.2%,环境优美,产品更加注重品质化、人性化和自然生态化,是鲁能领秀城的生态人居典范。 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 中央公园[1] 学区准现房,坐享鲁能领秀城成熟配套,成熟生活一步到位。 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 经典板式小高层,103㎡2+1房仅22席,稀市推出,错过再无;92㎡经典两房、137㎡舒适三房压轴登场! +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 物业公司: +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 济南凯瑞物业公司;深圳长城物业公司;北京盛世物业有限公司 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 绿化率: +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 42% +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 容积率: +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 2.70 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 暖气: +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 集中供暖 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 楼座展示:中央公园由22栋小高层、高层组成,3、16、17号楼分别是11层小高层,18层和28层的高层。 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 4号楼是23层,2梯3户。 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 项目位置: +鬼青蛙在哪里有收录详情? 鬼青蛙这张卡可以从手卡把这张卡以外的1只水属性怪兽丢弃,从手卡特殊召唤。 +鬼青蛙在哪里有收录详情? 这张卡召唤·反转召唤·特殊召唤成功时,可以从自己的卡组·场上选1只水族·水属性·2星以下的怪兽送去墓地。 +鬼青蛙在哪里有收录详情? 此外,1回合1次,可以通过让自己场上1只怪兽回到手卡,这个回合通常召唤外加上只有1次,自己可以把「鬼青蛙」以外的1只名字带有「青蛙」的怪兽召唤。[1] +鬼青蛙在哪里有收录详情? 游戏王卡包收录详情 +鬼青蛙在哪里有收录详情? [09/09/18] +西湖区有多大? 西湖区是江西省南昌市市辖区。 +西湖区有多大? 为南昌市中心城区之一,有着2200多年历史,是一个物华天宝、人杰地灵的古老城区。 +西湖区有多大? 2004年南昌市老城区区划调整后,西湖区东起京九铁路线与青山湖区毗邻,南以洪城路东段、抚河路南段、象湖以及南隔堤为界与青云谱区、南昌县接壤,西凭赣江中心线与红谷滩新区交界,北沿中山路、北京西路与东湖区相连,所辖面积34.5平方公里,常住人口43万,管辖1个镇、10个街道办事处,设12个行政村、100个社区。 +西湖区有多大? (图)西湖区[南昌市] +西湖区有多大? 西湖原为汉代豫章群古太湖的一部分,唐贞元15年(公元799年)洪恩桥的架设将东太湖分隔成东西两部分,洪恩桥以西谓之西湖,西湖区由此而得名。 +西湖区有多大? 西湖区在1926年南昌设市后分别称第四、五部分,六、七部分。 +西湖区有多大? 1949年解放初期分别称第三、四区。 +西湖区有多大? 1955年分别称抚河区、西湖区。 +西湖区有多大? 1980年两区合并称西湖区。[1] +西湖区有多大? 辖:西湖街道、丁公路街道、广外街道、系马桩街道、绳金塔街道、朝阳洲街道、禾草街街道、十字街街道、瓦子角街道、三眼井街道、上海路街道、筷子巷街道、南站街道。[1] +西湖区有多大? 2002年9月,由原筷子巷街道和原禾草街街道合并设立南浦街道,原广外街道与瓦子角街道的一部分合并设立广润门街道。 +西湖区有多大? 2002年12月1日设立桃源街道。 +西湖区有多大? 2004年区划调整前的西湖区区域:东与青山湖区湖坊乡插花接壤;西临赣江与红谷滩新区隔江相望;南以建设路为界,和青云谱区毗邻;北连中山路,北京西路,与东湖区交界。[1] +西湖区有多大? 2002年9月,由原筷子巷街道和原禾草街街道合并设立南浦街道,原广外街道与瓦子角街道的一部分合并设立广润门街道。 +西湖区有多大? 2002年12月1日设立桃源街道。 +西湖区有多大? 2004年区划调整前的西湖区区域:东与青山湖区湖坊乡插花接壤;西临赣江与红谷滩新区隔江相望;南以建设路为界,和青云谱区毗邻;北连中山路,北京西路,与东湖区交界。 +西湖区有多大? 2004年9月7日,国务院批准(国函[2004]70号)调整南昌市市辖区部分行政区划:将西湖区朝阳洲街道的西船居委会划归东湖区管辖。 +西湖区有多大? 将青山湖区的桃花镇和湖坊镇的同盟村划归西湖区管辖。 +西湖区有多大? 将西湖区十字街街道的谷市街、洪城路、南关口、九四、新丰5个居委会,上海路街道的草珊瑚集团、南昌肠衣厂、电子计算机厂、江西涤纶厂、江地基础公司、曙光、商标彩印厂、南昌市染整厂、江南蓄电池厂、四机床厂、二进、国乐新村12个居委会,南站街道的解放西路东居委会划归青云谱区管辖。 +西湖区有多大? 将西湖区上海路街道的轻化所、洪钢、省人民检察院、电信城东分局、安康、省机械施工公司、省水利设计院、省安装公司、南方电动工具厂、江西橡胶厂、上海路北、南昌电池厂、东华计量所、南昌搪瓷厂、上海路新村、华安针织总厂、江西五金厂、三波电机厂、水文地质大队、二六○厂、省卫生学校、新世纪、上海路住宅区北、塔子桥北、南航、上海路住宅区南、沿河、南昌阀门厂28个居委会,丁公路街道的新魏路、半边街、师大南路、顺化门、岔道口东路、师大、广电厅、手表厂、鸿顺9个居委会,南站街道的工人新村北、工人新村南、商苑、洪都中大道、铁路第三、铁路第四、铁路第六7个居委会划归青山湖区管辖。 +西湖区有多大? 调整后,西湖区辖绳金塔、桃源、朝阳洲、广润门、南浦、西湖、系马桩、十字街、丁公路、南站10个街道和桃花镇,区人民政府驻孺子路。 +西湖区有多大? 调整前,西湖区面积31平方千米,人口52万。 +西湖区有多大? (图)西湖区[南昌市] +西湖区有多大? 西湖区位于江西省省会南昌市的中心地带,具有广阔的发展空间和庞大的消费群体,商贸旅游、娱乐服务业等到各个行业都蕴藏着无限商机,投资前景十分广阔。 +西湖区有多大? 不仅水、电价格低廉,劳动力资源丰富,人均工资和房产价格都比沿海城市低,城区拥有良好的人居环境、低廉的投资成本,巨大的发展潜力。 +西湖区有多大? 105、316、320国道和京九铁路贯穿全境,把南北东西交通连成一线;民航可与上海、北京、广州、深圳、厦门、温州等到地通航,并开通了南昌-新加坡第一条国际航线;水运依托赣江可直达长江各港口;邮电通讯便捷,程控电话、数字微波、图文传真进入国际通讯网络;商检、海关、口岸等涉外机构齐全;水、电、气供应充足。 +西湖区有多大? (图)西湖区[南昌市] +西湖区有多大? 西湖区,是江西省省会南昌市的中心城区,面积34.8平方公里,常住人口51.9万人,辖桃花镇、朝农管理处及10个街道,设13个行政村,116个社区居委会,20个家委会。[2] +西湖区有多大? 2005年11月16日,南昌市《关于同意西湖区桃花镇、桃源、十字街街道办事处行政区划进行调整的批复》 +西湖区有多大? 1、同意将桃花镇的三道闸居委会划归桃源街道办事处管辖。 +青藏虎耳草花期什么时候? 青藏虎耳草多年生草本,高4-11.5厘米,丛生。 +青藏虎耳草花期什么时候? 花期7-8月。 +青藏虎耳草花期什么时候? 分布于甘肃(祁连山地)、青海(黄南、海南、海北)和西藏(加查)。 +青藏虎耳草花期什么时候? 生于海拔3 700-4 250米的林下、高山草甸和高山碎石隙。[1] +青藏虎耳草花期什么时候? 多年生草本,高4-11.5厘米,丛生。 +青藏虎耳草花期什么时候? 茎不分枝,具褐色卷曲柔毛。 +青藏虎耳草花期什么时候? 基生叶具柄,叶片卵形、椭圆形至长圆形,长15-25毫米,宽4-8毫米,腹面无毛,背面和边缘具褐色卷曲柔毛,叶柄长1-3厘米,基部扩大,边缘具褐色卷曲柔毛;茎生叶卵形至椭圆形,长1.5-2厘米,向上渐变小。 +青藏虎耳草花期什么时候? 聚伞花序伞房状,具2-6花;花梗长5-19毫米,密被褐色卷曲柔毛;萼片在花期反曲,卵形至狭卵形,长2.5-4.2毫米,宽1.5-2毫米,先端钝,两面无毛,边缘具褐色卷曲柔毛,3-5脉于先端不汇合;花瓣腹面淡黄色且其中下部具红色斑点,背面紫红色,卵形、狭卵形至近长圆形,长2.5-5.2毫米,宽1.5-2.1毫米,先端钝,基部具长0.5-1毫米之爪,3-5(-7)脉,具2痂体;雄蕊长2-3.6毫米,花丝钻形;子房半下位,周围具环状花盘,花柱长1-1.5毫米。 +青藏虎耳草花期什么时候? 生于高山草甸、碎石间。 +青藏虎耳草花期什么时候? 分布青海、西藏、甘肃、四川等地。 +青藏虎耳草花期什么时候? [1] +青藏虎耳草花期什么时候? 顶峰虎耳草Saxifraga cacuminum Harry Sm. +青藏虎耳草花期什么时候? 对叶虎耳Saxifraga contraria Harry Sm. +青藏虎耳草花期什么时候? 狭瓣虎耳草Saxifraga pseudohirculus Engl. +青藏虎耳草花期什么时候? 唐古特虎耳草Saxifraga tangutica Engl. +青藏虎耳草花期什么时候? 宽叶虎耳草(变种)Saxifraga tangutica Engl. var. platyphylla (Harry Sm.) J. T. Pan +青藏虎耳草花期什么时候? 唐古特虎耳草(原变种)Saxifraga tangutica Engl. var. tangutica +青藏虎耳草花期什么时候? 西藏虎耳草Saxifraga tibetica Losinsk.[1] +青藏虎耳草花期什么时候? Saxifraga przewalskii Engl. in Bull. Acad. Sci. St. -Petersb. 29:115. 1883: Engl et Irmsch. in Bot. Jahrb. 48:580. f. 5E-H. 1912 et in Engl. Pflanzenr. 67(IV. 117): 107. f. 21 E-H. 1916; J. T. Pan in Acta Phytotax. Sin. 16(2): 16. 1978;中国高等植物图鉴补编2: 30. 1983; 西藏植物志 2: 483. 1985. [1] +生产一支欧文冲锋枪需要多少钱? 欧文冲锋枪 Owen Gun 1945年,在新不列颠手持欧文冲锋枪的澳大利亚士兵 类型 冲锋枪 原产国 ?澳大利亚 服役记录 服役期间 1941年-1960年代 用户 参见使用国 参与战役 第二次世界大战 马来亚紧急状态 朝鲜战争 越南战争 1964年罗德西亚布什战争 生产历史 研发者 伊夫林·欧文(Evelyn Owen) 研发日期 1931年-1939年 生产商 约翰·莱萨特工厂 利特高轻武器工厂 单位制造费用 $ 30/枝 生产日期 1941年-1945年 制造数量 45,000-50,000 枝 衍生型 Mk 1/42 Mk 1/43 Mk 2/43 基本规格 总重 空枪: Mk 1/42:4.24 千克(9.35 磅) Mk 1/43:3.99 千克(8.8 磅) Mk 2/43:3.47 千克(7.65 磅) 全长 806 毫米(31.73 英吋) 枪管长度 247 毫米(9.72 英吋) 弹药 制式:9 × 19 毫米 原型:.38/200 原型:.45 ACP 口径 9 × 19 毫米:9 毫米(.357 英吋) .38/200:9.65 毫米(.38 英吋) .45 ACP:11.43 毫米(.45 英吋) 枪管 1 根,膛线7 条,右旋 枪机种类 直接反冲作用 开放式枪机 发射速率 理论射速: Mk 1/42:700 发/分钟 Mk 1/43:680 发/分钟 Mk 2/43:600 发/分钟 实际射速:120 发/分钟 枪口初速 380-420 米/秒(1,246.72-1,377.95 英尺/秒) 有效射程 瞄具装定射程:91.44 米(100 码) 最大有效射程:123 米(134.51 码) 最大射程 200 米(218.72 码) 供弹方式 32/33 发可拆卸式弹匣 瞄准具型式 机械瞄具:向右偏置的觇孔式照门和片状准星 欧文冲锋枪(英语:Owen Gun,正式名称:Owen Machine Carbine,以下简称为“欧文枪”)是一枝由伊夫林·(埃沃)·欧文(英语:Evelyn (Evo) Owen)于1939年研制、澳大利亚的首枝冲锋枪,制式型发射9 × 19 毫米鲁格手枪子弹。 +生产一支欧文冲锋枪需要多少钱? 欧文冲锋枪是澳大利亚唯一设计和主要服役的二战冲锋枪,并从1943年由澳大利亚陆军所使用,直到1960年代中期。 +生产一支欧文冲锋枪需要多少钱? 由新南威尔士州卧龙岗市出身的欧文枪发明者,伊夫林·欧文,在24岁时于1939年7月向悉尼维多利亚军营的澳大利亚陆军军械官员展示了他所设计的.22 LR口径“卡宾机枪”原型枪。 +生产一支欧文冲锋枪需要多少钱? 该枪却被澳大利亚陆军所拒绝,因为澳大利亚陆军在当时没有承认冲锋枪的价值。 +生产一支欧文冲锋枪需要多少钱? 随着战争的爆发,欧文加入了澳大利亚军队,并且成为一名列兵。 +生产一支欧文冲锋枪需要多少钱? 1940年9月,欧文的邻居,文森特·沃德尔(英语:Vincent Wardell),看到欧文家楼梯后面搁著一个麻布袋,里面放著一枝欧文枪的原型枪。 +生产一支欧文冲锋枪需要多少钱? 而文森特·沃德尔是坎布拉港的大型钢制品厂莱萨特公司的经理,他向欧文的父亲表明了他对其儿子的粗心大意感到痛心,但无论如何仍然解释了这款武器的历史。 +生产一支欧文冲锋枪需要多少钱? 沃德尔对欧文枪的简洁的设计留下了深刻的印象。 +生产一支欧文冲锋枪需要多少钱? 沃德尔安排欧文转调到陆军发明部(英语:Army Inventions Board),并重新开始在枪上的工作。 +生产一支欧文冲锋枪需要多少钱? 军队仍然持续地从负面角度查看该武器,但同时政府开始采取越来越有利的观点。 +生产一支欧文冲锋枪需要多少钱? 该欧文枪原型配备了装在顶部的弹鼓,后来让位给装在顶部的弹匣使用。 +生产一支欧文冲锋枪需要多少钱? 口径的选择亦花了一些时间去解决。 +生产一支欧文冲锋枪需要多少钱? 由于陆军有大批量的柯尔特.45 ACP子弹,它们决定欧文枪需要采用这种口径。 +生产一支欧文冲锋枪需要多少钱? 直到在1941年9月19日官方举办试验时,约翰·莱萨特工厂制成了9 毫米、.38/200和.45 ACP三种口径版本。 +生产一支欧文冲锋枪需要多少钱? 而从美、英进口的斯登冲锋枪和汤普森冲锋枪在试验中作为基准使用。 +生产一支欧文冲锋枪需要多少钱? 作为测试的一部分,所有的枪支都浸没在泥浆里,并以沙土覆盖,以模拟他们将会被使用时最恶劣的环境。 +生产一支欧文冲锋枪需要多少钱? 欧文枪是唯一在这测试中这样对待以后仍可正常操作的冲锋枪。 +生产一支欧文冲锋枪需要多少钱? 虽然测试表现出欧文枪具有比汤普森冲锋枪和司登冲锋枪更优秀的可靠性,陆军没有对其口径作出决定。 +生产一支欧文冲锋枪需要多少钱? 结果它在上级政府干预以后,陆军才下令9 毫米的衍生型为正式口径,并在1941年11月20日正式被澳大利亚陆军采用。 +生产一支欧文冲锋枪需要多少钱? 在欧文枪的寿命期间,其可靠性在澳大利亚部队中赢得了“军人的至爱”(英语:Digger's Darling)的绰号,亦有人传言它受到美军高度青睐。 +生产一支欧文冲锋枪需要多少钱? 欧文枪是在1942年开始正式由坎布拉港和纽卡斯尔的约翰·莱萨特工厂投入生产,在生产高峰期每个星期生产800 支。 +生产一支欧文冲锋枪需要多少钱? 1942年3月至1943年2月之间,莱萨特生产了28,000 枝欧文枪。 +生产一支欧文冲锋枪需要多少钱? 然而,最初的一批弹药类型竟然是错误的,以至10,000 枝欧文枪无法提供弹药。 +生产一支欧文冲锋枪需要多少钱? 政府再一次推翻军方的官僚主义作风??,并让弹药通过其最后的生产阶段,以及运送到当时在新几内亚与日军战斗的澳大利亚部队的手中。 +生产一支欧文冲锋枪需要多少钱? 在1941年至1945年间生产了约50,000 枝欧文枪。 +生产一支欧文冲锋枪需要多少钱? 在战争期间,欧文枪的平均生产成本为$ 30。[1] +生产一支欧文冲锋枪需要多少钱? 虽然它是有点笨重,因为其可靠性,欧文枪在士兵当中变得非常流行。 +生产一支欧文冲锋枪需要多少钱? 它是如此成功,它也被新西兰、英国和美国订购。[2] +生产一支欧文冲锋枪需要多少钱? 欧文枪后来也被澳大利亚部队在朝鲜战争和越南战争,[3]特别是步兵组的侦察兵。 +生产一支欧文冲锋枪需要多少钱? 这仍然是一枝制式的澳大利亚陆军武器,直到1960年代中期,它被F1冲锋枪所取代。 +第二届中国光伏摄影大赛因为什么政策而开始的? 光伏发电不仅是全球能源科技和产业发展的重要方向,也是我国具有国际竞争优势的战略性新兴产业,是我国保障能源安全、治理环境污染、应对气候变化的战略性选择。 +第二届中国光伏摄影大赛因为什么政策而开始的? 2013年7月以来,国家出台了《关于促进光伏产业健康发展的若干意见》等一系列政策,大力推进分布式光伏发电的应用,光伏发电有望走进千家万户,融入百姓民生。 +第二届中国光伏摄影大赛因为什么政策而开始的? 大赛主办方以此为契机,开启了“第二届中国光伏摄影大赛”的征程。 +悬赏任务有哪些类型? 悬赏任务,威客网站上一种任务模式,由雇主在威客网站发布任务,提供一定数额的赏金,以吸引威客们参与。 +悬赏任务有哪些类型? 悬赏任务数额一般在几十到几千不等,但也有几万甚至几十万的任务。 +悬赏任务有哪些类型? 主要以提交的作品的质量好坏作为中标标准,当然其中也带有雇主的主观喜好,中标人数较少,多为一个或几个,因此竞争激烈。 +悬赏任务有哪些类型? 大型悬赏任务赏金数额巨大,中标者也较多,但参与人也很多,对于身有一技之长的威客来讲,悬赏任务十分适合。 +悬赏任务有哪些类型? 悬赏任务的类型主要包括:设计类、文案类、取名类、网站类、编程类、推广类等等。 +悬赏任务有哪些类型? 每一类所适合的威客人群不同,报酬的多少也不同,比如设计类的报酬就比较高,一般都几百到几千,而推广类的计件任务报酬比较少,一般也就几块钱,但花费的时间很少,技术要求也很低。 +悬赏任务有哪些类型? 1.注册—登陆 +悬赏任务有哪些类型? 2.点击“我要发悬赏”—按照发布流程及提示提交任务要求。 +悬赏任务有哪些类型? 悬赏模式选择->网站托管赏金模式。 +悬赏任务有哪些类型? 威客网站客服稍后会跟发布者联系确认任务要求。 +悬赏任务有哪些类型? 3.没有问题之后就可以预付赏金进行任务发布。 +悬赏任务有哪些类型? 4.会员参与并提交稿件。 +悬赏任务有哪些类型? 5.发布者需要跟会员互动(每个提交稿件的会员都可以),解决问题,完善稿件,初步筛选稿件。 +悬赏任务有哪些类型? 6.任务发布期结束,进入选稿期(在筛选的稿件中选择最后满意的) +悬赏任务有哪些类型? 7.发布者不满意现有稿件可选定一个会员修改至满意为止,或者加价延期重新开放任务进行征稿。 +悬赏任务有哪些类型? (重复第六步)没有问题后进入下一步。 +悬赏任务有哪些类型? 8:中标会员交源文件给发布者—发布者确认—任务结束—网站将赏金付给中标会员。 +悬赏任务有哪些类型? 1、任务发布者自由定价,自由确定悬赏时间,自由发布任务要求,自主确定中标会员和中标方案。 +悬赏任务有哪些类型? 2、任务发布者100%预付任务赏金,让竞标者坚信您的诚意和诚信。 +悬赏任务有哪些类型? 3、任务赏金分配原则:任务一经发布,网站收取20%发布费,中标会员获得赏金的80%。 +悬赏任务有哪些类型? 4、每个任务最终都会选定至少一个作品中标,至少一个竞标者获得赏金。 +悬赏任务有哪些类型? 5、任务发布者若未征集到满意作品,可以加价延期征集,也可让会员修改,会员也可以删除任务。 +悬赏任务有哪些类型? 6、任务发布者自己所在组织的任何人均不能以任何形式参加自己所发布的任务,一经发现则视为任务发布者委托威客网按照网站规则选稿。 +悬赏任务有哪些类型? 7、任务悬赏总金额低于100元(含100元)的任务,悬赏时间最多为7天。 +悬赏任务有哪些类型? 所有任务最长时间不超过30天(特殊任务除外),任务总金额不得低于50元。 +悬赏任务有哪些类型? 8、网赚类、注册类任务总金额不能低于300元人民币,计件任务每个稿件的平均单价不能低于1元人民币。 +悬赏任务有哪些类型? 9、延期任务只有3次加价机会,第1次加价不得低于任务金额的10%,第2次加价不得低于任务总金额的20%,第3次不得低于任务总金额的50%。 +悬赏任务有哪些类型? 每次延期不能超过15天,加价金额不低于50元,特殊任务可以适当加长。 +悬赏任务有哪些类型? 如果为计件任务,且不是网赚类任务,将免费延期,直至征集完规定数量的作品为止。 +悬赏任务有哪些类型? 10、如果威客以交接源文件要挟任务发布者,威客网将扣除威客相关信用值,并取消其中标资格,同时任务将免费延长相应的时间继续征集作品 。 +江湖令由哪些平台运营? 《江湖令》是以隋唐时期为背景的RPG角色扮演类网页游戏。 +江湖令由哪些平台运营? 集角色扮演、策略、冒险等多种游戏元素为一体,画面精美犹如客户端游戏,融合历史、江湖、武功、恩仇多种特色元素,是款不可多得的精品游戏大作。 +江湖令由哪些平台运营? 由ya247平台、91wan游戏平台、2918、4399游戏平台、37wan、6711、兄弟玩网页游戏平台,49you、Y8Y9平台、8090游戏等平台运营的,由07177游戏网发布媒体资讯的网页游戏。 +江湖令由哪些平台运营? 网页游戏《江湖令》由51游戏社区运营,是以隋唐时期为背景的RPG角色扮演类网页游戏。 +江湖令由哪些平台运营? 集角色扮演、策略、冒险等多种游戏元素为一体,画面精美犹如客户端游戏,融合历史、江湖、武功、恩仇多种特色元素,是款不可多得的精品游戏大作… +江湖令由哪些平台运营? 背景故事: +江湖令由哪些平台运营? 隋朝末年,隋炀帝暴政,天下民不聊生,义军四起。 +江湖令由哪些平台运营? 在这动荡的时代中,百姓生活苦不堪言,多少人流离失所,家破人亡。 +江湖令由哪些平台运营? 天下三大势力---飞羽营、上清宫、侠隐岛,也值此机会扩张势力,派出弟子出来行走江湖。 +江湖令由哪些平台运营? 你便是这些弟子中的普通一员,在这群雄并起的年代,你将如何选择自己的未来。 +江湖令由哪些平台运营? 所有的故事,便从瓦岗寨/江都大营开始…… +江湖令由哪些平台运营? 势力: +江湖令由哪些平台运营? ①、飞羽营:【外功、根骨】 +江湖令由哪些平台运营? 南北朝时期,由北方政权创立的一个民间军事团体,经过多年的发展,逐渐成为江湖一大势力。 +江湖令由哪些平台运营? ②、上清宫:【外功、身法】 +江湖令由哪些平台运营? 道家圣地,宫中弟子讲求清静无为,以一种隐世的方式修炼,但身在此乱世,亦也不能独善其身。 +江湖令由哪些平台运营? ③、侠隐岛:【根骨、内力】 +江湖令由哪些平台运营? 位于偏远海岛上的一个世家,岛内弟子大多武功高强,但从不进入江湖行走,适逢乱世,现今岛主也决意作一翻作为。 +江湖令由哪些平台运营? 两大阵营: +江湖令由哪些平台运营? 义军:隋唐末期,百姓生活苦不堪言,有多个有志之士组成义军,对抗当朝暴君,希望建立一个适合百姓安居乐业的天地。 +江湖令由哪些平台运营? 隋军:战争一起即天下打乱,隋军首先要镇压四起的义军,同时在内部慢慢改变现有的朝廷,让天下再次恢复到昔日的安定。 +江湖令由哪些平台运营? 一、宠物品质 +江湖令由哪些平台运营? 宠物的品质分为:灵兽,妖兽,仙兽,圣兽,神兽 +江湖令由哪些平台运营? 二、宠物获取途径 +江湖令由哪些平台运营? 完成任务奖励宠物(其他途径待定)。 +江湖令由哪些平台运营? 三、宠物融合 +江湖令由哪些平台运营? 1、在主界面下方的【宠/骑】按钮进入宠物界面,再点击【融合】即可进入融合界面进行融合,在融合界面可选择要融合的宠物进行融合 +江湖令由哪些平台运营? 2、融合后主宠的形态不变; +江湖令由哪些平台运营? 3、融合后宠物的成长,品质,技能,经验,成长经验,等级都继承成长高的宠物; +江湖令由哪些平台运营? 4、融合宠物技能冲突,则保留成长值高的宠物技能,如果不冲突则叠加在空余的技能位置。 +请问土耳其足球超级联赛是什么时候成立的? 土耳其足球超级联赛(土耳其文:Türkiye 1. Süper Futbol Ligi)是土耳其足球协会管理的职业足球联赛,通常简称“土超”,也是土耳其足球联赛中最高级别。 +请问土耳其足球超级联赛是什么时候成立的? 目前,土超联赛队伍共有18支。 +请问土耳其足球超级联赛是什么时候成立的? 土耳其足球超级联赛 +请问土耳其足球超级联赛是什么时候成立的? 运动项目 足球 +请问土耳其足球超级联赛是什么时候成立的? 成立年份 1959年 +请问土耳其足球超级联赛是什么时候成立的? 参赛队数 18队 +请问土耳其足球超级联赛是什么时候成立的? 国家 土耳其 +请问土耳其足球超级联赛是什么时候成立的? 现任冠军 费内巴切足球俱乐部(2010-2011) +请问土耳其足球超级联赛是什么时候成立的? 夺冠最多队伍 费内巴切足球俱乐部(18次) +请问土耳其足球超级联赛是什么时候成立的? 土耳其足球超级联赛(Türkiye 1. Süper Futbol Ligi)是土耳其足球协会管理的职业足球联赛,通常简称「土超」,也是土耳其足球联赛中最高级别。 +请问土耳其足球超级联赛是什么时候成立的? 土超联赛队伍共有18支。 +请问土耳其足球超级联赛是什么时候成立的? 土超联赛成立于1959年,成立之前土耳其国有多个地区性联赛。 +请问土耳其足球超级联赛是什么时候成立的? 土超联赛成立后便把各地方联赛制度统一起来。 +请问土耳其足球超级联赛是什么时候成立的? 一般土超联赛由八月开始至五月结束,12月至1月会有歇冬期。 +请问土耳其足球超级联赛是什么时候成立的? 十八支球队会互相对叠,各有主场和作客两部分,采计分制。 +请问土耳其足球超级联赛是什么时候成立的? 联赛枋最底的三支球队会降到土耳其足球甲级联赛作赛。 +请问土耳其足球超级联赛是什么时候成立的? 由2005-06年球季起,土超联赛的冠、亚军会取得参加欧洲联赛冠军杯的资格。 +请问土耳其足球超级联赛是什么时候成立的? 成立至今土超联赛乃由两支著名球会所垄断──加拉塔萨雷足球俱乐部和费内巴切足球俱乐部,截至2009-2010赛季,双方各赢得冠军均为17次。 +请问土耳其足球超级联赛是什么时候成立的? 土超联赛共有18支球队,采取双循环得分制,每场比赛胜方得3分,负方0分,平局双方各得1分。 +请问土耳其足球超级联赛是什么时候成立的? 如果两支球队积分相同,对战成绩好的排名靠前,其次按照净胜球来决定;如果有三支以上的球队分数相同,则按照以下标准来确定排名:1、几支队伍间对战的得分,2、几支队伍间对战的净胜球数,3、总净胜球数。 +请问土耳其足球超级联赛是什么时候成立的? 联赛第1名直接参加下个赛季冠军杯小组赛,第2名参加下个赛季冠军杯资格赛第三轮,第3名进入下个赛季欧洲联赛资格赛第三轮,第4名进入下个赛季欧洲联赛资格赛第二轮,最后三名降入下个赛季的土甲联赛。 +请问土耳其足球超级联赛是什么时候成立的? 该赛季的土耳其杯冠军可参加下个赛季欧洲联赛资格赛第四轮,如果冠军已获得冠军杯资格,则亚军可参加下个赛季欧洲联赛资格赛第四轮,否则名额递补给联赛。 +请问土耳其足球超级联赛是什么时候成立的? 2010年/2011年 费内巴切 +请问土耳其足球超级联赛是什么时候成立的? 2009年/2010年 布尔萨体育(又译贝莎) +请问土耳其足球超级联赛是什么时候成立的? 2008年/2009年 贝西克塔斯 +请问土耳其足球超级联赛是什么时候成立的? 2007年/2008年 加拉塔萨雷 +请问土耳其足球超级联赛是什么时候成立的? 2006年/2007年 费内巴切 +请问土耳其足球超级联赛是什么时候成立的? 2005年/2006年 加拉塔沙雷 +请问土耳其足球超级联赛是什么时候成立的? 2004年/2005年 费内巴切(又译费伦巴治) +请问土耳其足球超级联赛是什么时候成立的? 2003年/2004年 费内巴切 +cid 作Customer IDentity解时是什么意思? ? CID 是 Customer IDentity 的简称,简单来说就是手机的平台版本. CID紧跟IMEI存储在手机的OTP(One Time Programmable)芯片中. CID 后面的数字代表的是索尼爱立信手机软件保护版本号,新的CID不断被使用,以用来防止手机被非索尼爱立信官方的维修程序拿来解锁/刷机/篡改 +cid 作Customer IDentity解时是什么意思? ? CID 是 Customer IDentity 的简称,简单来说就是手机的平台版本. CID紧跟IMEI存储在手机的OTP(One Time Programmable)芯片中. CID 后面的数字代表的是索尼爱立信手机软件保护版本号,新的CID不断被使用,以用来防止手机被非索尼爱立信官方的维修程序拿来解锁/刷机/篡改 +cid 作Customer IDentity解时是什么意思? ? (英)刑事调查局,香港警察的重案组 +cid 作Customer IDentity解时是什么意思? ? Criminal Investigation Department +cid 作Customer IDentity解时是什么意思? ? 佩枪: +cid 作Customer IDentity解时是什么意思? ? 香港警察的CID(刑事侦缉队),各区重案组的探员装备短管点38左轮手枪,其特点是便于收藏,而且不容易卡壳,重量轻,其缺点是装弹量少,只有6发,而且换子弹较慢,威力也一般,如果碰上54式手枪或者M9手枪明显处于下风。 +cid 作Customer IDentity解时是什么意思? ? 香港警察的“刑事侦查”(Criminal Investigation Department)部门,早于1983年起已经不叫做C.I.D.的了,1983年香港警察队的重整架构,撤销了C.I.D. ( Criminal Investigation Dept.) “刑事侦缉处”,将“刑事侦查”部门归入去“行动处”内,是“行动处”内的一个分支部门,叫“刑事部”( Crime Wing )。 +cid 作Customer IDentity解时是什么意思? ? 再于90年代的一次警队重整架构,香港警队成立了新的「刑事及保安处」,再将“刑事侦查”部门归入目前的「刑事及保安处」的“处”级单位,是归入这个“处”下的一个部门,亦叫“刑事部” ( Crime Wing ),由一个助理警务处长(刑事)领导。 +cid 作Customer IDentity解时是什么意思? ? 但是时至今天,CID虽已经是一个老旧的名称,香港市民、甚至香港警察都是习惯性的沿用这个历史上的叫法 . +cid 作Customer IDentity解时是什么意思? ? CID格式是美国Adobe公司发表的最新字库格式,它具有易扩充、速度快、兼容性好、简便、灵活等特点,已成为国内开发中文字库的热点,也为用户使用字库提供质量更好,数量更多的字体。 +cid 作Customer IDentity解时是什么意思? ? CID (Character identifier)就是字符识别码,在组成方式上分成CIDFont,CMap表两部分。 +cid 作Customer IDentity解时是什么意思? ? CIDFont文件即总字符集,包括了一种特定语言中所有常用的字符,把这些字符排序,它们在总字符集中排列的次序号就是各个字符的CID标识码(Index);CMap(Character Map)表即字符映像文件,将字符的编码(Code)映像到字符的CID标识码(Index)。 +cid 作Customer IDentity解时是什么意思? ? CID字库完全针对大字符集市场设计,其基本过程为:先根据Code,在CMap表查到Index,然后在CIDFont文件找到相应的字形数据。 +本町位于什么地方? 本条目记述台湾日治时期,各都市之本町。 +本町位于什么地方? 为台湾日治时期台北市之行政区,共分一~四丁目,在表町之西。 +本町位于什么地方? 以现在的位置来看,本町位于现台北市中正区的西北角,约位于忠孝西路一段往西至台北邮局东侧。 +本町位于什么地方? 再向南至开封街一段,沿此路线向西至开封街一段60号,顺60号到汉口街一段向东到现在华南银行总行附近画一条直线到衡阳路。 +本町位于什么地方? 再向东至重庆南路一段,由重庆南路一段回到原点这个范围内。 +本町位于什么地方? 另外,重庆南路一段在当时名为“本町通”。 +本町位于什么地方? 此地方自日治时期起,就是繁华的商业地区,当时也有三和银行、台北专卖分局、日本石油等重要商业机构。 +本町位于什么地方? 其中,专卖分局是战后二二八事件的主要起始点。 +本町位于什么地方? 台湾贮蓄银行(一丁目) +本町位于什么地方? 三和银行(二丁目) +本町位于什么地方? 专卖局台北分局(三丁目) +本町位于什么地方? 日本石油(四丁目) +本町位于什么地方? 为台湾日治时期台南市之行政区。 +本町位于什么地方? 范围包括清代旧街名枋桥头前、枋桥头后、鞋、草花、天公埕、竹仔、下大埕、帽仔、武馆、统领巷、大井头、内宫后、内南町。 +本町位于什么地方? 为清代台南城最繁华的区域。 +本町位于什么地方? 台南公会堂 +本町位于什么地方? 北极殿 +本町位于什么地方? 开基武庙 +本町位于什么地方? 町名改正 +本町位于什么地方? 这是一个与台湾相关的小作品。 +本町位于什么地方? 你可以通过编辑或修订扩充其内容。 +《行走的观点:埃及》的条形码是多少? 出版社: 上海社会科学院出版社; 第1版 (2006年5月1日) +《行走的观点:埃及》的条形码是多少? 丛书名: 时代建筑视觉旅行丛书 +《行走的观点:埃及》的条形码是多少? 条形码: 9787806818640 +《行走的观点:埃及》的条形码是多少? 尺寸: 18 x 13.1 x 0.7 cm +《行走的观点:埃及》的条形码是多少? 重量: 181 g +《行走的观点:埃及》的条形码是多少? 漂浮在沙与海市蜃楼之上的金字塔曾经是否是你的一个梦。 +《行走的观点:埃及》的条形码是多少? 埃及,这片蕴蓄了5000年文明的土地,本书为你撩开它神秘的纱。 +《行走的观点:埃及》的条形码是多少? 诸神、金字塔、神庙、狮身人面像、法老、艳后吸引着我们的注意力;缠绵悱恻的象形文字、医学、雕刻等留给我们的文明,不断引发我们对古代文明的惊喜和赞叹。 +《行走的观点:埃及》的条形码是多少? 尼罗河畔的奇异之旅,数千年的古老文明,尽收在你的眼底…… +《行走的观点:埃及》的条形码是多少? 本书集历史、文化、地理等知识于一体,并以优美、流畅文笔,简明扼要地阐述了埃及的地理环境、政治经济、历史沿革、文化艺术,以大量富有艺术感染力的彩色照片,生动形象地展示了埃及最具特色的名胜古迹、风土人情和自然风光。 +《行走的观点:埃及》的条形码是多少? 古埃及历史 +老挝人民军的工兵部队有几个营? 老挝人民军前身为老挝爱国战线领导的“寮国战斗部队”(即“巴特寮”),始建于1949年1月20日,1965年10月改名为老挝人民解放军,1982年7月改称现名。 +老挝人民军的工兵部队有几个营? 最高领导机构是中央国防和治安委员会,朱马里·赛雅颂任主席,隆再·皮吉任国防部长。 +老挝人民军的工兵部队有几个营? 实行义务兵役制,服役期最少18个月。[1] +老挝人民军的工兵部队有几个营? ?老挝军队在老挝社会中有较好的地位和保障,工资待遇比地方政府工作人员略高。 +老挝人民军的工兵部队有几个营? 武装部队总兵力约6万人,其中陆军约5万人,主力部队编为5个步兵师;空军2000多人;海军(内河巡逻部队)1000多人;部队机关院校5000人。[1] +老挝人民军的工兵部队有几个营? 老挝人民军军旗 +老挝人民军的工兵部队有几个营? 1991年8月14日通过的《老挝人民民主共和国宪法》第11条规定:国家执行保卫国防和维护社会安宁的政策。 +老挝人民军的工兵部队有几个营? 全体公民和国防力量、治安力量必须发扬忠于祖国、忠于人民的精神,履行保卫革命成果、保卫人民生命财产及和平劳动的任务,积极参加国家建设事业。 +老挝人民军的工兵部队有几个营? 最高领导机构是中央国防和治安委员会。 +老挝人民军的工兵部队有几个营? 主席由老挝人民革命党中央委员会总书记兼任。 +老挝人民军的工兵部队有几个营? 老挝陆军成立最早,兵力最多,约有5万人。 +老挝人民军的工兵部队有几个营? 其中主力部队步兵师5个、7个独立团、30多个营、65个独立连。 +老挝人民军的工兵部队有几个营? 地方部队30余个营及县属部队。 +老挝人民军的工兵部队有几个营? 地面炮兵2个团,10多个营。 +老挝人民军的工兵部队有几个营? 高射炮兵1个团9个营。 +老挝人民军的工兵部队有几个营? 导弹部队2个营。 +老挝人民军的工兵部队有几个营? 装甲兵7个营。 +老挝人民军的工兵部队有几个营? 特工部队6个营。 +老挝人民军的工兵部队有几个营? 通讯部队9个营。 +老挝人民军的工兵部队有几个营? 工兵部队6个营。 +老挝人民军的工兵部队有几个营? 基建工程兵2个团13个营。 +老挝人民军的工兵部队有几个营? 运输部队7个营。 +老挝人民军的工兵部队有几个营? 陆军的装备基本是中国和前苏联援助的装备和部分从抗美战争中缴获的美式装备。 +老挝人民军的工兵部队有几个营? 老挝内河部队总兵力约1700人,装备有内河船艇110多艘,编成4个艇队。 +老挝人民军的工兵部队有几个营? 有芒宽、巴能、纳坎、他曲、南盖、巴色等8个基地。 +老挝人民军的工兵部队有几个营? 空军于1975年8月组建,现有2个团、11个飞行大队,总兵力约2000人。 +老挝人民军的工兵部队有几个营? 装备有各种飞机140架,其中主要由前苏联提供和从万象政权的皇家空军手中接管。 +老挝人民军的工兵部队有几个营? 随着军队建设质量的提高,老挝人民军对外军事合作步伐也日益扩大,近年来先后与俄罗斯、印度、马来西亚、越南、菲律宾等国拓展了军事交流与合作的内容。 +老挝人民军的工兵部队有几个营? 2003年1月,印度决定向老挝援助一批军事装备和物资,并承诺提供技术帮助。 +老挝人民军的工兵部队有几个营? 2003年6月,老挝向俄罗斯订购了一批新式防空武器;2003年4月,老挝与越南签署了越南帮助老挝培训军事指挥干部和特种部队以及完成军队通信系统改造等多项协议。 +《焚心之城》的主角是谁? 《焚心之城》[1] 为网络作家老子扛过枪创作的一部都市类小说,目前正在创世中文网连载中。 +《焚心之城》的主角是谁? 乡下大男孩薛城,是一个不甘于生活现状的混混,他混过、爱过、也深深地被伤害过。 +《焚心之城》的主角是谁? 本料此生当浑浑噩噩,拼搏街头。 +《焚心之城》的主角是谁? 高考的成绩却给了他一点渺茫的希望,二月后,大学如期吹响了他进城的号角。 +《焚心之城》的主角是谁? 繁华的都市,热血的人生,冷眼嘲笑中,他发誓再不做一个平常人! +《焚心之城》的主角是谁? 江北小城,黑河大地,他要行走过的每一个角落都有他的传说。 +《焚心之城》的主角是谁? 扯出一面旗,拉一帮兄弟,做男人,就要多一份担当,活一口傲气。 +《焚心之城》的主角是谁? (日期截止到2014年10月23日凌晨) +请问香港利丰集团是什么时候成立的? 香港利丰集团前身是广州的华资贸易 (1906 - 1949) ,利丰是香港历史最悠久的出口贸易商号之一。 +请问香港利丰集团是什么时候成立的? 于1906年,冯柏燎先生和李道明先生在广州创立了利丰贸易公司;是当时中国第一家华资的对外贸易出口商。 +请问香港利丰集团是什么时候成立的? 利丰于1906年创立,初时只从事瓷器及丝绸生意;一年之后,增添了其它的货品,包括竹器、藤器、玉石、象牙及其它手工艺品,包括烟花爆竹类别。 +请问香港利丰集团是什么时候成立的? 在早期的对外贸易,中国南方内河港因水深不足不能行驶远洋船,反之香港港口水深岸阔,占尽地利。 +请问香港利丰集团是什么时候成立的? 因此,在香港成立分公司的责任,落在冯柏燎先生的三子冯汉柱先生身上。 +请问香港利丰集团是什么时候成立的? 1937年12月28日,利丰(1937)有限公司正式在香港创立。 +请问香港利丰集团是什么时候成立的? 第二次世界大战期间,利丰暂停贸易业务。 +请问香港利丰集团是什么时候成立的? 1943年,随着创办人冯柏燎先生去世后,业务移交给冯氏家族第二代。 +请问香港利丰集团是什么时候成立的? 之后,向来不参与业务管理的合伙人李道明先生宣布退休,将所拥有的利丰股权全部卖给冯氏家族。 +请问香港利丰集团是什么时候成立的? 目前由哈佛冯家两兄弟William Fung , Victor Fung和CEO Bruce Rockowitz 管理。 +请问香港利丰集团是什么时候成立的? 截止到2012年,集团旗下有利亚﹝零售﹞有限公司、利和集团、利邦时装有限公司、利越时装有限公司、利丰贸易有限公司。 +请问香港利丰集团是什么时候成立的? 利亚(零售)连锁,业务包括大家所熟悉的:OK便利店、玩具〝反〞斗城和圣安娜饼屋;范围包括香港、台湾、新加坡、马来西亚、至中国大陆及东南亚其它市场逾600多家店 +请问香港利丰集团是什么时候成立的? 利和集团,IDS以专业物流服务为根基,为客户提供经销,物流,制造服务领域内的一系列服务项目。 +请问香港利丰集团是什么时候成立的? 业务网络覆盖大中华区,东盟,美国及英国,经营着90多个经销中心,在中国设有18个经销公司,10,000家现代经销门店。 +请问香港利丰集团是什么时候成立的? 利邦(上海)时装贸易有限公司为大中华区其中一家大型男士服装零售集团。 +请问香港利丰集团是什么时候成立的? 现在在中国大陆、香港、台湾和澳门收购经营11个包括Cerruti 1881,Gieves & Hawkes,Kent & curwen和D’urban 等中档到高档的男士服装品牌,全国有超过350间门店设于各一线城市之高级商场及百货公司。 +请问香港利丰集团是什么时候成立的? 利越(上海)服装商贸有限公司隶属于Branded Lifestyle,负责中国大陆地区LEO里奥(意大利)、GIBO捷宝(意大利)、UFFIZI古杰师(意大利)、OVVIO奥维路(意大利)、Roots绿适(加拿大,全球服装排名第四)品牌销售业务 +请问香港利丰集团是什么时候成立的? 利丰(贸易)1995年收购了英之杰采购服务,1999年收购太古贸易有限公司(Swire & Maclain) 和金巴莉有限公司(Camberley),2000年和2002年分别收购香港采购出口集团Colby Group及Janco Oversea Limited,大大扩张了在美国及欧洲的顾客群,自2008年经济危机起一直到现在,收购多家欧、美、印、非等地区的时尚品牌,如英国品牌Visage,仅2011年上半年6个月就完成26个品牌的收购。 +请问香港利丰集团是什么时候成立的? 2004年利丰与Levi Strauss & Co.签订特许经营协议 +请问香港利丰集团是什么时候成立的? 2005年利丰伙拍Daymon Worldwide为全球供应私有品牌和特许品牌 +请问香港利丰集团是什么时候成立的? 2006年收购Rossetti手袋业务及Oxford Womenswear Group 强化美国批发业务 +请问香港利丰集团是什么时候成立的? 2007年收购Tommy Hilfiher全球采购业务,收购CGroup、Peter Black International LTD、Regetta USA LLC和American Marketing Enterprice +请问香港利丰集团是什么时候成立的? 2008年收购Kent&Curwen全球特许经营权,收购Van Zeeland,Inc和Miles Fashion Group +请问香港利丰集团是什么时候成立的? 2009年收购加拿大休闲品牌Roots ,收购Wear Me Appearl,LLC。 +请问香港利丰集团是什么时候成立的? 与Hudson's Bay、Wolverine Worldwide Inc、Talbots、Liz Claiborne达成了采购协议 +请问香港利丰集团是什么时候成立的? 2010年收购Oxford apparel Visage Group LTD +请问香港利丰集团是什么时候成立的? 2011年一月收购土耳其Modium、美国女性时尚Beyond Productions,三月收购贸易公司Celissa 、玩具公司Techno Source USA, Inc.、卡通品牌产品TVMania和法国著名时装一线品牌Cerruti 1881,五月收购Loyaltex Apparel Ltd.、女装Hampshire Designers和英国彩妆Collection 2000,六月收购家私贸易Exim Designs Co., Ltd.,七月收购家庭旅行产业Union Rich USA, LLC和设计公司Lloyd Textile Fashion Company Limited,八月收购童装Fishman & Tobin和Crimzon Rose,九月收购家私贸易True Innovations, LLC、日用品企业Midway Enterprises和Wonderful World。 +请问香港利丰集团是什么时候成立的? 十二月与USPA – U.S. Polo Association签署授权协议。 +请问香港利丰集团是什么时候成立的? 利丰的精神:积极进取,不断认识并争取有利于客户和自身进步的机会;以行动为主导,对客户、供应商及职工的需求作出快速的决定。 +请问香港利丰集团是什么时候成立的? 利丰的最终目标:在产品采购、销售、流转的各环节建立全球性队伍提供多元化服务,利丰成员有效合作,共达目标。 +如何使魔兽变种akt不被查杀? Trojan/PSW.Moshou.akt“魔兽”变种akt是“魔兽”木马家族的最新成员之一,采用Delphi 6.0-7.0编写,并经过加壳处理。 +如何使魔兽变种akt不被查杀? “魔兽”变种akt运行后,自我复制到被感染计算机的指定目录下。 +如何使魔兽变种akt不被查杀? 修改注册表,实现木马开机自动运行。 +如何使魔兽变种akt不被查杀? 自我注入到被感染计算机的“explorer.exe”、“notepad.exe”等用户级权限的进程中加载运行,隐藏自我,防止被查杀。 +如何使魔兽变种akt不被查杀? 在后台秘密监视用户打开的窗口标题,盗取网络游戏《魔兽世界》玩家的游戏帐号、游戏密码、角色等级、装备信息、金钱数量等信息,并在后台将窃取到的玩家信息发送到骇客指定的远程服务器上,致使玩家游戏帐号、装备物品、金钱等丢失,给游戏玩家造成非常大的损失。 +丙种球蛋白能预防什么病情? 丙种球蛋白预防传染性肝炎,预防麻疹等病毒性疾病感染,治疗先天性丙种球蛋白缺乏症 ,与抗生素合并使用,可提高对某些严重细菌性和病毒性疾病感染的疗效。 +丙种球蛋白能预防什么病情? 中文简称:“丙球” +丙种球蛋白能预防什么病情? 英文名称:γ-globulin、gamma globulin +丙种球蛋白能预防什么病情? 【别名】 免疫血清球蛋白,普通免疫球蛋白,人血丙种球蛋白,丙种球蛋白,静脉注射用人免疫球蛋白(pH4) +丙种球蛋白能预防什么病情? 注:由于人血中的免疫球蛋白大多数为丙种球蛋白(γ-球蛋白),有时丙种球蛋白也被混称为“免疫球蛋白”(immunoglobulin) 。 +丙种球蛋白能预防什么病情? 冻干制剂应为白色或灰白色的疏松体,液体制剂和冻干制剂溶解后,溶液应为接近无色或淡黄色的澄明液体,微带乳光。 +丙种球蛋白能预防什么病情? 但不应含有异物或摇不散的沉淀。 +丙种球蛋白能预防什么病情? 注射丙种球蛋白是一种被动免疫疗法。 +丙种球蛋白能预防什么病情? 它是把免疫球蛋白内含有的大量抗体输给受者,使之从低或无免疫状态很快达到暂时免疫保护状态。 +丙种球蛋白能预防什么病情? 由于抗体与抗原相互作用起到直接中和毒素与杀死细菌和病毒。 +丙种球蛋白能预防什么病情? 因此免疫球蛋白制品对预防细菌、病毒性感染有一定的作用[1]。 +丙种球蛋白能预防什么病情? 人免疫球蛋白的生物半衰期为16~24天。 +丙种球蛋白能预防什么病情? 1、丙种球蛋白[2]含有健康人群血清所具有的各种抗体,因而有增强机体抵抗力以预防感染的作用。 +丙种球蛋白能预防什么病情? 2、主要治疗先天性丙种球蛋白缺乏症和免疫缺陷病 +丙种球蛋白能预防什么病情? 3、预防传染性肝炎,如甲型肝炎和乙型肝炎等。 +丙种球蛋白能预防什么病情? 4、用于麻疹、水痘、腮腺炎、带状疱疹等病毒感染和细菌感染的防治 +丙种球蛋白能预防什么病情? 5、也可用于哮喘、过敏性鼻炎、湿疹等内源性过敏性疾病。 +丙种球蛋白能预防什么病情? 6、与抗生素合并使用,可提高对某些严重细菌性和病毒性疾病感染的疗效。 +丙种球蛋白能预防什么病情? 7、川崎病,又称皮肤粘膜淋巴结综合征,常见于儿童,丙种球蛋白是主要的治疗药物。 +丙种球蛋白能预防什么病情? 1、对免疫球蛋白过敏或有其他严重过敏史者。 +丙种球蛋白能预防什么病情? 2、有IgA抗体的选择性IgA缺乏者。 +丙种球蛋白能预防什么病情? 3、发烧患者禁用或慎用。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (1997年9月1日浙江省第八届人民代表大会常务委员会第三十九次会议通过 1997年9月9日浙江省第八届人民代表大会常务委员会公告第六十九号公布自公布之日起施行) +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 为了保护人的生命和健康,发扬人道主义精神,促进社会发展与和平进步事业,根据《中华人民共和国红十字会法》,结合本省实际,制定本办法。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 本省县级以上按行政区域建立的红十字会,是中国红十字会的地方组织,是从事人道主义工作的社会救助团体,依法取得社会团体法人资格,设置工作机构,配备专职工作人员,依照《中国红十字会章程》独立自主地开展工作。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 全省性行业根据需要可以建立行业红十字会,配备专职或兼职工作人员。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 街道、乡(镇)、机关、团体、学校、企业、事业单位根据需要,可以依照《中国红十字会章程》建立红十字会的基层组织。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 上级红十字会指导下级红十字会的工作。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 县级以上地方红十字会指导所在行政区域行业红十字会和基层红十字会的工作。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 人民政府对红十字会给予支持和资助,保障红十字会依法履行职责,并对其活动进行监督;红十字会协助人民政府开展与其职责有关的活动。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 全社会都应当关心和支持红十字事业。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 本省公民和单位承认《中国红十字会章程》并缴纳会费的,可以自愿参加红十字会,成为红十字会的个人会员或团体会员。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 个人会员由本人申请,基层红十字会批准,发给会员证;团体会员由单位申请,县级以上红十字会批准,发给团体会员证。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 个人会员和团体会员应当遵守《中华人民共和国红十字会法》和《中国红十字会章程》,热心红十字事业,履行会员的义务,并享有会员的权利。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 县级以上红十字会理事会由会员代表大会民主选举产生。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 理事会民主选举产生会长和副会长;根据会长提名,决定秘书长、副秘书长人选。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 县级以上红十字会可以设名誉会长、名誉副会长和名誉理事,由同级红十字会理事会聘请。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 省、市(地)红十字会根据独立、平等、互相尊重的原则,发展同境外、国外地方红十字会和红新月会的友好往来和合作关系。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 红十字会履行下列职责: +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (一)宣传、贯彻《中华人民共和国红十字会法》和本办法; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (二)开展救灾的准备工作,筹措救灾款物;在自然灾害和突发事件中,对伤病人员和其他受害者进行救助; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (三)普及卫生救护和防病知识,进行初级卫生救护培训,对交通、电力、建筑、矿山等容易发生意外伤害的单位进行现场救护培训; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (四)组织群众参加现场救护; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (五)参与输血献血工作,推动无偿献血; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (六)开展红十字青少年活动; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (七)根据中国红十字会总会部署,参加国际人道主义救援工作; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (八)依照国际红十字和红新月运动的基本原则,完成同级人民政府和上级红十字会委托的有关事宜; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (九)《中华人民共和国红十宇会法》和《中国红十字会章程》规定的其他职责。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 第八条 红十字会经费的主要来源: +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (一)红十字会会员缴纳的会费; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (二)接受国内外组织和个人捐赠的款物; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (三)红十字会的动产、不动产以及兴办社会福利事业和经济实体的收入; +宝湖庭院绿化率多少? 建发·宝湖庭院位于银川市金凤区核心地带—正源南街与长城中路交汇处向东500米。 +宝湖庭院绿化率多少? 项目已于2012年4月开工建设,总占地约4.2万平方米,总建筑面积约11.2万平方米,容积率2.14,绿化率35%,预计可入住630户。 +宝湖庭院绿化率多少? “建发·宝湖庭院”是银川建发集团股份有限公司继“建发·宝湖湾”之后,在宝湖湖区的又一力作。 +宝湖庭院绿化率多少? 项目周边发展成熟,东有唐徕渠景观水道,西临银川市交通主干道正源街;南侧与宝湖湿地公园遥相呼应。 +宝湖庭院绿化率多少? “宝湖庭院”项目公共交通资源丰富:15路、21路、35路、38路、43路公交车贯穿银川市各地,出行便利。 +宝湖庭院绿化率多少? 距离新百良田购物广场约1公里,工人疗养院600米,宝湖公园1公里,唐徕渠景观水道500米。 +宝湖庭院绿化率多少? 项目位置优越,购物、餐饮、医疗、交通、休闲等生活资源丰富。[1] +宝湖庭院绿化率多少? 建发·宝湖庭院建筑及景观设置传承建发一贯“简约、大气”的风格:搂间距宽广,确保每一座楼宇视野开阔通透。 +宝湖庭院绿化率多少? 楼宇位置错落有置,外立面设计大气沉稳别致。 +宝湖庭院绿化率多少? 项目内部休闲绿地、景观小品点缀其中,道路及停车系统设计合理,停车及通行条件便利。 +宝湖庭院绿化率多少? 社区会所、幼儿园、活动室、医疗服务中心等生活配套一应俱全。 +宝湖庭院绿化率多少? 行政区域:金凤区 +大月兔(中秋艺术作品)的作者还有哪些代表作? 大月兔是荷兰“大黄鸭”之父弗洛伦泰因·霍夫曼打造的大型装置艺术作品,该作品首次亮相于台湾桃园大园乡海军基地,为了迎接中秋节的到来;在展览期间,海军基地也首次对外开放。 +大月兔(中秋艺术作品)的作者还有哪些代表作? 霍夫曼觉得中国神话中捣杵的玉兔很有想象力,于是特别创作了“月兔”,这也是“月兔”新作第一次展出。[1] +大月兔(中秋艺术作品)的作者还有哪些代表作? ?2014年9月15日因工人施工不慎,遭火烧毁。[2] +大月兔(中秋艺术作品)的作者还有哪些代表作? “大月兔”外表采用的杜邦防水纸、会随风飘动,内部以木材加保丽龙框架支撑做成。 +大月兔(中秋艺术作品)的作者还有哪些代表作? 兔毛用防水纸做成,材质完全防水,不怕日晒雨淋。[3 +大月兔(中秋艺术作品)的作者还有哪些代表作? -4] +大月兔(中秋艺术作品)的作者还有哪些代表作? 25米的“月兔”倚靠在机 +大月兔(中秋艺术作品)的作者还有哪些代表作? 堡上望着天空,像在思考又像赏月。 +大月兔(中秋艺术作品)的作者还有哪些代表作? 月兔斜躺在机堡上,意在思考生命、边做白日梦,编织自己的故事。[3] +大月兔(中秋艺术作品)的作者还有哪些代表作? 台湾桃园大园乡海军基地也首度对外开放。 +大月兔(中秋艺术作品)的作者还有哪些代表作? 428公顷的海军基地中,地景艺术节使用约40公顷,展场包括过去军机机堡、跑道等,由于这处基地过去警备森严,不对外开放,这次结合地景艺术展出,也可一窥过去是黑猫中队基地的神秘面纱。 +大月兔(中秋艺术作品)的作者还有哪些代表作? 2014年9月2日,桃园县政府文化局举行“踩线团”,让 +大月兔(中秋艺术作品)的作者还有哪些代表作? 大月兔 +大月兔(中秋艺术作品)的作者还有哪些代表作? 各项地景艺术作品呈现在媒体眼中,虽然“月兔”仍在进行最后的细节赶工,但横躺在机堡上的“月兔”雏形已经完工。[5] +大月兔(中秋艺术作品)的作者还有哪些代表作? “这么大”、“好可爱呦”是不少踩线团成员对“月兔”的直觉;尤其在蓝天的衬托及前方绿草的组合下,呈现犹如真实版的爱丽丝梦游仙境。[6] +大月兔(中秋艺术作品)的作者还有哪些代表作? 霍夫曼的作品大月兔,“从平凡中,创作出不平凡的视觉”,创造出观赏者打从心中油然而生的幸福感,拉近观赏者的距离。[6] +大月兔(中秋艺术作品)的作者还有哪些代表作? 2014年9月15日早 +大月兔(中秋艺术作品)的作者还有哪些代表作? 上,施工人员要将月兔拆解,搬离海军基地草皮时,疑施工拆除的卡车,在拆除过程,故障起火,起火的卡车不慎延烧到兔子,造成兔子起火燃烧,消防队员即刻抢救,白色的大月兔立即变成焦黑的火烧兔。[7] +大月兔(中秋艺术作品)的作者还有哪些代表作? 桃园县府表示相当遗憾及难过,也不排除向包商求偿,也已将此事告知霍夫曼。[2] +大月兔(中秋艺术作品)的作者还有哪些代表作? ?[8] +大月兔(中秋艺术作品)的作者还有哪些代表作? 弗洛伦泰因·霍夫曼,荷兰艺术家,以在公共空间创作巨大造型 +大月兔(中秋艺术作品)的作者还有哪些代表作? 物的艺术项目见长。 +大月兔(中秋艺术作品)的作者还有哪些代表作? 代表作品包括“胖猴子”(2010年在巴西圣保罗展出)、“大黄兔”(2011年在瑞典厄勒布鲁展出)、粉红猫(2014年5月在上海亮相)、大黄鸭(Rubber Duck)、月兔等。 +英国耆卫保险公司有多少保险客户? 英国耆卫保险公司(Old Mutual plc)成立于1845年,一直在伦敦证券交易所(伦敦证券交易所:OML)作第一上市,也是全球排名第32位(按营业收入排名)的保险公司(人寿/健康)。 +英国耆卫保险公司有多少保险客户? 公司是全球财富500强公司之一,也是被列入英国金融时报100指数的金融服务集团之一。 +英国耆卫保险公司有多少保险客户? Old Mutual 是一家国际金融服务公司,拥有近320万个保险客户,240万个银行储户,270,000个短期保险客户以及700,000个信托客户 +英国耆卫保险公司有多少保险客户? 英国耆卫保险公司(Old Mutual)是一家国际金融服务公司,总部设在伦敦,主要为全球客户提供长期储蓄的解决方案、资产管理、短期保险和金融服务等,目前业务遍及全球34个国家。[1] +英国耆卫保险公司有多少保险客户? 主要包括人寿保险,资产管理,银行等。 +英国耆卫保险公司有多少保险客户? 1845年,Old Mutual在好望角成立。 +英国耆卫保险公司有多少保险客户? 1870年,董事长Charles Bell设计了Old Mutual公司的标记。 +英国耆卫保险公司有多少保险客户? 1910年,南非从英联邦独立出来。 +英国耆卫保险公司有多少保险客户? Old Mutual的董事长John X. Merriman被选为国家总理。 +英国耆卫保险公司有多少保险客户? 1927年,Old Mutual在Harare成立它的第一个事务所。 +英国耆卫保险公司有多少保险客户? 1960年,Old Mutual在南非成立了Mutual Unit信托公司,用来管理公司的信托业务。 +英国耆卫保险公司有多少保险客户? 1970年,Old Mutual的收入超过100百万R。 +英国耆卫保险公司有多少保险客户? 1980年,Old Mutual成为南非第一大人寿保险公司,年收入达10亿R。 +英国耆卫保险公司有多少保险客户? 1991年,Old Mutual在美国财富周刊上评选的全球保险公司中名列第38位。 +英国耆卫保险公司有多少保险客户? 1995年,Old Mutual在美国波士顿建立投资顾问公司,同年、又在香港和Guernsey建立事务所。 +英国耆卫保险公司有多少保险客户? 作为一项加强与其母公司联系的举措,OMNIA公司(百慕大)荣幸的更名为Old Mutual 公司(百慕大) 。 +英国耆卫保险公司有多少保险客户? 这一新的名称和企业识别清晰地展示出公司成为其世界金融机构合作伙伴强有力支持的决心。 +英国耆卫保险公司有多少保险客户? 2003 年4月,该公司被Old Mutual plc公司收购,更名为Sage Life(百慕大)公司并闻名于世,公司为Old Mutual公司提供了一个新的销售渠道,补充了其现有的以美元计价的产品线和分销系统。 +英国耆卫保险公司有多少保险客户? 达到了一个重要里程碑是公司成功的一个例证: 2005年6月3日公司资产超过10亿美元成为公司的一个主要里程碑,也是公司成功的一个例证。 +英国耆卫保险公司有多少保险客户? Old Mutual (百慕大)为客户提供一系列的投资产品。 +英国耆卫保险公司有多少保险客户? 在其开放的结构下,客户除了能够参与由Old Mutual会员管理的方案外,还能够参与由一些世界顶尖投资机构提供的投资选择。 +英国耆卫保险公司有多少保险客户? 首席执行官John Clifford对此发表评论说:“过去的两年对于Old Mutual家族来说是稳固发展的两年,更名是迫在眉睫的事情。 +英国耆卫保险公司有多少保险客户? 通过采用其名字和形象上的相似,Old Mutual (百慕大)进一步强化了与母公司的联系。” +英国耆卫保险公司有多少保险客户? Clifford补充道:“我相信Old Mutual全球品牌认可度和Old Mutual(百慕大)产品专业知识的结合将在未来的日子里进一步推动公司的成功。” +英国耆卫保险公司有多少保险客户? 随着公司更名而来的是公司网站的全新改版,设计投资选择信息、陈述、销售方案、营销材料和公告板块。 +英国耆卫保险公司有多少保险客户? 在美国购买不到OMNIA投资产品,该产品也不向美国公民或居民以及百慕大居民提供。 +英国耆卫保险公司有多少保险客户? 这些产品不对任何要约未得到批准的区域中的任何人,以及进行此要约或询价为非法行为的个人构成要约或询价。 +英国耆卫保险公司有多少保险客户? 关于Old Mutual(百慕大)公司 +英国耆卫保险公司有多少保险客户? Old Mutual(百慕大)公司总部位于百慕大,公司面向非美国居民及公民以及非百慕大居民,通过遍布世界的各个市场的金融机构开发和销售保险和投资方案。 +英国耆卫保险公司有多少保险客户? 这些方案由Old Mutual(百慕大)公司直接做出,向投资者提供各种投资选择和战略,同时提供死亡和其他受益保证。 +谁知道北京的淡定哥做了什么? 尼日利亚足球队守门员恩耶马被封淡定哥,原因是2010年南非世界杯上1:2落后希腊队时,对方前锋已经突破到禁区,其仍头依门柱发呆,其从容淡定令人吃惊。 +谁知道北京的淡定哥做了什么? 淡定哥 +谁知道北京的淡定哥做了什么? 在2010年6月17日的世界杯赛场上,尼日利亚1比2不敌希腊队,但尼日利亚门将恩耶马(英文名:Vincent Enyeama)在赛场上的“淡定”表现令人惊奇。 +谁知道北京的淡定哥做了什么? 随后,网友将赛场照片发布于各大论坛,恩耶马迅速窜红,并被网友称为“淡定哥”。 +谁知道北京的淡定哥做了什么? 淡定哥 +谁知道北京的淡定哥做了什么? 从网友上传得照片中可以看到,“淡定哥”在面临对方前锋突袭至小禁区之时,还靠在球门柱上发呆,其“淡定”程度的确非一般人所能及。 +谁知道北京的淡定哥做了什么? 恩耶马是尼日利亚国家队的主力守门员,目前效力于以色列的特拉维夫哈普尔队。 +谁知道北京的淡定哥做了什么? 1999年,恩耶马在尼日利亚国内的伊波姆星队开始职业生涯,后辗转恩伊姆巴、Iwuanyanwu民族等队,从07年开始,他为特拉维夫效力。 +谁知道北京的淡定哥做了什么? 恩耶马的尼日利亚国脚生涯始于2002年,截至2010年1月底,他为国家队出场已超过50次。 +谁知道北京的淡定哥做了什么? 当地时间2011年1月4日,国际足球历史与统计协会(IFFHS)公布了2010年度世界最佳门将,恩耶马(尼日利亚,特拉维夫夏普尔)10票排第十一 +谁知道北京的淡定哥做了什么? 此词经国家语言资源监测与研究中心等机构专家审定入选2010年年度新词语,并收录到《中国语言生活状况报告》中。 +谁知道北京的淡定哥做了什么? 提示性释义:对遇事从容镇定、处变不惊的男性的戏称。 +谁知道北京的淡定哥做了什么? 例句:上海现“淡定哥”:百米外爆炸他仍专注垂钓(2010年10月20日腾讯网http://news.qq.com/a/20101020/000646.htm) +谁知道北京的淡定哥做了什么? 2011年度新人物 +谁知道北京的淡定哥做了什么? 1、淡定哥(北京) +谁知道北京的淡定哥做了什么? 7月24日傍晚,北京市出现大范围降雨天气,位于通州北苑路出现积水,公交车也难逃被淹。 +谁知道北京的淡定哥做了什么? 李欣摄图片来源:新华网一辆私家车深陷积水,车主索性盘坐在自己的汽车上抽烟等待救援。 +谁知道北京的淡定哥做了什么? 私家车主索性盘坐在自己的车上抽烟等待救援,被网友称“淡定哥” +谁知道北京的淡定哥做了什么? 2、淡定哥——林峰 +谁知道北京的淡定哥做了什么? 在2011年7月23日的动车追尾事故中,绍兴人杨峰(@杨峰特快)在事故中失去了5位亲人:怀孕7个月的妻子、未出世的孩子、岳母、妻姐和外甥女,他的岳父也在事故中受伤正在治疗。 +谁知道北京的淡定哥做了什么? 他披麻戴孝出现在事故现场,要求将家人的死因弄个明白。 +谁知道北京的淡定哥做了什么? 但在第一轮谈判过后,表示:“请原谅我,如果我再坚持,我将失去我最后的第六个亲人。” +谁知道北京的淡定哥做了什么? 如果他继续“纠缠”铁道部,他治疗中的岳父将会“被死亡”。 +谁知道北京的淡定哥做了什么? 很多博友就此批评杨峰,并讽刺其为“淡定哥”。 +071型船坞登陆舰的北约代号是什么? 071型船坞登陆舰(英语:Type 071 Amphibious Transport Dock,北约代号:Yuzhao-class,中文:玉昭级,或以首舰昆仑山号称之为昆仑山级船坞登陆舰),是中国人民解放军海军隶下的大型多功能两栖船坞登陆舰,可作为登陆艇的母舰,用以运送士兵、步兵战车、主战坦克等展开登陆作战,也可搭载两栖车辆,具备大型直升机起降甲板及操作设施。 +071型船坞登陆舰的北约代号是什么? 071型两栖登陆舰是中国首次建造的万吨级作战舰艇,亦为中国大型多功能两栖舰船的开山之作,也可以说是中国万吨级以上大型作战舰艇的试验之作,该舰的建造使中国海军的两栖舰船实力有了质的提升。 +071型船坞登陆舰的北约代号是什么? 在本世纪以前中国海军原有的两栖舰队以一 +071型船坞登陆舰的北约代号是什么? 早期071模型 +071型船坞登陆舰的北约代号是什么? 千至四千吨级登陆舰为主要骨干,这些舰艇吨位小、筹载量有限,直升机操作能力非常欠缺,舰上自卫武装普遍老旧,对于现代化两栖登陆作战可说有很多不足。 +071型船坞登陆舰的北约代号是什么? 为了应对新时期的国际国内形势,中国在本世纪初期紧急强化两栖作战能力,包括短时间内密集建造072、074系列登陆舰,同时也首度设计一种新型船坞登陆舰,型号为071。[1] +071型船坞登陆舰的北约代号是什么? 在两栖作战行动中,这些舰只不得不采取最危险的 +071型船坞登陆舰的北约代号是什么? 舾装中的昆仑山号 +071型船坞登陆舰的北约代号是什么? 敌前登陆方式实施两栖作战行动,必须与敌人预定阻击力量进行面对面的战斗,在台湾地区或者亚洲其他国家的沿海,几乎没有可用而不设防的海滩登陆地带,并且各国或者地区的陆军在战时,可能会很快控制这些易于登陆的海难和港口,这样就限制住了中国海军两栖登陆部队的实际登陆作战能力。 +071型船坞登陆舰的北约代号是什么? 071型登陆舰正是为了更快和更多样化的登陆作战而开发的新型登陆舰艇。[2] +071型船坞登陆舰的北约代号是什么? 071型两栖船坞登陆舰具有十分良好的整体隐身能力, +071型船坞登陆舰的北约代号是什么? 071型概念图 +071型船坞登陆舰的北约代号是什么? 该舰外部线条简洁干练,而且舰体外形下部外倾、上部带有一定角度的内倾,从而形成雷达隐身性能良好的菱形横剖面。 +071型船坞登陆舰的北约代号是什么? 舰体为高干舷平甲板型,长宽比较小,舰身宽满,采用大飞剪型舰首及楔形舰尾,舰的上层建筑位于舰体中间部位,后部是大型直升机甲板,适航性能非常突出。 +071型船坞登陆舰的北约代号是什么? 顶甲板上各类电子设备和武器系统布局十分简洁干净,各系统的突出物很少。 +071型船坞登陆舰的北约代号是什么? 该舰的两座烟囱实行左右分布式设置在舰体两侧,既考虑了隐身特点,也十分新颖。[3] +071型船坞登陆舰的北约代号是什么? 1号甲板及上层建筑物主要设置有指挥室、控 +071型船坞登陆舰的北约代号是什么? 舰尾俯视 +071型船坞登陆舰的北约代号是什么? 制舱、医疗救护舱及一些居住舱,其中医疗救护舱设置有完备的战场救护设施,可以在舰上为伤病员提供紧急手术和野战救护能力。 +071型船坞登陆舰的北约代号是什么? 2号甲板主要是舰员和部分登陆人员的居住舱、办公室及厨房。 +071型船坞登陆舰的北约代号是什么? 主甲板以下则是登陆舱,分前后两段,前段是装甲车辆储存舱,共两层,可以储存登陆装甲车辆和一些其它物资,在进出口处还设有一小型升降机,用于两层之间的移动装卸用。 +071型船坞登陆舰的北约代号是什么? 前段车辆储存舱外壁左右各设有一折叠式装载舱门,所有装载车辆在码头可通过该门直接装载或者登陆上岸。 +071型船坞登陆舰的北约代号是什么? 后段是一个巨型船坞登陆舱,总长约70米,主要用来停泊大小型气垫登陆艇、机械登陆艇或车辆人员登陆艇。[4] +071型船坞登陆舰的北约代号是什么? 自卫武装方面,舰艏设有一门PJ-26型76mm舰炮( +071型船坞登陆舰的北约代号是什么? 井冈山号舰首主炮 +071型船坞登陆舰的北约代号是什么? 俄罗斯AK-176M的中国仿制版,亦被054A采用) , 四具与052B/C相同的726-4 18联装干扰弹发射器分置于舰首两侧以及上层结构两侧,近迫防御则依赖四座布置于上层结构的AK-630 30mm防空机炮 。 +071型船坞登陆舰的北约代号是什么? 原本071模型的舰桥前方设有一座八联装海红-7短程防空导弹发射器,不过071首舰直到出海试航与2009年4月下旬的海上阅兵式中,都未装上此一武器。 +071型船坞登陆舰的北约代号是什么? 电子装备方面, 舰桥后方主桅杆顶配置一具363S型E/F频2D对空/平面搜索雷达 、一具Racal Decca RM-1290 I频导航雷达,后桅杆顶装备一具拥有球型外罩的364型(SR-64)X频2D对空/对海搜索雷达,此外还有一具LR-66C舰炮射控雷达、一具负责导引AK-630机炮的TR-47C型火炮射控雷达等。[5] +071型船坞登陆舰的北约代号是什么? 071型自卫武装布置 +071型船坞登陆舰的北约代号是什么? 071首舰昆仑山号于2006年6月开 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 竹溪县人大常委会办公室:承担人民代表大会会议、常委会会议、主任会议和常委会党组会议(简称“四会”)的筹备和服务工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责常委会组成人员视察活动的联系服务工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 受主任会议委托,拟定有关议案草案。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担常委会人事任免的具体工作,负责机关人事管理和离退休干部的管理与服务。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担县人大机关的行政事务和后勤保障工作,负责机关的安全保卫、文电处理、档案、保密、文印工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担县人大常委会同市人大常委会及乡镇人大的工作联系。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责信息反馈工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 了解宪法、法律、法规和本级人大及其常委会的决议、决定实施情况及常委会成员提出建议办理情况,及时向常委会和主任会议报告。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担人大宣传工作,负责人大常委会会议宣传的组织和联系。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 组织协调各专门工作委员会开展工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承办上级交办的其他工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 办公室下设五个科,即秘书科、调研科、人事任免科、综合科、老干部科。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 教科文卫工作委员会:负责人大教科文卫工作的日常联系、督办、信息收集反馈和业务指导工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责教科文卫方面法律法规贯彻和人大工作情况的宣传、调研工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担人大常委会教科文卫方面会议议题调查的组织联系和调研材料的起草工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担教科文卫方面规范性备案文件的初审工作,侧重对教科文卫行政执法个案监督业务承办工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责常委会组成人员和人大代表对教科文卫工作方面检查、视察的组织联系工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承办上级交办的其他工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 代表工作委员会:负责与县人大代表和上级人大代表的联系、情况收集交流工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责《代表法》的宣传贯彻和贯彻实施情况的调查研究工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责县人大代表法律法规和人民代表大会制度知识学习的组织和指导工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责常委会主任、副主任和委员走访联系人大代表的组织、联系工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责组织人大系统的干部培训。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责乡镇人大主席团工作的联系和指导。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责人大代表建议、批评和意见办理工作的联系和督办落实。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责人大代表开展活动的组织、联系工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承办上级交办的其他工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 财政经济工作委员会:负责人大财政经济工作的日常联系、督办、信息收集反馈和业务指导工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责财政经济方面法律法规贯彻和人大工作情况的宣传、调研工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 对国民经济计划和财政预算编制情况进行初审。 +我想知道武汉常住人口有多少? 武汉,简称“汉”,湖北省省会。 +我想知道武汉常住人口有多少? 它是武昌、汉口、汉阳三镇统称。 +我想知道武汉常住人口有多少? 世界第三大河长江及其最长支流汉江横贯市区,将武汉一分为三,形成武昌、汉口、汉阳,三镇跨江鼎立的格局。 +我想知道武汉常住人口有多少? 唐朝诗人李白在此写下“黄鹤楼中吹玉笛,江城五月落梅花”,因此武汉自古又称“江城”。 +我想知道武汉常住人口有多少? 武汉是中国15个副省级城市之一,全国七大中心城市之一,全市常住人口858万人。 +我想知道武汉常住人口有多少? 华中地区最大都市,华中金融中心、交通中心、文化中心,长江中下游特大城市。 +我想知道武汉常住人口有多少? 武汉城市圈的中心城市。 +我想知道武汉常住人口有多少? [3]武昌、汉口、汉阳三地被俗称武汉三镇。 +我想知道武汉常住人口有多少? 武汉西与仙桃市、洪湖市相接,东与鄂州市、黄石市接壤,南与咸宁市相连,北与孝感市相接,形似一只自西向东的蝴蝶形状。 +我想知道武汉常住人口有多少? 在中国经济地理圈内,武汉处于优越的中心位置是中国地理上的“心脏”,故被称为“九省通衢”之地。 +我想知道武汉常住人口有多少? 武汉市历史悠久,古有夏汭、鄂渚之名。 +我想知道武汉常住人口有多少? 武汉地区考古发现的历史可以上溯距今6000年的新石器时代,其考古发现有东湖放鹰台遗址的含有稻壳的红烧土、石斧、石锛以及鱼叉。 +我想知道武汉常住人口有多少? 市郊黄陂区境内的盘龙城遗址是距今约3500年前的商朝方国宫城,是迄今中国发现及保存最完整的商代古城之一。 +我想知道武汉常住人口有多少? 现代武汉的城市起源,是东汉末年的位于今汉阳的卻月城、鲁山城,和在今武昌蛇山的夏口城。 +我想知道武汉常住人口有多少? 东汉末年,地方军阀刘表派黄祖为江夏太守,将郡治设在位于今汉阳龟山的卻月城中。 +我想知道武汉常住人口有多少? 卻月城是武汉市区内已知的最早城堡。 +我想知道武汉常住人口有多少? 223年,东吴孙权在武昌蛇山修筑夏口城,同时在城内的黄鹄矶上修筑了一座瞭望塔——黄鹤楼。 +我想知道武汉常住人口有多少? 苏轼在《前赤壁赋》中说的“西望夏口,东望武昌”中的夏口就是指武汉(而当时的武昌则是今天的鄂州)。 +我想知道武汉常住人口有多少? 南朝时,夏口扩建为郢州,成为郢州的治所。 +我想知道武汉常住人口有多少? 隋置江夏县和汉阳县,分别以武昌,汉阳为治所。 +我想知道武汉常住人口有多少? 唐时江夏和汉阳分别升为鄂州和沔州的州治,成为长江沿岸的商业重镇。 +我想知道武汉常住人口有多少? 江城之称亦始于隋唐。 +我想知道武汉常住人口有多少? 两宋时武昌属鄂州,汉阳汉口属汉阳郡。 +我想知道武汉常住人口有多少? 经过发掘,武汉出土了大量唐朝墓葬,在武昌马房山和岳家咀出土了灰陶四神砖以及灰陶十二生肖俑等。 +我想知道武汉常住人口有多少? 宋代武汉的制瓷业发达。 +我想知道武汉常住人口有多少? 在市郊江夏区梁子湖旁发现了宋代瓷窑群100多座,烧制的瓷器品种很多,釉色以青白瓷为主。 +我想知道武汉常住人口有多少? 南宋诗人陆游在经过武昌时,写下“市邑雄富,列肆繁错,城外南市亦数里,虽钱塘、建康不能过,隐然一大都会也”来描写武昌的繁华。 +我想知道武汉常住人口有多少? 南宋抗金将领岳飞驻防鄂州(今武昌)8年,在此兴师北伐。 +我想知道武汉常住人口有多少? 元世祖至元十八年(1281年),武昌成为湖广行省的省治。 +我想知道武汉常住人口有多少? 这是武汉第一次成为一级行政单位(相当于现代的省一级)的治所。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 列夫·达维多维奇,托洛茨基是联共(布)党内和第三国际时期反对派的领导人,托派"第四国际"的创始人和领导人。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 列夫·达维多维奇·托洛茨基 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 列夫·达维多维奇·托洛茨基(俄国与国际历史上最重要的无产阶级革命家之一,二十世纪国际共产主义运动中最具争议的、也是备受污蔑的左翼反对派领袖,他以对古典马克思主义“不断革命论”的独创性发展闻名于世,第三共产国际和第四国际的主要缔造者之一(第三国际前三次代表大会的宣言执笔人)。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 在1905年俄国革命中被工人群众推举为彼得堡苏维埃主席(而当时布尔什维克多数干部却还在讨论是否支持苏维埃,这些干部后来被赶回俄国的列宁痛击)。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1917年革命托洛茨基率领“区联派”与列宁派联合,并再次被工人推举为彼得格勒苏维埃主席。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 对于十月革命这场20世纪最重大的社会革命,托洛茨基赢得了不朽的历史地位。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 后来成了托洛茨基死敌的斯大林,当时作为革命组织领导者之一却写道:“起义的一切实际组织工作是在彼得格勒苏维埃主席托洛茨基同志直接指挥之下完成的。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 我们可以确切地说,卫戍部队之迅速站在苏维埃方面来,革命军事委员会的工作之所以搞得这样好,党认为这首先要归功于托洛茨基同志。” +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? (值得一提的是,若干年后,当反托成为政治需要时,此类评价都从斯大林文章中删掉了。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? )甚至连后来狂热的斯大林派雅克·沙杜尔,当时却也写道:“托洛茨基在十月起义中居支配地位,是起义的钢铁灵魂。” +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? (苏汉诺夫《革命札记》第6卷P76。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? )不仅在起义中,而且在无产阶级政权的捍卫、巩固方面和国际共产主义革命方面,托洛茨基也作出了极其卓越的贡献(外交官-苏联国际革命政策的负责人、苏联红军缔造者以及共产国际缔造者)。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 革命后若干年里,托洛茨基与列宁的画像时常双双并列挂在一起;十月革命之后到列宁病逝之前,布尔什维克历次全国代表大会上,代表大会发言结束均高呼口号:“我们的领袖列宁和托洛茨基万岁!” +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 在欧美共运中托洛茨基的威望非常高。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 后人常常认为托洛茨基只是一个知识分子文人,实际上他文武双全,而且谙熟军事指挥艺术,并且亲临战场。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 正是他作为十月革命的最高军事领袖(在十月革命期间他与士兵一起在战壕里作战),并且在1918年缔造并指挥苏联红军,是一个杰出的军事家(列宁曾对朋友说,除了托洛茨基,谁还能给我迅速地造成一支上百万人的强大军队? +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? )。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 在内战期间,他甚至坐装甲列车冒着枪林弹雨亲临战场指挥作战,差点挨炸死;当反革命军队进攻彼得堡时,当时的彼得堡领导人季诺维也夫吓得半死,托洛茨基却从容不迫指挥作战。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 同时托洛茨基又是一个高明的外交家,他曾强硬地要求英国政府释放因反战宣传被囚禁在英国的俄国流亡革命者,否则就不许英国公民离开俄国,连英国政府方面都觉得此举无懈可击;他并且把居高临下的法国到访者当场轰出他的办公室(革命前法国一直是俄国的头号债主与政治操纵者),却彬彬有礼地欢迎前来缓和冲突的法国大使;而在十月革命前夕,他对工人代表议会质询的答复既保守了即将起义的军事秘密,又鼓舞了革命者的战斗意志,同时严格遵循现代民主与公开原则,这些政治答复被波兰人多伊彻誉为“外交辞令的杰作”(伊·多伊彻的托氏传记<先知三部曲·武装的先知>第九章P335,第十一章P390)。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 托洛茨基在国民经济管理与研究工作中颇有创造:是苏俄新经济政策的首先提议者以及社会主义计划经济的首先实践者。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1928年斯大林迟迟开始的计划经济实验,是对1923年以托洛茨基为首的左翼反对派经济纲领的拙劣剽窃和粗暴翻版。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 因为统治者的政策迟到,使得新经济政策到1928年已产生了一个威胁政权生存的农村资产阶级,而苏俄工人阶级国家不得不强力解决——而且是不得不借助已蜕化为官僚集团的强力来解决冲突——结果导致了1929年到30年代初的大饥荒和对农民的大量冤枉错杀。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 另外,他还对文学理论有很高的造诣,其著作<文学与革命>甚至影响了整整一代的国际左翼知识分子(包括中国的鲁迅、王实味等人)。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 他在哈佛大学图书馆留下了100多卷的<托洛茨基全集>,其生动而真诚的自传和大量私人日记、信件,给人留下了研究人类生活各个方面的宝贵财富,更是追求社会进步与解放的历史道路上的重要知识库之一。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 托洛茨基1879年10月26日生于乌克兰赫尔松县富裕农民家庭,祖籍是犹太人。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 原姓布隆施泰因。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1896年开始参加工人运动。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1897年 ,参加建立南俄工人协会 ,反对沙皇专制制度。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1898年 在尼古拉也夫组织工人团体,被流放至西伯利亚。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1902年秋以署名托洛茨基之假护照逃到伦敦,参加V.I.列宁、G.V.普列汉诺夫等人主编的<火星报>的工作。 +谁知道洞庭湖大桥有多长? 洞庭湖大桥,位于洞庭湖与长江交汇处,东接岳阳市区洞庭大道和107国道、京珠高速公路,西连省道306线,是国内目前最长的内河公路桥。 +谁知道洞庭湖大桥有多长? 路桥全长10173.82m,其中桥长5747.82m,桥宽20m,西双向四车道,是我国第一座三塔双索面斜拉大桥,亚洲首座不等高三塔双斜索面预应力混凝土漂浮体系斜拉桥。 +谁知道洞庭湖大桥有多长? 洞庭湖大桥是我国最长的内河公路桥,大桥横跨东洞庭湖区,全长10174.2米,主桥梁长5747.8米。 +谁知道洞庭湖大桥有多长? 大桥的通车使湘、鄂间公路干线大为畅通,并为洞庭湖区运输抗洪抢险物资提供了一条快速通道该桥设计先进,新颖,造型美观,各项技求指标先进,且为首次在国内特大型桥梁中采用主塔斜拉桥结构体系。 +谁知道洞庭湖大桥有多长? 洞庭湖大桥是湖区人民的造福桥,装点湘北门户的形象桥,对优化交通网络绪构,发展区域经济,保障防汛救灾,缩短鄂、豫、陕等省、市西部车辆南下的运距,拓展岳阳城区的主骨架,提升岳阳城市品位,增强城市辐射力,有着十分重要的意义。 +谁知道洞庭湖大桥有多长? 自1996年12月开工以来,共有10支施工队伍和两支监理队伍参与了大桥的建设。 +谁知道洞庭湖大桥有多长? 主桥桥面高52米(黄海),设计通航等级Ⅲ级。 +谁知道洞庭湖大桥有多长? 主桥桥型为不等高三塔、双索面空间索、全飘浮体系的预应力钢筋混凝土肋板梁式结构的斜拉桥,跨径为130+310+310+130米。 +谁知道洞庭湖大桥有多长? 索塔为双室宝石型断面,中塔高为125.684米,两边塔高为99.311米。 +谁知道洞庭湖大桥有多长? 三塔基础为3米和3.2米大直径钻孔灌注桩。 +谁知道洞庭湖大桥有多长? 引桥为连续梁桥,跨径20至50米,基础直径为1.8和2.5米钻孔灌注桩。 +谁知道洞庭湖大桥有多长? 该桥设计先进、新颖、造型美观,各项技求指标先进,且为首次在国内特大型桥梁中采用主塔斜拉桥结构体系,岳阳洞庭湖大桥是我国首次采用不等高三塔斜拉桥桥型的特大桥,设计先进,施工难度大位居亚洲之首,是湖南省桥梁界的一大科研项目。 +谁知道洞庭湖大桥有多长? 洞庭湖大桥设计为三塔斜拉桥,空间双斜面索,主梁采用前支点挂篮施工,并按各种工况模拟挂篮受力进行现场试验,获得了大量有关挂篮受力性能和实际刚度的计算参数,作为施工控制参数。 +谁知道洞庭湖大桥有多长? 利用组合式模型单元,推导了斜拉桥分离式双肋平板主梁的单元刚度矩阵,并进行了岳阳洞庭湖大桥的空间受力分析,结果表明此种单元精度满足工程要求,同时在施工工艺方面也积累了成功经验。 +谁知道洞庭湖大桥有多长? 洞庭湖大桥的通车使湘、鄂间公路干线大为畅通,并为洞庭湖区抗洪抢险物资运输提供了一条快速通道。 +谁知道洞庭湖大桥有多长? 湖大桥设计先进,造型美丽,科技含量高。 +谁知道洞庭湖大桥有多长? 洞庭大桥还是一道美丽的风景线,大桥沿岸风景与岳阳楼,君山岛、洞庭湖等风景名胜融为一体,交相辉映,成为世人了解岳阳的又一崭新窗口,也具有特别旅游资源。 +谁知道洞庭湖大桥有多长? 洞庭湖大桥多塔斜拉桥新技术研究荣获国家科学技术进步二等奖、湖南省科学技术进步一等奖,并获第五届詹天佑大奖。 +谁知道洞庭湖大桥有多长? 大桥在中国土木工程学会2004年第16届年会上入选首届《中国十佳桥梁》,名列斜拉桥第二位。 +谁知道洞庭湖大桥有多长? 2001年荣获湖南省建设厅优秀设计一等奖,省优秀勘察一等奖。 +谁知道洞庭湖大桥有多长? 2003年荣获国家优秀工程设计金奖, "十佳学术活动"奖。 +天气预报员的布景师是谁? 芝加哥天气预报员大卫(尼古拉斯·凯奇),被他的粉丝们热爱,也被诅咒--这些人在天气不好的时候会迁怒于他,而大部分时候,大卫都是在预报坏天气。 +天气预报员的布景师是谁? ?不过,这也没什么,当一家国家早间新闻节目叫他去面试的时候,大卫的事业似乎又将再创新高。 +天气预报员的布景师是谁? 芝加哥天气预报员大卫(尼古拉斯·凯奇),被他的粉丝们热爱,也被诅咒--这些人在天气不好的时候会迁怒于他,而大部分时候,大卫都是在预报坏天气。 +天气预报员的布景师是谁? 不过,这也没什么,当一家国家早间新闻节目叫他去面试的时候,大卫的事业似乎又将再创新高。 +天气预报员的布景师是谁? 在电视节目上,大卫永远微笑,自信而光鲜,就像每一个成功的电视人一样,说起收入,他也绝对不落人后。 +天气预报员的布景师是谁? 不过,大卫的个人生活可就不那么如意了。 +天气预报员的布景师是谁? 与妻子劳伦(霍普·戴维斯)的离婚一直让他痛苦;儿子迈克吸大麻上瘾,正在进行戒毒,可戒毒顾问却对迈克有着异样的感情;女儿雪莉则体重惊人,总是愁眉苦脸、孤独寂寞;大卫的父亲罗伯特(迈克尔·凯恩),一个世界著名的小说家,虽然罗伯特不想再让大卫觉得负担过重,可正是他的名声让大卫的一生都仿佛处在他的阴影之下,更何况,罗伯特就快重病死了。 +天气预报员的布景师是谁? 和妻子的离婚、父亲的疾病、和孩子之间完全不和谐的关系,都让大卫每天头疼,而每次当他越想控制局面,一切就越加复杂。 +天气预报员的布景师是谁? 然而就在最后人们再也不会向他扔快餐,或许是因为他总是背着弓箭在大街上走。 +天气预报员的布景师是谁? 最后,面对那份高额工作的接受意味着又一个新生活的开始。 +天气预报员的布景师是谁? 也许,生活就像天气,想怎么样就怎么样,完全不可预料。 +天气预报员的布景师是谁? 导 演:戈尔·维宾斯基 Gore Verbinski +天气预报员的布景师是谁? 编 剧:Steve Conrad .....(written by) +天气预报员的布景师是谁? 演 员:尼古拉斯·凯奇 Nicolas Cage .....David Spritz +天气预报员的布景师是谁? 尼古拉斯·霍尔特 Nicholas Hoult .....Mike +天气预报员的布景师是谁? 迈克尔·凯恩 Michael Caine .....Robert Spritzel +天气预报员的布景师是谁? 杰蒙妮·德拉佩纳 Gemmenne de la Peña .....Shelly +天气预报员的布景师是谁? 霍普·戴维斯 Hope Davis .....Noreen +天气预报员的布景师是谁? 迈克尔·瑞斯玻利 Michael Rispoli .....Russ +天气预报员的布景师是谁? 原创音乐:James S. Levine .....(co-composer) (as James Levine) +天气预报员的布景师是谁? 汉斯·兹米尔 Hans Zimmer +天气预报员的布景师是谁? 摄 影:Phedon Papamichael +天气预报员的布景师是谁? 剪 辑:Craig Wood +天气预报员的布景师是谁? 选角导演:Denise Chamian +天气预报员的布景师是谁? 艺术指导:Tom Duffield +天气预报员的布景师是谁? 美术设计:Patrick M. Sullivan Jr. .....(as Patrick Sullivan) +天气预报员的布景师是谁? 布景师 :Rosemary Brandenburg +天气预报员的布景师是谁? 服装设计:Penny Rose +天气预报员的布景师是谁? 视觉特效:Charles Gibson +天气预报员的布景师是谁? David Sosalla .....Pacific Title & Art Studio +韩国国家男子足球队教练是谁? 韩国国家足球队,全名大韩民国足球国家代表队(???? ?? ?????),为韩国足球协会所于1928年成立,并于1948年加入国际足球协会。 +韩国国家男子足球队教练是谁? 韩国队自1986年世界杯开始,从未缺席任何一届决赛周。 +韩国国家男子足球队教练是谁? 在2002年世界杯,韩国在主场之利淘汰了葡萄牙、意大利及西班牙三支欧洲强队,最后夺得了殿军,是亚洲球队有史以来最好成绩。 +韩国国家男子足球队教练是谁? 在2010年世界杯,韩国也在首圈分组赛压倒希腊及尼日利亚出线次圈,再次晋身十六强,但以1-2败给乌拉圭出局。 +韩国国家男子足球队教练是谁? 北京时间2014年6月27日3时,巴西世界杯小组赛H组最后一轮赛事韩国对阵比利时,韩国队0-1不敌比利时,3场1平2负积1分垫底出局。 +韩国国家男子足球队教练是谁? 球队教练:洪明甫 +韩国国家男子足球队教练是谁? 韩国国家足球队,全名大韩民国足球国家代表队(韩国国家男子足球队???? ?? ?????),为韩国足球协会所于1928年成立,并于1948年加入国际足联。 +韩国国家男子足球队教练是谁? 韩国队是众多亚洲球队中,在世界杯表现最好,他们自1986年世界杯开始,从未缺席任何一届决赛周。 +韩国国家男子足球队教练是谁? 在2002年世界杯,韩国在主场之利淘汰了葡萄牙、意大利及西班牙三支欧洲强队,最后夺得了殿军,是亚洲球队有史以来最好成绩。 +韩国国家男子足球队教练是谁? 在2010年世界杯,韩国也在首圈分组赛压倒希腊及尼日利亚出线次圈,再次晋身十六强,但以1-2败给乌拉圭出局。 +韩国国家男子足球队教练是谁? 2014年世界杯外围赛,韩国在首轮分组赛以首名出线次轮分组赛,与伊朗、卡塔尔、乌兹别克以及黎巴嫩争逐两个直接出线决赛周资格,最后韩国仅以较佳的得失球差压倒乌兹别克,以小组次名取得2014年世界杯决赛周参赛资格,也是韩国连续八次晋身世界杯决赛周。 +韩国国家男子足球队教练是谁? 虽然韩国队在世界杯成绩为亚洲之冠,但在亚洲杯足球赛的成绩却远不及世界杯。 +韩国国家男子足球队教练是谁? 韩国只在首两届亚洲杯(1956年及1960年)夺冠,之后五十多年未能再度称霸亚洲杯,而自1992年更从未打入过决赛,与另一支东亚强队日本近二十年来四度在亚洲杯夺冠成强烈对比。[1] +韩国国家男子足球队教练是谁? 人物简介 +韩国国家男子足球队教练是谁? 车范根(1953年5月22日-)曾是大韩民国有名的锋线选手,他被欧洲媒体喻为亚洲最佳输出球员之一,他也被认为是世界最佳足球员之一。 +韩国国家男子足球队教练是谁? 他被国际足球史料与数据协会评选为20世纪亚洲最佳球员。 +韩国国家男子足球队教练是谁? 他在85-86赛季是德甲的最有价值球员,直到1999年为止他都是德甲外国球员入球纪录保持者。 +韩国国家男子足球队教练是谁? 德国的球迷一直没办法正确说出他名字的发音,所以球车范根(左)迷都以炸弹车(Cha Boom)称呼他。 +韩国国家男子足球队教练是谁? 这也代表了他强大的禁区得分能力。 +韩国国家男子足球队教练是谁? 职业生涯 +韩国国家男子足球队教练是谁? 车范根生于大韩民国京畿道的华城市,他在1971年于韩国空军俱乐部开始了他的足球员生涯;同年他入选了韩国19岁以下国家足球队(U-19)。 +韩国国家男子足球队教练是谁? 隔年他就加入了韩国国家足球队,他是有史以来加入国家队最年轻的球员。 +韩国国家男子足球队教练是谁? 车范根在27岁时前往德国发展,当时德甲被认为是世界上最好的足球联赛。 +韩国国家男子足球队教练是谁? 他在1978年12月加入了达姆施塔特,不过他在那里只待了不到一年就转到当时的德甲巨人法兰克福。 +韩国国家男子足球队教练是谁? 车范根很快在新俱乐部立足,他帮助球队赢得79-80赛季的欧洲足协杯。 +韩国国家男子足球队教练是谁? 在那个赛季过后,他成为德甲薪水第三高的球员,不过在1981年对上勒沃库森的一场比赛上,他的膝盖严重受伤,几乎毁了他的足球生涯。 +韩国国家男子足球队教练是谁? 在1983年车范根转投勒沃库森;他在这取得很高的成就,他成为85-86赛季德甲的最有价值球员,并且在1988年帮助球队拿下欧洲足协杯,也是他个人第二个欧洲足协杯。 +韩国国家男子足球队教练是谁? 他在决赛对垒西班牙人扮演追平比分的关键角色,而球会才在点球大战上胜出。 +韩国国家男子足球队教练是谁? 车范根在1989年退休,他在308场的德甲比赛中进了98球,一度是德甲外国球员的入球纪录。 +韩国国家男子足球队教练是谁? 执教生涯 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学,简称台湾科大、台科大或台科,是位于台湾台北市大安区的台湾第一所高等技职体系大专院校,现为台湾最知名的科技大学,校本部比邻国立台湾大学。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 该校已于2005年、2008年持续入选教育部的“发展国际一流大学及顶尖研究中心计划”。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? “国立”台湾工业技术学院成立于“民国”六十三年(1974)八月一日,为台湾地区第一所技术职业教育高等学府。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 建校之目的,在因应台湾地区经济与工业迅速发展之需求,以培养高级工程技术及管理人才为目标,同时建立完整之技术职业教育体系。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? “国立”台湾工业技术学院成立于“民国”六十三年(1974)八月一日,为台湾地区第一所技术职业教育高等学府。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 建校之目的,在因应台湾地区经济与工业迅速发展之需求,以培养高级工程技术及管理人才为目标,同时建立完整之技术职业教育体系。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 本校校地约44.5公顷,校本部位于台北市基隆路四段四十三号,。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 民国68年成立硕士班,民国71年成立博士班,现有大学部学生5,664人,研究生4,458人,专任教师451位。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 2001年在台湾地区教育部筹划之研究型大学(“国立”大学研究所基础教育重点改善计画)中,成为全台首批之9所大学之一 。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 自2005年更在“教育部”所推动“五年五百亿 顶尖大学”计划下,遴选为适合发展成“顶尖研究中心”的11所研究型大学之一。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学部设有二年制、四年制及工程在职人员进修班等三种学制;凡二专、三专及五专等专科学校以上之毕业生,皆可以报考本校大学部二年制,而高职、高中毕业生,可以报考本校大学部四年制。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 工业管理、电子工程、机械工程、营建工程及应用外语系等,则设有在职人员进修班学制,其招生对象为在职人员,利用夜间及暑假期间上课。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 凡在本校大学部修毕应修学分且成绩及格者皆授予学士学位。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学目前设有工程、电资、管理、设计、人文社会及精诚荣誉等六个学院,分别有机械、材料科学与工程、营建、化工、电子、电机、资工、工管、企管、资管、建筑、工商业设计、应用外语等13个系及校内招生之财务金融学士学位学程、科技管理学士学位学程;全校、工程、电资、管理、创意设计等五个不分系菁英班及光电研究所、管理研究所、财务金融研究所、科技管理研究所、管理学院MBA、数位学习教育研究所、医学工程研究所、自动化及控制研究所、工程技术研究所、专利研究所等独立研究所,此外尚有人文学科负责人文及社会类等课程之教学,通识学科负责法律、音乐、环保类等课程之教学,以及师资培育中心专以培养学生未来担任中等学校工、商、管理、设计等科之合格教师,合计23个独立系所、师资培育中心、人文学科及通识学科等教学单位。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学至今各系所毕业校友已达约56,456位,毕业生出路包含出国继续深造、在台深造以及投身于产业界。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 由于实作经验丰富,理论基础完备,工作态度认真,毕业校友担任政府要职、大学教授、大学校长及企业主管者众多,深受各界的肯定。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 工商业设计系副教授孙春望与硕一生全明远耗时两个月自制之三分钟动画短片“立体悲剧”。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 本片入选有“动画奥斯卡”之称的“ACM SIGGRAPH”国际动画展,并获得观众票选第一名,这也是台湾首次入选及获奖的短片。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 击败了好莱坞知名导演史蒂芬·史匹柏的“世界大战”、乔治卢卡斯的“星际大战三部曲”、梦工厂出品的动画“马达加斯加”、军机缠斗片“机战未来”及美国太空总署、柏克莱加州大学等好莱坞名片及顶尖学术单位制作的短片。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 2009年荣获有工业设计界奥斯卡奖之称的“德国iF设计大奖”国立台湾科技大学设计学院获得大学排名的全球第二,仅次于韩国三星美术设计学院“SADI”。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 总体排名 依据《泰晤士高等教育》(THES-QS)在2009年的世界大学排名调查,台科大排名全世界第351名,在台湾所有大学中排名第五,仅次于台大,清大,成大及阳明,并且是台湾唯一进入世界四百大名校的科技大学。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 依据在欧洲拥有广大声誉的“Eduniversal商学院排名网”2008年的资料,台湾有七所大学的商管学院被分别列入世界1000大商学院,其中台科大位在“卓越商学院”(EXCELLENT Business Schools,国内主要)之列,“推荐程度”(Recommendation Rate)为全台第四,仅次于台大、政大、中山,与交大并列。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 目前设有工程、电资、管理、设计、人文社会及精诚荣誉学院等六个学院。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 预计于竹北新校区设立产学合作学院及应用理学院。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●台湾建筑科技中心 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●智慧型机械人研究中心科技成果展示(15张) +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●台湾彩卷与博彩研究中心 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●电力电子技术研发中心 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●NCP-Taiwan办公室 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●资通安全研究与教学中心 +在日本,神道最初属于什么信仰? 神道又称天道,语出《易经》“大观在上,顺而巽,中正以观天下。 +在日本,神道最初属于什么信仰? 观,盥而不荐,有孚顒若,下观而化也。 +在日本,神道最初属于什么信仰? 观天之神道,而四时不忒,圣人以神道设教,而天下服矣”。 +在日本,神道最初属于什么信仰? 自汉以降,神道又指“墓前开道,建石柱以为标”。 +在日本,神道最初属于什么信仰? 在中医中,神道,经穴名。 +在日本,神道最初属于什么信仰? 出《针灸甲乙经》。 +在日本,神道最初属于什么信仰? 别名冲道。 +在日本,神道最初属于什么信仰? 属督脉。 +在日本,神道最初属于什么信仰? 宗教中,神道是日本的本土传统民族宗教,最初以自然崇拜为主,属于泛灵多神信仰(精灵崇拜),视自然界各种动植物为神祇。 +在日本,神道最初属于什么信仰? 神道又称天道,语出《易经》“大观在上,顺而巽,中正以观天下。 +在日本,神道最初属于什么信仰? 观,盥而不荐,有孚顒若,下观而化也。 +在日本,神道最初属于什么信仰? 观天之神道,而四时不忒,圣人以神道设教,而天下服矣”。 +在日本,神道最初属于什么信仰? 自汉以降,神道又指“墓前开道,建石柱以为标”。 +在日本,神道最初属于什么信仰? 在中医中,神道,经穴名。 +在日本,神道最初属于什么信仰? 出《针灸甲乙经》。 +在日本,神道最初属于什么信仰? 别名冲道。 +在日本,神道最初属于什么信仰? 属督脉。 +在日本,神道最初属于什么信仰? 宗教中,神道是日本的本土传统民族宗教,最初以自然崇拜为主,属于泛灵多神信仰(精灵崇拜),视自然界各种动植物为神祇。 +在日本,神道最初属于什么信仰? 谓鬼神赐福降灾神妙莫测之道。 +在日本,神道最初属于什么信仰? 《易·观》:“观天之神道,而四时不忒,圣人以神道设教,而天下服矣。” +在日本,神道最初属于什么信仰? 孔颖达 疏:“微妙无方,理不可知,目不可见,不知所以然而然,谓之神道。” +在日本,神道最初属于什么信仰? 《文选·王延寿<鲁灵光殿赋>》:“敷皇极以创业,协神道而大宁。” +在日本,神道最初属于什么信仰? 张载 注:“协和神明之道,而天下大宁。” +在日本,神道最初属于什么信仰? 南朝 梁 刘勰 《文心雕龙·正纬》:“夫神道阐幽,天命微显。” +在日本,神道最初属于什么信仰? 鲁迅 《中国小说史略》第五篇:“﹝ 干宝 ﹞尝感於其父婢死而再生,及其兄气绝复苏,自言见天神事,乃撰《搜神记》二十卷,以‘发明神道之不诬’。” +在日本,神道最初属于什么信仰? 神道设教 观卦里面蕴含着《易经》固有的诸如神道设教、用舍行藏、以德化民等思想,是孔子把这些思想发掘出来。 +在日本,神道最初属于什么信仰? 「据此是孔子见当时之人,惑于吉凶祸福,而卜筮之史,加以穿凿傅会,故演易系辞,明义理,切人事,借卜筮以教后人,所谓以神道设教,其所发明者,实即羲文之义理,而非别有义理,亦非羲文并无义理,至孔子始言义理也,当即朱子之言而小变之曰,易为卜筮作,实为义理作,伏羲文王之易,有占而无文,与今人用火珠林起课者相似,孔子加卦爻辞如签辞,纯以理言,实即羲文本意,则其说分明无误矣。」 +在日本,神道最初属于什么信仰? 孔子所发掘的《易经》思想与孔子在《论语》书中表现出来的思想完全一致。 +在日本,神道最初属于什么信仰? 《易传》的思想反映了孔子的思想,这个思想是《周易》的,也是孔子的。 +在日本,神道最初属于什么信仰? 在《周易》和孔子看来,神不是有意识的人格化的上帝。 +奥林匹克里昂获得了几连霸? 里昂 Lyon 全名 Olympique lyonnais 绰号 Les Gones、OL 成立 1950年 城市 法国,里昂 主场 热尔兰球场(Stade Gerland) 容纳人数 41,044人 主席 奥拉斯 主教练 雷米·加尔德 联赛 法国足球甲级联赛 2013–14 法甲,第 5 位 网站 官方网站 主场球衣 客场球衣 第三球衣 日尔兰体育场 奥林匹克里昂(Olympique lyonnais,简称:OL及Lyon,中文简称里昂)是一间位于法国东南部罗纳-阿尔卑斯区的里昂市的足球会,成立于1950年8月3日,前身为里昂·奥林匹克(Lyon Olympique)体育俱乐部其中一个分支的足球队,1889年离开体育俱乐部自立门户成立新俱乐部,但官方网站表示俱乐部于1950年正式成立。 +奥林匹克里昂获得了几连霸? 现时在法国足球甲级联赛比赛,俱乐部同时设立男子及女子足球队。 +奥林匹克里昂获得了几连霸? 里昂是首届法国足球甲级联赛成员之一,可惜名列第十五位而降落乙组,1951年以乙级联赛冠军获得创会后首次锦标。 +奥林匹克里昂获得了几连霸? 球队在法国足球史上没有取得辉煌成绩,比较优异的算是六十年代曾杀入欧洲杯赛冠军杯四强,及3度晋身法国杯决赛并2次成功获冠。 +奥林匹克里昂获得了几连霸? 直至九十年代末里昂由辛天尼带领,先连续取得联赛头三名,到2002年终于首次登上法国顶级联赛冠军宝座,同年勒冈(Paul Le Guen)接替执教法国国家足球队的辛天尼,他其后继续带领里昂保持气势,加上队中球员小儒尼尼奧、迪亚拉、克里斯蒂亞諾·馬克斯·戈麥斯、迈克尔·埃辛、西德尼·戈武及门将格雷戈里·库佩表现突出,2003年至2005年横扫3届联赛冠军,创下连续四年夺得联赛锦标,平了1960年代末圣艾蒂安及1990年代初马赛的四连冠纪录。 +奥林匹克里昂获得了几连霸? 2005年前利物浦主教练热拉尔·霍利尔重返法国担任新任主教练,并加入葡萄牙中场蒂亚戈,和前巴伦西亚前锋约翰·卡鲁。 +奥林匹克里昂获得了几连霸? 他亦成功带领里昂赢得一届法甲冠军。 +奥林匹克里昂获得了几连霸? 2007年里昂成为首支上市的法国足球俱乐部,招股价21至24.4欧元,发行370万股,集资8400万欧元[1]。 +奥林匹克里昂获得了几连霸? 2007年4月21日,联赛次名图卢兹二比三不敌雷恩,令处于榜首的里昂领先次席多达17分距离,里昂因此提前六轮联赛庆祝俱乐部连续第六年夺得联赛冠军,亦是欧洲五大联赛(英格兰、德国、西班牙、意大利及法国)历史上首支联赛六连冠队伍[2]。 +奥林匹克里昂获得了几连霸? 在2007-08年赛季,里昂再一次成功卫冕联赛锦标,达成七连霸伟业。 +奥林匹克里昂获得了几连霸? 不过在2008-09赛季,里昂排名法甲第三位,联赛冠军被波尔多所获得。 +奥林匹克里昂获得了几连霸? 于2010年4月,里昂以两回合3比2的比分于欧洲冠军联赛击败波尔多跻身四强,此乃里昂首次晋级此项顶级杯赛的四强阶段。 +奥林匹克里昂获得了几连霸? 粗体字为新加盟球员 +奥林匹克里昂获得了几连霸? 以下球员名单更新于2014年8月27日,球员编号参照 官方网站,夏季转会窗为6月9日至8月31日 +火柴人刺杀行动怎么才能过关? 移动鼠标控制瞄准,点击鼠标左键进行射击。 +火柴人刺杀行动怎么才能过关? 游戏加载完成后点击STARTGAME-然后点击STARTMISSION即可开始游戏。 +火柴人刺杀行动怎么才能过关? 这里不仅仅考验的是你的枪法而且最重要的是你的智慧,喜欢火柴人类型游戏的玩家可以进来小试身手。 +火柴人刺杀行动怎么才能过关? 控制瞄准,刺杀游戏中的目标人物即可过关哦。 +你知道2月14日西方情人节是因何起源的吗? 情人节(英语:Valentine's Day),情人节的起源有多个版本,其中一个说法是在公元三世纪,古罗马暴君为了征召更多士兵,禁止婚礼,一名叫瓦伦丁Valentine的修士不理禁令,秘密替人主持婚礼,结果被收监,最后处死。 +你知道2月14日西方情人节是因何起源的吗? 而他死的那天就是2月14日,为纪念Valentine的勇敢精神,人们将每年的2月14日定为Valentine的纪念日。 +你知道2月14日西方情人节是因何起源的吗? 因此成了后来的“情人节”。 +你知道2月14日西方情人节是因何起源的吗? 另外,据记载,教宗在公元496年废除牧神节,把2月14日定为圣瓦伦丁日,即是St.Valentine's Day,后来成为是西方的节日之一。 +你知道2月14日西方情人节是因何起源的吗? 中文名称:情人节 +你知道2月14日西方情人节是因何起源的吗? 外文名称:Valentine‘s Day +你知道2月14日西方情人节是因何起源的吗? 别名:情人节圣瓦伦丁节 +你知道2月14日西方情人节是因何起源的吗? 公历日期:2月14日 +你知道2月14日西方情人节是因何起源的吗? 起源时间:公元270年2月14日 +你知道2月14日西方情人节是因何起源的吗? 起源事件:人们为了纪念为情人做主而牺牲的瓦伦丁神父,把他遇害的那一天(2月14日)称为情人节。 +你知道2月14日西方情人节是因何起源的吗? 地区:欧美地区 +你知道2月14日西方情人节是因何起源的吗? 宗教:基督教 +你知道2月14日西方情人节是因何起源的吗? 其他信息:西方的传统节日之一。 +你知道2月14日西方情人节是因何起源的吗? 男女在这一天互送礼物(如贺卡和玫瑰花等)用以表达爱意或友好。 +你知道2月14日西方情人节是因何起源的吗? 据台湾“今日台湾人讨厌情人节新闻网”报道,西洋情人节即将来到,求职网进行“办公室恋情及情人节调查”发现,在目前全台上班族的感情状态中,有情人相伴的比率约5成5,4成5的上班族单身;较出乎意料的结果是,情人节以近3成(28%)的占比,登上最讨厌的节日第一名,端午节以24.3%居第二;农历年则以18.2%居第三;第四名是圣诞节,占12.4%。 +你知道2月14日西方情人节是因何起源的吗? 调查指出,情人节对单身族来说,不仅成为压力,也显得更加孤单,在情人节当天,单身的上班族有将近4成(39.1%)的人在家看电视度过,近两成(18.7%)上网聊天,有1成4(14.8%)的人,不畏满街闪光,勇气十足出门看电影,近1成(9.7%)的上班族选择留在公司加班;另外有 5.4%的人,会在情人节当天积极参加联谊,希望能改变自己的感情状态。 +你知道2月14日西方情人节是因何起源的吗? 情侣们在情人节当天,庆祝方式以吃浪漫大餐最多(37.1%),不过有近3成(27%)的情侣,在情人节当天不会特别庆祝情人节,且这个比率远比第三名的旅游(占比11.5%)高出1倍以上。 +你知道2月14日西方情人节是因何起源的吗? 在情人节当天庆祝的开销上,可以说是小资男女当道,选择1000元(新台币,下同)以内的上班族最多占33.1%,情人节当天的花费上班族的平均花费是2473元,大手笔花费上万元以上庆祝情人节的,占比只有2.5%。 +你知道2月14日西方情人节是因何起源的吗? 情人节的起源众说纷纭,而为纪念罗马教士瓦伦丁是其中一个普遍的说法。 +你知道2月14日西方情人节是因何起源的吗? 据《世界图书百科全书》(World Book Encyclopedia)数据指出:“在公元200年时期,罗马皇帝克劳狄二世禁止年轻男子结婚。 +你知道2月14日西方情人节是因何起源的吗? 他认为未婚男子可以成为更优良的士兵。 +你知道2月14日西方情人节是因何起源的吗? 一位名叫瓦伦丁的教士违反了皇帝的命令,秘密为年轻男子主持婚礼,引起皇帝不满,结果被收监,据说瓦伦丁于公元269年2月14日被处决。 +你知道2月14日西方情人节是因何起源的吗? 另外,据《天主教百科全书》(The Catholic情人节 Encyclopedia)指出,公元496年,教宗圣基拉西乌斯一世在公元第五世纪末叶废除了牧神节,把2月14日定为圣瓦伦丁日。” +你知道2月14日西方情人节是因何起源的吗? 这个节日现今以“圣瓦伦丁节”——亦即情人节的姿态盛行起来。 +你知道2月14日西方情人节是因何起源的吗? 但是在第2次梵蒂冈大公会议后,1969年的典礼改革上,整理了一堆在史实上不确定是否真实存在的人物以后,圣瓦伦丁日就被废除了。 +你知道2月14日西方情人节是因何起源的吗? 现在天主教圣人历已经没有圣瓦伦丁日(St. Valentine's Day)。 +你知道2月14日西方情人节是因何起源的吗? 根据《布卢姆尔的警句与寓言辞典》记载:“圣瓦伦丁是个罗马教士,由于援助受逼害的基督徒而身陷险境,后来他归信基督教,最后被处死,卒于二月十四日”古代庆祝情人节的习俗与瓦伦丁拉上关系,可能是纯属巧合而已。 +你知道2月14日西方情人节是因何起源的吗? 事实上,这个节日很可能与古罗马的牧神节或雀鸟交配的季节有关。 +你知道2月14日西方情人节是因何起源的吗? 情人节的特色是情侣互相馈赠礼物。 +你知道2月14日西方情人节是因何起源的吗? 时至今日,人们则喜欢以情人卡向爱人表达情意。 +防卫大学每年招收多少学生? 防卫大学的前身是保安大学。 +防卫大学每年招收多少学生? 防卫大学是日本自卫队培养陆、海、空三军初级军官的学校,被称为日军"军官的摇篮"。 +防卫大学每年招收多少学生? 防卫大学是日军的重点院校。 +防卫大学每年招收多少学生? 日本历届内阁首相都要到防卫大学视察"训示",并亲自向学生颁发毕业证书。 +防卫大学每年招收多少学生? 日军四分之一的军官、三分之一的将官从这里走出。 +防卫大学每年招收多少学生? 防卫大学毕业生已成为日军军官的中坚力量。 +防卫大学每年招收多少学生? 防卫大学每年从地方招收18岁至21岁的应届高中毕业生和同等学历的青年。 +防卫大学每年招收多少学生? 每年招生名额为530名。 +防卫大学每年招收多少学生? 1950年 8月,日本组建警察预备队,1952年改为保安队。 +防卫大学每年招收多少学生? 为了充实保安队干部队伍,提高干部军政素质,1953年4月成立了保安大学,校址设在三浦半岛的久里滨。 +防卫大学每年招收多少学生? 1954年7月1日保安厅改为防卫厅。 +防卫大学每年招收多少学生? 在保安队基础上,日本建立了陆、海、空三军自卫队,保安大学遂改名为防卫大学,1955年迁至三浦半岛东南方的小原台。 +防卫大学每年招收多少学生? 学校直属防卫厅领导。 +防卫大学每年招收多少学生? 防卫大学的教育方针是:要求学生德智体全面发展,倡导学生崇尚知识和正义,培养学生具有指挥各种部队的能力。 +防卫大学每年招收多少学生? 防卫大学每年招生名额为530名,其中陆军300名,海军100名,空军130名。 +防卫大学每年招收多少学生? 根据自卫队向妇女敞开军官大门的决定,防卫大学1992年首次招收女学员35名。 +防卫大学每年招收多少学生? 考试分两次进行。 +防卫大学每年招收多少学生? 第一次,每年11月份进行学科考试;第二次,12月份进行口试和体检。 +防卫大学每年招收多少学生? 学校按陆、海、空三军分别设大学本科班和理工科研究生班。 +防卫大学每年招收多少学生? 本科班学制4年,又分为理工和人文社会学两大科。 +防卫大学每年招收多少学生? 学员入学后先分科,530人中有460人专攻理科,70人专攻文科。 +防卫大学每年招收多少学生? 第1学年按专科学习一般大学课程和一般军事知识。 +防卫大学每年招收多少学生? 第2学年以后在军事上开始区分军种,学员分别学习陆、海、空军的专门课程。 +防卫大学每年招收多少学生? 文化课和军事课的比例是6:l。 +防卫大学每年招收多少学生? 文化课程有人文、社会、自然、外语、电气工程、机械工程、土木建筑工程、应用化学、应用物理、航空、航海等。 +防卫大学每年招收多少学生? 军事训练课每学年6周,按一年四季有比例地安排教学内容,对学生进行军事技术和体能训练。 +防卫大学每年招收多少学生? 理工科研究生班,每年招生1期,学制2年,每期招收90人,设电子工程、航空工程、兵器制造等7个专业,课程按一般大学硕士课程标准设置。 +防卫大学每年招收多少学生? 防卫大学的课程和训练都十分紧张。 +防卫大学每年招收多少学生? 近年来,为了增强防卫大学的吸引力,克服考生逐年减少的倾向广泛征集优秀人才,学校进行了一些改革,改变入学考试办法,各高中校长以内部呈报的形式向防卫大学推荐品学兼优的学生;减少学生入学考试科目,放宽对报考防卫大学的学生的视力要求;降低学分数(大约降低30学分);改善学生宿舍条件。 +防卫大学每年招收多少学生? 防卫大学的学生生活紧张而愉快。 +《威鲁贝鲁的物语》官网是什么? 10年前大战后,威鲁贝鲁国一致辛勤的保护着得来不易的和平,但是与邻国圣卡特拉斯国的关系却不断的紧张,战争即将爆发。 +《威鲁贝鲁的物语》官网是什么? 为了避免战争,威鲁贝鲁国王海特鲁王决定将自己最大的女儿公主莉塔嫁给圣卡特拉斯国的王子格鲁尼亚。 +《威鲁贝鲁的物语》官网是什么? 但是莉塔却刺伤了政治婚姻的对象格鲁尼亚王子逃了出去,这事激怒了圣卡特拉斯国的国王兰帕诺夫王,并下令14天之内抓到王女并执行公开处刑来谢罪,不然两国就要开战。 +《威鲁贝鲁的物语》官网是什么? 《威鲁贝鲁的物语~Sisters of Wellber~》 +《威鲁贝鲁的物语》官网是什么? (Sisters of Wellber) +《威鲁贝鲁的物语》官网是什么? 日文名 ウエルベールの物语 +《威鲁贝鲁的物语》官网是什么? 官方网站 http://www.avexmovie.jp/lineup/wellber/ +《威鲁贝鲁的物语》官网是什么? 为了回避发生战争这个最坏的结果,莉塔下定决心去中立国古利达姆。 diff --git a/examples/text_graph/erniesage/example_data/train_data.txt b/examples/text_graph/erniesage/example_data/train_data.txt new file mode 100644 index 0000000000000000000000000000000000000000..e9aead6c89fa2fdbed64e5dada352106a8deb349 --- /dev/null +++ b/examples/text_graph/erniesage/example_data/train_data.txt @@ -0,0 +1,1000 @@ +黑缘粗角肖叶甲触角有多大? 体长卵形,棕红色;鞘翅棕黄或淡棕色,外缘和中缝黑色或黑褐色;触角基部3、4节棕黄,余节棕色。 +黑缘粗角肖叶甲触角有多大? 头部刻点粗大,分布不均匀,头顶刻点十分稀疏;触角基部的内侧有一个三角形光瘤,唇基前缘呈半圆形凹切。 +黑缘粗角肖叶甲触角有多大? 触角近于体长之半,第1节粗大,棒状,第2节短,椭圆形,3、4两节细长,稍短于第5节,第5节基细端粗,末端6节明显粗大。 +黑缘粗角肖叶甲触角有多大? 前胸背板横宽,宽约为长的两倍,侧缘敞出较宽,圆形,敞边与盘区之间有一条细纵沟;盘区刻点相当密,前半部刻点较大于后半部。 +黑缘粗角肖叶甲触角有多大? 小盾片舌形,光亮,末端圆钝。 +黑缘粗角肖叶甲触角有多大? 鞘翅刻点粗大,不规则排列,肩部之后的刻点更为粗大,具皱褶,近中缝的刻点较小,略呈纵行排列。 +黑缘粗角肖叶甲触角有多大? 前胸前侧片前缘直;前胸后侧片具粗大刻点。 +黑缘粗角肖叶甲触角有多大? 足粗壮;胫节具纵脊,外端角向外延伸,呈弯角状;爪具附齿。 +暮光闪闪的姐姐是谁? 暮光闪闪是一匹雌性独角兽,后来在神秘魔法的影响下变成了空角兽(公主),她是《我的小马驹:友情是魔法》(英文名:My Little Pony:Friendship is Magic)中的主角之一。 +暮光闪闪的姐姐是谁? 她是银甲闪闪(Shining Armor)的妹妹,同时也是韵律公主(Princess Cadance)的小姑子。 +暮光闪闪的姐姐是谁? 在该系列中,她与最好的朋友与助手斯派克(Spike)一起生活在小马镇(Ponyville)的金橡图书馆(Golden Oak Library),研究友谊的魔法。 +暮光闪闪的姐姐是谁? 在暮光闪闪成为天角兽之前(即S3E13前),常常给塞拉丝蒂娅公主(Princess Celestia)关于友谊的报告。[1] +暮光闪闪的姐姐是谁? 《我的小马驹:友谊是魔法》(英文名称:My Little Pony:Friendship is Magic)(简称MLP) +暮光闪闪的姐姐是谁? 动画讲述了一只名叫做暮光闪闪(Twilight Sparkle)的独角兽(在SE3E13 +暮光闪闪的姐姐是谁? My Little Pony:Friendship is Magic[2] +暮光闪闪的姐姐是谁? 后成为了天角兽),执行她的导师塞拉斯蒂娅公主(PrincessCelestia)的任务,在小马镇(Ponyville)学习关于友谊的知识。 +暮光闪闪的姐姐是谁? 她与另外五只小马,苹果杰克(Applejack)、瑞瑞(Rarity)、云宝黛西(Rainbow Dash)、小蝶(Fluttershy)与萍琪派(Pinkie Pie),成为了最要好的朋友。 +暮光闪闪的姐姐是谁? 每匹小马都分别代表了协律精华的6个元素:诚实,慷慨,忠诚,善良,欢笑,魔法,各自扮演着属于自己的重要角色。 +暮光闪闪的姐姐是谁? 此后,暮光闪闪(Twilight Sparkle)便与她认识的新朋友们开始了有趣的日常生活。 +暮光闪闪的姐姐是谁? 在动画中,随时可见她们在小马镇(Ponyville)的种种冒险、奇遇、日常等等。 +暮光闪闪的姐姐是谁? 同时,也在她们之间的互动和冲突中,寻找着最适合最合理的完美解决方案。 +暮光闪闪的姐姐是谁? “尽管小马国并不太平,六位主角之间也常常有这样那样的问题,但是他们之间的真情对待,使得这个童话世界已经成为不少人心中理想的世外桃源。” +暮光闪闪的姐姐是谁? 暮光闪闪在剧情刚开始的时候生活在中心城(Canterlot),后来在夏日 +暮光闪闪的姐姐是谁? 暮光闪闪与斯派克(Spike) +暮光闪闪的姐姐是谁? 庆典的时候被塞拉丝蒂娅公主派遣到小马镇执行检查夏日庆典的准备工作的任务。 +暮光闪闪的姐姐是谁? 在小马镇交到了朋友(即其余5个主角),并和她们一起使用协律精华(Elements of harmony)击败了梦魇之月。 +暮光闪闪的姐姐是谁? 并在塞拉丝蒂亚公主的许可下,留在小马镇继续研究友谊的魔法。 +暮光闪闪的姐姐是谁? 暮光闪闪的知识基本来自于书本,并且她相当不相信书本以外的“迷信”,因为这样她在S1E15里吃足了苦头。 +暮光闪闪的姐姐是谁? 在这之后,她也开始慢慢学会相信一些书本以外的东西。 +暮光闪闪的姐姐是谁? 暮光闪闪热爱学习,并且学习成绩相当好(从她可以立刻算出 +暮光闪闪的姐姐是谁? 的结果可以看 +暮光闪闪的姐姐是谁? 暮光闪闪的原型 +暮光闪闪的姐姐是谁? 出)。 +暮光闪闪的姐姐是谁? 相当敬爱自己的老师塞拉丝蒂亚公主甚至到了精神失常的地步。 +暮光闪闪的姐姐是谁? 在第二季中,曾因为无法交出关于友谊的报告而做出了疯狂的行为,后来被塞拉丝蒂亚公主制止,在这之后,暮光闪闪得到了塞拉丝蒂亚公主“不用定期交友谊报告”的许可。 +暮光闪闪的姐姐是谁? 于是暮光闪闪在后面的剧情中的主角地位越来越得不到明显的体现。 +暮光闪闪的姐姐是谁? 在SE3E13中,因为破解了白胡子星璇留下的神秘魔法而被加冕成为了天角兽(公主),被尊称为“闪闪公主”。 +暮光闪闪的姐姐是谁? 当小星座熊在小马镇引起恐慌的时候,暮光闪闪运用了自身强大的魔法将水库举起后装满牛奶,用牛奶将小星座熊安抚后,连着巨型奶瓶和小星座熊一起送回了小星座熊居住的山洞。 +我想知道红谷十二庭有哪些金融机构? 红谷十二庭是由汪氏集团旗下子公司江西尤金房地产开发有限公司携手城发投资共同开发的精品社区,项目占地面积约380亩,总建筑面积约41万平方米。 +我想知道红谷十二庭有哪些金融机构? 项目以建设人文型、生态型居住环境为规划目标;创造一个布局合理、功能齐全、交通便捷、绿意盎然、生活方便,有文化内涵的居住区。 +我想知道红谷十二庭有哪些金融机构? 金融机构:工商银行、建设银行、农业银行、中国银行红谷滩支行、商业银行红谷滩支行等 +我想知道红谷十二庭有哪些金融机构? 周边公园:沿乌砂河50米宽绿化带、乌砂河水岸公园、秋水广场、赣江市民公园 +我想知道红谷十二庭有哪些金融机构? 周边医院:新建县人民医院、开心人药店、中寰医院 +我想知道红谷十二庭有哪些金融机构? 周边学校:育新小学红谷滩校区、南师附小红谷滩校区、实验小学红谷滩校区中学:南昌二中红谷滩校区、南昌五中、新建二中、竞秀贵族学校 +我想知道红谷十二庭有哪些金融机构? 周边公共交通:112、204、211、219、222、227、238、501等20多辆公交车在本项目社区门前停靠 +我想知道红谷十二庭有哪些金融机构? 红谷十二庭处在南昌一江两城中的西城中心,位属红谷滩CBD文化公园中心——马兰圩中心组团,红谷滩中心区、红角洲、新建县三区交汇处,南临南期友好路、东接红谷滩中心区、西靠乌砂河水岸公园(50米宽,1000米长)。 +我想知道红谷十二庭有哪些金融机构? 交通便捷,景观资源丰富,生活配套设施齐全,出则繁华,入则幽静,是现代人居的理想地段。 +我想知道红谷十二庭有哪些金融机构? 红谷十二庭户型图 +苏琳最开始进入智通实业是担任什么职位? 现任广东智通人才连锁股份有限公司总裁,清华大学高级工商管理硕士。 +苏琳最开始进入智通实业是担任什么职位? 1994年,加入智通实业,从总经理秘书做起。 +苏琳最开始进入智通实业是担任什么职位? 1995年,智通实业决定进入人才服务行业,被启用去负责新公司的筹建及运营工作,在苏琳的努力下,智通人才智力开发有限公司成立。 +苏琳最开始进入智通实业是担任什么职位? 2003年,面对同城对手的激烈竞争,苏琳冷静对待,领导智通先后接管、并购了同城的腾龙、安达盛人才市场,,“品牌运作,连锁经营,差异制胜”成为苏琳屡屡制胜的法宝。 +苏琳最开始进入智通实业是担任什么职位? 2006年,苏琳先是将智通人才升级为“东莞市智通人才连锁有限公司”,一举成为广东省人才市场目前惟一的连锁机构,随后在东莞同时开设长安、松山湖、清溪等镇区分部,至此智通在东莞共有6个分部。 +苏琳最开始进入智通实业是担任什么职位? 一番大刀阔斧完成东莞布局后,苏琳确定下一个更为高远的目标——进军珠三角,向全国发展连锁机构。 +苏琳最开始进入智通实业是担任什么职位? 到2011年末,苏琳领导的智通人才已在珠三角的东莞、佛山、江门、中山等地,长三角的南京、宁波、合肥等地,中西部的南昌、长沙、武汉、重庆、西安等地设立了20多家连锁经营网点。 +苏琳最开始进入智通实业是担任什么职位? 除了财务副总裁之外,苏琳是智通人才核心管理高层当中唯一的女性,不管是要约采访的记者还是刚刚加入智通的员工,见到苏琳的第一面,都会有一种惊艳的感觉,“一位女企业家居然非常美丽和时尚?!” +苏琳最开始进入智通实业是担任什么职位? 智通管理高层的另外6位男性成员,有一次同时接受一家知名媒体采访时,共同表达了对自己老板的“爱慕”之情,苏琳听后莞尔一笑,指着在座的这几位高层说道“其实,我更爱他们!” +苏琳最开始进入智通实业是担任什么职位? 这种具有独特领导魅力的表述让这位记者唏嘘不已,同时由这样的一个细节让他感受到了智通管理团队的协作力量。 +谁知道黄沙中心小学的邮政编码是多少? 学校于1954年始建于棕树湾村,当时借用一间民房做教室,取名为“黄沙小学”,只有教师1人,学生8人。 +谁知道黄沙中心小学的邮政编码是多少? 1958年在大跃进精神的指导下,实行大集体,全乡集中办学,发展到12个班,300多学生,20名教职工。 +谁知道黄沙中心小学的邮政编码是多少? 1959年解散。 +谁知道黄沙中心小学的邮政编码是多少? 1959年下半年,在上级的扶持下,建了6间木房,搬到1960年学校所在地,有6名教师,3个班,60名学生。 +谁知道黄沙中心小学的邮政编码是多少? 1968年,开始招收一个初中班,“黄沙小学”改名为 “附小”。 +谁知道黄沙中心小学的邮政编码是多少? 当时已发展到5个班,8名教师,110多名学生。 +谁知道黄沙中心小学的邮政编码是多少? 增建土木结构教室两间。 +谁知道黄沙中心小学的邮政编码是多少? 1986年,初中、小学分开办学。 +谁知道黄沙中心小学的邮政编码是多少? 增建部分教师宿舍和教室,办学条件稍有改善,学校初具规模。 +谁知道黄沙中心小学的邮政编码是多少? 1996年,我校在市、县领导及希望工程主管部门的关怀下,决定改为“黄沙希望小学”并拨款32万元,新建一栋4层,12间教室的教学楼,教学条件大有改善。 +谁知道黄沙中心小学的邮政编码是多少? 当时发展到10个班,学生300多人,教职工19人,小学高级教师3人,一级教师7人,二级教师9人。 +谁知道黄沙中心小学的邮政编码是多少? 2003年下半年由于农村教育体制改革,撤销教育组,更名为“黄沙中心小学”。 +谁知道黄沙中心小学的邮政编码是多少? 学校现有在校生177人(含学前42人),设有学前至六年级共7个教学班。 +谁知道黄沙中心小学的邮政编码是多少? 有教师19人,其中大专以上学历11人,中师6人;小学高级教师14人,一级教师5人。 +谁知道黄沙中心小学的邮政编码是多少? 学校校园占地面积2050平方米,生均达15.29平方米,校舍建筑面积1645平方米,生均12.27平方米;设有教师办公室、自然实验、电教室(合二为一)、微机室、图书阅览室(合二为一)、体育室、广播室、少先队活动室。 +谁知道黄沙中心小学的邮政编码是多少? 广西壮族自治区桂林市临桂县黄沙瑶族乡黄沙街 邮编:541113[1] +伊藤实华的职业是什么? 伊藤实华(1984年3月25日-)是日本的女性声优。 +伊藤实华的职业是什么? THREE TREE所属,东京都出身,身长149cm,体重39kg,血型AB型。 +伊藤实华的职业是什么? ポルノグラフィティのLION(森男) +伊藤实华的职业是什么? 2000年 +伊藤实华的职业是什么? 犬夜叉(枫(少女时代)) +伊藤实华的职业是什么? 幻影死神(西亚梨沙) +伊藤实华的职业是什么? 2001年 +伊藤实华的职业是什么? NOIR(ロザリー) +伊藤实华的职业是什么? 2002年 +伊藤实华的职业是什么? 水瓶战记(柠檬) +伊藤实华的职业是什么? 返乡战士(エイファ) +伊藤实华的职业是什么? 2003年 +伊藤实华的职业是什么? 奇诺之旅(女子A(悲しい国)) +伊藤实华的职业是什么? 2004年 +伊藤实华的职业是什么? 爱你宝贝(坂下ミキ) +伊藤实华的职业是什么? Get Ride! アムドライバー(イヴァン・ニルギース幼少期) +伊藤实华的职业是什么? スクールランブル(花井春树(幼少时代)) +伊藤实华的职业是什么? 2005年 +伊藤实华的职业是什么? 光速蒙面侠21(虎吉) +伊藤实华的职业是什么? 搞笑漫画日和(男子トイレの精、パン美先生) +伊藤实华的职业是什么? 银牙伝说WEED(テル) +伊藤实华的职业是什么? 魔女的考验(真部カレン、守山太郎) +伊藤实华的职业是什么? BUZZER BEATER(レニー) +伊藤实华的职业是什么? 虫师(“眼福眼祸”さき、“草を踏む音”沢(幼少时代)) +伊藤实华的职业是什么? 2006年 +伊藤实华的职业是什么? 魔女之刃(娜梅) +伊藤实华的职业是什么? 反斗小王子(远藤レイラ) +伊藤实华的职业是什么? 搞笑漫画日和2(パン美先生、フグ子、ダンサー、ヤマトの妹、女性) +伊藤实华的职业是什么? 人造昆虫カブトボーグ V×V(ベネチアンの弟、东ルリ、园儿A) +伊藤实华的职业是什么? 2007年 +爆胎监测与安全控制系统英文是什么? 爆胎监测与安全控制系统(Blow-out Monitoring and Brake System),是吉利全球首创,并拥有自主知识产权及专利的一项安全技术。 +爆胎监测与安全控制系统英文是什么? 这项技术主要是出于防止高速爆胎所导致的车辆失控而设计。 +爆胎监测与安全控制系统英文是什么? BMBS爆胎监测与安全控制系统技术于2004年1月28日正式获得中国发明专利授权。 +爆胎监测与安全控制系统英文是什么? 2008年第一代BMBS系统正式与世人见面,BMBS汇集国内外汽车力学、控制学、人体生理学、电子信息学等方面的专家和工程技术人员经过一百余辆试验车累计行程超过五百万公里的可靠性验证,以确保产品的可靠性。 +爆胎监测与安全控制系统英文是什么? BMBS技术方案的核心即是采用智能化自动控制系统,弥补驾驶员生理局限,在爆胎后反应时间为0.5秒,替代驾驶员实施行车制动,保障行车安全。 +爆胎监测与安全控制系统英文是什么? BMBS系统由控制系统和显示系统两大部分组成,控制系统由BMBS开关、BMBS主机、BMBS分机、BMBS真空助力器四部分组成;显示系统由GPS显示、仪表指示灯、语言提示、制动双闪灯组成。 +爆胎监测与安全控制系统英文是什么? 当轮胎气压高于或低于限值时,控制器声光提示胎压异常。 +爆胎监测与安全控制系统英文是什么? 轮胎温度过高时,控制器发出信号提示轮胎温度过高。 +爆胎监测与安全控制系统英文是什么? 发射器电量不足时,控制器显示低电压报警。 +爆胎监测与安全控制系统英文是什么? 发射器受到干扰长期不发射信号时,控制器显示无信号报警。 +爆胎监测与安全控制系统英文是什么? 当汽车电门钥匙接通时,BMBS首先进入自检程序,检测系统各部分功能是否正常,如不正常,BMBS报警灯常亮。 +走读干部现象在哪里比较多? 走读干部一般是指县乡两级干部家住县城以上的城市,本人在县城或者乡镇工作,要么晚出早归,要么周一去单位上班、周五回家过周末。 +走读干部现象在哪里比较多? 对于这种现象,社会上的议论多是批评性的,认为这些干部脱离群众、作风漂浮、官僚主义,造成行政成本增加和腐败。 +走读干部现象在哪里比较多? 截至2014年10月,共有6484名“走读干部”在专项整治中被查处。 +走读干部现象在哪里比较多? 这是中央首次大规模集中处理这一长期遭诟病的干部作风问题。 +走读干部现象在哪里比较多? 干部“走读”问题主要在乡镇地区比较突出,城市地区则较少。 +走读干部现象在哪里比较多? 从历史成因和各地反映的情况来看,产生“走读”现象的主要原因大致有四种: +走读干部现象在哪里比较多? 现今绝大多数乡村都有通往乡镇和县城的石子公路甚至柏油公路,这无疑为农村干部的出行创造了便利条件,为“干部像候鸟,频往家里跑”创造了客观条件。 +走读干部现象在哪里比较多? 选调生、公务员队伍大多是学历较高的大学毕业生,曾在高校所在地的城市生活,不少人向往城市生活,他们不安心长期扎根基层,而是将基层当作跳板,因此他们往往成为“走读”的主力军。 +走读干部现象在哪里比较多? 公仆意识、服务意识淡化,是“走读”现象滋生的主观原因。 +走读干部现象在哪里比较多? 有些党员干部感到自己长期在基层工作,该为自己和家庭想想了。 +走读干部现象在哪里比较多? 于是,不深入群众认真调查研究、认真听取群众意见、认真解决群众的实际困难,也就不难理解了。 +走读干部现象在哪里比较多? 县级党政组织对乡镇领导干部管理的弱化和为基层服务不到位,导致“走读”问题得不到应有的制度约束,是“走读”问题滋长的组织原因。[2] +走读干部现象在哪里比较多? 近些年来,我国一些地方的“干部走读”现象较为普遍,社会上对此议走读干部论颇多。 +走读干部现象在哪里比较多? 所谓“干部走读”,一般是指县乡两级干部家住县城以上的城市,本人在县城或者乡镇工作,要么早出晚归,要么周一去单位上班、周五回家过周末。 +走读干部现象在哪里比较多? 对于这种现象,社会上的议论多是批评性的,认为这些干部脱离群众、作风漂浮、官僚主义,造成行政成本增加和腐败。 +走读干部现象在哪里比较多? 干部走读之所以成为“千夫所指”,是因为这种行为增加了行政成本。 +走读干部现象在哪里比较多? 从根子上说,干部走读是城乡发展不平衡的产物,“人往高处走,水往低处流”,有了更加舒适的生活环境,不管是为了自己生活条件改善也好,还是因为子女教育也好,农村人口向城镇转移,这是必然结果。 +走读干部现象在哪里比较多? “干部走读”的另一个重要原因,是干部人事制度改革。 +走读干部现象在哪里比较多? 目前公务员队伍“凡进必考”,考上公务员的大多是学历较高的大学毕业生,这些大学毕业生来自各个全国各地,一部分在本地结婚生子,沉淀下来;一部分把公务员作为跳板,到基层后或考研,或再参加省考、国考,或想办法调回原籍。 +走读干部现象在哪里比较多? 再加上一些下派干部、异地交流任职干部,构成了看似庞大的“走读”队伍。 +走读干部现象在哪里比较多? 那么,“干部走读”有哪些弊端呢? +走读干部现象在哪里比较多? 一是这些干部人在基层,心在城市,缺乏长期作战的思想,工作不安心。 +走读干部现象在哪里比较多? 周一来上班,周五回家转,对基层工作缺乏热情和感情;二是长期在省市直机关工作,对基层工作不熟悉不了解,工作不热心;三是长期走读,基层干群有工作难汇报,有困难难解决,群众不开心;四是干部来回走读,公车私驾,私费公报,把大量的经济负担转嫁给基层;五是对这些走读干部,基层管不了,上级监督难,节假日期间到哪里去、做什么事,基本处于失控和真空状态,各级组织和基层干群不放心。 +走读干部现象在哪里比较多? 特别需要引起警觉的是,由于少数走读干部有临时思想,满足于“当维持会长”,得过且过混日子,热衷于做一些急功近利、砸锅求铁的短期行为和政绩工程,不愿做打基础、管长远的实事好事,甚至怠政、疏政和懒于理政,影响了党和政府各项方针政策措施的落实,导致基层无政府主义、自由主义抬头,削弱了党和政府的领导,等到矛盾激化甚至不可收拾的时候,处理已是来之不及。 +走读干部现象在哪里比较多? 权利要与义务相等,不能只有义务而没有权利,或是只有权利没有义务。 +走读干部现象在哪里比较多? 如何真正彻底解决乡镇干部“走读”的现象呢? +走读干部现象在哪里比较多? 那就必须让乡镇基层干部义务与权利相等。 +走读干部现象在哪里比较多? 如果不能解决基层干部待遇等问题,即使干部住村,工作上也不会有什么进展的。 +走读干部现象在哪里比较多? 所以,在政治上关心,在生活上照顾,在待遇上提高。 +走读干部现象在哪里比较多? 如,提高基层干部的工资待遇,增加通讯、交通补助;帮助解决子女入学及老人赡养问题;提拔干部优先考虑基层干部;干部退休时的待遇至少不低于机关干部等等。 +化州市良光镇东岸小学学风是什么? 学校全体教职工爱岗敬业,团结拼搏,勇于开拓,大胆创新,进行教育教学改革,努力开辟第二课堂的教学路子,并开通了网络校校通的交流合作方式。 +化州市良光镇东岸小学学风是什么? 现学校教师正在为创建安全文明校园而努力。 +化州市良光镇东岸小学学风是什么? 东岸小学位置偏僻,地处贫穷落后,是良光镇最偏远的学校,学校,下辖分教点——东心埇小学,[1]?。 +化州市良光镇东岸小学学风是什么? 学校2011年有教师22人,学生231人。 +化州市良光镇东岸小学学风是什么? 小学高级教师8人,小学一级教师10人,未定级教师4人,大专学历的教师6人,其余的都具有中师学历。 +化州市良光镇东岸小学学风是什么? 全校共设12个班,学校课程按标准开设。 +化州市良光镇东岸小学学风是什么? 东岸小学原来是一所破旧不堪,教学质量非常差的薄弱学校。 +化州市良光镇东岸小学学风是什么? 近几年来,在各级政府、教育部门及社会各界热心人士鼎力支持下,学校领导大胆改革创新,致力提高教学质量和教师水平,并加大经费投入,大大改善了办学条件,使学校由差变好,实现了大跨越。 +化州市良光镇东岸小学学风是什么? 学校建设性方面。 +化州市良光镇东岸小学学风是什么? 东岸小学属于革命老区学校,始建于1980年,从东心埇村祠堂搬到这个校址,1990年建造一幢建筑面积为800平方米的南面教学楼, 1998年老促会支持从北面建造一幢1800平方米的教学大楼。 +化州市良光镇东岸小学学风是什么? 学校在管理方面表现方面颇具特色,实现了各项制度的日常化和规范化。 +化州市良光镇东岸小学学风是什么? 学校领导有较强的事业心和责任感,讲求民主与合作,勤政廉政,依法治校,树立了服务意识。 +化州市良光镇东岸小学学风是什么? 学校一贯实施“德育为先,以人为本”的教育方针,制定了“团结,律已,拼搏,创新”的校训。 +化州市良光镇东岸小学学风是什么? 教育风为“爱岗敬业,乐于奉献”,学风为“乐学,勤学,巧学,会学”。 +化州市良光镇东岸小学学风是什么? 校内营造了尊师重教的氛围,形成了良好的校风和学风。 +化州市良光镇东岸小学学风是什么? 教师们爱岗敬业,师德高尚,治学严谨,教研教改气氛浓厚,获得喜人的教研成果。 +化州市良光镇东岸小学学风是什么? 近几年来,教师撰写的教育教学论文共10篇获得县市级以上奖励,获了镇级以上奖励的有100人次。 +化州市良光镇东岸小学学风是什么? 学校德育工作成绩显著,多年被评为“安全事故为零”的学校,良光镇先进学校。 +化州市良光镇东岸小学学风是什么? 特别是教学质量大大提高了。 +化州市良光镇东岸小学学风是什么? 这些成绩得到了上级及群众的充分肯定。 +化州市良光镇东岸小学学风是什么? 1.学校环境欠美观有序,学校大门口及校道有待改造。 +化州市良光镇东岸小学学风是什么? 2.学校管理制度有待改进,部分教师业务水平有待提高。 +化州市良光镇东岸小学学风是什么? 3.教师宿舍、教室及学生宿舍欠缺。 +化州市良光镇东岸小学学风是什么? 4.运动场不够规范,各类体育器材及设施需要增加。 +化州市良光镇东岸小学学风是什么? 5.学生活动空间少,见识面窄,视野不够开阔。 +化州市良光镇东岸小学学风是什么? 1.努力营造和谐的教育教学新气氛。 +化州市良光镇东岸小学学风是什么? 建立科学的管理制度,坚持“与时俱进,以人为本”,真正实现领导对教师,教师对学生之间进行“德治与情治”;学校的人文环境做到“文明,和谐,清新”;德育环境做到“自尊,律已,律人”;心理环境做到“安全,谦虚,奋发”;交际环境做到“团结合作,真诚助人”;景物环境做到“宜人,有序。” +化州市良光镇东岸小学学风是什么? 营造学校与育人的新特色。 +我很好奇发射管的输出功率怎么样? 产生或放大高频功率的静电控制电子管,有时也称振荡管。 +我很好奇发射管的输出功率怎么样? 用于音频或开关电路中的发射管称调制管。 +我很好奇发射管的输出功率怎么样? 发射管是无线电广播、通信、电视发射设备和工业高频设备中的主要电子器件。 +我很好奇发射管的输出功率怎么样? 输出功率和工作频率是发射管的基本技术指标。 +我很好奇发射管的输出功率怎么样? 广播、通信和工业设备的发射管,工作频率一般在30兆赫以下,输出功率在1919年为2千瓦以下,1930年达300千瓦,70年代初已超过1000千瓦,效率高达80%以上。 +我很好奇发射管的输出功率怎么样? 发射管工作频率提高时,输出功率和效率都会降低,因此1936年首次实用的脉冲雷达工作频率仅28兆赫,80年代则已达 400兆赫以上。 +我很好奇发射管的输出功率怎么样? 40年代电视发射管的工作频率为数十兆赫,而80年代初,优良的电视发射管可在1000兆赫下工作,输出功率达20千瓦,效率为40%。 +我很好奇发射管的输出功率怎么样? 平面电极结构的小功率发射三极管可在更高的频率下工作。 +我很好奇发射管的输出功率怎么样? 发射管多采用同心圆筒电极结构。 +我很好奇发射管的输出功率怎么样? 阴极在最内层,向外依次为各个栅极和阳极。 +我很好奇发射管的输出功率怎么样? 图中,自左至右为阴极、第一栅、第二栅、栅极阴极组装件和装入阳极后的整个管子。 +我很好奇发射管的输出功率怎么样? 发射管 +我很好奇发射管的输出功率怎么样? 中小功率发射管多采用间热式氧化物阴极。 +我很好奇发射管的输出功率怎么样? 大功率发射管一般采用碳化钍钨丝阴极,有螺旋、直条或网笼等结构形式。 +我很好奇发射管的输出功率怎么样? 图为网笼式阴极。 +我很好奇发射管的输出功率怎么样? 栅极多用钼丝或钨丝绕制,或用钼片经电加工等方法制造。 +我很好奇发射管的输出功率怎么样? 栅极表面经镀金(或铂)或涂敷锆粉等处理,以降低栅极电子发射,使发射管稳定工作。 +我很好奇发射管的输出功率怎么样? 用气相沉积方法制造的石墨栅极,具有良好的性能。 +我很好奇发射管的输出功率怎么样? 发射管阳极直流输入功率转化为高频输出功率的部分约为75%,其余25%成为阳极热损耗,因此对发射管的阳极必须进行冷却。 +我很好奇发射管的输出功率怎么样? 中小功率发射管的阳极采取自然冷却方式,用镍、钼或石墨等材料制造,装在管壳之内,工作温度可达 600℃。 +我很好奇发射管的输出功率怎么样? 大功率发射管的阳极都用铜制成,并作为真空密封管壳的一部分,采用各种强制冷却方式。 +我很好奇发射管的输出功率怎么样? 各种冷却方式下每平方厘米阳极内表面的散热能力为:水冷100瓦;风冷30瓦;蒸发冷却250瓦;超蒸发冷却1000瓦以上,80年代已制成阳极损耗功率为1250千瓦的超蒸发冷却发射管。 +我很好奇发射管的输出功率怎么样? 发射管也常以冷却方式命名,如风冷发射管、水冷发射管和蒸发冷却发射管。 +我很好奇发射管的输出功率怎么样? 发射管管壳用玻璃或陶瓷制造。 +我很好奇发射管的输出功率怎么样? 小功率发射管内使用含钡的吸气剂;大功率发射管则采用锆、钛、钽等吸气材料,管内压强约为10帕量级。 +我很好奇发射管的输出功率怎么样? 发射管寿命取决于阴极发射电子的能力。 +我很好奇发射管的输出功率怎么样? 大功率发射管寿命最高记录可达8万小时。 +我很好奇发射管的输出功率怎么样? 发射四极管的放大作用和输出输入电路间的隔离效果优于三极管,应用最广。 +我很好奇发射管的输出功率怎么样? 工业高频振荡器普遍采用三极管。 +我很好奇发射管的输出功率怎么样? 五极管多用在小功率范围中。 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 鲁能领秀城中央公园位于鲁能领秀城景观中轴之上,总占地161.55亩,总建筑面积约40万平米,容积率为2.70,由22栋小高层、高层组成;其绿地率高达35.2%,环境优美,产品更加注重品质化、人性化和自然生态化,是鲁能领秀城的生态人居典范。 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 中央公园[1] 学区准现房,坐享鲁能领秀城成熟配套,成熟生活一步到位。 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 经典板式小高层,103㎡2+1房仅22席,稀市推出,错过再无;92㎡经典两房、137㎡舒适三房压轴登场! +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 物业公司: +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 济南凯瑞物业公司;深圳长城物业公司;北京盛世物业有限公司 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 绿化率: +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 42% +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 容积率: +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 2.70 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 暖气: +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 集中供暖 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 楼座展示:中央公园由22栋小高层、高层组成,3、16、17号楼分别是11层小高层,18层和28层的高层。 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 4号楼是23层,2梯3户。 +鲁能领秀城中央公园有23层,2梯3户的是几号楼? 项目位置: +鬼青蛙在哪里有收录详情? 鬼青蛙这张卡可以从手卡把这张卡以外的1只水属性怪兽丢弃,从手卡特殊召唤。 +鬼青蛙在哪里有收录详情? 这张卡召唤·反转召唤·特殊召唤成功时,可以从自己的卡组·场上选1只水族·水属性·2星以下的怪兽送去墓地。 +鬼青蛙在哪里有收录详情? 此外,1回合1次,可以通过让自己场上1只怪兽回到手卡,这个回合通常召唤外加上只有1次,自己可以把「鬼青蛙」以外的1只名字带有「青蛙」的怪兽召唤。[1] +鬼青蛙在哪里有收录详情? 游戏王卡包收录详情 +鬼青蛙在哪里有收录详情? [09/09/18] +西湖区有多大? 西湖区是江西省南昌市市辖区。 +西湖区有多大? 为南昌市中心城区之一,有着2200多年历史,是一个物华天宝、人杰地灵的古老城区。 +西湖区有多大? 2004年南昌市老城区区划调整后,西湖区东起京九铁路线与青山湖区毗邻,南以洪城路东段、抚河路南段、象湖以及南隔堤为界与青云谱区、南昌县接壤,西凭赣江中心线与红谷滩新区交界,北沿中山路、北京西路与东湖区相连,所辖面积34.5平方公里,常住人口43万,管辖1个镇、10个街道办事处,设12个行政村、100个社区。 +西湖区有多大? (图)西湖区[南昌市] +西湖区有多大? 西湖原为汉代豫章群古太湖的一部分,唐贞元15年(公元799年)洪恩桥的架设将东太湖分隔成东西两部分,洪恩桥以西谓之西湖,西湖区由此而得名。 +西湖区有多大? 西湖区在1926年南昌设市后分别称第四、五部分,六、七部分。 +西湖区有多大? 1949年解放初期分别称第三、四区。 +西湖区有多大? 1955年分别称抚河区、西湖区。 +西湖区有多大? 1980年两区合并称西湖区。[1] +西湖区有多大? 辖:西湖街道、丁公路街道、广外街道、系马桩街道、绳金塔街道、朝阳洲街道、禾草街街道、十字街街道、瓦子角街道、三眼井街道、上海路街道、筷子巷街道、南站街道。[1] +西湖区有多大? 2002年9月,由原筷子巷街道和原禾草街街道合并设立南浦街道,原广外街道与瓦子角街道的一部分合并设立广润门街道。 +西湖区有多大? 2002年12月1日设立桃源街道。 +西湖区有多大? 2004年区划调整前的西湖区区域:东与青山湖区湖坊乡插花接壤;西临赣江与红谷滩新区隔江相望;南以建设路为界,和青云谱区毗邻;北连中山路,北京西路,与东湖区交界。[1] +西湖区有多大? 2002年9月,由原筷子巷街道和原禾草街街道合并设立南浦街道,原广外街道与瓦子角街道的一部分合并设立广润门街道。 +西湖区有多大? 2002年12月1日设立桃源街道。 +西湖区有多大? 2004年区划调整前的西湖区区域:东与青山湖区湖坊乡插花接壤;西临赣江与红谷滩新区隔江相望;南以建设路为界,和青云谱区毗邻;北连中山路,北京西路,与东湖区交界。 +西湖区有多大? 2004年9月7日,国务院批准(国函[2004]70号)调整南昌市市辖区部分行政区划:将西湖区朝阳洲街道的西船居委会划归东湖区管辖。 +西湖区有多大? 将青山湖区的桃花镇和湖坊镇的同盟村划归西湖区管辖。 +西湖区有多大? 将西湖区十字街街道的谷市街、洪城路、南关口、九四、新丰5个居委会,上海路街道的草珊瑚集团、南昌肠衣厂、电子计算机厂、江西涤纶厂、江地基础公司、曙光、商标彩印厂、南昌市染整厂、江南蓄电池厂、四机床厂、二进、国乐新村12个居委会,南站街道的解放西路东居委会划归青云谱区管辖。 +西湖区有多大? 将西湖区上海路街道的轻化所、洪钢、省人民检察院、电信城东分局、安康、省机械施工公司、省水利设计院、省安装公司、南方电动工具厂、江西橡胶厂、上海路北、南昌电池厂、东华计量所、南昌搪瓷厂、上海路新村、华安针织总厂、江西五金厂、三波电机厂、水文地质大队、二六○厂、省卫生学校、新世纪、上海路住宅区北、塔子桥北、南航、上海路住宅区南、沿河、南昌阀门厂28个居委会,丁公路街道的新魏路、半边街、师大南路、顺化门、岔道口东路、师大、广电厅、手表厂、鸿顺9个居委会,南站街道的工人新村北、工人新村南、商苑、洪都中大道、铁路第三、铁路第四、铁路第六7个居委会划归青山湖区管辖。 +西湖区有多大? 调整后,西湖区辖绳金塔、桃源、朝阳洲、广润门、南浦、西湖、系马桩、十字街、丁公路、南站10个街道和桃花镇,区人民政府驻孺子路。 +西湖区有多大? 调整前,西湖区面积31平方千米,人口52万。 +西湖区有多大? (图)西湖区[南昌市] +西湖区有多大? 西湖区位于江西省省会南昌市的中心地带,具有广阔的发展空间和庞大的消费群体,商贸旅游、娱乐服务业等到各个行业都蕴藏着无限商机,投资前景十分广阔。 +西湖区有多大? 不仅水、电价格低廉,劳动力资源丰富,人均工资和房产价格都比沿海城市低,城区拥有良好的人居环境、低廉的投资成本,巨大的发展潜力。 +西湖区有多大? 105、316、320国道和京九铁路贯穿全境,把南北东西交通连成一线;民航可与上海、北京、广州、深圳、厦门、温州等到地通航,并开通了南昌-新加坡第一条国际航线;水运依托赣江可直达长江各港口;邮电通讯便捷,程控电话、数字微波、图文传真进入国际通讯网络;商检、海关、口岸等涉外机构齐全;水、电、气供应充足。 +西湖区有多大? (图)西湖区[南昌市] +西湖区有多大? 西湖区,是江西省省会南昌市的中心城区,面积34.8平方公里,常住人口51.9万人,辖桃花镇、朝农管理处及10个街道,设13个行政村,116个社区居委会,20个家委会。[2] +西湖区有多大? 2005年11月16日,南昌市《关于同意西湖区桃花镇、桃源、十字街街道办事处行政区划进行调整的批复》 +西湖区有多大? 1、同意将桃花镇的三道闸居委会划归桃源街道办事处管辖。 +青藏虎耳草花期什么时候? 青藏虎耳草多年生草本,高4-11.5厘米,丛生。 +青藏虎耳草花期什么时候? 花期7-8月。 +青藏虎耳草花期什么时候? 分布于甘肃(祁连山地)、青海(黄南、海南、海北)和西藏(加查)。 +青藏虎耳草花期什么时候? 生于海拔3 700-4 250米的林下、高山草甸和高山碎石隙。[1] +青藏虎耳草花期什么时候? 多年生草本,高4-11.5厘米,丛生。 +青藏虎耳草花期什么时候? 茎不分枝,具褐色卷曲柔毛。 +青藏虎耳草花期什么时候? 基生叶具柄,叶片卵形、椭圆形至长圆形,长15-25毫米,宽4-8毫米,腹面无毛,背面和边缘具褐色卷曲柔毛,叶柄长1-3厘米,基部扩大,边缘具褐色卷曲柔毛;茎生叶卵形至椭圆形,长1.5-2厘米,向上渐变小。 +青藏虎耳草花期什么时候? 聚伞花序伞房状,具2-6花;花梗长5-19毫米,密被褐色卷曲柔毛;萼片在花期反曲,卵形至狭卵形,长2.5-4.2毫米,宽1.5-2毫米,先端钝,两面无毛,边缘具褐色卷曲柔毛,3-5脉于先端不汇合;花瓣腹面淡黄色且其中下部具红色斑点,背面紫红色,卵形、狭卵形至近长圆形,长2.5-5.2毫米,宽1.5-2.1毫米,先端钝,基部具长0.5-1毫米之爪,3-5(-7)脉,具2痂体;雄蕊长2-3.6毫米,花丝钻形;子房半下位,周围具环状花盘,花柱长1-1.5毫米。 +青藏虎耳草花期什么时候? 生于高山草甸、碎石间。 +青藏虎耳草花期什么时候? 分布青海、西藏、甘肃、四川等地。 +青藏虎耳草花期什么时候? [1] +青藏虎耳草花期什么时候? 顶峰虎耳草Saxifraga cacuminum Harry Sm. +青藏虎耳草花期什么时候? 对叶虎耳Saxifraga contraria Harry Sm. +青藏虎耳草花期什么时候? 狭瓣虎耳草Saxifraga pseudohirculus Engl. +青藏虎耳草花期什么时候? 唐古特虎耳草Saxifraga tangutica Engl. +青藏虎耳草花期什么时候? 宽叶虎耳草(变种)Saxifraga tangutica Engl. var. platyphylla (Harry Sm.) J. T. Pan +青藏虎耳草花期什么时候? 唐古特虎耳草(原变种)Saxifraga tangutica Engl. var. tangutica +青藏虎耳草花期什么时候? 西藏虎耳草Saxifraga tibetica Losinsk.[1] +青藏虎耳草花期什么时候? Saxifraga przewalskii Engl. in Bull. Acad. Sci. St. -Petersb. 29:115. 1883: Engl et Irmsch. in Bot. Jahrb. 48:580. f. 5E-H. 1912 et in Engl. Pflanzenr. 67(IV. 117): 107. f. 21 E-H. 1916; J. T. Pan in Acta Phytotax. Sin. 16(2): 16. 1978;中国高等植物图鉴补编2: 30. 1983; 西藏植物志 2: 483. 1985. [1] +生产一支欧文冲锋枪需要多少钱? 欧文冲锋枪 Owen Gun 1945年,在新不列颠手持欧文冲锋枪的澳大利亚士兵 类型 冲锋枪 原产国 ?澳大利亚 服役记录 服役期间 1941年-1960年代 用户 参见使用国 参与战役 第二次世界大战 马来亚紧急状态 朝鲜战争 越南战争 1964年罗德西亚布什战争 生产历史 研发者 伊夫林·欧文(Evelyn Owen) 研发日期 1931年-1939年 生产商 约翰·莱萨特工厂 利特高轻武器工厂 单位制造费用 $ 30/枝 生产日期 1941年-1945年 制造数量 45,000-50,000 枝 衍生型 Mk 1/42 Mk 1/43 Mk 2/43 基本规格 总重 空枪: Mk 1/42:4.24 千克(9.35 磅) Mk 1/43:3.99 千克(8.8 磅) Mk 2/43:3.47 千克(7.65 磅) 全长 806 毫米(31.73 英吋) 枪管长度 247 毫米(9.72 英吋) 弹药 制式:9 × 19 毫米 原型:.38/200 原型:.45 ACP 口径 9 × 19 毫米:9 毫米(.357 英吋) .38/200:9.65 毫米(.38 英吋) .45 ACP:11.43 毫米(.45 英吋) 枪管 1 根,膛线7 条,右旋 枪机种类 直接反冲作用 开放式枪机 发射速率 理论射速: Mk 1/42:700 发/分钟 Mk 1/43:680 发/分钟 Mk 2/43:600 发/分钟 实际射速:120 发/分钟 枪口初速 380-420 米/秒(1,246.72-1,377.95 英尺/秒) 有效射程 瞄具装定射程:91.44 米(100 码) 最大有效射程:123 米(134.51 码) 最大射程 200 米(218.72 码) 供弹方式 32/33 发可拆卸式弹匣 瞄准具型式 机械瞄具:向右偏置的觇孔式照门和片状准星 欧文冲锋枪(英语:Owen Gun,正式名称:Owen Machine Carbine,以下简称为“欧文枪”)是一枝由伊夫林·(埃沃)·欧文(英语:Evelyn (Evo) Owen)于1939年研制、澳大利亚的首枝冲锋枪,制式型发射9 × 19 毫米鲁格手枪子弹。 +生产一支欧文冲锋枪需要多少钱? 欧文冲锋枪是澳大利亚唯一设计和主要服役的二战冲锋枪,并从1943年由澳大利亚陆军所使用,直到1960年代中期。 +生产一支欧文冲锋枪需要多少钱? 由新南威尔士州卧龙岗市出身的欧文枪发明者,伊夫林·欧文,在24岁时于1939年7月向悉尼维多利亚军营的澳大利亚陆军军械官员展示了他所设计的.22 LR口径“卡宾机枪”原型枪。 +生产一支欧文冲锋枪需要多少钱? 该枪却被澳大利亚陆军所拒绝,因为澳大利亚陆军在当时没有承认冲锋枪的价值。 +生产一支欧文冲锋枪需要多少钱? 随着战争的爆发,欧文加入了澳大利亚军队,并且成为一名列兵。 +生产一支欧文冲锋枪需要多少钱? 1940年9月,欧文的邻居,文森特·沃德尔(英语:Vincent Wardell),看到欧文家楼梯后面搁著一个麻布袋,里面放著一枝欧文枪的原型枪。 +生产一支欧文冲锋枪需要多少钱? 而文森特·沃德尔是坎布拉港的大型钢制品厂莱萨特公司的经理,他向欧文的父亲表明了他对其儿子的粗心大意感到痛心,但无论如何仍然解释了这款武器的历史。 +生产一支欧文冲锋枪需要多少钱? 沃德尔对欧文枪的简洁的设计留下了深刻的印象。 +生产一支欧文冲锋枪需要多少钱? 沃德尔安排欧文转调到陆军发明部(英语:Army Inventions Board),并重新开始在枪上的工作。 +生产一支欧文冲锋枪需要多少钱? 军队仍然持续地从负面角度查看该武器,但同时政府开始采取越来越有利的观点。 +生产一支欧文冲锋枪需要多少钱? 该欧文枪原型配备了装在顶部的弹鼓,后来让位给装在顶部的弹匣使用。 +生产一支欧文冲锋枪需要多少钱? 口径的选择亦花了一些时间去解决。 +生产一支欧文冲锋枪需要多少钱? 由于陆军有大批量的柯尔特.45 ACP子弹,它们决定欧文枪需要采用这种口径。 +生产一支欧文冲锋枪需要多少钱? 直到在1941年9月19日官方举办试验时,约翰·莱萨特工厂制成了9 毫米、.38/200和.45 ACP三种口径版本。 +生产一支欧文冲锋枪需要多少钱? 而从美、英进口的斯登冲锋枪和汤普森冲锋枪在试验中作为基准使用。 +生产一支欧文冲锋枪需要多少钱? 作为测试的一部分,所有的枪支都浸没在泥浆里,并以沙土覆盖,以模拟他们将会被使用时最恶劣的环境。 +生产一支欧文冲锋枪需要多少钱? 欧文枪是唯一在这测试中这样对待以后仍可正常操作的冲锋枪。 +生产一支欧文冲锋枪需要多少钱? 虽然测试表现出欧文枪具有比汤普森冲锋枪和司登冲锋枪更优秀的可靠性,陆军没有对其口径作出决定。 +生产一支欧文冲锋枪需要多少钱? 结果它在上级政府干预以后,陆军才下令9 毫米的衍生型为正式口径,并在1941年11月20日正式被澳大利亚陆军采用。 +生产一支欧文冲锋枪需要多少钱? 在欧文枪的寿命期间,其可靠性在澳大利亚部队中赢得了“军人的至爱”(英语:Digger's Darling)的绰号,亦有人传言它受到美军高度青睐。 +生产一支欧文冲锋枪需要多少钱? 欧文枪是在1942年开始正式由坎布拉港和纽卡斯尔的约翰·莱萨特工厂投入生产,在生产高峰期每个星期生产800 支。 +生产一支欧文冲锋枪需要多少钱? 1942年3月至1943年2月之间,莱萨特生产了28,000 枝欧文枪。 +生产一支欧文冲锋枪需要多少钱? 然而,最初的一批弹药类型竟然是错误的,以至10,000 枝欧文枪无法提供弹药。 +生产一支欧文冲锋枪需要多少钱? 政府再一次推翻军方的官僚主义作风??,并让弹药通过其最后的生产阶段,以及运送到当时在新几内亚与日军战斗的澳大利亚部队的手中。 +生产一支欧文冲锋枪需要多少钱? 在1941年至1945年间生产了约50,000 枝欧文枪。 +生产一支欧文冲锋枪需要多少钱? 在战争期间,欧文枪的平均生产成本为$ 30。[1] +生产一支欧文冲锋枪需要多少钱? 虽然它是有点笨重,因为其可靠性,欧文枪在士兵当中变得非常流行。 +生产一支欧文冲锋枪需要多少钱? 它是如此成功,它也被新西兰、英国和美国订购。[2] +生产一支欧文冲锋枪需要多少钱? 欧文枪后来也被澳大利亚部队在朝鲜战争和越南战争,[3]特别是步兵组的侦察兵。 +生产一支欧文冲锋枪需要多少钱? 这仍然是一枝制式的澳大利亚陆军武器,直到1960年代中期,它被F1冲锋枪所取代。 +第二届中国光伏摄影大赛因为什么政策而开始的? 光伏发电不仅是全球能源科技和产业发展的重要方向,也是我国具有国际竞争优势的战略性新兴产业,是我国保障能源安全、治理环境污染、应对气候变化的战略性选择。 +第二届中国光伏摄影大赛因为什么政策而开始的? 2013年7月以来,国家出台了《关于促进光伏产业健康发展的若干意见》等一系列政策,大力推进分布式光伏发电的应用,光伏发电有望走进千家万户,融入百姓民生。 +第二届中国光伏摄影大赛因为什么政策而开始的? 大赛主办方以此为契机,开启了“第二届中国光伏摄影大赛”的征程。 +悬赏任务有哪些类型? 悬赏任务,威客网站上一种任务模式,由雇主在威客网站发布任务,提供一定数额的赏金,以吸引威客们参与。 +悬赏任务有哪些类型? 悬赏任务数额一般在几十到几千不等,但也有几万甚至几十万的任务。 +悬赏任务有哪些类型? 主要以提交的作品的质量好坏作为中标标准,当然其中也带有雇主的主观喜好,中标人数较少,多为一个或几个,因此竞争激烈。 +悬赏任务有哪些类型? 大型悬赏任务赏金数额巨大,中标者也较多,但参与人也很多,对于身有一技之长的威客来讲,悬赏任务十分适合。 +悬赏任务有哪些类型? 悬赏任务的类型主要包括:设计类、文案类、取名类、网站类、编程类、推广类等等。 +悬赏任务有哪些类型? 每一类所适合的威客人群不同,报酬的多少也不同,比如设计类的报酬就比较高,一般都几百到几千,而推广类的计件任务报酬比较少,一般也就几块钱,但花费的时间很少,技术要求也很低。 +悬赏任务有哪些类型? 1.注册—登陆 +悬赏任务有哪些类型? 2.点击“我要发悬赏”—按照发布流程及提示提交任务要求。 +悬赏任务有哪些类型? 悬赏模式选择->网站托管赏金模式。 +悬赏任务有哪些类型? 威客网站客服稍后会跟发布者联系确认任务要求。 +悬赏任务有哪些类型? 3.没有问题之后就可以预付赏金进行任务发布。 +悬赏任务有哪些类型? 4.会员参与并提交稿件。 +悬赏任务有哪些类型? 5.发布者需要跟会员互动(每个提交稿件的会员都可以),解决问题,完善稿件,初步筛选稿件。 +悬赏任务有哪些类型? 6.任务发布期结束,进入选稿期(在筛选的稿件中选择最后满意的) +悬赏任务有哪些类型? 7.发布者不满意现有稿件可选定一个会员修改至满意为止,或者加价延期重新开放任务进行征稿。 +悬赏任务有哪些类型? (重复第六步)没有问题后进入下一步。 +悬赏任务有哪些类型? 8:中标会员交源文件给发布者—发布者确认—任务结束—网站将赏金付给中标会员。 +悬赏任务有哪些类型? 1、任务发布者自由定价,自由确定悬赏时间,自由发布任务要求,自主确定中标会员和中标方案。 +悬赏任务有哪些类型? 2、任务发布者100%预付任务赏金,让竞标者坚信您的诚意和诚信。 +悬赏任务有哪些类型? 3、任务赏金分配原则:任务一经发布,网站收取20%发布费,中标会员获得赏金的80%。 +悬赏任务有哪些类型? 4、每个任务最终都会选定至少一个作品中标,至少一个竞标者获得赏金。 +悬赏任务有哪些类型? 5、任务发布者若未征集到满意作品,可以加价延期征集,也可让会员修改,会员也可以删除任务。 +悬赏任务有哪些类型? 6、任务发布者自己所在组织的任何人均不能以任何形式参加自己所发布的任务,一经发现则视为任务发布者委托威客网按照网站规则选稿。 +悬赏任务有哪些类型? 7、任务悬赏总金额低于100元(含100元)的任务,悬赏时间最多为7天。 +悬赏任务有哪些类型? 所有任务最长时间不超过30天(特殊任务除外),任务总金额不得低于50元。 +悬赏任务有哪些类型? 8、网赚类、注册类任务总金额不能低于300元人民币,计件任务每个稿件的平均单价不能低于1元人民币。 +悬赏任务有哪些类型? 9、延期任务只有3次加价机会,第1次加价不得低于任务金额的10%,第2次加价不得低于任务总金额的20%,第3次不得低于任务总金额的50%。 +悬赏任务有哪些类型? 每次延期不能超过15天,加价金额不低于50元,特殊任务可以适当加长。 +悬赏任务有哪些类型? 如果为计件任务,且不是网赚类任务,将免费延期,直至征集完规定数量的作品为止。 +悬赏任务有哪些类型? 10、如果威客以交接源文件要挟任务发布者,威客网将扣除威客相关信用值,并取消其中标资格,同时任务将免费延长相应的时间继续征集作品 。 +江湖令由哪些平台运营? 《江湖令》是以隋唐时期为背景的RPG角色扮演类网页游戏。 +江湖令由哪些平台运营? 集角色扮演、策略、冒险等多种游戏元素为一体,画面精美犹如客户端游戏,融合历史、江湖、武功、恩仇多种特色元素,是款不可多得的精品游戏大作。 +江湖令由哪些平台运营? 由ya247平台、91wan游戏平台、2918、4399游戏平台、37wan、6711、兄弟玩网页游戏平台,49you、Y8Y9平台、8090游戏等平台运营的,由07177游戏网发布媒体资讯的网页游戏。 +江湖令由哪些平台运营? 网页游戏《江湖令》由51游戏社区运营,是以隋唐时期为背景的RPG角色扮演类网页游戏。 +江湖令由哪些平台运营? 集角色扮演、策略、冒险等多种游戏元素为一体,画面精美犹如客户端游戏,融合历史、江湖、武功、恩仇多种特色元素,是款不可多得的精品游戏大作… +江湖令由哪些平台运营? 背景故事: +江湖令由哪些平台运营? 隋朝末年,隋炀帝暴政,天下民不聊生,义军四起。 +江湖令由哪些平台运营? 在这动荡的时代中,百姓生活苦不堪言,多少人流离失所,家破人亡。 +江湖令由哪些平台运营? 天下三大势力---飞羽营、上清宫、侠隐岛,也值此机会扩张势力,派出弟子出来行走江湖。 +江湖令由哪些平台运营? 你便是这些弟子中的普通一员,在这群雄并起的年代,你将如何选择自己的未来。 +江湖令由哪些平台运营? 所有的故事,便从瓦岗寨/江都大营开始…… +江湖令由哪些平台运营? 势力: +江湖令由哪些平台运营? ①、飞羽营:【外功、根骨】 +江湖令由哪些平台运营? 南北朝时期,由北方政权创立的一个民间军事团体,经过多年的发展,逐渐成为江湖一大势力。 +江湖令由哪些平台运营? ②、上清宫:【外功、身法】 +江湖令由哪些平台运营? 道家圣地,宫中弟子讲求清静无为,以一种隐世的方式修炼,但身在此乱世,亦也不能独善其身。 +江湖令由哪些平台运营? ③、侠隐岛:【根骨、内力】 +江湖令由哪些平台运营? 位于偏远海岛上的一个世家,岛内弟子大多武功高强,但从不进入江湖行走,适逢乱世,现今岛主也决意作一翻作为。 +江湖令由哪些平台运营? 两大阵营: +江湖令由哪些平台运营? 义军:隋唐末期,百姓生活苦不堪言,有多个有志之士组成义军,对抗当朝暴君,希望建立一个适合百姓安居乐业的天地。 +江湖令由哪些平台运营? 隋军:战争一起即天下打乱,隋军首先要镇压四起的义军,同时在内部慢慢改变现有的朝廷,让天下再次恢复到昔日的安定。 +江湖令由哪些平台运营? 一、宠物品质 +江湖令由哪些平台运营? 宠物的品质分为:灵兽,妖兽,仙兽,圣兽,神兽 +江湖令由哪些平台运营? 二、宠物获取途径 +江湖令由哪些平台运营? 完成任务奖励宠物(其他途径待定)。 +江湖令由哪些平台运营? 三、宠物融合 +江湖令由哪些平台运营? 1、在主界面下方的【宠/骑】按钮进入宠物界面,再点击【融合】即可进入融合界面进行融合,在融合界面可选择要融合的宠物进行融合 +江湖令由哪些平台运营? 2、融合后主宠的形态不变; +江湖令由哪些平台运营? 3、融合后宠物的成长,品质,技能,经验,成长经验,等级都继承成长高的宠物; +江湖令由哪些平台运营? 4、融合宠物技能冲突,则保留成长值高的宠物技能,如果不冲突则叠加在空余的技能位置。 +请问土耳其足球超级联赛是什么时候成立的? 土耳其足球超级联赛(土耳其文:Türkiye 1. Süper Futbol Ligi)是土耳其足球协会管理的职业足球联赛,通常简称“土超”,也是土耳其足球联赛中最高级别。 +请问土耳其足球超级联赛是什么时候成立的? 目前,土超联赛队伍共有18支。 +请问土耳其足球超级联赛是什么时候成立的? 土耳其足球超级联赛 +请问土耳其足球超级联赛是什么时候成立的? 运动项目 足球 +请问土耳其足球超级联赛是什么时候成立的? 成立年份 1959年 +请问土耳其足球超级联赛是什么时候成立的? 参赛队数 18队 +请问土耳其足球超级联赛是什么时候成立的? 国家 土耳其 +请问土耳其足球超级联赛是什么时候成立的? 现任冠军 费内巴切足球俱乐部(2010-2011) +请问土耳其足球超级联赛是什么时候成立的? 夺冠最多队伍 费内巴切足球俱乐部(18次) +请问土耳其足球超级联赛是什么时候成立的? 土耳其足球超级联赛(Türkiye 1. Süper Futbol Ligi)是土耳其足球协会管理的职业足球联赛,通常简称「土超」,也是土耳其足球联赛中最高级别。 +请问土耳其足球超级联赛是什么时候成立的? 土超联赛队伍共有18支。 +请问土耳其足球超级联赛是什么时候成立的? 土超联赛成立于1959年,成立之前土耳其国有多个地区性联赛。 +请问土耳其足球超级联赛是什么时候成立的? 土超联赛成立后便把各地方联赛制度统一起来。 +请问土耳其足球超级联赛是什么时候成立的? 一般土超联赛由八月开始至五月结束,12月至1月会有歇冬期。 +请问土耳其足球超级联赛是什么时候成立的? 十八支球队会互相对叠,各有主场和作客两部分,采计分制。 +请问土耳其足球超级联赛是什么时候成立的? 联赛枋最底的三支球队会降到土耳其足球甲级联赛作赛。 +请问土耳其足球超级联赛是什么时候成立的? 由2005-06年球季起,土超联赛的冠、亚军会取得参加欧洲联赛冠军杯的资格。 +请问土耳其足球超级联赛是什么时候成立的? 成立至今土超联赛乃由两支著名球会所垄断──加拉塔萨雷足球俱乐部和费内巴切足球俱乐部,截至2009-2010赛季,双方各赢得冠军均为17次。 +请问土耳其足球超级联赛是什么时候成立的? 土超联赛共有18支球队,采取双循环得分制,每场比赛胜方得3分,负方0分,平局双方各得1分。 +请问土耳其足球超级联赛是什么时候成立的? 如果两支球队积分相同,对战成绩好的排名靠前,其次按照净胜球来决定;如果有三支以上的球队分数相同,则按照以下标准来确定排名:1、几支队伍间对战的得分,2、几支队伍间对战的净胜球数,3、总净胜球数。 +请问土耳其足球超级联赛是什么时候成立的? 联赛第1名直接参加下个赛季冠军杯小组赛,第2名参加下个赛季冠军杯资格赛第三轮,第3名进入下个赛季欧洲联赛资格赛第三轮,第4名进入下个赛季欧洲联赛资格赛第二轮,最后三名降入下个赛季的土甲联赛。 +请问土耳其足球超级联赛是什么时候成立的? 该赛季的土耳其杯冠军可参加下个赛季欧洲联赛资格赛第四轮,如果冠军已获得冠军杯资格,则亚军可参加下个赛季欧洲联赛资格赛第四轮,否则名额递补给联赛。 +请问土耳其足球超级联赛是什么时候成立的? 2010年/2011年 费内巴切 +请问土耳其足球超级联赛是什么时候成立的? 2009年/2010年 布尔萨体育(又译贝莎) +请问土耳其足球超级联赛是什么时候成立的? 2008年/2009年 贝西克塔斯 +请问土耳其足球超级联赛是什么时候成立的? 2007年/2008年 加拉塔萨雷 +请问土耳其足球超级联赛是什么时候成立的? 2006年/2007年 费内巴切 +请问土耳其足球超级联赛是什么时候成立的? 2005年/2006年 加拉塔沙雷 +请问土耳其足球超级联赛是什么时候成立的? 2004年/2005年 费内巴切(又译费伦巴治) +请问土耳其足球超级联赛是什么时候成立的? 2003年/2004年 费内巴切 +cid 作Customer IDentity解时是什么意思? ? CID 是 Customer IDentity 的简称,简单来说就是手机的平台版本. CID紧跟IMEI存储在手机的OTP(One Time Programmable)芯片中. CID 后面的数字代表的是索尼爱立信手机软件保护版本号,新的CID不断被使用,以用来防止手机被非索尼爱立信官方的维修程序拿来解锁/刷机/篡改 +cid 作Customer IDentity解时是什么意思? ? CID 是 Customer IDentity 的简称,简单来说就是手机的平台版本. CID紧跟IMEI存储在手机的OTP(One Time Programmable)芯片中. CID 后面的数字代表的是索尼爱立信手机软件保护版本号,新的CID不断被使用,以用来防止手机被非索尼爱立信官方的维修程序拿来解锁/刷机/篡改 +cid 作Customer IDentity解时是什么意思? ? (英)刑事调查局,香港警察的重案组 +cid 作Customer IDentity解时是什么意思? ? Criminal Investigation Department +cid 作Customer IDentity解时是什么意思? ? 佩枪: +cid 作Customer IDentity解时是什么意思? ? 香港警察的CID(刑事侦缉队),各区重案组的探员装备短管点38左轮手枪,其特点是便于收藏,而且不容易卡壳,重量轻,其缺点是装弹量少,只有6发,而且换子弹较慢,威力也一般,如果碰上54式手枪或者M9手枪明显处于下风。 +cid 作Customer IDentity解时是什么意思? ? 香港警察的“刑事侦查”(Criminal Investigation Department)部门,早于1983年起已经不叫做C.I.D.的了,1983年香港警察队的重整架构,撤销了C.I.D. ( Criminal Investigation Dept.) “刑事侦缉处”,将“刑事侦查”部门归入去“行动处”内,是“行动处”内的一个分支部门,叫“刑事部”( Crime Wing )。 +cid 作Customer IDentity解时是什么意思? ? 再于90年代的一次警队重整架构,香港警队成立了新的「刑事及保安处」,再将“刑事侦查”部门归入目前的「刑事及保安处」的“处”级单位,是归入这个“处”下的一个部门,亦叫“刑事部” ( Crime Wing ),由一个助理警务处长(刑事)领导。 +cid 作Customer IDentity解时是什么意思? ? 但是时至今天,CID虽已经是一个老旧的名称,香港市民、甚至香港警察都是习惯性的沿用这个历史上的叫法 . +cid 作Customer IDentity解时是什么意思? ? CID格式是美国Adobe公司发表的最新字库格式,它具有易扩充、速度快、兼容性好、简便、灵活等特点,已成为国内开发中文字库的热点,也为用户使用字库提供质量更好,数量更多的字体。 +cid 作Customer IDentity解时是什么意思? ? CID (Character identifier)就是字符识别码,在组成方式上分成CIDFont,CMap表两部分。 +cid 作Customer IDentity解时是什么意思? ? CIDFont文件即总字符集,包括了一种特定语言中所有常用的字符,把这些字符排序,它们在总字符集中排列的次序号就是各个字符的CID标识码(Index);CMap(Character Map)表即字符映像文件,将字符的编码(Code)映像到字符的CID标识码(Index)。 +cid 作Customer IDentity解时是什么意思? ? CID字库完全针对大字符集市场设计,其基本过程为:先根据Code,在CMap表查到Index,然后在CIDFont文件找到相应的字形数据。 +本町位于什么地方? 本条目记述台湾日治时期,各都市之本町。 +本町位于什么地方? 为台湾日治时期台北市之行政区,共分一~四丁目,在表町之西。 +本町位于什么地方? 以现在的位置来看,本町位于现台北市中正区的西北角,约位于忠孝西路一段往西至台北邮局东侧。 +本町位于什么地方? 再向南至开封街一段,沿此路线向西至开封街一段60号,顺60号到汉口街一段向东到现在华南银行总行附近画一条直线到衡阳路。 +本町位于什么地方? 再向东至重庆南路一段,由重庆南路一段回到原点这个范围内。 +本町位于什么地方? 另外,重庆南路一段在当时名为“本町通”。 +本町位于什么地方? 此地方自日治时期起,就是繁华的商业地区,当时也有三和银行、台北专卖分局、日本石油等重要商业机构。 +本町位于什么地方? 其中,专卖分局是战后二二八事件的主要起始点。 +本町位于什么地方? 台湾贮蓄银行(一丁目) +本町位于什么地方? 三和银行(二丁目) +本町位于什么地方? 专卖局台北分局(三丁目) +本町位于什么地方? 日本石油(四丁目) +本町位于什么地方? 为台湾日治时期台南市之行政区。 +本町位于什么地方? 范围包括清代旧街名枋桥头前、枋桥头后、鞋、草花、天公埕、竹仔、下大埕、帽仔、武馆、统领巷、大井头、内宫后、内南町。 +本町位于什么地方? 为清代台南城最繁华的区域。 +本町位于什么地方? 台南公会堂 +本町位于什么地方? 北极殿 +本町位于什么地方? 开基武庙 +本町位于什么地方? 町名改正 +本町位于什么地方? 这是一个与台湾相关的小作品。 +本町位于什么地方? 你可以通过编辑或修订扩充其内容。 +《行走的观点:埃及》的条形码是多少? 出版社: 上海社会科学院出版社; 第1版 (2006年5月1日) +《行走的观点:埃及》的条形码是多少? 丛书名: 时代建筑视觉旅行丛书 +《行走的观点:埃及》的条形码是多少? 条形码: 9787806818640 +《行走的观点:埃及》的条形码是多少? 尺寸: 18 x 13.1 x 0.7 cm +《行走的观点:埃及》的条形码是多少? 重量: 181 g +《行走的观点:埃及》的条形码是多少? 漂浮在沙与海市蜃楼之上的金字塔曾经是否是你的一个梦。 +《行走的观点:埃及》的条形码是多少? 埃及,这片蕴蓄了5000年文明的土地,本书为你撩开它神秘的纱。 +《行走的观点:埃及》的条形码是多少? 诸神、金字塔、神庙、狮身人面像、法老、艳后吸引着我们的注意力;缠绵悱恻的象形文字、医学、雕刻等留给我们的文明,不断引发我们对古代文明的惊喜和赞叹。 +《行走的观点:埃及》的条形码是多少? 尼罗河畔的奇异之旅,数千年的古老文明,尽收在你的眼底…… +《行走的观点:埃及》的条形码是多少? 本书集历史、文化、地理等知识于一体,并以优美、流畅文笔,简明扼要地阐述了埃及的地理环境、政治经济、历史沿革、文化艺术,以大量富有艺术感染力的彩色照片,生动形象地展示了埃及最具特色的名胜古迹、风土人情和自然风光。 +《行走的观点:埃及》的条形码是多少? 古埃及历史 +老挝人民军的工兵部队有几个营? 老挝人民军前身为老挝爱国战线领导的“寮国战斗部队”(即“巴特寮”),始建于1949年1月20日,1965年10月改名为老挝人民解放军,1982年7月改称现名。 +老挝人民军的工兵部队有几个营? 最高领导机构是中央国防和治安委员会,朱马里·赛雅颂任主席,隆再·皮吉任国防部长。 +老挝人民军的工兵部队有几个营? 实行义务兵役制,服役期最少18个月。[1] +老挝人民军的工兵部队有几个营? ?老挝军队在老挝社会中有较好的地位和保障,工资待遇比地方政府工作人员略高。 +老挝人民军的工兵部队有几个营? 武装部队总兵力约6万人,其中陆军约5万人,主力部队编为5个步兵师;空军2000多人;海军(内河巡逻部队)1000多人;部队机关院校5000人。[1] +老挝人民军的工兵部队有几个营? 老挝人民军军旗 +老挝人民军的工兵部队有几个营? 1991年8月14日通过的《老挝人民民主共和国宪法》第11条规定:国家执行保卫国防和维护社会安宁的政策。 +老挝人民军的工兵部队有几个营? 全体公民和国防力量、治安力量必须发扬忠于祖国、忠于人民的精神,履行保卫革命成果、保卫人民生命财产及和平劳动的任务,积极参加国家建设事业。 +老挝人民军的工兵部队有几个营? 最高领导机构是中央国防和治安委员会。 +老挝人民军的工兵部队有几个营? 主席由老挝人民革命党中央委员会总书记兼任。 +老挝人民军的工兵部队有几个营? 老挝陆军成立最早,兵力最多,约有5万人。 +老挝人民军的工兵部队有几个营? 其中主力部队步兵师5个、7个独立团、30多个营、65个独立连。 +老挝人民军的工兵部队有几个营? 地方部队30余个营及县属部队。 +老挝人民军的工兵部队有几个营? 地面炮兵2个团,10多个营。 +老挝人民军的工兵部队有几个营? 高射炮兵1个团9个营。 +老挝人民军的工兵部队有几个营? 导弹部队2个营。 +老挝人民军的工兵部队有几个营? 装甲兵7个营。 +老挝人民军的工兵部队有几个营? 特工部队6个营。 +老挝人民军的工兵部队有几个营? 通讯部队9个营。 +老挝人民军的工兵部队有几个营? 工兵部队6个营。 +老挝人民军的工兵部队有几个营? 基建工程兵2个团13个营。 +老挝人民军的工兵部队有几个营? 运输部队7个营。 +老挝人民军的工兵部队有几个营? 陆军的装备基本是中国和前苏联援助的装备和部分从抗美战争中缴获的美式装备。 +老挝人民军的工兵部队有几个营? 老挝内河部队总兵力约1700人,装备有内河船艇110多艘,编成4个艇队。 +老挝人民军的工兵部队有几个营? 有芒宽、巴能、纳坎、他曲、南盖、巴色等8个基地。 +老挝人民军的工兵部队有几个营? 空军于1975年8月组建,现有2个团、11个飞行大队,总兵力约2000人。 +老挝人民军的工兵部队有几个营? 装备有各种飞机140架,其中主要由前苏联提供和从万象政权的皇家空军手中接管。 +老挝人民军的工兵部队有几个营? 随着军队建设质量的提高,老挝人民军对外军事合作步伐也日益扩大,近年来先后与俄罗斯、印度、马来西亚、越南、菲律宾等国拓展了军事交流与合作的内容。 +老挝人民军的工兵部队有几个营? 2003年1月,印度决定向老挝援助一批军事装备和物资,并承诺提供技术帮助。 +老挝人民军的工兵部队有几个营? 2003年6月,老挝向俄罗斯订购了一批新式防空武器;2003年4月,老挝与越南签署了越南帮助老挝培训军事指挥干部和特种部队以及完成军队通信系统改造等多项协议。 +《焚心之城》的主角是谁? 《焚心之城》[1] 为网络作家老子扛过枪创作的一部都市类小说,目前正在创世中文网连载中。 +《焚心之城》的主角是谁? 乡下大男孩薛城,是一个不甘于生活现状的混混,他混过、爱过、也深深地被伤害过。 +《焚心之城》的主角是谁? 本料此生当浑浑噩噩,拼搏街头。 +《焚心之城》的主角是谁? 高考的成绩却给了他一点渺茫的希望,二月后,大学如期吹响了他进城的号角。 +《焚心之城》的主角是谁? 繁华的都市,热血的人生,冷眼嘲笑中,他发誓再不做一个平常人! +《焚心之城》的主角是谁? 江北小城,黑河大地,他要行走过的每一个角落都有他的传说。 +《焚心之城》的主角是谁? 扯出一面旗,拉一帮兄弟,做男人,就要多一份担当,活一口傲气。 +《焚心之城》的主角是谁? (日期截止到2014年10月23日凌晨) +请问香港利丰集团是什么时候成立的? 香港利丰集团前身是广州的华资贸易 (1906 - 1949) ,利丰是香港历史最悠久的出口贸易商号之一。 +请问香港利丰集团是什么时候成立的? 于1906年,冯柏燎先生和李道明先生在广州创立了利丰贸易公司;是当时中国第一家华资的对外贸易出口商。 +请问香港利丰集团是什么时候成立的? 利丰于1906年创立,初时只从事瓷器及丝绸生意;一年之后,增添了其它的货品,包括竹器、藤器、玉石、象牙及其它手工艺品,包括烟花爆竹类别。 +请问香港利丰集团是什么时候成立的? 在早期的对外贸易,中国南方内河港因水深不足不能行驶远洋船,反之香港港口水深岸阔,占尽地利。 +请问香港利丰集团是什么时候成立的? 因此,在香港成立分公司的责任,落在冯柏燎先生的三子冯汉柱先生身上。 +请问香港利丰集团是什么时候成立的? 1937年12月28日,利丰(1937)有限公司正式在香港创立。 +请问香港利丰集团是什么时候成立的? 第二次世界大战期间,利丰暂停贸易业务。 +请问香港利丰集团是什么时候成立的? 1943年,随着创办人冯柏燎先生去世后,业务移交给冯氏家族第二代。 +请问香港利丰集团是什么时候成立的? 之后,向来不参与业务管理的合伙人李道明先生宣布退休,将所拥有的利丰股权全部卖给冯氏家族。 +请问香港利丰集团是什么时候成立的? 目前由哈佛冯家两兄弟William Fung , Victor Fung和CEO Bruce Rockowitz 管理。 +请问香港利丰集团是什么时候成立的? 截止到2012年,集团旗下有利亚﹝零售﹞有限公司、利和集团、利邦时装有限公司、利越时装有限公司、利丰贸易有限公司。 +请问香港利丰集团是什么时候成立的? 利亚(零售)连锁,业务包括大家所熟悉的:OK便利店、玩具〝反〞斗城和圣安娜饼屋;范围包括香港、台湾、新加坡、马来西亚、至中国大陆及东南亚其它市场逾600多家店 +请问香港利丰集团是什么时候成立的? 利和集团,IDS以专业物流服务为根基,为客户提供经销,物流,制造服务领域内的一系列服务项目。 +请问香港利丰集团是什么时候成立的? 业务网络覆盖大中华区,东盟,美国及英国,经营着90多个经销中心,在中国设有18个经销公司,10,000家现代经销门店。 +请问香港利丰集团是什么时候成立的? 利邦(上海)时装贸易有限公司为大中华区其中一家大型男士服装零售集团。 +请问香港利丰集团是什么时候成立的? 现在在中国大陆、香港、台湾和澳门收购经营11个包括Cerruti 1881,Gieves & Hawkes,Kent & curwen和D’urban 等中档到高档的男士服装品牌,全国有超过350间门店设于各一线城市之高级商场及百货公司。 +请问香港利丰集团是什么时候成立的? 利越(上海)服装商贸有限公司隶属于Branded Lifestyle,负责中国大陆地区LEO里奥(意大利)、GIBO捷宝(意大利)、UFFIZI古杰师(意大利)、OVVIO奥维路(意大利)、Roots绿适(加拿大,全球服装排名第四)品牌销售业务 +请问香港利丰集团是什么时候成立的? 利丰(贸易)1995年收购了英之杰采购服务,1999年收购太古贸易有限公司(Swire & Maclain) 和金巴莉有限公司(Camberley),2000年和2002年分别收购香港采购出口集团Colby Group及Janco Oversea Limited,大大扩张了在美国及欧洲的顾客群,自2008年经济危机起一直到现在,收购多家欧、美、印、非等地区的时尚品牌,如英国品牌Visage,仅2011年上半年6个月就完成26个品牌的收购。 +请问香港利丰集团是什么时候成立的? 2004年利丰与Levi Strauss & Co.签订特许经营协议 +请问香港利丰集团是什么时候成立的? 2005年利丰伙拍Daymon Worldwide为全球供应私有品牌和特许品牌 +请问香港利丰集团是什么时候成立的? 2006年收购Rossetti手袋业务及Oxford Womenswear Group 强化美国批发业务 +请问香港利丰集团是什么时候成立的? 2007年收购Tommy Hilfiher全球采购业务,收购CGroup、Peter Black International LTD、Regetta USA LLC和American Marketing Enterprice +请问香港利丰集团是什么时候成立的? 2008年收购Kent&Curwen全球特许经营权,收购Van Zeeland,Inc和Miles Fashion Group +请问香港利丰集团是什么时候成立的? 2009年收购加拿大休闲品牌Roots ,收购Wear Me Appearl,LLC。 +请问香港利丰集团是什么时候成立的? 与Hudson's Bay、Wolverine Worldwide Inc、Talbots、Liz Claiborne达成了采购协议 +请问香港利丰集团是什么时候成立的? 2010年收购Oxford apparel Visage Group LTD +请问香港利丰集团是什么时候成立的? 2011年一月收购土耳其Modium、美国女性时尚Beyond Productions,三月收购贸易公司Celissa 、玩具公司Techno Source USA, Inc.、卡通品牌产品TVMania和法国著名时装一线品牌Cerruti 1881,五月收购Loyaltex Apparel Ltd.、女装Hampshire Designers和英国彩妆Collection 2000,六月收购家私贸易Exim Designs Co., Ltd.,七月收购家庭旅行产业Union Rich USA, LLC和设计公司Lloyd Textile Fashion Company Limited,八月收购童装Fishman & Tobin和Crimzon Rose,九月收购家私贸易True Innovations, LLC、日用品企业Midway Enterprises和Wonderful World。 +请问香港利丰集团是什么时候成立的? 十二月与USPA – U.S. Polo Association签署授权协议。 +请问香港利丰集团是什么时候成立的? 利丰的精神:积极进取,不断认识并争取有利于客户和自身进步的机会;以行动为主导,对客户、供应商及职工的需求作出快速的决定。 +请问香港利丰集团是什么时候成立的? 利丰的最终目标:在产品采购、销售、流转的各环节建立全球性队伍提供多元化服务,利丰成员有效合作,共达目标。 +如何使魔兽变种akt不被查杀? Trojan/PSW.Moshou.akt“魔兽”变种akt是“魔兽”木马家族的最新成员之一,采用Delphi 6.0-7.0编写,并经过加壳处理。 +如何使魔兽变种akt不被查杀? “魔兽”变种akt运行后,自我复制到被感染计算机的指定目录下。 +如何使魔兽变种akt不被查杀? 修改注册表,实现木马开机自动运行。 +如何使魔兽变种akt不被查杀? 自我注入到被感染计算机的“explorer.exe”、“notepad.exe”等用户级权限的进程中加载运行,隐藏自我,防止被查杀。 +如何使魔兽变种akt不被查杀? 在后台秘密监视用户打开的窗口标题,盗取网络游戏《魔兽世界》玩家的游戏帐号、游戏密码、角色等级、装备信息、金钱数量等信息,并在后台将窃取到的玩家信息发送到骇客指定的远程服务器上,致使玩家游戏帐号、装备物品、金钱等丢失,给游戏玩家造成非常大的损失。 +丙种球蛋白能预防什么病情? 丙种球蛋白预防传染性肝炎,预防麻疹等病毒性疾病感染,治疗先天性丙种球蛋白缺乏症 ,与抗生素合并使用,可提高对某些严重细菌性和病毒性疾病感染的疗效。 +丙种球蛋白能预防什么病情? 中文简称:“丙球” +丙种球蛋白能预防什么病情? 英文名称:γ-globulin、gamma globulin +丙种球蛋白能预防什么病情? 【别名】 免疫血清球蛋白,普通免疫球蛋白,人血丙种球蛋白,丙种球蛋白,静脉注射用人免疫球蛋白(pH4) +丙种球蛋白能预防什么病情? 注:由于人血中的免疫球蛋白大多数为丙种球蛋白(γ-球蛋白),有时丙种球蛋白也被混称为“免疫球蛋白”(immunoglobulin) 。 +丙种球蛋白能预防什么病情? 冻干制剂应为白色或灰白色的疏松体,液体制剂和冻干制剂溶解后,溶液应为接近无色或淡黄色的澄明液体,微带乳光。 +丙种球蛋白能预防什么病情? 但不应含有异物或摇不散的沉淀。 +丙种球蛋白能预防什么病情? 注射丙种球蛋白是一种被动免疫疗法。 +丙种球蛋白能预防什么病情? 它是把免疫球蛋白内含有的大量抗体输给受者,使之从低或无免疫状态很快达到暂时免疫保护状态。 +丙种球蛋白能预防什么病情? 由于抗体与抗原相互作用起到直接中和毒素与杀死细菌和病毒。 +丙种球蛋白能预防什么病情? 因此免疫球蛋白制品对预防细菌、病毒性感染有一定的作用[1]。 +丙种球蛋白能预防什么病情? 人免疫球蛋白的生物半衰期为16~24天。 +丙种球蛋白能预防什么病情? 1、丙种球蛋白[2]含有健康人群血清所具有的各种抗体,因而有增强机体抵抗力以预防感染的作用。 +丙种球蛋白能预防什么病情? 2、主要治疗先天性丙种球蛋白缺乏症和免疫缺陷病 +丙种球蛋白能预防什么病情? 3、预防传染性肝炎,如甲型肝炎和乙型肝炎等。 +丙种球蛋白能预防什么病情? 4、用于麻疹、水痘、腮腺炎、带状疱疹等病毒感染和细菌感染的防治 +丙种球蛋白能预防什么病情? 5、也可用于哮喘、过敏性鼻炎、湿疹等内源性过敏性疾病。 +丙种球蛋白能预防什么病情? 6、与抗生素合并使用,可提高对某些严重细菌性和病毒性疾病感染的疗效。 +丙种球蛋白能预防什么病情? 7、川崎病,又称皮肤粘膜淋巴结综合征,常见于儿童,丙种球蛋白是主要的治疗药物。 +丙种球蛋白能预防什么病情? 1、对免疫球蛋白过敏或有其他严重过敏史者。 +丙种球蛋白能预防什么病情? 2、有IgA抗体的选择性IgA缺乏者。 +丙种球蛋白能预防什么病情? 3、发烧患者禁用或慎用。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (1997年9月1日浙江省第八届人民代表大会常务委员会第三十九次会议通过 1997年9月9日浙江省第八届人民代表大会常务委员会公告第六十九号公布自公布之日起施行) +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 为了保护人的生命和健康,发扬人道主义精神,促进社会发展与和平进步事业,根据《中华人民共和国红十字会法》,结合本省实际,制定本办法。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 本省县级以上按行政区域建立的红十字会,是中国红十字会的地方组织,是从事人道主义工作的社会救助团体,依法取得社会团体法人资格,设置工作机构,配备专职工作人员,依照《中国红十字会章程》独立自主地开展工作。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 全省性行业根据需要可以建立行业红十字会,配备专职或兼职工作人员。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 街道、乡(镇)、机关、团体、学校、企业、事业单位根据需要,可以依照《中国红十字会章程》建立红十字会的基层组织。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 上级红十字会指导下级红十字会的工作。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 县级以上地方红十字会指导所在行政区域行业红十字会和基层红十字会的工作。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 人民政府对红十字会给予支持和资助,保障红十字会依法履行职责,并对其活动进行监督;红十字会协助人民政府开展与其职责有关的活动。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 全社会都应当关心和支持红十字事业。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 本省公民和单位承认《中国红十字会章程》并缴纳会费的,可以自愿参加红十字会,成为红十字会的个人会员或团体会员。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 个人会员由本人申请,基层红十字会批准,发给会员证;团体会员由单位申请,县级以上红十字会批准,发给团体会员证。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 个人会员和团体会员应当遵守《中华人民共和国红十字会法》和《中国红十字会章程》,热心红十字事业,履行会员的义务,并享有会员的权利。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 县级以上红十字会理事会由会员代表大会民主选举产生。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 理事会民主选举产生会长和副会长;根据会长提名,决定秘书长、副秘书长人选。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 县级以上红十字会可以设名誉会长、名誉副会长和名誉理事,由同级红十字会理事会聘请。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 省、市(地)红十字会根据独立、平等、互相尊重的原则,发展同境外、国外地方红十字会和红新月会的友好往来和合作关系。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 红十字会履行下列职责: +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (一)宣传、贯彻《中华人民共和国红十字会法》和本办法; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (二)开展救灾的准备工作,筹措救灾款物;在自然灾害和突发事件中,对伤病人员和其他受害者进行救助; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (三)普及卫生救护和防病知识,进行初级卫生救护培训,对交通、电力、建筑、矿山等容易发生意外伤害的单位进行现场救护培训; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (四)组织群众参加现场救护; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (五)参与输血献血工作,推动无偿献血; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (六)开展红十字青少年活动; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (七)根据中国红十字会总会部署,参加国际人道主义救援工作; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (八)依照国际红十字和红新月运动的基本原则,完成同级人民政府和上级红十字会委托的有关事宜; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (九)《中华人民共和国红十宇会法》和《中国红十字会章程》规定的其他职责。 +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 第八条 红十字会经费的主要来源: +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (一)红十字会会员缴纳的会费; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (二)接受国内外组织和个人捐赠的款物; +浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (三)红十字会的动产、不动产以及兴办社会福利事业和经济实体的收入; +宝湖庭院绿化率多少? 建发·宝湖庭院位于银川市金凤区核心地带—正源南街与长城中路交汇处向东500米。 +宝湖庭院绿化率多少? 项目已于2012年4月开工建设,总占地约4.2万平方米,总建筑面积约11.2万平方米,容积率2.14,绿化率35%,预计可入住630户。 +宝湖庭院绿化率多少? “建发·宝湖庭院”是银川建发集团股份有限公司继“建发·宝湖湾”之后,在宝湖湖区的又一力作。 +宝湖庭院绿化率多少? 项目周边发展成熟,东有唐徕渠景观水道,西临银川市交通主干道正源街;南侧与宝湖湿地公园遥相呼应。 +宝湖庭院绿化率多少? “宝湖庭院”项目公共交通资源丰富:15路、21路、35路、38路、43路公交车贯穿银川市各地,出行便利。 +宝湖庭院绿化率多少? 距离新百良田购物广场约1公里,工人疗养院600米,宝湖公园1公里,唐徕渠景观水道500米。 +宝湖庭院绿化率多少? 项目位置优越,购物、餐饮、医疗、交通、休闲等生活资源丰富。[1] +宝湖庭院绿化率多少? 建发·宝湖庭院建筑及景观设置传承建发一贯“简约、大气”的风格:搂间距宽广,确保每一座楼宇视野开阔通透。 +宝湖庭院绿化率多少? 楼宇位置错落有置,外立面设计大气沉稳别致。 +宝湖庭院绿化率多少? 项目内部休闲绿地、景观小品点缀其中,道路及停车系统设计合理,停车及通行条件便利。 +宝湖庭院绿化率多少? 社区会所、幼儿园、活动室、医疗服务中心等生活配套一应俱全。 +宝湖庭院绿化率多少? 行政区域:金凤区 +大月兔(中秋艺术作品)的作者还有哪些代表作? 大月兔是荷兰“大黄鸭”之父弗洛伦泰因·霍夫曼打造的大型装置艺术作品,该作品首次亮相于台湾桃园大园乡海军基地,为了迎接中秋节的到来;在展览期间,海军基地也首次对外开放。 +大月兔(中秋艺术作品)的作者还有哪些代表作? 霍夫曼觉得中国神话中捣杵的玉兔很有想象力,于是特别创作了“月兔”,这也是“月兔”新作第一次展出。[1] +大月兔(中秋艺术作品)的作者还有哪些代表作? ?2014年9月15日因工人施工不慎,遭火烧毁。[2] +大月兔(中秋艺术作品)的作者还有哪些代表作? “大月兔”外表采用的杜邦防水纸、会随风飘动,内部以木材加保丽龙框架支撑做成。 +大月兔(中秋艺术作品)的作者还有哪些代表作? 兔毛用防水纸做成,材质完全防水,不怕日晒雨淋。[3 +大月兔(中秋艺术作品)的作者还有哪些代表作? -4] +大月兔(中秋艺术作品)的作者还有哪些代表作? 25米的“月兔”倚靠在机 +大月兔(中秋艺术作品)的作者还有哪些代表作? 堡上望着天空,像在思考又像赏月。 +大月兔(中秋艺术作品)的作者还有哪些代表作? 月兔斜躺在机堡上,意在思考生命、边做白日梦,编织自己的故事。[3] +大月兔(中秋艺术作品)的作者还有哪些代表作? 台湾桃园大园乡海军基地也首度对外开放。 +大月兔(中秋艺术作品)的作者还有哪些代表作? 428公顷的海军基地中,地景艺术节使用约40公顷,展场包括过去军机机堡、跑道等,由于这处基地过去警备森严,不对外开放,这次结合地景艺术展出,也可一窥过去是黑猫中队基地的神秘面纱。 +大月兔(中秋艺术作品)的作者还有哪些代表作? 2014年9月2日,桃园县政府文化局举行“踩线团”,让 +大月兔(中秋艺术作品)的作者还有哪些代表作? 大月兔 +大月兔(中秋艺术作品)的作者还有哪些代表作? 各项地景艺术作品呈现在媒体眼中,虽然“月兔”仍在进行最后的细节赶工,但横躺在机堡上的“月兔”雏形已经完工。[5] +大月兔(中秋艺术作品)的作者还有哪些代表作? “这么大”、“好可爱呦”是不少踩线团成员对“月兔”的直觉;尤其在蓝天的衬托及前方绿草的组合下,呈现犹如真实版的爱丽丝梦游仙境。[6] +大月兔(中秋艺术作品)的作者还有哪些代表作? 霍夫曼的作品大月兔,“从平凡中,创作出不平凡的视觉”,创造出观赏者打从心中油然而生的幸福感,拉近观赏者的距离。[6] +大月兔(中秋艺术作品)的作者还有哪些代表作? 2014年9月15日早 +大月兔(中秋艺术作品)的作者还有哪些代表作? 上,施工人员要将月兔拆解,搬离海军基地草皮时,疑施工拆除的卡车,在拆除过程,故障起火,起火的卡车不慎延烧到兔子,造成兔子起火燃烧,消防队员即刻抢救,白色的大月兔立即变成焦黑的火烧兔。[7] +大月兔(中秋艺术作品)的作者还有哪些代表作? 桃园县府表示相当遗憾及难过,也不排除向包商求偿,也已将此事告知霍夫曼。[2] +大月兔(中秋艺术作品)的作者还有哪些代表作? ?[8] +大月兔(中秋艺术作品)的作者还有哪些代表作? 弗洛伦泰因·霍夫曼,荷兰艺术家,以在公共空间创作巨大造型 +大月兔(中秋艺术作品)的作者还有哪些代表作? 物的艺术项目见长。 +大月兔(中秋艺术作品)的作者还有哪些代表作? 代表作品包括“胖猴子”(2010年在巴西圣保罗展出)、“大黄兔”(2011年在瑞典厄勒布鲁展出)、粉红猫(2014年5月在上海亮相)、大黄鸭(Rubber Duck)、月兔等。 +英国耆卫保险公司有多少保险客户? 英国耆卫保险公司(Old Mutual plc)成立于1845年,一直在伦敦证券交易所(伦敦证券交易所:OML)作第一上市,也是全球排名第32位(按营业收入排名)的保险公司(人寿/健康)。 +英国耆卫保险公司有多少保险客户? 公司是全球财富500强公司之一,也是被列入英国金融时报100指数的金融服务集团之一。 +英国耆卫保险公司有多少保险客户? Old Mutual 是一家国际金融服务公司,拥有近320万个保险客户,240万个银行储户,270,000个短期保险客户以及700,000个信托客户 +英国耆卫保险公司有多少保险客户? 英国耆卫保险公司(Old Mutual)是一家国际金融服务公司,总部设在伦敦,主要为全球客户提供长期储蓄的解决方案、资产管理、短期保险和金融服务等,目前业务遍及全球34个国家。[1] +英国耆卫保险公司有多少保险客户? 主要包括人寿保险,资产管理,银行等。 +英国耆卫保险公司有多少保险客户? 1845年,Old Mutual在好望角成立。 +英国耆卫保险公司有多少保险客户? 1870年,董事长Charles Bell设计了Old Mutual公司的标记。 +英国耆卫保险公司有多少保险客户? 1910年,南非从英联邦独立出来。 +英国耆卫保险公司有多少保险客户? Old Mutual的董事长John X. Merriman被选为国家总理。 +英国耆卫保险公司有多少保险客户? 1927年,Old Mutual在Harare成立它的第一个事务所。 +英国耆卫保险公司有多少保险客户? 1960年,Old Mutual在南非成立了Mutual Unit信托公司,用来管理公司的信托业务。 +英国耆卫保险公司有多少保险客户? 1970年,Old Mutual的收入超过100百万R。 +英国耆卫保险公司有多少保险客户? 1980年,Old Mutual成为南非第一大人寿保险公司,年收入达10亿R。 +英国耆卫保险公司有多少保险客户? 1991年,Old Mutual在美国财富周刊上评选的全球保险公司中名列第38位。 +英国耆卫保险公司有多少保险客户? 1995年,Old Mutual在美国波士顿建立投资顾问公司,同年、又在香港和Guernsey建立事务所。 +英国耆卫保险公司有多少保险客户? 作为一项加强与其母公司联系的举措,OMNIA公司(百慕大)荣幸的更名为Old Mutual 公司(百慕大) 。 +英国耆卫保险公司有多少保险客户? 这一新的名称和企业识别清晰地展示出公司成为其世界金融机构合作伙伴强有力支持的决心。 +英国耆卫保险公司有多少保险客户? 2003 年4月,该公司被Old Mutual plc公司收购,更名为Sage Life(百慕大)公司并闻名于世,公司为Old Mutual公司提供了一个新的销售渠道,补充了其现有的以美元计价的产品线和分销系统。 +英国耆卫保险公司有多少保险客户? 达到了一个重要里程碑是公司成功的一个例证: 2005年6月3日公司资产超过10亿美元成为公司的一个主要里程碑,也是公司成功的一个例证。 +英国耆卫保险公司有多少保险客户? Old Mutual (百慕大)为客户提供一系列的投资产品。 +英国耆卫保险公司有多少保险客户? 在其开放的结构下,客户除了能够参与由Old Mutual会员管理的方案外,还能够参与由一些世界顶尖投资机构提供的投资选择。 +英国耆卫保险公司有多少保险客户? 首席执行官John Clifford对此发表评论说:“过去的两年对于Old Mutual家族来说是稳固发展的两年,更名是迫在眉睫的事情。 +英国耆卫保险公司有多少保险客户? 通过采用其名字和形象上的相似,Old Mutual (百慕大)进一步强化了与母公司的联系。” +英国耆卫保险公司有多少保险客户? Clifford补充道:“我相信Old Mutual全球品牌认可度和Old Mutual(百慕大)产品专业知识的结合将在未来的日子里进一步推动公司的成功。” +英国耆卫保险公司有多少保险客户? 随着公司更名而来的是公司网站的全新改版,设计投资选择信息、陈述、销售方案、营销材料和公告板块。 +英国耆卫保险公司有多少保险客户? 在美国购买不到OMNIA投资产品,该产品也不向美国公民或居民以及百慕大居民提供。 +英国耆卫保险公司有多少保险客户? 这些产品不对任何要约未得到批准的区域中的任何人,以及进行此要约或询价为非法行为的个人构成要约或询价。 +英国耆卫保险公司有多少保险客户? 关于Old Mutual(百慕大)公司 +英国耆卫保险公司有多少保险客户? Old Mutual(百慕大)公司总部位于百慕大,公司面向非美国居民及公民以及非百慕大居民,通过遍布世界的各个市场的金融机构开发和销售保险和投资方案。 +英国耆卫保险公司有多少保险客户? 这些方案由Old Mutual(百慕大)公司直接做出,向投资者提供各种投资选择和战略,同时提供死亡和其他受益保证。 +谁知道北京的淡定哥做了什么? 尼日利亚足球队守门员恩耶马被封淡定哥,原因是2010年南非世界杯上1:2落后希腊队时,对方前锋已经突破到禁区,其仍头依门柱发呆,其从容淡定令人吃惊。 +谁知道北京的淡定哥做了什么? 淡定哥 +谁知道北京的淡定哥做了什么? 在2010年6月17日的世界杯赛场上,尼日利亚1比2不敌希腊队,但尼日利亚门将恩耶马(英文名:Vincent Enyeama)在赛场上的“淡定”表现令人惊奇。 +谁知道北京的淡定哥做了什么? 随后,网友将赛场照片发布于各大论坛,恩耶马迅速窜红,并被网友称为“淡定哥”。 +谁知道北京的淡定哥做了什么? 淡定哥 +谁知道北京的淡定哥做了什么? 从网友上传得照片中可以看到,“淡定哥”在面临对方前锋突袭至小禁区之时,还靠在球门柱上发呆,其“淡定”程度的确非一般人所能及。 +谁知道北京的淡定哥做了什么? 恩耶马是尼日利亚国家队的主力守门员,目前效力于以色列的特拉维夫哈普尔队。 +谁知道北京的淡定哥做了什么? 1999年,恩耶马在尼日利亚国内的伊波姆星队开始职业生涯,后辗转恩伊姆巴、Iwuanyanwu民族等队,从07年开始,他为特拉维夫效力。 +谁知道北京的淡定哥做了什么? 恩耶马的尼日利亚国脚生涯始于2002年,截至2010年1月底,他为国家队出场已超过50次。 +谁知道北京的淡定哥做了什么? 当地时间2011年1月4日,国际足球历史与统计协会(IFFHS)公布了2010年度世界最佳门将,恩耶马(尼日利亚,特拉维夫夏普尔)10票排第十一 +谁知道北京的淡定哥做了什么? 此词经国家语言资源监测与研究中心等机构专家审定入选2010年年度新词语,并收录到《中国语言生活状况报告》中。 +谁知道北京的淡定哥做了什么? 提示性释义:对遇事从容镇定、处变不惊的男性的戏称。 +谁知道北京的淡定哥做了什么? 例句:上海现“淡定哥”:百米外爆炸他仍专注垂钓(2010年10月20日腾讯网http://news.qq.com/a/20101020/000646.htm) +谁知道北京的淡定哥做了什么? 2011年度新人物 +谁知道北京的淡定哥做了什么? 1、淡定哥(北京) +谁知道北京的淡定哥做了什么? 7月24日傍晚,北京市出现大范围降雨天气,位于通州北苑路出现积水,公交车也难逃被淹。 +谁知道北京的淡定哥做了什么? 李欣摄图片来源:新华网一辆私家车深陷积水,车主索性盘坐在自己的汽车上抽烟等待救援。 +谁知道北京的淡定哥做了什么? 私家车主索性盘坐在自己的车上抽烟等待救援,被网友称“淡定哥” +谁知道北京的淡定哥做了什么? 2、淡定哥——林峰 +谁知道北京的淡定哥做了什么? 在2011年7月23日的动车追尾事故中,绍兴人杨峰(@杨峰特快)在事故中失去了5位亲人:怀孕7个月的妻子、未出世的孩子、岳母、妻姐和外甥女,他的岳父也在事故中受伤正在治疗。 +谁知道北京的淡定哥做了什么? 他披麻戴孝出现在事故现场,要求将家人的死因弄个明白。 +谁知道北京的淡定哥做了什么? 但在第一轮谈判过后,表示:“请原谅我,如果我再坚持,我将失去我最后的第六个亲人。” +谁知道北京的淡定哥做了什么? 如果他继续“纠缠”铁道部,他治疗中的岳父将会“被死亡”。 +谁知道北京的淡定哥做了什么? 很多博友就此批评杨峰,并讽刺其为“淡定哥”。 +071型船坞登陆舰的北约代号是什么? 071型船坞登陆舰(英语:Type 071 Amphibious Transport Dock,北约代号:Yuzhao-class,中文:玉昭级,或以首舰昆仑山号称之为昆仑山级船坞登陆舰),是中国人民解放军海军隶下的大型多功能两栖船坞登陆舰,可作为登陆艇的母舰,用以运送士兵、步兵战车、主战坦克等展开登陆作战,也可搭载两栖车辆,具备大型直升机起降甲板及操作设施。 +071型船坞登陆舰的北约代号是什么? 071型两栖登陆舰是中国首次建造的万吨级作战舰艇,亦为中国大型多功能两栖舰船的开山之作,也可以说是中国万吨级以上大型作战舰艇的试验之作,该舰的建造使中国海军的两栖舰船实力有了质的提升。 +071型船坞登陆舰的北约代号是什么? 在本世纪以前中国海军原有的两栖舰队以一 +071型船坞登陆舰的北约代号是什么? 早期071模型 +071型船坞登陆舰的北约代号是什么? 千至四千吨级登陆舰为主要骨干,这些舰艇吨位小、筹载量有限,直升机操作能力非常欠缺,舰上自卫武装普遍老旧,对于现代化两栖登陆作战可说有很多不足。 +071型船坞登陆舰的北约代号是什么? 为了应对新时期的国际国内形势,中国在本世纪初期紧急强化两栖作战能力,包括短时间内密集建造072、074系列登陆舰,同时也首度设计一种新型船坞登陆舰,型号为071。[1] +071型船坞登陆舰的北约代号是什么? 在两栖作战行动中,这些舰只不得不采取最危险的 +071型船坞登陆舰的北约代号是什么? 舾装中的昆仑山号 +071型船坞登陆舰的北约代号是什么? 敌前登陆方式实施两栖作战行动,必须与敌人预定阻击力量进行面对面的战斗,在台湾地区或者亚洲其他国家的沿海,几乎没有可用而不设防的海滩登陆地带,并且各国或者地区的陆军在战时,可能会很快控制这些易于登陆的海难和港口,这样就限制住了中国海军两栖登陆部队的实际登陆作战能力。 +071型船坞登陆舰的北约代号是什么? 071型登陆舰正是为了更快和更多样化的登陆作战而开发的新型登陆舰艇。[2] +071型船坞登陆舰的北约代号是什么? 071型两栖船坞登陆舰具有十分良好的整体隐身能力, +071型船坞登陆舰的北约代号是什么? 071型概念图 +071型船坞登陆舰的北约代号是什么? 该舰外部线条简洁干练,而且舰体外形下部外倾、上部带有一定角度的内倾,从而形成雷达隐身性能良好的菱形横剖面。 +071型船坞登陆舰的北约代号是什么? 舰体为高干舷平甲板型,长宽比较小,舰身宽满,采用大飞剪型舰首及楔形舰尾,舰的上层建筑位于舰体中间部位,后部是大型直升机甲板,适航性能非常突出。 +071型船坞登陆舰的北约代号是什么? 顶甲板上各类电子设备和武器系统布局十分简洁干净,各系统的突出物很少。 +071型船坞登陆舰的北约代号是什么? 该舰的两座烟囱实行左右分布式设置在舰体两侧,既考虑了隐身特点,也十分新颖。[3] +071型船坞登陆舰的北约代号是什么? 1号甲板及上层建筑物主要设置有指挥室、控 +071型船坞登陆舰的北约代号是什么? 舰尾俯视 +071型船坞登陆舰的北约代号是什么? 制舱、医疗救护舱及一些居住舱,其中医疗救护舱设置有完备的战场救护设施,可以在舰上为伤病员提供紧急手术和野战救护能力。 +071型船坞登陆舰的北约代号是什么? 2号甲板主要是舰员和部分登陆人员的居住舱、办公室及厨房。 +071型船坞登陆舰的北约代号是什么? 主甲板以下则是登陆舱,分前后两段,前段是装甲车辆储存舱,共两层,可以储存登陆装甲车辆和一些其它物资,在进出口处还设有一小型升降机,用于两层之间的移动装卸用。 +071型船坞登陆舰的北约代号是什么? 前段车辆储存舱外壁左右各设有一折叠式装载舱门,所有装载车辆在码头可通过该门直接装载或者登陆上岸。 +071型船坞登陆舰的北约代号是什么? 后段是一个巨型船坞登陆舱,总长约70米,主要用来停泊大小型气垫登陆艇、机械登陆艇或车辆人员登陆艇。[4] +071型船坞登陆舰的北约代号是什么? 自卫武装方面,舰艏设有一门PJ-26型76mm舰炮( +071型船坞登陆舰的北约代号是什么? 井冈山号舰首主炮 +071型船坞登陆舰的北约代号是什么? 俄罗斯AK-176M的中国仿制版,亦被054A采用) , 四具与052B/C相同的726-4 18联装干扰弹发射器分置于舰首两侧以及上层结构两侧,近迫防御则依赖四座布置于上层结构的AK-630 30mm防空机炮 。 +071型船坞登陆舰的北约代号是什么? 原本071模型的舰桥前方设有一座八联装海红-7短程防空导弹发射器,不过071首舰直到出海试航与2009年4月下旬的海上阅兵式中,都未装上此一武器。 +071型船坞登陆舰的北约代号是什么? 电子装备方面, 舰桥后方主桅杆顶配置一具363S型E/F频2D对空/平面搜索雷达 、一具Racal Decca RM-1290 I频导航雷达,后桅杆顶装备一具拥有球型外罩的364型(SR-64)X频2D对空/对海搜索雷达,此外还有一具LR-66C舰炮射控雷达、一具负责导引AK-630机炮的TR-47C型火炮射控雷达等。[5] +071型船坞登陆舰的北约代号是什么? 071型自卫武装布置 +071型船坞登陆舰的北约代号是什么? 071首舰昆仑山号于2006年6月开 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 竹溪县人大常委会办公室:承担人民代表大会会议、常委会会议、主任会议和常委会党组会议(简称“四会”)的筹备和服务工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责常委会组成人员视察活动的联系服务工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 受主任会议委托,拟定有关议案草案。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担常委会人事任免的具体工作,负责机关人事管理和离退休干部的管理与服务。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担县人大机关的行政事务和后勤保障工作,负责机关的安全保卫、文电处理、档案、保密、文印工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担县人大常委会同市人大常委会及乡镇人大的工作联系。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责信息反馈工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 了解宪法、法律、法规和本级人大及其常委会的决议、决定实施情况及常委会成员提出建议办理情况,及时向常委会和主任会议报告。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担人大宣传工作,负责人大常委会会议宣传的组织和联系。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 组织协调各专门工作委员会开展工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承办上级交办的其他工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 办公室下设五个科,即秘书科、调研科、人事任免科、综合科、老干部科。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 教科文卫工作委员会:负责人大教科文卫工作的日常联系、督办、信息收集反馈和业务指导工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责教科文卫方面法律法规贯彻和人大工作情况的宣传、调研工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担人大常委会教科文卫方面会议议题调查的组织联系和调研材料的起草工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担教科文卫方面规范性备案文件的初审工作,侧重对教科文卫行政执法个案监督业务承办工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责常委会组成人员和人大代表对教科文卫工作方面检查、视察的组织联系工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承办上级交办的其他工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 代表工作委员会:负责与县人大代表和上级人大代表的联系、情况收集交流工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责《代表法》的宣传贯彻和贯彻实施情况的调查研究工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责县人大代表法律法规和人民代表大会制度知识学习的组织和指导工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责常委会主任、副主任和委员走访联系人大代表的组织、联系工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责组织人大系统的干部培训。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责乡镇人大主席团工作的联系和指导。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责人大代表建议、批评和意见办理工作的联系和督办落实。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责人大代表开展活动的组织、联系工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承办上级交办的其他工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 财政经济工作委员会:负责人大财政经济工作的日常联系、督办、信息收集反馈和业务指导工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责财政经济方面法律法规贯彻和人大工作情况的宣传、调研工作。 +我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 对国民经济计划和财政预算编制情况进行初审。 +我想知道武汉常住人口有多少? 武汉,简称“汉”,湖北省省会。 +我想知道武汉常住人口有多少? 它是武昌、汉口、汉阳三镇统称。 +我想知道武汉常住人口有多少? 世界第三大河长江及其最长支流汉江横贯市区,将武汉一分为三,形成武昌、汉口、汉阳,三镇跨江鼎立的格局。 +我想知道武汉常住人口有多少? 唐朝诗人李白在此写下“黄鹤楼中吹玉笛,江城五月落梅花”,因此武汉自古又称“江城”。 +我想知道武汉常住人口有多少? 武汉是中国15个副省级城市之一,全国七大中心城市之一,全市常住人口858万人。 +我想知道武汉常住人口有多少? 华中地区最大都市,华中金融中心、交通中心、文化中心,长江中下游特大城市。 +我想知道武汉常住人口有多少? 武汉城市圈的中心城市。 +我想知道武汉常住人口有多少? [3]武昌、汉口、汉阳三地被俗称武汉三镇。 +我想知道武汉常住人口有多少? 武汉西与仙桃市、洪湖市相接,东与鄂州市、黄石市接壤,南与咸宁市相连,北与孝感市相接,形似一只自西向东的蝴蝶形状。 +我想知道武汉常住人口有多少? 在中国经济地理圈内,武汉处于优越的中心位置是中国地理上的“心脏”,故被称为“九省通衢”之地。 +我想知道武汉常住人口有多少? 武汉市历史悠久,古有夏汭、鄂渚之名。 +我想知道武汉常住人口有多少? 武汉地区考古发现的历史可以上溯距今6000年的新石器时代,其考古发现有东湖放鹰台遗址的含有稻壳的红烧土、石斧、石锛以及鱼叉。 +我想知道武汉常住人口有多少? 市郊黄陂区境内的盘龙城遗址是距今约3500年前的商朝方国宫城,是迄今中国发现及保存最完整的商代古城之一。 +我想知道武汉常住人口有多少? 现代武汉的城市起源,是东汉末年的位于今汉阳的卻月城、鲁山城,和在今武昌蛇山的夏口城。 +我想知道武汉常住人口有多少? 东汉末年,地方军阀刘表派黄祖为江夏太守,将郡治设在位于今汉阳龟山的卻月城中。 +我想知道武汉常住人口有多少? 卻月城是武汉市区内已知的最早城堡。 +我想知道武汉常住人口有多少? 223年,东吴孙权在武昌蛇山修筑夏口城,同时在城内的黄鹄矶上修筑了一座瞭望塔——黄鹤楼。 +我想知道武汉常住人口有多少? 苏轼在《前赤壁赋》中说的“西望夏口,东望武昌”中的夏口就是指武汉(而当时的武昌则是今天的鄂州)。 +我想知道武汉常住人口有多少? 南朝时,夏口扩建为郢州,成为郢州的治所。 +我想知道武汉常住人口有多少? 隋置江夏县和汉阳县,分别以武昌,汉阳为治所。 +我想知道武汉常住人口有多少? 唐时江夏和汉阳分别升为鄂州和沔州的州治,成为长江沿岸的商业重镇。 +我想知道武汉常住人口有多少? 江城之称亦始于隋唐。 +我想知道武汉常住人口有多少? 两宋时武昌属鄂州,汉阳汉口属汉阳郡。 +我想知道武汉常住人口有多少? 经过发掘,武汉出土了大量唐朝墓葬,在武昌马房山和岳家咀出土了灰陶四神砖以及灰陶十二生肖俑等。 +我想知道武汉常住人口有多少? 宋代武汉的制瓷业发达。 +我想知道武汉常住人口有多少? 在市郊江夏区梁子湖旁发现了宋代瓷窑群100多座,烧制的瓷器品种很多,釉色以青白瓷为主。 +我想知道武汉常住人口有多少? 南宋诗人陆游在经过武昌时,写下“市邑雄富,列肆繁错,城外南市亦数里,虽钱塘、建康不能过,隐然一大都会也”来描写武昌的繁华。 +我想知道武汉常住人口有多少? 南宋抗金将领岳飞驻防鄂州(今武昌)8年,在此兴师北伐。 +我想知道武汉常住人口有多少? 元世祖至元十八年(1281年),武昌成为湖广行省的省治。 +我想知道武汉常住人口有多少? 这是武汉第一次成为一级行政单位(相当于现代的省一级)的治所。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 列夫·达维多维奇,托洛茨基是联共(布)党内和第三国际时期反对派的领导人,托派"第四国际"的创始人和领导人。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 列夫·达维多维奇·托洛茨基 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 列夫·达维多维奇·托洛茨基(俄国与国际历史上最重要的无产阶级革命家之一,二十世纪国际共产主义运动中最具争议的、也是备受污蔑的左翼反对派领袖,他以对古典马克思主义“不断革命论”的独创性发展闻名于世,第三共产国际和第四国际的主要缔造者之一(第三国际前三次代表大会的宣言执笔人)。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 在1905年俄国革命中被工人群众推举为彼得堡苏维埃主席(而当时布尔什维克多数干部却还在讨论是否支持苏维埃,这些干部后来被赶回俄国的列宁痛击)。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1917年革命托洛茨基率领“区联派”与列宁派联合,并再次被工人推举为彼得格勒苏维埃主席。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 对于十月革命这场20世纪最重大的社会革命,托洛茨基赢得了不朽的历史地位。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 后来成了托洛茨基死敌的斯大林,当时作为革命组织领导者之一却写道:“起义的一切实际组织工作是在彼得格勒苏维埃主席托洛茨基同志直接指挥之下完成的。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 我们可以确切地说,卫戍部队之迅速站在苏维埃方面来,革命军事委员会的工作之所以搞得这样好,党认为这首先要归功于托洛茨基同志。” +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? (值得一提的是,若干年后,当反托成为政治需要时,此类评价都从斯大林文章中删掉了。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? )甚至连后来狂热的斯大林派雅克·沙杜尔,当时却也写道:“托洛茨基在十月起义中居支配地位,是起义的钢铁灵魂。” +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? (苏汉诺夫《革命札记》第6卷P76。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? )不仅在起义中,而且在无产阶级政权的捍卫、巩固方面和国际共产主义革命方面,托洛茨基也作出了极其卓越的贡献(外交官-苏联国际革命政策的负责人、苏联红军缔造者以及共产国际缔造者)。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 革命后若干年里,托洛茨基与列宁的画像时常双双并列挂在一起;十月革命之后到列宁病逝之前,布尔什维克历次全国代表大会上,代表大会发言结束均高呼口号:“我们的领袖列宁和托洛茨基万岁!” +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 在欧美共运中托洛茨基的威望非常高。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 后人常常认为托洛茨基只是一个知识分子文人,实际上他文武双全,而且谙熟军事指挥艺术,并且亲临战场。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 正是他作为十月革命的最高军事领袖(在十月革命期间他与士兵一起在战壕里作战),并且在1918年缔造并指挥苏联红军,是一个杰出的军事家(列宁曾对朋友说,除了托洛茨基,谁还能给我迅速地造成一支上百万人的强大军队? +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? )。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 在内战期间,他甚至坐装甲列车冒着枪林弹雨亲临战场指挥作战,差点挨炸死;当反革命军队进攻彼得堡时,当时的彼得堡领导人季诺维也夫吓得半死,托洛茨基却从容不迫指挥作战。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 同时托洛茨基又是一个高明的外交家,他曾强硬地要求英国政府释放因反战宣传被囚禁在英国的俄国流亡革命者,否则就不许英国公民离开俄国,连英国政府方面都觉得此举无懈可击;他并且把居高临下的法国到访者当场轰出他的办公室(革命前法国一直是俄国的头号债主与政治操纵者),却彬彬有礼地欢迎前来缓和冲突的法国大使;而在十月革命前夕,他对工人代表议会质询的答复既保守了即将起义的军事秘密,又鼓舞了革命者的战斗意志,同时严格遵循现代民主与公开原则,这些政治答复被波兰人多伊彻誉为“外交辞令的杰作”(伊·多伊彻的托氏传记<先知三部曲·武装的先知>第九章P335,第十一章P390)。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 托洛茨基在国民经济管理与研究工作中颇有创造:是苏俄新经济政策的首先提议者以及社会主义计划经济的首先实践者。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1928年斯大林迟迟开始的计划经济实验,是对1923年以托洛茨基为首的左翼反对派经济纲领的拙劣剽窃和粗暴翻版。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 因为统治者的政策迟到,使得新经济政策到1928年已产生了一个威胁政权生存的农村资产阶级,而苏俄工人阶级国家不得不强力解决——而且是不得不借助已蜕化为官僚集团的强力来解决冲突——结果导致了1929年到30年代初的大饥荒和对农民的大量冤枉错杀。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 另外,他还对文学理论有很高的造诣,其著作<文学与革命>甚至影响了整整一代的国际左翼知识分子(包括中国的鲁迅、王实味等人)。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 他在哈佛大学图书馆留下了100多卷的<托洛茨基全集>,其生动而真诚的自传和大量私人日记、信件,给人留下了研究人类生活各个方面的宝贵财富,更是追求社会进步与解放的历史道路上的重要知识库之一。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 托洛茨基1879年10月26日生于乌克兰赫尔松县富裕农民家庭,祖籍是犹太人。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 原姓布隆施泰因。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1896年开始参加工人运动。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1897年 ,参加建立南俄工人协会 ,反对沙皇专制制度。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1898年 在尼古拉也夫组织工人团体,被流放至西伯利亚。 +列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1902年秋以署名托洛茨基之假护照逃到伦敦,参加V.I.列宁、G.V.普列汉诺夫等人主编的<火星报>的工作。 +谁知道洞庭湖大桥有多长? 洞庭湖大桥,位于洞庭湖与长江交汇处,东接岳阳市区洞庭大道和107国道、京珠高速公路,西连省道306线,是国内目前最长的内河公路桥。 +谁知道洞庭湖大桥有多长? 路桥全长10173.82m,其中桥长5747.82m,桥宽20m,西双向四车道,是我国第一座三塔双索面斜拉大桥,亚洲首座不等高三塔双斜索面预应力混凝土漂浮体系斜拉桥。 +谁知道洞庭湖大桥有多长? 洞庭湖大桥是我国最长的内河公路桥,大桥横跨东洞庭湖区,全长10174.2米,主桥梁长5747.8米。 +谁知道洞庭湖大桥有多长? 大桥的通车使湘、鄂间公路干线大为畅通,并为洞庭湖区运输抗洪抢险物资提供了一条快速通道该桥设计先进,新颖,造型美观,各项技求指标先进,且为首次在国内特大型桥梁中采用主塔斜拉桥结构体系。 +谁知道洞庭湖大桥有多长? 洞庭湖大桥是湖区人民的造福桥,装点湘北门户的形象桥,对优化交通网络绪构,发展区域经济,保障防汛救灾,缩短鄂、豫、陕等省、市西部车辆南下的运距,拓展岳阳城区的主骨架,提升岳阳城市品位,增强城市辐射力,有着十分重要的意义。 +谁知道洞庭湖大桥有多长? 自1996年12月开工以来,共有10支施工队伍和两支监理队伍参与了大桥的建设。 +谁知道洞庭湖大桥有多长? 主桥桥面高52米(黄海),设计通航等级Ⅲ级。 +谁知道洞庭湖大桥有多长? 主桥桥型为不等高三塔、双索面空间索、全飘浮体系的预应力钢筋混凝土肋板梁式结构的斜拉桥,跨径为130+310+310+130米。 +谁知道洞庭湖大桥有多长? 索塔为双室宝石型断面,中塔高为125.684米,两边塔高为99.311米。 +谁知道洞庭湖大桥有多长? 三塔基础为3米和3.2米大直径钻孔灌注桩。 +谁知道洞庭湖大桥有多长? 引桥为连续梁桥,跨径20至50米,基础直径为1.8和2.5米钻孔灌注桩。 +谁知道洞庭湖大桥有多长? 该桥设计先进、新颖、造型美观,各项技求指标先进,且为首次在国内特大型桥梁中采用主塔斜拉桥结构体系,岳阳洞庭湖大桥是我国首次采用不等高三塔斜拉桥桥型的特大桥,设计先进,施工难度大位居亚洲之首,是湖南省桥梁界的一大科研项目。 +谁知道洞庭湖大桥有多长? 洞庭湖大桥设计为三塔斜拉桥,空间双斜面索,主梁采用前支点挂篮施工,并按各种工况模拟挂篮受力进行现场试验,获得了大量有关挂篮受力性能和实际刚度的计算参数,作为施工控制参数。 +谁知道洞庭湖大桥有多长? 利用组合式模型单元,推导了斜拉桥分离式双肋平板主梁的单元刚度矩阵,并进行了岳阳洞庭湖大桥的空间受力分析,结果表明此种单元精度满足工程要求,同时在施工工艺方面也积累了成功经验。 +谁知道洞庭湖大桥有多长? 洞庭湖大桥的通车使湘、鄂间公路干线大为畅通,并为洞庭湖区抗洪抢险物资运输提供了一条快速通道。 +谁知道洞庭湖大桥有多长? 湖大桥设计先进,造型美丽,科技含量高。 +谁知道洞庭湖大桥有多长? 洞庭大桥还是一道美丽的风景线,大桥沿岸风景与岳阳楼,君山岛、洞庭湖等风景名胜融为一体,交相辉映,成为世人了解岳阳的又一崭新窗口,也具有特别旅游资源。 +谁知道洞庭湖大桥有多长? 洞庭湖大桥多塔斜拉桥新技术研究荣获国家科学技术进步二等奖、湖南省科学技术进步一等奖,并获第五届詹天佑大奖。 +谁知道洞庭湖大桥有多长? 大桥在中国土木工程学会2004年第16届年会上入选首届《中国十佳桥梁》,名列斜拉桥第二位。 +谁知道洞庭湖大桥有多长? 2001年荣获湖南省建设厅优秀设计一等奖,省优秀勘察一等奖。 +谁知道洞庭湖大桥有多长? 2003年荣获国家优秀工程设计金奖, "十佳学术活动"奖。 +天气预报员的布景师是谁? 芝加哥天气预报员大卫(尼古拉斯·凯奇),被他的粉丝们热爱,也被诅咒--这些人在天气不好的时候会迁怒于他,而大部分时候,大卫都是在预报坏天气。 +天气预报员的布景师是谁? ?不过,这也没什么,当一家国家早间新闻节目叫他去面试的时候,大卫的事业似乎又将再创新高。 +天气预报员的布景师是谁? 芝加哥天气预报员大卫(尼古拉斯·凯奇),被他的粉丝们热爱,也被诅咒--这些人在天气不好的时候会迁怒于他,而大部分时候,大卫都是在预报坏天气。 +天气预报员的布景师是谁? 不过,这也没什么,当一家国家早间新闻节目叫他去面试的时候,大卫的事业似乎又将再创新高。 +天气预报员的布景师是谁? 在电视节目上,大卫永远微笑,自信而光鲜,就像每一个成功的电视人一样,说起收入,他也绝对不落人后。 +天气预报员的布景师是谁? 不过,大卫的个人生活可就不那么如意了。 +天气预报员的布景师是谁? 与妻子劳伦(霍普·戴维斯)的离婚一直让他痛苦;儿子迈克吸大麻上瘾,正在进行戒毒,可戒毒顾问却对迈克有着异样的感情;女儿雪莉则体重惊人,总是愁眉苦脸、孤独寂寞;大卫的父亲罗伯特(迈克尔·凯恩),一个世界著名的小说家,虽然罗伯特不想再让大卫觉得负担过重,可正是他的名声让大卫的一生都仿佛处在他的阴影之下,更何况,罗伯特就快重病死了。 +天气预报员的布景师是谁? 和妻子的离婚、父亲的疾病、和孩子之间完全不和谐的关系,都让大卫每天头疼,而每次当他越想控制局面,一切就越加复杂。 +天气预报员的布景师是谁? 然而就在最后人们再也不会向他扔快餐,或许是因为他总是背着弓箭在大街上走。 +天气预报员的布景师是谁? 最后,面对那份高额工作的接受意味着又一个新生活的开始。 +天气预报员的布景师是谁? 也许,生活就像天气,想怎么样就怎么样,完全不可预料。 +天气预报员的布景师是谁? 导 演:戈尔·维宾斯基 Gore Verbinski +天气预报员的布景师是谁? 编 剧:Steve Conrad .....(written by) +天气预报员的布景师是谁? 演 员:尼古拉斯·凯奇 Nicolas Cage .....David Spritz +天气预报员的布景师是谁? 尼古拉斯·霍尔特 Nicholas Hoult .....Mike +天气预报员的布景师是谁? 迈克尔·凯恩 Michael Caine .....Robert Spritzel +天气预报员的布景师是谁? 杰蒙妮·德拉佩纳 Gemmenne de la Peña .....Shelly +天气预报员的布景师是谁? 霍普·戴维斯 Hope Davis .....Noreen +天气预报员的布景师是谁? 迈克尔·瑞斯玻利 Michael Rispoli .....Russ +天气预报员的布景师是谁? 原创音乐:James S. Levine .....(co-composer) (as James Levine) +天气预报员的布景师是谁? 汉斯·兹米尔 Hans Zimmer +天气预报员的布景师是谁? 摄 影:Phedon Papamichael +天气预报员的布景师是谁? 剪 辑:Craig Wood +天气预报员的布景师是谁? 选角导演:Denise Chamian +天气预报员的布景师是谁? 艺术指导:Tom Duffield +天气预报员的布景师是谁? 美术设计:Patrick M. Sullivan Jr. .....(as Patrick Sullivan) +天气预报员的布景师是谁? 布景师 :Rosemary Brandenburg +天气预报员的布景师是谁? 服装设计:Penny Rose +天气预报员的布景师是谁? 视觉特效:Charles Gibson +天气预报员的布景师是谁? David Sosalla .....Pacific Title & Art Studio +韩国国家男子足球队教练是谁? 韩国国家足球队,全名大韩民国足球国家代表队(???? ?? ?????),为韩国足球协会所于1928年成立,并于1948年加入国际足球协会。 +韩国国家男子足球队教练是谁? 韩国队自1986年世界杯开始,从未缺席任何一届决赛周。 +韩国国家男子足球队教练是谁? 在2002年世界杯,韩国在主场之利淘汰了葡萄牙、意大利及西班牙三支欧洲强队,最后夺得了殿军,是亚洲球队有史以来最好成绩。 +韩国国家男子足球队教练是谁? 在2010年世界杯,韩国也在首圈分组赛压倒希腊及尼日利亚出线次圈,再次晋身十六强,但以1-2败给乌拉圭出局。 +韩国国家男子足球队教练是谁? 北京时间2014年6月27日3时,巴西世界杯小组赛H组最后一轮赛事韩国对阵比利时,韩国队0-1不敌比利时,3场1平2负积1分垫底出局。 +韩国国家男子足球队教练是谁? 球队教练:洪明甫 +韩国国家男子足球队教练是谁? 韩国国家足球队,全名大韩民国足球国家代表队(韩国国家男子足球队???? ?? ?????),为韩国足球协会所于1928年成立,并于1948年加入国际足联。 +韩国国家男子足球队教练是谁? 韩国队是众多亚洲球队中,在世界杯表现最好,他们自1986年世界杯开始,从未缺席任何一届决赛周。 +韩国国家男子足球队教练是谁? 在2002年世界杯,韩国在主场之利淘汰了葡萄牙、意大利及西班牙三支欧洲强队,最后夺得了殿军,是亚洲球队有史以来最好成绩。 +韩国国家男子足球队教练是谁? 在2010年世界杯,韩国也在首圈分组赛压倒希腊及尼日利亚出线次圈,再次晋身十六强,但以1-2败给乌拉圭出局。 +韩国国家男子足球队教练是谁? 2014年世界杯外围赛,韩国在首轮分组赛以首名出线次轮分组赛,与伊朗、卡塔尔、乌兹别克以及黎巴嫩争逐两个直接出线决赛周资格,最后韩国仅以较佳的得失球差压倒乌兹别克,以小组次名取得2014年世界杯决赛周参赛资格,也是韩国连续八次晋身世界杯决赛周。 +韩国国家男子足球队教练是谁? 虽然韩国队在世界杯成绩为亚洲之冠,但在亚洲杯足球赛的成绩却远不及世界杯。 +韩国国家男子足球队教练是谁? 韩国只在首两届亚洲杯(1956年及1960年)夺冠,之后五十多年未能再度称霸亚洲杯,而自1992年更从未打入过决赛,与另一支东亚强队日本近二十年来四度在亚洲杯夺冠成强烈对比。[1] +韩国国家男子足球队教练是谁? 人物简介 +韩国国家男子足球队教练是谁? 车范根(1953年5月22日-)曾是大韩民国有名的锋线选手,他被欧洲媒体喻为亚洲最佳输出球员之一,他也被认为是世界最佳足球员之一。 +韩国国家男子足球队教练是谁? 他被国际足球史料与数据协会评选为20世纪亚洲最佳球员。 +韩国国家男子足球队教练是谁? 他在85-86赛季是德甲的最有价值球员,直到1999年为止他都是德甲外国球员入球纪录保持者。 +韩国国家男子足球队教练是谁? 德国的球迷一直没办法正确说出他名字的发音,所以球车范根(左)迷都以炸弹车(Cha Boom)称呼他。 +韩国国家男子足球队教练是谁? 这也代表了他强大的禁区得分能力。 +韩国国家男子足球队教练是谁? 职业生涯 +韩国国家男子足球队教练是谁? 车范根生于大韩民国京畿道的华城市,他在1971年于韩国空军俱乐部开始了他的足球员生涯;同年他入选了韩国19岁以下国家足球队(U-19)。 +韩国国家男子足球队教练是谁? 隔年他就加入了韩国国家足球队,他是有史以来加入国家队最年轻的球员。 +韩国国家男子足球队教练是谁? 车范根在27岁时前往德国发展,当时德甲被认为是世界上最好的足球联赛。 +韩国国家男子足球队教练是谁? 他在1978年12月加入了达姆施塔特,不过他在那里只待了不到一年就转到当时的德甲巨人法兰克福。 +韩国国家男子足球队教练是谁? 车范根很快在新俱乐部立足,他帮助球队赢得79-80赛季的欧洲足协杯。 +韩国国家男子足球队教练是谁? 在那个赛季过后,他成为德甲薪水第三高的球员,不过在1981年对上勒沃库森的一场比赛上,他的膝盖严重受伤,几乎毁了他的足球生涯。 +韩国国家男子足球队教练是谁? 在1983年车范根转投勒沃库森;他在这取得很高的成就,他成为85-86赛季德甲的最有价值球员,并且在1988年帮助球队拿下欧洲足协杯,也是他个人第二个欧洲足协杯。 +韩国国家男子足球队教练是谁? 他在决赛对垒西班牙人扮演追平比分的关键角色,而球会才在点球大战上胜出。 +韩国国家男子足球队教练是谁? 车范根在1989年退休,他在308场的德甲比赛中进了98球,一度是德甲外国球员的入球纪录。 +韩国国家男子足球队教练是谁? 执教生涯 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学,简称台湾科大、台科大或台科,是位于台湾台北市大安区的台湾第一所高等技职体系大专院校,现为台湾最知名的科技大学,校本部比邻国立台湾大学。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 该校已于2005年、2008年持续入选教育部的“发展国际一流大学及顶尖研究中心计划”。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? “国立”台湾工业技术学院成立于“民国”六十三年(1974)八月一日,为台湾地区第一所技术职业教育高等学府。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 建校之目的,在因应台湾地区经济与工业迅速发展之需求,以培养高级工程技术及管理人才为目标,同时建立完整之技术职业教育体系。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? “国立”台湾工业技术学院成立于“民国”六十三年(1974)八月一日,为台湾地区第一所技术职业教育高等学府。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 建校之目的,在因应台湾地区经济与工业迅速发展之需求,以培养高级工程技术及管理人才为目标,同时建立完整之技术职业教育体系。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 本校校地约44.5公顷,校本部位于台北市基隆路四段四十三号,。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 民国68年成立硕士班,民国71年成立博士班,现有大学部学生5,664人,研究生4,458人,专任教师451位。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 2001年在台湾地区教育部筹划之研究型大学(“国立”大学研究所基础教育重点改善计画)中,成为全台首批之9所大学之一 。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 自2005年更在“教育部”所推动“五年五百亿 顶尖大学”计划下,遴选为适合发展成“顶尖研究中心”的11所研究型大学之一。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学部设有二年制、四年制及工程在职人员进修班等三种学制;凡二专、三专及五专等专科学校以上之毕业生,皆可以报考本校大学部二年制,而高职、高中毕业生,可以报考本校大学部四年制。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 工业管理、电子工程、机械工程、营建工程及应用外语系等,则设有在职人员进修班学制,其招生对象为在职人员,利用夜间及暑假期间上课。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 凡在本校大学部修毕应修学分且成绩及格者皆授予学士学位。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学目前设有工程、电资、管理、设计、人文社会及精诚荣誉等六个学院,分别有机械、材料科学与工程、营建、化工、电子、电机、资工、工管、企管、资管、建筑、工商业设计、应用外语等13个系及校内招生之财务金融学士学位学程、科技管理学士学位学程;全校、工程、电资、管理、创意设计等五个不分系菁英班及光电研究所、管理研究所、财务金融研究所、科技管理研究所、管理学院MBA、数位学习教育研究所、医学工程研究所、自动化及控制研究所、工程技术研究所、专利研究所等独立研究所,此外尚有人文学科负责人文及社会类等课程之教学,通识学科负责法律、音乐、环保类等课程之教学,以及师资培育中心专以培养学生未来担任中等学校工、商、管理、设计等科之合格教师,合计23个独立系所、师资培育中心、人文学科及通识学科等教学单位。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学至今各系所毕业校友已达约56,456位,毕业生出路包含出国继续深造、在台深造以及投身于产业界。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 由于实作经验丰富,理论基础完备,工作态度认真,毕业校友担任政府要职、大学教授、大学校长及企业主管者众多,深受各界的肯定。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 工商业设计系副教授孙春望与硕一生全明远耗时两个月自制之三分钟动画短片“立体悲剧”。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 本片入选有“动画奥斯卡”之称的“ACM SIGGRAPH”国际动画展,并获得观众票选第一名,这也是台湾首次入选及获奖的短片。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 击败了好莱坞知名导演史蒂芬·史匹柏的“世界大战”、乔治卢卡斯的“星际大战三部曲”、梦工厂出品的动画“马达加斯加”、军机缠斗片“机战未来”及美国太空总署、柏克莱加州大学等好莱坞名片及顶尖学术单位制作的短片。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 2009年荣获有工业设计界奥斯卡奖之称的“德国iF设计大奖”国立台湾科技大学设计学院获得大学排名的全球第二,仅次于韩国三星美术设计学院“SADI”。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 总体排名 依据《泰晤士高等教育》(THES-QS)在2009年的世界大学排名调查,台科大排名全世界第351名,在台湾所有大学中排名第五,仅次于台大,清大,成大及阳明,并且是台湾唯一进入世界四百大名校的科技大学。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 依据在欧洲拥有广大声誉的“Eduniversal商学院排名网”2008年的资料,台湾有七所大学的商管学院被分别列入世界1000大商学院,其中台科大位在“卓越商学院”(EXCELLENT Business Schools,国内主要)之列,“推荐程度”(Recommendation Rate)为全台第四,仅次于台大、政大、中山,与交大并列。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 目前设有工程、电资、管理、设计、人文社会及精诚荣誉学院等六个学院。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 预计于竹北新校区设立产学合作学院及应用理学院。 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●台湾建筑科技中心 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●智慧型机械人研究中心科技成果展示(15张) +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●台湾彩卷与博彩研究中心 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●电力电子技术研发中心 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●NCP-Taiwan办公室 +国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●资通安全研究与教学中心 +在日本,神道最初属于什么信仰? 神道又称天道,语出《易经》“大观在上,顺而巽,中正以观天下。 +在日本,神道最初属于什么信仰? 观,盥而不荐,有孚顒若,下观而化也。 +在日本,神道最初属于什么信仰? 观天之神道,而四时不忒,圣人以神道设教,而天下服矣”。 +在日本,神道最初属于什么信仰? 自汉以降,神道又指“墓前开道,建石柱以为标”。 +在日本,神道最初属于什么信仰? 在中医中,神道,经穴名。 +在日本,神道最初属于什么信仰? 出《针灸甲乙经》。 +在日本,神道最初属于什么信仰? 别名冲道。 +在日本,神道最初属于什么信仰? 属督脉。 +在日本,神道最初属于什么信仰? 宗教中,神道是日本的本土传统民族宗教,最初以自然崇拜为主,属于泛灵多神信仰(精灵崇拜),视自然界各种动植物为神祇。 +在日本,神道最初属于什么信仰? 神道又称天道,语出《易经》“大观在上,顺而巽,中正以观天下。 +在日本,神道最初属于什么信仰? 观,盥而不荐,有孚顒若,下观而化也。 +在日本,神道最初属于什么信仰? 观天之神道,而四时不忒,圣人以神道设教,而天下服矣”。 +在日本,神道最初属于什么信仰? 自汉以降,神道又指“墓前开道,建石柱以为标”。 +在日本,神道最初属于什么信仰? 在中医中,神道,经穴名。 +在日本,神道最初属于什么信仰? 出《针灸甲乙经》。 +在日本,神道最初属于什么信仰? 别名冲道。 +在日本,神道最初属于什么信仰? 属督脉。 +在日本,神道最初属于什么信仰? 宗教中,神道是日本的本土传统民族宗教,最初以自然崇拜为主,属于泛灵多神信仰(精灵崇拜),视自然界各种动植物为神祇。 +在日本,神道最初属于什么信仰? 谓鬼神赐福降灾神妙莫测之道。 +在日本,神道最初属于什么信仰? 《易·观》:“观天之神道,而四时不忒,圣人以神道设教,而天下服矣。” +在日本,神道最初属于什么信仰? 孔颖达 疏:“微妙无方,理不可知,目不可见,不知所以然而然,谓之神道。” +在日本,神道最初属于什么信仰? 《文选·王延寿<鲁灵光殿赋>》:“敷皇极以创业,协神道而大宁。” +在日本,神道最初属于什么信仰? 张载 注:“协和神明之道,而天下大宁。” +在日本,神道最初属于什么信仰? 南朝 梁 刘勰 《文心雕龙·正纬》:“夫神道阐幽,天命微显。” +在日本,神道最初属于什么信仰? 鲁迅 《中国小说史略》第五篇:“﹝ 干宝 ﹞尝感於其父婢死而再生,及其兄气绝复苏,自言见天神事,乃撰《搜神记》二十卷,以‘发明神道之不诬’。” +在日本,神道最初属于什么信仰? 神道设教 观卦里面蕴含着《易经》固有的诸如神道设教、用舍行藏、以德化民等思想,是孔子把这些思想发掘出来。 +在日本,神道最初属于什么信仰? 「据此是孔子见当时之人,惑于吉凶祸福,而卜筮之史,加以穿凿傅会,故演易系辞,明义理,切人事,借卜筮以教后人,所谓以神道设教,其所发明者,实即羲文之义理,而非别有义理,亦非羲文并无义理,至孔子始言义理也,当即朱子之言而小变之曰,易为卜筮作,实为义理作,伏羲文王之易,有占而无文,与今人用火珠林起课者相似,孔子加卦爻辞如签辞,纯以理言,实即羲文本意,则其说分明无误矣。」 +在日本,神道最初属于什么信仰? 孔子所发掘的《易经》思想与孔子在《论语》书中表现出来的思想完全一致。 +在日本,神道最初属于什么信仰? 《易传》的思想反映了孔子的思想,这个思想是《周易》的,也是孔子的。 +在日本,神道最初属于什么信仰? 在《周易》和孔子看来,神不是有意识的人格化的上帝。 +奥林匹克里昂获得了几连霸? 里昂 Lyon 全名 Olympique lyonnais 绰号 Les Gones、OL 成立 1950年 城市 法国,里昂 主场 热尔兰球场(Stade Gerland) 容纳人数 41,044人 主席 奥拉斯 主教练 雷米·加尔德 联赛 法国足球甲级联赛 2013–14 法甲,第 5 位 网站 官方网站 主场球衣 客场球衣 第三球衣 日尔兰体育场 奥林匹克里昂(Olympique lyonnais,简称:OL及Lyon,中文简称里昂)是一间位于法国东南部罗纳-阿尔卑斯区的里昂市的足球会,成立于1950年8月3日,前身为里昂·奥林匹克(Lyon Olympique)体育俱乐部其中一个分支的足球队,1889年离开体育俱乐部自立门户成立新俱乐部,但官方网站表示俱乐部于1950年正式成立。 +奥林匹克里昂获得了几连霸? 现时在法国足球甲级联赛比赛,俱乐部同时设立男子及女子足球队。 +奥林匹克里昂获得了几连霸? 里昂是首届法国足球甲级联赛成员之一,可惜名列第十五位而降落乙组,1951年以乙级联赛冠军获得创会后首次锦标。 +奥林匹克里昂获得了几连霸? 球队在法国足球史上没有取得辉煌成绩,比较优异的算是六十年代曾杀入欧洲杯赛冠军杯四强,及3度晋身法国杯决赛并2次成功获冠。 +奥林匹克里昂获得了几连霸? 直至九十年代末里昂由辛天尼带领,先连续取得联赛头三名,到2002年终于首次登上法国顶级联赛冠军宝座,同年勒冈(Paul Le Guen)接替执教法国国家足球队的辛天尼,他其后继续带领里昂保持气势,加上队中球员小儒尼尼奧、迪亚拉、克里斯蒂亞諾·馬克斯·戈麥斯、迈克尔·埃辛、西德尼·戈武及门将格雷戈里·库佩表现突出,2003年至2005年横扫3届联赛冠军,创下连续四年夺得联赛锦标,平了1960年代末圣艾蒂安及1990年代初马赛的四连冠纪录。 +奥林匹克里昂获得了几连霸? 2005年前利物浦主教练热拉尔·霍利尔重返法国担任新任主教练,并加入葡萄牙中场蒂亚戈,和前巴伦西亚前锋约翰·卡鲁。 +奥林匹克里昂获得了几连霸? 他亦成功带领里昂赢得一届法甲冠军。 +奥林匹克里昂获得了几连霸? 2007年里昂成为首支上市的法国足球俱乐部,招股价21至24.4欧元,发行370万股,集资8400万欧元[1]。 +奥林匹克里昂获得了几连霸? 2007年4月21日,联赛次名图卢兹二比三不敌雷恩,令处于榜首的里昂领先次席多达17分距离,里昂因此提前六轮联赛庆祝俱乐部连续第六年夺得联赛冠军,亦是欧洲五大联赛(英格兰、德国、西班牙、意大利及法国)历史上首支联赛六连冠队伍[2]。 +奥林匹克里昂获得了几连霸? 在2007-08年赛季,里昂再一次成功卫冕联赛锦标,达成七连霸伟业。 +奥林匹克里昂获得了几连霸? 不过在2008-09赛季,里昂排名法甲第三位,联赛冠军被波尔多所获得。 +奥林匹克里昂获得了几连霸? 于2010年4月,里昂以两回合3比2的比分于欧洲冠军联赛击败波尔多跻身四强,此乃里昂首次晋级此项顶级杯赛的四强阶段。 +奥林匹克里昂获得了几连霸? 粗体字为新加盟球员 +奥林匹克里昂获得了几连霸? 以下球员名单更新于2014年8月27日,球员编号参照 官方网站,夏季转会窗为6月9日至8月31日 +火柴人刺杀行动怎么才能过关? 移动鼠标控制瞄准,点击鼠标左键进行射击。 +火柴人刺杀行动怎么才能过关? 游戏加载完成后点击STARTGAME-然后点击STARTMISSION即可开始游戏。 +火柴人刺杀行动怎么才能过关? 这里不仅仅考验的是你的枪法而且最重要的是你的智慧,喜欢火柴人类型游戏的玩家可以进来小试身手。 +火柴人刺杀行动怎么才能过关? 控制瞄准,刺杀游戏中的目标人物即可过关哦。 +你知道2月14日西方情人节是因何起源的吗? 情人节(英语:Valentine's Day),情人节的起源有多个版本,其中一个说法是在公元三世纪,古罗马暴君为了征召更多士兵,禁止婚礼,一名叫瓦伦丁Valentine的修士不理禁令,秘密替人主持婚礼,结果被收监,最后处死。 +你知道2月14日西方情人节是因何起源的吗? 而他死的那天就是2月14日,为纪念Valentine的勇敢精神,人们将每年的2月14日定为Valentine的纪念日。 +你知道2月14日西方情人节是因何起源的吗? 因此成了后来的“情人节”。 +你知道2月14日西方情人节是因何起源的吗? 另外,据记载,教宗在公元496年废除牧神节,把2月14日定为圣瓦伦丁日,即是St.Valentine's Day,后来成为是西方的节日之一。 +你知道2月14日西方情人节是因何起源的吗? 中文名称:情人节 +你知道2月14日西方情人节是因何起源的吗? 外文名称:Valentine‘s Day +你知道2月14日西方情人节是因何起源的吗? 别名:情人节圣瓦伦丁节 +你知道2月14日西方情人节是因何起源的吗? 公历日期:2月14日 +你知道2月14日西方情人节是因何起源的吗? 起源时间:公元270年2月14日 +你知道2月14日西方情人节是因何起源的吗? 起源事件:人们为了纪念为情人做主而牺牲的瓦伦丁神父,把他遇害的那一天(2月14日)称为情人节。 +你知道2月14日西方情人节是因何起源的吗? 地区:欧美地区 +你知道2月14日西方情人节是因何起源的吗? 宗教:基督教 +你知道2月14日西方情人节是因何起源的吗? 其他信息:西方的传统节日之一。 +你知道2月14日西方情人节是因何起源的吗? 男女在这一天互送礼物(如贺卡和玫瑰花等)用以表达爱意或友好。 +你知道2月14日西方情人节是因何起源的吗? 据台湾“今日台湾人讨厌情人节新闻网”报道,西洋情人节即将来到,求职网进行“办公室恋情及情人节调查”发现,在目前全台上班族的感情状态中,有情人相伴的比率约5成5,4成5的上班族单身;较出乎意料的结果是,情人节以近3成(28%)的占比,登上最讨厌的节日第一名,端午节以24.3%居第二;农历年则以18.2%居第三;第四名是圣诞节,占12.4%。 +你知道2月14日西方情人节是因何起源的吗? 调查指出,情人节对单身族来说,不仅成为压力,也显得更加孤单,在情人节当天,单身的上班族有将近4成(39.1%)的人在家看电视度过,近两成(18.7%)上网聊天,有1成4(14.8%)的人,不畏满街闪光,勇气十足出门看电影,近1成(9.7%)的上班族选择留在公司加班;另外有 5.4%的人,会在情人节当天积极参加联谊,希望能改变自己的感情状态。 +你知道2月14日西方情人节是因何起源的吗? 情侣们在情人节当天,庆祝方式以吃浪漫大餐最多(37.1%),不过有近3成(27%)的情侣,在情人节当天不会特别庆祝情人节,且这个比率远比第三名的旅游(占比11.5%)高出1倍以上。 +你知道2月14日西方情人节是因何起源的吗? 在情人节当天庆祝的开销上,可以说是小资男女当道,选择1000元(新台币,下同)以内的上班族最多占33.1%,情人节当天的花费上班族的平均花费是2473元,大手笔花费上万元以上庆祝情人节的,占比只有2.5%。 +你知道2月14日西方情人节是因何起源的吗? 情人节的起源众说纷纭,而为纪念罗马教士瓦伦丁是其中一个普遍的说法。 +你知道2月14日西方情人节是因何起源的吗? 据《世界图书百科全书》(World Book Encyclopedia)数据指出:“在公元200年时期,罗马皇帝克劳狄二世禁止年轻男子结婚。 +你知道2月14日西方情人节是因何起源的吗? 他认为未婚男子可以成为更优良的士兵。 +你知道2月14日西方情人节是因何起源的吗? 一位名叫瓦伦丁的教士违反了皇帝的命令,秘密为年轻男子主持婚礼,引起皇帝不满,结果被收监,据说瓦伦丁于公元269年2月14日被处决。 +你知道2月14日西方情人节是因何起源的吗? 另外,据《天主教百科全书》(The Catholic情人节 Encyclopedia)指出,公元496年,教宗圣基拉西乌斯一世在公元第五世纪末叶废除了牧神节,把2月14日定为圣瓦伦丁日。” +你知道2月14日西方情人节是因何起源的吗? 这个节日现今以“圣瓦伦丁节”——亦即情人节的姿态盛行起来。 +你知道2月14日西方情人节是因何起源的吗? 但是在第2次梵蒂冈大公会议后,1969年的典礼改革上,整理了一堆在史实上不确定是否真实存在的人物以后,圣瓦伦丁日就被废除了。 +你知道2月14日西方情人节是因何起源的吗? 现在天主教圣人历已经没有圣瓦伦丁日(St. Valentine's Day)。 +你知道2月14日西方情人节是因何起源的吗? 根据《布卢姆尔的警句与寓言辞典》记载:“圣瓦伦丁是个罗马教士,由于援助受逼害的基督徒而身陷险境,后来他归信基督教,最后被处死,卒于二月十四日”古代庆祝情人节的习俗与瓦伦丁拉上关系,可能是纯属巧合而已。 +你知道2月14日西方情人节是因何起源的吗? 事实上,这个节日很可能与古罗马的牧神节或雀鸟交配的季节有关。 +你知道2月14日西方情人节是因何起源的吗? 情人节的特色是情侣互相馈赠礼物。 +你知道2月14日西方情人节是因何起源的吗? 时至今日,人们则喜欢以情人卡向爱人表达情意。 +防卫大学每年招收多少学生? 防卫大学的前身是保安大学。 +防卫大学每年招收多少学生? 防卫大学是日本自卫队培养陆、海、空三军初级军官的学校,被称为日军"军官的摇篮"。 +防卫大学每年招收多少学生? 防卫大学是日军的重点院校。 +防卫大学每年招收多少学生? 日本历届内阁首相都要到防卫大学视察"训示",并亲自向学生颁发毕业证书。 +防卫大学每年招收多少学生? 日军四分之一的军官、三分之一的将官从这里走出。 +防卫大学每年招收多少学生? 防卫大学毕业生已成为日军军官的中坚力量。 +防卫大学每年招收多少学生? 防卫大学每年从地方招收18岁至21岁的应届高中毕业生和同等学历的青年。 +防卫大学每年招收多少学生? 每年招生名额为530名。 +防卫大学每年招收多少学生? 1950年 8月,日本组建警察预备队,1952年改为保安队。 +防卫大学每年招收多少学生? 为了充实保安队干部队伍,提高干部军政素质,1953年4月成立了保安大学,校址设在三浦半岛的久里滨。 +防卫大学每年招收多少学生? 1954年7月1日保安厅改为防卫厅。 +防卫大学每年招收多少学生? 在保安队基础上,日本建立了陆、海、空三军自卫队,保安大学遂改名为防卫大学,1955年迁至三浦半岛东南方的小原台。 +防卫大学每年招收多少学生? 学校直属防卫厅领导。 +防卫大学每年招收多少学生? 防卫大学的教育方针是:要求学生德智体全面发展,倡导学生崇尚知识和正义,培养学生具有指挥各种部队的能力。 +防卫大学每年招收多少学生? 防卫大学每年招生名额为530名,其中陆军300名,海军100名,空军130名。 +防卫大学每年招收多少学生? 根据自卫队向妇女敞开军官大门的决定,防卫大学1992年首次招收女学员35名。 +防卫大学每年招收多少学生? 考试分两次进行。 +防卫大学每年招收多少学生? 第一次,每年11月份进行学科考试;第二次,12月份进行口试和体检。 +防卫大学每年招收多少学生? 学校按陆、海、空三军分别设大学本科班和理工科研究生班。 +防卫大学每年招收多少学生? 本科班学制4年,又分为理工和人文社会学两大科。 +防卫大学每年招收多少学生? 学员入学后先分科,530人中有460人专攻理科,70人专攻文科。 +防卫大学每年招收多少学生? 第1学年按专科学习一般大学课程和一般军事知识。 +防卫大学每年招收多少学生? 第2学年以后在军事上开始区分军种,学员分别学习陆、海、空军的专门课程。 +防卫大学每年招收多少学生? 文化课和军事课的比例是6:l。 +防卫大学每年招收多少学生? 文化课程有人文、社会、自然、外语、电气工程、机械工程、土木建筑工程、应用化学、应用物理、航空、航海等。 +防卫大学每年招收多少学生? 军事训练课每学年6周,按一年四季有比例地安排教学内容,对学生进行军事技术和体能训练。 +防卫大学每年招收多少学生? 理工科研究生班,每年招生1期,学制2年,每期招收90人,设电子工程、航空工程、兵器制造等7个专业,课程按一般大学硕士课程标准设置。 +防卫大学每年招收多少学生? 防卫大学的课程和训练都十分紧张。 +防卫大学每年招收多少学生? 近年来,为了增强防卫大学的吸引力,克服考生逐年减少的倾向广泛征集优秀人才,学校进行了一些改革,改变入学考试办法,各高中校长以内部呈报的形式向防卫大学推荐品学兼优的学生;减少学生入学考试科目,放宽对报考防卫大学的学生的视力要求;降低学分数(大约降低30学分);改善学生宿舍条件。 +防卫大学每年招收多少学生? 防卫大学的学生生活紧张而愉快。 +《威鲁贝鲁的物语》官网是什么? 10年前大战后,威鲁贝鲁国一致辛勤的保护着得来不易的和平,但是与邻国圣卡特拉斯国的关系却不断的紧张,战争即将爆发。 +《威鲁贝鲁的物语》官网是什么? 为了避免战争,威鲁贝鲁国王海特鲁王决定将自己最大的女儿公主莉塔嫁给圣卡特拉斯国的王子格鲁尼亚。 +《威鲁贝鲁的物语》官网是什么? 但是莉塔却刺伤了政治婚姻的对象格鲁尼亚王子逃了出去,这事激怒了圣卡特拉斯国的国王兰帕诺夫王,并下令14天之内抓到王女并执行公开处刑来谢罪,不然两国就要开战。 +《威鲁贝鲁的物语》官网是什么? 《威鲁贝鲁的物语~Sisters of Wellber~》 +《威鲁贝鲁的物语》官网是什么? (Sisters of Wellber) +《威鲁贝鲁的物语》官网是什么? 日文名 ウエルベールの物语 +《威鲁贝鲁的物语》官网是什么? 官方网站 http://www.avexmovie.jp/lineup/wellber/ +《威鲁贝鲁的物语》官网是什么? 为了回避发生战争这个最坏的结果,莉塔下定决心去中立国古利达姆。 diff --git a/examples/text_graph/erniesage/link_prediction.py b/examples/text_graph/erniesage/link_prediction.py new file mode 100644 index 0000000000000000000000000000000000000000..2ad8b2faecfad6b294703e8f9271905d887de245 --- /dev/null +++ b/examples/text_graph/erniesage/link_prediction.py @@ -0,0 +1,177 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import io +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import pgl +import yaml +from data import GraphDataLoader, PredictData, TrainData, batch_fn +from easydict import EasyDict as edict +from models import ErnieSageForLinkPrediction + +from paddlenlp.transformers import ErnieTinyTokenizer, ErnieTokenizer +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "ernie-tiny": (ErnieSageForLinkPrediction, ErnieTinyTokenizer), + "ernie-1.0": (ErnieSageForLinkPrediction, ErnieTokenizer), +} + + +def set_seed(config): + random.seed(config.seed) + np.random.seed(config.seed) + paddle.seed(config.seed) + + +def load_data(graph_data_path): + base_graph = pgl.Graph.load(graph_data_path) + term_ids = np.load(os.path.join(graph_data_path, "term_ids.npy"), mmap_mode="r") + return base_graph, term_ids + + +def do_train(config): + paddle.set_device(config.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + set_seed(config) + + base_graph, term_ids = load_data(config.graph_work_path) + collate_fn = partial(batch_fn, samples=config.samples, base_graph=base_graph, term_ids=term_ids) + + # mode = "train" + train_ds = TrainData(config.graph_work_path) + + model_class, tokenizer_class = MODEL_CLASSES[config.model_name_or_path] + tokenizer = tokenizer_class.from_pretrained(config.model_name_or_path) + config.cls_token_id = tokenizer.cls_token_id + + model = model_class.from_pretrained(config.model_name_or_path, config_file=config) + model = paddle.DataParallel(model) + + train_loader = GraphDataLoader( + train_ds, batch_size=config.batch_size, shuffle=True, num_workers=config.sample_workers, collate_fn=collate_fn + ) + + optimizer = paddle.optimizer.Adam(learning_rate=config.lr, parameters=model.parameters()) + + rank = paddle.distributed.get_rank() + global_step = 0 + tic_train = time.time() + for epoch in range(config.epoch): + for step, (graphs, datas) in enumerate(train_loader): + global_step += 1 + loss, outputs = model(graphs, datas) + if global_step % config.log_per_step == 0: + logger.info( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_step, epoch, step, loss, config.log_per_step / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + optimizer.clear_grad() + if global_step % config.save_per_step == 0: + if rank == 0: + output_dir = os.path.join(config.output_path, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model._layers.save_pretrained(output_dir) + if rank == 0: + output_dir = os.path.join(config.output_path, "last") + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model._layers.save_pretrained(output_dir) + + +def tostr(data_array): + return " ".join(["%.5lf" % d for d in data_array]) + + +@paddle.no_grad() +def do_predict(config): + paddle.set_device(config.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + set_seed(config) + + # mode = "predict" + num_nodes = int(np.load(os.path.join(config.graph_work_path, "num_nodes.npy"))) + + base_graph, term_ids = load_data(config.graph_work_path) + collate_fn = partial(batch_fn, samples=config.samples, base_graph=base_graph, term_ids=term_ids) + + model_class, tokenizer_class = MODEL_CLASSES[config.model_name_or_path] + tokenizer = tokenizer_class.from_pretrained(config.model_name_or_path) + config.cls_token_id = tokenizer.cls_token_id + + model = model_class.from_pretrained(config.infer_model, config_file=config) + + model = paddle.DataParallel(model) + predict_ds = PredictData(num_nodes) + + predict_loader = GraphDataLoader( + predict_ds, + batch_size=config.infer_batch_size, + shuffle=True, + num_workers=config.sample_workers, + collate_fn=collate_fn, + ) + + trainer_id = paddle.distributed.get_rank() + id2str = io.open(os.path.join(config.graph_work_path, "terms.txt"), encoding=config.encoding).readlines() + if not os.path.exists(config.output_path): + os.mkdir(config.output_path) + fout = io.open("%s/part-%s" % (config.output_path, trainer_id), "w", encoding="utf8") + + global_step = 0 + epoch = 0 + tic_train = time.time() + model.eval() + for step, (graphs, datas) in enumerate(predict_loader): + global_step += 1 + loss, outputs = model(graphs, datas) + for user_feat, user_real_index in zip(outputs[0].numpy(), outputs[3].numpy()): + sri = id2str[int(user_real_index)].strip("\n") + line = "{}\t{}\n".format(sri, tostr(user_feat)) + fout.write(line) + if global_step % config.log_per_step == 0: + logger.info( + "predict step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_step, epoch, step, loss, config.log_per_step / (time.time() - tic_train)) + ) + tic_train = time.time() + fout.close() + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="main") + parser.add_argument("--conf", type=str, default="./config.yaml") + parser.add_argument("--do_predict", action="store_true", default=False) + args = parser.parse_args() + config = edict(yaml.load(open(args.conf), Loader=yaml.FullLoader)) + + assert config.device in ["gpu", "cpu"], "Device should be gpu/cpu, but got %s." % config.device + logger.info(config) + if args.do_predict: + do_predict(config) + else: + do_train(config) diff --git a/examples/text_graph/erniesage/models/__init__.py b/examples/text_graph/erniesage/models/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..4b02ff01793be5a8840cc144dabc13beaff989b8 --- /dev/null +++ b/examples/text_graph/erniesage/models/__init__.py @@ -0,0 +1,18 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from models import model + +__all__ = [] +__all__ += model.__all__ diff --git a/examples/text_graph/erniesage/models/conv.py b/examples/text_graph/erniesage/models/conv.py new file mode 100644 index 0000000000000000000000000000000000000000..8ec0c61d7b0a4bbcdf2e76761d928fe7f48e09fe --- /dev/null +++ b/examples/text_graph/erniesage/models/conv.py @@ -0,0 +1,174 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class GraphSageConv(nn.Layer): + """GraphSAGE is a general inductive framework that leverages node feature + information (e.g., text attributes) to efficiently generate node embeddings + for previously unseen data. + + Paper reference: + Hamilton, Will, Zhitao Ying, and Jure Leskovec. + "Inductive representation learning on large graphs." + Advances in neural information processing systems. 2017. + """ + + def __init__(self, input_size, hidden_size, learning_rate, aggr_func="sum"): + super(GraphSageConv, self).__init__() + assert aggr_func in [ + "sum", + "mean", + "max", + "min", + ], "Only support 'sum', 'mean', 'max', 'min' built-in receive function." + self.aggr_func = "reduce_%s" % aggr_func + + self.self_linear = nn.Linear( + input_size, hidden_size, weight_attr=paddle.ParamAttr(learning_rate=learning_rate) + ) + self.neigh_linear = nn.Linear( + input_size, hidden_size, weight_attr=paddle.ParamAttr(learning_rate=learning_rate) + ) + + def forward(self, graph, feature, act=None): + def _send_func(src_feat, dst_feat, edge_feat): + return {"msg": src_feat["h"]} + + def _recv_func(message): + return getattr(message, self.aggr_func)(message["msg"]) + + msg = graph.send(_send_func, src_feat={"h": feature}) + neigh_feature = graph.recv(reduce_func=_recv_func, msg=msg) + + self_feature = self.self_linear(feature) + neigh_feature = self.neigh_linear(neigh_feature) + output = self_feature + neigh_feature + if act is not None: + output = getattr(F, act)(output) + + output = F.normalize(output, axis=1) + return output + + +class ErnieSageV2Conv(nn.Layer): + """ErnieSage (abbreviation of ERNIE SAmple aggreGatE), a model proposed by the PGL team. + ErnieSageV2: Ernie is applied to the EDGE of the text graph. + """ + + def __init__(self, ernie, input_size, hidden_size, learning_rate, cls_token_id=1, aggr_func="sum"): + """ErnieSageV2: Ernie is applied to the EDGE of the text graph. + + Args: + ernie (nn.Layer): the ernie model. + input_size (int): input size of feature tensor. + hidden_size (int): hidden size of the Conv layers. + learning_rate (float): learning rate. + aggr_func (str): aggregate function. 'sum', 'mean', 'max' avaliable. + """ + super(ErnieSageV2Conv, self).__init__() + assert aggr_func in [ + "sum", + "mean", + "max", + "min", + ], "Only support 'sum', 'mean', 'max', 'min' built-in receive function." + self.aggr_func = "reduce_%s" % aggr_func + self.cls_token_id = cls_token_id + self.self_linear = nn.Linear( + input_size, hidden_size, weight_attr=paddle.ParamAttr(learning_rate=learning_rate) + ) + self.neigh_linear = nn.Linear( + input_size, hidden_size, weight_attr=paddle.ParamAttr(learning_rate=learning_rate) + ) + + self.ernie = ernie + + def ernie_send(self, src_feat, dst_feat, edge_feat): + """Apply ernie model on the edge. + + Args: + src_feat (Tensor Dict): src feature tensor dict. + dst_feat (Tensor Dict): dst feature tensor dict. + edge_feat (Tensor Dict): edge feature tensor dict. + + Returns: + Tensor Dict: tensor dict which use 'msg' as the key. + """ + # input_ids + cls = paddle.full(shape=[src_feat["term_ids"].shape[0], 1], dtype="int64", fill_value=self.cls_token_id) + src_ids = paddle.concat([cls, src_feat["term_ids"]], 1) + + dst_ids = dst_feat["term_ids"] + + # sent_ids + sent_ids = paddle.concat([paddle.zeros_like(src_ids), paddle.ones_like(dst_ids)], 1) + term_ids = paddle.concat([src_ids, dst_ids], 1) + + # build position_ids + input_mask = paddle.cast(term_ids > 0, "int64") + position_ids = paddle.cumsum(input_mask, axis=1) - 1 + + outputs = self.ernie(term_ids, sent_ids, position_ids) + feature = outputs[1] + return {"msg": feature} + + def send_recv(self, graph, term_ids): + """Message Passing of erniesage v2. + + Args: + graph (Graph): the Graph object. + feature (Tensor): the node feature tensor. + + Returns: + Tensor: the self and neighbor feature tensors. + """ + + def _recv_func(message): + return getattr(message, self.aggr_func)(message["msg"]) + + msg = graph.send(self.ernie_send, node_feat={"term_ids": term_ids}) + neigh_feature = graph.recv(reduce_func=_recv_func, msg=msg) + + cls = paddle.full(shape=[term_ids.shape[0], 1], dtype="int64", fill_value=self.cls_token_id) + term_ids = paddle.concat([cls, term_ids], 1) + term_ids.stop_gradient = True + outputs = self.ernie(term_ids, paddle.zeros_like(term_ids)) + self_feature = outputs[1] + + return self_feature, neigh_feature + + def forward(self, graph, term_ids, act="relu"): + """Forward funciton of Conv layer. + + Args: + graph (Graph): Graph object. + feature (Tensor): node feture. + act (str, optional): activation function. Defaults to 'relu'. + + Returns: + Tensor: feature after conv. + """ + + self_feature, neigh_feature = self.send_recv(graph, term_ids) + self_feature = self.self_linear(self_feature) + neigh_feature = self.neigh_linear(neigh_feature) + output = self_feature + neigh_feature + if act is not None: + output = getattr(F, act)(output) + output = F.normalize(output, axis=1) + return output diff --git a/examples/text_graph/erniesage/models/encoder.py b/examples/text_graph/erniesage/models/encoder.py new file mode 100644 index 0000000000000000000000000000000000000000..9363beb43a4585d5406b82f90105a45deafd30bc --- /dev/null +++ b/examples/text_graph/erniesage/models/encoder.py @@ -0,0 +1,133 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from models.conv import ErnieSageV2Conv, GraphSageConv + + +class Encoder(nn.Layer): + """Base class + Chose different type ErnieSage class. + """ + + def __init__(self, config): + """init function + + Args: + config (Dict): all configs. + """ + super(Encoder, self).__init__() + self.config = config + # Don't add ernie to self, oterwise, there will be more copies of ernie weights + # self.ernie = ernie + + @classmethod + def factory(cls, config, ernie): + """Classmethod for ernie sage model. + + Args: + config (Dict): all configs. + ernie (nn.Layer): the ernie model. + + Raises: + ValueError: Invalid ernie sage model type. + + Returns: + Class: real model class. + """ + model_type = config.model_type + if model_type == "ErnieSageV2": + return ErnieSageV2Encoder(config, ernie) + else: + raise ValueError("Invalid ernie sage model type") + + def forward(self, *args, **kwargs): + raise NotImplementedError + + +class ErnieSageV2Encoder(Encoder): + def __init__(self, config, ernie): + """Ernie sage v2 encoder + + Args: + config (Dict): all config. + ernie (nn.Layer): the ernie model. + """ + super(ErnieSageV2Encoder, self).__init__(config) + # Don't add ernie to self, oterwise, there will be more copies of ernie weights + # self.ernie = ernie + self.convs = nn.LayerList() + fc_lr = self.config.lr / 0.001 + erniesage_conv = ErnieSageV2Conv( + ernie, + ernie.config["hidden_size"], + self.config.hidden_size, + learning_rate=fc_lr, + cls_token_id=self.config.cls_token_id, + aggr_func="sum", + ) + self.convs.append(erniesage_conv) + for i in range(1, self.config.num_layers): + layer = GraphSageConv( + self.config.hidden_size, self.config.hidden_size, learning_rate=fc_lr, aggr_func="sum" + ) + self.convs.append(layer) + + if self.config.final_fc: + self.linear = nn.Linear( + self.config.hidden_size, self.config.hidden_size, weight_attr=paddle.ParamAttr(learning_rate=fc_lr) + ) + + def take_final_feature(self, feature, index): + """Gather the final feature. + + Args: + feature (Tensor): the total featue tensor. + index (Tensor): the index to gather. + + Returns: + Tensor: final result tensor. + """ + feat = paddle.gather(feature, index) + if self.config.final_fc: + feat = self.linear(feat) + if self.config.final_l2_norm: + feat = F.normalize(feat, axis=1) + return feat + + def forward(self, graphs, term_ids, inputs): + """forward train function of the model. + + Args: + graphs (Graph List): list of graph tensors. + inputs (Tensor List): list of input tensors. + + Returns: + Tensor List: list of final feature tensors. + """ + # term_ids for ErnieSageConv is the raw feature. + feature = term_ids + for i in range(len(graphs), self.config.num_layers): + graphs.append(graphs[0]) + for i in range(0, self.config.num_layers): + if i == self.config.num_layers - 1 and i != 0: + act = None + else: + act = "leaky_relu" + feature = self.convs[i](graphs[i], feature, act) + + final_feats = [self.take_final_feature(feature, x) for x in inputs] + return final_feats diff --git a/examples/text_graph/erniesage/models/loss.py b/examples/text_graph/erniesage/models/loss.py new file mode 100644 index 0000000000000000000000000000000000000000..3648c27821c185a973e983ee7fe4298356cba1ff --- /dev/null +++ b/examples/text_graph/erniesage/models/loss.py @@ -0,0 +1,69 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +def LossFactory(config): + """Choose different type of loss by config + + Args: + config (Dict): config file. + + Raises: + ValueError: invalid loss type. + + Returns: + Class: the real class object. + """ + loss_type = config.loss_type + if loss_type == "hinge": + return HingeLoss(config.margin) + elif loss_type == "softmax_with_cross_entropy": + return SoftmaxWithCrossEntropy() + else: + raise ValueError("invalid loss type") + + +class SoftmaxWithCrossEntropy(nn.Layer): + """softmax with cross entropy loss""" + + def __init__(self, config): + super(SoftmaxWithCrossEntropy, self).__init__() + + def forward(self, logits, label): + return F.cross_entropy(logits, label, reduction="mean") + + +class HingeLoss(nn.Layer): + """Hinge Loss for the pos and neg.""" + + def __init__(self, margin): + super(HingeLoss, self).__init__() + self.margin = margin + + def forward(self, pos, neg): + """forward function + + Args: + pos (Tensor): pos score. + neg (Tensor): neg score. + + Returns: + Tensor: final hinge loss. + """ + loss = paddle.mean(F.relu(neg - pos + self.margin)) + return loss diff --git a/examples/text_graph/erniesage/models/model.py b/examples/text_graph/erniesage/models/model.py new file mode 100644 index 0000000000000000000000000000000000000000..4884baacc8609350d2871f06de1d5d724b5d3037 --- /dev/null +++ b/examples/text_graph/erniesage/models/model.py @@ -0,0 +1,68 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +from models.encoder import Encoder +from models.loss import LossFactory + +from paddlenlp.transformers import ErnieModel, ErniePretrainedModel + +__all__ = ["ErnieSageForLinkPrediction"] + + +class ErnieSageForLinkPrediction(ErniePretrainedModel): + """ErnieSage for link prediction task.""" + + def __init__(self, config, config_file): + """Model which Based on the PaddleNLP PretrainedModel + + Note: + 1. the ernie must be the first argument. + 2. must set self.XX = ernie to load weights. + 3. the self.config keyword is taken by PretrainedModel class. + + Args: + ernie (nn.Layer): the submodule layer of ernie model. + config (Dict): the config file + """ + super(ErnieSageForLinkPrediction, self).__init__(config) + self.config_file = config_file + self.ernie = ErnieModel(config) + self.encoder = Encoder.factory(self.config_file, self.ernie) + self.loss_func = LossFactory(self.config_file) + + def forward(self, graphs, data): + """Forward function of link prediction task. + + Args: + graphs (Graph List): the Graph list. + data (Tensor List): other input of the model. + + Returns: + Tensor: loss and output tensors. + """ + term_ids, user_index, pos_item_index, neg_item_index, user_real_index, pos_item_real_index = data + # encoder model + outputs = self.encoder(graphs, term_ids, [user_index, pos_item_index, neg_item_index]) + user_feat, pos_item_feat, neg_item_feat = outputs + + # calc loss + if self.config_file.neg_type == "batch_neg": + neg_item_feat = pos_item_feat + + pos = paddle.sum(user_feat * pos_item_feat, -1, keepdim=True) # [B, 1] + neg = paddle.matmul(user_feat, neg_item_feat, transpose_y=True) # [B, B] + loss = self.loss_func(pos, neg) + # return loss, outputs + return loss, outputs + [user_real_index, pos_item_real_index] diff --git a/examples/text_graph/erniesage/preprocessing/dump_graph.py b/examples/text_graph/erniesage/preprocessing/dump_graph.py new file mode 100644 index 0000000000000000000000000000000000000000..d2de5674a63ffde9dec10f21dab352a5367bb36b --- /dev/null +++ b/examples/text_graph/erniesage/preprocessing/dump_graph.py @@ -0,0 +1,154 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import io +import os +from functools import partial +from io import open + +import numpy as np +import pgl +import yaml +from easydict import EasyDict as edict +from pgl.graph_kernel import alias_sample_build_table +from pgl.utils.logger import log + +from paddlenlp.transformers import ErnieTinyTokenizer, ErnieTokenizer + +TOKENIZER_CLASSES = { + "ernie-tiny": ErnieTinyTokenizer, + "ernie-1.0": ErnieTokenizer, +} + + +def term2id(string, tokenizer, max_seqlen): + # string = string.split("\t")[1] + tokens = tokenizer._tokenize(string) + ids = tokenizer.convert_tokens_to_ids(tokens) + ids = ids[: max_seqlen - 1] + ids = ids + [tokenizer.sep_token_id] + ids = ids + [tokenizer.pad_token_id] * (max_seqlen - len(ids)) + return ids + + +def load_graph(config, str2id, term_file, terms, item_distribution): + edges = [] + with io.open(config.graph_data, encoding=config.encoding) as f: + for idx, line in enumerate(f): + if idx % 100000 == 0: + log.info("%s readed %s lines" % (config.graph_data, idx)) + slots = [] + for col_idx, col in enumerate(line.strip("\n").split("\t")): + s = col[: config.max_seqlen] + if s not in str2id: + str2id[s] = len(str2id) + term_file.write(str(col_idx) + "\t" + col + "\n") + item_distribution.append(0) + slots.append(str2id[s]) + + src = slots[0] + dst = slots[1] + edges.append((src, dst)) + edges.append((dst, src)) + item_distribution[dst] += 1 + edges = np.array(edges, dtype="int64") + return edges + + +def load_link_prediction_train_data(config, str2id, term_file, terms, item_distribution): + train_data = [] + neg_samples = [] + with io.open(config.train_data, encoding=config.encoding) as f: + for idx, line in enumerate(f): + if idx % 100000 == 0: + log.info("%s readed %s lines" % (config.train_data, idx)) + slots = [] + for col_idx, col in enumerate(line.strip("\n").split("\t")): + s = col[: config.max_seqlen] + if s not in str2id: + str2id[s] = len(str2id) + term_file.write(str(col_idx) + "\t" + col + "\n") + item_distribution.append(0) + slots.append(str2id[s]) + + src = slots[0] + dst = slots[1] + neg_samples.append(slots[2:]) + train_data.append((src, dst)) + train_data = np.array(train_data, dtype="int64") + np.save(os.path.join(config.graph_work_path, "train_data.npy"), train_data) + if len(neg_samples) != 0: + np.save(os.path.join(config.graph_work_path, "neg_samples.npy"), np.array(neg_samples)) + + +def dump_graph(config): + if not os.path.exists(config.graph_work_path): + os.makedirs(config.graph_work_path) + str2id = dict() + term_file = io.open(os.path.join(config.graph_work_path, "terms.txt"), "w", encoding=config.encoding) + terms = [] + item_distribution = [] + + edges = load_graph(config, str2id, term_file, terms, item_distribution) + if config.task == "link_prediction": + load_link_prediction_train_data(config, str2id, term_file, terms, item_distribution) + else: + raise ValueError + + term_file.close() + num_nodes = len(str2id) + str2id.clear() + + log.info("Building graph...") + graph = pgl.graph.Graph(num_nodes=num_nodes, edges=edges) + # indegree = graph.indegree() + graph.indegree() + graph.outdegree() + graph.dump(config.graph_work_path) + + # dump alias sample table + item_distribution = np.array(item_distribution) + item_distribution = np.sqrt(item_distribution) + distribution = 1.0 * item_distribution / item_distribution.sum() + alias, events = alias_sample_build_table(distribution) + np.save(os.path.join(config.graph_work_path, "alias.npy"), alias) + np.save(os.path.join(config.graph_work_path, "events.npy"), events) + log.info("End Build Graph") + + +def dump_node_feat(config): + log.info("Dump node feat starting...") + id2str = [ + line.strip("\n").split("\t")[-1] + for line in io.open(os.path.join(config.graph_work_path, "terms.txt"), encoding=config.encoding) + ] + # pool = multiprocessing.Pool() + + tokenizer_class = TOKENIZER_CLASSES[config.model_name_or_path] + tokenizer = tokenizer_class.from_pretrained(config.model_name_or_path) + fn = partial(term2id, tokenizer=tokenizer, max_seqlen=config.max_seqlen) + term_ids = [fn(x) for x in id2str] + + np.save(os.path.join(config.graph_work_path, "term_ids.npy"), np.array(term_ids, np.uint16)) + log.info("Dump node feat done.") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="main") + parser.add_argument("--conf", type=str, default="./config.yaml") + args = parser.parse_args() + config = edict(yaml.load(open(args.conf), Loader=yaml.FullLoader)) + dump_graph(config) + dump_node_feat(config) diff --git a/examples/text_matching/README.md b/examples/text_matching/README.md new file mode 100644 index 0000000000000000000000000000000000000000..5f82c98c009f2aa271e01ce75f61dcd051a5cc57 --- /dev/null +++ b/examples/text_matching/README.md @@ -0,0 +1,31 @@ +# 文本匹配 + +**文本匹配一直是自然语言处理(NLP)领域一个基础且重要的方向,一般研究两段文本之间的关系。文本相似度计算、自然语言推理、问答系统、信息检索等,都可以看作针对不同数据和场景的文本匹配应用。这些自然语言处理任务在很大程度上都可以抽象成文本匹配问题,比如信息检索可以归结为搜索词和文档资源的匹配,问答系统可以归结为问题和候选答案的匹配,复述问题可以归结为两个同义句的匹配。** + +<p align="center"> +<img src="https://ai-studio-static-online.cdn.bcebos.com/1d24ea95d560465995515f8a3040202b092b07c6d03e4501b64a16dce01a1bbe" hspace='10'/> <br /> +</p> + + +<p align="center"> +<img src="https://ai-studio-static-online.cdn.bcebos.com/ff58769b237444b89bde5fec9d7215e02825b7d1f2864269986f1daa01b9f497" hspace='10'/> <br /> +</p> + + +文本匹配任务数据每一个样本通常由两个文本组成(query,title)。类别形式为 0 或 1,0 表示 query 与 title 不匹配; 1 表示匹配。 + +本项目包含面向搜索、推荐系统排序模块、召回模块的常规解决方案,具体如下: +- 基于单塔 Point-wise 范式的语义匹配模型 [ernie_matching](./ernie_matching/train_pointwise.py): 模型精度高、计算复杂度高, 适合直接进行语义匹配 2 分类的应用场景。 +- 基于单塔 Pair-wise 范式的语义匹配模型 [ernie_matching](./ernie_matching/train_pairwise.py): 模型精度高、计算复杂度高, 对文本相似度大小的`序关系`建模能力更强,适合将相似度特征作为上层排序模块输入特征的应用场景。 +- 基于双塔 Point-wise 范式的语义匹配模型 [SimNet](./simnet) 和 [Sentence Transformers](./sentence_transformers), 这 2 种方案计算效率更高,适合对延时要求高、根据语义相似度进行粗排的应用场景。 + +## ernie_matching +[ernie_matching](./ernie_matching) 展示了基于预训练模型 ERNIE-Gram 训练单塔 Point-wise & Pair-wise 语义匹配模型。 + +## SimNet + +[SimNet](./simnet) 展示了如何使用CNN、LSTM、GRU等网络完成文本匹配任务。 + +## Sentence Transformers + +[Sentence Transformers](./sentence_transformers) 展示了如何使用以 ERNIE 为代表的模型Fine-tune完成文本匹配任务。 diff --git a/examples/text_matching/diffcse/README.md b/examples/text_matching/diffcse/README.md new file mode 100644 index 0000000000000000000000000000000000000000..d08bd3bbe6d01cacbe6f85b0ff7c1b9a89d2182c --- /dev/null +++ b/examples/text_matching/diffcse/README.md @@ -0,0 +1,169 @@ +# 无监督语义匹配模型 [DiffCSE](https://arxiv.org/pdf/2204.10298.pdf) + +借鉴 [DiffCSE](https://arxiv.org/pdf/2204.10298.pdf) 的思路,实现了 DiffCSE 模型。相比于 SimCSE 模型,DiffCSE模型会更关注语句之间的差异性,具有精确的向量表示能力。DiffCSE 模型同样适合缺乏监督数据,但是又有大量无监督数据的匹配和检索场景。 + +## 快速开始 +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +``` +DiffCSE/ +├── model.py # DiffCSE 模型组网代码 +├── custom_ernie.py # 为适配 DiffCSE 模型,对ERNIE模型进行了部分修改 +├── data.py # 无监督语义匹配训练数据、测试数据的读取逻辑 +├── run_diffcse.py # 模型训练、评估、预测的主脚本 +├── utils.py # 包括一些常用的工具式函数 +├── run_train.sh # 模型训练的脚本 +├── run_eval.sh # 模型评估的脚本 +└── run_infer.sh # 模型预测的脚本 +``` + +### 模型训练 +默认使用无监督模式进行训练 DiffCSE,模型训练数据的数据样例如下所示,每行表示一条训练样本: +```shell +全年地方财政总收入3686.81亿元,比上年增长12.3%。 +“我对案情并不十分清楚,所以没办法提出批评,建议,只能希望通过质询,要求检察院对此做出说明。”他说。 +据调查结果显示:2015年微商行业总体市场规模达到1819.5亿元,预计2016年将达到3607.3亿元,增长率为98.3%。 +前往冈仁波齐需要办理目的地包含日喀则和阿里地区的边防证,外转沿途有一些补给点,可购买到干粮和饮料。 +``` + +可以运行如下命令,开始模型训练并且进行模型测试。 + +```shell +gpu_ids=0 +export CUDA_VISIBLE_DEVICES=${gpu_ids} + +log_dir="log_train" +python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \ + run_diffcse.py \ + --mode "train" \ + --encoder_name "rocketqa-zh-dureader-query-encoder" \ + --generator_name "ernie-3.0-base-zh" \ + --discriminator_name "ernie-3.0-base-zh" \ + --max_seq_length "128" \ + --output_emb_size "32" \ + --train_set_file "your train_set path" \ + --eval_set_file "your dev_set path" \ + --save_dir "./checkpoints" \ + --log_dir ${log_dir} \ + --save_steps "50000" \ + --eval_steps "1000" \ + --epochs "3" \ + --batch_size "32" \ + --mlm_probability "0.15" \ + --lambda_weight "0.15" \ + --learning_rate "3e-5" \ + --weight_decay "0.01" \ + --warmup_proportion "0.01" \ + --seed "0" \ + --device "gpu" +``` + +可支持配置的参数: +* `mode`:可选,用于指明本次运行是模型训练、模型评估还是模型预测,仅支持[train, eval, infer]三种模式;默认为 infer。 +* `encoder_name`:可选,DiffCSE模型中用于向量抽取的模型名称;默认为 ernie-3.0-base-zh。 +* `generator_name`: 可选,DiffCSE模型中生成器的模型名称;默认为 ernie-3.0-base-zh。 +* `discriminator_name`: 可选,DiffCSE模型中判别器的模型名称;默认为 rocketqa-zh-dureader-query-encoder。 +* `max_seq_length`:可选,ERNIE-Gram 模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `output_emb_size`:可选,向量抽取模型输出向量的维度;默认为32。 +* `train_set_file`:可选,用于指定训练集的路径。 +* `eval_set_file`:可选,用于指定验证集的路径。 +* `save_dir`:可选,保存训练模型的目录; +* `log_dir`:可选,训练训练过程中日志的输出目录; +* `save_steps`:可选,用于指定模型训练过程中每隔多少 step 保存一次模型。 +* `eval_steps`:可选,用于指定模型训练过程中每隔多少 step,使用验证集评估一次模型。 +* `epochs`: 模型训练轮次,默认为3。 +* `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `mlm_probability`:可选,利用生成器预测时,控制单词掩码的比例,默认为0.15。 +* `lambda_weight`:可选,控制RTD任务loss的占比,默认为0.15。 +* `learning_rate`:可选,Fine-tune 的最大学习率;默认为5e-5。 +* `weight_decay`:可选,控制正则项力度的参数,用于防止过拟合,默认为0.01。 +* `warmup_proportion`:可选,学习率 warmup 策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到 learning_rate, 而后再缓慢衰减,默认为0.01。 +* `seed`:可选,随机种子,默认为1000. +* `device`: 选用什么设备进行训练,可选 cpu 或 gpu。如使用 gpu 训练则参数 gpus 指定GPU卡号。 + +程序运行时将会自动进行训练,评估。同时训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +checkpoints/ +├── best +│   ├── model_state.pdparams +│   ├── tokenizer_config.json +│   ├── special_tokens_map.json +│   └── vocab.txt +└── ... +``` + +### 模型评估 +在模型评估时,需要使用带有标签的数据,以下展示了几条模型评估数据样例,每行表示一条训练样本,每行共计包含3列,分别是query1, query2, label: +```shell +右键单击此电脑选择属性,如下图所示 右键单击此电脑选择属性,如下图所示 5 +好医生解密||是什么,让美洲大蠊能美容还能救命 解密美洲大蠊巨大药用价值 1 +蒜香蜜汁烤鸡翅的做法 外香里嫩一口爆汁蒜蓉蜜汁烤鸡翅的做法 3 +项目计划书 篇2 简易项目计划书(参考模板) 2 +夏天幼儿园如何正确使用空调? 老师们该如何正确使用空调,让孩子少生病呢? 3 +``` + + +可以运行如下命令,进行模型评估。 + +```shell +gpu_ids=0 +export CUDA_VISIBLE_DEVICES=${gpu_ids} + +log_dir="log_eval" +python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \ + run_diffcse.py \ + --mode "eval" \ + --encoder_name "rocketqa-zh-dureader-query-encoder" \ + --max_seq_length "128" \ + --output_emb_size "32" \ + --eval_set_file "your dev_set path" \ + --ckpt_dir "./checkpoints/best" \ + --batch_size "32" \ + --seed "0" \ + --device "gpu" +``` +可支持配置的参数: +* `ckpt_dir`: 用于指定进行模型评估的checkpoint路径。 + +其他参数解释同上。 + +### 基于动态图模型预测 +在模型预测时,需要给定待预测的两条文本,以下展示了几条模型预测的数据样例,每行表示一条训练样本,每行共计包含2列,分别是query1, query2: +```shell +韩国现代摩比斯2015招聘 韩国现代摩比斯2015校园招聘信息 +《DNF》封号减刑方法 被封一年怎么办? DNF封号减刑方法 封号一年怎么减刑 +原神手鞠游戏三个刷新位置一览 手鞠游戏三个刷新位置一览 +``` + +可以运行如下命令,进行模型预测: +```shell +gpu_ids=0 +export CUDA_VISIBLE_DEVICES=${gpu_ids} + +log_dir="log_infer" +python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \ + run_diffcse.py \ + --mode "infer" \ + --encoder_name "rocketqa-zh-dureader-query-encoder" \ + --max_seq_length "128" \ + --output_emb_size "32" \ + --infer_set_file "your test_set path \ + --ckpt_dir "./checkpoints/best" \ + --save_infer_path "./infer_result.txt" \ + --batch_size "32" \ + --seed "0" \ + --device "gpu" +``` + +可支持配置的参数: +* `infer_set_file`: 可选,用于指定测试集的路径。 +* `save_infer_path`: 可选,用于保存模型预测结果的文件路径。 + +其他参数解释同上。 待模型预测结束后,会将结果保存至save_infer_path参数指定的文件中。 + + +## Reference +[1] Chuang Y S , Dangovski R , Luo H , et al. DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings[J]. arXiv e-prints, 2022. https://arxiv.org/pdf/2204.10298.pdf. diff --git a/examples/text_matching/diffcse/custom_ernie.py b/examples/text_matching/diffcse/custom_ernie.py new file mode 100644 index 0000000000000000000000000000000000000000..2ba36f3bb3be08577a22c5e6b1f465df672b62d9 --- /dev/null +++ b/examples/text_matching/diffcse/custom_ernie.py @@ -0,0 +1,1082 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + +from paddlenlp.transformers import PretrainedModel, register_base_model + +__all__ = [ + "ErnieModel", + "ErniePretrainedModel", + "ErnieForSequenceClassification", + "ErnieForTokenClassification", + "ErnieForQuestionAnswering", + "ErnieForPretraining", + "ErniePretrainingCriterion", + "ErnieForMaskedLM", + "ErnieForMultipleChoice", +] + + +class ErnieEmbeddings(nn.Layer): + r""" + Include embeddings from word, position and token_type embeddings. + """ + + def __init__( + self, + vocab_size, + hidden_size=768, + hidden_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=2, + pad_token_id=0, + weight_attr=None, + task_type_vocab_size=3, + task_id=0, + use_task_id=False, + ): + super(ErnieEmbeddings, self).__init__() + + self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_token_id, weight_attr=weight_attr) + self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size, weight_attr=weight_attr) + self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size, weight_attr=weight_attr) + self.use_task_id = use_task_id + self.task_id = task_id + if self.use_task_id: + self.task_type_embeddings = nn.Embedding(task_type_vocab_size, hidden_size, weight_attr=weight_attr) + self.layer_norm = nn.LayerNorm(hidden_size) + self.dropout = nn.Dropout(hidden_dropout_prob) + + def forward(self, input_ids, token_type_ids=None, position_ids=None, task_type_ids=None): + if position_ids is None: + # maybe need use shape op to unify static graph and dynamic graph + # seq_length = input_ids.shape[1] + ones = paddle.ones_like(input_ids, dtype="int64") + seq_length = paddle.cumsum(ones, axis=1) + position_ids = seq_length - ones + position_ids.stop_gradient = True + if token_type_ids is None: + token_type_ids = paddle.zeros_like(input_ids, dtype="int64") + input_embedings = self.word_embeddings(input_ids) + position_embeddings = self.position_embeddings(position_ids) + token_type_embeddings = self.token_type_embeddings(token_type_ids) + + embeddings = input_embedings + position_embeddings + token_type_embeddings + if self.use_task_id: + if task_type_ids is None: + task_type_ids = paddle.ones_like(input_ids, dtype="int64") * self.task_id + task_type_embeddings = self.task_type_embeddings(task_type_ids) + embeddings = embeddings + task_type_embeddings + embeddings = self.layer_norm(embeddings) + embeddings = self.dropout(embeddings) + return embeddings + + +class ErniePooler(nn.Layer): + def __init__(self, hidden_size, weight_attr=None): + super(ErniePooler, self).__init__() + self.dense = nn.Linear(hidden_size, hidden_size, weight_attr=weight_attr) + self.activation = nn.Tanh() + + def forward(self, hidden_states): + # We "pool" the model by simply taking the hidden state corresponding + # to the first token. + first_token_tensor = hidden_states[:, 0] + pooled_output = self.dense(first_token_tensor) + pooled_output = self.activation(pooled_output) + return pooled_output + + +class ErniePretrainedModel(PretrainedModel): + r""" + An abstract class for pretrained ERNIE models. It provides ERNIE related + `model_config_file`, `pretrained_init_configuration`, `resource_files_names`, + `pretrained_resource_files_map`, `base_model_prefix` for downloading and + loading pretrained models. + Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details. + """ + + model_config_file = "model_config.json" + pretrained_init_configuration = { + # Deprecated, alias for ernie-1.0-base-zh + "ernie-1.0": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 513, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 2, + "vocab_size": 18000, + "pad_token_id": 0, + }, + "ernie-1.0-base-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 513, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 2, + "vocab_size": 18000, + "pad_token_id": 0, + }, + "ernie-1.0-large-zh-cw": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "initializer_range": 0.02, + "intermediate_size": 3072, # it is 3072 instead of 4096 + "max_position_embeddings": 512, + "num_attention_heads": 16, + "num_hidden_layers": 24, + "type_vocab_size": 2, + "vocab_size": 18000, + "pad_token_id": 0, + }, + "ernie-tiny": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "initializer_range": 0.02, + "intermediate_size": 4096, + "max_position_embeddings": 600, + "num_attention_heads": 16, + "num_hidden_layers": 3, + "type_vocab_size": 2, + "vocab_size": 50006, + "pad_token_id": 0, + }, + "ernie-2.0-base-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 513, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 4, + "vocab_size": 18000, + }, + "ernie-2.0-large-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "intermediate_size": 4096, # special for large model + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 16, + "num_hidden_layers": 24, + "type_vocab_size": 4, + "vocab_size": 12800, + }, + "ernie-2.0-base-en": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 4, + "vocab_size": 30522, + "pad_token_id": 0, + }, + "ernie-2.0-base-en-finetuned-squad": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 4, + "vocab_size": 30522, + "pad_token_id": 0, + }, + "ernie-2.0-large-en": { + "attention_probs_dropout_prob": 0.1, + "intermediate_size": 4096, # special for ernie-2.0-large-en + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 16, + "num_hidden_layers": 24, + "type_vocab_size": 4, + "vocab_size": 30522, + "pad_token_id": 0, + }, + "rocketqa-zh-dureader-query-encoder": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 513, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 2, + "vocab_size": 18000, + "pad_token_id": 0, + }, + "rocketqa-zh-dureader-para-encoder": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 513, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 2, + "vocab_size": 18000, + "pad_token_id": 0, + }, + "rocketqa-v1-marco-query-encoder": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 4, + "vocab_size": 30522, + "pad_token_id": 0, + }, + "rocketqa-v1-marco-para-encoder": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 4, + "vocab_size": 30522, + "pad_token_id": 0, + }, + "rocketqa-zh-dureader-cross-encoder": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 513, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 2, + "vocab_size": 18000, + "pad_token_id": 0, + }, + "rocketqa-v1-marco-cross-encoder": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 4, + "vocab_size": 30522, + "pad_token_id": 0, + }, + "ernie-3.0-xbase-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "intermediate_size": 4096, # special for large model + "hidden_size": 1024, + "initializer_range": 0.02, + "max_position_embeddings": 2048, + "num_attention_heads": 16, + "num_hidden_layers": 20, + "task_type_vocab_size": 16, + "type_vocab_size": 4, + "use_task_id": True, + "vocab_size": 40000, + }, + "ernie-3.0-base-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 2048, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "task_type_vocab_size": 3, + "type_vocab_size": 4, + "use_task_id": True, + "vocab_size": 40000, + }, + "ernie-3.0-medium-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "intermediate_size": 3072, + "initializer_range": 0.02, + "max_position_embeddings": 2048, + "num_attention_heads": 12, + "num_hidden_layers": 6, + "task_type_vocab_size": 16, + "type_vocab_size": 4, + "use_task_id": True, + "vocab_size": 40000, + }, + "ernie-3.0-mini-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 384, + "intermediate_size": 1536, + "initializer_range": 0.02, + "max_position_embeddings": 2048, + "num_attention_heads": 12, + "num_hidden_layers": 6, + "task_type_vocab_size": 16, + "type_vocab_size": 4, + "use_task_id": True, + "vocab_size": 40000, + }, + "ernie-3.0-micro-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 384, + "intermediate_size": 1536, + "initializer_range": 0.02, + "max_position_embeddings": 2048, + "num_attention_heads": 12, + "num_hidden_layers": 4, + "task_type_vocab_size": 16, + "type_vocab_size": 4, + "use_task_id": True, + "vocab_size": 40000, + }, + "ernie-3.0-nano-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 312, + "intermediate_size": 1248, + "initializer_range": 0.02, + "max_position_embeddings": 2048, + "num_attention_heads": 12, + "num_hidden_layers": 4, + "task_type_vocab_size": 16, + "type_vocab_size": 4, + "use_task_id": True, + "vocab_size": 40000, + }, + } + resource_files_names = {"model_state": "model_state.pdparams"} + pretrained_resource_files_map = { + "model_state": { + # Deprecated, alias for ernie-1.0-base-zh + "ernie-1.0": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_v1_chn_base.pdparams", + "ernie-1.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_v1_chn_base.pdparams", + "ernie-1.0-large-zh-cw": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_1.0_large_zh_cw.pdparams", + "ernie-tiny": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_tiny/ernie_tiny.pdparams", + "ernie-2.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_2.0/ernie_2.0_base_zh.pdparams", + "ernie-2.0-large-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_2.0/ernie_2.0_large_zh.pdparams", + "ernie-2.0-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/ernie_v2_eng_base.pdparams", + "ernie-2.0-base-en-finetuned-squad": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/ernie_v2_eng_base_finetuned_squad.pdparams", + "ernie-2.0-large-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_large/ernie_v2_eng_large.pdparams", + "rocketqa-zh-dureader-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_query_encoder.pdparams", + "rocketqa-zh-dureader-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_para_encoder.pdparams", + "rocketqa-v1-marco-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_v1_marco_query_encoder.pdparams", + "rocketqa-v1-marco-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_v1_marco_para_encoder.pdparams", + "rocketqa-zh-dureader-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_cross_encoder.pdparams", + "rocketqa-v1-marco-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_v1_marco_cross_encoder.pdparams", + "ernie-3.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams", + "ernie-3.0-xbase-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_xbase_zh.pdparams", + "ernie-3.0-medium-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh.pdparams", + "ernie-3.0-mini-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh.pdparams", + "ernie-3.0-micro-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh.pdparams", + "ernie-3.0-nano-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh.pdparams", + } + } + base_model_prefix = "ernie" + + def _init_weights(self, layer): + """Initialization hook""" + if isinstance(layer, (nn.Linear, nn.Embedding)): + # only support dygraph, use truncated_normal and make it inplace + # and configurable later + if isinstance(layer.weight, paddle.Tensor): + layer.weight.set_value( + paddle.tensor.normal( + mean=0.0, + std=self.initializer_range + if hasattr(self, "initializer_range") + else self.ernie.config["initializer_range"], + shape=layer.weight.shape, + ) + ) + elif isinstance(layer, nn.LayerNorm): + layer._epsilon = 1e-12 + + +@register_base_model +class ErnieModel(ErniePretrainedModel): + r""" + The bare ERNIE Model transformer outputting raw hidden-states. + This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`. + Refer to the superclass documentation for the generic methods. + This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation + /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer + and refer to the Paddle documentation for all matter related to general usage and behavior. + Args: + vocab_size (int): + Vocabulary size of `inputs_ids` in `ErnieModel`. Also is the vocab size of token embedding matrix. + Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `ErnieModel`. + hidden_size (int, optional): + Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`. + num_hidden_layers (int, optional): + Number of hidden layers in the Transformer encoder. Defaults to `12`. + num_attention_heads (int, optional): + Number of attention heads for each attention layer in the Transformer encoder. + Defaults to `12`. + intermediate_size (int, optional): + Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors + to ff layers are firstly projected from `hidden_size` to `intermediate_size`, + and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`. + Defaults to `3072`. + hidden_act (str, optional): + The non-linear activation function in the feed-forward layer. + ``"gelu"``, ``"relu"`` and any other paddle supported activation functions + are supported. Defaults to `"gelu"`. + hidden_dropout_prob (float, optional): + The dropout probability for all fully connected layers in the embeddings and encoder. + Defaults to `0.1`. + attention_probs_dropout_prob (float, optional): + The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target. + Defaults to `0.1`. + max_position_embeddings (int, optional): + The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input + sequence. Defaults to `512`. + type_vocab_size (int, optional): + The vocabulary size of the `token_type_ids`. + Defaults to `2`. + initializer_range (float, optional): + The standard deviation of the normal initializer for initializing all weight matrices. + Defaults to `0.02`. + + .. note:: + A normal_initializer initializes weight matrices as normal distributions. + See :meth:`ErniePretrainedModel._init_weights()` for how weights are initialized in `ErnieModel`. + pad_token_id(int, optional): + The index of padding token in the token vocabulary. + Defaults to `0`. + """ + + def __init__( + self, + vocab_size, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + intermediate_size=3072, + hidden_act="gelu", + hidden_dropout_prob=0.1, + attention_probs_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=2, + initializer_range=0.02, + pad_token_id=0, + task_type_vocab_size=3, + task_id=0, + use_task_id=False, + ): + super(ErnieModel, self).__init__() + self.pad_token_id = pad_token_id + self.initializer_range = initializer_range + weight_attr = paddle.ParamAttr( + initializer=nn.initializer.TruncatedNormal(mean=0.0, std=self.initializer_range) + ) + self.embeddings = ErnieEmbeddings( + vocab_size, + hidden_size, + hidden_dropout_prob, + max_position_embeddings, + type_vocab_size, + pad_token_id, + weight_attr, + task_type_vocab_size, + task_id, + use_task_id, + ) + encoder_layer = nn.TransformerEncoderLayer( + hidden_size, + num_attention_heads, + intermediate_size, + dropout=hidden_dropout_prob, + activation=hidden_act, + attn_dropout=attention_probs_dropout_prob, + act_dropout=0, + weight_attr=weight_attr, + normalize_before=False, + ) + self.encoder = nn.TransformerEncoder(encoder_layer, num_hidden_layers) + self.pooler = ErniePooler(hidden_size, weight_attr) + + def forward( + self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None, + task_type_ids=None, + cls_input=None, + ): + r""" + Args: + input_ids (Tensor): + Indices of input sequence tokens in the vocabulary. They are + numerical representations of tokens that build the input sequence. + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + token_type_ids (Tensor, optional): + Segment token indices to indicate different portions of the inputs. + Selected in the range ``[0, type_vocab_size - 1]``. + If `type_vocab_size` is 2, which means the inputs have two portions. + Indices can either be 0 or 1: + - 0 corresponds to a *sentence A* token, + - 1 corresponds to a *sentence B* token. + Its data type should be `int64` and it has a shape of [batch_size, sequence_length]. + Defaults to `None`, which means we don't add segment embeddings. + position_ids (Tensor, optional): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0, + max_position_embeddings - 1]``. + Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to `None`. + attention_mask (Tensor, optional): + Mask used in multi-head attention to avoid performing attention on to some unwanted positions, + usually the paddings or the subsequent positions. + Its data type can be int, float and bool. + When the data type is bool, the `masked` tokens have `False` values and the others have `True` values. + When the data type is int, the `masked` tokens have `0` values and the others have `1` values. + When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values. + It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`. + For example, its shape can be [batch_size, sequence_length], [batch_size, sequence_length, sequence_length], + [batch_size, num_attention_heads, sequence_length, sequence_length]. + We use whole-word-mask in ERNIE, so the whole word will have the same value. For example, "使用" as a word, + "使" and "用" will have the same value. + Defaults to `None`, which means nothing needed to be prevented attention to. + Returns: + tuple: Returns tuple (``sequence_output``, ``pooled_output``). + With the fields: + - `sequence_output` (Tensor): + Sequence of hidden-states at the last layer of the model. + It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size]. + - `pooled_output` (Tensor): + The output of first token (`[CLS]`) in sequence. + We "pool" the model by simply taking the hidden state corresponding to the first token. + Its data type should be float32 and its shape is [batch_size, hidden_size]. + Example: + .. code-block:: + import paddle + from paddlenlp.transformers import ErnieModel, ErnieTokenizer + tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0') + model = ErnieModel.from_pretrained('ernie-1.0') + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + sequence_output, pooled_output = model(**inputs) + """ + if attention_mask is None: + attention_mask = paddle.unsqueeze( + (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2] + ) + # For 2D attention_mask from tokenizer + elif attention_mask.ndim == 2: + attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype()) + attention_mask = (1.0 - attention_mask) * -1e4 + attention_mask.stop_gradient = True + + embedding_output = self.embeddings( + input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, task_type_ids=task_type_ids + ) + + if cls_input is not None: + embedding_output = paddle.concat([cls_input.unsqueeze(1), embedding_output[:, 1:, :]], axis=1) + + encoder_outputs = self.encoder(embedding_output, attention_mask) + sequence_output = encoder_outputs + pooled_output = self.pooler(sequence_output) + return sequence_output, pooled_output + + +class ErnieForSequenceClassification(ErniePretrainedModel): + r""" + Ernie Model with a linear layer on top of the output layer, + designed for sequence classification/regression tasks like GLUE tasks. + Args: + ernie (ErnieModel): + An instance of `paddlenlp.transformers.ErnieModel`. + num_classes (int, optional): + The number of classes. Default to `2`. + dropout (float, optional): + The dropout probability for output of ERNIE. + If None, use the same value as `hidden_dropout_prob` + of `paddlenlp.transformers.ErnieModel` instance. Defaults to `None`. + """ + + def __init__(self, ernie, num_classes=2, dropout=None): + super(ErnieForSequenceClassification, self).__init__() + self.num_classes = num_classes + self.ernie = ernie # allow ernie to be config + self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config["hidden_dropout_prob"]) + self.classifier = nn.Linear(self.ernie.config["hidden_size"], num_classes) + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`ErnieModel`. + token_type_ids (Tensor, optional): + See :class:`ErnieModel`. + position_ids (Tensor, optional): + See :class:`ErnieModel`. + attention_mask (Tensor, optional): + See :class:`ErnieModel`. + Returns: + Tensor: Returns tensor `logits`, a tensor of the input text classification logits. + Shape as `[batch_size, num_classes]` and dtype as float32. + Example: + .. code-block:: + import paddle + from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer + tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0') + model = ErnieForSequenceClassification.from_pretrained('ernie-1.0') + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + """ + _, pooled_output = self.ernie( + input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask + ) + + pooled_output = self.dropout(pooled_output) + logits = self.classifier(pooled_output) + return logits + + +class ErnieForQuestionAnswering(ErniePretrainedModel): + """ + Ernie Model with a linear layer on top of the hidden-states + output to compute `span_start_logits` and `span_end_logits`, + designed for question-answering tasks like SQuAD. + Args: + ernie (`ErnieModel`): + An instance of `ErnieModel`. + """ + + def __init__(self, ernie): + super(ErnieForQuestionAnswering, self).__init__() + self.ernie = ernie # allow ernie to be config + self.classifier = nn.Linear(self.ernie.config["hidden_size"], 2) + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`ErnieModel`. + token_type_ids (Tensor, optional): + See :class:`ErnieModel`. + position_ids (Tensor, optional): + See :class:`ErnieModel`. + attention_mask (Tensor, optional): + See :class:`ErnieModel`. + Returns: + tuple: Returns tuple (`start_logits`, `end_logits`). + With the fields: + - `start_logits` (Tensor): + A tensor of the input token classification logits, indicates the start position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + - `end_logits` (Tensor): + A tensor of the input token classification logits, indicates the end position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + Example: + .. code-block:: + import paddle + from paddlenlp.transformers import ErnieForQuestionAnswering, ErnieTokenizer + tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0') + model = ErnieForQuestionAnswering.from_pretrained('ernie-1.0') + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + """ + + sequence_output, _ = self.ernie( + input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask + ) + + logits = self.classifier(sequence_output) + logits = paddle.transpose(logits, perm=[2, 0, 1]) + start_logits, end_logits = paddle.unstack(x=logits, axis=0) + + return start_logits, end_logits + + +class ErnieForTokenClassification(ErniePretrainedModel): + r""" + ERNIE Model with a linear layer on top of the hidden-states output layer, + designed for token classification tasks like NER tasks. + Args: + ernie (`ErnieModel`): + An instance of `ErnieModel`. + num_classes (int, optional): + The number of classes. Defaults to `2`. + dropout (float, optional): + The dropout probability for output of ERNIE. + If None, use the same value as `hidden_dropout_prob` + of `ErnieModel` instance `ernie`. Defaults to `None`. + """ + + def __init__(self, ernie, num_classes=2, dropout=None): + super(ErnieForTokenClassification, self).__init__() + self.num_classes = num_classes + self.ernie = ernie # allow ernie to be config + self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config["hidden_dropout_prob"]) + self.classifier = nn.Linear(self.ernie.config["hidden_size"], num_classes) + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`ErnieModel`. + token_type_ids (Tensor, optional): + See :class:`ErnieModel`. + position_ids (Tensor, optional): + See :class:`ErnieModel`. + attention_mask (Tensor, optional): + See :class:`ErnieModel`. + Returns: + Tensor: Returns tensor `logits`, a tensor of the input token classification logits. + Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`. + Example: + .. code-block:: + import paddle + from paddlenlp.transformers import ErnieForTokenClassification, ErnieTokenizer + tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0') + model = ErnieForTokenClassification.from_pretrained('ernie-1.0') + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + """ + sequence_output, _ = self.ernie( + input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask + ) + + sequence_output = self.dropout(sequence_output) + logits = self.classifier(sequence_output) + return logits + + +class ErnieLMPredictionHead(nn.Layer): + r""" + Ernie Model with a `language modeling` head on top. + """ + + def __init__( + self, + hidden_size, + vocab_size, + activation, + embedding_weights=None, + weight_attr=None, + ): + super(ErnieLMPredictionHead, self).__init__() + + self.transform = nn.Linear(hidden_size, hidden_size, weight_attr=weight_attr) + self.activation = getattr(nn.functional, activation) + self.layer_norm = nn.LayerNorm(hidden_size) + self.decoder_weight = ( + self.create_parameter( + shape=[vocab_size, hidden_size], dtype=self.transform.weight.dtype, attr=weight_attr, is_bias=False + ) + if embedding_weights is None + else embedding_weights + ) + self.decoder_bias = self.create_parameter(shape=[vocab_size], dtype=self.decoder_weight.dtype, is_bias=True) + + def forward(self, hidden_states, masked_positions=None): + if masked_positions is not None: + hidden_states = paddle.reshape(hidden_states, [-1, hidden_states.shape[-1]]) + hidden_states = paddle.tensor.gather(hidden_states, masked_positions) + # gather masked tokens might be more quick + hidden_states = self.transform(hidden_states) + hidden_states = self.activation(hidden_states) + hidden_states = self.layer_norm(hidden_states) + hidden_states = paddle.tensor.matmul(hidden_states, self.decoder_weight, transpose_y=True) + self.decoder_bias + return hidden_states + + +class ErniePretrainingHeads(nn.Layer): + def __init__( + self, + hidden_size, + vocab_size, + activation, + embedding_weights=None, + weight_attr=None, + ): + super(ErniePretrainingHeads, self).__init__() + self.predictions = ErnieLMPredictionHead(hidden_size, vocab_size, activation, embedding_weights, weight_attr) + self.seq_relationship = nn.Linear(hidden_size, 2, weight_attr=weight_attr) + + def forward(self, sequence_output, pooled_output, masked_positions=None): + prediction_scores = self.predictions(sequence_output, masked_positions) + seq_relationship_score = self.seq_relationship(pooled_output) + return prediction_scores, seq_relationship_score + + +class ErnieForPretraining(ErniePretrainedModel): + r""" + Ernie Model with a `masked language modeling` head and a `sentence order prediction` head + on top. + """ + + def __init__(self, ernie): + super(ErnieForPretraining, self).__init__() + self.ernie = ernie + weight_attr = paddle.ParamAttr( + initializer=nn.initializer.TruncatedNormal(mean=0.0, std=self.ernie.initializer_range) + ) + self.cls = ErniePretrainingHeads( + self.ernie.config["hidden_size"], + self.ernie.config["vocab_size"], + self.ernie.config["hidden_act"], + embedding_weights=self.ernie.embeddings.word_embeddings.weight, + weight_attr=weight_attr, + ) + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, masked_positions=None): + r""" + Args: + input_ids (Tensor): + See :class:`ErnieModel`. + token_type_ids (Tensor, optional): + See :class:`ErnieModel`. + position_ids (Tensor, optional): + See :class:`ErnieModel`. + attention_mask (Tensor, optional): + See :class:`ErnieModel`. + Returns: + tuple: Returns tuple (``prediction_scores``, ``seq_relationship_score``). + With the fields: + - `prediction_scores` (Tensor): + The scores of masked token prediction. Its data type should be float32. + If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size]. + Otherwise, its shape is [batch_size, mask_token_num, vocab_size]. + - `seq_relationship_score` (Tensor): + The scores of next sentence prediction. + Its data type should be float32 and its shape is [batch_size, 2]. + """ + with paddle.static.amp.fp16_guard(): + outputs = self.ernie( + input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask + ) + sequence_output, pooled_output = outputs[:2] + prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output, masked_positions) + return prediction_scores, seq_relationship_score + + +class ErniePretrainingCriterion(paddle.nn.Layer): + r""" + The loss output of Ernie Model during the pretraining: + a `masked language modeling` head and a `next sentence prediction (classification)` head. + """ + + def __init__(self, with_nsp_loss=True): + super(ErniePretrainingCriterion, self).__init__() + self.with_nsp_loss = with_nsp_loss + # self.loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=-1) + + def forward(self, prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels=None): + """ + Args: + prediction_scores(Tensor): + The scores of masked token prediction. Its data type should be float32. + If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size]. + Otherwise, its shape is [batch_size, mask_token_num, vocab_size] + seq_relationship_score(Tensor): + The scores of next sentence prediction. Its data type should be float32 and + its shape is [batch_size, 2] + masked_lm_labels(Tensor): + The labels of the masked language modeling, its dimensionality is equal to `prediction_scores`. + Its data type should be int64. If `masked_positions` is None, its shape is [batch_size, sequence_length, 1]. + Otherwise, its shape is [batch_size, mask_token_num, 1] + next_sentence_labels(Tensor): + The labels of the next sentence prediction task, the dimensionality of `next_sentence_labels` + is equal to `seq_relation_labels`. Its data type should be int64 and + its shape is [batch_size, 1] + Returns: + Tensor: The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean of `next_sentence_loss`. + Its data type should be float32 and its shape is [1]. + """ + + with paddle.static.amp.fp16_guard(): + masked_lm_loss = F.cross_entropy(prediction_scores, masked_lm_labels, ignore_index=-1, reduction="none") + + if not self.with_nsp_loss: + return paddle.mean(masked_lm_loss) + + next_sentence_loss = F.cross_entropy(seq_relationship_score, next_sentence_labels, reduction="none") + return paddle.mean(masked_lm_loss), paddle.mean(next_sentence_loss) + + +class ErnieOnlyMLMHead(nn.Layer): + def __init__(self, hidden_size, vocab_size, activation, embedding_weights): + super().__init__() + self.predictions = ErnieLMPredictionHead( + hidden_size=hidden_size, vocab_size=vocab_size, activation=activation, embedding_weights=embedding_weights + ) + + def forward(self, sequence_output, masked_positions=None): + prediction_scores = self.predictions(sequence_output, masked_positions) + return prediction_scores + + +class ErnieForMaskedLM(ErniePretrainedModel): + """ + Ernie Model with a `masked language modeling` head on top. + Args: + ernie (:class:`ErnieModel`): + An instance of :class:`ErnieModel`. + """ + + def __init__(self, ernie): + super(ErnieForMaskedLM, self).__init__() + self.ernie = ernie + self.cls = ErnieOnlyMLMHead( + self.ernie.config["hidden_size"], + self.ernie.config["vocab_size"], + self.ernie.config["hidden_act"], + embedding_weights=self.ernie.embeddings.word_embeddings.weight, + ) + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`ErnieModel`. + token_type_ids (Tensor, optional): + See :class:`ErnieModel`. + position_ids (Tensor, optional): + See :class:`ErnieModel`. + attention_mask (Tensor, optional): + See :class:`ErnieModel`. + Returns: + Tensor: Returns tensor `prediction_scores`, The scores of masked token prediction. + Its data type should be float32 and shape is [batch_size, sequence_length, vocab_size]. + Example: + .. code-block:: + import paddle + from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer + tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0') + model = ErnieForMaskedLM.from_pretrained('ernie-1.0') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + print(logits.shape) + # [1, 17, 18000] + """ + + outputs = self.ernie( + input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask + ) + sequence_output = outputs[0] + prediction_scores = self.cls(sequence_output, masked_positions=None) + return prediction_scores + + +class ErnieForMultipleChoice(ErniePretrainedModel): + """ + Ernie Model with a linear layer on top of the hidden-states output layer, + designed for multiple choice tasks like RocStories/SWAG tasks. + + Args: + ernie (:class:`ErnieModel`): + An instance of ErnieModel. + num_choices (int, optional): + The number of choices. Defaults to `2`. + dropout (float, optional): + The dropout probability for output of Ernie. + If None, use the same value as `hidden_dropout_prob` of `ErnieModel` + instance `ernie`. Defaults to None. + """ + + def __init__(self, ernie, num_choices=2, dropout=None): + super(ErnieForMultipleChoice, self).__init__() + self.num_choices = num_choices + self.ernie = ernie + self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config["hidden_dropout_prob"]) + self.classifier = nn.Linear(self.ernie.config["hidden_size"], 1) + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + r""" + The ErnieForMultipleChoice forward method, overrides the __call__() special method. + Args: + input_ids (Tensor): + See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length]. + token_type_ids(Tensor, optional): + See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length]. + position_ids(Tensor, optional): + See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length]. + attention_mask (list, optional): + See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length]. + Returns: + Tensor: Returns tensor `reshaped_logits`, a tensor of the multiple choice classification logits. + Shape as `[batch_size, num_choice]` and dtype as `float32`. + """ + # input_ids: [bs, num_choice, seq_l] + input_ids = input_ids.reshape(shape=(-1, input_ids.shape[-1])) # flat_input_ids: [bs*num_choice,seq_l] + + if position_ids is not None: + position_ids = position_ids.reshape(shape=(-1, position_ids.shape[-1])) + if token_type_ids is not None: + token_type_ids = token_type_ids.reshape(shape=(-1, token_type_ids.shape[-1])) + + if attention_mask is not None: + attention_mask = attention_mask.reshape(shape=(-1, attention_mask.shape[-1])) + + _, pooled_output = self.ernie( + input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask + ) + pooled_output = self.dropout(pooled_output) + + logits = self.classifier(pooled_output) # logits: (bs*num_choice,1) + reshaped_logits = logits.reshape(shape=(-1, self.num_choices)) # logits: (bs, num_choice) + + return reshaped_logits diff --git a/examples/text_matching/diffcse/data.py b/examples/text_matching/diffcse/data.py new file mode 100644 index 0000000000000000000000000000000000000000..f2ff0b1749dab43957a83735802b3514bc400e40 --- /dev/null +++ b/examples/text_matching/diffcse/data.py @@ -0,0 +1,118 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle + + +def get_special_tokens(): + return ["[PAD]", "[CLS]", "[SEP]", "[MASK]", "[UNK]"] + + +def get_special_token_ids(tokenizer): + special_tokens = ["[PAD]", "[CLS]", "[SEP]", "[MASK]", "[UNK]"] + return tokenizer.convert_tokens_to_ids(special_tokens) + + +def get_special_token_dict(tokenizer): + special_tokens = ["[PAD]", "[CLS]", "[SEP]", "[MASK]", "[UNK]"] + special_token_dict = dict(zip(special_tokens, tokenizer.convert_tokens_to_ids(special_tokens))) + return special_token_dict + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False): + result = [] + for key, text in example.items(): + if "label" in key: + # do_evaluate + result += [example["label"]] + else: + # do_train + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, return_attention_mask=True) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + attention_mask = encoded_inputs["attention_mask"] + result += [input_ids, token_type_ids, attention_mask] + return result + + +def read_text_single(data_path): + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip() + yield {"text_a": data, "text_b": data} + + +def masked_fill(x, mask, value): + y = paddle.full(x.shape, value, x.dtype) + return paddle.where(mask, y, x) + + +def mask_tokens(batch_inputs, tokenizer, mlm_probability=0.15): + """ + Description: Mask input_ids for masked language modeling: 80% MASK, 10% random, 10% original + """ + mlm_inputs = batch_inputs.clone() + mlm_labels = batch_inputs.clone() + + probability_matrix = paddle.full(mlm_inputs.shape, mlm_probability) + + special_tokens_mask = paddle.cast(paddle.zeros(mlm_inputs.shape), dtype=bool) + for special_token_id in get_special_token_ids(tokenizer): + special_tokens_mask |= mlm_inputs == special_token_id + + probability_matrix = masked_fill(probability_matrix, special_tokens_mask, 0.0) + + masked_indices = paddle.cast(paddle.bernoulli(probability_matrix), dtype=bool) + mlm_labels = masked_fill(mlm_labels, ~masked_indices, -100) + + # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK]) + indices_replaced = paddle.cast(paddle.bernoulli(paddle.full(mlm_inputs.shape, 0.8)), dtype=bool) & masked_indices + mlm_inputs = masked_fill(mlm_inputs, indices_replaced, tokenizer.mask_token_id) + + # 10% of the time, we replace masked input tokens with random word + indices_random = ( + paddle.cast(paddle.bernoulli(paddle.full(mlm_inputs.shape, 0.5)), dtype=bool) + & masked_indices + & ~indices_replaced + ) + random_words = paddle.randint(0, len(tokenizer), mlm_inputs.shape, dtype=mlm_inputs.dtype) + mlm_inputs = paddle.where(indices_random, random_words, mlm_inputs) + + # The rest of the time (10% of the time) we keep the masked input tokens unchanged + return mlm_inputs, mlm_labels + + +def read_text_pair(data_path, is_infer=False): + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if is_infer: + if len(data[0]) == 0 or len(data[1]) == 0: + continue + yield {"text_a": data[0], "text_b": data[1]} + else: + if len(data[0]) == 0 or len(data[1]) == 0 or len(data[2]) == 0: + continue + yield {"text_a": data[0], "text_b": data[1], "label": data[2]} diff --git a/examples/text_matching/diffcse/model.py b/examples/text_matching/diffcse/model.py new file mode 100644 index 0000000000000000000000000000000000000000..0deb60ee4830adeb738b79a17322ef7a9b3da245 --- /dev/null +++ b/examples/text_matching/diffcse/model.py @@ -0,0 +1,313 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from custom_ernie import ErnieModel as CustomErnie +from data import mask_tokens + +from paddlenlp.transformers import AutoModel, ErnieForMaskedLM + + +class ProjectionMLP(nn.Layer): + def __init__(self, in_dim): + super(ProjectionMLP, self).__init__() + hidden_dim = in_dim * 2 + out_dim = in_dim + list_layers = [nn.Linear(in_dim, hidden_dim, bias_attr=False), nn.BatchNorm1D(hidden_dim), nn.ReLU()] + list_layers += [nn.Linear(hidden_dim, out_dim, bias_attr=False), nn.BatchNorm1D(out_dim)] + self.net = nn.Sequential(*list_layers) + + def forward(self, x): + return self.net(x) + + +class Similarity(nn.Layer): + """ + Dot product or cosine similarity + """ + + def __init__(self, temp): + super(Similarity, self).__init__() + self.temp = temp + self.cos = nn.CosineSimilarity(axis=-1) + self.record = None + self.pos_avg = 0.0 + self.neg_avg = 0.0 + + def forward(self, x, y, one_vs_one=False): + if one_vs_one: + sim = self.cos(x, y) + return sim + + x = x.unsqueeze(1) + y = y.unsqueeze(0) + sim = self.cos(x, y) + self.record = sim.detach() + min_size = min(self.record.shape[0], self.record.shape[1]) + num_item = self.record.shape[0] * self.record.shape[1] + self.pos_avg = paddle.diag(self.record).sum().item() / min_size + self.neg_avg = (self.record.sum().item() - paddle.diag(self.record).sum().item()) / (num_item - min_size) + return sim / self.temp + + +class Encoder(nn.Layer): + def __init__(self, pretrained_model_name, temp=0.05, output_emb_size=None): + super(Encoder, self).__init__() + self.ptm = AutoModel.from_pretrained(pretrained_model_name) + # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size + self.output_emb_size = output_emb_size + self.mlp = ProjectionMLP(self.ptm.config["hidden_size"]) + + if output_emb_size is not None: + self.emb_reduce_linear = nn.Linear(self.ptm.config["hidden_size"], output_emb_size) + + self.temp = temp + self.sim = Similarity(temp) + + def get_pooled_embedding( + self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, with_pooler=False + ): + # Note: cls_embedding is poolerd embedding with act tanh + sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + if not with_pooler: + ori_cls_embedding = sequence_output[:, 0, :] + else: + ori_cls_embedding = cls_embedding + + mlp_cls_embedding = self.mlp(ori_cls_embedding) + if self.output_emb_size is not None: + cls_embedding = self.emb_reduce_linear(mlp_cls_embedding) + + return cls_embedding, mlp_cls_embedding + + def cosine_sim( + self, + query_input_ids, + key_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + key_token_type_ids=None, + key_position_ids=None, + key_attention_mask=None, + with_pooler=False, + ): + query_cls_embedding, _ = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask, with_pooler=with_pooler + ) + key_cls_embedding, _ = self.get_pooled_embedding( + key_input_ids, key_token_type_ids, key_position_ids, key_attention_mask, with_pooler=with_pooler + ) + + cosine_sim = self.sim(query_cls_embedding, key_cls_embedding, one_vs_one=True) + return cosine_sim + + def forward( + self, + query_input_ids, + key_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + key_token_type_ids=None, + key_position_ids=None, + key_attention_mask=None, + with_pooler=False, + ): + query_cls_embedding, mlp_query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask, with_pooler=with_pooler + ) + key_cls_embedding, mlp_key_cls_embedding = self.get_pooled_embedding( + key_input_ids, key_token_type_ids, key_position_ids, key_attention_mask, with_pooler=with_pooler + ) + + cosine_sim = self.sim(query_cls_embedding, key_cls_embedding) + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") + labels = paddle.reshape(labels, shape=[-1, 1]) + loss = F.cross_entropy(input=cosine_sim, label=labels) + + mlp_cls_embedding = paddle.concat([mlp_query_cls_embedding, mlp_key_cls_embedding], axis=0) + return loss, mlp_cls_embedding + + +class Discriminator(nn.Layer): + def __init__(self, ptm_model_name): + super(Discriminator, self).__init__() + self.ptm = CustomErnie.from_pretrained(ptm_model_name) + self.classifier = nn.Linear(self.ptm.config["hidden_size"], 2) + + def forward(self, input_ids, labels, cls_input, token_type_ids=None, attention_mask=None): + sequence_output, _ = self.ptm( + input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, cls_input=cls_input + ) + pred_scores = self.classifier(sequence_output) + loss = F.cross_entropy(input=pred_scores, label=labels) + + return loss, pred_scores.argmax(-1) + + +class DiffCSE(nn.Layer): + def __init__( + self, + encoder_name, + generator_name, + discriminator_name, + enc_tokenizer, + gen_tokenizer, + dis_tokenizer, + temp=0.05, + output_emb_size=32, + mlm_probability=0.15, + lambda_weight=0.15, + ): + super(DiffCSE, self).__init__() + self.encoder_name = encoder_name + self.generator_name = generator_name + self.discriminator_name = discriminator_name + self.enc_tokenizer = enc_tokenizer + self.gen_tokenizer = gen_tokenizer + self.dis_tokenizer = dis_tokenizer + self.temp = temp + self.output_emb_size = output_emb_size + self.mlm_probability = mlm_probability + self.lambda_weight = lambda_weight + + self.encoder = Encoder(encoder_name, temp=temp, output_emb_size=output_emb_size) + self.generator = ErnieForMaskedLM.from_pretrained(generator_name) + self.discriminator = Discriminator(discriminator_name) + + self.rtd_acc = 0.0 + self.rtd_rep_acc = 0.0 + self.rtd_fix_acc = 0.0 + + def train_forward( + self, + query_input_ids, + key_input_ids, + query_token_type_ids=None, + key_token_type_ids=None, + query_attention_mask=None, + key_attention_mask=None, + ): + + # extract senmantic vector with encoder and then comput CL loss + loss, mlp_cls_embedding = self.encoder( + query_input_ids, + key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask, + ) + + with paddle.no_grad(): + # mask tokens for query and key input_ids and then predict mask token with generator + input_ids = paddle.concat([query_input_ids, key_input_ids], axis=0) + if self.encoder_name != self.generator_name: + input_ids = self.encode_by_generator(input_ids) + attention_mask = paddle.concat([query_attention_mask, key_attention_mask], axis=0) + mlm_input_ids, _ = mask_tokens(input_ids, self.gen_tokenizer, mlm_probability=self.mlm_probability) + # predict tokens using generator + pred_tokens = self.generator(mlm_input_ids, attention_mask=attention_mask).argmax(axis=-1) + + pred_tokens = pred_tokens.detach() + + if self.generator_name != self.discriminator_name: + pred_tokens = self.encode_by_discriminator(pred_tokens) + input_ids = self.encode_by_discriminator(input_ids) + + pred_tokens[:, 0] = self.dis_tokenizer.cls_token_id + e_inputs = pred_tokens * attention_mask + replaced = pred_tokens != input_ids + e_labels = paddle.cast(replaced, dtype="int64") * attention_mask + rtd_loss, prediction = self.discriminator(e_inputs, e_labels, cls_input=mlp_cls_embedding) + loss = loss + self.lambda_weight * rtd_loss + + rep = (e_labels == 1) * attention_mask + fix = (e_labels == 0) * attention_mask + self.rtd_rep_acc = float((prediction * rep).sum() / rep.sum()) + self.rtd_fix_acc = float(1.0 - (prediction * fix).sum() / fix.sum()) + self.rtd_acc = float(((prediction == e_labels) * attention_mask).sum() / attention_mask.sum()) + + return loss, rtd_loss + + def encode_by_generator(self, batch_tokens): + new_tokens = [] + for one_tokens in batch_tokens: + one_gen_tokens = self.enc_tokenizer.convert_ids_to_tokens(one_tokens.tolist()) + new_tokens.append(self.gen_tokenizer.convert_tokens_to_ids(one_gen_tokens)) + + return paddle.to_tensor(new_tokens) + + def encode_by_discriminator(self, batch_tokens): + new_tokens = [] + for one_tokens in batch_tokens: + one_gen_tokens = self.gen_tokenizer.convert_ids_to_tokens(one_tokens.tolist()) + new_tokens.append(self.dis_tokenizer.convert_tokens_to_ids(one_gen_tokens)) + + return paddle.to_tensor(new_tokens) + + def test_forward( + self, + query_input_ids, + key_input_ids, + query_token_type_ids=None, + key_token_type_ids=None, + query_attention_mask=None, + key_attention_mask=None, + ): + + # compute cosine similarity for query and key text + cos_sim = self.encoder.cosine_sim( + query_input_ids, + key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask, + ) + + return cos_sim + + def forward( + self, + query_input_ids, + key_input_ids, + query_token_type_ids=None, + key_token_type_ids=None, + query_attention_mask=None, + key_attention_mask=None, + mode="train", + ): + if mode == "train": + return self.train_forward( + query_input_ids, + key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask, + ) + else: + return self.test_forward( + query_input_ids, + key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask, + ) diff --git a/examples/text_matching/diffcse/run_diffcse.py b/examples/text_matching/diffcse/run_diffcse.py new file mode 100644 index 0000000000000000000000000000000000000000..c567308db3d744ca3d546ae78a7e7a349ecc7532 --- /dev/null +++ b/examples/text_matching/diffcse/run_diffcse.py @@ -0,0 +1,374 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time +from functools import partial + +import numpy as np +import paddle +from data import convert_example, create_dataloader, read_text_pair, read_text_single +from model import DiffCSE, Encoder +from utils import eval_metric, set_seed +from visualdl import LogWriter + +import paddlenlp as ppnlp +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import LinearDecayWithWarmup + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--mode", choices=["train", "eval", "infer"], default="infer", help="Select which mode to run model, defaults to infer.") +parser.add_argument("--encoder_name", type=str, help="The sentence_encoder name or path that you wanna train based on.") +parser.add_argument("--generator_name", type=str, help="The generator model name or path that you wanna train based on.") +parser.add_argument("--discriminator_name", type=str, help="The discriminator model name or path that you wanna train based on.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization.") +parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.") +parser.add_argument("--train_set_file", type=str, help="The full path of train_set_file.") +parser.add_argument("--eval_set_file", type=str, help="The full path of eval_set_file.") +parser.add_argument("--infer_set_file", type=str, help="The full path of infer_set_file.") +parser.add_argument("--ckpt_dir", default=None, type=str, help="The ckpt directory where the model checkpoints will be loaded when doing evaluation/inference.") +parser.add_argument("--save_dir", default="./checkpoints", type=str, help="The directory where the model checkpoints will be written.") +parser.add_argument("--log_dir", default=None, type=str, help="The directory where log will be written.") +parser.add_argument("--save_infer_path", default="./infer_result.txt", type=str, help="The save directory where the inference result will be written.") +parser.add_argument("--save_steps", type=int, default=10000, help="Step interval for saving checkpoint.") +parser.add_argument("--eval_steps", type=int, default=10000, help="Step interval for evaluation.") +parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override ecpochs.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--temp", default=0.05, type=float, help="Temperature for softmax.") +parser.add_argument("--mlm_probability", default=0.15, type=float, help="The ratio for masked language model.") +parser.add_argument("--lambda_weight", default=0.15, type=float, help="The weight for RTD loss.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument("--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# fmt: on + + +def do_infer(model, tokenizer, data_loader): + assert isinstance(model, Encoder), "please make sure that model is instance of Encoder." + sims = [] + model.eval() + with paddle.no_grad(): + for batch in data_loader: + ( + query_input_ids, + query_token_type_ids, + query_attention_mask, + key_input_ids, + key_token_type_ids, + key_attention_mask, + ) = batch + cosine_sim = model.cosine_sim( + query_input_ids=query_input_ids, + key_input_ids=key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask, + ) + sims.append(cosine_sim.numpy()) + sims = np.concatenate(sims, axis=0) + model.train() + return sims + + +def do_eval(model, tokenizer, data_loader): + assert isinstance(model, Encoder), "please make sure that model is instance of Encoder." + sims, labels = [], [] + model.eval() + with paddle.no_grad(): + for batch in data_loader: + ( + query_input_ids, + query_token_type_ids, + query_attention_mask, + key_input_ids, + key_token_type_ids, + key_attention_mask, + label, + ) = batch + cosine_sim = model.cosine_sim( + query_input_ids=query_input_ids, + key_input_ids=key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask, + ) + sims.append(cosine_sim.numpy()) + labels.append(label.numpy()) + + sims = np.concatenate(sims, axis=0) + labels = np.concatenate(labels, axis=0) + score = eval_metric(labels, sims) + model.train() + return score + + +def do_train(model, tokenizer, train_data_loader, dev_data_loader, writer=None): + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + global_step = 0 + best_score = 0.0 + tic_train = time.time() + model = paddle.DataParallel(model) + model.train() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + ( + query_input_ids, + query_token_type_ids, + query_attention_mask, + key_input_ids, + key_token_type_ids, + key_attention_mask, + ) = batch + + loss, rtd_loss = model( + query_input_ids, + key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask, + ) + + global_step += 1 + if global_step % (args.eval_steps // 10) == 0 and rank == 0: + print( + "global step {}, epoch: {}, batch: {}, loss: {:.5f}, rtd_loss: {:.5f}, rtd_acc: {:.5f}, rtd_rep_acc: {:.5f}, rtd_fix_acc: {:.5f}, pos_avg: {:.5f}, neg_avg: {:.5f}, speed: {:.2f} step/s".format( + global_step, + epoch, + step, + loss.item(), + rtd_loss.item(), + model._layers.rtd_acc, + model._layers.rtd_rep_acc, + model._layers.rtd_fix_acc, + model._layers.encoder.sim.pos_avg, + model._layers.encoder.sim.neg_avg, + (args.eval_steps // 10) / (time.time() - tic_train), + ) + ) + writer.add_scalar(tag="train/loss", step=global_step, value=loss.item()) + writer.add_scalar(tag="train/rtd_loss", step=global_step, value=rtd_loss.item()) + writer.add_scalar(tag="train/rtd_acc", step=global_step, value=model._layers.rtd_acc) + writer.add_scalar(tag="train/rtd_rep_acc", step=global_step, value=model._layers.rtd_rep_acc) + writer.add_scalar(tag="train/rtd_fix_acc", step=global_step, value=model._layers.rtd_fix_acc) + + tic_train = time.time() + + if global_step % args.eval_steps == 0 and rank == 0: + score = do_eval(model._layers.encoder, tokenizer, dev_data_loader) + print("Evaluation - score:{:.5f}".format(score)) + + if best_score < score: + print( + "best checkpoint has been updated: from last best_score {} --> new score {}.".format( + best_score, score + ) + ) + best_score = score + # save best model + save_dir = os.path.join(args.save_dir, "best") + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model._layers.encoder.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + writer.add_scalar(tag="eval/score", step=global_step, value=score) + model.train() + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "checkpoint_{}".format(global_step)) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model._layers.encoder.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + if args.max_steps > 0 and global_step >= args.max_steps: + return model + + +if __name__ == "__main__": + # set running environment + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + if not os.path.exists(args.save_dir): + os.makedirs(args.save_dir) + + # define tokenizer for processing data + tokenizer = ppnlp.transformers.AutoTokenizer.from_pretrained(args.encoder_name) + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + if args.mode == "train": + start_time = time.time() + + # load data + train_ds = load_dataset(read_text_single, data_path=args.train_set_file, lazy=False) + dev_ds = load_dataset(read_text_pair, data_path=args.eval_set_file, lazy=False) + gen_tokenizer = ppnlp.transformers.AutoTokenizer.from_pretrained(args.generator_name) + dis_tokenizer = ppnlp.transformers.AutoTokenizer.from_pretrained(args.discriminator_name) + + # initializing DiffCSE model + model = DiffCSE( + encoder_name=args.encoder_name, + generator_name=args.generator_name, + discriminator_name=args.discriminator_name, + enc_tokenizer=tokenizer, + gen_tokenizer=gen_tokenizer, + dis_tokenizer=dis_tokenizer, + temp=args.temp, + output_emb_size=args.output_emb_size, + mlm_probability=args.mlm_probability, + lambda_weight=args.lambda_weight, + ) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=0), # attention_mask + Pad(axis=0, pad_val=tokenizer.pad_token_id), # key_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + Pad(axis=0, pad_val=0), # attention_mask + ): [data for data in fn(samples)] + dev_batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=0), # attention_mask + Pad(axis=0, pad_val=tokenizer.pad_token_id), # key_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + Pad(axis=0, pad_val=0), # attention_mask + Stack(dtype="int64"), # labels + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + dev_data_loader = create_dataloader( + dev_ds, mode="eval", batch_size=args.batch_size, batchify_fn=dev_batchify_fn, trans_fn=trans_func + ) + + with LogWriter(logdir=os.path.join(args.log_dir, "scalar")) as writer: + do_train(model, tokenizer, train_data_loader, dev_data_loader, writer=writer) + + end_time = time.time() + print("running time {} s".format(end_time - start_time)) + + if args.mode == "eval": + start_time = time.time() + # initalizing encoder model for eval + model = Encoder(args.encoder_name, temp=args.temp, output_emb_size=args.output_emb_size) + # load model from saved checkpoint + if args.ckpt_dir: + init_from_ckpt = os.path.join(args.ckpt_dir, "model_state.pdparams") + if os.path.isfile(init_from_ckpt): + print( + "*************************initializing model from {}*****************************".format( + init_from_ckpt + ) + ) + state_dict = paddle.load(init_from_ckpt) + model.set_dict(state_dict) + + dev_ds = load_dataset(read_text_pair, data_path=args.eval_set_file, lazy=False) + + dev_batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=0), # attention_mask + Pad(axis=0, pad_val=tokenizer.pad_token_id), # key_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + Pad(axis=0, pad_val=0), # attention_mask + Stack(dtype="int64"), # labels + ): [data for data in fn(samples)] + + dev_data_loader = create_dataloader( + dev_ds, mode="eval", batch_size=args.batch_size, batchify_fn=dev_batchify_fn, trans_fn=trans_func + ) + + score = do_eval(model, tokenizer, dev_data_loader) + print("Evaluation - score:{:.5f}".format(score)) + + end_time = time.time() + print("running time {} s".format(end_time - start_time)) + + if args.mode == "infer": + start_time = time.time() + # initalizing encoder model for eval + model = Encoder(args.encoder_name, temp=args.temp, output_emb_size=args.output_emb_size) + # load model from saved checkpoint + if args.ckpt_dir: + init_from_ckpt = os.path.join(args.ckpt_dir, "model_state.pdparams") + if os.path.isfile(init_from_ckpt): + print( + "*************************initializing model from {}*****************************".format( + init_from_ckpt + ) + ) + state_dict = paddle.load(init_from_ckpt) + model.set_dict(state_dict) + + infer_ds = load_dataset(read_text_pair, data_path=args.infer_set_file, lazy=False, is_infer=True) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=0), # attention_mask + Pad(axis=0, pad_val=tokenizer.pad_token_id), # key_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + Pad(axis=0, pad_val=0), # attention_mask + ): [data for data in fn(samples)] + + infer_data_loader = create_dataloader( + infer_ds, mode="infer", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + cosin_sim = do_infer(model, tokenizer, infer_data_loader) + + with open(args.save_infer_path, "w", encoding="utf-8") as f: + for idx, cos in enumerate(cosin_sim): + msg = "{} --> {}\n".format(idx, cos) + f.write(msg) + print("Inference result has been saved to : {}".format(args.save_infer_path)) + + end_time = time.time() + print("running time {} s".format(end_time - start_time)) diff --git a/examples/text_matching/diffcse/run_eval.sh b/examples/text_matching/diffcse/run_eval.sh new file mode 100644 index 0000000000000000000000000000000000000000..0c5512c2f4d3c6fd5ff0d1c1ae6a58da4454c2fd --- /dev/null +++ b/examples/text_matching/diffcse/run_eval.sh @@ -0,0 +1,15 @@ +gpu_ids=0 +export CUDA_VISIBLE_DEVICES=${gpu_ids} + +log_dir="log_eval" +python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \ + run_diffcse.py \ + --mode "eval" \ + --encoder_name "rocketqa-zh-dureader-query-encoder" \ + --max_seq_length "128" \ + --output_emb_size "32" \ + --eval_set_file "your dev_set path" \ + --ckpt_dir "./checkpoints/best" \ + --batch_size "32" \ + --seed "0" \ + --device "gpu" diff --git a/examples/text_matching/diffcse/run_infer.sh b/examples/text_matching/diffcse/run_infer.sh new file mode 100644 index 0000000000000000000000000000000000000000..7df8a573cd8a3cdd8ced45b4871e45a5f5169881 --- /dev/null +++ b/examples/text_matching/diffcse/run_infer.sh @@ -0,0 +1,16 @@ +gpu_ids=0 +export CUDA_VISIBLE_DEVICES=${gpu_ids} + +log_dir="log_infer" +python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \ + run_diffcse.py \ + --mode "infer" \ + --encoder_name "rocketqa-zh-dureader-query-encoder" \ + --max_seq_length "128" \ + --output_emb_size "32" \ + --infer_set_file "your test_set path \ + --ckpt_dir "./checkpoints/best" \ + --save_infer_path "./infer_result.txt" \ + --batch_size "32" \ + --seed "0" \ + --device "gpu" diff --git a/examples/text_matching/diffcse/run_train.sh b/examples/text_matching/diffcse/run_train.sh new file mode 100644 index 0000000000000000000000000000000000000000..9ed49a4b37e8db24b68e93a8c519c8552b7e4da7 --- /dev/null +++ b/examples/text_matching/diffcse/run_train.sh @@ -0,0 +1,27 @@ +gpu_ids=0 +export CUDA_VISIBLE_DEVICES=${gpu_ids} + +log_dir="log_train" +python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \ + run_diffcse.py \ + --mode "train" \ + --encoder_name "rocketqa-zh-dureader-query-encoder" \ + --generator_name "ernie-3.0-base-zh" \ + --discriminator_name "ernie-3.0-base-zh" \ + --max_seq_length "128" \ + --output_emb_size "32" \ + --train_set_file "your train_set path" \ + --eval_set_file "your dev_set path" \ + --save_dir "./checkpoints" \ + --log_dir ${log_dir} \ + --save_steps "50000" \ + --eval_steps "1000" \ + --batch_size "32" \ + --epochs "3" \ + --mlm_probability "0.15" \ + --lambda_weight "0.15" \ + --learning_rate "3e-5" \ + --weight_decay "0.01" \ + --warmup_proportion "0.01" \ + --seed "0" \ + --device "gpu" diff --git a/examples/text_matching/diffcse/utils.py b/examples/text_matching/diffcse/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..d70b434f82869675883fa245c7e1d58e2ffbec1d --- /dev/null +++ b/examples/text_matching/diffcse/utils.py @@ -0,0 +1,35 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import random +import numpy as np + +import paddle +from scipy import stats + + +def set_seed(seed=0): + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def masked_fill(x, mask, value): + y = paddle.full(x.shape, value, x.dtype) + return paddle.where(mask, y, x) + + +def eval_metric(labels, preds): + spearman_corr = stats.spearmanr(labels, preds).correlation + return spearman_corr diff --git a/examples/text_matching/ernie_matching/README.md b/examples/text_matching/ernie_matching/README.md new file mode 100644 index 0000000000000000000000000000000000000000..dd24b9a664b8ca02de02db6eac4a45c13e7b98d9 --- /dev/null +++ b/examples/text_matching/ernie_matching/README.md @@ -0,0 +1,175 @@ +# 基于预训练模型 ERNIE-Gram 的单塔文本匹配 + +我们基于预训练模型 ERNIE-Gram 给出了单塔文本匹配的 2 种训练范式: Point-wise 和 Pair-wise。其中单塔 Point-wise 匹配模型适合直接对文本对进行 2 分类的应用场景: 例如判断 2 个文本是否为语义相似;Pair-wise 匹配模型适合将文本对相似度作为特征之一输入到上层排序模块进行排序的应用场景。 + +## 模型下载 +本项目使用语义匹配数据集 LCQMC 作为训练集 , 基于 ERNIE-Gram 预训练模型热启训练并开源了单塔 Point-wise 语义匹配模型, 用户可以直接基于这个模型对文本对进行语义匹配的 2 分类任务。 + +| 模型 | dev acc | +| ---- | ------- | +| [ERNIE-1.0-Base](https://bj.bcebos.com/paddlenlp/models/text_matching/ernie1.0_zh_pointwise_matching_model.tar) | 89.43 | +| [ERNIE-Gram-Base](https://bj.bcebos.com/paddlenlp/models/text_matching/ernie_gram_zh_pointwise_matching_model.tar) | 90.60 | + +## 快速开始 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +``` +ernie_matching/ +├── deply # 部署 +| └── python +| └── predict.py # python 预测部署示例 +├── export_model.py # 动态图参数导出静态图参数脚本 +├── model.py # Point-wise & Pair-wise 匹配模型组网 +├── data.py # Point-wise & Pair-wise 训练样本的转换逻辑 、Pair-wise 生成随机负例的逻辑 +├── train_pointwise.py # Point-wise 单塔匹配模型训练脚本 +├── train_pairwise.py # Pair-wise 单塔匹配模型训练脚本 +├── predict_pointwise.py # Point-wise 单塔匹配模型预测脚本,输出文本对是否相似: 0、1 分类 +├── predict_pairwise.py # Pair-wise 单塔匹配模型预测脚本,输出文本对的相似度打分 +└── train.py # 模型训练评估 +``` + +### 模型训练 + +我们以中文文本匹配公开数据集 LCQMC 为示例数据集,可以运行下面的命令,在训练集(train.tsv)上进行单塔 Point-wise 模型训练,并在开发集(dev.tsv)验证。Pair-wise 匹配模型只需要采用 `train_pairwise.py` 脚本即可。 + +```shell +$ unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus "0" train_pointwise.py \ + --device gpu \ + --save_dir ./checkpoints \ + --batch_size 32 \ + --learning_rate 2E-5 +``` + +可支持配置的参数: + +* `save_dir`:可选,保存训练模型的目录;默认保存在当前目录checkpoints文件夹下。 +* `max_seq_length`:可选,ERNIE-Gram 模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `learning_rate`:可选,Fine-tune的最大学习率;默认为5e-5。 +* `weight_decay`:可选,控制正则项力度的参数,用于防止过拟合,默认为0.0。 +* `epochs`: 训练轮次,默认为3。 +* `warmup_proption`:可选,学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.0。 +* `init_from_ckpt`:可选,模型参数路径,热启动模型训练;默认为None。 +* `seed`:可选,随机种子,默认为1000. +* `device`: 选用什么设备进行训练,可选cpu、gpu或npu。如使用gpu训练则参数gpus指定GPU卡号。 + +代码示例中使用的预训练模型是 ERNIE-Gram,如果想要使用其他预训练模型如 ERNIE, BERT,RoBERTa,Electra等,只需更换`model` 和 `tokenizer`即可。 + +```python + +# 使用 ERNIE-3.0-medium-zh 预训练模型 +model = AutoModel.from_pretrained('ernie-3.0-medium-zh') +tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh') + +# 使用 ERNIE-Gram 预训练模型 +model = AutoModel.from_pretrained('ernie-gram-zh') +tokenizer = AutoTokenizer.from_pretrained('ernie-gram-zh') + +# 使用 ERNIE 预训练模型 +# ernie-1.0 +#model = AutoModel.from_pretrained('ernie-1.0-base-zh')) +#tokenizer = AutoTokenizer.from_pretrained('ernie-1.0-base-zh') + +# ernie-tiny +# model = AutoModel.from_pretrained('ernie-tiny')) +# tokenizer = AutoTokenizer.from_pretrained('ernie-tiny') + + +# 使用 BERT 预训练模型 +# bert-base-chinese +# model = AutoModel.from_pretrained('bert-base-chinese') +# tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese') + +# bert-wwm-chinese +# model = AutoModel.from_pretrained('bert-wwm-chinese') +# tokenizer = AutoTokenizer.from_pretrained('bert-wwm-chinese') + +# bert-wwm-ext-chinese +# model = AutoModel.from_pretrained('bert-wwm-ext-chinese') +# tokenizer = AutoTokenizer.from_pretrained('bert-wwm-ext-chinese') + + +# 使用 RoBERTa 预训练模型 +# roberta-wwm-ext +# model = AutoModel.from_pretrained('roberta-wwm-ext') +# tokenizer = AutoTokenizer.from_pretrained('roberta-wwm-ext') + +# roberta-wwm-ext +# model = AutoModel.from_pretrained('roberta-wwm-ext-large') +# tokenizer = AutoTokenizer.from_pretrained('roberta-wwm-ext-large') + +``` +更多预训练模型,参考[transformers](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) + +程序运行时将会自动进行训练,评估。同时训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +checkpoints/ +├── model_100 +│   ├── model_state.pdparams +│   ├── tokenizer_config.json +│   └── vocab.txt +└── ... +``` + +**NOTE:** +* 如需恢复模型训练,则可以设置`init_from_ckpt`, 如`init_from_ckpt=checkpoints/model_100/model_state.pdparams`。 +* 如需使用ernie-tiny模型,则需要提前先安装sentencepiece依赖,如`pip install sentencepiece` + +### 基于动态图模型预测 + +我们用 LCQMC 的测试集作为预测数据, 测试数据示例如下,: +```text +谁有狂三这张高清的 这张高清图,谁有 +英雄联盟什么英雄最好 英雄联盟最好英雄是什么 +这是什么意思,被蹭网吗 我也是醉了,这是什么意思 +现在有什么动画片好看呢? 现在有什么好看的动画片吗? +请问晶达电子厂现在的工资待遇怎么样要求有哪些 三星电子厂工资待遇怎么样啊 +``` + +启动预测: + +```shell +$ unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus "0" \ + predict_pointwise.py \ + --device gpu \ + --params_path "./checkpoints/model_4400/model_state.pdparams"\ + --batch_size 128 \ + --max_seq_length 64 \ + --input_file 'test.tsv' +``` + +输出预测结果如下: +```text +{'query': '谁有狂三这张高清的', 'title': '这张高清图,谁有', 'pred_label': 1} +{'query': '英雄联盟什么英雄最好', 'title': '英雄联盟最好英雄是什么', 'pred_label': 1} +{'query': '这是什么意思,被蹭网吗', 'title': '我也是醉了,这是什么意思', 'pred_label': 1} +{'query': '现在有什么动画片好看呢?', 'title': '现在有什么好看的动画片吗?', 'pred_label': 1} +{'query': '请问晶达电子厂现在的工资待遇怎么样要求有哪些', 'title': '三星电子厂工资待遇怎么样啊', 'pred_label': 0} +``` + +### 基于静态图部署预测 +#### 模型导出 +使用动态图训练结束之后,可以使用静态图导出工具 `export_model.py` 将动态图参数导出成静态图参数。 执行如下命令: + +```shell +python export_model.py --params_path checkpoints/model_300/model_state.pdparams --output_path=./output +``` + +其中`params_path`是指动态图训练保存的参数路径,`output_path`是指静态图参数导出路径。 + +#### 预测部署 +导出静态图模型之后,可以基于静态图模型进行预测,`deploy/python/predict.py` 文件提供了静态图预测示例。执行如下命令: + +```shell +python deploy/python/predict.py --model_dir ./output +``` + +## Reference +[1] Xin Liu, Qingcai Chen, Chong Deng, Huajun Zeng, Jing Chen, Dongfang Li, Buzhou Tang, LCQMC: A Large-scale Chinese Question Matching Corpus,COLING2018. +[2] Xiao, Dongling, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. “ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding.” ArXiv:2010.12148 [Cs]. diff --git a/examples/text_matching/ernie_matching/data.py b/examples/text_matching/ernie_matching/data.py new file mode 100644 index 0000000000000000000000000000000000000000..aabe8f71a29be91d6d05e8266a5990feb62b413d --- /dev/null +++ b/examples/text_matching/ernie_matching/data.py @@ -0,0 +1,130 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import numpy as np + +from paddlenlp.datasets import MapDataset + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 2: + continue + yield {"query": data[0], "title": data[1]} + + +def convert_pointwise_example(example, tokenizer, max_seq_length=512, is_test=False): + + query, title = example["query"], example["title"] + + encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if not is_test: + label = np.array([example["label"]], dtype="int64") + return input_ids, token_type_ids, label + else: + return input_ids, token_type_ids + + +def convert_pairwise_example(example, tokenizer, max_seq_length=512, phase="train"): + + if phase == "train": + query, pos_title, neg_title = example["query"], example["title"], example["neg_title"] + + pos_inputs = tokenizer(text=query, text_pair=pos_title, max_seq_len=max_seq_length) + neg_inputs = tokenizer(text=query, text_pair=neg_title, max_seq_len=max_seq_length) + + pos_input_ids = pos_inputs["input_ids"] + pos_token_type_ids = pos_inputs["token_type_ids"] + neg_input_ids = neg_inputs["input_ids"] + neg_token_type_ids = neg_inputs["token_type_ids"] + + return (pos_input_ids, pos_token_type_ids, neg_input_ids, neg_token_type_ids) + + else: + query, title = example["query"], example["title"] + + inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = inputs["input_ids"] + token_type_ids = inputs["token_type_ids"] + if phase == "eval": + return input_ids, token_type_ids, example["label"] + elif phase == "predict": + return input_ids, token_type_ids + else: + raise ValueError("not supported phase:{}".format(phase)) + + +def gen_pair(dataset, pool_size=100): + """ + Generate triplet randomly based on dataset + + Args: + dataset: A `MapDataset` or `IterDataset` or a tuple of those. + Each example is composed of 2 texts: example["query"], example["title"] + pool_size: the number of example to sample negative example randomly + + Return: + dataset: A `MapDataset` or `IterDataset` or a tuple of those. + Each example is composed of 2 texts: example["query"], example["pos_title"]、example["neg_title"] + """ + + if len(dataset) < pool_size: + pool_size = len(dataset) + + new_examples = [] + pool = [] + tmp_examples = [] + + for example in dataset: + label = example["label"] + + # Filter negative example + if label == 0: + continue + + tmp_examples.append(example) + pool.append(example["title"]) + + if len(pool) >= pool_size: + np.random.shuffle(pool) + for idx, example in enumerate(tmp_examples): + example["neg_title"] = pool[idx] + new_examples.append(example) + tmp_examples = [] + pool = [] + else: + continue + return MapDataset(new_examples) diff --git a/examples/text_matching/ernie_matching/deploy/python/predict.py b/examples/text_matching/ernie_matching/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..aa9be63b026b69419c096b40a51ecf7562f13ef7 --- /dev/null +++ b/examples/text_matching/ernie_matching/deploy/python/predict.py @@ -0,0 +1,210 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import numpy as np +import paddle +from paddle import inference + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoTokenizer +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') +parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') +parser.add_argument("--benchmark", type=eval, default=False, help="To log some information about environment and running.") +parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") +args = parser.parse_args() +# fmt: on + + +def convert_example(example, tokenizer, max_seq_length=512, is_test=False): + + query, title = example["query"], example["title"] + + encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if not is_test: + label = np.array([example["label"]], dtype="int64") + return input_ids, token_type_ids, label + else: + return input_ids, token_type_ids + + +class Predictor(object): + def __init__( + self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False, + ): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.pdmodel" + params_file = model_dir + "/inference.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as initialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8, + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode + ) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + if args.benchmark: + import auto_log + + pid = os.getpid() + self.autolog = auto_log.AutoLogger( + model_name="ernie-tiny", + model_precision=precision, + batch_size=self.batch_size, + data_shape="dynamic", + save_path=args.save_log_path, + inference_config=config, + pids=pid, + process_name=None, + gpu_ids=0, + time_keys=["preprocess_time", "inference_time", "postprocess_time"], + warmup=0, + logger=logger, + ) + + def predict(self, data, tokenizer, label_map): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + label_map(obj:`dict`): The label id (key) to label str (value) map. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + if args.benchmark: + self.autolog.times.start() + + examples = [] + for text in data: + input_ids, segment_ids = convert_example(text, tokenizer, max_seq_length=self.max_seq_length, is_test=True) + examples.append((input_ids, segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # segment + ): fn(samples) + + if args.benchmark: + self.autolog.times.stamp() + + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + probs = self.output_handle.copy_to_cpu() + if args.benchmark: + self.autolog.times.stamp() + + # probs = softmax(logits, axis=1) + idx = np.argmax(probs, axis=1) + idx = idx.tolist() + labels = [label_map[i] for i in idx] + + if args.benchmark: + self.autolog.times.end(stamp=True) + + return labels + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor( + args.model_dir, + args.device, + args.max_seq_length, + args.batch_size, + args.use_tensorrt, + args.precision, + args.cpu_threads, + args.enable_mkldnn, + ) + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + test_ds = load_dataset("lcqmc", splits=["test"]) + + data = [{"query": d["query"], "title": d["title"]} for d in test_ds] + + batches = [data[idx : idx + args.batch_size] for idx in range(0, len(data), args.batch_size)] + label_map = {0: "dissimilar", 1: "similar"} + + results = [] + for batch_data in batches: + results.extend(predictor.predict(batch_data, tokenizer, label_map)) + for idx, text in enumerate(data): + print("Data: {} \t Label: {}".format(text, results[idx])) + if args.benchmark: + predictor.autolog.report() diff --git a/examples/text_matching/ernie_matching/export_model.py b/examples/text_matching/ernie_matching/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..696a700a3f61029dca0569d69549f9f09bda4087 --- /dev/null +++ b/examples/text_matching/ernie_matching/export_model.py @@ -0,0 +1,52 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from model import PointwiseMatching + +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + model = PointwiseMatching(pretrained_model) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + model.eval() + + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/examples/text_matching/ernie_matching/model.py b/examples/text_matching/ernie_matching/model.py new file mode 100644 index 0000000000000000000000000000000000000000..ba412b9f78d6ab3671e9b3ee5857d1987001d4dc --- /dev/null +++ b/examples/text_matching/ernie_matching/model.py @@ -0,0 +1,88 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class PointwiseMatching(nn.Layer): + def __init__(self, pretrained_model, dropout=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # num_labels = 2 (similar or dissimilar) + self.classifier = nn.Linear(self.ptm.config["hidden_size"], 2) + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + cls_embedding = self.dropout(cls_embedding) + logits = self.classifier(cls_embedding) + + return logits + + +class PairwiseMatching(nn.Layer): + def __init__(self, pretrained_model, dropout=None, margin=0.1): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + self.margin = margin + + # hidden_size -> 1, calculate similarity + self.similarity = nn.Linear(self.ptm.config["hidden_size"], 1) + + def predict(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): + + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + cls_embedding = self.dropout(cls_embedding) + sim_score = self.similarity(cls_embedding) + sim_score = F.sigmoid(sim_score) + + return sim_score + + def forward( + self, + pos_input_ids, + neg_input_ids, + pos_token_type_ids=None, + neg_token_type_ids=None, + pos_position_ids=None, + neg_position_ids=None, + pos_attention_mask=None, + neg_attention_mask=None, + ): + + _, pos_cls_embedding = self.ptm(pos_input_ids, pos_token_type_ids, pos_position_ids, pos_attention_mask) + + _, neg_cls_embedding = self.ptm(neg_input_ids, neg_token_type_ids, neg_position_ids, neg_attention_mask) + + pos_embedding = self.dropout(pos_cls_embedding) + neg_embedding = self.dropout(neg_cls_embedding) + + pos_sim = self.similarity(pos_embedding) + neg_sim = self.similarity(neg_embedding) + + pos_sim = F.sigmoid(pos_sim) + neg_sim = F.sigmoid(neg_sim) + + labels = paddle.full(shape=[pos_cls_embedding.shape[0]], fill_value=1.0, dtype="float32") + + loss = F.margin_ranking_loss(pos_sim, neg_sim, labels, margin=self.margin) + + return loss diff --git a/examples/text_matching/ernie_matching/predict_pairwise.py b/examples/text_matching/ernie_matching/predict_pairwise.py new file mode 100644 index 0000000000000000000000000000000000000000..042089eec670e0bc2ea385593ac0e5d402539d4b --- /dev/null +++ b/examples/text_matching/ernie_matching/predict_pairwise.py @@ -0,0 +1,105 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +from data import convert_pairwise_example as convert_example +from data import create_dataloader, read_text_pair +from model import PairwiseMatching + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--input_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# fmt: on + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair. + data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + batch_probs = [] + + model.eval() + + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + batch_prob = model.predict(input_ids=input_ids, token_type_ids=token_type_ids).numpy() + + batch_probs.append(batch_prob) + + batch_probs = np.concatenate(batch_probs, axis=0) + + return batch_probs + + +if __name__ == "__main__": + paddle.set_device(args.device) + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, phase="predict") + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment_ids + ): [data for data in fn(samples)] + + valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) + + valid_data_loader = create_dataloader( + valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + model = PairwiseMatching(pretrained_model) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + y_probs = predict(model, valid_data_loader) + + valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) + + for idx, prob in enumerate(y_probs): + text_pair = valid_ds[idx] + text_pair["pred_prob"] = prob[0] + print(text_pair) diff --git a/examples/text_matching/ernie_matching/predict_pointwise.py b/examples/text_matching/ernie_matching/predict_pointwise.py new file mode 100644 index 0000000000000000000000000000000000000000..7931087bfb99586e7a20106d6f6560ee012201c5 --- /dev/null +++ b/examples/text_matching/ernie_matching/predict_pointwise.py @@ -0,0 +1,105 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +from data import convert_pointwise_example as convert_example +from data import create_dataloader, read_text_pair +from model import PointwiseMatching + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--input_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair. + data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + batch_probs = [] + + model.eval() + + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + batch_prob = model(input_ids=input_ids, token_type_ids=token_type_ids).numpy() + + batch_probs.append(batch_prob) + + batch_probs = np.concatenate(batch_probs, axis=0) + + return batch_probs + + +if __name__ == "__main__": + paddle.set_device(args.device) + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_test=True) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment_ids + ): [data for data in fn(samples)] + + valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) + + valid_data_loader = create_dataloader( + valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + model = PointwiseMatching(pretrained_model) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + y_probs = predict(model, valid_data_loader) + y_preds = np.argmax(y_probs, axis=1) + + valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) + for idx, y_pred in enumerate(y_preds): + text_pair = valid_ds[idx] + text_pair["pred_label"] = y_pred + print(text_pair) diff --git a/examples/text_matching/ernie_matching/train_pairwise.py b/examples/text_matching/ernie_matching/train_pairwise.py new file mode 100644 index 0000000000000000000000000000000000000000..32d16eef867adb2fdd0b41bf4551218384092125 --- /dev/null +++ b/examples/text_matching/ernie_matching/train_pairwise.py @@ -0,0 +1,192 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from data import convert_pairwise_example as convert_example +from data import create_dataloader, gen_pair +from model import PairwiseMatching + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--margin", default=0.2, type=float, help="Margin for pos_score and neg_score.") +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--eval_step", default=100, type=int, help="Step interval for evaluation.") +parser.add_argument('--save_step', default=10000, type=int, help="Step interval for saving checkpoint.") +parser.add_argument('--max_step', default=10000, type=int, help="Max steps for training.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# fmt: on + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, phase="dev"): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + + for idx, batch in enumerate(data_loader): + input_ids, token_type_ids, labels = batch + + pos_probs = model.predict(input_ids=input_ids, token_type_ids=token_type_ids) + + neg_probs = 1.0 - pos_probs + + preds = np.concatenate((neg_probs, pos_probs), axis=1) + metric.update(preds=preds, labels=labels) + + print("eval_{} auc:{:.2}".format(phase, metric.accumulate())) + metric.reset() + model.train() + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds, dev_ds = load_dataset("lcqmc", splits=["train", "dev"]) + + train_ds = gen_pair(train_ds) + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func_train = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + trans_func_eval = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, phase="eval") + + batchify_fn_train = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # pos_pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # pos_pair_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # neg_pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # neg_pair_segment + ): [data for data in fn(samples)] + + batchify_fn_eval = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # pair_segment + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn_train, trans_fn=trans_func_train + ) + + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn_eval, trans_fn=trans_func_eval + ) + + model = PairwiseMatching(pretrained_model, margin=args.margin) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + metric = paddle.metric.Auc() + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + pos_input_ids, pos_token_type_ids, neg_input_ids, neg_token_type_ids = batch + + loss = model( + pos_input_ids=pos_input_ids, + neg_input_ids=neg_input_ids, + pos_token_type_ids=pos_token_type_ids, + neg_token_type_ids=neg_token_type_ids, + ) + + global_step += 1 + + if global_step > args.max_step: + print("Training steps have achieved max_step, training is stopped.") + return + + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.eval_step == 0 and rank == 0: + evaluate(model, metric, dev_data_loader, "dev") + + if global_step % args.save_step == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/examples/text_matching/ernie_matching/train_pointwise.py b/examples/text_matching/ernie_matching/train_pointwise.py new file mode 100644 index 0000000000000000000000000000000000000000..24cb9c78428e9cc919579d969e048def2b99a7cb --- /dev/null +++ b/examples/text_matching/ernie_matching/train_pointwise.py @@ -0,0 +1,187 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from data import convert_pointwise_example as convert_example +from data import create_dataloader +from model import PointwiseMatching + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--eval_step", default=100, type=int, help="Step interval for evaluation.") +parser.add_argument('--save_step', default=10000, type=int, help="Step interval for saving checkpoint.") +parser.add_argument('--max_step', default=10000, type=int, help="Max steps for training.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu', "npu"], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader, phase="dev"): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + criterion(obj:`paddle.nn.Layer`): It can compute the loss. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + losses = [] + for batch in data_loader: + input_ids, token_type_ids, labels = batch + logits = model(input_ids=input_ids, token_type_ids=token_type_ids) + loss = criterion(logits, labels) + losses.append(loss.numpy()) + correct = metric.compute(logits, labels) + metric.update(correct) + accu = metric.accumulate() + print("eval {} loss: {:.5}, accu: {:.5}".format(phase, np.mean(losses), accu)) + model.train() + metric.reset() + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds, dev_ds = load_dataset("lcqmc", splits=["train", "dev"]) + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_pair_segment + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + model = PointwiseMatching(pretrained_model) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + criterion = paddle.nn.loss.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + input_ids, token_type_ids, labels = batch + logits = model(input_ids=input_ids, token_type_ids=token_type_ids) + loss = criterion(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + acc = metric.accumulate() + + global_step += 1 + + if global_step > args.max_step: + print("Training steps have achieved max_step, training is stopped.") + return + + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.eval_step == 0 and rank == 0: + evaluate(model, criterion, metric, dev_data_loader) + + if global_step % args.save_step == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/examples/text_matching/question_matching/README.md b/examples/text_matching/question_matching/README.md new file mode 100644 index 0000000000000000000000000000000000000000..24f037843a4dd4bf67b6437ba476ba684038e13f --- /dev/null +++ b/examples/text_matching/question_matching/README.md @@ -0,0 +1,144 @@ +# 千言-问题匹配鲁棒性评测基线 + +我们基于预训练模型 ERNIE-Gram 结合正则化策略 [R-Drop](https://arxiv.org/abs/2106.14448) 在 [2021 CCF BDCI 千言-问题匹配鲁棒性评测](https://aistudio.baidu.com/aistudio/competition/detail/116/0/introduction) 竞赛上建立了 Baseline 方案和评测结果。 + +## 赛题背景 + +问题匹配(Question Matching)任务旨在判断两个自然问句之间的语义是否等价,是自然语言处理领域一个重要研究方向。问题匹配同时也具有很高的商业价值,在信息检索、智能客服等领域发挥重要作用。 + +近年来,神经网络模型虽然在一些标准的问题匹配评测集合上已经取得与人类相仿甚至超越人类的准确性,但在处理真实应用场景问题时,性能大幅下降,在简单(人类很容易判断)的问题上无法做出正确判断(如下图),影响产品体验的同时也会造成相应的经济损失。 + +| 问题1 | 问题2 | 标签(Label) | Model | +| :----------------: | :------------------: | :---------: | :-----: | +| 婴儿吃什么蔬菜好 | 婴儿吃什么`绿色`蔬菜好 | 0 | 1 | +| 关于`牢房`的电视剧 | 关于`监狱`的电视剧 | 1 | 0 | +| 心率过`快`有什么问题 | 心率过`慢`有什么问题 | 0 | 1 | +| 黑色`裤子`配什么`上衣` | 黑色`上衣`配什么`裤子` | 0 | 1 | + +当前大多数问题匹配任务采用单一指标,在同分布的测试集上评测模型的好坏,这种评测方式可能夸大了模型能力,并且缺乏对模型鲁棒性的细粒度优劣势评估。本次评测关注问题匹配模型在真实应用场景中的鲁棒性,从词汇理解、句法结构、错别字、口语化、对话理解五个维度检测模型的能力,从而发现模型的不足之处,推动语义匹配技术的发展。本次竞赛主要基于[千言数据集](https://luge.ai),采用的数据集包括哈尔滨工业大学(深圳)的LCQMC和BQ数据集、OPPO的小布对话短文本数据集以及百度的DuQM数据集,期望从多维度、多领域出发,全面评价模型的鲁棒性,进一步提升问题匹配技术的研究水平。本次竞赛将在第九届“CCF大数据与计算智能大赛”举办技术交流论坛和颁奖仪式,诚邀学术界和工业界的研究者和开发者参加本次竞赛! + +## 基线评测效果 +本项目分别基于ERNIE-1.0、Bert-base-chinese、ERNIE-Gram 3 个中文预训练模型训练了单塔 Point-wise 的匹配模型, 基于 ERNIE-Gram 的模型效果显著优于其它 2 个预训练模型。 + +此外,在 ERNIE-Gram 模型基础上我们也对最新的正则化策略 [R-Drop](https://arxiv.org/abs/2106.14448) 进行了相关评测, [R-Drop](https://arxiv.org/abs/2106.14448) 策略的核心思想是针对同 1 个训练样本过多次前向网络得到的输出加上正则化的 Loss 约束。 + +我们开源了效果最好的 2 个策略对应模型的 checkpoint 作为本次比赛的基线方案: 基于 ERNIE-Gram 预训练模型 R-Drop 系数分别为 0.0 和 0.1 的 2 个模型, 用户可以下载相应的模型来复现我们的评测结果。 + +| 模型 | rdrop_coef | dev acc | test-A acc | test-B acc| +| ---- | ---- |-----|--------|------- | +| ernie-1.0-base |0.0| 86.96 |76.20 | 77.50| +| bert-base-chinese |0.0| 86.93| 76.90 |77.60 | +| [ernie-gram-zh](https://bj.bcebos.com/paddlenlp/models/text_matching/question_matching_rdrop0p0_baseline_model.tar) | 0.0 |87.66 | **80.80** | **81.20** | +| [ernie-gram-zh](https://bj.bcebos.com/paddlenlp/models/text_matching/question_matching_rdrop0p1_baseline_model.tar) | 0.1 |87.91 | 80.20 | 80.80 | +| ernie-gram-zh | 0.2 |87.47 | 80.10 | 81.00 | + + +## 快速开始 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: +``` +question_matching/ +├── model.py # 匹配模型组网 +├── data.py # 训练样本的数据读取、转换逻辑 +├── predict.py # 模型预测脚本,输出测试集的预测结果: 0,1 +└── train.py # 模型训练评估 +``` + +### 数据准备 +本项目使用竞赛提供的 LCQMC、BQ、OPPO 这 3 个数据集的训练集合集作为训练集,使用这 3 个数据集的验证集合集作为验证集。 + +运行如下命令生成本项目所使用的训练集和验证集,您在参赛过程中可以探索采取其它的训练集和验证集组合,不需要和基线方案完全一致。 +```shell +cat ./data/train/LCQMC/train ./data/train/BQ/train ./data/train/OPPO/train > train.txt +cat ./data/train/LCQMC/dev ./data/train/BQ/dev ./data/train/OPPO/dev > dev.txt +``` +训练集数据格式为 3 列: text_a \t text_b \t label, 样例数据如下: +```text +喜欢打篮球的男生喜欢什么样的女生 爱打篮球的男生喜欢什么样的女生 1 +我手机丢了,我想换个手机 我想买个新手机,求推荐 1 +大家觉得她好看吗 大家觉得跑男好看吗? 0 +求秋色之空漫画全集 求秋色之空全集漫画 1 +晚上睡觉带着耳机听音乐有什么害处吗? 孕妇可以戴耳机听音乐吗? 0 +``` +验证集的数据格式和训练集相同,样例如下: +``` +开初婚未育证明怎么弄? 初婚未育情况证明怎么开? 1 +谁知道她是网络美女吗? 爱情这杯酒谁喝都会醉是什么歌 0 +男孩喝女孩的尿的故事 怎样才知道是生男孩还是女孩 0 +这种图片是用什么软件制作的? 这种图片制作是用什么软件呢? 1 +``` + +### 模型训练 +运行如下命令,即可复现本项目中基于 ERNIE-Gram 的基线模型: + +```shell +$unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus "0,1,2,3" train.py \ + --train_set train.txt \ + --dev_set dev.txt \ + --device gpu \ + --eval_step 100 \ + --save_dir ./checkpoints \ + --train_batch_size 32 \ + --learning_rate 2E-5 \ + --rdrop_coef 0.0 +``` + +可支持配置的参数: +* `train_set`: 训练集的文件。 +* `dev_set`:验证集数据文件。 +* `rdrop_coef`:可选,控制 R-Drop 策略正则化 KL-Loss 的系数;默认为 0.0, 即不使用 R-Drop 策略。 +* `train_batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `learning_rate`:可选,Fine-tune的最大学习率;默认为5e-5。 +* `weight_decay`:可选,控制正则项力度的参数,用于防止过拟合,默认为0.0。 +* `epochs`: 训练轮次,默认为3。 +* `warmup_proption`:可选,学习率 warmup 策略的比例,如果 0.1,则学习率会在前 10% 训练 step 的过程中从 0 慢慢增长到 learning_rate, 而后再缓慢衰减,默认为 0.0。 +* `init_from_ckpt`:可选,模型参数路径,热启动模型训练;默认为None。 +* `seed`:可选,随机种子,默认为1000。 +* `device`: 选用什么设备进行训练,可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。 + +程序运行时将会自动进行训练,评估。同时训练过程中会自动保存模型在指定的`save_dir`中。 + +训练过程中每一次在验证集上进行评估之后,程序会根据验证集的评估指标是否优于之前最优的模型指标来决定是否存储当前模型,如果优于之前最优的验证集指标则会存储当前模型,否则则不存储,因此训练过程结束之后,模型存储路径下 step 数最大的模型则对应验证集指标最高的模型, 一般我们选择验证集指标最高的模型进行预测。 + +如: +```text +checkpoints/ +├── model_10000 +│   ├── model_state.pdparams +│   ├── tokenizer_config.json +│   └── vocab.txt +└── ... +``` + +**NOTE:** +* 如需恢复模型训练,则可以设置`init_from_ckpt`, 如`init_from_ckpt=checkpoints/model_100/model_state.pdparams`。 + + +### 开始预测 +训练完成后,在指定的 checkpoints 路径下会自动存储在验证集评估指标最高的模型,运行如下命令开始生成预测结果: +```shell +$ unset CUDA_VISIBLE_DEVICES +python -u \ + predict.py \ + --device gpu \ + --params_path "./checkpoints/model_10000/model_state.pdparams" \ + --batch_size 128 \ + --input_file "${test_set}" \ + --result_file "predict_result" +``` + +输出预测结果示例如下: +```text +0 +1 +0 +1 +``` +### 提交进行评测 +提交预测结果进行评测 + +## Reference +[1] Liang, Xiaobo, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min Zhang, and Tie-Yan Liu. “R-Drop: Regularized Dropout for Neural Networks.” ArXiv:2106.14448 [Cs], June 28, 2021. http://arxiv.org/abs/2106.14448. diff --git a/examples/text_matching/question_matching/data.py b/examples/text_matching/question_matching/data.py new file mode 100644 index 0000000000000000000000000000000000000000..19560ab78a86b62a33d774c8863e47290e467be7 --- /dev/null +++ b/examples/text_matching/question_matching/data.py @@ -0,0 +1,60 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import paddle + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def read_text_pair(data_path, is_test=False): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if is_test is False: + if len(data) != 3: + continue + yield {"query1": data[0], "query2": data[1], "label": data[2]} + else: + if len(data) != 2: + continue + yield {"query1": data[0], "query2": data[1]} + + +def convert_example(example, tokenizer, max_seq_length=512, is_test=False): + + query, title = example["query1"], example["query2"] + + encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if not is_test: + label = np.array([example["label"]], dtype="int64") + return input_ids, token_type_ids, label + else: + return input_ids, token_type_ids diff --git a/examples/text_matching/question_matching/model.py b/examples/text_matching/question_matching/model.py new file mode 100644 index 0000000000000000000000000000000000000000..8c28ac57b70d237dbe7369af58ed4906f8579850 --- /dev/null +++ b/examples/text_matching/question_matching/model.py @@ -0,0 +1,47 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle.nn as nn + +import paddlenlp as ppnlp + + +class QuestionMatching(nn.Layer): + def __init__(self, pretrained_model, dropout=None, rdrop_coef=0.0): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # num_labels = 2 (similar or dissimilar) + self.classifier = nn.Linear(self.ptm.config["hidden_size"], 2) + self.rdrop_coef = rdrop_coef + self.rdrop_loss = ppnlp.losses.RDropLoss() + + def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, do_evaluate=False): + + _, cls_embedding1 = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + cls_embedding1 = self.dropout(cls_embedding1) + logits1 = self.classifier(cls_embedding1) + + # For more information about R-drop please refer to this paper: https://arxiv.org/abs/2106.14448 + # Original implementation please refer to this code: https://github.com/dropreg/R-Drop + if self.rdrop_coef > 0 and not do_evaluate: + _, cls_embedding2 = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + cls_embedding2 = self.dropout(cls_embedding2) + logits2 = self.classifier(cls_embedding2) + kl_loss = self.rdrop_loss(logits1, logits2) + else: + kl_loss = 0.0 + + return logits1, kl_loss diff --git a/examples/text_matching/question_matching/predict.py b/examples/text_matching/question_matching/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..3d6f350a48f7bdb4603c23c18b7c4ebe122b4922 --- /dev/null +++ b/examples/text_matching/question_matching/predict.py @@ -0,0 +1,103 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +from data import convert_example, create_dataloader, read_text_pair +from model import QuestionMatching + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--input_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--result_file", type=str, required=True, help="The result file name") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=256, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# fmt: on + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`QuestionMatching`): A model to calculate whether the question pair is semantic similar or not. + data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + batch_logits = [] + + model.eval() + + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + batch_logit, _ = model(input_ids=input_ids, token_type_ids=token_type_ids) + + batch_logits.append(batch_logit.numpy()) + + batch_logits = np.concatenate(batch_logits, axis=0) + + return batch_logits + + +if __name__ == "__main__": + paddle.set_device(args.device) + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_test=True) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment_ids + ): [data for data in fn(samples)] + + test_ds = load_dataset(read_text_pair, data_path=args.input_file, is_test=True, lazy=False) + + test_data_loader = create_dataloader( + test_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + model = QuestionMatching(pretrained_model) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + y_probs = predict(model, test_data_loader) + y_preds = np.argmax(y_probs, axis=1) + + with open(args.result_file, "w", encoding="utf-8") as f: + for y_pred in y_preds: + f.write(str(y_pred) + "\n") diff --git a/examples/text_matching/question_matching/train.py b/examples/text_matching/question_matching/train.py new file mode 100644 index 0000000000000000000000000000000000000000..336dc663666215da4318074180cfc83cdcbfd3d3 --- /dev/null +++ b/examples/text_matching/question_matching/train.py @@ -0,0 +1,196 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from data import convert_example, create_dataloader, read_text_pair +from model import QuestionMatching + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--train_set", type=str, required=True, help="The full path of train_set_file") +parser.add_argument("--dev_set", type=str, required=True, help="The full path of dev_set_file") +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=256, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument('--max_steps', default=-1, type=int, help="If > 0, set total number of training steps to perform.") +parser.add_argument("--train_batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--eval_batch_size", default=128, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--eval_step", default=100, type=int, help="Step interval for evaluation.") +parser.add_argument('--save_step', default=10000, type=int, help="Step interval for saving checkpoint.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--rdrop_coef", default=0.0, type=float, help="The coefficient of KL-Divergence loss in R-Drop paper, for more detail please refer to https://arxiv.org/abs/2106.14448), if rdrop_coef > 0 then R-Drop works") +args = parser.parse_args() +# fmt: on + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + criterion(obj:`paddle.nn.Layer`): It can compute the loss. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + losses = [] + total_num = 0 + + for batch in data_loader: + input_ids, token_type_ids, labels = batch + total_num += len(labels) + logits, _ = model(input_ids=input_ids, token_type_ids=token_type_ids, do_evaluate=True) + loss = criterion(logits, labels) + losses.append(loss.numpy()) + correct = metric.compute(logits, labels) + metric.update(correct) + accu = metric.accumulate() + + print("dev_loss: {:.5}, accuracy: {:.5}, total_num:{}".format(np.mean(losses), accu, total_num)) + model.train() + metric.reset() + return accu + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds = load_dataset(read_text_pair, data_path=args.train_set, is_test=False, lazy=False) + + dev_ds = load_dataset(read_text_pair, data_path=args.dev_set, is_test=False, lazy=False) + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_pair_segment + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.train_batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.eval_batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + model = QuestionMatching(pretrained_model, rdrop_coef=args.rdrop_coef) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + criterion = paddle.nn.loss.CrossEntropyLoss() + + metric = paddle.metric.Accuracy() + + global_step = 0 + best_accuracy = 0.0 + + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + input_ids, token_type_ids, labels = batch + logits1, kl_loss = model(input_ids=input_ids, token_type_ids=token_type_ids) + correct = metric.compute(logits1, labels) + metric.update(correct) + acc = metric.accumulate() + + ce_loss = criterion(logits1, labels) + if kl_loss > 0: + loss = ce_loss + kl_loss * args.rdrop_coef + else: + loss = ce_loss + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.4f, ce_loss: %.4f., kl_loss: %.4f, accu: %.4f, speed: %.2f step/s" + % (global_step, epoch, step, loss, ce_loss, kl_loss, acc, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.eval_step == 0 and rank == 0: + accuracy = evaluate(model, criterion, metric, dev_data_loader) + if accuracy > best_accuracy: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + best_accuracy = accuracy + + if global_step == args.max_steps: + return + + +if __name__ == "__main__": + do_train() diff --git a/examples/text_matching/sentence_transformers/README.md b/examples/text_matching/sentence_transformers/README.md new file mode 100644 index 0000000000000000000000000000000000000000..53133e788470860ae9f7818befbfee10535ca0e4 --- /dev/null +++ b/examples/text_matching/sentence_transformers/README.md @@ -0,0 +1,182 @@ +# 使用预训练模型Fine-tune完成中文文本匹配任务 + +随着深度学习的发展,模型参数的数量飞速增长。为了训练这些参数,需要更大的数据集来避免过拟合。然而,对于大部分NLP任务来说,构建大规模的标注数据集非常困难(成本过高),特别是对于句法和语义相关的任务。相比之下,大规模的未标注语料库的构建则相对容易。为了利用这些数据,我们可以先从其中学习到一个好的表示,再将这些表示应用到其他任务中。最近的研究表明,基于大规模未标注语料库的预训练模型(Pretrained Models, PTM) 在NLP任务上取得了很好的表现。 + +近年来,大量的研究表明基于大型语料库的预训练模型(Pretrained Models, PTM)可以学习通用的语言表示,有利于下游NLP任务,同时能够避免从零开始训练模型。随着计算能力的发展,深度模型的出现(即 Transformer)和训练技巧的增强使得 PTM 不断发展,由浅变深。 + +百度的预训练模型ERNIE经过海量的数据训练后,其特征抽取的工作已经做的非常好。借鉴迁移学习的思想,我们可以利用其在海量数据中学习的语义信息辅助小数据集(如本示例中的医疗文本数据集)上的任务。 + +<center> <img width="600px" src="https://ai-studio-static-online.cdn.bcebos.com/d96c602338044ee8bcd4171f38ea6d49506d1f3253f3496b802ec56cb654ecf5" /> </center> + +使用预训练模型ERNIE完成文本匹配任务,大家可能会想到将query和title文本拼接,之后输入ERNIE中,取`CLS`特征(pooled_output),之后输出全连接层,进行二分类。如下图ERNIE用于句对分类任务的用法: + +<p align="center"> +<img src="https://ai-studio-static-online.cdn.bcebos.com/45440029c07240ad89d665c5b176e63297e9584e1da24e02b79dd54fb990f74a" width='30%'/> <br /> +</p> + +然而,以上用法的问题在于,**ERNIE的模型参数非常庞大,导致计算量非常大,预测的速度也不够理想**。从而达不到线上业务的要求。针对该问题,可以使用PaddleNLP工具搭建Sentence Transformer网络。 + +<p align="center"> +<img src="https://ai-studio-static-online.cdn.bcebos.com/103998703e134a7184883511a538620e16fed045e2614dcc8afacec446600438" width='30%'/> <br /> +</p> + +Sentence Transformer采用了双塔(Siamese)的网络结构。Query和Title分别输入ERNIE,共享一个ERNIE参数,得到各自的token embedding特征。之后对token embedding进行pooling(此处教程使用mean pooling操作),之后输出分别记作u,v。之后将三个表征(u,v,|u-v|)拼接起来,进行二分类。网络结构如上图所示。 + +更多关于Sentence Transformer的信息可以参考论文:https://arxiv.org/abs/1908.10084 + +**同时,不仅可以使用ERNIR作为文本语义特征提取器,可以利用BERT/RoBerta/Electra等模型作为文本语义特征提取器** + +**那么Sentence Transformer采用Siamese的网路结构,是如何提升预测速度呢?** + +**Siamese的网络结构好处在于query和title分别输入同一套网络。如在信息搜索任务中,此时就可以将数据库中的title文本提前计算好对应sequence_output特征,保存在数据库中。当用户搜索query时,只需计算query的sequence_output特征与保存在数据库中的title sequence_output特征,通过一个简单的mean_pooling和全连接层进行二分类即可。从而大幅提升预测效率,同时也保障了模型性能。** + +关于匹配任务常用的Siamese网络结构可以参考:https://blog.csdn.net/thriving_fcl/article/details/73730552 + +PaddleNLP提供了丰富的预训练模型,并且可以便捷地获取PaddlePaddle生态下的所有预训练模型。下面展示如何使用PaddleNLP一键加载ERNIE,优化文本匹配任务。 + +## 模型简介 + +本项目针对中文文本匹配问题,开源了一系列模型,供用户可配置地使用: + ++ BERT([Bidirectional Encoder Representations from Transformers](https://arxiv.org/abs/1810.04805))中文模型,简写`bert-base-chinese`, 其由12层Transformer网络组成。 ++ ERNIE([Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223)),支持ERNIE 1.0中文模型(简写`ernie-1.0`)和ERNIE Tiny中文模型(简写`ernie-tiny`)。 + 其中`ernie`由12层Transformer网络组成,`ernie-tiny`由3层Transformer网络组成。 ++ RoBERTa([A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692)),支持12层Transformer网络的`roberta-wwm-ext`。 + +| 模型 | dev acc | test acc | +| ---- | ------- | -------- | +| bert-base-chinese | 0.86537 | 0.84440 | +| bert-wwm-chinese | 0.86333 | 0.84128 | +| bert-wwm-ext-chinese | 0.86049 | 0.83848 | +| ernie-1.0 | 0.87480 | 0.84760 | +| ernie-tiny | 0.86071 | 0.83352 | +| roberta-wwm-ext | 0.87526 | 0.84904 | +| rbt3 | 0.85367 | 0.83464 | +| rbtl3 | 0.85174 | 0.83744 | + +## 快速开始 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +sentence_transformers/ +├── model.py # Sentence Transfomer 组网文件 +├── README.md # 文本说明 +└── train.py # 模型训练评估 +``` + +### 模型训练 + +我们以中文文本匹配公开数据集LCQMC为示例数据集,可以运行下面的命令,在训练集(train.tsv)上进行模型训练,并在开发集(dev.tsv)验证 +```shell +$ unset CUDA_VISIBLE_DEVICES +$ python -m paddle.distributed.launch --gpus "0" train.py --device gpu --save_dir ./checkpoints +``` + +可支持配置的参数: + +* `save_dir`:可选,保存训练模型的目录;默认保存在当前目录checkpoints文件夹下。 +* `max_seq_length`:可选,ERNIE/BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `learning_rate`:可选,Fine-tune的最大学习率;默认为5e-5。 +* `weight_decay`:可选,控制正则项力度的参数,用于防止过拟合,默认为0.00。 +* `epochs`: 训练轮次,默认为3。 +* `warmup_proption`:可选,学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.1。 +* `init_from_ckpt`:可选,模型参数路径,热启动模型训练;默认为None。 +* `seed`:可选,随机种子,默认为1000. +* `device`: 选用什么设备进行训练,可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。 + +代码示例中使用的预训练模型是ERNIE,如果想要使用其他预训练模型如BERT,RoBERTa,Electra等,只需更换`model` 和 `tokenizer`即可。 + +```python +# 使用 ERNIE 预训练模型 +# ernie-3.0-medium-zh +model = AutoModel.from_pretrained('ernie-3.0-medium-zh') +tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh') + +# ernie-1.0 +# model = AutoModel.from_pretrained('ernie-1.0-base-zh') +# tokenizer = AutoTokenizer.from_pretrained('ernie-1.0-base-zh') + +# ernie-tiny +# model = AutoModel.Model.from_pretrained('ernie-tiny') +# tokenizer = AutoTokenizer.from_pretrained('ernie-tiny') + + +# 使用 BERT 预训练模型 +# bert-base-chinese +# model = AutoModel.Model.from_pretrained('bert-base-chinese') +# tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese') + +# bert-wwm-chinese +# model = AutoModel.from_pretrained('bert-wwm-chinese') +# tokenizer = AutoTokenizer.from_pretrained('bert-wwm-chinese') + +# bert-wwm-ext-chinese +# model = AutoModel.from_pretrained('bert-wwm-ext-chinese') +# tokenizer = AutoTokenizer.from_pretrained('bert-wwm-ext-chinese') + + +# 使用 RoBERTa 预训练模型 +# roberta-wwm-ext +# model = AutoModel..from_pretrained('roberta-wwm-ext') +# tokenizer = AutoTokenizer.from_pretrained('roberta-wwm-ext') + +# roberta-wwm-ext +# model = AutoModel.from_pretrained('roberta-wwm-ext-large') +# tokenizer = AutoTokenizer.from_pretrained('roberta-wwm-ext-large') + +``` +更多预训练模型,参考[transformers](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) + +程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +checkpoints/ +├── model_100 +│   ├── model_config.json +│   ├── model_state.pdparams +│   ├── tokenizer_config.json +│   └── vocab.txt +└── ... +``` + +**NOTE:** +* 如需恢复模型训练,则可以设置`init_from_ckpt`, 如`init_from_ckpt=checkpoints/model_100/model_state.pdparams`。 +* 如需使用ernie-tiny模型,则需要提前先安装sentencepiece依赖,如`pip install sentencepiece` + +### 模型预测 + +启动预测: +```shell +export CUDA_VISIBLE_DEVICES=0 +python predict.py --device gpu --params_path checkpoints/model_400/model_state.pdparams +``` + +将待预测数据如以下示例: + +```text +世界上什么东西最小 世界上什么东西最小? +光眼睛大就好看吗 眼睛好看吗? +小蝌蚪找妈妈怎么样 小蝌蚪找妈妈是谁画的 +``` + +可以直接调用`predict`函数即可输出预测结果。 + +如 + +```text +Data: ['世界上什么东西最小', '世界上什么东西最小?'] Label: similar +Data: ['光眼睛大就好看吗', '眼睛好看吗?'] Label: dissimilar +Data: ['小蝌蚪找妈妈怎么样', '小蝌蚪找妈妈是谁画的'] Label: dissimilar +``` + + +## Reference + +关于Sentence Transformer更多信息参考[www.SBERT.net](https://www.sbert.net)以及论文: +- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) (EMNLP 2019) +- [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) (EMNLP 2020) +- [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240) (arXiv 2020) diff --git a/examples/text_matching/sentence_transformers/deploy/simple_serving/README.md b/examples/text_matching/sentence_transformers/deploy/simple_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..37f76205b1e7b6386ecc4f23ae3a656b023c97ba --- /dev/null +++ b/examples/text_matching/sentence_transformers/deploy/simple_serving/README.md @@ -0,0 +1,39 @@ +# 基于PaddleNLP SimpleServing 的服务化部署 + +## 目录 +- [环境准备](#环境准备) +- [Server启动服务](#Server服务启动) +- [其他参数设置](#其他参数设置) + +## 环境准备 +使用有SimpleServing功能的PaddleNLP版本 +```shell +pip install paddlenlp >= 2.4.4 +``` +## Server服务启动 +### 分类任务启动 +#### 启动分类 Server 服务 +```bash +paddlenlp server server:app --host 0.0.0.0 --port 8189 +``` + +#### 启动分类 Client 服务 +```bash +python client.py +``` + +## 其他参数设置 +可以在client端设置 `max_seq_len`, `batch_size`, `prob_limit` 参数 +```python + data = { + 'data': { + 'text': texts, + 'text_pair': text_pairs, + }, + 'parameters': { + 'max_seq_len': args.max_seq_len, + 'batch_size': args.batch_size, + 'prob_limit': args.prob_limit + } + } +``` diff --git a/examples/text_matching/sentence_transformers/deploy/simple_serving/client.py b/examples/text_matching/sentence_transformers/deploy/simple_serving/client.py new file mode 100644 index 0000000000000000000000000000000000000000..08de26f80d8933bb8383e879a292a1eb1c2038fa --- /dev/null +++ b/examples/text_matching/sentence_transformers/deploy/simple_serving/client.py @@ -0,0 +1,44 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json + +import requests + +parser = argparse.ArgumentParser() +parser.add_argument( + "--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization." +) +parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.") +parser.add_argument("--prob_limit", default=0.5, type=int, help="probability limit.") +args = parser.parse_args() + +url = "http://0.0.0.0:8189/models/text_matching" +headers = {"Content-Type": "application/json"} + +if __name__ == "__main__": + texts = ["三亚是一个美丽的城市", "北京烤鸭怎么样"] + text_pair = ["三亚是个漂亮的城市", "北京烤鸭多少钱"] + + data = { + "data": { + "text": texts, + "text_pair": text_pair, + }, + "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size, "prob_limit": args.prob_limit}, + } + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + result_json = json.loads(r.text) + print(result_json) diff --git a/examples/text_matching/sentence_transformers/deploy/simple_serving/server.py b/examples/text_matching/sentence_transformers/deploy/simple_serving/server.py new file mode 100644 index 0000000000000000000000000000000000000000..61356d61f05e6b8878747c74e6c23c57385f84f7 --- /dev/null +++ b/examples/text_matching/sentence_transformers/deploy/simple_serving/server.py @@ -0,0 +1,135 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +from scipy.special import softmax + +from paddlenlp import SimpleServer +from paddlenlp.data import Pad, Tuple +from paddlenlp.server import BaseModelHandler, BasePostHandler + + +class TextMatchingModelHandler(BaseModelHandler): + def __init__(self): + super().__init__() + + @classmethod + def process(cls, predictor, tokenizer, data, parameters): + + max_seq_len = 128 + batch_size = 1 + if "max_seq_len" not in parameters: + max_seq_len = parameters["max_seq_len"] + if "batch_size" not in parameters: + batch_size = parameters["batch_size"] + text = None + if "text" in data: + text = data["text"] + if text is None: + return {} + if isinstance(text, str): + text = [text] + has_pair = False + if "text_pair" in data and data["text_pair"] is not None: + text_pair = data["text_pair"] + if isinstance(text_pair, str): + text_pair = [text_pair] + if len(text) != len(text_pair): + raise ValueError("The length of text and text_pair must be same.") + has_pair = True + + # Get the result of tokenizer + examples = [] + for idx, _ in enumerate(text): + if has_pair: + text_a = tokenizer(text=text[idx], max_length=max_seq_len) + text_b = tokenizer(text=text_pair[idx], max_length=max_seq_len) + + examples.append((text_a["input_ids"], text_b["input_ids"])) + + # Separates data into some batches. + batches = [examples[i : i + batch_size] for i in range(0, len(examples), batch_size)] + + def batchify_fn(samples): + return Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), + )(samples) + + results = [[]] * predictor._output_num + for batch in batches: + query_input_ids, title_input_ids = batchify_fn(batch) + if predictor._predictor_type == "paddle_inference": + predictor._input_handles[0].copy_from_cpu(query_input_ids) + predictor._input_handles[1].copy_from_cpu(title_input_ids) + predictor._predictor.run() + output = [output_handle.copy_to_cpu() for output_handle in predictor._output_handles] + for i, out in enumerate(output): + results[i].append(out) + print(results) + + # Resolve the logits result and get the predict label and confidence + results_concat = [] + for i in range(0, len(results)): + results_concat.append(np.concatenate(results[i], axis=0)) + + out_dict = {"logits": results_concat[0].tolist(), "data": data} + + return out_dict + + +class TextMatchingPostHandler(BasePostHandler): + def __init__(self): + super().__init__() + + @classmethod + def process(cls, data, parameters): + if "logits" not in data: + raise ValueError( + "The output of model handler do not include the 'logits', " + " please check the model handler output. The model handler output:\n{}".format(data) + ) + + prob_limit = 0.5 + if "prob_limit" in parameters: + prob_limit = parameters["prob_limit"] + logits = data["logits"] + # softmax for probs + logits = softmax(logits, axis=-1) + + print(logits) + + labels = [] + probs = [] + for logit in logits: + if logit[1] > prob_limit: + labels.append(1) + else: + labels.append(0) + probs.append(logit[1]) + + out_dict = {"label": labels, "similarity": probs} + return out_dict + + +app = SimpleServer() +app.register( + task_name="models/text_matching", + model_path="../../export_model", + tokenizer_name="ernie-3.0-medium-zh", + model_handler=TextMatchingModelHandler, + post_handler=TextMatchingPostHandler, + precision="fp32", + device_id=0, +) diff --git a/examples/text_matching/sentence_transformers/export_model.py b/examples/text_matching/sentence_transformers/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..2f132929ee758fd9881db29e5c232237e0a0a696 --- /dev/null +++ b/examples/text_matching/sentence_transformers/export_model.py @@ -0,0 +1,101 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +import paddle.nn as nn + +from paddlenlp.transformers import AutoModel, AutoTokenizer + +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, default="ernie-1.0", help="The path to model parameters to be loaded.") +parser.add_argument( + "--output_path", type=str, default="./export", help="The path of model parameter in static graph to be saved." +) +args = parser.parse_args() + + +class SentenceTransformer(nn.Layer): + def __init__(self, pretrained_model, dropout=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + # num_labels = 2 (similar or dissimilar) + self.classifier = nn.Linear(self.ptm.config["hidden_size"] * 3, 2) + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + query_token_embedding, _ = self.ptm( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + query_token_embedding = self.dropout(query_token_embedding) + query_attention_mask = paddle.unsqueeze( + (query_input_ids != self.ptm.pad_token_id).astype(self.ptm.pooler.dense.weight.dtype), axis=2 + ) + # Set token embeddings to 0 for padding tokens + query_token_embedding = query_token_embedding * query_attention_mask + query_sum_embedding = paddle.sum(query_token_embedding, axis=1) + query_sum_mask = paddle.sum(query_attention_mask, axis=1) + query_mean = query_sum_embedding / query_sum_mask + + title_token_embedding, _ = self.ptm( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + title_token_embedding = self.dropout(title_token_embedding) + title_attention_mask = paddle.unsqueeze( + (title_input_ids != self.ptm.pad_token_id).astype(self.ptm.pooler.dense.weight.dtype), axis=2 + ) + # Set token embeddings to 0 for padding tokens + title_token_embedding = title_token_embedding * title_attention_mask + title_sum_embedding = paddle.sum(title_token_embedding, axis=1) + title_sum_mask = paddle.sum(title_attention_mask, axis=1) + title_mean = title_sum_embedding / title_sum_mask + + sub = paddle.abs(paddle.subtract(query_mean, title_mean)) + projection = paddle.concat([query_mean, title_mean, sub], axis=-1) + + logits = self.classifier(projection) + + return logits + + +if __name__ == "__main__": + + tokenizer = AutoTokenizer.from_pretrained(args.params_path) + pretrained_model = AutoModel.from_pretrained(args.params_path) + + model = SentenceTransformer(pretrained_model) + model.eval() + + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="query_input_ids"), + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="title_input_ids"), + ] + # Convert to static graph with specific input description + model = paddle.jit.to_static(model, input_spec=input_spec) + + # Save in static graph model. + save_path = os.path.join(args.output_path, "float32") + paddle.jit.save(model, save_path) diff --git a/examples/text_matching/sentence_transformers/model.py b/examples/text_matching/sentence_transformers/model.py new file mode 100644 index 0000000000000000000000000000000000000000..4e132e557b3237875c9c910a18f96954d88a0f20 --- /dev/null +++ b/examples/text_matching/sentence_transformers/model.py @@ -0,0 +1,69 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn + + +class SentenceTransformer(nn.Layer): + def __init__(self, pretrained_model, dropout=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + # num_labels = 2 (similar or dissimilar) + self.classifier = nn.Linear(self.ptm.config["hidden_size"] * 3, 2) + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + query_token_embedding, _ = self.ptm( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + query_token_embedding = self.dropout(query_token_embedding) + query_attention_mask = paddle.unsqueeze( + (query_input_ids != self.ptm.pad_token_id).astype(self.ptm.pooler.dense.weight.dtype), axis=2 + ) + # Set token embeddings to 0 for padding tokens + query_token_embedding = query_token_embedding * query_attention_mask + query_sum_embedding = paddle.sum(query_token_embedding, axis=1) + query_sum_mask = paddle.sum(query_attention_mask, axis=1) + query_mean = query_sum_embedding / query_sum_mask + + title_token_embedding, _ = self.ptm( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + title_token_embedding = self.dropout(title_token_embedding) + title_attention_mask = paddle.unsqueeze( + (title_input_ids != self.ptm.pad_token_id).astype(self.ptm.pooler.dense.weight.dtype), axis=2 + ) + # Set token embeddings to 0 for padding tokens + title_token_embedding = title_token_embedding * title_attention_mask + title_sum_embedding = paddle.sum(title_token_embedding, axis=1) + title_sum_mask = paddle.sum(title_attention_mask, axis=1) + title_mean = title_sum_embedding / title_sum_mask + + sub = paddle.abs(paddle.subtract(query_mean, title_mean)) + projection = paddle.concat([query_mean, title_mean, sub], axis=-1) + + logits = self.classifier(projection) + + return logits diff --git a/examples/text_matching/sentence_transformers/predict.py b/examples/text_matching/sentence_transformers/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..b9294ff2c0f86d37d4845508639dc0e41f17a92e --- /dev/null +++ b/examples/text_matching/sentence_transformers/predict.py @@ -0,0 +1,158 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from model import SentenceTransformer + +from paddlenlp.data import Pad, Tuple +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, default='./checkpoint/model_2700/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=50, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# fmt: on + + +def convert_example(example, tokenizer, max_seq_length=512): + """ + Builds model inputs from a sequence or a pair of sequence for sequence classification tasks + by concatenating and adding special tokens. And creates a mask from the two sequences passed + to be used in a sequence-pair classification task. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + - pair of sequences: ``[CLS] A [SEP] B [SEP]`` + + A BERT sequence pair mask has the following format: + :: + 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 + | first sequence | second sequence | + + If only one sequence, only returns the first portion of the mask (0's). + + + Args: + example(obj:`list[str]`): List of input data, containing query, title and label if it have label. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + + Returns: + query_input_ids(obj:`list[int]`): The list of query token ids. + query_token_type_ids(obj: `list[int]`): List of query sequence pair mask. + title_input_ids(obj:`list[int]`): The list of title token ids. + title_token_type_ids(obj: `list[int]`): List of title sequence pair mask. + label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. + """ + query, title = example[0], example[1] + + query_encoded_inputs = tokenizer(text=query, max_seq_len=max_seq_length) + query_input_ids = query_encoded_inputs["input_ids"] + query_token_type_ids = query_encoded_inputs["token_type_ids"] + + title_encoded_inputs = tokenizer(text=title, max_seq_len=max_seq_length) + title_input_ids = title_encoded_inputs["input_ids"] + title_token_type_ids = title_encoded_inputs["token_type_ids"] + + return query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids + + +def predict(model, data, tokenizer, label_map, batch_size=1): + """ + Predicts the data labels. + + Args: + model (obj:`paddle.nn.Layer`): A model to classify texts. + data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `se_len`(sequence length). + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + label_map(obj:`dict`): The label id (key) to label str (value) map. + batch_size(obj:`int`, defaults to 1): The number of batch. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + examples = [] + for text_pair in data: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = convert_example( + text_pair, tokenizer, max_seq_length=args.max_seq_length + ) + examples.append((query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids)) + + # Separates data into some batches. + batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)] + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + ): [data for data in fn(samples)] + + results = [] + model.eval() + for batch in batches: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batchify_fn(batch) + + query_input_ids = paddle.to_tensor(query_input_ids) + query_token_type_ids = paddle.to_tensor(query_token_type_ids) + title_input_ids = paddle.to_tensor(title_input_ids) + title_token_type_ids = paddle.to_tensor(title_token_type_ids) + + probs = model( + query_input_ids, + title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ) + idx = paddle.argmax(probs, axis=1).numpy() + idx = idx.tolist() + labels = [label_map[i] for i in idx] + results.extend(labels) + return results + + +if __name__ == "__main__": + paddle.set_device(args.device) + + # ErnieTinyTokenizer is special for ernie-tiny pretained model. + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + data = [ + ["世界上什么东西最小", "世界上什么东西最小?"], + ["光眼睛大就好看吗", "眼睛好看吗?"], + ["小蝌蚪找妈妈怎么样", "小蝌蚪找妈妈是谁画的"], + ] + label_map = {0: "dissimilar", 1: "similar"} + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + model = SentenceTransformer(pretrained_model) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + results = predict(model, data, tokenizer, label_map, batch_size=args.batch_size) + for idx, text in enumerate(data): + print("Data: {} \t Label: {}".format(text, results[idx])) diff --git a/examples/text_matching/sentence_transformers/train.py b/examples/text_matching/sentence_transformers/train.py new file mode 100644 index 0000000000000000000000000000000000000000..d681bab0f91b38de06ac0bce5e450fb663d9e5fb --- /dev/null +++ b/examples/text_matching/sentence_transformers/train.py @@ -0,0 +1,239 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from model import SentenceTransformer + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# fmt: on + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + criterion(obj:`paddle.nn.Layer`): It can compute the loss. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + losses = [] + for batch in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids, labels = batch + logits = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ) + loss = criterion(logits, labels) + losses.append(loss.numpy()) + correct = metric.compute(logits, labels) + metric.update(correct) + accu = metric.accumulate() + print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu)) + model.train() + metric.reset() + + +def convert_example(example, tokenizer, max_seq_length=512, is_test=False): + """ + Builds model inputs from a sequence or a pair of sequence for sequence classification tasks + by concatenating and adding special tokens. And creates a mask from the two sequences passed + to be used in a sequence-pair classification task. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + - pair of sequences: ``[CLS] A [SEP] B [SEP]`` + + A BERT sequence pair mask has the following format: + :: + 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 + | first sequence | second sequence | + + If only one sequence, only returns the first portion of the mask (0's). + + + Args: + example(obj:`list[str]`): List of input data, containing query, title and label if it have label. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + query_input_ids(obj:`list[int]`): The list of query token ids. + query_token_type_ids(obj: `list[int]`): List of query sequence pair mask. + title_input_ids(obj:`list[int]`): The list of title token ids. + title_token_type_ids(obj: `list[int]`): List of title sequence pair mask. + label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. + """ + query, title = example["query"], example["title"] + + query_encoded_inputs = tokenizer(text=query, max_seq_len=max_seq_length) + query_input_ids = query_encoded_inputs["input_ids"] + query_token_type_ids = query_encoded_inputs["token_type_ids"] + + title_encoded_inputs = tokenizer(text=title, max_seq_len=max_seq_length) + title_input_ids = title_encoded_inputs["input_ids"] + title_token_type_ids = title_encoded_inputs["token_type_ids"] + + if not is_test: + label = np.array([example["label"]], dtype="int64") + return query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids, label + else: + return query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds, dev_ds = load_dataset("lcqmc", splits=["train", "dev"]) + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + model = SentenceTransformer(pretrained_model) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + criterion = paddle.nn.loss.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids, labels = batch + logits = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ) + loss = criterion(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + acc = metric.accumulate() + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train)) + ) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % 100 == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + evaluate(model, criterion, metric, dev_data_loader) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/examples/text_matching/simbert/README.md b/examples/text_matching/simbert/README.md new file mode 100644 index 0000000000000000000000000000000000000000..751cfaa17067643c71aa4f85fbf5cef45d301a31 --- /dev/null +++ b/examples/text_matching/simbert/README.md @@ -0,0 +1,50 @@ +# SimBERT模型 + +## 模型简介 +[SimBERT](https://github.com/ZhuiyiTechnology/simbert)的模型权重是以Google开源的BERT模型为基础,基于微软的UniLM思想设计了融检索与生成于一体的任务,来进一步微调后得到的模型,所以它同时具备相似问生成和相似句检索能力。 + +## 快速开始 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +simbert/ +├── data.py #训练样本的数据加载以及转换 +├── predict.py # 模型预测 +└── README.md # 文档说明 +``` + +### 模型预测 + +启动预测: +```shell +export CUDA_VISIBLE_DEVICES=0 +python predict.py --input_file ./datasets/lcqmc/dev.tsv +``` + +待预测数据如以下示例: + + +```text +世界上什么东西最小 世界上什么东西最小? +光眼睛大就好看吗 眼睛好看吗? +小蝌蚪找妈妈怎么样 小蝌蚪找妈妈是谁画的 +``` + +按照predict.py.py进行预测得到相似度 + +如 + +```text +{'query': '世界上什么东西最小', 'title': '世界上什么东西最小?', 'similarity': 0.992725} +{'query': '光眼睛大就好看吗', 'title': '眼睛好看吗?', 'similarity': 0.74502724} +{'query': '小蝌蚪找妈妈怎么样', 'title': '小蝌蚪找妈妈是谁画的', 'similarity': 0.8192148} +``` + +## Reference + +关于SimBERT更多信息参考[科学空间](https://spaces.ac.cn/archives/7427) + +SimBERT项目地址 https://github.com/ZhuiyiTechnology/simbert diff --git a/examples/text_matching/simbert/data.py b/examples/text_matching/simbert/data.py new file mode 100644 index 0000000000000000000000000000000000000000..82b104d282d864a648d06bbbec4313edd8bb6f12 --- /dev/null +++ b/examples/text_matching/simbert/data.py @@ -0,0 +1,53 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 2: + continue + yield {"query": data[0], "title": data[1]} + + +def convert_example(example, tokenizer, max_seq_length=512, phase="train"): + + query, title = example["query"], example["title"] + + query_encoded_inputs = tokenizer(text=query, max_seq_len=max_seq_length) + query_input_ids = query_encoded_inputs["input_ids"] + query_token_type_ids = query_encoded_inputs["token_type_ids"] + title_encoded_inputs = tokenizer(text=title, max_seq_len=max_seq_length) + + title_input_ids = title_encoded_inputs["input_ids"] + title_token_type_ids = title_encoded_inputs["token_type_ids"] + + return query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids diff --git a/examples/text_matching/simbert/predict.py b/examples/text_matching/simbert/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..4db2a487d1dd12b81d6063e7152f9425a7ddebe9 --- /dev/null +++ b/examples/text_matching/simbert/predict.py @@ -0,0 +1,100 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +from functools import partial + +import paddle +from data import convert_example, create_dataloader, read_text_pair + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--input_file", type=str, required=True, help="The full path of input file") +# parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# fmt: on + + +def predict(model, data_loader): + """ + Predicts the similarity. + + Args: + model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair. + data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + results = [] + + model.eval() + + with paddle.no_grad(): + for batch_data in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data + query_input_ids = paddle.to_tensor(query_input_ids) + query_token_type_ids = paddle.to_tensor(query_token_type_ids) + title_input_ids = paddle.to_tensor(title_input_ids) + title_token_type_ids = paddle.to_tensor(title_token_type_ids) + + vecs_query = model(input_ids=query_input_ids, token_type_ids=query_token_type_ids) + vecs_title = model(input_ids=title_input_ids, token_type_ids=title_token_type_ids) + vecs_query = vecs_query[1].numpy() + vecs_title = vecs_title[1].numpy() + + vecs_query = vecs_query / (vecs_query**2).sum(axis=1, keepdims=True) ** 0.5 + vecs_title = vecs_title / (vecs_title**2).sum(axis=1, keepdims=True) ** 0.5 + sims = (vecs_query * vecs_title).sum(axis=1) + + results.extend(sims) + + return results + + +if __name__ == "__main__": + paddle.set_device(args.device) + + model = AutoModel.from_pretrained("simbert-base-chinese", pool_act="linear") + tokenizer = AutoTokenizer.from_pretrained("simbert-base-chinese") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, phase="predict") + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + ): [data for data in fn(samples)] + + valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) + + valid_data_loader = create_dataloader( + valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + y_sims = predict(model, valid_data_loader) + + valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) + + for idx, prob in enumerate(y_sims): + text_pair = valid_ds[idx] + text_pair["similarity"] = y_sims[idx] + print(text_pair) diff --git a/examples/text_matching/simcse/README.md b/examples/text_matching/simcse/README.md new file mode 100644 index 0000000000000000000000000000000000000000..10ed7c5f5f63407bfc384f9918b46225091ce434 --- /dev/null +++ b/examples/text_matching/simcse/README.md @@ -0,0 +1,126 @@ +# 无监督语义匹配模型 [SimCSE](https://arxiv.org/abs/2104.08821) + +我们实现了 SimCSE 模型,并借鉴 ESimCSE 论文思想,通过 Word Repetition(WR) 策略进一步提升了 SimCSE 模型效果,在 4 个权威中文语义匹配数据集上做了充分效果评测。SimCSE 模型适合缺乏监督数据,但是又有大量无监督数据的匹配和检索场景。 + +## 效果评估 +本项目分别使用 LCQMC、BQ_Corpus、STS-B、ATEC 这 4 个中文语义匹配数据集的训练集作为无监督训练集(仅使用文本信息,不使用 Label),并且在各自数据集上的验证集上进行效果评估,评估指标采用 SimCSE 论文中采用的 Spearman 相关系数,Spearman 相关系数越高,表示模型效果越好。中文数据集的下载地址为:[下载地址](https://paddlenlp.bj.bcebos.com/datasets/senteval_cn.zip) + +### 中文语义匹配数据集效果 + +| 模型| LCQMC | BQ_Corpus|STS-B|ATEC| +|-------|-------|-----|------|-----| +|SimCSE| 57.01 | **51.72** | 74.76 | 33.56 | +| SimCSE + WR| **58.97** | 51.58 | **78.32** | **33.73** | + +SimCSE + WR 策略在中文数据集训练的超参数设置如下: + +| 数据集|epoch | learning rate | dropout|batch size| dup rate| +|-------|-------|-----|------|-----|-----| +|LCQMC|1| 5E-5 | 0.3 |64| 0.32 | +|BQ_Corpus|1| 1E-5 | 0.3 |64|0.32 | +|STS-B|8| 5E-5 | 0.1 |64| 0.32 | +|ATEC|1| 5E-5 | 0.3 | 64| 0.32 | + + + +## 快速开始 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +``` +simcse/ +├── model.py # SimCSE 模型组网代码 +├── data.py # 无监督语义匹配训练数据、测试数据的读取逻辑 +├── predict.py # 基于训练好的无监督语义匹配模型计算文本 Pair 相似度 +├── train.sh # 模型训练的脚本 +└── train.py # SimCSE 模型训练、评估逻辑 +``` + +### 模型训练 +我们以中文文本匹配公开数据集 LCQMC 为示例数据集, 仅使用 LCQMC 的文本数据构造生成了无监督的训练数据。可以运行如下命令,开始模型训练并且在 LCQMC 的验证集上进行 Spearman 相关系数评估。 + +```shell +$ unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus '0' \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/ \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 1 \ + --save_steps 100 \ + --eval_steps 100 \ + --max_seq_length 64 \ + --dropout 0.3 \ + --train_set_file "./senteval_cn/LCQMC/train.txt" \ + --test_set_file "./senteval_cn/LCQMC/dev.tsv" +``` + +可支持配置的参数: + +* `infer_with_fc_pooler`:可选,在预测阶段计算文本 embedding 表示的时候网络前向是否会过训练阶段最后一层的 fc; 建议关闭模型效果最好。 +* `dup_rate`: 可选,word reptition 的比例,默认是0.32,根据论文 Word Repetition 比例采用 0.32 效果最佳。 +* `scale`:可选,在计算 cross_entropy loss 之前对 cosine 相似度进行缩放的因子;默认为 20。 +* `dropout`:可选,SimCSE 网络前向使用的 dropout 取值;默认 0.1。 +* `save_dir`:可选,保存训练模型的目录;默认保存在当前目录checkpoints文件夹下。 +* `max_seq_length`:可选,ERNIE-Gram 模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `learning_rate`:可选,Fine-tune的最大学习率;默认为5e-5。 +* `weight_decay`:可选,控制正则项力度的参数,用于防止过拟合,默认为0.0。 +* `epochs`: 训练轮次,默认为1。 +* `warmup_proption`:可选,学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.0。 +* `init_from_ckpt`:可选,模型参数路径,热启动模型训练;默认为None。 +* `seed`:可选,随机种子,默认为1000. +* `device`: 选用什么设备进行训练,可选cpu、gpu或npu。如使用gpu训练则参数gpus指定GPU卡号。 + +程序运行时将会自动进行训练,评估。同时训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +checkpoints/ +├── model_100 +│   ├── model_state.pdparams +│   ├── tokenizer_config.json +│   └── vocab.txt +└── ... +``` + +**NOTE:** +* 如需恢复模型训练,则可以设置`init_from_ckpt`, 如`init_from_ckpt=checkpoints/model_100/model_state.pdparams`。 + +### 基于动态图模型预测 + +我们用 LCQMC 的测试集作为预测数据, 测试数据示例如下,: +```text +谁有狂三这张高清的 这张高清图,谁有 +英雄联盟什么英雄最好 英雄联盟最好英雄是什么 +这是什么意思,被蹭网吗 我也是醉了,这是什么意思 +现在有什么动画片好看呢? 现在有什么好看的动画片吗? +请问晶达电子厂现在的工资待遇怎么样要求有哪些 三星电子厂工资待遇怎么样啊 +``` + +执行如下命令开始预测: +```shell +python -u -m paddle.distributed.launch --gpus "0" \ + predict.py \ + --device gpu \ + --params_path "./checkpoints/model_4400/model_state.pdparams"\ + --batch_size 64 \ + --max_seq_length 64 \ + --text_pair_file 'test.tsv' +``` + +输出预测结果如下: +```text +0.7201147675514221 +0.9010907411575317 +0.5393891334533691 +0.9698929786682129 +0.6056119203567505 +``` + +## Reference +[1] Gao, Tianyu, Xingcheng Yao, and Danqi Chen. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” ArXiv:2104.08821 [Cs], April 18, 2021. http://arxiv.org/abs/2104.08821. + +[2] Wu, Xing, et al. "ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding." arXiv preprint arXiv:2109.04380 (2021). https://arxiv.org/abs/2109.04380. diff --git a/examples/text_matching/simcse/data.py b/examples/text_matching/simcse/data.py new file mode 100644 index 0000000000000000000000000000000000000000..6f1f49d93a8fd1f7288e508355219784c631a81c --- /dev/null +++ b/examples/text_matching/simcse/data.py @@ -0,0 +1,135 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import random + +import numpy as np +import paddle + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + + for key, text in example.items(): + if "label" in key: + # do_evaluate + result += [example["label"]] + else: + # do_train + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + + return result + + +def read_simcse_text(data_path): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip() + yield {"text_a": data, "text_b": data} + + +def read_text_pair(data_path, is_test=False): + """Reads data.""" + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if is_test is False: + if len(data) != 3: + continue + yield {"text_a": data[0], "text_b": data[1], "label": data[2]} + else: + if len(data) != 2: + continue + yield {"text_a": data[0], "text_b": data[1]} + + +def word_repetition(input_ids, token_type_ids, dup_rate=0.32): + """Word Repetition strategy.""" + input_ids = input_ids.numpy().tolist() + token_type_ids = token_type_ids.numpy().tolist() + + batch_size, seq_len = len(input_ids), len(input_ids[0]) + repetitied_input_ids = [] + repetitied_token_type_ids = [] + rep_seq_len = seq_len + for batch_id in range(batch_size): + cur_input_id = input_ids[batch_id] + actual_len = np.count_nonzero(cur_input_id) + dup_word_index = [] + # If sequence length is less than 5, skip it + if actual_len > 5: + dup_len = random.randint(a=0, b=max(2, int(dup_rate * actual_len))) + # Skip cls and sep position + dup_word_index = random.sample(list(range(1, actual_len - 1)), k=dup_len) + + r_input_id = [] + r_token_type_id = [] + for idx, word_id in enumerate(cur_input_id): + # Insert duplicate word + if idx in dup_word_index: + r_input_id.append(word_id) + r_token_type_id.append(token_type_ids[batch_id][idx]) + r_input_id.append(word_id) + r_token_type_id.append(token_type_ids[batch_id][idx]) + after_dup_len = len(r_input_id) + repetitied_input_ids.append(r_input_id) + repetitied_token_type_ids.append(r_token_type_id) + + if after_dup_len > rep_seq_len: + rep_seq_len = after_dup_len + # Padding the data to the same length + for batch_id in range(batch_size): + after_dup_len = len(repetitied_input_ids[batch_id]) + pad_len = rep_seq_len - after_dup_len + repetitied_input_ids[batch_id] += [0] * pad_len + repetitied_token_type_ids[batch_id] += [0] * pad_len + + return paddle.to_tensor(repetitied_input_ids), paddle.to_tensor(repetitied_token_type_ids) diff --git a/examples/text_matching/simcse/model.py b/examples/text_matching/simcse/model.py new file mode 100644 index 0000000000000000000000000000000000000000..59364f2035807c23d29a77d802f2bda23fb8ceac --- /dev/null +++ b/examples/text_matching/simcse/model.py @@ -0,0 +1,119 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class SimCSE(nn.Layer): + def __init__(self, pretrained_model, dropout=None, margin=0.0, scale=20, output_emb_size=None): + + super().__init__() + + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off between + # recall performance and efficiency + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr) + + self.margin = margin + # Used scaling cosine similarity to ease converge + self.sacle = scale + + def get_pooled_embedding( + self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, with_pooler=True + ): + + # Note: cls_embedding is poolerd embedding with act tanh + sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) + + if with_pooler is False: + cls_embedding = sequence_output[:, 0, :] + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def cosine_sim( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + with_pooler=True, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask, with_pooler=with_pooler + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask, with_pooler=with_pooler + ) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + return cosine_sim + + def forward( + self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask + ) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask + ) + + cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) + + # substract margin from all positive samples cosine_sim() + margin_diag = paddle.full( + shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype() + ) + + cosine_sim = cosine_sim - paddle.diag(margin_diag) + + # scale cosine to ease training converge + cosine_sim *= self.sacle + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") + labels = paddle.reshape(labels, shape=[-1, 1]) + + loss = F.cross_entropy(input=cosine_sim, label=labels) + + return loss diff --git a/examples/text_matching/simcse/predict.py b/examples/text_matching/simcse/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..c94eaf0164bb85f112a0e3e4d2b102a1639314c2 --- /dev/null +++ b/examples/text_matching/simcse/predict.py @@ -0,0 +1,116 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +from data import convert_example, create_dataloader, read_text_pair +from model import SimCSE + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', choices=['cpu', 'gpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--margin", default=0.0, type=float, help="Margin between pos_sample and neg_samples.") +parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.") +parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.") + +args = parser.parse_args() +# yapf: enable + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SimCSE`): A model to extract text embedding or calculate similarity of text pair. + data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + + cosine_sims = [] + + model.eval() + + with paddle.no_grad(): + for batch_data in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data + + query_input_ids = paddle.to_tensor(query_input_ids) + query_token_type_ids = paddle.to_tensor(query_token_type_ids) + title_input_ids = paddle.to_tensor(title_input_ids) + title_token_type_ids = paddle.to_tensor(title_token_type_ids) + + batch_cosine_sim = model.cosine_sim( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids, + ).numpy() + + cosine_sims.append(batch_cosine_sim) + + cosine_sims = np.concatenate(cosine_sims, axis=0) + + return cosine_sims + + +if __name__ == "__main__": + paddle.set_device(args.device) + + tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + ): [data for data in fn(samples)] + + valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False, is_test=True) + + valid_data_loader = create_dataloader( + valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") + + model = SimCSE(pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError("Please set --params_path with correct pretrained model file") + + cosin_sim = predict(model, valid_data_loader) + + for idx, cosine in enumerate(cosin_sim): + print("{}".format(cosine)) diff --git a/examples/text_matching/simcse/train.py b/examples/text_matching/simcse/train.py new file mode 100644 index 0000000000000000000000000000000000000000..d279be57b258daa5fe5b2d9608d153f6f83c1ab7 --- /dev/null +++ b/examples/text_matching/simcse/train.py @@ -0,0 +1,233 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from data import ( + convert_example, + create_dataloader, + read_simcse_text, + read_text_pair, + word_repetition, +) +from model import SimCSE +from scipy import stats + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization." + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.") +parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, help="Step interval for saving checkpoint.") +parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override ecpochs.") +parser.add_argument('--eval_steps', type=int, default=10000, help="Step interval for evaluation.") +parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file.") +parser.add_argument("--test_set_file", type=str, required=True, help="The full path of test_set_file.") +parser.add_argument("--margin", default=0.0, type=float, help="Margin between pos_sample and neg_samples.") +parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.") +parser.add_argument("--dropout", default=0.1, type=float, help="Dropout for pretrained model encoder.") +parser.add_argument("--dup_rate", default=0.32, type=float, help="duplicate rate for word repetition.") +parser.add_argument("--infer_with_fc_pooler", action='store_true', help="Whether use fc layer after cls embedding or not for when infer.") + +args = parser.parse_args() + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def do_evaluate(model, tokenizer, data_loader, with_pooler=False): + model.eval() + + total_num = 0 + spearman_corr = 0.0 + sims = [] + labels = [] + + for batch in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids, label = batch + total_num += len(label) + + query_cls_embedding = model.get_pooled_embedding( + query_input_ids, query_token_type_ids, with_pooler=with_pooler) + + title_cls_embedding = model.get_pooled_embedding(title_input_ids, title_token_type_ids, with_pooler=with_pooler) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + + sims.append(cosine_sim.numpy()) + labels.append(label.numpy()) + + sims = np.concatenate(sims, axis=0) + labels = np.concatenate(labels, axis=0) + + spearman_corr = stats.spearmanr(labels, sims).correlation + model.train() + return spearman_corr, total_num + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds = load_dataset( + read_simcse_text, data_path=args.train_set_file, lazy=False) + + dev_ds = load_dataset( + read_text_pair, data_path=args.test_set_file, lazy=False) + + pretrained_model = AutoModel.from_pretrained( + 'ernie-3.0-medium-zh', + hidden_dropout_prob=args.dropout, + attention_probs_dropout_prob=args.dropout) + + tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh') + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + ): [data for data in fn(samples)] + + dev_batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # title_segment + Stack(dtype="int64"), # labels + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, + mode='train', + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + dev_data_loader = create_dataloader( + dev_ds, + mode='eval', + batch_size=args.batch_size, + batchify_fn=dev_batchify_fn, + trans_fn=trans_func) + + model = SimCSE( + pretrained_model, + margin=args.margin, + scale=args.scale, + output_emb_size=args.output_emb_size) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + print("warmup from:{}".format(args.init_from_ckpt)) + + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else len( + train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, + args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [ + p.name for n, p in model.named_parameters() + if not any(nd in n for nd in ["bias", "norm"]) + ] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params) + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch + if args.dup_rate > 0: + query_input_ids, query_token_type_ids = word_repetition(query_input_ids, query_token_type_ids, args.dup_rate) + title_input_ids, title_token_type_ids = word_repetition(title_input_ids, title_token_type_ids, args.dup_rate) + + loss = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids) + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, + 10 / (time.time() - tic_train))) + tic_train = time.time() + + if global_step % args.eval_steps == 0 and rank == 0: + # need better way to get model Layers + spearman_corr, total_num = do_evaluate(model._layers, tokenizer, dev_data_loader, args.infer_with_fc_pooler) + print("global step: {}, spearman_corr: {:.4f}, total_num: {}".format(global_step, spearman_corr, total_num)) + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, 'model_state.pdparams') + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + if args.max_steps > 0 and global_step >= args.max_steps: + return + + +if __name__ == "__main__": + do_train() diff --git a/examples/text_matching/simcse/train.sh b/examples/text_matching/simcse/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..145be555a3b1eec1132d0beadf2abafa7b3b144d --- /dev/null +++ b/examples/text_matching/simcse/train.sh @@ -0,0 +1,15 @@ +python -u -m paddle.distributed.launch --gpus '4' \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/ \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 8 \ + --save_steps 2000 \ + --eval_steps 100 \ + --max_seq_length 64 \ + --dropout 0.3 \ + --output_emb_size 256 \ + --dup_rate 0.32 \ + --train_set_file "./senteval_cn/STS-B/train.txt" \ + --test_set_file "./senteval_cn/STS-B/dev.tsv" \ No newline at end of file diff --git a/examples/text_matching/simnet/README.md b/examples/text_matching/simnet/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9b346888e8a40e659530762cb0a0fc3d7318bf03 --- /dev/null +++ b/examples/text_matching/simnet/README.md @@ -0,0 +1,168 @@ +# 使用SimNet完成文本匹配任务 + +短文本语义匹配(SimilarityNet, SimNet)是一个计算短文本相似度的框架,可以根据用户输入的两个文本,计算出相似度得分。 +SimNet框架在百度各产品上广泛应用,主要包括BOW、CNN、RNN、MMDNN等核心网络结构形式,提供语义相似度计算训练和预测框架, +适用于信息检索、新闻推荐、智能客服等多个应用场景,帮助企业解决语义匹配问题。 +可通过[AI开放平台-短文本相似度](https://ai.baidu.com/tech/nlp_basic/simnet)线上体验。 + +## 模型简介 + + +本项目通过调用[Seq2Vec](../../../paddlenlp/seq2vec/)中内置的模型进行序列建模,完成句子的向量表示。包含最简单的词袋模型和一系列经典的RNN类模型。 + +| 模型 | 模型介绍 | +| ------------------------------------------------ | ------------------------------------------------------------ | +| BOW(Bag Of Words) | 非序列模型,将句子表示为其所包含词的向量的加和 | +| CNN | 序列模型,使用卷积操作,提取局部区域地特征 | +| GRU(Gated Recurrent Unit) | 序列模型,能够较好地解决序列文本中长距离依赖的问题 | +| LSTM(Long Short Term Memory) | 序列模型,能够较好地解决序列文本中长距离依赖的问题 | + + +| 模型 | dev acc | test acc | +| ---- | ------- | -------- | +| BoW | 0.7290 | 0.75232 | +| CNN | 0.7042 | 0.73760 | +| GRU | 0.7781 | 0.77808 | +| LSTM | 0.73760 | 0.77320 | + + + +## 快速开始 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +simnet/ +├── model.py # 模型组网 +├── predict.py # 模型预测 +├── utils.py # 数据处理工具 +├── train.py # 训练模型主程序入口,包括训练、评估 +└── README.md # 文档说明 +``` + +### 数据准备 + +#### 使用PaddleNLP内置数据集 + +```python +from paddlenlp.datasets import load_dataset + +train_ds, dev_ds, test_ds = load_dataset("lcqmc", splits=["train", "dev", "test"]) +``` + +部分样例数据如下: + +```text +query title label +最近有什么好看的电视剧,推荐一下 近期有什么好看的电视剧,求推荐? 1 +大学生验证仅针对在读学生,已毕业学生不能申请的哦。 通过了大学生验证的用户,可以在支付宝的合作商户,享受学生优惠 0 +如何在网上查户口 如何网上查户口 1 +关于故事的成语 来自故事的成语 1 + 湖北农村信用社手机银行客户端下载 湖北长阳农村商业银行手机银行客户端下载 0 +草泥马是什么动物 草泥马是一种什么动物 1 +``` + +### 模型训练 + +在模型训练之前,需要先下载词汇表文件simnet_vocab.txt,用于构造词-id映射关系。 + +```shell +wget https://bj.bcebos.com/paddlenlp/data/simnet_vocab.txt +``` + +**NOTE:** 词表的选择和实际应用数据相关,需根据实际数据选择词表。 + +我们以中文文本匹配数据集LCQMC为示例数据集,可以运行下面的命令,在训练集(train.tsv)上进行模型训练,并在开发集(dev.tsv)验证 + +CPU启动: + +```shell +python train.py --vocab_path='./simnet_vocab.txt' \ + --device=cpu \ + --network=lstm \ + --lr=5e-4 \ + --batch_size=64 \ + --epochs=5 \ + --save_dir='./checkpoints' +``` + +GPU启动: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" train.py --vocab_path='./simnet_vocab.txt' \ + --device=gpu \ + --network=lstm \ + --lr=5e-4 \ + --batch_size=64 \ + --epochs=5 \ + --save_dir='./checkpoints' +``` + +以上参数表示: + +* `vocab_path`: 词汇表文件路径。 +* `device`: 选用什么设备进行训练,可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。 +* `network`: 模型网络名称,默认为`lstm`, 可更换为lstm, gru, rnn,bow,cnn等。 +* `lr`: 学习率, 默认为5e-4。 +* `batch_size`: 运行一个batch大小,默认为64。 +* `epochs`: 训练轮次,默认为5。 +* `save_dir`: 训练保存模型的文件路径。 +* `init_from_ckpt`: 恢复模型训练的断点路径。 + + +程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +checkpoints/ +├── 0.pdopt +├── 0.pdparams +├── 1.pdopt +├── 1.pdparams +├── ... +└── final.pdparams +``` + +**NOTE:** 如需恢复模型训练,则init_from_ckpt只需指定到文件名即可,不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/0`即可,程序会自动加载模型参数`checkpoints/0.pdparams`,也会自动加载优化器状态`checkpoints/0.pdopt`。 + +### 模型预测 + +启动预测 + +CPU启动: + +```shell +python predict.py --vocab_path='./simnet_vocab.txt' \ + --device=cpu \ + --network=lstm \ + --params_path=checkpoints/final.pdparams +``` + +GPU启动: + +```shell +CUDA_VISIBLE_DEVICES=0 python predict.py --vocab_path='./simnet_vocab.txt' \ + --device=gpu \ + --network=lstm \ + --params_path='./checkpoints/final.pdparams' +``` + +将待预测数据分词完毕后,如以下示例: + +```text +世界上什么东西最小 世界上什么东西最小? +光眼睛大就好看吗 眼睛好看吗? +小蝌蚪找妈妈怎么样 小蝌蚪找妈妈是谁画的 +``` + +处理成模型所需的`Tensor`,如可以直接调用`preprocess_prediction_data`函数既可处理完毕。之后传入`predict`函数即可输出预测结果。 + +如 + +```text +Data: ['世界上什么东西最小', '世界上什么东西最小?'] Label: similar +Data: ['光眼睛大就好看吗', '眼睛好看吗?'] Label: dissimilar +Data: ['小蝌蚪找妈妈怎么样', '小蝌蚪找妈妈是谁画的'] Label: dissimilar +``` diff --git a/examples/text_matching/simnet/model.py b/examples/text_matching/simnet/model.py new file mode 100644 index 0000000000000000000000000000000000000000..ae029705140be947cfa2e94f7e475863e3374716 --- /dev/null +++ b/examples/text_matching/simnet/model.py @@ -0,0 +1,219 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn + +import paddlenlp as nlp + + +class SimNet(nn.Layer): + def __init__(self, network, vocab_size, num_classes, emb_dim=128, pad_token_id=0): + super().__init__() + + network = network.lower() + if network == "bow": + self.model = BoWModel(vocab_size, num_classes, emb_dim, padding_idx=pad_token_id) + elif network == "cnn": + self.model = CNNModel(vocab_size, num_classes, emb_dim, padding_idx=pad_token_id) + elif network == "gru": + self.model = GRUModel(vocab_size, num_classes, emb_dim, direction="forward", padding_idx=pad_token_id) + elif network == "lstm": + self.model = LSTMModel(vocab_size, num_classes, emb_dim, direction="forward", padding_idx=pad_token_id) + else: + raise ValueError("Unknown network: %s, it must be one of bow, cnn, lstm or gru." % network) + + def forward(self, query, title, query_seq_len=None, title_seq_len=None): + logits = self.model(query, title, query_seq_len, title_seq_len) + return logits + + +class BoWModel(nn.Layer): + """ + This class implements the Bag of Words Classification Network model to classify texts. + At a high level, the model starts by embedding the tokens and running them through + a word embedding. Then, we encode these representations with a `BoWEncoder`. + Lastly, we take the output of the encoder to create a final representation, + which is passed through some feed-forward layers to output a logits (`output_layer`). + Args: + vocab_size (obj:`int`): The vocabulary size. + emb_dim (obj:`int`, optional, defaults to 128): The embedding dimension. + padding_idx (obj:`int`, optional, defaults to 0) : The pad token index. + hidden_size (obj:`int`, optional, defaults to 128): The first full-connected layer hidden size. + fc_hidden_size (obj:`int`, optional, defaults to 96): The second full-connected layer hidden size. + num_classes (obj:`int`): All the labels that the data has. + """ + + def __init__(self, vocab_size, num_classes, emb_dim=128, padding_idx=0, fc_hidden_size=128): + super().__init__() + self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx) + self.bow_encoder = nlp.seq2vec.BoWEncoder(emb_dim) + self.fc = nn.Linear(self.bow_encoder.get_output_dim() * 2, fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, query, title, query_seq_len=None, title_seq_len=None): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_query = self.embedder(query) + embedded_title = self.embedder(title) + # Shape: (batch_size, embedding_dim) + summed_query = self.bow_encoder(embedded_query) + summed_title = self.bow_encoder(embedded_title) + encoded_query = paddle.tanh(summed_query) + encoded_title = paddle.tanh(summed_title) + # Shape: (batch_size, embedding_dim*2) + contacted = paddle.concat([encoded_query, encoded_title], axis=-1) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(contacted)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + # probs = F.softmax(logits, axis=-1) + return logits + + +class LSTMModel(nn.Layer): + def __init__( + self, + vocab_size, + num_classes, + emb_dim=128, + padding_idx=0, + lstm_hidden_size=128, + direction="forward", + lstm_layers=1, + dropout_rate=0.0, + pooling_type=None, + fc_hidden_size=128, + ): + super().__init__() + self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) + self.lstm_encoder = nlp.seq2vec.LSTMEncoder( + emb_dim, lstm_hidden_size, num_layers=lstm_layers, direction=direction, dropout=dropout_rate + ) + self.fc = nn.Linear(self.lstm_encoder.get_output_dim() * 2, fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, query, title, query_seq_len, title_seq_len): + assert query_seq_len is not None and title_seq_len is not None + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_query = self.embedder(query) + embedded_title = self.embedder(title) + # Shape: (batch_size, lstm_hidden_size) + query_repr = self.lstm_encoder(embedded_query, sequence_length=query_seq_len) + title_repr = self.lstm_encoder(embedded_title, sequence_length=title_seq_len) + # Shape: (batch_size, 2*lstm_hidden_size) + contacted = paddle.concat([query_repr, title_repr], axis=-1) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(contacted)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + # probs = F.softmax(logits, axis=-1) + + return logits + + +class GRUModel(nn.Layer): + def __init__( + self, + vocab_size, + num_classes, + emb_dim=128, + padding_idx=0, + gru_hidden_size=128, + direction="forward", + gru_layers=1, + dropout_rate=0.0, + pooling_type=None, + fc_hidden_size=96, + ): + super().__init__() + self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) + self.gru_encoder = nlp.seq2vec.GRUEncoder( + emb_dim, gru_hidden_size, num_layers=gru_layers, direction=direction, dropout=dropout_rate + ) + self.fc = nn.Linear(self.gru_encoder.get_output_dim() * 2, fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, query, title, query_seq_len, title_seq_len): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_query = self.embedder(query) + embedded_title = self.embedder(title) + # Shape: (batch_size, gru_hidden_size) + query_repr = self.gru_encoder(embedded_query, sequence_length=query_seq_len) + title_repr = self.gru_encoder(embedded_title, sequence_length=title_seq_len) + # Shape: (batch_size, 2*gru_hidden_size) + contacted = paddle.concat([query_repr, title_repr], axis=-1) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(contacted)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + # probs = F.softmax(logits, axis=-1) + + return logits + + +class CNNModel(nn.Layer): + """ + This class implements the + + + Convolution Neural Network model. + At a high level, the model starts by embedding the tokens and running them through + a word embedding. Then, we encode these representations with a `CNNEncoder`. + The CNN has one convolution layer for each ngram filter size. Each convolution operation gives + out a vector of size num_filter. The number of times a convolution layer will be used + is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these + outputs from the convolution layer and outputs the max. + Lastly, we take the output of the encoder to create a final representation, + which is passed through some feed-forward layers to output a logits (`output_layer`). + Args: + vocab_size (obj:`int`): The vocabulary size. + emb_dim (obj:`int`, optional, defaults to 128): The embedding dimension. + padding_idx (obj:`int`, optional, defaults to 0) : The pad token index. + num_classes (obj:`int`): All the labels that the data has. + """ + + def __init__( + self, + vocab_size, + num_classes, + emb_dim=128, + padding_idx=0, + num_filter=256, + ngram_filter_sizes=(3,), + fc_hidden_size=128, + ): + super().__init__() + self.padding_idx = padding_idx + self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx) + self.encoder = nlp.seq2vec.CNNEncoder( + emb_dim=emb_dim, num_filter=num_filter, ngram_filter_sizes=ngram_filter_sizes + ) + self.fc = nn.Linear(self.encoder.get_output_dim() * 2, fc_hidden_size) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, query, title, query_seq_len=None, title_seq_len=None): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_query = self.embedder(query) + embedded_title = self.embedder(title) + # Shape: (batch_size, num_filter) + query_repr = self.encoder(embedded_query) + title_repr = self.encoder(embedded_title) + # Shape: (batch_size, 2*num_filter) + contacted = paddle.concat([query_repr, title_repr], axis=-1) + # Shape: (batch_size, fc_hidden_size) + fc_out = paddle.tanh(self.fc(contacted)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc_out) + # probs = F.softmax(logits, axis=-1) + return logits diff --git a/examples/text_matching/simnet/predict.py b/examples/text_matching/simnet/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..3455a239226bab4f5696cabce4bdff0d38185414 --- /dev/null +++ b/examples/text_matching/simnet/predict.py @@ -0,0 +1,109 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle +import paddle.nn.functional as F +from model import SimNet +from utils import preprocess_prediction_data + +from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +parser.add_argument('--device', choices=['cpu', 'gpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") +parser.add_argument("--vocab_path", type=str, default="./simnet_vocab.txt", help="The path to vocabulary.") +parser.add_argument('--network', type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?") +parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.") +args = parser.parse_args() +# yapf: enable + + +def predict(model, data, label_map, batch_size=1, pad_token_id=0): + """ + Predicts the data labels. + + Args: + model (obj:`paddle.nn.Layer`): A model to classify texts. + data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `seq_len`(sequence length). + label_map(obj:`dict`): The label id (key) to label str (value) map. + batch_size(obj:`int`, defaults to 1): The number of batch. + pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + + # Separates data into some batches. + batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)] + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=pad_token_id), # query_ids + Pad(axis=0, pad_val=pad_token_id), # title_ids + Stack(dtype="int64"), # query_seq_lens + Stack(dtype="int64"), # title_seq_lens + ): [data for data in fn(samples)] + + results = [] + model.eval() + for batch in batches: + query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(batch) + query_ids = paddle.to_tensor(query_ids) + title_ids = paddle.to_tensor(title_ids) + query_seq_lens = paddle.to_tensor(query_seq_lens) + title_seq_lens = paddle.to_tensor(title_seq_lens) + logits = model(query_ids, title_ids, query_seq_lens, title_seq_lens) + probs = F.softmax(logits, axis=1) + idx = paddle.argmax(probs, axis=1).numpy() + idx = idx.tolist() + labels = [label_map[i] for i in idx] + results.extend(labels) + return results + + +if __name__ == "__main__": + paddle.set_device(args.device) + # Loads vocab. + vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") + tokenizer = JiebaTokenizer(vocab) + label_map = {0: "dissimilar", 1: "similar"} + + # Constructs the newtork. + model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(label_map)) + + # Loads model parameters. + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + # Firstly pre-processing prediction data and then do predict. + data = [ + ["世界上什么东西最小", "世界上什么东西最小?"], + ["光眼睛大就好看吗", "眼睛好看吗?"], + ["小蝌蚪找妈妈怎么样", "小蝌蚪找妈妈是谁画的"], + ] + examples = preprocess_prediction_data(data, tokenizer) + results = predict( + model, + examples, + label_map=label_map, + batch_size=args.batch_size, + pad_token_id=vocab.token_to_idx.get("[PAD]", 0), + ) + + for idx, text in enumerate(data): + print("Data: {} \t Label: {}".format(text, results[idx])) diff --git a/examples/text_matching/simnet/train.py b/examples/text_matching/simnet/train.py new file mode 100644 index 0000000000000000000000000000000000000000..848311004be0881ec76f46f9ef8d1e3c46a64220 --- /dev/null +++ b/examples/text_matching/simnet/train.py @@ -0,0 +1,122 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import paddle +from model import SimNet +from utils import convert_example + +from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import load_dataset + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--lr", type=float, default=5e-4, help="Learning rate used to train.") +parser.add_argument("--save_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint") +parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") +parser.add_argument("--vocab_path", type=str, default="./simnet_vocab.txt", help="The directory to dataset.") +parser.add_argument('--network', type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +args = parser.parse_args() +# yapf: enable + + +def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None): + """ + Creats dataloader. + + Args: + dataset(obj:`paddle.io.Dataset`): Dataset instance. + trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. + mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. + batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. + batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging + the sample list, None for only stack each fields of sample in axis + 0(same as :attr::`np.stack(..., axis=0)`). + + Returns: + dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. + """ + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True) + else: + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True, collate_fn=batchify_fn) + return dataloader + + +if __name__ == "__main__": + paddle.set_device(args.device) + + # Loads vocab. + if not os.path.exists(args.vocab_path): + raise RuntimeError("The vocab_path can not be found in the path %s" % args.vocab_path) + vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") + + # Loads dataset. + train_ds, dev_ds, test_ds = load_dataset("lcqmc", splits=["train", "dev", "test"]) + + # Constructs the newtork. + model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(train_ds.label_list)) + model = paddle.Model(model) + + # Reads data and generates mini-batches. + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # query_ids + Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # title_ids + Stack(dtype="int64"), # query_seq_lens + Stack(dtype="int64"), # title_seq_lens + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + tokenizer = JiebaTokenizer(vocab) + trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False) + train_loader = create_dataloader( + train_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="train", batchify_fn=batchify_fn + ) + dev_loader = create_dataloader( + dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn + ) + test_loader = create_dataloader( + test_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="test", batchify_fn=batchify_fn + ) + + optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr) + + # Defines loss and metric. + criterion = paddle.nn.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + model.prepare(optimizer, criterion, metric) + + # Loads pre-trained parameters. + if args.init_from_ckpt: + model.load(args.init_from_ckpt) + print("Loaded checkpoint from %s" % args.init_from_ckpt) + + # Starts training and evaluating. + model.fit( + train_loader, + dev_loader, + epochs=args.epochs, + save_dir=args.save_dir, + ) diff --git a/examples/text_matching/simnet/utils.py b/examples/text_matching/simnet/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..5de0876215d191025cc15298077d003345c17d89 --- /dev/null +++ b/examples/text_matching/simnet/utils.py @@ -0,0 +1,73 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np + + +def convert_example(example, tokenizer, is_test=False): + """ + Builds model inputs from a sequence for sequence classification tasks. + It use `jieba.cut` to tokenize text. + + Args: + example(obj:`list[str]`): List of input data, containing text and label if it have label. + tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + query_ids(obj:`list[int]`): The list of query ids. + title_ids(obj:`list[int]`): The list of title ids. + query_seq_len(obj:`int`): The input sequence query length. + title_seq_len(obj:`int`): The input sequence title length. + label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. + """ + + query, title = example["query"], example["title"] + query_ids = np.array(tokenizer.encode(query), dtype="int64") + query_seq_len = np.array(len(query_ids), dtype="int64") + title_ids = np.array(tokenizer.encode(title), dtype="int64") + title_seq_len = np.array(len(title_ids), dtype="int64") + + if not is_test: + label = np.array(example["label"], dtype="int64") + return query_ids, title_ids, query_seq_len, title_seq_len, label + else: + return query_ids, title_ids, query_seq_len, title_seq_len + + +def preprocess_prediction_data(data, tokenizer): + """ + It process the prediction data as the format used as training. + + Args: + data (obj:`List[List[str, str]]`): + The prediction data whose each element is a text pair. + Each text will be tokenized by jieba.lcut() function. + tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. + + Returns: + examples (obj:`list`): The processed data whose each element + is a `list` object, which contains + + - query_ids(obj:`list[int]`): The list of query ids. + - title_ids(obj:`list[int]`): The list of title ids. + - query_seq_len(obj:`int`): The input sequence query length. + - title_seq_len(obj:`int`): The input sequence title length. + + """ + examples = [] + for query, title in data: + query_ids = tokenizer.encode(query) + title_ids = tokenizer.encode(title) + examples.append([query_ids, title_ids, len(query_ids), len(title_ids)]) + return examples diff --git a/examples/text_summarization/bart/README.md b/examples/text_summarization/bart/README.md new file mode 100644 index 0000000000000000000000000000000000000000..db140d4d8866c3dfe0fb387c13b31b67efb3cfc9 --- /dev/null +++ b/examples/text_summarization/bart/README.md @@ -0,0 +1,227 @@ +# BART + +## 模型简介 + +BART是一种Seq2Seq结构的降噪自编码器,通过增加噪声来破环文本然后重建原文本来训练模型。它使用一个标准的Transformer结构,可以被看作泛化的BERT(由于是双向编码器),GPT(由于是从左到右解码器),和一些其他的预训练模型结构。 + +本项目是BART在 PaddlePaddle 2.2上开源实现的文本摘要的例子,包含了在[CNN/DailyMail](https://arxiv.org/pdf/1704.04368.pdf)数据集上微调和生成的代码。 + +## 快速开始 + +### 环境依赖 + +- nltk +- rouge_score + +安装方式:`pip install -r requirements.txt` + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +. +├── run_summarization.py # 模型finetune主程序入口 +├── generate.py # 模型生成主程序入口 +├── utils.py # 定义参数及一些工具函数 +├── requirements.txt # 环境依赖文件 +└── README.md # 文档说明 +``` + +### 数据准备 + +**CNN/DailyMail**数据集是一个英文数据集,包含CNN和《每日邮报》记者撰写的30多万篇独特新闻文章,常用来做文本摘要。 + +为了方便用户快速测试,PaddleNLP Dataset API内置了CNN/DailyMail数据集,一键即可完成数据集加载,示例代码如下: + +```python +from paddlenlp.datasets import load_dataset +train_set, dev_set, test_set = load_dataset("cnn_dailymail", splits=["train", "dev", "test"]) +``` + +### 模型训练 + +运行如下命令即可在训练集上进行finetune,并在验证集上进行验证 + +```shell +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +# 例如使用1号和2号卡,则:`--gpu 1,2` +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus 1,2 run_summarization.py \ + --model_name_or_path=bart-base \ + --dataset_name=cnn_dailymail \ + --output_dir=output \ + --max_source_length=1024 \ + --max_target_length=142 \ + --learning_rate=1e-4 \ + --num_train_epochs=6 \ + --logging_steps=100 \ + --save_steps=1000 \ + --seed=42 \ + --train_batch_size=20 \ + --eval_batch_size=64 \ + --warmup_proportion=0.1 \ + --ignore_pad_token_for_loss=True \ + --device=gpu +``` + +其中参数释义如下: +- `gpus` 指示了训练所用的GPU + +- `model_name_or_path` 指示了finetune使用的预训练模型,可以是PaddleNLP提供的预训练模型,或者是本地的模型。如果使用本地的模型,则配置为本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + + | PaddleNLP提供的预训练模型 | + |---------------------------------| + | bart-base | + | bart-large | + +- `dataset_name` 表示训练的数据集。 + +- `output_dir` 表示模型的保存路径。 + +- `max_source_length` 表示输入article的最大长度。 + +- `max_target_length` 表示输入highlights的最大长度。 + +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 + +- `num_train_epochs` 表示训练轮数。 + +- `logging_steps` 表示日志打印间隔。 + +- `save_steps` 表示模型保存及评估间隔。 + +- `seed` 表示随机数生成器的种子。 + +- `epochs` 表示训练轮数。 + +- `train_batch_size` 表示训练**每张卡**上的样本数目。 + +- `eval_batch_size` 表示预测**单卡**上的样本数目。 + +- `warmup_proportion` 表示warmup_steps所占总步数的比例。学习率逐渐升高到基础学习率(即上面配置的learning_rate)所需要的迭代数。 + +- `ignore_pad_token_for_loss` 表示计算loss时忽略padding。 + +- `device` 表示使用的设备。 + +程序运行时将会自动进行训练和验证,训练过程中会自动保存模型在指定的`output_dir`中。如: + +```text +./output/ +├── bart_model_1000.pdparams +│ ├── model_config.json +│ ├── model_state.pdparams +│ ├── merges.txt +│ ├── tokenizer_config.json +│ └── vocab.json +└── ... +``` + +**NOTE:** 如需恢复模型训练,只需指定`model_name_or_path`为本地微调模型的路径即可。 + +### 模型预测 + +运行如下命令即可在验证集上进行测试 + +```shell +# GPU启动,预测仅支持单卡 +export CUDA_VISIBLE_DEVICES=0 +python generate.py \ + --model_name_or_path=bart-base-cnndm-model \ + --dataset_name=cnn_dailymail \ + --output_path=generate.txt \ + --max_source_length=1024 \ + --max_target_length=142 \ + --decode_strategy=greedy_search \ + --top_k=2 \ + --top_p=1.0 \ + --num_beams=1 \ + --length_penalty=0.0 \ + --batch_size=64 \ + --seed=42 \ + --ignore_pad_token_for_loss=True \ + --logging_steps=100 \ + --device=gpu +``` + +其中参数释义如下: +- `model_name_or_path` 指示了预测使用的模型,可以是PaddleNLP提供的预训练模型,或者是本地的模型。如果使用本地的模型,则配置为本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + + | PaddleNLP提供的预训练模型 | + |---------------------------------| + | bart-base | + | bart-large | + +- `dataset_name` 表示预测的数据集。 + +- `output_path` 表示预测结果的保存路径。 + +- `max_source_length` 表示输入article的最大长度。 + +- `max_target_length` 表示输入highlights的最大长度。 + +- `decode_strategy` 表示预测解码时采取的策略,可选"sampling"、"greedy_search"和"beam_search"之一。 + +- `top_k` 表示采用"sampling"解码策略时,token的概率按从大到小排序,生成的token只从前`top_k`个中进行采样。 + +- `top_p` 表示采用"sampling"解码策略时,从词表中采样并选择概率之和大于给定阈值`top_p`的token。 + +- `num_beams` 表示besm search的beam size。 + +- `length_penalty` 表示besm search生成长度的指数惩罚。 + +- `batch_size` 表示每次迭代**单卡**上的样本数目。 + +- `seed` 表示随机数生成器的种子。 + +- `ignore_pad_token_for_loss` 表示训练时计算loss时忽略padding。如果训练时设置为True,那么预测时的label需要还原来计算评估指标。 + +- `logging_steps` 表示日志打印间隔。 + +- `device` 表示使用的设备。 + +程序运行结束后会将预测生成的摘要保存在`output_path`中。同时终端中会输出评估结果。 + +采用预训练模型及微调模型在验证集上有如下结果: + +| model_name_or_path | Rouge-1 | Rouge-2 | Rouge-L | +| :----------------------: | :-------------: | :-------------: |:-------------: | +| [bart-base-cnndm-model](https://bj.bcebos.com/paddlenlp/models/transformers/bart/bart-base-cnndm-model.tar.gz ) | 43.6446 | 20.1447 | 41.0132 | + +**NOTE:** `bart-base-cnndm-model`是按本项目中的超参finetune得到的结果。 + +### 模型高性能预测 + +在模型预测阶段,我们提供了基于 FastGeneration 的高性能预测的选项,可以选择性开启是否需要采用高性能预测。只需在上述模型预测上添加两个参数即可:分别是`faster`,`use_fp16_decoding`。 + +```shell +# GPU启动,预测仅支持单卡 +export CUDA_VISIBLE_DEVICES=0 +python generate.py \ + --model_name_or_path=bart-base-cnndm-model \ + --dataset_name=cnn_dailymail \ + --output_path=generate.txt \ + --max_source_length=1024 \ + --max_target_length=142 \ + --decode_strategy=greedy_search \ + --top_k=2 \ + --top_p=1.0 \ + --num_beams=1 \ + --length_penalty=0.0 \ + --batch_size=64 \ + --seed=42 \ + --ignore_pad_token_for_loss=True \ + --logging_steps=100 \ + --faster \ + --use_fp16_decoding \ + --device=gpu +``` +其中新增参数释义如下: +- `faster` 表示是否开启高性能预测。设置 `--faster` 即表示开启。 +- `use_fp16_decoding` 表示在开启高性能预测的时候,是否使用 fp16 来完成预测过程。设置 `--use_fp16_decoding` 即表示使用 fp16 进行预测,否则使用 fp32。 + +## 参考文献 +1. Lewis M , Liu Y , Goyal N , et al. [BART: Denoising Sequence-to-Sequence Pre-training for Natural +Language Generation, Translation, and Comprehension](https://aclanthology.org/2020.acl-main.703.pdf)[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 7871-7880. +2. See A , Liu P J , CD Manning. [Get To The Point: Summarization with Pointer-Generator Networks](https://aclanthology.org/P17-1099.pdf)[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1073–1083. diff --git a/examples/text_summarization/bart/generate.py b/examples/text_summarization/bart/generate.py new file mode 100644 index 0000000000000000000000000000000000000000..12c7018cf89bc028149b9d51dfac549f2434ed90 --- /dev/null +++ b/examples/text_summarization/bart/generate.py @@ -0,0 +1,206 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import random +import time +from functools import partial +from pprint import pprint + +import numpy as np +import paddle +from paddle.io import BatchSampler, DataLoader +from utils import compute_metrics, convert_example + +from paddlenlp.data import Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import BartForConditionalGeneration, BartTokenizer + +summarization_name_mapping = {"cnn_dailymail": ("article", "highlights")} + + +def parse_args(): + parser = argparse.ArgumentParser() + # Required parameters + parser.add_argument( + "--model_name_or_path", default="bart-base", type=str, required=True, help="Path to pre-trained model. " + ) + parser.add_argument( + "--dataset_name", + default="cnn_dailymail", + type=str, + required=True, + help="The name of the dataset to use. Selected in the list: " + ", ".join(summarization_name_mapping.keys()), + ) + parser.add_argument( + "--output_path", type=str, default="generate.txt", help="The file path where the infer result will be saved." + ) + parser.add_argument( + "--max_source_length", + default=1024, + type=int, + help="The maximum total input sequence length after " + "tokenization.Sequences longer than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--min_target_length", + default=0, + type=int, + help="The minimum total sequence length for target text when generating. ", + ) + parser.add_argument( + "--max_target_length", + default=142, + type=int, + help="The maximum total sequence length for target text after " + "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded." + "during ``evaluate`` and ``predict``.", + ) + parser.add_argument( + "--decode_strategy", default="greedy_search", type=str, help="The decode strategy in generation." + ) + parser.add_argument( + "--top_k", + default=2, + type=int, + help="The number of highest probability vocabulary tokens to keep for top-k sampling.", + ) + parser.add_argument("--top_p", default=1.0, type=float, help="The cumulative probability for top-p sampling.") + parser.add_argument("--num_beams", default=1, type=int, help="The number of beams for beam search.") + parser.add_argument( + "--length_penalty", + default=0.6, + type=float, + help="The exponential penalty to the sequence length for beam search.", + ) + parser.add_argument( + "--early_stopping", + default=False, + type=eval, + help="Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.", + ) + parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity of beam search. ") + parser.add_argument("--faster", action="store_true", help="Whether to process inference using FastGeneration. ") + parser.add_argument( + "--use_fp16_decoding", + action="store_true", + help="Whether to use fp16 when using FastGeneration. Only works when using FastGeneration. ", + ) + parser.add_argument("--batch_size", default=64, type=int, help="Batch size per GPU/CPU for testing or evaluation.") + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu", "xpu"], + help="The device to select to train the model, is must be cpu/gpu/xpu.", + ) + parser.add_argument( + "--ignore_pad_token_for_loss", + default=True, + type=bool, + help="Whether to ignore the tokens corresponding to " "padded labels in the loss computation or not.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +def batchify_fn(samples): + fn = Tuple( + Stack(dtype="int64"), # input_ids + Stack(dtype="int64"), # attention mask + Stack(dtype="int32"), # mem_seq_lens + Stack(dtype="int64"), # decoder_input_ids + Stack(dtype="int64"), # labels + ) + return fn(samples) + + +@paddle.no_grad() +def generate(args): + paddle.set_device(args.device) + set_seed(args) + tokenizer = BartTokenizer.from_pretrained(args.model_name_or_path) + model = BartForConditionalGeneration.from_pretrained(args.model_name_or_path) + dataset = load_dataset(args.dataset_name, splits=["dev"]) + trans_func = partial( + convert_example, + text_column=summarization_name_mapping[args.dataset_name][0], + summary_column=summarization_name_mapping[args.dataset_name][1], + tokenizer=tokenizer, + decoder_start_token_id=model.bart.decoder_start_token_id, + max_source_length=args.max_source_length, + max_target_length=args.max_target_length, + ignore_pad_token_for_loss=args.ignore_pad_token_for_loss, + is_train=False, + ) + + dataset = dataset.map(trans_func, lazy=True) + batch_sampler = BatchSampler(dataset, batch_size=args.batch_size, shuffle=False) + data_loader = DataLoader( + dataset=dataset, batch_sampler=batch_sampler, num_workers=0, collate_fn=batchify_fn, return_list=True + ) + data_loader.pin_memory = False + + model.eval() + total_time = 0.0 + start_time = time.time() + all_preds = [] + all_labels = [] + for step, batch in enumerate(data_loader): + input_ids, _, mem_seq_lens, _, labels = batch + preds, _ = model.generate( + input_ids=input_ids, + seq_lens=mem_seq_lens, + max_length=args.max_target_length, + min_length=args.min_target_length, + decode_strategy=args.decode_strategy, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + early_stopping=args.early_stopping, + diversity_rate=args.diversity_rate, + use_fast=args.faster, + ) + total_time += time.time() - start_time + if step % args.logging_steps == 0: + print("step %d - %.3fs/step" % (step, total_time / args.logging_steps)) + total_time = 0.0 + all_preds.extend(preds.numpy()) + all_labels.extend(labels.numpy()) + start_time = time.time() + + rouge_result, decoded_preds = compute_metrics(all_preds, all_labels, tokenizer, args.ignore_pad_token_for_loss) + print("Rouge result: ", rouge_result) + with open(args.output_path, "w", encoding="utf-8") as fout: + for decoded_pred in decoded_preds: + fout.write(decoded_pred + "\n") + print("Save generated result into: %s" % args.output_path) + + +if __name__ == "__main__": + args = parse_args() + pprint(args) + generate(args) diff --git a/examples/text_summarization/bart/requirements.txt b/examples/text_summarization/bart/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..659847365c1ff958cb7acafc2ba9ce9c8f40d234 --- /dev/null +++ b/examples/text_summarization/bart/requirements.txt @@ -0,0 +1,2 @@ +nltk==3.6.2 +rouge_score==0.0.4 diff --git a/examples/text_summarization/bart/run_summarization.py b/examples/text_summarization/bart/run_summarization.py new file mode 100644 index 0000000000000000000000000000000000000000..3951f9778f52ccc3efbe69f7c2e20400a643adef --- /dev/null +++ b/examples/text_summarization/bart/run_summarization.py @@ -0,0 +1,295 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import argparse +import random +import time +import distutils.util +from pprint import pprint +from functools import partial +from tqdm import tqdm +import numpy as np + +import paddle +import paddle.nn as nn +from paddle.io import BatchSampler, DistributedBatchSampler, DataLoader +from paddlenlp.transformers import BartForConditionalGeneration, BartTokenizer +from paddlenlp.transformers import LinearDecayWithWarmup +from paddlenlp.utils.log import logger +from paddlenlp.datasets import load_dataset +from paddlenlp.data import Tuple, Stack +from utils import convert_example, compute_metrics + +summarization_name_mapping = {"cnn_dailymail": ("article", "highlights")} + + +def parse_args(): + parser = argparse.ArgumentParser() + # Required parameters + parser.add_argument( + "--model_name_or_path", default="bart-base", type=str, required=True, help="Path to pre-trained model. " + ) + parser.add_argument( + "--dataset_name", + default="cnn_dailymail", + type=str, + required=True, + help="The name of the dataset to use. Selected in the list: " + ", ".join(summarization_name_mapping.keys()), + ) + parser.add_argument( + "--output_dir", + default="output", + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_source_length", + default=1024, + type=int, + help="The maximum total input sequence length after " + "tokenization.Sequences longer than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--min_target_length", + default=0, + type=int, + help="The minimum total sequence length for target text when generating. ", + ) + parser.add_argument( + "--max_target_length", + default=142, + type=int, + help="The maximum total sequence length for target text after " + "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded." + "during ``evaluate`` and ``predict``.", + ) + parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--train_batch_size", + default=20, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument( + "--eval_batch_size", + default=12, + type=int, + help="Batch size per GPU/CPU for evaluation.", + ) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument( + "--warmup_steps", + default=0, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument( + "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu", "xpu"], + help="The device to select to train the model, is must be cpu/gpu/xpu.", + ) + parser.add_argument( + "--use_amp", default=False, type=distutils.util.strtobool, help="Enable mixed precision training." + ) + parser.add_argument("--scale_loss", default=2**15, type=float, help="The value of scale_loss for fp16.") + parser.add_argument( + "--ignore_pad_token_for_loss", + default=True, + type=bool, + help="Whether to ignore the tokens corresponding to " "padded labels in the loss computation or not.", + ) + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, data_loader, tokenizer, ignore_pad_token_for_loss, min_target_length, max_target_length): + model.eval() + all_preds = [] + all_labels = [] + model = model._layers if isinstance(model, paddle.DataParallel) else model + for batch in tqdm(data_loader, total=len(data_loader), desc="Eval step"): + input_ids, _, _, labels = batch + preds = model.generate( + input_ids=input_ids, min_length=min_target_length, max_length=max_target_length, use_cache=True + )[0] + all_preds.extend(preds.numpy()) + all_labels.extend(labels.numpy()) + rouge_result, decoded_preds = compute_metrics(all_preds, all_labels, tokenizer, ignore_pad_token_for_loss) + logger.info(rouge_result) + model.train() + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + tokenizer = BartTokenizer.from_pretrained(args.model_name_or_path) + model = BartForConditionalGeneration.from_pretrained(args.model_name_or_path) + trans_func = partial( + convert_example, + text_column=summarization_name_mapping[args.dataset_name][0], + summary_column=summarization_name_mapping[args.dataset_name][1], + tokenizer=tokenizer, + decoder_start_token_id=model.bart.decoder_start_token_id, + max_source_length=args.max_source_length, + max_target_length=args.max_target_length, + ignore_pad_token_for_loss=args.ignore_pad_token_for_loss, + ) + logger.info("Loading train and dev dataset: %s" % args.dataset_name) + train_set, dev_set = load_dataset(args.dataset_name, splits=["train", "dev"]) + logger.info("Loaded train and dev dataset: %s" % args.dataset_name) + train_set = train_set.map(trans_func, lazy=True) + train_batch_sampler = DistributedBatchSampler(train_set, batch_size=args.train_batch_size, shuffle=True) + batchify_fn = lambda samples, fn=Tuple( + Stack(dtype="int64"), # input_ids + Stack(dtype="int64"), # attention mask + Stack(dtype="int64"), # decoder_input_ids + Stack(dtype="int64"), # labels + ): fn(samples) + train_data_loader = DataLoader( + dataset=train_set, batch_sampler=train_batch_sampler, num_workers=0, collate_fn=batchify_fn, return_list=True + ) + dev_set = dev_set.map(trans_func, lazy=True) + dev_batch_sampler = BatchSampler(dev_set, batch_size=args.eval_batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_set, batch_sampler=dev_batch_sampler, num_workers=0, collate_fn=batchify_fn, return_list=True + ) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs) + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + loss_fct = nn.CrossEntropyLoss() + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + global_step = 0 + tic_train = time.time() + for epoch in tqdm(range(args.num_train_epochs), desc="Epoch"): + for step, batch in tqdm(enumerate(train_data_loader), desc="Train step", total=len(train_data_loader)): + global_step += 1 + input_ids, attention_mask, decoder_input_ids, labels = batch + with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]): + logits = model(input_ids, attention_mask, decoder_input_ids) + loss = loss_fct(logits, labels) + if args.use_amp: + scaled_loss = scaler.scale(loss) + scaled_loss.backward() + scaler.minimize(optimizer, scaled_loss) + else: + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + logger.info( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % ( + global_step, + num_training_steps, + epoch, + step, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + evaluate( + model, + dev_data_loader, + tokenizer, + args.ignore_pad_token_for_loss, + args.min_target_length, + args.max_target_length, + ) + logger.info("eval done total : %s s" % (time.time() - tic_eval)) + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "bart_model_%d.pdparams" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + return + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "bart_model_final_%d.pdparams" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + +if __name__ == "__main__": + args = parse_args() + pprint(args) + do_train(args) diff --git a/examples/text_summarization/bart/utils.py b/examples/text_summarization/bart/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..892e0f520c5707b5bc6ee31278f2438d01de5ca7 --- /dev/null +++ b/examples/text_summarization/bart/utils.py @@ -0,0 +1,115 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np +import nltk +from rouge_score import rouge_scorer, scoring + + +def convert_example( + example, + text_column, + summary_column, + tokenizer, + decoder_start_token_id, + max_source_length, + max_target_length, + ignore_pad_token_for_loss=True, + is_train=True, +): + """ + Convert a example into necessary features. + """ + inputs = example[text_column] + targets = example[summary_column] + labels = tokenizer(targets, max_length=max_target_length, padding="max_length", truncation=True) + decoder_input_ids = [decoder_start_token_id] + labels["input_ids"][:-1] + if ignore_pad_token_for_loss: + labels["input_ids"] = [(l if l != tokenizer.pad_token_id else -100) for l in labels["input_ids"]] + if is_train: + model_inputs = tokenizer( + inputs, + max_length=max_source_length, + padding="max_length", + truncation=True, + return_attention_mask=True, + return_length=False, + ) + return model_inputs["input_ids"], model_inputs["attention_mask"], decoder_input_ids, labels["input_ids"] + else: + model_inputs = tokenizer( + inputs, + max_length=max_source_length, + padding="max_length", + truncation=True, + return_attention_mask=True, + return_length=True, + ) + return ( + model_inputs["input_ids"], + model_inputs["attention_mask"], + model_inputs["length"], + decoder_input_ids, + labels["input_ids"], + ) + + +def compute_metrics(preds, labels, tokenizer, ignore_pad_token_for_loss=True): + def compute_rouge(predictions, references, rouge_types=None, use_stemmer=True): + if rouge_types is None: + rouge_types = ["rouge1", "rouge2", "rougeLsum"] + + scorer = rouge_scorer.RougeScorer(rouge_types=rouge_types, use_stemmer=use_stemmer) + aggregator = scoring.BootstrapAggregator() + + for ref, pred in zip(references, predictions): + score = scorer.score(ref, pred) + aggregator.add_scores(score) + result = aggregator.aggregate() + result = {key: round(value.mid.fmeasure * 100, 4) for key, value in result.items()} + return result + + def post_process_text(preds, labels): + preds = [pred.strip() for pred in preds] + labels = [label.strip() for label in labels] + + # rougeLSum expects newline after each sentence + preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds] + labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels] + + return preds, labels + + def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): + """ + Post-process the decoded sequence. + """ + eos_pos = len(seq) - 1 + for i, idx in enumerate(seq): + if idx == eos_idx: + eos_pos = i + break + seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] + return seq + + if ignore_pad_token_for_loss: + labels = np.asarray(labels) + labels = np.where(labels != -100, labels, tokenizer.pad_token_id) + decoded_preds, decoded_labels = [], [] + for pred, label in zip(preds, labels): + pred_id = post_process_seq(pred, tokenizer.bos_token_id, tokenizer.eos_token_id) + label_id = post_process_seq(label, tokenizer.bos_token_id, tokenizer.eos_token_id) + decoded_preds.append(tokenizer.convert_ids_to_string(pred_id)) + decoded_labels.append(tokenizer.convert_ids_to_string(label_id)) + decoded_preds, decoded_labels = post_process_text(decoded_preds, decoded_labels) + rouge_result = compute_rouge(decoded_preds, decoded_labels) + return rouge_result, decoded_preds diff --git a/examples/text_summarization/pointer_summarizer/.gitignore b/examples/text_summarization/pointer_summarizer/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..4725dc6de854f8d59b6600aa4b1f9b4f6d197d56 --- /dev/null +++ b/examples/text_summarization/pointer_summarizer/.gitignore @@ -0,0 +1,2 @@ +log/* +finished_files/* diff --git a/examples/text_summarization/pointer_summarizer/README.md b/examples/text_summarization/pointer_summarizer/README.md new file mode 100644 index 0000000000000000000000000000000000000000..52bbdf8253586cae88119ae66bf46c7b480c616f --- /dev/null +++ b/examples/text_summarization/pointer_summarizer/README.md @@ -0,0 +1,63 @@ +# Pointer Generator Network for Text Summarization + +This code is the Paddle v2.0 implementation of *[Get To The Point: Summarization with Pointer-Generator Networks](https://arxiv.org/abs/1704.04368)*. +The code adapts and aligns with [a previous Pytorch implmentation](https://github.com/atulkum/pointer_summarizer). + +To reach the state-of-the-art performance stated in the source paper, please use the default hyper-parameters listed in *config.py*. + +## Model performance (with pointer generation and coverage loss enabled) +After training for 100k iterations with *batch_size=8*, the Paddle implementation achieves a ROUGE-1-f1 of 0.3980 (0.3907 by [a previous Pytorch implmentation](https://github.com/atulkum/pointer_summarizer) and 0.3953 by [the source paper](https://arxiv.org/abs/1704.04368)). + +``` +ROUGE-1: +rouge_1_f_score: 0.3980 with confidence interval (0.3959, 0.4002) +rouge_1_recall: 0.4639 with confidence interval (0.4613, 0.4667) +rouge_1_precision: 0.3707 with confidence interval (0.3683, 0.3732) + +ROUGE-2: +rouge_2_f_score: 0.1726 with confidence interval (0.1704, 0.1749) +rouge_2_recall: 0.2008 with confidence interval (0.1984, 0.2034) +rouge_2_precision: 0.1615 with confidence interval (0.1593, 0.1638) + +ROUGE-l: +rouge_l_f_score: 0.3617 with confidence interval (0.3597, 0.3640) +rouge_l_recall: 0.4214 with confidence interval (0.4188, 0.4242) +rouge_l_precision: 0.3371 with confidence interval (0.3348, 0.3396) + +``` + +## Prerequisites: +* The code is tested on Python 3.7.1 and Paddle 2.0.0 +* Training takes around 1s/iter on a single Tesla V100 (\~28 hours to train 100k iters) +* Decoding the entire test set takes 2-3 hours + +## Data Preprocessing: +1) Follow data generation instruction from https://github.com/abisee/cnn-dailymail **but place the *make_datafiles_json.py* script provided in this repo into https://github.com/abisee/cnn-dailymail and run *make_datafiles_json.py* instead of *make_datafiles.py* to minimize package dependencies.** +2) place the output folder *finished_files_json/* as a subfolder in this repo +3) You might need to change some paths and parameters in *config.py* + + +## How to run training: +* To train the model from start: +``` +python train.py +``` +* To continue training using a previously trained model: +``` +python train.py -m path/to/model/dir/ +``` + +## Set up ROUGE +* You need to setup [pyrouge](https://github.com/andersjo/pyrouge) to get the rouge score +* Also see [this tutorial](https://poojithansl7.wordpress.com/2018/08/04/setting-up-rouge/) to set up rouge and pyrouge. + + +## How to decode & evaluate: +* To decode using a previously trained model: +``` +python decode.py path/to/model/dir/ +``` +* If you already have the summaries generated using *decode.py* and only needs to run rouge evaluation: +``` +python rouge_eval.py path/to/decoded/dir/ +``` diff --git a/examples/text_summarization/pointer_summarizer/__init__.py b/examples/text_summarization/pointer_summarizer/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/examples/text_summarization/pointer_summarizer/config.py b/examples/text_summarization/pointer_summarizer/config.py new file mode 100644 index 0000000000000000000000000000000000000000..795fb03fdb339895bff7960397ed6342d7a0cae9 --- /dev/null +++ b/examples/text_summarization/pointer_summarizer/config.py @@ -0,0 +1,31 @@ +# Data directories +train_data_path = "./finished_files/chunked/train_*.json" +eval_data_path = "./finished_files/val.json" +decode_data_path = "./finished_files/test.json" +vocab_path = "./finished_files/vocab" +log_root = "./log" + +# Hyperparameters +hidden_dim = 256 +emb_dim = 128 +batch_size = 8 +max_enc_steps = 400 +max_dec_steps = 100 +beam_size = 4 +min_dec_steps = 35 +vocab_size = 50000 + +lr = 0.15 +adagrad_init_acc = 0.1 +rand_unif_init_mag = 0.02 +trunc_norm_init_std = 1e-4 +max_grad_norm = 2.0 + +pointer_gen = True +is_coverage = True +cov_loss_wt = 1.0 + +eps = 1e-12 +max_iterations = 100000 + +lr_coverage = 0.15 diff --git a/examples/text_summarization/pointer_summarizer/data.py b/examples/text_summarization/pointer_summarizer/data.py new file mode 100644 index 0000000000000000000000000000000000000000..a8479943e691c3cfb68df383a503e9bb1fedf333 --- /dev/null +++ b/examples/text_summarization/pointer_summarizer/data.py @@ -0,0 +1,503 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# This file is adapted from https://github.com/abisee/pointer-generator/blob/master/data.py + +import csv +import glob +import io +import json +import queue +import random +import time +from random import shuffle +from threading import Thread + +import config +import data +import numpy as np + +random.seed(123) + + +# <s> and </s> are used in the data files to segment the abstracts into sentences. They don't receive vocab ids. +SENTENCE_START = "<s>" +SENTENCE_END = "</s>" + +PAD_TOKEN = "[PAD]" # This has a vocab id, which is used to pad the encoder input, decoder input and target sequence +UNKNOWN_TOKEN = "[UNK]" # This has a vocab id, which is used to represent out-of-vocabulary words +START_DECODING = "[START]" # This has a vocab id, which is used at the start of every decoder input sequence +STOP_DECODING = "[STOP]" # This has a vocab id, which is used at the end of untruncated target sequences + +# Note: none of <s>, </s>, [PAD], [UNK], [START], [STOP] should appear in the vocab file. + + +class Example(object): + def __init__(self, article, abstract_sentences, vocab): + # Get ids of special tokens + start_decoding = vocab.word2id(data.START_DECODING) + stop_decoding = vocab.word2id(data.STOP_DECODING) + + # Process the article + article_words = article.split() + if len(article_words) > config.max_enc_steps: + article_words = article_words[: config.max_enc_steps] + self.enc_len = len(article_words) # store the length after truncation but before padding + self.enc_input = [ + vocab.word2id(w) for w in article_words + ] # list of word ids; OOVs are represented by the id for UNK token + + # Process the abstract + abstract = " ".join(abstract_sentences) # string + abstract_words = abstract.split() # list of strings + abs_ids = [ + vocab.word2id(w) for w in abstract_words + ] # list of word ids; OOVs are represented by the id for UNK token + + # Get the decoder input sequence and target sequence + self.dec_input, self.target = self.get_dec_inp_targ_seqs( + abs_ids, config.max_dec_steps, start_decoding, stop_decoding + ) + self.dec_len = len(self.dec_input) + + # If using pointer-generator mode, we need to store some extra info + if config.pointer_gen: + # Store a version of the enc_input where in-article OOVs are represented by their temporary OOV id; also store the in-article OOVs words themselves + self.enc_input_extend_vocab, self.article_oovs = data.article2ids(article_words, vocab) + + # Get a version of the reference summary where in-article OOVs are represented by their temporary article OOV id + abs_ids_extend_vocab = data.abstract2ids(abstract_words, vocab, self.article_oovs) + + # Overwrite decoder target sequence so it uses the temp article OOV ids + _, self.target = self.get_dec_inp_targ_seqs( + abs_ids_extend_vocab, config.max_dec_steps, start_decoding, stop_decoding + ) + + # Store the original strings + self.original_article = article + self.original_abstract = abstract + self.original_abstract_sents = abstract_sentences + + def get_dec_inp_targ_seqs(self, sequence, max_len, start_id, stop_id): + inp = [start_id] + sequence[:] + target = sequence[:] + if len(inp) > max_len: # truncate + inp = inp[:max_len] + target = target[:max_len] # no end_token + else: # no truncation + target.append(stop_id) # end token + assert len(inp) == len(target) + return inp, target + + def pad_decoder_inp_targ(self, max_len, pad_id): + while len(self.dec_input) < max_len: + self.dec_input.append(pad_id) + while len(self.target) < max_len: + self.target.append(pad_id) + + def pad_encoder_input(self, max_len, pad_id): + while len(self.enc_input) < max_len: + self.enc_input.append(pad_id) + if config.pointer_gen: + while len(self.enc_input_extend_vocab) < max_len: + self.enc_input_extend_vocab.append(pad_id) + + +class Batch(object): + def __init__(self, example_list, vocab, batch_size): + self.batch_size = batch_size + self.pad_id = vocab.word2id(data.PAD_TOKEN) # id of the PAD token used to pad sequences + self.init_encoder_seq(example_list) # initialize the input to the encoder + self.init_decoder_seq(example_list) # initialize the input and targets for the decoder + self.store_orig_strings(example_list) # store the original strings + + def init_encoder_seq(self, example_list): + # Determine the maximum length of the encoder input sequence in this batch + max_enc_seq_len = max([ex.enc_len for ex in example_list]) + + # Pad the encoder input sequences up to the length of the longest sequence + for ex in example_list: + ex.pad_encoder_input(max_enc_seq_len, self.pad_id) + + # Initialize the numpy arrays + # Note: our enc_batch can have different length (second dimension) for each batch because we use dynamic_rnn for the encoder. + self.enc_batch = np.zeros((self.batch_size, max_enc_seq_len), dtype=np.int32) + self.enc_lens = np.zeros((self.batch_size), dtype=np.int32) + self.enc_padding_mask = np.zeros((self.batch_size, max_enc_seq_len), dtype=np.float32) + + # Fill in the numpy arrays + for i, ex in enumerate(example_list): + self.enc_batch[i, :] = ex.enc_input[:] + self.enc_lens[i] = ex.enc_len + for j in range(ex.enc_len): + self.enc_padding_mask[i][j] = 1 + + # For pointer-generator mode, need to store some extra info + if config.pointer_gen: + # Determine the max number of in-article OOVs in this batch + self.max_art_oovs = max([len(ex.article_oovs) for ex in example_list]) + # Store the in-article OOVs themselves + self.art_oovs = [ex.article_oovs for ex in example_list] + # Store the version of the enc_batch that uses the article OOV ids + self.enc_batch_extend_vocab = np.zeros((self.batch_size, max_enc_seq_len), dtype=np.int32) + for i, ex in enumerate(example_list): + self.enc_batch_extend_vocab[i, :] = ex.enc_input_extend_vocab[:] + + def init_decoder_seq(self, example_list): + # Pad the inputs and targets + for ex in example_list: + ex.pad_decoder_inp_targ(config.max_dec_steps, self.pad_id) + + # Initialize the numpy arrays. + self.dec_batch = np.zeros((self.batch_size, config.max_dec_steps), dtype=np.int32) + self.target_batch = np.zeros((self.batch_size, config.max_dec_steps), dtype=np.int32) + self.dec_padding_mask = np.zeros((self.batch_size, config.max_dec_steps), dtype=np.float32) + self.dec_lens = np.zeros((self.batch_size), dtype=np.int32) + + # Fill in the numpy arrays + for i, ex in enumerate(example_list): + self.dec_batch[i, :] = ex.dec_input[:] + self.target_batch[i, :] = ex.target[:] + self.dec_lens[i] = ex.dec_len + for j in range(ex.dec_len): + self.dec_padding_mask[i][j] = 1 + + def store_orig_strings(self, example_list): + self.original_articles = [ex.original_article for ex in example_list] # list of lists + self.original_abstracts = [ex.original_abstract for ex in example_list] # list of lists + self.original_abstracts_sents = [ex.original_abstract_sents for ex in example_list] # list of lists + + +class Batcher(object): + BATCH_QUEUE_MAX = 100 # max number of batches the batch_queue can hold + + def __init__(self, data_path, vocab, mode, batch_size, single_pass): + self._data_path = data_path + self._vocab = vocab + self._single_pass = single_pass + self.mode = mode + self.batch_size = batch_size + # Initialize a queue of Batches waiting to be used, and a queue of Examples waiting to be batched + self._batch_queue = queue.Queue(self.BATCH_QUEUE_MAX) + self._example_queue = queue.Queue(self.BATCH_QUEUE_MAX * self.batch_size) + + # Different settings depending on whether we're in single_pass mode or not + if single_pass: + self._num_example_q_threads = 1 # just one thread, so we read through the dataset just once + self._num_batch_q_threads = 1 # just one thread to batch examples + self._bucketing_cache_size = ( + 1 # only load one batch's worth of examples before bucketing; this essentially means no bucketing + ) + self._finished_reading = False # this will tell us when we're finished reading the dataset + else: + self._num_example_q_threads = 1 # 16 # num threads to fill example queue + self._num_batch_q_threads = 1 # 4 # num threads to fill batch queue + self._bucketing_cache_size = ( + 1 # 100 # how many batches-worth of examples to load into cache before bucketing + ) + + # Start the threads that load the queues + self._example_q_threads = [] + for _ in range(self._num_example_q_threads): + self._example_q_threads.append(Thread(target=self.fill_example_queue)) + self._example_q_threads[-1].daemon = True + self._example_q_threads[-1].start() + self._batch_q_threads = [] + for _ in range(self._num_batch_q_threads): + self._batch_q_threads.append(Thread(target=self.fill_batch_queue)) + self._batch_q_threads[-1].daemon = True + self._batch_q_threads[-1].start() + + # Start a thread that watches the other threads and restarts them if they're dead + if not single_pass: # We don't want a watcher in single_pass mode because the threads shouldn't run forever + self._watch_thread = Thread(target=self.watch_threads) + self._watch_thread.daemon = True + self._watch_thread.start() + + def next_batch(self): + # If the batch queue is empty, print a warning + if self._batch_queue.qsize() == 0: + print( + "Bucket input queue is empty when calling next_batch. Bucket queue size: %i, Input queue size: %i" + % (self._batch_queue.qsize(), self._example_queue.qsize()) + ) + if self._single_pass and self._finished_reading: + print("Finished reading dataset in single_pass mode.") + return None + + batch = self._batch_queue.get() # get the next Batch + return batch + + def fill_example_queue(self): + input_gen = self.text_generator(data.example_generator(self._data_path, self._single_pass)) + + while True: + try: + (article, abstract) = next( + input_gen + ) # read the next example from file. article and abstract are both strings. + except StopIteration: # if there are no more examples: + print("The example generator for this example queue filling thread has exhausted data.") + if self._single_pass: + print("single_pass mode is on, so we've finished reading dataset. This thread is stopping.") + self._finished_reading = True + break + else: + raise Exception("single_pass mode is off but the example generator is out of data; error.") + + abstract_sentences = [ + sent.strip() for sent in data.abstract2sents(abstract) + ] # Use the <s> and </s> tags in abstract to get a list of sentences. + example = Example(article, abstract_sentences, self._vocab) # Process into an Example. + self._example_queue.put(example) # place the Example in the example queue. + + def fill_batch_queue(self): + while True: + if self.mode == "decode": + # Beam search decode mode where a single example is repeated in the batch + ex = self._example_queue.get() + b = [ex for _ in range(self.batch_size)] + self._batch_queue.put(Batch(b, self._vocab, self.batch_size)) + else: + # Get bucketing_cache_size-many batches of Examples into a list, then sort + inputs = [] + for _ in range(self.batch_size * self._bucketing_cache_size): + inputs.append(self._example_queue.get()) + inputs = sorted( + inputs, key=lambda inp: inp.enc_len, reverse=True + ) # sort by length of encoder sequence + + # Group the sorted Examples into batches, optionally shuffle the batches, and place in the batch queue. + batches = [] + for i in range(0, len(inputs), self.batch_size): + batches.append(inputs[i : i + self.batch_size]) + if not self._single_pass: + shuffle(batches) + for b in batches: # each b is a list of Example objects + self._batch_queue.put(Batch(b, self._vocab, self.batch_size)) + + def watch_threads(self): + while True: + print( + "Bucket queue size: %i, Input queue size: %i" + % (self._batch_queue.qsize(), self._example_queue.qsize()) + ) + + time.sleep(60) + for idx, t in enumerate(self._example_q_threads): + if not t.is_alive(): # if the thread is dead + print("Found example queue thread dead. Restarting.") + new_t = Thread(target=self.fill_example_queue) + self._example_q_threads[idx] = new_t + new_t.daemon = True + new_t.start() + for idx, t in enumerate(self._batch_q_threads): + if not t.is_alive(): # if the thread is dead + print("Found batch queue thread dead. Restarting.") + new_t = Thread(target=self.fill_batch_queue) + self._batch_q_threads[idx] = new_t + new_t.daemon = True + new_t.start() + + def text_generator(self, example_generator): + while True: + e = next(example_generator) + try: + article_text = e["article"] + abstract_text = e["abstract"] + except ValueError: + print("Failed to get article or abstract from example") + continue + if len(article_text) == 0: # See https://github.com/abisee/pointer-generator/issues/1 + print("Found an example with empty article text. Skipping it.") + continue + else: + yield (article_text, abstract_text) + + +class Vocab(object): + def __init__(self, vocab_file, max_size): + self._word_to_id = {} + self._id_to_word = {} + self._count = 0 # keeps track of total number of words in the Vocab + + # [UNK], [PAD], [START] and [STOP] get the ids 0,1,2,3. + for w in [UNKNOWN_TOKEN, PAD_TOKEN, START_DECODING, STOP_DECODING]: + self._word_to_id[w] = self._count + self._id_to_word[self._count] = w + self._count += 1 + + # Read the vocab file and add words up to max_size + with open(vocab_file, "r", encoding="utf8") as vocab_f: + for line in vocab_f: + pieces = line.split() + if len(pieces) != 2: + print("Warning: incorrectly formatted line in vocabulary file: %s\n" % line) + continue + w = pieces[0] + if w in [SENTENCE_START, SENTENCE_END, UNKNOWN_TOKEN, PAD_TOKEN, START_DECODING, STOP_DECODING]: + raise Exception( + "<s>, </s>, [UNK], [PAD], [START] and [STOP] shouldn't be in the vocab file, but %s is" % w + ) + if w in self._word_to_id: + raise Exception("Duplicated word in vocabulary file: %s" % w) + self._word_to_id[w] = self._count + self._id_to_word[self._count] = w + self._count += 1 + if max_size != 0 and self._count >= max_size: + print( + "max_size of vocab was specified as %i; we now have %i words. Stopping reading." + % (max_size, self._count) + ) + break + + print( + "Finished constructing vocabulary of %i total words. Last word added: %s" + % (self._count, self._id_to_word[self._count - 1]) + ) + + def word2id(self, word): + if word not in self._word_to_id: + return self._word_to_id[UNKNOWN_TOKEN] + return self._word_to_id[word] + + def id2word(self, word_id): + if word_id not in self._id_to_word: + raise ValueError("Id not found in vocab: %d" % word_id) + return self._id_to_word[word_id] + + def size(self): + return self._count + + def write_metadata(self, fpath): + print("Writing word embedding metadata file to %s..." % (fpath)) + with open(fpath, "w") as f: + fieldnames = ["word"] + writer = csv.DictWriter(f, delimiter="\t", fieldnames=fieldnames) + for i in range(self.size()): + writer.writerow({"word": self._id_to_word[i]}) + + +def example_generator(data_path, single_pass): + while True: + filelist = glob.glob(data_path) # get the list of datafiles + assert filelist, "Error: Empty filelist at %s" % data_path # check filelist isn't empty + if single_pass: + filelist = sorted(filelist) + else: + random.shuffle(filelist) + for f in filelist: + reader = io.open(f, "r", encoding="utf8") + while True: + reader_str = reader.readline() + if reader_str == "": + break + reader_json = json.loads(reader_str) + yield reader_json + if single_pass: + print("example_generator completed reading all datafiles. No more data.") + break + + +def article2ids(article_words, vocab): + ids = [] + oovs = [] + unk_id = vocab.word2id(UNKNOWN_TOKEN) + for w in article_words: + i = vocab.word2id(w) + if i == unk_id: # If w is OOV + if w not in oovs: # Add to list of OOVs + oovs.append(w) + oov_num = oovs.index(w) # This is 0 for the first article OOV, 1 for the second article OOV... + ids.append(vocab.size() + oov_num) # This is e.g. 50000 for the first article OOV, 50001 for the second... + else: + ids.append(i) + return ids, oovs + + +def abstract2ids(abstract_words, vocab, article_oovs): + ids = [] + unk_id = vocab.word2id(UNKNOWN_TOKEN) + for w in abstract_words: + i = vocab.word2id(w) + if i == unk_id: # If w is an OOV word + if w in article_oovs: # If w is an in-article OOV + vocab_idx = vocab.size() + article_oovs.index(w) # Map to its temporary article OOV number + ids.append(vocab_idx) + else: # If w is an out-of-article OOV + ids.append(unk_id) # Map to the UNK token id + else: + ids.append(i) + return ids + + +def outputids2words(id_list, vocab, article_oovs): + words = [] + for i in id_list: + try: + w = vocab.id2word(i) # might be [UNK] + except ValueError: # w is OOV + assert ( + article_oovs is not None + ), "Error: model produced a word ID that isn't in the vocabulary. This should not happen in baseline (no pointer-generator) mode" + article_oov_idx = i - vocab.size() + try: + w = article_oovs[article_oov_idx] + except ValueError: # i doesn't correspond to an article oov + raise ValueError( + "Error: model produced word ID %i which corresponds to article OOV %i but this example only has %i article OOVs" + % (i, article_oov_idx, len(article_oovs)) + ) + words.append(w) + return words + + +def abstract2sents(abstract): + cur = 0 + sents = [] + while True: + try: + start_p = abstract.index(SENTENCE_START, cur) + end_p = abstract.index(SENTENCE_END, start_p + 1) + cur = end_p + len(SENTENCE_END) + sents.append(abstract[start_p + len(SENTENCE_START) : end_p]) + except ValueError: # no more sentences + return sents + + +def show_art_oovs(article, vocab): + unk_token = vocab.word2id(UNKNOWN_TOKEN) + words = article.split(" ") + words = [("__%s__" % w) if vocab.word2id(w) == unk_token else w for w in words] + out_str = " ".join(words) + return out_str + + +def show_abs_oovs(abstract, vocab, article_oovs): + unk_token = vocab.word2id(UNKNOWN_TOKEN) + words = abstract.split(" ") + new_words = [] + for w in words: + if vocab.word2id(w) == unk_token: # w is oov + if article_oovs is None: # baseline mode + new_words.append("__%s__" % w) + else: # pointer-generator mode + if w in article_oovs: + new_words.append("__%s__" % w) + else: + new_words.append("!!__%s__!!" % w) + else: # w is in-vocab word + new_words.append(w) + out_str = " ".join(new_words) + return out_str diff --git a/examples/text_summarization/pointer_summarizer/decode.py b/examples/text_summarization/pointer_summarizer/decode.py new file mode 100644 index 0000000000000000000000000000000000000000..e66d9442f1fde5911f8ce60ce543efd356645a66 --- /dev/null +++ b/examples/text_summarization/pointer_summarizer/decode.py @@ -0,0 +1,232 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Except for the paddle part, content of this file is copied from https://github.com/abisee/pointer-generator/blob/master/ + +import os +import re +import sys +import time + +import config +import data +import paddle +from data import Batcher, Vocab +from model import Model +from train_util import get_input_from_batch +from utils import rouge_eval, rouge_log, write_for_rouge + + +class Beam(object): + def __init__(self, tokens, log_probs, state, context, coverage): + self.tokens = tokens + self.log_probs = log_probs + self.state = state + self.context = context + self.coverage = coverage + + def extend(self, token, log_prob, state, context, coverage): + return Beam( + tokens=self.tokens + [token], + log_probs=self.log_probs + [log_prob], + state=state, + context=context, + coverage=coverage, + ) + + @property + def latest_token(self): + return self.tokens[-1] + + @property + def avg_log_prob(self): + return sum(self.log_probs) / len(self.tokens) + + +class BeamSearch(object): + def __init__(self, model_file_path): + model_name = ( + re.findall(r"train_\d+", model_file_path)[0] + "_" + re.findall(r"model_\d+_\d+\.\d+", model_file_path)[0] + ) + self._decode_dir = os.path.join(config.log_root, "decode_%s" % (model_name)) + self._rouge_ref_dir = os.path.join(self._decode_dir, "rouge_ref") + self._rouge_dec_dir = os.path.join(self._decode_dir, "rouge_dec_dir") + for p in [self._decode_dir, self._rouge_ref_dir, self._rouge_dec_dir]: + if not os.path.exists(p): + os.mkdir(p) + + self.vocab = Vocab(config.vocab_path, config.vocab_size) + self.batcher = Batcher( + config.decode_data_path, self.vocab, mode="decode", batch_size=config.beam_size, single_pass=True + ) + self.model = Model(model_file_path, is_eval=True) + + def sort_beams(self, beams): + return sorted(beams, key=lambda h: h.avg_log_prob, reverse=True) + + def decode(self): + start = time.time() + counter = 0 + batch = self.batcher.next_batch() + while batch is not None: + # Run beam search to get best Hypothesis + best_summary = self.beam_search(batch) + + # Extract the output ids from the hypothesis and convert back to words + output_ids = [int(t) for t in best_summary.tokens[1:]] + decoded_words = data.outputids2words( + output_ids, self.vocab, (batch.art_oovs[0] if config.pointer_gen else None) + ) + + # Remove the [STOP] token from decoded_words, if necessary + try: + fst_stop_idx = decoded_words.index(data.STOP_DECODING) + decoded_words = decoded_words[:fst_stop_idx] + except ValueError: + decoded_words = decoded_words + + original_abstract_sents = batch.original_abstracts_sents[0] + + write_for_rouge(original_abstract_sents, decoded_words, counter, self._rouge_ref_dir, self._rouge_dec_dir) + counter += 1 + print("global step %d, %.2f step/s" % (counter, 1.0 / (time.time() - start))) + start = time.time() + batch = self.batcher.next_batch() + + print("Decoder has finished reading dataset for single_pass.") + print("Now starting ROUGE eval...") + results_dict = rouge_eval(self._rouge_ref_dir, self._rouge_dec_dir) + rouge_log(results_dict, self._decode_dir) + + def beam_search(self, batch): + # The batch should have only one example + ( + enc_batch, + enc_padding_mask, + enc_lens, + enc_batch_extend_vocab, + extra_zeros, + c_t_0, + coverage_t_0, + ) = get_input_from_batch(batch) + + encoder_outputs, encoder_feature, encoder_hidden = self.model.encoder(enc_batch, enc_lens) + s_t_0 = self.model.reduce_state(encoder_hidden) + + dec_h, dec_c = s_t_0 # 1 x 2*hidden_size + dec_h = dec_h.squeeze() + dec_c = dec_c.squeeze() + + # Prepare decoder batch + beams = [ + Beam( + tokens=[self.vocab.word2id(data.START_DECODING)], + log_probs=[0.0], + state=(dec_h[0], dec_c[0]), + context=c_t_0[0], + coverage=(coverage_t_0[0] if config.is_coverage else None), + ) + for _ in range(config.beam_size) + ] + results = [] + steps = 0 + while steps < config.max_dec_steps and len(results) < config.beam_size: + latest_tokens = [h.latest_token for h in beams] + latest_tokens = [ + t if t < self.vocab.size() else self.vocab.word2id(data.UNKNOWN_TOKEN) for t in latest_tokens + ] + y_t_1 = paddle.to_tensor(latest_tokens) + all_state_h = [] + all_state_c = [] + + all_context = [] + + for h in beams: + state_h, state_c = h.state + all_state_h.append(state_h) + all_state_c.append(state_c) + + all_context.append(h.context) + + s_t_1 = (paddle.stack(all_state_h, 0).unsqueeze(0), paddle.stack(all_state_c, 0).unsqueeze(0)) + c_t_1 = paddle.stack(all_context, 0) + + coverage_t_1 = None + if config.is_coverage: + all_coverage = [] + for h in beams: + all_coverage.append(h.coverage) + coverage_t_1 = paddle.stack(all_coverage, 0) + + final_dist, s_t, c_t, attn_dist, p_gen, coverage_t = self.model.decoder( + y_t_1, + s_t_1, + encoder_outputs, + encoder_feature, + enc_padding_mask, + c_t_1, + extra_zeros, + enc_batch_extend_vocab, + coverage_t_1, + steps, + ) + log_probs = paddle.log(final_dist) + topk_log_probs, topk_ids = paddle.topk(log_probs, config.beam_size * 2) + + dec_h, dec_c = s_t + dec_h = dec_h.squeeze() + dec_c = dec_c.squeeze() + + all_beams = [] + num_orig_beams = 1 if steps == 0 else len(beams) + for i in range(num_orig_beams): + h = beams[i] + state_i = (dec_h[i], dec_c[i]) + context_i = c_t[i] + coverage_i = coverage_t[i] if config.is_coverage else None + + for j in range(config.beam_size * 2): # for each of the top 2*beam_size hyps: + new_beam = h.extend( + token=int(topk_ids[i, j]), + log_prob=float(topk_log_probs[i, j]), + state=state_i, + context=context_i, + coverage=coverage_i, + ) + all_beams.append(new_beam) + + beams = [] + for h in self.sort_beams(all_beams): + if h.latest_token == self.vocab.word2id(data.STOP_DECODING): + if steps >= config.min_dec_steps: + results.append(h) + else: + beams.append(h) + if len(beams) == config.beam_size or len(results) == config.beam_size: + break + + steps += 1 + + if len(results) == 0: + results = beams + + beams_sorted = self.sort_beams(results) + + return beams_sorted[0] + + +if __name__ == "__main__": + model_filename = sys.argv[1] + beam_search_processor = BeamSearch(model_filename) + beam_search_processor.decode() diff --git a/examples/text_summarization/pointer_summarizer/make_datafiles_json.py b/examples/text_summarization/pointer_summarizer/make_datafiles_json.py new file mode 100644 index 0000000000000000000000000000000000000000..e266a6544247810d72da68f117267d9720078eb4 --- /dev/null +++ b/examples/text_summarization/pointer_summarizer/make_datafiles_json.py @@ -0,0 +1,282 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import hashlib +import json +import os +import subprocess +import sys + +dm_single_close_quote = "\u2019" # unicode +dm_double_close_quote = "\u201d" +END_TOKENS = [ + ".", + "!", + "?", + "...", + "'", + "`", + '"', + dm_single_close_quote, + dm_double_close_quote, + ")", +] # acceptable ways to end a sentence + +# We use these to separate the summary sentences in the json datafiles +SENTENCE_START = "<s>" +SENTENCE_END = "</s>" + +all_train_urls = "url_lists/all_train.txt" +all_val_urls = "url_lists/all_val.txt" +all_test_urls = "url_lists/all_test.txt" + +cnn_tokenized_stories_dir = "cnn_stories_tokenized_json" +dm_tokenized_stories_dir = "dm_stories_tokenized_json" +finished_files_dir = "finished_files_json" +chunks_dir = os.path.join(finished_files_dir, "chunked") + +# These are the number of .story files we expect there to be in cnn_stories_dir and dm_stories_dir +num_expected_cnn_stories = 92579 +num_expected_dm_stories = 219506 + +VOCAB_SIZE = 200000 +CHUNK_SIZE = 1000 # num examples per chunk, for the chunked data + + +def chunk_file(set_name): + in_file = finished_files_dir + os.sep + "%s.json" % set_name + reader = open(in_file, "r") + chunk = 0 + finished = False + while not finished: + chunk_fname = os.path.join(chunks_dir, "%s_%03d.json" % (set_name, chunk)) # new chunk + with open(chunk_fname, "w") as writer: + for _ in range(CHUNK_SIZE): + len_line = reader.readline() + if not len_line: + finished = True + break + writer.write(len_line) + chunk += 1 + + +def chunk_all(): + # Make a dir to hold the chunks + if not os.path.isdir(chunks_dir): + os.mkdir(chunks_dir) + # Chunk the data + for set_name in ["train", "val", "test"]: + print("Splitting %s data into chunks..." % set_name) + chunk_file(set_name) + print("Saved chunked data in %s" % chunks_dir) + + +def tokenize_stories(stories_dir, tokenized_stories_dir): + """Maps a whole directory of .story files to a tokenized version using Stanford CoreNLP Tokenizer""" + print("Preparing to tokenize %s to %s..." % (stories_dir, tokenized_stories_dir)) + stories = os.listdir(stories_dir) + # make IO list file + print("Making list of files to tokenize...") + with open("mapping.txt", "w") as f: + for s in stories: + f.write("%s \t %s\n" % (os.path.join(stories_dir, s), os.path.join(tokenized_stories_dir, s))) + command = ["java", "edu.stanford.nlp.process.PTBTokenizer", "-ioFileList", "-preserveLines", "mapping.txt"] + print("Tokenizing %i files in %s and saving in %s..." % (len(stories), stories_dir, tokenized_stories_dir)) + subprocess.call(command) + print("Stanford CoreNLP Tokenizer has finished.") + os.remove("mapping.txt") + + # Check that the tokenized stories directory contains the same number of files as the original directory + num_orig = len(os.listdir(stories_dir)) + num_tokenized = len(os.listdir(tokenized_stories_dir)) + if num_orig != num_tokenized: + raise Exception( + "The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" + % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig) + ) + print("Successfully finished tokenizing %s to %s.\n" % (stories_dir, tokenized_stories_dir)) + + +def read_text_file(text_file): + lines = [] + with open(text_file, "r") as f: + for line in f: + lines.append(line.strip()) + return lines + + +def hashhex(s): + """Returns a heximal formated SHA1 hash of the input string.""" + h = hashlib.sha1() + h.update(s.encode()) + return h.hexdigest() + + +def get_url_hashes(url_list): + return [hashhex(url) for url in url_list] + + +def fix_missing_period(line): + """Adds a period to a line that is missing a period""" + if "@highlight" in line: + return line + if line == "": + return line + if line[-1] in END_TOKENS: + return line + # print line[-1] + return line + " ." + + +def get_art_abs(story_file): + lines = read_text_file(story_file) + + # Lowercase everything + lines = [line.lower() for line in lines] + + # Put periods on the ends of lines that are missing them (this is a problem in the dataset because many image captions don't end in periods; consequently they end up in the body of the article as run-on sentences) + lines = [fix_missing_period(line) for line in lines] + + # Separate out article and abstract sentences + article_lines = [] + highlights = [] + next_is_highlight = False + for idx, line in enumerate(lines): + if line == "": + continue # empty line + elif line.startswith("@highlight"): + next_is_highlight = True + elif next_is_highlight: + highlights.append(line) + else: + article_lines.append(line) + + # Make article into a single string + article = " ".join(article_lines) + + # Make abstract into a signle string, putting <s> and </s> tags around the sentences + abstract = " ".join(["%s %s %s" % (SENTENCE_START, sent, SENTENCE_END) for sent in highlights]) + + return article, abstract + + +def write_to_bin(url_file, out_file, makevocab=False): + """Reads the tokenized .story files corresponding to the urls listed in the url_file and writes them to a out_file.""" + print("Making bin file for URLs listed in %s..." % url_file) + url_list = read_text_file(url_file) + url_hashes = get_url_hashes(url_list) + story_fnames = [s + ".story" for s in url_hashes] + num_stories = len(story_fnames) + + if makevocab: + vocab_counter = collections.Counter() + + with open(out_file, "w") as writer: + for idx, s in enumerate(story_fnames): + if idx % 1000 == 0: + print( + "Writing story %i of %i; %.2f percent done" + % (idx, num_stories, float(idx) * 100.0 / float(num_stories)) + ) + + # Look in the tokenized story dirs to find the .story file corresponding to this url + if os.path.isfile(os.path.join(cnn_tokenized_stories_dir, s)): + story_file = os.path.join(cnn_tokenized_stories_dir, s) + elif os.path.isfile(os.path.join(dm_tokenized_stories_dir, s)): + story_file = os.path.join(dm_tokenized_stories_dir, s) + else: + print( + "Error: Couldn't find tokenized story file %s in either tokenized story directories %s and %s. Was there an error during tokenization?" + % (s, cnn_tokenized_stories_dir, dm_tokenized_stories_dir) + ) + # Check again if tokenized stories directories contain correct number of files + print( + "Checking that the tokenized stories directories %s and %s contain correct number of files..." + % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir) + ) + check_num_stories(cnn_tokenized_stories_dir, num_expected_cnn_stories) + check_num_stories(dm_tokenized_stories_dir, num_expected_dm_stories) + raise Exception( + "Tokenized stories directories %s and %s contain correct number of files but story file %s found in neither." + % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir, s) + ) + + # Get the strings to write to .json file + article, abstract = get_art_abs(story_file) + + # Write to json + writer.write(json.dumps({"article": article, "abstract": abstract}) + "\n") + + # Write the vocab to file, if applicable + if makevocab: + art_tokens = article.split(" ") + abs_tokens = abstract.split(" ") + abs_tokens = [ + t for t in abs_tokens if t not in [SENTENCE_START, SENTENCE_END] + ] # remove these tags from vocab + tokens = art_tokens + abs_tokens + tokens = [t.strip() for t in tokens] # strip + tokens = [t for t in tokens if t != ""] # remove empty + vocab_counter.update(tokens) + + print("Finished writing file %s\n" % out_file) + + # write vocab to file + if makevocab: + print("Writing vocab file...") + with open(os.path.join(finished_files_dir, "vocab"), "w") as writer: + for word, count in vocab_counter.most_common(VOCAB_SIZE): + writer.write(word + " " + str(count) + "\n") + print("Finished writing vocab file") + + +def check_num_stories(stories_dir, num_expected): + num_stories = len(os.listdir(stories_dir)) + if num_stories != num_expected: + raise Exception( + "stories directory %s contains %i files but should contain %i" % (stories_dir, num_stories, num_expected) + ) + + +if __name__ == "__main__": + if len(sys.argv) != 3: + print("USAGE: python make_datafiles.py <cnn_stories_dir> <dailymail_stories_dir>") + sys.exit() + cnn_stories_dir = sys.argv[1] + dm_stories_dir = sys.argv[2] + + # Check the stories directories contain the correct number of .story files + check_num_stories(cnn_stories_dir, num_expected_cnn_stories) + check_num_stories(dm_stories_dir, num_expected_dm_stories) + + # Create some new directories + if not os.path.exists(cnn_tokenized_stories_dir): + os.makedirs(cnn_tokenized_stories_dir) + if not os.path.exists(dm_tokenized_stories_dir): + os.makedirs(dm_tokenized_stories_dir) + if not os.path.exists(finished_files_dir): + os.makedirs(finished_files_dir) + + # Run stanford tokenizer on both stories dirs, outputting to tokenized stories directories + tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir) + tokenize_stories(dm_stories_dir, dm_tokenized_stories_dir) + + # Read the tokenized stories, do a little postprocessing then write to bin files + write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.json")) + write_to_bin(all_val_urls, os.path.join(finished_files_dir, "val.json")) + write_to_bin(all_train_urls, os.path.join(finished_files_dir, "train.json"), makevocab=True) + + # Chunk the data. This splits each of train.json, val.json and test.json into smaller chunks, each containing e.g. 1000 examples, and saves them in finished_files/chunks + chunk_all() diff --git a/examples/text_summarization/pointer_summarizer/model.py b/examples/text_summarization/pointer_summarizer/model.py new file mode 100644 index 0000000000000000000000000000000000000000..413df876e05be34ccfc97f0cd289ef028bd65abb --- /dev/null +++ b/examples/text_summarization/pointer_summarizer/model.py @@ -0,0 +1,263 @@ +import os +import sys + +import paddle +import paddle.nn.initializer as I + +import paddle.nn as nn +import paddle.nn.functional as F +import config + + +def paddle2D_scatter_add(x_tensor, index_tensor, update_tensor, dim=0): + dim0, dim1 = update_tensor.shape + update_tensor = paddle.flatten(update_tensor, start_axis=0, stop_axis=1) + index_tensor = paddle.reshape(index_tensor, [-1, 1]) + if dim == 0: + index_tensor = paddle.concat(x=[index_tensor, (paddle.arange(dim1 * dim0) % dim0).unsqueeze(1)], axis=1) + elif dim == 1: + index_tensor = paddle.concat(x=[(paddle.arange(dim1 * dim0) // dim1).unsqueeze(1), index_tensor], axis=1) + output_tensor = paddle.scatter_nd_add(x_tensor, index_tensor, update_tensor) + return output_tensor + + +class Encoder(paddle.nn.Layer): + def __init__(self): + super(Encoder, self).__init__() + + # Initialized embeddings + self.embedding = nn.Embedding( + config.vocab_size, + config.emb_dim, + weight_attr=paddle.ParamAttr(initializer=I.Normal(std=config.trunc_norm_init_std)), + ) + + # Initialized lstm weights + self.lstm = nn.LSTM( + config.emb_dim, + config.hidden_dim, + num_layers=1, + direction="bidirect", + weight_ih_attr=paddle.ParamAttr( + initializer=I.Uniform(low=-config.rand_unif_init_mag, high=config.rand_unif_init_mag) + ), + bias_ih_attr=paddle.ParamAttr(initializer=I.Constant(value=0.0)), + ) + + # Initialized linear weights + self.W_h = nn.Linear(config.hidden_dim * 2, config.hidden_dim * 2, bias_attr=False) + + # The variable seq_lens should be in descending order + def forward(self, input, seq_lens): + embedded = self.embedding(input) + self.embedded = embedded + + output, hidden = self.lstm(embedded, sequence_length=paddle.to_tensor(seq_lens, dtype="int32")) + + encoder_feature = paddle.reshape(output, [-1, 2 * config.hidden_dim]) # B * t_k x 2*hidden_dim + encoder_feature = self.W_h(encoder_feature) + + return output, encoder_feature, hidden + + +class ReduceState(paddle.nn.Layer): + def __init__(self): + super(ReduceState, self).__init__() + + self.reduce_h = nn.Linear( + config.hidden_dim * 2, + config.hidden_dim, + weight_attr=paddle.ParamAttr(initializer=I.Normal(std=config.trunc_norm_init_std)), + ) + self.reduce_c = nn.Linear( + config.hidden_dim * 2, + config.hidden_dim, + weight_attr=paddle.ParamAttr(initializer=I.Normal(std=config.trunc_norm_init_std)), + ) + + def forward(self, hidden): + h, c = hidden # h, c dim = 2 x b x hidden_dim + h_in = paddle.reshape(h.transpose([1, 0, 2]), [-1, config.hidden_dim * 2]) + hidden_reduced_h = F.relu(self.reduce_h(h_in)) + c_in = paddle.reshape(c.transpose([1, 0, 2]), [-1, config.hidden_dim * 2]) + hidden_reduced_c = F.relu(self.reduce_c(c_in)) + + return (hidden_reduced_h.unsqueeze(0), hidden_reduced_c.unsqueeze(0)) # h, c dim = 1 x b x hidden_dim + + +class Attention(paddle.nn.Layer): + def __init__(self): + super(Attention, self).__init__() + # Attention + if config.is_coverage: + self.W_c = nn.Linear(1, config.hidden_dim * 2, bias_attr=False) + self.decode_proj = nn.Linear(config.hidden_dim * 2, config.hidden_dim * 2) + self.v = nn.Linear(config.hidden_dim * 2, 1, bias_attr=False) + + def forward(self, s_t_hat, encoder_outputs, encoder_feature, enc_padding_mask, coverage): + b, t_k, n = encoder_outputs.shape + + dec_fea = self.decode_proj(s_t_hat) # B x 2*hidden_dim + dec_fea_expanded = paddle.expand(dec_fea.unsqueeze(1), [b, t_k, n]) # B x t_k x 2*hidden_dim + dec_fea_expanded = paddle.reshape(dec_fea_expanded, [-1, n]) # B * t_k x 2*hidden_dim + + att_features = encoder_feature + dec_fea_expanded # B * t_k x 2*hidden_dim + if config.is_coverage: + coverage_input = paddle.reshape(coverage, [-1, 1]) # B * t_k x 1 + coverage_feature = self.W_c(coverage_input) # B * t_k x 2*hidden_dim + att_features = att_features + coverage_feature + + e = F.tanh(att_features) # B * t_k x 2*hidden_dim + scores = self.v(e) # B * t_k x 1 + scores = paddle.reshape(scores, [-1, t_k]) # B x t_k + + attn_dist_ = F.softmax(scores, axis=1) * enc_padding_mask # B x t_k + normalization_factor = attn_dist_.sum(1, keepdim=True) + # attn_dist = attn_dist_ / normalization_factor + attn_dist = attn_dist_ / ( + paddle.reshape(normalization_factor, [-1, 1]) + + paddle.ones_like(paddle.reshape(normalization_factor, [-1, 1])) * sys.float_info.epsilon + ) + # See the issue: https://github.com/atulkum/pointer_summarizer/issues/54 + + attn_dist = attn_dist.unsqueeze(1) # B x 1 x t_k + c_t = paddle.bmm(attn_dist, encoder_outputs) # B x 1 x n + c_t = paddle.reshape(c_t, [-1, config.hidden_dim * 2]) # B x 2*hidden_dim + + attn_dist = paddle.reshape(attn_dist, [-1, t_k]) # B x t_k + + if config.is_coverage: + coverage = paddle.reshape(coverage, [-1, t_k]) + coverage = coverage + attn_dist + + return c_t, attn_dist, coverage + + +class Decoder(paddle.nn.Layer): + def __init__(self): + super(Decoder, self).__init__() + self.attention_network = Attention() + # Decoder + self.embedding = nn.Embedding( + config.vocab_size, + config.emb_dim, + weight_attr=paddle.ParamAttr(initializer=I.Normal(std=config.trunc_norm_init_std)), + ) + + self.x_context = nn.Linear(config.hidden_dim * 2 + config.emb_dim, config.emb_dim) + + self.lstm = nn.LSTM( + config.emb_dim, + config.hidden_dim, + num_layers=1, + direction="forward", + weight_ih_attr=paddle.ParamAttr( + initializer=I.Uniform(low=-config.rand_unif_init_mag, high=config.rand_unif_init_mag) + ), + bias_ih_attr=paddle.ParamAttr(initializer=I.Constant(value=0.0)), + ) + + if config.pointer_gen: + self.p_gen_linear = nn.Linear(config.hidden_dim * 4 + config.emb_dim, 1) + + self.out1 = nn.Linear(config.hidden_dim * 3, config.hidden_dim) + self.out2 = nn.Linear( + config.hidden_dim, + config.vocab_size, + weight_attr=paddle.ParamAttr(initializer=I.Normal(std=config.trunc_norm_init_std)), + ) + + def forward( + self, + y_t_1, + s_t_1, + encoder_outputs, + encoder_feature, + enc_padding_mask, + c_t_1, + extra_zeros, + enc_batch_extend_vocab, + coverage, + step, + ): + if not self.training and step == 0: + h_decoder, c_decoder = s_t_1 + s_t_hat = paddle.concat( + ( + paddle.reshape(h_decoder, [-1, config.hidden_dim]), + paddle.reshape(c_decoder, [-1, config.hidden_dim]), + ), + 1, + ) # B x 2*hidden_dim + c_t, _, coverage_next = self.attention_network( + s_t_hat, encoder_outputs, encoder_feature, enc_padding_mask, coverage + ) + coverage = coverage_next + + y_t_1_embd = self.embedding(y_t_1) + x = self.x_context(paddle.concat((c_t_1, y_t_1_embd), 1)) + lstm_out, s_t = self.lstm(x.unsqueeze(1), s_t_1) + + h_decoder, c_decoder = s_t + s_t_hat = paddle.concat( + (paddle.reshape(h_decoder, [-1, config.hidden_dim]), paddle.reshape(c_decoder, [-1, config.hidden_dim])), 1 + ) # B x 2*hidden_dim + c_t, attn_dist, coverage_next = self.attention_network( + s_t_hat, encoder_outputs, encoder_feature, enc_padding_mask, coverage + ) + + if self.training or step > 0: + coverage = coverage_next + + p_gen = None + if config.pointer_gen: + p_gen_input = paddle.concat((c_t, s_t_hat, x), 1) # B x (2*2*hidden_dim + emb_dim) + p_gen = self.p_gen_linear(p_gen_input) + p_gen = F.sigmoid(p_gen) + + output = paddle.concat((paddle.reshape(lstm_out, [-1, config.hidden_dim]), c_t), 1) # B x hidden_dim * 3 + output1 = self.out1(output) # B x hidden_dim + output2 = self.out2(output1) # B x vocab_size + vocab_dist = F.softmax(output2, axis=1) + + if config.pointer_gen: + vocab_dist_ = p_gen * vocab_dist + attn_dist_ = (1 - p_gen) * attn_dist + + if extra_zeros is not None: + vocab_dist_ = paddle.concat([vocab_dist_, extra_zeros], 1) + final_dist = paddle2D_scatter_add(vocab_dist_, enc_batch_extend_vocab, attn_dist_, 1) + else: + final_dist = vocab_dist + + return final_dist, s_t, c_t, attn_dist, p_gen, coverage + + +class Model(object): + def __init__(self, model_file_path=None, is_eval=False): + super(Model, self).__init__() + encoder = Encoder() + decoder = Decoder() + reduce_state = ReduceState() + + # Shared the embedding between encoder and decoder + decoder.embedding.weight = encoder.embedding.weight + + if paddle.distributed.get_world_size() > 1: + encoder = paddle.DataParallel(encoder) + decoder = paddle.DataParallel(decoder) + reduce_state = paddle.DataParallel(reduce_state) + + if is_eval: + encoder.eval() + decoder.eval() + reduce_state.eval() + + self.encoder = encoder + self.decoder = decoder + self.reduce_state = reduce_state + + if model_file_path is not None: + self.decoder.set_state_dict(paddle.load(os.path.join(model_file_path, "decoder.params"))) + self.encoder.set_state_dict(paddle.load(os.path.join(model_file_path, "encoder.params"))) + self.reduce_state.set_state_dict(paddle.load(os.path.join(model_file_path, "reduce_state.params"))) diff --git a/examples/text_summarization/pointer_summarizer/rouge_eval.py b/examples/text_summarization/pointer_summarizer/rouge_eval.py new file mode 100644 index 0000000000000000000000000000000000000000..1d0dd29b9d833a33d74904e68b2b891374344f6c --- /dev/null +++ b/examples/text_summarization/pointer_summarizer/rouge_eval.py @@ -0,0 +1,24 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys + +from utils import rouge_eval, rouge_log + +decode_dir = sys.argv[1] + +print("Decoder has finished reading dataset for single_pass.") +print("Now starting ROUGE eval...") +results_dict = rouge_eval(decode_dir + "rouge_ref", decode_dir + "rouge_dec_dir") +rouge_log(results_dict, decode_dir + "rouge_dec_dir") diff --git a/examples/text_summarization/pointer_summarizer/train.py b/examples/text_summarization/pointer_summarizer/train.py new file mode 100644 index 0000000000000000000000000000000000000000..6c9406cc0686bfb919d892ca2f5207ab7849590a --- /dev/null +++ b/examples/text_summarization/pointer_summarizer/train.py @@ -0,0 +1,202 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys +import time + +import config +import paddle +from data import Batcher, Vocab +from model import Model +from paddle.optimizer import Adagrad +from train_util import get_input_from_batch, get_output_from_batch +from utils import calc_running_avg_loss + +# Flush out immediately +sys.stdout.flush() + + +class Trainer(object): + def __init__(self): + self.vocab = Vocab(config.vocab_path, config.vocab_size) + self.batcher = Batcher( + config.train_data_path, self.vocab, mode="train", batch_size=config.batch_size, single_pass=False + ) + + train_dir = os.path.join(config.log_root, "train_%d" % (int(time.time()))) + + if not os.path.exists(config.log_root): + os.mkdir(config.log_root) + + if not os.path.exists(train_dir): + os.mkdir(train_dir) + + self.model_dir = os.path.join(train_dir, "model") + if not os.path.exists(self.model_dir): + os.mkdir(self.model_dir) + + def save_model(self, running_avg_loss, iter): + state = { + "encoder": self.model.encoder.state_dict(), + "decoder": self.model.decoder.state_dict(), + "reduce_state": self.model.reduce_state.state_dict(), + "optimizer": self.optimizer.state_dict(), + } + model_save_dir = os.path.join(self.model_dir, "model_%06d_%.8f" % (iter, running_avg_loss)) + for k in state: + model_save_path = os.path.join(model_save_dir, "%s.params" % k) + paddle.save(state[k], model_save_path) + return model_save_dir + + def setup_train(self, model_file_path=None): + self.model = Model(model_file_path) + + initial_lr = config.lr_coverage if config.is_coverage else config.lr + params = ( + list(self.model.encoder.parameters()) + + list(self.model.decoder.parameters()) + + list(self.model.reduce_state.parameters()) + ) + + self.optimizer = Adagrad( + parameters=params, + learning_rate=initial_lr, + initial_accumulator_value=config.adagrad_init_acc, + epsilon=1.0e-10, + grad_clip=paddle.nn.ClipGradByGlobalNorm(clip_norm=config.max_grad_norm), + ) + + start_iter, start_loss = 0, 0 + + if model_file_path is not None: + start_iter = int(model_file_path.split("_")[-2]) + start_loss = float(model_file_path.split("_")[-1].replace(os.sep, "")) + + if not config.is_coverage: + self.optimizer.set_state_dict(paddle.load(os.path.join(model_file_path, "optimizer.params"))) + + return start_iter, start_loss + + def train_one_batch(self, batch, iter): + + ( + enc_batch, + enc_padding_mask, + enc_lens, + enc_batch_extend_vocab, + extra_zeros, + c_t_1, + coverage, + ) = get_input_from_batch(batch) + dec_batch, dec_padding_mask, max_dec_len, dec_lens_var, target_batch = get_output_from_batch(batch) + + self.optimizer.clear_gradients() + + encoder_outputs, encoder_feature, encoder_hidden = self.model.encoder(enc_batch, enc_lens) + s_t_1 = self.model.reduce_state(encoder_hidden) + + step_losses = [] + for di in range(min(max_dec_len, config.max_dec_steps)): + y_t_1 = dec_batch[:, di] + + final_dist, s_t_1, c_t_1, attn_dist, p_gen, next_coverage = self.model.decoder( + y_t_1, + s_t_1, + encoder_outputs, + encoder_feature, + enc_padding_mask, + c_t_1, + extra_zeros, + enc_batch_extend_vocab, + coverage, + di, + ) + + target = target_batch[:, di] + add_index = paddle.arange(0, target.shape[0]) + new_index = paddle.stack([add_index, target], axis=1) + gold_probs = paddle.gather_nd(final_dist, new_index).squeeze() + step_loss = -paddle.log(gold_probs + config.eps) + + if config.is_coverage: + step_coverage_loss = paddle.sum(paddle.minimum(attn_dist, coverage), 1) + step_loss = step_loss + config.cov_loss_wt * step_coverage_loss + coverage = next_coverage + + step_mask = dec_padding_mask[:, di] + step_loss = step_loss * step_mask + step_losses.append(step_loss) + + sum_losses = paddle.sum(paddle.stack(step_losses, 1), 1) + batch_avg_loss = sum_losses / dec_lens_var + loss = paddle.mean(batch_avg_loss) + + loss.backward() + self.optimizer.minimize(loss) + + return float(loss) + + def trainIters(self, n_iters, model_file_path=None): + iter, running_avg_loss = self.setup_train(model_file_path) + start = time.time() + while iter < n_iters: + batch = self.batcher.next_batch() + loss = self.train_one_batch(batch, iter) + running_avg_loss = calc_running_avg_loss(loss, running_avg_loss, iter) + iter += 1 + print( + "global step %d/%d, step loss: %.8f, running avg loss: %.8f, speed: %.2f step/s" + % (iter, n_iters, loss, running_avg_loss, 1.0 / (time.time() - start)) + ) + start = time.time() + if iter % 5000 == 0 or iter == 1: + if paddle.distributed.get_rank() == 0: + model_save_dir = self.save_model(running_avg_loss, iter) + print( + "Saved model for iter %d with running avg loss %.8f to directory: %s" + % (iter, running_avg_loss, model_save_dir) + ) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Train script") + parser.add_argument( + "-m", dest="model_file_path", required=False, default=None, help="Model file for retraining (default: None)." + ) + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override config.max_iterations.", + ) + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu", "xpu"], + help="The device to select to train the model, is must be cpu/gpu/xpu.", + ) + + args = parser.parse_args() + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + train_processor = Trainer() + if args.max_steps > 0: + train_processor.trainIters(args.max_steps, args.model_file_path) + else: + train_processor.trainIters(config.max_iterations, args.model_file_path) diff --git a/examples/text_summarization/pointer_summarizer/train_util.py b/examples/text_summarization/pointer_summarizer/train_util.py new file mode 100644 index 0000000000000000000000000000000000000000..63773ce4a58fb1c8033004569853e5e68c6707a7 --- /dev/null +++ b/examples/text_summarization/pointer_summarizer/train_util.py @@ -0,0 +1,38 @@ +import numpy as np +import paddle +import config + + +def get_input_from_batch(batch): + batch_size = len(batch.enc_lens) + enc_batch = paddle.to_tensor(batch.enc_batch, dtype="int64") + enc_padding_mask = paddle.to_tensor(batch.enc_padding_mask, dtype="float32") + enc_lens = batch.enc_lens + extra_zeros = None + enc_batch_extend_vocab = None + + if config.pointer_gen: + enc_batch_extend_vocab = paddle.to_tensor(batch.enc_batch_extend_vocab, dtype="int64") + # The variable max_art_oovs is the max over all the article oov list in the batch + if batch.max_art_oovs > 0: + extra_zeros = paddle.zeros((batch_size, batch.max_art_oovs)) + + c_t_1 = paddle.zeros((batch_size, 2 * config.hidden_dim)) + + coverage = None + if config.is_coverage: + coverage = paddle.zeros(enc_batch.shape) + + return enc_batch, enc_padding_mask, enc_lens, enc_batch_extend_vocab, extra_zeros, c_t_1, coverage + + +def get_output_from_batch(batch): + dec_batch = paddle.to_tensor(batch.dec_batch, dtype="int64") + dec_padding_mask = paddle.to_tensor(batch.dec_padding_mask, dtype="float32") + dec_lens = batch.dec_lens + max_dec_len = np.max(dec_lens) + dec_lens_var = paddle.to_tensor(dec_lens, dtype="float32") + + target_batch = paddle.to_tensor(batch.target_batch, dtype="int64") + + return dec_batch, dec_padding_mask, max_dec_len, dec_lens_var, target_batch diff --git a/examples/text_summarization/pointer_summarizer/utils.py b/examples/text_summarization/pointer_summarizer/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..2cc081d905d3652b3b2fdd33a041e824b8f46182 --- /dev/null +++ b/examples/text_summarization/pointer_summarizer/utils.py @@ -0,0 +1,99 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Content of this file is copied from https://github.com/abisee/pointer-generator/blob/master/ +import logging +import os + +import pyrouge + + +def print_results(article, abstract, decoded_output): + print("") + print(("ARTICLE: %s", article)) + print(("REFERENCE SUMMARY: %s", abstract)) + print(("GENERATED SUMMARY: %s", decoded_output)) + print("") + + +def make_html_safe(s): + s.replace("<", "<") + s.replace(">", ">") + return s + + +def rouge_eval(ref_dir, dec_dir): + r = pyrouge.Rouge155() + r.model_filename_pattern = "#ID#_reference.txt" + r.system_filename_pattern = "(\d+)_decoded.txt" + r.model_dir = ref_dir + r.system_dir = dec_dir + logging.getLogger("global").setLevel(logging.WARNING) # silence pyrouge logging + rouge_results = r.convert_and_evaluate() + return r.output_to_dict(rouge_results) + + +def rouge_log(results_dict, dir_to_write): + log_str = "" + for x in ["1", "2", "l"]: + log_str += "\nROUGE-%s:\n" % x + for y in ["f_score", "recall", "precision"]: + key = "rouge_%s_%s" % (x, y) + key_cb = key + "_cb" + key_ce = key + "_ce" + val = results_dict[key] + val_cb = results_dict[key_cb] + val_ce = results_dict[key_ce] + log_str += "%s: %.4f with confidence interval (%.4f, %.4f)\n" % (key, val, val_cb, val_ce) + print(log_str) + results_file = os.path.join(dir_to_write, "ROUGE_results.txt") + print(("Writing final ROUGE results to %s..." % results_file)) + with open(results_file, "w") as f: + f.write(log_str) + + +def calc_running_avg_loss(loss, running_avg_loss, step, decay=0.99): + if running_avg_loss == 0: # on the first iteration just take the loss + running_avg_loss = loss + else: + running_avg_loss = running_avg_loss * decay + (1 - decay) * loss + running_avg_loss = min(running_avg_loss, 12) # clip + return running_avg_loss + + +def write_for_rouge(reference_sents, decoded_words, ex_index, _rouge_ref_dir, _rouge_dec_dir): + decoded_sents = [] + while len(decoded_words) > 0: + try: + fst_period_idx = decoded_words.index(".") + except ValueError: + fst_period_idx = len(decoded_words) + sent = decoded_words[: fst_period_idx + 1] + decoded_words = decoded_words[fst_period_idx + 1 :] + decoded_sents.append(" ".join(sent)) + + # Pyrouge calls a perl script that puts the data into HTML files. + # Therefore we need to make our output HTML safe. + decoded_sents = [make_html_safe(w) for w in decoded_sents] + reference_sents = [make_html_safe(w) for w in reference_sents] + + ref_file = os.path.join(_rouge_ref_dir, "%06d_reference.txt" % ex_index) + decoded_file = os.path.join(_rouge_dec_dir, "%06d_decoded.txt" % ex_index) + + with open(ref_file, "w") as f: + for idx, sent in enumerate(reference_sents): + f.write(sent) if idx == len(reference_sents) - 1 else f.write(sent + "\n") + with open(decoded_file, "w") as f: + for idx, sent in enumerate(decoded_sents): + f.write(sent) if idx == len(decoded_sents) - 1 else f.write(sent + "\n") diff --git a/examples/text_summarization/prophetnet/README.md b/examples/text_summarization/prophetnet/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a96213df8b809faa3be61e12e60d3c4ff2b311a0 --- /dev/null +++ b/examples/text_summarization/prophetnet/README.md @@ -0,0 +1,250 @@ +# Prophetnet + +## 模型简介 + +ProphetNet(先知网络)是一种新型的 seq2seq 预训练模型。在训练时,Prophetnet 每一时刻将会学习同时预测未来的 N 个字符,这种自监督学习目标可以使得模型考虑未来更远的字符,防止模型对强局部相关(strong +local correlation)过拟合。 + +本项目是 Prophetnet 在 PaddlePaddle 2.4 上开源实现的文本摘要的例子,包含了在 CNN/DailyMail 数据集,Gigaword 数据集上微调和生成的代码。 + +### 项目依赖 + +``` +pip install -r requirements.txt +python -m pip install paddlepaddle-gpu==2.4.1.post117 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html +pip install paddlenlp==2.5.2 +``` + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +├── train_prophetnet.py # 模型finetune主程序入口 +├── generate.py # 模型生成主程序入口 +├── eval.py # 生成结果评估入口 +├── uncase_tokenize_data.py # 数据预处理 +├── uncompress_data.sh # 数据解压脚本 +├── run_train.sh # 模型训练脚本 +├── run_eval.sh # 模型评估脚本 +├── requirements.txt # 环境依赖文件 +└── README.md # 文档说明 +``` + +### 数据准备 + +GLGE 数据集下载:[链接](https://drive.google.com/file/d/1F4zppa9Gqrh6iNyVsZJkxfbm5waalqEA/view) + +GLGE 测试集下载:[链接](https://drive.google.com/file/d/11lDXIG87dChIfukq3x2Wx4r5_duCRm_J/view) + +将glge_public.tar与glge_hidden_v1.1.tar.gz放入到项目根目录下。 + +``` +bash uncompress_data.sh +``` + +### 数据预处理 + +``` +python uncase_tokenize_data.py --dataset <DATASET> +``` + +说明: + +- `<DATASET>`可选`cnndm`, `gigaword`. + +### 模型训练 + +``` +bash run_train.sh <DATASET> +``` + +或直接运行finetune程序 + +- cnndm: + +``` +python -m paddle.distributed.launch --gpus 0 python train_prophetnet.py \ + --dataset=cnndm \ + --model_name_or_path=prophetnet-large-uncased \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=8 \ + --num_train_epochs=4 \ + --learning_rate=0.0001 \ + --warmup_init_lr=1e-07 \ + --warmup_steps=1000 \ + --max_grad_norm=0.1 \ + --dataloader_num_workers=4 \ + --logging_steps 10 \ + --save_steps 100 \ + --do_train \ + --do_eval \ + --output_dir=./ckpt/cnndm +``` + +- gigaword: + +``` +python -m paddle.distributed.launch --gpus 0 python train_prophetnet.py \ + --dataset=gigaword \ + --model_name_or_path=prophetnet-large-uncased \ + --per_device_train_batch_size=16 \ + --per_device_eval_batch_size=32 \ + --num_train_epochs=6 \ + --learning_rate=0.0001 \ + --warmup_init_lr=1e-07 \ + --warmup_steps=1000 \ + --max_grad_norm=0.1 \ + --dataloader_num_workers=8 \ + --logging_steps 10 \ + --save_steps 100 \ + --do_train \ + --do_eval \ + --output_dir=./ckpt/gigaword +``` + +其中参数释义如下: + +- `dataset` 指定数据集,可选cnndm和gigaword + +- `model_name_or_path` 预训练模型名称或本地预训练模型初始化权重文件路径 + +- `per_device_train_batch_size` 表示单卡训练样本批大小 + +- `per_device_eval_batch_size` 表示单卡验证样本批大小 + +- `num_train_epochs` 表示训练轮数 + +- `learning_rate` 表示学习率 + +- `warmup_init_lr` 表示预热学习率 + +- `warmup_steps` 表示预热学习步数 + +- `max_grad_norm` 表示梯度裁剪 + +- `dataloader_num_workers` 指定数据加载规模 + +- `logging_steps` 表示打印结果间隔 + +- `save_steps`表示验证间隔 + +- `do_train` 表示是否训练 + +- `do_eval` 表示是否验证 + +- `output_idr` 指定微调结果权重存放路径 + +已经finetune好的模型权重: + +- cnndm : [链接](https://pan.baidu.com/s/1cemrUDxkqEW9raoasJ_VKw), 提取码:1egi + +- gigaword : [链接](https://pan.baidu.com/s/1qRH2FStT3vNQtDjZLkYJBQ), 提取码:on5v + +### 模型评估 + +使用prophetNet源码的[评估脚本](https://pan.baidu.com/s/1FOnd01rNvDJoONYegacq1Q), 此脚本依赖于pyrouge,需要提前安装rouge。 + +``` +pip install git+https://github.com/pltrdy/pyrouge +``` + +``` +bash run_eval.sh <DATASET> +``` + +或直接运行模型生成程序 + +- cnndm: + +``` +python generate.py \ + --dataset=cnndm \ + --model_name_or_path=prophetnet-large-uncased \ + --output_path=./generate/cnndm/generate.txt \ + --min_target_length=45 \ + --max_target_length=110 \ + --decode_strategy=beam_search \ + --num_beams=4 \ + --length_penalty=1.2 \ + --batch_size=16 \ + --ignore_pad_token_for_loss=True \ + --early_stopping=True \ + --logging_steps=100 \ + --device=gpu + +python eval.py --dataset cnndm --generated ./generate/cnndm/generate.txt +``` + +- gigaword: + +``` +python generate.py \ + --dataset=gigaword \ + --model_name_or_path=prophetnet-large-uncased \ + --output_path=./generate/gigaword/generate.txt \ + --min_target_length=1 \ + --max_target_length=200 \ + --decode_strategy=beam_search \ + --num_beams=4 \ + --length_penalty=1.6 \ + --batch_size=16 \ + --ignore_pad_token_for_loss=True \ + --early_stopping=True \ + --logging_steps=100 \ + --device=gpu + +python eval.py --dataset gigaword --generated ./generate/gigaword/generate.txt +``` + +其中参数释义如下: + +- `dataset` 指定数据集,可选cnndm和gigaword + +- `vocab_file` 指定词表文件 + +- `output_path` 指定生成结果存放路径 + +- `min_target_length` 指定解码最短长度 + +- `max_target_length` 指定解码最大长度 + +- `decode_strategy` 指定解码策略 + +- `num_beams` 指定beam_search解码宽度 + +- `length_penalty` 指定beam_search解码的长度指数惩罚 + +- `batch_size` 指定评估样本批大小 + +- `ignore_pad_token_for_loss` 表示计算loss时忽略padding + +- `early_stopping` 指定生成结束符是否停止预测 + +- `logging_steps` 指定日志打印间隔 + +- `device` 指定使用设备 + +### 微调测试精度 + +> #### 在CNN/DM数据集的测试效果如下表。 + +|网络 |opt|batch_size|数据集|ROUGE_1|ROUGE_2|ROUGE_L| +| :---: | :---: | :---: | :---: | :---: | :---: | :---: | +|prophetnet-large-uncased|Adam|4|CNN/DM|44.17|21.24|41.36| + +> #### 在gigaword数据集的测试效果如下表。 + +|网络 |opt|batch_size|数据集|ROUGE_1|ROUGE_2|ROUGE_L| +| :---: | :---: | :---: | :---: | :---: | :---: | :---: | +|prophetnet-large-uncased|Adam|16|gigaword|38.92|19.81|36.06| + +### 实验环境 + +- GPU RTX3090 * 1, CPU Intel i7-11700k +- Ubuntu 18.04 + +### 参考文献 + +1. Qi W, Yan Y, Gong Y, et al. Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training[J]. arXiv + preprint arXiv:2001.04063, 2020. diff --git a/examples/text_summarization/prophetnet/eval.py b/examples/text_summarization/prophetnet/eval.py new file mode 100644 index 0000000000000000000000000000000000000000..01a1042c4eca395aa801650297d802069f35757b --- /dev/null +++ b/examples/text_summarization/prophetnet/eval.py @@ -0,0 +1,62 @@ +import argparse +import os +import re +import sys +from os import listdir +from os.path import isfile, join + +parser = argparse.ArgumentParser() +parser.add_argument("--dataset", type=str, help="choose from all, or 1 of 8 dataset like cnndm, gigaword etc.") +parser.add_argument("--generated", type=str, help="generated output file.") + +args = parser.parse_args() + +data_root_path = "data" + +support_dataset = ["cnndm", "gigaword"] +files2rouge_template = ".*ROUGE-1 Average_F: (?P<rouge1_f>\d+(\.\d*)?|\.\d+).*ROUGE-2 Average_F: (?P<rouge2_f>\d+(\.\d*)?|\.\d+).*ROUGE-L Average_F: (?P<rougeL_f>\d+(\.\d*)?|\.\d+).*" +# gigaword_template='.*ROUGE-1: (?P<rouge1_f>\d+(\.\d*)?|\.\d+).*ROUGE-2: (?P<rouge2_f>\d+(\.\d*)?|\.\d+).*ROUGE-L: (?P<rougeL_f>\d+(\.\d*)?|\.\d+).*' +qg_template = ".*Bleu_4: (?P<bleu4>\d+(\.\d*)?|\.\d+).*METEOR: (?P<meteor>\d+(\.\d*)?|\.\d+).*ROUGE_L: (?P<rougeL>\d+(\.\d*)?|\.\d+).*" +personachat_template = ".*?(?P<d1>[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?).*?(?P<d2>[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?).*Bleu_1: (?P<bleu1>\d+(\.\d*)?|\.\d+).*Bleu_2: (?P<bleu2>\d+(\.\d*)?|\.\d+).*" + + +def scale_up(d): + return {k: float(d[k]) * 100 for k in d.keys()} + + +def eval_one_dataset(): + golden_file = f"{data_root_path}/{args.dataset}_data/test.tgt" + + eval_template = { + "cnndm": f"python ./evaluate/cnndm/postprocess_cnn_dm.py --generated {generated_file} --golden {golden_file}", + "gigaword": f"python ./evaluate/gigaword/eval.py --perl --pred {generated_file} --gold {golden_file}", + } + + cmd = eval_template[args.dataset] + try: + output = os.popen(cmd).read() + if args.dataset in ["cnndm", "gigaword"]: + d = re.search(files2rouge_template, output.replace("\n", " ")).groupdict() + d = scale_up(d) + print(f"{args.dataset}\trouge1/rouge2/rougeL\t{d['rouge1_f']:.2f}/{d['rouge2_f']:.2f}/{d['rougeL_f']:.2f}") + except: + print("Unexpected error:", sys.exc_info()[0]) + print(f"{args.dataset} evaluate failed!") + + +if args.dataset != "all": + generated_file = args.generated + eval_one_dataset() +else: + output_root_path = args.generated + onlyfolders = [f for f in listdir(output_root_path) if not isfile(join(args.generated, f))] + for dataset in support_dataset: + for folder in onlyfolders: + if folder.startswith(dataset): + for hypo_file in listdir(args.generated + "/" + folder): + if "hypo" in hypo_file or "score" in hypo_file: + generated_file = args.generated + "/" + folder + "/" + hypo_file + print(f"{dataset}\tpredict_file:{generated_file}") + args.dataset = dataset + args.gnerated = generated_file + eval_one_dataset() diff --git a/examples/text_summarization/prophetnet/evaluate/cnndm/bs_pyrouge.py b/examples/text_summarization/prophetnet/evaluate/cnndm/bs_pyrouge.py new file mode 100644 index 0000000000000000000000000000000000000000..a4859890760a2a07e45890d9056a1d3dae131e88 --- /dev/null +++ b/examples/text_summarization/prophetnet/evaluate/cnndm/bs_pyrouge.py @@ -0,0 +1,649 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import print_function, unicode_literals, division + +import codecs +import os +import platform +import re +from functools import partial +from subprocess import check_output +from tempfile import mkdtemp + +try: + from configparser import ConfigParser +except ImportError: + from ConfigParser import ConfigParser + +import logging +from pyrouge.utils import log +from pyrouge.utils.file_utils import verify_dir + +REMAP = {"-lrb-": "(", "-rrb-": ")", "-lcb-": "{", "-rcb-": "}", "-lsb-": "[", "-rsb-": "]", "``": '"', "''": '"'} + + +def clean(x): + return re.sub(r"-lrb-|-rrb-|-lcb-|-rcb-|-lsb-|-rsb-|``|''", lambda m: REMAP.get(m.group()), x) + + +class DirectoryProcessor: + @staticmethod + def process(input_dir, output_dir, function): + """ + Apply function to all files in input_dir and save the resulting output + files in output_dir. + + """ + if not os.path.exists(output_dir): + os.makedirs(output_dir) + logger = log.get_global_console_logger() + logger.info("Processing files in {}.".format(input_dir)) + input_file_names = os.listdir(input_dir) + for input_file_name in input_file_names: + input_file = os.path.join(input_dir, input_file_name) + with codecs.open(input_file, "r", encoding="UTF-8") as f: + input_string = f.read() + output_string = function(input_string) + output_file = os.path.join(output_dir, input_file_name) + with codecs.open(output_file, "w", encoding="UTF-8") as f: + f.write(clean(output_string.lower())) + logger.info("Saved processed files to {}.".format(output_dir)) + + +class Rouge155(object): + """ + This is a wrapper for the ROUGE 1.5.5 summary evaluation package. + This class is designed to simplify the evaluation process by: + + 1) Converting summaries into a format ROUGE understands. + 2) Generating the ROUGE configuration file automatically based + on filename patterns. + + This class can be used within Python like this: + + rouge = Rouge155() + rouge.system_dir = 'test/systems' + rouge.model_dir = 'test/models' + + # The system filename pattern should contain one group that + # matches the document ID. + rouge.system_filename_pattern = 'SL.P.10.R.11.SL062003-(\d+).html' + + # The model filename pattern has '#ID#' as a placeholder for the + # document ID. If there are multiple model summaries, pyrouge + # will use the provided regex to automatically match them with + # the corresponding system summary. Here, [A-Z] matches + # multiple model summaries for a given #ID#. + rouge.model_filename_pattern = 'SL.P.10.R.[A-Z].SL062003-#ID#.html' + + rouge_output = rouge.evaluate() + print(rouge_output) + output_dict = rouge.output_to_dict(rouge_output) + print(output_dict) + -> {'rouge_1_f_score': 0.95652, + 'rouge_1_f_score_cb': 0.95652, + 'rouge_1_f_score_ce': 0.95652, + 'rouge_1_precision': 0.95652, + [...] + + + To evaluate multiple systems: + + rouge = Rouge155() + rouge.system_dir = '/PATH/TO/systems' + rouge.model_dir = 'PATH/TO/models' + for system_id in ['id1', 'id2', 'id3']: + rouge.system_filename_pattern = \ + 'SL.P/.10.R.{}.SL062003-(\d+).html'.format(system_id) + rouge.model_filename_pattern = \ + 'SL.P.10.R.[A-Z].SL062003-#ID#.html' + rouge_output = rouge.evaluate(system_id) + print(rouge_output) + + """ + + def __init__(self, rouge_dir=None, rouge_args=None, temp_dir=None): + """ + Create a Rouge155 object. + + rouge_dir: Directory containing Rouge-1.5.5.pl + rouge_args: Arguments to pass through to ROUGE if you + don't want to use the default pyrouge + arguments. + + """ + self.temp_dir = temp_dir + self.log = log.get_global_console_logger() + self.log.setLevel(logging.WARNING) + self.__set_dir_properties() + self._config_file = None + self._settings_file = self.__get_config_path() + self.__set_rouge_dir(rouge_dir) + self.args = self.__clean_rouge_args(rouge_args) + self._system_filename_pattern = None + self._model_filename_pattern = None + + def save_home_dir(self): + config = ConfigParser() + section = "pyrouge settings" + config.add_section(section) + config.set(section, "home_dir", self._home_dir) + with open(self._settings_file, "w") as f: + config.write(f) + self.log.info("Set ROUGE home directory to {}.".format(self._home_dir)) + + @property + def settings_file(self): + """ + Path of the settings file, which stores the ROUGE home dir. + + """ + return self._settings_file + + @property + def bin_path(self): + """ + The full path of the ROUGE binary (although it's technically + a script), i.e. rouge_home_dir/ROUGE-1.5.5.pl + + """ + if self._bin_path is None: + raise Exception( + "ROUGE path not set. Please set the ROUGE home directory " + "and ensure that ROUGE-1.5.5.pl exists in it." + ) + return self._bin_path + + @property + def system_filename_pattern(self): + """ + The regular expression pattern for matching system summary + filenames. The regex string. + + E.g. "SL.P.10.R.11.SL062003-(\d+).html" will match the system + filenames in the SPL2003/system folder of the ROUGE SPL example + in the "sample-test" folder. + + Currently, there is no support for multiple systems. + + """ + return self._system_filename_pattern + + @system_filename_pattern.setter + def system_filename_pattern(self, pattern): + self._system_filename_pattern = pattern + + @property + def model_filename_pattern(self): + """ + The regular expression pattern for matching model summary + filenames. The pattern needs to contain the string "#ID#", + which is a placeholder for the document ID. + + E.g. "SL.P.10.R.[A-Z].SL062003-#ID#.html" will match the model + filenames in the SPL2003/system folder of the ROUGE SPL + example in the "sample-test" folder. + + "#ID#" is a placeholder for the document ID which has been + matched by the "(\d+)" part of the system filename pattern. + The different model summaries for a given document ID are + matched by the "[A-Z]" part. + + """ + return self._model_filename_pattern + + @model_filename_pattern.setter + def model_filename_pattern(self, pattern): + self._model_filename_pattern = pattern + + @property + def config_file(self): + return self._config_file + + @config_file.setter + def config_file(self, path): + config_dir, _ = os.path.split(path) + verify_dir(config_dir, "configuration file") + self._config_file = path + + def split_sentences(self): + """ + ROUGE requires texts split into sentences. In case the texts + are not already split, this method can be used. + + """ + from pyrouge.utils.sentence_splitter import PunktSentenceSplitter + + self.log.info("Splitting sentences.") + ss = PunktSentenceSplitter() + + def sent_split_to_string(s): + return "\n".join(ss.split(s)) + + process_func = partial(DirectoryProcessor.process, function=sent_split_to_string) + self.__process_summaries(process_func) + + @staticmethod + def convert_summaries_to_rouge_format(input_dir, output_dir): + """ + Convert all files in input_dir into a format ROUGE understands + and saves the files to output_dir. The input files are assumed + to be plain text with one sentence per line. + + input_dir: Path of directory containing the input files. + output_dir: Path of directory in which the converted files + will be saved. + + """ + DirectoryProcessor.process(input_dir, output_dir, Rouge155.convert_text_to_rouge_format) + + @staticmethod + def convert_text_to_rouge_format(text, title="dummy title"): + """ + Convert a text to a format ROUGE understands. The text is + assumed to contain one sentence per line. + + text: The text to convert, containg one sentence per line. + title: Optional title for the text. The title will appear + in the converted file, but doesn't seem to have + any other relevance. + + Returns: The converted text as string. + + """ + sentences = text.split("\n") + sent_elems = [ + '<a name="{i}">[{i}]</a> <a href="#{i}" id={i}>' "{text}</a>".format(i=i, text=sent) + for i, sent in enumerate(sentences, start=1) + ] + html = """<html> +<head> +<title>{title} + + +{elems} + +""".format( + title=title, elems="\n".join(sent_elems) + ) + + return html + + @staticmethod + def write_config_static( + system_dir, system_filename_pattern, model_dir, model_filename_pattern, config_file_path, system_id=None + ): + """ + Write the ROUGE configuration file, which is basically a list + of system summary files and their corresponding model summary + files. + + pyrouge uses regular expressions to automatically find the + matching model summary files for a given system summary file + (cf. docstrings for system_filename_pattern and + model_filename_pattern). + + system_dir: Path of directory containing + system summaries. + system_filename_pattern: Regex string for matching + system summary filenames. + model_dir: Path of directory containing + model summaries. + model_filename_pattern: Regex string for matching model + summary filenames. + config_file_path: Path of the configuration file. + system_id: Optional system ID string which + will appear in the ROUGE output. + + """ + system_filenames = [f for f in os.listdir(system_dir)] + system_models_tuples = [] + + system_filename_pattern = re.compile(system_filename_pattern) + for system_filename in sorted(system_filenames): + match = system_filename_pattern.match(system_filename) + if match: + id = match.groups(0)[0] + model_filenames = [model_filename_pattern.replace("#ID#", id)] + # model_filenames = Rouge155.__get_model_filenames_for_id( + # id, model_dir, model_filename_pattern) + system_models_tuples.append((system_filename, sorted(model_filenames))) + if not system_models_tuples: + raise Exception( + "Did not find any files matching the pattern {} " + "in the system summaries directory {}.".format(system_filename_pattern.pattern, system_dir) + ) + + with codecs.open(config_file_path, "w", encoding="utf-8") as f: + f.write('') + for task_id, (system_filename, model_filenames) in enumerate(system_models_tuples, start=1): + eval_string = Rouge155.__get_eval_string( + task_id, system_id, system_dir, system_filename, model_dir, model_filenames + ) + f.write(eval_string) + f.write("") + + def write_config(self, config_file_path=None, system_id=None): + """ + Write the ROUGE configuration file, which is basically a list + of system summary files and their matching model summary files. + + This is a non-static version of write_config_file_static(). + + config_file_path: Path of the configuration file. + system_id: Optional system ID string which will + appear in the ROUGE output. + + """ + if not system_id: + system_id = 1 + if (not config_file_path) or (not self._config_dir): + self._config_dir = mkdtemp(dir=self.temp_dir) + config_filename = "rouge_conf.xml" + else: + config_dir, config_filename = os.path.split(config_file_path) + verify_dir(config_dir, "configuration file") + self._config_file = os.path.join(self._config_dir, config_filename) + Rouge155.write_config_static( + self._system_dir, + self._system_filename_pattern, + self._model_dir, + self._model_filename_pattern, + self._config_file, + system_id, + ) + self.log.info("Written ROUGE configuration to {}".format(self._config_file)) + + def evaluate(self, system_id=1, rouge_args=None): + """ + Run ROUGE to evaluate the system summaries in system_dir against + the model summaries in model_dir. The summaries are assumed to + be in the one-sentence-per-line HTML format ROUGE understands. + + system_id: Optional system ID which will be printed in + ROUGE's output. + + Returns: Rouge output as string. + + """ + self.write_config(system_id=system_id) + options = self.__get_options(rouge_args) + command = [self._bin_path] + options + self.log.info("Running ROUGE with command {}".format(" ".join(command))) + rouge_output = check_output(command).decode("UTF-8") + return rouge_output + + def convert_and_evaluate(self, system_id=1, split_sentences=False, rouge_args=None): + """ + Convert plain text summaries to ROUGE format and run ROUGE to + evaluate the system summaries in system_dir against the model + summaries in model_dir. Optionally split texts into sentences + in case they aren't already. + + This is just a convenience method combining + convert_summaries_to_rouge_format() and evaluate(). + + split_sentences: Optional argument specifying if + sentences should be split. + system_id: Optional system ID which will be printed + in ROUGE's output. + + Returns: ROUGE output as string. + + """ + if split_sentences: + self.split_sentences() + self.__write_summaries() + rouge_output = self.evaluate(system_id, rouge_args) + return rouge_output + + def output_to_dict(self, output): + """ + Convert the ROUGE output into python dictionary for further + processing. + + """ + # 0 ROUGE-1 Average_R: 0.02632 (95%-conf.int. 0.02632 - 0.02632) + pattern = re.compile(r"(\d+) (ROUGE-\S+) (Average_\w): (\d.\d+) " r"\(95%-conf.int. (\d.\d+) - (\d.\d+)\)") + results = {} + for line in output.split("\n"): + match = pattern.match(line) + if match: + sys_id, rouge_type, measure, result, conf_begin, conf_end = match.groups() + measure = {"Average_R": "recall", "Average_P": "precision", "Average_F": "f_score"}[measure] + rouge_type = rouge_type.lower().replace("-", "_") + key = "{}_{}".format(rouge_type, measure) + results[key] = float(result) + results["{}_cb".format(key)] = float(conf_begin) + results["{}_ce".format(key)] = float(conf_end) + return results + + ################################################################### + # Private methods + + def __set_rouge_dir(self, home_dir=None): + """ + Verify presence of ROUGE-1.5.5.pl and data folder, and set + those paths. + + """ + if not home_dir: + self._home_dir = self.__get_rouge_home_dir_from_settings() + else: + self._home_dir = home_dir + self.save_home_dir() + self._bin_path = os.path.join(self._home_dir, "ROUGE-1.5.5.pl") + self.data_dir = os.path.join(self._home_dir, "data") + if not os.path.exists(self._bin_path): + raise Exception( + "ROUGE binary not found at {}. Please set the " + "correct path by running pyrouge_set_rouge_path " + "/path/to/rouge/home.".format(self._bin_path) + ) + + def __get_rouge_home_dir_from_settings(self): + config = ConfigParser() + with open(self._settings_file) as f: + if hasattr(config, "read_file"): + config.read_file(f) + else: + # use deprecated python 2.x method + config.readfp(f) + rouge_home_dir = config.get("pyrouge settings", "home_dir") + return rouge_home_dir + + @staticmethod + def __get_eval_string(task_id, system_id, system_dir, system_filename, model_dir, model_filenames): + """ + ROUGE can evaluate several system summaries for a given text + against several model summaries, i.e. there is an m-to-n + relation between system and model summaries. The system + summaries are listed in the tag and the model summaries + in the tag. pyrouge currently only supports one system + summary per text, i.e. it assumes a 1-to-n relation between + system and model summaries. + + """ + peer_elems = '

{name}

'.format(id=system_id, name=system_filename) + + model_elems = [ + '{name}'.format(id=chr(65 + i), name=name) for i, name in enumerate(model_filenames) + ] + + model_elems = "\n\t\t\t".join(model_elems) + eval_string = """ + + {model_root} + {peer_root} + + + + {peer_elems} + + + {model_elems} + + +""".format( + task_id=task_id, model_root=model_dir, model_elems=model_elems, peer_root=system_dir, peer_elems=peer_elems + ) + return eval_string + + def __process_summaries(self, process_func): + """ + Helper method that applies process_func to the files in the + system and model folders and saves the resulting files to new + system and model folders. + + """ + temp_dir = mkdtemp(dir=self.temp_dir) + new_system_dir = os.path.join(temp_dir, "system") + os.mkdir(new_system_dir) + new_model_dir = os.path.join(temp_dir, "model") + os.mkdir(new_model_dir) + self.log.info( + "Processing summaries. Saving system files to {} and " + "model files to {}.".format(new_system_dir, new_model_dir) + ) + process_func(self._system_dir, new_system_dir) + process_func(self._model_dir, new_model_dir) + self._system_dir = new_system_dir + self._model_dir = new_model_dir + + def __write_summaries(self): + self.log.info("Writing summaries.") + self.__process_summaries(self.convert_summaries_to_rouge_format) + + @staticmethod + def __get_model_filenames_for_id(id, model_dir, model_filenames_pattern): + pattern = re.compile(model_filenames_pattern.replace("#ID#", id)) + model_filenames = [f for f in os.listdir(model_dir) if pattern.match(f)] + if not model_filenames: + raise Exception( + "Could not find any model summaries for the system" + " summary with ID {}. Specified model filename pattern was: " + "{}".format(id, model_filenames_pattern) + ) + return model_filenames + + def __get_options(self, rouge_args=None): + """ + Get supplied command line arguments for ROUGE or use default + ones. + + """ + if self.args: + options = self.args.split() + elif rouge_args: + options = rouge_args.split() + else: + options = [ + "-e", + self._data_dir, + "-c", + 95, + # '-2', + # '-1', + # '-U', + "-m", + # '-v', + "-r", + 1000, + "-n", + 2, + # '-w', 1.2, + "-a", + ] + options = list(map(str, options)) + + options = self.__add_config_option(options) + return options + + def __create_dir_property(self, dir_name, docstring): + """ + Generate getter and setter for a directory property. + + """ + property_name = "{}_dir".format(dir_name) + private_name = "_" + property_name + setattr(self, private_name, None) + + def fget(self): + return getattr(self, private_name) + + def fset(self, path): + verify_dir(path, dir_name) + setattr(self, private_name, path) + + p = property(fget=fget, fset=fset, doc=docstring) + setattr(self.__class__, property_name, p) + + def __set_dir_properties(self): + """ + Automatically generate the properties for directories. + + """ + directories = [ + ("home", "The ROUGE home directory."), + ("data", "The path of the ROUGE 'data' directory."), + ("system", "Path of the directory containing system summaries."), + ("model", "Path of the directory containing model summaries."), + ] + for (dirname, docstring) in directories: + self.__create_dir_property(dirname, docstring) + + def __clean_rouge_args(self, rouge_args): + """ + Remove enclosing quotation marks, if any. + + """ + if not rouge_args: + return + quot_mark_pattern = re.compile('"(.+)"') + match = quot_mark_pattern.match(rouge_args) + if match: + cleaned_args = match.group(1) + return cleaned_args + else: + return rouge_args + + def __add_config_option(self, options): + return options + [self._config_file] + + def __get_config_path(self): + if platform.system() == "Windows": + parent_dir = os.getenv("APPDATA") + config_dir_name = "pyrouge" + elif os.name == "posix": + parent_dir = os.path.expanduser("~") + config_dir_name = ".pyrouge" + else: + parent_dir = os.path.dirname(__file__) + config_dir_name = "" + config_dir = os.path.join(parent_dir, config_dir_name) + if not os.path.exists(config_dir): + os.makedirs(config_dir) + return os.path.join(config_dir, "settings.ini") + + +if __name__ == "__main__": + import argparse + from utils.argparsers import rouge_path_parser + + parser = argparse.ArgumentParser(parents=[rouge_path_parser]) + args = parser.parse_args() + + rouge = Rouge155(args.rouge_home) + rouge.save_home_dir() diff --git a/examples/text_summarization/prophetnet/evaluate/cnndm/postprocess_cnn_dm.py b/examples/text_summarization/prophetnet/evaluate/cnndm/postprocess_cnn_dm.py new file mode 100644 index 0000000000000000000000000000000000000000..67b22c98c28983ce5ecc4c5af91df183c2ddd21e --- /dev/null +++ b/examples/text_summarization/prophetnet/evaluate/cnndm/postprocess_cnn_dm.py @@ -0,0 +1,275 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import shutil +import string +import tempfile +import time + +from bs_pyrouge import Rouge155 + +parser = argparse.ArgumentParser() +parser.add_argument("--generated", type=str, help="generated output file.") +parser.add_argument("--golden", type=str, help="Gold output file.") +parser.add_argument( + "--duplicate_rate", + type=float, + default=0.7, + help="If the duplicate rate (compared with history) is large, we can discard the current sentence.", +) +parser.add_argument("--trunc_len", type=int, default=0, help="Truncate line by the maximum length.") +args = parser.parse_args() + +fin = open(args.generated, "r", encoding="utf-8") +fgolden = open(args.golden, "r", encoding="utf-8") +dedup_rate = args.duplicate_rate +trunc_len = args.trunc_len + +_tok_dict = {"(": "-LRB-", ")": "-RRB-", "[": "-LSB-", "]": "-RSB-", "{": "-LCB-", "}": "-RCB-"} + + +def _is_digit(w): + for ch in w: + if not (ch.isdigit() or ch == ","): + return False + return True + + +def fix_tokenization(text): + input_tokens = text.split() + output_tokens = [] + has_left_quote = False + has_left_single_quote = False + + i = 0 + prev_dash = False + while i < len(input_tokens): + tok = input_tokens[i] + flag_prev_dash = False + if tok in _tok_dict.keys(): + output_tokens.append(_tok_dict[tok]) + i += 1 + elif tok == '"': + if has_left_quote: + output_tokens.append("''") + else: + output_tokens.append("``") + has_left_quote = not has_left_quote + i += 1 + elif ( + tok == "'" + and len(output_tokens) > 0 + and output_tokens[-1].endswith("n") + and i < len(input_tokens) - 1 + and input_tokens[i + 1] == "t" + ): + output_tokens[-1] = output_tokens[-1][:-1] + output_tokens.append("n't") + i += 2 + elif tok == "'" and i < len(input_tokens) - 1 and input_tokens[i + 1] in ("s", "d", "ll"): + output_tokens.append("'" + input_tokens[i + 1]) + i += 2 + elif tok == "'": + if has_left_single_quote: + output_tokens.append("'") + else: + output_tokens.append("`") + has_left_single_quote = not has_left_single_quote + i += 1 + elif tok == "." and i < len(input_tokens) - 2 and input_tokens[i + 1] == "." and input_tokens[i + 2] == ".": + output_tokens.append("...") + i += 3 + elif ( + tok == "," + and len(output_tokens) > 0 + and _is_digit(output_tokens[-1]) + and i < len(input_tokens) - 1 + and _is_digit(input_tokens[i + 1]) + ): + # $ 3 , 000 -> $ 3,000 + output_tokens[-1] += "," + input_tokens[i + 1] + i += 2 + elif ( + tok == "." + and len(output_tokens) > 0 + and output_tokens[-1].isdigit() + and i < len(input_tokens) - 1 + and input_tokens[i + 1].isdigit() + ): + # 3 . 03 -> $ 3.03 + output_tokens[-1] += "." + input_tokens[i + 1] + i += 2 + elif ( + tok == "." + and len(output_tokens) > 0 + and len(output_tokens[-1]) == 1 + and output_tokens[-1].isupper() + and i < len(input_tokens) - 2 + and len(input_tokens[i + 1]) == 1 + and input_tokens[i + 1].isupper() + and input_tokens[i + 2] == "." + ): + # U . N . -> U.N. + k = i + 3 + while k + 2 < len(input_tokens): + if len(input_tokens[k + 1]) == 1 and input_tokens[k + 1].isupper() and input_tokens[k + 2] == ".": + k += 2 + else: + break + output_tokens[-1] += "".join(input_tokens[i:k]) + i += 2 + elif tok == "-": + if i < len(input_tokens) - 1 and input_tokens[i + 1] == "-": + output_tokens.append("--") + i += 2 + elif i == len(input_tokens) - 1 or i == 0: + output_tokens.append("-") + i += 1 + elif output_tokens[-1] not in string.punctuation and input_tokens[i + 1][0] not in string.punctuation: + output_tokens[-1] += "-" + i += 1 + flag_prev_dash = True + else: + output_tokens.append("-") + i += 1 + elif prev_dash and len(output_tokens) > 0 and tok[0] not in string.punctuation: + output_tokens[-1] += tok + i += 1 + else: + output_tokens.append(tok) + i += 1 + prev_dash = flag_prev_dash + return " ".join(output_tokens) + + +def remove_duplicate(l_list, duplicate_rate): + tk_list = [l.lower().split() for l in l_list] + r_list = [] + history_set = set() + for i, w_list in enumerate(tk_list): + w_set = set(w_list) + if len(w_set & history_set) / len(w_set) <= duplicate_rate: + r_list.append(l_list[i]) + history_set |= w_set + return r_list + + +def test_rouge(cand, ref): + temp_dir = tempfile.mkdtemp() + candidates = cand + references = ref + assert len(candidates) == len(references) + + cnt = len(candidates) + current_time = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime()) + tmp_dir = os.path.join(temp_dir, "rouge-tmp-{}".format(current_time)) + if not os.path.isdir(tmp_dir): + os.mkdir(tmp_dir) + os.mkdir(tmp_dir + "/candidate") + os.mkdir(tmp_dir + "/reference") + try: + for i in range(cnt): + if len(references[i]) < 1: + continue + with open(tmp_dir + "/candidate/cand.{}.txt".format(i), "w", encoding="utf-8") as f: + f.write(candidates[i]) + with open(tmp_dir + "/reference/ref.{}.txt".format(i), "w", encoding="utf-8") as f: + f.write(references[i]) + r = Rouge155(temp_dir=temp_dir) + r.model_dir = tmp_dir + "/reference/" + r.system_dir = tmp_dir + "/candidate/" + r.model_filename_pattern = "ref.#ID#.txt" + r.system_filename_pattern = r"cand.(\d+).txt" + rouge_results = r.convert_and_evaluate() + print(rouge_results) + results_dict = r.output_to_dict(rouge_results) + finally: + if os.path.isdir(tmp_dir): + shutil.rmtree(tmp_dir) + return results_dict + + +def rouge_results_to_str(results_dict): + return ">> ROUGE-F(1/2/l): {:.2f}/{:.2f}/{:.2f}\nROUGE-R(1/2/3/l): {:.2f}/{:.2f}/{:.2f}\n".format( + results_dict["rouge_1_f_score"] * 100, + results_dict["rouge_2_f_score"] * 100, + results_dict["rouge_l_f_score"] * 100, + results_dict["rouge_1_recall"] * 100, + results_dict["rouge_2_recall"] * 100, + results_dict["rouge_l_recall"] * 100, + ) + + +def count_tokens(tokens): + counter = {} + for t in tokens: + if t in counter.keys(): + counter[t] += 1 + else: + counter[t] = 1 + return counter + + +def get_f1(text_a, text_b): + tokens_a = text_a.lower().split() + tokens_b = text_b.lower().split() + if len(tokens_a) == 0 or len(tokens_b) == 0: + return 1 if len(tokens_a) == len(tokens_b) else 0 + set_a = count_tokens(tokens_a) + set_b = count_tokens(tokens_b) + match = 0 + for token in set_a.keys(): + if token in set_b.keys(): + match += min(set_a[token], set_b[token]) + p = match / len(tokens_a) + r = match / len(tokens_b) + return 2.0 * p * r / (p + r + 1e-5) + + +generated_list = [] +for line in fin: + buf = [] + for sentence in line.strip().split("[X_SEP]"): + sentence = fix_tokenization(sentence) + if any(get_f1(sentence, s) > 1.0 for s in buf): + continue + s_len = len(sentence.split()) + if s_len <= 4: + continue + buf.append(sentence) + if dedup_rate < 1: + buf = remove_duplicate(buf, dedup_rate) + if trunc_len: + num_left = trunc_len + trunc_list = [] + for bit in buf: + tk_list = bit.split() + n = min(len(tk_list), num_left) + trunc_list.append(" ".join(tk_list[:n])) + num_left -= n + if num_left <= 0: + break + else: + trunc_list = buf + generated_list.append("\n".join(trunc_list)) + +golden_list = [] +for line in fgolden: + line = line.strip().replace(" ", "\n") + golden_list.append(line) + +scores = test_rouge(generated_list, golden_list) +print(rouge_results_to_str(scores)) diff --git a/examples/text_summarization/prophetnet/evaluate/gigaword/bs_pyrouge.py b/examples/text_summarization/prophetnet/evaluate/gigaword/bs_pyrouge.py new file mode 100644 index 0000000000000000000000000000000000000000..df4cbbecbcb26f646ee9d7249b50dd06a6736132 --- /dev/null +++ b/examples/text_summarization/prophetnet/evaluate/gigaword/bs_pyrouge.py @@ -0,0 +1,649 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import print_function, unicode_literals, division + +import codecs +import logging +import os +import platform +import re +from functools import partial +from subprocess import check_output +from tempfile import mkdtemp + +try: + from configparser import ConfigParser +except ImportError: + from ConfigParser import ConfigParser + +from pyrouge.utils import log +from pyrouge.utils.file_utils import verify_dir + +REMAP = {"-lrb-": "(", "-rrb-": ")", "-lcb-": "{", "-rcb-": "}", "-lsb-": "[", "-rsb-": "]", "``": '"', "''": '"'} + + +def clean(x): + return re.sub(r"-lrb-|-rrb-|-lcb-|-rcb-|-lsb-|-rsb-|``|''", lambda m: REMAP.get(m.group()), x) + + +class DirectoryProcessor: + @staticmethod + def process(input_dir, output_dir, function): + """ + Apply function to all files in input_dir and save the resulting output + files in output_dir. + + """ + if not os.path.exists(output_dir): + os.makedirs(output_dir) + logger = log.get_global_console_logger() + logger.info("Processing files in {}.".format(input_dir)) + input_file_names = os.listdir(input_dir) + for input_file_name in input_file_names: + input_file = os.path.join(input_dir, input_file_name) + with codecs.open(input_file, "r", encoding="UTF-8") as f: + input_string = f.read() + output_string = function(input_string) + output_file = os.path.join(output_dir, input_file_name) + with codecs.open(output_file, "w", encoding="UTF-8") as f: + f.write(clean(output_string.lower())) + logger.info("Saved processed files to {}.".format(output_dir)) + + +class Rouge155(object): + """ + This is a wrapper for the ROUGE 1.5.5 summary evaluation package. + This class is designed to simplify the evaluation process by: + + 1) Converting summaries into a format ROUGE understands. + 2) Generating the ROUGE configuration file automatically based + on filename patterns. + + This class can be used within Python like this: + + rouge = Rouge155() + rouge.system_dir = 'test/systems' + rouge.model_dir = 'test/models' + + # The system filename pattern should contain one group that + # matches the document ID. + rouge.system_filename_pattern = 'SL.P.10.R.11.SL062003-(\d+).html' + + # The model filename pattern has '#ID#' as a placeholder for the + # document ID. If there are multiple model summaries, pyrouge + # will use the provided regex to automatically match them with + # the corresponding system summary. Here, [A-Z] matches + # multiple model summaries for a given #ID#. + rouge.model_filename_pattern = 'SL.P.10.R.[A-Z].SL062003-#ID#.html' + + rouge_output = rouge.evaluate() + print(rouge_output) + output_dict = rouge.output_to_dict(rouge_output) + print(output_dict) + -> {'rouge_1_f_score': 0.95652, + 'rouge_1_f_score_cb': 0.95652, + 'rouge_1_f_score_ce': 0.95652, + 'rouge_1_precision': 0.95652, + [...] + + + To evaluate multiple systems: + + rouge = Rouge155() + rouge.system_dir = '/PATH/TO/systems' + rouge.model_dir = 'PATH/TO/models' + for system_id in ['id1', 'id2', 'id3']: + rouge.system_filename_pattern = \ + 'SL.P/.10.R.{}.SL062003-(\d+).html'.format(system_id) + rouge.model_filename_pattern = \ + 'SL.P.10.R.[A-Z].SL062003-#ID#.html' + rouge_output = rouge.evaluate(system_id) + print(rouge_output) + + """ + + def __init__(self, rouge_dir=None, rouge_args=None, temp_dir=None): + """ + Create a Rouge155 object. + + rouge_dir: Directory containing Rouge-1.5.5.pl + rouge_args: Arguments to pass through to ROUGE if you + don't want to use the default pyrouge + arguments. + + """ + self.temp_dir = temp_dir + self.log = log.get_global_console_logger() + self.log.setLevel(logging.WARNING) + self.__set_dir_properties() + self._config_file = None + self._settings_file = self.__get_config_path() + self.__set_rouge_dir(rouge_dir) + self.args = self.__clean_rouge_args(rouge_args) + self._system_filename_pattern = None + self._model_filename_pattern = None + + def save_home_dir(self): + config = ConfigParser() + section = "pyrouge settings" + config.add_section(section) + config.set(section, "home_dir", self._home_dir) + with open(self._settings_file, "w") as f: + config.write(f) + self.log.info("Set ROUGE home directory to {}.".format(self._home_dir)) + + @property + def settings_file(self): + """ + Path of the settings file, which stores the ROUGE home dir. + + """ + return self._settings_file + + @property + def bin_path(self): + """ + The full path of the ROUGE binary (although it's technically + a script), i.e. rouge_home_dir/ROUGE-1.5.5.pl + + """ + if self._bin_path is None: + raise Exception( + "ROUGE path not set. Please set the ROUGE home directory " + "and ensure that ROUGE-1.5.5.pl exists in it." + ) + return self._bin_path + + @property + def system_filename_pattern(self): + """ + The regular expression pattern for matching system summary + filenames. The regex string. + + E.g. "SL.P.10.R.11.SL062003-(\d+).html" will match the system + filenames in the SPL2003/system folder of the ROUGE SPL example + in the "sample-test" folder. + + Currently, there is no support for multiple systems. + + """ + return self._system_filename_pattern + + @system_filename_pattern.setter + def system_filename_pattern(self, pattern): + self._system_filename_pattern = pattern + + @property + def model_filename_pattern(self): + """ + The regular expression pattern for matching model summary + filenames. The pattern needs to contain the string "#ID#", + which is a placeholder for the document ID. + + E.g. "SL.P.10.R.[A-Z].SL062003-#ID#.html" will match the model + filenames in the SPL2003/system folder of the ROUGE SPL + example in the "sample-test" folder. + + "#ID#" is a placeholder for the document ID which has been + matched by the "(\d+)" part of the system filename pattern. + The different model summaries for a given document ID are + matched by the "[A-Z]" part. + + """ + return self._model_filename_pattern + + @model_filename_pattern.setter + def model_filename_pattern(self, pattern): + self._model_filename_pattern = pattern + + @property + def config_file(self): + return self._config_file + + @config_file.setter + def config_file(self, path): + config_dir, _ = os.path.split(path) + verify_dir(config_dir, "configuration file") + self._config_file = path + + def split_sentences(self): + """ + ROUGE requires texts split into sentences. In case the texts + are not already split, this method can be used. + + """ + from pyrouge.utils.sentence_splitter import PunktSentenceSplitter + + self.log.info("Splitting sentences.") + ss = PunktSentenceSplitter() + + def sent_split_to_string(s): + return "\n".join(ss.split(s)) + + process_func = partial(DirectoryProcessor.process, function=sent_split_to_string) + self.__process_summaries(process_func) + + @staticmethod + def convert_summaries_to_rouge_format(input_dir, output_dir): + """ + Convert all files in input_dir into a format ROUGE understands + and saves the files to output_dir. The input files are assumed + to be plain text with one sentence per line. + + input_dir: Path of directory containing the input files. + output_dir: Path of directory in which the converted files + will be saved. + + """ + DirectoryProcessor.process(input_dir, output_dir, Rouge155.convert_text_to_rouge_format) + + @staticmethod + def convert_text_to_rouge_format(text, title="dummy title"): + """ + Convert a text to a format ROUGE understands. The text is + assumed to contain one sentence per line. + + text: The text to convert, containg one sentence per line. + title: Optional title for the text. The title will appear + in the converted file, but doesn't seem to have + any other relevance. + + Returns: The converted text as string. + + """ + sentences = text.split("\n") + sent_elems = [ + '[{i}] ' "{text}".format(i=i, text=sent) + for i, sent in enumerate(sentences, start=1) + ] + html = """ + +{title} + + +{elems} + +""".format( + title=title, elems="\n".join(sent_elems) + ) + + return html + + @staticmethod + def write_config_static( + system_dir, system_filename_pattern, model_dir, model_filename_pattern, config_file_path, system_id=None + ): + """ + Write the ROUGE configuration file, which is basically a list + of system summary files and their corresponding model summary + files. + + pyrouge uses regular expressions to automatically find the + matching model summary files for a given system summary file + (cf. docstrings for system_filename_pattern and + model_filename_pattern). + + system_dir: Path of directory containing + system summaries. + system_filename_pattern: Regex string for matching + system summary filenames. + model_dir: Path of directory containing + model summaries. + model_filename_pattern: Regex string for matching model + summary filenames. + config_file_path: Path of the configuration file. + system_id: Optional system ID string which + will appear in the ROUGE output. + + """ + system_filenames = [f for f in os.listdir(system_dir)] + system_models_tuples = [] + + system_filename_pattern = re.compile(system_filename_pattern) + for system_filename in sorted(system_filenames): + match = system_filename_pattern.match(system_filename) + if match: + id = match.groups(0)[0] + model_filenames = [model_filename_pattern.replace("#ID#", id)] + # model_filenames = Rouge155.__get_model_filenames_for_id( + # id, model_dir, model_filename_pattern) + system_models_tuples.append((system_filename, sorted(model_filenames))) + if not system_models_tuples: + raise Exception( + "Did not find any files matching the pattern {} " + "in the system summaries directory {}.".format(system_filename_pattern.pattern, system_dir) + ) + + with codecs.open(config_file_path, "w", encoding="utf-8") as f: + f.write('') + for task_id, (system_filename, model_filenames) in enumerate(system_models_tuples, start=1): + eval_string = Rouge155.__get_eval_string( + task_id, system_id, system_dir, system_filename, model_dir, model_filenames + ) + f.write(eval_string) + f.write("") + + def write_config(self, config_file_path=None, system_id=None): + """ + Write the ROUGE configuration file, which is basically a list + of system summary files and their matching model summary files. + + This is a non-static version of write_config_file_static(). + + config_file_path: Path of the configuration file. + system_id: Optional system ID string which will + appear in the ROUGE output. + + """ + if not system_id: + system_id = 1 + if (not config_file_path) or (not self._config_dir): + self._config_dir = mkdtemp(dir=self.temp_dir) + config_filename = "rouge_conf.xml" + else: + config_dir, config_filename = os.path.split(config_file_path) + verify_dir(config_dir, "configuration file") + self._config_file = os.path.join(self._config_dir, config_filename) + Rouge155.write_config_static( + self._system_dir, + self._system_filename_pattern, + self._model_dir, + self._model_filename_pattern, + self._config_file, + system_id, + ) + self.log.info("Written ROUGE configuration to {}".format(self._config_file)) + + def evaluate(self, system_id=1, rouge_args=None): + """ + Run ROUGE to evaluate the system summaries in system_dir against + the model summaries in model_dir. The summaries are assumed to + be in the one-sentence-per-line HTML format ROUGE understands. + + system_id: Optional system ID which will be printed in + ROUGE's output. + + Returns: Rouge output as string. + + """ + self.write_config(system_id=system_id) + options = self.__get_options(rouge_args) + command = [self._bin_path] + options + self.log.info("Running ROUGE with command {}".format(" ".join(command))) + rouge_output = check_output(command).decode("UTF-8") + return rouge_output + + def convert_and_evaluate(self, system_id=1, split_sentences=False, rouge_args=None): + """ + Convert plain text summaries to ROUGE format and run ROUGE to + evaluate the system summaries in system_dir against the model + summaries in model_dir. Optionally split texts into sentences + in case they aren't already. + + This is just a convenience method combining + convert_summaries_to_rouge_format() and evaluate(). + + split_sentences: Optional argument specifying if + sentences should be split. + system_id: Optional system ID which will be printed + in ROUGE's output. + + Returns: ROUGE output as string. + + """ + if split_sentences: + self.split_sentences() + self.__write_summaries() + rouge_output = self.evaluate(system_id, rouge_args) + return rouge_output + + def output_to_dict(self, output): + """ + Convert the ROUGE output into python dictionary for further + processing. + + """ + # 0 ROUGE-1 Average_R: 0.02632 (95%-conf.int. 0.02632 - 0.02632) + pattern = re.compile(r"(\d+) (ROUGE-\S+) (Average_\w): (\d.\d+) " r"\(95%-conf.int. (\d.\d+) - (\d.\d+)\)") + results = {} + for line in output.split("\n"): + match = pattern.match(line) + if match: + sys_id, rouge_type, measure, result, conf_begin, conf_end = match.groups() + measure = {"Average_R": "recall", "Average_P": "precision", "Average_F": "f_score"}[measure] + rouge_type = rouge_type.lower().replace("-", "_") + key = "{}_{}".format(rouge_type, measure) + results[key] = float(result) + results["{}_cb".format(key)] = float(conf_begin) + results["{}_ce".format(key)] = float(conf_end) + return results + + ################################################################### + # Private methods + + def __set_rouge_dir(self, home_dir=None): + """ + Verify presence of ROUGE-1.5.5.pl and data folder, and set + those paths. + + """ + if not home_dir: + self._home_dir = self.__get_rouge_home_dir_from_settings() + else: + self._home_dir = home_dir + self.save_home_dir() + self._bin_path = os.path.join(self._home_dir, "ROUGE-1.5.5.pl") + self.data_dir = os.path.join(self._home_dir, "data") + if not os.path.exists(self._bin_path): + raise Exception( + "ROUGE binary not found at {}. Please set the " + "correct path by running pyrouge_set_rouge_path " + "/path/to/rouge/home.".format(self._bin_path) + ) + + def __get_rouge_home_dir_from_settings(self): + config = ConfigParser() + with open(self._settings_file) as f: + if hasattr(config, "read_file"): + config.read_file(f) + else: + # use deprecated python 2.x method + config.readfp(f) + rouge_home_dir = config.get("pyrouge settings", "home_dir") + return rouge_home_dir + + @staticmethod + def __get_eval_string(task_id, system_id, system_dir, system_filename, model_dir, model_filenames): + """ + ROUGE can evaluate several system summaries for a given text + against several model summaries, i.e. there is an m-to-n + relation between system and model summaries. The system + summaries are listed in the tag and the model summaries + in the tag. pyrouge currently only supports one system + summary per text, i.e. it assumes a 1-to-n relation between + system and model summaries. + + """ + peer_elems = '

{name}

'.format(id=system_id, name=system_filename) + + model_elems = [ + '{name}'.format(id=chr(65 + i), name=name) for i, name in enumerate(model_filenames) + ] + + model_elems = "\n\t\t\t".join(model_elems) + eval_string = """ + + {model_root} + {peer_root} + + + + {peer_elems} + + + {model_elems} + + +""".format( + task_id=task_id, model_root=model_dir, model_elems=model_elems, peer_root=system_dir, peer_elems=peer_elems + ) + return eval_string + + def __process_summaries(self, process_func): + """ + Helper method that applies process_func to the files in the + system and model folders and saves the resulting files to new + system and model folders. + + """ + temp_dir = mkdtemp(dir=self.temp_dir) + new_system_dir = os.path.join(temp_dir, "system") + os.mkdir(new_system_dir) + new_model_dir = os.path.join(temp_dir, "model") + os.mkdir(new_model_dir) + self.log.info( + "Processing summaries. Saving system files to {} and " + "model files to {}.".format(new_system_dir, new_model_dir) + ) + process_func(self._system_dir, new_system_dir) + process_func(self._model_dir, new_model_dir) + self._system_dir = new_system_dir + self._model_dir = new_model_dir + + def __write_summaries(self): + self.log.info("Writing summaries.") + self.__process_summaries(self.convert_summaries_to_rouge_format) + + @staticmethod + def __get_model_filenames_for_id(id, model_dir, model_filenames_pattern): + pattern = re.compile(model_filenames_pattern.replace("#ID#", id)) + model_filenames = [f for f in os.listdir(model_dir) if pattern.match(f)] + if not model_filenames: + raise Exception( + "Could not find any model summaries for the system" + " summary with ID {}. Specified model filename pattern was: " + "{}".format(id, model_filenames_pattern) + ) + return model_filenames + + def __get_options(self, rouge_args=None): + """ + Get supplied command line arguments for ROUGE or use default + ones. + + """ + if self.args: + options = self.args.split() + elif rouge_args: + options = rouge_args.split() + else: + options = [ + "-e", + self._data_dir, + "-c", + 95, + # '-2', + # '-1', + # '-U', + "-m", + # '-v', + "-r", + 1000, + "-n", + 2, + # '-w', 1.2, + "-a", + ] + options = list(map(str, options)) + + options = self.__add_config_option(options) + return options + + def __create_dir_property(self, dir_name, docstring): + """ + Generate getter and setter for a directory property. + + """ + property_name = "{}_dir".format(dir_name) + private_name = "_" + property_name + setattr(self, private_name, None) + + def fget(self): + return getattr(self, private_name) + + def fset(self, path): + verify_dir(path, dir_name) + setattr(self, private_name, path) + + p = property(fget=fget, fset=fset, doc=docstring) + setattr(self.__class__, property_name, p) + + def __set_dir_properties(self): + """ + Automatically generate the properties for directories. + + """ + directories = [ + ("home", "The ROUGE home directory."), + ("data", "The path of the ROUGE 'data' directory."), + ("system", "Path of the directory containing system summaries."), + ("model", "Path of the directory containing model summaries."), + ] + for (dirname, docstring) in directories: + self.__create_dir_property(dirname, docstring) + + def __clean_rouge_args(self, rouge_args): + """ + Remove enclosing quotation marks, if any. + + """ + if not rouge_args: + return + quot_mark_pattern = re.compile('"(.+)"') + match = quot_mark_pattern.match(rouge_args) + if match: + cleaned_args = match.group(1) + return cleaned_args + else: + return rouge_args + + def __add_config_option(self, options): + return options + [self._config_file] + + def __get_config_path(self): + if platform.system() == "Windows": + parent_dir = os.getenv("APPDATA") + config_dir_name = "pyrouge" + elif os.name == "posix": + parent_dir = os.path.expanduser("~") + config_dir_name = ".pyrouge" + else: + parent_dir = os.path.dirname(__file__) + config_dir_name = "" + config_dir = os.path.join(parent_dir, config_dir_name) + if not os.path.exists(config_dir): + os.makedirs(config_dir) + return os.path.join(config_dir, "settings.ini") + + +if __name__ == "__main__": + import argparse + from utils.argparsers import rouge_path_parser + + parser = argparse.ArgumentParser(parents=[rouge_path_parser]) + args = parser.parse_args() + + rouge = Rouge155(args.rouge_home) + rouge.save_home_dir() diff --git a/examples/text_summarization/prophetnet/evaluate/gigaword/eval.py b/examples/text_summarization/prophetnet/evaluate/gigaword/eval.py new file mode 100644 index 0000000000000000000000000000000000000000..51ed27e6804af685a6c54704a1c004e25ed518fa --- /dev/null +++ b/examples/text_summarization/prophetnet/evaluate/gigaword/eval.py @@ -0,0 +1,368 @@ +"""BERT finetuning runner.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import argparse +import glob +import json +import logging +import os +import shutil +import string +import tempfile +import time +from multiprocessing import Pool, cpu_count +from pathlib import Path + +# pip install py-rouge +import rouge + +# from pytorch_pretrained_bert.tokenization import BertTokenizer +# pip install pyrouge +from bs_pyrouge import Rouge155 + +logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO +) +logger = logging.getLogger(__name__) +logger.setLevel(logging.WARNING) + +parser = argparse.ArgumentParser() + +# Required parameters +parser.add_argument("--gold", type=str, help="Gold output file.") +parser.add_argument("--pred", type=str, help="Input prediction file.") +parser.add_argument("--split", type=str, default="", help="Data split (train/dev/test).") +parser.add_argument("--save_best", action="store_true", help="Save best epoch.") +parser.add_argument("--only_eval_best", action="store_true", help="Only evaluate best epoch.") +parser.add_argument("--trunc_len", type=int, default=0, help="Truncate line by the maximum length.") +default_process_count = max(1, cpu_count() - 1) +parser.add_argument( + "--processes", type=int, default=default_process_count, help="Number of processes to use (default %(default)s)" +) +parser.add_argument("--perl", action="store_true", help="Using the perl script.") +parser.add_argument("--lazy_eval", action="store_true", help="Skip evaluation if the .rouge file exists.") +args = parser.parse_args() + +SPECIAL_TOKEN = ["[UNK]", "[PAD]", "[CLS]", "[MASK]"] +evaluator = rouge.Rouge(metrics=["rouge-n", "rouge-l"], max_n=2, limit_length=False, apply_avg=True, weight_factor=1.2) + + +def test_rouge(cand, ref): + temp_dir = tempfile.mkdtemp() + candidates = cand + references = ref + assert len(candidates) == len(references) + + cnt = len(candidates) + current_time = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime()) + tmp_dir = os.path.join(temp_dir, "rouge-tmp-{}".format(current_time)) + if not os.path.isdir(tmp_dir): + os.mkdir(tmp_dir) + os.mkdir(tmp_dir + "/candidate") + os.mkdir(tmp_dir + "/reference") + try: + for i in range(cnt): + if len(references[i]) < 1: + continue + with open(tmp_dir + "/candidate/cand.{}.txt".format(i), "w", encoding="utf-8") as f: + f.write(candidates[i]) + with open(tmp_dir + "/reference/ref.{}.txt".format(i), "w", encoding="utf-8") as f: + f.write(references[i]) + r = Rouge155(temp_dir=temp_dir) + r.model_dir = tmp_dir + "/reference/" + r.system_dir = tmp_dir + "/candidate/" + r.model_filename_pattern = "ref.#ID#.txt" + r.system_filename_pattern = r"cand.(\d+).txt" + rouge_results = r.convert_and_evaluate() + print(rouge_results) + results_dict = r.output_to_dict(rouge_results) + finally: + if os.path.isdir(tmp_dir): + shutil.rmtree(tmp_dir) + return results_dict + + +def rouge_results_to_str(results_dict): + return ">> ROUGE-F(1/2/l): {:.2f}/{:.2f}/{:.2f}\nROUGE-R(1/2/3/l): {:.2f}/{:.2f}/{:.2f}\n".format( + results_dict["rouge_1_f_score"] * 100, + results_dict["rouge_2_f_score"] * 100, + results_dict["rouge_l_f_score"] * 100, + results_dict["rouge_1_recall"] * 100, + results_dict["rouge_2_recall"] * 100, + results_dict["rouge_l_recall"] * 100, + ) + + +def count_tokens(tokens): + counter = {} + for t in tokens: + if t in counter.keys(): + counter[t] += 1 + else: + counter[t] = 1 + return counter + + +def get_f1(text_a, text_b): + tokens_a = text_a.lower().split() + tokens_b = text_b.lower().split() + if len(tokens_a) == 0 or len(tokens_b) == 0: + return 1 if len(tokens_a) == len(tokens_b) else 0 + set_a = count_tokens(tokens_a) + set_b = count_tokens(tokens_b) + match = 0 + for token in set_a.keys(): + if token in set_b.keys(): + match += min(set_a[token], set_b[token]) + p = match / len(tokens_a) + r = match / len(tokens_b) + return 2.0 * p * r / (p + r + 1e-5) + + +_tok_dict = { + "(": "-lrb-", + ")": "-rrb-", + "[": "-lsb-", + "]": "-rsb-", + "{": "-lcb-", + "}": "-rcb-", + "[UNK]": "UNK", + "&": "&", + "<": "<", + ">": ">", +} + + +def _is_digit(w): + for ch in w: + if not (ch.isdigit() or ch == ","): + return False + return True + + +def fix_tokenization(text): + input_tokens = text.split() + output_tokens = [] + has_left_quote = False + has_left_single_quote = False + + i = 0 + prev_dash = False + while i < len(input_tokens): + tok = input_tokens[i] + flag_prev_dash = False + if tok in _tok_dict.keys(): + output_tokens.append(_tok_dict[tok]) + i += 1 + elif tok == '"': + if has_left_quote: + output_tokens.append("''") + else: + output_tokens.append("``") + has_left_quote = not has_left_quote + i += 1 + elif ( + tok == "'" + and len(output_tokens) > 0 + and output_tokens[-1].endswith("n") + and i < len(input_tokens) - 1 + and input_tokens[i + 1] == "t" + ): + output_tokens[-1] = output_tokens[-1][:-1] + output_tokens.append("n't") + i += 2 + elif tok == "'" and i < len(input_tokens) - 1 and input_tokens[i + 1] in ("s", "d", "ll"): + output_tokens.append("'" + input_tokens[i + 1]) + i += 2 + elif tok == "'": + if has_left_single_quote: + output_tokens.append("'") + else: + output_tokens.append("`") + has_left_single_quote = not has_left_single_quote + i += 1 + elif tok == "." and i < len(input_tokens) - 2 and input_tokens[i + 1] == "." and input_tokens[i + 2] == ".": + output_tokens.append("...") + i += 3 + elif ( + tok == "," + and len(output_tokens) > 0 + and _is_digit(output_tokens[-1]) + and i < len(input_tokens) - 1 + and _is_digit(input_tokens[i + 1]) + ): + # $ 3 , 000 -> $ 3,000 + output_tokens[-1] += "," + input_tokens[i + 1] + i += 2 + elif ( + tok == "." + and len(output_tokens) > 0 + and output_tokens[-1].isdigit() + and i < len(input_tokens) - 1 + and input_tokens[i + 1].isdigit() + ): + # 3 . 03 -> $ 3.03 + output_tokens[-1] += "." + input_tokens[i + 1] + i += 2 + elif ( + tok == "." + and len(output_tokens) > 0 + and len(output_tokens[-1]) == 1 + and output_tokens[-1].isupper() + and i < len(input_tokens) - 2 + and len(input_tokens[i + 1]) == 1 + and input_tokens[i + 1].isupper() + and input_tokens[i + 2] == "." + ): + # U . N . -> U.N. + k = i + 3 + while k + 2 < len(input_tokens): + if len(input_tokens[k + 1]) == 1 and input_tokens[k + 1].isupper() and input_tokens[k + 2] == ".": + k += 2 + else: + break + output_tokens[-1] += "".join(input_tokens[i:k]) + i += 2 + elif tok == "-": + if i < len(input_tokens) - 1 and input_tokens[i + 1] == "-": + output_tokens.append("--") + i += 2 + elif i == len(input_tokens) - 1 or i == 0: + output_tokens.append("-") + i += 1 + elif output_tokens[-1] not in string.punctuation and input_tokens[i + 1][0] not in string.punctuation: + output_tokens[-1] += "-" + i += 1 + flag_prev_dash = True + else: + output_tokens.append("-") + i += 1 + elif prev_dash and len(output_tokens) > 0 and tok[0] not in string.punctuation: + output_tokens[-1] += tok + i += 1 + else: + output_tokens.append(tok) + i += 1 + prev_dash = flag_prev_dash + return " ".join(output_tokens) + + +def process_eval(eval_fn): + gold_list = [] + with open(args.gold, "r", encoding="utf-8") as f_in: + for l in f_in: + line = l.strip() + gold_list.append(line) + + pred_list = [] + with open(eval_fn, "r", encoding="utf-8") as f_in: + for l in f_in: + buf = [] + sentence = fix_tokenization(l.strip()).replace("1", "#") + buf.append(sentence) + if args.trunc_len: + num_left = args.trunc_len + trunc_list = [] + for bit in buf: + tk_list = bit.split() + n = min(len(tk_list), num_left) + trunc_list.append(" ".join(tk_list[:n])) + num_left -= n + if num_left <= 0: + break + else: + trunc_list = buf + line = "\n".join(trunc_list) + pred_list.append(line) + with open(eval_fn + ".post", "w", encoding="utf-8") as f_out: + for l in pred_list: + f_out.write(l.strip()) + f_out.write("\n") + # rouge scores + if len(pred_list) < len(gold_list): + # evaluate subset + gold_list = gold_list[: len(pred_list)] + assert len(pred_list) == len(gold_list) + if args.perl: + scores = test_rouge(pred_list, gold_list) + else: + scores = evaluator.get_scores(pred_list, [[it] for it in gold_list]) + return eval_fn, scores + + +def main(): + if args.perl: + eval_fn_list = list(glob.glob(args.pred)) + else: + eval_fn_list = [ + eval_fn for eval_fn in glob.glob(args.pred) if not (args.lazy_eval and Path(eval_fn + ".rouge").exists()) + ] + eval_fn_list = list(filter(lambda fn: not (fn.endswith(".post") or fn.endswith(".rouge")), eval_fn_list)) + + if args.only_eval_best: + best_epoch_dict = {} + for dir_path in set(Path(fn).parent for fn in eval_fn_list): + fn_save = os.path.join(dir_path, "save_best.dev") + if Path(fn_save).exists(): + with open(fn_save, "r") as f_in: + __, o_name, __ = f_in.read().strip().split("\n") + epoch = o_name.split(".")[1] + best_epoch_dict[dir_path] = epoch + new_eval_fn_list = [] + for fn in eval_fn_list: + dir_path = Path(fn).parent + if dir_path in best_epoch_dict: + if Path(fn).name.split(".")[1] == best_epoch_dict[dir_path]: + new_eval_fn_list.append(fn) + eval_fn_list = new_eval_fn_list + + logger.info("***** Evaluation: %s *****", ",".join(eval_fn_list)) + num_pool = max(1, min(args.processes, len(eval_fn_list))) + logger.info(args.processes, len(eval_fn_list), num_pool) + p = Pool(num_pool) + r_list = p.imap_unordered(process_eval, eval_fn_list) + r_list = sorted([(fn, scores) for fn, scores in r_list], key=lambda x: x[0]) + rg2_dict = {} + for fn, scores in r_list: + logger.info(fn) + if args.perl: + print(rouge_results_to_str(scores)) + else: + rg2_dict[fn] = scores["rouge-2"]["f"] + print( + "ROUGE-1: {}\tROUGE-2: {}\tROUGE-L: {}\n".format( + scores["rouge-1"]["f"], scores["rouge-2"]["f"], scores["rouge-l"]["f"] + ) + ) + with open(fn + ".rouge", "w") as f_out: + f_out.write(json.dumps({"rg1": scores["rouge-1"]["f"], "rg2": scores["rouge-2"]["f"]})) + p.close() + p.join() + + if args.save_best: + # find best results + group_dict = {} + for k, v in rg2_dict.items(): + d_name, o_name = Path(k).parent, Path(k).name + if (d_name not in group_dict) or (v > group_dict[d_name][1]): + group_dict[d_name] = (o_name, v) + # compare and save the best result + for k, v in group_dict.items(): + fn = os.path.join(k, "save_best." + args.split) + o_name_s, rst_s = v + should_save = True + if Path(fn).exists(): + with open(fn, "r") as f_in: + rst_f = float(f_in.read().strip().split("\n")[-1]) + if rst_s <= rst_f: + should_save = False + if should_save: + with open(fn, "w") as f_out: + f_out.write("{0}\n{1}\n{2}\n".format(k, o_name_s, rst_s)) + + +if __name__ == "__main__": + main() diff --git a/examples/text_summarization/prophetnet/generate.py b/examples/text_summarization/prophetnet/generate.py new file mode 100644 index 0000000000000000000000000000000000000000..54f51e851809a977bd143dd054d0d13aba84af2d --- /dev/null +++ b/examples/text_summarization/prophetnet/generate.py @@ -0,0 +1,275 @@ +import argparse +import os +import random +import time +from pprint import pprint + +import numpy as np +import paddle +from paddle.io import BatchSampler, DataLoader +from rouge_score import rouge_scorer, scoring +from tqdm import tqdm + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers.prophetnet.modeling import ProphetNetForConditionalGeneration +from paddlenlp.transformers.prophetnet.tokenizer import ProphetNetTokenizer + +summarization_name_mapping = {"cnn_dailymail": ("article", "highlights")} + + +def parse_args(): + parser = argparse.ArgumentParser() + # Required parameters + parser.add_argument( + "--model_name_or_path", + default="prophetnet-large-uncased", + type=str, + required=True, + help="Path to pre-trained model. ", + ) + parser.add_argument( + "--dataset", default="gigaword", choices=["cnndm", "gigaword"], type=str, help="Path to tokenizer vocab file. " + ) + parser.add_argument( + "--output_path", type=str, default="generate.txt", help="The file path where the infer result will be saved." + ) + parser.add_argument( + "--max_source_length", + default=1024, + type=int, + help="The maximum total input sequence length after " + "tokenization.Sequences longer than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--min_target_length", + default=45, + type=int, + help="The minimum total sequence length for target text when generating. ", + ) + parser.add_argument( + "--max_target_length", + default=110, + type=int, + help="The maximum total sequence length for target text after " + "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded." + "during ``evaluate`` and ``predict``.", + ) + parser.add_argument( + "--decode_strategy", default="beam_search", type=str, help="The decode strategy in generation." + ) + parser.add_argument( + "--top_k", + default=2, + type=int, + help="The number of highest probability vocabulary tokens to keep for top-k sampling.", + ) + parser.add_argument("--top_p", default=1.0, type=float, help="The cumulative probability for top-p sampling.") + parser.add_argument("--num_beams", default=5, type=int, help="The number of beams for beam search.") + parser.add_argument( + "--length_penalty", + default=1.2, + type=float, + help="The exponential penalty to the sequence length for beam search.", + ) + parser.add_argument( + "--early_stopping", + default=False, + type=eval, + help="Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.", + ) + parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity of beam search. ") + parser.add_argument( + "--num_beam_groups", + default=1, + type=int, + help="Number of groups to divide `num_beams` into in order to use DIVERSE BEAM SEARCH.", + ) + parser.add_argument( + "--repetition_penalty", + default=1.0, + type=float, + help="Number of groups to divide `num_beams` into in order to use DIVERSE BEAM SEARCH.", + ) + parser.add_argument("--batch_size", default=4, type=int, help="Batch size per GPU/CPU for testing or evaluation.") + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu", "xpu"], + help="The device to select to train the model, is must be cpu/gpu/xpu.", + ) + parser.add_argument( + "--ignore_pad_token_for_loss", + default=True, + type=bool, + help="Whether to ignore the tokens corresponding to " "padded labels in the loss computation or not.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +def compute_metrics(preds, labels, tokenizer, ignore_pad_token_for_loss=True, compute_rouge_=True): + def compute_rouge(predictions, references, rouge_types=None, use_stemmer=True): + if rouge_types is None: + rouge_types = ["rouge1", "rouge2", "rougeLsum"] + + scorer = rouge_scorer.RougeScorer(rouge_types=rouge_types, use_stemmer=use_stemmer) + aggregator = scoring.BootstrapAggregator() + + for ref, pred in zip(references, predictions): + score = scorer.score(ref, pred) + aggregator.add_scores(score) + result = aggregator.aggregate() + result = {key: round(value.mid.fmeasure * 100, 4) for key, value in result.items()} + return result + + def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): + """ + Post-process the decoded sequence. + """ + eos_pos = len(seq) - 1 + for i, idx in enumerate(seq): + if idx == eos_idx: + eos_pos = i + break + seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] + return seq + + if ignore_pad_token_for_loss: + labels = np.asarray(labels) + labels = np.where(labels != -100, labels, tokenizer.pad_token_id) + decoded_preds, decoded_labels = [], [] + for pred, label in zip(preds, labels): + pred_id = post_process_seq(pred, tokenizer.bos_token_id, tokenizer.eos_token_id) + label_id = post_process_seq(label, tokenizer.bos_token_id, tokenizer.eos_token_id) + decoded_preds.append(tokenizer.convert_ids_to_string(pred_id)) + decoded_labels.append(tokenizer.convert_ids_to_string(label_id)) + + if compute_rouge_: + rouge_result = compute_rouge(decoded_preds, decoded_labels) + return rouge_result, decoded_preds + else: + return decoded_preds, decoded_labels + + +def read(data_path): + data_path_src = data_path[0] + data_path_tgt = data_path[1] + with open(data_path_src, "r", encoding="utf-8") as f_d_s: + src_lines_length = len(f_d_s.readlines()) + with open(data_path_tgt, "r", encoding="utf-8") as f_d_t: + tgt_lines_length = len(f_d_t.readlines()) + assert src_lines_length == tgt_lines_length + with open(data_path_src, "r", encoding="utf-8") as f_d_s: + with open(data_path_tgt, "r", encoding="utf-8") as f_d_t: + for row_d_s, row_d_t in tqdm(zip(f_d_s, f_d_t), total=src_lines_length): + yield {"article": row_d_s, "highlights": row_d_t} + + +def convert_example(is_test=False): + def warpper(example): + """convert an example into necessary features""" + tokens = example["article"] + labels = example["highlights"] + src_ids, src_attention_mask_ids = tokens.split("$1$") + src_ids = [int(i) for i in src_ids.split(" ")] + src_attention_mask_ids = [int(i) for i in src_attention_mask_ids.split(" ")] + + if not is_test: + labels, decoder_input_attention_mask_ids = labels.split("$1$") + labels = [int(i) for i in labels.split(" ")] + decoder_input_attention_mask_ids = [int(i) for i in decoder_input_attention_mask_ids.split(" ")] + decoder_input_ids = [labels[-1]] + labels[:-1] + + return src_ids, src_attention_mask_ids, decoder_input_ids, decoder_input_attention_mask_ids, labels + + else: + labels, _ = labels.split("$1$") + labels = [int(i) for i in labels.split(" ")] + return src_ids, src_attention_mask_ids, labels + + return warpper + + +@paddle.no_grad() +def generate(args): + paddle.set_device(args.device) + tokenizer = ProphetNetTokenizer.from_pretrained(args.model_name_or_path) + model = ProphetNetForConditionalGeneration.from_pretrained(args.model_name_or_path) + + test_data_src = "data/" + args.dataset + "_data/uncased_tok_data/test.src" + test_data_tgt = "data/" + args.dataset + "_data/uncased_tok_data/test.tgt" + + test_dataset = load_dataset(read, data_path=[test_data_src, test_data_tgt], lazy=False) + + trunc = convert_example(is_test=True) + + test_dataset = test_dataset.map(trunc) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # src_ids + Pad(axis=0, pad_val=0), # attn mask + Pad(axis=0, pad_val=tokenizer.pad_token_id), # labels + ): fn(samples) + + batch_sampler = BatchSampler(test_dataset, batch_size=args.batch_size, shuffle=False) + test_data_loader = DataLoader( + dataset=test_dataset, batch_sampler=batch_sampler, num_workers=0, collate_fn=batchify_fn, return_list=True + ) + + model.eval() + total_time = 0.0 + start_time = time.time() + all_preds = [] + all_labels = [] + for step, batch in tqdm(enumerate(test_data_loader), total=len(test_data_loader)): + input_ids, attention_mask, labels = batch + preds, _ = model.generate( + input_ids=input_ids, + attention_mask=attention_mask, + max_length=args.max_target_length, + min_length=args.min_target_length, + decode_strategy=args.decode_strategy, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + early_stopping=args.early_stopping, + diversity_rate=args.diversity_rate, + num_beam_groups=args.num_beam_groups, + repetition_penalty=args.repetition_penalty, + ) + total_time += time.time() - start_time + all_preds.extend(preds.numpy()) + all_labels.extend(labels.numpy()) + if step % args.logging_steps == 0: + print("step %d - %.3fs/step" % (step, total_time / args.logging_steps)) + total_time = 0.0 + start_time = time.time() + decoded_preds, _ = compute_metrics( + all_preds, all_labels, tokenizer, args.ignore_pad_token_for_loss, compute_rouge_=False + ) + if not os.path.exists(os.path.abspath(os.path.dirname(args.output_path) + os.path.sep + ".")): + os.makedirs(os.path.abspath(os.path.dirname(args.output_path) + os.path.sep + ".")) + with open(args.output_path, "w", encoding="utf-8") as fout: + for decoded_pred in decoded_preds: + fout.write(decoded_pred + "\n") + print("Save generated result into: %s" % args.output_path) + + +if __name__ == "__main__": + args = parse_args() + pprint(args) + generate(args) diff --git a/examples/text_summarization/prophetnet/requirements.txt b/examples/text_summarization/prophetnet/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..36ec00fa8f7284bb1ac36c268554ff8418a5f636 --- /dev/null +++ b/examples/text_summarization/prophetnet/requirements.txt @@ -0,0 +1,5 @@ +configparser==5.2.0 +nltk==3.6.7 +numpy==1.21.0 +tqdm==4.62.3 +py-rouge==1.1 \ No newline at end of file diff --git a/examples/text_summarization/prophetnet/run_eval.sh b/examples/text_summarization/prophetnet/run_eval.sh new file mode 100644 index 0000000000000000000000000000000000000000..39b7bdcf1cfc03849b17860c8b3bd4ad74de8f15 --- /dev/null +++ b/examples/text_summarization/prophetnet/run_eval.sh @@ -0,0 +1,37 @@ +DATASET=$1 + +if [ $DATASET = cnndm ] +then +python generate.py \ + --dataset=cnndm \ + --model_name_or_path=prophetnet-large-uncased \ + --output_path=./generate/cnndm/generate.txt \ + --min_target_length=45 \ + --max_target_length=110 \ + --decode_strategy=beam_search \ + --num_beams=4 \ + --length_penalty=1.2 \ + --batch_size=16 \ + --ignore_pad_token_for_loss=True \ + --early_stopping=True \ + --logging_steps=100 \ + --device=gpu +else +python generate.py \ + --dataset=gigaword \ + --model_name_or_path=prophetnet-large-uncased \ + --output_path=./generate/gigaword/generate.txt \ + --min_target_length=1 \ + --max_target_length=200 \ + --decode_strategy=beam_search \ + --num_beams=4 \ + --length_penalty=1.6 \ + --batch_size=16 \ + --ignore_pad_token_for_loss=True \ + --early_stopping=True \ + --logging_steps=100 \ + --device=gpu +fi + + +python eval.py --dataset $DATASET --generated ./generate/$DATASET/generate.txt \ No newline at end of file diff --git a/examples/text_summarization/prophetnet/run_train.sh b/examples/text_summarization/prophetnet/run_train.sh new file mode 100644 index 0000000000000000000000000000000000000000..625098dc77e205291e9d7e0c996c00ac0df796bb --- /dev/null +++ b/examples/text_summarization/prophetnet/run_train.sh @@ -0,0 +1,54 @@ +#!/bin/bash + +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +DATASET=$1 + +if [ "$DATASET" == cnndm ] +then +python -m paddle.distributed.launch --gpus 0 python train_prophetnet.py \ + --dataset=cnndm \ + --model_name_or_path=prophetnet-large-uncased \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=8 \ + --num_train_epochs=4 \ + --learning_rate=0.0001 \ + --warmup_init_lr=1e-07 \ + --warmup_steps=1000 \ + --max_grad_norm=0.1 \ + --dataloader_num_workers=4 \ + --logging_steps 10 \ + --save_steps 100 \ + --do_train \ + --do_eval \ + --output_dir=./ckpt/cnndm +else +python -m paddle.distributed.launch --gpus 0 python train_prophetnet.py \ + --dataset=gigaword \ + --model_name_or_path=prophetnet-large-uncased \ + --per_device_train_batch_size=16 \ + --per_device_eval_batch_size=32 \ + --num_train_epochs=6 \ + --learning_rate=0.0001 \ + --warmup_init_lr=1e-07 \ + --warmup_steps=1000 \ + --max_grad_norm=0.1 \ + --dataloader_num_workers=8 \ + --logging_steps 10 \ + --save_steps 100 \ + --do_train \ + --do_eval \ + --output_dir=./ckpt/gigaword +fi \ No newline at end of file diff --git a/examples/text_summarization/prophetnet/train_prophetnet.py b/examples/text_summarization/prophetnet/train_prophetnet.py new file mode 100644 index 0000000000000000000000000000000000000000..f799a6dabbf6ce5341ba9712fdc8dc368858383c --- /dev/null +++ b/examples/text_summarization/prophetnet/train_prophetnet.py @@ -0,0 +1,252 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from dataclasses import dataclass, field +from typing import Optional + +import paddle +from tqdm import tqdm + +from paddlenlp.data import Pad +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments, set_seed +from paddlenlp.transformers.prophetnet.modeling import ( + ProphetNetForConditionalGeneration, +) +from paddlenlp.transformers.prophetnet.tokenizer import ProphetNetTokenizer + + +@dataclass +class ModelArguments: + model_name_or_path: Optional[str] = field( + default="prophetnet-large-uncased", + metadata={"help": ("Path to pre-trained model.")}, + ) + warmup_init_lr: Optional[float] = field( + default=1e-07, + ) + + +@dataclass +class DataArguments: + dataset: Optional[str] = field( + default="gigaword", + metadata={"help": ("Path to tokenizer vocab file.")}, + ) + + +def read(data_path): + data_path_src = data_path[0] + data_path_tgt = data_path[1] + with open(data_path_src, "r", encoding="utf-8") as f_d_s: + src_lines_length = len(f_d_s.readlines()) + with open(data_path_tgt, "r", encoding="utf-8") as f_d_t: + tgt_lines_length = len(f_d_t.readlines()) + assert src_lines_length == tgt_lines_length + with open(data_path_src, "r", encoding="utf-8") as f_d_s: + with open(data_path_tgt, "r", encoding="utf-8") as f_d_t: + for row_d_s, row_d_t in tqdm(zip(f_d_s, f_d_t), total=src_lines_length): + yield {"article": row_d_s, "highlights": row_d_t} + + +class InverseSquareRootSchedule(paddle.optimizer.lr.LRScheduler): + def __init__(self, warmup_init_lr, warmup_end_lr, warmup_steps, last_epoch=-1, verbose=False): + self.lr_step = (warmup_end_lr - warmup_init_lr) / warmup_steps + self.decay_factor = warmup_end_lr * warmup_steps**0.5 + self.warmup_steps = warmup_steps + self.warmup_init_lr = warmup_init_lr + super(InverseSquareRootSchedule, self).__init__(warmup_init_lr, last_epoch, verbose) + + def get_lr(self): + if self.last_epoch < self.warmup_steps: + self.base_lr = self.warmup_init_lr + self.last_epoch * self.lr_step + else: + self.base_lr = self.decay_factor * self.last_epoch**-0.5 + return self.base_lr + + +def convert_example(is_test=False): + def warpper(example): + """convert an example into necessary features""" + tokens = example["article"] + labels = example["highlights"] + src_ids, src_attention_mask_ids = tokens.split("$1$") + src_ids = [int(i) for i in src_ids.split(" ")] + src_attention_mask_ids = [int(i) for i in src_attention_mask_ids.split(" ")] + + if not is_test: + labels, decoder_input_attention_mask_ids = labels.split("$1$") + labels = [int(i) for i in labels.split(" ")] + decoder_input_attention_mask_ids = [int(i) for i in decoder_input_attention_mask_ids.split(" ")] + decoder_input_ids = [labels[-1]] + labels[:-1] + return src_ids, src_attention_mask_ids, decoder_input_ids, decoder_input_attention_mask_ids, labels + + else: + return src_ids, src_attention_mask_ids + + return warpper + + +@dataclass +class DataCollator: + tokenizer: ProphetNetTokenizer + + def __call__(self, features): + src_ids = [] + src_pids = [] + tgt_ids = [] + tgt_pids = [] + labels = [] + batch = {} + + for feature in features: + src_idx, src_pid, tgt_idx, tgt_pid, label = feature + src_ids.append(src_idx) + src_pids.append(src_pid) + tgt_ids.append(tgt_idx) + tgt_pids.append(tgt_pid) + labels.append(label) + + src_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(src_ids),) + src_pids = (Pad(axis=0, pad_val=0)(src_pids),) + tgt_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(tgt_ids),) + tgt_pids = (Pad(axis=0, pad_val=0)(tgt_pids),) + labels = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(labels),) + + batch["src_ids"] = src_ids[0] + batch["src_pids"] = src_pids[0] + batch["tgt_ids"] = tgt_ids[0] + batch["tgt_pids"] = tgt_pids[0] + batch["labels"] = labels[0] + + return batch + + +def loss_func(model, logits, labels, ignore_index=-100): + expend_targets = paddle.cast( + paddle.zeros((model.prophetnet.config["ngram"], labels.shape[0], labels.shape[1])).fill_(ignore_index), + dtype=paddle.int32, + ) + + for i in range(model.prophetnet.config["ngram"]): + if i > 0 and model.prophetnet.disable_ngram_loss: + break + expend_targets[i, :, :] = labels.cast(dtype=paddle.int32) # B,Ngram,Seq + + logits = logits.transpose([1, 0, 2, 3]) + + if model.prophetnet.eps > 0.0: + expend_targets_mask = paddle.cast(expend_targets != ignore_index, dtype=paddle.float32) + expend_targets = paddle.nn.functional.one_hot(expend_targets, num_classes=model.vocab_size) + expend_targets = paddle.nn.functional.label_smooth(expend_targets, epsilon=model.prophetnet.eps) + loss = paddle.nn.functional.cross_entropy(logits, expend_targets, soft_label=True, reduction="none").squeeze() + loss = paddle.sum(expend_targets_mask * loss) / expend_targets_mask.sum() + else: + loss = paddle.nn.functional.cross_entropy( + logits, expend_targets.cast(dtype=paddle.int64), ignore_index=ignore_index + ) + + return loss + + +class ProphetnetTrainer(Trainer): + def compute_loss(self, model, inputs, return_outputs=False): + src_ids, src_attention_mask_ids, decoder_input_ids, decoder_input_attention_mask_ids, label_ids = inputs + src_ids = inputs["src_ids"] + src_attention_mask_ids = inputs["src_pids"] + decoder_input_ids = inputs["tgt_ids"] + decoder_input_attention_mask_ids = inputs["tgt_pids"] + label_ids = inputs["labels"] + + src_ids = src_ids.cast(dtype=paddle.int32) + src_attention_mask_ids = src_attention_mask_ids.cast(dtype=paddle.int32) + decoder_input_ids = decoder_input_ids.cast(dtype=paddle.int32) + decoder_input_attention_mask_ids = decoder_input_attention_mask_ids.cast(dtype=paddle.int32) + label_ids = label_ids.cast(dtype=paddle.int64) + + outputs = model( + input_ids=src_ids, + attention_mask=src_attention_mask_ids, + decoder_input_ids=decoder_input_ids, + decoder_attention_mask=decoder_input_attention_mask_ids, + ) + loss = loss_func(model, outputs[2], label_ids, ignore_index=model.padding_idx) + + return (loss, outputs) if return_outputs else loss + + +def do_train(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(training_args.seed) + + train_data_src = "data/" + data_args.dataset + "_data/uncased_tok_data/train.src" + train_data_tgt = "data/" + data_args.dataset + "_data/uncased_tok_data/train.tgt" + + dev_data_src = "data/" + data_args.dataset + "_data/uncased_tok_data/dev.src" + dev_data_tgt = "data/" + data_args.dataset + "_data/uncased_tok_data/dev.tgt" + + train_dataset = load_dataset(read, data_path=[train_data_src, train_data_tgt], lazy=False) + dev_dataset = load_dataset(read, data_path=[dev_data_src, dev_data_tgt], lazy=False) + + tokenizer = ProphetNetTokenizer.from_pretrained(model_args.model_name_or_path) + + trans_func = convert_example() + + train_dataset = train_dataset.map(trans_func) + dev_dataset = dev_dataset.map(trans_func) + batchify_fn = DataCollator(tokenizer) + + model = ProphetNetForConditionalGeneration.from_pretrained(model_args.model_name_or_path) + + lr_scheduler = InverseSquareRootSchedule( + model_args.warmup_init_lr, training_args.learning_rate, training_args.warmup_steps + ) + optimizer = paddle.optimizer.Adam( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=training_args.weight_decay, + grad_clip=paddle.nn.ClipGradByNorm(training_args.max_grad_norm), + ) + + trainer = ProphetnetTrainer( + model=model, + args=training_args, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=dev_dataset if training_args.do_eval else None, + tokenizer=tokenizer, + data_collator=batchify_fn, + optimizers=(optimizer, lr_scheduler), + ) + + if training_args.do_train: + train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + metrics = train_results.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + +if __name__ == "__main__": + do_train() diff --git a/examples/text_summarization/prophetnet/uncase_tokenize_data.py b/examples/text_summarization/prophetnet/uncase_tokenize_data.py new file mode 100644 index 0000000000000000000000000000000000000000..7a0b8f84ba0f1c5d42aa086d0312d925a3a9b25c --- /dev/null +++ b/examples/text_summarization/prophetnet/uncase_tokenize_data.py @@ -0,0 +1,92 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import tqdm +from nltk.tokenize.treebank import TreebankWordDetokenizer + +from paddlenlp.transformers.prophetnet.tokenizer import ProphetNetTokenizer + + +def uncased_preocess(fin, fout, keep_sep=False, max_len=512): + tokenizer = ProphetNetTokenizer(vocab_file="prophetnet.tokenizer") + fin = open(fin, "r", encoding="utf-8") + fout = open(fout, "w", encoding="utf-8") + twd = TreebankWordDetokenizer() + for line in tqdm.tqdm(fin.readlines()): + line = line.strip().replace("``", '"').replace("''", '"').replace("`", "'") + s_list = [twd.detokenize(x.strip().split(" "), convert_parentheses=True) for x in line.split("")] + if keep_sep: + output_string = " [X_SEP] ".join(s_list) + else: + output_string = " ".join(s_list) + encoded_string = tokenizer(output_string, return_attention_mask=True, max_length=max_len) + ids, attention_mask_ids = encoded_string["input_ids"][:max_len], encoded_string["attention_mask"][:max_len] + output_string = "$1$".join([" ".join([str(i) for i in ids]), " ".join([str(i) for i in attention_mask_ids])]) + fout.write("{}\n".format(output_string)) + + +def tokenize_with_bert_uncase(fin, fout, max_len=512): + fin = open(fin, "r", encoding="utf-8") + fout = open(fout, "w", encoding="utf-8") + tokenizer = ProphetNetTokenizer(vocab_file="prophetnet.tokenizer") + for line in tqdm.tqdm(fin.readlines()): + encoded_string = tokenizer(line, return_attention_mask=True, max_length=max_len) + ids, attention_mask_ids = encoded_string["input_ids"][:max_len], encoded_string["attention_mask"][:max_len] + output_string = "$1$".join([" ".join([str(i) for i in ids]), " ".join([str(i) for i in attention_mask_ids])]) + fout.write("{}\n".format(output_string)) + + +def tokenize_data(dataset): + dataset = dataset + "_data" + input_dir = "./data/%s" % (dataset) + output_dir = "./data/%s/uncased_tok_data" % (dataset) + if not os.path.isdir(output_dir): + os.makedirs(output_dir) + if dataset == "cnndm": + uncased_preocess("%s/train.src" % input_dir, "%s/train.src" % output_dir, keep_sep=False) + uncased_preocess("%s/dev.src" % input_dir, "%s/dev.src" % output_dir, keep_sep=False) + uncased_preocess("%s/test.src" % input_dir, "%s/test.src" % output_dir, keep_sep=False) + uncased_preocess("%s/train.tgt" % input_dir, "%s/train.tgt" % output_dir, keep_sep=True, max_len=128) + uncased_preocess("%s/dev.tgt" % input_dir, "%s/dev.tgt" % output_dir, keep_sep=True) + uncased_preocess("%s/test.tgt" % input_dir, "%s/test.tgt" % output_dir, keep_sep=True) + else: + tokenize_with_bert_uncase("%s/train.src" % input_dir, "%s/train.src" % output_dir) + tokenize_with_bert_uncase("%s/train.tgt" % input_dir, "%s/train.tgt" % output_dir) + tokenize_with_bert_uncase("%s/dev.src" % input_dir, "%s/dev.src" % output_dir) + tokenize_with_bert_uncase("%s/dev.tgt" % input_dir, "%s/dev.tgt" % output_dir) + tokenize_with_bert_uncase("%s/test.src" % input_dir, "%s/test.src" % output_dir) + tokenize_with_bert_uncase("%s/test.tgt" % input_dir, "%s/test.tgt" % output_dir) + + +parser = argparse.ArgumentParser() +parser.add_argument("--dataset", type=str, help="choose dataset from all, or 1 of 8 datasets: cnndm, gigaword") +args = parser.parse_args() + +DATASET_LIST = ["cnndm", "gigaword"] + +if args.dataset != "all" and args.dataset not in DATASET_LIST: + print("please choose dataset from all, or 1 of 8 datasets: cnndm, gigaword") + exit() +else: + if args.dataset == "all": + dataset_list = DATASET_LIST + else: + dataset_list = [args.dataset] + +print(dataset_list) +for dataset in dataset_list: + tokenize_data(dataset) diff --git a/examples/text_summarization/prophetnet/uncompress_data.sh b/examples/text_summarization/prophetnet/uncompress_data.sh new file mode 100644 index 0000000000000000000000000000000000000000..392596ae14b29a62bed5bf7dedf0c979b68ea9c4 --- /dev/null +++ b/examples/text_summarization/prophetnet/uncompress_data.sh @@ -0,0 +1,12 @@ +tar -xvf ./glge_public.tar +tar -zxvf ./glge_hidden_v1.1.tar.gz + +DATA=./data +DATASETS=(cnndm gigaword) +mkdir $DATA +for DATASET in ${DATASETS[@]}; do + echo $DATASET +mkdir $DATA/$DATASET\_data +mv ./glge-released-dataset/easy/$DATASET\_data/org_data/* $DATA/$DATASET\_data/ +mv ./glge-hidden-dataset/easy/$DATASET\_data/org_data/* $DATA/$DATASET\_data/ +done diff --git a/examples/text_summarization/unimo-text/README.md b/examples/text_summarization/unimo-text/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c0d336331d03142e490890df9cb2a11b7fada0df --- /dev/null +++ b/examples/text_summarization/unimo-text/README.md @@ -0,0 +1,294 @@ +# 生成式文本摘要应用 + +**目录** +- [生成式文本摘要应用](#生成式文本摘要应用) + - [简介](#简介) + - [基于预训练语言模型的文本摘要](#基于预训练语言模型的文本摘要) + - [效果展示](#效果展示) + - [开箱即用](#开箱即用) + - [支持单条、批量预测](#支持单条批量预测) + - [可配置参数说明](#可配置参数说明) + - [训练定制](#训练定制) + - [文本摘要应用定制训练全流程介绍](#文本摘要应用定制训练全流程介绍) + - [环境依赖](#环境依赖) + - [代码结构说明](#代码结构说明) + - [数据准备](#数据准备) + - [数据加载](#数据加载) + - [从本地文件创建数据集](#从本地文件创建数据集) + - [模型训练](#模型训练) + - [模型预测](#模型预测) + - [模型推理部署](#模型推理部署) + - [FastGeneration加速及模型静态图导出](#fastgeneration加速及模型静态图导出) + - [模型部署](#模型部署) + - [References](#references) + + +## 简介 +文本摘要的目标是自动地将输入文本转换成简短摘要,为用户提供简明扼要的内容描述,是缓解文本信息过载的一个重要手段。 +文本摘要也是自然语言生成领域中的一个重要任务,有很多应用场景,如新闻摘要、论文摘要、财报摘要、传记摘要、专利摘要、对话摘要、评论摘要、观点摘要、电影摘要、文章标题生成、商品名生成、自动报告生成、搜索结果预览等。 + +本项目是基于预训练语言模型UNIMO-Text的文本摘要,具有以下优势: +- 效果领先。 +- 开箱即用。本项目提供TaskFlow接口,无需训练,仅需几行代码便可预测。 +- 高性能推理。本项目基于[FastGeneration](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/fast_generation)进行推理加速,能够提供更高性能的推理体验。 +- 训练推理全流程打通。本项目提供了全面的定制训练流程,从数据准备、模型训练预测,到模型推理部署,一应俱全。 + +### 基于预训练语言模型的文本摘要 + +基于预训练语言模型(Pretrained Language Models, PLMs)范式的自动文本摘要是目前最常用、效果最好(SOTA)的方式。 +预训练模型是在超大规模的语料采用无监督(unsupervised)或者弱监督(weak-supervised)的方式进行预训练,能够学习如何准确地理解自然语言并以自然语言的形式流畅表达,这两项都是完成文本摘要任务的重要能力。 + +PaddleNLP提供了方便易用的接口,可指定模型名或模型参数文件路径通过from_pretrained()方法加载不同网络结构的预训练模型,且相应预训练模型权重下载速度快速、稳定。下面以中文unimo-text-1.0-summary模型为例,演示如何加载预训练模型和分词器: +``` +from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOTokenizer +model_name = "unimo-text-1.0-summary" +model = UNIMOLMHeadModel.from_pretrained(model_name) +tokenizer = UNIMOTokenizer.from_pretrained(model_name) +``` + +## 效果展示 + +## 开箱即用 +PaddleNLP提供开箱即用的产业级NLP预置任务能力,无需训练,一键预测。 +### 支持单条、批量预测 + +```python +>>> from paddlenlp import Taskflow +>>> summarizer = Taskflow("text_summarization") +# 单条输入 +>>> summarizer("雪后的景色可真美丽呀!不管是大树上,屋顶上,还是菜地上,都穿上了一件精美的、洁白的羽绒服。放眼望去,整个世界变成了银装素裹似的,世界就像是粉妆玉砌的一样。") +# 输出:'雪后的景色可真美丽呀!' + +# 多条输入 +>>> summarizer([ + "雪后的景色可真美丽呀!不管是大树上,屋顶上,还是菜地上,都穿上了一件精美的、洁白的羽绒服。放眼望去,整个世界变成了银装素裹似的,世界就像是粉妆玉砌的一样。", + "根据“十个工作日”原则,下轮调价窗口为8月23日24时。卓创资讯分析,原油价格或延续震荡偏弱走势,且新周期的原油变化率仍将负值开局,消息面对国内成品油市场并无提振。受此影响,预计国内成品油批发价格或整体呈现稳中下滑走势,但“金九银十”即将到来,卖方看好后期市场,预计跌幅较为有限。" + ]) +#输出:['雪后的景色可真美丽呀!', '成品油调价窗口8月23日24时开启'] +``` + +### 可配置参数说明 +* `model`:可选模型,默认为`unimo-text-1.0-summary`。 +* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 + + +## 训练定制 +### 文本摘要应用定制训练全流程介绍 +接下来,我们将按数据准备、训练、预测、推理部署对文本摘要应用的全流程进行介绍。 +1. **数据准备** +- 如果没有已标注的数据集,我们推荐[doccano](https://github.com/doccano/doccano)数据标注工具。 +如果已有标注好的本地数据集,我们需要根据将数据集整理为文档要求的格式,请参考[从本地文件创建数据集](#从本地文件创建数据集)。 + +2. **模型训练** + +- 数据准备完成后,可以开始使用我们的数据集对预训练模型进行微调训练。我们可以根据任务需求,调整可配置参数,选择使用GPU或CPU进行模型训练,脚本默认保存在开发集最佳表现模型。中文任务默认使用"unimo-text-1.0-summary"模型,unimo-text-1.0-summary还支持large模型,详见[UNIMO模型汇总](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers/UNIMO/contents.html),可以根据任务和设备需求进行选择。 + + +3. **模型预测** + +- 训练结束后,我们可以加载保存的最佳模型进行模型测试,打印模型预测结果。 + +4. **模型推理部署** + +- 模型部署需要将保存的最佳模型参数(动态图)导出成静态图参数,用于后续的推理部署。 + +- 文本摘要应用提供了基于Paddle Inference的本地部署predictor,并且支持在GPU设备使用FastGeneration进行加速。 + +- 文本摘要应用提供了基于Paddle Serving的服务端部署方案。 + +### 环境依赖 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +text_summarization/ +├── deploy # 部署 +│ ├── paddle_inference # PaddleInference高性能推理部署 +│ │ ├── inference_unimo_text.py # 推理部署脚本 +│ │ └── README.md # 说明文档 +│ └── paddle_serving +│ ├── config.yml # 配置文件 +│ ├── pipeline_client.py # 客户端程序 +│ ├── pipeline_service.py # 服务器程序 +│ ├── export_serving.sh # serving模型导出脚本 +│ └── README.md # 说明文档 +├── export_model.py # 动态图参数导出静态图参数脚本 +├── export_model.sh # 动态图参数导出静态图参数shell脚本 +├── train.py # 训练评估脚本 +├── train.sh # 训练评估shell脚本 +├── utils.py # 工具函数脚本 +└── README.md # 说明文档 +``` + +### 数据准备 + +#### 数据加载 +#### 从本地文件创建数据集 + +在许多情况,我们需要使用本地数据集来训练我们的文本摘要模型,本项目支持使用固定格式本地数据集文件进行训练。 + +本地数据集目录结构如下: + +```text +data/ +├── train.json # 训练数据集文件 +└── test.json # 可选,待预测数据文件 +``` +本地数据集文件格式如下: +- train.json/test.json 文件每行格式: +```text +{ +"title": "任志强抨击政府把土地作为投机品地产业被人为破坏", +"content": "“北京的保障房市场就像一个巨大的赌场,每个人都在期待中奖。”面对中国目前现行的保障性住房政策,华远地产董事长任志强再次语出惊人。(分享自@第一财经-中国房地产金融)" +} +``` + +更多数据集读取格式详见[数据集加载](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html#)和[自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。 + + +### 模型训练 +运行如下命令即可在样例训练集上进行finetune,并在样例验证集上进行验证。 + +```shell +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +unset CUDA_VISIBLE_DEVICES + +log_dir=output +rm -rf ${log_dir} +mkdir -p ${log_dir} + +python -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir ${log_dir} train.py \ + --model_name_or_path=unimo-text-1.0-summary \ + --train_file train.json \ + --eval_file test.json \ + --save_dir=${log_dir}/checkpoints \ + --logging_steps=100 \ + --save_steps=10000 \ + --epochs=10 \ + --batch_size=32 \ + --learning_rate=5e-5 \ + --warmup_proportion=0.02 \ + --weight_decay=0.01 \ + --max_seq_len=60 \ + --max_target_len=30 \ + --max_dec_len=20 \ + --min_dec_len=3 \ + --do_train \ + --do_eval \ + --device=gpu \ +``` +也可以直接使用`train.sh`. + +关键参数释义如下: +- `gpus` 指示了训练所用的GPU卡号。 +- `dataset_name` 数据集名称。 +- `train_file` 本地训练数据地址。 +- `eval_file` 本地测试数据地址。 +- `model_name_or_path` 指示了finetune使用的具体预训练模型,可以是PaddleNLP提供的预训练模型(详见[UNIMO模型汇总](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers/UNIMO/contents.html)),或者是本地的预训练模型。如果使用本地的预训练模型,可以配置本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 + + | PaddleNLP提供的预训练模型 | + |---------------------------------| + | unimo-text-1.0-summary | + | unimo-text-1.0 | + | unimo-text-1.0-large | + +- `save_dir` 表示模型的保存路径。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `seed` 表示随机数生成器的种子。 +- `epochs` 表示训练轮数。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。 +- `warmup_proportion` 表示学习率逐渐升高到基础学习率(即上面配置的learning_rate)所需要的迭代数占总步数的比例,最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。 +- `max_seq_len` 模型输入序列的最大长度。 +- `max_target_len` 模型训练时标签的最大长度。 +- `min_dec_len` 模型生成序列的最小长度。 +- `max_dec_len` 模型生成序列的最大长度。 +- `do_train` 是否进行训练。 +- `do_eval` 是否进行预测,在验证集上会自动评估。 +- `device` 表示使用的设备,从gpu和cpu中选择。 + +更多参数详情和参数的默认值请参考`train.py`。 + +程序运行时将会自动进行训练和验证,训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +./checkpoints/ +├── model_8000 +│ ├── model_config.json +│ ├── model_state.pdparams +│ ├── special_tokens_map.json +│ ├── tokenizer_config.json +│ └── vocab.txt +└── ... +``` + +**NOTE:** 如需恢复模型训练,`model_name_or_path`配置本地模型的目录地址即可。 + + +### 模型预测 + +运行下方脚本可以使用训练好的模型进行预测。 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python train.py \ + --do_eval \ + --eval_file test.json \ + --model_name_or_path=your_model_path \ + --logging_steps=100 \ + --batch_size=16 \ + --max_seq_len=60 \ + --max_target_len=30 \ + --max_dec_len=20 \ + --min_dec_len=3 \ + --device=gpu +``` + +程序运行结束后会将预测结果保存在`output_path`中。 + + +Finetuned baseline的模型在[LCSTS](https://aclanthology.org/D15-1229/)测试集上有如下结果: +| model_name | Rouge-1 | Rouge-2 | Rouge-L | BLEU-4 | +| :-----------------------------: | :---: | :-----------: | :-------------------: |:-------------------: | +| finetuned unimo-text-1.0-summary | 39.56 | 26.24 | 36.35 | 21.48 | + + +### 模型推理部署 + +#### FastGeneration加速及模型静态图导出 + +使用动态图训练结束之后,可以通过[静态图导出脚本](export_model.py)实现基于FastGeneration的高性能预测加速,并将动态图参数导出成静态图参数,静态图参数保存在`output_path`指定路径中。运行方式: + +```shell +python export_model.py \ + --model_name_or_path unimo-text-1.0-summary \ + --decoding_strategy beam_search \ + --inference_model_dir ./inference_model \ + --max_out_len 30 \ +``` +关键参数释义如下: + +* `model_name_or_path`:动态图训练保存的参数路径;默认为"unimo-text-1.0-summary"。 +* `inference_model_dir`:静态图图保存的参数路径;默认为"./inference_model"。 +* `max_out_len`:最大输出长度。 + +执行命令后将会自动导出模型到指定的 `inference_model` 中,保存模型文件结构如下所示: + +```text +inference_model/ +├── unimo_text.pdiparams +├── unimo_text.pdiparams.info +└── unimo_text.pdmodel +``` + +#### 模型部署 +文本摘要应用已打通多种场景部署方案,点击链接获取具体的使用教程。 +- [Paddle Inference 推理 (Python)](./deploy/paddle_inference/README.md) +- [Paddle Serving 服务化部署(Python)](./deploy/paddle_serving/README.md) + +## References +Li, Wei, et al. "Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning." arXiv preprint arXiv:2012.15409 (2020). diff --git a/examples/text_summarization/unimo-text/deploy/paddle_inference/README.md b/examples/text_summarization/unimo-text/deploy/paddle_inference/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ca5577601cf2520aab2be98c4456a2c869840ecc --- /dev/null +++ b/examples/text_summarization/unimo-text/deploy/paddle_inference/README.md @@ -0,0 +1,31 @@ +# Paddle Inference部署 +本文档将介绍如何使用[Paddle Inference](https://paddle-inference.readthedocs.io/en/latest/guides/introduction/index_intro.html#paddle-inference)工具进行自动文本摘要应用高性能推理推理部署。 + +**目录** + * [背景介绍](#背景介绍) + * [导出预测部署模型](#导出预测部署模型) + * [基于Python预测](#基于Python预测) + + +## 背景介绍 +Paddle inference和主框架的Model.predict均可实现推理预测,Paddle Inference 是飞桨的原生推理库, 作用于服务器端和云端,提供高性能的推理能力,主框架的Model 对象是一个具备训练、测试、推理的神经网络。相比于Model.predict,inference可使用MKLDNN、CUDNN、TensorRT进行预测加速。Model.predict适用于训练好的模型直接进行预测,paddle inference适用于对推理性能、通用性有要求的用户,针对不同平台不同的应用场景进行了深度的适配优化,保证模型在服务器端即训即用,快速部署。由于 Paddle Inference 能力直接基于飞桨的训练算子,因此它支持飞桨训练出的所有模型的推理。 + + +Paddle Inference Python端预测部署主要包含两个步骤: +- 导出预测部署模型 +- 基于Python预测 + + +## 导出预测部署模型 +部署时需要使用预测格式的模型(即动态图转静态图操作)。预测格式模型相对训练格式模型而言,在拓扑上裁剪掉了预测不需要的算子,并且会做特定部署优化。具体操作详见[FastGeneration加速及模型静态图导出](../../README.md)。 + +## 基于Python预测 + + +在终端输入以下命令可在GPU上进行预测: +```shell +python inference_unimo_text.py --inference_model_dir ../../inference_model +``` + +关键参数释义如下: +* `inference_model_dir`:用于高性能推理的静态图模型参数路径;默认为"../../inference_model"。 diff --git a/examples/text_summarization/unimo-text/deploy/paddle_inference/inference_unimo_text.py b/examples/text_summarization/unimo-text/deploy/paddle_inference/inference_unimo_text.py new file mode 100644 index 0000000000000000000000000000000000000000..f60fbf35841b1d4257615d9463aca01ed572638c --- /dev/null +++ b/examples/text_summarization/unimo-text/deploy/paddle_inference/inference_unimo_text.py @@ -0,0 +1,151 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +from pprint import pprint + +import numpy as np +from paddle import inference + +from paddlenlp.data import Pad +from paddlenlp.ops.ext_utils import load +from paddlenlp.transformers import UNIMOTokenizer + + +def setup_args(): + """Setup arguments.""" + parser = argparse.ArgumentParser() + parser.add_argument( + "--inference_model_dir", + default="../../inference_model", + type=str, + help="Path to save inference model of UNIMOText. ", + ) + args = parser.parse_args() + return args + + +def setup_predictor(args): + """Setup inference predictor.""" + # Load FastGeneration lib. + load("FastGeneration", verbose=True) + model_file = os.path.join(args.inference_model_dir, "unimo_text.pdmodel") + params_file = os.path.join(args.inference_model_dir, "unimo_text.pdiparams") + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = inference.Config(model_file, params_file) + config.enable_use_gpu(100, 0) + config.switch_ir_optim() + config.enable_memory_optim() + config.disable_glog_info() + + predictor = inference.create_predictor(config) + return predictor + + +def convert_example(example, tokenizer, max_seq_len=512, return_length=True): + """Convert all examples into necessary features.""" + source = example + tokenized_example = tokenizer.gen_encode( + source, + max_seq_len=max_seq_len, + add_start_token_for_decoding=True, + return_length=True, + is_split_into_words=False, + ) + return tokenized_example + + +def batchify_fn(batch_examples, pad_val, pad_right=False): + """Batchify a batch of examples.""" + + def pad_mask(batch_attention_mask): + """Pad attention_mask.""" + batch_size = len(batch_attention_mask) + max_len = max(map(len, batch_attention_mask)) + attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9 + for i, mask_data in enumerate(attention_mask): + seq_len = len(batch_attention_mask[i]) + if pad_right: + mask_data[:seq_len:, :seq_len] = np.array(batch_attention_mask[i], dtype="float32") + else: + mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32") + # In order to ensure the correct broadcasting mechanism, expand one + # dimension to the second dimension (n_head of Transformer). + attention_mask = np.expand_dims(attention_mask, axis=1) + return attention_mask + + pad_func = Pad(pad_val=pad_val, pad_right=pad_right, dtype="int32") + input_ids = pad_func([example["input_ids"] for example in batch_examples]) + token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples]) + attention_mask = pad_mask([example["attention_mask"] for example in batch_examples]) + seq_len = np.asarray([example["seq_len"] for example in batch_examples], dtype="int32") + input_dict = {} + input_dict["input_ids"] = input_ids + input_dict["token_type_ids"] = token_type_ids + input_dict["attention_mask"] = attention_mask + input_dict["seq_len"] = seq_len + return input_dict + + +def postprocess_response(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first .""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.mask_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + return tokens + + +def infer(args, predictor): + """Use predictor to inference.""" + tokenizer = UNIMOTokenizer.from_pretrained("unimo-text-1.0-summary") + + inputs = [ + "雪后的景色可真美丽呀!不管是大树上,屋顶上,还是菜地上,都穿上了一件精美的、洁白的羽绒服。放眼望去,整个世界变成了银装素裹似的,世界就像是粉妆玉砌的一样。", + "根据“十个工作日”原则,下轮调价窗口为8月23日24时。卓创资讯分析,原油价格或延续震荡偏弱走势,且新周期的原油变化率仍将负值开局,消息面对国内成品油市场并无提振。受此影响,预计国内成品油批发价格或整体呈现稳中下滑走势,但“金九银十”即将到来,卖方看好后期市场,预计跌幅较为有限。", + ] + + examples = [convert_example(i, tokenizer) for i in inputs] + data = batchify_fn(examples, tokenizer.pad_token_id) + + input_handles = {} + for name in predictor.get_input_names(): + input_handles[name] = predictor.get_input_handle(name) + input_handles[name].copy_from_cpu(data[name]) + + output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()] + + predictor.run() + + output = [output_handle.copy_to_cpu() for output_handle in output_handles] + + for idx, sample in enumerate(output[0]): + for beam_idx, beam in enumerate(sample): + if beam_idx > len(sample) // 2: + break + print(f"Example {idx} beam beam_idx {beam_idx}: ", "".join(postprocess_response(beam, tokenizer))) + + +if __name__ == "__main__": + args = setup_args() + pprint(args) + predictor = setup_predictor(args) + infer(args, predictor) diff --git a/examples/text_summarization/unimo-text/deploy/paddle_serving/README.md b/examples/text_summarization/unimo-text/deploy/paddle_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..31061c513704e20e786edb56bac35a6c17d83722 --- /dev/null +++ b/examples/text_summarization/unimo-text/deploy/paddle_serving/README.md @@ -0,0 +1,140 @@ +# Paddle Serving服务化部署 + +本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具部署自动文本摘要在线服务。 + +## 目录 +- [Paddle Serving服务化部署](#paddle-serving服务化部署) + - [目录](#目录) + - [背景介绍](#背景介绍) + - [环境准备](#环境准备) + - [安装Paddle Serving](#安装paddle-serving) + - [模型转换](#模型转换) + - [pipeline部署](#pipeline部署) + - [修改配置文件](#修改配置文件) + - [server启动服务](#server启动服务) + - [client发送服务请求](#client发送服务请求) + +## 背景介绍 +Paddle Serving 依托深度学习框架 PaddlePaddle 旨在帮助深度学习开发者和企业提供高性能、灵活易用的工业级在线推理服务。Paddle Serving 支持 RESTful、gRPC、bRPC 等多种协议,提供多种异构硬件和多种操作系统环境下推理解决方案,和多种经典预训练模型示例。集成高性能服务端推理引擎 Paddle Inference 和端侧引擎 Paddle Lite。设计并实现基于有向无环图(DAG) 的异步流水线高性能推理框架,具有多模型组合、异步调度、并发推理、动态批量、多卡多流推理、请求缓存等特性。 + +Paddle Serving Python端预测部署主要包含以下步骤: +- 环境准备 +- 模型转换 +- 部署模型 + +## 环境准备 +### 安装Paddle Serving +安装client和serving app,用于向服务发送请求: +```shell +pip install paddle_serving_app paddle_serving_client +``` +安装GPU server,用于启动服务: + +- 安装GPU server, 注意选择跟本地环境一致的命令 +```shell +# CUDA10.2 + Cudnn7 + TensorRT6 +pip install paddle-serving-server-gpu==0.8.3.post102 # -i https://pypi.tuna.tsinghua.edu.cn/simple +# CUDA10.1 + TensorRT6 +pip install paddle-serving-server-gpu==0.8.3.post101 # -i https://pypi.tuna.tsinghua.edu.cn/simple +# CUDA11.2 + TensorRT8 +pip install paddle-serving-server-gpu==0.8.3.post112 # -i https://pypi.tuna.tsinghua.edu.cn/simple +``` + +**NOTE:** +- 可以开启国内清华镜像源来加速下载 +- 如果要安装最新版本的PaddleServing参考[链接](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Latest_Packages_CN.md)。 + + +## 模型转换 + +使用Paddle Serving做服务化部署时,需要将保存的inference模型转换为serving易于部署的模型。 + +用已安装的paddle_serving_client将静态图参数模型转换成serving格式。关于如何使用将训练后的动态图模型转为静态图模型详见[FastGeneration加速及模型静态图导出](../../README.md)。 + +模型转换命令如下: +```shell +python -m paddle_serving_client.convert --dirname ../../inference_model \ + --model_filename unimo_text.pdmodel \ + --params_filename unimo_text.pdiparams \ + --serving_server inference_model_server \ + --serving_client inference_model_client +``` +关键参数释义如下: +* `dirname`:模型文件夹地址。 +* `model_filename`:模型文件名。 +* `params_filename`:模型参数名。 +* `serving_server`:server的模型文件和配置文件路径,默认"serving_server"。 +* `serving_client`:client的配置文件路径,默认"serving_client"。 + +也可以直接使用`export_serving.sh`. + +更多参数可通过以下命令查询: +```shell +python -m paddle_serving_client.convert --help +``` +模型转换完成后,会在paddle_serving文件夹多出inference_model_server和inference_model_client的文件夹,文件夹目录格式如下: +``` +inference_model_server/ +├── unimo_text.pdiparams +├── unimo_text.pdmodel +├── serving_server_conf.prototxt +└── serving_server_conf.stream.prototxt + +inference_model_client/ +├── serving_client_conf.prototxt +└── serving_client_conf.stream.prototxt +``` + +## pipeline部署 + +paddle_serving目录包含启动pipeline服务和发送预测请求的代码,包括: +``` +paddle_serving/ +├──config.yml # 启动服务端的配置文件 +├──pipeline_client.py # 发送pipeline预测请求的脚本 +└──pipeline_service.py # 启动pipeline服务端的脚本 +``` + +### 修改配置文件 +目录中的`config.yml`文件解释了每一个参数的含义,可以根据实际需要修改其中的配置。 + +### server启动服务 +修改好配置文件后,执行下面命令启动服务: +```shell +# 启动服务 +python pipeline_service.py +``` +成功启动服务后,log.txt中会打印类似如下日志 +``` +--- Running analysis [ir_graph_to_program_pass] +I0831 12:29:41.132828 28269 analysis_predictor.cc:1035] ======= optimize end ======= +I0831 12:29:41.133375 28269 naive_executor.cc:102] --- skip [feed], feed -> seq_len +I0831 12:29:41.133384 28269 naive_executor.cc:102] --- skip [feed], feed -> attention_mask +I0831 12:29:41.133390 28269 naive_executor.cc:102] --- skip [feed], feed -> token_type_ids +I0831 12:29:41.133401 28269 naive_executor.cc:102] --- skip [feed], feed -> input_ids +I0831 12:29:41.134040 28269 naive_executor.cc:102] --- skip [_generated_var_3], fetch -> fetch +I0831 12:29:41.134049 28269 naive_executor.cc:102] --- skip [gather_tree_0.tmp_0], fetch -> fetch +[2022-08-31 12:29:41,138] [ INFO] - Already cached /root/.paddlenlp/models/unimo-text-1.0-summary/unimo-text-1.0-vocab.txt +[2022-08-31 12:29:41,161] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/unimo-text-1.0-summary/tokenizer_config.json +[2022-08-31 12:29:41,162] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/unimo-text-1.0-summary/special_tokens_map.json +[PipelineServicer] succ init +[OP Object] init success +[OP Object] init success +[OP Object] init success +[OP Object] init success +[OP Object] init success +[OP Object] init success +[OP Object] init success +[OP Object] init success +[OP Object] init success +[OP Object] init success +[OP Object] init success +2022/08/31 12:29:41 start proxy service +``` + +### client发送服务请求 +执行以下命令发送文本摘要服务请求: +```shell +python pipeline_client.py +``` +注意执行客户端请求时关闭代理,并根据实际情况修改server_url地址(启动服务所在的机器)。 diff --git a/examples/text_summarization/unimo-text/deploy/paddle_serving/config.yml b/examples/text_summarization/unimo-text/deploy/paddle_serving/config.yml new file mode 100644 index 0000000000000000000000000000000000000000..c6a2744c818a58f4a52d998620dcf6d76b554395 --- /dev/null +++ b/examples/text_summarization/unimo-text/deploy/paddle_serving/config.yml @@ -0,0 +1,54 @@ +#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 18011 + +#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 9999 + +#worker_num, 最大并发数。 +#当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +#当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 10 + +#build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + #op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: True + + #重试次数 + retry: 1 + + #使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + use_profile: false + tracer: + interval_s: 10 + +op: + text_summarization: + #并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 11 + + #当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + #client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + + #模型路径 + model_config: ./inference_model_server + + #Fetch结果列表,以client_config中fetch_var的alias_name为准,不设置默认取全部输出变量 + fetch_list: ["_generated_var_3", "transpose_0.tmp_0"] + + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 1 + + #计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: "0" + + #thread_num + thread_num: 12 + + #ir_optim + ir_optim: False + \ No newline at end of file diff --git a/examples/text_summarization/unimo-text/deploy/paddle_serving/export_serving.sh b/examples/text_summarization/unimo-text/deploy/paddle_serving/export_serving.sh new file mode 100644 index 0000000000000000000000000000000000000000..d34b97fdd5264076c11aacf043e2586496f67f8e --- /dev/null +++ b/examples/text_summarization/unimo-text/deploy/paddle_serving/export_serving.sh @@ -0,0 +1,19 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python -m paddle_serving_client.convert --dirname ../../inference_model \ + --model_filename unimo_text.pdmodel \ + --params_filename unimo_text.pdiparams \ + --serving_server inference_model_server \ + --serving_client inference_model_client \ No newline at end of file diff --git a/examples/text_summarization/unimo-text/deploy/paddle_serving/pipeline_client.py b/examples/text_summarization/unimo-text/deploy/paddle_serving/pipeline_client.py new file mode 100644 index 0000000000000000000000000000000000000000..70f11f46d589eb00211c2cf2ef1988b4d4fd9a59 --- /dev/null +++ b/examples/text_summarization/unimo-text/deploy/paddle_serving/pipeline_client.py @@ -0,0 +1,52 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import time + +import numpy as np +from paddle_serving_server.pipeline import PipelineClient + +from paddlenlp.utils.log import logger + + +class Runner(object): + def __init__( + self, + server_url: str, + ): + self.client = PipelineClient() + self.client.connect([server_url]) + + def Run(self, data): + inputs = np.array([i.encode("utf-8") for i in data], dtype=np.object_) + start_time = time.time() + ret = self.client.predict(feed_dict={"inputs": inputs}) + end_time = time.time() + logger.info("time cost :{} seconds".format(end_time - start_time)) + if not ret.value: + logger.warning("Fail to fetch summary.") + # ret is special class but a dict + for d, s in zip(data, eval(ret.value[0])): + print("Text: ", d) + print("Summary: ", s[0]) + print("-" * 50) + + +if __name__ == "__main__": + server_url = "127.0.0.1:18011" + runner = Runner(server_url) + texts = [ + "雪后的景色可真美丽呀!不管是大树上,屋顶上,还是菜地上,都穿上了一件精美的、洁白的羽绒服。放眼望去,整个世界变成了银装素裹似的,世界就像是粉妆玉砌的一样。", + "根据“十个工作日”原则,下轮调价窗口为8月23日24时。卓创资讯分析,原油价格或延续震荡偏弱走势,且新周期的原油变化率仍将负值开局,消息面对国内成品油市场并无提振。受此影响,预计国内成品油批发价格或整体呈现稳中下滑走势,但“金九银十”即将到来,卖方看好后期市场,预计跌幅较为有限。", + ] + runner.Run(texts) diff --git a/examples/text_summarization/unimo-text/deploy/paddle_serving/pipeline_service.py b/examples/text_summarization/unimo-text/deploy/paddle_serving/pipeline_service.py new file mode 100644 index 0000000000000000000000000000000000000000..0184041c6f930ce07ab15a26fd00e2bc63b56735 --- /dev/null +++ b/examples/text_summarization/unimo-text/deploy/paddle_serving/pipeline_service.py @@ -0,0 +1,129 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +from paddle_serving_server.web_service import Op, WebService + +from paddlenlp.data import Pad +from paddlenlp.ops.ext_utils import load +from paddlenlp.transformers import UNIMOTokenizer +from paddlenlp.utils.log import logger + + +def convert_example(example, tokenizer, max_seq_len=512, return_length=True): + """Convert all examples into necessary features.""" + source = example + tokenized_example = tokenizer.gen_encode( + source, + max_seq_len=max_seq_len, + add_start_token_for_decoding=True, + return_length=True, + is_split_into_words=False, + ) + return tokenized_example + + +def batchify_fn(batch_examples, pad_val, pad_right=False): + """Batchify a batch of examples.""" + + def pad_mask(batch_attention_mask): + """Pad attention_mask.""" + batch_size = len(batch_attention_mask) + max_len = max(map(len, batch_attention_mask)) + attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9 + for i, mask_data in enumerate(attention_mask): + seq_len = len(batch_attention_mask[i]) + if pad_right: + mask_data[:seq_len:, :seq_len] = np.array(batch_attention_mask[i], dtype="float32") + else: + mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32") + # In order to ensure the correct broadcasting mechanism, expand one + # dimension to the second dimension (n_head of Transformer). + attention_mask = np.expand_dims(attention_mask, axis=1) + return attention_mask + + pad_func = Pad(pad_val=pad_val, pad_right=pad_right, dtype="int32") + input_ids = pad_func([example["input_ids"] for example in batch_examples]) + token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples]) + attention_mask = pad_mask([example["attention_mask"] for example in batch_examples]) + seq_len = np.asarray([example["seq_len"] for example in batch_examples], dtype="int32") + input_dict = {} + input_dict["input_ids"] = input_ids + input_dict["token_type_ids"] = token_type_ids + input_dict["attention_mask"] = attention_mask + input_dict["seq_len"] = seq_len + return input_dict + + +def postprocess_response(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first .""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.mask_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + return tokens + + +class UnimoTextOp(Op): + """Op for unimo_text.""" + + def init_op(self): + self.tokenizer = UNIMOTokenizer.from_pretrained("unimo-text-1.0-summary") + + def preprocess(self, input_dicts, data_id, log_id): + # Convert input format + ((_, input_dict),) = input_dicts.items() + data = input_dict["inputs"] + if isinstance(data, str) and "array(" in data: + data = eval(data) + else: + logger.error("input value {}is not supported.".format(data)) + data = [i.decode("utf-8") for i in data] + examples = [convert_example(i, self.tokenizer) for i in data] + input_dict = batchify_fn(examples, self.tokenizer.pad_token_id) + # the first return must be a dict or a list of dict, the dict corresponding to a batch of model input + return input_dict, False, None, "" + + def postprocess(self, input_dicts, fetch_dict, data_id, log_id): + outputs = fetch_dict["transpose_0.tmp_0"] + results = [] + for sample in outputs: + result = [] + for idx, beam in enumerate(sample): + if idx >= len(sample) // 2: + break + res = "".join(postprocess_response(beam, self.tokenizer)) + result.append(res) + results.append(result) + out_dict = {} + out_dict["outputs"] = str(results) + # the first return must be a dict or a list of dict, the dict corresponding to a batch of model output + return out_dict, None, "" + + +class UnimoTextService(WebService): + def get_pipeline_response(self, read_op): + return UnimoTextOp(name="text_summarization", input_ops=[read_op]) + + +if __name__ == "__main__": + # Load FastGeneration lib. + load("FastGeneration", verbose=True) + service = UnimoTextService(name="text_summarization") + service.prepare_pipeline_config("config.yml") + service.run_service() diff --git a/examples/text_summarization/unimo-text/export_model.py b/examples/text_summarization/unimo-text/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..4dc1a0b58fe51f0d124833f073ea8b35f0afc554 --- /dev/null +++ b/examples/text_summarization/unimo-text/export_model.py @@ -0,0 +1,114 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +from pprint import pprint + +import paddle + +from paddlenlp.ops import FasterUNIMOText +from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOTokenizer +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_name_or_path", + default="unimo-text-1.0-summary", + type=str, + help="The model name to specify the UNIMOText to use. ", + ) + parser.add_argument( + "--inference_model_dir", + default="./inference_model", + type=str, + help="Path to save inference model of UNIMOText. ", + ) + parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure top_k sampling. ") + parser.add_argument( + "--topp", default=1.0, type=float, help="The probability threshold to procedure top_p sampling. " + ) + parser.add_argument("--max_out_len", default=64, type=int, help="Maximum output length. ") + parser.add_argument("--min_out_len", default=1, type=int, help="Minimum output length. ") + parser.add_argument("--num_return_sequence", default=1, type=int, help="The number of returned sequence. ") + parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ") + parser.add_argument("--num_return_sequences", default=1, type=int, help="The number of returned sequences. ") + parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ") + parser.add_argument( + "--decoding_strategy", + default="beam_search", + choices=["sampling", "beam_search"], + type=str, + help="The main strategy to decode. ", + ) + parser.add_argument("--num_beams", default=4, type=int, help="The number of candidate to procedure beam search. ") + parser.add_argument( + "--diversity_rate", default=0.0, type=float, help="The diversity rate to procedure beam search. " + ) + + args = parser.parse_args() + return args + + +def do_predict(args): + place = "gpu" + place = paddle.set_device(place) + + model_name_or_path = args.model_name_or_path + model = UNIMOLMHeadModel.from_pretrained(model_name_or_path) + tokenizer = UNIMOTokenizer.from_pretrained(model_name_or_path) + + unimo_text = FasterUNIMOText(model=model, use_fp16_decoding=args.use_fp16_decoding, trans_out=True) + + # Set evaluate mode + unimo_text.eval() + + # Convert dygraph model to static graph model + unimo_text = paddle.jit.to_static( + unimo_text, + input_spec=[ + # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + # token_type_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + # attention_mask + paddle.static.InputSpec(shape=[None, 1, None, None], dtype="float32"), + # seq_len + paddle.static.InputSpec(shape=[None], dtype="int64"), + args.max_out_len, + args.min_out_len, + args.topk, + args.topp, + args.num_beams, # num_beams. Used for beam_search. + args.decoding_strategy, + tokenizer.cls_token_id, # cls/bos + tokenizer.mask_token_id, # mask/eos + tokenizer.pad_token_id, # pad + args.diversity_rate, # diversity rate. Used for beam search. + args.temperature, + args.num_return_sequences, + ], + ) + + # Save converted static graph model + paddle.jit.save(unimo_text, os.path.join(args.inference_model_dir, "unimo_text")) + logger.info("UNIMOText has been saved to {}.".format(args.inference_model_dir)) + + +if __name__ == "__main__": + args = parse_args() + pprint(args) + + do_predict(args) diff --git a/examples/text_summarization/unimo-text/export_model.sh b/examples/text_summarization/unimo-text/export_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..6a51ec14e0df28b10f50e8d5c9717c63aad7294c --- /dev/null +++ b/examples/text_summarization/unimo-text/export_model.sh @@ -0,0 +1,19 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python export_model.py \ + --model_name_or_path unimo-text-1.0-summary \ + --decoding_strategy beam_search \ + --inference_model_dir ./inference_model \ + --max_out_len 30 \ \ No newline at end of file diff --git a/examples/text_summarization/unimo-text/train.py b/examples/text_summarization/unimo-text/train.py new file mode 100644 index 0000000000000000000000000000000000000000..e6f3d0429df06fd9ed95541551d5aa2e31f51bdd --- /dev/null +++ b/examples/text_summarization/unimo-text/train.py @@ -0,0 +1,286 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import math +import os +import time + +import paddle +import paddle.distributed as dist +import paddle.nn.functional as F +from paddle.optimizer import AdamW +from utils import compute_metrics, create_data_loader, print_args, select_sum, set_seed + +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import ( + LinearDecayWithWarmup, + UNIMOLMHeadModel, + UNIMOTokenizer, +) + + +def parse_args(): + parser = argparse.ArgumentParser(__doc__) + parser.add_argument( + "--model_name_or_path", + type=str, + default="unimo-text-1.0-summary", + help="The path or shortcut name of the pre-trained model.", + ) + parser.add_argument("--train_file", type=str, required=False, default=None, help="Train data path.") + parser.add_argument("--eval_file", type=str, required=False, default=None, help="Eval data path.") + parser.add_argument( + "--save_dir", type=str, default="./checkpoints", help="The directory where the checkpoints will be saved." + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.") + parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.") + parser.add_argument("--learning_rate", type=float, default=5e-5, help="The initial learning rate.") + parser.add_argument("--weight_decay", type=float, default=0.01, help="The weight decay for optimizer.") + parser.add_argument("--epochs", type=int, default=3, help="Total number of training epochs to perform.") + parser.add_argument("--warmup_proportion", type=float, default=0.02, help="The number of warmup steps.") + parser.add_argument("--max_grad_norm", type=float, default=1.0, help="The max value of grad norm.") + parser.add_argument("--beta1", type=float, default=0.9, help="beta1") + parser.add_argument("--beta2", type=float, default=0.98, help="beta2") + parser.add_argument("--epsilon", type=float, default=1e-6, help="epsilon") + parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum sequence length of training.") + parser.add_argument("--max_dec_len", type=int, default=20, help="The maximum sequence length of decoding.") + parser.add_argument("--min_dec_len", type=int, default=3, help="The minimal sequence length of decoding.") + parser.add_argument( + "--max_target_len", type=int, default=30, help="The maximum target sequence length of training." + ) + parser.add_argument( + "--num_return_sequences", + type=int, + default=1, + help="The numbers of returned sequences for one input in generation.", + ) + parser.add_argument( + "--decode_strategy", type=str, default="beam_search", help="The decode strategy in generation." + ) + parser.add_argument( + "--top_k", + type=int, + default=0, + help="The number of highest probability vocabulary tokens to keep for top-k sampling.", + ) + parser.add_argument( + "--temperature", type=float, default=1.0, help="The value used to module the next token probabilities." + ) + parser.add_argument("--top_p", type=float, default=1.0, help="The cumulative probability for top-p sampling.") + parser.add_argument("--num_beams", type=int, default=6, help="The number of beams for beam search.") + parser.add_argument( + "--length_penalty", + type=float, + default=1.2, + help="The exponential penalty to the sequence length for beam search.", + ) + parser.add_argument("--device", type=str, default="gpu", help="The device to select for training the model.") + parser.add_argument( + "--output_path", type=str, default="./predict.txt", help="The file path where the infer result will be saved." + ) + parser.add_argument("--do_train", action="store_true", help="Whether to train the model.") + parser.add_argument("--do_eval", action="store_true", help="Whether to eval and predict.") + parser.add_argument("--use_amp", action="store_true", help="Enable mixed precision training.") + parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + args = parser.parse_args() + return args + + +def save_ckpt(model, tokenizer, save_dir, name): + output_dir = os.path.join(save_dir, "model_{}".format(name)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + +def read_file(file): + with open(file, "r", encoding="utf-8") as f: + for line in f.readlines(): + line = line.strip() + if not line: + continue + line = json.loads(line) + yield line + + +def run(args): + paddle.set_device(args.device) + world_size = dist.get_world_size() + + if world_size > 1: + dist.init_parallel_env() + set_seed(args.seed) + + model = UNIMOLMHeadModel.from_pretrained(args.model_name_or_path) + tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path) + + if world_size > 1: + model = paddle.DataParallel(model) + + if args.do_train: + train_ds = load_dataset(read_file, file=args.train_file, lazy=False) + dev_ds = load_dataset(read_file, file=args.eval_file, lazy=False) + + train_ds, train_data_loader = create_data_loader(train_ds, tokenizer, args, "train") + dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test") + if args.max_steps > 0: + num_training_steps = args.max_steps + num_train_epochs = math.ceil(num_training_steps / len(train_data_loader)) + else: + num_training_steps = len(train_data_loader) * args.epochs + num_train_epochs = args.epochs + + print(f"num_training_steps: {num_training_steps}, num_train_epochs: {num_train_epochs}") + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + beta1=args.beta1, + beta2=args.beta2, + epsilon=args.epsilon, + apply_decay_param_fun=lambda x: x in decay_params, + grad_clip=paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm), + ) + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + step = 0 + total_time = 0.0 + for epoch in range(num_train_epochs): + print("\nEpoch %d/%d" % (epoch + 1, num_train_epochs)) + batch_start_time = time.time() + for inputs in train_data_loader: + step += 1 + labels = inputs[-1] + with paddle.amp.auto_cast( + args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"], level="O1" + ): + logits = model(*inputs[:-1]) + labels = paddle.nn.functional.one_hot(labels, num_classes=logits.shape[-1]) + labels = paddle.nn.functional.label_smooth(labels) + loss = F.cross_entropy(logits, labels, soft_label=True) + if args.use_amp: + scaled_loss = scaler.scale(loss) + scaled_loss.backward() + scaler.step(optimizer) + scaler.update() + optimizer.clear_grad(set_to_zero=False) + else: + loss.backward() + optimizer.step() + optimizer.clear_grad() + lr_scheduler.step() + total_time += time.time() - batch_start_time + if step % args.logging_steps == 0: + ppl = paddle.exp(loss) + print( + "epoch %d - step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step" + % (epoch, step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps) + ) + total_time = 0.0 + + if step % args.save_steps == 0 or step == num_training_steps: + if dist.get_rank() == 0: + save_ckpt(model, tokenizer, args.save_dir, step) + print("Saved step {} model.\n".format(step)) + model_eval = model._layers if isinstance(model, paddle.DataParallel) else model + evaluation(model_eval, dev_data_loader, args, tokenizer) + batch_start_time = time.time() + if step >= num_training_steps: + break + if step >= num_training_steps: + break + + print("\nTraining completed.") + elif args.do_eval: + dev_ds = load_dataset(read_file, file=args.eval_file, lazy=False) + dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test") + + model_eval = model._layers if isinstance(model, paddle.DataParallel) else model + evaluation(model_eval, dev_data_loader, args, tokenizer) + + +@paddle.no_grad() +def evaluation(model, data_loader, args, tokenizer): + print("\nEval begin...") + model.eval() + pred_ref = [] + total_time = 0.0 + start_time = time.time() + for step, inputs in enumerate(data_loader, 1): + input_ids, token_type_ids, position_ids, attention_mask = inputs + ids, scores = model.generate( + input_ids=input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask, + max_length=args.max_dec_len, + min_length=args.min_dec_len, + decode_strategy=args.decode_strategy, + temperature=args.temperature, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + num_return_sequences=args.num_return_sequences, + bos_token_id=tokenizer.cls_token_id, + eos_token_id=tokenizer.mask_token_id, + ) + + total_time += time.time() - start_time + if step % args.logging_steps == 0: + print("eval step %d - %.3fs/step" % (step, total_time / args.logging_steps)) + total_time = 0.0 + + results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences) + pred_ref.extend(results) + start_time = time.time() + + with open(args.output_path, "w", encoding="utf-8") as fout: + for ref in pred_ref: + fout.write(ref + "\n") + + print("\nSave inference result into: %s" % args.output_path) + + if "title" in data_loader.dataset[0].keys(): + targets = [example["title"] for example in data_loader.dataset] + compute_metrics(pred_ref, targets) + + model.train() + return + + +if __name__ == "__main__": + args = parse_args() + print_args(args) + run(args) diff --git a/examples/text_summarization/unimo-text/train.sh b/examples/text_summarization/unimo-text/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..302b66ede711f38d58f3b40c7e57e52f4262ba46 --- /dev/null +++ b/examples/text_summarization/unimo-text/train.sh @@ -0,0 +1,40 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 +unset CUDA_VISIBLE_DEVICES + +log_dir=output +rm -rf ${log_dir} +mkdir -p ${log_dir} + +python -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir ${log_dir} train.py \ + --model_name_or_path=unimo-text-1.0-summary \ + --train_file train.json \ + --eval_file test.json \ + --save_dir=${log_dir}/checkpoints \ + --logging_steps=100 \ + --save_steps=10000 \ + --epochs=10 \ + --batch_size=32 \ + --learning_rate=5e-5 \ + --warmup_proportion=0.02 \ + --weight_decay=0.01 \ + --max_seq_len=512 \ + --max_target_len=60 \ + --max_dec_len=20 \ + --min_dec_len=3 \ + --do_train \ + --do_eval \ + --device=gpu \ diff --git a/examples/text_summarization/unimo-text/utils.py b/examples/text_summarization/unimo-text/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..5542758ff8ccb984cf73fd6edca0bda1a8322f90 --- /dev/null +++ b/examples/text_summarization/unimo-text/utils.py @@ -0,0 +1,217 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import random +from functools import partial + +import numpy as np +import paddle +import paddle.distributed as dist +from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler +from rouge import Rouge + +from paddlenlp.data import Pad +from paddlenlp.metrics import BLEU + + +def print_args(args): + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +def set_seed(seed): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(seed) + np.random.seed(seed) + # Maybe different op seeds(for dropout) for different procs is better. + paddle.seed(seed + dist.get_rank()) + + +def compute_metrics(preds, targets): + assert len(preds) == len(targets), ( + "The length of pred_responses should be equal to the length of " + "target_responses. But received {} and {}.".format(len(preds), len(targets)) + ) + rouge = Rouge() + bleu4 = BLEU(n_size=4) + scores = [] + for pred, target in zip(preds, targets): + try: + score = rouge.get_scores(" ".join(pred), " ".join(target)) + scores.append([score[0]["rouge-1"]["f"], score[0]["rouge-2"]["f"], score[0]["rouge-l"]["f"]]) + except ValueError: + scores.append([0, 0, 0]) + bleu4.add_inst(pred, [target]) + rouge1 = np.mean([i[0] for i in scores]) + rouge2 = np.mean([i[1] for i in scores]) + rougel = np.mean([i[2] for i in scores]) + print("\n" + "*" * 15) + print("The auto evaluation result is:") + print("rouge-1:", round(rouge1, 4)) + print("rouge-2:", round(rouge2, 4)) + print("rouge-L:", round(rougel, 4)) + print("BLEU-4:", round(bleu4.score(), 4)) + + +def convert_example(example, tokenizer, max_seq_len=512, max_target_len=128, mode="train"): + """Convert all examples into necessary features.""" + source = example["content"] + if mode != "test": + tokenized_example = tokenizer.gen_encode( + source, + target=example["title"], + max_seq_len=max_seq_len, + max_target_len=max_target_len, + return_position_ids=True, + return_length=True, + ) + target_start = tokenized_example["input_ids"].index(tokenizer.cls_token_id, 1) + target_end = tokenized_example["seq_len"] + # Use to gather the logits corresponding to the labels during training + tokenized_example["masked_positions"] = list(range(target_start, target_end - 1)) + tokenized_example["labels"] = tokenized_example["input_ids"][target_start + 1 : target_end] + + return tokenized_example + else: + tokenized_example = tokenizer.gen_encode( + source, max_seq_len=max_seq_len, add_start_token_for_decoding=True, return_position_ids=True + ) + + if "title" in example and example["title"]: + tokenized_example["title"] = example["title"] + return tokenized_example + + +def batchify_fn(batch_examples, pad_val, mode): + def pad_mask(batch_attention_mask): + batch_size = len(batch_attention_mask) + max_len = max(map(len, batch_attention_mask)) + attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9 + for i, mask_data in enumerate(attention_mask): + seq_len = len(batch_attention_mask[i]) + mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32") + # In order to ensure the correct broadcasting mechanism, expand one + # dimension to the second dimension (n_head of Transformer). + attention_mask = np.expand_dims(attention_mask, axis=1) + return attention_mask + + pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64") + + input_ids = pad_func([example["input_ids"] for example in batch_examples]) + token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples]) + position_ids = pad_func([example["position_ids"] for example in batch_examples]) + + attention_mask = pad_mask([example["attention_mask"] for example in batch_examples]) + + if mode != "test": + max_len = max([example["seq_len"] for example in batch_examples]) + masked_positions = np.concatenate( + [ + np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len + for i, example in enumerate(batch_examples) + ] + ) + labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples]) + return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels + else: + return input_ids, token_type_ids, position_ids, attention_mask + + +def create_data_loader(dataset, tokenizer, args, mode): + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_len=args.max_seq_len, + max_target_len=args.max_target_len, + mode=mode, + ) + dataset = dataset.map(trans_func, lazy=True) + if mode == "train": + batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) + else: + batch_sampler = BatchSampler(dataset, batch_size=args.batch_size // 2, shuffle=False) + collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode) + data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True) + return dataset, data_loader + + +def post_process_sum(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first .""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.mask_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + special_tokens = ["[UNK]"] + tokens = [token for token in tokens if token not in special_tokens] + return token_ids, tokens + + +def select_sum(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1): + results = [] + group = [] + tmp = [] + if scores is not None: + ids = ids.numpy() + scores = scores.numpy() + + if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0: + raise ValueError( + "the length of `ids` is {}, but the `num_return_sequences` is {}".format( + len(ids), num_return_sequences + ) + ) + + for pred, score in zip(ids, scores): + pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer) + num_token = len(pred_token_ids) + + target = "".join(pred_tokens) + + # not ending + if max_dec_len is not None and num_token >= max_dec_len: + score -= 1e3 + + tmp.append([target, score]) + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + preds = sorted(preds, key=lambda x: -x[1]) + results.append(preds[0][0]) + else: + ids = ids.numpy() + + for pred in ids: + pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer) + num_token = len(pred_token_ids) + response = "".join(pred_tokens) + + # TODO: Support return scores in FT. + tmp.append([response]) + if len(tmp) == num_return_sequences: + group.append(tmp) + tmp = [] + + for preds in group: + results.append(preds[0][0]) + + return results diff --git a/examples/text_to_knowledge/README.md b/examples/text_to_knowledge/README.md new file mode 100644 index 0000000000000000000000000000000000000000..39d41248a007b44f4aeee417436f6a75bd569e77 --- /dev/null +++ b/examples/text_to_knowledge/README.md @@ -0,0 +1,168 @@ +# 解语(Text to Knowledge) + +[解语官网](https://www.paddlepaddle.org.cn/textToKnowledge) + +解语(Text to Knowledge)是首个覆盖中文全词类的知识库(百科知识树)及知识标注与挖掘框架,拥有可描述所有中文词汇的词类体系、中文知识标注工具集,以及更适用于中文挖掘任务的预训练语言模型。 + + +覆盖中文全词类的知识库和知识标注工具能够帮助你面对更加多元的应用场景,方便地融合自有知识体系,显著提升中文文本解析和挖掘效果,并能够更容易地利用知识增强机器学习模型效果。解语经过大规模工业应用验证,在实际业务中取得了良好的应用效果,适合通用领域中文文本理解任务。 + +image + + +**解语由以下三部分构成:** + +- [百科知识树(TermTree)](./termtree) :包括能够描述所有中文词汇的TermType词类体系,以及Term关系和属性值。 +- 中文知识标注工具集:包括[词类知识标注工具(WordTag)](./wordtag) 和[名词短语标注工具(NPTag)](./nptag),[适用于中文文本挖掘的预训练语言模型(ERNIE-CTM)](./ernie-ctm),为中文文本解析提供词类序列标注框架,结合百科知识树可实现定制化词类序列标注。 +- 中文知识挖掘方案:包括[知识模板挖掘工具](./wordtag-ie),旨在提供灵活可配置,可快速定制的中文知识挖掘方案。 + +**本次发布的解语开源试用版包括:** + +- 百科知识树(TermTree)V1.0试用版:包括简化版的TermType词类体系,和约100w的term集。 +- 中文词类知识标注工具(WordTag)V1.0版。 +- 名词短语标注工具(NPTag)V1.0版。 +- 中文预训练语言模型(ERNIE-CTM)V1.0版。 + + +---- + +## 解语的应用场景 + +解语可直接用于各类中文文本解析与挖掘任务,提升文本解析与挖掘精度;也可以作为中文文本特征生成器,为各类机器学习模型提供文本特征。 + +中文词类知识标注工具(WordTag)整合了传统中文解析的**分词**、**词性标注**、**命名实体识别**的能力,能够将任意中文句子解析为**完整的词类序列**。结合百科知识树(TermTree),可为应用提供一套通用的知识关联(term-linking)框架,方便应用适配关联自己的应用知识图谱,更好地将知识用于中文自然语言处理(NLP)任务。 + +![解语示例](doc/img/text_to_knowledge_example.png) + + +### 应用场景A:文本挖掘/解析模板生成与匹配 + +虽然近年来,深度学习模型尤其是预训练语言模型的广泛使用大幅提升了各项中文NLP任务效果,但在实际的工业应用中,单独使用深度学习模型往往达不到应用需求,还需要结合规则模型以提升精度以及解决恶劣case,如,知识图谱构建、query解析、语义一致性判定等应用。 + +在这些应用中,文本挖掘/解析模板是最常用的规则模型。WordTag包含了覆盖中文所有词汇的词类标注体系,在生成模板以及模板匹配上有着天然的优势。用户可以根据WordTag标注的样本词类序列,自动生成或配置更加丰富、精准的挖掘/解析模板,然后对目标文本使用WordTag标注,即可利用模板进行匹配,从而大大降低人工配置模板的代价,显著提升生产效率。 + +例如,输入文本:*美人鱼是周星驰执导的电影*,得到预测结果: + +```json +{ + "text": "美人鱼是周星驰执导的电影", + "items": [ + { + "item": "美人鱼", + "offset": 0, + "wordtag_label": "作品类_实体", + "length": 3, + "termid": "作品与出版物_eb_美人鱼" + }, + { + "item": "是", + "offset": 3, + "wordtag_label": "肯定词", + "length": 1, + "termid": "肯定否定词_cb_是" + }, + { + "item": "周星驰", + "offset": 4, + "wordtag_label": "人物类_实体", + "length": 3, + "termid": "人物_eb_周星驰" + }, + { + "item": "执导", + "offset": 7, + "wordtag_label": "场景事件", + "length": 2, + "termid": "场景事件_cb_执导" + }, + { + "item": "的", + "offset": 9, + "wordtag_label": "助词", + "length": 1, + "termid": "助词_cb_的" + }, + { + "item": "电影", + "offset": 10, + "wordtag_label": "作品类_概念", + "length": 2, + "termid": "影视作品_cb_电影" + } + ] +} +``` + +将上述标注结果中的词类序列取出,去除虚词、标点等与语义无关的词,可将抽取出的词类直接构造成为挖掘匹配模板: + +``` +[作品类_实体][肯定词|是][人物类_实体][场景事件|执导][作品类_概念|电影] +``` + +利用该模板,以及结合TermTree进行概念扩展,可以匹配出所有该句式的文本,例如: + +> 《狂人日记》是鲁迅创作的第一个短篇白话日记体小说 +> +> 《澳门风云》是王晶创作执导的合家欢贺岁喜剧赌片 +> +> 《千王之王2000》是一部王晶于1999年执导的喜剧电影 +> +> 《射雕英雄传》是金庸创作的长篇武侠小说 + +WordTag的标注结果中,区分了“人物类\_实体”和“人物类\_概念”,以及“作品类\_实体”和“作品类\_概念”,使得模板生成更为精准。同时,TermTree中也区分了命名实体词(eb: entity base)与非实体词(cb: concept base),这样,可以利用TermTree分别进行实体扩展(e.g., 周星驰->王晶)和概念扩展(e.g., 电影->小说),生成更加丰富多样的模板,支持更细化的应用场景。 + +### 应用场景B:词类知识增强的深度学习模型 + +词类特征同时也是一类重要的文本特征,可为原始文本token提供有效的边界信息、归组信息,减少样本中的噪音,防止模型过拟合;还可作为层次泛化特征,弥补统计共现特征的不足。 + +在深度学习模型应用中,可将WordTag产出的词类作为embedding特征,直接叠加到文本token上,作为深度学习模型的输入;在BERT等模型中,也可以将词类作为文本序列中的一部分,利用position id和可见性矩阵控制token和词类特征之间的可见性,作为深度学习模型的输入。 + +### 应用场景C:知识图谱关联(term-linking) + +随着知识图谱技术的普及和越来越多应用知识图谱数据的发布,如何利用知识提升NLP任务效果,成为近年来NLP研究的热点方向。文本与图谱知识结合的前提是将图谱中的实体准确link到文本上,这是知识图谱应用的一大难点。现有的方案多是基于某个特定图谱实现的,缺乏通用的图谱关联解决方案。我们尝试使用“**WordTag+TermTree**”提供一套通用的图谱关联(term-linking)技术框架。 + +**NOTE:** 为了避免歧义,我们 **用term统一指代图谱收录的各类实体、概念、术语**。 + +为了能够适配应用中的不同实体集(例如,不同的企业有不同的人物实体集合,不同的小说站有不同的小说实体集合),我们将term-linking拆分为两个步骤: + +- 第一步是基于词类的linking,主要解决“同名概念词/实体词”、“不同类的同名词”消歧问题,这一步只使用文本本身特征和词类特征,不使用图谱中的实体属性值(SPO)知识,从而支持切换不同应用图谱; +- 第二步是同类同名实体词的linking,主要解决同类下不同属性值的实体消歧问题,这一步需要使用实体词的SPO知识(一般用于实体特征表示计算,以及文本-实体相似度计算)。 + +“WordTag+TermTree”的开源版提供了第一步的解决示例,第二步由于依赖于特定图谱的SPO知识,暂时无法提供通用工具,未来可能提供通用解决方案。 + +### 应用场景D:文本分类和文本挖掘样本优化 + +工业NLP应用场景中,文本分类、文本挖掘是最常见的任务。虽然,预训练语言模型的技术进步大幅提升了小样本学习的效果,但要达到理想的工业应用效果,还是需要大规模高精度监督训练样本。 + +人工标注可以产出高精度小规模训练样本。半监督学习等技术可以帮助用户基于人工标准样本快速扩充样本规模,但无法保证样本精度。这种情况下,可以使用“WordTag+TermTree”辅助筛选和修正样本,提升样本精度,例如: + +- 使用WordTag产出样本模板,再利用TermTree进行泛化约束,筛选出高置信度的样本,或者过滤不合格的样本; + +- 利用词类关系检测类别与样本的一致性,比如,医疗类文本与“疾病损伤、药物、医疗卫生机构”等词类相关,可以利用TermTree知识筛选出该类别高置信度的样本。 + +此外,统计模型容易拟合高频term,导致在低频term上泛化效果不好,这时可以利用TermTree筛选样本,提升样本平衡性,从而提升模型泛化能力。 + +## 后续计划 + +1. 发布百科知识树(TermTree)正式版数据,建立知识共建社区,支持用户提交应用词表/应用图谱 & 定制化TermTree, [TermTree下载链接](https://kg-concept.bj.bcebos.com/TermTree/TermTree.V1.0.tar.gz); +2. 持续优化ERNIE-CTM预训练模型,支持多种参数规模模型发布,探索更好的适配中文解析挖掘任务的预训练模型; +3. 持续优化中文文本知识标注工具集,提供更加精准的知识标注服务;发布多粒度标注工具,支持更加丰富的应用场景。 + +## 在论文中引用解语 + +如果您的工作成果中使用了解语,请增加下述引用。我们非常乐于看到解语对您的工作带来帮助。 + +``` +@article{zhao2020TermTree, + title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding}, + author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong}, + technical report={Baidu, Inc. TR:2020-KG-TermTree}, + year={2020} +} +``` + + + +## 问题与反馈 + +解语在持续优化中,如果您有任何建议或问题,欢迎提交issue到Github。 diff --git a/examples/text_to_knowledge/doc/img/ernie_ctm_inputs.png b/examples/text_to_knowledge/doc/img/ernie_ctm_inputs.png new file mode 100644 index 0000000000000000000000000000000000000000..f5ec1073b759d83cd1b9e4aa1eeb70c09e4b697f Binary files /dev/null and b/examples/text_to_knowledge/doc/img/ernie_ctm_inputs.png differ diff --git a/examples/text_to_knowledge/doc/img/ernie_ctm_model.png b/examples/text_to_knowledge/doc/img/ernie_ctm_model.png new file mode 100644 index 0000000000000000000000000000000000000000..2d886e91f593140f1d0888d8cd987b2d33800408 Binary files /dev/null and b/examples/text_to_knowledge/doc/img/ernie_ctm_model.png differ diff --git a/examples/text_to_knowledge/doc/img/text_to_knowledge.png b/examples/text_to_knowledge/doc/img/text_to_knowledge.png new file mode 100644 index 0000000000000000000000000000000000000000..2a158a0b256db3246ec5f88b90636dc77c7a4083 Binary files /dev/null and b/examples/text_to_knowledge/doc/img/text_to_knowledge.png differ diff --git a/examples/text_to_knowledge/doc/img/text_to_knowledge_example.png b/examples/text_to_knowledge/doc/img/text_to_knowledge_example.png new file mode 100644 index 0000000000000000000000000000000000000000..bf2e2212268bab759100e5365249e174653e0563 Binary files /dev/null and b/examples/text_to_knowledge/doc/img/text_to_knowledge_example.png differ diff --git a/examples/text_to_knowledge/doc/img/wordtag_example.png b/examples/text_to_knowledge/doc/img/wordtag_example.png new file mode 100644 index 0000000000000000000000000000000000000000..b415962dda24c32c8b0583cd1b15693168543251 Binary files /dev/null and b/examples/text_to_knowledge/doc/img/wordtag_example.png differ diff --git a/examples/text_to_knowledge/doc/img/wordtag_model.png b/examples/text_to_knowledge/doc/img/wordtag_model.png new file mode 100644 index 0000000000000000000000000000000000000000..705e9b7d05a0eb59f9e8d3941e5a2ebcd3018f2c Binary files /dev/null and b/examples/text_to_knowledge/doc/img/wordtag_model.png differ diff --git a/examples/text_to_knowledge/ernie-ctm/README.md b/examples/text_to_knowledge/ernie-ctm/README.md new file mode 100644 index 0000000000000000000000000000000000000000..1e6a85355c606b20325233bb1efb94f0b84b7b89 --- /dev/null +++ b/examples/text_to_knowledge/ernie-ctm/README.md @@ -0,0 +1,167 @@ + +# 解语:ERNIE-CTM(ERNIE for **Chinese Text Mining**) + +ERNIE-CTM是适用于中文文本挖掘任务的预训练语言模型,拥有更全面的汉字字表集合,更优的中文文本挖掘任务表现,与PaddleNLP深度结合,提供更加便捷的应用实践。 + +## ERNIE-CTM特点 + +- 全面的中文汉字字表扩充 + - ERNIE-CTM的字符集包含2万+汉字,以及中文常用符号(常用标点、汉语拼音、编号)、部分外语符号(假名、单位)等,大幅减少中文解析挖掘任务中UNK(未识别字符)引发的标注问题。同时,ERNIE-CTM使用了embedding分解,可以更加灵活地扩充应用字表。 +- 更加适配中文文本挖掘任务 + - ERNIE-CTM中在每个表示后面添加了全局信息,在序列特征上叠加了全局的信息,使得在文本挖掘任务上有更加强力的表现。 +- 支持多种特征训练的模型结构 + - ERNIE-CTM的模型结构中,支持多种特征训练,用户可按照自己的需求任意添加任务及对应特征训练模型,而无需考虑任务之间的冲突所造成的灾难性遗忘。 + + + +## ERNIE-CTM模型介绍 + +### 模型结构 + +ERNIE-CTM的模型结构大体与BERT相同,都是双向transformer结构。区别是,ERNIE-CTM为能灵活扩充字表,采用了ALBERT的embedding分解,将embedding层分解为128维,参数列表如下: + +| 模型 | embedding size | hidden size | hidden layers | vocab size | +| -------------- | -------------- | ----------- | ------------- | ---------- | +| ERNIE-CTM-base | 128 | 768 | 12 | 23000 | + +ERNIE-CTM以字粒度建模,英文区分大小写,其输入表示如下: + +![ERNIE-CTM输入](../doc/img/ernie_ctm_inputs.png) + +其中,`[CLS{n}]`是ERNIE-CTM预留出的全局观察位,其中`n`从0开始计数,该全局观察位用于不同的训练任务,建模不同的语义特征,在下游任务中,可以结合使用,如使用attention筛选/融合特征,以达到更好的效果。而在灵活使用`[CLS{n}]`的时候,为中途增减任务token时不影响文本输入,所有的`[CLS{n}]`的位置编码均为0,且可以使用可见性矩阵(visible matrix)控制`[CLS{n}]`位置的特征对序列中其他位置,以及其他的全局观察位的可见性,以获得更加灵活、独立的特征表示。 + +本次开源的ERNIE-CTM-base模型中,使用了两个全局观察位`[CLS0]`和`[CLS1]`,具体作用见下文预训练任务介绍。 + +### 预训练任务 + +ERNIE-CTM使用的预训练任务为掩码语言模型(Masked Language Model,MLM)及ALBERT所使用的句子顺序预测(Sentence Order Prediction,SOP)。 + +其中`[CLS0]`用于训练SOP任务,训练方式如ALBERT中描述,正例为同一篇文章中的两个连续的句子,负例为用一篇文章中两个连续的句子顺序翻转。 + +`[CLS1]`做为全局的监督信号,应用于MLM任务中。训练MLM任务前,将`[CLS1]`特征表示拼接在所有的序列表示之后,通过线性层融合,成为最终的序列表示,之后预测MLM任务。所以,ERNIE-CTM最终输出的文本序列表示中,都融合了`[CLS1]`的特征表示。最终的序列表示中,带有全句的特征,一定程度可避免序列中全局特征捕捉不足,同时,`[CLS1]`最终的表示中也充分融合了句子内容的信息,弥补了SOP任务对文本主题信息捕捉不足的缺陷。 + +![ERNIE-CTM总体结构](../doc/img/ernie_ctm_model.png) + +### WordTag增量训练 + +在Ernie-Ctm微调任务中我们提供了一个基于[WordTag](../wordtag)的百科知识标注任务,该任务旨在解析中文词汇的知识标注,在该词性体系中覆盖了所有中文词汇的词类体系,包括各类实体词与非实体词(如概念、实体/专名、语法词等)。除了使用已有的WordTag工具对通用中文文本进行词类知识标注,WordTag同样支持用户使用自己的数据进行增量训练,下面是在WordTag模型上进行增量训练的具体示例流程。 + +#### 代码结构说明 + +```text +wordtag/ +├── data.py # 训练数据处理脚本 +├── metric.py # 模型效果验证指标脚本 +├── predict.py # 预测脚本 +├── README.md # 使用说明 +├── train.py # 训练脚本 +└── utils.py # 工具函数 +``` + +#### 数据准备 + +我们提供了少数样本用以示例增量训练。执行以下命令,下载并解压示例数据集: + +```bash +wget https://bj.bcebos.com/paddlenlp/datasets/wordtag_dataset_v3.tar.gz && tar -zxvf wordtag_dataset_v3.tar.gz +``` +解压之后 + +```text +data/ +├── dev.txt # 验证集 +├── tags.txt # WordTag标签集合 +└── train.json # 训练数据 +``` + +训练样本示例如下,每个单词以"/type"的形式标记其词性或实体类别,单词之间使用空格作为切分标记 + +```text +砚台/物体类 与/连词 笔/物体类 、/w 墨/物体类 、/w 纸/物体类 是/肯定词 中国/世界地区类 传统/修饰词 的/助词 文房四宝/词汇用语 。/w +《/w 全球化与中国:理论与发展趋势/作品类_实体 》/w 是/肯定词 2010年/时间类 经济管理出版社/组织机构类 出版/场景事件 的/助词 图书/作品类_概念 ,/w 作者/人物类_概念 是/肯定词 余永定/人物类_实体 、/w 路爱国/人物类_实体 、/w 高海红/人物类_实体 。/w +``` + +#### 模型训练 + +```shell +python -m paddle.distributed.launch --gpus "0" train.py \ + --max_seq_len 128 \ + --batch_size 32 \ + --learning_rate 5e-5 \ + --num_train_epochs 3 \ + --logging_steps 10 \ + --save_steps 100 \ + --output_dir ./output \ + --device "gpu" +``` + +其中参数释义如下: +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `output_dir` 表示模型保存路径。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 + + + +### 模型预测 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python -m paddle.distributed.launch --gpus "0" predict.py \ + --params_path ./output/model_300/model_state.pdparams \ + --batch_size 32 \ + --device "gpu" +``` + +## 自定义模型一键预测 + +Taskflow支持加载增量训练后的模型进行一键预测,通过`task_path`定义用户自定义路径即可。 + +文件组成: +```text +custom_task_path/ +├── model_state.pdparams +├── model_config.json +└── tags.txt +``` + +```python +from paddlenlp import Taskflow + +my_wordtag = Taskflow("knowledge_mining", task_path="./custom_task_path/") + +my_wordtag("美人鱼是周星驰执导的一部电影") +# [{'text': '美人鱼是周星驰执导的一部电影', 'items': [{'item': '美人鱼', 'offset': 0, 'wordtag_label': '作品类_实体', 'length': 3, 'termid': '作品与出版物_eb_美人鱼'}, {'item': '是', 'offset': 3, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '周星驰', 'offset': 4, 'wordtag_label': '人物类_实体', 'length': 3, 'termid': '人物_eb_周星驰'}, {'item': '执导', 'offset': 7, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_执导'}, {'item': '的', 'offset': 9, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '一部', 'offset': 10, 'wordtag_label': '数量词', 'length': 2}, {'item': '电影', 'offset': 12, 'wordtag_label': '作品类_概念', 'length': 2, 'termid': '影视作品_cb_电影'}]}] +``` + + +## ERNIE-CTM后续计划 + + +1. 提升预训练语料的多样性(开源版主要使用了百度百科语料),持续优化预训练模型 +2. 发布其他参数量的预训练模型(tiny、large等),便于不同场景应用 +3. 维护开源社区,探索模型优化方向,整合优秀idea + + + +## 在论文中引用ERNIE-CTM + +如果您的工作成果中使用了ERNIE-CTM,请增加下述引用。我们非常乐于看到ERNIE-CTM对您的工作带来帮助。 +``` +@article{zhao2020TermTree, + title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding}, + author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong}, + technical report={Baidu, Inc. TR:2020-KG-TermTree}, + year={2020} +} +``` + + + +## 问题与反馈 + +ERNIE-CTM在持续优化中,如果您有任何建议或问题,欢迎提交issue到Github。 diff --git a/examples/text_to_knowledge/ernie-ctm/data_process.py b/examples/text_to_knowledge/ernie-ctm/data_process.py new file mode 100644 index 0000000000000000000000000000000000000000..f40243dabb7064db449266023dc6cd6eb8d1eca5 --- /dev/null +++ b/examples/text_to_knowledge/ernie-ctm/data_process.py @@ -0,0 +1,93 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle + + +def load_dict(dict_path): + vocab = {} + i = 0 + with open(dict_path, "r", encoding="utf-8") as fin: + for line in fin: + vocab[line.strip()] = i + i += 1 + return vocab + + +def convert_example(example, tokenizer, max_seq_len, tags_to_idx=None, summary_num=2, is_test=False): + tokens = example["tokens"] + tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token", max_seq_len=max_seq_len) + + if is_test: + return tokenized_input["input_ids"], tokenized_input["token_type_ids"], tokenized_input["seq_len"] + + tags = example["tags"] + if len(tokenized_input["input_ids"]) - 1 - summary_num < len(tags): + tags = tags[: len(tokenized_input["input_ids"]) - 1 - summary_num] + # '[CLS]' and '[SEP]' will get label 'O' + tags = ["O"] * (summary_num) + tags + ["O"] + tags += ["O"] * (len(tokenized_input["input_ids"]) - len(tags)) + tokenized_input["tags"] = [tags_to_idx[x] for x in tags] + return ( + tokenized_input["input_ids"], + tokenized_input["token_type_ids"], + tokenized_input["seq_len"], + tokenized_input["tags"], + ) + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def read_custom_data(filename): + """Reads data""" + with open(filename, "r", encoding="utf-8") as f: + for line in f: + example = transfer_str_to_example(line.strip()) + yield example + + +def transfer_str_to_example(sample): + text = "" + tags = [] + items = sample.split(" ") + items = [item.rsplit("/", 1) for item in items] + for w, t in items: + text += w + if len(w) == 1: + tags.append(f"S-{t}") + else: + l = len(w) + for j in range(l): + if j == 0: + tags.append(f"B-{t}") + elif j == l - 1: + tags.append(f"E-{t}") + else: + tags.append(f"I-{t}") + res = { + "tokens": list(text), + "tags": tags, + } + return res diff --git a/examples/text_to_knowledge/ernie-ctm/metric.py b/examples/text_to_knowledge/ernie-ctm/metric.py new file mode 100644 index 0000000000000000000000000000000000000000..4341bd743a2774dbbf7664e9c329ee371dfcea22 --- /dev/null +++ b/examples/text_to_knowledge/ernie-ctm/metric.py @@ -0,0 +1,219 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import List, Tuple + +import paddle + + +class SequenceAccuracy(paddle.metric.Metric): + """ + Masked language model pre-train task accuracy. + """ + + def __init__(self): + super(SequenceAccuracy, self).__init__() + self.correct_k = 0 + self.total = 0 + + def compute(self, pred, label, ignore_index): + pred = paddle.argmax(pred, 1) + active_acc = label.reshape([-1]) != ignore_index + active_pred = pred.masked_select(active_acc) + active_labels = label.masked_select(active_acc) + correct = active_pred.equal(active_labels) + return correct + + def update(self, correct): + self.correct_k += correct.cast("float32").sum(0) + self.total += correct.shape[0] + + def reset(self): + self.correct_k = 0 + self.total = 0 + + def accumulate(self): + return float(self.correct_k) / self.total + + def name(self): + return "Masked Language Model Accuracy" + + +def wordseg_hard_acc(list_a: List[Tuple[str, str]], list_b: List[Tuple[str, str]]) -> float: + """ + Calculate extra metrics of word-seg + + Args: + list_a: prediction list + list_b: real list + + Returns: + acc: the extra accuracy + """ + p, q = 0, 0 + a_l, b_l = 0, 0 + acc = 0.0 + while q < len(list_b) and p < len(list_a): + a_r = a_l + len(list_a[p][0]) - 1 + b_r = b_l + len(list_b[q][0]) - 1 + if a_r < b_l: + p += 1 + a_l = a_r + 1 + continue + if b_r < a_l: + q += 1 + b_l = b_r + 1 + continue + if a_l == b_l and a_r == b_r: + acc += 1.0 + p += 1 + q += 1 + a_l = a_r + 1 + b_l = b_r + 1 + continue + p += 1 + return acc + + +def wordtag_hard_acc(list_a: List[Tuple[str, str]], list_b: List[Tuple[str, str]]) -> float: + """ + Calculate extra metrics of word-tag + + Args: + list_a: prediction list + list_b: real list + + Returns: + acc: the extra accuracy + """ + p, q = 0, 0 + a_l, b_l = 0, 0 + acc = 0.0 + while q < len(list_b) and p < len(list_a): + a_r = a_l + len(list_a[p][0]) - 1 + b_r = b_l + len(list_b[q][0]) - 1 + if a_r < b_l: + p += 1 + a_l = a_r + 1 + continue + if b_r < a_l: + q += 1 + b_l = b_r + 1 + continue + if a_l == b_l and a_r == b_r: + if list_a[p][-1] == list_b[q][-1]: + acc += 1.0 + p += 1 + q += 1 + a_l, b_l = a_r + 1, b_r + 1 + continue + p += 1 + return acc + + +def wordtag_soft_acc(list_a: List[Tuple[str, str]], list_b: List[Tuple[str, str]]) -> float: + """ + Calculate extra metrics of word-tag + + Args: + list_a: prediction list + list_b: real list + + Returns: + acc: the extra accuracy + """ + p, q = 0, 0 + a_l, b_l = 0, 0 + acc = 0.0 + while q < len(list_b) and p < len(list_a): + a_r = a_l + len(list_a[p][0]) - 1 + b_r = b_l + len(list_b[q][0]) - 1 + if a_r < b_l: + p += 1 + a_l = a_r + 1 + continue + if b_r < a_l: + q += 1 + b_l = b_r + 1 + continue + if a_l == b_l and a_r == b_r: + if list_a[p][-1] == list_b[q][-1]: + acc += 1.0 + elif list_b[q][-1].startswith(list_a[p][-1]): + acc += 1.0 + elif list_b[q] == "词汇用语": + acc += 1.0 + p += 1 + q += 1 + a_l, b_l = a_r + 1, b_r + 1 + continue + p += 1 + return acc + + +def wordseg_soft_acc(list_a: List[Tuple[str, str]], list_b: List[Tuple[str, str]]) -> float: + """ + Calculate extra metrics of word-seg + + Args: + list_a: prediction list + list_b: real list + + Returns: + acc: the extra accuracy + """ + i, j = 0, 0 + acc = 0.0 + a_l, b_l = 0, 0 + while i < len(list_a) and j < len(list_b): + a_r = a_l + len(list_a[i][0]) - 1 + b_r = b_l + len(list_b[j][0]) - 1 + if a_r < b_l: + i += 1 + a_l = a_r + 1 + continue + if b_r < a_l: + j += 1 + b_l = b_r + 1 + continue + if a_l == b_l and a_r == b_r: + acc += 1.0 + a_l, b_l = a_r + 1, b_r + 1 + i, j = i + 1, j + 1 + continue + if a_l == b_l and a_r < b_r: + cnt = 0.0 + tmp_a_r = a_r + for k in range(i + 1, len(list_a)): + tmp_a_r += len(list_a[k]) + cnt += 1.0 + if tmp_a_r == b_r: + acc += cnt + i, j = k + 1, j + 1 + a_l, b_l = tmp_a_r + 1, b_r + 1 + break + i += 1 + continue + if a_l == b_l and a_r > b_r: + tmp_b_r = b_r + for k in range(j + 1, len(list_b)): + tmp_b_r += len(list_b[k]) + if tmp_b_r == a_r: + acc += 1.0 + i, j = i + 1, k + 1 + a_l, b_l = a_r + 1, tmp_b_r + 1 + break + j += 1 + continue + i += 1 + return acc diff --git a/examples/text_to_knowledge/ernie-ctm/predict.py b/examples/text_to_knowledge/ernie-ctm/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..8620f9c393a039372f32d8e2fec5aebc5cb39268 --- /dev/null +++ b/examples/text_to_knowledge/ernie-ctm/predict.py @@ -0,0 +1,88 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from data_process import convert_example, load_dict +from utils import decode + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.transformers import ErnieCtmTokenizer, ErnieCtmWordtagModel + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, default="./output/model_300/model_state.pdparams", required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--data_dir", type=str, default="./data", help="The input data dir, should contain name_category_map.json.") +parser.add_argument("--max_seq_len", type=int, default=64, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', type=str, choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def do_predict(data, model, tokenizer, viterbi_decoder, tags_to_idx, idx_to_tags, batch_size=1, summary_num=2): + + examples = [] + for text in data: + example = {"tokens": list(text)} + input_ids, token_type_ids, seq_len = convert_example(example, tokenizer, args.max_seq_len, is_test=True) + + examples.append((input_ids, token_type_ids, seq_len)) + + batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)] + + batchify_fn = lambda samples, fn=Tuple( # noqa: E731 + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids + Stack(dtype="int64"), # seq_len + ): fn(samples) + + all_pred_tags = [] + + model.eval() + for batch in batches: + input_ids, token_type_ids, seq_len = batchify_fn(batch) + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + seq_len = paddle.to_tensor(seq_len) + pred_tags = model(input_ids, token_type_ids, lengths=seq_len)[0] + all_pred_tags.extend(pred_tags.numpy().tolist()) + results = decode(data, all_pred_tags, summary_num, idx_to_tags) + return results + + +if __name__ == "__main__": + paddle.set_device(args.device) + + data = [ + "美人鱼是周星驰执导的一部电影", + ] + + tags_to_idx = load_dict(os.path.join(args.data_dir, "tags.txt")) + idx_to_tags = dict(zip(*(tags_to_idx.values(), tags_to_idx.keys()))) + + model = ErnieCtmWordtagModel.from_pretrained("wordtag", num_tag=len(tags_to_idx)) + tokenizer = ErnieCtmTokenizer.from_pretrained("wordtag") + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + results = do_predict( + data, model, tokenizer, model.viterbi_decoder, tags_to_idx, idx_to_tags, batch_size=args.batch_size + ) + print(results) diff --git a/examples/text_to_knowledge/ernie-ctm/train.py b/examples/text_to_knowledge/ernie-ctm/train.py new file mode 100644 index 0000000000000000000000000000000000000000..a67b5d09486fa575f8b28a9d89a13fd6eb2568d0 --- /dev/null +++ b/examples/text_to_knowledge/ernie-ctm/train.py @@ -0,0 +1,203 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from data_process import convert_example, create_dataloader, load_dict, read_custom_data +from metric import SequenceAccuracy + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import ( + ErnieCtmTokenizer, + ErnieCtmWordtagModel, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + + # yapf: disable + parser.add_argument("--data_dir", default="./data", type=str, help="The input data dir, should contain train.json.") + parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.") + parser.add_argument("--output_dir", default="./output", type=str, help="The output directory where the model predictions and checkpoints will be written.",) + parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", ) + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.", ) + parser.add_argument("--logging_steps", type=int, default=5, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") + parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.", ) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion") + parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over total steps.") + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--seed", default=1000, type=int, help="random seed for initialization") + parser.add_argument("--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu.") + # yapf: enable + + args = parser.parse_args() + return args + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, tags, tags_to_idx): + model.eval() + metric.reset() + losses = [] + for batch in data_loader(): + input_ids, token_type_ids, seq_len, tags = batch + loss, seq_logits = model(input_ids, token_type_ids, lengths=seq_len, tag_labels=tags)[:2] + loss = loss.mean() + losses.append(loss.numpy()) + + correct = metric.compute( + pred=seq_logits.reshape([-1, len(tags_to_idx)]), label=tags.reshape([-1]), ignore_index=tags_to_idx["O"] + ) + metric.update(correct) + acc = metric.accumulate() + logger.info("eval loss: %.5f, acc: %.5f" % (np.mean(losses), acc)) + model.train() + metric.reset() + + +def do_train(args): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds = load_dataset( + read_custom_data, filename=os.path.join(args.data_dir, "train.txt"), is_test=False, lazy=False + ) + dev_ds = load_dataset(read_custom_data, filename=os.path.join(args.data_dir, "dev.txt"), is_test=False, lazy=False) + tags_to_idx = load_dict(os.path.join(args.data_dir, "tags.txt")) + + tokenizer = ErnieCtmTokenizer.from_pretrained("wordtag") + model = ErnieCtmWordtagModel.from_pretrained("wordtag", num_labels=len(tags_to_idx)) + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len, tags_to_idx=tags_to_idx) + + def batchify_fn(samples): + fn = Tuple( # noqa: E731 + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids + Stack(dtype="int64"), # seq_len + Pad(axis=0, pad_val=tags_to_idx["O"], dtype="int64"), # tags + ) + return fn(samples) + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.num_train_epochs + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + logger.info("Total steps: %s" % num_training_steps) + logger.info("WarmUp steps: %s" % warmup) + + metric = SequenceAccuracy() + + total_loss = 0 + global_step = 0 + + for epoch in range(1, args.num_train_epochs + 1): + logger.info(f"Epoch {epoch} beginnig") + start_time = time.time() + + for total_step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids, token_type_ids, seq_len, tags = batch + + loss = model(input_ids, token_type_ids, lengths=seq_len, tag_labels=tags)[0] + + loss = loss.mean() + total_loss += loss + loss.backward() + + optimizer.step() + optimizer.clear_grad() + lr_scheduler.step() + + if global_step % args.logging_steps == 0 and rank == 0: + end_time = time.time() + speed = float(args.logging_steps) / (end_time - start_time) + logger.info( + "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, total_loss / args.logging_steps, speed) + ) + start_time = time.time() + total_loss = 0 + + if (global_step % args.save_steps == 0 or global_step == num_training_steps) and rank == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % (global_step)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + evaluate(model, metric, dev_data_loader, tags, tags_to_idx) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/text_to_knowledge/ernie-ctm/utils.py b/examples/text_to_knowledge/ernie-ctm/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..293945bef20362d069e68125ce6318f6189bf455 --- /dev/null +++ b/examples/text_to_knowledge/ernie-ctm/utils.py @@ -0,0 +1,48 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +def reset_offset(pred_words): + for i in range(0, len(pred_words)): + if i > 0: + pred_words[i]["offset"] = pred_words[i - 1]["offset"] + len(pred_words[i - 1]["item"]) + pred_words[i]["length"] = len(pred_words[i]["item"]) + return pred_words + + +def decode(texts, all_pred_tags, summary_num, idx_to_tags): + batch_results = [] + for i, pred_tags in enumerate(all_pred_tags): + pred_words, pred_word = [], [] + + for j, tag in enumerate(pred_tags[summary_num:-1]): + if j >= len(texts[i]): + break + pred_label = idx_to_tags[tag] + if pred_label.find("-") != -1: + _, label = pred_label.split("-") + else: + label = pred_label + if pred_label.startswith("S") or pred_label.startswith("O"): + pred_words.append({"item": texts[i][j], "offset": 0, "wordtag_label": label}) + else: + pred_word.append(texts[i][j]) + if pred_label.startswith("E"): + pred_words.append({"item": "".join(pred_word), "offset": 0, "wordtag_label": label}) + del pred_word[:] + + pred_words = reset_offset(pred_words) + result = {"text": texts[i], "items": pred_words} + batch_results.append(result) + return batch_results diff --git a/examples/text_to_knowledge/nptag/README.md b/examples/text_to_knowledge/nptag/README.md new file mode 100644 index 0000000000000000000000000000000000000000..aece4d724dfca330fc2b19c3cdc605f27ac23523 --- /dev/null +++ b/examples/text_to_knowledge/nptag/README.md @@ -0,0 +1,160 @@ +# 解语:NPTag(名词短语标注工具) + +NPTag(名词短语标注工具)是首个能够覆盖所有中文名词性词汇及短语的细粒度知识标注工具,旨在解决NLP中,名词性短语收录不足,导致的OOV(out-of-vocabulary,超出收录词表)问题。可直接应用构造知识特征,辅助NLP任务。 + +## NPTag特点 + +- 包含2000+细粒度类别,覆盖所有中文名词性短语的词类体系,更丰富的知识标注结果 + - NPTag试用的词类体系未覆盖所有中文名词性短语的词类体系,对所有类目做了更细类目的识别(如注射剂、鱼类、博物馆等),共包含2000+细粒度类别,且可以直接关联百科知识树。 +- 可自由定制的分类框架 + - NPTag开源版标注使用的词类体系是我们在实践中对**百科词条**分类应用较好的一个版本,用户可以自由定制自己的词类体系和训练样本,构建自己的NPTag,以获得更好的适配效果。例如,可按照自定义的类别构造训练样本,使用小学习率、短训练周期微调NPTag模型,即可获得自己定制的NPTag工具。 + +## NPTag模型介绍 + +NPTag使用[ERNIE-CTM](../ernie-ctm)+prompt训练而成,使用启发式搜索解码,保证分类结果都在标签体系之内。 + +### finetune任务 + +在微调任务中提供了一个中文名词短语标注的任务,旨在对中文名词短语进行细粒度分类。 + +#### 代码结构说明 + +```text +nptag/ +├── deploy # 部署 +│   └── python +│   └── predict.py # python预测部署示例 +├── data.py # 训练数据处理脚本 +├── export_model.py # 模型导出脚本 +├── metric.py # 模型效果验证指标脚本 +├── predict.py # 预测脚本 +├── README.md # 使用说明 +├── train.py # 训练脚本 +└── utils.py # 工具函数 +``` + +#### 数据准备 + +执行以下命令,下载并解压示例数据集: + +```bash +wget https://bj.bcebos.com/paddlenlp/paddlenlp/datasets/nptag_dataset.tar.gz && tar -zxvf nptag_dataset.tar.gz +``` + +解压之后 +```text +data/ +├── name_category_map.json # NPTag标签文件 +├── dev.txt # 验证集 +└── train.txt # 训练集 +``` + +数据集`train.txt`和`dev.txt`格式示例(text VS label) +``` +石竹 植物 +杂链聚合物 化学物质 +罗伯特·布雷森 人 +``` + +标签文件`name_category_map.json`格式示例,其中key为细粒度标签,即NPTag的预测结果;value为粗粒度标签,示例中对应WordTag的标签集合,用户可以根据场景需要自定义修改该标签映射 +``` +{ + "植物": "生物类_植物", + "化学物质": "物体类_化学物质", + "人": "人物类_实体", +} +``` + +#### 模型训练 +```bash +python -m paddle.distributed.launch --gpus "0" train.py \ + --batch_size 64 \ + --learning_rate 1e-6 \ + --num_train_epochs 3 \ + --logging_steps 10 \ + --save_steps 100 \ + --output_dir ./output \ + --device "gpu" +``` + +可支持配置的参数: +- `data_dir`: 数据集文件路径,默认数据集存放在当前目录data文件夹下。 +- `init_from_ckpt`: 模型参数路径,热启动模型训练,默认为None。 +- `output_dir`: 模型保存路径,默认保存在当前目录的output文件夹下。 +- `max_seq_len`: 模型使用的最大序列长度,默认为64。 +- `learning_rate`: finetune的最大学习率;默认为1e-6。 +- `num_train_epochs`: 表示训练轮数,默认为3。 +- `logging_steps`: 日志打印步数间隔,默认为10。 +- `save_steps`: 模型保存的步数间隔, 默认为100。 +- `batch_size`: 批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为64。 +- `weight_decay`: 控制正则项力度的参数,用于防止过拟合,默认为0.0。 +- `warmup_proportion`: 学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.0。 +- `adam_epsilon`: Adam优化器的参数,避免分母为零,默认为1e-8。 +- `seed`: 随机种子,默认为1000。 +- `device`: 选用什么设备进行训练,可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。 + +### 基于动态图的预测 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python -m paddle.distributed.launch --gpus "0" predict.py \ + --device=gpu \ + --params_path ./output/model_100/model_state.pdparams +``` + +### 基于静态图的预测部署 + +使用动态图训练结束之后,可以将动态图参数导出成静态图参数,从而获得最优的预测部署性能,执行如下命令完成动态图转换静态图的功能: +```shell +python export_model.py --params_path=./output/model_100/model_state.pdparams --output_path=./export +``` + +导出静态图模型之后,可以用于部署,`deploy/python/predict.py`脚本提供了python部署预测示例。运行方式: +```shell +python deploy/python/predict.py --model_dir=./export +``` + +## Taskflow一键预测 + +除了以上的finetune示例,Taskflow内置了一个百度基于大规模标注汉语短语数据集训练的名词短语标注工具`NPTag`。用户可以方便地使用该工具完成对中文名词短语的一键预测。 + +```python +from paddlenlp import Taskflow + +nptag = Taskflow("knowledge_mining", model="nptag") +nptag("糖醋排骨") +''' +[{'text': '糖醋排骨', 'label': '菜品'}] +''' + +nptag(["糖醋排骨", "红曲霉菌"]) +''' +[{'text': '糖醋排骨', 'label': '菜品'}, {'text': '红曲霉菌', 'label': '微生物'}] +''' + +# 输出粗粒度类别标签`category`,即WordTag的词汇标签。 +nptag = Taskflow("knowledge_mining", model="nptag", linking=True) +nptag(["糖醋排骨", "红曲霉菌"]) + +''' +[{'text': '糖醋排骨', 'label': '菜品', 'category': '饮食类_菜品'}, {'text': '红曲霉菌', 'label': '微生物', 'category': '生物类_微生物'}] +''' +``` + +## 在论文中引用NPTag + +如果您的工作成果中使用了NPTag,请增加下述引用。我们非常乐于看到解语对您的工作带来帮助。 + +``` +@article{zhao2020TermTree, + title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding}, + author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong}, + technical report={Baidu, Inc. TR:2020-KG-TermTree}, + year={2020} +} +``` + + +## 问题与反馈 + +解语在持续优化中,如果您有任何建议或问题,欢迎提交issue到Github。 diff --git a/examples/text_to_knowledge/nptag/data.py b/examples/text_to_knowledge/nptag/data.py new file mode 100644 index 0000000000000000000000000000000000000000..4d482fe69f8ec8a310abcb382297241efbdb3b8b --- /dev/null +++ b/examples/text_to_knowledge/nptag/data.py @@ -0,0 +1,81 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import paddle + + +def convert_example(example, tokenzier, max_seq_len=512, max_cls_len=5, summary_num=2, is_test=False): + """ + Builds model inputs from a sequence for noun phrase classification task. + A prompt template is added to the end of the sequence. + + Prompt template: + + - ``[是] + [MASK] * max_cls_len`` + + Model input example: + + - ``[CLS0][CLS1] X [是][MASK]...[MASK][SEP]`` + + where X is the input text. + + Args: + example(obj:`list[str]`): List of input data, containing text and label if it have label. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + max_cls_len(obj:`int`): The maximum length of labels. + summary_num(obj:`int`): The number of summary tokens, e.g. `[CLS0]` and `[CLS1]`. + is_test(obj:`bool`): If True, it will not return the label. + + """ + + if len(example["text"]) + max_cls_len + 1 + summary_num + 1 > max_seq_len: + example["text"] = example["text"][: (max_seq_len - (max_cls_len + 1 + summary_num + 1))] + + tokens = list(example["text"]) + ["是"] + ["[MASK]"] * max_cls_len + inputs = tokenzier(tokens, return_length=True, is_split_into_words="token", max_length=max_seq_len) + + label_indices = list(range(inputs["seq_len"] - 1 - max_cls_len, inputs["seq_len"] - 1)) + + if is_test: + return inputs["input_ids"], inputs["token_type_ids"], label_indices + + label_tokens = list(example["label"]) + ["[PAD]"] * (max_cls_len - len(example["label"])) + labels = np.full([inputs["seq_len"]], fill_value=-100, dtype=np.int64) + labels[label_indices] = tokenzier.convert_tokens_to_ids(label_tokens) + return inputs["input_ids"], inputs["token_type_ids"], labels + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +def read_custom_data(filename): + """Reads data""" + with open(filename, "r", encoding="utf-8") as f: + for line in f: + text, label = line.strip().split("\t") + yield {"text": text, "label": label} diff --git a/examples/text_to_knowledge/nptag/deploy/python/predict.py b/examples/text_to_knowledge/nptag/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..5bc505639132f257fab9ee835ecbb186d74b00f1 --- /dev/null +++ b/examples/text_to_knowledge/nptag/deploy/python/predict.py @@ -0,0 +1,150 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys + +import paddle + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.transformers import ErnieCtmTokenizer + +sys.path.append("./") + +from data import convert_example # noqa: E402 +from utils import construct_dict_map, decode, find_topk, search # noqa: E402 + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, default="./export/", help="The directory to static model.") +parser.add_argument("--data_dir", type=str, default="./data", help="The input data dir, should contain name_category_map.json.") +parser.add_argument("--max_seq_len", type=int, default=64, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", type=int, default=3, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# fmt: on + + +class Predictor(object): + def __init__(self, model_dir, device): + model_file = model_dir + "/inference.pdmodel" + params_file = model_dir + "/inference.pdiparams" + + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + # Disable IR optimization for NPTag + config.switch_ir_optim(False) + + if device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + + self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) + + def predict(self, data, tokenizer): + examples = [] + for text in data: + example = {"text": text} + input_ids, token_type_ids, label_indices = convert_example( + example, tokenizer, max_seq_len=args.max_seq_len, is_test=True + ) + examples.append((input_ids, token_type_ids, label_indices)) + + batches = [examples[idx : idx + args.batch_size] for idx in range(0, len(examples), args.batch_size)] + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids + Stack(dtype="int64"), # label_indices + ): fn(samples) + + name_dict, bk_tree, id_vocabs, vocab_ids = construct_dict_map( + tokenizer, os.path.join(args.data_dir, "name_category_map.json") + ) + + all_scores_can = [] + all_preds_can = [] + pred_ids = [] + + for batch in batches: + input_ids, token_type_ids, label_indices = batchify_fn(batch) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(token_type_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + + for i, l in zip(label_indices, logits): + score = l[i[0] : i[-1] + 1, vocab_ids] + # Find topk candidates of scores and predicted indices. + score_can, pred_id_can = find_topk(score, k=4, axis=-1) + + all_scores_can.extend([score_can.tolist()]) + all_preds_can.extend([pred_id_can.tolist()]) + pred_ids.extend([pred_id_can[:, 0].tolist()]) + + results = [] + for i, d in enumerate(data): + label = decode(pred_ids[i], id_vocabs) + result = { + "text": d, + "label": label, + } + if label not in name_dict: + scores_can = all_scores_can[i] + pred_ids_can = all_preds_can[i] + labels_can = search(scores_can, pred_ids_can, 0, [], 0) + labels_can.sort(key=lambda d: -d[1]) + for labels in labels_can: + cls_label_can = decode(labels[0], id_vocabs) + if cls_label_can in name_dict: + result["label"] = cls_label_can + break + else: + labels_can = bk_tree.search_similar_word(label) + result["label"] = labels_can[0][0] + + result["category"] = name_dict[result["label"]] + results.append(result) + return results + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor(args.model_dir, args.device) + + tokenizer = ErnieCtmTokenizer.from_pretrained("nptag") + + data = [ + "刘德华", + "快乐薯片", + "自适应共振理论映射", + ] + + results = predictor.predict(data, tokenizer) + print(results) diff --git a/examples/text_to_knowledge/nptag/export_model.py b/examples/text_to_knowledge/nptag/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..7956b003226053a8cb6baade0021ddb27f666b38 --- /dev/null +++ b/examples/text_to_knowledge/nptag/export_model.py @@ -0,0 +1,47 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from paddlenlp.transformers import ErnieCtmNptagModel + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default='./output/model_100/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./export', help="The path of model parameter in static graph to be saved.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + model = ErnieCtmNptagModel.from_pretrained("nptag") + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + model.eval() + + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # token_type_ids + ], + ) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/examples/text_to_knowledge/nptag/metric.py b/examples/text_to_knowledge/nptag/metric.py new file mode 100644 index 0000000000000000000000000000000000000000..8e9ceaf02aeed7e622f2926fff0f1f8bb8d19586 --- /dev/null +++ b/examples/text_to_knowledge/nptag/metric.py @@ -0,0 +1,55 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle + + +class NPTagAccuracy(paddle.metric.Metric): + """ + Accuracy for NPTag Prompt Model. + """ + + def __init__(self): + super(NPTagAccuracy, self).__init__() + self.reset() + + def reset(self): + self.corrects = 0 + self.total = 0 + + def compute(self, preds, labels): + correct = [] + for pred, label in zip(preds, labels): + real_pred, real_label = ([] for _ in range(2)) + for i in range(len(label)): + if label[i] == -100 or label[i] == 0: + continue + real_pred.append(pred[i]) + real_label.append(label[i]) + + if all(real_pred[i] == real_label[i] for i in range(len(real_label))): + correct.append(1) + else: + correct.append(0) + return correct + + def update(self, correct): + self.corrects += sum(correct) + self.total += len(correct) + + def accumulate(self): + return float(self.corrects) / self.total + + def name(self): + return "NPTag Prompt Model Accuracy" diff --git a/examples/text_to_knowledge/nptag/predict.py b/examples/text_to_knowledge/nptag/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..ea791b59a4bf6f06e017104fb2b9b595d088ad13 --- /dev/null +++ b/examples/text_to_knowledge/nptag/predict.py @@ -0,0 +1,125 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from data import convert_example +from utils import construct_dict_map, decode, find_topk, search + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.transformers import ErnieCtmNptagModel, ErnieCtmTokenizer + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, default="./output/model_100/model_state.pdparams", required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--data_dir", type=str, default="./data", help="The input data dir, should contain name_category_map.json.") +parser.add_argument("--max_seq_len", type=int, default=64, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', type=str, choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def do_predict(data, model, tokenizer, batch_size=1, max_cls_len=5, summary_num=2): + examples = [] + for text in data: + example = {"text": text} + input_ids, token_type_ids, label_indices = convert_example( + example, tokenizer, max_seq_len=args.max_seq_len, is_test=True + ) + examples.append((input_ids, token_type_ids, label_indices)) + + batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)] + + batchify_fn = lambda samples, fn=Tuple( # noqa: E731 + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids + Stack(dtype="int64"), # label_indices + ): fn(samples) + + name_dict, bk_tree, id_vocabs, vocab_ids = construct_dict_map( + tokenizer, os.path.join(args.data_dir, "name_category_map.json") + ) + + all_scores_can = [] + all_preds_can = [] + pred_ids = [] + + model.eval() + for batch in batches: + input_ids, token_type_ids, label_indices = batchify_fn(batch) + + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + logits = model(input_ids, token_type_ids)[0].numpy() + for i, l in zip(label_indices, logits): + score = l[i[0] : i[-1] + 1, vocab_ids] + # Find topk candidates of scores and predicted indices. + score_can, pred_id_can = find_topk(score, k=4, axis=-1) + + all_scores_can.extend([score_can.tolist()]) + all_preds_can.extend([pred_id_can.tolist()]) + pred_ids.extend([pred_id_can[:, 0].tolist()]) + + results = [] + for i, d in enumerate(data): + label = decode(pred_ids[i], id_vocabs) + + result = { + "text": d, + "label": label, + } + + if label not in name_dict: + scores_can = all_scores_can[i] + pred_ids_can = all_preds_can[i] + labels_can = search(scores_can, pred_ids_can, 0, [], 0) + labels_can.sort(key=lambda d: -d[1]) + for labels in labels_can: + cls_label_can = decode(labels[0], id_vocabs) + if cls_label_can in name_dict: + result["label"] = cls_label_can + break + else: + labels_can = bk_tree.search_similar_word(label) + if len(labels_can) != 0: + result["label"] = labels_can[0][0] + + if result["label"] in name_dict: + result["category"] = name_dict[result["label"]] + results.append(result) + return results + + +if __name__ == "__main__": + paddle.set_device(args.device) + + data = [ + "刘德华", + "快乐薯片", + "自适应共振理论映射", + ] + + model = ErnieCtmNptagModel.from_pretrained("nptag") + tokenizer = ErnieCtmTokenizer.from_pretrained("nptag") + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + results = do_predict(data, model, tokenizer, batch_size=args.batch_size) + print(results) diff --git a/examples/text_to_knowledge/nptag/train.py b/examples/text_to_knowledge/nptag/train.py new file mode 100644 index 0000000000000000000000000000000000000000..d76c809d6010bb7f53607b65208a17e4d7dd239a --- /dev/null +++ b/examples/text_to_knowledge/nptag/train.py @@ -0,0 +1,191 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +from data import convert_example, create_dataloader, read_custom_data +from metric import NPTagAccuracy + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import ( + ErnieCtmNptagModel, + ErnieCtmTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + + # yapf: disable + parser.add_argument("--data_dir", type=str, default="./data", help="The input data dir, should contain train.json and dev.json.") + parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") + parser.add_argument("--output_dir", type=str, default="./output", help="The output directory where the model predictions and checkpoints will be written.",) + parser.add_argument("--max_seq_len", type=int, default=64, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", ) + parser.add_argument("--learning_rate", type=float, default=1e-6, help="The initial learning rate for Adam.") + parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.", ) + parser.add_argument("--logging_steps", type=int, default=10, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") + parser.add_argument("--batch_size", type=int, default=64, help="Batch size per GPU/CPU for training.", ) + parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay if we apply some.") + parser.add_argument("--warmup_proportion", type=float, default=0.0, help="Linear warmup proportion over total steps.") + parser.add_argument("--adam_epsilon", type=float, default=1e-8, help="Epsilon for Adam optimizer.") + parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") + parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.") + # yapf: enable + + args = parser.parse_args() + return args + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, metric, criterion, data_loader, vocab_size): + model.eval() + metric.reset() + losses = [] + for batch in data_loader(): + input_ids, token_type_ids, labels = batch + outputs = model(input_ids, token_type_ids) + logits = outputs[0] + loss = criterion(logits.reshape([-1, vocab_size]), labels.reshape([-1])) + losses.append(loss.numpy()) + probs = F.softmax(logits, axis=-1) + preds = paddle.argmax(probs, axis=-1).numpy() + correct = metric.compute(preds, labels) + metric.update(correct) + acc = metric.accumulate() + logger.info("eval loss: %.5f, acc: %.5f" % (np.mean(losses), acc)) + model.train() + metric.reset() + + +def do_train(args): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds = load_dataset( + read_custom_data, filename=os.path.join(args.data_dir, "train.txt"), is_test=False, lazy=False + ) + dev_ds = load_dataset(read_custom_data, filename=os.path.join(args.data_dir, "dev.txt"), is_test=False, lazy=False) + + tokenizer = ErnieCtmTokenizer.from_pretrained("nptag") + model = ErnieCtmNptagModel.from_pretrained("nptag") + vocab_size = model.ernie_ctm.config["vocab_size"] + + trans_func = partial(convert_example, tokenzier=tokenizer, max_seq_len=args.max_seq_len) + + batchify_fn = lambda samples, fn=Tuple( # noqa: E731 + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids + Pad(axis=0, pad_val=-100, dtype="int64"), # labels + ): fn(samples) + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + model = paddle.DataParallel(model) + num_training_steps = len(train_data_loader) * args.num_train_epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + logger.info("Total steps: %s" % num_training_steps) + + metric = NPTagAccuracy() + criterion = paddle.nn.CrossEntropyLoss() + + global_step = 0 + for epoch in range(1, args.num_train_epochs + 1): + logger.info(f"Epoch {epoch} beginnig") + start_time = time.time() + + for step, batch in enumerate(train_data_loader): + global_step += 1 + input_ids, token_type_ids, labels = batch + outputs = model(input_ids, token_type_ids) + logits = outputs[0] + loss = criterion(logits.reshape([-1, vocab_size]), labels.reshape([-1])) + + loss.backward() + optimizer.step() + optimizer.clear_grad() + lr_scheduler.step() + + if global_step % args.logging_steps == 0 and rank == 0: + end_time = time.time() + speed = float(args.logging_steps) / (end_time - start_time) + logger.info( + "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, loss.item(), speed) + ) + start_time = time.time() + + if (global_step % args.save_steps == 0 or global_step == num_training_steps) and rank == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % (global_step)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model._layers.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + evaluate(model, metric, criterion, dev_data_loader, vocab_size) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/examples/text_to_knowledge/nptag/utils.py b/examples/text_to_knowledge/nptag/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..ad2e233e322345a82ff72b0d99cf2ccaed21a2f6 --- /dev/null +++ b/examples/text_to_knowledge/nptag/utils.py @@ -0,0 +1,195 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +from collections import OrderedDict +from typing import List + +import numpy as np + + +def construct_dict_map(tokenizer, name_dict_path): + """Construct dict map""" + with open(name_dict_path, encoding="utf-8") as fp: + name_dict = json.load(fp) + cls_vocabs = OrderedDict() + bk_tree = BurkhardKellerTree() + for k in name_dict: + bk_tree.add(k) + for c in k: + if c not in cls_vocabs: + cls_vocabs[c] = len(cls_vocabs) + cls_vocabs["[PAD]"] = len(cls_vocabs) + id_vocabs = dict(zip(cls_vocabs.values(), cls_vocabs.keys())) + vocab_ids = tokenizer.vocab.to_indices(list(cls_vocabs.keys())) + return name_dict, bk_tree, id_vocabs, vocab_ids + + +def decode(pred_ids, id_vocabs): + tokens = [id_vocabs[i] for i in pred_ids] + valid_token = [] + for token in tokens: + if token == "[PAD]": + break + valid_token.append(token) + return "".join(valid_token) + + +def search(scores_can, pred_ids_can, depth, path, score): + if depth >= 5: + return [(path, score)] + res = [] + for i in range(len(pred_ids_can[0])): + tmp_res = search( + scores_can, pred_ids_can, depth + 1, path + [pred_ids_can[depth][i]], score + scores_can[depth][i] + ) + res.extend(tmp_res) + return res + + +def find_topk(a, k, axis=-1, largest=True, sorted=True): + if axis is None: + axis_size = a.size + else: + axis_size = a.shape[axis] + assert 1 <= k <= axis_size + + a = np.asanyarray(a) + if largest: + index_array = np.argpartition(a, axis_size - k, axis=axis) + topk_indices = np.take(index_array, -np.arange(k) - 1, axis=axis) + else: + index_array = np.argpartition(a, k - 1, axis=axis) + topk_indices = np.take(index_array, np.arange(k), axis=axis) + topk_values = np.take_along_axis(a, topk_indices, axis=axis) + if sorted: + sorted_indices_in_topk = np.argsort(topk_values, axis=axis) + if largest: + sorted_indices_in_topk = np.flip(sorted_indices_in_topk, axis=axis) + sorted_topk_values = np.take_along_axis(topk_values, sorted_indices_in_topk, axis=axis) + sorted_topk_indices = np.take_along_axis(topk_indices, sorted_indices_in_topk, axis=axis) + return sorted_topk_values, sorted_topk_indices + return topk_values, topk_indices + + +def levenstein_distance(s1: str, s2: str) -> int: + """Calculate minimal Levenstein distance between s1 and s2. + + Args: + s1 (str): string + s2 (str): string + + Returns: + int: the minimal distance. + """ + m, n = len(s1) + 1, len(s2) + 1 + + # Initialize + dp = [[0] * n for i in range(m)] + dp[0][0] = 0 + for i in range(1, m): + dp[i][0] = dp[i - 1][0] + 1 + for j in range(1, n): + dp[0][j] = dp[0][j - 1] + 1 + + for i in range(1, m): + for j in range(1, n): + if s1[i - 1] != s2[j - 1]: + dp[i][j] = min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) + 1 + else: + dp[i][j] = dp[i - 1][j - 1] + return dp[m - 1][n - 1] + + +class BurkhardKellerNode(object): + """Node implementatation for BK-Tree. A BK-Tree node stores the information of current word, and its approximate words calculated by levenstein distance. + + Args: + word (str): word of current node. + """ + + def __init__(self, word: str): + self.word = word + self.next = {} + + +class BurkhardKellerTree(object): + """Implementataion of BK-Tree""" + + def __init__(self): + self.root = None + self.nodes = {} + + def __add(self, cur_node: BurkhardKellerNode, word: str): + """Insert a word into current tree. If tree is empty, set this word to root. + + Args: + word (str): word to be inserted. + """ + if self.root is None: + self.root = BurkhardKellerNode(word) + return + if word in self.nodes: + return + dist = levenstein_distance(word, cur_node.word) + if dist not in cur_node.next: + self.nodes[word] = cur_node.next[dist] = BurkhardKellerNode(word) + else: + self.__add(cur_node.next[dist], word) + + def add(self, word: str): + """Insert a word into current tree. If tree is empty, set this word to root. + + Args: + word (str): word to be inserted. + """ + return self.__add(self.root, word) + + def __search_similar_word(self, cur_node: BurkhardKellerNode, s: str, threshold: int = 2) -> List[str]: + res = [] + if cur_node is None: + return res + dist = levenstein_distance(cur_node.word, s) + if dist <= threshold: + res.append((cur_node.word, dist)) + start = max(dist - threshold, 1) + while start < dist + threshold: + tmp_res = self.__search_similar_word(cur_node.next.get(start, None), s)[:] + res.extend(tmp_res) + start += 1 + return res + + def search_similar_word(self, word: str) -> List[str]: + """Search the most similar (minimal levenstain distance) word between `s`. + + Args: + s (str): target word + + Returns: + List[str]: similar words. + """ + res = self.__search_similar_word(self.root, word) + + def max_prefix(s1: str, s2: str) -> int: + res = 0 + length = min(len(s1), len(s2)) + for i in range(length): + if s1[i] == s2[i]: + res += 1 + else: + break + return res + + res.sort(key=lambda d: (d[1], -max_prefix(d[0], word))) + return res diff --git a/examples/text_to_knowledge/termtree/README.md b/examples/text_to_knowledge/termtree/README.md new file mode 100644 index 0000000000000000000000000000000000000000..7fd126b43fed7ce3281ded192c94b09662fc0bf2 --- /dev/null +++ b/examples/text_to_knowledge/termtree/README.md @@ -0,0 +1,271 @@ +# 解语:TermTree(百科知识树) +TermTree(百科知识树)是一个描述所有中文词汇(包括概念、实体/专名、领域术语、语法词等,统一称之为Term)的树状知识库,完整的TermTree由两部分构成: + +> I. TermType词类体系:覆盖所有中文词汇词类的树状知识体系,是对中文词汇集合的一种全划分层次表示; +> +> II. Term关系和属性值:描述具体Term之间关系和Term属性值网状图谱,用于整合各应用知识图谱; + +本次发布的TermTreeV1.0试用版是TermTree的一个常用子集,包括两部分内容: + +> A. 简化版的TermType词类体系,由160+ termtype(三层结构)和 7000+ subtype构成。 +> +> B. 约100w的term集(挂接在TermType词类体系下),包括大多数常用概念(src=cb,基础概念库,termtype准确率为98%)和一部分高频百科实体(src=eb,基础实体库,termtype准确率为95%)。 +> +> 开源版不包括Term关系和属性值,但给出了实体的百科词条链接,应用方可以利用百科链接整合其他知识图谱使用。 + +我们提供了TermTreeV1.0试用版的下载链接供大家使用,[下载链接](https://kg-concept.bj.bcebos.com/TermTree/TermTree.V1.0.tar.gz) 。 + +**注:** 与其他常见应用知识图谱不同,TermTree的核心是概念词,而非专名实体词。因为,在中文文本中,概念词的含义是相对稳定的,而专名实体词随应用变化(例如,不同电商有不同的商品实体集,不同的小说站有不同的小说实体集),因此,TermTree通过 “提供常用概念集 + 可插拔的应用实体集/应用知识图谱” 来达到支持不同的应用适配。 + +## 自定义TermTree + +`termtree.py`文件中的TermTree类支持TermTree的加载、增加、保存操作,因为涉及到数据结构整体性和一致性,暂不支持删除和修改操作。下面提供了离线维护自定义TermTree的代码示例 + +### 文件准备 + +首先下载已有的TermTreeV1.0 +```shell +wget https://kg-concept.bj.bcebos.com/TermTree/TermTree.V1.0.tar.gz && tar -zxvf TermTree.V1.0.tar.gz +``` + +### TermTree维护与修改 + +加载TermTreeV1.0,增加新的term +```python +from termtree import TermTree + +# 加载百科知识树 +termtree = TermTree.from_dir("termtree_type.csv", "TermTree.V1.0") + +# 增加新term: 平原上的火焰 +termtree.add_term(term="平原上的火焰", + base="eb", + term_type="影视作品") + +# 保存修改, 执行后将在当前路径生成文件`termtree_data`,即新的自定义TermTree +termtree.save("./") +``` + +#### API说明 + +- ```python + def add_term() + ``` + +- **参数** + - term (str): 待增加的term名称。 + - base (str): term属于概念词(cb)还是实体词(eb)。 + - term_type (str): term的主类别。 + - sub_type (Optional[List[str]], optional): term的辅助类别或细分类别,非必选。 + - sub_terms (Optional[List[str]], optional): 用于描述同类同名的term集,非必选。 + - alias (Optional[List[str]], optional): term的常用别名,非必选。 + - alias_ext (Optional[List[str]], optional): term的常用扩展别名,非必选。 + - data (Optional[Dict[str, Any]], optional): 以dict形式构造该term节点,非必选。 + + +### 自定义Term-Linking + +Taskflow支持使用自定义TermTree实现自定义Term-Linking,该示例中"平原上的火焰"的Term-Linking如下: +作品类_实体(wordtag_label) -> 影视作品_eb_平原上的火焰(term_id) + +通过`task_path`定义用户自定义路径,文件组成: +```text +custom_task_path/ +├── termtree_type.csv +└── termtree_data +``` + +使用Taskflow加载自定义TermTree来进行预测: + +```python +from paddlenlp import Taskflow + +wordtag = Taskflow("knowledge_mining", task_path="./custom_task_path/") + +wordtag("《平原上的火焰》是今年新上映的电影") +# [{'text': '《平原上的火焰》是今年新上映的电影', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '平原上的火焰', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 6, 'termid': '影视作品_eb_平原上的火焰'}, {'item': '》', 'offset': 7, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 8, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '今年', 'offset': 9, 'wordtag_label': '时间类', 'length': 2, 'termid': '时间阶段_cb_今年'}, {'item': '新', 'offset': 11, 'wordtag_label': '修饰词', 'length': 1, 'termid': '修饰词_cb_新'}, {'item': '上映', 'offset': 12, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_上映'}, {'item': '的', 'offset': 14, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '电影', 'offset': 15, 'wordtag_label': '作品类_概念', 'length': 2, 'termid': '影视作品_cb_电影'}]}] +``` + +## 常见问题 + +**常见问题1:为什么TermTree采用树状结构(Tree),而不是网状结构(Net/Graph)?** + +- 树结构是对知识空间的全划分,网状结构是对相关关系的描述和提炼。树结构更方便做到对词类体系的全面描述。 + +- 树结构适合概念层次的泛化推理,网状结构适合相关性的泛化推理。树结构的知识对统计相关知识有很好的互补作用,在应用中能够更好地弥补统计模型的不足。 +- 两者可以结合表示和使用:Term集合整体以树结构组织(TermType词类体系),Term间的关系用网状结构描述(Term关系和属性值)。可以将TermTree视为中文词汇的层次描述框架,应用知识图谱可以基于TermType词类体系方便地整合到TermTree。 + +**常见问题2:为什么TermTree叫做百科知识树?是否只能用于描述百科知识?** + +- 一方面,Term可以泛指任意概念、实体/专名、领域术语、语法词等,用“百科”是为了表达Term的多样性,而不是限定Term的来源,Term可以来自任意中文文本; +- 另一方面,各类别的词汇都可以在百科词条中找到样例,用“百科”也是为了表示对所有中文词汇词类的描述能力。 + +**常见问题3:中文词汇词类描述体系有很多,为什么采用这个体系?** + +- TermTree的词类体系是在大规模工业应用实践(如百科文本解析挖掘、query理解)中打磨出来的中文词类体系,在理论上可能不是一个完备体系,但很适合通用领域中文解析挖掘任务。 + + +## TermTree字段说明 + +| 字段 | 说明 | 备注 | +| ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | +| id | 【必有】唯一标识符 | 可基于termid生成 | +| term | 【必有】term的名字 | | +| termid | 【必有】term的id(唯一),构造方式为termtype_src_term | 采用显式构造id的方式,便于应用数据扩展和整合 | +| src | 【必有】term的来源库,当前包括两个基础库cb和eb。其中cb为基础概念库(concept base,收录常用词汇用语,可作为各类应用的基础集),eb为基础实体库(entity base, 收录常见命名实体,可根据应用需求扩展) | cb、eb的划分标准不同应用不一样,可根据需求调整;应用方也可以构造自己的应用库,与cb、eb整合使用。 | +| termtype | 【必有】term的主类别,详细描述参见 [termtree\_type](./termtree_type.csv) | 多上位的term会选择其中一个作为termtype,其他上位作为subtype,方便应用筛选 | +| subtype | 【非必须】term的辅助类别或细分类别 | 如果应用特别关注某个subtype,也可以将其升级为termtype使用(需要相应更新termid和id) | +| subterms | 【非必须】用于描述同类同名的term集,若“termtype+src”下term只对应一个实例,则subterms为空;若“termtype+src”下term对应多个实例,则subterms记录这些实例,其字段与term相同 | 不需要区分subterm的两种常见场景:1. 应用只需词类特征;2. 上下文信息不足,无法区分具体实例 | +| subterms_num | 【非必须】subterms中的subterm数量 | 如果没有subterm,则值为0 | +| alias | 【非必须】term的常用别名 | 通常为歧义小的别名 | +| alias\_ext | 【非必须】term的常用扩展别名,经常是term或alias的一个子片段,单独出现有其他含义,结合上下文可识别为别名。 | 通常为歧义大的别名,便于应用筛选使用。e.g., 四维彩超的alias_ext“四维” | +| links | 【非必须】该term对应的其他term的id,可以是本知识库中的id,也可以是其他知识库如百度百科id | 如果是本知识库中的id,则表示两者可以指代同一实体 | + +## 数据示例 +```json +// 示例1:无subterms的term +{ + "id": "c472a6fe74eb2008c4e5b958a047eb5c", + "termid": "植物_cb_苹果", + "term": "苹果", + "src": "cb", + "termtype": "植物", + "subtype": [], + "subterms": [], + "subterms_num": 0, + "alias": [ + "苹果树" + ], + "alias_ext": [], + "links": [ + { + "bdbkUrl": [ + "http://baike.baidu.com/item/%E8%8B%B9%E6%9E%9C/14822460" + ] + } + ] +} + +// 示例2:有subterms的term +{ + "id": "824716062a4d74efc0897d676700a24e", + "termid": "影视作品_eb_苹果", + "term": "苹果", + "src": "eb", + "termtype": "影视作品", + "subtype": [], + "subterms": [ + { + "id": "9bb5b38dc50233b1ccd28d1c33c37605", + "subtype": [ + "影视作品_cb_电影", + "影视动漫作品_cb_剧情片" + ], + "alias": [], + "alias_ext": [], + "links": [ + { + "bdbkUrl": [ + "http://baike.baidu.com/item/%E8%8B%B9%E6%9E%9C/6011191" + ] + } + ] + }, + { + "id": "688dc07cc98f02cbd4d21e2700290590", + "subtype": [ + "影视作品_cb_韩国电影" + ], + "alias": [], + "alias_ext": [], + "links": [ + { + "bdbkUrl": [ + "http://baike.baidu.com/item/%E8%8B%B9%E6%9E%9C/6011208" + ] + } + ] + }, + { + "id": "bbf4abe6ac412b181eac383333ca9fef", + "subtype": [ + "影视作品_cb_剧情电影" + ], + "alias": [], + "alias_ext": [], + "links": [ + { + "bdbkUrl": [ + "http://baike.baidu.com/item/%E8%8B%B9%E6%9E%9C/6011176" + ] + } + ] + } + ], + "subterms_num": 3, + "alias": [], + "alias_ext": [], + "links": [] +} +``` + +## TermTree特点 + + 1. 将所有中文词汇放在一个统一类别体系下表示,包括**概念、实体/专名、领域术语、语法词**。 +- 解决传统标注技术下(e.g., 词性标注、命名实体识别),概念、实体、词性特征难以统一计算的问题。 + + 2. 为中文精准解析挖掘服务的词汇类别体系,以全面覆盖**百科词条、搜索query、新闻资讯**中出现的中文词汇为目标,支持通用场景文本理解。 + - 应用可以通过指定词表的TermType,方便地整合到TermTree中,定制应用特化版。 + + 3. 尽可能收录常用概念词,并区分常用概念词(src=cb)和专名实体词(src=eb),以解决专名实体与概念在计算中容易混淆的问题。为此,特别补充收录了很多百科中缺少的概念词。 + - 例:“琴房(歌曲类实体)” VS. “琴房(区域场所类概念)” + - 例:“甩掉(歌曲类实体)” VS. “甩掉(场景事件类概念)” + + 4. 将同类同名实体拆分为term和subterm两层(参见数据示例),term作为给定termtype下所有同名实体的表示,subterm作为同类同名实体集中每一个具体实体的表示: + - 一方面解决文本中信息不足无法区分具体实体时的标注问题; + - 一方面减少同名词汇的消歧计算代价(只需要计算同类下的同名实体,有效解决概念词和实体词识别混淆的问题) + + 5. 为重要的概念/实体构建完整上位归类路径(**注:** TermTreeV1.0试用版暂不包括),用于细粒度特征计算和知识推断,参见以下示例 + + | term | 类别| src| 上位归类路径示例 | + |---|---|---|---| + |苹果 | 植物类|cb|苹果 → 苹果属 → 蔷薇科 → 蔷薇目 → 双子叶植物纲 → 被子植物门 → 种子植物 → 植物界 → 真核生物域 → 生物| + | 黄香蕉苹果| 饮食类|cb|黄香蕉苹果 →苹果 →水果 → 蔬果和菌藻类 →食材 →食物 →饮食| + |甲型流感 | 疾病类|cb|甲型流感 → 流行性感冒 → 感冒 → 呼吸道感染 → 呼吸系统疾病 → 疾病损伤 → 生物疾病| + |甲型流感病毒| 微生物类|cb|甲型流感病毒 → 流行性感冒病毒 → 正粘病毒科 → RNA病毒 → 生物病毒 → 病原微生物 → 微生物 → 生物| + |琴房| 区域场所类|cb|琴房 → 音乐室 → 活动室 →活动场所 →区域场所| + |琴房| 音乐类|eb|琴房 → 歌曲 →音乐作品 →艺术作品 →作品 → 作品与出版物| + |认同感 | 生活用语类|cb|认同感 →正面感受 → 感受 → 知觉感受 → 个体描述 → 生活用语| + | 认同感| 图书类|eb|认同感 →书籍 →图书 →书刊 →出版物 → 作品与出版物| + |佛罗伦萨足球俱乐部| 体育组织机构|eb|佛罗伦萨足球俱乐部 →意大利足球联赛球队→职业足球俱乐部→足球俱乐部 →足球队 →球队 →运动队 →体育组织机构 →组织机构| + |佛罗伦萨市 | 世界地区类|cb|佛罗伦萨市 →托斯卡纳大区 →意大利 →南欧 →欧洲 →地球区域 →世界地区| + |言情小说 | 小说类|cb|言情小说 →情感小说 →小说 →文学作品 →作品 →作品与出版物| + | 言情小说| 音乐类|eb|言情小说 → 歌曲 →音乐作品 →艺术作品 →作品 → 作品与出版物| +> **注:** TermType词类体系可视为所有上位归类路径的集合。 + +## TermTree应用方式 + +1. 直接作为词表使用,利用termtype和subtype筛选应用所需的词表(停用词表、黑白名单、概念扩展词表等)。 +2. 结合中文文本知识标注工具(WordTag等)使用,用于文本词类特征生成、挖掘/解析pattern生成、样本构建和优化等等,参见"[解语的应用场景](../)"。 +3. 整合应用知识图谱,为应用知识图谱提供通用词汇知识补充。 + +## TermTree后续规划 + +1. 数据覆盖扩展到全量百度百科词条,提升TermType归类准确率,便于应用方筛选构建应用适配的TermTree; +2. 建立知识共建社区,支持用户提交自己的term词表,生成定制版TermTree。 + + +## 在论文中引用TermTree +如果您的工作成果中使用了TermTree,请增加下述引用。我们非常乐于看到TermTree对您的工作带来帮助。 +``` +@article{zhao2020TermTree, + title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding}, + author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong}, + technical report={Baidu, Inc. TR:2020-KG-TermTree}, + year={2020} +} +``` + +## 问题与反馈 + +百科知识树在持续扩充优化中,如果您有任何建议或发现数据问题,欢迎提交issue到Github。 diff --git a/examples/text_to_knowledge/termtree/termtree.py b/examples/text_to_knowledge/termtree/termtree.py new file mode 100644 index 0000000000000000000000000000000000000000..7b09795ef1135f1a4bb4c7ff930c05da49a65da5 --- /dev/null +++ b/examples/text_to_knowledge/termtree/termtree.py @@ -0,0 +1,416 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import csv +import json +import os +import warnings +from typing import Any, Dict, List, Optional, Tuple, Union + + +class TermTreeNode(object): + """Defination of term node. All members are protected, to keep rigorism of data struct. + + Args: + sid (str): term id of node. + term (str): term, common name of this term. + base (str): `cb` indicates concept base, `eb` indicates entity base. + term_type (Optional[str], optional): type of this term, constructs hirechical of `term` node. Defaults to None. + hyper (Optional[str], optional): parent type of a `type` node. Defaults to None. + node_type (str, optional): type statement of node, `type` or `term`. Defaults to "term". + alias (Optional[List[str]], optional): alias of this term. Defaults to None. + alias_ext (Optional[List[str]], optional): extended alias of this term, CANNOT be used in matching. + Defaults to None. + sub_type (Optional[List[str]], optional): grouped by some term. Defaults to None. + sub_term (Optional[List[str]], optional): some lower term. Defaults to None. + data (Optional[Dict[str, Any]], optional): to sore full imformation of a term. Defaults to None. + + """ + + def __init__( + self, + sid: str, + term: str, + base: str, + node_type: str = "term", + term_type: Optional[str] = None, + hyper: Optional[str] = None, + level: Optional[int] = None, + alias: Optional[List[str]] = None, + alias_ext: Optional[List[str]] = None, + sub_type: Optional[List[str]] = None, + sub_term: Optional[List[str]] = None, + data: Optional[Dict[str, Any]] = None, + ): + self._sid = sid + self._term = term + self._base = base + self._term_type = term_type + self._hyper = hyper + self._sub_term = sub_term if sub_term is not None else [] + self._sub_type = sub_type if sub_type is not None else [] + self._alias = alias if alias is not None else [] + self._alias_ext = alias_ext if alias_ext is not None else [] + self._data = data + self._level = level + self._node_type = node_type + self._sons = set() + + def __str__(self): + if self._data is not None: + return json.dumps(self._data, ensure_ascii=False) + else: + res = { + "termid": self._sid, + "term": self._term, + "src": self._base, + "alias": self._alias, + "alias_ext": self._alias_ext, + "termtype": self._term_type, + "subterms": self._sub_term, + "subtype": self._sub_type, + "links": [], + } + return json.dumps(res, ensure_ascii=False) + + @property + def sid(self): + return self._sid + + @property + def term(self): + return self._term + + @property + def base(self): + return self._base + + @property + def alias(self): + return self._alias + + @property + def alias_ext(self): + return self._alias_ext + + @property + def termtype(self): + return self._term_type + + @property + def subtype(self): + return self._sub_type + + @property + def subterm(self): + return self._sub_term + + @property + def hyper(self): + return self._hyper + + @property + def level(self): + return self._level + + @property + def sons(self): + return self._sons + + @property + def node_type(self): + return self._node_type + + def add_son(self, son_name): + self._sons.add(son_name) + + @classmethod + def from_dict(cls, data: Dict[str, Any]): + """Build a node from dictionary data. + + Args: + data (Dict[str, Any]): Dictionary data contain all k-v data. + + Returns: + [type]: TermTree node object. + """ + return cls( + sid=data["termid"], + term=data["term"], + base=data["src"], + term_type=data["termtype"], + sub_type=data["subtype"], + sub_term=data["subterms"], + alias=data["alias"], + alias_ext=data["alias_ext"], + data=data, + ) + + @classmethod + def from_json(cls, json_str: str): + """Build a node from JSON string. + + Args: + json_str (str): JSON string formatted by TermTree data. + + Returns: + [type]: TermTree node object. + """ + dict_data = json.loads(json_str) + return cls.from_dict(dict_data) + + +class TermTree(object): + """TermTree class.""" + + def __init__(self): + self._nodes: Dict[str, TermTreeNode] = {} + self._root = TermTreeNode(sid="root", term="root", base="cb", node_type="root", level=0) + self._nodes["root"] = self.root + self._index = {} + + def __build_sons(self): + for node in self._nodes: + self.__build_son(self._nodes[node]) + + def __getitem__(self, item): + return self._nodes[item] + + def __contains__(self, item): + return item in self._nodes + + def __iter__(self): + return self._nodes.__iter__() + + @property + def root(self): + return self._root + + def __load_type(self, file_path: str): + with open(file_path, "rt", newline="", encoding="utf8") as csvfile: + file_handler = csv.DictReader(csvfile, delimiter="\t") + for row in file_handler: + if row["type-1"] not in self: + self.add_type(type_name=row["type-1"], hyper_type="root") + if row["type-2"] != "" and row["type-2"] not in self: + self.add_type(type_name=row["type-2"], hyper_type=row["type-1"]) + if row["type-3"] != "" and row["type-3"] not in self: + self.add_type(type_name=row["type-3"], hyper_type=row["type-2"]) + + def __judge_term_node(self, node: TermTreeNode) -> bool: + if node.termtype not in self: + raise ValueError(f"Term type of new node {node.termtype} does not exists.") + if node.sid in self: + warnings.warn(f"{node.sid} exists, will be replaced by new node.") + + def add_term( + self, + term: Optional[str] = None, + base: Optional[str] = None, + term_type: Optional[str] = None, + sub_type: Optional[List[str]] = None, + sub_term: Optional[List[str]] = None, + alias: Optional[List[str]] = None, + alias_ext: Optional[List[str]] = None, + data: Optional[Dict[str, Any]] = None, + ): + """Add a term into TermTree. + + Args: + term (str): common name of name. + base (str): term is concept or entity. + term_type (str): term type of this term + sub_type (Optional[List[str]], optional): sub type of this term, must exists in TermTree. Defaults to None. + sub_terms (Optional[List[str]], optional): sub terms of this term. Defaults to None. + alias (Optional[List[str]], optional): alias of this term. Defaults to None. + alias_ext (Optional[List[str]], optional): . Defaults to None. + data (Optional[Dict[str, Any]], optional): [description]. Defaults to None. + """ + if data is not None: + new_node = TermTreeNode.from_dict(data) + else: + new_node = TermTreeNode( + sid=f"{term_type}_{base}_{term}", + term=term, + base=base, + term_type=term_type, + sub_term=sub_term, + sub_type=sub_type, + alias=alias, + alias_ext=alias_ext, + node_type="term", + ) + self.__judge_term_node(new_node) + self._nodes[new_node.sid] = new_node + self.__build_index(new_node) + + def add_type(self, type_name, hyper_type): + if type_name in self._nodes: + raise ValueError(f"Term Type {type_name} exists.") + if hyper_type not in self._nodes: + raise ValueError(f"Hyper type {hyper_type} does not exist, please add it first.") + if self._nodes[hyper_type].level == 3: + raise ValueError( + "Term type schema must be 3-LEVEL, 3rd level type node should not be a parent of type node." + ) + self._nodes[type_name] = TermTreeNode( + sid=type_name, + term=type_name, + base=None, + hyper=hyper_type, + node_type="type", + level=self._nodes[hyper_type].level + 1, + ) + self.__build_index(self._nodes[type_name]) + + def __load_file(self, file_path: str): + with open(file_path, encoding="utf-8") as fp: + for line in fp: + data = json.loads(line) + self.add_term(data=data) + + def __build_son(self, node: TermTreeNode): + """Build sons of a node + + Args: + node (TermTreeNode): son node. + """ + type_node = None + if node.termtype is not None: + type_node = self._nodes[node.termtype] + elif node.hyper is not None: + type_node = self._nodes[node.hyper] + if type_node is not None: + type_node.add_son(node.sid) + for sub_type in node.subtype: + sub_type_node = self._nodes[sub_type] + sub_type_node.add_son(node.sid) + + def build_son(self, node: str): + self.__build_son(self[node]) + + def __build_index(self, node: TermTreeNode): + if node.term not in self._index: + self._index[node.term] = [] + self._index[node.term].append(node.sid) + for alia in node.alias: + if alia not in self._index: + self._index[alia] = [] + self._index[alia].append(node.sid) + + def __judge_hyper(self, source_id, target_id) -> bool: + queue = [source_id] + visited_node = {source_id} + while len(queue) > 0: + cur_id = queue.pop(0) + if cur_id == target_id: + return True + cur_node = self._nodes[cur_id] + edge = [] + if cur_node.hyper is not None: + edge.append(cur_node.hyper) + if cur_node.termtype is not None: + edge.append(cur_node.termtype) + edge.extend(cur_node.subtype) + for next_id in edge: + if next_id not in visited_node: + queue.append(next_id) + visited_node.add(next_id) + return False + + def find_term(self, term: str, term_type: Optional[str] = None) -> Tuple[bool, Union[List[str], None]]: + """Find a term in Term Tree. If term not exists, return None. + If `term_type` is not None, will find term with this type. + + Args: + term (str): term to look up. + term_type (Optional[str], optional): find term in this term_type. Defaults to None. + + Returns: + Union[None, List[str]]: [description] + """ + if term not in self._index: + return False, None + else: + if term_type is None: + return True, self._index[term] + else: + out = [] + for term_id in self._index[term]: + if self.__judge_hyper(term_id, term_type) is True: + out.append(term_id) + if len(out) > 0: + return True, out + else: + return False, None + + def build_from_dir(self, term_schema_path, term_data_path, linking=True): + """Build TermTree from a directory which should contain type schema and term data. + + Args: + dir ([type]): [description] + """ + self.__load_type(term_schema_path) + if linking: + self.__load_file(term_data_path) + self.__build_sons() + + @classmethod + def from_dir(cls, term_schema_path, term_data_path, linking=True) -> "TermTree": + """Build TermTree from a directory which should contain type schema and term data. + + Args: + source_dir ([type]): [description] + + Returns: + TermTree: [description] + """ + term_tree = cls() + term_tree.build_from_dir(term_schema_path, term_data_path, linking) + return term_tree + + def __dfs(self, cur_id: str, depth: int, path: Dict[str, str], writer: csv.DictWriter): + cur_node = self._nodes[cur_id] + if cur_node.node_type == "term": + return + if depth > 0: + path[f"type-{depth}"] = cur_id + if path["type-1"] != "": + writer.writerow(path) + for son in cur_node.sons: + self.__dfs(son, depth + 1, path, writer) + if depth > 0: + path[f"type-{depth}"] = "" + + def save(self, save_dir): + """Save term tree to directory `save_dir` + + Args: + save_dir ([type]): Directory. + """ + if os.path.exists(save_dir) is False: + os.makedirs(save_dir, exist_ok=True) + out_path = {} + for i in range(1, 3): + out_path[f"type-{i}"] = "" + with open(f"{save_dir}/termtree_type.csv", "wt", encoding="utf-8", newline="") as fp: + fieldnames = ["type-1", "type-2", "type-3"] + csv_writer = csv.DictWriter(fp, delimiter="\t", fieldnames=fieldnames) + csv_writer.writeheader() + self.__dfs("root", 0, out_path, csv_writer) + with open(f"{save_dir}/termtree_data", "w", encoding="utf-8", newline="") as fp: + for nid in self: + node = self[nid] + if node.node_type == "term": + print(node, file=fp) diff --git a/examples/text_to_knowledge/termtree/termtree_type.csv b/examples/text_to_knowledge/termtree/termtree_type.csv new file mode 100644 index 0000000000000000000000000000000000000000..0ab88eefe103993d4de2c410488b7b2f886d8970 --- /dev/null +++ b/examples/text_to_knowledge/termtree/termtree_type.csv @@ -0,0 +1,164 @@ +type-1 type-2 type-3 说明 主要对应词性 subtype示例(开放集合) src(C表示cb、E表示eb) +角色 n/普通名词;nr/人名 群体 C/E +角色 人物 人物类概念、人物类实体 nr/人名 职业角色、历史人物、行业人物 C/E +角色 民族族群 民族和族群 五十六个民族 C +角色 虚拟角色 非现实的角色 虚拟人物、虚拟生物 C/E +作品与出版物 作品类概念、作品类实体 nw/作品名 拓片 C/E +作品与出版物 游戏 电子游戏、视频小游戏、网页游戏 C/E +作品与出版物 影视动漫作品 视频作品 C/E +作品与出版物 影视动漫作品 动漫作品 漫画、动画 C/E +作品与出版物 影视动漫作品 影视作品 电影、电视剧 C/E +作品与出版物 影视动漫作品 视频节目 脱口秀节目、新闻类节目、访谈节目 C/E +作品与出版物 音乐 歌曲、音乐专辑 C/E +作品与出版物 小说 网络小说、言情小说 C/E +作品与出版物 诗歌 诗词 C/E +作品与出版物 计算机软件 工具软件、办公软件 C/E +作品与出版物 舞蹈 C/E +作品与出版物 美术 雕塑作品、油画作品、工艺美术作品 C/E +作品与出版物 图书 书籍、词典、教材 C/E +作品与出版物 刊物 报纸、期刊 C/E +作品与出版物 文件 文书 C/E +作品与出版物 作品IP C/E +区域场所 ns/地名 宗教场所、建筑物 C/E +区域场所 景点 公园、植物园、动物园、博物馆 C/E +区域场所 楼盘住宅 商业楼盘、住宅楼盘、住宅小区 C/E +区域场所 交通场所 机场、车站、港口、交通道路、交通线路 C/E +区域场所 住宿场所 酒店、旅馆 C/E +区域场所 餐饮场所 咖啡馆、餐馆 C/E +区域场所 网上组织机构场所 网站、虚拟社区 C/E +位置方位 位置方位词 f/方位名词;s/处所名词 C +组织机构 组织机构类 nt/机构团体名 委员会、论坛 C/E +组织机构 演艺团体 乐队、艺术团、偶像组合 C/E +组织机构 国家机关 政府部门、党政机关 C/E +组织机构 企事业单位 公司、厂商、企业 C/E +组织机构 教育组织机构 学校、大学、幼儿园、培训机构 C/E +组织机构 居民服务机构 母婴护理机构、婚介机构、美容护理机构、家政服务机构 C/E +组织机构 医疗卫生机构 医院、药店、诊所、科室 C/E +组织机构 体育组织机构 运动队、体育俱乐部 C/E +组织机构 金融组织机构 银行、交易所、投资机构 C/E +组织机构 军事组织机构 部队、军区 C/E +品牌 品牌名 n/普通名词;nz/其他专名 C/E +品牌 汽车品牌 C/E +品牌 手机品牌 C/E +品牌 个护用品品牌 护肤品牌、彩妆品牌 C/E +物体与物品 包括物品和物质 n/普通名词;nz/其他专名 物体构造、化学物质 C/E +物体与物品 物品 飞机、船舶、轴承、摄影器材 C/E +物体与物品 物品 汽车 C/E +物体与物品 物品 手机 C/E +物体与物品 物品 美容美发用品 化妆品、美发用品 C/E +物体与物品 物品 电子电器产品 计算机、家用电器 C/E +物体与物品 物品 衣物饰品 服装、箱包、鞋靴、饰品配件 C/E +物体与物品 物品 兵器 武器、导弹、冷兵器.. C/E +物体与物品 设施 C/E +饮食 饮食类 n/普通名词;nz/其他专名 食材 C/E +饮食 菜品 菜品类 汤品、面食 C/E +饮食 饮品 饮品类 茶叶、酒、饮料和冷饮类 C/E +生物 生物类 n/普通名词;nz/其他专名 C +生物 动物 动物类 猫、狗、鸟纲、昆虫纲、鱼纲 C +生物 植物 植物类 C +生物 微生物 微生物类 真菌、细菌、生物病毒 C +世界地区 世界地区,包括地球外区域 ns/地名 首都、地球山脉、地球河流、地球岛屿、历史地区 C +世界地区 中国地区 中国地区 中国省级行政区、中国省会 C +世界地区 国家 现代意义的国家 现存国家、历史国家 C +虚拟事物 非现实事物 n/普通名词;nz/其他专名 虚拟场所、虚拟场景 C/E +虚拟事物 虚拟物品 虚拟宝物、游戏卡牌、游戏道具、游戏装备 C/E +文化 文化相关的特定类目 n/普通名词;nz/其他专名 C +文化 姓氏与人名 中文姓氏、英文姓氏 C +文化 语言文字 汉语、方言 C +文化 民俗 方术、数术、十二生肖、占星学星座、周易六十四卦 C +文化 政权朝代 历史政权、中国朝代 C +文化 制度政策协议 C/E +文化 奖项赛事活动 奖项、活动 C/E +事件 事件类 n/普通名词;nz/其他专名 展览、会议、案件、事故、战争 C/E +术语 领域术语、专名 nz/其他专名 C +术语 编码符号指标 价格、符号、信号、度量单位、邮政编码..... C +术语 教育用语 C +术语 教育用语 学科 C +术语 教育用语 学历学位 学历、学位 C +术语 教育用语 专业 C +术语 游戏用语 C +术语 游戏用语 麻将术语 C +术语 医药学术语 中医术语、西医术语、医学指标、诊断治疗方法 C +术语 医药学术语 医疗美容项目 C +术语 医药学术语 药物 中药、西药 C +术语 医药学术语 疾病损伤 疾病、疾病症状 C +术语 医药学术语 动物疾病 C +术语 金融术语 股票术语、证券术语、保险术语、银行术语 C +术语 金融术语 股票 C/E +术语 金融术语 保险 C/E +术语 金融术语 基金 C/E +术语 金融术语 银行卡 借记卡、信用卡 C/E +术语 经济术语 会计术语 C +术语 法律术语 C +术语 法律术语 法律法规 法律体系、法律、法规 C/E +术语 法律术语 罪名 C +术语 体育术语 围棋术语、象棋术语、篮球术语 C +术语 体育术语 体育运动项目 球类运动、武术功夫 C +术语 赌博博彩用语 赌博用语 C +术语 赌博博彩用语 彩票 C +术语 天文学术语 星系、恒星 C +术语 天文学术语 星座 八十八星座 C +术语 天文学术语 星体 小行星 C +术语 生物学术语 C +术语 生物学术语 动物体构造 动物器官系统、骨 C +术语 生物学术语 植物病虫害 植物病害、植物虫害 C +术语 机械工程术语 机械制造术语 C +术语 机械工程术语 汽车术语 C +术语 大气科学术语 气象学术语、气候学术语 C +术语 大气科学术语 台风 C/E +术语 计算机术语 病毒程序、计算机网络术语、编程技术术语 C +术语 文化术语 摄影术语、音乐术语、文学术语 C +术语 数学术语 数学概念、数学公式、几何学术语 C +术语 物理术语 电学术语、力学术语 C +术语 化学术语 化学结构 C +术语 统计术语 数理统计术语 C +术语 地学术语 地理学术语、地质学术语 C +术语 农业学术语 土壤学术语 C +术语 心理学术语 心理现象 C +术语 语言学术语 语法、词法、音韵学术语 C +术语 建筑术语 土木工程术语、装修术语 C +术语 军事术语 C +术语 政治术语 C +术语 哲学术语 哲学理论、伦理学术语、逻辑学术语 C +术语 宗教术语 道教术语、佛教术语 C +术语 通信术语 C +术语 材料科学术语 C +术语 航空科技术语 C +术语 水利科技术语 水利工程 C +术语 测绘学术语 测量术语 C +术语 电力术语 C +术语 社会学术语 C +术语 交通术语 船舶工程术语 C +术语 钓鱼术语 C +术语 ACGN术语 C +生活用语 日常生活中常用词 n/普通名词;nz/其他专名 信息知识资料、标识物、行业、服务 C +生活用语 情绪 C +生活用语 态度 C +生活用语 表情 笑、哭、眼神 C +生活用语 人物造型 妆容、发型 C +生活用语 个性特点 C +生活用语 颜色 C +生活用语 场景事件 包括常见动词 v/普通动词;vn/名动词;vd/动副词 考试 C/E +时间阶段 时间相关词 t/时间名词 时间、年代、世纪... C +时间阶段 地质年代 C +时间阶段 特殊日 农历二十四节气、假日、节日、纪念日 C +词汇用语 语法词类、汉字、成语等,用于兜底 n/普通名词 C +词汇用语 汉字 汉字字表 C/E +词汇用语 成语 成语词表 C/E +词汇用语 俗语 非成语的俗语 歇后语、顺口溜、谚语 C/E +词汇用语 诗句 诗句 C/E +词汇用语 介词 介词 p/介词 C +词汇用语 助词 助词 u/助词 C +词汇用语 代词 代词 r/代词 C +词汇用语 连词 连词 c/连词 C +词汇用语 副词 副词 d/副词 C +词汇用语 疑问词 疑问词 C +词汇用语 肯定否定词 常用肯定词和否定词 C +词汇用语 量词 量词 q/量词 C +词汇用语 数量词 数量词 m/数量词 C +词汇用语 叹词 叹词 C +词汇用语 拟声词 拟声词 C +词汇用语 修饰词 修饰词,包括常见形容词 n/普通名词;a/形容词;ad/副形词;an/名形词 C +词汇用语 汉字偏旁部首 汉字偏旁部首 C +词汇用语 日文假名 日文假名 平假名、片假名 C +词汇用语 汉语拼音 汉语拼音字母 C diff --git a/examples/text_to_knowledge/wordtag-ie/README.md b/examples/text_to_knowledge/wordtag-ie/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c5b28912c3098020c3e70f2ee85e886613fef347 --- /dev/null +++ b/examples/text_to_knowledge/wordtag-ie/README.md @@ -0,0 +1,135 @@ +# 解语:WordTag-IE(基于中文词类知识的信息抽取工具) + +WordTag-IE(基于中文词类知识的信息抽取工具)是在WordTag标注结果之上实现的信息抽取工具,旨在提供一个灵活、可配置,能够精准、全面覆盖简单句式的**规则信息抽取工具**。我们已提供了通用配置,可覆盖一些常见的抽取句式。用户也可以根据我们提供的配置方法,完成自己的配置,应用于自己的领域、专业文本。其产出数据,可作为模型的训练样本,也可以直接当作挖掘结果使用。 + +![](https://user-images.githubusercontent.com/1371212/172542329-754cb4f9-3526-400b-be6e-d60e078af872.png) + +## WordTag-IE特点 + +- **灵活、方便的配置,即时生效** + - WordTag-IE是在WordTag标注结果的基础上,完全使用规则实现的关系抽取工具。其配置完全基于WordTag的词类知识以及TermTree中的词表实现,实现了灵活、简单配置,且保证了产出数据的一致性 + +## 使用示例 + +在WordTag的任务中基础上可以打开`with_ie` 开关即可输出信息抽取的结果, 下面是使用PaddleNLP Taskflow使用WordTag-IE的使用示例。 +```python +>>> from paddlenlp import Taskflow +>>> wordtag_ie = Taskflow("knowledge_mining", with_ie=True) +>>> wordtag_ie('《忘了所有》是一首由王杰作词、作曲并演唱的歌曲,收录在专辑同名《忘了所有》中,由波丽佳音唱片于1996年08月31日发行。') +[[{'text': '《忘了所有》是一首由王杰作词、作曲并演唱的歌曲,收录在专辑同名《忘了所有》中,由波丽佳音唱片于1996年08月31日发行。', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '忘了所有', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 4}, {'item': '》', 'offset': 5, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 6, 'wordtag_label': '肯定词', 'length': 1}, {'item': '一首', 'offset': 7, 'wordtag_label': '数量词_单位数量词', 'length': 2}, {'item': '由', 'offset': 9, 'wordtag_label': '介词', 'length': 1}, {'item': '王杰', 'offset': 10, 'wordtag_label': '人物类_实体', 'length': 2}, {'item': '作词', 'offset': 12, 'wordtag_label': '场景事件', 'length': 2}, {'item': '、', 'offset': 14, 'wordtag_label': 'w', 'length': 1}, {'item': '作曲', 'offset': 15, 'wordtag_label': '场景事件', 'length': 2}, {'item': '并', 'offset': 17, 'wordtag_label': '连词', 'length': 1}, {'item': '演唱', 'offset': 18, 'wordtag_label': '场景事件', 'length': 2}, {'item': '的', 'offset': 20, 'wordtag_label': '助词', 'length': 1}, {'item': '歌曲', 'offset': 21, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': ',', 'offset': 23, 'wordtag_label': 'w', 'length': 1}, {'item': '收录', 'offset': 24, 'wordtag_label': '场景事件', 'length': 2}, {'item': '在', 'offset': 26, 'wordtag_label': '介词', 'length': 1}, {'item': '专辑', 'offset': 27, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '同名', 'offset': 29, 'wordtag_label': '场景事件', 'length': 2}, {'item': '《', 'offset': 31, 'wordtag_label': 'w', 'length': 1}, {'item': '忘了所有', 'offset': 32, 'wordtag_label': '作品类_实体', 'length': 4}, {'item': '》', 'offset': 36, 'wordtag_label': 'w', 'length': 1}, {'item': '中', 'offset': 37, 'wordtag_label': '词汇用语', 'length': 1}, {'item': ',', 'offset': 38, 'wordtag_label': 'w', 'length': 1}, {'item': '由', 'offset': 39, 'wordtag_label': '介词', 'length': 1}, {'item': '波丽佳音', 'offset': 40, 'wordtag_label': '人物类_实体', 'length': 4}, {'item': '唱片', 'offset': 44, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '于', 'offset': 46, 'wordtag_label': '介词', 'length': 1}, {'item': '1996年08月31日', 'offset': 47, 'wordtag_label': '时间类_具体时间', 'length': 11}, {'item': '发行', 'offset': 58, 'wordtag_label': '场景事件', 'length': 2}, {'item': '。', 'offset': 60, 'wordtag_label': 'w', 'length': 1}]}], [[{'HEAD_ROLE': {'item': '王杰', 'offset': 10, 'type': '人物类_实体'}, 'TAIL_ROLE': [{'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}], 'GROUP': '创作', 'TRIG': [{'item': '作词', 'offset': 12}, {'item': '作曲', 'offset': 15}, {'item': '演唱', 'offset': 18}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '王杰', 'offset': 10, 'type': '人物类_实体'}], 'GROUP': '创作者', 'SRC': 'HTG', 'TRIG': [{'item': '作词', 'offset': 12}, {'item': '作曲', 'offset': 15}, {'item': '演唱', 'offset': 18}]}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '歌曲', 'offset': 21, 'type': '作品类_概念'}], 'GROUP': '类型', 'SRC': 'TAIL'}, {'HEAD_ROLE': {'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}, 'TAIL_ROLE': [{'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}], 'GROUP': '收录', 'TRIG': [{'item': '收录', 'offset': 24}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}], 'GROUP': '收录于', 'SRC': 'HGT', 'TRIG': [{'item': '收录', 'offset': 24}]}, {'HEAD_ROLE': {'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}, 'TAIL_ROLE': [{'item': '王杰', 'type': '人物类_实体', 'offset': 10}], 'GROUP': '创作者', 'TRIG': [{'item': '专辑', 'offset': 27}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '王杰', 'type': '人物类_实体', 'offset': 10}, 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}], 'GROUP': '创作', 'SRC': 'HGT', 'TRIG': [{'item': '专辑', 'offset': 27}]}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 32}, 'TAIL_ROLE': [{'item': '唱片', 'offset': 44, 'type': '作品类_概念'}], 'GROUP': '类型', 'SRC': 'TAIL'}]]] +``` +同时可以通过 `schema` 来配置相关关系类型, 抽取自定义的关系组合 + +``` python +>>> from pprint import pprint +>>> schema = [ + { + "head_role": "作品类_实体", #头实体词类 + "group": "创作者", #关系名 + "tail_role": [ + { + "main": [ + "人物类_实体" #尾实体词类 + ], + "support": [] #相关词类,可作为该关系的补充,不可作为尾实体独立存在 + } + ], + "trig_word": [ + "作词", #触发词,对于没有触发词,而是由头尾实体直接触发的关系,可为null + ], + "trig_type": "trigger", #trigger表明由触发词触发,tail表明为尾实体触发 + "reverse": False, #是否为反向配置,即尾实体实际是头,头实体实际是尾 + "trig_direction": "B", #触发P的方向,表示在自然表达中,尾实体在触发词的哪一边,L为左,R为右,B为双向都有可能,默认为B + "rel_group": "创作" #对应的反关系,即头尾实体对调后,对应的关系,用于逻辑推断 + }] +>>> wordtag_ie.set_schema(schema) +>>> pprint(wordtag_ie('《忘了所有》是一首由王杰作词、作曲并演唱的歌曲,收录在专辑同名《忘了所有》中,由波丽佳音唱片于1996年08月31日发行。')[1]) +[[{'GROUP': '创作', + 'HEAD_ROLE': {'item': '王杰', 'offset': 10, 'type': '人物类_实体'}, + 'SRC': 'REVERSE', + 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 1, 'type': '作品类_实体'}], + 'TRIG': [{'item': '作词', 'offset': 12}]}, + {'GROUP': '创作者', + 'HEAD_ROLE': {'item': '忘了所有', 'offset': 1, 'type': '作品类_实体'}, + 'SRC': 'HTG', + 'TAIL_ROLE': [{'item': '王杰', 'offset': 10, 'type': '人物类_实体'}], + 'TRIG': [{'item': '作词', 'offset': 12}]}]] +``` + +## 配置示例 + +我们提供了配置示例文件[demo_config.json](./demo_config.json),用户可以直接基于这个文件实现自己想要的配置。 + +我们以“出版方”这个关系为例: + +```json +{ + "head_role": "作品类_实体", //头实体词类 + "group": "出版方", //关系名 + "tail_role": [ + { + "main": [ + "组织机构类" + ], //尾实体词类 + "support": [ + "时间类_具体时间" + ] //相关词类,可作为该关系的补充,不可作为尾实体独立存在 + } + ], //尾实体配置 + "trig_word": [ + "出版" + ], //触发词,对于没有触发词,而是由头尾实体直接触发的关系,可为null + "trig_direction": "L", //触发P的方向,表示在自然表达中,尾实体在触发词的哪一边,L为左,R为右,B为双向都有可能,默认为B + "trig_type": "trigger", //trigger表明由触发词触发,tail表明为尾实体触发 + "reverse": false, //是否为反向配置,即尾实体实际是头,头实体实际是尾 + "rel_group": "出版" //对应的反关系,即头尾实体对调后,对应的关系,用于逻辑推断 +} +``` + +### 配置原则 + +1. 文本中的头实体(head_role)一定在尾实体(tail_role)的前面(即左边),可以通过配置反向标记(reverse)和反向关系名(rel_group)生成反关系 +2. 两种触发模式:触发词触发(trig_type为trigger)和尾实体触发(trig_type为tail),两者的触发方向(trig_direction)配置不同 + + - 触发词的触发方向约束的是文本中尾实体在触发词的左边还是右边,默认是双向触发(B),可以配置向左触发(L)或向右(R)触发,以提升挖掘精度 + + - 尾实体触发不用配置方向,因为必然在头实体之后 +## 实现方法 + +使用WordTag的标注结果,相当于已实现将**无限的词收拢到了有限的词类体系中**,而待抽取的关系,则变成了仅发生在词类与词类之间,便可以枚举出来。例如,`人物类_实体`与`作品类_实体`之间的关系可以是“创作”,而“创作”的触发词(如作词、作曲、演唱、执导、出演等)或触发pattern,则可以通过知识库枚举得到,如此,则实现了灵活配置。 + +那么,接下来一个问题则是,我们如何从现在的序列解析结果中,得到关系三元组数据呢? + +要解决这个问题,我们依旧要从中文语言学的成果中寻找答案:中文更偏孤立语,注重**意合**,依靠词序和词之间的意义联系成句,词性、句法特征弱。也就是说,我们在解析的时候,可以尝试摒弃所谓句法特征,只是从次序上下手。于是,我们发现,只需要覆盖好 SPO 的几种常用表达顺序,单向搜索,即可覆盖大部分简单句。 + +例如,对于`<张艺谋,创作,十面埋伏>`这一 SPO 三元组,常用表达顺序有如下几种: + +- S-P-O:张艺谋执导了《十面埋伏》。 +- S-O-P:张艺谋是《十面埋伏》的导演。 +- O-S-P:《十面埋伏》是张艺谋执导的电影。 +- O-P-S:《十面埋伏》的导演是张艺谋。 + +然而,这种模式仍然过于复杂,如遇到多组 SPO 关系并存的文本,如果要完全照顾到这四种表达顺序,则很容易发生混乱,难以得到严格对应的三元组。所以,我们设计了**互反关系**的概念,即头实体和尾实体对调后,对应的反向关系。例如三元组`<张艺谋,创作,十面埋伏>`,则存在一个反向三元组`<十面埋伏,创作者,三元组>`。那么,当我们找到一个头实体之后,只需要考虑它之后的部分(即 `S-P-O` 和 `S-O-P` 两种表达顺序)就行了。 + +另外,我们认为,规范表达中,关系触发和尾实体一定实在同一个短语中出现,所以,触发关系之后,寻找尾实体的过程中,我们仅搜索与触发在同一个短语中的实体及相关元素。 + +## 后续计划 + +- 实现基于语义结构的抽取,覆盖复杂句式 + +## 在论文中引用WordTag-IE + +如果您的工作成果中使用了WordTag-IE,请增加下述引用。我们非常乐于看到WordTag-IE对您的工作带来帮助。 + +``` +@article{qin2022WordTag-IE, + title={WordTag-IE: a Rule-based Tool for Chinese Information Extraction}, + author={Qin, Huapeng and Zhao, Min and Tang, Wei}, + technical report={Baidu, Inc. TR:2022-KG-WordTag-IE}, + year={2022} +} +``` + +## 问题与反馈 + +WordTag-IE在持续优化中,如果您有任何建议或问题,欢迎提交issue到Github。 diff --git a/examples/text_to_knowledge/wordtag-ie/demo_config.json b/examples/text_to_knowledge/wordtag-ie/demo_config.json new file mode 100644 index 0000000000000000000000000000000000000000..fa39a2ef3e4adc19f87607cf02358436013956af --- /dev/null +++ b/examples/text_to_knowledge/wordtag-ie/demo_config.json @@ -0,0 +1,955 @@ +[ + { + "head_role": "人物类_实体", + "group": "名字", + "tail_role": [ + { + "main": [ + "文化类_姓氏与人名", + "其他角色类", + "人物类_实体" + ], + "support": [] + } + ], + "trig_word": [ + "原名" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "人物类_实体", + "group": "性别", + "tail_role": [ + { + "main": [ + "信息资料" + ], + "support": [] + } + ], + "trig_word": [ + "男", + "女" + ], + "trig_type": "role", + "reverse": false, + "trig_direction": null + }, + { + "head_role": "人物类_实体", + "group": "出生于", + "tail_role": [ + { + "main": [ + "时间类_具体时间" + ], + "support": [ + "世界地区类" + ] + }, + { + "main": [ + "世界地区类" + ], + "support": [ + "时间类_具体时间" + ] + } + ], + "trig_word": [ + "出生", + "出生于", + "生于" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B" + }, + { + "head_role": "时间类_具体时间", + "group": "出生于", + "tail_role": [ + { + "main": [ + "人物类_实体" + ], + "support": [ + "世界地区类" + ] + } + ], + "trig_word": [ + "出生", + "出生于", + "生于" + ], + "trig_type": "trigger", + "reverse": true, + "trig_direction": "B" + }, + { + "head_role": "人物类_实体", + "group": "参加工作时间", + "tail_role": [ + { + "main": [ + "时间类_具体时间" + ], + "support": [] + } + ], + "trig_word": [ + "参加工作" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "L" + }, + { + "head_role": "人物类_实体", + "group": "入党时间", + "tail_role": [ + { + "main": [ + "时间类_具体时间" + ], + "support": [] + } + ], + "trig_word": [ + "入党" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "L" + }, + { + "head_role": "人物类_实体", + "group": "加入组织", + "tail_role": [ + { + "main": [ + "组织机构类", + "组织机构类_概念" + ], + "support": [ + "时间类_具体时间" + ] + } + ], + "trig_word": [ + "加入", + "参加" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "时间类_具体时间", + "group": "加入组织", + "tail_role": [ + { + "main": [ + "人物类_实体" + ], + "support": [ + "组织机构类", + "组织机构类_概念" + ] + } + ], + "trig_word": [ + "加入", + "参加" + ], + "trig_type": "trigger", + "reverse": true, + "trig_direction": "R" + }, + { + "head_role": "人物类_实体", + "group": "享年", + "tail_role": [ + { + "main": [ + "数量词" + ], + "support": [] + } + ], + "trig_word": [ + "年仅" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "人物类_实体", + "group": "创作", + "tail_role": [ + { + "main": [ + "作品类_实体" + ], + "support": [] + } + ], + "trig_word": [ + "创作", + "监制", + "监制人", + "作词", + "作词人", + "作曲", + "作曲人", + "编曲", + "演唱", + "演唱者", + "制作人", + "制作", + "制片人", + "制片", + "主持人", + "主持", + "导演", + "执导", + "编剧", + "作者", + "所著", + "主编", + "撰写", + "编著", + "编撰" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B", + "rel_group": "创作者" + }, + { + "head_role": "人物类_实体", + "group": "出演", + "tail_role": [ + { + "main": [ + "作品类_实体" + ], + "support": [] + } + ], + "trig_word": [ + "配音" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B", + "rel_group": "演员" + }, + { + "head_role": "人物类_实体", + "group": "饰演", + "tail_role": [ + { + "main": [ + "其他角色类", + "人物类_实体" + ], + "support": [ + "作品类_实体" + ] + } + ], + "trig_word": [ + "扮演", + "饰演", + "饰" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B" + }, + { + "head_role": "人物类_实体", + "group": "代言", + "tail_role": [ + { + "main": [ + "品牌名" + ], + "support": [] + } + ], + "trig_word": [ + "代言" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B", + "rel_group": "代言人" + }, + { + "head_role": "人物类_实体", + "group": "创建", + "tail_role": [ + { + "main": [ + "组织机构类" + ], + "support": [] + } + ], + "trig_word": [ + "创办", + "创建" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R", + "rel_group": "创建人" + }, + { + "head_role": "人物类_实体", + "group": "获奖", + "tail_role": [ + { + "main": [ + "文化类_奖项赛事活动" + ], + "support": [ + "作品类_实体", + "数量词_序数词" + ] + } + ], + "trig_word": [ + "获", + "获得", + "荣获", + "获颁" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "作品类_实体", + "group": "类型", + "tail_role": [ + { + "main": [ + "作品类_概念" + ], + "support": [] + } + ], + "trig_word": [ + "是" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "作品类_实体", + "group": "出品方", + "tail_role": [ + { + "main": [ + "组织机构类" + ], + "support": [ + "时间类_具体时间" + ] + } + ], + "trig_word": [ + "出品" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "L" + }, + { + "head_role": "作品类_实体", + "group": "出版方", + "tail_role": [ + { + "main": [ + "组织机构类" + ], + "support": [ + "时间类_具体时间" + ] + } + ], + "trig_word": [ + "出版" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "L" + }, + { + "head_role": "作品类_实体", + "group": "发表于", + "tail_role": [ + { + "main": [ + "场所类_网上场所" + ], + "support": [] + } + ], + "trig_word": [ + "发表", + "连载", + "发表于", + "连载于" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B" + }, + { + "head_role": "作品类_实体", + "group": "创作者", + "tail_role": [ + { + "main": [ + "人物类_实体" + ], + "support": [] + } + ], + "trig_word": [ + "创作", + "监制", + "监制人", + "作词", + "作词人", + "作曲", + "作曲人", + "编曲", + "演唱", + "演唱者", + "制作人", + "制作", + "制片人", + "制片", + "主持人", + "主持", + "导演", + "执导", + "编剧", + "作者", + "所著", + "主编", + "撰写", + "编著", + "编撰" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B", + "rel_group": "创作" + }, + { + "head_role": "作品类_实体", + "group": "演员", + "tail_role": [ + { + "main": [ + "人物类_实体" + ], + "support": [] + } + ], + "trig_word": [ + "配音" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B", + "rel_group": "出演" + }, + { + "head_role": "作品类_实体", + "group": "收录于", + "tail_role": [ + { + "main": [ + "作品类_实体" + ], + "support": [] + } + ], + "trig_word": [ + "收录于" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B" + }, + { + "head_role": "作品类_实体", + "group": "改编自", + "tail_role": [ + { + "main": [ + "作品类_实体", + "作品类_概念" + ], + "support": [] + } + ], + "trig_word": [ + "改编", + "改编自" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B" + }, + { + "head_role": "作品类_实体", + "group": "获奖", + "tail_role": [ + { + "main": [ + "文化类_奖项赛事活动" + ], + "support": [] + } + ], + "trig_word": [ + "获", + "获得", + "荣获", + "获颁" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "作品类_实体", + "group": "上市于", + "tail_role": [ + { + "main": [ + "时间类_具体时间" + ], + "support": [] + } + ], + "trig_word": [ + "上市" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "L" + }, + { + "head_role": "组织机构类", + "group": "创建于", + "tail_role": [ + { + "main": [ + "时间类_具体时间" + ], + "support": [ + "世界地区类", + "组织机构类_国家机关" + ] + }, + { + "main": [ + "世界地区类", + "组织机构类_国家机关" + ], + "support": [ + "时间类_具体时间" + ] + } + ], + "trig_word": [ + "成立", + "创办", + "创建", + "建立", + "登记成立", + "成立登记" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B" + }, + { + "head_role": "组织机构类", + "group": "创建人", + "tail_role": [ + { + "main": [ + "人物类_实体" + ], + "support": [] + } + ], + "trig_word": [ + "创办", + "创建" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "L", + "rel_group": "创建" + }, + { + "head_role": "组织机构类", + "group": "上市于", + "tail_role": [ + { + "main": [ + "时间类_具体时间" + ], + "support": [ + "{[education:场外交易市场]}" + ] + }, + { + "main": [ + "{[education:场外交易市场]}" + ], + "support": [ + "时间类_具体时间" + ] + } + ], + "trig_word": [ + "上市" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "L" + }, + { + "head_role": "组织机构类", + "group": "成立于", + "tail_role": [ + { + "main": [ + "时间类_具体时间" + ], + "support": [ + "世界地区类" + ] + }, + { + "main": [ + "世界地区类" + ], + "support": [ + "时间类_具体时间" + ] + } + ], + "trig_word": [ + "成立于", + "成立" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B" + }, + { + "head_role": "时间类_具体时间", + "group": "成立于", + "tail_role": [ + { + "main": [ + "组织机构类" + ], + "support": [ + "世界地区类" + ] + } + ], + "trig_word": [ + "成立于", + "成立" + ], + "trig_type": "trigger", + "reverse": true, + "trig_direction": "L" + }, + { + "head_role": "组织机构类", + "group": "所属组织", + "tail_role": [ + { + "main": [ + "组织机构类" + ], + "support": [] + } + ], + "trig_word": [ + "隶属", + "隶属于" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "世界地区类", + "group": "所属地区", + "tail_role": [ + { + "main": [ + "世界地区类" + ], + "support": [] + } + ], + "trig_word": [ + "首都", + "省会", + "首府" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "L" + }, + { + "head_role": "世界地区类", + "group": "所属地区", + "tail_role": [ + { + "main": [ + "世界地区类" + ], + "support": [] + } + ], + "trig_word": [ + "隶属", + "隶属于" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "世界地区类", + "group": "所属地区", + "tail_role": [ + { + "main": [ + "世界地区类" + ], + "support": [] + } + ], + "trig_word": [ + "下辖" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "L" + }, + { + "head_role": "世界地区类", + "group": "官方语言", + "tail_role": [ + { + "main": [ + "文化类_语言文字" + ], + "support": [] + } + ], + "trig_word": [ + "官方语言" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B" + }, + { + "head_role": "世界地区类", + "group": "海拔", + "tail_role": [ + { + "main": [ + "数量词" + ], + "support": [] + } + ], + "trig_word": [ + "海拔" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "世界地区类", + "group": "面积", + "tail_role": [ + { + "main": [ + "数量词" + ], + "support": [] + } + ], + "trig_word": [ + "面积", + "占地" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "场所类", + "group": "类型", + "tail_role": [ + { + "main": [ + "场所类_概念" + ], + "support": [] + } + ], + "trig_word": [ + "是" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "场所类", + "group": "面积", + "tail_role": [ + { + "main": [ + "数量词" + ], + "support": [] + } + ], + "trig_word": [ + "面积", + "占地" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "物体类", + "group": "类型", + "tail_role": [ + { + "main": [ + "物体类_概念" + ], + "support": [] + } + ], + "trig_word": [ + "是" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "物体类_兵器", + "group": "类型", + "tail_role": [ + { + "main": [ + "物体类_兵器" + ], + "support": [] + } + ], + "trig_word": [ + "是" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + }, + { + "head_role": "物体类", + "group": "上市于", + "tail_role": [ + { + "main": [ + "时间类_具体时间" + ], + "support": [] + } + ], + "trig_word": [ + "上市" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "B" + }, + { + "head_role": "物体类", + "group": "制造方", + "tail_role": [ + { + "main": [ + "组织机构类_企事业单位" + ], + "support": [ + "时间类_具体时间" + ] + } + ], + "trig_word": [ + "生产", + "制造", + "推出", + "发布" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "L" + }, + { + "head_role": "品牌名", + "group": "类型", + "tail_role": [ + { + "main": [ + "品牌名_品牌类型" + ], + "support": [ + "世界地区类_国家" + ] + } + ], + "trig_word": [ + "是" + ], + "trig_type": "trigger", + "reverse": false, + "trig_direction": "R" + } +] diff --git a/examples/text_to_knowledge/wordtag/README.md b/examples/text_to_knowledge/wordtag/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ce194b26e07b4c62bb952c75883ea6b1cb1bd2d5 --- /dev/null +++ b/examples/text_to_knowledge/wordtag/README.md @@ -0,0 +1,248 @@ +# 解语:WordTag(中文词类知识标注工具) + +WordTag(中文词类知识标注工具)是首个能够覆盖所有中文词汇的词类知识标注工具,旨在为中文文本解析提供全面、丰富的知识标注结果,可以应用于模板(挖掘模板、解析模板)生成与匹配、知识挖掘(新词发现、关系挖掘)等自然语言处理任务中,提升文本解析与挖掘精度;也可以作为中文文本特征生成器,为各类机器学习模型提供文本特征。 + +![wordtag示例](../doc/img/wordtag_example.png) + +## WordTag特点 + +- **覆盖所有中文词汇的词类体系,更丰富的知识标注结果** + - WordTag使用的词类体系为覆盖所有中文词汇的词类体系,包括各类实体词与非实体词(如概念、实体/专名、语法词等)。WordTag开源版对部分类目(如组织机构等),做了更细类目的划分识别(如,医疗卫生机构、体育组织机构),对仅使用文本信息难以细分的类目(如人物类、作品类、品牌名等),不做更细粒度的词类识别。用户需要细粒度的词类识别时,可利用百科知识树的类别体系自行定制。 +- **整合百科知识树链接结果,获得更丰富的标注知识** + - 如上图示例所示,各个切分标注结果中,除词类标注外,还整合了百科知识树的链接结果,用户可以结合百科知识树数据共同使用:如,利用百科知识树中的subtype获得更细的上位粒度,利用term的百科信息获得更加丰富的知识等。 +- **可定制的词类序列标注框架** + - WordTag开源版标注使用的词类体系是我们在实践中对**百科文本**解析应用较好的一个版本,不同类型文本(如,搜索query、新闻资讯)的词类分布不同,用户可以利用百科知识树定制自己的词类体系和训练样本,构建自己的WordTag应用版,以获得更好的适配效果。例如,可将自定义的词表按照百科知识树的字段定义好,挂接/整合到百科知识树上,即可使用自己的Term数据定制标注样本和标注任务。 + +## 模型结构 + +模型使用[ERNIE-CTM](../ernie-ctm)+CRF训练而成,预测时使用viterbi解码,模型结构如下: + +wordtag模型结构 + + +## Term-Linking实现 + +WordTag提供从文本到百科知识树的链接方法,即Term-Linking,只需将term词类体系与百科知识树数据加载到工具中,即可在解析结果中得到term-linking结果。 + +为了能够适配应用中的不同实体集(例如,不同的企业有不同的人物实体集合,不同的小说站有不同的小说实体集合),我们将term-linking拆分为两个步骤: + +- 第一步是基于词类的linking,主要解决“同名概念词/实体词”、“不同类的同名词”消歧问题,这一步只使用文本本身特征和词类特征,不使用图谱中的实体属性值(SPO)知识,从而支持切换不同应用知识图谱; +- 第二步是同类同名实体词的linking,主要解决同类下不同属性值的实体消歧问题,这一步需要使用实体词的SPO知识(一般用于实体特征表示计算,以及文本-实体相似度计算)。 + +“WordTag+百科知识树”的开源版提供了第一步的解决示例,第二步由于依赖于特定图谱的SPO知识,无法提供通用工具,未来可能提供通用解决方案。 + +WordTag模型对所有的词预测到上位词类之后,会直接根据预测到的词类,映射到term体系(映射表参见代码配置),查找相应的term,进行link。用户也可根据自己的数据分布,定制term-linking策略: + +- link到自己定制的term词表:只需将term词表按照TermTree挂接好之后更换数据即可; +- 调整WordTag预测词类与term词表的映射关系(如,增加自定义类别):在代码配置中直接调整映射表即可。 + +## WordTag类别标签集合 + +WordTag共包含91种词性及专名类别标签,标签集合如下表 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
WordTag标签集合
人物类_实体组织机构类_军事组织机构_概念文化类_制度政策协议位置方位术语类_医药学术语信息资料_性别否定词
人物类_概念组织机构类_医疗卫生机构文化类_姓氏与人名世界地区类术语类_生物体链接地址数量词
作品类_实体组织机构类_医疗卫生机构_概念生物类世界地区类_国家疾病损伤类个性特征数量词_序数词
作品类_概念组织机构类_教育组织机构生物类_植物世界地区类_区划概念疾病损伤类_植物病虫害感官特征数量词_单位数量词
组织机构类组织机构类_教育组织机构_概念生物类_动物世界地区类_地理概念宇宙类场景事件叹词
组织机构类_概念物体类品牌名饮食类事件类介词拟声词
组织机构类_企事业单位物体类_概念品牌名_品牌类型饮食类_菜品时间类介词_方位介词修饰词
组织机构类_企事业单位_概念物体类_兵器场所类饮食类_饮品时间类_特殊日助词修饰词_性质
组织机构类_国家机关物体类_化学物质场所类_概念药物类时间类_朝代代词修饰词_类型
组织机构类_国家机关_概念其他角色类场所类_交通场所药物类_中药时间类_具体时间连词修饰词_化
组织机构类_体育组织机构文化类场所类_交通场所_概念术语类时间类_时长副词外语单词
组织机构类_体育组织机构_概念文化类_语言文字场所类_网上场所术语类_术语类型词汇用语疑问词汉语拼音
组织机构类_军事组织机构文化类_奖项赛事活动场所类_网上场所_概念术语类_符号指标类信息资料肯定词w(标点)
+ + +## WordTag应用场景 + +参见"[解语的应用场景](../)" + + +## WordTag示例代码 +下面提供了WordTag模型进行文本到百科知识树链接的示例程序。 + +### Term-Linking示例程序 + +Term-Linking示例程序可以对无标签数据启动模型预测, 例如想对下面几段文本进行百科知识树的链接解析 +``` +"《孤女》是2010年九州出版社出版的小说,作者是余兼羽。", +"热梅茶是一道以梅子为主要原料制作的茶饮" +``` + +执行下面的脚本即可快速获取上面两段文本的百科知识树链接的结果 + +```python +from paddlenlp import Taskflow +wordtag = Taskflow("knowledge_mining", model="wordtag", linking=True) +wordtag(["热梅茶是一道以梅子为主要原料制作的茶饮", + "《孤女》是2010年九州出版社出版的小说,作者是余兼羽"]) +# Support the input text directly +wordtag("热梅茶是一道以梅子为主要原料制作的茶饮") + +``` +下面是运行WordTag工具后的知识链接的预测结果 + +```json +[{'text': '《孤女》是2010年九州出版社出版的小说,作者是余兼羽。', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '孤女', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 2, 'termid': '小说_eb_孤女'}, {'item': '》', 'offset': 3, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 4, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '2010年', 'offset': 5, 'wordtag_label': '时间类', 'length': 5, 'termid': '时间阶段_cb_2010年'}, {'item': '九州出版社', 'offset': 10, 'wordtag_label': '组织机构类', 'length': 5, 'termid': '组织机构_eb_九州出版社'}, {'item': '出版', 'offset': 15, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_出版'}, {'item': '的', 'offset': 17, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '小说', 'offset': 18, 'wordtag_label': '作品类_概念', 'length': 2, 'termid': '小说_cb_小说'}, {'item': ',', 'offset': 20, 'wordtag_label': 'w', 'length': 1}, {'item': '作者', 'offset': 21, 'wordtag_label': '人物类_概念', 'length': 2, 'termid': '人物_cb_作者'}, {'item': '是', 'offset': 23, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '余兼羽', 'offset': 24, 'wordtag_label': '人物类_实体', 'length': 3}, {'item': '。', 'offset': 27, 'wordtag_label': 'w', 'length': 1}]}, {'text': '热梅茶是一道以梅子为主要原料制作的茶饮', 'items': [{'item': '热梅茶', 'offset': 0, 'wordtag_label': '饮食类_饮品', 'length': 3}, {'item': '是', 'offset': 3, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '一道', 'offset': 4, 'wordtag_label': '数量词', 'length': 2}, {'item': '以', 'offset': 6, 'wordtag_label': '介词', 'length': 1, 'termid': '介词_cb_以'}, {'item': '梅子', 'offset': 7, 'wordtag_label': '饮食类', 'length': 2, 'termid': '饮食_cb_梅'}, {'item': '为', 'offset': 9, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_为'}, {'item': '主要原料', 'offset': 10, 'wordtag_label': '物体类', 'length': 4, 'termid': '物品_cb_主要原料'}, {'item': '制作', 'offset': 14, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_制作'}, {'item': '的', 'offset': 16, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '茶饮', 'offset': 17, 'wordtag_label': '饮食类_饮品', 'length': 2, 'termid': '饮品_cb_茶饮'}]}] +{'text': '热梅茶是一道以梅子为主要原料制作的茶饮', 'items': [{'item': '热梅茶', 'offset': 0, 'wordtag_label': '饮食类_饮品', 'length': 3}, {'item': '是', 'offset': 3, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '一道', 'offset': 4, 'wordtag_label': '数量词', 'length': 2}, {'item': '以', 'offset': 6, 'wordtag_label': '介词', 'length': 1, 'termid': '介词_cb_以'}, {'item': '梅子', 'offset': 7, 'wordtag_label': '饮食类', 'length': 2, 'termid': '饮食_cb_梅'}, {'item': '为', 'offset': 9, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_为'}, {'item': '主要原料', 'offset': 10, 'wordtag_label': '物体类', 'length': 4, 'termid': '物品_cb_主要原料'}, {'item': '制作', 'offset': 14, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_制作'}, {'item': '的', 'offset': 16, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '茶饮', 'offset': 17, 'wordtag_label': '饮食类_饮品', 'length': 2, 'termid': '饮品_cb_茶饮'}]} +``` + +同时我们也提供了基于上述taskflow的python执行脚本,具体的执行方式如下: +```shell +python predict.py --max_seq_len 128 --batch_size 2 +``` +其中参数释义如下: +- `max_seq_len` 表示最大句子长度,超过该长度将被截断。 +- `batch_size` 表示每个预测批次的样本数目。 + +## WordTag进阶使用 + +### 自定义模型一键预测 + +用户可以使用自有数据对WordTag模型进行增量训练,然后使用Taskflow进行一键预测,参见[WordTag增量训练示例](../ernie-ctm)。 + +### 自定义Term-Linking + +Taskflow默认使用TermTreeV1.0实现Term-Linking, 用户也可以基于自己的TermTree实现Term-Linking,参见[自定义TermTree](../termtree)。 + +## Release Note + +- 2022.06:新增25个细化词类,用于下游挖掘任务 + +## WordTag后续计划 + +1. 持续优化知识标注模型,获得更加精准的标注结果; +2. 发布多粒度、多种参数规模的知识标注模型; +3. 提供细粒度term及subterm消歧的解决方案。 + + +## 在论文中引用WordTag + +如果您的工作成果中使用了WordTag,请增加下述引用。我们非常乐于看到WordTag对您的工作带来帮助。 +``` +@article{zhao2020TermTree, + title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding}, + author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong}, + technical report={Baidu, Inc. TR:2020-KG-TermTree}, + year={2020} +} +``` + + + +## 问题与反馈 + +WordTag在持续优化中,如果您有任何建议或问题,欢迎提交issue到Github。 diff --git a/examples/text_to_knowledge/wordtag/predict.py b/examples/text_to_knowledge/wordtag/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..5f4fe353782dfaab4206a0f79412a270c570e364 --- /dev/null +++ b/examples/text_to_knowledge/wordtag/predict.py @@ -0,0 +1,56 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle + +from paddlenlp import Taskflow + + +def parse_args(): + parser = argparse.ArgumentParser() + + # fmt: off + parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", ) + parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.", ) + parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.") + # fmt: on + + args = parser.parse_args() + return args + + +def do_predict(args): + paddle.set_device(args.device) + wordtag = Taskflow( + "knowledge_mining", model="wordtag", batch_size=args.batch_size, max_seq_length=args.max_seq_len, linking=True + ) + txts = ["《孤女》是2010年九州出版社出版的小说,作者是余兼羽。", "热梅茶是一道以梅子为主要原料制作的茶饮"] + res = wordtag(txts) + print(res) + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_predict(args) diff --git a/examples/text_to_sql/IGSQL/README.md b/examples/text_to_sql/IGSQL/README.md new file mode 100644 index 0000000000000000000000000000000000000000..7ae994dbaba1ef4bb9a0cc30aec578736a6658a7 --- /dev/null +++ b/examples/text_to_sql/IGSQL/README.md @@ -0,0 +1,126 @@ +# IGSQL: Database Schema Interaction Graph Based Neural Model for Context-Dependent Text-to-SQL Generation + +## 上下文相关的 Text2SQL 任务 + +语义解析是一种交互式分析技术,其将用户输入的自然语言表述转成可操作执行的语义表示形式,如逻辑表达式(如一阶逻辑表示,lambda表示等)、编程语言(如SQL、python等)、数学公式等。 + +Text2SQL 是语义解析技术中的一类任务,让机器自动将用户输入的自然语言问题转成可与数据库交互的 SQL 查询语言,实现基于数据库的自动问答能力。 + +上下文相关的 Text2SQL 则指在多轮问答、对话等场景中,对问题的解析除了依赖当前轮次的输入语句,往往同时依赖于上文中的用户语句和系统答复等,即要求模型具备上下文的感知(建模)能力,才可以更好地完成 SQL 生成的任务。这种多轮交互的方式更符合人类的行为习惯,所以上下文相关的 Text2SQL 解析技术也日益受到重视,成为学术界、工业界的研究重点和应用方向。 + +## 数据集 + +当前学术界主流的上下文相关的 Text2SQL 数据集包括[SParC](https://yale-lily.github.io/sparc)、[CoSQL](https://yale-lily.github.io/cosql) 等,详细说明可参见上述链接页面及相应的论文。 + +## 基线系统 +本系统基于 PaddlePaddle 动态图复现了 [IGSQL](https://github.com/headacheboy/IGSQL) 模型,其核心是基于预训练模型(ERNIE、BERT等)和LSTM的基础 Encoder,以及针对多轮场景的交互 Schema Encoder 和上下文句子 Encoder,而解码端则是在 EditSQL 基础上扩展而来的、基于门控机制和拷贝机制的 SQL 序列生成 Decoder。 + +# 环境准备 +代码运行需要 Linux 主机,Python 3.7 和 PaddlePaddle 2.1 以上版本。 + +## 推荐的环境 + +* 操作系统 CentOS 7.5 +* Python 3.7.9 +* PaddlePaddle develop + +除此之外,强烈建议使用支持 GPU 的硬件环境。 + +## PaddlePaddle + +可根据机器情况和个人需求在 PaddlePaddle 和 PaddlePaddle-GPU 中二选一安装。 +如果机器支持GPU,则建议安装GPU版本。 + +关于 PaddlePaddle 的安装教程、使用方法等请参考[官方文档](https://www.paddlepaddle.org.cn/#quick-start). + +## 第三方 Python 库 +除 PaddlePaddle 及其依赖之外,还依赖其它第三方 Python 库,位于代码根目录的 requirements.txt 文件中。 + +可使用 pip 一键安装 + +```pip install -r requirements.txt``` + +# 数据准备 + +```bash +# 下载模型训练、测试数据 +# 得到的sparc,cosql 两个数据集 +wget https://bj.bcebos.com/paddlenlp/paddlenlp/resource/igsql_data.tar.gz +tar xzvf igsql_data.tar.gz +# 下载glove词向量 +wget http://nlp.stanford.edu/data/glove.840B.300d.zip +unzip glove.840B.300d.zip +``` + +# 数据预处理 + +对原始数据进行数据预处理,以适配模型的输入,以sparc为例: + +```bash +python preprocess.py --dataset=sparc --remove_from +``` + +## 训练 + +以训练sparc模型为例: + +```bash +python run.py --raw_train_filename="data/sparc_data_removefrom/train.pkl" \ + --raw_validation_filename="data/sparc_data_removefrom/dev.pkl" \ + --database_schema_filename="data/sparc_data_removefrom/tables.json" \ + --embedding_filename="glove.840B.300d.txt" \ + --data_directory="processed_data_sparc_removefrom" \ + --logdir="logs_sparc" \ + --train=True \ + --evaluate=True +``` + +参数说明: +* raw_train_filename, raw_validation_filename, database_schema_filename: 数据集文件路径。 +* embedding_filename: GLOVE 词向量文件路径。 +* data_directory: 预处理得到的文件夹路径。 +* logdir: 输出日志文件夹路径。 +* train,evaluate: 是否执行train,evaluate。 + + +### 训练阶段的输出日志 +训练过程会输出loss、acc相关日志,内容类似: +``` +total_gold_tokens:13, step:5981================================= ] 99% ETA: 0:00:03 +LOSS:0.4242228865623474 +train [==================================] 100% Time: 1:20:22 +Predicting with file logs_sparc/train-eval_predictions.json +logs_sparc/train-eval[==================================] 100% Time: 0:01:30 +Predicting with file logs_sparc/valid-eval_predictions.json +logs_sparc/valid-eval[==================================]100% Time: 0:04:53 +``` + +## 预测 + +以预测sparc数据集为例: + +```bash +python run.py --raw_train_filename="data/sparc_data_removefrom/train.pkl" \ + --raw_validation_filename="data/sparc_data_removefrom/dev.pkl" \ + --database_schema_filename="data/sparc_data_removefrom/tables.json" \ + --embedding_filename="glove.840B.300d.txt" \ + --data_directory="processed_data_sparc_removefrom" \ + --logdir="logs_sparc_eval" \ + --evaluate=True \ + --save_file="logs_sparc/best_model" +``` + +参数说明: +* save_file: 加载的模型路径,请修改为真实的模型加载路径。 + +执行完上述命令后,预测结果保存在 "logs_sparc_eval/valid_use_gold_queries_predictions.json"。 + +# 评估 + +执行以下命令获得评估结果: + +```bash +python postprocess_eval.py --dataset=sparc --split=dev --pred_file logs_sparc_eval/valid_use_gold_queries_predictions.json --remove_from +``` + +其中的 --pred_file 参数请修改为真实的模型预测输出路径,评估结果保存在 "logs_sparc_eval/valid_use_gold_queries_predictions.json.eval"。 diff --git a/examples/text_to_sql/IGSQL/data_util/anonymization.py b/examples/text_to_sql/IGSQL/data_util/anonymization.py new file mode 100644 index 0000000000000000000000000000000000000000..61686f65d69360fcbccfabf41582a568766703ff --- /dev/null +++ b/examples/text_to_sql/IGSQL/data_util/anonymization.py @@ -0,0 +1,274 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Code for identifying and anonymizing entities in NL and SQL.""" + +import copy +import json + +from . import util + +ENTITY_NAME = "ENTITY" +CONSTANT_NAME = "CONSTANT" +TIME_NAME = "TIME" +SEPARATOR = "#" + + +def timeval(string): + """Returns the numeric version of a time. + + Args: + string (`str`): String representing a time. + + Returns: + `str`: String representing the absolute time. + """ + if string.endswith("am") or string.endswith("pm") and string[:-2].isdigit(): + numval = int(string[:-2]) + if len(string) == 3 or len(string) == 4: + numval *= 100 + if string.endswith("pm"): + numval += 1200 + return str(numval) + return "" + + +def is_time(string): + """Returns whether a string represents a time. + + Args: + string (str): String to check. + + Returns: + `bool`: Whether the string represents a time. + """ + if string.endswith("am") or string.endswith("pm"): + if string[:-2].isdigit(): + return True + + return False + + +def deanonymize(sequence, ent_dict, key): + """Deanonymizes a sequence. + + Args: + sequence (`list`): List of tokens to deanonymize. + ent_dict (`dict`): Maps from tokens to the entity dictionary. + key (`str`): The key to use, in this case either natural language or SQL. + + Returns: + `list`: Deanonymized sequence of tokens. + """ + new_sequence = [] + for token in sequence: + if token in ent_dict: + new_sequence.extend(ent_dict[token][key]) + else: + new_sequence.append(token) + + return new_sequence + + +class Anonymizer: + """Anonymization class for keeping track of entities in this domain and + scripts for anonymizing/deanonymizing. + + Attributes: + anonymization_map (`list`): Containing entities from + the anonymization file. + entity_types (`list`): All entities in the anonymization file. + keys (`set`): Possible keys (types of text handled); in this case it should be + one for natural language and another for SQL. + entity_set (`set`): entity_types as a set. + """ + + def __init__(self, filename): + self.anonymization_map = [] + self.entity_types = [] + self.keys = set() + + pairs = [json.loads(line) for line in open(filename).readlines()] + for pair in pairs: + for key in pair: + if key != "type": + self.keys.add(key) + self.anonymization_map.append(pair) + if pair["type"] not in self.entity_types: + self.entity_types.append(pair["type"]) + + self.entity_types.append(ENTITY_NAME) + self.entity_types.append(CONSTANT_NAME) + self.entity_types.append(TIME_NAME) + + self.entity_set = set(self.entity_types) + + def get_entity_type_from_token(self, token): + """Gets the type of an entity given an anonymized token. + + Args: + token (`str`): The entity token. + + Returns: + `str`: representing the type of the entity. + """ + # these are in the pattern NAME:#, so just strip the thing after the + # colon + colon_loc = token.index(SEPARATOR) + entity_type = token[:colon_loc] + assert entity_type in self.entity_set + + return entity_type + + def is_anon_tok(self, token): + """Returns whether a token is an anonymized token or not. + + Args: + token (`str`): The token to check. + + Returns: + `bool`: whether the token is an anonymized token. + """ + return token.split(SEPARATOR)[0] in self.entity_set + + def get_anon_id(self, token): + """Gets the entity index (unique ID) for a token. + + Args: + token (`str`): The token to get the index from. + + Returns: + `int`: the token ID if it is an anonymized token; otherwise -1. + """ + if self.is_anon_tok(token): + return self.entity_types.index(token.split(SEPARATOR)[0]) + else: + return -1 + + def anonymize(self, sequence, tok_to_entity_dict, key, add_new_anon_toks=False): + """Anonymizes a sequence. + + Args: + sequence (`list`): Sequence to anonymize. + tok_to_entity_dict (`dict`): Existing dictionary mapping from anonymized + tokens to entities. + key (`str`): Which kind of text this is (natural language or SQL) + add_new_anon_toks (`bool`): Whether to add new entities to tok_to_entity_dict. + + Returns: + `list`: The anonymized sequence. + """ + # Sort the token-tok-entity dict by the length of the modality. + sorted_dict = sorted(tok_to_entity_dict.items(), key=lambda k: len(k[1][key]))[::-1] + + anonymized_sequence = copy.deepcopy(sequence) + + if add_new_anon_toks: + type_counts = {} + for entity_type in self.entity_types: + type_counts[entity_type] = 0 + for token in tok_to_entity_dict: + entity_type = self.get_entity_type_from_token(token) + type_counts[entity_type] += 1 + + # First find occurrences of things in the anonymization dictionary. + for token, modalities in sorted_dict: + our_modality = modalities[key] + + # Check if this key's version of the anonymized thing is in our + # sequence. + while util.subsequence(our_modality, anonymized_sequence): + found = False + for startidx in range(len(anonymized_sequence) - len(our_modality) + 1): + if anonymized_sequence[startidx : startidx + len(our_modality)] == our_modality: + anonymized_sequence = ( + anonymized_sequence[:startidx] + + [token] + + anonymized_sequence[startidx + len(our_modality) :] + ) + found = True + break + assert found, ( + "Thought " + str(our_modality) + " was in [" + str(anonymized_sequence) + "] but could not find it" + ) + + # Now add new keys if they are present. + if add_new_anon_toks: + + # For every span in the sequence, check whether it is in the anon map + # for this modality + sorted_anon_map = sorted(self.anonymization_map, key=lambda k: len(k[key]))[::-1] + + for pair in sorted_anon_map: + our_modality = pair[key] + + token_type = pair["type"] + new_token = token_type + SEPARATOR + str(type_counts[token_type]) + + while util.subsequence(our_modality, anonymized_sequence): + found = False + for startidx in range(len(anonymized_sequence) - len(our_modality) + 1): + if anonymized_sequence[startidx : startidx + len(our_modality)] == our_modality: + if new_token not in tok_to_entity_dict: + type_counts[token_type] += 1 + tok_to_entity_dict[new_token] = pair + + anonymized_sequence = ( + anonymized_sequence[:startidx] + + [new_token] + + anonymized_sequence[startidx + len(our_modality) :] + ) + found = True + break + assert found, ( + "Thought " + + str(our_modality) + + " was in [" + + str(anonymized_sequence) + + "] but could not find it" + ) + + # Also replace integers with constants + for index, token in enumerate(anonymized_sequence): + if token.isdigit() or is_time(token): + if token.isdigit(): + entity_type = CONSTANT_NAME + value = new_token + if is_time(token): + entity_type = TIME_NAME + value = timeval(token) + + # First try to find the constant in the entity dictionary already, + # and get the name if it's found. + new_token = "" + new_dict = {} + found = False + for entity, value in tok_to_entity_dict.items(): + if value[key][0] == token: + new_token = entity + new_dict = value + found = True + break + + if not found: + new_token = entity_type + SEPARATOR + str(type_counts[entity_type]) + new_dict = {} + for tempkey in self.keys: + new_dict[tempkey] = [token] + + tok_to_entity_dict[new_token] = new_dict + type_counts[entity_type] += 1 + + anonymized_sequence[index] = new_token + + return anonymized_sequence diff --git a/examples/text_to_sql/IGSQL/data_util/atis_batch.py b/examples/text_to_sql/IGSQL/data_util/atis_batch.py new file mode 100644 index 0000000000000000000000000000000000000000..f0eb9351ad3f1696353dbbed87e7629a06396bd7 --- /dev/null +++ b/examples/text_to_sql/IGSQL/data_util/atis_batch.py @@ -0,0 +1,334 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import copy + +from . import snippets as snip +from . import sql_util +from . import vocabulary as vocab + + +class UtteranceItem: + def __init__(self, interaction, index): + self.interaction = interaction + self.utterance_index = index + + def __str__(self): + return str(self.interaction.utterances[self.utterance_index]) + + def histories(self, maximum): + if maximum > 0: + history_seqs = [] + for utterance in self.interaction.utterances[: self.utterance_index]: + history_seqs.append(utterance.input_seq_to_use) + + if len(history_seqs) > maximum: + history_seqs = history_seqs[-maximum:] + + return history_seqs + return [] + + def input_sequence(self): + return self.interaction.utterances[self.utterance_index].input_seq_to_use + + def previous_query(self): + if self.utterance_index == 0: + return [] + return self.interaction.utterances[self.utterance_index - 1].anonymized_gold_query + + def anonymized_gold_query(self): + return self.interaction.utterances[self.utterance_index].anonymized_gold_query + + def snippets(self): + return self.interaction.utterances[self.utterance_index].available_snippets + + def original_gold_query(self): + return self.interaction.utterances[self.utterance_index].original_gold_query + + def contained_entities(self): + return self.interaction.utterances[self.utterance_index].contained_entities + + def original_gold_queries(self): + return [q[0] for q in self.interaction.utterances[self.utterance_index].all_gold_queries] + + def gold_tables(self): + return [q[1] for q in self.interaction.utterances[self.utterance_index].all_gold_queries] + + def gold_query(self): + return self.interaction.utterances[self.utterance_index].gold_query_to_use + [vocab.EOS_TOK] + + def gold_edit_sequence(self): + return self.interaction.utterances[self.utterance_index].gold_edit_sequence + + def gold_table(self): + return self.interaction.utterances[self.utterance_index].gold_sql_results + + def all_snippets(self): + return self.interaction.snippets + + def within_limits(self, max_input_length=float("inf"), max_output_length=float("inf")): + return self.interaction.utterances[self.utterance_index].length_valid(max_input_length, max_output_length) + + def expand_snippets(self, sequence): + # Remove the EOS + if sequence[-1] == vocab.EOS_TOK: + sequence = sequence[:-1] + + # First remove the snippets + no_snippets_sequence = self.interaction.expand_snippets(sequence) + no_snippets_sequence = sql_util.fix_parentheses(no_snippets_sequence) + return no_snippets_sequence + + def flatten_sequence(self, sequence): + # Remove the EOS + if sequence[-1] == vocab.EOS_TOK: + sequence = sequence[:-1] + + # First remove the snippets + no_snippets_sequence = self.interaction.expand_snippets(sequence) + + # Deanonymize + deanon_sequence = self.interaction.deanonymize(no_snippets_sequence, "sql") + return deanon_sequence + + +class UtteranceBatch: + def __init__(self, items): + self.items = items + + def __len__(self): + return len(self.items) + + def start(self): + self.index = 0 + + def next(self): + item = self.items[self.index] + self.index += 1 + return item + + def done(self): + return self.index >= len(self.items) + + +class PredUtteranceItem: + def __init__(self, input_sequence, interaction_item, previous_query, index, available_snippets): + self.input_seq_to_use = input_sequence + self.interaction_item = interaction_item + self.index = index + self.available_snippets = available_snippets + self.prev_pred_query = previous_query + + def input_sequence(self): + return self.input_seq_to_use + + def histories(self, maximum): + histories = [] + if maximum == 0: + return histories + for utterance in self.interaction_item.processed_utterances[: self.index]: + histories.append(utterance.input_sequence()) + if len(histories) > maximum: + histories = histories[-maximum:] + return histories + + def snippets(self): + return self.available_snippets + + def previous_query(self): + return self.prev_pred_query + + def flatten_sequence(self, sequence): + return self.interaction_item.flatten_sequence(sequence) + + def remove_snippets(self, sequence): + return sql_util.fix_parentheses(self.interaction_item.expand_snippets(sequence)) + + def set_predicted_query(self, query): + self.anonymized_pred_query = query + + +class InteractionItem: + def __init__( + self, + interaction, + max_input_length=float("inf"), + max_output_length=float("inf"), + nl_to_sql_dict={}, + maximum_length=float("inf"), + ): + if maximum_length != float("inf"): + self.interaction = copy.deepcopy(interaction) + self.interaction.utterances = self.interaction.utterances[:maximum_length] + else: + self.interaction = interaction + self.processed_utterances = [] + self.snippet_bank = [] + self.identifier = self.interaction.identifier + + self.max_input_length = max_input_length + self.max_output_length = max_output_length + + self.nl_to_sql_dict = nl_to_sql_dict + + self.index = 0 + + def __len__(self): + return len(self.interaction) + + def __str__(self): + s = "Utterances, gold queries, and predictions:\n" + for i, utterance in enumerate(self.interaction.utterances): + s += " ".join(utterance.input_seq_to_use) + "\n" + pred_utterance = self.processed_utterances[i] + s += " ".join(pred_utterance.gold_query()) + "\n" + s += " ".join(pred_utterance.anonymized_query()) + "\n" + s += "\n" + s += "Snippets:\n" + for snippet in self.snippet_bank: + s += str(snippet) + "\n" + + return s + + def start_interaction(self): + assert len(self.snippet_bank) == 0 + assert len(self.processed_utterances) == 0 + assert self.index == 0 + + def next_utterance(self): + utterance = self.interaction.utterances[self.index] + self.index += 1 + + available_snippets = self.available_snippets(snippet_keep_age=1) + + return PredUtteranceItem( + utterance.input_seq_to_use, + self, + self.processed_utterances[-1].anonymized_pred_query if len(self.processed_utterances) > 0 else [], + self.index - 1, + available_snippets, + ) + + def done(self): + return len(self.processed_utterances) == len(self.interaction) + + def finish(self): + self.snippet_bank = [] + self.processed_utterances = [] + self.index = 0 + + def utterance_within_limits(self, utterance_item): + return utterance_item.within_limits(self.max_input_length, self.max_output_length) + + def available_snippets(self, snippet_keep_age): + return [snippet for snippet in self.snippet_bank if snippet.index <= snippet_keep_age] + + def gold_utterances(self): + utterances = [] + for i, utterance in enumerate(self.interaction.utterances): + utterances.append(UtteranceItem(self.interaction, i)) + return utterances + + def get_schema(self): + return self.interaction.schema + + def add_utterance(self, utterance, predicted_sequence, snippets=None, previous_snippets=[], simple=False): + if not snippets: + self.add_snippets(predicted_sequence, previous_snippets=previous_snippets, simple=simple) + else: + for snippet in snippets: + snippet.assign_id(len(self.snippet_bank)) + self.snippet_bank.append(snippet) + + for snippet in self.snippet_bank: + snippet.increase_age() + self.processed_utterances.append(utterance) + + def add_snippets(self, sequence, previous_snippets=[], simple=False): + if sequence: + if simple: + snippets = sql_util.get_subtrees_simple(sequence, oldsnippets=previous_snippets) + else: + snippets = sql_util.get_subtrees(sequence, oldsnippets=previous_snippets) + for snippet in snippets: + snippet.assign_id(len(self.snippet_bank)) + self.snippet_bank.append(snippet) + + for snippet in self.snippet_bank: + snippet.increase_age() + + def expand_snippets(self, sequence): + return sql_util.fix_parentheses(snip.expand_snippets(sequence, self.snippet_bank)) + + def remove_snippets(self, sequence): + if sequence[-1] == vocab.EOS_TOK: + sequence = sequence[:-1] + + no_snippets_sequence = self.expand_snippets(sequence) + no_snippets_sequence = sql_util.fix_parentheses(no_snippets_sequence) + return no_snippets_sequence + + def flatten_sequence(self, sequence, gold_snippets=False): + if sequence[-1] == vocab.EOS_TOK: + sequence = sequence[:-1] + + if gold_snippets: + no_snippets_sequence = self.interaction.expand_snippets(sequence) + else: + no_snippets_sequence = self.expand_snippets(sequence) + no_snippets_sequence = sql_util.fix_parentheses(no_snippets_sequence) + + deanon_sequence = self.interaction.deanonymize(no_snippets_sequence, "sql") + return deanon_sequence + + def gold_query(self, index): + return self.interaction.utterances[index].gold_query_to_use + [vocab.EOS_TOK] + + def original_gold_query(self, index): + return self.interaction.utterances[index].original_gold_query + + def gold_table(self, index): + return self.interaction.utterances[index].gold_sql_results + + +class InteractionBatch: + def __init__(self, items): + self.items = items + + def __len__(self): + return len(self.items) + + def start(self): + self.timestep = 0 + self.current_interactions = [] + + def get_next_utterance_batch(self, snippet_keep_age, use_gold=False): + items = [] + self.current_interactions = [] + for interaction in self.items: + if self.timestep < len(interaction): + utterance_item = interaction.original_utterances(snippet_keep_age, use_gold)[self.timestep] + self.current_interactions.append(interaction) + items.append(utterance_item) + + self.timestep += 1 + return UtteranceBatch(items) + + def done(self): + finished = True + for interaction in self.items: + if self.timestep < len(interaction): + finished = False + return finished + return finished diff --git a/examples/text_to_sql/IGSQL/data_util/atis_data.py b/examples/text_to_sql/IGSQL/data_util/atis_data.py new file mode 100644 index 0000000000000000000000000000000000000000..6ac6ec6ba20f266cb7d42c6f75bb2e7b158d70f2 --- /dev/null +++ b/examples/text_to_sql/IGSQL/data_util/atis_data.py @@ -0,0 +1,495 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Utility functions for loading and processing ATIS data.""" + +import os +import random +import json + +from . import anonymization as anon +from . import atis_batch +from . import dataset_split as ds +from .interaction import load_function +from .entities import NLtoSQLDict +from .atis_vocab import ATISVocabulary + +ENTITIES_FILENAME = "data/entities.txt" +ANONYMIZATION_FILENAME = "data/anonymization.txt" + + +class ATISDataset: + """Contains the ATIS data.""" + + def __init__(self, params): + self.anonymizer = None + if params.anonymize: + self.anonymizer = anon.Anonymizer(ANONYMIZATION_FILENAME) + + if not os.path.exists(params.data_directory): + os.mkdir(params.data_directory) + + self.entities_dictionary = NLtoSQLDict(ENTITIES_FILENAME) + + database_schema = None + if params.database_schema_filename: + if "removefrom" not in params.data_directory: + ( + database_schema, + column_names_surface_form, + column_names_embedder_input, + ) = self.read_database_schema_simple(params.database_schema_filename) + else: + database_schema, column_names_surface_form, column_names_embedder_input = self.read_database_schema( + params.database_schema_filename + ) + + int_load_function = load_function( + params, self.entities_dictionary, self.anonymizer, database_schema=database_schema + ) + + def collapse_list(the_list): + """Collapses a list of list into a single list.""" + return [s for i in the_list for s in i] + + if "atis" not in params.data_directory: + self.train_data = ds.DatasetSplit( + os.path.join(params.data_directory, params.processed_train_filename), + params.raw_train_filename, + int_load_function, + ) + self.valid_data = ds.DatasetSplit( + os.path.join(params.data_directory, params.processed_validation_filename), + params.raw_validation_filename, + int_load_function, + ) + + s = set() + for ele in self.train_data.examples: + s.add(ele.schema.table_schema["db_id"]) + db_id_ls = list(s) + db_id_ls.sort() + self.id2db = db_id_ls + self.db2id = {} + for i in range(len(self.id2db)): + self.db2id[self.id2db[i]] = i + + train_input_seqs = collapse_list(self.train_data.get_ex_properties(lambda i: i.input_seqs())) + valid_input_seqs = collapse_list(self.valid_data.get_ex_properties(lambda i: i.input_seqs())) + + all_input_seqs = train_input_seqs + valid_input_seqs + + self.input_vocabulary = ATISVocabulary( + all_input_seqs, + os.path.join(params.data_directory, params.input_vocabulary_filename), + params, + is_input="input", + anonymizer=self.anonymizer if params.anonymization_scoring else None, + ) + + self.output_vocabulary_schema = ATISVocabulary( + column_names_embedder_input, + os.path.join(params.data_directory, "schema_" + params.output_vocabulary_filename), + params, + is_input="schema", + anonymizer=self.anonymizer if params.anonymization_scoring else None, + ) + + train_output_seqs = collapse_list(self.train_data.get_ex_properties(lambda i: i.output_seqs())) + valid_output_seqs = collapse_list(self.valid_data.get_ex_properties(lambda i: i.output_seqs())) + all_output_seqs = train_output_seqs + valid_output_seqs + + sql_keywords = [ + ".", + "t1", + "t2", + "=", + "select", + "as", + "join", + "on", + ")", + "(", + "where", + "t3", + "by", + ",", + "group", + "distinct", + "t4", + "and", + "limit", + "desc", + ">", + "avg", + "having", + "max", + "in", + "<", + "sum", + "t5", + "intersect", + "not", + "min", + "except", + "or", + "asc", + "like", + "!", + "union", + "between", + "t6", + "-", + "t7", + "+", + "/", + ] + sql_keywords += ["count", "from", "value", "order"] + sql_keywords += ["group_by", "order_by", "limit_value", "!="] + + # skip column_names_surface_form but keep sql_keywords + skip_tokens = list(set(column_names_surface_form) - set(sql_keywords)) + + if params.data_directory == "processed_data_sparc_removefrom": + all_output_seqs = [] + out_vocab_ordered = [ + "select", + "value", + ")", + "(", + "where", + "=", + ",", + "count", + "group_by", + "order_by", + "limit_value", + "desc", + ">", + "distinct", + "avg", + "and", + "having", + "<", + "in", + "max", + "sum", + "asc", + "like", + "not", + "or", + "min", + "intersect", + "except", + "!=", + "union", + "between", + "-", + "+", + ] + for i in range(len(out_vocab_ordered)): + all_output_seqs.append(out_vocab_ordered[: i + 1]) + + self.output_vocabulary = ATISVocabulary( + all_output_seqs, + os.path.join(params.data_directory, params.output_vocabulary_filename), + params, + is_input="output", + anonymizer=self.anonymizer if params.anonymization_scoring else None, + skip=skip_tokens, + ) + else: + self.train_data = ds.DatasetSplit( + os.path.join(params.data_directory, params.processed_train_filename), + params.raw_train_filename, + int_load_function, + ) + if params.train: + self.valid_data = ds.DatasetSplit( + os.path.join(params.data_directory, params.processed_validation_filename), + params.raw_validation_filename, + int_load_function, + ) + if params.evaluate or params.attention: + self.dev_data = ds.DatasetSplit( + os.path.join(params.data_directory, params.processed_dev_filename), + params.raw_dev_filename, + int_load_function, + ) + if params.enable_testing: + self.test_data = ds.DatasetSplit( + os.path.join(params.data_directory, params.processed_test_filename), + params.raw_test_filename, + int_load_function, + ) + + train_input_seqs = [] + train_input_seqs = collapse_list(self.train_data.get_ex_properties(lambda i: i.input_seqs())) + + self.input_vocabulary = ATISVocabulary( + train_input_seqs, + os.path.join(params.data_directory, params.input_vocabulary_filename), + params, + is_input="input", + min_occur=2, + anonymizer=self.anonymizer if params.anonymization_scoring else None, + ) + + train_output_seqs = collapse_list(self.train_data.get_ex_properties(lambda i: i.output_seqs())) + + self.output_vocabulary = ATISVocabulary( + train_output_seqs, + os.path.join(params.data_directory, params.output_vocabulary_filename), + params, + is_input="output", + anonymizer=self.anonymizer if params.anonymization_scoring else None, + ) + + self.output_vocabulary_schema = None + + def read_database_schema_simple(self, database_schema_filename): + with open(database_schema_filename, "r") as f: + database_schema = json.load(f) + + database_schema_dict = {} + column_names_surface_form = [] + column_names_embedder_input = [] + for table_schema in database_schema: + db_id = table_schema["db_id"] + database_schema_dict[db_id] = table_schema + + column_names = table_schema["column_names"] + column_names_original = table_schema["column_names_original"] + table_names = table_schema["table_names"] + table_names_original = table_schema["table_names_original"] + + for i, (table_id, column_name) in enumerate(column_names_original): + column_name_surface_form = column_name + column_names_surface_form.append(column_name_surface_form.lower()) + + for table_name in table_names_original: + column_names_surface_form.append(table_name.lower()) + + for i, (table_id, column_name) in enumerate(column_names): + column_name_embedder_input = column_name + column_names_embedder_input.append(column_name_embedder_input.split()) + + for table_name in table_names: + column_names_embedder_input.append(table_name.split()) + + database_schema = database_schema_dict + + return database_schema, column_names_surface_form, column_names_embedder_input + + def read_database_schema(self, database_schema_filename): + with open(database_schema_filename, "r") as f: + database_schema = json.load(f) + + database_schema_dict = {} + column_names_surface_form = [] + column_names_embedder_input = [] + for table_schema in database_schema: + db_id = table_schema["db_id"] + database_schema_dict[db_id] = table_schema + + column_names = table_schema["column_names"] + column_names_original = table_schema["column_names_original"] + table_names = table_schema["table_names"] + table_names_original = table_schema["table_names_original"] + + for i, (table_id, column_name) in enumerate(column_names_original): + if table_id >= 0: + table_name = table_names_original[table_id] + column_name_surface_form = "{}.{}".format(table_name, column_name) + else: + column_name_surface_form = column_name + column_names_surface_form.append(column_name_surface_form.lower()) + + # also add table_name.* + for table_name in table_names_original: + column_names_surface_form.append("{}.*".format(table_name.lower())) + + for i, (table_id, column_name) in enumerate(column_names): + if table_id >= 0: + table_name = table_names[table_id] + column_name_embedder_input = table_name + " . " + column_name + else: + column_name_embedder_input = column_name + column_names_embedder_input.append(column_name_embedder_input.split()) + + for table_name in table_names: + column_name_embedder_input = table_name + " . *" + column_names_embedder_input.append(column_name_embedder_input.split()) + + database_schema = database_schema_dict + + return database_schema, column_names_surface_form, column_names_embedder_input + + def get_all_utterances(self, dataset, max_input_length=float("inf"), max_output_length=float("inf")): + """Returns all utterances in a dataset.""" + items = [] + for interaction in dataset.examples: + for i, utterance in enumerate(interaction.utterances): + if utterance.length_valid(max_input_length, max_output_length): + items.append(atis_batch.UtteranceItem(interaction, i)) + return items + + def get_all_interactions( + self, + dataset, + max_interaction_length=float("inf"), + max_input_length=float("inf"), + max_output_length=float("inf"), + sorted_by_length=False, + ): + """Gets all interactions in a dataset that fit the criteria. + + Args: + dataset (`ATISDatasetSplit`): The dataset to use. + max_interaction_length (`int`): Maximum interaction length to keep. + max_input_length (`int`): Maximum input sequence length to keep. + max_output_length (`int`): Maximum output sequence length to keep. + sorted_by_length (`bool`): Whether to sort the examples by interaction length. + + Returns: + `list`: All interactions. + """ + ints = [ + atis_batch.InteractionItem( + interaction, max_input_length, max_output_length, self.entities_dictionary, max_interaction_length + ) + for interaction in dataset.examples + ] + if sorted_by_length: + return sorted(ints, key=lambda x: len(x))[::-1] + else: + return ints + + def get_utterance_batches( + self, batch_size, max_input_length=float("inf"), max_output_length=float("inf"), randomize=False + ): + """Gets batches of utterances in the data. + + Args: + batch_size (`int`): Batch size to use. + max_input_length (`int`): Maximum length of input to keep. + max_output_length (`int`): Maximum length of output to use. + randomize (`bool`): Whether to randomize the ordering. + + Returns: + `list`: Batches of utterances. + """ + # First, get all interactions and the positions of the utterances that are + # possible in them. + items = self.get_all_utterances(self.train_data, max_input_length, max_output_length) + # if randomize: + # random.shuffle(items) + + batches = [] + + current_batch_items = [] + for item in items: + if len(current_batch_items) >= batch_size: + batches.append(atis_batch.UtteranceBatch(current_batch_items)) + current_batch_items = [] + current_batch_items.append(item) + batches.append(atis_batch.UtteranceBatch(current_batch_items)) + + assert sum([len(batch) for batch in batches]) == len(items) + + return batches + + def get_interaction_batches( + self, + batch_size, + max_interaction_length=float("inf"), + max_input_length=float("inf"), + max_output_length=float("inf"), + randomize=False, + ): + """Gets batches of interactions in the data. + + Args: + batch_size (`int`): Batch size to use. + max_interaction_length (`int`): Maximum length of interaction to keep + max_input_length (`int`): Maximum length of input to keep. + max_output_length (`int`): Maximum length of output to keep. + randomize (`bool`): Whether to randomize the ordering. + + Returns: + `list`: Batches of interactions. + + """ + items = self.get_all_interactions( + self.train_data, + max_interaction_length, + max_input_length, + max_output_length, + sorted_by_length=not randomize, + ) + if randomize: + random.shuffle(items) + + batches = [] + current_batch_items = [] + for item in items: + if len(current_batch_items) >= batch_size: + batches.append(atis_batch.InteractionBatch(current_batch_items)) + current_batch_items = [] + current_batch_items.append(item) + batches.append(atis_batch.InteractionBatch(current_batch_items)) + + assert sum([len(batch) for batch in batches]) == len(items) + + return batches + + def get_random_utterances(self, num_samples, max_input_length=float("inf"), max_output_length=float("inf")): + """Gets a random selection of utterances in the data. + + Args: + num_samples (`bool`): Number of random utterances to get. + max_input_length (`int`): Limit of input length. + max_output_length (`int`): Limit on output length. + + Returns: + `list`: A random selection of utterances. + """ + items = self.get_all_utterances(self.train_data, max_input_length, max_output_length) + random.shuffle(items) + return items[:num_samples] + + def get_random_interactions( + self, + num_samples, + max_interaction_length=float("inf"), + max_input_length=float("inf"), + max_output_length=float("inf"), + ): + """Gets a random selection of interactions in the data. + + Args: + num_samples (`bool`): Number of random interactions to get. + max_input_length (`int`): Limit of input length. + max_output_length (`int`): Limit on output length. + + Returns: + A random selection of interactions. + """ + items = self.get_all_interactions(self.train_data, max_interaction_length, max_input_length, max_output_length) + # random.shuffle(items) + return items[:num_samples] + + +def num_utterances(dataset): + """Returns the total number of utterances in the dataset.""" + return sum([len(interaction) for interaction in dataset.examples]) diff --git a/examples/text_to_sql/IGSQL/data_util/atis_vocab.py b/examples/text_to_sql/IGSQL/data_util/atis_vocab.py new file mode 100644 index 0000000000000000000000000000000000000000..9a4e875c5221dcbdb74db3d94aa4bbc9b091878e --- /dev/null +++ b/examples/text_to_sql/IGSQL/data_util/atis_vocab.py @@ -0,0 +1,84 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Gets and stores vocabulary for the ATIS data.""" + +from . import snippets +from .vocabulary import Vocabulary, UNK_TOK, DEL_TOK, EOS_TOK + +INPUT_FN_TYPES = [UNK_TOK, DEL_TOK, EOS_TOK] +OUTPUT_FN_TYPES = [UNK_TOK, EOS_TOK] + +MIN_INPUT_OCCUR = 1 +MIN_OUTPUT_OCCUR = 1 + + +class ATISVocabulary: + """Stores the vocabulary for the ATIS data. + + Attributes: + raw_vocab (`Vocabulary`): Vocabulary object. + tokens (`set`): Set of all of the strings in the vocabulary. + inorder_tokens (`list`): List of all tokens, with a strict and + unchanging order. + """ + + def __init__(self, token_sequences, filename, params, is_input="input", min_occur=1, anonymizer=None, skip=None): + + if is_input == "input": + functional_types = INPUT_FN_TYPES + elif is_input == "output": + functional_types = OUTPUT_FN_TYPES + elif is_input == "schema": + functional_types = [UNK_TOK] + else: + functional_types = [] + + self.raw_vocab = Vocabulary( + token_sequences, + filename, + functional_types=functional_types, + min_occur=min_occur, + ignore_fn=lambda x: snippets.is_snippet(x) + or (anonymizer and anonymizer.is_anon_tok(x)) + or (skip and x in skip), + ) + self.tokens = set(self.raw_vocab.token_to_id.keys()) + self.inorder_tokens = self.raw_vocab.id_to_token + + assert len(self.inorder_tokens) == len(self.raw_vocab) + + def __len__(self): + return len(self.raw_vocab) + + def token_to_id(self, token): + """Maps from a token to a unique ID. + + Args: + token (`str`): The token to look up. + + Returns: + `int`: Uniquely identifying the token. + """ + return self.raw_vocab.token_to_id[token] + + def id_to_token(self, identifier): + """Maps from a unique integer to an identifier. + + Args: + identifier (`int`): The unique ID. + + Returns: + `str`: Representing the token. + """ + return self.raw_vocab.id_to_token[identifier] diff --git a/examples/text_to_sql/IGSQL/data_util/dataset_split.py b/examples/text_to_sql/IGSQL/data_util/dataset_split.py new file mode 100644 index 0000000000000000000000000000000000000000..7349046ae0e8ef9a2e0293a895e45460fd64fd5b --- /dev/null +++ b/examples/text_to_sql/IGSQL/data_util/dataset_split.py @@ -0,0 +1,64 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Utility functions for loading and processing ATIS data.""" + +import os +import pickle + + +class DatasetSplit: + """Stores a split of the ATIS dataset. + + Attributes: + examples (`list`): Stores the examples in the split. + """ + + def __init__(self, processed_filename, raw_filename, load_function): + if os.path.exists(processed_filename): + print("Loading preprocessed data from " + processed_filename) + with open(processed_filename, "rb") as infile: + self.examples = pickle.load(infile) + else: + print("Loading raw data from " + raw_filename + " and writing to " + processed_filename) + + infile = open(raw_filename, "rb") + examples_from_file = pickle.load(infile) + assert isinstance(examples_from_file, list), raw_filename + " does not contain a list of examples" + infile.close() + + self.examples = [] + for example in examples_from_file: + obj, keep = load_function(example) + + if keep: + self.examples.append(obj) + + print("Loaded " + str(len(self.examples)) + " examples") + outfile = open(processed_filename, "wb") + pickle.dump(self.examples, outfile) + outfile.close() + + def get_ex_properties(self, function): + """Applies some function to the examples in the dataset. + + Args: + function (`function`): Function to apply to all examples. + + Returns + `list`: The return value of the function + """ + elems = [] + for example in self.examples: + elems.append(function(example)) + return elems diff --git a/examples/text_to_sql/IGSQL/data_util/entities.py b/examples/text_to_sql/IGSQL/data_util/entities.py new file mode 100644 index 0000000000000000000000000000000000000000..7887b24a07cd06274ee4f181edad2f0401c23d78 --- /dev/null +++ b/examples/text_to_sql/IGSQL/data_util/entities.py @@ -0,0 +1,77 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Classes for keeping track of the entities in a natural language string. """ + +import json + + +class NLtoSQLDict: + """ + Entity dict file should contain, on each line, a JSON dictionary with + "input" and "output" keys specifying the string for the input and output + pairs. The idea is that the existence of the key in an input sequence + likely corresponds to the existence of the value in the output sequence. + + The entity_dict should map keys (input strings) to a list of values (output + strings) where this property holds. This allows keys to map to multiple + output strings (e.g. for times). + """ + + def __init__(self, entity_dict_filename): + self.entity_dict = {} + + pairs = [json.loads(line) for line in open(entity_dict_filename).readlines()] + for pair in pairs: + input_seq = pair["input"] + output_seq = pair["output"] + if input_seq not in self.entity_dict: + self.entity_dict[input_seq] = [] + self.entity_dict[input_seq].append(output_seq) + + def get_sql_entities(self, tokenized_nl_string): + """ + Gets the output-side entities which correspond to the input entities in + the input sequence. + + Args: + tokenized_nl_string (`list`): list of tokens in the input string. + + Outputs: + `list`: The output strings. + """ + assert len(tokenized_nl_string) > 0 + flat_input_string = " ".join(tokenized_nl_string) + entities = [] + + # See if any input strings are in our input sequence, and add the + # corresponding output strings if so. + for entry, values in self.entity_dict.items(): + in_middle = " " + entry + " " in flat_input_string + + leftspace = " " + entry + at_end = leftspace in flat_input_string and flat_input_string.endswith(leftspace) + + rightspace = entry + " " + at_beginning = rightspace in flat_input_string and flat_input_string.startswith(rightspace) + if in_middle or at_end or at_beginning: + for out_string in values: + entities.append(out_string) + + # Also add any integers in the input string (these aren't in the entity) + # dict. + for token in tokenized_nl_string: + if token.isnumeric(): + entities.append(token) + + return entities diff --git a/examples/text_to_sql/IGSQL/data_util/interaction.py b/examples/text_to_sql/IGSQL/data_util/interaction.py new file mode 100644 index 0000000000000000000000000000000000000000..0b21b05471c9e1294709127e97333efc5df11ef1 --- /dev/null +++ b/examples/text_to_sql/IGSQL/data_util/interaction.py @@ -0,0 +1,325 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Contains the class for an interaction in ATIS. """ + +import paddle + +from . import anonymization as anon +from . import sql_util +from .snippets import expand_snippets +from .utterance import Utterance, OUTPUT_KEY, ANON_INPUT_KEY + + +class Schema: + def __init__(self, table_schema, simple=False): + if simple: + self.helper1(table_schema) + else: + self.helper2(table_schema) + + def helper1(self, table_schema): + self.table_schema = table_schema + column_names = table_schema["column_names"] + column_names_original = table_schema["column_names_original"] + table_names = table_schema["table_names"] + table_names_original = table_schema["table_names_original"] + assert len(column_names) == len(column_names_original) and len(table_names) == len(table_names_original) + + column_keep_index = [] + + self.column_names_surface_form = [] + self.column_names_surface_form_to_id = {} + for i, (table_id, column_name) in enumerate(column_names_original): + column_name_surface_form = column_name + column_name_surface_form = column_name_surface_form.lower() + if column_name_surface_form not in self.column_names_surface_form_to_id: + self.column_names_surface_form.append(column_name_surface_form) + self.column_names_surface_form_to_id[column_name_surface_form] = ( + len(self.column_names_surface_form) - 1 + ) + column_keep_index.append(i) + + column_keep_index_2 = [] + for i, table_name in enumerate(table_names_original): + column_name_surface_form = table_name.lower() + if column_name_surface_form not in self.column_names_surface_form_to_id: + self.column_names_surface_form.append(column_name_surface_form) + self.column_names_surface_form_to_id[column_name_surface_form] = ( + len(self.column_names_surface_form) - 1 + ) + column_keep_index_2.append(i) + + self.column_names_embedder_input = [] + self.column_names_embedder_input_to_id = {} + for i, (table_id, column_name) in enumerate(column_names): + column_name_embedder_input = column_name + if i in column_keep_index: + self.column_names_embedder_input.append(column_name_embedder_input) + self.column_names_embedder_input_to_id[column_name_embedder_input] = ( + len(self.column_names_embedder_input) - 1 + ) + + for i, table_name in enumerate(table_names): + column_name_embedder_input = table_name + if i in column_keep_index_2: + self.column_names_embedder_input.append(column_name_embedder_input) + self.column_names_embedder_input_to_id[column_name_embedder_input] = ( + len(self.column_names_embedder_input) - 1 + ) + + max_id_1 = max(v for k, v in self.column_names_surface_form_to_id.items()) + max_id_2 = max(v for k, v in self.column_names_embedder_input_to_id.items()) + assert (len(self.column_names_surface_form) - 1) == max_id_2 == max_id_1 + + self.num_col = len(self.column_names_surface_form) + + def helper2(self, table_schema): + self.table_schema = table_schema + column_names = table_schema["column_names"] + column_names_original = table_schema["column_names_original"] + table_names = table_schema["table_names"] + table_names_original = table_schema["table_names_original"] + assert len(column_names) == len(column_names_original) and len(table_names) == len(table_names_original) + + column_keep_index = [] + + self.column_names_surface_form = [] + self.column_names_surface_form_to_id = {} + for i, (table_id, column_name) in enumerate(column_names_original): + if table_id >= 0: + table_name = table_names_original[table_id] + column_name_surface_form = "{}.{}".format(table_name, column_name) + else: + column_name_surface_form = column_name + column_name_surface_form = column_name_surface_form.lower() + if column_name_surface_form not in self.column_names_surface_form_to_id: + self.column_names_surface_form.append(column_name_surface_form) + self.column_names_surface_form_to_id[column_name_surface_form] = ( + len(self.column_names_surface_form) - 1 + ) + column_keep_index.append(i) + + start_i = len(self.column_names_surface_form_to_id) + for i, table_name in enumerate(table_names_original): + column_name_surface_form = "{}.*".format(table_name.lower()) + self.column_names_surface_form.append(column_name_surface_form) + self.column_names_surface_form_to_id[column_name_surface_form] = i + start_i + + self.column_names_embedder_input = [] + self.column_names_embedder_input_to_id = {} + for i, (table_id, column_name) in enumerate(column_names): + if table_id >= 0: + table_name = table_names[table_id] + column_name_embedder_input = table_name + " . " + column_name + else: + column_name_embedder_input = column_name + if i in column_keep_index: + self.column_names_embedder_input.append(column_name_embedder_input) + self.column_names_embedder_input_to_id[column_name_embedder_input] = ( + len(self.column_names_embedder_input) - 1 + ) + + start_i = len(self.column_names_embedder_input_to_id) + for i, table_name in enumerate(table_names): + column_name_embedder_input = table_name + " . *" + self.column_names_embedder_input.append(column_name_embedder_input) + self.column_names_embedder_input_to_id[column_name_embedder_input] = i + start_i + + assert ( + len(self.column_names_surface_form) + == len(self.column_names_surface_form_to_id) + == len(self.column_names_embedder_input) + == len(self.column_names_embedder_input_to_id) + ) + + max_id_1 = max(v for k, v in self.column_names_surface_form_to_id.items()) + max_id_2 = max(v for k, v in self.column_names_embedder_input_to_id.items()) + assert (len(self.column_names_surface_form) - 1) == max_id_2 == max_id_1 + + self.num_col = len(self.column_names_surface_form) + + def __len__(self): + return self.num_col + + def in_vocabulary(self, column_name, surface_form=False): + if surface_form: + return column_name in self.column_names_surface_form_to_id + else: + return column_name in self.column_names_embedder_input_to_id + + def column_name_embedder_bow(self, column_name, surface_form=False, column_name_token_embedder=None): + assert self.in_vocabulary(column_name, surface_form) + if surface_form: + column_name_id = self.column_names_surface_form_to_id[column_name] + column_name_embedder_input = self.column_names_embedder_input[column_name_id] + else: + column_name_embedder_input = column_name + + column_name_embeddings = [column_name_token_embedder(token) for token in column_name_embedder_input.split()] + column_name_embeddings = paddle.stack(column_name_embeddings, axis=0) + return paddle.mean(column_name_embeddings, axis=0) + + def set_column_name_embeddings(self, column_name_embeddings): + self.column_name_embeddings = column_name_embeddings + assert len(self.column_name_embeddings) == self.num_col + + def column_name_embedder(self, column_name, surface_form=False): + assert self.in_vocabulary(column_name, surface_form) + if surface_form: + column_name_id = self.column_names_surface_form_to_id[column_name] + else: + column_name_id = self.column_names_embedder_input_to_id[column_name] + + return self.column_name_embeddings[column_name_id] + + +class Interaction: + def __init__(self, utterances, schema, snippets, anon_tok_to_ent, identifier, params): + self.utterances = utterances + self.schema = schema + self.snippets = snippets + self.anon_tok_to_ent = anon_tok_to_ent + self.identifier = identifier + + # Ensure that each utterance's input and output sequences, when remapped + # without anonymization or snippets, are the same as the original + # version. + for i, utterance in enumerate(self.utterances): + deanon_input = self.deanonymize(utterance.input_seq_to_use, ANON_INPUT_KEY) + assert deanon_input == utterance.original_input_seq, ( + "Anonymized sequence [" + + " ".join(utterance.input_seq_to_use) + + "] is not the same as [" + + " ".join(utterance.original_input_seq) + + "] when deanonymized (is [" + + " ".join(deanon_input) + + "] instead)" + ) + desnippet_gold = self.expand_snippets(utterance.gold_query_to_use) + deanon_gold = self.deanonymize(desnippet_gold, OUTPUT_KEY) + assert deanon_gold == utterance.original_gold_query, ( + "Anonymized and/or snippet'd query " + + " ".join(utterance.gold_query_to_use) + + " is not the same as " + + " ".join(utterance.original_gold_query) + ) + + def __str__(self): + string = "Utterances:\n" + for utterance in self.utterances: + string += str(utterance) + "\n" + string += "Anonymization dictionary:\n" + for ent_tok, deanon in self.anon_tok_to_ent.items(): + string += ent_tok + "\t" + str(deanon) + "\n" + + return string + + def __len__(self): + return len(self.utterances) + + def deanonymize(self, sequence, key): + """Deanonymizes a predicted query or an input utterance. + + Args: + sequence (`list`): The sequence to deanonymize. + key (`str`): The key in the anonymization table, e.g. NL or SQL. + """ + return anon.deanonymize(sequence, self.anon_tok_to_ent, key) + + def expand_snippets(self, sequence): + """Expands snippets for a sequence. + + Args: + sequence (`list`): A SQL query. + + """ + return expand_snippets(sequence, self.snippets) + + def input_seqs(self): + in_seqs = [] + for utterance in self.utterances: + in_seqs.append(utterance.input_seq_to_use) + return in_seqs + + def output_seqs(self): + out_seqs = [] + for utterance in self.utterances: + out_seqs.append(utterance.gold_query_to_use) + return out_seqs + + +def load_function(parameters, nl_to_sql_dict, anonymizer, database_schema=None): + def fn(interaction_example): + keep = False + + raw_utterances = interaction_example["interaction"] + + if "database_id" in interaction_example: + database_id = interaction_example["database_id"] + interaction_id = interaction_example["interaction_id"] + identifier = database_id + "/" + str(interaction_id) + else: + identifier = interaction_example["id"] + + schema = None + if database_schema: + if "removefrom" not in parameters.data_directory: + schema = Schema(database_schema[database_id], simple=True) + else: + schema = Schema(database_schema[database_id]) + + snippet_bank = [] + + utterance_examples = [] + + anon_tok_to_ent = {} + + for utterance in raw_utterances: + available_snippets = [snippet for snippet in snippet_bank if snippet.index <= 1] + + proc_utterance = Utterance( + utterance, available_snippets, nl_to_sql_dict, parameters, anon_tok_to_ent, anonymizer + ) + keep_utterance = proc_utterance.keep + + if schema: + assert keep_utterance + + if keep_utterance: + keep = True + utterance_examples.append(proc_utterance) + + # Update the snippet bank, and age each snippet in it. + if parameters.use_snippets: + if "atis" in parameters.data_directory: + snippets = sql_util.get_subtrees( + proc_utterance.anonymized_gold_query, proc_utterance.available_snippets + ) + else: + snippets = sql_util.get_subtrees_simple( + proc_utterance.anonymized_gold_query, proc_utterance.available_snippets + ) + + for snippet in snippets: + snippet.assign_id(len(snippet_bank)) + snippet_bank.append(snippet) + + for snippet in snippet_bank: + snippet.increase_age() + + interaction = Interaction(utterance_examples, schema, snippet_bank, anon_tok_to_ent, identifier, parameters) + + return interaction, keep + + return fn diff --git a/examples/text_to_sql/IGSQL/data_util/snippets.py b/examples/text_to_sql/IGSQL/data_util/snippets.py new file mode 100644 index 0000000000000000000000000000000000000000..e9a6fdfd2a2c26d0119335560686f3db617f9218 --- /dev/null +++ b/examples/text_to_sql/IGSQL/data_util/snippets.py @@ -0,0 +1,113 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Contains the Snippet class and methods for handling snippets.""" + +SNIPPET_PREFIX = "SNIPPET_" + + +def is_snippet(token): + """Determines whether a token is a snippet or not. + + Args: + token (`str`): The token to check. + + Returns: + `bool`: Indicating whether it's a snippet. + """ + return token.startswith(SNIPPET_PREFIX) + + +def expand_snippets(sequence, snippets): + """Given a sequence and a list of snippets, expand the snippets in the sequence. + + Args: + sequence (`list`): Query containing snippet references. + snippets (`list`): List of available snippets. + + Returns: + `list`: The expanded sequence list. + """ + snippet_id_to_snippet = {} + for snippet in snippets: + assert snippet.name not in snippet_id_to_snippet + snippet_id_to_snippet[snippet.name] = snippet + expanded_seq = [] + for token in sequence: + if token in snippet_id_to_snippet: + expanded_seq.extend(snippet_id_to_snippet[token].sequence) + else: + assert not is_snippet(token) + expanded_seq.append(token) + + return expanded_seq + + +def snippet_index(token): + """Returns the index of a snippet. + + Args: + token (`str`): The snippet to check. + + Returns: + `int`: The index of the snippet. + """ + assert is_snippet(token) + return int(token.split("_")[-1]) + + +class Snippet: + """Contains a snippet.""" + + def __init__(self, sequence, startpos, sql, age=0): + self.sequence = sequence + self.startpos = startpos + self.sql = sql + + # TODO: age vs. index? + self.age = age + self.index = 0 + + self.name = "" + self.embedding = None + + self.endpos = self.startpos + len(self.sequence) + assert self.endpos < len(self.sql), ( + "End position of snippet is " + + str(self.endpos) + + " which is greater than length of SQL (" + + str(len(self.sql)) + + ")" + ) + assert self.sequence == self.sql[self.startpos : self.endpos], ( + "Value of snippet (" + " ".join(self.sequence) + ") " + "is not the same as SQL at the same positions (" + " ".join(self.sql[self.startpos : self.endpos]) + ")" + ) + + def __str__(self): + return self.name + "\t" + str(self.age) + "\t" + " ".join(self.sequence) + + def __len__(self): + return len(self.sequence) + + def increase_age(self): + """Ages a snippet by one.""" + self.index += 1 + + def assign_id(self, number): + """Assigns the name of the snippet to be the prefix + the number.""" + self.name = SNIPPET_PREFIX + str(number) + + def set_embedding(self, embedding): + """Sets the embedding of the snippet.""" + self.embedding = embedding diff --git a/examples/text_to_sql/IGSQL/data_util/sql_util.py b/examples/text_to_sql/IGSQL/data_util/sql_util.py new file mode 100644 index 0000000000000000000000000000000000000000..eb9b92260d3a118b4c31f69f42ee8c29759dd0d0 --- /dev/null +++ b/examples/text_to_sql/IGSQL/data_util/sql_util.py @@ -0,0 +1,440 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import copy +import random +import signal + +import pymysql +import sqlparse +from sqlparse import sql as sql_types +from sqlparse import tokens as token_types + +from . import util +from .snippets import Snippet + +interesting_selects = ["DISTINCT", "MAX", "MIN", "count"] +ignored_subtrees = [["1", "=", "1"]] + + +def strip_whitespace_front(token_list): + """Strips whitespace and punctuation from the front of a SQL token list. + + Args: + token_list(`list`): the token list. + + Outputs: + `list`: New token list. + """ + new_token_list = [] + found_valid = False + + for token in token_list: + if not (token.is_whitespace or token.ttype == token_types.Punctuation) or found_valid: + found_valid = True + new_token_list.append(token) + + return new_token_list + + +def strip_whitespace(token_list): + """Strips whitespace from a token list. + + Args: + token_list(`list`): the token list. + + Returns: + `list`: New token list with no whitespace/punctuation surrounding. + """ + subtokens = strip_whitespace_front(token_list) + subtokens = strip_whitespace_front(subtokens[::-1])[::-1] + return subtokens + + +def token_list_to_seq(token_list): + """Converts a Token list to a sequence of strings, stripping out surrounding + punctuation and all whitespace. + + Args: + token_list(`list`): the list of tokens. + + Outputs: + `list`: sequence of strings + """ + subtokens = strip_whitespace(token_list) + + seq = [] + flat = sqlparse.sql.TokenList(subtokens).flatten() + for i, token in enumerate(flat): + strip_token = str(token).strip() + if len(strip_token) > 0: + seq.append(strip_token) + if len(seq) > 0: + if seq[0] == "(" and seq[-1] == ")": + seq = seq[1:-1] + + return seq + + +def find_subtrees(sequence, current_subtrees, where_parent=False, keep_conj_subtrees=False): + """Finds subtrees for a subsequence of SQL. + + Args: + sequence(`list`): Sequence of SQL tokens. + current_subtrees(`list`): Current list of subtrees. + where_parent(`bool`, optional): Whether the parent of the current sequence was a where clause + keep_conj_subtrees('bool', optional): Whether to look for a conjunction in this sequence and + keep its arguments + """ + + # If the parent of the subsequence was a WHERE clause, keep everything in the + # sequence except for the beginning WHERE and any surrounding parentheses. + if where_parent: + # Strip out the beginning WHERE, and any punctuation or whitespace at the + # beginning or end of the token list. + seq = token_list_to_seq(sequence.tokens[1:]) + if len(seq) > 0 and seq not in current_subtrees: + current_subtrees.append(seq) + + # If the current sequence has subtokens, i.e. if it's a node that can be + # expanded, check for a conjunction in its subtrees, and expand its subtrees. + # Also check for any SELECT statements and keep track of what follows. + if sequence.is_group: + if keep_conj_subtrees: + subtokens = strip_whitespace(sequence.tokens) + + # Check if there is a conjunction in the subsequence. If so, keep the + # children. Also make sure you don't split where AND is used within a + # child -- the subtokens sequence won't treat those ANDs differently (a + # bit hacky but it works) + has_and = False + for i, token in enumerate(subtokens): + if token.value == "OR" or token.value == "AND": + has_and = True + break + + if has_and: + and_subtrees = [] + current_subtree = [] + for i, token in enumerate(subtokens): + if token.value == "OR" or ( + token.value == "AND" + and i - 4 >= 0 + and i - 4 < len(subtokens) + and subtokens[i - 4].value != "BETWEEN" + ): + and_subtrees.append(current_subtree) + current_subtree = [] + else: + current_subtree.append(token) + and_subtrees.append(current_subtree) + + for subtree in and_subtrees: + seq = token_list_to_seq(subtree) + if len(seq) > 0 and seq[0] == "WHERE": + seq = seq[1:] + if seq not in current_subtrees: + current_subtrees.append(seq) + + in_select = False + select_toks = [] + for i, token in enumerate(sequence.tokens): + # Mark whether this current token is a WHERE. + is_where = isinstance(token, sql_types.Where) + + # If you are in a SELECT, start recording what follows until you hit a + # FROM + if token.value == "SELECT": + in_select = True + elif in_select: + select_toks.append(token) + if token.value == "FROM": + in_select = False + + seq = [] + if len(sequence.tokens) > i + 2: + seq = token_list_to_seq(select_toks + [sequence.tokens[i + 2]]) + + if seq not in current_subtrees and len(seq) > 0 and seq[0] in interesting_selects: + current_subtrees.append(seq) + + select_toks = [] + + # Recursively find subtrees in the children of the node. + find_subtrees(token, current_subtrees, is_where, where_parent or keep_conj_subtrees) + + +def get_subtrees(sql, oldsnippets=[]): + parsed = sqlparse.parse(" ".join(sql))[0] + + subtrees = [] + find_subtrees(parsed, subtrees) + + final_subtrees = [] + for subtree in subtrees: + if subtree not in ignored_subtrees: + final_version = [] + keep = True + + parens_counts = 0 + for i, token in enumerate(subtree): + if token == ".": + newtoken = final_version[-1] + "." + subtree[i + 1] + final_version = final_version[:-1] + [newtoken] + keep = False + elif keep: + final_version.append(token) + else: + keep = True + + if token == "(": + parens_counts -= 1 + elif token == ")": + parens_counts += 1 + + if parens_counts == 0: + final_subtrees.append(final_version) + + snippets = [] + sql = [str(tok) for tok in sql] + for subtree in final_subtrees: + startpos = -1 + for i in range(len(sql) - len(subtree) + 1): + if sql[i : i + len(subtree)] == subtree: + startpos = i + if startpos >= 0 and startpos + len(subtree) < len(sql): + age = 0 + for prevsnippet in oldsnippets: + if prevsnippet.sequence == subtree: + age = prevsnippet.age + 1 + snippet = Snippet(subtree, startpos, sql, age=age) + snippets.append(snippet) + + return snippets + + +def get_subtrees_simple(sql, oldsnippets=[]): + sql_string = " ".join(sql) + format_sql = sqlparse.format(sql_string, reindent=True) + + # get subtrees + subtrees = [] + for sub_sql in format_sql.split("\n"): + sub_sql = sub_sql.replace("(", " ( ").replace(")", " ) ").replace(",", " , ") + + subtree = sub_sql.strip().split() + if len(subtree) > 1: + subtrees.append(subtree) + + final_subtrees = subtrees + + snippets = [] + sql = [str(tok) for tok in sql] + for subtree in final_subtrees: + startpos = -1 + for i in range(len(sql) - len(subtree) + 1): + if sql[i : i + len(subtree)] == subtree: + startpos = i + + if startpos >= 0 and startpos + len(subtree) <= len(sql): + age = 0 + for prevsnippet in oldsnippets: + if prevsnippet.sequence == subtree: + age = prevsnippet.age + 1 + new_sql = sql + [";"] + snippet = Snippet(subtree, startpos, new_sql, age=age) + snippets.append(snippet) + + return snippets + + +conjunctions = {"AND", "OR", "WHERE"} + + +def get_all_in_parens(sequence): + if sequence[-1] == ";": + sequence = sequence[:-1] + + if "(" not in sequence: + return [] + + if sequence[0] == "(" and sequence[-1] == ")": + in_parens = sequence[1:-1] + return [in_parens] + get_all_in_parens(in_parens) + else: + paren_subseqs = [] + current_seq = [] + num_parens = 0 + in_parens = False + for token in sequence: + if in_parens: + current_seq.append(token) + if token == ")": + num_parens -= 1 + if num_parens == 0: + in_parens = False + paren_subseqs.append(current_seq) + current_seq = [] + elif token == "(": + in_parens = True + current_seq.append(token) + if token == "(": + num_parens += 1 + + all_subseqs = [] + for subseq in paren_subseqs: + all_subseqs.extend(get_all_in_parens(subseq)) + return all_subseqs + + +def split_by_conj(sequence): + num_parens = 0 + current_seq = [] + subsequences = [] + + for token in sequence: + if num_parens == 0: + if token in conjunctions: + subsequences.append(current_seq) + current_seq = [] + break + current_seq.append(token) + if token == "(": + num_parens += 1 + elif token == ")": + num_parens -= 1 + + assert num_parens >= 0 + + return subsequences + + +def get_sql_snippets(sequence): + # First, get all subsequences of the sequence that are surrounded by + # parentheses. + all_in_parens = get_all_in_parens(sequence) + all_subseq = [] + + # Then for each one, split the sequence on conjunctions (AND/OR). + for seq in all_in_parens: + subsequences = split_by_conj(seq) + all_subseq.append(seq) + all_subseq.extend(subsequences) + + # Finally, also get "interesting" selects + + for i, seq in enumerate(all_subseq): + print(str(i) + "\t" + " ".join(seq)) + exit() + + +def add_snippets_to_query(snippets, ignored_entities, query, prob_align=1.0): + query_copy = copy.copy(query) + + # Replace the longest snippets first, so sort by length descending. + sorted_snippets = sorted(snippets, key=lambda s: len(s.sequence))[::-1] + + for snippet in sorted_snippets: + ignore = False + snippet_seq = snippet.sequence + + # If it contains an ignored entity, then don't use it. + for entity in ignored_entities: + ignore = ignore or util.subsequence(entity, snippet_seq) + + # No NL entities found in snippet, then see if snippet is a substring of + # the gold sequence + if not ignore: + snippet_length = len(snippet_seq) + + # Iterate through gold sequence to see if it's a subsequence. + for start_idx in range(len(query_copy) - snippet_length + 1): + if query_copy[start_idx : start_idx + snippet_length] == snippet_seq: + align = random.random() < prob_align + + if align: + prev_length = len(query_copy) + + # At the start position of the snippet, replace with an + # identifier. + query_copy[start_idx] = snippet.name + + # Then cut out the indices which were collapsed into + # the snippet. + query_copy = query_copy[: start_idx + 1] + query_copy[start_idx + snippet_length :] + + # Make sure the length is as expected + assert len(query_copy) == prev_length - (snippet_length - 1) + + return query_copy + + +def execution_results(query, username, password, timeout=3): + connection = pymysql.connect(user=username, password=password) + + class TimeoutException(Exception): + pass + + def timeout_handler(signum, frame): + raise TimeoutException + + signal.signal(signal.SIGALRM, timeout_handler) + + syntactic = True + semantic = True + + table = [] + + with connection.cursor() as cursor: + signal.alarm(timeout) + try: + cursor.execute("SET sql_mode='IGNORE_SPACE';") + cursor.execute("use atis3;") + cursor.execute(query) + table = cursor.fetchall() + cursor.close() + except TimeoutException: + signal.alarm(0) + cursor.close() + except pymysql.err.ProgrammingError: + syntactic = False + semantic = False + cursor.close() + except pymysql.err.InternalError: + semantic = False + cursor.close() + except Exception: + signal.alarm(0) + signal.alarm(0) + cursor.close() + signal.alarm(0) + + connection.close() + + return (syntactic, semantic, sorted(table)) + + +def executable(query, username, password, timeout=2): + return execution_results(query, username, password, timeout)[1] + + +def fix_parentheses(sequence): + num_left = sequence.count("(") + num_right = sequence.count(")") + + if num_right < num_left: + fixed_sequence = sequence[:-1] + [")" for _ in range(num_left - num_right)] + [sequence[-1]] + return fixed_sequence + + return sequence diff --git a/examples/text_to_sql/IGSQL/data_util/tokenizers.py b/examples/text_to_sql/IGSQL/data_util/tokenizers.py new file mode 100644 index 0000000000000000000000000000000000000000..6cedd55d42ed2241b2310c17459e76795283f9a8 --- /dev/null +++ b/examples/text_to_sql/IGSQL/data_util/tokenizers.py @@ -0,0 +1,98 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Tokenizers for natural language SQL queries, and lambda calculus.""" + +import nltk +import sqlparse + + +def nl_tokenize(string): + """Tokenizes a natural language string into tokens. + Assumes data is space-separated (this is true of ZC07 data in ATIS2/3). + + Args: + string(`str`): the string to tokenize. + Outputs: + `list`: a list of tokens. + """ + return nltk.word_tokenize(string) + + +def sql_tokenize(string): + """Tokenizes a SQL statement into tokens. + + Args: + string(`str`): string to tokenize. + + Outputs: + `list`: a list of tokens. + """ + tokens = [] + statements = sqlparse.parse(string) + + # SQLparse gives you a list of statements. + for statement in statements: + # Flatten the tokens in each statement and add to the tokens list. + flat_tokens = sqlparse.sql.TokenList(statement.tokens).flatten() + for token in flat_tokens: + strip_token = str(token).strip() + if len(strip_token) > 0: + tokens.append(strip_token) + + newtokens = [] + keep = True + for i, token in enumerate(tokens): + if token == ".": + newtoken = newtokens[-1] + "." + tokens[i + 1] + newtokens = newtokens[:-1] + [newtoken] + keep = False + elif keep: + newtokens.append(token) + else: + keep = True + + return newtokens + + +def lambda_tokenize(string): + """Tokenizes a lambda-calculus statement into tokens. + + Args: + string(`str`): a lambda-calculus string + + Outputs: + `list`: a list of tokens. + """ + + space_separated = string.split(" ") + + new_tokens = [] + + # Separate the string by spaces, then separate based on existence of ( or + # ). + for token in space_separated: + tokens = [] + + current_token = "" + for char in token: + if char == ")" or char == "(": + tokens.append(current_token) + tokens.append(char) + current_token = "" + else: + current_token += char + tokens.append(current_token) + new_tokens.extend([tok for tok in tokens if tok]) + + return new_tokens diff --git a/examples/text_to_sql/IGSQL/data_util/util.py b/examples/text_to_sql/IGSQL/data_util/util.py new file mode 100644 index 0000000000000000000000000000000000000000..fc814eb18e810828280af5f1bce3aaa64f2dd92a --- /dev/null +++ b/examples/text_to_sql/IGSQL/data_util/util.py @@ -0,0 +1,31 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Contains various utility functions.""" + + +def subsequence(first_sequence, second_sequence): + """ + Returns whether the first sequence is a subsequence of the second sequence. + + Args: + first_sequence (`list`): A sequence. + second_sequence (`list`): Another sequence. + + Returns: + `bool`: Whether first_sequence is a subsequence of second_sequence. + """ + for startidx in range(len(second_sequence) - len(first_sequence) + 1): + if second_sequence[startidx : startidx + len(first_sequence)] == first_sequence: + return True + return False diff --git a/examples/text_to_sql/IGSQL/data_util/utterance.py b/examples/text_to_sql/IGSQL/data_util/utterance.py new file mode 100644 index 0000000000000000000000000000000000000000..b1a0879208bfc43b0d58dfc1dbec04b87e19a196 --- /dev/null +++ b/examples/text_to_sql/IGSQL/data_util/utterance.py @@ -0,0 +1,102 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Contains the Utterance class. """ + +from . import sql_util +from . import tokenizers + +ANON_INPUT_KEY = "cleaned_nl" +OUTPUT_KEY = "sql" + + +class Utterance: + def process_input_seq(self, anonymize, anonymizer, anon_tok_to_ent): + assert not anon_tok_to_ent or anonymize + assert not anonymize or anonymizer + + if anonymize: + assert anonymizer + + self.input_seq_to_use = anonymizer.anonymize( + self.original_input_seq, anon_tok_to_ent, ANON_INPUT_KEY, add_new_anon_toks=True + ) + else: + self.input_seq_to_use = self.original_input_seq + + def process_gold_seq( + self, output_sequences, nl_to_sql_dict, available_snippets, anonymize, anonymizer, anon_tok_to_ent + ): + # Get entities in the input sequence: + # anonymized entity types + # othe recognized entities (this includes "flight") + entities_in_input = [[tok] for tok in self.input_seq_to_use if tok in anon_tok_to_ent] + entities_in_input.extend(nl_to_sql_dict.get_sql_entities(self.input_seq_to_use)) + + # Get the shortest gold query (this is what we use to train) + shortest_gold_and_results = min(output_sequences, key=lambda x: len(x[0])) + + # Tokenize and anonymize it if necessary. + self.original_gold_query = shortest_gold_and_results[0] + self.gold_sql_results = shortest_gold_and_results[1] + + self.contained_entities = entities_in_input + + # Keep track of all gold queries and the resulting tables so that we can + # give credit if it predicts a different correct sequence. + self.all_gold_queries = output_sequences + + self.anonymized_gold_query = self.original_gold_query + if anonymize: + self.anonymized_gold_query = anonymizer.anonymize( + self.original_gold_query, anon_tok_to_ent, OUTPUT_KEY, add_new_anon_toks=False + ) + + # Add snippets to it. + self.gold_query_to_use = sql_util.add_snippets_to_query( + available_snippets, entities_in_input, self.anonymized_gold_query + ) + + def __init__(self, example, available_snippets, nl_to_sql_dict, params, anon_tok_to_ent={}, anonymizer=None): + # Get output and input sequences from the dictionary representation. + output_sequences = example[OUTPUT_KEY] + self.original_input_seq = tokenizers.nl_tokenize(example[params.input_key]) + self.available_snippets = available_snippets + self.keep = False + + if len(output_sequences) > 0 and len(self.original_input_seq) > 0: + # Only keep this example if there is at least one output sequence. + self.keep = True + if len(output_sequences) == 0 or len(self.original_input_seq) == 0: + return + + # Process the input sequence + self.process_input_seq(params.anonymize, anonymizer, anon_tok_to_ent) + + # Process the gold sequence + self.process_gold_seq( + output_sequences, nl_to_sql_dict, self.available_snippets, params.anonymize, anonymizer, anon_tok_to_ent + ) + + def __str__(self): + string = "Original input: " + " ".join(self.original_input_seq) + "\n" + string += "Modified input: " + " ".join(self.input_seq_to_use) + "\n" + string += "Original output: " + " ".join(self.original_gold_query) + "\n" + string += "Modified output: " + " ".join(self.gold_query_to_use) + "\n" + string += "Snippets:\n" + for snippet in self.available_snippets: + string += str(snippet) + "\n" + return string + + def length_valid(self, input_limit, output_limit): + return len(self.input_seq_to_use) < input_limit and len(self.gold_query_to_use) < output_limit diff --git a/examples/text_to_sql/IGSQL/data_util/vocabulary.py b/examples/text_to_sql/IGSQL/data_util/vocabulary.py new file mode 100644 index 0000000000000000000000000000000000000000..6b102f084c955c5d5f2ea22088056012395ea0ac --- /dev/null +++ b/examples/text_to_sql/IGSQL/data_util/vocabulary.py @@ -0,0 +1,106 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Contains class and methods for storing and computing a vocabulary from text.""" + +import operator +import os +import pickle + +# Special sequencing tokens. +UNK_TOK = "_UNK" # Replaces out-of-vocabulary words. +EOS_TOK = "_EOS" # Appended to the end of a sequence to indicate its end. +DEL_TOK = ";" + + +class Vocabulary: + """Vocabulary class: stores information about words in a corpus. + + Attributes: + functional_types (`list`): Functional vocabulary words, such as EOS. + max_size (`int`): The maximum size of vocabulary to keep. + min_occur (`int`): The minimum number of times a word should occur to keep it. + id_to_token (`list`): Ordered list of word types. + token_to_id (`dict`): Maps from each unique word type to its index. + """ + + def get_vocab(self, sequences, ignore_fn): + """Gets vocabulary from a list of sequences. + + Args: + sequences (`list`): Sequences from which to compute the vocabulary. + ignore_fn (`function`): Function used to tell whether to ignore a + token during computation of the vocabulary. + + Returns: + `list`: List of the unique word types in the vocabulary. + """ + type_counts = {} + + for sequence in sequences: + for token in sequence: + if not ignore_fn(token): + if token not in type_counts: + type_counts[token] = 0 + type_counts[token] += 1 + + # Create sorted list of tokens, by their counts. Reverse so it is in order of + # most frequent to least frequent. + sorted_type_counts = sorted(sorted(type_counts.items()), key=operator.itemgetter(1))[::-1] + + sorted_types = [typecount[0] for typecount in sorted_type_counts if typecount[1] >= self.min_occur] + + # Append the necessary functional tokens. + sorted_types = self.functional_types + sorted_types + + # Cut off if vocab_size is set (nonnegative) + if self.max_size >= 0: + vocab = sorted_types[: max(self.max_size, len(sorted_types))] + else: + vocab = sorted_types + + return vocab + + def __init__( + self, sequences, filename, functional_types=None, max_size=-1, min_occur=0, ignore_fn=lambda x: False + ): + self.functional_types = functional_types + self.max_size = max_size + self.min_occur = min_occur + + vocab = self.get_vocab(sequences, ignore_fn) + + self.id_to_token = [] + self.token_to_id = {} + + for i, word_type in enumerate(vocab): + self.id_to_token.append(word_type) + self.token_to_id[word_type] = i + + # Load the previous vocab, if it exists. + if os.path.exists(filename): + infile = open(filename, "rb") + loaded_vocab = pickle.load(infile) + infile.close() + + print("Loaded vocabulary from " + str(filename)) + if loaded_vocab.id_to_token != self.id_to_token or loaded_vocab.token_to_id != self.token_to_id: + print("Loaded vocabulary is different than generated vocabulary.") + else: + print("Writing vocabulary to " + str(filename)) + outfile = open(filename, "wb") + pickle.dump(self, outfile) + outfile.close() + + def __len__(self): + return len(self.id_to_token) diff --git a/examples/text_to_sql/IGSQL/eval_scripts/evaluation.py b/examples/text_to_sql/IGSQL/eval_scripts/evaluation.py new file mode 100644 index 0000000000000000000000000000000000000000..46e988cd5e26a52370d6c77085e8e181ce18ee4f --- /dev/null +++ b/examples/text_to_sql/IGSQL/eval_scripts/evaluation.py @@ -0,0 +1,919 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import sqlite3 + +from process_sql import Schema, get_schema, get_sql + +# Flag to disable value evaluation +DISABLE_VALUE = True +# Flag to disable distinct in select evaluation +DISABLE_DISTINCT = True + +################################ +# val: number(float)/string(str)/sql(dict) +# col_unit: (agg_id, col_id, isDistinct(bool)) +# val_unit: (unit_op, col_unit1, col_unit2) +# table_unit: (table_type, col_unit/sql) +# cond_unit: (not_op, op_id, val_unit, val1, val2) +# condition: [cond_unit1, 'and'/'or', cond_unit2, ...] +# sql { +# 'select': (isDistinct(bool), [(agg_id, val_unit), (agg_id, val_unit), ...]) +# 'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition} +# 'where': condition +# 'groupBy': [col_unit1, col_unit2, ...] +# 'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...]) +# 'having': condition +# 'limit': None/limit value +# 'intersect': None/sql +# 'except': None/sql +# 'union': None/sql +# } +################################ + +CLAUSE_KEYWORDS = ("select", "from", "where", "group", "order", "limit", "intersect", "union", "except") +JOIN_KEYWORDS = ("join", "on", "as") + +WHERE_OPS = ("not", "between", "=", ">", "<", ">=", "<=", "!=", "in", "like", "is", "exists") +UNIT_OPS = ("none", "-", "+", "*", "/") +AGG_OPS = ("none", "max", "min", "count", "sum", "avg") +TABLE_TYPE = { + "sql": "sql", + "table_unit": "table_unit", +} + +COND_OPS = ("and", "or") +SQL_OPS = ("intersect", "union", "except") +ORDER_OPS = ("desc", "asc") + +HARDNESS = { + "component1": ("where", "group", "order", "limit", "join", "or", "like"), + "component2": ("except", "union", "intersect"), +} + + +def condition_has_or(conds): + return "or" in conds[1::2] + + +def condition_has_like(conds): + return WHERE_OPS.index("like") in [cond_unit[1] for cond_unit in conds[::2]] + + +def condition_has_sql(conds): + for cond_unit in conds[::2]: + val1, val2 = cond_unit[3], cond_unit[4] + if val1 is not None and type(val1) is dict: + return True + if val2 is not None and type(val2) is dict: + return True + return False + + +def val_has_op(val_unit): + return val_unit[0] != UNIT_OPS.index("none") + + +def has_agg(unit): + return unit[0] != AGG_OPS.index("none") + + +def accuracy(count, total): + if count == total: + return 1 + return 0 + + +def recall(count, total): + if count == total: + return 1 + return 0 + + +def F1(acc, rec): + if (acc + rec) == 0: + return 0 + return (2.0 * acc * rec) / (acc + rec) + + +def get_scores(count, pred_total, label_total): + if pred_total != label_total: + return 0, 0, 0 + elif count == pred_total: + return 1, 1, 1 + return 0, 0, 0 + + +def eval_sel(pred, label): + pred_sel = pred["select"][1] + label_sel = label["select"][1] + label_wo_agg = [unit[1] for unit in label_sel] + pred_total = len(pred_sel) + label_total = len(label_sel) + cnt = 0 + cnt_wo_agg = 0 + + for unit in pred_sel: + if unit in label_sel: + cnt += 1 + label_sel.remove(unit) + if unit[1] in label_wo_agg: + cnt_wo_agg += 1 + label_wo_agg.remove(unit[1]) + + return label_total, pred_total, cnt, cnt_wo_agg + + +def eval_where(pred, label): + pred_conds = [unit for unit in pred["where"][::2]] + label_conds = [unit for unit in label["where"][::2]] + label_wo_agg = [unit[2] for unit in label_conds] + pred_total = len(pred_conds) + label_total = len(label_conds) + cnt = 0 + cnt_wo_agg = 0 + + for unit in pred_conds: + if unit in label_conds: + cnt += 1 + label_conds.remove(unit) + if unit[2] in label_wo_agg: + cnt_wo_agg += 1 + label_wo_agg.remove(unit[2]) + + return label_total, pred_total, cnt, cnt_wo_agg + + +def eval_group(pred, label): + pred_cols = [unit[1] for unit in pred["groupBy"]] + label_cols = [unit[1] for unit in label["groupBy"]] + pred_total = len(pred_cols) + label_total = len(label_cols) + cnt = 0 + pred_cols = [pred.split(".")[1] if "." in pred else pred for pred in pred_cols] + label_cols = [label.split(".")[1] if "." in label else label for label in label_cols] + for col in pred_cols: + if col in label_cols: + cnt += 1 + label_cols.remove(col) + return label_total, pred_total, cnt + + +def eval_having(pred, label): + pred_total = label_total = cnt = 0 + if len(pred["groupBy"]) > 0: + pred_total = 1 + if len(label["groupBy"]) > 0: + label_total = 1 + + pred_cols = [unit[1] for unit in pred["groupBy"]] + label_cols = [unit[1] for unit in label["groupBy"]] + if pred_total == label_total == 1 and pred_cols == label_cols and pred["having"] == label["having"]: + cnt = 1 + + return label_total, pred_total, cnt + + +def eval_order(pred, label): + pred_total = label_total = cnt = 0 + if len(pred["orderBy"]) > 0: + pred_total = 1 + if len(label["orderBy"]) > 0: + label_total = 1 + if ( + len(label["orderBy"]) > 0 + and pred["orderBy"] == label["orderBy"] + and ( + (pred["limit"] is None and label["limit"] is None) + or (pred["limit"] is not None and label["limit"] is not None) + ) + ): + cnt = 1 + return label_total, pred_total, cnt + + +def eval_and_or(pred, label): + pred_ao = pred["where"][1::2] + label_ao = label["where"][1::2] + pred_ao = set(pred_ao) + label_ao = set(label_ao) + + if pred_ao == label_ao: + return 1, 1, 1 + return len(pred_ao), len(label_ao), 0 + + +def get_nestedSQL(sql): + nested = [] + for cond_unit in sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]: + if type(cond_unit[3]) is dict: + nested.append(cond_unit[3]) + if type(cond_unit[4]) is dict: + nested.append(cond_unit[4]) + if sql["intersect"] is not None: + nested.append(sql["intersect"]) + if sql["except"] is not None: + nested.append(sql["except"]) + if sql["union"] is not None: + nested.append(sql["union"]) + return nested + + +def eval_nested(pred, label): + label_total = 0 + pred_total = 0 + cnt = 0 + if pred is not None: + pred_total += 1 + if label is not None: + label_total += 1 + if pred is not None and label is not None: + cnt += Evaluator().eval_exact_match(pred, label) + return label_total, pred_total, cnt + + +def eval_IUEN(pred, label): + lt1, pt1, cnt1 = eval_nested(pred["intersect"], label["intersect"]) + lt2, pt2, cnt2 = eval_nested(pred["except"], label["except"]) + lt3, pt3, cnt3 = eval_nested(pred["union"], label["union"]) + label_total = lt1 + lt2 + lt3 + pred_total = pt1 + pt2 + pt3 + cnt = cnt1 + cnt2 + cnt3 + return label_total, pred_total, cnt + + +def get_keywords(sql): + res = set() + if len(sql["where"]) > 0: + res.add("where") + if len(sql["groupBy"]) > 0: + res.add("group") + if len(sql["having"]) > 0: + res.add("having") + if len(sql["orderBy"]) > 0: + res.add(sql["orderBy"][0]) + res.add("order") + if sql["limit"] is not None: + res.add("limit") + if sql["except"] is not None: + res.add("except") + if sql["union"] is not None: + res.add("union") + if sql["intersect"] is not None: + res.add("intersect") + + # or keyword + ao = sql["from"]["conds"][1::2] + sql["where"][1::2] + sql["having"][1::2] + if len([token for token in ao if token == "or"]) > 0: + res.add("or") + + cond_units = sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2] + # not keyword + if len([cond_unit for cond_unit in cond_units if cond_unit[0]]) > 0: + res.add("not") + + # in keyword + if len([cond_unit for cond_unit in cond_units if cond_unit[1] == WHERE_OPS.index("in")]) > 0: + res.add("in") + + # like keyword + if len([cond_unit for cond_unit in cond_units if cond_unit[1] == WHERE_OPS.index("like")]) > 0: + res.add("like") + + return res + + +def eval_keywords(pred, label): + pred_keywords = get_keywords(pred) + label_keywords = get_keywords(label) + pred_total = len(pred_keywords) + label_total = len(label_keywords) + cnt = 0 + + for k in pred_keywords: + if k in label_keywords: + cnt += 1 + return label_total, pred_total, cnt + + +def count_agg(units): + return len([unit for unit in units if has_agg(unit)]) + + +def count_component1(sql): + count = 0 + if len(sql["where"]) > 0: + count += 1 + if len(sql["groupBy"]) > 0: + count += 1 + if len(sql["orderBy"]) > 0: + count += 1 + if sql["limit"] is not None: + count += 1 + if len(sql["from"]["table_units"]) > 0: # JOIN + count += len(sql["from"]["table_units"]) - 1 + + ao = sql["from"]["conds"][1::2] + sql["where"][1::2] + sql["having"][1::2] + count += len([token for token in ao if token == "or"]) + cond_units = sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2] + count += len([cond_unit for cond_unit in cond_units if cond_unit[1] == WHERE_OPS.index("like")]) + + return count + + +def count_component2(sql): + nested = get_nestedSQL(sql) + return len(nested) + + +def count_others(sql): + count = 0 + # number of aggregation + agg_count = count_agg(sql["select"][1]) + agg_count += count_agg(sql["where"][::2]) + agg_count += count_agg(sql["groupBy"]) + if len(sql["orderBy"]) > 0: + agg_count += count_agg( + [unit[1] for unit in sql["orderBy"][1] if unit[1]] + [unit[2] for unit in sql["orderBy"][1] if unit[2]] + ) + agg_count += count_agg(sql["having"]) + if agg_count > 1: + count += 1 + + # number of select columns + if len(sql["select"][1]) > 1: + count += 1 + + # number of where conditions + if len(sql["where"]) > 1: + count += 1 + + # number of group by clauses + if len(sql["groupBy"]) > 1: + count += 1 + + return count + + +class Evaluator: + """A simple evaluator""" + + def __init__(self): + self.partial_scores = None + + def eval_hardness(self, sql): + count_comp1_ = count_component1(sql) + count_comp2_ = count_component2(sql) + count_others_ = count_others(sql) + + if count_comp1_ <= 1 and count_others_ == 0 and count_comp2_ == 0: + return "easy" + elif (count_others_ <= 2 and count_comp1_ <= 1 and count_comp2_ == 0) or ( + count_comp1_ <= 2 and count_others_ < 2 and count_comp2_ == 0 + ): + return "medium" + elif ( + (count_others_ > 2 and count_comp1_ <= 2 and count_comp2_ == 0) + or (2 < count_comp1_ <= 3 and count_others_ <= 2 and count_comp2_ == 0) + or (count_comp1_ <= 1 and count_others_ == 0 and count_comp2_ <= 1) + ): + return "hard" + else: + return "extra" + + def eval_exact_match(self, pred, label): + partial_scores = self.eval_partial_match(pred, label) + self.partial_scores = partial_scores + + for _, score in partial_scores.items(): + if score["f1"] != 1: + return 0 + if len(label["from"]["table_units"]) > 0: + label_tables = sorted(label["from"]["table_units"]) + pred_tables = sorted(pred["from"]["table_units"]) + return label_tables == pred_tables + return 1 + + def eval_partial_match(self, pred, label): + res = {} + + label_total, pred_total, cnt, cnt_wo_agg = eval_sel(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["select"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, label_total) + res["select(no AGG)"] = { + "acc": acc, + "rec": rec, + "f1": f1, + "label_total": label_total, + "pred_total": pred_total, + } + + label_total, pred_total, cnt, cnt_wo_agg = eval_where(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["where"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, label_total) + res["where(no OP)"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + + label_total, pred_total, cnt = eval_group(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["group(no Having)"] = { + "acc": acc, + "rec": rec, + "f1": f1, + "label_total": label_total, + "pred_total": pred_total, + } + + label_total, pred_total, cnt = eval_having(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["group"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + + label_total, pred_total, cnt = eval_order(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["order"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + + label_total, pred_total, cnt = eval_and_or(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["and/or"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + + label_total, pred_total, cnt = eval_IUEN(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["IUEN"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + + label_total, pred_total, cnt = eval_keywords(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["keywords"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + + return res + + +def isValidSQL(sql, db): + conn = sqlite3.connect(db) + cursor = conn.cursor() + try: + cursor.execute(sql) + except Exception: + return False + return True + + +def print_scores(scores, etype): + levels = ["easy", "medium", "hard", "extra", "all"] + partial_types = [ + "select", + "select(no AGG)", + "where", + "where(no OP)", + "group(no Having)", + "group", + "order", + "and/or", + "IUEN", + "keywords", + ] + + print("{:20} {:20} {:20} {:20} {:20} {:20}".format("", *levels)) + counts = [scores[level]["count"] for level in levels] + print("{:20} {:<20d} {:<20d} {:<20d} {:<20d} {:<20d}".format("count", *counts)) + + if etype in ["all", "exec"]: + print("===================== EXECUTION ACCURACY =====================") + this_scores = [scores[level]["exec"] for level in levels] + print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format("execution", *this_scores)) + + if etype in ["all", "match"]: + print("\n====================== EXACT MATCHING ACCURACY =====================") + exact_scores = [scores[level]["exact"] for level in levels] + print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format("exact match", *exact_scores)) + print("\n---------------------PARTIAL MATCHING ACCURACY----------------------") + for type_ in partial_types: + this_scores = [scores[level]["partial"][type_]["acc"] for level in levels] + print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format(type_, *this_scores)) + + print("---------------------- PARTIAL MATCHING RECALL ----------------------") + for type_ in partial_types: + this_scores = [scores[level]["partial"][type_]["rec"] for level in levels] + print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format(type_, *this_scores)) + + print("---------------------- PARTIAL MATCHING F1 --------------------------") + for type_ in partial_types: + this_scores = [scores[level]["partial"][type_]["f1"] for level in levels] + print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format(type_, *this_scores)) + + +def evaluate(gold, predict, db_dir, etype, kmaps): + with open(gold) as f: + glist = [l.strip().split("\t") for l in f.readlines() if len(l.strip()) > 0] + + with open(predict) as f: + plist = [l.strip().split("\t") for l in f.readlines() if len(l.strip()) > 0] + evaluator = Evaluator() + + levels = ["easy", "medium", "hard", "extra", "all"] + partial_types = [ + "select", + "select(no AGG)", + "where", + "where(no OP)", + "group(no Having)", + "group", + "order", + "and/or", + "IUEN", + "keywords", + ] + entries = [] + scores = {} + + for level in levels: + scores[level] = {"count": 0, "partial": {}, "exact": 0.0} + scores[level]["exec"] = 0 + for type_ in partial_types: + scores[level]["partial"][type_] = {"acc": 0.0, "rec": 0.0, "f1": 0.0, "acc_count": 0, "rec_count": 0} + + eval_err_num = 0 + for p, g in zip(plist, glist): + p_str = p[0] + g_str, db = g + db_name = db + db = os.path.join(db_dir, db, db + ".sqlite") + schema = Schema(get_schema(db)) + g_sql = get_sql(schema, g_str) + hardness = evaluator.eval_hardness(g_sql) + scores[hardness]["count"] += 1 + scores["all"]["count"] += 1 + + try: + p_sql = get_sql(schema, p_str) + except Exception: + # If p_sql is not valid, then we will use an empty sql to evaluate with the correct sql + p_sql = { + "except": None, + "from": {"conds": [], "table_units": []}, + "groupBy": [], + "having": [], + "intersect": None, + "limit": None, + "orderBy": [], + "select": [False, []], + "union": None, + "where": [], + } + eval_err_num += 1 + print("eval_err_num:{}".format(eval_err_num)) + + # rebuild sql for value evaluation + kmap = kmaps[db_name] + g_valid_col_units = build_valid_col_units(g_sql["from"]["table_units"], schema) + g_sql = rebuild_sql_val(g_sql) + g_sql = rebuild_sql_col(g_valid_col_units, g_sql, kmap) + p_valid_col_units = build_valid_col_units(p_sql["from"]["table_units"], schema) + p_sql = rebuild_sql_val(p_sql) + p_sql = rebuild_sql_col(p_valid_col_units, p_sql, kmap) + + if etype in ["all", "exec"]: + exec_score = eval_exec_match(db, p_str, g_str, p_sql, g_sql) + if exec_score: + scores[hardness]["exec"] += 1 + + if etype in ["all", "match"]: + exact_score = evaluator.eval_exact_match(p_sql, g_sql) + partial_scores = evaluator.partial_scores + if exact_score == 0: + print("{} pred: {}".format(hardness, p_str)) + print("{} gold: {}".format(hardness, g_str)) + print("") + scores[hardness]["exact"] += exact_score + scores["all"]["exact"] += exact_score + for type_ in partial_types: + if partial_scores[type_]["pred_total"] > 0: + scores[hardness]["partial"][type_]["acc"] += partial_scores[type_]["acc"] + scores[hardness]["partial"][type_]["acc_count"] += 1 + if partial_scores[type_]["label_total"] > 0: + scores[hardness]["partial"][type_]["rec"] += partial_scores[type_]["rec"] + scores[hardness]["partial"][type_]["rec_count"] += 1 + scores[hardness]["partial"][type_]["f1"] += partial_scores[type_]["f1"] + if partial_scores[type_]["pred_total"] > 0: + scores["all"]["partial"][type_]["acc"] += partial_scores[type_]["acc"] + scores["all"]["partial"][type_]["acc_count"] += 1 + if partial_scores[type_]["label_total"] > 0: + scores["all"]["partial"][type_]["rec"] += partial_scores[type_]["rec"] + scores["all"]["partial"][type_]["rec_count"] += 1 + scores["all"]["partial"][type_]["f1"] += partial_scores[type_]["f1"] + + entries.append( + { + "predictSQL": p_str, + "goldSQL": g_str, + "hardness": hardness, + "exact": exact_score, + "partial": partial_scores, + } + ) + + for level in levels: + if scores[level]["count"] == 0: + continue + if etype in ["all", "exec"]: + scores[level]["exec"] /= scores[level]["count"] + + if etype in ["all", "match"]: + scores[level]["exact"] /= scores[level]["count"] + for type_ in partial_types: + if scores[level]["partial"][type_]["acc_count"] == 0: + scores[level]["partial"][type_]["acc"] = 0 + else: + scores[level]["partial"][type_]["acc"] = ( + scores[level]["partial"][type_]["acc"] / scores[level]["partial"][type_]["acc_count"] * 1.0 + ) + if scores[level]["partial"][type_]["rec_count"] == 0: + scores[level]["partial"][type_]["rec"] = 0 + else: + scores[level]["partial"][type_]["rec"] = ( + scores[level]["partial"][type_]["rec"] / scores[level]["partial"][type_]["rec_count"] * 1.0 + ) + if scores[level]["partial"][type_]["acc"] == 0 and scores[level]["partial"][type_]["rec"] == 0: + scores[level]["partial"][type_]["f1"] = 1 + else: + scores[level]["partial"][type_]["f1"] = ( + 2.0 + * scores[level]["partial"][type_]["acc"] + * scores[level]["partial"][type_]["rec"] + / (scores[level]["partial"][type_]["rec"] + scores[level]["partial"][type_]["acc"]) + ) + + print_scores(scores, etype) + + +def eval_exec_match(db, p_str, g_str, pred, gold): + """ + return 1 if the values between prediction and gold are matching + in the corresponding index. Currently not support multiple col_unit(pairs). + """ + conn = sqlite3.connect(db) + cursor = conn.cursor() + try: + cursor.execute(p_str) + p_res = cursor.fetchall() + except Exception: + return False + + cursor.execute(g_str) + q_res = cursor.fetchall() + + def res_map(res, val_units): + rmap = {} + for idx, val_unit in enumerate(val_units): + key = tuple(val_unit[1]) if not val_unit[2] else (val_unit[0], tuple(val_unit[1]), tuple(val_unit[2])) + rmap[key] = [r[idx] for r in res] + return rmap + + p_val_units = [unit[1] for unit in pred["select"][1]] + q_val_units = [unit[1] for unit in gold["select"][1]] + return res_map(p_res, p_val_units) == res_map(q_res, q_val_units) + + +# Rebuild SQL functions for value evaluation +def rebuild_cond_unit_val(cond_unit): + if cond_unit is None or not DISABLE_VALUE: + return cond_unit + + not_op, op_id, val_unit, val1, val2 = cond_unit + if type(val1) is not dict: + val1 = None + else: + val1 = rebuild_sql_val(val1) + if type(val2) is not dict: + val2 = None + else: + val2 = rebuild_sql_val(val2) + return not_op, op_id, val_unit, val1, val2 + + +def rebuild_condition_val(condition): + if condition is None or not DISABLE_VALUE: + return condition + + res = [] + for idx, it in enumerate(condition): + if idx % 2 == 0: + res.append(rebuild_cond_unit_val(it)) + else: + res.append(it) + return res + + +def rebuild_sql_val(sql): + if sql is None or not DISABLE_VALUE: + return sql + + sql["from"]["conds"] = rebuild_condition_val(sql["from"]["conds"]) + sql["having"] = rebuild_condition_val(sql["having"]) + sql["where"] = rebuild_condition_val(sql["where"]) + sql["intersect"] = rebuild_sql_val(sql["intersect"]) + sql["except"] = rebuild_sql_val(sql["except"]) + sql["union"] = rebuild_sql_val(sql["union"]) + + return sql + + +# Rebuild SQL functions for foreign key evaluation +def build_valid_col_units(table_units, schema): + col_ids = [table_unit[1] for table_unit in table_units if table_unit[0] == TABLE_TYPE["table_unit"]] + prefixs = [col_id[:-2] for col_id in col_ids] + valid_col_units = [] + for value in schema.idMap.values(): + if "." in value and value[: value.index(".")] in prefixs: + valid_col_units.append(value) + return valid_col_units + + +def rebuild_col_unit_col(valid_col_units, col_unit, kmap): + if col_unit is None: + return col_unit + + agg_id, col_id, distinct = col_unit + if col_id in kmap and col_id in valid_col_units: + col_id = kmap[col_id] + if DISABLE_DISTINCT: + distinct = None + return agg_id, col_id, distinct + + +def rebuild_val_unit_col(valid_col_units, val_unit, kmap): + if val_unit is None: + return val_unit + + unit_op, col_unit1, col_unit2 = val_unit + col_unit1 = rebuild_col_unit_col(valid_col_units, col_unit1, kmap) + col_unit2 = rebuild_col_unit_col(valid_col_units, col_unit2, kmap) + return unit_op, col_unit1, col_unit2 + + +def rebuild_table_unit_col(valid_col_units, table_unit, kmap): + if table_unit is None: + return table_unit + + table_type, col_unit_or_sql = table_unit + if isinstance(col_unit_or_sql, tuple): + col_unit_or_sql = rebuild_col_unit_col(valid_col_units, col_unit_or_sql, kmap) + return table_type, col_unit_or_sql + + +def rebuild_cond_unit_col(valid_col_units, cond_unit, kmap): + if cond_unit is None: + return cond_unit + + not_op, op_id, val_unit, val1, val2 = cond_unit + val_unit = rebuild_val_unit_col(valid_col_units, val_unit, kmap) + return not_op, op_id, val_unit, val1, val2 + + +def rebuild_condition_col(valid_col_units, condition, kmap): + for idx in range(len(condition)): + if idx % 2 == 0: + condition[idx] = rebuild_cond_unit_col(valid_col_units, condition[idx], kmap) + return condition + + +def rebuild_select_col(valid_col_units, sel, kmap): + if sel is None: + return sel + distinct, _list = sel + new_list = [] + for it in _list: + agg_id, val_unit = it + new_list.append((agg_id, rebuild_val_unit_col(valid_col_units, val_unit, kmap))) + if DISABLE_DISTINCT: + distinct = None + return distinct, new_list + + +def rebuild_from_col(valid_col_units, from_, kmap): + if from_ is None: + return from_ + + from_["table_units"] = [ + rebuild_table_unit_col(valid_col_units, table_unit, kmap) for table_unit in from_["table_units"] + ] + from_["conds"] = rebuild_condition_col(valid_col_units, from_["conds"], kmap) + return from_ + + +def rebuild_group_by_col(valid_col_units, group_by, kmap): + if group_by is None: + return group_by + + return [rebuild_col_unit_col(valid_col_units, col_unit, kmap) for col_unit in group_by] + + +def rebuild_order_by_col(valid_col_units, order_by, kmap): + if order_by is None or len(order_by) == 0: + return order_by + + direction, val_units = order_by + new_val_units = [rebuild_val_unit_col(valid_col_units, val_unit, kmap) for val_unit in val_units] + return direction, new_val_units + + +def rebuild_sql_col(valid_col_units, sql, kmap): + if sql is None: + return sql + + sql["select"] = rebuild_select_col(valid_col_units, sql["select"], kmap) + sql["from"] = rebuild_from_col(valid_col_units, sql["from"], kmap) + sql["where"] = rebuild_condition_col(valid_col_units, sql["where"], kmap) + sql["groupBy"] = rebuild_group_by_col(valid_col_units, sql["groupBy"], kmap) + sql["orderBy"] = rebuild_order_by_col(valid_col_units, sql["orderBy"], kmap) + sql["having"] = rebuild_condition_col(valid_col_units, sql["having"], kmap) + sql["intersect"] = rebuild_sql_col(valid_col_units, sql["intersect"], kmap) + sql["except"] = rebuild_sql_col(valid_col_units, sql["except"], kmap) + sql["union"] = rebuild_sql_col(valid_col_units, sql["union"], kmap) + + return sql + + +def build_foreign_key_map(entry): + cols_orig = entry["column_names_original"] + tables_orig = entry["table_names_original"] + + # rebuild cols corresponding to idmap in Schema + cols = [] + for col_orig in cols_orig: + if col_orig[0] >= 0: + t = tables_orig[col_orig[0]] + c = col_orig[1] + cols.append("__" + t.lower() + "." + c.lower() + "__") + else: + cols.append("__all__") + + def keyset_in_list(k1, k2, k_list): + for k_set in k_list: + if k1 in k_set or k2 in k_set: + return k_set + new_k_set = set() + k_list.append(new_k_set) + return new_k_set + + foreign_key_list = [] + foreign_keys = entry["foreign_keys"] + for fkey in foreign_keys: + key1, key2 = fkey + key_set = keyset_in_list(key1, key2, foreign_key_list) + key_set.add(key1) + key_set.add(key2) + + foreign_key_map = {} + for key_set in foreign_key_list: + sorted_list = sorted(list(key_set)) + midx = sorted_list[0] + for idx in sorted_list: + foreign_key_map[cols[idx]] = cols[midx] + + return foreign_key_map + + +def build_foreign_key_map_from_json(table): + with open(table) as f: + data = json.load(f) + tables = {} + for entry in data: + tables[entry["db_id"]] = build_foreign_key_map(entry) + return tables + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--gold", dest="gold", type=str) + parser.add_argument("--pred", dest="pred", type=str) + parser.add_argument("--db", dest="db", type=str) + parser.add_argument("--table", dest="table", type=str) + parser.add_argument("--etype", dest="etype", type=str) + args = parser.parse_args() + + gold = args.gold + pred = args.pred + db_dir = args.db + table = args.table + etype = args.etype + + assert etype in ["all", "exec", "match"], "Unknown evaluation method" + + kmaps = build_foreign_key_map_from_json(table) + + evaluate(gold, pred, db_dir, etype, kmaps) diff --git a/examples/text_to_sql/IGSQL/eval_scripts/evaluation_sqa.py b/examples/text_to_sql/IGSQL/eval_scripts/evaluation_sqa.py new file mode 100644 index 0000000000000000000000000000000000000000..4330e660ad05e7cf5d88d85b6b2d8f632f2c556d --- /dev/null +++ b/examples/text_to_sql/IGSQL/eval_scripts/evaluation_sqa.py @@ -0,0 +1,995 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import sqlite3 + +from process_sql import Schema, get_schema, get_sql + +# Flag to disable value evaluation +DISABLE_VALUE = True +# Flag to disable distinct in select evaluation +DISABLE_DISTINCT = True + +################################ +# val: number(float)/string(str)/sql(dict) +# col_unit: (agg_id, col_id, isDistinct(bool)) +# val_unit: (unit_op, col_unit1, col_unit2) +# table_unit: (table_type, col_unit/sql) +# cond_unit: (not_op, op_id, val_unit, val1, val2) +# condition: [cond_unit1, 'and'/'or', cond_unit2, ...] +# sql { +# 'select': (isDistinct(bool), [(agg_id, val_unit), (agg_id, val_unit), ...]) +# 'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition} +# 'where': condition +# 'groupBy': [col_unit1, col_unit2, ...] +# 'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...]) +# 'having': condition +# 'limit': None/limit value +# 'intersect': None/sql +# 'except': None/sql +# 'union': None/sql +# } +################################ + +CLAUSE_KEYWORDS = ("select", "from", "where", "group", "order", "limit", "intersect", "union", "except") +JOIN_KEYWORDS = ("join", "on", "as") + +WHERE_OPS = ("not", "between", "=", ">", "<", ">=", "<=", "!=", "in", "like", "is", "exists") +UNIT_OPS = ("none", "-", "+", "*", "/") +AGG_OPS = ("none", "max", "min", "count", "sum", "avg") +TABLE_TYPE = { + "sql": "sql", + "table_unit": "table_unit", +} + +COND_OPS = ("and", "or") +SQL_OPS = ("intersect", "union", "except") +ORDER_OPS = ("desc", "asc") + +HARDNESS = { + "component1": ("where", "group", "order", "limit", "join", "or", "like"), + "component2": ("except", "union", "intersect"), +} + + +def condition_has_or(conds): + return "or" in conds[1::2] + + +def condition_has_like(conds): + return WHERE_OPS.index("like") in [cond_unit[1] for cond_unit in conds[::2]] + + +def condition_has_sql(conds): + for cond_unit in conds[::2]: + val1, val2 = cond_unit[3], cond_unit[4] + if val1 is not None and type(val1) is dict: + return True + if val2 is not None and type(val2) is dict: + return True + return False + + +def val_has_op(val_unit): + return val_unit[0] != UNIT_OPS.index("none") + + +def has_agg(unit): + return unit[0] != AGG_OPS.index("none") + + +def accuracy(count, total): + if count == total: + return 1 + return 0 + + +def recall(count, total): + if count == total: + return 1 + return 0 + + +def F1(acc, rec): + if (acc + rec) == 0: + return 0 + return (2.0 * acc * rec) / (acc + rec) + + +def get_scores(count, pred_total, label_total): + if pred_total != label_total: + return 0, 0, 0 + elif count == pred_total: + return 1, 1, 1 + return 0, 0, 0 + + +def eval_sel(pred, label): + pred_sel = pred["select"][1] + label_sel = label["select"][1] + label_wo_agg = [unit[1] for unit in label_sel] + pred_total = len(pred_sel) + label_total = len(label_sel) + cnt = 0 + cnt_wo_agg = 0 + + for unit in pred_sel: + if unit in label_sel: + cnt += 1 + label_sel.remove(unit) + if unit[1] in label_wo_agg: + cnt_wo_agg += 1 + label_wo_agg.remove(unit[1]) + + return label_total, pred_total, cnt, cnt_wo_agg + + +def eval_where(pred, label): + pred_conds = [unit for unit in pred["where"][::2]] + label_conds = [unit for unit in label["where"][::2]] + label_wo_agg = [unit[2] for unit in label_conds] + pred_total = len(pred_conds) + label_total = len(label_conds) + cnt = 0 + cnt_wo_agg = 0 + + for unit in pred_conds: + if unit in label_conds: + cnt += 1 + label_conds.remove(unit) + if unit[2] in label_wo_agg: + cnt_wo_agg += 1 + label_wo_agg.remove(unit[2]) + + return label_total, pred_total, cnt, cnt_wo_agg + + +def eval_group(pred, label): + pred_cols = [unit[1] for unit in pred["groupBy"]] + label_cols = [unit[1] for unit in label["groupBy"]] + pred_total = len(pred_cols) + label_total = len(label_cols) + cnt = 0 + pred_cols = [pred.split(".")[1] if "." in pred else pred for pred in pred_cols] + label_cols = [label.split(".")[1] if "." in label else label for label in label_cols] + for col in pred_cols: + if col in label_cols: + cnt += 1 + label_cols.remove(col) + return label_total, pred_total, cnt + + +def eval_having(pred, label): + pred_total = label_total = cnt = 0 + if len(pred["groupBy"]) > 0: + pred_total = 1 + if len(label["groupBy"]) > 0: + label_total = 1 + + pred_cols = [unit[1] for unit in pred["groupBy"]] + label_cols = [unit[1] for unit in label["groupBy"]] + if pred_total == label_total == 1 and pred_cols == label_cols and pred["having"] == label["having"]: + cnt = 1 + + return label_total, pred_total, cnt + + +def eval_order(pred, label): + pred_total = label_total = cnt = 0 + if len(pred["orderBy"]) > 0: + pred_total = 1 + if len(label["orderBy"]) > 0: + label_total = 1 + if ( + len(label["orderBy"]) > 0 + and pred["orderBy"] == label["orderBy"] + and ( + (pred["limit"] is None and label["limit"] is None) + or (pred["limit"] is not None and label["limit"] is not None) + ) + ): + cnt = 1 + return label_total, pred_total, cnt + + +def eval_and_or(pred, label): + pred_ao = pred["where"][1::2] + label_ao = label["where"][1::2] + pred_ao = set(pred_ao) + label_ao = set(label_ao) + + if pred_ao == label_ao: + return 1, 1, 1 + return len(pred_ao), len(label_ao), 0 + + +def get_nestedSQL(sql): + nested = [] + for cond_unit in sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]: + if type(cond_unit[3]) is dict: + nested.append(cond_unit[3]) + if type(cond_unit[4]) is dict: + nested.append(cond_unit[4]) + if sql["intersect"] is not None: + nested.append(sql["intersect"]) + if sql["except"] is not None: + nested.append(sql["except"]) + if sql["union"] is not None: + nested.append(sql["union"]) + return nested + + +def eval_nested(pred, label): + label_total = 0 + pred_total = 0 + cnt = 0 + if pred is not None: + pred_total += 1 + if label is not None: + label_total += 1 + if pred is not None and label is not None: + cnt += Evaluator().eval_exact_match(pred, label) + return label_total, pred_total, cnt + + +def eval_IUEN(pred, label): + lt1, pt1, cnt1 = eval_nested(pred["intersect"], label["intersect"]) + lt2, pt2, cnt2 = eval_nested(pred["except"], label["except"]) + lt3, pt3, cnt3 = eval_nested(pred["union"], label["union"]) + label_total = lt1 + lt2 + lt3 + pred_total = pt1 + pt2 + pt3 + cnt = cnt1 + cnt2 + cnt3 + return label_total, pred_total, cnt + + +def get_keywords(sql): + res = set() + if len(sql["where"]) > 0: + res.add("where") + if len(sql["groupBy"]) > 0: + res.add("group") + if len(sql["having"]) > 0: + res.add("having") + if len(sql["orderBy"]) > 0: + res.add(sql["orderBy"][0]) + res.add("order") + if sql["limit"] is not None: + res.add("limit") + if sql["except"] is not None: + res.add("except") + if sql["union"] is not None: + res.add("union") + if sql["intersect"] is not None: + res.add("intersect") + + # or keyword + ao = sql["from"]["conds"][1::2] + sql["where"][1::2] + sql["having"][1::2] + if len([token for token in ao if token == "or"]) > 0: + res.add("or") + + cond_units = sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2] + # not keyword + if len([cond_unit for cond_unit in cond_units if cond_unit[0]]) > 0: + res.add("not") + + # in keyword + if len([cond_unit for cond_unit in cond_units if cond_unit[1] == WHERE_OPS.index("in")]) > 0: + res.add("in") + + # like keyword + if len([cond_unit for cond_unit in cond_units if cond_unit[1] == WHERE_OPS.index("like")]) > 0: + res.add("like") + + return res + + +def eval_keywords(pred, label): + pred_keywords = get_keywords(pred) + label_keywords = get_keywords(label) + pred_total = len(pred_keywords) + label_total = len(label_keywords) + cnt = 0 + + for k in pred_keywords: + if k in label_keywords: + cnt += 1 + return label_total, pred_total, cnt + + +def count_agg(units): + return len([unit for unit in units if has_agg(unit)]) + + +def count_component1(sql): + count = 0 + if len(sql["where"]) > 0: + count += 1 + if len(sql["groupBy"]) > 0: + count += 1 + if len(sql["orderBy"]) > 0: + count += 1 + if sql["limit"] is not None: + count += 1 + if len(sql["from"]["table_units"]) > 0: # JOIN + count += len(sql["from"]["table_units"]) - 1 + + ao = sql["from"]["conds"][1::2] + sql["where"][1::2] + sql["having"][1::2] + count += len([token for token in ao if token == "or"]) + cond_units = sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2] + count += len([cond_unit for cond_unit in cond_units if cond_unit[1] == WHERE_OPS.index("like")]) + + return count + + +def count_component2(sql): + nested = get_nestedSQL(sql) + return len(nested) + + +def count_others(sql): + count = 0 + # number of aggregation + agg_count = count_agg(sql["select"][1]) + agg_count += count_agg(sql["where"][::2]) + agg_count += count_agg(sql["groupBy"]) + if len(sql["orderBy"]) > 0: + agg_count += count_agg( + [unit[1] for unit in sql["orderBy"][1] if unit[1]] + [unit[2] for unit in sql["orderBy"][1] if unit[2]] + ) + agg_count += count_agg(sql["having"]) + if agg_count > 1: + count += 1 + + # number of select columns + if len(sql["select"][1]) > 1: + count += 1 + + # number of where conditions + if len(sql["where"]) > 1: + count += 1 + + # number of group by clauses + if len(sql["groupBy"]) > 1: + count += 1 + + return count + + +class Evaluator: + """A simple evaluator""" + + def __init__(self): + self.partial_scores = None + + def eval_hardness(self, sql): + count_comp1_ = count_component1(sql) + count_comp2_ = count_component2(sql) + count_others_ = count_others(sql) + + if count_comp1_ <= 1 and count_others_ == 0 and count_comp2_ == 0: + return "easy" + elif (count_others_ <= 2 and count_comp1_ <= 1 and count_comp2_ == 0) or ( + count_comp1_ <= 2 and count_others_ < 2 and count_comp2_ == 0 + ): + return "medium" + elif ( + (count_others_ > 2 and count_comp1_ <= 2 and count_comp2_ == 0) + or (2 < count_comp1_ <= 3 and count_others_ <= 2 and count_comp2_ == 0) + or (count_comp1_ <= 1 and count_others_ == 0 and count_comp2_ <= 1) + ): + return "hard" + else: + return "extra" + + def eval_exact_match(self, pred, label): + partial_scores = self.eval_partial_match(pred, label) + self.partial_scores = partial_scores + + for key, score in partial_scores.items(): + if score["f1"] != 1: + return 0 + + if len(label["from"]["table_units"]) > 0: + label_tables = sorted(label["from"]["table_units"]) + pred_tables = sorted(pred["from"]["table_units"]) + return label_tables == pred_tables + return 1 + + def eval_partial_match(self, pred, label): + res = {} + + label_total, pred_total, cnt, cnt_wo_agg = eval_sel(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["select"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, label_total) + res["select(no AGG)"] = { + "acc": acc, + "rec": rec, + "f1": f1, + "label_total": label_total, + "pred_total": pred_total, + } + + label_total, pred_total, cnt, cnt_wo_agg = eval_where(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["where"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, label_total) + res["where(no OP)"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + + label_total, pred_total, cnt = eval_group(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["group(no Having)"] = { + "acc": acc, + "rec": rec, + "f1": f1, + "label_total": label_total, + "pred_total": pred_total, + } + + label_total, pred_total, cnt = eval_having(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["group"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + + label_total, pred_total, cnt = eval_order(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["order"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + + label_total, pred_total, cnt = eval_and_or(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["and/or"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + + label_total, pred_total, cnt = eval_IUEN(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["IUEN"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + + label_total, pred_total, cnt = eval_keywords(pred, label) + acc, rec, f1 = get_scores(cnt, pred_total, label_total) + res["keywords"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total} + + return res + + +def isValidSQL(sql, db): + conn = sqlite3.connect(db) + cursor = conn.cursor() + try: + cursor.execute(sql) + except Exception: + return False + return True + + +def print_scores(scores, etype): + turns = ["turn 1", "turn 2", "turn 3", "turn 4", "turn >4"] + levels = ["easy", "medium", "hard", "extra", "all", "joint_all"] + partial_types = [ + "select", + "select(no AGG)", + "where", + "where(no OP)", + "group(no Having)", + "group", + "order", + "and/or", + "IUEN", + "keywords", + ] + + print("{:20} {:20} {:20} {:20} {:20} {:20} {:20}".format("", *levels)) + counts = [scores[level]["count"] for level in levels] + print("{:20} {:<20d} {:<20d} {:<20d} {:<20d} {:<20d} {:<20d}".format("count", *counts)) + + if etype in ["all", "exec"]: + print("===================== EXECUTION ACCURACY =====================") + this_scores = [scores[level]["exec"] for level in levels] + print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format("execution", *this_scores)) + + if etype in ["all", "match"]: + print("\n====================== EXACT MATCHING ACCURACY =====================") + exact_scores = [scores[level]["exact"] for level in levels] + print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format("exact match", *exact_scores)) + print("\n---------------------PARTIAL MATCHING ACCURACY----------------------") + for type_ in partial_types: + this_scores = [scores[level]["partial"][type_]["acc"] for level in levels] + print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format(type_, *this_scores)) + + print("---------------------- PARTIAL MATCHING RECALL ----------------------") + for type_ in partial_types: + this_scores = [scores[level]["partial"][type_]["rec"] for level in levels] + print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format(type_, *this_scores)) + + print("---------------------- PARTIAL MATCHING F1 --------------------------") + for type_ in partial_types: + this_scores = [scores[level]["partial"][type_]["f1"] for level in levels] + print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format(type_, *this_scores)) + + print("\n\n{:20} {:20} {:20} {:20} {:20} {:20}".format("", *turns)) + counts = [scores[turn]["count"] for turn in turns] + print("{:20} {:<20d} {:<20d} {:<20d} {:<20d} {:<20d}".format("count", *counts)) + + if etype in ["all", "exec"]: + print("===================== TRUN XECUTION ACCURACY =====================") + this_scores = [scores[turn]["exec"] for turn in turns] + print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format("execution", *this_scores)) + + if etype in ["all", "match"]: + print("\n====================== TRUN EXACT MATCHING ACCURACY =====================") + exact_scores = [scores[turn]["exact"] for turn in turns] + print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format("exact match", *exact_scores)) + + +def evaluate(gold, predict, db_dir, etype, kmaps): + with open(gold) as f: + glist = [] + gseq_one = [] + for l in f.readlines(): + if len(l.strip()) == 0: + glist.append(gseq_one) + gseq_one = [] + else: + lstrip = l.strip().split("\t") + gseq_one.append(lstrip) + + with open(predict) as f: + plist = [] + pseq_one = [] + for l in f.readlines(): + if len(l.strip()) == 0: + plist.append(pseq_one) + pseq_one = [] + else: + pseq_one.append(l.strip().split("\t")) + evaluator = Evaluator() + + turns = ["turn 1", "turn 2", "turn 3", "turn 4", "turn >4"] + levels = ["easy", "medium", "hard", "extra", "all", "joint_all"] + partial_types = [ + "select", + "select(no AGG)", + "where", + "where(no OP)", + "group(no Having)", + "group", + "order", + "and/or", + "IUEN", + "keywords", + ] + entries = [] + scores = {} + + for turn in turns: + scores[turn] = {"count": 0, "exact": 0.0} + scores[turn]["exec"] = 0 + + for level in levels: + scores[level] = {"count": 0, "partial": {}, "exact": 0.0} + scores[level]["exec"] = 0 + for type_ in partial_types: + scores[level]["partial"][type_] = {"acc": 0.0, "rec": 0.0, "f1": 0.0, "acc_count": 0, "rec_count": 0} + + eval_err_num = 0 + for p, g in zip(plist, glist): + print("----------------------interaction begin--------------") + scores["joint_all"]["count"] += 1 + turn_scores = {"exec": [], "exact": []} + for idx, pg in enumerate(zip(p, g)): + p, g = pg + p_str = p[0] + p_str = p_str.replace("value", "1") + g_str, db = g + db_name = db + db = os.path.join(db_dir, db, db + ".sqlite") + schema = Schema(get_schema(db)) + g_sql = get_sql(schema, g_str) + hardness = evaluator.eval_hardness(g_sql) + if idx > 3: + idx = ">4" + else: + idx += 1 + turn_id = "turn " + str(idx) + scores[turn_id]["count"] += 1 + scores[hardness]["count"] += 1 + scores["all"]["count"] += 1 + + try: + p_sql = get_sql(schema, p_str) + except Exception: + # If p_sql is not valid, then we will use an empty sql to evaluate with the correct sql + p_sql = { + "except": None, + "from": {"conds": [], "table_units": []}, + "groupBy": [], + "having": [], + "intersect": None, + "limit": None, + "orderBy": [], + "select": [False, []], + "union": None, + "where": [], + } + eval_err_num += 1 + print("eval_err_num:{}".format(eval_err_num)) + + # rebuild sql for value evaluation + kmap = kmaps[db_name] + g_valid_col_units = build_valid_col_units(g_sql["from"]["table_units"], schema) + g_sql = rebuild_sql_val(g_sql) + g_sql = rebuild_sql_col(g_valid_col_units, g_sql, kmap) + p_valid_col_units = build_valid_col_units(p_sql["from"]["table_units"], schema) + p_sql = rebuild_sql_val(p_sql) + p_sql = rebuild_sql_col(p_valid_col_units, p_sql, kmap) + + if etype in ["all", "exec"]: + exec_score = eval_exec_match(db, p_str, g_str, p_sql, g_sql) + if exec_score: + scores[hardness]["exec"] += 1 + scores[turn_id]["exec"] += 1 + turn_scores["exec"].append(1) + else: + turn_scores["exec"].append(0) + + if etype in ["all", "match"]: + exact_score = evaluator.eval_exact_match(p_sql, g_sql) + partial_scores = evaluator.partial_scores + if exact_score == 0: + turn_scores["exact"].append(0) + print("{} pred: {}".format(hardness, p_str)) + print("{} gold: {}".format(hardness, g_str)) + print("") + else: + print("correct") + print("{} pred: {}".format(hardness, p_str)) + print("{} gold: {}".format(hardness, g_str)) + print("") + turn_scores["exact"].append(1) + scores[turn_id]["exact"] += exact_score + scores[hardness]["exact"] += exact_score + scores["all"]["exact"] += exact_score + for type_ in partial_types: + if partial_scores[type_]["pred_total"] > 0: + scores[hardness]["partial"][type_]["acc"] += partial_scores[type_]["acc"] + scores[hardness]["partial"][type_]["acc_count"] += 1 + if partial_scores[type_]["label_total"] > 0: + scores[hardness]["partial"][type_]["rec"] += partial_scores[type_]["rec"] + scores[hardness]["partial"][type_]["rec_count"] += 1 + scores[hardness]["partial"][type_]["f1"] += partial_scores[type_]["f1"] + if partial_scores[type_]["pred_total"] > 0: + scores["all"]["partial"][type_]["acc"] += partial_scores[type_]["acc"] + scores["all"]["partial"][type_]["acc_count"] += 1 + if partial_scores[type_]["label_total"] > 0: + scores["all"]["partial"][type_]["rec"] += partial_scores[type_]["rec"] + scores["all"]["partial"][type_]["rec_count"] += 1 + scores["all"]["partial"][type_]["f1"] += partial_scores[type_]["f1"] + + entries.append( + { + "predictSQL": p_str, + "goldSQL": g_str, + "hardness": hardness, + "exact": exact_score, + "partial": partial_scores, + } + ) + + if all(v == 1 for v in turn_scores["exec"]): + scores["joint_all"]["exec"] += 1 + + if all(v == 1 for v in turn_scores["exact"]): + scores["joint_all"]["exact"] += 1 + print("all correct") + + for turn in turns: + if scores[turn]["count"] == 0: + continue + if etype in ["all", "exec"]: + scores[turn]["exec"] /= scores[turn]["count"] + + if etype in ["all", "match"]: + scores[turn]["exact"] /= scores[turn]["count"] + + for level in levels: + if scores[level]["count"] == 0: + continue + if etype in ["all", "exec"]: + scores[level]["exec"] /= scores[level]["count"] + + if etype in ["all", "match"]: + scores[level]["exact"] /= scores[level]["count"] + for type_ in partial_types: + if scores[level]["partial"][type_]["acc_count"] == 0: + scores[level]["partial"][type_]["acc"] = 0 + else: + scores[level]["partial"][type_]["acc"] = ( + scores[level]["partial"][type_]["acc"] / scores[level]["partial"][type_]["acc_count"] * 1.0 + ) + if scores[level]["partial"][type_]["rec_count"] == 0: + scores[level]["partial"][type_]["rec"] = 0 + else: + scores[level]["partial"][type_]["rec"] = ( + scores[level]["partial"][type_]["rec"] / scores[level]["partial"][type_]["rec_count"] * 1.0 + ) + if scores[level]["partial"][type_]["acc"] == 0 and scores[level]["partial"][type_]["rec"] == 0: + scores[level]["partial"][type_]["f1"] = 1 + else: + scores[level]["partial"][type_]["f1"] = ( + 2.0 + * scores[level]["partial"][type_]["acc"] + * scores[level]["partial"][type_]["rec"] + / (scores[level]["partial"][type_]["rec"] + scores[level]["partial"][type_]["acc"]) + ) + + print_scores(scores, etype) + + +def eval_exec_match(db, p_str, g_str, pred, gold): + """ + return 1 if the values between prediction and gold are matching + in the corresponding index. Currently not support multiple col_unit(pairs). + """ + conn = sqlite3.connect(db) + cursor = conn.cursor() + try: + cursor.execute(p_str) + p_res = cursor.fetchall() + except Exception: + return False + + cursor.execute(g_str) + q_res = cursor.fetchall() + + def res_map(res, val_units): + rmap = {} + for idx, val_unit in enumerate(val_units): + key = tuple(val_unit[1]) if not val_unit[2] else (val_unit[0], tuple(val_unit[1]), tuple(val_unit[2])) + rmap[key] = [r[idx] for r in res] + return rmap + + p_val_units = [unit[1] for unit in pred["select"][1]] + q_val_units = [unit[1] for unit in gold["select"][1]] + return res_map(p_res, p_val_units) == res_map(q_res, q_val_units) + + +# Rebuild SQL functions for value evaluation +def rebuild_cond_unit_val(cond_unit): + if cond_unit is None or not DISABLE_VALUE: + return cond_unit + + not_op, op_id, val_unit, val1, val2 = cond_unit + if type(val1) is not dict: + val1 = None + else: + val1 = rebuild_sql_val(val1) + if type(val2) is not dict: + val2 = None + else: + val2 = rebuild_sql_val(val2) + return not_op, op_id, val_unit, val1, val2 + + +def rebuild_condition_val(condition): + if condition is None or not DISABLE_VALUE: + return condition + + res = [] + for idx, it in enumerate(condition): + if idx % 2 == 0: + res.append(rebuild_cond_unit_val(it)) + else: + res.append(it) + return res + + +def rebuild_sql_val(sql): + if sql is None or not DISABLE_VALUE: + return sql + + sql["from"]["conds"] = rebuild_condition_val(sql["from"]["conds"]) + sql["having"] = rebuild_condition_val(sql["having"]) + sql["where"] = rebuild_condition_val(sql["where"]) + sql["intersect"] = rebuild_sql_val(sql["intersect"]) + sql["except"] = rebuild_sql_val(sql["except"]) + sql["union"] = rebuild_sql_val(sql["union"]) + + return sql + + +# Rebuild SQL functions for foreign key evaluation +def build_valid_col_units(table_units, schema): + col_ids = [table_unit[1] for table_unit in table_units if table_unit[0] == TABLE_TYPE["table_unit"]] + prefixs = [col_id[:-2] for col_id in col_ids] + valid_col_units = [] + for value in schema.idMap.values(): + if "." in value and value[: value.index(".")] in prefixs: + valid_col_units.append(value) + return valid_col_units + + +def rebuild_col_unit_col(valid_col_units, col_unit, kmap): + if col_unit is None: + return col_unit + + agg_id, col_id, distinct = col_unit + if col_id in kmap and col_id in valid_col_units: + col_id = kmap[col_id] + if DISABLE_DISTINCT: + distinct = None + return agg_id, col_id, distinct + + +def rebuild_val_unit_col(valid_col_units, val_unit, kmap): + if val_unit is None: + return val_unit + + unit_op, col_unit1, col_unit2 = val_unit + col_unit1 = rebuild_col_unit_col(valid_col_units, col_unit1, kmap) + col_unit2 = rebuild_col_unit_col(valid_col_units, col_unit2, kmap) + return unit_op, col_unit1, col_unit2 + + +def rebuild_table_unit_col(valid_col_units, table_unit, kmap): + if table_unit is None: + return table_unit + + table_type, col_unit_or_sql = table_unit + if isinstance(col_unit_or_sql, tuple): + col_unit_or_sql = rebuild_col_unit_col(valid_col_units, col_unit_or_sql, kmap) + return table_type, col_unit_or_sql + + +def rebuild_cond_unit_col(valid_col_units, cond_unit, kmap): + if cond_unit is None: + return cond_unit + + not_op, op_id, val_unit, val1, val2 = cond_unit + val_unit = rebuild_val_unit_col(valid_col_units, val_unit, kmap) + return not_op, op_id, val_unit, val1, val2 + + +def rebuild_condition_col(valid_col_units, condition, kmap): + for idx in range(len(condition)): + if idx % 2 == 0: + condition[idx] = rebuild_cond_unit_col(valid_col_units, condition[idx], kmap) + return condition + + +def rebuild_select_col(valid_col_units, sel, kmap): + if sel is None: + return sel + distinct, _list = sel + new_list = [] + for it in _list: + agg_id, val_unit = it + new_list.append((agg_id, rebuild_val_unit_col(valid_col_units, val_unit, kmap))) + if DISABLE_DISTINCT: + distinct = None + return distinct, new_list + + +def rebuild_from_col(valid_col_units, from_, kmap): + if from_ is None: + return from_ + + from_["table_units"] = [ + rebuild_table_unit_col(valid_col_units, table_unit, kmap) for table_unit in from_["table_units"] + ] + from_["conds"] = rebuild_condition_col(valid_col_units, from_["conds"], kmap) + return from_ + + +def rebuild_group_by_col(valid_col_units, group_by, kmap): + if group_by is None: + return group_by + + return [rebuild_col_unit_col(valid_col_units, col_unit, kmap) for col_unit in group_by] + + +def rebuild_order_by_col(valid_col_units, order_by, kmap): + if order_by is None or len(order_by) == 0: + return order_by + + direction, val_units = order_by + new_val_units = [rebuild_val_unit_col(valid_col_units, val_unit, kmap) for val_unit in val_units] + return direction, new_val_units + + +def rebuild_sql_col(valid_col_units, sql, kmap): + if sql is None: + return sql + + sql["select"] = rebuild_select_col(valid_col_units, sql["select"], kmap) + sql["from"] = rebuild_from_col(valid_col_units, sql["from"], kmap) + sql["where"] = rebuild_condition_col(valid_col_units, sql["where"], kmap) + sql["groupBy"] = rebuild_group_by_col(valid_col_units, sql["groupBy"], kmap) + sql["orderBy"] = rebuild_order_by_col(valid_col_units, sql["orderBy"], kmap) + sql["having"] = rebuild_condition_col(valid_col_units, sql["having"], kmap) + sql["intersect"] = rebuild_sql_col(valid_col_units, sql["intersect"], kmap) + sql["except"] = rebuild_sql_col(valid_col_units, sql["except"], kmap) + sql["union"] = rebuild_sql_col(valid_col_units, sql["union"], kmap) + + return sql + + +def build_foreign_key_map(entry): + cols_orig = entry["column_names_original"] + tables_orig = entry["table_names_original"] + + # rebuild cols corresponding to idmap in Schema + cols = [] + for col_orig in cols_orig: + if col_orig[0] >= 0: + t = tables_orig[col_orig[0]] + c = col_orig[1] + cols.append("__" + t.lower() + "." + c.lower() + "__") + else: + cols.append("__all__") + + def keyset_in_list(k1, k2, k_list): + for k_set in k_list: + if k1 in k_set or k2 in k_set: + return k_set + new_k_set = set() + k_list.append(new_k_set) + return new_k_set + + foreign_key_list = [] + foreign_keys = entry["foreign_keys"] + for fkey in foreign_keys: + key1, key2 = fkey + key_set = keyset_in_list(key1, key2, foreign_key_list) + key_set.add(key1) + key_set.add(key2) + + foreign_key_map = {} + for key_set in foreign_key_list: + sorted_list = sorted(list(key_set)) + midx = sorted_list[0] + for idx in sorted_list: + foreign_key_map[cols[idx]] = cols[midx] + + return foreign_key_map + + +def build_foreign_key_map_from_json(table): + with open(table) as f: + data = json.load(f) + tables = {} + for entry in data: + tables[entry["db_id"]] = build_foreign_key_map(entry) + return tables + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--gold", dest="gold", type=str) + parser.add_argument("--pred", dest="pred", type=str) + parser.add_argument("--db", dest="db", type=str) + parser.add_argument("--table", dest="table", type=str) + parser.add_argument("--etype", dest="etype", type=str) + args = parser.parse_args() + + gold = args.gold + pred = args.pred + db_dir = args.db + table = args.table + etype = args.etype + + assert etype in ["all", "exec", "match"], "Unknown evaluation method" + + kmaps = build_foreign_key_map_from_json(table) + + evaluate(gold, pred, db_dir, etype, kmaps) diff --git a/examples/text_to_sql/IGSQL/eval_scripts/metric_averages.py b/examples/text_to_sql/IGSQL/eval_scripts/metric_averages.py new file mode 100644 index 0000000000000000000000000000000000000000..4eb3fea22bf5cbe347a1526305d46271ecfa0e9c --- /dev/null +++ b/examples/text_to_sql/IGSQL/eval_scripts/metric_averages.py @@ -0,0 +1,66 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import sys + +predictions = [json.loads(line) for line in open(sys.argv[1]).readlines() if line] + +string_count = 0.0 +sem_count = 0.0 +syn_count = 0.0 +table_count = 0.0 +strict_table_count = 0.0 + +precision_denom = 0.0 +precision = 0.0 +recall_denom = 0.0 +recall = 0.0 +f1_score = 0.0 +f1_denom = 0.0 + +time = 0.0 + +for prediction in predictions: + if prediction["correct_string"]: + string_count += 1.0 + if prediction["semantic"]: + sem_count += 1.0 + if prediction["syntactic"]: + syn_count += 1.0 + if prediction["correct_table"]: + table_count += 1.0 + if prediction["strict_correct_table"]: + strict_table_count += 1.0 + if prediction["gold_tables"] != "[[]]": + precision += prediction["table_prec"] + precision_denom += 1 + if prediction["pred_table"] != "[]": + recall += prediction["table_rec"] + recall_denom += 1 + + if prediction["gold_tables"] != "[[]]": + f1_score += prediction["table_f1"] + f1_denom += 1 + +num_p = len(predictions) +print("string precision: " + str(string_count / num_p)) +print("% semantic: " + str(sem_count / num_p)) +print("% syntactic: " + str(syn_count / num_p)) +print("table prec: " + str(table_count / num_p)) +print("strict table prec: " + str(strict_table_count / num_p)) +print("table row prec: " + str(precision / precision_denom)) +print("table row recall: " + str(recall / recall_denom)) +print("table row f1: " + str(f1_score / f1_denom)) +print("inference time: " + str(time / num_p)) diff --git a/examples/text_to_sql/IGSQL/eval_scripts/process_sql.py b/examples/text_to_sql/IGSQL/eval_scripts/process_sql.py new file mode 100644 index 0000000000000000000000000000000000000000..4168932db02bc53df13a40a77a145111a66da2a4 --- /dev/null +++ b/examples/text_to_sql/IGSQL/eval_scripts/process_sql.py @@ -0,0 +1,579 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import sqlite3 + +from nltk import word_tokenize + +################################ +# Assumptions: +# 1. sql is correct +# 2. only table name has alias +# 3. only one intersect/union/except +# +# val: number(float)/string(str)/sql(dict) +# col_unit: (agg_id, col_id, isDistinct(bool)) +# val_unit: (unit_op, col_unit1, col_unit2) +# table_unit: (table_type, col_unit/sql) +# cond_unit: (not_op, op_id, val_unit, val1, val2) +# condition: [cond_unit1, 'and'/'or', cond_unit2, ...] +# sql { +# 'select': (isDistinct(bool), [(agg_id, val_unit), (agg_id, val_unit), ...]) +# 'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition} +# 'where': condition +# 'groupBy': [col_unit1, col_unit2, ...] +# 'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...]) +# 'having': condition +# 'limit': None/limit value +# 'intersect': None/sql +# 'except': None/sql +# 'union': None/sql +# } +################################ +CLAUSE_KEYWORDS = ("select", "from", "where", "group", "order", "limit", "intersect", "union", "except") +JOIN_KEYWORDS = ("join", "on", "as") + +WHERE_OPS = ("not", "between", "=", ">", "<", ">=", "<=", "!=", "in", "like", "is", "exists") +UNIT_OPS = ("none", "-", "+", "*", "/") +AGG_OPS = ("none", "max", "min", "count", "sum", "avg") +TABLE_TYPE = { + "sql": "sql", + "table_unit": "table_unit", +} + +COND_OPS = ("and", "or") +SQL_OPS = ("intersect", "union", "except") +ORDER_OPS = ("desc", "asc") + + +class Schema: + """ + Simple schema which maps table&column to a unique identifier + """ + + def __init__(self, schema): + self._schema = schema + self._idMap = self._map(self._schema) + + @property + def schema(self): + return self._schema + + @property + def idMap(self): + return self._idMap + + def _map(self, schema): + idMap = {"*": "__all__"} + id = 1 + for key, vals in schema.items(): + for val in vals: + idMap[key.lower() + "." + val.lower()] = "__" + key.lower() + "." + val.lower() + "__" + id += 1 + + for key in schema: + idMap[key.lower()] = "__" + key.lower() + "__" + id += 1 + + return idMap + + +def get_schema(db): + """ + Get database's schema, which is a dict with table name as key + and list of column names as value + Args: + db(`str`): Database path. + Returns: + `dict`: Schema dict. + """ + + schema = {} + conn = sqlite3.connect(db) + cursor = conn.cursor() + + # fetch table names + cursor.execute("SELECT name FROM sqlite_master WHERE type='table';") + tables = [str(table[0].lower()) for table in cursor.fetchall()] + + # fetch table info + for table in tables: + cursor.execute("PRAGMA table_info({})".format(table)) + schema[table] = [str(col[1].lower()) for col in cursor.fetchall()] + + return schema + + +def get_schema_from_json(fpath): + with open(fpath) as f: + data = json.load(f) + + schema = {} + for entry in data: + table = str(entry["table"].lower()) + cols = [str(col["column_name"].lower()) for col in entry["col_data"]] + schema[table] = cols + + return schema + + +def tokenize(string): + string = str(string) + string = string.replace("'", '"') # ensures all string values wrapped by "" problem?? + quote_idxs = [idx for idx, char in enumerate(string) if char == '"'] + assert len(quote_idxs) % 2 == 0, "Unexpected quote" + + # keep string value as token + vals = {} + for i in range(len(quote_idxs) - 1, -1, -2): + qidx1 = quote_idxs[i - 1] + qidx2 = quote_idxs[i] + val = string[qidx1 : qidx2 + 1] + key = "__val_{}_{}__".format(qidx1, qidx2) + string = string[:qidx1] + key + string[qidx2 + 1 :] + vals[key] = val + + toks = [word.lower() for word in word_tokenize(string)] + # replace with string value token + for i in range(len(toks)): + if toks[i] in vals: + toks[i] = vals[toks[i]] + + # find if there exists !=, >=, <= + eq_idxs = [idx for idx, tok in enumerate(toks) if tok == "="] + eq_idxs.reverse() + prefix = ("!", ">", "<") + for eq_idx in eq_idxs: + pre_tok = toks[eq_idx - 1] + if pre_tok in prefix: + toks = toks[: eq_idx - 1] + [pre_tok + "="] + toks[eq_idx + 1 :] + + return toks + + +def scan_alias(toks): + """Scan the index of 'as' and build the map for all alias""" + as_idxs = [idx for idx, tok in enumerate(toks) if tok == "as"] + alias = {} + for idx in as_idxs: + alias[toks[idx + 1]] = toks[idx - 1] + return alias + + +def get_tables_with_alias(schema, toks): + tables = scan_alias(toks) + for key in schema: + assert key not in tables, "Alias {} has the same name in table".format(key) + tables[key] = key + return tables + + +def parse_col(toks, start_idx, tables_with_alias, schema, default_tables=None): + tok = toks[start_idx] + if tok == "*": + return start_idx + 1, schema.idMap[tok] + + if "." in tok: # if token is a composite + alias, col = tok.split(".") + key = tables_with_alias[alias] + "." + col + return start_idx + 1, schema.idMap[key] + + assert default_tables is not None and len(default_tables) > 0, "Default tables should not be None or empty" + + for alias in default_tables: + table = tables_with_alias[alias] + if tok in schema.schema[table]: + key = table + "." + tok + return start_idx + 1, schema.idMap[key] + + assert False, "Error col: {}".format(tok) + + +def parse_col_unit(toks, start_idx, tables_with_alias, schema, default_tables=None): + idx = start_idx + len_ = len(toks) + isBlock = False + isDistinct = False + if toks[idx] == "(": + isBlock = True + idx += 1 + + if toks[idx] in AGG_OPS: + agg_id = AGG_OPS.index(toks[idx]) + idx += 1 + assert idx < len_ and toks[idx] == "(" + idx += 1 + if toks[idx] == "distinct": + idx += 1 + isDistinct = True + idx, col_id = parse_col(toks, idx, tables_with_alias, schema, default_tables) + assert idx < len_ and toks[idx] == ")" + idx += 1 + return idx, (agg_id, col_id, isDistinct) + + if toks[idx] == "distinct": + idx += 1 + isDistinct = True + agg_id = AGG_OPS.index("none") + idx, col_id = parse_col(toks, idx, tables_with_alias, schema, default_tables) + + if isBlock: + assert toks[idx] == ")" + idx += 1 # skip ')' + + return idx, (agg_id, col_id, isDistinct) + + +def parse_val_unit(toks, start_idx, tables_with_alias, schema, default_tables=None): + idx = start_idx + len_ = len(toks) + isBlock = False + if toks[idx] == "(": + isBlock = True + idx += 1 + + col_unit1 = None + col_unit2 = None + unit_op = UNIT_OPS.index("none") + + idx, col_unit1 = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables) + if idx < len_ and toks[idx] in UNIT_OPS: + unit_op = UNIT_OPS.index(toks[idx]) + idx += 1 + idx, col_unit2 = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables) + + if isBlock: + assert toks[idx] == ")" + idx += 1 # skip ')' + + return idx, (unit_op, col_unit1, col_unit2) + + +def parse_table_unit(toks, start_idx, tables_with_alias, schema): + idx = start_idx + len_ = len(toks) + key = tables_with_alias[toks[idx]] + + if idx + 1 < len_ and toks[idx + 1] == "as": + idx += 3 + else: + idx += 1 + + return idx, schema.idMap[key], key + + +def parse_value(toks, start_idx, tables_with_alias, schema, default_tables=None): + idx = start_idx + len_ = len(toks) + + isBlock = False + if toks[idx] == "(": + isBlock = True + idx += 1 + + if toks[idx] == "select": + idx, val = parse_sql(toks, idx, tables_with_alias, schema) + elif '"' in toks[idx]: # token is a string value + val = toks[idx] + idx += 1 + else: + try: + val = float(toks[idx]) + idx += 1 + except Exception: + end_idx = idx + while ( + end_idx < len_ + and toks[end_idx] != "," + and toks[end_idx] != ")" + and toks[end_idx] != "and" + and toks[end_idx] not in CLAUSE_KEYWORDS + and toks[end_idx] not in JOIN_KEYWORDS + ): + end_idx += 1 + + idx, val = parse_col_unit(toks[start_idx:end_idx], 0, tables_with_alias, schema, default_tables) + idx = end_idx + + if isBlock: + assert toks[idx] == ")" + idx += 1 + + return idx, val + + +def parse_condition(toks, start_idx, tables_with_alias, schema, default_tables=None): + idx = start_idx + len_ = len(toks) + conds = [] + + while idx < len_: + idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables) + not_op = False + if toks[idx] == "not": + not_op = True + idx += 1 + + assert idx < len_ and toks[idx] in WHERE_OPS, "Error condition: idx: {}, tok: {}".format(idx, toks[idx]) + op_id = WHERE_OPS.index(toks[idx]) + idx += 1 + val1 = val2 = None + if op_id == WHERE_OPS.index("between"): # between..and... special case: dual values + idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables) + assert toks[idx] == "and" + idx += 1 + idx, val2 = parse_value(toks, idx, tables_with_alias, schema, default_tables) + else: # normal case: single value + idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables) + val2 = None + + conds.append((not_op, op_id, val_unit, val1, val2)) + + if idx < len_ and (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";") or toks[idx] in JOIN_KEYWORDS): + break + + if idx < len_ and toks[idx] in COND_OPS: + conds.append(toks[idx]) + idx += 1 # skip and/or + + return idx, conds + + +def parse_select(toks, start_idx, tables_with_alias, schema, default_tables=None): + idx = start_idx + len_ = len(toks) + + assert toks[idx] == "select", "'select' not found" + idx += 1 + isDistinct = False + if idx < len_ and toks[idx] == "distinct": + idx += 1 + isDistinct = True + val_units = [] + + while idx < len_ and toks[idx] not in CLAUSE_KEYWORDS: + agg_id = AGG_OPS.index("none") + if toks[idx] in AGG_OPS: + agg_id = AGG_OPS.index(toks[idx]) + idx += 1 + idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables) + val_units.append((agg_id, val_unit)) + if idx < len_ and toks[idx] == ",": + idx += 1 # skip ',' + + return idx, (isDistinct, val_units) + + +def parse_from(toks, start_idx, tables_with_alias, schema): + """ + Assume in the from clause, all table units are combined with join + """ + assert "from" in toks[start_idx:], "'from' not found" + + len_ = len(toks) + idx = toks.index("from", start_idx) + 1 + default_tables = [] + table_units = [] + conds = [] + + while idx < len_: + isBlock = False + if toks[idx] == "(": + isBlock = True + idx += 1 + + if toks[idx] == "select": + idx, sql = parse_sql(toks, idx, tables_with_alias, schema) + table_units.append((TABLE_TYPE["sql"], sql)) + else: + if idx < len_ and toks[idx] == "join": + idx += 1 # skip join + idx, table_unit, table_name = parse_table_unit(toks, idx, tables_with_alias, schema) + table_units.append((TABLE_TYPE["table_unit"], table_unit)) + default_tables.append(table_name) + if idx < len_ and toks[idx] == "on": + idx += 1 # skip on + idx, this_conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables) + if len(conds) > 0: + conds.append("and") + conds.extend(this_conds) + + if isBlock: + assert toks[idx] == ")" + idx += 1 + if idx < len_ and (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")): + break + + return idx, table_units, conds, default_tables + + +def parse_where(toks, start_idx, tables_with_alias, schema, default_tables): + idx = start_idx + len_ = len(toks) + + if idx >= len_ or toks[idx] != "where": + return idx, [] + + idx += 1 + idx, conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables) + return idx, conds + + +def parse_group_by(toks, start_idx, tables_with_alias, schema, default_tables): + idx = start_idx + len_ = len(toks) + col_units = [] + + if idx >= len_ or toks[idx] != "group": + return idx, col_units + + idx += 1 + assert toks[idx] == "by" + idx += 1 + + while idx < len_ and not (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")): + idx, col_unit = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables) + col_units.append(col_unit) + if idx < len_ and toks[idx] == ",": + idx += 1 # skip ',' + else: + break + + return idx, col_units + + +def parse_order_by(toks, start_idx, tables_with_alias, schema, default_tables): + idx = start_idx + len_ = len(toks) + val_units = [] + order_type = "asc" # default type is 'asc' + + if idx >= len_ or toks[idx] != "order": + return idx, val_units + + idx += 1 + assert toks[idx] == "by" + idx += 1 + + while idx < len_ and not (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")): + idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables) + val_units.append(val_unit) + if idx < len_ and toks[idx] in ORDER_OPS: + order_type = toks[idx] + idx += 1 + if idx < len_ and toks[idx] == ",": + idx += 1 # skip ',' + else: + break + + return idx, (order_type, val_units) + + +def parse_having(toks, start_idx, tables_with_alias, schema, default_tables): + idx = start_idx + len_ = len(toks) + + if idx >= len_ or toks[idx] != "having": + return idx, [] + + idx += 1 + idx, conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables) + return idx, conds + + +def parse_limit(toks, start_idx): + idx = start_idx + len_ = len(toks) + + if idx < len_ and toks[idx] == "limit": + idx += 2 + # make limit value can work, cannot assume put 1 as a fake limit number + if type(toks[idx - 1]) != int: + return idx, 1 + + return idx, int(toks[idx - 1]) + + return idx, None + + +def parse_sql(toks, start_idx, tables_with_alias, schema): + isBlock = False # indicate whether this is a block of sql/sub-sql + len_ = len(toks) + idx = start_idx + + sql = {} + if toks[idx] == "(": + isBlock = True + idx += 1 + + # parse from clause in order to get default tables + from_end_idx, table_units, conds, default_tables = parse_from(toks, start_idx, tables_with_alias, schema) + sql["from"] = {"table_units": table_units, "conds": conds} + # select clause + _, select_col_units = parse_select(toks, idx, tables_with_alias, schema, default_tables) + idx = from_end_idx + sql["select"] = select_col_units + # where clause + idx, where_conds = parse_where(toks, idx, tables_with_alias, schema, default_tables) + sql["where"] = where_conds + # group by clause + idx, group_col_units = parse_group_by(toks, idx, tables_with_alias, schema, default_tables) + sql["groupBy"] = group_col_units + # having clause + idx, having_conds = parse_having(toks, idx, tables_with_alias, schema, default_tables) + sql["having"] = having_conds + # order by clause + idx, order_col_units = parse_order_by(toks, idx, tables_with_alias, schema, default_tables) + sql["orderBy"] = order_col_units + # limit clause + idx, limit_val = parse_limit(toks, idx) + sql["limit"] = limit_val + + idx = skip_semicolon(toks, idx) + if isBlock: + assert toks[idx] == ")" + idx += 1 # skip ')' + idx = skip_semicolon(toks, idx) + + # intersect/union/except clause + for op in SQL_OPS: # initialize IUE + sql[op] = None + if idx < len_ and toks[idx] in SQL_OPS: + sql_op = toks[idx] + idx += 1 + idx, IUE_sql = parse_sql(toks, idx, tables_with_alias, schema) + sql[sql_op] = IUE_sql + return idx, sql + + +def load_data(fpath): + with open(fpath) as f: + data = json.load(f) + return data + + +def get_sql(schema, query): + toks = tokenize(query) + tables_with_alias = get_tables_with_alias(schema.schema, toks) + _, sql = parse_sql(toks, 0, tables_with_alias, schema) + + return sql + + +def skip_semicolon(toks, start_idx): + idx = start_idx + while idx < len(toks) and toks[idx] == ";": + idx += 1 + return idx diff --git a/examples/text_to_sql/IGSQL/logger.py b/examples/text_to_sql/IGSQL/logger.py new file mode 100644 index 0000000000000000000000000000000000000000..4d0584f7515fcbf28cc3f7afc43222e3090c8e74 --- /dev/null +++ b/examples/text_to_sql/IGSQL/logger.py @@ -0,0 +1,67 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Contains the logging class.""" + + +class Logger: + def __init__(self, filename, option): + self.fileptr = open(filename, option) + if option == "r": + self.lines = self.fileptr.readlines() + else: + self.lines = [] + + def put(self, string): + """Writes to the file.""" + self.fileptr.write(string + "\n") + self.fileptr.flush() + + def close(self): + """Closes the logger.""" + self.fileptr.close() + + def findlast(self, identifier, default=0.0): + """Finds the last line in the log with a certain value.""" + for line in self.lines[::-1]: + if line.lower().startswith(identifier): + string = line.strip().split("\t")[1] + if string.replace(".", "").isdigit(): + return float(string) + elif string.lower() == "true": + return True + elif string.lower() == "false": + return False + else: + return string + return default + + def contains(self, string): + """Determines whether the string is present in the log.""" + for line in self.lines[::-1]: + if string.lower() in line.lower(): + return True + return False + + def findlast_log_before(self, before_str): + """Finds the last entry in the log before another entry.""" + loglines = [] + in_line = False + for line in self.lines[::-1]: + if line.startswith(before_str): + in_line = True + elif in_line: + loglines.append(line) + if line.strip() == "" and in_line: + return "".join(loglines[::-1]) + return "".join(loglines[::-1]) diff --git a/examples/text_to_sql/IGSQL/model/attention.py b/examples/text_to_sql/IGSQL/model/attention.py new file mode 100644 index 0000000000000000000000000000000000000000..a1989ec726cd17754081067b624b69dc66162486 --- /dev/null +++ b/examples/text_to_sql/IGSQL/model/attention.py @@ -0,0 +1,103 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Contains classes for computing and keeping track of attention distributions. +""" +from collections import namedtuple + +import numpy as np +import paddle +import paddle.nn.functional as F + +np.random.seed(0) + + +class AttentionResult(namedtuple("AttentionResult", ("scores", "distribution", "vector"))): + """Stores the result of an attention calculation.""" + + __slots__ = () + + +class Attention(paddle.nn.Layer): + """Attention mechanism class. Stores parameters for and computes attention. + + Attributes: + transform_query (`bool`): Whether or not to transform the query being + passed in with a weight transformation before computing attentino. + transform_key (`bool`): Whether or not to transform the key being + passed in with a weight transformation before computing attentino. + transform_value (`bool`): Whether or not to transform the value being + passed in with a weight transformation before computing attentino. + key_size (`int`): The size of the key vectors. + value_size (`int`): The size of the value vectors. + the query or key. + query_weights (`Parameter`): Weights for transforming the query. + key_weights (`Parameter`): Weights for transforming the key. + value_weights (`Parameter`): Weights for transforming the value. + """ + + def __init__(self, query_size, key_size, value_size): + super().__init__() + self.key_size = key_size + self.value_size = value_size + + _initializer = paddle.nn.initializer.XavierUniform() + + query_weights = paddle.ParamAttr(initializer=_initializer) + + self.query_linear = paddle.nn.Linear(query_size, self.key_size, weight_attr=query_weights, bias_attr=False) + + def transform_arguments(self, query, keys, values): + """Transforms the query/key/value inputs before attention calculations. + + Arguments: + query (`Tensor`): Vector representing the query (e.g., hidden state.) + keys (`list`): List of vectors representing the key + values. + values (`list`): List of vectors representing the values. + + Returns: + `triple`: The first represents the (transformed) + query, the second represents the (transformed and concatenated) + keys, and the third represents the (transformed and concatenated) + values. + """ + assert len(keys) == len(values) + + all_keys = paddle.stack(keys, axis=1) + all_values = paddle.stack(values, axis=1) + + assert all_keys.shape[0] == self.key_size, ( + "Expected key size of " + str(self.key_size) + " but got " + str(all_keys.shape[0]) + ) + assert all_values.shape[0] == self.value_size + + if query.dim() == 1: + query = query.unsqueeze(0) + query = self.query_linear(query) + + return query, all_keys, all_values + + def forward(self, query, keys, values=None): + if not values: + values = keys + + query_t, keys_t, values_t = self.transform_arguments(query, keys, values) + + scores = paddle.t(paddle.mm(query_t, keys_t)) # len(key) x len(query) + + distribution = F.softmax(scores, axis=0) # len(key) x len(query) + + context_vector = paddle.mm(values_t, distribution).squeeze() # value_size x len(query) + + return AttentionResult(scores, distribution, context_vector) diff --git a/examples/text_to_sql/IGSQL/model/bert_utils.py b/examples/text_to_sql/IGSQL/model/bert_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..f1363374684a8e2ea1e1f2044acdf9c00bce9579 --- /dev/null +++ b/examples/text_to_sql/IGSQL/model/bert_utils.py @@ -0,0 +1,421 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# modified from https://github.com/naver/sqlova + +import paddle + +from paddlenlp.transformers import BertModel, BertPretrainedModel, BertTokenizer + + +def get_bert(params): + model_bert = BertModel.from_pretrained("bert-base-uncased") + bert_config = BertPretrainedModel.pretrained_init_configuration["bert-base-uncased"] + tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") + + return model_bert, tokenizer, bert_config + + +def generate_inputs(tokenizer, nlu1_tok, hds1): + tokens = [] + segment_ids = [] + sent_ids = [] + + t_to_tt_idx_hds1 = [] + + tokens.append("[CLS]") + i_st_nlu = len(tokens) # to use it later + + segment_ids.append(0) + sent_ids.append(0) + for token in nlu1_tok: + tokens.append(token) + segment_ids.append(0) + sent_ids.append(0) + i_ed_nlu = len(tokens) + tokens.append("[SEP]") + segment_ids.append(0) + sent_ids.append(0) + + i_hds = [] + for i, hds11 in enumerate(hds1): + i_st_hd = len(tokens) + t_to_tt_idx_hds11 = [] + sub_tok = [] + for sub_tok1 in hds11.split(): + t_to_tt_idx_hds11.append(len(sub_tok)) + sub_tok += tokenizer.tokenize(sub_tok1) + t_to_tt_idx_hds1.append(t_to_tt_idx_hds11) + tokens += sub_tok + + i_ed_hd = len(tokens) + i_hds.append((i_st_hd, i_ed_hd)) + segment_ids += [1] * len(sub_tok) + sent_ids += [1] * len(sub_tok) + if i < len(hds1) - 1: + tokens.append("[SEP]") + segment_ids.append(0) + sent_ids.append(1) + elif i == len(hds1) - 1: + tokens.append("[SEP]") + segment_ids.append(1) + sent_ids.append(1) + else: + raise EnvironmentError + + i_nlu = (i_st_nlu, i_ed_nlu) + + return tokens, sent_ids, segment_ids, i_nlu, i_hds, t_to_tt_idx_hds1 + + +def gen_l_hpu(i_hds): + """ + # Treat columns as if it is a batch of natural language utterance with batch-size = # of columns * # of batch_size + i_hds = [(17, 18), (19, 21), (22, 23), (24, 25), (26, 29), (30, 34)]) + """ + l_hpu = [] + for i_hds1 in i_hds: + for i_hds11 in i_hds1: + l_hpu.append(i_hds11[1] - i_hds11[0]) + + return l_hpu + + +def get_bert_output(model_bert, tokenizer, nlu_t, hds, max_seq_length): + """ + Here, input is toknized further by WordPiece (WP) tokenizer and fed into BERT. + """ + + l_n = [] + l_hs = [] # The length of columns for each batch + + input_ids = [] + tokens = [] + segment_ids = [] + sent_ids = [] + input_mask = [] + + i_nlu = [] # index to retreive the position of contextual vector later. + i_hds = [] + + nlu_tt = [] + + t_to_tt_idx = [] + tt_to_t_idx = [] + + t_to_tt_idx_hds = [] + + for b, nlu_t1 in enumerate(nlu_t): + hds1 = hds[b] + l_hs.append(len(hds1)) + + # 1. 2nd tokenization using WordPiece + tt_to_t_idx1 = [] # number indicates where sub-token belongs to in 1st-level-tokens (here, CoreNLP). + t_to_tt_idx1 = [] # orig_to_tok_idx[i] = start index of i-th-1st-level-token in all_tokens. + nlu_tt1 = [] # all_doc_tokens[ orig_to_tok_idx[i] ] returns first sub-token segement of i-th-1st-level-token + for (i, token) in enumerate(nlu_t1): + t_to_tt_idx1.append( + len(nlu_tt1) + ) # all_doc_tokens[ indicate the start position of original 'white-space' tokens. + sub_tokens = tokenizer.tokenize(token) + for sub_token in sub_tokens: + tt_to_t_idx1.append(i) + nlu_tt1.append(sub_token) # all_doc_tokens are further tokenized using WordPiece tokenizer + nlu_tt.append(nlu_tt1) + tt_to_t_idx.append(tt_to_t_idx1) + t_to_tt_idx.append(t_to_tt_idx1) + + l_n.append(len(nlu_tt1)) + + # [CLS] nlu [SEP] col1 [SEP] col2 [SEP] ...col-n [SEP] + # 2. Generate BERT inputs & indices. + tokens1, sent_ids1, segment_ids1, i_nlu1, i_hds1, t_to_tt_idx_hds1 = generate_inputs(tokenizer, nlu_tt1, hds1) + + assert len(t_to_tt_idx_hds1) == len(hds1) + + t_to_tt_idx_hds.append(t_to_tt_idx_hds1) + + input_ids1 = tokenizer.convert_tokens_to_ids(tokens1) + + # Input masks + # The mask has 1 for real tokens and 0 for padding tokens. Only real + # tokens are attended to. + input_mask1 = [1] * len(input_ids1) + + # 3. Zero-pad up to the sequence length. + if len(nlu_t) == 1: + max_seq_length = len(input_ids1) + while len(input_ids1) < max_seq_length: + input_ids1.append(0) + input_mask1.append(0) + segment_ids1.append(0) + sent_ids1.append(0) + + assert len(input_ids1) == max_seq_length + assert len(input_mask1) == max_seq_length + assert len(segment_ids1) == max_seq_length + assert len(sent_ids1) == max_seq_length + + input_ids.append(input_ids1) + tokens.append(tokens1) + segment_ids.append(segment_ids1) + sent_ids.append(sent_ids1) + + input_mask.append(input_mask1) + + i_nlu.append(i_nlu1) + i_hds.append(i_hds1) + + # Convert to tensor + all_input_ids = paddle.to_tensor(input_ids, dtype="int64") + all_segment_ids = paddle.to_tensor(segment_ids, dtype="int64") + + # 4. Generate BERT output. + all_encoder_layer, pooled_output = model_bert( + all_input_ids, token_type_ids=all_segment_ids, output_hidden_states=True + ) + + # 5. generate l_hpu from i_hds + l_hpu = gen_l_hpu(i_hds) + + assert len(set(l_n)) == 1 and len(set(i_nlu)) == 1 + assert l_n[0] == i_nlu[0][1] - i_nlu[0][0] + + return ( + all_encoder_layer, + pooled_output, + tokens, + i_nlu, + i_hds, + l_n, + l_hpu, + l_hs, + nlu_tt, + t_to_tt_idx, + tt_to_t_idx, + t_to_tt_idx_hds, + ) + + +def get_wemb_n(i_nlu, l_n, hS, num_hidden_layers, all_encoder_layer, num_out_layers_n): + """ + Get the representation of each tokens. + """ + bS = len(l_n) + l_n_max = max(l_n) + wemb_n = [] + for b in range(bS): + # [B, max_len, dim] + # Fill zero for non-exist part. + i_nlu1 = i_nlu[b] + for i_noln in range(num_out_layers_n): + i_layer = num_hidden_layers - 1 - i_noln + tmp = all_encoder_layer[i_layer][b, i_nlu1[0] : i_nlu1[1], :].unsqueeze(0) + pad_right = l_n_max - (i_nlu1[1] - i_nlu1[0]) + pad_tmp = paddle.nn.functional.pad(tmp, [0, pad_right], data_format="NLC").squeeze(0) + wemb_n.append(pad_tmp) + wemb_n = paddle.stack(wemb_n) + return wemb_n + + +def get_wemb_h(i_hds, l_hpu, l_hs, hS, num_hidden_layers, all_encoder_layer, num_out_layers_h): + """ + As if + [ [table-1-col-1-tok1, t1-c1-t2, ...], + [t1-c2-t1, t1-c2-t2, ...]. + ... + [t2-c1-t1, ...,] + ] + """ + l_hpu_max = max(l_hpu) + wemb_h = [] + b_pu = -1 + + for b, i_hds1 in enumerate(i_hds): + for b1, i_hds11 in enumerate(i_hds1): + b_pu += 1 + for i_nolh in range(num_out_layers_h): + i_layer = num_hidden_layers - 1 - i_nolh + tmp = all_encoder_layer[i_layer][b, i_hds11[0] : i_hds11[1], :].unsqueeze(0) + pad_right = l_hpu_max - (i_hds11[1] - i_hds11[0]) + pad_tmp = paddle.nn.functional.pad(tmp, [0, pad_right], data_format="NLC").squeeze(0) + wemb_h.append(pad_tmp) + wemb_h = paddle.stack(wemb_h) + return wemb_h + + +def get_wemb_bert( + bert_config, model_bert, tokenizer, nlu_t, hds, max_seq_length, num_out_layers_n=1, num_out_layers_h=1 +): + + # get contextual output of all tokens from bert + ( + all_encoder_layer, + pooled_output, + tokens, + i_nlu, + i_hds, + l_n, + l_hpu, + l_hs, + nlu_tt, + t_to_tt_idx, + tt_to_t_idx, + t_to_tt_idx_hds, + ) = get_bert_output(model_bert, tokenizer, nlu_t, hds, max_seq_length) + # all_encoder_layer: BERT outputs from all layers. + # pooled_output: output of [CLS] vec. + # tokens: BERT intput tokens + # i_nlu: start and end indices of question in tokens + # i_hds: start and end indices of headers + + # get the wemb + wemb_n = get_wemb_n( + i_nlu, l_n, bert_config["hidden_size"], bert_config["num_hidden_layers"], all_encoder_layer, num_out_layers_n + ) + + wemb_h = get_wemb_h( + i_hds, + l_hpu, + l_hs, + bert_config["hidden_size"], + bert_config["num_hidden_layers"], + all_encoder_layer, + num_out_layers_h, + ) + + return wemb_n, wemb_h, l_n, l_hpu, l_hs, nlu_tt, t_to_tt_idx, tt_to_t_idx, t_to_tt_idx_hds + + +def prepare_input(tokenizer, input_sequence, input_schema, max_seq_length): + nlu_t = [] + hds = [] + + nlu_t1 = input_sequence + all_hds = input_schema.column_names_embedder_input + + nlu_tt1 = [] + for (i, token) in enumerate(nlu_t1): + nlu_tt1 += tokenizer.tokenize(token) + + current_hds1 = [] + for hds1 in all_hds: + new_hds1 = current_hds1 + [hds1] + tokens1, _, segment_ids1, i_nlu1, i_hds1, t_to_tt_idx_hds1 = generate_inputs(tokenizer, nlu_tt1, new_hds1) + if len(segment_ids1) > max_seq_length: + nlu_t.append(nlu_t1) + hds.append(current_hds1) + current_hds1 = [hds1] + else: + current_hds1 = new_hds1 + + if len(current_hds1) > 0: + nlu_t.append(nlu_t1) + hds.append(current_hds1) + + return nlu_t, hds + + +def prepare_input_v2(tokenizer, input_sequence, input_schema): + nlu_t = [] + hds = [] + max_seq_length = 0 + + nlu_t1 = input_sequence + all_hds = input_schema.column_names_embedder_input + + nlu_tt1 = [] + for (i, token) in enumerate(nlu_t1): + nlu_tt1 += tokenizer.tokenize(token) + + current_hds1 = [] + current_table = "" + for hds1 in all_hds: + hds1_table = hds1.split(".")[0].strip() + if hds1_table == current_table: + current_hds1.append(hds1) + else: + tokens1, segment_ids1, i_nlu1, i_hds1, t_to_tt_idx_hds1 = generate_inputs(tokenizer, nlu_tt1, current_hds1) + max_seq_length = max(max_seq_length, len(segment_ids1)) + + nlu_t.append(nlu_t1) + hds.append(current_hds1) + current_hds1 = [hds1] + current_table = hds1_table + + if len(current_hds1) > 0: + tokens1, segment_ids1, i_nlu1, i_hds1, t_to_tt_idx_hds1 = generate_inputs(tokenizer, nlu_tt1, current_hds1) + max_seq_length = max(max_seq_length, len(segment_ids1)) + nlu_t.append(nlu_t1) + hds.append(current_hds1) + + return nlu_t, hds, max_seq_length + + +def get_bert_encoding( + bert_config, + model_bert, + tokenizer, + input_sequence, + input_schema, + bert_input_version="v1", + max_seq_length=512, + num_out_layers_n=1, + num_out_layers_h=1, +): + if bert_input_version == "v1": + nlu_t, hds = prepare_input(tokenizer, input_sequence, input_schema, max_seq_length) + elif bert_input_version == "v2": + nlu_t, hds, max_seq_length = prepare_input_v2(tokenizer, input_sequence, input_schema) + + wemb_n, wemb_h, l_n, l_hpu, l_hs, nlu_tt, t_to_tt_idx, tt_to_t_idx, t_to_tt_idx_hds = get_wemb_bert( + bert_config, model_bert, tokenizer, nlu_t, hds, max_seq_length, num_out_layers_n, num_out_layers_h + ) + + t_to_tt_idx = t_to_tt_idx[0] + assert len(t_to_tt_idx) == len(input_sequence) + assert sum(len(t_to_tt_idx_hds1) for t_to_tt_idx_hds1 in t_to_tt_idx_hds) == len( + input_schema.column_names_embedder_input + ) + + assert list(wemb_h.shape)[0] == len(input_schema.column_names_embedder_input) + + utterance_states = [] + for i in range(len(t_to_tt_idx)): + start = t_to_tt_idx[i] + if i == len(t_to_tt_idx) - 1: + end = l_n[0] + else: + end = t_to_tt_idx[i + 1] + utterance_states.append(paddle.mean(wemb_n[:, start:end, :], axis=[0, 1])) + assert len(utterance_states) == len(input_sequence) + + schema_token_states = [] + cnt = -1 + for t_to_tt_idx_hds1 in t_to_tt_idx_hds: + for t_to_tt_idx_hds11 in t_to_tt_idx_hds1: + cnt += 1 + schema_token_states1 = [] + for i in range(len(t_to_tt_idx_hds11)): + start = t_to_tt_idx_hds11[i] + if i == len(t_to_tt_idx_hds11) - 1: + end = l_hpu[cnt] + else: + end = t_to_tt_idx_hds11[i + 1] + schema_token_states1.append(paddle.mean(wemb_h[cnt, start:end, :], axis=0)) + assert len(schema_token_states1) == len(input_schema.column_names_embedder_input[cnt].split()) + schema_token_states.append(schema_token_states1) + + assert len(schema_token_states) == len(input_schema.column_names_embedder_input) + + return utterance_states, schema_token_states diff --git a/examples/text_to_sql/IGSQL/model/decoder.py b/examples/text_to_sql/IGSQL/model/decoder.py new file mode 100644 index 0000000000000000000000000000000000000000..f0a7bf9c1c46bed0955dd99f1927eb22cd289704 --- /dev/null +++ b/examples/text_to_sql/IGSQL/model/decoder.py @@ -0,0 +1,277 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Decoder for the SQL generation problem.""" + +from collections import namedtuple + +import data_util.snippets as snippet_handler +import numpy as np +import paddle +import paddle.nn.functional as F +from data_util.vocabulary import EOS_TOK, UNK_TOK + +from . import embedder +from .token_predictor import PredictionInputWithSchema + +np.random.seed(0) + + +def flatten_distribution(distribution_map, probabilities): + """Flattens a probability distribution given a map of "unique" values. + All values in distribution_map with the same value should get the sum + of the probabilities. + + Arguments: + distribution_map (`list`): List of values to get the probability for. + probabilities (`np.ndarray`): Probabilities corresponding to the values in + distribution_map. + + Returns: + `list`: `np.ndarray` of the same size where probabilities for duplicates + in distribution_map are given the sum of the probabilities in probabilities. + """ + assert len(distribution_map) == len(probabilities) + if len(distribution_map) != len(set(distribution_map)): + idx_first_dup = 0 + seen_set = set() + for i, tok in enumerate(distribution_map): + if tok in seen_set: + idx_first_dup = i + break + seen_set.add(tok) + new_dist_map = distribution_map[:idx_first_dup] + list( + set(distribution_map) - set(distribution_map[:idx_first_dup]) + ) + assert len(new_dist_map) == len(set(new_dist_map)) + new_probs = np.array( + probabilities[:idx_first_dup] + [0.0 for _ in range(len(set(distribution_map)) - idx_first_dup)] + ) + assert len(new_probs) == len(new_dist_map) + + for i, token_name in enumerate(distribution_map[idx_first_dup:]): + if token_name not in new_dist_map: + new_dist_map.append(token_name) + + new_index = new_dist_map.index(token_name) + new_probs[new_index] += probabilities[i + idx_first_dup] + new_probs = new_probs.tolist() + else: + new_dist_map = distribution_map + new_probs = probabilities + + assert len(new_dist_map) == len(new_probs) + + return new_dist_map, new_probs + + +class SQLPrediction(namedtuple("SQLPrediction", ("predictions", "sequence", "probability"))): + """Contains prediction for a sequence.""" + + __slots__ = () + + def __str__(self): + return str(self.probability) + "\t" + " ".join(self.sequence) + + +class SequencePredictorWithSchema(paddle.nn.Layer): + """Predicts a sequence. + + Attributes: + lstms (list of dy.RNNBuilder): The RNN used. + token_predictor (TokenPredictor): Used to actually predict tokens. + """ + + def __init__(self, params, input_size, output_embedder, column_name_token_embedder, token_predictor): + super().__init__() + + self.lstmCell = paddle.nn.LSTMCell(input_size, params.decoder_state_size) + + self.token_predictor = token_predictor + self.output_embedder = output_embedder + self.column_name_token_embedder = column_name_token_embedder + + start_token_embedding = self.create_parameter( + [params.output_embedding_size], + dtype="float32", + default_initializer=paddle.nn.initializer.Uniform(low=-0.1, high=0.1), + ) + self.add_parameter("start_token_embedding", start_token_embedding) + + self.input_size = input_size + self.params = params + + def _initialize_decoder_lstm(self, encoder_state): + decoder_lstm_states = [] + + # check which one is h_0, which is c_0 + c_0 = encoder_state[1].reshape([1, -1]) + h_0 = encoder_state[0].reshape([1, -1]) + + decoder_lstm_states.append((h_0, c_0)) + return decoder_lstm_states + + def get_output_token_embedding(self, output_token, input_schema, snippets): + if self.params.use_snippets and snippet_handler.is_snippet(output_token): + output_token_embedding = embedder.bow_snippets(output_token, snippets, self.output_embedder, input_schema) + else: + if input_schema: + assert self.output_embedder.in_vocabulary(output_token) or input_schema.in_vocabulary( + output_token, surface_form=True + ) + # 经过 + if self.output_embedder.in_vocabulary(output_token): + output_token_embedding = self.output_embedder(output_token) + else: + output_token_embedding = input_schema.column_name_embedder(output_token, surface_form=True) + else: + output_token_embedding = self.output_embedder(output_token) + return output_token_embedding + + def get_decoder_input(self, output_token_embedding, prediction): + if self.params.use_schema_attention and self.params.use_query_attention: + decoder_input = paddle.concat( + [ + output_token_embedding, + prediction.utterance_attention_results.vector, + prediction.schema_attention_results.vector, + prediction.query_attention_results.vector, + ], + axis=0, + ) + elif self.params.use_schema_attention: + decoder_input = paddle.concat( + [ + output_token_embedding, + prediction.utterance_attention_results.vector, + prediction.schema_attention_results.vector, + ], + axis=0, + ) + else: + decoder_input = paddle.concat( + [output_token_embedding, prediction.utterance_attention_results.vector], axis=0 + ) + return decoder_input + + def forward( + self, + final_encoder_state, + encoder_states, + schema_states, + max_generation_length, + snippets=None, + gold_sequence=None, + input_sequence=None, + previous_queries=None, + previous_query_states=None, + input_schema=None, + dropout_amount=0.0, + ): + """Generates a sequence.""" + index = 0 + + context_vector_size = self.input_size - self.params.output_embedding_size + + # Decoder states: just the initialized decoder. + # Current input to decoder: phi(start_token) ; zeros the size of the + # context vector + predictions = [] + sequence = [] + probability = 1.0 + + decoder_states = self._initialize_decoder_lstm(final_encoder_state)[0] + + decoder_input = paddle.concat([self.start_token_embedding, paddle.zeros([context_vector_size])], axis=0) + continue_generating = True + while continue_generating: + if len(sequence) == 0 or sequence[-1] != EOS_TOK: + + decoder_state, decoder_states = self.lstmCell(decoder_input.unsqueeze(0), decoder_states) + decoder_state = decoder_state.squeeze() + + prediction_input = PredictionInputWithSchema( + decoder_state=decoder_state, + input_hidden_states=encoder_states, + schema_states=schema_states, + snippets=snippets, + input_sequence=input_sequence, + previous_queries=previous_queries, + previous_query_states=previous_query_states, + input_schema=input_schema, + ) + + prediction = self.token_predictor(prediction_input, dropout_amount=dropout_amount) + + predictions.append(prediction) + # 经过 + if gold_sequence: + output_token = gold_sequence[index] + + output_token_embedding = self.get_output_token_embedding(output_token, input_schema, snippets) + + decoder_input = self.get_decoder_input(output_token_embedding, prediction) + + sequence.append(gold_sequence[index]) + + if index >= len(gold_sequence) - 1: + continue_generating = False + else: + assert prediction.scores.dim() == 1 + probabilities = F.softmax(prediction.scores, axis=0).cpu().numpy().tolist() + + distribution_map = prediction.aligned_tokens + assert len(probabilities) == len(distribution_map) + + if self.params.use_previous_query and self.params.use_copy_switch and len(previous_queries) > 0: + assert prediction.query_scores.dim() == 1 + query_token_probabilities = ( + F.softmax(prediction.query_scores, axis=0).cpu().data.numpy().tolist() + ) + + query_token_distribution_map = prediction.query_tokens + + assert len(query_token_probabilities) == len(query_token_distribution_map) + + copy_switch = prediction.copy_switch.cpu().data.numpy() + + # Merge the two + probabilities = (np.array(probabilities) * (1 - copy_switch)).tolist() + ( + np.array(query_token_probabilities) * copy_switch + ).tolist() + distribution_map = distribution_map + query_token_distribution_map + assert len(probabilities) == len(distribution_map) + + # Get a new probabilities and distribution_map consolidating duplicates + distribution_map, probabilities = flatten_distribution(distribution_map, probabilities) + + # Modify the probability distribution so that the UNK token can never be produced + probabilities[distribution_map.index(UNK_TOK)] = 0.0 + argmax_index = int(np.argmax(probabilities)) + + argmax_token = distribution_map[argmax_index] + sequence.append(argmax_token) + + output_token_embedding = self.get_output_token_embedding(argmax_token, input_schema, snippets) + + decoder_input = self.get_decoder_input(output_token_embedding, prediction) + + probability *= probabilities[argmax_index] + + continue_generating = False + if index < max_generation_length and argmax_token != EOS_TOK: + continue_generating = True + + index += 1 + + return SQLPrediction(predictions, sequence, probability) diff --git a/examples/text_to_sql/IGSQL/model/embedder.py b/examples/text_to_sql/IGSQL/model/embedder.py new file mode 100644 index 0000000000000000000000000000000000000000..cfa23cd3a66985cfbde4275e053d6f77ed749b65 --- /dev/null +++ b/examples/text_to_sql/IGSQL/model/embedder.py @@ -0,0 +1,120 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Embedder for tokens. """ + +import data_util.snippets as snippet_handler +import data_util.vocabulary as vocabulary_handler +import paddle + + +class Embedder(paddle.nn.Layer): + """Embeds tokens.""" + + def __init__( + self, + embedding_size, + name="", + initializer=None, + vocabulary=None, + num_tokens=-1, + anonymizer=None, + freeze=False, + use_unk=True, + ): + super().__init__() + + if vocabulary: + assert num_tokens < 0, "Specified a vocabulary but also set number of tokens to " + str(num_tokens) + self.in_vocabulary = lambda token: token in vocabulary.tokens + self.vocab_token_lookup = lambda token: vocabulary.token_to_id(token) + if use_unk: + self.unknown_token_id = vocabulary.token_to_id(vocabulary_handler.UNK_TOK) + else: + self.unknown_token_id = -1 + self.vocabulary_size = len(vocabulary) + else: + + def check_vocab(index): + """Makes sure the index is in the vocabulary.""" + assert index < num_tokens, ( + "Passed token ID " + str(index) + "; expecting something less than " + str(num_tokens) + ) + return index < num_tokens + + self.in_vocabulary = check_vocab + self.vocab_token_lookup = lambda x: x + self.unknown_token_id = num_tokens # Deliberately throws an error here, + # But should crash before this + self.vocabulary_size = num_tokens + + self.anonymizer = anonymizer + + emb_name = name + "-tokens" + print( + "Creating token embedder called " + + emb_name + + " of size " + + str(self.vocabulary_size) + + " x " + + str(embedding_size) + ) + + if initializer is not None: + self.token_embedding_matrix = paddle.nn.Embedding(initializer.shape[0], initializer.shape[1]) + self.token_embedding_matrix.weight.set_value(initializer) + else: + initializer = paddle.nn.initializer.Uniform(low=-0.1, high=0.1) + self.token_embedding_matrix = paddle.nn.Embedding( + self.vocabulary_size, embedding_size, weight_attr=initializer + ) + + def forward(self, token): + assert isinstance(token, int) or not snippet_handler.is_snippet( + token + ), "embedder should only be called on flat tokens; use snippet_bow if you are trying to encode snippets" + + if self.in_vocabulary(token): + index_list = paddle.to_tensor(self.vocab_token_lookup(token), "int64") + return self.token_embedding_matrix(index_list).squeeze() + else: + index_list = paddle.to_tensor(self.unknown_token_id, "int64") + return self.token_embedding_matrix(index_list).squeeze() + + +def bow_snippets(token, snippets, output_embedder, input_schema): + """Bag of words embedding for snippets""" + assert snippet_handler.is_snippet(token) and snippets + + snippet_sequence = [] + for snippet in snippets: + if snippet.name == token: + snippet_sequence = snippet.sequence + break + assert snippet_sequence + + if input_schema: + snippet_embeddings = [] + for output_token in snippet_sequence: + assert output_embedder.in_vocabulary(output_token) or input_schema.in_vocabulary( + output_token, surface_form=True + ) + if output_embedder.in_vocabulary(output_token): + snippet_embeddings.append(output_embedder(output_token)) + else: + snippet_embeddings.append(input_schema.column_name_embedder(output_token, surface_form=True)) + else: + snippet_embeddings = [output_embedder(subtoken) for subtoken in snippet_sequence] + + snippet_embeddings = paddle.stack(snippet_embeddings, axis=0) # len(snippet_sequence) x emb_size + return paddle.mean(snippet_embeddings, axis=0) # emb_size diff --git a/examples/text_to_sql/IGSQL/model/model.py b/examples/text_to_sql/IGSQL/model/model.py new file mode 100644 index 0000000000000000000000000000000000000000..0f52ab8066c17d840758ba6a8a5164ccd391c7f9 --- /dev/null +++ b/examples/text_to_sql/IGSQL/model/model.py @@ -0,0 +1,376 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Class for the Sequence to sequence model for ATIS.""" + +import numpy as np +import paddle +from data_util.vocabulary import DEL_TOK, UNK_TOK + +from . import bert_utils +from .embedder import Embedder + + +def get_token_indices(token, index_to_token): + """Maps from a gold token (string) to a list of indices. + + Args: + token (`string`): String to look up. + index_to_token (`list`): Ordered list of tokens. + + Returns: + `list`: Representing the indices of the token in the probability + distribution. + """ + if token in index_to_token: + if len(set(index_to_token)) == len(index_to_token): # no duplicates + return [index_to_token.index(token)] + else: + indices = [] + for index, other_token in enumerate(index_to_token): + if token == other_token: + indices.append(index) + assert len(indices) == len(set(indices)) + return indices + else: + return [index_to_token.index(UNK_TOK)] + + +def flatten_utterances(utterances): + """Gets a flat sequence from a sequence of utterances. + + Args: + utterances (`list`): Utterances to concatenate. + + Returns: + `list`: Representing the flattened sequence with separating + delimiter tokens. + """ + sequence = [] + for i, utterance in enumerate(utterances): + sequence.extend(utterance) + if i < len(utterances) - 1: + sequence.append(DEL_TOK) + + return sequence + + +def encode_snippets_with_states(snippets, states): + """Encodes snippets by using previous query states instead. + + Args: + snippets (`list`): Input snippets. + states (`list`): Previous hidden states to use. + """ + for snippet in snippets: + snippet.set_embedding(paddle.concat([states[snippet.startpos], states[snippet.endpos]], axis=0)) + return snippets + + +def load_word_embeddings(input_vocabulary, output_vocabulary, output_vocabulary_schema, params): + print(output_vocabulary.inorder_tokens) + print() + + if params.reload_embedding == 1: + input_vocabulary_embeddings = np.load(params.data_directory + "/input_embeddings.npy") + output_vocabulary_embeddings = np.load(params.data_directory + "/output_embeddings.npy") + output_vocabulary_schema_embeddings = np.load(params.data_directory + "/output_schema_embeddings.npy") + input_embedding_size = 300 + return ( + input_vocabulary_embeddings, + output_vocabulary_embeddings, + output_vocabulary_schema_embeddings, + input_embedding_size, + ) + + def read_glove_embedding(embedding_filename, embedding_size): + glove_embeddings = {} + + with open(embedding_filename) as f: + cnt = 1 + for line in f: + cnt += 1 + if params.debug or not params.train: + if cnt == 1000: + print("Read 1000 word embeddings") + break + l_split = line.split() + word = " ".join(l_split[0 : len(l_split) - embedding_size]) + embedding = np.array([float(val) for val in l_split[len(l_split) - embedding_size :]]) + glove_embeddings[word] = embedding + + return glove_embeddings + + print("Loading Glove Embedding from", params.embedding_filename) + glove_embedding_size = 300 + glove_embeddings = read_glove_embedding(params.embedding_filename, glove_embedding_size) + print("Done") + + input_embedding_size = glove_embedding_size + + def create_word_embeddings(vocab): + + vocabulary_embeddings = np.zeros((len(vocab), glove_embedding_size), dtype=np.float32) + vocabulary_tokens = vocab.inorder_tokens + + glove_oov = 0 + para_oov = 0 + for token in vocabulary_tokens: + token_id = vocab.token_to_id(token) + if token in glove_embeddings: + vocabulary_embeddings[token_id][:glove_embedding_size] = glove_embeddings[token] + else: + glove_oov += 1 + + print("Glove OOV:", glove_oov, "Para OOV", para_oov, "Total", len(vocab)) + + return vocabulary_embeddings + + input_vocabulary_embeddings = create_word_embeddings(input_vocabulary) + output_vocabulary_embeddings = create_word_embeddings(output_vocabulary) + output_vocabulary_schema_embeddings = None + if output_vocabulary_schema: + output_vocabulary_schema_embeddings = create_word_embeddings(output_vocabulary_schema) + + np.save(params.data_directory + "/input_embeddings", input_vocabulary_embeddings) + np.save(params.data_directory + "/output_embeddings", output_vocabulary_embeddings) + np.save(params.data_directory + "/output_schema_embeddings", output_vocabulary_schema_embeddings) + + return ( + input_vocabulary_embeddings, + output_vocabulary_embeddings, + output_vocabulary_schema_embeddings, + input_embedding_size, + ) + + +class ATISModel(paddle.nn.Layer): + """Sequence-to-sequence model for predicting a SQL query given an utterance + and an interaction prefix. + """ + + def __init__(self, params, input_vocabulary, output_vocabulary, output_vocabulary_schema, anonymizer): + super().__init__() + + self.params = params + + self.dropout = 0.0 + + if params.use_bert: + self.model_bert, self.tokenizer, self.bert_config = bert_utils.get_bert(params) + + if "atis" not in params.data_directory: + if params.use_bert: + ( + input_vocabulary_embeddings, + output_vocabulary_embeddings, + output_vocabulary_schema_embeddings, + input_embedding_size, + ) = load_word_embeddings(input_vocabulary, output_vocabulary, output_vocabulary_schema, params) + + # Create the output embeddings + self.output_embedder = Embedder( + params.output_embedding_size, + name="output-embedding", + initializer=output_vocabulary_embeddings, + vocabulary=output_vocabulary, + anonymizer=anonymizer, + freeze=False, + ) + self.column_name_token_embedder = None + + # Create the encoder + encoder_input_size = params.input_embedding_size + encoder_output_size = params.encoder_state_size + if params.use_bert: + encoder_input_size = self.bert_config["hidden_size"] + + if params.discourse_level_lstm: + encoder_input_size += params.encoder_state_size // 2 + + self.utterance_encoder = paddle.nn.LSTM( + encoder_input_size, encoder_output_size // 2, num_layers=params.encoder_num_layers, direction="bidirect" + ) + + # Positional embedder for utterances + attention_key_size = params.encoder_state_size + self.schema_attention_key_size = attention_key_size + if params.state_positional_embeddings: + attention_key_size += params.positional_embedding_size + self.positional_embedder = Embedder( + params.positional_embedding_size, name="positional-embedding", num_tokens=params.maximum_utterances + ) + + self.utterance_attention_key_size = attention_key_size + + # Create the discourse-level LSTM parameters + if params.discourse_level_lstm: + self.discourse_lstms = paddle.nn.LSTMCell(params.encoder_state_size, params.encoder_state_size // 2) + + initial_discourse_state = self.create_parameter( + [params.encoder_state_size // 2], + dtype="float32", + default_initializer=paddle.nn.initializer.Uniform(low=-0.1, high=0.1), + ) + self.add_parameter("initial_discourse_state", initial_discourse_state) + + # Snippet encoder + final_snippet_size = 0 + + # Previous query Encoder + if params.use_previous_query: + self.query_encoder = paddle.nn.LSTM( + params.output_embedding_size, + params.encoder_state_size // 2, + num_layers=params.encoder_num_layers, + direction="bidirect", + ) + + self.final_snippet_size = final_snippet_size + + def _initialize_discourse_states(self): + discourse_state = self.initial_discourse_state + + hidden_size = self.discourse_lstms.weight_hh.shape[1] + + h_0 = paddle.zeros([1, hidden_size]) + c_0 = paddle.zeros([1, hidden_size]) + + return discourse_state, (h_0, c_0) + + def _add_positional_embeddings(self, hidden_states, utterances, group=False): + grouped_states = [] + + start_index = 0 + for utterance in utterances: + grouped_states.append(hidden_states[start_index : start_index + len(utterance)]) + start_index += len(utterance) + assert ( + len(hidden_states) + == sum([len(seq) for seq in grouped_states]) + == sum([len(utterance) for utterance in utterances]) + ) + + new_states = [] + flat_sequence = [] + + num_utterances_to_keep = min(self.params.maximum_utterances, len(utterances)) + for i, (states, utterance) in enumerate( + zip(grouped_states[-num_utterances_to_keep:], utterances[-num_utterances_to_keep:]) + ): + positional_sequence = [] + index = num_utterances_to_keep - i - 1 + + for state in states: + positional_sequence.append(paddle.concat([state, self.positional_embedder(index)], axis=0)) + + assert len(positional_sequence) == len(utterance), ( + "Expected utterance and state sequence length to be the same, " + + "but they were " + + str(len(utterance)) + + " and " + + str(len(positional_sequence)) + ) + + if group: + new_states.append(positional_sequence) + else: + new_states.extend(positional_sequence) + flat_sequence.extend(utterance) + + return new_states, flat_sequence + + def build_optim(self): + params_trainer = [] + params_bert_trainer = [] + for name, param in self.named_parameters(): + if not param.stop_gradient: + if self.params.all_in_one_trainer: + param.name = name + params_trainer.append(param) + else: + if "model_bert" in name: + params_bert_trainer.append(param) + else: + params_trainer.append(param) + clip = paddle.nn.ClipGradByNorm(clip_norm=self.params.clip) + + if self.params.scheduler: + self.scheduler = paddle.optimizer.lr.ReduceOnPlateau( + learning_rate=self.params.initial_learning_rate, + mode="min", + ) + self.trainer = paddle.optimizer.Adam( + parameters=params_trainer, learning_rate=self.scheduler, grad_clip=clip + ) + else: + self.trainer = paddle.optimizer.Adam(parameters=params_trainer, learning_rate=1.0, grad_clip=clip) + if self.params.fine_tune_bert: + if self.params.scheduler: + self.scheduler = paddle.optimizer.lr.ReduceOnPlateau( + learning_rate=self.params.initial_learning_rate, + mode="min", + ) + self.bert_trainer = paddle.optimizer.Adam( + parameters=params_bert_trainer, learning_rate=self.scheduler, grad_clip=clip + ) + else: + self.bert_trainer = paddle.optimizer.Adam( + parameters=params_bert_trainer, learning_rate=1.0, grad_clip=clip + ) + + def set_dropout(self, value): + """Sets the dropout to a specified value. + + Args: + value (`float`): Value to set dropout to. + """ + self.dropout = value + + def set_learning_rate(self, value): + """Sets the learning rate for the trainer. + + Args: + value (`float`): The new learning rate. + """ + # return + for param_group in self.trainer._parameter_list: + if self.params.all_in_one_trainer: + if "model_bert" in param_group.name: + param_group.optimize_attr["learning_rate"] = value * 0.01 + else: + param_group.optimize_attr["learning_rate"] = value + else: + param_group.optimize_attr["learning_rate"] = value + + if self.params.use_bert: + if not self.params.all_in_one_trainer: + for param_group in self.bert_trainer._parameter_list: + param_group.optimize_attr["learning_rate"] = value * 0.01 + + def save(self, filename): + """Saves the model to the specified filename. + + Args: + filename (`str`): The filename to save to. + """ + paddle.save(self.state_dict(), filename) + + def load(self, filename): + """Loads saved parameters into the parameter collection. + + Args: + filename (`str`): Name of file containing parameters. + """ + self.load_dict(paddle.load(filename)) + print("Loaded model from file " + filename) diff --git a/examples/text_to_sql/IGSQL/model/model_utils.py b/examples/text_to_sql/IGSQL/model/model_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..68fb9969673a8bfde5bbf70b65b9a30a6057b2d4 --- /dev/null +++ b/examples/text_to_sql/IGSQL/model/model_utils.py @@ -0,0 +1,196 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Contains various utility functions for Dynet models.""" + +import numpy as np +import paddle +import paddle.nn.functional as F + + +def compute_loss(gold_seq, scores, index_to_token_maps, gold_tok_to_id, noise=0.00000001): + """Computes the loss of a gold sequence given scores. + + Args: + gold_seq (`list`): A sequence of gold tokens. + scores (`list`): Expressions representing the scores of + potential output tokens for each token in gold_seq. + index_to_token_maps (`list`): Maps from index in the + sequence to a dictionary mapping from a string to a set of integers. + gold_tok_to_id (`func`): Maps from the gold token + and some lookup function to the indices in the probability distribution + where the gold token occurs. + noise (`float`, optional): The amount of noise to add to the loss. + + Returns: + `Tensor`: representing the sum of losses over the sequence. + """ + assert len(gold_seq) == len(scores) == len(index_to_token_maps) + + losses = [] + for i, gold_tok in enumerate(gold_seq): + score = scores[i] + token_map = index_to_token_maps[i] + + gold_indices = gold_tok_to_id(gold_tok, token_map) + + assert len(gold_indices) > 0 + noise_i = noise + """ + if len(gold_indices) == 1: + noise_i = 0 + """ + + probdist = score + + prob_of_tok = paddle.sum(paddle.index_select(probdist, paddle.to_tensor(gold_indices))) + + if prob_of_tok < noise_i: + prob_of_tok = prob_of_tok + noise_i + elif prob_of_tok > 1 - noise_i: + prob_of_tok = prob_of_tok - noise_i + losses.append(-paddle.log(prob_of_tok)) + + return paddle.sum(paddle.stack(losses)) + + +def get_seq_from_scores(scores, index_to_token_maps): + """Gets the argmax sequence from a set of scores. + + Args: + scores (`list`): Sequences of output scores. + index_to_token_maps (`list`): For each output token, maps + the index in the probability distribution to a string. + + Returns: + `list`: Representing the argmax sequence. + """ + seq = [] + for score, tok_map in zip(scores, index_to_token_maps): + # score_numpy_list = score.cpu().detach().numpy() + score_numpy_list = score.cpu().numpy() + assert score.shape[0] == len(tok_map) == len(list(score_numpy_list)) + seq.append(tok_map[np.argmax(score_numpy_list)]) + return seq + + +def per_token_accuracy(gold_seq, pred_seq): + """Returns the per-token accuracy comparing two strings (recall). + + Args: + gold_seq (`list`): A list of gold tokens. + pred_seq (`list`): A list of predicted tokens. + + Returns: + `float`: Representing the accuracy. + """ + num_correct = 0 + for i, gold_token in enumerate(gold_seq): + if i < len(pred_seq) and pred_seq[i] == gold_token: + num_correct += 1 + + return float(num_correct) / len(gold_seq) + + +def forward_one_multilayer(rnns, lstm_input, layer_states, dropout_amount=0.0): + """Goes forward for one multilayer RNN cell step. + + Args: + lstm_input (`Tensor`): Some input to the step. + layer_states (`list`): The states of each layer in the cell. + dropout_amount (`float`, optional): The amount of dropout to apply, in + between the layers. + + Returns: + (`list` , `list`), `Tensor`, (`list`): Representing (each layer's cell memory, + each layer's cell hidden state), the final hidden state, and (each layer's updated RNNState). + """ + num_layers = len(layer_states) + new_states = [] + cell_states = [] + hidden_states = [] + state = lstm_input + for i in range(num_layers): + layer_h, new_state = rnns[i](paddle.unsqueeze(state, 0), layer_states[i]) + new_states.append(new_state) + + layer_h = layer_h.squeeze() + layer_c = new_state[1].squeeze() + + state = layer_h + if i < num_layers - 1: + # p stands for probability of an element to be zeroed. i.e. p=1 means switch off all activations. + state = F.dropout(state, p=dropout_amount) + + cell_states.append(layer_c) + hidden_states.append(layer_h) + + return (cell_states, hidden_states), state, new_states + + +def encode_sequence(sequence, rnns, embedder, dropout_amount=0.0): + """Encodes a sequence given RNN cells and an embedding function. + + Args: + seq (`list`): The sequence to encode. + rnns (`list`): The RNNs to use. + emb_fn (`func`): Function that embeds strings to + word vectors. + size (`int`): The size of the RNN. + dropout_amount (`float`, optional): The amount of dropout to apply. + + Returns: + (`list`, `list`), `list`: The first pair is the (final cell memories, final cell states) + of all layers, and the second list is a list of the final layer's cell + state for all tokens in the sequence. + """ + + batch_size = 1 + layer_states = [] + for rnn in rnns: + hidden_size = rnn.weight_hh.shape[1] + + h_0 = paddle.zeros([batch_size, hidden_size]) + c_0 = paddle.zeros([batch_size, hidden_size]) + + layer_states.append((h_0, c_0)) + + outputs = [] + for token in sequence: + rnn_input = embedder(token) + (cell_states, hidden_states), output, layer_states = forward_one_multilayer( + rnns, rnn_input, layer_states, dropout_amount + ) + outputs.append(output) + + return (cell_states, hidden_states), outputs + + +def mask_fill(input, mask, value): + return input * paddle.cast(paddle.logical_not(mask), input.dtype) + paddle.cast(mask, input.dtype) * value + + +def LSTM_output_transfer(utterance_states, final_utterance_state): + + if len(utterance_states) != 0: + utterance_states = utterance_states.squeeze(0) + utterance_states = paddle.split(utterance_states, utterance_states.shape[0]) + for idx in range(len(utterance_states)): + utterance_states[idx] = utterance_states[idx].squeeze(0) + + if len(final_utterance_state) != 0: + (hidden_state, cell_memory) = final_utterance_state + hidden_states = paddle.concat([hidden_state[0], hidden_state[1]], axis=-1).squeeze(0) + cell_memories = paddle.concat([cell_memory[0], cell_memory[1]], axis=-1).squeeze(0) + final_utterance_state = (hidden_states.squeeze(0), cell_memories.squeeze(0)) + return utterance_states, final_utterance_state diff --git a/examples/text_to_sql/IGSQL/model/schema_interaction_model.py b/examples/text_to_sql/IGSQL/model/schema_interaction_model.py new file mode 100644 index 0000000000000000000000000000000000000000..2a7a94e4be161c7a5442d74a0eb6e4554b3da3e5 --- /dev/null +++ b/examples/text_to_sql/IGSQL/model/schema_interaction_model.py @@ -0,0 +1,936 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Class for the Sequence to sequence model for ATIS.""" + +import data_util.snippets as snippet_handler +import data_util.vocabulary as vocab +import numpy as np +import paddle +import paddle.nn.functional as F +from data_util import sql_util +from data_util.vocabulary import EOS_TOK + +from . import bert_utils, model_utils +from .attention import Attention +from .decoder import SequencePredictorWithSchema +from .model import ATISModel, get_token_indices +from .token_predictor import construct_token_predictor + +np.random.seed(0) + + +class GraphNN(paddle.nn.Layer): + def __init__(self, params): + super(GraphNN, self).__init__() + self.params = params + + weight_attr_final_fc = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform()) + bias_attr_final_fc = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(value=0.0)) + + weight_attr_fc = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform()) + bias_attr_fc = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(value=0.0)) + + weight_attr_qfc = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform()) + bias_attr_qfc = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(value=0.0)) + + self.final_fc = paddle.nn.Linear( + params.encoder_state_size, + params.encoder_state_size, + weight_attr=weight_attr_final_fc, + bias_attr=bias_attr_final_fc, + ) + self.fc = paddle.nn.Linear( + params.encoder_state_size, params.encoder_state_size, weight_attr=weight_attr_fc, bias_attr=bias_attr_fc + ) + self.qfc = paddle.nn.Linear( + params.encoder_state_size, params.encoder_state_size, weight_attr=weight_attr_qfc, bias_attr=bias_attr_qfc + ) + self.dropout = paddle.nn.Dropout(0.1) + self.leakyReLU = paddle.nn.LeakyReLU(0.2) + self.elu = paddle.nn.ELU() + self.relu = paddle.nn.ReLU() + + def forward(self, x, adj_matrix, previous_x=None): + # x: [len_tokens, d] + # adj_matrix: [len_tokens, len_tokens] + if previous_x is not None: + x_new = self.leakyReLU(self.fc(paddle.concat([previous_x, x], axis=0))).unsqueeze(0) + else: + x_new = self.leakyReLU(self.fc(x).unsqueeze(0)) # [1, len_tokens, d] + q = self.leakyReLU(self.qfc(x_new)) + x_ = paddle.concat(paddle.split(x_new, 3, axis=2), axis=0) + q_ = paddle.concat(paddle.split(q, 3, axis=2), axis=0) + outputs = paddle.matmul(q_, x_.transpose([0, 2, 1])) / 10.0 # [3, len_tokens, len_tokens] + tmp_adj_matrix = (adj_matrix == 0).expand(shape=[3, adj_matrix.shape[0], adj_matrix.shape[1]]) + outputs = model_utils.mask_fill(input=outputs, mask=tmp_adj_matrix, value=-1e9) + outputs = self.dropout(F.softmax(outputs, axis=-1)) + outputs = paddle.matmul(outputs, x_) + outputs = paddle.concat(paddle.split(outputs, 3, axis=0), axis=2) + if previous_x is not None: + outputs = paddle.split(outputs, 2, axis=1)[1] + outputs = x.unsqueeze(0) + outputs + ret = x + self.dropout(self.leakyReLU(self.final_fc(outputs).squeeze(0))) + return ret + + +LIMITED_INTERACTIONS = { + "raw/atis2/12-1.1/ATIS2/TEXT/TRAIN/SRI/QS0/1": 22, + "raw/atis3/17-1.1/ATIS3/SP_TRN/MIT/8K7/5": 14, + "raw/atis2/12-1.1/ATIS2/TEXT/TEST/NOV92/770/5": -1, +} + +END_OF_INTERACTION = {"quit", "exit", "done"} + + +class SchemaInteractionATISModel(ATISModel): + """Interaction ATIS model, where an interaction is processed all at once.""" + + def __init__(self, params, input_vocabulary, output_vocabulary, output_vocabulary_schema, anonymizer): + ATISModel.__init__(self, params, input_vocabulary, output_vocabulary, output_vocabulary_schema, anonymizer) + + if self.params.use_schema_encoder: + # Create the schema encoder + schema_encoder_input_size = params.input_embedding_size + schema_encoder_state_size = params.encoder_state_size + if params.use_bert: + schema_encoder_input_size = self.bert_config["hidden_size"] + + self.schema_encoder = paddle.nn.LSTM( + schema_encoder_input_size, + schema_encoder_state_size // 2, + num_layers=params.encoder_num_layers, + direction="bidirect", + ) + + # self-attention + if self.params.use_schema_self_attention: + self.schema2schema_attention_module = Attention( + self.schema_attention_key_size, self.schema_attention_key_size, self.schema_attention_key_size + ) + + # utterance level attention + if self.params.use_utterance_attention: + self.utterance_attention_module = Attention( + self.params.encoder_state_size, self.params.encoder_state_size, self.params.encoder_state_size + ) + + # Use attention module between input_hidden_states and schema_states + # schema_states: self.schema_attention_key_size x len(schema) + # input_hidden_states: self.utterance_attention_key_size x len(input) + if params.use_encoder_attention: + self.utterance2schema_attention_module = Attention( + self.schema_attention_key_size, self.utterance_attention_key_size, self.utterance_attention_key_size + ) + self.schema2utterance_attention_module = Attention( + self.utterance_attention_key_size, self.schema_attention_key_size, self.schema_attention_key_size + ) + + new_attention_key_size = self.schema_attention_key_size + self.utterance_attention_key_size + self.schema_attention_key_size = new_attention_key_size + self.utterance_attention_key_size = new_attention_key_size + + self.token_predictor = construct_token_predictor( + params, + output_vocabulary, + self.utterance_attention_key_size, + self.schema_attention_key_size, + self.final_snippet_size, + anonymizer, + ) + + # Use schema_attention in decoder + if params.use_schema_attention and params.use_query_attention: + decoder_input_size = ( + params.output_embedding_size + + self.utterance_attention_key_size + + self.schema_attention_key_size + + params.encoder_state_size + ) + elif params.use_schema_attention: + decoder_input_size = ( + params.output_embedding_size + self.utterance_attention_key_size + self.schema_attention_key_size + ) + else: + decoder_input_size = params.output_embedding_size + self.utterance_attention_key_size + + self.decoder = SequencePredictorWithSchema( + params, decoder_input_size, self.output_embedder, self.column_name_token_embedder, self.token_predictor + ) + + if params.gnn_layer_number: + self.gnn_history = paddle.nn.LayerList([GraphNN(params) for _ in range(2 * params.gnn_layer_number)]) + self.gnn = paddle.nn.LayerList([GraphNN(params) for _ in range(params.gnn_layer_number)]) + + def predict_turn( + self, + utterance_final_state, + input_hidden_states, + schema_states, + max_generation_length, + gold_query=None, + snippets=None, + input_sequence=None, + previous_queries=None, + previous_query_states=None, + input_schema=None, + feed_gold_tokens=False, + training=False, + ): + """Gets a prediction for a single turn -- calls decoder and updates loss, etc. + + TODO: this can probably be split into two methods, one that just predicts + and another that computes the loss. + """ + predicted_sequence = [] + fed_sequence = [] + loss = None + token_accuracy = 0.0 + if self.params.use_encoder_attention: + schema_attention = self.utterance2schema_attention_module( + paddle.stack(schema_states, axis=0), input_hidden_states + ).vector # input_value_size x len(schema) + utterance_attention = self.schema2utterance_attention_module( + paddle.stack(input_hidden_states, axis=0), schema_states + ).vector # schema_value_size x len(input) + + if schema_attention.dim() == 1: + schema_attention = schema_attention.unsqueeze(1) + if utterance_attention.dim() == 1: + utterance_attention = utterance_attention.unsqueeze(1) + + new_schema_states = paddle.concat( + [paddle.stack(schema_states, axis=1), schema_attention], axis=0 + ) # (input_value_size+schema_value_size) x len(schema) + schema_states = list(paddle.split(new_schema_states, num_or_sections=new_schema_states.shape[1], axis=1)) + schema_states = [schema_state.squeeze() for schema_state in schema_states] + + new_input_hidden_states = paddle.concat( + [paddle.stack(input_hidden_states, axis=1), utterance_attention], axis=0 + ) # (input_value_size+schema_value_size) x len(input) + input_hidden_states = list( + paddle.split(new_input_hidden_states, num_or_sections=new_input_hidden_states.shape[1], axis=1) + ) + input_hidden_states = [input_hidden_state.squeeze() for input_hidden_state in input_hidden_states] + + if feed_gold_tokens: + decoder_results = self.decoder( + utterance_final_state, + input_hidden_states, + schema_states, + max_generation_length, + gold_sequence=gold_query, + input_sequence=input_sequence, + previous_queries=previous_queries, + previous_query_states=previous_query_states, + input_schema=input_schema, + snippets=snippets, + dropout_amount=self.dropout, + ) + + all_scores = [] + all_alignments = [] + for prediction in decoder_results.predictions: + scores = F.softmax(prediction.scores, axis=0) + alignments = prediction.aligned_tokens + if self.params.use_previous_query and self.params.use_copy_switch and len(previous_queries) > 0: + query_scores = F.softmax(prediction.query_scores, axis=0) + copy_switch = prediction.copy_switch + scores = paddle.concat([scores * (1 - copy_switch), query_scores * copy_switch], axis=0) + alignments = alignments + prediction.query_tokens + + all_scores.append(scores) + all_alignments.append(alignments) + + # Compute the loss + gold_sequence = gold_query + + loss = model_utils.compute_loss(gold_sequence, all_scores, all_alignments, get_token_indices) + if not training: + predicted_sequence = model_utils.get_seq_from_scores(all_scores, all_alignments) + token_accuracy = model_utils.per_token_accuracy(gold_sequence, predicted_sequence) + fed_sequence = gold_sequence + else: + decoder_results = self.decoder( + utterance_final_state, + input_hidden_states, + schema_states, + max_generation_length, + input_sequence=input_sequence, + previous_queries=previous_queries, + previous_query_states=previous_query_states, + input_schema=input_schema, + snippets=snippets, + dropout_amount=self.dropout, + ) + predicted_sequence = decoder_results.sequence + fed_sequence = predicted_sequence + + decoder_states = [pred.decoder_state for pred in decoder_results.predictions] + + # fed_sequence contains EOS, which we don't need when encoding snippets. + # also ignore the first state, as it contains the BEG encoding. + + for token, state in zip(fed_sequence[:-1], decoder_states[1:]): + if snippet_handler.is_snippet(token): + snippet_length = 0 + for snippet in snippets: + if snippet.name == token: + snippet_length = len(snippet.sequence) + break + assert snippet_length > 0 + decoder_states.extend([state for _ in range(snippet_length)]) + else: + decoder_states.append(state) + + return (predicted_sequence, loss, token_accuracy, decoder_states, decoder_results) + + def encode_schema_bow_simple(self, input_schema): + schema_states = [] + for column_name in input_schema.column_names_embedder_input: + schema_states.append( + input_schema.column_name_embedder_bow( + column_name, surface_form=False, column_name_token_embedder=self.column_name_token_embedder + ) + ) + input_schema.set_column_name_embeddings(schema_states) + return schema_states + + def encode_schema_self_attention(self, schema_states): + schema_self_attention = self.schema2schema_attention_module( + paddle.stack(schema_states, axis=0), schema_states + ).vector + if schema_self_attention.dim() == 1: + schema_self_attention = schema_self_attention.unsqueeze(1) + residual_schema_states = list( + paddle.split(schema_self_attention, num_or_sections=schema_self_attention.shape[1], axis=1) + ) + residual_schema_states = [schema_state.squeeze() for schema_state in residual_schema_states] + + new_schema_states = [ + schema_state + residual_schema_state + for schema_state, residual_schema_state in zip(schema_states, residual_schema_states) + ] + + return new_schema_states + + def encode_schema(self, input_schema, dropout=False): + schema_states = [] + for column_name_embedder_input in input_schema.column_names_embedder_input: + tokens = column_name_embedder_input.split() + + schema_states_one, final_schema_state_one = self.schema_encoder(paddle.stack(tokens).unsqueeze(0)) + schema_states_one, final_schema_state_one = model_utils.LSTM_output_transfer( + schema_states_one, final_schema_state_one + ) + + # final_schema_state_one: 1 means hidden_states instead of cell_memories, -1 means last layer + schema_states.append(final_schema_state_one[0]) + + input_schema.set_column_name_embeddings(schema_states) + + # self-attention over schema_states + if self.params.use_schema_self_attention: + schema_states = self.encode_schema_self_attention(schema_states) + + return schema_states + + def get_bert_encoding(self, input_sequence, input_schema, discourse_state, dropout): + utterance_states, schema_token_states = bert_utils.get_bert_encoding( + self.bert_config, + self.model_bert, + self.tokenizer, + input_sequence, + input_schema, + bert_input_version=self.params.bert_input_version, + num_out_layers_n=1, + num_out_layers_h=1, + ) + + if self.params.discourse_level_lstm: + utterance_token_embedder = lambda x: paddle.concat([x, discourse_state], axis=0) + for idx in range(len(utterance_states)): + utterance_states[idx] = utterance_token_embedder(utterance_states[idx]) + + utterance_states, final_utterance_state = self.utterance_encoder(paddle.stack(utterance_states).unsqueeze(0)) + utterance_states, final_utterance_state = model_utils.LSTM_output_transfer( + utterance_states, final_utterance_state + ) + + schema_states = [] + for schema_token_states1 in schema_token_states: + schema_states_one, final_schema_state_one = self.schema_encoder( + paddle.stack(schema_token_states1).unsqueeze(0) + ) + schema_states_one, final_schema_state_one = model_utils.LSTM_output_transfer( + schema_states_one, final_schema_state_one + ) + + # final_schema_state_one: 1 means hidden_states instead of cell_memories, -1 means last layer + schema_states.append(sum(schema_states_one) / len(schema_states_one)) + + input_schema.set_column_name_embeddings(schema_states) + + # self-attention over schema_states + if self.params.use_schema_self_attention: + schema_states = self.encode_schema_self_attention(schema_states) + + return final_utterance_state, utterance_states, schema_states + + def get_query_token_embedding(self, output_token, input_schema): + if input_schema: + if not ( + self.output_embedder.in_vocabulary(output_token) + or input_schema.in_vocabulary(output_token, surface_form=True) + ): + output_token = "value" + if self.output_embedder.in_vocabulary(output_token): + output_token_embedding = self.output_embedder(output_token) + else: + output_token_embedding = input_schema.column_name_embedder(output_token, surface_form=True) + else: + output_token_embedding = self.output_embedder(output_token) + return output_token_embedding + + def get_utterance_attention( + self, final_utterance_states_c, final_utterance_states_h, final_utterance_state, num_utterances_to_keep + ): + # self-attention between utterance_states + final_utterance_states_h.append(final_utterance_state[0]) + final_utterance_states_c.append(final_utterance_state[1]) + final_utterance_states_c = final_utterance_states_c[-num_utterances_to_keep:] + final_utterance_states_h = final_utterance_states_h[-num_utterances_to_keep:] + + attention_result = self.utterance_attention_module(final_utterance_states_c[-1], final_utterance_states_c) + final_utterance_state_attention_c = final_utterance_states_c[-1] + attention_result.vector.squeeze() + + attention_result = self.utterance_attention_module(final_utterance_states_h[-1], final_utterance_states_h) + final_utterance_state_attention_h = final_utterance_states_h[-1] + attention_result.vector.squeeze() + + final_utterance_state = (final_utterance_state_attention_h, final_utterance_state_attention_c) + + return final_utterance_states_c, final_utterance_states_h, final_utterance_state + + def get_previous_queries(self, previous_queries, previous_query_states, previous_query, input_schema): + + query_token_embedder = lambda query_token: self.get_query_token_embedding(query_token, input_schema) + + previous_query_embedding = [] + + for output_token in previous_query: + previous_query_embedding.append(query_token_embedder(output_token)) + previous_query_embedding = paddle.stack(previous_query_embedding) + + previous_queries.append(previous_query) + num_queries_to_keep = min(self.params.maximum_queries, len(previous_queries)) + previous_queries = previous_queries[-num_queries_to_keep:] + + previous_outputs, _ = self.query_encoder(previous_query_embedding.unsqueeze(0)) + previous_outputs, _ = model_utils.LSTM_output_transfer(previous_outputs, _) + + assert len(previous_outputs) == len(previous_query) + previous_query_states.append(previous_outputs) + previous_query_states = previous_query_states[-num_queries_to_keep:] + + return previous_queries, previous_query_states + + def get_adj_matrix(self, inner, foreign_keys, num_col): + ret = np.eye(num_col) + all_keys = inner + foreign_keys + for ele in all_keys: + ret[ele[0]][ele[1]] = 1 + ret[ele[1]][ele[0]] = 1 + return ret + + def get_adj_utterance_matrix(self, inner, foreign_keys, num_col): + ret = np.eye(2 * num_col) + + all_keys = inner + foreign_keys + for i in range(num_col): + ret[i][num_col + i] = 1 + ret[num_col + i][i] = 1 + for ele in all_keys: + + # self graph connect + ret[ele[0]][ele[1]] = 1 + ret[ele[1]][ele[0]] = 1 + ret[num_col + ele[0]][num_col + ele[1]] = 1 + ret[num_col + ele[1]][num_col + ele[0]] = 1 + + ret[ele[0]][num_col + ele[1]] = 1 + ret[num_col + ele[1]][ele[0]] = 1 + ret[num_col + ele[0]][ele[1]] = 1 + ret[ele[1]][num_col + ele[0]] = 1 + ret = ret.dot(ret) + return ret + + def train_step( + self, interaction, max_generation_length, snippet_alignment_probability=1.0, db2id=None, id2db=None, step=None + ): + """Trains the interaction-level model on a single interaction. + + Args: + interaction (Interaction): The interaction to train on. + learning_rate (float): Learning rate to use. + snippet_keep_age (int): Age of oldest snippets to use. + snippet_alignment_probability (float): The probability that a snippet will + be used in constructing the gold sequence. + """ + # assert self.params.discourse_level_lstm + + losses = [] + total_gold_tokens = 0 + + input_hidden_states = [] + input_sequences = [] + + final_utterance_states_c = [] + final_utterance_states_h = [] + + previous_query_states = [] + previous_queries = [] + + discourse_state = None + if self.params.discourse_level_lstm: + discourse_state, discourse_lstm_states = self._initialize_discourse_states() + + # Schema and schema embeddings + input_schema = interaction.get_schema() + schema_states = [] + + if input_schema and not self.params.use_bert: + schema_states = self.encode_schema_bow_simple(input_schema) + + # Get the intra-turn graph and cross-turn graph + inner = [] + for i, ele in enumerate(interaction.interaction.schema.column_names_surface_form): + for j in range(i + 1, len(interaction.interaction.schema.column_names_surface_form)): + if ele.split(".")[0] == interaction.interaction.schema.column_names_surface_form[j].split(".")[0]: + inner.append([i, j]) + adjacent_matrix = self.get_adj_matrix(inner, input_schema.table_schema["foreign_keys"], input_schema.num_col) + adjacent_matrix_cross = self.get_adj_utterance_matrix( + inner, input_schema.table_schema["foreign_keys"], input_schema.num_col + ) + adjacent_matrix = paddle.to_tensor(adjacent_matrix) + adjacent_matrix_cross = paddle.to_tensor(adjacent_matrix_cross) + + previous_schema_states = paddle.zeros([input_schema.num_col, self.params.encoder_state_size]) + + for utterance_index, utterance in enumerate(interaction.gold_utterances()): + + if ( + interaction.identifier in LIMITED_INTERACTIONS + and utterance_index > LIMITED_INTERACTIONS[interaction.identifier] + ): + break + + input_sequence = utterance.input_sequence() + + available_snippets = utterance.snippets() + previous_query = utterance.previous_query() + + # Get the gold query: reconstruct if the alignment probability is less than one + if snippet_alignment_probability < 1.0: + gold_query = sql_util.add_snippets_to_query( + available_snippets, + utterance.contained_entities(), + utterance.anonymized_gold_query(), + prob_align=snippet_alignment_probability, + ) + [vocab.EOS_TOK] + else: + gold_query = utterance.gold_query() + + final_utterance_state, utterance_states, schema_states = self.get_bert_encoding( + input_sequence, input_schema, discourse_state, dropout=True + ) + + # temp1=final_utterance_state + + schema_states = paddle.stack(schema_states, axis=0) + for i in range(self.params.gnn_layer_number): + schema_states = self.gnn_history[2 * i](schema_states, adjacent_matrix_cross, previous_schema_states) + schema_states = self.gnn_history[2 * i + 1]( + schema_states, adjacent_matrix_cross, previous_schema_states + ) + schema_states = self.gnn[i](schema_states, adjacent_matrix) + previous_schema_states = schema_states + schema_states_ls = paddle.split(schema_states, schema_states.shape[0], axis=0) + schema_states = [ele.squeeze(0) for ele in schema_states_ls] + + input_hidden_states.extend(utterance_states) + input_sequences.append(input_sequence) + + num_utterances_to_keep = min(self.params.maximum_utterances, len(input_sequences)) + + if self.params.discourse_level_lstm: + discourse_state, discourse_lstm_states = self.discourse_lstms( + final_utterance_state[0].unsqueeze(0), discourse_lstm_states + ) + discourse_state = discourse_state.squeeze() + + if self.params.use_utterance_attention: + + ( + final_utterance_states_c, + final_utterance_states_h, + final_utterance_state, + ) = self.get_utterance_attention( + final_utterance_states_c, final_utterance_states_h, final_utterance_state, num_utterances_to_keep + ) + + if self.params.state_positional_embeddings: + utterance_states, flat_sequence = self._add_positional_embeddings(input_hidden_states, input_sequences) + + snippets = None + + if self.params.use_previous_query: + if len(previous_query) > 0: + previous_queries, previous_query_states = self.get_previous_queries( + previous_queries, previous_query_states, previous_query, input_schema + ) + + if len(gold_query) <= max_generation_length and len(previous_query) <= max_generation_length: + prediction = self.predict_turn( + final_utterance_state, + utterance_states, + schema_states, + max_generation_length, + gold_query=gold_query, + snippets=snippets, + input_sequence=flat_sequence, + previous_queries=previous_queries, + previous_query_states=previous_query_states, + input_schema=input_schema, + feed_gold_tokens=True, + training=True, + ) + loss = prediction[1] + total_gold_tokens += len(gold_query) + losses.append(loss) + else: + # Break if previous decoder snippet encoding -- because the previous + # sequence was too long to run the decoder. + if self.params.previous_decoder_snippet_encoding: + break + continue + + if losses: + average_loss = paddle.sum(paddle.stack(losses)) / total_gold_tokens + print(f"total_gold_tokens:{total_gold_tokens}, step:{step}") + print(f"LOSS:{float(average_loss.numpy())}") + if paddle.sum(paddle.cast(paddle.isinf(average_loss), "int32")) == paddle.ones([1]): + self.save("./inf_checkpoint") + + # Renormalize so the effect is normalized by the batch size. + normalized_loss = average_loss + if self.params.reweight_batch: + normalized_loss = len(losses) * average_loss / float(self.params.batch_size) + + normalized_loss.backward() + + if step <= self.params.warmup_step: + self.set_learning_rate(step / self.params.warmup_step * self.params.initial_learning_rate) + step += 1 + + self.trainer.step() + if self.params.fine_tune_bert: + self.bert_trainer.step() + self.bert_trainer.clear_grad() + self.trainer.clear_grad() + + loss_scalar = float(normalized_loss.numpy()) + isNan = sum(paddle.cast(paddle.isnan(normalized_loss), "float32").numpy().tolist()) == 0 + if paddle.isnan(normalized_loss): + print("nan error but keep running") + assert isNan + + else: + loss_scalar = 0.0 + + return loss_scalar, step + + def predict_with_predicted_queries(self, interaction, max_generation_length, syntax_restrict=True): + """Predicts an interaction, using the predicted queries to get snippets.""" + + syntax_restrict = False + + predictions = [] + + input_hidden_states = [] + input_sequences = [] + + final_utterance_states_c = [] + final_utterance_states_h = [] + + previous_query_states = [] + previous_queries = [] + + discourse_state = None + if self.params.discourse_level_lstm: + discourse_state, discourse_lstm_states = self._initialize_discourse_states() + + # Schema and schema embeddings + input_schema = interaction.get_schema() + schema_states = [] + + # Get the intra-turn graph and cross-turn graph + inner = [] + for i, ele in enumerate(interaction.interaction.schema.column_names_surface_form): + for j in range(i + 1, len(interaction.interaction.schema.column_names_surface_form)): + if ele.split(".")[0] == interaction.interaction.schema.column_names_surface_form[j].split(".")[0]: + inner.append([i, j]) + adjacent_matrix = self.get_adj_matrix(inner, input_schema.table_schema["foreign_keys"], input_schema.num_col) + adjacent_matrix_cross = self.get_adj_utterance_matrix( + inner, input_schema.table_schema["foreign_keys"], input_schema.num_col + ) + + adjacent_matrix = paddle.to_tensor(adjacent_matrix) + adjacent_matrix_cross = paddle.to_tensor(adjacent_matrix_cross) + previous_schema_states = paddle.zeros([input_schema.num_col, self.params.encoder_state_size]) + + if input_schema and not self.params.use_bert: + schema_states = self.encode_schema_bow_simple(input_schema) + + interaction.start_interaction() + + while not interaction.done(): + utterance = interaction.next_utterance() + + available_snippets = utterance.snippets() + previous_query = utterance.previous_query() + + input_sequence = utterance.input_sequence() + + if not self.params.use_bert: + if self.params.discourse_level_lstm: + utterance_token_embedder = lambda token: paddle.concat( + [self.input_embedder(token), discourse_state], axis=0 + ) + else: + utterance_token_embedder = self.input_embedder + final_utterance_state, utterance_states = self.utterance_encoder( + input_sequence, utterance_token_embedder + ) + else: + final_utterance_state, utterance_states, schema_states = self.get_bert_encoding( + input_sequence, input_schema, discourse_state, dropout=False + ) + + schema_states = paddle.stack(schema_states, axis=0) + for i in range(self.params.gnn_layer_number): + schema_states = self.gnn_history[2 * i](schema_states, adjacent_matrix_cross, previous_schema_states) + schema_states = self.gnn_history[2 * i + 1]( + schema_states, adjacent_matrix_cross, previous_schema_states + ) + schema_states = self.gnn[i](schema_states, adjacent_matrix) + previous_schema_states = schema_states + + schema_states_ls = paddle.split(schema_states, schema_states.shape[0], axis=0) + schema_states = [ele.squeeze(0) for ele in schema_states_ls] + + input_hidden_states.extend(utterance_states) + input_sequences.append(input_sequence) + + num_utterances_to_keep = min(self.params.maximum_utterances, len(input_sequences)) + + if self.params.discourse_level_lstm: + + discourse_state, discourse_lstm_states = self.discourse_lstms( + final_utterance_state[0].unsqueeze(0), discourse_lstm_states + ) + + if self.params.use_utterance_attention: + ( + final_utterance_states_c, + final_utterance_states_h, + final_utterance_state, + ) = self.get_utterance_attention( + final_utterance_states_c, final_utterance_states_h, final_utterance_state, num_utterances_to_keep + ) + + if self.params.state_positional_embeddings: + utterance_states, flat_sequence = self._add_positional_embeddings(input_hidden_states, input_sequences) + else: + flat_sequence = [] + for utt in input_sequences[-num_utterances_to_keep:]: + flat_sequence.extend(utt) + + snippets = None + if self.params.use_snippets: + snippets = self._encode_snippets(previous_query, available_snippets, input_schema) + + if self.params.use_previous_query and len(previous_query) > 0: + previous_queries, previous_query_states = self.get_previous_queries( + previous_queries, previous_query_states, previous_query, input_schema + ) + + results = self.predict_turn( + final_utterance_state, + utterance_states, + schema_states, + max_generation_length, + input_sequence=flat_sequence, + previous_queries=previous_queries, + previous_query_states=previous_query_states, + input_schema=input_schema, + snippets=snippets, + ) + + predicted_sequence = results[0] + predictions.append(results) + + # Update things necessary for using predicted queries + anonymized_sequence = utterance.remove_snippets(predicted_sequence) + if EOS_TOK in anonymized_sequence: + anonymized_sequence = anonymized_sequence[:-1] # Remove _EOS + else: + anonymized_sequence = ["select", "*", "from", "t1"] + + if not syntax_restrict: + utterance.set_predicted_query(interaction.remove_snippets(predicted_sequence)) + if input_schema: + # on SParC + interaction.add_utterance( + utterance, anonymized_sequence, previous_snippets=utterance.snippets(), simple=True + ) + else: + # on ATIS + interaction.add_utterance( + utterance, anonymized_sequence, previous_snippets=utterance.snippets(), simple=False + ) + else: + utterance.set_predicted_query(utterance.previous_query()) + interaction.add_utterance( + utterance, utterance.previous_query(), previous_snippets=utterance.snippets() + ) + + return predictions + + def predict_with_gold_queries(self, interaction, max_generation_length, feed_gold_query=False): + """Predicts SQL queries for an interaction. + + Args: + interaction (Interaction): Interaction to predict for. + feed_gold_query (bool): Whether or not to feed the gold token to the + generation step. + """ + + predictions = [] + + input_hidden_states = [] + input_sequences = [] + + final_utterance_states_c = [] + final_utterance_states_h = [] + + previous_query_states = [] + previous_queries = [] + + discourse_state = None + if self.params.discourse_level_lstm: + discourse_state, discourse_lstm_states = self._initialize_discourse_states() + + # Schema and schema embeddings + input_schema = interaction.get_schema() + schema_states = [] + if input_schema and not self.params.use_bert: + schema_states = self.encode_schema_bow_simple(input_schema) + + # Get the intra-turn graph and cross-turn graph + inner = [] + for i, ele in enumerate(interaction.interaction.schema.column_names_surface_form): + for j in range(i + 1, len(interaction.interaction.schema.column_names_surface_form)): + if ele.split(".")[0] == interaction.interaction.schema.column_names_surface_form[j].split(".")[0]: + inner.append([i, j]) + adjacent_matrix = self.get_adj_matrix(inner, input_schema.table_schema["foreign_keys"], input_schema.num_col) + adjacent_matrix_cross = self.get_adj_utterance_matrix( + inner, input_schema.table_schema["foreign_keys"], input_schema.num_col + ) + + adjacent_matrix = paddle.to_tensor(adjacent_matrix) + adjacent_matrix_cross = paddle.to_tensor(adjacent_matrix_cross) + previous_schema_states = paddle.zeros([input_schema.num_col, self.params.encoder_state_size]) + + for _, utterance in enumerate(interaction.gold_utterances()): + + input_sequence = utterance.input_sequence() + + previous_query = utterance.previous_query() + + final_utterance_state, utterance_states, schema_states = self.get_bert_encoding( + input_sequence, input_schema, discourse_state, dropout=True + ) + + schema_states = paddle.stack(schema_states, axis=0) + for i in range(self.params.gnn_layer_number): + schema_states = self.gnn_history[2 * i](schema_states, adjacent_matrix_cross, previous_schema_states) + schema_states = self.gnn_history[2 * i + 1]( + schema_states, adjacent_matrix_cross, previous_schema_states + ) + schema_states = self.gnn[i](schema_states, adjacent_matrix) + previous_schema_states = schema_states + + schema_states_ls = paddle.split(schema_states, schema_states.shape[0], axis=0) + schema_states = [ele.squeeze(0) for ele in schema_states_ls] + + input_hidden_states.extend(utterance_states) + input_sequences.append(input_sequence) + + num_utterances_to_keep = min(self.params.maximum_utterances, len(input_sequences)) + + if self.params.discourse_level_lstm: + discourse_state, discourse_lstm_states = self.discourse_lstms( + final_utterance_state[0].unsqueeze(0), discourse_lstm_states + ) + discourse_state = discourse_state.squeeze() + + if self.params.use_utterance_attention: + ( + final_utterance_states_c, + final_utterance_states_h, + final_utterance_state, + ) = self.get_utterance_attention( + final_utterance_states_c, final_utterance_states_h, final_utterance_state, num_utterances_to_keep + ) + + if self.params.state_positional_embeddings: + utterance_states, flat_sequence = self._add_positional_embeddings(input_hidden_states, input_sequences) + else: + flat_sequence = [] + for utt in input_sequences[-num_utterances_to_keep:]: + flat_sequence.extend(utt) + + snippets = None + + if self.params.use_previous_query and len(previous_query) > 0: + previous_queries, previous_query_states = self.get_previous_queries( + previous_queries, previous_query_states, previous_query, input_schema + ) + + prediction = self.predict_turn( + final_utterance_state, + utterance_states, + schema_states, + max_generation_length, + gold_query=utterance.gold_query(), + snippets=snippets, + input_sequence=flat_sequence, + previous_queries=previous_queries, + previous_query_states=previous_query_states, + input_schema=input_schema, + feed_gold_tokens=feed_gold_query, + ) + + predictions.append(prediction) + + return predictions diff --git a/examples/text_to_sql/IGSQL/model/token_predictor.py b/examples/text_to_sql/IGSQL/model/token_predictor.py new file mode 100644 index 0000000000000000000000000000000000000000..7561cd9aac016bbcb5c3b57cf56c057546f519cd --- /dev/null +++ b/examples/text_to_sql/IGSQL/model/token_predictor.py @@ -0,0 +1,447 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Predicts a token.""" + +from collections import namedtuple + +import paddle +import paddle.nn.functional as F + +from .attention import Attention, AttentionResult + + +class PredictionInput( + namedtuple("PredictionInput", ("decoder_state", "input_hidden_states", "snippets", "input_sequence")) +): + """Inputs to the token predictor.""" + + __slots__ = () + + +class PredictionInputWithSchema( + namedtuple( + "PredictionInputWithSchema", + ( + "decoder_state", + "input_hidden_states", + "schema_states", + "snippets", + "input_sequence", + "previous_queries", + "previous_query_states", + "input_schema", + ), + ) +): + """Inputs to the token predictor.""" + + __slots__ = () + + +class TokenPrediction( + namedtuple( + "TokenPrediction", + ( + "scores", + "aligned_tokens", + "utterance_attention_results", + "schema_attention_results", + "query_attention_results", + "copy_switch", + "query_scores", + "query_tokens", + "decoder_state", + ), + ) +): + """A token prediction.""" + + __slots__ = () + + +def score_schema_tokens(input_schema, schema_states, scorer): + # schema_states: emd_dim x num_tokens + scores = paddle.t(paddle.mm(paddle.t(scorer), schema_states)) # num_tokens x 1 + if scores.shape[0] != len(input_schema): + raise ValueError("Got " + str(scores.shape[0]) + " scores for " + str(len(input_schema)) + " schema tokens") + return scores, input_schema.column_names_surface_form + + +def score_query_tokens(previous_query, previous_query_states, scorer): + scores = paddle.t(paddle.mm(paddle.t(scorer), previous_query_states)) # num_tokens x 1 + if scores.shape[0] != len(previous_query): + raise ValueError("Got " + str(scores.shape[0]) + " scores for " + str(len(previous_query)) + " query tokens") + return scores, previous_query + + +class TokenPredictor(paddle.nn.Layer): + """Predicts a token given a (decoder) state. + + Attributes: + vocabulary (`Vocabulary`): A vocabulary object for the output. + attention_module (`Attention`): An attention module. + state_transformation_weights (`Parameter`): Transforms the input state + before predicting a token. + vocabulary_weights (`Parameter`): Final layer weights. + vocabulary_biases (`Parameter`): Final layer biases. + """ + + def __init__(self, params, vocabulary, attention_key_size): + super().__init__() + self.params = params + self.vocabulary = vocabulary + self.attention_module = Attention(params.decoder_state_size, attention_key_size, attention_key_size) + + bias_initializer = paddle.nn.initializer.Uniform(low=-0.1, high=0.1) + + _initializer = paddle.nn.initializer.XavierUniform() + + state_transform_weights = paddle.ParamAttr(initializer=_initializer) + + vocabulary_biases = paddle.ParamAttr(initializer=bias_initializer) + + self.state_transform_Linear = paddle.nn.Linear( + in_features=params.decoder_state_size + attention_key_size, + out_features=params.decoder_state_size, + weight_attr=state_transform_weights, + bias_attr=False, + ) + + self.vocabulary_Linear = paddle.nn.Linear( + in_features=params.decoder_state_size, + out_features=len(vocabulary), + weight_attr=state_transform_weights, + bias_attr=vocabulary_biases, + ) + + def _get_intermediate_state(self, state, dropout_amount=0.0): + intermediate_state = paddle.tanh(self.state_transform_Linear(state)) + return F.dropout(intermediate_state, dropout_amount) + + def _score_vocabulary_tokens(self, state): + scores = paddle.t(self.vocabulary_Linear(state)) + + if scores.shape[0] != len(self.vocabulary.inorder_tokens): + raise ValueError( + "Got " + + str(scores.shape[0]) + + " scores for " + + str(len(self.vocabulary.inorder_tokens)) + + " vocabulary items" + ) + + return scores, self.vocabulary.inorder_tokens + + def forward(self, prediction_input, dropout_amount=0.0): + decoder_state = prediction_input.decoder_state + input_hidden_states = prediction_input.input_hidden_states + + attention_results = self.attention_module(decoder_state, input_hidden_states) + + state_and_attn = paddle.concat([decoder_state, attention_results.vector], axis=0) + + intermediate_state = self._get_intermediate_state(state_and_attn, dropout_amount=dropout_amount) + vocab_scores, vocab_tokens = self._score_vocabulary_tokens(intermediate_state) + + return TokenPrediction(vocab_scores, vocab_tokens, attention_results, decoder_state) + + +class SchemaTokenPredictor(TokenPredictor): + """Token predictor that also predicts snippets. + + Attributes: + snippet_weights (`Parameter`): Weights for scoring snippets against some + state. + """ + + def __init__(self, params, vocabulary, utterance_attention_key_size, schema_attention_key_size, snippet_size): + TokenPredictor.__init__(self, params, vocabulary, utterance_attention_key_size) + + _initializer = paddle.nn.initializer.XavierUniform() + + if params.use_schema_attention: + self.utterance_attention_module = self.attention_module + self.schema_attention_module = Attention( + params.decoder_state_size, schema_attention_key_size, schema_attention_key_size + ) + + if self.params.use_query_attention: + self.query_attention_module = Attention( + params.decoder_state_size, params.encoder_state_size, params.encoder_state_size + ) + + self.start_query_attention_vector = self.create_parameter( + [params.encoder_state_size], + dtype="float32", + default_initializer=paddle.nn.initializer.Uniform(low=-0.1, high=0.1), + ) + + state_transform_weights = paddle.ParamAttr(initializer=_initializer) + if params.use_schema_attention and self.params.use_query_attention: + self.state_transform_Linear = paddle.nn.Linear( + in_features=params.decoder_state_size + + utterance_attention_key_size + + schema_attention_key_size + + params.encoder_state_size, + out_features=params.decoder_state_size, + weight_attr=state_transform_weights, + bias_attr=False, + ) + elif params.use_schema_attention: + self.state_transform_Linear = paddle.nn.Linear( + in_features=params.decoder_state_size + utterance_attention_key_size + schema_attention_key_size, + out_features=params.decoder_state_size, + weight_attr=state_transform_weights, + bias_attr=False, + ) + + schema_token_weights = paddle.ParamAttr(initializer=_initializer) + self.schema_token_Linear = paddle.nn.Linear( + in_features=params.decoder_state_size, + out_features=schema_attention_key_size, + weight_attr=schema_token_weights, + bias_attr=False, + ) + + if self.params.use_previous_query: + + query_token_weights = paddle.ParamAttr(initializer=_initializer) + self.query_token_Linear = paddle.nn.Linear( + in_features=params.decoder_state_size, + out_features=self.params.encoder_state_size, + weight_attr=query_token_weights, + bias_attr=False, + ) + + if self.params.use_copy_switch: + state2copyswitch_transform_weights = paddle.ParamAttr(initializer=_initializer) + if self.params.use_query_attention: + self.state2copyswitch_transform_Linear = paddle.nn.Linear( + in_features=params.decoder_state_size + + utterance_attention_key_size + + schema_attention_key_size + + params.encoder_state_size, + out_features=1, + weight_attr=state2copyswitch_transform_weights, + bias_attr=False, + ) + else: + self.state2copyswitch_transform_Linear = paddle.nn.Linear( + in_features=params.decoder_state_size + utterance_attention_key_size + schema_attention_key_size, + out_features=1, + weight_attr=state2copyswitch_transform_weights, + bias_attr=False, + ) + + state2copy_transform_weights = paddle.ParamAttr(initializer=_initializer) + self.state2copy_transform_Linear = paddle.nn.Linear( + in_features=params.decoder_state_size, + out_features=3, + weight_attr=state2copy_transform_weights, + bias_attr=False, + ) + + def _get_schema_token_scorer(self, state): + scorer = paddle.t(self.schema_token_Linear(state)) + return scorer + + def _get_query_token_scorer(self, state): + scorer = paddle.t(self.query_token_Linear(state)) + return scorer + + def _get_copy_switch(self, state): + copy_switch = F.sigmoid(self.state2copyswitch_transform_Linear(state)) + return copy_switch.squeeze() + + def forward(self, prediction_input, dropout_amount=0.0): + decoder_state = prediction_input.decoder_state + input_hidden_states = prediction_input.input_hidden_states + + input_schema = prediction_input.input_schema + schema_states = prediction_input.schema_states + + if self.params.use_schema_attention: + schema_attention_results = self.schema_attention_module(decoder_state, schema_states) + utterance_attention_results = self.utterance_attention_module(decoder_state, input_hidden_states) + else: + utterance_attention_results = self.attention_module(decoder_state, input_hidden_states) + schema_attention_results = None + + query_attention_results = None + if self.params.use_query_attention: + previous_query_states = prediction_input.previous_query_states + if len(previous_query_states) > 0: + query_attention_results = self.query_attention_module(decoder_state, previous_query_states[-1]) + else: + query_attention_results = self.start_query_attention_vector + query_attention_results = AttentionResult(None, None, query_attention_results) + + if self.params.use_schema_attention and self.params.use_query_attention: + state_and_attn = paddle.concat( + [ + decoder_state, + utterance_attention_results.vector, + schema_attention_results.vector, + query_attention_results.vector, + ], + axis=0, + ) + elif self.params.use_schema_attention: + state_and_attn = paddle.concat( + [decoder_state, utterance_attention_results.vector, schema_attention_results.vector], axis=0 + ) + else: + state_and_attn = paddle.concat([decoder_state, utterance_attention_results.vector], axis=0) + + intermediate_state = self._get_intermediate_state(state_and_attn, dropout_amount=dropout_amount) + copy_score = F.sigmoid(self.state2copy_transform_Linear(intermediate_state).squeeze(0)) + + vocab_scores, vocab_tokens = self._score_vocabulary_tokens(intermediate_state) + + final_scores = vocab_scores + aligned_tokens = [] + aligned_tokens.extend(vocab_tokens) + + schema_states = paddle.stack(schema_states, axis=1) + schema_scores, schema_tokens = score_schema_tokens( + input_schema, schema_states, self._get_schema_token_scorer(intermediate_state) + ) + + final_scores = paddle.concat([copy_score[0] * final_scores, copy_score[1] * schema_scores], axis=0) + aligned_tokens.extend(schema_tokens) + + # Previous Queries + previous_queries = prediction_input.previous_queries + previous_query_states = prediction_input.previous_query_states + + copy_switch = None + query_scores = None + query_tokens = None + if self.params.use_previous_query and len(previous_queries) > 0: + if self.params.use_copy_switch: + copy_switch = self._get_copy_switch(state_and_attn) + for turn, (previous_query, previous_query_state) in enumerate( + zip(previous_queries, previous_query_states) + ): + + assert len(previous_query) == len(previous_query_state) + previous_query_state = paddle.stack(previous_query_state, axis=1) + query_scores, query_tokens = score_query_tokens( + previous_query, previous_query_state, self._get_query_token_scorer(intermediate_state) + ) + query_scores = query_scores.squeeze() + + if query_scores is not None: + final_scores = paddle.concat([final_scores, copy_score[2] * query_scores], axis=0) + aligned_tokens += query_tokens + + return TokenPrediction( + final_scores, + aligned_tokens, + utterance_attention_results, + schema_attention_results, + query_attention_results, + copy_switch, + query_scores, + query_tokens, + decoder_state, + ) + + +class AnonymizationTokenPredictor(TokenPredictor): + """Token predictor that also predicts anonymization tokens. + + Attributes: + anonymizer (`Anonymizer`): The anonymization object. + + """ + + def __init__(self, params, vocabulary, attention_key_size, anonymizer): + TokenPredictor.__init__(self, params, vocabulary, attention_key_size) + if not anonymizer: + raise ValueError("Expected an anonymizer, but was None") + self.anonymizer = anonymizer + + def _score_anonymized_tokens(self, input_sequence, attention_scores): + scores = [] + tokens = [] + for i, token in enumerate(input_sequence): + if self.anonymizer.is_anon_tok(token): + scores.append(attention_scores[i]) + tokens.append(token) + + if len(scores) > 0: + if len(scores) != len(tokens): + raise ValueError("Got " + str(len(scores)) + " scores for " + str(len(tokens)) + " anonymized tokens") + + anonymized_scores = paddle.concat(scores, axis=0) + if anonymized_scores.dim() == 1: + anonymized_scores = anonymized_scores.unsqueeze(1) + return anonymized_scores, tokens + else: + return None, [] + + def forward(self, prediction_input, dropout_amount=0.0): + decoder_state = prediction_input.decoder_state + input_hidden_states = prediction_input.input_hidden_states + input_sequence = prediction_input.input_sequence + assert input_sequence + + attention_results = self.attention_module(decoder_state, input_hidden_states) + + state_and_attn = paddle.concat([decoder_state, attention_results.vector], axis=0) + + intermediate_state = self._get_intermediate_state(state_and_attn, dropout_amount=dropout_amount) + vocab_scores, vocab_tokens = self._score_vocabulary_tokens(intermediate_state) + + final_scores = vocab_scores + aligned_tokens = [] + aligned_tokens.extend(vocab_tokens) + + anonymized_scores, anonymized_tokens = self._score_anonymized_tokens(input_sequence, attention_results.scores) + + if anonymized_scores: + final_scores = paddle.concat([final_scores, anonymized_scores], axis=0) + aligned_tokens.extend(anonymized_tokens) + + final_scores = final_scores.squeeze() + + return TokenPrediction( + final_scores, aligned_tokens, attention_results, None, None, None, None, None, decoder_state + ) + + +def construct_token_predictor( + params, vocabulary, utterance_attention_key_size, schema_attention_key_size, snippet_size, anonymizer=None +): + """Constructs a token predictor given the parameters. + + Args: + params (`dict`): Contains the command line parameters/hyperparameters. + vocabulary (`Vocabulary`): Vocabulary object for output generation. + attention_key_size (`int`): The size of the attention keys. + anonymizer (`Anonymizer`): An anonymization object. + """ + + if not anonymizer and not params.previous_decoder_snippet_encoding: + print("using SchemaTokenPredictor") + return SchemaTokenPredictor( + params, vocabulary, utterance_attention_key_size, schema_attention_key_size, snippet_size + ) + elif params.use_snippets and anonymizer and not params.previous_decoder_snippet_encoding: + print("using AnonymizationTokenPredictor") + return AnonymizationTokenPredictor(params, vocabulary, utterance_attention_key_size, snippet_size, anonymizer) + else: + print("Unknown token_predictor") + exit() diff --git a/examples/text_to_sql/IGSQL/model_util.py b/examples/text_to_sql/IGSQL/model_util.py new file mode 100644 index 0000000000000000000000000000000000000000..a30608b6263286f6b489a7513160d7121ea8f201 --- /dev/null +++ b/examples/text_to_sql/IGSQL/model_util.py @@ -0,0 +1,639 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Basic model training and evaluation functions.""" + +import json +import random +from enum import Enum + +import paddle +import progressbar + +from . import model_utils +from .data_util import sql_util + + +def write_prediction( + fileptr, + identifier, + input_seq, + probability, + prediction, + flat_prediction, + gold_query, + flat_gold_queries, + gold_tables, + index_in_interaction, + database_username, + database_password, + database_timeout, + compute_metrics=True, +): + pred_obj = {} + pred_obj["identifier"] = identifier + if len(identifier.split("/")) == 2: + database_id, interaction_id = identifier.split("/") + else: + database_id = "atis" + interaction_id = identifier + pred_obj["database_id"] = database_id + pred_obj["interaction_id"] = interaction_id + + pred_obj["input_seq"] = input_seq + pred_obj["probability"] = probability + pred_obj["prediction"] = prediction + pred_obj["flat_prediction"] = flat_prediction + pred_obj["gold_query"] = gold_query + pred_obj["flat_gold_queries"] = flat_gold_queries + pred_obj["index_in_interaction"] = index_in_interaction + pred_obj["gold_tables"] = str(gold_tables) + + # Now compute the metrics we want. + + if compute_metrics: + # First metric: whether flat predicted query is in the gold query set. + correct_string = " ".join(flat_prediction) in [" ".join(q) for q in flat_gold_queries] + pred_obj["correct_string"] = correct_string + + # Database metrics + if not correct_string: + syntactic, semantic, pred_table = sql_util.execution_results( + " ".join(flat_prediction), database_username, database_password, database_timeout + ) + pred_table = sorted(pred_table) + best_prec = 0.0 + best_rec = 0.0 + best_f1 = 0.0 + + for gold_table in gold_tables: + num_overlap = float(len(set(pred_table) & set(gold_table))) + + if len(set(gold_table)) > 0: + prec = num_overlap / len(set(gold_table)) + else: + prec = 1.0 + + if len(set(pred_table)) > 0: + rec = num_overlap / len(set(pred_table)) + else: + rec = 1.0 + + if prec > 0.0 and rec > 0.0: + f1 = (2 * (prec * rec)) / (prec + rec) + else: + f1 = 1.0 + + best_prec = max(best_prec, prec) + best_rec = max(best_rec, rec) + best_f1 = max(best_f1, f1) + + else: + syntactic = True + semantic = True + pred_table = [] + best_prec = 1.0 + best_rec = 1.0 + best_f1 = 1.0 + + assert best_prec <= 1.0 + assert best_rec <= 1.0 + assert best_f1 <= 1.0 + pred_obj["syntactic"] = syntactic + pred_obj["semantic"] = semantic + correct_table = (pred_table in gold_tables) or correct_string + pred_obj["correct_table"] = correct_table + pred_obj["strict_correct_table"] = correct_table and syntactic + pred_obj["pred_table"] = str(pred_table) + pred_obj["table_prec"] = best_prec + pred_obj["table_rec"] = best_rec + pred_obj["table_f1"] = best_f1 + + fileptr.write(json.dumps(pred_obj) + "\n") + + +class Metrics(Enum): + """Definitions of simple metrics to compute.""" + + LOSS = 1 + TOKEN_ACCURACY = 2 + STRING_ACCURACY = 3 + CORRECT_TABLES = 4 + STRICT_CORRECT_TABLES = 5 + SEMANTIC_QUERIES = 6 + SYNTACTIC_QUERIES = 7 + FIRST_ACC = 8 + AFTER_FIRST_ACC = 9 + + +def get_progressbar(name, size): + """Gets a progress bar object given a name and the total size. + + Args: + name (`str`): The name to display on the side. + size (`int`): The maximum size of the progress bar. + + """ + return progressbar.ProgressBar( + maxval=size, + widgets=[name, progressbar.Bar("=", "[", "]"), " ", progressbar.Percentage(), " ", progressbar.ETA()], + ) + + +def train_epoch_with_utterances(batches, model, randomize=True): + """Trains model for a single epoch given batches of utterance data. + + Args: + batches (`UtteranceBatch`): The batches to give to training. + model (`ATISModel`): The model obect. + learning_rate (`float`): The learning rate to use during training. + dropout_amount (`float`): Amount of dropout to set in the model. + randomize (`bool`): Whether or not to randomize the order that the batches are seen. + """ + if randomize: + random.shuffle(batches) + progbar = get_progressbar("train ", len(batches)) + progbar.start() + loss_sum = 0.0 + + for i, batch in enumerate(batches): + batch_loss = model.train_step(batch) + loss_sum += batch_loss + + progbar.update(i) + + progbar.finish() + + total_loss = loss_sum / len(batches) + + return total_loss + + +def train_epoch_with_interactions( + interaction_batches, params, model, randomize=True, db2id=None, id2db=None, step=None +): + """Trains model for single epoch given batches of interactions. + + Args: + interaction_batches (`list`): The batches to train on. + params (`namespace`): Parameters to run with. + model (`ATISModel`): Model to train. + randomize (`bool`): Whether or not to randomize the order that batches are seen. + """ + if randomize: + random.shuffle(interaction_batches) + progbar = get_progressbar("train ", len(interaction_batches)) + progbar.start() + loss_sum = 0.0 + + skip_ls = ["sakila_1", "baseball_1", "soccer_1", "cre_Drama_Workshop_Groups", "formula_1", "assets_maintenance/8"] + skip_num = 0 + + for i, interaction_batch in enumerate(interaction_batches): + assert len(interaction_batch) == 1 + + interaction = interaction_batch.items[0] + + if interaction.identifier == "raw/atis2/12-1.1/ATIS2/TEXT/TEST/NOV92/770/5": + continue + + if "sparc" in params.data_directory and "baseball_1" in interaction.identifier: + continue + + skip = False + if "cosql" in params.data_directory: + print(interaction.identifier, i, skip_num) + for ele in skip_ls: + if ele in interaction.identifier: + print("skip") + skip = True + continue + if skip: + print("skip, length:", len(interaction.gold_utterances())) + skip_num += 1 + continue + + batch_loss, step = model.train_step( + interaction, params.train_maximum_sql_length, db2id=db2id, id2db=id2db, step=step + ) + + loss_sum += batch_loss + + progbar.update(i) + + progbar.finish() + + total_loss = loss_sum / len(interaction_batches) + + return total_loss, step + + +# 计算ACC +def update_sums( + metrics, + metrics_sums, + predicted_sequence, + flat_sequence, + gold_query, + original_gold_query, + gold_forcing=False, + loss=None, + token_accuracy=0.0, + database_username="", + database_password="", + database_timeout=0, + gold_table=None, +): + """ " Updates summing for metrics in an aggregator. + + TODO: don't use sums, just keep the raw value. + """ + if Metrics.LOSS in metrics: + metrics_sums[Metrics.LOSS] += loss + + if Metrics.TOKEN_ACCURACY in metrics: + if gold_forcing: + metrics_sums[Metrics.TOKEN_ACCURACY] += token_accuracy + else: + num_tokens_correct = 0.0 + for j, token in enumerate(gold_query): + if len(predicted_sequence) > j and predicted_sequence[j] == token: + num_tokens_correct += 1 + metrics_sums[Metrics.TOKEN_ACCURACY] += num_tokens_correct / len(gold_query) + + if Metrics.STRING_ACCURACY in metrics: + + metrics_sums[Metrics.STRING_ACCURACY] += int(flat_sequence == original_gold_query) + + if Metrics.CORRECT_TABLES in metrics: + assert database_username, "You did not provide a database username" + assert database_password, "You did not provide a database password" + assert database_timeout > 0, "Database timeout is 0 seconds" + + # Evaluate SQL + if flat_sequence != original_gold_query: + syntactic, semantic, table = sql_util.execution_results( + " ".join(flat_sequence), database_username, database_password, database_timeout + ) + else: + syntactic = True + semantic = True + table = gold_table + + metrics_sums[Metrics.CORRECT_TABLES] += int(table == gold_table) + if Metrics.SYNTACTIC_QUERIES in metrics: + metrics_sums[Metrics.SYNTACTIC_QUERIES] += int(syntactic) + if Metrics.SEMANTIC_QUERIES in metrics: + metrics_sums[Metrics.SEMANTIC_QUERIES] += int(semantic) + if Metrics.STRICT_CORRECT_TABLES in metrics: + metrics_sums[Metrics.STRICT_CORRECT_TABLES] += int(table == gold_table and syntactic) + + +def construct_averages(metrics_sums, total_num): + """Computes the averages for metrics. + + Args: + metrics_sums (`dict`): Sums for a metric. + total_num (`int`): Number to divide by (average). + """ + metrics_averages = {} + if isinstance(total_num, int): + for metric, value in metrics_sums.items(): + metrics_averages[metric] = value / total_num + if metric != "loss": + metrics_averages[metric] *= 100.0 + else: + for metric, value in metrics_sums.items(): + metrics_averages[metric] = value / total_num + if metric != "loss": + metrics_averages[metric] *= 100.0 + + return metrics_averages + + +def evaluate_utterance_sample( + sample, + model, + max_generation_length, + name="", + gold_forcing=False, + metrics=None, + total_num=-1, + database_username="", + database_password="", + database_timeout=0, + write_results=False, +): + """Evaluates a sample of utterance examples. + + Args: + sample (`list`): Examples to evaluate. + model (`ATISModel`): Model to predict with. + max_generation_length (`int`): Maximum length to generate. + name (`str`): Name to log with. + gold_forcing (`bool`): Whether to force the gold tokens during decoding. + metrics (`list`): Metrics to evaluate with. + total_num (`int`): Number to divide by when reporting results. + database_username (`str`): Username to use for executing queries. + database_password (`str`): Password to use when executing queries. + database_timeout (`float`): Timeout on queries when executing. + write_results (`bool`): Whether to write the results to a file. + """ + assert metrics + + if total_num < 0: + total_num = len(sample) + + metrics_sums = {} + for metric in metrics: + metrics_sums[metric] = 0.0 + + predictions_file = open(name + "_predictions.json", "w") + print("Predicting with filename " + str(name) + "_predictions.json") + progbar = get_progressbar(name, len(sample)) + progbar.start() + + predictions = [] + for i, item in enumerate(sample): + _, loss, predicted_seq = model.eval_step(item, max_generation_length, feed_gold_query=gold_forcing) + loss = loss / len(item.gold_query()) + predictions.append(predicted_seq) + + flat_sequence = item.flatten_sequence(predicted_seq) + token_accuracy = model_utils.per_token_accuracy(item.gold_query(), predicted_seq) + + if write_results: + write_prediction( + predictions_file, + identifier=item.interaction.identifier, + input_seq=item.input_sequence(), + probability=0, + prediction=predicted_seq, + flat_prediction=flat_sequence, + gold_query=item.gold_query(), + flat_gold_queries=item.original_gold_queries(), + gold_tables=item.gold_tables(), + index_in_interaction=item.utterance_index, + database_username=database_username, + database_password=database_password, + database_timeout=database_timeout, + ) + + update_sums( + metrics, + metrics_sums, + predicted_seq, + flat_sequence, + item.gold_query(), + item.original_gold_queries()[0], + gold_forcing, + loss, + token_accuracy, + database_username=database_username, + database_password=database_password, + database_timeout=database_timeout, + gold_table=item.gold_tables()[0], + ) + + progbar.update(i) + + progbar.finish() + predictions_file.close() + + return construct_averages(metrics_sums, total_num), None + + +def evaluate_interaction_sample( + sample, + model, + max_generation_length, + name="", + gold_forcing=False, + metrics=None, + total_num=-1, + database_username="", + database_password="", + database_timeout=0, + use_predicted_queries=False, + write_results=False, + use_gpu=False, + compute_metrics=False, +): + """Evaluates a sample of interactions.""" + predictions_file = open(name + "_predictions.json", "w") + print("Predicting with file " + str(name + "_predictions.json")) + metrics_sums = {} + for metric in metrics: + metrics_sums[metric] = 0.0 + progbar = get_progressbar(name, len(sample)) + progbar.start() + + num_utterances = 0 + predictions = [] + + model.eval() + + for i, interaction in enumerate(sample): + try: + with paddle.no_grad(): + if use_predicted_queries: + example_preds = model.predict_with_predicted_queries(interaction, max_generation_length) + else: + example_preds = model.predict_with_gold_queries( + interaction, max_generation_length, feed_gold_query=gold_forcing + ) + except RuntimeError as exception: + print("Failed on interaction: " + str(interaction.identifier)) + print(exception) + print("\n\n") + exit() + + predictions.extend(example_preds) + + assert len(example_preds) == len(interaction.interaction.utterances) or not example_preds + for j, pred in enumerate(example_preds): + num_utterances += 1 + + sequence, loss, token_accuracy, _, decoder_results = pred + + if use_predicted_queries: + item = interaction.processed_utterances[j] + original_utt = interaction.interaction.utterances[item.index] + + gold_query = original_utt.gold_query_to_use + original_gold_query = original_utt.original_gold_query + + gold_table = original_utt.gold_sql_results + gold_queries = [q[0] for q in original_utt.all_gold_queries] + gold_tables = [q[1] for q in original_utt.all_gold_queries] + index = item.index + else: + item = interaction.gold_utterances()[j] + + gold_query = item.gold_query() + original_gold_query = item.original_gold_query() + + gold_table = item.gold_table() + gold_queries = item.original_gold_queries() + gold_tables = item.gold_tables() + index = item.utterance_index + if loss: + loss = loss / len(gold_query) + + flat_sequence = item.flatten_sequence(sequence) + + # if isinstance(flat_sequence[-1],int): + # if flat_sequence[-1]==0: + # num_first_utterances += 1 + # else: + # num_after_first_utterances += 1 + + if write_results: + write_prediction( + predictions_file, + identifier=interaction.identifier, + input_seq=item.input_sequence(), + probability=decoder_results.probability, + prediction=sequence, + flat_prediction=flat_sequence, + gold_query=gold_query, + flat_gold_queries=gold_queries, + gold_tables=gold_tables, + index_in_interaction=index, + database_username=database_username, + database_password=database_password, + database_timeout=database_timeout, + compute_metrics=compute_metrics, + ) + + update_sums( + metrics, + metrics_sums, + sequence, + flat_sequence, + gold_query, + original_gold_query, + gold_forcing, + loss, + token_accuracy, + database_username=database_username, + database_password=database_password, + database_timeout=database_timeout, + gold_table=gold_table, + ) + + progbar.update(i) + + progbar.finish() + + if total_num < 0: + total_num = num_utterances + + predictions_file.close() + return construct_averages(metrics_sums, total_num), predictions + + +def evaluate_using_predicted_queries( + sample, + model, + name="", + gold_forcing=False, + metrics=None, + total_num=-1, + database_username="", + database_password="", + database_timeout=0, + snippet_keep_age=1, +): + predictions_file = open(name + "_predictions.json", "w") + print("Predicting with file " + str(name + "_predictions.json")) + assert not gold_forcing + metrics_sums = {} + for metric in metrics: + metrics_sums[metric] = 0.0 + progbar = get_progressbar(name, len(sample)) + progbar.start() + + num_utterances = 0 + predictions = [] + for i, item in enumerate(sample): + int_predictions = [] + item.start_interaction() + while not item.done(): + utterance = item.next_utterance(snippet_keep_age) + + predicted_sequence, loss, _, probability = model.eval_step(utterance) + int_predictions.append((utterance, predicted_sequence)) + + flat_sequence = utterance.flatten_sequence(predicted_sequence) + + if ( + sql_util.executable( + flat_sequence, username=database_username, password=database_password, timeout=database_timeout + ) + and probability >= 0.24 + ): + utterance.set_pred_query(item.remove_snippets(predicted_sequence)) + item.add_utterance( + utterance, item.remove_snippets(predicted_sequence), previous_snippets=utterance.snippets() + ) + else: + # Add the /previous/ predicted query, guaranteed to be syntactically + # correct + seq = [] + utterance.set_pred_query(seq) + item.add_utterance(utterance, seq, previous_snippets=utterance.snippets()) + + original_utt = item.interaction.utterances[utterance.index] + write_prediction( + predictions_file, + identifier=item.interaction.identifier, + input_seq=utterance.input_sequence(), + probability=probability, + prediction=predicted_sequence, + flat_prediction=flat_sequence, + gold_query=original_utt.gold_query_to_use, + flat_gold_queries=[q[0] for q in original_utt.all_gold_queries], + gold_tables=[q[1] for q in original_utt.all_gold_queries], + index_in_interaction=utterance.index, + database_username=database_username, + database_password=database_password, + database_timeout=database_timeout, + ) + + update_sums( + metrics, + metrics_sums, + predicted_sequence, + flat_sequence, + original_utt.gold_query_to_use, + original_utt.original_gold_query, + gold_forcing, + loss, + token_accuracy=0, + database_username=database_username, + database_password=database_password, + database_timeout=database_timeout, + gold_table=original_utt.gold_sql_results, + ) + + predictions.append(int_predictions) + progbar.update(i) + + progbar.finish() + + if total_num < 0: + total_num = num_utterances + predictions_file.close() + + return construct_averages(metrics_sums, total_num), predictions diff --git a/examples/text_to_sql/IGSQL/parse_args.py b/examples/text_to_sql/IGSQL/parse_args.py new file mode 100644 index 0000000000000000000000000000000000000000..1ab7d9eb373f9f3aba4cc24cba72c2e27857029d --- /dev/null +++ b/examples/text_to_sql/IGSQL/parse_args.py @@ -0,0 +1,167 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys + +args = sys.argv + + +def interpret_args(): + """Interprets the command line arguments, and returns a dictionary.""" + parser = argparse.ArgumentParser() + + parser.add_argument("--no_gpus", type=bool, default=1) + + # Data parameters + parser.add_argument( + "--raw_train_filename", type=str, default="../atis_data/data/resplit/processed/train_with_tables.pkl" + ) + parser.add_argument( + "--raw_dev_filename", type=str, default="../atis_data/data/resplit/processed/dev_with_tables.pkl" + ) + parser.add_argument( + "--raw_validation_filename", type=str, default="../atis_data/data/resplit/processed/valid_with_tables.pkl" + ) + parser.add_argument( + "--raw_test_filename", type=str, default="../atis_data/data/resplit/processed/test_with_tables.pkl" + ) + + parser.add_argument("--data_directory", type=str, default="processed_data") + + parser.add_argument("--processed_train_filename", type=str, default="train.pkl") + parser.add_argument("--processed_dev_filename", type=str, default="dev.pkl") + parser.add_argument("--processed_validation_filename", type=str, default="validation.pkl") + parser.add_argument("--processed_test_filename", type=str, default="test.pkl") + + parser.add_argument("--database_schema_filename", type=str, default=None) + parser.add_argument("--embedding_filename", type=str, default=None) + + parser.add_argument("--input_vocabulary_filename", type=str, default="input_vocabulary.pkl") + parser.add_argument("--output_vocabulary_filename", type=str, default="output_vocabulary.pkl") + + parser.add_argument("--input_key", type=str, default="utterance") + + parser.add_argument("--anonymize", type=bool, default=False) + parser.add_argument("--anonymization_scoring", type=bool, default=False) + parser.add_argument("--use_snippets", type=bool, default=False) + + parser.add_argument("--use_previous_query", type=bool, default=True) + parser.add_argument("--maximum_queries", type=int, default=1) + parser.add_argument("--use_copy_switch", type=bool, default=False) + parser.add_argument("--use_query_attention", type=bool, default=True) + + parser.add_argument("--use_utterance_attention", type=bool, default=True) + + parser.add_argument("--scheduler", type=bool, default=False) + + parser.add_argument("--use_bert", type=bool, default=True) + parser.add_argument("--bert_input_version", type=str, default="v1") + parser.add_argument("--fine_tune_bert", type=bool, default=True) + parser.add_argument("--lr_bert", default=1e-5, type=float, help="BERT model learning rate.") + + # Debugging/logging parameters + parser.add_argument("--reload_embedding", type=bool, default=False) + parser.add_argument("--logdir", type=str, default="logs") + parser.add_argument("--deterministic", type=bool, default=False) + parser.add_argument("--num_train", type=int, default=-1) + + parser.add_argument("--logfile", type=str, default="log.txt") + parser.add_argument("--results_file", type=str, default="results.txt") + + # Model architecture + parser.add_argument("--input_embedding_size", type=int, default=300) + parser.add_argument("--output_embedding_size", type=int, default=300) + + parser.add_argument("--encoder_state_size", type=int, default=300) + parser.add_argument("--decoder_state_size", type=int, default=300) + + parser.add_argument("--encoder_num_layers", type=int, default=1) + parser.add_argument("--decoder_num_layers", type=int, default=1) + parser.add_argument("--snippet_num_layers", type=int, default=1) + + parser.add_argument("--maximum_utterances", type=int, default=5) + parser.add_argument("--state_positional_embeddings", type=bool, default=True) + parser.add_argument("--positional_embedding_size", type=int, default=50) + + parser.add_argument("--snippet_age_embedding", type=bool, default=False) + parser.add_argument("--snippet_age_embedding_size", type=int, default=64) + parser.add_argument("--max_snippet_age_embedding", type=int, default=4) + parser.add_argument("--previous_decoder_snippet_encoding", type=bool, default=False) + + parser.add_argument("--discourse_level_lstm", type=bool, default=True) + + parser.add_argument("--use_schema_attention", type=bool, default=True) + parser.add_argument("--use_encoder_attention", type=bool, default=True) + + parser.add_argument("--use_schema_encoder", type=bool, default=True) + parser.add_argument("--use_schema_self_attention", type=bool, default=False) + parser.add_argument("--use_schema_encoder_2", type=bool, default=False) + + # Training parameters + parser.add_argument("--batch_size", type=int, default=16) + parser.add_argument("--train_maximum_sql_length", type=int, default=400) # 200 + parser.add_argument("--train_evaluation_size", type=int, default=100) + + parser.add_argument("--dropout_amount", type=float, default=0.5) + + parser.add_argument("--initial_patience", type=float, default=10.0) + parser.add_argument("--patience_ratio", type=float, default=1.01) + + parser.add_argument("--initial_learning_rate", type=float, default=1e-3) + parser.add_argument("--learning_rate_ratio", type=float, default=0.9) + + parser.add_argument("--interaction_level", type=bool, default=True) + parser.add_argument("--reweight_batch", type=bool, default=True) + parser.add_argument("--gnn_layer_number", type=int, default=1) + parser.add_argument("--clip", type=float, default=5.0) + parser.add_argument("--warmup_step", type=int, default=1000) + + # Setting + parser.add_argument("--train", type=bool, default=False) + parser.add_argument("--debug", type=bool, default=False) + + parser.add_argument("--evaluate", type=bool, default=False) + parser.add_argument("--attention", type=bool, default=False) + parser.add_argument("--save_file", type=str, default="") + parser.add_argument("--enable_testing", type=bool, default=False) + parser.add_argument("--use_predicted_queries", type=bool, default=False) + parser.add_argument("--evaluate_split", type=str, default="valid") + parser.add_argument("--evaluate_with_gold_forcing", type=bool, default=False) + parser.add_argument("--eval_maximum_sql_length", type=int, default=400) + parser.add_argument("--results_note", type=str, default="") + parser.add_argument("--compute_metrics", type=bool, default=False) + + parser.add_argument("--reference_results", type=str, default="") + + parser.add_argument("--interactive", type=bool, default=False) + + parser.add_argument("--database_username", type=str, default="aviarmy") + parser.add_argument("--database_password", type=str, default="aviarmy") + parser.add_argument("--database_timeout", type=int, default=2) + + parser.add_argument("--all_in_one_trainer", type=bool, default=False) + + args = parser.parse_args() + + if not os.path.exists(args.logdir): + os.makedirs(args.logdir) + + if not (args.train or args.evaluate or args.interactive or args.attention): + raise ValueError("You need to be training or evaluating") + if args.enable_testing and not args.evaluate: + raise ValueError("You should evaluate the model if enabling testing") + + return args diff --git a/examples/text_to_sql/IGSQL/postprocess_eval.py b/examples/text_to_sql/IGSQL/postprocess_eval.py new file mode 100644 index 0000000000000000000000000000000000000000..f2c41b383e12aa0e1d29012955b73f1773ca74e5 --- /dev/null +++ b/examples/text_to_sql/IGSQL/postprocess_eval.py @@ -0,0 +1,512 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import subprocess +import traceback +from collections import defaultdict + +import sqlparse + + +def find_shortest_path(start, end, graph): + stack = [[start, []]] + visited = set() + while len(stack) > 0: + ele, history = stack.pop() + if ele == end: + return history + for node in graph[ele]: + if node[0] not in visited: + stack.append((node[0], history + [(node[0], node[1])])) + visited.add(node[0]) + + +def gen_from(candidate_tables, schema): + if len(candidate_tables) <= 1: + if len(candidate_tables) == 1: + ret = "from {}".format(schema["table_names_original"][list(candidate_tables)[0]]) + else: + ret = "from {}".format(schema["table_names_original"][0]) + return {}, ret + + table_alias_dict = {} + uf_dict = {} + for t in candidate_tables: + uf_dict[t] = -1 + idx = 1 + graph = defaultdict(list) + for acol, bcol in schema["foreign_keys"]: + t1 = schema["column_names"][acol][0] + t2 = schema["column_names"][bcol][0] + graph[t1].append((t2, (acol, bcol))) + graph[t2].append((t1, (bcol, acol))) + candidate_tables = list(candidate_tables) + start = candidate_tables[0] + table_alias_dict[start] = idx + idx += 1 + ret = "from {} as T1".format(schema["table_names_original"][start]) + try: + for end in candidate_tables[1:]: + if end in table_alias_dict: + continue + path = find_shortest_path(start, end, graph) + prev_table = start + if not path: + table_alias_dict[end] = idx + idx += 1 + ret = "{} join {} as T{}".format( + ret, + schema["table_names_original"][end], + table_alias_dict[end], + ) + continue + for node, (acol, bcol) in path: + if node in table_alias_dict: + prev_table = node + continue + table_alias_dict[node] = idx + idx += 1 + ret = "{} join {} as T{} on T{}.{} = T{}.{}".format( + ret, + schema["table_names_original"][node], + table_alias_dict[node], + table_alias_dict[prev_table], + schema["column_names_original"][acol][1], + table_alias_dict[node], + schema["column_names_original"][bcol][1], + ) + prev_table = node + except Exception: + traceback.print_exc() + print("db:{}".format(schema["db_id"])) + return table_alias_dict, ret + return table_alias_dict, ret + + +def normalize_space(format_sql): + format_sql_1 = [ + " ".join(sub_sql.strip().replace(",", " , ").replace("(", " ( ").replace(")", " ) ").split()) + for sub_sql in format_sql.split("\n") + ] + format_sql_1 = "\n".join(format_sql_1) + format_sql_2 = ( + format_sql_1.replace("\njoin", " join") + .replace(",\n", ", ") + .replace(" where", "\nwhere") + .replace(" intersect", "\nintersect") + .replace("union ", "union\n") + .replace("\nand", " and") + .replace("order by t2 .\nstart desc", "order by t2 . start desc") + ) + return format_sql_2 + + +def get_candidate_tables(format_sql, schema): + candidate_tables = [] + + tokens = format_sql.split() + for ii, token in enumerate(tokens): + if "." in token: + table_name = token.split(".")[0] + candidate_tables.append(table_name) + + candidate_tables = list(set(candidate_tables)) + + table_names_original = [table_name.lower() for table_name in schema["table_names_original"]] + candidate_tables_id = [table_names_original.index(table_name) for table_name in candidate_tables] + + assert -1 not in candidate_tables_id + table_names_original = schema["table_names_original"] + + return candidate_tables_id, table_names_original + + +def get_surface_form_orig(format_sql_2, schema): + column_names_surface_form = [] + column_names_surface_form_original = [] + + column_names_original = schema["column_names_original"] + table_names_original = schema["table_names_original"] + for i, (table_id, column_name) in enumerate(column_names_original): + if table_id >= 0: + table_name = table_names_original[table_id] + column_name_surface_form = "{}.{}".format(table_name, column_name) + else: + # this is just * + column_name_surface_form = column_name + column_names_surface_form.append(column_name_surface_form.lower()) + column_names_surface_form_original.append(column_name_surface_form) + + # also add table_name.* + for table_name in table_names_original: + column_names_surface_form.append("{}.*".format(table_name.lower())) + column_names_surface_form_original.append("{}.*".format(table_name)) + + assert len(column_names_surface_form) == len(column_names_surface_form_original) + for surface_form, surface_form_original in zip(column_names_surface_form, column_names_surface_form_original): + format_sql_2 = format_sql_2.replace(surface_form, surface_form_original) + + return format_sql_2 + + +def add_from_clase(sub_sql, from_clause): + select_right_sub_sql = [] + left_sub_sql = [] + left = True + num_left_parathesis = 0 # in select_right_sub_sql + num_right_parathesis = 0 # in select_right_sub_sql + tokens = sub_sql.split() + for ii, token in enumerate(tokens): + if token == "select": + left = False + if left: + left_sub_sql.append(token) + continue + select_right_sub_sql.append(token) + if token == "(": + num_left_parathesis += 1 + elif token == ")": + num_right_parathesis += 1 + + def remove_missing_tables_from_select(select_statement): + tokens = select_statement.split(",") + + stop_idx = -1 + for i in range(len(tokens)): + idx = len(tokens) - 1 - i + token = tokens[idx] + if ".*" in token and "count " not in token: + pass + else: + stop_idx = idx + 1 + break + + if stop_idx > 0: + new_select_statement = ",".join(tokens[:stop_idx]).strip() + else: + new_select_statement = select_statement + + return new_select_statement + + if num_left_parathesis == num_right_parathesis or num_left_parathesis > num_right_parathesis: + sub_sqls = [] + sub_sqls.append(remove_missing_tables_from_select(sub_sql)) + sub_sqls.append(from_clause) + else: + assert num_left_parathesis < num_right_parathesis + select_sub_sql = [] + right_sub_sql = [] + for i in range(len(select_right_sub_sql)): + token_idx = len(select_right_sub_sql) - 1 - i + token = select_right_sub_sql[token_idx] + if token == ")": + num_right_parathesis -= 1 + if num_right_parathesis == num_left_parathesis: + select_sub_sql = select_right_sub_sql[:token_idx] + right_sub_sql = select_right_sub_sql[token_idx:] + break + + sub_sqls = [] + + if len(left_sub_sql) > 0: + sub_sqls.append(" ".join(left_sub_sql)) + if len(select_sub_sql) > 0: + new_select_statement = remove_missing_tables_from_select(" ".join(select_sub_sql)) + sub_sqls.append(new_select_statement) + + sub_sqls.append(from_clause) + + if len(right_sub_sql) > 0: + sub_sqls.append(" ".join(right_sub_sql)) + + return sub_sqls + + +def postprocess_single(format_sql_2, schema, start_alias_id=0): + candidate_tables_id, table_names_original = get_candidate_tables(format_sql_2, schema) + format_sql_2 = get_surface_form_orig(format_sql_2, schema) + + if len(candidate_tables_id) == 0: + final_sql = format_sql_2.replace("\n", " ") + elif len(candidate_tables_id) == 1: + # easy case + table_name = table_names_original[candidate_tables_id[0]] + from_clause = "from {}".format(table_name) + format_sql_3 = [] + for sub_sql in format_sql_2.split("\n"): + if "select" in sub_sql: + format_sql_3 += add_from_clase(sub_sql, from_clause) + else: + format_sql_3.append(sub_sql) + final_sql = " ".join(format_sql_3).replace("{}.".format(table_name), "") + else: + # more than 1 candidate_tables + table_alias_dict, ret = gen_from(candidate_tables_id, schema) + + from_clause = ret + for i in range(len(table_alias_dict)): + from_clause = from_clause.replace("T{}".format(i + 1), "T{}".format(i + 1 + start_alias_id)) + + table_name_to_alias = {} + for table_id, alias_id in table_alias_dict.items(): + table_name = table_names_original[table_id] + alias = "T{}".format(alias_id + start_alias_id) + table_name_to_alias[table_name] = alias + start_alias_id = start_alias_id + len(table_alias_dict) + + format_sql_3 = [] + for sub_sql in format_sql_2.split("\n"): + if "select" in sub_sql: + format_sql_3 += add_from_clase(sub_sql, from_clause) + else: + format_sql_3.append(sub_sql) + format_sql_3 = " ".join(format_sql_3) + + for table_name, alias in table_name_to_alias.items(): + format_sql_3 = format_sql_3.replace("{}.".format(table_name), "{}.".format(alias)) + + final_sql = format_sql_3 + + for i in range(5): + final_sql = final_sql.replace("select count ( T{}.* ) ".format(i), "select count ( * ) ") + final_sql = final_sql.replace("count ( T{}.* ) from ".format(i), "count ( * ) from ") + final_sql = final_sql.replace("order by count ( T{}.* ) ".format(i), "order by count ( * ) ") + final_sql = final_sql.replace("having count ( T{}.* ) ".format(i), "having count ( * ) ") + + return final_sql, start_alias_id + + +def postprocess_nested(format_sql_2, schema): + candidate_tables_id, table_names_original = get_candidate_tables(format_sql_2, schema) + if len(candidate_tables_id) == 1: + format_sql_2 = get_surface_form_orig(format_sql_2, schema) + # easy case + table_name = table_names_original[candidate_tables_id[0]] + from_clause = "from {}".format(table_name) + format_sql_3 = [] + for sub_sql in format_sql_2.split("\n"): + if "select" in sub_sql: + format_sql_3 += add_from_clase(sub_sql, from_clause) + else: + format_sql_3.append(sub_sql) + final_sql = " ".join(format_sql_3).replace("{}.".format(table_name), "") + else: + # case 1: easy case, except / union / intersect + # case 2: nested queries in condition + final_sql = [] + + def postprocess_subquery(sub_query_one, schema, start_alias_id_1): + final_sub_sql = [] + sub_query = [] + for sub_sql in sub_query_one.split("\n"): + if "select" in sub_sql: + if len(sub_query) > 0: + sub_query = "\n".join(sub_query) + sub_query, start_alias_id_1 = postprocess_single(sub_query, schema, start_alias_id_1) + final_sub_sql.append(sub_query) + sub_query = [] + sub_query.append(sub_sql) + else: + sub_query.append(sub_sql) + if len(sub_query) > 0: + sub_query = "\n".join(sub_query) + sub_query, start_alias_id_1 = postprocess_single(sub_query, schema, start_alias_id_1) + final_sub_sql.append(sub_query) + + final_sub_sql = " ".join(final_sub_sql) + return final_sub_sql, False, start_alias_id_1 + + start_alias_id = 0 + sub_query = [] + for sub_sql in format_sql_2.split("\n"): + if "except" in sub_sql or "union" in sub_sql or "intersect" in sub_sql: + sub_query = "\n".join(sub_query) + sub_query, _, start_alias_id = postprocess_subquery(sub_query, schema, start_alias_id) + final_sql.append(sub_query) + final_sql.append(sub_sql) + sub_query = [] + else: + sub_query.append(sub_sql) + if len(sub_query) > 0: + sub_query = "\n".join(sub_query) + sub_query, _, start_alias_id = postprocess_subquery(sub_query, schema, start_alias_id) + final_sql.append(sub_query) + + final_sql = " ".join(final_sql) + + # special case of from a subquery + final_sql = final_sql.replace("select count ( * ) (", "select count ( * ) from (") + + return final_sql + + +def postprocess_one(pred_sql, schema): + pred_sql = ( + pred_sql.replace("group_by", "group by") + .replace("order_by", "order by") + .replace("limit_value", "limit 1") + .replace("_EOS", "") + .replace(" value ", " 1 ") + .replace("distinct", "") + .strip(",") + .strip() + ) + if pred_sql.endswith("value"): + pred_sql = pred_sql[: -len("value")] + "1" + + try: + format_sql = sqlparse.format(pred_sql, reindent=True) + except Exception: + return pred_sql + format_sql_2 = normalize_space(format_sql) + + num_select = format_sql_2.count("select") + + if num_select > 1: + final_sql = postprocess_nested(format_sql_2, schema) + else: + final_sql, _ = postprocess_single(format_sql_2, schema) + + return final_sql + + +def postprocess(predictions, database_schema, remove_from=False): + correct = 0 + total = 0 + postprocess_sqls = {} + + for pred in predictions: + db_id = pred["database_id"] + schema = database_schema[db_id] + if db_id not in postprocess_sqls: + postprocess_sqls[db_id] = [] + + interaction_id = pred["interaction_id"] + turn_id = pred["index_in_interaction"] + total += 1 + + pred_sql_str = " ".join(pred["flat_prediction"]) + + gold_sql_str = " ".join(pred["flat_gold_queries"][0]) + if pred_sql_str == gold_sql_str: + correct += 1 + + postprocess_sql = pred_sql_str + if remove_from: + postprocess_sql = postprocess_one(pred_sql_str, schema) + + postprocess_sqls[db_id].append((postprocess_sql, interaction_id, turn_id)) + + # print (correct, total, float(correct)/total) + return postprocess_sqls + + +def read_prediction(pred_file): + print("Read prediction from", pred_file) + predictions = [] + with open(pred_file) as f: + for line in f: + pred = json.loads(line) + predictions.append(pred) + print("Number of predictions", len(predictions)) + return predictions + + +def read_schema(table_schema_path): + with open(table_schema_path) as f: + database_schema = json.load(f) + + database_schema_dict = {} + for table_schema in database_schema: + db_id = table_schema["db_id"] + database_schema_dict[db_id] = table_schema + + return database_schema_dict + + +def write_and_evaluate(postprocess_sqls, db_path, table_schema_path, gold_path, dataset): + db_list = [] + with open(gold_path) as f: + for line in f: + line_split = line.strip().split("\t") + if len(line_split) != 2: + continue + db = line.strip().split("\t")[1] + if db not in db_list: + db_list.append(db) + + output_file = "output_temp.txt" + if dataset == "spider": + with open(output_file, "w") as f: + for db in db_list: + for postprocess_sql, interaction_id, turn_id in postprocess_sqls[db]: + f.write(postprocess_sql + "\n") + + command = "python3 eval_scripts/evaluation.py --db {} --table {} --etype match --gold {} --pred {}".format( + db_path, table_schema_path, gold_path, os.path.abspath(output_file) + ) + elif dataset in ["sparc", "cosql"]: + cnt = 0 + with open(output_file, "w") as f: + for db in db_list: + for postprocess_sql, interaction_id, turn_id in postprocess_sqls[db]: + if turn_id == 0 and cnt > 0: + f.write("\n") + f.write("{}\n".format(postprocess_sql)) + cnt += 1 + + command = "python eval_scripts/evaluation_sqa.py --db {} --table {} --etype match --gold {} --pred {}".format( + db_path, table_schema_path, gold_path, os.path.abspath(output_file) + ) + command += "; rm output_temp.txt" + return command + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--dataset", choices=("spider", "sparc", "cosql"), default="sparc") + parser.add_argument("--split", type=str, default="dev") + parser.add_argument("--pred_file", type=str, default="") + parser.add_argument("--remove_from", action="store_true", default=False) + args = parser.parse_args() + + db_path = "data/database/" + if args.dataset == "spider": + table_schema_path = "data/spider/tables.json" + if args.split == "dev": + gold_path = "data/spider/dev_gold.sql" + elif args.dataset == "sparc": + table_schema_path = "data/sparc/tables.json" + if args.split == "dev": + gold_path = "data/sparc/dev_gold.txt" + elif args.dataset == "cosql": + table_schema_path = "data/cosql/tables.json" + if args.split == "dev": + gold_path = "data/cosql/dev_gold.txt" + + pred_file = args.pred_file + + database_schema = read_schema(table_schema_path) + predictions = read_prediction(pred_file) + postprocess_sqls = postprocess(predictions, database_schema, args.remove_from) + + command = write_and_evaluate(postprocess_sqls, db_path, table_schema_path, gold_path, args.dataset) + + eval_output = subprocess.check_output(command, stderr=subprocess.STDOUT, shell=True) + with open(pred_file + ".eval", "w") as f: + f.write(eval_output.decode("utf-8")) + print("Eval result in", pred_file + ".eval") diff --git a/examples/text_to_sql/IGSQL/preprocess.py b/examples/text_to_sql/IGSQL/preprocess.py new file mode 100644 index 0000000000000000000000000000000000000000..9fceaf9e9c3fcc401b1acda6643e215943d8696e --- /dev/null +++ b/examples/text_to_sql/IGSQL/preprocess.py @@ -0,0 +1,759 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import pickle +import shutil + +import sqlparse +from postprocess_eval import get_candidate_tables + + +def write_interaction(interaction_list, split, output_dir): + json_split = os.path.join(output_dir, split + ".json") + pkl_split = os.path.join(output_dir, split + ".pkl") + + with open(json_split, "w") as outfile: + for interaction in interaction_list: + json.dump(interaction, outfile, indent=4) + outfile.write("\n") + + new_objs = [] + for i, obj in enumerate(interaction_list): + new_interaction = [] + for ut in obj["interaction"]: + sql = ut["sql"] + sqls = [sql] + tok_sql_list = [] + for sql in sqls: + results = [] + tokenized_sql = sql.split() + tok_sql_list.append((tokenized_sql, results)) + ut["sql"] = tok_sql_list + new_interaction.append(ut) + obj["interaction"] = new_interaction + new_objs.append(obj) + + with open(pkl_split, "wb") as outfile: + pickle.dump(new_objs, outfile) + + return + + +def read_database_schema(database_schema_filename, schema_tokens, column_names, database_schemas_dict): + with open(database_schema_filename) as f: + database_schemas = json.load(f) + + def get_schema_tokens(table_schema): + column_names_surface_form = [] + column_names = [] + column_names_original = table_schema["column_names_original"] + table_names_original = table_schema["table_names_original"] + for i, (table_id, column_name) in enumerate(column_names_original): + if table_id >= 0: + table_name = table_names_original[table_id] + column_name_surface_form = "{}.{}".format(table_name, column_name) + else: + # this is just * + column_name_surface_form = column_name + column_names_surface_form.append(column_name_surface_form.lower()) + column_names.append(column_name.lower()) + + # also add table_name.* + for table_name in table_names_original: + column_names_surface_form.append("{}.*".format(table_name.lower())) + + return column_names_surface_form, column_names + + for table_schema in database_schemas: + database_id = table_schema["db_id"] + database_schemas_dict[database_id] = table_schema + schema_tokens[database_id], column_names[database_id] = get_schema_tokens(table_schema) + + return schema_tokens, column_names, database_schemas_dict + + +def remove_from_with_join(format_sql_2): + used_tables_list = [] + format_sql_3 = [] + table_to_name = {} + table_list = [] + old_table_to_name = {} + old_table_list = [] + for sub_sql in format_sql_2.split("\n"): + if "select " in sub_sql: + # only replace alias: t1 -> table_name, t2 -> table_name, etc... + if len(table_list) > 0: + for i in range(len(format_sql_3)): + for table, name in table_to_name.items(): + format_sql_3[i] = format_sql_3[i].replace(table, name) + + old_table_list = table_list + old_table_to_name = table_to_name + table_to_name = {} + table_list = [] + format_sql_3.append(sub_sql) + elif sub_sql.startswith("from"): + new_sub_sql = None + sub_sql_tokens = sub_sql.split() + for t_i, t in enumerate(sub_sql_tokens): + if t == "as": + table_to_name[sub_sql_tokens[t_i + 1]] = sub_sql_tokens[t_i - 1] + table_list.append(sub_sql_tokens[t_i - 1]) + elif t == ")" and new_sub_sql is None: + # new_sub_sql keeps some trailing parts after ')' + new_sub_sql = " ".join(sub_sql_tokens[t_i:]) + if len(table_list) > 0: + # if it's a from clause with join + if new_sub_sql is not None: + format_sql_3.append(new_sub_sql) + + used_tables_list.append(table_list) + else: + # if it's a from clause without join + table_list = old_table_list + table_to_name = old_table_to_name + assert "join" not in sub_sql + if new_sub_sql is not None: + sub_sub_sql = sub_sql[: -len(new_sub_sql)].strip() + assert len(sub_sub_sql.split()) == 2 + used_tables_list.append([sub_sub_sql.split()[1]]) + format_sql_3.append(sub_sub_sql) + format_sql_3.append(new_sub_sql) + elif "join" not in sub_sql: + assert len(sub_sql.split()) == 2 or len(sub_sql.split()) == 1 + if len(sub_sql.split()) == 2: + used_tables_list.append([sub_sql.split()[1]]) + + format_sql_3.append(sub_sql) + else: + print("bad from clause in remove_from_with_join") + exit() + else: + format_sql_3.append(sub_sql) + + if len(table_list) > 0: + for i in range(len(format_sql_3)): + for table, name in table_to_name.items(): + format_sql_3[i] = format_sql_3[i].replace(table, name) + + used_tables = [] + for t in used_tables_list: + for tt in t: + used_tables.append(tt) + used_tables = list(set(used_tables)) + + return format_sql_3, used_tables, used_tables_list + + +def remove_from_without_join(format_sql_3, column_names, schema_tokens): + format_sql_4 = [] + table_name = None + for sub_sql in format_sql_3.split("\n"): + if "select " in sub_sql: + if table_name: + for i in range(len(format_sql_4)): + tokens = format_sql_4[i].split() + for ii, token in enumerate(tokens): + if token in column_names and tokens[ii - 1] != ".": + if ( + ii + 1 < len(tokens) and tokens[ii + 1] != "." and tokens[ii + 1] != "(" + ) or ii + 1 == len(tokens): + if "{}.{}".format(table_name, token) in schema_tokens: + tokens[ii] = "{} . {}".format(table_name, token) + format_sql_4[i] = " ".join(tokens) + + format_sql_4.append(sub_sql) + elif sub_sql.startswith("from"): + sub_sql_tokens = sub_sql.split() + if len(sub_sql_tokens) == 1: + table_name = None + elif len(sub_sql_tokens) == 2: + table_name = sub_sql_tokens[1] + else: + print("bad from clause in remove_from_without_join") + print(format_sql_3) + exit() + else: + format_sql_4.append(sub_sql) + + if table_name: + for i in range(len(format_sql_4)): + tokens = format_sql_4[i].split() + for ii, token in enumerate(tokens): + if token in column_names and tokens[ii - 1] != ".": + if (ii + 1 < len(tokens) and tokens[ii + 1] != "." and tokens[ii + 1] != "(") or ii + 1 == len( + tokens + ): + if "{}.{}".format(table_name, token) in schema_tokens: + tokens[ii] = "{} . {}".format(table_name, token) + format_sql_4[i] = " ".join(tokens) + + return format_sql_4 + + +def add_table_name(format_sql_3, used_tables, column_names, schema_tokens): + # If just one table used, easy case, replace all column_name -> table_name.column_name + if len(used_tables) == 1: + table_name = used_tables[0] + format_sql_4 = [] + for sub_sql in format_sql_3.split("\n"): + if sub_sql.startswith("from"): + format_sql_4.append(sub_sql) + continue + + tokens = sub_sql.split() + for ii, token in enumerate(tokens): + if token in column_names and tokens[ii - 1] != ".": + if (ii + 1 < len(tokens) and tokens[ii + 1] != "." and tokens[ii + 1] != "(") or ii + 1 == len( + tokens + ): + if "{}.{}".format(table_name, token) in schema_tokens: + tokens[ii] = "{} . {}".format(table_name, token) + format_sql_4.append(" ".join(tokens)) + return format_sql_4 + + def get_table_name_for(token): + table_names = [] + for table_name in used_tables: + if "{}.{}".format(table_name, token) in schema_tokens: + table_names.append(table_name) + if len(table_names) == 0: + return "table" + if len(table_names) > 1: + return None + else: + return table_names[0] + + format_sql_4 = [] + for sub_sql in format_sql_3.split("\n"): + if sub_sql.startswith("from"): + format_sql_4.append(sub_sql) + continue + + tokens = sub_sql.split() + for ii, token in enumerate(tokens): + # skip * + if token == "*": + continue + if token in column_names and tokens[ii - 1] != ".": + if (ii + 1 < len(tokens) and tokens[ii + 1] != "." and tokens[ii + 1] != "(") or ii + 1 == len(tokens): + table_name = get_table_name_for(token) + if table_name: + tokens[ii] = "{} . {}".format(table_name, token) + format_sql_4.append(" ".join(tokens)) + + return format_sql_4 + + +def check_oov(format_sql_final, output_vocab, schema_tokens): + for sql_tok in format_sql_final.split(): + if not (sql_tok in schema_tokens or sql_tok in output_vocab): + print("OOV!", sql_tok) + raise Exception("OOV") + + +def normalize_space(format_sql): + format_sql_1 = [ + " ".join( + sub_sql.strip().replace(",", " , ").replace(".", " . ").replace("(", " ( ").replace(")", " ) ").split() + ) + for sub_sql in format_sql.split("\n") + ] + format_sql_1 = "\n".join(format_sql_1) + + format_sql_2 = ( + format_sql_1.replace("\njoin", " join") + .replace(",\n", ", ") + .replace(" where", "\nwhere") + .replace(" intersect", "\nintersect") + .replace("\nand", " and") + .replace("order by t2 .\nstart desc", "order by t2 . start desc") + ) + + format_sql_2 = ( + format_sql_2.replace("select\noperator", "select operator") + .replace("select\nconstructor", "select constructor") + .replace("select\nstart", "select start") + .replace("select\ndrop", "select drop") + .replace("select\nwork", "select work") + .replace("select\ngroup", "select group") + .replace("select\nwhere_built", "select where_built") + .replace("select\norder", "select order") + .replace("from\noperator", "from operator") + .replace("from\nforward", "from forward") + .replace("from\nfor", "from for") + .replace("from\ndrop", "from drop") + .replace("from\norder", "from order") + .replace(".\nstart", ". start") + .replace(".\norder", ". order") + .replace(".\noperator", ". operator") + .replace(".\nsets", ". sets") + .replace(".\nwhere_built", ". where_built") + .replace(".\nwork", ". work") + .replace(".\nconstructor", ". constructor") + .replace(".\ngroup", ". group") + .replace(".\nfor", ". for") + .replace(".\ndrop", ". drop") + .replace(".\nwhere", ". where") + ) + + format_sql_2 = ( + format_sql_2.replace("group by", "group_by") + .replace("order by", "order_by") + .replace("! =", "!=") + .replace("limit value", "limit_value") + ) + return format_sql_2 + + +def normalize_final_sql(format_sql_5): + format_sql_final = ( + format_sql_5.replace("\n", " ") + .replace(" . ", ".") + .replace("group by", "group_by") + .replace("order by", "order_by") + .replace("! =", "!=") + .replace("limit value", "limit_value") + ) + + # normalize two bad sqls + if "t1" in format_sql_final or "t2" in format_sql_final or "t3" in format_sql_final or "t4" in format_sql_final: + format_sql_final = format_sql_final.replace("t2.dormid", "dorm.dormid") + + # This is the failure case of remove_from_without_join() + format_sql_final = format_sql_final.replace( + "select city.city_name where city.state_name in ( select state.state_name where state.state_name in ( select river.traverse where river.river_name = value ) and state.area = ( select min ( state.area ) where state.state_name in ( select river.traverse where river.river_name = value ) ) ) order_by population desc limit_value", + "select city.city_name where city.state_name in ( select state.state_name where state.state_name in ( select river.traverse where river.river_name = value ) and state.area = ( select min ( state.area ) where state.state_name in ( select river.traverse where river.river_name = value ) ) ) order_by city.population desc limit_value", + ) + + return format_sql_final + + +def parse_sql(sql_string, db_id, column_names, output_vocab, schema_tokens, schema): + format_sql = sqlparse.format(sql_string, reindent=True) + format_sql_2 = normalize_space(format_sql) + + format_sql_3, used_tables, used_tables_list = remove_from_with_join(format_sql_2) + + format_sql_3 = "\n".join(format_sql_3) + format_sql_4 = add_table_name(format_sql_3, used_tables, column_names, schema_tokens) + + format_sql_4 = "\n".join(format_sql_4) + format_sql_5 = remove_from_without_join(format_sql_4, column_names, schema_tokens) + + format_sql_5 = "\n".join(format_sql_5) + format_sql_final = normalize_final_sql(format_sql_5) + + candidate_tables_id, table_names_original = get_candidate_tables(format_sql_final, schema) + + check_oov(format_sql_final, output_vocab, schema_tokens) + + return format_sql_final + + +def read_spider_split( + split_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from +): + with open(split_json) as f: + split_data = json.load(f) + print("read_spider_split", split_json, len(split_data)) + + for i, ex in enumerate(split_data): + db_id = ex["db_id"] + + final_sql = [] + skip = False + for query_tok in ex["query_toks_no_value"]: + if query_tok != "." and "." in query_tok: + # invalid sql; didn't use table alias in join + final_sql += query_tok.replace(".", " . ").split() + skip = True + else: + final_sql.append(query_tok) + final_sql = " ".join(final_sql) + + if skip and "train" in split_json: + continue + + if remove_from: + final_sql_parse = parse_sql( + final_sql, db_id, column_names[db_id], output_vocab, schema_tokens[db_id], database_schemas[db_id] + ) + else: + final_sql_parse = final_sql + + final_utterance = " ".join(ex["question_toks"]) + + if db_id not in interaction_list: + interaction_list[db_id] = [] + + interaction = {} + interaction["id"] = "" + interaction["scenario"] = "" + interaction["database_id"] = db_id + interaction["interaction_id"] = len(interaction_list[db_id]) + interaction["final"] = {} + interaction["final"]["utterance"] = final_utterance + interaction["final"]["sql"] = final_sql_parse + interaction["interaction"] = [{"utterance": final_utterance, "sql": final_sql_parse}] + interaction_list[db_id].append(interaction) + + return interaction_list + + +def read_data_json( + split_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from +): + with open(split_json) as f: + split_data = json.load(f) + print("read_data_json", split_json, len(split_data)) + + for interaction_data in split_data: + db_id = interaction_data["database_id"] + final_sql = interaction_data["final"]["query"] + final_utterance = interaction_data["final"]["utterance"] + + if db_id not in interaction_list: + interaction_list[db_id] = [] + + # no interaction_id in train + if "interaction_id" in interaction_data["interaction"]: + interaction_id = interaction_data["interaction"]["interaction_id"] + else: + interaction_id = len(interaction_list[db_id]) + + interaction = {} + interaction["id"] = "" + interaction["scenario"] = "" + interaction["database_id"] = db_id + interaction["interaction_id"] = interaction_id + interaction["final"] = {} + interaction["final"]["utterance"] = final_utterance + interaction["final"]["sql"] = final_sql + interaction["interaction"] = [] + + for turn in interaction_data["interaction"]: + turn_sql = [] + skip = False + for query_tok in turn["query_toks_no_value"]: + if query_tok != "." and "." in query_tok: + # invalid sql; didn't use table alias in join + turn_sql += query_tok.replace(".", " . ").split() + skip = True + else: + turn_sql.append(query_tok) + turn_sql = " ".join(turn_sql) + + # Correct some human sql annotation error + turn_sql = turn_sql.replace( + "select f_id from files as t1 join song as t2 on t1 . f_id = t2 . f_id", + "select t1 . f_id from files as t1 join song as t2 on t1 . f_id = t2 . f_id", + ) + turn_sql = turn_sql.replace("select name from climber mountain", "select name from climber") + turn_sql = turn_sql.replace( + "select sid from sailors as t1 join reserves as t2 on t1 . sid = t2 . sid join boats as t3 on t3 . bid = t2 . bid", + "select t1 . sid from sailors as t1 join reserves as t2 on t1 . sid = t2 . sid join boats as t3 on t3 . bid = t2 . bid", + ) + turn_sql = turn_sql.replace("select avg ( price ) from goods )", "select avg ( price ) from goods") + turn_sql = turn_sql.replace( + "select min ( annual_fuel_cost ) , from vehicles", "select min ( annual_fuel_cost ) from vehicles" + ) + turn_sql = turn_sql.replace( + "select * from goods where price < ( select avg ( price ) from goods", + "select * from goods where price < ( select avg ( price ) from goods )", + ) + turn_sql = turn_sql.replace( + "select distinct id , price from goods where price < ( select avg ( price ) from goods", + "select distinct id , price from goods where price < ( select avg ( price ) from goods )", + ) + turn_sql = turn_sql.replace( + "select id from goods where price > ( select avg ( price ) from goods", + "select id from goods where price > ( select avg ( price ) from goods )", + ) + + if skip and "train" in split_json: + continue + + if remove_from: + try: + turn_sql_parse = parse_sql( + turn_sql, + db_id, + column_names[db_id], + output_vocab, + schema_tokens[db_id], + database_schemas[db_id], + ) + except Exception: + print("continue") + continue + else: + turn_sql_parse = turn_sql + + if "utterance_toks" in turn: + turn_utterance = " ".join(turn["utterance_toks"]) # not lower() + else: + turn_utterance = turn["utterance"] + + interaction["interaction"].append({"utterance": turn_utterance, "sql": turn_sql_parse}) + if len(interaction["interaction"]) > 0: + interaction_list[db_id].append(interaction) + + return interaction_list + + +def read_spider(spider_dir, database_schemas, column_names, output_vocab, schema_tokens, remove_from): + interaction_list = {} + + train_json = os.path.join(spider_dir, "train.json") + interaction_list = read_spider_split( + train_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from + ) + + dev_json = os.path.join(spider_dir, "dev.json") + interaction_list = read_spider_split( + dev_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from + ) + + return interaction_list + + +def read_sparc(sparc_dir, database_schemas, column_names, output_vocab, schema_tokens, remove_from): + interaction_list = {} + + train_json = os.path.join(sparc_dir, "train_no_value.json") + interaction_list = read_data_json( + train_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from + ) + + dev_json = os.path.join(sparc_dir, "dev_no_value.json") + interaction_list = read_data_json( + dev_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from + ) + + return interaction_list + + +def read_cosql(cosql_dir, database_schemas, column_names, output_vocab, schema_tokens, remove_from): + interaction_list = {} + + train_json = os.path.join(cosql_dir, "train.json") + interaction_list = read_data_json( + train_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from + ) + + dev_json = os.path.join(cosql_dir, "dev.json") + interaction_list = read_data_json( + dev_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from + ) + + return interaction_list + + +def read_db_split(data_dir): + train_database = [] + with open(os.path.join(data_dir, "train_db_ids.txt")) as f: + for line in f: + train_database.append(line.strip()) + + dev_database = [] + with open(os.path.join(data_dir, "dev_db_ids.txt")) as f: + for line in f: + dev_database.append(line.strip()) + + return train_database, dev_database + + +def preprocess(dataset, remove_from=False): + # Validate output_vocab + output_vocab = [ + "_UNK", + "_EOS", + ".", + "t1", + "t2", + "=", + "select", + "from", + "as", + "value", + "join", + "on", + ")", + "(", + "where", + "t3", + "by", + ",", + "count", + "group", + "order", + "distinct", + "t4", + "and", + "limit", + "desc", + ">", + "avg", + "having", + "max", + "in", + "<", + "sum", + "t5", + "intersect", + "not", + "min", + "except", + "or", + "asc", + "like", + "!", + "union", + "between", + "t6", + "-", + "t7", + "+", + "/", + ] + if remove_from: + output_vocab = [ + "_UNK", + "_EOS", + "=", + "select", + "value", + ")", + "(", + "where", + ",", + "count", + "group_by", + "order_by", + "distinct", + "and", + "limit_value", + "limit", + "desc", + ">", + "avg", + "having", + "max", + "in", + "<", + "sum", + "intersect", + "not", + "min", + "except", + "or", + "asc", + "like", + "!=", + "union", + "between", + "-", + "+", + "/", + ] + print("size of output_vocab", len(output_vocab)) + print("output_vocab", output_vocab) + print() + + if dataset == "spider": + spider_dir = "data/spider/" + database_schema_filename = "data/spider/tables.json" + output_dir = "data/spider_data" + if remove_from: + output_dir = "data/spider_data_removefrom" + train_database, dev_database = read_db_split(spider_dir) + elif dataset == "sparc": + sparc_dir = "data/sparc/" + database_schema_filename = "data/sparc/tables.json" + output_dir = "data/sparc_data" + if remove_from: + output_dir = "data/sparc_data_removefrom" + train_database, dev_database = read_db_split(sparc_dir) + elif dataset == "cosql": + cosql_dir = "data/cosql/" + database_schema_filename = "data/cosql/tables.json" + output_dir = "data/cosql_data" + if remove_from: + output_dir = "data/cosql_data_removefrom" + train_database, dev_database = read_db_split(cosql_dir) + + if os.path.isdir(output_dir): + shutil.rmtree(output_dir) + os.mkdir(output_dir) + + schema_tokens = {} + column_names = {} + database_schemas = {} + + print("Reading spider database schema file") + schema_tokens, column_names, database_schemas = read_database_schema( + database_schema_filename, schema_tokens, column_names, database_schemas + ) + num_database = len(schema_tokens) + print("num_database", num_database, len(train_database), len(dev_database)) + print("total number of schema_tokens / databases:", len(schema_tokens)) + + output_database_schema_filename = os.path.join(output_dir, "tables.json") + with open(output_database_schema_filename, "w") as outfile: + json.dump([v for k, v in database_schemas.items()], outfile, indent=4) + + if dataset == "spider": + interaction_list = read_spider( + spider_dir, database_schemas, column_names, output_vocab, schema_tokens, remove_from + ) + elif dataset == "sparc": + interaction_list = read_sparc( + sparc_dir, database_schemas, column_names, output_vocab, schema_tokens, remove_from + ) + elif dataset == "cosql": + interaction_list = read_cosql( + cosql_dir, database_schemas, column_names, output_vocab, schema_tokens, remove_from + ) + + print("interaction_list length", len(interaction_list)) + + train_interaction = [] + for database_id in interaction_list: + if database_id not in dev_database: + train_interaction += interaction_list[database_id] + + dev_interaction = [] + for database_id in dev_database: + if database_id in interaction_list.keys(): + dev_interaction += interaction_list[database_id] + + print("train interaction: ", len(train_interaction)) + print("dev interaction: ", len(dev_interaction)) + + write_interaction(train_interaction, "train", output_dir) + write_interaction(dev_interaction, "dev", output_dir) + + return + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--dataset", choices=("spider", "sparc", "cosql"), default="sparc") + parser.add_argument("--remove_from", action="store_true", default=False) + args = parser.parse_args() + preprocess(args.dataset, args.remove_from) diff --git a/examples/text_to_sql/IGSQL/requirements.txt b/examples/text_to_sql/IGSQL/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..07db382b82e06152829025cf2d1008adb660e19f --- /dev/null +++ b/examples/text_to_sql/IGSQL/requirements.txt @@ -0,0 +1,4 @@ +sqlparse +pymysql +progressbar +nltk diff --git a/examples/text_to_sql/IGSQL/run.py b/examples/text_to_sql/IGSQL/run.py new file mode 100644 index 0000000000000000000000000000000000000000..bf9923c5dcd25f1dec039c377b9796890a2d2192 --- /dev/null +++ b/examples/text_to_sql/IGSQL/run.py @@ -0,0 +1,335 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Contains a main function for training and/or evaluating a model.""" + +import os +import sys + +sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) +import random # noqa: E402 + +import numpy as np # noqa: E402 +import paddle # noqa: E402 +from data_util import atis_data # noqa: E402 +from logger import Logger # noqa: E402 +from model.schema_interaction_model import SchemaInteractionATISModel # noqa: E402 +from model_util import ( # noqa: E402 + Metrics, + evaluate_interaction_sample, + evaluate_using_predicted_queries, + evaluate_utterance_sample, + train_epoch_with_interactions, + train_epoch_with_utterances, +) +from parse_args import interpret_args # noqa: E402 + +np.random.seed(0) +np.set_printoptions(16) +random.seed(0) + +VALID_EVAL_METRICS = [Metrics.LOSS, Metrics.TOKEN_ACCURACY, Metrics.STRING_ACCURACY] +TRAIN_EVAL_METRICS = [Metrics.LOSS, Metrics.TOKEN_ACCURACY, Metrics.STRING_ACCURACY] +FINAL_EVAL_METRICS = [Metrics.STRING_ACCURACY, Metrics.TOKEN_ACCURACY] + + +def train(model, data, params): + """Trains a model. + + Args: + model (ATISModel): The model to train. + data (ATISData): The data that is used to train. + params (namespace): Training parameters. + """ + # Get the training batches. + log = Logger(os.path.join(params.logdir, params.logfile), "w") + num_train_original = atis_data.num_utterances(data.train_data) + log.put("Original number of training utterances:\t" + str(num_train_original)) + + eval_fn = evaluate_utterance_sample + trainbatch_fn = data.get_utterance_batches + trainsample_fn = data.get_random_utterances + validsample_fn = data.get_all_utterances + batch_size = params.batch_size + if params.interaction_level: + batch_size = 1 + eval_fn = evaluate_interaction_sample + trainbatch_fn = data.get_interaction_batches + trainsample_fn = data.get_random_interactions + validsample_fn = data.get_all_interactions + + maximum_output_length = params.train_maximum_sql_length + train_batches = trainbatch_fn( + batch_size, max_output_length=maximum_output_length, randomize=not params.deterministic + ) + + if params.num_train >= 0: + train_batches = train_batches[: params.num_train] + + training_sample = trainsample_fn(params.train_evaluation_size, max_output_length=maximum_output_length) + valid_examples = validsample_fn(data.valid_data, max_output_length=maximum_output_length) + + num_train_examples = sum([len(batch) for batch in train_batches]) + num_steps_per_epoch = len(train_batches) + + log.put("Actual number of used training examples:\t" + str(num_train_examples)) + log.put("(Shortened by output limit of " + str(maximum_output_length) + ")") + log.put("Number of steps per epoch:\t" + str(num_steps_per_epoch)) + log.put("Batch size:\t" + str(batch_size)) + + print("Kept " + str(num_train_examples) + "/" + str(num_train_original) + " examples") + print("Batch size of " + str(batch_size) + " gives " + str(num_steps_per_epoch) + " steps per epoch") + + # Keeping track of things during training. + epochs = 0 + patience = params.initial_patience + learning_rate_coefficient = 1.0 + previous_epoch_loss = float("inf") + previous_valid_acc = 0.0 + maximum_string_accuracy = 0.0 + + countdown = int(patience) + + keep_training = True + step = 0 + + # init learning_rate + model.set_learning_rate(params.initial_learning_rate) + + while keep_training: + log.put("Epoch:\t" + str(epochs)) + model.set_dropout(params.dropout_amount) + model.train() + + if not params.scheduler: + model.set_learning_rate(learning_rate_coefficient * params.initial_learning_rate) + + # Run a training step. + if params.interaction_level: + epoch_loss, step = train_epoch_with_interactions( + train_batches, + params, + model, + randomize=not params.deterministic, + db2id=data.db2id, + id2db=data.id2db, + step=step, + ) + else: + epoch_loss = train_epoch_with_utterances(train_batches, model, randomize=not params.deterministic) + + log.put("train epoch loss:\t" + str(epoch_loss)) + + model.set_dropout(0.0) + + model.eval() + + with paddle.no_grad(): + + # Run an evaluation step on a sample of the training data. + train_eval_results = eval_fn( + training_sample, + model, + params.train_maximum_sql_length, + name=os.path.join(params.logdir, "train-eval"), + write_results=True, + gold_forcing=True, + metrics=TRAIN_EVAL_METRICS, + )[0] + + for name, value in train_eval_results.items(): + log.put("train final gold-passing " + name.name + ":\t" + "%.2f" % value) + + # Run an evaluation step on the validation set. + valid_eval_results = eval_fn( + valid_examples, + model, + params.eval_maximum_sql_length, + name=os.path.join(params.logdir, "valid-eval"), + write_results=True, + gold_forcing=True, + metrics=VALID_EVAL_METRICS, + )[0] + for name, value in valid_eval_results.items(): + log.put("valid gold-passing " + name.name + ":\t" + "%.2f" % value) + + valid_loss = valid_eval_results[Metrics.LOSS] + valid_token_accuracy = valid_eval_results[Metrics.TOKEN_ACCURACY] + string_accuracy = valid_eval_results[Metrics.STRING_ACCURACY] + + if params.scheduler: + model.scheduler.step(valid_loss) + + if ( + valid_loss > previous_epoch_loss + and valid_token_accuracy < previous_valid_acc + and step >= params.warmup_step + ): + learning_rate_coefficient *= params.learning_rate_ratio + log.put("learning rate coefficient:\t" + str(learning_rate_coefficient)) + + previous_epoch_loss = valid_loss + previous_valid_acc = valid_token_accuracy + saved = False + + if not saved and string_accuracy > maximum_string_accuracy: + maximum_string_accuracy = string_accuracy + patience = patience * params.patience_ratio + countdown = int(patience) + last_save_file = os.path.join(params.logdir, "best_model") + model.save(last_save_file) + + log.put("maximum string accuracy:\t" + str(maximum_string_accuracy)) + log.put("patience:\t" + str(patience)) + log.put("save file:\t" + str(last_save_file)) + + if countdown <= 0: + keep_training = False + + countdown -= 1 + log.put("countdown:\t" + str(countdown)) + log.put("") + + epochs += 1 + + log.put("Finished training!") + log.close() + + return last_save_file + + +def evaluate(model, data, params, last_save_file, split): + """Evaluates a pretrained model on a dataset. + + Args: + model (ATISModel): Model class. + data (ATISData): All of the data. + params (namespace): Parameters for the model. + last_save_file (str): Location where the model save file is. + """ + if last_save_file: + model.load(last_save_file) + else: + if not params.save_file: + raise ValueError("Must provide a save file name if not training first.") + model.load(params.save_file) + + filename = split + + if filename == "dev": + split = data.dev_data + elif filename == "train": + split = data.train_data + elif filename == "test": + split = data.test_data + elif filename == "valid": + split = data.valid_data + else: + raise ValueError("Split not recognized: " + str(params.evaluate_split)) + + if params.use_predicted_queries: + filename += "_use_predicted_queries" + else: + filename += "_use_gold_queries" + + full_name = os.path.join(params.logdir, filename) + params.results_note + + if params.interaction_level or params.use_predicted_queries: + examples = data.get_all_interactions(split) + if params.interaction_level: + evaluate_interaction_sample( + examples, + model, + name=full_name, + metrics=FINAL_EVAL_METRICS, + total_num=atis_data.num_utterances(split), + database_username=params.database_username, + database_password=params.database_password, + database_timeout=params.database_timeout, + use_predicted_queries=params.use_predicted_queries, + max_generation_length=params.eval_maximum_sql_length, + write_results=True, + use_gpu=True, + compute_metrics=params.compute_metrics, + ) + else: + evaluate_using_predicted_queries( + examples, + model, + name=full_name, + metrics=FINAL_EVAL_METRICS, + total_num=atis_data.num_utterances(split), + database_username=params.database_username, + database_password=params.database_password, + database_timeout=params.database_timeout, + ) + else: + examples = data.get_all_utterances(split) + evaluate_utterance_sample( + examples, + model, + name=full_name, + gold_forcing=False, + metrics=FINAL_EVAL_METRICS, + total_num=atis_data.num_utterances(split), + max_generation_length=params.eval_maximum_sql_length, + database_username=params.database_username, + database_password=params.database_password, + database_timeout=params.database_timeout, + write_results=True, + ) + + +def main(): + """Main function that trains and/or evaluates a model.""" + params = interpret_args() + + paddle.set_device("gpu") + + # Prepare the dataset into the proper form. + data = atis_data.ATISDataset(params) + params.num_db = len(data.db2id) + + # Construct the model object. + if params.interaction_level: + model_type = SchemaInteractionATISModel + else: + print("not implemented") + exit() + + model = model_type( + params, + data.input_vocabulary, + data.output_vocabulary, + data.output_vocabulary_schema, + data.anonymizer if params.anonymize and params.anonymization_scoring else None, + ) + + model.build_optim() + + sys.stdout.flush() + + last_save_file = "" + + if params.train: + last_save_file = train(model, data, params) + if params.evaluate and "valid" in params.evaluate_split: + evaluate(model, data, params, last_save_file, split="valid") + if params.evaluate and "dev" in params.evaluate_split: + evaluate(model, data, params, last_save_file, split="dev") + if params.evaluate and "test" in params.evaluate_split: + evaluate(model, data, params, last_save_file, split="test") + + +if __name__ == "__main__": + main() diff --git a/examples/text_to_sql/RAT-SQL/README.md b/examples/text_to_sql/RAT-SQL/README.md new file mode 100644 index 0000000000000000000000000000000000000000..48ed9294db7491d1a8071913d9ec78e903be5dc1 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/README.md @@ -0,0 +1,198 @@ +# Enhanced RAT-SQL + +## Text2SQL 任务 + +语义解析是一种交互式分析技术,其将用户输入的自然语言表述转成一种指定的语义表示形式,如图表示(AMR等)、逻辑表达式(一阶逻辑表示,lambda表示等)、编程语言(SQL、python等)、数学公式等。 + +Text2SQL 是语义解析技术中的一类任务,基于给定的数据库,其将用户输入的自然语言问题转成可与数据库交互的 SQL 查询语句,实现基于数据库的自动问答能力。 + + +## 数据集 + +数据集是推动自然语言处理技术进步的基石。为了处理不同场景、不同领域的应用需求,学术界及工业界陆续开放了一些相关数据集。千言项目为了验证模型的鲁棒性、泛化性等,针对每个自然语言处理问题,均收集和整理多个开源数据集,进行统一的处理并提供统一的测评方式。 + +作为千言项目的重要任务之一,语义解析方向收集和整理了NL2SQL、CSpider和DuSQL数据集,详情可参见千言官网的语义解析任务页面。 + + +## 基线系统 + +本基线系统基于PaddlePaddle2.0动态图实现了复杂数据集上的SOTA模型[RAT-SQL](https://arxiv.org/abs/1911.04942),其核心是基于encoder-decoder框架的序列生成模型。本系统编码端使用了 ERNIE + Relation-aware Transformer对问题和数据库schema 进行编码, 解码端实现了基于语法指导的解码算法,具体算法思想请见[TRANX](https://www.aclweb.org/anthology/D18-2002.pdf)。 + +同时,为了兼容上述提及的三个数据集,我们基于RAT-SQL模型进行扩展以丰富其问题解决能力,主要包括:1)多语言处理能力;2)value识别能力。 + +该基线系统除了提供模型的训练、预测外,还提供了评估及数据处理脚本。参赛选手及相关研究者可基于此系统进行更深层的效果优化。 + +# 环境准备 +代码运行需要 Linux 主机,Python 3.7 和 PaddlePaddle 2.0 以上版本。 + +## 推荐的环境 + +* 操作系统 CentOS 7.5 +* Python 3.7.9 +* PaddlePaddle 2.0.0 + +除此之外,强烈建议使用支持 GPU 的硬件环境。 + +## PaddlePaddle + +可根据机器情况和个人需求在 PaddlePaddle 和 PaddlePaddle-GPU 中二选一安装。 +如果机器支持GPU,则建议安装GPU版本。 + +``` +# CPU 版本 +pip3 install paddlepaddle +# GPU 版本 +pip3 install paddlepaddle-gpu +``` + +更多关于 PaddlePaddle 的安装教程、使用方法等请参考[官方文档](https://www.paddlepaddle.org.cn/#quick-start). + +## 第三方 Python 库 +除 PaddlePaddle 及其依赖之外,还依赖其它第三方 Python 库,位于代码根目录的 requirements.txt 文件中。 + +可使用 pip 一键安装 + +```pip3 install -r requirements.txt``` + +# 数据准备 +运行前需要自行下载训练、测试数据。 + +``` +# 下载模型训练、测试数据 +# 得到的数据包括 DuSQL, NL2SQL 和 CSpider 三个数据集(同[千言-语义解析](https://aistudio.baidu.com/aistudio/competition/detail/47)任务的三个数据集) +bash data/download_model_data.sh + +# 下载训练好的 Text2SQL 模型 +# 得到的数据包括: +# data +# ├── trained_model +# │   ├── dusql.pdparams +# │   ├── nl2sql.pdparams +# │   ├── cspider.pdparams +bash data/download_trained_model.sh + +``` + +# 数据预处理 + +对原始数据进行格式转换、依赖信息补充等,以适配模型的输入。下面以DuSQL数据集为例进行说明。 + +## 获取 Schema Linking 结果 +将 schema linking 独立出来,以便于针对这一步进行特定优化,可有效提升模型最终的效果。 + +``` +# 训练集 +./run.sh ./script/schema_linking.py \ + -s data/DuSQL/db_schema.json \ + -c data/DuSQL/db_content.json \ + -o data/DuSQL/match_values_train.json \ + data/DuSQL/train.json --is-train + +# 开发集和测试集 +./run.sh ./script/schema_linking.py \ + -s data/DuSQL/db_schema.json \ + -c data/DuSQL/db_content.json \ + -o data/DuSQL/match_values_dev.json \ + data/DuSQL/dev.json +./run.sh ./script/schema_linking.py \ + -s data/DuSQL/db_schema.json \ + -c data/DuSQL/db_content.json \ + -o data/DuSQL/match_values_test.json \ + data/DuSQL/test.json + +``` + +需要注意的是,对于 NL2SQL 数据集,需要额外指定参数 `--sql-format nl2sql`,以便适配其简化的 SQL Json 格式。 +此参数默认取值为 'dusql',可同时兼容 DuSQL 和 CSpider 数据集。 + +## 获得模型输入 + +对 DuSQL 原始数据和Schema Linking的结果做处理,得到模型的输入,位于 data/DuSQL/preproc 目录下: +``` +./run.sh ./script/text2sql_main.py \ + --mode preproc \ + --config conf/text2sql_dusql.jsonnet \ + --data-root data/DuSQL/ \ + --is-cached false \ + --output data/DuSQL/preproc +``` + +# 运行模型 + +## 模型配置文件 + +模型运行必需的配置位于conf下,默认提供的配置包括:text2sql_dusql.jsonnet, text2sql_nl2sql.jsonnet 和 text2sql_cspider.jsonnet, 分别用于 DuSQL, NL2SQL 和 CSpider 三个数据集的训练、预测等任务。 下文中如无特殊说明,则上述配置统称为 config.jsonnet。 + +## 训练 + +以训练DuSQL 模型为例 + +``` +bash ./train.sh 1 output/dusql_baseline --config conf/text2sql_dusql.jsonnet --data-root data/DuSQL/preproc +``` + +参数说明: +* 1 表示并发数,代码会自动选取剩余显存最多的卡使用,当前仅支持 1 卡训练;也可手动指定使用哪张卡,比如使用卡2,则此参数写为 cuda:2 +* output/dusql_experiment 表示训练过程保存的模型、预测开发集的结果等保存的目录,按需设置即可 +* --config conf/text2sql_dusql.jsonnet 指定本次任务的核心配置。注意 text2sql_dusql.jsonnet 需要替换为特定的配置文件 +* --data-root: 指定数据集的根目录。也可通过 --train-set/--dev-set/--test-set/--db 分别指定不同文件的路径 + +全部的参数可通过 `bash ./run.sh ./script/text2sql_main.py -h` 查看。其中常用参数: +* --pretrain-model: 指定 ERNIE 预训练模型目录 +* --batch-size: batch size 大小 +* --epochs: 总的训练轮数 +* --init-model-params: 热启模型的文件路径 + +命令行参数的优先级高于配置文件,即如果在命令行指定了config文件包含的参数,则以命令行的设置为准。 + +### 训练阶段的输出日志 +训练过程会输出loss、acc相关日志,日志会同时输出到屏幕和 --output 参数指定目录下的 train.log 文件中。 +内容类似: +``` +[train] epoch 1, batch 600. loss is 34.1222593689. cost 442.40s +[train] epoch 1, batch 700. loss is 33.3783876610. cost 424.55s +[train] epoch 1/30 loss is 34.777802, cost 2826.10s. +[eval] dev loss 0.000000, acc 1.0000. got best and saved. cost [27.94s] +``` +其中,间隔多少steps输出一次日志在conf中设置(train.log_steps),也可通过命令行参数指定(--log-steps)。 +为了提升训练速度,并非每个 epoch 结束都会执行评估,所以 eval 一行的 acc 实际中使用了 epoch 代替。 + +## 预测 + +以预测 DuSQL 开发集为例,结果保存到 output/dusql_dev_infer_result.json。 + +``` +bash ./run.sh ./script/text2sql_main.py --mode infer \ + --config conf/text2sql_dusql.jsonnet \ + --data-root data/DuSQL/preproc \ + --test-set data/DuSQL/preproc/dev.pkl \ + --init-model-param output/dusql_baseline/....../model.pdparams \ + --output output/dusql_dev_infer_result.sql +``` +其中的 --init-model-param 参数请修改为真实的模型路径。 + +## 评估 + +同样以 DuSQL 开发集的预测结果为例。 + +``` +python ./evaluation/text2sql_evaluation.py \ + -g data/DuSQL/gold_dev.sql \ + -t data/DuSQL/db_schema.json \ + -d DuSQL \ + -p output/dusql_dev_infer_result.sql +``` + +注意,其中的 `data/DuSQL/gold_dev.sql` 需要开发者从 dev.json 中提取得到,格式为“question_id \t sql_query \t db_id”。 + +# 基线效果 + +评价指标:Exact Match Accuracy,即预测的SQL query与标准SQL query相等的问题占比,这里“相等”判断时忽略成分的顺序影响,即SELECT A,B和SELECT B,A是相等的。 +评估数据集:各数据集的开发集合。 +评估设置:基线系统默认代码和配置。 + +| 数据集 | 准确率(%) | +|-------- | --- | +| DuSQL | 64.3 | +| NL2SQL | 73.0 | +| CSpider | 33.6 | diff --git a/examples/text_to_sql/RAT-SQL/conf/DuSQL.asdl b/examples/text_to_sql/RAT-SQL/conf/DuSQL.asdl new file mode 100644 index 0000000000000000000000000000000000000000..0edd7a574b2fb781ac335921c82aae021d446ca5 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/conf/DuSQL.asdl @@ -0,0 +1,119 @@ +-- Assumptions: +-- 1. sql is correct +-- 2. only table name has alias +-- 3. only one intersect/union/except + +module DuSQL +{ + -- val: number(float)/string(str)/sql(dict) + val = Value(value val_id) | ValSql(sql s) | ColUnit(column col_id) + + -- col_unit: (agg_id, col_id, isDistinct(bool)) + col_unit = ( + agg_type agg_id, + -- TODO fix + column col_id + ) + + -- val_unit: (unit_op, col_unit1, col_unit2) + -- val_unit = ( + -- unit_type unit_op, + -- col_unit col_unit1, + -- col_unit col_unit2 + -- ) + val_unit = Column(col_unit col_unit1) + | Minus(col_unit col_unit1, col_unit col_unit2) + | Plus(col_unit col_unit1, col_unit col_unit2) + | Times(col_unit col_unit1, col_unit col_unit2) + | Divide(col_unit col_unit1, col_unit col_unit2) + + -- condition: [cond_unit1, 'and'/'or', cond_unit2, ...] + -- cond_unit: (agg_id, op_id, val_unit, val1, val2) + cond = And(cond left, cond right) + | Or(cond left, cond right) + | NotIn(agg_type agg_id, val_unit val_unit, val val1) + | Between(agg_type agg_id, val_unit val_unit, val val1, val val2) + | Eq(agg_type agg_id, val_unit val_unit, val val1) + | Gt(agg_type agg_id, val_unit val_unit, val val1) + | Lt(agg_type agg_id, val_unit val_unit, val val1) + | Ge(agg_type agg_id, val_unit val_unit, val val1) + | Le(agg_type agg_id, val_unit val_unit, val val1) + | Ne(agg_type agg_id, val_unit val_unit, val val1) + | In(agg_type agg_id, val_unit val_unit, val val1) + | Like(agg_type agg_id, val_unit val_unit, val val1) + + -- sql { + -- 'select': (isDistinct(bool), [(agg_id, val_unit), (agg_id, val_unit), ...]) + -- 'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition} + -- 'where': condition + -- 'groupBy': [col_unit1, col_unit2, ...] + -- 'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...]) + -- 'having': condition + -- 'limit': None/limit value + -- 'intersect': None/sql + -- 'except': None/sql + -- 'union': None/sql + -- } + + sql = ( + select select, + from from, + sql_where sql_where, + sql_groupby sql_groupby, + sql_orderby sql_orderby, + sql_ieu sql_ieu, + ) + + sql_where = ( + cond? where, + ) + + sql_groupby = ( + col_unit* group_by, + cond? having, + ) + + sql_orderby = ( + order_by? order_by, + value? limit, + ) + + sql_ieu = ( + sql? intersect, + sql? except, + sql? union, + ) + + -- 'select': ([(agg_id, val_unit), (agg_id, val_unit), ...]) + select = (agg* aggs) + agg = (agg_type agg_id, val_unit val_unit) + + -- 'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition} + from = (table_unit* table_units, cond? conds) + -- table_unit: (table_type, col_unit/sql) + table_unit = TableUnitSql(sql s) | Table(table table_id) + + -- 'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...]) + order_by = (order order, agg* aggs) + + -- CLAUSE_KEYWORDS = ('select', 'from', 'where', 'group', 'order', 'limit', 'intersect', 'union', 'except') + -- JOIN_KEYWORDS = ('join', 'on', 'as') + + -- WHERE_OPS = ('not', 'between', '=', '>', '<', '>=', '<=', '!=', 'in', 'like', 'is', 'exists') + -- cond_type = Between | Eq | Gt | Lt | Ge | Le | Ne | In | Like | Is | Exists + + -- UNIT_OPS = ('none', '-', '+', "*", '/') + --unit_type = NoneUnitOp | Minus | Plus | Times | Divide + + -- AGG_OPS = ('none', 'max', 'min', 'count', 'sum', 'avg') + agg_type = NoneAggOp | Max | Min | Count | Sum | Avg + + -- TABLE_TYPE = { + -- 'sql': "sql", + -- 'table_unit': "table_unit", + -- } + -- COND_OPS = ('and', 'or') + -- SQL_OPS = ('intersect', 'union', 'except') + -- ORDER_OPS = ('desc', 'asc') + order = Asc | Desc +} diff --git a/examples/text_to_sql/RAT-SQL/conf/NL2SQL.asdl b/examples/text_to_sql/RAT-SQL/conf/NL2SQL.asdl new file mode 100644 index 0000000000000000000000000000000000000000..d7c7fdbb7f6686e1a2af2a1960e8664e477e49a3 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/conf/NL2SQL.asdl @@ -0,0 +1,119 @@ +-- Assumptions: +-- 1. sql is correct +-- 2. only table name has alias +-- 3. only one intersect/union/except + +module NL2SQL +{ + -- val: number(float)/string(str)/sql(dict) + val = Value(value val_id) | ValSql(sql s) | ColUnit(column col_id) + + -- col_unit: (agg_id, col_id, isDistinct(bool)) + col_unit = ( + agg_type agg_id, + -- TODO fix + column col_id + ) + + -- val_unit: (unit_op, col_unit1, col_unit2) + -- val_unit = ( + -- unit_type unit_op, + -- col_unit col_unit1, + -- col_unit col_unit2 + -- ) + val_unit = Column(col_unit col_unit1) + | Minus(col_unit col_unit1, col_unit col_unit2) + | Plus(col_unit col_unit1, col_unit col_unit2) + | Times(col_unit col_unit1, col_unit col_unit2) + | Divide(col_unit col_unit1, col_unit col_unit2) + + -- condition: [cond_unit1, 'and'/'or', cond_unit2, ...] + -- cond_unit: (agg_id, op_id, val_unit, val1, val2) + cond = And(cond left, cond right) + | Or(cond left, cond right) + | NotIn(agg_type agg_id, val_unit val_unit, val val1) + | Between(agg_type agg_id, val_unit val_unit, val val1, val val2) + | Eq(agg_type agg_id, val_unit val_unit, val val1) + | Gt(agg_type agg_id, val_unit val_unit, val val1) + | Lt(agg_type agg_id, val_unit val_unit, val val1) + | Ge(agg_type agg_id, val_unit val_unit, val val1) + | Le(agg_type agg_id, val_unit val_unit, val val1) + | Ne(agg_type agg_id, val_unit val_unit, val val1) + | In(agg_type agg_id, val_unit val_unit, val val1) + | Like(agg_type agg_id, val_unit val_unit, val val1) + + -- sql { + -- 'select': (isDistinct(bool), [(agg_id, val_unit), (agg_id, val_unit), ...]) + -- 'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition} + -- 'where': condition + -- 'groupBy': [col_unit1, col_unit2, ...] + -- 'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...]) + -- 'having': condition + -- 'limit': None/limit value + -- 'intersect': None/sql + -- 'except': None/sql + -- 'union': None/sql + -- } + + sql = ( + select select, + from from, + sql_where sql_where, + sql_groupby sql_groupby, + sql_orderby sql_orderby, + sql_ieu sql_ieu, + ) + + sql_where = ( + cond? where, + ) + + sql_groupby = ( + col_unit* group_by, + cond? having, + ) + + sql_orderby = ( + order_by? order_by, + value? limit, + ) + + sql_ieu = ( + sql? intersect, + sql? except, + sql? union, + ) + + -- 'select': ([(agg_id, val_unit), (agg_id, val_unit), ...]) + select = (agg* aggs) + agg = (agg_type agg_id, val_unit val_unit) + + -- 'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition} + from = (table_unit* table_units, cond? conds) + -- table_unit: (table_type, col_unit/sql) + table_unit = TableUnitSql(sql s) | Table(table table_id) + + -- 'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...]) + order_by = (order order, agg* aggs) + + -- CLAUSE_KEYWORDS = ('select', 'from', 'where', 'group', 'order', 'limit', 'intersect', 'union', 'except') + -- JOIN_KEYWORDS = ('join', 'on', 'as') + + -- WHERE_OPS = ('not', 'between', '=', '>', '<', '>=', '<=', '!=', 'in', 'like', 'is', 'exists') + -- cond_type = Between | Eq | Gt | Lt | Ge | Le | Ne | In | Like | Is | Exists + + -- UNIT_OPS = ('none', '-', '+', "*", '/') + --unit_type = NoneUnitOp | Minus | Plus | Times | Divide + + -- AGG_OPS = ('none', 'avg', 'max', 'min', 'count', 'sum') + agg_type = NoneAggOp | Avg | Max | Min | Count | Sum + + -- TABLE_TYPE = { + -- 'sql': "sql", + -- 'table_unit': "table_unit", + -- } + -- COND_OPS = ('and', 'or') + -- SQL_OPS = ('intersect', 'union', 'except') + -- ORDER_OPS = ('desc', 'asc') + order = Asc | Desc +} diff --git a/examples/text_to_sql/RAT-SQL/conf/text2sql_cspider.jsonnet b/examples/text_to_sql/RAT-SQL/conf/text2sql_cspider.jsonnet new file mode 100644 index 0000000000000000000000000000000000000000..4370bae165f3c431d644492f5e5cee748ad5d0ab --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/conf/text2sql_cspider.jsonnet @@ -0,0 +1,62 @@ + +function(data_path='data/CSpider/preproc') { + general: { + mode: null, + batch_size: 16, + use_cuda: true, + is_cloud: false, + is_debug: false, + use_fp16: 0, + }, + model: { + pretrain_model_type: 'BERT', + pretrain_model: 'bert-base-multilingual-uncased', + init_model_params: null, + init_model_optim: null, + model_name: 'seq2tree_v2', + grammar_type: 'dusql_v2', + rat_layers: 8, + rat_heads: 8, + enc_value_with_col: true, + num_value_col_type: 'q_num', # cls|col_0|q_num + value_memory: true, + predict_value: false, + max_seq_len: 510, + max_question_len: 120, + max_column_num: 100, + max_table_num: 18, + max_column_tokens: 50, # useless + max_table_tokens: 20, # useless + }, + data: { + db: null, + grammar: 'conf/DuSQL.asdl', + train_set: null, + dev_set: null, + test_set: null, + eval_file: null, + output: 'output', + is_cached: false, + }, + train: { + epochs: 50, + log_steps: 10, + trainer_num: 1, + # [begin] config for optimizer + learning_rate: 1e-05, + lr_scheduler: "linear_warmup_decay", + warmup_steps: 0, + warmup_proportion: 0.1, + weight_decay: 0.01, + use_dynamic_loss_scaling: false, + init_loss_scaling: 128, + incr_every_n_steps: 100, + decr_every_n_nan_or_inf: 2, + incr_ratio: 2.0, + decr_ratio: 0.8, + grad_clip: 1.0, + # [end] optimizer + random_seed: null, + use_data_parallel: false, + } +} diff --git a/examples/text_to_sql/RAT-SQL/conf/text2sql_dusql.jsonnet b/examples/text_to_sql/RAT-SQL/conf/text2sql_dusql.jsonnet new file mode 100644 index 0000000000000000000000000000000000000000..6c77b0c73ccfed1ff36b6b65f1e5749a28db1762 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/conf/text2sql_dusql.jsonnet @@ -0,0 +1,62 @@ + +function(data_path='data/DuSQL/preproc') { + general: { + mode: null, + batch_size: 16, + use_cuda: true, + is_cloud: false, + is_debug: false, + use_fp16: 0, + }, + model: { + pretrain_model_type: 'ERNIE', + pretrain_model: 'ernie-1.0', + init_model_params: null, + init_model_optim: null, + model_name: 'seq2tree_v2', + grammar_type: 'dusql_v2', + rat_layers: 8, + rat_heads: 8, + enc_value_with_col: true, + num_value_col_type: 'q_num', # cls|col_0|q_num + value_memory: true, + predict_value: true, + max_seq_len: 510, + max_question_len: 120, + max_column_num: 60, + max_table_num: 15, + max_column_tokens: 50, # useless + max_table_tokens: 20, # useless + }, + data: { + db: null, + grammar: 'conf/DuSQL.asdl', + train_set: null, + dev_set: null, + test_set: null, + eval_file: null, + output: 'output', + is_cached: false, + }, + train: { + epochs: 30, + log_steps: 10, + trainer_num: 1, + # [begin] config for optimizer + learning_rate: 1e-05, + lr_scheduler: "linear_warmup_decay", + warmup_steps: 0, + warmup_proportion: 0.1, + weight_decay: 0.01, + use_dynamic_loss_scaling: false, + init_loss_scaling: 128, + incr_every_n_steps: 100, + decr_every_n_nan_or_inf: 2, + incr_ratio: 2.0, + decr_ratio: 0.8, + grad_clip: 1.0, + # [end] optimizer + random_seed: null, + use_data_parallel: false, + } +} diff --git a/examples/text_to_sql/RAT-SQL/conf/text2sql_nl2sql.jsonnet b/examples/text_to_sql/RAT-SQL/conf/text2sql_nl2sql.jsonnet new file mode 100644 index 0000000000000000000000000000000000000000..80550cfd0c181ba2c55e73524f53d55d173469a6 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/conf/text2sql_nl2sql.jsonnet @@ -0,0 +1,62 @@ + +function(data_path='data/NL2SQL/preproc') { + general: { + mode: null, + batch_size: 16, + use_cuda: true, + is_cloud: false, + is_debug: false, + use_fp16: 0, + }, + model: { + pretrain_model_type: 'ERNIE', + pretrain_model: 'ernie-1.0', + init_model_params: null, + init_model_optim: null, + model_name: 'seq2tree_v2', + grammar_type: 'nl2sql', + rat_layers: 8, + rat_heads: 8, + enc_value_with_col: true, + num_value_col_type: 'q_num', # cls|col_0|q_num + value_memory: true, + predict_value: true, + max_seq_len: 510, + max_question_len: 120, + max_column_num: 60, + max_table_num: 15, + max_column_tokens: 50, # useless + max_table_tokens: 20, # useless + }, + data: { + db: null, + grammar: 'conf/NL2SQL.asdl', + train_set: null, + dev_set: null, + test_set: null, + eval_file: null, + output: 'output', + is_cached: false, + }, + train: { + epochs: 12, + log_steps: 10, + trainer_num: 1, + # [begin] config for optimizer + learning_rate: 1e-05, + lr_scheduler: "linear_warmup_decay", + warmup_steps: 0, + warmup_proportion: 0.1, + weight_decay: 0.01, + use_dynamic_loss_scaling: false, + init_loss_scaling: 128, + incr_every_n_steps: 100, + decr_every_n_nan_or_inf: 2, + incr_ratio: 2.0, + decr_ratio: 0.8, + grad_clip: 1.0, + # [end] optimizer + random_seed: null, + use_data_parallel: false, + } +} diff --git a/examples/text_to_sql/RAT-SQL/data/download_model_data.sh b/examples/text_to_sql/RAT-SQL/data/download_model_data.sh new file mode 100644 index 0000000000000000000000000000000000000000..87a77dd1d86a562baf657eef4f337a5c69677115 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/data/download_model_data.sh @@ -0,0 +1,14 @@ +#!/bin/bash + +cd `dirname $0` + +set -e + +wget --no-check-certificate https://dataset-bj.cdn.bcebos.com/qianyan/NL2SQL.zip +unzip NL2SQL.zip >/dev/null + +wget --no-check-certificate https://dataset-bj.cdn.bcebos.com/qianyan/CSpider.zip +unzip CSpider.zip >/dev/null + +wget --no-check-certificate https://bj.bcebos.com/v1/dataset-bj/qianyan/DuSQL.zip +unzip DuSQL.zip >/dev/null diff --git a/examples/text_to_sql/RAT-SQL/data/download_trained_model.sh b/examples/text_to_sql/RAT-SQL/data/download_trained_model.sh new file mode 100644 index 0000000000000000000000000000000000000000..47d64e59bf48a0f71c047d7d35b27becc5eae0ce --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/data/download_trained_model.sh @@ -0,0 +1,8 @@ +#!/bin/bash + +cd `dirname $0` + +version='v1.0.0' +target_file="text2sql_trained_model_$version.tar.gz" +wget --no-check-certificate https://dataset-bj.cdn.bcebos.com/qianyan/$target_file +tar xzf $target_file \ No newline at end of file diff --git a/examples/text_to_sql/RAT-SQL/evaluation/README.md b/examples/text_to_sql/RAT-SQL/evaluation/README.md new file mode 100644 index 0000000000000000000000000000000000000000..dc05871a09158aea37838af1aba930e42210c1e3 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/evaluation/README.md @@ -0,0 +1,34 @@ +# 环境 +建议 python3.7 + +# 评估 +输入文件格式: +1. 文件以.sql结尾 +2. 文件每行的格式:"qid\tsql_query\tdb_id",其中predict文件db_id是可选字段,gold文件db_id是必选字段 +3. 评估指标:exact matching score + +# 使用 + +## 命令行 + + python text2sql_evaluation.py \ + --g 'data/DuSQL/test_gold.sql' \ # gold文件 + --p 'test_DuSQL.sql' \ # predict文件 + --t 'data/DuSQL/db_schema.json' \ # schema文件 + --d 'DuSQL' # 选择dataset(DuSQL、NL2SQL、CSPider可选) + +## 接口 + + from text2sql_evaluation import evaluate + score, score_novalue = evaluate('table.json', 'gold.sql', 'pred.sql', dataset='DuSQL') +其中: + score["all"] = {"exact": exact num, "count": test examples num, "acc": accuracy} + score_novalue["all"] = {"exact": exact num, "count": test examples num, "acc": accuracy} + +## 输出 + with value: + {"exact": exact correct num, "count": test examples num, "acc": accuracy} + without value: + {"exact": exact correct num, "count": test examples num, "acc": accuracy} +其中: + acc表示最终输出准确率 diff --git a/examples/text_to_sql/RAT-SQL/evaluation/data/gold.sql b/examples/text_to_sql/RAT-SQL/evaluation/data/gold.sql new file mode 100644 index 0000000000000000000000000000000000000000..bc6d8e119c55717949579a96111ce220bb0f5aa7 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/evaluation/data/gold.sql @@ -0,0 +1,18 @@ +1 ( select 书名 from 传记 where 页数 > 400 ) intersect ( select 书名 from 传记 where 出版时间 > "1981-03-24" ) 人物传记 +2 ( select 姓名 from 作者 where 作品数量 >= 50 ) except ( select 姓名 from 作者 order by 出生日期 desc limit 3 ) 小说 +3 ( select 开源课程名称 from 学校的开源课程 order by 课时 desc limit 3 ) except ( select 开源课程名称 from 学校的开源课程 where 主讲教师 != "王建安" ) 在线学习平台 +4 select avg ( 现价格 ) sum ( 原价格 ) from 本月特价书籍 榜单 +5 select max ( 电子书售价 ) from 电子书 购书平台 +6 select min ( 电子书售价 ) avg ( 购买人数 ) max ( 会员价格 ) from 电子书 购书平台 +7 select sum ( 豆瓣评分 ) max ( 1星占比 ) from 书籍 豆瓣读书 +8 select 书名, 出版社 from 传记 where 作者 != "柳润墨" order by 页数 asc 人物传记 +9 select 书名, 类型 from 网络小说 where 评分 == ( select max ( 评分 ) from 网络小说 ) 网易云阅读 +10 select 出版社 from 文集 group by 出版社 order by avg ( 页数 ) desc limit 1 文集 +11 select 名称 from 小说改编话剧 where 演出总场次 < ( select max ( 演出总场次 ) from 小说改编话剧 where 演出剧团 != "开心麻花" ) 小说 +12 select 名称 from 文集 where 页数 < ( select max ( 页数 ) from 文集 where 出版社 != "人民文学出版社" ) 文集 +13 select 名称 from 文集 where 页数 == ( select max ( 页数 ) from 文集 where 出版社 != "人民文学出版社" ) 文集 +14 select 名称, 作者 from 书籍 where 豆瓣评分 > 5.4 order by 1星占比 desc 豆瓣读书 +15 select 名称, 评价人数 * 1星占比 from 书籍 where 作者 == "塔拉·韦斯特弗" 豆瓣读书 +16 select 姓名, 国籍 from 作者 where 作品数量 == ( select max ( 作品数量 ) from 作者 ) 小说 +17 select 姓名, 逝世日期 - 出生日期 from 作者 where 作品数量 < 50 小说 +18 select 讲述朝代 from 中国朝代历史 历史类书籍 diff --git a/examples/text_to_sql/RAT-SQL/evaluation/data/pred.sql b/examples/text_to_sql/RAT-SQL/evaluation/data/pred.sql new file mode 100644 index 0000000000000000000000000000000000000000..cff97fda2aa84c904c7e44dfbfe0da34af7ab7a6 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/evaluation/data/pred.sql @@ -0,0 +1,18 @@ +1 ( select 出版社 from 传记 where 页数 > 400 ) intersect ( select 书名 from 传记 where 出版时间 > "1981-03-24" ) 人物传记 +2 ( select 姓名 from 作者 where 作品数量 >= 60 ) except ( select 姓名 from 作者 order by 出生日期 desc limit 3 ) 小说 +3 ( select 开源课程名称 from 学校的开源课程 order by 课时 desc limit 5 ) except ( select 开源课程名称 from 学校的开源课程 where 主讲教师 != "王建安" ) 在线学习平台 +4 select avg ( 现价格 ) max ( 原价格 ) from 本月特价书籍 榜单 +5 select max ( 电子书售价 ) from 电子书 购书平台 +6 select min ( 电子书售价 ) avg ( 购买人数 ) max ( 会员价格 ) from 电子书 购书平台 +7 select sum ( 豆瓣评分 ) max ( 1星占比 ) from 书籍 豆瓣读书 +8 select 书名 from 传记 where 作者 != "柳润墨" order by 页数 asc 人物传记 +9 select 书名, 类型 from 网络小说 where 评分 in ( select max ( 评分 ) from 网络小说 ) 网易云阅读 +10 select 出版社 from 文集 group by 出版社 order by avg ( 页数 ) desc limit 1 文集 +11 select 名称 from 小说改编话剧 where 演出总场次 < ( select max ( 演出总场次 ) from 小说改编话剧 where 演出剧团 != "开心麻花" ) 小说 +12 select 名称 from 文集 where 页数 < ( select max ( 页数 ) from 文集 where 出版社 != "人民文学出版社" ) 文集 +13 select 名称 from 文集 where 页数 == ( select max ( 页数 ) from 文集 where 出版社 != "人民文学出版社" ) 文集 +14 select 名称, 作者 from 书籍 where 豆瓣评分 > 5.4 order by 1星占比 desc 豆瓣读书 +15 select 名称, 评价人数 * 1星占比 from 书籍 where 作者 == "塔拉·韦斯特弗" 豆瓣读书 +16 select 姓名, 国籍 from 作者 where 作品数量 == ( select max ( 作品数量 ) from 作者 ) 小说 +17 select 姓名, 逝世日期 - 出生日期 from 作者 where 作品数量 < 50 小说 +18 select 讲述朝代 from 中国朝代历史 历史类书籍 diff --git a/examples/text_to_sql/RAT-SQL/evaluation/data/table.json b/examples/text_to_sql/RAT-SQL/evaluation/data/table.json new file mode 100644 index 0000000000000000000000000000000000000000..144854f31c369b8f6282a3d27d4fefbc0dca1fec --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/evaluation/data/table.json @@ -0,0 +1,2049 @@ +[ + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "姓名" + ], + [ + 0, + "国籍" + ], + [ + 0, + "职业" + ], + [ + 0, + "主要成就" + ], + [ + 1, + "词条id" + ], + [ + 1, + "书名" + ], + [ + 1, + "作者" + ], + [ + 1, + "页数" + ], + [ + 1, + "出版社" + ], + [ + 1, + "出版时间" + ], + [ + 2, + "传记id" + ], + [ + 2, + "人物id" + ], + [ + 2, + "记录时间" + ] + ], + "column_types": [ + "text", + "number", + "text", + "text", + "text", + "text", + "number", + "text", + "text", + "number", + "text", + "time", + "number", + "number", + "text" + ], + "db_id": "人物传记", + "foreign_keys": [ + [ + 13, + 1 + ], + [ + 12, + 6 + ] + ], + "primary_keys": [ + 1, + 6 + ], + "table_names": [ + "名人", + "传记", + "名人传记" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "名称" + ], + [ + 0, + "成立时间" + ], + [ + 0, + "年营业额" + ], + [ + 0, + "是否自营" + ], + [ + 0, + "会员费" + ], + [ + 1, + "词条id" + ], + [ + 1, + "书名" + ], + [ + 1, + "作者" + ], + [ + 1, + "类型" + ], + [ + 2, + "书名id" + ], + [ + 2, + "平台id" + ], + [ + 2, + "售价" + ], + [ + 2, + "购买人数" + ], + [ + 2, + "评分" + ], + [ + 2, + "评分人数" + ], + [ + 2, + "加入购物车人数" + ], + [ + 2, + "收藏人数" + ], + [ + 2, + "缺货" + ], + [ + 3, + "书名id" + ], + [ + 3, + "平台id" + ], + [ + 3, + "电子书售价" + ], + [ + 3, + "会员价格" + ], + [ + 3, + "购买人数" + ] + ], + "column_types": [ + "text", + "number", + "text", + "time", + "number", + "binary", + "number", + "number", + "text", + "text", + "text", + "number", + "number", + "number", + "number", + "number", + "number", + "number", + "number", + "binary", + "number", + "number", + "number", + "number", + "number" + ], + "db_id": "购书平台", + "foreign_keys": [ + [ + 11, + 7 + ], + [ + 12, + 1 + ], + [ + 20, + 7 + ], + [ + 21, + 1 + ] + ], + "primary_keys": [ + 1, + 7 + ], + "table_names": [ + "平台", + "图书", + "图书与平台", + "电子书" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "姓名" + ], + [ + 0, + "性别" + ], + [ + 0, + "国籍" + ], + [ + 0, + "职业" + ], + [ + 0, + "所在单位" + ], + [ + 1, + "词条id" + ], + [ + 1, + "名称" + ], + [ + 1, + "作者id" + ], + [ + 1, + "会议名称" + ], + [ + 1, + "年份" + ], + [ + 1, + "引用量" + ], + [ + 2, + "论文id" + ], + [ + 2, + "引用论文id" + ], + [ + 2, + "是否对比论文" + ] + ], + "column_types": [ + "text", + "number", + "text", + "text", + "text", + "text", + "text", + "number", + "text", + "number", + "text", + "time", + "number", + "number", + "number", + "binary" + ], + "db_id": "论文", + "foreign_keys": [ + [ + 13, + 7 + ], + [ + 9, + 1 + ], + [ + 14, + 7 + ] + ], + "primary_keys": [ + 1, + 7 + ], + "table_names": [ + "作者", + "论文", + "论文引用" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "书名" + ], + [ + 0, + "讲述国家" + ], + [ + 0, + "讲述时代" + ], + [ + 1, + "词条id" + ], + [ + 1, + "书名" + ], + [ + 1, + "讲述朝代" + ], + [ + 2, + "词条id" + ], + [ + 2, + "书名" + ], + [ + 2, + "描述战事" + ], + [ + 3, + "词条id" + ], + [ + 3, + "书名" + ], + [ + 3, + "讲述名人" + ] + ], + "column_types": [ + "text", + "number", + "text", + "text", + "time", + "number", + "text", + "text", + "number", + "text", + "text", + "number", + "text", + "text" + ], + "db_id": "历史类书籍", + "foreign_keys": [], + "primary_keys": [ + 1, + 5, + 8, + 11 + ], + "table_names": [ + "国家历史", + "中国朝代历史", + "战争历史", + "人物历史" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "名称" + ], + [ + 0, + "成立年数" + ], + [ + 0, + "教师数量" + ], + [ + 0, + "课程体系分级" + ], + [ + 1, + "平台id" + ], + [ + 1, + "适合群体" + ], + [ + 1, + "一节课时间" + ], + [ + 1, + "课时数" + ], + [ + 1, + "主题数" + ], + [ + 1, + "词汇量" + ], + [ + 2, + "平台id" + ], + [ + 2, + "外教来自国家" + ], + [ + 2, + "外教数量" + ], + [ + 2, + "教师职业占比" + ], + [ + 3, + "平台id" + ], + [ + 3, + "课时数" + ], + [ + 3, + "原价" + ], + [ + 3, + "折扣" + ] + ], + "column_types": [ + "text", + "number", + "text", + "number", + "number", + "number", + "number", + "text", + "number", + "number", + "number", + "number", + "number", + "text", + "number", + "number", + "number", + "number", + "number", + "number" + ], + "db_id": "在线英语教学", + "foreign_keys": [ + [ + 16, + 1 + ], + [ + 6, + 1 + ], + [ + 12, + 1 + ] + ], + "primary_keys": [ + 1 + ], + "table_names": [ + "平台", + "青少年课程", + "教师", + "学费(平台,课时数,价格,折扣)" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "姓名" + ], + [ + 0, + "国籍" + ], + [ + 0, + "毕业院校" + ], + [ + 0, + "民族" + ], + [ + 1, + "词条id" + ], + [ + 1, + "名称" + ], + [ + 1, + "作者id" + ], + [ + 1, + "页数" + ], + [ + 1, + "定价" + ], + [ + 1, + "出版社" + ], + [ + 1, + "出版时间" + ], + [ + 1, + "开本" + ] + ], + "column_types": [ + "text", + "number", + "text", + "text", + "text", + "text", + "number", + "text", + "number", + "number", + "number", + "text", + "time", + "text" + ], + "db_id": "文集", + "foreign_keys": [ + [ + 8, + 1 + ] + ], + "primary_keys": [ + 1, + 6 + ], + "table_names": [ + "作者", + "文集" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "名称" + ], + [ + 0, + "英文名" + ], + [ + 0, + "原著作者" + ], + [ + 0, + "字数" + ], + [ + 1, + "词条id" + ], + [ + 1, + "姓名" + ], + [ + 1, + "国籍" + ], + [ + 1, + "翻译作品数量" + ], + [ + 2, + "词条id" + ], + [ + 2, + "名称" + ], + [ + 2, + "成立时间" + ], + [ + 2, + "成立地点" + ], + [ + 3, + "书籍id" + ], + [ + 3, + "译者id" + ], + [ + 3, + "出版社id" + ], + [ + 3, + "出版册数" + ], + [ + 3, + "出版时间" + ] + ], + "column_types": [ + "text", + "number", + "text", + "text", + "text", + "number", + "number", + "text", + "text", + "number", + "number", + "text", + "time", + "text", + "number", + "number", + "number", + "number", + "time" + ], + "db_id": "外文书籍", + "foreign_keys": [ + [ + 14, + 1 + ], + [ + 16, + 10 + ], + [ + 15, + 6 + ] + ], + "primary_keys": [ + 1, + 6, + 10 + ], + "table_names": [ + "外文书籍", + "译者", + "出版社", + "书籍出版信息" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "书名" + ], + [ + 0, + "作者" + ], + [ + 0, + "评分" + ], + [ + 0, + "总排名" + ], + [ + 1, + "图书id" + ], + [ + 1, + "评价人数" + ], + [ + 2, + "图书id" + ], + [ + 2, + "现价格" + ], + [ + 2, + "原价格" + ], + [ + 3, + "图书id" + ], + [ + 3, + "购买人数" + ], + [ + 3, + "收藏人数" + ], + [ + 4, + "图书id" + ], + [ + 4, + "推荐人数" + ] + ], + "column_types": [ + "text", + "number", + "text", + "text", + "number", + "number", + "number", + "number", + "number", + "number", + "number", + "number", + "number", + "number", + "number", + "number" + ], + "db_id": "榜单", + "foreign_keys": [ + [ + 11, + 1 + ], + [ + 8, + 1 + ], + [ + 6, + 1 + ], + [ + 14, + 1 + ] + ], + "primary_keys": [ + 1, + 6, + 8, + 11, + 14 + ], + "table_names": [ + "图书", + "五星榜单", + "本月特价书籍", + "人气榜单", + "必读榜单" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "名称" + ], + [ + 0, + "语言" + ], + [ + 0, + "类别" + ], + [ + 0, + "主办单位" + ], + [ + 0, + "创刊时间" + ], + [ + 0, + "国家" + ], + [ + 0, + "出版刊数" + ], + [ + 1, + "年份" + ], + [ + 1, + "期刊id" + ], + [ + 1, + "统计平台" + ], + [ + 1, + "出版文献数" + ], + [ + 1, + "被下载数量" + ], + [ + 1, + "被引数量" + ], + [ + 1, + "复合影响因子" + ], + [ + 1, + "综合影响因子" + ], + [ + 2, + "词条id" + ], + [ + 2, + "姓名" + ], + [ + 2, + "职业" + ], + [ + 3, + "人物id" + ], + [ + 3, + "期刊id" + ], + [ + 3, + "次数" + ] + ], + "column_types": [ + "text", + "number", + "text", + "text", + "text", + "text", + "time", + "text", + "number", + "time", + "number", + "text", + "number", + "number", + "number", + "number", + "number", + "number", + "text", + "text", + "number", + "number", + "number" + ], + "db_id": "期刊", + "foreign_keys": [ + [ + 20, + 17 + ], + [ + 10, + 1 + ], + [ + 21, + 1 + ] + ], + "primary_keys": [ + 1, + 17 + ], + "table_names": [ + "期刊", + "期刊文献", + "封面人物", + "期刊封面人物" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "名称" + ], + [ + 0, + "作者" + ], + [ + 0, + "作者国籍" + ], + [ + 0, + "豆瓣评分" + ], + [ + 0, + "评价人数" + ], + [ + 0, + "5星占比" + ], + [ + 0, + "1星占比" + ], + [ + 1, + "年份" + ], + [ + 1, + "书籍id" + ], + [ + 1, + "评分" + ], + [ + 1, + "排名" + ], + [ + 2, + "年份" + ], + [ + 2, + "书籍id" + ], + [ + 2, + "关注数" + ], + [ + 2, + "排名" + ], + [ + 3, + "年份" + ], + [ + 3, + "书籍id" + ], + [ + 3, + "购买数" + ], + [ + 3, + "排名" + ], + [ + 4, + "书籍id" + ], + [ + 4, + "平台" + ], + [ + 4, + "售价" + ], + [ + 4, + "是否有货" + ] + ], + "column_types": [ + "text", + "number", + "text", + "text", + "text", + "number", + "number", + "number", + "number", + "time", + "number", + "number", + "number", + "time", + "number", + "number", + "number", + "time", + "number", + "number", + "number", + "number", + "text", + "number", + "binary" + ], + "db_id": "豆瓣读书", + "foreign_keys": [ + [ + 14, + 1 + ], + [ + 21, + 1 + ], + [ + 10, + 1 + ], + [ + 18, + 1 + ] + ], + "primary_keys": [ + 1 + ], + "table_names": [ + "书籍", + "高分图书榜单", + "最受关注图书榜单", + "最畅销图书榜单", + "购买平台" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "姓名" + ], + [ + 0, + "国籍" + ], + [ + 0, + "出生日期" + ], + [ + 0, + "出生地" + ], + [ + 0, + "逝世日期" + ], + [ + 0, + "作品数量" + ], + [ + 1, + "词条id" + ], + [ + 1, + "小说名" + ], + [ + 1, + "文学体裁" + ], + [ + 1, + "作者id" + ], + [ + 1, + "首版时间" + ], + [ + 1, + "字数" + ], + [ + 2, + "词条id" + ], + [ + 2, + "名称" + ], + [ + 2, + "小说id" + ], + [ + 2, + "演出剧团" + ], + [ + 2, + "导演" + ], + [ + 2, + "演出总场次" + ], + [ + 2, + "观众评分" + ], + [ + 3, + "词条id" + ], + [ + 3, + "剧名" + ], + [ + 3, + "小说id" + ], + [ + 3, + "首播时间" + ], + [ + 3, + "集数" + ], + [ + 3, + "豆瓣评分" + ] + ], + "column_types": [ + "text", + "number", + "text", + "text", + "time", + "text", + "time", + "number", + "number", + "text", + "text", + "number", + "time", + "number", + "number", + "text", + "number", + "text", + "text", + "number", + "number", + "number", + "text", + "number", + "time", + "number", + "number" + ], + "db_id": "小说", + "foreign_keys": [ + [ + 11, + 1 + ], + [ + 16, + 8 + ], + [ + 23, + 8 + ] + ], + "primary_keys": [ + 1, + 8, + 14, + 21 + ], + "table_names": [ + "作者", + "小说", + "小说改编话剧", + "小说改编电视剧" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "姓名" + ], + [ + 0, + "年龄" + ], + [ + 0, + "出版作品数" + ], + [ + 0, + "网络作品数" + ], + [ + 1, + "词条id" + ], + [ + 1, + "书名" + ], + [ + 1, + "作者id" + ], + [ + 1, + "评分" + ], + [ + 1, + "评价人数" + ], + [ + 1, + "字数" + ], + [ + 1, + "点击数" + ], + [ + 1, + "类型" + ], + [ + 2, + "词条id" + ], + [ + 2, + "书名" + ], + [ + 2, + "作者id" + ], + [ + 2, + "评分" + ], + [ + 2, + "类型" + ], + [ + 2, + "状态" + ], + [ + 2, + "价格" + ], + [ + 3, + "网络小说id" + ], + [ + 3, + "周排名" + ], + [ + 3, + "月排名" + ], + [ + 3, + "总排名" + ], + [ + 4, + "网络小说id" + ], + [ + 4, + "周排名" + ], + [ + 4, + "月排名" + ], + [ + 4, + "总排名" + ] + ], + "column_types": [ + "text", + "number", + "text", + "number", + "number", + "number", + "number", + "text", + "number", + "number", + "number", + "number", + "number", + "text", + "number", + "text", + "number", + "number", + "text", + "text", + "number", + "number", + "number", + "number", + "number", + "number", + "number", + "number", + "number" + ], + "db_id": "网易云阅读", + "foreign_keys": [ + [ + 25, + 14 + ], + [ + 8, + 1 + ], + [ + 21, + 14 + ], + [ + 16, + 1 + ] + ], + "primary_keys": [ + 1, + 6, + 14 + ], + "table_names": [ + "作者", + "出版图书", + "网络小说", + "畅销榜", + "收藏榜" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "名称" + ], + [ + 0, + "类型" + ], + [ + 0, + "适用阶段" + ], + [ + 0, + "适用年级" + ], + [ + 0, + "科目类型" + ], + [ + 0, + "价格" + ], + [ + 0, + "特点" + ], + [ + 1, + "试卷id" + ], + [ + 1, + "套数" + ], + [ + 1, + "押题命中率" + ], + [ + 2, + "省份" + ], + [ + 2, + "参考试卷id" + ], + [ + 2, + "版本" + ], + [ + 2, + "购买数量" + ], + [ + 2, + "平均得分" + ] + ], + "column_types": [ + "text", + "number", + "text", + "text", + "text", + "number", + "text", + "number", + "text", + "number", + "text", + "number", + "text", + "number", + "text", + "number", + "number" + ], + "db_id": "教材辅助参考书", + "foreign_keys": [ + [ + 9, + 1 + ], + [ + 13, + 1 + ] + ], + "primary_keys": [ + 1, + 9 + ], + "table_names": [ + "参考书", + "参考试卷", + "适用城市" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "名称" + ], + [ + 0, + "类型" + ], + [ + 0, + "国家" + ], + [ + 0, + "世界排名" + ], + [ + 1, + "词条id" + ], + [ + 1, + "名称" + ], + [ + 1, + "所属专业" + ], + [ + 1, + "适合学者类型" + ], + [ + 2, + "词条id" + ], + [ + 2, + "名称" + ], + [ + 2, + "课程数量" + ], + [ + 2, + "合作学校数量" + ], + [ + 2, + "是否免费" + ], + [ + 3, + "平台id" + ], + [ + 3, + "学校id" + ], + [ + 3, + "合作课程数量" + ], + [ + 4, + "词条id" + ], + [ + 4, + "开源课程名称" + ], + [ + 4, + "课程id" + ], + [ + 4, + "学校id" + ], + [ + 4, + "课时" + ], + [ + 4, + "主讲教师" + ], + [ + 5, + "开源课程id" + ], + [ + 5, + "平台id" + ], + [ + 5, + "是否直播" + ], + [ + 5, + "课程时长" + ], + [ + 5, + "价格" + ] + ], + "column_types": [ + "text", + "number", + "text", + "text", + "text", + "number", + "number", + "text", + "text", + "text", + "number", + "text", + "number", + "number", + "binary", + "number", + "number", + "number", + "number", + "text", + "number", + "number", + "number", + "text", + "number", + "number", + "binary", + "number", + "number" + ], + "db_id": "在线学习平台", + "foreign_keys": [ + [ + 20, + 6 + ], + [ + 21, + 1 + ], + [ + 24, + 18 + ], + [ + 25, + 10 + ], + [ + 16, + 1 + ], + [ + 15, + 10 + ] + ], + "primary_keys": [ + 1, + 6, + 10, + 18 + ], + "table_names": [ + "学校", + "课程", + "平台", + "平台合作学校", + "学校的开源课程", + "平台课程" + ] + }, + { + "column_names": [ + [ + -1, + "*" + ], + [ + 0, + "词条id" + ], + [ + 0, + "名称" + ], + [ + 0, + "成立时间" + ], + [ + 0, + "级别" + ], + [ + 1, + "会议id" + ], + [ + 1, + "年份" + ], + [ + 1, + "长文提交量" + ], + [ + 1, + "长文录取率" + ], + [ + 1, + "短文提交量" + ], + [ + 1, + "短文录取率" + ], + [ + 2, + "会议id" + ], + [ + 2, + "年份" + ], + [ + 2, + "大洲" + ], + [ + 2, + "提交数量占比" + ], + [ + 3, + "会议id" + ], + [ + 3, + "年份" + ], + [ + 3, + "国家" + ], + [ + 3, + "提交数量占比" + ], + [ + 4, + "方向名称" + ], + [ + 4, + "会议id" + ], + [ + 4, + "长文提交量" + ], + [ + 4, + "长文录取率" + ], + [ + 4, + "短文提交量" + ], + [ + 4, + "短文录取率" + ] + ], + "column_types": [ + "text", + "number", + "text", + "time", + "text", + "number", + "time", + "number", + "number", + "number", + "number", + "number", + "time", + "text", + "number", + "number", + "time", + "text", + "number", + "text", + "number", + "number", + "number", + "number", + "number" + ], + "db_id": "NLP会议", + "foreign_keys": [ + [ + 5, + 1 + ], + [ + 11, + 1 + ], + [ + 15, + 1 + ], + [ + 20, + 1 + ] + ], + "primary_keys": [ + 1 + ], + "table_names": [ + "会议", + "各会议论文", + "各会议论文大洲分布", + "各会议论文国家分布", + "2019年会议各方向分布" + ] + } +] \ No newline at end of file diff --git a/examples/text_to_sql/RAT-SQL/evaluation/text2sql_evaluation.py b/examples/text_to_sql/RAT-SQL/evaluation/text2sql_evaluation.py new file mode 100644 index 0000000000000000000000000000000000000000..ee85021e4139d615cc1cfc9f1e7f106cb36b7b04 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/evaluation/text2sql_evaluation.py @@ -0,0 +1,1602 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Calculating the exact accuracy. For select, where and others schema, it will be +seen as right if has different order. This script refers to https://github.com/taoyds/spider。 +""" +import copy +import json +import logging +import re +from collections import defaultdict +from io import open + +from utils import evaluate_NL2SQL, is_float + +""" +val: number(float)/string(str)/sql(dict) +col_unit: (agg_id, col_id, isdistinct(bool)) +val_unit: (unit_op, col_unit1, col_unit2) +table_unit: (table_type, col_unit/sql) +cond_unit: (not_op, cond_op, val_unit, val1, val2) +condition: [cond_unit1, 'and'/'or', cond_unit2, ...] +sql { + 'select': [(agg_id, val_unit), (agg_id, val_unit), ...] + 'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition} + 'where': condition + 'groupBy': [col_unit1, col_unit2, ...] + 'orderBy': ('asc'/'desc', [(agg_id, val_unit), ...]) + 'having': condition + 'limit': None/number(int) + 'intersect': None/sql + 'except': None/sql + 'union': None/sql +} +""" + +CLAUSE_KEYWORDS = ("select", "from", "where", "group", "order", "limit", "intersect", "union", "except") +JOIN_KEYWORDS = ("join", "on", "as") + +COND_OPS = ("not_in", "between", "==", ">", "<", ">=", "<=", "!=", "in", "like") +UNIT_OPS = ("none", "-", "+", "*", "/") +AGG_OPS = ("none", "max", "min", "count", "sum", "avg") +TABLE_TYPE = { + "sql": "sql", + "table_unit": "table_unit", +} + +LOGIC_AND_OR = ("and", "or") +SQL_OPS = ("intersect", "union", "except") +ORDER_OPS = ("desc", "asc") + +CONST_COLUMN = set(["time_now"]) + +EXPECT_BRACKET_PRE_TOKENS = set(AGG_OPS + SQL_OPS + COND_OPS + CLAUSE_KEYWORDS + ("from", ",", "distinct")) + +g_empty_sql = { + "select": [], + "from": {"conds": [], "table_units": []}, + "where": [], + "groupBy": [], + "having": [], + "orderBy": [], + "limit": None, + "except": None, + "intersect": None, + "union": None, +} + +VALUE = "1" + + +def tokenize(string, single_equal=False, math=True): + """ + Args: + + Returns: + """ + + string = string.replace("'", '"').lower() + assert string.count('"') % 2 == 0, "Unexpected quote" + + def _extract_value(string): + """extract values in sql""" + fields = string.split('"') + for idx, tok in enumerate(fields): + if idx % 2 == 1: + fields[idx] = '"%s"' % (tok) + return fields + + def _resplit(tmp_tokens, fn_split, fn_omit): + """resplit""" + new_tokens = [] + for token in tmp_tokens: + token = token.strip() + if fn_omit(token): + new_tokens.append(token) + elif re.match(r"\d\d\d\d-\d\d(-\d\d)?", token): + new_tokens.append('"%s"' % (token)) + else: + new_tokens.extend(fn_split(token)) + return new_tokens + + tokens_tmp = _extract_value(string) + + two_bytes_op = ["==", "!=", ">=", "<=", "<>", ""] + if single_equal: + sep1 = re.compile(r"([ \+\-\*/\(\)=,><;])") # 单字节运算符 + else: + sep1 = re.compile(r"([ \+\-\*/\(\),><;])") # 单字节运算符 + sep2 = re.compile("(" + "|".join(two_bytes_op) + ")") # 多字节运算符 + tokens_tmp = _resplit(tokens_tmp, lambda x: x.split(" "), lambda x: x.startswith('"')) + tokens_tmp = _resplit(tokens_tmp, lambda x: re.split(sep2, x), lambda x: x.startswith('"')) + tokens_tmp = _resplit(tokens_tmp, lambda x: re.split(sep1, x), lambda x: x in two_bytes_op or x.startswith('"')) + tokens = list(filter(lambda x: x.strip() not in ("", "distinct", "DISTINCT"), tokens_tmp)) + + def _post_merge(tokens): + """merge: + * col name with "(", ")" + * values with +/- + """ + idx = 1 + while idx < len(tokens): + if tokens[idx] == "(" and tokens[idx - 1] not in EXPECT_BRACKET_PRE_TOKENS and tokens[idx - 1] != "=": + # 兼容单引号,这里可能有问题 + while idx < len(tokens): + tmp_tok = tokens.pop(idx) + tokens[idx - 1] += tmp_tok + if tmp_tok == ")": + break + elif tokens[idx] in ("+", "-") and tokens[idx - 1] in COND_OPS and idx + 1 < len(tokens): + tokens[idx] += tokens[idx + 1] + tokens.pop(idx + 1) + idx += 1 + else: + idx += 1 + return tokens + + tokens = _post_merge(tokens) + if single_equal: + tokens = [i if i != "=" else "==" for i in tokens] + return tokens + + +def scan_alias(toks): + """Scan the index of 'as' and build the map for all alias""" + as_idxs = [idx for idx, tok in enumerate(toks) if tok == "as"] + alias = {} + for idx in as_idxs: + alias[toks[idx + 1]] = toks[idx - 1] + return alias + + +def get_tables_with_alias(schema, toks): + """ + Args: + + Returns: + """ + tables = scan_alias(toks) + for key in schema: + assert key not in tables, "Alias {} has the same name in table".format(key) + tables[key] = key + return tables + + +def parse_col(toks, start_idx, tables_with_alias, schema, default_tables=None): + """ + :returns next idx, column id + """ + tok = toks[start_idx] + if tok == "*": + return start_idx + 1, schema.id_map[tok] + if tok in CONST_COLUMN: + return start_idx + 1, tok + + if "." in tok: # if token is a composite + alias, col = tok.split(".") + key = tables_with_alias[alias] + "." + col + return start_idx + 1, schema.id_map[key] + + assert default_tables is not None and len(default_tables) > 0, "Default tables should not be None or empty" + + for alias in default_tables: + table = tables_with_alias[alias] + if tok in schema.schema[table]: + key = table + "." + tok + return start_idx + 1, schema.id_map[key] + + raise RuntimeError("Error col: {},{}".format(tok, toks)) + + +def parse_col_unit(toks, start_idx, tables_with_alias, schema, default_tables=None): + """ + :returns next idx, (agg_op id, col_id) + """ + idx = start_idx + len_ = len(toks) + isBlock = False + if toks[idx] == "(": + isBlock = True + idx += 1 + + if toks[idx] in AGG_OPS: + agg_id = AGG_OPS.index(toks[idx]) + idx += 1 + assert idx < len_ and toks[idx] == "(" + idx += 1 + if toks[idx] == "distinct": + idx += 1 + idx, col_id = parse_col(toks, idx, tables_with_alias, schema, default_tables) + assert idx < len_ and toks[idx] == ")" + idx += 1 + return idx, (agg_id, col_id) + if toks[idx] == "distinct": + idx += 1 + agg_id = AGG_OPS.index("none") + idx, col_id = parse_col(toks, idx, tables_with_alias, schema, default_tables) + + if isBlock: + assert toks[idx] == ")" + idx += 1 # skip ')' + + return idx, (agg_id, col_id) + + +def parse_val_unit(toks, start_idx, tables_with_alias, schema, default_tables=None): + """ + Args: + + Returns: + """ + idx = start_idx + len_ = len(toks) + isBlock = False + if toks[idx] == "(": + isBlock = True + idx += 1 + + col_unit1 = None + col_unit2 = None + unit_op = UNIT_OPS.index("none") + + idx, col_unit1 = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables) + if idx < len_ and toks[idx] in UNIT_OPS: + unit_op = UNIT_OPS.index(toks[idx]) + idx += 1 + idx, col_unit2 = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables) + + if isBlock: + assert toks[idx] == ")" + idx += 1 # skip ')' + if unit_op in (UNIT_OPS.index("+"), UNIT_OPS.index("*")): + col_unit1, col_unit2 = sorted([col_unit1, col_unit2]) + + return idx, (unit_op, col_unit1, col_unit2) + + +def parse_table_unit(toks, start_idx, tables_with_alias, schema): + """ + :returns next idx, table id, table name + """ + idx = start_idx + len_ = len(toks) + key = tables_with_alias[toks[idx]] + + if idx + 1 < len_ and toks[idx + 1] == "as": + idx += 3 + else: + idx += 1 + + return idx, schema.id_map[key], key + + +def parse_value(toks, start_idx, tables_with_alias, schema, default_tables=None): + """ + Args: + + Returns: + """ + idx = start_idx + len_ = len(toks) + + isBlock = False + if toks[idx] == "(": + isBlock = True + idx += 1 + + def _force_float(str_num): + """force float, just for debug""" + last = "" + while len(str_num) > 0: + try: + n = float(str_num) + if last == "%": + n /= 100 + return n + except Exception: + last = str_num[-1] + str_num = str_num[:-1] + raise ValueError("not a float number") + + if toks[idx] == "select": + idx, val = parse_sql(toks, idx, tables_with_alias, schema) + elif toks[idx].startswith('"') and toks[idx].endswith('"'): # token is a string value + val = toks[idx] + idx += 1 + else: + try: + val_str = toks[idx] + # val = float(val_str) if val_str[-1] != '%' else float(val_str[:-1]) / 100 + val = _force_float(val_str) + idx += 1 + except Exception: + end_idx = idx + while ( + end_idx < len_ + and toks[end_idx] != "," + and toks[end_idx] != ")" + and toks[end_idx] != "and" + and toks[end_idx] not in CLAUSE_KEYWORDS + and toks[end_idx] not in JOIN_KEYWORDS + ): + end_idx += 1 + + idx, val = parse_col_unit(toks[start_idx:end_idx], 0, tables_with_alias, schema, default_tables) + idx = end_idx + + if isBlock: + assert toks[idx] == ")" + idx += 1 + + return idx, val + + +def parse_condition(toks, start_idx, tables_with_alias, schema, default_tables=None): + """ + Args: + + Returns: + """ + idx = start_idx + len_ = len(toks) + conds = [] + + while idx < len_: + agg_id = 0 + if idx < len_ and toks[idx] in AGG_OPS: + agg_id = AGG_OPS.index(toks[idx]) + idx += 1 + + idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables) + + op_str = toks[idx] + if op_str == "not": + assert toks[idx + 1] == "in", '"not" must followed by "in"' + op_str = "not_in" + idx += 1 + assert idx < len_ and op_str in COND_OPS, "Error condition: idx: {}, tok: {},, toks: {}".format( + idx, op_str, toks + ) + op_id = COND_OPS.index(op_str) + idx += 1 + val1 = val2 = None + # idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables) + # val2 = None + if op_id == COND_OPS.index("between"): # between..and... special case: dual values + idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables) + assert toks[idx] == "and" + idx += 1 + idx, val2 = parse_value(toks, idx, tables_with_alias, schema, default_tables) + else: # normal case: single value + idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables) + val2 = None + + conds.append((agg_id, op_id, val_unit, val1, val2)) + + if idx < len_ and (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";") or toks[idx] in JOIN_KEYWORDS): + break + + if idx < len_ and toks[idx] in LOGIC_AND_OR: + conds.append(toks[idx]) + idx += 1 # skip and/or + + return idx, conds + + +def parse_select(toks, start_idx, tables_with_alias, schema, default_tables=None): + """ + Args: + + Returns: + """ + idx = start_idx + len_ = len(toks) + + assert toks[idx] == "select", "'select' not found" + idx += 1 + if idx < len_ and toks[idx] == "distinct": + idx += 1 + val_units = [] + + while idx < len_ and toks[idx] not in CLAUSE_KEYWORDS: + agg_id = AGG_OPS.index("none") + if toks[idx] in AGG_OPS: + agg_id = AGG_OPS.index(toks[idx]) + idx += 1 + idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables) + val_units.append((agg_id, val_unit)) + if idx < len_ and toks[idx] == ",": + idx += 1 # skip ',' + + return idx, val_units + + +def parse_from(toks, start_idx, tables_with_alias, schema): + """ + Assume in the from clause, all table units are combined with join + """ + assert "from" in toks[start_idx:], "'from' not found" + + len_ = len(toks) + idx = toks.index("from", start_idx) + 1 + default_tables = [] + table_units = [] + conds = [] + last_table = None + + while idx < len_: + isBlock = False + if toks[idx] == "(": + isBlock = True + idx += 1 + + if toks[idx] == "select": + idx, sql = parse_sql(toks, idx, tables_with_alias, schema) + table_units.append((TABLE_TYPE["sql"], sql)) + last_table = sql["from"]["table_units"][0][1].strip("_") + else: + if idx < len_ and toks[idx] == "join": + idx += 1 # skip join + idx, table_unit, table_name = parse_table_unit(toks, idx, tables_with_alias, schema) + table_units.append((TABLE_TYPE["table_unit"], table_unit)) + default_tables.append(table_name) + if idx < len_ and toks[idx] == "on": + idx += 1 # skip on + idx, this_conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables) + if len(conds) > 0: + conds.append("and") + conds.extend(this_conds) + + if isBlock: + assert toks[idx] == ")" + idx += 1 + if idx < len_ and toks[idx] == "a": + assert last_table is not None, "last_table should be a table name strin, not None" + tables_with_alias["a"] = last_table + idx += 2 + elif idx < len_ and toks[idx] == "b": + assert last_table is not None, "last_table should be a table name strin, not None" + tables_with_alias["b"] = last_table + idx += 1 + if idx < len_ and (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")): + break + + return [idx, table_units, conds, default_tables] + + +def parse_where(toks, start_idx, tables_with_alias, schema, default_tables): + """ + Args: + + Returns: + """ + idx = start_idx + len_ = len(toks) + + if idx >= len_ or toks[idx] != "where": + return idx, [] + + idx += 1 + idx, conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables) + return idx, conds + + +def parse_group_by(toks, start_idx, tables_with_alias, schema, default_tables): + """ + Args: + + Returns: + """ + idx = start_idx + len_ = len(toks) + col_units = [] + + if idx >= len_ or toks[idx] != "group": + return idx, col_units + + idx += 1 + assert toks[idx] == "by" + idx += 1 + + while idx < len_ and not (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")): + idx, col_unit = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables) + col_units.append(col_unit) + if idx < len_ and toks[idx] == ",": + idx += 1 # skip ',' + else: + break + + return idx, col_units + + +def parse_order_by(toks, start_idx, tables_with_alias, schema, default_tables): + """ + Args: + + Returns: + """ + idx = start_idx + len_ = len(toks) + val_units = [] + order_type = "asc" # default type is 'asc' + + if idx >= len_ or toks[idx] != "order": + return idx, val_units + + idx += 1 + assert toks[idx] == "by" + idx += 1 + + while idx < len_ and not (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")): + agg_id = AGG_OPS.index("none") + if toks[idx] in AGG_OPS: + agg_id = AGG_OPS.index(toks[idx]) + idx += 1 + idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables) + val_units.append((agg_id, val_unit)) + if idx < len_ and toks[idx] in ORDER_OPS: + order_type = toks[idx] + idx += 1 + if idx < len_ and toks[idx] == ",": + idx += 1 # skip ',' + else: + break + + return idx, (order_type, val_units) + + +def parse_having(toks, start_idx, tables_with_alias, schema, default_tables): + """ + Args: + + Returns: + """ + idx = start_idx + len_ = len(toks) + + if idx >= len_ or toks[idx] != "having": + return idx, [] + + idx += 1 + idx, conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables) + return idx, conds + + +def parse_limit(toks, start_idx): + """ + Args: + + Returns: + """ + idx = start_idx + len_ = len(toks) + + if idx < len_ and toks[idx] == "limit": + idx += 2 + return idx, int(toks[idx - 1]) + + return idx, None + + +def parse_sql(toks, start_idx, tables_with_alias, schema): + """ + Args: + + Returns: + """ + isBlock = False # indicate whether this is a block of sql/sub-sql + len_ = len(toks) + idx = start_idx + + sql = {} + if toks[idx] == "(": + isBlock = True + idx += 1 + + # parse from clause in order to get default tables + from_end_idx, table_units, conds, default_tables = parse_from(toks, start_idx, tables_with_alias, schema) + sql["from"] = {"table_units": table_units, "conds": conds} + # select clause + _, select_col_units = parse_select(toks, idx, tables_with_alias, schema, default_tables) + idx = from_end_idx + sql["select"] = select_col_units + # where clause + idx, where_conds = parse_where(toks, idx, tables_with_alias, schema, default_tables) + sql["where"] = where_conds + # group by clause + idx, group_col_units = parse_group_by(toks, idx, tables_with_alias, schema, default_tables) + sql["groupBy"] = group_col_units + # having clause + idx, having_conds = parse_having(toks, idx, tables_with_alias, schema, default_tables) + sql["having"] = having_conds + # order by clause + idx, order_col_units = parse_order_by(toks, idx, tables_with_alias, schema, default_tables) + sql["orderBy"] = order_col_units + # limit clause + idx, limit_val = parse_limit(toks, idx) + sql["limit"] = limit_val + + idx = skip_semicolon(toks, idx) + if isBlock: + assert toks[idx] == ")" + idx += 1 # skip ')' + idx = skip_semicolon(toks, idx) + + # intersect/union/except clause + for op in SQL_OPS: # initialize IUE + sql[op] = None + if idx < len_ and toks[idx] in SQL_OPS: + sql_op = toks[idx] + idx += 1 + idx, IUE_sql = parse_sql(toks, idx, tables_with_alias, schema) + sql[sql_op] = IUE_sql + return idx, sql + + +def load_data(fpath): + """ + Args: + + Returns: + """ + with open(fpath) as f: + data = json.load(f) + return data + + +def get_sql(schema, query, single_equal=False): + """ + Args: + + Returns: + """ + toks = tokenize(query, single_equal=single_equal) + tables_with_alias = get_tables_with_alias(schema.schema, toks) + _, sql = parse_sql(toks, 0, tables_with_alias, schema) + + return sql + + +def skip_semicolon(toks, start_idx): + """ + Args: + + Returns: + """ + idx = start_idx + while idx < len(toks) and toks[idx] == ";": + idx += 1 + return idx + + +################################# + + +class Evaluator(object): + """A simple evaluator""" + + def __init__(self): + """init""" + self.partial_scores = None + + def _eval_exact_match(self, pred, gold, value_match=True): + """eval_exact_match""" + partial_scores = self.eval_partial_match(pred, gold, value_match=value_match) + self.partial_scores = partial_scores + + for _, score in partial_scores.items(): + if score["f1"] != 1: + return 0 + + gold_table_units = gold["from"]["table_units"] + pred_table_units = pred["from"]["table_units"] + if len(pred_table_units) != len(gold_table_units) or any( + map(lambda x: type(x[0][1]) != type(x[1][1]), zip(pred_table_units, gold_table_units)) # noqa: E721 + ): + return 0 + if type(gold_table_units[0][1]) is not dict: + return 1 if sorted(gold_table_units) == sorted(pred_table_units) else 0 + + # TODO: 严格考虑顺序 + def __eval_from_sql(pred_tables, gold_tables): + """eval from sql""" + for pred_table_unit, gold_table_unit in zip(pred_tables, gold_tables): + pred_table_sql = pred_table_unit[1] + gold_table_sql = gold_table_unit[1] + _, _, correct = eval_nested(pred_table_sql, gold_table_sql, value_match) + if correct == 0: + return 0 + return 1 + + correct = __eval_from_sql(pred_table_units, gold_table_units) + if len(gold_table_units) > 1 and correct == 0: + return __eval_from_sql(pred_table_units, list(reversed(gold_table_units))) + else: + return correct + + def eval_exact_match(self, pred, gold, value_match=True): + """wrapper of evaluate examct match, to process + `SQL1 intersect/union SQL2` vs `SQL2 intersect/union SQL1` + + Args: + pred (TYPE): NULL + gold (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + score = self._eval_exact_match(pred, gold, value_match=value_match) + if score == 1: + return score + + if gold["union"] is not None: + gold_tmp = copy.deepcopy(gold) + new_gold = gold_tmp["union"] + gold_tmp["union"] = None + new_gold["union"] = gold_tmp + return self._eval_exact_match(pred, new_gold, value_match=value_match) + elif gold["intersect"] is not None: + gold_tmp = copy.deepcopy(gold) + new_gold = gold_tmp["intersect"] + gold_tmp["intersect"] = None + new_gold["intersect"] = gold_tmp + return self._eval_exact_match(pred, new_gold, value_match=value_match) + else: + return 0 + + def eval_partial_match(self, pred, gold, value_match=True): + """eval_partial_match""" + res = {} + + gold_total, pred_total, cnt, cnt_wo_agg = eval_sel(pred, gold) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["select"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, gold_total) + res["select(no AGG)"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + gold_total, pred_total, cnt, cnt_wo_agg = eval_where(pred, gold, value_match=value_match) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["where"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, gold_total) + res["where(no OP)"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + gold_total, pred_total, cnt = eval_group(pred, gold) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["group"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + gold_total, pred_total, cnt = eval_having(pred, gold, value_match=value_match) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["having"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + gold_total, pred_total, cnt = eval_order(pred, gold, value_match=value_match) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["order"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + gold_total, pred_total, cnt = eval_and_or(pred, gold) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["and/or"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + gold_total, pred_total, cnt = eval_IUEN(pred, gold, value_match=value_match) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["IUEN"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + gold_total, pred_total, cnt = eval_keywords(pred, gold) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["keywords"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + return res + + +class Schema(object): + """ + Simple schema which maps table&column to a unique identifier + """ + + def __init__(self, db): + """init""" + self._schema = self._build_schema(db) + self._id_map = self._map(self._schema) + + @property + def schema(self): + """_schema property""" + return self._schema + + @property + def id_map(self): + """_id_map property""" + return self._id_map + + def _build_schema(self, db): + """build schema by input db + + Args: + db (dict): NULL + + Returns: TODO + + Raises: NULL + """ + tables = [x.lower() for x in db.get("table_names_original", db["table_names"])] + dct_table2cols = defaultdict(list) + for table_id, column in db.get("column_names_original", db["column_names"]): + if table_id < 0: + continue + dct_table2cols[tables[table_id]].append(column.lower()) + return dct_table2cols + + def _map(self, schema): + """map""" + id_map = {"*": "__all__"} + for key, vals in schema.items(): + for val in vals: + id_map[key.lower() + "." + val.lower()] = "__" + key.lower() + "." + val.lower() + "__" + + for key in schema: + id_map[key.lower()] = "__" + key.lower() + "__" + + return id_map + + +def get_scores(count, pred_total, gold_total): + """ + Args: + + Returns: + """ + if pred_total != gold_total: + return 0, 0, 0 + elif count == pred_total: + return 1, 1, 1 + return 0, 0, 0 + + +def eval_sel(pred, gold): + """ + Args: + + Returns: + """ + pred_sel = copy.deepcopy(pred["select"]) + gold_sel = copy.deepcopy(gold["select"]) + gold_wo_agg = [unit[1] for unit in gold_sel] + pred_total = len(pred_sel) + gold_total = len(gold_sel) + cnt = 0 + cnt_wo_agg = 0 + + for unit in pred_sel: + if unit in gold_sel: + cnt += 1 + gold_sel.remove(unit) + if unit[1] in gold_wo_agg: + cnt_wo_agg += 1 + gold_wo_agg.remove(unit[1]) + + return [gold_total, pred_total, cnt, cnt_wo_agg] + + +def eval_nested_cond(pred_cond, gold_cond, value_match=True): + """ + + Args: + pred_cond (TYPE): NULL + gold_cond (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + if pred_cond[:3] != gold_cond[:3] or type(pred_cond[3]) is not dict: + return 0 + + _, _, correct = eval_nested(pred_cond[3], gold_cond[3], value_match) + if correct == 0: + return 0 + + return pred_cond[4] == gold_cond[4] + + +def eval_cond(pred, gold, value_match=True): + """ + + Args: + pred (TYPE): NULL + gold (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + + def _equal(p, g): + p = p.strip("\"'") if type(p) is str else p + g = g.strip("\"'") if type(g) is str else g + if str(p) == str(g): + return True + if is_float(p) and is_float(g) and float(p) == float(g): + return True + return False + + if not value_match: + if not isinstance(pred[3], dict): + pred[3] = VALUE + if pred[4] is not None: + pred[4] = VALUE + + if not isinstance(gold[3], dict): + gold[3] = VALUE + if gold[4] is not None: + gold[4] = VALUE + + if type(gold[3]) is dict: + return eval_nested_cond(pred, gold, value_match) + + if pred[:3] != gold[:3]: + return 0 + + if _equal(pred[3], gold[3]) and _equal(pred[4], gold[4]): + return 1 + else: + return 0 + + +def eval_where(pred, gold, value_match=True): + """ + Args: + + Returns: + """ + pred_conds = copy.deepcopy([unit for unit in sorted(pred["where"][::2], key=lambda x: [str(i) for i in x])]) + gold_conds = copy.deepcopy([unit for unit in sorted(gold["where"][::2], key=lambda x: [str(i) for i in x])]) + pred_total = len(pred_conds) + gold_total = len(gold_conds) + cnt = 0 + cnt_wo_agg = 0 + + # 已经排过序了,可以一一比对 + for unit_p, unit_g in zip(pred_conds, gold_conds): + cnt += eval_cond(unit_p, unit_g, value_match) + + if unit_p[2] == unit_g[2]: + cnt_wo_agg += 1 + + return [gold_total, pred_total, cnt, cnt_wo_agg] + + +def eval_group(pred, gold): + """ + Args: + + Returns: + """ + pred_cols = [unit[1] for unit in pred["groupBy"]] + gold_cols = [unit[1] for unit in gold["groupBy"]] + pred_total = len(pred_cols) + gold_total = len(gold_cols) + cnt = 0 + pred_cols = [pred.split(".")[1] if "." in pred else pred for pred in pred_cols] + gold_cols = [gold.split(".")[1] if "." in gold else gold for gold in gold_cols] + for col in pred_cols: + if col in gold_cols: + cnt += 1 + gold_cols.remove(col) + return [gold_total, pred_total, cnt] + + +def eval_having(pred, gold, value_match=True): + """不评估and/or,在其它分支专门评估 + Args: + + Returns: + """ + if len(pred["having"]) != len(gold["having"]): + return [1, 1, 0] + + pred_total = len(pred["having"][::2]) + gold_total = len(gold["having"][::2]) + cnt = 0 + for pred_cond, gold_cond in zip(sorted(pred["having"][::2]), sorted(gold["having"][::2])): + if eval_cond(pred_cond, gold_cond, value_match) == 1: + cnt += 1 + + return [gold_total, pred_total, cnt] + + +def eval_order(pred, gold, value_match=True): + """ + Args: + + Returns: + """ + pred_total = gold_total = cnt = 0 + if len(pred["orderBy"]) > 0: + pred_total = 1 + if len(gold["orderBy"]) > 0: + gold_total = 1 + + if value_match: + if len(gold["orderBy"]) > 0 and pred["orderBy"] == gold["orderBy"] and pred["limit"] == gold["limit"]: + cnt = 1 + else: + if len(gold["orderBy"]) > 0 and pred["orderBy"] == gold["orderBy"]: + cnt = 1 + + return [gold_total, pred_total, cnt] + + +def eval_and_or(pred, gold): + """ + Args: + + Returns: + """ + + def _extract(conds): + """extract condition and/or""" + op_set = set() + for i in range(1, len(conds) - 1, 2): + left = conds[i - 1][:3] + right = conds[i + 1][:3] + left, right = list(sorted([left, right])) + op_set.add(f"{left}{conds[i].lower()}{right}") + return op_set + + # eval where and/or + pred_op_set = _extract(pred["where"]) + gold_op_set = _extract(gold["where"]) + if pred_op_set != gold_op_set: + return [1, 1, 0] + + # eval having and/or + pred_op_set = _extract(pred["having"]) + gold_op_set = _extract(gold["having"]) + if pred_op_set != gold_op_set: + return [1, 1, 0] + + return [1, 1, 1] + + +def get_nestedSQL(sql): + """ + Args: + + Returns: + """ + nested = [] + for cond_unit in sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]: + if type(cond_unit[3]) is dict: + nested.append(cond_unit[3]) + if type(cond_unit[4]) is dict: + nested.append(cond_unit[4]) + ## + for from_nest_sql in [table_unit[1] for table_unit in sql["from"]["table_units"] if table_unit[0] == "sql"]: + nested.append(from_nest_sql) + + if sql["intersect"] is not None: + nested.append(sql["intersect"]) + if sql["except"] is not None: + nested.append(sql["except"]) + if sql["union"] is not None: + nested.append(sql["union"]) + return nested + + +def eval_nested(pred, gold, value_match=True): + """ + Args: + + Returns: + """ + gold_total = 0 + pred_total = 0 + cnt = 0 + if pred is not None: + pred_total += 1 + if gold is not None: + gold_total += 1 + if pred is not None and gold is not None: + cnt += Evaluator().eval_exact_match(pred, gold, value_match=value_match) + return [gold_total, pred_total, cnt] + + +def eval_IUEN(pred, gold, value_match=True): + """ + Args: + + Returns: + """ + lt1, pt1, cnt1 = eval_nested(pred["intersect"], gold["intersect"], value_match=value_match) + lt2, pt2, cnt2 = eval_nested(pred["except"], gold["except"], value_match=value_match) + lt3, pt3, cnt3 = eval_nested(pred["union"], gold["union"], value_match=value_match) + gold_total = lt1 + lt2 + lt3 + pred_total = pt1 + pt2 + pt3 + cnt = cnt1 + cnt2 + cnt3 + return [gold_total, pred_total, cnt] + + +def get_keywords(sql): + """ + Args: + + Returns: + """ + res = set() + if len(sql["where"]) > 0: + res.add("where") + if len(sql["groupBy"]) > 0: + res.add("group") + if len(sql["having"]) > 0: + res.add("having") + if len(sql["orderBy"]) > 0: + res.add(sql["orderBy"][0]) + res.add("order") + if sql["limit"] is not None: + res.add("limit") + if sql["except"] is not None: + res.add("except") + if sql["union"] is not None: + res.add("union") + if sql["intersect"] is not None: + res.add("intersect") + + # or keyword + ao = sql["from"]["conds"][1::2] + sql["where"][1::2] + sql["having"][1::2] + if len([token for token in ao if token == "or"]) > 0: + res.add("or") + + # TODO + cond_units = sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2] + # not keyword + if len([cond_unit for cond_unit in cond_units if cond_unit[0]]) > 0: + res.add("not") + + # in keyword + if len([cond_unit for cond_unit in cond_units if cond_unit[1] == COND_OPS.index("in")]) > 0: + res.add("in") + + # like keyword + if len([cond_unit for cond_unit in cond_units if cond_unit[1] == COND_OPS.index("like")]) > 0: + res.add("like") + + return res + + +def eval_keywords(pred, gold): + """ + Args: + + Returns: + """ + pred_keywords = get_keywords(pred) + gold_keywords = get_keywords(gold) + pred_total = len(pred_keywords) + gold_total = len(gold_keywords) + cnt = 0 + + for k in pred_keywords: + if k in gold_keywords: + cnt += 1 + return [gold_total, pred_total, cnt] + + +# Rebuild SQL functions for foreign key evaluation +def build_valid_col_units(table_units, schema): + """ + Args: + + Returns: + """ + col_ids = [table_unit[1] for table_unit in table_units if table_unit[0] == TABLE_TYPE["table_unit"]] + prefixs = [col_id[:-2] for col_id in col_ids] + valid_col_units = [] + for value in schema.id_map.values(): + if "." in value and value[: value.index(".")] in prefixs: + valid_col_units.append(value) + return valid_col_units + + +def rebuild_col_unit_col(valid_col_units, col_unit, kmap): + """ + Args: + + Returns: + """ + if col_unit is None: + return col_unit + + agg_id, col_id = col_unit[0], col_unit[1] + if col_id in kmap and col_id in valid_col_units: + col_id = kmap[col_id] + return agg_id, col_id + + +def rebuild_val_unit_col(valid_col_units, val_unit, kmap): + """ + Args: + + Returns: + """ + if val_unit is None: + return val_unit + + unit_op, col_unit1, col_unit2 = val_unit + col_unit1 = rebuild_col_unit_col(valid_col_units, col_unit1, kmap) + col_unit2 = rebuild_col_unit_col(valid_col_units, col_unit2, kmap) + return [unit_op, col_unit1, col_unit2] + + +def rebuild_table_unit_col(valid_col_units, table_unit, kmap): + """ + Args: + + Returns: + """ + if table_unit is None: + return table_unit + + table_type, col_unit_or_sql = table_unit + if isinstance(col_unit_or_sql, dict): + col_unit_or_sql = rebuild_sql_col(valid_col_units, col_unit_or_sql, kmap) + elif isinstance(col_unit_or_sql, tuple): + col_unit_or_sql = rebuild_col_unit_col(valid_col_units, col_unit_or_sql, kmap) + return table_type, col_unit_or_sql + + +def rebuild_cond_unit_col(valid_col_units, cond_unit, kmap): + """ + Args: + + Returns: + """ + if cond_unit is None: + return cond_unit + + not_op, op_id, val_unit, val1, val2 = cond_unit + if type(val1) is dict: + rebuild_sql_col(valid_col_units, val1, kmap) + else: + val1 = str(val1) + val2 = str(val2) + + val_unit = rebuild_val_unit_col(valid_col_units, val_unit, kmap) + return [not_op, op_id, val_unit, val1, val2] + + +def rebuild_condition_col(valid_col_units, condition, kmap): + """ + Args: + + Returns: + """ + for idx in range(len(condition)): + if idx % 2 == 0: + condition[idx] = rebuild_cond_unit_col(valid_col_units, condition[idx], kmap) + return condition + + +def rebuild_select_col(valid_col_units, sel, kmap): + """ + Args: + + Returns: + """ + if sel is None: + return sel + new_list = [] + for it in sel: + agg_id, val_unit = it + new_list.append((agg_id, rebuild_val_unit_col(valid_col_units, val_unit, kmap))) + return new_list + + +def rebuild_from_col(valid_col_units, from_, kmap): + """ + Args: + + Returns: + """ + if from_ is None: + return from_ + + fn_proc = lambda x: rebuild_table_unit_col(valid_col_units, x, kmap) + from_["table_units"] = [fn_proc(table_unit) for table_unit in from_["table_units"]] + from_["conds"] = rebuild_condition_col(valid_col_units, from_["conds"], kmap) + return from_ + + +def rebuild_group_by_col(valid_col_units, group_by, kmap): + """ + Args: + + Returns: + """ + if group_by is None: + return group_by + + return [rebuild_col_unit_col(valid_col_units, col_unit, kmap) for col_unit in group_by] + + +def rebuild_order_by_col(valid_col_units, order_by, kmap): + """ + Args: + + Returns: + """ + if order_by is None or len(order_by) == 0: + return order_by + + direction, val_units = order_by + new_val_units = [(agg_id, rebuild_val_unit_col(valid_col_units, val_unit, kmap)) for agg_id, val_unit in val_units] + return direction, new_val_units + + +def rebuild_sql_col(valid_col_units, sql, kmap): + """ + Args: + + Returns: + """ + if sql is None: + return sql + + sql["select"] = rebuild_select_col(valid_col_units, sql["select"], kmap) + sql["from"] = rebuild_from_col(valid_col_units, sql["from"], kmap) + sql["where"] = rebuild_condition_col(valid_col_units, sql["where"], kmap) + sql["groupBy"] = rebuild_group_by_col(valid_col_units, sql["groupBy"], kmap) + sql["orderBy"] = rebuild_order_by_col(valid_col_units, sql["orderBy"], kmap) + sql["having"] = rebuild_condition_col(valid_col_units, sql["having"], kmap) + sql["intersect"] = rebuild_sql_col(valid_col_units, sql["intersect"], kmap) + sql["except"] = rebuild_sql_col(valid_col_units, sql["except"], kmap) + sql["union"] = rebuild_sql_col(valid_col_units, sql["union"], kmap) + + return sql + + +def build_foreign_key_map(entry): + """ + Args: + + Returns: + """ + cols_orig = entry["column_names"] + tables_orig = entry["table_names"] + + # rebuild cols corresponding to idmap in Schema + cols = [] + for col_orig in cols_orig: + if col_orig[0] >= 0: + t = tables_orig[col_orig[0]] + c = col_orig[1] + cols.append("__" + t.lower() + "." + c.lower() + "__") + else: + cols.append("__all__") + + def keyset_in_list(k1, k2, k_list): + """keyset_in_list""" + for k_set in k_list: + if k1 in k_set or k2 in k_set: + return k_set + new_k_set = set() + k_list.append(new_k_set) + return new_k_set + + foreign_key_list = [] + foreign_keys = entry["foreign_keys"] + for fkey in foreign_keys: + key1, key2 = fkey + key_set = keyset_in_list(key1, key2, foreign_key_list) + key_set.add(key1) + key_set.add(key2) + + foreign_key_map = {} + for key_set in foreign_key_list: + sorted_list = sorted(list(key_set)) + midx = sorted_list[0] + for idx in sorted_list: + foreign_key_map[cols[idx]] = cols[midx] + + return foreign_key_map + + +def build_foreign_key_map_from_json(table): + """ + Args: + + Returns: + """ + with open(table) as f: + data = json.load(f) + tables = {} + for entry in data: + tables[entry["db_id"]] = build_foreign_key_map(entry) + return tables + + +def evaluate_complex(table, gold, predict, mode="exact", single_equal=False): + """evaluate main + + Args: + table (str): all tables file name + gold (str): gold file name + pred (str): predict file name + mode (str): partial or exact + + Returns: float + exact match acc + """ + kmaps = build_foreign_key_map_from_json(table) + + with open(table) as ifs: + table_list = json.load(ifs) + table_dict = {} + for table in table_list: + if table["db_id"] in table_dict: + continue + table_dict[table["db_id"]] = table + with open(gold) as ifs: + gold_list = [l.strip().split("\t") for l in ifs if len(l.strip()) > 0] + gold_dict = dict([(x[0], x[1:]) for x in gold_list]) + + with open(predict) as ifs: + pred_list = [l.strip().split("\t") for l in ifs if len(l.strip()) > 0] + pred_dict = dict([(x[0], x[1:]) for x in pred_list]) + + evaluator = Evaluator() + + scores = { + "all": {"count": 0, "exact": 0, "acc": 0}, + "select": {"acc": 0, "rec": 0, "f1": 0}, + "select(no AGG)": {"acc": 0, "rec": 0, "f1": 0}, + "where": {"acc": 0, "rec": 0, "f1": 0}, + "where(no OP)": {"acc": 0, "rec": 0, "f1": 0}, + "group": {"acc": 0, "rec": 0, "f1": 0}, + "having": {"acc": 0, "rec": 0, "f1": 0}, + "order": {"acc": 0, "rec": 0, "f1": 0}, + "and/or": {"acc": 0, "rec": 0, "f1": 0}, + "IUEN": {"acc": 0, "rec": 0, "f1": 0}, + "keywords": {"acc": 0, "rec": 0, "f1": 0}, + } + + scores_novalue = { + "all": {"count": 0, "exact": 0, "acc": 0}, + "select": {"acc": 0, "rec": 0, "f1": 0}, + "select(no AGG)": {"acc": 0, "rec": 0, "f1": 0}, + "where": {"acc": 0, "rec": 0, "f1": 0}, + "where(no OP)": {"acc": 0, "rec": 0, "f1": 0}, + "group": {"acc": 0, "rec": 0, "f1": 0}, + "having": {"acc": 0, "rec": 0, "f1": 0}, + "order": {"acc": 0, "rec": 0, "f1": 0}, + "and/or": {"acc": 0, "rec": 0, "f1": 0}, + "IUEN": {"acc": 0, "rec": 0, "f1": 0}, + "keywords": {"acc": 0, "rec": 0, "f1": 0}, + } + + eval_err_num = 0 + for ins_id, g in gold_dict.items(): + scores["all"]["count"] += 1 + scores_novalue["all"]["count"] += 1 + if ins_id not in pred_dict: + continue + p = pred_dict[ins_id] + + pred_str = p[0] + gold_str, db_id = g + schema = Schema(table_dict[db_id]) + gold_str = gold_str.replace("==", "=") + gold_sql = get_sql(schema, gold_str, single_equal=single_equal) + + kmap = kmaps[db_id] + # rebuild sql for value evaluation + g_valid_col_units = build_valid_col_units(gold_sql["from"]["table_units"], schema) + gold_sql = rebuild_sql_col(g_valid_col_units, gold_sql, kmap) + + try: + pred_str = pred_str.replace("==", "=") + pred_sql = get_sql(schema, pred_str, single_equal=single_equal) + + p_valid_col_units = build_valid_col_units(pred_sql["from"]["table_units"], schema) + pred_sql = rebuild_sql_col(p_valid_col_units, pred_sql, kmap) + except Exception: + # If pred_sql is not valid, then we will use an empty sql to evaluate with the correct sql + pred_sql = g_empty_sql + eval_err_num += 1 + + exact_score = evaluator.eval_exact_match(pred_sql, gold_sql, value_match=True) + exact_score_novalue = evaluator.eval_exact_match(pred_sql, gold_sql, value_match=False) + if exact_score == 0: + logging.debug("error instance %s:\npred: %s\ngold: %s" % (ins_id, pred_str, gold_str)) + scores["all"]["exact"] += exact_score + scores_novalue["all"]["exact"] += exact_score_novalue + score = evaluator.eval_partial_match(pred_sql, gold_sql, value_match=True) + for k, v in score.items(): + for k1, v1 in v.items(): + if k1 in scores[k].keys(): + scores[k][k1] += v1 + + score_novalue = evaluator.eval_partial_match(pred_sql, gold_sql, value_match=False) + for k, v in score_novalue.items(): + for k1, v1 in v.items(): + if k1 in scores_novalue[k].keys(): + scores_novalue[k][k1] += v1 + + if scores["all"]["count"] == 0: + logging.warn("the number of evaluated instance is zero") + return 0.0, 0.0 + if scores_novalue["all"]["count"] == 0: + logging.warn("the number of evaluated no value instance is zero") + return 0.0, 0.0 + + for k, v in scores.items(): + if "acc" in v.keys() and "rec" in v.keys(): + scores[k]["acc"] = scores[k]["acc"] / scores["all"]["count"] * 1.0 + scores[k]["rec"] = scores[k]["rec"] / scores["all"]["count"] * 1.0 + scores[k]["f1"] = 2 * scores[k]["acc"] * scores[k]["rec"] / (scores[k]["acc"] + scores[k]["rec"]) + + for k, v in scores_novalue.items(): + if "acc" in v.keys() and "rec" in v.keys(): + scores_novalue[k]["acc"] = scores_novalue[k]["acc"] / scores_novalue["all"]["count"] * 1.0 + scores_novalue[k]["rec"] = scores_novalue[k]["rec"] / scores_novalue["all"]["count"] * 1.0 + scores_novalue[k]["f1"] = ( + 2 + * scores_novalue[k]["acc"] + * scores_novalue[k]["rec"] + / (scores_novalue[k]["acc"] + scores_novalue[k]["rec"]) + ) + scores["all"]["acc"] = scores["all"]["exact"] / scores["all"]["count"] + scores_novalue["all"]["acc"] = scores_novalue["all"]["exact"] / scores_novalue["all"]["count"] + + return scores, scores_novalue + + +def evaluate(table, gold, predict, mode="exact", dataset="DuSQL"): + """ + dataset:['CSpider', 'DuSQL', 'NL2SQL'] + """ + if dataset == "NL2SQL": + scores, scores_novalue = evaluate_NL2SQL(table, gold, predict, mode=mode, single_equal=True) + elif dataset == "DuSQL": + scores, scores_novalue = evaluate_complex(table, gold, predict, mode=mode, single_equal=True) + else: + scores, scores_novalue = evaluate_complex(table, gold, predict, mode=mode, single_equal=True) + return scores, scores_novalue + + +if __name__ == "__main__": + from argparse import ArgumentParser + + arg_parser = ArgumentParser() + arg_parser.add_argument("-g", "--gold", dest="gold", type=str) + arg_parser.add_argument("-p", "--pred", dest="pred", type=str) + arg_parser.add_argument("-t", "--table", dest="table", type=str) + arg_parser.add_argument("-d", "--dataset", choices=("NL2SQL", "CSpider", "DuSQL"), required=True, type=str) + arg_parser.add_argument("--debug", default=False, action="store_true") + args = arg_parser.parse_args() + + logging.basicConfig( + level=logging.DEBUG if args.debug else logging.INFO, + format="%(levelname)s: %(asctime)s %(filename)s [%(funcName)s:%(lineno)d][%(process)d] %(message)s", + datefmt="%m-%d %H:%M:%S", + filename="eval.log", + filemode="a", + ) + + out, out_novalue = evaluate(args.table, args.gold, args.pred, dataset=args.dataset) + print("with value:") + print(out["all"]) + print("*" * 20) + print("without_value") + print(out_novalue["all"] if out_novalue else "None") diff --git a/examples/text_to_sql/RAT-SQL/evaluation/utils.py b/examples/text_to_sql/RAT-SQL/evaluation/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..081b2df3b1f75822bf6b1117b21ad2fbaa6298e9 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/evaluation/utils.py @@ -0,0 +1,400 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import copy +import json +import logging +import re + +op_sql_dict = {0: ">", 1: "<", 2: "==", 3: "!="} +agg_sql_dict = {0: "", 1: "AVG", 2: "MAX", 3: "MIN", 4: "COUNT", 5: "SUM"} +conn_sql_dict = {0: "", 1: "and", 2: "or"} + +# from IRNet keywords, need to be simplify +CLAUSE_KEYWORDS = ("select", "from", "where", "group", "order", "limit", "intersect", "union", "except") +JOIN_KEYWORDS = ("join", "on", "as") + +COND_OPS = ("not_in", "between", "==", ">", "<", ">=", "<=", "!=", "in", "like") +UNIT_OPS = ("none", "-", "+", "*", "/") +AGG_OPS = ("none", "max", "min", "count", "sum", "avg") +TABLE_TYPE = { + "sql": "sql", + "table_unit": "table_unit", +} + +LOGIC_AND_OR = ("and", "or") +SQL_OPS = ("intersect", "union", "except") +ORDER_OPS = ("desc", "asc") + +CONST_COLUMN = set(["time_now"]) + +EXPECT_BRACKET_PRE_TOKENS = set(AGG_OPS + SQL_OPS + COND_OPS + CLAUSE_KEYWORDS + ("from", ",")) + +g_empty_sql = { + "select": [], + "from": {"conds": [], "table_units": []}, + "where": [], + "groupBy": [], + "having": [], + "orderBy": [], + "limit": None, + "except": None, + "intersect": None, + "union": None, +} + + +def is_float(value): + """is float""" + try: + float(value) + return True + except ValueError: + return False + except TypeError: + return False + + +def get_scores(count, pred_total, gold_total): + """ + Args: + + Returns: + """ + if pred_total != gold_total: + return 0, 0, 0 + elif count == pred_total: + return 1, 1, 1 + return 0, 0, 0 + + +def tokenize_NL2SQL(string, cols, single_equal=False, math=True): + """ + Args: + + Returns: + """ + + string = string.replace("'", '"').lower() + assert string.count('"') % 2 == 0, "Unexpected quote" + + re_cols = [i.lower() for i in cols] + + def _extract_value(string): + """extract values in sql""" + fields = string.split('"') + for idx, tok in enumerate(fields): + if idx % 2 == 1: + fields[idx] = '"%s"' % (tok) + return fields + + def _resplit(tmp_tokens, fn_split, fn_omit): + """resplit""" + new_tokens = [] + for token in tmp_tokens: + token = token.strip() + if fn_omit(token): + new_tokens.append(token) + elif re.match(r"\d\d\d\d-\d\d(-\d\d)?", token): + new_tokens.append('"%s"' % (token)) + else: + new_tokens.extend(fn_split(token)) + return new_tokens + + def _split_aggs(tmp_tokens): + """split aggs in select""" + new_toks = [] + for i, tok in enumerate(tmp_tokens): + if tok in ("from", "where"): + new_toks.extend(tmp_tokens[i:]) + break + if not ((tok.endswith(")") or tok.endswith("),")) and len(tok) > 5): + new_toks.extend(tok.split(",")) + continue + + extra = "" + if tok.endswith(","): + extra = "," + tok = tok[:-1] + + if tok[:4] in ("sum(", "avg(", "max(", "min("): + new_toks.extend([tok[:3], "(", tok[4:-1], ")"]) + elif tok[:6] == "count(": + new_toks.extend(["count", "(", tok[6:-1], ")"]) + else: + new_toks.append(tok) + + if extra: + new_toks.append(extra) + + return new_toks + + def join_by_col(toks, cols): + new_toks = [] + _len = len(toks) + i = 0 + while i < _len - 1: + merge = False + for j in range(10): + if "".join(toks[i : i + j]) in cols: + new_toks.append("".join(toks[i : i + j])) + i += j + merge = True + if not merge: + new_toks.append(toks[i]) + i += 1 + new_toks.append(toks[-1]) + return new_toks + + tokens_tmp = _extract_value(string) + + two_bytes_op = ["==", "!=", ">=", "<=", "<>", ""] + sep2 = re.compile("(" + "|".join(two_bytes_op) + ")") # 多字节运算符 + tokens_tmp = _resplit(tokens_tmp, lambda x: x.split(" "), lambda x: x.startswith('"')) + tokens_tmp = _resplit(tokens_tmp, lambda x: re.split(sep2, x), lambda x: x.startswith('"')) + tokens_tmp = _split_aggs(tokens_tmp) + tokens = list(filter(lambda x: x.strip() != "", tokens_tmp)) + + tokens = join_by_col(tokens, re_cols) + + def _post_merge(tokens): + """merge: + * col name with "(", ")" + * values with +/- + """ + idx = 1 + while idx < len(tokens): + if tokens[idx] == "(" and tokens[idx - 1] not in EXPECT_BRACKET_PRE_TOKENS and tokens[idx - 1] != "=": + while idx < len(tokens): + tmp_tok = tokens.pop(idx) + tokens[idx - 1] += tmp_tok + if tmp_tok == ")": + break + elif tokens[idx] in ("+", "-") and tokens[idx - 1] in COND_OPS and idx + 1 < len(tokens): + tokens[idx] += tokens[idx + 1] + tokens.pop(idx + 1) + idx += 1 + else: + idx += 1 + return tokens + + tokens = _post_merge(tokens) + if single_equal: + tokens = [i if i != "=" else "==" for i in tokens] + return tokens + + +def sql2query(sql, cols): + """ + transform sql json to sql query, this is only for NL2SQL, eg. select a, b where a op val1 + """ + + sels = sql["sel"] + aggs = sql["agg"] + op = sql["cond_conn_op"] + conds = sql["conds"] + + condstrs = [f'{cols[cond[0]]} {op_sql_dict[cond[1]]} "{cond[2]}"' for cond in conds] + cond_str = f" {conn_sql_dict[op]} ".join(condstrs) + + def agg_col(agg, col): + if agg == 0: + return cols[col] + else: + return f"{agg_sql_dict[agg]} ( {cols[col]} )" + + selstrs = [agg_col(i, j) for i, j in zip(aggs, sels)] + sel_str = " , ".join(selstrs) + + return f"SELECT {sel_str} WHERE {cond_str}" + + +def query2sql(query, cols, single_equal=False, with_value=True): + + cols = [i.lower() for i in cols] + + sql_op_dict = {} + sql_agg_dict = {} + sql_conn_dict = {} + for k, v in op_sql_dict.items(): + sql_op_dict[v] = k + sql_op_dict[v.lower()] = k + for k, v in agg_sql_dict.items(): + sql_agg_dict[v] = k + sql_agg_dict[v.lower()] = k + for k, v in conn_sql_dict.items(): + sql_conn_dict[v] = k + sql_conn_dict[v.lower()] = k + + query = tokenize_NL2SQL(query, cols, single_equal=single_equal, math=False) + assert query[0] == "select" + + def parse_cols(toks, start_idx): + """ + :returns next idx, (agg, col) + """ + if "from" in toks: + toks = toks[: toks.index("from")] + idx = start_idx + len_ = len(toks) + outs = [] + while idx < len_: + if toks[idx] in AGG_OPS: + idx += 1 + assert idx < len_ and toks[idx] == "(", toks[idx] + idx += 1 + agg, col = toks[start_idx], toks[idx] + idx += 1 + assert idx < len_ and toks[idx] == ")", toks[idx] + "".join(toks) + idx += 1 + outs.append((agg, col)) + elif toks[idx] == ",": + idx += 1 + else: + agg, col = "", toks[idx] + idx += 1 + outs.append(("", col)) + return outs + + def _format_col(old_col): + """format""" + if old_col.lower().startswith("table_"): + return old_col.split(".", 1)[1] + else: + return old_col + + if "where" not in query: + cond_index = len(query) + conn = "" + conds = [] + else: + cond_index = query.index("where") + condstr = query[cond_index + 1 :] + conn = [i for i in condstr[3::4]] + assert len(set(conn)) < 2, conn + conn = list(set(conn))[0] if conn else "" + conds = [condstr[i : i + 3] for i in range(len(condstr))[::4]] + sels = parse_cols(query[:cond_index], 1) + + sql = {} + + sql["agg"] = [sql_agg_dict[i[0]] for i in sels] + sql["cond_conn_op"] = sql_conn_dict[conn] + sql["sel"] = [cols.index(_format_col(i[1])) for i in sels] + if with_value: + sql["conds"] = [[cols.index(_format_col(c[0])), sql_op_dict[c[1]], '"' + c[2].strip('"') + '"'] for c in conds] + else: + sql["conds"] = [[cols.index(_format_col(c[0])), sql_op_dict[c[1]], "1"] for c in conds] + + sql_sels = [(sql_agg_dict[i[0]], cols.index(_format_col(i[1]))) for i in sels] + return sql, sql_sels + + +def evaluate_NL2SQL(table, gold, predict, single_equal=False, mode=None): + scores = {} + scores_novalue = {} + + # load db + with open(table) as ifs: + table_list = json.load(ifs) + table_dict = {} + for table in table_list: + table_dict[table["db_id"]] = table + + # load qa + with open(gold, "r", encoding="utf-8") as f1, open(predict, "r", encoding="utf-8") as f2: + gold_list = [l.strip().split("\t") for l in f1 if len(l.strip()) > 0] + gold_dict = dict([(x[0], x[1:]) for x in gold_list]) + + pred_list = [l.strip().split("\t") for l in f2 if len(l.strip()) > 0] + pred_dict = dict([(x[0], x[1]) for x in pred_list]) + + right = total = 0 + cnt_sel = 0 + cnt_cond = cnt_conn = 0 + + def compare_set(gold, pred): + _pred = copy.deepcopy(pred) + _gold = copy.deepcopy(gold) + + pred_total = len(_pred) + gold_total = len(_gold) + cnt = 0 + + for unit in _pred: + if unit in _gold: + cnt += 1 + _gold.remove(unit) + return cnt, pred_total, gold_total + + for qid, item in gold_dict.items(): + total += 1 + if qid not in pred_dict: + continue + sql_gold, db_id = "".join(item[0:-1]), item[-1] + + db = table_dict[db_id] + cols = [i[1] for i in db["column_names"]] + + sql_pred = pred_dict[qid] + + try: + sql_gold = sql_gold.replace("==", "=") + sql_pred = sql_pred.replace("==", "=") + components_gold, sels_gold = query2sql(sql_gold, cols, single_equal=single_equal) + components_pred, sels_pred = query2sql(sql_pred, cols, single_equal=single_equal) + + cnt, pred_total, gold_total = compare_set(sels_gold, sels_pred) + score_sels, _, _ = get_scores(cnt, pred_total, gold_total) + cnt, pred_total, gold_total = compare_set(components_gold["conds"], components_pred["conds"]) + score_conds, _, _ = get_scores(cnt, pred_total, gold_total) + score_conn = components_gold["cond_conn_op"] == components_pred["cond_conn_op"] + + if score_sels: + cnt_sel += 1 + if score_conds: + cnt_cond += 1 + if score_conn: + cnt_conn += 1 + if score_sels and score_conds and score_conn: + right += 1 + else: + logging.debug("error instance %s:\npred: %s\ngold: %s" % (qid, sql_pred, sql_gold)) + except Exception: + # traceback.print_exc() + logging.warning("parse sql error, error sql:") + logging.warning(sql_gold + "|||" + sql_pred) + # raise e + continue + + scores["all"] = dict([("count", total), ("exact", right), ("acc", right * 1.0 / total)]) + scores["select"] = dict([("count", total), ("exact", cnt_sel), ("acc", cnt_sel * 1.0 / total)]) + scores["condition"] = dict([("count", total), ("exact", cnt_cond), ("acc", cnt_cond * 1.0 / total)]) + scores["connection"] = dict([("count", total), ("exact", cnt_conn), ("acc", cnt_conn * 1.0 / total)]) + + return scores, scores_novalue + + +if __name__ == "__main__": + print(query2sql("SELECT 所在省份 , 产线名称 WHERE 日熔量(吨) < 600", [])) + print(query2sql("SELECT MAX ( 货币资金(亿元) ) WHERE 总资产(亿元) > 100 or 净资产(亿元) > 100", [])) + print(query2sql("SELECT 股价 , EPS17A WHERE 铁路公司 = 广深铁路", ["股价", "铁路公司", "EPS17A"], True)) + cols = ["公司", "2014(亿元)", "2015(亿元)", "2016(亿元)"] + print(query2sql("SELECT COUNT ( 公司 ) WHERE 2014(亿元) > 20 and 2015(亿元) > 20 and 2016(亿元) > 20", cols)) + + # print(query2sql("SELECT 书名/Title WHERE 索书号/CallNo. == BF637.U53C555=12010 or ISBN == 9.78142212482e+12", ["书名/Title","索书号/CallNo.",'ISBN'])) + # print(tokenize("SELECT 标称生产企业名称 WHERE 规格(包装规格) == 187.2g/盒 and 标称产品名称 == 富兰克牌西洋参含片", math=False)) + # print(tokenize("SELECT 设备型号 WHERE 生产企业 == AISINAWCO.,LTD. or 设备名称 == WCDMA无线数据终端", math=False)) + # print(tokenize("SELECT sum(t1.amount_claimed) FROM claim_headers AS t1 JOIN claims_documents AS t2 ON t1.claim_header_id = t2.claim_id WHERE t2.created_date = ( SELECT created_date FROM claims_documents ORDER BY created_date LIMIT 1 )")) + # print(query2sql("SELECT 书号(ISBN) WHERE 教材名称 == 线性代数 or 教材名称 == 中级有机化学", ["书号(ISBN)", "教材名称" ])) diff --git a/examples/text_to_sql/RAT-SQL/requirements.txt b/examples/text_to_sql/RAT-SQL/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..4c5c04d646299aa5a3171b4e24153f503b86225d --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/requirements.txt @@ -0,0 +1,11 @@ +nvgpu +tqdm +LAC +cn2an +setproctitle +sentencepiece +attrs +asdl +networkx +pyrsistent +jsonnet diff --git a/examples/text_to_sql/RAT-SQL/run.sh b/examples/text_to_sql/RAT-SQL/run.sh new file mode 100644 index 0000000000000000000000000000000000000000..e625033cf6b322c7e8e43857e71106557b59beed --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/run.sh @@ -0,0 +1,43 @@ +#!/bin/bash + +WORKROOT=$(cd $(dirname $0); pwd) +cd $WORKROOT + +#### gpu libs #### +# 添加cuda, cudnn库的路径 +export LD_LIBRARY_PATH=/home/work/cuda-10.0/lib64:$LD_LIBRARY_PATH +export LD_LIBRARY_PATH=/home/work/cuda-10.0/extras/CUPTI/lib64:$LD_LIBRARY_PATH +# 单独添加库的路径 +##export LD_LIBRARY_PATH=/home/work/cudnn/cudnn_v7.4/cuda/lib64:$LD_LIBRARY_PATH +# 添加NCCL库的路径。单卡训练时非必须 +export LD_LIBRARY_PATH=/home/work/nccl_2.3.5/lib/:$LD_LIBRARY_PATH + +#### paddle #### +# 是否是分布式训练,0标识是分布式,1标识是单机 +export PADDLE_IS_LOCAL=1 +# 申请显存比例 +export FLAGS_fraction_of_gpu_memory_to_use=1.0 +# 选择要使用的GPU +export CUDA_VISIBLE_DEVICES=`python script/available_gpu.py --best 1` +# CPU 核数 +export CPU_NUM=1 +# 表示是否使用垃圾回收策略来优化网络的内存使用,<0表示禁用,>=0表示启用 +export FLAGS_eager_delete_tensor_gb=1.0 +# 是否使用快速垃圾回收策略 +export FLAGS_fast_eager_deletion_mode=1 +# 垃圾回收策略释放变量的内存大小百分比,范围为[0.0, 1.0] +export FLAGS_memory_fraction_of_eager_deletion=1 +# 如果为1,则会在allreduce_op_handle中调用cudaStreamSynchronize(nccl_stream),这种模式在某些情况下可以获得更好的性能 +#export FLAGS_sync_nccl_allreduce=1 + +#### python #### +export PYTHONPATH=$WORKROOT:$PYTHONPATH +#echo "PYTHONPATH=$PYTHONPATH" +## python 3.6/3.7 is recomended +PYTHON_BIN=`which python3` + +echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" +echo "running command: ($PYTHON_BIN $@)" +$PYTHON_BIN -u $@ +exit $? + diff --git a/examples/text_to_sql/RAT-SQL/script/available_gpu.py b/examples/text_to_sql/RAT-SQL/script/available_gpu.py new file mode 100644 index 0000000000000000000000000000000000000000..1de2aba8171df20c4476402dfbbeefeed1a39740 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/script/available_gpu.py @@ -0,0 +1,45 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import traceback + +import nvgpu + +logging.basicConfig( + level=logging.DEBUG, + format="%(levelname)s: %(asctime)s %(filename)s" " [%(funcName)s:%(lineno)d][%(process)d] %(message)s", + datefmt="%m-%d %H:%M:%S", + filename=None, + filemode="a", +) + +if __name__ == "__main__": + from argparse import ArgumentParser + + try: + arg_parser = ArgumentParser(description="print available_gpu id, using nvgpu") + arg_parser.add_argument("-b", "--best", default=None, type=int, help="output best N") + args = arg_parser.parse_args() + + if args.best is not None: + gpus = sorted(nvgpu.gpu_info(), key=lambda x: (x["mem_used"], x["index"])) + ids = [x["index"] for x in gpus] + print(",".join(ids[: args.best])) + else: + print(",".join(nvgpu.available_gpus())) + + except Exception: + traceback.print_exc() + exit(-1) diff --git a/examples/text_to_sql/RAT-SQL/script/schema_linking.py b/examples/text_to_sql/RAT-SQL/script/schema_linking.py new file mode 100644 index 0000000000000000000000000000000000000000..76ac0c9df30d165874e01e2948e1370c802b1053 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/script/schema_linking.py @@ -0,0 +1,231 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import logging +import re +import sys +import traceback +from collections import defaultdict + +from text2sql.dataproc.dusql_dataset_v2 import load_tables + +logging.basicConfig( + level=logging.DEBUG, + format="%(levelname)s: %(asctime)s %(filename)s" " [%(funcName)s:%(lineno)d][%(process)d] %(message)s", + datefmt="%m-%d %H:%M:%S", + filename=None, + filemode="a", +) + +g_date_patt = re.compile(r"(([0-9]{2})[0-9]{2}年)?[0-9]{1,2}月[0-9]{2}日|([0-9]{2})[0-9]{2}年[0-9]{1,2}月") + + +def get_char_list(sentence): + def is_ascii(s): + """check if s is English album or number + Args: + s (str): NULL + Returns: bool + """ + return ord(s) < 128 + + if len(sentence) == 0: + return [] + + lst_result = [sentence[0]] + last_is_ascii = lst_result[-1].isalnum() + for char in sentence[1:]: + if char == " ": + last_is_ascii = False + continue + elif char == "-": + last_is_ascii = False + lst_result.append(char) + continue + + if is_ascii(char) and last_is_ascii: + lst_result[-1] += char + continue + + if is_ascii(char): + last_is_ascii = True + else: + last_is_ascii = False + + lst_result.append(char) + + return tuple(lst_result) + + +def _format_date_cell(old_cell): + new_cell = old_cell.rstrip("月日") + new_cell = new_cell.replace("年", "-") + new_cell = new_cell.replace("月", "-") + return new_cell + + +def _build(cells): + dct_index = defaultdict(set) + for cell in set(cells): + if type(cell) is not str: + continue + cell = cell.strip() + if re.match(g_date_patt, cell): + cell = _format_date_cell(cell) + cell_chars = get_char_list(cell.lower()) + dct_index[cell.lower()].add((cell, len(cell_chars))) + for pos in range(len(cell_chars) - 1): + bigram = cell_chars[pos : pos + 2] + # tri_gram = cell_chars[pos: pos + 3] + # four_gram = cell_chars[pos: pos + 4] + dct_index[bigram].add((cell, len(cell_chars) - 1)) + # dct_index[tri_gram].add((cell, len(cell_chars) - 2)) + # dct_index[four_gram].add(cell) + return dct_index + + +def build_cell_index(db_dict): + for db in db_dict.values(): + column_cells = [] + for column in db.columns: + cell_index = _build(column.cells) + column_cells.append(cell_index) + db.column_cells_index = column_cells + + +def extract_value_from_sql(sql_json, sql_format="dusql"): + dct_col_values = defaultdict(list) + if sql_format == "nl2sql": + for col, _, val in item["sql"]["conds"]: + dct_col_values[col].append(val) + return dct_col_values + + def _merge_dict(base_dict, extra_dict): + for k, v in extra_dict.items(): + base_dict[k].extend(v) + + def _extract_value_from_sql_cond(cond, dct_col_values): + if type(cond[3]) is dict: + new_col_values = extract_value_from_sql(cond[3]) + _merge_dict(dct_col_values, new_col_values) + return + col_id = cond[2][1][1] + dct_col_values[col_id].append(cond[3]) + if cond[4] is not None: + dct_col_values[col_id].append(cond[4]) + + for table_unit in sql_json["from"]["table_units"]: + if type(table_unit[1]) is dict: + new_col_values = extract_value_from_sql(table_unit[1]) + _merge_dict(dct_col_values, new_col_values) + + for cond in sql_json["where"][::2]: + _extract_value_from_sql_cond(cond, dct_col_values) + for cond in sql_json["having"][::2]: + _extract_value_from_sql_cond(cond, dct_col_values) + + if sql_json["intersect"] is not None: + new_col_values = extract_value_from_sql(sql_json["intersect"]) + _merge_dict(dct_col_values, new_col_values) + if sql_json["union"] is not None: + new_col_values = extract_value_from_sql(sql_json["union"]) + _merge_dict(dct_col_values, new_col_values) + if sql_json["except"] is not None: + new_col_values = extract_value_from_sql(sql_json["except"]) + _merge_dict(dct_col_values, new_col_values) + + return dct_col_values + + +def search_values(query, db, extra_values): + lst_match_values = [] + for column, cell_index in zip(db.columns, db.column_cells_index): + if column.id == 0: + lst_match_values.append([]) + continue + + candi_cnt = defaultdict(float) + query_chars = get_char_list(query.lower()) + appear_set = set() + for pos in range(len(query_chars)): + unigram = query_chars[pos] + if len(unigram) > 2 and unigram not in appear_set and unigram in cell_index: + for cell, base in cell_index[unigram]: + candi_cnt[cell] += 1.0 / base + if pos == len(query_chars) - 1: + break + + bigram = query_chars[pos : pos + 2] + if bigram not in cell_index: + continue + if bigram in appear_set: + continue + appear_set.add(bigram) + for cell, base in cell_index[bigram]: + candi_cnt[cell] += 1.0 / base + + if extra_values is not None and column.id in extra_values: + gold_values = extra_values[column.id] + for gval in gold_values: + candi_cnt[str(gval)] += 2.0 + + lst_match_values.append(list(sorted(candi_cnt.items(), key=lambda x: x[1], reverse=True))[:10]) + + return lst_match_values + + +if __name__ == "__main__": + import argparse + + try: + arg_parser = argparse.ArgumentParser(description="linking candidate values for each column") + arg_parser.add_argument( + "input", nargs="?", type=argparse.FileType("r"), default=sys.stdin, help="input file path" + ) + arg_parser.add_argument("-s", "--db-schema", required=True, help="file path") + arg_parser.add_argument("-c", "--db-content", required=True, help="file path") + arg_parser.add_argument( + "-o", "--output", type=argparse.FileType("w"), default=sys.stdout, help="output file path" + ) + arg_parser.add_argument("-t", "--is-train", default=False, action="store_true") + arg_parser.add_argument("-f", "--sql-format", default="dusql", choices=["dusql", "nl2sql", "cspider"]) + args = arg_parser.parse_args() + + sys.stderr.write(">>> loading databases...\n") + dct_db, _ = load_tables(args.db_schema, args.db_content) + build_cell_index(dct_db) + + sys.stderr.write(">>> extracting values...\n") + lst_output = [] + for idx, item in enumerate(json.load(args.input)): + question_id = item.get("question_id", f"qid{idx:06d}") + question = item["question"] + db_id = item["db_id"] + db = dct_db[db_id] + + extra_values = None + if args.is_train: + extra_values = extract_value_from_sql(item["sql"], args.sql_format) + + match_values = search_values(question, db, extra_values) + lst_output.append( + {"question_id": question_id, "question": question, "db_id": db_id, "match_values": match_values} + ) + + json.dump(lst_output, args.output, indent=2, ensure_ascii=False) + except Exception: + traceback.print_exc() + # logging.critical(traceback.format_exc()) + exit(-1) diff --git a/examples/text_to_sql/RAT-SQL/script/text2sql_main.py b/examples/text_to_sql/RAT-SQL/script/text2sql_main.py new file mode 100644 index 0000000000000000000000000000000000000000..14f329a3035d425bf41d72707d43708b8aa07d47 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/script/text2sql_main.py @@ -0,0 +1,218 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import random +import sys +from functools import partial +from pathlib import Path + +import numpy as np +import paddle +import paddle.distributed as dist +import text2sql +from text2sql import dataproc, global_config, launch +from text2sql.grammars.cspider_v2 import CSpiderLanguageV2 +from text2sql.grammars.dusql_v2 import DuSQLLanguageV2 +from text2sql.grammars.nl2sql import NL2SQLLanguage + +ModelClass = None +GrammarClass = None +DataLoaderClass = None +DatasetClass = None +g_input_encoder = None +g_label_encoder = None + + +def preprocess(config): + dataset_config = { + "db_file": config.data.db, + "input_encoder": g_input_encoder, + "label_encoder": g_label_encoder, + "is_cached": False, + } + + output_base = config.data.output + if config.data.train_set is not None: + dataset = DatasetClass(name="train", data_file=config.data.train_set, **dataset_config) + dataset.save(output_base, save_db=True) + g_label_encoder.save(Path(output_base) / "label_vocabs") + + if config.data.dev_set is not None: + dataset = DatasetClass(name="dev", data_file=config.data.dev_set, **dataset_config) + dataset.save(output_base, save_db=False) + + if config.data.test_set is not None: + dataset = DatasetClass(name="test", data_file=config.data.test_set, **dataset_config) + dataset.save(output_base, save_db=False) + + +def train(config): + logging.info("training arguments: %s", config) + if config.train.use_data_parallel: + logging.info("parallel mode. init env...") + dist.init_parallel_env() + + dataset_config = { + "db_file": config.data.db, + "input_encoder": g_input_encoder, + "label_encoder": g_label_encoder, + "is_cached": True, + } + train_set = DatasetClass(name="train", data_file=config.data.train_set, **dataset_config) + dev_set = DatasetClass(name="dev", data_file=config.data.dev_set, **dataset_config) + + shuf_train = True if not config.general.is_debug else False + train_reader = DataLoaderClass(config, train_set, batch_size=config.general.batch_size, shuffle=shuf_train) + # dev_reader = dataproc.DataLoader(config, dev_set, batch_size=config.general.batch_size, shuffle=False) + dev_reader = DataLoaderClass(config, dev_set, batch_size=1, shuffle=False) + max_train_steps = config.train.epochs * (len(train_set) // config.general.batch_size // config.train.trainer_num) + + model = ModelClass(config.model, g_label_encoder) + if config.model.init_model_params is not None: + logging.info("loading model param from %s", config.model.init_model_params) + model.set_state_dict(paddle.load(config.model.init_model_params)) + if config.train.use_data_parallel: + logging.info("parallel mode. init model...") + model = paddle.DataParallel(model) + + optimizer = text2sql.optim.init_optimizer(model, config.train, max_train_steps) + if config.model.init_model_optim is not None: + logging.info("loading model optim from %s", config.model.init_model_optim) + optimizer.set_state_dict(paddle.load(config.model.init_model_optim)) + + logging.info("start of training...") + launch.trainer.train(config, model, optimizer, config.train.epochs, train_reader, dev_reader) + logging.info("end of training...") + + +def inference(config): + if config.model.init_model_params is None: + raise RuntimeError("config.init_model_params should be a valid model path") + + dataset_config = { + "db_file": config.data.db, + "input_encoder": g_input_encoder, + "label_encoder": g_label_encoder, + "is_cached": True, + } + test_set = DatasetClass(name="test", data_file=config.data.test_set, **dataset_config) + test_reader = DataLoaderClass(config, test_set, batch_size=1, shuffle=False) + + model = ModelClass(config.model, g_label_encoder) + logging.info("loading model param from %s", config.model.init_model_params) + state_dict = paddle.load(config.model.init_model_params) + model.set_state_dict(state_dict) + + logging.info("start of inference...") + launch.infer.inference( + model, test_reader, config.data.output, beam_size=config.general.beam_size, model_name=config.model.model_name + ) + logging.info("end of inference...") + + +def evaluate(config): + dataset_config = { + "db_file": config.data.db, + "input_encoder": g_input_encoder, + "label_encoder": g_label_encoder, + "is_cached": True, + "schema_file": config.data.db_schema, + } + test_set = DatasetClass(name="test", data_file=config.data.test_set, **dataset_config) + with open(config.data.eval_file) as ifs: + infer_results = list(ifs) + model = None + + logging.info("start of evaluating...") + launch.eval.evaluate(model, test_set, infer_results, eval_value=config.general.is_eval_value) + logging.info("end of evaluating....") + + +def init_env(config): + log_level = logging.INFO if not config.general.is_debug else logging.DEBUG + formatter = logging.Formatter("%(levelname)s %(asctime)s %(filename)s:%(lineno)03d * %(message)s") + logger = logging.getLogger() + logger.setLevel(log_level) + handler = logger.handlers[0] + handler.setLevel(log_level) + handler.setFormatter(formatter) + + seed = config.train.random_seed + if seed is not None: + random.seed(seed) + paddle.seed(seed) + np.random.seed(seed) + + global ModelClass + global GrammarClass + global DatasetClass + global DataLoaderClass + global g_input_encoder + global g_label_encoder + + if config.model.grammar_type == "dusql_v2": + GrammarClass = DuSQLLanguageV2 + elif config.model.grammar_type == "nl2sql": + GrammarClass = NL2SQLLanguage + elif config.model.grammar_type == "cspider_v2": + GrammarClass = CSpiderLanguageV2 + else: + raise ValueError("grammar type is not supported: %s" % (config.model.grammar_type)) + g_label_encoder = dataproc.SQLPreproc( + config.data.grammar, + GrammarClass, + predict_value=config.model.predict_value, + is_cached=config.general.mode != "preproc", + ) + + assert config.model.model_name == "seq2tree_v2", "only seq2tree_v2 is supported" + g_input_encoder = dataproc.ErnieInputEncoderV2(config.model) + ModelClass = lambda x1, x2: text2sql.models.EncDecModel(x1, x2, "v2") + DatasetClass = dataproc.DuSQLDatasetV2 + DataLoaderClass = partial(dataproc.DataLoader, collate_fn=dataproc.dataloader.collate_batch_data_v2) + + +def _set_proc_name(config, tag_base): + """ + set process name on local machine + """ + if config.general.is_cloud: + return + if tag_base.startswith("train"): + tag_base = "train" + import setproctitle + + setproctitle.setproctitle(tag_base + "_" + config.data.output.rstrip("/").split("/")[-1]) + + +if __name__ == "__main__": + config = global_config.gen_config() + init_env(config) + + run_mode = config.general.mode + if run_mode == "preproc": + preprocess(config) + sys.exit(0) + + _set_proc_name(config, run_mode) + if run_mode == "test": + evaluate(config) + elif run_mode == "infer": + inference(config) + elif run_mode.startswith("train"): + if config.train.use_data_parallel: + dist.spawn(train, args=(config,)) + else: + train(config) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..2d65733fca8c9689d99c4f5d538b365caa67de56 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/__init__.py @@ -0,0 +1,21 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from . import dataproc +from . import grammars +from . import io +from . import launch +from . import models +from . import optim +from . import utils diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..5aa57bf32d9483441dffe231e6411ca6f34d55f3 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/__init__.py @@ -0,0 +1,20 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from .base_classes import * + +from . import dataloader +from .dataloader import DataLoader +from .dusql_dataset_v2 import DuSQLDatasetV2 +from .ernie_input_encoder_v2 import ErnieInputEncoderV2 +from .sql_preproc_v2 import SQLPreproc diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/base_classes.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/base_classes.py new file mode 100644 index 0000000000000000000000000000000000000000..3514e5c63eb60e05968b0d0a9e0edbc403d90e2b --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/base_classes.py @@ -0,0 +1,29 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +class BaseInputEncoder(object): + """Docstring for BaseInputEncoder.""" + + def __init__(self): + """init of class""" + super(BaseInputEncoder, self).__init__() + + def encode(self, inputs): + raise NotImplementedError + + +if __name__ == "__main__": + """run some simple test cases""" + pass diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/dataloader.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/dataloader.py new file mode 100644 index 0000000000000000000000000000000000000000..144d0c232392314132eed38e68ec9edc5f711258 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/dataloader.py @@ -0,0 +1,151 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import sys + +import numpy as np +import paddle +from text2sql.utils import nn_utils + + +def collate_batch_data_v2(origin_batch, config): + """format origin batch data for model forward""" + TOKEN_IDS = [] + SENT_IDS = [] + + QUESTION_TOKENS_INDEX = [] + TABLE_INDEX = [] + COLUMN_INDEX = [] + VALUE_INDEX = [] + RELATION_MATRIXES = [] + + lst_orig_inputs = [] + lst_orig_labels = [] + for orig_input, orig_label in origin_batch: + if orig_input.value_indexes[-1] > 510: + logging.warning( + "sequence is too long: %d. question is %s", orig_input.value_indexes[-1] + 2, orig_input.question + ) + continue + lst_orig_inputs.append(orig_input) + lst_orig_labels.append(orig_label) + + TOKEN_IDS.append(orig_input.token_ids) + SENT_IDS.append(orig_input.sent_ids) + + # orig_input.span_lens[0] 即 question 包含 [cls], [sep] 的长度 + QUESTION_TOKENS_INDEX.append(list(range(1, orig_input.column_indexes[0] - 1))) + TABLE_INDEX.append(orig_input.table_indexes) + COLUMN_INDEX.append(orig_input.column_indexes) + VALUE_INDEX.append(orig_input.value_indexes) + + relations = orig_input.relations + RELATION_MATRIXES.append(np.pad(relations, (0, config.max_seq_len - relations.shape[0]))) + + TOKEN_IDS = nn_utils.pad_sequences(TOKEN_IDS, max_len=config.max_seq_len) + SENT_IDS = nn_utils.pad_sequences(SENT_IDS, max_len=config.max_seq_len) + + QUESTION_TOKENS_INDEX = nn_utils.pad_sequences(QUESTION_TOKENS_INDEX, max_len=config.max_question_len) + TABLE_INDEX = nn_utils.pad_sequences(TABLE_INDEX, max_len=config.max_table_num) + COLUMN_INDEX = nn_utils.pad_sequences(COLUMN_INDEX, max_len=config.max_column_num) + VALUE_INDEX = nn_utils.pad_sequences(VALUE_INDEX, max_len=config.max_column_num * 2) + + inputs = { + "src_ids": TOKEN_IDS, + "sent_ids": SENT_IDS, + "question_tokens_index": QUESTION_TOKENS_INDEX, + "table_indexes": TABLE_INDEX, + "column_indexes": COLUMN_INDEX, + "value_indexes": VALUE_INDEX, + "orig_inputs": lst_orig_inputs, + } + RELATION_MATRIXES = np.array(RELATION_MATRIXES).astype(np.int64) + inputs["relations"] = RELATION_MATRIXES + + for key, value in inputs.items(): + if key in ("orig_inputs",): + continue + inputs[key] = paddle.to_tensor(value) + return (inputs, lst_orig_labels) + + +class DataLoader(object): + """Data Loader for train, test and inference""" + + def __init__( + self, + config, + dataset, + batch_size=1, + collate_fn=collate_batch_data_v2, + shuffle=False, + drop_last=False, + use_data_parallel=False, + use_multiprocess=False, + ): + super(DataLoader, self).__init__() + assert batch_size > 0, "batch_size must be an interger that > 0" + + self.config = config + self._dataset = dataset + self._batch_size = batch_size + self._collate_fn = collate_fn + self._shuffle = shuffle + self._drop_last = drop_last + self._use_data_parallel = use_data_parallel + self._use_multiprocess = use_multiprocess + + self.dataloader = paddle.fluid.reader.DataLoader.from_generator( + capacity=1000, return_list=True, use_multiprocess=use_multiprocess + ) + self.dataloader.set_batch_generator(self.create_generator()) + if use_data_parallel: + self.dataloader = paddle.distributed_batch_reader(self.dataloader) + + def __call__(self): + """call""" + return self.create_generator()() + + def create_generator(self): + """Returns a generator, each iteration returns a batch of data""" + + def _reader(): + range_fn = np.random.permutation if self._shuffle else np.arange + batch = [] + for iid in range_fn(len(self._dataset)): + batch.append(self._dataset[iid]) + if len(batch) == self._batch_size: + outputs = self._collate_fn(batch, self.config.model) + batch = [] + if len(outputs[1]) == 0: + continue + yield outputs + + if len(batch) > 0 and not self._drop_last: + yield self._collate_fn(batch, self.config.model) + + return _reader + + @property + def name(self): + """read property of name""" + return self._dataset.name + + +if __name__ == "__main__": + """run simple tests""" + if len(sys.argv) != 5: + print("usage: %s schema content data grammar_file" % (sys.argv[0])) + sys.exit(1) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/dusql_dataset_v2.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/dusql_dataset_v2.py new file mode 100644 index 0000000000000000000000000000000000000000..8093ff4c421f455fa41054f7094f094266691cbe --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/dusql_dataset_v2.py @@ -0,0 +1,336 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import logging +import os +import pickle +import sys +from pathlib import Path + +import attr +import networkx as nx +import paddle +import tqdm +from text2sql.utils import linking_utils, text_utils + +g_ernie_input_parser = None +g_match_score_threshold = 0.3 + + +@attr.s +class DuSQLItem: + text = attr.ib() + code = attr.ib() + schema = attr.ib() + orig = attr.ib() + orig_schema = attr.ib() + + +@attr.s +class Column: + id = attr.ib() + table = attr.ib() + name = attr.ib() + orig_name = attr.ib() + dtype = attr.ib() + cells = attr.ib(factory=list) + foreign_key_for = attr.ib(default=None) + + +@attr.s +class Table: + id = attr.ib() + name = attr.ib() + orig_name = attr.ib() + columns = attr.ib(factory=list) + primary_keys = attr.ib(factory=list) + primary_keys_id = attr.ib(factory=list) + foreign_keys_tables = attr.ib(factory=set) + + +@attr.s +class DB: + db_id = attr.ib() + tables = attr.ib() + columns = attr.ib() + foreign_key_graph = attr.ib() + orig = attr.ib() + connection = attr.ib(default=None) + + +def _extract_column_cells(table_names, tables_content): + lst_column_cells = [table_names] + + for table_name in table_names: + table_info = tables_content.get(table_name, None) + if table_info is None: + return None + rows = table_info.get("cell", []) + if len(rows) == 0: + rows = [[] for _ in tables_content[table_name]["header"]] + lst_column_cells.extend(rows) + else: + lst_column_cells.extend(list(zip(*rows))) + + return lst_column_cells + + +def load_tables(schema_file, content_file): + """load tables from json files""" + schemas = {} + eval_foreign_key_maps = {} + + with open(schema_file) as ifs_schema, open(content_file) as ifs_content: + lst_schema = json.load(ifs_schema) + dct_content = {x["db_id"]: x for x in json.load(ifs_content)} + + for schema_dict in lst_schema: + db_id = schema_dict["db_id"] + + contents = dct_content[db_id] + lst_column_cells = _extract_column_cells(schema_dict["table_names"], contents["tables"]) + if lst_column_cells is None: + lst_column_cells = [[] for _ in schema_dict["column_names"]] + assert len(lst_column_cells) == len(schema_dict["column_names"]) + + if "table_names_original" not in schema_dict: + schema_dict["table_names_original"] = schema_dict["table_names"] + if "column_names_original" not in schema_dict: + schema_dict["column_names_original"] = schema_dict["column_names"] + tables = tuple( + Table(id=i, name=text_utils.wordseg(name), orig_name=orig_name) + for i, (name, orig_name) in enumerate( + zip(schema_dict["table_names"], schema_dict["table_names_original"]) + ) + ) + columns = tuple( + Column( + id=i, + table=tables[table_id] if table_id >= 0 else None, + name=text_utils.wordseg(col_name), + orig_name=orig_col_name, + dtype=col_type, + # 1. drop data with length > 20 + # 2. ID is startswith item_ + cells=[ + x for x in set([str(c) for c in lst_column_cells[i]]) if len(x) <= 20 or x.startswith("item_") + ], + ) + for i, ((table_id, col_name), (_, orig_col_name), col_type) in enumerate( + zip(schema_dict["column_names"], schema_dict["column_names_original"], schema_dict["column_types"]) + ) + ) + + # Link columns to tables + for column in columns: + if column.table: + column.table.columns.append(column) + + # Register primary keys + for column_id in schema_dict["primary_keys"]: + column = columns[column_id] + column.table.primary_keys.append(column) + column.table.primary_keys_id.append(column_id) + + # Register foreign keys + foreign_key_graph = nx.DiGraph() + for source_column_id, dest_column_id in schema_dict["foreign_keys"]: + source_column = columns[source_column_id] + dest_column = columns[dest_column_id] + source_column.foreign_key_for = dest_column + columns[source_column_id].table.foreign_keys_tables.add(dest_column_id) + foreign_key_graph.add_edge( + source_column.table.id, dest_column.table.id, columns=(source_column_id, dest_column_id) + ) + foreign_key_graph.add_edge( + dest_column.table.id, source_column.table.id, columns=(dest_column_id, source_column_id) + ) + + schemas[db_id] = DB(db_id, tables, columns, foreign_key_graph, schema_dict) + # TODO + # eval_foreign_key_maps[db_id] = evaluation.build_foreign_key_map(schema_dict) + + return schemas, eval_foreign_key_maps + + +class DuSQLExample(object): + """Define struct of one DuSQL example, and its processing methods""" + + def __init__(self, json_example, db, input_encoder): + super(DuSQLExample, self).__init__() + + self.orig = json_example + self.question = json_example["question"] + self.question_id = json_example["question_id"] + self.columns = db.columns + self.tables = db.tables + self.db = db + + self.column_match_cells = self._filter_match_values(json_example["match_values"]) + + ernie_inputs = input_encoder.encode(self.question, db, self.column_match_cells) + self.token_ids = ernie_inputs.token_ids + self.sent_ids = ernie_inputs.sent_ids + self.table_indexes = ernie_inputs.table_indexes + self.column_indexes = ernie_inputs.column_indexes + self.value_indexes = ernie_inputs.value_indexes + self.values = ernie_inputs.value_list + + self.token_mapping = ernie_inputs.token_mapping + self.question_tokens = ernie_inputs.orig_question_tokens + self.candi_nums = ernie_inputs.candi_nums + self.relations = self._compute_relations() + + def _filter_match_values(self, match_values_info): + """filter by match score""" + lst_result = [] + for column_values in match_values_info: + filtered_results = [] + for value, score in column_values: + if score > g_match_score_threshold: + filtered_results.append(value) + else: # column_values should ordered by score + break + lst_result.append(filtered_results) + return lst_result + + def _compute_relations(self): + schema_linking_results = self._linking_wrapper(linking_utils.compute_schema_linking) + cell_value_linking_results = self._linking_wrapper(linking_utils.compute_cell_value_linking) + link_info_dict = {"sc_link": schema_linking_results, "cv_link": cell_value_linking_results} + + q_len = self.column_indexes[0] - 2 + c_len = len(self.columns) + t_len = len(self.tables) + total_len = q_len + c_len + t_len + relation_matrix = linking_utils.build_relation_matrix( + link_info_dict, total_len, q_len, c_len, list(range(c_len + 1)), list(range(t_len + 1)), self.db + ) + return relation_matrix + + def _linking_wrapper(self, fn_linking): + """wrapper for linking function, do linking and id convert""" + link_result = fn_linking(self.question_tokens, self.db) + + # convert words id to BERT word pieces id + new_result = {} + for m_name, matches in link_result.items(): + new_match = {} + for pos_str, match_type in matches.items(): + qid_str, col_tab_id_str = pos_str.split(",") + qid, col_tab_id = int(qid_str), int(col_tab_id_str) + for real_qid in self.token_mapping[qid]: + new_match[f"{real_qid},{col_tab_id}"] = match_type + new_result[m_name] = new_match + return new_result + + def __repr__(self): + """format for reviewing""" + return str(self.__dict__) + + +class DuSQLDatasetV2(paddle.io.Dataset): + """implement of DuSQL dataset for training/evaluating""" + + def __init__( + self, name, db_file, data_file, input_encoder, label_encoder, is_cached=False, schema_file=None, has_label=True + ): + super(DuSQLDatasetV2, self).__init__() + + self.name = name + self.input_encoder = input_encoder + self.label_encoder = label_encoder + self.db_schema_file = schema_file + self.has_label = has_label + self._qid2index = {} + + if is_cached: + self.db_dict, self._examples = None, None + self.load(db_file, data_file) + else: + schema_file, content_file = db_file + self.db_dict, _ = load_tables(schema_file, content_file) + self._examples = [] + match_value_file = Path(os.path.dirname(data_file)) / ("match_values_" + os.path.basename(data_file)) + if not match_value_file.exists(): + raise FileNotFoundError("match value file not found: " + str(match_value_file)) + with open(data_file) as ifs_data, open(match_value_file) as ifs_mval: + self.collate_examples(json.load(ifs_data), json.load(ifs_mval)) + + def collate_examples(self, orig_examples, match_values): + """collate examples, and append to self._examples""" + for idx, (item, m_val) in tqdm.tqdm(enumerate(zip(orig_examples, match_values))): + if "question_id" in item: + assert ( + item["question_id"] == m_val["question_id"] + ), f'data no match: {item["question_id"]} != {m_val["question_id"]}' + item["match_values"] = m_val["match_values"] + db = self.db_dict[item["db_id"]] + if not self.input_encoder.check(item, db): + logging.warning(f'check failed: db_id={item["db_id"]}, question={item["question"]}') + continue + if "question_id" not in item: + item["question_id"] = f"qid{idx:06d}" + inputs = DuSQLExample(item, db, self.input_encoder) + if "sql" not in item or type(item["sql"]) is not dict or not self.has_label: + outputs = None + else: + outputs = self.label_encoder.add_item(self.name, item["sql"], inputs.values) + self._qid2index[item["question_id"]] = len(self._examples) + self._examples.append([inputs, outputs]) + + def save(self, save_dir, save_db=True): + """save data to disk + + Args: + save_dir (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + os.makedirs(save_dir, exist_ok=True) + if save_db: + with open(Path(save_dir) / "db.pkl", "wb") as ofs: + pickle.dump(self.db_dict, ofs) + with open(Path(save_dir) / f"{self.name}.pkl", "wb") as ofs: + pickle.dump([self._examples, self._qid2index], ofs) + + def load(self, db_file, data_file): + """load data from disk""" + with open(db_file, "rb") as ifs: + self.db_dict = pickle.load(ifs) + with open(data_file, "rb") as ifs: + self._examples, self._qid2index = pickle.load(ifs) + + def get_by_qid(self, qid): + """ """ + index = self._qid2index[qid] + return self._examples[index] + + def __getitem__(self, idx): + """get one example""" + return self._examples[idx] + + def __len__(self): + """size of data examples""" + return len(self._examples) + + +if __name__ == "__main__": + """run simple tests""" + if len(sys.argv) != 5: + print("usage: %s schema content data grammar_file" % (sys.argv[0])) + sys.exit(1) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/ernie_input_encoder_v2.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/ernie_input_encoder_v2.py new file mode 100644 index 0000000000000000000000000000000000000000..f68e3da9ff0fd2d5416479bfcc32d2808fa06248 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/ernie_input_encoder_v2.py @@ -0,0 +1,293 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import sys +from collections import namedtuple + +import numpy as np +from text2sql.dataproc import BaseInputEncoder +from text2sql.utils import text_utils + +from paddlenlp.transformers import BertTokenizer, ErnieTokenizer + +ErnieInput = namedtuple( + "ErnieInput", + "token_ids sent_ids table_indexes column_indexes value_indexes value_list token_mapping orig_question_tokens candi_nums", +) + + +class ErnieInputEncoderV2(BaseInputEncoder): + """use ernie field_reader to seg, it will automatically add padding,mask,position,task,sentence and return length""" + + padding_id = 0 + truncation_type = 0 + + def __init__(self, model_config): + super(ErnieInputEncoderV2, self).__init__() + + self.config = model_config + self.enc_value_with_col = model_config.enc_value_with_col + if model_config.pretrain_model_type == "BERT": + self.tokenizer = BertTokenizer.from_pretrained(model_config.pretrain_model) + self.special_token_dict = { + "table": "[unused1]", + "column": "[unused2]", + "value": "[unused3]", + "text": "[unused11]", + "real": "[unused12]", + "number": "[unused13]", + "time": "[unused14]", + "binary": "[unused15]", + "boolean": "[unused16]", + "bool": "[unused17]", + "others": "[unused18]", + } + else: + self.tokenizer = ErnieTokenizer.from_pretrained(model_config.pretrain_model) + # low frequency token will be used as special token + # Other candidate: overchicstoretvhome + self.special_token_dict = { + "table": "blogabstract", + "column": "wx17house", + "value": "fluke62max", + "text": "googlemsn", + "real": "sputniknews", + "number": "sputniknews", + "time": "pixstyleme3c", + "binary": "pixnetfacebookyahoo", + "boolean": "pixnetfacebookyahoo", + "bool": "pixnetfacebookyahoo", + "others": "ubuntuforumwikilinuxpastechat", + } + self._need_bool_value = True if self.config.grammar_type != "nl2sql" else False + + def check(self, data, db): + if len(db.columns) > self.config.max_column_num or len(db.tables) > self.config.max_table_num: + return False + return True + + def encode(self, question, db, column_match_cells=None, candi_nums=None, col_orders=None, debug=False): + question = question.strip() + if self.config.num_value_col_type != "q_num": + orig_question_tokens = text_utils.wordseg(self.question) + candi_nums = list(set(["0", "1"] + text_utils.CandidateValueExtractor.extract_num_from_text(question))) + candi_nums_index = [-1] * len(candi_nums) + else: + orig_question_tokens, candi_nums, candi_nums_index = text_utils.wordseg_and_extract_num(question) + if "0" not in candi_nums: + candi_nums.append("0") + candi_nums_index.append(-1) + if "1" not in candi_nums: + candi_nums.append("1") + candi_nums_index.append(-1) + tokens, value_list, schema_indexes, token_mapping = self.tokenize( + orig_question_tokens, db, column_match_cells, candi_nums, candi_nums_index, col_orders + ) + if debug: + sys.stderr.write(json.dumps(tokens, ensure_ascii=False) + "\n") + token_ids = self.tokenizer.convert_tokens_to_ids(tokens) + + table_indexes, column_indexes, value_indexes, num_value_indexes = schema_indexes + q_len = column_indexes[0] + sent_ids = [0] * q_len + [1] * (len(token_ids) - q_len) + + value_indexes += num_value_indexes + return ErnieInput( + token_ids, + sent_ids, + table_indexes, + column_indexes, + value_indexes, + value_list, + token_mapping, + orig_question_tokens, + candi_nums, + ) + + def tokenize(self, question, db, column_match_cells=None, candi_nums=None, candi_nums_index=None, col_orders=None): + """ + Tokenize question and columns and concatenate. + final_tokens will include:Question、Schema(include non digital value)、digital value + [CLS] Q tokens [SEP] + [T] table1 [C] col1 [V] value [C] col2 ... [SEP] + [V] number [V] ... [SEP] + """ + if col_orders is None: + col_orders = np.arange(len(db.columns)) + + if type(question) is str: + q_tokens_tmp = self.tokenizer.tokenize(question) + token_idx_mapping = [[i] for i in range(len(q_tokens_tmp))] + else: + # question is tokens list + q_tokens_tmp, token_idx_mapping = self._resplit_words(question) + + final_candi_num_index = [] + if candi_nums_index is not None: + for idx in candi_nums_index: + if idx < 0: + final_candi_num_index.append(0) + else: + final_candi_num_index.append(token_idx_mapping[idx][0] + 1) + + # handle question tokens + question_tokens = ["[CLS]"] + q_tokens_tmp + final_tokens = question_tokens[: self.config.max_question_len] + ["[SEP]"] + + columns = [db.columns[i] for i in col_orders] + if column_match_cells is not None: + column_match_cells = [column_match_cells[i] for i in col_orders] + else: + column_match_cells = [None] * len(columns) + + # handle schema tokens + table_indexes = [] + column_indexes = [] + value_indexes = [] + value_list = [] + universe_value_set = set(["是", "否"]) if self._need_bool_value else set() + for idx, (column, match_cells) in enumerate(zip(columns, column_match_cells)): + if idx == 1 or idx > 1 and column.table.id != columns[idx - 1].table.id: + table_indexes.append(len(final_tokens)) + final_tokens.append(self.special_token_dict["table"]) + final_tokens += self.tokenizer.tokenize(column.table.orig_name) + + if idx == 0: + col_name = "任意列" + col_type = self.special_token_dict["text"] + else: + col_name = column.orig_name + # col_name = remove_brackets(col_name) + col_type = self.special_token_dict[column.dtype] + + column_indexes.append(len(final_tokens)) + final_tokens += [col_type] + self.tokenizer.tokenize(col_name) + + if match_cells is not None and len(match_cells) > 0: + if column.dtype in ("text", "time"): + if not self.config.predict_value: + match_cells = match_cells[:1] # the first cell used to complement semantics + for mcell in match_cells: + value_list.append(mcell) + toks = [self.special_token_dict["value"]] + self.tokenizer.tokenize(mcell) + if self.enc_value_with_col: + value_indexes.extend([column_indexes[-1], len(final_tokens)]) + else: + value_indexes.append(len(final_tokens)) + final_tokens += toks + elif self.config.predict_value: + for mcell in match_cells: + universe_value_set.add(mcell) + final_tokens.append("[SEP]") + + if self.config.predict_value: + for value in universe_value_set: + value_list.append(value) + toks = [self.special_token_dict["value"]] + self.tokenizer.tokenize(value) + if self.enc_value_with_col: + value_indexes.extend([0, len(final_tokens)]) + else: + value_indexes.append(len(final_tokens)) + final_tokens += toks + final_tokens.append("[SEP]") + + # handle number value tokens: condition and limit number values + num_value_indexes = [] + if candi_nums is not None and len(candi_nums) > 0: + value_list += candi_nums + for num, index in zip(candi_nums, final_candi_num_index): + if self.enc_value_with_col: + # index is the index of current number in question + num_value_indexes.extend([index, len(final_tokens)]) + elif self.config.num_value_col_type == "q_num": + num_value_indexes.append(index) + else: + num_value_indexes.append(len(final_tokens)) + final_tokens += [self.special_token_dict["value"]] + self.tokenizer.tokenize(num) + else: + # use fixed special token value/empty + if self.enc_value_with_col: + value_indexes = [0, len(final_tokens), 0, len(final_tokens) + 1] + else: + value_indexes = [len(final_tokens), len(final_tokens) + 1] + num_value_indexes = [] + value_list = ["value", "empty"] + final_tokens.extend(value_list) + final_tokens.append("[SEP]") + + # packed_sents_lens = [q_lens, column_tokens_lens, table_tokens_lens, limit_tokens_lens] + # packed_sents, packed_sents_lens = self._pack([question_tokens], + # column_tokens, + # table_tokens, + # limit_tokens, + # value_indexes=column_values_index) + + return ( + final_tokens, + value_list, + [table_indexes, column_indexes, value_indexes, num_value_indexes], + token_idx_mapping, + ) + + def _resplit_words(self, words): + """resplit words by bert_tokenizer""" + lst_new_result = [] + token_idx_mapping = [] + for idx, word in enumerate(words): + tokens = self.tokenizer.tokenize(word) + new_id_start = len(lst_new_result) + new_id_end = new_id_start + len(tokens) + lst_new_result.extend(tokens) + token_idx_mapping.append(list(range(new_id_start, new_id_end))) + return lst_new_result, token_idx_mapping + + def _pack(self, *sents_of_tokens_list, value_indexes=None): + packed_sents = [] + packed_sents_lens_all = [] + for sents_of_tokens in sents_of_tokens_list: + packed_sents_lens = [] + for tokens in sents_of_tokens: + packed_tokens = tokens + ["[SEP]"] + packed_sents += packed_tokens + packed_sents_lens.append(len(packed_tokens)) + packed_sents_lens_all.append(packed_sents_lens) + return packed_sents, packed_sents_lens_all + + +if __name__ == "__main__": + """run some simple test cases""" + if len(sys.argv) != 3: + print("usage: %s --db db_path") + sys.exit(1) + + from pathlib import Path + + from text2sql import global_config + from text2sql.dataproc.dusql_dataset import load_tables + + config = global_config.gen_config() + parser = ErnieInputEncoderV2(config) + q = "这 是 一项 测试 。 hello world !" + db_path = Path(config.data.db) + db_dict, _ = load_tables(db_path / "db_schema.json", db_path / "db_content.json") + db = db_dict[list(db_dict.keys())[0]] + column_match_cells = [None] * len(db.columns) + column_match_cells[1] = ["你好", "[CLS]"] + print(q) + print([x.orig_name for x in db.columns]) + print([x.orig_name for x in db.tables]) + print(parser.encode(q, db, column_match_cells=column_match_cells, candi_nums=["1", "0", "10000000"], debug=True)) + print("*" * 100) + print(parser.encode(q.split(" "), db, candi_nums=["1", "0", "10000000"], debug=True)) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/sql_label.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/sql_label.py new file mode 100644 index 0000000000000000000000000000000000000000..f0805650db93ef210bebcf366c7e922f5124c1b8 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/sql_label.py @@ -0,0 +1,440 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import numpy as np + +g_open_value_predict = False +g_having_agg_threshold = 0.9 + + +class SQL(object): + """SQL define""" + + op_sql_dict = {0: ">", 1: "<", 2: "==", 3: "!=", 4: ">=", 5: "<="} + agg_sql_dict = {0: "", 1: "AVG", 2: "MAX", 3: "MIN", 4: "COUNT", 5: "SUM"} + conn_sql_dict = {0: "", 1: "and", 2: "or"} + order_dict = {0: "", 1: "asc", 2: "desc"} + sel_num_dict = {0: 1, 1: 2, 2: 3, 3: 4} + # cond_num_dict = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5} + cond_num_dict = {0: 0, 1: 1, 2: 2, 3: 3, 4: 4} + group_num_dict = {0: 0, 1: 1} + group_type_dict = {0: "none", 1: "group", 2: "group_having", 3: "group_order"} + + order2id = {"": 0, "asc": 1, "desc": 2} + + num_where_ops = len(op_sql_dict) + 1 + num_agg_ops = len(agg_sql_dict) + num_cond_ops = len(conn_sql_dict) + num_order_directions = len(order_dict) + num_sel_num = len(sel_num_dict) + num_where_num = len(cond_num_dict) + num_group_num = len(group_num_dict) + num_group_type = len(group_type_dict) + + dtype_str = "text" + dtype_num = "real" + + def __init__(self, cond_conn_op: int, agg: list, sel: list, conds: list, **kwargs): + """doc""" + self.cond_conn_op = cond_conn_op + self.sel = [] + self.agg = [] + sel_agg_pairs = sorted(zip(sel, agg), key=lambda x: x[0]) + for col_id, agg_op in sel_agg_pairs: + self.sel.append(col_id) + self.agg.append(agg_op) + self.conds = list(sorted(conds, key=lambda x: x[0])) + self.order_by = list(sorted(kwargs.get("order_by", []))) + self.group_by = list(sorted(kwargs.get("group_by", []))) + self.having = list(sorted(kwargs.get("having", []))) + order_str = kwargs.get("order_direction", "").lower() + self.order_direction = self.order2id.get(order_str, 0) + limit = kwargs.get("limit", None) + self.limit = "0" if limit is None else str(limit) + self.sel_num = len(self.sel) + self.cond_num = len(self.conds) + self.group_num = len(self.group_by) + self.group_type = 0 + if len(self.group_by) > 0: + self.group_type = 1 + if len(self.having) > 0: + self.group_type = 2 + elif len(self.order_by) > 0 and self.order_by[0][0] > 0: + self.group_type = 3 + + @classmethod + def from_dict(cls, data: dict): + """doc""" + return cls(**data) + + def keys(self): + """doc""" + return ["cond_conn_op", "sel", "agg", "conds", "order_by", "order_direction", "limit", "group_by", "having"] + + def __getitem__(self, key): + """doc""" + return getattr(self, key) + + def to_json(self): + """doc""" + return json.dumps(dict(self), ensure_ascii=False, sort_keys=True) + + def equal_all_mode(self, other): + """doc""" + return self.to_json() == other.to_json() + + def __eq__(self, other): + """doc""" + raise NotImplementedError("compare mode not set") + + def __repr__(self): + """doc""" + repr_str = "" + repr_str += "sel: {}\n".format(self.sel) + repr_str += "agg: {}\n".format([self.agg_sql_dict[a] for a in self.agg]) + repr_str += "cond_conn_op: '{}'\n".format(self.conn_sql_dict[self.cond_conn_op]) + repr_str += "conds: {}".format([[cond[0], self.op_sql_dict[cond[1]], cond[2]] for cond in self.conds]) + + # TODO: support order/group/... + + return repr_str + + def __str__(self): + """doc""" + return self.to_json() + + def _repr_html_(self): + """doc""" + return self.__repr__().replace("\n", "
") + + +def sql2label(sql, num_cols): + """encode sql""" + # because of classification task, label is from 0 + # so sel_num and cond_num should -1,and label should +1 in prediction phrase + cond_conn_op_label = sql.cond_conn_op + + sel_num_label = sql.sel_num - 1 + # the new dataset has cond_num = 0, do not -1 + cond_num_label = len(sql.conds) + len(sql.having) + sel_label = np.zeros(num_cols, dtype="int32") + sel_agg_label = np.zeros((num_cols, SQL.num_agg_ops), dtype="int32") + for col_id, agg_op in zip(sql.sel, sql.agg): + assert col_id < num_cols, f"select col_id({col_id}) >= num_cols({num_cols}): {sql}" + sel_agg_label[col_id][agg_op] = 1 + sel_label[col_id] = 1 + # len(SQL.op_sql_dict) over all op ID range,which means defaults to no OP + cond_op_label = np.ones(num_cols, dtype="int32") * len(SQL.op_sql_dict) + having_agg_label = np.zeros((num_cols, SQL.num_agg_ops), dtype="int32") + + for col_id, cond_op, _ in sql.conds: + assert col_id < num_cols, f"where col_id({col_id}) >= num_cols({num_cols}): {sql}" + cond_op_label[col_id] = cond_op + + for agg, col_id, cond_op, _ in sql.having: + assert col_id < num_cols, f"having col_id({col_id}) >= num_cols({num_cols}): {sql}" + cond_op_label[col_id] = cond_op + having_agg_label[col_id][agg] = 1 + + order_col_label = np.zeros(num_cols, dtype="int32") + order_agg_label = np.zeros((num_cols, SQL.num_agg_ops), dtype="int32") + + order_direction_label = sql.order_direction + for agg, order_col in sql.order_by: + order_col_label[order_col] = 1 + order_agg_label[order_col][agg] = 1 + + group_num_label = sql.group_num + having_num_label = len(sql.having) + group_col_label = np.zeros(num_cols, dtype="int32") + for col_id in sql.group_by: + assert col_id < num_cols, f"group_by col_id({col_id}) >= num_cols({num_cols}): {sql}" + group_col_label[col_id] = 1 + + return ( + sel_num_label, + cond_num_label, + cond_conn_op_label, + sel_agg_label, + sel_label, + cond_op_label, + order_col_label, + order_agg_label, + order_direction_label, + group_num_label, + having_num_label, + group_col_label, + having_agg_label, + ) + + +def decode( + sel_num, + sel_col, + sel_agg, + where_num, + where_conn, + where_op, + where_op_prob, + col_value, + order_direction, + order_col, + order_agg, + limit_label, + group_num, + having_num, + group_col, + having_agg, + having_agg_prob, + header_match_cells, + candi_limit_nums, +): + """decode one instance predicts to sql""" + if col_value is None: + col_value = [None] * len(where_op) + # use dict to find label number, equals to label+1 + sel_num = SQL.sel_num_dict[int(sel_num)] + sorted_sel_index = sorted(range(len(sel_col)), key=lambda i: sel_col[i], reverse=True) + sel_col = [int(col_id) for col_id in sorted_sel_index][:sel_num] + sel_agg = [int(sel_agg[col_id]) for col_id in sorted_sel_index][:sel_num] + + cond_num = SQL.cond_num_dict[int(where_num)] + where_conn = int(where_conn) + cond_probs = [] + conds = [] + for col_id, (cond_op, cond_prob, value_id) in enumerate(zip(where_op, where_op_prob, col_value)): + if cond_op < len(SQL.op_sql_dict): + cond_probs.append(cond_prob) + value = get_value_by_id(col_id, value_id, header_match_cells) + conds.append([col_id, int(cond_op), value]) + if cond_num < len(conds): + sorted_cond_index = sorted(range(len(cond_probs)), key=lambda i: cond_probs[i], reverse=True) + conds = [conds[i] for i in sorted_cond_index[:cond_num]] + + if group_num is None: + group_num = 0 + if group_num > 0: + sorted_group_index = sorted(range(len(group_col)), key=lambda i: group_col[i], reverse=True) + group_col = [int(col_id) for col_id in sorted_group_index[:group_num]] + else: + group_col = [] + + having = [] + if having_num is None: + having_num = 0 + if having_agg is not None and group_num > 0 and having_num > 0: + having_agg_info = [] + for idx, (col_id, _, _) in enumerate(conds): + if having_agg[col_id] > 0: + having_agg_info.append([idx, int(having_agg[col_id]), float(having_agg_prob[col_id])]) + # cond_num -= 1 + if len(having_agg_info) > 0 and having_agg_info[0][2] >= g_having_agg_threshold: + # 按 agg 概率最大排序 + having_agg_info.sort(key=lambda x: x[2], reverse=True) + idx, agg, _ = having_agg_info[0] + having = [[agg] + list(conds[idx])] + conds.pop(idx) + + order_direction = int(order_direction) if order_direction is not None else 0 + if order_direction == 0 or order_col is None or order_agg is None: + order_by = [] + limit = "0" + else: + sorted_order_index = sorted(range(len(order_col)), key=lambda i: order_col[i], reverse=True) + order_col = [int(col_id) for col_id in sorted_order_index[:1]] + order_agg = [int(order_agg[col_id]) for col_id in sorted_order_index[:1]] + order_by = [[order_agg[0], order_col[0]]] + if limit_label < len(candi_limit_nums): + limit = candi_limit_nums[limit_label] + if limit == "0": + limit = "1" + else: + limit = "1" + + return { + "sel": list(sel_col), + "sel_num": int(sel_num), + "cond_num": int(cond_num), + "agg": list(sel_agg), + "cond_conn_op": int(where_conn), + "conds": [list(cond) for cond in conds], + "order_direction": order_direction, + "order_by": list(order_by), + "limit": limit, + "having_num": int(having_num), + "group_num": int(group_num), + "group_by": list(group_col), + "having": list(having), + } + + +def get_value_by_id(col_id, value_id, header_match_cells): + """ + + Args: + col_id (TYPE): NULL + value_id (TYPE): NULL + header_match_cells (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + if value_id is None or value_id < 0: + return None + + assert col_id < len(header_match_cells) + + curr_cells = header_match_cells[col_id] + if len(curr_cells) == 0: + return "0" + if value_id >= len(curr_cells): + return curr_cells[0] + else: + return curr_cells[value_id] + + +def decode_sqls(preds, header_lens, header_values_list, limit_nums_list): + """Generate sqls from model outputs""" + fn_empty_preds = lambda: [None] * len(preds["sel_num"]) + + preds_sel_num = np.argmax(preds["sel_num"], axis=-1) + preds_sel_col = preds["sel_col"] + preds_sel_agg = np.argmax(preds["sel_agg"], axis=-1) + preds_cond_num = np.argmax(preds["cond_num"], axis=-1) + preds_where_conn = np.argmax(preds["where_conn"], axis=-1) + preds_where_op = np.argmax(preds["where_op"], axis=-1) + preds_where_op_prob = np.max(preds["where_op"], axis=-1) + + preds_order_direction = np.argmax(preds["order_direction"], axis=-1) + preds_order_col = preds["order_col"] + preds_order_agg = np.argmax(preds["order_agg"], axis=-1) + preds_limit_label = np.argmax(preds["limit_label"], axis=-1) + + preds_group_num = np.argmax(preds["group_num"], axis=-1) + preds_having_num = np.argmax(preds["having_num"], axis=-1) + preds_group_col = preds["group_col"] + preds_having_agg = np.argmax(preds["having_agg"], axis=-1) + preds_having_agg_prob = np.max(preds["having_agg"], axis=-1) + + if g_open_value_predict: + preds_col_value = np.argmax(preds["col_value"], axis=-1) + else: + preds_col_value = fn_empty_preds() + + sqls = [] + for ( + sel_num, + sel_col, + sel_agg, + where_num, + where_conn, + where_op, + where_op_prob, + col_value, + order_direction, + order_col, + order_agg, + limit_label, + group_num, + having_num, + group_col, + having_agg, + having_agg_prob, + header_len, + limit_nums, + ) in zip( + preds_sel_num, + preds_sel_col, + preds_sel_agg, + preds_cond_num, + preds_where_conn, + preds_where_op, + preds_where_op_prob, + preds_col_value, + preds_order_direction, + preds_order_col, + preds_order_agg, + preds_limit_label, + preds_group_num, + preds_having_num, + preds_group_col, + preds_having_agg, + preds_having_agg_prob, + header_lens, + limit_nums_list, + ): + + sel_col = sel_col[:header_len] + sel_agg = sel_agg[:header_len] + where_op = where_op[:header_len] + where_op_prob = where_op_prob[:header_len] + if g_open_value_predict: + col_value = col_value[:header_len] + order_col = order_col[:header_len] + order_agg = order_agg[:header_len] + group_col = group_col[:header_len] + having_agg = having_agg[:header_len] + + sql = decode( + sel_num, + sel_col, + sel_agg, + where_num, + where_conn, + where_op, + where_op_prob, + col_value, + order_direction, + order_col, + order_agg, + limit_label, + group_num, + having_num, + group_col, + having_agg, + having_agg_prob, + None, + limit_nums, + ) + sqls.append(sql) + + return sqls + + +if __name__ == "__main__": + """run some simple test""" + # if len(sys.argv) > 2: + # with open(sys.argv[1]) as ifs: + # gold_sqls = [SQL.from_dict(json.loads(x)["sql"]) for x in ifs] + # with open(sys.argv[2]) as ifs: + # pred_sqls = [json.loads(x) for x in ifs] + + # print(f"acc of {sys.argv[1]} vs {sys.argv[2]}: ", get_acc(gold_sqls, pred_sqls)) + # else: + # gold_sqls = [{"sel": [5], "sel_num": 1, "cond_num": 2, "agg": [0], + # "cond_conn_op": 1, "conds": [[0, 2, '123'], [1, 2, '444']], + # "order_direction": "asc", "order_by": [[0, 1]]}] + # pred_sqls = [{"sel": [1, 0], "agg": [0, 4], "cond_conn_op": 0, + # "conds": [], "having_conn_op": 0, "having": [], "order_by": [], + # "order_direction": "", "limit": None, "group_by": [20]}] + # enc_out_names = ["sel_num_label", "cond_num_label", "cond_conn_op_label", "sel_agg_label", + # "sel_label", "cond_op_label", "order_col_label", "order_agg_label", + # "order_direction_label", "group_num_label", "group_col_label", + # "having_agg_label"] + # enc_out = sql2label(SQL.from_dict(gold_sqls[0]), 8) + # for name, array in zip(enc_out_names, enc_out): + # print(name, array) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/sql_preproc_v2.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/sql_preproc_v2.py new file mode 100644 index 0000000000000000000000000000000000000000..18b9f8c972fca0064080d4f89818fc5141ea6b29 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/sql_preproc_v2.py @@ -0,0 +1,457 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import collections.abc +import json +import logging +import os +import shutil +from pathlib import Path + +import attr +from text2sql.dataproc import vocab +from text2sql.utils import serialization + + +def get_field_presence_info(ast_wrapper, node, field_infos): + """get_field_presence_info""" + present = [] + for field_info in field_infos: + field_value = node.get(field_info.name) + is_present = field_value is not None and field_value != [] + + maybe_missing = field_info.opt or field_info.seq + is_builtin_type = field_info.type in ast_wrapper.primitive_types + + if maybe_missing and is_builtin_type: + # TODO: make it possible to deal with "singleton?" + present.append(is_present and type(field_value).__name__) + elif maybe_missing and not is_builtin_type: + present.append(is_present) + elif not maybe_missing and is_builtin_type: + present.append(type(field_value).__name__) + elif not maybe_missing and not is_builtin_type: + assert is_present + present.append(True) + return tuple(present) + + +@attr.s +class DecoderSQLItem: + """DecoderSQLItem""" + + tree = attr.ib() + orig_code = attr.ib() + sql_query = attr.ib(default="") + + +class SQLPreproc(object): + """SQLPreproc""" + + def __init__( + self, + base_path, + grammar_class, + predict_value=True, + min_freq=3, + max_count=5000, + use_seq_elem_rules=False, + is_cached=False, + ): + """ + Args: + base_path (TYPE): if is_cached is False, base_path is the asdl grammar file. + if is_cached is True, base_path is path to cached directory. + grammar_class (TYPE): grammar class, like grammars.dusql.DuSQLLanguage + predict_value (TYPE): Default is True + min_freq (TYPE): Default is 3 + max_count (TYPE): Default is 5000 + use_seq_elem_rules (TYPE): Default is False + is_cached (TYPE): Default is False + + Raises: NULL + """ + self.base_path = base_path + self.predict_value = predict_value + self.vocab = None + self.all_rules = None + self.rules_mask = None + + # key: train/dev/val/test/... + # value: examples + self.items = collections.defaultdict(list) + self.sum_type_constructors = collections.defaultdict(set) + self.field_presence_infos = collections.defaultdict(set) + self.seq_lengths = collections.defaultdict(set) + self.primitive_types = set() + + if not is_cached: + self.grammar = grammar_class(self.base_path) + self.ast_wrapper = self.grammar.ast_wrapper + self.vocab_builder = vocab.VocabBuilder(min_freq, max_count) + else: + self.grammar = None + self.ast_wrapper = None + self.load(grammar_class) + + self.use_seq_elem_rules = use_seq_elem_rules + if self.predict_value: + self.format_sql_value = self.transfer_sql_value + else: + self.format_sql_value = self.fix_sql_value + + def _get_val_index(self, val, value_dict): + def _float(val): + try: + return True, str(int(float(val))) + except Exception: + return False, "" + + val = str(val) + if val in value_dict: + return value_dict[val] + is_float, new_val = _float(val) + if is_float and new_val in value_dict: + return value_dict[new_val] + + new_val = val.replace(".", "") + candi = [] + for v, idx in value_dict.items(): + v = v.replace(".", "") + if v.startswith(new_val) or new_val.startswith(v): + candi.append((v, idx)) + + if len(candi) == 1: + return candi[0][1] + elif len(candi) > 1: + candi.sort(key=lambda x: len(x[0]), reverse=True) + return candi[0][1] + + return -1 + + def transfer_sql_value(self, sql_json, value_dict): + """transfer value str to int index""" + if "cond_conn_op" in sql_json: + self.transfer_simple_sql_value(sql_json, value_dict) + return + + def _trans_cond(cond): + """transfer condition value""" + val1 = cond[3] + val2 = cond[4] + if type(val1) is dict: + self.transfer_sql_value(val1, value_dict) + if val2 is not None: + val2 = self._get_val_index(val2, value_dict) + cond[4] = val2 if val2 >= 0 else 0 + return + + val1 = self._get_val_index(val1, value_dict) + if val2 is not None: + val2 = self._get_val_index(val2, value_dict) + if val1 == -1: + val1 = 0 + logging.debug("lost value: %s. candidates: %s", cond[3], ", ".join(value_dict.keys())) + logging.debug("sql is: %s", json.dumps(sql_json, ensure_ascii=False)) + if val2 == -1: + val2 = 0 + cond[3] = val1 + cond[4] = val2 + + for table_unit in sql_json["from"]["table_units"]: + if type(table_unit[1]) is dict: + self.transfer_sql_value(table_unit[1], value_dict) + + for cond in sql_json["where"][::2]: + _trans_cond(cond) + for cond in sql_json["having"][::2]: + _trans_cond(cond) + + if sql_json["limit"] is not None: + limit = str(sql_json["limit"]) + else: + limit = "0" + if limit in value_dict: + sql_json["limit"] = value_dict[limit] + else: + logging.debug("value of limit is lost: %s. candidates: %s", limit, ", ".join(value_dict.keys())) + sql_json["limit"] = value_dict["0"] + + if sql_json["intersect"] is not None: + self.transfer_sql_value(sql_json["intersect"], value_dict) + if sql_json["union"] is not None: + self.transfer_sql_value(sql_json["union"], value_dict) + if sql_json["except"] is not None: + self.transfer_sql_value(sql_json["except"], value_dict) + + def transfer_simple_sql_value(self, sql_json, value_dict): + for cond in sql_json["conds"]: + value = cond[2] + new_val = self._get_val_index(value, value_dict) + if new_val == -1: + new_val = 0 + cond[2] = new_val + + def fix_sql_value(self, sql_json, value_dict): + """fix sql value to 'value' token""" + + def _fix_cond_value(cond): + """transfer condition value""" + val1 = cond[3] + val2 = cond[4] + if type(val1) is dict: + self.fix_sql_value(val1, value_dict) + if val2 is not None: + val2 = self._get_val_index("value", value_dict) + cond[4] = val2 if val2 >= 0 else 0 + return + + val1 = self._get_val_index("value", value_dict) + if val2 is not None: + val2 = self._get_val_index("value", value_dict) + if val1 == -1: + val1 = 0 + logging.info("lost value: %s. candidates: %s", cond[3], ", ".join(value_dict.keys())) + logging.debug("sql is: %s", json.dumps(sql_json, ensure_ascii=False)) + if val2 == -1: + val2 = 0 + cond[3] = val1 + cond[4] = val2 + + for table_unit in sql_json["from"]["table_units"]: + if type(table_unit[1]) is dict: + self.fix_sql_value(table_unit[1], value_dict) + + for cond in sql_json["where"][::2]: + _fix_cond_value(cond) + for cond in sql_json["having"][::2]: + _fix_cond_value(cond) + + if sql_json["limit"] is not None: + limit = "value" + else: + limit = "empty" + assert limit in value_dict + sql_json["limit"] = value_dict[limit] + + if sql_json["intersect"] is not None: + self.fix_sql_value(sql_json["intersect"], value_dict) + if sql_json["union"] is not None: + self.fix_sql_value(sql_json["union"], value_dict) + if sql_json["except"] is not None: + self.fix_sql_value(sql_json["except"], value_dict) + + def add_item(self, section, sql_json, value_list): + """add an item""" + value_dict = {val: idx for idx, val in enumerate(value_list)} + self.format_sql_value(sql_json, value_dict) + + parsed = self.grammar.parse(sql_json, section) + self.ast_wrapper.verify_ast(parsed) # will raise AssertionError, if verify failed + + root = parsed + if section == "train": + for token in self._all_tokens(root): + self.vocab_builder.add_word(token) + self._record_productions(root) + + item = DecoderSQLItem(tree=root, orig_code=sql_json) + self.items[section].append(item) + return item + + def clear_items(self): + """clear items""" + self.items = collections.defaultdict(list) + + def _construct_cache_path(self, root_path): + root_path = Path(root_path) + self.vocab_path = root_path / "dec_vocab.json" + self.observed_productions_path = root_path / "observed_productions.json" + self.grammar_rules_path = root_path / "grammar_rules.json" + self.grammar_file = root_path / "grammar.asdl" + + def save(self, save_path): + """save parsed items to disk""" + os.makedirs(save_path, exist_ok=True) + self._construct_cache_path(save_path) + + self.vocab = self.vocab_builder.finish() + self.vocab.save(self.vocab_path) + # observed_productions + self.sum_type_constructors = serialization.to_dict_with_sorted_values(self.sum_type_constructors) + self.field_presence_infos = serialization.to_dict_with_sorted_values(self.field_presence_infos, key=str) + self.seq_lengths = serialization.to_dict_with_sorted_values(self.seq_lengths) + self.primitive_types = sorted(self.primitive_types) + with open(self.observed_productions_path, "w") as f: + json.dump( + { + "sum_type_constructors": self.sum_type_constructors, + "field_presence_infos": self.field_presence_infos, + "seq_lengths": self.seq_lengths, + "primitive_types": self.primitive_types, + }, + f, + indent=2, + sort_keys=True, + ) + + # grammar + self.all_rules, self.rules_mask = self._calculate_rules() + with open(self.grammar_rules_path, "w") as f: + json.dump( + { + "all_rules": self.all_rules, + "rules_mask": self.rules_mask, + }, + f, + indent=2, + sort_keys=True, + ) + + shutil.copy2(self.base_path, self.grammar_file) + + def load(self, grammar_class): + """load parsed items from disk""" + self._construct_cache_path(self.base_path) + + self.grammar = grammar_class(self.grammar_file) + self.ast_wrapper = self.grammar.ast_wrapper + self.vocab = vocab.Vocab.load(self.vocab_path) + + observed_productions = json.load(open(self.observed_productions_path)) + self.sum_type_constructors = observed_productions["sum_type_constructors"] + self.field_presence_infos = observed_productions["field_presence_infos"] + self.seq_lengths = observed_productions["seq_lengths"] + self.primitive_types = observed_productions["primitive_types"] + + grammar = json.load(open(self.grammar_rules_path)) + self.all_rules = serialization.tuplify(grammar["all_rules"]) + self.rules_mask = grammar["rules_mask"] + + def _record_productions(self, tree): + """_record_productions""" + queue = [(tree, False)] + while queue: + node, is_seq_elem = queue.pop() + node_type = node["_type"] + + # Rules of the form: + # expr -> Attribute | Await | BinOp | BoolOp | ... + # expr_seq_elem -> Attribute | Await | ... | Template1 | Template2 | ... + for type_name in [node_type] + node.get("_extra_types", []): + if type_name in self.ast_wrapper.constructors: + sum_type_name = self.ast_wrapper.constructor_to_sum_type[type_name] + if is_seq_elem and self.use_seq_elem_rules: + self.sum_type_constructors[sum_type_name + "_seq_elem"].add(type_name) + else: + self.sum_type_constructors[sum_type_name].add(type_name) + + # Rules of the form: + # FunctionDef + # -> identifier name, arguments args + # | identifier name, arguments args, stmt* body + # | identifier name, arguments args, expr* decorator_list + # | identifier name, arguments args, expr? returns + # ... + # | identifier name, arguments args, stmt* body, expr* decorator_list, expr returns + assert node_type in self.ast_wrapper.singular_types + field_presence_info = get_field_presence_info( + self.ast_wrapper, node, self.ast_wrapper.singular_types[node_type].fields + ) + self.field_presence_infos[node_type].add(field_presence_info) + + for field_info in self.ast_wrapper.singular_types[node_type].fields: + field_value = node.get(field_info.name, [] if field_info.seq else None) + to_enqueue = [] + if field_info.seq: + # Rules of the form: + # stmt* -> stmt + # | stmt stmt + # | stmt stmt stmt + self.seq_lengths[field_info.type + "*"].add(len(field_value)) + to_enqueue = field_value + else: + to_enqueue = [field_value] + for child in to_enqueue: + if isinstance(child, collections.abc.Mapping) and "_type" in child: + queue.append((child, field_info.seq)) + else: + self.primitive_types.add(type(child).__name__) + + def _calculate_rules(self): + """_calculate_rules""" + offset = 0 + + all_rules = [] + rules_mask = {} + + # Rules of the form: + # expr -> Attribute | Await | BinOp | BoolOp | ... + # expr_seq_elem -> Attribute | Await | ... | Template1 | Template2 | ... + for parent, children in sorted(self.sum_type_constructors.items()): + assert not isinstance(children, set) + rules_mask[parent] = (offset, offset + len(children)) + offset += len(children) + all_rules += [(parent, child) for child in children] + + # Rules of the form: + # FunctionDef + # -> identifier name, arguments args + # | identifier name, arguments args, stmt* body + # | identifier name, arguments args, expr* decorator_list + # | identifier name, arguments args, expr? returns + # ... + # | identifier name, arguments args, stmt* body, expr* decorator_list, expr returns + for name, field_presence_infos in sorted(self.field_presence_infos.items()): + assert not isinstance(field_presence_infos, set) + rules_mask[name] = (offset, offset + len(field_presence_infos)) + offset += len(field_presence_infos) + all_rules += [(name, presence) for presence in field_presence_infos] + + # Rules of the form: + # stmt* -> stmt + # | stmt stmt + # | stmt stmt stmt + for seq_type_name, lengths in sorted(self.seq_lengths.items()): + assert not isinstance(lengths, set) + rules_mask[seq_type_name] = (offset, offset + len(lengths)) + offset += len(lengths) + all_rules += [(seq_type_name, i) for i in lengths] + + return tuple(all_rules), rules_mask + + def _all_tokens(self, root): + """_all_tokens""" + queue = [root] + while queue: + node = queue.pop() + type_info = self.ast_wrapper.singular_types[node["_type"]] + + for field_info in reversed(type_info.fields): + field_value = node.get(field_info.name) + if field_info.type in self.grammar.pointers: + pass + elif field_info.type in self.ast_wrapper.primitive_types: + for token in self.grammar.tokenize_field_value(field_value): + yield token + elif isinstance(field_value, (list, tuple)): + queue.extend(field_value) + elif field_value is not None: + queue.append(field_value) + + +if __name__ == "__main__": + """run some simple test cases""" + pass diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/vocab.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/vocab.py new file mode 100644 index 0000000000000000000000000000000000000000..037399586ecd991434f542ac93952109a1c09801 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/vocab.py @@ -0,0 +1,96 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""vocabulary utils from rat-sql: https://github.com/Microsoft/rat-sql""" + +import collections +import collections.abc +import json + +UNK = "" +BOS = "" +EOS = "" + + +class Vocab(collections.abc.Set): + def __init__(self, iterable, special_elems=(UNK, BOS, EOS)): + elements = list(special_elems) + elements.extend(iterable) + assert len(elements) == len(set(elements)) + + self.id_to_elem = {i: elem for i, elem in enumerate(elements)} + self.elem_to_id = {elem: i for i, elem in enumerate(elements)} + + def __iter__(self): + for i in range(len(self)): + yield self.id_to_elem[i] + + def __contains__(self, value): + return value in self.elem_to_id + + def __len__(self): + return len(self.elem_to_id) + + def __getitem__(self, key): + if isinstance(key, slice): + raise TypeError("Slices not supported.") + return self.id_to_elem[key] + + def index(self, value): + try: + return self.elem_to_id[value] + except KeyError: + return self.elem_to_id[UNK] + + def indices(self, values): + return [self.index(value) for value in values] + + def __hash__(self): + return id(self) + + @classmethod + def load(self, in_path): + return Vocab(json.load(open(in_path)), special_elems=()) + + def save(self, out_path): + with open(out_path, "w") as ofs: + json.dump([self.id_to_elem[i] for i in range(len(self.id_to_elem))], ofs) + + +class VocabBuilder: + def __init__(self, min_freq=None, max_count=None): + self.word_freq = collections.Counter() + self.min_freq = min_freq + self.max_count = max_count + + def add_word(self, word, count=1): + self.word_freq[word] += count + + def finish(self, *args, **kwargs): + # Select the `max_count` most frequent words. If `max_count` is None, then choose all of the words. + eligible_words_and_freqs = self.word_freq.most_common(self.max_count) + if self.min_freq is not None: + for i, (word, freq) in enumerate(eligible_words_and_freqs): + if freq < self.min_freq: + eligible_words_and_freqs = eligible_words_and_freqs[:i] + break + + return Vocab((word for word, freq in sorted(eligible_words_and_freqs)), *args, **kwargs) + + def save(self, path): + with open(path, "w") as f: + json.dump(self.word_freq, f) + + def load(self, path): + with open(path, "r") as f: + self.word_freq = collections.Counter(json.load(f)) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/global_config.py b/examples/text_to_sql/RAT-SQL/text2sql/global_config.py new file mode 100644 index 0000000000000000000000000000000000000000..e31afaf4d90ef442b297313606027226a3b479a9 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/global_config.py @@ -0,0 +1,149 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import logging +import os +import sys +from types import SimpleNamespace + +import _jsonnet as jsonnet + + +def define_args_parser(): + """define command-line args parser""" + + def _arg_bool(arg): + """trans arg to bool type""" + if arg is None: + return arg + + if type(arg) is not str: + return bool(arg) + + if arg.isdigit(): + return bool(int(arg)) + + if arg.lower() == "true": + return True + else: + return False + + parser = argparse.ArgumentParser(description="text2sql command-line interface") + parser.add_argument( + "-c", "--config", default="conf/text2sql.jsonnet", help="global config file path. it's priority is the lowest" + ) + + general_args = parser.add_argument_group(title="general") + general_args.add_argument( + "--mode", + type=str.lower, + default="debug", + required=False, + choices=["preproc", "train", "infer", "test", "debug"], + ) + general_args.add_argument("--batch-size", type=int) + general_args.add_argument("--beam-size", default=1, type=int) + general_args.add_argument("--use-cuda", type=_arg_bool, default=True, help="is run in cuda mode") + general_args.add_argument("--is-eval-value", type=_arg_bool, default=True, help="is evaluating value") + general_args.add_argument("--is-cloud", type=_arg_bool, default=False, help="is run in paddle cloud") + general_args.add_argument("--is-debug", type=_arg_bool, default=False, help="is run in debug mode") + + model_args = parser.add_argument_group(title="model") + model_args.add_argument("--pretrain-model", help="ernie model path for dygraph") + model_args.add_argument("--init-model-params", help="trained model params") + model_args.add_argument("--init-model-optim", help="dumped model optimizer") + model_args.add_argument("--model-name", choices=["seq2tree_v2"], help="ernie model path for dygraph") + model_args.add_argument("--grammar-type", choices=["dusql_v2", "nl2sql"], help="") + + data_args = parser.add_argument_group(title="data") + data_args.add_argument("--data-root", help="root data path. low priority.") + data_args.add_argument("--db", help="a tuple of pathes (schema, content) or path to dumped file") + data_args.add_argument("--db-schema", help="temp argument") + data_args.add_argument("--grammar", help="path to grammar definition file, or cached label vocabs directory") + data_args.add_argument("--train-set", help="original dataset path or dumped file path") + data_args.add_argument("--dev-set", help="original dataset path or dumped file path") + data_args.add_argument("--test-set", help="original dataset path or dumped file path") + data_args.add_argument("--eval-file", help="file to be evaluated(inferenced result)") + data_args.add_argument("--output", help="") + data_args.add_argument("--is-cached", type=_arg_bool, help="is dataset in cached format") + + train_args = parser.add_argument_group(title="train") + train_args.add_argument("--epochs", type=int) + train_args.add_argument("--learning-rate", type=float) + train_args.add_argument("--log-steps", type=int) + train_args.add_argument("--random-seed", type=int) + train_args.add_argument("--use-data-parallel", type=_arg_bool) + + return parser + + +def gen_config(arg_list=None): + """read configs from file, and updating it by command-line arguments + + Args: + config_path (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + parser = define_args_parser() + cli_args = parser.parse_args(arg_list) + if cli_args.data_root is not None: + root_path = cli_args.data_root + if cli_args.is_cached or cli_args.is_cached is None: + if cli_args.db is None: + cli_args.db = os.path.join(root_path, "db.pkl") + if cli_args.grammar is None: + cli_args.grammar = os.path.join(root_path, "label_vocabs") + if cli_args.train_set is None: + cli_args.train_set = os.path.join(root_path, "train.pkl") + if cli_args.dev_set is None: + cli_args.dev_set = os.path.join(root_path, "dev.pkl") + if cli_args.test_set is None and not cli_args.mode.startswith("train"): + cli_args.test_set = os.path.join(root_path, "test.pkl") + else: + if cli_args.db is None: + cli_args.db = [os.path.join(root_path, "db_schema.json"), os.path.join(root_path, "db_content.json")] + if cli_args.train_set is None: + cli_args.train_set = os.path.join(root_path, "train.json") + if cli_args.dev_set is None: + cli_args.dev_set = os.path.join(root_path, "dev.json") + if cli_args.test_set is None and not cli_args.mode.startswith("train"): + cli_args.test_set = os.path.join(root_path, "test.json") + + arg_groups = {} + for group in parser._action_groups: + group_dict = {arg.dest: getattr(cli_args, arg.dest, None) for arg in group._group_actions} + arg_groups[group.title] = {k: v for k, v in group_dict.items() if v is not None} + + config_file = cli_args.config + config = json.loads(jsonnet.evaluate_file(config_file), object_hook=lambda o: SimpleNamespace(**o)) + + for group, args in arg_groups.items(): + if not hasattr(config, group): + logging.debug(f"group {group} is not a module of config") + setattr(config, group, SimpleNamespace()) + config_module = getattr(config, group) + for name, value in args.items(): + setattr(config_module, name, value) + + return config + + +if __name__ == "__main__": + """run some simple test cases""" + print(gen_config(sys.argv[1:] + ["--mode", "train", "--db", "path/to/db"])) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/grammars/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/grammars/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..6f0ea85344b7e0c679730356928c8749cf71cd66 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/grammars/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/examples/text_to_sql/RAT-SQL/text2sql/grammars/cspider_v2.py b/examples/text_to_sql/RAT-SQL/text2sql/grammars/cspider_v2.py new file mode 100644 index 0000000000000000000000000000000000000000..466bc51471902242a6a73991f9a2e23fcf843301 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/grammars/cspider_v2.py @@ -0,0 +1,701 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import itertools + +import asdl +import attr +import networkx as nx +from text2sql.utils import ast_util + + +def bimap(first, second): + return {f: s for f, s in zip(first, second)}, {s: f for f, s in zip(first, second)} + + +def filter_nones(d): + return {k: v for k, v in d.items() if v is not None and v != []} + + +def join(iterable, delimiter): + it = iter(iterable) + yield next(it) + for x in it: + yield delimiter + yield x + + +def intersperse(delimiter, seq): + return itertools.islice(itertools.chain.from_iterable(zip(itertools.repeat(delimiter), seq)), 1, None) + + +class CSpiderLanguageV2: + root_type = "sql" + + def __init__( + self, + asdl_file, + output_from=True, + use_table_pointer=True, + include_literals=False, + include_columns=True, + end_with_from=True, + clause_order=None, + infer_from_conditions=True, + factorize_sketch=2, + ): + + # collect pointers and checkers + custom_primitive_type_checkers = {} + self.include_columns = include_columns + self.pointers = set(["table", "column", "value"]) + custom_primitive_type_checkers["table"] = lambda x: isinstance(x, int) + custom_primitive_type_checkers["column"] = lambda x: isinstance(x, int) + custom_primitive_type_checkers["value"] = lambda x: isinstance(x, int) + + # create ast wrapper + self.factorize_sketch = factorize_sketch + self.ast_wrapper = ast_util.ASTWrapper( + asdl.parse(asdl_file), custom_primitive_type_checkers=custom_primitive_type_checkers + ) + if not use_table_pointer: + self.ast_wrapper.singular_types["Table"].fields[0].type = "int" + if not include_columns: + col_unit_fields = self.ast_wrapper.singular_types["col_unit"].fields + assert col_unit_fields[1].name == "col_id" + del col_unit_fields[1] + + # literals of limit field + self.include_literals = include_literals + + # from field + self.output_from = output_from + self.end_with_from = end_with_from + self.clause_order = clause_order + self.infer_from_conditions = infer_from_conditions + if self.clause_order: + # clause order is prioritized over configurations like end_with_from + assert factorize_sketch == 2 # TODO support other grammars + sql_fields = self.ast_wrapper.product_types["sql"].fields + letter2field = {k: v for k, v in zip("SFWGOI", sql_fields)} + new_sql_fields = [letter2field[k] for k in self.clause_order] + self.ast_wrapper.product_types["sql"].fields = new_sql_fields + else: + if not self.output_from: + sql_fields = self.ast_wrapper.product_types["sql"].fields + assert sql_fields[1].name == "from" + del sql_fields[1] + else: + sql_fields = self.ast_wrapper.product_types["sql"].fields + assert sql_fields[1].name == "from" + if self.end_with_from: + sql_fields.append(sql_fields[1]) + del sql_fields[1] + + def parse(self, code, section): + return self.parse_sql(code) + + def unparse(self, tree, db, value_list): + unparser = CSpiderUnparser(self.ast_wrapper, db, value_list, self.factorize_sketch) + return unparser.unparse_sql(tree) + + @classmethod + def tokenize_field_value(cls, field_value): + if isinstance(field_value, bytes): + field_value_str = field_value.encode("latin1") + elif isinstance(field_value, str): + field_value_str = field_value + else: + field_value_str = str(field_value) + if field_value_str[0] == '"' and field_value_str[-1] == '"': + field_value_str = field_value_str[1:-1] + # TODO: Get rid of surrounding quotes + return [field_value_str] + + def parse_val(self, val): + if isinstance(val, int): + return { + "_type": "Value", + "val_id": val, + } + elif isinstance(val, dict): + return { + "_type": "ValSql", + "s": self.parse_sql(val), + } + else: + raise ValueError(val) + + def parse_col_unit(self, col_unit): + agg_id, col_id, is_distinct = col_unit + result = { + "_type": "col_unit", + "agg_id": {"_type": self.AGG_TYPES_F[agg_id]}, + "is_distinct": is_distinct, + } + if self.include_columns: + result["col_id"] = col_id + return result + + def parse_val_unit(self, val_unit): + unit_op, col_unit1, col_unit2 = val_unit + result = { + "_type": self.UNIT_TYPES_F[unit_op], + "col_unit1": self.parse_col_unit(col_unit1), + } + if unit_op != 0: + result["col_unit2"] = self.parse_col_unit(col_unit2) + return result + + def parse_table_unit(self, table_unit): + table_type, value = table_unit + if table_type == "sql": + return { + "_type": "TableUnitSql", + "s": self.parse_sql(value), + } + elif table_type == "table_unit": + return { + "_type": "Table", + "table_id": value, + } + else: + raise ValueError(table_type) + + def parse_cond(self, cond, optional=False): + if optional and not cond: + return None + + if len(cond) > 1: + return { + "_type": self.LOGIC_OPERATORS_F[cond[1]], + "left": self.parse_cond(cond[:1]), + "right": self.parse_cond(cond[2:]), + } + + ((not_op, op_id, val_unit, val1, val2),) = cond + result = { + "_type": self.COND_TYPES_F[op_id], + "val_unit": self.parse_val_unit(val_unit), + "val1": self.parse_val(val1), + } + if op_id == 1: # between + result["val2"] = self.parse_val(val2) + if not_op: + result = { + "_type": "Not", + "c": result, + } + return result + + def parse_sql(self, sql, optional=False): + if optional and sql is None: + return None + if self.factorize_sketch == 0: + return filter_nones( + { + "_type": "sql", + "select": self.parse_select(sql["select"]), + "where": self.parse_cond(sql["where"], optional=True), + "group_by": [self.parse_col_unit(u) for u in sql["groupBy"]], + "order_by": self.parse_order_by(sql["orderBy"]), + "having": self.parse_cond(sql["having"], optional=True), + "limit": sql["limit"] if self.include_literals else (sql["limit"] is not None), + "intersect": self.parse_sql(sql["intersect"], optional=True), + "except": self.parse_sql(sql["except"], optional=True), + "union": self.parse_sql(sql["union"], optional=True), + **( + { + "from": self.parse_from(sql["from"], self.infer_from_conditions), + } + if self.output_from + else {} + ), + } + ) + elif self.factorize_sketch == 1: + return filter_nones( + { + "_type": "sql", + "select": self.parse_select(sql["select"]), + **( + { + "from": self.parse_from(sql["from"], self.infer_from_conditions), + } + if self.output_from + else {} + ), + "sql_where": filter_nones( + { + "_type": "sql_where", + "where": self.parse_cond(sql["where"], optional=True), + "sql_groupby": filter_nones( + { + "_type": "sql_groupby", + "group_by": [self.parse_col_unit(u) for u in sql["groupBy"]], + "having": filter_nones( + { + "_type": "having", + "having": self.parse_cond(sql["having"], optional=True), + } + ), + "sql_orderby": filter_nones( + { + "_type": "sql_orderby", + "order_by": self.parse_order_by(sql["orderBy"]), + "limit": filter_nones( + { + "_type": "limit", + "limit": sql["limit"] + if self.include_literals + else (sql["limit"] is not None), + } + ), + "sql_ieu": filter_nones( + { + "_type": "sql_ieu", + "intersect": self.parse_sql(sql["intersect"], optional=True), + "except": self.parse_sql(sql["except"], optional=True), + "union": self.parse_sql(sql["union"], optional=True), + } + ), + } + ), + } + ), + } + ), + } + ) + elif self.factorize_sketch == 2: + return filter_nones( + { + "_type": "sql", + "select": self.parse_select(sql["select"]), + **( + { + "from": self.parse_from(sql["from"], self.infer_from_conditions), + } + if self.output_from + else {} + ), + "sql_where": filter_nones( + { + "_type": "sql_where", + "where": self.parse_cond(sql["where"], optional=True), + } + ), + "sql_groupby": filter_nones( + { + "_type": "sql_groupby", + "group_by": [self.parse_col_unit(u) for u in sql["groupBy"]], + "having": self.parse_cond(sql["having"], optional=True), + } + ), + "sql_orderby": filter_nones( + { + "_type": "sql_orderby", + "order_by": self.parse_order_by(sql["orderBy"]), + "limit": sql["limit"] if sql["limit"] is not None else 0, + } + ), + "sql_ieu": filter_nones( + { + "_type": "sql_ieu", + "intersect": self.parse_sql(sql["intersect"], optional=True), + "except": self.parse_sql(sql["except"], optional=True), + "union": self.parse_sql(sql["union"], optional=True), + } + ), + } + ) + + def parse_select(self, select): + is_distinct, aggs = select + return { + "_type": "select", + "is_distinct": is_distinct, + "aggs": [self.parse_agg(agg) for agg in aggs], + } + + def parse_agg(self, agg): + agg_id, val_unit = agg + return { + "_type": "agg", + "agg_id": {"_type": self.AGG_TYPES_F[agg_id]}, + "val_unit": self.parse_val_unit(val_unit), + } + + def parse_from(self, from_, infer_from_conditions=False): + return filter_nones( + { + "_type": "from", + "table_units": [self.parse_table_unit(u) for u in from_["table_units"]], + "conds": self.parse_cond(from_["conds"], optional=True) if not infer_from_conditions else None, + } + ) + + def parse_order_by(self, order_by): + if not order_by: + return None + + order, val_units = order_by + return { + "_type": "order_by", + "order": {"_type": self.ORDERS_F[order]}, + "val_units": [self.parse_val_unit(v) for v in val_units], + } + + COND_TYPES_F, COND_TYPES_B = bimap( + # ('not', 'between', '=', '>', '<', '>=', '<=', '!=', 'in', 'like', 'is', 'exists'), + # (None, 'Between', 'Eq', 'Gt', 'Lt', 'Ge', 'Le', 'Ne', 'In', 'Like', 'Is', 'Exists')) + range(1, 10), + ("Between", "Eq", "Gt", "Lt", "Ge", "Le", "Ne", "In", "Like"), + ) + + UNIT_TYPES_F, UNIT_TYPES_B = bimap( + # ('none', '-', '+', '*', '/'), + range(5), + ("Column", "Minus", "Plus", "Times", "Divide"), + ) + + AGG_TYPES_F, AGG_TYPES_B = bimap(range(6), ("NoneAggOp", "Max", "Min", "Count", "Sum", "Avg")) + + ORDERS_F, ORDERS_B = bimap(("asc", "desc"), ("Asc", "Desc")) + + LOGIC_OPERATORS_F, LOGIC_OPERATORS_B = bimap(("and", "or"), ("And", "Or")) + + +@attr.s +class CSpiderUnparser: + ast_wrapper = attr.ib() + schema = attr.ib() + value_list = attr.ib() + factorize_sketch = attr.ib(default=0) + + UNIT_TYPES_B = { + "Minus": "-", + "Plus": "+", + "Times": "*", + "Divide": "/", + } + COND_TYPES_B = { + "Between": "BETWEEN", + "Eq": "=", + "Gt": ">", + "Lt": "<", + "Ge": ">=", + "Le": "<=", + "Ne": "!=", + "In": "IN", + "Like": "LIKE", + } + + @classmethod + def conjoin_conds(cls, conds): + if not conds: + return None + if len(conds) == 1: + return conds[0] + return {"_type": "And", "left": conds[0], "right": cls.conjoin_conds(conds[1:])} + + @classmethod + def linearize_cond(cls, cond): + if cond["_type"] in ("And", "Or"): + conds, keywords = cls.linearize_cond(cond["right"]) + return [cond["left"]] + conds, [cond["_type"]] + keywords + else: + return [cond], [] + + def unparse_val(self, val): + if val["_type"] == "Value": + value_index = int(val["val_id"]) + if value_index >= len(self.value_list): + value_index = 0 + return f'"{self.value_list[value_index]}"' + if val["_type"] == "ValSql": + return f'({self.unparse_sql(val["s"])})' + if val["_type"] == "ColUnit": + return self.unparse_col_unit(val["c"]) + + def unparse_col_unit(self, col_unit): + if "col_id" in col_unit: + column = self.schema.columns[col_unit["col_id"]] + if column.table is None: + column_name = column.orig_name + else: + column_name = f"{column.table.orig_name}.{column.orig_name}" + else: + column_name = "some_col" + + if col_unit["is_distinct"]: + column_name = f"DISTINCT {column_name}" + agg_type = col_unit["agg_id"]["_type"] + if agg_type == "NoneAggOp": + return column_name + else: + return f"{agg_type}({column_name})" + + def unparse_val_unit(self, val_unit): + if val_unit["_type"] == "Column": + return self.unparse_col_unit(val_unit["col_unit1"]) + col1 = self.unparse_col_unit(val_unit["col_unit1"]) + col2 = self.unparse_col_unit(val_unit["col_unit2"]) + return f'{col1} {self.UNIT_TYPES_B[val_unit["_type"]]} {col2}' + + # def unparse_table_unit(self, table_unit): + # raise NotImplementedError + + def unparse_cond(self, cond, negated=False): + if cond["_type"] == "And": + assert not negated + return f'{self.unparse_cond(cond["left"])} AND {self.unparse_cond(cond["right"])}' + elif cond["_type"] == "Or": + assert not negated + return f'{self.unparse_cond(cond["left"])} OR {self.unparse_cond(cond["right"])}' + elif cond["_type"] == "Not": + return self.unparse_cond(cond["c"], negated=True) + elif cond["_type"] == "Between": + tokens = [self.unparse_val_unit(cond["val_unit"])] + if negated: + tokens.append("NOT") + tokens += [ + "BETWEEN", + self.unparse_val(cond["val1"]), + "AND", + self.unparse_val(cond["val2"]), + ] + return " ".join(tokens) + tokens = [self.unparse_val_unit(cond["val_unit"])] + if negated: + tokens.append("NOT") + tokens += [self.COND_TYPES_B[cond["_type"]], self.unparse_val(cond["val1"])] + return " ".join(tokens) + + def refine_from(self, tree): + """ + 1) Inferring tables from columns predicted + 2) Mix them with the predicted tables if any + 3) Inferring conditions based on tables + """ + + # nested query in from clause, recursively use the refinement + if "from" in tree and tree["from"]["table_units"][0]["_type"] == "TableUnitSql": + for table_unit in tree["from"]["table_units"]: + subquery_tree = table_unit["s"] + self.refine_from(subquery_tree) + return + + # get predicted tables + predicted_from_table_ids = set() + if "from" in tree: + table_unit_set = [] + for table_unit in tree["from"]["table_units"]: + if table_unit["table_id"] not in predicted_from_table_ids: + predicted_from_table_ids.add(table_unit["table_id"]) + table_unit_set.append(table_unit) + tree["from"]["table_units"] = table_unit_set # remove duplicate + + # Get all candidate columns + candidate_column_ids = set( + self.ast_wrapper.find_all_descendants_of_type(tree, "column", lambda field: field.type != "sql") + ) + candidate_columns = [self.schema.columns[i] for i in candidate_column_ids] + must_in_from_table_ids = set(column.table.id for column in candidate_columns if column.table is not None) + + # Table the union of inferred and predicted tables + all_from_table_ids = must_in_from_table_ids.union(predicted_from_table_ids) + if not all_from_table_ids: + # TODO: better heuristic e.g., tables that have exact match + all_from_table_ids = {0} + + covered_tables = set() + candidate_table_ids = sorted(all_from_table_ids) + start_table_id = candidate_table_ids[0] + conds = [] + for table_id in candidate_table_ids[1:]: + if table_id in covered_tables: + continue + try: + path = nx.shortest_path(self.schema.foreign_key_graph, source=start_table_id, target=table_id) + except (nx.NetworkXNoPath, nx.NodeNotFound): + covered_tables.add(table_id) + continue + + for source_table_id, target_table_id in zip(path, path[1:]): + if target_table_id in covered_tables: + continue + all_from_table_ids.add(target_table_id) + col1, col2 = self.schema.foreign_key_graph[source_table_id][target_table_id]["columns"] + conds.append( + { + "_type": "Eq", + "val_unit": { + "_type": "Column", + "col_unit1": { + "_type": "col_unit", + "agg_id": {"_type": "NoneAggOp"}, + "col_id": col1, + "is_distinct": False, + }, + }, + "val1": { + "_type": "ColUnit", + "c": { + "_type": "col_unit", + "agg_id": {"_type": "NoneAggOp"}, + "col_id": col2, + "is_distinct": False, + }, + }, + } + ) + table_units = [{"_type": "Table", "table_id": i} for i in sorted(all_from_table_ids)] + + tree["from"] = { + "_type": "from", + "table_units": table_units, + } + cond_node = self.conjoin_conds(conds) + if cond_node is not None: + tree["from"]["conds"] = cond_node + + def unparse_sql(self, tree): + self.refine_from(tree) + + result = [ + # select select, + self.unparse_select(tree["select"]), + # from from, + self.unparse_from(tree["from"]), + ] + + def find_subtree(_tree, name): + if self.factorize_sketch == 0: + return _tree, _tree + elif name in _tree: + if self.factorize_sketch == 1: + return _tree[name], _tree[name] + elif self.factorize_sketch == 2: + return _tree, _tree[name] + else: + raise NotImplementedError + + tree, target_tree = find_subtree(tree, "sql_where") + # cond? where, + if "where" in target_tree: + result += ["WHERE", self.unparse_cond(target_tree["where"])] + + tree, target_tree = find_subtree(tree, "sql_groupby") + # col_unit* group_by, + if "group_by" in target_tree: + result += ["GROUP BY", ", ".join(self.unparse_col_unit(c) for c in target_tree["group_by"])] + + tree, target_tree = find_subtree(tree, "sql_orderby") + # order_by? order_by, + if "order_by" in target_tree: + result.append(self.unparse_order_by(target_tree["order_by"])) + + tree, target_tree = find_subtree(tree, "sql_groupby") + # cond? having, + if "having" in target_tree: + result += ["HAVING", self.unparse_cond(target_tree["having"])] + + tree, target_tree = find_subtree(tree, "sql_orderby") + # int? limit, + if "limit" in target_tree: + limit_index = int(target_tree["limit"]) + limit_value = "0" + if limit_index < len(self.value_list): + limit_value = self.value_list[limit_index] + if limit_value == "value": + limit_value = "1" + if limit_value.isdigit() and limit_value != "0": + result += ["LIMIT", str(limit_value)] + + tree, target_tree = find_subtree(tree, "sql_ieu") + # sql? intersect, + if "intersect" in target_tree: + result += ["INTERSECT", self.unparse_sql(target_tree["intersect"])] + # sql? except, + if "except" in target_tree: + result += ["EXCEPT", self.unparse_sql(target_tree["except"])] + # sql? union + if "union" in target_tree: + result += ["UNION", self.unparse_sql(target_tree["union"])] + + return " ".join(result) + + def unparse_select(self, select): + tokens = ["SELECT"] + if select["is_distinct"]: + tokens.append("DISTINCT") + tokens.append(", ".join(self.unparse_agg(agg) for agg in select.get("aggs", []))) + return " ".join(tokens) + + def unparse_agg(self, agg): + unparsed_val_unit = self.unparse_val_unit(agg["val_unit"]) + agg_type = agg["agg_id"]["_type"] + if agg_type == "NoneAggOp": + return unparsed_val_unit + else: + return f"{agg_type}({unparsed_val_unit})" + + def unparse_from(self, from_): + if "conds" in from_: + all_conds, keywords = self.linearize_cond(from_["conds"]) + else: + all_conds, keywords = [], [] + assert all(keyword == "And" for keyword in keywords) + + cond_indices_by_table = collections.defaultdict(set) + tables_involved_by_cond_idx = collections.defaultdict(set) + for i, cond in enumerate(all_conds): + for column in self.ast_wrapper.find_all_descendants_of_type(cond, "column"): + table = self.schema.columns[column].table + if table is None: + continue + cond_indices_by_table[table.id].add(i) + tables_involved_by_cond_idx[i].add(table.id) + + output_table_ids = set() + output_cond_indices = set() + tokens = ["FROM"] + for i, table_unit in enumerate(from_.get("table_units", [])): + if i > 0: + tokens += ["JOIN"] + + if table_unit["_type"] == "TableUnitSql": + tokens.append(f'({self.unparse_sql(table_unit["s"])})') + elif table_unit["_type"] == "Table": + table_id = table_unit["table_id"] + tokens += [self.schema.tables[table_id].orig_name] + output_table_ids.add(table_id) + + # Output "ON " if all tables involved in the condition have been output + conds_to_output = [] + for cond_idx in sorted(cond_indices_by_table[table_id]): + if cond_idx in output_cond_indices: + continue + if tables_involved_by_cond_idx[cond_idx] <= output_table_ids: + conds_to_output.append(all_conds[cond_idx]) + output_cond_indices.add(cond_idx) + if conds_to_output: + tokens += ["ON"] + tokens += list(intersperse("AND", (self.unparse_cond(cond) for cond in conds_to_output))) + return " ".join(tokens) + + def unparse_order_by(self, order_by): + return f'ORDER BY {", ".join(self.unparse_val_unit(v) for v in order_by["val_units"])} {order_by["order"]["_type"]}' diff --git a/examples/text_to_sql/RAT-SQL/text2sql/grammars/dusql_v2.py b/examples/text_to_sql/RAT-SQL/text2sql/grammars/dusql_v2.py new file mode 100644 index 0000000000000000000000000000000000000000..75989fd74b3e6d6f3d240d7715fa480c88f497bd --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/grammars/dusql_v2.py @@ -0,0 +1,750 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import itertools +import logging + +import asdl +import attr +import networkx as nx +from text2sql.utils import ast_util + + +def bimap(first, second): + return {f: s for f, s in zip(first, second)}, {s: f for f, s in zip(first, second)} + + +def filter_nones(d): + return {k: v for k, v in d.items() if v is not None and v != []} + + +def join(iterable, delimiter): + it = iter(iterable) + yield next(it) + for x in it: + yield delimiter + yield x + + +def intersperse(delimiter, seq): + return itertools.islice(itertools.chain.from_iterable(zip(itertools.repeat(delimiter), seq)), 1, None) + + +class DuSQLLanguageV2(object): + root_type = "sql" + + def __init__( + self, + asdl_file, + output_from=True, # < changed + use_table_pointer=True, # < changed + include_literals=False, # < changed + include_columns=True, + end_with_from=True, # < changed + clause_order=None, + infer_from_conditions=True, # < changed + factorize_sketch=2, + ): + + # collect pointers and checkers + self.pointers = set(["table", "column", "value"]) + custom_primitive_type_checkers = {} + custom_primitive_type_checkers["table"] = lambda x: isinstance(x, int) + custom_primitive_type_checkers["column"] = lambda x: isinstance(x, int) + custom_primitive_type_checkers["value"] = lambda x: isinstance(x, int) + + self.include_columns = include_columns + # create ast wrapper + self.factorize_sketch = factorize_sketch + self.ast_wrapper = ast_util.ASTWrapper( + asdl.parse(asdl_file), custom_primitive_type_checkers=custom_primitive_type_checkers + ) + + # from field + self.output_from = output_from + self.end_with_from = end_with_from + self.clause_order = clause_order + self.infer_from_conditions = infer_from_conditions + if self.clause_order: + # clause order is prioritized over configurations like end_with_from + assert factorize_sketch == 2 # TODO support other grammars + sql_fields = self.ast_wrapper.product_types["sql"].fields + letter2field = {k: v for k, v in zip("SFWGOI", sql_fields)} + new_sql_fields = [letter2field[k] for k in self.clause_order] + self.ast_wrapper.product_types["sql"].fields = new_sql_fields + else: + if not self.output_from: + sql_fields = self.ast_wrapper.product_types["sql"].fields + assert sql_fields[1].name == "from" + del sql_fields[1] + else: + sql_fields = self.ast_wrapper.product_types["sql"].fields + assert sql_fields[1].name == "from" + if self.end_with_from: + sql_fields.append(sql_fields[1]) + del sql_fields[1] + + def parse(self, code, section): + return self.parse_sql(code) + + def unparse(self, tree, db, value_list): + unparser = DuSQLUnparser(self.ast_wrapper, db, value_list, self.factorize_sketch) + return unparser.unparse_sql(tree) + + @classmethod + def tokenize_field_value(cls, field_value): + if isinstance(field_value, bytes): + field_value_str = field_value.encode("latin1") + elif isinstance(field_value, str): + field_value_str = field_value + else: + field_value_str = str(field_value) + if field_value_str[0] == '"' and field_value_str[-1] == '"': + field_value_str = field_value_str[1:-1] + # TODO: Get rid of surrounding quotes + return [field_value_str] + + def parse_val(self, val): + if isinstance(val, int): # Modify: add int + return { + "_type": "Value", + "val_id": val, + } + elif isinstance(val, dict): + return { + "_type": "ValSql", + "s": self.parse_sql(val), + } + else: + raise ValueError(val) + + def parse_col_unit(self, col_unit): + agg_id, col_id = col_unit[:2] + result = { + "_type": "col_unit", + "agg_id": {"_type": self.AGG_TYPES_F[agg_id]}, + } + if self.include_columns: + result["col_id"] = col_id + return result + + def parse_val_unit(self, val_unit): + unit_op, col_unit1, col_unit2 = val_unit + result = { + "_type": self.UNIT_TYPES_F[unit_op], + "col_unit1": self.parse_col_unit(col_unit1), + } + if unit_op != 0: + result["col_unit2"] = self.parse_col_unit(col_unit2) + if result["col_unit1"]["col_id"] == "TIME_NOW": + logging.debug("fix TIME_NOW grammar") + result["col_unit1"]["col_id"] = result["col_unit2"]["col_id"] + return result + + def parse_table_unit(self, table_unit): + table_type, value = table_unit + if table_type == "sql": + return { + "_type": "TableUnitSql", + "s": self.parse_sql(value), + } + elif table_type == "table_unit": + return { + "_type": "Table", + "table_id": value, + } + else: + raise ValueError(table_type) + + def parse_cond(self, cond, optional=False): + if optional and not cond: + return None + + if len(cond) > 1: + return { + "_type": self.LOGIC_OPERATORS_F[cond[1]], + "left": self.parse_cond(cond[:1]), + "right": self.parse_cond(cond[2:]), + } + + ((agg_id, op_id, val_unit, val1, val2),) = cond + result = { + "_type": self.COND_TYPES_F[op_id], + "agg_id": {"_type": self.AGG_TYPES_F[agg_id]}, + "val_unit": self.parse_val_unit(val_unit), + "val1": self.parse_val(val1), + } + if op_id == 1: # between + result["val2"] = self.parse_val(val2) + return result + + def parse_sql(self, sql, optional=False): + if optional and sql is None: + return None + if self.factorize_sketch == 0: + return filter_nones( + { + "_type": "sql", + "select": self.parse_select(sql["select"]), + "where": self.parse_cond(sql["where"], optional=True), + "group_by": [self.parse_col_unit(u) for u in sql["groupBy"]], + "order_by": self.parse_order_by(sql["orderBy"]), + "having": self.parse_cond(sql["having"], optional=True), + "limit": sql["limit"] if self.include_literals else (sql["limit"] is not None), + "intersect": self.parse_sql(sql["intersect"], optional=True), + "except": self.parse_sql(sql["except"], optional=True), + "union": self.parse_sql(sql["union"], optional=True), + **( + { + "from": self.parse_from(sql["from"], self.infer_from_conditions), + } + if self.output_from + else {} + ), + } + ) + elif self.factorize_sketch == 1: + return filter_nones( + { + "_type": "sql", + "select": self.parse_select(sql["select"]), + **( + { + "from": self.parse_from(sql["from"], self.infer_from_conditions), + } + if self.output_from + else {} + ), + "sql_where": filter_nones( + { + "_type": "sql_where", + "where": self.parse_cond(sql["where"], optional=True), + "sql_groupby": filter_nones( + { + "_type": "sql_groupby", + "group_by": [self.parse_col_unit(u) for u in sql["groupBy"]], + "having": filter_nones( + { + "_type": "having", + "having": self.parse_cond(sql["having"], optional=True), + } + ), + "sql_orderby": filter_nones( + { + "_type": "sql_orderby", + "order_by": self.parse_order_by(sql["orderBy"]), + "limit": filter_nones( + { + "_type": "limit", + "limit": sql["limit"] + if self.include_literals + else (sql["limit"] is not None), + } + ), + "sql_ieu": filter_nones( + { + "_type": "sql_ieu", + "intersect": self.parse_sql(sql["intersect"], optional=True), + "except": self.parse_sql(sql["except"], optional=True), + "union": self.parse_sql(sql["union"], optional=True), + } + ), + } + ), + } + ), + } + ), + } + ) + elif self.factorize_sketch == 2: + return filter_nones( + { + "_type": "sql", + "select": self.parse_select(sql["select"]), + **( + { + "from": self.parse_from(sql["from"], self.infer_from_conditions), + } + if self.output_from + else {} + ), + "sql_where": filter_nones( + { + "_type": "sql_where", + "where": self.parse_cond(sql["where"], optional=True), + } + ), + "sql_groupby": filter_nones( + { + "_type": "sql_groupby", + "group_by": [self.parse_col_unit(u) for u in sql["groupBy"]], + "having": self.parse_cond(sql["having"], optional=True), + } + ), + "sql_orderby": filter_nones( + { + "_type": "sql_orderby", + "order_by": self.parse_order_by(sql["orderBy"]), + "limit": sql["limit"] if sql["limit"] is not None else 0, + } + ), + "sql_ieu": filter_nones( + { + "_type": "sql_ieu", + "intersect": self.parse_sql(sql["intersect"], optional=True), + "except": self.parse_sql(sql["except"], optional=True), + "union": self.parse_sql(sql["union"], optional=True), + } + ), + } + ) + + def parse_select(self, select): + if type(select[0]) is bool: + aggs = select[1] + else: + aggs = select + return { + "_type": "select", + "aggs": [self.parse_agg(agg) for agg in aggs], + } + + def parse_agg(self, agg): + if len(agg) == 2: + agg_id, val_unit = agg + else: + agg_id, val_unit = 0, agg + return { + "_type": "agg", + "agg_id": {"_type": self.AGG_TYPES_F[agg_id]}, + "val_unit": self.parse_val_unit(val_unit), + } + + def parse_from(self, from_, infer_from_conditions=False): + return filter_nones( + { + "_type": "from", + "table_units": [self.parse_table_unit(u) for u in from_["table_units"]], + "conds": self.parse_cond(from_["conds"], optional=True) if not infer_from_conditions else None, + } + ) + + def parse_order_by(self, order_by): + if not order_by: + return None + + # DIFF: Spider&DuSQL + order, order_units = order_by + return { + "_type": "order_by", + "order": {"_type": self.ORDERS_F[order]}, + "aggs": [self.parse_agg(v) for v in order_units], + } + + COND_TYPES_F, COND_TYPES_B = bimap( + # Spider: (not, between, =, >, <, >=, <=, !=, in, like, is, exists) + # RAT: (None, Between, Eq, Gt, Lt, Ge, Le, Ne, In, Like, Is, Exists) + # DuSQL: (NotIn, Between, Eq, Gt, Lt, Ge, Le, Ne, In, Like) + # DIFF: Spider&DuSQL + range(0, 10), + ("NotIn", "Between", "Eq", "Gt", "Lt", "Ge", "Le", "Ne", "In", "Like"), + ) + + UNIT_TYPES_F, UNIT_TYPES_B = bimap( + # ('none', '-', '+', '*', '/'), + range(5), + ("Column", "Minus", "Plus", "Times", "Divide"), + ) + + AGG_TYPES_F, AGG_TYPES_B = bimap(range(6), ("NoneAggOp", "Max", "Min", "Count", "Sum", "Avg")) + + ORDERS_F, ORDERS_B = bimap(("asc", "desc"), ("Asc", "Desc")) + + LOGIC_OPERATORS_F, LOGIC_OPERATORS_B = bimap(("and", "or"), ("And", "Or")) + + +@attr.s +class DuSQLUnparser: + ast_wrapper = attr.ib() + schema = attr.ib() + value_list = attr.ib() + factorize_sketch = attr.ib(default=0) + + UNIT_TYPES_B = { + "Minus": "-", + "Plus": "+", + "Times": "*", + "Divide": "/", + } + COND_TYPES_B = { + "Between": "BETWEEN", + "Eq": "=", + "Gt": ">", + "Lt": "<", + "Ge": ">=", + "Le": "<=", + "Ne": "!=", + "In": "IN", + "NotIn": "NOT IN", + "Like": "LIKE", + } + + @classmethod + def conjoin_conds(cls, conds): + if not conds: + return None + if len(conds) == 1: + return conds[0] + return {"_type": "And", "left": conds[0], "right": cls.conjoin_conds(conds[1:])} + + @classmethod + def linearize_cond(cls, cond): + if cond["_type"] in ("And", "Or"): + conds, keywords = cls.linearize_cond(cond["right"]) + return [cond["left"]] + conds, [cond["_type"]] + keywords + else: + return [cond], [] + + def unparse_val(self, val): + if val["_type"] == "Value": + value_index = int(val["val_id"]) + if value_index >= len(self.value_list): + value_index = 0 + return f'"{self.value_list[value_index]}"' + if val["_type"] == "ValSql": + return f'({self.unparse_sql(val["s"])})' + if val["_type"] == "ColUnit": + column = self.schema.columns[val["col_id"]] + return f"{column.table.orig_name}.{column.orig_name}" + + def unparse_col_unit(self, col_unit, alias_table_name=None): + if "col_id" in col_unit: + column = self.schema.columns[col_unit["col_id"]] + if alias_table_name is not None: + column_name = f"{alias_table_name}.{column.orig_name}" + elif column.table is not None: + column_name = f"{column.table.orig_name}.{column.orig_name}" + else: + column_name = column.orig_name + else: + column_name = "some_col" + + agg_type = col_unit["agg_id"]["_type"] + if agg_type == "NoneAggOp": + return column_name + else: + return f"{agg_type}({column_name})" + + def unparse_val_unit(self, val_unit, is_row_calc=False): + if val_unit["_type"] == "Column": + return self.unparse_col_unit(val_unit["col_unit1"]) + col1 = self.unparse_col_unit(val_unit["col_unit1"], alias_table_name="a" if is_row_calc else None) + col2 = self.unparse_col_unit(val_unit["col_unit2"], alias_table_name="b" if is_row_calc else None) + calc_op = self.UNIT_TYPES_B[val_unit["_type"]] + # TODO: DuSQL const col + if col1 == col2 and calc_op == "-": + col1 = "TIME_NOW" + return f"{col1} {calc_op} {col2}" + + # def unparse_table_unit(self, table_unit): + # raise NotImplementedError + + def unparse_cond(self, cond, negated=False): + """ + Args: + cond: + { + "_type": "Ne", + "agg_id": { + "_type": "NoneAggOp" + }, + "val_unit": { + "_type": "Column", + "col_unit1": { + "_type": "col_unit", + "agg_id": { + "_type": "NoneAggOp" + }, + "col_id": 11, + } + }, + "val1": { + "_type": "Value", + "val_id": 0 + } + } + """ + if cond["_type"] == "And": + assert not negated + return f'{self.unparse_cond(cond["left"])} AND {self.unparse_cond(cond["right"])}' + elif cond["_type"] == "Or": + assert not negated + return f'{self.unparse_cond(cond["left"])} OR {self.unparse_cond(cond["right"])}' + elif cond["_type"] == "Not": + return self.unparse_cond(cond["c"], negated=True) + elif cond["_type"] == "Between": + tokens = [self.unparse_val_unit(cond["val_unit"])] + if negated: + tokens.append("NOT") + tokens += [ + "BETWEEN", + self.unparse_val(cond["val1"]), + "AND", + self.unparse_val(cond["val2"]), + ] + return " ".join(tokens) + tokens = [self.unparse_val_unit(cond["val_unit"])] + if cond["agg_id"]["_type"] != "NoneAggOp": + agg = cond["agg_id"]["_type"] + tokens[0] = f"{agg}({tokens[0]})" + if negated: + tokens.append("NOT") + tokens += [self.COND_TYPES_B[cond["_type"]], self.unparse_val(cond["val1"])] + return " ".join(tokens) + + def refine_from(self, tree): + """ + 1) Inferring tables from columns predicted + 2) Mix them with the predicted tables if any + 3) Inferring conditions based on tables + + Returns: bool + True: 是行计算 + False: 不是行计算 + """ + + # nested query in from clause, recursively use the refinement + if "from" in tree and tree["from"]["table_units"][0]["_type"] == "TableUnitSql": + for table_unit in tree["from"]["table_units"]: + # 正常解码的结果中,FROM 中要么全是 sub-sql,要么全是普通的 table_id + if "s" not in table_unit: + logging.warning("error tree in FROM clause: %s", str(tree)) + continue + subquery_tree = table_unit["s"] + self.refine_from(subquery_tree) + return len(tree["from"]["table_units"]) == 2 # 行计算 + + # get predicted tables + predicted_from_table_ids = set() + if "from" in tree: + table_unit_set = [] + for table_unit in tree["from"]["table_units"]: + if "table_id" in table_unit and table_unit["table_id"] not in predicted_from_table_ids: + predicted_from_table_ids.add(table_unit["table_id"]) + table_unit_set.append(table_unit) + tree["from"]["table_units"] = table_unit_set # remove duplicate + + # Get all candidate columns + candidate_column_ids = set( + self.ast_wrapper.find_all_descendants_of_type(tree, "column", lambda field: field.type != "sql") + ) + candidate_columns = [self.schema.columns[i] for i in candidate_column_ids] + must_in_from_table_ids = set(column.table.id for column in candidate_columns if column.table is not None) + + # Table the union of inferred and predicted tables + all_from_table_ids = must_in_from_table_ids.union(predicted_from_table_ids) + if not all_from_table_ids: + # TODO: better heuristic e.g., tables that have exact match + all_from_table_ids = {0} + + covered_tables = set() + candidate_table_ids = sorted(all_from_table_ids) + start_table_id = candidate_table_ids[0] + conds = [] + for table_id in candidate_table_ids[1:]: + if table_id in covered_tables: + continue + try: + path = nx.shortest_path(self.schema.foreign_key_graph, source=start_table_id, target=table_id) + except (nx.NetworkXNoPath, nx.NodeNotFound): + covered_tables.add(table_id) + continue + + for source_table_id, target_table_id in zip(path, path[1:]): + if target_table_id in covered_tables: + continue + all_from_table_ids.add(target_table_id) + col1, col2 = self.schema.foreign_key_graph[source_table_id][target_table_id]["columns"] + conds.append( + { + "_type": "Eq", + "agg_id": {"_type": DuSQLLanguageV2.AGG_TYPES_F[0]}, + "val_unit": { + "_type": "Column", + "col_unit1": { + "_type": "col_unit", + "agg_id": {"_type": "NoneAggOp"}, + "col_id": col1, + }, + }, + "val1": { + "_type": "ColUnit", + "col_id": col2, + }, + } + ) + table_units = [{"_type": "Table", "table_id": i} for i in sorted(all_from_table_ids)] + + tree["from"] = { + "_type": "from", + "table_units": table_units, + } + cond_node = self.conjoin_conds(conds) + if cond_node is not None: + tree["from"]["conds"] = cond_node + return False + + def unparse_sql(self, tree): + is_row_calc = self.refine_from(tree) + + result = [ + # select select, + self.unparse_select(tree["select"], is_row_calc), + # from from, + self.unparse_from(tree["from"], is_row_calc), + ] + + def find_subtree(_tree, name): + if self.factorize_sketch == 0: + return _tree, _tree + elif name in _tree: + if self.factorize_sketch == 1: + return _tree[name], _tree[name] + elif self.factorize_sketch == 2: + return _tree, _tree[name] + else: + raise NotImplementedError + + tree, target_tree = find_subtree(tree, "sql_where") + # cond? where, + if "where" in target_tree: + result += ["WHERE", self.unparse_cond(target_tree["where"])] + + tree, target_tree = find_subtree(tree, "sql_groupby") + # col_unit* group_by, + if "group_by" in target_tree: + result += ["GROUP BY", ", ".join(self.unparse_col_unit(c) for c in target_tree["group_by"])] + + tree, target_tree = find_subtree(tree, "sql_orderby") + # order_by? order_by, + if "order_by" in target_tree: + result.append(self.unparse_order_by(target_tree["order_by"])) + + tree, target_tree = find_subtree(tree, "sql_groupby") + # cond? having, + if "having" in target_tree: + having_block = self.unparse_cond(target_tree["having"]).split(" ") + if having_block[0] == "*": # 没有 agg + logging.info("post process: adding count() for having statement") + having_block[0] = "count(*)" + result += ["HAVING", " ".join(having_block)] + + tree, target_tree = find_subtree(tree, "sql_orderby") + # int? limit, + if "limit" in target_tree: + limit_index = int(target_tree["limit"]) + limit_value = "0" + if limit_index < len(self.value_list): + limit_value = self.value_list[limit_index] + if limit_value == "value": + limit_value = "1" + if limit_value.isdigit() and limit_value != "0": # 0表示没有limit + result += ["LIMIT", str(limit_value)] + + tree, target_tree = find_subtree(tree, "sql_ieu") + # sql? intersect, + if "intersect" in target_tree: + result += ["INTERSECT", self.unparse_sql(target_tree["intersect"])] + # sql? except, + if "except" in target_tree: + result += ["EXCEPT", self.unparse_sql(target_tree["except"])] + # sql? union + if "union" in target_tree: + result += ["UNION", self.unparse_sql(target_tree["union"])] + + return " ".join(result) + + def unparse_select(self, select, is_row_calc=False): + tokens = ["SELECT"] + tokens.append(", ".join(self.unparse_agg(agg, is_row_calc) for agg in select.get("aggs", []))) + return " ".join(tokens) + + def unparse_agg(self, agg, is_row_calc=False): + unparsed_val_unit = self.unparse_val_unit(agg["val_unit"], is_row_calc) + agg_type = agg["agg_id"]["_type"] + if agg_type == "NoneAggOp": + return unparsed_val_unit + else: + return f"{agg_type}({unparsed_val_unit})" + + def unparse_from(self, from_, is_row_calc=False): + if "conds" in from_: + all_conds, keywords = self.linearize_cond(from_["conds"]) + else: + all_conds, keywords = [], [] + assert all(keyword == "And" for keyword in keywords) + + cond_indices_by_table = collections.defaultdict(set) + tables_involved_by_cond_idx = collections.defaultdict(set) + for i, cond in enumerate(all_conds): + for column in self.ast_wrapper.find_all_descendants_of_type(cond, "column"): + if type(column) is dict: + column = column["col_id"] + table = self.schema.columns[column].table + if table is None: + continue + cond_indices_by_table[table.id].add(i) + tables_involved_by_cond_idx[i].add(table.id) + + output_table_ids = set() + output_cond_indices = set() + tokens = ["FROM"] + for i, table_unit in enumerate(from_.get("table_units", [])): + if i > 0: + if not is_row_calc: + tokens += ["JOIN"] + else: + tokens += [","] + + if table_unit["_type"] == "TableUnitSql": + tokens.append(f'({self.unparse_sql(table_unit["s"])})') + if is_row_calc: # 行计算SQL的别名 + tokens.append(["a", "b", "c"][i]) + elif table_unit["_type"] == "Table": + table_id = table_unit["table_id"] + tokens += [self.schema.tables[table_id].orig_name] + output_table_ids.add(table_id) + + # Output "ON " if all tables involved in the condition have been output + conds_to_output = [] + for cond_idx in sorted(cond_indices_by_table[table_id]): + if cond_idx in output_cond_indices: + continue + if tables_involved_by_cond_idx[cond_idx] <= output_table_ids: + conds_to_output.append(all_conds[cond_idx]) + output_cond_indices.add(cond_idx) + if conds_to_output: + tokens += ["ON"] + tokens += list(intersperse("AND", (self.unparse_cond(cond) for cond in conds_to_output))) + return " ".join(tokens) + + def unparse_order_by(self, order_by): + return f'ORDER BY {", ".join(self.unparse_agg(v) for v in order_by["aggs"])} {order_by["order"]["_type"]}' + + +if __name__ == "__main__": + """run some simple test cases""" + dusql_lang = DuSQLLanguageV2("conf/DuSQL.asdl") diff --git a/examples/text_to_sql/RAT-SQL/text2sql/grammars/nl2sql.py b/examples/text_to_sql/RAT-SQL/text2sql/grammars/nl2sql.py new file mode 100644 index 0000000000000000000000000000000000000000..b12592472721ab1069b58d1448a77e515f132ce9 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/grammars/nl2sql.py @@ -0,0 +1,552 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import itertools +import logging + +import asdl +import attr +import networkx as nx +from text2sql.utils import ast_util + + +def bimap(first, second): + return {f: s for f, s in zip(first, second)}, {s: f for f, s in zip(first, second)} + + +def filter_nones(d): + return {k: v for k, v in d.items() if v is not None and v != []} + + +def join(iterable, delimiter): + it = iter(iterable) + yield next(it) + for x in it: + yield delimiter + yield x + + +def intersperse(delimiter, seq): + return itertools.islice(itertools.chain.from_iterable(zip(itertools.repeat(delimiter), seq)), 1, None) + + +class NL2SQLLanguage(object): + root_type = "sql" + + def __init__( + self, + asdl_file, + output_from=True, # < changed + use_table_pointer=True, # < changed + include_literals=False, # < changed + include_columns=True, + end_with_from=True, # < changed + clause_order=None, + infer_from_conditions=True, # < changed + factorize_sketch=2, + ): + + # collect pointers and checkers + self.pointers = set(["table", "column", "value"]) + custom_primitive_type_checkers = {} + custom_primitive_type_checkers["table"] = lambda x: isinstance(x, int) + custom_primitive_type_checkers["column"] = lambda x: isinstance(x, int) + custom_primitive_type_checkers["value"] = lambda x: isinstance(x, int) + + self.include_columns = include_columns + # create ast wrapper + self.factorize_sketch = factorize_sketch + self.ast_wrapper = ast_util.ASTWrapper( + asdl.parse(asdl_file), custom_primitive_type_checkers=custom_primitive_type_checkers + ) + + # from field + self.output_from = output_from + self.end_with_from = end_with_from + self.clause_order = clause_order + self.infer_from_conditions = infer_from_conditions + if self.clause_order: + # clause order is prioritized over configurations like end_with_from + assert factorize_sketch == 2 # TODO support other grammars + sql_fields = self.ast_wrapper.product_types["sql"].fields + letter2field = {k: v for k, v in zip("SFWGOI", sql_fields)} + new_sql_fields = [letter2field[k] for k in self.clause_order] + self.ast_wrapper.product_types["sql"].fields = new_sql_fields + else: + if not self.output_from: + sql_fields = self.ast_wrapper.product_types["sql"].fields + assert sql_fields[1].name == "from" + del sql_fields[1] + else: + sql_fields = self.ast_wrapper.product_types["sql"].fields + assert sql_fields[1].name == "from" + if self.end_with_from: + sql_fields.append(sql_fields[1]) + del sql_fields[1] + + def parse(self, code, section): + return self.parse_sql(code) + + def unparse(self, tree, db, value_list): + unparser = NL2SQLUnparser(self.ast_wrapper, db, value_list, self.factorize_sketch) + return unparser.unparse_sql(tree) + + @classmethod + def tokenize_field_value(cls, field_value): + if isinstance(field_value, bytes): + field_value_str = field_value.encode("latin1") + elif isinstance(field_value, str): + field_value_str = field_value + else: + field_value_str = str(field_value) + if field_value_str[0] == '"' and field_value_str[-1] == '"': + field_value_str = field_value_str[1:-1] + # TODO: Get rid of surrounding quotes + return [field_value_str] + + def build_val_unit(self, col_id): + result = { + "_type": self.UNIT_TYPES_F[0], + "col_unit1": {"_type": "col_unit", "agg_id": {"_type": self.AGG_TYPES_F[0]}, "col_id": col_id}, + } + return result + + def parse_sql(self, sql, optional=False): + return filter_nones( + { + "_type": "sql", + "select": self.parse_select(sql["sel"], sql["agg"]), + **( + {"from": {"_type": "from", "table_units": [{"_type": "Table", "table_id": 0}]}} + if self.output_from + else {} + ), + "sql_where": filter_nones( + { + "_type": "sql_where", + "where": self.parse_cond(sql["conds"], ["", "and", "or"][sql["cond_conn_op"]]), + } + ), + "sql_groupby": filter_nones( + { + "_type": "sql_groupby", + "group_by": [], + "having": None, + } + ), + "sql_orderby": filter_nones( + { + "_type": "sql_orderby", + "order_by": None, + "limit": None, + } + ), + "sql_ieu": filter_nones( + { + "_type": "sql_ieu", + "intersect": None, + "except": None, + "union": None, + } + ), + } + ) + + def parse_select(self, select, aggs): + return { + "_type": "select", + "aggs": [self.parse_agg(col, agg) for col, agg in zip(select, aggs)], + } + + def parse_agg(self, col, agg): + return { + "_type": "agg", + "agg_id": {"_type": self.AGG_TYPES_F[agg]}, + "val_unit": self.build_val_unit(col), + } + + def parse_cond(self, conds, cond_conn_op): + if len(conds) > 1: + return { + "_type": self.LOGIC_OPERATORS_F[cond_conn_op], + "left": self.parse_cond(conds[:1], cond_conn_op), + "right": self.parse_cond(conds[1:], cond_conn_op), + } + + agg_id = 0 + col_id, op_id, val1 = conds[0] + result = { + "_type": self.COND_TYPES_F[op_id], + "agg_id": {"_type": self.AGG_TYPES_F[agg_id]}, + "val_unit": self.build_val_unit(col_id), + "val1": self.parse_val(val1), + } + return result + + def parse_val(self, val): + return { + "_type": "Value", + "val_id": val, + } + + COND_TYPES_F, COND_TYPES_B = bimap( + # Spider: (not, between, =, >, <, >=, <=, !=, in, like, is, exists) + # RAT: (None, Between, Eq, Gt, Lt, Ge, Le, Ne, In, Like, Is, Exists) + # DuSQL: (NotIn, Between, Eq, Gt, Lt, Ge, Le, Ne, In, Like) + # DIFF: Spider&DuSQL + range(0, 6), + ("Gt", "Lt", "Eq", "Ne", "Ge", "Le"), + ) + + UNIT_TYPES_F, UNIT_TYPES_B = bimap( + # ('none', '-', '+', '*', '/'), + range(5), + ("Column", "Minus", "Plus", "Times", "Divide"), + ) + + AGG_TYPES_F, AGG_TYPES_B = bimap(range(6), ("NoneAggOp", "Avg", "Max", "Min", "Count", "Sum")) + + ORDERS_F, ORDERS_B = bimap(("asc", "desc"), ("Asc", "Desc")) + + LOGIC_OPERATORS_F, LOGIC_OPERATORS_B = bimap(("and", "or"), ("And", "Or")) + + +@attr.s +class NL2SQLUnparser: + ast_wrapper = attr.ib() + schema = attr.ib() + value_list = attr.ib() + factorize_sketch = attr.ib(default=0) + + UNIT_TYPES_B = { + "Minus": "-", + "Plus": "+", + "Times": "*", + "Divide": "/", + } + COND_TYPES_B = { + "Between": "BETWEEN", + "Eq": "=", + "Gt": ">", + "Lt": "<", + "Ge": ">=", + "Le": "<=", + "Ne": "!=", + "In": "IN", + "NotIn": "NOT IN", + "Like": "LIKE", + } + + @classmethod + def conjoin_conds(cls, conds): + if not conds: + return None + if len(conds) == 1: + return conds[0] + return {"_type": "And", "left": conds[0], "right": cls.conjoin_conds(conds[1:])} + + @classmethod + def linearize_cond(cls, cond): + if cond["_type"] in ("And", "Or"): + conds, keywords = cls.linearize_cond(cond["right"]) + return [cond["left"]] + conds, [cond["_type"]] + keywords + else: + return [cond], [] + + def unparse_val(self, val): + if val["_type"] == "Value": + value_index = int(val["val_id"]) + if value_index >= len(self.value_list): + value_index = 0 + return f'"{self.value_list[value_index]}"' + if val["_type"] == "ValSql": + return f'({self.unparse_sql(val["s"])})' + if val["_type"] == "ColUnit": + column = self.schema.columns[val["col_id"]] + return f"{column.table.orig_name}.{column.orig_name}" + + def unparse_col_unit(self, col_unit, alias_table_name=None): + if "col_id" in col_unit: + column = self.schema.columns[col_unit["col_id"]] + if alias_table_name is not None: + column_name = f"{alias_table_name}.{column.orig_name}" + elif column.table is not None: + column_name = f"{column.table.orig_name}.{column.orig_name}" + else: + column_name = column.orig_name + else: + column_name = "some_col" + + agg_type = col_unit["agg_id"]["_type"] + if agg_type == "NoneAggOp": + return column_name + else: + return f"{agg_type}({column_name})" + + def unparse_val_unit(self, val_unit, is_row_calc=False): + if val_unit["_type"] == "Column": + return self.unparse_col_unit(val_unit["col_unit1"]) + col1 = self.unparse_col_unit(val_unit["col_unit1"], alias_table_name="a" if is_row_calc else None) + col2 = self.unparse_col_unit(val_unit["col_unit2"], alias_table_name="b" if is_row_calc else None) + calc_op = self.UNIT_TYPES_B[val_unit["_type"]] + # TODO: DuSQL const col + if col1 == col2 and calc_op == "-": + col1 = "TIME_NOW" + return f"{col1} {calc_op} {col2}" + + # def unparse_table_unit(self, table_unit): + # raise NotImplementedError + + def unparse_cond(self, cond, negated=False): + """ + Args: + cond: + { + "_type": "Ne", + "agg_id": { + "_type": "NoneAggOp" + }, + "val_unit": { + "_type": "Column", + "col_unit1": { + "_type": "col_unit", + "agg_id": { + "_type": "NoneAggOp" + }, + "col_id": 11, + } + }, + "val1": { + "_type": "Value", + "val_id": 0 + } + } + """ + if cond["_type"] == "And": + assert not negated + return f'{self.unparse_cond(cond["left"])} AND {self.unparse_cond(cond["right"])}' + elif cond["_type"] == "Or": + assert not negated + return f'{self.unparse_cond(cond["left"])} OR {self.unparse_cond(cond["right"])}' + elif cond["_type"] == "Not": + return self.unparse_cond(cond["c"], negated=True) + elif cond["_type"] == "Between": + tokens = [self.unparse_val_unit(cond["val_unit"])] + if negated: + tokens.append("NOT") + tokens += [ + "BETWEEN", + self.unparse_val(cond["val1"]), + "AND", + self.unparse_val(cond["val2"]), + ] + return " ".join(tokens) + tokens = [self.unparse_val_unit(cond["val_unit"])] + if cond["agg_id"]["_type"] != "NoneAggOp": + agg = cond["agg_id"]["_type"] + tokens[0] = f"{agg}({tokens[0]})" + if negated: + tokens.append("NOT") + tokens += [self.COND_TYPES_B[cond["_type"]], self.unparse_val(cond["val1"])] + return " ".join(tokens) + + def refine_from(self, tree): + """ + 1) Inferring tables from columns predicted + 2) Mix them with the predicted tables if any + 3) Inferring conditions based on tables + + Returns: bool + True: row calculation + False: not row calculation + """ + + # nested query in from clause, recursively use the refinement + if "from" in tree and tree["from"]["table_units"][0]["_type"] == "TableUnitSql": + for table_unit in tree["from"]["table_units"]: + # during natural decoding, all of FROM is sub-sql or ordinary table_id + if "s" not in table_unit: + logging.warning("error tree in FROM clause: %s", str(tree)) + continue + subquery_tree = table_unit["s"] + self.refine_from(subquery_tree) + return len(tree["from"]["table_units"]) == 2 # row calculation + + # get predicted tables + predicted_from_table_ids = set() + if "from" in tree: + table_unit_set = [] + for table_unit in tree["from"]["table_units"]: + if "table_id" in table_unit and table_unit["table_id"] not in predicted_from_table_ids: + predicted_from_table_ids.add(table_unit["table_id"]) + table_unit_set.append(table_unit) + tree["from"]["table_units"] = table_unit_set # remove duplicate + + # Get all candidate columns + candidate_column_ids = set( + self.ast_wrapper.find_all_descendants_of_type(tree, "column", lambda field: field.type != "sql") + ) + candidate_columns = [self.schema.columns[i] for i in candidate_column_ids] + must_in_from_table_ids = set(column.table.id for column in candidate_columns if column.table is not None) + + # Table the union of inferred and predicted tables + all_from_table_ids = must_in_from_table_ids.union(predicted_from_table_ids) + if not all_from_table_ids: + # TODO: better heuristic e.g., tables that have exact match + all_from_table_ids = {0} + + covered_tables = set() + candidate_table_ids = sorted(all_from_table_ids) + start_table_id = candidate_table_ids[0] + conds = [] + for table_id in candidate_table_ids[1:]: + if table_id in covered_tables: + continue + try: + path = nx.shortest_path(self.schema.foreign_key_graph, source=start_table_id, target=table_id) + except (nx.NetworkXNoPath, nx.NodeNotFound): + covered_tables.add(table_id) + continue + + for source_table_id, target_table_id in zip(path, path[1:]): + if target_table_id in covered_tables: + continue + all_from_table_ids.add(target_table_id) + col1, col2 = self.schema.foreign_key_graph[source_table_id][target_table_id]["columns"] + conds.append( + { + "_type": "Eq", + "agg_id": {"_type": NL2SQLLanguage.AGG_TYPES_F[0]}, + "val_unit": { + "_type": "Column", + "col_unit1": { + "_type": "col_unit", + "agg_id": {"_type": "NoneAggOp"}, + "col_id": col1, + }, + }, + "val1": { + "_type": "ColUnit", + "col_id": col2, + }, + } + ) + table_units = [{"_type": "Table", "table_id": i} for i in sorted(all_from_table_ids)] + + tree["from"] = { + "_type": "from", + "table_units": table_units, + } + cond_node = self.conjoin_conds(conds) + if cond_node is not None: + tree["from"]["conds"] = cond_node + return False + + def unparse_sql(self, tree): + is_row_calc = self.refine_from(tree) + + result = [ + # select select, + self.unparse_select(tree["select"], is_row_calc), + # from from, + self.unparse_from(tree["from"], is_row_calc), + ] + + def find_subtree(_tree, name): + if self.factorize_sketch == 0: + return _tree, _tree + elif name in _tree: + if self.factorize_sketch == 1: + return _tree[name], _tree[name] + elif self.factorize_sketch == 2: + return _tree, _tree[name] + else: + raise NotImplementedError + + tree, target_tree = find_subtree(tree, "sql_where") + # cond? where, + if "where" in target_tree: + result += ["WHERE", self.unparse_cond(target_tree["where"])] + + return " ".join(result) + + def unparse_select(self, select, is_row_calc=False): + tokens = ["SELECT"] + tokens.append(", ".join(self.unparse_agg(agg, is_row_calc) for agg in select.get("aggs", []))) + return " ".join(tokens) + + def unparse_agg(self, agg, is_row_calc=False): + unparsed_val_unit = self.unparse_val_unit(agg["val_unit"], is_row_calc) + agg_type = agg["agg_id"]["_type"] + if agg_type == "NoneAggOp": + return unparsed_val_unit + else: + return f"{agg_type}({unparsed_val_unit})" + + def unparse_from(self, from_, is_row_calc=False): + if "conds" in from_: + all_conds, keywords = self.linearize_cond(from_["conds"]) + else: + all_conds, keywords = [], [] + assert all(keyword == "And" for keyword in keywords) + + cond_indices_by_table = collections.defaultdict(set) + tables_involved_by_cond_idx = collections.defaultdict(set) + for i, cond in enumerate(all_conds): + for column in self.ast_wrapper.find_all_descendants_of_type(cond, "column"): + if type(column) is dict: + column = column["col_id"] + table = self.schema.columns[column].table + if table is None: + continue + cond_indices_by_table[table.id].add(i) + tables_involved_by_cond_idx[i].add(table.id) + + output_table_ids = set() + output_cond_indices = set() + tokens = ["FROM"] + for i, table_unit in enumerate(from_.get("table_units", [])): + if i > 0: + if not is_row_calc: + tokens += ["JOIN"] + else: + tokens += [","] + + if table_unit["_type"] == "TableUnitSql": + tokens.append(f'({self.unparse_sql(table_unit["s"])})') + if is_row_calc: # 行计算SQL的别名 + tokens.append(["a", "b", "c"][i]) + elif table_unit["_type"] == "Table": + table_id = table_unit["table_id"] + tokens += [self.schema.tables[table_id].orig_name] + output_table_ids.add(table_id) + + # Output "ON " if all tables involved in the condition have been output + conds_to_output = [] + for cond_idx in sorted(cond_indices_by_table[table_id]): + if cond_idx in output_cond_indices: + continue + if tables_involved_by_cond_idx[cond_idx] <= output_table_ids: + conds_to_output.append(all_conds[cond_idx]) + output_cond_indices.add(cond_idx) + if conds_to_output: + tokens += ["ON"] + tokens += list(intersperse("AND", (self.unparse_cond(cond) for cond in conds_to_output))) + return " ".join(tokens) + + +if __name__ == "__main__": + """run some simple test cases""" + dusql_lang = NL2SQLLanguage("conf/DuSQL.asdl") diff --git a/examples/text_to_sql/RAT-SQL/text2sql/io.py b/examples/text_to_sql/RAT-SQL/text2sql/io.py new file mode 100644 index 0000000000000000000000000000000000000000..65dd2889c51579f5e8d09fd72cd7c3b19e6d79e6 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/io.py @@ -0,0 +1,45 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import logging +import os +import traceback + +import paddle + + +def init_ernie_model(model_class, model_dir): + """init ernie model from static graph checkpoint""" + with open(os.path.join(model_dir, "ernie_config.json")) as ifs: + config = json.load(ifs) + + state = paddle.static.load_program_state(os.path.join(model_dir, "params")) + ernie = model_class(config, name="") + ernie.set_dict(state, use_structured_name=False) + return ernie, config["hidden_size"] + + +def save(model, optimizer, save_path): + try: + paddle.save(model.state_dict(), save_path + ".pdparams") + paddle.save(optimizer.state_dict(), save_path + ".pdopt") + except Exception: + logging.error("save model and optimizer failed. save path: %s", save_path) + logging.error(traceback.format_exc()) + + +if __name__ == "__main__": + """run some simple test cases""" + pass diff --git a/examples/text_to_sql/RAT-SQL/text2sql/launch/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/launch/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..136a074e0fbc89edcfe6c812e1383bfc80cf6dbb --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/launch/__init__.py @@ -0,0 +1,17 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from . import trainer +from . import infer +from . import eval diff --git a/examples/text_to_sql/RAT-SQL/text2sql/launch/eval.py b/examples/text_to_sql/RAT-SQL/text2sql/launch/eval.py new file mode 100644 index 0000000000000000000000000000000000000000..dd6a9806ccc437e1d1df920b2b60adfdb56c98f7 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/launch/eval.py @@ -0,0 +1,44 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +from text2sql.utils import metrics + + +def evaluate(model, dataset, infer_results, name="DuSQL", eval_value=True): + if name.lower() == "dusql": + metric = metrics.MetricDuSQLAcc(dataset, eval_value=eval_value) + else: + raise RuntimeError(f"only supports name DuSQL. but got {name}") + + for idx, line in enumerate(infer_results): + qid, pred_query, db_id, detail_result = line.strip().split("\t") + dct_result = json.loads(detail_result) + qid = dct_result["question_id"] + metric.update(dataset.get_by_qid(qid)[0], pred_query) + + eval_result = metric.finalize() + print("evaluating result:", json.dumps(eval_result["total_scores"], indent=4)) + with open("output/debug.json", "w") as ofs: + import random + + random.shuffle(eval_result["per_item"]) + json.dump(eval_result["per_item"], ofs, indent=4, ensure_ascii=False) + return eval_result + + +if __name__ == "__main__": + """run some simple test cases""" + pass diff --git a/examples/text_to_sql/RAT-SQL/text2sql/launch/infer.py b/examples/text_to_sql/RAT-SQL/text2sql/launch/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..4304c2364febf1dd2244dd893314cc997ce0dc66 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/launch/infer.py @@ -0,0 +1,135 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os + +import numpy as np +import paddle +import tqdm +from text2sql.models import beam_search, sql_beam_search + + +def inference( + model, data, output_path, beam_size=1, mode="infer", output_history=True, use_heuristic=True, model_name="seq2tree" +): + model.eval() + os.makedirs(os.path.dirname(output_path), exist_ok=True) + with paddle.no_grad(), open(output_path, "w") as ofs: + if mode == "infer": + _do_infer(model, data, beam_size, output_history, ofs, use_heuristic, model_name) + elif mode == "debug": + _debug(model, data, ofs) + + +def _do_infer(model, data, beam_size, output_history, ofs, use_heuristic=True, model_name="seq2tree"): + for i, (inputs, labels) in enumerate(tqdm.tqdm(data())): + if model_name.startswith("seq2tree"): + decoded = _infer_one(model, inputs, beam_size, output_history, use_heuristic, labels) + else: + decoded = _infer_general(model, inputs, labels) + db_id = inputs["orig_inputs"][0].db.db_id + question_id = inputs["orig_inputs"][0].question_id + question = inputs["orig_inputs"][0].question + gold_query = labels[0].orig_code if labels is not None and labels[0] is not None else "" + values = inputs["orig_inputs"][0].values + if len(decoded) == 0: + pred_query = "select *" + else: + pred_query = decoded[0]["pred_query"] + lst_output = [ + question_id, + pred_query, + db_id, + json.dumps( + { + "db_id": db_id, + "question_id": question_id, + "question": question, + "gold_query": gold_query, + "values": values, + "beams": decoded, + }, + ensure_ascii=False, + ), + ] + ofs.write("\t".join(lst_output) + "\n") + ofs.flush() + + +def _infer_one(model, inputs, beam_size, output_history=False, use_heuristic=True, labels=None): + """inference one example""" + if use_heuristic: + # TODO: from_cond should be true from non-bert model + beams = sql_beam_search.beam_search_with_heuristics( + model, inputs, beam_size=beam_size, max_steps=1000, from_cond=False + ) + else: + beams = beam_search.beam_search(model, inputs, beam_size=beam_size, max_steps=1000) + decoded = [] + for beam in beams: + model_output, inferred_code = beam.inference_state.finalize() + + decoded.append( + { + "pred_query": inferred_code, + "model_output": model_output, + "score": beam.score, + **( + { + "choice_history": beam.choice_history, + "score_history": beam.score_history, + } + if output_history + else {} + ), + } + ) + return decoded + + +def _infer_general(model, inputs, labels=None): + output = model(inputs) + sel_num = np.argmax(output.sel_num.numpy()).item() + # labels[0].sel_num, labels[0].sel_col + pred_sel_col = output.sel_col[0].numpy() + col_ids = list(zip(range(pred_sel_col.shape[1]), pred_sel_col.tolist()[0])) + sorted_col = sorted(col_ids, key=lambda x: x[1], reverse=True) + pred_cols = list(sorted(sorted_col[:sel_num], key=lambda x: x[0])) + gold_cols = [] + for cid, label in enumerate(labels[0].sel_col): + if label == 1: + gold_cols.append(cid) + + return {"sel_num": (sel_num, labels[0].sel_num), "sel_col": ([x[0] for x in pred_cols], gold_cols)} + + +def _debug(model, data, ofs): + for i, item in enumerate(tqdm.tqdm(data)): + ((_, history),) = model.compute_loss([item], debug=True) + ofs.write( + json.dumps( + { + "index": i, + "history": history, + } + ) + + "\n" + ) + ofs.flush() + + +if __name__ == "__main__": + """run some simple test cases""" + pass diff --git a/examples/text_to_sql/RAT-SQL/text2sql/launch/trainer.py b/examples/text_to_sql/RAT-SQL/text2sql/launch/trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..9c0e02f8edabcac8b763c5deb4488012b510578b --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/launch/trainer.py @@ -0,0 +1,107 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import os +import traceback +from pathlib import Path + +from text2sql import io, utils +from text2sql.launch import infer + + +def log_train_step(epoch, batch, steps_loss, cost_time): + if len(steps_loss) == 0: + return + + logging.info( + f"[train] epoch {epoch}, batch {batch}. " + + f"loss is {sum(steps_loss) / len(steps_loss):.10f}. " + + f"cost {cost_time:.2f}s" + ) + steps_loss.clear() + + +def epoch_train(config, model, optimizer, epoch, train_data, is_debug=False): + model.train() + + total_loss = 0 + steps_loss = [] + timer = utils.Timer() + batch_id = 1 + for batch_id, (inputs, labels) in enumerate(train_data(), start=1): + loss = model(inputs, labels) + + loss.backward() + optimizer.step() + optimizer.clear_grad() + if type(optimizer._learning_rate) is not float: + optimizer._learning_rate.step() + + total_loss += loss.item() + steps_loss.append(loss.item()) + if batch_id % config.train.log_steps == 0 or is_debug: + log_train_step(epoch, batch_id, steps_loss, timer.interval()) + log_train_step(epoch, batch_id, steps_loss, timer.interval()) + + return total_loss / batch_id + + +def _eval_during_train(model, data, epoch, output_root): + if epoch in [1, 2, 3, 4] + [6, 7, 9, 10, 11, 13, 14, 16, 17, 19] + list(range(21, 100, 2)): + return 0, epoch + model.eval() + try: + output = Path(output_root) / "infer_result" / f"{data.name}.infer_epoch{epoch:03d}.sql" + infer.inference(model, data, output) + except OSError: + traceback.print_exc() + logging.error(traceback.format_exc()) + return 0, epoch + + mean_loss = 0 + return mean_loss, epoch + + +def train(config, model, optimizer, epochs, train_data, dev_data, test_data=None): + best_acc = -1e10 + best_epoch = 0 + timer = utils.Timer() + for epoch in range(1, epochs + 1): + loss = epoch_train(config, model, optimizer, epoch, train_data, config.general.is_debug) + cost_time = timer.interval() + logging.info(f"[train] epoch {epoch}/{epochs} loss is {loss:.6f}, cost {cost_time:.2f}s.") + + dev_loss, dev_acc = _eval_during_train(model, dev_data, epoch, config.data.output) + log_str = f"[eval] dev loss {dev_loss:.6f}, acc {dev_acc:.4f}." + if test_data is not None: + test_loss, test_acc = _eval_during_train(model, test_data, epoch, config.data.output) + log_str += f" test loss {test_loss:.6f}, acc {test_acc:.4f}." + + if dev_acc > best_acc: + best_acc, best_epoch = dev_acc, epoch + save_path = os.path.join(config.data.output, f"epoch{epoch:03d}_acc{best_acc:.4f}", "model") + io.save(model, optimizer, save_path) + log_str += " got best and saved." + else: + log_str += f" best acc is {best_acc} on epoch {best_epoch}." + + cost_time = timer.interval() + log_str += f" cost [{cost_time:.2f}s]" + logging.info(log_str) + + +if __name__ == "__main__": + """run some simple test cases""" + pass diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/models/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..ccd17d06b7f18ad2dd870ee7de5c41ec3ec6ec7d --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/__init__.py @@ -0,0 +1,15 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .enc_dec import EncDecModel diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/attention.py b/examples/text_to_sql/RAT-SQL/text2sql/models/attention.py new file mode 100644 index 0000000000000000000000000000000000000000..6f457530a6ba86f39114032d7430adc33805fb5f --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/attention.py @@ -0,0 +1,179 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math + +import numpy as np +import paddle +import paddle.nn.functional as F + + +def maybe_mask(attn, attn_mask): + if attn_mask is not None: + assert all( + a == 1 or b == 1 or a == b for a, b in zip(attn.shape[::-1], attn_mask.shape[::-1]) + ), f"Attention mask shape {attn_mask.shape} should be broadcastable with attention shape {attn.shape}" + + attn.data.masked_fill_(attn_mask, -float("inf")) + + +def attention(query, key, value, mask=None, dropout=None): + "Compute 'Scaled Dot Product Attention'" + d_k = query.shape[-1] + scores = paddle.matmul(query, key, transpose_y=True) / math.sqrt(d_k) + if mask is not None: + scores = scores.masked_fill(mask == 0, -1e9) + p_attn = F.softmax(scores, axis=-1) + if dropout is not None: + p_attn = dropout(p_attn) + # return paddle.matmul(p_attn, value), scores.squeeze(1).squeeze(1) + return paddle.matmul(p_attn, value), p_attn + + +class Attention(paddle.nn.Layer): + def __init__(self, pointer): + super().__init__() + self.pointer = pointer + self.softmax = paddle.nn.Softmax(axis=-1) + + def forward(self, query, values, attn_mask=None): + # query shape: batch x query_size + # values shape: batch x num values x value_size + + # attn_logits shape: batch x num values + attn_logits = self.pointer(query, values, attn_mask) + # attn_logits shape: batch x num values + attn = self.softmax(attn_logits) + # output shape: batch x 1 x value_size + output = paddle.bmm(attn.unsqueeze(1), values) + output = output.squeeze(1) + return output, attn + + +class ScaledDotProductPointer(paddle.nn.Layer): + def __init__(self, query_size, key_size): + super().__init__() + self.query_proj = paddle.nn.Linear(query_size, key_size) + self.temp = np.power(key_size, 0.5) + + def forward(self, query, keys, attn_mask=None): + # query shape: batch x query_size + # keys shape: batch x num keys x key_size + + # proj_query shape: batch x key_size x 1 + proj_query = self.query_proj(query).unsqueeze(2) + + # attn_logits shape: batch x num keys + attn_logits = paddle.bmm(keys, proj_query).squeeze(2) / self.temp + maybe_mask(attn_logits, attn_mask) + return attn_logits + + +class ScaledDotProductAttention(Attention): + def __init__(self, query_size, value_size): + super().__init__(ScaledDotProductPointer(query_size, value_size)) + + +class BahdanauPointer(paddle.nn.Layer): + def __init__(self, query_size, key_size, proj_size): + super().__init__() + self.compute_scores = paddle.nn.Sequential( + paddle.nn.Linear(query_size + key_size, proj_size), paddle.nn.Tanh(), paddle.nn.Linear(proj_size, 1) + ) + + def forward(self, query: paddle.Tensor, keys: paddle.Tensor, attn_mask=None): + # query shape: batch x query_size + # keys shape: batch x num keys x key_size + + # query_expanded shape: batch x num keys x query_size + query_expanded = query.unsqueeze(1).expand([query.shape[0], keys.shape[1], query.shape[-1]]) + + # scores shape: batch x num keys x 1 + attn_logits = self.compute_scores( + # shape: batch x num keys x query_size + key_size + paddle.concat((query_expanded, keys), axis=2) + ) + # scores shape: batch x num keys + attn_logits = attn_logits.squeeze(2) + maybe_mask(attn_logits, attn_mask) + return attn_logits + + +class BahdanauAttention(Attention): + def __init__(self, query_size, value_size, proj_size): + super().__init__(BahdanauPointer(query_size, value_size, proj_size)) + + +# Adapted from The Annotated Transformers +class MultiHeadedAttention(paddle.nn.Layer): + def __init__(self, h, query_size, value_size, dropout=0.1): + super().__init__() + assert query_size % h == 0 + assert value_size % h == 0 + + # We assume d_v always equals d_k + self.d_k = value_size // h + self.h = h + + self.linears = paddle.nn.LayerList( + [ + paddle.nn.Linear(query_size, value_size), + paddle.nn.Linear(value_size, value_size), + paddle.nn.Linear(value_size, value_size), + paddle.nn.Linear(value_size, value_size), + ] + ) + + self.attn = None + self.dropout = paddle.nn.Dropout(p=dropout) + + def forward(self, query, values, attn_mask=None): + "Implements Figure 2" + if attn_mask is not None: + # Same mask applied to all h heads. + attn_mask = attn_mask.unsqueeze(1) + nbatches = query.shape[0] + + # 1) Do all the linear projections in batch from d_model => h x d_k + query, keys, values = [ + l(x).reshape([nbatches, -1, self.h, self.d_k]).transpose([0, 2, 1, 3]) + for l, x in zip(self.linears, (query, values, values)) + ] + + # 2) Apply attention on all the projected vectors in batch. + # x, self.attn = transformer.sparse_attention( + x, self.attn = attention(query, keys, values, mask=attn_mask, dropout=self.dropout) + + # 3) "Concat" using a view and apply a final linear. + x = x.transpose([0, 2, 1, 3]).reshape([nbatches, -1, self.h * self.d_k]) + x = x.squeeze(1) + return self.linears[3](x), self.attn + + +if __name__ == "__main__": + """run some simple test cases""" + sdpp = ScaledDotProductPointer(query_size=8, key_size=16) + sdpa = ScaledDotProductAttention(query_size=8, value_size=16) + bp = BahdanauPointer(query_size=8, key_size=16, proj_size=12) + ba = BahdanauAttention(query_size=8, value_size=16, proj_size=12) + mha = MultiHeadedAttention(h=2, query_size=8, value_size=16) + + q = paddle.to_tensor(list(range(1, 9)), dtype="float32").reshape([1, 8]) + v = paddle.to_tensor(list(range(1, 17)), dtype="float32").reshape([1, 1, 16]) + + print(sdpp(q, v)) + print(sdpa(q, v)) + print(bp(q, v)) + print(ba(q, v)) + print(mha(q, v)) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/beam_search.py b/examples/text_to_sql/RAT-SQL/text2sql/models/beam_search.py new file mode 100644 index 0000000000000000000000000000000000000000..c108a96df2ec66611d33bfa388524b17366d99b4 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/beam_search.py @@ -0,0 +1,81 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import operator + +import attr + + +@attr.s +class Hypothesis: + inference_state = attr.ib() + next_choices = attr.ib() + score = attr.ib(default=0) + + choice_history = attr.ib(factory=list) + score_history = attr.ib(factory=list) + + +def beam_search(model, orig_item, preproc_item, beam_size, max_steps): + inference_state, next_choices = model.begin_inference(orig_item, preproc_item) + beam = [Hypothesis(inference_state, next_choices)] + finished = [] + + for step in range(max_steps): + # Check if all beams are finished + if len(finished) == beam_size: + break + + candidates = [] + + # For each hypothesis, get possible expansions + # Score each expansion + for hyp in beam: + candidates += [ + (hyp, choice, choice_score.item(), hyp.score + choice_score.item()) + for choice, choice_score in hyp.next_choices + ] + + # Keep the top K expansions + candidates.sort(key=operator.itemgetter(3), reverse=True) + candidates = candidates[: beam_size - len(finished)] + + # Create the new hypotheses from the expansions + beam = [] + for hyp, choice, choice_score, cum_score in candidates: + inference_state = hyp.inference_state.clone() + next_choices = inference_state.step(choice) + if next_choices is None: + finished.append( + Hypothesis( + inference_state, + None, + cum_score, + hyp.choice_history + [choice], + hyp.score_history + [choice_score], + ) + ) + else: + beam.append( + Hypothesis( + inference_state, + next_choices, + cum_score, + hyp.choice_history + [choice], + hyp.score_history + [choice_score], + ) + ) + + finished.sort(key=operator.attrgetter("score"), reverse=True) + return finished diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/enc_dec.py b/examples/text_to_sql/RAT-SQL/text2sql/models/enc_dec.py new file mode 100644 index 0000000000000000000000000000000000000000..7ae4f7d50d39b86e0e8837924e3f5fcc4103377b --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/enc_dec.py @@ -0,0 +1,63 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +from paddle import nn +from text2sql.models import encoder_v2 +from text2sql.models.sql_decoder import decoder as decoder_v2 + + +class EncDecModel(nn.Layer): + """Dygraph version of BoomUp Model""" + + def __init__(self, config, label_encoder, model_version="v2"): + super(EncDecModel, self).__init__() + + self._config = config + self._model_version = model_version + + assert model_version in ("v2",), "model_version only support v2" + self.encoder = encoder_v2.Text2SQLEncoderV2(config) + self.decoder = decoder_v2.Text2SQLDecoder( + label_encoder, dropout=0.2, desc_attn="mha", use_align_mat=True, use_align_loss=True + ) + + def forward(self, inputs, labels=None, db=None, is_train=True): + if is_train: + assert labels is not None, "labels should not be None while training" + return self._train(inputs, labels) + else: + assert db is not None, "db should not be None while inferencing" + return self._inference(inputs, db) + + def _train(self, inputs, labels): + enc_results = self.encoder(inputs) + lst_loss = [] + for orig_inputs, label_info, enc_result in zip(inputs["orig_inputs"], labels, enc_results): + loss = self.decoder.compute_loss(orig_inputs, label_info, enc_result) + lst_loss.append(loss) + + return paddle.mean(paddle.stack(lst_loss, axis=0), axis=0) + + def _inference(self, inputs, db): + enc_state = self.encoder(inputs) + if self._model_version == "v1": + return self.decoder.inference(enc_state[0], db) + elif self._model_version == "v2": + return self.decoder.inference(enc_state[0], db, inputs["orig_inputs"][0].values) + + +if __name__ == "__main__": + """run some simple test cases""" + pass diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/encoder_v2.py b/examples/text_to_sql/RAT-SQL/text2sql/models/encoder_v2.py new file mode 100644 index 0000000000000000000000000000000000000000..0e336675115644cb6a46250a958b4f2cf59bc35a --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/encoder_v2.py @@ -0,0 +1,253 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import attr +import numpy as np +import paddle +from paddle import nn +from paddle.nn import functional as F +from text2sql.models import relational_encoder, relational_transformer +from text2sql.utils import linking_utils, nn_utils, utils + +from paddlenlp.transformers import BertModel, ErnieModel, ErniePretrainedModel + + +@attr.s +class EncoderState: + """Encoder state define""" + + state = attr.ib() + cls_hidden = attr.ib() + memory = attr.ib() + question_memory = attr.ib() + schema_memory = attr.ib() + words = attr.ib() + + pointer_memories = attr.ib() + pointer_maps = attr.ib() + + m2c_align_mat = attr.ib() + m2t_align_mat = attr.ib() + m2v_align_mat = attr.ib() + + def find_word_occurrences(self, word): + """find word occurrences""" + return [i for i, w in enumerate(self.words) if w == word] + + +class Text2SQLEncoderV2(nn.Layer): + """Encoder for text2sql model""" + + def __init__(self, model_config, extra=None): + super(Text2SQLEncoderV2, self).__init__() + self.enc_value_with_col = model_config.enc_value_with_col + + self.pretrain_model_type = model_config.pretrain_model_type + if model_config.pretrain_model_type == "BERT": + PretrainModel = BertModel + vocab_file = os.path.join( + os.path.expandvars("$HOME"), + ".paddlenlp/models", + model_config.pretrain_model, + model_config.pretrain_model + "-vocab.txt", + ) + args = {"vocab_size": utils.count_file_lines(vocab_file), "type_vocab_size": 2} + self.hidden_size = 768 + elif model_config.pretrain_model_type == "ERNIE": + PretrainModel = ErnieModel + ernie_config = ErniePretrainedModel.pretrained_init_configuration[model_config.pretrain_model] + # with open(Path(model_config.pretrain_model) / + # 'ernie_config.json') as ifs: + # ernie_config = json.load(ifs) + args = {"cfg": ernie_config} + self.hidden_size = ernie_config["hidden_size"] + else: + raise RuntimeError(f"unsupported pretrain model type: {model_config.pretrain_model_type}") + + if model_config.init_model_params is None: + self.base_encoder = PretrainModel.from_pretrained(model_config.pretrain_model) + else: + self.base_encoder = PretrainModel(**args["cfg"]) + # initializer = nn.initializer.TruncatedNormal(std=0.02) + self.rel_has_value = True + self.encs_update = relational_encoder.RelationAwareEncoder( + num_layers=model_config.rat_layers, + num_heads=model_config.rat_heads, + num_relations=len(linking_utils.RELATIONS), + hidden_size=self.hidden_size, + has_value=self.rel_has_value, + ) + if not self.rel_has_value: + self.value_align = relational_transformer.RelationalPointerNet( + hidden_size=self.hidden_size, num_relations=0 + ) + + self.include_in_memory = set(["question", "column", "table", "value"]) + + def forward(self, inputs): + """modeling forward stage of encoder""" + seq_hidden, cls_hidden = self.base_encoder(inputs["src_ids"], inputs["sent_ids"]) + if self.pretrain_model_type != "ERNIE" and self.pretrain_model_type != "BERT": + cls_hidden, seq_hidden = seq_hidden, cls_hidden + + question_tokens_index = inputs["question_tokens_index"] + table_indexes = inputs["table_indexes"] + column_indexes = inputs["column_indexes"] + value_indexes = inputs["value_indexes"] + + question_encs = nn_utils.batch_gather_2d(seq_hidden, question_tokens_index) + table_encs = nn_utils.batch_gather_2d(seq_hidden, table_indexes) + column_encs = nn_utils.batch_gather_2d(seq_hidden, column_indexes) + value_encs = nn_utils.batch_gather_2d(seq_hidden, value_indexes) + if self.enc_value_with_col: + value_num = value_encs.shape[1] // 2 + value_encs = value_encs.reshape([value_encs.shape[0], value_num, 2, -1]).sum(axis=2) + + orig_inputs = inputs["orig_inputs"] + column_pointer_maps = [{i: [i] for i in range(len(orig_input.columns))} for orig_input in orig_inputs] + table_pointer_maps = [{i: [i] for i in range(len(orig_input.tables))} for orig_input in orig_inputs] + value_pointer_maps = [{i: [i] for i in range(len(orig_input.values))} for orig_input in orig_inputs] + + enc_results = [] + # calculate relation encoding one-by-one + for batch_idx, orig_input in enumerate(orig_inputs): + q_len = orig_input.column_indexes[0] - 2 + col_size = len(orig_input.columns) + tab_size = len(orig_input.tables) + val_size = len(orig_input.values) + + q_enc = question_encs[batch_idx][:q_len] + tab_enc = table_encs[batch_idx][:tab_size] + col_enc = column_encs[batch_idx][:col_size] + val_enc = value_encs[batch_idx][:val_size] + + c_boundary = list(range(col_size + 1)) + t_boundary = list(range(tab_size + 1)) + + v_e_input = val_enc.unsqueeze(0) if self.rel_has_value else None + (q_enc_new, c_enc_new, t_enc_new, v_enc_new), align_mat = self.encs_update.forward_unbatched( + q_enc.unsqueeze(0), + col_enc.unsqueeze(0), + tab_enc.unsqueeze(0), + c_boundary, + t_boundary, + orig_input.relations, + v_e_input, + ) + + memory = [] + if "question" in self.include_in_memory: + memory.append(q_enc_new) + if "table" in self.include_in_memory: + memory.append(t_enc_new) + if "column" in self.include_in_memory: + memory.append(c_enc_new) + if "value" in self.include_in_memory and self.rel_has_value: + memory.append(v_enc_new) + memory = paddle.concat(memory, axis=1) + if not self.rel_has_value: + v_enc_new = val_enc.unsqueeze(0) + m2v_align_mat = self.value_align(memory, v_enc_new, relations=None) + align_mat[2] = m2v_align_mat + + schema_memory = (c_enc_new, t_enc_new) + if self.rel_has_value: + schema_memory += (v_enc_new,) + + enc_results.append( + EncoderState( + state=None, + cls_hidden=cls_hidden[batch_idx], + memory=memory, + question_memory=q_enc_new, + schema_memory=paddle.concat(schema_memory, axis=1), + words=orig_input.question_tokens, + pointer_memories={ + "table": t_enc_new, + "column": c_enc_new, + "value": v_enc_new, + }, + pointer_maps={ + "column": column_pointer_maps[batch_idx], + "table": table_pointer_maps[batch_idx], + "value": value_pointer_maps[batch_idx], + }, + m2c_align_mat=align_mat[0], + m2t_align_mat=align_mat[1], + m2v_align_mat=align_mat[2], + ) + ) + + return enc_results + + def span_encoder(self, cls_hidden, seq_hidden, span_index, span_tokens_index, span_tokens_mask, proj_fn=None): + """encode spans(like headers, table names) by sequence hidden states""" + batch_size, max_col_nums, max_col_tokens = span_tokens_index.shape + hidden_size = cls_hidden.shape[-1] + + # shape = [batch, max_col, hidden_size] + span_enc1 = nn_utils.batch_gather_2d(seq_hidden, span_index) + + token_gather_index = paddle.reshape(span_tokens_index, shape=[-1, max_col_nums * max_col_tokens]) + span_tokens_enc_origin = nn_utils.batch_gather_2d(seq_hidden, token_gather_index) + + span_tokens_weight = paddle.reshape( + paddle.matmul(span_tokens_enc_origin, paddle.unsqueeze(cls_hidden, [-1])), + [-1, max_col_nums, max_col_tokens], + ) + span_tokens_weight = F.softmax(nn_utils.sequence_mask(span_tokens_weight, span_tokens_mask), axis=-1) + + span_tokens_enc_origin = paddle.reshape( + span_tokens_enc_origin, [-1, max_col_nums, max_col_tokens, hidden_size] + ) + span_enc2 = paddle.sum(paddle.multiply(span_tokens_enc_origin, span_tokens_weight.unsqueeze([-1])), axis=2) + + span_enc = paddle.concat([span_enc1, span_enc2], axis=-1) + if proj_fn is not None: + span_enc = proj_fn(span_enc) + return span_enc + + +if __name__ == "__main__": + """run some simple test cases""" + inputs = { + "src_ids": paddle.to_tensor(np.array([0, 1, 2, 3, 4, 5], dtype=np.int64).reshape([1, 6])), + "sent_ids": paddle.to_tensor(np.array([0, 1, 1, 1, 1, 1], dtype=np.int64).reshape([1, 6])), + "question_tokens_index": paddle.to_tensor(list(range(1, 5)), dtype="int64").reshape([1, 4]), + "column_index": paddle.to_tensor([1, 4], dtype="int64").reshape([1, 2]), + "column_mask": paddle.to_tensor([1, 1], dtype="float32").reshape([1, 2]), + "column_tokens_index": paddle.to_tensor([1, 2, 3, 4, 5, 0], dtype="int64").reshape([1, 2, 3]), + "column_tokens_mask": paddle.to_tensor([1, 1, 1, 1, 1, 0], dtype="float32").reshape([1, 2, 3]), + "table_index": paddle.to_tensor([1, 4], dtype="int64").reshape([1, 2]), + "table_mask": paddle.to_tensor([1, 1], dtype="float32").reshape([1, 2]), + "table_tokens_index": paddle.to_tensor([1, 2, 3, 4, 5, 0], dtype="int64").reshape([1, 2, 3]), + "table_tokens_mask": paddle.to_tensor([1, 1, 1, 1, 1, 0], dtype="float32").reshape([1, 2, 3]), + "limit_nums_index": paddle.to_tensor([1, 4], dtype="int64").reshape([1, 2]), + "limit_nums_mask": paddle.to_tensor([1, 1], dtype="float32").reshape([1, 2]), + "orig_inputs": [ + { + "columns": ["a", "b"], + "tables": ["t1", "t2"], + "question_tokens": ["a", "bc", "d"], + "span_lens": [[6], [1, 1], [1, 1]], + "relations": np.arange(8 * 8).reshape(8, 8), + } + ], + } + + # model = Text2SQLEncoder() + # outputs = model(inputs) + # print(outputs) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/relational_encoder.py b/examples/text_to_sql/RAT-SQL/text2sql/models/relational_encoder.py new file mode 100644 index 0000000000000000000000000000000000000000..adc3868d8d4bdf12664422b969990ec3145d7bd1 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/relational_encoder.py @@ -0,0 +1,98 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import paddle +from text2sql.models import relational_transformer + + +class RelationAwareEncoder(paddle.nn.Layer): + """Relation-aware encoder""" + + def __init__(self, num_layers, num_heads, num_relations, hidden_size, has_value=False, dropout=0.1): + super(RelationAwareEncoder, self).__init__() + + self._num_layers = num_layers + self._num_heads = num_heads + self._hidden_size = hidden_size + self._dropout = dropout + + cfg = { + "num_hidden_layers": num_layers, + "num_attention_heads": num_heads, + "num_relations": num_relations, + "hidden_size": hidden_size, + "hidden_act": "relu", + "attention_probs_dropout_prob": dropout, + "hidden_dropout_prob": dropout, + "initializer_range": 0.02, + } + self.encoder = relational_transformer.RelationalTransformerEncoder(cfg) + if not has_value: + self.align_attn = relational_transformer.RelationalPointerNet(hidden_size, num_relations) + else: + self.align_attn = relational_transformer.RelationalPointerNet(hidden_size, 0) + + def forward(self, q_enc, c_enc, t_enc, c_boundaries, t_boundaries, relations, v_enc=None): + assert q_enc.shape[0] == 1 and c_enc.shape[0] == 1 and t_enc.shape[0] == 1 + return self.forward_unbatched(q_enc, c_enc, t_enc, c_boundaries, t_boundaries, relations) + + def forward_unbatched(self, q_enc, c_enc, t_enc, c_boundaries, t_boundaries, relations, v_enc=None): + enc = paddle.concat((q_enc, c_enc, t_enc), axis=1) + # enc = enc.transpose([1, 0, 2]) + + relations_t = paddle.to_tensor(relations, dtype="int64").unsqueeze([0]) + enc_new, _, _ = self.encoder(enc, relations_t) + + # Split updated_enc again + c_base = q_enc.shape[1] + t_base = q_enc.shape[1] + c_enc.shape[1] + q_enc_new = enc_new[:, :c_base] + c_enc_new = enc_new[:, c_base:t_base] + t_enc_new = enc_new[:, t_base:] + + if v_enc is None: + m2c_align_mat = self.align_attn(enc_new, c_enc_new, relations_t[:, :, c_base:t_base]) + m2t_align_mat = self.align_attn(enc_new, t_enc_new, relations_t[:, :, t_base:]) + m2v_align_mat = None + else: + enc_new = paddle.concat((enc_new, v_enc), axis=1) + m2c_align_mat = self.align_attn(enc_new, c_enc_new, relations=None) + m2t_align_mat = self.align_attn(enc_new, t_enc_new, relations=None) + m2v_align_mat = self.align_attn(enc_new, v_enc, relations=None) + + return ([q_enc_new, c_enc_new, t_enc_new, v_enc], [m2c_align_mat, m2t_align_mat, m2v_align_mat]) + + +if __name__ == "__main__": + """run some simple test cases""" + + hidden_size = 4 + q = paddle.to_tensor(list(range(12)), dtype="float32").reshape([1, 3, hidden_size]) + c = paddle.to_tensor(list(range(8)), dtype="float32").reshape([1, 2, hidden_size]) + t = paddle.to_tensor(list(range(8)), dtype="float32").reshape([1, 2, hidden_size]) + c_bound = None + t_bound = None + relations = np.zeros([7, 7], dtype=np.int64) + relations[0, 3] = 10 + relations[0, 1] = 1 + relations[0, 2] = 2 + relations[1, 2] = 1 + relations[1, 4] = 11 + relations[3, 4] = 21 + relations[3, 5] = 31 + + model = RelationAwareEncoder(2, 2, 99, hidden_size) + outputs = model(q, c, t, c_bound, t_bound, relations) + print(outputs) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/relational_transformer.py b/examples/text_to_sql/RAT-SQL/text2sql/models/relational_transformer.py new file mode 100644 index 0000000000000000000000000000000000000000..aba8fdd11346b28624fc2c6e5a4abeaefc162fa7 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/relational_transformer.py @@ -0,0 +1,413 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math + +import paddle +import paddle.nn.functional as F +from paddle import nn + +ACT_DICT = { + "relu": nn.ReLU, + "gelu": nn.GELU, +} + + +def _build_linear(n_in, n_out, name=None, init=None): + return nn.Linear( + n_in, + n_out, + weight_attr=paddle.ParamAttr(name="%s.w_0" % name if name is not None else None, initializer=init), + bias_attr="%s.b_0" % name if name is not None else None, + ) + + +def _build_ln(n_in, name): + return nn.LayerNorm( + normalized_shape=n_in, + weight_attr=paddle.ParamAttr( + name="%s_layer_norm_scale" % name if name is not None else None, initializer=nn.initializer.Constant(1.0) + ), + bias_attr=paddle.ParamAttr( + name="%s_layer_norm_bias" % name if name is not None else None, initializer=nn.initializer.Constant(0.0) + ), + ) + + +def new_name(name, postfix): + if name is None: + ret = None + elif name == "": + ret = postfix + else: + ret = "%s_%s" % (name, postfix) + return ret + + +# Adapted from +# https://github.com/tensorflow/tensor2tensor/blob/0b156ac533ab53f65f44966381f6e147c7371eee/tensor2tensor/layers/common_attention.py +def relative_attention_logits(query, key, relation): + """relative attention logits(scores) + + Args: + query (TYPE): NULL + key (TYPE): NULL + relation (TYPE): NULL + + Returns: Tensor, shape = [batch, heads, num queries, num kvs] + + Raises: NULL + """ + # We can't reuse the same logic as tensor2tensor because we don't share relation vectors across the batch. + # In this version, relation vectors are shared across heads. + # query: [batch, heads, num queries, depth]. + # key: [batch, heads, num kvs, depth]. + # relation: [batch, num queries, num kvs, depth]. + + # qk_matmul is [batch, heads, num queries, num kvs] + qk_matmul = paddle.matmul(query, key, transpose_y=True) + if relation is None: + return qk_matmul / math.sqrt(query.shape[-1]) + + # q_t is [batch, num queries, heads, depth] + q_t = query.transpose([0, 2, 1, 3]) + # r_t is [batch, num queries, depth, num kvs] + r_t = relation.transpose([0, 1, 3, 2]) + + # [batch, num queries, heads, depth] * [batch, num queries, depth, num kvs] + # = [batch, num queries, heads, num kvs] + # For each batch and query, we have a query vector per head. + # We take its dot product with the relation vector for each kv. + # + # transposed = [batch, heads, num queries, num kvs] + qr_matmul = paddle.matmul(q_t, r_t).transpose([0, 2, 1, 3]) + + # [batch, heads, num queries, num kvs] + return (qk_matmul + qr_matmul) / math.sqrt(query.shape[-1]) + + # Sharing relation vectors across batch and heads: + # query: [batch, heads, num queries, depth]. + # key: [batch, heads, num kvs, depth]. + # relation: [num queries, num kvs, depth]. + # + # Then take + # key reshaped + # [num queries, batch * heads, depth] + # relation.transpose(-2, -1) + # [num queries, depth, num kvs] + # and multiply them together. + # + # Without sharing relation vectors across heads: + # query: [batch, heads, num queries, depth]. + # key: [batch, heads, num kvs, depth]. + # relation: [batch, heads, num queries, num kvs, depth]. + # + # Then take + # key.unsqueeze(3) + # [batch, heads, num queries, 1, depth] + # relation.transpose(-2, -1) + # [batch, heads, num queries, depth, num kvs] + # and multiply them together: + # [batch, heads, num queries, 1, depth] + # * [batch, heads, num queries, depth, num kvs] + # = [batch, heads, num queries, 1, num kvs] + # and squeeze + # [batch, heads, num queries, num kvs] + + +def relative_attention_values(weight, value, relation): + """In this version, relation vectors are shared across heads. + Args: + weight: [batch, heads, num queries, num kvs]. + value: [batch, heads, num kvs, depth]. + relation: [batch, num queries, num kvs, depth]. + Returns: Tensor, shape = [batch, heads, num queries, depth] + """ + # wv_matmul is [batch, heads, num queries, depth] + wv_matmul = paddle.matmul(weight, value) + + # w_t is [batch, num queries, heads, num kvs] + w_t = weight.transpose([0, 2, 1, 3]) + # [batch, num queries, heads, num kvs] + # * [batch, num queries, num kvs, depth] + # = [batch, num queries, heads, depth] + # transposed = [batch, heads, num queries, depth] + wr_matmul = paddle.matmul(w_t, relation).transpose([0, 2, 1, 3]) + + return wv_matmul + wr_matmul + + +class RelationalAttentionLayer(nn.Layer): + def __init__(self, cfg, name=None): + super(RelationalAttentionLayer, self).__init__() + initializer = nn.initializer.TruncatedNormal(std=cfg["initializer_range"]) + d_model = cfg["hidden_size"] + n_head = cfg["num_attention_heads"] + assert d_model % n_head == 0 + d_model_q = cfg.get("query_hidden_size_per_head", d_model // n_head) * n_head + d_model_v = cfg.get("value_hidden_size_per_head", d_model // n_head) * n_head + self.n_head = n_head + self.d_key = d_model_q // n_head + self.q = _build_linear(d_model, d_model_q, new_name(name, "query_fc"), initializer) + self.k = _build_linear(d_model, d_model_q, new_name(name, "key_fc"), initializer) + self.v = _build_linear(d_model, d_model_v, new_name(name, "value_fc"), initializer) + self.o = _build_linear(d_model_v, d_model, new_name(name, "output_fc"), initializer) + self.dropout = nn.Dropout(p=cfg["attention_probs_dropout_prob"]) + + def forward(self, queries, keys, values, relation_k, relation_v, attn_bias=None, past_cache=None): + """relational attention forward. + seq_len in `shape` means num queries/keys/values of attention + + Args: + queries (TYPE): shape = [batch, seq_len, num_heads * hidden] + keys (TYPE): shape = queries.shape + values (TYPE): shape = queries.shape + relation_k (TYPE): shape = [batch, seq_len, seq_len, hidden] + relation_v (TYPE): shape = relation_k.shape + attn_bias (TYPE): used as sequence mask. Default is None + past_cache (TYPE): Default is None + + Returns: TODO + + Raises: NULL + """ + assert len(queries.shape) == len(keys.shape) == len(values.shape) == 3 + # bsz, q_len, q_dim = queries.shape + # bsz, k_len, k_dim = keys.shape + # bsz, v_len, v_dim = values.shape + # assert k_len == v_len + + q = self.q(queries) + k = self.k(keys) + v = self.v(values) + + cache = (k, v) + if past_cache is not None: + cached_k, cached_v = past_cache + k = paddle.concat([cached_k, k], 1) + v = paddle.concat([cached_v, v], 1) + + def _transpose(inputs): + """reshape and transpose + Args: inputs: shape = [batch, seq_len, heads * hidden] + Returns: shape = [batch, heads, seq_len, hidden] + """ + hidden_size = inputs.shape[-1] // self.n_head + outputs = inputs.reshape([0, 0, self.n_head, hidden_size]) + return outputs.transpose([0, 2, 1, 3]) + + q, k, v = [_transpose(x) for x in (q, k, v)] + + q = q.scale(self.d_key**-0.5) + scores = relative_attention_logits(q, k, relation_k) + if attn_bias is not None: + scores += attn_bias + scores = F.softmax(scores) + scores = self.dropout(scores) + + out = relative_attention_values(scores, v, relation_v) + # input: [batch, heads, seq_len, hidden] + # output: [batch, seq_len, heads * hidden] + out = out.transpose([0, 2, 1, 3]) + out = out.reshape([0, 0, out.shape[2] * out.shape[3]]) + out = self.o(out) + return out, cache + + +class RelationalPointerNet(nn.Layer): + """Pointer Netword with Relations""" + + def __init__(self, hidden_size, num_relations, init_range=0.02, name=None): + """init of class + + Args: + cfg (TYPE): NULL + + """ + super(RelationalPointerNet, self).__init__() + self.hidden_size = hidden_size + + initializer = nn.initializer.TruncatedNormal(std=init_range) + self.q = _build_linear(hidden_size, hidden_size, new_name(name, "query_fc"), initializer) + self.k = _build_linear(hidden_size, hidden_size, new_name(name, "key_fc"), initializer) + # self.dropout = nn.Dropout(p=cfg['attention_probs_dropout_prob']) + + self.relation_emb = None + if num_relations > 0: + self.relation_emb = nn.Embedding(num_relations, hidden_size) + self.scores = None + + def forward(self, queries, keys, relations, attn_bias=None): + """relational attention forward. + seq_len in `shape` means num queries/keys/values of attention + + Args: + queries (TYPE): shape = [batch, seq_len, num_heads * hidden] + keys (TYPE): shape = queries.shape + relations (TYPE): shape = [batch, seq_len, seq_len, hidden] + attn_bias (TYPE): used as sequence mask. Default is None + + Returns: TODO + + Raises: NULL + """ + assert len(queries.shape) == len(keys.shape) == 3 + + q = self.q(queries) + k = self.k(keys) + r = None + if relations is not None: + r = self.relation_emb(relations) + + def _transpose(inputs): + """reshape and transpose + Args: inputs: shape = [batch, seq_len, heads * hidden] + Returns: shape = [batch, heads, seq_len, hidden] + """ + # 1 代表 head 数量,此处恒为 1。 + outputs = inputs.reshape([0, 0, 1, self.hidden_size]) + return outputs.transpose([0, 2, 1, 3]) + + q = _transpose(q) + k = _transpose(k) + # q = q.scale(self.hidden_size**-0.5) + scores = relative_attention_logits(q, k, r) + if attn_bias is not None: + scores += attn_bias + + self.scores = F.softmax(scores) + return self.scores.squeeze([0, 1]) + + +class PositionwiseFeedForwardLayer(nn.Layer): + def __init__(self, cfg, name=None): + super(PositionwiseFeedForwardLayer, self).__init__() + initializer = nn.initializer.TruncatedNormal(std=cfg["initializer_range"]) + d_model = cfg["hidden_size"] + d_ffn = cfg.get("intermediate_size", 4 * d_model) + self.act = ACT_DICT[cfg["hidden_act"]]() + self.i = _build_linear( + d_model, + d_ffn, + new_name(name, "fc_0"), + initializer, + ) + self.o = _build_linear(d_ffn, d_model, new_name(name, "fc_1"), initializer) + prob = cfg.get("intermediate_dropout_prob", 0.0) + self.dropout = nn.Dropout(p=prob) + + def forward(self, inputs): + hidden = self.act(self.i(inputs)) + hidden = self.dropout(hidden) + out = self.o(hidden) + return out + + +class RelationalTransformerBlock(nn.Layer): + """A transformer block with relations""" + + def __init__(self, cfg, name=None): + super(RelationalTransformerBlock, self).__init__() + d_model = cfg["hidden_size"] + n_heads = cfg["num_attention_heads"] + self.attn = RelationalAttentionLayer(cfg, name=new_name(name, "multi_head_att")) + self.ln1 = _build_ln(d_model, name=new_name(name, "post_att")) + self.ffn = PositionwiseFeedForwardLayer(cfg, name=new_name(name, "ffn")) + self.ln2 = _build_ln(d_model, name=new_name(name, "post_ffn")) + prob = cfg.get("intermediate_dropout_prob", cfg["hidden_dropout_prob"]) + self.dropout = nn.Dropout(p=prob) + + # 假设 k/v 的 + rel_hidden = d_model // n_heads + self.relation_k_emb = nn.Embedding(cfg["num_relations"], rel_hidden) + self.relation_v_emb = nn.Embedding(cfg["num_relations"], rel_hidden) + + def forward(self, inputs, relations, attn_bias=None, past_cache=None): + relation_k = self.relation_k_emb(relations) + relation_v = self.relation_k_emb(relations) + + attn_out, cache = self.attn( + inputs, inputs, inputs, relation_k, relation_v, attn_bias, past_cache=past_cache + ) # self attn + attn_out = self.dropout(attn_out) + hidden = attn_out + inputs + hidden = self.ln1(hidden) # dropout/ add/ norm + + ffn_out = self.ffn(hidden) + ffn_out = self.dropout(ffn_out) + hidden = ffn_out + hidden + hidden = self.ln2(hidden) + return hidden, cache + + +class RelationalTransformerEncoder(nn.Layer): + def __init__(self, cfg, name=None): + super(RelationalTransformerEncoder, self).__init__() + n_layers = cfg["num_hidden_layers"] + self.block = nn.LayerList( + [RelationalTransformerBlock(cfg, new_name(name, "layer_%d" % i)) for i in range(n_layers)] + ) + + def forward(self, inputs, relations, attn_bias=None, past_cache=None): + """relational transformer encoder, forward stage of + n layers and m heads transformer blocks with relations + + Args: + inputs (TYPE): shape= [batch, seq_len, hidden] + relations (TYPE): shape = [batch, seq_len, seq_len] + attn_bias (TYPE): mask for inputs sequence. Default is None + past_cache (TYPE): Default is None + + Returns: (last_hidden_state, all_hidden_state_list, (cache_list_k, cache_list_v)) + + Raises: NULL + """ + if past_cache is not None: + assert isinstance( + past_cache, tuple + ), "unknown type of `past_cache`," + " expect tuple or list. got %s" % repr(type(past_cache)) + past_cache = list(zip(*past_cache)) + else: + past_cache = [None] * len(self.block) + cache_list_k, cache_list_v, hidden_list = [], [], [inputs] + + for b, p in zip(self.block, past_cache): + inputs, cache = b(inputs, relations, attn_bias=attn_bias, past_cache=p) + cache_k, cache_v = cache + cache_list_k.append(cache_k) + cache_list_v.append(cache_v) + hidden_list.append(inputs) + + return inputs, hidden_list, (cache_list_k, cache_list_v) + + +if __name__ == "__main__": + """run some simple test cases""" + cfg = { + "num_hidden_layers": 12, + "num_attention_heads": 2, + "num_relations": 99, + "hidden_size": 4, + "hidden_act": "relu", + "attention_probs_dropout_prob": 0.1, + "hidden_dropout_prob": 0.1, + "initializer_range": 0.02, + } + + model = RelationalTransformerEncoder(cfg) + print(model) + inputs = paddle.to_tensor(list(range(24)), dtype="float32").reshape([2, 3, 4]) + relations = paddle.to_tensor(list(range(18)), dtype="int64").reshape([2, 3, 3]) + hidden, _, _ = model(inputs, relations) + print(hidden) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_beam_search.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_beam_search.py new file mode 100644 index 0000000000000000000000000000000000000000..6b43d0610c4d70a89585b44569cd04799529dad0 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_beam_search.py @@ -0,0 +1,445 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import operator + +import attr +import networkx as nx +from text2sql.dataproc.sql_preproc_v2 import get_field_presence_info +from text2sql.models.beam_search import Hypothesis +from text2sql.models.sql_decoder.decoder import TreeState + + +@attr.s +class Hypothesis4Filtering(Hypothesis): + column_history = attr.ib(factory=list) + table_history = attr.ib(factory=list) + key_column_history = attr.ib(factory=list) + + +def beam_search_with_heuristics(model, inputs, beam_size, max_steps, from_cond=True): + """ + Find the valid FROM clause with beam search + """ + orig_inputs = inputs["orig_inputs"][0] + # inference_state, next_choices = model.inference(inputs, orig_inputs.db) + inference_state, next_choices = model(inputs, db=orig_inputs.db, is_train=False) + beam = [Hypothesis4Filtering(inference_state, next_choices)] + + cached_finished_seqs = [] # cache filtered trajectories + beam_prefix = beam + while True: + # search prefixes with beam search + prefixes2fill_from = [] + for step in range(max_steps): + if len(prefixes2fill_from) >= beam_size: + break + + candidates = [] + for hyp in beam_prefix: + if ( + hyp.inference_state.cur_item.state == hyp.inference_state.State.CHILDREN_APPLY + and hyp.inference_state.cur_item.node_type == "from" + ): + # only from not fill, save it and process in the following code + prefixes2fill_from.append(hyp) + else: + candidates += [ + (hyp, choice, choice_score.item(), hyp.score + choice_score.item()) + for choice, choice_score in hyp.next_choices + ] + candidates.sort(key=operator.itemgetter(3), reverse=True) + candidates = candidates[: beam_size - len(prefixes2fill_from)] + + # Create the new hypotheses from the expansions + beam_prefix = [] + for hyp, choice, choice_score, cum_score in candidates: + inference_state = hyp.inference_state.clone() + + # cache column choice + column_history = hyp.column_history[:] + if ( + hyp.inference_state.cur_item.state == hyp.inference_state.State.POINTER_APPLY + and hyp.inference_state.cur_item.node_type == "column" + ): + column_history = column_history + [choice] + + # get next choices + next_choices = inference_state.step(choice) + assert next_choices is not None + beam_prefix.append( + Hypothesis4Filtering( + inference_state, + next_choices, + cum_score, + hyp.choice_history + [choice], + hyp.score_history + [choice_score], + column_history, + ) + ) + + prefixes2fill_from.sort(key=operator.attrgetter("score"), reverse=True) + # assert len(prefixes) == beam_size + + # enumerating + beam_from = prefixes2fill_from + max_size = 6 + unfiltered_finished = [] + prefixes_unfinished = [] + for step in range(max_steps): + if len(unfiltered_finished) + len(prefixes_unfinished) > max_size: + break + + candidates = [] + for hyp in beam_from: + if ( + step > 0 + and hyp.inference_state.cur_item.state == hyp.inference_state.State.CHILDREN_APPLY + and hyp.inference_state.cur_item.node_type == "from" + ): + prefixes_unfinished.append(hyp) + else: + candidates += [ + (hyp, choice, choice_score.item(), hyp.score + choice_score.item()) + for choice, choice_score in hyp.next_choices + ] + candidates.sort(key=operator.itemgetter(3), reverse=True) + candidates = candidates[: max_size - len(prefixes_unfinished)] + + beam_from = [] + for hyp, choice, choice_score, cum_score in candidates: + inference_state = hyp.inference_state.clone() + + # cache table choice + table_history = hyp.table_history[:] + key_column_history = hyp.key_column_history[:] + if hyp.inference_state.cur_item.state == hyp.inference_state.State.POINTER_APPLY: + if hyp.inference_state.cur_item.node_type == "table": + table_history = table_history + [choice] + elif hyp.inference_state.cur_item.node_type == "column": + key_column_history = key_column_history + [choice] + + next_choices = inference_state.step(choice) + if next_choices is None: + unfiltered_finished.append( + Hypothesis4Filtering( + inference_state, + None, + cum_score, + hyp.choice_history + [choice], + hyp.score_history + [choice_score], + hyp.column_history, + table_history, + key_column_history, + ) + ) + else: + beam_from.append( + Hypothesis4Filtering( + inference_state, + next_choices, + cum_score, + hyp.choice_history + [choice], + hyp.score_history + [choice_score], + hyp.column_history, + table_history, + key_column_history, + ) + ) + # [END] for step in range(max_steps) + + unfiltered_finished.sort(key=operator.attrgetter("score"), reverse=True) + + # filtering + filtered_finished = [] + for hyp in unfiltered_finished: + mentioned_column_ids = set(hyp.column_history) + mentioned_key_column_ids = set(hyp.key_column_history) + mentioned_table_ids = set(hyp.table_history) + + # duplicate tables + if len(mentioned_table_ids) != len(hyp.table_history): + continue + + # the foreign key should be correctly used + # NOTE: the new version does not predict conditions in FROM clause anymore + if from_cond: + covered_tables = set() + must_include_key_columns = set() + candidate_table_ids = sorted(mentioned_table_ids) + start_table_id = candidate_table_ids[0] + for table_id in candidate_table_ids[1:]: + if table_id in covered_tables: + continue + try: + path = nx.shortest_path( + orig_inputs.db.foreign_key_graph, source=start_table_id, target=table_id + ) + except (nx.NetworkXNoPath, nx.NodeNotFound): + covered_tables.add(table_id) + continue + + for source_table_id, target_table_id in zip(path, path[1:]): + if target_table_id in covered_tables: + continue + if target_table_id not in mentioned_table_ids: + continue + col1, col2 = orig_inputs.db.foreign_key_graph[source_table_id][target_table_id]["columns"] + must_include_key_columns.add(col1) + must_include_key_columns.add(col2) + if not must_include_key_columns == mentioned_key_column_ids: + continue + + # tables whose columns are mentioned should also exist + must_table_ids = set() + for col in mentioned_column_ids: + tab_ = orig_inputs.db.columns[col].table + if tab_ is not None: + must_table_ids.add(tab_.id) + if not must_table_ids.issubset(mentioned_table_ids): + continue + + filtered_finished.append(hyp) + + filtered_finished.sort(key=operator.attrgetter("score"), reverse=True) + # filtered.sort(key=lambda x: x.score / len(x.choice_history), reverse=True) + prefixes_unfinished.sort(key=operator.attrgetter("score"), reverse=True) + # new_prefixes.sort(key=lambda x: x.score / len(x.choice_history), reverse=True) + + prefixes_, filtered_ = merge_beams(prefixes_unfinished, filtered_finished, beam_size) + + if filtered_: + cached_finished_seqs = cached_finished_seqs + filtered_ + cached_finished_seqs.sort(key=operator.attrgetter("score"), reverse=True) + + if prefixes_ and len(prefixes_[0].choice_history) < 200: + beam_prefix = prefixes_ + for hyp in beam_prefix: + hyp.table_history = [] + hyp.column_history = [] + hyp.key_column_history = [] + elif cached_finished_seqs: + return cached_finished_seqs[:beam_size] + else: + return unfiltered_finished[:beam_size] + + +# merge sorted beam +def merge_beams(beam_1, beam_2, beam_size): + if len(beam_1) == 0 or len(beam_2) == 0: + return beam_1, beam_2 + + annoated_beam_1 = [("beam_1", b) for b in beam_1] + annoated_beam_2 = [("beam_2", b) for b in beam_2] + merged_beams = annoated_beam_1 + annoated_beam_2 + merged_beams.sort(key=lambda x: x[1].score, reverse=True) + + ret_beam_1 = [] + ret_beam_2 = [] + for label, beam in merged_beams[:beam_size]: + if label == "beam_1": + ret_beam_1.append(beam) + else: + assert label == "beam_2" + ret_beam_2.append(beam) + return ret_beam_1, ret_beam_2 + + +def beam_search_with_oracle_column(model, inputs, preproc_item, beam_size, max_steps): + inference_state, next_choices = model.begin_inference(inputs, preproc_item) + beam = [Hypothesis(inference_state, next_choices)] + finished = [] + assert beam_size == 1 + + # identify all the cols mentioned in the gold sql + root_node = preproc_item[1].tree + + col_queue = list( + reversed([val for val in model.decoder.ast_wrapper.find_all_descendants_of_type(root_node, "column")]) + ) + tab_queue = list( + reversed([val for val in model.decoder.ast_wrapper.find_all_descendants_of_type(root_node, "table")]) + ) + col_queue_copy = col_queue[:] + tab_queue_copy = tab_queue[:] + + predict_counter = 0 + + for step in range(max_steps): + # Check if all beams are finished + if len(finished) == beam_size: + break + + # hijack the next choice using the gold col + assert len(beam) == 1 + hyp = beam[0] + if hyp.inference_state.cur_item.state == hyp.inference_state.State.POINTER_APPLY: + if hyp.inference_state.cur_item.node_type == "column" and len(col_queue) > 0: + gold_col = col_queue[0] + + flag = False + for _choice in hyp.next_choices: + if _choice[0] == gold_col: + flag = True + hyp.next_choices = [_choice] + col_queue = col_queue[1:] + break + assert flag + elif hyp.inference_state.cur_item.node_type == "table" and len(tab_queue) > 0: + gold_tab = tab_queue[0] + + flag = False + for _choice in hyp.next_choices: + if _choice[0] == gold_tab: + flag = True + hyp.next_choices = [_choice] + tab_queue = tab_queue[1:] + break + assert flag + + # for debug + if hyp.inference_state.cur_item.state == hyp.inference_state.State.POINTER_APPLY: + predict_counter += 1 + + # For each hypothesis, get possible expansions + # Score each expansion + candidates = [] + for hyp in beam: + candidates += [ + (hyp, choice, choice_score.item(), hyp.score + choice_score.item()) + for choice, choice_score in hyp.next_choices + ] + + # Keep the top K expansions + candidates.sort(key=operator.itemgetter(3), reverse=True) + candidates = candidates[: beam_size - len(finished)] + + # Create the new hypotheses from the expansions + beam = [] + for hyp, choice, choice_score, cum_score in candidates: + inference_state = hyp.inference_state.clone() + next_choices = inference_state.step(choice) + if next_choices is None: + finished.append( + Hypothesis( + inference_state, + None, + cum_score, + hyp.choice_history + [choice], + hyp.score_history + [choice_score], + ) + ) + else: + beam.append( + Hypothesis( + inference_state, + next_choices, + cum_score, + hyp.choice_history + [choice], + hyp.score_history + [choice_score], + ) + ) + if (len(col_queue_copy) + len(tab_queue_copy)) != predict_counter: + # print("The number of column/tables are not matched") + pass + finished.sort(key=operator.attrgetter("score"), reverse=True) + return finished + + +def beam_search_with_oracle_sketch(model, inputs, preproc_item, beam_size, max_steps): + inference_state, next_choices = model.begin_inference(inputs, preproc_item) + hyp = Hypothesis(inference_state, next_choices) + + parsed = model.decoder.preproc.grammar.parse(inputs["orig_inputs"][0].code, "val") + if not parsed: + return [] + + queue = [ + TreeState( + node=preproc_item[1].tree, + parent_field_type=model.decoder.preproc.grammar.root_type, + ) + ] + + while queue: + item = queue.pop() + node = item.node + parent_field_type = item.parent_field_type + + if isinstance(node, (list, tuple)): + node_type = parent_field_type + "*" + rule = (node_type, len(node)) + if rule not in model.decoder.rules_index: + return [] + rule_idx = model.decoder.rules_index[rule] + assert inference_state.cur_item.state == inference_state.State.LIST_LENGTH_APPLY + + if model.decoder.preproc.use_seq_elem_rules and parent_field_type in model.decoder.ast_wrapper.sum_types: + parent_field_type += "_seq_elem" + + for i, elem in reversed(list(enumerate(node))): + queue.append(TreeState(node=elem, parent_field_type=parent_field_type)) + + hyp = Hypothesis(inference_state, None, 0, hyp.choice_history + [rule_idx], hyp.score_history + [0]) + continue + + if parent_field_type in model.decoder.preproc.grammar.pointers: + assert inference_state.cur_item.state == inference_state.State.POINTER_APPLY + # best_choice = max(next_choices, key=lambda x: x[1]) + # node = best_choice[0] # override the node + + assert isinstance(node, int) + next_choices = inference_state.step(node) + hyp = Hypothesis(inference_state, None, 0, hyp.choice_history + [node], hyp.score_history + [0]) + continue + + if parent_field_type in model.decoder.ast_wrapper.primitive_types: + field_value_split = model.decoder.preproc.grammar.tokenize_field_value(node) + [""] + + for token in field_value_split: + next_choices = inference_state.step(token) + hyp = Hypothesis(inference_state, None, 0, hyp.choice_history + field_value_split, hyp.score_history + [0]) + continue + + type_info = model.decoder.ast_wrapper.singular_types[node["_type"]] + + if parent_field_type in model.decoder.preproc.sum_type_constructors: + # ApplyRule, like expr -> Call + rule = (parent_field_type, type_info.name) + rule_idx = model.decoder.rules_index[rule] + assert inference_state.cur_item.state == inference_state.State.SUM_TYPE_APPLY + extra_rules = [ + model.decoder.rules_index[parent_field_type, extra_type] for extra_type in node.get("_extra_types", []) + ] + next_choices = inference_state.step(rule_idx, extra_rules) + + hyp = Hypothesis(inference_state, None, 0, hyp.choice_history + [rule_idx], hyp.score_history + [0]) + + if type_info.fields: + # ApplyRule, like Call -> expr[func] expr*[args] keyword*[keywords] + # Figure out which rule needs to be applied + present = get_field_presence_info(model.decoder.ast_wrapper, node, type_info.fields) + rule = (node["_type"], tuple(present)) + rule_idx = model.decoder.rules_index[rule] + next_choices = inference_state.step(rule_idx) + + hyp = Hypothesis(inference_state, None, 0, hyp.choice_history + [rule_idx], hyp.score_history + [0]) + + # reversed so that we perform a DFS in left-to-right order + for field_info in reversed(type_info.fields): + if field_info.name not in node: + continue + queue.append(TreeState(node=node[field_info.name], parent_field_type=field_info.type)) + + return [hyp] diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..6f0ea85344b7e0c679730356928c8749cf71cd66 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/align_dec_func.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/align_dec_func.py new file mode 100644 index 0000000000000000000000000000000000000000..954693a5c60cbe9c35bd4208094e9ae36a72e41c --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/align_dec_func.py @@ -0,0 +1,77 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import paddle + + +def compute_align_loss(model, desc_enc, example): + """model: a nl2code decoder""" + # find relevant columns + root_node = example.tree + rel_cols = list(reversed([val for val in model.ast_wrapper.find_all_descendants_of_type(root_node, "column")])) + rel_tabs = list(reversed([val for val in model.ast_wrapper.find_all_descendants_of_type(root_node, "table")])) + rel_vals = np.abs( + list(reversed([val for val in model.ast_wrapper.find_all_descendants_of_type(root_node, "value")])) + ) + + rel_cols_t = paddle.to_tensor(sorted(list(set(rel_cols))), dtype="int64") + rel_tabs_t = paddle.to_tensor(sorted(list(set(rel_tabs))), dtype="int64") + rel_vals_t = paddle.to_tensor(sorted(list(set(rel_vals))), dtype="int64") + + mc_att_on_rel_col = desc_enc.m2c_align_mat.index_select(rel_cols_t, axis=1) + mc_max_rel_att = mc_att_on_rel_col.max(axis=0) + mc_max_rel_att = mc_max_rel_att.clip(min=1e-9) + + mt_att_on_rel_tab = desc_enc.m2t_align_mat.index_select(rel_tabs_t, axis=1) + mt_max_rel_att = mt_att_on_rel_tab.max(axis=0) + mt_max_rel_att = mt_max_rel_att.clip(min=1e-9) + + mv_att_on_rel_val = desc_enc.m2v_align_mat.index_select(rel_vals_t, axis=1) + mv_max_rel_att = mv_att_on_rel_val.max(axis=0) + mv_max_rel_att = mv_max_rel_att.clip(min=1e-9) + + value_loss_weight = 2.0 + align_loss = ( + -paddle.log(mc_max_rel_att).mean() + - paddle.log(mt_max_rel_att).mean() + - value_loss_weight * paddle.log(mv_max_rel_att).mean() + ) + return align_loss + + +def compute_pointer_with_align(model, node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc): + """compute_pointer_with_align""" + new_state, attention_weights = model._update_state( + node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc + ) + # output shape: batch (=1) x emb_size + output = new_state[0] + memory_pointer_logits = model.pointers[node_type](output, desc_enc.memory) + memory_pointer_probs = paddle.nn.functional.softmax(memory_pointer_logits, axis=1) + # pointer_logits shape: batch (=1) x num choices + if node_type == "column": + pointer_probs = paddle.mm(memory_pointer_probs, desc_enc.m2c_align_mat) + elif node_type == "table": + pointer_probs = paddle.mm(memory_pointer_probs, desc_enc.m2t_align_mat) + else: # value + pointer_probs = paddle.mm(memory_pointer_probs, desc_enc.m2v_align_mat) + pointer_probs = pointer_probs.clip(min=1e-9) + pointer_logits = paddle.log(pointer_probs) + return output, new_state, pointer_logits, attention_weights + + +if __name__ == "__main__": + """run some simple test cases""" + pass diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/decoder.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/decoder.py new file mode 100644 index 0000000000000000000000000000000000000000..56d022236f3b5ce9df34bb7e086c8d9c7956570b --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/decoder.py @@ -0,0 +1,638 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import copy +import itertools + +import attr +import paddle +import paddle.nn.functional as F +from text2sql.dataproc import sql_preproc_v2, vocab +from text2sql.models import attention +from text2sql.models.sql_decoder import align_dec_func +from text2sql.models.sql_decoder.infer_tree_traversal import InferenceTreeTraversal +from text2sql.models.sql_decoder.train_tree_traversal import TrainTreeTraversal +from text2sql.models.sql_decoder.tree_traversal import TreeTraversal + + +def maybe_stack(items, axis=None): + to_stack = [item for item in items if item is not None] + if not to_stack: + return None + elif len(to_stack) == 1: + return to_stack[0].unsqueeze(axis) + else: + return paddle.stack(to_stack, axis) + + +def accumulate_logprobs(d, keys_and_logprobs): + for key, logprob in keys_and_logprobs: + existing = d.get(key) + if existing is None: + d[key] = logprob + else: + d[key] = paddle.logsumexp(paddle.stack((logprob, existing), axis=0), axis=0) + + +@attr.s +class TreeState: + node = attr.ib() + parent_field_type = attr.ib() + + +class Text2SQLDecoder(paddle.nn.Layer): + """Decoder model""" + + Preproc = sql_preproc_v2.SQLPreproc + + def __init__( + self, + preproc, + # + rule_emb_size=128, + node_embed_size=64, + # TODO: This should be automatically inferred from encoder + enc_recurrent_size=768, + recurrent_size=512, + dropout=0.0, + desc_attn="bahdanau", + copy_pointer=None, + multi_loss_type="logsumexp", + sup_att=None, + use_align_mat=False, + use_align_loss=False, + enumerate_order=False, + loss_type="softmax", + ): + """init""" + super().__init__() + self.preproc = preproc + self.ast_wrapper = preproc.ast_wrapper + self.terminal_vocab = preproc.vocab + + self.rule_emb_size = rule_emb_size + self.node_emb_size = node_embed_size + self.enc_recurrent_size = enc_recurrent_size + self.recurrent_size = recurrent_size + + self.rules_index = {v: idx for idx, v in enumerate(self.preproc.all_rules)} + self.use_align_mat = use_align_mat + self.use_align_loss = use_align_loss + self.enumerate_order = enumerate_order + + if use_align_mat: + self.compute_align_loss = lambda *args: align_dec_func.compute_align_loss(self, *args) + self.compute_pointer_with_align = lambda *args: align_dec_func.compute_pointer_with_align(self, *args) + + if self.preproc.use_seq_elem_rules: + self.node_type_vocab = vocab.Vocab( + sorted(self.preproc.primitive_types) + + sorted(self.ast_wrapper.custom_primitive_types) + + sorted(self.preproc.sum_type_constructors.keys()) + + sorted(self.preproc.field_presence_infos.keys()) + + sorted(self.preproc.seq_lengths.keys()), + special_elems=(), + ) + else: + self.node_type_vocab = vocab.Vocab( + sorted(self.preproc.primitive_types) + + sorted(self.ast_wrapper.custom_primitive_types) + + sorted(self.ast_wrapper.sum_types.keys()) + + sorted(self.ast_wrapper.singular_types.keys()) + + sorted(self.preproc.seq_lengths.keys()), + special_elems=(), + ) + + self.state_update = paddle.nn.LSTMCell( + input_size=self.rule_emb_size * 2 + self.enc_recurrent_size + self.recurrent_size + self.node_emb_size, + hidden_size=self.recurrent_size, + ) + # dropout=dropout) + + self.attn_type = desc_attn + if desc_attn == "bahdanau": + self.desc_attn = attention.BahdanauAttention( + query_size=self.recurrent_size, value_size=self.enc_recurrent_size, proj_size=50 + ) + elif desc_attn == "mha": + self.desc_attn = attention.MultiHeadedAttention( + h=8, query_size=self.recurrent_size, value_size=self.enc_recurrent_size + ) + elif desc_attn == "mha-1h": + self.desc_attn = attention.MultiHeadedAttention( + h=1, query_size=self.recurrent_size, value_size=self.enc_recurrent_size + ) + elif desc_attn == "sep": + self.question_attn = attention.MultiHeadedAttention( + h=1, query_size=self.recurrent_size, value_size=self.enc_recurrent_size + ) + self.schema_attn = attention.MultiHeadedAttention( + h=1, query_size=self.recurrent_size, value_size=self.enc_recurrent_size + ) + else: + # TODO: Figure out how to get right sizes (query, value) to module + self.desc_attn = desc_attn + self.sup_att = sup_att + + self.rule_logits = paddle.nn.Sequential( + paddle.nn.Linear(self.recurrent_size, self.rule_emb_size), + paddle.nn.Tanh(), + paddle.nn.Linear(self.rule_emb_size, len(self.rules_index)), + ) + self.rule_embedding = paddle.nn.Embedding( + num_embeddings=len(self.rules_index), embedding_dim=self.rule_emb_size + ) + + self.gen_logodds = paddle.nn.Linear(self.recurrent_size, 1) + self.terminal_logits = paddle.nn.Sequential( + paddle.nn.Linear(self.recurrent_size, self.rule_emb_size), + paddle.nn.Tanh(), + paddle.nn.Linear(self.rule_emb_size, len(self.terminal_vocab)), + ) + self.terminal_embedding = paddle.nn.Embedding( + num_embeddings=len(self.terminal_vocab), embedding_dim=self.rule_emb_size + ) + if copy_pointer is None: + self.copy_pointer = attention.BahdanauPointer( + query_size=self.recurrent_size, key_size=self.enc_recurrent_size, proj_size=50 + ) + else: + # TODO: Figure out how to get right sizes (query, key) to module + self.copy_pointer = copy_pointer + if multi_loss_type == "logsumexp": + self.multi_loss_reduction = lambda logprobs: -paddle.logsumexp(logprobs, axis=1) + elif multi_loss_type == "mean": + self.multi_loss_reduction = lambda logprobs: -paddle.mean(logprobs, axis=1) + + self.pointers = {} + self.pointer_action_emb_proj = {} + for pointer_type in self.preproc.grammar.pointers: + self.pointers[pointer_type] = attention.ScaledDotProductPointer( + query_size=self.recurrent_size, key_size=self.enc_recurrent_size + ) + self.pointer_action_emb_proj[pointer_type] = paddle.nn.Linear(self.enc_recurrent_size, self.rule_emb_size) + setattr(self, pointer_type + "_pointer", self.pointers[pointer_type]) + setattr(self, pointer_type + "_action_emb_proj", self.pointer_action_emb_proj[pointer_type]) + + self.node_type_embedding = paddle.nn.Embedding( + num_embeddings=len(self.node_type_vocab), embedding_dim=self.node_emb_size + ) + + # TODO batching + self.zero_rule_emb = paddle.zeros([1, self.rule_emb_size]) + self.zero_recurrent_emb = paddle.zeros([1, self.recurrent_size]) + if loss_type == "softmax": + self.xent_loss = paddle.nn.CrossEntropyLoss(reduction="none") + elif loss_type == "entmax": + raise ValueError("entmax is not supported") + # self.xent_loss = entmax.entmax15_loss + elif loss_type == "sparsemax": + raise ValueError("sparsemax is not supported") + # self.xent_loss = entmax.sparsemax_loss + elif loss_type == "label_smooth": + self.xent_loss = self.label_smooth_loss + + def label_smooth_loss(self, X, target, smooth_value=0.1): + """label smooth loss""" + if self.training: + logits = paddle.log_softmax(X, axis=1) + size = X.size()[1] + one_hot = paddle.full(X.size(), smooth_value / (size - 1)).to(X.device) + one_hot.scatter_(1, target.unsqueeze(0), 1 - smooth_value) + loss = F.kl_div(logits, one_hot, reduction="batchmean") + return loss.unsqueeze(0) + else: + return paddle.nn.functional.cross_entropy(X, target, reduction="none") + + @classmethod + def _calculate_rules(cls, preproc): + """calculate rules""" + offset = 0 + + all_rules = [] + rules_mask = {} + + # Rules of the form: + # expr -> Attribute | Await | BinOp | BoolOp | ... + # expr_seq_elem -> Attribute | Await | ... | Template1 | Template2 | ... + for parent, children in sorted(preproc.sum_type_constructors.items()): + assert parent not in rules_mask + rules_mask[parent] = (offset, offset + len(children)) + offset += len(children) + all_rules += [(parent, child) for child in children] + + # Rules of the form: + # FunctionDef + # -> identifier name, arguments args + # | identifier name, arguments args, stmt* body + # | identifier name, arguments args, expr* decorator_list + # | identifier name, arguments args, expr? returns + # ... + # | identifier name, arguments args, stmt* body, expr* decorator_list, expr returns + for name, field_presence_infos in sorted(preproc.field_presence_infos.items()): + assert name not in rules_mask + rules_mask[name] = (offset, offset + len(field_presence_infos)) + offset += len(field_presence_infos) + all_rules += [(name, presence) for presence in field_presence_infos] + + # Rules of the form: + # stmt* -> stmt + # | stmt stmt + # | stmt stmt stmt + for seq_type_name, lengths in sorted(preproc.seq_lengths.items()): + assert seq_type_name not in rules_mask + rules_mask[seq_type_name] = (offset, offset + len(lengths)) + offset += len(lengths) + all_rules += [(seq_type_name, i) for i in lengths] + + return all_rules, rules_mask + + def compute_loss(self, enc_input, example, desc_enc, debug=False): + """train main""" + if not (self.enumerate_order and self.training): + mle_loss = self.compute_mle_loss(enc_input, example, desc_enc, debug) + else: + mle_loss = self.compute_loss_from_all_ordering(enc_input, example, desc_enc, debug) + + if self.use_align_loss: + align_loss = self.compute_align_loss(desc_enc, example) + return mle_loss + align_loss + return mle_loss + + def compute_loss_from_all_ordering(self, enc_input, example, desc_enc, debug): + """compute loss from all ordering""" + + def get_permutations(node): + """get permutations""" + + def traverse_tree(node): + """traverse tree""" + nonlocal permutations + if isinstance(node, (list, tuple)): + p = itertools.permutations(range(len(node))) + permutations.append(list(p)) + for child in node: + traverse_tree(child) + elif isinstance(node, dict): + for node_name in node: + traverse_tree(node[node_name]) + + permutations = [] + traverse_tree(node) + return permutations + + def get_perturbed_tree(node, permutation): + """get perturbed tree""" + + def traverse_tree(node, parent_type, parent_node): + """traverse tree""" + if isinstance(node, (list, tuple)): + nonlocal permutation + p_node = [node[i] for i in permutation[0]] + parent_node[parent_type] = p_node + permutation = permutation[1:] + for child in node: + traverse_tree(child, None, None) + elif isinstance(node, dict): + for node_name in node: + traverse_tree(node[node_name], node_name, node) + + node = copy.deepcopy(node) + traverse_tree(node, None, None) + return node + + orig_tree = example.tree + permutations = get_permutations(orig_tree) + products = itertools.product(*permutations) + loss_list = [] + for product in products: + tree = get_perturbed_tree(orig_tree, product) + example.tree = tree + loss = self.compute_mle_loss(enc_input, example, desc_enc) + loss_list.append(loss) + example.tree = orig_tree + loss_v = paddle.stack(loss_list, 0) + return paddle.logsumexp(loss_v, 0) + + def compute_mle_loss(self, enc_input, example, desc_enc, debug=False): + """compute mle loss""" + traversal = TrainTreeTraversal(self, desc_enc, debug) + traversal.step(None) + # for debug + # class List(list): + # def __init__(self, *args, **kwargs): + # """ """ + # super(List, self).__init__(*args, **kwargs) + + # def append(self, *args, **kwargs): + # """ """ + # super().append(*args, **kwargs) + # print('append:', list(reversed(self))) + + # def pop(self): + # """ """ + # print('pop:', list(reversed(self))) + # item = super().pop() + # return item + # + # queue = List() + # queue.append( + # TreeState( + # node=example.tree, + # parent_field_type=self.preproc.grammar.root_type, + # )) + + queue = [ + TreeState( + node=example.tree, + parent_field_type=self.preproc.grammar.root_type, + ) + ] + while queue: + item = queue.pop() + node = item.node + parent_field_type = item.parent_field_type + + if isinstance(node, (list, tuple)): + node_type = parent_field_type + "*" + rule = (node_type, len(node)) + rule_idx = self.rules_index[rule] + assert traversal.cur_item.state == TreeTraversal.State.LIST_LENGTH_APPLY + traversal.step(rule_idx) + + if self.preproc.use_seq_elem_rules and parent_field_type in self.ast_wrapper.sum_types: + parent_field_type += "_seq_elem" + + for i, elem in reversed(list(enumerate(node))): + queue.append( + TreeState( + node=elem, + parent_field_type=parent_field_type, + ) + ) + continue + + if parent_field_type in self.preproc.grammar.pointers: + assert isinstance(node, int) + assert traversal.cur_item.state == TreeTraversal.State.POINTER_APPLY + pointer_map = desc_enc.pointer_maps.get(parent_field_type) + # TODO: fix -1 + if node == -1: + node = 0 + if pointer_map: + values = pointer_map[node] + traversal.step(values[0]) + else: + traversal.step(node) + continue + + if parent_field_type in self.ast_wrapper.primitive_types: + # identifier, int, string, bytes, object, singleton + # - could be bytes, str, int, float, bool, NoneType + # - terminal tokens vocabulary is created by turning everything into a string (with `str`) + # - at decoding time, cast back to str/int/float/bool + field_value_split = self.preproc.grammar.tokenize_field_value(node) + [vocab.EOS] + + for token in field_value_split: + assert traversal.cur_item.state == TreeTraversal.State.GEN_TOKEN + traversal.step(token) + continue + + type_info = self.ast_wrapper.singular_types[node["_type"]] + + if parent_field_type in self.preproc.sum_type_constructors: + # ApplyRule, like expr -> Call + rule = (parent_field_type, type_info.name) + rule_idx = self.rules_index[rule] + assert traversal.cur_item.state == TreeTraversal.State.SUM_TYPE_APPLY + extra_rules = [ + self.rules_index[parent_field_type, extra_type] for extra_type in node.get("_extra_types", []) + ] + traversal.step(rule_idx, extra_rules) + + if type_info.fields: + # ApplyRule, like Call -> expr[func] expr*[args] keyword*[keywords] + # Figure out which rule needs to be applied + present = sql_preproc_v2.get_field_presence_info(self.ast_wrapper, node, type_info.fields) + rule = (node["_type"], tuple(present)) + rule_idx = self.rules_index[rule] + assert traversal.cur_item.state == TreeTraversal.State.CHILDREN_APPLY + traversal.step(rule_idx) + + # reversed so that we perform a DFS in left-to-right order + for field_info in reversed(type_info.fields): + if field_info.name not in node: + continue + + queue.append( + TreeState( + node=node[field_info.name], + parent_field_type=field_info.type, + ) + ) + + loss = paddle.sum(paddle.stack(tuple(traversal.loss), axis=0), axis=0) + if debug: + return loss, [attr.asdict(entry) for entry in traversal.history] + else: + return loss + + def inference(self, desc_enc, db, value_list): + """infer main""" + traversal = InferenceTreeTraversal(self, desc_enc, db, value_list) + choices = traversal.step(None) + return traversal, choices + + def _desc_attention(self, prev_state, desc_enc): + """desc attention + + Args: + prev_state (tuple): (prev_hidden, prev_cell_state) + desc_enc (encoder.EncoderState) + """ + # prev_state shape: + # - h_n: batch (=1) x emb_size + # - c_n: batch (=1) x emb_size + query = prev_state[0] + if self.attn_type != "sep": + return self.desc_attn(query, desc_enc.memory, attn_mask=None) + else: + question_context, question_attention_logits = self.question_attn(query, desc_enc.question_memory) + schema_context, schema_attention_logits = self.schema_attn(query, desc_enc.schema_memory) + return question_context + schema_context, schema_attention_logits + + def _tensor(self, data, dtype=None): + """new a tensor""" + return paddle.to_tensor(data, dtype=dtype) + + def _index(self, vocab, word): + """get token id""" + return self._tensor([vocab.index(word)]) + + def _update_state(self, node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc): + """update state""" + # desc_context: shape = batch (=1) x emb_size + desc_context, attention_logits = self._desc_attention(prev_state, desc_enc) + # node_type_emb: shape = batch (=1) x emb_size + node_type_emb = self.node_type_embedding(self._index(self.node_type_vocab, node_type)) + + # 更新 LSTM state 的输入 + state_input = paddle.concat( + ( + prev_action_emb, # a_{t-1}: rule_emb_size + desc_context, # c_t: enc_recurrent_size + parent_h, # s_{p_t}: recurrent_size + parent_action_emb, # a_{p_t}: rule_emb_size + node_type_emb, # n_{f-t}: node_emb_size + ), + axis=-1, + ) + # state_input shape: batch (=1) x (emb_size * 5) + _, new_state = self.state_update(state_input, prev_state) + return new_state, attention_logits + + def apply_rule(self, node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc): + """apply rule""" + new_state, attention_logits = self._update_state( + node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc + ) + # output shape: batch (=1) x emb_size + output = new_state[0] + # rule_logits shape: batch (=1) x num choices + rule_logits = self.rule_logits(output) + + return output, new_state, rule_logits + + def rule_infer(self, node_type, rule_logits): + """rule infer""" + rule_logprobs = paddle.nn.functional.log_softmax(rule_logits, axis=-1) + rules_start, rules_end = self.preproc.rules_mask[node_type] + + # TODO: Mask other probabilities first? + return list(zip(range(rules_start, rules_end), rule_logprobs[0, rules_start:rules_end])) + + def gen_token(self, node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc): + """gen token""" + new_state, attention_logits = self._update_state( + node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc + ) + # output shape: batch (=1) x emb_size + output = new_state[0] + + # gen_logodds shape: batch (=1) + gen_logodds = self.gen_logodds(output).squeeze(1) + + return new_state, output, gen_logodds + + def gen_token_loss(self, output, gen_logodds, token, desc_enc): + """gen token loss""" + # token_idx shape: batch (=1), LongTensor + token_idx = self._index(self.terminal_vocab, token) + # action_emb shape: batch (=1) x emb_size + + # +unk, +in desc: copy + # +unk, -in desc: gen (an unk token) + # -unk, +in desc: copy, gen + # -unk, -in desc: gen + # gen_logodds shape: batch (=1) + desc_locs = desc_enc.find_word_occurrences(token) + if desc_locs: + # copy: if the token appears in the description at least once + # copy_loc_logits shape: batch (=1) x desc length + copy_loc_logits = self.copy_pointer(output, desc_enc.memory) + copy_logprob = ( + # log p(copy | output) + # shape: batch (=1) + paddle.nn.functional.log_sigmoid(-gen_logodds) + - self.xent_loss(copy_loc_logits, self._tensor(desc_locs[0:1])) + # xent_loss: -log p(location | output) + # TODO: sum the probability of all occurrences + # shape: batch (=1) + ) + else: + copy_logprob = None + + # gen: ~(unk & in desc), equivalent to ~unk | ~in desc + if token in self.terminal_vocab or copy_logprob is None: + token_logits = self.terminal_logits(output) + # shape: + gen_logprob = ( + # log p(gen | output) + # shape: batch (=1) + paddle.nn.functional.log_sigmoid(gen_logodds) + - self.xent_loss(token_logits, token_idx) + # xent_loss: -log p(token | output) + # shape: batch (=1) + ) + else: + gen_logprob = None + + # loss should be -log p(...), so negate + loss_piece = -paddle.logsumexp(maybe_stack([copy_logprob, gen_logprob], axis=1), axis=1) + return loss_piece + + def token_infer(self, output, gen_logodds, desc_enc): + """token infer""" + # Copy tokens + # log p(copy | output) + # shape: batch (=1) + copy_logprob = paddle.nn.functional.log_sigmoid(-gen_logodds) + copy_loc_logits = self.copy_pointer(output, desc_enc.memory) + # log p(loc_i | copy, output) + # shape: batch (=1) x seq length + copy_loc_logprobs = paddle.nn.functional.log_softmax(copy_loc_logits, axis=-1) + # log p(loc_i, copy | output) + copy_loc_logprobs += copy_logprob + + log_prob_by_word = {} + # accumulate_logprobs is needed because the same word may appear + # multiple times in desc_enc.words. + accumulate_logprobs(log_prob_by_word, zip(desc_enc.words, copy_loc_logprobs.squeeze(0))) + + # Generate tokens + # log p(~copy | output) + # shape: batch (=1) + gen_logprob = paddle.nn.functional.log_sigmoid(gen_logodds) + token_logits = self.terminal_logits(output) + # log p(v | ~copy, output) + # shape: batch (=1) x vocab size + token_logprobs = paddle.nn.functional.log_softmax(token_logits, axis=-1) + # log p(v, ~copy| output) + # shape: batch (=1) x vocab size + token_logprobs += gen_logprob + + accumulate_logprobs( + log_prob_by_word, + ((self.terminal_vocab[idx], token_logprobs[0, idx]) for idx in range(token_logprobs.shape[1])), + ) + + return list(log_prob_by_word.items()) + + def compute_pointer(self, node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc): + """compute pointer""" + new_state, attention_logits = self._update_state( + node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc + ) + # output shape: batch (=1) x emb_size + output = new_state[0] + # pointer_logits shape: batch (=1) x num choices + pointer_logits = self.pointers[node_type](output, desc_enc.pointer_memories[node_type]) + + return output, new_state, pointer_logits, attention_logits + + def pointer_infer(self, node_type, logits): + """pointer infer""" + logprobs = paddle.nn.functional.log_softmax(logits, axis=-1) + # TODO batching + return list(zip(range(logits.shape[1]), logprobs[0])) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/infer_tree_traversal.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/infer_tree_traversal.py new file mode 100644 index 0000000000000000000000000000000000000000..654472b58e0f26855d98495448bc1539964d7a19 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/infer_tree_traversal.py @@ -0,0 +1,216 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import attr +import pyrsistent +import paddle + +from text2sql.models.sql_decoder.tree_traversal import TreeTraversal +from text2sql.dataproc import vocab + + +class InferenceTreeTraversal(TreeTraversal): + """InferenceTreeTraversal""" + + class TreeAction: + pass + + @attr.s(frozen=True) + class SetParentField(TreeAction): + """SetParentField""" + + parent_field_name = attr.ib() + node_type = attr.ib() + node_value = attr.ib(default=None) + + @attr.s(frozen=True) + class CreateParentFieldList(TreeAction): + """CreateParentFieldList""" + + parent_field_name = attr.ib() + + @attr.s(frozen=True) + class AppendTerminalToken(TreeAction): + """AppendTerminalToken""" + + parent_field_name = attr.ib() + value = attr.ib() + + @attr.s(frozen=True) + class FinalizeTerminal(TreeAction): + """FinalizeTerminal""" + + parent_field_name = attr.ib() + terminal_type = attr.ib() + + @attr.s(frozen=True) + class NodeFinished(TreeAction): + """NodeFinished""" + + pass + + SIMPLE_TERMINAL_TYPES = { + "str": str, + "int": int, + "float": float, + "bool": lambda n: {"True": True, "False": False}.get(n, False), + } + + SIMPLE_TERMINAL_TYPES_DEFAULT = { + "str": "", + "int": 0, + "float": 0, + "bool": True, + } + + def __init__(self, model, desc_enc, db=None, value_list=None): + """__init__""" + super().__init__(model, desc_enc) + self.actions = pyrsistent.pvector() + self.db = db + self.value_list = value_list + + def clone(self): + """clone""" + super_clone = super().clone() + super_clone.actions = self.actions + super_clone.db = self.db + super_clone.value_list = self.value_list + return super_clone + + def rule_choice(self, node_type, rule_logits): + """rule_choice""" + return self.model.rule_infer(node_type, rule_logits) + + def token_choice(self, output, gen_logodds): + """token_choice""" + return self.model.token_infer(output, gen_logodds, self.desc_enc) + + def pointer_choice(self, node_type, logits, attention_logits): + """pointer_choice""" + # Group them based on pointer map + pointer_logprobs = self.model.pointer_infer(node_type, logits) + pointer_map = self.desc_enc.pointer_maps.get(node_type) + if not pointer_map: + return pointer_logprobs + + pointer_logprobs = dict(pointer_logprobs) + return [ + ( + orig_index, + paddle.logsumexp(paddle.stack(tuple(pointer_logprobs[i] for i in mapped_indices), axis=0), axis=0), + ) + for orig_index, mapped_indices in pointer_map.items() + ] + + def update_using_last_choice(self, last_choice, extra_choice_info, attention_offset): + """update_using_last_choice""" + super().update_using_last_choice(last_choice, extra_choice_info, attention_offset) + + # Record actions + # CHILDREN_INQUIRE + if self.cur_item.state == TreeTraversal.State.CHILDREN_INQUIRE: + self.actions = self.actions.append( + self.SetParentField(self.cur_item.parent_field_name, self.cur_item.node_type) + ) + type_info = self.model.ast_wrapper.singular_types[self.cur_item.node_type] + if not type_info.fields: + self.actions = self.actions.append(self.NodeFinished()) + + # LIST_LENGTH_APPLY + elif self.cur_item.state == TreeTraversal.State.LIST_LENGTH_APPLY: + self.actions = self.actions.append(self.CreateParentFieldList(self.cur_item.parent_field_name)) + + # GEN_TOKEN + elif self.cur_item.state == TreeTraversal.State.GEN_TOKEN: + if last_choice == vocab.EOS: + self.actions = self.actions.append( + self.FinalizeTerminal(self.cur_item.parent_field_name, self.cur_item.node_type) + ) + elif last_choice is not None: + self.actions = self.actions.append( + self.AppendTerminalToken(self.cur_item.parent_field_name, last_choice) + ) + + elif self.cur_item.state == TreeTraversal.State.POINTER_APPLY: + self.actions = self.actions.append( + self.SetParentField(self.cur_item.parent_field_name, node_type=None, node_value=last_choice) + ) + + # NODE_FINISHED + elif self.cur_item.state == TreeTraversal.State.NODE_FINISHED: + self.actions = self.actions.append(self.NodeFinished()) + + def finalize(self): + """finalize""" + root = current = None + stack = [] + for i, action in enumerate(self.actions): + if isinstance(action, self.SetParentField): + if action.node_value is None: + new_node = {"_type": action.node_type} + else: + new_node = action.node_value + + if action.parent_field_name is None: + # Initial node in tree. + assert root is None + root = current = new_node + stack.append(root) + continue + + existing_list = current.get(action.parent_field_name) + if existing_list is None: + current[action.parent_field_name] = new_node + else: + assert isinstance(existing_list, list) + current[action.parent_field_name].append(new_node) + + if action.node_value is None: + stack.append(current) + current = new_node + + elif isinstance(action, self.CreateParentFieldList): + current[action.parent_field_name] = [] + + elif isinstance(action, self.AppendTerminalToken): + tokens = current.get(action.parent_field_name) + if tokens is None: + tokens = current[action.parent_field_name] = [] + tokens.append('"' + action.value + '"') + + elif isinstance(action, self.FinalizeTerminal): + terminal = "".join(current.get(action.parent_field_name, [])) + constructor = self.SIMPLE_TERMINAL_TYPES.get(action.terminal_type) + if constructor: + try: + value = constructor(terminal) + except ValueError: + value = self.SIMPLE_TERMINAL_TYPES_DEFAULT[action.terminal_type] + elif action.terminal_type == "bytes": + value = terminal.decode("latin1") + elif action.terminal_type == "NoneType": + value = None + else: + raise ValueError(f"Unknown terminal type: {action.terminal_type}") + current[action.parent_field_name] = value + + elif isinstance(action, self.NodeFinished): + current = stack.pop() + + else: + raise ValueError(action) + + assert not stack + return root, self.model.preproc.grammar.unparse(root, self.db, self.value_list) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/train_tree_traversal.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/train_tree_traversal.py new file mode 100644 index 0000000000000000000000000000000000000000..259322bfb05e4b482f2b05596862965dcd683ea9 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/train_tree_traversal.py @@ -0,0 +1,132 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import operator + +import attr +import pyrsistent +import paddle + +from text2sql.models.sql_decoder.tree_traversal import TreeTraversal + + +@attr.s +class ChoiceHistoryEntry: + """ChoiceHistoryEntry""" + + rule_left = attr.ib() + choices = attr.ib() + probs = attr.ib() + valid_choices = attr.ib() + + +class TrainTreeTraversal(TreeTraversal): + """TrainTreeTraversal""" + + @attr.s(frozen=True) + class XentChoicePoint: + """XentChoicePoint""" + + logits = attr.ib() + weight = attr.ib(default=1.0) + + def compute_loss(self, outer, idx, extra_indices): + """compute loss""" + if extra_indices: + logprobs = paddle.nn.functional.log_softmax(self.logits, axis=1) + valid_logprobs = logprobs[:, [idx] + extra_indices] + return self.weight * outer.model.multi_loss_reduction(valid_logprobs) + else: + # idx shape: batch (=1) + idx = outer.model._tensor([idx]) + # loss_piece shape: batch (=1) + loss = outer.model.xent_loss(self.logits, idx) + return self.weight * loss + + @attr.s(frozen=True) + class TokenChoicePoint: + """TokenChoicePoint""" + + lstm_output = attr.ib() + gen_logodds = attr.ib() + + def compute_loss(self, outer, token, extra_tokens): + """compute loss""" + return outer.model.gen_token_loss(self.lstm_output, self.gen_logodds, token, outer.desc_enc) + + def __init__(self, model, desc_enc, debug=False): + """__init__""" + super().__init__(model, desc_enc) + self.choice_point = None + self.loss = pyrsistent.pvector() + + self.debug = debug + self.history = pyrsistent.pvector() + + def clone(self): + """clone""" + super_clone = super().clone() + super_clone.choice_point = self.choice_point + super_clone.loss = self.loss + super_clone.debug = self.debug + super_clone.history = self.history + return super_clone + + def rule_choice(self, node_type, rule_logits): + """rule_choice""" + self.choice_point = self.XentChoicePoint(rule_logits) + if self.debug: + choices = [] + probs = [] + for rule_idx, logprob in sorted( + self.model.rule_infer(node_type, rule_logits), key=operator.itemgetter(1), reverse=True + ): + _, rule = self.model.preproc.all_rules[rule_idx] + choices.append(rule) + probs.append(logprob.exp().item()) + self.history = self.history.append(ChoiceHistoryEntry(node_type, choices, probs, None)) + + def token_choice(self, output, gen_logodds): + """token_choice""" + self.choice_point = self.TokenChoicePoint(output, gen_logodds) + + def pointer_choice(self, node_type, logits, attention_logits): + """pointer_choice""" + loss_weight = 1.0 + if node_type == "value": + loss_weight = 2.0 + self.choice_point = self.XentChoicePoint(logits, weight=loss_weight) + self.attention_choice = self.XentChoicePoint(attention_logits, weight=loss_weight) + + def update_using_last_choice(self, last_choice, extra_choice_info, attention_offset): + """update_using_last_choice""" + super().update_using_last_choice(last_choice, extra_choice_info, attention_offset) + if last_choice is None: + return + + if self.debug and isinstance(self.choice_point, self.XentChoicePoint): + valid_choice_indices = [last_choice] + ([] if extra_choice_info is None else extra_choice_info) + self.history[-1].valid_choices = [ + self.model.preproc.all_rules[rule_idx][1] for rule_idx in valid_choice_indices + ] + + self.loss = self.loss.append(self.choice_point.compute_loss(self, last_choice, extra_choice_info)) + + if attention_offset is not None and self.attention_choice is not None: + self.loss = self.loss.append( + self.attention_choice.compute_loss(self, attention_offset, extra_indices=None) + ) + + self.choice_point = None + self.attention_choice = None diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/tree_traversal.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/tree_traversal.py new file mode 100644 index 0000000000000000000000000000000000000000..47ec81d78b6e1c3db76bba0ef7da0dcc8372a4e5 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/tree_traversal.py @@ -0,0 +1,429 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import enum + +import attr +import pyrsistent +from text2sql.dataproc import vocab +from text2sql.utils import nn_utils + + +@attr.s +class TreeState: + node = attr.ib() + parent_field_type = attr.ib() + + +class TreeTraversal: + class Handler: + handlers = {} + + @classmethod + def register_handler(cls, func_type): + """register_handler""" + if func_type in cls.handlers: + raise RuntimeError(f"{func_type} handler is already registered") + + def inner_func(func): + """inner_func""" + cls.handlers[func_type] = func.__name__ + return func + + return inner_func + + @attr.s(frozen=True) + class QueueItem: + item_id = attr.ib() + state = attr.ib() + node_type = attr.ib() + parent_action_emb = attr.ib() + parent_h = attr.ib() # parent hidden state + parent_field_name = attr.ib() + + def to_str(self): + """to_str""" + return f"" + + class State(enum.Enum): + """State""" + + SUM_TYPE_INQUIRE = 0 + SUM_TYPE_APPLY = 1 + CHILDREN_INQUIRE = 2 + CHILDREN_APPLY = 3 + LIST_LENGTH_INQUIRE = 4 + LIST_LENGTH_APPLY = 5 + GEN_TOKEN = 6 + POINTER_INQUIRE = 7 + POINTER_APPLY = 8 + NODE_FINISHED = 9 + + def __init__(self, model, desc_enc): + """__init__""" + if model is None: + return + + self.model = model + self.desc_enc = desc_enc + + # TODO: model.state_update.set_dropout_masks(batch_size=1) + self.recurrent_state = nn_utils.lstm_init(None, self.model.recurrent_size, 1) + self.prev_action_emb = model.zero_rule_emb + + root_type = model.preproc.grammar.root_type + if root_type in model.preproc.ast_wrapper.sum_types: + initial_state = TreeTraversal.State.SUM_TYPE_INQUIRE + else: + initial_state = TreeTraversal.State.CHILDREN_INQUIRE + + self.queue = pyrsistent.pvector() + self.cur_item = TreeTraversal.QueueItem( + item_id=0, + state=initial_state, + node_type=root_type, + parent_action_emb=self.model.zero_rule_emb, + parent_h=self.model.zero_recurrent_emb, + parent_field_name=None, + ) + self.next_item_id = 1 + + self.update_prev_action_emb = TreeTraversal._update_prev_action_emb_apply_rule + + def clone(self): + """clone""" + other = self.__class__(None, None) + other.model = self.model + other.desc_enc = self.desc_enc + other.recurrent_state = self.recurrent_state + other.prev_action_emb = self.prev_action_emb + other.queue = self.queue + other.cur_item = self.cur_item + other.next_item_id = self.next_item_id + other.actions = self.actions + other.update_prev_action_emb = self.update_prev_action_emb + return other + + def step(self, last_choice, extra_choice_info=None, attention_offset=None): + """step""" + while True: + # >>', '2' * 10, 'one step finished.') + return choices + + def update_using_last_choice(self, last_choice, extra_choice_info, attention_offset): + """update_using_last_choice""" + if last_choice is None: + return + self.update_prev_action_emb(self, last_choice, extra_choice_info) + + @classmethod + def _update_prev_action_emb_apply_rule(cls, self, last_choice, extra_choice_info): + """_update_prev_action_emb_apply_rule""" + rule_idx = self.model._tensor([last_choice]) + self.prev_action_emb = self.model.rule_embedding(rule_idx) + + @classmethod + def _update_prev_action_emb_gen_token(cls, self, last_choice, extra_choice_info): + """_update_prev_action_emb_gen_token""" + # token_idx shape: batch (=1), LongTensor + token_idx = self.model._index(self.model.terminal_vocab, last_choice) + # action_emb shape: batch (=1) x emb_size + self.prev_action_emb = self.model.terminal_embedding(token_idx) + + @classmethod + def _update_prev_action_emb_pointer(cls, self, last_choice, extra_choice_info): + """_update_prev_action_emb_pointer""" + # TODO batching + self.prev_action_emb = self.model.pointer_action_emb_proj[self.cur_item.node_type]( + self.desc_enc.pointer_memories[self.cur_item.node_type][:, last_choice] + ) + + def pop(self): + """pop""" + # Call + # a. Ask which one to choose + output, self.recurrent_state, rule_logits = self.model.apply_rule( + self.cur_item.node_type, + self.recurrent_state, + self.prev_action_emb, + self.cur_item.parent_h, + self.cur_item.parent_action_emb, + self.desc_enc, + ) + self.cur_item = attr.evolve(self.cur_item, state=TreeTraversal.State.SUM_TYPE_APPLY, parent_h=output) + + self.update_prev_action_emb = TreeTraversal._update_prev_action_emb_apply_rule + choices = self.rule_choice(self.cur_item.node_type, rule_logits) + return choices, False + + @Handler.register_handler(State.SUM_TYPE_APPLY) + def process_sum_apply(self, last_choice): + """process_sum_apply""" + # b. Add action, prepare for #2 + sum_type, singular_type = self.model.preproc.all_rules[last_choice] + assert sum_type == self.cur_item.node_type + + self.cur_item = attr.evolve( + self.cur_item, + node_type=singular_type, + parent_action_emb=self.prev_action_emb, + state=TreeTraversal.State.CHILDREN_INQUIRE, + ) + return None, True + + @Handler.register_handler(State.CHILDREN_INQUIRE) + def process_children_inquire(self, last_choice): + """process_children_inquire""" + # 2. ApplyRule, like Call -> expr[func] expr*[args] keyword*[keywords] + # Check if we have no children + type_info = self.model.ast_wrapper.singular_types[self.cur_item.node_type] + if not type_info.fields: + if self.pop(): + last_choice = None + return last_choice, True + else: + return None, False + + # a. Ask about presence + output, self.recurrent_state, rule_logits = self.model.apply_rule( + self.cur_item.node_type, + self.recurrent_state, + self.prev_action_emb, + self.cur_item.parent_h, + self.cur_item.parent_action_emb, + self.desc_enc, + ) + self.cur_item = attr.evolve(self.cur_item, state=TreeTraversal.State.CHILDREN_APPLY, parent_h=output) + + self.update_prev_action_emb = TreeTraversal._update_prev_action_emb_apply_rule + choices = self.rule_choice(self.cur_item.node_type, rule_logits) + return choices, False + + @Handler.register_handler(State.CHILDREN_APPLY) + def process_children_apply(self, last_choice): + """process_children_apply""" + # b. Create the children + node_type, children_presence = self.model.preproc.all_rules[last_choice] + assert node_type == self.cur_item.node_type + + self.queue = self.queue.append( + TreeTraversal.QueueItem( + item_id=self.cur_item.item_id, + state=TreeTraversal.State.NODE_FINISHED, + node_type=None, + parent_action_emb=None, + parent_h=None, + parent_field_name=None, + ) + ) + for field_info, present in reversed( + list( + zip( + self.model.ast_wrapper.singular_types[node_type].fields, + children_presence, + ) + ) + ): + if not present: + continue + + # seq field: LIST_LENGTH_INQUIRE x + # sum type: SUM_TYPE_INQUIRE x + # product type: + # no children: not possible + # children: CHILDREN_INQUIRE + # constructor type: not possible x + # builtin type: GEN_TOKEN x + child_type = field_type = field_info.type + if field_info.seq: + child_state = TreeTraversal.State.LIST_LENGTH_INQUIRE + elif field_type in self.model.ast_wrapper.sum_types: + child_state = TreeTraversal.State.SUM_TYPE_INQUIRE + elif field_type in self.model.ast_wrapper.product_types: + assert self.model.ast_wrapper.product_types[field_type].fields + child_state = TreeTraversal.State.CHILDREN_INQUIRE + elif field_type in self.model.preproc.grammar.pointers: + child_state = TreeTraversal.State.POINTER_INQUIRE + elif field_type in self.model.ast_wrapper.primitive_types: + child_state = TreeTraversal.State.GEN_TOKEN + child_type = present + else: + raise ValueError(f"Unable to handle field type {field_type}") + + self.queue = self.queue.append( + TreeTraversal.QueueItem( + item_id=self.next_item_id, + state=child_state, + node_type=child_type, + parent_action_emb=self.prev_action_emb, + parent_h=self.cur_item.parent_h, + parent_field_name=field_info.name, + ) + ) + self.next_item_id += 1 + + # pop queue 的最后一个元素,并赋值给 self.cur_item + advanced = self.pop() + assert advanced + last_choice = None + return last_choice, True + + @Handler.register_handler(State.LIST_LENGTH_INQUIRE) + def process_list_length_inquire(self, last_choice): + """process_list_length_inquire""" + list_type = self.cur_item.node_type + "*" + output, self.recurrent_state, rule_logits = self.model.apply_rule( + list_type, + self.recurrent_state, + self.prev_action_emb, + self.cur_item.parent_h, + self.cur_item.parent_action_emb, + self.desc_enc, + ) + self.cur_item = attr.evolve(self.cur_item, state=TreeTraversal.State.LIST_LENGTH_APPLY, parent_h=output) + + self.update_prev_action_emb = TreeTraversal._update_prev_action_emb_apply_rule + choices = self.rule_choice(list_type, rule_logits) + return choices, False + + @Handler.register_handler(State.LIST_LENGTH_APPLY) + def process_list_length_apply(self, last_choice): + """process_list_length_apply""" + list_type, num_children = self.model.preproc.all_rules[last_choice] + elem_type = self.cur_item.node_type + assert list_type == elem_type + "*" + + child_node_type = elem_type + if elem_type in self.model.ast_wrapper.sum_types: + child_state = TreeTraversal.State.SUM_TYPE_INQUIRE + if self.model.preproc.use_seq_elem_rules: + child_node_type = elem_type + "_seq_elem" + elif elem_type in self.model.ast_wrapper.product_types: + child_state = TreeTraversal.State.CHILDREN_INQUIRE + elif elem_type == "identifier": + child_state = TreeTraversal.State.GEN_TOKEN + child_node_type = "str" + elif elem_type in self.model.ast_wrapper.primitive_types: + raise ValueError("sequential builtin types not supported") + else: + raise ValueError(f"Unable to handle seq field type {elem_type}") + + for i in range(num_children): + self.queue = self.queue.append( + TreeTraversal.QueueItem( + item_id=self.next_item_id, + state=child_state, + node_type=child_node_type, + parent_action_emb=self.prev_action_emb, + parent_h=self.cur_item.parent_h, + parent_field_name=self.cur_item.parent_field_name, + ) + ) + self.next_item_id += 1 + + advanced = self.pop() + assert advanced + last_choice = None + return last_choice, True + + @Handler.register_handler(State.GEN_TOKEN) + def process_gen_token(self, last_choice): + """process_gen_token""" + if last_choice == vocab.EOS: + if self.pop(): + last_choice = None + return last_choice, True + else: + return None, False + + self.recurrent_state, output, gen_logodds = self.model.gen_token( + self.cur_item.node_type, + self.recurrent_state, + self.prev_action_emb, + self.cur_item.parent_h, + self.cur_item.parent_action_emb, + self.desc_enc, + ) + self.update_prev_action_emb = TreeTraversal._update_prev_action_emb_gen_token + choices = self.token_choice(output, gen_logodds) + return choices, False + + @Handler.register_handler(State.POINTER_INQUIRE) + def process_pointer_inquire(self, last_choice): + """process_pointer_inquire""" + # a. Ask which one to choose + output, self.recurrent_state, logits, attention_logits = self.model.compute_pointer_with_align( + self.cur_item.node_type, + self.recurrent_state, + self.prev_action_emb, + self.cur_item.parent_h, + self.cur_item.parent_action_emb, + self.desc_enc, + ) + self.cur_item = attr.evolve(self.cur_item, state=TreeTraversal.State.POINTER_APPLY, parent_h=output) + + self.update_prev_action_emb = TreeTraversal._update_prev_action_emb_pointer + choices = self.pointer_choice(self.cur_item.node_type, logits, attention_logits) + return choices, False + + @Handler.register_handler(State.POINTER_APPLY) + def process_pointer_apply(self, last_choice): + """process_pointer_apply""" + if self.pop(): + last_choice = None + return last_choice, True + else: + return None, False + + @Handler.register_handler(State.NODE_FINISHED) + def process_node_finished(self, last_choice): + """process_node_finished""" + if self.pop(): + last_choice = None + return last_choice, True + else: + return None, False + + def rule_choice(self, node_type, rule_logits): + """rule_choice""" + raise NotImplementedError + + def token_choice(self, output, gen_logodds): + """token_choice""" + raise NotImplementedError + + def pointer_choice(self, node_type, logits, attention_logits): + """pointer_choice""" + raise NotImplementedError diff --git a/examples/text_to_sql/RAT-SQL/text2sql/optim.py b/examples/text_to_sql/RAT-SQL/text2sql/optim.py new file mode 100644 index 0000000000000000000000000000000000000000..3ca7406376f3e80295569813af80508a1af5664a --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/optim.py @@ -0,0 +1,54 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import re + +import paddle + +param_name_to_exclude_from_weight_decay = re.compile(r".*layer_norm_scale|.*layer_norm_bias|.*b_0") + + +def get_warmup_and_linear_decay(max_steps, warmup_steps): + """ERNIE/demo/utils.py""" + return lambda step: min(step / warmup_steps, 1.0 - (step - warmup_steps) / (max_steps - warmup_steps)) + + +def init_optimizer(model, config, train_steps, scale_params_lr=None): + if scale_params_lr is not None: + for model, lr_scale in scale_params_lr: + for param in model.parameters(): + param.optimize_attr["learning_rate"] *= lr_scale + + warmup_steps = int(config.warmup_proportion * train_steps) + lr_scheduler = paddle.optimizer.lr.LambdaDecay( + config.learning_rate, get_warmup_and_linear_decay(train_steps, warmup_steps) + ) + optimizer = paddle.optimizer.AdamW( + lr_scheduler, + parameters=model.parameters(), + weight_decay=config.weight_decay, + apply_decay_param_fun=lambda n: not param_name_to_exclude_from_weight_decay.match(n), + grad_clip=paddle.nn.ClipGradByGlobalNorm(config.grad_clip), + ) + return optimizer + + +if __name__ == "__main__": + """run some simple test cases""" + import types + + model = paddle.vision.models.LeNet() + config = types.SimpleNamespace(learning_rate=1e-3, warmup_proportion=0.1, weight_decay=0.2, grad_clip=1.0) + optim = init_optimizer(model, config, train_steps=10000) + print(optim) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..6b9868224721c471baafd08878ebf0a7a84d765a --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/__init__.py @@ -0,0 +1,15 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .utils import * diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/ast_util.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/ast_util.py new file mode 100644 index 0000000000000000000000000000000000000000..2c13a7d855970b30fa8af2b5e158ec1f1956b270 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/ast_util.py @@ -0,0 +1,272 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Any, Dict, Union + +import asdl +import attr + + +class ASTWrapperVisitor(asdl.VisitorBase): + """Used by ASTWrapper to collect information. + + - put constructors in one place. + - checks that all fields have names. + - get all optional fields. + """ + + def __init__(self): + # type: () -> None + super(ASTWrapperVisitor, self).__init__() + self.constructors = {} # type: Dict[str, asdl.Constructor] + self.sum_types = {} # type: Dict[str, asdl.Sum] + self.product_types = {} # type: Dict[str, asdl.Product] + self.fieldless_constructors = {} # type: Dict[str, asdl.Constructor] + + def visitModule(self, mod): + # type: (asdl.Module) -> None + for dfn in mod.dfns: + self.visit(dfn) + + def visitType(self, type_): + # type: (asdl.Type) -> None + self.visit(type_.value, str(type_.name)) + + def visitSum(self, sum_, name): + # type: (asdl.Sum, str) -> None + self.sum_types[name] = sum_ + for t in sum_.types: + self.visit(t, name) + + def visitConstructor(self, cons, _name): + # type: (asdl.Constructor, str) -> None + assert cons.name not in self.constructors + self.constructors[cons.name] = cons + if not cons.fields: + self.fieldless_constructors[cons.name] = cons + for f in cons.fields: + self.visit(f, cons.name) + + def visitField(self, field, name): + # type: (asdl.Field, str) -> None + # pylint: disable=no-self-use + if field.name is None: + raise ValueError(f"Field of type {field.type} in {name} lacks name") + + def visitProduct(self, prod, name): + # type: (asdl.Product, str) -> None + self.product_types[name] = prod + for f in prod.fields: + self.visit(f, name) + + +SingularType = Union[asdl.Constructor, asdl.Product] + + +class ASTWrapper(object): + """Provides helper methods on the ASDL AST.""" + + default_primitive_type_checkers = { + "identifier": lambda x: isinstance(x, str), + "int": lambda x: isinstance(x, int), + "string": lambda x: isinstance(x, str), + "bytes": lambda x: isinstance(x, bytes), + "object": lambda x: isinstance(x, object), + "singleton": lambda x: x is True or x is False or x is None, + } + + # pylint: disable=too-few-public-methods + + def __init__(self, ast_def, custom_primitive_type_checkers={}): + # type: (asdl.Module, str) -> None + self.ast_def = ast_def + + visitor = ASTWrapperVisitor() + visitor.visit(ast_def) + + self.constructors = visitor.constructors + self.sum_types = visitor.sum_types + self.product_types = visitor.product_types + self.seq_fragment_constructors = {} + self.primitive_type_checkers = {**self.default_primitive_type_checkers, **custom_primitive_type_checkers} + self.custom_primitive_types = set(custom_primitive_type_checkers.keys()) + self.primitive_types = set(self.primitive_type_checkers.keys()) + + # Product types and constructors: + # no need to decide upon a further type for these. + self.singular_types = {} # type: Dict[str, SingularType] + self.singular_types.update(self.constructors) + self.singular_types.update(self.product_types) + + # IndexedSets for each sum type + self.sum_type_vocabs = { + name: sorted(t.name for t in sum_type.types) for name, sum_type in self.sum_types.items() + } + self.constructor_to_sum_type = { + constructor.name: name for name, sum_type in self.sum_types.items() for constructor in sum_type.types + } + self.seq_fragment_constructor_to_sum_type = { + constructor.name: name for name, sum_type in self.sum_types.items() for constructor in sum_type.types + } + self.fieldless_constructors = sorted(visitor.fieldless_constructors.keys()) + + @property + def types(self): + # type: () -> Dict[str, Union[asdl.Sum, asdl.Product]] + return self.ast_def.types + + @property + def root_type(self): + # type: () -> str + return self._root_type + + def add_sum_type(self, name, sum_type): + assert name not in self.sum_types + self.sum_types[name] = sum_type + self.types[name] = sum_type + + for type_ in sum_type.types: + self._add_constructor(name, type_) + + def add_constructors_to_sum_type(self, sum_type_name, constructors): + for constructor in constructors: + self._add_constructor(sum_type_name, constructor) + self.sum_types[sum_type_name].types += constructors + + def remove_product_type(self, product_type_name): + self.singular_types.pop(product_type_name) + self.product_types.pop(product_type_name) + self.types.pop(product_type_name) + + def add_seq_fragment_type(self, sum_type_name, constructors): + for constructor in constructors: + # TODO: Record that this constructor is a sequence fragment? + self._add_constructor(sum_type_name, constructor) + + sum_type = self.sum_types[sum_type_name] + if not hasattr(sum_type, "seq_fragment_types"): + sum_type.seq_fragment_types = [] + sum_type.seq_fragment_types += constructors + + def _add_constructor(self, sum_type_name, constructor): + assert constructor.name not in self.constructors + self.constructors[constructor.name] = constructor + assert constructor.name not in self.singular_types + self.singular_types[constructor.name] = constructor + assert constructor.name not in self.constructor_to_sum_type + self.constructor_to_sum_type[constructor.name] = sum_type_name + + if not constructor.fields: + self.fieldless_constructors.append(constructor.name) + self.fieldless_constructors.sort() + + def verify_ast(self, node, expected_type=None, field_path=(), is_seq=False): + """Checks that `node` conforms to the current ASDL.""" + if node is None: + raise ValueError(f"node is None. path: {field_path}") + if not isinstance(node, dict): + raise ValueError(f"node is type {type(node)}. path: {field_path}") + + node_type = node["_type"] # type: str + if expected_type is not None: + sum_product = self.types[expected_type] + if isinstance(sum_product, asdl.Product): + if node_type != expected_type: + raise ValueError(f"Expected type {expected_type}, but instead saw {node_type}. path: {field_path}") + elif isinstance(sum_product, asdl.Sum): + possible_names = [t.name for t in sum_product.types] + if is_seq: + possible_names += [t.name for t in getattr(sum_product, "seq_fragment_types", [])] + if node_type not in possible_names: + raise ValueError( + f'Expected one of {", ".join(possible_names)}, but instead saw {node_type}. path: {field_path}' + ) + + else: + raise ValueError(f"Unexpected type in ASDL: {sum_product}") + + if node_type in self.types: + # Either a product or a sum type; we want it to be a product type + sum_product = self.types[node_type] + if isinstance(sum_product, asdl.Sum): + raise ValueError(f"sum type {node_type} not allowed as node type. path: {field_path}") + fields_to_check = sum_product.fields + elif node_type in self.constructors: + fields_to_check = self.constructors[node_type].fields + else: + raise ValueError(f"Unknown node_type {node_type}. path: {field_path}") + + for field in fields_to_check: + # field.opt: + # - missing is okay + # field.seq + # - missing is okay + # - otherwise, must be list + if field.name not in node: + if field.opt or field.seq: + continue + raise ValueError(f"required field {field.name} is missing. path: {field_path}") + + if field.seq and field.name in node and not isinstance(node[field.name], (list, tuple)): # noqa: E125 + raise ValueError(f"sequential field {field.name} is not sequence. path: {field_path}") + + # Check that each item in this field has the expected type. + items = node.get(field.name, ()) if field.seq else (node.get(field.name),) + + # pylint: disable=cell-var-from-loop + if field.type in self.primitive_type_checkers: + check = self.primitive_type_checkers[field.type] + else: + # pylint: disable=line-too-long + check = lambda n: self.verify_ast( + n, field.type, field_path + (field.name,), is_seq=field.seq + ) # noqa: E731,E501 + + for item in items: + assert check(item) + return True + + def find_all_descendants_of_type(self, tree, target_type, descend_pred=lambda field: True): + queue = [tree] + while queue: + node = queue.pop() + if not isinstance(node, dict): + continue + for field_info in self.singular_types[node["_type"]].fields: + if field_info.opt and field_info.name not in node: + continue + if not descend_pred(field_info): + continue + + if field_info.seq: + values = node.get(field_info.name, []) + else: + values = [node[field_info.name]] + + if field_info.type == target_type: + for value in values: + yield value + else: + queue.extend(values) + + +# Improve this when mypy supports recursive types. +Node = Dict[str, Any] + + +@attr.s +class HoleValuePlaceholder: + id = attr.ib() + is_seq = attr.ib() + is_opt = attr.ib() diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/dusql_evaluation.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/dusql_evaluation.py new file mode 100644 index 0000000000000000000000000000000000000000..d832acb97cc7b71e4aed325ea867a159f7faa11b --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/dusql_evaluation.py @@ -0,0 +1,1298 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Calculating the exact accuracy. For select, where and others schema, it will be +seen as right if has different order. This script refers to https://github.com/taoyds/spider。 +""" +import copy +import json +import logging +import re +from collections import defaultdict +from io import open + +from text2sql.utils import text_utils + +""" +val: number(float)/string(str)/sql(dict) +col_unit: (agg_id, col_id) +val_unit: (unit_op, col_unit1, col_unit2) +table_unit: (table_type, col_unit/sql) +cond_unit: (not_op, cond_op, val_unit, val1, val2) +condition: [cond_unit1, 'and'/'or', cond_unit2, ...] +sql { + 'select': [(agg_id, val_unit), (agg_id, val_unit), ...] + 'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition} + 'where': condition + 'groupBy': [col_unit1, col_unit2, ...] + 'orderBy': ('asc'/'desc', [(agg_id, val_unit), ...]) + 'having': condition + 'limit': None/number(int) + 'intersect': None/sql + 'except': None/sql + 'union': None/sql +} +""" + +CLAUSE_KEYWORDS = ("select", "from", "where", "group", "order", "limit", "intersect", "union", "except") +JOIN_KEYWORDS = ("join", "on", "as") + +COND_OPS = ("not_in", "between", "=", ">", "<", ">=", "<=", "!=", "in", "like") +UNIT_OPS = ("none", "-", "+", "*", "/") +AGG_OPS = ("none", "max", "min", "count", "sum", "avg") +TABLE_TYPE = { + "sql": "sql", + "table_unit": "table_unit", +} + +LOGIC_AND_OR = ("and", "or") +SQL_OPS = ("intersect", "union", "except") +ORDER_OPS = ("desc", "asc") + +CONST_COLUMN = set(["time_now"]) + +EXPECT_BRACKET_PRE_TOKENS = set(AGG_OPS + SQL_OPS + COND_OPS + CLAUSE_KEYWORDS + ("from", ",", "distinct")) + +g_empty_sql = { + "select": [], + "from": {"conds": [], "table_units": []}, + "where": [], + "groupBy": [], + "having": [], + "orderBy": [], + "limit": None, + "except": None, + "intersect": None, + "union": None, +} + +g_eval_value = True +g_is_nl2sql_dataset = False + + +################################# +def tokenize(string): + """ + Args: + + Returns: + """ + string = string.replace("'", '"').lower() + assert string.count('"') % 2 == 0, "Unexpected quote" + + def _extract_value(string): + """extract values in sql""" + fields = string.split('"') + for idx, tok in enumerate(fields): + if idx % 2 == 1: + fields[idx] = '"%s"' % (tok) + return fields + + def _resplit(tmp_tokens, fn_split, fn_omit): + """resplit""" + new_tokens = [] + for token in tmp_tokens: + token = token.strip() + if fn_omit(token): + new_tokens.append(token) + elif re.match(r"\d\d\d\d-\d\d(-\d\d)?", token): + new_tokens.append('"%s"' % (token)) + else: + new_tokens.extend(fn_split(token)) + return new_tokens + + tokens_tmp = _extract_value(string) + + two_bytes_op = ["==", "!=", ">=", "<=", "<>", ""] + sep1 = re.compile(r"([ \+\-\*/\(\),><;])") # 单字节运算符 + sep2 = re.compile("(" + "|".join(two_bytes_op) + ")") # 多字节运算符 + tokens_tmp = _resplit(tokens_tmp, lambda x: x.split(" "), lambda x: x.startswith('"')) + tokens_tmp = _resplit(tokens_tmp, lambda x: re.split(sep2, x), lambda x: x.startswith('"')) + tokens_tmp = _resplit(tokens_tmp, lambda x: re.split(sep1, x), lambda x: x in two_bytes_op or x.startswith('"')) + tokens = list(filter(lambda x: x.strip() not in ("", "distinct", "DISTINCT"), tokens_tmp)) + + def _post_merge(tokens): + """merge: + * col name with "(", ")" + * values with +/- + """ + idx = 1 + while idx < len(tokens): + if tokens[idx] == "(" and tokens[idx - 1] not in EXPECT_BRACKET_PRE_TOKENS: + while idx < len(tokens): + tmp_tok = tokens.pop(idx) + tokens[idx - 1] += tmp_tok + if tmp_tok == ")": + break + elif tokens[idx] in ("+", "-") and tokens[idx - 1] in COND_OPS and idx + 1 < len(tokens): + tokens[idx] += tokens[idx + 1] + tokens.pop(idx + 1) + idx += 1 + else: + idx += 1 + return tokens + + tokens = _post_merge(tokens) + return tokens + + +def scan_alias(toks): + """Scan the index of 'as' and build the map for all alias""" + as_idxs = [idx for idx, tok in enumerate(toks) if tok == "as"] + alias = {} + for idx in as_idxs: + alias[toks[idx + 1]] = toks[idx - 1] + return alias + + +def get_tables_with_alias(schema, toks): + tables = scan_alias(toks) + for key in schema: + assert key not in tables, "Alias {} has the same name in table".format(key) + tables[key] = key + return tables + + +def parse_col(toks, start_idx, tables_with_alias, schema, default_tables=None): + tok = toks[start_idx] + if tok == "*": + return start_idx + 1, schema.id_map[tok] + if tok in CONST_COLUMN: + return start_idx + 1, tok + + if g_is_nl2sql_dataset: + fn_check = lambda tok: "." in tok and tok.startswith("table_") + else: + fn_check = lambda tok: "." in tok + if fn_check(tok): # if token is a composite + alias, col = tok.split(".", 1) + key = tables_with_alias[alias] + "." + col + return start_idx + 1, schema.id_map[key] + + assert default_tables is not None and len(default_tables) > 0, "Default tables should not be None or empty" + + for alias in default_tables: + table = tables_with_alias[alias] + if tok in schema.schema[table]: + key = table + "." + tok + return start_idx + 1, schema.id_map[key] + + raise RuntimeError("Error col: {} from {}".format(tok, toks)) + + +def parse_col_unit(toks, start_idx, tables_with_alias, schema, default_tables=None): + idx = start_idx + len_ = len(toks) + isBlock = False + if toks[idx] == "(": + isBlock = True + idx += 1 + + if toks[idx] in AGG_OPS: + agg_id = AGG_OPS.index(toks[idx]) + idx += 1 + assert idx < len_ and toks[idx] == "(" + idx += 1 + idx, col_id = parse_col(toks, idx, tables_with_alias, schema, default_tables) + assert idx < len_ and toks[idx] == ")" + idx += 1 + return idx, (agg_id, col_id) + + agg_id = AGG_OPS.index("none") + idx, col_id = parse_col(toks, idx, tables_with_alias, schema, default_tables) + + if isBlock: + assert toks[idx] == ")" + idx += 1 # skip ')' + + return idx, (agg_id, col_id) + + +def parse_val_unit(toks, start_idx, tables_with_alias, schema, default_tables=None): + idx = start_idx + len_ = len(toks) + isBlock = False + if toks[idx] == "(": + isBlock = True + idx += 1 + + col_unit1 = None + col_unit2 = None + unit_op = UNIT_OPS.index("none") + + idx, col_unit1 = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables) + if idx < len_ and toks[idx] in UNIT_OPS: + unit_op = UNIT_OPS.index(toks[idx]) + idx += 1 + idx, col_unit2 = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables) + + if isBlock: + assert toks[idx] == ")" + idx += 1 # skip ')' + if unit_op in (UNIT_OPS.index("+"), UNIT_OPS.index("*")): + col_unit1, col_unit2 = sorted([col_unit1, col_unit2]) + + return idx, (unit_op, col_unit1, col_unit2) + + +def parse_table_unit(toks, start_idx, tables_with_alias, schema): + idx = start_idx + len_ = len(toks) + key = tables_with_alias[toks[idx]] + + if idx + 1 < len_ and toks[idx + 1] == "as": + idx += 3 + else: + idx += 1 + + return idx, schema.id_map[key], key + + +def parse_value(toks, start_idx, tables_with_alias, schema, default_tables=None): + idx = start_idx + len_ = len(toks) + + isBlock = False + if toks[idx] == "(": + isBlock = True + idx += 1 + + def _force_float(str_num): + """force float, just for debug""" + last = "" + while len(str_num) > 0: + try: + n = float(str_num) + if last == "%": + n /= 100 + return n + except Exception: + last = str_num[-1] + str_num = str_num[:-1] + raise ValueError("not a float number") + + if toks[idx] == "select": + idx, val = parse_sql(toks, idx, tables_with_alias, schema) + elif toks[idx].startswith('"') and toks[idx].endswith('"'): # token is a string value + val = toks[idx] + idx += 1 + else: + try: + val_str = toks[idx] + # val = float(val_str) if val_str[-1] != '%' else float(val_str[:-1]) / 100 + val = _force_float(val_str) + idx += 1 + except Exception: + end_idx = idx + while ( + end_idx < len_ + and toks[end_idx] != "," + and toks[end_idx] != ")" + and toks[end_idx] != "and" + and toks[end_idx] not in CLAUSE_KEYWORDS + and toks[end_idx] not in JOIN_KEYWORDS + ): + end_idx += 1 + + idx, val = parse_col_unit(toks[start_idx:end_idx], 0, tables_with_alias, schema, default_tables) + idx = end_idx + + if isBlock: + assert toks[idx] == ")" + idx += 1 + + return idx, val + + +def parse_condition(toks, start_idx, tables_with_alias, schema, default_tables=None): + idx = start_idx + len_ = len(toks) + conds = [] + + while idx < len_: + agg_id = 0 + if idx < len_ and toks[idx] in AGG_OPS: + agg_id = AGG_OPS.index(toks[idx]) + idx += 1 + + idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables) + + op_str = toks[idx] + if op_str == "not": + assert toks[idx + 1] == "in", '"not" must followed by "in"' + op_str = "not_in" + idx += 1 + assert idx < len_ and op_str in COND_OPS, "Error condition: idx: {}, tok: {}".format(idx, op_str) + op_id = COND_OPS.index(op_str) + idx += 1 + val1 = val2 = None + if op_id == COND_OPS.index("between"): + idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables) + assert toks[idx].lower() == "and" + idx += 1 + idx, val2 = parse_value(toks, idx, tables_with_alias, schema, default_tables) + else: + idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables) + val2 = None + + conds.append((agg_id, op_id, val_unit, val1, val2)) + + if idx < len_ and (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";") or toks[idx] in JOIN_KEYWORDS): + break + + if idx < len_ and toks[idx] in LOGIC_AND_OR: + conds.append(toks[idx]) + idx += 1 # skip and/or + + return idx, conds + + +def parse_select(toks, start_idx, tables_with_alias, schema, default_tables=None): + idx = start_idx + len_ = len(toks) + + assert toks[idx] == "select", "'select' not found" + idx += 1 + val_units = [] + + while idx < len_ and toks[idx] not in CLAUSE_KEYWORDS: + agg_id = AGG_OPS.index("none") + if toks[idx] in AGG_OPS: + agg_id = AGG_OPS.index(toks[idx]) + idx += 1 + idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables) + val_units.append((agg_id, val_unit)) + if idx < len_ and toks[idx] == ",": + idx += 1 # skip ',' + + return idx, val_units + + +def parse_from(toks, start_idx, tables_with_alias, schema): + """ + Assume in the from clause, all table units are combined with join + """ + assert "from" in toks[start_idx:], "'from' not found" + + len_ = len(toks) + idx = toks.index("from", start_idx) + 1 + default_tables = [] + table_units = [] + conds = [] + last_table = None + + while idx < len_: + isBlock = False + if toks[idx] == "(": + isBlock = True + idx += 1 + + if toks[idx] == "select": + idx, sql = parse_sql(toks, idx, tables_with_alias, schema) + table_units.append((TABLE_TYPE["sql"], sql)) + last_table = sql["from"]["table_units"][0][1].strip("_") + else: + if idx < len_ and toks[idx] == "join": + idx += 1 # skip join + idx, table_unit, table_name = parse_table_unit(toks, idx, tables_with_alias, schema) + table_units.append((TABLE_TYPE["table_unit"], table_unit)) + default_tables.append(table_name) + if idx < len_ and toks[idx] == "on": + idx += 1 # skip on + idx, this_conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables) + if len(conds) > 0: + conds.append("and") + conds.extend(this_conds) + + if isBlock: + assert toks[idx] == ")" + idx += 1 + if idx < len_ and toks[idx] == "a": + assert last_table is not None, "last_table should be a table name string, not None" + tables_with_alias["a"] = last_table + idx += 2 + elif idx < len_ and toks[idx] == "b": + assert last_table is not None, "last_table should be a table name string, not None" + tables_with_alias["b"] = last_table + idx += 1 + if idx < len_ and (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")): + break + + return [idx, table_units, conds, default_tables] + + +def parse_where(toks, start_idx, tables_with_alias, schema, default_tables): + idx = start_idx + len_ = len(toks) + + if idx >= len_ or toks[idx] != "where": + return idx, [] + + idx += 1 + idx, conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables) + return idx, conds + + +def parse_group_by(toks, start_idx, tables_with_alias, schema, default_tables): + idx = start_idx + len_ = len(toks) + col_units = [] + + if idx >= len_ or toks[idx] != "group": + return idx, col_units + + idx += 1 + assert toks[idx] == "by" + idx += 1 + + while idx < len_ and not (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")): + idx, col_unit = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables) + col_units.append(col_unit) + if idx < len_ and toks[idx] == ",": + idx += 1 # skip ',' + else: + break + + return idx, col_units + + +def parse_order_by(toks, start_idx, tables_with_alias, schema, default_tables): + idx = start_idx + len_ = len(toks) + val_units = [] + order_type = "asc" # default type is 'asc' + + if idx >= len_ or toks[idx] != "order": + return idx, val_units + + idx += 1 + assert toks[idx] == "by" + idx += 1 + + while idx < len_ and not (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")): + agg_id = AGG_OPS.index("none") + if toks[idx] in AGG_OPS: + agg_id = AGG_OPS.index(toks[idx]) + idx += 1 + idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables) + val_units.append((agg_id, val_unit)) + if idx < len_ and toks[idx] in ORDER_OPS: + order_type = toks[idx] + idx += 1 + if idx < len_ and toks[idx] == ",": + idx += 1 # skip ',' + else: + break + + return idx, (order_type, val_units) + + +def parse_having(toks, start_idx, tables_with_alias, schema, default_tables): + idx = start_idx + len_ = len(toks) + + if idx >= len_ or toks[idx] != "having": + return idx, [] + + idx += 1 + idx, conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables) + return idx, conds + + +def parse_limit(toks, start_idx): + idx = start_idx + len_ = len(toks) + + if idx < len_ and toks[idx] == "limit": + idx += 2 + limit_num = int(toks[idx - 1]) + return idx, limit_num + + return idx, None + + +def parse_sql(toks, start_idx, tables_with_alias, schema): + isBlock = False # indicate whether this is a block of sql/sub-sql + len_ = len(toks) + idx = start_idx + + sql = {} + if toks[idx] == "(": + isBlock = True + idx += 1 + + # parse from clause in order to get default tables + from_end_idx, table_units, conds, default_tables = parse_from(toks, start_idx, tables_with_alias, schema) + sql["from"] = {"table_units": table_units, "conds": conds} + # select clause + _, select_col_units = parse_select(toks, idx, tables_with_alias, schema, default_tables) + idx = from_end_idx + sql["select"] = select_col_units + # where clause + idx, where_conds = parse_where(toks, idx, tables_with_alias, schema, default_tables) + sql["where"] = where_conds + # group by clause + idx, group_col_units = parse_group_by(toks, idx, tables_with_alias, schema, default_tables) + sql["groupBy"] = group_col_units + # having clause + idx, having_conds = parse_having(toks, idx, tables_with_alias, schema, default_tables) + sql["having"] = having_conds + # order by clause + idx, order_col_units = parse_order_by(toks, idx, tables_with_alias, schema, default_tables) + sql["orderBy"] = order_col_units + # limit clause + idx, limit_val = parse_limit(toks, idx) + sql["limit"] = limit_val + + idx = skip_semicolon(toks, idx) + if isBlock: + assert toks[idx] == ")" + idx += 1 # skip ')' + idx = skip_semicolon(toks, idx) + + # intersect/union/except clause + for op in SQL_OPS: # initialize IUE + sql[op] = None + if idx < len_ and toks[idx] in SQL_OPS: + sql_op = toks[idx] + idx += 1 + idx, IUE_sql = parse_sql(toks, idx, tables_with_alias, schema) + sql[sql_op] = IUE_sql + return idx, sql + + +def load_data(fpath): + with open(fpath) as f: + data = json.load(f) + return data + + +def get_sql(schema, query): + toks = tokenize(query) + tables_with_alias = get_tables_with_alias(schema.schema, toks) + _, sql = parse_sql(toks, 0, tables_with_alias, schema) + + return sql + + +def skip_semicolon(toks, start_idx): + idx = start_idx + while idx < len(toks) and toks[idx] == ";": + idx += 1 + return idx + + +################################# + +g_db_schema_file = None +g_foreign_key_maps = None + + +class Evaluator(object): + """A simple evaluator""" + + def __init__(self, db_schema_file, foreign_key_maps, eval_value=True): + """init""" + self.schemas = {} + self.foreign_key_maps = foreign_key_maps + self.partial_scores = None + self.scores = {"all": {"count": 0, "exact": 0}} + global g_db_schema_file + global g_foreign_key_maps + g_db_schema_file = db_schema_file + g_foreign_key_maps = foreign_key_maps + + with open(db_schema_file) as ifs: + databases = json.load(ifs) + for db in databases: + self.schemas[db["db_id"]] = Schema(db) + is_nl2sql = all([len(x.schema) == 1 for x in self.schemas.values()]) + if is_nl2sql: + global g_is_nl2sql_dataset + g_is_nl2sql_dataset = True + + # number of failed to parse predicted sql query + self.eval_err_num = 0 + + global g_eval_value + g_eval_value = eval_value + self.eval_value = eval_value + + def _eval_exact_match(self, pred, gold): + """eval_exact_match""" + partial_scores = self.eval_partial_match(pred, gold) + self.partial_scores = partial_scores + + for _, score in partial_scores.items(): + if score["f1"] != 1: + return 0 + + gold_table_units = gold["from"]["table_units"] + pred_table_units = pred["from"]["table_units"] + if len(pred_table_units) != len(gold_table_units) or any( + map(lambda x: type(x[0][1]) != type(x[1][1]), zip(pred_table_units, gold_table_units)) # noqa: E721 + ): + return 0 + if type(gold_table_units[0][1]) is not dict: + return 1 if sorted(gold_table_units) == sorted(pred_table_units) else 0 + + # TODO: 严格考虑顺序 + def __eval_from_sql(pred_tables, gold_tables): + """eval from sql""" + for pred_table_unit, gold_table_unit in zip(pred_tables, gold_tables): + pred_table_sql = pred_table_unit[1] + gold_table_sql = gold_table_unit[1] + _, _, correct = eval_nested(pred_table_sql, gold_table_sql) + if correct == 0: + return 0 + return 1 + + correct = __eval_from_sql(pred_table_units, gold_table_units) + if len(gold_table_units) > 1 and correct == 0: + return __eval_from_sql(pred_table_units, list(reversed(gold_table_units))) + else: + return correct + + # if len(gold['from']['table_units']) > 0: + # gold_tables = sorted(gold['from']['table_units'], key=lambda x: str(x)) + # pred_tables = sorted(pred['from']['table_units'], key=lambda x: str(x)) + # return gold_tables == pred_tables + # return 1 + + def eval_exact_match(self, pred, gold): + """wrapper of evaluate exact match, to process + `SQL1 intersect/union SQL2` vs `SQL2 intersect/union SQL1` + """ + score = self._eval_exact_match(pred, gold) + if score == 1: + return score + + if gold["union"] is not None: + new_gold = gold["union"] + gold["union"] = None + new_gold["union"] = gold + return self._eval_exact_match(pred, new_gold) + elif gold["intersect"] is not None: + new_gold = gold["intersect"] + gold["intersect"] = None + new_gold["intersect"] = gold + return self._eval_exact_match(pred, new_gold) + else: + return 0 + + def eval_partial_match(self, pred, gold): + """eval partial match""" + res = {} + + gold_total, pred_total, cnt, cnt_wo_agg = eval_sel(pred, gold) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["select"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, gold_total) + res["select(no AGG)"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + gold_total, pred_total, cnt, cnt_wo_agg = eval_where(pred, gold) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["where"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, gold_total) + res["where(no OP)"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + gold_total, pred_total, cnt = eval_group(pred, gold) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["group(no Having)"] = { + "acc": acc, + "rec": rec, + "f1": f1, + "gold_total": gold_total, + "pred_total": pred_total, + } + + gold_total, pred_total, cnt = eval_having(pred, gold) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["group"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + gold_total, pred_total, cnt = eval_order(pred, gold) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["order"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + gold_total, pred_total, cnt = eval_and_or(pred, gold) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["and/or"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + gold_total, pred_total, cnt = eval_IUEN(pred, gold) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["IUEN"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + gold_total, pred_total, cnt = eval_keywords(pred, gold) + acc, rec, f1 = get_scores(cnt, pred_total, gold_total) + res["keywords"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total} + + return res + + def evaluate_one(self, db_id, gold_query, pred_query): + """evaluate one predicted result, and cache evaluating info + + Args: + db (TYPE): NULL + gold_query (TYPE): NULL + pred_query (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + self.scores["all"]["count"] += 1 + + schema = self.schemas[db_id] + kmap = self.foreign_key_maps[db_id] + + gold_query = gold_query.replace("==", "=") + gold_sql = get_sql(schema, gold_query) + # rebuild sql for value evaluation + g_valid_col_units = build_valid_col_units(gold_sql["from"]["table_units"], schema) + gold_sql = rebuild_sql_col(g_valid_col_units, gold_sql, kmap, self.eval_value) + + is_parse_error = False + try: + pred_sql = get_sql(schema, pred_query) + + p_valid_col_units = build_valid_col_units(pred_sql["from"]["table_units"], schema) + pred_sql = rebuild_sql_col(p_valid_col_units, pred_sql, kmap, self.eval_value) + except Exception: + # If pred_sql is not valid, then we will use an empty sql to evaluate with the correct sql + pred_sql = g_empty_sql + self.eval_err_num += 1 + is_parse_error = True + + exact_score = self.eval_exact_match(pred_sql, gold_sql) + if exact_score == 0: + logging.debug("error instance %s:\npred: %s\ngold: %s" % (db_id, pred_query, gold_query)) + self.scores["all"]["exact"] += exact_score + return { + "gold": gold_query, + "pred": pred_query, + "correct": int(exact_score), + "parse_error": int(is_parse_error), + } + + def finalize(self): + """ + Returns: TODO + + Raises: NULL + """ + self.scores["all"]["exact"] /= self.scores["all"]["count"] + return self.scores + + +class Schema(object): + """ + Simple schema which maps table&column to a unique identifier + """ + + def __init__(self, db): + """init""" + self._schema = self._build_schema(db) + self._id_map = self._map(self._schema) + + @property + def schema(self): + """_schema property""" + return self._schema + + @property + def id_map(self): + """_id_map property""" + return self._id_map + + def _build_schema(self, db): + """build schema by input db + + Args: + db (dict): NULL + + Returns: TODO + + Raises: NULL + """ + tables = [x.lower() for x in db.get("table_names_original", db["table_names"])] + dct_table2cols = defaultdict(list) + for table_id, column in db.get("column_names_original", db["column_names"]): + if table_id < 0: + continue + dct_table2cols[tables[table_id]].append(column.lower()) + return dct_table2cols + + def _map(self, schema): + """map""" + id_map = {"*": "__all__"} + for key, vals in schema.items(): + for val in vals: + id_map[key.lower() + "." + val.lower()] = "__" + key.lower() + "." + val.lower() + "__" + + for key in schema: + id_map[key.lower()] = "__" + key.lower() + "__" + + return id_map + + +def get_scores(count, pred_total, gold_total): + """ + Args: + + Returns: + """ + if pred_total != gold_total: + return 0, 0, 0 + elif count == pred_total: + return 1, 1, 1 + return 0, 0, 0 + + +def eval_sel(pred, gold): + """ + Args: + + Returns: + """ + pred_sel = copy.deepcopy(pred["select"]) + gold_sel = copy.deepcopy(gold["select"]) + gold_wo_agg = [unit[1] for unit in gold_sel] + pred_total = len(pred_sel) + gold_total = len(gold_sel) + cnt = 0 + cnt_wo_agg = 0 + + for unit in pred_sel: + if unit in gold_sel: + cnt += 1 + gold_sel.remove(unit) + if unit[1] in gold_wo_agg: + cnt_wo_agg += 1 + gold_wo_agg.remove(unit[1]) + + return [gold_total, pred_total, cnt, cnt_wo_agg] + + +def eval_nested_cond(pred_cond, gold_cond): + """ + + Args: + pred_cond (TYPE): NULL + gold_cond (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + if pred_cond[:3] != gold_cond[:3] or type(pred_cond[3]) is not dict: + return 0 + + _, _, correct = eval_nested(pred_cond[3], gold_cond[3]) + if correct == 0: + return 0 + + return pred_cond[4] == gold_cond[4] + + +def eval_cond(pred, gold): + def _equal(p, g): + if str(p) == str(g): + return True + p = p.strip("\"'") if type(p) is str else p + g = g.strip("\"'") if type(g) is str else g + if text_utils.is_float(p) and text_utils.is_float(g) and float(p) == float(g): + return True + return False + + if type(gold[3]) is dict: + return eval_nested_cond(pred, gold) + + if pred[:3] != gold[:3]: + return 0 + + if _equal(pred[3], gold[3]) and _equal(pred[4], gold[4]): + return 1 + else: + return 0 + + +def eval_where(pred, gold): + pred_conds = list(sorted([unit for unit in pred["where"][::2]], key=lambda x: [str(i) for i in x])) + gold_conds = list(sorted([unit for unit in gold["where"][::2]], key=lambda x: [str(i) for i in x])) + # gold_wo_agg = [unit[2] for unit in gold_conds] + pred_total = len(pred_conds) + gold_total = len(gold_conds) + cnt = 0 + cnt_wo_agg = 0 + + for unit_p, unit_g in zip(pred_conds, gold_conds): + cnt += eval_cond(unit_p, unit_g) + + if unit_p[2] == unit_g[2]: + cnt_wo_agg += 1 + + # for unit in pred_conds: + # if unit in gold_conds: + # cnt += 1 + # gold_conds.remove(unit) + # if unit[2] in gold_wo_agg: + # cnt_wo_agg += 1 + # gold_wo_agg.remove(unit[2]) + return [gold_total, pred_total, cnt, cnt_wo_agg] + # return [gold_total, pred_total, cnt, gold_total] + + +def eval_group(pred, gold): + pred_cols = [unit[1] for unit in pred["groupBy"]] + gold_cols = [unit[1] for unit in gold["groupBy"]] + pred_total = len(pred_cols) + gold_total = len(gold_cols) + cnt = 0 + pred_cols = [pred.split(".")[1] if "." in pred else pred for pred in pred_cols] + gold_cols = [gold.split(".")[1] if "." in gold else gold for gold in gold_cols] + for col in pred_cols: + if col in gold_cols: + cnt += 1 + gold_cols.remove(col) + return [gold_total, pred_total, cnt] + + +def eval_having(pred, gold): + """and/or will be evaluate in other branch""" + if len(pred["having"]) != len(gold["having"]): + return [1, 1, 0] + + pred_total = len(pred["having"][::2]) + gold_total = len(gold["having"][::2]) + cnt = 0 + for pred_cond, gold_cond in zip(sorted(pred["having"][::2]), sorted(gold["having"][::2])): + if eval_cond(pred_cond, gold_cond) == 1: + cnt += 1 + + return [gold_total, pred_total, cnt] + + +def eval_order(pred, gold): + pred_total = gold_total = cnt = 0 + if len(pred["orderBy"]) > 0: + pred_total = 1 + if len(gold["orderBy"]) > 0: + gold_total = 1 + + if len(gold["orderBy"]) > 0 and pred["orderBy"] == gold["orderBy"] and pred["limit"] == gold["limit"]: + cnt = 1 + + return [gold_total, pred_total, cnt] + + +def eval_and_or(pred, gold): + def _extract(conds): + """extract condition and/or""" + op_set = set() + for i in range(1, len(conds) - 1, 2): + left = conds[i - 1][:3] + right = conds[i + 1][:3] + left, right = list(sorted([left, right])) + op_set.add(f"{left}{conds[i].lower()}{right}") + return op_set + + # eval where and/or + pred_op_set = _extract(pred["where"]) + gold_op_set = _extract(gold["where"]) + if pred_op_set != gold_op_set: + return [1, 1, 0] + + # eval having and/or + pred_op_set = _extract(pred["having"]) + gold_op_set = _extract(gold["having"]) + if pred_op_set != gold_op_set: + return [1, 1, 0] + + return [1, 1, 1] + + +def get_nestedSQL(sql): + nested = [] + for cond_unit in sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]: + if type(cond_unit[3]) is dict: + nested.append(cond_unit[3]) + if type(cond_unit[4]) is dict: + nested.append(cond_unit[4]) + ## + for from_nest_sql in [table_unit[1] for table_unit in sql["from"]["table_units"] if table_unit[0] == "sql"]: + nested.append(from_nest_sql) + + if sql["intersect"] is not None: + nested.append(sql["intersect"]) + if sql["except"] is not None: + nested.append(sql["except"]) + if sql["union"] is not None: + nested.append(sql["union"]) + return nested + + +def eval_nested(pred, gold): + gold_total = 0 + pred_total = 0 + cnt = 0 + if pred is not None: + pred_total += 1 + if gold is not None: + gold_total += 1 + if pred is not None and gold is not None: + cnt += Evaluator(g_db_schema_file, g_foreign_key_maps, g_eval_value).eval_exact_match(pred, gold) + return [gold_total, pred_total, cnt] + + +def eval_IUEN(pred, gold): + lt1, pt1, cnt1 = eval_nested(pred["intersect"], gold["intersect"]) + lt2, pt2, cnt2 = eval_nested(pred["except"], gold["except"]) + lt3, pt3, cnt3 = eval_nested(pred["union"], gold["union"]) + gold_total = lt1 + lt2 + lt3 + pred_total = pt1 + pt2 + pt3 + cnt = cnt1 + cnt2 + cnt3 + return [gold_total, pred_total, cnt] + + +def get_keywords(sql): + res = set() + if len(sql["where"]) > 0: + res.add("where") + if len(sql["groupBy"]) > 0: + res.add("group") + if len(sql["having"]) > 0: + res.add("having") + if len(sql["orderBy"]) > 0: + res.add(sql["orderBy"][0]) + res.add("order") + if sql["limit"] is not None: + res.add("limit") + if sql["except"] is not None: + res.add("except") + if sql["union"] is not None: + res.add("union") + if sql["intersect"] is not None: + res.add("intersect") + + # or keyword + ao = sql["from"]["conds"][1::2] + sql["where"][1::2] + sql["having"][1::2] + if len([token for token in ao if token == "or"]) > 0: + res.add("or") + + # TODO + cond_units = sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2] + # not keyword + if len([cond_unit for cond_unit in cond_units if cond_unit[0]]) > 0: + res.add("not") + + # in keyword + if len([cond_unit for cond_unit in cond_units if cond_unit[1] == COND_OPS.index("in")]) > 0: + res.add("in") + + # like keyword + if len([cond_unit for cond_unit in cond_units if cond_unit[1] == COND_OPS.index("like")]) > 0: + res.add("like") + + return res + + +def eval_keywords(pred, gold): + pred_keywords = get_keywords(pred) + gold_keywords = get_keywords(gold) + pred_total = len(pred_keywords) + gold_total = len(gold_keywords) + cnt = 0 + + for k in pred_keywords: + if k in gold_keywords: + cnt += 1 + return [gold_total, pred_total, cnt] + + +# Rebuild SQL functions for foreign key evaluation +def build_valid_col_units(table_units, schema): + col_ids = [table_unit[1] for table_unit in table_units if table_unit[0] == TABLE_TYPE["table_unit"]] + prefixs = [col_id[:-2] for col_id in col_ids] + valid_col_units = [] + for value in schema.id_map.values(): + if "." in value and value[: value.index(".")] in prefixs: + valid_col_units.append(value) + return valid_col_units + + +def rebuild_col_unit_col(valid_col_units, col_unit, kmap): + if col_unit is None: + return col_unit + + agg_id, col_id = col_unit + if col_id in kmap and col_id in valid_col_units: + col_id = kmap[col_id] + return agg_id, col_id + + +def rebuild_val_unit_col(valid_col_units, val_unit, kmap): + if val_unit is None: + return val_unit + + unit_op, col_unit1, col_unit2 = val_unit + col_unit1 = rebuild_col_unit_col(valid_col_units, col_unit1, kmap) + col_unit2 = rebuild_col_unit_col(valid_col_units, col_unit2, kmap) + return [unit_op, col_unit1, col_unit2] + + +def rebuild_table_unit_col(valid_col_units, table_unit, kmap, eval_value=True): + if table_unit is None: + return table_unit + + table_type, col_unit_or_sql = table_unit + if isinstance(col_unit_or_sql, dict): + col_unit_or_sql = rebuild_sql_col(valid_col_units, col_unit_or_sql, kmap, eval_value) + elif isinstance(col_unit_or_sql, tuple): # useless + col_unit_or_sql = rebuild_col_unit_col(valid_col_units, col_unit_or_sql, kmap) + return table_type, col_unit_or_sql + + +def rebuild_cond_unit_col(valid_col_units, cond_unit, kmap, eval_value): + if cond_unit is None: + return cond_unit + + not_op, op_id, val_unit, val1, val2 = cond_unit + if type(val1) is dict: + rebuild_sql_col(valid_col_units, val1, kmap, eval_value) + if not eval_value: + if type(val1) is not dict: + val1 = "1" + if type(val2) is not dict and val2 is not None: + val2 = "2" + val_unit = rebuild_val_unit_col(valid_col_units, val_unit, kmap) + return [not_op, op_id, val_unit, val1, val2] + + +def rebuild_condition_col(valid_col_units, condition, kmap, eval_value): + for idx in range(len(condition)): + if idx % 2 == 0: + condition[idx] = rebuild_cond_unit_col(valid_col_units, condition[idx], kmap, eval_value) + return condition + + +def rebuild_select_col(valid_col_units, sel, kmap): + if sel is None: + return sel + new_list = [] + for it in sel: + agg_id, val_unit = it + new_list.append((agg_id, rebuild_val_unit_col(valid_col_units, val_unit, kmap))) + return new_list + + +def rebuild_from_col(valid_col_units, from_, kmap, eval_value=True): + if from_ is None: + return from_ + + fn_proc = lambda x: rebuild_table_unit_col(valid_col_units, x, kmap, eval_value) + from_["table_units"] = [fn_proc(table_unit) for table_unit in from_["table_units"]] + from_["conds"] = rebuild_condition_col(valid_col_units, from_["conds"], kmap, True) + return from_ + + +def rebuild_group_by_col(valid_col_units, group_by, kmap): + if group_by is None: + return group_by + + return [rebuild_col_unit_col(valid_col_units, col_unit, kmap) for col_unit in group_by] + + +def rebuild_order_by_col(valid_col_units, order_by, kmap): + if order_by is None or len(order_by) == 0: + return order_by + + direction, val_units = order_by + new_val_units = [(agg_id, rebuild_val_unit_col(valid_col_units, val_unit, kmap)) for agg_id, val_unit in val_units] + return direction, new_val_units + + +def rebuild_sql_col(valid_col_units, sql, kmap, eval_value): + if sql is None: + return sql + + sql["select"] = rebuild_select_col(valid_col_units, sql["select"], kmap) + sql["from"] = rebuild_from_col(valid_col_units, sql["from"], kmap, eval_value) + sql["where"] = rebuild_condition_col(valid_col_units, sql["where"], kmap, eval_value) + sql["groupBy"] = rebuild_group_by_col(valid_col_units, sql["groupBy"], kmap) + sql["orderBy"] = rebuild_order_by_col(valid_col_units, sql["orderBy"], kmap) + sql["having"] = rebuild_condition_col(valid_col_units, sql["having"], kmap, eval_value) + sql["intersect"] = rebuild_sql_col(valid_col_units, sql["intersect"], kmap, eval_value) + sql["except"] = rebuild_sql_col(valid_col_units, sql["except"], kmap, eval_value) + sql["union"] = rebuild_sql_col(valid_col_units, sql["union"], kmap, eval_value) + if not eval_value: + if sql["limit"] is None or int(sql["limit"]) <= 0: + sql["limit"] = 0 + else: + sql["limit"] = 1 + + return sql + + +def build_foreign_key_map(entry): + cols_orig = entry["column_names_original"] + tables_orig = entry["table_names_original"] + + # rebuild cols corresponding to idmap in Schema + cols = [] + for col_orig in cols_orig: + if col_orig[0] >= 0: + t = tables_orig[col_orig[0]] + c = col_orig[1] + cols.append("__" + t.lower() + "." + c.lower() + "__") + else: + cols.append("__all__") + + def keyset_in_list(k1, k2, k_list): + """keyset_in_list""" + for k_set in k_list: + if k1 in k_set or k2 in k_set: + return k_set + new_k_set = set() + k_list.append(new_k_set) + return new_k_set + + foreign_key_list = [] + foreign_keys = entry["foreign_keys"] + for fkey in foreign_keys: + key1, key2 = fkey + key_set = keyset_in_list(key1, key2, foreign_key_list) + key_set.add(key1) + key_set.add(key2) + + foreign_key_map = {} + for key_set in foreign_key_list: + sorted_list = sorted(list(key_set)) + midx = sorted_list[0] + for idx in sorted_list: + foreign_key_map[cols[idx]] = cols[midx] + + return foreign_key_map + + +def build_foreign_key_map_from_json(table): + with open(table) as f: + data = json.load(f) + tables = {} + for entry in data: + tables[entry["db_id"]] = build_foreign_key_map(entry) + return tables + + +if __name__ == "__main__": + pass diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/linking_utils.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/linking_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..876f923301e0cb5df28319628e8c36d593aa5b35 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/linking_utils.py @@ -0,0 +1,558 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import itertools +import logging +import re + +import numpy as np +from text2sql.utils import text_utils + +# the max matching ngram +g_linking_ngrams_n = 5 + +STOPWORDS = set( + [ + "的", + "是", + ",", + "?", + "有", + "多少", + "哪些", + "我", + "什么", + "你", + "知道", + "啊", + "一下", + "吗", + "在", + "请问", + "或者", + "想", + "和", + "为", + "帮", + "那个", + "你好", + "这", + "了", + "并且", + "都", + "呢", + "呀", + "哪个", + "还有", + "这个", + "-", + "项目", + "我查", + "就是", + "它", + "要求", + "谁", + "了解", + "告诉", + "时候", + "个", + "能", + "那", + "人", + "问", + "中", + "可以", + "一共", + "哪", + "麻烦", + "叫", + "想要", + "《", + "》", + "分别", + ] +) + + +def clamp(value, abs_max): + """clamp value""" + value = max(-abs_max, value) + value = min(abs_max, value) + return value + + +class Relations(object): + """Docstring for Relations.""" + + def __init__( + self, + qq_max_dist=2, + cc_foreign_key=True, + cc_table_match=True, + cc_max_dist=2, + ct_foreign_key=True, + ct_table_match=True, + tc_table_match=True, + tc_foreign_key=True, + tt_max_dist=2, + tt_foreign_key=True, + merge_types=False, + sc_link=True, + cv_link=True, + ): + super(Relations, self).__init__() + + self.qq_max_dist = qq_max_dist + self.cc_foreign_key = cc_foreign_key + self.cc_table_match = cc_table_match + self.cc_max_dist = cc_max_dist + self.ct_foreign_key = ct_foreign_key + self.ct_table_match = ct_table_match + self.tc_table_match = tc_table_match + self.tc_foreign_key = tc_foreign_key + self.tt_max_dist = tt_max_dist + self.tt_foreign_key = tt_foreign_key + self.merge_types = merge_types + self.sc_link = sc_link + self.cv_link = cv_link + + self.relation_ids = {} + + def add_relation(name): + self.relation_ids[name] = len(self.relation_ids) + logging.debug("relation: %s --> %d", name, self.relation_ids[name]) + + # < TODO: add_relation('[UNK]') + + def add_rel_dist(name, max_dist): + for i in range(-max_dist, max_dist + 1): + add_relation((name, i)) + + add_rel_dist("qq_dist", qq_max_dist) + + add_relation("qc_default") + # if qc_token_match: + # add_relation('qc_token_match') + + add_relation("qt_default") + # if qt_token_match: + # add_relation('qt_token_match') + + add_relation("cq_default") + # if cq_token_match: + # add_relation('cq_token_match') + + add_relation("cc_default") + if cc_foreign_key: + add_relation("cc_foreign_key_forward") + add_relation("cc_foreign_key_backward") + if cc_table_match: + add_relation("cc_table_match") + add_rel_dist("cc_dist", cc_max_dist) + + add_relation("ct_default") + if ct_foreign_key: + add_relation("ct_foreign_key") + if ct_table_match: + add_relation("ct_primary_key") + add_relation("ct_table_match") + add_relation("ct_any_table") + + add_relation("tq_default") + # if cq_token_match: + # add_relation('tq_token_match') + + add_relation("tc_default") + if tc_table_match: + add_relation("tc_primary_key") + add_relation("tc_table_match") + add_relation("tc_any_table") + if tc_foreign_key: + add_relation("tc_foreign_key") + + add_relation("tt_default") + if tt_foreign_key: + add_relation("tt_foreign_key_forward") + add_relation("tt_foreign_key_backward") + add_relation("tt_foreign_key_both") + add_rel_dist("tt_dist", tt_max_dist) + + # schema linking relations + # forward_backward + if sc_link: + add_relation("qcCEM") + add_relation("cqCEM") + add_relation("qtTEM") + add_relation("tqTEM") + add_relation("qcCPM") + add_relation("cqCPM") + add_relation("qtTPM") + add_relation("tqTPM") + + if cv_link: + add_relation("qcNUMBER") + add_relation("cqNUMBER") + add_relation("qcTIME") + add_relation("cqTIME") + add_relation("qcCELLMATCH") + add_relation("cqCELLMATCH") + + if merge_types: + assert not cc_foreign_key + assert not cc_table_match + assert not ct_foreign_key + assert not ct_table_match + assert not tc_foreign_key + assert not tc_table_match + assert not tt_foreign_key + + assert cc_max_dist == qq_max_dist + assert tt_max_dist == qq_max_dist + + add_relation("xx_default") + self.relation_ids["qc_default"] = self.relation_ids["xx_default"] + self.relation_ids["qt_default"] = self.relation_ids["xx_default"] + self.relation_ids["cq_default"] = self.relation_ids["xx_default"] + self.relation_ids["cc_default"] = self.relation_ids["xx_default"] + self.relation_ids["ct_default"] = self.relation_ids["xx_default"] + self.relation_ids["tq_default"] = self.relation_ids["xx_default"] + self.relation_ids["tc_default"] = self.relation_ids["xx_default"] + self.relation_ids["tt_default"] = self.relation_ids["xx_default"] + + if sc_link: + self.relation_ids["qcCEM"] = self.relation_ids["xx_default"] + self.relation_ids["qcCPM"] = self.relation_ids["xx_default"] + self.relation_ids["qtTEM"] = self.relation_ids["xx_default"] + self.relation_ids["qtTPM"] = self.relation_ids["xx_default"] + self.relation_ids["cqCEM"] = self.relation_ids["xx_default"] + self.relation_ids["cqCPM"] = self.relation_ids["xx_default"] + self.relation_ids["tqTEM"] = self.relation_ids["xx_default"] + self.relation_ids["tqTPM"] = self.relation_ids["xx_default"] + if cv_link: + self.relation_ids["qcNUMBER"] = self.relation_ids["xx_default"] + self.relation_ids["cqNUMBER"] = self.relation_ids["xx_default"] + self.relation_ids["qcTIME"] = self.relation_ids["xx_default"] + self.relation_ids["cqTIME"] = self.relation_ids["xx_default"] + self.relation_ids["qcCELLMATCH"] = self.relation_ids["xx_default"] + self.relation_ids["cqCELLMATCH"] = self.relation_ids["xx_default"] + + for i in range(-qq_max_dist, qq_max_dist + 1): + self.relation_ids["cc_dist", i] = self.relation_ids["qq_dist", i] + self.relation_ids["tt_dist", i] = self.relation_ids["tt_dist", i] + + logging.info("relations num is: %d", len(self.relation_ids)) + + def __len__(self): + """size of relations + Returns: int + """ + return len(self.relation_ids) + + +RELATIONS = Relations() + + +# schema linking, similar to IRNet +def compute_schema_linking(tokens, db): + """schema linking""" + + def partial_match(x_list, y_list): + """check partial match""" + x_str = "".join(x_list) + y_str = "".join(y_list) + if x_str in STOPWORDS: + return False + if re.match("%s" % re.escape(x_str), y_str): + assert x_str in y_str + return True + else: + return False + + def exact_match(x_list, y_list): + """check exact match""" + x, y = x_list, y_list + if type(x) is list: + x = "".join(x) + if type(y) is list: + y = "".join(y) + return x == y + + def set_q_relation(q_match_dict, q_start, q_match_len, other_id, relation_tag, force=True): + """set match relation for question""" + for q_id in range(q_start, q_start + q_match_len): + key = f"{q_id},{other_id}" + if not force and key in q_match_dict: + continue + q_match_dict[key] = relation_tag + + columns = [x.name for x in db.columns] + tables = [x.name for x in db.tables] + + q_col_match = dict() + q_tab_match = dict() + + col_id2list = dict() + for col_id, col_item in enumerate(columns): + col_id2list[col_id] = col_item + + tab_id2list = dict() + for tab_id, tab_item in enumerate(tables): + tab_id2list[tab_id] = tab_item + + # 5-gram + n = g_linking_ngrams_n + while n > 0: + for i, n_gram_list in enumerate(text_utils.ngrams(tokens, n)): + if len("".join(n_gram_list).strip()) == 0: + continue + # exact match case + for col_id, col in col_id2list.items(): + if exact_match(n_gram_list, col): + set_q_relation(q_col_match, i, n, col_id, "CEM") + for tab_id, tab in tab_id2list.items(): + if exact_match(n_gram_list, tab): + set_q_relation(q_tab_match, i, n, tab_id, "TEM") + + # partial match case + for col_id, col in col_id2list.items(): + if partial_match(n_gram_list, col): + set_q_relation(q_col_match, i, n, col_id, "CPM", force=False) + for tab_id, tab in tab_id2list.items(): + if partial_match(n_gram_list, tab): + set_q_relation(q_tab_match, i, n, tab_id, "TEM", force=False) + n -= 1 + return {"q_col_match": q_col_match, "q_tab_match": q_tab_match} + + +def compute_cell_value_linking(tokens, db): + """cell-value linking""" + + def isnumber(word): + """check if input is a number""" + try: + float(word) + return True + except Exception: + return False + + def check_cell_match(word, cells): + """check if word partial/exact match one of values""" + for cell in cells: + if word in cell: + return True + return False + + num_date_match = {} + cell_match = {} + + for q_id, word in enumerate(tokens): + if len(word.strip()) == 0: + continue + if word in STOPWORDS: + continue + + num_flag = isnumber(word) + for col_id, column in enumerate(db.columns): + # word is number + if num_flag: + if column.dtype in ("number", "real", "time"): # TODO fine-grained date + rel = "NUMBER" if column.dtype == "real" else column.dtype.upper() + num_date_match[f"{q_id},{col_id}"] = rel + elif column.dtype.lower() == "binary": # binary condition should use special process + continue + elif check_cell_match(word, column.cells): + cell_match[f"{q_id},{col_id}"] = "CELLMATCH" + + cv_link = {"num_date_match": num_date_match, "cell_match": cell_match} + return cv_link + + +def _table_id(db, col): + if col == 0: + return None + else: + return db.columns[col].table.id + + +def _foreign_key_id(db, col): + foreign_col = db.columns[col].foreign_key_for + if foreign_col is None: + return None + return foreign_col.id + + +def _match_foreign_key(db, col, table): + foreign_key_id = _foreign_key_id(db, col) + if foreign_key_id is None: + return None + return table == _table_id(db, foreign_key_id) + + +def build_relation_matrix(other_links, total_length, q_length, c_length, c_boundaries, t_boundaries, db): + """build relation matrix""" + sc_link = other_links.get("sc_link", {"q_col_match": {}, "q_tab_match": {}}) + cv_link = other_links.get("cv_link", {"num_date_match": {}, "cell_match": {}}) + + # Catalogue which things are where + loc_types = {} + for i in range(q_length): + loc_types[i] = ("question",) + + c_base = q_length + for c_id, (c_start, c_end) in enumerate(zip(c_boundaries, c_boundaries[1:])): + for i in range(c_start + c_base, c_end + c_base): + loc_types[i] = ("column", c_id) + t_base = q_length + c_length + for t_id, (t_start, t_end) in enumerate(zip(t_boundaries, t_boundaries[1:])): + for i in range(t_start + t_base, t_end + t_base): + loc_types[i] = ("table", t_id) + + relations = np.zeros((total_length, total_length), dtype=np.int64) + for i, j in itertools.product(range(total_length), repeat=2): + + def _set_relation(name): + """set relation for position (i, j)""" + relations[i, j] = RELATIONS.relation_ids[name] + + def _get_qc_links(q_id, c_id): + """get link relation of q and col""" + coord = "%d,%d" % (q_id, c_id) + if coord in sc_link["q_col_match"]: + return sc_link["q_col_match"][coord] + elif coord in cv_link["cell_match"]: + return cv_link["cell_match"][coord] + elif coord in cv_link["num_date_match"]: + return cv_link["num_date_match"][coord] + return "_default" + + def _get_qt_links(q_id, c_id): + """get link relation of q and tab""" + coord = "%d,%d" % (q_id, c_id) + if coord in sc_link["q_tab_match"]: + return sc_link["q_tab_match"][coord] + else: + return "_default" + + try: + i_type, j_type = loc_types[i], loc_types[j] + except Exception as e: + logging.error( + f"loc_types: {loc_types}. c_boundaries: {c_boundaries}." + + f"i, j, total_length and q_length: {i}, {j}, {total_length}, {q_length}" + ) + raise e + + if i_type[0] == "question": + # relation of question-to-* + if j_type[0] == "question": # relation qq + _set_relation(("qq_dist", clamp(j - i, RELATIONS.qq_max_dist))) + elif j_type[0] == "column": # relation qc + j_real = j_type[1] + rel = _get_qc_links(i, j_real) + _set_relation("qc" + rel) + elif j_type[0] == "table": # relation qt + j_real = j_type[1] + rel = _get_qt_links(i, j_real) + _set_relation("qt" + rel) + elif i_type[0] == "column": + # relation of column-to-* + if j_type[0] == "question": # relation cq + i_real = i_type[1] + rel = _get_qc_links(j, i_real) + _set_relation("cq" + rel) + elif j_type[0] == "column": # relation cc + col1, col2 = i_type[1], j_type[1] + if col1 == col2: + _set_relation(("cc_dist", clamp(j - i, RELATIONS.cc_max_dist))) + else: + _set_relation("cc_default") + # TODO: foreign keys and table match + if RELATIONS.cc_foreign_key: + if _foreign_key_id(db, col1) == col2: + _set_relation("cc_foreign_key_forward") + if _foreign_key_id(db, col2) == col1: + _set_relation("cc_foreign_key_backward") + if RELATIONS.cc_table_match and _table_id(db, col1) == _table_id(db, col2): + _set_relation("cc_table_match") + elif j_type[0] == "table": # relation ct + col, table = i_type[1], j_type[1] + _set_relation("ct_default") + if RELATIONS.ct_foreign_key and _match_foreign_key(db, col, table): + _set_relation("ct_foreign_key") + if RELATIONS.ct_table_match: + col_table = _table_id(db, col) + if col_table == table: + if col in db.columns[col].table.primary_keys_id: + _set_relation("ct_primary_key") + else: + _set_relation("ct_table_match") + elif col_table is None: + _set_relation("ct_any_table") + elif i_type[0] == "table": + # relation of table-to-* + if j_type[0] == "question": + i_real = i_type[1] + rel = _get_qt_links(j, i_real) + _set_relation("tq" + rel) + elif j_type[0] == "column": + table, col = i_type[1], j_type[1] + _set_relation("tc_default") + + if RELATIONS.tc_foreign_key and _match_foreign_key(db, col, table): + _set_relation("tc_foreign_key") + if RELATIONS.tc_table_match: + col_table = _table_id(db, col) + if col_table == table: + if col in db.columns[col].table.primary_keys_id: + _set_relation("tc_primary_key") + else: + _set_relation("tc_table_match") + elif col_table is None: + _set_relation("tc_any_table") + elif j_type[0] == "table": + table1, table2 = i_type[1], j_type[1] + if table1 == table2: + _set_relation(("tt_dist", clamp(j - i, RELATIONS.tt_max_dist))) + else: + _set_relation("tt_default") + if RELATIONS.tt_foreign_key: + forward = table2 in db.tables[table1].foreign_keys_tables + backward = table1 in db.tables[table2].foreign_keys_tables + if forward and backward: + _set_relation("tt_foreign_key_both") + elif forward: + _set_relation("tt_foreign_key_forward") + elif backward: + _set_relation("tt_foreign_key_backward") + + return relations + + +if __name__ == "__main__": + """run some simple test cases""" + q = "帮 我 查 一 下 大众 帕 萨 特 的 轴距 和 能源 类型 分别 是 什么 , 叫 什么 名 ?".split(" ") + for i, tok in enumerate(q): + print(i, tok) + # header = Header(['名称', '品牌', '轴距', '能源类型'], ['text', 'text', 'real', 'text']) + # print(header.names) + # print(compute_schema_linking(q, header)) + + # q = '帮 我 查 一 下 大众 轴距 大于 10 米 的 车 能源 类型 分别 是 什么 ?'.split(' ') + # for i, tok in enumerate(q): + # print(i, tok) + # rows = [['帕萨特', '大众', '10', '汽油车'], + # ['伊兰特', '现代', '10', '汽油车'], + # ['GL8', '别克', '10', '汽油车']] + # table = Table('tid1', 'tname', 'title', header, rows) + # print(compute_cell_value_linking(q, table)) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/metrics.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/metrics.py new file mode 100644 index 0000000000000000000000000000000000000000..bc5b616a6ed2e677678210d2d4d17642e166a995 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/metrics.py @@ -0,0 +1,225 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os + +from text2sql.dataproc import sql_label +from text2sql.utils import dusql_evaluation, nn_utils + + +class MetricSimpleSQLAcc(object): + """SimpleSQLAccMetric.""" + + def __init__(self, eval_value=False): + """init of class + + Args: + eval_value (TYPE): Default is False + + """ + super(MetricSimpleSQLAcc, self).__init__() + + self._eval_value = eval_value + self._gold_list = [] + self._pred_list = [] + self._correctness = [] + + def update(self, labels, predicts): + preds = nn_utils.tensor2numpy(predicts) + pred_sqls = sql_label.decode_sqls(preds, labels["header_lens"], None, labels["limit_nums"]) + self._gold_list.extend(labels["gold_sqls"]) + self._pred_list.extend(pred_sqls) + + def calc(self): + conn_correct = 0 + sel_col_agg_correct = 0 + conds_correct = 0 + conds_col_correct = 0 + conds_col_op_correct = 0 + all_correct = 0 + sel_num_correct = 0 + cond_num_correct = 0 + order_correct = 0 + limit_correct = 0 + group_correct = 0 + having_correct = 0 + num_queries = len(self._gold_list) + self._correctness.clear() + for pred_sql, true_sql in zip(self._pred_list, self._gold_list): + n_correct = 0 + if pred_sql["sel_num"] == true_sql.sel_num: + sel_num_correct += 1 + if pred_sql["cond_num"] == len(true_sql.conds) + len(true_sql.having): + cond_num_correct += 1 + if pred_sql["cond_conn_op"] == true_sql.cond_conn_op: + conn_correct += 1 + n_correct += 1 + pred_aggs = set(zip(pred_sql["sel"], pred_sql["agg"])) + true_aggs = set(zip(true_sql.sel, true_sql.agg)) + if pred_aggs == true_aggs: + sel_col_agg_correct += 1 + n_correct += 1 + pred_conds = set([(cond[0], cond[1], cond[2]) for cond in pred_sql["conds"]]) + if not self._eval_value: + true_conds_tmp = [(cond[0], cond[1], None) for cond in true_sql.conds] + else: + true_conds_tmp = [(cond[0], cond[1], cond[2]) for cond in true_sql.conds] + true_conds = set(true_conds_tmp) + if pred_conds == true_conds: + conds_correct += 1 + n_correct += 1 + pred_conds_col = set([cond[0] for cond in pred_sql["conds"]]) + true_conds_col = set([cond[0] for cond in true_sql["conds"]]) + if pred_conds_col == true_conds_col: + conds_col_correct += 1 + pred_conds_col_op = set([(cond[0], cond[1]) for cond in pred_sql["conds"]]) + true_conds_col_op = set([(cond[0], cond[1]) for cond in true_sql["conds"]]) + if pred_conds_col_op == true_conds_col_op: + conds_col_op_correct += 1 + + pred_order_direc = pred_sql["order_direction"] + true_order_direc = true_sql["order_direction"] + pred_order_by = pred_sql["order_by"] + true_order_by = true_sql["order_by"] + if pred_order_direc == true_order_direc and pred_order_by == true_order_by: + n_correct += 1 + order_correct += 1 + + pred_limit = pred_sql["limit"] + true_limit = true_sql["limit"] + if pred_limit == true_limit: + n_correct += 1 + limit_correct += 1 + + pred_group_by = pred_sql["group_by"] + true_group_by = true_sql["group_by"] + if pred_group_by == true_group_by: + n_correct += 1 + group_correct += 1 + + pred_having = [list(x) for x in pred_sql["having"]] + true_having = [list(x) for x in true_sql["having"]] + if not self._eval_value: + true_having = [x[:-1] + [None] for x in true_having] + if pred_having == true_having: + n_correct += 1 + having_correct += 1 + + if n_correct == 7: + all_correct += 1 + self._correctness.append(1) + else: + self._correctness.append(0) + + self._acc = all_correct / num_queries + self._sub_task_acc = { + "sel_num": sel_num_correct / num_queries, + "sel_col_agg": sel_col_agg_correct / num_queries, + "cond_num": cond_num_correct / num_queries, + "cond_conn": conn_correct / num_queries, + "where_conds": conds_correct / num_queries, + "where_col": conds_col_correct / num_queries, + "where_col_op": conds_col_op_correct / num_queries, + } + self._sub_task_acc.update( + { + "order_by": order_correct / num_queries, + "limit": limit_correct / num_queries, + } + ) + self._sub_task_acc.update( + { + "group_by": group_correct / num_queries, + "having": having_correct / num_queries, + } + ) + return self._acc, self._sub_task_acc + + def save(self, save_dir, file_tag): + """ + + Args: + save_dir (TYPE): NULL + file_tag (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + if "{acc}" in file_tag or "{acc:" in file_tag: + file_tag = file_tag.format(acc=self._acc) + file_path = os.path.join(save_dir, file_tag) + if not os.path.isdir(os.path.dirname(file_path)): + os.mkdir(os.path.dirname(file_path)) + with open(file_path, "w") as ofs: + for pred, correct in zip(self._pred_list, self._correctness): + pred["correct"] = correct + ofs.write(json.dumps(pred, ensure_ascii=False) + "\n") + + def __str__(self): + """ + Returns: TODO + + Raises: NULL + """ + return f"acc {self._acc * 100:.2f}, sub tasks {self._sub_task_acc}" + + +class MetricDuSQLAcc(object): + """Acc Metric for DuSQL like dataset""" + + def __init__(self, dataset, eval_value=True): + """init""" + super(MetricDuSQLAcc, self).__init__() + self.dataset = dataset + self.eval_value = eval_value + + self.foreign_key_maps = { + db_id: dusql_evaluation.build_foreign_key_map(db.orig) for db_id, db in self.dataset.db_dict.items() + } + self.evaluator = dusql_evaluation.Evaluator( + self.dataset.db_schema_file, self.foreign_key_maps, eval_value=self.eval_value + ) + self.results = [] + + def update(self, item, inferred_code): + """update one instance""" + sql_query = item.orig["query"] if "query" in item.orig else item.orig["sql_query"] + ret_dict = self.evaluator.evaluate_one(item.db.db_id, sql_query, inferred_code) + ret_dict["db_id"] = item.orig["db_id"] + ret_dict["question"] = item.orig["question"] + self.results.append(ret_dict) + + def udpate_beams(self, item, inferred_codes, orig_question=None): + """update one instance beam""" + beam_dict = {} + if orig_question: + beam_dict["orig_question"] = orig_question + for i, code in enumerate(inferred_codes): + ret_dict = self.evaluator.evaluate_one(item.db.db_id, item.orig["query"], code) + beam_dict[i] = ret_dict + if ret_dict["exact"] is True: + break + self.results.append(beam_dict) + + def finalize(self): + """finalize""" + self.evaluator.finalize() + return {"per_item": self.results, "total_scores": self.evaluator.scores} + + +if __name__ == "__main__": + """run some simple test cases""" + pass diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/nn_utils.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/nn_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..02d04743d52ca028bcb6d3d1f31f5331b2fb76e2 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/nn_utils.py @@ -0,0 +1,199 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import paddle +from paddle import nn + + +def build_linear(n_in, n_out, name=None, init=None): + return nn.Linear( + n_in, + n_out, + weight_attr=paddle.ParamAttr(name="%s.w_0" % name if name is not None else None, initializer=init), + bias_attr="%s.b_0" % name if name is not None else None, + ) + + +def build_layer_norm(n_in, name): + return nn.LayerNorm( + normalized_shape=n_in, + weight_attr=paddle.ParamAttr( + name="%s_layer_norm_scale" % name if name is not None else None, initializer=nn.initializer.Constant(1.0) + ), + bias_attr=paddle.ParamAttr( + name="%s_layer_norm_bias" % name if name is not None else None, initializer=nn.initializer.Constant(0.0) + ), + ) + + +def lstm_init(num_layers, hidden_size, *batch_sizes): + init_size = batch_sizes + (hidden_size,) + if num_layers is not None: + init_size = (num_layers,) + init_size + init = paddle.zeros(init_size) + return (init, init) + + +def batch_gather_2d(var, indices): + """Gather slices from var in each batch, according to corresponding + index in indices. Currently, it only support 2d Tensor. + + Args: + var (Variable): with shape [batch_size, ...] + indices (Variable): with shape [batch_size, max_len] + + Returns: Variable with shape [batch_size] + + Raises: NULL + + Examples: + var + [[1, 2, 3], + [4, 5, 6]] + indices + [[2, 0], [1, 2]] + + return + [[3, 1], [5, 6]] + + """ + if len(indices.shape) != 2: + raise ValueError( + "shape of indices error. it should be a 2-D layers. " "but got shape = %s" % (str(indices.shape),) + ) + + batch_size = paddle.shape(indices)[0] + + zero = paddle.to_tensor([0], dtype="int64") + one = paddle.to_tensor([1], dtype="int64") + end = paddle.cast(batch_size, dtype="int64") + batch_indices_1d = paddle.unsqueeze(paddle.arange(zero, end, one, dtype=indices.dtype), [1]) + + seq_len = indices.shape[1] + batch_indices = paddle.expand(batch_indices_1d, [batch_size, seq_len]) + + coord_2d = paddle.concat([paddle.unsqueeze(batch_indices, [2]), paddle.unsqueeze(indices, [2])], axis=2) + coord_2d.stop_gradient = True + coord_1d = paddle.reshape(coord_2d, shape=[-1, 2]) + output_1d = paddle.gather_nd(var, coord_1d) + output_2d = paddle.reshape(output_1d, [batch_size, seq_len, var.shape[-1]]) + return output_2d + + +def sequence_mask(seq_hidden, mask, mode="zero"): + """ + + Args: + seq_hidden (Tensor): NULL + mask (Tensor): 1 for un-mask tokens, and 0 for mask tokens. + mode (str): zero/-inf/+inf + + Returns: TODO + + Raises: NULL + """ + + while len(mask.shape) < len(seq_hidden.shape): + mask = mask.unsqueeze([-1]) + + mask = mask.cast(dtype=seq_hidden.dtype) + masked = paddle.multiply(seq_hidden, mask) + if mode == "zero": + return masked + + if mode == "-inf": + scale_size = +1e5 + elif mode == "+inf": + scale_size = -1e5 + else: + raise ValueError(f"mask mode setting error. expect zero/-inf/+inf, but got {mode}") + + add_mask = paddle.scale(mask - 1, scale=scale_size) + masked = paddle.add(masked, add_mask) + return masked + + +def pad_sequences(seqs, max_len, value=0.0, dtype=np.int64): + """padding sequences""" + data_max_len = 0 + format_seqs = [] + for seq in seqs: + format_seqs.append(list(seq)) + data_max_len = len(seq) if len(seq) > data_max_len else data_max_len + max_len = min(max_len, data_max_len) + padded = [] + for seq in format_seqs: + padded.append(seq[:max_len] + [value] * (max_len - len(seq))) + padded = np.array(padded) + return padded.astype(dtype) + + +def pad_sequences_for_3d(seqs, max_col, max_num, dtype=np.int64): + """padding sequences for 3d""" + padded = [] + for seq in seqs: + padded.append(np.vstack((seq, np.zeros((max_col - seq.shape[0], max_num), dtype=np.int64)))) + return np.array(padded).astype(dtype) + + +def pad_index_sequences(seqs, max_col, max_row, dtype=np.int64): + """padding sequences for column token indexes""" + padded = [] + for query in seqs: + new_cols = [] + for col in query[:max_row]: + temp_cols = col[:max_col] + [0] * (max_col - len(col)) + new_cols.append(temp_cols) + new_cols = new_cols + [[0] * max_col for _ in range(max_row - len(new_cols))] + padded.append(new_cols) + return np.array(padded).astype(dtype) + + +def tensor2numpy(inputs): + if type(inputs) in (list, tuple): + return [x.numpy() for x in inputs] + elif type(inputs) is dict: + outputs = {} + for key, value in inputs.items(): + if type(value) is paddle.Tensor: + outputs[key] = value.numpy() + else: + outputs[key] = value + return outputs + elif type(inputs) is paddle.Tensor: + return inputs.numpy() + else: + raise ValueError("only support inputs to be of type list/tuple/dict/Tensor." + f"but got {type(inputs)}") + + +if __name__ == "__main__": + """run some simple test cases""" + seq_input = paddle.to_tensor( + [ + [1, 2, 3, 4], + [5, 5, 5, 5], + ], + dtype="float32", + ) + mask = paddle.to_tensor( + [ + [1, 1, 0, 0], + [1, 1, 1, 0], + ], + dtype="float32", + ) + + print(sequence_mask(seq_input, mask, mode="zero")) + print(sequence_mask(seq_input, mask, mode="-inf")) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/serialization.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/serialization.py new file mode 100644 index 0000000000000000000000000000000000000000..1dd638d5319bff1c3cefd120efc211337e0d2c3c --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/serialization.py @@ -0,0 +1,39 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +def to_dict_with_sorted_values(d, key=None): + """to dict with sorted values""" + return {k: sorted(v, key=key) for k, v in d.items()} + + +def to_dict_with_set_values(d): + """to dict with set values""" + result = {} + for k, v in d.items(): + hashable_v = [] + for v_elem in v: + if isinstance(v_elem, list): + hashable_v.append(tuple(v_elem)) + else: + hashable_v.append(v_elem) + result[k] = set(hashable_v) + return result + + +def tuplify(x): + """tuplify""" + if not isinstance(x, (tuple, list)): + return x + return tuple(tuplify(elem) for elem in x) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/text_utils.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/text_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..8273bcb6b1d207378b55e388bcebd98463e8e751 --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/text_utils.py @@ -0,0 +1,424 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import re +from collections import defaultdict +from statistics import mean + +import cn2an +from LAC import LAC + +g_max_candi_value = 5 +g_date_patt = re.compile(r"(([0-9]{2})[0-9]{2}-)?(0?[1-9]|1[012])-[0123][0-9]") +g_date_patt2 = re.compile(r"(([0-9]{2})[0-9]{2}年)?[0-9]{1,2}月[0-9]{2}[号日]|([0-9]{2})[0-9]{2}年[0-9]{1,2}月") + +g_lac_seg = LAC(mode="seg") +g_lac_lac = LAC(mode="lac") + +wordseg = lambda sentence: g_lac_seg.run(sentence) +lac = lambda sentence: g_lac_lac.run(sentence) + +# LAC Tags +# 标签 含义 标签 含义 标签 含义 标签 含义 +# n 普通名词 f 方位名词 s 处所名词 nw 作品名 +# nz 其他专名 v 普通动词 vd 动副词 vn 名动词 +# a 形容词 ad 副形词 an 名形词 d 副词 +# m 数量词 q 量词 r 代词 p 介词 +# c 连词 u 助词 xc 其他虚词 w 标点符号 +# PER 人名 LOC 地名 ORG 机构名 TIME 时间 +g_ner_tag_mapping = { + "LOC": "LOC", + "TIME": "TIME", + "PER": "PER", + "m": "NUM", +} +EMPTY_TAG = "o" + + +def ner(sentence): + """wordseg and ner + + Args: + sentence (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + results = lac(sentence) + words = results[0] + tags_tmp = results[1] + tags = [] + for tag in tags_tmp: + tags.append(g_ner_tag_mapping.get(tag, EMPTY_TAG)) + return (words, tags) + + +def ngrams(tok_list, n): + """generate n-grams from tok_list + + Args: + tok_list (TYPE): NULL + n (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + for pos in range(len(tok_list) - n + 1): + yield tok_list[pos : pos + n] + + +def remove_brackets(s): + """Remove brackets [] () from text""" + return re.sub(r"[\(\(].*[\)\)]", "", s) + + +def is_float(value): + """is float""" + try: + float(value) + return True + except ValueError: + return False + except TypeError: + return False + + +def cn_to_an(string): + """cn to an""" + try: + return str(cn2an.cn2an(string, "normal")) + except ValueError: + return string + + +def an_to_cn(string): + """an to cn""" + try: + return str(cn2an.an2cn(string)) + except ValueError: + return string + + +def str_to_num(string): + """str to num""" + try: + float_val = float(cn_to_an(string)) + if int(float_val) == float_val: + return str(int(float_val)) + else: + return str(float_val) + except ValueError: + return None + + +def str_to_year(string): + """str to year""" + year = string.replace("年", "") + year = cn_to_an(year) + if is_float(year) and float(year) < 1900: + year = int(year) + 2000 + return str(year) + else: + return None + + +class CandidateValueExtractor: + """ + params: + """ + + CN_NUM = "〇一二三四五六七八九零壹贰叁肆伍陆柒捌玖貮两1234567890" + CN_UNIT = "十拾百佰千仟万萬亿億兆点." + + @classmethod + def norm_unit(cls, rows, col_id, values): + """norm unit""" + l = [] + for row in rows: + if isinstance(row[col_id], str) or row[col_id] is None: + return None + l.append(len(str(int(row[col_id])))) + mean_len = round(mean(l) + 0.5) + + new_values = set() + for value in values: + flag = False + if value.isdigit(): + str_value = str(value) + diff = len(str_value) - mean_len + if diff > 2: + tail_str = str_value[-1 * diff :] + if tail_str.count("0") == len(tail_str): + new_values.add(str_value[:mean_len]) + new_values.add(value) + flag = True + if not flag: + new_values.add(value) + return list(new_values) + + @classmethod + def search_values(cls, question, table): + """search candidate cells from table, that will be used as sql values + + Args: + question_words (list): NULL + question_tags (list): NULL + table (Table): NULL + + Returns: TODO + + Raises: NULL + """ + # 提取年份和数字 + value_in_question = cls.extract_values_from_text(question) + all_candidate = [] + for col_id in range(len(table.header)): + header = table.header[col_id] + # 提取col出现在question中的cell + # TODO 这里存在一个问题,一个text类型cell必须完全在question中出现才会被当做候选cell + value_in_column = cls.extract_values_from_column(question, table, col_id, header.type) + if header.type == "text": + candi_values = value_in_column + elif header.type == "real": + norm_unit_res = cls.norm_unit(table.rows, col_id, value_in_question) + if norm_unit_res is not None: + value_in_question = norm_unit_res + candi_values = value_in_question + if len(candi_values) >= g_max_candi_value: + candi_values = candi_values[:g_max_candi_value] + else: + st_candi_values = set(candi_values) + for v in value_in_column: + if v in st_candi_values: + continue + st_candi_values.add(v) + if len(st_candi_values) >= g_max_candi_value: + break + candi_values = list(st_candi_values) + all_candidate.append(candi_values) + return all_candidate + + # 19年 or 一九年 will be replaced to 2019年 + @classmethod + def extract_year_from_text(cls, text): + """extract year from text""" + values = [] + # FIXME trick: yrs is from 2000 + num_year_texts = re.findall(r"[0-9][0-9]年", text) + values += ["20{}".format(text[:-1]) for text in num_year_texts] + cn_year_texts = re.findall(r"[{}][{}]年".format(cls.CN_NUM, cls.CN_NUM), text) + cn_year_values = [str_to_year(text) for text in cn_year_texts] + values += [value for value in cn_year_values if value is not None] + return values + + @classmethod + def extract_date_from_text(cls, text): + """ + + Args: + text (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + date_values = [] + unmatched_spans = [] + base = 0 + while base < len(text): + res = re.search(g_date_patt, text[base:]) + if res is None: + unmatched_spans.append(text[base:]) + else: + start, end = res.span() + unmatched_spans.append(text[base:start]) + unmatched_spans.append(text[end:]) + base = end + date_values.append(text[start:end]) + return date_values, list(filter(lambda x: x.strip() != "", unmatched_spans)) + + @classmethod + def extract_num_from_text(cls, text): + """extract num from text""" + values = [] + # 1. all digital number + num_values = re.findall(r"[-+]?[0-9]*\.?[0-9]+", text) + values += num_values + # 2. include chinese word + cn_num_unit = cls.CN_NUM + cls.CN_UNIT + cn_num_texts = re.findall(r"[{}]*\.?[{}]+".format(cn_num_unit, cn_num_unit), text) + + cn_num_values = [str_to_num(text) for text in cn_num_texts] + values += [value for value in cn_num_values if value is not None] + # 3. both number and chinese word + cn_num_mix = re.findall(r"[0-9]*\.?[{}]+".format(cls.CN_UNIT), text) + for word in cn_num_mix: + num = re.findall(r"[-+]?[0-9]*\.?[0-9]+", word) + for n in num: + word = word.replace(n, an_to_cn(n)) + str_num = str_to_num(word) + if str_num is not None: + values.append(str_num) + return values + + @classmethod + def extract_values_from_text(cls, text): + """extract values from text""" + values = [] + values += cls.extract_year_from_text(text) + values_tmp, unmatched_spans = cls.extract_date_from_text(text) + values.extend(values_tmp) + for span in unmatched_spans: + values += cls.extract_num_from_text(span) + return list(set(values)) + + @classmethod + def extract_values_from_column(cls, question, table, col_id, col_type): + """extract values from column""" + if col_type == "text": + base_threshold = 0 + else: + base_threshold = 2 + + value_score = table.search(question, col_id) + value_score_filter = list(filter(lambda x: x[1] > base_threshold, value_score)) + if len(value_score_filter) == 0: + return [] + + value_score_filter.sort(key=lambda x: x[1], reverse=True) + # if col_type == 'text' \ + # and len(value_score_filter) > g_max_candi_value \ + # and value_score_filter[g_max_candi_value][1] == value_score_filter[0][1]: + # value_score_filter_tmp = value_score_filter[:50] + # tmp_score = value_score_filter[g_max_candi_value][1] + # select_col_values = [x[0] for x in value_score_filter_tmp if x[1] >= tmp_score] + # else: + # select_col_values = [x[0] for x in value_score_filter[:g_max_candi_value]] + select_col_values = [x[0] for x in value_score_filter[:g_max_candi_value]] + + return select_col_values + + +def re_search(patt, text): + """ + + Args: + patt (TYPE): NULL + text (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + lst_result = [] + pos = 0 + while True: + match = re.search(patt, text[pos:]) + if match is None: + break + lst_result.append((match.start() + pos, match.end() + pos)) + pos = pos + match.end() + 1 + + return lst_result + + +CN_NUM = "〇一二三四五六七八九零壹贰叁肆伍陆柒捌玖貮两1234567890" +CN_UNIT = "十拾百佰千仟万萬亿億兆点." +CN_NUM_UNIT = CN_NUM + CN_UNIT +PATT_NUM = re.compile(r"[-+]?[0-9]*\.?[0-9]+") +PATT_CN_NUM = re.compile(r"[{}]*\.?[{}]+".format(CN_NUM_UNIT, CN_NUM_UNIT)) +PATT_MIX_NUM = re.compile(r"[0-9]*\.?[{}]+".format(CN_UNIT)) + + +def _extract_num_span(text): + """extract number and mark their spans + + Args: + text (TYPE): NULL + + Returns: TODO + + Raises: NULL + """ + dct_start2end = defaultdict(set) + # digital number + spans = re_search(PATT_NUM, text) + for start, end in spans: + dct_start2end[start].add((end, text[start:end])) + + # chinese number + spans = re_search(PATT_CN_NUM, text) + for start, end in spans: + num = str_to_num(text[start:end]) + if num is None: + continue + dct_start2end[start].add((end, num)) + + # number, chinese + spans = re_search(PATT_MIX_NUM, text) + for start, end in spans: + orig_num = text[start:end] + for ar_num in re.findall(PATT_NUM, orig_num): + orig_num = orig_num.replace(ar_num, an_to_cn(ar_num)) + num = str_to_num(orig_num) + if num is not None: + dct_start2end[start].add((end, num)) + + lst_result = [] + for start, st_end_and_num in sorted(dct_start2end.items()): + lst_end, lst_num = list(zip(*st_end_and_num)) + end = max(lst_end) + if len(lst_result) == 0 or start > lst_result[-1][0][1]: + lst_result.append([(start, end), lst_num]) + continue + last_start, last_end = lst_result[-1][0] + if end - start > last_end - last_start: + lst_result.pop(-1) + lst_result.append([(start, end), lst_num]) + else: + pass + + return lst_result + + +def wordseg_and_extract_num(text): + lst_span_and_nums = _extract_num_span(text) + lst_words = [] + pos = 0 + lst_nums = [] + lst_nums_index = [] + for span, nums in lst_span_and_nums: + start, end = span + lst_words.extend(wordseg(text[pos:start])) + lst_nums.extend(nums) + lst_nums_index.extend([len(lst_words)] * len(nums)) + lst_words.append(text[start:end]) + pos = end + + if pos < len(text): + lst_words.extend(wordseg(text[pos:])) + + return lst_words, lst_nums, lst_nums_index + + +if __name__ == "__main__": + """run some simple test cases""" + lst_token = ["hello", ",", "I", "am", "Li", "Lei", "."] + print(list(ngrams(lst_token, 2))) + print(list(ngrams(lst_token, 4))) + + text = "123年后,你好一百万不多,2百万不少" + print(wordseg_and_extract_num(text)) diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/utils.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..3de10ab34c618ff99d37ba0ac4d7bf0d4d94b84d --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/utils.py @@ -0,0 +1,104 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import time + + +class Timer(object): + """Stat Cost Time""" + + def __init__(self, msg=""): + super(Timer, self).__init__() + + self._msg = msg + self._start = time.time() + self._last = self._start + + def reset(self, only_last=False, msg=None): + """reset all setting""" + if msg is not None: + self._msg = msg + curr_time = time.time() + self._last = curr_time + if not only_last: + self._start = curr_time + + def check(self): + """check cost time from start""" + end = time.time() + cost = end - self._start + return cost + + def interval(self): + """check cost time from lst""" + end = time.time() + cost = end - self._last + self._last = end + return cost + + def ending(self): + """ending checking and log""" + cost = "%.2f" % time.time() - self._start + if self._msg == "": + log_msg = "cost time: %s" % (cost) + elif "{}" in self._msg: + log_msg = self._msg.format(cost) + else: + log_msg = self._msg + cost + + logging.info(log_msg) + + +def list_increment(lst: list, base: int): + """increment each element in list""" + for i in range(len(lst)): + lst[i] += base + return lst + + +def count_file_lines(filename): + cnt = 0 + with open(filename) as ifs: + for _ in ifs: + cnt += 1 + return cnt + + +def print_tensors(tag="*", **kwargs): + """print tensors for debugging""" + print(tag * 50) + for key, value in kwargs.items(): + print(key, ":", value) + + +if __name__ == "__main__": + """run some simple test cases""" + from boomup import data_struct + + question = "三峡碧江需要大于2的招聘数量" + table_json = { + "rows": [ + [4.0, "污水运行工", "三峡碧江公司", "渝北", 2.0, "大专及以上", "给排水/环境工程/机电及相关专业", "sxswrlzyb@163.com"], + [5.0, "污水运行工", "三峡垫江公司", "垫江", 1.0, "大专及以上", "给排水/环境工程/机电及相关专业", "sxswrlzyb@163.com"], + ], + "name": "Table_a7b5108c3b0611e98ad7f40f24344a08", + "title": "", + "header": ["岗位序号", "招聘岗位", "用人单位", "工作地点", "招聘数量", "学历要求", "专业及资格要求", "简历投递邮箱"], + "common": "", + "id": "a7b510", + "types": ["real", "text", "text", "text", "real", "text", "text", "text"], + } + table_json["header"] = data_struct.Header(table_json["header"], table_json["types"]) + table = data_struct.Table(**table_json) diff --git a/examples/text_to_sql/RAT-SQL/train.sh b/examples/text_to_sql/RAT-SQL/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..65f98c27abba52d886119885022af76f05a77ffa --- /dev/null +++ b/examples/text_to_sql/RAT-SQL/train.sh @@ -0,0 +1,37 @@ +#!/bin/bash + +if [ $# -ge 1 ] && [ "$1" == "-h" ]; then + echo "usage:" + echo " $0 trainer_num output_path [main args]" + exit 0 +fi + +trainer_num=$1 +output_path=$2 +shift 2 +if [[ $trainer_num = cuda* ]]; then + cuda_devices=`echo $trainer_num | sed 's/cuda://'` + trainer_num=`echo $cuda_devices | awk -F',' '{print NF}'` +else + cuda_devices=`python script/available_gpu.py --best $trainer_num` +fi + +WORKROOT=$(cd $(dirname $0); pwd) +cd $WORKROOT + +#### paddle #### +# 选择要使用的GPU +export CUDA_VISIBLE_DEVICES=$cuda_devices +# CPU 核数 +export CPU_NUM=$trainer_num +#### python #### +export PYTHONPATH=$WORKROOT:$WORKROOT/third:$WORKROOT/third/ERNIE:$PYTHONPATH +echo "PYTHONPATH=$PYTHONPATH" + +echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" +[ -d $output_path ] || mkdir -p $output_path +[ -f $output_path/train.log ] && mv $output_path/train.log $output_path/train.log.`date +%Y%m%d_%H%M%S` +echo "running command: ($PYTHON_BIN $@ --output $output_path)" > $output_path/train.log +python ./script/text2sql_main.py $@ --mode train --output $output_path 2>&1 | tee -a $output_path/train.log +exit $? + diff --git a/examples/torch_migration/README.md b/examples/torch_migration/README.md new file mode 100644 index 0000000000000000000000000000000000000000..603f040ed11467e011fd3eb01e1aaffd5ff15650 --- /dev/null +++ b/examples/torch_migration/README.md @@ -0,0 +1,62 @@ +# BERT-SST2-Prod +Reproduction process of BERT on SST2 dataset + +# 安装说明 + +* 下载代码库 + +```shell +git clone https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/torch_migration +``` + +* 进入文件夹,安装requirements + +```shell +pip install -r requirements.txt +``` + +* 安装PaddlePaddle与PyTorch + +```shell +# CPU版本的PaddlePaddle +pip install paddlepaddle==2.2.0 -i https://mirror.baidu.com/pypi/simple +# 如果希望安装GPU版本的PaddlePaddle,可以使用下面的命令 +# pip install paddlepaddle-gpu==2.2.0.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html +# 安装PyTorch +pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html +``` + +**注意**: 本项目依赖于paddlepaddle-2.2.0版本,安装时需要注意。 + +* 验证PaddlePaddle是否安装成功 + +运行python,输入下面的命令。 + +```shell +import paddle +paddle.utils.run_check() +print(paddle.__version__) +``` + +如果输出下面的内容,则说明PaddlePaddle安装成功。 + +``` +PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now. +2.2.0 +``` + + +* 验证PyTorch是否安装成功 + +运行python,输入下面的命令,如果可以正常输出,则说明torch安装成功。 + +```shell +import torch +print(torch.__version__) +# 如果安装的是cpu版本,可以按照下面的命令确认torch是否安装成功 +# 期望输出为 tensor([1.]) +print(torch.Tensor([1.0])) +# 如果安装的是gpu版本,可以按照下面的命令确认torch是否安装成功 +# 期望输出为 tensor([1.], device='cuda:0') +print(torch.Tensor([1.0]).cuda()) +``` diff --git a/examples/torch_migration/docs/ThesisReproduction_NLP.md b/examples/torch_migration/docs/ThesisReproduction_NLP.md new file mode 100644 index 0000000000000000000000000000000000000000..eee175d34a280b06fa2b40a11c4401c4ae5182be --- /dev/null +++ b/examples/torch_migration/docs/ThesisReproduction_NLP.md @@ -0,0 +1,928 @@ +# 论文复现指南 + +## 目录 + +- [1. 总览](#1) + - [1.1 背景](#1.1) + - [1.2 前序工作](#1.2) +- [2. 整体框图](#2) + - [2.1 流程概览](#2.1) + - [2.2 reprod_log whl包](#2.2) +- [3. 论文复现理论知识及实战](#3) + - [3.1 模型结构对齐](#3.1) + - [3.2 验证/测试集数据读取对齐](#3.2) + - [3.3 评估指标对齐](#3.3) + - [3.4 损失函数对齐](#3.4) + - [3.5 优化器对齐](#3.5) + - [3.6 学习率对齐](#3.6) + - [3.7 正则化策略对齐](#3.7) + - [3.8 反向对齐](#3.8) + - [3.9 训练集数据读取对齐](#3.9) + - [3.10 网络初始化对齐](#3.10) + - [3.11 模型训练对齐](#3.11) + - [3.12 单机多卡训练](#3.12) +- [4. 论文复现注意事项与FAQ](#4) + - [4.0 通用注意事项](#4.0) + - [4.1 模型结构对齐](#4.1) + - [4.2 验证/测试集数据读取对齐](#4.2) + - [4.3 评估指标对齐](#4.3) + - [4.4 损失函数对齐](#4.4) + - [4.5 优化器对齐](#4.5) + - [4.6 学习率对齐](#4.6) + - [4.7 正则化策略对齐](#4.7) + - [4.8 反向对齐](#4.8) + - [4.9 训练集数据读取对齐](#4.9) + - [4.10 网络初始化对齐](#4.10) + - [4.11 模型训练对齐](#4.11) + + +## 1. 总览 + + +### 1.1 背景 + +* 以深度学习为核心的人工智能技术仍在高速发展,通过论文复现,开发者可以获得 + * 学习成长:自我能力提升 + * 技术积累:对科研或工作有所帮助和启发 + * 社区荣誉:成果被开发者广泛使用 + + +### 1.2 前序工作 + +基于本指南复现论文过程中,建议开发者准备以下内容。 + +* 了解该模型输入输出格式。以BERT的情感分类任务为例,通过阅读论文与参考代码,了解到模型输入为`[batch_size, sequence_length]`的tensor,类型为`int64`,label为`[batch, ]`的label,类型为`int64`。 +* 准备好训练/验证数据集,用于模型训练与评估 +* 准备好fake input data以及label,与模型输入shape、type等保持一致,用于后续模型前向对齐。 + * 在对齐模型前向过程中,我们不需要考虑数据集模块等其他模块,此时使用fake data是将模型结构和数据部分解耦非常合适的一种方式。 + * 将fake data以文件的形式存储下来,也可以保证PaddlePaddle与参考代码的模型结构输入是完全一致的,更便于排查问题。 + * 在该步骤中,以BERT为例,生成fake data的脚本可以参考:[gen_fake_data.py](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/fake_data/gen_fake_data.py)。 +* 在特定设备(CPU/GPU)上,跑通参考代码的预测过程(前向)以及至少2轮(iteration)迭代过程,保证后续基于PaddlePaddle复现论文过程中可对比。 +* 本文档基于 `BERT-SST2-Prod` 代码以及`reprod_log` whl包进行说明与测试。如果希望体验,建议参考[BERT-SST2-Prod文档](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/README.md)进行安装与测试。 +* 在复现的过程中,只需要将PaddlePaddle的复现代码以及打卡日志上传至github,不能在其中添加参考代码的实现,在验收通过之后,需要删除打卡日志。建议在初期复现的时候,就将复现代码与参考代码分成2个文件夹进行管理。 + + +## 2. 整体框图 + + +### 2.1 流程概览 + +面对一篇自然语言处理的论文,复现该论文的整体流程如下图所示。 + +![图片](https://user-images.githubusercontent.com/16911935/199389647-b000a7b1-28d1-485e-8ec0-3e7e2c05884a.png) + +总共包含11个步骤。为了高效复现论文,设置了5个验收节点。如上图中黄色框所示。后续章节会详细介绍上述步骤和验收节点,具体内容安排如下: + +* 第3章:介绍11个复现步骤的理论知识、实战以及验收流程。 +* 第4章:针对复现流程过程中每个步骤可能出现的问题,本章会进行详细介绍。如果还是不能解决问题,可以提ISSUE进行讨论,提ISSUE地址:[https://github.com/PaddlePaddle/Paddle/issues/new/choose](https://github.com/PaddlePaddle/Paddle/issues/new/choose) + + +### 2.2 reprod_log whl包 + +#### 2.2.1 reprod_log工具简介 +`reprod_log`是用于论文复现赛中辅助自查和验收工具。该工具源代码地址在:[https://github.com/WenmuZhou/reprod_log](https://github.com/WenmuZhou/reprod_log)。主要功能如下: + +* 存取指定节点的输入输出tensor +* 基于文件的tensor读写 +* 2个字典的对比验证 +* 对比结果的输出与记录 + +更多API与使用方法可以参考:[reprod_log API使用说明](https://github.com/WenmuZhou/reprod_log/blob/master/README.md)。 + +#### 2.2.2 reprod_log使用demo + +下面基于代码:[https://github.com/JunnYu/BERT-SST2-Prod/tree/main/pipeline/reprod_log_demo](https://github.com/JunnYu/BERT-SST2-Prod/tree/main/pipeline/reprod_log_demo),给出如何使用该工具。 + +文件夹中包含`write_log.py`和`check_log_diff.py`文件,其中`write_log.py`中给出了`ReprodLogger`类的使用方法,`check_log_diff.py`给出了`ReprodDiffHelper`类的使用方法,依次运行两个python文件,使用下面的方式运行代码。 + +```shell +# 进入文件夹 +cd pipeline/reprod_log_demo +# 随机生成矩阵,写入文件中 +python write_log.py +# 进行文件对比,输出日志 +python check_log_diff.py +``` + +最终会输出以下内容 + +``` +[2021/11/18 09:29:31] root INFO: demo_test_1: +[2021/11/18 09:29:31] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/18 09:29:31] root INFO: demo_test_2: +[2021/11/18 09:29:31] root INFO: mean diff: check passed: False, value: 0.33387675881385803 +[2021/11/18 09:29:31] root INFO: diff check failed +``` + +可以看出:对于key为`demo_test_1`的矩阵,由于diff为0,小于设置的阈值`1e-6`,核验成功;对于key为`demo_test_2`的矩阵,由于diff为0.33,大于设置的阈值`1e-6`,核验失败。 + +#### 2.2.3 reprod_log在论文复现中应用 + +在论文复现中,基于reprod_log的结果记录模块,产出下面若干文件 +``` +log_reprod +├── forward_paddle.npy +├── forward_torch.npy # 与forward_paddle.npy作为一并核查的文件对 +├── metric_paddle.npy +├── metric_torch.npy # 与metric_paddle.npy作为一并核查的文件对 +├── loss_paddle.npy +├── loss_torch.npy # 与loss_paddle.npy作为一并核查的文件对 +├── bp_align_paddle.npy +├── bp_align_torch.npy # 与bp_align_paddle.npy作为一并核查的文件对 +├── train_align_paddle.npy +├── train_align_torch.npy # pytorch运行得到的参考评估指标 +``` + +基于reprod_log的`ReprodDiffHelper`模块,产出下面5个日志文件。 + +``` +├── forward_diff.log # forward_paddle.npy与forward_torch.npy生成的diff结果文件 +├── metric_diff.log # metric_paddle.npy与metric_torch.npy生成的diff结果文件 +├── loss_diff.log # loss_paddle.npy与loss_torch.npy生成的diff结果文件 +├── bp_align_diff.log # bp_align_paddle.npy与bp_align_torch.npy生成的diff结果文件 +├── train_align_diff.log # train_align_paddle.train_align_torch.npy生成的diff结果文件 +``` + +上述文件的生成代码都需要开发者进行开发,验收时需要提供上面罗列的所有文件(不需要提供产生这些文件的可运行程序)以及完整的模型训练评估程序和日志。 +BERT-SST2-Prod项目提供了基于reprod_log的5个验收点对齐验收示例,具体代码地址为:[https://github.com/JunnYu/BERT-SST2-Prod/tree/main/pipeline](https://github.com/JunnYu/BERT-SST2-Prod/tree/main/pipeline), +每个文件夹中的README.md文档提供了使用说明。 + + +## 3. 论文复现理论知识及实战 + + +### 3.1 模型结构对齐 + +对齐模型结构时,一般有3个主要步骤: + +* 网络结构代码转换 +* 权重转换 +* 模型组网正确性验证 + +下面详细介绍这3个部分。 + +#### 3.1.1 网络结构代码转换 + +**【基本流程】** + +由于PyTorch的API和PaddlePaddle的API非常相似,可以参考[PyTorch-PaddlePaddle API映射表](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/08_api_mapping/pytorch_api_mapping_cn.html) +,组网部分代码直接进行手动转换即可。 + +**【注意事项】** + +如果遇到PaddlePaddle没有的API,可以尝试用多种API来组合,也可以给PaddlePaddle团队提[ISSUE](https://github.com/PaddlePaddle/Paddle/issues),获得支持。 + +**【实战】** + +BERT网络结构的PyTorch实现: [transformers-bert](https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py) + +对应转换后的PaddlePaddle实现: [paddlenlp-bert](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/bert/modeling.py) + + +#### 3.1.2 权重转换 + +**【基本流程】** + +组网代码转换完成之后,需要对模型权重进行转换,如果PyTorch repo中已经提供权重,那么可以直接下载并进行后续的转换;如果没有提供,则可以基于PyTorch代码,随机生成一个初始化权重(定义完model以后,使用`torch.save()` API保存模型权重),然后进行权重转换。 + +**【注意事项】** + +在权重转换的时候,需要注意`paddle.nn.Linear`等API的权重保存格式和名称等与PyTorch稍有diff,具体内容可以参考`4.1章节`。 + +**【实战】** + +BERT的代码转换脚本可以在这里查看:[https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/weights/torch2paddle.py](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/weights/torch2paddle.py), + +注意:运行该代码需要首先下载Huggingface的BERT预训练模型到该目录下,下载地址为:[https://huggingface.co/bert-base-uncased/blob/main/pytorch_model.bin](https://huggingface.co/bert-base-uncased/blob/main/pytorch_model.bin) + +```python +# https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/weights/torch2paddle.py + +from collections import OrderedDict + +import numpy as np +import paddle +import torch +from paddlenlp.transformers import BertForPretraining as PDBertForMaskedLM +from transformers import BertForMaskedLM as PTBertForMaskedLM + + +def convert_pytorch_checkpoint_to_paddle( + pytorch_checkpoint_path="pytorch_model.bin", + paddle_dump_path="model_state.pdparams", + version="old", ): + hf_to_paddle = { + "embeddings.LayerNorm": "embeddings.layer_norm", + "encoder.layer": "encoder.layers", + "attention.self.query": "self_attn.q_proj", + "attention.self.key": "self_attn.k_proj", + "attention.self.value": "self_attn.v_proj", + "attention.output.dense": "self_attn.out_proj", + "intermediate.dense": "linear1", + "output.dense": "linear2", + "attention.output.LayerNorm": "norm1", + "output.LayerNorm": "norm2", + "predictions.decoder.": "predictions.decoder_", + "predictions.transform.dense": "predictions.transform", + "predictions.transform.LayerNorm": "predictions.layer_norm", + } + do_not_transpose = [] + if version == "old": + hf_to_paddle.update({ + "predictions.bias": "predictions.decoder_bias", + ".gamma": ".weight", + ".beta": ".bias", + }) + do_not_transpose = do_not_transpose + ["predictions.decoder.weight"] + + pytorch_state_dict = torch.load( + pytorch_checkpoint_path, map_location="cpu") + paddle_state_dict = OrderedDict() + for k, v in pytorch_state_dict.items(): + is_transpose = False + if k[-7:] == ".weight": + # embeddings.weight and LayerNorm.weight do not transpose + if all(d not in k for d in do_not_transpose): + if ".embeddings." not in k and ".LayerNorm." not in k: + if v.ndim == 2: + v = v.transpose(0, 1) + is_transpose = True + oldk = k + for hf_name, pd_name in hf_to_paddle.items(): + k = k.replace(hf_name, pd_name) + + # add prefix `bert.` + if "bert." not in k and "cls." not in k and "classifier" not in k: + k = "bert." + k + + print(f"Converting: {oldk} => {k} | is_transpose {is_transpose}") + paddle_state_dict[k] = v.data.numpy() + + paddle.save(paddle_state_dict, paddle_dump_path) + + +def compare(out_torch, out_paddle): + out_torch = out_torch.detach().numpy() + out_paddle = out_paddle.detach().numpy() + assert out_torch.shape == out_paddle.shape + abs_dif = np.abs(out_torch - out_paddle) + mean_dif = np.mean(abs_dif) + max_dif = np.max(abs_dif) + min_dif = np.min(abs_dif) + print("mean_dif:{}".format(mean_dif)) + print("max_dif:{}".format(max_dif)) + print("min_dif:{}".format(min_dif)) + + +def test_forward(): + paddle.set_device("cpu") + model_torch = PTBertForMaskedLM.from_pretrained("./bert-base-uncased") + model_paddle = PDBertForMaskedLM.from_pretrained("./bert-base-uncased") + model_torch.eval() + model_paddle.eval() + np.random.seed(42) + x = np.random.randint( + 1, model_paddle.bert.config["vocab_size"], size=(4, 64)) + input_torch = torch.tensor(x, dtype=torch.int64) + out_torch = model_torch(input_torch)[0] + + input_paddle = paddle.to_tensor(x, dtype=paddle.int64) + out_paddle = model_paddle(input_paddle)[0] + + print("torch result shape:{}".format(out_torch.shape)) + print("paddle result shape:{}".format(out_paddle.shape)) + compare(out_torch, out_paddle) + + +if __name__ == "__main__": + convert_pytorch_checkpoint_to_paddle( + "./bert-base-uncased/pytorch_model.bin", + "./bert-base-uncased/model_state.pdparams") + test_forward() + # torch result shape:torch.Size([4, 64, 30522]) + # paddle result shape:[4, 64, 30522] + # mean_dif:1.666686512180604e-05 + # max_dif:0.00015211105346679688 + # min_dif:0.0 +``` + +运行完成之后,会在当前目录生成`model_state.pdparams`文件,即为转换后的PaddlePaddle预训练模型。 +**Tips**: 由于paddlenlp中已有转换后的bert-base-uncased模型,因此可以一键加载,程序会自动下载对应权重! + + +#### 3.1.3 模型组网正确性验证 + +**【基本流程】** + +1. 定义PyTorch模型,加载权重,固定seed,基于numpy生成随机数,转换为PyTorch可以处理的tensor,送入网络,获取输出,使用reprod_log保存结果。 +2. 定义PaddlePaddle模型,加载权重,固定seed,基于numpy生成随机数,转换为PaddlePaddle可以处理的tensor,送入网络,获取输出,使用reprod_log保存结果。 +3. 使用reprod_log排查diff,小于阈值,即可完成自测。 + +**【注意事项】** + +* 模型在前向对齐验证时,需要调用`model.eval()`方法,保证组网中的随机量被关闭,比如BatchNorm、Dropout等。 +* 给定相同的输入数据,为保证可复现性,如果有随机数生成,固定相关的随机种子。 +* 输出diff可以使用`np.mean(np.abs(o1 - o2))`进行计算,一般小于1e-6的话,可以认为前向没有问题。如果最终输出结果diff较大,可以使用二分的方法进行排查,比如说BERT,包含1个embdding层、12个transformer-block以及最后的MLM head层,那么完成模型组网和权重转换之后,如果模型输出没有对齐,可以尝试输出中间某一个transformer-block的tensor进行对比,如果相同,则向后进行排查;如果不同,则继续向前进行排查,以此类推,直到找到导致没有对齐的操作。 + +**【实战】** + +BERT模型组网正确性验证可以参考如下示例代码: +[https://github.com/JunnYu/BERT-SST2-Prod/tree/main/pipeline/Step1](https://github.com/JunnYu/BERT-SST2-Prod/tree/main/pipeline/Step1 + +**【验收】** + +对于待复现的项目,前向对齐验收流程如下。 + +1. 准备输入:fake data + * 使用参考代码的dataloader,生成一个batch的数据,保存下来,在前向对齐时,直接从文件中读入。 + * 固定随机数种子,生成numpy随机矩阵,转化tensor +2. 保存输出: + * PaddlePaddle/PyTorch:dict,key为tensor的name(自定义),value为tensor的值。最后将dict保存到文件中。建议命名为`forward_paddle.npy`和`forward_torch.npy`。 +3. 自测:使用reprod_log加载2个文件,使用report功能,记录结果到日志文件中,建议命名为`forward_diff_log.txt`,观察diff,二者diff小于特定的阈值即可。 +4. 提交内容:新建文件夹,将`forward_paddle.npy`、`forward_torch.npy`与`forward_diff_log.txt`文件放在文件夹中,后续的输出结果和自查日志也放在该文件夹中,一并打包上传即可。 +5. 注意: + * PaddlePaddle与PyTorch保存的dict的key需要保持相同,否则report过程可能会提示key无法对应,从而导致report失败,之后的`【验收】`环节也是如此。 + * 如果是固定随机数种子,建议将fake data保存到dict中,方便check参考代码和PaddlePaddle的输入是否一致。 + + +### 3.2 验证/测试集数据读取对齐 + +**【基本流程】** + +对于一个数据集,一般有以下一些信息需要重点关注 + +* 数据集名称、下载地址 +* 训练集/验证集/测试集 + +PaddlePaddle中数据集相关的API为`paddle.io.Dataset`,PyTorch中对应为`torch.utils.data.Dataset`,二者功能一致,在绝大多数情况下,可以使用该类构建数据集。它是描述Dataset方法和行为的抽象类,在具体实现的时候,需要继承这个基类,实现其中的`__getitem__`和`__len__`方法。除了参考代码中相关实现,也可以参考待复现论文中的说明。 + +复现完Dataset之后,可以构建Dataloader,对数据进行组batch、批处理,送进网络进行计算。 + +`paddle.io.DataLoader`可以进行数据加载,将数据分成批数据,并提供加载过程中的采样。PyTorch对应的实现为`torch.utils.data.DataLoader`,二者在功能上一致,只是在参数方面稍有diff:(1)PaddlePaddle缺少对`pin_memory`等参数的支持;(2)PaddlePaddle增加了`use_shared_memory`参数来选择是否使用共享内存加速数据加载过程。 + +**【注意事项】** + +论文中一般会提供数据集的名称以及基本信息。复现过程中,我们在下载完数据之后,建议先检查下是否和论文中描述一致,否则可能存在的问题有: + +* 数据集版本不同,比如论文中使用了cnn_dailymail的v3.0.0版本数据集,但是我们下载的是cnn_dailymail的v1.0.0版本数据集,如果不对其进行检查,可能会导致我们最终训练的数据量等与论文中有diff +* 数据集使用方式不同,有些论文中,可能只是抽取了该数据集的子集进行方法验证,此时需要注意抽取方法,需要保证抽取出的子集完全相同。 +* 在评估指标对齐时,我们可以固定batch size,关闭Dataloader的shuffle操作。 + +构建数据集时,可以使用paddlenlp中的数据集加载方式,具体可以参考:[如何自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。对应地,PyTorch中的数据处理api可以参考:[huggingface的datasets自定义数据集](https://huggingface.co/docs/datasets/about_dataset_load.html#building-a-dataset)。对于其中之一,可以找到另一个平台的实现。 + +此外, +* 有些自定义的数据处理方法,如果不涉及到深度学习框架的部分,可以直接复用。 +* 对于特定任务中的数据预处理方法,比如说Tokenizer,如果没有现成的API可以调用,可以参考官方模型套件中的一些实现方法,比如PaddleClas、PaddleDetection、PaddleSeg等。 + +**【实战】** + +BERT模型复现过程中,数据预处理和Dataset、Dataloader的检查可以参考该文件: +[https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step2/test_data.py](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step2/test_data.py) + + +使用方法可以参考[数据检查文档](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step2/README.md)。 + + +### 3.3 评估指标对齐 + +**【基本流程】** + +PaddlePaddle提供了一系列Metric计算类,比如说`Accuracy`, `Auc`, `Precision`, `Recall`等,而PyTorch中,目前可以通过组合的方式实现metric计算,或者调用[huggingface-datasets](https://huggingface.co/docs/datasets/about_metrics.html?highlight=metric),在论文复现的过程中,需要注意保证对于该模块,给定相同的输入,二者输出完全一致。具体流程如下。 + +1. 构建fake数据 +1. 使用PyTorch的指标获取评估结果,使用reprod_log保存结果。 +2. 使用PaddlePaddle的指标获取评估结果,使用reprod_log保存结果。 +3. 使用reprod_log排查diff,小于阈值,即可完成自测。 + +**【注意事项】** + +在评估指标对齐之前,需要注意保证对于该模块,给定相同的输入,二者输出完全一致。 + + +**【实战】** + +评估指标对齐检查方法可以参考文档:[评估指标对齐检查方法文档](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step2/README.md#%E6%95%B0%E6%8D%AE%E8%AF%84%E4%BC%B0%E5%AF%B9%E9%BD%90%E6%B5%81%E7%A8%8B) + + +**【验收】** + +对于待复现的项目,评估指标对齐验收流程如下。 + +1. 输入:dataloader, model +2. 输出: + * PaddlePaddle/PyTorch:dict,key为tensor的name(自定义),value为具体评估指标的值。最后将dict使用reprod_log保存到各自的文件中,建议命名为`metric_paddle.npy`和`metric_torch.npy`。 + * 自测:使用reprod_log加载2个文件,使用report功能,记录结果到日志文件中,建议命名为`metric_diff_log.txt`,观察diff,二者diff小于特定的阈值即可。 +3. 提交内容:将`metric_paddle.npy`、`metric_torch.npy`与`metric_diff_log.txt`文件备份到`3.1节验收环节`新建的文件夹中,后续的输出结果和自查日志也放在该文件夹中,一并打包上传即可。 +4. 注意: + * 数据需要是真实数据 + * 需要检查论文是否只是抽取了验证集/测试集中的部分文件,如果是的话,则需要保证PaddlePaddle和参考代码中dataset使用的数据集一致。 + + + +### 3.4 损失函数对齐 + +**【基本流程】** + +PaddlePaddle与PyTorch均提供了很多loss function,用于模型训练,具体的API映射表可以参考:[Loss类API映射列表](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/08_api_mapping/pytorch_api_mapping_cn.html#lossapi)。以CrossEntropyLoss为例,主要区别为: +* PaddlePaddle提供了对软标签、指定softmax计算纬度的支持。 + +如果论文中使用的loss function没有指定的API,则可以尝试通过组合API的方式,实现自定义的loss function。 + +具体流程如下。 + +1. 定义PyTorch模型,加载权重,加载fake data 和 fake label(或者固定seed,基于numpy生成随机数),转换为PyTorch可以处理的tensor,送入网络,获取loss结果,使用reprod_log保存结果。 +2. 定义PaddlePaddle模型,加载fake data 和 fake label(或者固定seed,基于numpy生成随机数),转换为PaddlePaddle可以处理的tensor,送入网络,获取loss结果,使用reprod_log保存结果。 +3. 使用reprod_log排查diff,小于阈值,即可完成自测。 + +**【注意事项】** + +* 计算loss的时候,建议设置`model.eval()`,避免模型中随机量的问题。 + +**【实战】** + +本部分可以参考文档:[https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step3/README.md](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step3/README.md)。 + +**【验收】** + +对于待复现的项目,损失函数对齐验收流程如下。 + +1. 输入:fake data & label +2. 输出: + * PaddlePaddle/PyTorch:dict,key为tensor的name(自定义),value为具体评估指标的值。最后将dict使用reprod_log保存到各自的文件中,建议命名为`loss_paddle.npy`和`loss_torch.npy`。 +3. 自测:使用reprod_log加载2个文件,使用report功能,记录结果到日志文件中,建议命名为`loss_diff_log.txt`,观察diff,二者diff小于特定的阈值即可。 +4. 提交内容:将`loss_paddle.npy`、`loss_torch.npy`与`loss_diff_log.txt`文件备份到`3.1节验收环节`新建的文件夹中,后续的输出结果和自查日志也放在该文件夹中,一并打包上传即可。 + + +### 3.5 优化器对齐 + +**【基本流程】** + +PaddlePaddle中的optimizer有`paddle.optimizer`等一系列实现,PyTorch中则有`torch.Optim`等一系列实现。 + +**【注意事项】** + +以SGD等优化器为例,PaddlePaddle与Pytorch的优化器区别主要如下。 + +* PaddlePaddle在优化器中增加了对梯度裁剪的支持,在训练GAN或者一些NLP、多模态任务中,这个用到的比较多。 +* PaddlePaddle的SGD不支持动量更新、动量衰减和Nesterov动量,这里需要使用`paddle.optimizer.Momentum` API实现这些功能。 + +**【实战】** + +本部分对齐建议对照[PaddlePaddle优化器API文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/optimizer/Overview_cn.html)与参考代码的优化器实现进行对齐,用之后的反向对齐统一验证该模块的正确性。 + + + +### 3.6 学习率对齐 + +**【基本流程】** + +* 学习率策略主要用于指定训练过程中的学习率变化曲线,这里可以将定义好的学习率策略,不断step,即可得到对应的学习率值,可以将学习率值保存在列表或者矩阵中,使用`reprod_log`工具判断二者是否对齐。 + +**【注意事项】** + +PaddlePaddle中,需要首先构建学习率策略,再传入优化器对象中;对于PyTorch,如果希望使用更丰富的学习率策略,需要先构建优化器,再传入学习率策略类API。 + +**【实战】** + +学习率复现对齐,可以参考代码:[学习率对齐验证文档](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step4/README.md#%E5%AD%A6%E4%B9%A0%E7%8E%87%E5%AF%B9%E9%BD%90%E9%AA%8C%E8%AF%81)。 + + +### 3.7 正则化策略对齐 + +**【基本流程】** + +L2正则化策略用于模型训练,可以防止模型对训练数据过拟合,L1正则化可以用于得到稀疏化的权重矩阵,PaddlePaddle中有`paddle.regularizer.L1Decay`与`paddle.regularizer.L2Decay` API。PyTorch中,torch.optim集成的优化器只有L2正则化方法,直接在构建optimizer的时候,传入`weight_decay`参数即可。 + +**【注意事项】** + +* PaddlePaddle的optimizer中支持L1Decat/L2Decay。 +* PyTorch的optimizer支持不同参数列表的学习率分别设置,params传入字典即可,而PaddlePaddle的2.1.0版本目前尚未支持这种行为,可以通过设置`ParamAttr`的`learning_rate`参数,来确定相对学习率倍数。PaddlePaddle的2.2.0版本中虽然实现该功能,但是模型收敛速度较慢,不建议使用。[优化器收敛速度慢](https://github.com/PaddlePaddle/Paddle/issues/36915) + +**【实战】** + +本部分对齐建议对照[PaddlePaddle正则化API文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/regularizer/L2Decay_cn.html)与参考代码的优化器实现进行对齐,用之后的反向对齐统一验证该模块的正确性。 + + +### 3.8 反向对齐 + +**【基本流程】** + +此处可以通过numpy生成假的数据和label(推荐),也可以准备固定的真实数据。具体流程如下。 + +1. 检查两个代码的训练超参数全部一致,如优化器及其超参数、学习率、LayerNorm中的eps等。 +2. 将PaddlePaddle与PyTorch网络中涉及的所有随机操作全部关闭,如dropout、drop_path等,推荐将模型设置为eval模式(`model.eval()`) +3. 加载相同的weight dict(可以通过PyTorch来存储随机的权重),将准备好的数据分别传入网络并迭代,观察二者loss是否一致(此处batch-size要一致,如果使用多个真实数据,要保证传入网络的顺序一致) +4. 如果经过2轮以上,loss均可以对齐,则基本可以认为反向对齐。 + + +**【注意事项】** + +* 如果第一轮loss就没有对齐,则需要仔细排查一下模型前向部分。 +* 如果第二轮开始,loss开始无法对齐,则首先需要排查下超参数的差异,没问题的话,在`loss.backward()`方法之后,使用`tensor.grad`获取梯度值,二分的方法查找diff,定位出PaddlePaddle与PyTorch梯度无法对齐的API或者操作,然后进一步验证并反馈。 + +梯度的打印方法示例代码如下所示,注释掉的内容即为打印网络中所有参数的梯度shape。 + +```python + # 代码地址:https://github.com/JunnYu/BERT-SST2-Prod/blob/2c372656bb1b077b0073c50161771d9fa9d8de5a/pipeline/Step4/test_bp.py#L12 + def pd_train_some_iters(model, + criterion, + optimizer, + fake_data, + fake_label, + max_iter=2): + model = PDBertForSequenceClassification.from_pretrained("bert-base-uncased", num_classes=2) + classifier_weights = paddle.load("../classifier_weights/paddle_classifier_weights.bin") + model.load_dict(classifier_weights) + model.eval() + criterion = paddle.nn.CrossEntropy() + decay_params = [ + p.name for n, p in model.named_parameters() + if not any(nd in n for nd in ["bias", "norm"]) + ] + optimizer = paddle.optimizer.AdamW(learning_rate=3e-5, parameters=model.parameters(), + weight_decay=1e-2, + epsilon=1e-6, + apply_decay_param_fun=lambda x: x in decay_params) + loss_list = [] + for idx in range(max_iter): + input_ids = paddle.to_tensor(fake_data) + labels = paddle.to_tensor(fake_label) + + output = model(input_ids) + loss = criterion(output, labels) + loss.backward() + optimizer.step() + optimizer.clear_grad() + loss_list.append(loss) + return loss_list +``` + + + + +**【实战】** + +本部分可以参考文档:[反向对齐操作文档](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step4/README.md#%E5%8F%8D%E5%90%91%E5%AF%B9%E9%BD%90%E6%93%8D%E4%BD%9C%E6%96%B9%E6%B3%95)。 + +**【验收】** + +对于待复现的项目,反向对齐验收流程如下。 + +1. 输入:fake data & label +2. 输出: + * PaddlePaddle/PyTorch:dict,key为tensor的name(自定义),value为具体loss的值。最后将dict使用reprod_log保存到各自的文件中,建议命名为`bp_align_paddle.npy`和`bp_align_torch.npy`。 +3. 自测:使用reprod_log加载2个文件,使用report功能,记录结果到日志文件中,建议命名为`bp_align_diff_log.txt`,观察diff,二者diff小于特定的阈值即可。 +4. 提交内容:将`bp_align_paddle.npy`、`bp_align_torch.npy`与`bp_align_diff_log.txt`文件备份到`3.1节验收环节`新建的文件夹中,后续的输出结果和自查日志也放在该文件夹中,一并打包上传即可。 +5. 注意: + * loss需要保存至少2轮以上。 + * 在迭代的过程中,需要保证模型的batch size等超参数完全相同 + * 在迭代的过程中,需要设置`model.eval()`,使用固定的假数据,同时加载相同权重的预训练模型。 + + +### 3.9 训练集数据读取对齐 + +**【基本流程】** + +该部分内容与3.2节内容基本一致,参考PyTorch的代码,实现训练集数据读取与预处理模块即可。 + +**【注意事项】** + +该部分内容,可以参考3.8节的自测方法,将输入的`fake data & label`替换为训练的dataloader,但是需要注意的是: +* 在使用train dataloader的时候,建议设置random seed,对于PyTorch来说 + +```python +#initialize random seed +torch.manual_seed(config.SEED) +torch.cuda.manual_seed_all(config.SEED) +np.random.seed(config.SEED) +random.seed(config.SEED) +``` + +对于PaddlePaddle来说 + +```python +paddle.seed(config.SEED) +np.random.seed(config.SEED) +random.seed(config.SEED) +``` + + + +### 3.10 网络初始化对齐 + +**【基本流程】** + +* 下面给出了部分初始化API的映射表。 + +|PaddlePaddle API | PyTorch API | +|---|---| +| paddle.nn.initializer.KaimingNormal | torch.nn.init.kaiming_normal_ | +| paddle.nn.initializer.KaimingUniform | torch.nn.init.kaiming_uniform_ | +| paddle.nn.initializer.XavierNormal | torch.nn.init.xavier_normal_ | +| paddle.nn.initializer.XavierUniform | torch.nn.init.xavier_uniform_ | + +**【注意事项】** + +* 更多初始化API可以参考[PyTorch初始化API文档](https://pytorch.org/docs/stable/nn.init.html)以及[PaddlePaddle初始化API文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/Overview_cn.html#chushihuaxiangguan)。 + +**【实战】** + +本部分对齐建议对照[PaddlePaddle 初始化API文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/Overview_cn.html#chushihuaxiangguan)与参考代码的初始化实现对齐。 + + +### 3.11 模型训练对齐 + +**【基本流程】** + +完成前面的步骤之后,就可以开始全量数据的训练对齐任务了。按照下面的步骤进行训练对齐。 + +1. 准备train/eval data, loader, model +2. 对model按照论文所述进行初始化(如果论文中提到加载了预训练模型,则按需加载pretrained model) +3. 加载配置,开始训练,迭代得到最终模型与评估指标,将评估指标使用reprod_log保存到文件中。 +4. 将PaddlePaddle提供的参考指标使用reprod_log提交到另一个文件中。 +5. 使用reprod_log排查diff,小于阈值,即可完成自测。 + +**【注意事项】** + +* 【强烈】建议先做完反向对齐之后再进行模型训练对齐,二者之间的不确定量包括:数据集、PaddlePaddle与参考代码在模型training mode下的区别,初始化参数。 +* 在训练对齐过程中,受到较多随机量的影响,精度有少量diff是正常的,以SST-2数据集的分类为例,diff在0.15%以内可以认为是正常的,这里可以根据不同的任务,适当调整对齐检查的阈值(`ReprodDiffHelper.report`函数中的`diff_threshold`参数)。 +* 训练过程中的波动是正常的,如果最终收敛结果不一致,可以 + * 仔细排查Dropout、BatchNorm以及其他组网模块及超参是否无误。 + * 基于参考代码随机生成一份预训练模型,转化为PaddlePaddle的模型,并使用PaddlePaddle加载训练,对比二者的收敛曲线与最终结果,排查初始化影响。 + * 使用参考代码的Dataloader生成的数据,进行模型训练,排查train dataloader的影响。 + +**【实战】** + +本部分可以参考文档:[训练对齐操作文档](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step5/README.md)。 + +**【验收】** + +对于待复现的项目,训练对齐验收流程如下。 + +1. 输入:train/eval dataloader, model +2. 输出: + * PaddlePaddle:dict,key为保存值的name(自定义),value为具体评估指标的值。最后将dict使用reprod_log保存到文件中,建议命名为`train_align_paddle.npy`。 + * benchmark:dict,key为保存值的name(自定义),value为论文复现赛的评估指标要求的值。最后将dict使用reprod_log保存到文件中,建议命名为`train_align_benchmark.npy`。 +3. 自测:使用reprod_log加载2个文件,使用report功能,记录结果到日志文件中,建议命名为`train_align_diff_log.txt`,观察diff,二者diff小于特定的阈值即可。 +4. 提交内容:将`train_align_paddle.npy`、`train_align_benchmark.npy`与`train_align_diff_log.txt`文件备份到`3.1节验收环节`新建的文件夹中,最终一并打包上传即可。 + + +### 3.12 单机多卡训练 + +如果希望使用单机多卡提升训练效率,可以从以下几个过程对代码进行修改。 + +#### 3.12.1 数据读取 + +对于PaddlePaddle来说,多卡数据读取这块主要的变化在sampler + +对于单机单卡,sampler实现方式如下所示。 + +```python +train_sampler = paddle.io.RandomSampler(dataset) +train_batch_sampler = paddle.io.BatchSampler( + sampler=train_sampler, batch_size=args.batch_size) +``` + +对于单机多卡任务,sampler实现方式如下所示。 + +```python +train_batch_sampler = paddle.io.DistributedBatchSampler( + dataset=dataset, + batch_size=args.batch_size, + shuffle=True, + drop_last=False + ) +``` + +注意:在这种情况下,单机多卡的代码仍然能够以单机单卡的方式运行,因此建议以这种sampler方式进行论文复现。 + + +#### 3.12.2 多卡模型初始化 + +如果以多卡的方式运行,需要初始化并行训练环境,代码如下所示。 + +```python +if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() +``` + +在模型组网并初始化参数之后,需要使用`paddle.DataParallel()`对模型进行封装,使得模型可以通过数据并行的模式被执行。代码如下所示。 + +```python +if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) +``` + + +#### 3.12.3 模型保存、日志保存等其他模块 + +以模型保存为例,我们只需要在0号卡上保存即可,否则多个trainer同时保存的话,可能会造成写冲突,导致最终保存的模型不可用。 + + +#### 3.12.4 程序启动方式 + +对于单机单卡,启动脚本如下所示。[https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/bert](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/bert) + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_glue.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --task_name SST-2 \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --logging_steps 1 \ + --save_steps 500 \ + --output_dir ./tmp/ \ + --device gpu \ + --use_amp False +``` + + +对于单机多卡(示例中为4卡训练),启动脚本如下所示。 + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0,1,2,3" run_glue.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --task_name SST-2 \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --logging_steps 1 \ + --save_steps 500 \ + --output_dir ./tmp/ \ + --device gpu \ + --use_amp False +``` + +注意:这里8卡训练时,虽然单卡的batch size没有变化(32),但是总卡的batch size相当于是单卡的8倍,因此学习率也设置为了单卡时的8倍。 + + +**【实战】** + +本部分可以参考paddlenlp库中的例子:[单机多卡训练](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/bert)。 + + +## 4. 论文复现注意事项与FAQ + +本部分主要总结大家在论文复现赛过程中遇到的问题,如果本章内容没有能够解决你的问题,欢迎给该文档提出优化建议或者给Paddle提[ISSUE](https://github.com/PaddlePaddle/Paddle/issues/new/choose)。 + + +### 4.0 通用注意事项 + +* 需要仔细对照PaddlePaddle与参考代码的优化器参数实现,确保优化器参数严格对齐。 +* 如果遇到一些Paddle不支持的API操作,可以尝试使用替代实现进行复现。如下面的PyTorch代码,PaddlePaddle中可以通过slice + concat API的组合形式进行功能实现。同时,对于这个问题,建议优先给Paddle提[ISSUE](https://github.com/PaddlePaddle/Paddle/issues/new/choose),列出Paddle不支持的实现,开发人员会根据优先级进行开发。 + +```python +torch.stack([ + per_locations[:, 0] - per_box_regression[:, 0], + per_locations[:, 1] - per_box_regression[:, 1], + per_locations[:, 0] + per_box_regression[:, 2], + per_locations[:, 1] + per_box_regression[:, 3], +], dim=1) +``` +* 如果遇到Paddle不包含的OP或者API,比如(1) 如果是某些算法实现存在调用了外部OP,而且Paddle也不包含该OP实现;(2) 其他框架存在的API或者OP,但是Paddle中没有这些OP。此时: + * 对于Paddle资深用户来说,可以尝试使用Paddle的自定义算子功能,存在一定的代码开发量。 + * 对于初学者来说,可以给Paddle提[ISSUE](https://github.com/PaddlePaddle/Paddle/issues/new/choose),列出Paddle不支持的实现,Paddle开发人员会根据优先级进行实现。 +* PaddlePaddle与PyTorch对于不同名称的API,实现的功能可能是相同的,复现的时候注意,比如[paddle.optimizer.lr.StepDecay](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/optimizer/lr/StepDecay_cn.html#stepdecay)与[torch.optim.lr_scheduler.StepLR](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html#torch.optim.lr_scheduler.StepLR) 。 +* 对于PaddlePaddle来说,通过`paddle.set_device`函数(全局)来确定模型结构是运行在什么设备上,对于torch来说,是通过`model.to("device")` (局部)来确定模型结构的运行设备,这块在复现的时候需要注意。 + + + +### 4.1 模型结构对齐 + +#### 4.1.1 API +* 对于 `paddle.nn.Linear` 层的weight参数,PaddlePaddle与PyTorch的保存方式不同,在转换时需要进行转置,示例代码可以参考[BERT权重转换脚本](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/weights/torch2paddle.py)。 +* `torch.masked_fill`函数的功能目前可以使用`paddle.where`进行实现,可以参考:[链接](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/faq/train_cn.html#paddletorch-masked-fillapi)。 +* `pack_padded_sequence`和`pad_packed_sequence`这两个API目前PaddlePaddle中没有实现,可以直接在RNN或者LSTM的输入中传入`sequence_length`来实现等价的功能。 + + +#### 4.1.2 权重转换 + +* 在权重转换的时候,不能只关注参数的名称,比如说有些`paddle.nn.Linear`层,但是定义的变量名称为`conv`,这种也是需要进行权重转置的。 +* 权重转换时,建议同时打印 Paddle 和 PyTorch 对应权重的shape,以防止名称相似但是shape不同的参数权重转换报错。 + +#### 4.1.3 模型组网正确性验证 + +* 在论文复现的过程中,可能会遇到一些经典的模型结构,比如Transformer等,Paddle官方也提供了Transformer的实现,但是这里建议自己根据PyTorch代码重新实现一遍,一方面是对整体的模型结构更加熟悉,另一方面也保证模型结构和权重完全对齐。 +* 在复杂的网络结构中,如果前向结果对不齐,可以按照模块排查问题,比如依次获取embedding、transformer-block、mlm-head输出等,看下问题具体出现在哪个子模块,再进到子模块详细排查。 +* 网络结构对齐后,尽量使用训练好的预训练模型和真实的数据进行前向diff计算,这样更准确。 + + +### 4.2 验证/测试集数据读取对齐 + +* 需要仔细排查数据预处理,不仅包含的预处理方法相同,也需要保证预处理的流程相同,比如先padding策略不同和截断策略的不同会导致得到最终的结果是不同的。 + + +### 4.3 评估指标对齐 + +* 真实数据评估时,需要注意评估时 `paddle.io.DataLoader` 的 ``drop_last`` 参数是否打开(文档[链接](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/DataLoader_cn.html#dataloader)),复现代码需要与参考代码保持一致,否则最后不够batch-size的数据的评估会有diff。 +* 在识别或者检索过程中,为了加速评估过程,往往会将评估函数由CPU实现改为GPU实现,由此会带来评估函数输出的不一致。这是由于sort函数对于相同值的排序结果不同带来的。在复现的过程中,如果可以接受轻微的指标不稳定,可以使用PaddlePaddle的sort函数,如果对于指标非常敏感,同时对速度性能要求很高,可以给PaddlePaddle提[ISSUE](https://github.com/PaddlePaddle/Paddle/issues/new/choose),由研发人员高优开发。 + + + +### 4.4 损失函数对齐 + +* 部分算法的损失函数中会用到 bool 索引,这时候可以使用[paddle.where](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/where_cn.html#where) 代替。 +* `paddle.nn.CrossEntropyLoss` 默认是在最后一维(axis=-1)计算损失函数,而 `torch.nn.CrossEntropyLoss` 是在axis=1的地方计算损失函数,因此如果输入的维度大于2,这里需要保证计算的维(axis)相同,否则可能会出错。 +* 在生成模型中会遇到梯度损失,需要对模型中的算子求二次梯度,目前`MaxPooling`暂时不支持二次梯度,如果复现的过程中遇到了需要对`MaxPooling`求二次梯度的情况,可以和Paddle官方开发同学反馈,进一步确认解决方案。 +* 在保存损失函数值的时候,注意要使用`paddle.no_grad`,或者仅仅保存转换成 numpy 的数组,避免损失没有析构导致内存泄漏问题。 + +```python +# 错误示范 +loss = celoss(pred, label) +avg_loss += loss +# 正确示范1 +loss = celoss(pred, label) +avg_loss += loss.numpy() +# 正确示范2 +loss = celoss(pred, label) +with paddle.no_grad() + avg_loss += loss +``` + + +### 4.5 优化器对齐 + +* Paddle目前支持在 ``optimizer`` 中通过设置 ``params_groups`` 的方式设置不同参数的更新方式,可以参考[代码示例](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/optimizer/optimizer.py#L107) 。 +* 有些模型训练时,会使用梯度累加策略,即累加到一定step数量之后才进行参数更新,这时在实现上需要注意对齐。 +* 在某些任务中,比如说深度学习可视化、可解释性等任务中,一般只要求模型前向过程,不需要训练,此时优化器、学习率等用于模型训练的模块对于该类论文复现是不需要的。 +* 在文本分类领域,大多数Transformer模型都采用了AdamW优化器,并且会设置weigh decay,同时部分参数设置为no weight decay,例如位置编码的参数通常设置为no weight decay,no weight decay参数设置不正确,最终会有明显的精度损失,需要特别注意。一般可以通过分析模型权重来发现该问题,分别计算官方模型和复现模型每层参数权重的平均值、方差,对每一层依次对比,有显著差异的层可能存在问题,因为在weight decay的作用下,参数权重数值会相对较小,而未正确设置no weight decay,则会造成该层参数权重数值异常偏小。 + + + +### 4.6 学习率对齐 + +* PaddlePaddle 中参数的学习率受到优化器学习率和`ParamAttr`中设置的学习率影响,因此跟踪学习率需要将二者结合进行跟踪。 +* 对于复现代码和参考代码,学习率在整个训练过程中在相同的轮数相同的iter下应该保持一致,可以通过`reprod_log`工具、打印学习率值或者可视化二者学习率的log来查看diff。 +* 有些网络的学习率策略比较细致,比如带warmup的学习率策略,这里需要保证起始学习率等参数都完全一致。 + + + +### 4.7 正则化策略对齐 + +* 在如Transformer或者少部分CNN模型中,存在一些参数不做正则化(正则化系数为0)的情况。这里需要找到这些参数并对齐取消实施正则化策略,可以参考[这里](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/ppcls/arch/backbone/model_zoo/resnest.py#L72),对特定参数进行修改。 + + +### 4.8 反向对齐 + +* 反向对齐时,如果第二轮开始,loss开始无法对齐,则首先需要排查下超参数的差异,没问题的话,在`loss.backward()`方法之后,使用`tensor.grad`获取梯度值,二分的方法查找diff,定位出PaddlePaddle与PyTorch梯度无法对齐的API或者操作,然后进一步验证。第3章中给出了获取所有参数的梯度方法,如果只希望打印特定参数的梯度,可以用下面的方式。 + + +```python +import paddle + +def print_hook_fn(grad): + print(grad) + +x = paddle.to_tensor([0., 1., 2., 3.], stop_gradient=False) +h = x.register_hook(print_hook_fn) +w = x * 4 +w.backward() +# backward之后会输出下面的内容 +# Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=False, +# [4., 4., 4., 4.]) +``` + + + +### 4.9 训练集数据读取对齐 + +#### 4.9.1 API + +* 在前向过程中,如果数据预处理过程运行出错,请先将 ``paddle.io.DataLoader`` 的 ``num_workers`` 参数设为0,然后根据单个进程下的报错日志定位出具体的bug。 + +#### 4.9.2 数据预处理 + + +* 如果数据处理过程中涉及到随机数生成,建议固定seed (`np.random.seed(0)`, `random.seed(0)`),查看复现代码和参考代码处理后的数据是否有diff。 +* 对文本进行tokenizer处理时,需要确定文本的截断策略,padding策略。 + + +### 4.10 网络初始化对齐 + +* 对于不同的深度学习框架,网络初始化在大多情况下,即使值的分布完全一致,也无法保证值完全一致,这里也是论文复现中不确定性比较大的地方。如果十分怀疑初始化导致的问题,建议将参考的初始化权重转成paddle模型,加载该初始化模型训练,看下收敛精度。 +* CNN对于模型初始化相对来说没有那么敏感,在迭代轮数与数据集足够的情况下,最终精度指标基本接近;而transformer系列模型对于初始化比较敏感,在transformer系列模型训练对齐过程中,建议对这一块进行重点检查。 + + + +### 4.11 模型训练对齐 + +#### 4.11.1 训练对齐通用问题 + +* 有条件的话,复现工作之前最好先基于官方代码完成训练,保证与官方指标能够对齐,并且将训练策略和训练过程中的关键指标记录保存下来,比如每个epoch的学习率、Train Loss、Eval Loss、Eval Acc等,在复现网络的训练过程中,将关键指标保存下来,这样可以将两次训练中关键指标的变化曲线绘制出来,能够很方便的进行对比。 +* 训练过程中可以对loss或者acc进行可视化,和竞品loss或者acc进行直观的对比;如果训练较大的数据集,1次完整训练的成本比较高,此时可以隔一段时间查看一下,如果精度差异比较大,建议先停掉实验,排查原因。 +* 如果训练的过程中出nan,一般是因为除0或者log0的情况, 可以着重看下几个部分: + * 如果有预训练模型的话,可以确认下是否加载正确 + * 模型结构中计算loss的部分是否有考虑到正样本为0的情况 + * 也可能是某个API的数值越界导致的,可以测试较小的输入是否还会出现nan。 +* 如果训练过程中如果出现不收敛的情况,可以 + * 简化网络和数据,实验是否收敛; + * 如果是基于原有实现进行改动,可以尝试控制变量法,每次做一个改动,逐个排查; + * 检查学习率是否过大、优化器设置是否合理,排查下weight decay是否设置正确; + * 保存不同step之间的模型参数,观察模型参数是否更新。 diff --git a/examples/torch_migration/pipeline/Step1/README.md b/examples/torch_migration/pipeline/Step1/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b3db11238110a8c3d1f7479b9ee6c3e172293549 --- /dev/null +++ b/examples/torch_migration/pipeline/Step1/README.md @@ -0,0 +1,86 @@ +# 使用方法 + + +本部分内容以前向对齐为例,介绍基于`repord_log`工具对齐的检查流程。其中与`reprod_log`工具有关的部分都是需要开发者需要添加的部分。 + + +```shell +# 进入文件夹并生成torch的bert模型权重 +cd pipeline/weights/ && python torch_bert_weights.py +# 进入文件夹并将torch的bert模型权重转换为paddle +cd pipeline/weights/ && python torch2paddle.py +# 进入文件夹并生成classifier权重 +cd pipeline/classifier_weights/ && python generate_classifier_weights.py +# 进入Step1文件夹 +cd pipeline/Step1/ +# 生成paddle的前向数据 +python pd_forward_bert.py +# 生成torch的前向数据 +python pt_forward_bert.py +# 对比生成log +python check_step1.py +``` + +具体地,以PaddlePaddle为例,`pd_forward_bert.py`的具体代码如下所示。 + +```python +import numpy as np +import paddle +from reprod_log import ReprodLogger +import sys +import os +CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0] # 当前目录 +config_path = CURRENT_DIR.rsplit('/', 1)[0] +sys.path.append(config_path) +from models.pd_bert import * + +# 导入reprod_log中的ReprodLogger类 +from reprod_log import ReprodLogger + +reprod_logger = ReprodLogger() + +# 组网初始化加载BertModel权重 +paddle_dump_path = '../weights/paddle_weight.pdparams' +config = BertConfig() +model = BertForSequenceClassification(config) +checkpoint = paddle.load(paddle_dump_path) +model.bert.load_dict(checkpoint) + +# 加载分类权重 +classifier_weights = paddle.load( + "../classifier_weights/paddle_classifier_weights.bin") +model.load_dict(classifier_weights) +model.eval() +# 读入fake data并转换为tensor,这里也可以固定seed在线生成fake data +fake_data = np.load("../fake_data/fake_data.npy") +fake_data = paddle.to_tensor(fake_data) +# 模型前向 +out = model(fake_data) +# 保存前向结果,对于不同的任务,需要开发者添加。 +reprod_logger.add("logits", out.cpu().detach().numpy()) +reprod_logger.save("forward_paddle.npy") +``` + +diff检查的代码可以参考:[check_step1.py](./check_step1.py),具体代码如下所示。 + +```python +# https://github.com/littletomatodonkey/AlexNet-Prod/blob/master/pipeline/Step1/check_step1.py +# 使用reprod_log排查diff +from reprod_log import ReprodDiffHelper +if __name__ == "__main__": + diff_helper = ReprodDiffHelper() + torch_info = diff_helper.load_info("./forward_torch.npy") + paddle_info = diff_helper.load_info("./forward_paddle.npy") + diff_helper.compare_info(torch_info, paddle_info) + diff_helper.report(path="forward_diff.log") +``` + +产出日志如下,同时会将check的结果保存在`forward_diff.log`文件中。 + +``` +[2021/11/17 20:15:50] root INFO: logits: +[2021/11/17 20:15:50] root INFO: mean diff: check passed: True, value: 1.30385160446167e-07 +[2021/11/17 20:15:50] root INFO: diff check passed +``` + +平均绝对误差为1.3e-7,测试通过。 diff --git a/examples/torch_migration/pipeline/Step1/check_step1.py b/examples/torch_migration/pipeline/Step1/check_step1.py new file mode 100644 index 0000000000000000000000000000000000000000..6dbb247cf179e584d36b0009120c942cc782ba09 --- /dev/null +++ b/examples/torch_migration/pipeline/Step1/check_step1.py @@ -0,0 +1,23 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from reprod_log import ReprodDiffHelper + +if __name__ == "__main__": + diff_helper = ReprodDiffHelper() + torch_info = diff_helper.load_info("./forward_torch.npy") + paddle_info = diff_helper.load_info("./forward_paddle.npy") + + diff_helper.compare_info(torch_info, paddle_info) + diff_helper.report(path="forward_diff.log") diff --git a/examples/torch_migration/pipeline/Step1/pd_forward_bert.py b/examples/torch_migration/pipeline/Step1/pd_forward_bert.py new file mode 100644 index 0000000000000000000000000000000000000000..9bdd14f4071b0a5929b6cefadd5d832a8fc8880d --- /dev/null +++ b/examples/torch_migration/pipeline/Step1/pd_forward_bert.py @@ -0,0 +1,49 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import sys + +import numpy as np +import paddle +from reprod_log import ReprodLogger + +CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0] # 当前目录 +CONFIG_PATH = CURRENT_DIR.rsplit("/", 1)[0] +sys.path.append(CONFIG_PATH) + +from models.pd_bert import BertConfig, BertForSequenceClassification # noqa: E402 + +if __name__ == "__main__": + paddle.set_device("cpu") + + # def logger + reprod_logger = ReprodLogger() + + paddle_dump_path = "../weights/paddle_weight.pdparams" + config = BertConfig() + model = BertForSequenceClassification(config) + checkpoint = paddle.load(paddle_dump_path) + model.bert.load_dict(checkpoint) + + classifier_weights = paddle.load("../classifier_weights/paddle_classifier_weights.bin") + model.load_dict(classifier_weights) + model.eval() + # read or gen fake data + + fake_data = np.load("../fake_data/fake_data.npy") + fake_data = paddle.to_tensor(fake_data) + # forward + out = model(fake_data)[0] + reprod_logger.add("logits", out.cpu().detach().numpy()) + reprod_logger.save("forward_paddle.npy") diff --git a/examples/torch_migration/pipeline/Step1/pt_forward_bert.py b/examples/torch_migration/pipeline/Step1/pt_forward_bert.py new file mode 100644 index 0000000000000000000000000000000000000000..55b7f7490caa616ad6d832c600517e57bf08c429 --- /dev/null +++ b/examples/torch_migration/pipeline/Step1/pt_forward_bert.py @@ -0,0 +1,47 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import sys + +import numpy as np +import torch +from reprod_log import ReprodLogger + +CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0] # 当前目录 +CONFIG_PATH = CURRENT_DIR.rsplit("/", 1)[0] +sys.path.append(CONFIG_PATH) + +from models.pt_bert import BertConfig, BertForSequenceClassification # noqa: E402 + +if __name__ == "__main__": + # def logger + reprod_logger = ReprodLogger() + + pytorch_dump_path = "../weights/torch_weight.bin" + config = BertConfig() + model = BertForSequenceClassification(config) + checkpoint = torch.load(pytorch_dump_path) + model.bert.load_state_dict(checkpoint) + + classifier_weights = torch.load("../classifier_weights/torch_classifier_weights.bin") + model.load_state_dict(classifier_weights, strict=False) + model.eval() + + # read or gen fake data + fake_data = np.load("../fake_data/fake_data.npy") + fake_data = torch.from_numpy(fake_data) + # forward + out = model(fake_data)[0] + reprod_logger.add("logits", out.cpu().detach().numpy()) + reprod_logger.save("forward_torch.npy") diff --git a/examples/torch_migration/pipeline/Step1/torch2paddle.py b/examples/torch_migration/pipeline/Step1/torch2paddle.py new file mode 100644 index 0000000000000000000000000000000000000000..4a2b4977051b97291ce45e7e49aabc1581267b7b --- /dev/null +++ b/examples/torch_migration/pipeline/Step1/torch2paddle.py @@ -0,0 +1,114 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from collections import OrderedDict + +import numpy as np +import paddle +import torch +from paddlenlp.transformers import BertForPretraining as PDBertForMaskedLM +from transformers import BertForMaskedLM as PTBertForMaskedLM + + +def convert_pytorch_checkpoint_to_paddle( + pytorch_checkpoint_path="pytorch_model.bin", + paddle_dump_path="model_state.pdparams", + version="old", +): + hf_to_paddle = { + "embeddings.LayerNorm": "embeddings.layer_norm", + "encoder.layer": "encoder.layers", + "attention.self.query": "self_attn.q_proj", + "attention.self.key": "self_attn.k_proj", + "attention.self.value": "self_attn.v_proj", + "attention.output.dense": "self_attn.out_proj", + "intermediate.dense": "linear1", + "output.dense": "linear2", + "attention.output.LayerNorm": "norm1", + "output.LayerNorm": "norm2", + "predictions.decoder.": "predictions.decoder_", + "predictions.transform.dense": "predictions.transform", + "predictions.transform.LayerNorm": "predictions.layer_norm", + } + do_not_transpose = [] + if version == "old": + hf_to_paddle.update( + { + "predictions.bias": "predictions.decoder_bias", + ".gamma": ".weight", + ".beta": ".bias", + } + ) + do_not_transpose = do_not_transpose + ["predictions.decoder.weight"] + + pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu") + paddle_state_dict = OrderedDict() + for k, v in pytorch_state_dict.items(): + is_transpose = False + if k[-7:] == ".weight": + # embeddings.weight and LayerNorm.weight do not transpose + if all(d not in k for d in do_not_transpose): + if ".embeddings." not in k and ".LayerNorm." not in k: + if v.ndim == 2: + if "embeddings" not in k: + v = v.transpose(0, 1) + is_transpose = True + is_transpose = False + oldk = k + print(f"Converting: {oldk} => {k} | is_transpose {is_transpose}") + paddle_state_dict[k] = v.data.numpy() + + paddle.save(paddle_state_dict, paddle_dump_path) + + +def compare(out_torch, out_paddle): + out_torch = out_torch.detach().numpy() + out_paddle = out_paddle.detach().numpy() + assert out_torch.shape == out_paddle.shape + abs_dif = np.abs(out_torch - out_paddle) + mean_dif = np.mean(abs_dif) + max_dif = np.max(abs_dif) + min_dif = np.min(abs_dif) + print("mean_dif:{}".format(mean_dif)) + print("max_dif:{}".format(max_dif)) + print("min_dif:{}".format(min_dif)) + + +def test_forward(): + paddle.set_device("cpu") + model_torch = PTBertForMaskedLM.from_pretrained("./bert-base-uncased") + model_paddle = PDBertForMaskedLM.from_pretrained("./bert-base-uncased") + model_torch.eval() + model_paddle.eval() + np.random.seed(42) + x = np.random.randint(1, model_paddle.bert.config["vocab_size"], size=(4, 64)) + input_torch = torch.tensor(x, dtype=torch.int64) + out_torch = model_torch(input_torch)[0] + + input_paddle = paddle.to_tensor(x, dtype=paddle.int64) + out_paddle = model_paddle(input_paddle)[0] + + print("torch result shape:{}".format(out_torch.shape)) + print("paddle result shape:{}".format(out_paddle.shape)) + compare(out_torch, out_paddle) + + +if __name__ == "__main__": + convert_pytorch_checkpoint_to_paddle("test.bin", "test_paddle.pdparams") +# test_forward() +# torch result shape:torch.Size([4, 64, 30522]) +# paddle result shape:[4, 64, 30522] +# mean_dif:1.666686512180604e-05 +# max_dif:0.00015211105346679688 +# min_dif:0.0 diff --git a/examples/torch_migration/pipeline/Step2/README.md b/examples/torch_migration/pipeline/Step2/README.md new file mode 100644 index 0000000000000000000000000000000000000000..029761c85e47fdbfda1fa427092a1e5791b9271f --- /dev/null +++ b/examples/torch_migration/pipeline/Step2/README.md @@ -0,0 +1,131 @@ +# 使用方法 + +## 数据集和数据加载对齐步骤 + +* 使用下面的命令,判断数据预处理以及数据集是否构建正确。 + +```shell +python test_data.py +``` + +显示出以下内容,Dataset以及Dataloader的长度和内容diff均满足小于指定阈值,可以认为复现成功。 + +``` +[2021/11/17 20:57:06] root INFO: length: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_0_input_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_0_token_type_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_0_labels: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_1_input_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_1_token_type_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_1_labels: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_2_input_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_2_token_type_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_2_labels: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_3_input_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_3_token_type_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_3_labels: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_4_input_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_4_token_type_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataset_4_labels: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_0_input_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_0_token_type_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_0_labels: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_1_input_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_1_token_type_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_1_labels: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_2_input_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_2_token_type_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_2_labels: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_3_input_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_3_token_type_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_3_labels: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_4_input_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_4_token_type_ids: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: dataloader_4_labels: +[2021/11/17 20:57:06] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 20:57:06] root INFO: diff check passed +``` + + +## 数据评估对齐流程 + +### 评估代码和修改内容说明 + +Pytorch准确率评估指标使用的是huggingface的datasets库。 + +```python +import torch +import numpy as np +from datasets import load_metric +hf_metric = load_metric("accuracy.py") +logits = np.random.normal(0, 1, size=(64, 2)).astype("float32") +labels = np.random.randint(0, 2, size=(64,)).astype("int64") +hf_metric.add_batch(predictions=torch.from_numpy(logits).argmax(dim=-1), references=torch.from_numpy(labels)) +hf_accuracy = hf_metric.compute()["accuracy"] +print(hf_accuracy) +``` + +对应地,PaddlePaddle评估指标代码如下 + +```python +import paddle +import numpy as np +from paddle.metric import Accuracy +pd_metric = Accuracy() +pd_metric.reset() +logits = np.random.normal(0, 1, size=(64, 2)).astype("float32") +labels = np.random.randint(0, 2, size=(64,)).astype("int64") +correct = pd_metric.compute(paddle.to_tensor(logits), paddle.to_tensor(labels)) +pd_metric.update(correct) +pd_accuracy = pd_metric.accumulate() +print(pd_accuracy) +``` + +### 操作步骤 + +运行下面的命令,验证数据集评估是否正常。 + +```shell +# 生成paddle和pytorch指标 +python test_metric.py +# 对比生成log +python check_step2.py +``` + +最终结果输出如下,accuracy精度diff为0,小于阈值,结果前向验证, +``` +[2021/11/17 21:15:05] root INFO: accuracy: +[2021/11/17 21:15:05] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 21:15:05] root INFO: diff check passed + +``` diff --git a/examples/torch_migration/pipeline/Step2/accuracy.py b/examples/torch_migration/pipeline/Step2/accuracy.py new file mode 100644 index 0000000000000000000000000000000000000000..aecc76a72f54fb6a3c71404597769806d7847466 --- /dev/null +++ b/examples/torch_migration/pipeline/Step2/accuracy.py @@ -0,0 +1,90 @@ +# coding=utf-8 +# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Accuracy metric.""" + +import datasets +from sklearn.metrics import accuracy_score + +_DESCRIPTION = """ +Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: +Accuracy = (TP + TN) / (TP + TN + FP + FN) +TP: True positive +TN: True negative +FP: False positive +FN: False negative +""" + +_KWARGS_DESCRIPTION = """ +Args: + predictions: Predicted labels, as returned by a model. + references: Ground truth labels. + normalize: If False, return the number of correctly classified samples. + Otherwise, return the fraction of correctly classified samples. + sample_weight: Sample weights. +Returns: + accuracy: Accuracy score. +Examples: + + >>> accuracy_metric = datasets.load_metric("accuracy") + >>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1]) + >>> print(results) + {'accuracy': 1.0} +""" + +_CITATION = """\ +@article{scikit-learn, + title={Scikit-learn: Machine Learning in {P}ython}, + author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. + and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. + and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and + Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, + journal={Journal of Machine Learning Research}, + volume={12}, + pages={2825--2830}, + year={2011} +} +""" + + +@datasets.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION) +class Accuracy(datasets.Metric): + def _info(self): + return datasets.MetricInfo( + description=_DESCRIPTION, + citation=_CITATION, + inputs_description=_KWARGS_DESCRIPTION, + features=datasets.Features( + { + "predictions": datasets.Sequence(datasets.Value("int32")), + "references": datasets.Sequence(datasets.Value("int32")), + } + if self.config_name == "multilabel" + else { + "predictions": datasets.Value("int32"), + "references": datasets.Value("int32"), + } + ), + reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html"], + ) + + def _compute(self, predictions, references, normalize=True, sample_weight=None): + return { + "accuracy": accuracy_score( + references, + predictions, + normalize=normalize, + sample_weight=sample_weight, + ).item(), + } diff --git a/examples/torch_migration/pipeline/Step2/check_step2.py b/examples/torch_migration/pipeline/Step2/check_step2.py new file mode 100644 index 0000000000000000000000000000000000000000..ac74370e6a99d16c7ed03a80d60f87e51d1f66b7 --- /dev/null +++ b/examples/torch_migration/pipeline/Step2/check_step2.py @@ -0,0 +1,24 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from reprod_log import ReprodDiffHelper + +if __name__ == "__main__": + diff_helper = ReprodDiffHelper() + torch_info = diff_helper.load_info("./metric_torch.npy") + paddle_info = diff_helper.load_info("./metric_paddle.npy") + + diff_helper.compare_info(torch_info, paddle_info) + + diff_helper.report(path="metric_diff.log") diff --git a/examples/torch_migration/pipeline/Step2/demo_sst2_sentence/demo.tsv b/examples/torch_migration/pipeline/Step2/demo_sst2_sentence/demo.tsv new file mode 100644 index 0000000000000000000000000000000000000000..fdc6b82affefaea1a9b623436adc4f11968cf661 --- /dev/null +++ b/examples/torch_migration/pipeline/Step2/demo_sst2_sentence/demo.tsv @@ -0,0 +1,33 @@ +sentence label +it 's a charming and often affecting journey . 1 +unflinchingly bleak and desperate 0 +allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . 1 +the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . 1 +it 's slow -- very , very slow . 0 +although laced with humor and a few fanciful touches , the film is a refreshingly serious look at young women . 1 +a sometimes tedious film . 0 +or doing last year 's taxes with your ex-wife . 0 +you do n't have to know about music to appreciate the film 's easygoing blend of comedy and romance . 1 +in exactly 89 minutes , most of which passed as slowly as if i 'd been sitting naked on an igloo , formula 51 sank from quirky to jerky to utter turkey . 0 +the mesmerizing performances of the leads keep the film grounded and keep the audience riveted . 1 +it takes a strange kind of laziness to waste the talents of robert forster , anne meara , eugene levy , and reginald veljohnson all in the same movie . 0 +... the film suffers from a lack of humor ( something needed to balance out the violence ) ... 0 +we root for ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity . 1 +even horror fans will most likely not find what they 're seeking with trouble every day ; the movie lacks both thrills and humor . 0 +a gorgeous , high-spirited musical from india that exquisitely blends music , dance , song , and high drama . 1 +the emotions are raw and will strike a nerve with anyone who 's ever had family trauma . 1 +audrey tatou has a knack for picking roles that magnify her outrageous charm , and in this literate french comedy , she 's as morning-glory exuberant as she was in amélie . 1 +... the movie is just a plain old monster . 0 +in its best moments , resembles a bad high school production of grease , without benefit of song . 0 +pumpkin takes an admirable look at the hypocrisy of political correctness , but it does so with such an uneven tone that you never know when humor ends and tragedy begins . 0 +the iditarod lasts for days - this just felt like it did . 0 +holden caulfield did it better . 0 +a delectable and intriguing thriller filled with surprises , read my lips is an original . 1 +seldom has a movie so closely matched the spirit of a man and his work . 1 +nicks , seemingly uncertain what 's going to make people laugh , runs the gamut from stale parody to raunchy sex gags to formula romantic comedy . 0 +the action switches between past and present , but the material link is too tenuous to anchor the emotional connections that purport to span a 125-year divide . 0 +it 's an offbeat treat that pokes fun at the democratic exercise while also examining its significance for those who take part . 1 +it 's a cookie-cutter movie , a cut-and-paste job . 0 +i had to look away - this was god awful . 0 +thanks to scott 's charismatic roger and eisenberg 's sweet nephew , roger dodger is one of the most compelling variations on in the company of men . 1 +... designed to provide a mix of smiles and tears , `` crossroads '' instead provokes a handful of unintentional howlers and numerous yawns . 0 diff --git a/examples/torch_migration/pipeline/Step2/predict.py b/examples/torch_migration/pipeline/Step2/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..6333dcd105e4b9051d8ffaaa30a7f8a7f7302e9f --- /dev/null +++ b/examples/torch_migration/pipeline/Step2/predict.py @@ -0,0 +1,85 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import sys +from functools import partial + +import paddle +import paddle.nn as nn +import pandas as pd + +from paddlenlp.datasets import load_dataset as ppnlp_load_dataset +from paddlenlp.transformers import BertTokenizer as PPNLPBertTokenizer + +CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0] # 当前目录 +CONFIG_PATH = CURRENT_DIR.rsplit("/", 1)[0] +sys.path.append(CONFIG_PATH) +from models.pd_bert import BertConfig, BertForSequenceClassification # noqa: E402 + + +def get_data(): + def read(data_path): + df = pd.read_csv(data_path, sep="\t") + for _, row in df.iterrows(): + yield {"sentence": row["sentence"], "labels": row["label"]} + + def convert_example(example, tokenizer, max_length=128): + # labels = [example["labels"]] + # labels = np.array([example["labels"]], dtype="int64") + example = tokenizer(example["sentence"], max_seq_len=max_length) + return example + + tokenizer = PPNLPBertTokenizer.from_pretrained("bert-base-uncased") + dataset_test = ppnlp_load_dataset(read, data_path="demo_sst2_sentence/demo.tsv", lazy=False) + trans_func = partial(convert_example, tokenizer=tokenizer, max_length=128) + + dataset_test = dataset_test.map(trans_func, lazy=False) + one_sentence = dataset_test.new_data[0] + + for k in ["input_ids", "token_type_ids"]: + one_sentence[k] = paddle.to_tensor(one_sentence[k], dtype="int64") + one_sentence[k] = paddle.unsqueeze(one_sentence[k], axis=0) + + return one_sentence + + +@paddle.no_grad() +def main(): + # 模型定义 + paddle_dump_path = "../weights/paddle_weight.pdparams" + config = BertConfig() + model = BertForSequenceClassification(config) + checkpoint = paddle.load(paddle_dump_path) + model.bert.load_dict(checkpoint) + + classifier_weights = paddle.load("../classifier_weights/paddle_classifier_weights.bin") + model.load_dict(classifier_weights) + + model.eval() + # 要预测的句子 + data = get_data() + softmax = nn.Softmax() + # 预测的各类别的概率值 + output = softmax(model(**data)[0]).numpy() + + # 概率值最大的类别 + class_id = output.argmax() + # 对应的概率值 + prob = output[0][class_id] + print(f"class_id: {class_id}, prob: {prob}") + return output + + +if __name__ == "__main__": + main() diff --git a/examples/torch_migration/pipeline/Step2/test_data.py b/examples/torch_migration/pipeline/Step2/test_data.py new file mode 100644 index 0000000000000000000000000000000000000000..587e4e088dace3dd1b7db41e080a32a88291d1f3 --- /dev/null +++ b/examples/torch_migration/pipeline/Step2/test_data.py @@ -0,0 +1,135 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from functools import partial + +import numpy as np +import paddle +import pandas as pd +import torch +from datasets import Dataset +from reprod_log import ReprodDiffHelper, ReprodLogger +from transformers import BertTokenizer as HFBertTokenizer + +from paddlenlp.datasets import load_dataset as ppnlp_load_dataset +from paddlenlp.transformers import BertTokenizer as PPNLPBertTokenizer + + +def build_paddle_data_pipeline(): + from paddlenlp.data import DataCollatorWithPadding + + def read(data_path): + df = pd.read_csv(data_path, sep="\t") + for _, row in df.iterrows(): + yield {"sentence": row["sentence"], "labels": row["label"]} + + def convert_example(example, tokenizer, max_length=128): + labels = [example["labels"]] + example = tokenizer(example["sentence"], max_seq_len=max_length) + + example["labels"] = labels + return example + + # load tokenizer + tokenizer = PPNLPBertTokenizer.from_pretrained("bert-base-uncased") + # load data + dataset_test = ppnlp_load_dataset(read, data_path="demo_sst2_sentence/demo.tsv", lazy=False) + trans_func = partial(convert_example, tokenizer=tokenizer, max_length=128) + + # tokenize data + dataset_test = dataset_test.map(trans_func, lazy=False) + + test_sampler = paddle.io.SequenceSampler(dataset_test) + test_batch_sampler = paddle.io.BatchSampler(sampler=test_sampler, batch_size=4) + data_collator = DataCollatorWithPadding(tokenizer) + data_loader_test = paddle.io.DataLoader( + dataset_test, + batch_sampler=test_batch_sampler, + num_workers=0, + collate_fn=data_collator, + ) + + return dataset_test, data_loader_test + + +def build_torch_data_pipeline(): + from transformers import DataCollatorWithPadding + + tokenizer = HFBertTokenizer.from_pretrained("bert-base-uncased") + + def preprocess_function(examples): + result = tokenizer( + examples["sentence"], + padding=False, + max_length=128, + truncation=True, + return_token_type_ids=True, + ) + if "label" in examples: + result["labels"] = [examples["label"]] + return result + + # load data + dataset_test = Dataset.from_csv("demo_sst2_sentence/demo.tsv", sep="\t") + dataset_test = dataset_test.map( + preprocess_function, + batched=False, + remove_columns=dataset_test.column_names, + desc="Running tokenizer on dataset", + ) + dataset_test.set_format("np", columns=["input_ids", "token_type_ids", "labels"]) + test_sampler = torch.utils.data.SequentialSampler(dataset_test) + collate_fn = DataCollatorWithPadding(tokenizer) + data_loader_test = torch.utils.data.DataLoader( + dataset_test, + batch_size=4, + sampler=test_sampler, + num_workers=0, + collate_fn=collate_fn, + ) + return dataset_test, data_loader_test + + +def test_data_pipeline(): + diff_helper = ReprodDiffHelper() + paddle_dataset, paddle_dataloader = build_paddle_data_pipeline() + torch_dataset, torch_dataloader = build_torch_data_pipeline() + + logger_paddle_data = ReprodLogger() + logger_torch_data = ReprodLogger() + + logger_paddle_data.add("length", np.array(len(paddle_dataset))) + logger_torch_data.add("length", np.array(len(torch_dataset))) + + # random choose 5 images and check + for idx in range(5): + rnd_idx = np.random.randint(0, len(paddle_dataset)) + for k in ["input_ids", "token_type_ids", "labels"]: + + logger_paddle_data.add(f"dataset_{idx}_{k}", np.array(paddle_dataset[rnd_idx][k])) + + logger_torch_data.add(f"dataset_{idx}_{k}", np.array(torch_dataset[rnd_idx][k])) + + for idx, (paddle_batch, torch_batch) in enumerate(zip(paddle_dataloader, torch_dataloader)): + if idx >= 5: + break + for i, k in enumerate(["input_ids", "token_type_ids", "labels"]): + logger_paddle_data.add(f"dataloader_{idx}_{k}", paddle_batch[k].numpy()) + logger_torch_data.add(f"dataloader_{idx}_{k}", torch_batch[k].cpu().numpy()) + + diff_helper.compare_info(logger_paddle_data.data, logger_torch_data.data) + diff_helper.report() + + +if __name__ == "__main__": + test_data_pipeline() diff --git a/examples/torch_migration/pipeline/Step2/test_metric.py b/examples/torch_migration/pipeline/Step2/test_metric.py new file mode 100644 index 0000000000000000000000000000000000000000..2b82fd8348e53c894efe9319efb047c8245bbae5 --- /dev/null +++ b/examples/torch_migration/pipeline/Step2/test_metric.py @@ -0,0 +1,49 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import paddle +import torch +from datasets import load_metric +from paddle.metric import Accuracy +from reprod_log import ReprodLogger + + +def generate(): + pd_metric = Accuracy() + pd_metric.reset() + hf_metric = load_metric("accuracy.py") + for i in range(4): + logits = np.random.normal(0, 1, size=(64, 2)).astype("float32") + labels = np.random.randint(0, 2, size=(64,)).astype("int64") + # paddle metric + correct = pd_metric.compute(paddle.to_tensor(logits), paddle.to_tensor(labels)) + pd_metric.update(correct) + # hf metric + hf_metric.add_batch( + predictions=torch.from_numpy(logits).argmax(dim=-1), + references=torch.from_numpy(labels), + ) + pd_accuracy = pd_metric.accumulate() + hf_accuracy = hf_metric.compute()["accuracy"] + reprod_logger = ReprodLogger() + reprod_logger.add("accuracy", np.array([pd_accuracy])) + reprod_logger.save("metric_paddle.npy") + reprod_logger = ReprodLogger() + reprod_logger.add("accuracy", np.array([hf_accuracy])) + reprod_logger.save("metric_torch.npy") + + +if __name__ == "__main__": + generate() diff --git a/examples/torch_migration/pipeline/Step3/README.md b/examples/torch_migration/pipeline/Step3/README.md new file mode 100644 index 0000000000000000000000000000000000000000..4e6e79ae1bf1f152d335517bd96df8dc87616378 --- /dev/null +++ b/examples/torch_migration/pipeline/Step3/README.md @@ -0,0 +1,67 @@ +# 使用方法 + +## 代码解析 + +以PaddlePaddle为例,下面为定义模型、计算loss并保存的代码。 + +```python +# paddle_loss.py +if __name__ == "__main__": + paddle.set_device("cpu") + + # def logger + reprod_logger = ReprodLogger() + + model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_classes=2) + classifier_weights = paddle.load("../classifier_weights/paddle_classifier_weights.bin") + model.load_dict(classifier_weights) + model.eval() + + criterion = nn.CrossEntropyLoss() + + # read or gen fake data + fake_data = np.load("../fake_data/fake_data.npy") + fake_data = paddle.to_tensor(fake_data) + + fake_label = np.load("../fake_data/fake_label.npy") + fake_label = paddle.to_tensor(fake_label) + + # forward + out = model(fake_data) + + loss = criterion(out, fake_label) + # + reprod_logger.add("loss", loss.cpu().detach().numpy()) + reprod_logger.save("loss_paddle.npy") + +``` + +记录loss并保存在`loss_paddle.npy`文件中。 + + +## 操作步骤 + +* 具体操作步骤如下所示。 + + +```shell +# 生成paddle的前向loss结果 +python paddle_loss.py + +# 生成torch的前向loss结果 +python torch_loss.py + +# 对比生成log +python check_step3.py +``` + +`check_step3.py`的输出结果如下所示,同时也会保存在`loss_diff.log`文件中。 + +``` +[2021/11/17 21:27:35] root INFO: loss: +[2021/11/17 21:27:35] root INFO: mean diff: check passed: True, value: 5.960464477539063e-08 +[2021/11/17 21:27:35] root INFO: diff check passed + +``` + +diff为5.96e-8,check通过。 diff --git a/examples/torch_migration/pipeline/Step3/check_step3.py b/examples/torch_migration/pipeline/Step3/check_step3.py new file mode 100644 index 0000000000000000000000000000000000000000..546233dade0ea728512e43318790abd6742627d7 --- /dev/null +++ b/examples/torch_migration/pipeline/Step3/check_step3.py @@ -0,0 +1,24 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from reprod_log import ReprodDiffHelper + +if __name__ == "__main__": + diff_helper = ReprodDiffHelper() + torch_info = diff_helper.load_info("./loss_torch.npy") + paddle_info = diff_helper.load_info("./loss_paddle.npy") + + diff_helper.compare_info(torch_info, paddle_info) + + diff_helper.report(path="loss_diff.log") diff --git a/examples/torch_migration/pipeline/Step3/paddle_loss.py b/examples/torch_migration/pipeline/Step3/paddle_loss.py new file mode 100644 index 0000000000000000000000000000000000000000..c791e03fc7e105232d26a37d314b273139d528e0 --- /dev/null +++ b/examples/torch_migration/pipeline/Step3/paddle_loss.py @@ -0,0 +1,58 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import sys + +import numpy as np +import paddle +import paddle.nn as nn +from reprod_log import ReprodLogger + +CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0] # 当前目录 +CONFIG_PATH = CURRENT_DIR.rsplit("/", 1)[0] +sys.path.append(CONFIG_PATH) + +from models.pd_bert import BertConfig, BertForSequenceClassification # noqa: E402 + +if __name__ == "__main__": + paddle.set_device("cpu") + + # def logger + reprod_logger = ReprodLogger() + + paddle_dump_path = "../weights/paddle_weight.pdparams" + config = BertConfig() + model = BertForSequenceClassification(config) + checkpoint = paddle.load(paddle_dump_path) + model.bert.load_dict(checkpoint) + + classifier_weights = paddle.load("../classifier_weights/paddle_classifier_weights.bin") + model.load_dict(classifier_weights) + model.eval() + + criterion = nn.CrossEntropyLoss() + + # read or gen fake data + fake_data = np.load("../fake_data/fake_data.npy") + fake_data = paddle.to_tensor(fake_data) + + fake_label = np.load("../fake_data/fake_label.npy") + fake_label = paddle.to_tensor(fake_label) + + # forward + out = model(fake_data)[0] + + loss = criterion(out, fake_label) + reprod_logger.add("loss", loss.cpu().detach().numpy()) + reprod_logger.save("loss_paddle.npy") diff --git a/examples/torch_migration/pipeline/Step3/torch_loss.py b/examples/torch_migration/pipeline/Step3/torch_loss.py new file mode 100644 index 0000000000000000000000000000000000000000..fa455bb5a4d9ed6fe992a3e27e33c4bb94a5c7ea --- /dev/null +++ b/examples/torch_migration/pipeline/Step3/torch_loss.py @@ -0,0 +1,56 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import sys + +import numpy as np +import torch +import torch.nn as nn +from reprod_log import ReprodLogger + +CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0] # 当前目录 +CONFIG_PATH = CURRENT_DIR.rsplit("/", 1)[0] +sys.path.append(CONFIG_PATH) + +from models.pt_bert import BertConfig, BertForSequenceClassification # noqa: E402 + +if __name__ == "__main__": + + # def logger + reprod_logger = ReprodLogger() + + criterion = nn.CrossEntropyLoss() + + pytorch_dump_path = "../weights/torch_weight.bin" + config = BertConfig() + model = BertForSequenceClassification(config) + checkpoint = torch.load(pytorch_dump_path) + model.bert.load_state_dict(checkpoint) + + classifier_weights = torch.load("../classifier_weights/torch_classifier_weights.bin") + model.load_state_dict(classifier_weights, strict=False) + model.eval() + # read or gen fake data + fake_data = np.load("../fake_data/fake_data.npy") + fake_data = torch.from_numpy(fake_data) + + fake_label = np.load("../fake_data/fake_label.npy") + fake_label = torch.from_numpy(fake_label) + + # forward + out = model(fake_data)[0] + + loss = criterion(out, fake_label) + reprod_logger.add("loss", loss.cpu().detach().numpy()) + reprod_logger.save("loss_torch.npy") diff --git a/examples/torch_migration/pipeline/Step4/README.md b/examples/torch_migration/pipeline/Step4/README.md new file mode 100644 index 0000000000000000000000000000000000000000..695b0728a773d0c14ef6dbd90f7342a2d8afdd20 --- /dev/null +++ b/examples/torch_migration/pipeline/Step4/README.md @@ -0,0 +1,136 @@ +# 使用方法 + +### 学习率对齐验证 + +运行下面的命令,检查学习率模块设置是否正确。 + +```shell +python test_lr_scheduler.py +``` + +最终输出内容如下。 + +``` +[2021/11/17 21:44:19] root INFO: step_100_linear_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 21:44:19] root INFO: step_300_linear_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 21:44:19] root INFO: step_500_linear_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 21:44:19] root INFO: step_700_linear_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 21:44:19] root INFO: step_900_linear_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 21:44:19] root INFO: step_100_cosine_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 21:44:19] root INFO: step_300_cosine_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 21:44:19] root INFO: step_500_cosine_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: False, value: 9.35605818719964e-06 +[2021/11/17 21:44:19] root INFO: step_700_cosine_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: False, value: 1.3681476625617212e-05 +[2021/11/17 21:44:19] root INFO: step_900_cosine_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: False, value: 1.8924391285779562e-05 +[2021/11/17 21:44:19] root INFO: step_100_polynomial_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 21:44:19] root INFO: step_300_polynomial_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 21:44:19] root INFO: step_500_polynomial_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 21:44:19] root INFO: step_700_polynomial_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 21:44:19] root INFO: step_900_polynomial_lr: +[2021/11/17 21:44:19] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 21:44:19] root INFO: diff check failed + +``` + +linear和polynomial方式衰减的学习率diff为0,check通过,cosine方式衰减学习率可能由于计算误差未通过。 + + +### 反向对齐操作方法 + +#### 代码讲解 + +以PaddlePaddle为例,训练流程核心代码如下所示。每个iter中输入相同的fake data与fake label,计算loss,进行梯度反传与参数更新,将loss批量返回,用于后续的验证。 + +```python +def pd_train_some_iters(model, + criterion, + optimizer, + fake_data, + fake_label, + max_iter=2): + paddle_dump_path = '../weights/paddle_weight.pdparams' + config = PDBertConfig() + model = PDBertForSequenceClassification(config) + checkpoint = paddle.load(paddle_dump_path) + model.bert.load_dict(checkpoint) + classifier_weights = paddle.load( + "../classifier_weights/paddle_classifier_weights.bin") + model.load_dict(classifier_weights) + model.eval() + criterion = paddle.nn.CrossEntropy() + decay_params = [ + p.name for n, p in model.named_parameters() + if not any(nd in n for nd in ["bias", "norm"]) + ] + optimizer = paddle.optimizer.AdamW(learning_rate=3e-5, parameters=model.parameters(), + weight_decay=1e-2, + epsilon=1e-6, + apply_decay_param_fun=lambda x: x in decay_params) + loss_list = [] + for idx in range(max_iter): + input_ids = paddle.to_tensor(fake_data) + labels = paddle.to_tensor(fake_label) + + output = model(input_ids) + loss = criterion(output, labels) + loss.backward() + optimizer.step() + optimizer.clear_grad() + loss_list.append(loss) + return loss_list +``` + + +#### 操作方法 + +运行下面的命令,基于fake data与fake label,依次生成若干轮loss数据并保存,使用`reprod_log`工具进行diff排查。 + +```shell +# 生成paddle和torch的前向数据 +python test_bp.py + +# 对比生成log +python check_step4.py +``` + +最终输出结果如下,同时会保存在文件`bp_align_diff.log`中。 + +``` +[2021/11/17 22:08:30] root INFO: loss_0: +[2021/11/17 22:08:30] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 22:08:30] root INFO: loss_1: +[2021/11/17 22:08:30] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 22:08:30] root INFO: loss_2: +[2021/11/17 22:08:30] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 22:08:30] root INFO: loss_3: +[2021/11/17 22:08:30] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 22:08:30] root INFO: loss_4: +[2021/11/17 22:08:30] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 22:08:30] root INFO: loss_5: +[2021/11/17 22:08:30] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 22:08:30] root INFO: loss_6: +[2021/11/17 22:08:30] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 22:08:30] root INFO: loss_7: +[2021/11/17 22:08:30] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 22:08:30] root INFO: loss_8: +[2021/11/17 22:08:30] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 22:08:30] root INFO: loss_9: +[2021/11/17 22:08:30] root INFO: mean diff: check passed: True, value: 0.0 +[2021/11/17 22:08:30] root INFO: diff check passed + +``` + +前面10轮的loss diff均等于0,check通过。 diff --git a/examples/torch_migration/pipeline/Step4/check_step4.py b/examples/torch_migration/pipeline/Step4/check_step4.py new file mode 100644 index 0000000000000000000000000000000000000000..20823116d45584038cc2b49fab1d693a66c1ae0c --- /dev/null +++ b/examples/torch_migration/pipeline/Step4/check_step4.py @@ -0,0 +1,23 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from reprod_log import ReprodDiffHelper + +if __name__ == "__main__": + diff_helper = ReprodDiffHelper() + torch_info = diff_helper.load_info("./bp_align_torch.npy") + paddle_info = diff_helper.load_info("./bp_align_paddle.npy") + diff_helper.compare_info(torch_info, paddle_info) + + diff_helper.report(diff_threshold=1e-2, path="bp_align_diff.log") diff --git a/examples/torch_migration/pipeline/Step4/test_bp.py b/examples/torch_migration/pipeline/Step4/test_bp.py new file mode 100644 index 0000000000000000000000000000000000000000..1cf28c97bcad3cc9e17016698ea53ca44d28cb1b --- /dev/null +++ b/examples/torch_migration/pipeline/Step4/test_bp.py @@ -0,0 +1,127 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import sys + +import numpy as np +import paddle +import torch +from reprod_log import ReprodLogger +from transformers import AdamW + +CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0] # 当前目录 +CONFIG_PATH = CURRENT_DIR.rsplit("/", 1)[0] +sys.path.append(CONFIG_PATH) + +# isort: off + +from models.pd_bert import BertConfig as PDBertConfig # noqa: E402 +from models.pd_bert import ( # noqa: E402 + BertForSequenceClassification as PDBertForSequenceClassification, +) +from models.pt_bert import BertConfig as HFBertConfig # noqa: E402 +from models.pt_bert import ( # noqa: E402 + BertForSequenceClassification as HFBertForSequenceClassification, +) + +# isort: on + + +def pd_train_some_iters(fake_data, fake_label, max_iter=2): + paddle_dump_path = "../weights/paddle_weight.pdparams" + config = PDBertConfig() + model = PDBertForSequenceClassification(config) + checkpoint = paddle.load(paddle_dump_path) + model.bert.load_dict(checkpoint) + + classifier_weights = paddle.load("../classifier_weights/paddle_classifier_weights.bin") + model.load_dict(classifier_weights) + model.eval() + criterion = paddle.nn.CrossEntropyLoss() + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=3e-5, + parameters=model.parameters(), + weight_decay=1e-2, + epsilon=1e-6, + apply_decay_param_fun=lambda x: x in decay_params, + ) + loss_list = [] + for idx in range(max_iter): + input_ids = paddle.to_tensor(fake_data) + labels = paddle.to_tensor(fake_label) + + output = model(input_ids)[0] + loss = criterion(output, labels) + loss.backward() + optimizer.step() + optimizer.clear_grad() + loss_list.append(loss) + return loss_list + + +def hf_train_some_iters(fake_data, fake_label, max_iter=2): + + pytorch_dump_path = "../weights/torch_weight.bin" + config = HFBertConfig() + model = HFBertForSequenceClassification(config) + checkpoint = torch.load(pytorch_dump_path) + model.bert.load_state_dict(checkpoint) + classifier_weights = torch.load("../classifier_weights/torch_classifier_weights.bin") + model.load_state_dict(classifier_weights, strict=False) + model.eval() + criterion = torch.nn.CrossEntropyLoss() + no_decay = ["bias", "LayerNorm.weight"] + optimizer_grouped_parameters = [ + { + "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], + "weight_decay": 1e-2, + }, + { + "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], + "weight_decay": 0.0, + }, + ] + optimizer = AdamW(optimizer_grouped_parameters, lr=3e-5) + + loss_list = [] + for idx in range(max_iter): + input_ids = torch.from_numpy(fake_data) + labels = torch.from_numpy(fake_label) + + output = model(input_ids)[0] + loss = criterion(output, labels) + loss.backward() + optimizer.step() + optimizer.zero_grad() + loss_list.append(loss) + return loss_list + + +if __name__ == "__main__": + print("Start training") + paddle.set_device("cpu") + fake_data = np.load("../fake_data/fake_data.npy") + fake_label = np.load("../fake_data/fake_label.npy") + hf_reprod_logger = ReprodLogger() + hf_loss_list = hf_train_some_iters(fake_data, fake_label, 10) + for idx, loss in enumerate(hf_loss_list): + hf_reprod_logger.add(f"loss_{idx}", loss.detach().cpu().numpy()) + hf_reprod_logger.save("bp_align_torch.npy") + + pd_reprod_logger = ReprodLogger() + pd_loss_list = pd_train_some_iters(fake_data, fake_label, 10) + for idx, loss in enumerate(pd_loss_list): + pd_reprod_logger.add(f"loss_{idx}", loss.detach().cpu().numpy()) + pd_reprod_logger.save("bp_align_paddle.npy") diff --git a/examples/torch_migration/pipeline/Step4/test_lr_scheduler.py b/examples/torch_migration/pipeline/Step4/test_lr_scheduler.py new file mode 100644 index 0000000000000000000000000000000000000000..caab6bd1432e91f639ba0aa36c7067293bfe140b --- /dev/null +++ b/examples/torch_migration/pipeline/Step4/test_lr_scheduler.py @@ -0,0 +1,98 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import torch +from reprod_log import ReprodDiffHelper, ReprodLogger +from torch.optim import AdamW +from transformers.optimization import get_scheduler as get_hf_scheduler + +# define paddle scheduler +from paddlenlp.transformers import ( + CosineDecayWithWarmup, + LinearDecayWithWarmup, + PolyDecayWithWarmup, +) + +scheduler_type2cls = { + "linear": LinearDecayWithWarmup, + "cosine": CosineDecayWithWarmup, + "polynomial": PolyDecayWithWarmup, +} + + +def get_paddle_scheduler( + learning_rate, + scheduler_type, + num_warmup_steps=None, + num_training_steps=None, + **scheduler_kwargs, +): + if scheduler_type not in scheduler_type2cls.keys(): + data = " ".join(scheduler_type2cls.keys()) + raise ValueError(f"scheduler_type must be choson from {data}") + + if num_warmup_steps is None: + raise ValueError("requires `num_warmup_steps`, please provide that argument.") + + if num_training_steps is None: + raise ValueError("requires `num_training_steps`, please provide that argument.") + + return scheduler_type2cls[scheduler_type]( + learning_rate=learning_rate, + total_steps=num_training_steps, + warmup=num_warmup_steps, + **scheduler_kwargs, + ) + + +def test_lr(): + diff_helper = ReprodDiffHelper() + pd_reprod_logger = ReprodLogger() + hf_reprod_logger = ReprodLogger() + lr = 3e-5 + num_warmup_steps = 345 + num_training_steps = 1024 + milestone = [100, 300, 500, 700, 900] + for scheduler_type in ["linear", "cosine", "polynomial"]: + torch_optimizer = AdamW(torch.nn.Linear(1, 1).parameters(), lr=lr) + hf_scheduler = get_hf_scheduler( + name=scheduler_type, + optimizer=torch_optimizer, + num_warmup_steps=num_warmup_steps, + num_training_steps=num_training_steps, + ) + pd_scheduler = get_paddle_scheduler( + learning_rate=lr, + scheduler_type=scheduler_type, + num_warmup_steps=num_warmup_steps, + num_training_steps=num_training_steps, + ) + + for i in range(num_training_steps): + hf_scheduler.step() + pd_scheduler.step() + if i in milestone: + hf_reprod_logger.add( + f"step_{i}_{scheduler_type}_lr", + np.array([hf_scheduler.get_last_lr()[-1]]), + ) + pd_reprod_logger.add(f"step_{i}_{scheduler_type}_lr", np.array([pd_scheduler.get_lr()])) + + diff_helper.compare_info(hf_reprod_logger.data, pd_reprod_logger.data) + diff_helper.report() + + +if __name__ == "__main__": + test_lr() diff --git a/examples/torch_migration/pipeline/Step5/README.md b/examples/torch_migration/pipeline/Step5/README.md new file mode 100644 index 0000000000000000000000000000000000000000..bab96301cac7c0e808aac92475e640e4d1ac05e5 --- /dev/null +++ b/examples/torch_migration/pipeline/Step5/README.md @@ -0,0 +1,29 @@ +# 使用方法 + +首先运行下面的python代码,生成`train_align_torch.npy`和`train_align_paddle.npy`文件。 + +```python +# 运行生成paddle结果 +cd bert_paddle/ +sh train.sh +# 运行生成torch结果 +cd bert_torch/ +sh train.sh +``` + +然后运行下面的代码,运行训练脚本;之后使用`check_step5.py`进行精度diff验证。 + +```shell +# 对比生成log +python check_step5.py +``` + +这里需要注意的是,由于是精度对齐,SST-2数据集的精度diff在0.15%以内时,可以认为对齐,因此将`diff_threshold`参数修改为了`0.0015`。 + +``` +[2021/11/17 22:41:12] root INFO: acc: +[2021/11/17 22:41:12] root INFO: mean diff: check passed: True, value: 0.0011467889908256534 +[2021/11/17 22:41:12] root INFO: diff check passed +``` + +最终diff为`0.00114`,小于阈值标准,检查通过。 diff --git a/examples/torch_migration/pipeline/Step5/bert_paddle/train.py b/examples/torch_migration/pipeline/Step5/bert_paddle/train.py new file mode 100644 index 0000000000000000000000000000000000000000..425b4f47c01ae91accf921a826b2144f90e1126b --- /dev/null +++ b/examples/torch_migration/pipeline/Step5/bert_paddle/train.py @@ -0,0 +1,313 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import datetime +import os +import random +import sys +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +import utils +from paddle.metric import Accuracy +from paddle.optimizer import AdamW +from reprod_log import ReprodLogger + +from paddlenlp.data import Dict, Pad, Stack +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import BertTokenizer + +CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0] # 当前目录 +CONFIG_PATH = CURRENT_DIR.rsplit("/", 2)[0] +sys.path.append(CONFIG_PATH) + +from models.pd_bert import BertConfig, BertForSequenceClassification # noqa: E402 + + +def train_one_epoch( + model, + criterion, + optimizer, + lr_scheduler, + data_loader, + epoch, + print_freq, + scaler=None, +): + model.train() + metric_logger = utils.MetricLogger(delimiter=" ") + metric_logger.add_meter("lr", utils.SmoothedValue(window_size=1, fmt="{value}")) + metric_logger.add_meter("sentence/s", utils.SmoothedValue(window_size=10, fmt="{value}")) + + header = "Epoch: [{}]".format(epoch) + for batch in metric_logger.log_every(data_loader, print_freq, header): + inputs = {"input_ids": batch[0], "token_type_ids": batch[1]} + labels = batch[2] + start_time = time.time() + with paddle.amp.auto_cast( + enable=scaler is not None, + custom_white_list=["layer_norm", "softmax", "gelu"], + ): + logits = model(**inputs)[0] + loss = criterion( + logits.reshape([-1, 2]), + labels.reshape( + [ + -1, + ] + ), + ) + + optimizer.clear_grad() + if scaler is not None: + scaler.scale(loss).backward() + scaler.step(optimizer) + scaler.update() + else: + loss.backward() + optimizer.step() + lr_scheduler.step() + batch_size = inputs["input_ids"].shape[0] + metric_logger.update(loss=loss.item(), lr=lr_scheduler.get_lr()) + metric_logger.meters["sentence/s"].update(batch_size / (time.time() - start_time)) + + +def evaluate(model, criterion, data_loader, metric, print_freq=100): + model.eval() + metric.reset() + metric_logger = utils.MetricLogger(delimiter=" ") + header = "Test:" + with paddle.no_grad(): + for batch in metric_logger.log_every(data_loader, print_freq, header): + inputs = {"input_ids": batch[0], "token_type_ids": batch[1]} + labels = batch[2] + logits = model(**inputs)[0] + loss = criterion( + logits.reshape([-1, 2]), + labels.reshape( + [ + -1, + ] + ), + ) + metric_logger.update(loss=loss.item()) + correct = metric.compute(logits, labels) + metric.update(correct) + acc_global_avg = metric.accumulate() + # gather the stats from all processes + metric_logger.synchronize_between_processes() + print(" * Accuracy {acc_global_avg:.6f}".format(acc_global_avg=acc_global_avg)) + return acc_global_avg + + +def set_seed(seed=42): + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def convert_example(example, tokenizer, max_length=128): + labels = np.array([example["labels"]], dtype="int64") + example = tokenizer(example["sentence"], max_seq_len=max_length) + return { + "input_ids": example["input_ids"], + "token_type_ids": example["token_type_ids"], + "labels": labels, + } + + +def load_data(args, tokenizer): + print("Loading data") + train_ds = load_dataset("glue", args.task_name, splits="train") + validation_ds = load_dataset("glue", args.task_name, splits="dev") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_length=args.max_length) + train_ds = train_ds.map(trans_func, lazy=False) + validation_ds = validation_ds.map(trans_func, lazy=False) + + train_sampler = paddle.io.BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False) + validation_sampler = paddle.io.BatchSampler(validation_ds, batch_size=args.batch_size, shuffle=False) + + return train_ds, validation_ds, train_sampler, validation_sampler + + +def main(args): + if args.output_dir: + pass + # utils.mkdir(args.output_dir) + print(args) + scaler = None + # if args.fp16: + # scaler = paddle.amp.GradScaler() + paddle.set_device(args.device) + + if args.seed is not None: + set_seed(args.seed) + + tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path) + batchify_fn = lambda samples, fn=Dict( + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), + "labels": Stack(dtype="int64"), + } + ): fn(samples) + train_dataset, validation_dataset, train_sampler, validation_sampler = load_data(args, tokenizer) + + train_data_loader = paddle.io.DataLoader( + train_dataset, + batch_sampler=train_sampler, + num_workers=args.workers, + collate_fn=batchify_fn, + ) + validation_data_loader = paddle.io.DataLoader( + validation_dataset, + batch_sampler=validation_sampler, + num_workers=args.workers, + collate_fn=batchify_fn, + ) + + print("Creating model") + paddle_dump_path = "../../weights/paddle_weight.pdparams" + config = BertConfig() + model = BertForSequenceClassification(config) + checkpoint = paddle.load(paddle_dump_path) + model.bert.load_dict(checkpoint) + + classifier_weights = paddle.load("../../classifier_weights/paddle_classifier_weights.bin") + model.load_dict(classifier_weights) + + print("Creating criterion") + criterion = nn.CrossEntropyLoss() + + print("Creating lr_scheduler") + lr_scheduler = utils.get_scheduler( + learning_rate=args.lr, + scheduler_type=args.lr_scheduler_type, + num_warmup_steps=args.num_warmup_steps, + num_training_steps=args.num_train_epochs * len(train_data_loader), + ) + + print("Creating optimizer") + # Split weights in two groups, one with weight decay and the other not. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + epsilon=1e-6, + apply_decay_param_fun=lambda x: x in decay_params, + ) + metric = Accuracy() + + if args.test_only: + evaluate(model, criterion, validation_data_loader, metric) + return + + print("Start training") + start_time = time.time() + best_accuracy = 0.0 + for epoch in range(args.num_train_epochs): + + train_one_epoch( + model, + criterion, + optimizer, + lr_scheduler, + train_data_loader, + epoch, + args.print_freq, + scaler, + ) + acc = evaluate(model, criterion, validation_data_loader, metric) + best_accuracy = max(best_accuracy, acc) + if args.output_dir: + pass + + total_time = time.time() - start_time + total_time_str = str(datetime.timedelta(seconds=int(total_time))) + print("Training time {}".format(total_time_str)) + return best_accuracy + + +def get_args_parser(add_help=True): + import argparse + + parser = argparse.ArgumentParser(description="Paddle SST-2 Classification Training", add_help=add_help) + parser.add_argument("--task_name", default="sst-2", help="the name of the glue task to train on.") + parser.add_argument( + "--model_name_or_path", + default="bert-base-uncased", + help="path to pretrained model or model identifier from huggingface.co/models.", + ) + parser.add_argument("--device", default="gpu", help="device") + parser.add_argument("--batch_size", default=32, type=int) + parser.add_argument( + "--max_length", + type=int, + default=128, + help=( + "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated," + ), + ) + parser.add_argument("--num_train_epochs", default=3, type=int, help="number of total epochs to run") + parser.add_argument( + "--workers", + default=0, + type=int, + help="number of data loading workers (default: 16)", + ) + parser.add_argument("--lr", default=3e-5, type=float, help="initial learning rate") + parser.add_argument( + "--weight_decay", + default=1e-2, + type=float, + help="weight decay (default: 1e-2)", + dest="weight_decay", + ) + parser.add_argument( + "--lr_scheduler_type", + default="linear", + help="the scheduler type to use.", + choices=["linear", "cosine", "polynomial"], + ) + parser.add_argument( + "--num_warmup_steps", + default=0, + type=int, + help="number of steps for the warmup in the lr scheduler.", + ) + parser.add_argument("--print_freq", default=10, type=int, help="print frequency") + parser.add_argument("--output_dir", default="outputs", help="path where to save") + parser.add_argument( + "--test_only", + help="only test the model", + action="store_true", + ) + parser.add_argument("--seed", default=42, type=int, help="a seed for reproducible training.") + # Mixed precision training parameters + parser.add_argument("--fp16", action="store_true", help="whether or not mixed precision training") + + return parser + + +if __name__ == "__main__": + args = get_args_parser().parse_args() + acc = main(args) + reprod_logger = ReprodLogger() + reprod_logger.add("acc", np.array([acc])) + reprod_logger.save("train_align_paddle.npy") diff --git a/examples/torch_migration/pipeline/Step5/bert_paddle/train.sh b/examples/torch_migration/pipeline/Step5/bert_paddle/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..5c5e367f6404ac9f8df12e554ffe6ef67c308e0a --- /dev/null +++ b/examples/torch_migration/pipeline/Step5/bert_paddle/train.sh @@ -0,0 +1,20 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python -m paddle.distributed.launch --gpus "1" train.py \ + --model_name_or_path bert-base-uncased \ + --batch_size 128 \ + --num_warmup_steps 158 \ + --output_dir paddle_outputs + diff --git a/examples/torch_migration/pipeline/Step5/bert_paddle/utils.py b/examples/torch_migration/pipeline/Step5/bert_paddle/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..101eeca84cd4ea911a52020df0839c24cdb43cec --- /dev/null +++ b/examples/torch_migration/pipeline/Step5/bert_paddle/utils.py @@ -0,0 +1,213 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import datetime +import errno +import os +import time +from collections import defaultdict, deque + +import paddle + +from paddlenlp.transformers import ( + CosineDecayWithWarmup, + LinearDecayWithWarmup, + PolyDecayWithWarmup, +) + + +class SmoothedValue(object): + """Track a series of values and provide access to smoothed values over a + window or the global series average. + """ + + def __init__(self, window_size=20, fmt=None): + if fmt is None: + fmt = "{median:.4f} ({global_avg:.4f})" + self.deque = deque(maxlen=window_size) + self.total = 0.0 + self.count = 0 + self.fmt = fmt + + def update(self, value, n=1): + self.deque.append(value) + self.count += n + self.total += value * n + + def synchronize_between_processes(self): + """ + Warning: does not synchronize the deque! + """ + t = paddle.to_tensor([self.count, self.total], dtype="float64") + t = t.numpy().tolist() + self.count = int(t[0]) + self.total = t[1] + + @property + def median(self): + d = paddle.to_tensor(list(self.deque)) + return d.median().item() + + @property + def avg(self): + d = paddle.to_tensor(list(self.deque), dtype="float32") + return d.mean().item() + + @property + def global_avg(self): + return self.total / self.count + + @property + def max(self): + return max(self.deque) + + @property + def value(self): + return self.deque[-1] + + def __str__(self): + return self.fmt.format( + median=self.median, + avg=self.avg, + global_avg=self.global_avg, + max=self.max, + value=self.value, + ) + + +class MetricLogger(object): + def __init__(self, delimiter="\t"): + self.meters = defaultdict(SmoothedValue) + self.delimiter = delimiter + + def update(self, **kwargs): + for k, v in kwargs.items(): + if isinstance(v, paddle.Tensor): + v = v.item() + assert isinstance(v, (float, int)) + self.meters[k].update(v) + + def __getattr__(self, attr): + if attr in self.meters: + return self.meters[attr] + if attr in self.__dict__: + return self.__dict__[attr] + raise AttributeError("'{}' object has no attribute '{}'".format(type(self).__name__, attr)) + + def __str__(self): + loss_str = [] + for name, meter in self.meters.items(): + loss_str.append("{}: {}".format(name, str(meter))) + return self.delimiter.join(loss_str) + + def synchronize_between_processes(self): + for meter in self.meters.values(): + meter.synchronize_between_processes() + + def add_meter(self, name, meter): + self.meters[name] = meter + + def log_every(self, iterable, print_freq, header=None): + i = 0 + if not header: + header = "" + start_time = time.time() + end = time.time() + iter_time = SmoothedValue(fmt="{avg:.4f}") + data_time = SmoothedValue(fmt="{avg:.4f}") + space_fmt = ":" + str(len(str(len(iterable)))) + "d" + if paddle.device.is_compiled_with_cuda(): + log_msg = self.delimiter.join( + [ + header, + "[{0" + space_fmt + "}/{1}]", + "eta: {eta}", + "{meters}", + "time: {time}", + "data: {data}", + ] + ) + else: + log_msg = self.delimiter.join( + [ + header, + "[{0" + space_fmt + "}/{1}]", + "eta: {eta}", + "{meters}", + "time: {time}", + "data: {data}", + ] + ) + for obj in iterable: + data_time.update(time.time() - end) + yield obj + iter_time.update(time.time() - end) + if i % print_freq == 0: + eta_seconds = iter_time.global_avg * (len(iterable) - i) + eta_string = str(datetime.timedelta(seconds=int(eta_seconds))) + print( + log_msg.format( + i, + len(iterable), + eta=eta_string, + meters=str(self), + time=str(iter_time), + data=str(data_time), + ) + ) + i += 1 + end = time.time() + total_time = time.time() - start_time + total_time_str = str(datetime.timedelta(seconds=int(total_time))) + print("{} Total time: {}".format(header, total_time_str)) + + +scheduler_type2cls = { + "linear": LinearDecayWithWarmup, + "cosine": CosineDecayWithWarmup, + "polynomial": PolyDecayWithWarmup, +} + + +def get_scheduler( + learning_rate, + scheduler_type, + num_warmup_steps=None, + num_training_steps=None, + **scheduler_kwargs, +): + if scheduler_type not in scheduler_type2cls.keys(): + data = " ".join(scheduler_type2cls.keys()) + raise ValueError(f"scheduler_type must be choson from {data}") + + if num_warmup_steps is None: + raise ValueError("requires `num_warmup_steps`, please provide that argument.") + + if num_training_steps is None: + raise ValueError("requires `num_training_steps`, please provide that argument.") + + return scheduler_type2cls[scheduler_type]( + learning_rate=learning_rate, + total_steps=num_training_steps, + warmup=num_warmup_steps, + **scheduler_kwargs, + ) + + +def mkdir(path): + try: + os.makedirs(path) + except OSError as e: + if e.errno != errno.EEXIST: + raise diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/accuracy.py b/examples/torch_migration/pipeline/Step5/bert_torch/accuracy.py new file mode 100644 index 0000000000000000000000000000000000000000..aecc76a72f54fb6a3c71404597769806d7847466 --- /dev/null +++ b/examples/torch_migration/pipeline/Step5/bert_torch/accuracy.py @@ -0,0 +1,90 @@ +# coding=utf-8 +# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Accuracy metric.""" + +import datasets +from sklearn.metrics import accuracy_score + +_DESCRIPTION = """ +Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: +Accuracy = (TP + TN) / (TP + TN + FP + FN) +TP: True positive +TN: True negative +FP: False positive +FN: False negative +""" + +_KWARGS_DESCRIPTION = """ +Args: + predictions: Predicted labels, as returned by a model. + references: Ground truth labels. + normalize: If False, return the number of correctly classified samples. + Otherwise, return the fraction of correctly classified samples. + sample_weight: Sample weights. +Returns: + accuracy: Accuracy score. +Examples: + + >>> accuracy_metric = datasets.load_metric("accuracy") + >>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1]) + >>> print(results) + {'accuracy': 1.0} +""" + +_CITATION = """\ +@article{scikit-learn, + title={Scikit-learn: Machine Learning in {P}ython}, + author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. + and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. + and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and + Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, + journal={Journal of Machine Learning Research}, + volume={12}, + pages={2825--2830}, + year={2011} +} +""" + + +@datasets.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION) +class Accuracy(datasets.Metric): + def _info(self): + return datasets.MetricInfo( + description=_DESCRIPTION, + citation=_CITATION, + inputs_description=_KWARGS_DESCRIPTION, + features=datasets.Features( + { + "predictions": datasets.Sequence(datasets.Value("int32")), + "references": datasets.Sequence(datasets.Value("int32")), + } + if self.config_name == "multilabel" + else { + "predictions": datasets.Value("int32"), + "references": datasets.Value("int32"), + } + ), + reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html"], + ) + + def _compute(self, predictions, references, normalize=True, sample_weight=None): + return { + "accuracy": accuracy_score( + references, + predictions, + normalize=normalize, + sample_weight=sample_weight, + ).item(), + } diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/glue.py b/examples/torch_migration/pipeline/Step5/bert_torch/glue.py new file mode 100644 index 0000000000000000000000000000000000000000..10cec84128a579c3b290eb847e720c16f03cb4c7 --- /dev/null +++ b/examples/torch_migration/pipeline/Step5/bert_torch/glue.py @@ -0,0 +1,625 @@ +# coding=utf-8 +# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Lint as: python3 +"""The General Language Understanding Evaluation (GLUE) benchmark.""" + +import csv +import os +import textwrap + +import datasets +import numpy as np + +_GLUE_CITATION = """\ +@inproceedings{wang2019glue, + title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding}, + author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.}, + note={In the Proceedings of ICLR.}, + year={2019} +} +""" + +_GLUE_DESCRIPTION = """\ +GLUE, the General Language Understanding Evaluation benchmark +(https://gluebenchmark.com/) is a collection of resources for training, +evaluating, and analyzing natural language understanding systems. + +""" + +_MRPC_DEV_IDS = "https://dl.fbaipublicfiles.com/glue/data/mrpc_dev_ids.tsv" +_MRPC_TRAIN = "https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt" +_MRPC_TEST = "https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt" + +_MNLI_BASE_KWARGS = dict( + text_features={ + "premise": "sentence1", + "hypothesis": "sentence2", + }, + label_classes=["entailment", "neutral", "contradiction"], + label_column="gold_label", + data_url="https://dl.fbaipublicfiles.com/glue/data/MNLI.zip", + data_dir="MNLI", + citation=textwrap.dedent( + """\ + @InProceedings{N18-1101, + author = "Williams, Adina + and Nangia, Nikita + and Bowman, Samuel", + title = "A Broad-Coverage Challenge Corpus for + Sentence Understanding through Inference", + booktitle = "Proceedings of the 2018 Conference of + the North American Chapter of the + Association for Computational Linguistics: + Human Language Technologies, Volume 1 (Long + Papers)", + year = "2018", + publisher = "Association for Computational Linguistics", + pages = "1112--1122", + location = "New Orleans, Louisiana", + url = "http://aclweb.org/anthology/N18-1101" + } + @article{bowman2015large, + title={A large annotated corpus for learning natural language inference}, + author={Bowman, Samuel R and Angeli, Gabor and Potts, Christopher and Manning, Christopher D}, + journal={arXiv preprint arXiv:1508.05326}, + year={2015} + }""" + ), + url="http://www.nyu.edu/projects/bowman/multinli/", +) + + +class GlueConfig(datasets.BuilderConfig): + """BuilderConfig for GLUE.""" + + def __init__( + self, + text_features, + label_column, + data_url, + data_dir, + citation, + url, + label_classes=None, + process_label=lambda x: x, + **kwargs, + ): + """BuilderConfig for GLUE. + + Args: + text_features: `dict[string, string]`, map from the name of the feature + dict for each text field to the name of the column in the tsv file + label_column: `string`, name of the column in the tsv file corresponding + to the label + data_url: `string`, url to download the zip file from + data_dir: `string`, the path to the folder containing the tsv files in the + downloaded zip + citation: `string`, citation for the data set + url: `string`, url for information about the data set + label_classes: `list[string]`, the list of classes if the label is + categorical. If not provided, then the label will be of type + `datasets.Value('float32')`. + process_label: `Function[string, any]`, function taking in the raw value + of the label and processing it to the form required by the label feature + **kwargs: keyword arguments forwarded to super. + """ + super(GlueConfig, self).__init__(version=datasets.Version("1.0.0", ""), **kwargs) + self.text_features = text_features + self.label_column = label_column + self.label_classes = label_classes + self.data_url = data_url + self.data_dir = data_dir + self.citation = citation + self.url = url + self.process_label = process_label + + +class Glue(datasets.GeneratorBasedBuilder): + """The General Language Understanding Evaluation (GLUE) benchmark.""" + + BUILDER_CONFIGS = [ + GlueConfig( + name="cola", + description=textwrap.dedent( + """\ + The Corpus of Linguistic Acceptability consists of English + acceptability judgments drawn from books and journal articles on + linguistic theory. Each example is a sequence of words annotated + with whether it is a grammatical English sentence.""" + ), + text_features={"sentence": "sentence"}, + label_classes=["unacceptable", "acceptable"], + label_column="is_acceptable", + data_url="https://dl.fbaipublicfiles.com/glue/data/CoLA.zip", + data_dir="CoLA", + citation=textwrap.dedent( + """\ + @article{warstadt2018neural, + title={Neural Network Acceptability Judgments}, + author={Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R}, + journal={arXiv preprint arXiv:1805.12471}, + year={2018} + }""" + ), + url="https://nyu-mll.github.io/CoLA/", + ), + GlueConfig( + name="sst2", + description=textwrap.dedent( + """\ + The Stanford Sentiment Treebank consists of sentences from movie reviews and + human annotations of their sentiment. The task is to predict the sentiment of a + given sentence. We use the two-way (positive/negative) class split, and use only + sentence-level labels.""" + ), + text_features={"sentence": "sentence"}, + label_classes=["negative", "positive"], + label_column="label", + data_url="https://dl.fbaipublicfiles.com/glue/data/SST-2.zip", + data_dir="SST-2", + citation=textwrap.dedent( + """\ + @inproceedings{socher2013recursive, + title={Recursive deep models for semantic compositionality over a sentiment treebank}, + author={Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D and Ng, Andrew and Potts, Christopher}, + booktitle={Proceedings of the 2013 conference on empirical methods in natural language processing}, + pages={1631--1642}, + year={2013} + }""" + ), + url="https://datasets.stanford.edu/sentiment/index.html", + ), + GlueConfig( + name="mrpc", + description=textwrap.dedent( + """\ + The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of + sentence pairs automatically extracted from online news sources, with human annotations + for whether the sentences in the pair are semantically equivalent.""" + ), # pylint: disable=line-too-long + text_features={"sentence1": "", "sentence2": ""}, + label_classes=["not_equivalent", "equivalent"], + label_column="Quality", + data_url="", # MRPC isn't hosted by GLUE. + data_dir="MRPC", + citation=textwrap.dedent( + """\ + @inproceedings{dolan2005automatically, + title={Automatically constructing a corpus of sentential paraphrases}, + author={Dolan, William B and Brockett, Chris}, + booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)}, + year={2005} + }""" + ), + url="https://www.microsoft.com/en-us/download/details.aspx?id=52398", + ), + GlueConfig( + name="qqp", + description=textwrap.dedent( + """\ + The Quora Question Pairs2 dataset is a collection of question pairs from the + community question-answering website Quora. The task is to determine whether a + pair of questions are semantically equivalent.""" + ), + text_features={ + "question1": "question1", + "question2": "question2", + }, + label_classes=["not_duplicate", "duplicate"], + label_column="is_duplicate", + data_url="https://dl.fbaipublicfiles.com/glue/data/QQP-clean.zip", + data_dir="QQP", + citation=textwrap.dedent( + """\ + @online{WinNT, + author = {Iyer, Shankar and Dandekar, Nikhil and Csernai, Kornel}, + title = {First Quora Dataset Release: Question Pairs}, + year = {2017}, + url = {https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs}, + urldate = {2019-04-03} + }""" + ), + url="https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs", + ), + GlueConfig( + name="stsb", + description=textwrap.dedent( + """\ + The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of + sentence pairs drawn from news headlines, video and image captions, and natural + language inference data. Each pair is human-annotated with a similarity score + from 1 to 5.""" + ), + text_features={ + "sentence1": "sentence1", + "sentence2": "sentence2", + }, + label_column="score", + data_url="https://dl.fbaipublicfiles.com/glue/data/STS-B.zip", + data_dir="STS-B", + citation=textwrap.dedent( + """\ + @article{cer2017semeval, + title={Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation}, + author={Cer, Daniel and Diab, Mona and Agirre, Eneko and Lopez-Gazpio, Inigo and Specia, Lucia}, + journal={arXiv preprint arXiv:1708.00055}, + year={2017} + }""" + ), + url="http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark", + process_label=np.float32, + ), + GlueConfig( + name="mnli", + description=textwrap.dedent( + """\ + The Multi-Genre Natural Language Inference Corpus is a crowdsourced + collection of sentence pairs with textual entailment annotations. Given a premise sentence + and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis + (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The premise sentences are + gathered from ten different sources, including transcribed speech, fiction, and government reports. + We use the standard test set, for which we obtained private labels from the authors, and evaluate + on both the matched (in-domain) and mismatched (cross-domain) section. We also use and recommend + the SNLI corpus as 550k examples of auxiliary training data.""" + ), + **_MNLI_BASE_KWARGS, + ), + GlueConfig( + name="mnli_mismatched", + description=textwrap.dedent( + """\ + The mismatched validation and test splits from MNLI. + See the "mnli" BuilderConfig for additional information.""" + ), + **_MNLI_BASE_KWARGS, + ), + GlueConfig( + name="mnli_matched", + description=textwrap.dedent( + """\ + The matched validation and test splits from MNLI. + See the "mnli" BuilderConfig for additional information.""" + ), + **_MNLI_BASE_KWARGS, + ), + GlueConfig( + name="qnli", + description=textwrap.dedent( + """\ + The Stanford Question Answering Dataset is a question-answering + dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn + from Wikipedia) contains the answer to the corresponding question (written by an annotator). We + convert the task into sentence pair classification by forming a pair between each question and each + sentence in the corresponding context, and filtering out pairs with low lexical overlap between the + question and the context sentence. The task is to determine whether the context sentence contains + the answer to the question. This modified version of the original task removes the requirement that + the model select the exact answer, but also removes the simplifying assumptions that the answer + is always present in the input and that lexical overlap is a reliable cue.""" + ), # pylint: disable=line-too-long + text_features={ + "question": "question", + "sentence": "sentence", + }, + label_classes=["entailment", "not_entailment"], + label_column="label", + data_url="https://dl.fbaipublicfiles.com/glue/data/QNLIv2.zip", + data_dir="QNLI", + citation=textwrap.dedent( + """\ + @article{rajpurkar2016squad, + title={Squad: 100,000+ questions for machine comprehension of text}, + author={Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy}, + journal={arXiv preprint arXiv:1606.05250}, + year={2016} + }""" + ), + url="https://rajpurkar.github.io/SQuAD-explorer/", + ), + GlueConfig( + name="rte", + description=textwrap.dedent( + """\ + The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual + entailment challenges. We combine the data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim + et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009).4 Examples are + constructed based on news and Wikipedia text. We convert all datasets to a two-class split, where + for three-class datasets we collapse neutral and contradiction into not entailment, for consistency.""" + ), # pylint: disable=line-too-long + text_features={ + "sentence1": "sentence1", + "sentence2": "sentence2", + }, + label_classes=["entailment", "not_entailment"], + label_column="label", + data_url="https://dl.fbaipublicfiles.com/glue/data/RTE.zip", + data_dir="RTE", + citation=textwrap.dedent( + """\ + @inproceedings{dagan2005pascal, + title={The PASCAL recognising textual entailment challenge}, + author={Dagan, Ido and Glickman, Oren and Magnini, Bernardo}, + booktitle={Machine Learning Challenges Workshop}, + pages={177--190}, + year={2005}, + organization={Springer} + } + @inproceedings{bar2006second, + title={The second pascal recognising textual entailment challenge}, + author={Bar-Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan}, + booktitle={Proceedings of the second PASCAL challenges workshop on recognising textual entailment}, + volume={6}, + number={1}, + pages={6--4}, + year={2006}, + organization={Venice} + } + @inproceedings{giampiccolo2007third, + title={The third pascal recognizing textual entailment challenge}, + author={Giampiccolo, Danilo and Magnini, Bernardo and Dagan, Ido and Dolan, Bill}, + booktitle={Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing}, + pages={1--9}, + year={2007}, + organization={Association for Computational Linguistics} + } + @inproceedings{bentivogli2009fifth, + title={The Fifth PASCAL Recognizing Textual Entailment Challenge.}, + author={Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Giampiccolo, Danilo}, + booktitle={TAC}, + year={2009} + }""" + ), + url="https://aclweb.org/aclwiki/Recognizing_Textual_Entailment", + ), + GlueConfig( + name="wnli", + description=textwrap.dedent( + """\ + The Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task + in which a system must read a sentence with a pronoun and select the referent of that pronoun from + a list of choices. The examples are manually constructed to foil simple statistical methods: Each + one is contingent on contextual information provided by a single word or phrase in the sentence. + To convert the problem into sentence pair classification, we construct sentence pairs by replacing + the ambiguous pronoun with each possible referent. The task is to predict if the sentence with the + pronoun substituted is entailed by the original sentence. We use a small evaluation set consisting of + new examples derived from fiction books that was shared privately by the authors of the original + corpus. While the included training set is balanced between two classes, the test set is imbalanced + between them (65% not entailment). Also, due to a data quirk, the development set is adversarial: + hypotheses are sometimes shared between training and development examples, so if a model memorizes the + training examples, they will predict the wrong label on corresponding development set + example. As with QNLI, each example is evaluated separately, so there is not a systematic correspondence + between a model's score on this task and its score on the unconverted original task. We + call converted dataset WNLI (Winograd NLI).""" + ), + text_features={ + "sentence1": "sentence1", + "sentence2": "sentence2", + }, + label_classes=["not_entailment", "entailment"], + label_column="label", + data_url="https://dl.fbaipublicfiles.com/glue/data/WNLI.zip", + data_dir="WNLI", + citation=textwrap.dedent( + """\ + @inproceedings{levesque2012winograd, + title={The winograd schema challenge}, + author={Levesque, Hector and Davis, Ernest and Morgenstern, Leora}, + booktitle={Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning}, + year={2012} + }""" + ), + url="https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html", + ), + GlueConfig( + name="ax", + description=textwrap.dedent( + """\ + A manually-curated evaluation dataset for fine-grained analysis of + system performance on a broad range of linguistic phenomena. This + dataset evaluates sentence understanding through Natural Language + Inference (NLI) problems. Use a model trained on MulitNLI to produce + predictions for this dataset.""" + ), + text_features={ + "premise": "sentence1", + "hypothesis": "sentence2", + }, + label_classes=["entailment", "neutral", "contradiction"], + label_column="", # No label since we only have test set. + # We must use a URL shortener since the URL from GLUE is very long and + # causes issues in TFDS. + data_url="https://dl.fbaipublicfiles.com/glue/data/AX.tsv", + data_dir="", # We are downloading a tsv. + citation="", # The GLUE citation is sufficient. + url="https://gluebenchmark.com/diagnostics", + ), + ] + + def _info(self): + features = {text_feature: datasets.Value("string") for text_feature in self.config.text_features.keys()} + if self.config.label_classes: + features["label"] = datasets.features.ClassLabel(names=self.config.label_classes) + else: + features["label"] = datasets.Value("float32") + features["idx"] = datasets.Value("int32") + return datasets.DatasetInfo( + description=_GLUE_DESCRIPTION, + features=datasets.Features(features), + homepage=self.config.url, + citation=self.config.citation + "\n" + _GLUE_CITATION, + ) + + def _split_generators(self, dl_manager): + if self.config.name == "ax": + data_file = dl_manager.download(self.config.data_url) + return [ + datasets.SplitGenerator( + name=datasets.Split.TEST, + gen_kwargs={ + "data_file": data_file, + "split": "test", + }, + ) + ] + + if self.config.name == "mrpc": + data_dir = None + mrpc_files = dl_manager.download( + { + "dev_ids": _MRPC_DEV_IDS, + "train": _MRPC_TRAIN, + "test": _MRPC_TEST, + } + ) + else: + dl_dir = dl_manager.download_and_extract(self.config.data_url) + data_dir = os.path.join(dl_dir, self.config.data_dir) + mrpc_files = None + train_split = datasets.SplitGenerator( + name=datasets.Split.TRAIN, + gen_kwargs={ + "data_file": os.path.join(data_dir or "", "train.tsv"), + "split": "train", + "mrpc_files": mrpc_files, + }, + ) + if self.config.name == "mnli": + return [ + train_split, + _mnli_split_generator("validation_matched", data_dir, "dev", matched=True), + _mnli_split_generator("validation_mismatched", data_dir, "dev", matched=False), + _mnli_split_generator("test_matched", data_dir, "test", matched=True), + _mnli_split_generator("test_mismatched", data_dir, "test", matched=False), + ] + elif self.config.name == "mnli_matched": + return [ + _mnli_split_generator("validation", data_dir, "dev", matched=True), + _mnli_split_generator("test", data_dir, "test", matched=True), + ] + elif self.config.name == "mnli_mismatched": + return [ + _mnli_split_generator("validation", data_dir, "dev", matched=False), + _mnli_split_generator("test", data_dir, "test", matched=False), + ] + else: + return [ + train_split, + datasets.SplitGenerator( + name=datasets.Split.VALIDATION, + gen_kwargs={ + "data_file": os.path.join(data_dir or "", "dev.tsv"), + "split": "dev", + "mrpc_files": mrpc_files, + }, + ), + datasets.SplitGenerator( + name=datasets.Split.TEST, + gen_kwargs={ + "data_file": os.path.join(data_dir or "", "test.tsv"), + "split": "test", + "mrpc_files": mrpc_files, + }, + ), + ] + + def _generate_examples(self, data_file, split, mrpc_files=None): + if self.config.name == "mrpc": + # We have to prepare the MRPC dataset from the original sources ourselves. + examples = self._generate_example_mrpc_files(mrpc_files=mrpc_files, split=split) + for example in examples: + yield example["idx"], example + else: + process_label = self.config.process_label + label_classes = self.config.label_classes + + # The train and dev files for CoLA are the only tsv files without a + # header. + is_cola_non_test = self.config.name == "cola" and split != "test" + + with open(data_file, encoding="utf8") as f: + reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE) + if is_cola_non_test: + reader = csv.reader(f, delimiter="\t", quoting=csv.QUOTE_NONE) + + for n, row in enumerate(reader): + if is_cola_non_test: + row = { + "sentence": row[3], + "is_acceptable": row[1], + } + + example = {feat: row[col] for feat, col in self.config.text_features.items()} + example["idx"] = n + + if self.config.label_column in row: + label = row[self.config.label_column] + # For some tasks, the label is represented as 0 and 1 in the tsv + # files and needs to be cast to integer to work with the feature. + if label_classes and label not in label_classes: + label = int(label) if label else None + example["label"] = process_label(label) + else: + example["label"] = process_label(-1) + + # Filter out corrupted rows. + for value in example.values(): + if value is None: + break + else: + yield example["idx"], example + + def _generate_example_mrpc_files(self, mrpc_files, split): + if split == "test": + with open(mrpc_files["test"], encoding="utf8") as f: + # The first 3 bytes are the utf-8 BOM \xef\xbb\xbf, which messes with + # the Quality key. + f.seek(3) + reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE) + for n, row in enumerate(reader): + yield { + "sentence1": row["#1 String"], + "sentence2": row["#2 String"], + "label": int(row["Quality"]), + "idx": n, + } + else: + with open(mrpc_files["dev_ids"], encoding="utf8") as f: + reader = csv.reader(f, delimiter="\t", quoting=csv.QUOTE_NONE) + dev_ids = [[row[0], row[1]] for row in reader] + with open(mrpc_files["train"], encoding="utf8") as f: + # The first 3 bytes are the utf-8 BOM \xef\xbb\xbf, which messes with + # the Quality key. + f.seek(3) + reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE) + for n, row in enumerate(reader): + is_row_in_dev = [row["#1 ID"], row["#2 ID"]] in dev_ids + if is_row_in_dev == (split == "dev"): + yield { + "sentence1": row["#1 String"], + "sentence2": row["#2 String"], + "label": int(row["Quality"]), + "idx": n, + } + + +def _mnli_split_generator(name, data_dir, split, matched): + return datasets.SplitGenerator( + name=name, + gen_kwargs={ + "data_file": os.path.join(data_dir, "%s_%s.tsv" % (split, "matched" if matched else "mismatched")), + "split": split, + "mrpc_files": None, + }, + ) diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/train.py b/examples/torch_migration/pipeline/Step5/bert_torch/train.py new file mode 100644 index 0000000000000000000000000000000000000000..b236071f2a695f92428189aabfec91cab2f008b3 --- /dev/null +++ b/examples/torch_migration/pipeline/Step5/bert_torch/train.py @@ -0,0 +1,326 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import datetime +import os +import random +import sys +import time + +import numpy as np +import torch +import torch.utils.data +import utils +from datasets import load_dataset, load_metric +from reprod_log import ReprodLogger +from torch import nn +from transformers import AdamW, BertTokenizer, DataCollatorWithPadding, get_scheduler + +CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0] # 当前目录 +CONFIG_PATH = CURRENT_DIR.rsplit("/", 2)[0] +sys.path.append(CONFIG_PATH) + +from models.pt_bert import BertConfig, BertForSequenceClassification # noqa: E402 + +task_to_keys = { + "cola": ("sentence", None), + "mnli": ("premise", "hypothesis"), + "mrpc": ("sentence1", "sentence2"), + "qnli": ("question", "sentence"), + "qqp": ("question1", "question2"), + "rte": ("sentence1", "sentence2"), + "sst2": ("sentence", None), + "stsb": ("sentence1", "sentence2"), + "wnli": ("sentence1", "sentence2"), +} + + +def train_one_epoch( + model, + criterion, + optimizer, + lr_scheduler, + data_loader, + device, + epoch, + print_freq, + scaler=None, +): + model.train() + metric_logger = utils.MetricLogger(delimiter=" ") + metric_logger.add_meter("lr", utils.SmoothedValue(window_size=1, fmt="{value}")) + metric_logger.add_meter("sentence/s", utils.SmoothedValue(window_size=10, fmt="{value}")) + + header = "Epoch: [{}]".format(epoch) + for batch in metric_logger.log_every(data_loader, print_freq, header): + start_time = time.time() + batch.to(device) + labels = batch.pop("labels") + with torch.cuda.amp.autocast(enabled=scaler is not None): + logits = model(**batch)[0] + loss = criterion(logits.reshape(-1, 2), labels.reshape(-1)) + + optimizer.zero_grad() + if scaler is not None: + scaler.scale(loss).backward() + scaler.step(optimizer) + scaler.update() + else: + loss.backward() + optimizer.step() + lr_scheduler.step() + batch_size = batch["input_ids"].shape[0] + metric_logger.update(loss=loss.item(), lr=lr_scheduler.get_last_lr()[-1]) + metric_logger.meters["sentence/s"].update(batch_size / (time.time() - start_time)) + + +def evaluate(model, criterion, data_loader, device, metric, print_freq=100): + model.eval() + metric_logger = utils.MetricLogger(delimiter=" ") + header = "Test:" + with torch.no_grad(): + for batch in metric_logger.log_every(data_loader, print_freq, header): + batch.to(device) + labels = batch.pop("labels") + logits = model(**batch)[0] + loss = criterion(logits.reshape(-1, 2), labels.reshape(-1)) + metric_logger.update(loss=loss.item()) + metric.add_batch( + predictions=logits.argmax(dim=-1), + references=labels, + ) + acc_global_avg = metric.compute()["accuracy"] + # gather the stats from all processes + metric_logger.synchronize_between_processes() + print(" * Accuracy {acc_global_avg:.6f}".format(acc_global_avg=acc_global_avg)) + return acc_global_avg + + +def set_seed(seed=42): + random.seed(seed) + np.random.seed(seed) + torch.manual_seed(seed) + torch.cuda.manual_seed_all(seed) + + +def load_data(args, tokenizer): + print("Loading data") + raw_datasets = load_dataset("glue.py", args.task_name, cache_dir=args.data_cache_dir) + sentence1_key, sentence2_key = task_to_keys[args.task_name] + + def preprocess_function(examples): + texts = ( + (examples[sentence1_key],) if sentence2_key is None else (examples[sentence1_key], examples[sentence2_key]) + ) + result = tokenizer(*texts, padding=False, max_length=args.max_length, truncation=True) + + if "label" in examples: + result["labels"] = examples["label"] + return result + + train_ds = raw_datasets["train"].map( + preprocess_function, + batched=True, + remove_columns=raw_datasets["train"].column_names, + desc="Running tokenizer on train dataset", + new_fingerprint=f"train_tokenized_dataset_{args.task_name}", + ) + validation_ds = raw_datasets["validation"].map( + preprocess_function, + batched=True, + remove_columns=raw_datasets["validation"].column_names, + desc="Running tokenizer on validation dataset", + new_fingerprint=f"validation_tokenized_dataset_{args.task_name}", + ) + train_sampler = torch.utils.data.SequentialSampler(train_ds) + validation_sampler = torch.utils.data.SequentialSampler(validation_ds) + + return train_ds, validation_ds, train_sampler, validation_sampler + + +def main(args): + if args.output_dir: + utils.mkdir(args.output_dir) + print(args) + scaler = None + if args.fp16: + scaler = torch.cuda.amp.GradScaler() + device = torch.device(args.device) + torch.backends.cudnn.benchmark = True + + if args.seed is not None: + set_seed(args.seed) + + tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path) + data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=(8 if args.fp16 else None)) + train_dataset, validation_dataset, train_sampler, validation_sampler = load_data(args, tokenizer) + train_data_loader = torch.utils.data.DataLoader( + train_dataset, + batch_size=args.batch_size, + sampler=train_sampler, + num_workers=args.workers, + collate_fn=data_collator, + ) + + validation_data_loader = torch.utils.data.DataLoader( + validation_dataset, + batch_size=args.batch_size, + sampler=validation_sampler, + num_workers=args.workers, + collate_fn=data_collator, + ) + + print("Creating model") + pytorch_dump_path = "../../weights/torch_weight.bin" + config = BertConfig() + model = BertForSequenceClassification(config) + checkpoint = torch.load(pytorch_dump_path) + model.bert.load_state_dict(checkpoint) + + classifier_weights = torch.load("../../classifier_weights/torch_classifier_weights.bin") + model.load_state_dict(classifier_weights, strict=False) + model.to(device) + + print("Creating criterion") + criterion = nn.CrossEntropyLoss() + + print("Creating optimizer") + # Split weights in two groups, one with weight decay and the other not. + no_decay = ["bias", "LayerNorm.weight"] + optimizer_grouped_parameters = [ + { + "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], + "weight_decay": args.weight_decay, + }, + { + "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], + "weight_decay": 0.0, + }, + ] + optimizer = AdamW(optimizer_grouped_parameters, lr=args.lr) + + print("Creating lr_scheduler") + lr_scheduler = get_scheduler( + name=args.lr_scheduler_type, + optimizer=optimizer, + num_warmup_steps=args.num_warmup_steps, + num_training_steps=args.num_train_epochs * len(train_data_loader), + ) + + metric = load_metric("accuracy.py") + if args.test_only: + evaluate(model, criterion, validation_data_loader, device=device) + return + + print("Start training") + start_time = time.time() + best_accuracy = 0.0 + for epoch in range(args.num_train_epochs): + train_one_epoch( + model, + criterion, + optimizer, + lr_scheduler, + train_data_loader, + device, + epoch, + args.print_freq, + scaler, + ) + acc = evaluate(model, criterion, validation_data_loader, device=device, metric=metric) + best_accuracy = max(best_accuracy, acc) + if args.output_dir: + pass + + total_time = time.time() - start_time + total_time_str = str(datetime.timedelta(seconds=int(total_time))) + print("Training time {}".format(total_time_str)) + return best_accuracy + + +def get_args_parser(add_help=True): + import argparse + + parser = argparse.ArgumentParser(description="PyTorch SST-2 Classification Training", add_help=add_help) + parser.add_argument("--data_cache_dir", default="data_caches", help="data cache dir.") + parser.add_argument("--task_name", default="sst2", help="the name of the glue task to train on.") + parser.add_argument( + "--model_name_or_path", + default="bert-base-uncased", + help="path to pretrained model or model identifier from huggingface.co/models.", + ) + parser.add_argument("--device", default="cuda:2", help="device") + parser.add_argument("--batch_size", default=32, type=int) + parser.add_argument( + "--max_length", + type=int, + default=128, + help=( + "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated," + ), + ) + parser.add_argument("--num_train_epochs", default=3, type=int, help="number of total epochs to run") + parser.add_argument( + "--workers", + default=0, + type=int, + help="number of data loading workers (default: 16)", + ) + parser.add_argument("--lr", default=3e-5, type=float, help="initial learning rate") + parser.add_argument( + "--weight_decay", + default=1e-2, + type=float, + help="weight decay (default: 1e-2)", + dest="weight_decay", + ) + parser.add_argument( + "--lr_scheduler_type", + default="linear", + help="the scheduler type to use.", + choices=[ + "linear", + "cosine", + "cosine_with_restarts", + "polynomial", + "constant", + "constant_with_warmup", + ], + ) + parser.add_argument( + "--num_warmup_steps", + default=0, + type=int, + help="number of steps for the warmup in the lr scheduler.", + ) + parser.add_argument("--print_freq", default=10, type=int, help="print frequency") + parser.add_argument("--output_dir", default="outputs", help="path where to save") + parser.add_argument( + "--test_only", + help="only test the model", + action="store_true", + ) + parser.add_argument("--seed", default=42, type=int, help="a seed for reproducible training.") + # Mixed precision training parameters + parser.add_argument("--fp16", action="store_true", help="whether or not mixed precision training") + + return parser + + +if __name__ == "__main__": + args = get_args_parser().parse_args() + acc = main(args) + reprod_logger = ReprodLogger() + reprod_logger.add("acc", np.array([acc])) + reprod_logger.save("train_align_benchmark.npy") diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/train.sh b/examples/torch_migration/pipeline/Step5/bert_torch/train.sh new file mode 100644 index 0000000000000000000000000000000000000000..1d26be50340b7850bc36f33411045499bc34a221 --- /dev/null +++ b/examples/torch_migration/pipeline/Step5/bert_torch/train.sh @@ -0,0 +1,19 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +python train.py \ + --model_name_or_path bert-base-uncased \ + --batch_size 128 \ + --num_warmup_steps 158 \ + --output_dir bert_outputs \ \ No newline at end of file diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/utils.py b/examples/torch_migration/pipeline/Step5/bert_torch/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..a77288f7823bd9a388e97d8713d6770199f3f974 --- /dev/null +++ b/examples/torch_migration/pipeline/Step5/bert_torch/utils.py @@ -0,0 +1,204 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import datetime +import errno +import os +import time +from collections import defaultdict, deque + +import torch + + +class SmoothedValue(object): + """Track a series of values and provide access to smoothed values over a + window or the global series average. + """ + + def __init__(self, window_size=20, fmt=None): + if fmt is None: + fmt = "{median:.4f} ({global_avg:.4f})" + self.deque = deque(maxlen=window_size) + self.total = 0.0 + self.count = 0 + self.fmt = fmt + + def update(self, value, n=1): + self.deque.append(value) + self.count += n + self.total += value * n + + def synchronize_between_processes(self): + """ + Warning: does not synchronize the deque! + """ + return + + @property + def median(self): + d = torch.tensor(list(self.deque)) + return d.median().item() + + @property + def avg(self): + d = torch.tensor(list(self.deque), dtype=torch.float32) + return d.mean().item() + + @property + def global_avg(self): + return self.total / self.count + + @property + def max(self): + return max(self.deque) + + @property + def value(self): + return self.deque[-1] + + def __str__(self): + return self.fmt.format( + median=self.median, + avg=self.avg, + global_avg=self.global_avg, + max=self.max, + value=self.value, + ) + + +class MetricLogger(object): + def __init__(self, delimiter="\t"): + self.meters = defaultdict(SmoothedValue) + self.delimiter = delimiter + + def update(self, **kwargs): + for k, v in kwargs.items(): + if isinstance(v, torch.Tensor): + v = v.item() + assert isinstance(v, (float, int)) + self.meters[k].update(v) + + def __getattr__(self, attr): + if attr in self.meters: + return self.meters[attr] + if attr in self.__dict__: + return self.__dict__[attr] + raise AttributeError("'{}' object has no attribute '{}'".format(type(self).__name__, attr)) + + def __str__(self): + loss_str = [] + for name, meter in self.meters.items(): + loss_str.append("{}: {}".format(name, str(meter))) + return self.delimiter.join(loss_str) + + def synchronize_between_processes(self): + for meter in self.meters.values(): + meter.synchronize_between_processes() + + def add_meter(self, name, meter): + self.meters[name] = meter + + def log_every(self, iterable, print_freq, header=None): + i = 0 + if not header: + header = "" + start_time = time.time() + end = time.time() + iter_time = SmoothedValue(fmt="{avg:.4f}") + data_time = SmoothedValue(fmt="{avg:.4f}") + space_fmt = ":" + str(len(str(len(iterable)))) + "d" + if torch.cuda.is_available(): + log_msg = self.delimiter.join( + [ + header, + "[{0" + space_fmt + "}/{1}]", + "eta: {eta}", + "{meters}", + "time: {time}", + "data: {data}", + "max mem: {memory:.0f}", + ] + ) + else: + log_msg = self.delimiter.join( + [ + header, + "[{0" + space_fmt + "}/{1}]", + "eta: {eta}", + "{meters}", + "time: {time}", + "data: {data}", + ] + ) + MB = 1024.0 * 1024.0 + for obj in iterable: + data_time.update(time.time() - end) + yield obj + iter_time.update(time.time() - end) + if i % print_freq == 0: + eta_seconds = iter_time.global_avg * (len(iterable) - i) + eta_string = str(datetime.timedelta(seconds=int(eta_seconds))) + if torch.cuda.is_available(): + print( + log_msg.format( + i, + len(iterable), + eta=eta_string, + meters=str(self), + time=str(iter_time), + data=str(data_time), + memory=torch.cuda.max_memory_allocated() / MB, + ) + ) + else: + print( + log_msg.format( + i, + len(iterable), + eta=eta_string, + meters=str(self), + time=str(iter_time), + data=str(data_time), + ) + ) + i += 1 + end = time.time() + total_time = time.time() - start_time + total_time_str = str(datetime.timedelta(seconds=int(total_time))) + print("{} Total time: {}".format(header, total_time_str)) + + +def accuracy(output, target, topk=(1,)): + """Computes the accuracy over the k top predictions for the specified values of k""" + with torch.no_grad(): + maxk = max(topk) + batch_size = target.size(0) + + _, pred = output.topk(maxk, 1, True, True) + pred = pred.t() + correct = pred.eq(target[None]) + + res = [] + for k in topk: + correct_k = correct[:k].flatten().sum(dtype=torch.float32) + res.append(correct_k * (100.0 / batch_size)) + return res + + +def mkdir(path): + try: + os.makedirs(path) + except OSError as e: + if e.errno != errno.EEXIST: + raise diff --git a/examples/torch_migration/pipeline/Step5/check_step5.py b/examples/torch_migration/pipeline/Step5/check_step5.py new file mode 100644 index 0000000000000000000000000000000000000000..79d3556a8ae0a604cc35f622df28d76c142d554a --- /dev/null +++ b/examples/torch_migration/pipeline/Step5/check_step5.py @@ -0,0 +1,24 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from reprod_log import ReprodDiffHelper + +if __name__ == "__main__": + diff_helper = ReprodDiffHelper() + torch_info = diff_helper.load_info("bert_torch/train_align_benchmark.npy") + paddle_info = diff_helper.load_info("bert_paddle/train_align_paddle.npy") + + diff_helper.compare_info(torch_info, paddle_info) + + diff_helper.report(path="train_align_diff.log", diff_threshold=0.0025) diff --git a/examples/torch_migration/pipeline/classifier_weights/generate_classifier_weights.py b/examples/torch_migration/pipeline/classifier_weights/generate_classifier_weights.py new file mode 100644 index 0000000000000000000000000000000000000000..1c696aac3af0a44a79d65125ec5197597c0090e0 --- /dev/null +++ b/examples/torch_migration/pipeline/classifier_weights/generate_classifier_weights.py @@ -0,0 +1,37 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import paddle +import torch + + +def generate(seed): + np.random.seed(seed) + weight = np.random.normal(0, 0.02, (768, 2)).astype("float32") + bias = np.zeros((2,)).astype("float32") + paddle_weights = { + "classifier.weight": weight, + "classifier.bias": bias, + } + torch_weights = { + "classifier.weight": torch.from_numpy(weight).t(), + "classifier.bias": torch.from_numpy(bias), + } + torch.save(torch_weights, "torch_classifier_weights.bin") + paddle.save(paddle_weights, "paddle_classifier_weights.bin") + + +if __name__ == "__main__": + generate(seed=42) diff --git a/examples/torch_migration/pipeline/fake_data/gen_fake_data.py b/examples/torch_migration/pipeline/fake_data/gen_fake_data.py new file mode 100644 index 0000000000000000000000000000000000000000..e083799c0484e456ab7ed574eb33f38a8def010b --- /dev/null +++ b/examples/torch_migration/pipeline/fake_data/gen_fake_data.py @@ -0,0 +1,26 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np + + +def gen_fake_data(): + fake_data = np.random.randint(1, 30522, size=(4, 64)).astype(np.int64) + fake_label = np.array([0, 1, 1, 0]).astype(np.int64) + np.save("fake_data.npy", fake_data) + np.save("fake_label.npy", fake_label) + + +if __name__ == "__main__": + gen_fake_data() diff --git a/examples/torch_migration/pipeline/models/pd_bert.py b/examples/torch_migration/pipeline/models/pd_bert.py new file mode 100644 index 0000000000000000000000000000000000000000..3e798421241910dbb0394836b4950da5aab59f1b --- /dev/null +++ b/examples/torch_migration/pipeline/models/pd_bert.py @@ -0,0 +1,424 @@ +# coding=utf-8 +# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. +# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Paddle BERT model.""" + +import math +from typing import Optional, Tuple + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + +ACT2FN = { + "relu": F.relu, + "gelu": F.gelu, + "tanh": F.tanh, + "sigmoid": F.sigmoid, +} +NEG_INF = -1e4 + + +class BertConfig: + def __init__( + self, + vocab_size: int = 30522, + hidden_size: int = 768, + num_hidden_layers: int = 12, + num_attention_heads: int = 12, + intermediate_size: int = 3072, + hidden_act: str = "gelu", + hidden_dropout_prob: float = 0.1, + attention_probs_dropout_prob: float = 0.1, + max_position_embeddings: int = 512, + type_vocab_size: int = 2, + initializer_range: float = 0.02, + pad_token_id: int = 0, + pool_act: str = "tanh", + layer_norm_eps: float = 1e-12, + output_attentions: bool = False, + output_hidden_states: bool = False, + num_labels=2, + **kwargs + ): + self.pad_token_id = pad_token_id + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.intermediate_size = intermediate_size + self.hidden_act = hidden_act + self.hidden_dropout_prob = hidden_dropout_prob + self.attention_probs_dropout_prob = attention_probs_dropout_prob + self.max_position_embeddings = max_position_embeddings + self.type_vocab_size = type_vocab_size + self.initializer_range = initializer_range + self.pool_act = pool_act + self.layer_norm_eps = layer_norm_eps + self.output_attentions = output_attentions + self.output_hidden_states = output_hidden_states + self.num_labels = num_labels + + +class BertEmbeddings(nn.Layer): + """Construct the embeddings from word, position and token_type embeddings.""" + + def __init__(self, config): + super().__init__() + self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id) + self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) + self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size) + + self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + self.register_buffer( + "position_ids", paddle.arange(config.max_position_embeddings, dtype="int64").reshape((1, -1)) + ) + + def forward( + self, + input_ids: Optional[paddle.Tensor] = None, + token_type_ids: Optional[paddle.Tensor] = None, + position_ids: Optional[paddle.Tensor] = None, + ) -> paddle.Tensor: + input_shape = input_ids.shape + seq_length = input_ids.shape[1] + + if position_ids is None: + position_ids = self.position_ids[:, :seq_length] + + if token_type_ids is None: + token_type_ids = paddle.zeros(input_shape, dtype=paddle.int64) + + inputs_embeds = self.word_embeddings(input_ids) + token_type_embeddings = self.token_type_embeddings(token_type_ids) + position_embeddings = self.position_embeddings(position_ids) + embeddings = inputs_embeds + token_type_embeddings + position_embeddings + embeddings = self.LayerNorm(embeddings) + embeddings = self.dropout(embeddings) + return embeddings + + +class BertSelfAttention(nn.Layer): + def __init__(self, config): + super().__init__() + self.num_attention_heads = config.num_attention_heads + self.attention_head_size = config.hidden_size // config.num_attention_heads + self.all_head_size = self.num_attention_heads * self.attention_head_size + + self.query = nn.Linear(config.hidden_size, self.all_head_size) + self.key = nn.Linear(config.hidden_size, self.all_head_size) + self.value = nn.Linear(config.hidden_size, self.all_head_size) + + self.dropout = nn.Dropout(config.attention_probs_dropout_prob) + + def transpose_for_scores(self, x: paddle.Tensor) -> paddle.Tensor: + new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size] + x = x.reshape(new_x_shape) + return x.transpose([0, 2, 1, 3]) + + def forward( + self, + hidden_states: paddle.Tensor, + attention_mask: Optional[paddle.Tensor] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[paddle.Tensor]: + + # compute q,k,v + query_layer = self.transpose_for_scores(self.query(hidden_states)) + key_layer = self.transpose_for_scores(self.key(hidden_states)) + value_layer = self.transpose_for_scores(self.value(hidden_states)) + + # Take the dot product between "query" and "key" to get the raw attention scores. + attention_scores = paddle.matmul(query_layer, key_layer, transpose_y=True) + + attention_scores = attention_scores / math.sqrt(self.attention_head_size) + if attention_mask is not None: + attention_scores = attention_scores + attention_mask + + # Normalize the attention scores to probabilities. + attention_probs = F.softmax(attention_scores, axis=-1) + attention_probs = self.dropout(attention_probs) + + context_layer = paddle.matmul(attention_probs, value_layer) + + context_layer = context_layer.transpose([0, 2, 1, 3]) + new_context_layer_shape = context_layer.shape[:-2] + [ + self.all_head_size, + ] + context_layer = context_layer.reshape(new_context_layer_shape) + + outputs = (context_layer, attention_probs) if output_attentions else (context_layer,) + + return outputs + + +class BertSelfOutput(nn.Layer): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.hidden_size) + self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.dropout(hidden_states) + hidden_states = self.LayerNorm(hidden_states + input_tensor) + return hidden_states + + +class BertAttention(nn.Layer): + def __init__(self, config): + super().__init__() + self.self = BertSelfAttention(config) + self.output = BertSelfOutput(config) + + def forward( + self, + hidden_states: paddle.Tensor, + attention_mask: Optional[paddle.Tensor] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[paddle.Tensor]: + self_outputs = self.self( + hidden_states, + attention_mask, + output_attentions, + ) + attention_output = self.output(self_outputs[0], hidden_states) + outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them + return outputs + + +class BertIntermediate(nn.Layer): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.intermediate_size) + if isinstance(config.hidden_act, str): + self.intermediate_act_fn = ACT2FN[config.hidden_act] + else: + self.intermediate_act_fn = config.hidden_act + + def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.intermediate_act_fn(hidden_states) + return hidden_states + + +class BertOutput(nn.Layer): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.intermediate_size, config.hidden_size) + self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.dropout(hidden_states) + hidden_states = self.LayerNorm(hidden_states + input_tensor) + return hidden_states + + +class BertLayer(nn.Layer): + def __init__(self, config): + super().__init__() + self.seq_len_dim = 1 + self.attention = BertAttention(config) + self.intermediate = BertIntermediate(config) + self.output = BertOutput(config) + + def forward( + self, + hidden_states: paddle.Tensor, + attention_mask: Optional[paddle.Tensor] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[paddle.Tensor]: + # self attn + self_attention_outputs = self.attention( + hidden_states, + attention_mask, + output_attentions=output_attentions, + ) + attention_output = self_attention_outputs[0] + + outputs = self_attention_outputs[1:] # add self attentions if we output attention weights + + # ffn + intermediate_output = self.intermediate(attention_output) + layer_output = self.output(intermediate_output, attention_output) + + outputs = (layer_output,) + outputs + + return outputs + + +class BertEncoder(nn.Layer): + def __init__(self, config): + super().__init__() + self.config = config + self.layer = nn.LayerList([BertLayer(config) for _ in range(config.num_hidden_layers)]) + + def forward( + self, + hidden_states: paddle.Tensor, + attention_mask: Optional[paddle.Tensor] = None, + output_attentions: Optional[bool] = False, + output_hidden_states: Optional[bool] = False, + ) -> Tuple[paddle.Tensor]: + all_hidden_states = () if output_hidden_states else None + all_self_attentions = () if output_attentions else None + + for layer_module in self.layer: + # add hidden_states + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + layer_outputs = layer_module( + hidden_states, + attention_mask, + output_attentions, + ) + hidden_states = layer_outputs[0] + + # add self attn + if output_attentions: + all_self_attentions = all_self_attentions + (layer_outputs[1],) + + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + return tuple( + v + for v in [ + hidden_states, + all_hidden_states, + all_self_attentions, + ] + if v is not None + ) + + +class BertPooler(nn.Layer): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.hidden_size) + self.activation = ACT2FN[config.pool_act] + + def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor: + first_token_tensor = hidden_states[:, 0] + pooled_output = self.dense(first_token_tensor) + pooled_output = self.activation(pooled_output) + return pooled_output + + +class BertPreTrainedModel(nn.Layer): + def _init_weights(self, module): + """Initialize the weights""" + normal_init = nn.initializer.Normal(mean=0.0, std=self.config.initializer_range) + zero_init = nn.initializer.Constant(0.0) + one_init = nn.initializer.Constant(1.0) + if isinstance(module, nn.Linear): + normal_init(module.weight) + if module.bias is not None: + zero_init(module.bias) + elif isinstance(module, nn.Embedding): + normal_init(module.weight) + if module._padding_idx is not None: + with paddle.no_grad(): + module.weight[module._padding_idx] = 0 + elif isinstance(module, nn.LayerNorm): + zero_init(module.bias) + one_init(module.weight) + + +class BertModel(BertPreTrainedModel): + def __init__(self, config, add_pooling_layer=True): + super().__init__() + self.config = config + self.embeddings = BertEmbeddings(config) + self.encoder = BertEncoder(config) + + self.pooler = BertPooler(config) if add_pooling_layer else None + + def forward( + self, + input_ids: Optional[paddle.Tensor] = None, + attention_mask: Optional[paddle.Tensor] = None, + token_type_ids: Optional[paddle.Tensor] = None, + position_ids: Optional[paddle.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + ) -> Tuple[paddle.Tensor]: + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + + if token_type_ids is None: + token_type_ids = paddle.zeros(input_ids.shape, dtype=paddle.int64) + + if attention_mask is not None: + attention_mask = (1.0 - attention_mask[:, :, None, None]) * NEG_INF + + embedding_output = self.embeddings( + input_ids=input_ids, + position_ids=position_ids, + token_type_ids=token_type_ids, + ) + encoder_outputs = self.encoder( + embedding_output, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + ) + sequence_output = encoder_outputs[0] + pooled_output = self.pooler(sequence_output) if self.pooler is not None else None + + return (sequence_output, pooled_output) + encoder_outputs[1:] + + +class BertForSequenceClassification(BertPreTrainedModel): + def __init__(self, config): + super().__init__() + self.num_labels = config.num_labels + self.config = config + + self.bert = BertModel(config) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + self.classifier = nn.Linear(config.hidden_size, config.num_labels) + + def forward( + self, + input_ids: Optional[paddle.Tensor] = None, + attention_mask: Optional[paddle.Tensor] = None, + token_type_ids: Optional[paddle.Tensor] = None, + position_ids: Optional[paddle.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + ) -> Tuple[paddle.Tensor]: + + outputs = self.bert( + input_ids, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + ) + + pooled_output = outputs[1] + pooled_output = self.dropout(pooled_output) + logits = self.classifier(pooled_output) + output = (logits,) + outputs[2:] + return output diff --git a/examples/torch_migration/pipeline/models/pt_bert.py b/examples/torch_migration/pipeline/models/pt_bert.py new file mode 100644 index 0000000000000000000000000000000000000000..fab4fd9dd5182040d67c79def30d5e0ce1381d8e --- /dev/null +++ b/examples/torch_migration/pipeline/models/pt_bert.py @@ -0,0 +1,420 @@ +# coding=utf-8 +# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. +# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""PyTorch BERT model.""" + +import math +from typing import Optional, Tuple + +import torch +import torch.nn as nn +import torch.nn.functional as F + +ACT2FN = { + "relu": F.relu, + "gelu": F.gelu, + "tanh": F.tanh, + "sigmoid": F.sigmoid, +} +NEG_INF = -1e4 + + +class BertConfig: + def __init__( + self, + vocab_size: int = 30522, + hidden_size: int = 768, + num_hidden_layers: int = 12, + num_attention_heads: int = 12, + intermediate_size: int = 3072, + hidden_act: str = "gelu", + hidden_dropout_prob: float = 0.1, + attention_probs_dropout_prob: float = 0.1, + max_position_embeddings: int = 512, + type_vocab_size: int = 2, + initializer_range: float = 0.02, + pad_token_id: int = 0, + pool_act: str = "tanh", + layer_norm_eps: float = 1e-12, + output_attentions: bool = False, + output_hidden_states: bool = False, + num_labels=2, + **kwargs + ): + self.pad_token_id = pad_token_id + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.intermediate_size = intermediate_size + self.hidden_act = hidden_act + self.hidden_dropout_prob = hidden_dropout_prob + self.attention_probs_dropout_prob = attention_probs_dropout_prob + self.max_position_embeddings = max_position_embeddings + self.type_vocab_size = type_vocab_size + self.initializer_range = initializer_range + self.pool_act = pool_act + self.layer_norm_eps = layer_norm_eps + self.output_attentions = output_attentions + self.output_hidden_states = output_hidden_states + self.num_labels = num_labels + + +class BertEmbeddings(nn.Module): + """Construct the embeddings from word, position and token_type embeddings.""" + + def __init__(self, config): + super().__init__() + self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id) + self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) + self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size) + + self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + ) -> torch.Tensor: + input_shape = input_ids.size() + seq_length = input_ids.shape[1] + + if position_ids is None: + position_ids = self.position_ids[:, :seq_length] + + if token_type_ids is None: + token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device) + + inputs_embeds = self.word_embeddings(input_ids) + token_type_embeddings = self.token_type_embeddings(token_type_ids) + position_embeddings = self.position_embeddings(position_ids) + embeddings = inputs_embeds + token_type_embeddings + position_embeddings + embeddings = self.LayerNorm(embeddings) + embeddings = self.dropout(embeddings) + return embeddings + + +class BertSelfAttention(nn.Module): + def __init__(self, config): + super().__init__() + self.num_attention_heads = config.num_attention_heads + self.attention_head_size = config.hidden_size // config.num_attention_heads + self.all_head_size = self.num_attention_heads * self.attention_head_size + + self.query = nn.Linear(config.hidden_size, self.all_head_size) + self.key = nn.Linear(config.hidden_size, self.all_head_size) + self.value = nn.Linear(config.hidden_size, self.all_head_size) + + self.dropout = nn.Dropout(config.attention_probs_dropout_prob) + + def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor: + new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size) + x = x.view(new_x_shape) + return x.permute(0, 2, 1, 3) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: + + # compute q,k,v + query_layer = self.transpose_for_scores(self.query(hidden_states)) + key_layer = self.transpose_for_scores(self.key(hidden_states)) + value_layer = self.transpose_for_scores(self.value(hidden_states)) + + # Take the dot product between "query" and "key" to get the raw attention scores. + attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) + + attention_scores = attention_scores / math.sqrt(self.attention_head_size) + if attention_mask is not None: + attention_scores = attention_scores + attention_mask + + # Normalize the attention scores to probabilities. + attention_probs = F.softmax(attention_scores, dim=-1) + attention_probs = self.dropout(attention_probs) + + context_layer = torch.matmul(attention_probs, value_layer) + + context_layer = context_layer.permute(0, 2, 1, 3).contiguous() + new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,) + context_layer = context_layer.view(new_context_layer_shape) + + outputs = (context_layer, attention_probs) if output_attentions else (context_layer,) + + return outputs + + +class BertSelfOutput(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.hidden_size) + self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.dropout(hidden_states) + hidden_states = self.LayerNorm(hidden_states + input_tensor) + return hidden_states + + +class BertAttention(nn.Module): + def __init__(self, config): + super().__init__() + self.self = BertSelfAttention(config) + self.output = BertSelfOutput(config) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: + self_outputs = self.self( + hidden_states, + attention_mask, + output_attentions, + ) + attention_output = self.output(self_outputs[0], hidden_states) + outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them + return outputs + + +class BertIntermediate(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.intermediate_size) + if isinstance(config.hidden_act, str): + self.intermediate_act_fn = ACT2FN[config.hidden_act] + else: + self.intermediate_act_fn = config.hidden_act + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.intermediate_act_fn(hidden_states) + return hidden_states + + +class BertOutput(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.intermediate_size, config.hidden_size) + self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.dropout(hidden_states) + hidden_states = self.LayerNorm(hidden_states + input_tensor) + return hidden_states + + +class BertLayer(nn.Module): + def __init__(self, config): + super().__init__() + self.seq_len_dim = 1 + self.attention = BertAttention(config) + self.intermediate = BertIntermediate(config) + self.output = BertOutput(config) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: + # self attn + self_attention_outputs = self.attention( + hidden_states, + attention_mask, + output_attentions=output_attentions, + ) + attention_output = self_attention_outputs[0] + + outputs = self_attention_outputs[1:] # add self attentions if we output attention weights + + # ffn + intermediate_output = self.intermediate(attention_output) + layer_output = self.output(intermediate_output, attention_output) + + outputs = (layer_output,) + outputs + + return outputs + + +class BertEncoder(nn.Module): + def __init__(self, config): + super().__init__() + self.config = config + self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)]) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = False, + output_hidden_states: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: + all_hidden_states = () if output_hidden_states else None + all_self_attentions = () if output_attentions else None + + for layer_module in self.layer: + # add hidden_states + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + layer_outputs = layer_module( + hidden_states, + attention_mask, + output_attentions, + ) + hidden_states = layer_outputs[0] + + # add self attn + if output_attentions: + all_self_attentions = all_self_attentions + (layer_outputs[1],) + + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + return tuple( + v + for v in [ + hidden_states, + all_hidden_states, + all_self_attentions, + ] + if v is not None + ) + + +class BertPooler(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.hidden_size) + self.activation = ACT2FN[config.pool_act] + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + first_token_tensor = hidden_states[:, 0] + pooled_output = self.dense(first_token_tensor) + pooled_output = self.activation(pooled_output) + return pooled_output + + +class BertPreTrainedModel(nn.Module): + def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, nn.Linear): + # Slightly different from the TF version which uses truncated_normal for initialization + # cf https://github.com/pytorch/pytorch/pull/5617 + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + + +class BertModel(BertPreTrainedModel): + def __init__(self, config, add_pooling_layer=True): + super().__init__() + self.config = config + self.embeddings = BertEmbeddings(config) + self.encoder = BertEncoder(config) + + self.pooler = BertPooler(config) if add_pooling_layer else None + + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + token_type_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + ) -> Tuple[torch.Tensor]: + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + + device = input_ids.device + + if token_type_ids is None: + token_type_ids = torch.zeros(input_ids.shape, dtype=torch.long, device=device) + + if attention_mask is not None: + attention_mask = (1.0 - attention_mask[:, :, None, None]) * NEG_INF + + embedding_output = self.embeddings( + input_ids=input_ids, + position_ids=position_ids, + token_type_ids=token_type_ids, + ) + encoder_outputs = self.encoder( + embedding_output, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + ) + sequence_output = encoder_outputs[0] + pooled_output = self.pooler(sequence_output) if self.pooler is not None else None + + return (sequence_output, pooled_output) + encoder_outputs[1:] + + +class BertForSequenceClassification(BertPreTrainedModel): + def __init__(self, config): + super().__init__() + self.num_labels = config.num_labels + self.config = config + + self.bert = BertModel(config) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + self.classifier = nn.Linear(config.hidden_size, config.num_labels) + + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + token_type_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + ) -> Tuple[torch.Tensor]: + + outputs = self.bert( + input_ids, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + ) + + pooled_output = outputs[1] + pooled_output = self.dropout(pooled_output) + logits = self.classifier(pooled_output) + output = (logits,) + outputs[2:] + return output diff --git a/examples/torch_migration/pipeline/reprod_log_demo/check_log_diff.py b/examples/torch_migration/pipeline/reprod_log_demo/check_log_diff.py new file mode 100644 index 0000000000000000000000000000000000000000..40ebdf029afe3167cb3447dec5055aa5a7e45547 --- /dev/null +++ b/examples/torch_migration/pipeline/reprod_log_demo/check_log_diff.py @@ -0,0 +1,25 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from reprod_log import ReprodDiffHelper + +if __name__ == "__main__": + diff_helper = ReprodDiffHelper() + + info1 = diff_helper.load_info("./result_1.npy") + info2 = diff_helper.load_info("./result_2.npy") + + diff_helper.compare_info(info1, info2) + + diff_helper.report(diff_method="mean", diff_threshold=1e-6, path="./diff.txt") diff --git a/examples/torch_migration/pipeline/reprod_log_demo/write_log.py b/examples/torch_migration/pipeline/reprod_log_demo/write_log.py new file mode 100644 index 0000000000000000000000000000000000000000..b2985e3db724447a7a45db29b8667fa641a5ee2e --- /dev/null +++ b/examples/torch_migration/pipeline/reprod_log_demo/write_log.py @@ -0,0 +1,31 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +from reprod_log import ReprodLogger + +if __name__ == "__main__": + reprod_log_1 = ReprodLogger() + reprod_log_2 = ReprodLogger() + + data_1 = np.random.rand(4, 64, 768).astype(np.float32) + data_2 = np.random.rand(4, 64, 768).astype(np.float32) + + reprod_log_1.add("demo_test_1", data_1) + reprod_log_1.add("demo_test_2", data_1) + reprod_log_1.save("result_1.npy") + + reprod_log_2.add("demo_test_1", data_1) + reprod_log_2.add("demo_test_2", data_2) + reprod_log_2.save("result_2.npy") diff --git a/examples/torch_migration/pipeline/weights/torch2paddle.py b/examples/torch_migration/pipeline/weights/torch2paddle.py new file mode 100644 index 0000000000000000000000000000000000000000..74511fea26e95e800c24f8118fc9c157ade56c26 --- /dev/null +++ b/examples/torch_migration/pipeline/weights/torch2paddle.py @@ -0,0 +1,115 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from collections import OrderedDict + +import numpy as np +import paddle +import torch +from paddlenlp.transformers import BertForPretraining as PDBertForMaskedLM +from transformers import BertForMaskedLM as PTBertForMaskedLM + + +def convert_pytorch_checkpoint_to_paddle( + pytorch_checkpoint_path="pytorch_model.bin", + paddle_dump_path="model_state.pdparams", + version="old", +): + hf_to_paddle = { + "embeddings.LayerNorm": "embeddings.layer_norm", + "encoder.layer": "encoder.layers", + "attention.self.query": "self_attn.q_proj", + "attention.self.key": "self_attn.k_proj", + "attention.self.value": "self_attn.v_proj", + "attention.output.dense": "self_attn.out_proj", + "intermediate.dense": "linear1", + "output.dense": "linear2", + "attention.output.LayerNorm": "norm1", + "output.LayerNorm": "norm2", + "predictions.decoder.": "predictions.decoder_", + "predictions.transform.dense": "predictions.transform", + "predictions.transform.LayerNorm": "predictions.layer_norm", + } + do_not_transpose = [] + if version == "old": + hf_to_paddle.update( + { + "predictions.bias": "predictions.decoder_bias", + ".gamma": ".weight", + ".beta": ".bias", + } + ) + do_not_transpose = do_not_transpose + ["predictions.decoder.weight"] + + pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu") + paddle_state_dict = OrderedDict() + for k, v in pytorch_state_dict.items(): + is_transpose = False + if k[-7:] == ".weight": + # embeddings.weight and LayerNorm.weight do not transpose + if all(d not in k for d in do_not_transpose): + if ".embeddings." not in k and ".LayerNorm." not in k: + if v.ndim == 2: + if "embeddings" not in k: + v = v.transpose(0, 1) + is_transpose = True + is_transpose = False + oldk = k + # for hf_name, pd_name in hf_to_paddle.items(): + # k = k.replace(hf_name, pd_name) + + # add prefix `bert.` + if "bert." not in k and "cls." not in k and "classifier" not in k: + k = k + + print(f"Converting: {oldk} => {k} | is_transpose {is_transpose}") + paddle_state_dict[k] = v.data.numpy() + + paddle.save(paddle_state_dict, paddle_dump_path) + + +def compare(out_torch, out_paddle): + out_torch = out_torch.detach().numpy() + out_paddle = out_paddle.detach().numpy() + assert out_torch.shape == out_paddle.shape + abs_dif = np.abs(out_torch - out_paddle) + mean_dif = np.mean(abs_dif) + max_dif = np.max(abs_dif) + min_dif = np.min(abs_dif) + print("mean_dif:{}".format(mean_dif)) + print("max_dif:{}".format(max_dif)) + print("min_dif:{}".format(min_dif)) + + +def test_forward(): + paddle.set_device("cpu") + model_torch = PTBertForMaskedLM.from_pretrained("./bert-base-uncased") + model_paddle = PDBertForMaskedLM.from_pretrained("./bert-base-uncased") + model_torch.eval() + model_paddle.eval() + np.random.seed(42) + x = np.random.randint(1, model_paddle.bert.config["vocab_size"], size=(4, 64)) + input_torch = torch.tensor(x, dtype=torch.int64) + out_torch = model_torch(input_torch)[0] + + input_paddle = paddle.to_tensor(x, dtype=paddle.int64) + out_paddle = model_paddle(input_paddle)[0] + + print("torch result shape:{}".format(out_torch.shape)) + print("paddle result shape:{}".format(out_paddle.shape)) + compare(out_torch, out_paddle) + + +if __name__ == "__main__": + convert_pytorch_checkpoint_to_paddle("./torch_weight.bin", "./paddle_weight.pdparams") diff --git a/examples/torch_migration/pipeline/weights/torch_bert_weight.py b/examples/torch_migration/pipeline/weights/torch_bert_weight.py new file mode 100644 index 0000000000000000000000000000000000000000..819229e156a57c3c15a588de43b18fde6496b97c --- /dev/null +++ b/examples/torch_migration/pipeline/weights/torch_bert_weight.py @@ -0,0 +1,21 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from transformers import BertModel +import torch + +hf_model = BertModel.from_pretrained("bert-base-uncased") +hf_model.eval() +PATH = "./torch_weight.bin" +torch.save(hf_model.state_dict(), PATH) diff --git a/examples/torch_migration/requirements.txt b/examples/torch_migration/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..4d3875d031562bd8503b533f5c2c63816400322e --- /dev/null +++ b/examples/torch_migration/requirements.txt @@ -0,0 +1,5 @@ +paddlepaddle-gpu==2.2.0 +torch>=1.7 +transformers +paddlenlp +git+https://github.com/WenmuZhou/reprod_log.git diff --git a/examples/word_embedding/README.md b/examples/word_embedding/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9d7eeac4b12c3d0325050131515a60fddb83381b --- /dev/null +++ b/examples/word_embedding/README.md @@ -0,0 +1,106 @@ +# Word Embedding with PaddleNLP + +## 简介 + +PaddleNLP已预置多个公开的预训练Embedding,用户可以通过使用`paddlenlp.embeddings.TokenEmbedding`接口加载预训练Embedding,从而提升训练效果。以下通过基于开源情感倾向分类数据集ChnSentiCorp的文本分类训练例子展示`paddlenlp.embeddings.TokenEmbedding`对训练提升的效果。更多的`paddlenlp.embeddings.TokenEmbedding`用法,请参考[TokenEmbedding 接口使用指南](../../docs/model_zoo/embeddings.md) 。 + + +## 快速开始 + +### 环境依赖 + +- visualdl + +安装命令:`pip install visualdl` + + +### 启动训练 + +我们以中文情感分类公开数据集ChnSentiCorp为示例数据集,可以运行下面的命令,在训练集(train.tsv)上进行模型训练,并在验证集(dev.tsv)验证。训练时会自动下载词表dict.txt,用于对数据集进行切分,构造数据样本。 + +启动训练: + +```shell +# 使用paddlenlp.embeddings.TokenEmbedding +python train.py --device='gpu' \ + --lr=5e-4 \ + --batch_size=64 \ + --epochs=20 \ + --use_token_embedding=True \ + --vdl_dir='./vdl_dir' + +# 使用paddle.nn.Embedding +python train.py --device='gpu' \ + --lr=1e-4 \ + --batch_size=64 \ + --epochs=20 \ + --use_token_embedding=False \ + --vdl_dir='./vdl_dir' +``` + +以上参数表示: + +* `device`: 选择训练设备,目前可选'gpu', 'cpu', 'xpu'。 默认为`gpu`。 +* `lr`: 学习率, 默认为5e-4。 +* `batch_size`: 运行一个batch大小,默认为64。 +* `epochs`: 训练轮次,默认为5。 +* `use_token_embedding`: 是否使用`paddlenlp.embeddings.TokenEmbedding`,默认为True。 +* `vdl_dir`: VisualDL日志目录。训练过程中的VisualDL信息会在该目录下保存。默认为`./vdl_dir` + +该脚本还提供以下参数: + +* `save_dir`: 模型保存目录。默认值为"./checkpoints/"。 +* `init_from_ckpt`: 恢复模型训练的断点路径。默认值为None,表示不恢复训练。 +* `embedding_name`: 预训练Embedding名称,默认为`w2v.baidu_encyclopedia.target.word-word.dim300`. 支持的预训练Embedding可参考[Embedding 模型汇总](../../docs/model_zoo/embeddings.md)。 + +**注意:** + +程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存模型在指定的`save_dir`中。训练过程中会实时保存每个epoch的模型参数,并以当前epoch值命名。如第2个Epochs,模型参数会被保存为`./checkpoints/2.pdparams`,优化器状态保存为`./checkpoints/2.pdopt`。 + +如: +```text +checkpoints/ +├── 0.pdopt +├── 0.pdparams +├── 1.pdopt +├── 1.pdparams +├── ... +└── final.pdparams +``` + +如需恢复模型训练,则init_from_ckpt只需指定到文件名即可,不需要添加文件尾缀。如果用户想热启第10个Epoch保存的模型,则设置 `--init_from_ckpt=./checkpoints/10`即可,程序会自动加载模型参数`./checkpoints/10.pdparams`,也会自动加载优化器状态`./checkpoints/10.pdopt`。 + + +### 启动VisualDL + +推荐使用VisualDL查看实验对比。以下为VisualDL的启动命令,其中logdir参数指定的目录需要与启动训练时指定的`vdl_dir`相同。(更多VisualDL的用法,可参考[VisualDL使用指南](https://github.com/PaddlePaddle/VisualDL#2-launch-panel)) + +``` +visualdl --logdir ./vdl_dir --port 8888 --host 0.0.0.0 +``` + +### 训练效果对比 + +在Chrome浏览器输入 `ip:8888` (ip为启动VisualDL机器的IP)。 + +以下为示例实验效果对比图,蓝色是使用`paddlenlp.embeddings.TokenEmbedding`进行的实验,绿色是使用没有加载预训练模型的Embedding进行的实验。 +可以看到,使用`paddlenlp.embeddings.TokenEmbedding`的训练,其验证acc变化趋势上升,并收敛于0.90左右,收敛后相对平稳,不容易过拟合。 +而没有使用`paddlenlp.embeddings.TokenEmbedding`的训练,其验证acc变化趋势向下,并收敛于0.86左右。从示例实验可以观察到,使用`paddlenlp.embedding.TokenEmbedding`能提升训练效果。 + +Eval Acc: + +![eval acc](https://user-images.githubusercontent.com/16698950/102076935-79ac5480-3e43-11eb-81f8-6e509c394fbf.png) + +| | Best Acc | +| ------------------------------------| ------------- | +| paddle.nn.Embedding | 0.8965 | +| paddelnlp.embeddings.TokenEmbedding | 0.9082 | + +## 致谢 +- 感谢 [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors)提供Word2Vec中文Embedding预训练模型,[GloVe Project](https://nlp.stanford.edu/projects/glove)提供的GloVe英文Embedding预训练模型,[FastText Project](https://fasttext.cc/docs/en/english-vectors.html)提供的fasttext英文预训练模型。 + +## 参考论文 +- Li, Shen, et al. "Analogical reasoning on chinese morphological and semantic relations." arXiv preprint arXiv:1805.06504 (2018). +- Qiu, Yuanyuan, et al. "Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings." Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221. +- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. +- T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. Advances in Pre-Training Distributed Word Representations diff --git a/examples/word_embedding/data.py b/examples/word_embedding/data.py new file mode 100644 index 0000000000000000000000000000000000000000..23b8b61abfd349ab22e594901961e51b3cb69380 --- /dev/null +++ b/examples/word_embedding/data.py @@ -0,0 +1,115 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import jieba +import numpy as np + +from paddlenlp.data import JiebaTokenizer + +tokenizer = jieba + + +def set_tokenizer(vocab): + global tokenizer + if vocab is not None: + tokenizer = JiebaTokenizer(vocab=vocab) + + +def load_vocab(vocab_file): + """Loads a vocabulary file into a dictionary.""" + vocab = {} + with open(vocab_file, "r", encoding="utf-8") as reader: + tokens = reader.readlines() + for index, token in enumerate(tokens): + token = token.rstrip("\n").split("\t")[0] + vocab[token] = index + return vocab + + +def convert_tokens_to_ids(tokens, vocab): + """Converts a token id (or a sequence of id) in a token string + (or a sequence of tokens), using the vocabulary. + """ + + ids = [] + unk_id = vocab.get("[UNK]", None) + for token in tokens: + wid = vocab.get(token, unk_id) + if wid: + ids.append(wid) + return ids + + +def convert_example(example, vocab, unk_token_id=1, is_test=False): + """ + Builds model inputs from a sequence for sequence classification tasks. + It use `jieba.cut` to tokenize text. + Args: + example(obj:`list[str]`): List of input data, containing text and label if it have label. + vocab(obj:`dict`): The vocabulary. + unk_token_id(obj:`int`, defaults to 1): The unknown token id. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + Returns: + input_ids(obj:`list[int]`): The list of token ids.s + valid_length(obj:`int`): The input sequence valid length. + label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. + """ + + input_ids = [] + for token in tokenizer.cut(example["text"]): + token_id = vocab.get(token, unk_token_id) + input_ids.append(token_id) + valid_length = np.array([len(input_ids)]) + input_ids = np.array(input_ids, dtype="int32") + if not is_test: + label = np.array(example["label"], dtype="int64") + return input_ids, valid_length, label + else: + return input_ids, valid_length + + +def pad_texts_to_max_seq_len(texts, max_seq_len, pad_token_id=0): + """ + Padded the texts to the max sequence length if the length of text is lower than it. + Unless it truncates the text. + Args: + texts(obj:`list`): Texts which contains a sequence of word ids. + max_seq_len(obj:`int`): Max sequence length. + pad_token_id(obj:`int`, optional, defaults to 0) : The pad token index. + """ + for index, text in enumerate(texts): + seq_len = len(text) + if seq_len < max_seq_len: + padded_tokens = [pad_token_id for _ in range(max_seq_len - seq_len)] + new_text = text + padded_tokens + texts[index] = new_text + elif seq_len > max_seq_len: + new_text = text[:max_seq_len] + texts[index] = new_text + + +def preprocess_prediction_data(data, vocab): + """ + It process the prediction data as the format used as training. + Args: + data (obj:`List[str]`): The prediction data whose each element is a tokenized text. + Returns: + examples (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. + A Example object contains `text`(word_ids) and `seq_len`(sequence length). + """ + examples = [] + for text in data: + tokens = " ".join(tokenizer.cut(text)).split(" ") + ids = convert_tokens_to_ids(tokens, vocab) + examples.append([ids, len(ids)]) + return examples diff --git a/examples/word_embedding/train.py b/examples/word_embedding/train.py new file mode 100644 index 0000000000000000000000000000000000000000..bd997f6eea9a1132c349c0cfa796883efeb5dd7d --- /dev/null +++ b/examples/word_embedding/train.py @@ -0,0 +1,189 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +import os.path as osp +from functools import partial + +import data +import paddle +import paddle.nn as nn + +import paddlenlp +from paddlenlp.data import Pad, Stack, Tuple, Vocab +from paddlenlp.datasets import load_dataset +from paddlenlp.embeddings import TokenEmbedding +from paddlenlp.utils.downloader import get_path_from_url + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--epochs", type=int, default=5, help="Number of epoches for training.") +parser.add_argument("--device", type=str, default="gpu", help="Select cpu, gpu, xpu devices to train model.") +parser.add_argument("--lr", type=float, default=5e-4, help="Learning rate used to train.") +parser.add_argument("--save_dir", type=str, default='./checkpoints/', help="Directory to save model checkpoint") +parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--use_token_embedding", type=eval, default=True, help="Whether use pretrained embedding") +parser.add_argument("--embedding_name", type=str, default="w2v.baidu_encyclopedia.target.word-word.dim300", help="The name of pretrained embedding") +parser.add_argument("--vdl_dir", type=str, default="vdl_dir/", help="VisualDL log directory") +args = parser.parse_args() +# yapf: enable + +WORD_DICT_URL = "https://bj.bcebos.com/paddlenlp/data/dict.txt" + + +def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, pad_token_id=0): + """ + Creats dataloader. + Args: + dataset(obj:`paddle.io.Dataset`): Dataset instance. + mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. + batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. + pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. + Returns: + dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. + """ + if trans_fn: + dataset = dataset.map(trans_fn, lazy=True) + + shuffle = True if mode == "train" else False + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=vocab.get("[PAD]", 0)), # input_ids + Stack(dtype="int32"), # seq len + Stack(dtype="int64"), # label + ): [data for data in fn(samples)] + + dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True, collate_fn=batchify_fn) + return dataloader + + +class BoWModel(nn.Layer): + """ + This class implements the Bag of Words Classification Network model to classify texts. + At a high level, the model starts by embedding the tokens and running them through + a word embedding. Then, we encode these representations with a `BoWEncoder`. + Lastly, we take the output of the encoder to create a final representation, + which is passed through some feed-forward layers to output a logits (`output_layer`). + Args: + vocab_size (obj:`int`): The vocabulary size. + emb_dim (obj:`int`, optional, defaults to 300): The embedding dimension. + hidden_size (obj:`int`, optional, defaults to 128): The first full-connected layer hidden size. + fc_hidden_size (obj:`int`, optional, defaults to 96): The second full-connected layer hidden size. + num_classes (obj:`int`): All the labels that the data has. + """ + + def __init__( + self, + vocab_size, + num_classes, + vocab_path, + emb_dim=300, + hidden_size=128, + fc_hidden_size=96, + use_token_embedding=True, + ): + super().__init__() + if use_token_embedding: + self.embedder = TokenEmbedding(args.embedding_name, extended_vocab_path=vocab_path) + emb_dim = self.embedder.embedding_dim + else: + padding_idx = vocab_size - 1 + self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx) + self.bow_encoder = paddlenlp.seq2vec.BoWEncoder(emb_dim) + self.fc1 = nn.Linear(self.bow_encoder.get_output_dim(), hidden_size) + self.fc2 = nn.Linear(hidden_size, fc_hidden_size) + self.dropout = nn.Dropout(p=0.3, axis=1) + self.output_layer = nn.Linear(fc_hidden_size, num_classes) + + def forward(self, text, seq_len=None): + # Shape: (batch_size, num_tokens, embedding_dim) + embedded_text = self.embedder(text) + + # Shape: (batch_size, embedding_dim) + summed = self.bow_encoder(embedded_text) + summed = self.dropout(summed) + encoded_text = paddle.tanh(summed) + + # Shape: (batch_size, hidden_size) + fc1_out = paddle.tanh(self.fc1(encoded_text)) + # Shape: (batch_size, fc_hidden_size) + fc2_out = paddle.tanh(self.fc2(fc1_out)) + # Shape: (batch_size, num_classes) + logits = self.output_layer(fc2_out) + return logits + + +if __name__ == "__main__": + assert args.device in ["cpu", "gpu", "xpu"], "Invalid device! Available device should be cpu, gpu, or xpu." + paddle.set_device(args.device) + + # Loads vocab. + vocab_path = "./dict.txt" + if not os.path.exists(vocab_path): + # download in current directory + get_path_from_url(WORD_DICT_URL, "./") + vocab = data.load_vocab(vocab_path) + + if "[PAD]" not in vocab: + vocab["[PAD]"] = len(vocab) + # Loads dataset. + train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"]) + + # Constructs the network. + model = BoWModel( + vocab_size=len(vocab), + num_classes=len(train_ds.label_list), + vocab_path=vocab_path, + use_token_embedding=args.use_token_embedding, + ) + if args.use_token_embedding: + vocab = model.embedder.vocab + data.set_tokenizer(vocab) + vocab = vocab.token_to_idx + else: + v = Vocab.from_dict(vocab, unk_token="[UNK]", pad_token="[PAD]") + data.set_tokenizer(v) + model = paddle.Model(model) + + # Reads data and generates mini-batches. + trans_fn = partial(data.convert_example, vocab=vocab, unk_token_id=vocab["[UNK]"], is_test=False) + train_loader = create_dataloader( + train_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="train", pad_token_id=vocab["[PAD]"] + ) + dev_loader = create_dataloader( + dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", pad_token_id=vocab["[PAD]"] + ) + + optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr) + + # Defines loss and metric. + criterion = paddle.nn.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + model.prepare(optimizer, criterion, metric) + + # Loads pre-trained parameters. + if args.init_from_ckpt: + model.load(args.init_from_ckpt) + print("Loaded checkpoint from %s" % args.init_from_ckpt) + + # Starts training and evaluating. + log_dir = "use_normal_embedding" + if args.use_token_embedding: + log_dir = "use_token_embedding" + log_dir = osp.join(args.vdl_dir, log_dir) + callback = paddle.callbacks.VisualDL(log_dir=log_dir) + model.fit(train_loader, dev_loader, epochs=args.epochs, save_dir=args.save_dir, callbacks=callback) diff --git a/fast_generation/README.md b/fast_generation/README.md new file mode 100644 index 0000000000000000000000000000000000000000..fe699a9c7271e55ad1be9960560a1510ec7806ad --- /dev/null +++ b/fast_generation/README.md @@ -0,0 +1,305 @@ +# FastGeneration + +FastGeneration是PaddleNLP v2.2版本加入的文本生成高性能加速功能,其支持GPT、OPT、BART、UnifiedTransformer等多种NLP生成类预训练模型,并且支持多种解码策略,可以用于机器翻译、文本续写、文本摘要、对话生成等多种NLG任务的GPU场景预测加速。 + +功能底层依托于[NV FasterTransformer](https://github.com/NVIDIA/FasterTransformer),该库针对标准的Transformer和GPT模型、beam search和sampling解码策略进行了性能优化。PaddleNLP FastGeneration在其之上进行了扩展,实现了更多模型和生成策略的优化支持,并将功能入口封装于`model.generate`函数。功能的开启和关闭通过传入`use_fast`参数进行控制(默认为关闭状态)。通过调用generate函数,用户可以简单的使用模型高性能推理功能。下图展示了FastGeneration的启动流程: + + +

+ +

+ +## Featrues + +- 全面支持生成式预训练模型。包括GPT、OPT、CodeGen、GPTJ、BART、mBART、UnifiedTransformer和UNIMO-text。 +- 支持大多数主流解码策略。包括Beam Search、Sampling、Greedy Search。以及Diverse Sibling Search、Length Penalty等子策略。 +- 解码速度快。最高可达非加速版generate函数的**18倍**。**并支持FP16混合精度计算**。 +- 易用性强。功能的入口为`model.generate`,与非加速版生成api的使用方法相同,当满足加速条件时使用jit即时编译高性能算子并用于生成,不满足则自动切换回非加速版生成api。 +- GPT、UnifiedTransformer和UNIMO-text模型支持高性能并行推理,在具备MPI和NCCL的环境中一行代码即可开启使用,允许通过多张小显存容量的 GPU 使用百亿大模型,预测速度较单卡也进一步提升。百亿模型四卡并行高性能推理速度达单卡高性能推理速度2+倍。 + +### Inference Model Support +下表为PaddleNLP FastGeneration对预训练模型和解码策略的支持情况(GPU)。 + +| Model Name | GPT2 | OPT | CodeGen| GPTJ| BART | mBART | UnifiedTransformer | +|------------------------|---------|---------| ---------| ---------|-----------------|-----------------|--------------------| +| Model Structure | Decoder | Decoder |Decoder|Decoder| Encoder-Decoder | Encoder-Decoder | Prefix-LM | +| Beam Search | ❌ | ❌ |❌|❌| ✅ | ✅ | ✅ | +| Top-K Sampling | ✅ | ✅ |✅|✅| ✅ | ✅ | ✅ | +| Top-P Sampling | ✅ | ✅ |✅|✅| ✅ | ✅ | ✅ | +| Diverse Sibling Search | ❌ | ❌ |❌|❌| ✅ | ✅ | ✅ | +| Forced Decoding | ❌ | ❌ |❌|❌| ❌ | ✅ | ❌ | +| Length Penalty | ❌ | ❌ |❌|❌| ✅ | ✅ | ✅ | +| Temperature | ✅ | ✅ |✅|✅| ✅ | ✅ | ✅ | +| Repetition Penalty | ✅ | ✅ |✅|✅| ❌ | ❌ | ❌ | + +## Performence + +FastGeneration的高性能解码相比原版generate方法加速明显,并且与竞品相比有也有极大的速度优势。以下为性能对比图: + +- **batch_size = 4, out_seq_len = 32** +- Device: Tesla V100-SXM2-16GB +- CUDA version 11.2 +- cudnn version 8 +- torch version 1.10.0+cu113 +- transformers version 4.12.5 + +### **BART** (bart-base, batch_size=4, max_length=32) + +

+ +

+ +### **GPT** (gpt2, batch_size=4, max_length=32) + +

+ +

+ +### **OPT** (opt, batch_size=4, max_length=32) + +

+ +

+ +### **CodeGen:** +* 环境和超参 + - Platform: Tesla V100-SXM2-32GB + - CUDA 10.1 + - CUDNN 7.6.5 + - PaddlePaddle-gpu 2.3.1.post101 + - transformers==4.21.1 + - torch==1.11.0 + - Batch Size: 1 + - Input Length: 60 + - Output Length: 20 +

+ +

+ +- Platform: A100-40G +

+ +

+ +### **Pegasus** +* 环境和超参 + - Platform: Tesla V100-SXM2-32GB + - CUDA 10.1 + - CUDNN 7.6.5 + - PaddlePaddle-gpu 2.3.2.post101 + - transformers==4.21.1 + - torch==1.11.0 + - Batch Size: 4 + - Input Length: 60 + - Output Length: 20 + - Decode_strategy: beam search + - num_beams: 4 +

+ +

+ +更详细的性能数据请参见[这里](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/fast_generation/perf) + +## Quick Start + +### 高性能推理 + +为体现FastGeneration的易用性,我们在`samples`文件夹中内置了几个典型任务示例,下面以基于GPT模型的中文文本续写任务为例: + +```sh +python samples/gpt_sample.py +``` + +如果是第一次执行,PaddleNLP会启动即时编译([JIT Compile](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/new_op/new_custom_op_cn.html#jit-compile))自动编译高性能解码算子。 + +```sh +... +2021-11-17 13:42:56,771 - INFO - execute command: cd /10.2/hub/PaddleNLP/paddlenlp/ops/extenstions && /usr/local/bin/python FasterTransformer_setup.py build +INFO:utils.cpp_extension:execute command: cd /10.2/hub/PaddleNLP/paddlenlp/ops/extenstions && /usr/local/bin/python FasterTransformer_setup.py build +grep: warning: GREP_OPTIONS is deprecated; please use an alias or script +running build +running build_ext +-- The C compiler identification is GNU 8.2.0 +-- The CXX compiler identification is GNU 8.2.0 +-- The CUDA compiler identification is NVIDIA 10.2.89 +-- Check for working C compiler: /usr/bin/cc +-- Check for working C compiler: /usr/bin/cc -- works +-- Detecting C compiler ABI info +-- Detecting C compiler ABI info - done +-- Detecting C compile features +-- Detecting C compile features - done +-- Check for working CXX compiler: /usr +... +``` + +编译过程通常会花费几分钟的时间编译只会进行一次,之后再次使用高性能解码就不需要重新编译了,编译完成后会继续运行,可以看到生成的结果如下: + +``` +Model input: 花间一壶酒,独酌无相亲。举杯邀明月, +Result: 对影成三人。 +``` + +打开示例代码 `samples/gpt_sample.py` ,我们可以看到如下代码: + +``` +... +model = GPTLMHeadModel.from_pretrained(model_name) +... +outputs, _ = model.generate( + input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search', + use_fast=True) +... +``` + +可以看到,FastGeneration的使用方法与 `model.generate()` 相同,只需传入输入tensor和解码相关参数即可,使用非常简便。如果要使用非加速版的 `model.generate()` 方法,只需传入 `use_fast=False` 即可,示例如下: + +``` +... +outputs, _ = model.generate( + input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search', use_fast=False) +... +``` + +**NOTE:** 需要注意的是,如果传入 `model.generate()` 的参数不满足高性能版本的要求。程序会做出提示并自动切换为非加速版本,例如我们在上面的例子中传入 `min_length=1` ,会得到如下提示: + +``` +... +[2021-11-17 14:21:06,132] [ WARNING] - 'min_length != 0' is not supported yet in the fast version +[2021-11-17 14:21:06,132] [ WARNING] - FastGeneration is not available, and the original version would be used instead. +... +``` + +关于该函数的详细介绍可以参考API文档[generate](https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.generation_utils.html)和**Aistudio教程[文本生成任务实战:如何使用PaddleNLP实现各种解码策略](https://aistudio.baidu.com/aistudio/projectdetail/3243711?contributionType=1)。**`samples`文件夹中的其他示例的使用方法相同。 + +### 并行推理 + +FastGeneration对GPT、UnifiedTransformer和UNIMO-text模型在高性能推理的基础上还实现了模型并行功能,其中GPT支持Tensor Parallel和Layer Parallel(Pipeline Parallel)两种并行策略的组合,UnifiedTransformer和UNIMO-text支持Tensor Parallel。关于这两种并行策略的详细介绍请参考[Megatron论文](https://arxiv.org/pdf/2104.04473.pdf)。 + +并行推理当前依赖MPI([MPICH](https://www.mpich.org)、[OpenMPI](https://www.open-mpi.org)均可)和[NCCL](https://developer.nvidia.com/nccl),如需使用还请先安装依赖。在使用时,相比上面的单卡高性能加速代码中也只增加了`from_pretrained`创建加载模型之前加上`enable_ft_para()`一行。 +#### GPT 并行推理 + +GPT高性能并行推理的完整使用示例已在`gpt_mp_sample.py`中提供,按照如下方式启动即可: + +```sh +mpirun -n 4 python gpt_mp_sample.py --tensor_para_size 4 --layer_para_size 1 +``` + +其中`-n 4`指明使用的进程和GPU数,`tensor_para_size`和`tensor_para_size`分别指明Tensor Parallel和Layer Parallel各自使用的GPU数,均设置为1则进行单卡预测。另外加上`--use_fp16`以使用FP16,加上`--profile`可以进行相应设置的性能测试。其他生成相关的参数设置释义如下: +- `model_name` 指定使用的GPT模型,默认为[`gpt-cpm-larg-cn`](https://github.com/TsinghuaAI/CPM-1-Generate)。 +- `max_length` 指定生成的最大长度,默认为50。 +- `topk` 用于Top-K采样策略,采样时将只从概率最高K个token中采样,默认为1,即greedy search。 +- `topp` 用于Top-P采样策略,采样时将只从概率最高且累加概率不超过该值的token中采样,默认为1.0。 +- `temperature` 用于调整预测概率分布,默认为1.0,即保持模型原有的预测概率。 + +使用`gpt-cpm-larg-cn`(2.6B)和默认设置,在V100上4卡Tensor Parallel较单卡高性能预测速度提升约40%。 + +#### PLATO-XL 并行推理 + +PLATO-XL百亿对话预训练模型(11B UnifiedTransformer模型)高性能并行推理的完整使用示例已在`plato_xl_sample.py`中提供(当前只支持Tensor Parallel),按照如下方式启动即可: + +```shell +mpirun -n 4 python plato_xl_sample.py +``` + +参数释义基本同上。在V100上4卡Tensor Parallel高性能预测为单卡高性能预测速度的2倍。 + +## Generate Examples + +除了以上示例之外,PaddleNLP的examples中大多使用了`model.generate`的示例都可以通过调整到合适的参数使用高性能推理。具体如下: + +- [examples/dialogue/unified_transformer](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/dialogue/unified_transformer) +- [model_zoo/gpt/fast_gpt](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/gpt/fast_gpt) +- [examples/text_generation/unimo-text](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_generation/unimo-text) +- [examples/text_summarization/bart](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_summarization/bart) + +下面我们以基于 `Unified Transformer` 的任务型对话为例展示一下FastGeneration的加速效果: + +打开以上链接中Unified Transformer对应的example,找到README中对应预测的脚本。稍作修改如下: + +```sh +export CUDA_VISIBLE_DEVICES=0 + python infer.py \ + --model_name_or_path=unified_transformer-12L-cn-luge \ + --output_path=./predict.txt \ + --logging_steps=10 \ + --seed=2021 \ + --max_seq_len=512 \ + --max_knowledge_len=256 \ + --batch_size=4 \ + --min_dec_len=1 \ + --max_dec_len=64 \ + --num_return_sequences=1 \ + --decode_strategy=sampling \ + --top_k=5 \ + --faster + --device=gpu +``` + +由于这里只是展示性能,我们直接在 `model_name_or_path` 填入PaddleNLP预训练模型名称 `unified_transformer-12L-cn-luge` 。 + +可以看到,由于该任务为对话任务,我们为了防止模型生成过多安全回复(如:哈哈哈、不错等),保证生成结果具有更多的随机性,我们选择TopK-sampling作为解码策略,并让k=5。 + +打开 `infer.py` ,可以看到我们传入的脚本参数大多都提供给了 `model.generate()` 方法: + +``` +output = model.generate( + input_ids=input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask, + seq_len=seq_len, + max_length=args.max_dec_len, + min_length=args.min_dec_len, + decode_strategy=args.decode_strategy, + temperature=args.temperature, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + length_penalty=args.length_penalty, + early_stopping=args.early_stopping, + num_return_sequences=args.num_return_sequences, + use_fp16_decoding=args.use_fp16_decoding, + use_fast=args.faster) +``` + +运行脚本,输出结果如下: + +```sh +step 10 - 1.695s/step +step 20 - 1.432s/step +step 30 - 1.435s/step +``` + +可以看到,非加速版 `generate()` 方法的预测速度为每个step耗时1.5秒左右。 + +下面我们在启动脚本中传入 `--faster` 参数,该参数会向 `generate()` 方法传入 `use_fast=True` ,启动加速模式。同时我们需要设置 `--min_dec_len=0` ,因为FastGeneration当前还不支持该参数。新的脚本启动参数如下: + +```sh +export CUDA_VISIBLE_DEVICES=0 + python infer.py \ + --model_name_or_path=unified_transformer-12L-cn-luge \ + --output_path=./predict.txt \ + --logging_steps=10 \ + --seed=2021 \ + --max_seq_len=512 \ + --max_knowledge_len=256 \ + --batch_size=4 \ + --min_dec_len=0 \ + --max_dec_len=64 \ + --num_return_sequences=1 \ + --decode_strategy=sampling \ + --top_k=5 \ + --device=gpu \ + --faster +``` + +再次运行脚本,输出结果如下(由于我们已经编译过高性能算子,所以这里不会重新编译): + +```sh +[2021-11-23 13:38:09,200] [ DEBUG] - skipping 'FastGeneration' extension (up-to-date) build +step 10 - 0.250s/step +step 20 - 0.156s/step +step 30 - 0.141s/step +``` + +可以看到,FastGeneration的预测速度为每个step耗时0.15秒左右,相比非加速版提速超过9倍。 diff --git a/fast_generation/perf/README.md b/fast_generation/perf/README.md new file mode 100644 index 0000000000000000000000000000000000000000..242bf765ec1187d9039f4bb67e4f4af21dc83443 --- /dev/null +++ b/fast_generation/perf/README.md @@ -0,0 +1,250 @@ +# FastGeneration Performance + +以下性能数据为非加速版generate方法和FastGeneration对比数据。 + +- **测试设备:** Tesla V100-SXM2-16GB +- **Batch Size:** 4 +- **Max Length:** 32 + +## 性能数据 +*** + +CUDA 10.1, cudnn 7, gcc 82 + +torch version 1.10.0+cu102, transformers version 4.12.5 + +**BART:** + +| Model Size | Decode Strategy| FastGeneration(FP32)
(ms) | FastGeneration(FP16)
(ms) | HF generate
(ms) | Speed Up Rate
(Faster32/HF) | Speed Up Rate
(Faster16/HF) | +|-----|----|---|---|---|---|---| +|num_layers = 6
num_attention_heads = 12
hidden_size = 768
(bart-base)|top_k = 1|37.53|34.01|136.89|3.65|4.02 +| |top_k = 4 |39.33|34.98|146.89|3.73|4.2 | +| |top_k = 8 |42.35|34.77|136.80|3.23|3.93| +| |top_k = 16 |40.95|35.45|148.45|3.63|4.19| +| |top_p = 0.4 |45.83|33.32|184.36|4.02|5.53| +| |num_beams = 4|44.72|37.51|242.73|5.43|6.47| +| |num_beams = 8|61.56|40.27|273.93|4.45|6.8 | +| |num_beams = 16|82.05|46.68|433.51|5.28|9.29| +|num_layers = 12
num_attention_heads = 16
hidden_size = 1024
(bart-large)|top_k = 1|55.03|45.44|199.27|3.62|4.39| +| |top_k = 4|70.12|56.81|220.96|3.15|3.89| +| |top_k = 8|69.96|57.73|201.06|2.87|3.48| +| |top_k = 16|69.16|59.62|223.73|3.23|3.75| +| |top_p = 0.4|73.49|61.43|275.86|3.75|4.49| +| |num_beams = 4|66.44|50.71|277.61|4.18|5.47| +| |num_beams = 8|135.30|85.75|314.78|2.33|3.67| +| |num_beams = 16|168.01|100.22|441.95|2.63|4.41| + +**GPT:** + +| Model Size | Decode Strategy| FastGeneration(FP32)
(ms) | FastGeneration(FP16)
(ms) | HF generate
(ms) | Speed Up Rate
(Faster32/HF) | Speed Up Rate
(Faster16/HF) | +|-----|----|---|---|---|---|---| +|num_layers = 12
num_attention_heads = 12
hidden_size = 768
(gpt2)|top_k = 1|69.29|59.20|363.93|5.25|6.15| +| |top_k = 4|68.07|60.92|391.02|5.74|6.42| +| |top_k = 8|69.16|60.45|401.18|5.80|6.64| +| |top_k = 16|73.59|62.40|401.55|5.46|6.44| +| |top_p = 0.4|95.61|76.26|429.63|4.49|5.63| +|num_layers = 24
num_attention_heads = 16
hidden_size = 1024
(gpt2-medium)|top_k = 1|127.04|95.13|726.83|5.72|7.64| +| |top_k = 4|126.74|93.95|694.53|5.48|7.39| +| |top_k = 8|128.11|94.07|743.63|5.80|7.91| +| |top_k = 16|126.78|95.00|732.96|5.78|7.72| +| |top_p = 0.4|143.36|105.40|756.12|5.27|7.17| +|num_layers = 36
num_attention_heads = 20
hidden_size = 1280
(gpt2-large)top_k = 1|236.80|200.37|1057.94|4.47|5.28| +| |top_k = 4|236.69|201.95|1075.17|4.54|5.32| +| |top_k = 8|237.04|202.00|1084.60|4.58|5.37| +| |top_k = 16|235.01|201.79|1110.75|4.73|5.5| +| |top_p = 0.4|270.31|205.84|1111.16|4.11|5.4| + +**OPT** + +* 模型参数 + +| Model Name | num_layers | num_attention_heads | hidden_size | +|------------|------------|---------------------|-------------| +| OPT-125m | 12 | 12 | 768 | +| OPT-350M | 24 | 16 | 1024 | + +transformer: 4.20.1 + +* 性能指标数据 + +| Model | Decoding Strategy | Faster Generation(FP32)(ms) | Faster Generation(FP16)(ms) | HF Generation(ms) | Speed Up Rate(Faster32/HF) | Speed Up Rate(Faster16/HF) | +|:--------:|:-------------------:|:-----------------------------:|:-----------------------------:|:-------------------:|:----------------------------:|:----------------------------:| +| opt-125m | top_k=1 | 58.39 | 48.82 | 290.14 | 4.97 | 5.94 | +| | top_k=4 | 58.45 | 49.05 | 283.55 | 4.85 | 5.78 | +| | top_k=8 | 59.13 | 49.32 | 284.76 | 4.82 | 5.77 | +| | top_k=16 | 60.15 | 49.54 | 299.87 | 4.99 | 6.05 | +| | top_p=0.4 | 75.78 | 60.72 | 335.70 | 4.43 | 5.53 | +| opt-350m | top_k=1 | 124.49 | 90.58 | 511.46 | 4.11 | 5.65 | +| | top_k=4 | 125.60 | 90.96 | 528.42 | 4.21 | 5.81 | +| | top_k=8 | 125.93 | 90.96 | 523.46 | 4.16 | 5.75 | +| | top_k=16 | 126.25 | 91.58 | 524.79 | 4.16 | 5.73 | +| | top_p=0.4 | 142.93 | 103.68 | 600.80 | 4.20 | 5.79 | + +*** + +CUDA 11.2, cudnn 8, gcc 82 + +torch version 1.10.0+cu113, transformers version 4.12.5 + +**BART:** + +| Model Size | Decode Strategy| FastGeneration(FP32)
(ms) | FastGeneration(FP16)
(ms) | HF generate
(ms) | Speed Up Rate
(Faster32/HF) | Speed Up Rate
(Faster16/HF) | +|-----|----|---|---|---|---|---| +|num_layers = 6
num_attention_heads = 12
hidden_size = 768
(bart-base)|top_k = 1|31.1|27.4|139.46|4.48|5.09 +| |top_k = 4 |32.13|29.06|149.81|4.66|5.16| +| |top_k = 8 |31.7|28.36|154.3|4.87|5.44| +| |top_k = 16 |32.93|28.66|145.85|4.43|5.09| +| |top_p = 0.4 |33.35|29.01|173.18|5.19|5.97| +| |num_beams = 4|47.55|38.02|252.71|5.31|6.65| +| |num_beams = 8|52.19|41.39|282.3|5.41|6.82| +| |num_beams = 16|67.18|45.82|441.59|6.57|9.64| +|num_layers = 12
num_attention_heads = 16
hidden_size = 1024
(bart-large)|top_k = 1|45.8|37.43|173.08|3.78|4.62| +| |top_k = 4|51.11|48.28|246.27|4.82|5.1| +| |top_k = 8|61.61|50.67|246.19|4.0|4.86| +| |top_k = 16|63.81|48.33|272.93|4.28|5.65| +| |top_p = 0.4|63.0|50.05|288.76|4.58|5.77| +| |num_beams = 4|65.54|48.58|273.84|4.18|5.64| +| |num_beams = 8|75.68|52.59|340.86|4.5|6.48| +| |num_beams = 16|102.87|62.25|477.97|4.65|7.68| + +**GPT:** + +| Model Size | Decode Strategy| FastGeneration(FP32)
(ms) | FastGeneration(FP16)
(ms) | HF generate
(ms) | Speed Up Rate
(Faster32/HF) | Speed Up Rate
(Faster16/HF) | +|-----|----|---|---|---|---|---| +|num_layers = 12
num_attention_heads = 12
hidden_size = 768
(gpt2)|top_k = 1|50.84|40.37|399.58|7.86|9.9| +| |top_k = 4|50.38|38.81|419.55|8.33|10.81| +| |top_k = 8|51.23|36.78|411.7|8.04|11.19| +| |top_k = 16|51.03|38.76|408.36|8.0|10.54| +| |top_p = 0.4|68.55|48.04|489.45|7.14|10.19| +|num_layers = 24
num_attention_heads = 16
hidden_size = 1024
(gpt2-medium)|top_k = 1|111.37|79.73|753.11|6.76|9.45| +| |top_k = 4|110.53|80.48|767.48|6.94|9.54| +| |top_k = 8|109.87|78.92|754.99|6.87|9.57| +| |top_k = 16|110.61|85.26|764.16|6.91|8.96| +| |top_p = 0.4|127.51|87.72|830.24|6.51|9.46| +|num_layers = 36
num_attention_heads = 20
hidden_size = 1280
(gpt2-large)|top_k = 1|203.76|142.85|1108.26|5.44|7.76| +| |top_k = 4|204.18|139.49|1230.63|6.03|8.82| +| |top_k = 8|204.22|139.14|1238.96|6.07|8.9| +| |top_k = 16|204.11|140.04|1148.05|5.62|8.2| +| |top_p = 0.4|222.12|150.68|1248.75|5.62|8.29| + + +**OPT:** + +* 模型参数 + +| Model Name | num_layers | num_attention_heads | hidden_size | +|------------|------------|---------------------|-------------| +| OPT-125m | 12 | 12 | 768 | +| OPT-350M | 24 | 16 | 1024 | + +transformers: 4.20.1 + +* 性能结果报表 + +| Model | Decoding Strategy | Faster Generation(FP32)(ms) | Faster Generation(FP16)(ms) | HF Generation(ms) | Speed Up Rate(Faster32/HF) | Speed Up Rate(Faster16/HF) | +|:--------:|:-------------------:|:-----------------------------:|:-----------------------------:|:-------------------:|:----------------------------:|:----------------------------:| +| opt-125m | top_k=1 | 50.57 | 42.59 | 267.95 | 5.30 | 6.29 | +| | top_k=4 | 50.88 | 40.01 | 280.95 | 5.52 | 7.02 | +| | top_k=8 | 50.91 | 43.77 | 268.54 | 5.27 | 6.14 | +| | top_k=16 | 51.08 | 42.56 | 265.40 | 5.20 | 6.24 | +| | top_p=0.4 | 69.08 | 54.59 | 330.56 | 4.78 | 6.06 | +| opt-350m | top_k=1 | 110.22 | 77.82 | 507.00 | 4.60 | 6.51 | +| | top_k=4 | 110.76 | 77.93 | 479.42 | 4.33 | 6.15 | +| | top_k=8 | 142.07 | 78.86 | 513.79 | 3.62 | 6.52 | +| | top_k=16 | 110.80 | 78.19 | 488.34 | 4.41 | 6.25 | +| | top_p=0.4 | 128.33 | 92.57 | 544.18 | 4.24 | 5.88 | + +**CodeGen:** +* 环境和超参 + +- Platform: Tesla V100-SXM2-32GB +- CUDA 10.1 +- CUDNN 7.6.5 +- PaddlePaddle-gpu 2.3.1.post101 +- transformers==4.21.1 +- torch==1.11.0 +- Batch Size: 1 +- Input Length: 60 +- Output Length: 20 + +* 模型参数 + +| Model Name | num_layers | num_attention_heads | hidden_size | +|------------|------------|---------------------|-------------| +| Salesforce/codegen-350M-mono | 20 | 16 | 1024 | +| Salesforce/codegen-2B-mono | 32 | 32 | 2560 | +| Salesforce/codegen-6B-mono | 33 | 16 | 4096 | +| Salesforce/codegen-16B-mono | 34 | 24 | 6144 | + + + +* 性能结果报表 + +| Model | Decoding Strategy | Faster Generation(FP32)(ms) | Faster Generation(FP16)(ms) | HF Generation(ms) | Speed Up Rate(Faster32/HF) | Speed Up Rate(Faster16/HF) | +|:--------:|:-------------------:|:-----------------------------:|:-----------------------------:|:-------------------:|:----------------------------:|:----------------------------:| +| Salesforce/codegen-350M-mono | top_k=1 | 57.76 | 51.35 | 709.62 | 12.29 | 13.82 | +| | top_k=4 | 57.42 | 50.88 | 639.58 | 11.14 | 12.57 | +| | top_k=8 | 57.24 | 51.67 | 685.82 | 11.98 | 13.27 | +| | top_k=16 | 57.57 | 51.61 | 686.62 | 11.93 | 13.30 | +| | top_p=0.4 | 67.26 | 57.35 | 656.12 | 9.75 | 11.44 | +| Salesforce/codegen-2B-mono| top_k=1 | 319.03 | 207.41 | 1040.71 | 3.26 | 5.02 | +| | top_k=4 | 318.98 | 207.37 | 1014.32 | 3.18 | 4.89 | +| | top_k=8 | 319.66 | 207.26 | 1084.09 | 3.39 | 5.23 | +| | top_k=16 | 320.04 | 207.74 | 1040.28 | 3.25 | 5.01 | +| | top_p=0.4 | 329.07 | 213.97 | 1055.55 | 3.21 | 4.93 | +| Salesforce/codegen-6B-mono| top_k=1 | 762.91 | 411.94 | 1384.90 | 1.82 | 3.36 | +| | top_k=4 | 762.58 | 412.79 | 1378.32 | 1.81 | 3.34 | +| | top_k=8 | 763.43 | 413.32 | 1366.45 | 1.79 | 3.31 | +| | top_k=16 | 762.79 | 413.83 | 1376.69 | 1.80 | 3.33 | +| | top_p=0.4 | 771.77 | 419.16 | 1366.49 | 1.77 | 3.26 | + + +**Pegasus:** + +| Model Size | Decode Strategy| FastGeneration(FP32)
(ms) | FastGeneration(FP16)
(ms) | HF generate
(ms) | Speed Up Rate
(Faster32/HF) | Speed Up Rate
(Faster16/HF) | +|-----|----|---|---|---|---|---| +|IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese|num_beams=2|87.41|75.47|1322.24|15.13|17.52 +| |num_beams=4 |91.55|66.47|1364.43|14.90|20.53| +| |num_beams=6 |94.55|73.25|1391.35|14.72|18.99| +| |num_beams=8 |100.48|84.82|1467.64|14.61|17.30| +|IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese|num_beams=2|120.15|94.26|1735.21|14.44|18.41| +| |num_beams=4 |126.42|99.07|1622.31|12.83|16.38| +| |num_beams=6 |142.21|99.95|1717.49|12.08|17.18| +| |num_beams=8 |158.26|104.31|1697.65|10.73|16.27| + + +## 测试方法 + +运行如下命令即可bart性能测试: + +```sh +bash run_perf_bart.sh +``` + +运行如下命令即可启动gpt性能测试: + +```sh +bash run_perf_gpt.sh +``` + +运行以上命令后,脚本会自动使用不同的模型参数进行性能测试,结果如下图所示: + +```sh +... +[2021-12-10 08:11:37,255] [ DEBUG] - skipping 'FastGeneration' extension (up-to-date) build +Namespace(decode_strategy='sampling', max_length=32, model_name_or_path='bart-base', num_beams=1, top_k=1, top_p=1.0, use_fp16_decoding=False) +Faster FP32 cost: 40.13654176145792 +PD cost: 511.413540635258 +HF cost: 138.49875444546342 +Speed up Faster FP32/PD: 12.741843671403577 +Speed up Faster FP32/HF: 3.4506897796177394 +... +... +[2021-12-10 08:13:42,858] [ DEBUG] - skipping 'FastGeneration' extension (up-to-date) build +Namespace(decode_strategy='sampling', max_length=32, model_name_or_path='bart-base', num_beams=1, top_k=1, top_p=1.0, use_fp16_decoding=True) +Faster FP16 cost: 34.004870522767305 +... +``` +可以看到,对于每组参数,脚本会先输出FP32和竞品的测试对比,再单独输出FP16的性能数据。 + +**NOTE:** 根据测试环境和机器状态的不同,以上性能测试脚本的结果可能与表中结果有所出入。 diff --git a/fast_generation/perf/bart_perf.py b/fast_generation/perf/bart_perf.py new file mode 100644 index 0000000000000000000000000000000000000000..8466dafcaaefa8bd1a3e92f763f104ef578b77ea --- /dev/null +++ b/fast_generation/perf/bart_perf.py @@ -0,0 +1,170 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import time +from pprint import pprint + +import paddle +import torch +from transformers import BartForConditionalGeneration as hf_bart_model + +from paddlenlp.data import Pad +from paddlenlp.transformers import BartForConditionalGeneration, BartTokenizer + + +def prepare_input(tokenizer, sentences): + word_pad = Pad(tokenizer.pad_token_id, dtype="int64") + tokenized = tokenizer(sentences) + inputs = word_pad([i["input_ids"] for i in tokenized]) + input_ids = paddle.to_tensor(inputs) + return input_ids + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_name_or_path", + default="bart-base", + type=str, + choices=["bart-base", "bart-large"], + help="The model name to specify the bart to use. Can be one of ['bart-base', 'bart-large']. ", + ) + parser.add_argument( + "--decode_strategy", + default="sampling", + type=str, + choices=["greedy_search", "beam_search", "sampling"], + help="The decoding strategy. Can be one of ['greedy_search', 'beam_search', 'sampling']", + ) + parser.add_argument("--num_beams", default=4, type=int, help="The parameters for beam search. ") + parser.add_argument("--top_k", default=4, type=int, help="The number of candidate to procedure beam search. ") + parser.add_argument( + "--top_p", default=1.0, type=float, help="The probability threshold to procedure topp sampling. " + ) + parser.add_argument("--max_length", default=32, type=int, help="Maximum output length. ") + parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ") + args = parser.parse_args() + return args + + +def do_predict(args): + place = "gpu" + place = paddle.set_device(place) + + tokenizer = BartTokenizer.from_pretrained(args.model_name_or_path) + model = BartForConditionalGeneration.from_pretrained(args.model_name_or_path) + # Set evaluate mode + model.eval() + sentences = [ + "I love that girl, but does not me.", + "She is so that I can not help glance at .", + "Nothing's gonna my love for you.", + "Drop everything now. Meet me in the pouring . Kiss me on the sidewalk.", + ] + + input_ids = prepare_input(tokenizer, sentences) + + # Define model + model.eval() + + num_loop = 100 + with paddle.no_grad(): + for i in range(num_loop): + # For warmup. + if 50 == i: + # PaddlePaddle >= 2.2 + paddle.device.cuda.synchronize(place) + start = time.perf_counter() + model.generate( + input_ids=input_ids, + max_length=args.max_length, + decode_strategy=args.decode_strategy, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + early_stopping=True, + use_fast=True, + use_fp16_decoding=args.use_fp16_decoding, + ) + paddle.device.cuda.synchronize(place) + fast_cost = (time.perf_counter() - start) / 50 * 1000 + + if args.use_fp16_decoding: + pprint(args) + print("Fast FP16 cost:", fast_cost) + return + + with paddle.no_grad(): + for i in range(num_loop): + # For warmup. + if 50 == i: + # PaddlePaddle >= 2.2 + paddle.device.cuda.synchronize(place) + start = time.perf_counter() + model.generate( + input_ids=input_ids, + max_length=args.max_length, + decode_strategy=args.decode_strategy, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + early_stopping=True, + ) + paddle.device.cuda.synchronize(place) + pd_cost = (time.perf_counter() - start) / 50 * 1000 + + device = torch.device("cuda:0") + hf_model = hf_bart_model.from_pretrained("facebook/" + args.model_name_or_path) + hf_model.to(device) + hf_model.eval() + hf_input_ids = prepare_input(tokenizer, sentences) + hf_input_ids = torch.tensor(hf_input_ids.numpy()) + hf_input_ids = hf_input_ids.to(device) + + if args.decode_strategy == "sampling": + do_sample = True + else: + do_sample = False + with torch.no_grad(): + for i in range(num_loop): + # For warmup. + if 50 == i: + torch.cuda.synchronize() + start = time.perf_counter() + hf_model.generate( + hf_input_ids, + do_sample=do_sample, + max_length=args.max_length + 1, + top_k=args.top_k, + top_p=args.top_p, + num_beams=args.num_beams, + no_repeat_ngram_size=0, + length_penalty=0.0, + ) + torch.cuda.synchronize() + hf_cost = (time.perf_counter() - start) / 50 * 1000 + + pprint(args) + print("Fast FP32 cost:", fast_cost) + print("PD cost:", pd_cost) + print("HF cost:", hf_cost) + print("Speed up Fast FP32/PD:", pd_cost / fast_cost) + print("Speed up Fast FP32/HF:", hf_cost / fast_cost) + + +if __name__ == "__main__": + args = parse_args() + do_predict(args) diff --git a/fast_generation/perf/codegen_perf.py b/fast_generation/perf/codegen_perf.py new file mode 100644 index 0000000000000000000000000000000000000000..8620a11336c5c48ed0d443a5654a857c8221782b --- /dev/null +++ b/fast_generation/perf/codegen_perf.py @@ -0,0 +1,175 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import time +from pprint import pprint + +import numpy as np +import paddle +import pynvml + +from paddlenlp.transformers import CodeGenForCausalLM, CodeGenTokenizer + +pynvml.nvmlInit() + + +def query_by_id(gpu_id=2): + handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id) + meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle) + return meminfo.used // 1024 // 1024 + + +def perf_pd(args): + start_mem = query_by_id(args.gpu_id) + place = "gpu" + place = paddle.set_device(place) + tokenizer = CodeGenTokenizer.from_pretrained(args.model_name_or_path) + model = CodeGenForCausalLM.from_pretrained(args.model_name_or_path, load_state_as_np=True) + model.eval() + load_mem = query_by_id(args.gpu_id) + + input_ids_np = [ + np.random.choice(list(tokenizer.decoder.keys())[:-1], args.input_len) for _ in range(args.batch_size) + ] + input_ids = paddle.to_tensor(input_ids_np) + + num_loop = 100 + with paddle.no_grad(): + for i in range(num_loop): + # For warmup. + if num_loop // 2 == i: + # PaddlePaddle >= 2.2 + paddle.device.cuda.synchronize(place) + start = time.perf_counter() + model.generate( + input_ids=input_ids, + max_length=args.generate_len, + min_length=args.generate_len, + decode_strategy="sampling", + top_k=args.top_k, + top_p=args.top_p, + use_fast=args.use_faster, + use_fp16_decoding=args.use_fp16_decoding, + ) + generate_mem = query_by_id(args.gpu_id) + paddle.device.cuda.synchronize(place) + pd_cost = (time.perf_counter() - start) / (num_loop - num_loop // 2) * 1000 + return pd_cost, load_mem - start_mem, generate_mem - start_mem + + +def perf_hf(args): + import torch + from transformers import CodeGenForCausalLM as hf_codegen + from transformers import CodeGenTokenizer as hf_tokenizer + + start_mem = query_by_id(args.gpu_id) + device = torch.device("cuda") + tokenizer = hf_tokenizer.from_pretrained(args.model_name_or_path) + model = hf_codegen.from_pretrained(args.model_name_or_path) + model.to(device) + model.eval() + load_mem = query_by_id(args.gpu_id) + + input_ids_np = [np.random.choice(list(tokenizer.decoder.keys()), args.input_len) for _ in range(args.batch_size)] + input_ids = torch.tensor(input_ids_np) + input_ids = input_ids.to(device) + num_loop = 100 + with torch.no_grad(): + for i in range(num_loop): + # For warmup. + if num_loop // 2 == i: + torch.cuda.synchronize() + start = time.perf_counter() + model.generate( + input_ids, + do_sample=True, + max_length=args.generate_len + input_ids.shape[-1], + min_length=args.generate_len + input_ids.shape[-1], + top_k=args.top_k, + top_p=args.top_p, + ) + generate_mem = query_by_id(args.gpu_id) + torch.cuda.synchronize() + hf_cost = (time.perf_counter() - start) / (num_loop - num_loop // 2) * 1000 + return hf_cost, load_mem - start_mem, generate_mem - start_mem + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--perf_type", + default="pd", + type=str, + choices=["pd", "pd_faster_fp32", "pd_faster_fp16", "hf"], + help="The type of perf. ", + ) + parser.add_argument( + "--model_name_or_path", + default="Salesforce/codegen-350M-mono", + type=str, + choices=[ + "Salesforce/codegen-350M-mono", + "Salesforce/codegen-2B-mono", + "Salesforce/codegen-6B-mono", + "Salesforce/codegen-16B-mono", + ], + help="The model name to specify the bart to use. ", + ) + parser.add_argument("--top_k", default=4, type=int, help="The number of candidate to procedure topk sampling. ") + parser.add_argument( + "--top_p", default=1.0, type=float, help="The probability threshold to procedure topp sampling. " + ) + parser.add_argument("--batch_size", default=1, type=int, help="The size of input batch. ") + parser.add_argument("--input_len", default=60, type=int, help="The size of model input. ") + parser.add_argument("--generate_len", default=20, type=int, help="Length of output . ") + parser.add_argument("--gpu_id", default=2, type=int, help="The id of GPU . ") + parser.add_argument( + "--use_faster", action="store_true", help="Whether to process inference using faster codegen. " + ) + + parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ") + args = parser.parse_args() + return args + + +def do_predict(args): + try: + if args.perf_type == "pd": + args.use_faster = False + cost, load_mem, generate_mem = perf_pd(args) + elif args.perf_type == "pd_faster_fp32": + args.use_faster = True + args.use_fp16_decoding = False + cost, load_mem, generate_mem = perf_pd(args) + elif args.perf_type == "pd_faster_fp16": + args.use_faster = True + args.use_fp16_decoding = True + paddle.set_default_dtype("float16") + cost, load_mem, generate_mem = perf_pd(args) + else: + cost, load_mem, generate_mem = perf_hf(args) + pprint(args) + print( + f"CodeGenPerfResult: cost_time: {cost} ms, load_mem: {load_mem} MB, generate_mem:{generate_mem} MB, args:{args}\n" + ) + except Exception as e: + pprint(args) + print(f"CodeGenPerfResult: ERROR: {e}, args:{args}\n") + + +if __name__ == "__main__": + args = parse_args() + do_predict(args) diff --git a/fast_generation/perf/gpt_perf.py b/fast_generation/perf/gpt_perf.py new file mode 100644 index 0000000000000000000000000000000000000000..87afcba682b4fd81abcbd6b60f20749cde814ce4 --- /dev/null +++ b/fast_generation/perf/gpt_perf.py @@ -0,0 +1,155 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import time +from pprint import pprint + +import numpy as np +import paddle +import torch +from transformers import GPT2LMHeadModel as hf_gpt_model + +from paddlenlp.transformers import GPTLMHeadModel, GPTTokenizer + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_name_or_path", + default="gpt2-en", + type=str, + choices=["gpt2-en", "gpt2-medium-en", "gpt2-large-en"], + help="The model name to specify the bart to use. Can be one of ['gpt2-en', 'gpt2-medium-en', 'gpt2-large-en']. ", + ) + parser.add_argument( + "--decode_strategy", + default="sampling", + type=str, + choices=["greedy_search", "sampling"], + help="The decoding strategy. Can be one of ['greedy_search', 'sampling']", + ) + parser.add_argument("--top_k", default=4, type=int, help="The number of candidate to procedure beam search. ") + parser.add_argument("--batch_size", default=4, type=int, help="The size of input batch. ") + parser.add_argument( + "--top_p", default=1.0, type=float, help="The probability threshold to procedure topp sampling. " + ) + parser.add_argument("--max_length", default=32, type=int, help="Maximum output length. ") + parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ") + args = parser.parse_args() + return args + + +def do_predict(args): + place = "gpu" + place = paddle.set_device(place) + + tokenizer = GPTTokenizer.from_pretrained(args.model_name_or_path) + model = GPTLMHeadModel.from_pretrained(args.model_name_or_path) + # Set evaluate mode + model.eval() + bos_id = tokenizer.convert_tokens_to_ids("<|endoftext|>") + eos_id = tokenizer.convert_tokens_to_ids("<|endoftext|>") + + input_ids_np = np.array([[bos_id] for i in range(args.batch_size)]).astype("int64").reshape([args.batch_size, 1]) + input_ids = paddle.to_tensor(input_ids_np) + # Define model + num_loop = 100 + with paddle.no_grad(): + for i in range(num_loop): + # For warmup. + if 50 == i: + # PaddlePaddle >= 2.2 + paddle.device.cuda.synchronize(place) + start = time.perf_counter() + model.generate( + input_ids=input_ids, + max_length=args.max_length, + decode_strategy=args.decode_strategy, + top_k=args.top_k, + top_p=args.top_p, + bos_token_id=bos_id, + eos_token_id=eos_id, + use_fast=True, + use_fp16_decoding=args.use_fp16_decoding, + ) + paddle.device.cuda.synchronize(place) + fast_cost = (time.perf_counter() - start) / 50 * 1000 + + if args.use_fp16_decoding: + pprint(args) + print("Fast FP16 cost:", fast_cost) + return + with paddle.no_grad(): + for i in range(num_loop): + # For warmup. + if 50 == i: + # PaddlePaddle >= 2.2 + paddle.device.cuda.synchronize(place) + start = time.perf_counter() + model.generate( + input_ids=input_ids, + max_length=args.max_length, + decode_strategy=args.decode_strategy, + top_k=args.top_k, + top_p=args.top_p, + bos_token_id=bos_id, + eos_token_id=eos_id, + ) + paddle.device.cuda.synchronize(place) + pd_cost = (time.perf_counter() - start) / 50 * 1000 + + device = torch.device("cuda:0") + hf_model = hf_gpt_model.from_pretrained(args.model_name_or_path[:-3]) + hf_model.to(device) + hf_model.eval() + + hf_input_ids = torch.tensor(input_ids_np) + hf_input_ids = hf_input_ids.to(device) + + if args.decode_strategy == "sampling": + do_sample = True + else: + do_sample = False + with torch.no_grad(): + for i in range(num_loop): + # For warmup. + if 50 == i: + torch.cuda.synchronize() + start = time.perf_counter() + hf_model.generate( + hf_input_ids, + do_sample=do_sample, + max_length=args.max_length + 1, + bos_token_id=bos_id, + eos_token_id=eos_id, + pad_token_id=0, + top_k=args.top_k, + top_p=args.top_p, + ) + torch.cuda.synchronize() + hf_cost = (time.perf_counter() - start) / 50 * 1000 + + pprint(args) + print("Fast FP32 cost:", fast_cost) + print("PD cost:", pd_cost) + print("HF cost:", hf_cost) + print("Speed up Fast FP32/PD:", pd_cost / fast_cost) + print("Speed up Fast FP32/HF:", hf_cost / fast_cost) + + +if __name__ == "__main__": + args = parse_args() + do_predict(args) diff --git a/fast_generation/perf/opt_perf.py b/fast_generation/perf/opt_perf.py new file mode 100644 index 0000000000000000000000000000000000000000..213881fbf947c116ebc9934eefc3aa5e7c99c9c8 --- /dev/null +++ b/fast_generation/perf/opt_perf.py @@ -0,0 +1,162 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +# append project root dir to project to make it run with latest code +import sys +import time +from pprint import pprint + +import numpy as np +import paddle +import torch +from transformers.models.opt.modeling_opt import OPTForCausalLM as hf_opt_model + +from paddlenlp.transformers import GPTTokenizer, OPTForCausalLM + +sys.path.insert(0, "../../") + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_name_or_path", + default="facebook/opt-125m", + type=str, + choices=["facebook/opt-125m", "facebook/opt-350m", "facebook/opt-1.3b", "facebook/opt-2.7b"], + help="The model name to specify the bart to use. Can be one of ['facebook/opt-125m', 'facebook/opt-350m', 'facebook/opt-1.3b', 'facebook/opt-2.7b']. ", + ) + parser.add_argument( + "--decode_strategy", + default="greedy_search", + type=str, + choices=["greedy_search", "sampling"], + help="The decoding strategy. Can be one of ['greedy_search', 'sampling']", + ) + parser.add_argument("--top_k", default=4, type=int, help="The number of candidate to procedure beam search. ") + parser.add_argument("--batch_size", default=4, type=int, help="The size of input batch. ") + parser.add_argument( + "--top_p", default=1.0, type=float, help="The probability threshold to procedure topp sampling. " + ) + parser.add_argument("--max_length", default=32, type=int, help="Maximum output length. ") + parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ") + args = parser.parse_args() + return args + + +def do_predict(args): + place = "gpu" + place = paddle.set_device(place) + + tokenizer = GPTTokenizer.from_pretrained(args.model_name_or_path) + model = OPTForCausalLM.from_pretrained(args.model_name_or_path) + # Set evaluate mode + model.eval() + bos_id = tokenizer.convert_tokens_to_ids("<|endoftext|>") + eos_id = tokenizer.convert_tokens_to_ids("<|endoftext|>") + + input_ids_np = np.array([[bos_id] for i in range(args.batch_size)]).astype("int64").reshape([args.batch_size, 1]) + input_ids = paddle.to_tensor(input_ids_np) + # Define model + num_loop = 100 + with paddle.no_grad(): + for i in range(num_loop): + # For warmup. + if 50 == i: + # PaddlePaddle >= 2.2 + paddle.device.cuda.synchronize(place) + start = time.perf_counter() + model.generate( + input_ids=input_ids, + max_length=args.max_length, + decode_strategy=args.decode_strategy, + top_k=args.top_k, + top_p=args.top_p, + bos_token_id=bos_id, + eos_token_id=eos_id, + use_fast=True, + use_fp16_decoding=args.use_fp16_decoding, + ) + paddle.device.cuda.synchronize(place) + fast_cost = (time.perf_counter() - start) / 50 * 1000 + + if args.use_fp16_decoding: + pprint(args) + print("Fast FP16 cost:", fast_cost) + return + with paddle.no_grad(): + for i in range(num_loop): + # For warmup. + if 50 == i: + # PaddlePaddle >= 2.2 + paddle.device.cuda.synchronize(place) + start = time.perf_counter() + model.generate( + input_ids=input_ids, + max_length=args.max_length, + decode_strategy=args.decode_strategy, + top_k=args.top_k, + top_p=args.top_p, + bos_token_id=bos_id, + eos_token_id=eos_id, + ) + paddle.device.cuda.synchronize(place) + pd_cost = (time.perf_counter() - start) / 50 * 1000 + + device = torch.device("cuda:0") + hf_model = hf_opt_model.from_pretrained(args.model_name_or_path) + + hf_model.to(device) + hf_model.eval() + + hf_input_ids = torch.tensor(input_ids_np) + hf_input_ids = hf_input_ids.to(device) + + if args.decode_strategy == "sampling": + do_sample = True + else: + do_sample = False + with torch.no_grad(): + for i in range(num_loop): + # For warmup. + if 50 == i: + torch.cuda.synchronize() + start = time.perf_counter() + hf_model.generate( + hf_input_ids, + do_sample=do_sample, + max_length=args.max_length + 1, + bos_token_id=bos_id, + eos_token_id=eos_id, + pad_token_id=0, + top_k=args.top_k, + top_p=args.top_p, + ) + torch.cuda.synchronize() + hf_cost = (time.perf_counter() - start) / 50 * 1000 + + pprint(args) + print("Fast FP32 cost:", fast_cost) + print("PD cost:", pd_cost) + print("HF cost:", hf_cost) + print("Speed up Fast FP32/PD:", pd_cost / fast_cost) + print("Speed up Fast FP32/HF:", hf_cost / fast_cost) + + +if __name__ == "__main__": + args = parse_args() + print(args.model_name_or_path) + do_predict(args) diff --git a/fast_generation/perf/pegasus_perf.py b/fast_generation/perf/pegasus_perf.py new file mode 100644 index 0000000000000000000000000000000000000000..ae9c6ce61b6a09abca9d7c59379bccb740530617 --- /dev/null +++ b/fast_generation/perf/pegasus_perf.py @@ -0,0 +1,168 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import time +from pprint import pprint + +import numpy as np +import paddle +import pynvml + +from paddlenlp.transformers import ( + PegasusChineseTokenizer, + PegasusForConditionalGeneration, +) + +pynvml.nvmlInit() + + +def query_by_id(gpu_id=2): + handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id) + meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle) + return meminfo.used // 1024 // 1024 + + +def perf_pd(args): + start_mem = query_by_id(args.gpu_id) + place = "gpu" + place = paddle.set_device(place) + tokenizer = PegasusChineseTokenizer.from_pretrained(args.model_name_or_path) + model = PegasusForConditionalGeneration.from_pretrained(args.model_name_or_path, load_state_as_np=True) + model.eval() + load_mem = query_by_id(args.gpu_id) + input_ids_np = [np.random.choice(range(len(tokenizer.vocab)), args.input_len) for _ in range(args.batch_size)] + input_ids = paddle.to_tensor(input_ids_np) + + num_loop = 100 + with paddle.no_grad(): + for i in range(num_loop): + # For warmup. + if num_loop // 2 == i: + # PaddlePaddle >= 2.2 + paddle.device.cuda.synchronize(place) + start = time.perf_counter() + model.generate( + input_ids=input_ids, + max_length=args.generate_len, + min_length=args.generate_len, + decode_strategy="beam_search", + num_beams=args.num_beams, + use_fast=args.use_faster, + use_fp16_decoding=args.use_fp16_decoding, + ) + generate_mem = query_by_id(args.gpu_id) + paddle.device.cuda.synchronize(place) + pd_cost = (time.perf_counter() - start) / (num_loop - num_loop // 2) * 1000 + return pd_cost, load_mem - start_mem, generate_mem - start_mem + + +def perf_hf(args): + import torch + from tokenizers_pegasus import PegasusTokenizer as hf_tokenizer + from transformers import PegasusForConditionalGeneration as hf_pegasus + + start_mem = query_by_id(args.gpu_id) + device = torch.device("cuda") + tokenizer = hf_tokenizer.from_pretrained(args.model_name_or_path) + model = hf_pegasus.from_pretrained(args.model_name_or_path) + model.to(device) + model.eval() + load_mem = query_by_id(args.gpu_id) + + input_ids_np = [np.random.choice(range(len(tokenizer.vocab)), args.input_len) for _ in range(args.batch_size)] + input_ids = torch.tensor(input_ids_np) + input_ids = input_ids.to(device) + num_loop = 100 + with torch.no_grad(): + for i in range(num_loop): + # For warmup. + if num_loop // 2 == i: + torch.cuda.synchronize() + start = time.perf_counter() + model.generate( + input_ids, + do_sample=False, + num_beams=args.num_beams, + max_length=args.generate_len + input_ids.shape[-1], + min_length=args.generate_len + input_ids.shape[-1], + ) + generate_mem = query_by_id(args.gpu_id) + torch.cuda.synchronize() + hf_cost = (time.perf_counter() - start) / (num_loop - num_loop // 2) * 1000 + return hf_cost, load_mem - start_mem, generate_mem - start_mem + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--perf_type", + default="pd", + type=str, + choices=["pd", "pd_faster_fp32", "pd_faster_fp16", "hf"], + help="The type of perf. ", + ) + parser.add_argument( + "--model_name_or_path", + default="IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese", + type=str, + choices=[ + "IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese", + "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese", + ], + help="The model name to specify the pegasus to use. ", + ) + parser.add_argument("--num_beams", default=4, type=int, help="The number of beams to procedure beam search. ") + parser.add_argument("--batch_size", default=1, type=int, help="The size of input batch. ") + parser.add_argument("--input_len", default=60, type=int, help="The size of model input. ") + parser.add_argument("--generate_len", default=20, type=int, help="Length of output . ") + parser.add_argument("--gpu_id", default=2, type=int, help="The id of GPU . ") + parser.add_argument( + "--use_faster", action="store_true", help="Whether to process inference using faster pegasus. " + ) + + parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ") + args = parser.parse_args() + return args + + +def do_predict(args): + try: + if args.perf_type == "pd": + args.use_faster = False + cost, load_mem, generate_mem = perf_pd(args) + elif args.perf_type == "pd_faster_fp32": + args.use_faster = True + args.use_fp16_decoding = False + cost, load_mem, generate_mem = perf_pd(args) + elif args.perf_type == "pd_faster_fp16": + args.use_faster = True + args.use_fp16_decoding = True + # paddle.set_default_dtype('float16') + cost, load_mem, generate_mem = perf_pd(args) + else: + cost, load_mem, generate_mem = perf_hf(args) + pprint(args) + print( + f"PegasusPerfResult: cost_time: {cost} ms, load_mem: {load_mem} MB, generate_mem:{generate_mem} MB, args:{args}\n" + ) + except Exception as e: + pprint(args) + print(f"PegasusPerfResult: ERROR: {e}, args:{args}\n") + + +if __name__ == "__main__": + args = parse_args() + do_predict(args) diff --git a/fast_generation/perf/run_perf_bart.sh b/fast_generation/perf/run_perf_bart.sh new file mode 100644 index 0000000000000000000000000000000000000000..fa087770cb5afabe1a20cdce8acb5be6a17ca976 --- /dev/null +++ b/fast_generation/perf/run_perf_bart.sh @@ -0,0 +1,76 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=3 + +for model_name in bart-base bart-large; + do + for top_k in 1 4 8 16; + do + python bart_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=sampling \ + --num_beams=1 \ + --top_k=$top_k \ + --top_p=1 \ + --max_length=32 + sleep 10s + python bart_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=sampling \ + --num_beams=1 \ + --top_k=$top_k \ + --top_p=1 \ + --max_length=32 \ + --use_fp16_decoding + sleep 10s + done + python bart_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=sampling \ + --num_beams=1 \ + --top_k=0 \ + --top_p=0.4 \ + --max_length=32 + sleep 10s + python bart_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=sampling \ + --num_beams=1 \ + --top_k=0 \ + --top_p=0.4 \ + --max_length=32 \ + --use_fp16_decoding + sleep 10s + for num_beams in 4 8 16; + do + python bart_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=beam_search \ + --num_beams=$num_beams \ + --top_k=1 \ + --top_p=1 \ + --max_length=32 + sleep 10s + python bart_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=beam_search \ + --num_beams=$num_beams \ + --top_k=1 \ + --top_p=1 \ + --max_length=32 \ + --use_fp16_decoding + sleep 10s + done + done \ No newline at end of file diff --git a/fast_generation/perf/run_perf_codegen.sh b/fast_generation/perf/run_perf_codegen.sh new file mode 100644 index 0000000000000000000000000000000000000000..be4792096e2efb3df8fcafcbf4f2260d109c9539 --- /dev/null +++ b/fast_generation/perf/run_perf_codegen.sh @@ -0,0 +1,64 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +GPU_ID=1 +export CUDA_VISIBLE_DEVICES=${GPU_ID} + +for model_name in Salesforce/codegen-350M-mono Salesforce/codegen-2B-mono Salesforce/codegen-6B-mono; + do + for top_k in 1 4 8 16; + do + for input_len in 60; + do + for generate_len in 20; + do + for perf_type in pd pd_faster_fp32 pd_faster_fp16 hf; + do + echo model_name: $model_name, perf_type: $perf_type, top_k: $top_k, top_p: 1.0, input_len: $input_len, generate_len: $generate_len + python codegen_perf.py \ + --model_name_or_path=$model_name \ + --perf_type=$perf_type \ + --top_k=$top_k \ + --top_p=1.0 \ + --input_len=$input_len \ + --generate_len=$generate_len \ + --gpu_id ${GPU_ID} + sleep 3s + done + done + done + done + for top_p in 0.4; + do + for input_len in 60; + do + for generate_len in 20; + do + for perf_type in pd pd_faster_fp32 pd_faster_fp16 hf; + do + echo model_name: $model_name, perf_type: $perf_type, top_k: 0, top_p: $top_p, input_len: $input_len, generate_len: $generate_len + python codegen_perf.py \ + --model_name_or_path=$model_name \ + --perf_type=$perf_type \ + --top_k=0 \ + --top_p=$top_p \ + --input_len=$input_len \ + --generate_len=$generate_len \ + --gpu_id ${GPU_ID} + sleep 3s + done + done + done + done + done \ No newline at end of file diff --git a/fast_generation/perf/run_perf_gpt.sh b/fast_generation/perf/run_perf_gpt.sh new file mode 100644 index 0000000000000000000000000000000000000000..5363b0546af65d01737b6a1e75a2aaa5c9fe3971 --- /dev/null +++ b/fast_generation/perf/run_perf_gpt.sh @@ -0,0 +1,52 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=3 + +for model_name in gpt2-en gpt2-medium-en gpt2-large-en; + do + for top_k in 1 4 8 16; + do + python gpt_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=sampling \ + --top_k=$top_k \ + --top_p=1 \ + --max_length=32 + sleep 10s + python gpt_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=sampling \ + --top_k=$top_k \ + --top_p=1 \ + --max_length=32 \ + --use_fp16_decoding + sleep 10s + done + python gpt_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=sampling \ + --top_k=0 \ + --top_p=0.4 \ + --max_length=32 + sleep 10s + python gpt_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=sampling \ + --top_k=0 \ + --top_p=0.4 \ + --max_length=32 \ + --use_fp16_decoding + sleep 10s + done \ No newline at end of file diff --git a/fast_generation/perf/run_perf_opt.sh b/fast_generation/perf/run_perf_opt.sh new file mode 100644 index 0000000000000000000000000000000000000000..bc1d525c00acf95ec80e162794874e3b5b390373 --- /dev/null +++ b/fast_generation/perf/run_perf_opt.sh @@ -0,0 +1,52 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export CUDA_VISIBLE_DEVICES=3 + +for model_name in facebook/opt-125m facebook/opt-350m; + do + for top_k in 1 4 8 16; + do + python opt_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=sampling \ + --top_k=$top_k \ + --top_p=0.4 \ + --max_length=32 + sleep 10s + python opt_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=sampling \ + --top_k=$top_k \ + --top_p=0.4 \ + --max_length=32 \ + --use_fp16_decoding + sleep 10s + done + python opt_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=sampling \ + --top_k=0 \ + --top_p=0.4 \ + --max_length=32 + sleep 10s + python opt_perf.py \ + --model_name_or_path=$model_name \ + --decode_strategy=sampling \ + --top_k=0 \ + --top_p=0.4 \ + --max_length=32 \ + --use_fp16_decoding + sleep 10s + done diff --git a/fast_generation/perf/run_perf_pegasus.sh b/fast_generation/perf/run_perf_pegasus.sh new file mode 100644 index 0000000000000000000000000000000000000000..264c28b22c8ba45d8c104661da5237947d547b36 --- /dev/null +++ b/fast_generation/perf/run_perf_pegasus.sh @@ -0,0 +1,45 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +GPU_ID=4 +export CUDA_VISIBLE_DEVICES=${GPU_ID} + +for model_name in IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese; + do + for batch_size in 1 4 8 16; + do + for num_beams in 2 4 6 8; + do + for input_len in 60; + do + for generate_len in 20; + do + for perf_type in pd_faster_fp16 pd_faster_fp32 pd hf; + do + echo model_name: $model_name, perf_type: $perf_type, batch_size:$batch_size, num_beams: $num_beams, input_len: $input_len, generate_len: $generate_len + python pegasus_perf.py \ + --model_name_or_path=$model_name \ + --perf_type=$perf_type \ + --batch_size=$batch_size \ + --num_beams=$num_beams \ + --input_len=$input_len \ + --generate_len=$generate_len \ + --gpu_id ${GPU_ID} + sleep 3s + done + done + done + done + done + done \ No newline at end of file diff --git a/fast_generation/samples/codegen_16b_sample.py b/fast_generation/samples/codegen_16b_sample.py new file mode 100644 index 0000000000000000000000000000000000000000..02c121645e1ca17203a1a75242d76bec2a77235e --- /dev/null +++ b/fast_generation/samples/codegen_16b_sample.py @@ -0,0 +1,38 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle + +from paddlenlp.transformers import CodeGenForCausalLM, CodeGenTokenizer + +# Can be load on A100-40G +paddle.set_default_dtype("float16") +model_name = "Salesforce/codegen-16B-mono" + +tokenizer = CodeGenTokenizer.from_pretrained(model_name) +model = CodeGenForCausalLM.from_pretrained(model_name, load_state_as_np=True) +model.eval() + +inputs = "def hello" +input_ids = tokenizer([inputs], return_tensors="pd")["input_ids"] + +# Enable FastGeneration +outputs, _ = model.generate( + input_ids=input_ids, max_length=128, decode_strategy="greedy_search", use_fp16_decoding=True, use_fast=True +) + +result = tokenizer.decode(outputs[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]) + +print("Model input:", inputs) +print("Result:", result) diff --git a/fast_generation/samples/codegen_sample.py b/fast_generation/samples/codegen_sample.py new file mode 100644 index 0000000000000000000000000000000000000000..77cb5c7a335e8641b951adaa0d6bc1c3589962a3 --- /dev/null +++ b/fast_generation/samples/codegen_sample.py @@ -0,0 +1,37 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp.transformers import CodeGenForCausalLM, CodeGenTokenizer + +model_name = "Salesforce/codegen-350M-mono" + +tokenizer = CodeGenTokenizer.from_pretrained(model_name) +model = CodeGenForCausalLM.from_pretrained(model_name) +model.eval() + +inputs = "def hello" +input_ids = tokenizer([inputs], return_tensors="pd")["input_ids"] + +outputs, _ = model.generate( + input_ids=input_ids, max_length=128, decode_strategy="greedy_search", use_fp16_decoding=True, use_fast=True +) + +result = tokenizer.decode(outputs[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]) + +print("Model input:", inputs) +print("Result:", result) +# Result: _world(): +# print("Hello World") + +# hello_world() diff --git a/fast_generation/samples/gpt_mp_sample.py b/fast_generation/samples/gpt_mp_sample.py new file mode 100644 index 0000000000000000000000000000000000000000..f2370f9b2e8fc9861e8769d64e55381d87d47869 --- /dev/null +++ b/fast_generation/samples/gpt_mp_sample.py @@ -0,0 +1,132 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import time +from pprint import pprint + +import paddle + +from paddlenlp.ops import enable_ft_para, get_ft_para_conf +from paddlenlp.transformers import GPTChineseTokenizer, GPTLMHeadModel, GPTTokenizer + +MODEL_CLASSES = { + "gpt-cpm-large-cn": (GPTLMHeadModel, GPTChineseTokenizer), + "gpt-cpm-small-cn-distill": (GPTLMHeadModel, GPTChineseTokenizer), + "gpt2-en": (GPTLMHeadModel, GPTTokenizer), + "gpt2-medium-en": (GPTLMHeadModel, GPTTokenizer), + "gpt2-large-en": (GPTLMHeadModel, GPTTokenizer), + "gpt2-xl-en": (GPTLMHeadModel, GPTTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_name", + default="gpt-cpm-large-cn", + choices=list(MODEL_CLASSES.keys()), + help="The model name to specify which gpt to use. It can be " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument("--batch_size", default=4, type=int, help="Batch size.") + parser.add_argument("--max_length", default=50, type=int, help="Maximum output length.") + parser.add_argument( + "--topk", default=1, type=int, help="The number of highest probability tokens to keep for top-k-sampling." + ) + parser.add_argument("--topp", default=1.0, type=float, help="The cumulative probability for top-p-filtering.") + parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set.") + parser.add_argument("--tensor_para_size", default=2, type=int, help="The size for tensor parallel.") + parser.add_argument("--layer_para_size", default=1, type=int, help="The size for layer parallel.") + parser.add_argument( + "--layer_para_batch_size", + default=None, + type=int, + help="The local batch size for pipeline parallel." "It is suggested to use `batch_size // layer_para_size`.", + ) + parser.add_argument("--use_fp16", action="store_true", help="Whether to use fp16 to predict.") + parser.add_argument("--profile", action="store_true", help="Whether to profile.") + args = parser.parse_args() + return args + + +def profile(batch_size, total_step=50, warmup_step=10, rank=0): + def _wrapper(func): + def _impl(*args, **kwargs): + for i in range(total_step): + if i == warmup_step: + paddle.device.cuda.synchronize() + start_time = time.time() + out = func(*args, **kwargs) + paddle.device.cuda.synchronize() + end_time = time.time() + if rank is None or get_ft_para_conf().rank == rank: + time_interval = end_time - start_time + num_batch = total_step - warmup_step + print("Latency: %2fs, QPS: %2f" % (time_interval / num_batch, num_batch * batch_size / time_interval)) + return out + + return _impl + + return _wrapper + + +def main(args): + if args.use_fp16: + paddle.set_default_dtype("float16") + enable_ft_para( + args.tensor_para_size, + args.layer_para_size, + args.batch_size // args.layer_para_size if args.layer_para_batch_size is None else args.layer_para_batch_size, + ) + # TODO(guosheng): Maybe device can be set in `enable_ft_para` + paddle.set_device("gpu:" + str(get_ft_para_conf().rank)) + + model_name = args.model_name + if args.profile: + MODEL_CLASSES[model_name][0].generate = profile(args.batch_size)(MODEL_CLASSES[model_name][0].generate) + tokenizer = MODEL_CLASSES[model_name][-1].from_pretrained(model_name) + model = MODEL_CLASSES[model_name][0].from_pretrained(model_name, load_state_as_np=True) + model.eval() + + # NOTE: When using prompt, open this and replace the text with what you want. + input = "花间一壶酒,独酌无相亲。举杯邀明月," + # input = '一时黛玉进了荣府,下了车。众嬷嬷引着,便往东转弯,' + # input = '爱因斯坦曾经说过:' + input_ids = tokenizer(input)["input_ids"] + # NOTE: When generating from the beginning, open this. + # input_ids = [tokenizer.eos_token_id] + input_ids = [input_ids] * args.batch_size + + inputs_ids = paddle.to_tensor(input_ids, dtype="int32") + outputs, _ = model.generate( + input_ids=inputs_ids, + max_length=args.max_length, + decode_strategy="sampling", + top_k=args.topk, + top_p=args.topp, + temperature=args.temperature, + use_fast=True, + ) + + # Only make the first process to output. + if get_ft_para_conf().rank == 0: + for i in range(len(outputs)): + result = tokenizer.convert_ids_to_string(outputs[i].numpy().tolist()) + print("Result:", result) + + +if __name__ == "__main__": + args = parse_args() + pprint(args) + main(args) diff --git a/fast_generation/samples/gpt_sample.py b/fast_generation/samples/gpt_sample.py new file mode 100644 index 0000000000000000000000000000000000000000..e0cff0bba726b680014055e6cf4ac6db3bf6ef28 --- /dev/null +++ b/fast_generation/samples/gpt_sample.py @@ -0,0 +1,35 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle + +from paddlenlp.transformers import GPTChineseTokenizer, GPTLMHeadModel + +model_name = "gpt-cpm-small-cn-distill" + +tokenizer = GPTChineseTokenizer.from_pretrained(model_name) +model = GPTLMHeadModel.from_pretrained(model_name) +model.eval() + +inputs = "花间一壶酒,独酌无相亲。举杯邀明月," +inputs_ids = tokenizer(inputs)["input_ids"] +inputs_ids = paddle.to_tensor(inputs_ids, dtype="int64").unsqueeze(0) + +outputs, _ = model.generate(input_ids=inputs_ids, max_length=10, decode_strategy="greedy_search", use_fast=True) + +result = tokenizer.convert_ids_to_string(outputs[0].numpy().tolist()) + +print("Model input:", inputs) +print("Result:", result) +# 对影成三人。 diff --git a/fast_generation/samples/gptj_sample.py b/fast_generation/samples/gptj_sample.py new file mode 100644 index 0000000000000000000000000000000000000000..f335121287c87f6d88c61b863396e28811c2d6f6 --- /dev/null +++ b/fast_generation/samples/gptj_sample.py @@ -0,0 +1,42 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle + +from paddlenlp.transformers import GPTJForCausalLM, GPTJTokenizer + +paddle.set_default_dtype("float16") +model_name = "EleutherAI/gpt-j-6B" + +tokenizer = GPTJTokenizer.from_pretrained(model_name) +model = GPTJForCausalLM.from_pretrained(model_name, load_state_as_np=True) +model.eval() + +inputs = "What is PaddleNLP?" +input_ids = tokenizer([inputs], return_tensors="pd")["input_ids"] + +outputs, _ = model.generate( + input_ids=input_ids, + max_length=100, + decode_strategy="sampling", + temperature=0.8, + top_p=0.9, + use_fp16_decoding=True, + use_fast=True, +) + +result = tokenizer.decode(outputs[0]) + +print("Model input:", inputs) +print("Result:", result) diff --git a/fast_generation/samples/mbart_sample.py b/fast_generation/samples/mbart_sample.py new file mode 100644 index 0000000000000000000000000000000000000000..e16c4e7de1764f4ba911a000b054df6505d7d9e3 --- /dev/null +++ b/fast_generation/samples/mbart_sample.py @@ -0,0 +1,58 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import paddle + +from paddlenlp.transformers import MBart50Tokenizer, MBartForConditionalGeneration + +model_name = "mbart-large-50-many-to-many-mmt" + +tokenizer = MBart50Tokenizer.from_pretrained(model_name, src_lang="en_XX") +model = MBartForConditionalGeneration.from_pretrained(model_name) +model.eval() + + +def postprocess_response(seq, bos_idx, eos_idx): + """Post-process the decoded sequence.""" + eos_pos = len(seq) - 1 + for i, idx in enumerate(seq): + if idx == eos_idx: + eos_pos = i + break + seq = [idx for idx in seq[: eos_pos + 1] if idx != bos_idx and idx != eos_idx] + res = tokenizer.convert_ids_to_string(seq) + return res + + +bos_id = tokenizer.lang_code_to_id["zh_CN"] +eos_id = model.mbart.config["eos_token_id"] + +inputs = "PaddleNLP is a powerful NLP library with Awesome pre-trained models and easy-to-use interface, supporting wide-range of NLP tasks from research to industrial applications." +input_ids = tokenizer(inputs)["input_ids"] +input_ids = paddle.to_tensor(input_ids, dtype="int32").unsqueeze(0) + +outputs, _ = model.generate( + input_ids=input_ids, + forced_bos_token_id=bos_id, + decode_strategy="beam_search", + num_beams=4, + max_length=50, + use_fast=True, +) + +result = postprocess_response(outputs[0].numpy().tolist(), bos_id, eos_id) + +print("Model input:", inputs) + +print("Result:", result) +# PaddleNLP是一个强大的NLP库,具有超乎寻常的预训练模型和易于使用的接口,支持从研究到工业应用的广泛的NLP任务。 diff --git a/fast_generation/samples/opt_sample.py b/fast_generation/samples/opt_sample.py new file mode 100644 index 0000000000000000000000000000000000000000..812fd6e01b8f39e423f53e1c5cd6aea32306abce --- /dev/null +++ b/fast_generation/samples/opt_sample.py @@ -0,0 +1,45 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle + +from paddlenlp.transformers import GPTTokenizer, OPTForCausalLM + +model_name = "facebook/opt-350m" + +tokenizer = GPTTokenizer.from_pretrained(model_name) +model = OPTForCausalLM.from_pretrained(model_name) +model.eval() + +inputs = """a chat between a curious human and Statue of Liberty. +Human: What is your name? +Statue: I am statue of liberty. +Human: where do you live? +Statue: New york city. +Human: how long have you lived there?。""" + +inputs_ids = tokenizer([inputs])["input_ids"] +inputs_ids = paddle.to_tensor(inputs_ids, dtype="int64") + +outputs, _ = model.generate( + input_ids=inputs_ids, + max_length=20, + decode_strategy="greedy_search", + use_fast=True, +) + +result = tokenizer.convert_ids_to_string(outputs[0].numpy().tolist()) + +print("Model input:", inputs) +print("Result:", result) diff --git a/fast_generation/samples/pegasus_sample.py b/fast_generation/samples/pegasus_sample.py new file mode 100644 index 0000000000000000000000000000000000000000..ddbc340808b602fabb78305b32c57e82a344585d --- /dev/null +++ b/fast_generation/samples/pegasus_sample.py @@ -0,0 +1,36 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp.transformers import ( + PegasusChineseTokenizer, + PegasusForConditionalGeneration, +) + +model = PegasusForConditionalGeneration.from_pretrained("IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese") +tokenizer = PegasusChineseTokenizer.from_pretrained("IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese") +model.eval() + +inputs = "在北京冬奥会自由式滑雪女子坡面障碍技巧决赛中,中国选手谷爱凌夺得银牌。祝贺谷爱凌!今天上午,自由式滑雪女子坡面障碍技巧决赛举行。决赛分三轮进行,取选手最佳成绩排名决出奖牌。第一跳,中国选手谷爱凌获得69.90分。在12位选手中排名第三。完成动作后,谷爱凌又扮了个鬼脸,甚是可爱。第二轮中,谷爱凌在道具区第三个障碍处失误,落地时摔倒。获得16.98分。网友:摔倒了也没关系,继续加油!在第二跳失误摔倒的情况下,谷爱凌顶住压力,第三跳稳稳发挥,流畅落地!获得86.23分!此轮比赛,共12位选手参赛,谷爱凌第10位出场。网友:看比赛时我比谷爱凌紧张,加油!" +tokenized = tokenizer(inputs, return_tensors="pd") +outputs, _ = model.generate( + input_ids=tokenized["input_ids"], + decode_strategy="beam_search", + num_beams=4, + use_fp16_decoding=True, + use_fast=True, +) +result = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=False) + +print("Model input:", inputs) +print("Result:", result) diff --git a/fast_generation/samples/plato_sample.py b/fast_generation/samples/plato_sample.py new file mode 100644 index 0000000000000000000000000000000000000000..ac79e60918e49a9596e2b6a49739f58af6817fa7 --- /dev/null +++ b/fast_generation/samples/plato_sample.py @@ -0,0 +1,62 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp.transformers import ( + UnifiedTransformerLMHeadModel, + UnifiedTransformerTokenizer, +) + +model_name = "plato-mini" + +tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name) +model = UnifiedTransformerLMHeadModel.from_pretrained(model_name) +model.eval() + + +def postprocess_response(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first .""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.sep_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + return tokens + + +inputs = "你好啊,你今年多大了" + +inputs_ids = tokenizer.dialogue_encode( + inputs, add_start_token_as_response=True, return_tensors=True, is_split_into_words=False +) + +outputs, _ = model.generate( + input_ids=inputs_ids["input_ids"], + token_type_ids=inputs_ids["token_type_ids"], + position_ids=inputs_ids["position_ids"], + attention_mask=inputs_ids["attention_mask"], + max_length=64, + decode_strategy="sampling", + top_k=5, + use_fast=True, +) + +result = postprocess_response(outputs[0].numpy(), tokenizer) +result = "".join(result) + +print("Model input:", inputs) +print("Result:", result) +# 我今年23岁了,你今年多大了? diff --git a/fast_generation/samples/plato_xl_sample.py b/fast_generation/samples/plato_xl_sample.py new file mode 100644 index 0000000000000000000000000000000000000000..b7c91f4fc9217215c3e71169aac8cc76e08e4b39 --- /dev/null +++ b/fast_generation/samples/plato_xl_sample.py @@ -0,0 +1,162 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time +from distutils.util import strtobool +from pprint import pprint + +import paddle + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.ops import enable_ft_para, get_ft_para_conf +from paddlenlp.transformers import ( + UnifiedTransformerLMHeadModel, + UnifiedTransformerTokenizer, +) + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--use_role", type=strtobool, default=True, help="Whether to use role embeddings.") + parser.add_argument( + "--position_style", + default="relative", + choices=["continuous", "relative"], + type=str, + help="The type for positional embedding. Default is relative.", + ) + parser.add_argument("--batch_size", default=1, type=int, help="Batch size.") + parser.add_argument( + "--num_return_sequences", default=1, type=int, help="The number of returned sequences for each sample." + ) + parser.add_argument("--max_out_len", default=64, type=int, help="Maximum output sequence length.") + parser.add_argument("--min_out_len", default=1, type=int, help="Minimum output sequence length.") + parser.add_argument( + "--topk", default=1, type=int, help="The number of highest probability tokens to keep for top-k-sampling." + ) + parser.add_argument("--topp", default=1.0, type=float, help="The cumulative probability for top-p-filtering.") + parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set.") + parser.add_argument("--use_fp16", action="store_true", help="Whether to use fp16 to predict.") + parser.add_argument("--profile", action="store_true", help="Whether to profile.") + args = parser.parse_args() + return args + + +def profile(batch_size, total_step=50, warmup_step=10, rank=0): + def _wrapper(func): + def _impl(*args, **kwargs): + for i in range(total_step): + if i == warmup_step: + paddle.device.cuda.synchronize() + start_time = time.time() + out = func(*args, **kwargs) + paddle.device.cuda.synchronize() + end_time = time.time() + if rank is None or get_ft_para_conf().rank == rank: + time_interval = end_time - start_time + num_batch = total_step - warmup_step + print("Latency: %2fs, QPS: %2f" % (time_interval / num_batch, num_batch * batch_size / time_interval)) + return out + + return _impl + + return _wrapper + + +def postprocess_response(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first .""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.sep_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + response = " ".join(tokens) + return response + + +def main(args): + # For memory saving when using FastGeneration: + # If environment variable `PPFG_QKV_MEM_OPT` is set and the weights of q/k/v + # is fused, it will try to delete the original unfused weights. Note the + # rollback to original model would not be guarantee anymore when the fast + # model failed if the original weights are deleted. + os.environ["PPFG_QKV_MEM_OPT"] = "1" + if args.use_fp16: + paddle.set_default_dtype("float16") + enable_ft_para() + # TODO(guosheng): Maybe device can be set in `enable_ft_para` + paddle.set_device("gpu:" + str(get_ft_para_conf().rank)) + + if args.profile: + UnifiedTransformerLMHeadModel.generate = profile(args.batch_size)(UnifiedTransformerLMHeadModel.generate) + tokenizer = UnifiedTransformerTokenizer.from_pretrained("plato-xl") + model = UnifiedTransformerLMHeadModel.from_pretrained("plato-xl", load_state_as_np=True) + model.eval() + + history = [ + "hi , Mary ! What do you usually like to do in your spare time ?", + "well , I spend a lot of time watching movies .", + "what a confidence ! I always watch a lot of movies , too ." + "oh really , Frank ? What kind of movies do you like ?", + ] + inputs = [history] * args.batch_size + inputs = list( + map( + lambda history: tokenizer.dialogue_encode( + history=history, + add_start_token_as_response=True, + return_length=True, + return_role_ids=args.use_role, + position_style=args.position_style, + ), + inputs, + ) + ) + collator = DataCollatorWithPadding(tokenizer) + data = collator(inputs) + + outputs, _ = model.generate( + input_ids=data["input_ids"], + token_type_ids=data["token_type_ids"], + position_ids=data["position_ids"], + attention_mask=data["attention_mask"].cast("float32"), # TODO(guosheng): remove this cast + role_ids=data.get("role_ids", None), + seq_len=data["seq_len"], + max_length=args.max_out_len, + min_length=args.min_out_len, + decode_strategy="sampling", + top_k=args.topk, + top_p=args.topp, + temperature=args.temperature, + num_return_sequences=args.num_return_sequences, + use_fast=True, + use_fp16_decoding=args.use_fp16, + ) + + # Only make the first process to output. + if get_ft_para_conf().rank == 0: + for i in range(len(outputs)): + result = postprocess_response(outputs[i].numpy(), tokenizer) + print("Result:", result) + + +if __name__ == "__main__": + args = parse_args() + pprint(args) + main(args) diff --git a/fast_generation/samples/t5_sample.py b/fast_generation/samples/t5_sample.py new file mode 100644 index 0000000000000000000000000000000000000000..53ad13f903c115e647411c0edb56dab752653a73 --- /dev/null +++ b/fast_generation/samples/t5_sample.py @@ -0,0 +1,58 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--max_length", default=256, type=int, help="Maximum output sequence length.") + parser.add_argument("--beam_size", default=4, type=int, help="The beam size to set.") + parser.add_argument("--use_faster", action="store_true", help="Whether to use faster to predict.") + parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 to predict.") + args = parser.parse_args() + return args + + +def predict(args): + model_name = "t5-base" + + model = T5ForConditionalGeneration.from_pretrained(model_name) + model.eval() + tokenizer = T5Tokenizer.from_pretrained(model_name) + + en_text = ' This image section from an infrared recording by the Spitzer telescope shows a "family portrait" of countless generations of stars: the oldest stars are seen as blue dots. ' + input_ids = tokenizer.encode("translate English to French: " + en_text, return_tensors="pd")["input_ids"] + + output, _ = model.generate( + input_ids=input_ids, + num_beams=args.beam_size, + max_length=args.max_length, + decode_strategy="beam_search", + use_fast=True, # args.use_faster, + use_fp16_decoding=args.use_fp16_decoding, + ) + + translation = tokenizer.decode(output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False) + + print("The original sentence: ", en_text) + print("The translation result: ", translation) + + +if __name__ == "__main__": + args = parse_args() + + predict(args) diff --git a/fast_generation/samples/unimo_text_sample.py b/fast_generation/samples/unimo_text_sample.py new file mode 100644 index 0000000000000000000000000000000000000000..29197be47e52b209160e9fe5c52586e9b265219b --- /dev/null +++ b/fast_generation/samples/unimo_text_sample.py @@ -0,0 +1,59 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOTokenizer + +model_name = "unimo-text-1.0-lcsts-new" + +model = UNIMOLMHeadModel.from_pretrained(model_name) +model.eval() +tokenizer = UNIMOTokenizer.from_pretrained(model_name) + + +def postprocess_response(token_ids, tokenizer): + """Post-process the decoded sequence. Truncate from the first .""" + eos_pos = len(token_ids) + for i, tok_id in enumerate(token_ids): + if tok_id == tokenizer.mask_token_id: + eos_pos = i + break + token_ids = token_ids[:eos_pos] + tokens = tokenizer.convert_ids_to_tokens(token_ids) + tokens = tokenizer.merge_subword(tokens) + return tokens + + +inputs = "深度学习是人工智能的核心技术领域。百度飞桨作为中国首个自主研发、功能丰富、开源开放的产业级深度学习平台,将从多层次技术产品、产业AI人才培养和强大的生态资源支持三方面全面护航企业实现快速AI转型升级。" + +inputs_ids = tokenizer.gen_encode( + inputs, add_start_token_for_decoding=True, return_tensors=True, is_split_into_words=False +) + +outputs, _ = model.generate( + input_ids=inputs_ids["input_ids"], + token_type_ids=inputs_ids["token_type_ids"], + position_ids=inputs_ids["position_ids"], + attention_mask=inputs_ids["attention_mask"], + max_length=64, + decode_strategy="beam_search", + num_beams=2, + use_fast=True, +) + +result = postprocess_response(outputs[0].numpy(), tokenizer) +result = "".join(result) + +print("Model input:", inputs) +print("Result:", result) +# 百度飞桨:深度学习助力企业转型升级 diff --git a/fast_tokenizer/CMakeLists.txt b/fast_tokenizer/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..ce238239ae7f88c87ad301c39cef25a287c64e0e --- /dev/null +++ b/fast_tokenizer/CMakeLists.txt @@ -0,0 +1,235 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +cmake_minimum_required(VERSION 3.10) + +project(tokenizers LANGUAGES CXX C VERSION 1.0) + +option(WITH_TESTING "Compile PaddleNLP fast_tokenizer with unit testing" OFF) +option(WITH_PYTHON "Compile PaddleNLP fast_tokenizer with python interpreter" ON) +option(WITH_ICU_LITE "Compile PaddleNLP fast_tokenizer with lite icu library" OFF) +option(USE_ABI0 "Set -D_GLIBCXX_USE_CXX11_ABI to 0" OFF) + +add_definitions(-DFASTTOKENIZER_LIB) + +set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/cmake") +set (PUBLIC_DEPEND_LIBS "") + +if(NOT CMAKE_BUILD_TYPE) + set(CMAKE_BUILD_TYPE "Release" CACHE STRING + "Choose the type of build, options are: Debug Release +RelWithDebInfo MinSizeRel." + FORCE) +endif(NOT CMAKE_BUILD_TYPE) + +if(NOT WIN32) + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} --std=c++11") + if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "aarch64") + set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,--no-as-needed") + endif() + if (USE_ABI0) + message(STATUS "-D_GLIBCXX_USE_CXX11_ABI will be set to 0.") + add_definitions(-D_GLIBCXX_USE_CXX11_ABI=0) + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_GLIBCXX_USE_CXX11_ABI=0") + else() + message(STATUS "-D_GLIBCXX_USE_CXX11_ABI will be set to 1.") + add_definitions(-D_GLIBCXX_USE_CXX11_ABI=1) + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_GLIBCXX_USE_CXX11_ABI=1") + endif() +else() + set(CMAKE_CXX_STANDARD 11) +endif() + +IF(WIN32) +# Need to add flags for windows +foreach( + flag_var + CMAKE_CXX_FLAGS + CMAKE_CXX_FLAGS_DEBUG + CMAKE_CXX_FLAGS_RELEASE + CMAKE_CXX_FLAGS_MINSIZEREL + CMAKE_CXX_FLAGS_RELWITHDEBINFO + CMAKE_C_FLAGS + CMAKE_C_FLAGS_DEBUG + CMAKE_C_FLAGS_RELEASE + CMAKE_C_FLAGS_MINSIZEREL + CMAKE_C_FLAGS_RELWITHDEBINFO) + if(${flag_var} MATCHES "/MD") + string(REGEX REPLACE "/MD" "/MT" ${flag_var} "${${flag_var}}") + elseif(NOT (${flag_var} MATCHES "/MT")) + set(${flag_var} "${${flag_var}} /MT") + endif() + set(${flag_var} "${${flag_var}}" CACHE STRING "msvc compiler flags" FORCE) +endforeach() + +add_definitions("-DNOMINMAX") +# windows build turn off warnings, use parallel compiling. + +foreach( + flag_var + CMAKE_CXX_FLAGS + CMAKE_CXX_FLAGS_DEBUG + CMAKE_CXX_FLAGS_RELEASE + CMAKE_CXX_FLAGS_MINSIZEREL + CMAKE_CXX_FLAGS_RELWITHDEBINFO + CMAKE_C_FLAGS + CMAKE_C_FLAGS_DEBUG + CMAKE_C_FLAGS_RELEASE + CMAKE_C_FLAGS_MINSIZEREL + CMAKE_C_FLAGS_RELWITHDEBINFO) + string(REGEX REPLACE "/W[1-4]" " /W0 " ${flag_var} "${${flag_var}}") +endforeach() + +foreach(flag_var CMAKE_CXX_FLAGS CMAKE_C_FLAGS) + set(${flag_var} "${${flag_var}} /w") +endforeach() + +# Windows Remove /Zi, /ZI for Release, MinSizeRel builds +foreach(flag_var + CMAKE_C_FLAGS CMAKE_C_FLAGS_RELEASE CMAKE_C_FLAGS_MINSIZEREL + CMAKE_CXX_FLAGS CMAKE_CXX_FLAGS_RELEASE CMAKE_CXX_FLAGS_MINSIZEREL) +if(${flag_var} MATCHES "/Z[iI]") + string(REGEX REPLACE "/Z[iI]" "" ${flag_var} "${${flag_var}}") +endif() +endforeach() + +set(CMAKE_SUPPRESS_REGENERATION ON) +set(CMAKE_STATIC_LIBRARY_PREFIX lib) + +set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /bigobj") +set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} /bigobj") +set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /bigobj") +set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} /bigobj") + +if("${CMAKE_GENERATOR}" STREQUAL "Ninja") + set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /Zc:inline") + set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} /Zc:inline") + set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /Zc:inline") + set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} /Zc:inline") +endif() + +foreach(flag_var CMAKE_SHARED_LINKER_FLAGS CMAKE_STATIC_LINKER_FLAGS + CMAKE_EXE_LINKER_FLAGS CMAKE_LINKER_FLAGS) +set(${flag_var} + "${${flag_var}} /ignore:4049 /ignore:4217 /ignore:4006 /ignore:4221") +set(${flag_var} "${${flag_var}} /NODEFAULTLIB:MSVCRT.LIB") +endforeach() + +ELSE(WIN32) + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -fPIC") + IF (NOT APPLE) + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ldl") + IF (NOT ANDROID) + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -lpthread") + ELSE() + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Os") + ENDIF() + ENDIF() + set (PUBLIC_DEPEND_LIBS ${CMAKE_DL_LIBS}) +ENDIF(WIN32) + +set(CMAKE_INSTALL_PREFIX ${PROJECT_SOURCE_DIR}) +set(TOKENIZERS_INSTALL_INCLUDE_DIR ${PROJECT_SOURCE_DIR}) + +message("CMAKE_BUILD_TYPE = " ${CMAKE_BUILD_TYPE}) +message("CMAKE_CXX_FLAGS = ${CMAKE_CXX_FLAGS}") +message("CMAKE_EXE_LINKER_FLAGS = ${CMAKE_EXE_LINKER_FLAGS}") + +# config GIT_URL with github mirrors to speed up dependent repos clone +option(GIT_URL "Git URL to clone dependent repos" ${GIT_URL}) +if(NOT GIT_URL) + set(GIT_URL "https://github.com") +endif() + +include_directories(${TOKENIZERS_INSTALL_INCLUDE_DIR}) + +include(generic) +include(third_party) + +add_subdirectory(fast_tokenizer) + +if(WITH_PYTHON) + +add_subdirectory(python) + +if (NOT APPLE AND NOT WIN32) # Linux +add_custom_target(build_tokenizers_bdist_wheel ALL + COMMAND ${PYTHON_EXECUTABLE} setup.py bdist_wheel --plat-name=manylinux1_x86_64 + COMMENT "Packing whl packages------>>>" + DEPENDS copy_python_tokenizers) +else() +add_custom_target(build_tokenizers_bdist_wheel ALL + COMMAND ${CMAKE_COMMAND} -E env ${py_env} ${PYTHON_EXECUTABLE} setup.py bdist_wheel + COMMENT "Packing whl packages------>>>" + DEPENDS copy_python_tokenizers) +endif() + +else(WITH_PYTHON) # Pack fast_tokenizer cpp lib + +set(CPP_PACKAGE_DIR ${CMAKE_BINARY_DIR}/cpp/fast_tokenizer) +add_custom_target(build_cpp_package_dir ALL + COMMAND ${CMAKE_COMMAND} -E make_directory ${CPP_PACKAGE_DIR}/lib ${CPP_PACKAGE_DIR}/include ${CPP_PACKAGE_DIR}/third_party/include ${CPP_PACKAGE_DIR}/third_party/lib + DEPENDS core_tokenizers) + +# copy cmake +configure_file(${PROJECT_SOURCE_DIR}/FastTokenizer.cmake.in ${PROJECT_SOURCE_DIR}/FastTokenizer.cmake @ONLY) +file(COPY ${PROJECT_SOURCE_DIR}/FastTokenizer.cmake DESTINATION ${CPP_PACKAGE_DIR}/) + +# copy headers +file(COPY ${PROJECT_SOURCE_DIR}/fast_tokenizer/ DESTINATION ${CPP_PACKAGE_DIR}/include/fast_tokenizer/ + FILES_MATCHING PATTERN "*.h" + PATTERN "test" EXCLUDE + PATTERN "demo" EXCLUDE + PATTERN "pybind" EXCLUDE) + +add_custom_target(copy_third_party_headers ALL + COMMAND ${CMAKE_COMMAND} -E copy_directory + ${GFLAGS_INCLUDE_DIR} ${ICU_INCLUDE_DIR} ${DART_INCLUDE_DIR} + ${GLOG_INCLUDE_DIR} ${JSON_INCLUDE_DIR} ${RE2_INCLUDE_DIR} + ${CPP_PACKAGE_DIR}/third_party/include + DEPENDS build_cpp_package_dir) + +# copy library +set(TOKENIZER_CORE_NAME "core_tokenizers") +set(TOKENIZER_CORE_PATH ${CMAKE_BINARY_DIR}/fast_tokenizer) +if (WIN32) + set(ICU_DLL_DIR ${CMAKE_BINARY_DIR}/third_party/icu/src/extern_icu/icu4c/bin64) + set(ICU_LIB_DIR ${CMAKE_BINARY_DIR}/third_party/icu/src/extern_icu/icu4c/lib64) + add_custom_target(copy_shared_library ALL + COMMAND ${CMAKE_COMMAND} -E copy ${TOKENIZER_CORE_PATH}/${TOKENIZER_CORE_NAME}.dll ${TOKENIZER_CORE_PATH}/${TOKENIZER_CORE_NAME}.lib ${CPP_PACKAGE_DIR}/lib + COMMAND ${CMAKE_COMMAND} -E copy ${ICU_DLL_DIR}/icudt70.dll ${ICU_DLL_DIR}/icuuc70.dll ${ICU_LIB_DIR}/icudt.lib ${ICU_LIB_DIR}/icuuc.lib ${CPP_PACKAGE_DIR}/third_party/lib + DEPENDS build_cpp_package_dir core_tokenizers) +elseif(APPLE) + set(TOKENIZER_CORE_LIBS_PATH "${TOKENIZER_CORE_PATH}/lib${TOKENIZER_CORE_NAME}.dylib") + add_custom_target(copy_shared_library ALL + COMMAND ${CMAKE_COMMAND} -E copy ${TOKENIZER_CORE_LIBS_PATH} ${CPP_PACKAGE_DIR}/lib + DEPENDS build_cpp_package_dir core_tokenizers) +elseif(ANDROID) + set(TOKENIZER_CORE_LIBS_PATH "${TOKENIZER_CORE_PATH}/lib${TOKENIZER_CORE_NAME}.so") + add_custom_target(copy_shared_library ALL + COMMAND ${CMAKE_COMMAND} -E copy ${TOKENIZER_CORE_LIBS_PATH} ${CPP_PACKAGE_DIR}/lib + COMMAND ${CMAKE_STRIP} --strip-all ${CPP_PACKAGE_DIR}/lib/lib${TOKENIZER_CORE_NAME}.so + DEPENDS build_cpp_package_dir core_tokenizers) +else() + set(TOKENIZER_CORE_LIBS_PATH "${TOKENIZER_CORE_PATH}/lib${TOKENIZER_CORE_NAME}.so") + add_custom_target(copy_shared_library ALL + COMMAND ${CMAKE_COMMAND} -E copy ${TOKENIZER_CORE_LIBS_PATH} ${CPP_PACKAGE_DIR}/lib + DEPENDS build_cpp_package_dir core_tokenizers) +endif() + +add_custom_target(create_commit_id_file ALL + COMMAND ${GIT_EXECUTABLE} log -1 --format=%H > ${CPP_PACKAGE_DIR}/commit.log + DEPENDS copy_shared_library) +endif(WITH_PYTHON) + diff --git a/fast_tokenizer/FastTokenizer.cmake.in b/fast_tokenizer/FastTokenizer.cmake.in new file mode 100644 index 0000000000000000000000000000000000000000..cf9c341867686ff40f59cb5cb392c4f02e4f5211 --- /dev/null +++ b/fast_tokenizer/FastTokenizer.cmake.in @@ -0,0 +1,44 @@ +CMAKE_MINIMUM_REQUIRED (VERSION 3.12) + +set(USE_ABI0 @USE_ABI0@) + +if (NOT WIN32) + if (USE_ABI0) + add_definitions(-D_GLIBCXX_USE_CXX11_ABI=0) + else() + add_definitions(-D_GLIBCXX_USE_CXX11_ABI=1) + endif() +endif() + + +if(NOT CMAKE_BUILD_TYPE) + set(CMAKE_BUILD_TYPE "Release" CACHE STRING + "Choose the type of build, options are: Debug Release +RelWithDebInfo MinSizeRel." + FORCE) +endif(NOT CMAKE_BUILD_TYPE) + +if(NOT WIN32) + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} --std=c++11") + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -Wno-narrowing") +else() + set(CMAKE_CXX_STANDARD 11) +endif() + +set(LIBRARY_NAME core_tokenizers) + +set(FAST_TOKENIZER_INCS "") +list(APPEND FAST_TOKENIZER_INCS ${CMAKE_CURRENT_LIST_DIR}/include) +list(APPEND FAST_TOKENIZER_INCS ${CMAKE_CURRENT_LIST_DIR}/third_party/include) + +set(FAST_TOKENIZER_LIBS "") +find_library(FTLIB ${LIBRARY_NAME} ${CMAKE_CURRENT_LIST_DIR}/lib NO_DEFAULT_PATH) +list(APPEND FAST_TOKENIZER_LIBS ${FTLIB}) + +if (WIN32) +find_library(ICUDT icudt ${CMAKE_CURRENT_LIST_DIR}/third_party/lib NO_DEFAULT_PATH) +list(APPEND FAST_TOKENIZER_LIBS ${ICUDT}) + +find_library(ICUUC icuuc ${CMAKE_CURRENT_LIST_DIR}/third_party/lib NO_DEFAULT_PATH) +list(APPEND FAST_TOKENIZER_LIBS ${ICUUC}) +endif() diff --git a/fast_tokenizer/LICENSE b/fast_tokenizer/LICENSE new file mode 100644 index 0000000000000000000000000000000000000000..d945753f93c1bade8953a38fa2141850b174b26b --- /dev/null +++ b/fast_tokenizer/LICENSE @@ -0,0 +1,203 @@ +Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/fast_tokenizer/Makefile b/fast_tokenizer/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..4823a02a3cc188ee98ce9709370d05f6e71b3cbb --- /dev/null +++ b/fast_tokenizer/Makefile @@ -0,0 +1,38 @@ +# Makefile for fast_tokenizer +# +# GitHb: https://github.com/PaddlePaddle/PaddleNLP +# Author: Paddle Team https://github.com/PaddlePaddle +# + +# Compile and test for fast_tokenizer cpp library + +.PHONY: fast_tokenizer_cpp_compile + +fast_tokenizer_cpp_compile: + mkdir -p build_cpp && cd build_cpp && \ + cmake .. -DWITH_PYTHON=OFF -DWITH_TESTING=ON -DCMAKE_BUILD_TYPE=Release && \ + make -j4 + +.PHONY: fast_tokenizer_cpp_test + +fast_tokenizer_cpp_test: + bash run_fast_tokenizer_cpp_test.sh build_cpp/fast_tokenizer/test + +# Compile and test for fast_tokenizer python library + +.PHONY: fast_tokenizer_python_install + +fast_tokenizer_python_install: + pip install numpy wheel pytest paddlepaddle .. + +.PHONY: fast_tokenizer_python_compile + +fast_tokenizer_python_compile: + mkdir -p build_py && cd build_py && \ + cmake .. -DWITH_PYTHON=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release && \ + make -j4 + +.PHONY: fast_tokenizer_python_test + +fast_tokenizer_python_test: + pip install build_py/dist/*whl && pytest build_py/python/tests diff --git a/fast_tokenizer/README.md b/fast_tokenizer/README.md new file mode 100644 index 0000000000000000000000000000000000000000..6b8d18e6f56ff7c53e42deb2669a8d3ff7fc544b --- /dev/null +++ b/fast_tokenizer/README.md @@ -0,0 +1,150 @@ + +# ⚡ FastTokenizer:高性能文本处理库 + +------------------------------------------------------------------------------------------ + +

+ + + + + + + + + +

+ + +FastTokenizer 是一款简单易用、功能强大的跨平台高性能文本预处理库,集成业界多个常用的 Tokenizer 实现,支持不同 NLP 场景下的文本预处理功能,如文本分类、阅读理解,序列标注等。在 Python 端结合 PaddleNLP Tokenizer 模块,为用户在训练、推理阶段提供高效通用的文本预处理能力。 + +![fast-tokenizer-framework](https://user-images.githubusercontent.com/10826371/215277623-1fcd84e5-1cc7-43a8-a33c-890103cf1cc8.png) + +## 特性 + +- 高性能。底层采用 C++ 实现,并且集成高性能分词算法 `FastWordPiece` [1],其性能远高于常规 Python 实现的 Tokenizer。在文本分类任务上,FastTokenizer 对比 Python 版本 Tokenizer 加速比最高可达20倍。支持多线程加速多文本批处理分词。默认使用单线程分词。 +- 可拓展性强。用户可以通过指定不同的 `Normalizer`, `PreTokenizer`, `Model` 以及 `PostProcessor` 组件自定义 Tokenizer。可在 [FastTokenizer Pipeline](docs/pipeline/README.md) 文档了解更多关于组件的介绍以及使用方式。 +- 跨平台。FastTokenizer 可在不同的系统平台上使用,目前已支持 Windows x64,Linux x64 以及 MacOS 10.14+ 平台上使用。 +- 支持多编程语言。FastTokenizer 提供在 [C++](./docs/cpp/README.md)、[Python](./docs/python/README.md) 语言上开发的能力。 +- 包含所有预处理。覆盖绝大部分 Transformer 模型的 Tokenizer 所需要的功能,包括特殊 Tokens 的拼接、截断等。输入的原始文本经过 FastTokenizer 处理后得到的结果可直接输入到 Transformer 类模型。 + +## 快速开始 + +下面将介绍 Python 版本 FastTokenizer 的使用方式,C++ 版本的使用方式可参考 [FastTokenizer C++ 库使用教程](./docs/cpp/README.md)。 + +### 环境依赖 + +- Windows 64位系统 +- Linux x64系统 +- MacOS 10.14+系统( m1 芯片的 MacOS,需要使用 x86_64 版本的 Anaconda 作为 python 环境方可安装使用) +- Python 3.6 ~ 3.10 + +### 安装 FastTokenizer + +```python +pip install fast-tokenizer-python +``` + +### FastTokenizer 使用示例 + +- 准备词表 + +```shell +# Linux或者Mac用户可直接执行以下命令下载测试的词表,Windows 用户可在浏览器上下载到本地。 +wget https://bj.bcebos.com/paddlenlp/models/transformers/ernie/vocab.txt +``` + +- 切词示例 + +FastTokenizer 库内置 NLP 任务常用的 Tokenizer,如 ErnieFastTokenizer。下面将展示 FastTokenizer 的简单用法。 + +```python +import fast_tokenizer +from fast_tokenizer import ErnieFastTokenizer, models + +# 0.(可选)设置线程数 +fast_tokenizer.set_thread_num(1) +# 1. 加载词表 +vocab = models.WordPiece.read_file("ernie_vocab.txt") +# 2. 实例化 ErnieFastTokenizer 对象 +fast_tokenizer = ErnieFastTokenizer(vocab) +# 3. 切词 +output = fast_tokenizer.encode("我爱中国") +# 4. 输出结果 +print("ids: ", output.ids) +print("type_ids: ", output.type_ids) +print("tokens: ", output.tokens) +print("offsets: ", output.offsets) +print("attention_mask: ", output.attention_mask) + +# 5. 示例输出 +# ids: [1, 75, 329, 12, 20, 2] +# type_ids: [0, 0, 0, 0, 0, 0] +# tokens: ['[CLS]', '我', '爱', '中', '国', '[SEP]'] +# offsets: [(0, 0), (0, 1), (1, 2), (2, 3), (3, 4), (0, 0)] +# attention_mask: [1, 1, 1, 1, 1, 1] +``` + +### FastTokenizer 在 PaddleNLP Tokenizer 模块加速示例 + +PaddleNLP Tokenizer 模块可简单地应用在模型训练以及推理部署的文本预处理阶段,并通过 `AutoTokenizer.from_pretrained` 方式实例化相应的 Tokenizer 。其中 `AutoTokenizer` 默认加载得到的 Tokenizer 是常规 Python 实现的 Tokenizer,其性能会低于 C++ 实现的 FastTokenizer。为了提升 PaddleNLP Tokenizer 模块性能,目前 PaddleNLP Tokenizer 模块已经支持使用 FastTokenizer 作为 Tokenizer 的后端加速切词阶段。在现有的 Tokenizer 加载接口中,仅需添加 `use_fast=True` 这一关键词参数,其余代码保持不变,即可加载 Fast 版本的 Tokenizer,代码示例如下: + +```python +from paddlenlp.transformers import AutoTokenizer + +# 默认加载Python版本的Tokenizer +tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh') +# 打开use_fast开关,可加载Fast版本Tokenizer +fast_tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh', use_fast=True) + +text1 = tokenizer('自然语言处理') +text2 = fast_tokenizer('自然语言处理') + +print(text1) +print(text2) + +# 示例输出 +# {'input_ids': [1, 67, 187, 405, 545, 239, 38, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]} +# {'input_ids': [1, 67, 187, 405, 545, 239, 38, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]} + +``` + +目前 PaddleNLP 已支持 BERT、ERNIE、TinyBERT 以及 ERNIE-M 4种 Tokenizer 的 Fast 版本,其余模型的 Tokenizer 暂不支持 Fast 版本。 + +## FAQ + +**Q:我在 AutoTokenizer.from_pretrained 接口上已经打开 `use_fast=True` 开关,为什么文本预处理阶段性能上好像没有任何变化?** + +A:在有三种情况下,打开 `use_fast=True` 开关可能无法提升性能: + 1. 没有安装 fast_tokenizer 。若在没有安装 fast_tokenizer 库的情况下打开 `use_fast` 开关,PaddleNLP 会给出以下warning:"Can't find the fast_tokenizer package, please ensure install fast_tokenizer correctly. "。 + + 2. 加载的 Tokenizer 类型暂不支持 Fast 版本。目前支持4种 Tokenizer 的 Fast 版本,分别是 BERT、ERNIE、TinyBERT 以及 ERNIE-M Tokenizer。若加载不支持 Fast 版本的 Tokenizer 情况下打开 `use_fast` 开关,PaddleNLP 会给出以下warning:"The tokenizer XXX doesn't have the fast version. Please check the map paddlenlp.transformers.auto.tokenizer.FAST_TOKENIZER_MAPPING_NAMES to see which fast tokenizers are currently supported." + + 3. 待切词文本长度过短(如文本平均长度小于5)。这种情况下切词开销可能不是整个文本预处理的性能瓶颈,导致在使用 FastTokenizer 后仍无法提升整体性能。 + +**Q:如何使用多线程加速分词?** + +A:可以通过调用 `fast_tokenizer.set_thread_num(xxx)` 使用多线程进行分词。需要谨慎开启多线程加速分词,在以下场景下可以考虑开启多线程: + 1. CPU资源充足。若在推理阶段使用CPU进行推理,开启多线程分词可能会出现资源竞争情况,从而影响推理阶段的性能。 + + 2. 文本的批大小较大。若批大小比较小,开启多线程可能不会得到任何加速效果,并且可能会因为线程调度导致延时增长。建议批大小大于4的时候再考虑开启多线程分词。 + + 3. 文本长度较长。若文本长度较短,开启多线程可能不会得到任何加速效果,并且可能会因为线程调度导致延时增长。建议文本平均长度大于16的时候再考虑开启多线程分词。 + +**Q:Windows 上编译、运行示例出错。** 相关issue:[issues 4673](https://github.com/PaddlePaddle/PaddleNLP/issues/4673)。 + +A:FastTokenizer 支持 Linux、Windows 以及 MacOS 系统上运行,同一示例可以在不同的操作系统上运行。如果出现在其他系统编译运行没错,但在 Windows 上编译或者运行示例出错的问题,大概率是编译过程中遇到中文字符的编码问题,FastTokenizer 要求字符集必须为 UTF-8。可以参考Visual Studio的官方文档,设置源字符集为/utf-8解决:[/utf-8(将源字符集和执行字符集设置为 UTF-8)](https://learn.microsoft.com/zh-cn/cpp/build/reference/utf-8-set-source-and-executable-character-sets-to-utf-8?view=msvc-170)。 + +## 参考文献 + +- [1] Xinying Song, Alex Salcianuet al. "Fast WordPiece Tokenization", EMNLP, 2021 + +## 相关文档 + +[FastTokenizer Pipeline](docs/pipeline/README.md) + +[FastTokenizer 编译指南](docs/compile/README.md) + +[FastTokenizer C++ 库使用教程](./docs/cpp/README.md) + +[FastTokenizer Python 库使用教程](./docs/python/README.md) diff --git a/fast_tokenizer/cmake/ByproductsICU.cmake b/fast_tokenizer/cmake/ByproductsICU.cmake new file mode 100644 index 0000000000000000000000000000000000000000..3b68f08249bc0ca4846c05304e55ee08647bcb70 --- /dev/null +++ b/fast_tokenizer/cmake/ByproductsICU.cmake @@ -0,0 +1,54 @@ +# MIT License +# +# Copyright (c) 2018 The ViaDuck Project +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +function(GetICUByproducts ICU_PATH ICU_LIB_VAR ICU_INCLUDE_VAR ICU_BASE_NAMES_VAR) + # include directory + set(${ICU_INCLUDE_VAR} "${ICU_PATH}/include" PARENT_SCOPE) + + if (WIN32) + # windows basenames and pre/suffixes + set(ICU_LIB_BASE_NAMES dt in io tu uc) + + set(ICU_SHARED_PREFIX "lib") + set(ICU_STATIC_PREFIX "") + set(ICU_SHARED_SUFFIX ".dll.a") + set(ICU_STATIC_SUFFIX ".lib") + set(ICU_INSTALL_LIB "lib64") + else() + # unix basenames and pre/suffixes + set(ICU_LIB_BASE_NAMES i18n data uc io tu) + set(ICU_SHARED_PREFIX ${CMAKE_SHARED_LIBRARY_PREFIX}) + set(ICU_STATIC_PREFIX ${CMAKE_STATIC_LIBRARY_PREFIX}) + set(ICU_SHARED_SUFFIX ${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(ICU_STATIC_SUFFIX ${CMAKE_STATIC_LIBRARY_SUFFIX}) + set(ICU_INSTALL_LIB "lib") + endif() + # add static and shared libs to the libraries variable + foreach(ICU_BASE_NAME ${ICU_LIB_BASE_NAMES}) + set(ICU_SHARED_LIB "${ICU_PATH}/${ICU_INSTALL_LIB}/${ICU_SHARED_PREFIX}icu${ICU_BASE_NAME}${ICU_SHARED_SUFFIX}") + set(ICU_STATIC_LIB "${ICU_PATH}/${ICU_INSTALL_LIB}/${ICU_STATIC_PREFIX}icu${ICU_BASE_NAME}${ICU_STATIC_SUFFIX}") + + if (ICU_STATIC) + list(APPEND ${ICU_LIB_VAR} ${ICU_STATIC_LIB}) + else() + list(APPEND ${ICU_LIB_VAR} ${ICU_SHARED_LIB}) + endif() + list(APPEND ${ICU_BASE_NAMES_VAR} ${ICU_BASE_NAME}) + endforeach() + set(${ICU_LIB_VAR} ${${ICU_LIB_VAR}} PARENT_SCOPE) + set(${ICU_BASE_NAMES_VAR} ${${ICU_BASE_NAMES_VAR}} PARENT_SCOPE) +endfunction() \ No newline at end of file diff --git a/fast_tokenizer/cmake/FindNumPy.cmake b/fast_tokenizer/cmake/FindNumPy.cmake new file mode 100644 index 0000000000000000000000000000000000000000..7815e86cb503cf91afdaba7fd4f0ec02d87132a1 --- /dev/null +++ b/fast_tokenizer/cmake/FindNumPy.cmake @@ -0,0 +1,52 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Find the Python NumPy package +# PYTHON_NUMPY_INCLUDE_DIR +# NUMPY_FOUND +# will be set by this script + +cmake_minimum_required(VERSION 2.6) + +if(NOT PYTHON_EXECUTABLE) + if(NumPy_FIND_QUIETLY) + find_package(PythonInterp QUIET) + else() + find_package(PythonInterp) + set(_numpy_out 1) + endif() +endif() + +if (PYTHON_EXECUTABLE) + # write a python script that finds the numpy path + file(WRITE ${PROJECT_BINARY_DIR}/FindNumpyPath.py + "try: import numpy; print(numpy.get_include())\nexcept:pass\n") + + # execute the find script + exec_program("${PYTHON_EXECUTABLE}" ${PROJECT_BINARY_DIR} + ARGS "FindNumpyPath.py" + OUTPUT_VARIABLE NUMPY_PATH) +elseif(_numpy_out) + message(STATUS "Python executable not found.") +endif(PYTHON_EXECUTABLE) + +find_path(PYTHON_NUMPY_INCLUDE_DIR numpy/arrayobject.h + HINTS "${NUMPY_PATH}" "${PYTHON_INCLUDE_PATH}") + +if(PYTHON_NUMPY_INCLUDE_DIR) + set(PYTHON_NUMPY_FOUND 1 CACHE INTERNAL "Python numpy found") +endif(PYTHON_NUMPY_INCLUDE_DIR) + +include(FindPackageHandleStandardArgs) +find_package_handle_standard_args(NumPy DEFAULT_MSG PYTHON_NUMPY_INCLUDE_DIR) diff --git a/fast_tokenizer/cmake/dummy.c.in b/fast_tokenizer/cmake/dummy.c.in new file mode 100644 index 0000000000000000000000000000000000000000..17ba4d3495eb41be61ab59425a6ddc49fa4389e3 --- /dev/null +++ b/fast_tokenizer/cmake/dummy.c.in @@ -0,0 +1,3 @@ +// Generated by @dummy_GENERATOR@. DO NOT EDIT!!! + +const char *dummy = "@dummy_CONTENT@"; \ No newline at end of file diff --git a/fast_tokenizer/cmake/external/dart.cmake b/fast_tokenizer/cmake/external/dart.cmake new file mode 100644 index 0000000000000000000000000000000000000000..1e00807f774567db810304adc8c70f83dfea12b1 --- /dev/null +++ b/fast_tokenizer/cmake/external/dart.cmake @@ -0,0 +1,45 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +include(ExternalProject) + +set(DART_PREFIX_DIR ${THIRD_PARTY_PATH}/dart) +SET(DART_REPOSITORY ${GIT_URL}/s-yata/darts-clone.git) +SET(DART_TAG master) + +set(DART_INCLUDE_DIR ${THIRD_PARTY_PATH}/dart/src/extern_dart/include) +include_directories(${DART_INCLUDE_DIR}) + +ExternalProject_Add( + extern_dart + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${DART_REPOSITORY} + GIT_TAG ${DART_TAG} + PREFIX ${DART_PREFIX_DIR} + # If we explicitly leave the `UPDATE_COMMAND` of the ExternalProject_Add + # function in CMakeLists blank, it will cause another parameter GIT_TAG + # to be modified without triggering incremental compilation, and the + # third-party library version changes cannot be incorporated. + # reference: https://cmake.org/cmake/help/latest/module/ExternalProject.html + UPDATE_COMMAND "" + CONFIGURE_COMMAND "" + BUILD_COMMAND "" + INSTALL_COMMAND "" + TEST_COMMAND "" +) + +add_library(dart INTERFACE) + +add_dependencies(dart extern_dart) diff --git a/fast_tokenizer/cmake/external/gflags.cmake b/fast_tokenizer/cmake/external/gflags.cmake new file mode 100644 index 0000000000000000000000000000000000000000..cb7e0420045aecdcb467cac0eba872c5dbe76fb2 --- /dev/null +++ b/fast_tokenizer/cmake/external/gflags.cmake @@ -0,0 +1,121 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +INCLUDE(ExternalProject) + +SET(GFLAGS_PREFIX_DIR ${THIRD_PARTY_PATH}/gflags) +SET(GFLAGS_INSTALL_DIR ${THIRD_PARTY_PATH}/install/gflags) +SET(GFLAGS_INCLUDE_DIR "${GFLAGS_INSTALL_DIR}/include" CACHE PATH "gflags include directory." FORCE) +set(GFLAGS_REPOSITORY ${GIT_URL}/gflags/gflags.git) +set(GFLAGS_TAG "v2.2.2") +IF(WIN32) + set(GFLAGS_LIBRARIES "${GFLAGS_INSTALL_DIR}/lib/gflags_static.lib" CACHE FILEPATH "GFLAGS_LIBRARIES" FORCE) +ELSE(WIN32) + set(GFLAGS_LIBRARIES "${GFLAGS_INSTALL_DIR}/lib/libgflags.a" CACHE FILEPATH "GFLAGS_LIBRARIES" FORCE) + set(BUILD_COMMAND $(MAKE) --silent) + set(INSTALL_COMMAND $(MAKE) install) +ENDIF(WIN32) + +INCLUDE_DIRECTORIES(${GFLAGS_INCLUDE_DIR}) + +IF(ANDROID) +set(CROSS_COMPILE_CMAKE_ARGS + "-DCMAKE_SYSTEM_NAME=${CMAKE_SYSTEM_NAME}" + "-DCMAKE_SYSTEM_VERSION=${CMAKE_SYSTEM_VERSION}" + "-DCMAKE_ANDROID_ARCH_ABI=${CMAKE_ANDROID_ARCH_ABI}" + "-DCMAKE_ANDROID_NDK=${CMAKE_ANDROID_NDK}" + "-DCMAKE_ANDROID_STL_TYPE=${CMAKE_ANDROID_STL_TYPE}" + "-DANDROID_ABI=${CMAKE_ANDROID_ARCH_ABI}" + "-DANDROID_TOOLCHAIN=${ANDROID_TOOLCHAIN}" + "-DANDROID_STL=${CMAKE_ANDROID_STL_TYPE}" + "-DCMAKE_SYSTEM_PROCESSOR=${CMAKE_SYSTEM_PROCESSOR}" + "-DCMAKE_TOOLCHAIN_FILE=${CMAKE_ANDROID_NDK}/build/cmake/android.toolchain.cmake" + "-DCMAKE_ANDROID_NDK_TOOLCHAIN_VERSION=${CMAKE_ANDROID_NDK_TOOLCHAIN_VERSION}" + "-DANDROID_PLATFORM=android-${ANDROID_NATIVE_API_LEVEL}" + "-D__ANDROID_API__=${ANDROID_NATIVE_API_LEVEL}") + +ExternalProject_Add( + extern_gflags + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${GFLAGS_REPOSITORY} + GIT_TAG ${GFLAGS_TAG} + PREFIX ${GFLAGS_PREFIX_DIR} + UPDATE_COMMAND "" + BUILD_COMMAND ${BUILD_COMMAND} + INSTALL_COMMAND ${INSTALL_COMMAND} + CMAKE_ARGS ${CROSS_COMPILE_CMAKE_ARGS} + -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} + -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} + -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} + -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DBUILD_STATIC_LIBS=ON + -DCMAKE_INSTALL_PREFIX=${GFLAGS_INSTALL_DIR} + -DCMAKE_POSITION_INDEPENDENT_CODE=ON + -DBUILD_TESTING=OFF + -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE} + ${EXTERNAL_OPTIONAL_ARGS} + CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${GFLAGS_INSTALL_DIR} + -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON + -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} + BUILD_BYPRODUCTS ${GFLAGS_LIBRARIES} +) +ELSE() +ExternalProject_Add( + extern_gflags + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${GFLAGS_REPOSITORY} + GIT_TAG ${GFLAGS_TAG} + PREFIX ${GFLAGS_PREFIX_DIR} + UPDATE_COMMAND "" + BUILD_COMMAND ${BUILD_COMMAND} + INSTALL_COMMAND ${INSTALL_COMMAND} + CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} + -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} + -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} + -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DBUILD_STATIC_LIBS=ON + -DCMAKE_INSTALL_PREFIX=${GFLAGS_INSTALL_DIR} + -DCMAKE_POSITION_INDEPENDENT_CODE=ON + -DBUILD_TESTING=OFF + -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE} + ${EXTERNAL_OPTIONAL_ARGS} + CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${GFLAGS_INSTALL_DIR} + -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON + -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} + BUILD_BYPRODUCTS ${GFLAGS_LIBRARIES} +) +ENDIF() + +ADD_LIBRARY(gflags STATIC IMPORTED GLOBAL) +SET_PROPERTY(TARGET gflags PROPERTY IMPORTED_LOCATION ${GFLAGS_LIBRARIES}) +ADD_DEPENDENCIES(gflags extern_gflags) + +# On Windows (including MinGW), the Shlwapi library is used by gflags if available. +if (WIN32) + include(CheckIncludeFileCXX) + check_include_file_cxx("shlwapi.h" HAVE_SHLWAPI) + if (HAVE_SHLWAPI) + set_property(GLOBAL PROPERTY OS_DEPENDENCY_MODULES shlwapi.lib) + endif(HAVE_SHLWAPI) +endif (WIN32) \ No newline at end of file diff --git a/fast_tokenizer/cmake/external/glog.cmake b/fast_tokenizer/cmake/external/glog.cmake new file mode 100644 index 0000000000000000000000000000000000000000..2afc39608ce1afd99e3e17d25095ed08648f4ae8 --- /dev/null +++ b/fast_tokenizer/cmake/external/glog.cmake @@ -0,0 +1,117 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +INCLUDE(ExternalProject) + +SET(GLOG_PREFIX_DIR ${THIRD_PARTY_PATH}/glog) +SET(GLOG_INSTALL_DIR ${THIRD_PARTY_PATH}/install/glog) +SET(GLOG_INCLUDE_DIR "${GLOG_INSTALL_DIR}/include" CACHE PATH "glog include directory." FORCE) +SET(GLOG_REPOSITORY ${GIT_URL}/google/glog.git) +SET(GLOG_TAG v0.4.0) + +IF(WIN32) + SET(GLOG_LIBRARIES "${GLOG_INSTALL_DIR}/lib/glog.lib" CACHE FILEPATH "glog library." FORCE) + SET(GLOG_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4267 /wd4530") + add_definitions("/DGOOGLE_GLOG_DLL_DECL=") +ELSE(WIN32) + SET(GLOG_LIBRARIES "${GLOG_INSTALL_DIR}/lib/libglog.a" CACHE FILEPATH "glog library." FORCE) + SET(GLOG_CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS}) +ENDIF(WIN32) + +INCLUDE_DIRECTORIES(${GLOG_INCLUDE_DIR}) + +IF(ANDROID) +set(CROSS_COMPILE_CMAKE_ARGS + "-DCMAKE_SYSTEM_NAME=${CMAKE_SYSTEM_NAME}" + "-DCMAKE_SYSTEM_VERSION=${CMAKE_SYSTEM_VERSION}" + "-DCMAKE_ANDROID_ARCH_ABI=${CMAKE_ANDROID_ARCH_ABI}" + "-DCMAKE_ANDROID_NDK=${CMAKE_ANDROID_NDK}" + "-DCMAKE_ANDROID_STL_TYPE=${CMAKE_ANDROID_STL_TYPE}" + "-DANDROID_ABI=${CMAKE_ANDROID_ARCH_ABI}" + "-DANDROID_TOOLCHAIN=${ANDROID_TOOLCHAIN}" + "-DANDROID_STL=${CMAKE_ANDROID_STL_TYPE}" + "-DCMAKE_SYSTEM_PROCESSOR=${CMAKE_SYSTEM_PROCESSOR}" + "-DCMAKE_TOOLCHAIN_FILE=${CMAKE_ANDROID_NDK}/build/cmake/android.toolchain.cmake" + "-DCMAKE_ANDROID_NDK_TOOLCHAIN_VERSION=${CMAKE_ANDROID_NDK_TOOLCHAIN_VERSION}" + "-DANDROID_PLATFORM=android-${ANDROID_NATIVE_API_LEVEL}" + "-D__ANDROID_API__=${ANDROID_NATIVE_API_LEVEL}") + +ExternalProject_Add( + extern_glog + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${GLOG_REPOSITORY} + GIT_TAG ${GLOG_TAG} + DEPENDS gflags + PREFIX ${GLOG_PREFIX_DIR} + UPDATE_COMMAND "" + CMAKE_ARGS ${CROSS_COMPILE_CMAKE_ARGS} + -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} + -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} + -DCMAKE_CXX_FLAGS=${GLOG_CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} + -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DCMAKE_INSTALL_PREFIX=${GLOG_INSTALL_DIR} + -DCMAKE_INSTALL_LIBDIR=${GLOG_INSTALL_DIR}/lib + -DCMAKE_POSITION_INDEPENDENT_CODE=ON + -DWITH_GFLAGS=OFF + -DBUILD_TESTING=OFF + -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE} + ${EXTERNAL_OPTIONAL_ARGS} + CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${GLOG_INSTALL_DIR} + -DCMAKE_INSTALL_LIBDIR:PATH=${GLOG_INSTALL_DIR}/lib + -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON + -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} + BUILD_BYPRODUCTS ${GLOG_LIBRARIES} +) +ELSE() +ExternalProject_Add( + extern_glog + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${GLOG_REPOSITORY} + GIT_TAG ${GLOG_TAG} + DEPENDS gflags + PREFIX ${GLOG_PREFIX_DIR} + UPDATE_COMMAND "" + CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} + -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} + -DCMAKE_CXX_FLAGS=${GLOG_CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} + -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DCMAKE_INSTALL_PREFIX=${GLOG_INSTALL_DIR} + -DCMAKE_INSTALL_LIBDIR=${GLOG_INSTALL_DIR}/lib + -DCMAKE_POSITION_INDEPENDENT_CODE=ON + -DWITH_GFLAGS=OFF + -DBUILD_TESTING=OFF + -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE} + ${EXTERNAL_OPTIONAL_ARGS} + CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${GLOG_INSTALL_DIR} + -DCMAKE_INSTALL_LIBDIR:PATH=${GLOG_INSTALL_DIR}/lib + -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON + -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} + BUILD_BYPRODUCTS ${GLOG_LIBRARIES} +) +ENDIF() + +ADD_LIBRARY(glog STATIC IMPORTED GLOBAL) +SET_PROPERTY(TARGET glog PROPERTY IMPORTED_LOCATION ${GLOG_LIBRARIES}) +ADD_DEPENDENCIES(glog extern_glog gflags) +LINK_LIBRARIES(glog) \ No newline at end of file diff --git a/fast_tokenizer/cmake/external/gtest.cmake b/fast_tokenizer/cmake/external/gtest.cmake new file mode 100644 index 0000000000000000000000000000000000000000..4b870934558739f43f572afc02c232ef392fa45c --- /dev/null +++ b/fast_tokenizer/cmake/external/gtest.cmake @@ -0,0 +1,85 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +IF(WITH_TESTING) + ENABLE_TESTING() +ENDIF() + +INCLUDE(GNUInstallDirs) +INCLUDE(ExternalProject) + +SET(GTEST_PREFIX_DIR ${THIRD_PARTY_PATH}/gtest) +SET(GTEST_INSTALL_DIR ${THIRD_PARTY_PATH}/install/gtest) +SET(GTEST_INCLUDE_DIR "${GTEST_INSTALL_DIR}/include" CACHE PATH "gtest include directory." FORCE) +set(GTEST_REPOSITORY ${GIT_URL}/google/googletest.git) +set(GTEST_TAG release-1.8.1) + +INCLUDE_DIRECTORIES(${GTEST_INCLUDE_DIR}) + +IF(WIN32) + set(GTEST_LIBRARIES + "${GTEST_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}/gtest.lib" CACHE FILEPATH "gtest libraries." FORCE) + set(GTEST_MAIN_LIBRARIES + "${GTEST_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}/gtest_main.lib" CACHE FILEPATH "gtest main libraries." FORCE) + string(REPLACE "/w " "" GTEST_CMAKE_C_FLAGS "${CMAKE_C_FLAGS}") + string(REPLACE "/w " "" GTEST_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}") + string(REPLACE "/W0 " "" GTEST_CMAKE_C_FLAGS "${GTEST_CMAKE_C_FLAGS}") + string(REPLACE "/W0 " "" GTEST_CMAKE_CXX_FLAGS "${GTEST_CMAKE_CXX_FLAGS}") +ELSE(WIN32) + set(GTEST_LIBRARIES + "${GTEST_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}/libgtest.a" CACHE FILEPATH "gtest libraries." FORCE) + set(GTEST_MAIN_LIBRARIES + "${GTEST_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}/libgtest_main.a" CACHE FILEPATH "gtest main libraries." FORCE) + set(GTEST_CMAKE_C_FLAGS "${CMAKE_C_FLAGS}") + set(GTEST_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}") +ENDIF(WIN32) + +ExternalProject_Add( + extern_gtest + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${GTEST_REPOSITORY} + GIT_TAG ${GTEST_TAG} + DEPENDS ${GTEST_DEPENDS} + PREFIX ${GTEST_PREFIX_DIR} + UPDATE_COMMAND "" + CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} + -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} + -DCMAKE_CXX_FLAGS=${GTEST_CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} + -DCMAKE_C_FLAGS=${GTEST_CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DCMAKE_INSTALL_PREFIX=${GTEST_INSTALL_DIR} + -DCMAKE_POSITION_INDEPENDENT_CODE=ON + -DBUILD_GMOCK=ON + -Dgtest_disable_pthreads=ON + -Dgtest_force_shared_crt=ON + -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE} + ${EXTERNAL_OPTIONAL_ARGS} + CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${GTEST_INSTALL_DIR} + -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON + -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} + BUILD_BYPRODUCTS ${GTEST_LIBRARIES} + BUILD_BYPRODUCTS ${GTEST_MAIN_LIBRARIES} +) + +ADD_LIBRARY(gtest STATIC IMPORTED GLOBAL) +SET_PROPERTY(TARGET gtest PROPERTY IMPORTED_LOCATION ${GTEST_LIBRARIES}) +ADD_DEPENDENCIES(gtest extern_gtest) + +ADD_LIBRARY(gtest_main STATIC IMPORTED GLOBAL) +SET_PROPERTY(TARGET gtest_main PROPERTY IMPORTED_LOCATION ${GTEST_MAIN_LIBRARIES}) +ADD_DEPENDENCIES(gtest_main extern_gtest) \ No newline at end of file diff --git a/fast_tokenizer/cmake/external/icu.cmake b/fast_tokenizer/cmake/external/icu.cmake new file mode 100644 index 0000000000000000000000000000000000000000..cd604d384ef6e4fd97ec0500aba7d632de69840c --- /dev/null +++ b/fast_tokenizer/cmake/external/icu.cmake @@ -0,0 +1,138 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +include(CMakeParseArguments) +include(ExternalProject) +include (ByproductsICU) +SET(ICU_PREFIX_DIR ${THIRD_PARTY_PATH}/icu) +SET(ICU_INSTALL_DIR ${THIRD_PARTY_PATH}/install/icu) +if(ANDROID) + set(ICU_URL_PREFIX "https://bj.bcebos.com/fastdeploy/test") + # check ABI, toolchain + if((NOT ANDROID_ABI MATCHES "armeabi-v7a") AND (NOT ANDROID_ABI MATCHES "arm64-v8a")) + message(FATAL_ERROR "FastTokenizer for Android only support armeabi-v7a, arm64-v8a now.") + endif() + if(NOT ANDROID_TOOLCHAIN MATCHES "clang") + message(FATAL_ERROR "Currently, only support clang toolchain while cross compiling FastTokenizer for Android, but found ${ANDROID_TOOLCHAIN}.") + endif() + if (WITH_ICU_LITE) + set(ICU_REPOSITORY ${ICU_URL_PREFIX}/icu-lite-android-${ANDROID_ABI}.tgz) + else() + set(ICU_REPOSITORY ${ICU_URL_PREFIX}/icu-android-${ANDROID_ABI}.tgz) + endif() +else() + SET(ICU_REPOSITORY ${GIT_URL}/unicode-org/icu.git) +endif() +SET(ICU_TAG release-70-1) +set(FIND_OR_BUILD_ICU_DIR ${CMAKE_CURRENT_LIST_DIR}) + +set(HOST_CFLAGS "${CMAKE_C_FLAGS}") +set(HOST_CXXFLAGS "${CMAKE_CXX_FLAGS}") +set(HOST_CC "${CMAKE_C_COMPILER}") +set(HOST_CXX "${CMAKE_CXX_COMPILER}") +set(HOST_LDFLAGS "${CMAKE_MODULE_LINKER_FLAGS}") + +set(HOST_ENV_CMAKE ${CMAKE_COMMAND} -E env + CC=${HOST_CC} + CXX=${HOST_CXX} + CFLAGS=${HOST_CFLAGS} + CXXFLAGS=${HOST_CXXFLAGS} + LDFLAGS=${HOST_LDFLAGS} +) + +# predict host libraries +set(ICU_STATIC TRUE) +GetICUByproducts(${ICU_INSTALL_DIR} ICU_LIBRARIES ICU_INCLUDE_DIRS ICU_BASE_NAMES) +INCLUDE_DIRECTORIES(${ICU_INCLUDE_DIRS}) + +if(WIN32) +ExternalProject_Add( + extern_icu + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${ICU_REPOSITORY} + GIT_TAG ${ICU_TAG} + GIT_PROGRESS 1 + PREFIX ${ICU_PREFIX_DIR} + UPDATE_COMMAND "" + CONFIGURE_COMMAND msbuild ..\\extern_icu\\icu4c\\source\\allinone\\allinone.sln /p:Configuration=Release /p:Platform=x64 /p:RuntimeLibrary=MT_StaticRelease /p:SkipUWP=true + BUILD_COMMAND "" + INSTALL_COMMAND ${CMAKE_COMMAND} -E copy_directory ../extern_icu/icu4c/include ${ICU_INSTALL_DIR}/include + && ${CMAKE_COMMAND} -E copy_directory ../extern_icu/icu4c/lib64 ${ICU_INSTALL_DIR}/lib64 + BUILD_BYPRODUCTS ${ICU_LIBRARIES} +) +elseif(APPLE) +ExternalProject_Add( + extern_icu + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${ICU_REPOSITORY} + GIT_TAG ${ICU_TAG} + GIT_PROGRESS 1 + PREFIX ${ICU_PREFIX_DIR} + UPDATE_COMMAND "" + CONFIGURE_COMMAND ${HOST_ENV_CMAKE} ../extern_icu/icu4c/source/runConfigureICU "MacOSX/GCC" --enable-static --disable-shared --enable-rpath + BUILD_COMMAND make -j4 + INSTALL_COMMAND make install prefix="" DESTDIR=${ICU_INSTALL_DIR} install + BUILD_BYPRODUCTS ${ICU_LIBRARIES} +) +elseif(ANDROID) +ExternalProject_Add( + extern_icu + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + URL ${ICU_REPOSITORY} + PREFIX ${ICU_PREFIX_DIR} + CONFIGURE_COMMAND "" + UPDATE_COMMAND "" + BUILD_COMMAND "" + INSTALL_COMMAND + ${CMAKE_COMMAND} -E remove_directory ${ICU_INSTALL_DIR} && + ${CMAKE_COMMAND} -E make_directory ${ICU_INSTALL_DIR} && + ${CMAKE_COMMAND} -E rename ${ICU_PREFIX_DIR}/src/extern_icu/lib/ ${ICU_INSTALL_DIR}/lib && + ${CMAKE_COMMAND} -E copy_directory ${ICU_PREFIX_DIR}/src/extern_icu/include ${ICU_INSTALL_DIR}/include + BUILD_BYPRODUCTS ${ICU_LIBRARIES} +) +else() +ExternalProject_Add( + extern_icu + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${ICU_REPOSITORY} + GIT_TAG ${ICU_TAG} + GIT_PROGRESS 1 + PREFIX ${ICU_PREFIX_DIR} + UPDATE_COMMAND "" + CONFIGURE_COMMAND ${HOST_ENV_CMAKE} ../extern_icu/icu4c/source/runConfigureICU "Linux/gcc" --enable-static --disable-shared --enable-rpath + BUILD_COMMAND make -j4 + INSTALL_COMMAND make install prefix="" DESTDIR=${ICU_INSTALL_DIR} install + BUILD_BYPRODUCTS ${ICU_LIBRARIES} +) +endif() + +list(LENGTH ICU_LIBRARIES ICU_LIB_LEN) +MATH(EXPR ICU_LIB_LEN "${ICU_LIB_LEN}-1") + +# icui18n icudata icuuc icuio icutu +foreach(ICU_IDX RANGE ${ICU_LIB_LEN}) + list(GET ICU_LIBRARIES ${ICU_IDX} ICU_LIB) + list(GET ICU_BASE_NAMES ${ICU_IDX} ICU_BASE_NAME) + ADD_LIBRARY("icu${ICU_BASE_NAME}" STATIC IMPORTED GLOBAL) + SET_PROPERTY(TARGET "icu${ICU_BASE_NAME}" PROPERTY IMPORTED_LOCATION ${ICU_LIB}) + ADD_DEPENDENCIES("icu${ICU_BASE_NAME}" extern_icu) + list(APPEND ICU_INTERFACE_LINK_LIBRARIES "icu${ICU_BASE_NAME}") +endforeach() + +if(WIN32) +ADD_LIBRARY("icudata" ALIAS "icudt") +endif() \ No newline at end of file diff --git a/fast_tokenizer/cmake/external/nlohmann_json.cmake b/fast_tokenizer/cmake/external/nlohmann_json.cmake new file mode 100644 index 0000000000000000000000000000000000000000..9a34c5cca503ccb9565a8fcfdaeefc9811ec20fb --- /dev/null +++ b/fast_tokenizer/cmake/external/nlohmann_json.cmake @@ -0,0 +1,46 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +include(ExternalProject) + +set(JSON_PREFIX_DIR ${THIRD_PARTY_PATH}/json) +SET(JSON_REPOSITORY ${GIT_URL}/nlohmann/json.git) +SET(JSON_TAG v3.10.5) + +set(JSON_INCLUDE_DIR ${THIRD_PARTY_PATH}/json/src/extern_json/single_include) +include_directories(${JSON_INCLUDE_DIR}) + +ExternalProject_Add( + extern_json + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${JSON_REPOSITORY} + GIT_TAG ${JSON_TAG} + GIT_PROGRESS 1 + PREFIX ${JSON_PREFIX_DIR} + # If we explicitly leave the `UPDATE_COMMAND` of the ExternalProject_Add + # function in CMakeLists blank, it will cause another parameter GIT_TAG + # to be modified without triggering incremental compilation, and the + # third-party library version changes cannot be incorporated. + # reference: https://cmake.org/cmake/help/latest/module/ExternalProject.html + UPDATE_COMMAND "" + CONFIGURE_COMMAND "" + BUILD_COMMAND "" + INSTALL_COMMAND "" + TEST_COMMAND "" +) + +add_library(json INTERFACE) + +add_dependencies(json extern_json) diff --git a/fast_tokenizer/cmake/external/protobuf.cmake b/fast_tokenizer/cmake/external/protobuf.cmake new file mode 100644 index 0000000000000000000000000000000000000000..e5f3e19be7b52b3344538864f04b0ee646f7f685 --- /dev/null +++ b/fast_tokenizer/cmake/external/protobuf.cmake @@ -0,0 +1,295 @@ +# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +INCLUDE(ExternalProject) +# Always invoke `FIND_PACKAGE(Protobuf)` for importing function protobuf_generate_cpp +IF(NOT WIN32) +FIND_PACKAGE(Protobuf QUIET) +ENDIF(NOT WIN32) +macro(UNSET_VAR VAR_NAME) + UNSET(${VAR_NAME} CACHE) + UNSET(${VAR_NAME}) +endmacro() + +UNSET_VAR(PROTOBUF_INCLUDE_DIR) +UNSET_VAR(PROTOBUF_FOUND) +UNSET_VAR(PROTOBUF_PROTOC_EXECUTABLE) +UNSET_VAR(PROTOBUF_PROTOC_LIBRARY) +UNSET_VAR(PROTOBUF_LITE_LIBRARY) +UNSET_VAR(PROTOBUF_LIBRARY) +UNSET_VAR(PROTOBUF_INCLUDE_DIR) +UNSET_VAR(Protobuf_PROTOC_EXECUTABLE) +function(protobuf_generate_python SRCS) + # shameless copy from https://github.com/Kitware/CMake/blob/master/Modules/FindProtobuf.cmake + if(NOT ARGN) + message(SEND_ERROR "Error: PROTOBUF_GENERATE_PYTHON() called without any proto files") + return() + endif() + + if(PROTOBUF_GENERATE_CPP_APPEND_PATH) + # Create an include path for each file specified + foreach(FIL ${ARGN}) + get_filename_component(ABS_FIL ${FIL} ABSOLUTE) + get_filename_component(ABS_PATH ${ABS_FIL} PATH) + list(FIND _protobuf_include_path ${ABS_PATH} _contains_already) + if(${_contains_already} EQUAL -1) + list(APPEND _protobuf_include_path -I ${ABS_PATH}) + endif() + endforeach() + else() + set(_protobuf_include_path -I ${CMAKE_CURRENT_SOURCE_DIR}) + endif() + if(DEFINED PROTOBUF_IMPORT_DIRS AND NOT DEFINED Protobuf_IMPORT_DIRS) + set(Protobuf_IMPORT_DIRS "${PROTOBUF_IMPORT_DIRS}") + endif() + + if(DEFINED Protobuf_IMPORT_DIRS) + foreach(DIR ${Protobuf_IMPORT_DIRS}) + get_filename_component(ABS_PATH ${DIR} ABSOLUTE) + list(FIND _protobuf_include_path ${ABS_PATH} _contains_already) + if(${_contains_already} EQUAL -1) + list(APPEND _protobuf_include_path -I ${ABS_PATH}) + endif() + endforeach() + endif() + + set(${SRCS}) + foreach(FIL ${ARGN}) + get_filename_component(ABS_FIL ${FIL} ABSOLUTE) + get_filename_component(FIL_WE ${FIL} NAME_WE) + if(NOT PROTOBUF_GENERATE_CPP_APPEND_PATH) + get_filename_component(FIL_DIR ${FIL} DIRECTORY) + if(FIL_DIR) + set(FIL_WE "${FIL_DIR}/${FIL_WE}") + endif() + endif() + list(APPEND ${SRCS} "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}_pb2.py") + add_custom_command( + OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}_pb2.py" + COMMAND ${PROTOBUF_PROTOC_EXECUTABLE} --python_out ${CMAKE_CURRENT_BINARY_DIR} ${_protobuf_include_path} ${ABS_FIL} + DEPENDS ${ABS_FIL} ${PROTOBUF_PROTOC_EXECUTABLE} + COMMENT "Running Python protocol buffer compiler on ${FIL}" + VERBATIM ) + endforeach() + + set(${SRCS} ${${SRCS}} PARENT_SCOPE) +endfunction() + +# Print and set the protobuf library information, +# finish this cmake process and exit from this file. +macro(PROMPT_PROTOBUF_LIB) + SET(protobuf_DEPS ${ARGN}) + + MESSAGE(STATUS "Protobuf protoc executable: ${PROTOBUF_PROTOC_EXECUTABLE}") + MESSAGE(STATUS "Protobuf-lite library: ${PROTOBUF_LITE_LIBRARY}") + MESSAGE(STATUS "Protobuf library: ${PROTOBUF_LIBRARY}") + MESSAGE(STATUS "Protoc library: ${PROTOBUF_PROTOC_LIBRARY}") + MESSAGE(STATUS "Protobuf version: ${PROTOBUF_VERSION}") + INCLUDE_DIRECTORIES(${PROTOBUF_INCLUDE_DIR}) + + # Assuming that all the protobuf libraries are of the same type. + IF(${PROTOBUF_LIBRARY} MATCHES ${CMAKE_STATIC_LIBRARY_SUFFIX}) + SET(protobuf_LIBTYPE STATIC) + ELSEIF(${PROTOBUF_LIBRARY} MATCHES "${CMAKE_SHARED_LIBRARY_SUFFIX}$") + SET(protobuf_LIBTYPE SHARED) + ELSE() + MESSAGE(FATAL_ERROR "Unknown library type: ${PROTOBUF_LIBRARY}") + ENDIF() + + ADD_LIBRARY(protobuf ${protobuf_LIBTYPE} IMPORTED GLOBAL) + SET_PROPERTY(TARGET protobuf PROPERTY IMPORTED_LOCATION ${PROTOBUF_LIBRARY}) + + ADD_LIBRARY(protobuf_lite ${protobuf_LIBTYPE} IMPORTED GLOBAL) + SET_PROPERTY(TARGET protobuf_lite PROPERTY IMPORTED_LOCATION ${PROTOBUF_LITE_LIBRARY}) + + ADD_LIBRARY(libprotoc ${protobuf_LIBTYPE} IMPORTED GLOBAL) + SET_PROPERTY(TARGET libprotoc PROPERTY IMPORTED_LOCATION ${PROTOC_LIBRARY}) + + ADD_EXECUTABLE(protoc IMPORTED GLOBAL) + SET_PROPERTY(TARGET protoc PROPERTY IMPORTED_LOCATION ${PROTOBUF_PROTOC_EXECUTABLE}) + # FIND_Protobuf.cmake uses `Protobuf_PROTOC_EXECUTABLE`. + # make `protobuf_generate_cpp` happy. + SET(Protobuf_PROTOC_EXECUTABLE ${PROTOBUF_PROTOC_EXECUTABLE}) + + FOREACH(dep ${protobuf_DEPS}) + ADD_DEPENDENCIES(protobuf ${dep}) + ADD_DEPENDENCIES(protobuf_lite ${dep}) + ADD_DEPENDENCIES(libprotoc ${dep}) + ADD_DEPENDENCIES(protoc ${dep}) + ENDFOREACH() + + RETURN() +endmacro() +macro(SET_PROTOBUF_VERSION) + EXEC_PROGRAM(${PROTOBUF_PROTOC_EXECUTABLE} ARGS --version OUTPUT_VARIABLE PROTOBUF_VERSION) + STRING(REGEX MATCH "[0-9]+.[0-9]+" PROTOBUF_VERSION "${PROTOBUF_VERSION}") +endmacro() + +set(PROTOBUF_ROOT "" CACHE PATH "Folder contains protobuf") +IF (WIN32) + SET(PROTOBUF_ROOT ${THIRD_PARTY_PATH}/install/protobuf) +ENDIF(WIN32) + +if (NOT "${PROTOBUF_ROOT}" STREQUAL "") + find_path(PROTOBUF_INCLUDE_DIR google/protobuf/message.h PATHS ${PROTOBUF_ROOT}/include NO_DEFAULT_PATH) + find_library(PROTOBUF_LIBRARY protobuf libprotobuf.lib PATHS ${PROTOBUF_ROOT}/lib NO_DEFAULT_PATH) + find_library(PROTOBUF_LITE_LIBRARY protobuf-lite libprotobuf-lite.lib PATHS ${PROTOBUF_ROOT}/lib NO_DEFAULT_PATH) + find_library(PROTOBUF_PROTOC_LIBRARY protoc libprotoc.lib PATHS ${PROTOBUF_ROOT}/lib NO_DEFAULT_PATH) + find_program(PROTOBUF_PROTOC_EXECUTABLE protoc PATHS ${PROTOBUF_ROOT}/bin NO_DEFAULT_PATH) + if (PROTOBUF_INCLUDE_DIR AND PROTOBUF_LIBRARY AND PROTOBUF_LITE_LIBRARY AND PROTOBUF_PROTOC_LIBRARY AND PROTOBUF_PROTOC_EXECUTABLE) + message(STATUS "Using custom protobuf library in ${PROTOBUF_ROOT}.") + SET(PROTOBUF_FOUND true) + SET_PROTOBUF_VERSION() + PROMPT_PROTOBUF_LIB() + else() + message(WARNING "Cannot find protobuf library in ${PROTOBUF_ROOT}") + endif() +endif() + +FUNCTION(build_protobuf TARGET_NAME BUILD_FOR_HOST) + STRING(REPLACE "extern_" "" TARGET_DIR_NAME "${TARGET_NAME}") + SET(PROTOBUF_SOURCES_DIR ${THIRD_PARTY_PATH}/${TARGET_DIR_NAME}) + SET(PROTOBUF_INSTALL_DIR ${THIRD_PARTY_PATH}/install/${TARGET_DIR_NAME}) + + SET(${TARGET_NAME}_INCLUDE_DIR "${PROTOBUF_INSTALL_DIR}/include" PARENT_SCOPE) + SET(PROTOBUF_INCLUDE_DIR "${PROTOBUF_INSTALL_DIR}/include" PARENT_SCOPE) + SET(${TARGET_NAME}_LITE_LIBRARY + "${PROTOBUF_INSTALL_DIR}/lib/libprotobuf-lite${CMAKE_STATIC_LIBRARY_SUFFIX}" + PARENT_SCOPE) + SET(${TARGET_NAME}_LIBRARY + "${PROTOBUF_INSTALL_DIR}/lib/libprotobuf${CMAKE_STATIC_LIBRARY_SUFFIX}" + PARENT_SCOPE) + SET(${TARGET_NAME}_PROTOC_LIBRARY + "${PROTOBUF_INSTALL_DIR}/lib/libprotoc${CMAKE_STATIC_LIBRARY_SUFFIX}" + PARENT_SCOPE) + SET(${TARGET_NAME}_PROTOC_EXECUTABLE + "${PROTOBUF_INSTALL_DIR}/bin/protoc${CMAKE_EXECUTABLE_SUFFIX}" + PARENT_SCOPE) + + SET(PROTOBUF_REPO "https://github.com/protocolbuffers/protobuf.git") + SET(PROTOBUF_TAG "9f75c5aa851cd877fb0d93ccc31b8567a6706546") + SET(OPTIONAL_CACHE_ARGS "") + SET(OPTIONAL_ARGS "") + + IF(BUILD_FOR_HOST) + SET(OPTIONAL_ARGS + "-DCMAKE_C_COMPILER=${HOST_C_COMPILER}" + "-DCMAKE_CXX_COMPILER=${HOST_CXX_COMPILER}" + "-Dprotobuf_WITH_ZLIB=OFF" + "-DZLIB_ROOT:FILEPATH=${ZLIB_ROOT}") + SET(OPTIONAL_CACHE_ARGS "-DZLIB_ROOT:STRING=${ZLIB_ROOT}") + ELSE() + # protobuf have compile issue when use android stl c++_static + SET(PROTOBUF_REPO "https://github.com/tensor-tang/protobuf.git") + SET(PROTOBUF_TAG "mobile") + SET(OPTIONAL_ARGS "-Dprotobuf_WITH_ZLIB=OFF" + ${CROSS_COMPILE_CMAKE_ARGS} + "-DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}" + "-DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}" + "-DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}" + "-DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG}" + "-DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE}" + "-DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}" + "-DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}" + "-DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}") + ENDIF() + IF(WIN32) + SET(OPTIONAL_ARGS ${OPTIONAL_ARGS} "-DCMAKE_GENERATOR_PLATFORM=x64") + ENDIF() + + if(LITE_WITH_LIGHT_WEIGHT_FRAMEWORK) + ExternalProject_Add( + ${TARGET_NAME} + ${EXTERNAL_PROJECT_LOG_ARGS} + PREFIX ${PROTOBUF_SOURCES_DIR} + SOURCE_SUBDIR cmake + UPDATE_COMMAND "" + GIT_REPOSITORY ${PROTOBUF_REPO} + GIT_TAG ${PROTOBUF_TAG} + GIT_PROGRESS 1 + CMAKE_ARGS + ${OPTIONAL_ARGS} + -Dprotobuf_BUILD_TESTS=OFF + -DCMAKE_SKIP_RPATH=ON + -DCMAKE_POSITION_INDEPENDENT_CODE=ON + -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE} + -DCMAKE_INSTALL_PREFIX=${PROTOBUF_INSTALL_DIR} + -DCMAKE_INSTALL_LIBDIR=lib + -DBUILD_SHARED_LIBS=OFF + CMAKE_CACHE_ARGS + -DCMAKE_INSTALL_PREFIX:PATH=${PROTOBUF_INSTALL_DIR} + -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} + -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF + -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON + ${OPTIONAL_CACHE_ARGS} + ) + else() + ExternalProject_Add( + ${TARGET_NAME} + ${EXTERNAL_PROJECT_LOG_ARGS} + PREFIX ${PROTOBUF_SOURCES_DIR} + UPDATE_COMMAND "" + GIT_REPOSITORY ${PROTOBUF_REPO} + GIT_TAG ${PROTOBUF_TAG} + GIT_PROGRESS 1 + CONFIGURE_COMMAND + ${CMAKE_COMMAND} ${PROTOBUF_SOURCES_DIR}/src/${TARGET_NAME}/cmake + ${OPTIONAL_ARGS} + -Dprotobuf_BUILD_TESTS=OFF + -DCMAKE_SKIP_RPATH=ON + -DCMAKE_POSITION_INDEPENDENT_CODE=ON + -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE} + -DCMAKE_INSTALL_PREFIX=${PROTOBUF_INSTALL_DIR} + -DCMAKE_INSTALL_LIBDIR=lib + -DBUILD_SHARED_LIBS=OFF + CMAKE_CACHE_ARGS + -DCMAKE_INSTALL_PREFIX:PATH=${PROTOBUF_INSTALL_DIR} + -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} + -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF + -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON + ${OPTIONAL_CACHE_ARGS} + ) + endif() +ENDFUNCTION() + +SET(PROTOBUF_VERSION 3.1.0) + +IF(LITE_WITH_LIGHT_WEIGHT_FRAMEWORK) + build_protobuf(protobuf_host TRUE) + LIST(APPEND external_project_dependencies protobuf_host) + SET(PROTOBUF_PROTOC_EXECUTABLE ${protobuf_host_PROTOC_EXECUTABLE} + CACHE FILEPATH "protobuf executable." FORCE) +ENDIF() + +IF(NOT PROTOBUF_FOUND) + build_protobuf(extern_protobuf FALSE) + + SET(PROTOBUF_INCLUDE_DIR ${extern_protobuf_INCLUDE_DIR} + CACHE PATH "protobuf include directory." FORCE) + SET(PROTOBUF_LITE_LIBRARY ${extern_protobuf_LITE_LIBRARY} + CACHE FILEPATH "protobuf lite library." FORCE) + SET(PROTOBUF_LIBRARY ${extern_protobuf_LIBRARY} + CACHE FILEPATH "protobuf library." FORCE) + SET(PROTOBUF_PROTOC_LIBRARY ${extern_protobuf_PROTOC_LIBRARY} + CACHE FILEPATH "protoc library." FORCE) + + IF(LITE_WITH_LIGHT_WEIGHT_FRAMEWORK) + PROMPT_PROTOBUF_LIB(protobuf_host extern_protobuf) + ELSE() + SET(PROTOBUF_PROTOC_EXECUTABLE ${extern_protobuf_PROTOC_EXECUTABLE} + CACHE FILEPATH "protobuf executable." FORCE) + PROMPT_PROTOBUF_LIB(extern_protobuf) + ENDIF() + +ENDIF(NOT PROTOBUF_FOUND) \ No newline at end of file diff --git a/fast_tokenizer/cmake/external/pybind11.cmake b/fast_tokenizer/cmake/external/pybind11.cmake new file mode 100644 index 0000000000000000000000000000000000000000..7f5f15d3e091de0d433b45c604927ee7a11f8d82 --- /dev/null +++ b/fast_tokenizer/cmake/external/pybind11.cmake @@ -0,0 +1,45 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +include(ExternalProject) + +set(PYBIND_PREFIX_DIR ${THIRD_PARTY_PATH}/pybind) +SET(PYBIND_REPOSITORY ${GIT_URL}/pybind/pybind11.git) +SET(PYBIND_TAG v2.9.0) + +set(PYBIND_INCLUDE_DIR ${THIRD_PARTY_PATH}/pybind/src/extern_pybind/include) +include_directories(${PYBIND_INCLUDE_DIR}) + +ExternalProject_Add( + extern_pybind + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${PYBIND_REPOSITORY} + GIT_TAG ${PYBIND_TAG} + PREFIX ${PYBIND_PREFIX_DIR} + # If we explicitly leave the `UPDATE_COMMAND` of the ExternalProject_Add + # function in CMakeLists blank, it will cause another parameter GIT_TAG + # to be modified without triggering incremental compilation, and the + # third-party library version changes cannot be incorporated. + # reference: https://cmake.org/cmake/help/latest/module/ExternalProject.html + UPDATE_COMMAND "" + CONFIGURE_COMMAND "" + BUILD_COMMAND "" + INSTALL_COMMAND "" + TEST_COMMAND "" +) + +add_library(pybind INTERFACE) + +add_dependencies(pybind extern_pybind) diff --git a/fast_tokenizer/cmake/external/python.cmake b/fast_tokenizer/cmake/external/python.cmake new file mode 100644 index 0000000000000000000000000000000000000000..81da0782893bdf98dfe30578ec96f97edb59e0da --- /dev/null +++ b/fast_tokenizer/cmake/external/python.cmake @@ -0,0 +1,74 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +INCLUDE(python_module) + +FIND_PACKAGE(PythonInterp ${PY_VERSION} REQUIRED) +FIND_PACKAGE(PythonLibs ${PY_VERSION} REQUIRED) + +if(WIN32) + execute_process(COMMAND "${PYTHON_EXECUTABLE}" "-c" +"from distutils import sysconfig as s;import sys;import struct; +print(sys.prefix); +print(s.get_config_var('LDVERSION') or s.get_config_var('VERSION')); +" + RESULT_VARIABLE _PYTHON_SUCCESS + OUTPUT_VARIABLE _PYTHON_VALUES + ERROR_VARIABLE _PYTHON_ERROR_VALUE) + + if(NOT _PYTHON_SUCCESS EQUAL 0) + set(PYTHONLIBS_FOUND FALSE) + return() + endif() + + # Convert the process output into a list + string(REGEX REPLACE ";" "\\\\;" _PYTHON_VALUES ${_PYTHON_VALUES}) + string(REGEX REPLACE "\n" ";" _PYTHON_VALUES ${_PYTHON_VALUES}) + list(GET _PYTHON_VALUES 0 PYTHON_PREFIX) + list(GET _PYTHON_VALUES 1 PYTHON_LIBRARY_SUFFIX) + + # Make sure all directory separators are '/' + string(REGEX REPLACE "\\\\" "/" PYTHON_PREFIX ${PYTHON_PREFIX}) + + set(PYTHON_LIBRARY + "${PYTHON_PREFIX}/libs/Python${PYTHON_LIBRARY_SUFFIX}.lib") + + # when run in a venv, PYTHON_PREFIX points to it. But the libraries remain in the + # original python installation. They may be found relative to PYTHON_INCLUDE_DIR. + if(NOT EXISTS "${PYTHON_LIBRARY}") + get_filename_component(_PYTHON_ROOT ${PYTHON_INCLUDE_DIR} DIRECTORY) + set(PYTHON_LIBRARY + "${_PYTHON_ROOT}/libs/Python${PYTHON_LIBRARY_SUFFIX}.lib") + endif() + + # raise an error if the python libs are still not found. + if(NOT EXISTS "${PYTHON_LIBRARY}") + message(FATAL_ERROR "Python libraries not found") + endif() + SET(PYTHON_LIBRARIES "${PYTHON_LIBRARY}") +endif(WIN32) + +# Fixme: Maybe find a static library. Get SHARED/STATIC by FIND_PACKAGE. +ADD_LIBRARY(python SHARED IMPORTED GLOBAL) +SET_PROPERTY(TARGET python PROPERTY IMPORTED_LOCATION ${PYTHON_LIBRARIES}) + +SET(py_env "") +IF(PYTHONINTERP_FOUND) + find_python_module(pip REQUIRED) + find_python_module(numpy REQUIRED) + find_python_module(wheel REQUIRED) + FIND_PACKAGE(NumPy REQUIRED) +ENDIF(PYTHONINTERP_FOUND) +INCLUDE_DIRECTORIES(${PYTHON_INCLUDE_DIR}) +INCLUDE_DIRECTORIES(${PYTHON_NUMPY_INCLUDE_DIR}) diff --git a/fast_tokenizer/cmake/external/re2.cmake b/fast_tokenizer/cmake/external/re2.cmake new file mode 100644 index 0000000000000000000000000000000000000000..079dfaa3182c857706d9100225ea9bb1c098439b --- /dev/null +++ b/fast_tokenizer/cmake/external/re2.cmake @@ -0,0 +1,108 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +INCLUDE(ExternalProject) + +SET(RE2_PREFIX_DIR ${THIRD_PARTY_PATH}/re2) +SET(RE2_INSTALL_DIR ${THIRD_PARTY_PATH}/install/re2) +# As we add extra features for utf8proc, we use the non-official repo +SET(RE2_REPOSITORY ${GIT_URL}/google/re2.git) +SET(RE2_TAG 2022-04-01) + +IF(WIN32) + SET(RE2_LIBRARIES "${RE2_INSTALL_DIR}/lib/re2.lib") + add_definitions(-DRE2_STATIC) +ELSEIF(APPLE) + SET(RE2_LIBRARIES "${RE2_INSTALL_DIR}/lib/libre2.a") +ELSEIF(ANDROID) + SET(RE2_LIBRARIES "${RE2_INSTALL_DIR}/lib/libre2.a") +ELSE() + IF(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "aarch64|arm64") + SET(RE2_LIBRARIES "${RE2_INSTALL_DIR}/lib/libre2.a") + ELSE() + file(READ "/etc/issue" ETC_ISSUE) + string(REGEX MATCH "Debian|Ubuntu" DIST ${ETC_ISSUE}) + IF(DIST STREQUAL "Debian") + SET(RE2_LIBRARIES "${RE2_INSTALL_DIR}/lib/libre2.a") + ELSEIF(DIST STREQUAL "Ubuntu") + SET(RE2_LIBRARIES "${RE2_INSTALL_DIR}/lib/libre2.a") + ELSE() + SET(RE2_LIBRARIES "${RE2_INSTALL_DIR}/lib64/libre2.a") + ENDIF() + ENDIF() +ENDIF() + +SET(RE2_INCLUDE_DIR ${RE2_INSTALL_DIR}/include) +INCLUDE_DIRECTORIES(${RE2_INCLUDE_DIR}) + +IF(ANDROID) +set(CROSS_COMPILE_CMAKE_ARGS + "-DCMAKE_SYSTEM_NAME=${CMAKE_SYSTEM_NAME}" + "-DCMAKE_SYSTEM_VERSION=${CMAKE_SYSTEM_VERSION}" + "-DCMAKE_ANDROID_ARCH_ABI=${CMAKE_ANDROID_ARCH_ABI}" + "-DCMAKE_ANDROID_NDK=${CMAKE_ANDROID_NDK}" + "-DCMAKE_ANDROID_STL_TYPE=${CMAKE_ANDROID_STL_TYPE}" + "-DANDROID_ABI=${CMAKE_ANDROID_ARCH_ABI}" + "-DANDROID_TOOLCHAIN=${ANDROID_TOOLCHAIN}" + "-DANDROID_STL=${CMAKE_ANDROID_STL_TYPE}" + "-DCMAKE_SYSTEM_PROCESSOR=${CMAKE_SYSTEM_PROCESSOR}" + "-DCMAKE_TOOLCHAIN_FILE=${CMAKE_ANDROID_NDK}/build/cmake/android.toolchain.cmake" + "-DCMAKE_ANDROID_NDK_TOOLCHAIN_VERSION=${CMAKE_ANDROID_NDK_TOOLCHAIN_VERSION}" + "-DANDROID_PLATFORM=android-${ANDROID_NATIVE_API_LEVEL}" + "-D__ANDROID_API__=${ANDROID_NATIVE_API_LEVEL}") + +ExternalProject_Add( + extern_re2 + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${RE2_REPOSITORY} + GIT_TAG ${RE2_TAG} + PREFIX ${RE2_PREFIX_DIR} + UPDATE_COMMAND "" + CMAKE_ARGS ${CROSS_COMPILE_CMAKE_ARGS} + -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} + -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DCMAKE_INSTALL_PREFIX:PATH=${RE2_INSTALL_DIR} + -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} + -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} + BUILD_BYPRODUCTS ${RE2_LIBRARIES} +) +ELSE() +ExternalProject_Add( + extern_re2 + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${RE2_REPOSITORY} + GIT_TAG ${RE2_TAG} + PREFIX ${RE2_PREFIX_DIR} + UPDATE_COMMAND "" + CMAKE_ARGS -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} + -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DCMAKE_INSTALL_PREFIX:PATH=${RE2_INSTALL_DIR} + -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} + -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} + BUILD_BYPRODUCTS ${RE2_LIBRARIES} +) +ENDIF() + +ADD_LIBRARY(re2 STATIC IMPORTED GLOBAL) +SET_PROPERTY(TARGET re2 PROPERTY IMPORTED_LOCATION ${RE2_LIBRARIES}) +ADD_DEPENDENCIES(re2 extern_re2) \ No newline at end of file diff --git a/fast_tokenizer/cmake/external/utf8proc.cmake b/fast_tokenizer/cmake/external/utf8proc.cmake new file mode 100644 index 0000000000000000000000000000000000000000..460cbab819c4e5a132c5c7b798a27b366cc821ce --- /dev/null +++ b/fast_tokenizer/cmake/external/utf8proc.cmake @@ -0,0 +1,51 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +INCLUDE(ExternalProject) + +SET(UTF8PROC_PREFIX_DIR ${THIRD_PARTY_PATH}/utf8proc) +SET(UTF8PROC_INSTALL_DIR ${THIRD_PARTY_PATH}/install/utf8proc) +# As we add extra features for utf8proc, we use the non-official repo +SET(UTF8PROC_REPOSITORY ${GIT_URL}/JuliaStrings/utf8proc.git) +SET(UTF8PROC_TAG v2.6.1) + +IF(WIN32) + SET(UTF8PROC_LIBRARIES "${UTF8PROC_INSTALL_DIR}/lib/utf8proc_static.lib") + add_definitions(-DUTF8PROC_STATIC) +ELSE(WIN32) + SET(UTF8PROC_LIBRARIES "${UTF8PROC_INSTALL_DIR}/lib/libutf8proc.a") +ENDIF(WIN32) + +INCLUDE_DIRECTORIES(${UTF8PROC_INSTALL_DIR}/include) + +ExternalProject_Add( + extern_utf8proc + ${EXTERNAL_PROJECT_LOG_ARGS} + ${SHALLOW_CLONE} + GIT_REPOSITORY ${UTF8PROC_REPOSITORY} + GIT_TAG ${UTF8PROC_TAG} + PREFIX ${UTF8PROC_PREFIX_DIR} + UPDATE_COMMAND "" + CMAKE_ARGS -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DBUILD_SHARED=ON + -DBUILD_STATIC=ON + -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_INSTALL_PREFIX:PATH=${UTF8PROC_INSTALL_DIR} + -DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE} + BUILD_BYPRODUCTS ${UTF8PROC_LIBRARIES} +) + +ADD_LIBRARY(utf8proc STATIC IMPORTED GLOBAL) +SET_PROPERTY(TARGET utf8proc PROPERTY IMPORTED_LOCATION ${UTF8PROC_LIBRARIES}) +ADD_DEPENDENCIES(utf8proc extern_utf8proc) \ No newline at end of file diff --git a/fast_tokenizer/cmake/generic.cmake b/fast_tokenizer/cmake/generic.cmake new file mode 100644 index 0000000000000000000000000000000000000000..07266667383b831e1a149d9c50a6c3606381700a --- /dev/null +++ b/fast_tokenizer/cmake/generic.cmake @@ -0,0 +1,208 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +function(cc_library TARGET_NAME) + set(options STATIC static SHARED shared INTERFACE interface) + set(oneValueArgs "") + set(multiValueArgs SRCS DEPS) + cmake_parse_arguments(cc_library "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + if(WIN32) + # add libxxx.lib prefix in windows + set(${TARGET_NAME}_LIB_NAME "${CMAKE_STATIC_LIBRARY_PREFIX}${TARGET_NAME}${CMAKE_STATIC_LIBRARY_SUFFIX}" CACHE STRING "output library name for target ${TARGET_NAME}") + endif(WIN32) + if(cc_library_SRCS) + if(cc_library_SHARED OR cc_library_shared) # build *.so + add_library(${TARGET_NAME} SHARED ${cc_library_SRCS}) + elseif(cc_library_INTERFACE OR cc_library_interface) + generate_dummy_static_lib(LIB_NAME ${TARGET_NAME} FILE_PATH ${target_SRCS} GENERATOR "generic.cmake:cc_library") + else() + add_library(${TARGET_NAME} STATIC ${cc_library_SRCS}) + endif() + if(cc_library_DEPS) + # remove link to python, see notes at: + # https://github.com/pybind/pybind11/blob/master/docs/compiling.rst#building-manually + if("${cc_library_DEPS};" MATCHES "python;") + list(REMOVE_ITEM cc_library_DEPS python) + add_dependencies(${TARGET_NAME} python) + if(WIN32) + target_link_libraries(${TARGET_NAME} ${PYTHON_LIBRARIES}) + else() + target_link_libraries(${TARGET_NAME} "-Wl,-undefined,dynamic_lookup") + endif(WIN32) + endif() + target_link_libraries(${TARGET_NAME} ${cc_library_DEPS} ${PUBLIC_DEPEND_LIBS}) + endif() + # For C++ 17 filesystem + # target_link_libraries(${TARGET_NAME} stdc++fs) + + # cpplint code style + foreach(source_file ${cc_library_SRCS}) + string(REGEX REPLACE "\\.[^.]*$" "" source ${source_file}) + if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h) + list(APPEND cc_library_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h) + endif() + endforeach() + + else(cc_library_SRCS) + if(cc_library_DEPS) + list(REMOVE_DUPLICATES cc_library_DEPS) + set(dummy_FILE_PATH "${CMAKE_CURRENT_BINARY_DIR}/${TARGET_NAME}_dummy.c") + configure_file(${PROJECT_SOURCE_DIR}/cmake/dummy.c.in ${dummy_FILE_PATH} @ONLY) + if(cc_library_SHARED OR cc_library_shared) # build *.so + add_library(${TARGET_NAME} SHARED ${dummy_FILE_PATH}) + elseif(cc_library_INTERFACE OR cc_library_interface) + generate_dummy_static_lib(LIB_NAME ${TARGET_NAME} FILE_PATH ${dummy_FILE_PATH} GENERATOR "generic.cmake:cc_library") + else() + add_library(${TARGET_NAME} STATIC ${dummy_FILE_PATH}) + endif() + target_link_libraries(${TARGET_NAME} ${cc_library_DEPS}) + else() + message(FATAL_ERROR "Please specify source files or libraries in cc_library(${TARGET_NAME} ...).") + endif() + endif(cc_library_SRCS) +endfunction(cc_library) + +function(cc_test_build TARGET_NAME) + if(WITH_TESTING AND NOT "$ENV{CI_SKIP_CPP_TEST}" STREQUAL "ON") + set(oneValueArgs "") + set(multiValueArgs SRCS DEPS) + cmake_parse_arguments(cc_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + add_executable(${TARGET_NAME} ${cc_test_SRCS}) + if(WIN32) + if("${cc_test_DEPS};" MATCHES "python;") + list(REMOVE_ITEM cc_test_DEPS python) + target_link_libraries(${TARGET_NAME} ${PYTHON_LIBRARIES}) + endif() + endif(WIN32) + get_property(os_dependency_modules GLOBAL PROPERTY OS_DEPENDENCY_MODULES) + target_link_libraries(${TARGET_NAME} ${cc_test_DEPS} ${os_dependency_modules} tokenizers_gtest_main gtest glog) + add_dependencies(${TARGET_NAME} ${cc_test_DEPS} gtest) + endif() +endfunction() + +function(cc_test_run TARGET_NAME) + if(WITH_TESTING AND NOT "$ENV{CI_SKIP_CPP_TEST}" STREQUAL "ON") + set(oneValueArgs "") + set(multiValueArgs COMMAND ARGS) + cmake_parse_arguments(cc_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + add_test(NAME ${TARGET_NAME} + COMMAND ${cc_test_COMMAND} ${cc_test_ARGS} + WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) + # No unit test should exceed 2 minutes. + if (WIN32) + set_tests_properties(${TARGET_NAME} PROPERTIES TIMEOUT 150) + endif() + if (APPLE) + set_tests_properties(${TARGET_NAME} PROPERTIES TIMEOUT 20) + endif() + elseif(WITH_TESTING AND NOT TEST ${TARGET_NAME}) + add_test(NAME ${TARGET_NAME} COMMAND ${CMAKE_COMMAND} -E echo CI skip ${TARGET_NAME}.) + endif() +endfunction() + +function(cc_test TARGET_NAME) + # The environment variable `CI_SKIP_CPP_TEST` is used to skip the compilation + # and execution of test in CI. `CI_SKIP_CPP_TEST` is set to ON when no files + # other than *.py are modified. + if(WITH_TESTING) + set(oneValueArgs "") + set(multiValueArgs SRCS DEPS ARGS) + cmake_parse_arguments(cc_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + cc_test_build(${TARGET_NAME} + SRCS ${cc_test_SRCS} + DEPS ${cc_test_DEPS}) + add_test(NAME ${TARGET_NAME} COMMAND ${CMAKE_COMMAND} -E echo CI skip ${TARGET_NAME}.) + endif() +endfunction(cc_test) + +# create a dummy source file, then create a static library. +# LIB_NAME should be the static lib name. +# FILE_PATH should be the dummy source file path. +# GENERATOR should be the file name invoke this function. +# CONTENT should be some helpful info. +# example: generate_dummy_static_lib(mylib FILE_PATH /path/to/dummy.c GENERATOR mylib.cmake CONTENT "helpful info") +function(generate_dummy_static_lib) + set(options "") + set(oneValueArgs LIB_NAME FILE_PATH GENERATOR CONTENT) + set(multiValueArgs "") + cmake_parse_arguments(dummy "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + if(NOT dummy_LIB_NAME) + message(FATAL_ERROR "You must provide a static lib name.") + endif() + if(NOT dummy_FILE_PATH) + set(dummy_FILE_PATH "${CMAKE_CURRENT_BINARY_DIR}/${dummy_LIB_NAME}_dummy.c") + endif() + if(NOT dummy_GENERATOR) + message(FATAL_ERROR "You must provide a generator file name.") + endif() + if(NOT dummy_CONTENT) + set(dummy_CONTENT "${dummy_LIB_NAME}_dummy.c for lib ${dummy_LIB_NAME}") + endif() + + configure_file(${PROJECT_SOURCE_DIR}/cmake/dummy.c.in ${dummy_FILE_PATH} @ONLY) + add_library(${dummy_LIB_NAME} STATIC ${dummy_FILE_PATH}) +endfunction() + +function(paddle_protobuf_generate_cpp SRCS HDRS) + if(NOT ARGN) + message( + SEND_ERROR + "Error: paddle_protobuf_generate_cpp() called without any proto files") + return() + endif() + + set(${SRCS}) + set(${HDRS}) + + foreach(FIL ${ARGN}) + get_filename_component(ABS_FIL ${FIL} ABSOLUTE) + get_filename_component(FIL_WE ${FIL} NAME_WE) + set(_protobuf_protoc_src "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}.pb.cc") + set(_protobuf_protoc_hdr "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}.pb.h") + list(APPEND ${SRCS} "${_protobuf_protoc_src}") + list(APPEND ${HDRS} "${_protobuf_protoc_hdr}") + + add_custom_command( + OUTPUT "${_protobuf_protoc_src}" "${_protobuf_protoc_hdr}" + COMMAND ${CMAKE_COMMAND} -E make_directory "${CMAKE_CURRENT_BINARY_DIR}" + COMMAND ${PROTOBUF_PROTOC_EXECUTABLE} -I${CMAKE_CURRENT_SOURCE_DIR} + --cpp_out "${CMAKE_CURRENT_BINARY_DIR}" ${ABS_FIL} + # Set `EXTERN_PROTOBUF_DEPEND` only if need to compile `protoc.exe`. + DEPENDS ${ABS_FIL} ${EXTERN_PROTOBUF_DEPEND} + COMMENT "Running C++ protocol buffer compiler on ${FIL}" + VERBATIM) + endforeach() + + set_source_files_properties(${${SRCS}} ${${HDRS}} PROPERTIES GENERATED TRUE) + set(${SRCS} + ${${SRCS}} + PARENT_SCOPE) + set(${HDRS} + ${${HDRS}} + PARENT_SCOPE) +endfunction() + +function(proto_library TARGET_NAME) + set(oneValueArgs "") + set(multiValueArgs SRCS DEPS) + cmake_parse_arguments(proto_library "${options}" "${oneValueArgs}" + "${multiValueArgs}" ${ARGN}) + set(proto_srcs) + set(proto_hdrs) + paddle_protobuf_generate_cpp(proto_srcs proto_hdrs ${proto_library_SRCS}) + cc_library( + ${TARGET_NAME} + SRCS ${proto_srcs} + DEPS ${proto_library_DEPS} protobuf) +endfunction() \ No newline at end of file diff --git a/fast_tokenizer/cmake/python_module.cmake b/fast_tokenizer/cmake/python_module.cmake new file mode 100644 index 0000000000000000000000000000000000000000..8fdccc17e9b0ca5e66c62652bbc8fe4a7065fc00 --- /dev/null +++ b/fast_tokenizer/cmake/python_module.cmake @@ -0,0 +1,54 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +function(find_python_module module) + string(TOUPPER ${module} module_upper) + if(NOT PY_${module_upper}) + if(ARGC GREATER 1 AND ARGV1 STREQUAL "REQUIRED") + set(${module}_FIND_REQUIRED TRUE) + else() + set(${module}_FIND_REQUIRED FALSE) + endif() + # A module's location is usually a directory, but for binary modules + # it's a .so file. + execute_process(COMMAND "${PYTHON_EXECUTABLE}" "-c" + "import re, ${module}; print(re.compile('/__init__.py.*').sub('',${module}.__file__))" + RESULT_VARIABLE _${module}_status + OUTPUT_VARIABLE _${module}_location + ERROR_QUIET + OUTPUT_STRIP_TRAILING_WHITESPACE) + if(NOT _${module}_status) + set(PY_${module_upper} ${_${module}_location} CACHE STRING + "Location of Python module ${module}") + endif(NOT _${module}_status) + endif(NOT PY_${module_upper}) + find_package_handle_standard_args(PY_${module} DEFAULT_MSG PY_${module_upper}) + if(NOT PY_${module_upper}_FOUND AND ${module}_FIND_REQUIRED) + message(FATAL_ERROR "python module ${module} is not found") + endif() + + execute_process(COMMAND "${PYTHON_EXECUTABLE}" "-c" + "import sys, ${module}; sys.stdout.write(${module}.__version__)" + OUTPUT_VARIABLE _${module}_version + RESULT_VARIABLE _${module}_status + ERROR_QUIET + OUTPUT_STRIP_TRAILING_WHITESPACE) + if(NOT _${module}_status) + set(PY_${module_upper}_VERSION ${_${module}_version} CACHE STRING + "Version of Python module ${module}") + endif(NOT _${module}_status) + + set(PY_${module_upper}_FOUND ${PY_${module_upper}_FOUND} PARENT_SCOPE) + set(PY_${module_upper}_VERSION ${PY_${module_upper}_VERSION} PARENT_SCOPE) +endfunction(find_python_module) diff --git a/fast_tokenizer/cmake/third_party.cmake b/fast_tokenizer/cmake/third_party.cmake new file mode 100644 index 0000000000000000000000000000000000000000..ef17fc6bf3dc9c24c79ce3225a7d7d4557b1d5ae --- /dev/null +++ b/fast_tokenizer/cmake/third_party.cmake @@ -0,0 +1,32 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +include(ExternalProject) + +set(THIRD_PARTY_PATH "${CMAKE_BINARY_DIR}/third_party" CACHE STRING + "A path setting third party libraries download & build directories.") + +include(external/icu) +if(WITH_TESTING) + include(external/gtest) +endif() +include(external/gflags) +include(external/glog) +include(external/re2) +include(external/nlohmann_json) +include(external/dart) # For trie +if (WITH_PYTHON) + include(external/python) + include(external/pybind11) +endif() \ No newline at end of file diff --git a/fast_tokenizer/docs/compile/README.md b/fast_tokenizer/docs/compile/README.md new file mode 100644 index 0000000000000000000000000000000000000000..6ae8c791ae7c8c5ea946377d639612d4566bd2dd --- /dev/null +++ b/fast_tokenizer/docs/compile/README.md @@ -0,0 +1,16 @@ +# FastTokenizer 编译指南 + +本文档说明编译 FastTokenizer C++ 库、Python 库以及 Android 库三种编译过程,根据编译的平台参考如下文档 + +- [Linux & Mac 编译](./how_to_build_linux_and_mac.md) +- [Windows 编译](./how_to_build_windows.md) +- [Android 编译](./how_to_build_android.md) + +FastTokenizer 使用 CMake 编译,其中编译过程中,各平台上编译选项如下表所示 + +| 选项 | 作用 | 备注 | +|:---- | :--- | :--- | +| WITH_PYTHON | 是否编译为 Python 库,默认为是,否则为 C++ 库|| +| WITH_TESTING | 是否编译 C++ 单测,默认为否 || +| WITH_ICU_LITE | 是否与 ICU Lite 依赖包联编,打开后可减小 FastTokenizer 库体积,默认为否 | 只能用于 Andriod 编译,打开后 FastTokenizer 库体积大小从 **32 M 减少到 7.4 M**,但只能对中英文进行分词。| +| USE_ABI0 | 是否编译_GLIBCXX_USE_CXX11_ABI=0, 默认为OFF。| diff --git a/fast_tokenizer/docs/compile/how_to_build_android.md b/fast_tokenizer/docs/compile/how_to_build_android.md new file mode 100644 index 0000000000000000000000000000000000000000..40c1cfe375db2b56c9d541268309d18c56599c20 --- /dev/null +++ b/fast_tokenizer/docs/compile/how_to_build_android.md @@ -0,0 +1,46 @@ +# Android 编译 + +FastTokenizer 提供两种版本 Android 库,分别是常规版本以及轻量版本。常规版本的 FastTokenizer Android 库功能齐全,可支持任意语言的分词功能,库体积大约为 **32 M**;轻量版本主要支持中文和英文两种语言的分词,库体积约为 **7.4 M**。开发者可以根据自己实际需求选择合适的版本安装,以下将分别介绍这两种版本的编译方式。 + +## 环境依赖 + +- cmake >= 3.10 +- NDK >= 20 + +## 配置NDK +```bash +wget https://dl.google.com/android/repository/android-ndk-r20b-linux-x86_64.zip +unzip android-ndk-r20b-linux-x86_64.zip # 会解压缩到 android-ndk-r20b 目录 +export NDK_ROOT=${PWD}/android-ndk-r20b +``` + +## 编译 C++ 库方法 + +### 常规版本 + +```bash +git clone https://github.com/PaddlePaddle/PaddleNLP.git +cd PaddleNLP/fast_tokenizer +mkdir build & cd build +cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK_ROOT/build/cmake/android.toolchain.cmake -DANDROID_ABI="arm64-v8a" -DANDROID_NATIVE_API_LEVEL=android-21 -DANDROID_STL=c++_shared -DWITH_TESTING=OFF -DWITH_PYTHON=OFF -DANDROID_TOOLCHAIN=clang +make -j8 +``` + +### 轻量版本 + +``` +git clone https://github.com/PaddlePaddle/PaddleNLP.git +cd PaddleNLP/fast_tokenizer +mkdir build & cd build +cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK_ROOT/build/cmake/android.toolchain.cmake -DANDROID_ABI="arm64-v8a" -DANDROID_NATIVE_API_LEVEL=android-21 -DANDROID_STL=c++_shared -DWITH_TESTING=OFF -DWITH_PYTHON=OFF -DANDROID_TOOLCHAIN=clang -DWITH_ICU_LITE=ON +make -j8 +``` + +### 库体积压缩 + +编译后的 C++ 库在当前目录下的 `cpp` 目录下。可以选择使用 strip 减少库体积: +```shell +$NDK_ROOT/toolchains/llvm/prebuilt/linux-x86_64/aarch64-linux-android/bin/strip libcore_tokenizers.so +``` + +更多编译选项说明参考[编译指南](./README.md) diff --git a/fast_tokenizer/docs/compile/how_to_build_linux_and_mac.md b/fast_tokenizer/docs/compile/how_to_build_linux_and_mac.md new file mode 100644 index 0000000000000000000000000000000000000000..cd13724aef2dfe355b3228ebfc62b00b29d96a36 --- /dev/null +++ b/fast_tokenizer/docs/compile/how_to_build_linux_and_mac.md @@ -0,0 +1,36 @@ +# Linux & Mac编译 + +## 环境依赖 + +- cmake >= 3.10 +- gcc >= 8.2.0 + +## 编译 C++ 库方法 + +```bash +git clone https://github.com/PaddlePaddle/PaddleNLP.git +cd PaddleNLP/fast_tokenizer +mkdir build & cd build +cmake .. -DWITH_PYTHON=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release +make -j8 +``` + +编译后的 C++ 库在当前目录下的 `cpp` 目录下。 + +## 编译 Python 库方法 + +```bash +git clone https://github.com/PaddlePaddle/PaddleNLP.git +cd PaddleNLP/fast_tokenizer +mkdir build & cd build +# 设置 Python 环境 +export LD_LIBRARY_PATH=/opt/_internal/cpython-3.6.0/lib/:${LD_LIBRARY_PATH} +export PATH=/opt/_internal/cpython-3.6.0/bin/:${PATH} + +cmake .. -DWITH_PYTHON=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release +make -j8 +``` + +编译后的 wheel 包即在当前目录下的 `dist` 目录中 + +更多编译选项说明参考[编译指南](./README.md) diff --git a/fast_tokenizer/docs/compile/how_to_build_windows.md b/fast_tokenizer/docs/compile/how_to_build_windows.md new file mode 100644 index 0000000000000000000000000000000000000000..4796b0418034eaf51ee65b0e712c3eb6c135b5b1 --- /dev/null +++ b/fast_tokenizer/docs/compile/how_to_build_windows.md @@ -0,0 +1,42 @@ +# Windows 编译 + +## 环境依赖 + +- cmake >= 3.10 +- VS 2019 +- ninja +- cmake >= 3.10 + +以上依赖安装好后,在 Windows 菜单打开`x64 Native Tools Command Prompt for VS 2019`命令工具即可进行下面的编译环节。 + +## 编译 C++ 库方法 + +```bash +git clone https://github.com/PaddlePaddle/PaddleNLP.git +cd PaddleNLP/fast_tokenizer +mkdir build & cd build +cmake .. -G "Ninja" -DWITH_PYTHON=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release +ninja -j8 +``` + +编译后的 C++ 库在当前目录下的`cpp`目录下。 + +## 编译 Python 库方法 + +```bash +git clone https://github.com/PaddlePaddle/PaddleNLP.git +cd PaddleNLP/fast_tokenizer +mkdir build & cd build +# 需要指定 Python 库 +cmake .. -G "Ninja" -DWITH_PYTHON=ON ^ + -DWITH_TESTING=OFF ^ + -DCMAKE_BUILD_TYPE=Release ^ + -DPYTHON_EXECUTABLE=C:\Python37\python.exe ^ + -DPYTHON_INCLUDE_DIR=C:\Python37\include ^ + -DPYTHON_LIBRARY=C:\Python37\libs\python3%%x.lib +ninja -j8 +``` + +编译后的 wheel 包即在当前目录下的 `dist` 目录中 + +更多编译选项说明参考[编译指南](./README.md) diff --git a/fast_tokenizer/docs/cpp/README.md b/fast_tokenizer/docs/cpp/README.md new file mode 100644 index 0000000000000000000000000000000000000000..7b8976efd4de2b7c469e786dfc58f33a3978217c --- /dev/null +++ b/fast_tokenizer/docs/cpp/README.md @@ -0,0 +1,71 @@ +# FastTokenizer C++ 库使用教程 + +## 1. 快速安装 + +当前版本 FastTokenizer C++ 库支持不同的操作系统以及硬件平台,并为以下平台提供预编译包: +|系统|下载地址| +|---|---| +|Linux-x64| [fast_tokenizer-linux-x64-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-x64-1.0.2.tgz) | +|Linux-aarch64| [fast_tokenizer-linux-aarch64-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-aarch64-1.0.2.tgz) | +|Windows| [fast_tokenizer-win-x64-1.0.2.zip](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-win-x64-1.0.2.zip) | +|MacOS-x64| [fast_tokenizer-osx-x86_64-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-osx-x86_64-1.0.2.tgz) | +|MacOS-arm64| [fast_tokenizer-osx-arm64-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-osx-arm64-1.0.2.tgz) | +|Android-arm64-v8a| [fast_tokenizer-android-arm64-v8a-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-android-arm64-v8a-1.0.2.tgz) | +|Android-armeabi-v7a| [fast_tokenizer-android-armeabi-v7a-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-android-armeabi-v7a-1.0.2.tgz) | +|Android-lite-arm64-v8a| [fast_tokenizer-lite-android-arm64-v8a-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-lite-android-arm64-v8a-1.0.2.tgz) | +|Android-lite-armeabi-v7a| [fast_tokenizer-lite-android-armeabi-v7a-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-lite-android-armeabi-v7a-1.0.2.tgz) | + +### 环境依赖 + +#### 系统环境要求 +|系统|版本|架构| +|---|---|---| +|Linux|Ubuntu 16.04+,CentOS 7+|x64, aarch64| +|Windows|10+|x64| +|MacOS| 11.4+|x64, arm64| +|Android| - |arm64-v8a, armeabi-v7a| + +#### Linux,Mac 编译环境要求 +|依赖|版本| +|---|---| +|cmake|>=16.0| +|gcc|>=8.2.0| + +#### Windows 编译环境要求 +|依赖|版本| +|---|---| +|cmake|>=16.0| +|VisualStudio|2019| + +### 下载解压 + +```shell +wget -c https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-x64-1.0.2.tgz + +tar xvfz fast_tokenizer-linux-x64-1.0.2.tgz +# 解压后为fast_tokenizer目录 +``` + +解压后得到fast_tokenizer目录,该目录的结构如下: + +```shell + +fast_tokenizer +|__ commit.log # 编译时的commit id +|__ FastTokenizer.cmake # FastTokenizer CMake文件,定义了头文件目录、动态链接库目录变量 +|__ include # FastTokenizer的头文件目录 +|__ lib # FastTokenizer的动态链接库目录 +|__ third_party # FastTokenizer依赖的第三方库目录 + +``` + +推荐用户直接使用 cmake 方式引入 FastTokenizer 库。在 CMake 引入 FastTokenizer 时,只需添加一行 `include(FastTokenizer.cmake)`,即可获取 FastTokenizer 的预定义的 CMake 变量 `FAST_TOKENIZER_INCS` 和 `FAST_TOKENIZER_LIBS`,分别指定 FastTokenizer 的头文件目录以及动态链接库目录。 + + +## 2. 快速开始 + +目前 FastTokenizer 提供了以下 C++ 使用示例。 + +[ErnieFastTokenizer C++示例](../../examples/ernie-3.0/README.md) + +[ClipFastTokenizer C++示例](../../examples/clip/README.md) diff --git a/fast_tokenizer/docs/pipeline/README.md b/fast_tokenizer/docs/pipeline/README.md new file mode 100644 index 0000000000000000000000000000000000000000..16da52f90ef4a9d5079cdaed2eb3674e65854cc5 --- /dev/null +++ b/fast_tokenizer/docs/pipeline/README.md @@ -0,0 +1,254 @@ +# FastTokenizer Pipeline + +当我们使用 Tokenizer 的 `Tokenizer.encode` 或者 `Tokenizer.encode_batch` 方法进行分词时,会经历如下四个阶段:Normalize、PreTokenize、 Model 以及 PostProcess。针对这四个阶段,FastTokenizer 提供 Normalizer、PreTokenizer、Model 以及 PostProcessor 四个组件分别完成四个阶段所需要的工作。下面将详细介绍四大组件具体负责的工作,并通过示例介绍如何组合四个组件定义一个 Tokenizer。 + +## Normalizer + +Normalizer 组件主要用于将原始字符串标准化,输出标准化的字符串,常见的标准化字符串操作有大小写转换、半角全角转换等。 FastTokenizer 所有 Normalizer 类都继承自 `normalizers.Normalizer`,命名方式均为 `normalizers.*Normalizer`。 FastTokenizer 还支持将现有 Normalizer 类进行组合得到一个 Normalizer 序列,用户可以通过调用 `normalizers.SequenceNormalizer` 使用已有的 Normalizer 自定义新的 Normalizer。下面将分别展示 Python 以及 C++ 上使用示例。 + +### Python 示例 + +```python +import fast_tokenizer +from fast_tokenizer.normalizers import LowercaseNormalizer, SequenceNormalizer, NFDNormalizer, StripAccentsNormalizer + +normalizer = SequenceNormalizer([NFDNormalizer(), StripAccentsNormalizer() LowercaseNormalizer()]) +print(normalizer.normalize_str("Héllò hôw are ü?")) +# hello how are u? +``` + +### C++ 示例 + +```c++ + +#include +#include "fast_tokenizer/normalizers/normalizers.h" +using namespace paddlenlp::fast_tokenizer; + +int main() { + normalizers::NFDNormalizer n1; + normalizers::StripAccentsNormalizer n2; + normalizers::LowercaseNormalizer n3; + normalizers::SequenceNormalizer normalizer({&n1, &n2, &n3}); + normalizers::NormalizedString normalized("Héllò hôw are ü?"); + normalizer(&normalized); + // Expected output + // normalized string: hello how are u? + // original string: Héllò hôw are ü? + std::cout << "normalized string: " << normalized.GetStr() << std::endl; + std::cout << "original string: " << normalized.GetOrignalStr() << std::endl; +} + +``` + +## PreTokenizer + +PreTokenizer 组件主要使用简单的分词方法,将标准化的字符串进行预切词,得到较大粒度的词组(word),例如按照标点、空格等方式进行分词。FastTokenizer 所有 PreTokenizer 类都继承自 `normalizers.PreTokenizer`,命名方式均为 `normalizers.*PreTokenizer`。 下面将分别展示 Python 以及 C++ 上使用空格对文本进行分词的使用示例。 + +### Python 示例 + +```python +import fast_tokenizer +from fast_tokenizer.pretokenizers import WhitespacePreTokenizer +pretokenizer = WhitespacePreTokenizer() +print(pretokenizer.pretokenize_str("Hello! How are you? I'm fine, thank you.")) +# [('Hello!', (0, 6)), ('How', (7, 10)), ('are', (11, 14)), ('you?', (15, 19)), ("I'm", (20, 23)), ('fine,', (24, 29)), ('thank', (30, 35)), ('you.', (36, 40))] +``` + +### C++ 示例 + +```c++ + +#include +#include "fast_tokenizer/pretokenizers/pretokenizers.h" + +using namespace paddlenlp::fast_tokenizer; + +int main() { + pretokenizers::WhitespacePreTokenizer pretokenizer; + pretokenizers::PreTokenizedString pretokenized( + "Hello! How are you? I'm fine, thank you."); + pretokenizer(&pretokenized); + auto&& splits = pretokenized.GetSplits(true, core::OffsetType::CHAR); + for (auto&& split : splits) { + auto&& value = std::get<0>(split); + auto&& offset = std::get<1>(split); + std::cout << "(" << value << ", (" << offset.first << ", " << offset.second + << ")" + << ")" << std::endl; + } + return 0; +} + +// (Hello!, (0, 6)) +// (How, (7, 10)) +// (are, (11, 14)) +// (you?, (15, 19)) +// (I'm, (20, 23)) +// (fine,, (24, 29)) +// (thank, (30, 35)) +// (you., (36, 40)) + +``` + +## Model + +Model 组件是 FastTokenizer 核心模块,用于将粗粒度词组按照一定的算法进行切分,得到细粒度的 Token(word piece)及其对应的在词表中的 id,目前支持的切词算法包括 FastWordPiece[1]、WordPiece、BPE 以及 Unigram。其中,`FastWordPiece` 是 "Fast WordPiece Tokenization" 提出的基于`MinMaxMatch`匹配算法的一种分词算法。原有 `WordPiece` 算法的时间复杂度与序列长度为二次方关系,在对长文本进行分词操作时,时间开销比较大。而 `FastWordPiece` 算法通过 `Aho–Corasick` 算法避免 Token 失配时从头匹配,将 `WordPiece` 算法的时间复杂度降低为与序列长度的线性关系,大大提升了分词效率。下面是 `FastWordPiece` 类的初始化示例。 + +### Python 示例 + +```python +import fast_tokenizer +from fast_tokenizer.models import FastWordPiece + +# Initialize model from ernie 3.0 vocab file +model = FastWordPiece.from_file("ernie-3.0-medium-vocab.txt", with_pretokenization=True) +print(model.tokenize("我爱中国!")) +# [id: 75 value:我 offset: (0, 3), id: 329 value:爱 offset: (3, 6), id: 12 value:中 offset: (6, 9), id: 20 value:国 offset: (9, 12), id: 12046 value:! offset: (12, 13)] +``` + +### C++ 示例 + +```c++ + +#include +#include + +#include "fast_tokenizer/models/models.h" + +using namespace paddlenlp::fast_tokenizer; + +int main() { + std::string text = "我爱中国!"; + auto model = models::FastWordPiece::GetFastWordPieceFromFile( + "ernie_vocab.txt", "[UNK]", 100, "##", true); + std::vector results = model.Tokenize(text); + for (const core::Token& token : results) { + std::cout << "id: " << token.id_ << ", value: " << token.value_ + << ", offset: (" << token.offset_.first << ", " + << token.offset_.second << ")." << std::endl; + } + return 0; +} + +// id: 75, value: 我, offset: (0, 3). +// id: 329, value: 爱, offset: (3, 6). +// id: 12, value: 中, offset: (6, 9). +// id: 20, value: 国, offset: (9, 12). +// id: 12044, value: !, offset: (12, 15). +``` + +## PostProcessor + +PostProcess 组件主要执行 Transformer 类模型的文本序列的后处理逻辑,比如添加 [SEP] 等特殊 Token,并且会将前面分词得到的结果转为一个 `Encoding` 的结构体,包含 token_ids, type_ids, offset, position_ids 等模型所需要的信息。FastTokenizer 所有 PostProcessor 类都继承自 `normalizers.PostProcessor`,命名方式均为 `normalizers.*PostProcessor`。 + +## Tokenizer + +Tokenizer 对象在运行`Tokenizer.encode` 或者 `Tokenizer.encode_batch` 方法进行分词时,通过调用各个阶段组件的回调函数运行不同阶段的处理逻辑。所以我们定义 Tokenizer 对象时,需要设置各个阶段的组件。下面将通过代码示例展示如何定义 ERNIE 模型的 Tokenizer。 + +### Python 示例 + +```python + +import fast_tokenizer +from fast_tokenizer import Tokenizer +from fast_tokenizer.models import FastWordPiece +from fast_tokenizer.normalizers import BertNormalizer +from fast_tokenizer.pretokenizers import BertPreTokenizer + +# 1. Initialize model from ernie 3.0 vocab file +model = FastWordPiece.from_file("ernie-3.0-medium-vocab.txt") + +# 2. Use model to initialize a tokenizer object +tokenizer = Tokenizer(model) + +# 3. Set a normalizer +tokenizer.normalizer = BertNormalizer( + clean_text=True, + handle_chinese_chars=True, + strip_accents=True, + lowercase=True, +) + +# 4. Set a pretokenizer +tokenizer.pretokenizer = BertPreTokenizer() + +print(tokenizer.encode("我爱中国!")) + +# The Encoding content: +# ids: 75, 329, 12, 20, 12046 +# type_ids: 0, 0, 0, 0, 0 +# tokens: 我, 爱, 中, 国, ! +# offsets: (0, 1), (1, 2), (2, 3), (3, 4), (4, 5) +# special_tokens_mask: 0, 0, 0, 0, 0 +# attention_mask: 1, 1, 1, 1, 1 +# sequence_ranges: +``` + +针对 ERNIE、BERT 这类常见模型,FastTokenizer Python 库 已经定义好这类模型的 Tokenizer,可以通过 `from fast_tokenizer import ErnieFastTokenizer` 直接使用。 + +### C++ 示例 + +```c++ + +#include +#include + +#include "fast_tokenizer/core/tokenizer.h" +#include "fast_tokenizer/models/models.h" +#include "fast_tokenizer/normalizers/normalizers.h" +#include "fast_tokenizer/postprocessors/postprocessors.h" +#include "fast_tokenizer/pretokenizers/pretokenizers.h" + +using namespace paddlenlp::fast_tokenizer; + +int main() { + std::vector texts{"我爱中国!"}; + core::Tokenizer tokenizer; + + // 1. Set model + auto model = models::FastWordPiece::GetFastWordPieceFromFile( + "ernie_vocab.txt", "[UNK]", 100, "##", true); + tokenizer.SetModel(model); + + // 2. Set Normalizer + normalizers::BertNormalizer normalizer( + /* clean_text = */ true, + /* handle_chinese_chars = */ true, + /* strip_accents= */ true, + /* lowercase = */ true); + tokenizer.SetNormalizer(normalizer); + + // 3. Set Pretokenizer + pretokenizers::BertPreTokenizer pretokenizer; + tokenizer.SetPreTokenizer(pretokenizer); + + // 4. Set PostProcessor + postprocessors::BertPostProcessor postprocessor; + tokenizer.SetPostProcessor(postprocessor); + + std::vector encodings; + tokenizer.EncodeBatchStrings(texts, &encodings); + + for (auto encoding : encodings) { + std::cout << encoding.DebugString() << std::endl; + } + return 0; +} + +// The Encoding content: +// ids: 101, 75, 329, 12, 20, 12044, 102 +// type_ids: 0, 0, 0, 0, 0, 0, 0 +// tokens: [CLS], 我, 爱, 中, 国, !, [SEP] +// offsets: (0, 0), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (0, 0) +// special_tokens_mask: 1, 0, 0, 0, 0, 0, 1 +// attention_mask: 1, 1, 1, 1, 1, 1, 1 +// sequence_ranges: {0 : (1, 6) }, +``` + +针对 ERNIE、BERT 这类常见模型,FastTokenizer C++ 库 已经定义好这类模型的 Tokenizer,可以通过 `paddlenlp::fast_tokenizer::tokenizers_impl::ErnieFastTokenizer` 直接使用。 + + +## 参考文献 + +- [1] Xinying Song, Alex Salcianuet al. "Fast WordPiece Tokenization", EMNLP, 2021 diff --git a/fast_tokenizer/docs/python/README.md b/fast_tokenizer/docs/python/README.md new file mode 100644 index 0000000000000000000000000000000000000000..86720f9dcca9a4eb490396c1be0aa6e24758cfe3 --- /dev/null +++ b/fast_tokenizer/docs/python/README.md @@ -0,0 +1,16 @@ +# FastTokenizer Python 库使用教程 + +## 1. 快速安装 + +### 环境依赖 + +- Windows 64位系统 +- Linux x64系统 +- MacOS 10.14+系统(m1芯片的MacOS,需要使用x86_64版本的Anaconda作为python环境方可安装使用) +- Python 3.6 ~ 3.10 + +### 安装 + +```shell +pip install --upgrade fast_tokenizer +``` diff --git a/fast_tokenizer/examples/clip/README.md b/fast_tokenizer/examples/clip/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/fast_tokenizer/examples/clip/cpp/CMakeLists.txt b/fast_tokenizer/examples/clip/cpp/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..271104d7f89caa8035677d23112694f2af9827b4 --- /dev/null +++ b/fast_tokenizer/examples/clip/cpp/CMakeLists.txt @@ -0,0 +1,28 @@ +cmake_minimum_required(VERSION 3.10) +project(cpp_fast_tokenizer_demo CXX C) +option(FAST_TOKENIZER_INSTALL_DIR "Path of downloaded fast_tokenizer sdk.") + +# Download clip vocab and merge files +set(CLIP_VOCAB_PATH ${CMAKE_CURRENT_BINARY_DIR}/clip_vocab.json) +set(CLIP_MERGES_PATH ${CMAKE_CURRENT_BINARY_DIR}/clip_merges.txt) + +if (EXISTS ${CLIP_VOCAB_PATH}) + message("The ${CLIP_VOCAB_PATH} exists already.") +else() + file(DOWNLOAD "http://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/vocab.json" ${CLIP_VOCAB_PATH} SHOW_PROGRESS) + message("Already download the vocab.json of clip to ${CMAKE_CURRENT_BINARY_DIR} for test.") +endif() + +if (EXISTS ${CLIP_MERGES_PATH}) + message("The ${CLIP_MERGES_PATH} exists already.") +else() + file(DOWNLOAD "http://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/merges.txt" ${CLIP_MERGES_PATH} SHOW_PROGRESS) + message("Already download the merges.txt of clip to ${CMAKE_CURRENT_BINARY_DIR} for test.") +endif() + +# Get FAST_TOKENIZER_INCS and FAST_TOKENIZER_LIBS +include(${FAST_TOKENIZER_INSTALL_DIR}/FastTokenizer.cmake) +include_directories(${FAST_TOKENIZER_INCS}) + +add_executable(demo ${PROJECT_SOURCE_DIR}/demo.cc) +target_link_libraries(demo ${FAST_TOKENIZER_LIBS}) diff --git a/fast_tokenizer/examples/clip/cpp/README.md b/fast_tokenizer/examples/clip/cpp/README.md new file mode 100644 index 0000000000000000000000000000000000000000..d24c0f7fd65f17d9e2b451e984c9eeefd44078b3 --- /dev/null +++ b/fast_tokenizer/examples/clip/cpp/README.md @@ -0,0 +1,99 @@ +# ClipFastTokenizer C++ 示例 + +## 1. 快速安装 + +当前版本FastTokenizer C++库支持不同的操作系统以及硬件平台,用户可以根据实际的使用环境,从以下选择合适的预编译包: +|系统|下载地址| +|---|---| +|Linux-x64| [fast_tokenizer-linux-x64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-x64-1.0.0.tgz) | +|Linux-aarch64| [fast_tokenizer-linux-aarch64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-aarch64-1.0.0.tgz) | +|Windows| [fast_tokenizer-win-x64-1.0.0.zip](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-win-x64-1.0.0.zip) | +|MacOS-x64| [fast_tokenizer-osx-x86_64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-osx-x86_64-1.0.0.tgz) | +|MacOS-arm64| [fast_tokenizer-osx-arm64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-osx-arm64-1.0.0.tgz) | + +### 环境依赖 + +#### 系统环境要求 +|系统|版本| +|---|---| +|Linux|Ubuntu 16.04+,CentOS 7+| +|Windows|10| +|MacOS| 11.4+| + + +#### Linux,Mac编译环境要求 +|依赖|版本| +|---|---| +|cmake|>=16.0| +|gcc|>=8.2.0| + +#### Windows编译环境要求 +|依赖|版本| +|---|---| +|cmake|>=16.0| +|VisualStudio|2019| + +## 2. 快速开始 + +以下以Linux平台为例, 介绍如何使用FastTokenizer C++预编译包完成demo示例编译及运行。该示例会生成一个名为`demo`的可执行文件。 + +### 2.1 下载解压 + +```shell +wget -c https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-x64-1.0.0.tgz + +tar xvfz fast_tokenizer-linux-x64-1.0.0.tgz +# 解压后为fast_tokenizer目录 +``` + +解压后得到fast_tokenizer目录,该目录的结构如下: + +```shell + +fast_tokenizer +|__ commit.log # 编译时的commit id +|__ FastTokenizer.cmake # FastTokenizer CMake文件,定义了头文件目录、动态链接库目录变量 +|__ include # FastTokenizer的头文件目录 +|__ lib # FastTokenizer的动态链接库目录 +|__ third_party # FastTokenizer依赖的第三方库目录 + +``` + +推荐用户直接使用cmake方式引入FastTokenizer库。在CMake引入FastTokenizer时,只需添加一行 `include(FastTokenizer.cmake)`,即可获取FastTokenizer的预定义的CMake变量`FAST_TOKENIZER_INCS`和`FAST_TOKENIZER_LIBS`,分别指定FastTokenizer的头文件目录以及动态链接库目录。 + + +### 2.2 编译 + +示例提供简单的CMakeLists.txt, 用户仅需指定fast_tokenizer包的路径,即可完成编译。 + +```shell + +# 创建编译目录 +mkdir build +cd build + +# 运行cmake,通过指定fast_tokenizer包的路径,构建Makefile +cmake .. -DFAST_TOKENIZER_INSTALL_DIR=/path/to/fast_tokenizer + +# 编译 +make + +``` + +### 2.3 运行 + +```shell +./demo +``` + + +### 2.4 样例输出 + +输出包含原始文本的输入,以及分词后的ids序列结果(含padding)。 + +```shell + +text = "a photo of an astronaut riding a horse on mars" +ids = [49406, 320, 1125, 539, 550, 18376, 6765, 320, 4558, 525, 7496, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407] + +``` diff --git a/fast_tokenizer/examples/clip/cpp/demo.cc b/fast_tokenizer/examples/clip/cpp/demo.cc new file mode 100644 index 0000000000000000000000000000000000000000..0d7983b2ab55b0c14a46396f88ef51482d42b5ef --- /dev/null +++ b/fast_tokenizer/examples/clip/cpp/demo.cc @@ -0,0 +1,70 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include "fast_tokenizer/tokenizers/clip_fast_tokenizer.h" +using namespace paddlenlp; + +template +std::ostream& operator<<(std::ostream& os, const std::vector vec) { + os << "["; + for (int i = 0; i < vec.size(); ++i) { + if (i == 0) { + os << vec[i]; + } else { + os << ", " << vec[i]; + } + } + os << "]"; + return os; +} + +fast_tokenizer::tokenizers_impl::ClipFastTokenizer CreateClipFastTokenizer( + const std::string& vocab_path, + const std::string& merge_path, + uint32_t max_length, + bool pad_to_max_length = true) { + fast_tokenizer::tokenizers_impl::ClipFastTokenizer tokenizer( + vocab_path, merge_path, max_length); + if (pad_to_max_length) { + tokenizer.EnablePadMethod(fast_tokenizer::core::RIGHT, + tokenizer.GetPadTokenId(), + 0, + tokenizer.GetPadToken(), + &max_length, + nullptr); + } + return tokenizer; +} + +int main() { + // 1. Define a clip fast tokenizer + auto tokenizer = CreateClipFastTokenizer("clip_vocab.json", + "clip_merges.txt", + /*max_length = */ 77, + /* pad_to_max_length = */ true); + // 2. Tokenize the input strings + std::vector encodings; + std::vector texts = { + "a photo of an astronaut riding a horse on mars"}; + tokenizer.EncodeBatchStrings(texts, &encodings); + + for (int i = 0; i < texts.size(); ++i) { + std::cout << "text = \"" << texts[i] << "\"" << std::endl; + std::cout << "ids = " << encodings[i].GetIds() << std::endl; + } + + return 0; +} diff --git a/fast_tokenizer/examples/clip/python/README.md b/fast_tokenizer/examples/clip/python/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/fast_tokenizer/examples/clip/python/demo.py b/fast_tokenizer/examples/clip/python/demo.py new file mode 100644 index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47 --- /dev/null +++ b/fast_tokenizer/examples/clip/python/demo.py @@ -0,0 +1,13 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/fast_tokenizer/examples/ernie-3.0/README.md b/fast_tokenizer/examples/ernie-3.0/README.md new file mode 100644 index 0000000000000000000000000000000000000000..db728533e676ca67382271ae705064f7b96e0119 --- /dev/null +++ b/fast_tokenizer/examples/ernie-3.0/README.md @@ -0,0 +1,21 @@ +# ErnieFastTokenizer分词示例 + +FastTokenizer库在C++、Python端提供ErnieFastTokenizer接口,用户只需传入模型相应的词表即可调用该接口,完成高效分词操作。该接口底层使用`WordPiece`算法进行分词。针对`WordPiece`算法,FastTokenizer实现了"Fast WordPiece Tokenization"提出的基于`MinMaxMatch`的`FastWordPiece`算法。原有`WordPiece`算法的时间复杂度与序列长度为二次方关系,在对长文本进行分词操作时,时间开销比较大。而`FastWordPiece`算法通过`Aho–Corasick `算法将`WordPiece`算法的时间复杂度降低为与序列长度的线性关系,大大提升了分词效率。`ErnieFastTokenizer`除了支持ERNIE模型的分词以外,还支持其他基于`WordPiece`算法分词的模型,比如`BERT`, `TinyBert`等,详细的模型列表如下: + +## 支持的模型列表 + +- ERNIE +- BERT +- TinyBERT +- ERNIE Gram +- ERNIE ViL + +## 详细分词示例文档 + +[C++ 分词示例](./cpp/README.md) + +[Python 分词示例](./python/README.md) + +## 参考文献 + +- Xinying Song, Alex Salcianuet al. "Fast WordPiece Tokenization", EMNLP, 2021 diff --git a/fast_tokenizer/examples/ernie-3.0/cpp/CMakeLists.txt b/fast_tokenizer/examples/ernie-3.0/cpp/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..b7f349ea036e11adcb71d1f260461d20bde9839c --- /dev/null +++ b/fast_tokenizer/examples/ernie-3.0/cpp/CMakeLists.txt @@ -0,0 +1,22 @@ +cmake_minimum_required(VERSION 3.10) +project(cpp_fast_tokenizer_demo CXX C) + +option(FAST_TOKENIZER_INSTALL_DIR "Path of downloaded fast_tokenizer sdk.") + +# Download ernie vocab for demo +set(ERNIE_VOCAB_PATH ${CMAKE_CURRENT_BINARY_DIR}/ernie_vocab.txt) +if (EXISTS ${ERNIE_VOCAB_PATH}) + message(STATUS "The ${ERNIE_VOCAB_PATH} exists already.") +else() + file(DOWNLOAD "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/vocab.txt" ${ERNIE_VOCAB_PATH} SHOW_PROGRESS) + message(STATUS "Already download the vocab.txt of ernie to ${CMAKE_CURRENT_BINARY_DIR} for demo.") +endif() + +# Get FAST_TOKENIZER_INCS and FAST_TOKENIZER_LIBS +message(STATUS "The fast_tokenizer install dir: ${FAST_TOKENIZER_INSTALL_DIR}") +include(${FAST_TOKENIZER_INSTALL_DIR}/FastTokenizer.cmake) + +include_directories(${FAST_TOKENIZER_INCS}) + +add_executable(demo ${PROJECT_SOURCE_DIR}/demo.cc) +target_link_libraries(demo ${FAST_TOKENIZER_LIBS}) diff --git a/fast_tokenizer/examples/ernie-3.0/cpp/README.md b/fast_tokenizer/examples/ernie-3.0/cpp/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/fast_tokenizer/examples/ernie-3.0/cpp/demo.cc b/fast_tokenizer/examples/ernie-3.0/cpp/demo.cc new file mode 100644 index 0000000000000000000000000000000000000000..d886791d4d8a28f15b63d72f370b2aa38811fd05 --- /dev/null +++ b/fast_tokenizer/examples/ernie-3.0/cpp/demo.cc @@ -0,0 +1,70 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include "fast_tokenizer/tokenizers/ernie_fast_tokenizer.h" +using namespace paddlenlp; + +int main() { + // 1. Define a ernie fast tokenizer + fast_tokenizer::tokenizers_impl::ErnieFastTokenizer tokenizer( + "ernie_vocab.txt"); + // 2. Tokenize the input strings + // case 1: tokenize a single string + std::cout << "case 1: Tokenize a single string" << std::endl; + fast_tokenizer::core::Encoding encoding; + std::string single_string = + "商赢环球股份有限公司关于延期回复上海证券交易所对" + "公司2017年年度报告的事后审核问询函的公告"; + tokenizer.EncodePairStrings(single_string, &encoding); + std::cout << encoding.DebugString() << std::endl; + + // case 2: tokenize a pair of strings + std::cout << "case 2: Tokenize a pair of strings" << std::endl; + std::string text = "蚂蚁借呗等额还款可以换成先息后本吗"; + std::string text_pair = "借呗有先息到期还本吗"; + + tokenizer.EncodePairStrings(text, text_pair, &encoding); + std::cout << encoding.DebugString() << std::endl; + + // case 3: Tokenize a batch of single strings + std::cout << "case 3: Tokenize a batch of single strings" << std::endl; + std::vector encodings; + std::vector strings_list = { + "通过中介公司买了二手房,首付都付了,现在卖家不想卖了。怎么处理?", + "凌云研发的国产两轮电动车怎么样,有什么惊喜?", + "一辆车的寿命到底多长,最多可以开多久?"}; + tokenizer.EncodeBatchStrings(strings_list, &encodings); + for (auto&& encoding : encodings) { + std::cout << encoding.DebugString() << std::endl; + } + + // case 4: Tokenize a batch of pair strings + std::cout << "case 4: Tokenize a batch of pair strings" << std::endl; + std::vector texts = { + "花呗自动从余额宝扣款,需要我自己设置吗", + "这个蚂蚁花呗能恢复正常用不", + "在经济的另一次转变中,人们发现在低地农场饲养羔羊更具成本效益,部分原因" + "是那里有更丰富、更有营养的牧场,因此湖地农场的利润变得更少。"}; + std::vector text_pairs = { + "支付宝余额会自动还花呗吗", + "我的蚂蚁花呗 怎么用不了", + "人们发现,经济的另一个转变更有营养。"}; + tokenizer.EncodeBatchStrings(texts, text_pairs, &encodings); + for (auto&& encoding : encodings) { + std::cout << encoding.DebugString() << std::endl; + } + return 0; +} \ No newline at end of file diff --git a/fast_tokenizer/examples/ernie-3.0/python/README.md b/fast_tokenizer/examples/ernie-3.0/python/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/fast_tokenizer/examples/ernie-3.0/python/demo.py b/fast_tokenizer/examples/ernie-3.0/python/demo.py new file mode 100644 index 0000000000000000000000000000000000000000..490604e7bcc76a224d7d8f0ef5d103a3f7bb9ce2 --- /dev/null +++ b/fast_tokenizer/examples/ernie-3.0/python/demo.py @@ -0,0 +1,26 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import fast_tokenizer +from fast_tokenizer import ErnieFastTokenizer, models + +fast_tokenizer.set_thread_num(1) +vocab = models.WordPiece.read_file("ernie_vocab.txt") +fast_tokenizer = ErnieFastTokenizer(vocab) +output = fast_tokenizer.encode("我爱中国") +print("ids: ", output.ids) +print("type_ids: ", output.type_ids) +print("tokens: ", output.tokens) +print("offsets: ", output.offsets) +print("attention_mask: ", output.attention_mask) diff --git a/fast_tokenizer/fast_tokenizer/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..a4269b2bbd640203181511cceec5a59ae83d157c --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/CMakeLists.txt @@ -0,0 +1,44 @@ +add_subdirectory(decoders) +add_subdirectory(models) +add_subdirectory(normalizers) +add_subdirectory(pretokenizers) +add_subdirectory(postprocessors) +add_subdirectory(core) +add_subdirectory(utils) +# set the relative path of shared library +if (NOT APPLE AND NOT WIN32) + set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,-rpath='$ORIGIN'") +endif() + +if (WITH_PYTHON) + add_subdirectory(pybind) + cc_library(core_tokenizers SHARED + SRCS pybind/pybind.cc tokenizers/ernie_fast_tokenizer.cc + DEPS pybind python pybind_normalizers pybind_utils + pybind_pretokenizers pybind_models pybind_decoders + pybind_postprocessors pybind_tokenizers pybind_exception + pybind_core normalizers pretokenizers core models + tokenizer added_vocabulary postprocessors json) + set_target_properties(core_tokenizers PROPERTIES PREFIX "") + if (WIN32) + set_target_properties(core_tokenizers PROPERTIES SUFFIX ".pyd") + else() + set_target_properties(core_tokenizers PROPERTIES SUFFIX ".so") + endif() + + if (APPLE) + SET(CMAKE_INSTALL_RPATH "@loader_path/core_tokenizers.so") + endif() + +else(WITH_PYTHON) + cc_library(core_tokenizers SHARED + SRCS tokenizers/ernie_fast_tokenizer.cc tokenizers/clip_fast_tokenizer.cc + DEPS normalizers pretokenizers models decoders + postprocessors core added_vocabulary tokenizer json) + + if (APPLE) + SET(CMAKE_INSTALL_RPATH "@loader_path/lib/libcore_tokenizers.dylib") + endif() +endif(WITH_PYTHON) + +add_subdirectory(test) diff --git a/fast_tokenizer/fast_tokenizer/core/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/core/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..2700f66a44018a824502a865449e23dd636b6f39 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/core/CMakeLists.txt @@ -0,0 +1,4 @@ +cc_library(added_vocabulary SRCS added_vocabulary.cc DEPS normalizers pretokenizers json) +cc_library(base SRCS base.cc DEPS json) +cc_library(tokenizer SRCS tokenizer.cc DEPS added_vocabulary json decoders trie models postprocessors base) +cc_library(core SRCS encoding.cc DEPS json base) diff --git a/fast_tokenizer/fast_tokenizer/core/added_vocabulary.cc b/fast_tokenizer/fast_tokenizer/core/added_vocabulary.cc new file mode 100644 index 0000000000000000000000000000000000000000..bdb05fa136b82dd28031d4d211a0e2064c54af1e --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/core/added_vocabulary.cc @@ -0,0 +1,424 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/core/added_vocabulary.h" +#include "fast_tokenizer/models/model.h" +#include "fast_tokenizer/normalizers/normalizer.h" +#include "fast_tokenizer/pretokenizers/pretokenizer.h" +#include "glog/logging.h" +#include "re2/re2.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace core { + +inline bool StartWithWord(const std::string& sequence) { + static re2::RE2 pattern("^\\w"); + return re2::RE2::FullMatch(sequence, pattern); +} + +inline bool EndWithWord(const std::string& sequence) { + static re2::RE2 pattern("\\w$"); + return re2::RE2::FullMatch(sequence, pattern); +} + +inline bool StartWithSpace(const std::string& sequence) { + static re2::RE2 pattern("^\\s*"); + return re2::RE2::FullMatch(sequence, pattern); +} + +inline bool EndWithSpace(const std::string& sequence) { + static re2::RE2 pattern("\\s*$"); + return re2::RE2::FullMatch(sequence, pattern); +} + +inline size_t GetEndSpaceIdx(const std::string& sequence) { + static re2::RE2 pattern("\\s*$"); + re2::StringPiece result_str; + pattern.Match( + sequence, 0, sequence.length(), RE2::UNANCHORED, &result_str, 1); + return result_str.data() - sequence.data(); +} + +inline size_t GetStartSpaceIdx(const std::string& sequence) { + static re2::RE2 pattern("^\\s*"); + re2::StringPiece result_str; + pattern.Match( + sequence, 0, sequence.length(), RE2::UNANCHORED, &result_str, 1); + return result_str.data() + result_str.length() - sequence.data(); +} + +inline size_t GetLeftMostSpaceFromEnd(const std::string& sequence) { + if (EndWithSpace(sequence)) { + return GetEndSpaceIdx(sequence); + } + return sequence.length(); +} + +inline size_t GetRightMostSpaceFromStart(const std::string& sequence) { + if (StartWithSpace(sequence)) { + return GetStartSpaceIdx(sequence); + } + return 0; +} + +AddedToken::AddedToken() + : content_(""), + is_single_word_(false), + use_lstrip_(false), + use_rstrip_(false), + use_normalized_(true), + is_special_(false) {} + +AddedToken::AddedToken(const std::string& content, + bool is_special, + bool single_word, + bool lstrip, + bool rstrip) + : content_(content), + is_special_(is_special), + use_normalized_(!is_special), + is_single_word_(single_word), + use_lstrip_(lstrip), + use_rstrip_(rstrip) {} + +std::string AddedToken::GetContent() const { return content_; } + +bool AddedToken::GetIsSpecial() const { return is_special_; } + +bool AddedToken::GetUseNormalized() const { return use_normalized_; } + +void AddedToken::SetIsSingleWord(bool is_single_word) { + is_single_word_ = is_single_word; +} +bool AddedToken::GetUseLStrip() const { return use_lstrip_; } + +bool AddedToken::GetUseRStrip() const { return use_rstrip_; } + +bool AddedToken::GetIsSingleWord() const { return is_single_word_; } + +void AddedToken::SetContent(const std::string& content) { content_ = content; } + +void AddedToken::SetUseLStrip(bool use_lstrip) { use_lstrip_ = use_lstrip; } + +void AddedToken::SetUseRStrip(bool use_rstrip) { use_rstrip_ = use_rstrip; } + +void AddedToken::SetUseNormalized(bool use_normalized) { + use_normalized_ = use_normalized; +} + +void AddedToken::SetIsSpecial(bool is_special) { is_special_ = is_special; } + +bool AddedToken::operator==(const AddedToken& other) const { + return content_ == other.content_; +} + +AddedVocabulary::AddedVocabulary() + : split_trie_({std::make_shared(""), Vocab()}), + split_normalized_trie_({std::make_shared(""), Vocab()}) {} + +size_t AddedVocabulary::GetLen() const { return vocab_.size(); } + +core::Vocab AddedVocabulary::GetVocab() const { return vocab_; } +core::Vocab& AddedVocabulary::GetMutableVocab() { return vocab_; } + +bool AddedVocabulary::TokenToId(const std::string& token, + const models::Model& model, + uint32_t* id) const { + if (vocab_.find(token) != vocab_.end()) { + *id = vocab_.at(token); + return true; + } + return model.TokenToId(token, id); +} + +bool AddedVocabulary::IdToToken(uint32_t id, + const models::Model& model, + std::string* token) const { + if (vocab_reversed_.find(id) != vocab_reversed_.end()) { + *token = vocab_reversed_.at(id).GetContent(); + return true; + } + return model.IdToToken(id, token); +} + +bool AddedVocabulary::IsSpecialToken(const std::string& token) const { + return special_tokens_set_.find(token) != special_tokens_set_.end(); +} + +size_t AddedVocabulary::AddSpecialTokens( + const std::vector& tokens, + const models::Model& model, + const normalizers::Normalizer* normalizers) { + return AddTokens(tokens, model, normalizers); +} + +size_t AddedVocabulary::AddTokens(const std::vector& tokens, + const models::Model& model, + const normalizers::Normalizer* normalizers) { + for (const auto& token : tokens) { + if (token.GetIsSpecial() && !token.GetContent().empty() && + !IsSpecialToken(token.GetContent())) { + special_tokens_.push_back(token); + special_tokens_set_.insert(token.GetContent()); + } + } + int ignored_tokens_num = 0; + for (const auto& token : tokens) { + if (token.GetContent().empty()) { + ignored_tokens_num += 1; + continue; + } + uint32_t id; + if (TokenToId(token.GetContent(), model, &id)) { + ignored_tokens_num += 1; + } else { + uint32_t new_id = model.GetVocabSize() + GetLen(); + vocab_[token.GetContent()] = new_id; + if (special_tokens_set_.count(token.GetContent()) == 0) { + added_tokens_.push_back(token); + } + id = new_id; + } + vocab_reversed_[id] = token; + } + RefreshAddedTokens(model, normalizers); + return tokens.size() - ignored_tokens_num; +} +void AddedVocabulary::RefreshAddedTokens( + const models::Model& model, const normalizers::Normalizer* normalizers) { + using TokenAndId = std::pair; + std::vector normalized, non_normalized; + for (const auto& tokens : {special_tokens_, added_tokens_}) { + for (const auto& token : tokens) { + uint32_t id; + if (TokenToId(token.GetContent(), model, &id)) { + if (token.GetUseNormalized()) { + normalized.push_back({token, id}); + } else { + non_normalized.push_back({token, id}); + } + } + } + } + Vocab ids; + std::vector tokens; + for (const auto& token_ids : non_normalized) { + tokens.push_back(token_ids.first); + ids[token_ids.first.GetContent()] = token_ids.second; + } + // Create a regex pattern + std::string pattern(""); + for (int i = 0; i < tokens.size(); ++i) { + if (i > 0) { + pattern += "|"; + } + std::string pattern_str = ""; + for (const auto& ch : tokens[i].GetContent()) { + if (ch == '[' || ch == ']') { + pattern_str.append(1, '\\'); + } + pattern_str.append(1, ch); + } + pattern += "\(" + pattern_str + "\)"; + } + // Update split_trie_ + split_trie_.first = std::make_shared(pattern); + split_trie_.second = std::move(ids); + Vocab normalized_ids; + std::vector normalized_tokens; + for (const auto& token_ids : normalized) { + normalized_tokens.push_back(token_ids.first); + normalized_ids[token_ids.first.GetContent()] = token_ids.second; + } + + std::string normalized_pattern(""); + for (int i = 0; i < normalized_tokens.size(); ++i) { + normalizers::NormalizedString normalized_content( + normalized_tokens[i].GetContent()); + if (normalizers != nullptr) { + (*normalizers)(&normalized_content); + } + if (i > 0) { + normalized_pattern += "|"; + } + std::string pattern_str = ""; + for (const auto& ch : normalized_content.GetStr()) { + if (ch == '[' || ch == ']') { + pattern_str.append(1, '\\'); + } + pattern_str.append(1, ch); + } + normalized_pattern += "\(" + pattern_str + "\)"; + } + split_normalized_trie_.first = std::make_shared(normalized_pattern); + split_normalized_trie_.second = std::move(normalized_ids); +} + +bool AddedVocabulary::FindMatch(const std::string& sequence, + const MatchSet& pattern, + std::vector* results) const { + if (sequence.empty()) { + return false; + } + std::vector splits; + size_t start = 0; + size_t start_offset = 0; + size_t end = sequence.length(); + re2::StringPiece result_str; + VLOG(6) << "start = " << start << ", end = " << end + << ", sequence = " << sequence + << ", pattern: " << pattern.first->pattern(); + while (pattern.first->Match( + sequence, start, end, RE2::UNANCHORED, &result_str, 1) && + result_str != "") { + VLOG(6) << "result_str: " << result_str << ", " << pattern.first->pattern(); + size_t curr_start = result_str.data() - sequence.data(); + size_t curr_end = curr_start + result_str.length(); + uint32_t id = pattern.second.at(result_str.ToString()); + AddedToken added_tokens = vocab_reversed_.at(id); + VLOG(6) << "start = " << start << ", end = " << end + << ", curr_start = " << curr_start << ", curr_end = " << curr_end; + if (added_tokens.GetIsSingleWord()) { + bool start_space = + (curr_start == 0) || !EndWithWord(sequence.substr(0, curr_start)); + bool stop_space = (curr_end >= sequence.length()) || + !StartWithWord(sequence.substr(curr_end)); + if (!start_space || !stop_space) { + // Discard not single word + start = curr_end; + continue; + } + } + if (added_tokens.GetUseLStrip()) { + auto new_start = GetEndSpaceIdx(sequence.substr(0, curr_start)); + curr_start = std::max(new_start, start_offset); + } + if (added_tokens.GetUseRStrip()) { + curr_end += GetStartSpaceIdx(sequence.substr(curr_end)); + } + if (curr_start > start_offset) { + splits.push_back({0, false, {start_offset, curr_start}}); + } + splits.push_back({id, true, {curr_start, curr_end}}); + start = curr_end; + start_offset = curr_end; + } + if (start != sequence.length()) { + splits.push_back({0, false, {start, sequence.length()}}); + } + *results = std::move(splits); + return true; +} + +bool AddedVocabulary::SplitWithIndices( + const normalizers::NormalizedString& normalized, + const MatchSet& pattern, + std::vector* split_results) const { + std::vector match_results; + bool status = FindMatch(normalized.GetStr(), pattern, &match_results); + for (auto& match_result : match_results) { + normalizers::NormalizedString slice; + auto id = std::get<0>(match_result); + auto is_not_unk = std::get<1>(match_result); + auto offsets = std::get<2>(match_result); + normalized.Slice(offsets, &slice, false); + std::vector tokens; + if (is_not_unk) { + tokens.emplace_back(core::Token{id, slice.GetStr(), {0, slice.GetLen()}}); + } + // use push_back({slice, tokens}) will raise error in windows platform. + split_results->emplace_back(slice, tokens); + } + return status; +} + +void AddedVocabulary::ExtractAndNormalize( + const normalizers::Normalizer* normalizers, + const std::string& sequence, + pretokenizers::PreTokenizedString* pretokenized) const { + pretokenized->SetOriginalStr(sequence); + pretokenized->Split( + [&](int idx, + normalizers::NormalizedString* normalized, + std::vector* string_splits) { + this->SplitWithIndices(*normalized, this->split_trie_, string_splits); + }); + pretokenized->Split( + [&](int idx, + normalizers::NormalizedString* normalized, + std::vector* string_splits) { + if (normalizers != nullptr) { + (*normalizers)(normalized); + VLOG(6) << "After normalized: " << normalized->GetStr(); + this->SplitWithIndices( + *normalized, this->split_normalized_trie_, string_splits); + } + }); +} + +const std::unordered_map& +AddedVocabulary::GetAddedTokenVocabReversed() const { + return vocab_reversed_; +} + + +void to_json(nlohmann::json& j, const AddedTokenWithId& added_token) { + j = { + {"id", added_token.id_}, + {"content", added_token.added_token_.GetContent()}, + {"single_word", added_token.added_token_.GetIsSingleWord()}, + {"lstrip", added_token.added_token_.GetUseLStrip()}, + {"rstrip", added_token.added_token_.GetUseRStrip()}, + {"normalized", added_token.added_token_.GetUseNormalized()}, + {"special", added_token.added_token_.GetIsSpecial()}, + }; +} + +void from_json(const nlohmann::json& j, AddedTokenWithId& added_token) { + j.at("id").get_to(added_token.id_); + std::string content = j.at("content").get(); + added_token.added_token_.SetContent(content); + + bool single_word = j.at("single_word").get(); + added_token.added_token_.SetIsSingleWord(single_word); + + bool lstrip = j.at("lstrip").get(); + added_token.added_token_.SetUseLStrip(lstrip); + + bool rstrip = j.at("rstrip").get(); + added_token.added_token_.SetUseRStrip(rstrip); + + bool normalized = j.at("normalized").get(); + added_token.added_token_.SetUseNormalized(normalized); + + bool special = j.at("special").get(); + added_token.added_token_.SetIsSpecial(special); +} + +void to_json(nlohmann::json& j, const AddedVocabulary& added_vocab) { + nlohmann::json jarray = nlohmann::json::array(); + for (const auto& vocab_item : added_vocab.vocab_reversed_) { + AddedTokenWithId added_token_with_id; + added_token_with_id.id_ = vocab_item.first; + added_token_with_id.added_token_ = vocab_item.second; + nlohmann::json jo = added_token_with_id; + jarray.emplace_back(jo); + } + j = jarray; +} + +} // namespace core +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/core/added_vocabulary.h b/fast_tokenizer/fast_tokenizer/core/added_vocabulary.h new file mode 100644 index 0000000000000000000000000000000000000000..a9b26f6778184c3174110fdf4344409726d2dc0e --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/core/added_vocabulary.h @@ -0,0 +1,154 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include // For shared_ptr +#include +#include + +#include "fast_tokenizer/core/base.h" +#include "nlohmann/json.hpp" + +namespace re2 { +class RE2; +} // namespace re2 + +namespace paddlenlp { +namespace fast_tokenizer { + +namespace normalizers { +class Normalizer; +class NormalizedString; +} // namespace normalizers + +namespace models { +class Model; +} // namespace models + +namespace pretokenizers { +class PreTokenizedString; +struct StringSplit; +} // namespace pretokenizers + +namespace core { + +using MatchSet = std::pair, Vocab>; +using MatchResult = std::tuple; + +bool StartWithWord(const std::string& sequence); +bool EndWithWord(const std::string& sequence); +bool StartWithSpace(const std::string& sequence); +bool EndWithSpace(const std::string& sequence); + +class FASTTOKENIZER_DECL AddedToken { +public: + AddedToken(); + AddedToken(const std::string& content, + bool is_special = false, + bool single_word = false, + bool lstrip = false, + bool rstrip = false); + void SetIsSingleWord(bool is_single_word); + void SetUseLStrip(bool use_lstrip); + void SetUseRStrip(bool use_rstrip); + void SetUseNormalized(bool use_normalized); + void SetContent(const std::string& content); + void SetIsSpecial(bool is_special); + std::string GetContent() const; + bool GetIsSpecial() const; + bool GetUseNormalized() const; + bool GetUseLStrip() const; + bool GetUseRStrip() const; + bool GetIsSingleWord() const; + bool operator==(const AddedToken& other) const; + +private: + std::string content_; + bool is_single_word_; + bool use_lstrip_; + bool use_rstrip_; + bool use_normalized_; + bool is_special_; + friend struct AddedTokenWithId; +}; + +struct FASTTOKENIZER_DECL AddedTokenWithId { + AddedToken added_token_; + uint32_t id_; + friend void to_json(nlohmann::json& j, const AddedTokenWithId& added_token); + friend void from_json(const nlohmann::json& j, AddedTokenWithId& added_token); +}; + +class FASTTOKENIZER_DECL AddedVocabulary { +public: + AddedVocabulary(); + size_t GetLen() const; + core::Vocab& GetMutableVocab(); + core::Vocab GetVocab() const; + bool TokenToId(const std::string& token, + const models::Model& model, + uint32_t* id) const; + bool IdToToken(uint32_t id, + const models::Model& model, + std::string* token) const; + bool IsSpecialToken(const std::string& token) const; + size_t AddSpecialTokens(const std::vector& tokens, + const models::Model& model, + const normalizers::Normalizer* normalizers); + size_t AddTokens(const std::vector& tokens, + const models::Model& model, + const normalizers::Normalizer* normalizers); + void RefreshAddedTokens(const models::Model& model, + const normalizers::Normalizer* normalizers); + bool FindMatch(const std::string& sequence, + const MatchSet& pattern, + std::vector* results) const; + bool SplitWithIndices( + const normalizers::NormalizedString& normalized, + const MatchSet& pattern, + std::vector* split_results) const; + void ExtractAndNormalize( + const normalizers::Normalizer* normalizers, + const std::string& sequence, + pretokenizers::PreTokenizedString* pretokenized) const; + const std::unordered_map& GetAddedTokenVocabReversed() + const; + +private: + core::Vocab vocab_; + std::unordered_map vocab_reversed_; + std::vector added_tokens_; + std::vector special_tokens_; + std::unordered_set special_tokens_set_; + MatchSet split_trie_; + MatchSet split_normalized_trie_; + friend void to_json(nlohmann::json& j, const AddedVocabulary& added_vocab); + friend void from_json(const nlohmann::json& j, AddedVocabulary& added_vocab); +}; + +} // namespace core +} // namespace fast_tokenizer +} // namespace paddlenlp + +namespace std { +template <> +class hash { +public: + size_t operator()( + const paddlenlp::fast_tokenizer::core::AddedToken& added_token) const { + return std::hash()(added_token.GetContent()); + } +}; +} diff --git a/fast_tokenizer/fast_tokenizer/core/base.cc b/fast_tokenizer/fast_tokenizer/core/base.cc new file mode 100644 index 0000000000000000000000000000000000000000..2d0bdb23d0121e40188d45c4dcfa05789eaf006f --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/core/base.cc @@ -0,0 +1,53 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/core/base.h" + +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace core { + +static int fast_tokenizer_thread_num = 1; + +void SetThreadNum(int thread_num) { fast_tokenizer_thread_num = thread_num; } + +int GetThreadNum() { return fast_tokenizer_thread_num; } + +void RunMultiThread(std::function func, + size_t batch_size) { + int thread_num = GetThreadNum(); + if (thread_num == 1) { + // Note(zhoushunjie): No need to create threads when + // thread_num equals to 1. + func(0, batch_size); + } else { + std::vector vectorOfThread; + size_t start_index = 0; + size_t step_index = ceil(batch_size / float(thread_num)); + + for (size_t thread_index = 0; thread_index < thread_num; thread_index++) { + vectorOfThread.emplace_back(std::thread(func, start_index, step_index)); + start_index = start_index + step_index; + } + for (size_t thread_index = 0; thread_index < thread_num; thread_index++) { + vectorOfThread[thread_index].join(); + } + } +} + +} // namespace core +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/core/base.h b/fast_tokenizer/fast_tokenizer/core/base.h new file mode 100644 index 0000000000000000000000000000000000000000..21af2d912f7cbdd2285950300328f54d41b3cedc --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/core/base.h @@ -0,0 +1,378 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include +#include +#include + +#include "fast_tokenizer/utils/utils.h" +#include "nlohmann/json.hpp" + +namespace std { +template <> +struct hash> { + size_t operator()(const std::pair& x) const { + size_t h1 = hash()(x.first); + size_t h2 = hash()(x.second); + return h1 ^ (h2 << 1); + } +}; +} + +namespace paddlenlp { +namespace fast_tokenizer { +namespace core { + +enum FASTTOKENIZER_DECL OffsetType { CHAR, BYTE }; +enum FASTTOKENIZER_DECL Direction { LEFT, RIGHT }; +enum FASTTOKENIZER_DECL TruncStrategy { + LONGEST_FIRST, + ONLY_FIRST, + ONLY_SECOND +}; +enum FASTTOKENIZER_DECL PadStrategy { BATCH_LONGEST, FIXED_SIZE }; + +enum FASTTOKENIZER_DECL SplitMode { + REMOVED, + ISOLATED, + MERGED_WITH_PREVIOUS, + MERGED_WITH_NEXT, + CONTIGUOUS +}; + +NLOHMANN_JSON_SERIALIZE_ENUM(OffsetType, + { + {CHAR, "CHAR"}, {BYTE, "BYTE"}, + }); + +NLOHMANN_JSON_SERIALIZE_ENUM(Direction, + { + {LEFT, "LEFT"}, {RIGHT, "RIGHT"}, + }); + +NLOHMANN_JSON_SERIALIZE_ENUM(TruncStrategy, + { + {LONGEST_FIRST, "LONGEST_FIRST"}, + {ONLY_FIRST, "ONLY_FIRST"}, + {ONLY_SECOND, "ONLY_SECOND"}, + }); + + +NLOHMANN_JSON_SERIALIZE_ENUM(PadStrategy, + { + {BATCH_LONGEST, "BATCH_LONGEST"}, + {FIXED_SIZE, "FIXED_SIZE"}, + }); + +struct FASTTOKENIZER_DECL TruncMethod { + Direction direction_; + size_t max_len_; + TruncStrategy strategy_; + size_t stride_; + TruncMethod() + : max_len_(512), + stride_(0), + strategy_(LONGEST_FIRST), + direction_(RIGHT) {} +}; + +struct FASTTOKENIZER_DECL PadMethod { + PadStrategy strategy_; + Direction direction_; + uint32_t pad_id_; + uint32_t pad_token_type_id_; + std::string pad_token_; + uint32_t pad_len_; + uint32_t pad_to_multiple_of_; + + PadMethod() + : strategy_(BATCH_LONGEST), + direction_(RIGHT), + pad_id_(0), + pad_token_type_id_(0), + pad_token_("[PAD]"), + pad_len_(0), + pad_to_multiple_of_(0) {} +}; + +inline void to_json(nlohmann::json& j, const TruncMethod& trunc_method) { + j = { + {"strategy", trunc_method.strategy_}, + {"direction", trunc_method.direction_}, + {"max_len", trunc_method.max_len_}, + {"stride", trunc_method.stride_}, + }; +} + +inline void from_json(const nlohmann::json& j, TruncMethod& trunc_method) { + j["strategy"].get_to(trunc_method.strategy_); + j["direction"].get_to(trunc_method.direction_); + j["max_len"].get_to(trunc_method.max_len_); + j["stride"].get_to(trunc_method.stride_); +} + + +inline void to_json(nlohmann::json& j, const PadMethod& pad_method) { + j = { + {"strategy", pad_method.strategy_}, + {"direction", pad_method.direction_}, + {"pad_id", pad_method.pad_id_}, + {"pad_token_type_id", pad_method.pad_token_type_id_}, + {"pad_token", pad_method.pad_token_}, + {"pad_len", pad_method.pad_len_}, + {"pad_to_multiple_of", pad_method.pad_to_multiple_of_}, + }; +} + +inline void from_json(const nlohmann::json& j, PadMethod& pad_method) { + j["strategy"].get_to(pad_method.strategy_); + j["direction"].get_to(pad_method.direction_); + j["pad_id"].get_to(pad_method.pad_id_); + j["pad_token_type_id"].get_to(pad_method.pad_token_type_id_); + j["pad_token"].get_to(pad_method.pad_token_); + j["pad_len"].get_to(pad_method.pad_len_); + j["pad_to_multiple_of"].get_to(pad_method.pad_to_multiple_of_); +} + +using Offset = std::pair; +using Range = std::pair; +using Vocab = std::unordered_map; +using VocabList = std::vector>; +using VocabReversed = std::unordered_map; +using SortedVocabReversed = std::map; +using Pair = std::pair; +using MergeMap = std::unordered_map>; +using Merges = std::vector>; + +inline void to_json(nlohmann::json& j, + const SortedVocabReversed& sorted_vocab_r) { + j = nlohmann::ordered_json(); + for (const auto& item : sorted_vocab_r) { + j[item.second] = item.first; + } +} + +struct FASTTOKENIZER_DECL Token { + uint32_t id_; + std::string value_; + Offset offset_; + Token() = default; + Token(uint32_t id, const std::string& value, const Offset& offset) + : id_(id), value_(value), offset_(offset) {} +}; + +struct FASTTOKENIZER_DECL Merge { + size_t pos_; + uint32_t rank_; + uint32_t new_id_; + + bool operator==(const Merge& other) const { + return pos_ == other.pos_ && rank_ == other.rank_; + } + bool operator<(const Merge& other) const { + // Used in priority queue + // The queue will output the Merge value + // in ascending order of rank_ + if (rank_ != other.rank_) { + return rank_ > other.rank_; + } + return pos_ > other.pos_; + } +}; + +struct FASTTOKENIZER_DECL Symbol { + uint32_t ch_; // symbol id + int prev_; + int next_; + size_t len_; + + Symbol() = default; + Symbol(uint32_t ch, int prev, int next, size_t len) + : ch_(ch), prev_(prev), next_(next), len_(len) {} + // Merges the current Symbol with the other one. + // In order to update prev/next, we consider Self to be the Symbol on the + // left, + // and other to be the next one on the right. + void MergeWith(const Symbol& other, uint32_t ch) { + ch_ = ch; + next_ = other.next_; + len_ += other.len_; + } +}; + +struct FASTTOKENIZER_DECL BPEWord { + BPEWord() = default; + BPEWord(size_t capacity) { Reserve(capacity); } + void Reserve(size_t capacity) { symbols_.reserve(capacity); } + void Add(uint32_t ch, size_t byte_len) { + int len = symbols_.size(); + int next = -1; + int prev = -1; + if (len >= 1) { + symbols_.back().next_ = len; + prev = len - 1; + } + symbols_.emplace_back(ch, prev, next, byte_len); + } + + void Merge(uint32_t c1, + uint32_t c2, + uint32_t replacement, + std::vector>* changes) { + for (int i = 0; i < symbols_.size(); ++i) { + // Found a byte pair + if (symbols_[i].ch_ == c1 && i + 1 < symbols_.size() && + symbols_[i + 1].ch_ == c2) { + auto& first = symbols_[i]; + auto& second = symbols_[i + 1]; + // If there are other characters before the pair + if (i > 0) { + changes->push_back({{symbols_[i - 1].ch_, first.ch_}, -1}); + changes->push_back({{symbols_[i - 1].ch_, replacement}, 1}); + } + Symbol symbols{ + replacement, first.prev_, second.next_, first.len_ + second.len_}; + symbols_.insert(symbols_.begin() + i, symbols); + symbols_.erase(symbols_.begin() + i + 1, symbols_.begin() + i + 3); + if (i + 1 < symbols_.size()) { + changes->push_back({{second.ch_, symbols_[i + 1].ch_}, -1}); + changes->push_back({{replacement, symbols_[i + 1].ch_}, 1}); + } + } + } + } + + void MergeAll(const MergeMap& merges, const std::vector& dropout) { + std::priority_queue queue; + std::vector skip; + skip.reserve(symbols_.size()); + for (int i = 0; i < symbols_.size() - 1; ++i) { + auto& first = symbols_[i]; + auto& second = symbols_[i + 1]; + if (merges.find({first.ch_, second.ch_}) != merges.end()) { + auto new_merge_info = merges.at({first.ch_, second.ch_}); + core::Merge new_merge{static_cast(i), + new_merge_info.first, + new_merge_info.second}; + queue.push(new_merge); + } + } + std::random_device + rd; // Will be used to obtain a seed for the random number engine + std::mt19937 gen(rd()); // Standard mersenne_twister_engine seeded with + // rd() + std::uniform_real_distribution distrib(0.0, 1.0); + bool can_skip = (dropout.size() > 0); + while (!queue.empty()) { + // Can't use reference there, because the pop operation will change the + // top value + auto top = queue.top(); + queue.pop(); + if (can_skip && distrib(gen) < dropout[0]) { + // May dropout some merges + skip.push_back(top); + } else { + for (auto& skip_merge : skip) { + queue.push(skip_merge); + } + skip.clear(); + if (symbols_[top.pos_].len_ == 0) { + continue; + } + if (symbols_[top.pos_].next_ == -1) { + continue; + } + size_t next_pos = symbols_[top.pos_].next_; + auto& right = symbols_[next_pos]; + // Make sure we are not processing an expired queue entry + auto target_new_pair = Pair{symbols_[top.pos_].ch_, right.ch_}; + if (merges.find(target_new_pair) == merges.end() || + merges.at(target_new_pair).second != top.new_id_) { + continue; + } + // Otherwise, let's merge + symbols_[top.pos_].MergeWith(right, top.new_id_); + // Tag the right part as removed + symbols_[next_pos].len_ = 0; + // Update `prev` on the new `next` to the current pos + if (right.next_ > -1 && (right.next_ < symbols_.size())) { + symbols_[right.next_].prev_ = top.pos_; + } + // Insert the new pair formed with the previous symbol + auto& current = symbols_[top.pos_]; + if (current.prev_ >= 0) { + auto prev = current.prev_; + auto& prev_symbol = symbols_[prev]; + auto new_pair = Pair{prev_symbol.ch_, current.ch_}; + if (merges.find(new_pair) != merges.end()) { + auto new_merge = merges.at(new_pair); + queue.push({static_cast(current.prev_), + new_merge.first, + new_merge.second}); + } + } + + // Insert the new pair formed with the next symbol + size_t next = current.next_; + if (next < symbols_.size()) { + auto& next_symbol = symbols_[next]; + auto next_pair = Pair{current.ch_, next_symbol.ch_}; + if (merges.find(next_pair) != merges.end()) { + auto new_merge = merges.at(next_pair); + queue.push({top.pos_, new_merge.first, new_merge.second}); + } + } + } + } + symbols_.erase( + std::remove_if(symbols_.begin(), + symbols_.end(), + [](const Symbol& symbol) { return symbol.len_ == 0; }), + symbols_.end()); + } + + void GetChars(std::vector* result) const { + result->reserve(symbols_.size()); + for (const auto& symbol : symbols_) { + result->emplace_back(symbol.ch_); + } + } + + void GetOffset(std::vector* result) const { + result->reserve(symbols_.size()); + uint32_t pos = 0; + for (const auto& symbol : symbols_) { + result->emplace_back(pos, pos + symbol.len_); + pos += symbol.len_; + } + } + + std::vector symbols_; +}; + +FASTTOKENIZER_DECL void SetThreadNum(int thread_num); + +FASTTOKENIZER_DECL int GetThreadNum(); + +FASTTOKENIZER_DECL void RunMultiThread(std::function func, + size_t batch_size); + +} // namespace core +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/core/encoding.cc b/fast_tokenizer/fast_tokenizer/core/encoding.cc new file mode 100644 index 0000000000000000000000000000000000000000..379d21df931b9ef7803ccc23f6d6321e09cfc702 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/core/encoding.cc @@ -0,0 +1,669 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/core/encoding.h" +#include +#include +#include +#include +#include "glog/logging.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace core { + +Encoding::Encoding(const std::vector& ids, + const std::vector& type_ids, + const std::vector& tokens, + const std::vector& words_idx, + const std::vector& offsets, + const std::vector& special_tokens_mask, + const std::vector& attention_mask, + const std::vector& overflowing, + const std::unordered_map& sequence_ranges) + : ids_(ids), + type_ids_(type_ids), + tokens_(tokens), + words_idx_(words_idx), + offsets_(offsets), + special_tokens_mask_(special_tokens_mask), + attention_mask_(attention_mask), + overflowing_(overflowing), + sequence_ranges_(sequence_ranges) {} +// Move version +Encoding::Encoding(std::vector&& ids, + std::vector&& type_ids, + std::vector&& tokens, + std::vector&& words_idx, + std::vector&& offsets, + std::vector&& special_tokens_mask, + std::vector&& attention_mask, + std::vector&& overflowing, + std::unordered_map&& sequence_ranges) + : ids_(std::move(ids)), + type_ids_(std::move(type_ids)), + tokens_(std::move(tokens)), + words_idx_(std::move(words_idx)), + offsets_(std::move(offsets)), + special_tokens_mask_(std::move(special_tokens_mask)), + attention_mask_(std::move(attention_mask)), + overflowing_(std::move(overflowing)), + sequence_ranges_(std::move(sequence_ranges)) {} + +Encoding::Encoding(uint32_t capacity) { + ids_.reserve(capacity); + type_ids_.reserve(capacity); + tokens_.reserve(capacity); + words_idx_.reserve(capacity); + offsets_.reserve(capacity); + special_tokens_mask_.reserve(capacity); + attention_mask_.reserve(capacity); +} + +Encoding::Encoding(const std::vector& tokens, uint32_t type_id) + : type_ids_(tokens.size(), type_id), + words_idx_(tokens.size(), std::numeric_limits::max()), + attention_mask_(tokens.size(), 1), + special_tokens_mask_(tokens.size(), 0) { + auto length = tokens.size(); + ids_.reserve(length); + offsets_.reserve(length); + tokens_.reserve(length); + for (const auto& token : tokens) { + ids_.push_back(token.id_); + tokens_.push_back(token.value_); + offsets_.push_back(token.offset_); + } +} + +Encoding::Encoding(Encoding&& other) + : ids_(std::move(other.ids_)), + type_ids_(std::move(other.type_ids_)), + tokens_(std::move(other.tokens_)), + words_idx_(std::move(other.words_idx_)), + offsets_(std::move(other.offsets_)), + special_tokens_mask_(std::move(other.special_tokens_mask_)), + attention_mask_(std::move(other.attention_mask_)), + overflowing_(std::move(other.overflowing_)), + sequence_ranges_(std::move(other.sequence_ranges_)) {} + +Encoding& Encoding::operator=(Encoding&& other) { + ids_ = std::move(other.ids_); + type_ids_ = std::move(other.type_ids_); + tokens_ = std::move(other.tokens_); + words_idx_ = std::move(other.words_idx_); + offsets_ = std::move(other.offsets_); + special_tokens_mask_ = std::move(other.special_tokens_mask_); + attention_mask_ = std::move(other.attention_mask_); + overflowing_ = std::move(other.overflowing_); + sequence_ranges_ = std::move(other.sequence_ranges_); + return *this; +} + +bool Encoding::IsEmpty() const { return ids_.empty(); } + +int Encoding::GetLen() const { return ids_.size(); } + +int Encoding::GetNumSequence() const { + if (sequence_ranges_.empty()) { + return 1; + } + return sequence_ranges_.size(); +} + +void Encoding::SetSequenceIds(uint32_t seq_ids) { + sequence_ranges_[seq_ids] = {0, GetLen()}; +} + +const std::vector& Encoding::GetTokens() const { return tokens_; } + +const std::vector& Encoding::GetWordsIdx() const { + return words_idx_; +} + +std::vector& Encoding::GetMutableWordsIdx() { return words_idx_; } + +std::vector Encoding::GetSequenceIds() const { + std::vector sequences(GetLen()); + for (uint32_t seq_id = 0; seq_id < GetNumSequence(); ++seq_id) { + Range range = sequence_ranges_.at(seq_id); + auto seq_len = range.second - range.first; + for (int i = range.first; i < range.second; ++i) { + sequences[i] = seq_id; + } + } + return sequences; +} + +const std::vector& Encoding::GetIds() const { return ids_; } + +const std::vector& Encoding::GetTypeIds() const { return type_ids_; } + +const std::vector& Encoding::GetOffsets() const { return offsets_; } + +std::vector& Encoding::GetMutableOffsets() { return offsets_; } + +const std::vector& Encoding::GetSpecialTokensMask() const { + return special_tokens_mask_; +} + +const std::vector& Encoding::GetAttentionMask() const { + return attention_mask_; +} + +const std::vector& Encoding::GetOverflowing() const { + return overflowing_; +} + +std::vector& Encoding::GetMutableOverflowing() { + return overflowing_; +} + +Range Encoding::GetSequenceRange(uint32_t seq_id) const { + return sequence_ranges_.at(seq_id); +} + +void Encoding::ProcessTokenWithOffsets( + std::function + process_token_fn) { + auto length = GetLen(); + for (int i = 0; i < length; ++i) { + process_token_fn(i, tokens_[i], &offsets_[i]); + } +} + +std::vector Encoding::TokenIdxToSequenceIds( + uint32_t token_idx) const { + std::vector seq_ids; + if (token_idx < GetLen()) { + if (sequence_ranges_.empty()) { + seq_ids.push_back(0); + } else { + for (auto iter = sequence_ranges_.begin(); iter != sequence_ranges_.end(); + ++iter) { + if (token_idx >= iter->second.first && + token_idx < iter->second.second) { + seq_ids.push_back(iter->first); + break; + } + } + } + } + return seq_ids; +} + +std::vector Encoding::WordIdxToTokensIdx(uint32_t word_idx, + uint32_t seq_id) const { + auto seq_range = sequence_ranges_.at(seq_id); + std::vector ranges; + int start = -1; + int end = -1; + for (uint32_t i = seq_range.first; i < seq_range.second; ++i) { + // -1 is the word index of special token + if (words_idx_[i] > word_idx && + words_idx_[i] != std::numeric_limits::max()) { + break; + } + if (words_idx_[i] == word_idx) { + if (start < 0 || i < start) { + start = i; + } + if (end < 0 || i >= end) { + end = i + 1; + } + } + } + if (start >= 0 && end >= 0) { + seq_range.first += start; + seq_range.second += end; + ranges.push_back(seq_range); + } + return ranges; +} + +std::vector Encoding::WordIdxToCharOffsets(uint32_t word_idx, + uint32_t seq_id) const { + std::vector offsets; + std::vector ranges = WordIdxToTokensIdx(word_idx, seq_id); + if (ranges.size() > 0) { + auto start = ranges[0].first; + auto end = ranges[0].second; + if (end > 0) { + offsets.push_back({offsets_[start].first, offsets_[end - 1].second}); + } + } + return offsets; +} + +std::vector> Encoding::TokenIdxToCharOffsets( + uint32_t token_idx) const { + std::vector> results; + auto seq_ids = TokenIdxToSequenceIds(token_idx); + if (seq_ids.size() > 0) { + results.push_back({seq_ids[0], offsets_[token_idx]}); + } + return results; +} + +std::vector> Encoding::TokenIdxToWordIdx( + uint32_t token_idx) const { + std::vector> results; + auto seq_ids = TokenIdxToSequenceIds(token_idx); + if (seq_ids.size() > 0) { + results.push_back({seq_ids[0], words_idx_[token_idx]}); + } + return results; +} + +std::vector Encoding::CharOffsetsToTokenIdx(uint32_t char_pos, + uint32_t seq_id) const { + std::vector token_idx; + auto seq_range = sequence_ranges_.at(seq_id); + for (int i = seq_range.first; i < seq_range.second; ++i) { + if (char_pos >= offsets_[i].first && char_pos < offsets_[i].second) { + token_idx.push_back(i); + break; + } + } + return token_idx; +} + +std::vector Encoding::CharOffsetsToWordIdx(uint32_t char_pos, + uint32_t seq_id) const { + std::vector token_idx = CharOffsetsToTokenIdx(char_pos, seq_id); + std::vector word_idx; + if (token_idx.size() > 0) { + auto words_idx = TokenIdxToWordIdx(token_idx[0]); + if (words_idx.size() > 0) { + word_idx.push_back(words_idx[0].second); + } + } + return word_idx; +} + +void Encoding::Truncate(size_t max_len, size_t stride, Direction direction) { + size_t encoding_len = ids_.size(); + if (max_len < encoding_len) { + if (max_len == 0) { + *this = Encoding(0); + overflowing_.push_back(*this); + return; + } + assert(stride < max_len); + sequence_ranges_.clear(); + + size_t step_len = max_len - stride; + bool found_end = false; + std::vector part_ranges; + // Get PartRanges + if (direction == RIGHT) { + for (size_t start = 0; start < encoding_len && !found_end; + start += step_len) { + size_t stop = std::min(start + max_len, encoding_len); + found_end = (stop == encoding_len); + part_ranges.push_back({start, stop}); + } + } else { + for (size_t i = 0; i < encoding_len; i += step_len) { + size_t stop = encoding_len - i; + size_t start = (stop < max_len) ? 0 : stop - max_len; + if (start < stop && !found_end) { + found_end = (start == 0); + part_ranges.push_back({start, stop}); + } else { + break; + } + } + } + // Create new encoding + auto new_encoding_len = part_ranges[0].second - part_ranges[0].first; + Encoding new_encoding( + std::vector(ids_.begin(), ids_.begin() + new_encoding_len), + std::vector(type_ids_.begin(), + type_ids_.begin() + new_encoding_len), + std::vector(tokens_.begin(), + tokens_.begin() + new_encoding_len), + std::vector(words_idx_.begin(), + words_idx_.begin() + new_encoding_len), + std::vector(offsets_.begin(), + offsets_.begin() + new_encoding_len), + std::vector(special_tokens_mask_.begin(), + special_tokens_mask_.begin() + new_encoding_len), + std::vector(attention_mask_.begin(), + attention_mask_.begin() + new_encoding_len), + std::vector(), + std::unordered_map()); + // Set overflowing + for (size_t i = 1; i < part_ranges.size() - 1; ++i) { + auto start = part_ranges[i].first; + auto end = part_ranges[i].second; + new_encoding.overflowing_.emplace_back(Encoding( + std::vector(ids_.begin() + start, ids_.begin() + end), + std::vector(type_ids_.begin() + start, + type_ids_.begin() + end), + std::vector(tokens_.begin() + start, + tokens_.begin() + end), + std::vector(words_idx_.begin() + start, + words_idx_.begin() + end), + std::vector(offsets_.begin() + start, offsets_.begin() + end), + std::vector(special_tokens_mask_.begin() + start, + special_tokens_mask_.begin() + end), + std::vector(attention_mask_.begin() + start, + attention_mask_.begin() + end), + std::vector(), + std::unordered_map())); + } + *this = std::move(new_encoding); + } +} + + +void Encoding::MergeWith(const Encoding& pair, bool growing_offsets) { + std::vector overflowings; + + for (const auto& this_o : overflowing_) { + auto n_encoding = this_o; + n_encoding.MergeWith(pair, growing_offsets); + overflowings.emplace_back(n_encoding); + for (const auto& pair_o : pair.overflowing_) { + auto n_encoding = this_o; + n_encoding.MergeWith(pair_o, growing_offsets); + overflowings.emplace_back(n_encoding); + } + } + for (const auto& pair_o : pair.overflowing_) { + auto n_encoding = *this; + n_encoding.MergeWith(pair_o, growing_offsets); + overflowings.emplace_back(n_encoding); + } + + auto orignal_len = GetLen(); + for (const auto& pair_seq_range : pair.sequence_ranges_) { + sequence_ranges_.insert({pair_seq_range.first, + {pair_seq_range.second.first + orignal_len, + pair_seq_range.second.second + orignal_len}}); + } +#define EXTEND_VECTOR(member) \ + member.insert(member.end(), pair.member.begin(), pair.member.end()) + EXTEND_VECTOR(ids_); + EXTEND_VECTOR(type_ids_); + EXTEND_VECTOR(tokens_); + EXTEND_VECTOR(words_idx_); + EXTEND_VECTOR(special_tokens_mask_); + EXTEND_VECTOR(attention_mask_); +#undef EXTEND_VECTOR + // Setting offet + uint32_t starting_offset = 0; + if (growing_offsets && offsets_.size() > 0) { + starting_offset = offsets_.back().second; + } + for (const auto& pair_offset : pair.offsets_) { + offsets_.push_back({pair_offset.first + starting_offset, + pair_offset.second + starting_offset}); + } + + overflowing_ = std::move(overflowings); +} + +void Encoding::Pad(uint32_t target_length, + uint32_t pad_id, + uint32_t pad_type_id, + const std::string& pad_token, + Direction direction) { + for (auto& overflowing : overflowing_) { + overflowing.Pad(target_length, pad_id, pad_type_id, pad_token, direction); + } + // Need to be padded in this situation + if (GetLen() < target_length) { + auto pad_len = target_length - GetLen(); + if (direction == LEFT) { + ids_.insert(ids_.begin(), pad_len, pad_id); + type_ids_.insert(type_ids_.begin(), pad_len, pad_type_id); + tokens_.insert(tokens_.begin(), pad_len, pad_token); + words_idx_.insert( + words_idx_.begin(), pad_len, std::numeric_limits::max()); + attention_mask_.insert(attention_mask_.begin(), pad_len, 0); + special_tokens_mask_.insert(special_tokens_mask_.begin(), pad_len, 1); + offsets_.insert(offsets_.begin(), pad_len, {0, 0}); + } else { + ids_.insert(ids_.end(), pad_len, pad_id); + type_ids_.insert(type_ids_.end(), pad_len, pad_type_id); + tokens_.insert(tokens_.end(), pad_len, pad_token); + words_idx_.insert( + words_idx_.end(), pad_len, std::numeric_limits::max()); + attention_mask_.insert(attention_mask_.end(), pad_len, 0); + special_tokens_mask_.insert(special_tokens_mask_.end(), pad_len, 1); + offsets_.insert(offsets_.end(), pad_len, {0, 0}); + } + } +} + +// Static method +Encoding Encoding::Merge(const std::vector& encodings, + bool growing_offsets) { + Encoding merged_encoding; + for (auto& encoding : encodings) { + merged_encoding.MergeWith(encoding, growing_offsets); + } + return merged_encoding; +} + +void Encoding::SetTypeIds(const std::vector& type_ids) { + type_ids_ = type_ids; +} + +bool Encoding::operator==(const Encoding& other) const { + if (overflowing_.size() != other.overflowing_.size()) { + return false; + } + for (int i = 0; i < overflowing_.size(); ++i) { + if (!(overflowing_[i] == other.overflowing_[i])) { + return false; + } + } + return ids_ == other.ids_ && type_ids_ == other.type_ids_ && + tokens_ == other.tokens_ && words_idx_ == other.words_idx_ && + offsets_ == other.offsets_ && + special_tokens_mask_ == other.special_tokens_mask_ && + attention_mask_ == other.attention_mask_ && + sequence_ranges_ == other.sequence_ranges_; +} + +std::string Encoding::DebugString() const { + std::ostringstream oss; + oss << "The Encoding content: \n"; + oss << "ids: "; + for (int i = 0; i < ids_.size(); ++i) { + oss << ids_[i]; + if (i < ids_.size() - 1) { + oss << ", "; + } + } + oss << "\n"; + + oss << "type_ids: "; + for (int i = 0; i < type_ids_.size(); ++i) { + oss << type_ids_[i]; + if (i < type_ids_.size() - 1) { + oss << ", "; + } + } + oss << "\n"; + + oss << "tokens: "; + for (int i = 0; i < tokens_.size(); ++i) { + oss << tokens_[i]; + if (i < tokens_.size() - 1) { + oss << ", "; + } + } + oss << "\n"; + + oss << "offsets: "; + for (int i = 0; i < offsets_.size(); ++i) { + oss << "(" << offsets_[i].first << ", " << offsets_[i].second << ")"; + if (i < offsets_.size() - 1) { + oss << ", "; + } + } + oss << "\n"; + + oss << "special_tokens_mask: "; + for (int i = 0; i < special_tokens_mask_.size(); ++i) { + oss << special_tokens_mask_[i]; + if (i < special_tokens_mask_.size() - 1) { + oss << ", "; + } + } + oss << "\n"; + + oss << "attention_mask: "; + for (int i = 0; i < attention_mask_.size(); ++i) { + oss << attention_mask_[i]; + if (i < attention_mask_.size() - 1) { + oss << ", "; + } + } + oss << "\n"; + + oss << "sequence_ranges: "; + for (auto iter = sequence_ranges_.begin(); iter != sequence_ranges_.end(); + ++iter) { + oss << "{" << iter->first << " : (" << iter->second.first << ", " + << iter->second.second << ") }, "; + } + return oss.str(); +} + + +bool TruncateEncodings(Encoding* encoding, + Encoding* pair_encoding, + const TruncMethod& method) { + if (method.max_len_ == 0) { + encoding->Truncate(0, method.stride_, method.direction_); + if (pair_encoding != nullptr) { + pair_encoding->Truncate(0, method.stride_, method.direction_); + } + return true; + } + size_t total_length = encoding->GetIds().size(); + if (pair_encoding != nullptr) { + total_length += pair_encoding->GetIds().size(); + } + if (total_length <= method.max_len_) { + return true; + } + auto num_of_removed_ids = total_length - method.max_len_; + + if (method.strategy_ == TruncStrategy::LONGEST_FIRST) { + if (pair_encoding == nullptr) { + encoding->Truncate(method.max_len_, method.stride_, method.direction_); + } else { + auto encoding_len = encoding->GetIds().size(); + auto pair_encoding_len = pair_encoding->GetIds().size(); + bool has_swapped = false; + if (encoding_len > pair_encoding_len) { + std::swap(encoding_len, pair_encoding_len); + has_swapped = true; + } + if (encoding_len > method.max_len_) { + pair_encoding_len = encoding_len; + } else { + pair_encoding_len = + std::max(method.max_len_ - encoding_len, encoding_len); + } + if (pair_encoding_len + encoding_len > method.max_len_) { + // In this case, make sure the encoding_len is larger than + // pair_encoding_len + encoding_len = method.max_len_ / 2; + pair_encoding_len = encoding_len + method.max_len_ % 2; + } + if (has_swapped) { + std::swap(encoding_len, pair_encoding_len); + } + encoding->Truncate(encoding_len, method.stride_, method.direction_); + pair_encoding->Truncate( + pair_encoding_len, method.stride_, method.direction_); + } + } else { + // TruncStrategy::ONLY_FIRST or TruncStrategy::ONLY_SECOND + Encoding* result = nullptr; + if (method.strategy_ == TruncStrategy::ONLY_FIRST) { + result = encoding; + } else if (method.strategy_ == TruncStrategy::ONLY_SECOND) { + if (pair_encoding == nullptr) { + // Can't truncate the pair text when it doesn't exist + return false; + } + result = pair_encoding; + } + if (result->GetIds().size() > num_of_removed_ids) { + result->Truncate(result->GetIds().size() - num_of_removed_ids, + method.stride_, + method.direction_); + } else { + // Target sequence is too short to be truncated. + return false; + } + } + return true; +} + +void MultiThreadPadEncodings(std::vector* encodings, + const PadMethod& method, + size_t pad_length, + size_t start_index, + size_t step_index) { + auto batch_size = encodings->size(); + size_t end_index = start_index + step_index; + if (end_index > batch_size) end_index = batch_size; + for (size_t i = start_index; i < end_index; ++i) { + auto& encoding = (*encodings)[i]; + encoding.Pad(pad_length, + method.pad_id_, + method.pad_token_type_id_, + method.pad_token_, + method.direction_); + } +} +void PadEncodings(std::vector* encodings, const PadMethod& method) { + if (encodings == nullptr || encodings->empty()) { + return; + } + size_t pad_length = 0; + if (method.strategy_ == PadStrategy::BATCH_LONGEST) { + for (const auto& encoding : *encodings) { + pad_length = std::max(encoding.GetIds().size(), pad_length); + } + } else { + pad_length = method.pad_len_; + } + if (method.pad_to_multiple_of_ > 0 && + pad_length % method.pad_to_multiple_of_) { + pad_length += pad_length - pad_length % method.pad_to_multiple_of_; + } + auto batch_size = encodings->size(); + auto func = std::bind(&MultiThreadPadEncodings, + encodings, + std::ref(method), + pad_length, + std::placeholders::_1, + std::placeholders::_2); + RunMultiThread(func, batch_size); +} + + +} // namespace core +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/core/encoding.h b/fast_tokenizer/fast_tokenizer/core/encoding.h new file mode 100644 index 0000000000000000000000000000000000000000..5a9d3a41b714a044164ce4441b064d96b6f5aea6 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/core/encoding.h @@ -0,0 +1,135 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include +#include "fast_tokenizer/core/base.h" +#include "fast_tokenizer/utils/utils.h" + +#include +#include +#include +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace core { + +class FASTTOKENIZER_DECL Encoding { +public: + Encoding() = default; + Encoding(const std::vector& ids, + const std::vector& type_ids, + const std::vector& tokens, + const std::vector& words_idx, + const std::vector& offsets, + const std::vector& special_tokens_mask, + const std::vector& attention_mask, + const std::vector& overflowing, + const std::unordered_map& sequence_ranges); + // Move version + Encoding(std::vector&& ids, + std::vector&& type_ids, + std::vector&& tokens, + std::vector&& words_idx, + std::vector&& offsets, + std::vector&& special_tokens_mask, + std::vector&& attention_mask, + std::vector&& overflowing, + std::unordered_map&& sequence_ranges); + Encoding(uint32_t size); + Encoding(const std::vector& tokens, uint32_t type_id); + + Encoding(Encoding&&); + Encoding(const Encoding&) = default; + Encoding& operator=(Encoding&&); + Encoding& operator=(const Encoding&) = default; + + bool IsEmpty() const; + void SetSequenceIds(uint32_t seq_ids); + + // Getter + int GetLen() const; + int GetNumSequence() const; + const std::vector& GetTokens() const; + const std::vector& GetWordsIdx() const; + std::vector& GetMutableWordsIdx(); + std::vector GetSequenceIds() const; + const std::vector& GetIds() const; + const std::vector& GetTypeIds() const; + const std::vector& GetOffsets() const; + std::vector& GetMutableOffsets(); + const std::vector& GetSpecialTokensMask() const; + const std::vector& GetAttentionMask() const; + const std::vector& GetOverflowing() const; + std::vector& GetMutableOverflowing(); + Range GetSequenceRange(uint32_t seq_id) const; + + void ProcessTokenWithOffsets( + std::function + process_token_fn); + + // token_idx: The index of token in the sequence + std::vector TokenIdxToSequenceIds(uint32_t token_idx) const; + std::vector WordIdxToTokensIdx(uint32_t word_idx, + uint32_t seq_id) const; + std::vector WordIdxToCharOffsets(uint32_t word_idx, + uint32_t seq_id) const; + std::vector> TokenIdxToCharOffsets( + uint32_t token_idx) const; + std::vector> TokenIdxToWordIdx( + uint32_t token_idx) const; + std::vector CharOffsetsToTokenIdx(uint32_t char_pos, + uint32_t seq_id) const; + std::vector CharOffsetsToWordIdx(uint32_t char_pos, + uint32_t seq_id) const; + void Truncate(size_t max_len, size_t stride, Direction direction); + void MergeWith(const Encoding& pair, bool growing_offsets); + void Pad(uint32_t target_length, + uint32_t pad_id, + uint32_t pad_type_id, + const std::string& pad_token, + Direction direction); + // Static method + static Encoding Merge(const std::vector& encodings, + bool growing_offsets); + std::string DebugString() const; + void SetTypeIds(const std::vector& type_ids); + bool operator==(const Encoding& other) const; + +private: + std::vector ids_; + std::vector type_ids_; + std::vector tokens_; + std::vector words_idx_; + std::vector offsets_; + std::vector special_tokens_mask_; + std::vector attention_mask_; + std::vector overflowing_; + std::unordered_map sequence_ranges_; +}; + +bool FASTTOKENIZER_DECL TruncateEncodings(Encoding* encoding, + Encoding* pair_encoding, + const TruncMethod& method); +void FASTTOKENIZER_DECL PadEncodings(std::vector* encoding, + const PadMethod& method); + +} // namespace core +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/core/tokenizer.cc b/fast_tokenizer/fast_tokenizer/core/tokenizer.cc new file mode 100644 index 0000000000000000000000000000000000000000..2b907b56936ebb5821f9796d961d52d8bf1b5f6a --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/core/tokenizer.cc @@ -0,0 +1,876 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/core/tokenizer.h" + +#include + +#include "fast_tokenizer/core/added_vocabulary.h" +#include "fast_tokenizer/core/base.h" +#include "fast_tokenizer/core/encoding.h" +#include "fast_tokenizer/decoders/decoders.h" +#include "fast_tokenizer/models/models.h" +#include "fast_tokenizer/normalizers/normalizers.h" +#include "fast_tokenizer/postprocessors/postprocessors.h" +#include "fast_tokenizer/pretokenizers/pretokenizers.h" +#include "glog/logging.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace core { + +normalizers::Normalizer* Tokenizer::GetNormalizerPtr() const { + return normalizer_.get(); +} + +void Tokenizer::ReleaseNormaizer() { normalizer_ = nullptr; } + +pretokenizers::PreTokenizer* Tokenizer::GetPreTokenizer() const { + return pretokenizer_.get(); +} + +void Tokenizer::ReleasePreTokenizer() { pretokenizer_ = nullptr; } + +void Tokenizer::SetTruncMethod(const TruncMethod& trunc_method) { + trunc_method_ = trunc_method; +} + +void Tokenizer::EnableTruncMethod(size_t max_len, + size_t stride, + Direction direction, + TruncStrategy strategy) { + use_truncation_ = true; + trunc_method_.direction_ = direction; + trunc_method_.max_len_ = max_len; + trunc_method_.strategy_ = strategy; + trunc_method_.stride_ = stride; +} + +void Tokenizer::DisableTruncMethod() { use_truncation_ = false; } + +TruncMethod Tokenizer::GetTruncMethod() const { return trunc_method_; } + +void Tokenizer::SetPadMethod(const PadMethod& pad_method) { + pad_method_ = pad_method; +} + +void Tokenizer::EnablePadMethod(Direction direction, + uint32_t pad_id, + uint32_t pad_type_id, + const std::string& pad_token, + uint32_t* length, + uint32_t* pad_to_multiple_of) { + use_padding_ = true; + pad_method_.direction_ = direction; + pad_method_.pad_id_ = pad_id; + pad_method_.pad_token_type_id_ = pad_type_id; + pad_method_.pad_token_ = pad_token; + if (length != nullptr) { + pad_method_.pad_len_ = *length; + pad_method_.strategy_ = PadStrategy::FIXED_SIZE; + } else { + pad_method_.strategy_ = PadStrategy::BATCH_LONGEST; + } + if (pad_to_multiple_of != nullptr) { + pad_method_.pad_to_multiple_of_ = *pad_to_multiple_of; + } else { + pad_method_.pad_to_multiple_of_ = 0; + } +} +void Tokenizer::DisablePadMethod() { use_padding_ = false; } + +PadMethod Tokenizer::GetPadMethod() const { return pad_method_; } + +models::Model* Tokenizer::GetModelPtr() const { return model_.get(); } + +void Tokenizer::ReleasePostProcessor() { post_processor_ = nullptr; } + +postprocessors::PostProcessor* Tokenizer::GetPostProcessorPtr() const { + return post_processor_.get(); +} + +void Tokenizer::ReleaseDecoder() { decoder_ = nullptr; } + +decoders::Decoder* Tokenizer::GetDecoderPtr() const { return decoder_.get(); } + +Vocab Tokenizer::GetVocab(bool with_added_vocabulary) const { + auto vocab = model_->GetVocab(); + auto added_vocab = added_vocabulary_.GetVocab(); + if (with_added_vocabulary) { + for (const auto& vocab_item : added_vocab) { + vocab.insert(vocab_item); + } + } + return vocab; +} + +size_t Tokenizer::GetVocabSize(bool with_added_vocabulary) const { + size_t vocab_size = model_->GetVocabSize(); + if (with_added_vocabulary) { + vocab_size += added_vocabulary_.GetLen(); + } + return vocab_size; +} + +size_t Tokenizer::AddTokens(const std::vector& tokens) { + return added_vocabulary_.AddTokens(tokens, *model_, normalizer_.get()); +} + +size_t Tokenizer::AddSpecialTokens(const std::vector& tokens) { + return added_vocabulary_.AddSpecialTokens(tokens, *model_, normalizer_.get()); +} + +bool Tokenizer::TokenToId(const std::string& token, uint32_t* id) const { + return added_vocabulary_.TokenToId(token, *model_, id); +} + +bool Tokenizer::IdToToken(uint32_t id, std::string* token) const { + return added_vocabulary_.IdToToken(id, *model_, token); +} + +bool Tokenizer::DoTokenize(pretokenizers::PreTokenizedString* pretokenized, + uint32_t type_id, + const std::vector& word_idx, + OffsetType offset_type, + Encoding* encoding) const { + pretokenized->Tokenize([&](normalizers::NormalizedString* normalized) { + return this->GetModelPtr()->Tokenize(normalized->GetStr()); + }); + return pretokenized->TransformToEncoding( + word_idx, type_id, offset_type, encoding); +} + +bool Tokenizer::DoPreTokenize( + pretokenizers::PreTokenizedString* pretokenized) const { + if (pretokenizer_ != nullptr) { + (*pretokenizer_)(pretokenized); + } + return true; +} + +struct InputStringVisitor { + InputStringVisitor(const Tokenizer* tokenizer, + uint32_t type_id, + OffsetType offset_type, + Encoding* encodings) + : tokenizer_(tokenizer), + type_id_(type_id), + offset_type_(offset_type), + encodings_(encodings) {} + void operator()(const std::vector& pretokenized_texts) const { + tokenizer_->EncodeSingleText( + pretokenized_texts, type_id_, offset_type_, encodings_); + } + + void operator()(const std::string& raw_text) const { + tokenizer_->EncodeSingleText(raw_text, type_id_, offset_type_, encodings_); + } + const Tokenizer* tokenizer_; + uint32_t type_id_; + OffsetType offset_type_; + Encoding* encodings_; +}; + +void Tokenizer::EncodeSingleString(const InputString& input_string, + uint32_t type_id, + OffsetType offset_type, + Encoding* encodings) const { + paddlenlp::visit(InputStringVisitor(this, type_id, offset_type, encodings), + input_string); +} + +void Tokenizer::PostProcess(Encoding* encoding, + Encoding* pair_encoding, + bool add_special_tokens, + Encoding* result_encoding) const { + // 1. Trunc + if (use_truncation_) { + auto added_tokens_num = 0; + if (post_processor_ != nullptr) { + added_tokens_num = + post_processor_->AddedTokensNum(pair_encoding != nullptr); + } + if (add_special_tokens && added_tokens_num > 0) { + auto trunc_method = trunc_method_; + trunc_method.max_len_ -= added_tokens_num; + TruncateEncodings(encoding, pair_encoding, trunc_method); + } else { + TruncateEncodings(encoding, pair_encoding, trunc_method_); + } + } + // 2. Post process + if (post_processor_ == nullptr) { + postprocessors::PostProcessor::DefaultProcess( + encoding, pair_encoding, result_encoding); + } else { + (*post_processor_)( + encoding, pair_encoding, add_special_tokens, result_encoding); + } + // 3. Pad + if (use_padding_) { + std::vector encodings; + encodings.push_back(*result_encoding); + PadEncodings(&encodings, pad_method_); + } +} + +void Tokenizer::EncodePairStrings(const EncodeInput& encode_input, + Encoding* encodings, + bool add_special_tokens) const { + Encoding encoding; + if (encode_input.type() == typeid(InputString)) { + const auto& input_string = paddlenlp::get(encode_input); + EncodeSingleString(input_string, 0, OffsetType::CHAR, &encoding); + PostProcess(&encoding, nullptr, add_special_tokens, encodings); + } else { + Encoding pair_encoding; + const auto& input_string_pair = + paddlenlp::get>(encode_input); + EncodeSingleString(input_string_pair.first, 0, OffsetType::CHAR, &encoding); + EncodeSingleString( + input_string_pair.second, 1, OffsetType::CHAR, &pair_encoding); + PostProcess(&encoding, &pair_encoding, add_special_tokens, encodings); + } +} + +void Tokenizer::EncodePairStrings(const std::string& text, + const std::string& text_pair, + Encoding* encodings, + bool add_special_tokens) const { + Encoding encoding, pair_encoding; + EncodeSingleString(text, 0, OffsetType::CHAR, &encoding); + EncodeSingleString(text_pair, 1, OffsetType::CHAR, &pair_encoding); + PostProcess(&encoding, &pair_encoding, add_special_tokens, encodings); +} + +void Tokenizer::MultiThreadEncodeBatchStrings( + const std::vector& texts, + const std::vector& text_pairs, + std::vector* encodings, + bool add_special_tokens, + size_t start_index, + size_t step_index) const { + if (texts.size() != text_pairs.size()) { + throw std::runtime_error( + "The size of text must equal to the size of text_pair"); + } + auto batch_size = texts.size(); + size_t end_index = start_index + step_index; + if (end_index > batch_size) end_index = batch_size; + for (size_t i = start_index; i < end_index; ++i) { + EncodePairStrings( + texts[i], text_pairs[i], &(*encodings)[i], add_special_tokens); + } +} + +void Tokenizer::MultiThreadEncodeBatchStrings( + const std::vector& batch_encode_input, + std::vector* encodings, + bool add_special_tokens, + size_t start_index, + size_t step_index) const { + auto batch_size = batch_encode_input.size(); + size_t end_index = start_index + step_index; + if (end_index > batch_size) end_index = batch_size; + for (size_t i = start_index; i < end_index; ++i) { + EncodePairStrings( + batch_encode_input[i], &(*encodings)[i], add_special_tokens); + } +} + +void Tokenizer::MultiThreadEncodeBatchStrings( + const std::vector& texts, + std::vector* encodings, + bool add_special_tokens, + size_t start_index, + size_t step_index) const { + auto batch_size = texts.size(); + size_t end_index = start_index + step_index; + if (end_index > batch_size) end_index = batch_size; + for (size_t i = start_index; i < end_index; ++i) { + EncodePairStrings(texts[i], &(*encodings)[i], add_special_tokens); + } +} + +void Tokenizer::EncodeBatchStrings( + const std::vector& batch_encode_input, + std::vector* encodings, + bool add_special_tokens) const { + auto batch_size = batch_encode_input.size(); + encodings->resize(batch_size); + auto func = [&](size_t start_index, size_t step_index) { + MultiThreadEncodeBatchStrings(batch_encode_input, + encodings, + add_special_tokens, + start_index, + step_index); + }; + RunMultiThread(func, batch_size); + + if (use_padding_) { + PadEncodings(encodings, pad_method_); + } +} + +void Tokenizer::EncodeBatchStrings(const std::vector& texts, + std::vector* encodings, + bool add_special_tokens) const { + auto batch_size = texts.size(); + encodings->resize(batch_size); + auto func = [&](size_t start_index, size_t step_index) { + MultiThreadEncodeBatchStrings( + texts, encodings, add_special_tokens, start_index, step_index); + }; + RunMultiThread(func, batch_size); + + if (use_padding_) { + PadEncodings(encodings, pad_method_); + } +} + +void Tokenizer::EncodeBatchStrings(const std::vector& texts, + const std::vector& text_pairs, + std::vector* encodings, + bool add_special_tokens) const { + auto batch_size = texts.size(); + encodings->resize(batch_size); + auto func = [&](size_t start_index, size_t step_index) { + MultiThreadEncodeBatchStrings(texts, + text_pairs, + encodings, + add_special_tokens, + start_index, + step_index); + }; + RunMultiThread(func, batch_size); + + if (use_padding_) { + PadEncodings(encodings, pad_method_); + } +} + +void Tokenizer::EncodeSingleText( + const std::vector& pretokenized_texts, + uint32_t type_id, + OffsetType offset_type, + Encoding* encoding) const { + std::vector encodings; + for (uint32_t i = 0; i < pretokenized_texts.size(); ++i) { + encodings.emplace_back( + EncodeTextToEncoding({i}, type_id, offset_type, pretokenized_texts[i])); + } + *encoding = Encoding::Merge(encodings, false); +} + +void Tokenizer::EncodeSingleText(const std::string& raw_text, + uint32_t type_id, + OffsetType offset_type, + Encoding* encodings) const { + *encodings = EncodeTextToEncoding({}, type_id, offset_type, raw_text); +} + +Encoding Tokenizer::EncodeTextToEncoding(const std::vector& word_idx, + uint32_t type_id, + OffsetType offset_type, + const std::string& text) const { + pretokenizers::PreTokenizedString pretokenized; + added_vocabulary_.ExtractAndNormalize(normalizer_.get(), text, &pretokenized); + DoPreTokenize(&pretokenized); + Encoding encoding; + DoTokenize(&pretokenized, type_id, word_idx, offset_type, &encoding); + return encoding; +} + +const AddedVocabulary& Tokenizer::GetAddedVocabulary() const { + return added_vocabulary_; +} + +void Tokenizer::Save(const std::string& path, bool pretty) const { + std::string json_str; + ToJsonStr(&json_str, pretty); + std::ofstream fout(path); + fout << json_str; +} + +void Tokenizer::ToJsonStr(std::string* json_str, bool pretty) const { + int indent = -1; + if (pretty) { + indent = 2; + } + nlohmann::json j = *this; + *json_str = j.dump(indent); +} + +Tokenizer Tokenizer::LoadFromFile(const std::string& json_path) { + std::ifstream fin(json_path); + nlohmann::json j; + fin >> j; + Tokenizer tokenizer; + j.get_to(tokenizer); + return tokenizer; +} + +Tokenizer Tokenizer::LoadFromStr(const std::string& json_str) { + auto jo = nlohmann::json::parse(json_str); + Tokenizer tokenizer; + jo.get_to(tokenizer); + return tokenizer; +} + +void Tokenizer::Decode(const std::vector& token_ids, + std::string* result, + bool skip_special_tokens) const { + // Get tokens + std::vector tokens; + std::string token; + for (int i = 0; i < token_ids.size(); ++i) { + IdToToken(token_ids[i], &token); + if (!added_vocabulary_.IsSpecialToken(token) || !skip_special_tokens) { + tokens.push_back(token); + } + } + if (decoder_ != nullptr) { + (*decoder_)(tokens, result); + } else { + for (int i = 0; i < tokens.size(); ++i) { + if (i > 0) { + *result += " "; + } + *result += tokens[i]; + } + } +} + + +void Tokenizer::MultiThreadDecodeBatch( + const std::vector>& batch_token_ids, + std::vector* results, + bool skip_special_tokens, + size_t start_index, + size_t step_index) const { + auto batch_size = batch_token_ids.size(); + size_t end_index = start_index + step_index; + if (end_index > batch_size) end_index = batch_size; + for (size_t i = start_index; i < end_index; ++i) { + Decode(batch_token_ids[i], &(*results)[i], skip_special_tokens); + } +} + +void Tokenizer::DecodeBatch( + const std::vector>& batch_token_ids, + std::vector* results, + bool skip_special_tokens) const { + auto batch_size = batch_token_ids.size(); + results->resize(batch_size); + auto func = [&](size_t start_index, size_t step_index) { + MultiThreadDecodeBatch( + batch_token_ids, results, skip_special_tokens, start_index, step_index); + }; + RunMultiThread(func, batch_size); +} + +bool Tokenizer::GetUseTruncation() const { return use_truncation_; } + +bool Tokenizer::GetUsePadding() const { return use_padding_; } + +void to_json(nlohmann::json& j, const Tokenizer& tokenizer) { + j = { + {"added_tokens", tokenizer.added_vocabulary_}, + }; + + j["truncation"] = nullptr; + if (tokenizer.use_truncation_) { + j["truncation"] = tokenizer.trunc_method_; + } + + j["padding"] = nullptr; + if (tokenizer.use_padding_) { + j["padding"] = tokenizer.pad_method_; + } + + j["normalizer"] = nullptr; + if (tokenizer.normalizer_ != nullptr) { + if (typeid(*tokenizer.normalizer_.get()) == + typeid(normalizers::BertNormalizer)) { + j["normalizer"] = *dynamic_cast( + tokenizer.normalizer_.get()); + } else if (typeid(*tokenizer.normalizer_.get()) == + typeid(normalizers::ReplaceNormalizer)) { + j["normalizer"] = *dynamic_cast( + tokenizer.normalizer_.get()); + } else if (typeid(*tokenizer.normalizer_.get()) == + typeid(normalizers::StripNormalizer)) { + j["normalizer"] = *dynamic_cast( + tokenizer.normalizer_.get()); + } else if (typeid(*tokenizer.normalizer_.get()) == + typeid(normalizers::StripAccentsNormalizer)) { + j["normalizer"] = *dynamic_cast( + tokenizer.normalizer_.get()); + } else if (typeid(*tokenizer.normalizer_.get()) == + typeid(normalizers::NFCNormalizer)) { + j["normalizer"] = *dynamic_cast( + tokenizer.normalizer_.get()); + } else if (typeid(*tokenizer.normalizer_.get()) == + typeid(normalizers::NFDNormalizer)) { + j["normalizer"] = *dynamic_cast( + tokenizer.normalizer_.get()); + } else if (typeid(*tokenizer.normalizer_.get()) == + typeid(normalizers::NFKCNormalizer)) { + j["normalizer"] = *dynamic_cast( + tokenizer.normalizer_.get()); + } else if (typeid(*tokenizer.normalizer_.get()) == + typeid(normalizers::NFKDNormalizer)) { + j["normalizer"] = *dynamic_cast( + tokenizer.normalizer_.get()); + } else if (typeid(*tokenizer.normalizer_.get()) == + typeid(normalizers::NmtNormalizer)) { + j["normalizer"] = *dynamic_cast( + tokenizer.normalizer_.get()); + } else if (typeid(*tokenizer.normalizer_.get()) == + typeid(normalizers::LowercaseNormalizer)) { + j["normalizer"] = *dynamic_cast( + tokenizer.normalizer_.get()); + } else if (typeid(*tokenizer.normalizer_.get()) == + typeid(normalizers::SequenceNormalizer)) { + j["normalizer"] = *dynamic_cast( + tokenizer.normalizer_.get()); + } else if (typeid(*tokenizer.normalizer_.get()) == + typeid(normalizers::PrecompiledNormalizer)) { + j["normalizer"] = *dynamic_cast( + tokenizer.normalizer_.get()); + } + } + + j["pretokenizer"] = nullptr; + if (tokenizer.pretokenizer_ != nullptr) { + if (typeid(*tokenizer.pretokenizer_.get()) == + typeid(pretokenizers::BertPreTokenizer)) { + j["pretokenizer"] = *dynamic_cast( + tokenizer.pretokenizer_.get()); + } else if (typeid(*tokenizer.pretokenizer_.get()) == + typeid(pretokenizers::MetaSpacePreTokenizer)) { + j["pretokenizer"] = *dynamic_cast( + tokenizer.pretokenizer_.get()); + } else if (typeid(*tokenizer.pretokenizer_.get()) == + typeid(pretokenizers::WhitespacePreTokenizer)) { + j["pretokenizer"] = *dynamic_cast( + tokenizer.pretokenizer_.get()); + } else if (typeid(*tokenizer.pretokenizer_.get()) == + typeid(pretokenizers::WhitespaceAndPunctuationPreTokenizer)) { + j["pretokenizer"] = + *dynamic_cast( + tokenizer.pretokenizer_.get()); + } else if (typeid(*tokenizer.pretokenizer_.get()) == + typeid(pretokenizers::SequencePreTokenizer)) { + j["pretokenizer"] = *dynamic_cast( + tokenizer.pretokenizer_.get()); + } else if (typeid(*tokenizer.pretokenizer_.get()) == + typeid(pretokenizers::ByteLevelPreTokenizer)) { + j["pretokenizer"] = *dynamic_cast( + tokenizer.pretokenizer_.get()); + } else if (typeid(*tokenizer.pretokenizer_.get()) == + typeid(pretokenizers::SplitPreTokenizer)) { + j["pretokenizer"] = *dynamic_cast( + tokenizer.pretokenizer_.get()); + } + } + + j["model"] = nullptr; + if (tokenizer.model_ != nullptr) { + if (typeid(*tokenizer.model_.get()) == typeid(models::WordPiece)) { + j["model"] = *dynamic_cast(tokenizer.model_.get()); + } else if (typeid(*tokenizer.model_.get()) == + typeid(models::FastWordPiece)) { + j["model"] = + *dynamic_cast(tokenizer.model_.get()); + } else if (typeid(*tokenizer.model_.get()) == typeid(models::BPE)) { + j["model"] = *dynamic_cast(tokenizer.model_.get()); + } else if (typeid(*tokenizer.model_.get()) == typeid(models::Unigram)) { + j["model"] = *dynamic_cast(tokenizer.model_.get()); + } + } + + j["postprocessor"] = nullptr; + if (tokenizer.post_processor_ != nullptr) { + if (typeid(*tokenizer.post_processor_.get()) == + typeid(postprocessors::BertPostProcessor)) { + j["postprocessor"] = *dynamic_cast( + tokenizer.post_processor_.get()); + } else if (typeid(*tokenizer.post_processor_.get()) == + typeid(postprocessors::TemplatePostProcessor)) { + j["postprocessor"] = + *dynamic_cast( + tokenizer.post_processor_.get()); + } else if (typeid(*tokenizer.post_processor_.get()) == + typeid(postprocessors::RobertaPostProcessor)) { + j["postprocessor"] = *dynamic_cast( + tokenizer.post_processor_.get()); + } else if (typeid(*tokenizer.post_processor_.get()) == + typeid(postprocessors::ByteLevelPostProcessor)) { + j["postprocessor"] = + *dynamic_cast( + tokenizer.post_processor_.get()); + } + } + + j["decoder"] = nullptr; + if (tokenizer.decoder_ != nullptr) { + if (typeid(*tokenizer.decoder_.get()) == typeid(decoders::WordPiece)) { + j["decoder"] = + *dynamic_cast(tokenizer.decoder_.get()); + } + } +} + +void from_json(const nlohmann::json& j, Tokenizer& tokenizer) { + // deserialize normalizer_ + try { + const auto& normalizer = j.at("normalizer"); + if (!normalizer.is_null()) { + if (normalizer.at("type") == "BertNormalizer") { + normalizers::BertNormalizer bert_normalizer; + normalizer.get_to(bert_normalizer); + tokenizer.SetNormalizer(bert_normalizer); + } else if (normalizer.at("type") == "ReplaceNormalizer") { + normalizers::ReplaceNormalizer replace_normalizer; + normalizer.get_to(replace_normalizer); + tokenizer.SetNormalizer(replace_normalizer); + } else if (normalizer.at("type") == "StripNormalizer") { + normalizers::StripNormalizer strip_normalizer; + normalizer.get_to(strip_normalizer); + tokenizer.SetNormalizer(strip_normalizer); + } else if (normalizer.at("type") == "StripAccentsNormalizer") { + normalizers::StripAccentsNormalizer strip_normalizer; + normalizer.get_to(strip_normalizer); + tokenizer.SetNormalizer(strip_normalizer); + } else if (normalizer.at("type") == "NFCNormalizer") { + normalizers::NFCNormalizer unicode_normalizer; + normalizer.get_to(unicode_normalizer); + tokenizer.SetNormalizer(unicode_normalizer); + } else if (normalizer.at("type") == "NFDNormalizer") { + normalizers::NFDNormalizer unicode_normalizer; + normalizer.get_to(unicode_normalizer); + tokenizer.SetNormalizer(unicode_normalizer); + } else if (normalizer.at("type") == "NFKCNormalizer") { + normalizers::NFKCNormalizer unicode_normalizer; + normalizer.get_to(unicode_normalizer); + tokenizer.SetNormalizer(unicode_normalizer); + } else if (normalizer.at("type") == "NFKDNormalizer") { + normalizers::NFKDNormalizer unicode_normalizer; + normalizer.get_to(unicode_normalizer); + tokenizer.SetNormalizer(unicode_normalizer); + } else if (normalizer.at("type") == "NmtNormalizer") { + normalizers::NmtNormalizer unicode_normalizer; + normalizer.get_to(unicode_normalizer); + tokenizer.SetNormalizer(unicode_normalizer); + } else if (normalizer.at("type") == "LowercaseNormalizer") { + normalizers::LowercaseNormalizer unicode_normalizer; + normalizer.get_to(unicode_normalizer); + tokenizer.SetNormalizer(unicode_normalizer); + } else if (normalizer.at("type") == "SequenceNormalizer") { + normalizers::SequenceNormalizer unicode_normalizer; + normalizer.get_to(unicode_normalizer); + tokenizer.SetNormalizer(unicode_normalizer); + } else if (normalizer.at("type") == "PrecompiledNormalizer") { + normalizers::PrecompiledNormalizer precompiled_normalizer; + normalizer.get_to(precompiled_normalizer); + tokenizer.SetNormalizer(precompiled_normalizer); + } + } + + // deserialize pretokenizer_ + nlohmann::json pretokenizer; + if (j.find("pretokenizer") == j.end()) { + pretokenizer = j.at("pre_tokenizer"); + } else { + pretokenizer = j.at("pretokenizer"); + } + if (!pretokenizer.is_null()) { + if (pretokenizer.at("type") == "BertPreTokenizer") { + pretokenizers::BertPreTokenizer bert_pretokenizer; + tokenizer.SetPreTokenizer(bert_pretokenizer); + } else if (pretokenizer.at("type") == "MetaSpacePreTokenizer") { + pretokenizers::MetaSpacePreTokenizer meta_pretokenizer; + pretokenizer.get_to(meta_pretokenizer); + tokenizer.SetPreTokenizer(meta_pretokenizer); + } else if (pretokenizer.at("type") == "WhitespacePreTokenizer") { + pretokenizers::WhitespacePreTokenizer whitespace_pretokenizer; + tokenizer.SetPreTokenizer(whitespace_pretokenizer); + } else if (pretokenizer.at("type") == + "WhitespaceAndPunctuationPreTokenizer") { + pretokenizers::WhitespaceAndPunctuationPreTokenizer + whitespace_pretokenizer; + tokenizer.SetPreTokenizer(whitespace_pretokenizer); + } else if (pretokenizer.at("type") == "SequencePreTokenizer") { + pretokenizers::SequencePreTokenizer sequence_pretokenizer; + pretokenizer.get_to(sequence_pretokenizer); + tokenizer.SetPreTokenizer(sequence_pretokenizer); + } else if (pretokenizer.at("type") == "ByteLevelPreTokenizer") { + pretokenizers::ByteLevelPreTokenizer byte_pretokenizer; + pretokenizer.get_to(byte_pretokenizer); + tokenizer.SetPreTokenizer(byte_pretokenizer); + } else if (pretokenizer.at("type") == "SplitPreTokenizer") { + pretokenizers::SplitPreTokenizer split_pretokenizer; + pretokenizer.get_to(split_pretokenizer); + tokenizer.SetPreTokenizer(split_pretokenizer); + } + } + + // deserialize model_ + const auto& model = j.at("model"); + if (!model.is_null()) { + if (model.at("type") == "WordPiece") { + models::WordPiece wordpiece; + model.get_to(wordpiece); + tokenizer.SetModel(wordpiece); + } else if (model.at("type") == "FastWordPiece") { + models::FastWordPiece wordpiece; + model.get_to(wordpiece); + tokenizer.SetModel(wordpiece); + } else if (model.at("type") == "BPE") { + models::BPE bpe; + model.get_to(bpe); + tokenizer.SetModel(bpe); + } else if (model.at("type") == "Unigram") { + models::Unigram unigram; + model.get_to(unigram); + tokenizer.SetModel(unigram); + } + } + + // deserialize post_processor_ + nlohmann::json post_processor; + if (j.find("postprocessor") == j.end()) { + post_processor = j.at("post_processor"); + } else { + post_processor = j.at("postprocessor"); + } + if (!post_processor.is_null()) { + if (post_processor.at("type") == "BertPostProcessor") { + postprocessors::BertPostProcessor bert_postprocessor; + post_processor.get_to(bert_postprocessor); + tokenizer.SetPostProcessor(bert_postprocessor); + } else if (post_processor.at("type") == "TemplateProcessing") { + postprocessors::TemplatePostProcessor template_postprocessor; + post_processor.get_to(template_postprocessor); + tokenizer.SetPostProcessor(template_postprocessor); + } else if (post_processor.at("type") == "RobertaPostProcessor") { + postprocessors::RobertaPostProcessor roberta_postprocessor; + post_processor.get_to(roberta_postprocessor); + tokenizer.SetPostProcessor(roberta_postprocessor); + } else if (post_processor.at("type") == "ByteLevelPostProcessor") { + postprocessors::ByteLevelPostProcessor byte_level_postprocessor; + post_processor.get_to(byte_level_postprocessor); + tokenizer.SetPostProcessor(byte_level_postprocessor); + } + } + + // deserialize trunc_method_ + const auto& trunc_method = j.at("truncation"); + if (!trunc_method.is_null()) { + tokenizer.use_truncation_ = true; + trunc_method.get_to(tokenizer.trunc_method_); + } else { + tokenizer.use_truncation_ = false; + } + + // deserialize pad_method_ + const auto& pad_method = j.at("padding"); + if (!pad_method.is_null()) { + tokenizer.use_padding_ = true; + pad_method.get_to(tokenizer.pad_method_); + } else { + tokenizer.use_padding_ = false; + } + + // deserialize added_vocabulary_ + const auto& added_tokens = j.at("added_tokens"); + core::AddedTokenWithId added_token_with_id; + std::vector tokens(added_tokens.size()); + for (int i = 0; i < added_tokens.size(); ++i) { + added_tokens[i].get_to(added_token_with_id); + tokens[i] = added_token_with_id.added_token_; + } + tokenizer.AddSpecialTokens(tokens); + + const auto& decoder = j.at("decoder"); + if (!decoder.is_null()) { + if (decoder.at("type") == "WordPiece") { + decoders::WordPiece wordpiece_decoder; + decoder.get_to(wordpiece_decoder); + tokenizer.SetDecoder(wordpiece_decoder); + } + } + + } catch (nlohmann::json::out_of_range& e) { + VLOG(0) << e.what(); + } +} +// Instantiate normalizers +template void Tokenizer::SetNormalizer(const normalizers::BertNormalizer&); +template void Tokenizer::SetNormalizer(const normalizers::LowercaseNormalizer&); +template void Tokenizer::SetNormalizer(const normalizers::NFCNormalizer&); +template void Tokenizer::SetNormalizer(const normalizers::NFKCNormalizer&); +template void Tokenizer::SetNormalizer(const normalizers::NFDNormalizer&); +template void Tokenizer::SetNormalizer(const normalizers::NFKDNormalizer&); +template void Tokenizer::SetNormalizer(const normalizers::NmtNormalizer&); +template void Tokenizer::SetNormalizer( + const normalizers::PrecompiledNormalizer&); +template void Tokenizer::SetNormalizer(const normalizers::ReplaceNormalizer&); +template void Tokenizer::SetNormalizer(const normalizers::SequenceNormalizer&); +template void Tokenizer::SetNormalizer( + const normalizers::StripAccentsNormalizer&); +template void Tokenizer::SetNormalizer(const normalizers::StripNormalizer&); + +// Instantiate pretokenizers +template void Tokenizer::SetPreTokenizer( + const pretokenizers::BertPreTokenizer&); +template void Tokenizer::SetPreTokenizer( + const pretokenizers::WhitespacePreTokenizer&); +template void Tokenizer::SetPreTokenizer( + const pretokenizers::WhitespaceAndPunctuationPreTokenizer&); +template void Tokenizer::SetPreTokenizer( + const pretokenizers::MetaSpacePreTokenizer&); +template void Tokenizer::SetPreTokenizer( + const pretokenizers::SequencePreTokenizer&); +template void Tokenizer::SetPreTokenizer( + const pretokenizers::ByteLevelPreTokenizer&); +template void Tokenizer::SetPreTokenizer( + const pretokenizers::SplitPreTokenizer&); + +// Instantiate models +template Tokenizer::Tokenizer(const models::WordPiece&); +template void Tokenizer::SetModel(const models::WordPiece&); +template Tokenizer::Tokenizer(const models::FastWordPiece&); +template void Tokenizer::SetModel(const models::FastWordPiece&); +template Tokenizer::Tokenizer(const models::BPE&); +template void Tokenizer::SetModel(const models::BPE&); +template Tokenizer::Tokenizer(const models::Unigram&); +template void Tokenizer::SetModel(const models::Unigram&); + +// Instantiate processors +template void Tokenizer::SetPostProcessor( + const postprocessors::BertPostProcessor&); +template void Tokenizer::SetPostProcessor( + const postprocessors::TemplatePostProcessor&); +template void Tokenizer::SetPostProcessor( + const postprocessors::RobertaPostProcessor&); +template void Tokenizer::SetPostProcessor( + const postprocessors::ByteLevelPostProcessor&); + +// Instantiate Decoder +template void Tokenizer::SetDecoder(const decoders::WordPiece& decoder); +} // namespace core +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/core/tokenizer.h b/fast_tokenizer/fast_tokenizer/core/tokenizer.h new file mode 100644 index 0000000000000000000000000000000000000000..71fdf6099f64616b4ed539ee5da9ec2b80fb6b8d --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/core/tokenizer.h @@ -0,0 +1,254 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include // For shared_ptr +#include + +#include "fast_tokenizer/core/added_vocabulary.h" +#include "fast_tokenizer/core/base.h" +#include "fast_tokenizer/utils/utils.h" +#include "fast_tokenizer/utils/variant.h" +#include "nlohmann/json.hpp" + +namespace paddlenlp { +namespace fast_tokenizer { + +namespace normalizers { + +class Normalizer; +class NormalizedString; + +} // namespace normalizers + +namespace pretokenizers { + +class PreTokenizer; +class PreTokenizedString; + +} // namespace pretokenizers + +namespace models { +class Model; +} // namespace models + +namespace postprocessors { +class PostProcessor; +} // namespace postprocessors + +namespace decoders { +class Decoder; +}; + +namespace core { + +class AddedVocabulary; +class Encoding; + +using InputString = paddlenlp::variant>; +using EncodeInput = + paddlenlp::variant>; + +class FASTTOKENIZER_DECL Tokenizer { +public: + Tokenizer() + : model_(nullptr), + normalizer_(nullptr), + pretokenizer_(nullptr), + post_processor_(nullptr), + decoder_(nullptr), + use_padding_(true), + use_truncation_(true) {} + template + Tokenizer(const ModelType& model) + : model_(std::make_shared(model)), + normalizer_(nullptr), + pretokenizer_(nullptr), + post_processor_(nullptr), + decoder_(nullptr), + use_padding_(true), + use_truncation_(true) {} + + template + void SetNormalizer(const NormalizerType& normalizer) { + normalizer_ = std::make_shared(normalizer); + } + void ReleaseNormaizer(); + normalizers::Normalizer* GetNormalizerPtr() const; + + template + void SetPreTokenizer(const PreTokenizerType& pretokenizer) { + pretokenizer_ = std::make_shared(pretokenizer); + } + void ReleasePreTokenizer(); + pretokenizers::PreTokenizer* GetPreTokenizer() const; + + template + void SetModel(const ModelType& model) { + model_ = std::make_shared(model); + } + models::Model* GetModelPtr() const; + + template + void SetPostProcessor(const PostProcessorType& post_processor) { + post_processor_ = std::make_shared(post_processor); + } + void ReleasePostProcessor(); + postprocessors::PostProcessor* GetPostProcessorPtr() const; + + template + void SetDecoder(const DecoderType& decoder) { + decoder_ = std::make_shared(decoder); + } + void ReleaseDecoder(); + decoders::Decoder* GetDecoderPtr() const; + + void SetTruncMethod(const TruncMethod& trunc_method); + void DisableTruncMethod(); + void EnableTruncMethod(size_t max_len, + size_t stride, + Direction direction, + TruncStrategy strategy); + TruncMethod GetTruncMethod() const; + + void SetPadMethod(const PadMethod& pad_method); + void DisablePadMethod(); + void EnablePadMethod(Direction direction, + uint32_t pad_id, + uint32_t pad_type_id, + const std::string& pad_token, + uint32_t* length, + uint32_t* pad_to_multiple_of); + PadMethod GetPadMethod() const; + + Vocab GetVocab(bool with_added_vocabulary = true) const; + size_t GetVocabSize(bool with_added_vocabulary = true) const; + bool TokenToId(const std::string& token, uint32_t* id) const; + bool IdToToken(uint32_t id, std::string* token) const; + size_t AddTokens(const std::vector& tokens); + size_t AddSpecialTokens(const std::vector& tokens); + bool DoTokenize(pretokenizers::PreTokenizedString* pretokenized, + uint32_t type_id, + const std::vector& word_idx, + OffsetType offset_type, + Encoding* encoding) const; + bool DoPreTokenize(pretokenizers::PreTokenizedString* pretokenized) const; + + void EncodeSingleString(const InputString& input_string, + uint32_t type_id, + OffsetType offset_type, + Encoding* encodings) const; + void PostProcess(Encoding* encoding, + Encoding* pair_encoding, + bool add_special_tokens, + Encoding* result_encoding) const; + void EncodePairStrings(const EncodeInput& encode_input, + Encoding* encodings, + bool add_special_tokens = true) const; + void EncodePairStrings(const std::string& text, + const std::string& text_pair, + Encoding* encodings, + bool add_special_tokens = true) const; + + void MultiThreadEncodeBatchStrings( + const std::vector& batch_encode_input, + std::vector* encodings, + bool add_special_tokens, + size_t start_index, + size_t step_index) const; + // Tokenize the unpretokenized text. + void MultiThreadEncodeBatchStrings(const std::vector& texts, + std::vector* encodings, + bool add_special_tokens, + size_t start_index, + size_t step_index) const; + void MultiThreadEncodeBatchStrings(const std::vector& texts, + const std::vector& text_pairs, + std::vector* encodings, + bool add_special_tokens, + size_t start_index, + size_t step_index) const; + + void EncodeBatchStrings(const std::vector& batch_encode_input, + std::vector* encodings, + bool add_special_tokens = true) const; + // Tokenize the unpretokenized text. + void EncodeBatchStrings(const std::vector& texts, + std::vector* encodings, + bool add_special_tokens = true) const; + void EncodeBatchStrings(const std::vector& texts, + const std::vector& text_pairs, + std::vector* encodings, + bool add_special_tokens = true) const; + + // Encode single text which is already pretokenized. + void EncodeSingleText(const std::vector& pretokenized_texts, + uint32_t type_id, + OffsetType offset_type, + Encoding* encodings) const; + // Encode single raw text + void EncodeSingleText(const std::string& raw_text, + uint32_t type_id, + OffsetType offset_type, + Encoding* encodings) const; + const AddedVocabulary& GetAddedVocabulary() const; + void Save(const std::string& json_path, bool pretty = true) const; + void ToJsonStr(std::string* json_str, bool pretty = true) const; + + // Create a tokenzier from json path + static Tokenizer LoadFromFile(const std::string& json_path); + static Tokenizer LoadFromStr(const std::string& json_str); + + bool GetUseTruncation() const; + bool GetUsePadding() const; + + // Decode: From tokens to a complete string + void Decode(const std::vector& token_ids, + std::string* result, + bool skip_special_tokens = true) const; + void MultiThreadDecodeBatch( + const std::vector>& batch_token_ids, + std::vector* results, + bool skip_special_tokens, + size_t start_index, + size_t step_index) const; + void DecodeBatch(const std::vector>& batch_token_ids, + std::vector* results, + bool skip_special_tokens = true) const; + +private: + Encoding EncodeTextToEncoding(const std::vector& word_idx, + uint32_t type_id, + OffsetType offset_type, + const std::string& text) const; + // All member of Tokenizer + std::shared_ptr normalizer_; + std::shared_ptr pretokenizer_; + std::shared_ptr model_; + std::shared_ptr post_processor_; + std::shared_ptr decoder_; + + TruncMethod trunc_method_; + PadMethod pad_method_; + AddedVocabulary added_vocabulary_; + bool use_truncation_; + bool use_padding_; + + friend void to_json(nlohmann::json& j, const Tokenizer& tokenizer); + friend void from_json(const nlohmann::json& j, Tokenizer& tokenizer); +}; + +} // namespace core +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/decoders/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/decoders/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..d2fffc3dac6e24bd846f3e775280e08dfd6012b8 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/decoders/CMakeLists.txt @@ -0,0 +1 @@ +cc_library(decoders SRCS wordpiece.cc DEPS json utils) diff --git a/fast_tokenizer/fast_tokenizer/decoders/decoder.h b/fast_tokenizer/fast_tokenizer/decoders/decoder.h new file mode 100644 index 0000000000000000000000000000000000000000..7969e22d11f7dd5370cdb2ba4c6b0c375c7653ca --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/decoders/decoder.h @@ -0,0 +1,32 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace decoders { + +struct FASTTOKENIZER_DECL Decoder { + virtual void operator()(const std::vector tokens, + std::string* result) const = 0; +}; + +} // namespace decoders +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/decoders/decoders.h b/fast_tokenizer/fast_tokenizer/decoders/decoders.h new file mode 100644 index 0000000000000000000000000000000000000000..efc72779de9f9908b2f7f58576199ef6e349f9f0 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/decoders/decoders.h @@ -0,0 +1,18 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/decoders/decoder.h" +#include "fast_tokenizer/decoders/wordpiece.h" \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/decoders/wordpiece.cc b/fast_tokenizer/fast_tokenizer/decoders/wordpiece.cc new file mode 100644 index 0000000000000000000000000000000000000000..e81f1562d24265d8974e708a63cc9d2bfabb9d1c --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/decoders/wordpiece.cc @@ -0,0 +1,69 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/decoders/wordpiece.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace decoders { + +WordPiece::WordPiece(const std::string prefix, bool cleanup) + : prefix_(prefix), cleanup_(cleanup) {} + +void WordPiece::CleanUp(std::string* result) const { + utils::StringReplaceAll(result, " .", "."); + utils::StringReplaceAll(result, " !", "!"); + utils::StringReplaceAll(result, " ?", "?"); + utils::StringReplaceAll(result, " ,", ","); + utils::StringReplaceAll(result, " ' ", "'"); + utils::StringReplaceAll(result, " n't", "n't"); + utils::StringReplaceAll(result, " 'm", "'m"); + utils::StringReplaceAll(result, " do not", " don't"); + utils::StringReplaceAll(result, " 's", "'s"); + utils::StringReplaceAll(result, " 've", "'ve"); + utils::StringReplaceAll(result, " 're", "'re"); +} + +void WordPiece::operator()(const std::vector tokens, + std::string* result) const { + *result = ""; + for (int i = 0; i < tokens.size(); ++i) { + if (i > 0) { + *result += " "; + } + *result += tokens[i]; + } + utils::StringReplaceAll(result, " " + prefix_, ""); + if (cleanup_) { + CleanUp(result); + } +} + +void to_json(nlohmann::json& j, const WordPiece& decoder) { + j = { + {"type", "WordPiece"}, + {"cleanup", decoder.cleanup_}, + {"prefix", decoder.prefix_}, + }; +} + +void from_json(const nlohmann::json& j, WordPiece& decoder) { + j["cleanup"].get_to(decoder.cleanup_); + j["prefix"].get_to(decoder.prefix_); +} + +} // namespace decoders +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/decoders/wordpiece.h b/fast_tokenizer/fast_tokenizer/decoders/wordpiece.h new file mode 100644 index 0000000000000000000000000000000000000000..1f41b3f8b5dcf4475bc701f4d81521eb3c0bd5f5 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/decoders/wordpiece.h @@ -0,0 +1,42 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/decoders/decoder.h" +#include "fast_tokenizer/utils/utils.h" +#include "nlohmann/json.hpp" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace decoders { + +struct FASTTOKENIZER_DECL WordPiece : public Decoder { + virtual void operator()(const std::vector tokens, + std::string* result) const; + + WordPiece(const std::string prefix = "##", bool cleanup = true); + +private: + void CleanUp(std::string* result) const; + std::string prefix_; + bool cleanup_; + + friend void to_json(nlohmann::json& j, const WordPiece& decoder); + friend void from_json(const nlohmann::json& j, WordPiece& decoder); +}; + +} // namespace decoders +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/models/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/models/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..f51706a770358988d5e72386da0eea90e363a7d2 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/models/CMakeLists.txt @@ -0,0 +1,3 @@ +cc_library(models + SRCS wordpiece.cc fast_wordpiece.cc bpe.cc unigram.cc + DEPS core json trie failure icuuc icudata lattice utils) diff --git a/fast_tokenizer/fast_tokenizer/models/bpe.cc b/fast_tokenizer/fast_tokenizer/models/bpe.cc new file mode 100644 index 0000000000000000000000000000000000000000..ad80d2752d6f3b115512d573c3e07e2e27eebc87 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/models/bpe.cc @@ -0,0 +1,349 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include +#include + +#include "glog/logging.h" +#include "fast_tokenizer/models/bpe.h" +#include "fast_tokenizer/utils/path.h" +#include "fast_tokenizer/utils/utf8.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace models { +const std::string WHITESPACE = " \n\r\t\f\v"; + +void BPE::Init(const core::Merges& merges) { + if (dropout_.size() > 0) { + if (dropout_[0] > 1.0 || dropout_[0] <= 0.0) { + std::ostringstream oss; + oss << "The range of dropout rate should be (0,1], but receive " + << dropout_[0]; + throw std::runtime_error(oss.str()); + } + } + // construct vocab_r + for (auto&& item : vocab_) { + vocab_reversed_[item.second] = item.first; + } + int prefix_len = 0; + if (continuing_subword_prefix_.size() > 0) { + prefix_len += continuing_subword_prefix_[0].length(); + } + + // construct merge_map + for (int i = 0; i < merges.size(); i++) { + auto&& merge = merges[i]; + try { + auto a_id = vocab_.at(merge.first); + auto b_id = vocab_.at(merge.second); + auto new_token = merge.first + merge.second.substr(prefix_len); + auto new_id = vocab_.at(new_token); + merges_.insert({core::Pair(a_id, b_id), {i, new_id}}); + } catch (...) { + std::ostringstream oss; + oss << "Can't merge token out of the vocabulary"; + throw std::runtime_error(oss.str()); + } + } + + // construct unk + if (unk_token_.size() > 0) { + try { + unk_token_id_.emplace_back(vocab_.at(unk_token_.front())); + } catch (...) { + std::ostringstream oss; + oss << "Unk token `" << unk_token_.front() + << "` not found in the vocabulary"; + throw std::runtime_error(oss.str()); + } + } +} + +BPE::BPE() : fuse_unk_(false), cache_(utils::DEFAULT_CACHE_CAPACITY) {} + +BPE::BPE(const core::Vocab& vocab, + const core::Merges& merges, + size_t cache_capacity, + const std::vector& dropout, + const std::vector& unk_token, + const std::vector& continuing_subword_prefix, + const std::vector& end_of_word_suffix, + bool fuse_unk) + : vocab_(vocab), + fuse_unk_(fuse_unk), + dropout_(dropout), + unk_token_(unk_token), + continuing_subword_prefix_(continuing_subword_prefix), + end_of_word_suffix_(end_of_word_suffix), + cache_(utils::DEFAULT_CACHE_CAPACITY) { + Init(merges); +} + +void BPE::ClearCache() { cache_.Clear(); } + +core::Vocab BPE::GetVocabFromFile(const std::string& vocab_json_path) { + std::ifstream fin(vocab_json_path); + core::Vocab vocab; + nlohmann::json j; + fin >> j; + for (nlohmann::json::iterator it = j.begin(); it != j.end(); ++it) { + vocab[it.key()] = it.value(); + } + return vocab; +} + +void BPE::ConstructMergesPair(const std::string word_line, + std::pair* result) { + auto pair_a_begin = word_line.find_first_not_of(WHITESPACE); + auto pair_a_end = word_line.find_first_of(WHITESPACE, pair_a_begin); + auto pair_b_begin = word_line.find_first_not_of(WHITESPACE, pair_a_end); + auto pair_b_end = word_line.find_first_of(WHITESPACE, pair_b_begin); + *result = {word_line.substr(pair_a_begin, pair_a_end - pair_a_begin), + word_line.substr(pair_b_begin, pair_b_end - pair_b_begin)}; +} + +core::Merges BPE::GetMergesFromFile(const std::string& merge_path) { + std::ifstream fin(merge_path); + core::Merges merges; + std::string word_str; + while (std::getline(fin, word_str)) { + if (word_str.find("#version") == 0) { + continue; + } + std::pair result; + ConstructMergesPair(word_str, &result); + merges.emplace_back(result); + } + return merges; +} + +void BPE::GetVocabAndMergesFromFile(const std::string& vocab_json_path, + const std::string& merge_path, + core::Vocab* vocab, + core::Merges* merges) { + *vocab = BPE::GetVocabFromFile(vocab_json_path); + *merges = BPE::GetMergesFromFile(merge_path); +} + +void BPE::MergeWord(const std::string& word, core::BPEWord* bpe_word) { + std::vector> unk; + bpe_word->Reserve(word.length()); + uint32_t start = 0; + while (start < word.length()) { + uint32_t content_char; + uint32_t content_char_width = + utils::UTF8ToUInt32(word.data() + start, &content_char); + content_char = utils::UTF8ToUnicode(content_char); + uint32_t end = start + content_char_width; + bool is_first = (start == 0); + bool is_last = (end >= word.length()); + std::string curr_str = word.substr(start, content_char_width); + // Add the `continuing_subword_prefix` if relevant + if (!is_first) { + if (continuing_subword_prefix_.size() > 0) { + curr_str = continuing_subword_prefix_.front() + curr_str; + } + } + // Add the `end_of_word_suffix` if relevant + if (is_last) { + if (end_of_word_suffix_.size() > 0) { + curr_str = curr_str + end_of_word_suffix_.front(); + } + } + if (vocab_.find(curr_str) != vocab_.end()) { + if (unk.size() > 0) { + bpe_word->Add(unk.front().first, unk.front().second); + unk.clear(); + } + auto id = vocab_.at(curr_str); + bpe_word->Add(id, content_char_width); + } else { + if (unk_token_id_.size() > 0) { + if (unk.size() == 0) { + unk.push_back({unk_token_id_.front(), content_char_width}); + } else { + if (fuse_unk_) { + unk[0] = {unk[0].first, unk[0].second + content_char_width}; + } else { + bpe_word->Add(unk[0].first, unk[0].second); + unk[0] = {unk_token_id_.front(), content_char_width}; + } + } + } + } + start = end; + } + + if (unk.size() > 0) { + bpe_word->Add(unk.front().first, unk.front().second); + } + bpe_word->MergeAll(merges_, dropout_); +} + +void BPE::WordToTokens(const core::BPEWord& bpe_word, + std::vector* tokens) { + std::vector chars; + bpe_word.GetChars(&chars); + + std::vector offsets; + bpe_word.GetOffset(&offsets); + + tokens->reserve(offsets.size()); + for (int i = 0; i < offsets.size(); ++i) { + tokens->emplace_back(chars[i], vocab_reversed_[chars[i]], offsets[i]); + } +} + +void BPE::TokenizeWithCache(const std::string& sequence, + std::vector* tokens) { + core::BPEWord bpe_word; + if (cache_.GetValue(sequence, &bpe_word)) { + WordToTokens(bpe_word, tokens); + } else { + MergeWord(sequence, &bpe_word); + WordToTokens(bpe_word, tokens); + cache_.SetValue(sequence, bpe_word); + } +} + +std::vector BPE::Tokenize(const std::string& sequence) { + std::vector tokens; + if (sequence.empty()) { + return tokens; + } + if (dropout_.size() == 0) { + TokenizeWithCache(sequence, &tokens); + return tokens; + } + core::BPEWord bpe_word; + MergeWord(sequence, &bpe_word); + WordToTokens(bpe_word, &tokens); + return tokens; +} + +bool BPE::TokenToId(const std::string& token, uint32_t* id) const { + if (vocab_.find(token) == vocab_.end()) { + return false; + } + *id = vocab_.at(token); + return true; +} + +bool BPE::IdToToken(uint32_t id, std::string* token) const { + if (vocab_reversed_.find(id) == vocab_reversed_.end()) { + return false; + } + *token = vocab_reversed_.at(id); + return true; +} + +core::Vocab BPE::GetVocab() const { return vocab_; } + +size_t BPE::GetVocabSize() const { return vocab_.size(); } + +// Return the saved voacb path and merges.txt +std::vector BPE::Save(const std::string& folder, + const std::string& filename_prefix) const { + // write vocab json + std::string vocab_path; + if (filename_prefix == "") { + vocab_path = utils::PathJoin(folder, "vocab.json"); + } else { + vocab_path = utils::PathJoin({folder, filename_prefix, "-vocab.json"}); + } + VLOG(6) << "Vocab path" << vocab_path; + core::SortedVocabReversed sorted_vocab_r(vocab_reversed_.begin(), + vocab_reversed_.end()); + nlohmann::json j = sorted_vocab_r; + std::ofstream fout(vocab_path); + fout << j.dump(); + fout.close(); + + // write merges.txt + std::string merges_path; + if (filename_prefix == "") { + merges_path = utils::PathJoin(folder, "merges.txt"); + } else { + merges_path = utils::PathJoin({folder, filename_prefix, "-merges.txt"}); + } + VLOG(6) << "Merges path" << merges_path; + std::ofstream merge_fout(merges_path); + merge_fout << "#version: 0.2\n"; + for (auto&& merge : merges_) { + merge_fout << vocab_reversed_.at(merge.first.first) << " " + << vocab_reversed_.at(merge.first.second) << "\n"; + } + merge_fout.close(); + return {vocab_path, merges_path}; +} + +void to_json(nlohmann::json& j, const BPE& model) { + std::vector> merges; + for (auto& merge : model.merges_) { + merges.push_back({merge.first, merge.second.first}); + } + std::sort(merges.begin(), + merges.end(), + [](const std::pair& a, + const std::pair& b) { + return a.second < b.second; + }); + std::vector merge_strs; + for (auto& merge : merges) { + std::string s = model.vocab_reversed_.at(merge.first.first) + " " + + model.vocab_reversed_.at(merge.first.second); + merge_strs.push_back(s); + } + + core::SortedVocabReversed sorted_vocab_r(model.vocab_reversed_.begin(), + model.vocab_reversed_.end()); + + j = {{"type", "BPE"}, + {"unk_token", model.unk_token_}, + {"continuing_subword_prefix", model.continuing_subword_prefix_}, + {"end_of_word_suffix", model.end_of_word_suffix_}, + {"fuse_unk", model.fuse_unk_}, + {"dropout", model.dropout_}, + {"vocab", sorted_vocab_r}, + {"merges", merge_strs}}; +} + +void from_json(const nlohmann::json& j, BPE& model) { + j["vocab"].get_to(model.vocab_); + j["unk_token"].get_to(model.unk_token_); + j["continuing_subword_prefix"].get_to(model.continuing_subword_prefix_); + j["end_of_word_suffix"].get_to(model.end_of_word_suffix_); + j["fuse_unk"].get_to(model.fuse_unk_); + j["dropout"].get_to(model.dropout_); + + std::vector merge_strs; + j["merges"].get_to(merge_strs); + + core::Merges merges; + std::pair result; + for (auto& word_line : merge_strs) { + BPE::ConstructMergesPair(word_line, &result); + merges.push_back(result); + } + model.Init(merges); +} + +} // namespace model +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/models/bpe.h b/fast_tokenizer/fast_tokenizer/models/bpe.h new file mode 100644 index 0000000000000000000000000000000000000000..bb8cfd08cc413b3b73aea15fdcf5c866514a11cc --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/models/bpe.h @@ -0,0 +1,82 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/models/model.h" +#include "nlohmann/json.hpp" +#include "fast_tokenizer/utils/cache.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace models { + +struct FASTTOKENIZER_DECL BPE : public Model { + BPE(); + BPE(const core::Vocab& vocab, + const core::Merges& merges, + size_t cache_capacity = utils::DEFAULT_CACHE_CAPACITY, + const std::vector& dropout = {}, + const std::vector& unk_token = {}, + const std::vector& continuing_subword_prefix = {}, + const std::vector& end_of_word_suffix = {}, + bool fuse_unk = false); + virtual std::vector Tokenize( + const std::string& sequence) override; + virtual bool TokenToId(const std::string& token, uint32_t* id) const override; + virtual bool IdToToken(uint32_t id, std::string* token) const override; + virtual core::Vocab GetVocab() const override; + virtual size_t GetVocabSize() const override; + // Return the saved voacb path + virtual std::vector Save( + const std::string& folder, + const std::string& filename_prefix) const override; + + void ClearCache(); + static core::Vocab GetVocabFromFile(const std::string& vocab_json_path); + static core::Merges GetMergesFromFile(const std::string& merge_path); + static void GetVocabAndMergesFromFile(const std::string& vocab_json_path, + const std::string& merge_path, + core::Vocab* vocab, + core::Merges* merges); + static void ConstructMergesPair(const std::string word_line, + std::pair* result); + +private: + void Init(const core::Merges& merges); + void MergeWord(const std::string& word, core::BPEWord* bpe_word); + void WordToTokens(const core::BPEWord& bpe_word, + std::vector* tokens); + void TokenizeWithCache(const std::string& sequence, + std::vector* tokens); + core::Vocab vocab_; + core::VocabReversed vocab_reversed_; + core::MergeMap merges_; + + // The following vector may contain 0 or 1 element + utils::Cache cache_; + std::vector dropout_; + std::vector unk_token_; + std::vector unk_token_id_; + std::vector continuing_subword_prefix_; + std::vector end_of_word_suffix_; + bool fuse_unk_; + friend void to_json(nlohmann::json& j, const BPE& model); + friend void from_json(const nlohmann::json& j, BPE& model); +}; + +} // namespace models +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/models/fast_wordpiece.cc b/fast_tokenizer/fast_tokenizer/models/fast_wordpiece.cc new file mode 100644 index 0000000000000000000000000000000000000000..92124c6b15ae210638f3360a396bf77eed00e01c --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/models/fast_wordpiece.cc @@ -0,0 +1,428 @@ +// Copyright 2022 TF.Text Authors. +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at + +// http://www.apache.org/licenses/LICENSE-2.0 + +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "fast_tokenizer/models/fast_wordpiece.h" + +#include +#include +#include +#include +#include + +#include "fast_tokenizer/models/wordpiece.h" +#include "fast_tokenizer/utils/path.h" +#include "fast_tokenizer/utils/utf8.h" +#include "fast_tokenizer/utils/utils.h" +#include "glog/logging.h" +#include "unicode/uchar.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace models { + +const std::string WHITESPACE = " \n\r\t\f\v"; + +void FastWordPiece::InitFailureAndTrie() { + unk_token_id_ = vocab_.at(unk_token_); + trie_.SetWithPretokenization(with_pretokenization_); + trie_.SetUNKToken(unk_token_); + trie_.SetContinuingSubwordPrefix(continuing_subword_prefix_); + failure_array_.SetWithPretokenization(with_pretokenization_); + failure_array_.InitFromVocabAndTrie( + vocab_, &trie_, unk_token_, continuing_subword_prefix_); + PrecomputeEncodeValueForSubwordPrefix(); +} + +FastWordPiece::FastWordPiece() : WordPiece(), with_pretokenization_(false) {} + +FastWordPiece::FastWordPiece(const core::Vocab& vocab, + const std::string& unk_token, + size_t max_input_chars_per_word, + const std::string& continuing_subword_prefix, + bool with_pretokenization) + : WordPiece(vocab, + unk_token, + max_input_chars_per_word, + continuing_subword_prefix), + trie_(continuing_subword_prefix, unk_token, with_pretokenization), + with_pretokenization_(with_pretokenization), + failure_array_(with_pretokenization) { + InitFailureAndTrie(); +} + +void FastWordPiece::PrecomputeEncodeValueForSubwordPrefix() { + auto subword_prefix_tokens = WordPiece::Tokenize(continuing_subword_prefix_); + encoded_value_for_subword_prefix_.reserve(subword_prefix_tokens.size()); + + for (auto& token : subword_prefix_tokens) { + utils::FailureVocabToken failure_vocab_token( + token.value_, token.id_, continuing_subword_prefix_); + int encoded_value = utils::EncodeToken( + failure_vocab_token.TokenId(), + failure_vocab_token.TokenLengthWithoutContinuingSubwordPrefix(), + failure_vocab_token.IsSuffixToken()); + encoded_value_for_subword_prefix_.push_back(encoded_value); + } +} + +bool FastWordPiece::TryFollowFailureLinkAndCollectTokens( + const std::string& sequence, + int sequence_offset_in_text, + int* curr_offset_in_sequence, + utils::Trie::TraversalCursor* node, + std::vector* tokens) const { + int curr_node_value = 0; + if (trie_.TryGetData(*node, &curr_node_value)) { + AppendTokensToOutput(sequence, + sequence_offset_in_text, + curr_offset_in_sequence, + curr_node_value, + tokens); + trie_.SetTraversalCursor( + node, failure_array_.GetFailure(node->node_id_)->failure_link_); + return true; + } + const auto& node_aux = failure_array_.GetFailure(node->node_id_); + + if (node_aux->failure_link_ == utils::kNullNode) { + // No failure_link can be followed. + return false; + } + int offset = 0, length = 0; + utils::GetFailurePopsOffsetAndLength( + node_aux->failure_pops_offset_length_, &offset, &length); + for (int i = offset; i < offset + length; ++i) { + AppendTokensToOutput(sequence, + sequence_offset_in_text, + curr_offset_in_sequence, + failure_array_.GetFailurePop(i), + tokens); + } + trie_.SetTraversalCursor(node, node_aux->failure_link_); + return true; +} + +void FastWordPiece::AppendTokensToOutput( + const std::string& sequence, + int sequence_offset_in_text, + int* curr_offset_in_sequence, + int curr_node_value, + std::vector* tokens) const { + uint32_t id = utils::GetTokenIdFromEncodedValue(curr_node_value); + std::string value; + int token_substr_length = + utils::GetTokenLengthFromEncodedValue(curr_node_value); + if (*curr_offset_in_sequence == 0 && + utils::IsSuffixTokenFromEncodedValue(curr_node_value)) { + token_substr_length += continuing_subword_prefix_.size(); + } + + if (id == unk_token_id_) { + value = unk_token_; + } else { + auto c_offset = *curr_offset_in_sequence; + c_offset = (std::min)(c_offset, static_cast(sequence.length() - 1)); + value = sequence.substr(*curr_offset_in_sequence, token_substr_length); + } + + if (*curr_offset_in_sequence > 0) { + value = continuing_subword_prefix_ + value; + } + core::Offset offset = { + sequence_offset_in_text + *curr_offset_in_sequence, + sequence_offset_in_text + *curr_offset_in_sequence + token_substr_length}; + tokens->emplace_back(id, value, offset); + + *curr_offset_in_sequence += token_substr_length; +} + +void FastWordPiece::ResetOutputAppendUNK( + int sequence_offset_in_text, + int sequence_size, + int* original_num_tokens, + std::vector* tokens) const { + tokens->resize(*original_num_tokens + 1); + tokens->back() = { + unk_token_id_, + unk_token_, + {sequence_offset_in_text, sequence_offset_in_text + sequence_size}}; + (*original_num_tokens)++; +} + +bool FastWordPiece::TryHandleContinuingSubWordPrefix( + const std::string& sequence, + int sequence_offset_in_text, + const utils::Trie::TraversalCursor& curr_node, + int* original_num_tokens, + int* curr_offset_in_sequence, + std::vector* tokens) const { + if (curr_node.node_id_ != trie_.GetSuffixRoot()) { + return false; + } + int cur_num_tokens = tokens->size(); + if (cur_num_tokens != *original_num_tokens) { + return false; + } + if (encoded_value_for_subword_prefix_.size() == 1 && + utils::GetTokenIdFromEncodedValue(encoded_value_for_subword_prefix_[0]) == + unk_token_id_) { + ResetOutputAppendUNK( + sequence_offset_in_text, sequence.size(), original_num_tokens, tokens); + return true; + } + for (int encoded_token_value : encoded_value_for_subword_prefix_) { + AppendTokensToOutput(sequence, + sequence_offset_in_text, + curr_offset_in_sequence, + encoded_token_value, + tokens); + } + return true; +} + +void FastWordPiece::HandleTheRemainingStringOnTriePath( + const std::string& sequence, + int sequence_offset_in_text, + utils::Trie::TraversalCursor* curr_node, + int* original_num_tokens, + int* curr_offset_in_sequence, + std::vector* tokens) const { + if (curr_node->node_id_ == utils::Trie::kRootNodeId) { + return; + } + if (TryHandleContinuingSubWordPrefix(sequence, + sequence_offset_in_text, + *curr_node, + original_num_tokens, + curr_offset_in_sequence, + tokens)) { + *original_num_tokens = tokens->size(); + return; + } + while (curr_node->node_id_ != trie_.GetSuffixRoot() && + curr_node->node_id_ != trie_.GetPuncFailureNode()) { + if (!TryFollowFailureLinkAndCollectTokens(sequence, + sequence_offset_in_text, + curr_offset_in_sequence, + curr_node, + tokens)) { + ResetOutputAppendUNK(sequence_offset_in_text, + sequence.size(), + original_num_tokens, + tokens); + return; + } + } + *original_num_tokens = tokens->size(); +} + +int FastWordPiece::SkipRemainingOfWordAndTrailingWhiteSpaces( + const std::string& sequence, int* curr_idx) const { + int seq_len = sequence.length(); + uint32_t curr_unicode_char; + int end_of_word = *curr_idx; + while (*curr_idx < seq_len) { + auto chwidth = + utils::UTF8ToUInt32(sequence.data() + *curr_idx, &curr_unicode_char); + curr_unicode_char = utils::UTF8ToUnicode(curr_unicode_char); + if (u_isUWhiteSpace(curr_unicode_char)) { + *curr_idx += chwidth; + break; + } + if (utils::IsPunctuationOrChineseChar(curr_unicode_char)) { + break; + } + *curr_idx += chwidth; + end_of_word = *curr_idx; + } + return end_of_word; +} + +std::vector FastWordPiece::TokenizeWithoutPreTokenize( + const std::string& sequence) const { + VLOG(6) << "Using FastWordPiece::TokenizeWithoutPreTokenize to tokenize " + "sequence"; + if (sequence.empty()) { + return {}; + } + std::vector all_tokens; + size_t unicode_len = + utils::GetUnicodeLenFromUTF8(sequence.data(), sequence.length()); + int original_num_tokens = 0; + if (unicode_len > max_input_chars_per_word_) { + ResetOutputAppendUNK(0, sequence.size(), &original_num_tokens, &all_tokens); + } else { + int curr_offset_in_sequence = 0; + auto curr_node = trie_.CreateRootTraversalCursor(); + for (auto ch : sequence) { + while (!trie_.TryTraverseOneStep(&curr_node, ch)) { + if (!TryFollowFailureLinkAndCollectTokens(sequence, + 0, + &curr_offset_in_sequence, + &curr_node, + &all_tokens)) { + ResetOutputAppendUNK( + 0, sequence.size(), &original_num_tokens, &all_tokens); + return all_tokens; + } + } + } + HandleTheRemainingStringOnTriePath(sequence, + 0, + &curr_node, + &original_num_tokens, + &curr_offset_in_sequence, + &all_tokens); + } + if (all_tokens.size() == 0) { + ResetOutputAppendUNK(0, sequence.size(), &original_num_tokens, &all_tokens); + } + VLOG(6) << "All tokens num from TokenizeWithoutPreTokenize: " + << all_tokens.size(); + return all_tokens; +} + +std::vector FastWordPiece::TokenizeWithPreTokenize( + const std::string& sequence) const { + VLOG(6) + << "Using FastWordPiece::TokenizeWithPreTokenize to tokenize sequence"; + // Need to implement + if (sequence.empty()) { + return {}; + } + std::vector all_tokens; + int original_num_tokens = 0; + uint32_t prev_unicode_char, curr_unicode_char; + int curr_idx = 0; + int chwidth = 0; + auto seq_len = sequence.length(); + while (curr_idx < seq_len) { + int curr_offset_in_word = 0; + auto curr_node = trie_.CreateRootTraversalCursor(); + int bytes_length = 0; + int word_offset_in_sequence = curr_idx; + std::string sequence_substr = sequence.substr(curr_idx); + bool fail_to_match = false; + while (curr_idx < seq_len) { + prev_unicode_char = curr_unicode_char; + chwidth = + utils::UTF8ToUInt32(sequence.data() + curr_idx, &curr_unicode_char); + curr_unicode_char = utils::UTF8ToUnicode(curr_unicode_char); + if (bytes_length + chwidth > max_input_chars_per_word_) { + break; + } + std::string curr_substr = sequence.substr(curr_idx, chwidth); + while (!trie_.TryTraverseSeveralSteps(&curr_node, curr_substr)) { + if (!TryFollowFailureLinkAndCollectTokens(sequence_substr, + word_offset_in_sequence, + &curr_offset_in_word, + &curr_node, + &all_tokens)) { + fail_to_match = true; + break; + } + } + if (fail_to_match) { + break; + } + bytes_length += chwidth; + curr_idx += chwidth; + } + if (curr_idx >= seq_len) { + HandleTheRemainingStringOnTriePath(sequence_substr, + word_offset_in_sequence, + &curr_node, + &original_num_tokens, + &curr_offset_in_word, + &all_tokens); + break; + } + bool curr_unicode_char_is_space = u_isUWhiteSpace(curr_unicode_char); + if (curr_unicode_char_is_space || + utils::IsPunctuationOrChineseChar(curr_unicode_char) || + (curr_idx > 0 && + utils::IsPunctuationOrChineseChar(prev_unicode_char))) { + HandleTheRemainingStringOnTriePath( + sequence_substr.substr(0, curr_idx - word_offset_in_sequence), + word_offset_in_sequence, + &curr_node, + &original_num_tokens, + &curr_offset_in_word, + &all_tokens); + if (curr_unicode_char_is_space) { + curr_idx += chwidth; + } + continue; + } + curr_idx += chwidth; + int end_of_word = + SkipRemainingOfWordAndTrailingWhiteSpaces(sequence, &curr_idx); + ResetOutputAppendUNK(word_offset_in_sequence, + end_of_word - word_offset_in_sequence, + &original_num_tokens, + &all_tokens); + } + if (all_tokens.size() == 0) { + ResetOutputAppendUNK(0, sequence.size(), &original_num_tokens, &all_tokens); + } + VLOG(6) << "All tokens num from TokenizeWithPreTokenize: " + << all_tokens.size(); + return all_tokens; +} + +std::vector FastWordPiece::Tokenize(const std::string& sequence) { + if (!with_pretokenization_) { + return TokenizeWithoutPreTokenize(sequence); + } + return TokenizeWithPreTokenize(sequence); +} + +FastWordPiece FastWordPiece::GetFastWordPieceFromFile( + const std::string& file, + const std::string& unk_token, + size_t max_input_chars_per_word, + const std::string& continuing_subword_prefix, + bool with_pretokenization) { + auto vocab = GetVocabFromFile(file); + return FastWordPiece(vocab, + unk_token, + max_input_chars_per_word, + continuing_subword_prefix, + with_pretokenization); +} + +void to_json(nlohmann::json& j, const FastWordPiece& model) { + j = { + {"type", "FastWordPiece"}, + {"vocab", model.vocab_}, + {"unk_token", model.unk_token_}, + {"max_input_chars_per_word", model.max_input_chars_per_word_}, + {"continuing_subword_prefix", model.continuing_subword_prefix_}, + {"with_pretokenization", model.with_pretokenization_}, + }; +} + +void from_json(const nlohmann::json& j, FastWordPiece& model) { + j["vocab"].get_to(model.vocab_); + j["unk_token"].get_to(model.unk_token_); + j["max_input_chars_per_word"].get_to(model.max_input_chars_per_word_); + j["continuing_subword_prefix"].get_to(model.continuing_subword_prefix_); + j["with_pretokenization"].get_to(model.with_pretokenization_); + model.InitFailureAndTrie(); +} + +} // namespace models +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/models/fast_wordpiece.h b/fast_tokenizer/fast_tokenizer/models/fast_wordpiece.h new file mode 100644 index 0000000000000000000000000000000000000000..f26534ba6753cf34db90a784b2e78e7a254418dc --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/models/fast_wordpiece.h @@ -0,0 +1,95 @@ +// Copyright 2022 TF.Text Authors. +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at + +// http://www.apache.org/licenses/LICENSE-2.0 + +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "fast_tokenizer/models/model.h" +#include "fast_tokenizer/models/wordpiece.h" +#include "fast_tokenizer/utils/failure.h" +#include "fast_tokenizer/utils/trie.h" +#include "fast_tokenizer/utils/utils.h" +#include "nlohmann/json.hpp" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace models { + +struct FASTTOKENIZER_DECL FastWordPiece : public WordPiece { + FastWordPiece(); + FastWordPiece(const core::Vocab& vocab, + const std::string& unk_token = "[UNK]", + size_t max_input_chars_per_word = 100, + const std::string& continuing_subword_prefix = "##", + bool with_pretokenization = false); + + virtual std::vector Tokenize( + const std::string& sequence) override; + static FastWordPiece GetFastWordPieceFromFile( + const std::string& file, + const std::string& unk_token = "[UNK]", + size_t max_input_chars_per_word = 100, + const std::string& continuing_subword_prefix = "##", + bool with_pretokenization = false); + +private: + void InitFailureAndTrie(); + std::vector TokenizeWithoutPreTokenize( + const std::string& sequence) const; + std::vector TokenizeWithPreTokenize( + const std::string& sequence) const; + bool TryFollowFailureLinkAndCollectTokens( + const std::string& sequence, + int sequence_offset_in_text, + int* curr_offset_in_sequence, + utils::Trie::TraversalCursor* node, + std::vector* tokens) const; + + void AppendTokensToOutput(const std::string& sequence, + int sequence_offset_in_text, + int* curr_offset_in_sequence, + int curr_node_value, + std::vector* tokens) const; + void HandleTheRemainingStringOnTriePath( + const std::string& sequence, + int sequence_offset_in_text, + utils::Trie::TraversalCursor* node, + int* original_num_tokens, + int* curr_offset_in_sequence, + std::vector* tokens) const; + bool TryHandleContinuingSubWordPrefix( + const std::string& sequence, + int sequence_offset_in_text, + const utils::Trie::TraversalCursor& node, + int* original_num_tokens, + int* curr_offset_in_sequence, + std::vector* tokens) const; + void ResetOutputAppendUNK(int sequence_offset_in_text, + int sequence_size, + int* original_num_tokens, + std::vector* tokens) const; + int SkipRemainingOfWordAndTrailingWhiteSpaces(const std::string& sequence, + int* curr_idx) const; + void PrecomputeEncodeValueForSubwordPrefix(); + utils::Trie trie_; + utils::FailureArray failure_array_; + std::vector encoded_value_for_subword_prefix_; + friend void to_json(nlohmann::json& j, const FastWordPiece& model); + friend void from_json(const nlohmann::json& j, FastWordPiece& model); + bool with_pretokenization_; // The end-to-end version of FastWordPiece +}; + +} // namespace models +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/models/model.h b/fast_tokenizer/fast_tokenizer/models/model.h new file mode 100644 index 0000000000000000000000000000000000000000..8a8f8daddf6f1d4c0092a477b20975a91e1b6265 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/models/model.h @@ -0,0 +1,39 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "fast_tokenizer/core/base.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace models { + +struct FASTTOKENIZER_DECL Model { + virtual std::vector Tokenize(const std::string& tokens) = 0; + virtual bool TokenToId(const std::string& token, uint32_t* id) const = 0; + virtual bool IdToToken(uint32_t id, std::string* token) const = 0; + virtual core::Vocab GetVocab() const = 0; + virtual size_t GetVocabSize() const = 0; + // Return the saved voacb path + virtual std::vector Save( + const std::string& folder, const std::string& filename_prefix) const = 0; +}; + +} // namespace model +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/models/models.h b/fast_tokenizer/fast_tokenizer/models/models.h new file mode 100644 index 0000000000000000000000000000000000000000..feafdd1ae5902f4242a5cffdf16f67f5fe80f8e6 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/models/models.h @@ -0,0 +1,21 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/models/bpe.h" +#include "fast_tokenizer/models/fast_wordpiece.h" +#include "fast_tokenizer/models/model.h" +#include "fast_tokenizer/models/unigram.h" +#include "fast_tokenizer/models/wordpiece.h" \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/models/unigram.cc b/fast_tokenizer/fast_tokenizer/models/unigram.cc new file mode 100644 index 0000000000000000000000000000000000000000..255ee1c3ca2978cda2c198ec78f7a4117504f117 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/models/unigram.cc @@ -0,0 +1,436 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/models/unigram.h" +#include +#include +#include + +#include "glog/logging.h" +#include "fast_tokenizer/utils/path.h" +#include "fast_tokenizer/utils/unique_ptr.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace models { + +constexpr float kUnkPenalty = 10.0; + +Unigram::Unigram() { + core::VocabList vocab = {{"", 0.0}}; + std::vector unk_id = {0}; + Init(vocab, unk_id); +} + +Unigram::Unigram(const core::VocabList& vocab, + const std::vector& unk_id) { + Init(vocab, unk_id); +} + +Unigram::Unigram(const Unigram& other) { Init(other.vocab_, other.unk_id_); } + +void Unigram::Init(const core::VocabList& vocab, + const std::vector& unk_id) { + size_t n = vocab.size(); + if (unk_id.size() > 0) { + if (n == 0) { + std::ostringstream oss; + oss << "EmptyVocabulary error occurs when init unigram with unk token."; + throw std::runtime_error(oss.str()); + } else if (unk_id[0] >= n) { + std::ostringstream oss; + oss << "Unk token id is not in vocab when init unigram with unk token."; + throw std::runtime_error(oss.str()); + } + } + + vocab_ = vocab; + unk_id_ = unk_id; + + bos_id_ = n + 1; + eos_id_ = n + 2; + min_score_ = std::numeric_limits::max(); + + std::vector keys; + std::vector values; + // id = 0 is unk_id_ + for (size_t id = 0; id < n; ++id) { + size_t actual_id = id; + token_to_ids_.insert({vocab[id].first, actual_id}); + keys.push_back(vocab[id].first.c_str()); + values.push_back(actual_id); + if (vocab[id].second < min_score_) { + min_score_ = vocab[id].second; + } + } + + std::vector sorted_keys; + std::vector sorted_values; + utils::GetSortedVocab(keys, values, &sorted_keys, &sorted_values); + trie_ = utils::make_unique(); + if (trie_->build(sorted_keys.size(), + const_cast(&sorted_keys[0]), + nullptr, + &sorted_values[0]) != 0) { + std::ostringstream oss; + oss << "Cannot build double-array."; + throw std::runtime_error(oss.str()); + return; + } + // Computes the maximum number of shared prefixes in the trie. + const int kMaxTrieResultsSize = 1024; + std::vector results( + kMaxTrieResultsSize); + trie_results_size_ = 0; + for (size_t id = 0; id < n; ++id) { + const int num_nodes = trie_->commonPrefixSearch(vocab[id].first.data(), + results.data(), + results.size(), + vocab[id].first.size()); + trie_results_size_ = std::max(trie_results_size_, num_nodes); + } + fuse_unk_ = true; + is_optimized_ = true; + if (trie_results_size_ == 0) { + std::ostringstream oss; + oss << "No entry is found in the trie."; + throw std::runtime_error(oss.str()); + } +} + +float Unigram::GetVocabScore(uint32_t id) const { return vocab_.at(id).second; } + +bool Unigram::TokenToId(const std::string& token, uint32_t* id) const { + if (token_to_ids_.find(token) == token_to_ids_.end()) { + return false; + } + *id = token_to_ids_.at(token); + return true; +} + +bool Unigram::IdToToken(uint32_t id, std::string* token) const { + if (id >= vocab_.size()) { + return false; + } + *token = vocab_[id].first; + return true; +} + +core::Vocab Unigram::GetVocab() const { return token_to_ids_; } + +size_t Unigram::GetVocabSize() const { return vocab_.size(); } + +std::vector Unigram::Tokenize(const std::string& sequence) { + std::vector encode_result; + Encode(sequence, &encode_result); + size_t offset = 0; + std::vector tokens; + tokens.reserve(encode_result.size()); + auto UpdateTokens = [&](const std::string& str) { + uint32_t id = 0; + if (token_to_ids_.find(str) != token_to_ids_.end()) { + id = token_to_ids_.at(str); + } else { + if (unk_id_.size() > 0) { + id = unk_id_[0]; + } + } + auto len = str.length(); + tokens.emplace_back(id, str, core::Offset{offset, offset + len}); + offset += len; + }; + + for (auto&& str : encode_result) { + // Avoid to append the filtered_token_ to encoded_result + if (str == filtered_token_) { + offset += filtered_token_.length(); + continue; + } + // Split the tokenized tokens following some regex rule + if (split_rule_ != nullptr) { + re2::StringPiece result; + int start = 0; + int end = str.length(); + while (split_rule_->Match(str, start, end, RE2::UNANCHORED, &result, 1)) { + int curr_start = result.data() - str.data(); + int res_len = result.length(); + start = curr_start + res_len; + std::string result_str(result.data(), res_len); + if (result_str == filtered_token_) { + offset += filtered_token_.length(); + continue; + } + UpdateTokens(result_str); + } + if (start == 0) { + // Hasn't been splitted + UpdateTokens(str); + } + } else { + UpdateTokens(str); + } + } + return tokens; +} + +std::vector Unigram::Save( + const std::string& folder, const std::string& filename_prefix) const { + std::string vocab_path; + if (filename_prefix == "") { + vocab_path = utils::PathJoin(folder, "unigram.json"); + } else { + vocab_path = utils::PathJoin({folder, filename_prefix, "-unigram.json"}); + } + VLOG(6) << "Vocab path" << vocab_path; + std::ofstream fout(vocab_path); + nlohmann::json j = *this; + fout << j.dump(); + fout.close(); + return {vocab_path}; +} + +void Unigram::PopulateNodes(utils::Lattice* lattice) const { + auto get_chars_length = [&lattice](int begin_pos, const char* end) { + int pos = begin_pos; + while (lattice->surface(pos) < end) ++pos; + return pos - begin_pos; + }; + + const float unk_score = min_score_ - kUnkPenalty; + + const int len = lattice->size(); + const char* end = lattice->sentence() + lattice->utf8_size(); + + // +1 just in case. + std::vector trie_results( + trie_results_size_ + 1); + + for (int begin_pos = 0; begin_pos < len; ++begin_pos) { + const char* begin = lattice->surface(begin_pos); + + // Finds all pieces which are prefix of surface(begin_pos). + const size_t num_nodes = + trie_->commonPrefixSearch(begin, + trie_results.data(), + trie_results.size(), + static_cast(end - begin)); + CHECK_LT(num_nodes, trie_results.size()); + + bool has_single_node = false; + + // Inserts pieces to the lattice. + for (size_t k = 0; k < num_nodes; ++k) { + const int length = + get_chars_length(begin_pos, begin + trie_results[k].length); + const int id = trie_results[k].value; + utils::Lattice::Node* node = lattice->Insert(begin_pos, length); + node->id = id; // the value of Trie stores vocab_id. + // User defined symbol receives extra bonus to always be selected. + node->score = vocab_[id].second; + + if (!has_single_node && node->length == 1) { + has_single_node = true; + } + } + + if (!has_single_node) { + if (unk_id_.size() > 0) { + utils::Lattice::Node* node = lattice->Insert(begin_pos, 1); + node->id = unk_id_[0]; // add UNK node. + node->score = unk_score; + } + } + } +} + +void Unigram::Encode(const std::string& normalized, + std::vector* encode_result) { + encode_result->clear(); + if (normalized.empty()) { + return; + } + if (!cache_.GetValue(normalized, encode_result)) { + if (is_optimized_) { + EncodeOptimized(normalized, encode_result); + } else { + EncodeUnoptimized(normalized, encode_result); + } + cache_.SetValue(normalized, *encode_result); + } +} + +void Unigram::EncodeOptimized(const std::string& normalized, + std::vector* encode_result) { + // Represents the last node of the best path. + struct BestPathNode { + int id = -1; // The vocab id. (maybe -1 for UNK) + float best_path_score = + 0; // The total score of the best path ending at this node. + int starts_at = + -1; // The starting position (in utf-8) of this node. The entire best + // path can be constructed by backtracking along this link. + }; + const int size = normalized.size(); + const float unk_score = min_score_ - kUnkPenalty; + // The ends are exclusive. + std::vector best_path_ends_at(size + 1); + // Generate lattice on-the-fly (not stored) and update best_path_ends_at. + int starts_at = 0; + while (starts_at < size) { + std::size_t node_pos = 0; + std::size_t key_pos = starts_at; + const auto best_path_score_till_here = + best_path_ends_at[starts_at].best_path_score; + bool has_single_node = false; + const int mblen = std::min( + utils::OneCharLen(normalized.data() + starts_at), size - starts_at); + while (key_pos < size) { + const int ret = + trie_->traverse(normalized.data(), node_pos, key_pos, key_pos + 1); + if (ret == -2) break; + if (ret >= 0) { + // Update the best path node. + auto& target_node = best_path_ends_at[key_pos]; + const auto length = (key_pos - starts_at); + const auto score = GetVocabScore(ret); + const auto candidate_best_path_score = + score + best_path_score_till_here; + VLOG(4) << "key_pos: " << key_pos; + VLOG(4) << "score: " << score; + VLOG(4) << "best_path_score_till_here: " << best_path_score_till_here; + VLOG(4) << "starts_at: " << starts_at; + VLOG(4) << "token: " << vocab_.at(ret).first; + if (target_node.starts_at == -1 || + candidate_best_path_score > target_node.best_path_score) { + target_node.best_path_score = candidate_best_path_score; + target_node.starts_at = starts_at; + target_node.id = ret; + } + if (!has_single_node && length == mblen) { + has_single_node = true; + } + } + } + if (!has_single_node) { + auto& target_node = best_path_ends_at[starts_at + mblen]; + const auto candidate_best_path_score = + unk_score + best_path_score_till_here; + if (target_node.starts_at == -1 || + candidate_best_path_score > target_node.best_path_score) { + target_node.best_path_score = candidate_best_path_score; + target_node.starts_at = starts_at; + target_node.id = -1; + if (unk_id_.size() > 0) { + target_node.id = unk_id_[0]; + } + } + } + // Move by one unicode character. + starts_at += mblen; + } + int ends_at = size; + std::vector token; + while (ends_at > 0) { + const auto& node = best_path_ends_at[ends_at]; + auto starts_at = node.starts_at; + if (fuse_unk_ && unk_id_.size() > 0 && node.id == unk_id_[0]) { + token.push_back(normalized.substr(starts_at, ends_at - starts_at)); + } else { + if (!token.empty()) { + encode_result->push_back(""); + auto& back = encode_result->back(); + for (int i = token.size() - 1; i >= 0; --i) { + back.append(token[i]); + } + token.clear(); + } + encode_result->push_back( + normalized.substr(starts_at, ends_at - starts_at)); + } + ends_at = node.starts_at; + } + if (!token.empty()) { + encode_result->push_back(""); + auto& back = encode_result->back(); + for (int i = token.size() - 1; i >= 0; --i) { + back.append(token[i]); + } + } + std::reverse(encode_result->begin(), encode_result->end()); +} + +void Unigram::EncodeUnoptimized(const std::string& normalized, + std::vector* encode_result) { + utils::Lattice lattice; + lattice.SetSentence( + utils::simple_string_view(normalized.data(), normalized.size())); + PopulateNodes(&lattice); + if (fuse_unk_) { + std::string token; + for (const auto* node : lattice.Viterbi().first) { + if (unk_id_.size() > 0 && node->id == unk_id_[0]) { + token.append(node->piece.data(), node->piece.size()); + } else { + if (!token.empty()) { + encode_result->push_back(token); + token.clear(); + } + encode_result->push_back(std::string(node->piece.data())); + } + if (!token.empty()) { + encode_result->push_back(token); + } + } + } else { + for (const auto* node : lattice.Viterbi().first) { + encode_result->push_back(std::string(node->piece.data())); + } + } +} + +void Unigram::SetFilterToken(const std::string& filtered_token) { + filtered_token_ = filtered_token; +} + +void Unigram::SetSplitRule(const std::string& split_rule) { + split_rule_ = utils::make_unique(split_rule); +} + +void to_json(nlohmann::json& j, const Unigram& model) { + std::string split_rule = ""; + if (model.split_rule_ != nullptr) { + split_rule = model.split_rule_->pattern(); + } + j = {{"type", "Unigram"}, + {"unk_id", model.unk_id_}, + {"vocab", model.vocab_}, + {"filter_token", model.filtered_token_}, + {"split_rule", split_rule}}; +} + +void from_json(const nlohmann::json& j, Unigram& model) { + std::string filter_token = j.at("filter_token").get(); + std::string split_rule = j.at("split_rule").get(); + model.Init(j.at("vocab").get(), + j.at("unk_id").get>()); + if (!split_rule.empty()) { + model.SetSplitRule(split_rule); + } + model.SetFilterToken(filter_token); +} + +} // namespace model +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/models/unigram.h b/fast_tokenizer/fast_tokenizer/models/unigram.h new file mode 100644 index 0000000000000000000000000000000000000000..c66cbbbae5a3b3946cff4263b59c40c668d5f4b4 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/models/unigram.h @@ -0,0 +1,86 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/core/base.h" +#include "fast_tokenizer/models/model.h" +#include "fast_tokenizer/utils/cache.h" +#include "fast_tokenizer/utils/lattice.h" +#include "fast_tokenizer/utils/trie.h" + +#include "darts.h" +#include "nlohmann/json.hpp" +#include "re2/re2.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace models { + +struct FASTTOKENIZER_DECL Unigram : public Model { + Unigram(); + Unigram(const core::VocabList& vocab, const std::vector& unk_id); + Unigram(const Unigram& other); + virtual bool TokenToId(const std::string& token, uint32_t* id) const override; + virtual bool IdToToken(uint32_t id, std::string* token) const override; + virtual core::Vocab GetVocab() const override; + virtual size_t GetVocabSize() const override; + virtual std::vector Tokenize( + const std::string& sequence) override; + virtual std::vector Save( + const std::string& folder, + const std::string& filename_prefix) const override; + // Set the filter token for unigram. + void SetFilterToken(const std::string& filtered_token); + // Set the special spliting rule for unigram. + void SetSplitRule(const std::string& split_rule); + +private: + float GetVocabScore(uint32_t id) const; + void Init(const core::VocabList& vocab, const std::vector& unk_id); + void PopulateNodes(utils::Lattice* lattice) const; + void Encode(const std::string& normalized, + std::vector* encode_result); + void EncodeOptimized(const std::string& normalized, + std::vector* encode_result); + void EncodeUnoptimized(const std::string& normalized, + std::vector* encode_result); + + core::Vocab token_to_ids_; + core::VocabList vocab_; + utils::Cache> cache_; + std::unique_ptr trie_; + double min_score_; + std::vector unk_id_; + size_t bos_id_; + size_t eos_id_; + bool fuse_unk_; + bool is_optimized_; + int trie_results_size_; + // Some tokenizer, such as ernie-m, may avoid to append some special + // token to final result, the unigram model doesn't filter any tokens + // by default. + std::string filtered_token_; + // For special rule of token spliting after tokenization, + // the unigram model has no spliting rule by default. + // It's useful for some cases, such as ernie-m tokenizer. + std::unique_ptr split_rule_; + + friend void to_json(nlohmann::json& j, const Unigram& model); + friend void from_json(const nlohmann::json& j, Unigram& model); +}; + +} // namespace models +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/models/wordpiece.cc b/fast_tokenizer/fast_tokenizer/models/wordpiece.cc new file mode 100644 index 0000000000000000000000000000000000000000..079602d128064bda6c359482091800a99150ff1b --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/models/wordpiece.cc @@ -0,0 +1,294 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include +#include +#include + +#include "fast_tokenizer/models/wordpiece.h" +#include "fast_tokenizer/utils/path.h" +#include "fast_tokenizer/utils/utf8.h" +#include "glog/logging.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace models { +const std::string WHITESPACE = " \n\r\t\f\v"; + +WordPiece::WordPiece() + : unk_token_("[UNK]"), + continuing_subword_prefix_("##"), + max_input_chars_per_word_(100), + unk_token_id_(0) {} + +WordPiece::WordPiece(const core::Vocab& vocab, + const std::string& unk_token, + size_t max_input_chars_per_word, + const std::string& continuing_subword_prefix, + bool handle_chinese_chars) + : vocab_(vocab), + unk_token_(unk_token), + max_input_chars_per_word_(max_input_chars_per_word), + continuing_subword_prefix_(continuing_subword_prefix), + handle_chinese_chars_(handle_chinese_chars) { + for (const auto& vocab_item : vocab) { + vocab_reversed_[vocab_item.second] = vocab_item.first; + } + unk_token_id_ = vocab.at(unk_token); +} + +// Move version +WordPiece::WordPiece(core::Vocab&& vocab, + std::string&& unk_token, + size_t max_input_chars_per_word, + std::string&& continuing_subword_prefix, + bool handle_chinese_chars) + : vocab_(std::move(vocab)), + unk_token_(std::move(unk_token)), + max_input_chars_per_word_(std::move(max_input_chars_per_word)), + continuing_subword_prefix_(std::move(continuing_subword_prefix)), + handle_chinese_chars_(handle_chinese_chars) { + for (const auto& vocab_item : vocab) { + vocab_reversed_[vocab_item.second] = vocab_item.first; + } + unk_token_id_ = vocab.at(unk_token); +} + +core::Vocab WordPiece::GetVocab() const { return vocab_; } + +size_t WordPiece::GetVocabSize() const { return vocab_.size(); } + +bool WordPiece::TokenToId(const std::string& token, uint32_t* id) const { + if (vocab_.find(token) == vocab_.end()) { + return false; + } + *id = vocab_.at(token); + return true; +} + +bool WordPiece::IdToToken(uint32_t id, std::string* token) const { + if (vocab_reversed_.find(id) == vocab_reversed_.end()) { + return false; + } + *token = vocab_reversed_.at(id); + return true; +} + +std::vector WordPiece::Save( + const std::string& folder, const std::string& filename_prefix) const { + std::string filepath; + if (filename_prefix == "") { + filepath = utils::PathJoin(folder, "vocab.txt"); + } else { + filepath = utils::PathJoin({folder, filename_prefix, "-vocab.txt"}); + } + VLOG(6) << "Full path" << filepath; + std::ofstream fout(filepath); + std::vector> vocab(vocab_.begin(), + vocab_.end()); + std::sort(vocab.begin(), + vocab.end(), + [](const std::pair& left, + const std::pair& right) -> bool { + return left.second < right.second; + }); + for (const auto& vocab_item : vocab) { + fout << vocab_item.first << "\n"; + } + fout.close(); + return {filepath}; +} + +static bool CheckIfStringIsAlphaNum(const std::string& str) { + return std::count_if(str.begin(), str.end(), [](char ch) { + return std::isalnum(ch) > 0; + }) == str.length(); +} + +std::vector WordPiece::Tokenize(const std::string& sequence) { + VLOG(6) << "Using WordPiece::Tokenize to tokenize sequence '" << sequence << "'"; + std::vector all_tokens; + size_t unicode_len = + utils::GetUnicodeLenFromUTF8(sequence.data(), sequence.length()); + if (unicode_len > max_input_chars_per_word_) { + all_tokens.emplace_back( + vocab_.at(unk_token_), unk_token_, core::Offset{0, sequence.length()}); + } else { + bool found_token = true; + uint32_t start = 0; + + while (start < sequence.length()) { + uint32_t end = sequence.length(); + core::Token cur_token; + bool match_cur_token = false; + while (start < end) { + std::string sub_str = sequence.substr(start, end - start); + if (start > 0 && + (handle_chinese_chars_ || CheckIfStringIsAlphaNum(sub_str))) { + sub_str = continuing_subword_prefix_ + sub_str; + } + const auto& vocab_iter = vocab_.find(sub_str); + if (vocab_iter != vocab_.end()) { + cur_token = {vocab_iter->second, sub_str, {start, end}}; + match_cur_token = true; + break; + } + // std::u32string u32sub_str = conv.from_bytes(sub_str); + // end -= utils::GetUTF8CharLen(u32sub_str.back()); + for (auto it = sub_str.rbegin(); it != sub_str.rend(); ++it) { + --end; + if (utils::IsCharBeginBoundary(*it)) { + break; + } + } + } + if (!match_cur_token) { + found_token = false; + break; + } + all_tokens.emplace_back(cur_token); + start = end; + } + + if (!found_token) { + all_tokens.clear(); + all_tokens.emplace_back(vocab_.at(unk_token_), + unk_token_, + core::Offset{0, sequence.length()}); + } + } + return all_tokens; +} + + +core::Vocab WordPiece::GetVocabFromFile(const std::string& file) { + std::ifstream fin(file); + core::Vocab vocab; + int i = 0; + constexpr int MAX_BUFFER_SIZE = 256; + char word[MAX_BUFFER_SIZE]; + while (fin.getline(word, MAX_BUFFER_SIZE)) { + std::string word_str = word; + auto leading_spaces = word_str.find_first_not_of(WHITESPACE); + if (leading_spaces != std::string::npos) { + leading_spaces = (std::min)(leading_spaces, word_str.length() - 1); + word_str = word_str.substr(leading_spaces); + } + auto trailing_spaces = word_str.find_last_not_of(WHITESPACE); + if (trailing_spaces != std::string::npos) { + word_str = word_str.substr(0, trailing_spaces + 1); + } + if (word_str != "") { + vocab[word_str] = i++; + } + } + return vocab; +} + +WordPiece WordPiece::GetWordPieceFromFile( + const std::string& file, + const std::string& unk_token, + size_t max_input_chars_per_word, + const std::string& continuing_subword_prefix) { + auto vocab = GetVocabFromFile(file); + return WordPiece( + vocab, unk_token, max_input_chars_per_word, continuing_subword_prefix); +} + +void to_json(nlohmann::json& j, const WordPiece& model) { + j = { + {"type", "WordPiece"}, + {"vocab", model.vocab_}, + {"unk_token", model.unk_token_}, + {"max_input_chars_per_word", model.max_input_chars_per_word_}, + {"continuing_subword_prefix", model.continuing_subword_prefix_}, + }; +} + +void from_json(const nlohmann::json& j, WordPiece& model) { + j["vocab"].get_to(model.vocab_); + j["unk_token"].get_to(model.unk_token_); + j["max_input_chars_per_word"].get_to(model.max_input_chars_per_word_); + j["continuing_subword_prefix"].get_to(model.continuing_subword_prefix_); +} + + +WordPieceConfig::WordPieceConfig() + : unk_token_("[UNK]"), + max_input_chars_per_word_(100), + continuing_subword_prefix_("##") {} + + +void WordPieceFactory::SetFiles(const std::string& files) { + config_.files_ = files; +} + +void WordPieceFactory::SetUNKToken(const std::string& unk_token) { + config_.unk_token_ = unk_token; +} + +void WordPieceFactory::SetMaxInputCharsPerWord( + size_t max_input_chars_per_word) { + config_.max_input_chars_per_word_ = max_input_chars_per_word; +} + +void WordPieceFactory::SetContinuingSubwordPrefix( + const std::string& continuing_subword_prefix) { + config_.continuing_subword_prefix_ = continuing_subword_prefix; +} + +WordPiece WordPieceFactory::CreateWordPieceModel() { + std::ifstream fin(config_.files_); + if (fin) { + GetVocabFromFiles(config_.files_); + } else { + VLOG(0) << "File " << config_.files_ + << " doesn't exist or can't be accessed."; + config_.vocab_ = core::Vocab(); + } + return WordPiece{config_.vocab_, + config_.unk_token_, + config_.max_input_chars_per_word_, + config_.continuing_subword_prefix_}; +} + +void WordPieceFactory::GetVocabFromFiles(const std::string& files) { + std::ifstream fin(files); + config_.vocab_.clear(); + int i = 0; + constexpr int MAX_BUFFER_SIZE = 256; + char word[MAX_BUFFER_SIZE]; + while (fin.getline(word, MAX_BUFFER_SIZE)) { + std::string word_str = word; + auto leading_spaces = word_str.find_first_not_of(WHITESPACE); + if (leading_spaces != std::string::npos) { + leading_spaces = (std::min)(leading_spaces, word_str.length() - 1); + word_str = word_str.substr(leading_spaces); + } + auto trailing_spaces = word_str.find_last_not_of(WHITESPACE); + if (trailing_spaces != std::string::npos) { + word_str = word_str.substr(0, trailing_spaces + 1); + } + if (word_str != "") { + config_.vocab_[word_str] = i++; + } + } +} + +} // namespace model +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/models/wordpiece.h b/fast_tokenizer/fast_tokenizer/models/wordpiece.h new file mode 100644 index 0000000000000000000000000000000000000000..956485522f25b277c665a05001f626c2bc192e3c --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/models/wordpiece.h @@ -0,0 +1,88 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/models/model.h" +#include "nlohmann/json.hpp" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace models { + +struct FASTTOKENIZER_DECL WordPiece : public Model { + WordPiece(); + WordPiece(const core::Vocab& vocab, + const std::string& unk_token = "[UNK]", + size_t max_input_chars_per_word = 100, + const std::string& continuing_subword_prefix = "##", + bool handle_chinese_chars = true); + // Move version + WordPiece(core::Vocab&& vocab, + std::string&& unk_token, + size_t max_input_chars_per_word, + std::string&& continuing_subword_prefix, + bool handle_chinese_chars); + virtual std::vector Tokenize( + const std::string& sequence) override; + virtual bool TokenToId(const std::string& token, uint32_t* id) const override; + virtual bool IdToToken(uint32_t id, std::string* token) const override; + virtual core::Vocab GetVocab() const override; + virtual size_t GetVocabSize() const override; + // Return the saved voacb full path + virtual std::vector Save( + const std::string& folder, + const std::string& filename_prefix) const override; + static core::Vocab GetVocabFromFile(const std::string& file); + static WordPiece GetWordPieceFromFile( + const std::string& file, + const std::string& unk_token = "[UNK]", + size_t max_input_chars_per_word = 100, + const std::string& continuing_subword_prefix = "##"); + +protected: + core::Vocab vocab_; + core::VocabReversed vocab_reversed_; + std::string unk_token_; + uint32_t unk_token_id_; + size_t max_input_chars_per_word_; + std::string continuing_subword_prefix_; + bool handle_chinese_chars_; + friend void to_json(nlohmann::json& j, const WordPiece& model); + friend void from_json(const nlohmann::json& j, WordPiece& model); +}; + +struct WordPieceConfig { + WordPieceConfig(); + std::string files_; + core::Vocab vocab_; + std::string unk_token_; + size_t max_input_chars_per_word_; + std::string continuing_subword_prefix_; +}; + + +struct WordPieceFactory { + WordPieceConfig config_; + void SetFiles(const std::string& files); + void SetUNKToken(const std::string& unk_token); + void SetMaxInputCharsPerWord(size_t max_input_chars_per_word); + void SetContinuingSubwordPrefix(const std::string& continuing_subword_prefix); + WordPiece CreateWordPieceModel(); + void GetVocabFromFiles(const std::string& files); +}; + +} // namespace models +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/normalizers/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..b44e4bbcc455f894953c592ce6e43b591dbc7eda --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/CMakeLists.txt @@ -0,0 +1,5 @@ +cc_library(normalizers + SRCS normalizer.cc unicode.cc + utils.cc strip.cc replace.cc bert.cc + precompiled.cc + DEPS re2 json sentencepiece_normalizer icuuc icudata) diff --git a/fast_tokenizer/fast_tokenizer/normalizers/bert.cc b/fast_tokenizer/fast_tokenizer/normalizers/bert.cc new file mode 100644 index 0000000000000000000000000000000000000000..f4745c449cd80f26156b459d427d4a60fd028869 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/bert.cc @@ -0,0 +1,120 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/normalizers/bert.h" + +#include +#include +#include + +#include "fast_tokenizer/normalizers/strip.h" +#include "fast_tokenizer/normalizers/utils.h" +#include "fast_tokenizer/utils/utils.h" +#include "glog/logging.h" +#include "unicode/uchar.h" +#include "unicode/unistr.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace normalizers { +BertNormalizer::BertNormalizer(bool clean_text, + bool handle_chinese_chars, + bool strip_accents, + bool lowercase) + : clean_text_(clean_text), + handle_chinese_chars_(handle_chinese_chars), + strip_accents_(strip_accents), + lowercase_(lowercase) {} + +static bool IsControl(int ch) { + if (ch == '\t' || ch == '\n' || ch == '\r') return false; + // It means (general category "C"). + return !u_isprint(ch); +} + +void BertNormalizer::DoCleanText(NormalizedString* input) const { + (*input) + .FilterChar([](char32_t ch) -> bool { + return !(ch == 0 || ch == 0xfffd || IsControl(ch)); + }) + .MapChar([](char32_t ch) -> char32_t { + if (utils::IsWhiteSpace(ch)) { + return ' '; + } + return ch; + }); +} + +void BertNormalizer::DoHandleChineseChars(NormalizedString* input) const { + std::wstring_convert, char32_t> conv; + std::u32string u32input = conv.from_bytes(input->GetStr()); + std::u32string u32output; + std::vector changes; + u32output.reserve(u32input.length() * 3); + changes.reserve(u32input.length() * 3); + for (int i = 0; i < u32input.length(); ++i) { + if (utils::IsChineseChar(u32input[i])) { + u32output.push_back(' '); + u32output.push_back(u32input[i]); + u32output.push_back(' '); + changes.push_back(0); + changes.push_back(1); + changes.push_back(1); + } else { + u32output.push_back(u32input[i]); + changes.push_back(0); + } + } + OffsetMapping new_normalized_offset{u32output, changes}; + input->UpdateNormalized(new_normalized_offset, 0); +} +void BertNormalizer::operator()(NormalizedString* input) const { + if (clean_text_) { + DoCleanText(input); + } + if (handle_chinese_chars_) { + DoHandleChineseChars(input); + } + if (strip_accents_) { + StripAccentsNormalizer()(input); + } + if (lowercase_) { + input->Lowercase(); + } +} + +void to_json(nlohmann::json& j, const BertNormalizer& bert_normalizer) { + j = { + {"type", "BertNormalizer"}, + {"clean_text", bert_normalizer.clean_text_}, + {"handle_chinese_chars", bert_normalizer.handle_chinese_chars_}, + {"strip_accents", bert_normalizer.strip_accents_}, + {"lowercase", bert_normalizer.lowercase_}, + }; +} + +void from_json(const nlohmann::json& j, BertNormalizer& bert_normalizer) { + j.at("clean_text").get_to(bert_normalizer.clean_text_); + j.at("handle_chinese_chars").get_to(bert_normalizer.handle_chinese_chars_); + j.at("lowercase").get_to(bert_normalizer.lowercase_); + if (!j.at("strip_accents").is_null()) { + j.at("strip_accents").get_to(bert_normalizer.strip_accents_); + } else { + bert_normalizer.strip_accents_ = false; + } +} + +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/bert.h b/fast_tokenizer/fast_tokenizer/normalizers/bert.h new file mode 100644 index 0000000000000000000000000000000000000000..4312bdefb01e0cce421027906321eaa906c3ccb4 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/bert.h @@ -0,0 +1,47 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include "nlohmann/json.hpp" +#include "fast_tokenizer/normalizers/normalizer.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace normalizers { +struct FASTTOKENIZER_DECL BertNormalizer : public Normalizer { + BertNormalizer(bool clean_text = true, + bool handle_chinese_chars = true, + bool strip_accents = true, + bool lowercase = true); + virtual void operator()(NormalizedString* input) const override; + BertNormalizer(const BertNormalizer&) = default; + BertNormalizer(BertNormalizer&&) = default; + +private: + bool clean_text_; + bool handle_chinese_chars_; + bool strip_accents_; + bool lowercase_; + void DoCleanText(NormalizedString* input) const; + void DoHandleChineseChars(NormalizedString* input) const; + friend void to_json(nlohmann::json& j, const BertNormalizer& bert_normalizer); + friend void from_json(const nlohmann::json& j, + BertNormalizer& bert_normalizer); +}; +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/normalizer.cc b/fast_tokenizer/fast_tokenizer/normalizers/normalizer.cc new file mode 100644 index 0000000000000000000000000000000000000000..bc63e0845063f6260a4043baee8ee9f9f9faf71a --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/normalizer.cc @@ -0,0 +1,656 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include +#include + +#include "fast_tokenizer/normalizers/normalizer.h" +#include "fast_tokenizer/utils/utf8.h" + +#include "fast_tokenizer/normalizers/unicode.h" +#include "glog/logging.h" +#include "re2/re2.h" +#include "unicode/edits.h" +#include "unicode/errorcode.h" +#include "unicode/normalizer2.h" +#include "unicode/uchar.h" +#include "unicode/utypes.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace normalizers { + +NormalizedString::NormalizedString(const std::string& original) + : original_(original), normalized_(original), original_shift_(0) { + // calculate alignments + std::wstring_convert, char32_t> conv; + std::u32string u32normalized = conv.from_bytes(normalized_); + for (int i = 0; i < u32normalized.length(); ++i) { + auto new_normalized_char_len = utils::GetUTF8CharLen(u32normalized[i]); + uint32_t start = 0; + uint32_t end = 0; + if (i != 0) { + start = alignments_.back().second; + } + end = start + new_normalized_char_len; + for (int j = 0; j < new_normalized_char_len; ++j) { + alignments_.push_back({start, end}); + } + } +} + +NormalizedString::NormalizedString(NormalizedString&& other) + : original_(std::move(other.original_)), + normalized_(std::move(other.normalized_)), + alignments_(std::move(other.alignments_)), + original_shift_(other.original_shift_) {} + +NormalizedString& NormalizedString::operator=(NormalizedString&& other) { + original_ = std::move(other.original_); + normalized_ = std::move(other.normalized_); + alignments_ = std::move(other.alignments_); + original_shift_ = other.original_shift_; + return *this; +} + +const std::string& NormalizedString::GetStr() const { return normalized_; } + +const std::string& NormalizedString::GetOrignalStr() const { return original_; } + +uint32_t NormalizedString::GetLen() const { return normalized_.length(); } + +uint32_t NormalizedString::GetOriginalLen() const { return original_.length(); } + +core::Offset NormalizedString::GetOrginalOffset() const { + return {original_shift_, GetOriginalLen() + original_shift_}; +} + +bool NormalizedString::IsEmpty() const { return normalized_.empty(); } + +bool NormalizedString::IsOriginalEmpty() const { return original_.empty(); } + +void NormalizedString::UpdateNormalized(const OffsetMapping& new_normalized, + uint32_t initial_offset) { + UpdateNormalizedRange(new_normalized, initial_offset, {0, GetLen()}, true); +} + +void NormalizedString::UpdateNormalizedRange( + const OffsetMapping& new_normalized, + uint32_t initial_offset, + const core::Range& range, + bool origin_range) { + auto n_range = range; + if (origin_range) { + ConvertOffsets(&n_range, origin_range); + } + // Retrieve the original characters that are being replaced. This let us + // compute the change in byte sizes along the way. + std::wstring_convert, char32_t> conv; + n_range.first = (std::min)(n_range.first, + static_cast(normalized_.length() - 1)); + std::u32string u32replaced_normalized = conv.from_bytes( + normalized_.substr(n_range.first, n_range.second - n_range.first)); + uint32_t initial_removed = 0; + // calculate initial_removed + for (int i = 0; i < initial_offset; ++i) { + size_t chwidth = utils::GetUTF8CharLen(u32replaced_normalized[i]); + initial_removed += chwidth; + } + + uint32_t offset = initial_removed + n_range.first; + std::vector alignments; + alignments.reserve(n_range.second - n_range.first); + + int replaced_normalized_idx = initial_removed; + // Calculate the new alignments + for (int i = 0; i < new_normalized.u32normalized.length(); ++i) { + auto idx = offset; + core::Range align; + int curr_changes = new_normalized.changes[i]; + if (curr_changes > 0) { + // Insert a char + if (idx < 1) { + align = {0, 0}; + } else { + align = alignments_[idx - 1]; + } + } else { + align = alignments_[idx]; + } + char32_t new_normalized_char = new_normalized.u32normalized[i]; + auto new_normalized_char_len = utils::GetUTF8CharLen(new_normalized_char); + char32_t replaced_char = -1; + if (curr_changes <= 0) { + replaced_char = u32replaced_normalized[replaced_normalized_idx++]; + } + uint32_t replaced_char_size = + (replaced_char == -1) ? 0 : utils::GetUTF8CharLen(replaced_char); + + uint32_t total_bytes_to_remove = 0; + if (curr_changes < 0) { + for (int j = 0; j < -curr_changes; ++j) { + replaced_char = u32replaced_normalized[replaced_normalized_idx++]; + total_bytes_to_remove += utils::GetUTF8CharLen(replaced_char); + } + } + offset += replaced_char_size + total_bytes_to_remove; + alignments.insert(alignments.end(), new_normalized_char_len, align); + } + // Replace the old alignments in n_range + if (n_range.second - n_range.first >= alignments.size()) { + std::memcpy(alignments_.data() + n_range.first, + alignments.data(), + alignments.size() * sizeof(core::Range)); + alignments_.erase(alignments_.begin() + n_range.first + alignments.size(), + alignments_.begin() + n_range.second); + } else { + std::vector new_alignments; + auto third_len = 0; + if (alignments_.size() > n_range.second) { + third_len = alignments_.size() - n_range.second; + } + new_alignments.resize(n_range.first + alignments.size() + third_len); + if (n_range.first > 0) { + std::copy_n(alignments_.begin(), n_range.first, new_alignments.begin()); + } + std::copy_n(alignments.begin(), + alignments.size(), + new_alignments.begin() + n_range.first); + if (third_len > 0) { + std::copy_n(alignments_.begin() + n_range.second, + third_len, + new_alignments.begin() + n_range.first + alignments.size()); + } + alignments_ = std::move(new_alignments); + } + // Unicode -> UTF8 + uint32_t normalized_utf8_size = 0; + for (auto& ch : new_normalized.u32normalized) { + normalized_utf8_size += utils::GetUTF8CharLen(ch); + } + std::vector utf8_str(normalized_utf8_size + 1); + utils::GetUTF8Str(new_normalized.u32normalized.data(), + utf8_str.data(), + new_normalized.u32normalized.length()); + + // Update normalized_ + auto normalized_iter = normalized_.begin(); + normalized_.replace(normalized_iter + n_range.first, + normalized_iter + n_range.second, + utf8_str.data(), + normalized_utf8_size); +} + +bool NormalizedString::ConvertOffsets(core::Range* range, + bool origin_range) const { + auto len_original = GetOriginalLen(); + auto len_normalized = GetLen(); + if (range->first == range->second) { + return true; + } + if (range->first > range->second) { + return false; + } + if (origin_range && original_.empty() && + (range->first == 0 && range->second == 0)) { + range->second = len_normalized; + return true; + } + if (!origin_range && normalized_.empty() && + (range->first == 0 && range->second == 0)) { + range->second = len_original; + return true; + } + if (origin_range) { + int start = -1; + int end = -1; + for (int i = 0; i < alignments_.size(); ++i) { + if (range->second >= alignments_[i].second) { + if (start < 0 && range->first <= alignments_[i].first) { + if (alignments_[i].first != alignments_[i].second) { + start = i; + } + } + if (range->second >= alignments_[i].second) { + end = i + 1; + } + } + } + if (start > 0 && end < 0) { + *range = {start, start}; + } else if (start < 0 && end > 0) { + *range = {end, end}; + } else if (start > 0 && end > 0) { + *range = {start, end}; + } else { + return false; + } + } else { + range->first = alignments_[range->first].first; + range->second = alignments_[range->second - 1].second; + } + return true; +} + +void NormalizedString::RunNormalization(const std::string& mode) { + icu::ErrorCode icu_error; + const icu::Normalizer2* normalizer = nullptr; + if (mode == "NFD") { + normalizer = icu::Normalizer2::getNFDInstance(icu_error); + } else if (mode == "NFKD") { + normalizer = icu::Normalizer2::getNFKDInstance(icu_error); + } else if (mode == "NFC") { + normalizer = icu::Normalizer2::getNFCInstance(icu_error); + } else if (mode == "NFKC") { + normalizer = icu::Normalizer2::getNFKCInstance(icu_error); + } + std::string normalized_result; + icu::Edits edits; + icu::StringByteSink byte_sink(&normalized_result); + normalizer->normalizeUTF8( + 0, + icu::StringPiece(normalized_.data(), normalized_.size()), + byte_sink, + &edits, + icu_error); + std::wstring_convert, char32_t> conv; + std::u32string u32new_normalized = conv.from_bytes(normalized_result); + // Set changes + std::vector changes; + changes.reserve(u32new_normalized.length()); + auto iter = edits.getFineIterator(); + int old_offset = 0; + int new_offset = 0; + // The edits record the byte level modification, so need to transform to char + // level + // using GetUnicodeLenFromUTF8 + while (iter.next(icu_error)) { + auto old_length = iter.oldLength(); + auto new_length = iter.newLength(); + auto new_unicode_len = utils::GetUnicodeLenFromUTF8( + normalized_result.data() + new_offset, new_length); + auto old_unicode_len = utils::GetUnicodeLenFromUTF8( + normalized_.data() + old_offset, old_length); + old_offset += old_length; + new_offset += new_length; + if (old_unicode_len == new_unicode_len) { + // Just replace the char + changes.insert(changes.end(), old_unicode_len, 0); + } else if (old_unicode_len < new_unicode_len) { + // Insert the char + changes.insert(changes.end(), old_unicode_len, 0); + changes.insert(changes.end(), new_unicode_len - old_unicode_len, 1); + } else /* old_length > new_length */ { + // Remove the char + if (new_unicode_len > 1) { + changes.insert(changes.end(), new_unicode_len - 1, 0); + } + changes.push_back(new_unicode_len - old_unicode_len); + } + } + OffsetMapping new_normalized_offset{u32new_normalized, changes}; + // Update normalized_ and alignments_ + UpdateNormalized(new_normalized_offset, 0); +} + +NormalizedString& NormalizedString::NFD() { + RunNormalization("NFD"); + return *this; +} + +NormalizedString& NormalizedString::NFKD() { + RunNormalization("NFKD"); + return *this; +} + +NormalizedString& NormalizedString::NFC() { + RunNormalization("NFC"); + return *this; +} + +NormalizedString& NormalizedString::NFKC() { + RunNormalization("NFKC"); + return *this; +} + +NormalizedString& NormalizedString::LStrip() { return LRStrip(true, false); } + +NormalizedString& NormalizedString::RStrip() { return LRStrip(false, true); } + +const std::string WHITESPACE = " \n\r\t\f\v"; + +NormalizedString& NormalizedString::LRStrip(bool left, bool right) { + uint32_t leading_spaces = 0; + uint32_t trailing_spaces = 0; + std::string new_normalized = normalized_; + if (left) { + leading_spaces = new_normalized.find_first_not_of(WHITESPACE); + if (leading_spaces != std::string::npos) { + leading_spaces = (std::min)( + leading_spaces, static_cast(new_normalized.length() - 1)); + new_normalized = new_normalized.substr(leading_spaces); + } + } + if (right) { + trailing_spaces = new_normalized.find_last_not_of(WHITESPACE); + if (trailing_spaces != std::string::npos) { + new_normalized = new_normalized.substr(0, trailing_spaces + 1); + } + } + + std::wstring_convert, char32_t> conv; + std::u32string u32new_normalized = conv.from_bytes(new_normalized); + // Set changes + std::vector changes(u32new_normalized.length(), 0); + changes.back() = -trailing_spaces; + + OffsetMapping new_normalized_offset{u32new_normalized, changes}; + // Update normalized_ and alignments_ + UpdateNormalized(new_normalized_offset, leading_spaces); + return *this; +} + +NormalizedString& NormalizedString::FilterChar( + std::function keep_char_fn) { + std::wstring_convert, char32_t> conv; + std::u32string u32new_normalized; + u32new_normalized.reserve(normalized_.length()); + uint32_t removed_start = 0; + uint32_t removed = 0; + std::vector changes; + changes.reserve(normalized_.length()); + bool has_init_ch = false; + uint32_t last_char; + uint32_t curr_char; + size_t utf8_len = 0; + while (utf8_len < normalized_.length()) { + auto chwidth = + utils::UTF8ToUInt32(normalized_.data() + utf8_len, &curr_char); + curr_char = utils::UTF8ToUnicode(curr_char); + if (keep_char_fn(curr_char)) { + if (has_init_ch) { + u32new_normalized.push_back(last_char); + changes.push_back(-removed); + } else { + has_init_ch = true; + removed_start = removed; + } + last_char = curr_char; + removed = 0; + } else { + removed += 1; + } + utf8_len += chwidth; + } + if (has_init_ch) { + u32new_normalized.push_back(last_char); + changes.push_back(-removed); + } + OffsetMapping new_normalized_offset{u32new_normalized, changes}; + // Update normalized_ and alignments_ + UpdateNormalized(new_normalized_offset, removed_start); + return *this; +} + +NormalizedString& NormalizedString::MapChar( + std::function map_char_fn) { + size_t utf8_len = 0; + std::u32string u32normalized; + uint32_t curr_char; + u32normalized.reserve(normalized_.length()); + while (utf8_len < normalized_.length()) { + auto chwidth = + utils::UTF8ToUInt32(normalized_.data() + utf8_len, &curr_char); + curr_char = utils::UTF8ToUnicode(curr_char); + curr_char = map_char_fn(curr_char); + u32normalized.push_back(curr_char); + utf8_len += chwidth; + } + std::vector changes(u32normalized.size(), 0); + UpdateNormalized({u32normalized, changes}, 0); + return *this; +} + +NormalizedString& NormalizedString::Lowercase() { + std::wstring_convert, char32_t> conv; + std::u32string u32normalized = conv.from_bytes(normalized_); + // Can cover all single char covert cases + for (int i = 0; i < u32normalized.length(); ++i) { + u32normalized[i] = u_tolower(u32normalized[i]); + } + // No need to update normalized range + normalized_ = conv.to_bytes(u32normalized); + return *this; +} + +NormalizedString& NormalizedString::Replace(const re2::RE2& pattern, + const std::string& content) { + re2::StringPiece result; + size_t start = 0; + size_t end = normalized_.length(); + int64_t offset = 0; + + std::u32string u32content; + u32content.reserve(content.size()); + std::vector changes; + changes.reserve(content.size()); + + size_t content_utf8_len = 0; + while (content_utf8_len < content.length()) { + uint32_t content_char; + auto content_char_width = + utils::UTF8ToUInt32(content.data() + content_utf8_len, &content_char); + content_char = utils::UTF8ToUnicode(content_char); + u32content.push_back(content_char); + changes.push_back(1); + content_utf8_len += content_char_width; + } + size_t new_len = content.length(); + + OffsetMapping new_normalized{u32content, changes}; + + while (pattern.Match(normalized_, start, end, RE2::UNANCHORED, &result, 1)) { + size_t curr_start = result.data() - normalized_.data(); + size_t old_len = result.length(); + size_t curr_end = curr_start + old_len; + size_t removed_chars = + utils::GetUnicodeLenFromUTF8(normalized_.data() + curr_start, old_len); + UpdateNormalizedRange( + new_normalized, removed_chars, {curr_start, curr_end}, false); + offset = new_len - old_len; + // update start + start = curr_end; + if (offset >= 0) { + start = curr_end + offset; + } else { + size_t uoffset = -offset; + start = (curr_end >= uoffset) ? curr_end - uoffset : 0; + } + end = normalized_.length(); + } + return *this; +} + +NormalizedString& NormalizedString::Prepend(const std::string& content) { + // Get the first unicode char of normalized + uint32_t first_char_of_normalized; + auto first_char_width = + utils::UTF8ToUInt32(normalized_.data(), &first_char_of_normalized); + first_char_of_normalized = utils::UTF8ToUnicode(first_char_of_normalized); + + std::u32string u32content; + u32content.reserve(content.length()); + std::vector changes; + changes.reserve(content.length()); + uint32_t utf8_len = 0; + while (utf8_len < content.length()) { + uint32_t content_char; + auto content_char_width = + utils::UTF8ToUInt32(content.data() + utf8_len, &content_char); + content_char = utils::UTF8ToUnicode(content_char); + u32content.push_back(content_char); + if (utf8_len == 0) { + changes.push_back(0); + } else { + changes.push_back(1); + } + utf8_len += content_char_width; + } + u32content.push_back(first_char_of_normalized); + changes.push_back(1); + UpdateNormalizedRange({u32content, changes}, 0, {0, first_char_width}, false); + return *this; +} + +bool NormalizedString::ValidateRange(const core::Range& range, + bool origin_range) const { + if (origin_range) { + return utils::IsCharBoundary(original_.data() + range.first) && + utils::IsCharBoundary(original_.data() + range.second - 1); + } + return utils::IsCharBoundary(normalized_.data() + range.first) && + utils::IsCharBoundary(normalized_.data() + range.second - 1); +} + +bool NormalizedString::Slice(core::Range range, + NormalizedString* normalized, + bool origin_range) const { + if (ValidateRange(range, origin_range)) { + core::Range normalized_range = range; + core::Range original_range = range; + if (origin_range) { + ConvertOffsets(&normalized_range, true); + } else { + ConvertOffsets(&original_range, false); + } + uint32_t n_shift = original_range.first; + + original_range.first = + (std::min)(original_range.first, + static_cast(this->original_.length() - 1)); + normalized->original_ = this->original_.substr( + original_range.first, original_range.second - original_range.first); + + normalized_range.first = + (std::min)(normalized_range.first, + static_cast(this->normalized_.length() - 1)); + normalized->normalized_ = this->normalized_.substr( + normalized_range.first, + normalized_range.second - normalized_range.first); + normalized->alignments_.reserve(normalized_range.second - + normalized_range.first); + for (uint32_t i = normalized_range.first; i < normalized_range.second; + ++i) { + normalized->alignments_.emplace_back( + this->alignments_[i].first - n_shift, + this->alignments_[i].second - n_shift); + } + + normalized->original_shift_ = this->original_shift_ + original_range.first; + return true; + } + return false; +} + +uint32_t NormalizedString::GetMatch( + const std::string& normalized, + const re2::RE2& pattern, + std::vector>* matches, + bool invert) const { + size_t start = 0; + size_t end = normalized.length(); + // Construct the matches whose mode is REMOVED. + re2::StringPiece result; + uint32_t reserved_num = 0; + while (pattern.Match(normalized, start, end, RE2::UNANCHORED, &result, 1)) { + size_t curr_start = result.data() - normalized.data(); + size_t curr_end = curr_start + result.length(); + if (start != curr_start) { + matches->push_back({{start, curr_start}, invert}); + if (!invert) { + ++reserved_num; + } + } + matches->push_back({{curr_start, curr_end}, !invert}); + if (invert) { + ++reserved_num; + } + start = curr_end; + } + if (start < end) { + matches->push_back({{start, end}, invert}); + if (!invert) { + ++reserved_num; + } + } + return reserved_num; +} + +uint32_t NormalizedString::GetMatch( + const std::string& normalized, + const std::function& pattern_func, + std::vector>* matches, + bool invert) const { + size_t utf8_len = 0; + size_t start = 0; + size_t curr_start = 0; + size_t curr_end = 0; + matches->reserve(normalized.length()); + uint32_t ch; + uint32_t reserved_num = 0; + while (utf8_len < normalized.length()) { + auto chwidth = utils::UTF8ToUInt32(normalized.data() + utf8_len, &ch); + ch = utils::UTF8ToUnicode(ch); + if (pattern_func(ch)) { + curr_start = utf8_len; + curr_end = curr_start + chwidth; + if (curr_start != start) { + matches->emplace_back(core::Range{start, curr_start}, invert); + if (!invert) { + ++reserved_num; + } + } + matches->emplace_back(core::Range{curr_start, curr_end}, !invert); + if (invert) { + ++reserved_num; + } + start = curr_end; + } + utf8_len += chwidth; + } + if (start < normalized.length()) { + matches->emplace_back(core::Range{start, normalized.length()}, invert); + if (!invert) { + ++reserved_num; + } + } + return reserved_num; +} + +template void NormalizedString::Split(const re2::RE2& pattern, + core::SplitMode mode, + std::vector* normalizes, + bool invert) const; +template void NormalizedString::Split( + const std::function& pattern_func, + core::SplitMode mode, + std::vector* normalizes, + bool invert) const; + +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/normalizer.h b/fast_tokenizer/fast_tokenizer/normalizers/normalizer.h new file mode 100644 index 0000000000000000000000000000000000000000..9a8b74e687cb0119ecc255a0f4029953f7e9e355 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/normalizer.h @@ -0,0 +1,205 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include "fast_tokenizer/core/base.h" +#include "fast_tokenizer/utils/utils.h" + +namespace re2 { +class RE2; +} // namespace re2 + +namespace paddlenlp { +namespace fast_tokenizer { +namespace normalizers { + +struct FASTTOKENIZER_DECL OffsetMapping { + std::u32string u32normalized; + std::vector changes; // Same size as normalized +}; + +class FASTTOKENIZER_DECL NormalizedString { +public: + NormalizedString(const std::string& original); + NormalizedString(NormalizedString&& other); + NormalizedString(const NormalizedString& other) = default; + NormalizedString& operator=(const NormalizedString& other) = default; + NormalizedString& operator=(NormalizedString&& other); + const std::string& GetStr() const; + const std::string& GetOrignalStr() const; + uint32_t GetLen() const; + uint32_t GetOriginalLen() const; + core::Offset GetOrginalOffset() const; + bool IsEmpty() const; + bool IsOriginalEmpty() const; + + // Unicode Normalization + NormalizedString& NFD(); + NormalizedString& NFKD(); + NormalizedString& NFC(); + NormalizedString& NFKC(); + + // Strip + NormalizedString& LRStrip(bool left, bool right); + NormalizedString& LStrip(); + NormalizedString& RStrip(); + + NormalizedString& FilterChar(std::function keep_char_fn); + NormalizedString& MapChar(std::function map_char_fn); + NormalizedString& Lowercase(); + NormalizedString& Replace(const re2::RE2& pattern, + const std::string& content); + NormalizedString& Prepend(const std::string& content); + bool Slice(core::Range range, + NormalizedString* normalized, + bool origin_range) const; + + void UpdateNormalized(const OffsetMapping& new_normalized, + uint32_t initial_offset); + template + void Split(const PatternType& + pattern, /* re2::RE2 or std::function */ + core::SplitMode mode, + std::vector* normalizes, + bool invert = false) const { + // Vec<(Offsets, should_remove)> + std::vector> matches; + auto normalizes_size = GetMatch(normalized_, pattern, &matches, invert); + // Convert matches + switch (mode) { + case core::SplitMode::REMOVED: + break; + case core::SplitMode::ISOLATED: { + for (auto& match : matches) { + match.second = false; + } + normalizes_size = matches.size(); + break; + } + case core::SplitMode::MERGED_WITH_PREVIOUS: { + bool previous_match = false; + std::vector> new_matches; + for (const auto& match : matches) { + auto offset = match.first; + bool curr_match = match.second; + if (curr_match && !previous_match) { + if (new_matches.size() > 0) { + new_matches.back().first.second = offset.second; + } else { + new_matches.push_back({offset, false}); + } + } else { + new_matches.push_back({offset, false}); + } + previous_match = curr_match; + } + matches = std::move(new_matches); + normalizes_size = matches.size(); + break; + } + case core::SplitMode::MERGED_WITH_NEXT: { + bool previous_match = false; + std::vector> new_matches; + for (auto it = matches.crbegin(); it != matches.crend(); ++it) { + const auto& match = *it; + auto offset = match.first; + bool curr_match = match.second; + if (curr_match && !previous_match) { + if (new_matches.size() > 0) { + new_matches.back().first.first = offset.first; + } else { + new_matches.push_back({offset, false}); + } + } else { + new_matches.push_back({offset, false}); + } + previous_match = curr_match; + } + matches = std::move(new_matches); + normalizes_size = matches.size(); + std::reverse(matches.begin(), matches.end()); + break; + } + case core::SplitMode::CONTIGUOUS: { + bool previous_match = false; + std::vector> new_matches; + for (const auto& match : matches) { + auto offset = match.first; + bool curr_match = match.second; + if (curr_match == previous_match) { + if (new_matches.size() > 0) { + new_matches.back().first.second = offset.second; + } else { + new_matches.push_back({offset, false}); + } + } else { + new_matches.push_back({offset, false}); + } + previous_match = curr_match; + } + matches = std::move(new_matches); + normalizes_size = matches.size(); + break; + } + default: + break; + } + normalizes->resize(normalizes_size); + int idx = 0; + for (const auto& match : matches) { + if (!match.second) { + Slice(match.first, &(normalizes->at(idx++)), false); + } + } + } + bool ConvertOffsets(core::Range* range, bool origin_range = true) const; + NormalizedString() = default; + +private: + std::string original_; + std::string normalized_; + // In order to keep track of the offset mapping from + // original_ to normalized_ + std::vector alignments_; + uint32_t original_shift_; + + void UpdateNormalizedRange(const OffsetMapping& new_normalized, + uint32_t initial_offset, + const core::Range& range, + bool origin_range = true); + void RunNormalization(const std::string& mode); + bool ValidateRange(const core::Range& range, bool origin_range) const; + + uint32_t GetMatch(const std::string& normalized, + const re2::RE2& pattern, + std::vector>* matches, + bool invert = false) const; + + uint32_t GetMatch(const std::string& normalized, + const std::function& pattern_func, + std::vector>* matches, + bool invert = false) const; +}; + +struct FASTTOKENIZER_DECL Normalizer { + virtual void operator()(NormalizedString* mut_str) const = 0; +}; + +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/normalizers.h b/fast_tokenizer/fast_tokenizer/normalizers/normalizers.h new file mode 100644 index 0000000000000000000000000000000000000000..6f29e0c2eb25de0c185aeb3605ad1b7ca8663da6 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/normalizers.h @@ -0,0 +1,23 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/normalizers/bert.h" +#include "fast_tokenizer/normalizers/normalizer.h" +#include "fast_tokenizer/normalizers/precompiled.h" +#include "fast_tokenizer/normalizers/replace.h" +#include "fast_tokenizer/normalizers/strip.h" +#include "fast_tokenizer/normalizers/unicode.h" +#include "fast_tokenizer/normalizers/utils.h" diff --git a/fast_tokenizer/fast_tokenizer/normalizers/precompiled.cc b/fast_tokenizer/fast_tokenizer/normalizers/precompiled.cc new file mode 100644 index 0000000000000000000000000000000000000000..7d5189d30f26e0d5617fc79abc059a56626262cd --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/precompiled.cc @@ -0,0 +1,87 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/normalizers/precompiled.h" +#include +#include + +#include "glog/logging.h" +#include "fast_tokenizer/utils/unique_ptr.h" + + +namespace paddlenlp { +namespace fast_tokenizer { +namespace normalizers { + +PrecompiledNormalizer::PrecompiledNormalizer( + const std::string& precompiled_charsmap) { + SetPrecompiledCharsMap(precompiled_charsmap); +} + +PrecompiledNormalizer::PrecompiledNormalizer( + const PrecompiledNormalizer& precompiled_normalizer) + : sentencepiece_normalizer_(new utils::Normalizer( + *precompiled_normalizer.sentencepiece_normalizer_.get())) {} + +static std::string GetByteFromString(const std::string& str) { + std::ostringstream oss; + oss << std::hex << std::setfill('0'); + for (int i = 0; i < str.length(); ++i) { + oss << "\\x" << std::setw(2) << (static_cast(str[i]) & 0xFF); + } + return oss.str(); +} + +void PrecompiledNormalizer::SetPrecompiledCharsMap( + const std::string& precompiled_charsmap) { + sentencepiece_normalizer_ = + utils::make_unique(precompiled_charsmap); +} + +void PrecompiledNormalizer::operator()(NormalizedString* mut_str) const { + std::string normalized; + std::vector norm_to_orig; + std::u32string u32content; + if (sentencepiece_normalizer_->Normalize(mut_str->GetStr().data(), + mut_str->GetStr().length(), + &normalized, + &norm_to_orig, + &u32content)) { + mut_str->UpdateNormalized({u32content, norm_to_orig}, 0); + } +} + +void to_json(nlohmann::json& j, + const PrecompiledNormalizer& precompiled_normalizer) { + const auto& precompiled_str = + precompiled_normalizer.sentencepiece_normalizer_ + ->GetPrecompiledCharsmap(); + std::vector bytes(precompiled_str.begin(), precompiled_str.end()); + j = {{"type", "PrecompiledNormalizer"}, {"precompiled_charsmap", bytes}}; +} + +void from_json(const nlohmann::json& j, + PrecompiledNormalizer& precompiled_normalizer) { + std::vector bytes; + j.at("precompiled_charsmap").get_to(bytes); + std::ostringstream precompiled_charsmap_oss; + for (int i = 0; i < bytes.size(); ++i) { + precompiled_charsmap_oss << static_cast(bytes[i]); + } + precompiled_normalizer.SetPrecompiledCharsMap(precompiled_charsmap_oss.str()); +} + +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/precompiled.h b/fast_tokenizer/fast_tokenizer/normalizers/precompiled.h new file mode 100644 index 0000000000000000000000000000000000000000..3641952030f6706c33227292cb648d264c9bfc32 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/precompiled.h @@ -0,0 +1,44 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include "nlohmann/json.hpp" +#include "fast_tokenizer/normalizers/normalizer.h" +#include "fast_tokenizer/utils/sentencepiece_normalizer.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { + +namespace normalizers { +struct FASTTOKENIZER_DECL PrecompiledNormalizer : public Normalizer { + PrecompiledNormalizer() = default; + explicit PrecompiledNormalizer(const std::string& precompiled_charsmap); + PrecompiledNormalizer(const PrecompiledNormalizer& precompiled_normalizer); + + virtual void operator()(NormalizedString* mut_str) const override; + void SetPrecompiledCharsMap(const std::string& precompiled_charsmap); + +private: + std::unique_ptr sentencepiece_normalizer_; + friend void to_json(nlohmann::json& j, + const PrecompiledNormalizer& precompiled_normalizer); + friend void from_json(const nlohmann::json& j, + PrecompiledNormalizer& precompiled_normalizer); +}; +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/replace.cc b/fast_tokenizer/fast_tokenizer/normalizers/replace.cc new file mode 100644 index 0000000000000000000000000000000000000000..1d7f81d09a5fd61abf1371f59681e3b9913d6ec9 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/replace.cc @@ -0,0 +1,51 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/normalizers/replace.h" +#include "fast_tokenizer/utils/unique_ptr.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace normalizers { + +ReplaceNormalizer::ReplaceNormalizer(const std::string& pattern, + const std::string& content) + : pattern_(new re2::RE2(pattern)), content_(content) {} + +ReplaceNormalizer::ReplaceNormalizer( + const ReplaceNormalizer& replace_normalizer) + : pattern_(new re2::RE2(replace_normalizer.pattern_->pattern())), + content_(replace_normalizer.content_) {} + +void ReplaceNormalizer::operator()(NormalizedString* input) const { + input->Replace(*pattern_, content_); +} + +void to_json(nlohmann::json& j, const ReplaceNormalizer& replace_normalizer) { + j = { + {"type", "ReplaceNormalizer"}, + {"pattern", replace_normalizer.pattern_->pattern()}, + {"content", replace_normalizer.content_}, + }; +} + +void from_json(const nlohmann::json& j, ReplaceNormalizer& replace_normalizer) { + replace_normalizer.pattern_ = + utils::make_unique(j.at("pattern").get()); + j.at("content").get_to(replace_normalizer.content_); +} + +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/replace.h b/fast_tokenizer/fast_tokenizer/normalizers/replace.h new file mode 100644 index 0000000000000000000000000000000000000000..76141f7669c8ad8377153d5e37d5816282fac07a --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/replace.h @@ -0,0 +1,45 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "nlohmann/json.hpp" +#include "fast_tokenizer/normalizers/normalizer.h" +#include "re2/re2.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace normalizers { + +struct FASTTOKENIZER_DECL ReplaceNormalizer : public Normalizer { + ReplaceNormalizer() = default; + ReplaceNormalizer(const std::string& pattern, const std::string& content); + ReplaceNormalizer(const ReplaceNormalizer& replace_normalizer); + virtual void operator()(NormalizedString* mut_str) const override; + friend void to_json(nlohmann::json& j, + const ReplaceNormalizer& replace_normalizer); + friend void from_json(const nlohmann::json& j, + ReplaceNormalizer& replace_normalizer); + +private: + std::unique_ptr pattern_; + std::string content_; +}; + +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/strip.cc b/fast_tokenizer/fast_tokenizer/normalizers/strip.cc new file mode 100644 index 0000000000000000000000000000000000000000..c14c23f27164b98c5fda60c73858da39f5cfc145 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/strip.cc @@ -0,0 +1,68 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/normalizers/strip.h" +#include "unicode/translit.h" +#include "unicode/unistr.h" +#include "unicode/utypes.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace normalizers { +StripNormalizer::StripNormalizer(bool left /* = true*/, bool right /* = true*/) + : left_(left), right_(right) {} + +void StripNormalizer::operator()(NormalizedString* input) const { + if (left_) { + input->LStrip(); + } + if (right_) { + input->RStrip(); + } +} + +void to_json(nlohmann::json& j, const StripNormalizer& strip_normalizer) { + j = { + {"type", "StripNormalizer"}, + {"left", strip_normalizer.left_}, + {"right", strip_normalizer.right_}, + }; +} + +void from_json(const nlohmann::json& j, StripNormalizer& strip_normalizer) { + j.at("left").get_to(strip_normalizer.left_); + j.at("right").get_to(strip_normalizer.right_); +} + +void StripAccentsNormalizer::operator()(NormalizedString* input) const { + input->NFD(); + input->FilterChar([](char32_t ch) -> bool { + // equals to `unicodedata.category(char) == 'Mn'` + return u_charType(ch) != U_NON_SPACING_MARK; + }); +} + +void to_json(nlohmann::json& j, + const StripAccentsNormalizer& strip_normalizer) { + j = { + {"type", "StripAccentsNormalizer"}, + }; +} + +void from_json(const nlohmann::json& j, + StripAccentsNormalizer& strip_normalizer) {} + +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/strip.h b/fast_tokenizer/fast_tokenizer/normalizers/strip.h new file mode 100644 index 0000000000000000000000000000000000000000..e8af13ac36a6110867adfb559fa37ffdcc9ae58b --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/strip.h @@ -0,0 +1,51 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include + +#include "fast_tokenizer/normalizers/normalizer.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace normalizers { + +struct FASTTOKENIZER_DECL StripNormalizer : public Normalizer { + StripNormalizer(bool left = true, bool right = true); + virtual void operator()(NormalizedString* input) const override; + StripNormalizer(StripNormalizer&&) = default; + StripNormalizer(const StripNormalizer&) = default; + +private: + bool left_; + bool right_; + friend void to_json(nlohmann::json& j, + const StripNormalizer& strip_normalizer); + friend void from_json(const nlohmann::json& j, + StripNormalizer& strip_normalizer); +}; + +struct FASTTOKENIZER_DECL StripAccentsNormalizer : public Normalizer { + virtual void operator()(NormalizedString* input) const override; + friend void to_json(nlohmann::json& j, + const StripAccentsNormalizer& strip_normalizer); + friend void from_json(const nlohmann::json& j, + StripAccentsNormalizer& strip_normalizer); +}; + +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/unicode.cc b/fast_tokenizer/fast_tokenizer/normalizers/unicode.cc new file mode 100644 index 0000000000000000000000000000000000000000..5c16c0d00eae9d6b8d678328564dbaa34edd5ee5 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/unicode.cc @@ -0,0 +1,103 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include + +#include "fast_tokenizer/normalizers/unicode.h" +#include "unicode/edits.h" +#include "unicode/errorcode.h" +#include "unicode/normalizer2.h" +#include "unicode/utypes.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace normalizers { + +void NFCNormalizer::operator()(NormalizedString* input) const { input->NFC(); } + +void to_json(nlohmann::json& j, const NFCNormalizer& normalizer) { + j = { + {"type", "NFCNormalizer"}, + }; +} + +void from_json(const nlohmann::json& j, NFCNormalizer& normalizer) {} + +void NFKCNormalizer::operator()(NormalizedString* input) const { + input->NFKC(); +} + +void to_json(nlohmann::json& j, const NFKCNormalizer& normalizer) { + j = { + {"type", "NFKCNormalizer"}, + }; +} + +void from_json(const nlohmann::json& j, NFKCNormalizer& normalizer) {} + +void NFDNormalizer::operator()(NormalizedString* input) const { input->NFD(); } + +void to_json(nlohmann::json& j, const NFDNormalizer& normalizer) { + j = { + {"type", "NFDNormalizer"}, + }; +} + +void from_json(const nlohmann::json& j, NFDNormalizer& normalizer) {} + +void NFKDNormalizer::operator()(NormalizedString* input) const { + input->NFKD(); +} + +void to_json(nlohmann::json& j, const NFKDNormalizer& normalizer) { + j = { + {"type", "NFKDNormalizer"}, + }; +} + +void from_json(const nlohmann::json& j, NFKDNormalizer& normalizer) {} + +void NmtNormalizer::operator()(NormalizedString* input) const { + input->FilterChar([](char32_t ch) -> bool { + if ((ch >= 0x0001 && ch <= 0x0008) || (ch == 0x000B) || + (ch >= 0x000E && ch <= 0x001F) || (ch == 0x007F) || (ch == 0x008F) || + (ch == 0x009F)) { + return false; + } + return true; + }); + input->MapChar([](char32_t ch) -> char32_t { + if ((ch == 0x0009) || (ch == 0x000A) || (ch == 0x000C) || (ch == 0x000D) || + (ch == 0x1680) || (ch >= 0x200B && ch <= 0x200F) || (ch == 0x2028) || + (ch == 0x2029) || (ch == 0x2581) || (ch == 0xFEFF) || (ch == 0xFFFD)) { + return ' '; + } + return ch; + }); +} + +void to_json(nlohmann::json& j, const NmtNormalizer& normalizer) { + j = { + {"type", "NmtNormalizer"}, + }; +} + +void from_json(const nlohmann::json& j, NmtNormalizer& normalizer) {} + +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/unicode.h b/fast_tokenizer/fast_tokenizer/normalizers/unicode.h new file mode 100644 index 0000000000000000000000000000000000000000..6bf9c4b8de42b1117163e466292af2607612aced --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/unicode.h @@ -0,0 +1,57 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include "fast_tokenizer/normalizers/normalizer.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace normalizers { + +struct FASTTOKENIZER_DECL NFCNormalizer : public Normalizer { + virtual void operator()(NormalizedString* input) const override; + friend void to_json(nlohmann::json& j, const NFCNormalizer& normalizer); + friend void from_json(const nlohmann::json& j, NFCNormalizer& normalizer); +}; + +struct FASTTOKENIZER_DECL NFDNormalizer : public Normalizer { + virtual void operator()(NormalizedString* input) const override; + friend void to_json(nlohmann::json& j, const NFDNormalizer& normalizer); + friend void from_json(const nlohmann::json& j, NFDNormalizer& normalizer); +}; + +struct FASTTOKENIZER_DECL NFKCNormalizer : public Normalizer { + virtual void operator()(NormalizedString* input) const override; + friend void to_json(nlohmann::json& j, const NFKCNormalizer& normalizer); + friend void from_json(const nlohmann::json& j, NFKCNormalizer& normalizer); +}; + +struct FASTTOKENIZER_DECL NFKDNormalizer : public Normalizer { + virtual void operator()(NormalizedString* input) const override; + friend void to_json(nlohmann::json& j, const NFKDNormalizer& normalizer); + friend void from_json(const nlohmann::json& j, NFKDNormalizer& normalizer); +}; + +struct FASTTOKENIZER_DECL NmtNormalizer : public Normalizer { + virtual void operator()(NormalizedString* input) const override; + friend void to_json(nlohmann::json& j, const NmtNormalizer& normalizer); + friend void from_json(const nlohmann::json& j, NmtNormalizer& normalizer); +}; + +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/utils.cc b/fast_tokenizer/fast_tokenizer/normalizers/utils.cc new file mode 100644 index 0000000000000000000000000000000000000000..15f6875b87490d80c33f1f6cef411a8270798333 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/utils.cc @@ -0,0 +1,158 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/normalizers/utils.h" +#include "fast_tokenizer/normalizers/bert.h" +#include "fast_tokenizer/normalizers/precompiled.h" +#include "fast_tokenizer/normalizers/replace.h" +#include "fast_tokenizer/normalizers/strip.h" +#include "fast_tokenizer/normalizers/unicode.h" +#include "unicode/unistr.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace normalizers { + +void SequenceNormalizer::AppendNormalizer(Normalizer* normalizer) { + std::shared_ptr normalizer_ptr; + if (typeid(*normalizer) == typeid(SequenceNormalizer)) { + auto cast_normalizer = dynamic_cast(normalizer); + normalizer_ptr = std::make_shared(*cast_normalizer); + } else if (typeid(*normalizer) == typeid(LowercaseNormalizer)) { + auto cast_normalizer = dynamic_cast(normalizer); + normalizer_ptr = std::make_shared(*cast_normalizer); + } else if (typeid(*normalizer) == typeid(StripNormalizer)) { + auto cast_normalizer = dynamic_cast(normalizer); + normalizer_ptr = std::make_shared(*cast_normalizer); + } else if (typeid(*normalizer) == typeid(StripAccentsNormalizer)) { + auto cast_normalizer = dynamic_cast(normalizer); + normalizer_ptr = std::make_shared(*cast_normalizer); + } else if (typeid(*normalizer) == typeid(NFCNormalizer)) { + auto cast_normalizer = dynamic_cast(normalizer); + normalizer_ptr = std::make_shared(*cast_normalizer); + } else if (typeid(*normalizer) == typeid(NFDNormalizer)) { + auto cast_normalizer = dynamic_cast(normalizer); + normalizer_ptr = std::make_shared(*cast_normalizer); + } else if (typeid(*normalizer) == typeid(NFKCNormalizer)) { + auto cast_normalizer = dynamic_cast(normalizer); + normalizer_ptr = std::make_shared(*cast_normalizer); + } else if (typeid(*normalizer) == typeid(NFKDNormalizer)) { + auto cast_normalizer = dynamic_cast(normalizer); + normalizer_ptr = std::make_shared(*cast_normalizer); + } else if (typeid(*normalizer) == typeid(NmtNormalizer)) { + auto cast_normalizer = dynamic_cast(normalizer); + normalizer_ptr = std::make_shared(*cast_normalizer); + } else if (typeid(*normalizer) == typeid(ReplaceNormalizer)) { + auto cast_normalizer = dynamic_cast(normalizer); + normalizer_ptr = std::make_shared(*cast_normalizer); + } else if (typeid(*normalizer) == typeid(BertNormalizer)) { + auto cast_normalizer = dynamic_cast(normalizer); + normalizer_ptr = std::make_shared(*cast_normalizer); + } else if (typeid(*normalizer) == typeid(PrecompiledNormalizer)) { + auto cast_normalizer = dynamic_cast(normalizer); + normalizer_ptr = std::make_shared(*cast_normalizer); + } + normalizer_ptrs_.push_back(std::move(normalizer_ptr)); +} + +SequenceNormalizer::SequenceNormalizer( + const std::vector& normalizers) { + for (auto& normalizer : normalizers) { + AppendNormalizer(normalizer); + } +} + +void SequenceNormalizer::operator()(NormalizedString* input) const { + std::string result; + for (auto& normalizer : normalizer_ptrs_) { + normalizer->operator()(input); + } +} +void LowercaseNormalizer::operator()(NormalizedString* input) const { + input->Lowercase(); +} + +void to_json(nlohmann::json& j, const SequenceNormalizer& normalizer) { + nlohmann::json jlist; + for (auto& ptr : normalizer.normalizer_ptrs_) { + nlohmann::json jitem; + if (typeid(*ptr) == typeid(SequenceNormalizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(LowercaseNormalizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(StripNormalizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(StripAccentsNormalizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(NFCNormalizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(NFDNormalizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(NFKCNormalizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(NFKDNormalizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(NmtNormalizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(ReplaceNormalizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(BertNormalizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(PrecompiledNormalizer)) { + jitem = *dynamic_cast(ptr.get()); + } + jlist.push_back(jitem); + } + j = {{"type", "SequenceNormalizer"}, {"normalizers", jlist}}; +} + +void from_json(const nlohmann::json& j, + SequenceNormalizer& sequence_normalizer) { +#define TRY_APPEND_NORMALIZER(NORMALIZER_TYPE) \ + if (normalizer_type == #NORMALIZER_TYPE) { \ + NORMALIZER_TYPE normalizer; \ + normalizer_json.get_to(normalizer); \ + sequence_normalizer.AppendNormalizer(&normalizer); \ + } + + for (auto& normalizer_json : j.at("normalizers")) { + std::string normalizer_type; + normalizer_json.at("type").get_to(normalizer_type); + TRY_APPEND_NORMALIZER(BertNormalizer); + TRY_APPEND_NORMALIZER(PrecompiledNormalizer); + TRY_APPEND_NORMALIZER(ReplaceNormalizer); + TRY_APPEND_NORMALIZER(StripAccentsNormalizer); + TRY_APPEND_NORMALIZER(StripNormalizer); + TRY_APPEND_NORMALIZER(NFCNormalizer); + TRY_APPEND_NORMALIZER(NFKCNormalizer); + TRY_APPEND_NORMALIZER(NFDNormalizer); + TRY_APPEND_NORMALIZER(NFKDNormalizer); + TRY_APPEND_NORMALIZER(NmtNormalizer); + TRY_APPEND_NORMALIZER(LowercaseNormalizer); + TRY_APPEND_NORMALIZER(SequenceNormalizer); + } +#undef TRY_APPEND_NORMALIZER +} + +void to_json(nlohmann::json& j, const LowercaseNormalizer& normalizer) { + j = { + {"type", "LowercaseNormalizer"}, + }; +} + +void from_json(const nlohmann::json& j, LowercaseNormalizer& normalizer) {} + +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/normalizers/utils.h b/fast_tokenizer/fast_tokenizer/normalizers/utils.h new file mode 100644 index 0000000000000000000000000000000000000000..94fd2c91cdf06f72f6a6e29937161ae441ecf409 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/normalizers/utils.h @@ -0,0 +1,50 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include "fast_tokenizer/normalizers/normalizer.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace normalizers { + +struct FASTTOKENIZER_DECL SequenceNormalizer : public Normalizer { + SequenceNormalizer() = default; + SequenceNormalizer(const SequenceNormalizer&) = default; + SequenceNormalizer(const std::vector& normalizers); + virtual void operator()(NormalizedString* input) const override; + void AppendNormalizer(Normalizer* normalizer); + +private: + std::vector> normalizer_ptrs_; + friend void to_json(nlohmann::json& j, const SequenceNormalizer& normalizer); + friend void from_json(const nlohmann::json& j, + SequenceNormalizer& normalizer); +}; + +struct FASTTOKENIZER_DECL LowercaseNormalizer : public Normalizer { + virtual void operator()(NormalizedString* input) const override; + friend void to_json(nlohmann::json& j, const LowercaseNormalizer& normalizer); + friend void from_json(const nlohmann::json& j, + LowercaseNormalizer& normalizer); +}; + +} // namespace normalizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/postprocessors/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..9d4aad766e77052d84d931bd35ad4a6ac490a606 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/postprocessors/CMakeLists.txt @@ -0,0 +1 @@ +cc_library(postprocessors SRCS bert.cc postprocessor.cc template.cc roberta.cc byte_level.cc DEPS core json) diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/bert.cc b/fast_tokenizer/fast_tokenizer/postprocessors/bert.cc new file mode 100644 index 0000000000000000000000000000000000000000..d40067c9d837adf6f303eb517eeadb43960b9b7d --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/postprocessors/bert.cc @@ -0,0 +1,208 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include + +#include "fast_tokenizer/core/encoding.h" +#include "glog/logging.h" +#include "fast_tokenizer/postprocessors/bert.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace postprocessors { + +BertPostProcessor::BertPostProcessor() + : sep_({"[SEP]", 102}), cls_({"[CLS]", 101}) {} +BertPostProcessor::BertPostProcessor( + const std::pair& sep, + const std::pair& cls) + : sep_(sep), cls_(cls) {} +size_t BertPostProcessor::AddedTokensNum(bool is_pair) const { + if (is_pair) { + // [CLS] A [SEP] B [SEP] + return 3; + } + // [CLS] A [SEP] + return 2; +} + +void BertPostProcessor::operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const { + if (!add_special_tokens) { + DefaultProcess(encoding, pair_encoding, result_encoding); + return; + } +// Construct the sequence as: [CLS] A [SEP] +#define CREATE_PROCESSED_ENCODING_SEQ( \ + encoding_ptr, attr, name, head_value, back_value) \ + auto encoding_##name = encoding_ptr->Get##attr(); \ + decltype(encoding_##name) name(encoding_##name.size() + 2); \ + std::copy(encoding_##name.begin(), encoding_##name.end(), name.begin() + 1); \ + name.front() = head_value; \ + name.back() = back_value + // ids + CREATE_PROCESSED_ENCODING_SEQ(encoding, Ids, ids, cls_.second, sep_.second); + // type_ids + CREATE_PROCESSED_ENCODING_SEQ(encoding, TypeIds, type_ids, 0, 0); + // tokens + CREATE_PROCESSED_ENCODING_SEQ( + encoding, Tokens, tokens, cls_.first, sep_.first); + // word_idx + CREATE_PROCESSED_ENCODING_SEQ(encoding, WordsIdx, word_idx, -1, -1); + // offsets + core::Offset empty_offsets = {0, 0}; + CREATE_PROCESSED_ENCODING_SEQ( + encoding, Offsets, offsets, empty_offsets, empty_offsets); + // special_tokens_mask + std::vector special_tokens_mask(ids.size(), 0); + special_tokens_mask.front() = special_tokens_mask.back() = 1; + // attention_mask + std::vector attention_mask(ids.size(), 1); + // sequence_ranges + std::unordered_map sequence_ranges; + sequence_ranges[0] = {1, ids.size() - 1}; + // overflowing + auto& overflowings = encoding->GetMutableOverflowing(); + for (auto& overflow_encoding : overflowings) { + CREATE_PROCESSED_ENCODING_SEQ( + (&overflow_encoding), Ids, ids, cls_.second, sep_.second); + CREATE_PROCESSED_ENCODING_SEQ( + (&overflow_encoding), TypeIds, type_ids, 0, 0); + CREATE_PROCESSED_ENCODING_SEQ( + (&overflow_encoding), Tokens, tokens, cls_.first, sep_.first); + CREATE_PROCESSED_ENCODING_SEQ( + (&overflow_encoding), WordsIdx, word_idx, -1, -1); + CREATE_PROCESSED_ENCODING_SEQ( + (&overflow_encoding), Offsets, offsets, empty_offsets, empty_offsets); + + std::vector special_tokens_mask(ids.size(), 0); + special_tokens_mask.front() = special_tokens_mask.back() = 1; + + std::vector attention_mask(ids.size(), 1); + + std::unordered_map sequence_ranges; + sequence_ranges[0] = {1, ids.size() - 1}; + + overflow_encoding = std::move( + core::Encoding(std::move(ids), + std::move(type_ids), + std::move(tokens), + std::move(word_idx), + std::move(offsets), + std::move(special_tokens_mask), + std::move(attention_mask), + std::vector(), // No overflowing + std::move(sequence_ranges))); + } + + core::Encoding new_encoding(std::move(ids), + std::move(type_ids), + std::move(tokens), + std::move(word_idx), + std::move(offsets), + std::move(special_tokens_mask), + std::move(attention_mask), + std::move(overflowings), + std::move(sequence_ranges)); + if (pair_encoding != nullptr) { +#define CREATE_PROCESSED_PARI_ENCODING_SEQ( \ + encoding_ptr, attr, name, back_value) \ + auto encoding_##name = encoding_ptr->Get##attr(); \ + decltype(encoding_##name) name(encoding_##name.size() + 1); \ + std::copy(encoding_##name.begin(), encoding_##name.end(), name.begin()); \ + name.back() = back_value + + CREATE_PROCESSED_PARI_ENCODING_SEQ(pair_encoding, Ids, ids, sep_.second); + CREATE_PROCESSED_PARI_ENCODING_SEQ(pair_encoding, TypeIds, type_ids, 1); + CREATE_PROCESSED_PARI_ENCODING_SEQ( + pair_encoding, Tokens, tokens, sep_.first); + CREATE_PROCESSED_PARI_ENCODING_SEQ(pair_encoding, WordsIdx, word_idx, -1); + core::Offset empty_offsets = {0, 0}; + CREATE_PROCESSED_PARI_ENCODING_SEQ( + pair_encoding, Offsets, offsets, empty_offsets); + + std::vector special_tokens_mask(ids.size(), 0); + special_tokens_mask.back() = 1; + + std::vector attention_mask(ids.size(), 1); + std::unordered_map sequence_ranges; + sequence_ranges[1] = {0, ids.size() - 1}; + // overflowing + auto& overflowings = pair_encoding->GetMutableOverflowing(); + for (auto& overflow_pair_encoding : overflowings) { + CREATE_PROCESSED_PARI_ENCODING_SEQ( + (&overflow_pair_encoding), Ids, ids, sep_.second); + CREATE_PROCESSED_PARI_ENCODING_SEQ( + (&overflow_pair_encoding), TypeIds, type_ids, 1); + CREATE_PROCESSED_PARI_ENCODING_SEQ( + (&overflow_pair_encoding), Tokens, tokens, sep_.first); + CREATE_PROCESSED_PARI_ENCODING_SEQ( + (&overflow_pair_encoding), WordsIdx, word_idx, -1); + core::Offset empty_offsets = {0, 0}; + CREATE_PROCESSED_PARI_ENCODING_SEQ( + (&overflow_pair_encoding), Offsets, offsets, empty_offsets); + + std::vector special_tokens_mask(ids.size(), 0); + special_tokens_mask.back() = 1; + + std::vector attention_mask(ids.size(), 1); + std::unordered_map sequence_ranges; + sequence_ranges[0] = {1, ids.size() - 1}; + + overflow_pair_encoding = std::move( + core::Encoding(std::move(ids), + std::move(type_ids), + std::move(tokens), + std::move(word_idx), + std::move(offsets), + std::move(special_tokens_mask), + std::move(attention_mask), + std::vector(), // No overflowing + std::move(sequence_ranges))); + } + + core::Encoding new_pair_encoding(std::move(ids), + std::move(type_ids), + std::move(tokens), + std::move(word_idx), + std::move(offsets), + std::move(special_tokens_mask), + std::move(attention_mask), + std::move(overflowings), + std::move(sequence_ranges)); + new_encoding.MergeWith(new_pair_encoding, false); + } +#undef CREATE_PROCESSED_ENCODING_SEQ +#undef CREATE_PROCESSED_PARI_ENCODING_SEQ + *result_encoding = std::move(new_encoding); +} + +void to_json(nlohmann::json& j, const BertPostProcessor& bert_postprocessor) { + j = { + {"type", "BertPostProcessor"}, + {"sep", bert_postprocessor.sep_}, + {"cls", bert_postprocessor.cls_}, + }; +} + +void from_json(const nlohmann::json& j, BertPostProcessor& bert_postprocessor) { + j["cls"].get_to(bert_postprocessor.cls_); + j["sep"].get_to(bert_postprocessor.sep_); +} + +} // namespace postprocessors +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/bert.h b/fast_tokenizer/fast_tokenizer/postprocessors/bert.h new file mode 100644 index 0000000000000000000000000000000000000000..cc8c77f9785be134f39919fe398d278a11f2f2f8 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/postprocessors/bert.h @@ -0,0 +1,43 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "fast_tokenizer/postprocessors/postprocessor.h" +#include "fast_tokenizer/utils/utils.h" +#include "nlohmann/json.hpp" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace postprocessors { + +struct FASTTOKENIZER_DECL BertPostProcessor : public PostProcessor { + BertPostProcessor(const std::pair& sep, + const std::pair& cls); + BertPostProcessor(); + virtual size_t AddedTokensNum(bool is_pair) const override; + virtual void operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const override; + std::pair sep_; + std::pair cls_; + friend void to_json(nlohmann::json& j, + const BertPostProcessor& bert_postprocessor); + friend void from_json(const nlohmann::json& j, + BertPostProcessor& bert_postprocessor); +}; +} // namespace postprocessors +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/byte_level.cc b/fast_tokenizer/fast_tokenizer/postprocessors/byte_level.cc new file mode 100644 index 0000000000000000000000000000000000000000..51f0b45ecec43148b464c213d96a3570f85efeaf --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/postprocessors/byte_level.cc @@ -0,0 +1,74 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "fast_tokenizer/postprocessors/byte_level.h" +#include "fast_tokenizer/pretokenizers/byte_level.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace postprocessors { + +ByteLevelPostProcessor::ByteLevelPostProcessor(bool add_prefix_space, + bool trim_offsets, + bool use_regex) + : add_prefix_space_(add_prefix_space), + trim_offsets_(trim_offsets), + use_regex_(use_regex) {} + + +size_t ByteLevelPostProcessor::AddedTokensNum(bool is_pair) const { return 0; } + +void ByteLevelPostProcessor::operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const { + if (trim_offsets_) { + pretokenizers::ProcessOffsets(encoding, add_special_tokens); + for (auto& overflowing : encoding->GetMutableOverflowing()) { + pretokenizers::ProcessOffsets(&overflowing, add_special_tokens); + } + if (pair_encoding != nullptr) { + pretokenizers::ProcessOffsets(pair_encoding, add_special_tokens); + for (auto& overflowing : pair_encoding->GetMutableOverflowing()) { + pretokenizers::ProcessOffsets(&overflowing, add_special_tokens); + } + } + } + + encoding->SetSequenceIds(0); + if (pair_encoding != nullptr) { + pair_encoding->SetSequenceIds(1); + } +} + +void to_json(nlohmann::json& j, + const ByteLevelPostProcessor& byte_level_postprocessor) { + j = { + {"type", "ByteLevelPostProcessor"}, + {"add_prefix_space", byte_level_postprocessor.add_prefix_space_}, + {"trim_offsets", byte_level_postprocessor.trim_offsets_}, + {"use_regex", byte_level_postprocessor.use_regex_}, + }; +} + +void from_json(const nlohmann::json& j, + ByteLevelPostProcessor& byte_level_postprocessor) { + j["add_prefix_space"].get_to(byte_level_postprocessor.add_prefix_space_); + j["trim_offsets"].get_to(byte_level_postprocessor.trim_offsets_); + j["use_regex"].get_to(byte_level_postprocessor.use_regex_); +} + +} // namespace postprocessors +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/byte_level.h b/fast_tokenizer/fast_tokenizer/postprocessors/byte_level.h new file mode 100644 index 0000000000000000000000000000000000000000..75cdd995fec5e5c80ad9d82128bb5b8e5f0e1449 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/postprocessors/byte_level.h @@ -0,0 +1,46 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "fast_tokenizer/postprocessors/postprocessor.h" +#include "fast_tokenizer/utils/utils.h" +#include "nlohmann/json.hpp" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace postprocessors { + +struct FASTTOKENIZER_DECL ByteLevelPostProcessor : public PostProcessor { + ByteLevelPostProcessor(bool add_prefix_space = true, + bool trim_offsets = true, + bool use_regex = true); + virtual size_t AddedTokensNum(bool is_pair) const override; + virtual void operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const override; + + friend void to_json(nlohmann::json& j, + const ByteLevelPostProcessor& byte_level_postprocessor); + friend void from_json(const nlohmann::json& j, + ByteLevelPostProcessor& byte_level_postprocessor); + bool add_prefix_space_; + bool trim_offsets_; + bool use_regex_; +}; + +} // namespace postprocessors +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/postprocessor.cc b/fast_tokenizer/fast_tokenizer/postprocessors/postprocessor.cc new file mode 100644 index 0000000000000000000000000000000000000000..abe994beb7d8f38c257e790f55c70babf46010ee --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/postprocessors/postprocessor.cc @@ -0,0 +1,37 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/postprocessors/postprocessor.h" +#include "fast_tokenizer/core/encoding.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace postprocessors { + +void PostProcessor::DefaultProcess(core::Encoding* encoding, + core::Encoding* pair_encoding, + core::Encoding* result_encoding) { + if (pair_encoding == nullptr) { + *result_encoding = *encoding; + } else { + encoding->SetSequenceIds(0); + pair_encoding->SetSequenceIds(1); + encoding->MergeWith(*pair_encoding, false); + *result_encoding = *encoding; + } +} + +} // namespace postprocessors +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/postprocessor.h b/fast_tokenizer/fast_tokenizer/postprocessors/postprocessor.h new file mode 100644 index 0000000000000000000000000000000000000000..34fda8377a2aadab775b61c6d38e5230467b45eb --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/postprocessors/postprocessor.h @@ -0,0 +1,41 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { + +namespace core { +class Encoding; +} // namespace core + +namespace postprocessors { + +struct FASTTOKENIZER_DECL PostProcessor { + virtual size_t AddedTokensNum(bool is_pair) const = 0; + virtual void operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const = 0; + static void DefaultProcess(core::Encoding* encoding, + core::Encoding* pair_encoding, + core::Encoding* result_encoding); +}; +} // namespace postprocessors +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/postprocessors.h b/fast_tokenizer/fast_tokenizer/postprocessors/postprocessors.h new file mode 100644 index 0000000000000000000000000000000000000000..9427f2478b9a487f34419e6fa394ffe84bcde0f2 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/postprocessors/postprocessors.h @@ -0,0 +1,21 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/postprocessors/bert.h" +#include "fast_tokenizer/postprocessors/byte_level.h" +#include "fast_tokenizer/postprocessors/postprocessor.h" +#include "fast_tokenizer/postprocessors/roberta.h" +#include "fast_tokenizer/postprocessors/template.h" diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/roberta.cc b/fast_tokenizer/fast_tokenizer/postprocessors/roberta.cc new file mode 100644 index 0000000000000000000000000000000000000000..4a468847c2220f927c89eb7117656302a3a2091e --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/postprocessors/roberta.cc @@ -0,0 +1,244 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include + +#include "fast_tokenizer/core/encoding.h" +#include "fast_tokenizer/postprocessors/roberta.h" +#include "fast_tokenizer/pretokenizers/byte_level.h" +#include "glog/logging.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace postprocessors { + +RobertaPostProcessor::RobertaPostProcessor( + const std::pair& sep, + const std::pair& cls, + bool trim_offsets, + bool add_prefix_space) + : sep_(sep), + cls_(cls), + trim_offsets_(trim_offsets), + add_prefix_space_(add_prefix_space) {} + +size_t RobertaPostProcessor::AddedTokensNum(bool is_pair) const { + if (is_pair) { + return 4; + } + return 2; +} + +void RobertaPostProcessor::operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const { + if (trim_offsets_) { + pretokenizers::ProcessOffsets(encoding, add_special_tokens); + for (auto& overflowing : encoding->GetMutableOverflowing()) { + pretokenizers::ProcessOffsets(&overflowing, add_special_tokens); + } + if (pair_encoding != nullptr) { + pretokenizers::ProcessOffsets(pair_encoding, add_special_tokens); + for (auto& overflowing : pair_encoding->GetMutableOverflowing()) { + pretokenizers::ProcessOffsets(&overflowing, add_special_tokens); + } + } + } + encoding->SetTypeIds(std::vector(encoding->GetLen(), 0)); + if (pair_encoding != nullptr) { + pair_encoding->SetTypeIds( + std::vector(pair_encoding->GetLen(), 0)); + } + if (!add_special_tokens) { + DefaultProcess(encoding, pair_encoding, result_encoding); + return; + } +// Construct the sequence as: [CLS] A [SEP] +#define CREATE_PROCESSED_ENCODING_SEQ( \ + encoding_ptr, attr, name, head_value, back_value) \ + auto encoding_##name = encoding_ptr->Get##attr(); \ + decltype(encoding_##name) name(encoding_##name.size() + 2); \ + std::copy(encoding_##name.begin(), encoding_##name.end(), name.begin() + 1); \ + name.front() = head_value; \ + name.back() = back_value + // ids + CREATE_PROCESSED_ENCODING_SEQ(encoding, Ids, ids, cls_.second, sep_.second); + // type_ids. Because there is no nsp task in the roberta, all type_ids are set + // to 0. + std::vector type_ids(encoding->GetTypeIds().size() + 2, 0); + // tokens + CREATE_PROCESSED_ENCODING_SEQ( + encoding, Tokens, tokens, cls_.first, sep_.first); + // word_idx + CREATE_PROCESSED_ENCODING_SEQ(encoding, WordsIdx, word_idx, -1, -1); + // offsets + core::Offset empty_offsets = {0, 0}; + CREATE_PROCESSED_ENCODING_SEQ( + encoding, Offsets, offsets, empty_offsets, empty_offsets); + // special_tokens_mask + std::vector special_tokens_mask(ids.size(), 0); + special_tokens_mask.front() = special_tokens_mask.back() = 1; + // attention_mask + std::vector attention_mask(ids.size(), 1); + // sequence_ranges + std::unordered_map sequence_ranges; + sequence_ranges[0] = {1, ids.size() - 1}; + // overflowing + auto& overflowings = encoding->GetMutableOverflowing(); + for (auto& overflow_encoding : overflowings) { + CREATE_PROCESSED_ENCODING_SEQ( + (&overflow_encoding), Ids, ids, cls_.second, sep_.second); + CREATE_PROCESSED_ENCODING_SEQ( + (&overflow_encoding), TypeIds, type_ids, 0, 0); + CREATE_PROCESSED_ENCODING_SEQ( + (&overflow_encoding), Tokens, tokens, cls_.first, sep_.first); + CREATE_PROCESSED_ENCODING_SEQ( + (&overflow_encoding), WordsIdx, word_idx, -1, -1); + CREATE_PROCESSED_ENCODING_SEQ( + (&overflow_encoding), Offsets, offsets, empty_offsets, empty_offsets); + + std::vector special_tokens_mask(ids.size(), 0); + special_tokens_mask.front() = special_tokens_mask.back() = 1; + + std::vector attention_mask(ids.size(), 1); + + std::unordered_map sequence_ranges; + sequence_ranges[0] = {1, ids.size() - 1}; + + overflow_encoding = std::move( + core::Encoding(std::move(ids), + std::move(type_ids), + std::move(tokens), + std::move(word_idx), + std::move(offsets), + std::move(special_tokens_mask), + std::move(attention_mask), + std::vector(), // No overflowing + std::move(sequence_ranges))); + } + + core::Encoding new_encoding(std::move(ids), + std::move(type_ids), + std::move(tokens), + std::move(word_idx), + std::move(offsets), + std::move(special_tokens_mask), + std::move(attention_mask), + std::move(overflowings), + std::move(sequence_ranges)); + + // Construct the sequence as: [CLS] A [SEP] [SEP] B [SEP] + if (pair_encoding != nullptr) { + // ids + CREATE_PROCESSED_ENCODING_SEQ( + pair_encoding, Ids, ids, sep_.second, sep_.second); + // type_ids. Because there is no nsp task in the roberta, all type_ids are + // set to 0. + std::vector type_ids(pair_encoding->GetTypeIds().size() + 2, 0); + // tokens + CREATE_PROCESSED_ENCODING_SEQ( + pair_encoding, Tokens, tokens, sep_.first, sep_.first); + // word_idx + CREATE_PROCESSED_ENCODING_SEQ(pair_encoding, WordsIdx, word_idx, -1, -1); + // offsets + core::Offset empty_offsets = {0, 0}; + CREATE_PROCESSED_ENCODING_SEQ( + pair_encoding, Offsets, offsets, empty_offsets, empty_offsets); + // special_tokens_mask + std::vector special_tokens_mask(ids.size(), 0); + special_tokens_mask.front() = special_tokens_mask.back() = 1; + // attention_mask + std::vector attention_mask(ids.size(), 1); + // sequence_ranges + std::unordered_map sequence_ranges; + sequence_ranges[1] = {1, ids.size() - 1}; + // overflowing + auto& overflowings = pair_encoding->GetMutableOverflowing(); + for (auto& overflow_pair_encoding : overflowings) { + // ids + CREATE_PROCESSED_ENCODING_SEQ( + (&overflow_pair_encoding), Ids, ids, sep_.second, sep_.second); + // type_ids + std::vector type_ids( + overflow_pair_encoding.GetTypeIds().size() + 2, 0); + // tokens + CREATE_PROCESSED_ENCODING_SEQ( + (&overflow_pair_encoding), Tokens, tokens, sep_.first, sep_.first); + // word_idx + CREATE_PROCESSED_ENCODING_SEQ( + (&overflow_pair_encoding), WordsIdx, word_idx, -1, -1); + // offsets + core::Offset empty_offsets = {0, 0}; + CREATE_PROCESSED_ENCODING_SEQ((&overflow_pair_encoding), + Offsets, + offsets, + empty_offsets, + empty_offsets); + // special_tokens_mask + std::vector special_tokens_mask(ids.size(), 0); + special_tokens_mask.front() = special_tokens_mask.back() = 1; + // attention_mask + std::vector attention_mask(ids.size(), 1); + // sequence_ranges + std::unordered_map sequence_ranges; + sequence_ranges[0] = {1, ids.size() - 1}; + overflow_pair_encoding = std::move( + core::Encoding(std::move(ids), + std::move(type_ids), + std::move(tokens), + std::move(word_idx), + std::move(offsets), + std::move(special_tokens_mask), + std::move(attention_mask), + std::vector(), // No overflowing + std::move(sequence_ranges))); + } + core::Encoding new_pair_encoding(std::move(ids), + std::move(type_ids), + std::move(tokens), + std::move(word_idx), + std::move(offsets), + std::move(special_tokens_mask), + std::move(attention_mask), + std::move(overflowings), + std::move(sequence_ranges)); + new_encoding.MergeWith(new_pair_encoding, false); + } +#undef CREATE_PROCESSED_ENCODING_SEQ + *result_encoding = std::move(new_encoding); +} + +void to_json(nlohmann::json& j, + const RobertaPostProcessor& roberta_postprocessor) { + j = { + {"type", "RobertaPostProcessor"}, + {"sep", roberta_postprocessor.sep_}, + {"cls", roberta_postprocessor.cls_}, + {"trim_offsets", roberta_postprocessor.trim_offsets_}, + {"add_prefix_space", roberta_postprocessor.add_prefix_space_}, + }; +} + +void from_json(const nlohmann::json& j, + RobertaPostProcessor& roberta_postprocessor) { + j["cls"].get_to(roberta_postprocessor.cls_); + j["sep"].get_to(roberta_postprocessor.sep_); + j["trim_offsets"].get_to(roberta_postprocessor.trim_offsets_); + j["add_prefix_space"].get_to(roberta_postprocessor.add_prefix_space_); +} + +} // namespace postprocessors +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/roberta.h b/fast_tokenizer/fast_tokenizer/postprocessors/roberta.h new file mode 100644 index 0000000000000000000000000000000000000000..0601882f1df1cc20872c63161e475bd7505ec086 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/postprocessors/roberta.h @@ -0,0 +1,47 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "fast_tokenizer/postprocessors/postprocessor.h" +#include "fast_tokenizer/utils/utils.h" +#include "nlohmann/json.hpp" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace postprocessors { + +struct FASTTOKENIZER_DECL RobertaPostProcessor : public PostProcessor { + RobertaPostProcessor(const std::pair& sep = {"", + 2}, + const std::pair& cls = {"", 0}, + bool trim_offsets = true, + bool add_prefix_space = true); + virtual size_t AddedTokensNum(bool is_pair) const override; + virtual void operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const override; + std::pair sep_; + std::pair cls_; + bool trim_offsets_; + bool add_prefix_space_; + friend void to_json(nlohmann::json& j, + const RobertaPostProcessor& roberta_postprocessor); + friend void from_json(const nlohmann::json& j, + RobertaPostProcessor& roberta_postprocessor); +}; +} // namespace postprocessors +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/template.cc b/fast_tokenizer/fast_tokenizer/postprocessors/template.cc new file mode 100644 index 0000000000000000000000000000000000000000..28a3eb92587e8eeb1924eacdfb674ac9e60a7328 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/postprocessors/template.cc @@ -0,0 +1,454 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include + +#include "fast_tokenizer/core/encoding.h" +#include "fast_tokenizer/postprocessors/template.h" +#include "glog/logging.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace postprocessors { + +void ParseIdFromString(const std::string& template_id_string, + TemplatePiece* template_piece) { + if (template_id_string.find_first_of("$") == 0) { + *template_piece = TemplateSequence(); + auto& seq = paddlenlp::get(*template_piece); + std::string rest = + template_id_string.substr(template_id_string.find_first_not_of("$")); + if (rest == "" || rest == "A" || rest == "a") { + seq = TemplateSequence{SequenceType::SEQ_A, 0}; + } else if (rest == "B" || rest == "b") { + seq = TemplateSequence{SequenceType::SEQ_B, 0}; + } else { + std::string::size_type sz; + uint32_t type_id = std::stoul(rest, &sz); + if (sz = rest.length()) { + seq = TemplateSequence{SequenceType::SEQ_A, type_id}; + } else { + throw std::runtime_error( + "ParseIdFromString error! The format of template piece id should " + "be " + "$A, $a, $B, $b or ${type_id}"); + } + } + } else { + *template_piece = TemplateSpecialToken(); + paddlenlp::get(*template_piece) = {template_id_string, + 0}; + } +} + +void SetTypeId(uint32_t type_id, TemplatePiece* template_piece) { + if (paddlenlp::get_if(template_piece) != nullptr) { + paddlenlp::get(*template_piece).second = type_id; + } else { + paddlenlp::get(*template_piece).second = type_id; + } +} + +void GetTemplatePieceFromString(const std::string& template_string, + TemplatePiece* template_piece) { + auto spliter_idx = template_string.find_first_of(":"); + if (spliter_idx == std::string::npos) { + ParseIdFromString(template_string, template_piece); + } else { + std::string template_id_string = template_string.substr(0, spliter_idx); + std::string template_type_id_string = + template_string.substr(spliter_idx + 1); + ParseIdFromString(template_id_string, template_piece); + + std::string::size_type sz; + uint32_t type_id = std::stoul(template_type_id_string, &sz); + if (sz == template_type_id_string.length()) { + SetTypeId(type_id, template_piece); + } else { + throw std::runtime_error( + "ParseTypeIdFromString error! The type id should be unsigned " + "integer."); + } + } +} + +void to_json(nlohmann::json& j, const TemplatePiece& template_piece) { + if (paddlenlp::get_if(&template_piece) != nullptr) { + auto& template_sequence = paddlenlp::get(template_piece); + j = { + {"Sequence", + { + {"id", template_sequence.first}, + {"type_id", template_sequence.second}, + }}, + }; + } else { + auto& template_special_token = + paddlenlp::get(template_piece); + j = { + {"SpecialToken", + { + {"id", template_special_token.first}, + {"type_id", template_special_token.second}, + }}, + }; + } +} + +void from_json(const nlohmann::json& j, TemplatePiece& template_piece) { + if (j.find("Sequence") != j.end()) { + template_piece = + TemplateSequence(j["Sequence"]["id"], j["Sequence"]["type_id"]); + } else { + template_piece = TemplateSpecialToken(j["SpecialToken"]["id"], + j["SpecialToken"]["type_id"]); + } +} + +void to_json(nlohmann::json& j, const SpecialToken& special_token) { + j = { + {"id", special_token.id_}, + {"ids", special_token.ids_}, + {"tokens", special_token.tokens_}, + }; +} + +void from_json(const nlohmann::json& j, SpecialToken& special_token) { + j["id"].get_to(special_token.id_); + j["ids"].get_to(special_token.ids_); + j["tokens"].get_to(special_token.tokens_); +} + +size_t TemplatePostProcessor::CountAdded( + Template* template_, const SpecialTokensMap& special_tokens_map) { + size_t count = 0; + for (auto& piece : template_->pieces_) { + TemplateSpecialToken* special_token = + paddlenlp::get_if(&piece); + if (special_token != nullptr) { + auto token_iter = + special_tokens_map.tokens_map_.find(special_token->first); + if (token_iter != special_tokens_map.tokens_map_.end()) { + count += token_iter->second.ids_.size(); + } + } + } + return count; +} + +void to_json(nlohmann::json& j, const Template& template_) { + for (auto& piece : template_.pieces_) { + j.push_back(piece); + } +} + +void from_json(const nlohmann::json& j, Template& template_) { + template_.pieces_.resize(j.size()); + for (int i = 0; i < j.size(); ++i) { + j[i].get_to(template_.pieces_[i]); + } +} + +void to_json(nlohmann::json& j, const SpecialTokensMap& tokens_map) { + for (auto it = tokens_map.tokens_map_.begin(); + it != tokens_map.tokens_map_.end(); + ++it) { + j[it->first] = it->second; + } +} + +void from_json(const nlohmann::json& j, SpecialTokensMap& tokens_map) { + SpecialToken special_token; + for (auto it = j.begin(); it != j.end(); ++it) { + tokens_map.tokens_map_[it.key()] = it.value().get_to(special_token); + } +} + +size_t TemplatePostProcessor::DefaultAdded(bool is_single) { + Template* target = nullptr; + if (is_single) { + target = &single_; + } else { + target = &pair_; + } + return CountAdded(target, special_tokens_map_); +} + +void TemplatePostProcessor::UpdateAddedTokensNum() { + added_single_ = DefaultAdded(true); + added_pair_ = DefaultAdded(false); +} + +void TemplatePostProcessor::UpdateSinglePieces( + const std::string& template_str) { + single_.GetPiecesFromStr(template_str); + added_single_ = DefaultAdded(true); +} + +void TemplatePostProcessor::UpdateSinglePieces( + const std::vector& pieces) { + single_.GetPiecesFromVec(pieces); + added_single_ = DefaultAdded(true); +} + +void TemplatePostProcessor::UpdatePairPieces(const std::string& template_str) { + pair_.GetPiecesFromStr(template_str); + added_pair_ = DefaultAdded(false); +} + +void TemplatePostProcessor::UpdatePairPieces( + const std::vector& pieces) { + pair_.GetPiecesFromVec(pieces); + added_pair_ = DefaultAdded(false); +} + +TemplatePostProcessor::TemplatePostProcessor() { UpdateAddedTokensNum(); } + +TemplatePostProcessor::TemplatePostProcessor( + const Template& single, + const Template& pair, + const std::vector& special_tokens_map) + : single_(single), pair_(pair), special_tokens_map_(special_tokens_map) { + UpdateAddedTokensNum(); +} + +size_t TemplatePostProcessor::AddedTokensNum(bool is_pair) const { + if (is_pair) { + return added_pair_; + } + return added_single_; +} + +void TemplatePostProcessor::SetTokensMap( + const std::vector& special_tokens) { + special_tokens_map_.SetTokensMap(special_tokens); + UpdateAddedTokensNum(); +} + +void TemplatePostProcessor::ApplyTemplate( + const Template& pieces, + core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const { + size_t new_size = 0; + for (auto&& piece : pieces.pieces_) { + if (paddlenlp::get_if(&piece) != nullptr) { + auto seq_type = paddlenlp::get(piece).first; + if (seq_type == SequenceType::SEQ_A) { + new_size += encoding->GetLen(); + } else { + if (pair_encoding == nullptr) { + throw std::runtime_error( + "Template expected a pair sequence, but none provided"); + } + new_size += pair_encoding->GetLen(); + } + } else { + if (add_special_tokens) { + auto&& special_token = + paddlenlp::get(piece).first; + if (special_tokens_map_.tokens_map_.find(special_token) != + special_tokens_map_.tokens_map_.end()) { + new_size += + special_tokens_map_.tokens_map_.at(special_token).ids_.size(); + } + } + } + } + std::vector ids; + ids.reserve(new_size); + std::vector type_ids; + type_ids.reserve(new_size); + std::vector tokens; + tokens.reserve(new_size); + std::vector words_idx; + words_idx.reserve(new_size); + std::vector offsets; + offsets.reserve(new_size); + std::vector special_tokens_mask; + special_tokens_mask.reserve(new_size); + std::vector attention_mask; + attention_mask.reserve(new_size); + std::unordered_map sequence_ranges; + std::vector result_overflowings; + auto& overflowings = encoding->GetMutableOverflowing(); + + core::Encoding result_overflowing_encoding; + for (auto& overflow_encoding : overflowings) { + core::Encoding encoding_copy = overflow_encoding; + core::Encoding pair_encoding_copy; + if (pair_encoding != nullptr) { + pair_encoding_copy = *pair_encoding; + ApplyTemplate(pieces, + &encoding_copy, + &pair_encoding_copy, + add_special_tokens, + &result_overflowing_encoding); + result_overflowings.push_back(result_overflowing_encoding); + for (auto& pair_overflow_encoding : + pair_encoding->GetMutableOverflowing()) { + core::Encoding pair_encoding_copy = pair_overflow_encoding; + ApplyTemplate(pieces, + &encoding_copy, + &pair_encoding_copy, + add_special_tokens, + &result_overflowing_encoding); + result_overflowings.push_back(result_overflowing_encoding); + } + } else { + ApplyTemplate(pieces, + &encoding_copy, + pair_encoding, + add_special_tokens, + &result_overflowing_encoding); + result_overflowings.push_back(result_overflowing_encoding); + } + } + if (pair_encoding != nullptr) { + for (auto& pair_overflow_encoding : + pair_encoding->GetMutableOverflowing()) { + core::Encoding encoding_copy = *encoding; + core::Encoding pair_encoding_copy = pair_overflow_encoding; + ApplyTemplate(pieces, + &encoding_copy, + &pair_encoding_copy, + add_special_tokens, + &result_overflowing_encoding); + result_overflowings.push_back(result_overflowing_encoding); + } + } + VLOG(6) << "Template pieces num: " << pieces.pieces_.size(); + for (auto& piece : pieces.pieces_) { + if (paddlenlp::get_if(&piece) != nullptr) { + auto& template_sequence = paddlenlp::get(piece); + if (template_sequence.first == SequenceType::SEQ_A) { + auto seq_start = ids.size(); + auto seq_end = seq_start + encoding->GetLen(); + sequence_ranges[0] = {seq_start, seq_end}; + ids.insert( + ids.end(), encoding->GetIds().begin(), encoding->GetIds().end()); + type_ids.insert( + type_ids.end(), encoding->GetLen(), template_sequence.second); + tokens.insert(tokens.end(), + encoding->GetTokens().begin(), + encoding->GetTokens().end()); + words_idx.insert(words_idx.end(), + encoding->GetWordsIdx().begin(), + encoding->GetWordsIdx().end()); + offsets.insert(offsets.end(), + encoding->GetOffsets().begin(), + encoding->GetOffsets().end()); + special_tokens_mask.insert(special_tokens_mask.end(), + encoding->GetSpecialTokensMask().begin(), + encoding->GetSpecialTokensMask().end()); + attention_mask.insert(attention_mask.end(), + encoding->GetAttentionMask().begin(), + encoding->GetAttentionMask().end()); + } else if (template_sequence.first == SequenceType::SEQ_B) { + if (pair_encoding == nullptr) { + throw std::runtime_error("Missing pair sequence, checked above"); + } + auto seq_start = ids.size(); + auto seq_end = seq_start + pair_encoding->GetLen(); + sequence_ranges[0] = {seq_start, seq_end}; + ids.insert(ids.end(), + pair_encoding->GetIds().begin(), + pair_encoding->GetIds().end()); + type_ids.insert( + type_ids.end(), pair_encoding->GetLen(), template_sequence.second); + tokens.insert(tokens.end(), + pair_encoding->GetTokens().begin(), + pair_encoding->GetTokens().end()); + words_idx.insert(words_idx.end(), + pair_encoding->GetWordsIdx().begin(), + pair_encoding->GetWordsIdx().end()); + offsets.insert(offsets.end(), + pair_encoding->GetOffsets().begin(), + pair_encoding->GetOffsets().end()); + special_tokens_mask.insert( + special_tokens_mask.end(), + pair_encoding->GetSpecialTokensMask().begin(), + pair_encoding->GetSpecialTokensMask().end()); + attention_mask.insert(attention_mask.end(), + pair_encoding->GetAttentionMask().begin(), + pair_encoding->GetAttentionMask().end()); + } + } else { + auto& special_token = paddlenlp::get(piece); + if (add_special_tokens) { + const std::string& id = special_token.first; + uint32_t type_id = special_token.second; + auto& tok = special_tokens_map_.tokens_map_.at( + id); // We already checked existance above + auto size = tok.ids_.size(); + ids.insert(ids.end(), tok.ids_.begin(), tok.ids_.end()); + type_ids.insert(type_ids.end(), size, type_id); + tokens.insert(tokens.end(), tok.tokens_.begin(), tok.tokens_.end()); + words_idx.insert(words_idx.end(), size, -1 /* 2^32 */); + offsets.insert(offsets.end(), size, {0, 0}); + special_tokens_mask.insert(special_tokens_mask.end(), size, 1); + attention_mask.insert(attention_mask.end(), size, 1); + } + } + } + *result_encoding = core::Encoding(std::move(ids), + std::move(type_ids), + std::move(tokens), + std::move(words_idx), + std::move(offsets), + std::move(special_tokens_mask), + std::move(attention_mask), + std::move(result_overflowings), + std::move(sequence_ranges)); +} + +void TemplatePostProcessor::operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const { + if (pair_encoding != nullptr) { + ApplyTemplate( + pair_, encoding, pair_encoding, add_special_tokens, result_encoding); + } else { + ApplyTemplate( + single_, encoding, pair_encoding, add_special_tokens, result_encoding); + } +} + +void to_json(nlohmann::json& j, + const TemplatePostProcessor& template_postprocessor) { + j = { + {"type", "TemplateProcessing"}, + {"single", template_postprocessor.single_}, + {"pair", template_postprocessor.pair_}, + {"special_tokens", template_postprocessor.special_tokens_map_}, + }; +} + +void from_json(const nlohmann::json& j, + TemplatePostProcessor& template_postprocessor) { + j["single"].get_to(template_postprocessor.single_); + j["pair"].get_to(template_postprocessor.pair_); + j["special_tokens"].get_to(template_postprocessor.special_tokens_map_); + template_postprocessor.added_single_ = + template_postprocessor.DefaultAdded(true); + template_postprocessor.added_pair_ = + template_postprocessor.DefaultAdded(false); +} + +} // namespace postprocessors +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/template.h b/fast_tokenizer/fast_tokenizer/postprocessors/template.h new file mode 100644 index 0000000000000000000000000000000000000000..aa20de483daa1729fc7ae83d8b3fb03385dddfe0 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/postprocessors/template.h @@ -0,0 +1,191 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include + +#include "fast_tokenizer/postprocessors/postprocessor.h" +#include "fast_tokenizer/utils/utils.h" +#include "fast_tokenizer/utils/variant.h" +#include "nlohmann/json.hpp" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace postprocessors { + +enum FASTTOKENIZER_DECL SequenceType { SEQ_A, SEQ_B }; +NLOHMANN_JSON_SERIALIZE_ENUM(SequenceType, + { + {SEQ_A, "A"}, {SEQ_B, "B"}, + }); +// The template indicate `${Id} : ${TypeId}` +using TemplateSequence = std::pair; +using TemplateSpecialToken = std::pair; + +using TemplatePiece = + paddlenlp::variant; +void to_json(nlohmann::json& j, const TemplatePiece& template_piece); +void from_json(const nlohmann::json& j, TemplatePiece& template_piece); + +void ParseIdFromString(const std::string& template_id_string, + TemplatePiece* template_piece); +void SetTypeId(uint32_t type_id, TemplatePiece* template_piece); +void GetTemplatePieceFromString(const std::string& template_string, + TemplatePiece* template_piece); + +struct FASTTOKENIZER_DECL SpecialToken { + std::string id_; + std::vector ids_; + std::vector tokens_; + SpecialToken() = default; + SpecialToken(const std::string& id, + const std::vector& ids, + const std::vector& tokens) + : id_(id), ids_(ids), tokens_(tokens) {} + SpecialToken(const std::string& token, uint32_t id) { + id_ = token; + ids_.push_back(id); + tokens_.push_back(token); + } + friend void to_json(nlohmann::json& j, const SpecialToken& special_token); + friend void from_json(const nlohmann::json& j, SpecialToken& special_token); +}; + +struct FASTTOKENIZER_DECL Template { + std::vector pieces_; + Template() = default; + explicit Template(const std::string& template_str) { + std::vector pieces; + + // Parse the pieces + size_t start = template_str.find_first_not_of(" "); + size_t pos; + while ((pos = template_str.find_first_of(" ", start)) != + std::string::npos) { + pieces.push_back(template_str.substr(start, pos - start)); + start = template_str.find_first_not_of(" ", pos); + } + if (start != std::string::npos) { + pieces.push_back(template_str.substr(start)); + } + AddStringPiece(pieces); + } + + explicit Template(const std::vector& pieces) + : pieces_(pieces) {} + explicit Template(const std::vector& pieces) { + AddStringPiece(pieces); + } + + void GetPiecesFromVec(const std::vector& pieces) { + AddStringPiece(pieces); + } + + void GetPiecesFromStr(const std::string& template_str) { + std::vector pieces; + + // Parse the pieces + size_t start = template_str.find_first_not_of(" "); + size_t pos; + while ((pos = template_str.find_first_of(" ", start)) != + std::string::npos) { + pieces.push_back(template_str.substr(start, pos - start)); + start = template_str.find_first_not_of(" ", pos); + } + if (start != std::string::npos) { + pieces.push_back(template_str.substr(start)); + } + AddStringPiece(pieces); + } + + void Clean() { pieces_.clear(); } + +private: + void AddStringPiece(const std::vector& pieces) { + for (auto&& piece : pieces) { + TemplatePiece template_piece; + GetTemplatePieceFromString(piece, &template_piece); + if (paddlenlp::get_if(&template_piece)) { + pieces_.push_back(paddlenlp::get(template_piece)); + } else { + pieces_.push_back(paddlenlp::get(template_piece)); + } + } + } + + friend void to_json(nlohmann::json& j, const Template& template_); + friend void from_json(const nlohmann::json& j, Template& template_); +}; + +struct FASTTOKENIZER_DECL SpecialTokensMap { + std::unordered_map tokens_map_; + SpecialTokensMap() = default; + explicit SpecialTokensMap(const std::vector& special_tokens) { + SetTokensMap(special_tokens); + } + void SetTokensMap(const std::vector& special_tokens) { + tokens_map_.clear(); + for (const auto& special_token : special_tokens) { + tokens_map_.insert({special_token.id_, special_token}); + } + } + friend void to_json(nlohmann::json& j, const SpecialTokensMap& tokens_map); + friend void from_json(const nlohmann::json& j, SpecialTokensMap& tokens_map); +}; + +struct FASTTOKENIZER_DECL TemplatePostProcessor : public PostProcessor { + TemplatePostProcessor(); + TemplatePostProcessor(const Template&, + const Template&, + const std::vector&); + + virtual void operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const override; + virtual size_t AddedTokensNum(bool is_pair) const override; + + void UpdateSinglePieces(const std::string& template_str); + void UpdateSinglePieces(const std::vector& pieces); + void UpdatePairPieces(const std::string& template_str); + void UpdatePairPieces(const std::vector& pieces); + void UpdateAddedTokensNum(); + void SetTokensMap(const std::vector& special_tokens); + size_t CountAdded(Template* template_, + const SpecialTokensMap& special_tokens_map); + size_t DefaultAdded(bool is_single = true); + void ApplyTemplate(const Template& pieces, + core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const; + + friend void to_json(nlohmann::json& j, + const TemplatePostProcessor& template_postprocessor); + friend void from_json(const nlohmann::json& j, + TemplatePostProcessor& template_postprocessor); + + Template single_; + Template pair_; + size_t added_single_; + size_t added_pair_; + SpecialTokensMap special_tokens_map_; +}; + +} // namespace postprocessors +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/pretokenizers/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..2acf3a9a7a5d01fecebd53bcf64fab52fa960f25 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/CMakeLists.txt @@ -0,0 +1,4 @@ +cc_library(pretokenizers + SRCS pretokenizer.cc whitespace.cc bert.cc metaspace.cc + sequence.cc byte_level.cc split.cc whitespace_and_punctuation.cc + DEPS normalizers core json utils) diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/bert.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/bert.cc new file mode 100644 index 0000000000000000000000000000000000000000..ce45d0c316f223514a93dc6e9037aeb119511662 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/bert.cc @@ -0,0 +1,74 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/pretokenizers/bert.h" + +#include "fast_tokenizer/utils/utils.h" +#include "glog/logging.h" +#include "re2/re2.h" +#include "unicode/uchar.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + +// Note (zhoushunjie): Init re2::RE2 objects cost too much, +// ensure not init in the operator() +static re2::RE2 pattern("[\\s\\p{Zs}]+"); +static re2::RE2 punc_pattern("[[:punct:]]|[\\pP]"); + +void BertPreTokenizer::operator()(PreTokenizedString* pretokenized) const { + std::vector normalized_splits; + pretokenized->Split([&normalized_splits]( + int idx, + normalizers::NormalizedString* normalized, + std::vector* string_splits) { + // Use single character match instead of regex to improve performance + normalized->Split([](char32_t ch) -> bool { return u_isUWhiteSpace(ch); }, + core::SplitMode::REMOVED, + &normalized_splits); + for (auto&& normalize : normalized_splits) { + if (!normalize.IsEmpty()) { + string_splits->emplace_back(std::move(normalize)); + } + } + }); + normalized_splits.clear(); + pretokenized->Split([&normalized_splits]( + int idx, + normalizers::NormalizedString* normalized, + std::vector* string_splits) { + // Use single character match instead of regex to improve performance + normalized->Split( + utils::IsPunctuation, core::SplitMode::ISOLATED, &normalized_splits); + for (auto&& normalize : normalized_splits) { + if (!normalize.IsEmpty()) { + VLOG(6) << "After pretokenized: " << normalize.GetStr(); + string_splits->emplace_back(std::move(normalize)); + } + } + }); +} + +void to_json(nlohmann::json& j, const BertPreTokenizer& bert_pre_tokenizer) { + j = { + {"type", "BertPreTokenizer"}, + }; +} + +void from_json(const nlohmann::json& j, BertPreTokenizer& bert_pre_tokenizer) {} + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/bert.h b/fast_tokenizer/fast_tokenizer/pretokenizers/bert.h new file mode 100644 index 0000000000000000000000000000000000000000..7543ac5d25f212363e0676b094f8e3e887ea9f9d --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/bert.h @@ -0,0 +1,35 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/pretokenizers/pretokenizer.h" +#include "fast_tokenizer/utils/utils.h" +#include "nlohmann/json.hpp" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + +struct FASTTOKENIZER_DECL BertPreTokenizer : public PreTokenizer { + virtual void operator()(PreTokenizedString* pretokenized) const override; + friend void to_json(nlohmann::json& j, + const BertPreTokenizer& bert_pre_tokenizer); + friend void from_json(const nlohmann::json& j, + BertPreTokenizer& bert_pre_tokenizer); +}; + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/byte_level.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/byte_level.cc new file mode 100644 index 0000000000000000000000000000000000000000..e686bce3c785386e757d25a419b8d054dbed1497 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/byte_level.cc @@ -0,0 +1,150 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "fast_tokenizer/pretokenizers/byte_level.h" + +#include +#include + +#include "fast_tokenizer/utils/utf8.h" +#include "fast_tokenizer/utils/utils.h" +#include "glog/logging.h" +#include "re2/re2.h" +#include "unicode/uchar.h" + + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + + +static re2::RE2 pattern( + R"('s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+)"); + + +static std::unordered_map BYTES_TO_CHARS = + utils::CreateBytesToChars(); +ByteLevelPreTokenizer::ByteLevelPreTokenizer(bool add_prefix_space, + bool use_regex) + : add_prefix_space_(add_prefix_space), use_regex_(use_regex) {} + + +void ByteLevelPreTokenizer::operator()(PreTokenizedString* pretokenized) const { + std::vector normalized_splits; + pretokenized->Split([&normalized_splits, this]( + int idx, + normalizers::NormalizedString* normalized, + std::vector* string_splits) { + if (this->add_prefix_space_ && normalized->GetStr().find(' ') != 0) { + normalized->Prepend(" "); + } + if (this->use_regex_) { + normalized->Split(pattern, core::SplitMode::ISOLATED, &normalized_splits); + for (auto&& normalize : normalized_splits) { + if (!normalize.IsEmpty()) { + string_splits->emplace_back(std::move(normalize)); + } + } + } else { + string_splits->emplace_back(*normalized); + } + }); + pretokenized->Normalize([](normalizers::NormalizedString* normalized) { + const std::string& str = normalized->GetStr(); + std::u32string u32normalized; + std::vector changes; + size_t utf8_len = 0; + uint32_t last_char; + uint32_t curr_char; + while (utf8_len < str.length()) { + auto chwidth = utils::UTF8ToUInt32(str.data() + utf8_len, &curr_char); + curr_char = utils::UTF8ToUnicode(curr_char); + for (int i = 0; i < chwidth; ++i) { + u32normalized.push_back(BYTES_TO_CHARS.at(str[i + utf8_len])); + if (i == 0) { + changes.push_back(0); + } else { + changes.push_back(1); + } + } + utf8_len += chwidth; + } + normalized->UpdateNormalized({u32normalized, changes}, 0); + }); +} + + +void to_json(nlohmann::json& j, + const ByteLevelPreTokenizer& byte_pre_tokenizer) { + j = { + {"type", "ByteLevelPreTokenizer"}, + {"add_prefix_space", byte_pre_tokenizer.add_prefix_space_}, + {"use_regex", byte_pre_tokenizer.use_regex_}, + }; +} + + +void from_json(const nlohmann::json& j, + ByteLevelPreTokenizer& byte_pre_tokenizer) { + j.at("add_prefix_space").get_to(byte_pre_tokenizer.add_prefix_space_); + j.at("use_regex").get_to(byte_pre_tokenizer.add_prefix_space_); +} + +void ProcessOffsets(core::Encoding* encoding, bool add_prefix_space) { + auto process_token_fn = + [&](uint32_t i, const std::string& token, core::Offset* offset) -> void { + uint32_t leading_spaces = 0; + uint32_t trailing_spaces = 0; + + std::wstring_convert, char32_t> conv; + std::u32string u32token = conv.from_bytes(token); + for (int i = 0; i < u32token.size(); ++i) { + if (utils::IsWhiteSpace(u32token[i]) || + u32token[i] == BYTES_TO_CHARS.at(' ')) { + ++leading_spaces; + } else { + break; + } + } + + for (int i = u32token.size() - 1; i >= 0; --i) { + if (utils::IsWhiteSpace(u32token[i]) || + u32token[i] == BYTES_TO_CHARS.at(' ')) { + ++trailing_spaces; + } else { + break; + } + } + + if (leading_spaces > 0 || trailing_spaces > 0) { + if (leading_spaces > 0) { + bool is_first = (i == 0) || (offset->first == 0); + if (is_first && add_prefix_space && leading_spaces == 1) { + leading_spaces = 0; + } + offset->first = + (std::min)(offset->first + leading_spaces, offset->second); + } + } + if (trailing_spaces > 0 && offset->second >= trailing_spaces) { + offset->second = + (std::max)(offset->second - trailing_spaces, offset->first); + } + }; + encoding->ProcessTokenWithOffsets(process_token_fn); +} + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/byte_level.h b/fast_tokenizer/fast_tokenizer/pretokenizers/byte_level.h new file mode 100644 index 0000000000000000000000000000000000000000..c06dcc373f6dd3cdefeea4272dd01c7a30b20da9 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/byte_level.h @@ -0,0 +1,44 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "fast_tokenizer/pretokenizers/pretokenizer.h" +#include "fast_tokenizer/utils/utils.h" +#include "nlohmann/json.hpp" + + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + +struct FASTTOKENIZER_DECL ByteLevelPreTokenizer : public PreTokenizer { + ByteLevelPreTokenizer(bool add_prefix_space = true, bool use_regex = true); + virtual void operator()(PreTokenizedString* pretokenized) const override; + friend void to_json(nlohmann::json& j, + const ByteLevelPreTokenizer& byte_pre_tokenizer); + friend void from_json(const nlohmann::json& j, + ByteLevelPreTokenizer& byte_pre_tokenizer); + +private: + bool add_prefix_space_; + bool use_regex_; +}; + +void FASTTOKENIZER_DECL ProcessOffsets(core::Encoding* encoding, + bool add_prefix_space); + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/metaspace.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/metaspace.cc new file mode 100644 index 0000000000000000000000000000000000000000..f3a26001f11b395a16da257512456b169060d602 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/metaspace.cc @@ -0,0 +1,87 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/pretokenizers/metaspace.h" + +#include "fast_tokenizer/utils/utf8.h" +#include "glog/logging.h" +#include "re2/re2.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + +static re2::RE2 pattern(" "); + +void MetaSpacePreTokenizer::UpdateReplacementChar() { + uint32_t ch; + utils::UTF8ToUInt32(replacement_.data(), &ch); + replacement_char_ = utils::UTF8ToUnicode(ch); +} + +MetaSpacePreTokenizer::MetaSpacePreTokenizer(const std::string& replacement, + bool add_prefix_space) + : replacement_(replacement), add_prefix_space_(add_prefix_space) { + UpdateReplacementChar(); +} + +std::string MetaSpacePreTokenizer::GetReplacement() const { + return replacement_; +} + +void MetaSpacePreTokenizer::SetReplacement(const std::string& replacement) { + replacement_ = replacement; + UpdateReplacementChar(); +} + +void MetaSpacePreTokenizer::operator()(PreTokenizedString* pretokenized) const { + std::vector normalized_splits; + pretokenized->Split([&](int idx, + normalizers::NormalizedString* normalized, + std::vector* string_splits) { + normalized->Replace(pattern, replacement_); + if (add_prefix_space_ && normalized->GetStr().find(replacement_) != 0) { + normalized->Prepend(replacement_); + } + normalized->Split( + [&](char32_t ch) -> bool { return ch == replacement_char_; }, + core::SplitMode::MERGED_WITH_NEXT, + &normalized_splits); + for (auto&& normalize : normalized_splits) { + if (!normalize.IsEmpty()) { + VLOG(6) << "After pretokenized: " << normalize.GetStr(); + string_splits->emplace_back(std::move(normalize)); + } + } + }); +} + +void to_json(nlohmann::json& j, + const MetaSpacePreTokenizer& meta_pretokenizer) { + j = { + {"type", "MetaSpacePreTokenizer"}, + {"replacement", meta_pretokenizer.replacement_}, + {"add_prefix_space", meta_pretokenizer.add_prefix_space_}, + }; +} + +void from_json(const nlohmann::json& j, + MetaSpacePreTokenizer& meta_pretokenizer) { + j.at("add_prefix_space").get_to(meta_pretokenizer.add_prefix_space_); + meta_pretokenizer.SetReplacement(j.at("replacement").get()); +} + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/metaspace.h b/fast_tokenizer/fast_tokenizer/pretokenizers/metaspace.h new file mode 100644 index 0000000000000000000000000000000000000000..4b5c20504d3a25091e72244addfea7b0c2c29dc9 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/metaspace.h @@ -0,0 +1,48 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/pretokenizers/pretokenizer.h" +#include "fast_tokenizer/utils/utils.h" +#include "nlohmann/json.hpp" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + +struct FASTTOKENIZER_DECL MetaSpacePreTokenizer : public PreTokenizer { + // Replaces white space with U+2581 (LOWER ONE EIGHT BLOCK) + MetaSpacePreTokenizer(const std::string& replacement = "\xe2\x96\x81", + bool add_prefix_space = true); + MetaSpacePreTokenizer(const MetaSpacePreTokenizer&) = default; + virtual void operator()(PreTokenizedString* pretokenized) const override; + std::string GetReplacement() const; + void SetReplacement(const std::string&); + +private: + void UpdateReplacementChar(); + std::string replacement_; + bool add_prefix_space_; + char32_t replacement_char_; + + friend void to_json(nlohmann::json& j, + const MetaSpacePreTokenizer& meta_pretokenizer); + friend void from_json(const nlohmann::json& j, + MetaSpacePreTokenizer& meta_pretokenizer); +}; + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizer.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizer.cc new file mode 100644 index 0000000000000000000000000000000000000000..47d7b5f6ed5008a8c7f26b7f394c62efdce541fa --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizer.cc @@ -0,0 +1,277 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/pretokenizers/pretokenizer.h" + +#include +#include +#include + +#include "fast_tokenizer/utils/unique_ptr.h" +#include "fast_tokenizer/utils/utf8.h" +#include "glog/logging.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + +BytesToCharOffsetConverter::BytesToCharOffsetConverter(const std::string& seq) + : OffsetConverter(seq) { + std::wstring_convert, char32_t> conv; + std::u32string u32seq = conv.from_bytes(seq); + offset_map_.reserve(u32seq.length() * 4); + for (int i = 0; i < u32seq.length(); ++i) { + auto utf8_len = utils::GetUTF8CharLen(u32seq[i]); + for (int j = 0; j < utf8_len; ++j) { + offset_map_.push_back(i); + } + } +} + +bool BytesToCharOffsetConverter::convert(const core::Offset& offset, + core::Offset* result) const { + size_t byte_start = offset.first; + size_t byte_end = offset.second; + if (offset_map_.size() <= byte_start) { + return false; + } + auto char_start = offset_map_.at(byte_start); + auto char_end = char_start + 1; + if (offset_map_.size() > byte_end) { + char_end = offset_map_.at(byte_end); + } else if (offset_map_.size() > byte_end - 1) { + char_end = offset_map_.at(byte_end - 1) + 1; + } + *result = {char_start, char_end}; + return true; +} + + +CharToBytesOffsetConverter::CharToBytesOffsetConverter(const std::string& seq) + : OffsetConverter(seq) { + std::wstring_convert, char32_t> conv; + std::u32string u32seq = conv.from_bytes(seq); + uint32_t index = 0; + offset_map_.reserve(u32seq.length() * 4); + for (int i = 0; i < u32seq.length(); ++i) { + offset_map_.push_back(index); + auto utf8_len = fast_tokenizer::utils::GetUTF8CharLen(u32seq[i]); + index += utf8_len; + } + offset_map_.push_back(index); +} + +bool CharToBytesOffsetConverter::convert(const core::Offset& offset, + core::Offset* result) const { + size_t char_start = offset.first; + size_t char_end = offset.second; + if (offset_map_.size() <= char_start) { + return false; + } + auto byte_start = offset_map_.at(char_start); + auto byte_end = byte_start + 1; + if (offset_map_.size() > char_end) { + byte_end = offset_map_.at(char_end); + } + *result = {byte_start, byte_end}; + return true; +} + +PreTokenizedString::PreTokenizedString(const std::string& original) + : original_(original) { + splits_.emplace_back(std::move(StringSplit(original_))); +} + +PreTokenizedString::PreTokenizedString( + const normalizers::NormalizedString& normalized) + : original_(normalized.GetOrignalStr()) { + splits_.emplace_back(std::move(StringSplit(original_))); +} + +PreTokenizedString& PreTokenizedString::operator=(PreTokenizedString&& other) { + original_ = std::move(other.original_); + splits_ = std::move(other.splits_); + return *this; +} + +size_t PreTokenizedString::GetSplitsSize() const { return splits_.size(); } + +StringSplit PreTokenizedString::GetSplit(int idx) const { return splits_[idx]; } + +const std::string& PreTokenizedString::GetOriginStr() const { + return original_; +} + +void PreTokenizedString::Split( + std::function*)> split_fn) { + std::vector new_splits; + new_splits.reserve(splits_.size()); + for (int i = 0; i < splits_.size(); ++i) { + if (splits_[i].tokens_.size() > 0) { + new_splits.emplace_back(std::move(splits_[i])); + continue; + } + split_fn(i, &splits_[i].normalized_, &new_splits); + } + splits_ = std::move(new_splits); +} + +void PreTokenizedString::Normalize( + std::function normalize_fn) { + for (auto& split : splits_) { + if (split.tokens_.empty()) { + normalize_fn(&split.normalized_); + } + } +} +void PreTokenizedString::Tokenize( + std::function(normalizers::NormalizedString*)> + tokenize_fn) { + for (auto& split : splits_) { + if (split.tokens_.empty()) { + split.tokens_ = std::move(tokenize_fn(&split.normalized_)); + } + } +} + +bool PreTokenizedString::TransformToEncoding( + const std::vector& input_word_idx, + uint32_t type_id, + core::OffsetType offset_type, + core::Encoding* encoding) const { + if (splits_.empty()) { + *encoding = core::Encoding(); + return true; + } + for (const auto& split : splits_) { + if (split.tokens_.empty()) { + throw std::logic_error( + "The split of PreTokenizedString is empty, please call " + "PreTokenizedString::Tokenize first before transform to Encoding."); + return false; + } + } + + if (offset_type == core::OffsetType::CHAR) { + return TransformToEncodingUseConvertor( + input_word_idx, type_id, encoding); + } + return TransformToEncodingUseConvertor( + input_word_idx, type_id, encoding); +} + +template +bool PreTokenizedString::TransformToEncodingUseConvertor( + const std::vector& input_word_idx, + uint32_t type_id, + core::Encoding* encoding) const { + Convertor converter(original_); + uint32_t tokens_size = 0; + for (int i = 0; i < splits_.size(); ++i) { + tokens_size += splits_[i].tokens_.size(); + } + + std::vector token_ids(tokens_size); + std::vector tokens(tokens_size); + std::vector offsets(tokens_size); + uint32_t curr_idx = 0; + for (int i = 0; i < splits_.size(); ++i) { + const auto& split = splits_[i]; + const auto& normalized = split.normalized_; + auto offset = normalized.GetOrginalOffset(); + core::Offset tmp_offset; + bool has_set_offset = false; + for (const auto& token : split.tokens_) { + auto token_offset = token.offset_; + bool flag = normalized.ConvertOffsets(&token_offset, false); + if (flag) { + token_offset.first += offset.first; + token_offset.second += offset.first; + } + if (has_set_offset) { + offset = token_offset; + has_set_offset = true; + } + converter.convert(token_offset, &tmp_offset); + token_ids[curr_idx] = token.id_; + tokens[curr_idx] = token.value_; + offsets[curr_idx] = tmp_offset; + ++curr_idx; + } + } + // Setting words_idx + std::vector words_idx(tokens_size); + if (input_word_idx.size() == 0) { + uint32_t word_offset = 0; + for (uint32_t i = 0; i < splits_.size(); ++i) { + std::fill_n( + words_idx.begin() + word_offset, splits_[i].tokens_.size(), i); + word_offset += splits_[i].tokens_.size(); + } + } else { + std::fill(words_idx.begin(), words_idx.end(), input_word_idx[0]); + } + *encoding = std::move(core::Encoding( + std::move(token_ids), + std::vector(tokens_size, type_id), // type_ids + std::move(tokens), + std::move(words_idx), + std::move(offsets), + std::vector(tokens_size, 0), /* special_tokens_mask */ + std::vector(tokens_size, 1), /* attention_mask */ + std::vector(), /* overflowing */ + std::unordered_map() /* sequence_ranges */)); + return true; +} + +void PreTokenizedString::SetOriginalStr(const std::string& original) { + original_ = original; + splits_.clear(); + splits_.emplace_back(original_); +} + +std::vector>> +PreTokenizedString::GetSplits(bool is_original, + const core::OffsetType& offset_type) const { + std::unique_ptr converter; + if (offset_type == core::OffsetType::BYTE) { + converter = utils::make_unique(original_); + } else { + converter = utils::make_unique(original_); + } + std::vector>> + result; + uint32_t offset = 0; + for (auto&& split : splits_) { + core::Offset curr_offset, split_offset; + if (is_original) { + split_offset = split.normalized_.GetOrginalOffset(); + } else { + auto len = split.normalized_.GetLen(); + offset += len; + split_offset = {offset - len, offset}; + } + + // Convert to char offsets if relevant + converter->convert(split_offset, &curr_offset); + result.emplace_back(split.normalized_.GetStr(), curr_offset, split.tokens_); + } + return result; +} + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizer.h b/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizer.h new file mode 100644 index 0000000000000000000000000000000000000000..02abfff8949c4e2d2193aa0093f1fdf30e7e67a0 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizer.h @@ -0,0 +1,117 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include +#include + +#include "fast_tokenizer/core/base.h" +#include "fast_tokenizer/core/encoding.h" +#include "fast_tokenizer/normalizers/normalizer.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + +struct FASTTOKENIZER_DECL StringSplit { + normalizers::NormalizedString normalized_; + std::vector tokens_; + StringSplit(normalizers::NormalizedString&& normalized) + : normalized_(std::move(normalized)) {} + StringSplit(const normalizers::NormalizedString& normalized) + : normalized_(normalized) {} + StringSplit(const normalizers::NormalizedString& normalized, + const std::vector& tokens) + : normalized_(normalized), tokens_(tokens) {} + StringSplit() = default; + StringSplit(const StringSplit& other) = default; + StringSplit(StringSplit&& other) + : tokens_(std::move(other.tokens_)), + normalized_(std::move(other.normalized_)) {} + + StringSplit& operator=(const StringSplit& other) = default; + StringSplit& operator=(StringSplit&& other) { + tokens_ = std::move(other.tokens_); + normalized_ = std::move(other.normalized_); + return *this; + } +}; + +class FASTTOKENIZER_DECL PreTokenizedString { +public: + PreTokenizedString() = default; + PreTokenizedString(const std::string& original); + PreTokenizedString(const normalizers::NormalizedString& normalized); + PreTokenizedString& operator=(PreTokenizedString&& other); + + void Split(std::function*)> split_fn); + void Normalize( + std::function normalize_fn); + // For wordpiece, bpe ...... + void Tokenize( + std::function(normalizers::NormalizedString*)> + tokenize_fn); + bool TransformToEncoding(const std::vector& word_idx, + uint32_t type_id, + core::OffsetType offset_type, + core::Encoding* encodings) const; + template + bool TransformToEncodingUseConvertor(const std::vector& word_idx, + uint32_t type_id, + core::Encoding* encodings) const; + size_t GetSplitsSize() const; + StringSplit GetSplit(int idx) const; + const std::string& GetOriginStr() const; + void SetOriginalStr(const std::string& original); + std::vector>> + GetSplits(bool is_original, const core::OffsetType& offset_type) const; + +private: + std::string original_; + std::vector splits_; +}; + +struct FASTTOKENIZER_DECL PreTokenizer { + virtual void operator()(PreTokenizedString* pretokenized) const = 0; +}; + +struct FASTTOKENIZER_DECL OffsetConverter { + OffsetConverter(const std::string&) {} + virtual bool convert(const core::Offset&, core::Offset*) const { + return true; + } +}; + +struct FASTTOKENIZER_DECL BytesToCharOffsetConverter : public OffsetConverter { + std::vector offset_map_; + BytesToCharOffsetConverter(const std::string&); + virtual bool convert(const core::Offset&, core::Offset*) const; +}; + +struct FASTTOKENIZER_DECL CharToBytesOffsetConverter : public OffsetConverter { + std::vector offset_map_; + CharToBytesOffsetConverter(const std::string&); + virtual bool convert(const core::Offset&, core::Offset*) const; +}; + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizers.h b/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizers.h new file mode 100644 index 0000000000000000000000000000000000000000..5828ecd2dafe66b5ba34184a3f12d3799d4333e9 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizers.h @@ -0,0 +1,24 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/pretokenizers/bert.h" +#include "fast_tokenizer/pretokenizers/byte_level.h" +#include "fast_tokenizer/pretokenizers/metaspace.h" +#include "fast_tokenizer/pretokenizers/pretokenizer.h" +#include "fast_tokenizer/pretokenizers/sequence.h" +#include "fast_tokenizer/pretokenizers/split.h" +#include "fast_tokenizer/pretokenizers/whitespace.h" +#include "fast_tokenizer/pretokenizers/whitespace_and_punctuation.h" diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/sequence.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/sequence.cc new file mode 100644 index 0000000000000000000000000000000000000000..3ac87ddbc3501bb4639fa12f8fef8a197f1a5e19 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/sequence.cc @@ -0,0 +1,122 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/pretokenizers/pretokenizers.h" +#include "glog/logging.h" +#include "re2/re2.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + +SequencePreTokenizer::SequencePreTokenizer( + const std::vector& pretokenizers) { + for (auto& pretokenizer : pretokenizers) { + AppendPreTokenizer(pretokenizer); + } +} + +void SequencePreTokenizer::AppendPreTokenizer(PreTokenizer* pretokenizer) { + std::shared_ptr pretokenizer_ptr; + if (typeid(*pretokenizer) == typeid(SequencePreTokenizer)) { + auto cast_pretokenizer = dynamic_cast(pretokenizer); + pretokenizer_ptr = + std::make_shared(*cast_pretokenizer); + } else if (typeid(*pretokenizer) == typeid(BertPreTokenizer)) { + auto cast_pretokenizer = dynamic_cast(pretokenizer); + pretokenizer_ptr = std::make_shared(*cast_pretokenizer); + } else if (typeid(*pretokenizer) == typeid(MetaSpacePreTokenizer)) { + auto cast_pretokenizer = dynamic_cast(pretokenizer); + pretokenizer_ptr = + std::make_shared(*cast_pretokenizer); + } else if (typeid(*pretokenizer) == typeid(WhitespacePreTokenizer)) { + auto cast_pretokenizer = + dynamic_cast(pretokenizer); + pretokenizer_ptr = + std::make_shared(*cast_pretokenizer); + } else if (typeid(*pretokenizer) == + typeid(WhitespaceAndPunctuationPreTokenizer)) { + auto cast_pretokenizer = + dynamic_cast(pretokenizer); + pretokenizer_ptr = std::make_shared( + *cast_pretokenizer); + } else if (typeid(*pretokenizer) == typeid(SplitPreTokenizer)) { + auto cast_pretokenizer = dynamic_cast(pretokenizer); + pretokenizer_ptr = std::make_shared(*cast_pretokenizer); + } else if (typeid(*pretokenizer) == typeid(ByteLevelPreTokenizer)) { + auto cast_pretokenizer = dynamic_cast(pretokenizer); + pretokenizer_ptr = + std::make_shared(*cast_pretokenizer); + } else { + VLOG(6) << "This pretokenizer is not supportted now."; + } + pretokenzer_ptrs_.push_back(pretokenizer_ptr); +} + +void SequencePreTokenizer::operator()(PreTokenizedString* pretokenized) const { + for (auto& pretokenizer : pretokenzer_ptrs_) { + pretokenizer->operator()(pretokenized); + } +} + +void to_json(nlohmann::json& j, + const SequencePreTokenizer& sequence_pretokenizer) { + nlohmann::json jlist; + for (auto& ptr : sequence_pretokenizer.pretokenzer_ptrs_) { + nlohmann::json jitem; + if (typeid(*ptr) == typeid(SequencePreTokenizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(BertPreTokenizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(MetaSpacePreTokenizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(WhitespacePreTokenizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(WhitespaceAndPunctuationPreTokenizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(SplitPreTokenizer)) { + jitem = *dynamic_cast(ptr.get()); + } else if (typeid(*ptr) == typeid(ByteLevelPreTokenizer)) { + jitem = *dynamic_cast(ptr.get()); + } + jlist.push_back(jitem); + } + j = {{"type", "SequencePreTokenizer"}, {"pretokenizers", jlist}}; +} + +void from_json(const nlohmann::json& j, + SequencePreTokenizer& sequence_pretokenizer) { +#define TRY_APPEND_PRETOKENIZER(PRETOKENIZER_TYPE) \ + if (pretokenizer_type == #PRETOKENIZER_TYPE) { \ + PRETOKENIZER_TYPE pretokenizer; \ + pretokenizer_json.get_to(pretokenizer); \ + sequence_pretokenizer.AppendPreTokenizer(&pretokenizer); \ + } + for (auto& pretokenizer_json : j.at("pretokenizers")) { + std::string pretokenizer_type; + pretokenizer_json.at("type").get_to(pretokenizer_type); + TRY_APPEND_PRETOKENIZER(SequencePreTokenizer); + TRY_APPEND_PRETOKENIZER(WhitespacePreTokenizer); + TRY_APPEND_PRETOKENIZER(WhitespaceAndPunctuationPreTokenizer); + TRY_APPEND_PRETOKENIZER(MetaSpacePreTokenizer); + TRY_APPEND_PRETOKENIZER(BertPreTokenizer); + TRY_APPEND_PRETOKENIZER(ByteLevelPreTokenizer); + TRY_APPEND_PRETOKENIZER(SplitPreTokenizer); + } +#undef TRY_APPEND_PRETOKENIZER +} + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/sequence.h b/fast_tokenizer/fast_tokenizer/pretokenizers/sequence.h new file mode 100644 index 0000000000000000000000000000000000000000..741f8d43c08af413afb5036721208a214281d86a --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/sequence.h @@ -0,0 +1,43 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include + +#include "fast_tokenizer/pretokenizers/pretokenizer.h" +#include "fast_tokenizer/utils/utils.h" +#include "nlohmann/json.hpp" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + +struct FASTTOKENIZER_DECL SequencePreTokenizer : public PreTokenizer { + SequencePreTokenizer() = default; + SequencePreTokenizer(const SequencePreTokenizer&) = default; + SequencePreTokenizer(const std::vector& pretokenizers); + virtual void operator()(PreTokenizedString* pretokenized) const override; + void AppendPreTokenizer(PreTokenizer* pretokenizer); + +private: + std::vector> pretokenzer_ptrs_; + friend void to_json(nlohmann::json& j, + const SequencePreTokenizer& sequence_pretokenizer); + friend void from_json(const nlohmann::json& j, + SequencePreTokenizer& sequence_pretokenizer); +}; + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/split.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/split.cc new file mode 100644 index 0000000000000000000000000000000000000000..927af338b2e26c4d57b852d7b9d1053a3b53d817 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/split.cc @@ -0,0 +1,72 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/pretokenizers/split.h" + +#include "fast_tokenizer/core/base.h" +#include "fast_tokenizer/normalizers/normalizer.h" +#include "fast_tokenizer/utils/unique_ptr.h" +#include "re2/re2.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + +SplitPreTokenizer::SplitPreTokenizer( + const SplitPreTokenizer& split_pretokenizer) + : pattern_(new re2::RE2(split_pretokenizer.pattern_->pattern())) { + split_mode_ = split_pretokenizer.split_mode_; + invert_ = split_pretokenizer.invert_; +} + +SplitPreTokenizer::SplitPreTokenizer(const std::string& pattern, + core::SplitMode split_mode, + bool invert) + : invert_(invert), split_mode_(split_mode) { + pattern_ = utils::make_unique(pattern); +} + +void SplitPreTokenizer::operator()(PreTokenizedString* pretokenized) const { + pretokenized->Split([&](int idx, + normalizers::NormalizedString* normalized, + std::vector* string_splits) { + std::vector normalized_splits; + normalized->Split(*pattern_, split_mode_, &normalized_splits, invert_); + for (auto& normalize : normalized_splits) { + string_splits->push_back(StringSplit(normalize)); + } + }); +} + + +void to_json(nlohmann::json& j, const SplitPreTokenizer& split_pretokenizer) { + j = { + {"type", "SplitPreTokenizer"}, + {"pattern", split_pretokenizer.pattern_->pattern()}, + {"split_mode", split_pretokenizer.split_mode_}, + {"invert", split_pretokenizer.invert_}, + }; +} + +void from_json(const nlohmann::json& j, SplitPreTokenizer& split_pretokenizer) { + split_pretokenizer.pattern_ = + utils::make_unique(j.at("pattern").get()); + j.at("split_mode").get_to(split_pretokenizer.split_mode_); + j.at("invert").get_to(split_pretokenizer.invert_); +} + + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/split.h b/fast_tokenizer/fast_tokenizer/pretokenizers/split.h new file mode 100644 index 0000000000000000000000000000000000000000..52796f9f4e240441b2d4fd2204bdeb77a044f525 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/split.h @@ -0,0 +1,48 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/pretokenizers/pretokenizer.h" +#include "fast_tokenizer/utils/utils.h" + +namespace re2 { +class RE2; +} // namespace re2 + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + +struct FASTTOKENIZER_DECL SplitPreTokenizer : public PreTokenizer { + SplitPreTokenizer() = default; + SplitPreTokenizer(const std::string& pattern, + core::SplitMode split_mode, + bool invert); + SplitPreTokenizer(const SplitPreTokenizer& split_pretokenizer); + virtual void operator()(PreTokenizedString* pretokenized) const override; + friend void to_json(nlohmann::json& j, + const SplitPreTokenizer& split_pretokenizer); + friend void from_json(const nlohmann::json& j, + SplitPreTokenizer& split_pretokenizer); + +private: + bool invert_; + core::SplitMode split_mode_; + std::unique_ptr pattern_; +}; + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace.cc new file mode 100644 index 0000000000000000000000000000000000000000..6ef950870997127f299be630f439db6c1dd69c2a --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace.cc @@ -0,0 +1,50 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/pretokenizers/whitespace.h" + +#include "fast_tokenizer/normalizers/normalizer.h" +#include "re2/re2.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { +static re2::RE2 pattern("[\\s\\p{Zs}]+"); + +void WhitespacePreTokenizer::operator()( + PreTokenizedString* pretokenized) const { + pretokenized->Split([&](int idx, + normalizers::NormalizedString* normalized, + std::vector* string_splits) { + std::vector normalized_splits; + normalized->Split(pattern, core::SplitMode::REMOVED, &normalized_splits); + for (auto& normalize : normalized_splits) { + string_splits->push_back(StringSplit(normalize)); + } + }); +} + +void to_json(nlohmann::json& j, + const WhitespacePreTokenizer& whitespace_pretokenizer) { + j = { + {"type", "WhitespacePreTokenizer"}, + }; +} + +void from_json(const nlohmann::json& j, + WhitespacePreTokenizer& whitespace_pretokenizer) {} + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace.h b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace.h new file mode 100644 index 0000000000000000000000000000000000000000..43aa955ffc0655b97e1644f7266676d74ea0a8f2 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace.h @@ -0,0 +1,34 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/pretokenizers/pretokenizer.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + +struct FASTTOKENIZER_DECL WhitespacePreTokenizer : public PreTokenizer { + virtual void operator()(PreTokenizedString* pretokenized) const override; + friend void to_json(nlohmann::json& j, + const WhitespacePreTokenizer& whitespace_pretokenizer); + friend void from_json(const nlohmann::json& j, + WhitespacePreTokenizer& whitespace_pretokenizer); +}; + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace_and_punctuation.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace_and_punctuation.cc new file mode 100644 index 0000000000000000000000000000000000000000..d54d59cdbcfc3ef2427e3a9caebcd6ed8ea055ee --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace_and_punctuation.cc @@ -0,0 +1,60 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/pretokenizers/whitespace_and_punctuation.h" + +#include "fast_tokenizer/normalizers/normalizer.h" +#include "re2/re2.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { +static re2::RE2 pattern("[\\s\\p{Zs}]+"); + +void WhitespaceAndPunctuationPreTokenizer::operator()( + PreTokenizedString* pretokenized) const { + pretokenized->Split([&](int idx, + normalizers::NormalizedString* normalized, + std::vector* string_splits) { + std::vector normalized_splits; + normalized->Split(pattern, core::SplitMode::REMOVED, &normalized_splits); + for (auto& normalize : normalized_splits) { + string_splits->push_back(StringSplit(normalize)); + } + }); + pretokenized->Split([&](int idx, + normalizers::NormalizedString* normalized, + std::vector* string_splits) { + std::vector normalized_splits; + normalized->Split("\\w+", core::SplitMode::ISOLATED, &normalized_splits); + for (auto& normalize : normalized_splits) { + string_splits->push_back(StringSplit(normalize)); + } + }); +} + +void to_json( + nlohmann::json& j, + const WhitespaceAndPunctuationPreTokenizer& whitespace_pretokenizer) { + j = { + {"type", "WhitespaceAndPunctuationPreTokenizer"}, + }; +} + +void from_json(const nlohmann::json& j, + WhitespaceAndPunctuationPreTokenizer& whitespace_pretokenizer) {} + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace_and_punctuation.h b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace_and_punctuation.h new file mode 100644 index 0000000000000000000000000000000000000000..f7343052e5857ca134b473ca267b9d92de33910d --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace_and_punctuation.h @@ -0,0 +1,37 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "fast_tokenizer/pretokenizers/pretokenizer.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pretokenizers { + +struct FASTTOKENIZER_DECL WhitespaceAndPunctuationPreTokenizer + : public PreTokenizer { + virtual void operator()(PreTokenizedString* pretokenized) const override; + friend void to_json( + nlohmann::json& j, + const WhitespaceAndPunctuationPreTokenizer& whitespace_pretokenizer); + friend void from_json( + const nlohmann::json& j, + WhitespaceAndPunctuationPreTokenizer& whitespace_pretokenizer); +}; + +} // namespace pretokenizers +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/pybind/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..66e6290ddd96d94464f0ceaac85b1fde67606ed4 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/CMakeLists.txt @@ -0,0 +1,10 @@ +set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,-rpath='$ORIGIN'") +cc_library(pybind_utils SRCS utils.cc DEPS pybind python json) +cc_library(pybind_normalizers SRCS normalizers.cc DEPS pybind python json normalizers) +cc_library(pybind_pretokenizers SRCS pretokenizers.cc DEPS pybind python json pretokenizers) +cc_library(pybind_models SRCS models.cc DEPS pybind python json models) +cc_library(pybind_postprocessors SRCS postprocessors.cc DEPS pybind python core json postprocessors) +cc_library(pybind_tokenizers SRCS tokenizers.cc DEPS pybind python pybind_utils json tokenizer) +cc_library(pybind_exception SRCS exception.cc DEPS pybind python) +cc_library(pybind_decoders SRCS decoders.cc DEPS pybind python json decoders) +cc_library(pybind_core SRCS core.cc DEPS pybind python json) \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/pybind/core.cc b/fast_tokenizer/fast_tokenizer/pybind/core.cc new file mode 100644 index 0000000000000000000000000000000000000000..26cd25d9317dc19ede8d009024a6e3645c55115a --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/core.cc @@ -0,0 +1,288 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include + +#include "fast_tokenizer/core/added_vocabulary.h" +#include "fast_tokenizer/core/base.h" +#include "fast_tokenizer/core/encoding.h" +#include "fast_tokenizer/pybind/core.h" + +#include +#include + +namespace py = pybind11; + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +py::list GetWordIdx(const core::Encoding& self) { + py::list list; + for (const auto& idx : self.GetWordsIdx()) { + if (idx == static_cast(-1)) { + list.append(py::none()); + } else { + list.append(py::cast(idx)); + } + } + return list; +} + +void BindCore(pybind11::module* m) { + py::class_(*m, "Token") + .def(py::init<>()) + .def_readwrite("id", &core::Token::id_) + .def_readwrite("value", &core::Token::value_) + .def_readwrite("offset", &core::Token::offset_) + .def("__repr__", [](const core::Token& token) { + std::ostringstream oss; + oss << "id: " << token.id_ << "\tvalue:" << token.value_ + << "\toffset: (" << token.offset_.first << ", " + << token.offset_.second << ")"; + return oss.str(); + }); + py::class_(*m, "PadMethod") + .def(py::init<>()) + .def_readwrite("strategy", &core::PadMethod::strategy_) + .def_readwrite("direction", &core::PadMethod::direction_) + .def_readwrite("pad_id", &core::PadMethod::pad_id_) + .def_readwrite("pad_token_type_id", &core::PadMethod::pad_token_type_id_) + .def_readwrite("pad_token", &core::PadMethod::pad_token_) + .def_readwrite("pad_len", &core::PadMethod::pad_len_) + .def_readwrite("pad_to_multiple_of", + &core::PadMethod::pad_to_multiple_of_); + py::class_(*m, "TruncMethod") + .def(py::init<>()) + .def_readwrite("direction", &core::TruncMethod::direction_) + .def_readwrite("max_len", &core::TruncMethod::max_len_) + .def_readwrite("strategy", &core::TruncMethod::strategy_) + .def_readwrite("stride", &core::TruncMethod::stride_); + + py::enum_(*m, "OffsetType") + .value("CHAR", core::OffsetType::CHAR) + .value("BYTE", core::OffsetType::BYTE) + .export_values(); + py::enum_(*m, "Direction") + .value("LEFT", core::Direction::LEFT) + .value("RIGHT", core::Direction::RIGHT) + .export_values(); + py::enum_(*m, "TruncStrategy") + .value("LONGEST_FIRST", core::TruncStrategy::LONGEST_FIRST) + .value("ONLY_FIRST", core::TruncStrategy::ONLY_FIRST) + .value("ONLY_SECOND", core::TruncStrategy::ONLY_SECOND) + .export_values(); + py::enum_(*m, "PadStrategy") + .value("BATCH_LONGEST", core::PadStrategy::BATCH_LONGEST) + .value("FIXED_SIZE", core::PadStrategy::FIXED_SIZE) + .export_values(); + + py::enum_(*m, "SplitMode") + .value("REMOVED", core::SplitMode::REMOVED) + .value("ISOLATED", core::SplitMode::ISOLATED) + .value("MERGED_WITH_PREVIOUS", core::SplitMode::MERGED_WITH_PREVIOUS) + .value("MERGED_WITH_NEXT", core::SplitMode::MERGED_WITH_NEXT) + .value("CONTIGUOUS", core::SplitMode::CONTIGUOUS) + .export_values(); + + py::class_(*m, "Encoding") + .def(py::init&, + const std::vector&, + const std::vector&, + const std::vector&, + const std::vector&, + const std::vector&, + const std::vector&, + const std::vector&, + const std::unordered_map&>(), + py::arg("ids"), + py::arg("type_ids"), + py::arg("tokens"), + py::arg("words_idx"), + py::arg("offsets"), + py::arg("special_tokens_mask"), + py::arg("attention_mask"), + py::arg("overflowing"), + py::arg("sequence_ranges")) + .def(py::init(), py::arg("size")) + .def(py::init&, uint32_t>(), + py::arg("tokens"), + py::arg("type_id")) + .def("__str__", &core::Encoding::DebugString) + .def("__repr__", &core::Encoding::DebugString) + .def("__len__", &core::Encoding::GetLen) + .def_property_readonly("n_sequences", &core::Encoding::GetNumSequence) + .def_property_readonly("tokens", &core::Encoding::GetTokens) + .def_property_readonly("word_ids", &GetWordIdx) + .def_property_readonly("sequence_ids", &core::Encoding::GetSequenceIds) + .def_property_readonly("ids", &core::Encoding::GetIds) + .def_property_readonly("type_ids", &core::Encoding::GetTypeIds) + .def_property_readonly("offsets", &core::Encoding::GetOffsets) + .def_property_readonly("special_tokens_mask", + &core::Encoding::GetSpecialTokensMask) + .def_property_readonly("attention_mask", + &core::Encoding::GetAttentionMask) + .def_property_readonly("overflowing", &core::Encoding::GetOverflowing) + .def("set_sequence_ids", + &core::Encoding::SetSequenceIds, + py::arg("sequence_id")) + .def("char_to_token", + [](const core::Encoding& self, + uint32_t char_pos, + uint32_t seq_id) -> py::object { + auto token_idxs = self.CharOffsetsToTokenIdx(char_pos, seq_id); + if (token_idxs.size() == 0) { + return py::none(); + } + return py::cast(token_idxs[0]); + }, + py::arg("char_pos"), + py::arg("sequence_index") = 0) + .def("char_to_word", + [](const core::Encoding& self, + uint32_t char_pos, + uint32_t seq_id) -> py::object { + auto word_idxs = self.CharOffsetsToWordIdx(char_pos, seq_id); + if (word_idxs.size() == 0) { + return py::none(); + } + return py::cast(word_idxs[0]); + }, + py::arg("char_pos"), + py::arg("sequence_index") = 0) + .def_static("merge", + &core::Encoding::Merge, + py::arg("encodings"), + py::arg("growing_offsets") = true) + .def("pad", + [](core::Encoding& self, + uint32_t length, + const std::string& direction, + uint32_t pad_id, + uint32_t pad_type_id, + const std::string& pad_token) { + core::Direction direct; + if (direction == "right") { + direct = core::Direction::RIGHT; + } else { + direct = core::Direction::LEFT; + } + self.Pad(length, pad_id, pad_type_id, pad_token, direct); + }, + py::arg("length"), + py::arg("direction") = "right", + py::arg("pad_id") = 0, + py::arg("pad_type_id") = 0, + py::arg("pad_token") = "[PAD]") + .def("token_to_chars", + [](const core::Encoding& self, uint32_t token_index) -> py::object { + auto offsets = self.TokenIdxToCharOffsets(token_index); + if (offsets.size() == 0) { + return py::none(); + } + return py::cast(offsets[0]); + }, + py::arg("token_index")) + .def("token_to_sequence", + [](const core::Encoding& self, uint32_t token_index) -> py::object { + auto seq_ids = self.TokenIdxToSequenceIds(token_index); + if (seq_ids.size() == 0) { + return py::none(); + } + return py::cast(seq_ids[0]); + }, + py::arg("token_index")) + .def("token_to_word", + [](const core::Encoding& self, uint32_t token_index) -> py::object { + auto word_idx = self.TokenIdxToWordIdx(token_index); + if (word_idx.size() == 0) { + return py::none(); + } + return py::cast(word_idx[0].second); + }, + py::arg("token_index")) + .def("word_to_chars", + [](const core::Encoding& self, + uint32_t word_index, + uint32_t sequence_index) -> py::object { + auto ranges = + self.WordIdxToCharOffsets(word_index, sequence_index); + if (ranges.size() == 0) { + return py::none(); + } + return py::cast(ranges[0]); + }, + py::arg("word_index"), + py::arg("sequence_index") = 0) + .def("word_to_tokens", + [](const core::Encoding& self, + uint32_t word_index, + uint32_t sequence_index) -> py::object { + auto ranges = self.WordIdxToTokensIdx(word_index, sequence_index); + if (ranges.size() == 0) { + return py::none(); + } + return py::cast(ranges[0]); + }, + py::arg("word_index"), + py::arg("sequence_index") = 0) + .def("truncate", + [](core::Encoding& self, + size_t max_length, + size_t stride, + const std::string& direction) { + core::Direction direct; + if (direction == "right") { + direct = core::Direction::RIGHT; + } else { + direct = core::Direction::LEFT; + } + self.Truncate(max_length, stride, direct); + }, + py::arg("max_length"), + py::arg("stride") = 0, + py::arg("direction") = "right"); + + py::class_(*m, "AddedToken") + .def(py::init<>()) + .def(py::init([](const std::string& content, + bool single_word, + bool lstrip, + bool rstrip, + bool normalized) { + return core::AddedToken( + content, !normalized, single_word, lstrip, rstrip); + }), + py::arg("content"), + py::arg("single_word") = false, + py::arg("lstrip") = false, + py::arg("rstrip") = false, + py::arg("normalized") = true) + .def(py::self == py::self) + .def_property_readonly("content", &core::AddedToken::GetContent) + .def_property_readonly("get_is_special", &core::AddedToken::GetIsSpecial) + .def_property_readonly( + "normalized", + [](const core::AddedToken& self) { return !self.GetUseNormalized(); }) + .def_property_readonly("lstrip", &core::AddedToken::GetUseLStrip) + .def_property_readonly("rstrip", &core::AddedToken::GetUseRStrip) + .def_property_readonly("single_word", &core::AddedToken::GetIsSingleWord); + + m->def("set_thread_num", &core::SetThreadNum); + m->def("get_thread_num", &core::GetThreadNum); +} + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/core.h b/fast_tokenizer/fast_tokenizer/pybind/core.h new file mode 100644 index 0000000000000000000000000000000000000000..4d42bdd00cbfe90d30ee97a0223cf1db9244aa4a --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/core.h @@ -0,0 +1,27 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +void BindCore(pybind11::module* m); + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/decoders.cc b/fast_tokenizer/fast_tokenizer/pybind/decoders.cc new file mode 100644 index 0000000000000000000000000000000000000000..0e0bdbe8728ff2fee6212931550f4c9effb30be5 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/decoders.cc @@ -0,0 +1,74 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/decoders/decoders.h" +#include +#include "fast_tokenizer/pybind/decoders.h" + +namespace py = pybind11; + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +class PyDecoder : public decoders::Decoder { +public: + using Decoder::Decoder; + virtual void operator()(const std::vector tokens, + std::string* result) const override { + PYBIND11_OVERLOAD_PURE_NAME( + void, Decoder, "__call__", operator(), tokens, result); + } +}; + +class PyWordPieceDecoder : public decoders::WordPiece { +public: + using WordPiece::WordPiece; + virtual void operator()(const std::vector tokens, + std::string* result) const override { + PYBIND11_OVERLOAD_NAME( + void, WordPiece, "__call__", operator(), tokens, result); + } +}; + +void BindDecoders(pybind11::module* m) { + auto submodule = m->def_submodule("decoders", "The decoders module"); + py::class_(submodule, "Decoder") + .def(py::init<>()) + .def("decode", + [](const decoders::Decoder& self, + const std::vector& tokens) { + std::string result; + self(tokens, &result); + return result; + }, + py::arg("tokens")); + + py::class_(submodule, "WordPiece") + .def(py::init(), + py::arg("prefix") = "##", + py::arg("cleanup") = true) + .def("decode", + [](const decoders::Decoder& self, + const std::vector& tokens) { + std::string result; + self(tokens, &result); + return result; + }, + py::arg("tokens")); +} + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/decoders.h b/fast_tokenizer/fast_tokenizer/pybind/decoders.h new file mode 100644 index 0000000000000000000000000000000000000000..27be3049cf674b2d581b44edfed031bbb2df3817 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/decoders.h @@ -0,0 +1,27 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +void BindDecoders(pybind11::module* m); + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/exception.cc b/fast_tokenizer/fast_tokenizer/pybind/exception.cc new file mode 100644 index 0000000000000000000000000000000000000000..35df7987fd54c66c0ffb4bea8086b573e58fc792 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/exception.cc @@ -0,0 +1,35 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/pybind/exception.h" + +namespace py = pybind11; + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +void ThrowExceptionToPython(std::exception_ptr p) { + static PyObject* EnforceNotMetException = + PyErr_NewException("tokenizer.EnforceNotMet", PyExc_Exception, NULL); + try { + if (p) std::rethrow_exception(p); + } catch (const std::runtime_error& e) { + PyErr_SetString(PyExc_RuntimeError, e.what()); + } +} + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/exception.h b/fast_tokenizer/fast_tokenizer/pybind/exception.h new file mode 100644 index 0000000000000000000000000000000000000000..49381948179f8addef9d95f0023463b3db28490e --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/exception.h @@ -0,0 +1,45 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include + +#include "pybind11/pybind11.h" +#define TOKENIZERS_TRY try { +#define TOKENIZERS_CATCH_AND_THROW_RETURN_NULL \ + } \ + catch (...) { \ + ThrowExceptionToPython(std::current_exception()); \ + return nullptr; \ + } + +#define TOKENIZERS_CATCH_AND_THROW_RETURN_NEG \ + } \ + catch (...) { \ + ThrowExceptionToPython(std::current_exception()); \ + return -1; \ + } + +namespace py = pybind11; + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +void ThrowExceptionToPython(std::exception_ptr p); + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/models.cc b/fast_tokenizer/fast_tokenizer/pybind/models.cc new file mode 100644 index 0000000000000000000000000000000000000000..3cee6b03f3884ab913a9639001870262e106e9ae --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/models.cc @@ -0,0 +1,551 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/models/models.h" + +#include + +#include "fast_tokenizer/pybind/models.h" +#include "fast_tokenizer/pybind/utils.h" +#include "glog/logging.h" + +namespace py = pybind11; + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +class PyModel : public models::Model { +public: + using Model::Model; + virtual std::vector Tokenize( + const std::string& tokens) override { + PYBIND11_OVERLOAD_PURE_NAME( + std::vector, Model, "tokenize", Tokenize, tokens); + } + + virtual bool TokenToId(const std::string& token, + uint32_t* id) const override { + PYBIND11_OVERLOAD_PURE_NAME( + bool, Model, "token_to_id", TokenToId, token, id); + } + + virtual bool IdToToken(uint32_t id, std::string* token) const override { + PYBIND11_OVERLOAD_PURE_NAME( + bool, Model, "id_to_token", IdToToken, id, token); + } + + virtual core::Vocab GetVocab() const override { + PYBIND11_OVERLOAD_PURE_NAME(core::Vocab, Model, "get_vocab", GetVocab); + } + + virtual size_t GetVocabSize() const override { + PYBIND11_OVERLOAD_PURE_NAME(size_t, Model, "get_vocab_size", GetVocabSize); + } + + virtual std::vector Save( + const std::string& folder, + const std::string& filename_prefix) const override { + PYBIND11_OVERLOAD_PURE_NAME( + std::vector, Model, "save", Save, folder, filename_prefix); + } +}; + +class PyWordPiece : public models::WordPiece { + using WordPiece::WordPiece; + virtual std::vector Tokenize( + const std::string& tokens) override { + PYBIND11_OVERLOAD_NAME( + std::vector, WordPiece, "tokenize", Tokenize, tokens); + } + + virtual bool TokenToId(const std::string& token, + uint32_t* id) const override { + PYBIND11_OVERLOAD_NAME( + bool, WordPiece, "token_to_id", TokenToId, token, id); + } + + virtual bool IdToToken(uint32_t id, std::string* token) const override { + PYBIND11_OVERLOAD_NAME( + bool, WordPiece, "id_to_token", IdToToken, id, token); + } + + virtual core::Vocab GetVocab() const override { + PYBIND11_OVERLOAD_NAME(core::Vocab, WordPiece, "get_vocab", GetVocab); + } + + virtual size_t GetVocabSize() const override { + PYBIND11_OVERLOAD_NAME(size_t, WordPiece, "get_vocab_size", GetVocabSize); + } + + virtual std::vector Save( + const std::string& folder, + const std::string& filename_prefix) const override { + PYBIND11_OVERLOAD_NAME(std::vector, + WordPiece, + "save", + Save, + folder, + filename_prefix); + } +}; + +class PyFastWordPiece : public models::FastWordPiece { + using FastWordPiece::FastWordPiece; + virtual std::vector Tokenize( + const std::string& tokens) override { + PYBIND11_OVERLOAD_NAME( + std::vector, FastWordPiece, "tokenize", Tokenize, tokens); + } + + virtual bool TokenToId(const std::string& token, + uint32_t* id) const override { + PYBIND11_OVERLOAD_NAME( + bool, FastWordPiece, "token_to_id", TokenToId, token, id); + } + + virtual bool IdToToken(uint32_t id, std::string* token) const override { + PYBIND11_OVERLOAD_NAME( + bool, FastWordPiece, "id_to_token", IdToToken, id, token); + } + + virtual core::Vocab GetVocab() const override { + PYBIND11_OVERLOAD_NAME(core::Vocab, FastWordPiece, "get_vocab", GetVocab); + } + + virtual size_t GetVocabSize() const override { + PYBIND11_OVERLOAD_NAME( + size_t, FastWordPiece, "get_vocab_size", GetVocabSize); + } + + virtual std::vector Save( + const std::string& folder, + const std::string& filename_prefix) const override { + PYBIND11_OVERLOAD_NAME(std::vector, + FastWordPiece, + "save", + Save, + folder, + filename_prefix); + } +}; + +class PyBPE : public models::BPE { + using BPE::BPE; + virtual std::vector Tokenize( + const std::string& tokens) override { + PYBIND11_OVERLOAD_NAME( + std::vector, BPE, "tokenize", Tokenize, tokens); + } + + virtual bool TokenToId(const std::string& token, + uint32_t* id) const override { + PYBIND11_OVERLOAD_NAME(bool, BPE, "token_to_id", TokenToId, token, id); + } + + virtual bool IdToToken(uint32_t id, std::string* token) const override { + PYBIND11_OVERLOAD_NAME(bool, BPE, "id_to_token", IdToToken, id, token); + } + + virtual core::Vocab GetVocab() const override { + PYBIND11_OVERLOAD_NAME(core::Vocab, BPE, "get_vocab", GetVocab); + } + + virtual size_t GetVocabSize() const override { + PYBIND11_OVERLOAD_NAME(size_t, BPE, "get_vocab_size", GetVocabSize); + } + + virtual std::vector Save( + const std::string& folder, + const std::string& filename_prefix) const override { + PYBIND11_OVERLOAD_NAME( + std::vector, BPE, "save", Save, folder, filename_prefix); + } +}; + +class PyUnigram : public models::Unigram { + using Unigram::Unigram; + virtual std::vector Tokenize( + const std::string& tokens) override { + PYBIND11_OVERLOAD_NAME( + std::vector, Unigram, "tokenize", Tokenize, tokens); + } + + virtual bool TokenToId(const std::string& token, + uint32_t* id) const override { + PYBIND11_OVERLOAD_NAME(bool, Unigram, "token_to_id", TokenToId, token, id); + } + + virtual bool IdToToken(uint32_t id, std::string* token) const override { + PYBIND11_OVERLOAD_NAME(bool, Unigram, "id_to_token", IdToToken, id, token); + } + + virtual core::Vocab GetVocab() const override { + PYBIND11_OVERLOAD_NAME(core::Vocab, Unigram, "get_vocab", GetVocab); + } + + virtual size_t GetVocabSize() const override { + PYBIND11_OVERLOAD_NAME(size_t, Unigram, "get_vocab_size", GetVocabSize); + } + + virtual std::vector Save( + const std::string& folder, + const std::string& filename_prefix) const override { + PYBIND11_OVERLOAD_NAME(std::vector, + Unigram, + "save", + Save, + folder, + filename_prefix); + } +}; + +void BindModels(pybind11::module* m) { + auto submodule = m->def_submodule("models", "The models module"); + py::class_(submodule, "Model") + .def(py::init<>()) + .def("tokenize", &models::Model::Tokenize) + .def("token_to_id", &models::Model::TokenToId) + .def("id_to_token", &models::Model::IdToToken) + .def("get_vocab", &models::Model::GetVocab) + .def("get_vocab_size", &models::Model::GetVocabSize) + .def("save", &models::Model::Save); + py::class_(submodule, "WordPiece") + .def(py::init<>()) + .def(py::init(), + py::arg("vocab"), + py::arg("unk_token") = "[UNK]", + py::arg("max_input_chars_per_word") = 100, + py::arg("continuing_subword_prefix") = "##", + py::arg("handle_chinese_chars") = true) + .def("tokenize", &models::WordPiece::Tokenize) + .def("token_to_id", + [](const models::WordPiece& wordpiece, const std::string& token) { + uint32_t id; + wordpiece.TokenToId(token, &id); + return id; + }) + .def("id_to_token", + [](const models::WordPiece& wordpiece, uint32_t id) { + std::string token; + wordpiece.IdToToken(id, &token); + return token; + }) + .def("get_vocab", &models::WordPiece::GetVocab) + .def("get_vocab_size", &models::WordPiece::GetVocabSize) + .def_static( + "read_file", &models::WordPiece::GetVocabFromFile, py::arg("vocab")) + .def_static("from_file", + &models::WordPiece::GetWordPieceFromFile, + py::arg("vocab"), + py::arg("unk_token") = "[UNK]", + py::arg("max_input_chars_per_word") = 100, + py::arg("continuing_subword_prefix") = "##") + .def( + "save", + [](const models::WordPiece& wordpiece, + const std::string& folder, + const py::object& py_obj) { + std::string prefix = ""; + if (!py_obj.is(py::none())) { + prefix = py_obj.cast(); + } + return wordpiece.Save(folder, prefix); + }, + py::arg("folder"), + py::arg("prefix") = py::none()); + py::class_(submodule, "FastWordPiece") + .def(py::init<>()) + .def(py::init(), + py::arg("vocab"), + py::arg("unk_token") = "[UNK]", + py::arg("max_input_chars_per_word") = 100, + py::arg("continuing_subword_prefix") = "##", + py::arg("with_pretokenization") = false) + .def("tokenize", &models::FastWordPiece::Tokenize) + .def("token_to_id", + [](const models::FastWordPiece& model, const std::string& token) { + uint32_t id; + model.TokenToId(token, &id); + return id; + }) + .def("id_to_token", + [](const models::FastWordPiece& model, uint32_t id) { + std::string token; + model.IdToToken(id, &token); + return token; + }) + .def("get_vocab", &models::FastWordPiece::GetVocab) + .def("get_vocab_size", &models::FastWordPiece::GetVocabSize) + .def_static("read_file", + &models::FastWordPiece::GetVocabFromFile, + py::arg("vocab")) + .def_static("from_file", + &models::FastWordPiece::GetFastWordPieceFromFile, + py::arg("vocab"), + py::arg("unk_token") = "[UNK]", + py::arg("max_input_chars_per_word") = 100, + py::arg("continuing_subword_prefix") = "##", + py::arg("with_pretokenization") = false) + .def( + "save", + [](const models::FastWordPiece& wordpiece, + const std::string& folder, + const py::object& py_obj) { + std::string prefix = ""; + if (!py_obj.is(py::none())) { + prefix = py_obj.cast(); + } + return wordpiece.Save(folder, prefix); + }, + py::arg("folder"), + py::arg("prefix") = py::none()); + py::class_(submodule, "BPE") + .def(py::init([](const py::object& py_vocab, + const py::object& py_merges, + const py::object& py_cache_capacity, + const py::object& py_dropout, + const py::object& py_unk_token, + const py::object& py_continuing_subword_prefix, + const py::object& py_end_of_word_suffix, + const py::object& py_fuse_unk) { + core::Vocab vocab; + if (!py_vocab.is(py::none())) { + vocab = py_vocab.cast(); + } + + core::Merges merges; + if (!py_merges.is(py::none())) { + merges = py_merges.cast(); + } + + size_t cache_capacity = utils::DEFAULT_CACHE_CAPACITY; + if (!py_cache_capacity.is(py::none())) { + cache_capacity = py_cache_capacity.cast(); + } + + std::vector dropout; + if (!py_dropout.is(py::none())) { + dropout.emplace_back(py_dropout.cast()); + } + + std::vector unk_token; + if (!py_unk_token.is(py::none())) { + unk_token.emplace_back(py_unk_token.cast()); + } + + std::vector continuing_subword_prefix; + if (!py_continuing_subword_prefix.is(py::none())) { + continuing_subword_prefix.emplace_back( + py_continuing_subword_prefix.cast()); + } + + std::vector end_of_word_suffix; + if (!py_end_of_word_suffix.is(py::none())) { + end_of_word_suffix.emplace_back( + py_end_of_word_suffix.cast()); + } + + bool fuse_unk = false; + if (!py_fuse_unk.is(py::none())) { + fuse_unk = py_fuse_unk.cast(); + } + models::BPE self(vocab, + merges, + cache_capacity, + dropout, + unk_token, + continuing_subword_prefix, + end_of_word_suffix, + fuse_unk); + return self; + }), + py::arg("vocab") = py::none(), + py::arg("merges") = py::none(), + py::arg("cache_capacity") = py::none(), + py::arg("dropout") = py::none(), + py::arg("unk_token") = py::none(), + py::arg("continuing_subword_prefix") = py::none(), + py::arg("end_of_word_suffix") = py::none(), + py::arg("fuse_unk") = py::none()) + .def("tokenize", &models::BPE::Tokenize) + .def("token_to_id", + [](const models::BPE& model, const std::string& token) { + uint32_t id; + model.TokenToId(token, &id); + return id; + }) + .def("id_to_token", + [](const models::BPE& model, uint32_t id) { + std::string token; + model.IdToToken(id, &token); + return token; + }) + .def("get_vocab", &models::BPE::GetVocab) + .def("get_vocab_size", &models::BPE::GetVocabSize) + .def( + "save", + [](const models::BPE& bpe, + const std::string& folder, + const py::object& py_obj) { + std::string prefix = ""; + if (!py_obj.is(py::none())) { + prefix = py_obj.cast(); + } + return bpe.Save(folder, prefix); + }, + py::arg("folder"), + py::arg("prefix") = py::none()) + .def_static( + "read_file", + [](const std::string& vocab_path, const std::string& merges_path) { + core::Vocab vocab; + core::Merges merges; + models::BPE::GetVocabAndMergesFromFile( + vocab_path, merges_path, &vocab, &merges); + return py::make_tuple(vocab, merges); + }, + py::arg("vocab"), + py::arg("merges")) + .def_static( + "from_file", + [](const std::string& vocab_path, + const std::string& merges_path, + const py::kwargs& kwargs) { + core::Vocab vocab; + core::Merges merges; + models::BPE::GetVocabAndMergesFromFile( + vocab_path, merges_path, &vocab, &merges); + VLOG(6) << "In BPE from_file:"; + size_t cache_capacity = utils::DEFAULT_CACHE_CAPACITY; + if (kwargs.contains("cache_capacity")) { + cache_capacity = kwargs["cache_capacity"].cast(); + VLOG(6) << "cache_capacity = " << cache_capacity; + } + std::vector dropout; + if (kwargs.contains("dropout")) { + dropout.emplace_back(kwargs["dropout"].cast()); + VLOG(6) << "dropout = " << kwargs["dropout"].cast(); + } + + std::vector unk_token; + if (kwargs.contains("unk_token")) { + unk_token.emplace_back(kwargs["unk_token"].cast()); + VLOG(6) << "unk_token = " + << kwargs["unk_token"].cast(); + } + + std::vector continuing_subword_prefix; + if (kwargs.contains("continuing_subword_prefix")) { + continuing_subword_prefix.emplace_back( + kwargs["continuing_subword_prefix"].cast()); + VLOG(6) + << "continuing_subword_prefix = " + << kwargs["continuing_subword_prefix"].cast(); + } + + std::vector end_of_word_suffix; + if (kwargs.contains("end_of_word_suffix")) { + end_of_word_suffix.emplace_back( + kwargs["end_of_word_suffix"].cast()); + VLOG(6) << "end_of_word_suffix = " + << kwargs["end_of_word_suffix"].cast(); + } + + bool fuse_unk = false; + if (kwargs.contains("fuse_unk")) { + fuse_unk = kwargs["fuse_unk"].cast(); + VLOG(6) << "fuse_unk = " << kwargs["fuse_unk"].cast(); + } + return models::BPE(vocab, + merges, + cache_capacity, + dropout, + unk_token, + continuing_subword_prefix, + end_of_word_suffix, + fuse_unk); + }, + py::arg("vocab"), + py::arg("merges")); + py::class_(submodule, "Unigram") + .def(py::init([](const py::object& py_vocab_list, + const py::object& py_unk_token_id) { + if (py_vocab_list.is(py::none()) && + py_unk_token_id.is(py::none())) { + return models::Unigram(); + } else if (!py_vocab_list.is(py::none()) && + !py_unk_token_id.is(py::none())) { + try { + core::VocabList vocab_list = + py_vocab_list.cast(); + size_t unk_id = py_unk_token_id.cast(); + return models::Unigram(vocab_list, {unk_id}); + } catch (std::exception& e) { + VLOG(0) << "Init Unigram error:" << e.what(); + goto error; + } + } + error: + throw py::value_error( + "`vocab` and `unk_id` must be both specified"); + }), + py::arg("vocab") = py::none(), + py::arg("unk_id") = py::none()) + .def("tokenize", &models::Unigram::Tokenize) + .def("token_to_id", + [](const models::Unigram& model, const std::string& token) { + uint32_t id; + model.TokenToId(token, &id); + return id; + }) + .def("id_to_token", + [](const models::Unigram& model, uint32_t id) { + std::string token; + model.IdToToken(id, &token); + return token; + }) + .def("get_vocab", &models::Unigram::GetVocab) + .def("get_vocab_size", &models::Unigram::GetVocabSize) + .def("set_filter_token", + &models::Unigram::SetFilterToken, + py::arg("filter_token") = "") + .def("set_split_rule", + &models::Unigram::SetSplitRule, + py::arg("split_rule") = "") + .def( + "save", + [](const models::Unigram& unigram, + const std::string& folder, + const py::object& py_obj) { + std::string prefix = ""; + if (!py_obj.is(py::none())) { + prefix = py_obj.cast(); + } + return unigram.Save(folder, prefix); + }, + py::arg("folder"), + py::arg("prefix") = py::none()); +} +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/models.h b/fast_tokenizer/fast_tokenizer/pybind/models.h new file mode 100644 index 0000000000000000000000000000000000000000..ca675e61c61e52b52e49b369af0cfa4500066204 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/models.h @@ -0,0 +1,27 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +void BindModels(pybind11::module* m); + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/normalizers.cc b/fast_tokenizer/fast_tokenizer/pybind/normalizers.cc new file mode 100644 index 0000000000000000000000000000000000000000..9c561b11850322a64498f63107da9066f4bc6b01 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/normalizers.cc @@ -0,0 +1,462 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/normalizers/normalizers.h" +#include +#include "fast_tokenizer/pybind/normalizers.h" + +namespace py = pybind11; + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +class PyNormalizer : public normalizers::Normalizer { +public: + using Normalizer::Normalizer; + virtual void operator()( + normalizers::NormalizedString* mut_str) const override { + PYBIND11_OVERLOAD_PURE_NAME( + void, Normalizer, "__call__", operator(), mut_str); + } +}; + +class PyBertNormalizer : public normalizers::BertNormalizer { +public: + using BertNormalizer::BertNormalizer; + virtual void operator()( + normalizers::NormalizedString* mut_str) const override { + PYBIND11_OVERLOAD_NAME( + void, BertNormalizer, "__call__", operator(), mut_str); + } +}; + +class PyReplaceNormalizer : public normalizers::ReplaceNormalizer { +public: + using ReplaceNormalizer::ReplaceNormalizer; + PyReplaceNormalizer(const ReplaceNormalizer& r) : ReplaceNormalizer(r) {} + virtual void operator()( + normalizers::NormalizedString* mut_str) const override { + PYBIND11_OVERLOAD_NAME( + void, ReplaceNormalizer, "__call__", operator(), mut_str); + } +}; + +class PyStripNormalizer : public normalizers::StripNormalizer { +public: + using StripNormalizer::StripNormalizer; + virtual void operator()( + normalizers::NormalizedString* mut_str) const override { + PYBIND11_OVERLOAD_NAME( + void, StripNormalizer, "__call__", operator(), mut_str); + } +}; + +class PyStripAccentsNormalizer : public normalizers::StripAccentsNormalizer { +public: + using StripAccentsNormalizer::StripAccentsNormalizer; + virtual void operator()( + normalizers::NormalizedString* mut_str) const override { + PYBIND11_OVERLOAD_NAME( + void, StripAccentsNormalizer, "__call__", operator(), mut_str); + } +}; + +class PyNFCNormalizer : public normalizers::NFCNormalizer { +public: + using NFCNormalizer::NFCNormalizer; + virtual void operator()( + normalizers::NormalizedString* mut_str) const override { + PYBIND11_OVERLOAD_NAME( + void, NFCNormalizer, "__call__", operator(), mut_str); + } +}; + +class PyNFKCNormalizer : public normalizers::NFKCNormalizer { +public: + using NFKCNormalizer::NFKCNormalizer; + virtual void operator()( + normalizers::NormalizedString* mut_str) const override { + PYBIND11_OVERLOAD_NAME( + void, NFKCNormalizer, "__call__", operator(), mut_str); + } +}; + +class PyNFDNormalizer : public normalizers::NFDNormalizer { +public: + using NFDNormalizer::NFDNormalizer; + virtual void operator()( + normalizers::NormalizedString* mut_str) const override { + PYBIND11_OVERLOAD_NAME( + void, NFDNormalizer, "__call__", operator(), mut_str); + } +}; + +class PyNFKDNormalizer : public normalizers::NFKDNormalizer { +public: + using NFKDNormalizer::NFKDNormalizer; + virtual void operator()( + normalizers::NormalizedString* mut_str) const override { + PYBIND11_OVERLOAD_NAME( + void, NFKDNormalizer, "__call__", operator(), mut_str); + } +}; + +class PyNmtNormalizer : public normalizers::NmtNormalizer { +public: + using NmtNormalizer::NmtNormalizer; + virtual void operator()( + normalizers::NormalizedString* mut_str) const override { + PYBIND11_OVERLOAD_NAME( + void, NmtNormalizer, "__call__", operator(), mut_str); + } +}; + +class PySequenceNormalizer : public normalizers::SequenceNormalizer { +public: + using SequenceNormalizer::SequenceNormalizer; + virtual void operator()( + normalizers::NormalizedString* mut_str) const override { + PYBIND11_OVERLOAD_NAME( + void, SequenceNormalizer, "__call__", operator(), mut_str); + } +}; + +class PyLowercaseNormalizer : public normalizers::LowercaseNormalizer { +public: + using LowercaseNormalizer::LowercaseNormalizer; + virtual void operator()( + normalizers::NormalizedString* mut_str) const override { + PYBIND11_OVERLOAD_NAME( + void, LowercaseNormalizer, "__call__", operator(), mut_str); + } +}; + +class PyPrecompiledNormalizer : public normalizers::PrecompiledNormalizer { +public: + using PrecompiledNormalizer::PrecompiledNormalizer; + virtual void operator()( + normalizers::NormalizedString* mut_str) const override { + PYBIND11_OVERLOAD_NAME( + void, PrecompiledNormalizer, "__call__", operator(), mut_str); + } +}; + +void BindNormalizers(pybind11::module* m) { + auto submodule = m->def_submodule("normalizers", "The normalizers module"); + py::class_(submodule, "NormalizedString") + .def(py::init()) + .def(py::init<>()) + .def("__str__", &normalizers::NormalizedString::GetStr); + py::class_(submodule, "Normalizer") + .def(py::init<>()) + .def("normalize_str", + [](const normalizers::Normalizer& self, const std::string& str) { + normalizers::NormalizedString normalized(str); + self(&normalized); + return normalized.GetStr(); + }, + py::arg("sequence")) + .def("__call__", &normalizers::Normalizer::operator()); + py::class_(submodule, + "BertNormalizer") + .def(py::init(), + py::arg("clean_text") = true, + py::arg("handle_chinese_chars") = true, + py::arg("strip_accents") = true, + py::arg("lowercase") = true) + .def(py::init([](bool clean_text, + bool handle_chinese_chars, + const py::object& strip_accents_obj, + bool lowercase) { + bool strip_accents = lowercase; + if (!strip_accents_obj.is(py::none())) { + strip_accents = strip_accents_obj.cast(); + } + return std::unique_ptr( + new normalizers::BertNormalizer(clean_text, + handle_chinese_chars, + strip_accents, + lowercase)); + }), + py::arg("clean_text") = true, + py::arg("handle_chinese_chars") = true, + py::arg("strip_accents") = true, + py::arg("lowercase") = true) + .def("normalize_str", + [](const normalizers::BertNormalizer& self, const std::string& str) { + normalizers::NormalizedString normalized(str); + self(&normalized); + return normalized.GetStr(); + }, + py::arg("sequence")) + .def("__call__", &normalizers::BertNormalizer::operator()) + .def("__getstate__", [](const normalizers::BertNormalizer& self) { + nlohmann::json j = self; + return j.dump(); + }); + + py::class_( + submodule, "ReplaceNormalizer") + .def(py::init(), + py::arg("replace_normalizer")) + .def(py::init(), + py::arg("pattern"), + py::arg("content")) + .def("normalize_str", + [](const normalizers::ReplaceNormalizer& self, + const std::string& str) { + normalizers::NormalizedString normalized(str); + self(&normalized); + return normalized.GetStr(); + }, + py::arg("sequence")) + .def("__call__", &normalizers::ReplaceNormalizer::operator()) + .def("__getstate__", [](const normalizers::ReplaceNormalizer& self) { + nlohmann::json j = self; + return j.dump(); + }); + + py::class_(submodule, + "StripNormalizer") + .def(py::init(), + py::arg("left") = true, + py::arg("right") = true) + .def( + "normalize_str", + [](const normalizers::StripNormalizer& self, const std::string& str) { + normalizers::NormalizedString normalized(str); + self(&normalized); + return normalized.GetStr(); + }, + py::arg("sequence")) + .def("__call__", &normalizers::StripNormalizer::operator()) + .def("__getstate__", [](const normalizers::StripNormalizer& self) { + nlohmann::json j = self; + return j.dump(); + }); + py::class_( + submodule, "StripAccentsNormalizer") + .def(py::init<>()) + .def("normalize_str", + [](const normalizers::StripAccentsNormalizer& self, + const std::string& str) { + normalizers::NormalizedString normalized(str); + self(&normalized); + return normalized.GetStr(); + }, + py::arg("sequence")) + .def("__call__", &normalizers::StripAccentsNormalizer::operator()) + .def("__getstate__", [](const normalizers::StripAccentsNormalizer& self) { + nlohmann::json j = self; + return j.dump(); + }); + py::class_(submodule, + "NFCNormalizer") + .def(py::init<>()) + .def("normalize_str", + [](const normalizers::NFCNormalizer& self, const std::string& str) { + normalizers::NormalizedString normalized(str); + self(&normalized); + return normalized.GetStr(); + }, + py::arg("sequence")) + .def("__call__", &normalizers::NFCNormalizer::operator()) + .def("__getstate__", [](const normalizers::NFCNormalizer& self) { + nlohmann::json j = self; + return j.dump(); + }); + py::class_(submodule, + "NFDNormalizer") + .def(py::init<>()) + .def("normalize_str", + [](const normalizers::NFDNormalizer& self, const std::string& str) { + normalizers::NormalizedString normalized(str); + self(&normalized); + return normalized.GetStr(); + }, + py::arg("sequence")) + .def("__call__", &normalizers::NFDNormalizer::operator()) + .def("__getstate__", [](const normalizers::NFDNormalizer& self) { + nlohmann::json j = self; + return j.dump(); + }); + py::class_(submodule, + "NFKCNormalizer") + .def(py::init<>()) + .def("normalize_str", + [](const normalizers::NFKCNormalizer& self, const std::string& str) { + normalizers::NormalizedString normalized(str); + self(&normalized); + return normalized.GetStr(); + }, + py::arg("sequence")) + .def("__call__", &normalizers::NFKCNormalizer::operator()) + .def("__getstate__", [](const normalizers::NFKCNormalizer& self) { + nlohmann::json j = self; + return j.dump(); + }); + py::class_(submodule, + "NFKDNormalizer") + .def(py::init<>()) + .def("normalize_str", + [](const normalizers::NFKDNormalizer& self, const std::string& str) { + normalizers::NormalizedString normalized(str); + self(&normalized); + return normalized.GetStr(); + }, + py::arg("sequence")) + .def("__call__", &normalizers::NFKDNormalizer::operator()) + .def("__getstate__", [](const normalizers::NFKDNormalizer& self) { + nlohmann::json j = self; + return j.dump(); + }); + py::class_(submodule, + "NmtNormalizer") + .def(py::init<>()) + .def("normalize_str", + [](const normalizers::NmtNormalizer& self, const std::string& str) { + normalizers::NormalizedString normalized(str); + self(&normalized); + return normalized.GetStr(); + }, + py::arg("sequence")) + .def("__call__", &normalizers::NmtNormalizer::operator()) + .def("__getstate__", [](const normalizers::NmtNormalizer& self) { + nlohmann::json j = self; + return j.dump(); + }); + py::class_( + submodule, "LowercaseNormalizer") + .def(py::init<>()) + .def("normalize_str", + [](const normalizers::LowercaseNormalizer& self, + const std::string& str) { + normalizers::NormalizedString normalized(str); + self(&normalized); + return normalized.GetStr(); + }, + py::arg("sequence")) + .def("__call__", &normalizers::LowercaseNormalizer::operator()) + .def("__getstate__", [](const normalizers::LowercaseNormalizer& self) { + nlohmann::json j = self; + return j.dump(); + }); + py::class_( + submodule, "SequenceNormalizer") + .def( + py::init([](const py::list& py_list) { + normalizers::Normalizer* normalizer_ptr; + std::vector normalizers; + for (py::handle py_normalizer : py_list) { + if (pybind11::type::of(py_normalizer) + .is(py::type::of())) { + normalizer_ptr = + py_normalizer.cast(); + } else if (pybind11::type::of(py_normalizer) + .is(py::type::of())) { + normalizer_ptr = + py_normalizer.cast(); + } else if (pybind11::type::of(py_normalizer) + .is(py::type::of())) { + normalizer_ptr = + py_normalizer.cast(); + } else if (pybind11::type::of(py_normalizer) + .is(py::type::of())) { + normalizer_ptr = + py_normalizer.cast(); + } else if (pybind11::type::of(py_normalizer) + .is(py::type::of())) { + normalizer_ptr = + py_normalizer.cast(); + } else if (pybind11::type::of(py_normalizer) + .is(py::type::of())) { + normalizer_ptr = + py_normalizer.cast(); + } else if (pybind11::type::of(py_normalizer) + .is(py::type::of())) { + normalizer_ptr = + py_normalizer.cast(); + } else if (pybind11::type::of(py_normalizer) + .is(py::type::of< + normalizers::ReplaceNormalizer>())) { + normalizer_ptr = + py_normalizer.cast(); + } else if (pybind11::type::of(py_normalizer) + .is(py::type::of< + normalizers::SequenceNormalizer>())) { + normalizer_ptr = + py_normalizer.cast(); + } else if (pybind11::type::of(py_normalizer) + .is(py::type::of< + normalizers::StripAccentsNormalizer>())) { + normalizer_ptr = + py_normalizer.cast(); + } else if (pybind11::type::of(py_normalizer) + .is(py::type::of< + normalizers::StripNormalizer>())) { + normalizer_ptr = + py_normalizer.cast(); + } else if (pybind11::type::of(py_normalizer) + .is(py::type::of< + normalizers::PrecompiledNormalizer>())) { + normalizer_ptr = + py_normalizer.cast(); + } else { + throw py::value_error( + "Type of normalizers should be one of " + "`LowercaseNormalizer`," + " `BertNormalizer`, `NFCNormalizer`, `NFKCNormalizer`, " + "`NFDNormalizer`," + " `NFKDNormalizer`, `NmtNormalizer`, `ReplaceNormalizer`, " + "`SequenceNormalizer`," + " `StripAccentsNormalizer`, `StripNormalizer`, " + "`PrecompiledNormalizer`"); + } + normalizers.push_back(normalizer_ptr); + } + return normalizers::SequenceNormalizer(normalizers); + }), + py::arg("normalizers")) + .def("normalize_str", + [](const normalizers::SequenceNormalizer& self, + const std::string& str) { + normalizers::NormalizedString normalized(str); + self(&normalized); + return normalized.GetStr(); + }, + py::arg("sequence")) + .def("__call__", &normalizers::SequenceNormalizer::operator()) + .def("__getstate__", [](const normalizers::SequenceNormalizer& self) { + nlohmann::json j = self; + return j.dump(); + }); + py::class_( + submodule, "PrecompiledNormalizer") + .def(py::init<>()) + .def(py::init(), py::arg("precompiled_charsmap")) + .def("normalize_str", + [](const normalizers::PrecompiledNormalizer& self, + const std::string& str) { + normalizers::NormalizedString normalized(str); + self(&normalized); + return normalized.GetStr(); + }, + py::arg("sequence")) + .def("__call__", &normalizers::PrecompiledNormalizer::operator()); +} + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/normalizers.h b/fast_tokenizer/fast_tokenizer/pybind/normalizers.h new file mode 100644 index 0000000000000000000000000000000000000000..64cd9b6e2ed486f9ef6ba7bcc74512aba5c00b9d --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/normalizers.h @@ -0,0 +1,27 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +void BindNormalizers(pybind11::module* m); + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/postprocessors.cc b/fast_tokenizer/fast_tokenizer/pybind/postprocessors.cc new file mode 100644 index 0000000000000000000000000000000000000000..6795c125481afc1fa6aff5fd7599cf13fe94d7a6 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/postprocessors.cc @@ -0,0 +1,418 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/postprocessors/postprocessors.h" +#include +#include "fast_tokenizer/core/encoding.h" +#include "fast_tokenizer/pybind/postprocessors.h" +#include "fast_tokenizer/pybind/utils.h" +#include "glog/logging.h" + +namespace py = pybind11; + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +class PyPostProcessor : public postprocessors::PostProcessor { +public: + using PostProcessor::PostProcessor; + virtual void operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const override { + PYBIND11_OVERLOAD_PURE_NAME(void, + PostProcessor, + "__call__", + operator(), + encoding, + pair_encoding, + add_special_tokens, + result_encoding); + } + virtual size_t AddedTokensNum(bool is_pair) const override { + PYBIND11_OVERLOAD_PURE_NAME(size_t, + PostProcessor, + "num_special_tokens_to_add", + AddedTokensNum, + is_pair); + } +}; + +class PyBertPostProcessor : public postprocessors::BertPostProcessor { +public: + using BertPostProcessor::BertPostProcessor; + virtual void operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const override { + PYBIND11_OVERLOAD_NAME(void, + BertPostProcessor, + "__call__", + operator(), + encoding, + pair_encoding, + add_special_tokens, + result_encoding); + } + virtual size_t AddedTokensNum(bool is_pair) const override { + PYBIND11_OVERLOAD_NAME(size_t, + BertPostProcessor, + "num_special_tokens_to_add", + AddedTokensNum, + is_pair); + } +}; + +class PyTemplatePostProcessor : public postprocessors::TemplatePostProcessor { +public: + using TemplatePostProcessor::TemplatePostProcessor; + virtual void operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const override { + PYBIND11_OVERLOAD_NAME(void, + TemplatePostProcessor, + "__call__", + operator(), + encoding, + pair_encoding, + add_special_tokens, + result_encoding); + } + virtual size_t AddedTokensNum(bool is_pair) const override { + PYBIND11_OVERLOAD_NAME(size_t, + TemplatePostProcessor, + "num_special_tokens_to_add", + AddedTokensNum, + is_pair); + } +}; + +class PyRobertaPostProcessor : public postprocessors::RobertaPostProcessor { +public: + using RobertaPostProcessor::RobertaPostProcessor; + virtual void operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const override { + PYBIND11_OVERLOAD_NAME(void, + RobertaPostProcessor, + "__call__", + operator(), + encoding, + pair_encoding, + add_special_tokens, + result_encoding); + } + virtual size_t AddedTokensNum(bool is_pair) const override { + PYBIND11_OVERLOAD_NAME(size_t, + RobertaPostProcessor, + "num_special_tokens_to_add", + AddedTokensNum, + is_pair); + } +}; + +class PyByteLevelPostProcessor : public postprocessors::ByteLevelPostProcessor { +public: + using ByteLevelPostProcessor::ByteLevelPostProcessor; + virtual void operator()(core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens, + core::Encoding* result_encoding) const override { + PYBIND11_OVERLOAD_NAME(void, + ByteLevelPostProcessor, + "__call__", + operator(), + encoding, + pair_encoding, + add_special_tokens, + result_encoding); + } + virtual size_t AddedTokensNum(bool is_pair) const override { + PYBIND11_OVERLOAD_NAME(size_t, + ByteLevelPostProcessor, + "num_special_tokens_to_add", + AddedTokensNum, + is_pair); + } +}; + +void BindPostProcessors(pybind11::module* m) { + auto submodule = + m->def_submodule("postprocessors", "The postprocessors module"); + py::class_(submodule, + "PostProcessor") + .def(py::init<>()) + .def("num_special_tokens_to_add", + &postprocessors::PostProcessor::AddedTokensNum, + py::arg("is_pair")) + .def("__call__", + [](const postprocessors::PostProcessor& self, + core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens) { + core::Encoding result_encoding; + self( + encoding, pair_encoding, add_special_tokens, &result_encoding); + return result_encoding; + }, + py::arg("encoding"), + py::arg("pair_encoding"), + py::arg("add_special_tokens")); + py::class_( + submodule, "BertPostProcessor") + .def(py::init<>()) + .def(py::init&, + const std::pair&>(), + py::arg("sep"), + py::arg("cls")) + .def("num_special_tokens_to_add", + &postprocessors::BertPostProcessor::AddedTokensNum, + py::arg("is_pair")) + .def("__call__", + [](const postprocessors::BertPostProcessor& self, + core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens) { + core::Encoding result_encoding; + self( + encoding, pair_encoding, add_special_tokens, &result_encoding); + return result_encoding; + }, + py::arg("encoding"), + py::arg("pair_encoding"), + py::arg("add_special_tokens")); + + // For Template Processing + py::class_(submodule, "SpecialToken") + .def(py::init<>()) + .def(py::init&, + const std::vector&>(), + py::arg("id"), + py::arg("ids"), + py::arg("tokens")) + .def(py::init(), + py::arg("token"), + py::arg("id")); + + py::class_(submodule, "Template") + .def(py::init<>()) + .def(py::init(), py::arg("template")) + .def(py::init&>(), py::arg("pieces")) + .def(py::init&>(), + py::arg("pieces")); + + py::class_( + submodule, "TemplatePostProcessor") + .def( + py::init([](const py::object& single_obj, + const py::object& pair_obj, + const py::object& special_tokens_obj) { + postprocessors::TemplatePostProcessor self; + // Setting single + if (py::isinstance(single_obj)) { + std::vector template_piece = + CastPyArg2VectorOfStr(single_obj.ptr(), 0); + self.UpdateSinglePieces(template_piece); + } else if (py::isinstance(single_obj)) { + self.UpdateSinglePieces( + CastPyArg2AttrString(single_obj.ptr(), 0)); + } else { + throw py::value_error( + "Type of args single need to be List[str] or str."); + } + // Setting pair + if (py::isinstance(pair_obj)) { + std::vector template_piece = + CastPyArg2VectorOfStr(pair_obj.ptr(), 0); + self.UpdatePairPieces(template_piece); + } else if (py::isinstance(pair_obj)) { + self.UpdatePairPieces(CastPyArg2AttrString(pair_obj.ptr(), 0)); + } else { + throw py::value_error( + "Type of args pair need to be List[str] or str."); + } + // Setting special_tokens + if (py::isinstance(special_tokens_obj)) { + std::vector special_tokens; + for (auto& str : special_tokens_obj.cast()) { + if (py::isinstance(str)) { + auto token_tuple = str.cast(); + uint32_t id; + std::string token; + if (token_tuple.size() == 2) { + if (py::isinstance(token_tuple[0]) && + py::isinstance(token_tuple[1])) { + token = token_tuple[0].cast(); + id = token_tuple[1].cast(); + } else if (py::isinstance(token_tuple[1]) && + py::isinstance(token_tuple[0])) { + token = token_tuple[1].cast(); + id = token_tuple[0].cast(); + } else { + throw py::value_error( + "`Tuple` with both a token and its associated ID, in " + "any order"); + } + special_tokens.emplace_back(token, id); + } else { + throw py::value_error( + "Type of args special_tokens need to be " + "List[Union[Tuple[int, str], Tuple[str, int], dict]]"); + } + } else if (py::isinstance(str)) { + auto token_dict = str.cast(); + std::string id; + std::vector ids; + std::vector tokens; + if (token_dict.contains("id") && + py::isinstance(token_dict["id"])) { + id = token_dict["id"].cast(); + } else { + throw py::value_error( + "Type of args special_tokens dict need to have key 'id'" + "and the respective value should be `str`"); + } + if (token_dict.contains("ids") && + py::isinstance(token_dict["ids"])) { + for (auto py_id : token_dict["ids"].cast()) { + if (py::isinstance(py_id)) { + ids.push_back(py_id.cast()); + } else { + throw py::value_error( + "Type of args special_tokens dict need to have key " + "'ids'" + "and the respective value should be List[int]"); + } + } + } else { + throw py::value_error( + "Type of args special_tokens dict need to have key " + "'ids'" + "and the respective value should be List[int]"); + } + if (token_dict.contains("tokens") && + py::isinstance(token_dict["tokens"])) { + for (auto& py_token : + token_dict["tokens"].cast()) { + if (py::isinstance(py_token)) { + tokens.push_back(py_token.cast()); + } else { + throw py::value_error( + "Type of args special_tokens dict need to have key " + "'tokens'" + "and the respective value should be List[str]"); + } + } + } else { + throw py::value_error( + "Type of args special_tokens dict need to have key " + "'tokens'" + "and the respective value should be List[str]"); + } + special_tokens.emplace_back(id, ids, tokens); + } else { + throw py::value_error( + "Type of args special_tokens need to be " + "List[Union[Tuple[int, str], Tuple[str, int], dict]]"); + } + } + self.SetTokensMap(special_tokens); + } else { + throw py::value_error( + "Type of args special_tokens need to be " + "List[Union[Tuple[int, str], Tuple[str, int], dict]]"); + } + return self; + }), + py::arg("single"), + py::arg("pair"), + py::arg("special_tokens")) + .def("num_special_tokens_to_add", + &postprocessors::TemplatePostProcessor::AddedTokensNum, + py::arg("is_pair")) + .def("__call__", + [](const postprocessors::TemplatePostProcessor& self, + core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens) { + core::Encoding result_encoding; + self( + encoding, pair_encoding, add_special_tokens, &result_encoding); + return result_encoding; + }, + py::arg("encoding"), + py::arg("pair_encoding"), + py::arg("add_special_tokens")); + + py::class_( + submodule, "RobertaPostProcessor") + .def(py::init<>()) + .def(py::init&, + const std::pair&, + bool, + bool>(), + py::arg("sep"), + py::arg("cls"), + py::arg("trim_offsets") = true, + py::arg("add_prefix_space") = true) + .def("num_special_tokens_to_add", + &postprocessors::RobertaPostProcessor::AddedTokensNum, + py::arg("is_pair")) + .def("__call__", + [](const postprocessors::RobertaPostProcessor& self, + core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens) { + core::Encoding result_encoding; + self( + encoding, pair_encoding, add_special_tokens, &result_encoding); + return result_encoding; + }, + py::arg("encoding"), + py::arg("pair_encoding"), + py::arg("add_special_tokens")); + py::class_( + submodule, "ByteLevelPostProcessor") + .def(py::init(), + py::arg("add_prefix_space") = true, + py::arg("trim_offsets") = true, + py::arg("use_regex") = true) + .def("num_special_tokens_to_add", + &postprocessors::ByteLevelPostProcessor::AddedTokensNum, + py::arg("is_pair")) + .def("__call__", + [](const postprocessors::ByteLevelPostProcessor& self, + core::Encoding* encoding, + core::Encoding* pair_encoding, + bool add_special_tokens) { + core::Encoding result_encoding; + self( + encoding, pair_encoding, add_special_tokens, &result_encoding); + return result_encoding; + }, + py::arg("encoding"), + py::arg("pair_encoding"), + py::arg("add_special_tokens")); + +} + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/postprocessors.h b/fast_tokenizer/fast_tokenizer/pybind/postprocessors.h new file mode 100644 index 0000000000000000000000000000000000000000..b30b31a951ee6caac96a55f7686d0a781ca6c59e --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/postprocessors.h @@ -0,0 +1,27 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +void BindPostProcessors(pybind11::module* m); + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/pretokenizers.cc b/fast_tokenizer/fast_tokenizer/pybind/pretokenizers.cc new file mode 100644 index 0000000000000000000000000000000000000000..efb9ae77446d993992cd73629085ae598a7f1631 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/pretokenizers.cc @@ -0,0 +1,265 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + + +#include "fast_tokenizer/pretokenizers/pretokenizers.h" + +#include + +#include "fast_tokenizer/pybind/pretokenizers.h" +#include "re2/re2.h" + +namespace py = pybind11; + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +class PyPreTokenizer : public pretokenizers::PreTokenizer { +public: + using PreTokenizer::PreTokenizer; + virtual void operator()( + pretokenizers::PreTokenizedString* pretokenized) const override { + PYBIND11_OVERLOAD_PURE_NAME( + void, PreTokenizer, "__call__", operator(), pretokenized); + } +}; + +class PyWhitespacePreTokenizer : public pretokenizers::WhitespacePreTokenizer { +public: + using WhitespacePreTokenizer::WhitespacePreTokenizer; + virtual void operator()( + pretokenizers::PreTokenizedString* pretokenized) const override { + PYBIND11_OVERLOAD_NAME( + void, WhitespacePreTokenizer, "__call__", operator(), pretokenized); + } +}; + +class PyWhitespaceAndPunctuationPreTokenizer + : public pretokenizers::WhitespaceAndPunctuationPreTokenizer { +public: + using WhitespaceAndPunctuationPreTokenizer:: + WhitespaceAndPunctuationPreTokenizer; + virtual void operator()( + pretokenizers::PreTokenizedString* pretokenized) const override { + PYBIND11_OVERLOAD_NAME(void, + WhitespaceAndPunctuationPreTokenizer, + "__call__", + operator(), + pretokenized); + } +}; + +class PyBertPreTokenizer : public pretokenizers::BertPreTokenizer { +public: + using BertPreTokenizer::BertPreTokenizer; + virtual void operator()( + pretokenizers::PreTokenizedString* pretokenized) const override { + PYBIND11_OVERLOAD_NAME( + void, BertPreTokenizer, "__call__", operator(), pretokenized); + } +}; + +class PyMetaSpacePreTokenizer : public pretokenizers::MetaSpacePreTokenizer { +public: + using MetaSpacePreTokenizer::MetaSpacePreTokenizer; + virtual void operator()( + pretokenizers::PreTokenizedString* pretokenized) const override { + PYBIND11_OVERLOAD_NAME( + void, MetaSpacePreTokenizer, "__call__", operator(), pretokenized); + } +}; + +class PySequencePreTokenizer : public pretokenizers::SequencePreTokenizer { +public: + using SequencePreTokenizer::SequencePreTokenizer; + virtual void operator()( + pretokenizers::PreTokenizedString* pretokenized) const override { + PYBIND11_OVERLOAD_NAME( + void, SequencePreTokenizer, "__call__", operator(), pretokenized); + } +}; + +class PyByteLevelPreTokenizer : public pretokenizers::ByteLevelPreTokenizer { +public: + using ByteLevelPreTokenizer::ByteLevelPreTokenizer; + virtual void operator()( + pretokenizers::PreTokenizedString* pretokenized) const override { + PYBIND11_OVERLOAD_NAME( + void, ByteLevelPreTokenizer, "__call__", operator(), pretokenized); + } +}; + +class PySplitPreTokenizer : public pretokenizers::SplitPreTokenizer { +public: + using SplitPreTokenizer::SplitPreTokenizer; + virtual void operator()( + pretokenizers::PreTokenizedString* pretokenized) const override { + PYBIND11_OVERLOAD_NAME( + void, SplitPreTokenizer, "__call__", operator(), pretokenized); + } +}; + +void BindPreTokenizers(pybind11::module* m) { + auto sub_module = + m->def_submodule("pretokenizers", "The pretokenizers module"); + py::class_(sub_module, "StringSplit") + .def(py::init(), + py::arg("nomalized_text")) + .def(py::init&>(), + py::arg("nomalized_text"), + py::arg("tokens")) + .def_readwrite("normalized", &pretokenizers::StringSplit::normalized_) + .def_readwrite("tokens", &pretokenizers::StringSplit::tokens_); + py::class_(sub_module, + "PreTokenizedString") + .def(py::init<>()) + .def(py::init(), py::arg("raw_text")) + .def(py::init(), + py::arg("nomalized_text")) + .def("get_string_split", &pretokenizers::PreTokenizedString::GetSplit) + .def("get_string_splits_size", + &pretokenizers::PreTokenizedString::GetSplitsSize) + .def("get_original_text", + &pretokenizers::PreTokenizedString::GetOriginStr) + .def( + "get_splits", + [](const pretokenizers::PreTokenizedString& self, + const std::string& offset_referential, + const std::string& offset_type) { + bool is_original = true; + if (offset_referential != "original") { + is_original = false; + } + core::OffsetType type = core::OffsetType::CHAR; + if (offset_type != "char") { + type = core::OffsetType::BYTE; + } + return self.GetSplits(is_original, type); + }, + py::arg("offset_referential") = "original", + py::arg("offset_type") = "char") + .def("to_encoding", + [](const pretokenizers::PreTokenizedString& self, + const std::vector& word_idx, + uint32_t type_id, + core::OffsetType offset_type) { + core::Encoding encoding; + self.TransformToEncoding( + word_idx, type_id, offset_type, &encoding); + return encoding; + }); + py::class_(sub_module, + "PreTokenizer") + .def(py::init<>()) + .def("__call__", &pretokenizers::PreTokenizer::operator()); + py::class_( + sub_module, "WhitespacePreTokenizer") + .def(py::init<>()) + .def("__call__", &pretokenizers::WhitespacePreTokenizer::operator()); + py::class_( + sub_module, "WhitespaceAndPunctuationPreTokenizer") + .def(py::init<>()) + .def("__call__", + &pretokenizers::WhitespaceAndPunctuationPreTokenizer::operator()); + py::class_( + sub_module, "BertPreTokenizer") + .def(py::init<>()) + .def("__call__", &pretokenizers::BertPreTokenizer::operator()); + py::class_( + sub_module, "MetaSpacePreTokenizer") + .def(py::init(), + py::arg("replacement") = "_", + py::arg("add_prefix_space") = true) + .def("__call__", &pretokenizers::MetaSpacePreTokenizer::operator()); + py::class_( + sub_module, "SequencePreTokenizer") + .def( + py::init([](const py::list& py_list) { + pretokenizers::PreTokenizer* pretokenizer_ptr; + std::vector pretokenizers; + for (py::handle py_pretokenizer : py_list) { + if (pybind11::type::of(py_pretokenizer) + .is(py::type::of())) { + pretokenizer_ptr = + py_pretokenizer.cast(); + } else if (pybind11::type::of(py_pretokenizer) + .is(py::type::of< + pretokenizers::MetaSpacePreTokenizer>())) { + pretokenizer_ptr = + py_pretokenizer + .cast(); + } else if (pybind11::type::of(py_pretokenizer) + .is(py::type::of< + pretokenizers::SequencePreTokenizer>())) { + pretokenizer_ptr = + py_pretokenizer + .cast(); + } else if (pybind11::type::of(py_pretokenizer) + .is(py::type::of< + pretokenizers::WhitespacePreTokenizer>())) { + pretokenizer_ptr = + py_pretokenizer + .cast(); + } else if (pybind11::type::of(py_pretokenizer) + .is(py::type::of< + pretokenizers:: + WhitespaceAndPunctuationPreTokenizer>())) { + pretokenizer_ptr = py_pretokenizer.cast< + pretokenizers::WhitespaceAndPunctuationPreTokenizer*>(); + } else if (pybind11::type::of(py_pretokenizer) + .is(py::type::of< + pretokenizers::ByteLevelPreTokenizer>())) { + pretokenizer_ptr = + py_pretokenizer + .cast(); + } else if (py::type::of(py_pretokenizer) + .is(py::type::of< + pretokenizers::SplitPreTokenizer>())) { + pretokenizer_ptr = + py_pretokenizer.cast(); + } else { + throw py::value_error( + "Type of normalizers should be one of `BertPreTokenizer`," + " `MetaSpacePreTokenizer`, `SequencePreTokenizer`," + " `WhitespacePreTokenizer`, `ByteLevelPreTokenizer`," + " `WhitespaceAndPunctuationPreTokenizer`, " + "`SplitPreTokenizer`"); + } + pretokenizers.push_back(pretokenizer_ptr); + } + return pretokenizers::SequencePreTokenizer(pretokenizers); + }), + py::arg("pretokenizers")) + .def("__call__", &pretokenizers::SequencePreTokenizer::operator()); + py::class_( + sub_module, "ByteLevelPreTokenizer") + .def(py::init(), + py::arg("add_prefix_space") = true, + py::arg("use_regex") = true) + .def("__call__", &pretokenizers::ByteLevelPreTokenizer::operator()); + py::class_( + sub_module, "SplitPreTokenizer") + .def(py::init(), + py::arg("pattern"), + py::arg("split_mode"), + py::arg("invert")) + .def("__call__", &pretokenizers::SplitPreTokenizer::operator()); +} + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/pretokenizers.h b/fast_tokenizer/fast_tokenizer/pybind/pretokenizers.h new file mode 100644 index 0000000000000000000000000000000000000000..ffe7b1adf83d0d5e5fef7e3274b1dfb71eedc069 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/pretokenizers.h @@ -0,0 +1,27 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +void BindPreTokenizers(pybind11::module* m); + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/pybind.cc b/fast_tokenizer/fast_tokenizer/pybind/pybind.cc new file mode 100644 index 0000000000000000000000000000000000000000..3f7b32f56d88dd1b65ae0a69199d24b284b4e6e1 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/pybind.cc @@ -0,0 +1,51 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ +#include + +#include +#include + +#include "fast_tokenizer/pybind/core.h" +#include "fast_tokenizer/pybind/decoders.h" +#include "fast_tokenizer/pybind/models.h" +#include "fast_tokenizer/pybind/normalizers.h" +#include "fast_tokenizer/pybind/postprocessors.h" +#include "fast_tokenizer/pybind/pretokenizers.h" +#include "fast_tokenizer/pybind/tokenizers.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +PYBIND11_MODULE(core_tokenizers, m) { + m.doc() = "The paddlenlp fast_tokenizer core module."; + // 1. Bind normalizers submodule + BindNormalizers(&m); + // 2. Bind pre_tokenizers submodule + BindPreTokenizers(&m); + // 3. Bind models submodule + BindModels(&m); + // 4. Bind processors submodule + BindPostProcessors(&m); + // 5. Bind tokenizers submodule + BindTokenizers(&m); + // 6. Bind core + BindCore(&m); + // 7. Bind decoder submodule + BindDecoders(&m); +} + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/tokenizers.cc b/fast_tokenizer/fast_tokenizer/pybind/tokenizers.cc new file mode 100644 index 0000000000000000000000000000000000000000..69902251a2ced6bc69de577213079e3fcd0ca034 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/tokenizers.cc @@ -0,0 +1,1428 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/pybind/tokenizers.h" + +#include + +#include + +#include "fast_tokenizer/core/tokenizer.h" +#include "fast_tokenizer/decoders/decoders.h" +#include "fast_tokenizer/models/models.h" +#include "fast_tokenizer/normalizers/normalizers.h" +#include "fast_tokenizer/postprocessors/postprocessors.h" +#include "fast_tokenizer/pretokenizers/pretokenizers.h" +#include "fast_tokenizer/pybind/exception.h" +#include "fast_tokenizer/pybind/utils.h" +#include "glog/logging.h" + +namespace py = pybind11; + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +PyTypeObject* p_tokenizer_type; // For Tokenizer + +PyNumberMethods number_methods; +PySequenceMethods sequence_methods; +PyMappingMethods mapping_methods; + +typedef struct { + PyObject_HEAD core::Tokenizer tokenizer; + // Weak references + PyObject* weakrefs; +} TokenizerObject; + +static PyObject* TokenizerPropertiesGetNormaizer(TokenizerObject* self, + void* closure) { + py::object py_obj = py::cast(self->tokenizer.GetNormalizerPtr()); + py_obj.inc_ref(); + return py_obj.ptr(); +} + +static int TokenizerPropertiesSetNormalizer(TokenizerObject* self, + PyObject* value, + void* closure) { + TOKENIZERS_TRY + py::handle py_obj(value); + int ret = 0; + if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& normalizer = + py_obj.cast(); + self->tokenizer.SetNormalizer(normalizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& normalizer = py_obj.cast(); + self->tokenizer.SetNormalizer(normalizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& normalizer = py_obj.cast(); + self->tokenizer.SetNormalizer(normalizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& normalizer = py_obj.cast(); + self->tokenizer.SetNormalizer(normalizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& normalizer = py_obj.cast(); + self->tokenizer.SetNormalizer(normalizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& normalizer = py_obj.cast(); + self->tokenizer.SetNormalizer(normalizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& normalizer = py_obj.cast(); + self->tokenizer.SetNormalizer(normalizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& normalizer = + py_obj.cast(); + self->tokenizer.SetNormalizer(normalizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& normalizer = + py_obj.cast(); + self->tokenizer.SetNormalizer(normalizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& normalizer = + py_obj.cast(); + self->tokenizer.SetNormalizer(normalizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& normalizer = py_obj.cast(); + self->tokenizer.SetNormalizer(normalizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& normalizer = + py_obj.cast(); + self->tokenizer.SetNormalizer(normalizer); + } else if (py_obj.is(py::none())) { + self->tokenizer.ReleaseNormaizer(); + } else { + ret = 1; + throw std::runtime_error("Need to assign the object of Normalizer"); + } + return ret; + TOKENIZERS_CATCH_AND_THROW_RETURN_NEG +} + +static PyObject* TokenizerPropertiesGetPreTokenizer(TokenizerObject* self, + void* closure) { + py::object py_obj = py::cast(self->tokenizer.GetPreTokenizer()); + py_obj.inc_ref(); + return py_obj.ptr(); +} + +static int TokenizerPropertiesSetPreTokenizer(TokenizerObject* self, + PyObject* value, + void* closure) { + TOKENIZERS_TRY + py::handle py_obj(value); + int ret = 0; + if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& pretokenizer = + py_obj.cast(); + self->tokenizer.SetPreTokenizer(pretokenizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& pretokenizer = + py_obj.cast(); + self->tokenizer.SetPreTokenizer(pretokenizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of< + pretokenizers::WhitespaceAndPunctuationPreTokenizer>())) { + const auto& pretokenizer = + py_obj + .cast(); + self->tokenizer.SetPreTokenizer(pretokenizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& pretokenizer = + py_obj.cast(); + self->tokenizer.SetPreTokenizer(pretokenizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& pretokenizer = + py_obj.cast(); + self->tokenizer.SetPreTokenizer(pretokenizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& pretokenizer = + py_obj.cast(); + self->tokenizer.SetPreTokenizer(pretokenizer); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& pretokenizer = + py_obj.cast(); + self->tokenizer.SetPreTokenizer(pretokenizer); + } else if (py_obj.is(py::none())) { + self->tokenizer.ReleasePreTokenizer(); + } else { + ret = 1; + throw std::runtime_error("Need to assign the object of PreTokenizer"); + } + return ret; + TOKENIZERS_CATCH_AND_THROW_RETURN_NEG +} + +static PyObject* TokenizerPropertiesGetModel(TokenizerObject* self, + void* closure) { + py::object py_obj = py::cast(self->tokenizer.GetModelPtr()); + py_obj.inc_ref(); + return py_obj.ptr(); +} + +static int TokenizerPropertiesSetModel(TokenizerObject* self, + PyObject* value, + void* closure) { + TOKENIZERS_TRY + py::handle py_obj(value); + int ret = 0; + if (pybind11::type::of(py_obj).is(py::type::of())) { + const auto& model = py_obj.cast(); + self->tokenizer.SetModel(model); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& model = py_obj.cast(); + self->tokenizer.SetModel(model); + } else if (pybind11::type::of(py_obj).is(py::type::of())) { + const auto& model = py_obj.cast(); + self->tokenizer.SetModel(model); + } else if (pybind11::type::of(py_obj).is(py::type::of())) { + const auto& model = py_obj.cast(); + self->tokenizer.SetModel(model); + } else { + ret = 1; + throw std::runtime_error("Need to assign the object of Model"); + } + return ret; + TOKENIZERS_CATCH_AND_THROW_RETURN_NEG +} + +static PyObject* TokenizerPropertiesGetPostProcessor(TokenizerObject* self, + void* closure) { + py::object py_obj = py::cast(self->tokenizer.GetPostProcessorPtr()); + py_obj.inc_ref(); + return py_obj.ptr(); +} + +static int TokenizerPropertiesSetPostProcessor(TokenizerObject* self, + PyObject* value, + void* closure) { + TOKENIZERS_TRY + py::handle py_obj(value); + int ret = 0; + if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& processor = + py_obj.cast(); + self->tokenizer.SetPostProcessor(processor); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& processor = + py_obj.cast(); + self->tokenizer.SetPostProcessor(processor); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& processor = + py_obj.cast(); + self->tokenizer.SetPostProcessor(processor); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& processor = + py_obj.cast(); + self->tokenizer.SetPostProcessor(processor); + } else if (py_obj.is(py::none())) { + self->tokenizer.ReleasePostProcessor(); + } else { + ret = 1; + throw std::runtime_error("Need to assign the object of PostProcessor"); + } + return ret; + TOKENIZERS_CATCH_AND_THROW_RETURN_NEG +} + +static PyObject* TokenizerPropertiesGetPadding(TokenizerObject* self, + void* closure) { + if (!self->tokenizer.GetUsePadding()) { + Py_RETURN_NONE; + } + auto pad_method = self->tokenizer.GetPadMethod(); + PyObject* py_dict = PyDict_New(); + PyDict_SetItem(py_dict, ToPyObject("pad_id"), ToPyObject(pad_method.pad_id_)); + PyDict_SetItem(py_dict, + ToPyObject("pad_token_type_id"), + ToPyObject(pad_method.pad_token_type_id_)); + PyDict_SetItem( + py_dict, ToPyObject("pad_token"), ToPyObject(pad_method.pad_token_)); + if (pad_method.pad_to_multiple_of_ > 0) { + PyDict_SetItem(py_dict, + ToPyObject("pad_to_multiple_of"), + ToPyObject(pad_method.pad_to_multiple_of_)); + } else { + Py_INCREF(Py_None); + PyDict_SetItem(py_dict, ToPyObject("pad_to_multiple_of"), Py_None); + } + + PyDict_SetItem( + py_dict, + ToPyObject("direction"), + ToPyObject((pad_method.direction_ == core::Direction::RIGHT) ? "right" + : "left")); + if (pad_method.strategy_ == core::PadStrategy::BATCH_LONGEST) { + Py_INCREF(Py_None); + PyDict_SetItem(py_dict, ToPyObject("length"), Py_None); + } else { + PyDict_SetItem( + py_dict, ToPyObject("length"), ToPyObject(pad_method.pad_len_)); + } + return py_dict; +} + +static PyObject* TokenizerPropertiesGetTruncation(TokenizerObject* self, + void* closure) { + if (!self->tokenizer.GetUseTruncation()) { + Py_RETURN_NONE; + } + auto trunc_method = self->tokenizer.GetTruncMethod(); + PyObject* py_dict = PyDict_New(); + PyDict_SetItem( + py_dict, ToPyObject("max_length"), ToPyObject(trunc_method.max_len_)); + PyDict_SetItem( + py_dict, ToPyObject("stride"), ToPyObject(trunc_method.stride_)); + PyDict_SetItem( + py_dict, + ToPyObject("direction"), + ToPyObject((trunc_method.direction_ == core::Direction::RIGHT) ? "right" + : "left")); + if (trunc_method.strategy_ == core::TruncStrategy::LONGEST_FIRST) { + PyDict_SetItem( + py_dict, ToPyObject("strategy"), ToPyObject("longest_first")); + } else if (trunc_method.strategy_ == core::TruncStrategy::ONLY_FIRST) { + PyDict_SetItem(py_dict, ToPyObject("strategy"), ToPyObject("only_first")); + } else if (trunc_method.strategy_ == core::TruncStrategy::ONLY_SECOND) { + PyDict_SetItem(py_dict, ToPyObject("strategy"), ToPyObject("only_second")); + } + return py_dict; +} + +static PyObject* TokenizerPropertiesGetDecoder(TokenizerObject* self, + void* closure) { + py::object py_obj = py::cast(self->tokenizer.GetDecoderPtr()); + py_obj.inc_ref(); + return py_obj.ptr(); +} + +static int TokenizerPropertiesSetDecoder(TokenizerObject* self, + PyObject* value, + void* closure) { + TOKENIZERS_TRY + py::handle py_obj(value); + int ret = 0; + if (pybind11::type::of(py_obj).is(py::type::of())) { + const auto& decoder = py_obj.cast(); + self->tokenizer.SetDecoder(decoder); + } else if (py_obj.is(py::none())) { + self->tokenizer.ReleaseDecoder(); + } else { + ret = 1; + throw std::runtime_error("Need to assign the object of Decoder"); + } + return ret; + TOKENIZERS_CATCH_AND_THROW_RETURN_NEG +} + +struct PyGetSetDef tokenizer_variable_properties[] = { + {"normalizer", + (getter)TokenizerPropertiesGetNormaizer, + (setter)TokenizerPropertiesSetNormalizer, + nullptr, + nullptr}, + {"pretokenizer", + (getter)TokenizerPropertiesGetPreTokenizer, + (setter)TokenizerPropertiesSetPreTokenizer, + nullptr, + nullptr}, + {"model", + (getter)TokenizerPropertiesGetModel, + (setter)TokenizerPropertiesSetModel, + nullptr, + nullptr}, + {"postprocessor", + (getter)TokenizerPropertiesGetPostProcessor, + (setter)TokenizerPropertiesSetPostProcessor, + nullptr, + nullptr}, + {"padding", + (getter)TokenizerPropertiesGetPadding, + nullptr, + nullptr, + nullptr}, + {"truncation", + (getter)TokenizerPropertiesGetTruncation, + nullptr, + nullptr, + nullptr}, + {"decoder", + (getter)TokenizerPropertiesGetDecoder, + (setter)TokenizerPropertiesSetDecoder, + nullptr, + nullptr}, + {nullptr, nullptr, nullptr, nullptr, nullptr}}; + +PyObject* TokenizerNew(PyTypeObject* type, PyObject* args, PyObject* kwargs) { + PyObject* obj = type->tp_alloc(type, 0); + if (obj) { + auto v = reinterpret_cast(obj); + new (&(v->tokenizer)) core::Tokenizer(); + } + return obj; +} + +static void TokenizerDealloc(TokenizerObject* self) { + if (self->weakrefs != NULL) + PyObject_ClearWeakRefs(reinterpret_cast(self)); + self->tokenizer.~Tokenizer(); + Py_TYPE(self)->tp_free(reinterpret_cast(self)); +} + +int TokenizerInit(PyObject* self, PyObject* args, PyObject* kwargs) { + bool flag_kwargs = false; + if (kwargs) flag_kwargs = true; + // all kwargs + PyObject* kw_model = NULL; + // the keywords argument + static char* kwlist[] = {const_cast("model"), NULL}; + // 'O' Store a Python object (without any conversion) in a C object pointer, + // '|' Indicates that the remaining arguments in the Python argument list are + // optional. + bool flag_ = + PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_model); + std::unordered_map kws_map{{"model", kw_model}}; + + auto py_tokenizer_ptr = reinterpret_cast(self); + Py_ssize_t args_num = PyTuple_Size(args); + + if (args_num == 1) { + py::object py_obj = + py::reinterpret_borrow(PyTuple_GET_ITEM(args, 0)); + if (pybind11::type::of(py_obj).is(py::type::of())) { + const auto& model = py_obj.cast(); + py_tokenizer_ptr->tokenizer.SetModel(model); + } else if (pybind11::type::of(py_obj).is( + py::type::of())) { + const auto& model = py_obj.cast(); + py_tokenizer_ptr->tokenizer.SetModel(model); + } else if (pybind11::type::of(py_obj).is(py::type::of())) { + const auto& model = py_obj.cast(); + py_tokenizer_ptr->tokenizer.SetModel(model); + } else if (pybind11::type::of(py_obj).is(py::type::of())) { + const auto& model = py_obj.cast(); + py_tokenizer_ptr->tokenizer.SetModel(model); + } else { + std::ostringstream oss; + oss << "Expected type of arguments is `model`"; + throw std::runtime_error(oss.str()); + } + return 0; + } else if (args_num >= 1) { + std::ostringstream oss; + oss << "Expected number of arguments is 0 or 1, but recive " << args_num; + throw std::runtime_error(oss.str()); + } + return 1; +} + +// def add_special_tokens(token) +static PyObject* AddSpecialTokens(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_special_tokens = NULL; + static char* kwlist[] = {const_cast("tokens"), NULL}; + bool flag_ = PyArg_ParseTupleAndKeywords( + args, kwargs, "|O", kwlist, &kw_special_tokens); + Py_ssize_t args_num = PyTuple_Size(args); + std::string tokens; + if (args_num == (Py_ssize_t)1) { + if (PyList_Check(kw_special_tokens)) { + std::vector added_tokens; + Py_ssize_t tokens_num = PyList_GET_SIZE(kw_special_tokens); + for (Py_ssize_t i = 0; i < tokens_num; ++i) { + PyObject* obj = PyList_GetItem(kw_special_tokens, i); + if (PyUnicode_Check(obj)) { + added_tokens.push_back( + core::AddedToken(CastPyArg2AttrString(obj, 0), true)); + } else { + py::handle py_obj(obj); + if (!py::type::of(py_obj).is(py::type::of())) { + throw std::runtime_error( + "The argument of tokens should be List[Union[str, " + "AddedToken]]"); + } + auto added_token = py_obj.cast(); + added_tokens.push_back(added_token); + } + } + return ToPyObject(self->tokenizer.AddSpecialTokens(added_tokens)); + } else { + // throw error + throw std::runtime_error( + "Need to pass the string list as to argument tokens"); + } + } else { + // throw error + std::ostringstream oss; + oss << "Expected number of arguments is 1, but recive " << args_num; + throw std::runtime_error(oss.str()); + } + Py_RETURN_NONE; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +// def add_tokens(token) +static PyObject* AddTokens(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_tokens = NULL; + static char* kwlist[] = {const_cast("tokens"), NULL}; + bool flag_ = + PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_tokens); + Py_ssize_t args_num = PyTuple_Size(args); + std::string tokens; + if (args_num == (Py_ssize_t)1) { + if (PyList_Check(kw_tokens)) { + std::vector added_tokens; + Py_ssize_t tokens_num = PyList_GET_SIZE(kw_tokens); + for (Py_ssize_t i = 0; i < tokens_num; ++i) { + PyObject* obj = PyList_GetItem(kw_tokens, i); + if (PyUnicode_Check(obj)) { + added_tokens.push_back( + core::AddedToken(CastPyArg2AttrString(obj, 0), true)); + } else { + py::handle py_obj(obj); + if (!py::type::of(py_obj).is(py::type::of())) { + throw std::runtime_error( + "The argument of tokens should be List[Union[str, " + "AddedToken]]"); + } + auto added_token = py_obj.cast(); + added_tokens.push_back(added_token); + } + } + return ToPyObject(self->tokenizer.AddTokens(added_tokens)); + } else { + throw std::runtime_error( + "Need to pass the string list as to argument tokens"); + } + } else { + std::ostringstream oss; + oss << "Expected number of arguments is 1, but recive " << args_num; + throw std::runtime_error(oss.str()); + } + Py_RETURN_NONE; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +// def enable_padding(direction="right", pad_id=0, +// pad_type_id=0, pad_token="[PAD]", +// length=None, ad_to_multiple_of=None) +static PyObject* EnablePadding(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_direction = NULL; + PyObject* kw_pad_id = NULL; + PyObject* kw_pad_type_id = NULL; + PyObject* kw_pad_token = NULL; + PyObject* kw_length = NULL; + PyObject* kw_pad_to_multiple_of = NULL; + bool flag_kwargs = false; + if (kwargs) flag_kwargs = true; + static char* kwlist[] = {const_cast("direction"), + const_cast("pad_id"), + const_cast("pad_type_id"), + const_cast("pad_token"), + const_cast("length"), + const_cast("pad_to_multiple_of"), + NULL}; + bool flag_ = PyArg_ParseTupleAndKeywords(args, + kwargs, + "|OOOOOO", + kwlist, + &kw_direction, + &kw_pad_id, + &kw_pad_type_id, + &kw_pad_token, + &kw_length, + &kw_pad_to_multiple_of); + Py_ssize_t args_num = PyTuple_Size(args); + std::string direction = "right"; + uint32_t pad_id = 0; + uint32_t pad_type_id = 0; + std::string pad_token = "[PAD]"; + uint32_t* length_ptr = nullptr; + uint32_t* pad_to_multiple_of_ptr = nullptr; + uint32_t length = 0; + uint32_t pad_to_multiple_of = 0; + VLOG(6) << "args_num: " << args_num << ", flag_kwargs: " << flag_kwargs; + VLOG(6) << "kw_direction: " << kw_direction; + VLOG(6) << "kw_pad_id: " << kw_pad_id; + VLOG(6) << "kw_pad_type_id: " << kw_pad_type_id; + VLOG(6) << "kw_pad_token: " << kw_pad_token; + VLOG(6) << "kw_length: " << kw_length; + VLOG(6) << "kw_pad_to_multiple_of: " << kw_pad_to_multiple_of; + if (args_num >= (Py_ssize_t)0 && args_num <= (Py_ssize_t)6) { + if ((args_num == 0 && flag_kwargs && kw_direction) || (args_num >= 1)) { + direction = CastPyArg2AttrString(kw_direction, 0); + } + if ((args_num <= 1 && flag_kwargs && kw_pad_id) || (args_num >= 2)) { + pad_id = CastPyArg2AttrSize_t(kw_pad_id, 1); + } + if ((args_num <= 2 && flag_kwargs && kw_pad_type_id) || (args_num >= 3)) { + pad_type_id = CastPyArg2AttrSize_t(kw_pad_type_id, 2); + } + if ((args_num <= 3 && flag_kwargs && kw_pad_token) || (args_num >= 4)) { + pad_token = CastPyArg2AttrString(kw_pad_token, 3); + } + if ((args_num <= 4 && flag_kwargs && kw_length) || (args_num >= 5)) { + if (!(kw_length == Py_None)) { + length = CastPyArg2AttrSize_t(kw_length, 4); + length_ptr = &length; + } + } + if ((args_num <= 5 && flag_kwargs && kw_pad_to_multiple_of) || + (args_num == 6)) { + if (!(kw_pad_to_multiple_of == Py_None)) { + pad_to_multiple_of = CastPyArg2AttrSize_t(kw_pad_to_multiple_of, 5); + pad_to_multiple_of_ptr = &pad_to_multiple_of; + } + } + } else { + std::ostringstream oss; + oss << "Expected number of arguments is from 0 to 6, but recive " + << args_num; + throw std::runtime_error(oss.str()); + } + core::Direction pad_direction; + if (direction == "right") { + pad_direction = core::Direction::RIGHT; + } else if (direction == "left") { + pad_direction = core::Direction::LEFT; + } else { + throw std::runtime_error( + "The direction args should be \"right\" or \"left\""); + } + self->tokenizer.EnablePadMethod(pad_direction, + pad_id, + pad_type_id, + pad_token, + length_ptr, + pad_to_multiple_of_ptr); + Py_RETURN_NONE; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +// def disable_padding() +static PyObject* DisablePadding(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + Py_ssize_t args_num = PyTuple_Size(args); + if (args_num == (Py_ssize_t)0) { + self->tokenizer.DisablePadMethod(); + Py_RETURN_NONE; + } else { + std::ostringstream oss; + oss << "Expected number of arguments is 0, but recive " << args_num; + throw std::runtime_error(oss.str()); + } + Py_RETURN_NONE; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +// def enable_truncation(max_length, stride=0, strategy="longest_first", +// direction="right") +static PyObject* EnableTruncation(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_max_length = NULL; + PyObject* kw_stride = NULL; + PyObject* kw_strategy = NULL; + PyObject* kw_direction = NULL; + bool flag_kwargs = false; + if (kwargs) flag_kwargs = true; + static char* kwlist[] = {const_cast("max_length"), + const_cast("stride"), + const_cast("strategy"), + const_cast("direction"), + NULL}; + bool flag_ = PyArg_ParseTupleAndKeywords(args, + kwargs, + "|OOOO", + kwlist, + &kw_max_length, + &kw_stride, + &kw_strategy, + &kw_direction); + Py_ssize_t args_num = PyTuple_Size(args); + uint32_t max_length = 0; + uint32_t stride = 0; + std::string strategy = "longest_first"; + std::string direction = "right"; + + if (args_num >= (Py_ssize_t)0 && args_num <= (Py_ssize_t)4) { + max_length = CastPyArg2AttrSize_t(kw_max_length, 0); + if ((args_num <= 1 && flag_kwargs && kw_stride) || (args_num >= 2)) { + stride = CastPyArg2AttrSize_t(kw_stride, 1); + } + if ((args_num <= 2 && flag_kwargs && kw_strategy) || (args_num >= 3)) { + strategy = CastPyArg2AttrString(kw_strategy, 2); + } + if ((args_num <= 3 && flag_kwargs && kw_direction) || (args_num >= 4)) { + direction = CastPyArg2AttrString(kw_direction, 3); + } + } else { + std::ostringstream oss; + oss << "Expected number of arguments 1 to 4, but recive " << args_num; + throw std::runtime_error(oss.str()); + } + + core::TruncStrategy trunc_strategy; + if (strategy == "longest_first") { + trunc_strategy = core::TruncStrategy::LONGEST_FIRST; + } else if (strategy == "only_first") { + trunc_strategy = core::TruncStrategy::ONLY_FIRST; + } else if (strategy == "only_second") { + trunc_strategy = core::TruncStrategy::ONLY_SECOND; + } else { + throw std::runtime_error( + "The strategy args should be \"longest_first\", \"only_first\" or " + "\"only_second\""); + } + core::Direction trunc_direction; + if (direction == "right") { + trunc_direction = core::Direction::RIGHT; + } else if (direction == "left") { + trunc_direction = core::Direction::LEFT; + } else { + throw std::runtime_error( + "The direction args should be \"right\" or \"left\""); + } + self->tokenizer.EnableTruncMethod( + max_length, stride, trunc_direction, trunc_strategy); + Py_RETURN_NONE; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +// def disable_truncation() +static PyObject* DisableTruncation(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + Py_ssize_t args_num = PyTuple_Size(args); + if (args_num == (Py_ssize_t)0) { + self->tokenizer.DisableTruncMethod(); + Py_RETURN_NONE; + } else { + std::ostringstream oss; + oss << "Expected number of arguments is 0, but recive " << args_num; + throw std::runtime_error(oss.str()); + } + Py_RETURN_NONE; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +// def get_vocab(with_added_vocabulary=True) +static PyObject* GetVocab(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_with_added_vocabulary = NULL; + static char* kwlist[] = {const_cast("with_added_vocabulary"), NULL}; + bool flag_ = PyArg_ParseTupleAndKeywords( + args, kwargs, "|O", kwlist, &kw_with_added_vocabulary); + Py_ssize_t args_num = PyTuple_Size(args); + bool with_added_vocabulary = true; + if (args_num == (Py_ssize_t)0) { + with_added_vocabulary = true; + py::object py_obj = + py::cast(self->tokenizer.GetVocab(with_added_vocabulary)); + py_obj.inc_ref(); + return py_obj.ptr(); + } else if (args_num == (Py_ssize_t)1) { + if (PyBool_Check(kw_with_added_vocabulary)) { + with_added_vocabulary = + CastPyArg2AttrBoolean(kw_with_added_vocabulary, 0); + py::object py_obj = + py::cast(self->tokenizer.GetVocab(with_added_vocabulary)); + py_obj.inc_ref(); + return py_obj.ptr(); + } else { + // throw error + } + } else { + std::ostringstream oss; + oss << "Expected number of arguments is from 0 to 1, but recive " + << args_num; + throw std::runtime_error(oss.str()); + } + Py_RETURN_NONE; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +// def get_vocab_size(with_added_vocabulary=True) +static PyObject* GetVocabSize(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_with_added_vocabulary = NULL; + static char* kwlist[] = {const_cast("with_added_vocabulary"), NULL}; + bool flag_ = PyArg_ParseTupleAndKeywords( + args, kwargs, "|O", kwlist, &kw_with_added_vocabulary); + Py_ssize_t args_num = PyTuple_Size(args); + bool with_added_vocabulary = true; + if (args_num == (Py_ssize_t)0) { + with_added_vocabulary = true; + return ToPyObject(self->tokenizer.GetVocabSize(with_added_vocabulary)); + } else if (args_num == (Py_ssize_t)1) { + if (PyBool_Check(kw_with_added_vocabulary)) { + with_added_vocabulary = + CastPyArg2AttrBoolean(kw_with_added_vocabulary, 0); + return ToPyObject(self->tokenizer.GetVocabSize(with_added_vocabulary)); + } else { + // throw error + } + } else { + std::ostringstream oss; + oss << "Expected number of arguments is 0, but recive " << args_num; + throw std::runtime_error(oss.str()); + } + Py_RETURN_NONE; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +// def encode(sequence, pair=None, is_pretokenized=False, +// add_special_tokens=True) +static PyObject* Encode(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_sequence = NULL; + PyObject* kw_pair = NULL; + PyObject* kw_is_pretokenized = NULL; + PyObject* kw_add_special_tokens = NULL; + bool flag_kwargs = false; + if (kwargs) flag_kwargs = true; + static char* kwlist[] = {const_cast("sequence"), + const_cast("pair"), + const_cast("is_pretokenized"), + const_cast("add_special_tokens"), + NULL}; + bool flag_ = PyArg_ParseTupleAndKeywords(args, + kwargs, + "|OOOO", + kwlist, + &kw_sequence, + &kw_pair, + &kw_is_pretokenized, + &kw_add_special_tokens); + Py_ssize_t args_num = PyTuple_Size(args); + if (args_num >= (Py_ssize_t)1 && args_num <= (Py_ssize_t)4) { + bool is_pretokenized = false; + bool add_special_tokens = true; + bool has_pair = false; + core::Encoding encoding; + core::Encoding pair_encoding; + core::Encoding result_encoding; + + if ((args_num <= 2 && flag_kwargs && kw_is_pretokenized) || + (args_num >= 3)) { + is_pretokenized = CastPyArg2AttrBoolean(kw_is_pretokenized, 2); + } + if ((args_num <= 3 && flag_kwargs && kw_add_special_tokens) || + (args_num >= 4)) { + add_special_tokens = CastPyArg2AttrBoolean(kw_add_special_tokens, 3); + } + if (is_pretokenized) { + if (PyList_Check(kw_sequence) || PyTuple_Check(kw_sequence)) { + std::vector sequence_array = + CastPyArg2VectorOfStr(kw_sequence, 0); + std::vector pair_array; + if ((args_num <= 1 && flag_kwargs && kw_pair && kw_pair != Py_None) || + (args_num >= 2)) { + has_pair = true; + pair_array = CastPyArg2VectorOfStr(kw_pair, 1); + } + self->tokenizer.EncodeSingleString( + sequence_array, 0, core::OffsetType::CHAR, &encoding); + core::Encoding* pair_encoding_ptr = nullptr; + if (has_pair) { + self->tokenizer.EncodeSingleString( + pair_array, 0, core::OffsetType::CHAR, &pair_encoding); + pair_encoding_ptr = &pair_encoding; + } + self->tokenizer.PostProcess( + &encoding, pair_encoding_ptr, add_special_tokens, &result_encoding); + } else { + // throw error + std::ostringstream oss; + oss << "The sequence should be list of string when " + "is_pretokenized=True"; + throw std::runtime_error(oss.str()); + } + } else { + std::string sequence = CastPyArg2AttrString(kw_sequence, 0); + std::string pair; + if (((args_num <= 1 && flag_kwargs && kw_pair) || (args_num >= 2)) && + kw_pair != Py_None) { + has_pair = true; + pair = CastPyArg2AttrString(kw_pair, 1); + } + self->tokenizer.EncodeSingleString( + sequence, 0, core::OffsetType::CHAR, &encoding); + core::Encoding* pair_encoding_ptr = nullptr; + if (has_pair) { + self->tokenizer.EncodeSingleString( + pair, 1, core::OffsetType::CHAR, &pair_encoding); + pair_encoding_ptr = &pair_encoding; + } + self->tokenizer.PostProcess( + &encoding, pair_encoding_ptr, add_special_tokens, &result_encoding); + } + py::object py_obj = py::cast(result_encoding); + py_obj.inc_ref(); + return py_obj.ptr(); + } else { + std::ostringstream oss; + oss << "Expected number of arguments is from 1 to 4, but recive " + << args_num; + throw std::runtime_error(oss.str()); + } + Py_RETURN_NONE; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +// def encode(input, add_special_tokens=True, is_pretokenized=False) +static PyObject* EncodeBatch(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_input = NULL; + PyObject* kw_special_tokens = NULL; + PyObject* kw_is_pretokenized = NULL; + bool flag_kwargs = false; + if (kwargs) flag_kwargs = true; + static char* kwlist[] = {const_cast("input"), + const_cast("add_special_tokens"), + const_cast("is_pretokenized"), + NULL}; + bool flag_ = PyArg_ParseTupleAndKeywords(args, + kwargs, + "|OOO", + kwlist, + &kw_input, + &kw_special_tokens, + &kw_is_pretokenized); + bool add_special_tokens = true; + bool is_pretokenized = false; + Py_ssize_t args_num = PyTuple_Size(args); + VLOG(6) << " args_num: " << args_num << ", flag_kwargs: " << flag_kwargs + << ", flag_: " << flag_; + std::vector batch_encode_input; + if (args_num >= (Py_ssize_t)1 && args_num <= (Py_ssize_t)3) { + if ((args_num <= 1 && flag_kwargs && kw_special_tokens) || + (args_num >= 2)) { + add_special_tokens = CastPyArg2AttrBoolean(kw_special_tokens, 1); + } + if ((args_num <= 2 && kw_is_pretokenized && flag_kwargs) || args_num == 3) { + is_pretokenized = CastPyArg2AttrBoolean(kw_is_pretokenized, 2); + } + if (PyList_Check(kw_input)) { + Py_ssize_t list_size = PyList_Size(kw_input); + for (Py_ssize_t i = 0; i < list_size; ++i) { + PyObject* item = PyList_GetItem(kw_input, i); + // Has pair + if (PyTuple_Check(item) && PyTuple_Size(item) == 2) { + PyObject* text = PyTuple_GetItem(item, 0); + PyObject* text_pair = PyTuple_GetItem(item, 1); + // pretokenized + if (is_pretokenized) { + Py_ssize_t pretokenized_size = PyList_Size(item); + std::vector text_vec; + std::vector text_pair_vec; + for (Py_ssize_t j = 0; j < pretokenized_size; ++j) { + PyObject* py_text = PyList_GetItem(text, j); + PyObject* py_text_pair = PyList_GetItem(text_pair, j); + text_vec.push_back(CastPyArg2AttrString(py_text, 0)); + text_pair_vec.push_back(CastPyArg2AttrString(py_text_pair, 1)); + } + batch_encode_input.push_back( + std::pair{text_vec, + text_pair_vec}); + } else { + batch_encode_input.push_back( + std::pair{ + CastPyArg2AttrString(text, 0), + CastPyArg2AttrString(text_pair, 1)}); + } + } else { + // Only get text + if (is_pretokenized) { + Py_ssize_t pretokenized_size = PyList_Size(item); + std::vector str_vec(pretokenized_size); + for (Py_ssize_t j = 0; j < pretokenized_size; ++j) { + PyObject* py_text = PyList_GetItem(item, j); + str_vec[j] = CastPyArg2AttrString(py_text, 0); + } + batch_encode_input.push_back(str_vec); + } else { + batch_encode_input.push_back(CastPyArg2AttrString(item, 0)); + } + } + } + } else { + std::ostringstream oss; + oss << "Expected the type of input argument is list"; + throw std::runtime_error(oss.str()); + } + std::vector result_encodings; + self->tokenizer.EncodeBatchStrings( + batch_encode_input, &result_encodings, add_special_tokens); + py::object py_obj = py::cast(result_encodings); + py_obj.inc_ref(); + return py_obj.ptr(); + } else { + std::ostringstream oss; + oss << "Expected number of arguments is from 1 to 2, but recive " + << args_num; + throw std::runtime_error(oss.str()); + } + Py_RETURN_NONE; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +// def id_to_token(id) +static PyObject* IdToToken(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + PyObject* kw_id = NULL; + static char* kwlist[] = {const_cast("id"), NULL}; + bool flag_ = PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_id); + Py_ssize_t args_num = PyTuple_Size(args); + if (args_num == (Py_ssize_t)1) { + if (PyLong_Check(kw_id)) { + uint32_t id = PyLong_AsLong(kw_id); + std::string token; + if (self->tokenizer.IdToToken(id, &token)) { + return ToPyObject(token); + } + Py_RETURN_NONE; + } else { + // throw error + } + } else { + std::ostringstream oss; + oss << "Expected number of arguments is 1, but recive " << args_num; + throw std::runtime_error(oss.str()); + } + Py_RETURN_NONE; +} + +// def token_to_id(token) +static PyObject* TokenToId(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_token = NULL; + static char* kwlist[] = {const_cast("token"), NULL}; + bool flag_ = + PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_token); + Py_ssize_t args_num = PyTuple_Size(args); + std::string token = ""; + if (args_num == (Py_ssize_t)1) { + token = CastPyArg2AttrString(kw_token, 0); + uint32_t id; + if (self->tokenizer.TokenToId(token, &id)) { + return ToPyObject(id); + } + Py_RETURN_NONE; + } else { + std::ostringstream oss; + oss << "Expected number of arguments is 1, but recive " << args_num; + throw std::runtime_error(oss.str()); + } + Py_RETURN_NONE; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +// def num_special_tokens_to_add(is_pair) +static PyObject* NumSpecialTokensToAdd(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_is_pair = NULL; + static char* kwlist[] = {const_cast("is_pair"), NULL}; + bool flag_ = + PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_is_pair); + Py_ssize_t args_num = PyTuple_Size(args); + bool is_pair = false; + if (args_num == (Py_ssize_t)1) { + is_pair = CastPyArg2AttrBoolean(kw_is_pair, 0); + } else { + std::ostringstream oss; + oss << "Expected number of arguments is 1, but recive " << args_num; + throw std::runtime_error(oss.str()); + } + auto postprocessor_ptr = self->tokenizer.GetPostProcessorPtr(); + if (postprocessor_ptr == nullptr) { + return ToPyObject(0); + } + return ToPyObject(postprocessor_ptr->AddedTokensNum(is_pair)); + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + + +static PyObject* Save(TokenizerObject* self, PyObject* args, PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_path = NULL; + PyObject* kw_pretty = NULL; + static char* kwlist[] = { + const_cast("path"), const_cast("pretty"), NULL}; + bool flag_ = PyArg_ParseTupleAndKeywords( + args, kwargs, "|OO", kwlist, &kw_path, &kw_pretty); + bool pretty = true; + Py_ssize_t args_num = PyTuple_Size(args); + if (args_num >= (Py_ssize_t)1 && args_num <= (Py_ssize_t)2) { + if (args_num == (Py_ssize_t)2) { + pretty = CastPyArg2AttrBoolean(kw_pretty, 1); + } + std::string path = CastPyArg2AttrString(kw_path, 0); + self->tokenizer.Save(path, pretty); + } else { + std::ostringstream oss; + oss << "Expected number of arguments is from 1 to 2, but recive " + << args_num; + throw std::runtime_error(oss.str()); + } + Py_RETURN_NONE; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +static PyObject* ToStr(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_pretty = NULL; + static char* kwlist[] = {const_cast("pretty"), NULL}; + bool flag_ = + PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_pretty); + bool pretty = true; + Py_ssize_t args_num = PyTuple_Size(args); + std::string json_str; + if (args_num >= (Py_ssize_t)0 && args_num <= (Py_ssize_t)1) { + if (args_num == (Py_ssize_t)1) { + pretty = CastPyArg2AttrBoolean(kw_pretty, 0); + } + self->tokenizer.ToJsonStr(&json_str, pretty); + } else { + std::ostringstream oss; + oss << "Expected number of arguments is from 1 to 2, but recive " + << args_num; + throw std::runtime_error(oss.str()); + } + return ToPyObject(json_str); + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +static PyObject* FromStr(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_json = NULL; + static char* kwlist[] = {const_cast("json"), NULL}; + bool flag_ = + PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_json); + Py_ssize_t args_num = PyTuple_Size(args); + std::string json_str; + core::Tokenizer tokenizer; + if (args_num == (Py_ssize_t)1) { + json_str = CastPyArg2AttrString(kw_json, 0); + tokenizer = core::Tokenizer::LoadFromStr(json_str); + } else { + std::ostringstream oss; + oss << "Expected number of arguments is from 1 to 2, but recive " + << args_num; + throw std::runtime_error(oss.str()); + } + TokenizerObject* obj = + (TokenizerObject*)TokenizerNew(p_tokenizer_type, NULL, NULL); + obj->tokenizer = tokenizer; + return (PyObject*)obj; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +static PyObject* FromFile(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_path = NULL; + static char* kwlist[] = {const_cast("json"), NULL}; + bool flag_ = + PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_path); + Py_ssize_t args_num = PyTuple_Size(args); + std::string path; + core::Tokenizer tokenizer; + if (args_num == (Py_ssize_t)1) { + path = CastPyArg2AttrString(kw_path, 0); + tokenizer = core::Tokenizer::LoadFromFile(path); + } else { + std::ostringstream oss; + oss << "Expected number of arguments is from 1 to 2, but recive " + << args_num; + throw std::runtime_error(oss.str()); + } + TokenizerObject* obj = + (TokenizerObject*)TokenizerNew(p_tokenizer_type, NULL, NULL); + obj->tokenizer = tokenizer; + return (PyObject*)obj; + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +// def decode(self, ids, skip_special_tokens=True): +static PyObject* Decode(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_ids = NULL; + PyObject* kw_skip_special_tokens = NULL; + bool flag_kwargs = false; + if (kwargs) flag_kwargs = true; + static char* kwlist[] = { + const_cast("ids"), const_cast("skip_special_tokens"), NULL}; + bool flag_ = PyArg_ParseTupleAndKeywords( + args, kwargs, "|OO", kwlist, &kw_ids, &kw_skip_special_tokens); + bool skip_special_tokens = true; + Py_ssize_t args_num = PyTuple_Size(args); + VLOG(6) << " args_num: " << args_num << ", flag_kwargs: " << flag_kwargs + << ", flag_: " << flag_; + if (args_num >= (Py_ssize_t)1 && args_num <= (Py_ssize_t)2) { + if (args_num == (Py_ssize_t)2 || (flag_kwargs && kw_skip_special_tokens)) { + skip_special_tokens = CastPyArg2AttrBoolean(kw_skip_special_tokens, 1); + } + auto ids = CastPyArg2VectorOfInt(kw_ids, 0); + std::string result; + self->tokenizer.Decode(ids, &result, skip_special_tokens); + return ToPyObject(result); + } else { + std::ostringstream oss; + oss << "Expected number of arguments is from 1 to 2, but recive " + << args_num; + throw std::runtime_error(oss.str()); + } + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +// def decode_batch(self, sequences, skip_special_tokens=True): +static PyObject* DecodeBatch(TokenizerObject* self, + PyObject* args, + PyObject* kwargs) { + TOKENIZERS_TRY + PyObject* kw_sequences = NULL; + PyObject* kw_skip_special_tokens = NULL; + bool flag_kwargs = false; + if (kwargs) flag_kwargs = true; + static char* kwlist[] = {const_cast("sequences"), + const_cast("skip_special_tokens"), + NULL}; + bool flag_ = PyArg_ParseTupleAndKeywords( + args, kwargs, "|OO", kwlist, &kw_sequences, &kw_skip_special_tokens); + bool skip_special_tokens = true; + Py_ssize_t args_num = PyTuple_Size(args); + VLOG(6) << " args_num: " << args_num << ", flag_kwargs: " << flag_kwargs + << ", flag_: " << flag_; + if (args_num >= (Py_ssize_t)1 && args_num <= (Py_ssize_t)2) { + if (args_num == (Py_ssize_t)2 || (flag_kwargs && kw_skip_special_tokens)) { + skip_special_tokens = CastPyArg2AttrBoolean(kw_skip_special_tokens, 1); + } + std::vector> batch_ids; + PyObject* item = nullptr; + if (PyTuple_Check(kw_sequences)) { + Py_ssize_t len = PyTuple_Size(kw_sequences); + for (Py_ssize_t i = 0; i < len; i++) { + item = PyTuple_GetItem(kw_sequences, i); + batch_ids.emplace_back(CastPyArg2VectorOfInt(item, 0)); + } + } else if (PyList_Check(kw_sequences)) { + Py_ssize_t len = PyList_Size(kw_sequences); + for (Py_ssize_t i = 0; i < len; i++) { + item = PyList_GetItem(kw_sequences, i); + batch_ids.emplace_back(CastPyArg2VectorOfInt(item, 0)); + } + } else { + std::ostringstream oss; + oss << "Args sequences need to be int of list or tuple"; + throw std::runtime_error(oss.str()); + } + std::vector result; + self->tokenizer.DecodeBatch(batch_ids, &result, skip_special_tokens); + return ToPyObject(result); + } else { + std::ostringstream oss; + oss << "Expected number of arguments is from 1 to 2, but recive " + << args_num; + throw std::runtime_error(oss.str()); + } + TOKENIZERS_CATCH_AND_THROW_RETURN_NULL +} + +PyMethodDef tokenizer_variable_methods[] = { + {"add_special_tokens", + (PyCFunction)(void (*)(void))AddSpecialTokens, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"add_tokens", + (PyCFunction)(void (*)(void))AddTokens, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"enable_padding", + (PyCFunction)(void (*)(void))EnablePadding, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"disable_padding", + (PyCFunction)(void (*)(void))DisablePadding, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"enable_truncation", + (PyCFunction)(void (*)(void))EnableTruncation, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"disable_truncation", + (PyCFunction)(void (*)(void))DisableTruncation, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"get_vocab", + (PyCFunction)(void (*)(void))GetVocab, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"get_vocab_size", + (PyCFunction)(void (*)(void))GetVocabSize, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"encode", + (PyCFunction)(void (*)(void))Encode, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"encode_batch", + (PyCFunction)(void (*)(void))EncodeBatch, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"decode", + (PyCFunction)(void (*)(void))Decode, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"decode_batch", + (PyCFunction)(void (*)(void))DecodeBatch, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"id_to_token", + (PyCFunction)(void (*)(void))IdToToken, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"token_to_id", + (PyCFunction)(void (*)(void))TokenToId, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"num_special_tokens_to_add", + (PyCFunction)(void (*)(void))NumSpecialTokensToAdd, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"save", + (PyCFunction)(void (*)(void))Save, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"to_str", + (PyCFunction)(void (*)(void))ToStr, + METH_VARARGS | METH_KEYWORDS, + NULL}, + {"from_str", + (PyCFunction)(void (*)(void))FromStr, + METH_VARARGS | METH_KEYWORDS | METH_STATIC, + NULL}, + {"from_file", + (PyCFunction)(void (*)(void))FromFile, + METH_VARARGS | METH_KEYWORDS | METH_STATIC, + NULL}, + // TODO(zhoushunjie): Need to implement + // {"from_buffer", + // (PyCFunction)(void (*)(void))NumSpecialTokensToAdd, + // METH_VARARGS | METH_KEYWORDS | METH_STATIC, + // NULL}, + // {"from_pretrained", + // (PyCFunction)(void (*)(void))NumSpecialTokensToAdd, + // METH_VARARGS | METH_KEYWORDS | METH_STATIC, + // NULL}, + {NULL, NULL, 0, NULL}}; + +void BindTokenizers(pybind11::module* m) { + auto heap_type = reinterpret_cast( + PyType_Type.tp_alloc(&PyType_Type, 0)); + heap_type->ht_name = ToPyObject("Tokenizer"); + heap_type->ht_qualname = ToPyObject("Tokenizer"); + auto type = &heap_type->ht_type; + type->tp_name = "Tokenizer"; + type->tp_basicsize = sizeof(TokenizerObject); + type->tp_dealloc = (destructor)TokenizerDealloc; + type->tp_as_number = &number_methods; + type->tp_as_sequence = &sequence_methods; + type->tp_as_mapping = &mapping_methods; + type->tp_methods = tokenizer_variable_methods; + type->tp_getset = tokenizer_variable_properties; + type->tp_init = TokenizerInit; + type->tp_new = TokenizerNew; + Py_INCREF(&PyBaseObject_Type); + type->tp_base = reinterpret_cast(&PyBaseObject_Type); + type->tp_flags |= + Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE | Py_TPFLAGS_HEAPTYPE; +#if PY_VERSION_HEX >= 0x03050000 + type->tp_as_async = &heap_type->as_async; +#endif + p_tokenizer_type = type; + + if (PyType_Ready(type) < 0) { + throw "Init Tokenizers error in BindTokenizers(PyType_Ready)."; + return; + } + + Py_INCREF(type); + if (PyModule_AddObject( + m->ptr(), "Tokenizer", reinterpret_cast(type)) < 0) { + Py_DECREF(type); + Py_DECREF(m->ptr()); + throw "Init Tokenizers error in BindTokenizers(PyModule_AddObject)."; + return; + } +} + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/tokenizers.h b/fast_tokenizer/fast_tokenizer/pybind/tokenizers.h new file mode 100644 index 0000000000000000000000000000000000000000..b9e45b957da8998b7b16fb2e757f148319595e43 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/tokenizers.h @@ -0,0 +1,27 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +void BindTokenizers(pybind11::module* m); + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/utils.cc b/fast_tokenizer/fast_tokenizer/pybind/utils.cc new file mode 100644 index 0000000000000000000000000000000000000000..377aeb2ed32e3f0877bbf22f94d8f4ed0a8eb2fd --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/utils.cc @@ -0,0 +1,277 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include + +#include "fast_tokenizer/pybind/utils.h" +namespace py = pybind11; + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +PyObject* ToPyObject(bool value) { + if (value) { + Py_INCREF(Py_True); + return Py_True; + } else { + Py_INCREF(Py_False); + return Py_False; + } +} + +PyObject* ToPyObject(int value) { return PyLong_FromLong(value); } + +PyObject* ToPyObject(uint32_t value) { return PyLong_FromUnsignedLong(value); } + +PyObject* ToPyObject(int64_t value) { return PyLong_FromLongLong(value); } + +PyObject* ToPyObject(size_t value) { return PyLong_FromSize_t(value); } + +PyObject* ToPyObject(float value) { return PyLong_FromDouble(value); } + +PyObject* ToPyObject(double value) { return PyLong_FromDouble(value); } + +PyObject* ToPyObject(const char* value) { return PyUnicode_FromString(value); } + +PyObject* ToPyObject(const std::string& value) { + return PyUnicode_FromString(value.c_str()); +} + +PyObject* ToPyObject(const std::vector& value) { + PyObject* result = PyList_New((Py_ssize_t)value.size()); + + for (size_t i = 0; i < value.size(); i++) { + PyList_SET_ITEM(result, static_cast(i), ToPyObject(value[i])); + } + + return result; +} + +PyObject* ToPyObject(const std::vector& value) { + PyObject* result = PyList_New((Py_ssize_t)value.size()); + + for (size_t i = 0; i < value.size(); i++) { + PyList_SET_ITEM(result, static_cast(i), ToPyObject(value[i])); + } + + return result; +} + +PyObject* ToPyObject(const std::vector& value) { + PyObject* result = PyList_New((Py_ssize_t)value.size()); + + for (size_t i = 0; i < value.size(); i++) { + PyList_SET_ITEM(result, (Py_ssize_t)i, ToPyObject(value[i])); + } + + return result; +} + +PyObject* ToPyObject(const std::vector& value) { + PyObject* result = PyList_New((Py_ssize_t)value.size()); + + for (size_t i = 0; i < value.size(); i++) { + PyList_SET_ITEM(result, (Py_ssize_t)i, ToPyObject(value[i])); + } + + return result; +} + +PyObject* ToPyObject(const std::vector& value) { + PyObject* result = PyList_New((Py_ssize_t)value.size()); + + for (size_t i = 0; i < value.size(); i++) { + PyList_SET_ITEM(result, static_cast(i), ToPyObject(value[i])); + } + + return result; +} + +PyObject* ToPyObject(const std::vector& value) { + PyObject* result = PyList_New((Py_ssize_t)value.size()); + + for (size_t i = 0; i < value.size(); i++) { + PyList_SET_ITEM(result, static_cast(i), ToPyObject(value[i])); + } + + return result; +} + +PyObject* ToPyObject(const std::vector>& value) { + PyObject* result = PyList_New((Py_ssize_t)value.size()); + + for (size_t i = 0; i < value.size(); i++) { + PyList_SET_ITEM(result, static_cast(i), ToPyObject(value[i])); + } + + return result; +} + +PyObject* ToPyObject(const std::vector& value) { + PyObject* result = PyList_New((Py_ssize_t)value.size()); + for (size_t i = 0; i < value.size(); i++) { + PyList_SET_ITEM(result, static_cast(i), ToPyObject(value[i])); + } + return result; +} + +bool CastPyArg2AttrBoolean(PyObject* obj, ssize_t arg_pos) { + if (obj == Py_None) { + return false; // To be compatible with QA integration testing. Some + // test case pass in None. + } else if (obj == Py_True) { + return true; + } else if (obj == Py_False) { + return false; + } else { + std::ostringstream oss; + oss << "argument (position" << arg_pos + 1 << " must be bool, but got " + << (reinterpret_cast(obj->ob_type))->tp_name; + throw std::runtime_error(oss.str()); + } + return false; +} + +std::string CastPyArg2AttrString(PyObject* obj, ssize_t arg_pos) { + if (PyUnicode_Check(obj)) { + Py_ssize_t size; + const char* data; + data = PyUnicode_AsUTF8AndSize(obj, &size); + return std::string(data, static_cast(size)); + } else { + std::ostringstream oss; + oss << "argument (position" << arg_pos + 1 << " must be str, but got " + << (reinterpret_cast(obj->ob_type))->tp_name; + throw std::runtime_error(oss.str()); + return ""; + } +} + +int CastPyArg2AttrInt(PyObject* obj, ssize_t arg_pos) { + if (PyLong_Check(obj) && !PyBool_Check(obj)) { + return static_cast(PyLong_AsLong(obj)); + } else { + std::ostringstream oss; + oss << "argument (position" << arg_pos + 1 << " must be str, but got " + << (reinterpret_cast(obj->ob_type))->tp_name; + throw std::runtime_error(oss.str()); + return 0; + } +} + +int64_t CastPyArg2AttrLong(PyObject* obj, ssize_t arg_pos) { + if (PyLong_Check(obj) && !PyBool_Check(obj)) { + return (int64_t)PyLong_AsLong(obj); // NOLINT + } else { + std::ostringstream oss; + oss << "argument (position" << arg_pos + 1 << " must be str, but got " + << (reinterpret_cast(obj->ob_type))->tp_name; + throw std::runtime_error(oss.str()); + return 0; + } +} + +size_t CastPyArg2AttrSize_t(PyObject* obj, ssize_t arg_pos) { + if (PyLong_Check(obj) && !PyBool_Check(obj)) { + return PyLong_AsSize_t(obj); + } else { + std::ostringstream oss; + oss << "argument (position" << arg_pos + 1 << " must be str, but got " + << (reinterpret_cast(obj->ob_type))->tp_name; + throw std::runtime_error(oss.str()); + return 0; + } +} + +float CastPyArg2AttrFloat(PyObject* obj, ssize_t arg_pos) { + if (PyFloat_Check(obj) || PyLong_Check(obj)) { + return static_cast(PyFloat_AsDouble(obj)); + } else { + std::ostringstream oss; + oss << "argument (position" << arg_pos + 1 << " must be str, but got " + << (reinterpret_cast(obj->ob_type))->tp_name; + throw std::runtime_error(oss.str()); + return 0; + } +} + +std::vector CastPyArg2VectorOfStr(PyObject* obj, size_t arg_pos) { + std::vector result; + if (PyList_Check(obj)) { + Py_ssize_t len = PyList_Size(obj); + PyObject* item = nullptr; + for (Py_ssize_t i = 0; i < len; i++) { + item = PyList_GetItem(obj, i); + if (PyUnicode_Check(item)) { + result.emplace_back(CastPyArg2AttrString(item, 0)); + } else { + std::ostringstream oss; + oss << "argument (position" << arg_pos + 1 + << " must be list of str, but got " + << (reinterpret_cast(obj->ob_type))->tp_name; + throw std::runtime_error(oss.str()); + return {}; + } + } + } else if (PyTuple_Check(obj)) { + Py_ssize_t len = PyTuple_Size(obj); + PyObject* item = nullptr; + for (Py_ssize_t i = 0; i < len; i++) { + item = PyTuple_GetItem(obj, i); + if (PyUnicode_Check(item)) { + result.emplace_back(CastPyArg2AttrString(item, 0)); + } else { + std::ostringstream oss; + oss << "argument (position" << arg_pos + 1 + << " must be list of str, but got " + << (reinterpret_cast(obj->ob_type))->tp_name; + throw std::runtime_error(oss.str()); + return {}; + } + } + } else if (obj == Py_None) { + return {}; + } else { + std::ostringstream oss; + oss << "argument (position" << arg_pos + 1 + << " must be list or tuple, but got " + << (reinterpret_cast(obj->ob_type))->tp_name; + throw std::runtime_error(oss.str()); + return {}; + } + return result; +} + +bool PyObject_CheckLongOrConvertToLong(PyObject** obj) { + if ((PyLong_Check(*obj) && !PyBool_Check(*obj))) { + return true; + } + + if (std::string((reinterpret_cast((*obj)->ob_type))->tp_name) + .find("numpy") != std::string::npos) { + auto to = PyNumber_Long(*obj); + if (to) { + *obj = to; + return true; + } + } + + return false; +} + +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/pybind/utils.h b/fast_tokenizer/fast_tokenizer/pybind/utils.h new file mode 100644 index 0000000000000000000000000000000000000000..75ef55d654a571eebf5fef44c6ea6e0521835eec --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/pybind/utils.h @@ -0,0 +1,108 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include +#include + +// For Windows ssize_t +#if defined(_MSC_VER) +#include +typedef SSIZE_T ssize_t; +#endif + +namespace paddlenlp { +namespace fast_tokenizer { +namespace pybind { + +PyObject* ToPyObject(int value); +PyObject* ToPyObject(uint32_t value); +PyObject* ToPyObject(bool value); +PyObject* ToPyObject(int64_t value); +PyObject* ToPyObject(size_t value); +PyObject* ToPyObject(float value); +PyObject* ToPyObject(double value); +PyObject* ToPyObject(const char* value); +PyObject* ToPyObject(const std::string& value); +PyObject* ToPyObject(const std::vector& value); +PyObject* ToPyObject(const std::vector& value); +PyObject* ToPyObject(const std::vector& value); +PyObject* ToPyObject(const std::vector& value); +PyObject* ToPyObject(const std::vector& value); +PyObject* ToPyObject(const std::vector& value); +PyObject* ToPyObject(const std::vector>& value); +PyObject* ToPyObject(const std::vector& value); + +bool PyObject_CheckLongOrConvertToLong(PyObject** obj); +bool CastPyArg2AttrBoolean(PyObject* obj, ssize_t arg_pos); +std::string CastPyArg2AttrString(PyObject* obj, ssize_t arg_pos); +int CastPyArg2AttrInt(PyObject* obj, ssize_t arg_pos); +int64_t CastPyArg2AttrLong(PyObject* obj, ssize_t arg_pos); +size_t CastPyArg2AttrSize_t(PyObject* obj, ssize_t arg_pos); +float CastPyArg2AttrFloat(PyObject* obj, ssize_t arg_pos); +std::vector CastPyArg2VectorOfStr(PyObject* obj, size_t arg_pos); + +template +std::vector CastPyArg2VectorOfInt(PyObject* obj, size_t arg_pos) { + std::vector result; + if (PyList_Check(obj)) { + Py_ssize_t len = PyList_Size(obj); + PyObject* item = nullptr; + for (Py_ssize_t i = 0; i < len; i++) { + item = PyList_GetItem(obj, i); + if (PyObject_CheckLongOrConvertToLong(&item)) { + result.emplace_back(static_cast(PyLong_AsLong(item))); + } else { + std::ostringstream oss; + oss << "argument (position " << arg_pos + 1 + << "must be list of int, but got " + << reinterpret_cast(item->ob_type)->tp_name + << " at pos " << i; + throw oss.str(); + return {}; + } + } + } else if (PyTuple_Check(obj)) { + Py_ssize_t len = PyTuple_Size(obj); + PyObject* item = nullptr; + for (Py_ssize_t i = 0; i < len; i++) { + item = PyTuple_GetItem(obj, i); + if (PyObject_CheckLongOrConvertToLong(&item)) { + result.emplace_back(static_cast(PyLong_AsLong(item))); + } else { + std::ostringstream oss; + oss << "argument (position " << arg_pos + 1 + << "must be list of int, but got " + << reinterpret_cast(item->ob_type)->tp_name + << " at pos " << i; + throw oss.str(); + return {}; + } + } + } else if (obj == Py_None) { + return {}; + } else { + std::ostringstream oss; + oss << "argument (position " << arg_pos + 1 + << "must be list or tuple, but got " + << reinterpret_cast(obj->ob_type)->tp_name; + throw oss.str(); + return {}; + } + return result; +} +} // namespace pybind +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/test/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/test/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..380e63582b161fc491259415251ad48e1e075d73 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/CMakeLists.txt @@ -0,0 +1,58 @@ +if(WITH_TESTING) +cc_library(tokenizers_gtest_main SRCS gtest_main.cc DEPS gtest gflags) + +# Test Normalizers modules +cc_test(test_normalizer SRCS test_normalizer.cc DEPS normalizers) +cc_test(test_unicode SRCS test_unicode.cc DEPS normalizers) +cc_test(test_replace SRCS test_replace.cc DEPS normalizers) +cc_test(test_strip SRCS test_strip.cc DEPS normalizers) +cc_test(test_utils SRCS test_utils.cc DEPS normalizers) + +# Test PreTokenizers modules +cc_test(test_whitespace SRCS test_whitespace.cc DEPS pretokenizers) +cc_test(test_bert_pretokenizer SRCS test_bert_pretokenizer.cc DEPS pretokenizers) +cc_test(test_split_pretokenizer SRCS test_split_pretokenizer.cc DEPS pretokenizers) + +# Test Model +cc_test(test_wordpiece SRCS test_wordpiece.cc DEPS models) +cc_test(test_fast_wordpiece SRCS test_fast_wordpiece.cc DEPS models) + +# Download ernie vocab for test +set(ERNIE_VOCAB_PATH ${CMAKE_CURRENT_BINARY_DIR}/ernie_vocab.txt) +if (EXISTS ${ERNIE_VOCAB_PATH}) + message("The ${ERNIE_VOCAB_PATH} exists already.") +else() + file(DOWNLOAD "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/vocab.txt" ${ERNIE_VOCAB_PATH} SHOW_PROGRESS) + message("Already download the vocab.txt of ernie to ${CMAKE_CURRENT_BINARY_DIR} for test.") +endif() + +# Download clip vocab and merge files +set(CLIP_VOCAB_PATH ${CMAKE_CURRENT_BINARY_DIR}/clip_vocab.json) +set(CLIP_MERGES_PATH ${CMAKE_CURRENT_BINARY_DIR}/clip_merges.txt) + +if (EXISTS ${CLIP_VOCAB_PATH}) + message("The ${CLIP_VOCAB_PATH} exists already.") +else() + file(DOWNLOAD "http://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/vocab.json" ${CLIP_VOCAB_PATH} SHOW_PROGRESS) + message("Already download the vocab.json of clip to ${CMAKE_CURRENT_BINARY_DIR} for test.") +endif() + +if (EXISTS ${CLIP_MERGES_PATH}) + message("The ${CLIP_MERGES_PATH} exists already.") +else() + file(DOWNLOAD "http://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/merges.txt" ${CLIP_MERGES_PATH} SHOW_PROGRESS) + message("Already download the merges.txt of clip to ${CMAKE_CURRENT_BINARY_DIR} for test.") +endif() + +# Test Tokenizer +cc_test(test_bert_tokenizer SRCS test_bert_tokenizer.cc DEPS normalizers pretokenizers models postprocessors tokenizer) + +# Test PostProcessor +cc_test(test_roberta_postprocessor SRCS test_roberta_postprocessor.cc DEPS normalizers pretokenizers models postprocessors tokenizer) + +if(NOT WITH_PYTHON) + cc_test(test_ernie_fast_tokenizer SRCS test_ernie_fast_tokenizer.cc DEPS normalizers pretokenizers models postprocessors tokenizer core_tokenizers) + cc_test(test_clip_fast_tokenizer SRCS test_clip_fast_tokenizer.cc DEPS normalizers pretokenizers models postprocessors tokenizer core_tokenizers) +endif() + +endif() diff --git a/fast_tokenizer/fast_tokenizer/test/gtest_main.cc b/fast_tokenizer/fast_tokenizer/test/gtest_main.cc new file mode 100644 index 0000000000000000000000000000000000000000..8dc400abf0bea266f46297837814e68fe8049dca --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/gtest_main.cc @@ -0,0 +1,62 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "gtest/gtest.h" +#include "gflags/gflags.h" +#include "glog/logging.h" + + +int main(int argc, char** argv) { + testing::InitGoogleTest(&argc, argv); + std::vector new_argv; + for (int i = 0; i < argc; ++i) { + new_argv.push_back(argv[i]); + } + + std::vector envs; + std::vector undefok; + + char* env_str = nullptr; + if (envs.size() > 0) { + std::string env_string = "--tryfromenv="; + for (auto t : envs) { + env_string += t + ","; + } + env_string = env_string.substr(0, env_string.length() - 1); + env_str = strdup(env_string.c_str()); + new_argv.push_back(env_str); + VLOG(1) << "gtest env_string:" << env_string; + } + + char* undefok_str = nullptr; + if (undefok.size() > 0) { + std::string undefok_string = "--undefok="; + for (auto t : undefok) { + undefok_string += t + ","; + } + undefok_string = undefok_string.substr(0, undefok_string.length() - 1); + undefok_str = strdup(undefok_string.c_str()); + new_argv.push_back(undefok_str); + VLOG(1) << "gtest undefok_string:" << undefok_string; + } + + int new_argc = static_cast(new_argv.size()); + char** new_argv_address = new_argv.data(); + ::GFLAGS_NAMESPACE::ParseCommandLineFlags( + &new_argc, &new_argv_address, false); + int ret = RUN_ALL_TESTS(); + if (env_str) free(env_str); + if (undefok_str) free(undefok_str); + return ret; +} diff --git a/fast_tokenizer/fast_tokenizer/test/test_bert_pretokenizer.cc b/fast_tokenizer/fast_tokenizer/test/test_bert_pretokenizer.cc new file mode 100644 index 0000000000000000000000000000000000000000..f4be133e375dbc2f338a14d8a1145b75d0ebc2e0 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_bert_pretokenizer.cc @@ -0,0 +1,49 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include "fast_tokenizer/pretokenizers/bert.h" +#include "glog/logging.h" +#include "gtest/gtest.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { +TEST(pretokenizers, bert) { + std::string input = + "I \t am good\r at \nsport. I like\tfootball especially!!!"; + std::vector expected_outputs = {"I", + "am", + "good", + "at", + "sport", + ".", + "I", + "like", + "football", + "especially", + "!", + "!", + "!"}; + pretokenizers::PreTokenizedString bert_input(input); + pretokenizers::BertPreTokenizer()(&bert_input); + ASSERT_EQ(expected_outputs.size(), bert_input.GetSplitsSize()); + for (int i = 0; i < expected_outputs.size(); ++i) { + ASSERT_EQ(bert_input.GetSplit(i).normalized_.GetStr(), expected_outputs[i]); + } +} +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/test_bert_tokenizer.cc b/fast_tokenizer/fast_tokenizer/test/test_bert_tokenizer.cc new file mode 100644 index 0000000000000000000000000000000000000000..6aac0fdf1222c889e693c003da119adf8d624f11 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_bert_tokenizer.cc @@ -0,0 +1,120 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ +#include +#include +#include +#include "fast_tokenizer/core/added_vocabulary.h" +#include "fast_tokenizer/core/base.h" +#include "fast_tokenizer/core/encoding.h" +#include "fast_tokenizer/core/tokenizer.h" +#include "fast_tokenizer/models/wordpiece.h" +#include "fast_tokenizer/normalizers/bert.h" +#include "fast_tokenizer/postprocessors/bert.h" +#include "fast_tokenizer/pretokenizers/bert.h" +#include "glog/logging.h" +#include "gtest/gtest.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +template +void CheckVectorEqual(const std::vector& a, const std::vector& b) { + ASSERT_EQ(a.size(), b.size()); + auto size = a.size(); + for (int i = 0; i < size; ++i) { + ASSERT_EQ(a[i], b[i]); + } +} + +TEST(tokenizer, bert_tokenizer) { + models::WordPieceFactory factory; + std::string vocab_file = "ernie_vocab.txt"; + factory.SetFiles(vocab_file); + // Declare the components of tokenizer + auto word_piece = factory.CreateWordPieceModel(); + auto normalizer = normalizers::BertNormalizer(); + auto pretokenizer = pretokenizers::BertPreTokenizer(); + auto postprocessor = + postprocessors::BertPostProcessor({"[SEP]", 2}, {"[CLS]", 1}); + core::PadMethod pad_method; + core::TruncMethod trunc_method; + + // Initialize tokenizer + core::Tokenizer tokenizer(word_piece); + tokenizer.SetNormalizer(normalizer); + tokenizer.SetPreTokenizer(pretokenizer); + tokenizer.SetPostProcessor(postprocessor); + tokenizer.SetPadMethod(pad_method); + tokenizer.SetTruncMethod(trunc_method); + std::vector special_added_tokens = { + {"[PAD]", true}, + {"[CLS]", true}, + {"[SEP]", true}, + {"[MASK]", true}, + {"[UNK]", true}, + }; + auto special_tokens_num = tokenizer.AddSpecialTokens(special_added_tokens); + + // Tokenize the sample strings + std::vector encodings(2); + tokenizer.EncodePairStrings("今天天气真好", &encodings[0]); + tokenizer.EncodePairStrings("don't know how this missed award nominations.", + &encodings[1]); + std::vector> expected_tokens = { + {"[CLS]", "今", "天", "天", "气", "真", "好", "[SEP]"}, + {"[CLS]", + "don", + "[UNK]", + "t", + "know", + "how", + "this", + "miss", + "##ed", + "award", + "no", + "##min", + "##ations", + ".", + "[SEP]"}}; + std::vector> expected_ids = { + {1, 508, 125, 125, 266, 384, 170, 2}, + {1, + 3362, + 17963, + 2052, + 3821, + 5071, + 3730, + 7574, + 9530, + 6301, + 3825, + 10189, + 11005, + 42, + 2}}; + std::vector> expected_type_ids = { + {0, 0, 0, 0, 0, 0, 0, 0}, {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}; + for (int i = 0; i < encodings.size(); ++i) { + CheckVectorEqual(expected_tokens[i], encodings[i].GetTokens()); + CheckVectorEqual(expected_ids[i], encodings[i].GetIds()); + CheckVectorEqual(expected_type_ids[i], encodings[i].GetTypeIds()); + } +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/test_clip_fast_tokenizer.cc b/fast_tokenizer/fast_tokenizer/test/test_clip_fast_tokenizer.cc new file mode 100644 index 0000000000000000000000000000000000000000..b2604b3cc90624d9526fd1e514723f2f5b58e418 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_clip_fast_tokenizer.cc @@ -0,0 +1,50 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/core/encoding.h" +#include "fast_tokenizer/tokenizers/clip_fast_tokenizer.h" + +#include "fast_tokenizer/test/utils.h" + +#include "glog/logging.h" +#include "gtest/gtest.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +TEST(tokenizer, clip_full) { + std::string vocab_path = "clip_vocab.json"; + std::string merges_path = "clip_merges.txt"; + tokenizers_impl::ClipFastTokenizer clip_tokenizer(vocab_path, merges_path); + + core::Encoding encoding; + std::string input_text = "A\n'll 11p223RF☆ho!!to?'d'd''d of a cat"; + std::vector expected_ids = { + 49406, 320, 1342, 272, 272, 335, 273, 273, 274, 16368, 13439, 2971, + 748, 531, 13610, 323, 1896, 8445, 323, 539, 320, 2368, 49407}; + std::vector expected_tokens = { + "<|startoftext|>", "a", "'ll", "1", "1", + "p", "2", "2", "3", "rf", + "âĺĨ", "ho", "!!", "to", "?'", + "d", "'d", "''", "d", "of", + "a", "cat", "<|endoftext|>"}; + clip_tokenizer.EncodePairStrings(input_text, &encoding); + CheckVectorEqual(expected_ids, encoding.GetIds()); + CheckVectorEqual(expected_tokens, encoding.GetTokens()); +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/test_ernie_fast_tokenizer.cc b/fast_tokenizer/fast_tokenizer/test/test_ernie_fast_tokenizer.cc new file mode 100644 index 0000000000000000000000000000000000000000..f3b73f6b42d951fec698ddb11f17ecfcfd60a2ef --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_ernie_fast_tokenizer.cc @@ -0,0 +1,88 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include "fast_tokenizer/core/added_vocabulary.h" +#include "fast_tokenizer/core/base.h" +#include "fast_tokenizer/core/encoding.h" +#include "fast_tokenizer/core/tokenizer.h" +#include "fast_tokenizer/models/wordpiece.h" +#include "fast_tokenizer/normalizers/bert.h" +#include "fast_tokenizer/postprocessors/bert.h" +#include "fast_tokenizer/pretokenizers/bert.h" +#include "fast_tokenizer/test/utils.h" +#include "fast_tokenizer/tokenizers/ernie_fast_tokenizer.h" + +#include "glog/logging.h" +#include "gtest/gtest.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +TEST(tokenizer, ernie_fast_tokenizer) { + std::string vocab_file = "ernie_vocab.txt"; + tokenizers_impl::ErnieFastTokenizer ernie_fast_tokenizer(vocab_file); + std::vector encodings(2); + ernie_fast_tokenizer.EncodePairStrings("今天天气真好", &encodings[0]); + ernie_fast_tokenizer.EncodePairStrings( + "don't know how this missed award nominations.", &encodings[1]); + std::vector> expected_tokens = { + {"[CLS]", "今", "天", "天", "气", "真", "好", "[SEP]"}, + {"[CLS]", + "don", + "[UNK]", + "t", + "know", + "how", + "this", + "miss", + "##ed", + "award", + "no", + "##min", + "##ations", + ".", + "[SEP]"}}; + std::vector> expected_ids = { + {1, 508, 125, 125, 266, 384, 170, 2}, + {1, + 3362, + 17963, + 2052, + 3821, + 5071, + 3730, + 7574, + 9530, + 6301, + 3825, + 10189, + 11005, + 42, + 2}}; + std::vector> expected_type_ids = { + {0, 0, 0, 0, 0, 0, 0, 0}, {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}; + for (int i = 0; i < encodings.size(); ++i) { + CheckVectorEqual(expected_tokens[i], encodings[i].GetTokens()); + CheckVectorEqual(expected_ids[i], encodings[i].GetIds()); + CheckVectorEqual(expected_type_ids[i], encodings[i].GetTypeIds()); + } +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/test_fast_wordpiece.cc b/fast_tokenizer/fast_tokenizer/test/test_fast_wordpiece.cc new file mode 100644 index 0000000000000000000000000000000000000000..c30517ebd4a3c992526f1bb62dc096fba7b0d3a3 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_fast_wordpiece.cc @@ -0,0 +1,42 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include "fast_tokenizer/models/fast_wordpiece.h" +#include "glog/logging.h" +#include "gtest/gtest.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +TEST(model, fast_wordpiece_token_to_id) { + auto vocab = models::FastWordPiece::GetVocabFromFile("ernie_vocab.txt"); + models::FastWordPiece fast_wordpiece_model(vocab); + // Test tokens in vocab + for (const auto& item : vocab) { + uint32_t id; + fast_wordpiece_model.TokenToId(item.first, &id); + ASSERT_EQ(item.second, id); + } + // Test [UNK] token + uint32_t fast_wordpiece_id; + ASSERT_FALSE(fast_wordpiece_model.TokenToId("dasd", &fast_wordpiece_id)); +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/test_normalizer.cc b/fast_tokenizer/fast_tokenizer/test/test_normalizer.cc new file mode 100644 index 0000000000000000000000000000000000000000..97de7ea8307394c3f8ee7e789f7a8f4a70835412 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_normalizer.cc @@ -0,0 +1,55 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "fast_tokenizer/normalizers/bert.h" +#include "fast_tokenizer/normalizers/replace.h" +#include "fast_tokenizer/normalizers/strip.h" +#include "fast_tokenizer/normalizers/unicode.h" +#include "glog/logging.h" +#include "gtest/gtest.h" +#include "re2/re2.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +TEST(normalizers, split) { + re2::RE2 pattern("-"); + std::string input = "The-final--countdown"; + normalizers::NormalizedString split_input(input); + auto test_split = [&pattern, &split_input]( + core::SplitMode mode, const std::vector expected_strings) { + std::vector normalizes; + split_input.Split(pattern, mode, &normalizes); + ASSERT_EQ(expected_strings.size(), normalizes.size()); + for (int i = 0; i < expected_strings.size(); ++i) { + ASSERT_EQ(expected_strings[i], normalizes[i].GetStr()); + } + }; + + test_split(core::SplitMode::REMOVED, {"The", "final", "countdown"}); + test_split(core::SplitMode::ISOLATED, + {"The", "-", "final", "-", "-", "countdown"}); + test_split(core::SplitMode::CONTIGUOUS, + {"The", "-", "final", "--", "countdown"}); + test_split(core::SplitMode::MERGED_WITH_PREVIOUS, + {"The-", "final-", "-", "countdown"}); + test_split(core::SplitMode::MERGED_WITH_NEXT, + {"The", "-final", "-", "-countdown"}); +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/test_replace.cc b/fast_tokenizer/fast_tokenizer/test/test_replace.cc new file mode 100644 index 0000000000000000000000000000000000000000..c32a193e4e92b0ea57b89594acf9075c59b58a7e --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_replace.cc @@ -0,0 +1,47 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "fast_tokenizer/normalizers/bert.h" +#include "fast_tokenizer/normalizers/replace.h" +#include "fast_tokenizer/normalizers/strip.h" +#include "fast_tokenizer/normalizers/unicode.h" +#include "glog/logging.h" +#include "gtest/gtest.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +TEST(normalizers, replace) { + std::string input = "This is a ''test''"; + std::string expected_output = "This is a \"test\""; + normalizers::NormalizedString replace_input(input); + + normalizers::ReplaceNormalizer normalizer("''", "\""); + normalizer(&replace_input); + ASSERT_EQ(expected_output, replace_input.GetStr()); + + // Regex pattern + std::string regex_input = "This is a test"; + std::string expected_regex_output = "This is a test"; + normalizers::NormalizedString regex_replace_input(regex_input); + normalizers::ReplaceNormalizer regex_normalizer("(\\s+)", " "); + regex_normalizer(®ex_replace_input); + ASSERT_EQ(expected_regex_output, regex_replace_input.GetStr()); +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/test_roberta_postprocessor.cc b/fast_tokenizer/fast_tokenizer/test/test_roberta_postprocessor.cc new file mode 100644 index 0000000000000000000000000000000000000000..3fabbf6af049b92e6bb9271da75e1753dde6922b --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_roberta_postprocessor.cc @@ -0,0 +1,94 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include + +#include "fast_tokenizer/core/encoding.h" +#include "fast_tokenizer/postprocessors/roberta.h" +#include "glog/logging.h" +#include "gtest/gtest.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +TEST(postprocessors, roberta) { + postprocessors::RobertaPostProcessor postprocessor; + core::Encoding encoding( + {core::Token(12, "Hello", {0, 5}), core::Token(14, "there", {6, 11})}, 0); + core::Encoding pair_encoding({core::Token(15, "pair", {0, 4})}, 0); + core::Encoding result_encoding; + + core::Encoding encoding_copy = encoding; + core::Encoding pair_encoding_copy = pair_encoding; + + postprocessor(&encoding_copy, nullptr, true, &result_encoding); + uint32_t special_word_idx = std::numeric_limits::max(); + ASSERT_EQ(result_encoding, + core::Encoding({0, 12, 14, 2}, + {0, 0, 0, 0}, + {"", "Hello", "there", ""}, + std::vector(4, special_word_idx), + {{0, 0}, {0, 5}, {6, 11}, {0, 0}}, + {1, 0, 0, 1}, + {1, 1, 1, 1}, + {}, + {{0, {1, 3}}})); + ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(2), + std::vector(1, 0)); + ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(3).size(), 0); + + encoding_copy = encoding; + postprocessor(&encoding_copy, &pair_encoding_copy, true, &result_encoding); + ASSERT_EQ( + result_encoding, + core::Encoding({0, 12, 14, 2, 2, 15, 2}, + {0, 0, 0, 0, 0, 0, 0}, + {"", "Hello", "there", "", "", "pair", ""}, + std::vector(7, special_word_idx), + {{0, 0}, {0, 5}, {6, 11}, {0, 0}, {0, 0}, {0, 4}, {0, 0}}, + {1, 0, 0, 1, 1, 0, 1}, + {1, 1, 1, 1, 1, 1, 1}, + {}, + {{0, {1, 3}}, {1, {5, 6}}})); + + ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(2), + std::vector(1, 0)); + ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(3), std::vector{}); + ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(4), std::vector{}); + ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(5), std::vector{1}); + ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(6), std::vector{}); + + encoding_copy = encoding; + pair_encoding_copy = pair_encoding; + postprocessor(&encoding_copy, &pair_encoding_copy, false, &result_encoding); + ASSERT_EQ(result_encoding, + core::Encoding({12, 14, 15}, + {0, 0, 0}, + {"Hello", "there", "pair"}, + std::vector(3, special_word_idx), + {{0, 5}, {6, 11}, {0, 4}}, + {0, 0, 0}, + {1, 1, 1}, + {}, + {{0, {0, 2}}, {1, {2, 3}}})); + + ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(0), std::vector{0}); + ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(1), std::vector{0}); + ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(2), std::vector{1}); +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/test_split_pretokenizer.cc b/fast_tokenizer/fast_tokenizer/test/test_split_pretokenizer.cc new file mode 100644 index 0000000000000000000000000000000000000000..89c4df06943eb90ce0cbb0da4751bad72fb85e59 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_split_pretokenizer.cc @@ -0,0 +1,110 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include "fast_tokenizer/pretokenizers/split.h" +#include "glog/logging.h" +#include "gtest/gtest.h" +#include "re2/re2.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +TEST(pretokenizers, split_basic) { + std::string input = "How are you doing?"; + // All tokens' id are set to 0. + std::vector>> test_cases = + {{ + core::SplitMode::REMOVED, + std::vector{{0, "How", {0, 3}}, + {0, "are", {4, 7}}, + {0, "you", {8, 11}}, + {0, "doing", {12, 17}}, + {0, "?", {17, 18}}}, + }, + { + core::SplitMode::ISOLATED, + std::vector{{0, "How", {0, 3}}, + {0, " ", {3, 4}}, + {0, "are", {4, 7}}, + {0, " ", {7, 8}}, + {0, "you", {8, 11}}, + {0, " ", {11, 12}}, + {0, "doing", {12, 17}}, + {0, "?", {17, 18}}}, + }, + { + core::SplitMode::MERGED_WITH_PREVIOUS, + std::vector{{0, "How ", {0, 4}}, + {0, "are ", {4, 8}}, + {0, "you ", {8, 12}}, + {0, "doing", {12, 17}}, + {0, "?", {17, 18}}}, + }, + { + core::SplitMode::MERGED_WITH_NEXT, + std::vector{{0, "How", {0, 3}}, + {0, " are", {3, 7}}, + {0, " you", {7, 11}}, + {0, " doing", {11, 17}}, + {0, "?", {17, 18}}}, + }, + { + core::SplitMode::CONTIGUOUS, + std::vector{{0, "How", {0, 3}}, + {0, " ", {3, 4}}, + {0, "are", {4, 7}}, + {0, " ", {7, 8}}, + {0, "you", {8, 11}}, + {0, " ", {11, 12}}, + {0, "doing?", {12, 18}}}, + }}; + std::string pattern = R"(\w+|[^\w\s]+)"; + for (auto&& test : test_cases) { + pretokenizers::PreTokenizedString pretokenized(input); + pretokenizers::SplitPreTokenizer pretok(pattern, test.first, true); + pretok(&pretokenized); + ASSERT_EQ(test.second.size(), pretokenized.GetSplitsSize()); + for (int i = 0; i < test.second.size(); ++i) { + auto&& curr_split = pretokenized.GetSplit(i); + ASSERT_EQ(test.second[i].value_, curr_split.normalized_.GetStr()); + auto original_offset = curr_split.normalized_.GetOrginalOffset(); + ASSERT_EQ(test.second[i].offset_, original_offset); + } + } +} + +TEST(pretokenizers, split_invert) { + std::string input = "Hello Hello Hello"; + pretokenizers::PreTokenizedString pretok_str(input), + pretok_str_for_invert(input); + pretokenizers::SplitPreTokenizer pretok(" ", core::SplitMode::REMOVED, false); + pretokenizers::SplitPreTokenizer pretok_invert( + "Hello", core::SplitMode::REMOVED, true); + + pretok(&pretok_str); + pretok_invert(&pretok_str_for_invert); + + ASSERT_EQ(pretok_str.GetSplitsSize(), pretok_str_for_invert.GetSplitsSize()); + for (int i = 0; i < pretok_str.GetSplitsSize(); ++i) { + ASSERT_EQ(pretok_str.GetSplit(i).normalized_.GetStr(), + pretok_str_for_invert.GetSplit(i).normalized_.GetStr()); + } +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/test_strip.cc b/fast_tokenizer/fast_tokenizer/test/test_strip.cc new file mode 100644 index 0000000000000000000000000000000000000000..370ad9c37d79847b2cfef79aa364024920fe9c09 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_strip.cc @@ -0,0 +1,55 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "fast_tokenizer/normalizers/bert.h" +#include "fast_tokenizer/normalizers/replace.h" +#include "fast_tokenizer/normalizers/strip.h" +#include "fast_tokenizer/normalizers/unicode.h" +#include "glog/logging.h" +#include "gtest/gtest.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +TEST(normalizers, strip) { + std::string input = " \t我爱中国\n\f\v"; + std::string expected_lstrip_output = "我爱中国\n\f\v"; + std::string expected_rstrip_output = " \t我爱中国"; + std::string expected_lrstrip_output = "我爱中国"; + + normalizers::NormalizedString lrstrip_input(input); + normalizers::NormalizedString lstrip_input(input); + normalizers::NormalizedString rstrip_input(input); + + normalizers::StripNormalizer lrstrip(true, true); + lrstrip(&lrstrip_input); + std::string lrstrip_output = lrstrip_input.GetStr(); + ASSERT_EQ(expected_lrstrip_output, lrstrip_output); + + normalizers::StripNormalizer lstrip(true, false); + lstrip(&lstrip_input); + std::string lstrip_output = lstrip_input.GetStr(); + ASSERT_EQ(expected_lstrip_output, lstrip_output); + + normalizers::StripNormalizer rstrip(false, true); + rstrip(&rstrip_input); + std::string rstrip_output = rstrip_input.GetStr(); + ASSERT_EQ(expected_rstrip_output, rstrip_output); +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/test_unicode.cc b/fast_tokenizer/fast_tokenizer/test/test_unicode.cc new file mode 100644 index 0000000000000000000000000000000000000000..e4e927b13f35a4f9681539db8833a07784cb60e6 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_unicode.cc @@ -0,0 +1,61 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "fast_tokenizer/normalizers/bert.h" +#include "fast_tokenizer/normalizers/replace.h" +#include "fast_tokenizer/normalizers/strip.h" +#include "fast_tokenizer/normalizers/unicode.h" +#include "glog/logging.h" +#include "gtest/gtest.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +TEST(normalizers, unicode) { + std::string input = "\u1e9b\u0323a\u1e9b\u0323"; + std::string expected_nfkc_output = "ṩaṩ"; + std::string expected_nfc_output = "\u1e9b\u0323a\u1e9b\u0323"; + std::string expected_nfkd_output = "\u0073\u0323\u0307a\u0073\u0323\u0307"; + std::string expected_nfd_output = "\u017f\u0323\u0307a\u017f\u0323\u0307"; + + normalizers::NFKCNormalizer nfkc; + normalizers::NormalizedString normalized_input1(input); + normalizers::NormalizedString normalized_input2(input); + normalizers::NormalizedString normalized_input3(input); + normalizers::NormalizedString normalized_input4(input); + nfkc(&normalized_input1); + std::string nfkc_output = normalized_input1.GetStr(); + ASSERT_EQ(expected_nfkc_output, nfkc_output); + + normalizers::NFCNormalizer nfc; + nfc(&normalized_input2); + std::string nfc_output = normalized_input2.GetStr(); + ASSERT_EQ(expected_nfc_output, nfc_output); + + normalizers::NFKDNormalizer nfkd; + nfkd(&normalized_input3); + std::string nfkd_output = normalized_input3.GetStr(); + ASSERT_EQ(expected_nfkd_output, nfkd_output); + + normalizers::NFDNormalizer nfd; + nfd(&normalized_input4); + std::string nfd_output = normalized_input4.GetStr(); + ASSERT_EQ(expected_nfd_output, nfd_output); +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/test_utils.cc b/fast_tokenizer/fast_tokenizer/test/test_utils.cc new file mode 100644 index 0000000000000000000000000000000000000000..8711f47724b42f7ed931be213edb44c171f86740 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_utils.cc @@ -0,0 +1,37 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "fast_tokenizer/normalizers/bert.h" +#include "fast_tokenizer/normalizers/replace.h" +#include "fast_tokenizer/normalizers/strip.h" +#include "fast_tokenizer/normalizers/unicode.h" +#include "glog/logging.h" +#include "gtest/gtest.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +TEST(normalizers, utils) { + std::string input = "ÓÓßSSCHLOË"; + std::string expected_output = "óóßsschloë"; + normalizers::NormalizedString lower_input(input); + lower_input.Lowercase(); + ASSERT_EQ(expected_output, lower_input.GetStr()); +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/test_whitespace.cc b/fast_tokenizer/fast_tokenizer/test/test_whitespace.cc new file mode 100644 index 0000000000000000000000000000000000000000..95a2fdee98b2cc12eefbc425a9865fbe942c483d --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_whitespace.cc @@ -0,0 +1,40 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include "fast_tokenizer/pretokenizers/whitespace.h" +#include "glog/logging.h" +#include "gtest/gtest.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +TEST(pretokenizers, whitespace) { + std::string input = "I \t am good\r at \nsport"; + std::vector expected_outputs = { + "I", "am", "good", "at", "sport"}; + pretokenizers::PreTokenizedString whitespace_input(input); + pretokenizers::WhitespacePreTokenizer()(&whitespace_input); + ASSERT_EQ(expected_outputs.size(), whitespace_input.GetSplitsSize()); + for (int i = 0; i < expected_outputs.size(); ++i) { + ASSERT_EQ(whitespace_input.GetSplit(i).normalized_.GetStr(), + expected_outputs[i]); + } +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/test_wordpiece.cc b/fast_tokenizer/fast_tokenizer/test/test_wordpiece.cc new file mode 100644 index 0000000000000000000000000000000000000000..3d7661f923193a4584d501ecb24981b86220757f --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/test_wordpiece.cc @@ -0,0 +1,96 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include "fast_tokenizer/models/wordpiece.h" +#include "glog/logging.h" +#include "gtest/gtest.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +TEST(model, wordpiece_factory) { + models::WordPieceFactory factory; + auto check_config = [&](const std::string& filename, + size_t vocab_size, + const std::string& unk_token, + size_t max_input_chars_per_word, + const std::string& continuing_subword_prefix) { + ASSERT_EQ(filename, factory.config_.files_); + ASSERT_EQ(vocab_size, factory.config_.vocab_.size()); + ASSERT_EQ(unk_token, factory.config_.unk_token_); + ASSERT_EQ(max_input_chars_per_word, + factory.config_.max_input_chars_per_word_); + ASSERT_EQ(continuing_subword_prefix, + factory.config_.continuing_subword_prefix_); + }; + check_config("", 0, "[UNK]", 100, "##"); + + std::string vocab_file = "ernie_vocab.txt"; + factory.SetFiles(vocab_file); + factory.GetVocabFromFiles(vocab_file); + check_config(vocab_file, 17964UL, "[UNK]", 100, "##"); +} + +TEST(model, wordpiece_model) { + models::WordPieceFactory factory; + factory.SetFiles("ernie_vocab.txt"); + + auto wordpiece_model = factory.CreateWordPieceModel(); + auto check_token_id = [&](const std::string& expected_token, + uint32_t expected_id) { + std::string token; + uint32_t id; + wordpiece_model.TokenToId(expected_token, &id); + wordpiece_model.IdToToken(expected_id, &token); + ASSERT_EQ(id, expected_id); + ASSERT_EQ(token, expected_token); + }; + std::array tokens = { + "[PAD]", "[CLS]", "[SEP]", "[MASK]", ",", "的", "、", "一", "人", "有"}; + for (int i = 0; i < tokens.size(); i++) { + check_token_id(tokens[i], i); + } + // check non-exist token + uint32_t id; + ASSERT_FALSE(wordpiece_model.TokenToId("xxsada", &id)); + // check non-exist id + std::string token; + ASSERT_FALSE( + wordpiece_model.IdToToken(wordpiece_model.GetVocabSize(), &token)); + + // Check Tokenize Chinese + auto chinese_tokens = wordpiece_model.Tokenize("今天天气真好"); + auto check_token = [](const core::Token& token, + const std::string& expected_string, + uint32_t id, + core::Offset offset) { + ASSERT_EQ(token.value_, expected_string); + ASSERT_EQ(token.id_, id); + ASSERT_EQ(token.offset_, offset); + }; + check_token(chinese_tokens[0], "今", 508, {0, 3}); + check_token(chinese_tokens[1], "##天", 12172, {3, 6}); + check_token(chinese_tokens[2], "##天", 12172, {6, 9}); + check_token(chinese_tokens[3], "##气", 12311, {9, 12}); + check_token(chinese_tokens[4], "##真", 12427, {12, 15}); + check_token(chinese_tokens[5], "##好", 12217, {15, 18}); +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/test/utils.h b/fast_tokenizer/fast_tokenizer/test/utils.h new file mode 100644 index 0000000000000000000000000000000000000000..af1525ebbf1da5b4998e7d1639d5e86bbe84bd57 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/test/utils.h @@ -0,0 +1,37 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include + +#include "glog/logging.h" +#include "gtest/gtest.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tests { + +template +void CheckVectorEqual(const std::vector& a, const std::vector& b) { + ASSERT_EQ(a.size(), b.size()); + auto size = a.size(); + for (int i = 0; i < size; ++i) { + ASSERT_EQ(a[i], b[i]); + } +} + +} // namespace tests +} // namespace fast_tokenizer +} // namespace paddlenlp \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/tokenizers/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/tokenizers/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/fast_tokenizer/fast_tokenizer/tokenizers/clip_fast_tokenizer.cc b/fast_tokenizer/fast_tokenizer/tokenizers/clip_fast_tokenizer.cc new file mode 100644 index 0000000000000000000000000000000000000000..929d6320cf7bb8046cd03d395799bcc447336c9e --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/tokenizers/clip_fast_tokenizer.cc @@ -0,0 +1,138 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/tokenizers/clip_fast_tokenizer.h" +#include "fast_tokenizer/models/models.h" +#include "fast_tokenizer/normalizers/normalizers.h" +#include "fast_tokenizer/postprocessors/postprocessors.h" +#include "fast_tokenizer/pretokenizers/pretokenizers.h" +#include "glog/logging.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tokenizers_impl { + +ClipFastTokenizer::ClipFastTokenizer( + const std::string& vocab_path, + const std::string& merges_path, + uint32_t max_length, + const std::string& unk_token, + const std::string& pad_token, + const std::string& bos_token, + const std::string& eos_token, + bool add_prefix_space, + const std::string& continuing_subword_prefix, + const std::string& end_of_word_suffix, + bool trim_offsets) { + core::Vocab vocab; + core::Merges merges; + models::BPE::GetVocabAndMergesFromFile( + vocab_path, merges_path, &vocab, &merges); + VLOG(6) << "The vocab size of ClipFastTokenizer is " << vocab.size(); + VLOG(6) << "The merges size of ClipFastTokenizer is " << merges.size(); + + models::BPE bpe(vocab, + merges, + 10000, + {}, + {unk_token}, + {continuing_subword_prefix}, + {end_of_word_suffix}, + false); + // Set tokenizer model + this->SetModel(bpe); + + // Set added tokens + std::vector added_tokens; + uint32_t id; + unk_token_ = unk_token; + if (this->TokenToId(unk_token, &id)) { + added_tokens.emplace_back(unk_token, true); + } + pad_token_ = pad_token; + if (this->TokenToId(pad_token, &id)) { + added_tokens.emplace_back(pad_token, true); + pad_token_id_ = id; + } + bos_token_ = bos_token; + if (this->TokenToId(bos_token, &id)) { + added_tokens.emplace_back(bos_token, true); + bos_token_id_ = id; + } + eos_token_ = eos_token; + if (this->TokenToId(eos_token, &id)) { + added_tokens.emplace_back(eos_token, true); + eos_token_id_ = id; + } + this->AddSpecialTokens(added_tokens); + + // Set normalizers + normalizers::NFCNormalizer nfc_normalizer; + normalizers::ReplaceNormalizer replace_normalizer(R"(\s+)", " "); + normalizers::LowercaseNormalizer lower_normalizer; + normalizers::SequenceNormalizer seq_normalizer; + seq_normalizer.AppendNormalizer(&nfc_normalizer); + seq_normalizer.AppendNormalizer(&replace_normalizer); + seq_normalizer.AppendNormalizer(&lower_normalizer); + this->SetNormalizer(seq_normalizer); + + // Set pretokenizers + pretokenizers::ByteLevelPreTokenizer byte_level_pretokenizer(add_prefix_space, + true); + pretokenizers::SplitPreTokenizer split_pretokenizer( + R"('s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+)", + core::SplitMode::REMOVED, + true); + pretokenizers::SequencePreTokenizer seq_pretokenizer; + seq_pretokenizer.AppendPreTokenizer(&split_pretokenizer); + seq_pretokenizer.AppendPreTokenizer(&byte_level_pretokenizer); + this->SetPreTokenizer(seq_pretokenizer); + + // Set postprocessors + postprocessors::RobertaPostProcessor roberta_postprocessor( + {eos_token, eos_token_id_}, + {bos_token, bos_token_id_}, + /* trim_offsets= */ false, + add_prefix_space); + this->SetPostProcessor(roberta_postprocessor); + + if (max_length == 0) { + this->DisableTruncMethod(); + } else { + this->EnableTruncMethod(max_length, + 0, + core::Direction::RIGHT, + core::TruncStrategy::LONGEST_FIRST); + } +} + +std::string ClipFastTokenizer::GetPadToken() const { return pad_token_; } + +uint32_t ClipFastTokenizer::GetPadTokenId() const { return pad_token_id_; } + +std::string ClipFastTokenizer::GetUNKToken() const { return unk_token_; } + +uint32_t ClipFastTokenizer::GetUNKTokenId() const { return unk_token_id_; } + +std::string ClipFastTokenizer::GetBOSToken() const { return bos_token_; } + +uint32_t ClipFastTokenizer::GetBOSTokenId() const { return bos_token_id_; } + +std::string ClipFastTokenizer::GetEOSToken() const { return eos_token_; } + +uint32_t ClipFastTokenizer::GetEOSTokenId() const { return eos_token_id_; } + +} // namespace tokenizers_impl +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/tokenizers/clip_fast_tokenizer.h b/fast_tokenizer/fast_tokenizer/tokenizers/clip_fast_tokenizer.h new file mode 100644 index 0000000000000000000000000000000000000000..c018919615563f4e8191d84c87374f88ee41de40 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/tokenizers/clip_fast_tokenizer.h @@ -0,0 +1,61 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "fast_tokenizer/core/encoding.h" +#include "fast_tokenizer/core/tokenizer.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tokenizers_impl { + +struct FASTTOKENIZER_DECL ClipFastTokenizer : public core::Tokenizer { + ClipFastTokenizer(const std::string& vocab_path, + const std::string& merges_path, + uint32_t max_length = 0, + const std::string& unk_token = "<|endoftext|>", + const std::string& pad_token = "<|endoftext|>", + const std::string& bos_token = "<|startoftext|>", + const std::string& eos_token = "<|endoftext|>", + bool add_prefix_space = false, + const std::string& continuing_subword_prefix = "", + const std::string& end_of_word_suffix = "", + bool trim_offsets = false); + std::string GetPadToken() const; + uint32_t GetPadTokenId() const; + std::string GetUNKToken() const; + uint32_t GetUNKTokenId() const; + std::string GetBOSToken() const; + uint32_t GetBOSTokenId() const; + std::string GetEOSToken() const; + uint32_t GetEOSTokenId() const; + +private: + std::string pad_token_; + uint32_t pad_token_id_; + std::string unk_token_; + uint32_t unk_token_id_; + std::string bos_token_; + uint32_t bos_token_id_; + std::string eos_token_; + uint32_t eos_token_id_; +}; + +} // namespace fast_tokenizer_impl +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/tokenizers/ernie_fast_tokenizer.cc b/fast_tokenizer/fast_tokenizer/tokenizers/ernie_fast_tokenizer.cc new file mode 100644 index 0000000000000000000000000000000000000000..2c9d3bacbd5ce93dea93ee8932addaa0227a992e --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/tokenizers/ernie_fast_tokenizer.cc @@ -0,0 +1,152 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/tokenizers/ernie_fast_tokenizer.h" +#include "fast_tokenizer/core/encoding.h" +#include "fast_tokenizer/models/models.h" +#include "fast_tokenizer/normalizers/normalizers.h" +#include "fast_tokenizer/postprocessors/postprocessors.h" +#include "fast_tokenizer/pretokenizers/pretokenizers.h" +#include "fast_tokenizer/utils/utils.h" +#include "glog/logging.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tokenizers_impl { + +ErnieFastTokenizer::ErnieFastTokenizer(const std::string& vocab_path, + const std::string& unk_token, + const std::string& sep_token, + const std::string& cls_token, + const std::string& pad_token, + const std::string& mask_token, + bool clean_text, + bool handle_chinese_chars, + bool strip_accents, + bool lowercase, + const std::string& wordpieces_prefix, + uint32_t max_sequence_len) { + core::Vocab vocab; + utils::GetVocabFromFiles(vocab_path, &vocab); + VLOG(6) << "The vocab size of ErnieFastTokenizer is " << vocab.size(); + Init(vocab, + unk_token, + sep_token, + cls_token, + pad_token, + mask_token, + clean_text, + handle_chinese_chars, + strip_accents, + lowercase, + wordpieces_prefix, + max_sequence_len); +} + + +ErnieFastTokenizer::ErnieFastTokenizer(const core::Vocab& vocab, + const std::string& unk_token, + const std::string& sep_token, + const std::string& cls_token, + const std::string& pad_token, + const std::string& mask_token, + bool clean_text, + bool handle_chinese_chars, + bool strip_accents, + bool lowercase, + const std::string& wordpieces_prefix, + uint32_t max_sequence_len) { + Init(vocab, + unk_token, + sep_token, + cls_token, + pad_token, + mask_token, + clean_text, + handle_chinese_chars, + strip_accents, + lowercase, + wordpieces_prefix, + max_sequence_len); +} + + +void ErnieFastTokenizer::Init(const core::Vocab& vocab, + const std::string& unk_token, + const std::string& sep_token, + const std::string& cls_token, + const std::string& pad_token, + const std::string& mask_token, + bool clean_text, + bool handle_chinese_chars, + bool strip_accents, + bool lowercase, + const std::string& wordpieces_prefix, + uint32_t max_sequence_len) { + models::FastWordPiece wordpiece(vocab, + unk_token, + 100 /* max_input_chars_per_word */, + wordpieces_prefix, + true); + this->SetModel(wordpiece); + + std::vector added_tokens; + uint32_t id; + if (this->TokenToId(unk_token, &id)) { + added_tokens.emplace_back(unk_token, true); + } + if (this->TokenToId(sep_token, &id)) { + added_tokens.emplace_back(sep_token, true); + } + if (this->TokenToId(cls_token, &id)) { + added_tokens.emplace_back(cls_token, true); + } + if (this->TokenToId(pad_token, &id)) { + added_tokens.emplace_back(pad_token, true); + } + if (this->TokenToId(mask_token, &id)) { + added_tokens.emplace_back(mask_token, true); + } + this->AddSpecialTokens(added_tokens); + + + normalizers::BertNormalizer bert_normalizer( + clean_text, handle_chinese_chars, strip_accents, lowercase); + this->SetNormalizer(bert_normalizer); + + if (vocab.size() > 0) { + uint32_t sep_id, cls_id; + if (!this->TokenToId(sep_token, &sep_id)) { + throw std::invalid_argument("sep_token not found in the vocabulary"); + } + if (!this->TokenToId(cls_token, &cls_id)) { + throw std::invalid_argument("cls_token not found in the vocabulary"); + } + postprocessors::BertPostProcessor bert_postprocessor({sep_token, sep_id}, + {cls_token, cls_id}); + this->SetPostProcessor(bert_postprocessor); + } + if (max_sequence_len == 0) { + this->DisableTruncMethod(); + } else { + this->EnableTruncMethod(max_sequence_len, + 0, + core::Direction::RIGHT, + core::TruncStrategy::LONGEST_FIRST); + } +} + +} // namespace tokenizers_impl +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/tokenizers/ernie_fast_tokenizer.h b/fast_tokenizer/fast_tokenizer/tokenizers/ernie_fast_tokenizer.h new file mode 100644 index 0000000000000000000000000000000000000000..b1cf0dbe52b2bc04a7b35fdb6a8440a0f8799a08 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/tokenizers/ernie_fast_tokenizer.h @@ -0,0 +1,70 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ +#pragma once + +#include +#include +#include "fast_tokenizer/core/encoding.h" +#include "fast_tokenizer/core/tokenizer.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace tokenizers_impl { + +struct FASTTOKENIZER_DECL ErnieFastTokenizer : public core::Tokenizer { + ErnieFastTokenizer(const std::string& vocab_path, + const std::string& unk_token = "[UNK]", + const std::string& sep_token = "[SEP]", + const std::string& cls_token = "[CLS]", + const std::string& pad_token = "[PAD]", + const std::string& mask_token = "[MASK]", + bool clean_text = true, + bool handle_chinese_chars = true, + bool strip_accents = true, + bool lowercase = true, + const std::string& wordpieces_prefix = "##", + uint32_t max_sequence_len = 0); + + ErnieFastTokenizer(const core::Vocab& vocab, + const std::string& unk_token = "[UNK]", + const std::string& sep_token = "[SEP]", + const std::string& cls_token = "[CLS]", + const std::string& pad_token = "[PAD]", + const std::string& mask_token = "[MASK]", + bool clean_text = true, + bool handle_chinese_chars = true, + bool strip_accents = true, + bool lowercase = true, + const std::string& wordpieces_prefix = "##", + uint32_t max_sequence_len = 0); + +private: + void Init(const core::Vocab& vocab, + const std::string& unk_token = "[UNK]", + const std::string& sep_token = "[SEP]", + const std::string& cls_token = "[CLS]", + const std::string& pad_token = "[PAD]", + const std::string& mask_token = "[MASK]", + bool clean_text = true, + bool handle_chinese_chars = true, + bool strip_accents = true, + bool lowercase = true, + const std::string& wordpieces_prefix = "##", + uint32_t max_sequence_len = 0); +}; + +} // namespace fast_tokenizer_impl +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/utils/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..d4ef2b1eb91f094f1254468777b2f431f5f27b29 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/CMakeLists.txt @@ -0,0 +1,5 @@ +cc_library(utils SRCS utils.cc DEPS icuuc icudata) +cc_library(trie SRCS trie.cc DEPS dart utils) +cc_library(failure SRCS failure.cc DEPS trie utils) +cc_library(sentencepiece_normalizer SRCS sentencepiece_normalizer.cc DEPS trie icuuc icudata utils) +cc_library(lattice SRCS lattice.cc DEPS utils) \ No newline at end of file diff --git a/fast_tokenizer/fast_tokenizer/utils/cache.h b/fast_tokenizer/fast_tokenizer/utils/cache.h new file mode 100644 index 0000000000000000000000000000000000000000..70471057239442b5febf466c932de370626fe47e --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/cache.h @@ -0,0 +1,102 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at + +// http://www.apache.org/licenses/LICENSE-2.0 + +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include + +#include "fast_tokenizer/utils/shared_mutex.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +static size_t DEFAULT_CACHE_CAPACITY = 10000; +typedef utils::shared_mutex RWLock; +typedef std::unique_lock WLock; +typedef utils::shared_lock RLock; + +template +struct Cache { + std::unordered_map map_; + size_t capacity_; + Cache(size_t capacity = DEFAULT_CACHE_CAPACITY) : capacity_(capacity) { + Fresh(); + } + + Cache(const Cache& other) { + RLock guard(cache_mutex_); + map_ = other.map_; + capacity_ = other.capacity_; + } + + Cache& operator=(const Cache& other) { + RLock guard(cache_mutex_); + map_ = other.map_; + capacity_ = other.capacity_; + return *this; + } + + void Fresh() { CreateCacheMap(capacity_); } + void Clear() { + WLock guard(cache_mutex_); + map_.clear(); + } + + bool GetValue(const K& key, V* value) { + // It's not guaranteed to get the value if the key is in cache + // for non-blocking read. + if (cache_mutex_.try_lock_shared()) { + if (map_.find(key) == map_.end()) { + cache_mutex_.unlock_shared(); + return false; + } + *value = map_.at(key); + cache_mutex_.unlock_shared(); + return true; + } + return false; + } + + bool SetValue(const K& key, const V& value) { + // Before trying to acquire a write lock, we check if we are already at + // capacity with a read handler. + if (cache_mutex_.try_lock_shared()) { + if (map_.size() >= capacity_) { + cache_mutex_.unlock_shared(); + return false; + } + } else { + return false; + } + if (cache_mutex_.try_lock()) { + map_.insert({key, value}); + cache_mutex_.unlock(); + return true; + } + return false; + } + +private: + void CreateCacheMap(size_t capacity) { + WLock guard(cache_mutex_); + map_ = std::unordered_map(capacity); + } + RWLock cache_mutex_; +}; + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/failure.cc b/fast_tokenizer/fast_tokenizer/utils/failure.cc new file mode 100644 index 0000000000000000000000000000000000000000..1ae50b4d334e8fb2589a6f0cb77af5479587b474 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/failure.cc @@ -0,0 +1,425 @@ +// Copyright 2022 TF.Text Authors. +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at + +// http://www.apache.org/licenses/LICENSE-2.0 + +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include + +#include "glog/logging.h" +#include "fast_tokenizer/utils/failure.h" +#include "fast_tokenizer/utils/trie.h" +#include "fast_tokenizer/utils/utf8.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +Failure::Failure() + : failure_link_(utils::kNullNode), + failure_pops_offset_length_(utils::kNullFailurePopsList) {} + +FailureVocabToken::FailureVocabToken( + const std::string& token, + int token_id, + const std::string& continuing_subword_prefix) + : token_(token), + token_id_(token_id), + is_suffix_token_(false), + actual_token_start_offset_(0), + actual_token_unicode_len_(0), + contains_punctuation_(false) { + if (!continuing_subword_prefix.empty() && + token_ != continuing_subword_prefix && + utils::IsSuffixWord(token_, continuing_subword_prefix)) { + is_suffix_token_ = true; + actual_token_start_offset_ = continuing_subword_prefix.size(); + } + // Iterate over the Unicode chars from the token, to initialize + // contains_punctuation_ and actual_token_unicode_len_. + int token_len = token.size(); + int cur_pos = actual_token_start_offset_; + uint32_t ch; + const char* pSrc = token.c_str(); + while (cur_pos < token_len) { + uint32_t count = utils::UTF8ToUInt32(pSrc + cur_pos, &ch); + cur_pos += count; + ch = utils::UTF8ToUnicode(ch); + if (!contains_punctuation_ && utils::IsPunctuationOrChineseChar(ch)) { + contains_punctuation_ = true; + } + ++actual_token_unicode_len_; + } +} + +const std::string& FailureVocabToken::Token() const { return token_; } + +int FailureVocabToken::TokenId() const { return token_id_; } + +bool FailureVocabToken::IsSuffixToken() const { return is_suffix_token_; } + +bool FailureVocabToken::ContainsPunctuation() const { + return contains_punctuation_; +} + +int FailureVocabToken::TokenUnicodeLengthWithoutContinuingSubwordPrefix() + const { + return actual_token_unicode_len_; +} + +int FailureVocabToken::TokenLengthWithoutContinuingSubwordPrefix() const { + return token_.size() - actual_token_start_offset_; +} + +void FailureArray::BuildFailureVocab( + const std::unordered_map& vocab, + const std::string& unk_token, + const std::string& continuing_subword_prefix) { + if (vocab.size() > utils::kMaxSupportedVocabSize) { + std::ostringstream oss; + oss << "Vocab size exceeds the max supported (" + << utils::kMaxSupportedVocabSize + << "). Found vocab size: " << vocab.size(); + throw std::invalid_argument(oss.str()); + } + failure_vocab_tokens_.reserve(vocab.size()); + int unk_id = vocab.at(unk_token); + for (auto& item : vocab) { + if (item.first == continuing_subword_prefix) { + VLOG(6) + << "The empty suffix token is found in the vocabulary, which takes " + "place in token id space but will (almost) never be used in the " + "result. Consider cleaning it from the vocabulary."; + continue; + } + if (item.first.empty()) { + VLOG(6) + << "The empty string is found in the vocabulary, which takes place " + "in the token id space but will never be used in the result. " + "Consider cleaning it from the vocabulary."; + continue; + } + FailureVocabToken vocab_token( + item.first, item.second, continuing_subword_prefix); + if (vocab_token.TokenLengthWithoutContinuingSubwordPrefix() > + utils::kMaxVocabTokenLengthInUTF8Bytes) { + std::ostringstream oss; + oss << "Vocab token utf8 length (excluding suffix indicator) exceeds the " + "max supported (" + << utils::kMaxVocabTokenLengthInUTF8Bytes + << "). The vocab token is: " << item.first + << " with utf8 length (excluding suffix indicator): " + << vocab_token.TokenLengthWithoutContinuingSubwordPrefix(); + throw std::invalid_argument(oss.str()); + } + // Skip the word which contains punctuation but not a punctuation. + if (with_pretokenization_ && vocab_token.ContainsPunctuation() && + (vocab_token.IsSuffixToken() || + vocab_token.TokenUnicodeLengthWithoutContinuingSubwordPrefix() > 1)) { + continue; + } + failure_vocab_tokens_.emplace_back(vocab_token); + } + if (failure_vocab_tokens_.empty()) { + std::ostringstream oss; + oss << "No valid vocab tokens were found to build the trie."; + throw std::invalid_argument(oss.str()); + } + if (!continuing_subword_prefix.empty()) { + const bool suffix_token_exists = std::any_of( + failure_vocab_tokens_.begin(), + failure_vocab_tokens_.end(), + [](const FailureVocabToken& token) { return token.IsSuffixToken(); }); + if (!suffix_token_exists) { + auto new_suffix_token = continuing_subword_prefix + + std::string(1, utils::kInvalidControlChar); + failure_vocab_tokens_.emplace_back( + new_suffix_token, unk_id, continuing_subword_prefix); + } + } + if (with_pretokenization_) { + for (uint32_t cp = 1; cp <= 0x0010FFFF; ++cp) { + if (!utils::IsUnicodeChar(cp) || !utils::IsPunctuationOrChineseChar(cp)) { + continue; + } + char utf_str[5]; + utils::GetUTF8Str(reinterpret_cast(&cp), utf_str, 1); + std::string punc_str(utf_str); + if (vocab.count(punc_str) == 0) { + failure_vocab_tokens_.emplace_back( + punc_str, unk_id, continuing_subword_prefix); + } + } + failure_vocab_tokens_.emplace_back( + std::string(1, kInvalidControlChar), unk_id, continuing_subword_prefix); + } +} + +void FailureArray::CreateVocabFromFailureVocab( + const std::vector& failure_vocab_tokens, + std::unordered_map* vocab) const { + for (auto&& failure_vocab : failure_vocab_tokens) { + (*vocab)[failure_vocab.Token()] = failure_vocab.TokenId(); + } +} + +void FailureArray::InitFromVocabAndTrie( + const std::unordered_map& vocab, + Trie* trie, + const std::string& unk_token, + const std::string& continuing_subword_prefix) { + BuildFailureVocab(vocab, unk_token, continuing_subword_prefix); + + // Create Trie + std::unordered_map new_vocab; + CreateVocabFromFailureVocab(failure_vocab_tokens_, &new_vocab); + trie->SetVocab(new_vocab); + + // Create failure array + BuildFailureArray(failure_vocab_tokens_, trie); +} + +void FailureArray::RemovePunctuationTrieLink(Trie* trie) const { + auto continuing_subword_prefix = trie->GetContinuingSubwordPrefix(); + if (with_pretokenization_ && !continuing_subword_prefix.empty()) { + int cur_idx = 0; + int next_idx = 0; + uint32_t curr_char, next_char; + bool prev_node_is_root = false; + auto node = trie->CreateRootTraversalCursor(); + while (cur_idx < continuing_subword_prefix.length()) { + next_idx = cur_idx; + auto chwidth = utils::UTF8ToUInt32( + continuing_subword_prefix.data() + next_idx, &curr_char); + curr_char = utils::UTF8ToUnicode(curr_char); + next_idx = cur_idx + chwidth; + prev_node_is_root = (node.node_id_ == trie->kRootNodeId); + std::string cur_unicode_char(continuing_subword_prefix.data() + cur_idx, + chwidth); + if (!trie->TryTraverseSeveralSteps(&node, cur_unicode_char)) { + throw std::runtime_error( + "Cannot locate a character in suffix_indicator_. It should never " + "happen."); + } + if (IsPunctuationOrChineseChar(curr_char)) { + if (prev_node_is_root) { + cur_idx = next_idx; + auto next_chwidth = utils::UTF8ToUInt32( + continuing_subword_prefix.data() + next_idx, &next_char); + next_idx += next_chwidth; + std::string next_unicode_char( + continuing_subword_prefix.data() + cur_idx, next_chwidth); + auto child_node = node; + if (!trie->TryTraverseSeveralSteps(&child_node, next_unicode_char)) { + throw std::runtime_error( + "Cannot locate a character in suffix_indicator_. It should " + "never happen."); + } + trie->DeleteLinkFromParent(child_node.node_id_); + } else { + trie->DeleteLinkFromParent(node.node_id_); + } + break; + } + cur_idx = next_idx; + } + } +} + +// Algorithm 2 in https://arxiv.org/pdf/2012.15524.pdf +void FailureArray::BuildFailureArray( + const std::vector& failure_vocab_tokens, Trie* trie) { + std::vector> node_outgoing_edge_labels; + BuildOutgoingEdgeLabelsForTrie( + failure_vocab_tokens, trie, &node_outgoing_edge_labels); + failure_array_.resize(trie->Size()); + std::queue trie_node_queue({trie->kRootNodeId}); + if (trie->GetSuffixRoot() != trie->kRootNodeId) { + trie_node_queue.push(trie->GetSuffixRoot()); + } + while (!trie_node_queue.empty()) { + uint32_t parent_id = trie_node_queue.front(); + trie_node_queue.pop(); + std::vector outgoing_labels_sorted( + node_outgoing_edge_labels[parent_id].begin(), + node_outgoing_edge_labels[parent_id].end()); + std::sort(outgoing_labels_sorted.begin(), outgoing_labels_sorted.end()); + for (const char edge_label : outgoing_labels_sorted) { + auto child_node = trie->CreateTraversalCursor(parent_id); + if (!trie->TryTraverseOneStep(&child_node, edge_label)) { + std::ostringstream oss; + oss << "Failed to traverse to child following edge " << edge_label + << " at parent " << parent_id << "."; + throw std::runtime_error(oss.str()); + } + if (child_node.node_id_ == trie->GetSuffixRoot()) { + continue; + } + int child_data_value = -1; + // Case 1: str(v) in V + // * f(v) = trie.GetSuffixRoot() + // * F(v) = [str(v)] + if (trie->TryGetData(child_node, &child_data_value)) { + uint32_t failure_link = trie->GetSuffixRoot(); + if (node_id_is_punc_map_.count(child_node.node_id_) == 0) { + throw std::invalid_argument( + "Failed to find if an end node in the trie is a punctuation char " + "in node_id_is_punc_map_. It should never happen."); + } + if (with_pretokenization_ && + node_id_is_punc_map_.at(child_node.node_id_)) { + failure_link = trie->GetPuncFailureNode(); + } + AssignFailureLinkAndPops(child_node.node_id_, + failure_link, + {child_data_value}, + utils::kNullFailurePopsList); + trie_node_queue.push(child_node.node_id_); + continue; + } + + // Case 2: str(v) is not in V + const Failure& parent_failure = failure_array_[parent_id]; + if (parent_failure.failure_link_ != utils::kNullNode) { + std::vector one_step_pops; + auto curr_node = + trie->CreateTraversalCursor(parent_failure.failure_link_); + // Find the failure link util the failure link is root or + // the node has the outgoing label correspoding to edge_label. + while (true) { + if (trie->TryTraverseOneStep(&curr_node, edge_label)) { + AssignFailureLinkAndPops( + child_node.node_id_, + curr_node.node_id_, + one_step_pops, + parent_failure.failure_pops_offset_length_); + break; + } + const Failure& curr_node_failure = failure_array_[curr_node.node_id_]; + if (curr_node_failure.failure_link_ == utils::kNullNode) { + break; + } + GetFailurePopsAndAppendToOut( + curr_node_failure.failure_pops_offset_length_, &one_step_pops); + trie->SetTraversalCursor(&curr_node, curr_node_failure.failure_link_); + } + } + // If the failure_link of parent is root, + // * f(v) = none + // * F(v) = [] + trie_node_queue.push(child_node.node_id_); + } + } + RemovePunctuationTrieLink(trie); +} + +void FailureArray::AssignFailureLinkAndPops( + uint32_t cur_node, + uint32_t failure_link, + const std::vector& one_step_pops, + int parent_failure_pops_offset_length) { + if (failure_link == utils::kNullNode) { + return; + } + auto& curr_node_failure = failure_array_[cur_node]; + curr_node_failure.failure_link_ = failure_link; + if (one_step_pops.empty()) { + curr_node_failure.failure_pops_offset_length_ = + parent_failure_pops_offset_length; + } else { + const int offset = failure_pops_pool_.size(); + if (offset > utils::kMaxSupportedFailurePoolOffset) { + std::ostringstream oss; + oss << "Failure pops list offset is " << offset + << ", which exceeds maximum supported offset " + << utils::kMaxSupportedFailurePoolOffset + << ". The vocabulary seems to be too large to be supported."; + throw std::runtime_error(oss.str()); + } + GetFailurePopsAndAppendToOut(parent_failure_pops_offset_length, + &failure_pops_pool_); + failure_pops_pool_.insert( + failure_pops_pool_.end(), one_step_pops.begin(), one_step_pops.end()); + const int length = failure_pops_pool_.size() - offset; + if (length > utils::kMaxSupportedFailurePoolOffset) { + std::ostringstream oss; + oss << "Failure pops list size is " << length + << ", which exceeds maximum supported offset " + << utils::kMaxFailurePopsListSize; + throw std::runtime_error(oss.str()); + } + curr_node_failure.failure_pops_offset_length_ = + utils::EncodeFailurePopList(offset, length); + } +} + +void FailureArray::GetFailurePopsAndAppendToOut( + uint32_t failure_pops_offset_length, std::vector* out_failure_pops) { + if (failure_pops_offset_length == utils::kNullFailurePopsList) { + return; + } + int offset = 0, length = 0; + utils::GetFailurePopsOffsetAndLength( + failure_pops_offset_length, &offset, &length); + out_failure_pops->insert(out_failure_pops->end(), + failure_pops_pool_.begin() + offset, + failure_pops_pool_.begin() + offset + length); +} + +void FailureArray::BuildOutgoingEdgeLabelsForTrie( + const std::vector& failure_vocab_tokens, + Trie* trie, + std::vector>* node_outgoing_edge_labels) { + node_outgoing_edge_labels->resize(trie->Size()); + const std::string dummy_token = std::string(1, utils::kInvalidControlChar); + for (auto& item : failure_vocab_tokens) { + if (item.Token() != dummy_token) { + BuildOutgoingEdgeLabelsFromToken(item, trie, node_outgoing_edge_labels); + } + } +} + +void FailureArray::BuildOutgoingEdgeLabelsFromToken( + const FailureVocabToken& vocab_token, + Trie* trie, + std::vector>* node_outgoing_edge_labels) { + const std::string& token = vocab_token.Token(); + Trie::TraversalCursor curr_node; + int char_pos = 0; + trie->SetTraversalCursor(&curr_node, Trie::kRootNodeId); + while (char_pos < token.size()) { + const char edge_label = token[char_pos]; + (*node_outgoing_edge_labels)[curr_node.node_id_].insert(edge_label); + if (!trie->TryTraverseOneStep(&curr_node, edge_label)) { + std::ostringstream oss; + oss << "Error in traversing to child following edge `" << edge_label + << "` from the prefix `" << token.substr(0, char_pos) + << "` at parent id " << curr_node.node_id_ << ". The token is `" + << token << "`. The char position" + << " is " << char_pos << "."; + + throw std::runtime_error(oss.str()); + } + ++char_pos; + } + node_id_is_punc_map_[curr_node.node_id_] = + !vocab_token.IsSuffixToken() && vocab_token.ContainsPunctuation() && + vocab_token.TokenUnicodeLengthWithoutContinuingSubwordPrefix() == 1; +} + + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/failure.h b/fast_tokenizer/fast_tokenizer/utils/failure.h new file mode 100644 index 0000000000000000000000000000000000000000..c302f53496a8b4bd3156234635f108e50ac1e8e2 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/failure.h @@ -0,0 +1,107 @@ +// Copyright 2022 TF.Text Authors. +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at + +// http://www.apache.org/licenses/LICENSE-2.0 + +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +class Trie; + +// Used in Fast WordPiece Model specially +struct Failure { + uint32_t failure_link_; + // Indicate the number of failure_pops + // and the offset in failure_pops_pool + uint32_t failure_pops_offset_length_; + Failure(); +}; + +class FailureVocabToken { +public: + FailureVocabToken(const std::string& token, + int token_id, + const std::string& continuing_subword_prefix); + + const std::string& Token() const; + + int TokenId() const; + bool IsSuffixToken() const; + bool ContainsPunctuation() const; + int TokenUnicodeLengthWithoutContinuingSubwordPrefix() const; + int TokenLengthWithoutContinuingSubwordPrefix() const; + +private: + std::string token_; + int token_id_; + bool is_suffix_token_; + int actual_token_start_offset_; + int actual_token_unicode_len_; + bool contains_punctuation_; +}; + +struct FailureArray { + FailureArray(bool with_pretokenization = false) + : with_pretokenization_(with_pretokenization) {} + void BuildFailureArray( + const std::vector& failure_vocab_tokens, Trie* trie); + void BuildFailureVocab(const std::unordered_map& vocab, + const std::string& unk_token, + const std::string& continuing_subword_prefix); + void InitFromVocabAndTrie( + const std::unordered_map& vocab, + Trie* trie, + const std::string& unk_token, + const std::string& continuing_subword_prefix); + const Failure* GetFailure(int idx) const { return &(failure_array_.at(idx)); } + int GetFailurePop(int idx) const { return failure_pops_pool_.at(idx); } + void SetWithPretokenization(bool with_pretokenization) { + with_pretokenization_ = with_pretokenization; + } + +private: + void BuildOutgoingEdgeLabelsForTrie( + const std::vector& failure_vocab_tokens, + Trie* trie, + std::vector>* node_outgoing_edge_labels); + void BuildOutgoingEdgeLabelsFromToken( + const FailureVocabToken& vocab_token, + Trie* trie, + std::vector>* node_outgoing_edge_labels); + void AssignFailureLinkAndPops(uint32_t cur_node, + uint32_t failure_link, + const std::vector& one_step_pops, + int parent_failure_pops_offset_length); + void GetFailurePopsAndAppendToOut(uint32_t failure_pops_offset_length, + std::vector* out_failure_pops); + void RemovePunctuationTrieLink(Trie* trie) const; + void CreateVocabFromFailureVocab( + const std::vector& failure_vocab_tokens, + std::unordered_map* vocab) const; + std::vector failure_array_; + std::vector failure_pops_pool_; + std::unordered_map node_id_is_punc_map_; + std::vector failure_vocab_tokens_; + bool with_pretokenization_; // The end-to-end version of FailureArray +}; + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/lattice.cc b/fast_tokenizer/fast_tokenizer/utils/lattice.cc new file mode 100644 index 0000000000000000000000000000000000000000..14447788cda5ea07e4a41129f6da4452a57e2b61 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/lattice.cc @@ -0,0 +1,546 @@ +// Copyright 2016 Google Inc. +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at + +// http://www.apache.org/licenses/LICENSE-2.0 + +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "fast_tokenizer/utils/lattice.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "glog/logging.h" + +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +// Size of nodes pre-allocated in Lattice. +constexpr size_t kPreallocateLatticeNodeSize = 1024; + +constexpr float kEpsilon = 1e-7; + +constexpr unsigned int kDefaultSeed = static_cast(-1); +static std::atomic g_seed(kDefaultSeed); + +inline float LogSumExp(float x, float y, bool init_mode) { + if (init_mode) { + return y; + } + const float vmin = std::min(x, y); + const float vmax = std::max(x, y); + constexpr float kMinusLogEpsilon = 50; + if (vmax > vmin + kMinusLogEpsilon) { + return vmax; + } else { + return vmax + log(std::exp(static_cast(vmin - vmax)) + 1.0); + } +} + +uint32_t GetRandomGeneratorSeed() { + return g_seed == kDefaultSeed ? std::random_device{}() : g_seed.load(); +} + +std::mt19937 *GetRandomGenerator() { + thread_local static std::mt19937 mt(GetRandomGeneratorSeed()); + return &mt; +} + +inline float Gumbel() { + const float kEpsilon = 1e-7; + auto *mt = GetRandomGenerator(); + std::uniform_real_distribution dis(0.0, 1.0); + float noise = -std::log(-(std::log(dis(*mt) + kEpsilon))); + return noise; +} + +Lattice::Lattice() : node_allocator_(kPreallocateLatticeNodeSize) {} +Lattice::~Lattice() {} + +const std::vector &Lattice::begin_nodes(int pos) const { + return begin_nodes_[pos]; +} + +const std::vector &Lattice::end_nodes(int pos) const { + return end_nodes_[pos]; +} + +int Lattice::size() const { + // -1 because surface_ may include the EOS. + return std::max(0, surface_.size() - 1); +} + +int Lattice::utf8_size() const { return sentence_.size(); } + +const char *Lattice::sentence() const { return sentence_.data(); } + +const char *Lattice::surface(int pos) const { return surface_[pos]; } + +Lattice::Node *Lattice::bos_node() const { return end_nodes_[0][0]; } + +Lattice::Node *Lattice::eos_node() const { return begin_nodes_[size()][0]; } + +Lattice::Node *Lattice::NewNode() { + Node *node = node_allocator_.Allocate(); + node->node_id = node_allocator_.size() - 1; + return node; +} + +void Lattice::Clear() { + begin_nodes_.clear(); + end_nodes_.clear(); + sentence_ = utils::simple_string_view(""); + surface_.clear(); + node_allocator_.Free(); +} + +void Lattice::SetSentence(utils::simple_string_view sentence) { + Clear(); + + sentence_ = sentence; + surface_.reserve(sentence.size() + 1); + + while (!sentence.empty()) { + const int mblen = + std::min(utils::OneCharLen(sentence.data()), sentence.size()); + surface_.push_back(sentence.data()); + sentence.remove_prefix(mblen); + } + surface_.push_back(sentence.data()); + + const int len = size(); + begin_nodes_.resize(len + 1); + end_nodes_.resize(len + 1); + + constexpr size_t kReservedNodeSize = 16; + for (int i = 0; i <= len; ++i) { + begin_nodes_[i].reserve(kReservedNodeSize); + end_nodes_[i].reserve(kReservedNodeSize); + } + + Node *bos = NewNode(); + bos->id = -1; + bos->pos = 0; + end_nodes_[0].push_back(bos); + + Node *eos = NewNode(); + eos->id = -1; + eos->pos = len; + begin_nodes_[len].push_back(eos); +} + +Lattice::Node *Lattice::Insert(int pos, int length) { + Node *node = NewNode(); + node->pos = pos; + node->length = length; + const int utf8_length = + static_cast(surface(pos + length) - surface(pos)); + node->piece = simple_string_view(surface(pos), utf8_length); + begin_nodes_[pos].push_back(node); + end_nodes_[pos + node->length].push_back(node); + return node; +} + +Lattice::LatticePathWithScore Lattice::Viterbi() { + const int len = size(); + + for (int pos = 0; pos <= len; ++pos) { + for (Node *rnode : begin_nodes_[pos]) { + rnode->prev = nullptr; + float best_score = 0.0; + Node *best_node = nullptr; + for (Node *lnode : end_nodes_[pos]) { + const float score = lnode->backtrace_score + rnode->score; + if (best_node == nullptr || score > best_score) { + best_node = lnode; + best_score = score; + } + } + if (best_node == nullptr) { + LOG(ERROR) << "Failed to find the best path in Viterbi."; + return {}; + } + rnode->prev = best_node; + rnode->backtrace_score = best_score; + } + } + + // backtrace + std::vector results; + float score = begin_nodes(len)[0]->backtrace_score; + for (Node *node = begin_nodes_[len][0]->prev; node->prev != nullptr; + node = node->prev) { + results.push_back(node); + } + + std::reverse(results.begin(), results.end()); + + LatticePathWithScore retval = {results, score}; + + return retval; +} + +std::vector Lattice::ForwardAlgorithm(float inv_theta) const { + const int len = size(); + std::vector alpha(node_allocator_.size(), 0.0); + + for (int pos = 0; pos <= len; ++pos) { + for (Node *rnode : begin_nodes_[pos]) { + for (Node *lnode : end_nodes_[pos]) { + alpha[rnode->node_id] = + LogSumExp(alpha[rnode->node_id], + inv_theta * lnode->score + alpha[lnode->node_id], + lnode == end_nodes_[pos][0]); + } + } + } + + return alpha; +} + +std::vector Lattice::BackwardAlgorithm(float inv_theta) const { + const int len = size(); + std::vector beta(node_allocator_.size(), 0.0); + + for (int pos = len; pos >= 0; --pos) { + for (Node *lnode : end_nodes_[pos]) { + for (Node *rnode : begin_nodes_[pos]) { + beta[lnode->node_id] = LogSumExp(beta[lnode->node_id], + rnode->score + beta[rnode->node_id], + rnode == begin_nodes_[pos][0]); + } + } + } + + return beta; +} + +float Lattice::PopulateMarginal(float freq, + std::vector *expected) const { + if (expected == nullptr) return 0.0; + + const int len = size(); + + // alpha and beta (accumulative log prob) in Forward Backward. + // the index of alpha/beta is Node::node_id. + + const auto alpha = ForwardAlgorithm(1.0); + const auto beta = BackwardAlgorithm(1.0); + + const float Z = alpha[begin_nodes_[len][0]->node_id]; + for (int pos = 0; pos < len; ++pos) { + for (Node *node : begin_nodes_[pos]) { + if (node->id >= 0) { + // the index of |expected| is a Node::id, which is a vocabulary id. + (*expected)[node->id] += + freq * + std::exp(static_cast(alpha[node->node_id] + node->score + + beta[node->node_id] - Z)); + } + } + } + + return freq * Z; +} + +float Lattice::CalculateEntropy(float inv_theta) const { + const int len = size(); + + // alpha[node_id] is the marginal prob of sequence up to start of node + // H is entropy of sequence + // the index of alpha/H is Node::node_id. + std::vector H(node_allocator_.size(), 0.0); + + // Populate the forward marginals to get the normalising constant + const auto alpha = ForwardAlgorithm(inv_theta); + + // Now populate the forward entropies + for (int pos = 0; pos <= len; ++pos) { + for (Node *rnode : begin_nodes_[pos]) { + for (Node *lnode : end_nodes_[pos]) { + // Contribution each lnode makes = p(lnode) * (H(lnode) + log p(lnode)) + + // We have to normalise p(lnode) by the marginal contribution it makes + const float lnode_transition_prob = + ((inv_theta * lnode->score) + alpha[lnode->node_id] - + alpha[rnode->node_id]); + H[rnode->node_id] += std::exp(lnode_transition_prob) * + (H[lnode->node_id] + lnode_transition_prob); + } + } + } + + return -H[begin_nodes_[len][0]->node_id]; +} + +// The node structure to support A* algorithm in Lattice::NBest() +struct Hypothesis { + Lattice::Node *node; + Hypothesis *next; + float fx; // the priority to pop a new hypothesis from the priority queue. + float gx; // the sum of scores from EOS to the left-most node in x. +}; + +// Helper function for cloning a Hypothesis and the ones on their next paths. +// The graph structure is preserved. +// +// to_clone: the Hypothesis to clone. +// clone_map: mapping between the old pointers and the new pointers. +// allocator: allocate and own the cloned Hypothesis. +// +// Returns the cloned Hypothesis*. All Hypothesis on its "next" chain are also +// guaranteed to have been allocated via "allocator", and "clone_map" is updated +// with all new mappings. +Hypothesis *CloneHypAndDependents( + const Hypothesis *to_clone, + std::unordered_map *clone_map, + FreeList *allocator) { + Hypothesis *cloned = nullptr; + Hypothesis **result_callback = &cloned; + + // Iteratively clone "to_clone" and its dependencies. + // The new pointer will be written back to *result_callback. + while (to_clone != nullptr) { + // If "to_clone" has already been cloned before, we just look up the result. + auto iter = clone_map->find(to_clone); + if (iter != clone_map->end()) { + *result_callback = iter->second; + break; + } + + // Allocate a new Hypothesis and copy the values. + Hypothesis *new_hyp = allocator->Allocate(); + *new_hyp = *to_clone; + *result_callback = new_hyp; + clone_map->insert({to_clone, new_hyp}); + + // Move on to clone "to_clone->next". + to_clone = to_clone->next; + result_callback = &(new_hyp->next); + LOG(ERROR) << "Failed to find the best path in Viterbi."; + } + return cloned; +} + +std::vector Lattice::NBest(size_t nbest_size, + bool sample, + float inv_theta) { + if (nbest_size < 1) { + LOG(WARNING) << "nbest_size >= 1. Returns empty result."; + return {}; + } + + if (nbest_size == 1 && !sample) { + return {Viterbi()}; + } + + // Uses A* search to enumerate N-bests. + // Given a lattice, enumerates hypotheses (paths) from EOS. + // At each partial path x, compute f(x) as follows + // f(x) = g(x) + h(x). + // g(x): the sum of scores from EOS to the left-most node in x. + // for a complete hypothesis, g(hyp) is the score of the hypothesis. + // h(x): a heuristic that estimates the largest score from x to BOS. + // f(x): the priority to pop a new hypothesis from the priority queue. + // + // As left-to-right Viterbi search can tell the *exact* value of h(x), + // we can obtain the exact n-best results with A*. + + class HypothesisComparator { + public: + const bool operator()(Hypothesis *h1, Hypothesis *h2) { + return (h1->fx < h2->fx); + } + }; + + using Agenda = std::priority_queue, + HypothesisComparator>; + constexpr size_t kPreallocatedHypothesisSize = 512; + FreeList hypothesis_allocator(kPreallocatedHypothesisSize); + + Agenda agenda; + std::vector results; + + auto *eos = hypothesis_allocator.Allocate(); + eos->node = eos_node(); + eos->next = nullptr; + eos->gx = 0.0; + + std::vector alpha(node_allocator_.size(), 0.0); + + if (sample) { + // Run forwards algorithm to get normalising constants + alpha = ForwardAlgorithm(inv_theta); + // f(eos) = Gumbel(0), as it is the perturbed score of the entire lattice. + eos->fx = Gumbel(); + } else { + // Run Viterbi first to fill backtrace score. + Viterbi(); + eos->fx = eos->node->backtrace_score; + } + agenda.push(eos); + + int shrink_count = 0; // Number of times agenda has shrunk. For logging only. + bool printed_memory_warning = false; // For logging only. + while (!agenda.empty()) { + auto *top = agenda.top(); + agenda.pop(); + auto *node = top->node; + + // Reaches to BOS + if (node == bos_node()) { + results.resize(results.size() + 1); + for (auto *n = top->next; n->next != nullptr; n = n->next) { + results.back().first.push_back(n->node); + } + results.back().second = top->fx; + if (results.size() == nbest_size) { + break; + } + continue; + } + + const int end_nodes_size = end_nodes(node->pos).size(); + std::vector probs(end_nodes_size, 0.0); + std::vector perturbed_probs(end_nodes_size, 0.0); + std::vector adjusted_probs(end_nodes_size, 0.0); + const float Z = alpha[node->node_id]; + if (sample) { + float max_score = -1e8; + // Calculate the marginal and perturbed scores for stochastic search + for (int i = 0; i < end_nodes(node->pos).size(); i++) { + Node *lnode = end_nodes(node->pos)[i]; + // Calculate backwards transition score + probs[i] = + top->gx + alpha[lnode->node_id] + (inv_theta * lnode->score) - Z; + perturbed_probs[i] = probs[i] + Gumbel(); + if (perturbed_probs[i] > max_score) { + max_score = perturbed_probs[i]; + } + } + // Now constrain the sampled continuations to match the score of parent + for (int i = 0; i < adjusted_probs.size(); i++) { + // Use numerically stable version of truncated Gumbel: + // https://arxiv.org/pdf/1903.06059.pdf appendix B.3 + const float v = top->fx - perturbed_probs[i] + + std::log1p(-std::exp(perturbed_probs[i] - max_score)); + adjusted_probs[i] = top->fx - std::max(static_cast(0.0), v) - + std::log1p(std::exp(-std::abs(v))); + } + } + + // Expands new node ending at node->pos + for (int i = 0; i < end_nodes(node->pos).size(); i++) { + Node *lnode = end_nodes(node->pos)[i]; + auto *hyp = hypothesis_allocator.Allocate(); + hyp->node = lnode; + if (sample) { + hyp->gx = probs[i]; + hyp->fx = adjusted_probs[i]; + } else { + hyp->gx = lnode->score + top->gx; // just adds node->score + hyp->fx = + lnode->backtrace_score + top->gx; // backtrace_score is h(node). + } + hyp->next = top; + agenda.push(hyp); + } + + static constexpr int kOneBillion = 1000000000; // 10^9. + if (hypothesis_allocator.size() >= kOneBillion) { + if (!printed_memory_warning) { + printed_memory_warning = true; + LOG(WARNING) << "Allocator size exceeds " << kOneBillion + << " with an example of length " << this->size(); + } + } + + // When the input is too long or contains duplicated phrases, + // `agenda` will get extremely big. Here we avoid this case by + // dynamically shrinking the agenda. + constexpr int kMaxAgendaSize = 10000; + constexpr int kMinAgendaSize = 512; + if (agenda.size() >= kMaxAgendaSize) { + // Keeps the top `kMinAgendaSize` hypothesis. + Agenda new_agenda; + // Keeps the top hypothesis and the ones on their "next" paths. + FreeList new_allocator(kPreallocatedHypothesisSize); + // Map between old Hypothesis* and new Hypothesis*. + std::unordered_map clone_map; + + const int size = std::min(kMinAgendaSize, nbest_size * 10); + shrink_count++; + LOG(WARNING) << "Too big agenda size " << agenda.size() + << ". Shrinking (round " << shrink_count << ") down to " + << size << "."; + for (int i = 0; i < size; ++i) { + const Hypothesis *top_hyp = agenda.top(); + Hypothesis *cloned_hyp = + CloneHypAndDependents(top_hyp, &clone_map, &new_allocator); + new_agenda.push(cloned_hyp); + agenda.pop(); + } + agenda = std::move(new_agenda); + hypothesis_allocator.swap(new_allocator); + } + } + + return results; +} + +std::vector Lattice::Sample(float inv_theta) { + const int len = size(); + if (len == 0) return {}; + + std::vector alpha(node_allocator_.size(), 0.0); + + alpha = ForwardAlgorithm(inv_theta); + + auto *mt = GetRandomGenerator(); + + std::vector results; + std::vector probs; + + float Z = alpha[eos_node()->node_id]; + Node *node = eos_node(); + while (true) { + probs.clear(); + for (const Node *lnode : end_nodes_[node->pos]) { + probs.push_back(std::exp(static_cast( + alpha[lnode->node_id] + inv_theta * lnode->score - Z))); + } + std::discrete_distribution dist(probs.begin(), probs.end()); + node = end_nodes_[node->pos][dist(*mt)]; + if (node == bos_node()) break; + + Z = alpha[node->node_id]; + results.push_back(node); + } + + std::reverse(results.begin(), results.end()); + return results; +} + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/lattice.h b/fast_tokenizer/fast_tokenizer/utils/lattice.h new file mode 100644 index 0000000000000000000000000000000000000000..daa6523d059d46bd9d56cce4427e408ee0de76b4 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/lattice.h @@ -0,0 +1,192 @@ +// Copyright 2016 Google Inc. +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at + +// http://www.apache.org/licenses/LICENSE-2.0 + +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include "fast_tokenizer/utils/string_view.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +// Copy from https://github.com/google/sentencepiece/blob/master/src/freelist.h +// Simple FreeList that allocates a chunk of T at once. +template +class FreeList { +public: + FreeList() = delete; + explicit FreeList(size_t chunk_size) : chunk_size_(chunk_size) {} + virtual ~FreeList() { + for (auto &chunk : freelist_) delete[] chunk; + } + + // `Free` doesn't free the object but reuse the allocated memory chunks. + void Free() { + const int size = std::min(chunk_index_ + 1, freelist_.size()); + for (int i = 0; i < size; ++i) { + T *chunk = freelist_[i]; + memset(static_cast(chunk), 0, sizeof(*chunk) * chunk_size_); + } + chunk_index_ = 0; + element_index_ = 0; + } + + // Returns the number of allocated elements. + size_t size() const { return chunk_size_ * chunk_index_ + element_index_; } + + void swap(FreeList &other) { + std::swap(freelist_, other.freelist_); + std::swap(element_index_, other.element_index_); + std::swap(chunk_index_, other.chunk_index_); + std::swap(chunk_size_, other.chunk_size_); + } + + // Returns the element as an array. + T *operator[](size_t index) const { + return freelist_[index / chunk_size_] + index % chunk_size_; + } + + // Allocates new element. + T *Allocate() { + if (element_index_ >= chunk_size_) { + ++chunk_index_; + element_index_ = 0; + } + + if (chunk_index_ == freelist_.size()) { + T *chunk = new T[chunk_size_]; + memset(static_cast(chunk), 0, sizeof(*chunk) * chunk_size_); + freelist_.push_back(chunk); + } + + T *result = freelist_[chunk_index_] + element_index_; + ++element_index_; + + return result; + } + +private: + std::vector freelist_; + + // The last element is stored at freelist_[chunk_index_][element_index_] + size_t element_index_ = 0; + size_t chunk_index_ = 0; + size_t chunk_size_ = 0; // Do not modify except in swap() +}; + + +// Copy from +// https://github.com/google/sentencepiece/blob/master/src/unigram_model.h +class Lattice { +public: + Lattice(); + virtual ~Lattice(); + + struct Node { + utils::simple_string_view piece; // Sentence piece representation. + uint32_t pos; // Unicode position in the sentence. + uint32_t length; // Unicode length, not UT8 byte. + uint32_t node_id; // unique id in the current lattice. + int id; // vocab id. (maybe -1 for UNK) + float score; // logprob of this sentencepiece. + float backtrace_score; // backtrace info used in Viterbi. + Node *prev; // best previous node on Viterbi path. + + std::string DebugString() const; + }; + + // Returns bos node. + Node *bos_node() const; + + // Returns eos node. + Node *eos_node() const; + + // Returns nodes starting at |pos|. + const std::vector &begin_nodes(int pos) const; + + // Returns nodes ending at |pos|. + const std::vector &end_nodes(int pos) const; + + // Returns Unicode character length. + int size() const; + + // Returns multi-byte (utf8) length. + int utf8_size() const; + + // Returns the substring of sentence. sentence[pos:] + const char *surface(int pos) const; + + // Returns immutable sentence. The same as surface(0) + const char *sentence() const; + + // Clears the lattice. + void Clear(); + + // Sets new sentence. + void SetSentence(utils::simple_string_view sentence); + + // Inserts a new node at [pos, pos + length - 1]. + // After calling this method, The caller must set Node::score and Node::id. + Node *Insert(int pos, int length); + + using LatticePathWithScore = std::pair, float>; + + // Returns Viterbi path. All nodes must be populated in advance. + LatticePathWithScore Viterbi(); + + // Runs forwards/backwards algorithm, returns vector with normalised + // transition probs. + std::vector ForwardAlgorithm(float theta) const; + std::vector BackwardAlgorithm(float theta) const; + + // Returns n-best results. + std::vector NBest(size_t nbest_size, + bool sample, + float theta); + + // Samples one path from the lattice according to the + // generation probability (Product of piece probabilities). + // `theta` is a smoothing parameter. + std::vector Sample(float theta); + + // Calculates the entropy of the lattice. + float CalculateEntropy(float theta) const; + + // Populates marginal probability of every node in this lattice. + // |freq| is the frequency of the sentence. + // for (auto *node : all_nodes_) { + // (*expected)[node->id] += marginal_prob_of_node * freq; + // } + // Returns the log-likelihood of this sentence. + float PopulateMarginal(float freq, std::vector *expected) const; + +private: + // Returns new node. + // Lattice class has the ownership of the returned value. + Node *NewNode(); + + utils::simple_string_view sentence_; + std::vector surface_; + std::vector> begin_nodes_; + std::vector> end_nodes_; + FreeList node_allocator_; +}; + + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/path.h b/fast_tokenizer/fast_tokenizer/utils/path.h new file mode 100644 index 0000000000000000000000000000000000000000..a58a00af613abf822e3af365855fe286a7d7479d --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/path.h @@ -0,0 +1,58 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include + +#ifdef _MSC_VER +#define PATH_SEP "\\" +#else +#define PATH_SEP "/" +#endif + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +inline std::string PathJoin(const std::vector& paths, + const std::string& sep = PATH_SEP) { + if (paths.size() == 1) { + return paths[0]; + } + std::string filepath = ""; + for (const auto& path : paths) { + if (filepath == "") { + filepath += path; + continue; + } + if (path[0] == sep[0] || filepath.back() == sep[0]) { + filepath += path; + } else { + filepath += sep + path; + } + } + return filepath; +} + +inline std::string PathJoin(const std::string& folder, + const std::string filename, + const std::string& sep = PATH_SEP) { + return PathJoin(std::vector{folder, filename}, sep); +} + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/sentencepiece_normalizer.cc b/fast_tokenizer/fast_tokenizer/utils/sentencepiece_normalizer.cc new file mode 100644 index 0000000000000000000000000000000000000000..4a7bf9950ab5996347a3b3254c0ef75887d7b728 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/sentencepiece_normalizer.cc @@ -0,0 +1,342 @@ +// Copyright 2016 Google Inc. +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at + +// http://www.apache.org/licenses/LICENSE-2.0 + +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "fast_tokenizer/utils/sentencepiece_normalizer.h" +#include +#include "fast_tokenizer/utils/unique_ptr.h" +#include "fast_tokenizer/utils/utf8.h" +#include "fast_tokenizer/utils/utils.h" + +#include "glog/logging.h" +#include "unicode/brkiter.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +PrefixMatcher::PrefixMatcher(const std::set& dic) { + if (dic.empty()) return; + std::vector key; + key.reserve(dic.size()); + for (const auto& it : dic) key.push_back(it); + trie_ = utils::make_unique(); + trie_->build(key.size(), const_cast(&key[0]), nullptr, nullptr); +} + +int PrefixMatcher::PrefixMatch(const char* w, size_t w_len, bool* found) const { + if (trie_ == nullptr) { + if (found) *found = false; + return std::min(w_len, OneCharLen(w)); + } + constexpr int kResultSize = 64; + Darts::DoubleArray::result_pair_type trie_results[kResultSize]; + const int num_nodes = + trie_->commonPrefixSearch(w, trie_results, kResultSize, w_len); + if (found) *found = (num_nodes > 0); + if (num_nodes == 0) { + return std::min(w_len, OneCharLen(w)); + } + + int mblen = 0; + for (int i = 0; i < num_nodes; ++i) { + mblen = std::max(trie_results[i].length, mblen); + } + return mblen; +} + +std::string PrefixMatcher::GlobalReplace(const char* w, + size_t w_len, + const char* out, + size_t out_len, + const char** result_w) const { + std::string result; + if (w_len > 0) { + bool found = false; + const int mblen = PrefixMatch(w, w_len, &found); + if (found) { + result.append(out, out_len); + } else { + result.append(w, mblen); + } + *result_w = w + mblen; + } + return result; +} + +Normalizer::Normalizer(const std::string& precompiled_charsmap) + : precompiled_charsmap_(precompiled_charsmap) { + Init(); +} + +Normalizer::Normalizer(const Normalizer& other) + : precompiled_charsmap_(other.precompiled_charsmap_) { + Init(); +} + + +Normalizer::~Normalizer() {} + +std::string Normalizer::GetPrecompiledCharsmap() const { + return precompiled_charsmap_; +} + +void Normalizer::Init() { + if (!precompiled_charsmap_.empty()) { +#ifdef IS_BIG_ENDIAN + DecodePrecompiledCharsMap(precompiled_charsmap_.data(), + precompiled_charsmap_.length(), + &trie_blob_, + &normalized_blob_, + &precompiled_charsmap_buffer_); +#else + DecodePrecompiledCharsMap(precompiled_charsmap_.data(), + precompiled_charsmap_.length(), + &trie_blob_, + &normalized_blob_); +#endif + // Reads the body of double array. + trie_ = utils::make_unique(); + // The second arg of set_array is not the size of blob, + // but the number of double array units. + trie_->set_array(const_cast(trie_blob_.data()), + trie_blob_.size() / trie_->unit_size()); + normalized_ = normalized_blob_.data(); + } +} + +void Normalizer::DecodePrecompiledCharsMap(const char* blob, + size_t blob_size, + std::string* trie_blob, + std::string* normalized, + std::string* buffer) { + uint32_t trie_blob_size = 0; + uint32_t offset = 0; + if (blob_size <= sizeof(trie_blob_size) || + !DecodePOD(blob, sizeof(trie_blob_size), &trie_blob_size) || + trie_blob_size >= blob_size) { + throw std::runtime_error("Blob for normalization rule is broken."); + } +#ifdef IS_BIG_ENDIAN + trie_blob_size = util::Swap32(trie_blob_size); +#endif + if (trie_blob_size >= blob_size) { + throw std::runtime_error("Trie data size exceeds the input blob size."); + } + offset += sizeof(trie_blob_size); +#ifdef IS_BIG_ENDIAN + buffer->assign(blob + offset, trie_blob_size); + uint32* data = reinterpret_cast(const_cast(buffer->data())); + for (int i = 0; i < trie_blob_size / 4; ++i) data[i] = util::Swap32(data[i]); + *trie_blob = std::string(buffer->data(), trie_blob_size); +#else + *trie_blob = std::string(blob + offset, trie_blob_size); +#endif + offset += trie_blob_size; + *normalized = std::string(blob + offset, blob_size - offset); +} + +std::string Normalizer::EncodePrecompiledCharsMap( + const std::string& trie_blob, const std::string& normalized) { + // + std::string blob; + blob.append(EncodePOD(trie_blob.size())); + blob.append(trie_blob.data(), trie_blob.size()); + blob.append(normalized.data(), normalized.size()); + +#ifdef IS_BIG_ENDIAN + uint32* data = reinterpret_cast(const_cast(blob.data())); + for (int i = 0; i <= trie_blob.size() / 4; ++i) { + data[i] = util::Swap32(data[i]); + } +#endif + return blob; +} + +std::pair Normalizer::NormalizePrefix( + const char* input, size_t input_len) const { + std::pair result; + if (input_len == 0) { + return result; + } + if (matcher_ != nullptr) { + bool found = false; + const int mblen = matcher_->PrefixMatch(input, input_len, &found); + if (found) { + return std::make_pair(simple_string_view(input, input_len), mblen); + } + } + size_t longest_length = 0; + int longest_value = 0; + if (trie_ != nullptr) { + // Allocates trie_results in stack, which makes the encoding speed 36% + // fast. (38k sentences/sec => 60k sentences/sec). Builder checks that the + // result size never exceeds kMaxTrieResultsSize. This array consumes + // 0.5kByte in stack, which is less than default stack frames (16kByte). + Darts::DoubleArray::result_pair_type + trie_results[Normalizer::kMaxTrieResultsSize]; + const size_t num_nodes = trie_->commonPrefixSearch( + input, trie_results, Normalizer::kMaxTrieResultsSize, input_len); + + // Finds the longest rule. + for (size_t k = 0; k < num_nodes; ++k) { + if (longest_length == 0 || trie_results[k].length > longest_length) { + longest_length = trie_results[k].length; // length of prefix + longest_value = trie_results[k].value; // pointer to |normalized_|. + } + } + } + + if (longest_length == 0) { + size_t length = 0; + if (!IsValidDecodeUTF8(input, input + input_len, &length)) { + // Found a malformed utf8. + // The rune is set to be 0xFFFD (REPLACEMENT CHARACTER), + // which is a valid Unicode of three bytes in utf8, + // but here we only consume one byte. + result.second = 1; + static const char kReplacementChar[] = "\xEF\xBF\xBD"; + result.first = simple_string_view(kReplacementChar); + } else { + result.second = length; + result.first = simple_string_view(input, length); + } + } else { + result.second = longest_length; + // No need to pass the size of normalized sentence, + // since |normalized| is delimitered by "\0". + result.first = simple_string_view(&(normalized_[longest_value])); + } + return result; +} + +bool Normalizer::Normalize(const char* input, + size_t input_len, + std::string* normalized, + std::vector* norm_to_orig, + std::u32string* u32content) const { + bool modified = false; + norm_to_orig->clear(); + normalized->clear(); + if (input_len == 0) { + return modified; + } + + // Reserves the output buffer to avoid re-allocations. + const size_t kReservedSize = input_len * 3; + normalized->reserve(kReservedSize); + norm_to_orig->reserve(kReservedSize); + if (u32content != nullptr) { + u32content->reserve(kReservedSize); + } + UErrorCode err = U_ZERO_ERROR; + std::unique_ptr iter( + icu::BreakIterator::createCharacterInstance(icu::Locale::getDefault(), + err)); + UText utext = UTEXT_INITIALIZER; + utext_openUTF8(&utext, input, input_len, &err); + iter->setText(&utext, err); + int curr_pos = iter->current(); + while (iter->next() != icu::BreakIterator::DONE) { + int next_pos = iter->current(); + int curr_len = next_pos - curr_pos; + std::pair p; + if (curr_len < 6) { + p = NormalizePrefix(input + curr_pos, curr_len); + simple_string_view sp = p.first; + if (sp.data() != input + curr_pos) { + if (!sp.empty()) { + for (size_t n = 0; n < sp.size(); ++n) { + *normalized += sp.data()[n]; + } + } + Replace(sp, + simple_string_view(input + curr_pos, curr_len), + norm_to_orig, + u32content); + modified = true; + curr_pos = next_pos; + continue; + } + } + int curr_grapheme_pos = curr_pos; + while (curr_grapheme_pos < next_pos) { + uint32_t content_char; + auto content_char_width = + utils::UTF8ToUInt32(input + curr_grapheme_pos, &content_char); + content_char = utils::UTF8ToUnicode(content_char); + p = NormalizePrefix(input + curr_grapheme_pos, content_char_width); + simple_string_view sp = p.first; + if (sp.data() != input + curr_grapheme_pos) { + if (!sp.empty()) { + for (size_t n = 0; n < sp.size(); ++n) { + *normalized += sp.data()[n]; + } + } + Replace( + sp, + simple_string_view(input + curr_grapheme_pos, content_char_width), + norm_to_orig, + u32content); + modified = true; + } else { + for (int i = 0; i < sp.size(); ++i) { + *normalized += sp.data()[i]; + } + if (u32content != nullptr) { + u32content->push_back(content_char); + } + norm_to_orig->push_back(0); + } + curr_grapheme_pos += content_char_width; + } + curr_pos = next_pos; + } + utext_close(&utext); + return modified; +} + +void Normalizer::Replace(const simple_string_view& new_part, + const simple_string_view& old_part, + std::vector* changes, + std::u32string* u32content) const { + auto new_unicode_len = + GetUnicodeLenFromUTF8(new_part.data(), new_part.size()); + auto old_unicode_len = + GetUnicodeLenFromUTF8(old_part.data(), old_part.size()); + if (u32content != nullptr) { + size_t utf8_len = 0; + while (utf8_len < new_part.size()) { + uint32_t content_char; + auto content_char_width = + utils::UTF8ToUInt32(new_part.data() + utf8_len, &content_char); + content_char = utils::UTF8ToUnicode(content_char); + u32content->push_back(content_char); + utf8_len += content_char_width; + } + } + changes->insert(changes->end(), new_unicode_len, 0); + if (new_unicode_len > old_unicode_len) { + auto diff = new_unicode_len - old_unicode_len; + for (auto i = changes->size() - 1; i >= changes->size() - diff; --i) { + (*changes)[i] = 1; + } + } else { + changes->back() -= old_unicode_len - new_unicode_len; + } +} + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/sentencepiece_normalizer.h b/fast_tokenizer/fast_tokenizer/utils/sentencepiece_normalizer.h new file mode 100644 index 0000000000000000000000000000000000000000..3a8543cc39c8d3d8816abe053f09e52848867cd5 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/sentencepiece_normalizer.h @@ -0,0 +1,114 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at + +// http://www.apache.org/licenses/LICENSE-2.0 + +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include + +#include "fast_tokenizer/utils/string_view.h" + +#include "darts.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +struct Cstrless { + bool operator()(const char* a, const char* b) const { + return std::strcmp(a, b) < 0; + } +}; + +class PrefixMatcher { +public: + // Initializes the PrefixMatcher with `dic`. + explicit PrefixMatcher(const std::set& dic); + + int PrefixMatch(const char* w, size_t w_len, bool* found = nullptr) const; + + std::string GlobalReplace(const char* w, + size_t w_len, + const char* out, + size_t out_len, + const char** result_w) const; + +private: + std::unique_ptr trie_; +}; + +class Normalizer { +public: + // Instantiates Normalizer with |spec|. + // |spec| should not be deleted until Normalizer is destroyed. + explicit Normalizer(const std::string& precompiled_charsmap); + Normalizer(const Normalizer& other); + virtual ~Normalizer(); + + virtual void SetPrefixMatcher(const PrefixMatcher* matcher) { + matcher_ = matcher; + } + + virtual bool Normalize(const char* input, + size_t input_len, + std::string* normalized, + std::vector* norm_to_orig, + std::u32string* u32content = nullptr) const; + std::string GetPrecompiledCharsmap() const; + +private: + void Init(); + void Replace(const simple_string_view& new_part, + const simple_string_view& old_part, + std::vector* changes, + std::u32string* u32content = nullptr) const; + std::pair NormalizePrefix(const char* input, + size_t input_len) const; + + + // // Encodes trie_blob and normalized string and return compiled blob. + static std::string EncodePrecompiledCharsMap(const std::string& trie_blob, + const std::string& normalized); + + // Decodes blob into trie_blob and normalized string. + static void DecodePrecompiledCharsMap(const char* blob, + size_t blob_size, + std::string* trie_blob, + std::string* normalized, + std::string* buffer = nullptr); + + static constexpr int kMaxTrieResultsSize = 32; + + std::unique_ptr trie_; + + const char* normalized_ = nullptr; + std::string normalized_blob_; + std::string trie_blob_; + + // Prefix matcher; + const PrefixMatcher* matcher_ = nullptr; + + // Split hello world into "hello_" and "world_" instead of + // "_hello" and "_world". + const bool treat_whitespace_as_suffix_ = false; + std::string precompiled_charsmap_buffer_; + std::string precompiled_charsmap_; +}; + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/shared_mutex.h b/fast_tokenizer/fast_tokenizer/utils/shared_mutex.h new file mode 100644 index 0000000000000000000000000000000000000000..37931ff9f740f8a3a7f9449c1535bd5935c22e21 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/shared_mutex.h @@ -0,0 +1,304 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at + +// http://www.apache.org/licenses/LICENSE-2.0 + +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +// The code is from http://howardhinnant.github.io/shared_mutex.cpp +// C++ 11 shared_mutex implementation +class shared_mutex { + typedef std::mutex mutex_t; + typedef std::condition_variable cond_t; + typedef unsigned count_t; + + mutex_t mut_; + cond_t gate1_; + cond_t gate2_; + count_t state_; + + static const count_t write_entered_ = 1U << (sizeof(count_t) * CHAR_BIT - 1); + static const count_t n_readers_ = ~write_entered_; + +public: + shared_mutex() : state_(0) {} + ~shared_mutex() { std::lock_guard _(mut_); } + + shared_mutex(const shared_mutex&) = delete; + shared_mutex& operator=(const shared_mutex&) = delete; + + // Exclusive ownership + + void lock() { + std::unique_lock lk(mut_); + while (state_ & write_entered_) gate1_.wait(lk); + state_ |= write_entered_; + while (state_ & n_readers_) gate2_.wait(lk); + } + bool try_lock() { + std::unique_lock lk(mut_); + if (state_ == 0) { + state_ = write_entered_; + return true; + } + return false; + } + template + bool try_lock_for(const std::chrono::duration& rel_time) { + return try_lock_until(std::chrono::steady_clock::now() + rel_time); + } + template + bool try_lock_until(const std::chrono::time_point& abs_time); + void unlock() { + std::lock_guard _(mut_); + state_ = 0; + gate1_.notify_all(); + } + + // Shared ownership + + void lock_shared() { + std::unique_lock lk(mut_); + while ((state_ & write_entered_) || (state_ & n_readers_) == n_readers_) + gate1_.wait(lk); + count_t num_readers = (state_ & n_readers_) + 1; + state_ &= ~n_readers_; + state_ |= num_readers; + } + bool try_lock_shared() { + std::unique_lock lk(mut_); + count_t num_readers = state_ & n_readers_; + if (!(state_ & write_entered_) && num_readers != n_readers_) { + ++num_readers; + state_ &= ~n_readers_; + state_ |= num_readers; + return true; + } + return false; + } + template + bool try_lock_shared_for(const std::chrono::duration& rel_time) { + return try_lock_shared_until(std::chrono::steady_clock::now() + rel_time); + } + template + bool try_lock_shared_until( + const std::chrono::time_point& abs_time); + void unlock_shared() { + std::lock_guard _(mut_); + count_t num_readers = (state_ & n_readers_) - 1; + state_ &= ~n_readers_; + state_ |= num_readers; + if (state_ & write_entered_) { + if (num_readers == 0) gate2_.notify_one(); + } else { + if (num_readers == n_readers_ - 1) gate1_.notify_one(); + } + } +}; + +template +bool shared_mutex::try_lock_until( + const std::chrono::time_point& abs_time) { + std::unique_lock lk(mut_); + if (state_ & write_entered_) { + while (true) { + std::cv_status status = gate1_.wait_until(lk, abs_time); + if ((state_ & write_entered_) == 0) break; + if (status == std::cv_status::timeout) return false; + } + } + state_ |= write_entered_; + if (state_ & n_readers_) { + while (true) { + std::cv_status status = gate2_.wait_until(lk, abs_time); + if ((state_ & n_readers_) == 0) break; + if (status == std::cv_status::timeout) { + state_ &= ~write_entered_; + return false; + } + } + } + return true; +} + +template +bool shared_mutex::try_lock_shared_until( + const std::chrono::time_point& abs_time) { + std::unique_lock lk(mut_); + if ((state_ & write_entered_) || (state_ & n_readers_) == n_readers_) { + while (true) { + std::cv_status status = gate1_.wait_until(lk, abs_time); + if ((state_ & write_entered_) == 0 && (state_ & n_readers_) < n_readers_) + break; + if (status == std::cv_status::timeout) return false; + } + } + count_t num_readers = (state_ & n_readers_) + 1; + state_ &= ~n_readers_; + state_ |= num_readers; + return true; +} + +template +class shared_lock { +public: + typedef Mutex mutex_type; + +private: + mutex_type* m_; + bool owns_; + + struct __nat { + int _; + }; + +public: + shared_lock() : m_(nullptr), owns_(false) {} + + explicit shared_lock(mutex_type& m) : m_(&m), owns_(true) { + m_->lock_shared(); + } + + shared_lock(mutex_type& m, std::defer_lock_t) : m_(&m), owns_(false) {} + + shared_lock(mutex_type& m, std::try_to_lock_t) + : m_(&m), owns_(m.try_lock_shared()) {} + + shared_lock(mutex_type& m, std::adopt_lock_t) : m_(&m), owns_(true) {} + + template + shared_lock(mutex_type& m, + const std::chrono::time_point& abs_time) + : m_(&m), owns_(m.try_lock_shared_until(abs_time)) {} + template + shared_lock(mutex_type& m, const std::chrono::duration& rel_time) + : m_(&m), owns_(m.try_lock_shared_for(rel_time)) {} + + ~shared_lock() { + if (owns_) m_->unlock_shared(); + } + + shared_lock(shared_lock const&) = delete; + shared_lock& operator=(shared_lock const&) = delete; + + shared_lock(shared_lock&& sl) : m_(sl.m_), owns_(sl.owns_) { + sl.m_ = nullptr; + sl.owns_ = false; + } + + shared_lock& operator=(shared_lock&& sl) { + if (owns_) m_->unlock_shared(); + m_ = sl.m_; + owns_ = sl.owns_; + sl.m_ = nullptr; + sl.owns_ = false; + return *this; + } + + explicit shared_lock(std::unique_lock&& ul) + : m_(ul.mutex()), owns_(ul.owns_lock()) { + if (owns_) m_->unlock_and_lock_shared(); + ul.release(); + } + + void lock(); + bool try_lock(); + template + bool try_lock_for(const std::chrono::duration& rel_time) { + return try_lock_until(std::chrono::steady_clock::now() + rel_time); + } + template + bool try_lock_until(const std::chrono::time_point& abs_time); + void unlock(); + + void swap(shared_lock&& u) { + std::swap(m_, u.m_); + std::swap(owns_, u.owns_); + } + + mutex_type* release() { + mutex_type* r = m_; + m_ = nullptr; + owns_ = false; + return r; + } + bool owns_lock() const { return owns_; } + operator int __nat::*() const { return owns_ ? &__nat::_ : 0; } + mutex_type* mutex() const { return m_; } +}; + +template +void shared_lock::lock() { + if (m_ == nullptr) + throw std::system_error(std::error_code(EPERM, std::system_category()), + "shared_lock::lock: references null mutex"); + if (owns_) + throw std::system_error(std::error_code(EDEADLK, std::system_category()), + "shared_lock::lock: already locked"); + m_->lock_shared(); + owns_ = true; +} + +template +bool shared_lock::try_lock() { + if (m_ == nullptr) + throw std::system_error(std::error_code(EPERM, std::system_category()), + "shared_lock::try_lock: references null mutex"); + if (owns_) + throw std::system_error(std::error_code(EDEADLK, std::system_category()), + "shared_lock::try_lock: already locked"); + owns_ = m_->try_lock_shared(); + return owns_; +} + +template +template +bool shared_lock::try_lock_until( + const std::chrono::time_point& abs_time) { + if (m_ == nullptr) + throw std::system_error( + std::error_code(EPERM, std::system_category()), + "shared_lock::try_lock_until: references null mutex"); + if (owns_) + throw std::system_error(std::error_code(EDEADLK, std::system_category()), + "shared_lock::try_lock_until: already locked"); + owns_ = m_->try_lock_shared_until(abs_time); + return owns_; +} + +template +void shared_lock::unlock() { + if (!owns_) + throw std::system_error(std::error_code(EPERM, std::system_category()), + "shared_lock::unlock: not locked"); + m_->unlock_shared(); + owns_ = false; +} + +template +inline void swap(shared_lock& x, shared_lock& y) { + x.swap(y); +} + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/string_view.h b/fast_tokenizer/fast_tokenizer/utils/string_view.h new file mode 100644 index 0000000000000000000000000000000000000000..35cacdefed9f2f4bb14392ee3775c00ceb663bc5 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/string_view.h @@ -0,0 +1,53 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at + +// http://www.apache.org/licenses/LICENSE-2.0 + +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +struct simple_string_view { + const char* ptr_; + size_t offset_; + size_t size_; + explicit simple_string_view(const char* ptr = nullptr) + : ptr_(ptr), offset_(0), size_(0) { + while (ptr_ && ptr_[size_] != '\0') { + size_++; + } + } + simple_string_view(const char* ptr, size_t size) : ptr_(ptr), size_(size) {} + + const char* data() const { + if (!ptr_) { + return ptr_ + offset_; + } + return ptr_; + } + size_t size() const { return size_; } + bool empty() const { return size_ == 0; } + + void remove_prefix(size_t n) { + assert(n <= size_); + ptr_ += n; + size_ -= n; + } +}; + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/trie.cc b/fast_tokenizer/fast_tokenizer/utils/trie.cc new file mode 100644 index 0000000000000000000000000000000000000000..b063e91ff0853829fe7ec4596ca872e744cfae66 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/trie.cc @@ -0,0 +1,231 @@ +// Copyright 2022 TF.Text Authors. +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at + +// http://www.apache.org/licenses/LICENSE-2.0 + +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include + +#include "glog/logging.h" +#include "fast_tokenizer/utils/trie.h" +#include "fast_tokenizer/utils/utf8.h" +#include "fast_tokenizer/utils/utils.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +void Trie::CreateTrie(const std::vector& keys, + const std::vector& values) { + trie_ = std::make_shared(); + trie_->build(keys.size(), + const_cast(&keys[0]), + nullptr, + const_cast(&values[0])); + const uint32_t* trie_ptr = reinterpret_cast(trie_->array()); + trie_array_ = std::vector(trie_ptr, trie_ptr + trie_->size()); +} + +int Trie::EncodeTokenId(const std::string& token, uint32_t id) const { + bool is_suffix_token = (token.rfind(continuing_subword_prefix_) == 0); + uint32_t token_length = token.length(); + if (is_suffix_token) { + token_length -= continuing_subword_prefix_.length(); + } + return EncodeToken(id, token_length, is_suffix_token); +} + +void Trie::InitTrieSuffixRoot() { + auto node = CreateRootTraversalCursor(); + if (!TryTraverseSeveralSteps(&node, continuing_subword_prefix_)) { + throw std::runtime_error( + "Cannot locate suffix_root_. This should never happen."); + } + suffix_root_ = node.node_id_; +} + +void Trie::InitTrie(const std::vector& keys, + const std::vector& values) { + std::vector sorted_keys; + std::vector sorted_values; + GetSortedVocab(keys, values, &sorted_keys, &sorted_values); + CreateTrie(sorted_keys, sorted_values); + InitTrieSuffixRoot(); + if (with_pretokenization_ && keys.size() > 0) { + auto node = CreateRootTraversalCursor(); + if (!TryTraverseSeveralSteps(&node, + std::string(1, utils::kInvalidControlChar))) { + throw std::runtime_error( + "Cannot locate the dummy node for the failure link for punctuation " + "nodes. This should never happen."); + } + punct_failure_link_node_ = node.node_id_; + DeleteLinkFromParent(punct_failure_link_node_); + DeleteValueOfNode(punct_failure_link_node_); + } +} + +void Trie::AddPuncVocab( + std::vector* punc_vocab, + const std::unordered_map& vocab) const { + if (with_pretokenization_) { + for (uint32_t cp = 1; cp <= 0x0010FFFF; ++cp) { + if (!utils::IsUnicodeChar(cp) || !utils::IsPunctuationOrChineseChar(cp)) { + continue; + } + char utf_str[5]; + utils::GetUTF8Str(reinterpret_cast(&cp), utf_str, 1); + std::string punc_str(utf_str); + if (vocab.count(punc_str) == 0) { + punc_vocab->push_back(punc_str); + } + } + punc_vocab->push_back(std::string(1, utils::kInvalidControlChar)); + } +} + +void Trie::SetVocab(const std::unordered_map& vocab) { + std::vector keys; + std::vector values; + for (auto&& item : vocab) { + keys.push_back(item.first.c_str()); + values.push_back(EncodeTokenId(item.first, item.second)); + } + InitTrie(keys, values); +} + +void Trie::SetVocabList(const std::vector& keys) { + std::unordered_map vocab; + for (int i = 0; i < keys.size(); ++i) { + vocab[keys[i]] = i; + } + SetVocab(vocab); +} + +Trie::Trie(const std::string& continuing_subword_prefix, + const std::string& unk_token, + bool with_pretokenization) + : trie_(nullptr), + continuing_subword_prefix_(continuing_subword_prefix), + suffix_root_(utils::kNullNode), + punct_failure_link_node_(utils::kNullNode), + unk_token_(unk_token), + with_pretokenization_(with_pretokenization) {} + +Trie::Trie(const std::unordered_map& vocab, + const std::string& continuing_subword_prefix, + const std::string& unk_token, + bool with_pretokenization) + : continuing_subword_prefix_(continuing_subword_prefix), + unk_token_(unk_token), + suffix_root_(utils::kNullNode), + punct_failure_link_node_(utils::kNullNode), + with_pretokenization_(with_pretokenization) { + SetVocab(vocab); +} + +Trie::Trie(const std::vector& keys, + const std::string& continuing_subword_prefix, + const std::string& unk_token, + bool with_pretokenization) + : continuing_subword_prefix_(continuing_subword_prefix), + unk_token_(unk_token), + suffix_root_(utils::kNullNode), + punct_failure_link_node_(utils::kNullNode), + with_pretokenization_(with_pretokenization) { + SetVocabList(keys); +} + +Trie::TraversalCursor Trie::CreateRootTraversalCursor() const { + return CreateTraversalCursor(kRootNodeId); +} + +Trie::TraversalCursor Trie::CreateTraversalCursor(uint32_t node_id) const { + return Trie::TraversalCursor(node_id, trie_array_[node_id]); +} + +void Trie::SetTraversalCursor(Trie::TraversalCursor* cursor, + uint32_t node_id) const { + cursor->node_id_ = node_id; + cursor->unit_ = trie_array_[node_id]; +} + +bool Trie::TryTraverseOneStep(Trie::TraversalCursor* cursor, + unsigned char ch) const { + const uint32_t next_node_id = cursor->node_id_ ^ Offset(cursor->unit_) ^ ch; + const uint32_t next_node_unit = trie_array_[next_node_id]; + if (Label(next_node_unit) != ch) { + return false; + } + cursor->node_id_ = next_node_id; + cursor->unit_ = next_node_unit; + return true; +} + +bool Trie::TryTraverseSeveralSteps(Trie::TraversalCursor* cursor, + const std::string& path) const { + return TryTraverseSeveralSteps(cursor, path.data(), path.size()); +} + +bool Trie::TryTraverseSeveralSteps(Trie::TraversalCursor* cursor, + const char* ptr, + int size) const { + uint32_t cur_id = cursor->node_id_; + uint32_t cur_unit = cursor->unit_; + for (; size > 0; --size, ++ptr) { + const unsigned char ch = static_cast(*ptr); + cur_id ^= Offset(cur_unit) ^ ch; + cur_unit = trie_array_[cur_id]; + if (Label(cur_unit) != ch) { + return false; + } + } + cursor->node_id_ = cur_id; + cursor->unit_ = cur_unit; + return true; +} + +bool Trie::TryGetData(const Trie::TraversalCursor& cursor, + int* out_data) const { + if (!HasLeaf(cursor.unit_)) { + return false; + } + const uint32_t value_unit = + trie_array_[cursor.node_id_ ^ Offset(cursor.unit_)]; + *out_data = Value(value_unit); + return true; +} + +void Trie::DeleteValueOfNode(uint32_t node_id) { + trie_array_[node_id] &= 0xFFFFFEFF; +} + +void Trie::DeleteLinkFromParent(uint32_t child_node_id) { + trie_array_[child_node_id] &= 0xFFFFFF00; +} + +void Trie::SetWithPretokenization(bool with_pretokenization) { + with_pretokenization_ = with_pretokenization; +} + +void Trie::SetUNKToken(const std::string& unk_token) { unk_token_ = unk_token; } + +void Trie::SetContinuingSubwordPrefix( + const std::string& continuing_subword_prefix) { + continuing_subword_prefix_ = continuing_subword_prefix; +} + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/trie.h b/fast_tokenizer/fast_tokenizer/utils/trie.h new file mode 100644 index 0000000000000000000000000000000000000000..b4c9b0cbff4d05a84326479d7b9767312308adf4 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/trie.h @@ -0,0 +1,120 @@ +// Copyright 2022 TF.Text Authors. +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at + +// http://www.apache.org/licenses/LICENSE-2.0 + +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include +#include +#include +#include "darts.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +class Trie { +public: + static constexpr uint32_t kRootNodeId = 0; + + Trie(const std::string& continuing_subword_prefix = "##", + const std::string& unk_token = "[UNK]", + bool with_pretokenization = false); + Trie(const std::unordered_map& vocab, + const std::string& continuing_subword_prefix = "##", + const std::string& unk_token = "[UNK]", + bool with_pretokenization = false); + Trie(const std::vector& keys, + const std::string& continuing_subword_prefix = "##", + const std::string& unk_token = "[UNK]", + bool with_pretokenization = false); + struct TraversalCursor { + uint32_t node_id_; + uint32_t unit_; + TraversalCursor(uint32_t node_id = 0, uint32_t unit = 0) + : node_id_(node_id), unit_(unit) {} + }; + + TraversalCursor CreateRootTraversalCursor() const; + TraversalCursor CreateTraversalCursor(uint32_t node_id) const; + void SetTraversalCursor(TraversalCursor* cursor, uint32_t node_id) const; + bool TryTraverseOneStep(TraversalCursor* cursor, unsigned char ch) const; + bool TryTraverseSeveralSteps(TraversalCursor* cursor, + const std::string& path) const; + bool TryGetData(const TraversalCursor& cursor, int* out_data) const; + void SetVocab(const std::unordered_map& vocab); + void SetVocabList(const std::vector& vocab); + void SetWithPretokenization(bool with_pretokenization_); + void SetUNKToken(const std::string& unk_token); + void SetContinuingSubwordPrefix(const std::string& continuing_subword_prefix); + + uint32_t Size() const { + if (trie_.get() != nullptr) { + return trie_->size(); + } + return 0; + } + std::string GetContinuingSubwordPrefix() const { + return continuing_subword_prefix_; + } + uint32_t GetSuffixRoot() const { return suffix_root_; } + uint32_t GetPuncFailureNode() const { return punct_failure_link_node_; } + void DeleteValueOfNode(uint32_t node_id); + void DeleteLinkFromParent(uint32_t child_node_id); + +private: + void AddPuncVocab( + std::vector* punc_vocab, + const std::unordered_map& vocab) const; + void InitTrieSuffixRoot(); + void InitTrie(const std::vector& keys, + const std::vector& values); + int EncodeTokenId(const std::string& token, uint32_t id) const; + void CreateTrie(const std::vector& keys, + const std::vector& values); + + bool TryTraverseSeveralSteps(TraversalCursor* cursor, + const char* ptr, + int size) const; + + static uint32_t Offset(uint32_t unit) { + return (unit >> 10) << ((unit & 0x200) >> 6); + } + + // Returns a label associated with a node. + // A leaf node will have the MSB set and thus return an invalid label. + static uint32_t Label(uint32_t unit) { return unit & 0x800000ff; } + + // Returns whether a node has a leaf as a child. + static bool HasLeaf(uint32_t unit) { return unit & 0x100; } + + // Returns a value associated with a node. Available when a node is a leaf. + static int Value(uint32_t unit) { + return static_cast(unit & 0x7fffffff); + } + + std::shared_ptr trie_; + std::vector trie_array_; + std::string continuing_subword_prefix_; + std::string unk_token_; + uint32_t suffix_root_; + uint32_t punct_failure_link_node_; + bool with_pretokenization_; +}; + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/unique_ptr.h b/fast_tokenizer/fast_tokenizer/utils/unique_ptr.h new file mode 100644 index 0000000000000000000000000000000000000000..767e203fcd2d5d2c9bddb49cf042cf6472465120 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/unique_ptr.h @@ -0,0 +1,61 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +// Trait to select overloads and return types for MakeUnique. +template +struct MakeUniqueResult { + using scalar = std::unique_ptr; +}; +template +struct MakeUniqueResult { + using array = std::unique_ptr; +}; +template +struct MakeUniqueResult { + using invalid = void; +}; + +// MakeUnique(...) is an early implementation of C++14 std::make_unique. +// It is designed to be 100% compatible with std::make_unique so that the +// eventual switchover will be a simple renaming operation. +template +typename MakeUniqueResult::scalar make_unique(Args &&... args) { // NOLINT + return std::unique_ptr( + new T(std::forward(args)...)); // NOLINT(build/c++11) +} + +// Overload for array of unknown bound. +// The allocation of arrays needs to use the array form of new, +// and cannot take element constructor arguments. +template +typename MakeUniqueResult::array make_unique(size_t n) { + return std::unique_ptr(new typename std::remove_extent::type[n]()); +} + +// Reject arrays of known bound. +template +typename MakeUniqueResult::invalid make_unique(Args &&... /* args */) = + delete; // NOLINT + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/utf8.h b/fast_tokenizer/fast_tokenizer/utils/utf8.h new file mode 100644 index 0000000000000000000000000000000000000000..dbb8c92f67329db99151649e433aa6d39c58360a --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/utf8.h @@ -0,0 +1,225 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +static constexpr uint32_t kUnicodeError = 0xFFFD; + +inline bool IsUnicodeNonChar(uint32_t c) { + return ((c) >= 0xfdd0 && ((c) <= 0xfdef || ((c)&0xfffe) == 0xfffe) && + (c) <= 0x10ffff); +} + +inline bool IsUnicodeChar(uint32_t c) { + return ((c) < 0xd800 || + (0xdfff < (c) && (c) <= 0x10ffff && !IsUnicodeNonChar(c))); +} + +inline uint32_t BytesInUTF8Char(uint8_t byte) { + unsigned int count = 1; + // no if-statements means no divergence + count += static_cast((byte & 0xF0) == 0xF0); + count += static_cast((byte & 0xE0) == 0xE0); + count += static_cast((byte & 0xC0) == 0xC0); + count -= static_cast((byte & 0xC0) == 0x80); + return count; +} + +inline uint32_t GetUnicodeLenFromUTF8(const char* pSrc, size_t length) { + size_t unicode_len = 0; + size_t start = 0; + while (start < length && pSrc[start] != '\0') { + size_t chwidth = BytesInUTF8Char(pSrc[start]); + start += chwidth; + ++unicode_len; + } + return unicode_len; +} + +inline uint32_t UTF8ToUInt32(const char* pSrc, uint32_t* chr) { + uint32_t chwidth = BytesInUTF8Char(static_cast(*pSrc)); + *chr = static_cast(*pSrc++) & 0xFF; + if (chwidth > 1) { + *chr = (*chr) << 8; + *chr |= (static_cast(*pSrc++) & 0xFF); // << 8; + if (chwidth > 2) { + *chr = (*chr) << 8; + *chr |= (static_cast(*pSrc++) & 0xFF); // << 16; + if (chwidth > 3) { + *chr = (*chr) << 8; + *chr |= (static_cast(*pSrc++) & 0xFF); // << 24; + } + } + } + return chwidth; +} + +inline uint32_t UTF8ToUnicode(uint32_t utf8) { + uint32_t unchr = 0; + if (utf8 < 0x00000080) { + unchr = utf8; + } else if (utf8 < 0x0000E000) { + unchr = (utf8 & 0x1F00) >> 2; + unchr |= (utf8 & 0x003F); + } else if (utf8 < 0x00F00000) { + unchr = (utf8 & 0x0F0000) >> 4; + unchr |= (utf8 & 0x003F00) >> 2; + unchr |= (utf8 & 0x00003F); + } else if (utf8 <= static_cast(0xF8000000)) { + unchr = (utf8 & 0x03000000) >> 6; + unchr |= (utf8 & 0x003F0000) >> 4; + unchr |= (utf8 & 0x00003F00) >> 2; + unchr |= (utf8 & 0x0000003F); + } + return unchr; +} + +inline bool IsCharBeginBoundary(char ch) { + return ((~ch) >> 7) || ((ch & 0xC0) == 0xC0); +} + +inline bool IsCharBoundary(const char* ch) { + return IsCharBeginBoundary(*ch) || IsCharBeginBoundary(*(ch + 1)); +} + +inline uint32_t UnicodeToUTF8(uint32_t unchr) { + uint32_t utf8 = 0; + if (unchr < 0x00000080) { + utf8 = unchr; + } else if (unchr < 0x00000800) { + utf8 = (unchr << 2) & 0x1F00; + utf8 |= (unchr & 0x3F); + utf8 |= 0x0000C080; + } else if (unchr < 0x00010000) { + utf8 = (unchr << 4) & 0x0F0000; // upper 4 bits + utf8 |= (unchr << 2) & 0x003F00; // next 6 bits + utf8 |= (unchr & 0x3F); // last 6 bits + utf8 |= 0x00E08080; + } else if (unchr < 0x00110000) { // 3-byte unicode + utf8 = (unchr << 6) & 0x07000000; // upper 3 bits + utf8 |= (unchr << 4) & 0x003F0000; // next 6 bits + utf8 |= (unchr << 2) & 0x00003F00; // next 6 bits + utf8 |= (unchr & 0x3F); // last 6 bits + utf8 |= static_cast(0xF0808080); + } + return utf8; +} + +inline uint32_t BytesInUnicodeChar(uint32_t chr) { + uint32_t count = 1; + // no if-statements means no divergence + count += static_cast((chr & static_cast(0x0000FF00)) > 0); + count += static_cast((chr & static_cast(0x00FF0000)) > 0); + count += static_cast((chr & static_cast(0xFF000000)) > 0); + return count; +} + +inline uint32_t UnicodeToUTF8Char(uint32_t chr, char* dst) { + uint32_t chwidth = BytesInUnicodeChar(chr); + for (uint32_t idx = 0; idx < chwidth; ++idx) { + dst[chwidth - idx - 1] = static_cast(chr & 0xFF); + chr = chr >> 8; + } + return chwidth; +} + +inline uint32_t GetUTF8CharLen(uint32_t u32chr) { + return BytesInUnicodeChar(UnicodeToUTF8(u32chr)); +} + +inline void GetUTF8Str(const char32_t* unicode_str, + char* utf8_str, + size_t unicode_len) { + char dst_char[5] = {0}; + for (size_t i = 0; i < unicode_len; ++i) { + uint32_t utf8_uint32 = UnicodeToUTF8(unicode_str[i]); + uint32_t utf8_char_count = UnicodeToUTF8Char(utf8_uint32, dst_char); + dst_char[utf8_char_count] = '\0'; + memcpy(utf8_str, dst_char, utf8_char_count); + utf8_str += utf8_char_count; + } + *utf8_str = '\0'; +} + +inline void GetUnicodeStr(const char* pSrc, + char32_t* unicode_str, + size_t unicode_len) { + uint32_t curr_unicode_char; + uint32_t count = UTF8ToUInt32(pSrc, &curr_unicode_char); + curr_unicode_char = UTF8ToUnicode(curr_unicode_char); + for (size_t i = 0; i < unicode_len; ++i) { + unicode_str[i] = curr_unicode_char; + pSrc += count; + count = UTF8ToUInt32(pSrc, &curr_unicode_char); + curr_unicode_char = UTF8ToUnicode(curr_unicode_char); + } +} + +inline bool IsTrailByte(char x) { return static_cast(x) < -0x40; } + +inline bool IsValidCodepoint(char32_t c) { + return (static_cast(c) < 0xD800) || (c >= 0xE000 && c <= 0x10FFFF); +} + +// mblen sotres the number of bytes consumed after decoding. +inline uint32_t DecodeUTF8(const char* begin, const char* end, size_t* mblen) { + const size_t len = end - begin; + + if (static_cast(begin[0]) < 0x80) { + *mblen = 1; + return static_cast(begin[0]); + } else if (len >= 2 && (begin[0] & 0xE0) == 0xC0) { + const uint32_t cp = (((begin[0] & 0x1F) << 6) | ((begin[1] & 0x3F))); + if (IsTrailByte(begin[1]) && cp >= 0x0080 && IsValidCodepoint(cp)) { + *mblen = 2; + return cp; + } + } else if (len >= 3 && (begin[0] & 0xF0) == 0xE0) { + const uint32_t cp = (((begin[0] & 0x0F) << 12) | ((begin[1] & 0x3F) << 6) | + ((begin[2] & 0x3F))); + if (IsTrailByte(begin[1]) && IsTrailByte(begin[2]) && cp >= 0x0800 && + IsValidCodepoint(cp)) { + *mblen = 3; + return cp; + } + } else if (len >= 4 && (begin[0] & 0xf8) == 0xF0) { + const uint32_t cp = (((begin[0] & 0x07) << 18) | ((begin[1] & 0x3F) << 12) | + ((begin[2] & 0x3F) << 6) | ((begin[3] & 0x3F))); + if (IsTrailByte(begin[1]) && IsTrailByte(begin[2]) && + IsTrailByte(begin[3]) && cp >= 0x10000 && IsValidCodepoint(cp)) { + *mblen = 4; + return cp; + } + } + + // Invalid UTF-8. + *mblen = 1; + return kUnicodeError; +} + +inline bool IsValidDecodeUTF8(const char* begin, + const char* end, + size_t* mblen) { + const uint32_t c = DecodeUTF8(begin, end, mblen); + return c != kUnicodeError || *mblen == 3; +} + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/utils.cc b/fast_tokenizer/fast_tokenizer/utils/utils.cc new file mode 100644 index 0000000000000000000000000000000000000000..dd23726c1bf457748e3cd7d79e11ec8fc116a698 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/utils.cc @@ -0,0 +1,147 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "fast_tokenizer/utils/utils.h" + +#include "unicode/uchar.h" + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +void GetVocabFromFiles(const std::string& files, + std::unordered_map* vocab) { + const static std::string WHITESPACE = " \n\r\t\f\v"; + std::ifstream fin(files); + if (!fin.good()) { + std::cerr << "The vocab file " << files + << " seems to be unable to access" + " or non-exists, please check again. " + << std::endl; + return; + } + vocab->clear(); + int i = 0; + constexpr int MAX_BUFFER_SIZE = 256; + char word[MAX_BUFFER_SIZE]; + while (fin.getline(word, MAX_BUFFER_SIZE)) { + std::string word_str = word; + auto leading_spaces = word_str.find_first_not_of(WHITESPACE); + if (leading_spaces != std::string::npos) { + leading_spaces = (std::min)(leading_spaces, word_str.length() - 1); + word_str = word_str.substr(leading_spaces); + } + auto trailing_spaces = word_str.find_last_not_of(WHITESPACE); + if (trailing_spaces != std::string::npos) { + word_str = word_str.substr(0, trailing_spaces + 1); + } + if (word_str != "") { + (*vocab)[word_str] = i++; + } + } +} + +bool IsChineseChar(int ch) { + return ( + (ch >= 0x4E00 && ch <= 0x9FFF) || (ch >= 0x3400 && ch <= 0x4DBF) || + (ch >= 0x20000 && ch <= 0x2A6DF) || (ch >= 0x2A700 && ch <= 0x2B73F) || + (ch >= 0x2B740 && ch <= 0x2B81F) || (ch >= 0x2B820 && ch <= 0x2CEAF) || + (ch >= 0xF900 && ch <= 0xFAFF) || (ch >= 0x2F800 && ch <= 0x2FA1F)); +} + +bool IsPunctuation(int ch) { + return (ch >= 33 && ch <= 47) || (ch >= 58 && ch <= 64) || + (ch >= 91 && ch <= 96) || (ch >= 123 && ch <= 126) || u_ispunct(ch); +} + +bool IsPunctuationOrChineseChar(int ch) { + return IsChineseChar(ch) || IsPunctuation(ch); +} + +bool StringReplace(std::string* str, + const std::string& from, + const std::string& to) { + size_t start_pos = str->find(from); + if (start_pos == std::string::npos) return false; + str->replace(start_pos, from.length(), to); + return true; +} + +void StringReplaceAll(std::string* str, + const std::string& from, + const std::string& to) { + if (from.empty()) return; + size_t start_pos = 0; + while ((start_pos = str->find(from, start_pos)) != std::string::npos) { + str->replace(start_pos, from.length(), to); + start_pos += to.length(); // In case 'to' contains 'from', like replacing + // 'x' with 'yx' + } +} + +void GetSortedVocab(const std::vector& keys, + const std::vector& values, + std::vector* sorted_keys, + std::vector* sorted_values) { + // Sort the vocab + std::vector sorted_vocab_index(keys.size()); + std::iota(sorted_vocab_index.begin(), sorted_vocab_index.end(), 0); + std::sort(sorted_vocab_index.begin(), + sorted_vocab_index.end(), + [&keys](const int a, const int b) { + return std::strcmp(keys[a], keys[b]) < 0; + }); + + sorted_keys->resize(keys.size()); + sorted_values->resize(keys.size()); + for (int i = 0; i < sorted_vocab_index.size(); ++i) { + auto idx = sorted_vocab_index[i]; + (*sorted_keys)[i] = keys[idx]; + (*sorted_values)[i] = values[idx]; + } +} + +std::unordered_map CreateBytesToChars() { + std::unordered_map bytes_to_chars; + bool bytes_flag[256] = {false}; + std::vector> ranges = { + {'!', '~'}, {'\xA1', '\xAC'}, {'\xAE', '\xFF'}}; + + for (int i = 0; i < ranges.size(); ++i) { + for (uint32_t c = ranges[i].first; c <= ranges[i].second; ++c) { + bytes_to_chars.insert({c, c}); + bytes_flag[c] = true; + } + } + uint32_t n = 0; + for (uint32_t b = 0; b <= 255; ++b) { + if (!bytes_flag[b]) { + bytes_to_chars.insert({b, (1 << 8) + n}); + n += 1; + } + } + return bytes_to_chars; +} + +bool IsWhiteSpace(int ch) { + const std::string WHITESPACE = " \n\r\t\f\v"; + for (int i = 0; i < WHITESPACE.length(); ++i) { + if (ch == WHITESPACE[i]) return true; + } + return u_isspace(ch); +} + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/utils.h b/fast_tokenizer/fast_tokenizer/utils/utils.h new file mode 100644 index 0000000000000000000000000000000000000000..6a9e418693d337223c3afb0a3f8b94aba7979357 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/utils.h @@ -0,0 +1,187 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#if defined(_FREEBSD) +#include +#endif +#if !defined(__APPLE__) && !defined(_WIN32) && !defined(_FREEBSD) +#include +#if BYTE_ORDER == __BIG_ENDIAN +#define IS_BIG_ENDIAN +#endif +#endif + +#if defined(_WIN32) +#ifdef FASTTOKENIZER_LIB +#define FASTTOKENIZER_DECL __declspec(dllexport) +#else +#define FASTTOKENIZER_DECL __declspec(dllimport) +#endif // FASTTOKENIZER_LIB +#else +#define FASTTOKENIZER_DECL __attribute__((visibility("default"))) +#endif // _WIN32 + +namespace paddlenlp { +namespace fast_tokenizer { +namespace utils { + +void GetVocabFromFiles(const std::string& files, + std::unordered_map* vocab); + +bool IsChineseChar(int ch); + +bool IsPunctuation(int ch); + +bool IsPunctuationOrChineseChar(int ch); + +bool StringReplace(std::string* str, + const std::string& from, + const std::string& to); + +void StringReplaceAll(std::string* str, + const std::string& from, + const std::string& to); + +// Used in fast wordpiece model +static constexpr uint32_t kBitToIndicateSuffixToken = 30; + +static constexpr uint32_t kBitsToEncodeVocabTokenLength = 8; + +static constexpr uint32_t kMaskToEncodeVocabTokenLength = + (1 << kBitsToEncodeVocabTokenLength) - 1; + +static constexpr uint32_t kMaxVocabTokenLengthInUTF8Bytes = + (1 << kBitsToEncodeVocabTokenLength); + +static constexpr uint32_t kMaxSupportedVocabSize = + (1 << (32 - 1 - 1 - kBitsToEncodeVocabTokenLength)); + +static constexpr uint32_t kMaskToEncodeVocabTokenId = + ((1 << kBitToIndicateSuffixToken) - 1) ^ kMaskToEncodeVocabTokenLength; + +inline int EncodeToken(uint32_t token_id, + uint32_t token_length, + bool is_suffix_token) { + int encoded_value = (is_suffix_token << kBitToIndicateSuffixToken) | + (token_id << kBitsToEncodeVocabTokenLength) | + (token_length - 1); + return encoded_value; +} + +inline bool IsSuffixTokenFromEncodedValue(int token_encoded_value) { + return static_cast(token_encoded_value >> kBitToIndicateSuffixToken); +} + +// Gets the token id from the encoded value. +inline int GetTokenIdFromEncodedValue(int token_encoded_value) { + return (token_encoded_value & kMaskToEncodeVocabTokenId) >> + kBitsToEncodeVocabTokenLength; +} + +// Gets the token length (without the suffix indicator) from the encoded value. +inline int GetTokenLengthFromEncodedValue(int token_encoded_value) { + return (token_encoded_value & kMaskToEncodeVocabTokenLength) + 1; +} + +static constexpr uint32_t kBitsToEncodeFailurePopsListSize = + kBitsToEncodeVocabTokenLength; + +static constexpr uint32_t kMaskToEncodeFailurePopsListSize = + (1 << kBitsToEncodeFailurePopsListSize) - 1; + +static constexpr uint32_t kMaxFailurePopsListSize = + (1 << kBitsToEncodeFailurePopsListSize); + +static constexpr uint32_t kMaxSupportedFailurePoolOffset = + (1 << (32 - kBitsToEncodeFailurePopsListSize)) - 1 - 1; + +static constexpr uint32_t kNullFailurePopsList = + std::numeric_limits::max(); + +inline uint32_t EncodeFailurePopList(int offset, int length) { + return (offset << kBitsToEncodeFailurePopsListSize) | (length - 1); +} + +inline void GetFailurePopsOffsetAndLength(uint32_t offset_and_length, + int* out_offset, + int* out_length) { + *out_offset = offset_and_length >> kBitsToEncodeFailurePopsListSize; + *out_length = (offset_and_length & kMaskToEncodeFailurePopsListSize) + 1; +} + +static constexpr uint32_t kNullNode = std::numeric_limits::max(); + +static constexpr uint32_t kMaxSupportedTrieSize = + std::numeric_limits::max(); + +// A Unicode control char that never appears in the input as it is filtered +// during text normalization. It is used to build dummy nodes in the trie. +static constexpr char kInvalidControlChar = 0x11; + +inline bool IsSuffixWord(const std::string& word, + const std::string& continuing_subword_prefix) { + return word.rfind(continuing_subword_prefix) == 0; +} + +template +inline bool DecodePOD(const char* str, size_t str_len, T* result) { + if (sizeof(*result) != str_len) { + return false; + } + memcpy(result, str, sizeof(T)); + return true; +} + + +template +inline std::string EncodePOD(const T& value) { + std::string s; + s.resize(sizeof(T)); + memcpy(const_cast(s.data()), &value, sizeof(T)); + return s; +} + +inline size_t OneCharLen(const char* src) { + return "\1\1\1\1\1\1\1\1\1\1\1\1\2\2\3\4"[(*src & 0xFF) >> 4]; +} + +#ifdef IS_BIG_ENDIAN +inline uint32 Swap32(uint32 x) { return __builtin_bswap32(x); } +#endif + +void GetSortedVocab(const std::vector& keys, + const std::vector& values, + std::vector* sorted_keys, + std::vector* sorted_values); + +std::unordered_map CreateBytesToChars(); + +bool IsWhiteSpace(int ch); + +} // namespace utils +} // namespace fast_tokenizer +} // namespace paddlenlp diff --git a/fast_tokenizer/fast_tokenizer/utils/variant.h b/fast_tokenizer/fast_tokenizer/utils/variant.h new file mode 100644 index 0000000000000000000000000000000000000000..3429c2ce15b3c02055c4670edee9f901b81679e3 --- /dev/null +++ b/fast_tokenizer/fast_tokenizer/utils/variant.h @@ -0,0 +1,2859 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Copy from +// https://github.com/mpark/variant/blob/single-header/v1.4.0/variant.hpp +// Modify the following points: +// 1. modify namespace mpark to namespace paddlenlp +// 2. add type() member function for variant class +// 3. remove the visitation implementation under the branhch with +// MPARK_CPP14_CONSTEXPR defined since lib::cpp14::array could not be converted +// to std::initializer_list in Paddle's compilation +// 4. decorate PYBIND11_HIDDEN for struct value_visitor + +// MPark.Variant +// +// Copyright Michael Park, 2015-2017 +// +// Distributed under the Boost Software License, Version 1.0. +// (See accompanying file LICENSE.md or copy at +// http://boost.org/LICENSE_1_0.txt) + +#pragma once + +// gcc >= 9 has a bug that creates a false positive warning. +// Reference: +// https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92145 +// https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89381 +#if defined(__GNUC__) && !defined(__clang__) && __GNUC__ >= 9 +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wdeprecated-copy" +#endif + +/* + variant synopsis + +namespace std { + + // 20.7.2, class template variant + template + class variant { + public: + + // 20.7.2.1, constructors + constexpr variant() noexcept(see below); + variant(const variant&); + variant(variant&&) noexcept(see below); + + template constexpr variant(T&&) noexcept(see below); + + template + constexpr explicit variant(in_place_type_t, Args&&...); + + template + constexpr explicit variant( + in_place_type_t, initializer_list, Args&&...); + + template + constexpr explicit variant(in_place_index_t, Args&&...); + + template + constexpr explicit variant( + in_place_index_t, initializer_list, Args&&...); + + // 20.7.2.2, destructor + ~variant(); + + // 20.7.2.3, assignment + variant& operator=(const variant&); + variant& operator=(variant&&) noexcept(see below); + + template variant& operator=(T&&) noexcept(see below); + + // 20.7.2.4, modifiers + template + T& emplace(Args&&...); + + template + T& emplace(initializer_list, Args&&...); + + template + variant_alternative& emplace(Args&&...); + + template + variant_alternative& emplace(initializer_list, Args&&...); + + // 20.7.2.5, value status + constexpr bool valueless_by_exception() const noexcept; + constexpr size_t index() const noexcept; + + // 20.7.2.6, swap + void swap(variant&) noexcept(see below); + }; + + // 20.7.3, variant helper classes + template struct variant_size; // undefined + + template + constexpr size_t variant_size_v = variant_size::value; + + template struct variant_size; + template struct variant_size; + template struct variant_size; + + template + struct variant_size>; + + template struct variant_alternative; // undefined + + template + using variant_alternative_t = typename variant_alternative::type; + + template struct variant_alternative; + template struct variant_alternative; + template struct variant_alternative; + + template + struct variant_alternative>; + + constexpr size_t variant_npos = -1; + + // 20.7.4, value access + template + constexpr bool holds_alternative(const variant&) noexcept; + + template + constexpr variant_alternative_t>& + get(variant&); + + template + constexpr variant_alternative_t>&& + get(variant&&); + + template + constexpr variant_alternative_t> const& + get(const variant&); + + template + constexpr variant_alternative_t> const&& + get(const variant&&); + + template + constexpr T& get(variant&); + + template + constexpr T&& get(variant&&); + + template + constexpr const T& get(const variant&); + + template + constexpr const T&& get(const variant&&); + + template + constexpr add_pointer_t>> + get_if(variant*) noexcept; + + template + constexpr add_pointer_t>> + get_if(const variant*) noexcept; + + template + constexpr add_pointer_t + get_if(variant*) noexcept; + + template + constexpr add_pointer_t + get_if(const variant*) noexcept; + + // 20.7.5, relational operators + template + constexpr bool operator==(const variant&, const variant&); + + template + constexpr bool operator!=(const variant&, const variant&); + + template + constexpr bool operator<(const variant&, const variant&); + + template + constexpr bool operator>(const variant&, const variant&); + + template + constexpr bool operator<=(const variant&, const variant&); + + template + constexpr bool operator>=(const variant&, const variant&); + + // 20.7.6, visitation + template + constexpr see below visit(Visitor&&, Variants&&...); + + // 20.7.7, class monostate + struct monostate; + + // 20.7.8, monostate relational operators + constexpr bool operator<(monostate, monostate) noexcept; + constexpr bool operator>(monostate, monostate) noexcept; + constexpr bool operator<=(monostate, monostate) noexcept; + constexpr bool operator>=(monostate, monostate) noexcept; + constexpr bool operator==(monostate, monostate) noexcept; + constexpr bool operator!=(monostate, monostate) noexcept; + + // 20.7.9, specialized algorithms + template + void swap(variant&, variant&) noexcept(see below); + + // 20.7.10, class bad_variant_access + class bad_variant_access; + + // 20.7.11, hash support + template struct hash; + template struct hash>; + template <> struct hash; + +} // namespace std + +*/ + +#include +#include +#include +#include +#include +#include +#include + +// MPark.Variant +// +// Copyright Michael Park, 2015-2017 +// +// Distributed under the Boost Software License, Version 1.0. +// (See accompanying file LICENSE.md or copy at +// http://boost.org/LICENSE_1_0.txt) + +#ifndef MPARK_CONFIG_HPP +#define MPARK_CONFIG_HPP + +// MSVC 2015 Update 3. +#if __cplusplus < 201103L && (!defined(_MSC_VER) || _MSC_FULL_VER < 190024210) +#error "MPark.Variant requires C++11 support." +#endif + +#ifndef __has_attribute +#define __has_attribute(x) 0 +#endif + +#ifndef __has_builtin +#define __has_builtin(x) 0 +#endif + +#ifndef __has_include +#define __has_include(x) 0 +#endif + +#ifndef __has_feature +#define __has_feature(x) 0 +#endif + +#if __has_attribute(always_inline) || defined(__GNUC__) +#define MPARK_ALWAYS_INLINE __attribute__((__always_inline__)) inline +#elif defined(_MSC_VER) +#define MPARK_ALWAYS_INLINE __forceinline +#else +#define MPARK_ALWAYS_INLINE inline +#endif + +#if __has_builtin(__builtin_addressof) || \ + (defined(__GNUC__) && __GNUC__ >= 7) || defined(_MSC_VER) +#define MPARK_BUILTIN_ADDRESSOF +#endif + +#if __has_builtin(__builtin_unreachable) || defined(__GNUC__) +#define MPARK_BUILTIN_UNREACHABLE __builtin_unreachable() +#elif defined(_MSC_VER) +#define MPARK_BUILTIN_UNREACHABLE __assume(false) +#else +#define MPARK_BUILTIN_UNREACHABLE +#endif + +#if __has_builtin(__type_pack_element) +#define MPARK_TYPE_PACK_ELEMENT +#endif + +#if defined(__cpp_constexpr) && __cpp_constexpr >= 200704 && \ + !(defined(__GNUC__) && __GNUC__ == 4 && __GNUC_MINOR__ == 9) +#define MPARK_CPP11_CONSTEXPR +#endif + +#if defined(__cpp_constexpr) && __cpp_constexpr >= 201304 +#define MPARK_CPP14_CONSTEXPR +#endif + +#if __has_feature(cxx_exceptions) || defined(__cpp_exceptions) || \ + (defined(_MSC_VER) && defined(_CPPUNWIND)) +#define MPARK_EXCEPTIONS +#endif + +#if defined(__cpp_generic_lambdas) || defined(_MSC_VER) +#define MPARK_GENERIC_LAMBDAS +#endif + +#if defined(__cpp_lib_integer_sequence) +#define MPARK_INTEGER_SEQUENCE +#endif + +#if defined(__cpp_return_type_deduction) || defined(_MSC_VER) +#define MPARK_RETURN_TYPE_DEDUCTION +#endif + +#if defined(__cpp_lib_transparent_operators) || defined(_MSC_VER) +#define MPARK_TRANSPARENT_OPERATORS +#endif + +#if defined(__cpp_variable_templates) || defined(_MSC_VER) +#define MPARK_VARIABLE_TEMPLATES +#endif + +#if !defined(__GLIBCXX__) || __has_include() // >= libstdc++-5 +#define MPARK_TRIVIALITY_TYPE_TRAITS +#define MPARK_INCOMPLETE_TYPE_TRAITS +#endif + +#endif // MPARK_CONFIG_HPP + +// MPark.Variant +// +// Copyright Michael Park, 2015-2017 +// +// Distributed under the Boost Software License, Version 1.0. +// (See accompanying file LICENSE.md or copy at +// http://boost.org/LICENSE_1_0.txt) + +#ifndef MPARK_IN_PLACE_HPP +#define MPARK_IN_PLACE_HPP + +#include + +namespace paddlenlp { + +struct in_place_t { + explicit in_place_t() = default; +}; + +template +struct in_place_index_t { + explicit in_place_index_t() = default; +}; + +template +struct in_place_type_t { + explicit in_place_type_t() = default; +}; + +#ifdef MPARK_VARIABLE_TEMPLATES +constexpr in_place_t in_place{}; + +template +constexpr in_place_index_t in_place_index{}; + +template +constexpr in_place_type_t in_place_type{}; +#endif + +} // namespace paddlenlp + +#endif // MPARK_IN_PLACE_HPP + +// MPark.Variant +// +// Copyright Michael Park, 2015-2017 +// +// Distributed under the Boost Software License, Version 1.0. +// (See accompanying file LICENSE.md or copy at +// http://boost.org/LICENSE_1_0.txt) + +#ifndef MPARK_LIB_HPP +#define MPARK_LIB_HPP + +#include +#include +#include +#include + +#define MPARK_RETURN(...) \ + noexcept(noexcept(__VA_ARGS__))->decltype(__VA_ARGS__) { return __VA_ARGS__; } + +namespace paddlenlp { +namespace lib { +template +struct identity { + using type = T; +}; + +inline namespace cpp14 { +template +struct array { + constexpr const T &operator[](std::size_t index) const { return data[index]; } + + T data[N == 0 ? 1 : N]; +}; + +template +using add_pointer_t = typename std::add_pointer::type; + +template +using common_type_t = typename std::common_type::type; + +template +using decay_t = typename std::decay::type; + +template +using enable_if_t = typename std::enable_if::type; + +template +using remove_const_t = typename std::remove_const::type; + +template +using remove_reference_t = typename std::remove_reference::type; + +template +inline constexpr T &&forward(remove_reference_t &t) noexcept { + return static_cast(t); +} + +template +inline constexpr T &&forward(remove_reference_t &&t) noexcept { + static_assert(!std::is_lvalue_reference::value, + "can not forward an rvalue as an lvalue"); + return static_cast(t); +} + +template +inline constexpr remove_reference_t &&move(T &&t) noexcept { + return static_cast &&>(t); +} + +#ifdef MPARK_INTEGER_SEQUENCE +using std::index_sequence; +using std::index_sequence_for; +using std::integer_sequence; +using std::make_index_sequence; +#else +template +struct integer_sequence { + using value_type = T; + static constexpr std::size_t size() noexcept { return sizeof...(Is); } +}; + +template +using index_sequence = integer_sequence; + +template +struct make_index_sequence_concat; + +template +struct make_index_sequence_concat, + index_sequence> + : identity> {}; + +template +struct make_index_sequence_impl; + +template +using make_index_sequence = typename make_index_sequence_impl::type; + +template +struct make_index_sequence_impl + : make_index_sequence_concat, + make_index_sequence> {}; + +template <> +struct make_index_sequence_impl<0> : identity> {}; + +template <> +struct make_index_sequence_impl<1> : identity> {}; + +template +using index_sequence_for = make_index_sequence; +#endif + +// +#ifdef MPARK_TRANSPARENT_OPERATORS +using equal_to = std::equal_to<>; +#else +struct equal_to { + template + inline constexpr auto operator()(Lhs &&lhs, Rhs &&rhs) const + MPARK_RETURN(lib::forward(lhs) == lib::forward(rhs)) +}; +#endif + +#ifdef MPARK_TRANSPARENT_OPERATORS +using not_equal_to = std::not_equal_to<>; +#else +struct not_equal_to { + template + inline constexpr auto operator()(Lhs &&lhs, Rhs &&rhs) const + MPARK_RETURN(lib::forward(lhs) != lib::forward(rhs)) +}; +#endif + +#ifdef MPARK_TRANSPARENT_OPERATORS +using less = std::less<>; +#else +struct less { + template + inline constexpr auto operator()(Lhs &&lhs, Rhs &&rhs) const + MPARK_RETURN(lib::forward(lhs) < lib::forward(rhs)) +}; +#endif + +#ifdef MPARK_TRANSPARENT_OPERATORS +using greater = std::greater<>; +#else +struct greater { + template + inline constexpr auto operator()(Lhs &&lhs, Rhs &&rhs) const + MPARK_RETURN(lib::forward(lhs) > lib::forward(rhs)) +}; +#endif + +#ifdef MPARK_TRANSPARENT_OPERATORS +using less_equal = std::less_equal<>; +#else +struct less_equal { + template + inline constexpr auto operator()(Lhs &&lhs, Rhs &&rhs) const + MPARK_RETURN(lib::forward(lhs) <= lib::forward(rhs)) +}; +#endif + +#ifdef MPARK_TRANSPARENT_OPERATORS +using greater_equal = std::greater_equal<>; +#else +struct greater_equal { + template + inline constexpr auto operator()(Lhs &&lhs, Rhs &&rhs) const + MPARK_RETURN(lib::forward(lhs) >= lib::forward(rhs)) +}; +#endif +} // namespace cpp14 + +inline namespace cpp17 { +// +template +using bool_constant = std::integral_constant; + +template +struct voider : identity {}; + +template +using void_t = typename voider::type; + +namespace detail { +namespace swappable { + +using std::swap; + +template +struct is_swappable { + private: + template (), std::declval()))> + inline static std::true_type test(int); + + template + inline static std::false_type test(...); + + public: + static constexpr bool value = decltype(test(0))::value; +}; + +template +struct is_nothrow_swappable { + static constexpr bool value = + noexcept(swap(std::declval(), std::declval())); +}; + +template +struct is_nothrow_swappable : std::false_type {}; + +} // namespace swappable +} // namespace detail + +using detail::swappable::is_swappable; + +template +using is_nothrow_swappable = + detail::swappable::is_nothrow_swappable::value, T>; + +// +namespace detail { + +template +struct is_reference_wrapper : std::false_type {}; + +template +struct is_reference_wrapper> : std::true_type {}; + +template +struct Invoke; + +template <> +struct Invoke { + template + inline static constexpr auto invoke(R T::*pmf, Arg &&arg, Args &&...args) + MPARK_RETURN((lib::forward(arg).*pmf)(lib::forward(args)...)) +}; + +template <> +struct Invoke { + template + inline static constexpr auto invoke(R T::*pmf, Arg &&arg, Args &&...args) + MPARK_RETURN((lib::forward(arg).get().* + pmf)(lib::forward(args)...)) +}; + +template <> +struct Invoke { + template + inline static constexpr auto invoke(R T::*pmf, Arg &&arg, Args &&...args) + MPARK_RETURN(((*lib::forward(arg)).* + pmf)(lib::forward(args)...)) +}; + +template <> +struct Invoke { + template + inline static constexpr auto invoke(R T::*pmo, Arg &&arg) + MPARK_RETURN(lib::forward(arg).*pmo) +}; + +template <> +struct Invoke { + template + inline static constexpr auto invoke(R T::*pmo, Arg &&arg) + MPARK_RETURN(lib::forward(arg).get().*pmo) +}; + +template <> +struct Invoke { + template + inline static constexpr auto invoke(R T::*pmo, Arg &&arg) + MPARK_RETURN((*lib::forward(arg)).*pmo) +}; + +template +inline constexpr auto invoke(R T::*f, Arg &&arg, Args &&...args) + MPARK_RETURN(Invoke::value, + (std::is_base_of>::value ? 0 + : is_reference_wrapper>::value + ? 1 + : 2)>::invoke(f, + lib::forward(arg), + lib::forward(args)...)) + +#ifdef _MSC_VER +#pragma warning(push) +#pragma warning(disable : 4100) +#endif + template + inline constexpr auto invoke(F &&f, Args &&...args) + MPARK_RETURN(lib::forward(f)(lib::forward(args)...)) +#ifdef _MSC_VER +#pragma warning(pop) +#endif +} // namespace detail + +template +inline constexpr auto invoke(F &&f, Args &&...args) + MPARK_RETURN(detail::invoke(lib::forward(f), + lib::forward(args)...)) + + namespace detail { + template + struct invoke_result {}; + + template + struct invoke_result< + void_t(), std::declval()...))>, + F, + Args...> : identity(), + std::declval()...))> {}; + +} // namespace detail + +template +using invoke_result = detail::invoke_result; + +template +using invoke_result_t = typename invoke_result::type; + +namespace detail { + +template +struct is_invocable : std::false_type {}; + +template +struct is_invocable>, F, Args...> + : std::true_type {}; + +template +struct is_invocable_r : std::false_type {}; + +template +struct is_invocable_r>, R, F, Args...> + : std::is_convertible, R> {}; + +} // namespace detail + +template +using is_invocable = detail::is_invocable; + +template +using is_invocable_r = detail::is_invocable_r; + +namespace detail { + +template +struct is_nothrow_invocable { + static constexpr bool value = + noexcept(lib::invoke(std::declval(), std::declval()...)); +}; + +template +struct is_nothrow_invocable : std::false_type {}; + +template +struct is_nothrow_invocable_r { + private: + inline static R impl() { + return lib::invoke(std::declval(), std::declval()...); + } + + public: + static constexpr bool value = noexcept(impl()); +}; + +template +struct is_nothrow_invocable_r : std::false_type {}; + +} // namespace detail + +template +using is_nothrow_invocable = + detail::is_nothrow_invocable::value, F, Args...>; + +template +using is_nothrow_invocable_r = detail:: + is_nothrow_invocable_r::value, R, F, Args...>; + +// +#ifdef MPARK_BUILTIN_ADDRESSOF +template +inline constexpr T *addressof(T &arg) noexcept { + return __builtin_addressof(arg); +} +#else +namespace detail { + +namespace has_addressof_impl { + +struct fail; + +template +inline fail operator&(T &&); + +template +inline static constexpr bool impl() { + return (std::is_class::value || std::is_union::value) && + !std::is_same()), fail>::value; +} + +} // namespace has_addressof_impl + +template +using has_addressof = bool_constant()>; + +template +inline constexpr T *addressof(T &arg, std::true_type) noexcept { + return std::addressof(arg); +} + +template +inline constexpr T *addressof(T &arg, std::false_type) noexcept { + return &arg; +} + +} // namespace detail + +template +inline constexpr T *addressof(T &arg) noexcept { + return detail::addressof(arg, detail::has_addressof{}); +} +#endif + +template +inline constexpr T *addressof(const T &&) = delete; + +} // namespace cpp17 + +template +struct remove_all_extents : identity {}; + +template +struct remove_all_extents> : remove_all_extents {}; + +template +using remove_all_extents_t = typename remove_all_extents::type; + +template +using size_constant = std::integral_constant; + +template +struct indexed_type : size_constant { + using type = T; +}; + +template +using all = std::is_same, + integer_sequence>; + +#ifdef MPARK_TYPE_PACK_ELEMENT +template +using type_pack_element_t = __type_pack_element; +#else +template +struct type_pack_element_impl { + private: + template + struct set; + + template + struct set> : indexed_type... {}; + + template + inline static std::enable_if impl(indexed_type); + + inline static std::enable_if impl(...); + + public: + using type = decltype(impl(set>{})); +}; + +template +using type_pack_element = typename type_pack_element_impl::type; + +template +using type_pack_element_t = typename type_pack_element::type; +#endif + +#ifdef MPARK_TRIVIALITY_TYPE_TRAITS +using std::is_trivially_copy_assignable; +using std::is_trivially_copy_constructible; +using std::is_trivially_move_assignable; +using std::is_trivially_move_constructible; +#else +template +struct is_trivially_copy_constructible + : bool_constant::value &&__has_trivial_copy( + T)> {}; + +template +struct is_trivially_move_constructible : bool_constant<__is_trivial(T)> {}; + +template +struct is_trivially_copy_assignable + : bool_constant::value &&__has_trivial_assign( + T)> {}; + +template +struct is_trivially_move_assignable : bool_constant<__is_trivial(T)> {}; +#endif + +template +struct dependent_type : T {}; + +template +struct push_back; + +template +using push_back_t = typename push_back::type; + +template +struct push_back, J> { + using type = index_sequence; +}; + +} // namespace lib +} // namespace paddlenlp + +#undef MPARK_RETURN + +#endif // MPARK_LIB_HPP + +namespace paddlenlp { + +#ifdef MPARK_RETURN_TYPE_DEDUCTION + +#define AUTO auto +#define AUTO_RETURN(...) \ + { return __VA_ARGS__; } + +#define AUTO_REFREF auto && +#define AUTO_REFREF_RETURN(...) \ + { return __VA_ARGS__; } + +#define DECLTYPE_AUTO decltype(auto) +#define DECLTYPE_AUTO_RETURN(...) \ + { return __VA_ARGS__; } + +#else + +#define AUTO auto +#define AUTO_RETURN(...) \ + ->lib::decay_t { return __VA_ARGS__; } + +#define AUTO_REFREF auto +#define AUTO_REFREF_RETURN(...) \ + ->decltype((__VA_ARGS__)) { \ + static_assert(std::is_reference::value, ""); \ + return __VA_ARGS__; \ + } + +#define DECLTYPE_AUTO auto +#define DECLTYPE_AUTO_RETURN(...) \ + ->decltype(__VA_ARGS__) { return __VA_ARGS__; } + +#endif + +class bad_variant_access : public std::exception { + public: + virtual const char *what() const noexcept override { + return "bad_variant_access"; + } +}; + +[[noreturn]] inline void throw_bad_variant_access() { +#ifdef MPARK_EXCEPTIONS + throw bad_variant_access{}; +#else + std::terminate(); + MPARK_BUILTIN_UNREACHABLE; +#endif +} + +template +class variant; + +template +struct variant_size; + +#ifdef MPARK_VARIABLE_TEMPLATES +template +constexpr std::size_t variant_size_v = variant_size::value; +#endif + +template +struct variant_size : variant_size {}; + +template +struct variant_size : variant_size {}; + +template +struct variant_size : variant_size {}; + +template +struct variant_size> : lib::size_constant {}; + +template +struct variant_alternative; + +template +using variant_alternative_t = typename variant_alternative::type; + +template +struct variant_alternative + : std::add_const> {}; + +template +struct variant_alternative + : std::add_volatile> {}; + +template +struct variant_alternative + : std::add_cv> {}; + +template +struct variant_alternative> { + static_assert(I < sizeof...(Ts), + "index out of bounds in `std::variant_alternative<>`"); + using type = lib::type_pack_element_t; +}; + +constexpr std::size_t variant_npos = static_cast(-1); + +namespace detail { + +constexpr std::size_t not_found = static_cast(-1); +constexpr std::size_t ambiguous = static_cast(-2); + +#ifdef MPARK_CPP14_CONSTEXPR +template +inline constexpr std::size_t find_index() { + constexpr lib::array matches = { + {std::is_same::value...}}; + std::size_t result = not_found; + for (std::size_t i = 0; i < sizeof...(Ts); ++i) { + if (matches[i]) { + if (result != not_found) { + return ambiguous; + } + result = i; + } + } + return result; +} +#else +inline constexpr std::size_t find_index_impl(std::size_t result, std::size_t) { + return result; +} + +template +inline constexpr std::size_t find_index_impl(std::size_t result, + std::size_t idx, + bool b, + Bs... bs) { + return b ? (result != not_found ? ambiguous + : find_index_impl(idx, idx + 1, bs...)) + : find_index_impl(result, idx + 1, bs...); +} + +template +inline constexpr std::size_t find_index() { + return find_index_impl(not_found, 0, std::is_same::value...); +} +#endif + +template +using find_index_sfinae_impl = + lib::enable_if_t>; + +template +using find_index_sfinae = find_index_sfinae_impl()>; + +template +struct find_index_checked_impl : lib::size_constant { + static_assert(I != not_found, "the specified type is not found."); + static_assert(I != ambiguous, "the specified type is ambiguous."); +}; + +template +using find_index_checked = find_index_checked_impl()>; + +struct valueless_t {}; + +enum class Trait { TriviallyAvailable, Available, Unavailable }; + +template + class IsTriviallyAvailable, + template + class IsAvailable> +inline constexpr Trait trait() { + return IsTriviallyAvailable::value ? Trait::TriviallyAvailable + : IsAvailable::value ? Trait::Available + : Trait::Unavailable; +} + +#ifdef MPARK_CPP14_CONSTEXPR +template +inline constexpr Trait common_trait(Traits... traits_) { + Trait result = Trait::TriviallyAvailable; + lib::array traits = {{traits_...}}; + for (std::size_t i = 0; i < sizeof...(Traits); ++i) { + Trait t = traits[i]; + if (static_cast(t) > static_cast(result)) { + result = t; + } + } + return result; +} +#else +inline constexpr Trait common_trait_impl(Trait result) { return result; } + +template +inline constexpr Trait common_trait_impl(Trait result, Trait t, Traits... ts) { + return static_cast(t) > static_cast(result) + ? common_trait_impl(t, ts...) + : common_trait_impl(result, ts...); +} + +template +inline constexpr Trait common_trait(Traits... ts) { + return common_trait_impl(Trait::TriviallyAvailable, ts...); +} +#endif + +template +struct traits { + static constexpr Trait copy_constructible_trait = + common_trait(trait()...); + + static constexpr Trait move_constructible_trait = + common_trait(trait()...); + + static constexpr Trait copy_assignable_trait = + common_trait(copy_constructible_trait, + trait()...); + + static constexpr Trait move_assignable_trait = + common_trait(move_constructible_trait, + trait()...); + + static constexpr Trait destructible_trait = common_trait( + trait()...); +}; + +namespace access { + +struct recursive_union { +#ifdef MPARK_RETURN_TYPE_DEDUCTION + template + inline static constexpr auto &&get_alt(V &&v, in_place_index_t<0>) { + return lib::forward(v).head_; + } + + template + inline static constexpr auto &&get_alt(V &&v, in_place_index_t) { + return get_alt(lib::forward(v).tail_, in_place_index_t{}); + } +#else + template + struct get_alt_impl { + template + inline constexpr AUTO_REFREF operator()(V &&v) const + AUTO_REFREF_RETURN(get_alt_impl{}(lib::forward(v).tail_)) + }; + + template + struct get_alt_impl<0, Dummy> { + template + inline constexpr AUTO_REFREF operator()(V &&v) const + AUTO_REFREF_RETURN(lib::forward(v).head_) + }; + + template + inline static constexpr AUTO_REFREF get_alt(V &&v, in_place_index_t) + AUTO_REFREF_RETURN(get_alt_impl{}(lib::forward(v))) +#endif +}; + +struct base { + template + inline static constexpr AUTO_REFREF get_alt(V &&v) +#ifdef _MSC_VER + AUTO_REFREF_RETURN(recursive_union::get_alt(lib::forward(v).data_, + in_place_index_t{})) +#else + AUTO_REFREF_RETURN(recursive_union::get_alt(data(lib::forward(v)), + in_place_index_t{})) +#endif +}; + +struct variant { + template + inline static constexpr AUTO_REFREF get_alt(V &&v) + AUTO_REFREF_RETURN(base::get_alt(lib::forward(v).impl_)) +}; + +} // namespace access + +namespace visitation { + +#if defined(MPARK_CPP14_CONSTEXPR) && !defined(_MSC_VER) +#define MPARK_VARIANT_SWITCH_VISIT +#endif + +struct base { + template + using dispatch_result_t = + decltype(lib::invoke(std::declval(), + access::base::get_alt<0>(std::declval())...)); + + template + struct expected { + template + inline static constexpr bool but_got() { + return std::is_same::value; + } + }; + + template + struct visit_return_type_check { + static_assert(expected::template but_got(), + "`visit` requires the visitor to have a single return type"); + + template + inline static constexpr DECLTYPE_AUTO invoke(Visitor &&visitor, + Alts &&...alts) + DECLTYPE_AUTO_RETURN(lib::invoke(lib::forward(visitor), + lib::forward(alts)...)) + }; + +#ifdef MPARK_VARIANT_SWITCH_VISIT + template + struct dispatcher; + + template + struct dispatcher { + template + MPARK_ALWAYS_INLINE static constexpr R dispatch(F &&, + typename ITs::type &&..., + Vs &&...) { + MPARK_BUILTIN_UNREACHABLE; + } + + template + MPARK_ALWAYS_INLINE static constexpr R dispatch_case(F &&, Vs &&...) { + MPARK_BUILTIN_UNREACHABLE; + } + + template + MPARK_ALWAYS_INLINE static constexpr R dispatch_at(std::size_t, + F &&, + Vs &&...) { + MPARK_BUILTIN_UNREACHABLE; + } + }; + + template + struct dispatcher { + template + MPARK_ALWAYS_INLINE static constexpr R dispatch( + F &&f, typename ITs::type &&...visited_vs) { + using Expected = R; + using Actual = decltype(lib::invoke( + lib::forward(f), + access::base::get_alt( + lib::forward(visited_vs))...)); + return visit_return_type_check::invoke( + lib::forward(f), + access::base::get_alt( + lib::forward(visited_vs))...); + } + + template + MPARK_ALWAYS_INLINE static constexpr R dispatch( + F &&f, typename ITs::type &&...visited_vs, V &&v, Vs &&...vs) { +#define MPARK_DISPATCH(I) \ + dispatcher<(I < lib::decay_t::size()), \ + R, \ + ITs..., \ + lib::indexed_type>:: \ + template dispatch<0>(lib::forward(f), \ + lib::forward(visited_vs)..., \ + lib::forward(v), \ + lib::forward(vs)...) + +#define MPARK_DEFAULT(I) \ + dispatcher<(I < lib::decay_t::size()), R, ITs...>::template dispatch( \ + lib::forward(f), \ + lib::forward(visited_vs)..., \ + lib::forward(v), \ + lib::forward(vs)...) + + switch (v.index()) { + case B + 0: + return MPARK_DISPATCH(B + 0); + case B + 1: + return MPARK_DISPATCH(B + 1); + case B + 2: + return MPARK_DISPATCH(B + 2); + case B + 3: + return MPARK_DISPATCH(B + 3); + case B + 4: + return MPARK_DISPATCH(B + 4); + case B + 5: + return MPARK_DISPATCH(B + 5); + case B + 6: + return MPARK_DISPATCH(B + 6); + case B + 7: + return MPARK_DISPATCH(B + 7); + case B + 8: + return MPARK_DISPATCH(B + 8); + case B + 9: + return MPARK_DISPATCH(B + 9); + case B + 10: + return MPARK_DISPATCH(B + 10); + case B + 11: + return MPARK_DISPATCH(B + 11); + case B + 12: + return MPARK_DISPATCH(B + 12); + case B + 13: + return MPARK_DISPATCH(B + 13); + case B + 14: + return MPARK_DISPATCH(B + 14); + case B + 15: + return MPARK_DISPATCH(B + 15); + case B + 16: + return MPARK_DISPATCH(B + 16); + case B + 17: + return MPARK_DISPATCH(B + 17); + case B + 18: + return MPARK_DISPATCH(B + 18); + case B + 19: + return MPARK_DISPATCH(B + 19); + case B + 20: + return MPARK_DISPATCH(B + 20); + case B + 21: + return MPARK_DISPATCH(B + 21); + case B + 22: + return MPARK_DISPATCH(B + 22); + case B + 23: + return MPARK_DISPATCH(B + 23); + case B + 24: + return MPARK_DISPATCH(B + 24); + case B + 25: + return MPARK_DISPATCH(B + 25); + case B + 26: + return MPARK_DISPATCH(B + 26); + case B + 27: + return MPARK_DISPATCH(B + 27); + case B + 28: + return MPARK_DISPATCH(B + 28); + case B + 29: + return MPARK_DISPATCH(B + 29); + case B + 30: + return MPARK_DISPATCH(B + 30); + case B + 31: + return MPARK_DISPATCH(B + 31); + default: + return MPARK_DEFAULT(B + 32); + } + +#undef MPARK_DEFAULT +#undef MPARK_DISPATCH + } + + template + MPARK_ALWAYS_INLINE static constexpr R dispatch_case(F &&f, Vs &&...vs) { + using Expected = R; + using Actual = decltype(lib::invoke( + lib::forward(f), + access::base::get_alt(lib::forward(vs))...)); + return visit_return_type_check::invoke( + lib::forward(f), + access::base::get_alt(lib::forward(vs))...); + } + + template + MPARK_ALWAYS_INLINE static constexpr R dispatch_at(std::size_t index, + F &&f, + V &&v, + Vs &&...vs) { + static_assert(lib::all<(lib::decay_t::size() == + lib::decay_t::size())...>::value, + "all of the variants must be the same size."); +#define MPARK_DISPATCH_AT(I) \ + dispatcher<(I < lib::decay_t::size()), R>::template dispatch_case( \ + lib::forward(f), lib::forward(v), lib::forward(vs)...) + +#define MPARK_DEFAULT(I) \ + dispatcher<(I < lib::decay_t::size()), R>::template dispatch_at( \ + index, lib::forward(f), lib::forward(v), lib::forward(vs)...) + + switch (index) { + case B + 0: + return MPARK_DISPATCH_AT(B + 0); + case B + 1: + return MPARK_DISPATCH_AT(B + 1); + case B + 2: + return MPARK_DISPATCH_AT(B + 2); + case B + 3: + return MPARK_DISPATCH_AT(B + 3); + case B + 4: + return MPARK_DISPATCH_AT(B + 4); + case B + 5: + return MPARK_DISPATCH_AT(B + 5); + case B + 6: + return MPARK_DISPATCH_AT(B + 6); + case B + 7: + return MPARK_DISPATCH_AT(B + 7); + case B + 8: + return MPARK_DISPATCH_AT(B + 8); + case B + 9: + return MPARK_DISPATCH_AT(B + 9); + case B + 10: + return MPARK_DISPATCH_AT(B + 10); + case B + 11: + return MPARK_DISPATCH_AT(B + 11); + case B + 12: + return MPARK_DISPATCH_AT(B + 12); + case B + 13: + return MPARK_DISPATCH_AT(B + 13); + case B + 14: + return MPARK_DISPATCH_AT(B + 14); + case B + 15: + return MPARK_DISPATCH_AT(B + 15); + case B + 16: + return MPARK_DISPATCH_AT(B + 16); + case B + 17: + return MPARK_DISPATCH_AT(B + 17); + case B + 18: + return MPARK_DISPATCH_AT(B + 18); + case B + 19: + return MPARK_DISPATCH_AT(B + 19); + case B + 20: + return MPARK_DISPATCH_AT(B + 20); + case B + 21: + return MPARK_DISPATCH_AT(B + 21); + case B + 22: + return MPARK_DISPATCH_AT(B + 22); + case B + 23: + return MPARK_DISPATCH_AT(B + 23); + case B + 24: + return MPARK_DISPATCH_AT(B + 24); + case B + 25: + return MPARK_DISPATCH_AT(B + 25); + case B + 26: + return MPARK_DISPATCH_AT(B + 26); + case B + 27: + return MPARK_DISPATCH_AT(B + 27); + case B + 28: + return MPARK_DISPATCH_AT(B + 28); + case B + 29: + return MPARK_DISPATCH_AT(B + 29); + case B + 30: + return MPARK_DISPATCH_AT(B + 30); + case B + 31: + return MPARK_DISPATCH_AT(B + 31); + default: + return MPARK_DEFAULT(B + 32); + } + +#undef MPARK_DEFAULT +#undef MPARK_DISPATCH_AT + } + }; +#else + template + inline static constexpr const T &at(const T &elem) noexcept { + return elem; + } + + template + inline static constexpr const lib::remove_all_extents_t &at( + const lib::array &elems, std::size_t i, Is... is) noexcept { + return at(elems[i], is...); + } + + template + inline static constexpr lib::array, sizeof...(Fs) + 1> + make_farray(F &&f, Fs &&...fs) { + return {{lib::forward(f), lib::forward(fs)...}}; + } + + template + struct make_fmatrix_impl { + template + inline static constexpr dispatch_result_t dispatch(F &&f, + Vs &&...vs) { + using Expected = dispatch_result_t; + using Actual = decltype(lib::invoke( + lib::forward(f), + access::base::get_alt(lib::forward(vs))...)); + return visit_return_type_check::invoke( + lib::forward(f), + access::base::get_alt(lib::forward(vs))...); + } + +#ifdef MPARK_RETURN_TYPE_DEDUCTION + template + inline static constexpr auto impl(lib::index_sequence) { + return &dispatch; + } + + template + inline static constexpr auto impl(Is, + lib::index_sequence, + Ls... ls) { + return make_farray(impl(lib::push_back_t{}, ls...)...); + } +#else + template + struct impl; + + template + struct impl> { + inline constexpr AUTO operator()() const AUTO_RETURN(&dispatch) + }; + + template + struct impl, Ls...> { + inline constexpr AUTO operator()() const + AUTO_RETURN(make_farray(impl, Ls...>{}()...)) + }; +#endif + }; + +#ifdef MPARK_RETURN_TYPE_DEDUCTION + template + inline static constexpr auto make_fmatrix() { + return make_fmatrix_impl::impl( + lib::index_sequence<>{}, + lib::make_index_sequence::size()>{}...); + } +#else + template + inline static constexpr AUTO make_fmatrix() + AUTO_RETURN(typename make_fmatrix_impl::template impl< + lib::index_sequence<>, + lib::make_index_sequence::size()>...>{}()) +#endif + + template + struct make_fdiagonal_impl { + template + inline static constexpr dispatch_result_t dispatch(F &&f, + Vs &&...vs) { + using Expected = dispatch_result_t; + using Actual = decltype(lib::invoke( + lib::forward(f), + access::base::get_alt(lib::forward(vs))...)); + return visit_return_type_check::invoke( + lib::forward(f), + access::base::get_alt(lib::forward(vs))...); + } + + template + inline static constexpr AUTO impl(lib::index_sequence) + AUTO_RETURN(make_farray(&dispatch...)) + }; + + template + inline static constexpr auto make_fdiagonal() + -> decltype(make_fdiagonal_impl::impl( + lib::make_index_sequence::size()>{})) { + static_assert(lib::all<(lib::decay_t::size() == + lib::decay_t::size())...>::value, + "all of the variants must be the same size."); + return make_fdiagonal_impl::impl( + lib::make_index_sequence::size()>{}); + } +#endif +}; + +#if !defined(MPARK_VARIANT_SWITCH_VISIT) && \ + (!defined(_MSC_VER) || _MSC_VER >= 1910) +template +using fmatrix_t = decltype(base::make_fmatrix()); + +template +struct fmatrix { + static constexpr fmatrix_t value = base::make_fmatrix(); +}; + +template +constexpr fmatrix_t fmatrix::value; + +template +using fdiagonal_t = decltype(base::make_fdiagonal()); + +template +struct fdiagonal { + static constexpr fdiagonal_t value = + base::make_fdiagonal(); +}; + +template +constexpr fdiagonal_t fdiagonal::value; +#endif + +struct alt { + template + inline static constexpr DECLTYPE_AUTO visit_alt(Visitor &&visitor, Vs &&...vs) +#ifdef MPARK_VARIANT_SWITCH_VISIT + DECLTYPE_AUTO_RETURN( + base::dispatcher(vs)))...>>:: + template dispatch<0>(lib::forward(visitor), + as_base(lib::forward(vs))...)) +#elif !defined(_MSC_VER) || _MSC_VER >= 1910 + DECLTYPE_AUTO_RETURN( + base::at(fmatrix(vs)))...>::value, + vs.index()...)(lib::forward(visitor), + as_base(lib::forward(vs))...)) +#else + DECLTYPE_AUTO_RETURN(base::at( + base::make_fmatrix(vs)))...>(), + vs.index()...)(lib::forward(visitor), + as_base(lib::forward(vs))...)) +#endif + + template + inline static constexpr DECLTYPE_AUTO + visit_alt_at(std::size_t index, Visitor &&visitor, Vs &&...vs) +#ifdef MPARK_VARIANT_SWITCH_VISIT + DECLTYPE_AUTO_RETURN( + base::dispatcher< + true, + base::dispatch_result_t< + Visitor, + decltype(as_base(lib::forward(vs)))...>>:: + template dispatch_at<0>(index, + lib::forward(visitor), + as_base(lib::forward(vs))...)) +#elif !defined(_MSC_VER) || _MSC_VER >= 1910 + DECLTYPE_AUTO_RETURN(base::at( + fdiagonal(vs)))...>::value, + index)(lib::forward(visitor), + as_base(lib::forward(vs))...)) +#else + DECLTYPE_AUTO_RETURN( + base::at(base::make_fdiagonal< + Visitor &&, + decltype(as_base(lib::forward(vs)))...>(), + index)(lib::forward(visitor), + as_base(lib::forward(vs))...)) +#endif +}; + +struct variant { + private: + template + struct visitor { + template + inline static constexpr bool does_not_handle() { + return lib::is_invocable::value; + } + }; + + template + struct visit_exhaustiveness_check { + static_assert(visitor::template does_not_handle(), + "`visit` requires the visitor to be exhaustive."); + + inline static constexpr DECLTYPE_AUTO invoke(Visitor &&visitor, + Values &&...values) + DECLTYPE_AUTO_RETURN(lib::invoke(lib::forward(visitor), + lib::forward(values)...)) + }; + + template + struct value_visitor { + Visitor &&visitor_; + + template + inline constexpr DECLTYPE_AUTO operator()(Alts &&...alts) const + DECLTYPE_AUTO_RETURN(visit_exhaustiveness_check< + Visitor, + decltype((lib::forward(alts).value))...>:: + invoke(lib::forward(visitor_), + lib::forward(alts).value...)) + }; + + template + inline static constexpr AUTO make_value_visitor(Visitor &&visitor) + AUTO_RETURN(value_visitor{lib::forward(visitor)}) + + public + : template + inline static constexpr DECLTYPE_AUTO + visit_alt(Visitor &&visitor, Vs &&...vs) + DECLTYPE_AUTO_RETURN(alt::visit_alt(lib::forward(visitor), + lib::forward(vs).impl_...)) + + template + inline static constexpr DECLTYPE_AUTO + visit_alt_at(std::size_t index, Visitor &&visitor, Vs &&...vs) + DECLTYPE_AUTO_RETURN( + alt::visit_alt_at(index, + lib::forward(visitor), + lib::forward(vs).impl_...)) + + template + inline static constexpr DECLTYPE_AUTO + visit_value(Visitor &&visitor, Vs &&...vs) DECLTYPE_AUTO_RETURN( + visit_alt(make_value_visitor(lib::forward(visitor)), + lib::forward(vs)...)) + + template + inline static constexpr DECLTYPE_AUTO + visit_value_at(std::size_t index, Visitor &&visitor, Vs &&...vs) + DECLTYPE_AUTO_RETURN( + visit_alt_at(index, + make_value_visitor(lib::forward(visitor)), + lib::forward(vs)...)) +}; + +} // namespace visitation + +template +struct alt { + using value_type = T; + +#ifdef _MSC_VER +#pragma warning(push) +#pragma warning(disable : 4244) +#endif + template + inline explicit constexpr alt(in_place_t, Args &&...args) + : value(lib::forward(args)...) {} +#ifdef _MSC_VER +#pragma warning(pop) +#endif + + T value; +}; + +template +union recursive_union; + +template +union recursive_union {}; + +#define MPARK_VARIANT_RECURSIVE_UNION(destructible_trait, destructor) \ + template \ + union recursive_union { \ + public: \ + inline explicit constexpr recursive_union(valueless_t) noexcept \ + : dummy_{} {} \ + \ + template \ + inline explicit constexpr recursive_union(in_place_index_t<0>, \ + Args &&...args) \ + : head_(in_place_t{}, lib::forward(args)...) {} \ + \ + template \ + inline explicit constexpr recursive_union(in_place_index_t, \ + Args &&...args) \ + : tail_(in_place_index_t{}, lib::forward(args)...) {} \ + \ + recursive_union(const recursive_union &) = default; \ + recursive_union(recursive_union &&) = default; \ + \ + destructor \ + \ + recursive_union & \ + operator=(const recursive_union &) = default; \ + recursive_union &operator=(recursive_union &&) = default; \ + \ + private: \ + char dummy_; \ + alt head_; \ + recursive_union tail_; \ + \ + friend struct access::recursive_union; \ + } + +MPARK_VARIANT_RECURSIVE_UNION(Trait::TriviallyAvailable, + ~recursive_union() = default;); +MPARK_VARIANT_RECURSIVE_UNION(Trait::Available, ~recursive_union(){}); +MPARK_VARIANT_RECURSIVE_UNION(Trait::Unavailable, ~recursive_union() = delete;); + +#undef MPARK_VARIANT_RECURSIVE_UNION + +using index_t = unsigned int; + +template +class base { + public: + inline explicit constexpr base(valueless_t tag) noexcept + : data_(tag), index_(static_cast(-1)) {} + + template + inline explicit constexpr base(in_place_index_t, Args &&...args) + : data_(in_place_index_t{}, lib::forward(args)...), index_(I) {} + + inline constexpr bool valueless_by_exception() const noexcept { + return index_ == static_cast(-1); + } + + inline constexpr std::size_t index() const noexcept { + return valueless_by_exception() ? variant_npos : index_; + } + + protected: + using data_t = recursive_union; + + friend inline constexpr base &as_base(base &b) { return b; } + friend inline constexpr const base &as_base(const base &b) { return b; } + friend inline constexpr base &&as_base(base &&b) { return lib::move(b); } + friend inline constexpr const base &&as_base(const base &&b) { + return lib::move(b); + } + + friend inline constexpr data_t &data(base &b) { return b.data_; } + friend inline constexpr const data_t &data(const base &b) { return b.data_; } + friend inline constexpr data_t &&data(base &&b) { return lib::move(b).data_; } + friend inline constexpr const data_t &&data(const base &&b) { + return lib::move(b).data_; + } + + inline static constexpr std::size_t size() { return sizeof...(Ts); } + + data_t data_; + index_t index_; + + friend struct access::base; + friend struct visitation::base; +}; + +struct dtor { +#ifdef _MSC_VER +#pragma warning(push) +#pragma warning(disable : 4100) +#endif + template + inline void operator()(Alt &alt) const noexcept { + alt.~Alt(); + } +#ifdef _MSC_VER +#pragma warning(pop) +#endif +}; + +#if !defined(_MSC_VER) || _MSC_VER >= 1910 +#define MPARK_INHERITING_CTOR(type, base) using base::base; +#else +#define MPARK_INHERITING_CTOR(type, base) \ + template \ + inline explicit constexpr type(Args &&...args) \ + : base(lib::forward(args)...) {} +#endif + +template +class destructor; + +#define MPARK_VARIANT_DESTRUCTOR(destructible_trait, definition, destroy) \ + template \ + class destructor, destructible_trait> \ + : public base { \ + using super = base; \ + \ + public: \ + MPARK_INHERITING_CTOR(destructor, super) \ + using super::operator=; \ + \ + destructor(const destructor &) = default; \ + destructor(destructor &&) = default; \ + definition destructor &operator=(const destructor &) = default; \ + destructor &operator=(destructor &&) = default; \ + \ + protected: \ + destroy \ + } + +MPARK_VARIANT_DESTRUCTOR( + Trait::TriviallyAvailable, ~destructor() = default; + , inline void destroy() noexcept { + this->index_ = static_cast(-1); + }); + +MPARK_VARIANT_DESTRUCTOR( + Trait::Available, + ~destructor() { destroy(); }, + inline void destroy() noexcept { + if (!this->valueless_by_exception()) { + visitation::alt::visit_alt(dtor{}, *this); + } + this->index_ = static_cast(-1); + }); + +MPARK_VARIANT_DESTRUCTOR(Trait::Unavailable, ~destructor() = delete; + , inline void destroy() noexcept = delete;); + +#undef MPARK_VARIANT_DESTRUCTOR + +template +class constructor : public destructor { + using super = destructor; + + public: + MPARK_INHERITING_CTOR(constructor, super) + using super::operator=; + + protected: +#ifndef MPARK_GENERIC_LAMBDAS + struct ctor { + template + inline void operator()(LhsAlt &lhs_alt, RhsAlt &&rhs_alt) const { + constructor::construct_alt(lhs_alt, lib::forward(rhs_alt).value); + } + }; +#endif + + template + inline static T &construct_alt(alt &a, Args &&...args) { + auto *result = ::new (static_cast(lib::addressof(a))) + alt(in_place_t{}, lib::forward(args)...); + return result->value; + } + + template + inline static void generic_construct(constructor &lhs, Rhs &&rhs) { + lhs.destroy(); + if (!rhs.valueless_by_exception()) { + visitation::alt::visit_alt_at( + rhs.index(), +#ifdef MPARK_GENERIC_LAMBDAS + [](auto &lhs_alt, auto &&rhs_alt) { + constructor::construct_alt( + lhs_alt, lib::forward(rhs_alt).value); + } +#else + ctor {} +#endif + , + lhs, + lib::forward(rhs)); + lhs.index_ = rhs.index_; + } + } +}; + +template +class move_constructor; + +#define MPARK_VARIANT_MOVE_CONSTRUCTOR(move_constructible_trait, definition) \ + template \ + class move_constructor, move_constructible_trait> \ + : public constructor> { \ + using super = constructor>; \ + \ + public: \ + MPARK_INHERITING_CTOR(move_constructor, super) \ + using super::operator=; \ + \ + move_constructor(const move_constructor &) = default; \ + definition ~move_constructor() = default; \ + move_constructor &operator=(const move_constructor &) = default; \ + move_constructor &operator=(move_constructor &&) = default; \ + } + +MPARK_VARIANT_MOVE_CONSTRUCTOR( + Trait::TriviallyAvailable, + move_constructor(move_constructor &&that) = default;); + +MPARK_VARIANT_MOVE_CONSTRUCTOR( + Trait::Available, + move_constructor(move_constructor &&that) noexcept( + lib::all::value...>::value) + : move_constructor(valueless_t{}) { + this->generic_construct(*this, lib::move(that)); + }); + +MPARK_VARIANT_MOVE_CONSTRUCTOR(Trait::Unavailable, + move_constructor(move_constructor &&) = delete;); + +#undef MPARK_VARIANT_MOVE_CONSTRUCTOR + +template +class copy_constructor; + +#define MPARK_VARIANT_COPY_CONSTRUCTOR(copy_constructible_trait, definition) \ + template \ + class copy_constructor, copy_constructible_trait> \ + : public move_constructor> { \ + using super = move_constructor>; \ + \ + public: \ + MPARK_INHERITING_CTOR(copy_constructor, super) \ + using super::operator=; \ + \ + definition copy_constructor(copy_constructor &&) = default; \ + ~copy_constructor() = default; \ + copy_constructor &operator=(const copy_constructor &) = default; \ + copy_constructor &operator=(copy_constructor &&) = default; \ + } + +MPARK_VARIANT_COPY_CONSTRUCTOR( + Trait::TriviallyAvailable, + copy_constructor(const copy_constructor &that) = default;); + +MPARK_VARIANT_COPY_CONSTRUCTOR( + Trait::Available, copy_constructor(const copy_constructor &that) + : copy_constructor(valueless_t{}) { + this->generic_construct(*this, that); + }); + +MPARK_VARIANT_COPY_CONSTRUCTOR( + Trait::Unavailable, copy_constructor(const copy_constructor &) = delete;); + +#undef MPARK_VARIANT_COPY_CONSTRUCTOR + +template +class assignment : public copy_constructor { + using super = copy_constructor; + + public: + MPARK_INHERITING_CTOR(assignment, super) + using super::operator=; + + template + inline /* auto & */ auto emplace(Args &&...args) + -> decltype(this->construct_alt(access::base::get_alt(*this), + lib::forward(args)...)) { + this->destroy(); + auto &result = this->construct_alt(access::base::get_alt(*this), + lib::forward(args)...); + this->index_ = I; + return result; + } + + protected: +#ifndef MPARK_GENERIC_LAMBDAS + template + struct assigner { + template + inline void operator()(ThisAlt &this_alt, ThatAlt &&that_alt) const { + self->assign_alt(this_alt, lib::forward(that_alt).value); + } + assignment *self; + }; +#endif + + template + inline void assign_alt(alt &a, Arg &&arg) { + if (this->index() == I) { +#ifdef _MSC_VER +#pragma warning(push) +#pragma warning(disable : 4244) +#endif + a.value = lib::forward(arg); +#ifdef _MSC_VER +#pragma warning(pop) +#endif + } else { + struct { + void operator()(std::true_type) const { + this_->emplace(lib::forward(arg_)); + } + void operator()(std::false_type) const { + this_->emplace(T(lib::forward(arg_))); + } + assignment *this_; + Arg &&arg_; + } impl{this, lib::forward(arg)}; + impl(lib::bool_constant < std::is_nothrow_constructible::value || + !std::is_nothrow_move_constructible::value > {}); + } + } + + template + inline void generic_assign(That &&that) { + if (this->valueless_by_exception() && that.valueless_by_exception()) { + // do nothing. + } else if (that.valueless_by_exception()) { + this->destroy(); + } else { + visitation::alt::visit_alt_at( + that.index(), +#ifdef MPARK_GENERIC_LAMBDAS + [this](auto &this_alt, auto &&that_alt) { + this->assign_alt(this_alt, + lib::forward(that_alt).value); + } +#else + assigner { this } +#endif + , + *this, + lib::forward(that)); + } + } +}; + +template +class move_assignment; + +#define MPARK_VARIANT_MOVE_ASSIGNMENT(move_assignable_trait, definition) \ + template \ + class move_assignment, move_assignable_trait> \ + : public assignment> { \ + using super = assignment>; \ + \ + public: \ + MPARK_INHERITING_CTOR(move_assignment, super) \ + using super::operator=; \ + \ + move_assignment(const move_assignment &) = default; \ + move_assignment(move_assignment &&) = default; \ + ~move_assignment() = default; \ + move_assignment &operator=(const move_assignment &) = default; \ + definition \ + } + +MPARK_VARIANT_MOVE_ASSIGNMENT( + Trait::TriviallyAvailable, + move_assignment &operator=(move_assignment &&that) = default;); + +MPARK_VARIANT_MOVE_ASSIGNMENT( + Trait::Available, + move_assignment & + operator=(move_assignment &&that) noexcept( + lib::all<(std::is_nothrow_move_constructible::value && + std::is_nothrow_move_assignable::value)...>::value) { + this->generic_assign(lib::move(that)); + return *this; + }); + +MPARK_VARIANT_MOVE_ASSIGNMENT( + Trait::Unavailable, + move_assignment &operator=(move_assignment &&) = delete;); + +#undef MPARK_VARIANT_MOVE_ASSIGNMENT + +template +class copy_assignment; + +#define MPARK_VARIANT_COPY_ASSIGNMENT(copy_assignable_trait, definition) \ + template \ + class copy_assignment, copy_assignable_trait> \ + : public move_assignment> { \ + using super = move_assignment>; \ + \ + public: \ + MPARK_INHERITING_CTOR(copy_assignment, super) \ + using super::operator=; \ + \ + copy_assignment(const copy_assignment &) = default; \ + copy_assignment(copy_assignment &&) = default; \ + ~copy_assignment() = default; \ + definition copy_assignment &operator=(copy_assignment &&) = default; \ + } + +MPARK_VARIANT_COPY_ASSIGNMENT( + Trait::TriviallyAvailable, + copy_assignment &operator=(const copy_assignment &that) = default;); + +MPARK_VARIANT_COPY_ASSIGNMENT( + Trait::Available, copy_assignment &operator=(const copy_assignment &that) { + this->generic_assign(that); + return *this; + }); + +MPARK_VARIANT_COPY_ASSIGNMENT( + Trait::Unavailable, + copy_assignment &operator=(const copy_assignment &) = delete;); + +#undef MPARK_VARIANT_COPY_ASSIGNMENT + +template +class impl : public copy_assignment> { + using super = copy_assignment>; + + public: + MPARK_INHERITING_CTOR(impl, super) + using super::operator=; + + template + inline void assign(Arg &&arg) { + this->assign_alt(access::base::get_alt(*this), lib::forward(arg)); + } + + inline void swap(impl &that) { + if (this->valueless_by_exception() && that.valueless_by_exception()) { + // do nothing. + } else if (this->index() == that.index()) { + visitation::alt::visit_alt_at( + this->index(), +#ifdef MPARK_GENERIC_LAMBDAS + [](auto &this_alt, auto &that_alt) { + using std::swap; + swap(this_alt.value, that_alt.value); + } +#else + swapper {} +#endif + , + *this, + that); + } else { + impl *lhs = this; + impl *rhs = lib::addressof(that); + if (lhs->move_nothrow() && !rhs->move_nothrow()) { + std::swap(lhs, rhs); + } + impl tmp(lib::move(*rhs)); +#ifdef MPARK_EXCEPTIONS + // EXTENSION: When the move construction of `lhs` into `rhs` throws + // and `tmp` is nothrow move constructible then we move `tmp` back + // into `rhs` and provide the strong exception safety guarantee. + try { + this->generic_construct(*rhs, lib::move(*lhs)); + } catch (...) { + if (tmp.move_nothrow()) { + this->generic_construct(*rhs, lib::move(tmp)); + } + throw; + } +#else + this->generic_construct(*rhs, lib::move(*lhs)); +#endif + this->generic_construct(*lhs, lib::move(tmp)); + } + } + + inline const std::type_info &type() const { + return visitation::alt::visit_alt_at( + this->index(), +#ifdef MPARK_GENERIC_LAMBDAS + [](auto &alt) -> const std::type_info & { return typeid(alt.value); } +#else + typer {} +#endif + , + *this); + } + + private: +#ifndef MPARK_GENERIC_LAMBDAS + struct swapper { + template + inline void operator()(ThisAlt &this_alt, ThatAlt &that_alt) const { + using std::swap; + swap(this_alt.value, that_alt.value); + } + }; + + struct typer { + template + inline const std::type_info &operator()(Alt &alt) const { + return typeid(alt.value); + } + }; +#endif + + inline constexpr bool move_nothrow() const { + return this->valueless_by_exception() || + lib::array{{std::is_nothrow_move_constructible< + Ts>::value...}}[this->index()]; + } +}; + +#undef MPARK_INHERITING_CTOR + +template +struct overload_leaf { + using F = lib::size_constant (*)(T); + operator F() const { return nullptr; } +}; + +template +struct overload_impl { + private: + template + struct impl; + + template + struct impl> : overload_leaf... {}; + + public: + using type = impl>; +}; + +template +using overload = typename overload_impl::type; + +template +using best_match = lib::invoke_result_t, T &&>; + +template +struct is_in_place_index : std::false_type {}; + +template +struct is_in_place_index> : std::true_type {}; + +template +struct is_in_place_type : std::false_type {}; + +template +struct is_in_place_type> : std::true_type {}; + +} // namespace detail + +template +class variant { + static_assert(0 < sizeof...(Ts), + "variant must consist of at least one alternative."); + + static_assert(lib::all::value...>::value, + "variant can not have an array type as an alternative."); + + static_assert(lib::all::value...>::value, + "variant can not have a reference type as an alternative."); + + static_assert(lib::all::value...>::value, + "variant can not have a void type as an alternative."); + + public: + template < + typename Front = lib::type_pack_element_t<0, Ts...>, + lib::enable_if_t::value, int> = 0> + inline constexpr variant() noexcept( + std::is_nothrow_default_constructible::value) + : impl_(in_place_index_t<0>{}) {} + + variant(const variant &) = default; + variant(variant &&) = default; + + template < + typename Arg, + typename Decayed = lib::decay_t, + lib::enable_if_t::value, int> = 0, + lib::enable_if_t::value, int> = 0, + lib::enable_if_t::value, int> = 0, + std::size_t I = detail::best_match::value, + typename T = lib::type_pack_element_t, + lib::enable_if_t::value, int> = 0> + inline constexpr variant(Arg &&arg) noexcept( + std::is_nothrow_constructible::value) + : impl_(in_place_index_t{}, lib::forward(arg)) {} + + template , + lib::enable_if_t::value, int> = 0> + inline explicit constexpr variant( + in_place_index_t, + Args &&...args) noexcept(std::is_nothrow_constructible::value) + : impl_(in_place_index_t{}, lib::forward(args)...) {} + + template < + std::size_t I, + typename Up, + typename... Args, + typename T = lib::type_pack_element_t, + lib::enable_if_t< + std::is_constructible &, Args...>::value, + int> = 0> + inline explicit constexpr variant( + in_place_index_t, + std::initializer_list il, + Args &&...args) noexcept(std:: + is_nothrow_constructible< + T, + std::initializer_list &, + Args...>::value) + : impl_(in_place_index_t{}, il, lib::forward(args)...) {} + + template ::value, + lib::enable_if_t::value, int> = 0> + inline explicit constexpr variant( + in_place_type_t, + Args &&...args) noexcept(std::is_nothrow_constructible::value) + : impl_(in_place_index_t{}, lib::forward(args)...) {} + + template < + typename T, + typename Up, + typename... Args, + std::size_t I = detail::find_index_sfinae::value, + lib::enable_if_t< + std::is_constructible &, Args...>::value, + int> = 0> + inline explicit constexpr variant( + in_place_type_t, + std::initializer_list il, + Args &&...args) noexcept(std:: + is_nothrow_constructible< + T, + std::initializer_list &, + Args...>::value) + : impl_(in_place_index_t{}, il, lib::forward(args)...) {} + + ~variant() = default; + + variant &operator=(const variant &) = default; + variant &operator=(variant &&) = default; + + template , variant>::value, + int> = 0, + std::size_t I = detail::best_match::value, + typename T = lib::type_pack_element_t, + lib::enable_if_t<(std::is_assignable::value && + std::is_constructible::value), + int> = 0> + inline variant &operator=(Arg &&arg) noexcept( + (std::is_nothrow_assignable::value && + std::is_nothrow_constructible::value)) { + impl_.template assign(lib::forward(arg)); + return *this; + } + + template , + lib::enable_if_t::value, int> = 0> + inline T &emplace(Args &&...args) { + return impl_.template emplace(lib::forward(args)...); + } + + template < + std::size_t I, + typename Up, + typename... Args, + typename T = lib::type_pack_element_t, + lib::enable_if_t< + std::is_constructible &, Args...>::value, + int> = 0> + inline T &emplace(std::initializer_list il, Args &&...args) { + return impl_.template emplace(il, lib::forward(args)...); + } + + template ::value, + lib::enable_if_t::value, int> = 0> + inline T &emplace(Args &&...args) { + return impl_.template emplace(lib::forward(args)...); + } + + template < + typename T, + typename Up, + typename... Args, + std::size_t I = detail::find_index_sfinae::value, + lib::enable_if_t< + std::is_constructible &, Args...>::value, + int> = 0> + inline T &emplace(std::initializer_list il, Args &&...args) { + return impl_.template emplace(il, lib::forward(args)...); + } + + inline constexpr bool valueless_by_exception() const noexcept { + return impl_.valueless_by_exception(); + } + + inline constexpr std::size_t index() const noexcept { return impl_.index(); } + + template , + Dummy>::value && + lib::dependent_type, + Dummy>::value)...>::value, + int> = 0> + inline void swap(variant &that) noexcept( + lib::all<(std::is_nothrow_move_constructible::value && + lib::is_nothrow_swappable::value)...>::value) { + impl_.swap(that.impl_); + } + + inline const std::type_info &type() const noexcept { return impl_.type(); } + + private: + detail::impl impl_; + + friend struct detail::access::variant; + friend struct detail::visitation::variant; +}; + +template +inline constexpr bool holds_alternative(const variant &v) noexcept { + return v.index() == I; +} + +template +inline constexpr bool holds_alternative(const variant &v) noexcept { + return holds_alternative::value>(v); +} + +namespace detail { +template +struct generic_get_impl { + constexpr generic_get_impl(int) noexcept {} + + constexpr AUTO_REFREF operator()(V &&v) const + AUTO_REFREF_RETURN(access::variant::get_alt(lib::forward(v)).value) +}; + +template +inline constexpr AUTO_REFREF generic_get(V &&v) + AUTO_REFREF_RETURN(generic_get_impl(holds_alternative(v) + ? 0 + : (throw_bad_variant_access(), + 0))(lib::forward(v))) +} // namespace detail + +template +inline constexpr variant_alternative_t> &get( + variant &v) { + return detail::generic_get(v); +} + +template +inline constexpr variant_alternative_t> &&get( + variant &&v) { + return detail::generic_get(lib::move(v)); +} + +template +inline constexpr const variant_alternative_t> &get( + const variant &v) { + return detail::generic_get(v); +} + +template +inline constexpr const variant_alternative_t> &&get( + const variant &&v) { + return detail::generic_get(lib::move(v)); +} + +template +inline constexpr T &get(variant &v) { + return get::value>(v); +} + +template +inline constexpr T &&get(variant &&v) { + return get::value>(lib::move(v)); +} + +template +inline constexpr const T &get(const variant &v) { + return get::value>(v); +} + +template +inline constexpr const T &&get(const variant &&v) { + return get::value>(lib::move(v)); +} + +namespace detail { + +template +inline constexpr /* auto * */ AUTO generic_get_if(V *v) noexcept + AUTO_RETURN(v &&holds_alternative(*v) + ? lib::addressof(access::variant::get_alt(*v).value) + : nullptr) + +} // namespace detail + +template +inline constexpr lib::add_pointer_t>> +get_if(variant *v) noexcept { + return detail::generic_get_if(v); +} + +template +inline constexpr lib::add_pointer_t< + const variant_alternative_t>> +get_if(const variant *v) noexcept { + return detail::generic_get_if(v); +} + +template +inline constexpr lib::add_pointer_t get_if(variant *v) noexcept { + return get_if::value>(v); +} + +template +inline constexpr lib::add_pointer_t get_if( + const variant *v) noexcept { + return get_if::value>(v); +} + +namespace detail { +template +struct convert_to_bool { + template + inline constexpr bool operator()(Lhs &&lhs, Rhs &&rhs) const { + static_assert( + std::is_convertible, bool>::value, + "relational operators must return a type" + " implicitly convertible to bool"); + return lib::invoke(RelOp{}, lib::forward(lhs), lib::forward(rhs)); + } +}; +} // namespace detail + +template +inline constexpr bool operator==(const variant &lhs, + const variant &rhs) { + using detail::visitation::variant; + using equal_to = detail::convert_to_bool; +#ifdef MPARK_CPP14_CONSTEXPR + if (lhs.index() != rhs.index()) return false; + if (lhs.valueless_by_exception()) return true; + return variant::visit_value_at(lhs.index(), equal_to{}, lhs, rhs); +#else + return lhs.index() == rhs.index() && + (lhs.valueless_by_exception() || + variant::visit_value_at(lhs.index(), equal_to{}, lhs, rhs)); +#endif +} + +template +inline constexpr bool operator!=(const variant &lhs, + const variant &rhs) { + using detail::visitation::variant; + using not_equal_to = detail::convert_to_bool; +#ifdef MPARK_CPP14_CONSTEXPR + if (lhs.index() != rhs.index()) return true; + if (lhs.valueless_by_exception()) return false; + return variant::visit_value_at(lhs.index(), not_equal_to{}, lhs, rhs); +#else + return lhs.index() != rhs.index() || + (!lhs.valueless_by_exception() && + variant::visit_value_at(lhs.index(), not_equal_to{}, lhs, rhs)); +#endif +} + +template +inline constexpr bool operator<(const variant &lhs, + const variant &rhs) { + using detail::visitation::variant; + using less = detail::convert_to_bool; +#ifdef MPARK_CPP14_CONSTEXPR + if (rhs.valueless_by_exception()) return false; + if (lhs.valueless_by_exception()) return true; + if (lhs.index() < rhs.index()) return true; + if (lhs.index() > rhs.index()) return false; + return variant::visit_value_at(lhs.index(), less{}, lhs, rhs); +#else + return !rhs.valueless_by_exception() && + (lhs.valueless_by_exception() || lhs.index() < rhs.index() || + (lhs.index() == rhs.index() && + variant::visit_value_at(lhs.index(), less{}, lhs, rhs))); +#endif +} + +template +inline constexpr bool operator>(const variant &lhs, + const variant &rhs) { + using detail::visitation::variant; + using greater = detail::convert_to_bool; +#ifdef MPARK_CPP14_CONSTEXPR + if (lhs.valueless_by_exception()) return false; + if (rhs.valueless_by_exception()) return true; + if (lhs.index() > rhs.index()) return true; + if (lhs.index() < rhs.index()) return false; + return variant::visit_value_at(lhs.index(), greater{}, lhs, rhs); +#else + return !lhs.valueless_by_exception() && + (rhs.valueless_by_exception() || lhs.index() > rhs.index() || + (lhs.index() == rhs.index() && + variant::visit_value_at(lhs.index(), greater{}, lhs, rhs))); +#endif +} + +template +inline constexpr bool operator<=(const variant &lhs, + const variant &rhs) { + using detail::visitation::variant; + using less_equal = detail::convert_to_bool; +#ifdef MPARK_CPP14_CONSTEXPR + if (lhs.valueless_by_exception()) return true; + if (rhs.valueless_by_exception()) return false; + if (lhs.index() < rhs.index()) return true; + if (lhs.index() > rhs.index()) return false; + return variant::visit_value_at(lhs.index(), less_equal{}, lhs, rhs); +#else + return lhs.valueless_by_exception() || + (!rhs.valueless_by_exception() && + (lhs.index() < rhs.index() || + (lhs.index() == rhs.index() && + variant::visit_value_at(lhs.index(), less_equal{}, lhs, rhs)))); +#endif +} + +template +inline constexpr bool operator>=(const variant &lhs, + const variant &rhs) { + using detail::visitation::variant; + using greater_equal = detail::convert_to_bool; +#ifdef MPARK_CPP14_CONSTEXPR + if (rhs.valueless_by_exception()) return true; + if (lhs.valueless_by_exception()) return false; + if (lhs.index() > rhs.index()) return true; + if (lhs.index() < rhs.index()) return false; + return variant::visit_value_at(lhs.index(), greater_equal{}, lhs, rhs); +#else + return rhs.valueless_by_exception() || + (!lhs.valueless_by_exception() && + (lhs.index() > rhs.index() || + (lhs.index() == rhs.index() && + variant::visit_value_at(lhs.index(), greater_equal{}, lhs, rhs)))); +#endif +} + +struct monostate {}; + +inline constexpr bool operator<(monostate, monostate) noexcept { return false; } + +inline constexpr bool operator>(monostate, monostate) noexcept { return false; } + +inline constexpr bool operator<=(monostate, monostate) noexcept { return true; } + +inline constexpr bool operator>=(monostate, monostate) noexcept { return true; } + +inline constexpr bool operator==(monostate, monostate) noexcept { return true; } + +inline constexpr bool operator!=(monostate, monostate) noexcept { + return false; +} + +namespace detail { + +template +inline constexpr bool all_impl(const lib::array &bs, std::size_t idx) { + return idx >= N || (bs[idx] && all_impl(bs, idx + 1)); +} + +template +inline constexpr bool all(const lib::array &bs) { + return all_impl(bs, 0); +} + +} // namespace detail + +template +inline constexpr DECLTYPE_AUTO visit(Visitor &&visitor, Vs &&...vs) + DECLTYPE_AUTO_RETURN( + (detail::all(lib::array{ + {!vs.valueless_by_exception()...}}) + ? (void)0 + : throw_bad_variant_access()), + detail::visitation::variant::visit_value(lib::forward(visitor), + lib::forward(vs)...)) + + template + inline auto swap(variant &lhs, + variant &rhs) noexcept(noexcept(lhs.swap(rhs))) + -> decltype(lhs.swap(rhs)) { + lhs.swap(rhs); +} + +namespace detail { + +template +using enabled_type = T; + +namespace hash { + +template +constexpr bool meets_requirements() noexcept { + return std::is_copy_constructible::value && + std::is_move_constructible::value && + lib::is_invocable_r::value; +} + +template +constexpr bool is_enabled() noexcept { + using H = std::hash; + return meets_requirements() && + std::is_default_constructible::value && + std::is_copy_assignable::value && std::is_move_assignable::value; +} + +} // namespace hash + +} // namespace detail + +#undef AUTO +#undef AUTO_RETURN + +#undef AUTO_REFREF +#undef AUTO_REFREF_RETURN + +#undef DECLTYPE_AUTO +#undef DECLTYPE_AUTO_RETURN + +} // namespace paddlenlp + +namespace std { + +template +struct hash, + paddlenlp::lib::enable_if_t>()...>::value>>> { + using argument_type = paddlenlp::variant; + using result_type = std::size_t; + + inline result_type operator()(const argument_type &v) const { + using paddlenlp::detail::visitation::variant; + std::size_t result = + v.valueless_by_exception() + ? 299792458 // Random value chosen by the universe upon creation + : variant::visit_alt( +#ifdef MPARK_GENERIC_LAMBDAS + [](const auto &alt) { + using alt_type = paddlenlp::lib::decay_t; + using value_type = paddlenlp::lib::remove_const_t< + typename alt_type::value_type>; + return hash{}(alt.value); + } +#else + hasher {} +#endif + , + v); + return hash_combine(result, hash{}(v.index())); + } + + private: +#ifndef MPARK_GENERIC_LAMBDAS + struct hasher { + template + inline std::size_t operator()(const Alt &alt) const { + using alt_type = paddlenlp::lib::decay_t; + using value_type = + paddlenlp::lib::remove_const_t; + return hash{}(alt.value); + } + }; +#endif + + static std::size_t hash_combine(std::size_t lhs, std::size_t rhs) { + return lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2); + } +}; + +template <> +struct hash { + using argument_type = paddlenlp::monostate; + using result_type = std::size_t; + + inline result_type operator()(const argument_type &) const noexcept { + return 66740831; // return a fundamentally attractive random value. + } +}; + +} // namespace std + +#if defined(__GNUC__) && !defined(__clang__) && __GNUC__ >= 9 +#pragma GCC diagnostic pop +#endif diff --git a/fast_tokenizer/icu_filters.json b/fast_tokenizer/icu_filters.json new file mode 100644 index 0000000000000000000000000000000000000000..42f1735d0421e8ec5e32927218790b14792285be --- /dev/null +++ b/fast_tokenizer/icu_filters.json @@ -0,0 +1,35 @@ +{ + "localeFilter": { + "filterType": "language", + "includelist": [ + "en", + "zh" + ] + }, + "featureFilters": { + "coll_tree": "exclude", + "coll_ucadata": "exclude", + + "confusables": "exclude", + + "curr_tree": "exclude", + "curr_supplemental": "exclude", + + "lang_tree": "exclude", + + "unit_tree": "exclude", + + "rbnf_tree": "exclude", + + "zone_tree": "exclude", + "zone_supplemental": "exclude", + + "stringprep": "exclude", + + "translit": "exclude", + + "unames": "exclude", + + "conversion_mappings": "exclude" + } +} diff --git a/fast_tokenizer/perf/README.md b/fast_tokenizer/perf/README.md new file mode 100644 index 0000000000000000000000000000000000000000..7d4f4b1d33a9e499c18d7d8a718818bdcd9f0931 --- /dev/null +++ b/fast_tokenizer/perf/README.md @@ -0,0 +1,68 @@ +# 飞桨FastTokenizer性能测试 + +在PaddleNLP v2.2.0版本中PaddleNLP推出了高性能的Transformer类文本分词器,简称飞桨FastTokenizer。为了验证飞桨FastTokenizer的性能快的特点,PaddleNLP选取了业内常见的一些文本分词器进行了性能对比比较,主要进行性能参考的是HuggingFace BertTokenizer, Tensorflow-text BertTokenizer. 我们以 bert-base-chinese 模型为例进行了文本分词性能实验对比,在中文的数据下进行性能对比实验,下面是具体实验设置信息: +* [HuggingFace Tokenizers(Python)](https://github.com/huggingface/tokenizers): + +```python +from transformers import AutoTokenizer + +hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese", use_fast=False) +``` + +* [HuggingFace Tokenizers(Rust)](https://github.com/huggingface/tokenizers): + +```python +from transformers import AutoTokenizer + +hf_tokenizer_fast = AutoTokenizer.from_pretrained("bert-base-chinese", use_fast=True) +``` + +* [TensorFlow-Text](https://www.tensorflow.org/text/api_docs/python/text/BertTokenizer): + +```python +import tensorflow_text as tf_text + +# vocab 为bert-base-chinese的词汇表 +tf_tokenizer = tf_text.BertTokenizer(vocab) +``` + +* [飞桨FastTokenizer](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/experimental): + +```python +from paddlenlp.experimental import FastTokenizer + +fast_tokenizer = FastTokenizer.from_pretrained("bert-base-chinese") + +``` + + +## 环境依赖 + +* paddlepaddle >= 2.2.1 +* paddlenlp >= 2.2 +* transformers == 4.11.3 +* tokenizers == 0.10.3 +* tensorflow_text == 2.5.0 + + +```shell +pip install -r requirements.txt +``` + +## 运行 + +```shell +python perf.py +``` + +- 测试环境: + + * CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz,物理核数40 + * GPU: CUDA 10.2, CuDNN 7.6.5, 16G + +- 测试结果: + + +
图片
+ +飞桨FastTokenizer与其他框架性能的对比,是在固定文本长度在不同batch size下的分词吞吐量。纵坐标是对数坐标,单位是1w tokens/秒。随着batch size的增大,飞桨FastTokenizer速度会远远超过其他同类产品的实现,尤其是在大batch文本上飞桨框架能充分发挥多核机器的优势,取得领先的速度。 diff --git a/fast_tokenizer/perf/perf.py b/fast_tokenizer/perf/perf.py new file mode 100644 index 0000000000000000000000000000000000000000..0b40060b2c71970b71209cf149e744f997d9f2fb --- /dev/null +++ b/fast_tokenizer/perf/perf.py @@ -0,0 +1,134 @@ +# -*- coding: UTF-8 -*- +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import time + +import tensorflow as tf +import tensorflow_text as tf_text +from transformers import AutoTokenizer + +from paddlenlp.experimental import FastTokenizer, to_tensor +from paddlenlp.transformers import BertTokenizer + +parser = argparse.ArgumentParser() + +# yapf: disable +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size for tokenization.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of tokenization epochs to perform.") +parser.add_argument("--num_samples", default=100, type=int, help="The number of samples to be tokenized") +# yapf: enable +args = parser.parse_args() + +max_seq_length = args.max_seq_length +batch_size = args.batch_size +epochs = args.epochs +num_samples = args.num_samples +total_tokens = epochs * num_samples * max_seq_length + +text = ( + "在世界几大古代文明中,中华文明源远流长、从未中断,至今仍充满蓬勃生机与旺盛生命力,这在人类历史上是了不起的奇迹。" + "本固根深、一脉相承的历史文化是铸就这一奇迹的重要基础。先秦时期是中华文化的创生期,奠定了此后几千年中华文化发展的" + "基础。考古发现证实,早期中华文明的形成经历了从“满天星斗”到“月明星稀”再到“多元一体”的过程。在这个过程中,不同地域、" + "不同人群的文化交流交融,中华民族最早的大家庭逐渐成形,国家由此诞生,“大同”社会理想和“天下为公,选贤与能,讲信修睦”" + "的价值追求逐渐深入人心。在早期国家形成过程中,我们的先人积累了初步的国家治理经验,包括经济、政治、军事、法律、文化" + "等各个方面,最终以典章、思想的形式进行总结和传承。流传至今的夏商西周国家治理经验、春秋战国诸子百家思想,是先秦时期" + "历史文化的集中反映。秦汉至宋元时期是中华文化的发展期,中华传统文化在这个时期走向成熟并迈向新的高峰。中央集权制度的" + "形成、郡县制度的推广、官僚制度的健全,推动中国传统社会形成国家治理的基本形态,为中国传统社会的长期延续和发展提供了" + "坚实的制度和文化支撑,贯穿其中的价值主线是对“大一统”的坚定追求。与此同时,民为邦本的民本思想、以文化人的文治主张、" + "协和万邦的天下观等,也在实践中得到丰富和完善。在追求“大一统”的历史中,民族精神世代相传,民族英雄史不绝书。" +) + +data = [text[:max_seq_length]] * num_samples + +# BERT Tokenizer using PaddleNLP FastTokenizer +pp_tokenizer = FastTokenizer.from_pretrained("bert-base-chinese") + +batches = [to_tensor(data[idx : idx + batch_size]) for idx in range(0, len(data), batch_size)] + +for batch_data in batches: + input_ids, token_type_ids = pp_tokenizer(text=batch_data, max_seq_len=max_seq_length) + +start = time.time() +for _ in range(epochs): + for batch_data in batches: + input_ids, token_type_ids = pp_tokenizer(batch_data, max_seq_len=max_seq_length) +end = time.time() + +print("The throughput of paddle FastTokenizer: {:,.2f} tokens/s".format((total_tokens / (end - start)))) + +hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese", use_fast=True) + +batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)] + +for batch_data in batches: + encoded_inputs = hf_tokenizer(batch_data) + +# BERT Tokenizer using HuggingFace AutoTokenizer +start = time.time() +for _ in range(epochs): + for batch_data in batches: + encoded_inputs = hf_tokenizer(batch_data) # , padding=True, truncation=True) +end = time.time() +print("The throughput of huggingface FastTokenizer: {:,.2f} tokens/s".format((total_tokens / (end - start)))) + +# BERT Tokenizer using PaddleNLP BertTokenizer +py_tokenizer = BertTokenizer.from_pretrained("bert-base-chinese") +for batch_data in batches: + encoded_inputs = py_tokenizer(batch_data) + +start = time.time() +for _ in range(epochs): + for batch_data in batches: + encoded_inputs = py_tokenizer(batch_data) +end = time.time() +print("The throughput of paddle BertTokenizer: {:,.2f} tokens/s".format((total_tokens / (end - start)))) + +# BERT Tokenizer using HuggingFace AutoTokenizer +hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese", use_fast=False) + +for batch_data in batches: + encoded_inputs = hf_tokenizer(batch_data) + +start = time.time() +for _ in range(epochs): + for batch_data in batches: + encoded_inputs = hf_tokenizer(batch_data) # , padding=True, truncation=True) +end = time.time() +print("The throughput of huggingface python tokenizer: {:,.2f} tokens/s".format((total_tokens / (end - start)))) + +# BERT Tokenizer using TensorFlow Text +vocab_list = list(py_tokenizer.vocab.token_to_idx.keys()) +lookup_table = tf.lookup.StaticVocabularyTable( + tf.lookup.KeyValueTensorInitializer( + keys=vocab_list, + key_dtype=tf.string, + values=tf.range(tf.size(vocab_list, out_type=tf.int64), dtype=tf.int64), + value_dtype=tf.int64, + ), + num_oov_buckets=1, +) + +tf_tokenizer = tf_text.BertTokenizer(lookup_table) + +for batch_data in batches: + input_ids = tf_tokenizer.tokenize(batch_data) + +start = time.time() +for _ in range(epochs): + for batch_data in batches: + input_ids = tf_tokenizer.tokenize(batch_data) +end = time.time() +print("The throughput of TensorFlow Text BertTokenizer: {:,.2f} tokens/s".format((total_tokens / (end - start)))) diff --git a/fast_tokenizer/perf/requirements.txt b/fast_tokenizer/perf/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..0a26c4d03ac5d967e6f7976deb52b0468a6ae3a4 --- /dev/null +++ b/fast_tokenizer/perf/requirements.txt @@ -0,0 +1,3 @@ +paddlenlp>=2.2.0 +transformer==4.11.3 +tensorflow_text==2.5.0 \ No newline at end of file diff --git a/fast_tokenizer/perf/run_all_perf.sh b/fast_tokenizer/perf/run_all_perf.sh new file mode 100644 index 0000000000000000000000000000000000000000..052de2a717cde719812a5618fe5d77a5d92835e3 --- /dev/null +++ b/fast_tokenizer/perf/run_all_perf.sh @@ -0,0 +1,27 @@ +# !/bin/sh + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +for seq_len in 32 64 128 256 512; do +for batch_size in 1 2 4 8 16 32 64; do +mkdir -p seq_len_$seq_len/batch_size_$batch_size +for thread_num in 1 2 4 8 16 32 64; do +echo "Experiment setting: thread_num=$thread_num, batch_size=$batch_size, sequence_length=$seq_len" +export OMP_NUM_THREADS=$thread_num +export RAYON_RS_NUM_CPUS=$thread_num +python perf.py --batch_size $batch_size --max_seq_length $seq_len >seq_len_$seq_len/batch_size_$batch_size/parallel$thread_num.log 2>nohup.out +done +done +done \ No newline at end of file diff --git a/fast_tokenizer/python/CMakeLists.txt b/fast_tokenizer/python/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..ab0fe0065e3b844179ea2feef38ab862b9a85c71 --- /dev/null +++ b/fast_tokenizer/python/CMakeLists.txt @@ -0,0 +1,31 @@ +# 1. Copy the python tokenizers directory to the binary directory +add_custom_target(copy_python_tokenizers ALL + COMMAND ${CMAKE_COMMAND} -E copy_directory ${CMAKE_SOURCE_DIR}/python ${CMAKE_BINARY_DIR}/python + DEPENDS core_tokenizers) + + +# 2. Copy setup.py +add_custom_target(copy_setup ALL + COMMAND ${CMAKE_COMMAND} -E copy ${CMAKE_SOURCE_DIR}/setup.py ${CMAKE_BINARY_DIR}/setup.py) + +# 3. Copy the core_tokenizers.so to python tokenizers directory +set(TOKENIZER_CORE_NAME "core_tokenizers") +set(TOKENIZER_DST_DIR ${CMAKE_BINARY_DIR}/python/fast_tokenizer) +set(TOKENIZER_SRC_DIR ${CMAKE_BINARY_DIR}/fast_tokenizer) +set(TOKENIZER_THIRD_PARTY_DIR ${CMAKE_BINARY_DIR}/third_party) + +IF(WIN32) +set(ICU_DLL_DIR ${TOKENIZER_THIRD_PARTY_DIR}/icu/src/extern_icu/icu4c/bin64) +add_custom_target(copy_shared_library ALL + COMMAND ${CMAKE_COMMAND} -E copy ${TOKENIZER_SRC_DIR}/${TOKENIZER_CORE_NAME}.pyd ${TOKENIZER_SRC_DIR}/${TOKENIZER_CORE_NAME}.lib ${TOKENIZER_DST_DIR} + COMMAND ${CMAKE_COMMAND} -E copy ${ICU_DLL_DIR}/icudt70.dll ${ICU_DLL_DIR}/icuuc70.dll ${TOKENIZER_DST_DIR}/libs + DEPENDS copy_python_tokenizers) +ELSE(WIN32) +add_custom_target(copy_shared_library ALL + COMMAND ${CMAKE_COMMAND} -E copy ${TOKENIZER_SRC_DIR}/${TOKENIZER_CORE_NAME}.so ${TOKENIZER_DST_DIR} + DEPENDS copy_python_tokenizers) +ENDIF() + +add_custom_target(create_commit_id_file ALL + COMMAND ${GIT_EXECUTABLE} log -1 --format=%H > ${TOKENIZER_DST_DIR}/commit.log + DEPENDS copy_python_tokenizers) diff --git a/fast_tokenizer/python/fast_tokenizer/__init__.py b/fast_tokenizer/python/fast_tokenizer/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..ea7b5a9570f456153e7fb030165b377f04e26987 --- /dev/null +++ b/fast_tokenizer/python/fast_tokenizer/__init__.py @@ -0,0 +1,63 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +__version__ = "1.0.2" + +import os +import platform +import sys +from typing import Dict, List, Tuple, Union + +try: + current_path = os.path.abspath(os.path.dirname(__file__)) + if os.name == "nt": + third_lib_path = current_path + os.sep + "libs" + # Will load shared library from 'path' on windows + os.environ["path"] = current_path + ";" + third_lib_path + ";" + os.environ["path"] + sys.path.insert(0, third_lib_path) + # Note: from python3.8, PATH will not take effect + # https://github.com/python/cpython/pull/12302 + # Use add_dll_directory to specify dll resolution path + if sys.version_info[:2] >= (3, 8): + os.add_dll_directory(third_lib_path) +except ImportError as e: + if os.name == "nt": + executable_path = os.path.abspath(os.path.dirname(sys.executable)) + raise ImportError( + """NOTE: You may need to run \"set PATH=%s;%%PATH%%\" + if you encounters \"DLL load failed\" errors. If you have python + installed in other directory, replace \"%s\" with your own + directory. The original error is: \n %s""" + % (executable_path, executable_path, str(e)) + ) + else: + raise ImportError( + """NOTE: You may need to run \"export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH\" + if you encounters \"libmkldnn.so not found\" errors. If you have python + installed in other directory, replace \"/usr/local/lib\" with your own + directory. The original error is: \n""" + + str(e) + ) +except Exception as e: + raise e + +from . import core_tokenizers as C +from .c_wrap import * + +from . import decoders, models, normalizers, postprocessors, pretokenizers +from .tokenizers_impl import ( + ClipFastTokenizer, + ErnieFastTokenizer, + SentencePieceBPEFastTokenizer, +) diff --git a/fast_tokenizer/python/fast_tokenizer/c_wrap.py b/fast_tokenizer/python/fast_tokenizer/c_wrap.py new file mode 100644 index 0000000000000000000000000000000000000000..9645afceab8a2c4f91f550d087252b0df7ed9f7b --- /dev/null +++ b/fast_tokenizer/python/fast_tokenizer/c_wrap.py @@ -0,0 +1,505 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Dict, List, Tuple, Union + +from . import core_tokenizers as C + +TextInputSequence = str +PreTokenizedInputSequence = Union[List[str], Tuple[str]] + +TextEncodeInput = Union[ + TextInputSequence, + Tuple[TextInputSequence, TextInputSequence], + List[TextInputSequence], +] + +PreTokenizedEncodeInput = Union[ + PreTokenizedInputSequence, + Tuple[PreTokenizedInputSequence, PreTokenizedInputSequence], + List[PreTokenizedInputSequence], +] + +InputSequence = Union[TextInputSequence, PreTokenizedInputSequence] + +EncodeInput = Union[TextEncodeInput, PreTokenizedEncodeInput] + + +class OffsetType: + CHAR = C.OffsetType.CHAR + BYTE = C.OffsetType.BYTE + + +class Direction: + LEFT = C.Direction.LEFT + RIGHT = C.Direction.RIGHT + + +class TruncStrategy: + LONGEST_FIRST = C.TruncStrategy.LONGEST_FIRST + ONLY_FIRST = C.TruncStrategy.ONLY_FIRST + ONLY_SECOND = C.TruncStrategy.ONLY_SECOND + + +class PadStrategy: + BATCH_LONGEST = C.PadStrategy.BATCH_LONGEST + FIXED_SIZE = C.PadStrategy.FIXED_SIZE + + +class SplitMode: + REMOVED = C.SplitMode.REMOVED + ISOLATED = C.SplitMode.ISOLATED + MERGED_WITH_NEXT = C.SplitMode.MERGED_WITH_NEXT + MERGED_WITH_PREVIOUS = C.SplitMode.MERGED_WITH_PREVIOUS + CONTIGUOUS = C.SplitMode.CONTIGUOUS + + +class Token: + def __init__(self): + self._token = C.Token() + + @property + def id(self): + return self._token.id + + @id.setter + def id(self, id: int): + self._token.id = id + + @property + def value(self): + return self._token.value + + @value.setter + def value(self, value: str): + self._token.value = value + + @property + def offset(self): + return self._token.offset + + @offset.setter + def offset(self, offset: Tuple[int, int]): + self._token.offset = offset + + def __repr__(self): + return self._token.__repr__() + + +class PadMethod: + def __init__(self): + self._pad_method = C.PadMethod() + + @property + def strategy(self): + return self._pad_method.strategy + + @strategy.setter + def strategy(self, strategy: str): + """Set the strategy of PadMethod. + :param strategy: (str) The strategy of PadMethod, 'batch_longest' and 'fixed_size' are valid + :return None + """ + self._pad_method.strategy = getattr(PadStrategy, strategy.upper()) + + @property + def direction(self): + return self._pad_method.direction + + @direction.setter + def direction(self, direction: str): + """Set the direction of PadMethod. + :param strategy: (str) The direction of PadMethod, 'left' and 'right' are valid + :return None + """ + self._pad_method.direction = getattr(Direction, direction.upper()) + + @property + def pad_id(self): + return self._pad_method.pad_id + + @pad_id.setter + def pad_id(self, pad_id: int): + self._pad_method.pad_id = pad_id + + @property + def pad_token_type_id(self): + return self._pad_method.pad_token_type_id + + @pad_token_type_id.setter + def pad_token_type_id(self, pad_token_type_id: int): + self._pad_method.pad_token_type_id = pad_token_type_id + + @property + def pad_token(self): + return self._pad_method.pad_token + + @pad_token.setter + def pad_token(self, pad_token: str): + self._pad_method.pad_token = pad_token + + @property + def pad_len(self): + return self._pad_method.pad_len + + @pad_len.setter + def pad_len(self, pad_len: int): + self._pad_method.pad_len = pad_len + + @property + def pad_to_multiple_of(self): + return self._pad_method.pad_to_multiple_of + + @pad_to_multiple_of.setter + def pad_to_multiple_of(self, pad_to_multiple_of): + self._pad_method.pad_to_multiple_of = pad_to_multiple_of + + +class TruncMethod: + def __init__(self): + self._trunc_method = C.TruncMethod() + + @property + def max_len(self): + return self._trunc_method.max_len + + @max_len.setter + def max_len(self, max_len: int): + self._trunc_method.max_len = max_len + + @property + def strategy(self): + return self._trunc_method.strategy + + @strategy.setter + def strategy(self, strategy: str): + """Set the strategy of TruncMethod. + :param strategy: (str) The strategy of PadMethod, 'longest_first', 'only_first' and 'only_second' are valid + :return None + """ + self._trunc_method.strategy = getattr(TruncStrategy, strategy.upper()) + + @property + def direction(self): + return self._trunc_method.direction + + @direction.setter + def direction(self, direction: str): + """Set the direction of TruncMethod. + :param strategy: (str) The direction of TruncMethod, 'left' and 'right' are valid + :return None + """ + self._trunc_method.direction = getattr(Direction, direction.upper()) + + @property + def stride(self): + return self._trunc_method.stride + + @stride.setter + def stride(self, stride: int): + self._trunc_method.stride = stride + + +class AddedToken: + def __init__(self, content="", single_word=False, lstrip=False, rstrip=False, normalized=True): + self._added_token = C.AddedToken(content, single_word, lstrip, rstrip, normalized) + + @property + def content(self): + return self._added_token.content + + @property + def get_is_special(self): + return self._added_token.get_is_special + + @property + def normalized(self): + return self._added_token.normalized + + @property + def lstrip(self): + return self._added_token.lstrip + + @property + def rstrip(self): + return self._added_token.rstrip + + @property + def single_word(self): + return self._added_token.single_word + + def __eq__(self, other): + return self._added_token == other._added_token + + +class Encoding: + def __init__( + self, + ids: List[int], + type_ids: List[int], + tokens: List[str], + words_idx: List[int], + offsets: List[Tuple[int, int]], + special_tokens_mask: List[int], + attention_mask: List[int], + overflowing: List, + sequence_ranges: Dict[str, Tuple[int, int]], + ): + self._encoding = C.Encoding( + ids, + type_ids, + tokens, + words_idx, + offsets, + special_tokens_mask, + attention_mask, + overflowing, + sequence_ranges, + ) + + def __str__(self): + return str(self._encoding) + + def __repr__(self): + return self._encoding.__repr__() + + def __len__(self): + return len(self._encoding) + + @property + def n_sequences(self): + return self._encoding.n_sequences + + @property + def tokens(self): + return self._encoding.tokens + + @property + def word_ids(self): + return self._encoding.word_ids + + @property + def sequence_ids(self): + return self._encoding.sequence_ids + + @property + def ids(self): + return self._encoding.ids + + @property + def type_ids(self): + return self._encoding.type_ids + + @property + def offsets(self): + return self._encoding.offsets + + @property + def special_tokens_mask(self): + return self._encoding.special_tokens_mask + + @property + def attention_mask(self): + return self._encoding.attention_mask + + @property + def overflowing(self): + return self._encoding.overflowing + + def set_sequence_ids(self, sequence_id: int): + return self._encoding.set_sequence_ids(sequence_id) + + def char_to_token(self, char_pos, sequence_index: int = 0): + return self._encoding.char_to_token(char_pos, sequence_index) + + @staticmethod + def merge(encodings: List, growing_offsets: bool = True): + return C.Encoding.merge(encodings, growing_offsets) + + def token_to_chars(self, token_index: int): + return self._encoding.token_to_chars(token_index) + + def token_to_sequence(self, token_index: int): + return self._encoding.token_to_sequence(token_index) + + def token_to_word(self, token_index: int): + return self._encoding.token_to_word(token_index) + + def word_to_chars(self, word_index: int, sequence_index: int = 0): + return self._encoding.word_to_chars(word_index, sequence_index) + + def word_to_tokens(self, word_index: int, sequence_index: int = 0): + return self._encoding.word_to_tokens(word_index, sequence_index) + + def truncate(self, max_length: int, stride: int = 0, direction: str = "right"): + return self._encoding.truncate(max_length, stride, direction) + + def pad( + self, length: int, direction: str = "right", pad_id: int = 0, pad_type_id: int = 0, pad_token: str = "[PAD]" + ): + return self._encoding.pad(length, direction, pad_id, pad_type_id, pad_token) + + +class Tokenizer: + def __init__(self, model): + self._tokenizer = None + if model is not None: + self._tokenizer = C.Tokenizer(model._model) + + @property + def normalizer(self): + return self._tokenizer.normalizer + + @normalizer.setter + def normalizer(self, normalizer): + self._tokenizer.normalizer = normalizer._normalizer + + @property + def pretokenizer(self): + return self._tokenizer.pretokenizer + + @pretokenizer.setter + def pretokenizer(self, pretokenizer): + self._tokenizer.pretokenizer = pretokenizer._pretokenizer + + @property + def model(self): + return self._tokenizer.model + + @model.setter + def model(self, model): + self._tokenizer.model = model._model + + @property + def postprocessor(self): + return self._tokenizer.postprocessor + + @postprocessor.setter + def postprocessor(self, postprocessor): + self._tokenizer.postprocessor = postprocessor._postprocessor + + @property + def decoder(self): + return self._tokenizer.decoder + + @decoder.setter + def decoder(self, decoder): + self._tokenizer.decoder = decoder._decoder + + @property + def padding(self): + return self._tokenizer.padding + + @property + def truncation(self): + return self._tokenizer.truncation + + def add_special_tokens(self, tokens: List[str]): + return self._tokenizer.add_special_tokens(tokens) + + def add_tokens(self, tokens: List[str]): + return self._tokenizer.add_tokens(tokens) + + def enable_padding( + self, + direction: str = "right", + pad_id: int = 0, + pad_type_id: int = 0, + pad_token: str = "[PAD]", + length: int = None, + pad_to_multiple_of: int = None, + ): + return self._tokenizer.enable_padding(direction, pad_id, pad_type_id, pad_token, length, pad_to_multiple_of) + + def disable_padding(self): + return self._tokenizer.disable_padding() + + def enable_truncation( + self, max_length: int, stride: int = 0, strategy: str = "longest_first", direction: str = "right" + ): + return self._tokenizer.enable_truncation(max_length, stride, strategy, direction) + + def disable_truncation(self): + return self._tokenizer.disable_truncation() + + def get_vocab(self, with_added_vocabulary: bool = True): + return self._tokenizer.get_vocab(with_added_vocabulary) + + def get_vocab_size(self, with_added_vocabulary: bool = True): + return self._tokenizer.get_vocab_size(with_added_vocabulary) + + def encode( + self, + sequence: InputSequence, + pair: InputSequence = None, + is_pretokenized: bool = False, + add_special_tokens: bool = True, + ): + return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens) + + def encode_batch( + self, + input: Union[List[EncodeInput], Tuple[EncodeInput]], + add_special_tokens: bool = True, + is_pretokenized: bool = False, + ): + return self._tokenizer.encode_batch(input, add_special_tokens, is_pretokenized) + + def decode(self, ids: List[int], skip_special_tokens: bool = True): + return self._tokenizer.decode(ids, skip_special_tokens) + + def decode_batch(self, sequences: List[List[int]], skip_special_tokens: bool = True): + return self._tokenizer.decode_batch(sequences, skip_special_tokens) + + def id_to_token(self, id: int): + return self._tokenizer.id_to_token(id) + + def token_to_id(self, token: str): + return self._tokenizer.token_to_id(token) + + def num_special_tokens_to_add(self, is_pair: bool = True): + return self._tokenizer.num_special_tokens_to_add(is_pair) + + def save(self, path: str, pretty: bool = True): + return self._tokenizer.save(path, pretty) + + def to_str(self, pretty: bool = True): + return self._tokenizer.to_str(pretty) + + @staticmethod + def from_str(json: str): + tr = Tokenizer(None) + tr._tokenizer = C.Tokenizer.from_str(json) + return tr + + @staticmethod + def from_file(json: str): + tr = Tokenizer(None) + tr._tokenizer = C.Tokenizer.from_file(json) + return tr + + +def set_thread_num(thread_num): + """Set the number of threads for accelerating batch tokenization + :param thread_num: (int) The number of threads + :return None + """ + C.set_thread_num(thread_num) + + +def get_thread_num(): + """Get the number of tokenization threads + :return int + """ + return C.get_thread_num() diff --git a/fast_tokenizer/python/fast_tokenizer/decoders/__init__.py b/fast_tokenizer/python/fast_tokenizer/decoders/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..142e5af3345a553aa9dda2fecdea5310e330226f --- /dev/null +++ b/fast_tokenizer/python/fast_tokenizer/decoders/__init__.py @@ -0,0 +1,28 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from abc import ABC +from typing import List + +from .. import C + + +class Decoder(ABC): + def decode(self, tokens: List[str]): + return self._decoder.decode(tokens) + + +class WordPiece(Decoder): + def __init__(self, prefix: str = "##", cleanup: bool = True): + self._decoder = C.decoders.WordPiece(prefix, cleanup) diff --git a/fast_tokenizer/python/fast_tokenizer/libs/__init__.py b/fast_tokenizer/python/fast_tokenizer/libs/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..ac566ed8eb61b16f39ab0c71ab928e0d0d76284a --- /dev/null +++ b/fast_tokenizer/python/fast_tokenizer/libs/__init__.py @@ -0,0 +1,15 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# used for setup.py.in to store the thirdparty shared libraries diff --git a/fast_tokenizer/python/fast_tokenizer/models/__init__.py b/fast_tokenizer/python/fast_tokenizer/models/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..567944e4d2b738edaeaaf7448b6823d72f53eabd --- /dev/null +++ b/fast_tokenizer/python/fast_tokenizer/models/__init__.py @@ -0,0 +1,169 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Tuple, Union, List, Dict +from abc import ABC + +from .. import C + + +class Model(ABC): + def tokenize(self, tokens: List[str]): + return self._model.tokenize(tokens) + + def token_to_id(self, token: str): + return self._model.token_to_id(token) + + def id_to_token(self, id: int): + return self._model.id_to_token(id) + + def get_vocab(self): + return self._model.get_vocab() + + def get_vocab_size(self): + return self._model.get_vocab_size() + + def save(self, folder: str, prefix: str = None): + return self._model.save(folder, prefix) + + +class WordPiece(Model): + def __init__( + self, + vocab: Dict[str, int], + unk_token: str = "[UNK]", + max_input_chars_per_word: int = 100, + continuing_subword_prefix: str = "##", + handle_chinese_chars: bool = True, + ): + self._model = None + if vocab is not None: + self._model = C.models.WordPiece( + vocab, unk_token, max_input_chars_per_word, continuing_subword_prefix, handle_chinese_chars + ) + + @staticmethod + def read_file(vocab: str): + """Read a vocab.txt file + + :params vocab: (str) The path to a vocab.txt file + :return: Dict[str, int], The vocabulary as a dict + """ + return C.models.WordPiece.read_file(vocab) + + @staticmethod + def from_file( + vocab: str, + unk_token: str = "[UNK]", + max_input_chars_per_word: int = 100, + continuing_subword_prefix: str = "continuing_subword_prefix", + ): + """Load a WordPiece instance from vocab file. + + :param vocab: (str) The path to a vocab.txt file + :param unk_token: (str) The unknown token + :param max_input_chars_per_word: (int) The max number of char when tokenize a word + :param continuing_subword_prefix: (str) The latter subword prefix. + :return: An instance of WordPiece. + """ + wp = WordPiece(None) + wp._model = C.models.WordPiece.from_file(vocab, unk_token, max_input_chars_per_word, continuing_subword_prefix) + return wp + + +class FastWordPiece(Model): + def __init__( + self, + vocab: Dict[str, int], + unk_token: str = "[UNK]", + max_input_chars_per_word: int = 100, + continuing_subword_prefix: str = "##", + with_pretokenization: bool = False, + ): + self._model = None + if vocab is not None: + self._model = C.models.FastWordPiece( + vocab, unk_token, max_input_chars_per_word, continuing_subword_prefix, with_pretokenization + ) + + @staticmethod + def read_file(vocab: str): + """Read a vocab.txt file + + :params vocab: (str) The path to a vocab.txt file + :return: Dict[str, int], The vocabulary as a dict + """ + return C.models.FastWordPiece.read_file(vocab) + + @staticmethod + def from_file( + vocab: str, + unk_token: str = "[UNK]", + max_input_chars_per_word: int = 100, + continuing_subword_prefix: str = "continuing_subword_prefix", + with_pretokenization: bool = False, + ): + """Load a FastWordPiece instance from vocab file. + + :param vocab: (str) The path to a vocab.txt file + :param unk_token: (str) The unknown token + :param max_input_chars_per_word: (int) The max number of char when tokenize a word + :param continuing_subword_prefix: (str) The latter subword prefix. + :param with_pretokenization: (bool) Whether to pretokenize sequence during the wordpiece tokenization. + If it's true, the end to end tokenization would be faster. + :return: An instance of FastWordPiece. + """ + wp = FastWordPiece(None) + wp._model = C.models.FastWordPiece.from_file( + vocab, unk_token, max_input_chars_per_word, continuing_subword_prefix, with_pretokenization + ) + return wp + + +class BPE(Model): + def __init__( + self, + vocab: Dict[str, int] = None, + merges: List[Tuple[str, str]] = None, + cache_capacity: int = None, + dropout: float = None, + unk_token: str = None, + continuing_subword_prefix: str = None, + end_of_word_suffix: str = None, + fuse_unk: bool = None, + ): + self._model = C.models.BPE( + vocab, merges, cache_capacity, dropout, unk_token, continuing_subword_prefix, end_of_word_suffix, fuse_unk + ) + + @staticmethod + def read_file(vocab: str, merges: str): + return C.models.BPE.read_file(vocab, merges) + + @staticmethod + def from_file(vocab: str, merges: str, **kwargs): + bpe = BPE() + bpe._model = C.models.BPE.from_file(vocab, merges, **kwargs) + return bpe + + +class Unigram(Model): + def __init__(self, vocab: List[Tuple[str, float]] = None, unk_id: int = None): + self._model = C.models.Unigram(vocab, unk_id) + + def set_filter_token(self, filter_token: str = ""): + return self._model.set_filter_token(filter_token) + + def set_split_rule(self, split_rule: str = ""): + return self._model.set_split_rule(split_rule) diff --git a/fast_tokenizer/python/fast_tokenizer/normalizers/__init__.py b/fast_tokenizer/python/fast_tokenizer/normalizers/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..6e8fa6e45b2d4515aa96b3159b7fd252d375c7e3 --- /dev/null +++ b/fast_tokenizer/python/fast_tokenizer/normalizers/__init__.py @@ -0,0 +1,103 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from abc import ABC + +from .. import C + + +class NormalizedString: + def __init__(self, raw_str: str): + self._normalized = C.normalizers.NormalizedString(raw_str) + + def __str__(self): + return str(self._normalized) + + +class Normalizer(ABC): + def normalize_str(self, sequence: str): + return self._normalizer.normalize_str(sequence) + + def __call__(self, normalized: NormalizedString): + return self._normalizer(normalized._normalized) + + def __getstate__(self): + return self._normalizer.__getstate__() + + +class BertNormalizer(Normalizer): + def __init__( + self, + clean_text: bool = True, + handle_chinese_chars: bool = True, + strip_accents: bool = True, + lowercase: bool = True, + ): + self._normalizer = C.normalizers.BertNormalizer(clean_text, handle_chinese_chars, strip_accents, lowercase) + + +class ReplaceNormalizer(Normalizer): + def __init__(self, pattern: str, content: str): + self._normalizer = C.normalizers.ReplaceNormalizer(pattern, content) + + +class StripNormalizer(Normalizer): + def __init__(self, left: bool = True, right: bool = True): + self._normalizer = C.normalizers.StripNormalizer(left, right) + + +class StripAccentsNormalizer(Normalizer): + def __init__(self): + self._normalizer = C.normalizers.StripAccentsNormalizer() + + +class NFCNormalizer(Normalizer): + def __init__(self): + self._normalizer = C.normalizers.NFCNormalizer() + + +class NFDNormalizer(Normalizer): + def __init__(self): + self._normalizer = C.normalizers.NFDNormalizer() + + +class NFKCNormalizer(Normalizer): + def __init__(self): + self._normalizer = C.normalizers.NFKCNormalizer() + + +class NFKDNormalizer(Normalizer): + def __init__(self): + self._normalizer = C.normalizers.NFKDNormalizer() + + +class NmtNormalizer(Normalizer): + def __init__(self): + self._normalizer = C.normalizers.NmtNormalizer() + + +class LowercaseNormalizer(Normalizer): + def __init__(self): + self._normalizer = C.normalizers.LowercaseNormalizer() + + +class SequenceNormalizer(Normalizer): + def __init__(self, normalizer_list=[]): + normalizer_list = [normalizer._normalizer for normalizer in normalizer_list] + self._normalizer = C.normalizers.SequenceNormalizer(normalizer_list) + + +class PrecompiledNormalizer(Normalizer): + def __init__(self, precompiled_charsmap: str): + self._normalizer = C.normalizers.PrecompiledNormalizer(precompiled_charsmap) diff --git a/fast_tokenizer/python/fast_tokenizer/postprocessors/__init__.py b/fast_tokenizer/python/fast_tokenizer/postprocessors/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..496c6413b76b0fe7e68d1473e10f0bd135554adb --- /dev/null +++ b/fast_tokenizer/python/fast_tokenizer/postprocessors/__init__.py @@ -0,0 +1,54 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from abc import ABC +from typing import List, Tuple, Union + +from .. import C, Encoding + + +class PostProcessor(ABC): + def num_special_tokens_to_add(self, is_pair: bool = True): + return self._postprocessor.num_special_tokens_to_add(is_pair) + + def __call__(self, encoding: Encoding, pair_encoding: Encoding, add_special_tokens: bool): + return self._postprocessor(encoding, pair_encoding, add_special_tokens) + + +class BertPostProcessor(PostProcessor): + def __init__(self, sep: Tuple[str, int] = ("[SEP]", 102), cls: Tuple[str, int] = ("[CLS]", 101)): + self._postprocessor = C.postprocessors.BertPostProcessor(sep, cls) + + +class RobertaPostProcessor(PostProcessor): + def __init__( + self, + sep: Tuple[str, int] = ("", 2), + cls: Tuple[str, int] = ("", 0), + trim_offsets: bool = True, + add_prefix_space: bool = True, + ): + self._postprocessor = C.postprocessors.RobertaPostProcessor(sep, cls, trim_offsets, add_prefix_space) + + +class ByteLevelPostProcessor(PostProcessor): + def __init__(self, add_prefix_space: bool = True, trim_offsets: bool = True, use_regex: bool = True): + self._postprocessor = C.postprocessors.ByteLevelPostProcessor(add_prefix_space, trim_offsets, use_regex) + + +class TemplatePostProcessor(PostProcessor): + def __init__( + self, single: Union[str, List[str]], pair: Union[str, List[str]], special_tokens: List[Tuple[str, int]] + ): + self._postprocessor = C.postprocessors.TemplatePostProcessor(single, pair, special_tokens) diff --git a/fast_tokenizer/python/fast_tokenizer/pretokenizers/__init__.py b/fast_tokenizer/python/fast_tokenizer/pretokenizers/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..6458f579ea1eac1a4f30ea7d3cc3209f43476276 --- /dev/null +++ b/fast_tokenizer/python/fast_tokenizer/pretokenizers/__init__.py @@ -0,0 +1,113 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from abc import ABC +from typing import Dict, List, Tuple, Union + +from .. import C, OffsetType, Token +from ..normalizers import NormalizedString + + +class StringSplit: + def __init__(self, nomalized_text: NormalizedString, tokens: List[Token]): + tokens = [token._token for token in tokens] + self._string_split = C.pretokenizers.StringSplit(nomalized_text._normalized, tokens) + + @property + def normalized(self): + return NormalizedString(self._string_split.normalized) + + @normalized.setter + def normalized(self, normalized: NormalizedString): + self._string_split.normalized = normalized._normalized + + @property + def tokens(self): + return self._string_split.tokens + + @tokens.setter + def tokens(self, tokens: List[Token]): + self._string_split.tokens = [token._token for token in tokens] + + +class PreTokenizedString: + def __init__(self, text: str): + self._pretokenized = C.pretokenizers.PreTokenizedString(text) + + def get_string_split(self, idx: int): + return self._pretokenized.get_string_split(idx) + + def get_string_splits_size(self): + return self._pretokenized.get_string_splits_size() + + def get_original_text(self): + return self._pretokenized.get_original_text() + + def get_splits(self, offset_referential: str = "original", offset_type: str = "char"): + """ + param offset_referential: "original" or "normalized" + param offset_type: "char" or "byte" + """ + return self._pretokenized.get_splits(offset_referential, offset_type) + + def to_encoding(self, word_idx: List[int], type_id: int, offset_type): + return self._pretokenized.to_encoding(word_idx, type_id, offset_type) + + +class PreTokenizer(ABC): + def __call__(self, pretokenized: PreTokenizedString): + return self._pretokenizer(pretokenized._pretokenized) + + def pretokenize_str(self, pretokenized_str: str): + pretokenized = PreTokenizedString(pretokenized_str) + self(pretokenized) + splits = pretokenized.get_splits() + result = [(s, offset) for s, offset, tokens in splits] + return result + + +class WhitespacePreTokenizer(PreTokenizer): + def __init__(self): + self._pretokenizer = C.pretokenizers.WhitespacePreTokenizer() + + +class WhitespaceAndPunctuationPreTokenizer(PreTokenizer): + def __init__(self): + self._pretokenizer = C.pretokenizers.WhitespaceAndPunctuationPreTokenizer() + + +class BertPreTokenizer(PreTokenizer): + def __init__(self): + self._pretokenizer = C.pretokenizers.BertPreTokenizer() + + +class MetaSpacePreTokenizer(PreTokenizer): + def __init__(self, replacement: str = "_", add_prefix_space: bool = True): + self._pretokenizer = C.pretokenizers.MetaSpacePreTokenizer(replacement, add_prefix_space) + + +class SequencePreTokenizer(PreTokenizer): + def __init__(self, pretokenizers: List): + pretokenizers = [pretokenizer._pretokenizer for pretokenizer in pretokenizers] + self._pretokenizer = C.pretokenizers.SequencePreTokenizer(pretokenizers) + + +class ByteLevelPreTokenizer(PreTokenizer): + def __init__(self, add_prefix_space: bool = True, use_regex: bool = True): + self._pretokenizer = C.pretokenizers.ByteLevelPreTokenizer(add_prefix_space, use_regex) + + +class SplitPreTokenizer(PreTokenizer): + def __init__(self, pattern: str, split_mode: int, invert: bool = True): + self._pretokenizer = C.pretokenizers.SplitPreTokenizer(pattern, C.SplitMode(split_mode), invert) diff --git a/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/__init__.py b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..cfb9bfb4859b7d86df0f2aa600537cd901597d4a --- /dev/null +++ b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/__init__.py @@ -0,0 +1,18 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .base_tokenizer import BaseFastTokenizer +from .ernie import ErnieFastTokenizer +from .sentencepiece_bpe import SentencePieceBPEFastTokenizer +from .clip import ClipFastTokenizer diff --git a/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/base_tokenizer.py b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/base_tokenizer.py new file mode 100644 index 0000000000000000000000000000000000000000..c6c5631d53b200ddb3dfbe0d10e2d28728e08ebe --- /dev/null +++ b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/base_tokenizer.py @@ -0,0 +1,169 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from fast_tokenizer import Tokenizer + +__all__ = ["BaseFastTokenizer"] + + +class BaseFastTokenizer: + def __init__(self, tokenizer_impl, parma_dict=None): + self._tokenizer = tokenizer_impl + self._parma_dict = parma_dict if parma_dict is not None else {} + + def __repr__(self): + return "Tokenizer(vocabulary_size={}, {})".format( + self._tokenizer.get_vocab_size(), + ", ".join(k + "=" + str(v) for k, v in self._parma_dict.items()), + ) + + def num_special_tokens_to_add(self, is_pair): + return self._tokenizer.num_special_tokens_to_add(is_pair) + + def get_vocab(self, with_added_tokens=True): + return self._tokenizer.get_vocab(with_added_tokens=with_added_tokens) + + def get_vocab_size(self, with_added_tokens=True): + return self._tokenizer.get_vocab_size(with_added_tokens=with_added_tokens) + + def enable_padding( + self, + direction="right", + pad_id=0, + pad_type_id=0, + pad_token="[PAD]", + pad_to_multiple_of=None, + length=None, + ): + return self._tokenizer.enable_padding( + direction=direction, + pad_to_multiple_of=pad_to_multiple_of, + pad_id=pad_id, + pad_type_id=pad_type_id, + pad_token=pad_token, + length=length, + ) + + def disable_padding(self): + self._tokenizer.disable_padding() + + @property + def padding(self): + return self._tokenizer.padding + + def enable_truncation(self, max_length, stride=0, strategy="longest_first", direction="right"): + self._tokenizer.enable_truncation(max_length, stride, strategy, direction) + + def disable_truncation(self): + self._tokenizer.disable_truncation() + + def truncation(self): + return self._tokenizer.truncation + + def add_tokens(self, tokens): + return self._tokenizer.add_tokens(tokens) + + def add_special_tokens(self, special_tokens): + return self._tokenizer.add_special_tokens(special_tokens) + + def encode( + self, + sequence, + pair=None, + is_pretokenized=False, + add_special_tokens=True, + ): + if sequence is None: + raise ValueError("encode: `sequence` can't be `None`") + return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens) + + def encode_batch(self, inputs, add_special_tokens=True, is_pretokenized=False): + if inputs is None: + raise ValueError("encode_batch: `inputs` can't be `None`") + return self._tokenizer.encode_batch(inputs, add_special_tokens, is_pretokenized) + + def decode(self, ids, skip_special_tokens=True) -> str: + if ids is None: + raise ValueError("None input is not valid. Should be a list of integers.") + + return self._tokenizer.decode(ids, skip_special_tokens=skip_special_tokens) + + def decode_batch(self, sequences, skip_special_tokens=True) -> str: + if sequences is None: + raise ValueError("None input is not valid. Should be list of list of integers.") + + return self._tokenizer.decode_batch(sequences, skip_special_tokens=skip_special_tokens) + + def token_to_id(self, token): + return self._tokenizer.token_to_id(token) + + def id_to_token(self, id): + return self._tokenizer.id_to_token(id) + + def post_process(self, encoding, pair=None, add_special_tokens=True): + return self._tokenizer.post_process(encoding, pair, add_special_tokens) + + @property + def model(self): + return self._tokenizer.model + + @model.setter + def model(self, model): + self._tokenizer.model = model + + @property + def normalizer(self): + return self._tokenizer.normalizer + + @normalizer.setter + def normalizer(self, normalizer): + self._tokenizer.normalizer = normalizer + + @property + def pretokenizer(self): + return self._tokenizer.pretokenizer + + @pretokenizer.setter + def pretokenizer(self, pretokenizer): + self._tokenizer.pretokenizer = pretokenizer + + @property + def postprocessor(self): + return self._tokenizer.postprocessor + + @postprocessor.setter + def postprocessor(self, postprocessor): + self._tokenizer.postprocessor = postprocessor + + @property + def decoder(self): + return self._tokenizer.decoder + + @decoder.setter + def decoder(self, decoder): + self._tokenizer.decoder = decoder + + def save(self, path, pretty=True): + self._tokenizer.save(path, pretty) + + def to_str(self, pretty=True): + return self._tokenizer.to_str(pretty) + + @staticmethod + def from_str(json): + return Tokenizer.from_str(json) + + @staticmethod + def from_file(path): + return Tokenizer.from_file(path) diff --git a/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/clip.py b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/clip.py new file mode 100644 index 0000000000000000000000000000000000000000..0add6051bc44a89c8c69ff5428c9c54da80aafeb --- /dev/null +++ b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/clip.py @@ -0,0 +1,101 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .base_tokenizer import BaseFastTokenizer + +from fast_tokenizer.normalizers import NFCNormalizer, ReplaceNormalizer, LowercaseNormalizer, SequenceNormalizer +from fast_tokenizer.pretokenizers import SplitPreTokenizer, ByteLevelPreTokenizer, SequencePreTokenizer +from fast_tokenizer.models import BPE +from fast_tokenizer.postprocessors import RobertaPostProcessor +from fast_tokenizer import Tokenizer, SplitMode + +__all__ = ["ClipFastTokenizer"] + + +class ClipFastTokenizer(BaseFastTokenizer): + def __init__( + self, + vocab=None, + merges=None, + max_length=None, + unk_token="<|endoftext|>", + pad_token="<|endoftext|>", + bos_token="<|startoftext|>", + eos_token="<|endoftext|>", + add_prefix_space=False, + continuing_subword_prefix="", + end_of_word_suffix="", + trim_offsets=False, + ): + # Init Tokenizer instance using tokenization model + tokenizer = Tokenizer( + BPE( + vocab, + merges, + unk_token=unk_token, + continuing_subword_prefix=continuing_subword_prefix, + end_of_word_suffix=end_of_word_suffix, + fuse_unk=False, + ) + ) + + # Add special tokens + bos_token_id = 0 + eos_token_id = 1 + if tokenizer.token_to_id(str(unk_token)) is not None: + tokenizer.add_special_tokens([str(unk_token)]) + if tokenizer.token_to_id(str(pad_token)) is not None: + tokenizer.add_special_tokens([str(pad_token)]) + if tokenizer.token_to_id(str(bos_token)) is not None: + bos_token_id = tokenizer.token_to_id(str(bos_token)) + tokenizer.add_special_tokens([str(bos_token)]) + if tokenizer.token_to_id(str(eos_token)) is not None: + eos_token_id = tokenizer.token_to_id(str(eos_token)) + tokenizer.add_special_tokens([str(eos_token)]) + + # Set the normalizer + tokenizer.normalizer = SequenceNormalizer( + [NFCNormalizer(), ReplaceNormalizer(r"\s+", " "), LowercaseNormalizer()] + ) + + # Set the pretokenizer + tokenizer.pretokenizer = SequencePreTokenizer( + [ + SplitPreTokenizer( + r"""'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""", + split_mode=SplitMode.REMOVED, + invert=True, + ), + ByteLevelPreTokenizer(add_prefix_space=False), + ] + ) + + # Set the postprocessor + tokenizer.postprocessor = RobertaPostProcessor( + sep=(eos_token, eos_token_id), cls=(bos_token, bos_token_id), trim_offsets=False, add_prefix_space=False + ) + + parameters = { + "model": "BPE", + "unk_token": unk_token, + "pad_token": pad_token, + "bos_token": bos_token, + "eos_token": eos_token, + "add_prefix_space": add_prefix_space, + "max_length": max_length, + "continuing_subword_prefix": continuing_subword_prefix, + "end_of_word_suffix": end_of_word_suffix, + "trim_offsets": trim_offsets, + } + super().__init__(tokenizer, parameters) diff --git a/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/ernie.py b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/ernie.py new file mode 100644 index 0000000000000000000000000000000000000000..6c854faa1713098b1726d3ffe56dd9de162099ac --- /dev/null +++ b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/ernie.py @@ -0,0 +1,110 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from fast_tokenizer import Tokenizer, decoders +from fast_tokenizer.models import FastWordPiece, WordPiece +from fast_tokenizer.normalizers import BertNormalizer +from fast_tokenizer.postprocessors import BertPostProcessor +from fast_tokenizer.pretokenizers import BertPreTokenizer + +from .base_tokenizer import BaseFastTokenizer + +__all__ = ["ErnieFastTokenizer"] + + +class ErnieFastTokenizer(BaseFastTokenizer): + def __init__( + self, + vocab=None, + unk_token="[UNK]", + sep_token="[SEP]", + cls_token="[CLS]", + pad_token="[PAD]", + mask_token="[MASK]", + clean_text=True, + handle_chinese_chars=True, + strip_accents=True, + lowercase=True, + wordpieces_prefix="##", + max_sequence_len=None, + max_input_chars_per_word=100, + use_fast_wordpiece=False, + use_fast_wordpiece_with_pretokenization=False, + ): + tokenizer_model = WordPiece if not use_fast_wordpiece else FastWordPiece + model_kwargs = { + "unk_token": str(unk_token), + "continuing_subword_prefix": wordpieces_prefix, + "max_input_chars_per_word": max_input_chars_per_word, + } + if use_fast_wordpiece: + model_kwargs["with_pretokenization"] = use_fast_wordpiece_with_pretokenization + else: + model_kwargs["handle_chinese_chars"] = handle_chinese_chars + if vocab is not None: + tokenizer = Tokenizer(tokenizer_model(vocab, **model_kwargs)) + else: + tokenizer = Tokenizer(tokenizer_model(**model_kwargs)) + + if tokenizer.token_to_id(str(unk_token)) is not None: + tokenizer.add_special_tokens([str(unk_token)]) + if tokenizer.token_to_id(str(sep_token)) is not None: + tokenizer.add_special_tokens([str(sep_token)]) + if tokenizer.token_to_id(str(cls_token)) is not None: + tokenizer.add_special_tokens([str(cls_token)]) + if tokenizer.token_to_id(str(pad_token)) is not None: + tokenizer.add_special_tokens([str(pad_token)]) + if tokenizer.token_to_id(str(mask_token)) is not None: + tokenizer.add_special_tokens([str(mask_token)]) + + tokenizer.normalizer = BertNormalizer( + clean_text=clean_text, + handle_chinese_chars=handle_chinese_chars, + strip_accents=strip_accents, + lowercase=lowercase, + ) + if not use_fast_wordpiece or not use_fast_wordpiece_with_pretokenization: + tokenizer.pretokenizer = BertPreTokenizer() + + if vocab is not None: + sep_token_id = tokenizer.token_to_id(str(sep_token)) + if sep_token_id is None: + raise TypeError("sep_token not found in the vocabulary") + cls_token_id = tokenizer.token_to_id(str(cls_token)) + if cls_token_id is None: + raise TypeError("cls_token not found in the vocabulary") + + tokenizer.postprocessor = BertPostProcessor((str(sep_token), sep_token_id), (str(cls_token), cls_token_id)) + + tokenizer.decoder = decoders.WordPiece(prefix=wordpieces_prefix) + if max_sequence_len is None: + tokenizer.disable_truncation() + else: + tokenizer.enable_truncation(max_sequence_len) + + parameters = { + "model": "BertWordPiece", + "unk_token": unk_token, + "sep_token": sep_token, + "cls_token": cls_token, + "pad_token": pad_token, + "mask_token": mask_token, + "clean_text": clean_text, + "handle_chinese_chars": handle_chinese_chars, + "strip_accents": strip_accents, + "lowercase": lowercase, + "wordpieces_prefix": wordpieces_prefix, + } + + super().__init__(tokenizer, parameters) diff --git a/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/sentencepiece_bpe.py b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/sentencepiece_bpe.py new file mode 100644 index 0000000000000000000000000000000000000000..3daacc6d28c9c856270bfc1da84ed12f94fb1592 --- /dev/null +++ b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/sentencepiece_bpe.py @@ -0,0 +1,56 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .base_tokenizer import BaseFastTokenizer +from fast_tokenizer.models import BPE +from fast_tokenizer.normalizers import NFKCNormalizer +from fast_tokenizer import Tokenizer +from fast_tokenizer.pretokenizers import MetaSpacePreTokenizer + +__all__ = ["SentencePieceBPEFastTokenizer"] + + +class SentencePieceBPEFastTokenizer(BaseFastTokenizer): + def __init__( + self, + vocab=None, + merges=None, + unk_token="", + replacement="▁", + add_prefix_space=True, + dropout=None, + fuse_unk=False, + ): + if vocab is not None and merges is not None: + tokenizer = Tokenizer(BPE(vocab, merges, dropout=dropout, unk_token=unk_token, fuse_unk=fuse_unk)) + else: + tokenizer = Tokenizer(BPE()) + if tokenizer.token_to_id(str(unk_token)) is not None: + tokenizer.add_special_tokens([str(unk_token)]) + tokenizer.normalizer = NFKCNormalizer() + tokenizer.pretokenizer = MetaSpacePreTokenizer(replacement=replacement, add_prefix_space=add_prefix_space) + parameters = { + "model": "SentencePieceBPE", + "unk_token": unk_token, + "replacement": replacement, + "add_prefix_space": add_prefix_space, + "dropout": dropout, + } + + super().__init__(tokenizer, parameters) + + @staticmethod + def from_file(vocab_filename, merges_filename, **kwargs): + vocab, merges = BPE.read_file(vocab_filename, merges_filename) + return SentencePieceBPEFastTokenizer(vocab, merges, **kwargs) diff --git a/fast_tokenizer/python/tests/test_byte_level_pretokenizer.py b/fast_tokenizer/python/tests/test_byte_level_pretokenizer.py new file mode 100644 index 0000000000000000000000000000000000000000..9f42234acf9bf568c87bbfd0d774c2b1cfc5350e --- /dev/null +++ b/fast_tokenizer/python/tests/test_byte_level_pretokenizer.py @@ -0,0 +1,69 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import unittest + +from fast_tokenizer import pretokenizers + + +class TestByteLevelPreTokenizer(unittest.TestCase): + def setUp(self): + self.pretokenized = pretokenizers.PreTokenizedString("Hello my friend, how is your day going?") + + def check_equals(self, add_prefix_space, use_regex, expected_result): + bytelevel = pretokenizers.ByteLevelPreTokenizer(add_prefix_space=add_prefix_space, use_regex=use_regex) + bytelevel(self.pretokenized) + splits = self.pretokenized.get_splits() + result = [(s, offset) for s, offset, tokens in splits] + self.assertEqual(result, expected_result) + + def test_pretokenize_with_regex(self): + expected_result = [ + ("Hello", (0, 5)), + ("Ġmy", (5, 8)), + ("Ġfriend", (8, 15)), + (",", (15, 16)), + ("Ġhow", (16, 20)), + ("Ġis", (20, 23)), + ("Ġyour", (23, 28)), + ("Ġday", (28, 32)), + ("Ġgoing", (32, 38)), + ("?", (38, 39)), + ] + + self.check_equals(False, True, expected_result) + + def test_pretokenize_without_regex(self): + expected_result = [("HelloĠmyĠfriend,ĠhowĠisĠyourĠdayĠgoing?", (0, 39))] + self.check_equals(False, False, expected_result) + + def test_pretokenize_with_prefix_with_regex(self): + expected_result = [ + ("ĠHello", (0, 5)), + ("Ġmy", (5, 8)), + ("Ġfriend", (8, 15)), + (",", (15, 16)), + ("Ġhow", (16, 20)), + ("Ġis", (20, 23)), + ("Ġyour", (23, 28)), + ("Ġday", (28, 32)), + ("Ġgoing", (32, 38)), + ("?", (38, 39)), + ] + + self.check_equals(True, True, expected_result) + + def test_pretokenize_with_prefix_without_regex(self): + expected_result = [("ĠHelloĠmyĠfriend,ĠhowĠisĠyourĠdayĠgoing?", (0, 39))] + self.check_equals(True, False, expected_result) diff --git a/fast_tokenizer/python/tests/test_clip_tokenizer.py b/fast_tokenizer/python/tests/test_clip_tokenizer.py new file mode 100644 index 0000000000000000000000000000000000000000..a6124b4f8c444bf7f92509aef083cf757f6f2be0 --- /dev/null +++ b/fast_tokenizer/python/tests/test_clip_tokenizer.py @@ -0,0 +1,91 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import unittest + +from fast_tokenizer import ClipFastTokenizer, models +from paddlenlp.utils.downloader import get_path_from_url + + +class TestClipFastTokenizer(unittest.TestCase): + def setUp(self): + vocab_path = os.path.join(os.getcwd(), "vocab.json") + merges_path = os.path.join(os.getcwd(), "merges.txt") + if not os.path.exists(vocab_path): + get_path_from_url( + "http://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/vocab.json", os.getcwd() + ) + if not os.path.exists(merges_path): + get_path_from_url( + "http://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/merges.txt", os.getcwd() + ) + vocab, merges = models.BPE.read_file(vocab_path, merges_path) + self.tokenizer = ClipFastTokenizer(vocab, merges) + self.expected_ids = [ + 49406, + 320, + 1342, + 272, + 272, + 335, + 273, + 273, + 274, + 16368, + 13439, + 2971, + 748, + 531, + 13610, + 323, + 1896, + 8445, + 323, + 539, + 320, + 2368, + 49407, + ] + self.expected_tokens = [ + "<|startoftext|>", + "a", + "'ll", + "1", + "1", + "p", + "2", + "2", + "3", + "rf", + "âĺĨ", + "ho", + "!!", + "to", + "?'", + "d", + "'d", + "''", + "d", + "of", + "a", + "cat", + "<|endoftext|>", + ] + self.input_text = "A\n'll 11p223RF☆ho!!to?'d'd''d of a cat" + + def test_encode(self): + result = self.tokenizer.encode(self.input_text) + self.assertEqual(result.tokens, self.expected_tokens) + self.assertEqual(result.ids, self.expected_ids) diff --git a/fast_tokenizer/python/tests/test_fast_wordpiece.py b/fast_tokenizer/python/tests/test_fast_wordpiece.py new file mode 100644 index 0000000000000000000000000000000000000000..7125aef04ace165664f2e766b2853a3229e4618f --- /dev/null +++ b/fast_tokenizer/python/tests/test_fast_wordpiece.py @@ -0,0 +1,92 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import unittest + +from fast_tokenizer import ErnieFastTokenizer +from fast_tokenizer.models import WordPiece, FastWordPiece +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoTokenizer +from paddlenlp.utils.log import logger + +logger.logger.setLevel("ERROR") + + +class TestWordpiece(unittest.TestCase): + def set_flag(self): + self.use_fast_wordpiece = False + self.use_fast_wordpiece_with_pretokenization = False + + def setUp(self): + self.max_seq_length = 128 + self.wordpiece_tokenizer = AutoTokenizer.from_pretrained("ernie-1.0", use_fast=True) + ernie_vocab = self.wordpiece_tokenizer.vocab + self.set_flag() + self.fast_wordpiece_tokenizer = ErnieFastTokenizer( + ernie_vocab, + max_sequence_len=self.max_seq_length, + use_fast_wordpiece=self.use_fast_wordpiece, + use_fast_wordpiece_with_pretokenization=self.use_fast_wordpiece_with_pretokenization, + ) + self.dataset = [example["sentence"] for example in load_dataset("clue", "tnews", splits=["train"])] + + def test_encode(self): + for sentence in self.dataset: + wordpiece_result = self.wordpiece_tokenizer(sentence, max_length=self.max_seq_length) + expected_input_ids = wordpiece_result["input_ids"] + expected_token_type_ids = wordpiece_result["token_type_ids"] + + fast_wordpiece_result = self.fast_wordpiece_tokenizer.encode(sentence) + actual_input_ids = fast_wordpiece_result.ids + actual_token_type_ids = fast_wordpiece_result.type_ids + self.assertEqual(expected_input_ids, actual_input_ids) + self.assertEqual(expected_token_type_ids, actual_token_type_ids) + + def test_get_offset_mapping(self): + for i, sentence in enumerate(self.dataset): + wordpiece_result = self.wordpiece_tokenizer( + sentence, max_length=self.max_seq_length, return_offsets_mapping=True + ) + expected_offset_mapping = wordpiece_result["offset_mapping"] + + fast_wordpiece_result = self.fast_wordpiece_tokenizer.encode(sentence) + actual_offset_mapping = fast_wordpiece_result.offsets + self.assertEqual(expected_offset_mapping, actual_offset_mapping) + + +class TestFastWordpiece(TestWordpiece): + def set_flag(self): + self.use_fast_wordpiece = True + self.use_fast_wordpiece_with_pretokenization = False + + +class TestFastWordpieceWithPretokenization(TestWordpiece): + def set_flag(self): + self.use_fast_wordpiece = True + self.use_fast_wordpiece_with_pretokenization = True + + +class TestFromfile(unittest.TestCase): + def setUp(self): + self.max_seq_length = 128 + t = AutoTokenizer.from_pretrained("ernie-1.0", use_fast=True) + self.vocab_file = t.init_kwargs["vocab_file"] + + def test(self): + WordPiece.from_file(self.vocab_file) + FastWordPiece.from_file(self.vocab_file) + + +if __name__ == "__main__": + unittest.main() diff --git a/fast_tokenizer/python/tests/test_tokenizer_json.py b/fast_tokenizer/python/tests/test_tokenizer_json.py new file mode 100644 index 0000000000000000000000000000000000000000..ace7f3858d02eddc6e05b6b82e07c4a889221e0c --- /dev/null +++ b/fast_tokenizer/python/tests/test_tokenizer_json.py @@ -0,0 +1,85 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import unittest + +import fast_tokenizer +from fast_tokenizer import ErnieFastTokenizer +from paddlenlp.transformers import AutoTokenizer +from paddlenlp.utils.log import logger + +logger.logger.setLevel("ERROR") + + +class TestTokenizerJson(unittest.TestCase): + def setUp(self): + wordpiece_tokenizer = AutoTokenizer.from_pretrained("ernie-1.0") + ernie_vocab = wordpiece_tokenizer.vocab.token_to_idx + self.fast_tokenizer = ErnieFastTokenizer(ernie_vocab) + + +class TestNormalizerJson(TestTokenizerJson): + def check_normalizer_json(self, normalizer): + self.fast_tokenizer.normalizer = normalizer + json_file = str(normalizer.__class__) + ".json" + self.fast_tokenizer.save(json_file) + tokenizer = ErnieFastTokenizer.from_file(json_file) + os.remove(json_file) + self.assertEqual(normalizer.__getstate__(), tokenizer.normalizer.__getstate__()) + + def test_replace(self): + replace_normalizer = fast_tokenizer.normalizers.ReplaceNormalizer("''", '"') + self.check_normalizer_json(replace_normalizer) + + def test_strip(self): + strip_normalizer = fast_tokenizer.normalizers.StripNormalizer(True, True) + self.check_normalizer_json(strip_normalizer) + + def test_strip_accent(self): + strip_normalizer = fast_tokenizer.normalizers.StripAccentsNormalizer() + self.check_normalizer_json(strip_normalizer) + + def test_nfc(self): + nfc_normalizer = fast_tokenizer.normalizers.NFCNormalizer() + self.check_normalizer_json(nfc_normalizer) + + def test_nfkc(self): + nfkc_normalizer = fast_tokenizer.normalizers.NFKCNormalizer() + self.check_normalizer_json(nfkc_normalizer) + + def test_nfd(self): + nfd_normalizer = fast_tokenizer.normalizers.NFDNormalizer() + self.check_normalizer_json(nfd_normalizer) + + def test_nfkd(self): + nfkd_normalizer = fast_tokenizer.normalizers.NFKDNormalizer() + self.check_normalizer_json(nfkd_normalizer) + + def test_nmt(self): + nmt_normalizer = fast_tokenizer.normalizers.NmtNormalizer() + self.check_normalizer_json(nmt_normalizer) + + def test_lowercase(self): + lowercase_normalizer = fast_tokenizer.normalizers.LowercaseNormalizer() + self.check_normalizer_json(lowercase_normalizer) + + def test_sequence(self): + lowercase_normalizer = fast_tokenizer.normalizers.LowercaseNormalizer() + sequence_normalizer = fast_tokenizer.normalizers.SequenceNormalizer(normalizer_list=[lowercase_normalizer]) + self.check_normalizer_json(sequence_normalizer) + + +if __name__ == "__main__": + unittest.main() diff --git a/fast_tokenizer/run_build_android_armv7_lib.sh b/fast_tokenizer/run_build_android_armv7_lib.sh new file mode 100644 index 0000000000000000000000000000000000000000..45303830c0c5335fe0dd73a3a621668f25a0383b --- /dev/null +++ b/fast_tokenizer/run_build_android_armv7_lib.sh @@ -0,0 +1,18 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +mkdir build_android_armeabi_v7a +cd build_android_armeabi_v7a +cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK_ROOT/build/cmake/android.toolchain.cmake -DANDROID_ABI="armeabi-v7a" -DANDROID_NATIVE_API_LEVEL=android-21 -DANDROID_STL=c++_shared -DWITH_TESTING=OFF -DWITH_PYTHON=OFF -DANDROID_TOOLCHAIN=clang +make -j8 diff --git a/fast_tokenizer/run_build_android_armv7_lite_lib.sh b/fast_tokenizer/run_build_android_armv7_lite_lib.sh new file mode 100644 index 0000000000000000000000000000000000000000..4d09e46108e13c1f4d04152e2e6483572bb997e3 --- /dev/null +++ b/fast_tokenizer/run_build_android_armv7_lite_lib.sh @@ -0,0 +1,18 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +mkdir build_android_armeabi_v7a_lite +cd build_android_armeabi_v7a_lite +cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK_ROOT/build/cmake/android.toolchain.cmake -DANDROID_ABI="armeabi-v7a" -DANDROID_NATIVE_API_LEVEL=android-21 -DANDROID_STL=c++_shared -DWITH_TESTING=OFF -DWITH_PYTHON=OFF -DANDROID_TOOLCHAIN=clang -DWITH_ICU_LITE=ON +make -j8 diff --git a/fast_tokenizer/run_build_android_armv8_lib.sh b/fast_tokenizer/run_build_android_armv8_lib.sh new file mode 100644 index 0000000000000000000000000000000000000000..f3905ad31a861195c98f2a37c259caab8f6a8d1c --- /dev/null +++ b/fast_tokenizer/run_build_android_armv8_lib.sh @@ -0,0 +1,18 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +mkdir build_android_arm64_v8a +cd build_android_arm64_v8a +cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK_ROOT/build/cmake/android.toolchain.cmake -DANDROID_ABI="arm64-v8a" -DANDROID_NATIVE_API_LEVEL=android-21 -DANDROID_STL=c++_shared -DWITH_TESTING=OFF -DWITH_PYTHON=OFF -DANDROID_TOOLCHAIN=clang +make -j8 diff --git a/fast_tokenizer/run_build_android_armv8_lite_lib.sh b/fast_tokenizer/run_build_android_armv8_lite_lib.sh new file mode 100644 index 0000000000000000000000000000000000000000..60390b1102a676fff0b00360595c6ea4452136e8 --- /dev/null +++ b/fast_tokenizer/run_build_android_armv8_lite_lib.sh @@ -0,0 +1,18 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +mkdir build_android_arm64_v8a_lite +cd build_android_arm64_v8a_lite +cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK_ROOT/build/cmake/android.toolchain.cmake -DANDROID_ABI="arm64-v8a" -DANDROID_NATIVE_API_LEVEL=android-21 -DANDROID_STL=c++_shared -DWITH_TESTING=OFF -DWITH_PYTHON=OFF -DANDROID_TOOLCHAIN=clang -DWITH_ICU_LITE=ON +make -j8 diff --git a/fast_tokenizer/run_build_cpp_lib.bat b/fast_tokenizer/run_build_cpp_lib.bat new file mode 100644 index 0000000000000000000000000000000000000000..faf396a27be563a12d3b7f71ef447bc19b98e8a4 --- /dev/null +++ b/fast_tokenizer/run_build_cpp_lib.bat @@ -0,0 +1,7 @@ +if not exist build_cpp mkdir build_cpp +cd build_cpp +for /d %%G in ("*") do rmdir /s /q "%%G" +del /q * +cmake .. -G "Ninja" -DWITH_PYTHON=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release +ninja -j20 +cd .. \ No newline at end of file diff --git a/fast_tokenizer/run_build_cpp_lib.sh b/fast_tokenizer/run_build_cpp_lib.sh new file mode 100644 index 0000000000000000000000000000000000000000..9e3ccceec6780fc7fe95cb54417e120a5bb6dd55 --- /dev/null +++ b/fast_tokenizer/run_build_cpp_lib.sh @@ -0,0 +1,36 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Can be used in linux and mac +mkdir -p build_cpp +cd build_cpp +rm -rf * +platform="$(uname -s)" +if [[ $platform == Linux* ]]; +then + core_num=`nproc` +else + core_num=`sysctl -n hw.logicalcpu` +fi +echo "Compile with $core_num cores" +cmake .. -DWITH_PYTHON=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release +make -j${core_num} + +if [[ $? == 0 ]]; +then + echo "Successfully compile." +else + echo "Fail compiling." +fi +cd .. diff --git a/fast_tokenizer/run_build_py_lib.bat b/fast_tokenizer/run_build_py_lib.bat new file mode 100644 index 0000000000000000000000000000000000000000..72bbd6824488e60ae97a1d06ee49405a81d9d631 --- /dev/null +++ b/fast_tokenizer/run_build_py_lib.bat @@ -0,0 +1,14 @@ +for %%x in (6 7 8 9 10) do ( + if not exist build_py3%%x mkdir build_py3%%x + cd build_py3%%x + for /d %%G in ("*") do rmdir /s /q "%%G" + del /q * + cmake .. -G "Ninja" -DWITH_PYTHON=ON ^ + -DWITH_TESTING=OFF ^ + -DCMAKE_BUILD_TYPE=Release ^ + -DPYTHON_EXECUTABLE=C:\Python3%%x\python.exe ^ + -DPYTHON_INCLUDE_DIR=C:\Python3%%x\include ^ + -DPYTHON_LIBRARY=C:\Python3%%x\libs\python3%%x.lib + ninja -j20 + cd .. +) diff --git a/fast_tokenizer/run_build_py_lib.sh b/fast_tokenizer/run_build_py_lib.sh new file mode 100644 index 0000000000000000000000000000000000000000..a473cd51a048b1c7ed5a9fbd382a21389049cb1d --- /dev/null +++ b/fast_tokenizer/run_build_py_lib.sh @@ -0,0 +1,44 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Can be used in linux and mac +# build python lib +mkdir -p build_py36 build_py37 build_py38 build_py39 build_py310 +for py_version in 6 7 8 9 10; +do + cd build_py3${py_version} + rm -rf * + platform="$(uname -s)" + if [[ $platform == Linux* ]]; + then + export LD_LIBRARY_PATH=/opt/_internal/cpython-3.${py_version}.0/lib/:${LD_LIBRARY_PATH} + export PATH=/opt/_internal/cpython-3.${py_version}.0/bin/:${PATH} + core_num=`nproc` + else + export LD_LIBRARY_PATH=/Users/paddle/miniconda2/envs/py3${py_version}/lib/:${LD_LIBRARY_PATH} + export PATH=/Users/paddle/miniconda2/envs/py3${py_version}/bin/:${PATH} + core_num=`sysctl -n hw.logicalcpu` + fi + echo "Compile with $core_num cores" + cmake .. -DWITH_PYTHON=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release + make -j${core_num} + if [[ $? == 0 ]]; + then + echo "Successfully compile." + else + echo "Fail compiling." + fi + cd .. +done + diff --git a/fast_tokenizer/run_fast_tokenizer_cpp_test.sh b/fast_tokenizer/run_fast_tokenizer_cpp_test.sh new file mode 100644 index 0000000000000000000000000000000000000000..02eeda082ed53f2b05a59859d822957fa2a071bf --- /dev/null +++ b/fast_tokenizer/run_fast_tokenizer_cpp_test.sh @@ -0,0 +1,24 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +cd ${PWD}/$1 + +for testcase in `ls ${TEST_DIR}`; do + if [[ "${testcase}" == "test"* ]]; then + ${PWD}/${testcase} + if [ $? -ne 0 ]; then + exit -1 + fi + fi +done diff --git a/fast_tokenizer/setup.py b/fast_tokenizer/setup.py new file mode 100644 index 0000000000000000000000000000000000000000..d172cda979caa41c55058d4d14410e7d345f2389 --- /dev/null +++ b/fast_tokenizer/setup.py @@ -0,0 +1,97 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +from setuptools import Distribution, setup +from setuptools.command.install import install + + +class BinaryDistribution(Distribution): + # when build the package, it will add + # platform name such as "cp37-cp37m-linux_x86_64" + def has_ext_modules(self): + return True + + +class InstallPlatlib(install): + def finalize_options(self): + install.finalize_options(self) + if self.distribution.has_ext_modules(): + self.install_lib = self.install_platlib + + +if os.name != "nt": + package_data = {"fast_tokenizer": ["core_tokenizers.so", "commit.log"]} + package_data["fast_tokenizer.libs"] = [] +else: + package_data = {"fast_tokenizer": ["core_tokenizers.pyd", "core_tokenizers.lib", "commit.log"]} + # Add icu dll + package_data["fast_tokenizer.libs"] = ["icuuc70.dll", "icudt70.dll"] + + +def get_version(): + f = open(os.path.join("python", "fast_tokenizer", "__init__.py")) + lines = f.readlines() + version = "" + for line in lines: + if line.startswith("__version__"): + version = line.split("=")[1] + version = version.strip().replace('"', "") + break + return version + + +long_description = "PaddleNLP Fast Tokenizer Library written in C++ " +setup( + name="fast-tokenizer-python", + version=get_version(), + author="PaddlePaddle Speech and Language Team", + author_email="paddlesl@baidu.com", + description=long_description, + long_description=long_description, + zip_safe=False, + url="https://github.com/PaddlePaddle/PaddleNLP/fast_tokenizer", + package_dir={"": "python"}, + packages=[ + "fast_tokenizer", + "fast_tokenizer.tokenizers_impl", + "fast_tokenizer.normalizers", + "fast_tokenizer.pretokenizers", + "fast_tokenizer.models", + "fast_tokenizer.postprocessors", + "fast_tokenizer.libs", + "fast_tokenizer.decoders", + ], + package_data=package_data, + extras_require={"test": ["pytest>=6.0"]}, + python_requires=">=3.6", + cmdclass={"install": InstallPlatlib}, + license="Apache 2.0", + distclass=BinaryDistribution, + classifiers=[ + "Development Status :: 5 - Production/Stable", + "Operating System :: OS Independent", + "Intended Audience :: Developers", + "Intended Audience :: Education", + "Intended Audience :: Science/Research", + "License :: OSI Approved :: Apache Software License", + "Programming Language :: C++", + "Programming Language :: Python :: 3.6", + "Programming Language :: Python :: 3.7", + "Programming Language :: Python :: 3.8", + "Programming Language :: Python :: 3.9", + "Programming Language :: Python :: 3.10", + ], +) diff --git a/fast_tokenizer/tools/codestyle/clang_format.hook b/fast_tokenizer/tools/codestyle/clang_format.hook new file mode 100644 index 0000000000000000000000000000000000000000..1d928216867c0ba3897d71542fea44debf8d72a0 --- /dev/null +++ b/fast_tokenizer/tools/codestyle/clang_format.hook @@ -0,0 +1,15 @@ +#!/bin/bash +set -e + +readonly VERSION="3.8" + +version=$(clang-format -version) + +if ! [[ $version == *"$VERSION"* ]]; then + echo "clang-format version check failed." + echo "a version contains '$VERSION' is needed, but get '$version'" + echo "you can install the right version, and make an soft-link to '\$PATH' env" + exit -1 +fi + +clang-format $@ diff --git a/fast_tokenizer/tools/codestyle/copyright.hook b/fast_tokenizer/tools/codestyle/copyright.hook new file mode 100644 index 0000000000000000000000000000000000000000..ecdc8a23ddeb2770b65294c8126e36cf18ca8eed --- /dev/null +++ b/fast_tokenizer/tools/codestyle/copyright.hook @@ -0,0 +1,134 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import absolute_import +from __future__ import print_function +from __future__ import unicode_literals + +import argparse +import io +import re +import sys +import os +import datetime + +COPYRIGHT = '''Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License.''' + +def _generate_copyright(comment_mark): + copyright=COPYRIGHT.split(os.linesep) + header = copyright[0].rstrip() + + p = re.search('(\d{4})', header).group(0) + now = datetime.datetime.now() + + header = header.replace(p,str(now.year)) + + ans=[comment_mark + " " + header + os.linesep] + for idx, line in enumerate(copyright[1:]): + ans.append(comment_mark + " " + line.rstrip() + os.linesep) + + return ans + +def _get_comment_mark(path): + lang_type=re.compile(r"\.(py|sh)$") + if lang_type.search(path) is not None: + return "#" + + lang_type=re.compile(r"\.(h|c|hpp|cc|cpp|cu|go|cuh|proto)$") + if lang_type.search(path) is not None: + return "//" + + return None + + +RE_ENCODE = re.compile(r"^[ \t\v]*#.*?coding[:=]", re.IGNORECASE) +RE_COPYRIGHT = re.compile(r".*Copyright \(c\) \d{4}", re.IGNORECASE) +RE_SHEBANG = re.compile(r"^[ \t\v]*#[ \t]?\!") + +def _check_copyright(path): + head=[] + try: + with open(path) as f: + head = [next(f) for x in range(4)] + except StopIteration: + pass + + for idx, line in enumerate(head): + if RE_COPYRIGHT.search(line) is not None: + return True + + return False + +def generate_copyright(path, comment_mark): + original_contents = io.open(path, encoding="utf-8").readlines() + head = original_contents[0:4] + + insert_line_no=0 + for i, line in enumerate(head): + if RE_ENCODE.search(line) or RE_SHEBANG.search(line): + insert_line_no=i+1 + + copyright = _generate_copyright(comment_mark) + if insert_line_no == 0: + new_contents = copyright + if len(original_contents) > 0 and len(original_contents[0].strip()) != 0: + new_contents.append(os.linesep) + new_contents.extend(original_contents) + else: + new_contents=original_contents[0:insert_line_no] + new_contents.append(os.linesep) + new_contents.extend(copyright) + if len(original_contents) > insert_line_no and len(original_contents[insert_line_no].strip()) != 0: + new_contents.append(os.linesep) + new_contents.extend(original_contents[insert_line_no:]) + new_contents="".join(new_contents) + + with io.open(path, 'w') as output_file: + output_file.write(new_contents) + + + +def main(argv=None): + parser = argparse.ArgumentParser( + description='Checker for copyright declaration.') + parser.add_argument('filenames', nargs='*', help='Filenames to check') + args = parser.parse_args(argv) + + retv = 0 + for path in args.filenames: + comment_mark = _get_comment_mark(path) + if comment_mark is None: + print("warning:Unsupported file", path, file=sys.stderr) + continue + + if _check_copyright(path): + continue + + generate_copyright(path, comment_mark) + + +if __name__ == '__main__': + exit(main()) diff --git a/fast_tokenizer/tools/codestyle/cpplint_pre_commit.hook b/fast_tokenizer/tools/codestyle/cpplint_pre_commit.hook new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/fast_tokenizer/tools/codestyle/pylint_pre_commit.hook b/fast_tokenizer/tools/codestyle/pylint_pre_commit.hook new file mode 100644 index 0000000000000000000000000000000000000000..bf2e982dd88a3bf5585b32e4d60badfe1acd9337 --- /dev/null +++ b/fast_tokenizer/tools/codestyle/pylint_pre_commit.hook @@ -0,0 +1,18 @@ +#!/bin/bash + +TOTAL_ERRORS=0 + + +DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" +export PYTHONPATH=$DIR:$PYTHONPATH + +# The trick to remove deleted files: https://stackoverflow.com/a/2413151 +for file in $(git diff --name-status | awk '$1 != "D" {print $2}'); do + pylint --disable=all --load-plugins=docstring_checker \ + --enable=doc-string-one-line,doc-string-end-with,doc-string-with-all-args,doc-string-triple-quotes,doc-string-missing,doc-string-indent-error,doc-string-with-returns,doc-string-with-raises $file; + TOTAL_ERRORS=$(expr $TOTAL_ERRORS + $?); +done + +exit $TOTAL_ERRORS +#For now, just warning: +#exit 0 diff --git "a/llama\346\250\241\345\236\213\347\273\223\346\236\204.png" "b/llama\346\250\241\345\236\213\347\273\223\346\236\204.png" new file mode 100644 index 0000000000000000000000000000000000000000..8e85936ebbb813ae2ddd5b8b8509006ea760e247 Binary files /dev/null and "b/llama\346\250\241\345\236\213\347\273\223\346\236\204.png" differ diff --git "a/llama\347\256\227\346\263\225\345\216\237\347\220\206.png" "b/llama\347\256\227\346\263\225\345\216\237\347\220\206.png" new file mode 100644 index 0000000000000000000000000000000000000000..276b6ec5d331026a9aa8dcdbe8e0a30a2feebc7a Binary files /dev/null and "b/llama\347\256\227\346\263\225\345\216\237\347\220\206.png" differ diff --git a/llm/README.md b/llm/README.md new file mode 100644 index 0000000000000000000000000000000000000000..22e665f949f744631fa8b77578fa7a95c1e1158b --- /dev/null +++ b/llm/README.md @@ -0,0 +1,404 @@ +# 飞桨大语言模型 +大模型全流程工具基于PaddlePaddle的4D分布式并行能力旨在提供高性能、灵活易用大模型工具,可以根据自己的需求轻易来定制化百亿和千亿大模型训练,同时支持高性能的压缩推理和服务化,最终使用大模型能力提升业务效果。 + +| Model | Pretrain | SFT | LoRA | PrefixTuning | Generation | Quantization | +| --- | --- | --- | --- | --- | --- | --- | +| [LLaMA v1/v2](./llama) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | +| [ChatGLM-6B](./chatglm) | N/A | ✅ | ✅ | ✅ | ✅ | ✅ | +| [ChatGLM2-6B](./chatglm2) | N/A | ✅ | ✅ | ✅ | ✅ | ✅ | +| [Bloom](./bloom) | N/A | ✅ | ✅ | ✅ | ✅ | ✅ | +| [GPT-3](./gpt-3) | ✅ | ✅ | ✅ | WIP | ✅ | WIP | +| [OPT](./opt) | WIP | ✅ | ✅ | WIP| ✅ | WIP | +| [GLM](./glm) |N/A | ✅ | ✅ | WIP| ✅ | WIP | +| [Qwen](./qwen) |N/A | ✅ | ✅ | ✅ | ✅ | WIP | + + +# LLM全流程工具介绍 +我们提供了模型预训练、精调(SFT、LoRA、PrefixTuning)、量化、动态图推理、服务化部署全流程脚本,开发者可以根据自己的需求定制化自己的大语言模型。 + +
+ llm +
+ +
+ + LLM全流程工具流程图(上图:PaddleNLP 2.6进展 下图:最终目标) + +
+ +## 1. 环境准备 + +- PaddlePaddle >= 2.5.1 +- PaddleNLP >= 2.6.0 +- tiktoken (仅 Qwen 需要) + +## 2. 预训练 +[LLaMA v1/v2](./llama)、[GPT-3](./gpt-3) 目录中提供了模型预训练的数据准备和训练细节,后续我们将支持更多的模型预训练。 + +## 3. 精调 +目前精调统一脚本只支持[LLaMA v1/v2](./llama)、[ChatGLM-6B](./chatglm)、[ChatGLM2-6B](./chatglm2)、[Bloom](./bloom)、[OPT](./opt)、[Qwen](./qwen),其他模型精调使用详见对应模型目录。接下来我们将以**Llama 2**为例介绍如何使用统一脚本进行SFT、LoRA、Prefix Tuning。更多LoRA、Prefix Tuning请参见[PEFT文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/peft.md)。 + +### 3.1 精调训练数据格式 + +为了方便用户测试,我们也提供示例数据集[广告生成数据集](https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz),用户也可以仿照数据集的格式制作自己的数据集进行精调。我们支持的数据格式是每行包含一个字典,每个字典包含以下字段: + +- `src` : `str, List(str)`, 模型的输入指令(instruction)、提示(prompt),模型应该执行的任务。 +- `tgt` : `str, List(str)`, 模型的输出。 + +样例数据: +``` +{"src": "类型#裙*颜色#蓝色*风格#清新*图案#蝴蝶结", "tgt": "裙身处采用立体蝴蝶结装饰辅以蓝色条带点缀,令衣身造型饱满富有层次的同时为其注入一丝甜美气息。将女孩清新娇俏的一面衬托而出。"} +... +``` + + + +### 3.2 SFT +SFT(Supervised Fine-Tuning)依托飞桨提出的[4D混合分布式并行](https://ai.baidu.com/forum/topic/show/987996)能力,支持使用Trainer API轻松切换数据并行(DP)、[张量并行(TP, Tensor Parallelism)](https://arxiv.org/abs/1909.08053)、[流水线并行(PP, Pipeline Parallelism)](https://arxiv.org/abs/1811.06965)(目前仅支持Llama)等多种分布式训练策略。 + +4D 混合并行策略如何组合?如图所示,在单机内使用通信量较大,适合使用机器内的卡间通信的张量并行(张量并行又称模型并行,MP)和分组参数切片(Sharding)的2D组合策略;训练千亿规模模型时,叠加流水线并行策略使用多台机器共同分担;同时叠加数据并行来增加并发数量,提升训练速度。 +
+ +
+ +``` +# 张量并行分布式训练(常用) +python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py ./llama/sft_argument.json + +# 目前ChatGLM2、OPT不支持张量并行,默认使用Sharding策略(Paddle 2.5.1支持Sharding Stage2,Sharding Stage3需要使用Paddle develop版本) +python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py ./chatglm2/sft_argument.json + +# 张量并行&流水线并行分布式训练(目前仅支持Llama) +python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py ./llama/sft_pp_argument.json +``` + +### 3.3 LoRA + +Transformer模型中包含许多Linear层需要进行密集的矩阵乘法计算,而这些通常具有全秩(full rank)。[LoRA](https://arxiv.org/abs/2106.09685)提出冻结预训练的权重矩阵, 通过引入两个低 rank 矩阵 $AB$(图中橙色的两个矩阵) 来近似权重的更新过程 $W_0+\Delta W=W_0+B A$ , 其中 $B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k}$,实验表面将输入表达随机投影到较小的子空间模型仍然可以有效地学习下游任务还可以节约大量的计算显存需求。 + + +
+ +
+ + +PaddleNLP LoRA API支持数据并行、张量并行等多种分布式训练策略,可以通过控制`tensor_parallel_degree` 调整并行训练策略。LoRA策略默认应用在所有Linear层,可拓展至**单机LoRA微调千亿模型**。 + + +``` +# 单卡训练 +python finetune_generation.py ./llama/lora_argument.json + +# 张量并行分布式训练(ChatGLM2、OPT不支持张量并行) +# 将lora_argument.json中tensor_parallel_degree修改为2 +python -u -m paddle.distributed.launch --gpus "0,1" finetune_generation.py ./llama/lora_argument.json +``` + + +### 3.4 Prefix Tuning + +[Prefix Tuning](https://arxiv.org/abs/2101.00190)受提示学习(Prompt learning)的影响,加入的一部分 prefix embedding 作为连续型提示进行训练。prefix embedding是由专门的 prefix encoder 网络生成的数个张量,会以 past_key_value的方式被插入到语言模型每一层的 hidden_state之前。 + +
+ +
+ +PaddleNLP Prefix Tuning API支持数据并行、张量并行等多种分布式训练策略,可以通过控制`tensor_parallel_degree` 调整并行训练策略。 +``` +# 单卡训练 +python finetune_generation.py ./llama/pt_argument.json + +# 张量并行分布式训练(ChatGLM2、OPT不支持张量并行) +# 将pt_argument.json中tensor_parallel_degree修改为2 +python -u -m paddle.distributed.launch --gpus "0,1" finetune_generation.py ./llama/pt_argument.json +``` +### 3.5 精调参数介绍 +
  模型参数(ModelArgument)
+ +- `model_name_or_path`: 预训练模型名称或者本地的模型路径,用于热启模型和分词器,默认为None。每个模型**支持模型权重**详见各模型目录。 +- `lora`: 是否开启LoRA微调策略,默认为False。 +- `lora_path`: LoRA参数和配置路径,对LoRA参数进行初始化,默认为None。 +- `lora_rank`: LoRA算法中rank(秩)的值,默认为8。 +- `prefix_tuning`: 是否使用Prefix Tuning策略,默认为False。 +- `num_prefix_tokens`: Prefix Tuning策略中Prefix Token数量,默认为128。 + +
+ +
  数据参数(DataArgument)
+ +- `dataset_name_or_path`: 本地数据集目录或内置数据集名称,默认为None。脚本已适配单文件和多文件,会自己寻找`dataset_name_or_path/train.json` 或者 `dataset_name_or_path/train/*.json`作为训练集文件, 以及`dataset_name_or_path/dev.json` 或者 `dataset_name_or_path/dev/*.json`作为验证集文件。 +- `task_name`: 用于选择内置数据集中的具体任务,默认为None。 +- `eval_with_do_generation`: 在模型效果评估的时候是否调用model.generate,默认为False。设置为True时,指标为ppl, accuracy;设置为False时,指标为BLEU4/Rouge,建议将`metric_for_best_model`设为bleu4。 +- `save_generation_output`: 当`eval_with_do_generation`设为True,是否将生成结果保存在`generated_output.json`文件中,默认为False。 +- `intokens`:是否使用InToken数据流(减少Padding冗余计算,大幅提升有效Token计算效率),默认为False。当`eval_with_do_generation`设为True,评估过程不支持InToken数据流。。 +- `src_length`: 模型输入上下文最大token长度,默认为1024。 +- `max_length`:模型输入(上下文+生成内容)的最大token长度, 默认为2048。当`intokens`设为True的时候,同时也为InToken数据流模型训练输入最大长度,通常建议设为模型允许输入最大长度,同时`per_device_train_batch_size`设为1,使用`gradient_accumulation_steps`控制batch size。 +- `lazy`:设置为False则使用`MapDataset`,设置为True则使用`IterDataset`,默认为False。对于数据量较大的时候建议设为True,`IterDataset`可以避免一次性将所有数据读入内存,注意需要设置`max_steps`并且`evaluation_strategy`和`save_strategy`设为`steps` + +
+ + +
  生成参数(GenerateArgument)
+ +注:以下参数仅在`eval_with_do_generation`为True,调用model.generate()时生效。 + +- `top_k`: “采样”策略中为 top-k 过滤保留的最高概率标记的数量。默认为1,等价于贪心策略。 +- `top_p`:“采样”策略中 top-p 过滤的累积概率。默认为1.0,表示不起作用。 +
+ +
  训练参数(TrainingArguments)
+ +以下仅介绍TrainingArguments部分常用参数,详情请参见[TrainingArguments文档](https://paddlenlp.readthedocs.io/zh/latest/trainer.html)。 + +- `output_dir`: 用于保存相关的文件目录,主要包括模型相关文件、训练过程中的checkpoint、分词器相关文件、评估的结果文件,默认为None。 +- `per_device_train_batch_size`: 训练集训练过程批处理大小,对应 micro batch size,默认为8。该参数需要根据具体的数据集来设定,该参数越大,占用显存越高,训练代价越大;反之,占用显存越小,训练速度越快。 +- `gradient_accumulation_steps`:梯度累积步数,顾名思义,就是将多次计算得到的梯度值进行累加,然后一次性进行参数更新,默认为1。等效于将原有训练batch size*gradient_accumulation_steps。 +- `per_device_eval_batch_size`: 验证集批处理大小,对应 micro batch size,默认为8。该参数越大,占用显存越高;该参数越小,占用显存越低。 +- `eval_accumulation_steps`:在将结果移动到CPU之前,累积输出张量的预测步骤数。如果如果未设置,则在移动到CPU之前,整个预测都会在GPU上累积(速度更快需要更多的显存),默认为None。 +- `num_train_epochs`:模型训练的轮次,默认为3。 +- `learning_rate`:优化器的初始学习率,默认为 5e-05。 +- `warmup_steps`: warmup的步数,默认为0。当warmup_steps>0时,会覆盖warmup_ratio的设置。 +- `logging_steps`: 日志打印的频率,仅当logging_strategy=="step"生效,默认为 500。如果希望看到较快的日志反馈或者即时的训练的速度,可以减小logging_steps。 +- `evaluation_strategy`: 评估策略,默认为no。"no":训练期间不进行评估;"steps":在每eval_steps结束进行;"epoch":在每个 epoch 结束时进行。 +- `save_strategy`: 保存策略,默认为no。"no":训练期间不进行评估;"steps":在每eval_steps结束进行;"epoch":在每个 epoch 结束时进行。 +- `fp16`: 是否需要开启FP16训练,开启FP16训练可以加速训练,默认为False。 +- `bf16`: 是否需要开启BF16训练,开启BF16训练可以加速训练,默认为False。 +- `fp16_opt_level`: 可设置O1或者O2,在 O1 级别下,在白名单中的算子将使用 float16/bfloat16 计算,在黑名单中的算子将使用 float32 计算。在 O2 级别下,模型的参数被转换为 float16/bfloat16, 如果算子的浮点型输入全是 float16/bfloat16,算子才会采用 float16/bfloat16 计算,若任意浮点型输入是 float32 类型,算子将采用 float32 计算。默认为O1。 +- `do_train`: 是否打开训练,默认为False。 +- `do_eval`: 是否打开评估,默认为False。 +- `disable_tqdm`: 是否关掉tqdm的进度条,默认为False。如果需要预估整体的训练时长,可以打开该配置,实时观察训练进度。 +- `load_best_model_at_end`: 训练结束后是否加载最优模型,通常与`metric_for_best_model`配合使用,默认为False。 +- `metric_for_best_model`: 最优模型指标,如"accuarcy"等,用于比较模型好坏,默认为None。 +- `recompute`: 重计算,暂支持full策略。开启后可降低显存以达到增大batch size的目的,默认为False。 +- `save_total_limit`: 保留checkpoint的个数,老的checkpoint会被删除,默认为None。 +- `tensor_parallel_degree`: 此参数tensor_parallel_degree表示将一层transformer结构的份数,该方法对通信开销较大, 建议 tensor_parallel_degree<=8, 尽量使用机器内部通信。默认为-1,表示不启用张量并行。 +- `pipeline_parallel_degree`: 表示划分流水线的大小.(假设该参数为4, 模型12层, 则每一个pp stage 包含3层模型) 默认值-1, 表示不启用流水线并行。 + +
+ + +### 3.6 张量并行参数合并 +我们使用张量并行(TP,Tensor Parallelism)训练过程中,为了节省TP参数合并时间往往在中间checkpoint将参数存储为多个TP参数分片,可以使用提供的分片合并参数脚本进行参数合并。 + +``` +python merge_tp_params.py \ + --model_name_or_path ./checkpoints/llama_sft_ckpts/checkpoint-100 +``` + +
  脚本参数介绍
+- `model_name_or_path`: 必须,本地的TP模型参数路径,默认为None。 +- `device`: 运行环境,默认为gpu。 +
+ +### 3.7 LoRA参数合并 +为了后续的**压缩**和**静态图推理**方便,我们提供LoRA参数合并脚本,可以将LoRA参数合并到主干模型并保存相应的权重。 +``` +python merge_lora_params.py \ + --model_name_or_path meta-llama/Llama-2-7b-chat \ + --lora_path ./checkpoints/llama_lora_ckpts +``` +
  脚本参数介绍
+ +- `model_name_or_path`: 必须,预训练模型名称或者本地的模型路径,用于热启模型和分词器,默认为None。 +- `lora_path`: LoRA参数和配置路径,对LoRA参数进行初始化,默认为None。 +- `merge_model_path`: 必须,合并参数后保存路径,默认为None。 +- `device`: 运行环境,默认为gpu。 +
+ +## 4. 模型推理 + +### 4.1 动态图推理 + +```shell +# 预训练&SFT动态图模型推理 +python predictor.py \ + --model_name_or_path meta-llama/Llama-2-7b-chat \ + --batch_size 1 \ + --data_file ./data/dev.json \ + --dtype "float16" \ + --mode "dynamic" + +# LoRA动态图模型推理 +python predictor.py \ + --model_name_or_path meta-llama/Llama-2-7b-chat \ + --batch_size 1 \ + --data_file ./data/dev.json \ + --lora_path ./checkpoints/llama_lora_ckpts \ + --mode "dynamic" + +# Prefix Tuning动态图模型推理 +python predictor.py \ + --model_name_or_path meta-llama/Llama-2-7b-chat \ + --batch_size 1 \ + --data_file ./data/dev.json \ + --prefix_path ./checkpoints/llama_pt_ckpts \ + --mode "dynamic" +``` + +### 4.2 静态图推理 + +```shell +# 首先需要运行一下命令将动态图导出为静态图 +# LoRA需要先合并参数,详见3.7LoRA参数合并 +# Prefix Tuning暂不支持 +python export_model.py \ + --model_name_or_path meta-llama/Llama-2-7b-chat \ + --output_path ./inference \ + --dtype float16 + + +# 静态图模型推理 +python predictor.py \ + --model_name_or_path inference \ + --batch_size 1 \ + --data_file ./data/dev.json \ + --dtype "float16" \ + --mode "static" +``` + +### 4.3 InferenceModel 动态图推理 + +```shell +# InferenceModel 动态图推理 +# LoRA需要先合并参数,详见3.7LoRA参数合并 +# Prefix Tuning暂不支持 +python predictor.py \ + --model_name_or_path meta-llama/Llama-2-7b-chat \ + --dtype float16 \ + --max_length 1024 \ + --mode "dynamic" \ + --inference_model +``` + +### 4.4 InferenceModel 静态图推理 + +```shell +# 首先需要运行一下命令将InferenceModel动态图导出为静态图 +# LoRA需要先合并参数,详见3.7LoRA参数合并 +# Prefix Tuning暂不支持 +python export_model.py \ + --model_name_or_path meta-llama/Llama-2-7b-chat \ + --output_path ./inference \ + --dtype float16 \ + --inference_model + +# InferenceModel 静态图推理 +python predictor.py \ + --model_name_or_path ./inference \ + --dtype float16 \ + --max_length 1024 \ + --output_file "infer.json" \ + --mode "static" \ + --inference_model +``` + + +### 4.5 参数介绍 + +
  脚本参数介绍
+ +- `model_name_or_path`: 必须,预训练模型名称或者本地的模型路径,用于热启模型和分词器,默认为None。 +- `batch_size`: 批处理大小,默认为8。该参数越大,占用显存越高;该参数越小,占用显存越低。 +- `src_length`: 模型输入上下文最大token长度,默认为1024。 +- `max_length`:模型输入(上下文+生成内容)的最大token长度, 默认为2048。 +- `lora_path`: LoRA参数和配置路径,对LoRA参数进行初始化,默认为None。 +- `prefix_path`: Prefix Tuning参数和配置路径,对Prefix Tuning参数进行初始化,默认为None。 +- `top_k`: “采样”策略中为 top-k 过滤保留的最高概率标记的数量。默认为1,等价于贪心策略。 +- `top_p`:“采样”策略中 top-p 过滤的累积概率。默认为1.0,表示不起作用。 +- `temperature`:“采样”策略中会对输出logit除以temperature。默认为1.0,表示不起作用。 +- `data_file`:必须,待推理json文件,默认为None。 +- `output_file`:保存推理结果文件名,默认为output.json。 +- `device`: 运行环境,默认为gpu。 +- `dtype`: 模型参数dtype,默认为None。如果没有传入`lora_path`、`prefix_path`则必须传入 +- `model_type`: 初始化不同类型模型,gpt-3: GPTForCausalLM; ernie-3.5-se: Ernie35ForCausalLM; 默认为 None。 +- `mode`: 使用动态图或者静态图推理,值为:[dynamic, static],默认为 dynamic。 +- `inference_model`: 是否使用InferenceModel 推理,默认值为 False。 + +
+ +## 5. 服务化部署 + +### 5.1 环境准备 + +- python >= 3.8 +- gradio +- flask + +### 5.2 Flask & Gradio UI服务化部署 + +我们提供了一套简单易用的UI服务化部署脚本: + + +``` +python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" flask_server.py \ + --model_name_or_path meta-llama/Llama-2-7b-chat \ + --port 8010 \ + --flask_port 8011 \ + --src_length 1024 \ + --dtype "float16" +``` + +
  脚本参数介绍
+ + +- `port`: Gradio UI 服务端口号,默认8011。 +- `flask_port`: Flask服务端口号,默认8010。 +- 其他参数请参见动态图推理中参数。 + +
+ +## 6. 量化 + +**注**:量化后模型暂不支持推理,相关开源工作正在进行中,敬请期待。 +量化算法可以将模型输入和模型权重用更低比特数值表示,能够有效减少内存占用和计算开销。下面我们提供PTQ、GPTQ两种量化算法结合**PaddleSlim自研策略**进行量化,更多技术细节详见[量化策略详细教程](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/quant/advanced_quantization.md) + +### 6.1 环境安装 +- PaddleSlim develop版本 +- PaddlePaddle develop版本 + +### 6.2 数据准备 + +量化中默认使用训练集作为校正(Calibartion)数据集,开发集作为评估数据集。如果希望使用其他数据作为校正数据集,则在数据目录下新增`quant.json`文件,文件格式请参照精调训练数据格式。 + +### 6.3 PTQ量化 + +``` +python finetune_generation.py ./llama/ptq_argument.json +``` + +### 6.4 GPTQ量化 + +``` +python finetune_generation.py ./llama/gptq_argument.json +``` + +### 6.5 量化参数介绍 + +
  量化参数(QuantArgument)
+ +- `quant_type`: PTQ,QAT量化类型,默认为A8W8。支持A8W8,WINT4,WINT8:A8W8指对激活(输入)进行INT8量化,对模型权重进行INT8量化;WINT4指仅对模型权重进行INT4量化,后续使用WeightOnly进行推理;WINT8指仅对模型权重进行INT8量化,后续使用WeightOnly进行推理。 +- `do_ptq`: 是否进行PTQ量化,默认为False。 +- `ptq_step`: PTQ量化步数,也即模型前向次数,默认为32。 +- `shift`: 是否在PTQ量化前进行[Shift策略](https://arxiv.org/abs/2304.09145),默认为False。使用Shift策略需要设`do_ptq`为True。 +- `shift_all_linear`: 是否对模型中所有Linear层应用Shift,如果为True,将会对非LayerNorm-Linear组合的Linear进行Shift,并且添加两个op,默认为False +- `shift_sampler`: Shift策略使用的sampler,默认为none。可选none,ema:none指直接利用MinMax计算Shift中的零点;ema指使用指数平均计算Shift中零点。 +- `shift_step`: Shift采样步数,也即模型前向次数,默认为32。 +- `smooth`: 是否在PTQ量化前进行[SmoothQuant策略](https://arxiv.org/abs/2211.10438),默认为False。使用Smooth策略需要设`do_ptq`为True。 +- `smooth_all_linears`: 是否对模型中所有Linear层应用Smooth,如果为True,将会对非LayerNorm-Linear组合的Linear进行Smooth,并且添加两个op,默认为False +- `smooth_sampler`: Smooth策略使用的sampler,默认为none,可选none,multi_step。multi_step会保存多轮前向结果进行计算,需要更大的显存。 +- `smooth_step`: Smooth采样步数,也即模型前向次数,默认为32。 +- `smooth_piecewise_search`: Smooth是否进行分段搜索,默认为False。分段搜索根据数值大小将激活分成K段,对于每一段进行alhpa和scale的搜索。 +- `smooth_k_piece`: 使用分段搜索功能时分段数量,默认为3。根据经验建议10B模型设置为3,100B模型设置为6。 +- `smooth_search_piece`: 使用分段搜索功能时,是否搜索分段数量,默认为False。设为True时,`smooth_k_piece`建议设为6,搜索分段数量耗时较长,如需加速Smooth过程建议关闭。 +- `do_gptq`: 是否进行GPTQ量化,GPTQ对模型进行WINT4量化,相比于普通PTQ量化精度更高,量化时间较长。默认为False。 +- `gptq_step`: GPTQ量化步数,也即模型前向次数,默认为8。 +
+ + +
  其他参数
+ +- `per_device_train_batch_size`: 量化前向批大小,默认为8。量化过程只有模型前向,相比于普通训练需要显存较少。 + +- 更多参数详见精调参数介绍。 + +
diff --git a/llm/argument.py b/llm/argument.py new file mode 100644 index 0000000000000000000000000000000000000000..fff1c3ceea1a3fbdff26d846ad4829d11e1d3d43 --- /dev/null +++ b/llm/argument.py @@ -0,0 +1,114 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from dataclasses import dataclass, field + +from paddlenlp.trainer import TrainingArguments + + +@dataclass +class TrainingArguments(TrainingArguments): + benchmark: bool = field(default=False, metadata={"help": "Whether runs benchmark"}) + + +@dataclass +class DataArgument: + dataset_name_or_path: str = field(default=None, metadata={"help": "Name or path for dataset"}) + task_name: str = field(default=None, metadata={"help": "Additional name to select a more specific task."}) + intokens: bool = field(default=False, metadata={"help": "Whether to use InTokens data stream"}) + src_length: int = field(default=1024, metadata={"help": "The maximum length of source(context) tokens."}) + max_length: int = field( + default=2048, + metadata={ + "help": "The maximum length that model input tokens can have. When intokens is set to True, it's also the maximum length for InTokens data stream" + }, + ) + eval_with_do_generation: bool = field(default=False, metadata={"help": "Whether to do generation for evaluation"}) + save_generation_output: bool = field( + default=False, + metadata={"help": "Whether to save generated text to file when eval_with_do_generation set to True."}, + ) + lazy: bool = field( + default=False, + metadata={ + "help": "Weather to return `MapDataset` or an `IterDataset`.True for `IterDataset`. False for `MapDataset`." + }, + ) + + +@dataclass +class ModelArgument: + model_name_or_path: str = field( + default=None, metadata={"help": "Build-in pretrained model name or the path to local model."} + ) + use_flash_attention: bool = field(default=False, metadata={"help": "Whether to use flash attention"}) + + # LoRA related parameters + lora: bool = field(default=False, metadata={"help": "Whether to use LoRA technique"}) + lora_path: str = field(default=None, metadata={"help": "Initialize lora state dict."}) + lora_rank: int = field(default=8, metadata={"help": "Lora attention dimension"}) + + # prefix tuning related parameters + prefix_tuning: bool = field(default=False, metadata={"help": "Whether to use Prefix technique"}) + num_prefix_tokens: int = field(default=128, metadata={"help": "Number of prefix tokens"}) + + +@dataclass +class QuantArgument: + quant_type: str = field( + default="A8W8", metadata={"help": "Quantization type. Supported values: A8W8, WINT4,WINT8"} + ) + + # QAT related parameters + # Not Yet support + do_qat: bool = field(default=False, metadata={"help": "Whether to use QAT technique"}) + + # PTQ related parameters + do_ptq: bool = field(default=False, metadata={"help": "Whether to use PTQ"}) + ptq_step: int = field(default=32, metadata={"help": "Step for PTQ"}) + + shift: bool = field(default=False, metadata={"help": "Whether to use Shift"}) + shift_all_linears: bool = field(default=False, metadata={"help": "Whether to shift all linears"}) + shift_sampler: str = field( + default="ema", metadata={"help": "The name of shift sampler, choosen from ['ema', 'none']"} + ) + shift_step: int = field(default=32, metadata={"help": "Sample steps when shift"}) + + smooth: bool = field(default=False, metadata={"help": "Whether to use Smooth"}) + smooth_all_linears: bool = field(default=False, metadata={"help": "Whether to smooth all linears"}) + smooth_sampler: str = field( + default="none", metadata={"help": "The name of smooth sampler, choosen from ['multi_step','none']"} + ) + smooth_step: int = field(default=32, metadata={"help": "Sample steps when smooth"}) + smooth_piecewise_search: bool = field( + default=False, metadata={"help": "The number of piece in piecewise search for smooth strategy."} + ) + smooth_k_piece: int = field(default=3, metadata={"help": "Number of pieces for K-search"}) + smooth_search_piece: bool = field(default=False, metadata={"help": "Whether search k_piece when piecewise search"}) + + # GPTQ related parameters + do_gptq: bool = field(default=False, metadata={"help": "Whether to use GPTQ"}) + gptq_step: int = field(default=8, metadata={"help": "Step for GPTQ"}) + + +@dataclass +class GenerateArgument: + top_k: int = field( + default=1, + metadata={ + "help": "The number of highest probability tokens to keep for top-k-filtering in the sampling strategy" + }, + ) + top_p: float = field( + default=1.0, metadata={"help": "The cumulative probability for top-p-filtering in the sampling strategy."} + ) diff --git a/llm/benchmark.sh b/llm/benchmark.sh new file mode 100644 index 0000000000000000000000000000000000000000..1f2db3dff9dbe44e0631a65aa54a3ef9bd1a45e1 --- /dev/null +++ b/llm/benchmark.sh @@ -0,0 +1,31 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export PYTHONPATH=$(dirname $(pwd)):$PYTHONPATH + +export FLAGS_control_flow_use_new_executor=1 +export FLAGS_new_executor_serial_run=1 +export FLAGS_allocator_strategy=naive_best_fit +export FLAGS_fraction_of_gpu_memory_to_use=0.92 + +python predictor.py \ + --model_name_or_path ./llama7b-inference_model_fp16 \ + --dtype float16 \ + --src_length 300 \ + --max_length 100 \ + --output_file "infer.json" \ + --mode "static" \ + --batch_size 1 \ + --benchmark \ + --inference_model \ No newline at end of file diff --git a/llm/bloom/README.md b/llm/bloom/README.md new file mode 100644 index 0000000000000000000000000000000000000000..2cdeafa669686d4fd49fd91cdd56b3b814b9c175 --- /dev/null +++ b/llm/bloom/README.md @@ -0,0 +1,25 @@ +# BLOOM + +## 1.模型介绍 + + +BLOOM是一种自回归大型语言模型(LLM),在大量文本数据上训练从而生生成目标文本,同时它能够支持46种语言和13种编程语言的文本交互。BLOOM 主要基于文本生成任务训练而成,可以很好的完成文本续写任务,此外 BloomZ 系列模型加入了 Instruction Tuning。 + +**支持模型权重:** +| Model | +|----------------------------------| +| bigscience/bloom-560m | +| bigscience/bloom-560m-bf16 | +| bigscience/bloom-1b1/ | +| bigscience/bloom-3b | +| bigscience/bloom-7b1 | +| bigscience/bloomz-560m/ | +| bigscience/bloomz-1b1 | +| bigscience/bloomz-3b | +| bigscience/bloomz-7b1-mt | +| bigscience/bloomz-7b1-p3 | +| bigscience/bloomz-7b1 | +| bellegroup/belle-7b-2m | + +## 2. 模型精调 +请参考[LLM全流程工具介绍](../README.md) diff --git a/llm/bloom/gptq_argument.json b/llm/bloom/gptq_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..6a5cb7e882a70dd13112e726ba1623ee3c9a254a --- /dev/null +++ b/llm/bloom/gptq_argument.json @@ -0,0 +1,16 @@ +{ + "model_name_or_path": "./checkpoints/bloom_sft_ckpts", + "per_device_train_batch_size": 8, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/bloom_gptq_ckpts", + "do_eval": true, + "eval_with_do_generation": false, + "do_gptq": true, + "gptq_step": 8 + } \ No newline at end of file diff --git a/llm/bloom/lora_argument.json b/llm/bloom/lora_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..9403c788ba70eb95f522dc1de429bdb84f63ed71 --- /dev/null +++ b/llm/bloom/lora_argument.json @@ -0,0 +1,30 @@ +{ + "model_name_or_path": "bigscience/bloomz-7b1-mt", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/bloom_lora_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-04, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 1, + "pipeline_parallel_degree": 1, + "lora": true + } \ No newline at end of file diff --git a/llm/bloom/pt_argument.json b/llm/bloom/pt_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..4112caffa884834fd50bf51ca9f7b7ccc436779d --- /dev/null +++ b/llm/bloom/pt_argument.json @@ -0,0 +1,30 @@ +{ + "model_name_or_path": "bigscience/bloomz-7b1-mt", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/bloom_pt_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-02, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 1, + "pipeline_parallel_degree": 1, + "prefix_tuning": true + } \ No newline at end of file diff --git a/llm/bloom/ptq_argument.json b/llm/bloom/ptq_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..0c96b852c6d3c3195a1be742122cb3ad0046314d --- /dev/null +++ b/llm/bloom/ptq_argument.json @@ -0,0 +1,22 @@ +{ + "model_name_or_path": "./checkpoints/bloom_sft_ckpts", + "per_device_train_batch_size": 8, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/bloom_ptq_ckpts", + "do_eval": true, + "eval_with_do_generation": false, + "do_ptq": true, + "ptq_step": 16, + "smooth": true, + "smooth_step": 16, + "smooth_all_linears": true, + "smooth_piecewise_search": true, + "smooth_k_piece": true, + "smooth_search_piece": true + } \ No newline at end of file diff --git a/llm/bloom/sft_argument.json b/llm/bloom/sft_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..3b3b35f93c79f8ab810448c1f8bb3f282d17a995 --- /dev/null +++ b/llm/bloom/sft_argument.json @@ -0,0 +1,29 @@ +{ + "model_name_or_path": "bigscience/bloomz-7b1-mt", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/bloom_sft_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-05, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 4, + "pipeline_parallel_degree": 1 + } \ No newline at end of file diff --git a/llm/chatglm/README.md b/llm/chatglm/README.md new file mode 100644 index 0000000000000000000000000000000000000000..281a7ceea61fefd9510adf005fd7eb71fde481ed --- /dev/null +++ b/llm/chatglm/README.md @@ -0,0 +1,19 @@ +# ChatGLM-6B + +## 1. 模型介绍 + +ChatGLM-6B 是一个开源的、支持中英双语问答的对话语言模型,基于 [General Language Model (GLM)](https://arxiv.org/abs/2103.10360) 架构,具有 62 亿参数。ChatGLM-6B 使用了和 ChatGLM 相同的技术,针对中文问答和对话进行了优化。经过约 1T 标识符的中英双语训练,辅以监督微调、反馈自助、人类反馈强化学习等技术的加持,62 亿参数的 ChatGLM-6B 已经能生成相当符合人类偏好的回答。 + +**支持模型权重:** + +| Model | +|----------------------------------| +| THUDM/chatglm-6b | +| THUDM/chatglm-6b-v1.1 | + +## 2. 模型协议 + +ChatGLM-6B 模型的权重的使用需要遵循[License](../../paddlenlp/transformers/chatglm/LICENSE)。 + +## 3. 模型精调 +请参考[LLM全流程工具介绍](../README.md) diff --git a/llm/chatglm/gptq_argument.json b/llm/chatglm/gptq_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..8b1c07742ba82d624a957621c5293ac4a507e3e4 --- /dev/null +++ b/llm/chatglm/gptq_argument.json @@ -0,0 +1,16 @@ +{ + "model_name_or_path": "./checkpoints/chatglm_sft_ckpts", + "per_device_train_batch_size": 8, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/chatglm_gptq_ckpts", + "do_eval": true, + "eval_with_do_generation": false, + "do_gptq": true, + "gptq_step": 8 + } \ No newline at end of file diff --git a/llm/chatglm/lora_argument.json b/llm/chatglm/lora_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..5f2e017a1c8afa779994cb13ce372fd5c8f19736 --- /dev/null +++ b/llm/chatglm/lora_argument.json @@ -0,0 +1,30 @@ +{ + "model_name_or_path": "THUDM/chatglm-6b", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/chatglm_lora_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-04, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 1, + "pipeline_parallel_degree": 1, + "lora": true + } \ No newline at end of file diff --git a/llm/chatglm/pt_argument.json b/llm/chatglm/pt_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..a34a86c0b3579f6ef268ef95bc1cc51ee1f365a3 --- /dev/null +++ b/llm/chatglm/pt_argument.json @@ -0,0 +1,30 @@ +{ + "model_name_or_path": "THUDM/chatglm-6b", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/chatglm_pt_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-02, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 1, + "pipeline_parallel_degree": 1, + "prefix_tuning": true + } \ No newline at end of file diff --git a/llm/chatglm/ptq_argument.json b/llm/chatglm/ptq_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..63474a9e0a194d2b1100c7e2b91560e396d079a9 --- /dev/null +++ b/llm/chatglm/ptq_argument.json @@ -0,0 +1,16 @@ +{ + "model_name_or_path": "./checkpoints/llama_sft_ckpts", + "per_device_train_batch_size": 8, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/llama_ptq_ckpts", + "do_eval": true, + "eval_with_do_generation": false, + "do_ptq": true, + "ptq_step": 16 + } \ No newline at end of file diff --git a/llm/chatglm/sft_argument.json b/llm/chatglm/sft_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..5884b2f99882a68a014ee2351b09140d8ff7ac40 --- /dev/null +++ b/llm/chatglm/sft_argument.json @@ -0,0 +1,29 @@ +{ + "model_name_or_path": "THUDM/chatglm-6b", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/chatglm_sft_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-05, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 4, + "pipeline_parallel_degree": 1 + } \ No newline at end of file diff --git a/llm/chatglm2/README.md b/llm/chatglm2/README.md new file mode 100644 index 0000000000000000000000000000000000000000..bbc50870529830d43d9631997aa1d8e1461fe0ad --- /dev/null +++ b/llm/chatglm2/README.md @@ -0,0 +1,19 @@ +# ChatGLM2-6B + +## 1. 模型介绍 + +ChatGLM2-6B 是开源中英双语对话模型 [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) 的第二代版本,在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上,ChatGLM2-6B 引入了[FlashAttention](https://github.com/HazyResearch/flash-attention)和[Multi-Query Attention](https://arxiv.org/abs/1911.02150v1)等新特性。更详细的模型介绍见[ChatGLM2-6B GitHub](https://github.com/THUDM/ChatGLM2-6B) + +**支持模型权重:** + +| Model | +|----------------------------------| +| THUDM/chatglm2-6b | + +## 2. 模型协议 + + +ChatGLM2-6B 模型的权重的使用需要遵循[License](../../paddlenlp/transformers/chatglm_v2/LICENSE)。 + +## 3. 模型精调 +请参考[LLM全流程工具介绍](../README.md) diff --git a/llm/chatglm2/gptq_argument.json b/llm/chatglm2/gptq_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..9285e8b628ad652cb39b3f621b303f9e4abd0301 --- /dev/null +++ b/llm/chatglm2/gptq_argument.json @@ -0,0 +1,16 @@ +{ + "model_name_or_path": "./checkpoints/chatglm2_sft_ckpts", + "per_device_train_batch_size": 8, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/chatglm2_gptq_ckpts", + "do_eval": true, + "eval_with_do_generation": false, + "do_gptq": true, + "gptq_step": 8 + } \ No newline at end of file diff --git a/llm/chatglm2/lora_argument.json b/llm/chatglm2/lora_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..3bf0a3bce244d6cecbe6daabd2e467a6ff7c9c42 --- /dev/null +++ b/llm/chatglm2/lora_argument.json @@ -0,0 +1,30 @@ +{ + "model_name_or_path": "THUDM/chatglm2-6b", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/chatglm2_lora_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-04, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 1, + "pipeline_parallel_degree": 1, + "lora": true + } \ No newline at end of file diff --git a/llm/chatglm2/pt_argument.json b/llm/chatglm2/pt_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..ae664a5201f21d701cf78bf5498d96afa3589806 --- /dev/null +++ b/llm/chatglm2/pt_argument.json @@ -0,0 +1,30 @@ +{ + "model_name_or_path": "THUDM/chatglm2-6b", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/chatglm2_pt_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-02, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 1, + "pipeline_parallel_degree": 1, + "prefix_tuning": true + } \ No newline at end of file diff --git a/llm/chatglm2/ptq_argument.json b/llm/chatglm2/ptq_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..acae45bd480f0880f4617152769c3a7c4b4fed3b --- /dev/null +++ b/llm/chatglm2/ptq_argument.json @@ -0,0 +1,22 @@ +{ + "model_name_or_path": "./checkpoints/chatglm2_sft_ckpts", + "per_device_train_batch_size": 8, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/chatglm2_ptq_ckpts", + "do_eval": true, + "eval_with_do_generation": false, + "do_ptq": true, + "ptq_step": 16, + "smooth": true, + "smooth_step": 16, + "smooth_all_linears": true, + "smooth_piecewise_search": true, + "smooth_k_piece": true, + "smooth_search_piece": true + } \ No newline at end of file diff --git a/llm/chatglm2/sft_argument.json b/llm/chatglm2/sft_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..e978a9de0f127396880079992b4bb4dcfb076e67 --- /dev/null +++ b/llm/chatglm2/sft_argument.json @@ -0,0 +1,29 @@ +{ + "model_name_or_path": "THUDM/chatglm2-6b", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/chatglm2_sft_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-05, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "sharding_parallel_degree": 4, + "sharding": "stage3" + } \ No newline at end of file diff --git a/llm/data.py b/llm/data.py new file mode 100644 index 0000000000000000000000000000000000000000..3a39643c096be8c27f99d7e5aa0956f63ab39abf --- /dev/null +++ b/llm/data.py @@ -0,0 +1,131 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np + +from paddlenlp.peft import LoRAModel, PrefixModelForCausalLM + + +def get_convert_example(model): + if isinstance(model, LoRAModel) or isinstance(model, PrefixModelForCausalLM): + base_model_prefix = model.model.base_model_prefix + else: + base_model_prefix = model.base_model_prefix + + if base_model_prefix == "chatglm": + return convert_example_chatglm + elif base_model_prefix in ["chatglm_v2", "llama", "bloom", "opt", "qwen"]: + return convert_example_common + else: + raise ValueError( + f"Unknown base_model_prefix: {model.base_model_prefix}. Supported base_model_prefix list: chatglm, bloom, llama." + ) + + +class DataFormatError(ValueError): + pass + + +def tokenize_example(tokenizer, example, data_args): + if "src" in example and "tgt" in example: + source = example["src"][0] if isinstance(example["src"], list) else example["src"] + target = example["tgt"][0] if isinstance(example["tgt"], list) else example["tgt"] + else: + raise DataFormatError( + f"Example format is wrong, please check: {example} or rewrite tokenize_example in data.py " + ) + tokenized_source = tokenizer( + source, + max_length=data_args.src_length, + truncation=True, + truncation_side="left", + add_special_tokens=True, + ) + tgt_max_length = data_args.max_length - len(tokenized_source["input_ids"]) + tokenized_target = tokenizer( + target, + max_length=tgt_max_length, + truncation=True, + truncation_side="right", + add_special_tokens=False, + ) + + tokenized_target_input_ids = tokenized_target["input_ids"] + # Add eos_token_id at the end of sequence if the sentence is not truncated. + # Attention! In some cases(ex. ChatGLMv2), tokenized eos_token is not equal to eos_token_id. + if len(tokenized_target_input_ids) < tgt_max_length: + tokenized_target_input_ids += [tokenizer.eos_token_id] + + return tokenized_source, tokenized_target_input_ids + + +def convert_example_common(example, tokenizer, data_args, is_test=True, intokens=False): + tokenized_source, tokenized_target_input_ids = tokenize_example(tokenizer, example, data_args) + + if is_test: + return { + **tokenized_source, + "labels": tokenized_target_input_ids, + } + else: + input_ids = tokenized_source["input_ids"] + tokenized_target_input_ids + source_length = len(tokenized_source["input_ids"]) + labels = [-100] * source_length + input_ids[source_length:] + # shift input_ids and labels + input_ids, labels = input_ids[:-1], labels[1:] + seq_length = len(input_ids) + features = {"input_ids": input_ids, "labels": labels} + if "position_ids" in tokenized_source: + features["position_ids"] = list(range(seq_length)) + if intokens: + features["attention_mask"] = np.tri(seq_length, seq_length, dtype=bool) + + return features + + +def convert_example_chatglm(example, tokenizer, data_args, is_test=True, intokens=False): + + tokenized_source, tokenized_target_input_ids = tokenize_example(tokenizer, example, data_args) + if is_test: + return { + **tokenized_source, + "labels": tokenized_target_input_ids, + } + else: + input_ids = tokenized_source["input_ids"] + tokenized_target_input_ids + bos_position = len(tokenized_source["input_ids"]) - 1 + labels = [-100] * bos_position + input_ids[bos_position:] + # shift input_ids and labels + input_ids, labels = input_ids[:-1], labels[1:] + features = { + "input_ids": input_ids, + "labels": labels, + } + + if intokens: + seq_length = len(input_ids) + # attention_mask + attention_mask = np.tri(seq_length, seq_length, dtype=bool) + attention_mask[:, :bos_position] = 1 + features["attention_mask"] = attention_mask + # 2d position_ids + position_ids = np.arange(seq_length, dtype=np.int64) + block_position_ids = np.concatenate( + [ + np.zeros(bos_position, dtype=np.int64), + np.arange(1, seq_length - bos_position + 1, dtype=np.int64), + ] + ) + features["position_ids"] = np.stack([position_ids, block_position_ids], axis=0) + + return features diff --git a/llm/ernie-3.5-se/README.md b/llm/ernie-3.5-se/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ad14040deba7ec63d7ac023847b720e3a30f907e --- /dev/null +++ b/llm/ernie-3.5-se/README.md @@ -0,0 +1,203 @@ +# ERNIE-3.5-SE + +## 1. 模型介绍 + +我们采用了Attention和FFN并行的Parallel Transformer的实现方式,将FFN和Attention层进行并行计算。通过这样的设计,我们可以把Attention和FFN需要的线形层计算进行算子融合,降低kernel调用以及通讯次数,提升并行训练的效率。并且我们发现第一层的FFN和最后一层的Attn作用不大,因此采用了“掐头去尾”策略,将底层的FFN的计算量挪到模型的顶层,在同FLOPs下效果和传统Transformer结构一致,但有更好的训练速度和吞吐。 + + + + + + + + + + +
Parallel Transformer “掐头去尾”策略
+ + +* Rope Embedding+[随机位置编码](https://aclanthology.org/2023.acl-short.161):我们采用的旋转位置编码Rope,并且为了有较好的模型外推能力,我们保留了线形层的Bias。为了提供长文外推能力,我们通过随机间隔取Position Ids,让模型能够有训短推长的能力。 + + + +* Sequence Length Warmup:通过动态调整前期训练的序列长度,提升模型的收敛效率。 + + +## 2. 预训练 + +预训练数据制作参考[此处](../../model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md) + +为了方便用户运行测试本模型,本项目提供了处理好的100k条doc的训练样本: +```shell +wget https://bj.bcebos.com/paddlenlp/models/transformers/ernie/data/ernie_openwebtext_100k_ids.npy +wget https://bj.bcebos.com/paddlenlp/models/transformers/ernie/data/ernie_openwebtext_100k_idx.npz +``` + +将所有预处理得到的文件统一放入一个文件夹中,以备训练使用: + +``` +mkdir data +mv ernie_openwebtext_100k_ids.npy ./data +mv ernie_openwebtext_100k_idx.npz ./data +``` + +使用下面脚本,即可启动 ernie-3.5-se-3b 的预训练,也可直接参考 run_trainer_stage2.sh。 +```shell +task_name="ernie35_hybrid" +python -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "output/$task_name""_log" \ + run_pretrain.py \ + --model_type "ernie" \ + --model_name_or_path "baidu/ernie-3.5-se-3b" \ + --tokenizer_name_or_path "ernie-tokenizer" \ + --input_dir "./data" \ + --output_dir "output/$task_name" \ + --split 949,50,1 \ + --max_seq_length 4096 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 1 \ + --use_flash_attention 1 \ + --use_fused_ln 1 \ + --bf16 \ + --fp16_opt_level "O2" \ + --scale_loss 512 \ + --learning_rate 0.0003 \ + --min_learning_rate 0.00003 \ + --lr_scheduler_type "cosine" \ + --max_steps 300000 \ + --save_steps 200 \ + --adam_beta2 0.95 \ + --weight_decay 0.1 \ + --warmup_steps 2000 \ + --max_grad_norm 1.0 \ + --logging_steps 2 \ + --dataloader_num_workers 0 \ + --sharding "stage2" \ + --sharding_parallel_degree 8 \ + --eval_steps 200 \ + --report_to "visualdl" \ + --disable_tqdm true \ + --continue_training 0\ + --recompute 1 \ + --do_train \ + --do_eval \ + --save_total_limit 10 \ + --device "gpu" +``` +注意: +1. 需要paddle develop版本训练,需要安装`pip install tool_helpers visualdl==2.5.3`等相关缺失whl包 +2. `use_flash_attention` 需要在A100机器开启,否则loss可能不正常(很快变成0.00x,非常小不正常)。建议使用cuda11.8环境。 +3. `continue_training` 表示从现有的预训练模型加载训练,如果需要从头开始预训练模型,则设置为0。 +4. `use_fused_ln` 需要安装[此目录](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/gpt-3/external_ops)下的自定义OP, `python setup.py install`。如果安装后仍然找不到算子,需要额外设置PYTHONPATH +5. 当前脚本为sharding版本,需要4D并行训练(数据、sharding、张量、流水线并行)的用户,可另外调整相关参数。 + + + +## 3. 精调 + +### SFT +```shell +python -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + finetune_generation.py \ + --output_dir "output_sft/$task_name" \ + --per_device_train_batch_size 4 \ + --gradient_accumulation_steps 2 \ + --per_device_eval_batch_size 8 \ + --model_name_or_path \ + --task_name squad \ + --num_train_epochs 2 \ + --learning_rate 3e-5 \ + --warmup_steps 30 \ + --logging_steps 1 \ + --evaluation_strategy epoch \ + --save_strategy epoch \ + --src_length 1024 \ + --tgt_length 1024 \ + --bf16 \ + --fp16_opt_level O2 \ + --do_train \ + --do_eval \ + --disable_tqdm True \ + --load_best_model_at_end True \ + --metric_for_best_model accuracy \ + --eval_with_do_generation False \ + --recompute \ + --save_total_limit 1 \ + --overwrite_output_dir \ + --sharding "stage2" \ + --sharding_parallel_degree 8 +``` + +### LoRA +```shell +python finetune_generation.py \ + --output_dir ./checkpoints/ \ + --per_device_train_batch_size 4 \ + --gradient_accumulation_steps 2 \ + --per_device_eval_batch_size 8 \ + --model_name_or_path \ + --task_name squad \ + --num_train_epochs 2 \ + --learning_rate 3e-4 \ + --warmup_steps 30 \ + --logging_steps 1 \ + --evaluation_strategy epoch \ + --save_strategy epoch \ + --src_length 1024 \ + --tgt_length 1024 \ + --bf16 \ + --fp16_opt_level O2 \ + --do_train \ + --do_eval \ + --disable_tqdm True \ + --load_best_model_at_end True \ + --metric_for_best_model accuracy \ + --eval_with_do_generation False \ + --recompute \ + --save_total_limit 1 \ + --overwrite_output_dir \ + --lora True \ + --lora_rank 8 +``` + +其中参数释义如下: + +- `model_name_or_path`: 预训练模型内置名称或者模型所在目录. +- `num_train_epochs`: 要执行的训练 epoch 总数(如果不是整数,将在停止训练之前执行最后一个 epoch +的小数部分百分比)。 +- `max_steps`: 模型训练步数。 +- `learning_rate`: 参数更新的学习率。 +- `warmup_steps`: 学习率热启的步数。 +- `eval_steps`: 模型评估的间隔步数。 +- `logging_steps`: 训练日志打印的间隔步数。 +- `save_steps`: 模型参数保存的间隔步数。 +- `save_total_limit`: 模型 checkpoint 保存的份数。 +- `output_dir`: 模型参数保存目录。 +- `src_length`: 上下文的最大输入长度,默认为128. +- `tgt_length`: 生成文本的最大长度,默认为160. +- `gradient_accumulation_steps`: 模型参数梯度累积的步数,可用于扩大 batch size。实际的 batch_size = per_device_train_batch_size * gradient_accumulation_steps。 +- `bf16`: 使用 bfloat16 精度进行模型训练和推理。 +- `fp16_opt_level`: bfloat16 精度训练模式,`O2`表示纯 bfloat16 训练。 +- `recompute`: 使用重计算策略,开启后可节省训练显存。 +- `do_train`: 是否训练模型。 +- `do_eval`: 是否评估模型。 +- `tensor_parallel_degree`: 模型并行数量。 +- `eval_with_do_generation`: 在评估的时候是否调用model.generate,默认为False。 +- `lora`: 是否使用 LoRA 技术。 +- `merge_weights`: 是否合并原始模型和 LoRA 模型的权重。 +- `lora_rank`: LoRA 算法中rank(秩)的值,默认为8。 +- `lora_path`: LoRA 参数和配置路径,对 LoRA 参数进行初始化。 +- `task_name`: 内置数据集任务名 +- `data_name`: 内置数据集名,定义数据集名必须同时定义数据集任务名 +- `dataset_path`: 自定义数据集路径。 + + +## 4. 动态图预测 + +```shell +python predict_generation.py \ + --model_name_or_path \ + --tokenizer_name_or_path ernie-tokenizer +``` diff --git a/llm/ernie-3.5-se/configuration.py b/llm/ernie-3.5-se/configuration.py new file mode 100644 index 0000000000000000000000000000000000000000..ecd6e055e7e1c26aa9d308cbafbc6eaf4c9c914b --- /dev/null +++ b/llm/ernie-3.5-se/configuration.py @@ -0,0 +1,202 @@ +# !/usr/bin/env python3 +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Ernie35 model configuration""" + +from paddlenlp.transformers.configuration_utils import PretrainedConfig +from paddlenlp.utils.log import logger + +__all__ = [ + "ERNIE_PRETRAINED_INIT_CONFIGURATION", + "Ernie35Config", + "ERNIE_PRETRAINED_RESOURCE_FILES_MAP", +] + +ERNIE_PRETRAINED_INIT_CONFIGURATION = { + "ernie/tiny-random-ernie": { + "fuse_linear": False, + "fuse_ln": False, + "hidden_size": 768, + "ignored_index": -100, + "initializer_range": 0.02, + "intermediate_size": 2048, + "max_position_embeddings": 4096, + "model_type": "ernie", + "num_attention_heads": 12, + "num_hidden_layers": 3, + "pad_token_id": 0, + "parallel_attn_hatf": True, + "enable_random_position_ids": True, + "use_progressive_seq_len": False, + "layer_norm_eps": 1e-06, + "tensor_parallel_output": True, + "tie_word_embeddings": False, + "use_bias": True, + "use_flash_attention": True, + "use_recompute": False, + "use_recompute_attn": False, + "vocab_size": 32000, + "weight_share_add_bias": True, + }, + "baidu/ernie-3.5-se-3b": { + "fuse_linear": False, + "fuse_ln": False, + "hidden_size": 3072, + "ignored_index": -100, + "initializer_range": 0.02, + "intermediate_size": 8192, + "max_position_embeddings": 32768, # 32k + "model_type": "ernie", + "num_attention_heads": 24, + "num_hidden_layers": 32, + "pad_token_id": 0, + "parallel_attn_hatf": True, + "enable_random_position_ids": True, + "use_progressive_seq_len": True, + "layer_norm_eps": 1e-06, + "tensor_parallel_output": True, + "tie_word_embeddings": False, + "use_bias": True, + "use_flash_attention": True, + "use_recompute": False, + "use_recompute_attn": False, + "vocab_size": 32000, + "weight_share_add_bias": True, + }, +} + +# Hypothetical model weights currently +ERNIE_PRETRAINED_RESOURCE_FILES_MAP = { + "model_state": {}, +} + + +class Ernie35Config(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`~Ernie35Model`]. It is used to instantiate an Ernie35 + model according to the specified arguments, defining the model architecture. Instantiating a configuration with the + defaults will yield a similar configuration to that of the Ernie35. + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + Args: + vocab_size (`int`, *optional*, defaults to 65536): + Vocabulary size of the Ernie35 model. Defines the number of different tokens that can be represented by the + `inputs_ids` passed when calling [`~Ernie35Model`]. + hidden_size (`int`, *optional*, defaults to 3072): + Dimension of the hidden representations. + intermediate_size (`int`, *optional*, defaults to 8192): + Dimension of the MLP representations. + num_hidden_layers (`int`, *optional*, defaults to 32): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 32): + Number of attention heads for each attention layer in the Transformer encoder. + hidden_act (`str` or `function`, *optional*, defaults to `"silu"`): + The non-linear activation function (function or string) in the decoder. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + layer_norm_eps (`float`, *optional*, defaults to 1e-12): + The epsilon used by the layer normalization layers. + use_cache (`bool`, *optional*, defaults to `True`): + Whether or not the model should return the last key/values attentions (not used by all models). Only + relevant if `config.is_decoder=True`. + tie_word_embeddings(`bool`, *optional*, defaults to `False`): + Whether to tie weight embeddings + """ + model_type = "ernie" + attribute_map = { + "n_positions": "max_position_embeddings", + "n_embd": "hidden_size", + "n_layer": "num_hidden_layers", + "n_head": "num_attention_heads", + "n_inner": "intermediate_size", + "activation_function": "hidden_act", + } + pretrained_init_configuration = ERNIE_PRETRAINED_INIT_CONFIGURATION + + def __init__( + self, + vocab_size=65536, + hidden_size=768, + intermediate_size=11008, + max_position_embeddings=2048, + num_hidden_layers=2, + num_attention_heads=2, + initializer_range=0.02, # no use + layer_norm_eps=1e-6, + use_cache=True, + use_flash_attention=True, + use_recompute=False, + use_recompute_attn=False, + fuse_ln=False, + tensor_parallel_output=True, + pad_token_id=0, + bos_token_id=1, + eos_token_id=2, + use_bias=False, + sequence_parallel=False, + weight_share=False, # non-PP only + weight_share_add_bias=True, + fuse_linear=False, + seqlen=False, + virtual_pp_degree=1, + ignored_index=-100, + parallel_attn_hatf=True, + enable_random_position_ids=False, + use_progressive_seq_len=False, + **kwargs, + ): + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.intermediate_size = intermediate_size + self.max_position_embeddings = max_position_embeddings + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.initializer_range = initializer_range + self.layer_norm_eps = layer_norm_eps + self.use_cache = use_cache + self.use_recompute_attn = use_recompute_attn + if use_recompute_attn: + logger.warning("set `use_recompute_attn`=True, disabling `use_recompute`") + use_recompute = False + self.use_recompute = use_recompute + self.use_flash_attention = use_flash_attention + self.tensor_parallel_output = tensor_parallel_output + self.pad_token_id = pad_token_id + self.bos_token_id = bos_token_id + self.eos_token_id = eos_token_id + self.fuse_ln = fuse_ln + self.sequence_parallel = sequence_parallel + self.seqlen = seqlen + self.virtual_pp_degree = virtual_pp_degree + self.use_bias = use_bias + self.weight_share_add_bias = weight_share_add_bias + kwargs["tie_word_embeddings"] = weight_share + self.fuse_linear = fuse_linear + self.ignored_index = ignored_index + self.parallel_attn_hatf = parallel_attn_hatf + self.enable_random_position_ids = enable_random_position_ids + self.use_progressive_seq_len = use_progressive_seq_len + super().__init__( + pad_token_id=pad_token_id, + bos_token_id=bos_token_id, + eos_token_id=eos_token_id, + tensor_parallel_output=tensor_parallel_output, + **kwargs, + ) + if self.sequence_parallel: + assert self.seqlen, "seqlen not provided in sequence-parallel" + assert ( + self.tensor_parallel_degree > 1 + ), f"senquence-parallel only works in mp, got mp={self.tensor_parallel_degree}" diff --git a/llm/ernie-3.5-se/conversion_utils.py b/llm/ernie-3.5-se/conversion_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..325a1cd30d7cc57e32f65927470cb0547268d96e --- /dev/null +++ b/llm/ernie-3.5-se/conversion_utils.py @@ -0,0 +1,198 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np + + +def split_qkv_gate_up_tensor_parallel_weight( + weight, tensor_parallel_degree, tensor_parallel_rank, hidden_size, intermediate_size, num_heads +): + """ + [QKV, G, U] -> [QKV1, G1, U1], [QKV2, G2, U2] + + Only support split Column dim. + + """ + assert weight.shape[-1] == 3 * hidden_size + 2 * intermediate_size, "input weight size dismatch!" + + if "PySafeSlice" in str(type(weight)): + QKV = weight[:, 0 : 3 * hidden_size] + G = weight[:, 3 * hidden_size : 3 * hidden_size + intermediate_size] + U = weight[:, 3 * hidden_size + intermediate_size :] + + # Split QKV + block_size = 3 * hidden_size // tensor_parallel_degree + start = tensor_parallel_rank * block_size + stop = (tensor_parallel_rank + 1) * block_size + assert ( + 3 * hidden_size % tensor_parallel_degree == 0 + ), f"The choosen size {hidden_size} is not compatible with sharding on {tensor_parallel_degree} shards" + qkv = QKV[:, start:stop] + + # Split G, U + block_size = intermediate_size // tensor_parallel_degree + start = tensor_parallel_rank * block_size + stop = (tensor_parallel_rank + 1) * block_size + assert ( + intermediate_size % tensor_parallel_degree == 0 + ), f"The choosen size {intermediate_size} is not compatible with sharding on {tensor_parallel_degree} shards" + g = G[:, start:stop] + u = U[:, start:stop] + + tensor = np.concatenate([qkv, g, u], axis=-1) + return tensor + + QKV, G, U = np.split(weight, [hidden_size * 3, hidden_size * 3 + intermediate_size], axis=-1) + assert ( + weight.shape[-1] % tensor_parallel_degree == 0 + ), f"The choosen size {weight.shape[-1]} is not compatible with sharding on {tensor_parallel_degree} shards, for tensor shape {weight.shape}" + sQKV, sG, sU = [np.split(item, tensor_parallel_degree, axis=-1) for item in [QKV, G, U]] + qkv, g, u = [item[tensor_parallel_rank] for item in [sQKV, sG, sU]] + tensor = np.concatenate([qkv, g, u], axis=-1) + return tensor + + +def merge_qkv_gate_up_tensor_parallel_weight(weight_list, tensor_parallel_degree, hidden_size, intermediate_size): + """ + [QKV1, G1, U1], [QKV2, G2, U2] -> [Q, K, V, G, U] + + Only support split Column dim. + + """ + bhs = hidden_size // tensor_parallel_degree + bis = intermediate_size // tensor_parallel_degree + + qkv_l, g_l, u_l = [], [], [] + for weight in weight_list: + qkv, g, u = np.split(weight, [bhs * 3, bhs * 3 + bis], axis=-1) + qkv_l.append(qkv) + g_l.append(g) + u_l.append(u) + QKV, G, U = [np.concatenate(item, axis=-1) for item in [qkv_l, g_l, u_l]] + tensor = np.concatenate([QKV, G, U], axis=-1) + return tensor + + +def qkv_gate_up_proj_split_fn(tensor_parallel_degree, tensor_parallel_rank, hidden_size, intermediate_size, num_heads): + def fn(x): + if x is None: + return None + x = split_qkv_gate_up_tensor_parallel_weight( + x, + tensor_parallel_degree=tensor_parallel_degree, + tensor_parallel_rank=tensor_parallel_rank, + hidden_size=hidden_size, + intermediate_size=intermediate_size, + num_heads=num_heads, + ) + return x + + return fn + + +def qkv_gate_up_proj_merge_fn(tensor_parallel_degree, tensor_parallel_rank, hidden_size, intermediate_size, num_heads): + def fn(x): + if x is None: + return None + x = merge_qkv_gate_up_tensor_parallel_weight( + x, + tensor_parallel_degree=tensor_parallel_degree, + hidden_size=hidden_size, + intermediate_size=intermediate_size, + ) + return x + + return fn + + +def split_o_tensor_parallel_weight( + weight, tensor_parallel_degree, tensor_parallel_rank, hidden_size, intermediate_size +): + """ + Only support split Row dim. + """ + assert weight.shape[0] == intermediate_size + hidden_size, "input weight size dismatch!" + if "PySafeSlice" in str(type(weight)): + A = weight[:intermediate_size] + block_size = intermediate_size // tensor_parallel_degree + start = tensor_parallel_rank * block_size + stop = (tensor_parallel_rank + 1) * block_size + assert ( + intermediate_size % tensor_parallel_degree == 0 + ), f"The choosen size {intermediate_size} is not compatible with sharding on {tensor_parallel_degree} shards" + a = A[start:stop] + + B = weight[intermediate_size:] + block_size = hidden_size // tensor_parallel_degree + start = tensor_parallel_rank * block_size + stop = (tensor_parallel_rank + 1) * block_size + assert ( + hidden_size % tensor_parallel_degree == 0 + ), f"The choosen size {hidden_size} is not compatible with sharding on {tensor_parallel_degree} shards" + b = B[start:stop] + tensor = np.concatenate([a, b], axis=0) + return tensor + + A, B = np.split(weight, [intermediate_size], axis=0) + assert ( + weight.shape[0] % tensor_parallel_degree == 0 + ), f"The choosen size {weight.shape[-1]} is not compatible with sharding on {tensor_parallel_degree} shards, for tensor shape {weight.shape}" + sA = np.split(A, tensor_parallel_degree, axis=0) + sB = np.split(B, tensor_parallel_degree, axis=0) + a, b = [item[tensor_parallel_rank] for item in [sA, sB]] + tensor = np.concatenate([a, b], axis=0) + return tensor + + +def merge_o_tensor_parallel_weight(weight_list, tensor_parallel_degree, hidden_size, intermediate_size): + bis = intermediate_size // tensor_parallel_degree + a_l, b_l = [], [] + for weight in weight_list: + a, b = np.split(weight, [bis], axis=0) + a_l.append(a) + b_l.append(b) + A, B = [np.concatenate(item, axis=0) for item in [a_l, b_l]] + tensor = np.concatenate([A, B], axis=0) + return tensor + + +def o_proj_split_fn(tensor_parallel_degree, tensor_parallel_rank, hidden_size, intermediate_size): + def fn(x): + if x is None: + return None + x = split_o_tensor_parallel_weight( + x, + tensor_parallel_degree=tensor_parallel_degree, + tensor_parallel_rank=tensor_parallel_rank, + hidden_size=hidden_size, + intermediate_size=intermediate_size, + ) + return x + + return fn + + +def o_proj_merge_fn(tensor_parallel_degree, tensor_parallel_rank, hidden_size, intermediate_size): + def fn(x): + if x is None: + return None + x = merge_o_tensor_parallel_weight( + x, + tensor_parallel_degree=tensor_parallel_degree, + hidden_size=hidden_size, + intermediate_size=intermediate_size, + ) + return x + + return fn diff --git a/llm/ernie-3.5-se/data.py b/llm/ernie-3.5-se/data.py new file mode 100644 index 0000000000000000000000000000000000000000..e439183c6917f8db6a43488550ab249a36db0f39 --- /dev/null +++ b/llm/ernie-3.5-se/data.py @@ -0,0 +1,199 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import copy +import json + +import paddle + +IGNORE_INDEX = -100 + +PROMPT_DICT = { + "prompt_input": ( + "Below is an instruction that describes a task, paired with an input that provides further context. " + "Write a response that appropriately completes the request.\n\n" + "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:" + ), + "prompt_no_input": ( + "Below is an instruction that describes a task. " + "Write a response that appropriately completes the request.\n\n" + "### Instruction:\n{instruction}\n\n### Response:" + ), +} + + +def reader(data_path): + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + json_line = json.loads(line) + yield json_line + + +def convert_example(example, tokenizer, data_args, is_test=False): + """ + Convert an example into necessary features. + """ + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + if "context" in example: + context = example["context"] + question = example["question"] + try: + answer = example["answers"][0] + except: + print(example["context"]) + print(example["question"]) + print(example["answers"]) + print(example["answer_starts"]) + print(example["is_impossible"]) + input_seq = f"answer: {answer} context: {context}" + output_seq = f"question: {question}
" + elif "instruction" in example: + input_seq = f"{example['instruction']}" + output_seq = f"{example['output']} " + elif "src" in example: + context = example["src"][0] if isinstance(example["src"], list) else example["src"] + question = example["tgt"][0] if isinstance(example["tgt"], list) else example["tgt"] + input_seq = f"{context}" + output_seq = f"{question} " + else: + raise ValueError("Please check the dataset format.") + + source_tokenized = tokenizer( + input_seq, + return_tensors="pd", + max_length=data_args.src_length, + truncation=True, + ) + + source_input_ids_len = ( + source_tokenized["input_ids"].not_equal(paddle.to_tensor(tokenizer.pad_token_id)).sum().item() + ) + + example_tokenized = tokenizer( + input_seq + output_seq, + return_tensors="pd", + max_length=data_args.src_length + data_args.tgt_length, + padding=False, + truncation=True, + ) + + input_ids = example_tokenized["input_ids"][0] + labels = copy.deepcopy(input_ids) + labels[:source_input_ids_len] = IGNORE_INDEX + + if is_test: + return dict( + input_ids=source_tokenized["input_ids"][0], + labels=labels, + ) + + # shift labels + input_ids, labels = input_ids[:-1], labels[1:] + + return dict( + input_ids=input_ids, + labels=labels, + ) + + +def custom_instruction_convert_example( + example, tokenizer, data_args, is_test=False, benchmark=False, model_max_length=512 +): + """ + Convert an example into necessary features. + """ + + if benchmark: + prompt_input, prompt_no_input = PROMPT_DICT["prompt_input"], PROMPT_DICT["prompt_no_input"] + + if example.get("input", "") != "": + input_seq = prompt_input.format_map(example) + else: + input_seq = prompt_no_input.format_map(example) + + output_seq = example["output"] + tokenizer.eos_token + else: + instruction = "" + input = "" + output = "" + if "instruction" in example and "output" in example: + instruction = example["instruction"] + output = example["output"] + else: + assert False, "instruction and output are not in the input dictionary." + if "input" in example["input"]: + input = example["input"] + + input_seq = instruction + input + output_seq = output + tokenizer.eos_token + + # To compatible with compile training mode in benchmark, input will be pad to fix length + source_tokenized = tokenizer( + input_seq, + return_tensors="pd", + max_length=data_args.src_length if not benchmark else model_max_length, + truncation=True, + ) + + source_input_ids_len = ( + source_tokenized["input_ids"].not_equal(paddle.to_tensor(tokenizer.pad_token_id)).sum().item() + ) + + total_length = data_args.src_length + data_args.tgt_length + + example_tokenized = tokenizer( + input_seq + output_seq, + return_tensors="pd", + max_length=total_length if not benchmark else model_max_length, + truncation=True, + ) + + input_ids = example_tokenized["input_ids"][0] + labels = copy.deepcopy(input_ids) + labels[:source_input_ids_len] = IGNORE_INDEX + + if is_test: + return dict( + input_ids=source_tokenized["input_ids"][0], + labels=labels, + ) + + # shift labels + input_ids, labels = input_ids[:-1], labels[1:] + + return dict( + input_ids=input_ids, + labels=labels, + ) + + +def left_padding(inputs, pad_id, max_length=-1): + for ids in inputs: + max_length = max(max_length, len(ids)) + + def extend_max_lenth(value, max_length, to_pad_id): + return [to_pad_id] * (max_length - len(value)) + value + + def extend_filed(values, max_length, to_pad_id): + res = [] + for value in values: + res.append(extend_max_lenth(value.tolist(), max_length, to_pad_id)) + return res + + res = extend_filed(inputs, max_length, pad_id) + return paddle.to_tensor(res) diff --git a/llm/ernie-3.5-se/ernie-tokenizer/sentencepiece.bpe.model b/llm/ernie-3.5-se/ernie-tokenizer/sentencepiece.bpe.model new file mode 100644 index 0000000000000000000000000000000000000000..ec2ed09d7a2cad429ad536c1582b7f769ef171cd Binary files /dev/null and b/llm/ernie-3.5-se/ernie-tokenizer/sentencepiece.bpe.model differ diff --git a/llm/ernie-3.5-se/ernie-tokenizer/special_tokens_map.json b/llm/ernie-3.5-se/ernie-tokenizer/special_tokens_map.json new file mode 100644 index 0000000000000000000000000000000000000000..1c141cfedc26fa17771e88d0a1372d4a7387a3ec --- /dev/null +++ b/llm/ernie-3.5-se/ernie-tokenizer/special_tokens_map.json @@ -0,0 +1 @@ +{"bos_token": {"content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false}, "eos_token": {"content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false}, "unk_token": {"content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false}} \ No newline at end of file diff --git a/llm/ernie-3.5-se/ernie-tokenizer/tokenizer_config.json b/llm/ernie-3.5-se/ernie-tokenizer/tokenizer_config.json new file mode 100644 index 0000000000000000000000000000000000000000..c12b770ded0ecebb105bfa45d039ddc79076e952 --- /dev/null +++ b/llm/ernie-3.5-se/ernie-tokenizer/tokenizer_config.json @@ -0,0 +1 @@ +{"add_bos_token": true, "add_eos_token": false, "model_max_length": 2048, "pad_token": null, "sp_model_kwargs": {}, "tokenizer_class": "Ernie35Tokenizer", "clean_up_tokenization_spaces": false, "bos_token": {"__type": "AddedToken", "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false}, "eos_token": {"__type": "AddedToken", "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false}, "unk_token": {"__type": "AddedToken", "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false}} diff --git a/llm/ernie-3.5-se/ernie_dataset.py b/llm/ernie-3.5-se/ernie_dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..120b439260c183e28067a456be734dc7eb46dfe0 --- /dev/null +++ b/llm/ernie-3.5-se/ernie_dataset.py @@ -0,0 +1,958 @@ +# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. + +import hashlib +import math +import os +import time +from itertools import accumulate + +import numpy as np +import paddle +from paddle.distributed import fleet + +local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0)) + + +class FakeHCG: + def get_data_parallel_group(self): + return None + + def get_pipe_parallel_group(self): + return None + + def get_model_parallel_group(self): + return None + + +class MMapIndexedDataset(paddle.io.Dataset): + def __init__(self, path, skip_warmup=False): + super().__init__() + + self._path = path + + # All documment ids, extend as 1-D array. + + for suffix in ["_ids.npy", "_idx.npz"]: + if not os.path.isfile(path + suffix): + raise ValueError("File Not found, %s" % (path + suffix)) + + self._token_ids = np.load(path + "_ids.npy", mmap_mode="r", allow_pickle=True) + process_data = np.load(path + "_idx.npz") + self._sizes = process_data["lens"] + self._pointers = np.empty(len(self._sizes) + 1, dtype=np.int64) + self._pointers[0] = 0 + np.cumsum(self._sizes, out=self._pointers[1:]) + self._doc_idx = process_data["docs"] + + def __getstate__(self): + return self._path + + def __len__(self): + return len(self._sizes) + + # @lru_cache(maxsize=8) + def __getitem__(self, idx): + if isinstance(idx, int): + size = self._sizes[idx] + ptr = self._pointers[idx] + np_array = self._token_ids[ptr : ptr + size] + return np_array + + elif isinstance(idx, slice): + start, stop, step = idx.indices(len(self)) + if step != 1: + raise ValueError("Slices into indexed_dataset must be contiguous") + ptr = self._pointers[start] + sizes = self._sizes[idx] + offsets = list(accumulate(sizes)) + total_size = sum(sizes) + np_array = self._token_ids[ptr : ptr + total_size] + sents = np.split(np_array, offsets[:-1]) + return sents + + def get(self, idx, offset=0, length=None): + """Retrieves a single item from the dataset with the option to only + return a portion of the item. + + get(idx) is the same as [idx] but get() does not support slicing. + """ + size = self._sizes[idx] + ptr = self._pointers[idx] + + if length is None: + length = size - offset + ptr += offset + np_array = self._token_ids[ptr : ptr + length] + return np_array + + @property + def sizes(self): + return self._sizes + + @property + def doc_idx(self): + return self._doc_idx + + def get_doc_idx(self): + return self._doc_idx + + def set_doc_idx(self, doc_idx_): + self._doc_idx = doc_idx_ + + +class BlendableDataset(paddle.io.Dataset): + def __init__(self, datasets, weights, size, *, data_cache_path=None): + + self.datasets = datasets + num_datasets = len(datasets) + assert num_datasets == len(weights) + + self.size = size + + # Normalize weights. + weights = np.array(weights, dtype=np.float64) + sum_weights = np.sum(weights) + assert sum_weights > 0.0 + weights /= sum_weights + + # Build indicies. + def _build_indices(): + start_time = time.time() + assert num_datasets < 255 + dataset_index = np.zeros(self.size, dtype=np.uint8) + dataset_sample_index = np.zeros(self.size, dtype=np.int64) + + from tool_helpers import helpers + + helpers.build_blending_indices( + dataset_index, + dataset_sample_index, + weights, + num_datasets, + self.size, + local_rank == 0, + # paddle.distributed.get_rank() == 0, + ) + print_rank_0( + "> elapsed time for building blendable dataset indices: " + "{:.2f} (sec)".format(time.time() - start_time) + ) + return dataset_index, dataset_sample_index + + desc = "Blendable dataset\n\n" + desc += "Datasets:\n" + for dataset in datasets: + desc += dataset.desc + "\n\n" + desc += f"Weights: {weights}\n" + desc += f"Size: {size}\n" + self.desc = desc + + if data_cache_path: + desc_hash = hashlib.md5(desc.encode("utf-8")).hexdigest() + desc_path = os.path.join(data_cache_path, desc_hash + ".dsc") + index_path = os.path.join(data_cache_path, desc_hash + "_index.npy") + sample_index_path = os.path.join(data_cache_path, desc_hash + "_sample_index.npy") + cache_hit = os.path.isfile(index_path) and os.path.isfile(sample_index_path) + cache_success = True + # if paddle.distributed.get_rank() == 0 and not cache_hit: + if local_rank == 0 and not cache_hit: + print( + " > WARNING: could not find index map files for blendable" + " dataset, building indices on rank 0 ...", + flush=True, + ) + dataset_index, dataset_sample_index = _build_indices() + try: + os.makedirs(os.path.dirname(index_path), exist_ok=True) + with open(desc_path, "wt") as fd: + fd.write(desc) + np.save(index_path, dataset_index, allow_pickle=True) + np.save(sample_index_path, dataset_sample_index, allow_pickle=True) + except OSError: + print(f"There was an error trying to create the data cache directory ({data_cache_path})") + print("or a file in it. This is set with the --data-cache-path argument. Please") + print("ensure you have write access to this directory or specify one that you do have") + print("write access to.") + cache_success = False + + try: + hcg = paddle.distributed.fleet.get_hybrid_communicate_group() + except: + hcg = FakeHCG() + + counts = paddle.to_tensor([cache_success], dtype="int64") + paddle.distributed.all_reduce(counts, group=hcg.get_data_parallel_group()) + paddle.distributed.all_reduce(counts, group=hcg.get_pipeline_model_parallel_group()) + if counts[0].item() != ( + paddle.distributed.get_world_size() + // paddle.distributed.get_world_size(group=hcg.get_tensor_model_parallel_group()) + ): + print_rank_0("Data index creation unsuccessful, exiting.") + exit() + + paddle.distributed.barrier() + # Load on all ranks. + print_rank_0(f"> loading blendable dataset index: {index_path}") + self.dataset_index = np.load(index_path, allow_pickle=True, mmap_mode="r") + assert self.dataset_index.size == self.size + + print_rank_0(f"> loading blendable dataset sample index: {sample_index_path}") + self.dataset_sample_index = np.load(sample_index_path, allow_pickle=True, mmap_mode="r") + assert self.dataset_sample_index.size == self.size + else: + self.dataset_index, self.dataset_sample_index = _build_indices() + + # Check size + _ = self.__getitem__(self.size - 1) + try: + _ = self.__getitem__(self.size) + raise RuntimeError("BlendedDataset size is improperly bounded") + except IndexError: + pass + print_rank_0("> size of blendable dataset: " "{} samples".format(self.size)) + + def __len__(self): + return self.size + + def __getitem__(self, idx): + dataset_idx = self.dataset_index[idx] + sample_idx = self.dataset_sample_index[idx] + return { + "dataset_idx": dataset_idx, + **self.datasets[dataset_idx][sample_idx], + } + + +def make_indexed_dataset(data_prefix, data_impl=None, skip_warmup=False): + return MMapIndexedDataset(data_prefix) + + +def get_datasets_weights_and_num_samples(data_prefix, train_valid_test_num_samples): + + # The data prefix should be in the format of: + # weight-1, data-prefix-1, weight-2, data-prefix-2, .. + assert len(data_prefix) % 2 == 0 + num_datasets = len(data_prefix) // 2 + weights = [0] * num_datasets + prefixes = [0] * num_datasets + for i in range(num_datasets): + weights[i] = float(data_prefix[2 * i]) + prefixes[i] = (data_prefix[2 * i + 1]).strip() + # Normalize weights + weight_sum = 0.0 + for weight in weights: + weight_sum += weight + assert weight_sum > 0.0 + weights = [weight / weight_sum for weight in weights] + + # Add 0.5% (the 1.005 factor) so in case the bleding dataset does + # not uniformly distribute the number of samples, we still have + # samples left to feed to the network. + datasets_train_valid_test_num_samples = [] + for weight in weights: + datasets_train_valid_test_num_samples.append( + [int(math.ceil(val * weight * 1.005)) for val in train_valid_test_num_samples] + ) + + return prefixes, weights, datasets_train_valid_test_num_samples + + +def print_rank_0(*args, **kwargs): + if paddle.distributed.get_rank() == 0: + print(*args, **kwargs) + + +def build_train_valid_test_datasets( + data_prefix, + data_impl, + splits_string, + train_valid_test_num_samples, + seq_length, + seed, + skip_warmup, + train_data_prefix=None, + valid_data_prefix=None, + test_data_prefix=None, + return_doc_ids=False, + *, + data_cache_path=None +): + """Build train, valid, and test datasets.""" + + if data_prefix: + print_rank_0("Single data path provided for train, valid & test") + + # Single dataset. + if len(data_prefix) == 1: + return _build_train_valid_test_datasets( + data_prefix[0], + data_impl, + splits_string, + train_valid_test_num_samples, + seq_length, + seed, + skip_warmup, + data_cache_path=data_cache_path, + ) + + # Blending dataset. + # Parse the values. + output = get_datasets_weights_and_num_samples(data_prefix, train_valid_test_num_samples) + prefixes, weights, datasets_train_valid_test_num_samples = output + train_num_samples, valid_num_samples, test_num_samples = map(sum, zip(*datasets_train_valid_test_num_samples)) + + # Build individual datasets. + train_datasets = [] + valid_datasets = [] + test_datasets = [] + for i in range(len(prefixes)): + train_ds, valid_ds, test_ds = _build_train_valid_test_datasets( + prefixes[i], + data_impl, + splits_string, + datasets_train_valid_test_num_samples[i], + seq_length, + seed, + skip_warmup, + return_doc_ids, + data_cache_path=data_cache_path, + ) + if train_ds: + train_datasets.append(train_ds) + if valid_ds: + valid_datasets.append(valid_ds) + if test_ds: + test_datasets.append(test_ds) + + # Blend. + blending_train_dataset = None + if train_datasets: + blending_train_dataset = BlendableDataset( + train_datasets, weights, train_num_samples, data_cache_path=data_cache_path + ) + blending_valid_dataset = None + if valid_datasets: + blending_valid_dataset = BlendableDataset( + valid_datasets, weights, valid_num_samples, data_cache_path=data_cache_path + ) + blending_test_dataset = None + if test_datasets: + blending_test_dataset = BlendableDataset( + test_datasets, weights, test_num_samples, data_cache_path=data_cache_path + ) + + return (blending_train_dataset, blending_valid_dataset, blending_test_dataset) + + else: + print_rank_0("Separate data paths provided for train, valid & test. Split string will be ignored.") + + train_dataset, valid_dataset, test_dataset = None, None, None + # Single dataset. + if train_data_prefix is not None: + train_dataset = build_dataset( + "train", + train_data_prefix, + data_impl, + splits_string, + train_valid_test_num_samples[0], + seq_length, + seed, + skip_warmup, + data_cache_path=data_cache_path, + ) + + if valid_data_prefix is not None: + valid_dataset = build_dataset( + "valid", + valid_data_prefix, + data_impl, + splits_string, + train_valid_test_num_samples[1], + seq_length, + seed, + False, + data_cache_path=data_cache_path, + ) + + if test_data_prefix is not None: + test_dataset = build_dataset( + "test", + test_data_prefix, + data_impl, + splits_string, + train_valid_test_num_samples[2], + seq_length, + seed, + False, + data_cache_path=data_cache_path, + ) + + return (train_dataset, valid_dataset, test_dataset) + + +def get_train_valid_test_split_(splits_string, size): + """Get dataset splits from comma or '/' separated string list.""" + + splits = [] + if splits_string.find(",") != -1: + splits = [float(s) for s in splits_string.split(",")] + elif splits_string.find("/") != -1: + splits = [float(s) for s in splits_string.split("/")] + else: + splits = [float(splits_string)] + while len(splits) < 3: + splits.append(0.0) + splits = splits[:3] + splits_sum = sum(splits) + assert splits_sum > 0.0 + splits = [split / splits_sum for split in splits] + splits_index = [0] + for index, split in enumerate(splits): + splits_index.append(splits_index[index] + int(round(split * float(size)))) + diff = splits_index[-1] - size + for index in range(1, len(splits_index)): + splits_index[index] -= diff + assert len(splits_index) == 4 + assert splits_index[-1] == size + return splits_index + + +def _build_train_valid_test_datasets( + data_prefix, + data_impl, + splits_string, + train_valid_test_num_samples, + seq_length, + seed, + skip_warmup, + return_doc_ids=False, + *, + data_cache_path=None +): + """Build train, valid, and test datasets.""" + + # Indexed dataset. + indexed_dataset = get_indexed_dataset_(data_prefix, data_impl, skip_warmup) + + total_num_of_documents = indexed_dataset.sizes.shape[0] + splits = get_train_valid_test_split_(splits_string, total_num_of_documents) + + # Print stats about the splits. + print_rank_0(" > dataset split:") + + def print_split_stats(name, index): + print_rank_0(" {}:".format(name)) + print_rank_0( + " document indices in [{}, {}) total of {} " + "documents".format(splits[index], splits[index + 1], splits[index + 1] - splits[index]) + ) + + print_split_stats("train", 0) + print_split_stats("validation", 1) + print_split_stats("test", 2) + + def build_dataset(index, name): + dataset = None + if splits[index + 1] > splits[index]: + documents = np.arange(start=splits[index], stop=splits[index + 1], step=1, dtype=np.int32) + dataset = LLMDataset( + name, + data_prefix, + documents, + indexed_dataset, + splits_string, + train_valid_test_num_samples[index], + seq_length, + seed, + return_doc_ids, + data_cache_path=data_cache_path, + ) + return dataset + + train_dataset = build_dataset(0, "train") + valid_dataset = build_dataset(1, "valid") + test_dataset = build_dataset(2, "test") + + return (train_dataset, valid_dataset, test_dataset) + + +def build_dataset( + dataset_name, + data_prefix, + data_impl, + splits_string, + num_samples, + seq_length, + seed, + skip_warmup, + *, + data_cache_path=None +): + dataset = None + if len(data_prefix) == 1: + dataset = _build_dataset( + dataset_name, + data_prefix[0], + data_impl, + splits_string, + num_samples, + seq_length, + seed, + skip_warmup, + data_cache_path=data_cache_path, + ) + else: + # Blending dataset. + # Parse the values. + output = get_datasets_weights_and_num_samples(data_prefix, num_samples) + prefixes, weights, dataset_num_samples = output + num_samples = sum(dataset_num_samples) + + # Build individual datasets. + datasets = [] + for i in range(len(prefixes)): + ds = _build_dataset( + dataset_name, + prefixes[i], + data_impl, + splits_string, + dataset_num_samples[i], + seq_length, + seed, + skip_warmup, + data_cache_path=data_cache_path, + ) + if ds: + datasets.append(ds) + + if datasets: + dataset = BlendableDataset(datasets, weights, num_samples, data_cache_path=data_cache_path) + + return dataset + + +def _build_dataset( + dataset_name, + data_prefix, + data_impl, + splits_string, + num_samples, + seq_length, + seed, + skip_warmup, + *, + data_cache_path=None +): + """ + Build dataset. This method is called when individual + train, valid, test datasets are provided + """ + + # Indexed dataset. + indexed_dataset = get_indexed_dataset_(data_prefix, data_impl, skip_warmup) + + total_num_of_documents = indexed_dataset.sizes.shape[0] + + print_rank_0(" {}:".format(dataset_name)) + print_rank_0( + " document indices in [0, {}) total of {} " + "documents".format(total_num_of_documents, total_num_of_documents) + ) + + documents = np.arange(start=0, stop=total_num_of_documents, step=1, dtype=np.int32) + + dataset = LLMDataset( + dataset_name, + data_prefix, + documents, + indexed_dataset, + splits_string, + num_samples, + seq_length, + seed, + data_cache_path=data_cache_path, + ) + + return dataset + + +def get_indexed_dataset_(data_prefix, data_impl, skip_warmup): + """Build indexed dataset.""" + print_rank_0(" > building dataset index ...") + + start_time = time.time() + indexed_dataset = make_indexed_dataset(data_prefix, data_impl, skip_warmup) + print_rank_0(" > finished creating indexed dataset in {:4f} " "seconds".format(time.time() - start_time)) + print_rank_0(" number of documents: {}".format(indexed_dataset.sizes.shape[0])) + + return indexed_dataset + + +class LLMDataset(paddle.io.Dataset): + def __init__( + self, + name, + data_prefix, + documents, + indexed_dataset, + splits_string, + num_samples, + seq_length, + seed, + return_doc_ids=False, + *, + data_cache_path=None + ): + + self.name = name + self.indexed_dataset = indexed_dataset + self.return_doc_ids = return_doc_ids + + # Checks + assert np.min(documents) >= 0 + assert np.max(documents) < indexed_dataset.sizes.shape[0] + + # Build index mappings. + self.doc_idx, self.sample_idx, self.shuffle_idx, self.desc, self.desc_hash = _build_index_mappings( + self.name, + data_prefix, + documents, + self.indexed_dataset.sizes, + splits_string, + num_samples, + seq_length, + seed, + data_cache_path=data_cache_path, + ) + + def __len__(self): + # -1 is due to data structure used to retieve the index: + # sample i --> [sample_idx[i], sample_idx[i+1]) + return self.sample_idx.shape[0] - 1 + + def __getitem__(self, idx): + # Get the shuffled index. + idx = self.shuffle_idx[idx] + # Start and end documents and offsets. + doc_index_f = self.sample_idx[idx][0] + doc_index_l = self.sample_idx[idx + 1][0] + offset_f = self.sample_idx[idx][1] + offset_l = self.sample_idx[idx + 1][1] + # If we are within the same document, just extract the chunk. + doc_ids = [] + if doc_index_f == doc_index_l: + doc_ids.append(self.doc_idx[doc_index_f]) + sample = self.indexed_dataset.get( + self.doc_idx[doc_index_f], offset=offset_f, length=offset_l - offset_f + 1 + ) + else: + # Otherwise, get the rest of the initial document. + doc_ids.append(self.doc_idx[doc_index_f]) + sample_list = [self.indexed_dataset.get(self.doc_idx[doc_index_f], offset=offset_f)] + # Loop over all in between documents and add the entire document. + for i in range(doc_index_f + 1, doc_index_l): + doc_ids.append(self.doc_idx[i]) + sample_list.append(self.indexed_dataset.get(self.doc_idx[i])) + # And finally add the relevant portion of last document. + doc_ids.append(self.doc_idx[doc_index_l]) + sample_list.append(self.indexed_dataset.get(self.doc_idx[doc_index_l], length=offset_l + 1)) + sample = np.concatenate(sample_list) + + if self.return_doc_ids: # for retro preprocessing + return {"text": np.array(sample, dtype=np.int64), "doc_ids": np.array(doc_ids, dtype=np.int64)} + else: + return {"text": np.array(sample, dtype=np.int64)} + + +def _build_index_mappings( + name, data_prefix, documents, sizes, splits_string, num_samples, seq_length, seed, *, data_cache_path +): + """Build doc-idx, sample-idx, and shuffle-idx. + doc-idx: is an array (ordered) of documents to be used in training. + sample-idx: is the start document index and document offset for each + training sample. + shuffle-idx: maps the sample index into a random index into sample-idx. + """ + # Number of tokens in each epoch and number of required epochs. + tokens_per_epoch = _num_tokens(documents, sizes) + num_epochs = _num_epochs(tokens_per_epoch, seq_length, num_samples) + + # rng state + np_rng = np.random.RandomState(seed=seed) + + # Filename of the index mappings. + desc = "LLM Dataset\n\n" + desc += f"Data prefix {data_prefix}\n" + desc += f"Dataset name {name}\n" + desc += f"Number of samples {num_samples}\n" + desc += f"Sequence length {seq_length}\n" + desc += f"Random seed {seed}\n" + desc += f"Split {splits_string}\n" + desc_hash = hashlib.md5(desc.encode("utf-8")).hexdigest() + print_rank_0(desc_hash, desc) + desc_filename = desc_hash + ".dsc" + doc_idx_filename = desc_hash + "_doc_idx.npy" + sample_idx_filename = desc_hash + "_sample_idx.npy" + shuffle_idx_filename = desc_hash + "_shuffle_idx.npy" + + # Look for cache in main data dir first to avoid unnecessary + # duplication, then look in data-cache-path if specified, + # If nothing is found, use the last path looked in + build_indices = True + prefixes = [os.path.join(os.path.dirname(data_prefix), "index-cache")] + if data_cache_path is not None: + prefixes.append(data_cache_path) + for prefix in prefixes: + idx_path = { + "desc": os.path.join(prefix, desc_filename), + "doc": os.path.join(prefix, doc_idx_filename), + "sample": os.path.join(prefix, sample_idx_filename), + "shuffle": os.path.join(prefix, shuffle_idx_filename), + } + for f in idx_path.values(): + if not os.path.isfile(f): + break + else: + # Found our files! + build_indices = False + break + data_cache_dir = os.path.dirname(idx_path["desc"]) + data_cache_success = True + + # local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0)) + # Build the indexed mapping if not exist. + if build_indices and paddle.distributed.get_rank() == 0: + # if build_indices and local_rank == 0: + print_rank_0(" > WARNING: could not find index map files, building " "the indices on rank 0 ...") + + # For the last epoch, decide whether include the entire epoch + # in the global shuffle or not. + + # If we need only one epoch, then separating last epoch does + # not mean anything. + if num_epochs == 1: + separate_last_epoch = False + print(" > only one epoch required, setting " "separate_last_epoch to False", flush=True) + + else: + # Get the number of samples for the last epoch + num_samples_from_epochs_minus_one = ((num_epochs - 1) * tokens_per_epoch - 1) // seq_length + last_epoch_num_samples = num_samples - num_samples_from_epochs_minus_one + assert last_epoch_num_samples >= 0, "last epoch number of samples should be non-negative." + num_samples_per_epoch = (tokens_per_epoch - 1) // seq_length + assert last_epoch_num_samples <= ( + num_samples_per_epoch + 1 + ), "last epoch number of samples exceeded max value." + # If we have less than 80% of the samples for the last epoch, + # seperate out the epoch and treat it differently. + # Note: the 80% number is just based on common sense and can + # be adjusted if needed. + separate_last_epoch = last_epoch_num_samples < int(0.80 * num_samples_per_epoch) + if separate_last_epoch: + string = ( + " > last epoch number of samples ({}) is smaller " + "than 80% of number of samples per epoch ({}), " + "setting separate_last_epoch to True" + ) + else: + string = ( + " > last epoch number of samples ({}) is larger " + "than 80% of number of samples per epoch ({}), " + "setting separate_last_epoch to False" + ) + print(string.format(last_epoch_num_samples, num_samples_per_epoch), flush=True) + + try: + os.makedirs(data_cache_dir, exist_ok=True) + + # description + with open(idx_path["desc"], "wt") as fd: + fd.write(desc) + + # doc-idx. + start_time = time.time() + doc_idx = _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch) + np.save(idx_path["doc"], doc_idx, allow_pickle=True) + print_rank_0( + " > elasped time to build and save doc-idx mapping " + "(seconds): {:4f}".format(time.time() - start_time) + ) + # sample-idx. + start_time = time.time() + # Use C++ implementation for speed. + # First compile and then import. + # from megatron.data import helpers + from tool_helpers import helpers + + assert doc_idx.dtype == np.int32 + assert sizes.dtype == np.int32 + sample_idx = helpers.build_sample_idx(sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch) + np.save(idx_path["sample"], sample_idx, allow_pickle=True) + print_rank_0( + " > elasped time to build and save sample-idx mapping " + "(seconds): {:4f}".format(time.time() - start_time) + ) + # shuffle-idx. + start_time = time.time() + # -1 is due to data structure used to retieve the index: + # sample i --> [sample_idx[i], sample_idx[i+1]) + if separate_last_epoch: + num_samples_ = num_samples_from_epochs_minus_one + else: + num_samples_ = sample_idx.shape[0] - 1 + shuffle_idx = _build_shuffle_idx(num_samples_, sample_idx.shape[0] - 1, np_rng) + np.save(idx_path["shuffle"], shuffle_idx, allow_pickle=True) + print_rank_0( + " > elasped time to build and save shuffle-idx mapping" + " (seconds): {:4f}".format(time.time() - start_time) + ) + except OSError: + print(f"There was an error trying to create the data cache directory ({data_cache_dir})") + print('or a file in it. This defaults to a directory "index-cache" within the directory') + print("the data files are in and can be set with the --data-cache-path argument. Please") + print("ensure you have write access to this directory or specify one that you do have") + print("write access to.") + data_cache_success = False + + try: + hcg = fleet.get_hybrid_communicate_group() + except: + hcg = FakeHCG() + + counts = paddle.to_tensor([data_cache_success], dtype="int64") + if paddle.distributed.get_world_size() > 1: + paddle.distributed.all_reduce(counts, group=hcg.get_data_parallel_group()) + paddle.distributed.all_reduce(counts, group=hcg.get_sharding_parallel_group()) + paddle.distributed.all_reduce(counts, group=hcg.get_pipe_parallel_group()) + + if counts[0].item() != ( + paddle.distributed.get_world_size() + // max(paddle.distributed.get_world_size(group=hcg.get_model_parallel_group()), 1) + ): + print_rank_0("Data index creation unsuccessful, exiting.") + exit() + + paddle.distributed.barrier() + # Load mappings. + start_time = time.time() + print_rank_0(f" > loading doc-idx mapping from {idx_path['doc']}") + doc_idx = np.load(idx_path["doc"], allow_pickle=True, mmap_mode="r") + + print_rank_0(f" > loading sample-idx mapping from {idx_path['sample']}") + sample_idx = np.load(idx_path["sample"], allow_pickle=True, mmap_mode="r") + + print_rank_0(f" > loading shuffle-idx mapping from {idx_path['shuffle']}") + shuffle_idx = np.load(idx_path["shuffle"], allow_pickle=True, mmap_mode="r") + + print_rank_0(" loaded indexed file in {:3.3f} seconds".format(time.time() - start_time)) + print_rank_0(" total number of samples: {}".format(sample_idx.shape[0])) + print_rank_0(" total number of epochs: {}".format(num_epochs)) + + return doc_idx, sample_idx, shuffle_idx, desc, desc_hash + + +def _num_tokens(documents, sizes): + """Total number of tokens in the dataset.""" + return np.sum(sizes[documents]) + + +def _num_epochs(tokens_per_epoch, seq_length, num_samples): + """Based on number of samples and sequence lenght, calculate how many + epochs will be needed.""" + num_epochs = 0 + total_tokens = 0 + while True: + num_epochs += 1 + total_tokens += tokens_per_epoch + # -1 is because we need to retrieve seq_length + 1 token each time + # but the last token will overlap with the first token of the next + # sample except for the last sample. + if ((total_tokens - 1) // seq_length) >= num_samples: + return num_epochs + + +def _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch): + """Build an array with length = number-of-epochs * number-of-dcuments. + Each index is mapped to a corresponding document.""" + if not separate_last_epoch or num_epochs == 1: + doc_idx = np.mgrid[0:num_epochs, 0 : len(documents)][1] + doc_idx[:] = documents + doc_idx = doc_idx.reshape(-1) + doc_idx = doc_idx.astype(np.int32) + np_rng.shuffle(doc_idx) + return doc_idx + + doc_idx_first = _build_doc_idx(documents, num_epochs - 1, np_rng, False) + doc_idx_last = _build_doc_idx(documents, 1, np_rng, False) + return np.concatenate((doc_idx_first, doc_idx_last)) + + +def _build_sample_idx(sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch): + """Sample index mapping is a 2D array with sizes + [number-of-samples + 1, 2] where [..., 0] contains + the index into `doc_idx` and [..., 1] is the + starting offset in that document.""" + + # Total number of samples. For -1 see comments in `_num_epochs`. + num_samples = (num_epochs * tokens_per_epoch - 1) // seq_length + sample_idx = np.zeros([num_samples + 1, 2], dtype=np.int32) + + # Index into sample_idx. + sample_index = 0 + # Index into doc_idx. + doc_idx_index = 0 + # Begining offset for each document. + doc_offset = 0 + # Start with first document and no offset. + sample_idx[sample_index][0] = doc_idx_index + sample_idx[sample_index][1] = doc_offset + sample_index += 1 + while sample_index <= num_samples: + # Start with a fresh sequence. + remaining_seq_length = seq_length + 1 + while remaining_seq_length != 0: + # Get the document length. + doc_id = doc_idx[doc_idx_index] + doc_length = sizes[doc_id] - doc_offset + # And add it to the current sequence. + remaining_seq_length -= doc_length + # If we have more than a full sequence, adjust offset and set + # remaining length to zero so we return from the while loop. + # Note that -1 here is for the same reason we have -1 in + # `_num_epochs` calculations. + if remaining_seq_length <= 0: + doc_offset += remaining_seq_length + doc_length - 1 + remaining_seq_length = 0 + else: + # Otherwise, start from the begining of the next document. + doc_idx_index += 1 + doc_offset = 0 + # Record the sequence. + sample_idx[sample_index][0] = doc_idx_index + sample_idx[sample_index][1] = doc_offset + sample_index += 1 + + return sample_idx + + +def _build_shuffle_idx(num_samples, total_size, np_rng): + """Build the range [0, size) and shuffle.""" + print( + " > building shuffle index with split [0, {}) and [{}, {}) " + "...".format(num_samples, num_samples, total_size), + flush=True, + ) + + dtype_ = np.uint32 + if total_size >= (np.iinfo(np.uint32).max - 1): + dtype_ = np.int64 + + shuffle_idx_first = np.arange(start=0, stop=num_samples, step=1, dtype=dtype_) + np_rng.shuffle(shuffle_idx_first) + if num_samples == total_size: + return shuffle_idx_first + + shuffle_idx_last = np.arange(start=num_samples, stop=total_size, step=1, dtype=dtype_) + np_rng.shuffle(shuffle_idx_last) + + return np.concatenate((shuffle_idx_first, shuffle_idx_last)) diff --git a/llm/ernie-3.5-se/finetune_generation.py b/llm/ernie-3.5-se/finetune_generation.py new file mode 100644 index 0000000000000000000000000000000000000000..0b23b9e24c953848ef714aa655b4d889ca29e367 --- /dev/null +++ b/llm/ernie-3.5-se/finetune_generation.py @@ -0,0 +1,266 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import sys +from dataclasses import dataclass, field +from functools import partial + +import paddle +from data import convert_example, custom_instruction_convert_example, reader +from modeling import Ernie35ForCausalLM +from tokenizer import Ernie35Tokenizer +from utils import Ernie35Trainer, compute_metrics, compute_metrics_not_do_generation + +from paddlenlp.data import DataCollatorForSeq2Seq +from paddlenlp.datasets import load_dataset +from paddlenlp.peft import LoRAConfig, LoRAModel +from paddlenlp.trainer import ( + PdArgumentParser, + TrainingArguments, + get_last_checkpoint, + set_seed, +) +from paddlenlp.utils.log import logger + + +@dataclass +class DataArgument: + + data_name: str = field(default=None, metadata={"help": "The name of data."}) + task_name: str = field(default=None, metadata={"help": "The name of task."}) + dataset_path: str = field(default=None, metadata={"help": "The file name of train dataset."}) + src_length: int = field(default=512, metadata={"help": "The max length of source text."}) + tgt_length: int = field(default=256, metadata={"help": "The max length of target text."}) + + +@dataclass +class ModelArgument: + model_name_or_path: str = field( + default="baidu/ernie-3.5-se-3b", + metadata={"help": "Build-in pretrained model name or the path to local model."}, + ) + label_smoothing: float = field(default=0.1, metadata={"help": "The label smoothing parameter."}) + lr_decay_ratio: float = field(default=0.1, metadata={"help": "The ratio for learning rate decrease"}) + use_flash_attention: bool = field(default=False, metadata={"help": "Whether to use flash attention"}) + eval_with_do_generation: bool = field( + default=False, metadata={"help": "Evaluate with generation, instead for calc loss."} + ) + benchmark: bool = field( + default=False, + metadata={"help": "Whether or not run benchmark."}, + ) + profiler_options: str = field( + default=None, + metadata={"help": "profiler_options."}, + ) + # lora + lora: bool = field(default=False, metadata={"help": "Whether to use LoRA technique"}) + lora_path: str = field(default=None, metadata={"help": "Initialize lora state dict."}) + lora_rank: int = field(default=4, metadata={"help": "Lora attention dimension"}) + merge_weights: bool = field( + default=False, metadata={"help": "Merge weights of the original model and the Lora model"} + ) + + +def main(): + parser = PdArgumentParser((ModelArgument, DataArgument, TrainingArguments)) + if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): + model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) + else: + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + data_args.always_pad_to_max_length = False + + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + training_args.benchmark = model_args.benchmark + training_args.tgt_length = data_args.tgt_length + + training_args.profiler_options = model_args.profiler_options + setattr(training_args, "label_smoothing", model_args.label_smoothing) + setattr(training_args, "lr_decay_ratio", model_args.lr_decay_ratio) + + paddle.set_device(training_args.device) + + set_seed(args=training_args) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + # Set the dtype for loading model + dtype = "float32" + if training_args.fp16_opt_level == "O2": + if training_args.fp16: + dtype = "float16" + if training_args.bf16: + dtype = "bfloat16" + + model_class = Ernie35ForCausalLM + + # Load the pretrained language model. + model = model_class.from_pretrained( + model_args.model_name_or_path, + tensor_parallel_output=False, + tensor_parallel_degree=training_args.tensor_parallel_degree, + tensor_parallel_rank=training_args.tensor_parallel_rank, + use_flash_attention=model_args.use_flash_attention, + dtype=dtype, # todo enable set dtype to avoid additional mem usage + use_progressive_seq_len=False, + parallel_attn_hatf=True, + ) + + if model_args.lora: + if model_args.lora_path is None: + # Not yet support RowParallelLinear + target_modules = [ + ".*q_proj.*", + ".*v_proj.*", + ".*k_proj.*", + ".*gate_proj.*", + ".*up_proj.*", + ".*o_proj.*", + ".*down_proj.*", + ".*qkv_gate_up_proj.*", + ] + + lora_config = LoRAConfig( + target_modules=target_modules, + r=model_args.lora_rank, + lora_alpha=2 * model_args.lora_rank, + merge_weights=model_args.merge_weights, + tensor_parallel_degree=training_args.tensor_parallel_degree, + dtype=dtype, + ) + model = LoRAModel(model, lora_config) + else: + model = LoRAModel.from_pretrained(model=model, lora_path=model_args.lora_path) + + model.mark_only_lora_as_trainable() + model.print_trainable_parameters() + + tokenizer = Ernie35Tokenizer.from_pretrained( + model_args.model_name_or_path, + padding_side="left", # Allow batch inference + ) + tokenizer.pad_token = tokenizer.unk_token + + # Load the dataset. + if training_args.benchmark: + train_ds = load_dataset(reader, data_path="./data/train.txt", lazy=False) + training_args.do_eval = False + data_args.always_pad_to_max_length = True + trans_func = partial( + custom_instruction_convert_example, tokenizer=tokenizer, data_args=data_args, benchmark=True + ) + elif training_args.do_train or training_args.do_eval: + if data_args.data_name is not None: + if data_args.task_name is not None: + train_ds, dev_ds = load_dataset(data_args.data_name, data_args.task_name, splits=["train", "dev"]) + else: + raise ValueError("`data_args.task_name` and `data_args.data_name` should be specified together") + elif data_args.task_name is not None: + if data_args.task_name == "squad": + train_ds, dev_ds = load_dataset("squad", splits=["train_v1", "dev_v1"]) + else: + train_ds, dev_ds = load_dataset(data_args.task_name, splits=["train", "dev"]) + elif data_args.dataset_path is not None: + train_ds = load_dataset(reader, data_path=os.path.join(data_args.dataset_path, "train.json"), lazy=False) + dev_ds = load_dataset(reader, data_path=os.path.join(data_args.dataset_path, "dev.json"), lazy=False) + else: + raise ValueError("Please set the correct data arguments(data_name, task_name, dataset_pat)") + trans_func = partial(convert_example, tokenizer=tokenizer, data_args=data_args) + + if training_args.do_train: + train_ds = train_ds.map(partial(trans_func)) + if training_args.do_eval: + # pipeline_parallel eval is the same as training. + dev_ds = dev_ds.map(partial(trans_func, is_test=model_args.eval_with_do_generation)) + + model_max_length = data_args.src_length + data_args.tgt_length if not training_args.benchmark else 512 + collate_fn = DataCollatorForSeq2Seq( + return_tensors="pd", + tokenizer=tokenizer, + max_length=model_max_length if data_args.always_pad_to_max_length else -1, + padding="max_length" if data_args.always_pad_to_max_length else True, + return_attention_mask=True, + ) + + def compute_metrics_trainer(eval_preds, tokenizer): + all_preds = [] + all_labels = [] + preds = eval_preds.predictions + preds = [x[x != -100] for x in preds] + all_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=False)) + labels = [x[x != -100] for x in eval_preds.label_ids] + all_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True, clean_up_tokenization_spaces=False)) + + all_preds = [pred.strip() for pred in all_preds] + all_labels = [label.strip() for label in all_labels] + all_preds = [pred.strip("question:") for pred in all_preds] + all_labels = [label.strip("question:") for label in all_labels] + + eval_result = compute_metrics(all_preds, all_labels) + return eval_result + + compute_metrics_func = partial( + compute_metrics_trainer, + tokenizer=tokenizer, + ) + + trainer = Ernie35Trainer( + model=model, + args=training_args, + train_dataset=train_ds if training_args.do_train else None, + eval_dataset=dev_ds if training_args.do_eval else None, + tokenizer=tokenizer, + compute_metrics=compute_metrics_func + if model_args.eval_with_do_generation + else compute_metrics_not_do_generation, + do_generation=model_args.eval_with_do_generation, + data_collator=collate_fn, + ) + + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=last_checkpoint) + trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1) + trainer.log_metrics("train", train_result.metrics) + trainer.save_metrics("train", train_result.metrics) + trainer.save_state() + + if training_args.do_eval: + eval_result = trainer.evaluate() + trainer.log_metrics("test", eval_result) + + +if __name__ == "__main__": + main() diff --git a/llm/ernie-3.5-se/modeling.py b/llm/ernie-3.5-se/modeling.py new file mode 100644 index 0000000000000000000000000000000000000000..570433b994c920f877fe1a230ee7c1eb74ba1de6 --- /dev/null +++ b/llm/ernie-3.5-se/modeling.py @@ -0,0 +1,1415 @@ +# !/usr/bin/env python3 + +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Ernie35 model""" +import contextlib +import math +import os +from functools import partial +from typing import Optional, Tuple + +import numpy as np +import paddle +import paddle.nn.functional as F +from configuration import Ernie35Config +from paddle import nn +from paddle.distributed import fleet +from paddle.distributed.fleet.layers.mpu.mp_layers import ( + ColumnParallelLinear, + RowParallelLinear, + VocabParallelEmbedding, +) +from paddle.distributed.fleet.layers.mpu.random import get_rng_state_tracker +from paddle.distributed.fleet.utils import recompute +from paddle.incubate.nn.layer.fused_dropout_add import FusedDropoutAdd + +from paddlenlp.transformers.model_outputs import ( + BaseModelOutputWithPastAndCrossAttentions, + CausalLMOutputWithCrossAttentions, +) +from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model +from paddlenlp.utils.log import logger + +try: + from paddle.nn.functional.flash_attention import flash_attention + + logger.warning("Use flash attention in scaled-dot-product. Attention mask is deprecated") +except: + flash_attention = None + +try: + import fused_ln as fused +except ImportError: + logger.warning("fused-ln not found, run `python setup.py install` to build fused ln") + fused = None + + +ERNIE_PRETRAINED_MODEL_ARCHIVE_LIST = [] + +__all__ = [ + "Ernie35Model", + "Ernie35PretrainedModel", + "Ernie35ForCausalLM", +] + + +def get_triangle_upper_mask(x, mask=None): + if mask is not None: + return mask + # [bsz, n_head, q_len, kv_seq_len] + shape = x.shape + # [bsz, 1, q_len, kv_seq_len] + shape[1] = 1 + mask = paddle.full(shape, -np.inf, dtype=x.dtype) + mask.stop_gradient = True + mask = paddle.triu(mask, diagonal=1) + mask.stop_gradient = True + return mask + + +def parallel_matmul( + x, + y, + bias=None, + transpose_y=False, + tensor_parallel_degree=1, + tensor_parallel_output=True, + fuse_linear=False, +): + + if tensor_parallel_degree > 1 and y.is_distributed: + pg = fleet.get_hybrid_communicate_group().get_model_parallel_group() + input_parallel = paddle.distributed.collective._c_identity(x, group=pg) + if transpose_y: + logits = paddle.matmul(input_parallel, y, transpose_y=True) + if bias is not None: + logits += bias + else: + if not fuse_linear: + logits = F.linear(input_parallel, y, bias) + else: + logits = paddle.incubate.nn.functional.fused_linear(input_parallel, y, bias) # hack for 逐位对齐 + + if tensor_parallel_output: + return logits + + return paddle.distributed.collective._c_concat(logits, group=pg) + + else: + logits = paddle.matmul(x, y, transpose_y=transpose_y) + if bias is not None: + logits += bias + return logits + + +def finfo(dtype: paddle.dtype = None): + if dtype is None: + dtype = paddle.get_default_dtype() + + if dtype == paddle.bfloat16: + # Numpy do not support `np.finfo(np.uint16)`, so try to construct a finfo object to fetch min value + class BFloatFInfo: + min = -3.3895313892515355e38 + + return BFloatFInfo + if dtype == paddle.float32: + return np.finfo(np.float32) + if dtype == paddle.float16: + return np.finfo(np.float16) + if dtype == paddle.float64: + return np.finfo(np.float64) + + +def masked_fill(x, mask, value): + y = paddle.full(x.shape, value, x.dtype) + return paddle.where(mask, y, x) + + +def scaled_dot_product_attention( + query_states, key_states, value_states, attention_mask, output_attentions, config, is_causal=True +): + + bsz, q_len, num_heads, _ = paddle.shape(query_states) + head_dim = config.hidden_size // config.num_attention_heads + _, kv_seq_len, _, _ = value_states.shape + + if config.use_flash_attention and flash_attention is not None: + # Flash Attention now ignore attention mask + # Current Flash Attention doesn't support attn maskt + # Paddle Flash Attention input [ bz, seqlen, nhead, head_dim] + # Torch Flash Attention input [ bz, nhead, seqlen, head_dim] + # without past keys + attn_output, attn_weights = flash_attention( + query_states, + key_states, + value_states, + causal=is_causal and query_states.shape[1] != 1, + return_softmax=output_attentions, + ) + + attn_output = attn_output.reshape([bsz, q_len, head_dim * num_heads]) + return attn_output, attn_weights + else: + + query_states = paddle.transpose(query_states, [0, 2, 1, 3]) / math.sqrt(head_dim) + # merge with the next tranpose + key_states = paddle.transpose(key_states, [0, 2, 1, 3]) + value_states = paddle.transpose(value_states, [0, 2, 1, 3]) + + attn_weights = paddle.matmul(query_states, key_states.transpose([0, 1, 3, 2])) + + if attn_weights.shape != [bsz, num_heads, q_len, kv_seq_len]: + raise ValueError( + f"Attention weights should be of shape {(bsz, num_heads, q_len, kv_seq_len)}, but is" + f" {attn_weights.shape}" + ) + + if attention_mask is None: + attention_mask = get_triangle_upper_mask(attn_weights) + + attention_mask = attention_mask.reshape([bsz, 1, q_len, kv_seq_len]) + if attention_mask.shape != [bsz, 1, q_len, kv_seq_len]: + raise ValueError( + f"Attention mask should be of shape {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" + ) + + attn_weights = attention_mask + attn_weights + attn_weights = paddle.maximum( + attn_weights, paddle.to_tensor(float(finfo(query_states.dtype).min), dtype=query_states.dtype) + ) + + with paddle.amp.auto_cast(False): + attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype) + + attn_output = paddle.matmul(attn_weights, value_states) + attn_output = attn_output.transpose([0, 2, 1, 3]) + attn_output = attn_output.reshape([bsz, q_len, head_dim * num_heads]) + if output_attentions: + return attn_output, attn_weights + return attn_output, None + + +def _make_causal_mask(input_ids_shape, past_key_values_length, dtype): + """ + Make causal mask used for self-attention. + """ + batch_size, target_length = input_ids_shape + + mask = paddle.full((target_length, target_length), float(finfo(dtype).min)) + + mask_cond = paddle.arange(mask.shape[-1]) + mask = masked_fill(mask, mask_cond < (mask_cond + 1).reshape([mask.shape[-1], 1]), 0) + + if past_key_values_length > 0: + mask = paddle.concat([paddle.zeros([target_length, past_key_values_length]), mask], axis=-1) + + return mask[None, None, :, :].expand([batch_size, 1, target_length, target_length + past_key_values_length]) + + +def _expand_mask(mask, dtype, tgt_length): + """ + Expands attention_mask from `[batch_size, src_length]` to `[batch_size, 1, tgt_length, src_length]`. + """ + if mask.ndim == 4: + expanded_mask = mask + elif mask.ndim == 3: + expanded_mask = mask[:, None, :, :] + else: + batch_size, src_length = mask.shape[0], mask.shape[-1] + tgt_length = tgt_length if tgt_length is not None else src_length + + expanded_mask = mask[:, None, None, :].expand([batch_size, 1, tgt_length, src_length]) + + inverted_mask = 1.0 - expanded_mask + return masked_fill(inverted_mask, inverted_mask.cast("bool"), float(finfo(dtype).min)) + + +class LayerNorm(nn.LayerNorm): + def __init__(self, config): + super().__init__(config.hidden_size, epsilon=config.layer_norm_eps) + + +class FusedLayerNorm(nn.Layer): + def __init__(self, config): + super().__init__() + self.config = config + self.hidden_size = config.hidden_size + self.weight = paddle.create_parameter( + shape=[self.hidden_size], + dtype=paddle.get_default_dtype(), + default_initializer=nn.initializer.Constant(1.0), + ) + self.bias = paddle.create_parameter( + shape=[self.hidden_size], + dtype=paddle.get_default_dtype(), + is_bias=True, + default_initializer=nn.initializer.Constant(0.0), + ) + self.variance_epsilon = config.layer_norm_eps + + def forward(self, hidden_states): + return fused.fused_ln(hidden_states, self.weight, self.bias, self.variance_epsilon)[0] + + +class RotaryEmbedding(nn.Layer): + def __init__(self, config, dim, max_position_embeddings=4096, base=10000): + super().__init__() + # dtype = paddle.get_default_dtype() + self.config = config + self.base = base + self.max_position_embeddings = max_position_embeddings + inv_freq = 1.0 / (base ** (paddle.cast(paddle.arange(0, dim, 2), dtype="float32") / dim)) + + # higher acc using float32 + t = paddle.arange(max_position_embeddings, dtype="float32") + freqs = paddle.einsum("i,j->ij", t, inv_freq.cast("float32")) + # Different from paper, but it uses a different permutation in order to obtain the same calculation + emb = paddle.concat([freqs, freqs], axis=-1) + + # [bs, seqlen, nhead, head_dim] + self.cos_cached = emb.cos() # [None, :, None, :] # .astype(dtype) + self.sin_cached = emb.sin() # [None, :, None, :] # .astype(dtype) + + def forward(self, x, seq_len=None, position_ids=None): + if position_ids is not None: + return self.cos_cached, self.sin_cached + start = 0 + if self.config.enable_random_position_ids: + if self.training: + np_rng = np.random.RandomState( + int(os.getenv("TRAINER_GLOBAL_STEP", "0")) + (paddle.distributed.get_rank() * 100000) + ) + pos_ids = np.array(list(np.sort(np_rng.permutation(self.max_position_embeddings)[:seq_len]))).astype( + "int64" + ) + pos_ids = paddle.to_tensor(pos_ids) + else: + if seq_len <= 4096: + times = 8 + else: + times = self.max_position_embeddings // seq_len + pos_ids = [times - 1] + pos_ids += [times] * (seq_len - 1) + pos_ids = paddle.cumsum(paddle.to_tensor(pos_ids)) + + return ( + self.cos_cached[pos_ids], + self.sin_cached[pos_ids], + ) + + return ( + self.cos_cached[start : start + seq_len, :], + self.sin_cached[start : start + seq_len, :], + ) + + @classmethod + def rotate_half(cls, x): + """Rotates half the hidden dims of the input.""" + + x1 = x[..., : x.shape[-1] // 2] + x2 = x[..., x.shape[-1] // 2 :] + return paddle.concat([-x2, x1], axis=-1) + + @classmethod + def apply_rotary_pos_emb(cls, q, k, cos, sin, offset: int = 0, position_ids=None): + if position_ids is not None: + assert offset == 0, offset + cos = F.embedding(position_ids, cos) + sin = F.embedding(position_ids, sin) + else: + cos = cos.unsqueeze(0) + sin = sin.unsqueeze(0) + cos = cos[:, offset : q.shape[1] + offset, None, :] + sin = sin[:, offset : q.shape[1] + offset, None, :] + + # q_embed = (q * cos) + (rotate_half(q) * sin) + # k_embed = (k * cos) + (rotate_half(k) * sin) + + cos = paddle.cast(cos, q.dtype) + sin = paddle.cast(sin, q.dtype) + q_embed = paddle.add(paddle.multiply(q, cos), paddle.multiply(cls.rotate_half(q), sin)) + k_embed = paddle.add(paddle.multiply(k, cos), paddle.multiply(cls.rotate_half(k), sin)) + return q_embed, k_embed + + +class Ernie35MLP(nn.Layer): + def __init__(self, config): + super().__init__() + self.hidden_size = config.hidden_size + self.intermediate_size = config.intermediate_size + + if config.tensor_parallel_degree > 1: + self.gate_proj = ColumnParallelLinear( + self.hidden_size, + self.intermediate_size, + gather_output=False, + has_bias=config.use_bias, + fuse_matmul_bias=config.fuse_linear, + ) + self.up_proj = ColumnParallelLinear( + self.hidden_size, + self.intermediate_size, + gather_output=False, + has_bias=config.use_bias, + fuse_matmul_bias=config.fuse_linear, + ) + else: + self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias_attr=config.use_bias) + self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias_attr=config.use_bias) + + if config.tensor_parallel_degree > 1: + self.down_proj = RowParallelLinear( + self.intermediate_size, + self.hidden_size, + input_is_parallel=True, + has_bias=config.use_bias, + fuse_matmul_bias=config.fuse_linear, + ) + else: + self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias_attr=config.use_bias) + + def forward(self, x): + x = F.silu(self.gate_proj(x)) * self.up_proj(x) + return self.down_proj(x) + + +def rope_attn( + mix_layer, + query_states, + key_states, + value_states, + attention_mask, + position_ids, + output_attentions=False, + past_key_value=None, + use_cache=False, + rotary_emb=None, + config=None, +): + + if mix_layer is not None: + query_states, key_states, value_states = paddle.split(mix_layer, 3, axis=-1) + + kv_seq_len = key_states.shape[-3] + offset = 0 + if past_key_value is not None: + offset = past_key_value[0].shape[-3] + kv_seq_len += offset + + cos, sin = rotary_emb(value_states, seq_len=kv_seq_len, position_ids=position_ids) + + query_states, key_states = rotary_emb.apply_rotary_pos_emb( + query_states, + key_states, + cos, + sin, + position_ids=position_ids, + offset=offset if position_ids is None else 0, + ) + + if past_key_value is not None: + # reuse k, v, self_attention + key_states = paddle.concat([past_key_value[0], key_states], axis=1) + value_states = paddle.concat([past_key_value[1], value_states], axis=1) + + past_key_value = (key_states, value_states) if use_cache else None + attn_output, attn_weights = scaled_dot_product_attention( + query_states, + key_states, + value_states, + attention_mask, + output_attentions, + config=config, + ) + return attn_output, attn_weights, past_key_value + + +class Ernie35Attention(nn.Layer): + def __init__(self, config): + super().__init__() + self.hidden_size = config.hidden_size + self.num_heads = config.num_attention_heads + self.head_dim = self.hidden_size // self.num_heads + self.use_recompute_attn = config.use_recompute_attn + + if config.tensor_parallel_degree > 1: + assert ( + self.num_heads % config.tensor_parallel_degree == 0 + ), "num_heads: {self.num_heads}, tensor_parallel_degree: {config.tensor_parallel_degree}" + self.num_heads = self.num_heads // config.tensor_parallel_degree + + if config.tensor_parallel_degree > 1: + self.q_proj = ColumnParallelLinear( + self.hidden_size, + self.hidden_size, + has_bias=config.use_bias, + gather_output=False, + fuse_matmul_bias=config.fuse_linear, + ) + self.k_proj = ColumnParallelLinear( + self.hidden_size, + self.hidden_size, + has_bias=config.use_bias, + gather_output=False, + fuse_matmul_bias=config.fuse_linear, + ) + self.v_proj = ColumnParallelLinear( + self.hidden_size, + self.hidden_size, + has_bias=config.use_bias, + gather_output=False, + fuse_matmul_bias=config.fuse_linear, + ) + else: + self.q_proj = nn.Linear( + self.hidden_size, + self.hidden_size, + bias_attr=config.use_bias, + ) + self.k_proj = nn.Linear( + self.hidden_size, + self.hidden_size, + bias_attr=config.use_bias, + ) + self.v_proj = nn.Linear( + self.hidden_size, + self.hidden_size, + bias_attr=config.use_bias, + ) + + if config.tensor_parallel_degree > 1: + self.o_proj = RowParallelLinear( + self.hidden_size, + self.hidden_size, + has_bias=config.use_bias, + input_is_parallel=True, + fuse_matmul_bias=config.fuse_linear, + ) + else: + self.o_proj = nn.Linear( + self.hidden_size, + self.hidden_size, + bias_attr=config.use_bias, + ) + + self.rotary_emb = RotaryEmbedding( + config, + self.head_dim, + max_position_embeddings=config.max_position_embeddings, + ) + self._cast_to_low_precison = False + + self.config = config + + def forward( + self, + hidden_states, + past_key_value: Optional[Tuple[paddle.Tensor]] = None, + attention_mask: Optional[paddle.Tensor] = None, + position_ids: Optional[Tuple[paddle.Tensor]] = None, + output_attentions: bool = False, + use_cache: bool = False, + ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]: + """Input shape: Batch x Time x Channel""" + bsz, q_len, _ = hidden_states.shape + query_states = key_states = value_states = mix_layer = None + # if self.fuse_attn: + # mix_layer = self.qkv_proj(hidden_states).reshape([bsz, q_len, self.num_heads, 3 * self.head_dim]) + # else: + query_states = self.q_proj(hidden_states).reshape(shape=[bsz, q_len, self.num_heads, self.head_dim]) + key_states = self.k_proj(hidden_states).reshape(shape=[bsz, q_len, self.num_heads, self.head_dim]) + value_states = self.v_proj(hidden_states).reshape(shape=[bsz, q_len, self.num_heads, self.head_dim]) + + _rope_attn = partial(rope_attn, rotary_emb=self.rotary_emb, config=self.config) + if self.use_recompute_attn: + assert past_key_value is None, "do not use kv cache in recompute" + assert not use_cache + attn_output, attn_weights, past_key_value = recompute( + _rope_attn, + mix_layer, + query_states, + key_states, + value_states, + attention_mask, + position_ids, + output_attentions, + use_reentrant=self.config.recompute_use_reentrant, + ) + else: + attn_output, attn_weights, past_key_value = _rope_attn( + mix_layer, + query_states, + key_states, + value_states, + attention_mask, + position_ids, + output_attentions, + past_key_value, + use_cache=use_cache, + ) + attn_output = self.o_proj(attn_output) + + if not output_attentions: + attn_weights = None + + return attn_output, attn_weights, past_key_value + + +class Ernie35MLPAttention(nn.Layer): + def __init__(self, config): + super().__init__() + self.hidden_size = config.hidden_size + self.num_heads = config.num_attention_heads + self.head_dim = self.hidden_size // self.num_heads + self.intermediate_size = config.intermediate_size + + self.use_recompute_attn = config.use_recompute_attn + + if config.tensor_parallel_degree > 1: + assert ( + self.num_heads % config.tensor_parallel_degree == 0 + ), "num_heads: {self.num_heads}, tensor_parallel_degree: {config.tensor_parallel_degree}" + self.num_heads = self.num_heads // config.tensor_parallel_degree + self.intermediate_size_this_rank = self.intermediate_size // config.tensor_parallel_degree + else: + self.intermediate_size_this_rank = self.intermediate_size + + if config.tensor_parallel_degree > 1: + self.qkv_gate_up_proj = ColumnParallelLinear( + self.hidden_size, + self.hidden_size * 3 + self.intermediate_size * 2, + has_bias=config.use_bias, + gather_output=False, + fuse_matmul_bias=config.fuse_linear, + ) + self.o_proj = RowParallelLinear( + self.hidden_size + self.intermediate_size, + self.hidden_size, + has_bias=config.use_bias, + input_is_parallel=True, + fuse_matmul_bias=config.fuse_linear, + ) + else: + self.qkv_gate_up_proj = nn.Linear( + self.hidden_size, + self.hidden_size * 3 + self.intermediate_size * 2, + bias_attr=config.use_bias, + ) + self.o_proj = nn.Linear( + self.hidden_size + self.intermediate_size, + self.hidden_size, + bias_attr=config.use_bias, + ) + + self.rotary_emb = RotaryEmbedding( + config, + self.head_dim, + max_position_embeddings=config.max_position_embeddings, + ) + self._cast_to_low_precison = False + + self.config = config + + def forward( + self, + hidden_states, + past_key_value: Optional[Tuple[paddle.Tensor]] = None, + attention_mask: Optional[paddle.Tensor] = None, + position_ids: Optional[Tuple[paddle.Tensor]] = None, + output_attentions: bool = False, + use_cache: bool = False, + ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]: + bsz, q_len, _ = hidden_states.shape + query_states = key_states = value_states = mix_layer = None + + mix_layer = self.qkv_gate_up_proj(hidden_states) + mix_layer, up_states, gate_states = mix_layer.split( + [self.head_dim * self.num_heads * 3, self.intermediate_size_this_rank, self.intermediate_size_this_rank], + axis=-1, + ) + mix_layer = mix_layer.reshape([bsz, q_len, self.num_heads, 3 * self.head_dim]) + _rope_attn = partial(rope_attn, rotary_emb=self.rotary_emb, config=self.config) + + if self.use_recompute_attn: + assert past_key_value is None, "do not use kv cache in recompute" + attn_output, attn_weights, past_key_value = recompute( + _rope_attn, + mix_layer, + query_states, + key_states, + value_states, + attention_mask, + position_ids, + output_attentions, + use_reentrant=False, + ) + else: + attn_output, attn_weights, past_key_value = _rope_attn( + mix_layer, + query_states, + key_states, + value_states, + attention_mask, + position_ids, + output_attentions, + past_key_value, + use_cache=use_cache, + ) + + ffn_output = F.silu(up_states) * gate_states + + output_states = paddle.concat([ffn_output, attn_output], axis=-1) + output_states = self.o_proj(output_states) + + if not output_attentions: + attn_weights = None + + return output_states, attn_weights, past_key_value + + +class Ernie35DecoderLayer(nn.Layer): + def __init__(self, config, has_ffn=True, has_mha=True, parallel_attn_ffn=False): + super().__init__() + self.hidden_size = config.hidden_size + self.attn_ffn_no_mha = not parallel_attn_ffn and not has_mha + + Norm = LayerNorm + if config.fuse_ln: + Norm = FusedLayerNorm + + self.self_attn_mlp = self.self_attn = self.mlp = None + if parallel_attn_ffn: + logger.info("using parallel-attn") + self.self_attn_mlp = Ernie35MLPAttention(config) + self.input_layernorm = Norm(config) + self.residual_add1 = FusedDropoutAdd(0.0, mode="upscale_in_train") + else: + logger.info("using normal-attn") + if has_mha: + self.self_attn = Ernie35Attention(config) + self.residual_add1 = FusedDropoutAdd(0.0, mode="upscale_in_train") + self.input_layernorm = Norm(config) + + if has_ffn: + self.mlp = Ernie35MLP(config) + self.post_attention_layernorm = Norm(config) + self.residual_add2 = FusedDropoutAdd(0.0, mode="upscale_in_train") + + def forward( + self, + hidden_states: paddle.Tensor, + attention_mask: Optional[paddle.Tensor] = None, + position_ids: Optional[paddle.Tensor] = None, + output_attentions: Optional[bool] = False, + past_key_value: Optional[Tuple[paddle.Tensor]] = None, + use_cache: Optional[bool] = False, + ) -> Tuple[paddle.Tensor, Optional[Tuple[paddle.Tensor, paddle.Tensor]]]: + """ + Args: + hidden_states (`paddle.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)` + attention_mask (`paddle.Tensor`, *optional*): attention mask of size + `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more detail. + use_cache (`bool`, *optional*): + If set to `True`, `cache` key value states are returned and can be used to speed up decoding + (see `cache`). + cache (`Tuple(paddle.Tensor)`, *optional*): cached past key and value projection states + """ + if self.self_attn_mlp is not None: + + residual = hidden_states + hidden_states = self.input_layernorm(hidden_states) + + # Self Attention + hidden_states, self_attn_weights, present_key_value = self.self_attn_mlp( + hidden_states=hidden_states, + past_key_value=past_key_value, + attention_mask=attention_mask, + position_ids=position_ids, + output_attentions=output_attentions, + use_cache=use_cache, + ) + hidden_states = self.residual_add1(hidden_states, residual) + else: + if self.self_attn is not None: + residual = hidden_states + hidden_states = self.input_layernorm(hidden_states) + + # Self Attention + hidden_states, self_attn_weights, present_key_value = self.self_attn( + hidden_states=hidden_states, + past_key_value=past_key_value, + attention_mask=attention_mask, + position_ids=position_ids, + output_attentions=output_attentions, + use_cache=use_cache, + ) + hidden_states = self.residual_add1(hidden_states, residual) + + if self.mlp is not None: + residual = hidden_states + hidden_states = self.post_attention_layernorm(hidden_states) + hidden_states = self.mlp(hidden_states) + hidden_states = self.residual_add2(hidden_states, residual) + + if self.attn_ffn_no_mha: + present_key_value = None + + outputs = (hidden_states,) + if output_attentions: + outputs += (self_attn_weights,) + + if use_cache: + outputs += (present_key_value,) + + # remove empty tuple for pipeline parallel + if type(outputs) is tuple and len(outputs) == 1: + outputs = outputs[0] + + return outputs + + +class Ernie35PretrainedModel(PretrainedModel): + config_class = Ernie35Config + base_model_prefix = "ernie" + + @classmethod + def _get_tensor_parallel_mappings(cls, config, is_split=True): + + from conversion_utils import ( + o_proj_merge_fn, + o_proj_split_fn, + qkv_gate_up_proj_merge_fn, + qkv_gate_up_proj_split_fn, + ) + + from paddlenlp.transformers.conversion_utils import split_or_merge_func + + fn = split_or_merge_func( + is_split=is_split, + tensor_parallel_degree=config.tensor_parallel_degree, + tensor_parallel_rank=config.tensor_parallel_rank, + num_attention_heads=config.num_attention_heads, + ) + qkv_gate_up_proj_fn = qkv_gate_up_proj_split_fn if is_split else qkv_gate_up_proj_merge_fn + fuse_qkvgu_fn = qkv_gate_up_proj_fn( # is_column: True + tensor_parallel_degree=config.tensor_parallel_degree, + tensor_parallel_rank=config.tensor_parallel_rank, + hidden_size=config.hidden_size, + intermediate_size=config.intermediate_size, + num_heads=config.num_attention_heads, + ) + o_proj_fn = o_proj_split_fn if is_split else o_proj_merge_fn + fuse_o_proj_fn = o_proj_fn( # is_column: False + tensor_parallel_degree=config.tensor_parallel_degree, + tensor_parallel_rank=config.tensor_parallel_rank, + hidden_size=config.hidden_size, + intermediate_size=config.intermediate_size, + ) + + def get_tensor_parallel_split_mappings(num_layers, parallel_attn_hatf): + final_actions = {} + base_actions = { + # Column Linear + "layers.0.self_attn_mlp.qkv_gate_up_proj.weight": fuse_qkvgu_fn, + "lm_head.weight": partial(fn, is_column=True), + # Row Linear + "embed_tokens.weight": partial(fn, is_column=False), + "layers.0.self_attn_mlp.o_proj.weight": fuse_o_proj_fn, + } + if config.use_bias: + base_actions.update( + { + # Column Linear + "layers.0.self_attn_mlp.qkv_gate_up_proj.bias": fuse_qkvgu_fn, + "lm_head.bias": partial(fn, is_column=True), + } + ) + + start = 0 if not parallel_attn_hatf else 1 + end = num_layers if not parallel_attn_hatf else num_layers - 1 + for key, action in base_actions.items(): + if "layers.0." in key: + for i in range(start, end): + final_actions[key.replace("layers.0.", f"layers.{i}.")] = action + if "layers.0." not in key: + final_actions[key] = action + + if parallel_attn_hatf: + # Layer 0 + final_actions["layers.0.self_attn.q_proj.weight"] = partial(fn, is_column=True) + final_actions["layers.0.self_attn.k_proj.weight"] = partial(fn, is_column=True) + final_actions["layers.0.self_attn.v_proj.weight"] = partial(fn, is_column=True) + final_actions["layers.0.self_attn.o_proj.weight"] = partial(fn, is_column=False) + if config.use_bias: + final_actions["layers.0.self_attn.q_proj.bias"] = partial(fn, is_column=True) + final_actions["layers.0.self_attn.k_proj.bias"] = partial(fn, is_column=True) + final_actions["layers.0.self_attn.v_proj.bias"] = partial(fn, is_column=True) + # Layer num_layers - 1 + final_actions[f"layers.{num_layers - 1}.mlp.gate_proj.weight"] = partial(fn, is_column=True) + final_actions[f"layers.{num_layers - 1}.mlp.up_proj.weight"] = partial(fn, is_column=True) + final_actions[f"layers.{num_layers - 1}.mlp.down_proj.weight"] = partial(fn, is_column=False) + if config.use_bias: + final_actions[f"layers.{num_layers - 1}.mlp.gate_proj.bias"] = partial(fn, is_column=True) + final_actions[f"layers.{num_layers - 1}.mlp.up_proj.bias"] = partial(fn, is_column=True) + + return final_actions + + mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers, config.parallel_attn_hatf) + + return mappings + + def _init_weights(self, layer): + """Initialization hook""" + if self.config.tensor_parallel_degree > 1: + rng_tracker = get_rng_state_tracker().rng_state + else: + rng_tracker = contextlib.nullcontext + + if isinstance( + layer, + ( + ColumnParallelLinear, + RowParallelLinear, + VocabParallelEmbedding, + Ernie35LMHead, + nn.Embedding, + nn.Linear, + ), + ): + + with rng_tracker(): + dtype = paddle.get_default_dtype() + # layer.weight.set_value( + # paddle.randn(layer.weight.shape, dtype=dtype).scale(self.config.initializer_range) + # ) + tmp = paddle.randn(layer.weight.shape, dtype=dtype).scale(self.config.initializer_range) + src_tensor = tmp.value().get_tensor() + layer.weight.value().get_tensor()._share_data_with(src_tensor) + # layer.weight.copy_(tmp, True) + + logger.info( + f'dist-init-fc: shape={layer.weight.shape}, range={self.config.initializer_range},type={type(layer)},norm={layer.weight.astype("float32").norm().item()}' + ) + + elif isinstance(layer, RotaryEmbedding): + head_dim = self.config.hidden_size // self.config.num_attention_heads + inv_freq = 1.0 / (layer.base ** (np.arange(0, head_dim, 2).astype("float32") / head_dim)) + # higher acc using float32 + t = np.arange(layer.max_position_embeddings, dtype="float32") + freqs = np.einsum("i,j->ij", t, inv_freq) + emb = np.concatenate([freqs, freqs], axis=-1) + # [bs, seqlen, nhead, head_dim] + cos_cached = np.cos(emb)[:, :] + sin_cached = np.sin(emb)[:, :] + layer.cos_cached.set_value(cos_cached) + layer.sin_cached.set_value(sin_cached) + + +@register_base_model +class Ernie35Model(Ernie35PretrainedModel): + """ + Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Ernie35DecoderLayer`] + Args: + config: Ernie35Config + """ + + def __init__(self, config: Ernie35Config): + super().__init__(config) + self.padding_idx = config.pad_token_id + self.vocab_size = config.vocab_size + self.hidden_size = config.hidden_size + self.config = config + + if config.tensor_parallel_degree > 1: + self.embed_tokens = VocabParallelEmbedding( + self.vocab_size, + self.hidden_size, + ) + else: + self.embed_tokens = nn.Embedding( + self.vocab_size, + self.hidden_size, + ) + + self.layers = nn.LayerList() + for i in range(config.num_hidden_layers): + if not config.parallel_attn_hatf: + _layer = Ernie35DecoderLayer(config, has_ffn=False, has_mha=False, parallel_attn_ffn=True) + else: + # no ffn in fisrt layer + # no mha in last layer + # Head mhA tail Ffn + if i == 0: + _layer = Ernie35DecoderLayer(config, has_ffn=False, has_mha=True, parallel_attn_ffn=False) + elif i == (config.num_hidden_layers - 1): + _layer = Ernie35DecoderLayer(config, has_ffn=True, has_mha=False, parallel_attn_ffn=False) + else: + _layer = Ernie35DecoderLayer(config, has_ffn=False, has_mha=False, parallel_attn_ffn=True) + + self.layers.append(_layer) + logger.info( + f"building layer:{i}, mha={_layer.self_attn is not None},ffn={_layer.mlp is not None},paral-mha-ffn={_layer.self_attn_mlp is not None}" + ) + + Norm = LayerNorm + if config.fuse_ln: + Norm = FusedLayerNorm + self.norm = Norm(config) + + def get_input_embeddings(self): + return self.embed_tokens + + def set_input_embeddings(self, value): + self.embed_tokens = value + + @classmethod + def _prepare_decoder_attention_mask(cls, attention_mask, input_shape, past_key_values_length, dtype): + # create causal mask + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + combined_attention_mask = None + if input_shape[-1] > 1: + combined_attention_mask = _make_causal_mask( + input_shape, past_key_values_length=past_key_values_length, dtype=dtype + ) + + if attention_mask is not None: + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + expanded_attn_mask = _expand_mask(attention_mask, dtype, tgt_length=input_shape[-1]) + combined_attention_mask = ( + expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask + ) + combined_attention_mask = paddle.maximum( + combined_attention_mask.astype(dtype), paddle.to_tensor(float(finfo(dtype).min), dtype=dtype) + ) + return combined_attention_mask + + @paddle.jit.not_to_static + def recompute_training(self, layer_module, hidden_states, attention_mask, position_ids, output_attentions): + def create_custom_forward(module): + def custom_forward(*inputs): + return module(*inputs, output_attentions, None) + + return custom_forward + + hidden_states = recompute( + create_custom_forward(layer_module), + hidden_states, + attention_mask, + position_ids, + use_reentrant=False, + ) + return hidden_states + + def forward( + self, + input_ids=None, + position_ids=None, + attention_mask=None, + inputs_embeds=None, + use_cache=None, + past_key_values=None, + output_attentions=False, + output_hidden_states=None, + return_dict=False, + **kwargs, + ): + + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + use_cache = use_cache if use_cache is not None else self.config.use_cache + + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + # retrieve input_ids and inputs_embeds + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time") + elif input_ids is not None: + batch_size, seq_length = input_ids.shape + elif inputs_embeds is not None: + batch_size, seq_length, _ = inputs_embeds.shape + else: + raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds") + + if past_key_values is None: + past_key_values = tuple([None] * len(self.layers)) + + seq_length_with_past = seq_length + cache_length = 0 + if past_key_values[0] is not None: + cache_length = paddle.shape(past_key_values[0][0])[1] + seq_length_with_past += cache_length + if inputs_embeds is None: + inputs_embeds = self.embed_tokens(input_ids).astype(self.embed_tokens.weight.dtype) + + # embed positions + if attention_mask is None: + attention_mask = paddle.ones((batch_size, seq_length_with_past), dtype=paddle.bool) + + attention_mask = self._prepare_decoder_attention_mask( + attention_mask, (batch_size, seq_length), cache_length, inputs_embeds.dtype + ) + hidden_states = inputs_embeds + + # decoder layers + all_hidden_states = () if output_hidden_states else None + all_self_attns = () if output_attentions else None + next_decoder_cache = () if use_cache else None + + for idx, (decoder_layer) in enumerate(self.layers): + if output_hidden_states: + all_hidden_states += (hidden_states,) + + past_key_value = past_key_values[idx] if past_key_values is not None else None + has_gradient = not hidden_states.stop_gradient + if self.config.use_recompute and has_gradient: + layer_outputs = self.recompute_training( + decoder_layer, + hidden_states, + attention_mask, + position_ids, + output_attentions, + ) + else: + layer_outputs = decoder_layer( + hidden_states, + attention_mask, + position_ids, + output_attentions, + past_key_value, + use_cache, + ) + + if isinstance(layer_outputs, (tuple, list)): + hidden_states = layer_outputs[0] + else: + hidden_states = layer_outputs + + if use_cache: + next_decoder_cache += (layer_outputs[2 if output_attentions else 1],) + + if output_attentions: + all_self_attns += (layer_outputs[1],) + + hidden_states = self.norm(hidden_states) + + # add hidden states from the last decoder layer + if output_hidden_states: + all_hidden_states += (hidden_states,) + + next_cache = next_decoder_cache if use_cache else None + + if not return_dict: + return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None) + return BaseModelOutputWithPastAndCrossAttentions( + last_hidden_state=hidden_states, + past_key_values=next_cache, + hidden_states=all_hidden_states, + attentions=all_self_attns, + cross_attentions=None, + ) + + +class Ernie35PretrainingCriterion(paddle.nn.Layer): + """ + Criterion for Ernie35. + It calculates the final loss. + """ + + def __init__(self, config, return_tuple=True): + super(Ernie35PretrainingCriterion, self).__init__() + self.ignored_index = getattr(config, "ignored_index", -100) + self.config = config + self.return_tuple = return_tuple + self.enable_parallel_cross_entropy = config.tensor_parallel_degree > 1 and config.tensor_parallel_output + + if self.enable_parallel_cross_entropy: + logger.info("using parallel cross entroy, take care") + self.loss_func = fleet.meta_parallel.ParallelCrossEntropy() + else: + self.loss_func = paddle.nn.CrossEntropyLoss( + reduction="none", + ) + + def forward(self, prediction_scores, masked_lm_labels): + if self.enable_parallel_cross_entropy: + assert ( + prediction_scores.shape[-1] != self.config.vocab_size + ), f"enable_parallel_cross_entropy, the vocab_size should be splited: {prediction_scores.shape[-1]}, {self.config.vocab_size}" + + with paddle.amp.auto_cast(False): + masked_lm_loss = self.loss_func(prediction_scores.astype("float32"), masked_lm_labels.unsqueeze(2)) + lossmask = masked_lm_labels != self.ignored_index + if (~lossmask).all(): + loss = paddle.mean(masked_lm_loss) * 0.0 + loss_sum = masked_lm_loss.sum().detach() + else: + lossmask = lossmask.reshape([-1]).cast(paddle.float32) + masked_lm_loss = paddle.sum(masked_lm_loss.cast(paddle.float32).reshape([-1]) * lossmask) + loss = masked_lm_loss / lossmask.sum() + loss_sum = masked_lm_loss.sum().detach() + if not self.return_tuple: + if self.training: + return loss + return loss_sum + return loss, loss_sum + + +class Ernie35LMHead(nn.Layer): + def __init__(self, config): + super(Ernie35LMHead, self).__init__() + self.config = config + if config.tensor_parallel_degree > 1: + vocab_size = config.vocab_size // config.tensor_parallel_degree + else: + vocab_size = config.vocab_size + + self.weight = self.create_parameter( + shape=[vocab_size, config.hidden_size] if config.tie_word_embeddings else [config.hidden_size, vocab_size], + dtype=paddle.get_default_dtype(), + ) + if config.weight_share_add_bias and config.use_bias: + self.bias = self.create_parameter( + shape=[vocab_size], + dtype=paddle.get_default_dtype(), + ) + else: + self.bias = None + + # Must set distributed attr for Tensor Parallel ! + self.weight.is_distributed = True if (vocab_size != config.vocab_size) else False + if config.weight_share_add_bias and config.use_bias: + self.bias.is_distributed = True if (vocab_size != config.vocab_size) else False + + if self.weight.is_distributed: + self.weight.split_axis = 1 + if config.weight_share_add_bias and config.use_bias and self.bias.is_distributed: + self.bias.split_axis = 0 + + def forward(self, hidden_states, tensor_parallel_output=None): + if tensor_parallel_output is None: + tensor_parallel_output = self.config.tensor_parallel_output + logits = parallel_matmul( + hidden_states, + self.weight, + bias=self.bias, + transpose_y=self.config.tie_word_embeddings, + tensor_parallel_degree=self.config.tensor_parallel_degree, + tensor_parallel_output=tensor_parallel_output, + fuse_linear=self.config.fuse_linear, + ) + return logits + + +class Ernie35ForCausalLM(Ernie35PretrainedModel): + _keys_to_ignore_on_load_missing = [r"lm_head.weight"] + + def __init__(self, config): + super().__init__(config) + # initialize-trick for big model, see https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/README.md#std-init + new_initializer_range = math.sqrt(0.3333 / config.hidden_size) + logger.info(f"change initializer-range from {config.initializer_range} to {new_initializer_range}") + config.initializer_range = new_initializer_range + self.config = config + + self.ernie = Ernie35Model(config) + self.lm_head = Ernie35LMHead(config) + self.criterion = Ernie35PretrainingCriterion(config) + + self.tie_weights() # maybe weight share + + # Initialize weights and apply final processing + def _post_init(self, original_init, *args, **kwargs): + super()._post_init(self, original_init, *args, **kwargs) + factor = 1 / math.sqrt(2 * self.config.num_hidden_layers) + logger.info(f"using post init div: factor:{factor}") + with paddle.no_grad(): + for l in self.ernie.layers: + if l.self_attn_mlp is not None: + l.self_attn_mlp.o_proj.weight.scale_(factor) + if l.self_attn is not None: + l.self_attn.o_proj.weight.scale_(factor) + if l.mlp is not None: + l.mlp.down_proj.weight.scale_(factor) + + def get_input_embeddings(self): + return self.ernie.embed_tokens + + def set_input_embeddings(self, value): + self.ernie.embed_tokens = value + + def get_output_embeddings(self): + return self.lm_head + + def set_output_embeddings(self, new_embeddings): + self.lm_head = new_embeddings + + def set_decoder(self, decoder): + self.ernie = decoder + + def get_decoder(self): + return self.ernie + + @staticmethod + def prepare_attention_mask_for_generation(input_ids, pad_token_id, eos_token_id): + is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any( + input_ids == pad_token_id + ).numpy().item() + is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or ( + (eos_token_id is not None) and (pad_token_id != eos_token_id) + ) + if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id: + attention_mask = (input_ids != pad_token_id).astype("int64") + else: + attention_mask = paddle.ones_like(input_ids, dtype="int64") + return attention_mask + + def prepare_inputs_for_generation( + self, input_ids, use_cache=False, past_key_values=None, inputs_embeds=None, **kwargs + ): + if past_key_values: + input_ids = input_ids[:, -1:] + + attention_mask = kwargs.get("attention_mask", None) + position_ids = kwargs.get("position_ids", None) + + # if `inputs_embeds` are passed, we only want to use them in the 1st generation step + if inputs_embeds is not None and past_key_values is None: + model_inputs = {"inputs_embeds": inputs_embeds} + else: + model_inputs = {"input_ids": input_ids} + + model_inputs.update( + { + "past_key_values": past_key_values, + "use_cache": True, # use_cache, + "attention_mask": attention_mask, + "position_ids": position_ids, + "return_dict": True, + } + ) + return model_inputs + + @staticmethod + def update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False): + # update cache + if isinstance(outputs, tuple) and len(outputs) > 1 and not isinstance(outputs[1], paddle.Tensor): + model_kwargs["past_key_values"] = outputs[1] + + if isinstance(outputs, CausalLMOutputWithCrossAttentions) and "past_key_values" in outputs: + model_kwargs["past_key_values"] = outputs.past_key_values + + # update token_type_ids with last value + if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None: + token_type_ids = model_kwargs["token_type_ids"] + model_kwargs["token_type_ids"] = paddle.concat([token_type_ids, token_type_ids[:, -1:]], axis=-1) + + interval = int(os.getenv("pos_decoding_interval", "2")) + # update position_ids + if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None: + position_ids = model_kwargs["position_ids"] + model_kwargs["position_ids"] = paddle.concat([position_ids, position_ids[:, -1:] + interval], axis=-1) + + if not is_encoder_decoder: + # update attention mask + if "attention_mask" in model_kwargs: + attention_mask = model_kwargs["attention_mask"] + model_kwargs["attention_mask"] = paddle.concat( + [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype="int64")], axis=-1 + ) + # update role_ids + if "role_ids" in model_kwargs and model_kwargs["role_ids"] is not None: + role_ids = model_kwargs["role_ids"] + model_kwargs["role_ids"] = paddle.concat([role_ids, role_ids[:, -1:]], axis=-1) + + return model_kwargs + + def forward( + self, + input_ids, + position_ids=None, + attention_mask=None, + inputs_embeds=None, + labels=None, + use_cache=False, + past_key_values=None, + output_attentions=None, + output_hidden_states=None, + return_dict=False, + ignored_index=0, # no use + ): + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + def progressive_seq(x, y): + globel_step = int(os.getenv("TRAINER_GLOBAL_STEP", "0")) + if globel_step < 500: + return x[:, :512], y[:, :512] + if globel_step < 1000: + return x[:, :1024], y[:, :1024] + if globel_step < 1500: + return x[:, :2048], y[:, :2048] + return x, y + + if self.config.use_progressive_seq_len and self.training: + input_ids, labels = progressive_seq(input_ids, labels) + + outputs = self.ernie( + input_ids, + position_ids=position_ids, + attention_mask=attention_mask, + inputs_embeds=inputs_embeds, + use_cache=use_cache, + past_key_values=past_key_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + hidden_states = outputs[0] + # if labels is None,means we need full output, instead of tensor_parallel_output + # tensor_parallel_output is togather with ParallelCrossEntropy + # tensor_parallel_output = ( + # self.config.tensor_parallel_output and labels is not None and self.config.tensor_parallel_degree > 1 + # ) + + logits = self.lm_head( + hidden_states, + ) # tensor_parallel_output=tensor_parallel_output) + + if return_dict: + if labels is not None: + loss, _ = self.criterion(logits, labels) + else: + loss = None + return CausalLMOutputWithCrossAttentions( + loss=loss, + logits=logits, + past_key_values=outputs.past_key_values, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + ) + + assert labels is not None + loss, loss_sum = self.criterion(logits, labels) + return loss, loss_sum diff --git a/llm/ernie-3.5-se/predict_generation.py b/llm/ernie-3.5-se/predict_generation.py new file mode 100644 index 0000000000000000000000000000000000000000..4120b1862a41eb5c4869b973441e7ffd0c23dc44 --- /dev/null +++ b/llm/ernie-3.5-se/predict_generation.py @@ -0,0 +1,212 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import json +import os +import sys + +sys.path.append("../../") + +import paddle +from modeling import Ernie35Config, Ernie35ForCausalLM +from paddle.distributed import fleet +from tokenizer import Ernie35Tokenizer + + +def get_parser(): + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument("--model_name_or_path", default="baidu/ernie-3.5-se-3b", help="The directory of model.") + parser.add_argument("--tokenizer_name_or_path", default="ernie-tokenizer", help="The directory of tokenizer.") + parser.add_argument( + "--merge_tensor_parallel_path", default=None, help="The directory of model to merge tensor parallel parts." + ) + parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.") + parser.add_argument("--src_length", type=int, default=50, help="the max length of source text") + parser.add_argument("--tgt_length", type=int, default=100, help="the max length of decoding length") + + parser.add_argument("--top_k", type=int, default=3, help="top_k parameter for generation") + parser.add_argument("--top_p", type=float, default=0.8, help="top_p parameter for generation") + parser.add_argument("--temperature", type=float, default=0.95, help="top_p parameter for generation") + parser.add_argument("--data_file", default=None, help="data file directory") + parser.add_argument("--predict_file", default="prediction.json", help="predict result file directory") + parser.add_argument("--device", type=str, default="gpu", help="Device") + return parser + + +def parse_arguments(): + parser = get_parser() + return parser.parse_args() + + +def batchfy_text(texts, batch_size): + batch_texts = [] + batch_start = 0 + while batch_start < len(texts): + batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]] + batch_start += batch_size + return batch_texts + + +class Predictor(object): + def __init__(self, args=None, tokenizer=None, model=None, **kwargs): + if args is None: + self.tokenizer = tokenizer + self.model = model + self.src_length = kwargs["src_length"] + self.tgt_length = kwargs["tgt_length"] + else: + self.tokenizer = Ernie35Tokenizer.from_pretrained(args.tokenizer_name_or_path) + self.tokenizer.pad_token = self.tokenizer.unk_token + self.batch_size = args.batch_size + self.args = args + self.src_length = self.args.src_length + self.tgt_length = self.args.tgt_length + + tensor_parallel_degree = paddle.distributed.get_world_size() + tensor_parallel_rank = 0 + if tensor_parallel_degree > 1: + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": 1, + "mp_degree": tensor_parallel_degree, + "pp_degree": 1, + "sharding_degree": 1, + } + fleet.init(is_collective=True, strategy=strategy) + hcg = fleet.get_hybrid_communicate_group() + tensor_parallel_rank = hcg.get_model_parallel_rank() + + self.rank = tensor_parallel_rank + + config = Ernie35Config.from_pretrained(args.model_name_or_path) + config.tensor_parallel_degree = tensor_parallel_degree + config.tensor_parallel_rank = (tensor_parallel_rank,) + dtype = "float16" if config.dtype is None else config.dtype + use_flash_attn = False if dtype == "float32" else True + self.model = Ernie35ForCausalLM.from_pretrained( + args.model_name_or_path, + tensor_parallel_degree=tensor_parallel_degree, + tensor_parallel_rank=tensor_parallel_rank, + load_state_as_np=True, + dtype=dtype, + use_flash_attention=use_flash_attn, + ) + print(self.model) + + self.model.eval() + + def preprocess(self, input_text): + inputs = self.tokenizer( + input_text, + padding=True, + return_tensors="np", + max_length=self.src_length, + return_attention_mask=True, + return_position_ids=True, + ) + inputs_tensor = {} + for key, value in inputs.items(): + inputs_tensor[key] = paddle.to_tensor(value) + + return inputs_tensor + + def infer(self, inputs): + if self.src_length + self.tgt_length > 4096: + self.model.pos_decoding_interval = max( + self.model.config.max_position_embeddings // (self.tgt_length + self.src_length), 1 + ) + else: + self.model.pos_decoding_interval = 8 + if self.model.pos_decoding_interval > 1 and "position_ids" in inputs: + inputs["position_ids"] = inputs["position_ids"] * self.model.pos_decoding_interval + ( + self.model.pos_decoding_interval - 1 + ) + + os.environ["pos_decoding_interval"] = str(self.model.pos_decoding_interval) + + if self.model.config.dtype == "float32" or self.model.config.dtype is None: + with paddle.no_grad(): + result = self.model.generate( + **inputs, + max_length=self.tgt_length, + decode_strategy="sampling", + temperature=self.args.temperature, + top_k=self.args.top_k, + top_p=self.args.top_p, + repetition_penalty=1.0, + ) + else: + white = ["flash_attn"] + with paddle.no_grad(): + with paddle.amp.auto_cast(True, custom_white_list=white, level="O2", dtype=self.model.config.dtype): + result = self.model.generate( + **inputs, + max_length=self.tgt_length, + decode_strategy="sampling", + temperature=self.args.temperature, + top_k=self.args.top_k, + top_p=self.args.top_p, + repetition_penalty=1.0, + ) + result = result[0] + return result + + def postprocess(self, infer_data): + result = [] + for x in infer_data.tolist(): + res = self.tokenizer.decode(x, skip_special_tokens=True) + result.append(res) + out_dict = {"result": result} + return out_dict + + def predict(self, texts): + input_map = self.preprocess(texts) + infer_result = self.infer(input_map) + output = self.postprocess(infer_result) + return output + + +if __name__ == "__main__": + args = parse_arguments() + paddle.set_device(args.device) + predictor = Predictor(args) + if args.data_file is None: + all_texts = [ + "The Broncos took an early lead in Super Bowl 50 and never trailed. Newton was limited by Denver's defense, ", + "Hello, it is a great day! nice to ", + "a b c d e f g ", + ] + else: + all_texts = [] + with open(args.data_file, "r", encoding="utf-8") as f: + for line in f: + example = json.loads(line) + context = example["src"][0] if isinstance(example["src"], list) else example["src"] + all_texts.append(context) + batch_texts = batchfy_text(all_texts, args.batch_size) + with open(args.predict_file, "w", encoding="utf-8") as f: + for bs, texts in enumerate(batch_texts): + outputs = predictor.predict(texts) + for text, result in zip(texts, outputs["result"]): + print("{}\n{}".format(text, result)) + out = {"src": text, "output": result} + f.write(json.dumps(out, ensure_ascii=False) + "\n") + + if args.merge_tensor_parallel_path is not None: + predictor.model.save_pretrained( + save_dir=args.merge_tensor_parallel_path, + merge_tensor_parallel=True, + ) + predictor.tokenizer.save_pretrained(args.merge_tensor_parallel_path) diff --git a/llm/ernie-3.5-se/run_pretrain.py b/llm/ernie-3.5-se/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..5e08185f7bb4c14f916a611c6e8455ad2dd65118 --- /dev/null +++ b/llm/ernie-3.5-se/run_pretrain.py @@ -0,0 +1,465 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +GPT/Ernie35 pretraining scripts. +""" +import math +import os +import random +import time +from dataclasses import dataclass, field +from typing import Optional + +import numpy as np +import paddle +from ernie_dataset import build_train_valid_test_datasets, print_rank_0 +from modeling import Ernie35Config, Ernie35ForCausalLM +from tokenizer import Ernie35Tokenizer + +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, + speed_metrics, +) +from paddlenlp.transformers import ( # Ernie35Config,; Ernie35ForCausalLM, + CosineAnnealingWithWarmupDecay, + LinearAnnealingWithWarmupDecay, +) +from paddlenlp.utils.batch_sampler import DistributedBatchSampler +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "ernie": ( + Ernie35Config, + Ernie35ForCausalLM, + ), +} + + +def add_start_docstrings(*docstr): + def docstring_decorator(fn): + fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "") + return fn + + return docstring_decorator + + +@dataclass +@add_start_docstrings(TrainingArguments.__doc__) +class PreTrainingArguments(TrainingArguments): + min_learning_rate: float = field( + default=1e-5, + metadata={"help": "Minimum learning rate deacyed to."}, + ) + decay_steps: float = field( + default=None, + metadata={ + "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate." + }, + ) + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and evaluating. + Using `PdArgumentParser` we can turn this class into argparse arguments to be able to + specify them on the command line. + """ + + input_dir: str = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."}) + + max_seq_length: int = field( + default=1024, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + share_folder: bool = field( + default=False, + metadata={"help": "Use share folder for data dir and output dir on multi machine."}, + ) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to pre-train from. + """ + + model_type: Optional[str] = field( + default="ernie", metadata={"help": "Only support for ernie pre-training for now."} + ) + model_name_or_path: str = field( + default="__internal_testing__/tiny-random-ernie", + metadata={ + "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html" + }, + ) + tokenizer_name_or_path: Optional[str] = field( + default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} + ) + + config_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} + ) + use_flash_attention: bool = field( + default=False, + metadata={"help": "use_flash_attention"}, + ) + use_fused_ln: bool = field( + default=False, + metadata={"help": "ernie, use_fused_ln"}, + ) + fuse_attention_qkv: bool = field( + default=True, + metadata={"help": "gpt, fuse_attention_qkv"}, + ) + recompute_granularity: str = field( + default="full", + metadata={"help": "full core_attn"}, + ) + virtual_pp_degree: int = field( + default=1, + metadata={"help": "virtual_pp_degree"}, + ) + + continue_training: bool = field( + default=False, + metadata={ + "help": "Pre-training from existing paddlenlp model weights. Default Fasle and model will train from scratch. If set True, the model_name_or_path argument must exist in the paddlenlp models." + }, + ) + + +def create_pretrained_dataset( + data_args, + training_args, + data_file, + tokenizer, +): + + train_valid_test_num_samples = [ + training_args.per_device_train_batch_size + * training_args.dataset_world_size + * training_args.max_steps + * training_args.gradient_accumulation_steps, + training_args.per_device_eval_batch_size + * training_args.dataset_world_size + * training_args.eval_iters + * (training_args.max_steps // training_args.eval_steps + 1), + training_args.per_device_eval_batch_size * training_args.dataset_world_size * training_args.test_iters, + ] + + print_rank_0(" > datasets target sizes (minimum size):") + print_rank_0(" train: {}".format(train_valid_test_num_samples[0])) + print_rank_0(" validation: {}".format(train_valid_test_num_samples[1])) + print_rank_0(" test: {}".format(train_valid_test_num_samples[2])) + + # Build the datasets. + train_dataset, valid_dataset, test_dataset = build_train_valid_test_datasets( + data_prefix=data_file, + data_impl="mmap", + splits_string=data_args.split, + train_valid_test_num_samples=train_valid_test_num_samples, + seq_length=data_args.max_seq_length, + seed=training_args.seed, + skip_warmup=False, + ) + + def print_dataset(data, mode="train"): + logger.info(f"Sample data for {mode} mode") + # input_ids, loss_mask, attention_mask, position_ids, labels = data + input_ids = data["text"] + logger.info(tokenizer._decode(input_ids)) + # logger.info(tokenizer._decode(labels)) + # logger.info(tokenizer.convert_ids_to_tokens(input_ids)) + + # eod_token = tokenizer.eos_token_id + from paddlenlp.data import Stack + + def _collate_data(data, stack_fn=Stack()): + tokens_ = stack_fn(x["text"] for x in data) + # Unpack. + tokens_ = paddle.to_tensor(tokens_, dtype="int64") + labels = tokens_[:, 1:] + tokens = tokens_[:, :-1] + + # Loss mask. + # loss_mask = paddle.ones(tokens.shape, dtype=paddle.float32) + # loss_mask[data == eod_token] = 0.0 + + return { + "input_ids": tokens, + # "token_type_ids": out[1], + # "attention_mask": out[2], + # "loss_mask": loss_mask, + "labels": labels, + } + + print_dataset(train_dataset[0]) + print_dataset(valid_dataset[0]) + print_dataset(test_dataset[0]) + + return train_dataset, valid_dataset, test_dataset, _collate_data + + +def get_train_data_file(args): + if len(args.input_dir.split()) > 1: + # weight-1 data-prefix-1 weight-2 data-prefix-2 ... + return args.input_dir.split() + else: + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if (os.path.isfile(os.path.join(args.input_dir, f)) and "_idx.npz" in str(f)) + ] + files = [x.replace("_idx.npz", "") for x in files] + + if len(files) > 1: + ret = [] + logger.info("You are using multi-dataset:") + for x in files: + ret.append(1.0) + ret.append(x) + logger.info(" > set weight of %s dataset to 1.0" % x) + return ret + + return files + + +def set_seed(args): + if args.device == "cpu": + idx = 0 + else: + idx = paddle.distributed.get_rank() + random.seed(args.seed + idx) + np.random.seed(args.seed + idx) + paddle.seed(args.seed + idx) + + +class PretrainingTrainer(Trainer): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix: str = "eval"): + # keep eval_dataloader + eval_dataloader = getattr(self, "eval_dataloader", None) + if eval_dataloader is None: + eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset + eval_dataloader = self.get_eval_dataloader(eval_dataset) + # must call data loader, otherwise, it will init many times, cause OOM error. + self.eval_dataloader = eval_dataloader() + + start_time = time.time() + # Temporarily disable metric computation, we will do it in the loop here. + compute_metrics = self.compute_metrics + eval_loop = self.evaluation_loop + + output = eval_loop( + eval_dataloader, + description="Evaluation", + # No point gathering the predictions if there are no metrics, otherwise we defer to + # self.args.prediction_loss_only + prediction_loss_only=True if compute_metrics is None else None, + ignore_keys=ignore_keys, + # Only evaluate max_eval_iters + max_eval_iters=self.args.eval_iters, + ) + + total_batch_size = self.args.eval_batch_size * self.args.world_size + output.metrics.update( + speed_metrics( + metric_key_prefix, + start_time, + num_samples=output.num_samples, + num_steps=math.ceil(output.num_samples / total_batch_size), + ) + ) + + self.log(output.metrics) + + self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics) + return output.metrics + + def _get_eval_sampler(self, eval_dataset) -> Optional[paddle.io.Sampler]: + return DistributedBatchSampler( + eval_dataset, + batch_size=self.args.per_device_eval_batch_size, + shuffle=False, + num_replicas=self.args.dataset_world_size, + rank=self.args.dataset_rank, + drop_last=self.args.dataloader_drop_last, + ) + + def _get_train_sampler(self) -> Optional[paddle.io.Sampler]: + return DistributedBatchSampler( + self.train_dataset, + batch_size=self.args.per_device_train_batch_size, + shuffle=False, + num_replicas=self.args.dataset_world_size, + rank=self.args.dataset_rank, + drop_last=self.args.dataloader_drop_last, + ) + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.tokenizer_name_or_path is None: + model_args.tokenizer_name_or_path = model_args.model_name_or_path + + set_seed(training_args) + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + training_args.eval_iters = 10 + training_args.test_iters = training_args.eval_iters * 10 + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + # if last_checkpoint is None and len( + # os.listdir(training_args.output_dir)) > 1: + # raise ValueError( + # f"Output directory ({training_args.output_dir}) already exists and is not empty. " + # "Use --overwrite_output_dir to overcome.") + if last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + config_class, model_class = MODEL_CLASSES[model_args.model_type] + + tokenizer = Ernie35Tokenizer.from_pretrained(model_args.tokenizer_name_or_path) + + config = config_class.from_pretrained(model_args.model_name_or_path) + config.max_position_embeddings = max(config.max_position_embeddings, data_args.max_seq_length) + if not model_args.continue_training: + config.vocab_size = max(config.vocab_size, ((tokenizer.vocab_size - 1) // 128 + 1) * 128) + logger.info(f"Reset vocab size to {config.vocab_size} for batter amp peformance.") + + config.use_flash_attention = model_args.use_flash_attention + config.fuse_ln = model_args.use_fused_ln + config.fuse_attention_qkv = model_args.fuse_attention_qkv + config.recompute_granularity = model_args.recompute_granularity + config.virtual_pp_degree = model_args.virtual_pp_degree + config.use_recompute = training_args.recompute + + config.tensor_parallel_degree = training_args.tensor_parallel_degree + config.tensor_parallel_rank = training_args.tensor_parallel_rank + + print("Final pre-training config:", config) + + # Set the dtype for loading model + dtype = "float32" + if training_args.fp16_opt_level == "O2": + if training_args.fp16: + dtype = "float16" + if training_args.bf16: + dtype = "bfloat16" + + if model_args.continue_training: + model = model_class.from_pretrained( + model_args.model_name_or_path, + config=config, + dtype=dtype, + load_state_as_np=True, + use_progressive_seq_len=True, + ) + else: + model = model_class._from_config(config, dtype=dtype) + + # Create the learning_rate sheduler and optimizer + if training_args.decay_steps is None: + training_args.decay_steps = training_args.max_steps + if training_args.warmup_steps > 0: + warmup_steps = training_args.warmup_steps + else: + warmup_steps = training_args.warmup_ratio * training_args.max_steps + + lr_scheduler = None + if training_args.lr_scheduler_type.value == "cosine": + lr_scheduler = CosineAnnealingWithWarmupDecay( + max_lr=training_args.learning_rate, + min_lr=training_args.min_learning_rate, + warmup_step=warmup_steps, + decay_step=training_args.decay_steps, + last_epoch=0, + ) + elif training_args.lr_scheduler_type.value == "linear": + lr_scheduler = LinearAnnealingWithWarmupDecay( + max_lr=training_args.learning_rate, + min_lr=training_args.min_learning_rate, + warmup_step=warmup_steps, + decay_step=training_args.decay_steps, + last_epoch=0, + ) + + data_file = get_train_data_file(data_args) + train_dataset, eval_dataset, test_dataset, data_collator = create_pretrained_dataset( + data_args, training_args, data_file, tokenizer + ) + + trainer = PretrainingTrainer( + model=model, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + optimizers=(None, lr_scheduler), + tokenizer=tokenizer, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_predict: + test_ret = trainer.predict(test_dataset) + trainer.log_metrics("test", test_ret.metrics) + + +if __name__ == "__main__": + main() diff --git a/llm/ernie-3.5-se/run_trainer_stage2.sh b/llm/ernie-3.5-se/run_trainer_stage2.sh new file mode 100644 index 0000000000000000000000000000000000000000..54d1ff7d391a16c1634b6c690d4058b2667b9631 --- /dev/null +++ b/llm/ernie-3.5-se/run_trainer_stage2.sh @@ -0,0 +1,70 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +set -x +unset CUDA_VISIBLE_DEVICES + +task_name="ernie35_hybrid" +# rm -rf output/$task_name/ +# rm -rf "output/$task_name""_log" + +data_dir="./data" + +PYTHONPATH=../../:$PYTHONPATH \ +python -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "output/$task_name""_log" \ + run_pretrain.py \ + --model_type "ernie" \ + --model_name_or_path "baidu/ernie-3.5-se-3b" \ + --tokenizer_name_or_path "ernie-tokenizer" \ + --input_dir "${data_dir}" \ + --output_dir "output/$task_name" \ + --split 949,50,1 \ + --max_seq_length 4096 \ + --per_device_train_batch_size 2 \ + --gradient_accumulation_steps 2 \ + --per_device_eval_batch_size 2 \ + --use_flash_attention 1 \ + --use_fused_ln 1 \ + --amp_master_grad 0 \ + --bf16 1 \ + --fp16_opt_level "O2" \ + --scale_loss 512 \ + --tensor_parallel_degree 1 \ + --pipeline_parallel_degree 1 \ + --virtual_pp_degree 1 \ + --learning_rate 0.0003 \ + --min_learning_rate 0.00003 \ + --lr_scheduler_type "cosine" \ + --max_steps 300000 \ + --save_steps 200 \ + --adam_beta2 0.95 \ + --weight_decay 0.1 \ + --warmup_steps 2000 \ + --max_grad_norm 1.0 \ + --logging_steps 1 \ + --dataloader_num_workers 0 \ + --eval_steps 50 \ + --report_to "visualdl" \ + --sharding "stage2" \ + --sharding_parallel_degree 8 \ + --disable_tqdm true \ + --continue_training 0 \ + --recompute 0 \ + --do_train \ + --do_eval \ + --save_total_limit 5 \ + --device "gpu" diff --git a/llm/ernie-3.5-se/tokenizer.py b/llm/ernie-3.5-se/tokenizer.py new file mode 100644 index 0000000000000000000000000000000000000000..0afbd2502d2ff3e2ce3ea7320e87d26c21f314a4 --- /dev/null +++ b/llm/ernie-3.5-se/tokenizer.py @@ -0,0 +1,193 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from shutil import copyfile +from typing import List, Optional, Tuple + +import sentencepiece as spm + +from paddlenlp.transformers import PretrainedTokenizer +from paddlenlp.utils.log import logger + + +class Ernie35Tokenizer(PretrainedTokenizer): + model_input_names = ["input_ids", "attention_mask", "position_ids"] + resource_files_names = { + "vocab_file": "sentencepiece.bpe.model", + } + padding_side = "left" + + def __init__( + self, + vocab_file, + unk_token="", + bos_token="", + eos_token="", + add_bos_token=True, + add_eos_token=False, + sp_model_kwargs=None, + decode_with_prefix_space=False, + **kwargs + ): + self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs + super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs) + + if self.pad_token is None: + self.pad_token = self.unk_token + self.vocab_file = vocab_file + self.add_bos_token = add_bos_token + self.add_eos_token = add_eos_token + self.decode_with_prefix_space = decode_with_prefix_space + self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) + self.sp_model.Load(vocab_file) + + @property + def vocab_size(self): + """Returns vocab size""" + return self.sp_model.get_piece_size() + + @property + def bos_token_id(self) -> Optional[int]: + return self.sp_model.bos_id() + + @property + def eos_token_id(self) -> Optional[int]: + return self.sp_model.eos_id() + + def get_vocab(self): + """Returns vocab as a dict""" + vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)} + vocab.update(self.added_tokens_encoder) + return vocab + + def _tokenize(self, text): + """Returns a tokenized string.""" + return self.sp_model.encode(text, out_type=str) + + def _convert_token_to_id(self, token): + """Converts a token (str) in an id using the vocab.""" + return self.sp_model.piece_to_id(token) + + def _convert_id_to_token(self, index): + """Converts an index (integer) in a token (str) using the vocab.""" + token = self.sp_model.IdToPiece(index) + return token + + def convert_tokens_to_string(self, tokens): + """Converts a sequence of tokens (string) in a single string.""" + current_sub_tokens = [] + out_string = "" + prev_is_special = False + for i, token in enumerate(tokens): + # make sure that special tokens are not decoded using sentencepiece model + if token in self.all_special_tokens: + if not prev_is_special and i != 0: + out_string += " " + out_string += self.sp_model.decode(current_sub_tokens) + token + prev_is_special = True + current_sub_tokens = [] + else: + current_sub_tokens.append(token) + prev_is_special = False + out_string += self.sp_model.decode(current_sub_tokens) + return out_string + + def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]: + """ + Save the vocabulary and special tokens file to a directory. + Args: + save_directory (`str`): + The directory in which to save the vocabulary. + Returns: + `Tuple(str)`: Paths to the files saved. + """ + if not os.path.isdir(save_directory): + logger.error(f"Vocabulary path ({save_directory}) should be a directory") + return + out_vocab_file = os.path.join( + save_directory, + (filename_prefix + "-" if filename_prefix else "") + self.resource_files_names["vocab_file"], + ) + + if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file): + copyfile(self.vocab_file, out_vocab_file) + elif not os.path.isfile(self.vocab_file): + with open(out_vocab_file, "wb") as fi: + content_spiece_model = self.sp_model.serialized_model_proto() + fi.write(content_spiece_model) + + return (out_vocab_file,) + + def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None): + if self.add_bos_token: + bos_token_ids = [self.bos_token_id] + else: + bos_token_ids = [] + + output = bos_token_ids + token_ids_0 + + if token_ids_1 is not None: + output = output + token_ids_1 + + if self.add_eos_token: + output = output + [self.eos_token_id] + + return output + + def get_special_tokens_mask( + self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False + ) -> List[int]: + """ + Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding + special tokens using the tokenizer `prepare_for_model` method. + Args: + token_ids_0 (`List[int]`): + List of IDs. + token_ids_1 (`List[int]`, *optional*): + Optional second list of IDs for sequence pairs. + already_has_special_tokens (`bool`, *optional*, defaults to `False`): + Whether or not the token list is already formatted with special tokens for the model. + Returns: + `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. + """ + if already_has_special_tokens: + return super().get_special_tokens_mask( + token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True + ) + + if token_ids_1 is None: + return [1] + ([0] * len(token_ids_0)) + [1] + return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1] + + def create_token_type_ids_from_sequences( + self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None + ) -> List[int]: + """ + Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make + use of token type ids, therefore a list of zeros is returned. + Args: + token_ids_0 (`List[int]`): + List of IDs. + token_ids_1 (`List[int]`, *optional*): + Optional second list of IDs for sequence pairs. + Returns: + `List[int]`: List of zeros. + """ + eos = [self.eos_token_id] + + if token_ids_1 is None: + return len(token_ids_0 + eos) * [0] + return len(token_ids_0 + eos + token_ids_1 + eos) * [0] diff --git a/llm/ernie-3.5-se/utils.py b/llm/ernie-3.5-se/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..4260f80124ba92b370f54f97d1677522ea5c6442 --- /dev/null +++ b/llm/ernie-3.5-se/utils.py @@ -0,0 +1,231 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import time +from typing import Any, Dict, List, Optional, Tuple, Union + +import numpy as np +import paddle +import paddle.nn as nn +from paddle.optimizer.lr import LambdaDecay +from rouge import Rouge +from sklearn.metrics import accuracy_score + +from paddlenlp.metrics import BLEU +from paddlenlp.trainer import PrinterCallback, ProgressCallback, Trainer +from paddlenlp.trainer.integrations import TrainerCallback +from paddlenlp.utils.log import logger + + +class AverageStatistical(object): + def __init__(self): + self.reset() + + def reset(self): + self.total_cnt = 0 + self.time = 0 + + def record(self, val, cnt=1): + self.time += val + self.total_cnt += cnt + + def get_average(self): + if self.total_cnt == 0: + return 0 + + return self.time / self.total_cnt + + def get_average_per_sec(self): + if self.time == 0.0: + return 0.0 + + return float(self.total_cnt) / self.time + + def get_total_cnt(self): + return self.total_cnt + + def get_total_time(self): + return self.time + + +class BenchmarkCallback(TrainerCallback): + def __init__(self, benchmark=True, profiler_options=None): + self.benchmark = benchmark + self.profiler_options = profiler_options + + def on_train_begin(self, args, state, control, **kwargs): + assert args.gradient_accumulation_steps == 1 and not args.do_eval and not args.do_predict + if self.benchmark: + self.reader_cost_avg = AverageStatistical() + + def on_epoch_begin(self, args, state, control, **kwargs): + if self.benchmark: + self.epoch_start = time.time() + self.batch_start = time.time() + + def on_step_begin(self, args, state, control, **kwargs): + if self.benchmark: + self.reader_cost_avg.record(time.time() - self.batch_start) + + def on_step_end(self, args, state, control, **kwargs): + if self.benchmark: + self.batch_start = time.time() + if control.should_log: + self.maybe_log_save_evaluate_start = time.time() + + def on_log(self, args, state, control, logs=None, **kwargs): + if self.benchmark: + if logs is not None and "interval_steps_per_second" in logs: + self.batch_start = self.batch_start + (time.time() - self.maybe_log_save_evaluate_start) + ips = logs["interval_steps_per_second"] * args.train_batch_size + avg_batch_cost = 1 / logs["interval_steps_per_second"] + logger.info( + "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sample/sec" + % ( + state.global_step, + state.max_steps, + logs["loss"], + self.reader_cost_avg.get_average(), + avg_batch_cost, + args.train_batch_size, + ips, + ) + ) + self.reader_cost_avg.reset() + + def on_epoch_end(self, args, state, control, **kwargs): + if self.benchmark: + train_epoch_cost = time.time() - self.epoch_start + logger.info("train epoch: %d, epoch_cost: %.5f s" % (state.epoch, train_epoch_cost)) + + +class Ernie35Trainer(Trainer): + def __init__(self, do_generation: bool, **kwargs): + super().__init__(**kwargs) + if self.args.benchmark or self.args.profiler_options is not None: + self.add_callback( + BenchmarkCallback(benchmark=self.args.benchmark, profiler_options=self.args.profiler_options) + ) + if self.args.benchmark: + if self.args.disable_tqdm: + self.pop_callback(PrinterCallback) + else: + self.pop_callback(ProgressCallback) + self.do_generation = do_generation + + def prediction_step( + self, + model: nn.Layer, + inputs: Dict[str, Union[paddle.Tensor, Any]], + prediction_loss_only: bool, + ignore_keys: Optional[List[str]] = None, + ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]: + + if prediction_loss_only: + return super().prediction_step(model, inputs, prediction_loss_only, ignore_keys) + elif not self.do_generation: + loss, logits, labels = super().prediction_step(model, inputs, prediction_loss_only, ignore_keys) + # argmax here to avoid gather all logits, which is too memory-consuming. + # keepdim in order to maintain the same shape as logits + return (loss, logits.argmax(axis=-1, keepdim=True), labels) + + model.eval() + + preds = model.generate( + input_ids=inputs["input_ids"], + attention_mask=inputs["attention_mask"], + max_length=self.args.tgt_length, + min_length=0, + use_cache=True, + temperature=1.0, + top_k=1, + top_p=1.0, + repetition_penalty=1.0, + decode_strategy="sampling", + )[0] + all_labels = [] + for label in inputs["labels"].numpy(): + label = [x for x in label[label != self.tokenizer.pad_token_id]] + all_labels.append(label) + max_label_length = max([len(x) for x in all_labels]) + for index, labels in enumerate(all_labels): + all_labels[index] = labels + [-100] * (max_label_length - len(labels)) + + return (None, paddle.to_tensor(preds), paddle.to_tensor(all_labels)) + + def create_scheduler(self, num_training_steps: int): + num_warmup_steps = ( + self.args.warmup_steps if self.args.warmup_steps > 0 else self.args.warmup_ratio * num_training_steps + ) + + def lr_lambda(current_step: int): + if current_step < num_warmup_steps: + return float(current_step) / float(max(1, num_warmup_steps)) + else: + decay_step_ratio = (current_step - num_warmup_steps) / (num_training_steps - num_warmup_steps) + return 1.0 - (1.0 - self.args.lr_decay_ratio) * decay_step_ratio + + if self.lr_scheduler is None: + self.lr_scheduler = LambdaDecay(self.args.learning_rate, lr_lambda, last_epoch=-1) + return self.lr_scheduler + + def log(self, logs: Dict[str, float], **kwargs) -> None: + if "loss" in logs: + logs["ppl"] = np.exp(logs["loss"]) + if "eval_loss" in logs: + logs["eval_ppl"] = np.exp(logs["eval_loss"]) + + super(Ernie35Trainer, self).log(logs, **kwargs) + + +def compute_metrics(preds, targets): + assert len(preds) == len(targets), ( + "The length of pred_responses should be equal to the length of " + "target_responses. But received {} and {}.".format(len(preds), len(targets)) + ) + rouge = Rouge() + bleu4 = BLEU(n_size=4) + scores = [] + for pred, target in zip(preds, targets): + try: + score = rouge.get_scores(" ".join(pred), " ".join(target)) + scores.append([score[0]["rouge-1"]["f"], score[0]["rouge-2"]["f"], score[0]["rouge-l"]["f"]]) + except ValueError: + scores.append([0, 0, 0]) + bleu4.add_inst(pred, [target]) + rouge1 = np.mean([i[0] for i in scores]) + rouge2 = np.mean([i[1] for i in scores]) + rougel = np.mean([i[2] for i in scores]) + + rouge1 = round(rouge1, 4) + rouge2 = round(rouge2, 4) + rougel = round(rougel, 4) + bleu4 = round(bleu4.score(), 4) + return dict( + rouge1=rouge1, + rouge2=rouge2, + rougel=rougel, + bleu4=bleu4, + ) + + +def compute_metrics_not_do_generation(eval_preds): + flattened_preds = np.array(eval_preds.predictions).flatten() + flattened_labels = np.array(eval_preds.label_ids).flatten() + filtered_preds = flattened_preds[flattened_labels != -100] + filtered_labels = flattened_labels[flattened_labels != -100] + accuracy = accuracy_score(y_true=filtered_labels, y_pred=filtered_preds) + return { + "accuracy": accuracy, + } diff --git a/llm/export_model.py b/llm/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..167f558d763ed707e8c0527a8f29c9893ff8e9f7 --- /dev/null +++ b/llm/export_model.py @@ -0,0 +1,96 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from __future__ import annotations + +import os +from dataclasses import dataclass, field + +import paddle +from paddle.distributed import fleet +from predictor import ModelArgument, PredictorArgument, create_predictor +from tqdm import tqdm +from utils import generate_rank_mapping, get_infer_model_path + +from paddlenlp.trainer import PdArgumentParser +from paddlenlp.utils.log import logger + + +@dataclass +class ExportArgument: + output_path: str = field(default=None, metadata={"help": "The output path of model."}) + + +def load_inference_model(model_path, model_name, param_name, exe): + model_abs_path = os.path.join(model_path, model_name) + param_abs_path = os.path.join(model_path, param_name) + if os.path.exists(model_abs_path) and os.path.exists(param_abs_path): + return paddle.static.io.load_inference_model(model_path, exe, model_name, param_name) + else: + return paddle.static.io.load_inference_model(model_path, exe) + + +def validate_pdmodel(model_path, model_prefix): + paddle.enable_static() + place = paddle.CUDAPlace(0) + exe = paddle.static.Executor(place) + scope = paddle.static.Scope() + + with paddle.static.scope_guard(scope): + net_program, feed_target_names, fetch_targets = paddle.static.io.load_inference_model( + os.path.join(model_path, model_prefix), exe + ) + + for block in net_program.blocks: + ops: list[paddle.framework.Operator] = block.ops + for op in tqdm(ops, desc="checking the validation of ops"): + if op.type.lower() == "print": + logger.warning(f"UNEXPECTED OP<{op.type}> which will reduce the performace of the static model") + + +def main(): + parser = PdArgumentParser((PredictorArgument, ModelArgument, ExportArgument)) + predictor_args, model_args, export_args = parser.parse_args_into_dataclasses() + + paddle.set_default_dtype(predictor_args.dtype) + tensor_parallel_degree = paddle.distributed.get_world_size() + tensor_parallel_rank = paddle.distributed.get_rank() + if tensor_parallel_degree > 1: + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": 1, + "mp_degree": tensor_parallel_degree, + "pp_degree": 1, + "sharding_degree": 1, + } + fleet.init(is_collective=True, strategy=strategy) + hcg = fleet.get_hybrid_communicate_group() + tensor_parallel_rank = hcg.get_model_parallel_rank() + + # set predictor type + predictor = create_predictor(predictor_args, model_args, tensor_parallel_degree, tensor_parallel_rank) + predictor.model.eval() + + predictor.model.to_static( + get_infer_model_path(export_args.output_path, predictor_args.model_prefix), + {"dtype": predictor_args.dtype, "export_precache": predictor_args.export_precache}, + ) + predictor.model.config.save_pretrained(export_args.output_path) + predictor.tokenizer.save_pretrained(export_args.output_path) + generate_rank_mapping(os.path.join(export_args.output_path, "rank_mapping.csv")) + + validate_pdmodel(export_args.output_path, "model") + + +if __name__ == "__main__": + main() diff --git a/llm/finetune_generation.py b/llm/finetune_generation.py new file mode 100644 index 0000000000000000000000000000000000000000..b190352e3705f0565e40be9e2f06f684d3c3261a --- /dev/null +++ b/llm/finetune_generation.py @@ -0,0 +1,452 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import json +import os +import sys +from functools import partial + +import paddle +from argument import ( + DataArgument, + GenerateArgument, + ModelArgument, + QuantArgument, + TrainingArguments, +) +from data import get_convert_example +from utils import ( + CausalLMTrainer, + InTokensIterDatasetCallback, + compute_metrics, + get_lora_target_modules, + get_prefix_tuning_params, +) + +from paddlenlp.data import DataCollatorForSeq2Seq +from paddlenlp.datasets import InTokensIterableDataset, InTokensMapDataset, load_dataset +from paddlenlp.metrics import BLEU, Rouge1, Rouge2, RougeL +from paddlenlp.peft import LoRAConfig, LoRAModel, PrefixConfig, PrefixModelForCausalLM +from paddlenlp.trainer import PdArgumentParser, get_last_checkpoint +from paddlenlp.trainer.trainer_callback import TrainerState +from paddlenlp.transformers import ( + AutoConfig, + AutoModelForCausalLM, + AutoTokenizer, + LlamaTokenizer, +) +from paddlenlp.utils.log import logger + + +def read_local_dataset(path): + with open(path, "r", encoding="utf-8") as fp: + for line in fp: + yield json.loads(line.strip()) + + +def main(): + # Arguments + parser = PdArgumentParser((GenerateArgument, QuantArgument, ModelArgument, DataArgument, TrainingArguments)) + if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): + gen_args, quant_args, model_args, data_args, training_args = parser.parse_json_file( + json_file=os.path.abspath(sys.argv[1]) + ) + else: + gen_args, quant_args, model_args, data_args, training_args = parser.parse_args_into_dataclasses() + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + training_args.print_config(quant_args, "Quant") + training_args.print_config(gen_args, "Generation") + + if sum([quant_args.do_ptq, quant_args.do_qat, quant_args.do_gptq, training_args.do_train]) > 1: + raise ValueError( + "--do_train, --do_ptq, --do_gptq and --do_qat cannot work at the same time. Please choose only one at a time" + ) + + # Setup GPU & distributed training + paddle.set_device(training_args.device) + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + if last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + # Load model + if training_args.fp16_opt_level == "O2": + if training_args.fp16: + dtype = "float16" + elif training_args.bf16: + dtype = "bfloat16" + else: + raise ValueError("Please specific dtype: --fp16 or --bf16") + else: + dtype = "float32" + + if training_args.pipeline_parallel_degree > 1: + if data_args.eval_with_do_generation and training_args.do_eval: + raise ValueError("Plese set eval_with_do_generation to false in pipeline parallel mode.") + from llama.modeling_pp import LlamaForCausalLMPipe + + model = LlamaForCausalLMPipe.from_pretrained( + model_args.model_name_or_path, + tensor_parallel_output=False, + tensor_parallel_degree=training_args.tensor_parallel_degree, + tensor_parallel_rank=training_args.tensor_parallel_rank, + use_flash_attention=model_args.use_flash_attention, + dtype=dtype, + ) + else: + model_config = AutoConfig.from_pretrained( + model_args.model_name_or_path, + tensor_parallel_output=False, + tensor_parallel_degree=training_args.tensor_parallel_degree, + tensor_parallel_rank=training_args.tensor_parallel_rank, + dtype=dtype, + ) + if hasattr(model_config, "use_flash_attention"): + model_config.use_flash_attention = model_args.use_flash_attention + model = AutoModelForCausalLM.from_pretrained( + model_args.model_name_or_path, + config=model_config, + ) + + # Load tokenizer & dataset + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + if isinstance(tokenizer, LlamaTokenizer): + tokenizer.pad_token_id = tokenizer.eos_token_id + + if data_args.dataset_name_or_path is None: + raise ValueError(f"Please specific dataset name or path (got {data_args.dataset_name_or_path})") + elif os.path.exists(os.path.join(data_args.dataset_name_or_path, "train.json")) and os.path.exists( + os.path.join(data_args.dataset_name_or_path, "dev.json") + ): + # train_ds, dev_ds = load_dataset( + # "json", + # data_files={ + # "train": os.path.join(data_args.dataset_name_or_path, "train.json"), + # "dev": os.path.join(data_args.dataset_name_or_path, "dev.json"), + # }, + # lazy=data_args.lazy, + # ) + train_ds = load_dataset( + read_local_dataset, + path=os.path.join(data_args.dataset_name_or_path, "train.json"), + lazy=data_args.lazy, + ) + dev_ds = load_dataset( + read_local_dataset, + path=os.path.join(data_args.dataset_name_or_path, "dev.json"), + lazy=data_args.lazy, + ) + + elif os.path.exists(os.path.join(data_args.dataset_name_or_path, "train")) and os.path.exists( + os.path.join(data_args.dataset_name_or_path, "dev") + ): + import glob + + train_files = glob.glob(os.path.join(data_args.dataset_name_or_path, "train", "*.json")) + dev_files = glob.glob(os.path.join(data_args.dataset_name_or_path, "dev", "*.json")) + train_ds, dev_ds = load_dataset( + "json", data_files={"train": train_files, "dev": dev_files}, lazy=data_args.lazy + ) + else: + if data_args.task_name is not None: + train_ds, dev_ds = load_dataset( + data_args.dataset_name_or_path, data_args.task_name, splits=["train", "dev"] + ) + else: + train_ds, dev_ds = load_dataset(data_args.dataset_name_or_path, splits=["train", "dev"]) + # TODO(ZHUI & sijunhe): Temporary implementation. Generalize this logic and move to Trainer later. + if training_args.resume_from_checkpoint is not None and data_args.lazy: + logger.info( + f"Loading from '{training_args.resume_from_checkpoint}' with `lazy=True`, manually skipping dataset and setting `ignore_data_skip` to True." + ) + training_args.ignore_data_skip = True + state = TrainerState.load_from_json(os.path.join(training_args.resume_from_checkpoint, "trainer_state.json")) + if state.trial_params is not None and "intokens_global_step" in state.trial_params: + consumed_samples = state.trial_params["intokens_global_step"] + else: + consumed_samples = ( + state.global_step + * training_args.per_device_train_batch_size + * training_args.gradient_accumulation_steps + * training_args.dataset_world_size + ) + logger.info( + f"Skipping the first {consumed_samples} samples to warmup the dataset from checkpoint '{training_args.resume_from_checkpoint}'." + ) + train_ds = train_ds.skip(consumed_samples) + + if training_args.pipeline_parallel_degree > 1: + from data import convert_example_common + + trans_func = partial(convert_example_common, tokenizer=tokenizer, data_args=data_args) + else: + trans_func = partial(get_convert_example(model), tokenizer=tokenizer, data_args=data_args) + if data_args.intokens: + if ( + model.base_model_prefix not in ["llama", "bloom", "chatglm", "chatglm_v2", "qwen"] + and training_args.pipeline_parallel_degree < 1 + ): + raise NotImplementedError( + "InTokens data stream is only implemented for LLaMA, Bloom, ChatGLM and QWen so far." + ) + train_ds = train_ds.map(partial(trans_func, is_test=False, intokens=data_args.intokens)) + eval_intokens = data_args.intokens + if data_args.intokens and data_args.eval_with_do_generation: + logger.warning( + "`intokens` conflicts with `eval_with_do_generation`. Setting intokens to False for the eval_dataset." + ) + eval_intokens = False + dev_ds = dev_ds.map(partial(trans_func, is_test=data_args.eval_with_do_generation, intokens=eval_intokens)) + if data_args.intokens: + if data_args.lazy: + intoken_dataset = InTokensIterableDataset + else: + intoken_dataset = InTokensMapDataset + logger.info("Creating InTokens Data Stream. This may take a few minutes.") + train_ds = intoken_dataset( + train_ds, + tokenizer=tokenizer, + max_length=data_args.max_length, + ) + if eval_intokens: + dev_ds = intoken_dataset( + dev_ds, + tokenizer=tokenizer, + max_length=data_args.max_length, + ) + + if model_args.prefix_tuning: + prefix_tuning_params = get_prefix_tuning_params(model) + prefix_config = PrefixConfig( + num_prefix_tokens=model_args.num_prefix_tokens, + num_attention_heads=prefix_tuning_params["num_attention_heads"], + num_hidden_layers=prefix_tuning_params["num_hidden_layers"], + hidden_size=prefix_tuning_params["hidden_size"], + multi_query_group_num=prefix_tuning_params["multi_query_group_num"], + dtype=dtype, + ) + model = PrefixModelForCausalLM( + model=model, + prefix_config=prefix_config, + postprocess_past_key_value=prefix_tuning_params["postprocess_past_key_value"], + ) + model.mark_only_prefix_as_trainable() + model.print_trainable_parameters() + + if model_args.lora: + if model_args.lora_path is None: + target_modules = get_lora_target_modules(model) + lora_config = LoRAConfig( + target_modules=target_modules, + r=model_args.lora_rank, + lora_alpha=2 * model_args.lora_rank, + merge_weights=False, + tensor_parallel_degree=training_args.tensor_parallel_degree, + dtype=dtype, + ) + model = LoRAModel(model, lora_config) + else: + model = LoRAModel.from_pretrained(model=model, lora_path=model_args.lora_path) + model.mark_only_lora_as_trainable() + model.print_trainable_parameters() + + def compute_metrics_do_generation(eval_preds): + rouge1 = Rouge1() + rouge2 = Rouge2() + rougel = RougeL() + bleu4 = BLEU(n_size=4) + + predictions = [x[x != -100].tolist() for x in eval_preds.predictions] + references = [x[x != -100].tolist() for x in eval_preds.label_ids] + + predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True, clean_up_tokenization_spaces=False) + references = tokenizer.batch_decode(references, skip_special_tokens=True, clean_up_tokenization_spaces=False) + if data_args.save_generation_output: + with open(os.path.join(training_args.output_dir, "generated_output.json"), "w", encoding="utf-8") as f: + for pred, ref in zip(predictions, references): + out = {"output": pred, "tgt": ref} + f.write(json.dumps(out, ensure_ascii=False) + "\n") + + # for pred in predictions: + rouge1_score = rouge1.score(predictions, references) + rouge2_score = rouge2.score(predictions, references) + for pred, ref in zip(predictions, references): + rougel.add_inst(pred, [ref]) + bleu4.add_inst(pred, [ref]) + return { + "rouge1": rouge1_score, + "rouge2": rouge2_score, + "rougel": rougel.score(), + "bleu4": bleu4.score(), + } + + # Create trainer + max_length = data_args.max_length if training_args.pipeline_parallel_degree > 1 else None + padding = "max_length" if training_args.pipeline_parallel_degree > 1 else True + trainer = CausalLMTrainer( + model=model, + args=training_args, + train_dataset=train_ds, + eval_dataset=dev_ds, + tokenizer=tokenizer, + compute_metrics=compute_metrics_do_generation if data_args.eval_with_do_generation else compute_metrics, + data_collator=DataCollatorForSeq2Seq( + tokenizer=tokenizer, + max_length=max_length, + padding=padding, + max_label_length=max_length, + return_tensors="np", + ), + do_generation=data_args.eval_with_do_generation, + callbacks=[InTokensIterDatasetCallback()] if isinstance(train_ds, InTokensIterableDataset) else None, + gen_args=gen_args, + data_args=data_args, + ) + + # Train + if training_args.do_train: + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + train_result = trainer.train(resume_from_checkpoint=checkpoint) + if training_args.benchmark: + total_effective_tokens = ( + sum([len(i["input_ids"]) for i in trainer.train_dataset]) * training_args.num_train_epochs + ) + effective_tokens_per_second = total_effective_tokens / train_result.metrics["train_runtime"] + logger.info(f"Effective_Tokens_per_second: {effective_tokens_per_second} ") + logger.info("Benchmark done.") + else: + trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1) + trainer.log_metrics("train", train_result.metrics) + trainer.save_metrics("train", train_result.metrics) + trainer.save_state() + + # QAT + if quant_args.do_qat: + if training_args.tensor_parallel_degree > 1: + raise NotImplementedError("Only support qat on single gpu.") + from quant import create_qat_model + + trainer.model = create_qat_model(quant_args, trainer.model, dtype) + train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) + trainer.save_model() + trainer.log_metrics("qat", train_result.metrics) + trainer.save_metrics("qat", train_result.metrics) + trainer.save_state() + + # PTQ + if quant_args.do_ptq: + if isinstance(model, LoRAModel): + raise NotImplementedError( + "PTQ strategy not supported for LoRA model. Please merge lora parameters to pretrain model first." + ) + from quant import apply_ptq, apply_shift, apply_smooth, get_ptq_model_config + + trainer.model.eval() + # Prepare ptq dataloader + if os.path.exists(os.path.join(data_args.dataset_name_or_path, "quant.json")): + # ptq_ds = load_dataset( + # "json", data_files=os.path.join(data_args.dataset_name_or_path, "quant.json"), lazy=data_args.lazy, + # )[0] + ptq_ds = load_dataset( + read_local_dataset, + path=os.path.join(data_args.dataset_name_or_path, "quant.json"), + lazy=data_args.lazy, + ) + ptq_ds = ptq_ds.map(partial(trans_func, is_test=False)) + else: + ptq_ds = train_ds + logger.info( + f"Not found quant.json in {data_args.dataset_name_or_path}. Set train dataset as PTQ calibration dataset." + ) + ptq_dataloader = trainer.get_ptq_dataloader(ptq_ds) + if quant_args.shift or quant_args.smooth: + ptq_model_config = get_ptq_model_config(trainer.model) + + if quant_args.shift: + apply_shift(quant_args, trainer, ptq_dataloader, ptq_model_config) + + if quant_args.smooth: + apply_smooth(quant_args, trainer, ptq_dataloader, ptq_model_config) + + apply_ptq(quant_args, trainer, ptq_dataloader) + trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1) + + if quant_args.do_gptq: + if isinstance(model, LoRAModel): + raise NotImplementedError( + "PTQ strategy not supported for LoRA model. Please merge lora parameters to pretrain model first." + ) + from quant import apply_gptq + + # Prepare ptq dataloader + if os.path.exists(os.path.join(data_args.dataset_name_or_path, "quant.json")): + # ptq_ds = load_dataset( + # "json", data_files=os.path.join(data_args.dataset_name_or_path, "quant.json"), lazy=data_args.lazy, + # )[0] + ptq_ds = load_dataset( + read_local_dataset, + path=os.path.join(data_args.dataset_name_or_path, "quant.json"), + lazy=data_args.lazy, + ) + ptq_ds = ptq_ds.map(partial(trans_func, is_test=False)) + else: + ptq_ds = train_ds + logger.info( + f"Not found quant.json in {data_args.dataset_name_or_path}. Set train dataset as PTQ calibration dataset." + ) + ptq_dataloader = trainer.get_ptq_dataloader(ptq_ds) + apply_gptq(quant_args, trainer, ptq_dataloader) + trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1) + + # Evaluation dev set + if training_args.do_eval: + eval_result = trainer.evaluate(dev_ds) + trainer.log_metrics("eval", eval_result) + + # Evaluation test set + if training_args.do_predict: + # test_ds = load_dataset( + # "json", data_files=os.path.join(data_args.dataset_name_or_path, "test.json"), lazy=data_args.lazy, + # )[0] + test_ds = load_dataset( + read_local_dataset, + path=os.path.join(data_args.dataset_name_or_path, "test.json"), + lazy=data_args.lazy, + ) + test_ds = test_ds.map(partial(trans_func, is_test=data_args.eval_with_do_generation)) + eval_result = trainer.predict(test_ds).metrics + trainer.log_metrics("test", eval_result) + + +if __name__ == "__main__": + main() diff --git a/llm/flask_server.py b/llm/flask_server.py new file mode 100644 index 0000000000000000000000000000000000000000..7c2d0dc9bd183b447dbc2605105d7d87d9d8c6b1 --- /dev/null +++ b/llm/flask_server.py @@ -0,0 +1,219 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from __future__ import annotations + +import json +import re +import time +from dataclasses import dataclass, field +from multiprocessing.shared_memory import SharedMemory + +from predictor import BasePredictor, ModelArgument, PredictorArgument, create_predictor + +from paddlenlp.trainer import PdArgumentParser +from paddlenlp.utils.log import logger + + +@dataclass +class ServerArgument: + port: int = field(default=8011, metadata={"help": "The port of ui service"}) + flask_port: int = field(default=8010, metadata={"help": "The port of flask service"}) + title: str = field(default="LLM", metadata={"help": "The title of gradio"}) + + +def read_shared_memory(memory: SharedMemory): + """read content from shared memory + + Args: + memory (SharedMemory): the instance of shared Memory + """ + length = int(memory.buf[0]) * 256 + int(memory.buf[1]) + if length == 0: + return "" + + sentence = bytes(memory.buf[2 : length + 2]).decode() + return sentence + + +def write_shared_memory(memory: SharedMemory, sentence: str): + """write content into shared memory + + [0:2]: store the length of sentence + [2:]: store the content of sentence + + Args: + memory (SharedMemory): the instance of shared Memory + sentence (str): the content which must be string + """ + buffer = bytearray(memory.buf.nbytes) + data = sentence.encode("utf-8") + + buffer[0:2] = bytearray([len(data) // 256, len(data) % 256]) + buffer[2 : len(data) + 2] = data + memory.buf[:] = buffer + + +SLEEP_SECOND = 0.5 +SHARED_MEMORY_NAME = "shared_memory" + + +def create_shared_memory(name: int, rank: int): + """create shared memory between multi-process + + Args: + name (int): the name of memory block + rank (int): the rank of current process + """ + file = f"{SHARED_MEMORY_NAME}-{name}" + shared_memory = None + if rank != 0: + while True: + try: + shared_memory = SharedMemory(file, size=1024 * 100) + print("success create shared_memory") + break + except FileNotFoundError: + time.sleep(0.01) + print("sleep for create shared memory") + else: + shared_memory = SharedMemory(file, create=True, size=1024 * 100) + return shared_memory + + +def enforce_stop_tokens(text, stop) -> str: + """Code by Langchain""" + """Cut off the text as soon as any stop words occur.""" + return re.split(re.escape(stop), text)[0] + + +class PredictorServer: + def __init__(self, args: ServerArgument, predictor: BasePredictor): + + self.predictor = predictor + self.args = args + + self.input_shared_memory = create_shared_memory("input", self.predictor.tensor_parallel_rank) + self.output_shared_memory = create_shared_memory("output", self.predictor.tensor_parallel_rank) + + if self.predictor.tensor_parallel_rank == 0: + write_shared_memory(self.input_shared_memory, "") + write_shared_memory(self.output_shared_memory, "") + + def predict(self, input_texts: str | list[str]): + return self.predictor.predict(input_texts) + + def start_predict(self, data): + print("start to predict under data", data) + + data = json.dumps(data, ensure_ascii=False) + write_shared_memory(self.input_shared_memory, data) + + while True: + result = read_shared_memory(self.output_shared_memory) + if result: + write_shared_memory(self.output_shared_memory, "") + return result + + else: + print("not found result, so to sleep ...") + + time.sleep(0.5) + + def start_flask_server(self): + from flask import Flask, jsonify, request + + app = Flask(__name__) + + @app.post("/api/chat") + def _server(): + data = request.get_json() + logger.info(f"Request: {json.dumps(data, indent=2, ensure_ascii=False)}") + try: + pred_seq = self.start_predict(data) + output = { + "error_code": 0, + "error_msg": "Success", + "result": {"response": {"role": "bot", "utterance": pred_seq}}, + } + except Exception as err: + logger.error(f"Server error: {err}") + output = {"error_code": 1000, "error_msg": f"Server error: {err}", "result": None} + + logger.info(f"Response: {json.dumps(output, indent=2, ensure_ascii=False)}") + return jsonify(output) + + app.run(host="0.0.0.0", port=self.args.flask_port) + + def start_ui_service(self, args): + # do not support start ui service in one command + from multiprocessing import Process + + from gradio_ui import main + + p = Process(target=main, args=(args,)) + p.daemon = True + p.start() + + +def main(args, server: PredictorServer): + from time import sleep + + while True: + sleep(0.5) + content = read_shared_memory(server.input_shared_memory) + + if content: + content = json.loads(content) + + context = content.pop("context", "") + content.pop("extra_info", None) + + generation_args = content + server.predictor.config.max_length = generation_args["max_length"] + server.predictor.config.top_p = generation_args["top_p"] + server.predictor.config.temperature = generation_args["temperature"] + server.predictor.config.top_k = generation_args["top_k"] + server.predictor.config.repetition_penalty = generation_args["repetition_penalty"] + + for key, value in generation_args.items(): + setattr(server.args, key, value) + + result = server.predict(context) + result = result[0] + if not result: + result = "invalid response" + write_shared_memory(server.output_shared_memory, result) + write_shared_memory(server.input_shared_memory, "") + + +if __name__ == "__main__": + + parser = PdArgumentParser((PredictorArgument, ModelArgument, ServerArgument)) + predictor_args, model_args, server_args = parser.parse_args_into_dataclasses() + predictor = create_predictor(predictor_args, model_args) + + server = PredictorServer(server_args, predictor) + + if server.predictor.tensor_parallel_rank == 0: + server.start_ui_service(server_args) + + from multiprocessing import Process + + p = Process( + target=server.start_flask_server, + ) + p.daemon = True + p.start() + + main(server_args, server) diff --git a/llm/glm/README.md b/llm/glm/README.md new file mode 100644 index 0000000000000000000000000000000000000000..86bc69d571e6c1de3b71281e8993a10a0436d1bc --- /dev/null +++ b/llm/glm/README.md @@ -0,0 +1,102 @@ +# GLM + +## 1. 模型介绍 + +[General Language Model (GLM)](https://arxiv.org/abs/2103.10360) 是以自回归填空作为训练目标的通用语言模型,可用于各类理解和生成任务。 + +现有预训练框架包括以 BERT 为代表的自编码模型,以 GPT 为代表的自回归模型和以 T5 为代表的编码-解码模型。但这些框架均不能完全支持自然语言理解、无条件生成和条件生成这三类主要任务。为了解决这一问题,我们提出了基于自回归填空任务的通用语言模型(GLM)。GLM 使用 2D 位置编码和任意顺序预测改进了填空预训练过程,在自然语言理解任务上超越了 BERT 和 T5。同时,GLM 的预训练过程基于多种任务,填空长度和数量各不相同。在自然语言理解、无条件生成和条件生成任务上,GLM 均超过了具有相同参数规模和训练数据量的 BERT、T5 和 GPT 模型。除此之外,GLM 还以 BERT Large 1.25 倍参数量的规模取得了当前最优的效果,证明了其在不同下游任务上良好的泛化能力。 + + +**支持模型权重:** + +| Model | +|----------------------------------| +| THUDM/glm-large-chinese | +| THUDM/glm-10b-chinese | + +## 3. 模型精调 + +### SFT + +``` +python -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py \ +--model_name_or_path THUDM/glm-large-chinese \ +--num_train_epochs 4 \ +--learning_rate 3e-5 \ +--warmup_ratio 0.06 \ +--weight_decay 0.1 \ +--label_smoothing 0.1 \ +--save_steps 100 \ +--logging_steps 1 \ +--eval_steps 100 \ +--output_dir ./checkpoints/glm-large-chinese \ +--src_length 608 \ +--tgt_length 160 \ +--min_tgt_length 55 \ +--length_penalty 0.7 \ +--no_repeat_ngram_size 3 \ +--num_beams 5 \ +--select_topk True \ +--per_device_eval_batch_size 2 \ +--per_device_train_batch_size 2 \ +--max_grad_norm 1.0 \ +--lr_scheduler_type linear \ +--fp16 \ +--fp16_opt_level O2 \ +--recompute \ +--do_train \ +--do_eval +``` + +### 单卡LoRA微调 + +``` +python finetune_generation.py \ +--model_name_or_path THUDM/glm-large-chinese \ +--num_train_epochs 4 \ +--learning_rate 3e-5 \ +--warmup_ratio 0.06 \ +--weight_decay 0.1 \ +--label_smoothing 0.1 \ +--save_steps 100 \ +--logging_steps 1 \ +--eval_steps 100 \ +--output_dir ./checkpoints/glm-large-chinese \ +--src_length 608 \ +--tgt_length 160 \ +--min_tgt_length 55 \ +--length_penalty 0.7 \ +--no_repeat_ngram_size 3 \ +--num_beams 5 \ +--select_topk True \ +--per_device_eval_batch_size 2 \ +--per_device_train_batch_size 2 \ +--max_grad_norm 1.0 \ +--lr_scheduler_type linear \ +--fp16 \ +--fp16_opt_level O2 \ +--recompute \ +--do_train \ +--do_eval \ +--lora True +``` + +其中参数释义如下: + +- `model_name_or_path`: 预训练模型内置名称或者模型所在目录,默认为`THUDM/glm-large-chinese`。 +- `src_length`: 上下文的最大输入长度,默认为608. +- `tgt_length`: 生成文本的最大长度,默认为160. +- `min_tgt_length`: 生成文本的最小长度,默认为55. +- `length_penalty`: 生成解码时的长度惩罚因子,默认为0.7. +- `num_beams`: 搜索方向数量,默认为5。 +- `label_smoothing`: 标签平滑因子,默认为0.1. +- `lr_decay_ratio`: 学习率衰减因子,默认为0.1. +- `lora`: 是否使用LoRA技术. + + +## 3.4 动态图推理 + +``` +python predict_generation.py \ + --model_name_or_path THUDM/glm-large-chinese +``` diff --git a/llm/glm/data.py b/llm/glm/data.py new file mode 100644 index 0000000000000000000000000000000000000000..40f5f3320a64cb25c1b78c6125e4accd0835d634 --- /dev/null +++ b/llm/glm/data.py @@ -0,0 +1,67 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np + + +def custom_convert_example(example, tokenizer, data_args, is_test=True): + source = None + title = None + target = None + if "source" in example and "title" in example: + source = example["source"] + if "title" in example.keys(): + title = example["title"] + elif "context" in example and "answer" in example: + source = example["context"] + if "answer" in example.keys(): + title = example["answer"] + else: + assert False, "Source and title are not in the input dictionary, nor are context and answer." + if "target" in example.keys(): + target = example["target"] + elif "question" in example.keys(): + target = example["question"] + example["text_a"] = "答案:" + title + "," + "上下文:" + source + example["text_b"] = "在已知答案的前提下,问题:" + target + inputs = tokenizer.encode(example["text_a"], max_length=data_args.src_length - 1, truncation=True) + inputs["input_ids"] = inputs["input_ids"][:-1] + [tokenizer.gmask_token_id] + inputs["input_ids"][-1:] + pad_length = data_args.src_length - len(inputs["input_ids"]) + inputs["input_ids"] = np.array([inputs["input_ids"] + [tokenizer.pad_token_id] * pad_length]) + inputs["attention_mask"] = np.array([inputs["attention_mask"] + [1] + [0] * pad_length]) + sep = inputs["input_ids"].shape[1] + inputs = tokenizer.build_inputs_for_generation( + inputs, + max_gen_length=data_args.tgt_length, + targets=" " + example["text_b"] if not is_test else None, + padding="max_length", + ) + + for input_name in inputs.keys(): + inputs[input_name] = inputs[input_name].squeeze(0) + if is_test: + inputs["position_ids"] = inputs["position_ids"][:, : inputs["input_ids"].shape[-1]] + labels = tokenizer.encode( + " " + example["text_b"], add_special_tokens=False, max_length=data_args.tgt_length - 1 + )["input_ids"] + loss_mask = [0] * sep + [1] * len(labels) + [0] * (data_args.tgt_length - len(labels)) + labels = ( + [0] * sep + + labels + + [tokenizer.eop_token_id] + + [tokenizer.pad_token_id] * (data_args.tgt_length - len(labels) - 1) + ) + inputs["label_ids"] = labels + inputs["loss_mask"] = loss_mask + return inputs diff --git a/llm/glm/finetune_generation.py b/llm/glm/finetune_generation.py new file mode 100644 index 0000000000000000000000000000000000000000..dd536c73f87a9aff029804857c1388eacdf1f508 --- /dev/null +++ b/llm/glm/finetune_generation.py @@ -0,0 +1,189 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import sys +from dataclasses import dataclass, field +from functools import partial + +import paddle +from data import custom_convert_example +from utils import GLMTrainer + +from paddlenlp.data import DefaultDataCollator +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import BLEU, Rouge1, Rouge2, RougeL +from paddlenlp.peft import LoRAConfig, LoRAModel +from paddlenlp.trainer import PdArgumentParser, TrainingArguments, get_last_checkpoint +from paddlenlp.transformers import AutoModelForConditionalGeneration, AutoTokenizer +from paddlenlp.utils.log import logger + + +@dataclass +class DataArgument: + task_name: str = field(default="dureader_qg", metadata={"help": "The name of task."}) + src_length: int = field(default=608, metadata={"help": "The max length of source text."}) + tgt_length: int = field(default=160, metadata={"help": "The max length of target text."}) + min_tgt_length: int = field(default=55, metadata={"help": "The min length of target text."}) + length_penalty: float = field(default=0.7, metadata={"help": "The length penalty."}) + no_repeat_ngram_size: int = field(default=3, metadata={"help": "The no repeat ngram size."}) + num_beams: int = field(default=5, metadata={"help": "The number of beams."}) + select_topk: bool = field(default=True, metadata={"help": "Whether to select top k tokens for generation."}) + top_p: float = field( + default=0.0, metadata={"help": "The cumulative probability for top-p-filtering in the 'sampling' strategy."} + ) + top_k: int = field( + default=0, + metadata={ + "help": "The number of highest probability tokens to keep for top-k-filtering in the 'sampling' strategy." + }, + ) + no_block_position: bool = field(default=False) + + +@dataclass +class ModelArgument: + model_name_or_path: str = field( + default="THUDM/glm-2b", metadata={"help": "Build-in pretrained model name or the path to local model."} + ) + label_smoothing: float = field(default=0.1, metadata={"help": "The label smoothing parameter."}) + lr_decay_ratio: float = field(default=0.1, metadata={"help": "The ratio for learning rate decrease"}) + lora: bool = field(default=False, metadata={"help": "Whether to use LoRA technique"}) + + +def main(): + parser = PdArgumentParser((ModelArgument, DataArgument, TrainingArguments)) + if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): + model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) + else: + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + setattr(training_args, "label_smoothing", model_args.label_smoothing) + setattr(training_args, "lr_decay_ratio", model_args.lr_decay_ratio) + + paddle.set_device(training_args.device) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + dtype = None + if training_args.fp16_opt_level == "O2": + if training_args.fp16: + dtype = "float16" + if training_args.bf16: + dtype = "bfloat16" + + # Load the pretrained language model. + model = AutoModelForConditionalGeneration.from_pretrained( + model_args.model_name_or_path, + output_predict=True, + parallel_output=True, + load_state_as_np=True, + dtype=dtype, # todo enable set dtype to avoid additional mem usage + tensor_parallel_degree=training_args.tensor_parallel_degree, + tensor_parallel_rank=training_args.tensor_parallel_rank, + ) + if model_args.lora: + # TODO: hardcode parameters for now. Change after MergedLoRA is introduced + lora_config = LoRAConfig( + target_modules=[".*query_key_value.*"], + r=4, + lora_alpha=8, + merge_weights=True, + tensor_parallel_degree=training_args.tensor_parallel_degree, + dtype=dtype, + ) + model = LoRAModel(model, lora_config) + model.mark_only_lora_as_trainable() + model.print_trainable_parameters() + + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + + # Load the dataset. + train_ds, dev_ds = load_dataset(data_args.task_name, splits=["train", "dev"]) + trans_func = partial(custom_convert_example, tokenizer=tokenizer, data_args=data_args) + train_ds = train_ds.map(partial(trans_func, is_test=False)) + test_ds = dev_ds.map(trans_func) + + collate_fn = DefaultDataCollator() + + def compute_metrics(eval_preds): + rouge1 = Rouge1() + rouge2 = Rouge2() + rougel = RougeL() + bleu4 = BLEU(n_size=4) + predictions = [x[x != -100] for x in eval_preds.predictions] + references = [x[x != -100] for x in eval_preds.label_ids] + + # for pred in predictions: + + rouge1_score = rouge1.score(predictions, references) + rouge2_score = rouge2.score(predictions, references) + for pred, ref in zip(predictions, references): + rougel.add_inst(pred, [ref]) + bleu4.add_inst(pred, [ref]) + return { + "rouge1": rouge1_score, + "rouge2": rouge2_score, + "rougel": rougel.score(), + "bleu4": bleu4.score(), + } + + trainer = GLMTrainer( + model=model, + args=training_args, + train_dataset=train_ds, + eval_dataset=dev_ds, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + do_generation=True, + data_collator=collate_fn, + ) + if training_args.fp16_opt_level == "O2": + trainer.disable_autocast_context_manager() + + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=last_checkpoint) + trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1) + trainer.log_metrics("train", train_result.metrics) + trainer.save_metrics("train", train_result.metrics) + trainer.save_state() + + if training_args.do_eval: + eval_result = trainer.evaluate(test_ds) + trainer.log_metrics("test", eval_result) + + +if __name__ == "__main__": + main() diff --git a/llm/glm/predict_generation.py b/llm/glm/predict_generation.py new file mode 100644 index 0000000000000000000000000000000000000000..7467216557a19d14059d168a71f28b7d9c731b9d --- /dev/null +++ b/llm/glm/predict_generation.py @@ -0,0 +1,153 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +from paddle.distributed import fleet + +from paddlenlp.peft import LoRAConfig, LoRAModel +from paddlenlp.transformers import ( + AutoConfig, + AutoModelForConditionalGeneration, + AutoTokenizer, +) + + +def parse_arguments(): + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_name_or_path", default="THUDM/glm-large-chinese", required=True, help="The directory of model." + ) + parser.add_argument("--lora_path", default=None, help="The directory of LoRA parameters. Default to None") + parser.add_argument( + "--merge_tensor_parallel_path", default=None, help="The directory of model to merge tensor parallel parts." + ) + parser.add_argument("--batch_size", type=int, default=2, help="The batch size of data.") + parser.add_argument("--src_length", type=int, default=200, help="The batch size of data.") + parser.add_argument("--tgt_length", type=int, default=20, help="The batch size of data.") + return parser.parse_args() + + +def batchfy_text(texts, batch_size): + batch_texts = [] + batch_start = 0 + while batch_start < len(texts): + batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]] + batch_start += batch_size + return batch_texts + + +class Predictor(object): + def __init__(self, args): + self.tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + self.batch_size = args.batch_size + self.args = args + + tensor_parallel_degree = paddle.distributed.get_world_size() + tensor_parallel_rank = 0 + if tensor_parallel_degree > 1: + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": 1, + "mp_degree": tensor_parallel_degree, + "pp_degree": 1, + "sharding_degree": 1, + } + fleet.init(is_collective=True, strategy=strategy) + hcg = fleet.get_hybrid_communicate_group() + tensor_parallel_rank = hcg.get_model_parallel_rank() + + if self.args.lora_path is not None: + lora_config = LoRAConfig.from_pretrained(self.args.lora_path) + dtype = lora_config.dtype + else: + config = AutoConfig.from_pretrained(args.model_name_or_path) + dtype = config.dtype if config.dtype is not None else "float32" + + self.model = AutoModelForConditionalGeneration.from_pretrained( + args.model_name_or_path, + tensor_parallel_degree=tensor_parallel_degree, + tensor_parallel_rank=tensor_parallel_rank, + load_state_as_np=True, + dtype=dtype, + low_cpu_mem_usage=True, + ) + if self.args.lora_path is not None: + self.model = LoRAModel.from_pretrained(self.model, self.args.lora_path) + self.model.eval() + + def preprocess(self, input_text): + input_text = [text.strip() + "[gMASK]" for text in input_text] + inputs = self.tokenizer( + input_text, + return_tensors="np", + add_special_tokens=True, + padding=True, + max_length=self.args.src_length, + truncation=True, + truncation_side="left", + ) + inputs = self.tokenizer.build_inputs_for_generation(inputs, max_gen_length=self.args.tgt_length) + inputs_tensor = {} + for key, value in inputs.items(): + inputs_tensor[key] = paddle.to_tensor(value) + return inputs_tensor + + def infer(self, inputs): + result = self.model.generate( + **inputs, + decode_strategy="sampling", + top_k=1, + max_length=self.args.tgt_length, + eos_token_id=self.tokenizer.eop_token_id, + pad_token_id=self.tokenizer.pad_token_id, + ) + result = result[0] + return result + + def postprocess(self, infer_data): + result = [] + for x in infer_data.tolist(): + res = self.tokenizer.decode(x, skip_special_tokens=True) + result.append(res) + out_dict = {"result": result} + return out_dict + + def predict(self, texts): + input_map = self.preprocess(texts) + infer_result = self.infer(input_map) + output = self.postprocess(infer_result) + return output + + +if __name__ == "__main__": + args = parse_arguments() + predictor = Predictor(args) + all_texts = [ + "答案:年基准利率4.35%,上下文:从实际看,贷款的基本条件是: 一是中国大陆居民,年龄在60岁以下; 二是有稳定的住址和工作或经营地点; 三是有稳定的收入来源; 四是无不良信用记录,贷款用途不能作为炒股,赌博等行为; 五是具有完全民事行为能力。在已知答案的前提下,问题:", + "答案:U系列,上下文:U系列是最好的,采用国际顶尖技术(由格力自主研发)双级变频压缩机,提高压缩机运转效率,制冷制热能力更强劲;1赫兹变频技术,使空调相当于一个15 W电灯泡,更加节能省电;送风面积广,风力大;生态风,净化空气。非常不错,现在国美在做活动,可以了解一下。在已知答案的前提下,问题:", + ] + batch_texts = batchfy_text(all_texts, args.batch_size) + for bs, texts in enumerate(batch_texts): + outputs = predictor.predict(texts) + for text, result in zip(texts, outputs["result"]): + print("{}\n{}".format(text, result)) + + if args.merge_tensor_parallel_path is not None: + predictor.model.save_pretrained( + save_dir=args.merge_tensor_parallel_path, + merge_tensor_parallel=True, + ) + predictor.tokenizer.save_pretrained(args.merge_tensor_parallel_path) diff --git a/llm/glm/utils.py b/llm/glm/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..d3b9e8919aa7286df303769873f98ce678a838df --- /dev/null +++ b/llm/glm/utils.py @@ -0,0 +1,79 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import Any, Dict, List, Optional, Tuple, Union + +import numpy as np +import paddle +import paddle.nn as nn + +from paddlenlp.trainer import Trainer + + +class GLMTrainer(Trainer): + def __init__(self, do_generation: bool, **kwargs): + super().__init__(**kwargs) + self.do_generation = do_generation + + def prediction_step( + self, + model: nn.Layer, + inputs: Dict[str, Union[paddle.Tensor, Any]], + prediction_loss_only: bool, + ignore_keys: Optional[List[str]] = None, + ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]: + + if not self.do_generation: + return super().prediction_step(model, inputs, prediction_loss_only, ignore_keys) + + model.eval() + with paddle.no_grad(): + tokens = model.generate( + input_ids=inputs["input_ids"], + position_ids=inputs["position_ids"], + attention_mask=inputs["attention_mask"], + decode_strategy="sampling", + top_k=1, + repetition_penalty=2.0, + bos_token_id=self.tokenizer.sop_token_id, + eos_token_id=self.tokenizer.eop_token_id, + pad_token_id=self.tokenizer.pad_token_id, + )[0] + all_preds = [] + for pred_tokens in tokens: + all_preds.append(pred_tokens[pred_tokens != self.tokenizer.pad_token_id].tolist()) + max_pred_length = max([len(x) for x in all_preds]) + for index, preds in enumerate(all_preds): + all_preds[index] = preds + [-100] * (max_pred_length - len(preds)) + + all_labels = [] + for label, mask in zip(inputs["labels"].numpy(), inputs["loss_mask"].numpy()): + label = label[mask.astype("bool")] + label = [x for x in label[label != self.tokenizer.pad_token_id]] + all_labels.append(label) + max_label_length = max([len(x) for x in all_labels]) + for index, labels in enumerate(all_labels): + all_labels[index] = labels + [-100] * (max_label_length - len(labels)) + + return (None, paddle.to_tensor(all_preds), paddle.to_tensor(all_labels)) + + def log(self, logs: Dict[str, float], **kwargs) -> None: + + if self.state.epoch is not None: + logs["epoch"] = round(self.state.epoch, 4) + + if "eval_loss" in logs: + logs["eval_ppl"] = np.exp(logs["eval_loss"]) + output = {**logs, **{"step": self.state.global_step}} + self.state.log_history.append(output) + self.control = self.callback_handler.on_log(self.args, self.state, self.control, logs, **kwargs) diff --git a/llm/gpt-3/README.md b/llm/gpt-3/README.md new file mode 100644 index 0000000000000000000000000000000000000000..5c687502627f3072dd62d19579df421afd2b7764 --- /dev/null +++ b/llm/gpt-3/README.md @@ -0,0 +1,194 @@ +# GPT + +## 1. 模型介绍 + +GPT-3是一种预训练语言模型,它能够模拟人类语言思维和表达。GPT-3拥有巨大的参数,包含了1750亿个参数,这使得它具有强大的语言理解和生成能力。它可以完成的任务包括文本生成、文本摘要、回答问题、翻译、阅读理解等。GPT-3的预训练过程使用了大量的语料库,包括互联网上的大量文本。它通过分析这些文本,学习如何生成和理解人类语言。GPT-3在自然语言处理领域具有很高的影响力,它可以模拟人类对话和生成文本,这使得它在许多应用领域都有广泛的应用,比如智能客服、自然语言处理、游戏设计等。 + +## 2. 预训练 + +预训练数据制作参考[此处](../../model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md) + +为了方便用户运行测试本模型,本项目提供了处理好的100k条doc的训练样本: +```shell +wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy +wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz +``` + +将所有预处理得到的文件统一放入一个文件夹中,以备训练使用: + +``` +mkdir data +mv gpt_en_dataset_300m_ids.npy ./data +mv gpt_en_dataset_300m_idx.npz ./data +``` + +注意: +1. 需要paddle develop版本训练,需要安装`pip install tool_helpers visualdl==2.5.3`等相关缺失whl包 +2. `use_flash_attention` 需要在A100机器开启,否则loss可能不正常(很快变成0.00x,非常小不正常)。建议使用cuda11.8环境。 + +使用下面脚本,即可在gpt2-medium-en的基础上,继续训练. +```shell +task_name="gpt3_hybrid" +export PYTHONPATH="../../PaddleNLP/" +export FLAGS_cudnn_deterministic=True +log_dir="log" +rm -rf $log_dir + +python -u -m paddle.distributed.launch \ + --gpus "0" \ + --log_dir ${log_dir} \ + run_pretrain.py \ + --model_type "gpt" \ + --model_name_or_path gpt2-medium-en \ + --tokenizer_name_or_path gpt2-medium-en \ + --input_dir "./data" \ + --output_dir "output/$task_name" \ + --split 949,50,1 \ + --max_seq_length 1024 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 1 \ + --tensor_parallel_degree 1 \ + --pipeline_parallel_degree 1 \ + --fuse_attention_qkv 1 \ + --use_flash_attention 0 \ + --fp16 \ + --fp16_opt_level "O2" \ + --scale_loss 1024 \ + --learning_rate 0.00001 \ + --min_learning_rate 0.000005 \ + --max_steps 10000 \ + --save_steps 5000 \ + --weight_decay 0.01 \ + --warmup_ratio 0.01 \ + --max_grad_norm 1.0 \ + --logging_steps 1\ + --dataloader_num_workers 1 \ + --sharding "stage2" \ + --eval_steps 1000 \ + --report_to "visualdl" \ + --disable_tqdm true \ + --recompute 1 \ + --gradient_accumulation_steps 2 \ + --do_train \ + --do_eval \ + --device "gpu" +``` + +其中参数释义如下: + +- `model_name_or_path`: 预训练模型内置名称或者模型所在目录,默认为`gpt2-medium-en`。 +- `num_train_epochs`: 要执行的训练 epoch 总数(如果不是整数,将在停止训练之前执行最后一个 epoch +的小数部分百分比)。 +- `max_steps`: 模型训练步数。 +- `learning_rate`: 参数更新的学习率。 +- `warmup_steps`: 学习率热启的步数。 +- `eval_steps`: 模型评估的间隔步数。 +- `logging_steps`: 训练日志打印的间隔步数。 +- `save_steps`: 模型参数保存的间隔步数。 +- `output_dir`: 模型参数保存目录。 +- `src_length`: 上下文的最大输入长度,默认为128. +- `tgt_length`: 生成文本的最大长度,默认为160. +- `gradient_accumulation_steps`: 模型参数梯度累积的步数,可用于扩大 batch size。实际的 batch_size = per_device_train_batch_size * gradient_accumulation_steps。 +- `fuse_attention_qkv`:在MultiHeadAttention中使用qkv线性层融合 +- `use_flash_attention`:使用flash attention技术,注意此处需要在A100机器开启 +- `fp16`: 使用 float16 精度进行模型训练和推理。 +- `fp16_opt_level`: float16 精度训练模式,`O2`表示纯 float16 训练。 +- `recompute`: 使用重计算策略,开启后可节省训练显存。 +- `do_train`: 是否训练模型。 +- `do_eval`: 是否评估模型。 +- `tensor_parallel_degree`: 模型并行数量。 +- `lora`: 是否使用LoRA技术。 +- `task_name`: 内置数据集任务名。 + + + + +## 3. 微调 +### SFT + +```shell +task_name="gpt3_hybrid" +export PYTHONPATH="../../PaddleNLP/" +export FLAGS_cudnn_deterministic=True +log_dir="log" +rm -rf $log_dir + +python -u -m paddle.distributed.launch \ + --gpus "0" \ + --log_dir ${log_dir} \ + finetune_generation.py \ + --model_type "gpt" \ + --model_name_or_path gpt2-medium-en \ + --output_dir "output/$task_name" \ + --per_device_train_batch_size 2 \ + --per_device_eval_batch_size 1 \ + --tensor_parallel_degree 1 \ + --pipeline_parallel_degree 1 \ + --fp16 \ + --fp16_opt_level "O2" \ + --scale_loss 1024 \ + --learning_rate 0.00001 \ + --max_steps 10000 \ + --save_steps 5000 \ + --weight_decay 0.01 \ + --warmup_ratio 0.01 \ + --max_grad_norm 1.0 \ + --logging_steps 1\ + --dataloader_num_workers 1 \ + --sharding "stage2" \ + --eval_steps 1000 \ + --report_to "visualdl" \ + --disable_tqdm true \ + --recompute 1 \ + --gradient_accumulation_steps 2 \ + --do_train \ + --do_eval \ + --device "gpu" +``` + +### LoRA + +```shell +export PYTHONPATH="../../PaddleNLP/" +export FLAGS_cudnn_deterministic=True +log_dir="log" +rm -rf $log_dir + +python finetune_generation.py \ + --model_type "gpt" \ + --model_name_or_path gpt2-medium-en \ + --output_dir "output/$task_name" \ + --per_device_train_batch_size 2 \ + --per_device_eval_batch_size 1 \ + --tensor_parallel_degree 1 \ + --pipeline_parallel_degree 1 \ + --fp16 \ + --fp16_opt_level "O2" \ + --scale_loss 1024 \ + --learning_rate 3e-4 \ + --max_steps 10000 \ + --save_steps 5000 \ + --weight_decay 0.01 \ + --warmup_ratio 0.01 \ + --max_grad_norm 1.0 \ + --logging_steps 1\ + --dataloader_num_workers 1 \ + --sharding "stage2" \ + --eval_steps 1000 \ + --report_to "visualdl" \ + --disable_tqdm true \ + --recompute 1 \ + --gradient_accumulation_steps 2 \ + --do_train \ + --do_eval \ + --device "gpu" \ + --lora +``` + + +## 3. 动态图推理 + +```shell +python predict_generation.py + +``` diff --git a/llm/gpt-3/finetune_generation.py b/llm/gpt-3/finetune_generation.py new file mode 100644 index 0000000000000000000000000000000000000000..a8906807ad98efa71e4dd466cae3ca50fb14930c --- /dev/null +++ b/llm/gpt-3/finetune_generation.py @@ -0,0 +1,247 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import sys +from dataclasses import dataclass, field +from functools import partial + +import paddle +from modeling_pp import GPTForCausalLMPipe +from utils import ( + DataCollatorForSupervisedDataset, + GPTTrainer, + compute_metrics, + convert_example, +) + +from paddlenlp.datasets import load_dataset +from paddlenlp.peft import LoRAConfig, LoRAModel +from paddlenlp.trainer import ( + PdArgumentParser, + TrainingArguments, + get_last_checkpoint, + set_seed, +) +from paddlenlp.transformers import AutoTokenizer, GPTConfig, GPTForCausalLM +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "gpt": (GPTConfig, GPTForCausalLM), +} + + +@dataclass +class DataArgument: + task_name: str = field(default="squad", metadata={"help": "The name of task."}) + src_length: int = field(default=1024, metadata={"help": "The max length of source text."}) + tgt_length: int = field(default=142, metadata={"help": "The max length of target text."}) + generate_num: int = field(default=0, metadata={"help": "Save first k examples generation result in dev dataset"}) + + +@dataclass +class ModelArgument: + model_type: str = field( + default="gpt-cn", metadata={"help": "Build-in pretrained model from the different model type."} + ) + model_name_or_path: str = field( + default="gpt-cpm-large-cn", metadata={"help": "Build-in pretrained model name or the path to local model."} + ) + use_flash_attn: bool = field(default=False, metadata={"help": "Whether to use flash attention"}) + enable_fuse_transformer: bool = field( + default=False, + metadata={"help": "gpt, enable_fuse_transformer"}, + ) + + fuse_attention_qkv: bool = field( + default=False, + metadata={"help": "gpt, fuse_attention_qkv"}, + ) + eval_with_do_generation: bool = field( + default=True, metadata={"help": "Evaluate with generation, instead for calc loss."} + ) + lr_decay_ratio: float = field(default=0.1, metadata={"help": "The ratio for learning rate decrease"}) + # lora + lora: bool = field(default=False, metadata={"help": "Whether to use LoRA technique"}) + lora_path: str = field(default=None, metadata={"help": "Initialize lora state dict."}) + lora_rank: int = field(default=8, metadata={"help": "Lora attention dimension"}) + merge_weights: bool = field( + default=False, metadata={"help": "Merge weights of the original model and the Lora model"} + ) + + +def main(): + parser = PdArgumentParser((ModelArgument, DataArgument, TrainingArguments)) + if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): + model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) + else: + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + # data_args.always_pad_to_max_length = False + data_args.always_pad_to_max_length = training_args.pipeline_parallel_degree > 1 + setattr(training_args, "lr_decay_ratio", model_args.lr_decay_ratio) + + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + training_args.tgt_length = data_args.tgt_length + paddle.set_device(training_args.device) + + set_seed(args=training_args) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + # Set the dtype for loading model + dtype = "float32" + if training_args.fp16_opt_level == "O2": + if training_args.fp16: + dtype = "float16" + if training_args.bf16: + dtype = "bfloat16" + + config_class, model_class = MODEL_CLASSES[model_args.model_type] + if training_args.pipeline_parallel_degree > 1: + model_class = GPTForCausalLMPipe + # Load the tokenizer + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + tokenizer.padding_side = "left" + + # Load and set the pretrained configuration + config = config_class.from_pretrained(model_args.model_name_or_path) + config.enable_fuse_transformer = model_args.enable_fuse_transformer + config.fuse_attention_qkv = model_args.fuse_attention_qkv + config.use_flash_attn = model_args.use_flash_attn + config.use_recompute = training_args.recompute + + config.tensor_parallel_degree = training_args.tensor_parallel_degree + config.tensor_parallel_rank = training_args.tensor_parallel_rank + config.ignore_index = tokenizer.pad_token_id + + model = model_class.from_pretrained( + model_args.model_name_or_path, + config=config, + dtype=dtype, + load_state_as_np=True, + ) + if model_args.lora: + if model_args.lora_path is None: + target_modules = [ + ".*qkv_proj.*", + ".*q_proj.*", + ".*k_proj.*", + ".*v_proj.*", + ".*linear1.*", + ".*linear2.*", + ".*out_proj.*", + ] + lora_config = LoRAConfig( + target_modules=target_modules, + r=model_args.lora_rank, + lora_alpha=2 * model_args.lora_rank, + merge_weights=model_args.merge_weights, + tensor_parallel_degree=training_args.tensor_parallel_degree, + dtype=dtype, + ) + model = LoRAModel(model, lora_config) + else: + model = LoRAModel.from_pretrained(model=model, lora_path=model_args.lora_path) + model.mark_only_lora_as_trainable() + model.print_trainable_parameters() + + # Load the dataset. + if training_args.do_train or training_args.do_eval: + train_ds, dev_ds = load_dataset(data_args.task_name, splits=["train_v1", "dev_v1"]) + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_source_length=data_args.src_length, + max_target_length=data_args.tgt_length, + ) + + if training_args.do_train: + train_ds = train_ds.map(partial(trans_func)) + if training_args.do_eval: + is_test = model_args.eval_with_do_generation + dev_ds = dev_ds.map(partial(trans_func, is_test=is_test)) + + collate_fn = DataCollatorForSupervisedDataset( + tokenizer, max_length=1024 if data_args.always_pad_to_max_length else 0 + ) + + def compute_metrics_trainer(eval_preds, tokenizer): + all_preds = [] + all_labels = [] + preds = eval_preds.predictions + preds = [x[x != -100] for x in preds] + all_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=False)) + labels = [x[x != -100] for x in eval_preds.label_ids] + all_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True, clean_up_tokenization_spaces=False)) + + all_preds = [pred.strip() for pred in all_preds] + all_labels = [label.strip() for label in all_labels] + all_preds = [pred.strip("question:") for pred in all_preds] + all_labels = [label.strip("question:") for label in all_labels] + + eval_result = compute_metrics(all_preds, all_labels) + return eval_result + + compute_metrics_func = partial( + compute_metrics_trainer, + tokenizer=tokenizer, + ) + + trainer = GPTTrainer( + model=model, + args=training_args, + train_dataset=train_ds if training_args.do_train else None, + eval_dataset=dev_ds if training_args.do_eval else None, + tokenizer=tokenizer, + compute_metrics=compute_metrics_func + if (model_args.eval_with_do_generation and training_args.do_eval) + else None, + do_generation=model_args.eval_with_do_generation, + data_collator=collate_fn, + ) + + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=last_checkpoint) + trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1) + trainer.log_metrics("train", train_result.metrics) + trainer.save_metrics("train", train_result.metrics) + trainer.save_state() + + if training_args.do_eval: + eval_result = trainer.evaluate() + trainer.log_metrics("test", eval_result) + + +if __name__ == "__main__": + main() diff --git a/llm/gpt-3/modeling_pp.py b/llm/gpt-3/modeling_pp.py new file mode 100644 index 0000000000000000000000000000000000000000..c6b72766e2256f775123bfeb902b2e8278478362 --- /dev/null +++ b/llm/gpt-3/modeling_pp.py @@ -0,0 +1,299 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# pass +import paddle.distributed.fleet as fleet +import paddle.nn as nn +from paddle.distributed.fleet.meta_parallel import ( + LayerDesc, + PipelineLayer, + SharedLayerDesc, +) +from paddle.distributed.fleet.utils import recompute + +from paddlenlp.transformers import ( + GPTConfig, + GPTDecoderLayer, + GPTEmbeddings, + GPTPretrainedModel, + GPTPretrainingCriterion, + PretrainedModel, +) +from paddlenlp.transformers.gpt.modeling import parallel_matmul + + +def get_hcg(): + return fleet.get_hybrid_communicate_group() + + +def get_attr(layer, name): + if getattr(layer, name, None) is not None: + return getattr(layer, name, None) + else: + return get_attr(layer._layer, name) + + +def parse_args(args): + if isinstance(args, tuple): + if len(args) == 3: + hidden_states, attention_mask, position_ids = args + elif len(args) == 2: + hidden_states, attention_mask = args + position_ids = None + else: + hidden_states = args + attention_mask, position_ids = None, None + + if position_ids is not None: + position_ids.stop_gradient = True + + if attention_mask is not None: + attention_mask.stop_gradient = True + + return hidden_states, attention_mask, position_ids + + +def return_args(hidden_states, attention_mask=None, position_ids=None): + ret = (hidden_states,) + + if attention_mask is not None: + ret += (attention_mask.clone(),) + if position_ids is not None: + ret += (position_ids.clone(),) + if len(ret) == 1: + ret = ret[0] + + return ret + + +class GPTEmbeddingPipe(GPTEmbeddings): + """Extends GPTEmbeddings to forward attention_mask through the pipeline.""" + + @property + def embedding_weight(self): + return get_attr(self.word_embeddings, "weight") + + def forward(self, args): + input_ids, attention_mask, position_ids = parse_args(args) + input_ids.stop_gradient = True + embeddings = super().forward(input_ids=input_ids, position_ids=position_ids) + return embeddings + + +class GPTDecoderLayerPipe(GPTDecoderLayer): + def forward(self, args): + hidden_states, attention_mask, position_ids = parse_args(args) + # hidden_states = super().forward(hidden_states, tgt_mask=attention_mask) + if self.enable_recompute and self.config.recompute_granularity == "full": + hidden_states = recompute(super().forward, hidden_states, attention_mask) + else: + hidden_states = super().forward(hidden_states, tgt_mask=attention_mask) + + return return_args(hidden_states, attention_mask, position_ids) + + +class LayerNormPipe(nn.LayerNorm): + def __init__(self, config): + super(LayerNormPipe, self).__init__(config.hidden_size, epsilon=1e-05) + + def forward(self, args): + hidden_states, attention_mask, position_ids = parse_args(args) + hidden_states = super().forward(hidden_states) + return return_args(hidden_states, attention_mask, position_ids) + + +class PipelinePretrainedModel(PretrainedModel): + _sequential_layers = [] + _pipeline_name_mapping = None + + def __init__(self, config, *args, **kwargs): + super().__init__(config, *args, **kwargs) + + def add_sequential_layer(self, layer_desc, name_prefix=""): + self._sequential_layers.append({"layer": layer_desc, "name_prefix": name_prefix}) + + def get_sequential_layers(self): + return [x["layer"] for x in self._sequential_layers] + + def get_sequential_name_prefixs(self): + return {str(index): x["name_prefix"] for index, x in enumerate(self._sequential_layers)} + + def _set_pipeline_name_mapping(self, mappings=None): + if mappings is not None: + self._pipeline_name_mapping = mappings + else: + mapping = {} + state_dict_keys = list(super().state_dict().keys()) + first_key = state_dict_keys[0].split(".") + # if use virtual pp_degree, the prefix is like 0.0.xxx + # else it will be like 0.xxx + use_virtual_pp_degree = first_key[0].isdigit() and first_key[1].isdigit() + + prefixs = self.get_sequential_name_prefixs() + for k in state_dict_keys: + name_splited = k.split(".") + # TODO(wawltor) Fix the virtual pipeline + if use_virtual_pp_degree: + idx = str(int(name_splited[0]) + int(name_splited[1])) + single_name = [prefixs[idx]] + single_name.extend(name_splited[2:]) + else: + idx = name_splited[0] + if idx == "shared_layers": + single_name = name_splited[2:] + single_name = ["gpt.embeddings"] + single_name + elif idx.isdigit(): + single_name = [prefixs[idx]] + single_name.extend(name_splited[1:]) + else: + raise ("The mapping table had bad row, please check parameter name:{}".format(k)) + mapping[".".join(single_name)] = k + + self._pipeline_name_mapping = mapping + + return self._pipeline_name_mapping + + def _prepare_pipeline_inputs_func(self, inputs): + first_stage_keys = ["input_ids", "attention_mask"] + last_stage_keys = ["labels"] + + def get_expected_keys(inputs, keys): + ret = tuple([inputs.pop(k) for k in keys if k in inputs]) + if len(ret) == 1: + ret = ret[0] + return ret + + if type(inputs) is dict: + return [ + get_expected_keys(inputs, first_stage_keys), + get_expected_keys(inputs, last_stage_keys), + ] + + keys = list(inputs[0].keys()) + inputs_batch = {key: [data.pop(key) for data in inputs] for key in keys} + return [ + get_expected_keys(inputs_batch, first_stage_keys), + get_expected_keys(inputs_batch, last_stage_keys), + ] + + def state_dict(self, *args, **kwargs): + state_dict = super().state_dict(*args, **kwargs) + + if self._pipeline_name_mapping is None: + self._set_pipeline_name_mapping() + assert len(self._pipeline_name_mapping) > 0, "The pipeline stage must have parameters!" + pp_to_single_mapping = {v: k for k, v in self._pipeline_name_mapping.items()} + + for k in list(state_dict.keys()): + v = state_dict.pop(k) + state_dict[pp_to_single_mapping[k]] = v + + return state_dict + + def set_state_dict(self, state_dict, *args, **kwargs): + if self._pipeline_name_mapping is None: + self._set_pipeline_name_mapping() + assert len(self._pipeline_name_mapping) > 0, "The pipeline stage must have parameters!" + + for k in list(state_dict.keys()): + v = state_dict.pop(k) + if k not in self._pipeline_name_mapping: + continue + state_dict[self._pipeline_name_mapping[k]] = v + + ret = super().set_state_dict(state_dict, *args, **kwargs) + return ret + + +class GPTForCausalLMPipe(PipelinePretrainedModel, PipelineLayer): + """LlamaForPretraining adapted for pipeline parallelism. + + The largest change is flattening the LlamaModel class so we can express it as a + sequence of layers including embedding, transformer layers, and output. + """ + + config_class = GPTConfig + + _get_tensor_parallel_mappings = GPTPretrainedModel._get_tensor_parallel_mappings + _init_weights = GPTPretrainedModel._init_weights + + # NO base_model_prefix !!!! + + def __init__( + self, + config, + pp_recompute_interval=1, + ): + self.config = config + + virtual_pp_degree = getattr(self.config, "virtual_pp_degree", 1) + + hcg = get_hcg() + tensor_parallel_degree = max(hcg.get_model_parallel_world_size(), 1) + tensor_parallel_rank = max(hcg.get_model_parallel_rank(), 0) + + config.tensor_parallel_degree = tensor_parallel_degree + config.tensor_parallel_rank = tensor_parallel_rank + + self.add_sequential_layer( + SharedLayerDesc("gpt", GPTEmbeddingPipe, shared_weight_attr="embedding_weight", config=config), "gpt" + ) + for i in range(config.num_hidden_layers): + self.add_sequential_layer( + LayerDesc(GPTDecoderLayerPipe, config=config), + f"gpt.decoder.layers.{i}", + ) + + self.add_sequential_layer(LayerDesc(LayerNormPipe, config=config), "gpt.decoder.norm") + + def _logits_helper(embedding, output): + return parallel_matmul(output, embedding.embedding_weight, True) + + self.add_sequential_layer( + SharedLayerDesc( + "gpt", + GPTEmbeddingPipe, + forward_func=_logits_helper, + shared_weight_attr="embedding_weight", + config=config, + ), + "gpt", + ) + + recompute_interval = 0 + # if use_recompute and recompute_granularity == "full": + # assert pp_recompute_interval <= config.num_hidden_layers // ( + # virtual_pp_degree * get_hcg().topology().get_dim_size("pipe") + # ), "pp recompute interval should smaller than num layers of each pp chunk" + # recompute_interval = pp_recompute_interval + + seg_method = "layer:GPTDecoderLayer" + if config.num_hidden_layers % get_hcg().topology().get_dim_size("pipe") != 0: + seg_method = "uniform" + + PipelineLayer.__init__( + self, + layers=self.get_sequential_layers(), + loss_fn=GPTPretrainingCriterion(config), + topology=get_hcg().topology(), + seg_method=seg_method, + recompute_interval=recompute_interval, + recompute_ctx={ + "mp_group": get_hcg().get_model_parallel_group(), + "offload": False, + "partition": False, + }, + num_virtual_pipeline_stages=virtual_pp_degree, + ) + self.apply(self._init_weights) diff --git a/llm/gpt-3/predict_generation.py b/llm/gpt-3/predict_generation.py new file mode 100644 index 0000000000000000000000000000000000000000..53e1d95c22d38aa8347c3d3af27ecf680335fec0 --- /dev/null +++ b/llm/gpt-3/predict_generation.py @@ -0,0 +1,167 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from __future__ import annotations + +import paddle +from utils import get_hcg, init_dist_env, set_seed + +from paddlenlp.transformers import ( + GPTChineseTokenizer, + GPTConfig, + GPTForCausalLM, + GPTTokenizer, +) + +MODEL_CLASSES = { + "gpt2": (GPTForCausalLM, GPTTokenizer), + "gpt2-cn": (GPTForCausalLM, GPTChineseTokenizer), +} + + +def parse_arguments(): + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument("--model_type", default="gpt2-cn", help="The directory of model.") + parser.add_argument("--model_name_or_path", default="gpt-cpm-large-cn", help="The directory of model.") + parser.add_argument("--save_onepiece_model_path", default=None, help="The directory of model.") + parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.") + parser.add_argument("--src_length", type=int, default=200, help="The batch size of data.") + parser.add_argument("--tgt_length", type=int, default=200, help="The batch size of data.") + parser.add_argument("--seed", type=int, default=20, help="the seed of parameter initialization") + return parser.parse_args() + + +def batchfy_text(texts, batch_size): + batch_texts = [] + batch_start = 0 + while batch_start < len(texts): + batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]] + batch_start += batch_size + return batch_texts + + +class Predictor(object): + def __init__(self, args=None, tokenizer=None, model=None, **kwargs): + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + self.tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + self.tokenizer.padding_side = "left" + self.batch_size = args.batch_size + self.args = args + self.src_length = self.args.src_length + self.tgt_length = self.args.tgt_length + + tensor_parallel_degree = paddle.distributed.get_world_size() + tensor_parallel_rank = 0 + if tensor_parallel_degree > 1: + hcg = get_hcg() + tensor_parallel_rank = hcg.get_model_parallel_rank() + + config = GPTConfig.from_pretrained(args.model_name_or_path) + dtype = config.dtype if config.dtype is not None else "float16" + + self.model = GPTForCausalLM.from_pretrained( + args.model_name_or_path, + load_state_as_np=True, + low_cpu_mem_usage=True, + dtype=dtype, + tensor_parallel_degree=tensor_parallel_degree, + tensor_parallel_rank=tensor_parallel_rank, + ) + if self.tokenizer.pad_token_id is None: + self.tokenizer.pad_token_id = self.model.config.pad_token_id + self.model.eval() + + def preprocess(self, input_text): + inputs = self.tokenizer( + input_text, + return_tensors="np", + padding=True, + max_length=self.src_length, + ) + inputs_tensor = {} + for key, value in inputs.items(): + inputs_tensor[key] = paddle.to_tensor(value) + return inputs_tensor + + def infer(self, inputs): + if self.model.config.dtype == "float32" or self.model.config.dtype is None: + with paddle.no_grad(): + result = self.model.generate( + **inputs, + max_length=self.tgt_length, + bos_token_id=self.tokenizer.bos_token_id, + eos_token_id=self.tokenizer.eol_token_id, + pad_token_id=self.tokenizer.pad_token_id, + decode_strategy="sampling", + top_k=1, + ) + else: + with paddle.no_grad(): + with paddle.amp.auto_cast(False, level="O2", dtype=self.model.config.dtype): + result = self.model.generate( + **inputs, + max_length=self.tgt_length, + bos_token_id=self.tokenizer.bos_token_id, + eos_token_id=self.tokenizer.eol_token_id, + pad_token_id=self.tokenizer.pad_token_id, + decode_strategy="sampling", + top_k=1, + ) + result = result[0] + return result + + def postprocess(self, infer_data): + result = [] + for x in infer_data.tolist(): + res = self.tokenizer.convert_ids_to_string(x) + result.append(res) + out_dict = {"result": result} + return out_dict + + def predict(self, texts): + input_map = self.preprocess(texts) + infer_result = self.infer(input_map) + output = self.postprocess(infer_result) + return output + + def save_onepiece_model(self, save_onepiece_model_path): + self.model.save_pretrained(save_dir=save_onepiece_model_path, merge_tensor_parallel=True) + paddle.distributed.barrier() + self.tokenizer.save_pretrained(save_onepiece_model_path) + paddle.distributed.barrier() + + +def predict(): + args = parse_arguments() + + # Init the fleet config + tensor_parallel_degree = paddle.distributed.get_world_size() + if tensor_parallel_degree > 1: + init_dist_env(tensor_parallel_degree=tensor_parallel_degree, seed=args.seed) + set_seed(args.seed) + + predictor = Predictor(args) + all_texts = ["问题:中国的首都是哪里?答案:北京。\n问题:苹果的CEO是谁? 答案:", "问题:中国的首都是哪里?答案:北京。\n问题:广东的省会是哪个城市? 答案:"] + batch_texts = batchfy_text(all_texts, args.batch_size) + for bs, texts in enumerate(batch_texts): + outputs = predictor.predict(texts) + for text, result in zip(texts, outputs["result"]): + print(result) + if args.save_onepiece_model_path is not None: + predictor.save_onepiece_model(args.save_onepiece_model_path) + + +if __name__ == "__main__": + predict() diff --git a/llm/gpt-3/run_pretrain.py b/llm/gpt-3/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..e68f80eef6beea1abce1c8fef6012f9a05afa302 --- /dev/null +++ b/llm/gpt-3/run_pretrain.py @@ -0,0 +1,438 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +GPT/Llama pretraining scripts. +""" +import math +import os +import sys +import time +from dataclasses import dataclass, field +from typing import Optional + +import paddle +from modeling_pp import GPTForCausalLMPipe + +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, + set_seed, + speed_metrics, +) +from paddlenlp.transformers import ( + AutoTokenizer, + CosineAnnealingWithWarmupDecay, + GPTConfig, + GPTForCausalLM, + LinearAnnealingWithWarmupDecay, +) +from paddlenlp.utils.batch_sampler import DistributedBatchSampler +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "gpt": ( + GPTConfig, + GPTForCausalLM, + ), +} + +from paddlenlp.data.causal_dataset import build_train_valid_test_datasets, print_rank_0 + + +def add_start_docstrings(*docstr): + def docstring_decorator(fn): + fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "") + return fn + + return docstring_decorator + + +@dataclass +@add_start_docstrings(TrainingArguments.__doc__) +class PreTrainingArguments(TrainingArguments): + min_learning_rate: float = field( + default=1e-5, + metadata={"help": "Minimum learning rate deacyed to."}, + ) + decay_steps: float = field( + default=None, + metadata={ + "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate." + }, + ) + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and evaluating. + Using `PdArgumentParser` we can turn this class into argparse arguments to be able to + specify them on the command line. + """ + + input_dir: str = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."}) + + max_seq_length: int = field( + default=1024, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + share_folder: bool = field( + default=False, + metadata={"help": "Use share folder for data dir and output dir on multi machine."}, + ) + + data_impl: str = field(default="mmap", metadata={"help": "The format of the preprocessed data."}) + skip_warmup: bool = field( + default=True, + metadata={"help": "Whether to skip the warmup process of mmap files."}, + ) + data_cache: str = field(default=None, metadata={"help": "The path of the cached dataset."}) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to pre-train from. + """ + + model_type: Optional[str] = field(default="gpt", metadata={"help": "Only support for gpt pre-training for now."}) + model_name_or_path: str = field( + default="gpt2-medium-en", + metadata={ + "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html" + }, + ) + tokenizer_name_or_path: Optional[str] = field( + default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} + ) + output_attentions: bool = field(default=False, metadata={"help": "Whether output attention weights"}) + use_flash_attention: bool = field(default=False, metadata={"help": "Whether to use flash attention"}) + fused_linear: bool = field( + default=False, + metadata={"help": "gpt, whether to fuse linear projection"}, + ) + fuse_attention_qkv: bool = field( + default=False, + metadata={"help": "gpt, whether to fuse attention qkv"}, + ) + enable_fuse_transformer: bool = field( + default=False, + metadata={"help": "gpt, enable_fuse_transformer"}, + ) + hidden_dropout_prob: float = field(default=0.1, metadata={"help": "The hidden dropout prob."}) + attention_probs_dropout_prob: float = field(default=0.1, metadata={"help": "The attention hidden dropout prob."}) + + +def create_pretrained_dataset( + data_args, + training_args, + data_file, + tokenizer, +): + + train_val_test_num_samples = [ + training_args.per_device_train_batch_size + * training_args.dataset_world_size + * training_args.max_steps + * training_args.gradient_accumulation_steps, + training_args.per_device_eval_batch_size + * training_args.dataset_world_size + * training_args.eval_iters + * (training_args.max_steps // training_args.eval_steps + 1), + training_args.per_device_eval_batch_size * training_args.dataset_world_size * training_args.test_iters, + ] + + print_rank_0(" > datasets target sizes (minimum size):") + print_rank_0(" train: {}".format(train_val_test_num_samples[0])) + print_rank_0(" validation: {}".format(train_val_test_num_samples[1])) + print_rank_0(" test: {}".format(train_val_test_num_samples[2])) + + # Build the datasets. + train_dataset, valid_dataset, test_dataset = build_train_valid_test_datasets( + data_prefix=data_file, + data_impl=data_args.data_impl, + splits_string=data_args.split, + train_val_test_num_samples=train_val_test_num_samples, + seq_length=data_args.max_seq_length, + seed=training_args.seed, + skip_warmup=data_args.skip_warmup, + data_cache_path=data_args.data_cache, + ) + + def print_dataset(data, mode="train"): + logger.info(f"Sample data for {mode} mode") + # input_ids, loss_mask, attention_mask, position_ids, labels = data + input_ids = data["text"] + + logger.info(tokenizer._decode(input_ids)) + + from paddlenlp.data import Stack + + def _collate_data(data, stack_fn=Stack()): + tokens_ = stack_fn([x["text"] for x in data]) + + labels = tokens_[:, 1:] + tokens = tokens_[:, :-1] + + # Attention mask. + attention_mask = paddle.ones(tokens.shape, dtype=paddle.int64) + + return { + "input_ids": tokens, + "attention_mask": attention_mask, + "labels": labels, + } + + print_dataset(train_dataset[0]) + print_dataset(valid_dataset[0]) + print_dataset(test_dataset[0]) + + return train_dataset, valid_dataset, test_dataset, _collate_data + + +def get_train_data_file(args): + if len(args.input_dir.split()) > 1: + # weight-1 data-prefix-1 weight-2 data-prefix-2 ... + return args.input_dir.split() + else: + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f))) + ] + files = [x.replace("_idx.npz", "") for x in files] + files = [x.replace(".idx", "") for x in files] + + if len(files) > 1: + ret = [] + logger.info("You are using multi-dataset:") + for x in files: + ret.append(1.0) + ret.append(x) + logger.info(" > set weight of %s dataset to 1.0" % x) + return ret + + return files + + +class PretrainingTrainer(Trainer): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix: str = "eval"): + # keep eval_dataloader + eval_dataloader = getattr(self, "eval_dataloader", None) + if eval_dataloader is None: + eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset + eval_dataloader = self.get_eval_dataloader(eval_dataset) + # must call data loader, otherwise, it will init many times, cause OOM error. + self.eval_dataloader = eval_dataloader() + + start_time = time.time() + # Temporarily disable metric computation, we will do it in the loop here. + compute_metrics = self.compute_metrics + eval_loop = self.evaluation_loop + + output = eval_loop( + eval_dataloader, + description="Evaluation", + # No point gathering the predictions if there are no metrics, otherwise we defer to + # self.args.prediction_loss_only + prediction_loss_only=True if compute_metrics is None else None, + ignore_keys=ignore_keys, + # Only evaluate max_eval_iters + max_eval_iters=self.args.eval_iters, + ) + + total_batch_size = self.args.eval_batch_size * self.args.world_size + output.metrics.update( + speed_metrics( + metric_key_prefix, + start_time, + num_samples=output.num_samples, + num_steps=math.ceil(output.num_samples / total_batch_size), + ) + ) + + self.log(output.metrics) + + self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics) + return output.metrics + + def _get_eval_sampler(self, eval_dataset) -> Optional[paddle.io.Sampler]: + return DistributedBatchSampler( + eval_dataset, + batch_size=self.args.per_device_eval_batch_size, + shuffle=False, + num_replicas=self.args.dataset_world_size, + rank=self.args.dataset_rank, + drop_last=self.args.dataloader_drop_last, + ) + + def _get_train_sampler(self) -> Optional[paddle.io.Sampler]: + return DistributedBatchSampler( + self.train_dataset, + batch_size=self.args.per_device_train_batch_size, + shuffle=False, + num_replicas=self.args.dataset_world_size, + rank=self.args.dataset_rank, + drop_last=self.args.dataloader_drop_last, + ) + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments)) + if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): + model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) + else: + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.tokenizer_name_or_path is None: + model_args.tokenizer_name_or_path = model_args.model_name_or_path + + if data_args.data_cache is not None: + os.makedirs(data_args.data_cache, exist_ok=True) + + set_seed(seed=training_args.seed, args=training_args) + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + training_args.eval_iters = 10 + training_args.test_iters = training_args.eval_iters * 10 + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + config_class, model_class = MODEL_CLASSES[model_args.model_type] + + tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path) + + config = config_class.from_pretrained(model_args.model_name_or_path) + config.output_attentions = model_args.output_attentions + config.max_position_embeddings = max(config.max_position_embeddings, data_args.max_seq_length) + config.hidden_dropout_prob = model_args.hidden_dropout_prob + config.attention_probs_dropout_prob = model_args.attention_probs_dropout_prob + config.enable_fuse_transformer = model_args.enable_fuse_transformer + config.fuse_attention_qkv = model_args.fuse_attention_qkv + config.use_recompute = training_args.recompute + config.use_flash_attention = model_args.use_flash_attention + + config.tensor_parallel_degree = training_args.tensor_parallel_degree + config.tensor_parallel_rank = training_args.tensor_parallel_rank + + print("Final pre-training config:", config) + + # Set the dtype for loading model + dtype = "float32" + if training_args.fp16_opt_level == "O2": + if training_args.fp16: + dtype = "float16" + if training_args.bf16: + dtype = "bfloat16" + + if training_args.pipeline_parallel_degree > 1: + model_class = GPTForCausalLMPipe + + model = model_class.from_pretrained( + model_args.model_name_or_path, + config=config, + dtype=dtype, + load_state_as_np=True, + ) + + # Create the learning_rate sheduler and optimizer + if training_args.decay_steps is None: + training_args.decay_steps = training_args.max_steps + warmup_steps = training_args.warmup_ratio * training_args.max_steps + + lr_scheduler = None + if training_args.lr_scheduler_type.value == "cosine": + lr_scheduler = CosineAnnealingWithWarmupDecay( + max_lr=training_args.learning_rate, + min_lr=training_args.min_learning_rate, + warmup_step=warmup_steps, + decay_step=training_args.decay_steps, + last_epoch=0, + ) + elif training_args.lr_scheduler_type.value == "linear": + lr_scheduler = LinearAnnealingWithWarmupDecay( + max_lr=training_args.learning_rate, + min_lr=training_args.min_learning_rate, + warmup_step=warmup_steps, + decay_step=training_args.decay_steps, + last_epoch=0, + ) + + data_file = get_train_data_file(data_args) + train_dataset, eval_dataset, test_dataset, data_collator = create_pretrained_dataset( + data_args, training_args, data_file, tokenizer + ) + + trainer = PretrainingTrainer( + model=model, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + optimizers=(None, lr_scheduler), + tokenizer=tokenizer, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + checkpoint = None + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_predict: + test_ret = trainer.predict(test_dataset) + trainer.log_metrics("test", test_ret.metrics) + + +if __name__ == "__main__": + main() diff --git a/llm/gpt-3/utils.py b/llm/gpt-3/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..fed4793d956bc252b1b6f017d74ff7d0d93be1e3 --- /dev/null +++ b/llm/gpt-3/utils.py @@ -0,0 +1,389 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import copy +import random +import re +from typing import Any, Dict, List, Optional, Tuple, Union + +import numpy as np +import paddle +import paddle.distributed as dist +import paddle.nn as nn +from paddle.distributed import fleet +from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker +from paddle.optimizer.lr import LambdaDecay +from rouge import Rouge + +from paddlenlp.data import DataCollatorForSeq2Seq +from paddlenlp.metrics import BLEU +from paddlenlp.trainer import Trainer +from paddlenlp.utils.log import logger + +PREFIX_CHECKPOINT_DIR = "model_state" +_re_checkpoint = re.compile(r"^" + PREFIX_CHECKPOINT_DIR + r"\.tp(\d+)" + ".pdparams$") + + +_hcg = None + + +def set_hcg(hcg): + global _hcg + _hcg = hcg + + +def get_hcg(): + global _hcg + return _hcg + + +def set_seed(seed): + # NOTE(shenliang03): For parameter init seed: + # seed: dp/mp_undistributed_paramter/sharding is same; others is different + # For compute seed(dropout): + # global seed: only mp group is same. + # local seed: all groups are different + + hcg = get_hcg() + if paddle.distributed.get_world_size() > 1: + # obtain rank message of hybrid parallel + + mp_rank = hcg.get_model_parallel_rank() + mp_size = hcg.get_model_parallel_world_size() + + pp_rank = hcg.get_stage_id() + pp_size = hcg.get_pipe_parallel_world_size() + + dp_rank = hcg.get_data_parallel_rank() + dp_size = hcg.get_data_parallel_world_size() + + sharding_rank = hcg.get_sharding_parallel_rank() + # sharding_size = hcg.get_sharding_parallel_world_size() + else: + mp_rank, mp_size = 0, 1 + pp_rank, pp_size = 0, 1 + dp_rank, dp_size = 0, 1 + sharding_rank, _ = 0, 1 + + # NOTE: the commented seeds are set only for precision validation + # seed += 100 * pp_rank + random.seed(seed + 100 * pp_rank) + np.random.seed(seed + 100 * pp_rank) + + # seed = mp_rank + + # pp_rank * (mp_size) + + # dp_rank * (mp_size * pp_size) + + # sharding_rank * (mp_size * pp_size * dp_size) + # seed offset is order to avoid conflicts with the parameter initialization seed + + seed_offset = seed + 1024 + paddle.distributed.get_world_size() + global_seed = ( + seed_offset + + pp_rank * (mp_size) + + dp_rank * (mp_size * pp_size) + + sharding_rank * (mp_size * pp_size * dp_size) + ) + + seed_offset += paddle.distributed.get_world_size() + local_seed = ( + seed_offset + + mp_rank + + pp_rank * (mp_size) + + dp_rank * (mp_size * pp_size) + + sharding_rank * (mp_size * pp_size * dp_size) + ) + + tracker = get_rng_state_tracker() + tracker.add("global_seed", global_seed) + tracker.add("local_seed", local_seed) + + paddle.seed(global_seed) + + logger.info("The global seed is set to {} and local seed is set to {}.".format(global_seed, local_seed)) + + +def create_hcg(strategy, hcg_name="HybridCommunicateGroup"): + if hcg_name == "HybridCommunicateGroup": + fleet.init(is_collective=True, strategy=strategy) + hcg = fleet.get_hybrid_communicate_group() + else: + dist.init_parallel_env() + hcg = eval("{}".format(hcg_name))(strategy) + + return hcg + + +def init_dist_env( + tensor_parallel_degree=1, sharding_parallel_degree=1, pipeline_parallel_degree=1, data_parallel_degree=1, seed=1 +): + + strategy = fleet.DistributedStrategy() + + def is_segment_parallel_supported(): + import inspect + + members = [name for (name, date) in inspect.getmembers(fleet.HybridCommunicateGroup)] + return "get_sep_parallel_world_size" in members + + if tensor_parallel_degree == 1 and sharding_parallel_degree == 1: + if is_segment_parallel_supported(): + order = ["pp", "dp", "sharding", "sep", "mp"] + else: + order = ["pp", "dp", "sharding", "mp"] + else: + if is_segment_parallel_supported(): + order = ["dp", "pp", "sharding", "sep", "mp"] + else: + order = ["dp", "pp", "sharding", "mp"] + + strategy.hybrid_configs = { + "dp_degree": data_parallel_degree, + "mp_degree": tensor_parallel_degree, + "pp_degree": pipeline_parallel_degree, + "sharding_degree": sharding_parallel_degree, + "order": order, + } + + # TODO(wawltor) The inference parallel do not support the pipeline mode + + """ + if pipeline_parallel_degree > 1: + if "sequence_parallel" in config.Model: + if config.Model.sequence_parallel: + assert config.Global.enable_partial_send_recv is False, ( + "if config.Distributed.pp_degree > 1 and config.Model.sequence_parallel is True, " + "config.Global.enable_partial_send_recv should be set False." + ) + + strategy.pipeline_configs = { + "accumulate_steps": config.Global.local_batch_size // config.Global.micro_batch_size, + "micro_batch_size": config.Global.micro_batch_size, + "enable_partial_send_recv": config.Global.enable_partial_send_recv, + } + """ + + # set control in tensor parallel + strategy.tensor_parallel_configs = {"tensor_init_seed": seed} + + hcg = create_hcg(strategy) + set_hcg(hcg) + + +def convert_example( + example, + tokenizer, + max_source_length, + max_target_length, + is_test=False, +): + """ + Convert an example into necessary features. + """ + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + context = example["context"] + question = example["question"] + try: + answer = example["answers"][0] + except Exception: + print(example["context"]) + print(example["question"]) + print(example["answers"]) + print(example["answer_starts"]) + print(example["is_impossible"]) + + input_seq = f"answer: {answer} context: {context} " + output_seq = f"question: {question} " + + outputs = tokenizer( + output_seq, + max_length=max_target_length, + # pad_to_max_seq_len=True, + truncation_strategy="longest_first", + return_attention_mask=False, + return_token_type_ids=False, + ) + inputs = tokenizer( + input_seq, + max_length=max_source_length, + # pad_to_max_seq_len=True, + truncation_strategy="longest_first", + return_attention_mask=False, + return_length=False, + ) + + final = {} + for k in outputs.keys(): + final[k] = inputs[k] + outputs[k] + if k == "input_ids": + final["labels"] = [tokenizer.pad_token_id] * len(inputs["input_ids"]) + outputs[k] + if is_test: + return dict(input_ids=inputs["input_ids"], labels=outputs["input_ids"]) + + # shift inputs and labels + final["input_ids"] = final["input_ids"][:-1] + final["labels"] = final["labels"][1:] + return final + + +def compute_metrics(preds, targets): + assert len(preds) == len(targets), ( + "The length of pred_responses should be equal to the length of " + "target_responses. But received {} and {}.".format(len(preds), len(targets)) + ) + rouge = Rouge() + bleu4 = BLEU(n_size=4) + scores = [] + for pred, target in zip(preds, targets): + try: + score = rouge.get_scores(" ".join(pred), " ".join(target)) + scores.append([score[0]["rouge-1"]["f"], score[0]["rouge-2"]["f"], score[0]["rouge-l"]["f"]]) + except ValueError: + scores.append([0, 0, 0]) + bleu4.add_inst(pred, [target]) + rouge1 = np.mean([i[0] for i in scores]) + rouge2 = np.mean([i[1] for i in scores]) + rougel = np.mean([i[2] for i in scores]) + + rouge1 = round(rouge1, 4) + rouge2 = round(rouge2, 4) + rougel = round(rougel, 4) + bleu4 = round(bleu4.score(), 4) + return dict( + rouge1=rouge1, + rouge2=rouge2, + rougel=rougel, + bleu4=bleu4, + ) + + +class DataCollatorForSupervisedDataset(DataCollatorForSeq2Seq): + """Collate examples for supervised fine-tuning.""" + + def __call__(self, features, return_tensors=None): + # Deep copy to avoid modifying features in-place + batch = copy.deepcopy(features) + if return_tensors is None: + return_tensors = self.return_tensors + labels = [feature["labels"] for feature in batch] if "labels" in batch[0].keys() else None + # We have to pad the labels before calling `tokenizer.pad` as this method won't pad them and needs them of the + # same length to return tensors. + if labels is not None: + # Note(gongenlei): In pipeline, max_label_length = self.max_length + if self.padding == "max_length" and self.max_length is not None: + max_label_length = self.max_length + else: + max_label_length = max(len(l) for l in labels) + if self.pad_to_multiple_of is not None: + max_label_length = ( + (max_label_length + self.pad_to_multiple_of - 1) + // self.pad_to_multiple_of + * self.pad_to_multiple_of + ) + + padding_side = self.tokenizer.padding_side + for feature in batch: + remainder = [self.tokenizer.pad_token_id] * (max_label_length - len(feature["labels"])) + if isinstance(feature["labels"], list): + feature["labels"] = ( + feature["labels"] + remainder if padding_side == "right" else remainder + feature["labels"] + ) + elif padding_side == "right": + feature["labels"] = np.concatenate([feature["labels"], remainder]).astype(np.int64) + else: + feature["labels"] = np.concatenate([remainder, feature["labels"]]).astype(np.int64) + + batch = self.tokenizer.pad( + batch, + padding=self.padding, + max_length=self.max_length, + pad_to_multiple_of=self.pad_to_multiple_of, + return_tensors=return_tensors, + return_attention_mask=self.return_attention_mask, + ) + + return batch + + +class GPTTrainer(Trainer): + def __init__(self, do_generation: bool, **kwargs): + super().__init__(**kwargs) + self.do_generation = do_generation + + def prediction_step( + self, + model: nn.Layer, + inputs: Dict[str, Union[paddle.Tensor, Any]], + prediction_loss_only: bool, + ignore_keys: Optional[List[str]] = None, + ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]: + + if prediction_loss_only: + return super().prediction_step(model, inputs, prediction_loss_only, ignore_keys) + elif not self.do_generation: + loss, logits, labels = super().prediction_step(model, inputs, prediction_loss_only, ignore_keys) + # argmax here to avoid gather all logits, which is too memory-consuming. + # keepdim in order to maintain the same shape as logits + return (loss, logits.argmax(axis=-1, keepdim=True), labels) + + model.eval() + + preds = model.generate( + input_ids=inputs["input_ids"], + attention_mask=inputs["attention_mask"] if "attention_mask" in inputs else None, + max_length=self.args.tgt_length, + min_length=0, + use_cache=True, + temperature=1.0, + top_k=1, + top_p=1.0, + repetition_penalty=1.0, + decode_strategy="sampling", + )[0] + all_labels = [] + for label in inputs["labels"].numpy(): + label = [x for x in label[label != self.tokenizer.pad_token_id]] + all_labels.append(label) + max_label_length = max([len(x) for x in all_labels]) + for index, labels in enumerate(all_labels): + all_labels[index] = labels + [-100] * (max_label_length - len(labels)) + + return (None, paddle.to_tensor(preds), paddle.to_tensor(all_labels)) + + def create_scheduler(self, num_training_steps: int): + num_warmup_steps = ( + self.args.warmup_steps if self.args.warmup_steps > 0 else self.args.warmup_ratio * num_training_steps + ) + + def lr_lambda(current_step: int): + if current_step < num_warmup_steps: + return float(current_step) / float(max(1, num_warmup_steps)) + else: + decay_step_ratio = (current_step - num_warmup_steps) / (num_training_steps - num_warmup_steps) + return 1.0 - (1.0 - self.args.lr_decay_ratio) * decay_step_ratio + + if self.lr_scheduler is None: + self.lr_scheduler = LambdaDecay(self.args.learning_rate, lr_lambda, last_epoch=-1) + return self.lr_scheduler + + def log(self, logs: Dict[str, float], **kwargs) -> None: + if "loss" in logs: + logs["ppl"] = np.exp(logs["loss"]) + if "eval_loss" in logs: + logs["eval_ppl"] = np.exp(logs["eval_loss"]) + + super(GPTTrainer, self).log(logs, **kwargs) diff --git a/llm/gradio_ui.py b/llm/gradio_ui.py new file mode 100644 index 0000000000000000000000000000000000000000..39f52ef91e2df3c0daa76d97f2955ba6bcfccb69 --- /dev/null +++ b/llm/gradio_ui.py @@ -0,0 +1,202 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import annotations + +import argparse +import copy +import json + +import gradio as gr +import requests + + +def setup_args(): + """Setup arguments.""" + parser = argparse.ArgumentParser() + parser.add_argument("--port", type=int, default=8073) + args = parser.parse_args() + return args + + +def launch(args): + """Launch characters dialogue demo.""" + + def rollback(state): + """Rollback context.""" + context = state.setdefault("context", []) + utterance = context[-2]["utterance"] + context = context[:-2] + state["context"] = context + shown_context = get_shown_context(context) + return utterance, shown_context, context, state + + def regen(state, version, top_k, top_p, temperature, repetition_penalty): + """Regenerate response.""" + context = state.setdefault("context", []) + context.pop() + user_turn = context.pop() + return infer(user_turn["utterance"], state, version, top_k, top_p, temperature, repetition_penalty) + + def infer(utterance, state, top_k, top_p, temperature, repetition_penalty, max_length): + """Model inference.""" + utterance = utterance.strip().replace("
", "\n") + context = state.setdefault("context", []) + + if not utterance: + gr.Warning("invalid inputs") + # gr.Warning("请输入有效问题") + shown_context = get_shown_context(context) + return None, shown_context, context, state + + context.append({"role": "user", "utterance": utterance}) + data = { + "context": utterance, + "top_k": top_k, + "top_p": top_p, + "temperature": temperature, + "repetition_penalty": repetition_penalty, + "max_length": max_length, + "min_length": 1, + } + result = requests.post(f"http://0.0.0.0:{args.flask_port}/api/chat", json=data).json() + bot_response = result["result"]["response"] + + # replace \n with br: https://github.com/gradio-app/gradio/issues/4344 + bot_response["utterance"] = bot_response["utterance"].replace("\n", "
") + context.append(bot_response) + shown_context = get_shown_context(context) + return None, shown_context, context, state + + def clean_context(context): + """Clean context for EB input.""" + cleaned_context = copy.deepcopy(context) + for turn in cleaned_context: + if turn["role"] == "bot": + bot_resp = turn["utterance"] + if bot_resp.startswith(""): + bot_resp = "\n".join(bot_resp.split("\n")[1:]) + turn["utterance"] = bot_resp + return cleaned_context + + def extract_eda(eb_debug_info): + """Extract EDA result from EB dispatch info.""" + eda_res = None + for item in eb_debug_info: + if item["sys"] == "EDA": + eda_output = json.loads(item["output"]) + eda_res = eda_output["result"] + break + return eda_res + + def extract_eb_input(eb_debug_info, convert_for_ar=True): + """Extract EB raw input from EB dispatch info.""" + eb_raw_input = None + for item in eb_debug_info: + if item["sys"] == "EB": + eb_output = json.loads(item["output"]) + eb_raw_input = eb_output["text_after_process"] + if convert_for_ar: + eb_raw_input = eb_raw_input.replace("[CLS]", "").replace("[SEP]", "") + break + return eb_raw_input + + def get_shown_context(context): + """Get gradio chatbot.""" + shown_context = [] + for turn_idx in range(0, len(context), 2): + shown_context.append([context[turn_idx]["utterance"], context[turn_idx + 1]["utterance"]]) + return shown_context + + with gr.Blocks(title="LLM", theme=gr.themes.Soft()) as block: + gr.Markdown(f"# {args.title}") + with gr.Row(): + with gr.Column(scale=1): + top_k = gr.Slider( + minimum=1, maximum=100, value=50, step=1, label="Top-k", info="该参数越大,模型生成结果更加随机,反之生成结果更加确定。" + ) + top_p = gr.Slider( + minimum=0, maximum=1, value=0.7, step=0.05, label="Top-p", info="该参数越大,模型生成结果更加随机,反之生成结果更加确定。" + ) + temperature = gr.Slider( + minimum=0.05, + maximum=1.5, + value=0.95, + step=0.05, + label="Temperature", + info="该参数越小,模型生成结果更加随机,反之生成结果更加确定。", + ) + repetition_penalty = gr.Slider( + minimum=0.1, + maximum=10, + value=1.0, + step=0.05, + label="Repetition Penalty", + info="该参数越大,生成结果重复的概率越低。设置 1 则不开启。", + ) + max_length = gr.Slider( + minimum=1, maximum=1024, value=50, step=1, label="Max Length", info="生成结果的最大长度。" + ) + with gr.Column(scale=4): + state = gr.State({}) + context_chatbot = gr.Chatbot(label="Context") + utt_text = gr.Textbox(placeholder="请输入...", label="Utterance") + with gr.Row(): + clear_btn = gr.Button("清空") + rollback_btn = gr.Button("撤回") + regen_btn = gr.Button("重新生成") + send_btn = gr.Button("发送") + with gr.Row(): + raw_context_json = gr.JSON(label="Raw Context") + + utt_text.submit( + infer, + inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_length], + outputs=[utt_text, context_chatbot, raw_context_json, state], + api_name="chat", + ) + clear_btn.click( + lambda _: (None, None, None, {}), + inputs=clear_btn, + outputs=[utt_text, context_chatbot, raw_context_json, state], + api_name="clear", + show_progress=False, + ) + rollback_btn.click( + rollback, + inputs=[state], + outputs=[utt_text, context_chatbot, raw_context_json, state], + show_progress=False, + ) + regen_btn.click( + regen, + inputs=[state, top_k, top_p, temperature, repetition_penalty, max_length], + outputs=[utt_text, context_chatbot, raw_context_json, state], + ) + send_btn.click( + infer, + inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_length], + outputs=[utt_text, context_chatbot, raw_context_json, state], + ) + + block.queue(default_enabled=True).launch(server_name="0.0.0.0", server_port=args.port, debug=True) + + +def main(args): + launch(args) + + +if __name__ == "__main__": + args = setup_args() + main(args) diff --git a/llm/llama/README.md b/llm/llama/README.md new file mode 100644 index 0000000000000000000000000000000000000000..8eeac47ea70321c6d1209434146213456a124f52 --- /dev/null +++ b/llm/llama/README.md @@ -0,0 +1,115 @@ +# LLaMA + +## 1. 模型介绍 + +**支持模型权重:** + +| Model | +| ---------------------------------| +| facebook/llama-7b | +| facebook/llama-13b | +| facebook/llama-30b | +| facebook/llama-65b | +| meta-llama/Llama-2-7b | +| meta-llama/Llama-2-7b-chat | +| meta-llama/Llama-2-13b | +| meta-llama/Llama-2-13b-chat | +| meta-llama/Llama-2-70b | +| meta-llama/Llama-2-70b-chat | +| ziqingyang/chinese-llama-7b | +| ziqingyang/chinese-llama-13b | +| ziqingyang/chinese-alpaca-7b | +| ziqingyang/chinese-alpaca-13b | +| idea-ccnl/ziya-llama-13b-v1 | +| linly-ai/chinese-llama-2-7b | +| baichuan-inc/Baichuan-7B | +| baichuan-inc/Baichuan-13B-Base | +| baichuan-inc/Baichuan-13B-Chat | +| FlagAlpha/Llama2-Chinese-7b-Chat | +| FlagAlpha/Llama2-Chinese-13b-Chat | + + + +使用方法: + +```python +from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer +model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat") +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat") +``` + +## 2. 模型协议 + +LLaMA 模型的权重的使用则需要遵循[License](../../paddlenlp/transformers/llama/LICENSE)。 + +Llama2 模型的权重的使用则需要遵循[License](../../paddlenlp/transformers/llama/Llama2.LICENSE)。 + + +## 3. 预训练 + +预训练数据制作参考[此处](../../../model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md) + +为了方便用户运行测试本模型,本项目提供了处理好的100k条doc的训练样本: +```shell +wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_ids.npy +wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_idx.npz +``` + +将所有预处理得到的文件统一放入一个文件夹中,以备训练使用: + +``` +mkdir data +mv llama_openwebtext_100k_ids.npy ./data +mv llama_openwebtext_100k_idx.npz ./data +``` + +使用下面脚本,即可在llama-7b的基础上,继续训练. +```shell +task_name_or_path="llama_hybrid" +python -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "output/$task_name_or_path""_log" \ + run_pretrain.py \ + --model_type "llama" \ + --model_name_or_path "facebook/llama-7b" \ + --tokenizer_name_or_path "facebook/llama-7b" \ + --input_dir "./data" \ + --output_dir "output/$task_name_or_path" \ + --split 949,50,1 \ + --max_seq_length 2048 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 1 \ + --use_flash_attention 1 \ + --use_fused_rms_norm 0 \ + --fp16 \ + --fp16_opt_level "O2" \ + --scale_loss 1024 \ + --learning_rate 0.00001 \ + --min_learning_rate 0.000005 \ + --lr_scheduler_type "cosine" \ + --max_steps 10000 \ + --save_steps 5000 \ + --weight_decay 0.01 \ + --warmup_ratio 0.01 \ + --max_grad_norm 1.0 \ + --logging_steps 20\ + --dataloader_num_workers 1 \ + --sharding "stage2" \ + --eval_steps 1000 \ + --report_to "visualdl" \ + --disable_tqdm true \ + --continue_training 1\ + --recompute 1 \ + --do_train \ + --do_eval \ + --device "gpu" +``` +注意: +1. 需要paddle develop版本训练,需要安装`pip install tool_helpers visualdl==2.5.3`等相关缺失whl包 +2. `use_flash_attention` 需要在A100机器开启,否则loss可能不正常(很快变成0.00x,非常小不正常)。建议使用cuda11.8环境。 +3. `continue_training` 表示从现有的预训练模型加载训练。7b模型初始loss大概为1.99x, 随机初始化模型loss从11.x左右下降。 +4. `use_fused_rms_norm` 需要安装[此目录](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/gpt-3/external_ops)下的自定义OP, `python setup.py install`。如果安装后仍然找不到算子,需要额外设置PYTHONPATH +5. 当前脚本为sharding版本,需要4D并行训练(数据、sharding、张量、流水线并行)的用户,请参考 `run_trainer_tp4pp2.sh`脚本。 + +## 4. 模型精调 +请参考[LLM全流程工具介绍](../README.md) diff --git a/llm/llama/benchmark.py b/llm/llama/benchmark.py new file mode 100644 index 0000000000000000000000000000000000000000..221220eacc319486e67bfea5e9efef7cdff8f489 --- /dev/null +++ b/llm/llama/benchmark.py @@ -0,0 +1,376 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import copy +import json +import os +import sys +from dataclasses import dataclass, field +from functools import partial + +import paddle +from benchmark_utils import ( + LlamaTrainer, + compute_metrics, + compute_metrics_not_do_generation, +) +from modeling_pp import LlamaForCausalLMPipe + +from paddlenlp.data import DataCollatorForSeq2Seq +from paddlenlp.datasets import load_dataset +from paddlenlp.peft import LoRAConfig, LoRAModel, PrefixConfig, PrefixModelForCausalLM +from paddlenlp.peft.prefix import llama_postprocess_past_key_value +from paddlenlp.trainer import ( + PdArgumentParser, + TrainingArguments, + get_last_checkpoint, + set_seed, +) +from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer +from paddlenlp.utils.log import logger + + +@dataclass +class DataArgument: + data_name: str = field(default=None, metadata={"help": "The name of data."}) + task_name_or_path: str = field(default=None, metadata={"help": "The name of task."}) + src_length: int = field(default=512, metadata={"help": "The max length of source text."}) + tgt_length: int = field(default=256, metadata={"help": "The max length of target text."}) + + +@dataclass +class ModelArgument: + model_name_or_path: str = field( + default="facebook/llama-7b", metadata={"help": "Build-in pretrained model name or the path to local model."} + ) + label_smoothing: float = field(default=0.1, metadata={"help": "The label smoothing parameter."}) + lr_decay_ratio: float = field(default=0.1, metadata={"help": "The ratio for learning rate decrease"}) + use_flash_attention: bool = field(default=False, metadata={"help": "Whether to use flash attention"}) + eval_with_do_generation: bool = field( + default=False, metadata={"help": "Evaluate with generation, instead for calc loss."} + ) + profiler_options: str = field( + default=None, + metadata={"help": "profiler_options."}, + ) + # lora + lora: bool = field(default=False, metadata={"help": "Whether to use LoRA technique"}) + lora_path: str = field(default=None, metadata={"help": "Initialize lora state dict."}) + lora_rank: int = field(default=4, metadata={"help": "Lora attention dimension"}) + merge_weights: bool = field( + default=False, metadata={"help": "Merge weights of the original model and the Lora model"} + ) + # prefix + prefix_tuning: bool = field(default=False, metadata={"help": "Whether to use Prefix technique"}) + num_prefix_tokens: int = field(default=10, metadata={"help": "Number of prefix tokens"}) + prefix_projection: bool = field(default=False, metadata={"help": "Whether to project the prefix tokens"}) + # qat + qat: bool = field(default=False, metadata={"help": "Whether to use QAT technique"}) + qat_type: str = field(default="A8W8", metadata={"help": "Quantization type. Supported values: A8W8, W4,A8W4"}) + + +PROMPT_DICT = { + "prompt_input": ( + "Below is an instruction that describes a task, paired with an input that provides further context. " + "Write a response that appropriately completes the request.\n\n" + "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:" + ), + "prompt_no_input": ( + "Below is an instruction that describes a task. " + "Write a response that appropriately completes the request.\n\n" + "### Instruction:\n{instruction}\n\n### Response:" + ), +} + + +def read_local_dataset(path): + with open(path, "r", encoding="utf-8") as f: + for line in f: + json_line = json.loads(line) + yield json_line + + +def custom_instruction_convert_example(example, tokenizer, data_args, is_test=False, model_max_length=512): + """ + Convert an example into necessary features. + """ + + prompt_input, prompt_no_input = PROMPT_DICT["prompt_input"], PROMPT_DICT["prompt_no_input"] + + if example.get("input", "") != "": + input_seq = prompt_input.format_map(example) + else: + input_seq = prompt_no_input.format_map(example) + + output_seq = example["output"] + tokenizer.eos_token + + # To compatible with compile training mode in benchmark, input will be pad to fix length + source_tokenized = tokenizer( + input_seq, + return_tensors="pd", + max_length=model_max_length, + truncation=True, + ) + + source_input_ids_len = ( + source_tokenized["input_ids"].not_equal(paddle.to_tensor(tokenizer.pad_token_id)).sum().item() + ) + + example_tokenized = tokenizer( + input_seq + output_seq, + return_tensors="pd", + max_length=model_max_length, + truncation=True, + ) + + input_ids = example_tokenized["input_ids"][0] + labels = copy.deepcopy(input_ids) + labels[:source_input_ids_len] = -100 + + if is_test: + return dict( + input_ids=source_tokenized["input_ids"][0], + labels=labels, + ) + + # shift labels + input_ids, labels = input_ids[:-1], labels[1:] + + return dict( + input_ids=input_ids, + labels=labels, + ) + + +def main(): + parser = PdArgumentParser((ModelArgument, DataArgument, TrainingArguments)) + if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): + model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) + else: + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + data_args.always_pad_to_max_length = training_args.pipeline_parallel_degree > 1 + + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + training_args.tgt_length = data_args.tgt_length + + training_args.profiler_options = model_args.profiler_options + setattr(training_args, "label_smoothing", model_args.label_smoothing) + setattr(training_args, "lr_decay_ratio", model_args.lr_decay_ratio) + + paddle.set_device(training_args.device) + + set_seed(args=training_args) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + # Set the dtype for loading model + dtype = "float32" + if training_args.fp16_opt_level == "O2": + if training_args.fp16: + dtype = "float16" + if training_args.bf16: + dtype = "bfloat16" + + model_class = AutoModelForCausalLM + if training_args.pipeline_parallel_degree > 1: + if model_args.eval_with_do_generation and training_args.do_eval: + raise ValueError("Plese set eval_with_do_generation to false in pipeline parallel mode.") + model_class = LlamaForCausalLMPipe + + # Load the pretrained language model. + model = model_class.from_pretrained( + model_args.model_name_or_path, + tensor_parallel_output=False, + tensor_parallel_degree=training_args.tensor_parallel_degree, + tensor_parallel_rank=training_args.tensor_parallel_rank, + use_flash_attention=model_args.use_flash_attention, + dtype=dtype, # todo enable set dtype to avoid additional mem usage + ) + if model_args.lora: + if model_args.lora_path is None: + # Not yet support RowParallelLinear + target_modules = [ + ".*q_proj.*", + ".*v_proj.*", + ".*k_proj.*", + ".*gate_proj.*", + ".*up_proj.*", + ".*o_proj.*", + ".*down_proj.*", + ] + + lora_config = LoRAConfig( + target_modules=target_modules, + r=model_args.lora_rank, + lora_alpha=2 * model_args.lora_rank, + merge_weights=model_args.merge_weights, + tensor_parallel_degree=training_args.tensor_parallel_degree, + dtype=dtype, + ) + model = LoRAModel(model, lora_config) + else: + model = LoRAModel.from_pretrained(model=model, lora_path=model_args.lora_path) + + model.mark_only_lora_as_trainable() + model.print_trainable_parameters() + + if model_args.qat: + from paddle import nn + from paddle.quantization import QAT, QuantConfig + + # FakeQuanterChannelWiseAbsMaxObserver not yet merge in Paddle develop + from paddle.quantization.quanters import FakeQuanterChannelWiseAbsMaxObserver + from paddle.quantization.quanters.abs_max import ( + FakeQuanterWithAbsMaxObserverLayer, + ) + from paddleslim.quant.quanters import PACTQuanter + + # from paddle.quantization.quanters import FakeQuanterWithAbsMaxObserver + from paddlenlp.peft.lora import LoRALinear + from paddlenlp.peft.lora.lora_quant_layers import QuantedLoRALinear + + q_config = QuantConfig(activation=None, weight=None) + q_config.add_qat_layer_mapping(LoRALinear, QuantedLoRALinear) + + if model_args.qat_type == "A8W8": + activation = PACTQuanter(quanter=FakeQuanterWithAbsMaxObserverLayer, init_value=20, dtype=dtype) + # activation = FakeQuanterWithAbsMaxObserver(moving_rate=0.9, bit_length=8, dtype=dtype) + weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=8, dtype="float32") + elif model_args.qat_type == "W4": + activation = None + weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=4, dtype="float32") + elif model_args.qat_type == "A8W4": + activation = PACTQuanter(quanter=FakeQuanterWithAbsMaxObserverLayer, init_value=20, dtype=dtype) + # activation = FakeQuanterWithAbsMaxObserver(moving_rate=0.9, bit_length=8, dtype=dtype) + weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=4, dtype="float32") + else: + raise ValueError("qat_type should be one of ['A8W8', 'W4', 'A8W4']") + + q_config.add_type_config(LoRALinear, weight=weight, activation=activation) + q_config.add_type_config(nn.Linear, weight=weight, activation=activation) + + qat = QAT(q_config) + model = qat.quantize(model, inplace=True) + + if model_args.prefix_tuning: + prefix_config = PrefixConfig( + num_prefix_tokens=model_args.num_prefix_tokens, + num_attention_heads=model.config.n_head, + num_hidden_layers=model.config.n_layer, + hidden_size=model.config.hidden_size, + prefix_projection=model_args.prefix_projection, + prefix_projection_hidden_size=model.config.hidden_size, + dtype=dtype, + ) + model = PrefixModelForCausalLM( + model=model, + prefix_config=prefix_config, + postprocess_past_key_value=llama_postprocess_past_key_value, + ) + model.mark_only_prefix_as_trainable() + model.print_trainable_parameters() + + tokenizer = AutoTokenizer.from_pretrained( + model_args.model_name_or_path, + padding_side="left", # Allow batch inference + ) + tokenizer.pad_token = tokenizer.unk_token + + # Load the dataset. + train_ds = load_dataset(read_local_dataset, path="./data/train.txt", lazy=False) + training_args.do_eval = False + data_args.always_pad_to_max_length = True + trans_func = partial(custom_instruction_convert_example, tokenizer=tokenizer, data_args=data_args) + + train_ds = train_ds.map(partial(trans_func)) + + model_max_length = 512 + collate_fn = DataCollatorForSeq2Seq( + return_tensors="pd", + tokenizer=tokenizer, + max_length=model_max_length if data_args.always_pad_to_max_length else -1, + padding="max_length" if data_args.always_pad_to_max_length else True, + max_label_length=model_max_length if data_args.always_pad_to_max_length else None, + return_attention_mask=True, + ) + + def compute_metrics_trainer(eval_preds, tokenizer): + all_preds = [] + all_labels = [] + preds = eval_preds.predictions + preds = [x[x != -100] for x in preds] + all_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=False)) + labels = [x[x != -100] for x in eval_preds.label_ids] + all_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True, clean_up_tokenization_spaces=False)) + + all_preds = [pred.strip() for pred in all_preds] + all_labels = [label.strip() for label in all_labels] + all_preds = [pred.strip("question:") for pred in all_preds] + all_labels = [label.strip("question:") for label in all_labels] + + eval_result = compute_metrics(all_preds, all_labels) + return eval_result + + compute_metrics_func = partial( + compute_metrics_trainer, + tokenizer=tokenizer, + ) + + trainer = LlamaTrainer( + model=model, + args=training_args, + train_dataset=train_ds if training_args.do_train else None, + tokenizer=tokenizer, + compute_metrics=compute_metrics_func + if model_args.eval_with_do_generation + else compute_metrics_not_do_generation, + do_generation=model_args.eval_with_do_generation, + data_collator=collate_fn, + ) + + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=last_checkpoint) + trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1) + trainer.log_metrics("train", train_result.metrics) + trainer.save_metrics("train", train_result.metrics) + trainer.save_state() + + if training_args.do_eval: + eval_result = trainer.evaluate() + trainer.log_metrics("test", eval_result) + + +if __name__ == "__main__": + main() diff --git a/llm/llama/benchmark_utils.py b/llm/llama/benchmark_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..a0dbd0637c7448858311e9837521d57e0567cc29 --- /dev/null +++ b/llm/llama/benchmark_utils.py @@ -0,0 +1,227 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import time +from typing import Any, Dict, List, Optional, Tuple, Union + +import numpy as np +import paddle +import paddle.nn as nn +from paddle.optimizer.lr import LambdaDecay +from rouge import Rouge +from sklearn.metrics import accuracy_score + +from paddlenlp.metrics import BLEU +from paddlenlp.trainer import PrinterCallback, ProgressCallback, Trainer +from paddlenlp.trainer.integrations import TrainerCallback +from paddlenlp.utils.log import logger + + +class AverageStatistical(object): + def __init__(self): + self.reset() + + def reset(self): + self.total_cnt = 0 + self.time = 0 + + def record(self, val, cnt=1): + self.time += val + self.total_cnt += cnt + + def get_average(self): + if self.total_cnt == 0: + return 0 + + return self.time / self.total_cnt + + def get_average_per_sec(self): + if self.time == 0.0: + return 0.0 + + return float(self.total_cnt) / self.time + + def get_total_cnt(self): + return self.total_cnt + + def get_total_time(self): + return self.time + + +class BenchmarkCallback(TrainerCallback): + def __init__(self, benchmark=True, profiler_options=None): + self.benchmark = benchmark + self.profiler_options = profiler_options + + def on_train_begin(self, args, state, control, **kwargs): + assert args.gradient_accumulation_steps == 1 and not args.do_eval and not args.do_predict + if self.benchmark: + self.reader_cost_avg = AverageStatistical() + + def on_epoch_begin(self, args, state, control, **kwargs): + if self.benchmark: + self.epoch_start = time.time() + self.batch_start = time.time() + + def on_step_begin(self, args, state, control, **kwargs): + if self.benchmark: + self.reader_cost_avg.record(time.time() - self.batch_start) + + def on_step_end(self, args, state, control, **kwargs): + if self.benchmark: + self.batch_start = time.time() + if control.should_log: + self.maybe_log_save_evaluate_start = time.time() + + def on_log(self, args, state, control, logs=None, **kwargs): + if self.benchmark: + if logs is not None and "interval_steps_per_second" in logs: + self.batch_start = self.batch_start + (time.time() - self.maybe_log_save_evaluate_start) + ips = logs["interval_steps_per_second"] * args.train_batch_size + avg_batch_cost = 1 / logs["interval_steps_per_second"] + logger.info( + "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sample/sec" + % ( + state.global_step, + state.max_steps, + logs["loss"], + self.reader_cost_avg.get_average(), + avg_batch_cost, + args.train_batch_size, + ips, + ) + ) + self.reader_cost_avg.reset() + + def on_epoch_end(self, args, state, control, **kwargs): + if self.benchmark: + train_epoch_cost = time.time() - self.epoch_start + logger.info("train epoch: %d, epoch_cost: %.5f s" % (state.epoch, train_epoch_cost)) + + +class LlamaTrainer(Trainer): + def __init__(self, do_generation: bool, **kwargs): + super().__init__(**kwargs) + self.add_callback(BenchmarkCallback(benchmark=True, profiler_options=self.args.profiler_options)) + if self.args.disable_tqdm: + self.pop_callback(PrinterCallback) + else: + self.pop_callback(ProgressCallback) + self.do_generation = do_generation + + def prediction_step( + self, + model: nn.Layer, + inputs: Dict[str, Union[paddle.Tensor, Any]], + prediction_loss_only: bool, + ignore_keys: Optional[List[str]] = None, + ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]: + + if prediction_loss_only: + return super().prediction_step(model, inputs, prediction_loss_only, ignore_keys) + elif not self.do_generation: + loss, logits, labels = super().prediction_step(model, inputs, prediction_loss_only, ignore_keys) + # argmax here to avoid gather all logits, which is too memory-consuming. + # keepdim in order to maintain the same shape as logits + return (loss, logits.argmax(axis=-1, keepdim=True), labels) + + model.eval() + + preds = model.generate( + input_ids=inputs["input_ids"], + attention_mask=inputs["attention_mask"], + max_length=self.args.tgt_length, + min_length=0, + use_cache=True, + temperature=1.0, + top_k=1, + top_p=1.0, + repetition_penalty=1.0, + decode_strategy="sampling", + )[0] + all_labels = [] + for label in inputs["labels"].numpy(): + label = [x for x in label[label != self.tokenizer.pad_token_id]] + all_labels.append(label) + max_label_length = max([len(x) for x in all_labels]) + for index, labels in enumerate(all_labels): + all_labels[index] = labels + [-100] * (max_label_length - len(labels)) + + return (None, paddle.to_tensor(preds), paddle.to_tensor(all_labels)) + + def create_scheduler(self, num_training_steps: int): + num_warmup_steps = ( + self.args.warmup_steps if self.args.warmup_steps > 0 else self.args.warmup_ratio * num_training_steps + ) + + def lr_lambda(current_step: int): + if current_step < num_warmup_steps: + return float(current_step) / float(max(1, num_warmup_steps)) + else: + decay_step_ratio = (current_step - num_warmup_steps) / (num_training_steps - num_warmup_steps) + return 1.0 - (1.0 - self.args.lr_decay_ratio) * decay_step_ratio + + if self.lr_scheduler is None: + self.lr_scheduler = LambdaDecay(self.args.learning_rate, lr_lambda, last_epoch=-1) + return self.lr_scheduler + + def log(self, logs: Dict[str, float], **kwargs) -> None: + if "loss" in logs: + logs["ppl"] = np.exp(logs["loss"]) + if "eval_loss" in logs: + logs["eval_ppl"] = np.exp(logs["eval_loss"]) + + super(LlamaTrainer, self).log(logs, **kwargs) + + +def compute_metrics(preds, targets): + assert len(preds) == len(targets), ( + "The length of pred_responses should be equal to the length of " + "target_responses. But received {} and {}.".format(len(preds), len(targets)) + ) + rouge = Rouge() + bleu4 = BLEU(n_size=4) + scores = [] + for pred, target in zip(preds, targets): + try: + score = rouge.get_scores(" ".join(pred), " ".join(target)) + scores.append([score[0]["rouge-1"]["f"], score[0]["rouge-2"]["f"], score[0]["rouge-l"]["f"]]) + except ValueError: + scores.append([0, 0, 0]) + bleu4.add_inst(pred, [target]) + rouge1 = np.mean([i[0] for i in scores]) + rouge2 = np.mean([i[1] for i in scores]) + rougel = np.mean([i[2] for i in scores]) + + rouge1 = round(rouge1, 4) + rouge2 = round(rouge2, 4) + rougel = round(rougel, 4) + bleu4 = round(bleu4.score(), 4) + return dict( + rouge1=rouge1, + rouge2=rouge2, + rougel=rougel, + bleu4=bleu4, + ) + + +def compute_metrics_not_do_generation(eval_preds): + flattened_preds = np.array(eval_preds.predictions).flatten() + flattened_labels = np.array(eval_preds.label_ids).flatten() + filtered_preds = flattened_preds[flattened_labels != -100] + filtered_labels = flattened_labels[flattened_labels != -100] + accuracy = accuracy_score(y_true=filtered_labels, y_pred=filtered_preds) + return { + "accuracy": accuracy, + } diff --git a/llm/llama/fused_layers.py b/llm/llama/fused_layers.py new file mode 100644 index 0000000000000000000000000000000000000000..2196fed9445c8a98593ddf7a2975e1799fa2e1eb --- /dev/null +++ b/llm/llama/fused_layers.py @@ -0,0 +1,77 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import paddle +from paddle import _C_ops +from paddle.framework import core + + +def is_fused_matmul_bias_supported(): + if paddle.is_compiled_with_cuda() and not paddle.is_compiled_with_rocm() or paddle.is_compiled_with_xpu(): + return hasattr(core.eager.ops.legacy, "fused_gemm_epilogue") + else: + return False + + +if is_fused_matmul_bias_supported(): + origin_linear = paddle.incubate.nn.functional.fused_linear +else: + origin_linear = paddle.nn.functional.linear + + +class FusedLinearWithGradAdd(paddle.autograd.PyLayer): + @staticmethod + def forward(ctx, x, weight, bias=None, name=None): + y = origin_linear(x, weight, bias) + ctx.save_for_backward(x, weight, bias) + return y + + @staticmethod + def backward(ctx, y_grad): + x, weight, bias = ctx.saved_tensor() + x_grad = paddle.matmul(y_grad, weight, transpose_y=True) + + # _C_ops.fused_linear_param_grad_add(x, y_grad, dw, db, multi precision, has bias) + if bias is None: + if hasattr(weight, "main_grad"): + weight.main_grad, _ = _C_ops.fused_linear_param_grad_add( + x, y_grad, weight.main_grad, None, True, False + ) + return x_grad, None + else: + if weight.grad is not None: + weight.grad, _ = _C_ops.fused_linear_param_grad_add(x, y_grad, weight.grad, None, False, False) + return x_grad, None + else: + weight_grad, _ = _C_ops.fused_linear_param_grad_add(x, y_grad, None, None, False, False) + return x_grad, weight_grad + + if hasattr(weight, "main_grad") and hasattr(bias, "main_grad"): + weight.main_grad, bias.main_grad = _C_ops.fused_linear_param_grad_add( + x, y_grad, weight.main_grad, bias.main_grad, True + ) + return x_grad, None, None + else: + if weight.grad is not None: + assert bias.grad is not None + weight.grad, bias.grad = _C_ops.fused_linear_param_grad_add(x, y_grad, weight.grad, bias.grad, False) + return x_grad, None, None + else: + weight_grad, bias_grad = _C_ops.fused_linear_param_grad_add(x, y_grad, None, None, False) + return x_grad, weight_grad, bias_grad + + +def mock_layers(): + paddle.nn.functional.linear = FusedLinearWithGradAdd.apply + if is_fused_matmul_bias_supported(): + paddle.incubate.nn.functional.fused_linear = FusedLinearWithGradAdd.apply diff --git a/llm/llama/gptq_argument.json b/llm/llama/gptq_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..75944f076c2967df2d63026ed92fab40a7682c0d --- /dev/null +++ b/llm/llama/gptq_argument.json @@ -0,0 +1,16 @@ +{ + "model_name_or_path": "./checkpoints/llama_sft_ckpts", + "per_device_train_batch_size": 8, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/llama_gptq_ckpts", + "do_eval": true, + "eval_with_do_generation": false, + "do_gptq": true, + "gptq_step": 8 + } \ No newline at end of file diff --git a/llm/llama/lora_argument.json b/llm/llama/lora_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..95bd4029b48aec6436c7296ad005fe45507c97bf --- /dev/null +++ b/llm/llama/lora_argument.json @@ -0,0 +1,30 @@ +{ + "model_name_or_path": "facebook/llama-7b", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/llama_lora_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-04, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 1, + "pipeline_parallel_degree": 1, + "lora": true + } \ No newline at end of file diff --git a/llm/llama/megre_tp_and_pp.py b/llm/llama/megre_tp_and_pp.py new file mode 100644 index 0000000000000000000000000000000000000000..1758ecf597107860412d7886d7de34fcfd00192b --- /dev/null +++ b/llm/llama/megre_tp_and_pp.py @@ -0,0 +1,88 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import paddle + +from paddlenlp.transformers import LlamaConfig, LlamaForCausalLM +from paddlenlp.utils.log import logger + + +def merge_pipeline_parallel(tp_degree, pp_degree, path): + tp_state_dict_list = [] + for tp in range(tp_degree): + tp_state_dict = {} + for pp in range(pp_degree): + tmp = paddle.load(os.path.join(path, f"model_state.tp{tp:0>2d}_pp{pp:0>2d}.pdparams"), return_numpy=True) + for k, v in tmp.items(): + tp_state_dict[k] = v + + tp_state_dict_list.append(tp_state_dict) + + return tp_state_dict_list + + +def merge_tensor_parallel(cls, state_dict_list, config) -> None: + """the entry of converting config and converting model file + + Args: + input_dir (str | None): the input dir which contains `pytorch_model.bin` and `config.json` file + config (PretrainedConfig): the PretrainedConfig instance of model + """ + name_action_mappings = cls._get_tensor_parallel_mappings(config, is_split=False) + state_keys_map = cls._resolve_prefix_keys(name_action_mappings.keys(), state_dict_list[0].keys()) + + for k, v in state_keys_map.items(): + name_action_mappings[v] = name_action_mappings.pop(k) + + state_dict_to_save = {} + for key in state_dict_list[0].keys(): + tensor = state_dict_list[0][key] + if key in name_action_mappings: + ret = [x[key] for x in state_dict_list] + action = name_action_mappings.pop(key) + tensor = action(ret) + + state_dict_to_save[key] = tensor + + if len(name_action_mappings) > 0: + for x in name_action_mappings.keys(): + logger.warning(f"key <{x}> need to merge tensor parallel but we can't find in model state.") + + print("Finally, we merging state dict to fellowing tensors.") + for k, v in state_dict_to_save.items(): + print(k, v.shape, v.dtype) + + return state_dict_to_save + + +def main(): + tp_degree = 2 + pp_degree = 2 + model_name_or_path = "temp_dir_to_your_ckpt" + + assert tp_degree > 1 + assert pp_degree > 1 + config = LlamaConfig.from_pretrained(model_name_or_path) + cls = LlamaForCausalLM + + tp_state_dict_list = merge_pipeline_parallel(tp_degree, pp_degree, model_name_or_path) + state_dict_to_save = merge_tensor_parallel(cls=cls, state_dict_list=tp_state_dict_list, config=config) + print("saving") + paddle.save(state_dict_to_save, os.path.join(model_name_or_path, "model_state.pdparams")) + + +if __name__ == "__main__": + main() diff --git a/llm/llama/modeling_pp.py b/llm/llama/modeling_pp.py new file mode 100644 index 0000000000000000000000000000000000000000..5354741fdea87b5b5c29055165d85ac4f8188c39 --- /dev/null +++ b/llm/llama/modeling_pp.py @@ -0,0 +1,278 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# pass +import paddle +import paddle.distributed.fleet as fleet +import paddle.nn as nn +from paddle.distributed.fleet.meta_parallel import LayerDesc, PipelineLayer + +from paddlenlp.transformers import PretrainedModel +from paddlenlp.transformers.llama.modeling import ( + LlamaConfig, + LlamaDecoderLayer, + LlamaLMHead, + LlamaModel, + LlamaPretrainedModel, + LlamaPretrainingCriterion, + LlamaRMSNorm, +) + + +def get_hcg(): + return fleet.get_hybrid_communicate_group() + + +def parse_args(args): + if isinstance(args, tuple): + if len(args) == 3: + hidden_states, attention_mask, position_ids = args + elif len(args) == 2: + hidden_states, attention_mask = args + position_ids = None + else: + hidden_states = args + attention_mask, position_ids = None, None + + if position_ids is not None: + position_ids.stop_gradient = True + + if attention_mask is not None: + attention_mask.stop_gradient = True + + return hidden_states, attention_mask, position_ids + + +def return_args(hidden_states, attention_mask=None, position_ids=None): + ret = (hidden_states,) + + if attention_mask is not None: + ret += (attention_mask.clone(),) + if position_ids is not None: + ret += (position_ids.clone(),) + if len(ret) == 1: + ret = ret[0] + + return ret + + +class LlamaEmbeddingPipe(nn.Layer): + """Extends LlamaEmbeddings to forward attention_mask through the pipeline.""" + + def __init__(self, config): + super(LlamaEmbeddingPipe, self).__init__() + self.sequence_parallel = config.sequence_parallel + self.hidden_size = config.hidden_size + if config.tensor_parallel_degree > 1: + self.embed_tokens = fleet.meta_parallel.VocabParallelEmbedding( + config.vocab_size, + config.hidden_size, + weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()), + ) + else: + self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size) + + def forward(self, args): + """_summary_ + + Args: + input (_type_): _description_ + + Returns: + _type_: _description_ + """ + input_ids, attention_mask, position_ids = parse_args(args) + input_embeds = self.embed_tokens(input_ids) + if self.sequence_parallel: + from paddlenlp.transformers import ScatterOp + + # [bs, seq_len, num_head * head_dim] -> [bs * seq_len, num_head * head_dim] + bs, seq_len, hidden_size = input_embeds.shape + input_embeds = paddle.reshape_(input_embeds, [bs * seq_len, hidden_size]) + # [seq_len * bs / n, num_head * head_dim] (n is mp parallelism) + input_embeds = ScatterOp.apply(input_embeds) + + batch_size, seq_length = input_ids.shape + if attention_mask is not None: + attention_mask = LlamaModel._prepare_decoder_attention_mask( + attention_mask, (batch_size, seq_length), 0, input_embeds.dtype + ) + attention_mask.stop_gradient = True + + return return_args(input_embeds, attention_mask, position_ids) + + +class LlamaDecoderLayerPipe(LlamaDecoderLayer): + def forward(self, args): + hidden_states, attention_mask, position_ids = parse_args(args) + hidden_states = super().forward(hidden_states, attention_mask=attention_mask) + return return_args(hidden_states, attention_mask, position_ids) + + +class LlamaRMSNormPipe(LlamaRMSNorm): + def forward(self, args): + hidden_states, attention_mask, position_ids = parse_args(args) + return super().forward(hidden_states) + + +class PipelinePretrainedModel(PretrainedModel): + _sequential_layers = [] + _pipeline_name_mapping = None + + def __init__(self, config, *args, **kwargs): + super().__init__(config, *args, **kwargs) + + def add_sequential_layer(self, layer_desc, name_prefix=""): + self._sequential_layers.append({"layer": layer_desc, "name_prefix": name_prefix}) + + def get_sequential_layers(self): + return [x["layer"] for x in self._sequential_layers] + + def get_sequential_name_prefixs(self): + return {str(index): x["name_prefix"] for index, x in enumerate(self._sequential_layers)} + + def _set_pipeline_name_mapping(self, mappings=None): + if mappings is not None: + self._pipeline_name_mapping = mappings + else: + mapping = {} + state_dict_keys = list(super().state_dict().keys()) + first_key = state_dict_keys[0].split(".") + # if use virtual pp_degree, the prefix is like 0.0.xxx + # else it will be like 0.xxx + use_virtual_pp_degree = first_key[0].isdigit() and first_key[1].isdigit() + + prefixs = self.get_sequential_name_prefixs() + for k in state_dict_keys: + name_splited = k.split(".") + if use_virtual_pp_degree: + idx = str(int(name_splited[0]) + int(name_splited[1])) + single_name = [prefixs[idx]] + single_name.extend(name_splited[2:]) + else: + idx = name_splited[0] + single_name = [prefixs[idx]] + single_name.extend(name_splited[1:]) + mapping[".".join(single_name)] = k + + self._pipeline_name_mapping = mapping + + return self._pipeline_name_mapping + + def state_dict(self, *args, **kwargs): + state_dict = super().state_dict(*args, **kwargs) + + if self._pipeline_name_mapping is None: + self._set_pipeline_name_mapping() + assert len(self._pipeline_name_mapping) > 0, "The pipeline stage must have parameters!" + pp_to_single_mapping = {v: k for k, v in self._pipeline_name_mapping.items()} + + for k in list(state_dict.keys()): + v = state_dict.pop(k) + state_dict[pp_to_single_mapping[k]] = v + + return state_dict + + def set_state_dict(self, state_dict, *args, **kwargs): + if self._pipeline_name_mapping is None: + self._set_pipeline_name_mapping() + assert len(self._pipeline_name_mapping) > 0, "The pipeline stage must have parameters!" + + for k in list(state_dict.keys()): + v = state_dict.pop(k) + if k not in self._pipeline_name_mapping: + continue + state_dict[self._pipeline_name_mapping[k]] = v + + ret = super().set_state_dict(state_dict, *args, **kwargs) + return ret + + +class LlamaForCausalLMPipe(PipelinePretrainedModel, PipelineLayer): + """LlamaForPretraining adapted for pipeline parallelism. + + The largest change is flattening the LlamaModel class so we can express it as a + sequence of layers including embedding, transformer layers, and output. + """ + + config_class = LlamaConfig + + _get_tensor_parallel_mappings = LlamaPretrainedModel._get_tensor_parallel_mappings + _init_weights = LlamaPretrainedModel._init_weights + + # NO base_model_prefix !!!! + + def __init__( + self, + config, + # scale_qk_by_layer_num=True, + # virtual_pp_degree=4, + ): + self.config = config + + self.use_recompute = self.config.use_recompute + self.recompute_granularity = self.config.recompute_granularity + self.pp_recompute_interval = self.config.pp_recompute_interval + self.no_recompute_layers = config.no_recompute_layers if config.no_recompute_layers is not None else [] + if self.recompute_granularity == "full": + assert len(self.no_recompute_layers) == 0, "for pp with full recompute, no_recompute_layers is not support" + + # virtual_pp_degree = self.config.virtual_pp_degree + virtual_pp_degree = getattr(self.config, "virtual_pp_degree", 1) + + hcg = get_hcg() + tensor_parallel_degree = max(hcg.get_model_parallel_world_size(), 1) + tensor_parallel_rank = max(hcg.get_model_parallel_rank(), 0) + + config.tensor_parallel_degree = tensor_parallel_degree + config.tensor_parallel_rank = tensor_parallel_rank + + self.add_sequential_layer(LayerDesc(LlamaEmbeddingPipe, config=config), "llama") + for i in range(config.num_hidden_layers): + self.add_sequential_layer( + LayerDesc(LlamaDecoderLayerPipe, config=config, layerwise_recompute=i not in self.no_recompute_layers), + f"llama.layers.{i}", + ) + + self.add_sequential_layer(LayerDesc(LlamaRMSNormPipe, config=config), "llama.norm") + self.add_sequential_layer(LayerDesc(LlamaLMHead, config=config), "lm_head") + + recompute_interval = 0 + if self.use_recompute and self.recompute_granularity == "full": + assert self.config.pp_recompute_interval <= config.num_hidden_layers // ( + virtual_pp_degree * get_hcg().topology().get_dim_size("pipe") + ), "pp recompute interval should smaller than num layers of each pp chunk" + recompute_interval = self.config.pp_recompute_interval + + seg_method = "layer:LlamaDecoderLayer" + if config.num_hidden_layers % get_hcg().topology().get_dim_size("pipe") != 0: + seg_method = "uniform" + + PipelineLayer.__init__( + self, + layers=self.get_sequential_layers(), + loss_fn=LlamaPretrainingCriterion(config), + topology=get_hcg().topology(), + seg_method=seg_method, + recompute_interval=recompute_interval, + recompute_ctx={ + "mp_group": get_hcg().get_model_parallel_group(), + "offload": False, + "partition": False, + }, + num_virtual_pipeline_stages=virtual_pp_degree, + ) + self.apply(self._init_weights) + # DON'T init PipelinePretrainedModel + # PipelinePretrainedModel.__init__(self.super(), config=config) diff --git a/llm/llama/pt_argument.json b/llm/llama/pt_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..6c7e8343a0df34d40a4f31c91e0ad4a052878472 --- /dev/null +++ b/llm/llama/pt_argument.json @@ -0,0 +1,30 @@ +{ + "model_name_or_path": "facebook/llama-7b", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/llama_pt_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-02, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 1, + "pipeline_parallel_degree": 1, + "prefix_tuning": true + } \ No newline at end of file diff --git a/llm/llama/ptq_argument.json b/llm/llama/ptq_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..82cdddbbcb3e09f29cd45132f4c7e1edf80bac78 --- /dev/null +++ b/llm/llama/ptq_argument.json @@ -0,0 +1,22 @@ +{ + "model_name_or_path": "./checkpoints/llama_sft_ckpts", + "per_device_train_batch_size": 8, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/llama_ptq_ckpts", + "do_eval": true, + "eval_with_do_generation": false, + "do_ptq": true, + "ptq_step": 16, + "smooth": true, + "smooth_step": 16, + "smooth_all_linears": true, + "smooth_piecewise_search": true, + "smooth_k_piece": true, + "smooth_search_piece": true + } \ No newline at end of file diff --git a/llm/llama/run_pretrain.py b/llm/llama/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..4ca3efbddbbf7f42031cdebee30a3423b66cf7bd --- /dev/null +++ b/llm/llama/run_pretrain.py @@ -0,0 +1,562 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +GPT/Llama pretraining scripts. +""" +import math +import os +import random +import sys +import time +from dataclasses import dataclass, field +from typing import List, Optional + +import numpy as np +import paddle + +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, + speed_metrics, +) +from paddlenlp.transformers import ( + AutoTokenizer, + CosineAnnealingWithWarmupDecay, + LinearAnnealingWithWarmupDecay, + LlamaConfig, + LlamaForCausalLM, + register_sequence_parallel_allreduce_hooks, +) +from paddlenlp.utils.batch_sampler import DistributedBatchSampler +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "llama": ( + LlamaConfig, + LlamaForCausalLM, + ), +} + +from fused_layers import mock_layers +from modeling_pp import LlamaForCausalLMPipe + +from paddlenlp.data.causal_dataset import build_train_valid_test_datasets, print_rank_0 + + +def add_start_docstrings(*docstr): + def docstring_decorator(fn): + fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "") + return fn + + return docstring_decorator + + +@dataclass +@add_start_docstrings(TrainingArguments.__doc__) +class PreTrainingArguments(TrainingArguments): + min_learning_rate: float = field( + default=1e-5, + metadata={"help": "Minimum learning rate deacyed to."}, + ) + decay_steps: float = field( + default=None, + metadata={ + "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate." + }, + ) + enable_linear_fused_grad_add: bool = field( + default=False, + metadata={ + "help": "Enable fused linear grad add strategy, which will reduce elementwise add for grad accumulation in the backward of nn.Linear ." + }, + ) + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and evaluating. + Using `PdArgumentParser` we can turn this class into argparse arguments to be able to + specify them on the command line. + """ + + input_dir: str = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."}) + + max_seq_length: int = field( + default=1024, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + share_folder: bool = field( + default=False, + metadata={"help": "Use share folder for data dir and output dir on multi machine."}, + ) + + data_impl: str = field(default="mmap", metadata={"help": "The format of the preprocessed data."}) + skip_warmup: bool = field( + default=True, + metadata={"help": "Whether to skip the warmup process of mmap files."}, + ) + data_cache: str = field(default=None, metadata={"help": "The path of the cached dataset."}) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to pre-train from. + """ + + model_type: Optional[str] = field( + default="llama", metadata={"help": "Only support for llama pre-training for now."} + ) + model_name_or_path: str = field( + default="__internal_testing__/tiny-random-llama", + metadata={ + "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html" + }, + ) + tokenizer_name_or_path: Optional[str] = field( + default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} + ) + + config_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} + ) + use_flash_attention: bool = field( + default=False, + metadata={"help": "use_flash_attention"}, + ) + use_fused_rms_norm: bool = field( + default=False, + metadata={"help": "llama, use_fused_rms_norm"}, + ) + fuse_attention_qkv: bool = field( + default=False, + metadata={"help": "whether to fuse attention qkv"}, + ) + fuse_attention_ffn: bool = field( + default=False, + metadata={"help": "whether to fuse first up and gate proj in mlp block"}, + ) + recompute_granularity: str = field( + default="full", + metadata={"help": "Choose among ['full', 'core_attn', 'full_attn']"}, + ) + virtual_pp_degree: int = field( + default=1, + metadata={"help": "virtual_pp_degree"}, + ) + continue_training: bool = field( + default=False, + metadata={ + "help": "Pre-training from existing paddlenlp model weights. Default False and model will train from scratch. If set True, the model_name_or_path argument must exist in the paddlenlp models." + }, + ) + sequence_parallel: bool = field( + default=False, + metadata={"help": "whether to use sequence parallel"}, + ) + fuse_sequence_parallel_allreduce: bool = field( + default=False, + metadata={"help": "whether to use fuse sequence parallel allreduce"}, + ) + rope_fusion_level: Optional[str] = field( + default=None, + metadata={ + "help": "The level of fusion of rope embedding. Can be chosen from:\n" + "(1) 'full': fuse sin cos compute and rope embedding\n" + "(2) 'core': only fuse rope embedding, will compute the sin and cos\n" + "(3) None: don't fuse any part of the rope embedding" + }, + ) + no_recompute_layers: Optional[List[int]] = field( + default=None, + metadata={"help": "Specify the full transformer layers that should not be recomputed."}, + ) + pp_recompute_interval: int = field( + default=1, + metadata={ + "help": "The interval for the number of layers at which recomputation occurs. A value of 0 indicates no recomputation. Default is 0." + }, + ) + recompute_use_reentrant: bool = field( + default=False, + metadata={"help": "recompute_use_reentrant"}, + ) + + +def create_pretrained_dataset( + data_args, + training_args, + data_file, + tokenizer, + need_data=True, +): + train_val_test_num_samples = [ + training_args.per_device_train_batch_size + * training_args.dataset_world_size + * training_args.max_steps + * training_args.gradient_accumulation_steps, + training_args.per_device_eval_batch_size + * training_args.dataset_world_size + * training_args.eval_iters + * (training_args.max_steps // training_args.eval_steps + 1), + training_args.per_device_eval_batch_size * training_args.dataset_world_size * training_args.test_iters, + ] + + print_rank_0(" > datasets target sizes (minimum size):") + print_rank_0(" train: {}".format(train_val_test_num_samples[0])) + print_rank_0(" validation: {}".format(train_val_test_num_samples[1])) + print_rank_0(" test: {}".format(train_val_test_num_samples[2])) + + # Build the datasets. + train_dataset, valid_dataset, test_dataset = build_train_valid_test_datasets( + data_prefix=data_file, + data_impl=data_args.data_impl, + splits_string=data_args.split, + train_val_test_num_samples=train_val_test_num_samples, + seq_length=data_args.max_seq_length, + seed=training_args.seed, + skip_warmup=data_args.skip_warmup, + data_cache_path=data_args.data_cache, + need_data=need_data, + ) + + def print_dataset(data, mode="train"): + logger.info(f"Sample data for {mode} mode.") + # input_ids, loss_mask, attention_mask, position_ids, labels = data + input_ids = data["text"] + + logger.info(tokenizer._decode(input_ids)) + + from paddlenlp.data import Stack + + def _collate_data(data, stack_fn=Stack()): + tokens_ = stack_fn([x["text"] for x in data]) + + labels = tokens_[:, 1:] + tokens = tokens_[:, :-1] + + return { + "input_ids": tokens, + "labels": labels, + } + + if need_data: + print_dataset(train_dataset[0], "train") + print_dataset(valid_dataset[0], "valid") + print_dataset(test_dataset[0], "test") + + return train_dataset, valid_dataset, test_dataset, _collate_data + + +def get_train_data_file(args): + if len(args.input_dir.split()) > 1: + # weight-1 data-prefix-1 weight-2 data-prefix-2 ... + return args.input_dir.split() + else: + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f))) + ] + files = [x.replace("_idx.npz", "") for x in files] + files = [x.replace(".idx", "") for x in files] # add + + if len(files) > 1: + ret = [] + logger.info("You are using multi-dataset:") + for x in files: + ret.append(1.0) + ret.append(x) + logger.info(" > set weight of %s dataset to 1.0" % x) + return ret + + return files + + +def set_seed(args): + if args.device == "cpu": + idx = 0 + else: + idx = paddle.distributed.get_rank() + random.seed(args.seed + idx) + np.random.seed(args.seed + idx) + paddle.seed(args.seed + idx) + + +class PretrainingTrainer(Trainer): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix: str = "eval"): + # keep eval_dataloader + eval_dataloader = getattr(self, "eval_dataloader", None) + if eval_dataloader is None: + eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset + eval_dataloader = self.get_eval_dataloader(eval_dataset) + # must call data loader, otherwise, it will init many times, cause OOM error. + self.eval_dataloader = eval_dataloader() + + start_time = time.time() + # Temporarily disable metric computation, we will do it in the loop here. + compute_metrics = self.compute_metrics + eval_loop = self.evaluation_loop + + output = eval_loop( + eval_dataloader, + description="Evaluation", + # No point gathering the predictions if there are no metrics, otherwise we defer to + # self.args.prediction_loss_only + prediction_loss_only=True if compute_metrics is None else None, + ignore_keys=ignore_keys, + # Only evaluate max_eval_iters + max_eval_iters=self.args.eval_iters, + ) + + total_batch_size = self.args.eval_batch_size * self.args.world_size + output.metrics.update( + speed_metrics( + metric_key_prefix, + start_time, + num_samples=output.num_samples, + num_steps=math.ceil(output.num_samples / total_batch_size), + ) + ) + + self.log(output.metrics) + + self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics) + return output.metrics + + def _get_eval_sampler(self, eval_dataset) -> Optional[paddle.io.Sampler]: + return DistributedBatchSampler( + eval_dataset, + batch_size=self.args.per_device_eval_batch_size, + shuffle=False, + num_replicas=self.args.dataset_world_size, + rank=self.args.dataset_rank, + drop_last=self.args.dataloader_drop_last, + ) + + def _get_train_sampler(self) -> Optional[paddle.io.Sampler]: + return DistributedBatchSampler( + self.train_dataset, + batch_size=self.args.per_device_train_batch_size, + shuffle=False, + num_replicas=self.args.dataset_world_size, + rank=self.args.dataset_rank, + drop_last=self.args.dataloader_drop_last, + ) + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments)) + if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): + model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) + else: + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + if training_args.enable_linear_fused_grad_add: + mock_layers() + + if model_args.tokenizer_name_or_path is None: + model_args.tokenizer_name_or_path = model_args.model_name_or_path + + if data_args.data_cache is not None: + os.makedirs(data_args.data_cache, exist_ok=True) + + set_seed(training_args) + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + training_args.eval_iters = 10 + training_args.test_iters = training_args.eval_iters * 10 + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + # if last_checkpoint is None and len( + # os.listdir(training_args.output_dir)) > 1: + # raise ValueError( + # f"Output directory ({training_args.output_dir}) already exists and is not empty. " + # "Use --overwrite_output_dir to overcome.") + if last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + config_class, model_class = MODEL_CLASSES[model_args.model_type] + + tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path) + + config = config_class.from_pretrained(model_args.model_name_or_path) + + config.seq_length = data_args.max_seq_length + # There are some technique extend RotaryEmbedding context. so don't change max_position_embeddings + if not model_args.continue_training: + config.max_position_embeddings = max(config.max_position_embeddings, data_args.max_seq_length) + + if not model_args.continue_training: + config.vocab_size = max(config.vocab_size, ((tokenizer.vocab_size - 1) // 128 + 1) * 128) + logger.info(f"Reset vocab size to {config.vocab_size} for batter amp peformance.") + + if model_args.no_recompute_layers is not None: + model_args.no_recompute_layers.sort() + + config.use_flash_attention = model_args.use_flash_attention + config.use_fused_rms_norm = model_args.use_fused_rms_norm + config.fuse_attention_qkv = model_args.fuse_attention_qkv + config.fuse_attention_ffn = model_args.fuse_attention_ffn + config.recompute_granularity = model_args.recompute_granularity + config.virtual_pp_degree = model_args.virtual_pp_degree + config.sequence_parallel = model_args.sequence_parallel + config.fuse_sequence_parallel_allreduce = model_args.fuse_sequence_parallel_allreduce + config.rope_fusion_level = model_args.rope_fusion_level + config.no_recompute_layers = model_args.no_recompute_layers + config.pp_recompute_interval = model_args.pp_recompute_interval + config.recompute_use_reentrant = model_args.recompute_use_reentrant + + config.use_recompute = training_args.recompute + config.tensor_parallel_degree = training_args.tensor_parallel_degree + config.tensor_parallel_rank = training_args.tensor_parallel_rank + + print("Final pre-training config:", config) + + # Set the dtype for loading model + dtype = "float32" + if training_args.fp16_opt_level == "O2": + if training_args.fp16: + dtype = "float16" + if training_args.bf16: + dtype = "bfloat16" + + if training_args.pipeline_parallel_degree > 1 and model_args.model_type == "llama": + model_class = LlamaForCausalLMPipe + + if model_args.continue_training: + model = model_class.from_pretrained( + model_args.model_name_or_path, + config=config, + dtype=dtype, + load_state_as_np=True, + ) + else: + model = model_class._from_config(config, dtype=dtype) + + if model_args.sequence_parallel: + register_sequence_parallel_allreduce_hooks( + model, training_args.gradient_accumulation_steps, model_args.fuse_sequence_parallel_allreduce + ) + + if training_args.recompute: + model.recompute_enable() + + # Create the learning_rate sheduler and optimizer + if training_args.decay_steps is None: + training_args.decay_steps = training_args.max_steps + warmup_steps = training_args.warmup_ratio * training_args.max_steps + + lr_scheduler = None + if training_args.lr_scheduler_type.value == "cosine": + lr_scheduler = CosineAnnealingWithWarmupDecay( + max_lr=training_args.learning_rate, + min_lr=training_args.min_learning_rate, + warmup_step=warmup_steps, + decay_step=training_args.decay_steps, + last_epoch=0, + ) + elif training_args.lr_scheduler_type.value == "linear": + lr_scheduler = LinearAnnealingWithWarmupDecay( + max_lr=training_args.learning_rate, + min_lr=training_args.min_learning_rate, + warmup_step=warmup_steps, + decay_step=training_args.decay_steps, + last_epoch=0, + ) + + data_file = get_train_data_file(data_args) + train_dataset, eval_dataset, test_dataset, data_collator = create_pretrained_dataset( + data_args, + training_args, + data_file, + tokenizer, + need_data=training_args.should_load_dataset, + ) + + total_effective_tokens = ( + training_args.per_device_train_batch_size + * training_args.dataset_world_size + * training_args.max_steps + * training_args.gradient_accumulation_steps + * data_args.max_seq_length + ) + + trainer = PretrainingTrainer( + model=model, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + optimizers=(None, lr_scheduler), + tokenizer=tokenizer, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_predict: + test_ret = trainer.predict(test_dataset) + trainer.log_metrics("test", test_ret.metrics) + + if training_args.should_load_dataset: + effective_tokens_per_second = total_effective_tokens / train_result.metrics["train_runtime"] + print(f"Effective Tokens per second: {effective_tokens_per_second:.2f}") + print(f"ips: {effective_tokens_per_second:.2f} tokens/s") + + +if __name__ == "__main__": + main() diff --git a/llm/llama/run_trainer.sh b/llm/llama/run_trainer.sh new file mode 100644 index 0000000000000000000000000000000000000000..fc23373ac3763938201a28a2ebea88a5b44b9247 --- /dev/null +++ b/llm/llama/run_trainer.sh @@ -0,0 +1,59 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -x +unset CUDA_VISIBLE_DEVICES +task_name="llama_hybrid" +rm -rf output/$task_name/ +rm -rf "output/$task_name""_log" + + +PYTHONPATH=../../:$PYTHONPATH \ +python -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "output/$task_name""_log" \ + run_pretrain.py \ + --model_type "llama" \ + --model_name_or_path "facebook/llama-7b" \ + --tokenizer_name_or_path "facebook/llama-7b" \ + --input_dir "./data" \ + --output_dir "output/$task_name" \ + --split 949,50,1 \ + --max_seq_length 2048 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 1 \ + --use_flash_attention 1 \ + --use_fused_rms_norm 0 \ + --fp16 \ + --fp16_opt_level "O2" \ + --scale_loss 1024 \ + --learning_rate 0.0001 \ + --min_learning_rate 0.00001 \ + --max_steps 10000 \ + --save_steps 5000 \ + --weight_decay 0.01 \ + --warmup_ratio 0.01 \ + --max_grad_norm 1.0 \ + --logging_steps 20\ + --dataloader_num_workers 1 \ + --sharding "stage2" \ + --eval_steps 1000 \ + --report_to "visualdl" \ + --disable_tqdm true \ + --continue_training 1\ + --recompute 1 \ + --do_train \ + --do_eval \ + --device "gpu" \ + --data_impl "mmap" \ No newline at end of file diff --git a/llm/llama/run_trainer_tp4pp2.sh b/llm/llama/run_trainer_tp4pp2.sh new file mode 100644 index 0000000000000000000000000000000000000000..07b7fa1996d53c002d36ee3f6f6ff9fe1dd1d569 --- /dev/null +++ b/llm/llama/run_trainer_tp4pp2.sh @@ -0,0 +1,70 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -x +unset CUDA_VISIBLE_DEVICES + +task_name="llama_hybrid" +rm -rf output/$task_name/ +rm -rf "output/$task_name""_log" + + +PYTHONPATH=../../:$PYTHONPATH \ +python -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "output/$task_name""_log" \ + run_pretrain.py \ + --model_type "llama" \ + --model_name_or_path "facebook/llama-7b" \ + --tokenizer_name_or_path "facebook/llama-7b" \ + --input_dir "./data" \ + --output_dir "output/$task_name" \ + --split 949,50,1 \ + --max_seq_length 2048 \ + --per_device_train_batch_size 1 \ + --gradient_accumulation_steps 4 \ + --per_device_eval_batch_size 4 \ + --use_flash_attention 1 \ + --use_fused_rms_norm 0 \ + --fp16 \ + --fp16_opt_level "O2" \ + --scale_loss 512 \ + --tensor_parallel_degree 4 \ + --pipeline_parallel_degree 2 \ + --virtual_pp_degree 1 \ + --sequence_parallel 0 \ + --learning_rate 0.00001 \ + --min_learning_rate 0.000001 \ + --max_steps 10000 \ + --save_steps 5000 \ + --weight_decay 0.01 \ + --warmup_ratio 0.01 \ + --max_grad_norm 1.0 \ + --logging_steps 10 \ + --dataloader_num_workers 1 \ + --eval_steps 1000 \ + --report_to "visualdl" \ + --sharding "stage1" \ + --disable_tqdm true \ + --continue_training 1 \ + --recompute 0 \ + --recompute_granularity full \ + --do_train \ + --do_eval \ + --device "gpu" \ + --distributed_dataloader 1 + # --pipeline_parallel_config "disable_partial_send_recv" # if set sequence_parallel True, please note off this line. + # reompute settings: + # --no_recompute_layers 0 1 2 3 4 5 6 7 8 9 10 ... int int + # --pp_recompute_interval 0 # A value of 0 indicates no recomputation. diff --git a/llm/llama/run_trainer_tp8.sh b/llm/llama/run_trainer_tp8.sh new file mode 100644 index 0000000000000000000000000000000000000000..3141558bf5b565f550e757e993ea91cdcb2ed9d3 --- /dev/null +++ b/llm/llama/run_trainer_tp8.sh @@ -0,0 +1,64 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +task_name="llama_13b" +rm -rf output/$task_name/ +rm -rf "output/$task_name""_log" + + +PYTHONPATH=../../:$PYTHONPATH \ +RCCL_NCHANNELS=8 HSA_FORCE_FINE_GRAIN_PCIE=1 python3 -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "output/$task_name""_log" \ + run_pretrain.py \ + --model_type "llama" \ + --model_name_or_path "facebook/llama-13b" \ + --tokenizer_name_or_path "facebook/llama-13b" \ + --input_dir "./data" \ + --output_dir "output/$task_name" \ + --split 949,50,1 \ + --max_seq_length 2048 \ + --per_device_train_batch_size 1 \ + --gradient_accumulation_steps 2 \ + --per_device_eval_batch_size 2 \ + --use_flash_attention 0 \ + --use_fused_rms_norm 0 \ + --fp16 \ + --fp16_opt_level "O2" \ + --scale_loss 512 \ + --tensor_parallel_degree 8 \ + --learning_rate 0.00001 \ + --min_learning_rate 0.000001 \ + --max_steps 10000 \ + --save_steps 5000 \ + --weight_decay 0.01 \ + --warmup_ratio 0.01 \ + --max_grad_norm 1.0 \ + --logging_steps 10 \ + --dataloader_num_workers 1 \ + --eval_steps 1000 \ + --report_to "visualdl" \ + --sharding "stage1" \ + --disable_tqdm true \ + --continue_training 1 \ + --recompute 1 \ + --recompute_granularity full \ + --do_train \ + --do_eval \ + --device "gpu" \ + --distributed_dataloader 1 + # --pipeline_parallel_config "disable_partial_send_recv" # if set sequence_parallel True, please note off this line. + # reompute settings: + # --no_recompute_layers 0 1 2 3 4 5 6 7 8 9 10 ... int int + # --pp_recompute_interval 0 # A value of 0 indicates no recomputation. diff --git a/llm/llama/sft_argument.json b/llm/llama/sft_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..8f46d4a3778d1452bd65a14e36507467eca191a1 --- /dev/null +++ b/llm/llama/sft_argument.json @@ -0,0 +1,29 @@ +{ + "model_name_or_path": "facebook/llama-7b", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/llama_sft_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-05, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 4, + "pipeline_parallel_degree": 1 + } \ No newline at end of file diff --git a/llm/llama/sft_pp_argument.json b/llm/llama/sft_pp_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..68d79ce1e606f93d90601aec21eb369057fabdb4 --- /dev/null +++ b/llm/llama/sft_pp_argument.json @@ -0,0 +1,30 @@ +{ + "model_name_or_path": "facebook/llama-7b", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/llama_sft_ckpts", + "per_device_train_batch_size": 1, + "gradient_accumulation_steps": 8, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-05, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 256, + "max_length": 512, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 2, + "pipeline_parallel_degree": 2, + "pipeline_parallel_config": "disable_p2p_cache_shape" + } \ No newline at end of file diff --git a/llm/llama/tests/test_pipeline_parallel.py b/llm/llama/tests/test_pipeline_parallel.py new file mode 100644 index 0000000000000000000000000000000000000000..a3b595a92082670ebc05dd8a0619cbd3aeb7d04d --- /dev/null +++ b/llm/llama/tests/test_pipeline_parallel.py @@ -0,0 +1,151 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import unittest + +import numpy as np +import paddle +import paddle.distributed.fleet as fleet +from modeling_pp import LlamaForCausalLMPipe +from paddle.distributed.fleet.meta_parallel.pipeline_parallel import PipelineParallel + +from paddlenlp.transformers import LlamaForCausalLM + + +class TestLlama(unittest.TestCase): + def test_pipeline_model(self): + world_size = paddle.distributed.get_world_size() + pp_degree = world_size + tp_degree = 1 + if world_size > 2: + pp_degree = 2 + assert world_size % pp_degree == 0 + tp_degree = world_size // pp_degree + + pp_degree = -1 + if pp_degree == -1: + tp_degree = world_size + pp_degree = 1 + + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": 1, + "mp_degree": tp_degree, + "pp_degree": pp_degree, + "sharding_degree": 1, + } + fleet.init(is_collective=True, strategy=strategy) + hcg = fleet.get_hybrid_communicate_group() + + if pp_degree > 1: + model_class = LlamaForCausalLMPipe + else: + model_class = LlamaForCausalLM + + model_name_or_path = "./llama-7b-2l" + # model_name_or_path = "__internal_testing__/tiny-random-llama" + # hidden_size = 4096 + model = model_class.from_pretrained( + model_name_or_path, + num_attention_heads=32, + tensor_parallel_degree=tp_degree, + tensor_parallel_rank=hcg.get_model_parallel_rank(), + tensor_parallel_output=False, + # use_flash_attention=True, + ) + + model.eval() + + # for k, v in model.state_dict().items(): + # print(k, v.shape, v.dtype, v.abs().sum().item()) + # if k == "lm_head.weight": + # print(v) + + # input_ids = paddle.to_tensor([[x for x in range(100, 110)]], dtype="int64") + # labels = paddle.to_tensor([[x for x in range(101, 111)]], dtype="int64") + attention_mask = None + input_ids = paddle.load("/ssd2/zhonghui03/Datasets/PaddleNLP/examples/language_model/llama/input_ids") + labels = paddle.load("/ssd2/zhonghui03/Datasets/PaddleNLP/examples/language_model/llama/labels") + attention_mask = paddle.load( + "/ssd2/zhonghui03/Datasets/PaddleNLP/examples/language_model/llama/attention_mask" + ) + + # labels[labels < 0] = 1 + + if pp_degree > 1: + pp_model = PipelineParallel(layers=model, hcg=hcg, strategy=strategy) + ret = pp_model.eval_batch(data=[input_ids, labels], compute_loss=True) + else: + # pp_model = PipelineParallel(layers=model, hcg=hcg, strategy=strategy) + # pp_model.data = [input_ids, labels] + # ret = pp_model._forward_step(None) + ret = model(input_ids=input_ids, labels=labels, attention_mask=attention_mask) + ret = ret[0] + + # np.testing.assert_allclose(ret.item(), 10.49988270, atol=1e-7) + print(f"ret mp{tp_degree} pp", ret.item()) + ret_mp_pp = ret.item() + + single_model = LlamaForCausalLM.from_pretrained( + model_name_or_path, + num_attention_heads=32, + tensor_parallel_output=False, + ) + single_model.eval() + ret = single_model(input_ids=input_ids, labels=labels, attention_mask=attention_mask) + print("ret single", ret[0].item()) + print( + f"diff: {(ret[0].item()- ret_mp_pp)/ret[0].item()}", + ) + np.testing.assert_allclose(ret[0].item(), ret_mp_pp, rtol=1.5e-7) + # 15.526779174804688 + # 16.879518508911133 + + +if __name__ == "__main__": + TestLlama().test_pipeline_model() + +# 3 bugs to fix in paddlepaddle +# pp_layers.py +# def _construct_shared_comm(self): +# shared_comm = {} +# if self._topo.get_dim("pipe") == 1: +# return shared_comm + +# topology.py +# def _set_p2p_group(self): +# self.send_next_group = None +# self.send_prev_group = None +# self.recv_next_group = None +# self.recv_prev_group = None +# if self._pp_degree <= 1: +# return + +# pipeline_parallel.py +# def _load_micro_batch(self, cache_id, stage=None): +# inputs = self.data +# if stage == "fisrt": +# assert self.is_pipeline_first_stage() +# assert len(inputs) == 2, "length of input should be 2" +# return self._load_micro_batch_impl(inputs[0], cache_id) +# elif stage== "last": +# assert self.is_pipeline_last_stage() +# assert len(inputs) == 2, "length of input should be 2" +# return self._load_micro_batch_impl(inputs[1], cache_id) +# else: +# inputs = None +# +# +# CUDA_VISIBLE_DEVICES=2 PYTHONPATH=./ pytest -s -v tests/test_pipeline_parallel.py +# PYTHONPATH=/ssd2/zhonghui03/Datasets/PaddleNLP:$PYTHONPATH PYTHONPATH=$PYTHONPATH:./ python -m paddle.distributed.launch --gpus 0,1,2,3 tests/test_pipeline_parallel.py diff --git a/llm/llama/tests/test_sequence_parallel.py b/llm/llama/tests/test_sequence_parallel.py new file mode 100644 index 0000000000000000000000000000000000000000..e7cdc932bf2499dec400d82147386779074b6103 --- /dev/null +++ b/llm/llama/tests/test_sequence_parallel.py @@ -0,0 +1,119 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import unittest + +import numpy as np +import paddle +import paddle.distributed.fleet as fleet +from modeling_pp import LlamaForCausalLMPipe +from paddle.distributed.fleet.meta_parallel.pipeline_parallel import PipelineParallel + +from paddlenlp.transformers import LlamaConfig, LlamaForCausalLM + + +class TestLlama(unittest.TestCase): + def test_sequence_model(self): + world_size = paddle.distributed.get_world_size() + pp_degree = world_size + tp_degree = 1 + + if world_size > 2: + pp_degree = 2 + assert world_size % pp_degree == 0 + tp_degree = world_size // pp_degree + + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": 1, + "mp_degree": tp_degree, + "pp_degree": pp_degree, + "sharding_degree": 1, + } + strategy.pipeline_configs = {"enable_partial_send_recv": False if pp_degree > 1 else True} + fleet.init(is_collective=True, strategy=strategy) + hcg = fleet.get_hybrid_communicate_group() + mp_group = hcg.get_model_parallel_group() + tensor_parallel_rank = mp_group.rank + + if pp_degree > 1: + model_class = LlamaForCausalLMPipe + else: + model_class = LlamaForCausalLM + + # model_name_or_path = "facebook/llama-7b" + model_name_or_path = "__internal_testing__/tiny-random-llama" + + seq_len = 2048 + batch_size = 2 + + config = LlamaConfig.from_pretrained(model_name_or_path) + config.seq_length = seq_len + config.use_flash_attention = False + config.use_fused_rms_norm = False + config.fuse_attention_qkv = False + config.recompute_granularity = "full" + config.virtual_pp_degree = 1 + config.use_recompute = False + + config.tensor_parallel_degree = tp_degree + config.tensor_parallel_rank = tensor_parallel_rank + config.tensor_parallel_output = False + config.sequence_parallel = True + + config.fuse_sequence_parallel_allreduce = False + + # hidden_size = 4096 + model = model_class.from_pretrained( + model_name_or_path, + config=config, + dtype="float32", + ) + + model.eval() + + input_ids = paddle.arange(100, 100 + batch_size * seq_len, dtype="int64").reshape([batch_size, seq_len]) + labels = paddle.arange(101, 101 + batch_size * seq_len, dtype="int64").reshape([batch_size, seq_len]) + + attention_mask = None + if pp_degree > 1: + pp_model = PipelineParallel(layers=model, hcg=hcg, strategy=strategy) + pp_model.accumulate_steps = batch_size # for micro_batch_size * acc_steps == batch_size + ret = pp_model.eval_batch(data=[input_ids, labels], compute_loss=True) + else: + # pp_model = PipelineParallel(layers=model, hcg=hcg, strategy=strategy) + # pp_model.data = [input_ids, labels] + # ret = pp_model._forward_step(None) + ret = model(input_ids=input_ids, labels=labels, attention_mask=attention_mask) + ret = ret[0] + + # np.testing.assert_allclose(ret.item(), 10.49988270, atol=1e-7) + print(f"ret mp{tp_degree} pp{pp_degree}", ret.item()) + ret_mp_pp = ret.item() + + single_model = LlamaForCausalLM.from_pretrained(model_name_or_path, config=config) + single_model.eval() + ret = single_model(input_ids=input_ids, labels=labels, attention_mask=attention_mask) + print("ret single", ret[0].item()) + print( + f"diff: {(ret[0].item()- ret_mp_pp)/ret[0].item()}", + ) + np.testing.assert_allclose(ret[0].item(), ret_mp_pp, rtol=1.5e-7) + + +if __name__ == "__main__": + TestLlama().test_sequence_model() + +# CUDA_VISIBLE_DEVICES=2 PYTHONPATH=./ pytest -s -v tests/test_pipeline_parallel.py +# PYTHONPATH=/ssd2/zhonghui03/Datasets/PaddleNLP:$PYTHONPATH PYTHONPATH=$PYTHONPATH:./ python -m paddle.distributed.launch --gpus 0,1,2,3 tests/test_pipeline_parallel.py diff --git a/llm/merge_lora_params.py b/llm/merge_lora_params.py new file mode 100644 index 0000000000000000000000000000000000000000..b2b27ffc900af1adfb815c99343498ea408e1467 --- /dev/null +++ b/llm/merge_lora_params.py @@ -0,0 +1,57 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse + +import paddle + +from paddlenlp.peft import LoRAConfig, LoRAModel +from paddlenlp.transformers import AutoModelForCausalLM + + +def parse_arguments(): + parser = argparse.ArgumentParser() + parser.add_argument("--model_name_or_path", default=None, required=True, help="The directory of pretrained model.") + parser.add_argument( + "--lora_path", default=None, required=True, help="The directory of LoRA parameters. Default to None" + ) + parser.add_argument("--merge_model_path", default=None, help="The directory of merged parameters. Default to None") + parser.add_argument("--device", type=str, default="gpu", help="Device") + return parser.parse_args() + + +def merge(): + args = parse_arguments() + paddle.set_device(args.device) + lora_config = LoRAConfig.from_pretrained(args.lora_path) + dtype = lora_config.dtype + lora_config.merge_weights = True + + model = AutoModelForCausalLM.from_pretrained( + args.model_name_or_path, + dtype=dtype, + ) + model = LoRAModel.from_pretrained(model=model, lora_path=args.lora_path, lora_config=lora_config) + model.eval() + if args.merge_model_path is None: + args.merge_model_path = args.lora_path + + model_state_dict = model.model.state_dict() + for key in list(model_state_dict): + if "lora" in key: + del model_state_dict[key] + model.model.save_pretrained(args.merge_model_path, state_dict=model_state_dict) + + +if __name__ == "__main__": + merge() diff --git a/llm/merge_tp_params.py b/llm/merge_tp_params.py new file mode 100644 index 0000000000000000000000000000000000000000..6f310c1836b5930cd045b298e3b0dc5860899ee1 --- /dev/null +++ b/llm/merge_tp_params.py @@ -0,0 +1,100 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import importlib +import os + +import paddle + +from paddlenlp.transformers import AutoConfig +from paddlenlp.transformers.auto.modeling import MAPPING_NAMES +from paddlenlp.utils.log import logger + + +def parse_arguments(): + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument("--model_name_or_path", default=None, required=True, help="The directory of model.") + parser.add_argument("--device", type=str, default="gpu", help="Device") + return parser.parse_args() + + +def load_tp_params(tp_degree, path): + tp_state_dict_list = [] + for tp in range(tp_degree): + tp_state_dict = {} + tmp = paddle.load(os.path.join(path, f"model_state.tp{tp:0>2d}.pdparams"), return_numpy=True) + for k, v in tmp.items(): + tp_state_dict[k] = v + tp_state_dict_list.append(tp_state_dict) + + return tp_state_dict_list + + +def merge_tensor_parallel(model_class, state_dict_list, config) -> None: + """the entry of converting config and converting model file + + Args: + input_dir (str | None): the input dir which contains `pytorch_model.bin` and `config.json` file + config (PretrainedConfig): the PretrainedConfig instance of model + """ + name_action_mappings = model_class._get_tensor_parallel_mappings(config, is_split=False) + state_keys_map = model_class._resolve_prefix_keys(name_action_mappings.keys(), state_dict_list[0].keys()) + + for k, v in state_keys_map.items(): + name_action_mappings[v] = name_action_mappings.pop(k) + + state_dict_to_save = {} + for key in state_dict_list[0].keys(): + tensor = state_dict_list[0][key] + if key in name_action_mappings: + ret = [x[key] for x in state_dict_list] + action = name_action_mappings.pop(key) + tensor = action(ret) + + state_dict_to_save[key] = tensor + + if len(name_action_mappings) > 0: + for x in name_action_mappings.keys(): + logger.warning(f"key <{x}> need to merge tensor parallel but we can't find in model state.") + + logger.info("Finally, we merging state dict to fellowing tensors.") + for k, v in state_dict_to_save.items(): + logger.info(f"{k}, {v.shape}, {v.dtype}") + + return state_dict_to_save + + +def main(): + args = parse_arguments() + paddle.set_device(args.device) + config = AutoConfig.from_pretrained(args.model_name_or_path) + init_class = config["architectures"][0] + import_class = importlib.import_module(f"paddlenlp.transformers.{MAPPING_NAMES[init_class[:-11]]}.modeling") + model_class = getattr(import_class, init_class) + + if config.tensor_parallel_degree > 1: + tp_state_dict_list = load_tp_params(config.tensor_parallel_degree, args.model_name_or_path) + state_dict_to_save = merge_tensor_parallel( + model_class=model_class, state_dict_list=tp_state_dict_list, config=config + ) + + logger.info("Saving") + paddle.save(state_dict_to_save, os.path.join(args.model_name_or_path, "model_state.pdparams")) + else: + logger.info("No need to merge since config.tensor_parallel_degree <= 1.") + + +if __name__ == "__main__": + main() diff --git a/llm/opt/README.md b/llm/opt/README.md new file mode 100644 index 0000000000000000000000000000000000000000..98b3f140fbfb3e2d004edf05eb1e8a4ba16c6c0f --- /dev/null +++ b/llm/opt/README.md @@ -0,0 +1,22 @@ +# OPT + +## 1. 模型介绍 + +[OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) 是以自回归填空作为训练目标的通用语言模型,可用于各类理解和生成任务。 + +**支持模型权重:** +| Model | +|----------------------------------| +|facebook/opt-125m| +|facebook/opt-350m | +|facebook/opt-1.3b | +|facebook/opt-2.7b | +|facebook/opt-6.7b| +|facebook/opt-13b | +|facebook/opt-30b | +|facebook/opt-66b | +|facebook/opt-iml-1.3b | +|opt-iml-max-1.3b | + +## 2. 模型精调 +请参考[LLM全流程工具介绍](../README.md) diff --git a/llm/opt/lora_argument.json b/llm/opt/lora_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..3c25fd57844001c52704947fb30bd30b656dcd9b --- /dev/null +++ b/llm/opt/lora_argument.json @@ -0,0 +1,30 @@ +{ + "model_name_or_path": "facebook/opt-125m", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/opt_lora_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-04, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 1, + "pipeline_parallel_degree": 1, + "lora": true + } \ No newline at end of file diff --git a/llm/opt/sft_argument.json b/llm/opt/sft_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..14e302fb19aee3d8c8dc19964ce6e068ea31174e --- /dev/null +++ b/llm/opt/sft_argument.json @@ -0,0 +1,29 @@ +{ + "model_name_or_path": "facebook/opt-125m", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/opt_sft_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-05, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "fp16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "sharding_parallel_degree": 4, + "sharding": "stage2" + } \ No newline at end of file diff --git a/llm/predictor.py b/llm/predictor.py new file mode 100644 index 0000000000000000000000000000000000000000..dc6da46cbe733bd0b058e9715df5a707162192fc --- /dev/null +++ b/llm/predictor.py @@ -0,0 +1,827 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from __future__ import annotations + +import json +import os +import sys +import time +from abc import abstractmethod +from dataclasses import dataclass, field + +import numpy as np +import paddle +import paddle.distributed.fleet.base.topology as tp +from paddle.distributed import fleet +from utils import ( + dybatch_preprocess, + get_alibi_slopes, + get_infer_model_path, + get_prefix_tuning_params, + load_real_time_tokens, +) + +from paddlenlp.peft import LoRAConfig, LoRAModel, PrefixConfig, PrefixModelForCausalLM +from paddlenlp.taskflow.utils import static_mode_guard +from paddlenlp.trainer import PdArgumentParser +from paddlenlp.transformers import ( + AutoConfig, + AutoModelForCausalLM, + AutoTokenizer, + LlamaTokenizer, + PretrainedModel, + PretrainedTokenizer, +) +from paddlenlp.utils.import_utils import import_module, is_paddlenlp_ops_available + + +@dataclass +class PredictorArgument: + model_name_or_path: str = field(default=None, metadata={"help": "The directory of model."}) + model_prefix: str = field(default="model", metadata={"help": "the prefix name of static model"}) + src_length: int = field(default=1024, metadata={"help": "The max length of source text."}) + max_length: int = field(default=2048, metadata={"help": "the max length for decoding."}) + top_k: int = field(default=0, metadata={"help": "top_k parameter for generation"}) + top_p: float = field(default=0.7, metadata={"help": "top_p parameter for generation"}) + temperature: float = field(default=0.95, metadata={"help": "top_p parameter for generation"}) + repetition_penalty: float = field(default=1.0, metadata={"help": "repetition penalty parameter for generation"}) + device: str = field(default="gpu", metadata={"help": "Device"}) + dtype: str = field(default=None, metadata={"help": "Model dtype"}) + lora_path: str = field(default=None, metadata={"help": "The directory of LoRA parameters. Default to None"}) + export_precache: bool = field(default=False, metadata={"help": "whether use prefix weight to do infer"}) + prefix_path: str = field( + default=None, metadata={"help": "The directory of Prefix Tuning parameters. Default to None"} + ) + decode_strategy: str = field( + default="sampling", + metadata={ + "help": "the decoding strategy of generation, which should be one of ['sampling', 'greedy_search', 'beam_search']. Default to sampling" + }, + ) + + mode: str = field( + default="dynamic", metadata={"help": "the type of predictor, it should be one of [dynamic, static]"} + ) + inference_model: bool = field(default=False, metadata={"help": "whether use InferenceModel to do generation"}) + quant_type: str = field( + default="None", + metadata={"help": "The quant type of inference model, support `weight_only_int8`, `weight_only_int4`."}, + ) + batch_size: int = field(default=1, metadata={"help": "The batch size of data."}) + benchmark: bool = field( + default=False, + metadata={ + "help": "If benchmark set as `True`, we will force model decode to max_length, which is helpful to compute throughput. " + }, + ) + + +@dataclass +class ModelArgument: + model_type: str = field( + default=None, + metadata={"help": "the type of the model, which can be one of ['gpt-3', 'ernie-3.5-se', 'llama-img2txt']"}, + ) + data_file: str = field(default=None, metadata={"help": "data file directory"}) + output_file: str = field(default="output.json", metadata={"help": "predict result file directory"}) + + +def batchfy_text(texts, batch_size): + batch_texts = [] + batch_start = 0 + while batch_start < len(texts): + batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]] + batch_start += batch_size + return batch_texts + + +def init_dist_env(): + tensor_parallel_degree = paddle.distributed.get_world_size() + tensor_parallel_rank = paddle.distributed.get_rank() + + if tensor_parallel_degree > 1: + # refer to: https://github.com/PaddlePaddle/Paddle/blob/4abea956ee852ce52791a1e08fa92ed4d3be150d/python/paddle/distributed/fleet/fleet.py#L298C23-L298C45 + hcg = tp._HYBRID_PARALLEL_GROUP + if hcg is None: + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": 1, + "mp_degree": tensor_parallel_degree, + "pp_degree": 1, + "sharding_degree": 1, + } + fleet.init(is_collective=True, strategy=strategy) + hcg = fleet.get_hybrid_communicate_group() + + tensor_parallel_rank = hcg.get_model_parallel_rank() + return tensor_parallel_rank, tensor_parallel_degree + + +class BasePredictor: + def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer = None): + self.model_config = AutoConfig.from_pretrained(config.model_name_or_path) + self.config: PredictorArgument = config + if tokenizer is None: + tokenizer = AutoTokenizer.from_pretrained(config.model_name_or_path, padding_side="left") + + self.tokenizer = tokenizer + self.return_tensors = "pd" + self.tensor_parallel_rank, self.tensor_parallel_degree = init_dist_env() + self.model_config.tensor_parallel_rank, self.model_config.tensor_parallel_degree = init_dist_env() + + def _preprocess(self, source): + tokenized_source = self.tokenizer( + source, + max_length=self.config.src_length, + truncation=True, + truncation_side="left", + return_tensors=self.return_tensors, + padding=True, + add_special_tokens=True, + ) + return tokenized_source + + @abstractmethod + def _infer(self, inputs): + raise NotImplementedError + + def _postprocess(self, predictions): + decoded_predictions = self.tokenizer.batch_decode( + predictions, skip_special_tokens=True, clean_up_tokenization_spaces=False + ) + return decoded_predictions + + def predict(self, input_texts: str | list[str]): + tokenized_source = self._preprocess(input_texts) + predictions = self._infer(tokenized_source) + decoded_predictions = self._postprocess(predictions) + return decoded_predictions + + +class DygraphPredictor(BasePredictor): + def __init__( + self, config: PredictorArgument, model: PretrainedModel = None, tokenizer: PretrainedTokenizer = None + ): + super().__init__(config, tokenizer) + self.model = model + if config.lora_path is not None: + lora_config = LoRAConfig.from_pretrained(config.lora_path) + dtype = lora_config.dtype + lora_config.merge_weights = True + elif config.prefix_path is not None: + prefix_config = PrefixConfig.from_pretrained(config.prefix_path) + dtype = prefix_config.dtype + elif config.dtype is not None: + dtype = config.dtype + else: + raise ValueError("Please specific the model dtype.") + + if self.model is None: + self.model = AutoModelForCausalLM.from_pretrained( + config.model_name_or_path, + dtype=dtype, + tensor_parallel_degree=self.tensor_parallel_degree, + tensor_parallel_rank=self.tensor_parallel_rank, + ) + + if config.lora_path is not None: + self.model = LoRAModel.from_pretrained( + model=self.model, lora_path=config.lora_path, lora_config=lora_config + ) + if config.prefix_path is not None: + prefix_tuning_params = get_prefix_tuning_params(self.model) + self.model = PrefixModelForCausalLM.from_pretrained( + model=self.model, + prefix_path=config.prefix_path, + postprocess_past_key_value=prefix_tuning_params["postprocess_past_key_value"], + ) + self.model.eval() + + @paddle.no_grad() + def _infer(self, inputs: dict[str, paddle.Tensor]): + result = self.model.generate( + **inputs, + max_length=self.config.max_length, + bos_token_id=self.tokenizer.bos_token_id, + eos_token_id=self.tokenizer.eos_token_id, + pad_token_id=self.tokenizer.pad_token_id, + decode_strategy=self.config.decode_strategy, + temperature=self.config.temperature, + top_k=self.config.top_k, + top_p=self.config.top_p, + repetition_penalty=self.config.repetition_penalty, + ) + result = result[0] + return result + + +class StaticGraphPredictor(BasePredictor): + def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer = None): + super().__init__(config, tokenizer) + + params_path = os.path.join(self.config.model_name_or_path, self.config.model_prefix + ".pdiparams") + model_path = os.path.join(self.config.model_name_or_path, self.config.model_prefix + ".pdmodel") + inference_config = paddle.inference.Config(model_path, params_path) + + if self.config.device == "gpu": + # set GPU configs accordingly + inference_config.enable_use_gpu(100, 0) + elif self.config.device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + inference_config.disable_gpu() + inference_config.disable_glog_info() + + with static_mode_guard(): + self.predictor = paddle.inference.create_predictor(inference_config) + + self.return_tensors = "np" + + def _preprocess(self, input_text: str | list[str]): + inputs = super()._preprocess(input_text) + inputs["max_length"] = np.array(self.config.max_length, dtype="int64") + + inputs["top_p"] = np.array(self.config.top_p, dtype="float32") + inputs["temperature"] = np.array(self.config.temperature, dtype="float32") + inputs["top_k"] = np.array(self.config.top_k, dtype="int64") + inputs["repetition_penalty"] = np.array(self.config.repetition_penalty, dtype="float32") + + return inputs + + def _infer(self, inputs: dict[str, np.ndarray]): + for name in self.predictor.get_input_names(): + self.predictor.get_input_handle(name).copy_from_cpu(inputs[name]) + + self.predictor.run() + output_names = self.predictor.get_output_names() + output_handle = self.predictor.get_output_handle(output_names[0]) + results = output_handle.copy_to_cpu() + # the first result is decoding_ids + decoded_ids = results.tolist() + return decoded_ids + + +class InferencePredictorMixin: + def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer): + + self.architectures = self.model_config.architectures[0].lower() + + self.dtype = config.dtype or self.model_config + self.cache_kvs = [paddle.zeros(shape, dtype=self.dtype) for shape in self.cache_kvs_shape] + self.num_layers, self.num_attention_heads, self.head_dim = ( + len(self.cache_kvs), + self.cache_kvs[0].shape[-3], + self.cache_kvs[0].shape[-1], + ) + self.total_max_length = config.src_length + config.max_length + self.pre_ids = paddle.full([config.batch_size, self.total_max_length], -1, dtype="int64") + if "chatglm" in self.architectures: + self.attention_mask = paddle.ones( + shape=(config.batch_size, 1, self.total_max_length, self.total_max_length), + dtype=self.dtype, + ) + self.tgt_pos = paddle.ones( + shape=[config.batch_size, 2, 1], + dtype="int64", + ) + else: + self.attention_mask = paddle.zeros( + shape=(config.batch_size, 1, self.total_max_length, self.total_max_length), + dtype=self.dtype, + ) + + self.tgt_generation_mask = paddle.zeros( + shape=[config.batch_size, 1, 1, self.total_max_length], + dtype=self.dtype, + ) + self.arange_tensor_encoder = paddle.zeros( + shape=(config.batch_size, 1, self.total_max_length), dtype=self.dtype + ) + + if config.export_precache: + if config.prefix_path: + prefix_cache = ( + paddle.to_tensor(np.load(f"{config.prefix_path}/pre_caches.npy")).astype(self.dtype).unsqueeze(2) + ) + prefix_cache = paddle.expand( + prefix_cache, + [ + self.num_layers, + 2, + config.batch_size, + self.num_attention_heads, + prefix_cache.shape[-2], + self.head_dim, + ], + ) + self.pre_caches = [item.squeeze_(0) for item in paddle.split(prefix_cache, self.num_layers, axis=0)] + else: + prefix_cache = paddle.zeros( + [self.num_layers, 2, config.batch_size, self.num_attention_heads, 128, self.head_dim], + dtype=self.dtype, + ) + self.pre_caches = [item.squeeze_(0) for item in paddle.split(prefix_cache, self.num_layers, axis=0)] + + def _postprocess(self, predictions): + if paddle.distributed.get_rank() == 0: + tokens: np.ndarray = load_real_time_tokens() + decoded_predictions = self.tokenizer.batch_decode( + tokens.tolist(), skip_special_tokens=True, clean_up_tokenization_spaces=False + ) + return decoded_predictions + else: + return None + + def _preprocess(self, source): + self.attention_mask[:] = 0 + self.tgt_generation_mask[:] = 0 + pre_caches_length = 0 if not self.config.export_precache else self.pre_caches[0].shape[-2] + + if "chatglm" in self.architectures: + inputs = dybatch_preprocess( + self.tokenizer, + source, + self.config.src_length, + self.config.max_length, + self.architectures, + top_p=self.config.top_p, + temperature=self.config.temperature, + ) + for i in range(inputs["input_ids"].shape[0]): + length = inputs["seq_len_encoder"][i][0] + self.attention_mask[i, 0, :length, :length] = 0 + self.attention_mask[i, 0, : length - 1, length - 1] = 1 + self.tgt_generation_mask[i, 0, 0, :length] = paddle.ones(shape=[1, length], dtype=self.config.dtype) + self.tgt_pos[i, 0, 0] = paddle.to_tensor([length], dtype="int64") + + inputs["tgt_pos"] = self.tgt_pos + elif "bloom" in self.architectures: + inputs = dybatch_preprocess( + self.tokenizer, + source, + self.config.src_length, + self.config.max_length, + self.architectures, + top_p=self.config.top_p, + temperature=self.config.temperature, + ) + for i in range(inputs["input_ids"].shape[0]): + length = inputs["seq_len_encoder"][i][0] + self.attention_mask[i, :, :length, :length] = paddle.tril( + paddle.ones(shape=(length, length), dtype=self.config.dtype) + ) + self.arange_tensor_encoder[i, :, :length] = paddle.arange(length).astype(self.config.dtype) + + self.tgt_generation_mask[i, :, 0, :length] = paddle.ones(shape=[1, length], dtype=self.config.dtype) + # alibi encoder + alibi_slopes = get_alibi_slopes(self.model_config.n_head) + inputs["position_ids"] = paddle.to_tensor(alibi_slopes, dtype="float32") + + alibi = alibi_slopes[..., None] * self.arange_tensor_encoder + alibi = alibi[:, :, None, :] + + if self.model_config.tensor_parallel_degree > 1: + block_size = self.model_config.n_head // self.model_config.tensor_parallel_degree + alibi = alibi[ + :, + self.model_config.tensor_parallel_rank + * block_size : (self.model_config.tensor_parallel_rank + 1) + * block_size, + ] + alibi = alibi.reshape([inputs["input_ids"].shape[0], block_size, 1, self.config.max_length]) + inputs["position_ids"] = inputs["position_ids"][ + self.model_config.tensor_parallel_rank + * block_size : (self.model.config.tensor_parallel_rank + 1) + * block_size + ] + + alibi_encoder = alibi.expand( + [ + inputs["input_ids"].shape[0], + self.model_config.n_head // self.model_config.tensor_parallel_degree, + self.total_max_length, + self.total_max_length, + ] + ) + alibi_decoder = alibi.expand( + [ + inputs["input_ids"].shape[0], + self.model_config.n_head // self.model_config.tensor_parallel_degree, + 1, + self.total_max_length, + ] + ) + self.attention_mask = ( + alibi_encoder + (1 - self.attention_mask) * paddle.finfo(self.attention_mask.dtype).min + ) + self.tgt_generation_mask = ( + alibi_decoder + (1 - self.tgt_generation_mask) * paddle.finfo(self.tgt_generation_mask.dtype).min + ) + + else: + inputs = dybatch_preprocess( + self.tokenizer, + source, + self.config.src_length, + self.config.max_length, + self.architectures, + top_p=self.config.top_p, + temperature=self.config.temperature, + pre_caches_length=pre_caches_length, + ) + + for i in range(inputs["input_ids"].shape[0]): + length = inputs["seq_len_encoder"][i][0] + self.attention_mask[i, 0, :length, :length] = paddle.tril( + paddle.ones(shape=(length, length), dtype=self.config.dtype) + ) + + if pre_caches_length > 0: + if self.config.prefix_path is None: + prefix_attention_mask = paddle.zeros( + [1, length, pre_caches_length], dtype=self.attention_mask.dtype + ) + else: + prefix_attention_mask = paddle.ones( + [1, length, pre_caches_length], dtype=self.attention_mask.dtype + ) + post_attention_mask = paddle.tril( + paddle.ones(shape=(length, length), dtype=self.attention_mask.dtype) + ).unsqueeze_(axis=0) + self.attention_mask[i, 0, :length, : length + pre_caches_length] = paddle.concat( + [prefix_attention_mask, post_attention_mask], axis=2 + ) + + if self.config.prefix_path is None: + self.tgt_generation_mask[i, 0, 0, pre_caches_length : length + pre_caches_length] = paddle.ones( + shape=[1, length], dtype="float16" + ) + else: + self.tgt_generation_mask[i, 0, 0, : length + pre_caches_length] = paddle.ones( + shape=[1, length + pre_caches_length], dtype=self.config.dtype + ) + + inputs["pre_ids"] = self.pre_ids + inputs["attention_mask"] = self.attention_mask + inputs["tgt_generation_mask"] = self.tgt_generation_mask + + if pre_caches_length > 0: + if self.config.mode == "dynamic": + inputs["pre_caches"] = self.pre_caches + else: + for i in range(len(self.pre_caches)): + inputs["pre_caches_{}".format(i)] = self.pre_caches[i].numpy() + + return inputs + + +class StaticInferencePredictor(InferencePredictorMixin, BasePredictor): + def __init__( + self, + config: PredictorArgument, + cache_kvs_shape: list[list[int]], + tokenizer: PretrainedTokenizer = None, + ): + self.cache_kvs_shape = cache_kvs_shape + BasePredictor.__init__(self, config, tokenizer) + InferencePredictorMixin.__init__(self, config, tokenizer) + + self.predictor = self._create_predictor(config) + + def _create_predictor(self, predictor_args: PredictorArgument): + if not is_paddlenlp_ops_available(): + raise ValueError( + "you should install the paddlenlp ops to run inference predictor, " + "https://github.com/PaddlePaddle/PaddleNLP/blob/develop/csrc/README.md" + ) + + # register the custome ops + import_module("paddlenlp_ops.encode_rotary_qk") + import_module("paddlenlp_ops.get_padding_offset") + import_module("paddlenlp_ops.qkv_transpose_split") + import_module("paddlenlp_ops.rebuild_padding") + import_module("paddlenlp_ops.transpose_remove_padding") + import_module("paddlenlp_ops.write_cache_kv") + + infer_model_path = get_infer_model_path(predictor_args.model_name_or_path, predictor_args.model_prefix) + + config = paddle.inference.Config(infer_model_path + ".pdmodel", infer_model_path + ".pdiparams") + + config.switch_ir_optim(True) + device_id = int(os.environ.get("FLAGS_selected_gpus", 0)) + config.enable_use_gpu(100, device_id) + # config.disable_glog_info() + # config.enable_memory_optim() + + if self.tensor_parallel_degree > 1: + trainer_endpoints = fleet.worker_endpoints() + current_endpoint = trainer_endpoints[self.tensor_parallel_rank] + + dist_config = config.dist_config() + dist_config.set_ranks(self.tensor_parallel_degree, self.tensor_parallel_rank) + dist_config.set_endpoints(trainer_endpoints, current_endpoint) + dist_config.enable_dist_model(True) + + dist_config.set_comm_init_config(os.path.join(predictor_args.model_name_or_path, "rank_mapping.csv")) + config.set_dist_config(dist_config) + + predictor = paddle.inference.create_predictor(config) + return predictor + + @paddle.no_grad() + def _infer(self, inputs): + for k, v in inputs.items(): + input_tensor = self.predictor.get_input_handle(k) + + if "mask" in k or "position" in k: + input_tensor.share_external_data(v) + else: + if paddle.is_tensor(v): + v = v.numpy() + input_tensor.copy_from_cpu(v) + + for i in range(len(self.cache_kvs_shape)): + input_tensor = self.predictor.get_input_handle("cache_kvs_" + str(i)) + input_tensor.share_external_data(self.cache_kvs[i]) + + input_tensor = self.predictor.get_input_handle("pre_ids") + input_tensor.share_external_data(self.pre_ids) + + self.predictor.run() + + +class DygraphInferencePredictor(InferencePredictorMixin, BasePredictor): + def __init__( + self, + config: PredictorArgument, + model: PretrainedModel = None, + tokenizer: PretrainedTokenizer = None, + ): + self.cache_kvs_shape = model.get_cache_kvs_shape(model.config, config.batch_size) + BasePredictor.__init__(self, config, tokenizer) + InferencePredictorMixin.__init__(self, config, tokenizer) + self.model = model + + @paddle.no_grad() + def _infer(self, inputs: dict[str, paddle.Tensor]): + for key in inputs.keys(): + if paddle.is_tensor(inputs[key]): + continue + if isinstance(inputs[key], list): + if paddle.is_tensor(inputs[key]): + continue + inputs[key] = [paddle.to_tensor(item) for item in inputs[key]] + else: + inputs[key] = paddle.to_tensor(inputs[key]) + + inputs["cache_kvs"] = self.cache_kvs + self.model.generate( + **inputs, + ) + return None + + +def create_predictor( + predictor_args: PredictorArgument, + model_args: ModelArgument, + tensor_parallel_degree: int = 1, + tensor_parallel_rank: int = 0, +): + tokenizer = AutoTokenizer.from_pretrained(predictor_args.model_name_or_path) + # TODO(wj-Mcat): fix llama tokenzier pad_token bug + if isinstance(tokenizer, LlamaTokenizer) and not tokenizer.pad_token: + tokenizer.pad_token = tokenizer.unk_token + + # update config parameter for inference predictor + if predictor_args.decode_strategy == "greedy_search" and predictor_args.inference_model: + predictor_args.top_p = 0.0 + predictor_args.temperature = 1.0 + + tensor_parallel_rank, tensor_parallel_degree = init_dist_env() + if not predictor_args.inference_model: + if predictor_args.mode == "dynamic": + if model_args.model_type == "gpt-3": + sys.path.append("./gpt-3") + from modeling import GPTForCausalLM + + model = GPTForCausalLM.from_pretrained( + predictor_args.model_name_or_path, + dtype=predictor_args.dtype, + tensor_parallel_degree=tensor_parallel_degree, + tensor_parallel_rank=tensor_parallel_rank, + ) + elif model_args.model_type == "ernie-3.5-se": + sys.path.append("./ernie-3.5-se") + from modeling import Ernie35ForCausalLM + + tensor_parallel_degree = paddle.distributed.get_world_size() + tensor_parallel_rank = paddle.distributed.get_rank() + model = Ernie35ForCausalLM.from_pretrained( + predictor_args.model_name_or_path, + dtype=predictor_args.dtype, + tensor_parallel_degree=tensor_parallel_degree, + tensor_parallel_rank=tensor_parallel_rank, + ) + else: + model = AutoModelForCausalLM.from_pretrained( + predictor_args.model_name_or_path, + dtype=predictor_args.dtype, + tensor_parallel_degree=tensor_parallel_degree, + tensor_parallel_rank=tensor_parallel_rank, + ) + + predictor = DygraphPredictor(predictor_args, model=model, tokenizer=tokenizer) + elif predictor_args.mode == "static": + predictor = StaticGraphPredictor(predictor_args, tokenizer=tokenizer) + else: + raise ValueError("the `mode` should be one of [dynamic, static]") + else: + if predictor_args.mode == "dynamic": + # TODO(wj-Mcat): complete AutoInferenceModel & AutoPredictor + config = AutoConfig.from_pretrained(predictor_args.model_name_or_path) + config.tensor_parallel_degree = tensor_parallel_degree + config.tensor_parallel_rank = tensor_parallel_rank + + if "llama" in config.architectures[0].lower(): + if model_args.model_type == "llama-img2txt": + # we use llama for img2txt. + from paddlenlp.experimental.transformers import ( + LlamaForMiniGPT4InferenceModel as LlamaInferenceModel, + ) + else: + from paddlenlp.experimental.transformers import ( + LlamaForCausalLMInferenceModel as LlamaInferenceModel, + ) + + config.tensor_parallel_degree = tensor_parallel_degree + config.tensor_parallel_rank = tensor_parallel_rank + config.quant_bits = -1 + + if predictor_args.quant_type.startswith("weight_only_int"): + quant_bits = int(predictor_args.quant_type[-1]) + config.quant_bits = quant_bits + + model = LlamaInferenceModel.from_pretrained( + predictor_args.model_name_or_path, config=config, dtype=predictor_args.dtype + ) + model.eval() + elif "chatglm" in config.architectures[0].lower(): + from paddlenlp.experimental.transformers import ( + ChatGLMForCausalLMInferenceModel, + ) + + model = ChatGLMForCausalLMInferenceModel.from_pretrained( + predictor_args.model_name_or_path, + config=config, + dtype=predictor_args.dtype, + ) + model.eval() + elif "bloom" in config.architectures[0].lower(): + from paddlenlp.experimental.transformers import ( + BloomForCausalLMInferenceModel, + ) + + config.tensor_parallel_degree = tensor_parallel_degree + config.tensor_parallel_rank = tensor_parallel_rank + model = BloomForCausalLMInferenceModel.from_pretrained( + predictor_args.model_name_or_path, + config=config, + dtype=predictor_args.dtype, + ) + cache_kvs_shape = BloomForCausalLMInferenceModel.get_cache_kvs_shape(config, predictor_args.batch_size) + model.eval() + predictor = DygraphInferencePredictor(predictor_args, model=model, tokenizer=tokenizer) + elif predictor_args.mode == "static": + config = AutoConfig.from_pretrained(predictor_args.model_name_or_path) + if "llama" in config.architectures[0].lower(): + from paddlenlp.experimental.transformers import ( + LlamaForCausalLMInferenceModel, + ) + + cache_kvs_shape = LlamaForCausalLMInferenceModel.get_cache_kvs_shape(config, predictor_args.batch_size) + predictor = StaticInferencePredictor(predictor_args, cache_kvs_shape, tokenizer=tokenizer) + elif "chatglm" in config.architectures[0].lower(): + from paddlenlp.experimental.transformers import ( + ChatGLMForCausalLMInferenceModel, + ) + + cache_kvs_shape = ChatGLMForCausalLMInferenceModel.get_cache_kvs_shape( + config, predictor_args.batch_size + ) + predictor = StaticInferencePredictor(predictor_args, cache_kvs_shape, tokenizer=tokenizer) + elif "bloom" in config.architectures[0].lower(): + from paddlenlp.experimental.transformers import ( + BloomForCausalLMInferenceModel, + ) + + cache_kvs_shape = BloomForCausalLMInferenceModel.get_cache_kvs_shape(config, predictor_args.batch_size) + predictor = StaticInferencePredictor( + predictor_args, cache_kvs_shape=cache_kvs_shape, tokenizer=tokenizer + ) + else: + raise ValueError("the `mode` should be one of [dynamic, static]") + return predictor + + +def predict(): + parser = PdArgumentParser((PredictorArgument, ModelArgument)) + predictor_args, model_args = parser.parse_args_into_dataclasses() + + paddle.set_device(predictor_args.device) + paddle.set_default_dtype(predictor_args.dtype) + + tensor_parallel_degree = paddle.distributed.get_world_size() + if tensor_parallel_degree > 1: + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = { + "dp_degree": 1, + "mp_degree": tensor_parallel_degree, + "pp_degree": 1, + "sharding_degree": 1, + } + fleet.init(is_collective=True, strategy=strategy) + + predictor = create_predictor(predictor_args, model_args) + source_texts = [] + target_texts = [] + if model_args.data_file: + with open(model_args.data_file, "r", encoding="utf-8") as f: + for line in f: + example = json.loads(line) + source_texts.append(example["src"]) + target_texts.append(example["tgt"]) + else: + source_texts = ["hello world, how are you?", "你好,请问你是谁?"] + target_texts = ["", ""] + + batch_source_texts = batchfy_text(source_texts, predictor_args.batch_size) + batch_target_texts = batchfy_text(target_texts, predictor_args.batch_size) + + with open(model_args.output_file, "w", encoding="utf-8") as f: + for bs, batch_source_text in enumerate(batch_source_texts): + outputs = predictor.predict(batch_source_text) + + if predictor.tensor_parallel_rank > 0: + continue + for output, source, target in zip(outputs, batch_source_texts[bs], batch_target_texts[bs]): + print("***********Source**********") + print(source) + print("***********Target**********") + print(target) + print("***********Output**********") + print(output) + out = {"src": source, "tgt": target, "output": output} + f.write(json.dumps(out, ensure_ascii=False) + "\n") + + if predictor_args.benchmark: + benchmark(predictor, predictor_args, model_args) + + +def benchmark(predictor, predictor_args, model_args): + # Just construct a simple benchmark input. We pad input to the src_length. + test_texts = "hello world, how are you?" + benchmark_texts = [test_texts + "" * predictor_args.src_length for _ in range(predictor_args.batch_size)] + + batch_benchmark_texts = batchfy_text(benchmark_texts, predictor_args.batch_size) + print("***********Start Benchmark**********") + + warmup_time = 10 + test_time = 100 + + print("***********Start Warmup**********") + for _ in range(warmup_time): + for bs, batch_source_text in enumerate(batch_benchmark_texts): + outputs = predictor.predict(batch_source_text) + + print("***********Start Speed Test**********") + start = time.perf_counter() + for _ in range(test_time): + for bs, batch_source_text in enumerate(batch_benchmark_texts): + outputs = predictor.predict(batch_source_text) + end = time.perf_counter() + + output_tokens = sum([len(output) for output in outputs]) + print( + "Input length is: {}, Output length is: {}, bs is: {}, Generate speed is: {:.3f} tokens/s(ips), QPS: {:.3f} requests/s. ".format( + predictor_args.src_length, + predictor_args.max_length, + predictor_args.batch_size, + (output_tokens / (end - start) / test_time), + (predictor_args.batch_size / (end - start) / test_time), + ) + ) + + +if __name__ == "__main__": + predict() diff --git a/llm/quant.py b/llm/quant.py new file mode 100644 index 0000000000000000000000000000000000000000..56314d6e468fe898224e9009f652996a9ae123d1 --- /dev/null +++ b/llm/quant.py @@ -0,0 +1,230 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import json +import os + +import paddle +from paddle import nn +from paddle.distributed.fleet.meta_parallel import ( + ColumnParallelLinear, + RowParallelLinear, +) +from paddle.quantization import PTQ, QAT, QuantConfig +from paddle.quantization.quanters.abs_max import FakeQuanterWithAbsMaxObserverLayer +from paddleslim.quant.advanced import ( + GPTQ, + EMASampler, + MultiStepSampler, + PieceWiseSearch, + Shift, + Smooth, +) +from paddleslim.quant.advanced.utils import find_parent_layer_and_sub_name +from paddleslim.quant.layers import ( + QuantizedColumnParallelLinear, + QuantizedRowParallelLinear, +) +from paddleslim.quant.observers import AbsMaxChannelWiseWeightObserver, AVGObserver +from paddleslim.quant.observers.abs_max_weight import ( + AbsMaxChannelWiseWeightObserverLayer, +) +from paddleslim.quant.observers.avg import AVGObserverLayer +from paddleslim.quant.quanters import PACTQuanter + +from paddlenlp.peft import PrefixModelForCausalLM +from paddlenlp.peft.lora import LoRALinear +from paddlenlp.peft.lora.lora_quant_layers import QuantedLoRALinear +from paddlenlp.utils.log import logger + + +def create_qat_model(quant_args, model, dtype): + # FakeQuanterChannelWiseAbsMaxObserver not yet merge in Paddle develop + from paddle.quantization.quanters import FakeQuanterChannelWiseAbsMaxObserver + + q_config = QuantConfig(activation=None, weight=None) + q_config.add_qat_layer_mapping(LoRALinear, QuantedLoRALinear) + if quant_args.quant_type == "A8W8": + activation = PACTQuanter(quanter=FakeQuanterWithAbsMaxObserverLayer, init_value=20, dtype=dtype) + weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=8, dtype="float32") + elif quant_args.quant_type == "WINT4": + activation = None + weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=4, dtype="float32") + elif quant_args.quant_type == "WINT8": + activation = None + weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=8, dtype="float32") + elif quant_args.quant_type == "A8W4": + activation = PACTQuanter(quanter=FakeQuanterWithAbsMaxObserverLayer, init_value=20, dtype=dtype) + weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=4, dtype="float32") + else: + raise ValueError("quant_type should be one of ['A8W8', 'WINT4', 'WINT8']") + q_config.add_type_config(LoRALinear, weight=weight, activation=activation) + q_config.add_type_config(nn.Linear, weight=weight, activation=activation) + + qat = QAT(q_config) + model = qat.quantize(model, inplace=True) + return model + + +def apply_shift(quant_args, trainer, ptq_dataloader, ptq_model_config): + logger.info("***** Running Shift *****") + shift_sampler = EMASampler() if quant_args.shift_sampler == "ema" else None + shift = Shift( + model=trainer.model, + model_config=ptq_model_config, + sample_function=shift_sampler, + shift_all_linears=quant_args.shift_all_linears, + ) + + trainer.ptq_loop( + ptq_dataloader, + description="Shift", + max_eval_iters=quant_args.shift_step, + ) + shift.update_weight() + del shift, shift_sampler + logger.info("***** Shift done *****") + + +def apply_smooth(quant_args, trainer, ptq_dataloader, ptq_model_config): + logger.info("***** Running Smooth *****") + smooth_sampler = MultiStepSampler() if quant_args.smooth_sampler == "multi_step" else None + if quant_args.smooth_piecewise_search: + search_func = PieceWiseSearch( + k_piece=quant_args.smooth_k_piece, + bits_length=8, + search_piece=quant_args.smooth_search_piece, + search_alpha_min=0.2, + search_alpha_max=0.8, + search_scale_min=1.0, + search_scale_max=5.0, + weight_quant_method="abs_max_channel_wise", + act_quant_method="avg", + ) + else: + search_func = None + smooth = Smooth( + trainer.model, + ptq_model_config, + alpha=0.5, + smooth_all_linears=quant_args.smooth_all_linears, + sample_function=smooth_sampler, + search_function=search_func, + ) + trainer.ptq_loop( + ptq_dataloader, + description="Smooth", + max_eval_iters=quant_args.smooth_step, + ) + + smooth.update_weight() + del smooth, smooth_sampler, search_func + logger.info("***** Smooth done *****") + + +def apply_ptq(quant_args, trainer, ptq_dataloader): + logger.info("***** Running PTQ *****") + q_config = QuantConfig(activation=None, weight=None) + + if quant_args.quant_type == "A8W8": + activation = AVGObserver(quant_bits=8) + weight = AbsMaxChannelWiseWeightObserver(quant_bits=8) + elif quant_args.quant_type == "WINT4": + activation = None + weight = AbsMaxChannelWiseWeightObserver(quant_bits=4) + elif quant_args.quant_type == "WINT8": + activation = None + weight = AbsMaxChannelWiseWeightObserver(quant_bits=8) + else: + raise ValueError("quant_type should be one of ['A8W8', 'WINT4', 'WINT8']") + + q_config.add_qat_layer_mapping(ColumnParallelLinear, QuantizedColumnParallelLinear) + q_config.add_qat_layer_mapping(RowParallelLinear, QuantizedRowParallelLinear) + q_config.add_type_config( + [paddle.nn.Linear, ColumnParallelLinear, RowParallelLinear, QuantedLoRALinear], + activation=activation, + weight=weight, + ) + + ptq = PTQ(q_config) + trainer.model = ptq.quantize(trainer.model, inplace=True) + trainer.ptq_loop( + ptq_dataloader, + description="PTQ", + max_eval_iters=quant_args.ptq_step, + ) + weight_scales = {} + act_scales = {} + for cur_name, cur_layer in trainer.model.named_sublayers(): + if isinstance(cur_layer, AbsMaxChannelWiseWeightObserverLayer): + if "_observer" not in cur_name: + weight_scales[cur_name] = cur_layer.scales().numpy().tolist() + if isinstance(cur_layer, AVGObserverLayer): + if "_observer" not in cur_name: + act_scales[cur_name] = cur_layer.scales().numpy().tolist() + + weight_scales_path = os.path.join(trainer.args.output_dir, "weight_scales.json") + with open(weight_scales_path, "w") as f: + json.dump(weight_scales, f) + logger.info(f"Weight scales saved in {weight_scales_path}.") + + act_scales_path = os.path.join(trainer.args.output_dir, "act_scales.json") + with open(act_scales_path, "w") as f: + json.dump(act_scales, f) + logger.info(f"Activation scales saved in {act_scales_path}.") + + trainer.model = ptq.convert(trainer.model, inplace=True) + logger.info("***** PTQ done *****") + + +def apply_gptq(quant_args, trainer, ptq_dataloader): + logger.info("***** Running GPTQ *****") + num_layer = 0 + model = trainer.model + for cur_name, cur_layer in model.named_sublayers(): + if type(cur_layer) in [paddle.nn.Linear, ColumnParallelLinear, RowParallelLinear]: + num_layer += 1 + logger.info(f"GPTQ layer: {num_layer}, {cur_name}") + parent_layer, sub_name = find_parent_layer_and_sub_name(model, cur_name) + cur_quant_layer = GPTQ(cur_layer) + setattr(parent_layer, sub_name, cur_quant_layer) + trainer.ptq_loop( + ptq_dataloader, + description="GPTQ", + max_eval_iters=quant_args.gptq_step, + ) + cur_quant_layer.fasterquant(percdamp=0.1, groupsize=-1, actorder=True) + del cur_quant_layer + setattr(parent_layer, sub_name, cur_layer) + logger.info("***** GPTQ done *****") + + +def get_ptq_model_config(model): + if isinstance(model, PrefixModelForCausalLM): + base_model_prefix = model.model.base_model_prefix + else: + base_model_prefix = model.base_model_prefix + + if base_model_prefix in ["chatglm"]: + raise NotImplementedError(f"{model} does not support Shift or Smooth.") + elif base_model_prefix == "chatglm_v2": + model_config = {"fused_qkv": False, "parallel_ffn": False, "skip_norm_list": ["rms_norm_56"]} + elif base_model_prefix == "bloom": + model_config = {"fused_qkv": True, "parallel_ffn": False} + elif base_model_prefix == "llama": + model_config = {"fused_qkv": False, "parallel_ffn": True} + else: + raise ValueError( + f"Unknown base_model_prefix: {model.base_model_prefix}. Supported base_model_prefix list: chatglm_V2, bloom, llama." + ) + return model_config diff --git a/llm/qwen/README.md b/llm/qwen/README.md new file mode 100644 index 0000000000000000000000000000000000000000..08d29823f9650f7b1e787144d4beb945e06eb310 --- /dev/null +++ b/llm/qwen/README.md @@ -0,0 +1,14 @@ +# Qwen + +## 1.模型介绍 + +[通义千问-7B(Qwen-7B)](https://arxiv.org/abs/2205.01068) 是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样,覆盖广泛,包括大量网络文本、专业书籍、代码等。 + +**支持模型权重:** +| Model | +|-------------------| +| qwen/qwen-7b | +| qwen/qwen-7b-chat | + +## 2. 模型精调 +请参考[LLM全流程工具介绍](../README.md) diff --git a/llm/qwen/lora_argument.json b/llm/qwen/lora_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..56d5b66c75755848345d0803333f80b86fc92f4c --- /dev/null +++ b/llm/qwen/lora_argument.json @@ -0,0 +1,30 @@ +{ + "model_name_or_path": "qwen/qwen-7b", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/qwen_lora_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-04, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "bf16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 1, + "pipeline_parallel_degree": 1, + "lora": true + } diff --git a/llm/qwen/pt_argument.json b/llm/qwen/pt_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..763c951176692fc8c9fd15feb833ccfb68caa8e3 --- /dev/null +++ b/llm/qwen/pt_argument.json @@ -0,0 +1,30 @@ +{ + "model_name_or_path": "qwen/qwen-7b", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/qwen_pt_ckpts", + "per_device_train_batch_size": 4, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-02, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "bf16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 1, + "pipeline_parallel_degree": 1, + "prefix_tuning": true + } diff --git a/llm/qwen/sft_argument.json b/llm/qwen/sft_argument.json new file mode 100644 index 0000000000000000000000000000000000000000..d06801aef2be3b0d42c44332f70cfe960d950776 --- /dev/null +++ b/llm/qwen/sft_argument.json @@ -0,0 +1,29 @@ +{ + "model_name_or_path": "qwen/qwen-7b", + "dataset_name_or_path": "./data", + "output_dir": "./checkpoints/qwen_sft_ckpts", + "per_device_train_batch_size": 1, + "gradient_accumulation_steps": 4, + "per_device_eval_batch_size": 8, + "eval_accumulation_steps":16, + "num_train_epochs": 3, + "learning_rate": 3e-05, + "warmup_steps": 30, + "logging_steps": 1, + "evaluation_strategy": "epoch", + "save_strategy": "epoch", + "src_length": 1024, + "max_length": 2048, + "bf16": true, + "fp16_opt_level": "O2", + "do_train": true, + "do_eval": true, + "disable_tqdm": true, + "load_best_model_at_end": true, + "eval_with_do_generation": false, + "metric_for_best_model": "accuracy", + "recompute": true, + "save_total_limit": 1, + "tensor_parallel_degree": 4, + "pipeline_parallel_degree": 1 + } diff --git a/llm/utils.py b/llm/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..5d439cec114512d40dfd28cd49e66719b10d5e3b --- /dev/null +++ b/llm/utils.py @@ -0,0 +1,561 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from __future__ import annotations + +import glob +import math +import os +import struct +from typing import Dict, Optional + +import numpy as np +import paddle +import paddle.distributed as dist +from paddle.distributed import fleet +from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler +from sklearn.metrics import accuracy_score + +from paddlenlp.datasets import InTokensIterableDataset +from paddlenlp.trainer import Trainer, TrainerCallback +from paddlenlp.trainer.trainer_utils import IterableDatasetShard, has_length +from paddlenlp.utils.log import logger + + +def compute_metrics(eval_preds): + + flattened_preds = np.array(eval_preds.predictions).flatten() + flattened_labels = np.array(eval_preds.label_ids).flatten() + filtered_preds = flattened_preds[flattened_labels != -100] + filtered_labels = flattened_labels[flattened_labels != -100] + accuracy = accuracy_score(y_true=filtered_labels, y_pred=filtered_preds) + return { + "accuracy": accuracy, + } + + +def get_prefix_tuning_params(model): + if model.base_model_prefix == "chatglm": + from paddlenlp.peft.prefix import chatglm_postprocess_past_key_value + + num_attention_heads = model.config.num_attention_heads + num_hidden_layers = model.config.num_hidden_layers + hidden_size = model.config.hidden_size + postprocess_past_key_value = chatglm_postprocess_past_key_value + multi_query_group_num = None + elif model.base_model_prefix == "chatglm_v2": + from paddlenlp.peft.prefix import chatglm_postprocess_past_key_value + + num_attention_heads = model.config.num_attention_heads + num_hidden_layers = model.config.num_layers + hidden_size = model.config.hidden_size + postprocess_past_key_value = chatglm_postprocess_past_key_value + multi_query_group_num = model.config.multi_query_group_num + elif model.base_model_prefix == "bloom": + from paddlenlp.peft.prefix import bloom_postprocess_past_key_value + + num_attention_heads = model.config.num_attention_heads + num_hidden_layers = model.config.n_layer + hidden_size = model.config.n_embed + postprocess_past_key_value = bloom_postprocess_past_key_value + multi_query_group_num = None + elif model.base_model_prefix == "llama": + from paddlenlp.peft.prefix import llama_postprocess_past_key_value + + num_attention_heads = model.config.n_head + num_hidden_layers = model.config.n_layer + hidden_size = model.config.hidden_size + postprocess_past_key_value = llama_postprocess_past_key_value + multi_query_group_num = None + elif model.base_model_prefix == "qwen": + from paddlenlp.peft.prefix import qwen_postprocess_past_key_value + + num_attention_heads = model.config.num_attention_heads + num_hidden_layers = model.config.num_hidden_layers + hidden_size = model.config.hidden_size + postprocess_past_key_value = qwen_postprocess_past_key_value + multi_query_group_num = None + else: + raise ValueError(f"Unknown base_model_prefix: {model.base_model_prefix}. ") + return dict( + num_attention_heads=num_attention_heads, + num_hidden_layers=num_hidden_layers, + hidden_size=hidden_size, + postprocess_past_key_value=postprocess_past_key_value, + multi_query_group_num=multi_query_group_num, + ) + + +def get_lora_target_modules(model): + # Not yet support RowParallelLinear + if model.base_model_prefix == "chatglm": + target_modules = [".*query_key_value.*", ".*dense.*", ".*dense_h_to_4h.*", ".*dense_4h_to_h.*"] + elif model.base_model_prefix == "chatglm_v2": + target_modules = [ + ".*query.*", + ".*key.*", + ".*value.*", + ".*dense.*", + ".*dense_h_to_4h.*", + ".*dense_4h_to_h.*", + ] + elif model.base_model_prefix == "bloom": + target_modules = [".*query_key_value.*", ".*dense.*", ".*dense_h_to_4h.*", ".*dense_4h_to_h.*"] + elif model.base_model_prefix == "llama": + target_modules = [ + ".*q_proj.*", + ".*v_proj.*", + ".*k_proj.*", + ".*o_proj.*", + ".*gate_proj.*", + ".*down_proj.*", + ".*up_proj.*", + ] + elif model.base_model_prefix == "opt": + target_modules = [ + ".*project_in.*", + ".*project_out.*", + ".*q_proj.*", + ".*k_proj.*", + ".*v_proj.*", + ".*qkv_proj.*", + ".*out_proj.*", + ".*linear1.*", + ".*linear2.*", + ] + elif model.base_model_prefix == "qwen": + target_modules = [ + ".*attn.c_attn.*", + ".*attn.c_proj.*", + ".*mlp.w1.*", + ".*mlp.w2.*", + ".*mlp.c_proj.*", + ] + else: + raise ValueError(f"Unknown base_model_prefix: {model.base_model_prefix}.") + return target_modules + + +class InTokensIterDatasetCallback(TrainerCallback): + """ + A [`TrainerCallback`] that handles early stopping. + + """ + + def on_step_end(self, args, state, control, **kwargs): + train_dataloader = kwargs["train_dataloader"] + if isinstance(train_dataloader.dataset, InTokensIterableDataset): + dataset = train_dataloader.dataset + elif isinstance(train_dataloader.dataset, IterableDatasetShard) and isinstance( + train_dataloader.dataset.dataset, InTokensIterableDataset + ): + dataset = train_dataloader.dataset.dataset + else: + raise ValueError( + "Unexpected dataset format: InTokensIterDatasetCallback expectes `paddlenlp.datasets.InTokensIterableDataset`" + ) + if state.trial_params is None: + state.trial_params = {} + state.trial_params["intokens_global_step"] = dataset.intokens_global_step + + +class CausalLMTrainer(Trainer): + def __init__(self, do_generation: bool, gen_args, data_args, **kwargs): + super().__init__(**kwargs) + self.do_generation = do_generation + self.gen_args = gen_args + self.data_args = data_args + + def prediction_step( + self, + model, + inputs, + prediction_loss_only: bool, + ignore_keys=None, + ): + if prediction_loss_only: + return super().prediction_step(model, inputs, prediction_loss_only, ignore_keys) + elif not self.do_generation: + loss, logits, labels = super().prediction_step(model, inputs, prediction_loss_only, ignore_keys) + # argmax here to avoid gather all logits, which is too memory-consuming. + # keepdim in order to maintain the same shape as logits + if isinstance(logits, (list, tuple)): + logits = logits[0] + return (loss, logits.argmax(axis=-1, keepdim=True), labels) + + loss = None + + model.eval() + with paddle.no_grad(): + generated_tokens = model.generate( + input_ids=inputs["input_ids"], + attention_mask=inputs["attention_mask"] if "attention_mask" in inputs else None, + position_ids=inputs["position_ids"] if "position_ids" in inputs else None, + max_length=max(self.data_args.max_length - inputs["input_ids"].shape[-1], 1), + decode_strategy="sampling", + top_k=self.gen_args.top_k, + top_p=self.gen_args.top_p, + bos_token_id=self.tokenizer.bos_token_id, + eos_token_id=self.tokenizer.eos_token_id, + pad_token_id=self.tokenizer.pad_token_id, + use_cache=True, + )[0] + all_preds = [] + for pred_tokens in generated_tokens: + pred_tokens = pred_tokens[pred_tokens != self.tokenizer.pad_token_id].tolist() + all_preds.append(pred_tokens) + max_pred_length = max([len(x) for x in all_preds]) + for index, preds in enumerate(all_preds): + all_preds[index] = preds + [-100] * (max_pred_length - len(preds)) + all_preds = paddle.to_tensor(all_preds) + + if "labels" in inputs: + all_labels = paddle.to_tensor(inputs["labels"]) + else: + all_labels = None + + return (loss, all_preds, all_labels) + + def log(self, logs: Dict[str, float], **kwargs) -> None: + if "loss" in logs: + logs["ppl"] = np.exp(logs["loss"]) + if "eval_loss" in logs: + logs["eval_ppl"] = np.exp(logs["eval_loss"]) + + super(CausalLMTrainer, self).log(logs, **kwargs) + + def get_ptq_dataloader(self, ptq_ds): + if self.args.world_size <= 1: + ptq_sampler = BatchSampler( + dataset=ptq_ds, + shuffle=True, + batch_size=self.args.per_device_train_batch_size, + drop_last=self.args.dataloader_drop_last, + ) + else: + ptq_sampler = DistributedBatchSampler( + self.train_dataset, + batch_size=self.args.per_device_train_batch_size, + shuffle=True, + num_replicas=self.args.dataset_world_size, + rank=self.args.dataset_rank, + drop_last=self.args.dataloader_drop_last, + ) + ptq_dataloader = DataLoader( + ptq_ds, + batch_sampler=ptq_sampler, + collate_fn=self.data_collator, + num_workers=self.args.dataloader_num_workers, + ) + return ptq_dataloader + + def ptq_loop( + self, + dataloader: DataLoader, + description: str, + max_eval_iters: Optional[int] = -1, + ): + if isinstance(dataloader, paddle.io.DataLoader): + batch_size = dataloader.batch_sampler.batch_size + else: + raise ValueError("Only support for paddle.io.DataLoader") + + if has_length(dataloader): + logger.info(f" Num examples = {self.num_examples(dataloader)}") + if max_eval_iters > 0: + logger.info(f" Total {description} steps = {max_eval_iters}") + else: + logger.info(f" Total {description} steps = {len(dataloader)}") + else: + logger.info(" Num examples: Unknown") + if max_eval_iters > 0: + logger.info(f" Total {description} steps = {max_eval_iters}") + + logger.info(f" Pre device batch size = {batch_size}") + logger.info(f" Total Batch size = {batch_size * self.args.dataset_world_size}") + self.model.eval() + with paddle.no_grad(): + for step, inputs in enumerate(dataloader): + self.prediction_step(model=self.model, inputs=inputs, prediction_loss_only=True, ignore_keys=None) + if max_eval_iters > 0 and step >= max_eval_iters - 1: + break + + +def get_infer_model_path(input_dir, model_prefix): + if dist.get_world_size() > 1: + local_rank = dist.ParallelEnv().dev_id + return os.path.join(input_dir, "rank_{}".format(local_rank), model_prefix) + else: + return os.path.join(input_dir, model_prefix) + + +def generate_rank_mapping(output_filename): + ring_id = -1 + try: + hcg = fleet.get_hybrid_communicate_group() + model_parallel_group = hcg.get_model_parallel_group() + ring_id = model_parallel_group.id + except Exception: + pass + + if ring_id == -1: + return + + world_size = dist.get_world_size() + with open(output_filename, "w") as f: + f.write("[ring_id -> ranks]\n") + f.write(",".join(map(str, [0] + list(range(world_size)))) + "\n") + f.write(",".join(map(str, [ring_id] + list(range(world_size)))) + "\n") + + f.write("[rank -> ring_ids]\n") + for i in range(world_size): + f.write("{},0,{}\n".format(i, ring_id)) + + +def deserialize_from_file(fp): + x_type = fp.read(1) + x_type_out = struct.unpack("c", x_type)[0] + # data + data_list = [] + if x_type_out == b"0": + data = fp.read(4) + data_out = struct.unpack("f", data)[0] + while data: + data_out = struct.unpack("f", data)[0] + data_list.append(data_out) + data = fp.read(4) + elif x_type_out == b"1": + data = fp.read(8) + while data: + data_out = struct.unpack("l", data)[0] + data_list.append(data_out) + data = fp.read(8) + elif x_type_out == b"2": + data = fp.read(4) + while data: + data_out = struct.unpack("i", data)[0] + data_list.append(data_out) + data = fp.read(4) + else: + print("type error") + data_arr = np.array(data_list) + return data_arr + + +def get_alibi_slopes(num_heads): + closest_power_of_2 = 2 ** math.floor(math.log2(num_heads)) + base = 2 ** (-(2 ** -(math.log2(closest_power_of_2) - 3))) + powers = np.arange(1, 1 + closest_power_of_2) + slopes = np.power(base, powers) + + if closest_power_of_2 != num_heads: + extra_base = 2 ** (-(2 ** -(math.log2(2 * closest_power_of_2) - 3))) + num_remaining_heads = min(closest_power_of_2, num_heads - closest_power_of_2) + extra_powers = np.arange(1, 1 + 2 * num_remaining_heads, 2) + slopes = np.concatante([slopes, np.power(extra_base, extra_powers)], axis=0) + + return slopes.astype("float32") + + +def pad_batch_data(insts, pad_id=0, return_seq_len=False, pad_style="right"): + """Pad sequences to the max sequence length in batch.""" + max_len = max(map(len, insts)) + if pad_style == "left": + inst_data = np.array([[pad_id] * (max_len - len(inst)) + list(inst) for inst in insts]) + else: + inst_data = np.array([list(inst) + [pad_id] * (max_len - len(inst)) for inst in insts]) + + if return_seq_len: + seq_len = np.array([len(inst) for inst in insts]) + return inst_data.astype("int64").reshape([-1, max_len]), seq_len + else: + return inst_data.astype("int64").reshape([-1, max_len]) + + +def dybatch_preprocess( + tokenizer, + texts: list[str], + src_length: int, + max_length: int, + architectures: str, + top_p: float, + temperature: float, + pre_caches_length: int = 0, + benchmark: bool = False, +): + """Pre-process generation inputs.""" + if "chatglm" in architectures: + input_ids = [] + position_ids = [] + + for text in texts: + tokens = tokenizer(text, return_tensors="np", padding=True, max_length=src_length) + input_ids.append(tokens["input_ids"][0]) + position_ids.append(tokens["position_ids"][0]) + + inputs = {} + pad_token_id = tokenizer([tokenizer.pad_token], return_tensors="np")["input_ids"][0][0] + inputs["input_ids"], seq_len = pad_batch_data(input_ids, pad_id=pad_token_id, return_seq_len=True) + bs = inputs["input_ids"].shape[0] + max_len = max(map(len, input_ids)) + + inst_data_pos = [] + for i in range(len(position_ids)): + inst_data_pos.append(np.array([list(inst) + [0] * (max_len - len(inst)) for inst in position_ids[i]])) + inputs["position_ids"] = paddle.to_tensor(np.array(inst_data_pos)) + else: + input_ids = [] + position_ids = [] + if isinstance(texts, str): + texts = [texts] + + for text in texts: + tokens = tokenizer( + text, + return_tensors="np", + padding=False, + max_length=src_length, + return_attention_mask=False, + return_token_type_ids=False, + ) + input_ids.append(tokens["input_ids"][0]) + + inputs = {} + pad_token_id = tokenizer([tokenizer.pad_token], return_tensors="np")["input_ids"][0][-1] + inputs["input_ids"], seq_len = pad_batch_data(input_ids, pad_id=pad_token_id, return_seq_len=True) + bs = inputs["input_ids"].shape[0] + max_len = max(map(len, input_ids)) + + position_ids = paddle.zeros(shape=[bs, max_length + src_length], dtype="int64") + + for i in range(bs): + position_ids[i, pre_caches_length : pre_caches_length + seq_len[i]] = paddle.arange(seq_len[i]) + inputs["position_ids"] = position_ids + + tgt_ids = [input[-1:] for input in input_ids] + tgt_pos = [] + for i, valid_len in enumerate(map(len, input_ids)): + tgt_pos.append(valid_len - 1) + + step_idx = [ + 0, + ] * bs + tgt_pos = np.array(tgt_pos).astype("int64") + inputs["eos_token_id"] = ( + np.array( + [ + tokenizer.eos_token_id, + ] + * bs + ) + .reshape(-1, 1) + .astype("int64") + ) + inputs["top_p"] = ( + np.array( + [ + top_p, + ] + * bs + ) + .reshape(-1, 1) + .astype("float32") + ) + inputs["temperature"] = ( + np.array( + [ + temperature, + ] + * bs + ) + .reshape(-1, 1) + .astype("float32") + ) + inputs["seq_len_encoder"] = seq_len.astype("int32").reshape(-1, 1) + inputs["seq_len_decoder"] = (seq_len + pre_caches_length).astype("int32").reshape(-1, 1) + inputs["step_idx"] = np.array(step_idx).astype("int64").reshape(-1, 1) + inputs["tgt_ids"] = np.array(tgt_ids).astype("int64").reshape(-1, 1) + inputs["tgt_pos"] = tgt_pos.reshape(-1, 1) + inputs["max_length"] = np.array(max_length - pre_caches_length).astype("int64").reshape((-1, 1)) + inputs["min_length"] = ( + np.array( + [ + 1 + if not benchmark + else max_length + - pre_caches_length, # Note(Zhengzekang): When in benchmark mode, we need to set a fixed decode length. + ] + * bs + ) + .astype("int64") + .reshape((-1, 1)) + ) + inputs["penalty_score"] = ( + np.array( + [ + 1.0, + ] + * bs + ) + .astype("float32") + .reshape((-1, 1)) + ) + inputs["frequency_score"] = ( + np.array( + [ + 0.0, + ] + * bs + ) + .astype("float32") + .reshape((-1, 1)) + ) + inputs["presence_score"] = ( + np.array( + [ + 0.0, + ] + * bs + ) + .astype("float32") + .reshape((-1, 1)) + ) + inputs["stop_flags"] = ( + np.array( + [ + 0, + ] + * bs + ) + .astype("bool") + .reshape((-1, 1)) + ) + inputs["stop_nums"] = np.array([bs]).astype("int64") + return inputs + + +def load_real_time_tokens(): + tokens = [] + files = glob.glob(os.path.join("./real_time_save.*")) + for j in range(1, len(files) + 1): + filename = "./real_time_save.temp_ids_rank_0_step_{}".format(j) + if not os.path.exists(filename): + break + fp = open(filename, "rb+") + fp.read(1) + data_list = deserialize_from_file(fp) + fp.close() + tokens.append(np.array(data_list).reshape(-1, 1)) + os.system("rm -f ./real_time_save.temp_ids_rank_*") + tokens = np.concatenate(tokens, axis=1) + return tokens diff --git a/model.properties b/model.properties new file mode 100644 index 0000000000000000000000000000000000000000..642940d8a4661878287656530612d098b34a541f --- /dev/null +++ b/model.properties @@ -0,0 +1,10 @@ +# 模型唯一标识 +modelCode=488 +# 模型名称 +modelName=llama_paddle +# 模型描述 +modelDescription=基于Paddle框架的llama-13b +# 应用场景(多个标签以英文逗号分割) +appScenario=训练,推理,医疗,教育,科研,金融 +# 框架类型(多个标签以英文逗号分割) +frameType=Paddle,PaddleNLP diff --git a/model_zoo/README.md b/model_zoo/README.md new file mode 100644 index 0000000000000000000000000000000000000000..31bc46d811cad13bc17e7b73e19734849257d5fb --- /dev/null +++ b/model_zoo/README.md @@ -0,0 +1,3 @@ +# PaddleNLP Selected Model Zoo + +本目录是飞桨PaddleNLP精选模型库,提供了高质量的预训练模型和端到端的全流程部署工具链。 diff --git a/model_zoo/bert/README.md b/model_zoo/bert/README.md new file mode 100644 index 0000000000000000000000000000000000000000..6c172be0ee0ffda8ba225f99e4511e597232865d --- /dev/null +++ b/model_zoo/bert/README.md @@ -0,0 +1,293 @@ +# BERT + +## 模型简介 + +[BERT](https://arxiv.org/abs/1810.04805) (Bidirectional Encoder Representations from Transformers)以[Transformer](https://arxiv.org/abs/1706.03762) 编码器为网络基本组件,使用掩码语言模型(Masked Language Model)和邻接句子预测(Next Sentence Prediction)两个任务在大规模无标注文本语料上进行预训练(pre-train),得到融合了双向内容的通用语义表示模型。以预训练产生的通用语义表示模型为基础,结合任务适配的简单输出层,微调(fine-tune)后即可应用到下游的NLP任务,效果通常也较直接在下游的任务上训练的模型更优。此前BERT即在[GLUE评测任务](https://gluebenchmark.com/tasks)上取得了SOTA的结果。 + +本项目是BERT在 Paddle 2.0上的开源实现,包含了预训练和[GLUE评测任务](https://gluebenchmark.com/tasks)上的微调代码。 + +## 快速开始 + +### 环境依赖 + +本教程除了需要安装PaddleNLP库,还需以下依赖 + +```text +h5py +``` + +### 数据准备 + +#### Pre-training数据准备 + +`create_pretraining_data.py` 是创建预训练程序所需数据的脚本。其以文本文件(使用换行符换行和空白符分隔,data目录下提供了部分示例数据)为输入,经由BERT tokenizer进行tokenize后再做生成sentence pair正负样本、掩码token等处理,最后输出hdf5格式的数据文件。使用方式如下: + +```shell +python create_pretraining_data.py \ + --input_file=data/sample_text.txt \ + --output_file=data/training_data.hdf5 \ + --bert_model=bert-base-uncased \ + --max_seq_length=128 \ + --max_predictions_per_seq=20 \ + --masked_lm_prob=0.15 \ + --random_seed=12345 \ + --dupe_factor=5 +``` + +其中参数释义如下: +- `input_file` 指定输入文件,可以使用目录,指定目录时将包括目录中的所有`.txt`文件。 +- `output_file` 指定输出文件。 +- `bert_model` 指定使用特定BERT模型对应的tokenizer进行tokenize处理。 +- `max_seq_length` 指定最大句子长度,超过该长度将被截断,不足该长度的将会进行padding。 +- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目。 +- `masked_lm_prob` 表示每个token被mask的概率。 +- `random_seed` 指定随机种子。 +- `dupe_factor` 指定输入数据被重复处理的次数,每次处理将重新产生随机mask。 + +使用以上预训练数据生成程序可以用于处理领域垂类数据后进行二次预训练。若需要使用BERT论文中预训练使用的英文Wiki和BookCorpus数据,可以参考[这里](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT)进行处理,得到的数据可以直接接入本项目中的预训练程序使用。 + +#### Fine-tunning数据准备 + +##### GLUE评测任务数据 + +GLUE评测任务所含数据集已在paddlenlp中以API形式提供,无需预先准备,使用`run_glue.py`执行微调时将会自动下载。 + +### 执行Pre-training +
+GPU训练 +
unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_pretrain.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --max_predictions_per_seq 20 \
+    --batch_size 32   \
+    --learning_rate 1e-4 \
+    --weight_decay 1e-2 \
+    --adam_epsilon 1e-6 \
+    --warmup_steps 10000 \
+    --num_train_epochs 3 \
+    --input_dir data/ \
+    --output_dir pretrained_models/ \
+    --logging_steps 1 \
+    --save_steps 20000 \
+    --max_steps 1000000 \
+    --device gpu \
+    --use_amp False
+ +其中参数释义如下: +- `model_type` 指示了模型类型,使用BERT模型时设置为bert即可。 +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。 +- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目,与创建预训练数据时的设置一致。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。 +- `adam_epsilon` 表示AdamW优化器中使用的epsilon值。 +- `warmup_steps` 表示动态学习率热启的step数。 +- `num_train_epochs` 表示训练轮数。 +- `input_dir` 表示输入数据的目录,该目录下所有文件名中包含training的文件将被作为训练数据。 +- `output_dir` 表示模型的保存目录。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `max_steps` 表示最大训练步数,达到`max_steps`后就提前结束。注意,我们必须设置 `max_steps`。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 +- `use_amp` 指示是否启用自动混合精度训练。 +
+ +#### GPU训练(Trainer版本) +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_pretrain_trainer.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --max_predictions_per_seq 20 \ + --per_device_train_batch_size 32 \ + --learning_rate 1e-4 \ + --weight_decay 1e-2 \ + --adam_epsilon 1e-6 \ + --warmup_steps 10000 \ + --num_train_epochs 3 \ + --input_dir data/ \ + --output_dir pretrained_models/ \ + --logging_steps 1 \ + --save_steps 20000 \ + --max_steps 1000000 \ + --device gpu \ + --fp16 False \ + --do_train +``` + +
+XPU训练 +
unset FLAGS_selected_xpus
+python -m paddle.distributed.launch --xpus "0" run_pretrain.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --max_predictions_per_seq 20 \
+    --batch_size 32   \
+    --learning_rate 1e-4 \
+    --weight_decay 1e-2 \
+    --adam_epsilon 1e-6 \
+    --warmup_steps 10000 \
+    --num_train_epochs 3 \
+    --input_dir data/ \
+    --output_dir pretrained_models/ \
+    --logging_steps 1 \
+    --save_steps 20000 \
+    --max_steps 1000000 \
+    --device xpu \
+    --use_amp False
+ +其中参数释义如下: +- `model_type` 指示了模型类型,使用BERT模型时设置为bert即可。 +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。 +- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目,与创建预训练数据时的设置一致。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。 +- `adam_epsilon` 表示AdamW优化器中使用的epsilon值。 +- `warmup_steps` 表示动态学习率热启的step数。 +- `num_train_epochs` 表示训练轮数。 +- `input_dir` 表示输入数据的目录,该目录下所有文件名中包含training的文件将被作为训练数据。 +- `output_dir` 表示模型的保存目录。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `max_steps` 表示最大训练步数,达到`max_steps`后就提前结束。注意,我们必须设置 `max_steps`。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 +- `use_amp` 指示是否启用自动混合精度训练。 +
+ +#### XPU训练(Trainer版本) +```shell +unset FLAGS_selected_xpus +python -m paddle.distributed.launch --xpus "0" run_pretrain_trainer.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --max_predictions_per_seq 20 \ + --per_device_train_batch_size 32 \ + --learning_rate 1e-4 \ + --weight_decay 1e-2 \ + --adam_epsilon 1e-6 \ + --warmup_steps 10000 \ + --num_train_epochs 3 \ + --input_dir data/ \ + --output_dir pretrained_models/ \ + --logging_steps 1 \ + --save_steps 20000 \ + --max_steps 1000000 \ + --device xpu \ + --fp16 False \ + --do_train +``` +其中参数释义如下: +- `model_type` 指示了模型类型,使用BERT模型时设置为bert即可。 +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。 +- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目,与创建预训练数据时的设置一致。 +- `per_device_train_batch_size` 表示用于训练的每个 GPU 核心/CPU 的batch大小.(`int`,可选,默认为 8) +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。 +- `adam_epsilon` 表示AdamW优化器中使用的epsilon值。 +- `warmup_steps` 表示动态学习率热启的step数。 +- `num_train_epochs` 表示训练轮数。 +- `input_dir` 表示输入数据的目录,该目录下所有文件名中包含training的文件将被作为训练数据。 +- `output_dir` 表示模型的保存目录。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `max_steps` 表示最大训练步数,达到`max_steps`后就提前结束。注意,我们必须设置 `max_steps`。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 +- `fp16` 是否使用 fp16 混合精度训练而不是 fp32 训练。(`bool`, 可选, 默认为 `False`) +- `do_train` 是否进行训练任务。(`bool`, 可选, 默认为 `False`) + +**NOTICE**: 预训练时data目录存放的是经过 `create_pretraining_data.py` 处理后的数据,因此需要通过该数据处理脚本预先处理,否则预训练将会出现报错。 + +### 执行Fine-tunning + +以GLUE中的SST-2任务为例,启动Fine-tuning的方式如下: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_glue_trainer.py \ + --model_name_or_path bert-base-uncased \ + --task_name SST2 \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --per_device_eval_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --logging_steps 1 \ + --save_steps 500 \ + --output_dir ./tmp/ \ + --device gpu \ + --fp16 False\ + --do_train \ + --do_eval +``` + +其中参数释义如下: +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。注:`bert-base-uncased`等对应使用的预训练模型转自[huggingface/transformers](https://github.com/huggingface/transformers),具体可参考当前目录下converter中的内容。 +- `task_name` 表示Fine-tuning的任务。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `per_device_train_batch_size` 表示用于训练的每个 GPU 核心/CPU 的batch大小.(`int`,可选,默认为 8) +- `per_device_eval_batch_size` 表示用于评估的每个 GPU 核心/CPU 的batch大小.(`int`,可选,默认为 8) +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `output_dir` 表示模型保存路径。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU, 'npu'表示使用华为昇腾卡。 +- `fp16` 是否使用 fp16 混合精度训练而不是 fp32 训练。(`bool`, 可选, 默认为 `False`) +- `do_train` 是否进行训练任务。(`bool`, 可选, 默认为 `False`) +- `do_eval` 是否进行评估任务。同上。(`bool`, 可选, 默认为 `False`) + +基于`bert-base-uncased`在GLUE各评测任务上Fine-tuning后,在验证集上有如下结果: + +| Task | Metric | Result | +|:-----:|:----------------------------:|:-----------------:| +| SST2 | Accuracy | 0.92660 | +| QNLI | Accuracy | 0.91707 | +| CoLA | Mattehew's corr | 0.59557 | +| MRPC | F1/Accuracy | 0.91667/0.88235 | +| STSB | Person/Spearman corr | 0.88847/0.88350 | +| QQP | Accuracy/F1 | 0.90581/0.87347 | +| MNLI | Matched acc/MisMatched acc | 0.84422/0.84825 | +| RTE | Accuracy | 0.711191 | + + +### 预测 + +在Fine-tuning完成后,我们可以使用如下方式导出希望用来预测的模型: + +```shell +python -u ./export_model.py \ + --model_type bert \ + --model_path bert-base-uncased \ + --output_path ./infer_model/model +``` + +其中参数释义如下: +- `model_type` 指示了模型类型,使用BERT模型时设置为bert即可。 +- `model_path` 表示训练模型的保存路径,与训练时的`output_dir`一致。 +- `output_path` 表示导出预测模型文件的前缀。保存时会添加后缀(`pdiparams`,`pdiparams.info`,`pdmodel`);除此之外,还会在`output_path`包含的目录下保存tokenizer相关内容。 + +完成模型导出后,可以开始部署。`deploy/python/seq_cls_infer.py` 文件提供了python部署预测示例。可执行以下命令运行部署示例: + +```shell +python deploy/python/seq_cls_infer.py --model_dir infer_model/ --device gpu --backend paddle +``` + +运行后预测结果打印如下: + +```bash +[2023-03-02 08:30:03,877] [ INFO] - We are using to load '../../infer_model/'. +[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend Runtime initialized with Backend::PDINFER in Device::GPU. +Batch id: 0, example id: 0, sentence1: against shimmering cinematography that lends the setting the ethereal beauty of an asian landscape painting, label: positive, negative prob: 0.0003, positive prob: 0.9997. +Batch id: 1, example id: 0, sentence1: the situation in a well-balanced fashion, label: positive, negative prob: 0.0002, positive prob: 0.9998. +Batch id: 2, example id: 0, sentence1: at achieving the modest , crowd-pleasing goals it sets for itself, label: positive, negative prob: 0.0017, positive prob: 0.9983. +Batch id: 3, example id: 0, sentence1: so pat it makes your teeth hurt, label: negative, negative prob: 0.9986, positive prob: 0.0014. +Batch id: 4, example id: 0, sentence1: this new jangle of noise , mayhem and stupidity must be a serious contender for the title ., label: negative, negative prob: 0.9806, positive prob: 0.0194. +``` + +更多详细用法可参考 [Python 部署](deploy/python/README.md)。 + +## 扩展 + +上述的介绍是基于动态图的BERT的预训练任务和微调任务以及预测任务的实践过程,同时在我们也提供了基于PaddlePaddle Fleet API的静态图的BERT相关实践,在组网代码层面保持动静统一,在计算速度以及多机联合训练方面有着更优的性能,具体的细节可以参考 [BERT静态图](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/bert/static) 。 diff --git a/model_zoo/bert/create_pretraining_data.py b/model_zoo/bert/create_pretraining_data.py new file mode 100644 index 0000000000000000000000000000000000000000..d88859585adba2a24223bc20d8d73b1298381a9b --- /dev/null +++ b/model_zoo/bert/create_pretraining_data.py @@ -0,0 +1,473 @@ +# coding=utf-8 +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved. +# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Create masked LM/next sentence masked_lm examples for BERT.""" +import argparse +import collections +import os +import random +from io import open + +import h5py +import numpy as np +from tqdm import tqdm + +from paddlenlp.transformers import BertTokenizer +from paddlenlp.transformers.tokenizer_utils import convert_to_unicode + + +class TrainingInstance(object): + """A single training instance (sentence pair).""" + + def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels, is_random_next): + self.tokens = tokens + self.segment_ids = segment_ids + self.is_random_next = is_random_next + self.masked_lm_positions = masked_lm_positions + self.masked_lm_labels = masked_lm_labels + + +def write_instance_to_example_file(instances, tokenizer, max_seq_length, max_predictions_per_seq, output_file): + """Create example files from `TrainingInstance`s.""" + + total_written = 0 + features = collections.OrderedDict() + + num_instances = len(instances) + features["input_ids"] = np.zeros([num_instances, max_seq_length], dtype="int32") + features["input_mask"] = np.zeros([num_instances, max_seq_length], dtype="int32") + features["segment_ids"] = np.zeros([num_instances, max_seq_length], dtype="int32") + features["masked_lm_positions"] = np.zeros([num_instances, max_predictions_per_seq], dtype="int32") + features["masked_lm_ids"] = np.zeros([num_instances, max_predictions_per_seq], dtype="int32") + features["next_sentence_labels"] = np.zeros(num_instances, dtype="int32") + + for inst_index, instance in enumerate(tqdm(instances)): + input_ids = tokenizer.convert_tokens_to_ids(instance.tokens) + input_mask = [1] * len(input_ids) + segment_ids = list(instance.segment_ids) + assert len(input_ids) <= max_seq_length + + while len(input_ids) < max_seq_length: + input_ids.append(0) + input_mask.append(0) + segment_ids.append(0) + + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + assert len(segment_ids) == max_seq_length + + masked_lm_positions = list(instance.masked_lm_positions) + masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels) + masked_lm_weights = [1.0] * len(masked_lm_ids) + + while len(masked_lm_positions) < max_predictions_per_seq: + masked_lm_positions.append(0) + masked_lm_ids.append(0) + masked_lm_weights.append(0.0) + + next_sentence_label = 1 if instance.is_random_next else 0 + + features["input_ids"][inst_index] = input_ids + features["input_mask"][inst_index] = input_mask + features["segment_ids"][inst_index] = segment_ids + features["masked_lm_positions"][inst_index] = masked_lm_positions + features["masked_lm_ids"][inst_index] = masked_lm_ids + features["next_sentence_labels"][inst_index] = next_sentence_label + + total_written += 1 + + print("saving data") + f = h5py.File(output_file, "w") + f.create_dataset("input_ids", data=features["input_ids"], dtype="i4", compression="gzip") + f.create_dataset("input_mask", data=features["input_mask"], dtype="i1", compression="gzip") + f.create_dataset("segment_ids", data=features["segment_ids"], dtype="i1", compression="gzip") + f.create_dataset("masked_lm_positions", data=features["masked_lm_positions"], dtype="i4", compression="gzip") + f.create_dataset("masked_lm_ids", data=features["masked_lm_ids"], dtype="i4", compression="gzip") + f.create_dataset("next_sentence_labels", data=features["next_sentence_labels"], dtype="i1", compression="gzip") + f.flush() + f.close() + + +def create_training_instances( + input_files, tokenizer, max_seq_length, dupe_factor, short_seq_prob, masked_lm_prob, max_predictions_per_seq, rng +): + """Create `TrainingInstance`s from raw text.""" + all_documents = [[]] + + # Input file format: + # (1) One sentence per line. These should ideally be actual sentences, not + # entire paragraphs or arbitrary spans of text. (Because we use the + # sentence boundaries for the "next sentence prediction" task). + # (2) Blank lines between documents. Document boundaries are needed so + # that the "next sentence prediction" task doesn't span between documents. + for input_file in input_files: + print("creating instance from {}".format(input_file)) + with open(input_file, "r", encoding="UTF-8") as reader: + while True: + line = convert_to_unicode(reader.readline()) + if not line: + break + line = line.strip() + + # Empty lines are used as document delimiters + if not line: + all_documents.append([]) + tokens = tokenizer.tokenize(line) + if tokens: + all_documents[-1].append(tokens) + + # Remove empty documents + all_documents = [x for x in all_documents if x] + rng.shuffle(all_documents) + + # vocab_words = list(tokenizer.vocab.keys()) + vocab_words = list(tokenizer.vocab.token_to_idx.keys()) + instances = [] + for _ in range(dupe_factor): + for document_index in range(len(all_documents)): + instances.extend( + create_instances_from_document( + all_documents, + document_index, + max_seq_length, + short_seq_prob, + masked_lm_prob, + max_predictions_per_seq, + vocab_words, + rng, + ) + ) + + rng.shuffle(instances) + return instances + + +def create_instances_from_document( + all_documents, + document_index, + max_seq_length, + short_seq_prob, + masked_lm_prob, + max_predictions_per_seq, + vocab_words, + rng, +): + """Creates `TrainingInstance`s for a single document.""" + document = all_documents[document_index] + + # Account for [CLS], [SEP], [SEP] + max_num_tokens = max_seq_length - 3 + + # We *usually* want to fill up the entire sequence since we are padding + # to `max_seq_length` anyways, so short sequences are generally wasted + # computation. However, we *sometimes* + # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter + # sequences to minimize the mismatch between pre-training and fine-tuning. + # The `target_seq_length` is just a rough target however, whereas + # `max_seq_length` is a hard limit. + target_seq_length = max_num_tokens + if rng.random() < short_seq_prob: + target_seq_length = rng.randint(2, max_num_tokens) + + # We DON'T just concatenate all of the tokens from a document into a long + # sequence and choose an arbitrary split point because this would make the + # next sentence prediction task too easy. Instead, we split the input into + # segments "A" and "B" based on the actual "sentences" provided by the user + # input. + instances = [] + current_chunk = [] + current_length = 0 + i = 0 + while i < len(document): + segment = document[i] + current_chunk.append(segment) + current_length += len(segment) + if i == len(document) - 1 or current_length >= target_seq_length: + if current_chunk: + # `a_end` is how many segments from `current_chunk` go into the `A` + # (first) sentence. + a_end = 1 + if len(current_chunk) >= 2: + a_end = rng.randint(1, len(current_chunk) - 1) + + tokens_a = [] + for j in range(a_end): + tokens_a.extend(current_chunk[j]) + + tokens_b = [] + # Random next + is_random_next = False + if len(current_chunk) == 1 or rng.random() < 0.5: + is_random_next = True + target_b_length = target_seq_length - len(tokens_a) + + # This should rarely go for more than one iteration for large + # corpora. However, just to be careful, we try to make sure that + # the random document is not the same as the document + # we're processing. + for _ in range(10): + random_document_index = rng.randint(0, len(all_documents) - 1) + if random_document_index != document_index: + break + + # If picked random document is the same as the current document + if random_document_index == document_index: + is_random_next = False + + random_document = all_documents[random_document_index] + random_start = rng.randint(0, len(random_document) - 1) + for j in range(random_start, len(random_document)): + tokens_b.extend(random_document[j]) + if len(tokens_b) >= target_b_length: + break + # We didn't actually use these segments so we "put them back" so + # they don't go to waste. + num_unused_segments = len(current_chunk) - a_end + i -= num_unused_segments + # Actual next + else: + is_random_next = False + for j in range(a_end, len(current_chunk)): + tokens_b.extend(current_chunk[j]) + truncate_seq_pair(tokens_a, tokens_b, target_seq_length, rng) + + assert len(tokens_a) >= 1 + assert len(tokens_b) >= 1 + + tokens = [] + segment_ids = [] + tokens.append("[CLS]") + segment_ids.append(0) + for token in tokens_a: + tokens.append(token) + segment_ids.append(0) + + tokens.append("[SEP]") + segment_ids.append(0) + + for token in tokens_b: + tokens.append(token) + segment_ids.append(1) + tokens.append("[SEP]") + segment_ids.append(1) + + (tokens, masked_lm_positions, masked_lm_labels) = create_masked_lm_predictions( + tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng + ) + instance = TrainingInstance( + tokens=tokens, + segment_ids=segment_ids, + is_random_next=is_random_next, + masked_lm_positions=masked_lm_positions, + masked_lm_labels=masked_lm_labels, + ) + instances.append(instance) + current_chunk = [] + current_length = 0 + i += 1 + + return instances + + +MaskedLmInstance = collections.namedtuple("MaskedLmInstance", ["index", "label"]) + + +def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng): + """Creates the predictions for the masked LM objective.""" + + cand_indexes = [] + for (i, token) in enumerate(tokens): + if token == "[CLS]" or token == "[SEP]": + continue + cand_indexes.append(i) + + rng.shuffle(cand_indexes) + + output_tokens = list(tokens) + + num_to_predict = min(max_predictions_per_seq, max(1, int(round(len(tokens) * masked_lm_prob)))) + + masked_lms = [] + covered_indexes = set() + for index in cand_indexes: + if len(masked_lms) >= num_to_predict: + break + if index in covered_indexes: + continue + covered_indexes.add(index) + + masked_token = None + # 80% of the time, replace with [MASK] + if rng.random() < 0.8: + masked_token = "[MASK]" + else: + # 10% of the time, keep original + if rng.random() < 0.5: + masked_token = tokens[index] + # 10% of the time, replace with random word + else: + masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)] + + output_tokens[index] = masked_token + + masked_lms.append(MaskedLmInstance(index=index, label=tokens[index])) + + masked_lms = sorted(masked_lms, key=lambda x: x.index) + + masked_lm_positions = [] + masked_lm_labels = [] + for p in masked_lms: + masked_lm_positions.append(p.index) + masked_lm_labels.append(p.label) + + return (output_tokens, masked_lm_positions, masked_lm_labels) + + +def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng): + """Truncates a pair of sequences to a maximum sequence length.""" + while True: + total_length = len(tokens_a) + len(tokens_b) + if total_length <= max_num_tokens: + break + + trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b + assert len(trunc_tokens) >= 1 + + # We want to sometimes truncate from the front and sometimes from the + # back to add more randomness and avoid biases. + if rng.random() < 0.5: + del trunc_tokens[0] + else: + trunc_tokens.pop() + + +def main(): + + parser = argparse.ArgumentParser() + + parser.add_argument( + "--input_file", + default=None, + type=str, + required=True, + help="The input train corpus. can be directory with .txt files or a path to a single file", + ) + parser.add_argument( + "--output_file", + default=None, + type=str, + required=True, + help="The output file where created hdf5 formatted data will be written.", + ) + parser.add_argument( + "--vocab_file", + default=None, + type=str, + required=False, + help="The vocabulary the BERT model will train on. " + "Use bert_model argument would ignore this. " + "The bert_model argument is recommended.", + ) + parser.add_argument( + "--do_lower_case", + action="store_true", + default=True, + help="Whether to lower case the input text. True for uncased models, False for cased models. " + "Use bert_model argument would ignore this. The bert_model argument is recommended.", + ) + parser.add_argument( + "--bert_model", + default="bert-base-uncased", + type=str, + required=False, + help="Bert pre-trained model selected in the list: bert-base-uncased, " + "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese." + "If provided, use the pre-trained model used tokenizer to create data " + "and ignore vocab_file and do_lower_case.", + ) + + # Other parameters + # int + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after WordPiece tokenization. \n" + "Sequences longer than this will be truncated, and sequences shorter \n" + "than this will be padded.", + ) + parser.add_argument( + "--dupe_factor", + default=10, + type=int, + help="Number of times to duplicate the input data (with different masks).", + ) + parser.add_argument( + "--max_predictions_per_seq", default=20, type=int, help="Maximum number of masked LM predictions per sequence." + ) + + # floats + parser.add_argument("--masked_lm_prob", default=0.15, type=float, help="Masked LM probability.") + parser.add_argument( + "--short_seq_prob", + default=0.1, + type=float, + help="Probability to create a sequence shorter than maximum sequence length", + ) + + parser.add_argument("--random_seed", type=int, default=12345, help="random seed for initialization") + + args = parser.parse_args() + print(args) + + if args.bert_model: + tokenizer = BertTokenizer.from_pretrained(args.bert_model) + else: + assert args.vocab_file, "vocab_file must be set If bert_model is not provided." + tokenizer = BertTokenizer(args.vocab_file, do_lower_case=args.do_lower_case) + + input_files = [] + if os.path.isfile(args.input_file): + input_files.append(args.input_file) + elif os.path.isdir(args.input_file): + input_files = [ + os.path.join(args.input_file, f) + for f in os.listdir(args.input_file) + if (os.path.isfile(os.path.join(args.input_file, f)) and f.endswith(".txt")) + ] + else: + raise ValueError("{} is not a valid path".format(args.input_file)) + + rng = random.Random(args.random_seed) + instances = create_training_instances( + input_files, + tokenizer, + args.max_seq_length, + args.dupe_factor, + args.short_seq_prob, + args.masked_lm_prob, + args.max_predictions_per_seq, + rng, + ) + + output_file = args.output_file + + write_instance_to_example_file( + instances, tokenizer, args.max_seq_length, args.max_predictions_per_seq, output_file + ) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/bert/data/sample_text.txt b/model_zoo/bert/data/sample_text.txt new file mode 100644 index 0000000000000000000000000000000000000000..75ec60cdb7842e023d32f95f0e16e1973eff4b71 --- /dev/null +++ b/model_zoo/bert/data/sample_text.txt @@ -0,0 +1,100 @@ +Zulfiqar A. Bhutta trained as a physician in Pakistan in the early stages of his career. +He holds titles across various organizations in diverse geographies. +Professor Bhutta is the Founding Director of the Center of Excellence in Women and Child Health & Institute for Global Child Health & Development, at the Aga Khan University South-Central Asia, East Africa & United Kingdom. +He is currently the Co-Director at the Centre for Global Child Health, at the Hospital for Sick Children and leads many projects as a Senior Scientist at the Research Institute in the Centre for Global Child Health at Sick Kids. +He holds a Professorship at the University of Toronto in the Department of Nutritional Sciences and the Division of Epidemiology, Dalla Lana School of Public Health. +Additionally, he holds concurrent professorship at the Department of Paediatrics, Aga Khan University in Karachi, Pakistan and at the Schools of Public Health of Johns Hopkins University, Tufts University, Boston University, University of Alberta and the London School of Hygiene & Tropical Medicine. +He is a designated Distinguished National Professor of the Government of Pakistan and was the Founding Chair of the National Research Ethics Committee of the Government of Pakistan from 2003-2014. +Dr. Bhutta received his MBBS from Khyber Medical College in Peshawar, Pakistan in 1977 at which time he was names "Best Graduate of the Year" and awarded the University Gold Medal for overall distinction. +His PhD work was completed at Karolinska Institute in Stockholm, Sweden in 1996. +He is a Fellow of the Royal College of Physicians (Edinburgh & London), the Royal College of Paediatrics and Child Health (London), American Academy of Paediatrics and the Pakistan Academy of Sciences. +Following the completion of his PhD Dr. Bhutta began working as House Surgeon in Obstetrics & Gynecology at the Khyber Teaching Hospital, Peshawar (April-November 1978). +He began work in paediatrics as a physician in November of 1978 in the Professorial Unit at the Institute of Child Health, Jinnah Postgraduate Medical Centre, Karachi (Pakistan). +Through 1980's he continued his work as a surgeon and paediatrician. +He undertook his first professor position in the Department of Paediatrics, The Aga Khan University Hospital, Karachi (Pakistan), from November 1987 to June 1992. +In 2005, Dr. Bhutta became the Chairman of the Department of Paediatrics & Child Health at the Aga Khan University & Medical Center, a position held until 2008. +Following his term as Chairman he became The Noordin Noormahomed Sheriff Professor & Founding Chair, Division of Women & Child Health, The Aga Khan University, a position he held for four years. +Dr. Bhutta currently holds the titles of co-director of the Centre for Global Child Health at the Hospital for Sick Children in Toronto, and founding director of the Centre of Excellence in Women and Child Health at the Aga Khan University. +In 2020, he was appointed founding director of the Institute for Global child Health & Development at the Aga Khan University and elected Fellow to the Royal Society, United Kingdom. +Outside of his professional responsibilities Dr. Bhutta serves on various local and international boards and committees, including a series of editorial boards. +In his various capacities Dr. Bhutta has produced a large collection of publications working with his teams at Sick Kids, AKU and international partners. +These include book reviews, chapters, 1. +"Haematological disorders" "Neonatal Jaundice" in Neonatal Vade‑Mecum, Fleming PJ, Speidel BD, Dunn PM Eds, Lloyd‑Luke Publishers, UK, 1986. +Revised 2nd Edition 1991. +2. +"Nutritional management of acute and persistent diarrhoea". +A M Molla, Bhutta Z A and  A Molla. +In McNeish A S, Mittal S K and Walker-Smith J A (eds). +Recent trends in diarrhoea and malnutrition, MAMC, Delhi, 1991, pp 37-51. +3. +"Paediatric Prescribing” in "Text book of Paediatrics for developing countries"            Arif MA, Hanif SM, Wasti SMK Eds, 1989, 2nd Edition 1996,  PPA, Karachi. +& Lahore 4. +"Innovations in neonatal care : Impact on neonatal survival in the developing world:. +Bhutta Z A  Zaidi S (Editor) 1992. +TWEL Publisher. +Karachi pp 121-131 5. +"Short course therapy in Pediatrics" Bhutta Z A& Teele D.  In Tice A D, Waldvogel F (Eds), Contemporary issues in Infectious Disease Epidemiology and Management, 1993 Gardiner Caldwell, Cheshire, pp 52 - 60. +6. +"Dietary management of persistent diarrhoea". +Bhutta Z A, Molla A M, Issani Z. +In Reflections on  Diarrhoeal Disease & Nutrition  of Children". +1993 Karachi, pp 97 - 103. +7. +"Prescribing practices amongst general practitioners (GPs) and consultant paediatricians in childhood diarrhoea.”  S.Q. +Nizami, I.A. +Khan, Bhutta Z A. +In "Reflections on Diarrhoeal Disease and Nutrition of Children". +1993 Karachi, pp  88-90. +8. +"The challenge of multidrug-resistant typhoid". +Bhutta Z A. +In Puri R K, Sachdev H P S, Choudhry P, Verma I C (Eds), Current concepts in Paediatrics, 1994. +Jaypee Publishers, New Delhi, pp 403.8. +9. +"Perinatal Care in Pakistan: Current status and trends". +In Proceedings of the Workshop in Reproductive Health. +College of Physicians and Surgeons, Pakistan, Karachi, 1995, pp 95-103. +10. +“A study of whole body protein kinetics in malnourished children with persistent diarrhoea” Bhutta Z A, Nizami SQ, Isani Z, Hardy S, Hendricks K, Young V.   Report of the second RCM coordinated Research Programme for application of stable isotope tracer methods to studies of energy metabolism in malnourished populations of developing countries. +NAHRES-30 1996 IAEA Vienna. +11. +"Pneumococcal infections in Pakistan: a country report". +In Adult Immunization in Asia, Fondation Mercel Merieux, Lyon, 1998. pp 79-82. +12. +“Factors affecting protein and aminoacid metabolism in childhood from developing countries". +In Child Nutrition: an international perspective. +Editors Solomons NW, Caballero B, Brown KH. +CRC Press 1998. +13. +"Protein Digestion and Bioavailability". +In Encyclopedia of Human Nutrition. +Editors: Sadler M, Strain JJ, Caballero B. +Academic Press (London), 1998 pp.1646-54. +14. +"Perinatal Care in Pakistan. +Reproductive Health: A manual for family practice and primary health care. +Bhutta Z A, Maqbool S.  College of Physicians and Surgeons, Pakistan, Karachi, 1999, pp 69-78. +15. +“Effective interventions to reduce neonatal mortality and morbidity from perinatal infection. +Bhutta ZA. +In Costello A, Manandhar D (eds). +"Improving Newborn Infant Health in Developing Countries’ 1999. +Imperial College Press, London pp.289-308. +16. +“Ambulatory management of typhoid fever”            “Risk factors and management of micronutrient deficiencies”            “Management of persistent diarrhoea in developing countries”. +In Manual of International Child Health, British Medical Journal, 2000 (in press). +17. +“The role of Cefixime in typhoid fever during childhood” in Cefixime, Adam D, Quintiliani R (Eds), Torre-Lazur-McCann, Tokyo, 2000; pp.107-112. +18. +"Micronutrients and Child Health in the Commonwealth”, Commonwealth Foundation" (UK) (2001). +19. +"Isotopic evaluation of breast milk intake, energy metabolism growth and body composition of exclusively breastfed infants in Pakistan". +Bhutta ZA, Nizami SQ, Weaver LT, Preston T. In Application of Stable Isotopes to evaluate Growth and Body Composition of Exclusively Breastfed infants, IAEA and WHO, NAHRES Report. +2000. +20. +“Typhoid Fever in Childhood: the south Asian experience”. +Ahmad K &Bhutta ZA. +In "Recent Advances in Paediatrics", Gupte S (Ed), 2000, India . +21. +“Neonatal Infections in developing countries” in  Carrera JM, Cabero L, Baraibar R (Eds). +The Perinatal Medicine of the new Millennium. \ No newline at end of file diff --git a/model_zoo/bert/deploy/python/README.md b/model_zoo/bert/deploy/python/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b2f9ef8727edd8bb695a9d482b99ff5119b6ce9b --- /dev/null +++ b/model_zoo/bert/deploy/python/README.md @@ -0,0 +1,139 @@ +# FastDeploy BERT 模型 Python 部署示例 + +在部署前,参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy Python SDK。 + +本目录下分别提供 `seq_cls_infer.py` 快速完成在 CPU/GPU 的 GLUE 文本分类任务的 Python 部署示例。 + +## 依赖安装 + +直接执行以下命令安装部署示例的依赖。 + +```bash +# 安装 fast_tokenizer 以及 GPU 版本 fastdeploy +pip install fast-tokenizer-python fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html +``` + +## 快速开始 + +以下示例展示如何基于 FastDeploy 库完成 BERT 模型在 GLUE SST-2 数据集上进行自然语言推断任务的 Python 预测部署,可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端,并使用`--model_dir`参数指定运行的模型,具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [BERT 训练文档](../../README.md)导出得到的部署模型,其模型目录为`model_zoo/bert/infer_model`(用户可按实际情况设置)。 + + +```bash +# CPU 推理 +python seq_cls_infer.py --model_dir ../../infer_model/ --device cpu --backend paddle +# GPU 推理 +python seq_cls_infer.py --model_dir ../../infer_model/ --device gpu --backend paddle +``` + +运行完成后返回的结果如下: + +```bash +[2023-03-02 08:30:03,877] [ INFO] - We are using to load '../../infer_model/'. +[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend Runtime initialized with Backend::PDINFER in Device::GPU. +Batch id: 0, example id: 0, sentence1: against shimmering cinematography that lends the setting the ethereal beauty of an asian landscape painting, label: positive, negative prob: 0.0003, positive prob: 0.9997. +Batch id: 1, example id: 0, sentence1: the situation in a well-balanced fashion, label: positive, negative prob: 0.0002, positive prob: 0.9998. +Batch id: 2, example id: 0, sentence1: at achieving the modest , crowd-pleasing goals it sets for itself, label: positive, negative prob: 0.0017, positive prob: 0.9983. +Batch id: 3, example id: 0, sentence1: so pat it makes your teeth hurt, label: negative, negative prob: 0.9986, positive prob: 0.0014. +Batch id: 4, example id: 0, sentence1: this new jangle of noise , mayhem and stupidity must be a serious contender for the title ., label: negative, negative prob: 0.9806, positive prob: 0.0194. +``` + +## 参数说明 + +| 参数 |参数说明 | +|----------|--------------| +|--model_dir | 指定部署模型的目录, | +|--batch_size |输入的batch size,默认为 1| +|--max_length |最大序列长度,默认为 128| +|--device | 运行的设备,可选范围: ['cpu', 'gpu'],默认为'cpu' | +|--device_id | 运行设备的id。默认为0。 | +|--cpu_threads | 当使用cpu推理时,指定推理的cpu线程数,默认为1。| +|--backend | 支持的推理后端,可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt'],默认为'paddle' | +|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启,默认为False | +|--use_fast| 是否使用FastTokenizer加速分词阶段。默认为True| + +## FastDeploy 高阶用法 + +FastDeploy 在 Python 端上,提供 `fastdeploy.RuntimeOption.use_xxx()` 以及 `fastdeploy.RuntimeOption.use_xxx_backend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 BERT 模型,需要选择硬件所支持的推理引擎进行部署,下表展示如何在不同的硬件上选择可用的推理引擎部署 BERT 模型。 + +符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持; + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
硬件 硬件对应的接口 可用的推理引擎 推理引擎对应的接口 是否支持 Paddle 新格式量化模型 是否支持 FP16 模式
CPU use_cpu() Paddle Inference use_paddle_infer_backend() N/A
ONNX Runtime use_ort_backend() N/A
OpenVINO use_openvino_backend() N/A
GPU use_gpu() Paddle Inference use_paddle_infer_backend() N/A
ONNX Runtime use_ort_backend()
Paddle TensorRT use_paddle_infer_backend() + paddle_infer_option.enable_trt = True
TensorRT use_trt_backend()
昆仑芯 XPU use_kunlunxin() Paddle Lite use_paddle_lite_backend() N/A
华为 昇腾 use_ascend() Paddle Lite use_paddle_lite_backend()
Graphcore IPU use_ipu() Paddle Inference use_paddle_infer_backend() N/A
diff --git a/model_zoo/bert/deploy/python/seq_cls_infer.py b/model_zoo/bert/deploy/python/seq_cls_infer.py new file mode 100644 index 0000000000000000000000000000000000000000..68f8a1b113573fa49be40ea5c453a5292a65e9bc --- /dev/null +++ b/model_zoo/bert/deploy/python/seq_cls_infer.py @@ -0,0 +1,159 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import distutils.util +import os + +import fastdeploy as fd +import numpy as np + +from paddlenlp.transformers import AutoTokenizer + + +def parse_arguments(): + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument("--model_dir", required=True, help="The directory of model.") + parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.") + parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.") + parser.add_argument( + "--device", + type=str, + default="cpu", + choices=["gpu", "cpu"], + help="Type of inference device, support 'cpu' or 'gpu'.", + ) + parser.add_argument( + "--backend", + type=str, + default="paddle", + choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"], + help="The inference runtime backend.", + ) + parser.add_argument("--cpu_threads", type=int, default=1, help="Number of threads to predict when using cpu.") + parser.add_argument("--device_id", type=int, default=0, help="Select which gpu device to train model.") + parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.") + parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.") + parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.") + parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode") + parser.add_argument( + "--use_fast", + type=distutils.util.strtobool, + default=True, + help="Whether to use fast_tokenizer to accelarate the tokenization.", + ) + return parser.parse_args() + + +def batchfy_text(texts, batch_size): + batch_texts = [] + batch_start = 0 + while batch_start < len(texts): + batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]] + batch_start += batch_size + return batch_texts + + +class Predictor(object): + def __init__(self, args): + self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir, use_fast=args.use_fast) + self.runtime = self.create_fd_runtime(args) + self.batch_size = args.batch_size + self.max_length = args.max_length + + def create_fd_runtime(self, args): + option = fd.RuntimeOption() + model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel") + params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams") + option.set_model_path(model_path, params_path) + if args.device == "cpu": + option.use_cpu() + option.set_cpu_thread_num(args.cpu_threads) + else: + option.use_gpu(args.device_id) + if args.backend == "paddle": + option.use_paddle_infer_backend() + elif args.backend == "onnx_runtime": + option.use_ort_backend() + elif args.backend == "openvino": + option.use_openvino_backend() + else: + option.use_trt_backend() + if args.backend == "paddle_tensorrt": + option.use_paddle_infer_backend() + option.paddle_infer_option.collect_trt_shape = True + option.paddle_infer_option.enable_trt = True + trt_file = os.path.join(args.model_dir, "model.trt") + option.trt_option.set_shape( + "input_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length] + ) + option.trt_option.set_shape( + "token_type_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length] + ) + if args.use_fp16: + option.trt_option.enable_fp16 = True + trt_file = trt_file + ".fp16" + option.trt_option.serialize_file = trt_file + return fd.Runtime(option) + + def preprocess(self, text): + data = self.tokenizer(text, max_length=self.max_length, padding=True, truncation=True) + input_ids_name = self.runtime.get_input_info(0).name + token_type_ids_name = self.runtime.get_input_info(1).name + input_map = { + input_ids_name: np.array(data["input_ids"], dtype="int64"), + token_type_ids_name: np.array(data["token_type_ids"], dtype="int64"), + } + return input_map + + def infer(self, input_map): + results = self.runtime.infer(input_map) + return results + + def postprocess(self, infer_data): + logits = np.array(infer_data[0]) + max_value = np.max(logits, axis=1, keepdims=True) + exp_data = np.exp(logits - max_value) + probs = exp_data / np.sum(exp_data, axis=1, keepdims=True) + out_dict = {"label": probs.argmax(axis=-1), "confidence": probs} + return out_dict + + def predict(self, texts): + input_map = self.preprocess(texts) + infer_result = self.infer(input_map) + output = self.postprocess(infer_result) + return output + + +if __name__ == "__main__": + args = parse_arguments() + predictor = Predictor(args) + texts_ds = [ + "against shimmering cinematography that lends the setting the ethereal beauty of an asian landscape painting", + "the situation in a well-balanced fashion", + "at achieving the modest , crowd-pleasing goals it sets for itself", + "so pat it makes your teeth hurt", + "this new jangle of noise , mayhem and stupidity must be a serious contender for the title .", + ] + label_map = {0: "negative", 1: "positive"} + batch_texts = batchfy_text(texts_ds, args.batch_size) + for bs, texts in enumerate(batch_texts): + outputs = predictor.predict(texts) + for i, sentence1 in enumerate(texts): + print( + f"Batch id: {bs}, example id: {i}, sentence1: {sentence1}, " + f"label: {label_map[outputs['label'][i]]}, negative prob: {outputs['confidence'][i][0]:.4f}, " + f"positive prob: {outputs['confidence'][i][1]:.4f}." + ) diff --git a/model_zoo/bert/export_model.py b/model_zoo/bert/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..367be76da408757c2dfeb166e670cb8b2cf22e4f --- /dev/null +++ b/model_zoo/bert/export_model.py @@ -0,0 +1,77 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from run_glue_trainer import MODEL_CLASSES + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_path", + default=None, + type=str, + required=True, + help="Path of the trained model to be exported.", + ) + parser.add_argument( + "--output_path", + default=None, + type=str, + required=True, + help="The output file prefix used to save the exported inference model.", + ) + args = parser.parse_args() + return args + + +def main(): + args = parse_args() + + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + # build model and load trained parameters + model = model_class.from_pretrained(args.model_path) + # switch to eval model + model.eval() + # convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ], + ) + # save converted static graph model + paddle.jit.save(model, args.output_path) + # also save tokenizer for inference usage + tokenizer = tokenizer_class.from_pretrained(args.model_path) + tokenizer.save_pretrained(os.path.dirname(args.output_path)) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/bert/run_glue_trainer.py b/model_zoo/bert/run_glue_trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..e06af9e9bcac2f491514aa6dbfa6eca09efb7743 --- /dev/null +++ b/model_zoo/bert/run_glue_trainer.py @@ -0,0 +1,189 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from dataclasses import dataclass, field + +import numpy as np +import paddle +from datasets import load_dataset +from paddle.metric import Accuracy + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman +from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments +from paddlenlp.transformers import ( + AutoModelForSequenceClassification, + AutoTokenizer, + BertForSequenceClassification, + BertTokenizer, + ErnieForSequenceClassification, + ErnieTokenizer, +) + +METRIC_CLASSES = { + "cola": Mcc, + "sst2": Accuracy, + "mrpc": AccuracyAndF1, + "stsb": PearsonAndSpearman, + "qqp": AccuracyAndF1, + "mnli": Accuracy, + "qnli": Accuracy, + "rte": Accuracy, + "wnli": Accuracy, +} + +task_to_keys = { + "cola": ("sentence", None), + "mnli": ("premise", "hypothesis"), + "mrpc": ("sentence1", "sentence2"), + "qnli": ("question", "sentence"), + "qqp": ("question1", "question2"), + "rte": ("sentence1", "sentence2"), + "sst2": ("sentence", None), + "stsb": ("sentence1", "sentence2"), + "wnli": ("sentence1", "sentence2"), +} + +MODEL_CLASSES = { + "bert": (BertForSequenceClassification, BertTokenizer), + "ernie": (ErnieForSequenceClassification, ErnieTokenizer), +} + + +@dataclass +class ModelArguments: + task_name: str = field( + default=None, + metadata={"help": "The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys())}, + ) + model_name_or_path: str = field( + default=None, + metadata={"help": "Path to pre-trained model or shortcut name"}, + ) + max_seq_length: int = field( + default=128, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + + +def do_train(): + training_args, model_args = PdArgumentParser([TrainingArguments, ModelArguments]).parse_args_into_dataclasses() + training_args: TrainingArguments = training_args + model_args: ModelArguments = model_args + + training_args.print_config(model_args, "Model") + training_args.print_config(training_args, "Training") + + model_args.task_name = model_args.task_name.lower() + + sentence1_key, sentence2_key = task_to_keys[model_args.task_name] + + train_ds = load_dataset("glue", model_args.task_name, split="train") + columns = train_ds.column_names + is_regression = model_args.task_name == "stsb" + label_list = None + if not is_regression: + label_list = train_ds.features["label"].names + num_labels = len(label_list) + else: + num_labels = 1 + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + + def preprocess_function(examples): + # Tokenize the texts + texts = ( + (examples[sentence1_key],) if sentence2_key is None else (examples[sentence1_key], examples[sentence2_key]) + ) + result = tokenizer(*texts, max_length=model_args.max_seq_length, truncation=True) + if "label" in examples: + # In all cases, rename the column to labels because the model will expect that. + result["labels"] = examples["label"] + return result + + train_ds = train_ds.map(preprocess_function, batched=True, remove_columns=columns) + data_collator = DataCollatorWithPadding(tokenizer) + + if model_args.task_name == "mnli": + dev_ds_matched, dev_ds_mismatched = load_dataset( + "glue", model_args.task_name, split=["validation_matched", "validation_mismatched"] + ) + dev_ds_matched = dev_ds_matched.map(preprocess_function, batched=True, remove_columns=columns) + dev_ds_mismatched = dev_ds_mismatched.map(preprocess_function, batched=True, remove_columns=columns) + dev_ds = {"matched": dev_ds_matched, "mismatched": dev_ds_mismatched} + else: + dev_ds = load_dataset("glue", model_args.task_name, split="validation") + dev_ds = dev_ds.map(preprocess_function, batched=True, remove_columns=columns) + + model = AutoModelForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_labels=num_labels) + + def compute_metrics(p): + preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions + if is_regression: + preds = np.squeeze(preds) + preds = paddle.to_tensor(preds) + label = paddle.to_tensor(p.label_ids) + + metric = METRIC_CLASSES[model_args.task_name]() + result = metric.compute(preds, label) + metric.update(result) + + if isinstance(metric, AccuracyAndF1): + acc, precision, recall, f1, _ = metric.accumulate() + return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1} + elif isinstance(metric, Mcc): + mcc = metric.accumulate() + return {"mcc": mcc[0]} + elif isinstance(metric, PearsonAndSpearman): + pearson, spearman, _ = metric.accumulate() + return {"pearson": pearson, "spearman": spearman} + elif isinstance(metric, Accuracy): + acc = metric.accumulate() + return {"accuracy": acc} + + trainer = Trainer( + model=model, + args=training_args, + data_collator=data_collator, + train_dataset=train_ds if training_args.do_train else None, + eval_dataset=dev_ds, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + ) + + # training + if training_args.do_train: + train_result = trainer.train() + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_eval: + if model_args.task_name == "mnli": + for _, eval_dataset in dev_ds.items(): + eval_metrics = trainer.evaluate(eval_dataset) + trainer.log_metrics("eval", eval_metrics) + trainer.save_metrics("eval", eval_metrics) + else: + eval_metrics = trainer.evaluate(dev_ds) + trainer.log_metrics("eval", eval_metrics) + trainer.save_metrics("eval", eval_metrics) + + +if __name__ == "__main__": + do_train() diff --git a/model_zoo/bert/run_pretrain.py b/model_zoo/bert/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..0c855de0e47498cd1d54c74561bfbd4f84274f2b --- /dev/null +++ b/model_zoo/bert/run_pretrain.py @@ -0,0 +1,476 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import sys +import time +from concurrent.futures import ThreadPoolExecutor + +import h5py +import numpy as np +import paddle +from paddle.io import DataLoader, Dataset + +from paddlenlp.data import Stack +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import ( + BertForPretraining, + BertModel, + BertPretrainingCriterion, + BertTokenizer, + ErnieForPretraining, + ErnieModel, + ErniePretrainingCriterion, + ErnieTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils import profiler +from paddlenlp.utils.log import logger +from paddlenlp.utils.tools import TimeCostAverage + +MODEL_CLASSES = { + "bert": (BertModel, BertForPretraining, BertPretrainingCriterion, BertTokenizer), + "ernie": (ErnieModel, ErnieForPretraining, ErniePretrainingCriterion, ErnieTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--input_dir", + default=None, + type=str, + required=True, + help="The input directory where the data will be read from.", + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + + parser.add_argument( + "--max_predictions_per_seq", default=80, type=int, help="The maximum total of masked tokens in input sequence" + ) + + parser.add_argument( + "--batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument( + "--max_steps", + default=1000000, + type=int, + help="Set total number of training steps to perform. ", + ) + parser.add_argument( + "--preprocessing_num_workers", + type=int, + default=0, + help="The number of processes to use for the preprocessing.", + ) + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") + + parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument( + "--device", + type=str, + default="gpu", + choices=["cpu", "gpu", "xpu", "npu"], + help="Device for selecting for the training.", + ) + parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.") + parser.add_argument( + "--amp_level", type=str, default="O2", choices=["O1", "O2"], help="select O1 or O2 of amp level." + ) + parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.") + parser.add_argument("--to_static", type=strtobool, default=False, help="Enable training under @to_static.") + + # For benchmark. + parser.add_argument( + "--profiler_options", + type=str, + default=None, + help='The option of profiler, which should be in format "key1=value1;key2=value2;key3=value3".', + ) + parser.add_argument( + "--fuse_transformer", + type=strtobool, + default=False, + help="Whether to use FusedTransformerEncoderLayer to replace a TransformerEncoderLayer or not.", + ) + parser.add_argument( + "--cinn", + type=strtobool, + default=False, + help="If cinn is True, we will apply @to_static to model.bert.encoder, else we will apply it to the whole model.", + ) + args = parser.parse_args() + return args + + +def set_seed(args): + random.seed(args.seed + paddle.distributed.get_rank()) + np.random.seed(args.seed + paddle.distributed.get_rank()) + paddle.seed(args.seed + paddle.distributed.get_rank()) + + +class WorkerInitObj(object): + def __init__(self, seed): + self.seed = seed + + def __call__(self, id): + np.random.seed(seed=self.seed + id) + random.seed(self.seed + id) + + +def create_pretraining_dataset(input_file, max_pred_length, shared_list, args, worker_init): + train_data = PretrainingDataset(input_file=input_file, max_pred_length=max_pred_length) + # files have been sharded, no need to dispatch again + train_batch_sampler = paddle.io.BatchSampler(train_data, batch_size=args.batch_size, shuffle=True) + + # DataLoader cannot be pickled because of its place. + # If it can be pickled, use global function instead of lambda and use + # ProcessPoolExecutor instead of ThreadPoolExecutor to prefetch. + def _collate_data(data, stack_fn=Stack()): + num_fields = len(data[0]) + out = [None] * num_fields + # input_ids, segment_ids, input_mask, masked_lm_positions, + # masked_lm_labels, next_sentence_labels, mask_token_num + for i in (0, 1, 2, 5): + out[i] = stack_fn([x[i] for x in data]) + _, seq_length = out[0].shape + size = sum(len(x[3]) for x in data) + # Padding for divisibility by 8 for fp16 or int8 usage + if size % 8 != 0: + size += 8 - (size % 8) + # masked_lm_positions + # Organize as a 1D tensor for gather or use gather_nd + out[3] = np.full(size, 0, dtype=np.int32) + # masked_lm_labels + out[4] = np.full([size, 1], -1, dtype=np.int64) + mask_token_num = 0 + for i, x in enumerate(data): + for j, pos in enumerate(x[3]): + out[3][mask_token_num] = i * seq_length + pos + out[4][mask_token_num] = x[4][j] + mask_token_num += 1 + # mask_token_num + out.append(np.asarray([mask_token_num], dtype=np.float32)) + return out + + train_data_loader = DataLoader( + dataset=train_data, + batch_sampler=train_batch_sampler, + collate_fn=_collate_data, + num_workers=args.preprocessing_num_workers, + worker_init_fn=worker_init, + return_list=True, + ) + return train_data_loader, input_file + + +def create_input_specs(): + input_ids = paddle.static.InputSpec(name="input_ids", shape=[-1, -1], dtype="int64") + segment_ids = paddle.static.InputSpec(name="segment_ids", shape=[-1, -1], dtype="int64") + position_ids = None + input_mask = paddle.static.InputSpec(name="input_mask", shape=[-1, 1, 1, -1], dtype="float32") + masked_lm_positions = paddle.static.InputSpec(name="masked_lm_positions", shape=[-1], dtype="int32") + return [input_ids, segment_ids, position_ids, input_mask, masked_lm_positions] + + +class PretrainingDataset(Dataset): + def __init__(self, input_file, max_pred_length): + self.input_file = input_file + self.max_pred_length = max_pred_length + f = h5py.File(input_file, "r") + keys = [ + "input_ids", + "input_mask", + "segment_ids", + "masked_lm_positions", + "masked_lm_ids", + "next_sentence_labels", + ] + self.inputs = [np.asarray(f[key][:]) for key in keys] + f.close() + + def __len__(self): + "Denotes the total number of samples" + return len(self.inputs[0]) + + def __getitem__(self, index): + + [input_ids, input_mask, segment_ids, masked_lm_positions, masked_lm_ids, next_sentence_labels] = [ + input[index].astype(np.int64) if indice < 5 else np.asarray(input[index].astype(np.int64)) + for indice, input in enumerate(self.inputs) + ] + # TODO: whether to use reversed mask by changing 1s and 0s to be + # consistent with nv bert + input_mask = (1 - np.reshape(input_mask.astype(np.float32), [1, 1, input_mask.shape[0]])) * -1e9 + + index = self.max_pred_length + # store number of masked tokens in index + # outputs of torch.nonzero diff with that of numpy.nonzero by zip + padded_mask_indices = (masked_lm_positions == 0).nonzero()[0] + if len(padded_mask_indices) != 0: + index = padded_mask_indices[0].item() + else: + index = self.max_pred_length + # masked_lm_labels = np.full(input_ids.shape, -1, dtype=np.int64) + # masked_lm_labels[masked_lm_positions[:index]] = masked_lm_ids[:index] + masked_lm_labels = masked_lm_ids[:index] + masked_lm_positions = masked_lm_positions[:index] + # softmax_with_cross_entropy enforce last dim size equal 1 + masked_lm_labels = np.expand_dims(masked_lm_labels, axis=-1) + next_sentence_labels = np.expand_dims(next_sentence_labels, axis=-1) + + return [input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels] + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + worker_init = WorkerInitObj(args.seed + paddle.distributed.get_rank()) + + args.model_type = args.model_type.lower() + base_class, model_class, criterion_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + pretrained_models_list = list(model_class.pretrained_init_configuration.keys()) + if args.model_name_or_path in pretrained_models_list: + config = model_class.config_class.from_pretrained(args.model_name_or_path) + config.fuse = args.fuse_transformer + model = model_class(config) + else: + model = model_class.from_pretrained(args.model_name_or_path) + criterion = criterion_class(getattr(model, model_class.base_model_prefix).config.vocab_size) + # decorate @to_static for benchmark, skip it by default. + if args.to_static: + if args.cinn: + model.bert.encoder = paddle.jit.to_static(model.bert.encoder) + logger.info("Successfully to apply @to_static to model.bert.encoder.") + else: + specs = create_input_specs() + model = paddle.jit.to_static(model, input_spec=specs) + logger.info("Successfully to apply @to_static to the whole model with specs: {}.".format(specs)) + + # If use default last_epoch, lr of the first iteration is 0. + # Use `last_epoch = 0` to be consistent with nv bert. + num_training_steps = args.max_steps + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps, last_epoch=0) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + model = paddle.amp.decorate(models=model, level=args.amp_level, save_dtype="float32") + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + pool = ThreadPoolExecutor(1) + global_step = 0 + for epoch in range(sys.maxsize): + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if os.path.isfile(os.path.join(args.input_dir, f)) and "train" in f + ] + files.sort() + num_files = len(files) + random.Random(args.seed + epoch).shuffle(files) + f_start_id = 0 + + shared_file_list = {} + + if paddle.distributed.get_world_size() > num_files: + remainder = paddle.distributed.get_world_size() % num_files + data_file = files[ + ( + f_start_id * paddle.distributed.get_world_size() + + paddle.distributed.get_rank() + + remainder * f_start_id + ) + % num_files + ] + else: + data_file = files[ + (f_start_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files + ] + + train_data_loader, _ = create_pretraining_dataset( + data_file, args.max_predictions_per_seq, shared_file_list, args, worker_init + ) + + # TODO(guosheng): better way to process single file + single_file = True if f_start_id + 1 == len(files) else False + + for f_id in range(f_start_id, len(files)): + if not single_file and f_id == f_start_id: + continue + if paddle.distributed.get_world_size() > num_files: + data_file = files[ + (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank() + remainder * f_id) + % num_files + ] + else: + data_file = files[ + (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files + ] + + dataset_future = pool.submit( + create_pretraining_dataset, + data_file, + args.max_predictions_per_seq, + shared_file_list, + args, + worker_init, + ) + train_cost_avg = TimeCostAverage() + reader_cost_avg = TimeCostAverage() + total_samples = 0 + batch_start = time.time() + for step, batch in enumerate(train_data_loader): + train_reader_cost = time.time() - batch_start + reader_cost_avg.record(train_reader_cost) + global_step += 1 + ( + input_ids, + segment_ids, + input_mask, + masked_lm_positions, + masked_lm_labels, + next_sentence_labels, + masked_lm_scale, + ) = batch + with paddle.amp.auto_cast( + args.use_amp, + custom_white_list=["layer_norm", "softmax", "gelu", "fused_attention", "fused_feedforward"], + level=args.amp_level, + ): + prediction_scores, seq_relationship_score = model( + input_ids=input_ids, + token_type_ids=segment_ids, + attention_mask=input_mask, + masked_positions=masked_lm_positions, + ) + loss = criterion( + prediction_scores, + seq_relationship_score, + masked_lm_labels, + next_sentence_labels, + masked_lm_scale, + ) + if args.use_amp: + scaler.scale(loss).backward() + scaler.minimize(optimizer, loss) + else: + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + total_samples += args.batch_size + train_run_cost = time.time() - batch_start + train_cost_avg.record(train_run_cost) + + # Profile for model benchmark + if args.profiler_options is not None: + profiler.add_profiler_step(args.profiler_options) + + if global_step % args.logging_steps == 0: + if paddle.distributed.get_rank() == 0: + logger.info( + "global step: %d, epoch: %d, batch: %d, loss: %f, " + "avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec" + % ( + global_step, + epoch, + step, + loss, + reader_cost_avg.get_average(), + train_cost_avg.get_average(), + total_samples / args.logging_steps, + total_samples / (args.logging_steps * train_cost_avg.get_average()), + ) + ) + total_samples = 0 + train_cost_avg.reset() + reader_cost_avg.reset() + if global_step % args.save_steps == 0 or global_step >= args.max_steps: + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt")) + if global_step >= args.max_steps: + del train_data_loader + return + batch_start = time.time() + + del train_data_loader + train_data_loader, data_file = dataset_future.result(timeout=None) + + +if __name__ == "__main__": + args = parse_args() + print(args) + do_train(args) diff --git a/model_zoo/bert/run_pretrain_trainer.py b/model_zoo/bert/run_pretrain_trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..f5624ea3dcf795d962f6e834352b1057bbd19250 --- /dev/null +++ b/model_zoo/bert/run_pretrain_trainer.py @@ -0,0 +1,259 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from dataclasses import dataclass, field + +import h5py +import numpy as np +import paddle +from paddle.io import Dataset + +from paddlenlp.data import Stack +from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import ( + BertForPretraining, + BertTokenizer, + ErnieForPretraining, + ErnieTokenizer, +) +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "bert": (BertForPretraining, BertTokenizer), + "ernie": (ErnieForPretraining, ErnieTokenizer), +} + + +@dataclass +class DataArguments: + input_dir: str = field(default=None, metadata={"help": "The input directory where the data will be read from."}) + + +@dataclass +class ModelArguments: + model_type: str = field( + default="bert", metadata={"help": "Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys())} + ) + model_name_or_path: str = field( + default=None, + metadata={ + "help": "Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ) + }, + ) + max_predictions_per_seq: int = field( + default=80, metadata={"help": "The maximum total of masked tokens in input sequence"} + ) + + to_static: strtobool = field(default=False, metadata={"help": "Enable training under @to_static."}) + profiler_options: str = field( + default=None, + metadata={"help": "Whether to use FusedTransformerEncoderLayer to replace a TransformerEncoderLayer or not."}, + ) + fuse_transformer: strtobool = field( + default=False, + metadata={"help": "Whether to use FusedTransformerEncoderLayer to replace a TransformerEncoderLayer or not."}, + ) + + +def get_train_data_file(data_args): + files = [ + os.path.join(data_args.input_dir, f) + for f in os.listdir(data_args.input_dir) + if os.path.isfile(os.path.join(data_args.input_dir, f)) and "train" in f + ] + files.sort() + num_files = len(files) + # random.Random(training_args.seed + epoch).shuffle(files) + f_start_id = 0 + + if paddle.distributed.get_world_size() > num_files: + remainder = paddle.distributed.get_world_size() % num_files + data_file = files[ + (f_start_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank() + remainder * f_start_id) + % num_files + ] + else: + data_file = files[ + (f_start_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files + ] + + # TODO(guosheng): better way to process single file + single_file = True if f_start_id + 1 == len(files) else False + + for f_id in range(f_start_id, len(files)): + if not single_file and f_id == f_start_id: + continue + if paddle.distributed.get_world_size() > num_files: + data_file = files[ + (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank() + remainder * f_id) + % num_files + ] + else: + data_file = files[(f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files] + + return data_file + + +def data_collator(data, stack_fn=Stack()): + num_fields = len(data[0]) + out = [None] * num_fields + # input_ids, segment_ids, input_mask, masked_lm_positions, + # masked_lm_labels, next_sentence_labels, mask_token_num + for i in (0, 1, 2, 5): + out[i] = stack_fn([x[i] for x in data]) + _, seq_length = out[0].shape + size = _ = sum(len(x[3]) for x in data) + # Padding for divisibility by 8 for fp16 or int8 usage + if size % 8 != 0: + size += 8 - (size % 8) + # masked_lm_positions + # Organize as a 1D tensor for gather or use gather_nd + + # masked_lm_positions + # Organize as a 1D tensor for gather or use gather_nd + out[3] = np.full(size, 0, dtype=np.int32) + # masked_lm_labels + out[4] = np.full([size, 1], -100, dtype=np.int64) + mask_token_num = 0 + for i, x in enumerate(data): + for j, pos in enumerate(x[3]): + out[3][mask_token_num] = i * seq_length + pos + out[4][mask_token_num] = x[4][j] + mask_token_num += 1 + + return { + "input_ids": out[0], + "token_type_ids": out[1], + "attention_mask": out[2], + "masked_positions": out[3], + "labels": out[4], + "next_sentence_label": out[5], + } + + +def create_input_specs(): + input_ids = paddle.static.InputSpec(name="input_ids", shape=[-1, -1], dtype="int64") + segment_ids = paddle.static.InputSpec(name="segment_ids", shape=[-1, -1], dtype="int64") + position_ids = None + input_mask = paddle.static.InputSpec(name="input_mask", shape=[-1, 1, 1, -1], dtype="float32") + masked_lm_positions = paddle.static.InputSpec(name="masked_lm_positions", shape=[-1], dtype="int32") + return [input_ids, segment_ids, position_ids, input_mask, masked_lm_positions] + + +class PretrainingDataset(Dataset): + def __init__(self, input_file, max_pred_length): + self.input_file = input_file + self.max_pred_length = max_pred_length + f = h5py.File(input_file, "r") + keys = [ + "input_ids", + "input_mask", + "segment_ids", + "masked_lm_positions", + "masked_lm_ids", + "next_sentence_labels", + ] + self.inputs = [np.asarray(f[key][:]) for key in keys] + f.close() + + def __len__(self): + "Denotes the total number of samples" + return len(self.inputs[0]) + + def __getitem__(self, index): + + [input_ids, input_mask, segment_ids, masked_lm_positions, masked_lm_ids, next_sentence_labels] = [ + input[index].astype(np.int64) if indice < 5 else np.asarray(input[index].astype(np.int64)) + for indice, input in enumerate(self.inputs) + ] + # TODO: whether to use reversed mask by changing 1s and 0s to be + # consistent with nv bert + input_mask = (1 - np.reshape(input_mask.astype(np.float32), [1, 1, input_mask.shape[0]])) * -1e9 + + index = self.max_pred_length + # store number of masked tokens in index + # outputs of torch.nonzero diff with that of numpy.nonzero by zip + padded_mask_indices = (masked_lm_positions == 0).nonzero()[0] + if len(padded_mask_indices) != 0: + index = padded_mask_indices[0].item() + # mask_token_num = index + else: + index = self.max_pred_length + # mask_token_num = self.max_pred_length + # masked_lm_labels = np.full(input_ids.shape, -1, dtype=np.int64) + # masked_lm_labels[masked_lm_positions[:index]] = masked_lm_ids[:index] + masked_lm_labels = masked_lm_ids[:index] + masked_lm_positions = masked_lm_positions[:index] + # softmax_with_cross_entropy enforce last dim size equal 1 + masked_lm_labels = np.expand_dims(masked_lm_labels, axis=-1) + next_sentence_labels = np.expand_dims(next_sentence_labels, axis=-1) + + return [input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels] + + +def do_train(): + data_args, training_args, model_args = PdArgumentParser( + [DataArguments, TrainingArguments, ModelArguments] + ).parse_args_into_dataclasses() + training_args: TrainingArguments = training_args + model_args: ModelArguments = model_args + data_args: DataArguments = data_args + + training_args.print_config(data_args, "Data") + training_args.print_config(model_args, "Model") + training_args.print_config(model_args, "Training") + + model_args.model_type = model_args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[model_args.model_type] + + tokenizer = tokenizer_class.from_pretrained(model_args.model_name_or_path) + + config = model_class.config_class.from_pretrained(model_args.model_name_or_path) + config.fuse = model_args.fuse_transformer + model = model_class(config) + + data_file = get_train_data_file(data_args) + train_dataset = PretrainingDataset(input_file=data_file, max_pred_length=model_args.max_predictions_per_seq) + + # decorate @to_static for benchmark, skip it by default. + if model_args.to_static: + specs = create_input_specs() + model = paddle.jit.to_static(model, input_spec=specs) + logger.info("Successfully to apply @to_static with specs: {}".format(specs)) + + trainer = Trainer( + model=model, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=None, + tokenizer=tokenizer, + ) + # training + if training_args.do_train: + train_result = trainer.train() + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + +if __name__ == "__main__": + do_train() diff --git a/model_zoo/bert/static/README.md b/model_zoo/bert/static/README.md new file mode 100644 index 0000000000000000000000000000000000000000..fa3d5f62b28db0703a35415f8d4a7c9f0633b324 --- /dev/null +++ b/model_zoo/bert/static/README.md @@ -0,0 +1,153 @@ +# BERT Benchmark with Fleet API +## 模型简介 + +[BERT](https://arxiv.org/abs/1810.04805) (Bidirectional Encoder Representations from Transformers)以[Transformer](https://arxiv.org/abs/1706.03762) 编码器为网络基本组件,使用掩码语言模型(Masked Language Model)和邻接句子预测(Next Sentence Prediction)两个任务在大规模无标注文本语料上进行预训练(pre-train),得到融合了双向内容的通用语义表示模型。以预训练产生的通用语义表示模型为基础,结合任务适配的简单输出层,微调(fine-tune)后即可应用到下游的NLP任务,效果通常也较直接在下游的任务上训练的模型更优。此前BERT即在[GLUE评测任务](https://gluebenchmark.com/tasks)上取得了SOTA的结果。 + +本项目是BERT在 Paddle 2.0上的开源实现,包含了预训练和[GLUE评测任务](https://gluebenchmark.com/tasks)上的微调代码。 + +## 快速开始 + +### 数据准备 + +#### Pre-training数据准备 + +`create_pretraining_data.py` 是创建预训练程序所需数据的脚本。其以文本文件(使用换行符换行和空白符分隔,data目录下提供了部分示例数据)为输入,经由BERT tokenizer进行tokenize后再做生成sentence pair正负样本、掩码token等处理,最后输出hdf5格式的数据文件。使用方式如下: + +```shell +python create_pretraining_data.py \ + --input_file=data/sample_text.txt \ + --output_file=data/training_data.hdf5 \ + --bert_model=bert-base-uncased \ + --max_seq_length=128 \ + --max_predictions_per_seq=20 \ + --masked_lm_prob=0.15 \ + --random_seed=12345 \ + --dupe_factor=5 +``` + +其中参数释义如下: +- `input_file` 指定输入文件,可以使用目录,指定目录时将包括目录中的所有`.txt`文件。 +- `output_file` 指定输出文件。 +- `bert_model` 指定使用特定BERT模型对应的tokenizer进行tokenize处理。 +- `max_seq_length` 指定最大句子长度,超过该长度将被截断,不足该长度的将会进行padding。 +- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目。 +- `masked_lm_prob` 表示每个token被mask的概率。 +- `random_seed` 指定随机种子。 +- `dupe_factor` 指定输入数据被重复处理的次数,每次处理将重新产生随机mask。 + +使用以上预训练数据生成程序可以用于处理领域垂类数据后进行二次预训练。若需要使用BERT论文中预训练使用的英文Wiki和BookCorpus数据,可以参考[这里](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT)进行处理,得到的数据可以直接接入本项目中的预训练程序使用。 + +#### Fine-tuning数据准备 +Fine-tuning的数据集已经被PaddleNLP框架集成,只需要填写相应的数据集的名称,PaddleNLP会自动下载数据集,具体的使用方法可以参考 `run_glue.py` 脚本。 + +##### GLUE评测任务数据 + +GLUE评测任务所含数据集已在paddlenlp中以API形式提供,无需预先准备,使用`run_glue.py`执行微调时将会自动下载。 + +### 执行Pre-training + +#### GPU训练 +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_pretrain.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --max_predictions_per_seq 20 \ + --batch_size 32 \ + --learning_rate 1e-4 \ + --weight_decay 1e-2 \ + --adam_epsilon 1e-6 \ + --warmup_steps 10000 \ + --input_dir data/ \ + --output_dir pretrained_models/ \ + --logging_steps 1 \ + --save_steps 20000 \ + --max_steps 1000000 \ + --device gpu \ + --use_amp False +``` +其中参数释义如下: +- `model_type` 指示了模型类型,使用BERT模型时设置为bert即可。 +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。 +- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目,与创建预训练数据时的设置一致。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。 +- `adam_epsilon` 表示AdamW优化器中使用的epsilon值。 +- `warmup_steps` 表示动态学习率热启的step数。 +- `num_train_epochs` 表示训练轮数。 +- `input_dir` 表示输入数据的目录,该目录下所有文件名中包含training的文件将被作为训练数据。 +- `output_dir` 表示模型的保存目录。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `max_steps` 表示最大训练步数。若训练`num_train_epochs`轮包含的训练步数大于该值,则达到`max_steps`后就提前结束。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 +- `use_amp` 指示是否启用自动混合精度训练。 +**NOTICE**: 预训练时data目录存放的是经过 `create_pretraining_data.py` 处理后的数据,因此需要通过该数据处理脚本预先处理,否则预训练将会出现报错。 + +### 执行Fine-tunning + +以GLUE中的SST-2任务为例,启动Fine-tuning的方式如下: + +```shell +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0" run_glue.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --task_name SST-2 \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --logging_steps 1 \ + --save_steps 500 \ + --output_dir ./tmp/ \ + --device gpu +``` + +其中参数释义如下: +- `model_type` 指示了模型类型,使用BERT模型时设置为bert即可。 +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。注:`bert-base-uncased`等对应使用的预训练模型转自[huggingface/transformers](https://github.com/huggingface/transformers),具体可参考当前目录下converter中的内容。 +- `task_name` 表示Fine-tuning的任务。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `output_dir` 表示模型保存路径。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 + +基于`bert-base-uncased`在GLUE各评测任务上Fine-tuning后,在验证集上有如下结果: + +| Task | Metric | Result | +|-------|------------------------------|-------------| +| CoLA | Matthews corr | 59.90 | +| SST-2 | Accuracy | 92.76 | +| STS-B | Pearson/Spearman corr | 89.12 | +| MNLI | matched acc./mismatched acc. | 84.45/84.62 | +| QNLI | acc. | 91.73 | +| RTE | acc. | 67.15 | + +### 预测 + +在Fine-tuning完成后,我们可以使用如下方式导出希望用来预测的模型: +然后按照如下的方式进行GLUE中的评测任务进行预测(基于Paddle的[Python预测API](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/05_inference_deployment/inference/python_infer_cn.html)): + +```shell +python -u ./predict_glue.py \ + --task_name SST-2 \ + --model_type bert \ + --model_path ./tmp/model_20/infer_model \ + --batch_size 32 \ + --max_seq_length 128 +``` + +其中参数释义如下: +- `task_name` 表示Fine-tuning的任务。 +- `model_type` 指示了模型类型,使用BERT模型时设置为bert即可。 +- `model_path` 表示预测模型文件的前缀,和上一步导出预测模型中的`output_path`一致。 +- `batch_size` 表示每个预测批次的样本数目。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 + +**NOTICE**: 预测脚本中的 './tmp/model_20/infer_model' 是 run_glue.py 中保存下来的模型,具体的模型路径可以根据具体的路径来设定。 diff --git a/model_zoo/bert/static/create_pretraining_data.py b/model_zoo/bert/static/create_pretraining_data.py new file mode 100644 index 0000000000000000000000000000000000000000..09394efc722c9fb8badebf1cbc64f11711192729 --- /dev/null +++ b/model_zoo/bert/static/create_pretraining_data.py @@ -0,0 +1,474 @@ +# coding=utf-8 +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved. +# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Create masked LM/next sentence masked_lm examples for BERT.""" +from __future__ import absolute_import, division, print_function, unicode_literals + +import argparse +import collections +import os +import random +from io import open + +import h5py +import numpy as np +from tqdm import tqdm + +from paddlenlp.transformers import BertTokenizer +from paddlenlp.transformers.tokenizer_utils import convert_to_unicode + + +class TrainingInstance(object): + """A single training instance (sentence pair).""" + + def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels, is_random_next): + self.tokens = tokens + self.segment_ids = segment_ids + self.is_random_next = is_random_next + self.masked_lm_positions = masked_lm_positions + self.masked_lm_labels = masked_lm_labels + + +def write_instance_to_example_file(instances, tokenizer, max_seq_length, max_predictions_per_seq, output_file): + """Create example files from `TrainingInstance`s.""" + + total_written = 0 + features = collections.OrderedDict() + + num_instances = len(instances) + features["input_ids"] = np.zeros([num_instances, max_seq_length], dtype="int32") + features["input_mask"] = np.zeros([num_instances, max_seq_length], dtype="int32") + features["segment_ids"] = np.zeros([num_instances, max_seq_length], dtype="int32") + features["masked_lm_positions"] = np.zeros([num_instances, max_predictions_per_seq], dtype="int32") + features["masked_lm_ids"] = np.zeros([num_instances, max_predictions_per_seq], dtype="int32") + features["next_sentence_labels"] = np.zeros(num_instances, dtype="int32") + + for inst_index, instance in enumerate(tqdm(instances)): + input_ids = tokenizer.convert_tokens_to_ids(instance.tokens) + input_mask = [1] * len(input_ids) + segment_ids = list(instance.segment_ids) + assert len(input_ids) <= max_seq_length + + while len(input_ids) < max_seq_length: + input_ids.append(0) + input_mask.append(0) + segment_ids.append(0) + + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + assert len(segment_ids) == max_seq_length + + masked_lm_positions = list(instance.masked_lm_positions) + masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels) + masked_lm_weights = [1.0] * len(masked_lm_ids) + + while len(masked_lm_positions) < max_predictions_per_seq: + masked_lm_positions.append(0) + masked_lm_ids.append(0) + masked_lm_weights.append(0.0) + + next_sentence_label = 1 if instance.is_random_next else 0 + + features["input_ids"][inst_index] = input_ids + features["input_mask"][inst_index] = input_mask + features["segment_ids"][inst_index] = segment_ids + features["masked_lm_positions"][inst_index] = masked_lm_positions + features["masked_lm_ids"][inst_index] = masked_lm_ids + features["next_sentence_labels"][inst_index] = next_sentence_label + + total_written += 1 + + print("saving data") + f = h5py.File(output_file, "w") + f.create_dataset("input_ids", data=features["input_ids"], dtype="i4", compression="gzip") + f.create_dataset("input_mask", data=features["input_mask"], dtype="i1", compression="gzip") + f.create_dataset("segment_ids", data=features["segment_ids"], dtype="i1", compression="gzip") + f.create_dataset("masked_lm_positions", data=features["masked_lm_positions"], dtype="i4", compression="gzip") + f.create_dataset("masked_lm_ids", data=features["masked_lm_ids"], dtype="i4", compression="gzip") + f.create_dataset("next_sentence_labels", data=features["next_sentence_labels"], dtype="i1", compression="gzip") + f.flush() + f.close() + + +def create_training_instances( + input_files, tokenizer, max_seq_length, dupe_factor, short_seq_prob, masked_lm_prob, max_predictions_per_seq, rng +): + """Create `TrainingInstance`s from raw text.""" + all_documents = [[]] + + # Input file format: + # (1) One sentence per line. These should ideally be actual sentences, not + # entire paragraphs or arbitrary spans of text. (Because we use the + # sentence boundaries for the "next sentence prediction" task). + # (2) Blank lines between documents. Document boundaries are needed so + # that the "next sentence prediction" task doesn't span between documents. + for input_file in input_files: + print("creating instance from {}".format(input_file)) + with open(input_file, "r", encoding="UTF-8") as reader: + while True: + line = convert_to_unicode(reader.readline()) + if not line: + break + line = line.strip() + + # Empty lines are used as document delimiters + if not line: + all_documents.append([]) + tokens = tokenizer.tokenize(line) + if tokens: + all_documents[-1].append(tokens) + + # Remove empty documents + all_documents = [x for x in all_documents if x] + rng.shuffle(all_documents) + + # vocab_words = list(tokenizer.vocab.keys()) + vocab_words = list(tokenizer.vocab.token_to_idx.keys()) + instances = [] + for _ in range(dupe_factor): + for document_index in range(len(all_documents)): + instances.extend( + create_instances_from_document( + all_documents, + document_index, + max_seq_length, + short_seq_prob, + masked_lm_prob, + max_predictions_per_seq, + vocab_words, + rng, + ) + ) + + rng.shuffle(instances) + return instances + + +def create_instances_from_document( + all_documents, + document_index, + max_seq_length, + short_seq_prob, + masked_lm_prob, + max_predictions_per_seq, + vocab_words, + rng, +): + """Creates `TrainingInstance`s for a single document.""" + document = all_documents[document_index] + + # Account for [CLS], [SEP], [SEP] + max_num_tokens = max_seq_length - 3 + + # We *usually* want to fill up the entire sequence since we are padding + # to `max_seq_length` anyways, so short sequences are generally wasted + # computation. However, we *sometimes* + # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter + # sequences to minimize the mismatch between pre-training and fine-tuning. + # The `target_seq_length` is just a rough target however, whereas + # `max_seq_length` is a hard limit. + target_seq_length = max_num_tokens + if rng.random() < short_seq_prob: + target_seq_length = rng.randint(2, max_num_tokens) + + # We DON'T just concatenate all of the tokens from a document into a long + # sequence and choose an arbitrary split point because this would make the + # next sentence prediction task too easy. Instead, we split the input into + # segments "A" and "B" based on the actual "sentences" provided by the user + # input. + instances = [] + current_chunk = [] + current_length = 0 + i = 0 + while i < len(document): + segment = document[i] + current_chunk.append(segment) + current_length += len(segment) + if i == len(document) - 1 or current_length >= target_seq_length: + if current_chunk: + # `a_end` is how many segments from `current_chunk` go into the `A` + # (first) sentence. + a_end = 1 + if len(current_chunk) >= 2: + a_end = rng.randint(1, len(current_chunk) - 1) + + tokens_a = [] + for j in range(a_end): + tokens_a.extend(current_chunk[j]) + + tokens_b = [] + # Random next + is_random_next = False + if len(current_chunk) == 1 or rng.random() < 0.5: + is_random_next = True + target_b_length = target_seq_length - len(tokens_a) + + # This should rarely go for more than one iteration for large + # corpora. However, just to be careful, we try to make sure that + # the random document is not the same as the document + # we're processing. + for _ in range(10): + random_document_index = rng.randint(0, len(all_documents) - 1) + if random_document_index != document_index: + break + + # If picked random document is the same as the current document + if random_document_index == document_index: + is_random_next = False + + random_document = all_documents[random_document_index] + random_start = rng.randint(0, len(random_document) - 1) + for j in range(random_start, len(random_document)): + tokens_b.extend(random_document[j]) + if len(tokens_b) >= target_b_length: + break + # We didn't actually use these segments so we "put them back" so + # they don't go to waste. + num_unused_segments = len(current_chunk) - a_end + i -= num_unused_segments + # Actual next + else: + is_random_next = False + for j in range(a_end, len(current_chunk)): + tokens_b.extend(current_chunk[j]) + truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng) + + assert len(tokens_a) >= 1 + assert len(tokens_b) >= 1 + + tokens = [] + segment_ids = [] + tokens.append("[CLS]") + segment_ids.append(0) + for token in tokens_a: + tokens.append(token) + segment_ids.append(0) + + tokens.append("[SEP]") + segment_ids.append(0) + + for token in tokens_b: + tokens.append(token) + segment_ids.append(1) + tokens.append("[SEP]") + segment_ids.append(1) + + (tokens, masked_lm_positions, masked_lm_labels) = create_masked_lm_predictions( + tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng + ) + instance = TrainingInstance( + tokens=tokens, + segment_ids=segment_ids, + is_random_next=is_random_next, + masked_lm_positions=masked_lm_positions, + masked_lm_labels=masked_lm_labels, + ) + instances.append(instance) + current_chunk = [] + current_length = 0 + i += 1 + + return instances + + +MaskedLmInstance = collections.namedtuple("MaskedLmInstance", ["index", "label"]) + + +def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng): + """Creates the predictions for the masked LM objective.""" + + cand_indexes = [] + for (i, token) in enumerate(tokens): + if token == "[CLS]" or token == "[SEP]": + continue + cand_indexes.append(i) + + rng.shuffle(cand_indexes) + + output_tokens = list(tokens) + + num_to_predict = min(max_predictions_per_seq, max(1, int(round(len(tokens) * masked_lm_prob)))) + + masked_lms = [] + covered_indexes = set() + for index in cand_indexes: + if len(masked_lms) >= num_to_predict: + break + if index in covered_indexes: + continue + covered_indexes.add(index) + + masked_token = None + # 80% of the time, replace with [MASK] + if rng.random() < 0.8: + masked_token = "[MASK]" + else: + # 10% of the time, keep original + if rng.random() < 0.5: + masked_token = tokens[index] + # 10% of the time, replace with random word + else: + masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)] + + output_tokens[index] = masked_token + + masked_lms.append(MaskedLmInstance(index=index, label=tokens[index])) + + masked_lms = sorted(masked_lms, key=lambda x: x.index) + + masked_lm_positions = [] + masked_lm_labels = [] + for p in masked_lms: + masked_lm_positions.append(p.index) + masked_lm_labels.append(p.label) + + return (output_tokens, masked_lm_positions, masked_lm_labels) + + +def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng): + """Truncates a pair of sequences to a maximum sequence length.""" + while True: + total_length = len(tokens_a) + len(tokens_b) + if total_length <= max_num_tokens: + break + + trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b + assert len(trunc_tokens) >= 1 + + # We want to sometimes truncate from the front and sometimes from the + # back to add more randomness and avoid biases. + if rng.random() < 0.5: + del trunc_tokens[0] + else: + trunc_tokens.pop() + + +def main(): + + parser = argparse.ArgumentParser() + + parser.add_argument( + "--input_file", + default=None, + type=str, + required=True, + help="The input train corpus. can be directory with .txt files or a path to a single file", + ) + parser.add_argument( + "--output_file", + default=None, + type=str, + required=True, + help="The output file where created hdf5 formatted data will be written.", + ) + parser.add_argument( + "--vocab_file", + default=None, + type=str, + required=False, + help="The vocabulary the BERT model will train on. " + "Use bert_model argument would ignore this. " + "The bert_model argument is recommended.", + ) + parser.add_argument( + "--do_lower_case", + action="store_true", + default=True, + help="Whether to lower case the input text. True for uncased models, False for cased models. " + "Use bert_model argument would ignore this. The bert_model argument is recommended.", + ) + parser.add_argument( + "--bert_model", + default="bert-base-uncased", + type=str, + required=False, + help="Bert pre-trained model selected in the list: bert-base-uncased, " + "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese." + "If provided, use the pre-trained model used tokenizer to create data " + "and ignore vocab_file and do_lower_case.", + ) + + # Other parameters + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after WordPiece tokenization. \n" + "Sequences longer than this will be truncated, and sequences shorter \n" + "than this will be padded.", + ) + parser.add_argument( + "--dupe_factor", + default=10, + type=int, + help="Number of times to duplicate the input data (with different masks).", + ) + parser.add_argument( + "--max_predictions_per_seq", default=20, type=int, help="Maximum number of masked LM predictions per sequence." + ) + + # floats + parser.add_argument("--masked_lm_prob", default=0.15, type=float, help="Masked LM probability.") + parser.add_argument( + "--short_seq_prob", + default=0.1, + type=float, + help="Probability to create a sequence shorter than maximum sequence length", + ) + + parser.add_argument("--random_seed", type=int, default=12345, help="random seed for initialization") + + args = parser.parse_args() + print(args) + + if args.bert_model: + tokenizer = BertTokenizer.from_pretrained(args.bert_model) + else: + assert args.vocab_file, "vocab_file must be set If bert_model is not provided." + tokenizer = BertTokenizer(args.vocab_file, do_lower_case=args.do_lower_case) + + input_files = [] + if os.path.isfile(args.input_file): + input_files.append(args.input_file) + elif os.path.isdir(args.input_file): + input_files = [ + os.path.join(args.input_file, f) + for f in os.listdir(args.input_file) + if (os.path.isfile(os.path.join(args.input_file, f)) and f.endswith(".txt")) + ] + else: + raise ValueError("{} is not a valid path".format(args.input_file)) + + rng = random.Random(args.random_seed) + instances = create_training_instances( + input_files, + tokenizer, + args.max_seq_length, + args.dupe_factor, + args.short_seq_prob, + args.masked_lm_prob, + args.max_predictions_per_seq, + rng, + ) + + output_file = args.output_file + + write_instance_to_example_file( + instances, tokenizer, args.max_seq_length, args.max_predictions_per_seq, output_file + ) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/bert/static/data/sample_text.txt b/model_zoo/bert/static/data/sample_text.txt new file mode 100644 index 0000000000000000000000000000000000000000..75ec60cdb7842e023d32f95f0e16e1973eff4b71 --- /dev/null +++ b/model_zoo/bert/static/data/sample_text.txt @@ -0,0 +1,100 @@ +Zulfiqar A. Bhutta trained as a physician in Pakistan in the early stages of his career. +He holds titles across various organizations in diverse geographies. +Professor Bhutta is the Founding Director of the Center of Excellence in Women and Child Health & Institute for Global Child Health & Development, at the Aga Khan University South-Central Asia, East Africa & United Kingdom. +He is currently the Co-Director at the Centre for Global Child Health, at the Hospital for Sick Children and leads many projects as a Senior Scientist at the Research Institute in the Centre for Global Child Health at Sick Kids. +He holds a Professorship at the University of Toronto in the Department of Nutritional Sciences and the Division of Epidemiology, Dalla Lana School of Public Health. +Additionally, he holds concurrent professorship at the Department of Paediatrics, Aga Khan University in Karachi, Pakistan and at the Schools of Public Health of Johns Hopkins University, Tufts University, Boston University, University of Alberta and the London School of Hygiene & Tropical Medicine. +He is a designated Distinguished National Professor of the Government of Pakistan and was the Founding Chair of the National Research Ethics Committee of the Government of Pakistan from 2003-2014. +Dr. Bhutta received his MBBS from Khyber Medical College in Peshawar, Pakistan in 1977 at which time he was names "Best Graduate of the Year" and awarded the University Gold Medal for overall distinction. +His PhD work was completed at Karolinska Institute in Stockholm, Sweden in 1996. +He is a Fellow of the Royal College of Physicians (Edinburgh & London), the Royal College of Paediatrics and Child Health (London), American Academy of Paediatrics and the Pakistan Academy of Sciences. +Following the completion of his PhD Dr. Bhutta began working as House Surgeon in Obstetrics & Gynecology at the Khyber Teaching Hospital, Peshawar (April-November 1978). +He began work in paediatrics as a physician in November of 1978 in the Professorial Unit at the Institute of Child Health, Jinnah Postgraduate Medical Centre, Karachi (Pakistan). +Through 1980's he continued his work as a surgeon and paediatrician. +He undertook his first professor position in the Department of Paediatrics, The Aga Khan University Hospital, Karachi (Pakistan), from November 1987 to June 1992. +In 2005, Dr. Bhutta became the Chairman of the Department of Paediatrics & Child Health at the Aga Khan University & Medical Center, a position held until 2008. +Following his term as Chairman he became The Noordin Noormahomed Sheriff Professor & Founding Chair, Division of Women & Child Health, The Aga Khan University, a position he held for four years. +Dr. Bhutta currently holds the titles of co-director of the Centre for Global Child Health at the Hospital for Sick Children in Toronto, and founding director of the Centre of Excellence in Women and Child Health at the Aga Khan University. +In 2020, he was appointed founding director of the Institute for Global child Health & Development at the Aga Khan University and elected Fellow to the Royal Society, United Kingdom. +Outside of his professional responsibilities Dr. Bhutta serves on various local and international boards and committees, including a series of editorial boards. +In his various capacities Dr. Bhutta has produced a large collection of publications working with his teams at Sick Kids, AKU and international partners. +These include book reviews, chapters, 1. +"Haematological disorders" "Neonatal Jaundice" in Neonatal Vade‑Mecum, Fleming PJ, Speidel BD, Dunn PM Eds, Lloyd‑Luke Publishers, UK, 1986. +Revised 2nd Edition 1991. +2. +"Nutritional management of acute and persistent diarrhoea". +A M Molla, Bhutta Z A and  A Molla. +In McNeish A S, Mittal S K and Walker-Smith J A (eds). +Recent trends in diarrhoea and malnutrition, MAMC, Delhi, 1991, pp 37-51. +3. +"Paediatric Prescribing” in "Text book of Paediatrics for developing countries"            Arif MA, Hanif SM, Wasti SMK Eds, 1989, 2nd Edition 1996,  PPA, Karachi. +& Lahore 4. +"Innovations in neonatal care : Impact on neonatal survival in the developing world:. +Bhutta Z A  Zaidi S (Editor) 1992. +TWEL Publisher. +Karachi pp 121-131 5. +"Short course therapy in Pediatrics" Bhutta Z A& Teele D.  In Tice A D, Waldvogel F (Eds), Contemporary issues in Infectious Disease Epidemiology and Management, 1993 Gardiner Caldwell, Cheshire, pp 52 - 60. +6. +"Dietary management of persistent diarrhoea". +Bhutta Z A, Molla A M, Issani Z. +In Reflections on  Diarrhoeal Disease & Nutrition  of Children". +1993 Karachi, pp 97 - 103. +7. +"Prescribing practices amongst general practitioners (GPs) and consultant paediatricians in childhood diarrhoea.”  S.Q. +Nizami, I.A. +Khan, Bhutta Z A. +In "Reflections on Diarrhoeal Disease and Nutrition of Children". +1993 Karachi, pp  88-90. +8. +"The challenge of multidrug-resistant typhoid". +Bhutta Z A. +In Puri R K, Sachdev H P S, Choudhry P, Verma I C (Eds), Current concepts in Paediatrics, 1994. +Jaypee Publishers, New Delhi, pp 403.8. +9. +"Perinatal Care in Pakistan: Current status and trends". +In Proceedings of the Workshop in Reproductive Health. +College of Physicians and Surgeons, Pakistan, Karachi, 1995, pp 95-103. +10. +“A study of whole body protein kinetics in malnourished children with persistent diarrhoea” Bhutta Z A, Nizami SQ, Isani Z, Hardy S, Hendricks K, Young V.   Report of the second RCM coordinated Research Programme for application of stable isotope tracer methods to studies of energy metabolism in malnourished populations of developing countries. +NAHRES-30 1996 IAEA Vienna. +11. +"Pneumococcal infections in Pakistan: a country report". +In Adult Immunization in Asia, Fondation Mercel Merieux, Lyon, 1998. pp 79-82. +12. +“Factors affecting protein and aminoacid metabolism in childhood from developing countries". +In Child Nutrition: an international perspective. +Editors Solomons NW, Caballero B, Brown KH. +CRC Press 1998. +13. +"Protein Digestion and Bioavailability". +In Encyclopedia of Human Nutrition. +Editors: Sadler M, Strain JJ, Caballero B. +Academic Press (London), 1998 pp.1646-54. +14. +"Perinatal Care in Pakistan. +Reproductive Health: A manual for family practice and primary health care. +Bhutta Z A, Maqbool S.  College of Physicians and Surgeons, Pakistan, Karachi, 1999, pp 69-78. +15. +“Effective interventions to reduce neonatal mortality and morbidity from perinatal infection. +Bhutta ZA. +In Costello A, Manandhar D (eds). +"Improving Newborn Infant Health in Developing Countries’ 1999. +Imperial College Press, London pp.289-308. +16. +“Ambulatory management of typhoid fever”            “Risk factors and management of micronutrient deficiencies”            “Management of persistent diarrhoea in developing countries”. +In Manual of International Child Health, British Medical Journal, 2000 (in press). +17. +“The role of Cefixime in typhoid fever during childhood” in Cefixime, Adam D, Quintiliani R (Eds), Torre-Lazur-McCann, Tokyo, 2000; pp.107-112. +18. +"Micronutrients and Child Health in the Commonwealth”, Commonwealth Foundation" (UK) (2001). +19. +"Isotopic evaluation of breast milk intake, energy metabolism growth and body composition of exclusively breastfed infants in Pakistan". +Bhutta ZA, Nizami SQ, Weaver LT, Preston T. In Application of Stable Isotopes to evaluate Growth and Body Composition of Exclusively Breastfed infants, IAEA and WHO, NAHRES Report. +2000. +20. +“Typhoid Fever in Childhood: the south Asian experience”. +Ahmad K &Bhutta ZA. +In "Recent Advances in Paediatrics", Gupte S (Ed), 2000, India . +21. +“Neonatal Infections in developing countries” in  Carrera JM, Cabero L, Baraibar R (Eds). +The Perinatal Medicine of the new Millennium. \ No newline at end of file diff --git a/model_zoo/bert/static/dataset.py b/model_zoo/bert/static/dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..e25357ad6a62a1c276fba03f51961a7cdbf7e1b7 --- /dev/null +++ b/model_zoo/bert/static/dataset.py @@ -0,0 +1,139 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import h5py +import numpy as np +import paddle +from paddle.io import DataLoader, Dataset + +from paddlenlp.data import Stack + + +def create_pretraining_dataset(input_file, max_pred_length, args, data_holders, worker_init=None, places=None): + train_data = PretrainingDataset(input_file=input_file, max_pred_length=max_pred_length) + train_batch_sampler = paddle.io.BatchSampler(train_data, batch_size=args.batch_size, shuffle=True) + + def _collate_data(data, stack_fn=Stack()): + num_fields = len(data[0]) + out = [None] * num_fields + # input_ids, segment_ids, input_mask, masked_lm_positions, + # masked_lm_labels, next_sentence_labels, mask_token_num + for i in (0, 1, 2, 5): + out[i] = stack_fn([x[i] for x in data]) + _, seq_length = out[0].shape + size = sum(len(x[3]) for x in data) + # Padding for divisibility by 8 for fp16 or int8 usage + if size % 8 != 0: + size += 8 - (size % 8) + # masked_lm_positions + # Organize as a 1D tensor for gather or use gather_nd + out[3] = np.full(size, 0, dtype=np.int32) + # masked_lm_labels + out[4] = np.full([size, 1], -1, dtype=np.int64) + mask_token_num = 0 + for i, x in enumerate(data): + for j, pos in enumerate(x[3]): + out[3][mask_token_num] = i * seq_length + pos + out[4][mask_token_num] = x[4][j] + mask_token_num += 1 + # mask_token_num + out.append(np.asarray([mask_token_num], dtype=np.float32)) + if args.use_amp and args.use_pure_fp16: + # cast input_mask to fp16 + out[2] = out[2].astype(np.float16) + # cast masked_lm_scale to fp16 + out[-1] = out[-1].astype(np.float16) + return out + + train_data_loader = DataLoader( + dataset=train_data, + places=places, + feed_list=data_holders, + batch_sampler=train_batch_sampler, + collate_fn=_collate_data, + num_workers=0, + worker_init_fn=worker_init, + return_list=False, + ) + return train_data_loader, input_file + + +def create_data_holder(args): + input_ids = paddle.static.data(name="input_ids", shape=[-1, -1], dtype="int64") + segment_ids = paddle.static.data(name="segment_ids", shape=[-1, -1], dtype="int64") + input_mask = paddle.static.data(name="input_mask", shape=[-1, 1, 1, -1], dtype="float32") + masked_lm_positions = paddle.static.data(name="masked_lm_positions", shape=[-1], dtype="int32") + masked_lm_labels = paddle.static.data(name="masked_lm_labels", shape=[-1, 1], dtype="int64") + next_sentence_labels = paddle.static.data(name="next_sentence_labels", shape=[-1, 1], dtype="int64") + masked_lm_scale = paddle.static.data(name="masked_lm_scale", shape=[-1, 1], dtype="float32") + return [ + input_ids, + segment_ids, + input_mask, + masked_lm_positions, + masked_lm_labels, + next_sentence_labels, + masked_lm_scale, + ] + + +class PretrainingDataset(Dataset): + def __init__(self, input_file, max_pred_length): + self.input_file = input_file + self.max_pred_length = max_pred_length + f = h5py.File(input_file, "r") + keys = [ + "input_ids", + "input_mask", + "segment_ids", + "masked_lm_positions", + "masked_lm_ids", + "next_sentence_labels", + ] + self.inputs = [np.asarray(f[key][:]) for key in keys] + f.close() + + def __len__(self): + "Denotes the total number of samples" + return len(self.inputs[0]) + + def __getitem__(self, index): + + [input_ids, input_mask, segment_ids, masked_lm_positions, masked_lm_ids, next_sentence_labels] = [ + input[index].astype(np.int64) if indice < 5 else np.asarray(input[index].astype(np.int64)) + for indice, input in enumerate(self.inputs) + ] + # TODO: whether to use reversed mask by changing 1s and 0s to be + # consistent with nv bert + input_mask = (1 - np.reshape(input_mask.astype(np.float32), [1, 1, input_mask.shape[0]])) * -1e4 + + index = self.max_pred_length + # store number of masked tokens in index + # outputs of torch.nonzero diff with that of numpy.nonzero by zip + padded_mask_indices = (masked_lm_positions == 0).nonzero()[0] + if len(padded_mask_indices) != 0: + index = padded_mask_indices[0].item() + # mask_token_num = index + else: + index = self.max_pred_length + # mask_token_num = self.max_pred_length + # masked_lm_labels = np.full(input_ids.shape, -1, dtype=np.int64) + # masked_lm_labels[masked_lm_positions[:index]] = masked_lm_ids[:index] + masked_lm_labels = masked_lm_ids[:index] + masked_lm_positions = masked_lm_positions[:index] + # softmax_with_cross_entropy enforce last dim size equal 1 + masked_lm_labels = np.expand_dims(masked_lm_labels, axis=-1) + next_sentence_labels = np.expand_dims(next_sentence_labels, axis=-1) + + return [input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels] diff --git a/model_zoo/bert/static/predict_glue.py b/model_zoo/bert/static/predict_glue.py new file mode 100644 index 0000000000000000000000000000000000000000..8095066423d81947316efaa8e9d32bfcfecc44bd --- /dev/null +++ b/model_zoo/bert/static/predict_glue.py @@ -0,0 +1,146 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import paddle +from run_glue import METRIC_CLASSES, MODEL_CLASSES, convert_example + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to perform predict, selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_path", + default=None, + type=str, + required=True, + help="The path prefix of inference model to be used.", + ) + parser.add_argument( + "--device", + default="gpu", + choices=["gpu", "cpu", "xpu"], + help="Device selected for inference.", + ) + parser.add_argument( + "--batch_size", + default=32, + type=int, + help="Batch size for predict.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + args = parser.parse_args() + return args + + +class Predictor(object): + def __init__(self, predictor, input_handles, output_handles): + self.predictor = predictor + self.input_handles = input_handles + self.output_handles = output_handles + + @classmethod + def create_predictor(cls, args): + config = paddle.inference.Config(args.model_path + ".pdmodel", args.model_path + ".pdiparams") + if args.device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + elif args.device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + elif args.device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + config.switch_use_feed_fetch_ops(False) + predictor = paddle.inference.create_predictor(config) + input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()] + output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()] + return cls(predictor, input_handles, output_handles) + + def predict_batch(self, data): + for input_field, input_handle in zip(data, self.input_handles): + input_handle.copy_from_cpu(input_field.numpy() if isinstance(input_field, paddle.Tensor) else input_field) + self.predictor.run() + output = [output_handle.copy_to_cpu() for output_handle in self.output_handles] + return output + + def predict(self, dataset, collate_fn, batch_size=1): + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=False) + data_loader = paddle.io.DataLoader( + dataset=dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, num_workers=0, return_list=True + ) + outputs = [] + for data in data_loader: + output = self.predict_batch(data) + outputs.append(output) + return outputs + + +def main(): + args = parse_args() + + predictor = Predictor.create_predictor(args) + + args.task_name = args.task_name.lower() + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + test_ds = load_dataset("glue", args.task_name, splits="test") + tokenizer = tokenizer_class.from_pretrained(os.path.dirname(args.model_path)) + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + label_list=test_ds.label_list, + max_seq_length=args.max_seq_length, + is_test=True, + ) + test_ds = test_ds.map(trans_func) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + predictor.predict(test_ds, batch_size=args.batch_size, collate_fn=batchify_fn) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/bert/static/run_glue.py b/model_zoo/bert/static/run_glue.py new file mode 100644 index 0000000000000000000000000000000000000000..d6cb54dc960f00297342b8a30a293778fe0a4265 --- /dev/null +++ b/model_zoo/bert/static/run_glue.py @@ -0,0 +1,395 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.distributed.fleet as fleet +from paddle.io import DataLoader +from paddle.metric import Accuracy + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import Mcc, PearsonAndSpearman +from paddlenlp.transformers import ( + BertForSequenceClassification, + BertTokenizer, + ErnieForSequenceClassification, + ErnieTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +METRIC_CLASSES = { + "cola": Mcc, + "sst-2": Accuracy, + "sts-b": PearsonAndSpearman, + "mnli": Accuracy, + "qnli": Accuracy, + "rte": Accuracy, +} + +MODEL_CLASSES = { + "bert": (BertForSequenceClassification, BertTokenizer), + "ernie": (ErnieForSequenceClassification, ErnieTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") + parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="Random seed for initialization") + parser.add_argument("--device", type=str, default="gpu", help="Device for selecting for the training.") + args = parser.parse_args() + return args + + +def create_data_holder(task_name): + """ + Define the input data holder for the glue task. + """ + input_ids = paddle.static.data(name="input_ids", shape=[-1, -1], dtype="int64") + token_type_ids = paddle.static.data(name="token_type_ids", shape=[-1, -1], dtype="int64") + if task_name == "sts-b": + label = paddle.static.data(name="label", shape=[-1, 1], dtype="float32") + else: + label = paddle.static.data(name="label", shape=[-1, 1], dtype="int64") + + return [input_ids, token_type_ids, label] + + +def reset_program_state_dict(args, model, state_dict, pretrained_state_dict): + """ + Initialize the parameter from the bert config, and set the parameter by + reseting the state dict." + """ + reset_state_dict = {} + scale = ( + model.initializer_range + if hasattr(model, "initializer_range") + else getattr(model, args.model_type).config["initializer_range"] + ) + reset_parameter_names = [] + for n, p in state_dict.items(): + if n in pretrained_state_dict: + reset_state_dict[p.name] = np.array(pretrained_state_dict[n]) + reset_parameter_names.append(n) + elif p.name in pretrained_state_dict and "bert" in n: + reset_state_dict[p.name] = np.array(pretrained_state_dict[p.name]) + reset_parameter_names.append(n) + else: + dtype_str = "float32" + if str(p.dtype) == "VarType.FP64": + dtype_str = "float64" + reset_state_dict[p.name] = np.random.normal(loc=0.0, scale=scale, size=p.shape).astype(dtype_str) + logger.info("the following parameter had reset, please check. {}".format(reset_parameter_names)) + return reset_state_dict + + +def set_seed(args): + """ + Use the same data seed(for data shuffle) for all procs to guarantee data + consistency after sharding. + """ + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +def evaluate(exe, metric, loss, correct, dev_program, data_loader, phase="eval"): + """ + The evaluate process, calcluate the eval loss and metric. + """ + metric.reset() + returns = [loss] + if isinstance(correct, list) or isinstance(correct, tuple): + returns.extend(list(correct)) + else: + returns.append(correct) + for batch in data_loader: + exe.run(dev_program, feed=batch, fetch_list=returns) + return_numpys = exe.run(dev_program, feed=batch, fetch_list=returns) + metric_numpy = return_numpys[1] if len(return_numpys[1:]) == 1 else return_numpys[1:] + metric.update(metric_numpy) + res = metric.accumulate() + if isinstance(metric, Mcc): + print("%s loss: %f, mcc: %s" % (phase, return_numpys[0], res[0])) + elif isinstance(metric, PearsonAndSpearman): + print( + "%s loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s" + % (phase, return_numpys[0], res[0], res[1], res[2]) + ) + else: + print("%s loss: %f, acc: %s, " % (phase, return_numpys[0], res)) + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """ + Convert a glue example into necessary features. + """ + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if (int(is_test) + len(example)) == 2: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + else: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + + if not is_test: + return example["input_ids"], example["token_type_ids"], label + else: + return example["input_ids"], example["token_type_ids"] + + +def do_train(args): + # Set the paddle execute environment + paddle.enable_static() + place = paddle.set_device(args.device) + fleet.init(is_collective=True) + set_seed(args) + + # Create the main_program for the training and dev_program for the validation + main_program = paddle.static.default_main_program() + startup_program = paddle.static.default_startup_program() + dev_program = paddle.static.Program() + + # Get the configuration of tokenizer and model + args.task_name = args.task_name.lower() + args.model_type = args.model_type.lower() + metric_class = METRIC_CLASSES[args.task_name] + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + # Create the tokenizer and dataset + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + train_ds = load_dataset("glue", args.task_name, splits="train") + + trans_func = partial( + convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length + ) + + train_ds = train_ds.map(trans_func, lazy=True) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ): fn(samples) + + train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + + # Define the input data and create the train/dev data_loader + with paddle.static.program_guard(main_program, startup_program): + [input_ids, token_type_ids, labels] = create_data_holder(args.task_name) + + train_data_loader = DataLoader( + dataset=train_ds, + feed_list=[input_ids, token_type_ids, labels], + batch_sampler=train_batch_sampler, + collate_fn=batchify_fn, + num_workers=0, + return_list=False, + ) + + if args.task_name == "mnli": + dev_ds_matched, dev_ds_mismatched = load_dataset( + "glue", args.task_name, splits=["dev_matched", "dev_mismatched"] + ) + + dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True) + dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True) + dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False) + dev_data_loader_matched = DataLoader( + dataset=dev_ds_matched, + batch_sampler=dev_batch_sampler_matched, + collate_fn=batchify_fn, + feed_list=[input_ids, token_type_ids, labels], + num_workers=0, + return_list=False, + ) + dev_batch_sampler_mismatched = paddle.io.BatchSampler( + dev_ds_mismatched, batch_size=args.batch_size, shuffle=False + ) + dev_data_loader_mismatched = DataLoader( + dataset=dev_ds_mismatched, + batch_sampler=dev_batch_sampler_mismatched, + collate_fn=batchify_fn, + num_workers=0, + feed_list=[input_ids, token_type_ids, labels], + return_list=False, + ) + else: + dev_ds = load_dataset("glue", args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, + batch_sampler=dev_batch_sampler, + collate_fn=batchify_fn, + num_workers=0, + feed_list=[input_ids, token_type_ids, labels], + return_list=False, + ) + + # Create the training-forward program, and clone it for the validation + with paddle.static.program_guard(main_program, startup_program): + num_class = 1 if train_ds.label_list is None else len(train_ds.label_list) + model, pretrained_state_dict = model_class.from_pretrained(args.model_name_or_path, num_classes=num_class) + loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss() + logits = model(input_ids, token_type_ids) + loss = loss_fct(logits, labels) + dev_program = main_program.clone(for_test=True) + + # Create the training-backward program, this pass will not be + # executed in the validation + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs + with paddle.static.program_guard(main_program, startup_program): + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + optimizer = fleet.distributed_optimizer(optimizer) + optimizer.minimize(loss) + + # Create the metric pass for the validation + with paddle.static.program_guard(dev_program, startup_program): + metric = metric_class() + correct = metric.compute(logits, labels) + + # Initialize the fine-tuning parameter, we will load the parameters in + # pre-training model. And initialize the parameter which not in pre-training model + # by the normal distribution. + exe = paddle.static.Executor(place) + exe.run(startup_program) + state_dict = model.state_dict() + reset_state_dict = reset_program_state_dict(args, model, state_dict, pretrained_state_dict) + paddle.static.set_program_state(main_program, reset_state_dict) + + global_step = 0 + tic_train = time.time() + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + loss_return = exe.run(main_program, feed=batch, fetch_list=[loss]) + if global_step % args.logging_steps == 0: + logger.info( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_step, epoch, step, loss_return[0], args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + lr_scheduler.step() + if global_step % args.save_steps == 0: + # Validation pass, record the loss and metric + if args.task_name == "mnli": + evaluate(exe, metric, loss, correct, dev_program, dev_data_loader_matched, "matched eval") + evaluate(exe, metric, loss, correct, dev_program, dev_data_loader_mismatched, "mismatched eval") + else: + evaluate(exe, metric, loss, correct, dev_program, dev_data_loader) + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + paddle.static.save_inference_model( + os.path.join(output_dir, "model"), [input_ids, token_type_ids], [logits], exe + ) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + return + + +if __name__ == "__main__": + args = parse_args() + assert args.device in ["cpu", "gpu", "xpu"], "Invalid device! Available device should be cpu, gpu, or xpu." + + do_train(args) diff --git a/model_zoo/bert/static/run_glue_with_sparaity.py b/model_zoo/bert/static/run_glue_with_sparaity.py new file mode 100644 index 0000000000000000000000000000000000000000..023c58b0b812a088e3358cba1329f99d2de69754 --- /dev/null +++ b/model_zoo/bert/static/run_glue_with_sparaity.py @@ -0,0 +1,408 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from paddle.incubate import asp +from paddle.io import DataLoader +from paddle.metric import Accuracy + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import Mcc, PearsonAndSpearman +from paddlenlp.transformers import ( + BertForSequenceClassification, + BertTokenizer, + ErnieForSequenceClassification, + ErnieTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +METRIC_CLASSES = { + "cola": Mcc, + "sst-2": Accuracy, + "sts-b": PearsonAndSpearman, + "mnli": Accuracy, + "qnli": Accuracy, + "rte": Accuracy, +} + +MODEL_CLASSES = { + "bert": (BertForSequenceClassification, BertTokenizer), + "ernie": (ErnieForSequenceClassification, ErnieTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument( + "--batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") + parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="Random seed for initialization") + parser.add_argument("--device", type=str, default="gpu", help="Device for selecting for the training.") + args = parser.parse_args() + return args + + +def create_data_holder(task_name): + """ + Define the input data holder for the glue task. + """ + input_ids = paddle.static.data(name="input_ids", shape=[-1, -1], dtype="int64") + token_type_ids = paddle.static.data(name="token_type_ids", shape=[-1, -1], dtype="int64") + if task_name == "sts-b": + label = paddle.static.data(name="label", shape=[-1, 1], dtype="float32") + else: + label = paddle.static.data(name="label", shape=[-1, 1], dtype="int64") + + return [input_ids, token_type_ids, label] + + +def reset_program_state_dict(args, model, state_dict, pretrained_state_dict): + """ + Initialize the parameter from the bert config, and set the parameter by + reseting the state dict." + """ + reset_state_dict = {} + scale = ( + model.initializer_range + if hasattr(model, "initializer_range") + else getattr(model, args.model_type).config["initializer_range"] + ) + reset_parameter_names = [] + for n, p in state_dict.items(): + if n in pretrained_state_dict: + reset_state_dict[p.name] = np.array(pretrained_state_dict[n]) + reset_parameter_names.append(n) + elif p.name in pretrained_state_dict and "bert" in n: + reset_state_dict[p.name] = np.array(pretrained_state_dict[p.name]) + reset_parameter_names.append(n) + else: + dtype_str = "float32" + if str(p.dtype) == "VarType.FP64": + dtype_str = "float64" + reset_state_dict[p.name] = np.random.normal(loc=0.0, scale=scale, size=p.shape).astype(dtype_str) + logger.info("the following parameter had reset, please check. {}".format(reset_parameter_names)) + return reset_state_dict + + +def set_seed(args): + """ + Use the same data seed(for data shuffle) for all procs to guarantee data + consistency after sharding. + """ + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +def evaluate(exe, metric, loss, correct, dev_program, data_loader, phase="eval"): + """ + The evaluate process, calcluate the eval loss and metric. + """ + metric.reset() + returns = [loss] + if isinstance(correct, list) or isinstance(correct, tuple): + returns.extend(list(correct)) + else: + returns.append(correct) + for batch in data_loader: + exe.run(dev_program, feed=batch, fetch_list=returns) + return_numpys = exe.run(dev_program, feed=batch, fetch_list=returns) + metric_numpy = return_numpys[1] if len(return_numpys[1:]) == 1 else return_numpys[1:] + metric.update(metric_numpy) + res = metric.accumulate() + if isinstance(metric, Mcc): + print("%s loss: %f, mcc: %s" % (phase, return_numpys[0], res[0])) + elif isinstance(metric, PearsonAndSpearman): + print( + "%s loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s" + % (phase, return_numpys[0], res[0], res[1], res[2]) + ) + else: + print("%s loss: %f, acc: %s, " % (phase, return_numpys[0], res)) + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """ + Convert a glue example into necessary features. + """ + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if (int(is_test) + len(example)) == 2: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + else: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + + if not is_test: + return example["input_ids"], example["token_type_ids"], label + else: + return example["input_ids"], example["token_type_ids"] + + +def do_train(args): + # Set the paddle execute environment + paddle.enable_static() + place = paddle.set_device(args.device) + set_seed(args) + + # Create the main_program for the training and dev_program for the validation + main_program = paddle.static.default_main_program() + startup_program = paddle.static.default_startup_program() + dev_program = paddle.static.Program() + + # Get the configuration of tokenizer and model + args.task_name = args.task_name.lower() + args.model_type = args.model_type.lower() + metric_class = METRIC_CLASSES[args.task_name] + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + # Create the tokenizer and dataset + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + train_ds = load_dataset("glue", args.task_name, splits="train") + + trans_func = partial( + convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length + ) + + train_ds = train_ds.map(trans_func, lazy=True) + + def batchify_fn( + samples, + fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ), + ): + return fn(samples) + + train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + + # Define the input data and create the train/dev data_loader + with paddle.static.program_guard(main_program, startup_program): + [input_ids, token_type_ids, labels] = create_data_holder(args.task_name) + + train_data_loader = DataLoader( + dataset=train_ds, + feed_list=[input_ids, token_type_ids, labels], + batch_sampler=train_batch_sampler, + collate_fn=batchify_fn, + num_workers=0, + return_list=False, + ) + + if args.task_name == "mnli": + dev_ds_matched, dev_ds_mismatched = load_dataset( + "glue", args.task_name, splits=["dev_matched", "dev_mismatched"] + ) + + dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True) + dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True) + dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False) + dev_data_loader_matched = DataLoader( + dataset=dev_ds_matched, + batch_sampler=dev_batch_sampler_matched, + collate_fn=batchify_fn, + feed_list=[input_ids, token_type_ids, labels], + num_workers=0, + return_list=False, + ) + dev_batch_sampler_mismatched = paddle.io.BatchSampler( + dev_ds_mismatched, batch_size=args.batch_size, shuffle=False + ) + dev_data_loader_mismatched = DataLoader( + dataset=dev_ds_mismatched, + batch_sampler=dev_batch_sampler_mismatched, + collate_fn=batchify_fn, + num_workers=0, + feed_list=[input_ids, token_type_ids, labels], + return_list=False, + ) + else: + dev_ds = load_dataset("glue", args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, + batch_sampler=dev_batch_sampler, + collate_fn=batchify_fn, + num_workers=0, + feed_list=[input_ids, token_type_ids, labels], + return_list=False, + ) + + # Create the training-forward program, and clone it for the validation + with paddle.static.program_guard(main_program, startup_program): + num_class = 1 if train_ds.label_list is None else len(train_ds.label_list) + model, pretrained_state_dict = model_class.from_pretrained(args.model_name_or_path, num_classes=num_class) + loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss() + logits = model(input_ids, token_type_ids) + loss = loss_fct(logits, labels) + dev_program = main_program.clone(for_test=True) + + # Create the training-backward program, this pass will not be + # executed in the validation + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs + with paddle.static.program_guard(main_program, startup_program): + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + # Keep Pooler and task-specific layer dense. + # Please note, excluded_layers must be set before calling `optimizer.minimize()`. + asp.set_excluded_layers(main_program, [model.bert.pooler.dense.full_name(), model.classifier.full_name()]) + # Calling asp.decorate() to wrap minimize() in optimizer, which + # will insert necessary masking operations for ASP workflow. + optimizer = asp.decorate(optimizer) + optimizer.minimize(loss) + + # Create the metric pass for the validation + with paddle.static.program_guard(dev_program, startup_program): + metric = metric_class() + correct = metric.compute(logits, labels) + + # Initialize the fine-tuning parameter, we will load the parameters in + # pre-training model. And initialize the parameter which not in pre-training model + # by the normal distribution. + exe = paddle.static.Executor(place) + exe.run(startup_program) + state_dict = model.state_dict() + reset_state_dict = reset_program_state_dict(args, model, state_dict, pretrained_state_dict) + paddle.static.set_program_state(main_program, reset_state_dict) + + # Pruning model to be 2:4 sparse pattern + # Must call `exe.run(startup_program)` first before calling `asp.prune_model` + asp.prune_model(place, main_program) + + global_step = 0 + tic_train = time.time() + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + loss_return = exe.run(main_program, feed=batch, fetch_list=[loss]) + if global_step % args.logging_steps == 0: + logger.info( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" + % (global_step, epoch, step, loss_return[0], args.logging_steps / (time.time() - tic_train)) + ) + tic_train = time.time() + lr_scheduler.step() + if global_step % args.save_steps == 0: + # Validation pass, record the loss and metric + if args.task_name == "mnli": + evaluate(exe, metric, loss, correct, dev_program, dev_data_loader_matched, "matched eval") + evaluate(exe, metric, loss, correct, dev_program, dev_data_loader_mismatched, "mismatched eval") + else: + evaluate(exe, metric, loss, correct, dev_program, dev_data_loader) + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + paddle.static.save_inference_model( + os.path.join(output_dir, "model"), [input_ids, token_type_ids], [logits], exe + ) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + return + + +if __name__ == "__main__": + args = parse_args() + assert args.device in ["cpu", "gpu", "xpu"], "Invalid device! Available device should be cpu, gpu, or xpu." + + do_train(args) diff --git a/model_zoo/bert/static/run_pretrain.py b/model_zoo/bert/static/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..d300a5f791fc2e6146e894912fbccb65cfb8db82 --- /dev/null +++ b/model_zoo/bert/static/run_pretrain.py @@ -0,0 +1,397 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from concurrent.futures import ThreadPoolExecutor + +import numpy as np +import paddle +import paddle.distributed.fleet as fleet +from dataset import create_data_holder, create_pretraining_dataset + +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import ( + BertForPretraining, + BertPretrainingCriterion, + BertTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils import profiler +from paddlenlp.utils.tools import TimeCostAverage + +MODEL_CLASSES = {"bert": (BertForPretraining, BertTokenizer)} + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--input_dir", + default=None, + type=str, + required=True, + help="The input directory where the data will be read from.", + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_predictions_per_seq", default=80, type=int, help="The maximum total of masked tokens in input sequence" + ) + + parser.add_argument( + "--batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") + parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument("--seed", type=int, default=42, help="Random seed for initialization") + parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.") + parser.add_argument( + "--enable_addto", + type=strtobool, + default=False, + help="Whether to enable the addto strategy for gradient accumulation or not. This is only used for AMP training.", + ) + parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.") + parser.add_argument("--use_pure_fp16", type=strtobool, default=False, help="Whether to use pure fp16 training.") + parser.add_argument("--device", type=str, default="gpu", help="Device for selecting for the training.") + parser.add_argument( + "--gradient_merge_steps", + type=int, + default=1, + help="Number of merge steps before gradient update." "global_batch_size = gradient_merge_steps * batch_size.", + ) + + # For benchmark. + parser.add_argument( + "--profiler_options", + type=str, + default=None, + help='The option of profiler, which should be in format "key1=value1;key2=value2;key3=value3".', + ) + parser.add_argument( + "--fuse_transformer", + type=strtobool, + default=False, + help="Whether to use FusedTransformerEncoderLayer to replace a TransformerEncoderLayer or not.", + ) + args = parser.parse_args() + return args + + +def select_dataset_file_for_each_worker(files, f_start_id, worker_num, worker_index): + """ + Spliting the train file according to the worker index. + """ + num_files = len(files) + if worker_num > num_files: + remainder = worker_num % num_files + data_file = files[(f_start_id * worker_num + worker_index + remainder * f_start_id) % num_files] + else: + data_file = files[(f_start_id * worker_num + worker_index) % num_files] + return data_file + + +def reset_program_state_dict(model, state_dict): + """ + Initialize the parameter from the bert config, and set the parameter by + reseting the state dict." + """ + scale = model.initializer_range if hasattr(model, "initializer_range") else model.bert.config.initializer_range + + new_state_dict = dict() + for n, p in state_dict.items(): + if "layer_norm" not in p.name: + dtype_str = "float32" + if str(p.dtype) == "VarType.FP64": + dtype_str = "float64" + new_state_dict[p.name] = np.random.normal(loc=0.0, scale=scale, size=p.shape).astype(dtype_str) + return new_state_dict + + +def create_strategy(args): + """ + Create build strategy and exec strategy. + """ + build_strategy = paddle.static.BuildStrategy() + exec_strategy = paddle.static.ExecutionStrategy() + + build_strategy.enable_addto = args.enable_addto + + exec_strategy.num_threads = 1 + exec_strategy.num_iteration_per_drop_scope = 10000 + return build_strategy, exec_strategy + + +def dist_optimizer(args, optimizer): + """ + Create a distributed optimizer based on a normal optimizer + """ + build_strategy, exec_strategy = create_strategy(args) + + dist_strategy = fleet.DistributedStrategy() + dist_strategy.execution_strategy = exec_strategy + dist_strategy.build_strategy = build_strategy + + dist_strategy.fuse_grad_size_in_MB = 16 + if args.use_amp: + dist_strategy.amp = True + + custom_black_list = ["lookup_table", "lookup_table_v2"] if args.use_pure_fp16 else None + dist_strategy.amp_configs = { + "custom_white_list": ["softmax", "layer_norm", "gelu"], + "init_loss_scaling": args.scale_loss, + "custom_black_list": custom_black_list, + "use_pure_fp16": args.use_pure_fp16, + } + if args.gradient_merge_steps > 1: + dist_strategy.gradient_merge = True + dist_strategy.gradient_merge_configs = {"k_steps": args.gradient_merge_steps} + + optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy) + return optimizer + + +def set_seed(seed): + """ + Use the same data seed(for data shuffle) for all procs to guarantee data + consistency after sharding. + """ + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +class WorkerInitObj(object): + "Construct the object with different seed, and the Dataloader will generate the data" + "with different seed in each worker." + + def __init__(self, seed): + self.seed = seed + + def __call__(self, id): + np.random.seed(seed=self.seed + id) + random.seed(self.seed + id) + + +def do_train(args): + # Initialize the paddle and paddle fleet execute environment + paddle.enable_static() + place = paddle.set_device(args.device) + fleet.init(is_collective=True) + + worker_num = fleet.worker_num() + worker_index = fleet.worker_index() + + # Create the random seed for the worker + set_seed(args.seed) + worker_init = WorkerInitObj(args.seed + worker_index) + + # Define the input data in the static mode + main_program = paddle.static.default_main_program() + startup_program = paddle.static.default_startup_program() + + data_holders = create_data_holder(args) + + [ + input_ids, + segment_ids, + input_mask, + masked_lm_positions, + masked_lm_labels, + next_sentence_labels, + masked_lm_scale, + ] = data_holders + + # Define the model structure in static mode + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + config = model_class.config_class.from_pretrained(args.model_name_or_path) + if config.vocab_size % 8 != 0: + config.vocab_size += 8 - (config.vocab_size % 8) + config.fuse = args.fuse_transformer + model = model_class(config) + criterion = BertPretrainingCriterion(model.bert.config.vocab_size) + prediction_scores, seq_relationship_score = model( + input_ids=input_ids, + token_type_ids=segment_ids, + attention_mask=input_mask, + masked_positions=masked_lm_positions, + ) + loss = criterion( + prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels, masked_lm_scale + ) + + # Define the dynamic learing_reate scheduler and optimizer + # BUG: train_data_loader is undefined variable here hence the noqa: F821 + num_training_steps = ( + args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs # noqa: F821 + ) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + multi_precision=args.use_pure_fp16, + ) + + # Use the fleet api to compile the distributed optimizer + optimizer = dist_optimizer(args, optimizer) + optimizer.minimize(loss) + + # Define the Executor for running the static model + exe = paddle.static.Executor(place) + exe.run(startup_program) + state_dict = model.state_dict() + + # Use the state dict to update the parameter + reset_state_dict = reset_program_state_dict(model, state_dict) + paddle.static.set_program_state(main_program, reset_state_dict) + if args.use_amp: + optimizer.amp_init(place) + + pool = ThreadPoolExecutor(1) + global_step = 0 + epoch = 0 + while True: + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if os.path.isfile(os.path.join(args.input_dir, f)) and "training" in f + ] + files.sort() + random.Random(args.seed + epoch).shuffle(files) + f_start_id = 0 + + # Select one file for each worker and create the DataLoader for the file + data_file = select_dataset_file_for_each_worker(files, f_start_id, worker_num, worker_index) + train_data_loader, _ = create_pretraining_dataset( + data_file, args.max_predictions_per_seq, args, data_holders, worker_init, paddle.static.cuda_places() + ) + + for f_id in range(f_start_id + 1, len(files)): + data_file = select_dataset_file_for_each_worker(files, f_id, worker_num, worker_index) + dataset_future = pool.submit( + create_pretraining_dataset, + data_file, + args.max_predictions_per_seq, + args, + data_holders, + worker_init, + paddle.static.cuda_places(), + ) + + train_cost_avg = TimeCostAverage() + reader_cost_avg = TimeCostAverage() + total_samples = 0 + batch_start = time.time() + for step, batch in enumerate(train_data_loader): + train_reader_cost = time.time() - batch_start + reader_cost_avg.record(train_reader_cost) + global_step += 1 + loss_return = exe.run(main_program, feed=batch, fetch_list=[loss]) + total_samples += args.batch_size + # In the new 2.0 api, must call this function to change the learning_rate + lr_scheduler.step() + train_run_cost = time.time() - batch_start + train_cost_avg.record(train_run_cost) + + # Profile for model benchmark + if args.profiler_options is not None: + profiler.add_profiler_step(args.profiler_options) + + if global_step % args.logging_steps == 0: + print( + "total step: %d, epoch: %d, batch: %d, loss: %f, " + "avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec" + % ( + global_step, + epoch, + step, + loss_return[0], + reader_cost_avg.get_average(), + train_cost_avg.get_average(), + total_samples / args.logging_steps, + args.batch_size / (reader_cost_avg.get_average() + train_cost_avg.get_average()), + ) + ) + total_samples = 0 + train_cost_avg.reset() + reader_cost_avg.reset() + if global_step % args.save_steps == 0: + if worker_index == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model.save_model_config(output_dir) + paddle.static.save(main_program, os.path.join(output_dir, "model_state")) + tokenizer.save_pretrained(output_dir) + if global_step >= args.max_steps: + del train_data_loader + return + batch_start = time.time() + del train_data_loader + train_data_loader, data_file = dataset_future.result(timeout=None) + epoch += 1 + + +if __name__ == "__main__": + args = parse_args() + print(args) + do_train(args) diff --git a/model_zoo/bert/static_ipu/README.md b/model_zoo/bert/static_ipu/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f18e209a004d0f2c1ca2c3a82c240bd58b82de77 --- /dev/null +++ b/model_zoo/bert/static_ipu/README.md @@ -0,0 +1,223 @@ +# Paddle-BERT with Graphcore IPUs + +## Overview + +This project enabled BERT-Base pre-training and SQuAD fine-tuning task using [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) on Graphcore [IPU-POD16](https://www.graphcore.ai/products/mk2/ipu-pod16). + +## File Structure + +| File | Description | +| ----------------- | ------------------------------------------------------------------ | +| `README.md` | How to run the model. | +| `run_pretrain.py` | The algorithm script to run pretraining tasks (phase1 and phase2). | +| `run_squad.py` | The algorithm script to run SQuAD finetune and validation task. | +| `modeling.py` | The algorithm script to build the Bert-Base model. | +| `dataset_ipu.py` | The algorithm script to load input data in pretraining. | +| `custom_ops/` | The folder contains custom ops that will be used. | +| `scripts/` | The folder contains scripts for model running. | + +## Dataset + +- Pretraining dataset + + Wikipedia dataset is used to do pretraining. Please refer to the Wikipedia dataset generator provided by [Nvidia](https://github.com/NVIDIA/DeepLearningExamples.git) to generate pretraining dataset. + + The sequence length used in pretraining phase1 and phase2 are: 128 and 384. Following steps are provided for dataset generation. + + ```bash + # Here we use a specific commmit, the latest commit should also be fine. + git clone https://github.com/NVIDIA/DeepLearningExamples.git + git checkout 88eb3cff2f03dad85035621d041e23a14345999e + + cd DeepLearningExamples/PyTorch/LanguageModeling/BERT + + # Modified the parameters `--max_seq_length 512` to `--max_seq_length 384` at line 50 and + # `--max_predictions_per_seq 80` to `--max_predictions_per_seq 56` at line 51. + vim data/create_datasets_from_start.sh + + # Build docker image + bash scripts/docker/build.sh + + # Use NV's docker to download and generate hdf5 file. This may requires GPU available. + # You can Remove `--gpus $NV_VISIBLE_DEVICES` to avoid GPU requirements. + bash scripts/docker/launch.sh + + # generate dataset with wiki_only + bash data/create_datasets_from_start.sh wiki_only + ``` + +- SQuAD v1.1 dataset + + SQuAD v1.1 dataset will be downloaded automatically. You don't have to download manually. + + +## Quick Start Guide + +### Prepare Project Environment + +- Create docker image + +```bash +# clone paddle repo +git clone https://github.com/paddlepaddle/Paddle.git -b release/2.3 +cd Paddle + +# build docker image +docker build -t paddlepaddle/paddle:latest-dev-ipu -f tools/dockerfile/Dockerfile.ipu . +``` + +- Create docker container + +```bash +# clone paddlenlp repo +git clone https://github.com/paddlepaddle/paddlenlp.git +cd paddlenlp/examples/language_model/bert/static_ipu + +# create docker container +# the ipuof configuration file need to be pre-generated and mounted to docker container +# the environment variable IPUOF_CONFIG_PATH should point to the ipuof configuration file +# more information on ipuof configuration is available at https://docs.graphcore.ai/projects/vipu-admin/en/latest/cli_reference.html?highlight=ipuof#ipuof-configuration-file +docker run --ulimit memlock=-1:-1 --net=host --cap-add=IPC_LOCK \ +--device=/dev/infiniband/ --ipc=host \ +--name paddle-bert-base \ +-v ${IPUOF_CONFIG_PATH}:/ipu.conf \ +-e IPUOF_CONFIG_PATH=/ipu.conf \ +-v ${PWD}:/workdir \ +-w /home -it paddlepaddle/paddle:latest-dev-ipu bash +``` + +All of later processes are required to be executed in the container. + +- Compile and installation + +```bash +# clone paddle repo +git clone https://github.com/paddlepaddle/Paddle.git -b release/2.3 +cd Paddle + +mkdir build && cd build + +# run cmake +cmake .. -DWITH_IPU=ON -DWITH_PYTHON=ON -DPY_VERSION=3.7 -DWITH_MKL=ON \ + -DPOPLAR_DIR=/opt/poplar -DPOPART_DIR=/opt/popart -DCMAKE_BUILD_TYPE=Release + +# compile +make paddle_python -j$(nproc) + +# install paddle package +pip install -U python/dist/paddlepaddle-0.0.0-cp37-cp37m-linux_x86_64.whl + +# go to workdir +cd /workdir +``` + +### Execution + +- Run pretraining phase1 (sequence_length = 128) + +```bash +# pod16 +# takes about 11.3 hours +bash scripts/pod16/run_pretrain.sh + +# pod4 +# takes about 11.3 * 4 hours +bash scripts/pod4/run_pretrain.sh +``` + +- Run pretraining phase2 (sequence_length = 384) + +```bash +# pod16 +# takes about 3 hours +bash scripts/pod16/run_pretrain_phase2.sh + +# pod4 +# takes about 3 * 4 hours +bash scripts/pod4/run_pretrain_phase2.sh +``` + +- Run SQuAD finetune task + +```bash +# pod16 +bash scripts/pod16/run_squad.sh + +# pod4 +bash scripts/pod4/run_squad.sh +``` + +- Run SQuAD validation + +```bash +# pod16 +bash scripts/pod16/run_squad_infer.sh + +# pod4 +bash scripts/pod4/run_squad_infer.sh +``` + +#### Parameters + +- `task` The type of the NLP model. +- `input_files` The directory of the input data. +- `output_dir` The directory of the trained models. +- `is_training` Training or inference. +- `seq_len` The sequence length. +- `vocab_size` Size of the vocabulary. +- `max_predictions_per_seq` The max number of the masked token each sentence. +- `max_position_embeddings` The length of the input mask. +- `num_hidden_layers` The number of encoder layers. +- `hidden_size` The size of the hidden state of the transformer layers size. +- `ignore_index` The ignore index for the masked position. +- `hidden_dropout_prob` The dropout probability for fully connected layer in embedding and encoder +- `attention_probs_dropout_prob` The dropout probability for attention layer in encoder. +- `learning_rate` The learning rate for training. +- `weight_decay` The weight decay. +- `beta1` The Adam/Lamb beta1 value +- `beta2` The Adam/Lamb beta2 value +- `adam_epsilon` Epsilon for Adam optimizer. +- `max_steps` The max training steps. +- `warmup_steps` The warmup steps used to update learning rate with lr_schedule. +- `scale_loss` The loss scaling. +- `accl1_type` set accl1 type to FLOAT or FLOAT16 +- `accl2_type` set accl2 type to FLOAT or FLOAT16 +- `weight_decay_mode` decay or l2 regularization +- `optimizer_state_offchip` The store location of the optimizer tensors +- `logging_steps` The gap steps of logging. +- `save_steps` Save the paddle model every n steps. +- `epochs` the iteration of the whole dataset. +- `batch_size` total batch size (= batches_per_step \* num_replica \* grad_acc_factor \* micro_batch_size). +- `micro_batch_size` The batch size of the IPU graph. +- `batches_per_step` The number of batches per step with pipelining. +- `seed` The random seed. +- `num_ipus` The number of IPUs. +- `ipu_enable_fp16` Enable FP16 or not. +- `num_replica` The number of the graph replication. +- `enable_grad_acc` Enable gradiant accumulation or not. +- `grad_acc_factor` Update the weights every n batches. +- `available_mem_proportion` The available proportion of memory used by conv or matmul. +- `shuffle` Shuffle Dataset. +- `wandb` Enable logging to Weights and Biases. +- `enable_engine_caching` Enable engine caching or not. +- `enable_load_params` Load paddle params or not. +- `tf_checkpoint` Path to Tensorflow Checkpoint to initialise the model. + +## Result + +For a POD16 platform: + +| Task | Metric | Result | +| ------ | -------- | ------- | +| Phase1 | MLM Loss | 1.6064 | +| | NSP Loss | 0.0272 | +| | MLM Acc | 0.6689 | +| | NSP Acc | 0.9897 | +| | tput | 11700 | +| Phase2 | MLM Loss | 1.5029 | +| | NSP Loss | 0.02444 | +| | MLM Acc | 0.68555 | +| | NSP Acc | 0.99121 | +| | tput | 3470 | +| SQuAD | EM | 79.9053 | +| | F1 | 87.6396 | diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_checkpointoutput.cc b/model_zoo/bert/static_ipu/custom_ops/custom_checkpointoutput.cc new file mode 100644 index 0000000000000000000000000000000000000000..edc7eec8fbf369e723e5e76e438513d874e697a8 --- /dev/null +++ b/model_zoo/bert/static_ipu/custom_ops/custom_checkpointoutput.cc @@ -0,0 +1,41 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/extension.h" + +namespace { +std::vector> InferShape(std::vector x_shape) { + return {x_shape}; +} + +std::vector InferDtype(paddle::DataType x_dtype) { + return {x_dtype}; +} + +std::vector OpForward(const paddle::Tensor &x) { return {x}; } + +std::vector OpBackward(const paddle::Tensor &x) { return {x}; } +} + +PD_BUILD_OP(checkpointoutput) + .Inputs({"X"}) + .Outputs({"Out"}) + .SetInferShapeFn(PD_INFER_SHAPE(InferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(InferDtype)) + .SetKernelFn(PD_KERNEL(OpForward)); + +PD_BUILD_GRAD_OP(checkpointoutput) + .Inputs({paddle::Grad("Out")}) + .Outputs({paddle::Grad("X")}) + .SetKernelFn(PD_KERNEL(OpBackward)); diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_detach.cc b/model_zoo/bert/static_ipu/custom_ops/custom_detach.cc new file mode 100644 index 0000000000000000000000000000000000000000..2796fd07d60d28e7b293990a2e007c4084ea14a6 --- /dev/null +++ b/model_zoo/bert/static_ipu/custom_ops/custom_detach.cc @@ -0,0 +1,42 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/extension.h" + +namespace { +std::vector> +InferShape(std::vector x_shape) { + return {x_shape}; +} + +std::vector InferDtype(paddle::DataType x_dtype) { + return {x_dtype}; +} + +std::vector OpForward(const paddle::Tensor &x) { return {x}; } + +std::vector OpBackward(const paddle::Tensor &x) { return {x}; } +} + +PD_BUILD_OP(detach) + .Inputs({"X"}) + .Outputs({"Out"}) + .SetInferShapeFn(PD_INFER_SHAPE(InferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(InferDtype)) + .SetKernelFn(PD_KERNEL(OpForward)); + +PD_BUILD_GRAD_OP(detach) + .Inputs({paddle::Grad("Out")}) + .Outputs({paddle::Grad("X")}) + .SetKernelFn(PD_KERNEL(OpBackward)); diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_identity.cc b/model_zoo/bert/static_ipu/custom_ops/custom_identity.cc new file mode 100644 index 0000000000000000000000000000000000000000..1997d0e896c12a0dc20b0a19d5bd4360a4a2bde1 --- /dev/null +++ b/model_zoo/bert/static_ipu/custom_ops/custom_identity.cc @@ -0,0 +1,41 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/extension.h" + +namespace { +std::vector> InferShape(std::vector x_shape) { + return {x_shape}; +} + +std::vector InferDtype(paddle::DataType x_dtype) { + return {x_dtype}; +} + +std::vector OpForward(const paddle::Tensor &x) { return {x}; } + +std::vector OpBackward(const paddle::Tensor &x) { return {x}; } +} + +PD_BUILD_OP(identity) + .Inputs({"X"}) + .Outputs({"Out"}) + .SetInferShapeFn(PD_INFER_SHAPE(InferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(InferDtype)) + .SetKernelFn(PD_KERNEL(OpForward)); + +PD_BUILD_GRAD_OP(identity) + .Inputs({paddle::Grad("Out")}) + .Outputs({paddle::Grad("X")}) + .SetKernelFn(PD_KERNEL(OpBackward)); diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_nll_loss.cc b/model_zoo/bert/static_ipu/custom_ops/custom_nll_loss.cc new file mode 100644 index 0000000000000000000000000000000000000000..88112a26b7e3b245ee530b9587a5d05e9d710b26 --- /dev/null +++ b/model_zoo/bert/static_ipu/custom_ops/custom_nll_loss.cc @@ -0,0 +1,55 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/extension.h" + +namespace { +std::vector> +InferShape(std::vector x_shape, std::vector y_shape, + const int &reduction, const std::string &ignoreIndex, + const bool &inputIsLogProbability) { + // 0: sum, 1: mean, 2: none + if (reduction == 2) { + return {y_shape}; + } else { + return {{1}}; + } +} + +std::vector InferDtype(paddle::DataType x_dtype, + paddle::DataType y_dtype) { + return {x_dtype}; +} + +std::vector OpForward(const paddle::Tensor &x, + const paddle::Tensor &y) { + return {x}; +} + +std::vector OpBackward(const paddle::Tensor &x) { return {x}; } +} + +PD_BUILD_OP(custom_nll_loss) + .Inputs({"X", "Y"}) + .Outputs({"Out"}) + .Attrs({"reduction: int", "ignoreIndex: std::string", + "inputIsLogProbability: bool"}) + .SetInferShapeFn(PD_INFER_SHAPE(InferShape)) + .SetInferDtypeFn(PD_INFER_DTYPE(InferDtype)) + .SetKernelFn(PD_KERNEL(OpForward)); + +PD_BUILD_GRAD_OP(custom_nll_loss) + .Inputs({paddle::Grad("Out")}) + .Outputs({paddle::Grad("X")}) + .SetKernelFn(PD_KERNEL(OpBackward)); diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_shape_infer.cc b/model_zoo/bert/static_ipu/custom_ops/custom_shape_infer.cc new file mode 100644 index 0000000000000000000000000000000000000000..74e144d8d7e6c715f7aded4df6ccd5423c9cdba5 --- /dev/null +++ b/model_zoo/bert/static_ipu/custom_ops/custom_shape_infer.cc @@ -0,0 +1,37 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include + +auto splitShapeInferenceFun = [](popart::ShapeInferenceContext &ctx) { + auto numOutputs = ctx.getNumOutputs(); + auto type = ctx.inType(0); + auto shape = ctx.inShape(0); + auto axis = ctx.getAttribute("axis"); + auto split = ctx.getAttribute>("split"); + + for (int i = 0; i < numOutputs; i++) { + shape[axis] = split.at(i); + ctx.outInfo(i) = {type, shape}; + } +}; + +#if POPART_VERSION_MAJOR == 2 +#if POPART_VERSION_MINOR == 3 +// for version 2.3, need to register a shape inference function for Split op +static popart::RegisterShapeInferenceFunction + splitRegister11(popart::Onnx::Operators::Split_11, splitShapeInferenceFun); +#endif +#endif \ No newline at end of file diff --git a/model_zoo/bert/static_ipu/custom_ops/disable_attn_dropout_bwd_pattern.cc b/model_zoo/bert/static_ipu/custom_ops/disable_attn_dropout_bwd_pattern.cc new file mode 100644 index 0000000000000000000000000000000000000000..803ae20c658bea3dce14b7d6e84a6fa71e9b33fc --- /dev/null +++ b/model_zoo/bert/static_ipu/custom_ops/disable_attn_dropout_bwd_pattern.cc @@ -0,0 +1,91 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "utils.cc" + +// Tests have found that disabling dropout in the backwards pass of the attention, just before the softmax, +// can improve SQuAD fine-tuning. This pattern finds that op replaces it with an identity op. +class DisableAttnDropoutBwdPattern : public popart::PreAliasPattern { +public: + bool matches(popart::Op *op) const override { + int check_levels = 2; + + if (!op->isConvertibleTo()) { + return false; + } + + // Is dropout enabled? If ratio is 0, we don't need to apply the pattern. + auto dropoutGradOp = dynamic_cast(op); + if (dropoutGradOp->getRatio() == 0.f) { + return false; + } + + // The specific attention DropoutGradOp we want to cull sits between a matmul and a softmax, + // so we'll look through producers and consumers and see if we can find them. + auto grad = op->input->tensor(popart::DropoutGradOp::getGradInIndex()); + + // The MatMulPattern converts the MatMulLhsGradOp to a MatMulOp + // There doesn't seem to be a way to check if a pattern is enabled from inside another pattern. + // The IR holds the patterns object, but it’s inaccessible for checking the status of individual patterns. + // Check both, with the most likely first. + bool hasMatMulProducer = search_producers_for(grad, check_levels) != nullptr; + if (!hasMatMulProducer) { + hasMatMulProducer |= search_producers_for(grad, check_levels) != nullptr; + } + + return hasMatMulProducer && search_consumers_for(grad) != nullptr; + } + + std::vector touches(popart::Op *) const override { return {}; } + + bool apply(popart::Op *op) const override { + if (!op->isConvertibleTo()) { + return false; + } + + auto dropoutGradOp = dynamic_cast(op); + auto identityOp = makeReplacementOpInIr(popart::Onnx::Operators::Identity_1, + dropoutGradOp, + ""); + + auto inputId = dropoutGradOp->inId(popart::DropoutGradOp::getGradInIndex()); + auto outputId = dropoutGradOp->outId(popart::DropoutGradOp::getOutIndex()); + dropoutGradOp->disconnectAllInputs(); + dropoutGradOp->disconnectAllOutputs(); + dropoutGradOp->getGraph().eraseOp(dropoutGradOp->id); + + identityOp->connectInTensor(popart::IdentityOp::getInIndex(), inputId); + identityOp->connectOutTensor(popart::IdentityOp::getOutIndex(), outputId); + identityOp->setup(); + + return true; + } +}; + + +static popart::PatternCreator disableAttnDropoutBwdPatternCreator("DisableAttnDropoutBwdPattern", false); diff --git a/model_zoo/bert/static_ipu/custom_ops/tied_gather.cc b/model_zoo/bert/static_ipu/custom_ops/tied_gather.cc new file mode 100644 index 0000000000000000000000000000000000000000..2350ffd243c4c0172dc16c2f73ee9378a8853c29 --- /dev/null +++ b/model_zoo/bert/static_ipu/custom_ops/tied_gather.cc @@ -0,0 +1,181 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +namespace CustomOperators { + const popart::OperatorIdentifier TiedGather = {"ai.graphcore", "TiedGather", 1}; +} // namespace CustomOperators + +class TiedGatherOp; +class TiedGatherGradOp; + +class TiedGatherGradOp : public popart::GatherGradOp { +public: + TiedGatherGradOp(const popart::GatherOp &op, int64_t axis_) + : popart::GatherGradOp(op, axis_), + fwd_op(&op) {} + const popart::GatherOp *fwd_op; +}; + +class TiedGatherOp : public popart::GatherOp { +public: + TiedGatherOp(int64_t axis_, const popart::Op::Settings &settings_) + : popart::GatherOp(CustomOperators::TiedGather, axis_, settings_) {} + bool check_indices = true; + + std::unique_ptr clone() const override { + return std::make_unique(*this); + } + + std::vector> getGradOps() { + std::vector> result; + result.push_back(std::make_unique(*this, getAxis())); + result[0]->pruneable = false; + return result; + } +}; + +class TiedGatherOpx : public popart::popx::Opx { +public: + TiedGatherOpx(popart::Op *op, popart::popx::Devicex *devicex) : popart::popx::Opx(op, devicex) { + verifyOp(op, CustomOperators::TiedGather); + // We always want this to layout its inputs + inputCreatorPriority = std::numeric_limits::max(); + } + + bool createsEquiv(int, const popart::popx::Opx *, int) const final { return false; } + + std::set mustExistBeforeCreate(int) const final { return {}; } + + popart::popx::InputCreatorType getInputCreatorType(int index0) const final { + return index0 == TiedGatherOp::dataInIndex() ? popart::popx::InputCreatorType::CanCreate + : popart::popx::Opx::getInputCreatorType(index0); + } + + poplar::Tensor createInput(popart::InIndex index, + const poplar::DebugNameAndId &dnai) const final { + popart::logging::debug("TiedGather asked to create index {}: name {}", index, dnai); + if (index != TiedGatherOp::dataInIndex()) { + throw popart::error("CustomOps Error: GatherOpx::createInput Cannot create input {}", index); + } + + auto inputInfo = inInfo(TiedGatherOp::indicesInIndex()); + auto weightInfo = inInfo(TiedGatherOp::dataInIndex()); + + unsigned inputSize = inputInfo.nelms(); + unsigned inChannels = weightInfo.dim(getOp().getAxis()); + unsigned outChannels = weightInfo.nelms() / inChannels; + + std::vector lhsShape = {inputSize, inChannels}; + std::vector rhsShape = {inChannels, outChannels}; + + return poplin::createMatMulInputRHS(graph(), + popart::popx::popType(weightInfo), + lhsShape, + rhsShape, + dnai, + {}, + &dv_p->matmulCache); + } + + // Identical to popart::opx::GatherOpx::grow however: + // 1) uses popops::gather instead of popops::multislice + // 2) range checks the indices and masks those out of range + void grow(poplar::program::Sequence &prog) const final { + const auto indicesShape = inShape(TiedGatherOp::indicesInIndex()); + const auto outputShape = + popart::vXtoY(outShape(TiedGatherOp::outIndex())); + + auto op = getOp(); + unsigned axis = op.getAxis(); + auto indices = getInTensor(TiedGatherOp::indicesInIndex()); + auto data = getInTensor(TiedGatherOp::dataInIndex()); + + // If there are no indices, return an empty tensor of the appropriate + // shape + if (indices.numElements() == 0) { + auto result = graph().addVariable( + data.elementType(), outputShape, debugContext("result")); + + setOutTensor(TiedGatherOp::outIndex(), result); + } else { + // Flatten the scalar indices. + auto offsets = indices.flatten(); + // reinterpret the indices as unsigned int. This assumes negative indices. + // are impossible. + offsets = offsets.reinterpret(poplar::UNSIGNED_INT); + + // Place the gather axis at the front. + data = data.dimShufflePartial({0}, {axis}); + // Store the shape for later. + auto tmp_shape = data.shape(); + // Flatten the other dimensions. + data = data.flatten(1, data.rank()); + + // Change (2) + poplar::Tensor mask; + if (op.check_indices) { + auto gather_size = data.shape()[0]; + mask = popops::lt(graph(), offsets, static_cast(gather_size), prog, debugContext("mask + tiedGatherOpxCreator(CustomOperators::TiedGather); diff --git a/model_zoo/bert/static_ipu/custom_ops/tied_gather_pattern.cc b/model_zoo/bert/static_ipu/custom_ops/tied_gather_pattern.cc new file mode 100644 index 0000000000000000000000000000000000000000..d22fcd22711e689468c816b266ed5bd0b054343d --- /dev/null +++ b/model_zoo/bert/static_ipu/custom_ops/tied_gather_pattern.cc @@ -0,0 +1,504 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "tied_gather.cc" +#include "utils.cc" + +using SerialiseSettings = popart::MatMulBaseOp::SerialiseSettings; + +// This pattern matches for graphs of the shape. +// +// Weight +// / \ +// Transpose MatMul +// | +// Indices --Gather +// +// And performs the following transformations: +// 1) Disable FullyConnectedPass on MatMul +// 2) Add Detach between the Gather and the Weight so no SGD ops are created (they will be added later by TiedGatherAccumulatePattern) +// 3) Replace Gather with TiedGather +// Resulting in: +// Weight +// / \ +// Transpose MatMul +// | +// Detach +// | +// Indices --TiedGather +// +// Conditionally, if MatMul is annotated with serialisation it will: +// 4) Replace Gather with N x TiedGather to match the serialisation on the MatMul +// Resulting in: +// For serialisation factor: 2 +// +// Weight +// / \ +// Transpose MatMul +// | +// Indices Detach +// | | | | +// | | | Slice--\ +// | Sub -|------TiedGather +// | | | +// | Slice--\ | +// Sub ---------TiedGather | +// \ | +// Add +// +namespace { +bool produced_by_transpose(popart::Tensor *t) { + return t->hasProducer() && t->getProducer()->isConvertibleTo(); +} +} + +class TiedGatherPattern : public popart::PreAliasPattern { + mutable std::map tied_op_map; +public: + bool matches(popart::Op *op) const override { + auto &ir = op->getIr(); + // Only run in the fwd pass + if (op->getIr().hasConstructedBackwards()) { + return false; + } + if (op->getIr().isTraining() && !op->getIr().getSessionOptions().enableGradientAccumulation) { + return false; + } + if (op->isConvertibleTo() && !op->isConvertibleTo()) { + if (produced_by_transpose(op->input->tensor(popart::GatherOp::dataInIndex()))) { + auto matmul = weight_consumed_by(op->input->tensor(popart::GatherOp::dataInIndex())); + if (matmul) { + tied_op_map.insert({op, matmul}); + return true; + } + } + } + return false; + } + + std::vector touches(popart::Op *) const override { return {}; } + + bool apply(popart::Op *op) const override { + auto &graph = op->getGraph(); + + auto gather = dynamic_cast(op); + auto matmul = tied_op_map[gather]; + + // (1) + matmul->setUseFullyConnectedPass(false); + + auto axis = gather->getAxis(); + auto serialisation = matmul->getSerialiseSettings(); + + auto data = gather->input->tensor(popart::GatherOp::dataInIndex()); + auto indices = gather->input->tensor(popart::GatherOp::indicesInIndex()); + auto out = gather->output->tensor(popart::GatherOp::outIndex()); + + // Disconnect "out" so it can be connected to the replacing ops. + gather->disconnectAllOutputs(); + + // (2) + auto detach_up = std::make_unique( + popart::Onnx::CustomOperators::Detach_1, + popart::Op::Settings(graph, "TiedGatherDetach") + ); + auto detach = detach_up.get(); + transferBaseProperties(gather, detach); + graph.moveIntoGraph(std::move(detach_up)); + detach->connectInTensor(0, data->id); + auto detached_data_id = data->id + "/detached"; + detach->createAndConnectOutTensor(0, detached_data_id); + detach->setup(); + data = graph.getTensors().get(detached_data_id); + + std::string name = gather->name(); + if (name.empty()) { + name = std::to_string(gather->id); + } + + auto replace_with_tied_gather = [&](popart::TensorId dict, popart::TensorId ind, int64_t i, const std::string &debugContext) { + auto tied_gather_up = std::make_unique( + axis, + popart::Op::Settings(graph, debugContext)); + auto tied_gather = tied_gather_up.get(); + transferBaseProperties(gather, tied_gather); + graph.moveIntoGraph(std::move(tied_gather_up)); + + tied_gather->connectInTensor(TiedGatherOp::dataInIndex(), dict); + tied_gather->connectInTensor(TiedGatherOp::indicesInIndex(), ind); + + auto out_id = out->id; + if (i >= 0) { + out_id = debugContext + ":0"; + tied_gather->createAndConnectOutTensor(TiedGatherOp::outIndex(), out_id); + } else { + tied_gather->connectOutTensor(TiedGatherOp::outIndex(), out_id); + } + + graph.topoCons->transfer(gather, tied_gather); + + tied_gather->setup(); + + return out_id; + }; + + if (serialisation.factor <= 1 || serialisation.mode == SerialiseSettings::Mode::None) { + // (3) + replace_with_tied_gather(data->id, indices->id, -1, name); + } else { + // (4) + if (serialisation.mode != SerialiseSettings::Mode::OutputChannels) { + throw popart::error("CustomOps Error: Tied Gather Pattern only supports Serialisation::Mode::OutputChannels"); + } + + auto slice_op = [&](int64_t starts, int64_t ends, const std::string &debugContext) { + auto slice_up = std::make_unique( + popart::Onnx::AiOnnx::OpSet9::Slice, + std::vector({starts}), + std::vector({ends}), + std::vector({axis}), + popart::Op::Settings(graph, debugContext + "/slice")); + auto slice = slice_up.get(); + transferBaseProperties(gather, slice); + graph.moveIntoGraph(std::move(slice_up)); + slice->connectInTensor(popart::SliceOp::getInIndex(), data->id); + auto data_slice = debugContext + "/slice:0"; + slice->createAndConnectOutTensor(popart::SliceOp::getOutIndex(), data_slice); + slice->setup(); + return data_slice; + }; + + auto subtract_with_constant = [&](popart::Tensor *a, int64_t c, const std::string &debugContext) { + auto sub_up = std::make_unique( + popart::Onnx::Operators::Sub_7, + popart::Op::Settings(graph, debugContext + "/sub")); + auto sub = sub_up.get(); + transferBaseProperties(gather, sub); + graph.moveIntoGraph(std::move(sub_up)); + sub->connectInTensor(popart::SubtractOp::getArg0InIndex(), a->id); + // Create constant to subtract from + static unsigned i = 0; + auto sub_const_id = a->id + "_sub_const_" + std::to_string(i++); + popart::TensorInfo subInfo(a->info.dataType(), {1}); + std::vector d(1, c); + graph.getTensors().addConstInit(sub_const_id, subInfo, d.data()); + sub->connectInTensor(popart::SubtractOp::getArg1InIndex(), sub_const_id); + auto indices_sub = debugContext + "/sub:0"; + sub->createAndConnectOutTensor(popart::SubtractOp::getOutIndex(), indices_sub); + sub->setup(); + return indices_sub; + }; + + auto add_op = [&](popart::TensorId a, popart::TensorId b, popart::TensorId out, const std::string &debugContext) { + auto add_up = std::make_unique( + popart::Onnx::Operators::Add_6, + popart::Op::Settings(graph, debugContext + "/add")); + auto add = add_up.get(); + transferBaseProperties(gather, add); + graph.moveIntoGraph(std::move(add_up)); + add->connectInTensor(popart::AddOp::getArg0InIndex(), a); + add->connectInTensor(popart::AddOp::getArg1InIndex(), b); + if (graph.getTensors().contains(out)) { + add->connectOutTensor(popart::AddOp::getOutIndex(), out); + } else { + add->createAndConnectOutTensor(popart::AddOp::getOutIndex(), out); + } + add->setup(); + return out; + }; + + popart::TensorId tmp_id; + for (int64_t i = 0; i < serialisation.factor; i++) { + int64_t slice_size = data->info.dim(axis) / serialisation.factor; + auto serial_name = name + "/" + std::to_string(i); + // Slice the Dictionary + auto data_slice = slice_op(i * slice_size, (i + 1) * slice_size, serial_name); + // Subtract the indices + auto indices_sub = subtract_with_constant(indices, i * slice_size, serial_name); + // Add the tied gather to the graph + auto next_id = replace_with_tied_gather(data_slice, indices_sub, i, serial_name); + + // Add the results + if (i == 0) { + tmp_id = next_id; + } else { + auto out_id = out->id; + if (i < serialisation.factor - 1) { + out_id += "_tmp" + std::to_string(i); + } + tmp_id = add_op(tmp_id, next_id, out_id, serial_name); + + // Tie the add to happen directly after the gather + graph.topoCons->insert( + graph.getTensors().get(next_id)->getProducer(), + graph.getTensors().get(tmp_id)->getProducer(), + true); + } + } + } + + gather->disconnectAllInputs(); + graph.eraseOp(gather->id); + + return true; + } +}; + +// This pattern matches for graphs of the shape. +// +// Weight +// | \ +// TiedGatherGrad MatMul +// | +// Accl - Accumulate +// +// And will perform the following transformation +// 1) Replace TiedGatherGrad with SparseAccumulate +// +// Resulting in: +// +// Weight +// | \ +// | MatMul +// | | +// | Accl - Accumulate +// | | | +// SparseAccumulate - Optimizer +// +// (--> is a topocon) + +class TiedGatherAccumulatePattern : public popart::PreAliasPattern { +public: + bool matches(popart::Op *op) const override { + // Only works with gradient accumulation + if (!op->getIr().getSessionOptions().enableGradientAccumulation) { + return false; + } + // Only run after the optimizers have been created + if (!op->getIr().hasDecomposedOptimizers()) { + return false; + } + return op->isConvertibleTo(); + } + + std::vector touches(popart::Op *) const override { return {}; } + + bool apply(popart::Op *op) const override { + auto gather_grad = dynamic_cast(op); + auto gather = gather_grad->fwd_op; + auto root_weight = get_variable(gather->input->tensor(popart::GatherOp::dataInIndex())); + + auto gather_ops = find_all_consumers(root_weight); + + auto &ir = op->getIr(); + + // Get all the Accumulate ops in the normal context + std::vector accumulate_ops; + + auto update_ops = find_all_consumers(root_weight); + if (update_ops.size() < 1) { + // OptimizerDecomposePattern has not run. + throw popart::error("CustomOps Error: Could not find update ops for weight {}", root_weight->id); + } + + for (size_t i = 0; i < update_ops.size(); i++) { + auto var_update = update_ops[i]; + + auto accum = var_update->inTensor(popart::VarUpdateWithUpdaterOp::getUpdaterInIndex()); + // Accumulate Ops in the normal fragment are Gradient Accumulation. + auto accl_op = search_producers_for(accum, 10); + + if (accl_op) { + auto exists = std::find_if(accumulate_ops.begin(), accumulate_ops.end(), [&accl_op](popart::Op* op){ return op->id == accl_op->id; }); + if (exists == accumulate_ops.end()) { + accumulate_ops.push_back(accl_op); + } + } else { + popart::logging::info("CustomOps Warning: Could not find outer AccumulateOp gradient accumulation via accumulator {}.", accum->id); + } + } + + if (accumulate_ops.size() != gather_ops.size()) { + throw popart::error("CustomOps Error: The number of gather ops ({}) does not match the number of accumulate ops ({}).", gather_ops.size(), accumulate_ops.size()); + } + + // Match up gather serial index to Accumulator's matmul index. + // TODO: Find a more robust way than sorting input ids + std::sort(accumulate_ops.begin(), accumulate_ops.end(), + [](const popart::Op *l, const popart::Op *r) { + return l->input->tensor(popart::AccumulateOp::getVarToUpdateInIndex())->id.compare( + r->input->tensor(popart::AccumulateOp::getVarToUpdateInIndex())->id) < 0; + }); + std::sort(gather_ops.begin(), gather_ops.end(), + [](const popart::Op *l, const popart::Op *r) { + return l->name().compare(r->name()) < 0; + }); + + auto itr = std::find(gather_ops.begin(), gather_ops.end(), gather); + if (itr == gather_ops.end()) { + throw popart::error("CustomOps Error: Could not find {} in the consumers of {}.", gather->name(), root_weight->id); + } + + unsigned serial_index = std::distance(gather_ops.begin(), itr); + + auto dense_accl = accumulate_ops[serial_index]; + + auto accl_id = dense_accl->inId(popart::AccumulateOp::getVarToUpdateInIndex()); + auto weight_id = gather->inId(popart::GatherOp::dataInIndex()); + popart::logging::pattern::info("Using tied accumulator {} for {}", accl_id, gather->name()); + + // Transpose must be inplace so the accumulator is actually updated + accl_id = transpose_inplace(accl_id, gather_grad); + + auto &graph = op->getGraph(); + + auto accum_type = dense_accl->getAccumulationType(); + popart::Tensor *factor = dense_accl->getFactor().isConst() ? nullptr : dense_accl->inTensor(popart::SparseAccumulateOp::getFactorInIndex()); + + if (factor != nullptr && accum_type == popart::AccumulationType::Mean) { + auto inv_counter = factor->id + "_inverse"; + if (!graph.getTensors().contains(inv_counter)) { + popart::TensorInfo one_info(factor->info.dataType(), {}); + std::vector one_data(one_info.nelms(), 1); + const auto &one_id = graph.getIr().createIntermediateTensorId("one"); + graph.getTensors().addConstInit(one_id, one_info, one_data.data()); + auto inv_op = graph.createConnectedOp( + {{popart::DivOp::getArg0InIndex(), one_id}, + {popart::DivOp::getArg1InIndex(), factor->id}}, + {{popart::DivOp::getOutIndex(), inv_counter}}, + popart::Onnx::Operators::Div_7, + popart::Op::Settings(graph, "mean_accumulate_inverse")); + transferBaseProperties(gather_grad, inv_op); + + for (auto cons : factor->consumers.getOps()) { + if (cons->isConvertibleTo() && + cons->inId(popart::AccumulateOp::getVarToUpdateInIndex()) == factor->id) { + graph.topoCons->insert(cons, inv_op); + } + } + } + accum_type = popart::AccumulationType::DampenedAdd; + factor = graph.getTensor(inv_counter); + } + + // Add sparseAccumulateOp. + auto sparse_accl_up = std::make_unique( + accum_type, + dense_accl->getFactor(), + gather_grad->getAxis(), + popart::Op::Settings(graph, "_tiedAccumulate/" + std::to_string(serial_index))); + + auto sparse_accl = sparse_accl_up.get(); + transferBaseProperties(gather_grad, sparse_accl); + graph.moveIntoGraph(std::move(sparse_accl_up)); + + // Inputs + // Accumulator + sparse_accl->connectInTensor(popart::SparseAccumulateOp::getVarToUpdateInIndex(), + accl_id); + // Gradients + sparse_accl->connectInTensor( + popart::SparseAccumulateOp::getUpdaterInIndex(), + gather_grad->inId(popart::GatherGradOp::gradInIndex())); + // Scale + if (!dense_accl->getFactor().isConst()) { + sparse_accl->connectInTensor( + // the index at which the dampening scale factor is received, + popart::SparseAccumulateOp::getFactorInIndex(), + // the name of the dampening scale factor + factor->id); + } + // Indices + sparse_accl->connectInTensor( + popart::SparseAccumulateOp::getIndicesInIndex(), + gather_grad->inId(popart::GatherGradOp::indicesInIndex())); + + // Original weight to be cloned + sparse_accl->connectInTensor( + popart::SparseAccumulateOp::getOriginalVarToUpdateInIndex(), + weight_id); + + // Transfer TopoCons + graph.topoCons->transfer(gather_grad, sparse_accl); + + // gatherGrad output that will be isolated + auto grad_Id = gather_grad->outId(TiedGatherGradOp::gradOutIndex()); + + // Remove TiedGatherGrad + gather_grad->disconnectAllInputs(); + gather_grad->disconnectAllOutputs(); + graph.eraseOp(gather_grad->id); + + // Outputs + sparse_accl->createAndConnectOutTensor( + popart::SparseAccumulateOp::getUpdatedVarOutIndex(), + sparse_accl->name() + ":0"); + + // remove the gatherGrad output + graph.getTensors().remove(grad_Id); + + // Finalise sparse op + sparse_accl->setup(); + + return true; + } + + popart::TensorId transpose_inplace(popart::TensorId tid, popart::Op *op) const { + auto &graph = op->getGraph(); + + // TransposeInplaceOp's constructor requires a transposeOp + auto outplace_up = std::make_unique( + popart::Onnx::AiOnnx::OpSet9::Transpose, + std::vector{1, 0}, + popart::Op::Settings(graph, tid + "_Transpose")); + auto transpose_up = outplace_up->getInplaceVariant(popart::Onnx::CustomOperators::TransposeInplace); + + auto transpose = transpose_up.get(); + transferBaseProperties(op, transpose); + graph.moveIntoGraph(std::move(transpose_up)); + + transpose->connectInTensor(popart::TransposeOp::getInIndex(), tid); + popart::TensorId out_id = tid + "/transposed"; + transpose->createAndConnectOutTensor(popart::TransposeOp::getOutIndex(), out_id); + + transpose->setup(); + return out_id; + } +}; + +static popart::PatternCreator TiedGatherPatternCreator("TiedGatherPattern", true); +static popart::PatternCreator TiedGatherAccumulatePatternCreator("TiedGatherAccumulatePattern", true); diff --git a/model_zoo/bert/static_ipu/custom_ops/utils.cc b/model_zoo/bert/static_ipu/custom_ops/utils.cc new file mode 100644 index 0000000000000000000000000000000000000000..b6c6570f803cd28c2c61a30040b6a66eca9af3be --- /dev/null +++ b/model_zoo/bert/static_ipu/custom_ops/utils.cc @@ -0,0 +1,173 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +template +static T *search_producers_for(popart::Tensor *t, int max_depth=-1) { + + // Searched as far as we can without success + if (t->tensorType() == popart::TensorType::Variable || !t->hasProducer()) { + return nullptr; + } + auto op = t->getProducer(); + if (op->isConvertibleTo() && op->settings.executionContext == Ctx) { + return dynamic_cast(op); + } + + if (op->input->n() < 1) { + return nullptr; + } + + unsigned producer_index = 0; + if (op->input->n() > 1) { + if (op->isConvertibleTo()) { + producer_index = popart::AdamUpdaterOp::getAccl1InIndex(); + } else if (op->isConvertibleTo()) { + producer_index = popart::AdamVarUpdateOp::getUpdaterInIndex(); + } else if (op->isConvertibleTo()) { + producer_index = popart::AccumulateBaseOp::getUpdaterInIndex(); + } else if (op->isConvertibleTo()) { + producer_index = popart::DropoutGradOp::getGradInIndex(); + } else if (op->isConvertibleTo()) { + // Grad Unscaling for Adam-based optimizers + producer_index = popart::MulOp::getArg0InIndex(); + } else if (op->isConvertibleTo()) { + // Replicated Tensor Sharding + producer_index = popart::ReplicatedReduceScatterOp::getInIndex(); + } else if (op->isConvertibleTo()) { + // Replicated Tensor Sharding + producer_index = popart::ReplicatedAllGatherOp::getInIndex(); + } else { + return nullptr; + } + } + + // Providing a max-search depth of -1 will remove the depth limit at the cost of potentially + // unnecessary checks. + if (max_depth > 0) { + max_depth -= 1; + if (max_depth == 0) { + return nullptr; + } + } + + return search_producers_for(op->input->tensor(producer_index), max_depth); +} + +// Finds the underlying variable by searching through producers. +static popart::Tensor *get_variable(popart::Tensor *t) { + if (t->tensorType() == popart::TensorType::Variable || t->tensorType() == popart::TensorType::Const) { + return t; + } else if (!t->hasProducer()) { + return nullptr; + } + auto op = t->getProducer(); + if (op->input->n() != 1) { + return nullptr; + } + return get_variable(op->input->tensors().front()); +} + +// Attempts to find T by searching through consumers. +template +static T *search_consumers_for(popart::Tensor *w, std::queue &q) { + for (auto consumer : w->consumers.getOps()) { + if (consumer->isConvertibleTo() && consumer->settings.executionContext == Ctx) { + return dynamic_cast(consumer); + } + + if (consumer->isConvertibleTo()) { + q.push(consumer->output->tensor(popart::DropoutGradOp::getGradInIndex())); + } + if (consumer->isConvertibleTo()) { + q.push(consumer->output->tensor( + popart::ReplicatedReduceScatterOp::getOutIndex())); + } + + // TODO: Improve this as it's too general. Most ops that have one input and one output are view changing. + if (consumer->input->n() == 1 && consumer->output->n() == 1) { + q.push(consumer->output->tensor(0)); + } + } + if (q.size() < 1) { + return nullptr; + } + w = q.front(); + q.pop(); + return search_consumers_for(w, q); +} +template +static T *search_consumers_for(popart::Tensor *w) { + std::queue q; + return search_consumers_for(w, q); +} + +template +static T *weight_consumed_by(popart::Tensor *w) { + w = get_variable(w); + if (w) { + return search_consumers_for(w); + } + return nullptr; +} + +template +static void find_all_consumers(popart::Tensor *w,std::queue &q, std::vector &result) { + for (auto consumer : w->consumers.getOps()) { + if (std::find(result.begin(), result.end(), consumer) == result.end()) { + if (consumer->isConvertibleTo() && consumer->settings.executionContext == Ctx) { + result.push_back(dynamic_cast(consumer)); + } + if (consumer->isConvertibleTo()) { + q.push(consumer->output->tensor(popart::MatMulOp::getOutIndex())); + } + if (consumer->isConvertibleTo()) { + q.push(consumer->output->tensor( + popart::ReplicatedReduceScatterOp::getOutIndex())); + } + // Most ops that have one input and one output are view changing. + if (consumer->input->n() == 1 && consumer->output->n() == 1) { + q.push(consumer->output->tensor(0)); + } + } + } + if (q.size() < 1) { + return; + } + w = q.front(); + q.pop(); + return find_all_consumers(w, q, result); +} +template +static std::vector find_all_consumers(popart::Tensor *w) { + std::queue q; + std::vector result; + find_all_consumers(w, q, result); + return result; +} diff --git a/model_zoo/bert/static_ipu/custom_ops/workarounds/prevent_const_expr_folding_op.cc b/model_zoo/bert/static_ipu/custom_ops/workarounds/prevent_const_expr_folding_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..d6482ad4e98a80e46e98395b3d70190fe4bdb69b --- /dev/null +++ b/model_zoo/bert/static_ipu/custom_ops/workarounds/prevent_const_expr_folding_op.cc @@ -0,0 +1,137 @@ +/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include +#include +#include +#include +#include + +namespace CustomOperators +{ + const popart::OperatorIdentifier PreventConstFolding = {"ai.graphcore", "PreventConstFolding", 1}; +} // namespace CustomOperators +namespace CustomGradOperators { + const popart::OperatorIdentifier PreventConstFoldingGrad = {"ai.graphcore", "PreventConstFoldingGrad", 1}; +} // namespace CustomGradOperators + +class PreventConstFoldingOp; +class PreventConstFoldingGradOp; +class PreventConstFoldingOpx; +class PreventConstFoldingGradOpx; + +// By default, const expressions ops get folded to optimise the graph and remove unnessary ops +// at the start. However, in this case, it causes the word embedding to exist in both its +// original and transposed form. By adding this op, the constant expression folding transform +// can't fold through it, so we prevent folding after this point. + +class PreventConstFoldingOp : public popart::Op +{ +public: + PreventConstFoldingOp(const popart::OperatorIdentifier &_opid, const Op::Settings &settings_) + : Op(_opid, settings_) {} + + void setup() final { outInfo(0) = inInfo(0); } + + std::unique_ptr clone() const { + return std::make_unique(*this); + } + + std::vector> getGradOps() { + std::vector> upops; + upops.emplace_back(std::make_unique(*this)); + return upops; + } + + float getSubgraphValue() const final { return getLowSubgraphValue(); } +}; + +static popart::OpDefinition PreventConstFoldingOpDef({}); + +static popart::OpCreator PreventConstFoldingOpCreator( + popart::OpDefinitions({{CustomOperators::PreventConstFolding, + PreventConstFoldingOpDef}}), + [](const popart::OpCreatorInfo &oci) -> std::unique_ptr { + return std::unique_ptr( + new PreventConstFoldingOp(oci.opid, oci.settings)); + }, + true); + +class PreventConstFoldingOpx : public popart::popx::Opx { +public: + PreventConstFoldingOpx(popart::Op *op, popart::popx::Devicex *devicex) : popart::popx::Opx(op, devicex) + { verifyOp(op, CustomOperators::PreventConstFolding); } + + popart::popx::InputCreatorType getInputCreatorType(popart::InIndex) const { + return popart::popx::InputCreatorType::CanUnwind; + } + + poplar::Tensor unwindTensorLayout(poplar::Tensor tensor, popart::InIndex, popart::OutIndex) const { + return tensor; + } + + popart::view::RegMap unwindRegion(popart::InIndex, popart::OutIndex) const { + return [this](const popart::view::Region &r) { + return popart::view::Regions(1, r); + }; + } + + void grow(poplar::program::Sequence &prog) const final { + insert(outId(0), getInTensor(0)); + } +}; + +class PreventConstFoldingGradOp : public PreventConstFoldingOp +{ +public: + PreventConstFoldingGradOp(const PreventConstFoldingOp &fwdOp) + : PreventConstFoldingOp(CustomGradOperators::PreventConstFoldingGrad, fwdOp.getSettings()) {} + + PreventConstFoldingGradOp(const popart::Op::Settings &settings) + : PreventConstFoldingOp(CustomGradOperators::PreventConstFoldingGrad, settings) {} + + std::unique_ptr clone() const final { + return std::make_unique(*this); + } + + const std::vector &gradInputInfo() const { + static const std::vector inInfo = { + {0, 0, popart::GradOpInType::GradOut}}; + + return inInfo; + } + const std::map &gradOutToNonGradIn() const { + static const std::map outInfo = {{0, 0}}; + return outInfo; + } +}; + +class PreventConstFoldingGradOpx : public popart::popx::Opx { +public: + PreventConstFoldingGradOpx(popart::Op *op, popart::popx::Devicex *devicex) + : popart::popx::Opx(op, devicex) { + verifyOp(op, CustomGradOperators::PreventConstFoldingGrad); + } + + void grow(poplar::program::Sequence &prog) const final { + setOutTensor(0, getInTensor(0)); + } +}; + +static popart::popx::OpxCreator + preventConstFoldingOpxCreator(CustomOperators::PreventConstFolding); +static popart::popx::OpxCreator + preventConstFoldingGradOpxCreator(CustomGradOperators::PreventConstFoldingGrad); diff --git a/model_zoo/bert/static_ipu/dataset_ipu.py b/model_zoo/bert/static_ipu/dataset_ipu.py new file mode 100644 index 0000000000000000000000000000000000000000..4e27c5d7041c6112a7f68332bcfe37220f009584 --- /dev/null +++ b/model_zoo/bert/static_ipu/dataset_ipu.py @@ -0,0 +1,270 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import multiprocessing +import threading +from queue import Queue + +import h5py +import numpy as np +import paddle + +KEYS = ("input_ids", "input_mask", "segment_ids", "masked_lm_positions", "masked_lm_ids", "next_sentence_labels") + + +def shuffle_dict(dic, len): + idxs = np.arange(len) + np.random.shuffle(idxs) + for k, v in dic.items(): + dic[k] = v[idxs] + + +class PretrainingHDF5DataLoader: + def __init__( + self, + input_files, + max_seq_length=128, + max_mask_tokens=20, + batch_size=1, + dtype=np.int32, + shuffle=False, + pad_position_value=511, + num_workers=3, + ): + self.files = input_files + self.batch_size = batch_size + self.max_seq_length = max_seq_length + self.max_mask_tokens = max_mask_tokens + self.dtype = dtype + self.shuffle = shuffle + self.pad_position_value = pad_position_value + if shuffle: + np.random.shuffle(self.files) + + self.counter = 0 + + # get total number of samples + pool = multiprocessing.Pool(min(multiprocessing.cpu_count(), 32)) + num_samples = pool.map(self.samples_in_file, self.files) + pool.close() + pool.join() + self.total_samples = sum(num_samples) + self.len = self.total_samples // self.batch_size + assert self.len > 1, f"Batch size {self.batch_size} larger than number of samples {self.total_samples}" + + # notify feed and fetch processes/thread to stop + self.event_queue = multiprocessing.Manager().Queue(10) + + # buffer to store final data + self.feed_buffer = Queue(20) + + # number of processes to do remask + self.num_workers = num_workers + # each feed_worker has one process_buffer to use + self.process_buffers = [multiprocessing.Manager().Queue(10) for _ in range(num_workers)] + self.split_files = np.array_split(self.files, self.num_workers) + # feed_worker will load data from h5py files, and do remask process + self.feed_workers = [ + multiprocessing.Process( + target=self.fill_buffer_loop, args=(self.split_files[idx], self.process_buffers[idx]) + ) + for idx in range(self.num_workers) + ] + for p in self.feed_workers: + p.start() + + # index for which process_buffer is used each time + self.post_fetch_idx = 0 + # load final data from process_buffers + self.fetch_worker = threading.Thread(target=self.post_fetch) + self.fetch_worker.start() + + def samples_in_file(self, filename): + with h5py.File(filename, "r") as f: + data_len = f[KEYS[0]].shape[0] + return data_len + + def release(self): + self.event_queue.put("END") + while not self.feed_buffer.empty(): + self.feed_buffer.get() + for process_buffer in self.process_buffers: + while not process_buffer.empty(): + process_buffer.get() + self.fetch_worker.join() + for p in self.feed_workers: + p.join() + return + + def __len__(self): + return self.len + + def __iter__(self): + self.counter = 0 + return self + + def __next__(self): + result = self.feed_buffer.get() + self.counter += 1 + return result + + def post_fetch(self): + while True: + if not self.event_queue.empty(): + return + if not self.process_buffers[self.post_fetch_idx].empty(): + logging.debug(f"self.post_fetch_idx: {self.post_fetch_idx}") + np_feed_list = self.process_buffers[self.post_fetch_idx].get() + self.post_fetch_idx += 1 + if self.post_fetch_idx == self.num_workers: + self.post_fetch_idx = 0 + elif self.post_fetch_idx > self.num_workers: + raise Exception("post_fetch_idx must < num_workers") + + lod_feed_list = [] + for data in np_feed_list: + tensor = paddle.fluid.core.LoDTensor() + place = paddle.CPUPlace() + tensor.set(data, place) + lod_feed_list.append(tensor) + self.feed_buffer.put(lod_feed_list) + + def fill_buffer_loop(self, files, process_buffer): + data = None + data_index = 0 + file_index = 0 + + def multiprocess_fill_buffer(data, file_index, data_index): + if data is None: + data = self.load_one_file(files[file_index]) + file_index += 1 + data_index = 0 + + curr_batch = [] + still_required = self.batch_size + while still_required > 0: + data_batch = {k: data[k][data_index : data_index + still_required] for k in KEYS} + data_batch_len = len(data_batch[KEYS[0]]) + data_index += data_batch_len + curr_batch.append(data_batch) + curr_batch_len = sum(len(x[KEYS[0]]) for x in curr_batch) + still_required = self.batch_size - curr_batch_len + if still_required > 0: + if file_index >= len(files): + np.random.shuffle(files) + file_index = 0 + + data = self.load_one_file(files[file_index]) + file_index += 1 + data_index = 0 + if not curr_batch_len == self.batch_size: + raise Exception("data length should equal to batch_size") + + result = {} + for k in KEYS: + result[k] = np.concatenate([item[k] for item in curr_batch], axis=0) + process_buffer.put(self.do_remask(result)) + + return data, file_index, data_index + + while True: + if self.event_queue.empty(): + data, file_index, data_index = multiprocess_fill_buffer(data, file_index, data_index) + else: + return + + def do_remask(self, samples): + input_ids = samples["input_ids"] + segment_ids = samples["segment_ids"] + masked_lm_positions = samples["masked_lm_positions"] + masked_lm_ids = samples["masked_lm_ids"] + next_sentence_labels = samples["next_sentence_labels"] + masked_lm_weights = np.ones_like(masked_lm_ids, dtype=np.int32) + masked_lm_weights[masked_lm_ids == 0] = 0 + + # post process + batch_size, seq_len = input_ids.shape + formatted_pos = self.pad_position_value * np.ones_like(samples["input_ids"]) + formatted_input = np.zeros_like(input_ids) + formatted_seg = np.zeros_like(segment_ids) + formatted_mask_labels = np.zeros((batch_size, self.max_mask_tokens), dtype=masked_lm_ids.dtype) + + valid_seq_positions = [] + valid_mask_positions = masked_lm_weights == 1 + valid_mask_len = np.sum(valid_mask_positions, axis=1).reshape(-1, 1) + for i, mask_pos in enumerate(masked_lm_positions): + pos = [True] * seq_len + for mask_index, m in enumerate(mask_pos): + if mask_index < valid_mask_len[i]: + pos[m] = False + valid_seq_positions.append(np.logical_and(pos, input_ids[i] != 0)) + valid_seq_len = np.minimum( + np.sum(valid_seq_positions, axis=1) + self.max_mask_tokens, self.max_seq_length + ).reshape(-1, 1) + unmasked_len = np.minimum(np.sum(valid_seq_positions, axis=1), self.max_seq_length - self.max_mask_tokens) + for i in range(batch_size): + target_mask_indices = np.arange(valid_mask_len[i]) + target_seq_indices = self.max_mask_tokens + np.arange(unmasked_len[i]) + source_mask_indices = masked_lm_positions[i][valid_mask_positions[i]] + source_seq_indices = np.arange(seq_len)[valid_seq_positions[i]][: unmasked_len[i]] + + target_indices = np.hstack([target_mask_indices, target_seq_indices]) + source_indices = np.hstack([source_mask_indices, source_seq_indices]) + + formatted_pos[i, target_indices] = source_indices + formatted_input[i, target_indices] = input_ids[i, source_indices] + formatted_seg[i, target_indices] = segment_ids[i, source_indices] + formatted_mask_labels[i] = masked_lm_ids[i, : self.max_mask_tokens] + + return [ + formatted_input.astype(np.int32), + formatted_seg.astype(np.int32), + formatted_pos.astype(np.int32), + valid_mask_len.astype(np.int32), + valid_seq_len.astype(np.int32), + formatted_mask_labels.astype(np.int32), + next_sentence_labels.astype(np.int32), + ] + + def load_one_file(self, file_path): + data = self.load_hdf5(file_path) + + if self.shuffle: + shuffle_dict(data, len(data[KEYS[0]])) + + return data + + def load_hdf5(self, filename): + with h5py.File(filename, "r") as f: + data = {key: np.asarray(f[key][:]) for key in KEYS} + return data + + +if __name__ == "__main__": + import glob + + base_dir = "data_path/wikicorpus_en/" + input_files = glob.glob(f"{base_dir}/*training*.hdf5") + input_files.sort() + # print(input_files) + + seed = 1984 + np.random.seed(seed) + paddle.seed(seed) + + data_loader = PretrainingHDF5DataLoader(input_files, batch_size=65536, shuffle=True) + + for idx, batch in enumerate(data_loader): + print(f"{idx}: {batch[0].shape()}") diff --git a/model_zoo/bert/static_ipu/load_tf_ckpt.py b/model_zoo/bert/static_ipu/load_tf_ckpt.py new file mode 100644 index 0000000000000000000000000000000000000000..2837fd7a69885015b6a81dbb2fe524602ca42c92 --- /dev/null +++ b/model_zoo/bert/static_ipu/load_tf_ckpt.py @@ -0,0 +1,175 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from logging import getLogger + +import numpy as np + +logger = getLogger(__name__) + + +def get_tf_mapping(args): + squad_mapping = {"cls/squad/output_weights": "linear_72.w_0", "cls/squad/output_bias": "linear_72.b_0"} + + tf_to_pdmodel = { + "bert/embeddings/word_embeddings": "ipu_bert_embeddings_0.w_0", + "bert/embeddings/position_embeddings": "embedding_0.w_0", + "bert/embeddings/token_type_embeddings": "ipu_bert_embeddings_0.w_1", + "bert/embeddings/LayerNorm/gamma": "layer_norm_0.w_0", + "bert/embeddings/LayerNorm/beta": "layer_norm_0.b_0", + } + for i in range(args.num_hidden_layers): + layer = { + f"bert/encoder/layer_{i}/attention/self/query/bias": f"bert_model_0.b_{i}", + f"bert/encoder/layer_{i}/attention/self/key/bias": f"bert_model_0.b_{i}", + f"bert/encoder/layer_{i}/attention/self/value/bias": f"bert_model_0.b_{i}", + f"bert/encoder/layer_{i}/attention/output/dense/kernel": f"linear_{i*6}.w_0", + f"bert/encoder/layer_{i}/attention/output/dense/bias": f"linear_{i*6}.b_0", + f"bert/encoder/layer_{i}/attention/output/LayerNorm/gamma": f"layer_norm_{i*4+2}.w_0", + f"bert/encoder/layer_{i}/attention/output/LayerNorm/beta": f"layer_norm_{i*4+2}.b_0", + f"bert/encoder/layer_{i}/intermediate/dense/kernel": f"linear_{i*6+2}.w_0", + f"bert/encoder/layer_{i}/intermediate/dense/bias": f"linear_{i*6+2}.b_0", + f"bert/encoder/layer_{i}/output/dense/kernel": f"linear_{i*6+3}.w_0", + f"bert/encoder/layer_{i}/output/dense/bias": f"linear_{i*6+3}.b_0", + f"bert/encoder/layer_{i}/output/LayerNorm/gamma": f"layer_norm_{(i+1)*4}.w_0", + f"bert/encoder/layer_{i}/output/LayerNorm/beta": f"layer_norm_{(i+1)*4}.b_0", + } + layer[f"bert/encoder/layer_{i}/attention/self/query/kernel"] = f"bert_model_0.w_{i*3+0}" + layer[f"bert/encoder/layer_{i}/attention/self/key/kernel"] = f"bert_model_0.w_{i*3+1}" + layer[f"bert/encoder/layer_{i}/attention/self/value/kernel"] = f"bert_model_0.w_{i*3+2}" + tf_to_pdmodel.update(**layer) + + if args.task == "PRETRAINING": + logger.error("Mapping ckpt weights is only supported in SQUAD task.") + elif args.task == "SQUAD": + tf_to_pdmodel.update(**squad_mapping) + + return tf_to_pdmodel + + +def generate_initializers(args, map_names, load_data, mapping, transform={}): + initializers = {} + initializers_param = {} + initializers_opt = {} + + qkv_tensor_range = { + "query": (0, args.hidden_size), + "key": (args.hidden_size, args.hidden_size * 2), + "value": (args.hidden_size * 2, args.hidden_size * 3), + } + + for name, array in zip(map_names, load_data): + logger.debug(f"Initialising tensor from checkpoint {name} -> {mapping[name]}") + + # config["lamb_m_dtype"] is for setting the data type for accl1 of lamb + # BERT can use FP16 for accl1 without lossing accuracy + # accl2 is always in FP32 + lamb_m_dtype = np.float32 + dtype = np.float32 + + if "moment1" in mapping[name]: + if array.dtype != lamb_m_dtype: + array = array.astype(lamb_m_dtype) + elif "moment2" in mapping[name]: + if array.dtype != np.float32: + array = array.astype(np.float32) + elif array.dtype != dtype: + array = array.astype(dtype) + + # If it's part of QKV biases, we need to handle separately as those 3 + # tensors need concatenating into one + if "bert_model_0.b" in mapping[name]: + qkv_part = name.split("/")[5] + if mapping[name] not in initializers.keys(): + qkv_shape = array.shape[0] * 3 + initializers[mapping[name]] = np.empty(qkv_shape, dtype=array.dtype) + + start_idx = qkv_tensor_range[qkv_part][0] + end_idx = qkv_tensor_range[qkv_part][1] + initializers[mapping[name]][start_idx:end_idx] = array + logger.debug(f"Initialising QKV_bias component {name}[{start_idx}:{end_idx}] from checkpoint") + continue + + if name in transform: + array = transform[name](array) + + padded_vocab_length = args.vocab_size + if "bert_embeddings_0.w_0" in mapping[name]: + tf_vocab_length = array.shape[0] + diff = padded_vocab_length - tf_vocab_length + # Pad or Crop the vocab. + if diff > 0: + logger.info(f"Padding the vocabulary. From {tf_vocab_length} to {padded_vocab_length}") + pad = np.zeros((diff, args.hidden_size)).astype(array.dtype) + array = np.concatenate((array, pad), axis=0) + else: + logger.warning( + f"Cropping the vocabulary may negatively effect performance. From {tf_vocab_length} to {padded_vocab_length}" + ) + array = np.array(array[:padded_vocab_length, :]) + # if args.task == "PRETRAINING": + # We use transposed weight in both pretraining and squad + array = np.transpose(array, [1, 0]) + + if "embedding_0.w_0" in mapping[name]: + max_pos, hidden_len = array.shape + if max_pos > args.max_position_embeddings: + array = array[: args.max_position_embeddings, :] + + # Otherwise just copy the positional embeddings over and over again as is done in longformer + elif max_pos < args.max_position_embeddings: + logger.warning("Not enough positional embeddings in checkpoint, copying to match length...") + array = array[np.mod(np.arange(args.max_position_embeddings), max_pos)] + + initializers[mapping[name]] = array.copy() + for k in initializers: + if "moment" in k: + initializers_opt[k] = initializers[k] + else: + initializers_param[k] = initializers[k] + return initializers_param, initializers_opt + + +# util function for load tf pretrained weight +def load_initializers_from_tf(file_path, args): + """ + Loads weights, etc. from Tensorflow files into a dictionary of Numpy Arrays. + + Can read either checkpoint files, or frozen graphs, according to the + `is_checkpoint` flag, passed in as the second argument. + """ + try: + import tensorflow as tf + except ImportError: + logger.error( + "Loading a TensorFlow model requires TensorFlow to be installed. " + "Please see https://www.tensorflow.org/install/ for installation " + "instructions." + ) + raise + + tf_path = os.path.abspath(file_path) + logger.info("Converting TensorFlow checkpoint from {}".format(tf_path)) + # Load weights from TF model + init_vars = tf.train.list_variables(tf_path) + + mapping = get_tf_mapping(args) + map_names = [name for name, shape in init_vars if name in mapping.keys()] + for name in (n for n, _ in init_vars if n not in mapping.keys()): + logger.debug(f"Skipping load of {name} - Not in mapping") + + load_data = [tf.train.load_variable(tf_path, name) for name in map_names] + initializers, opt_params = generate_initializers(args, map_names, load_data, mapping) + return initializers, opt_params diff --git a/model_zoo/bert/static_ipu/modeling.py b/model_zoo/bert/static_ipu/modeling.py new file mode 100644 index 0000000000000000000000000000000000000000..b8c0503e1be51052d1a78cdfabf3cf33783a4088 --- /dev/null +++ b/model_zoo/bert/static_ipu/modeling.py @@ -0,0 +1,635 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +from contextlib import ExitStack +from typing import List, NamedTuple + +import numpy as np +import paddle +import paddle.fluid +import paddle.nn as nn +import paddle.static +from paddle.nn import Layer + + +class DeviceScope(object): + def __init__(self, index, stage, name_scope=None): + self.index = index + self.stage = stage + self.name_scope = name_scope + + def __enter__(self): + self.stack = ExitStack() + self.stack.enter_context(paddle.static.ipu_shard_guard(index=self.index, stage=self.stage)) + if self.name_scope is not None: + self.stack.enter_context(paddle.static.name_scope(self.name_scope)) + return self + + def __exit__(self, *exp): + self.stack.close() + return False + + +class IpuBertConfig(NamedTuple): + """ + The configuration for BERT Model. + Args: + seq_len (int): + The sequence length. Default to `128`. + max_position_embeddings (int): + The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input + sequence. Defaults to `512`. + max_predictions_per_seq (int): + The max number of the masked token each sentence. Default to `20`. + hidden_size (int): + Dimensionality of the embedding layer, encoder layer and pooler layer. Defaults to `768`. + vocab_size (int): + Vocabulary size of `inputs_ids` in `BertModel`. Also is the vocab size of token embedding matrix. + Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `BertModel`. + num_hidden_layers (int): + Number of hidden layers in the Transformer encoder. Defaults to `12`. + available_mem_proportion (float): + The available proportion of memory used by conv or matmul. Default to `0.28`. + type_vocab_size (int): + The vocabulary size of `token_type_ids`. + Defaults to `2`. + hidden_dropout_prob (float): + The dropout probability for all fully connected layers in the embeddings and encoder. + Defaults to `0.1`. + attention_probs_dropout_prob (float): + The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target. + Defaults to `0.1`. + task (str): + The type of the NLP model. + layers_per_ipu (list): + Number of attention layers executed on each IPU. + """ + + micro_batch_size: int = 1 + seq_len: int = 128 + max_position_embeddings: int = 512 + max_predictions_per_seq: int = 20 + hidden_size: int = 768 + vocab_size: int = 30400 + num_hidden_layers: int = 12 + available_mem_proportion: float = 0.28 + type_vocab_size: int = 2 + + hidden_dropout_prob: float = 0.1 + attention_probs_dropout_prob: float = 0.1 + + # Choices: PRETRAINING (MLM + NSP), SQUAD + task: str = "PRETRAINING" + layers_per_ipu: List = None + + embeddings_scope: DeviceScope = None + attn_scopes: DeviceScope = None + ff_scopes: DeviceScope = None + mlm_scope: DeviceScope = None + nsp_scope: DeviceScope = None + + +class IpuBertEmbeddings(Layer): + """ + Include embeddings from word, position and token_type embeddings + """ + + def __init__(self, config, custom_ops=None): + super(IpuBertEmbeddings, self).__init__() + self.config = config + self.word_embeddings_weights = self.create_parameter( + shape=[config.hidden_size, config.vocab_size], dtype="float32" + ) + self.token_embeddings_weights = self.create_parameter( + shape=[config.type_vocab_size, config.hidden_size], dtype="float32" + ) + self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) + self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=0.001) + self.dropout = nn.Dropout(self.config.hidden_dropout_prob) + self.custom_ops = custom_ops + + def forward(self, indices, segments, positions): + # word embeddings + word_embeddings_weights = paddle.transpose(self.word_embeddings_weights, [1, 0]) + input_embeddings = paddle.gather(word_embeddings_weights, indices, axis=0) + + # position_embeddings + position_embeddings = self.position_embeddings(positions) + + # token_type_embeddings + token_type_embeddings = paddle.fluid.input.one_hot(segments, depth=2) + token_type_embeddings = paddle.matmul(token_type_embeddings, self.token_embeddings_weights) + + embeddings = paddle.add(input_embeddings, position_embeddings) + embeddings = paddle.add(embeddings, token_type_embeddings) + embeddings = self.layer_norm(embeddings) + embeddings = self.dropout(embeddings) + return embeddings, self.word_embeddings_weights + + +class BertModel(Layer): + """ + The bare BERT Model transformer outputting raw hidden-states. + + This model refers to :class:`~paddlenlp.transformers.bert.BertModel`. + + Args: + config (IpuBertConfig): + configuration of bert. + custom_ops: + custom defined operators which can be found in directory `custom_ops`. + """ + + def __init__(self, config, custom_ops=None): + super(BertModel, self).__init__() + self.config = config + self.custom_ops = custom_ops + + qk_scale = 1 / np.sqrt(self.config.hidden_size / self.config.num_hidden_layers) + self.qk_scale_attrs = { + "name": "QK_scale", + "shape": [1], + "dtype": "float32", + "value": qk_scale, + } + self.qkv_shape = [-1, self.config.seq_len, 12, 64] + self.masks = {} + + self.embedding = IpuBertEmbeddings(self.config, custom_ops) + + def _encoder_layer_ipu_offset(self, layer_index): + encoder_index = 0 + if len(self.config.layers_per_ipu) == 1: + encoder_index = layer_index // self.config.layers_per_ipu[0] + else: + for ipu, num_layers in enumerate(self.config.layers_per_ipu): + layer_index -= num_layers + if layer_index < 0: + encoder_index = ipu + break + return encoder_index + + def should_checkpoint(self, layer_index): + encoder_index = self._encoder_layer_ipu_offset(layer_index) + if len(self.config.layers_per_ipu) == 1: + layers = self.config.layers_per_ipu[0] + layer_index -= encoder_index * layers + else: + layers = self.config.layers_per_ipu[encoder_index] + layer_index -= sum(self.config.layers_per_ipu[:encoder_index]) + return layer_index < (layers - 1) + + def forward(self, indices, segments, positions, input_mask): + r""" + The BertModel forward method, overrides the `__call__()` special method. + + Args: + indices (Tensor): + Indices of input sequence tokens in the vocabulary. They are + numerical representations of tokens that build the input sequence. + Its data type should be `int32` and it has a shape of [batch_size * sequence_length]. + segments (Tensor): + Segment token indices to indicate different portions of the inputs. + Selected in the range ``[0, type_vocab_size - 1]``. + Its data type should be `int32` and it has a shape of [batch_size * sequence_length]. + positions(Tensor): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0, + max_position_embeddings - 1]``. + Shape as `[batch_size * sequence_length]` and dtype as int32. + input_mask (Tensor, optional): + Mask used in multi-head attention to avoid performing attention on to some unwanted positions, + usually the paddings or the subsequent positions. + If the task is PRETRAINING: + input_mask[0] is the index that masking starts in the mask_tokens + input_mask[1] is the index that masking starts in the rest of the sequence + Otherwise + input_mask is the mask tensor that has -1000 in positions to be masked and 0 otherwise. + + Returns: + tuple: Returns tuple (`sequence_output`, `word_embeddings_weights`). + + With the fields: + + - `sequence_output` (Tensor): + Sequence of hidden-states at the last layer of the model. + It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size]. + """ + + with self.config.embeddings_scope: + sequence_output, word_embeddings_weights = self.embedding(indices, segments, positions) + + if self.config.task == "PRETRAINING": + with paddle.static.ipu_shard_guard(index=0, stage=0): + input_mask[0] = self.custom_ops.detach(input_mask[0]) + input_mask[1] = self.custom_ops.detach(input_mask[1]) + + for i in range(self.config.num_hidden_layers): + # Attention + attn_scope = self.config.attn_scopes[i] + with attn_scope: + with paddle.static.name_scope(f"Layer{i}/Attention"): + layer_input = sequence_output + q = self.create_parameter( + shape=[self.config.hidden_size, self.config.hidden_size], dtype="float32" + ) + k = self.create_parameter( + shape=[self.config.hidden_size, self.config.hidden_size], dtype="float32" + ) + v = self.create_parameter( + shape=[self.config.hidden_size, self.config.hidden_size], dtype="float32" + ) + qkv = paddle.concat([q, k, v], axis=1) + qkv = paddle.matmul(sequence_output, qkv) + qkv.block.ops[-1]._set_attr("__available_memory", self.config.available_mem_proportion) + q, k, v = paddle.split( + qkv, + num_or_sections=[self.config.hidden_size, self.config.hidden_size, self.config.hidden_size], + axis=1, + ) + q = paddle.reshape(q, self.qkv_shape) + q = paddle.transpose(q, [0, 2, 1, 3]) + k = paddle.reshape(k, self.qkv_shape) + k = paddle.transpose(k, [0, 2, 3, 1]) + v = paddle.reshape(v, self.qkv_shape) + v = paddle.transpose(v, [0, 2, 1, 3]) + + # Attention calculation + with paddle.static.name_scope("Z"): + if self.config.task == "PRETRAINING": + if attn_scope.index in self.masks: + final_mask = self.masks[attn_scope.index] + else: + with paddle.static.name_scope("Mask"): + base_value = np.arange(self.config.seq_len).astype("int32") + base = paddle.fluid.layers.assign(base_value) + mmask = paddle.less_than(base, input_mask[0]) + mask_value = np.greater_equal(base_value, self.config.max_predictions_per_seq) + mask = paddle.fluid.layers.assign(mask_value) + mmask = paddle.logical_or(mmask, mask) + smask = paddle.less_than(base, input_mask[1]) + final_mask = paddle.logical_and(mmask, smask) + final_mask = paddle.cast(final_mask, "float16") + sub_attrs = { + "name": "constant_sub", + "shape": [1], + "dtype": "float32", + "value": 1, + } + mul_attrs = { + "name": "constant_mul", + "shape": [1], + "dtype": "float32", + "value": 1000, + } + final_mask = paddle.fluid.layers.elementwise_sub( + final_mask, paddle.fluid.layers.fill_constant(**sub_attrs) + ) + final_mask = paddle.fluid.layers.elementwise_mul( + final_mask, paddle.fluid.layers.fill_constant(**mul_attrs) + ) + final_mask = paddle.reshape(final_mask, [-1, 1, 1, self.config.seq_len]) + final_mask = self.custom_ops.detach(final_mask) + self.masks[attn_scope.index] = final_mask + + qk = paddle.matmul(q, k) + qk.block.ops[-1]._set_attr("__available_memory", self.config.available_mem_proportion) + qk_scale = paddle.fluid.layers.fill_constant(**self.qk_scale_attrs) + qk = paddle.fluid.layers.elementwise_mul(qk, qk_scale) + + if self.config.task == "PRETRAINING": + qk = paddle.fluid.layers.elementwise_add(qk, final_mask) + else: + # for SQUAD task, input_mask is calculated in data preprocessing + qk = paddle.fluid.layers.elementwise_add(qk, input_mask) + + qk = paddle.fluid.layers.softmax(qk) + if self.config.task == "SQUAD": + qk = paddle.fluid.layers.dropout( + qk, self.config.attention_probs_dropout_prob, dropout_implementation="upscale_in_train" + ) + qkv = paddle.matmul(qk, v) + qkv.block.ops[-1]._set_attr("__available_memory", self.config.available_mem_proportion) + qkv = paddle.transpose(qkv, [0, 2, 1, 3]) + qkv = paddle.reshape(qkv, [-1, self.config.hidden_size]) + + qkv_linear = nn.Linear(self.config.hidden_size, self.config.hidden_size, bias_attr=False) + qkv = qkv_linear(qkv) + qkv.block.ops[-1]._set_attr("__available_memory", self.config.available_mem_proportion) + qkv = paddle.fluid.layers.dropout( + qkv, self.config.attention_probs_dropout_prob, dropout_implementation="upscale_in_train" + ) + attention = paddle.add(layer_input, qkv) + layer_norm1 = nn.LayerNorm(self.config.hidden_size, epsilon=0.001) + attention = layer_norm1(attention) + + # FF + with self.config.ff_scopes[i]: + with paddle.static.name_scope(f"Layer{i}/FF"): + ff_linear1 = nn.Linear(self.config.hidden_size, 4 * self.config.hidden_size) + ff_linear2 = nn.Linear(4 * self.config.hidden_size, self.config.hidden_size) + with paddle.static.name_scope("1"): + ff = ff_linear1(attention) + ff.block.ops[-2]._set_attr("__available_memory", self.config.available_mem_proportion) + ff = paddle.fluid.layers.gelu(ff, approximate=True) + with paddle.static.name_scope("2"): + ff = ff_linear2(ff) + ff.block.ops[-2]._set_attr("__available_memory", self.config.available_mem_proportion) + ff = paddle.fluid.layers.dropout( + ff, self.config.attention_probs_dropout_prob, dropout_implementation="upscale_in_train" + ) + ff = paddle.add(attention, ff) + layer_norm2 = nn.LayerNorm(self.config.hidden_size, epsilon=0.001) + sequence_output = layer_norm2(ff) + + if self.should_checkpoint(i): + with paddle.static.name_scope(f"Layer{i}"): + logging.info(f"add checkpointoutput for ff_{i}") + sequence_output = self.custom_ops.checkpointoutput(sequence_output) + return sequence_output, word_embeddings_weights + + +class IpuBertForQuestionAnswering(Layer): + """ + Bert Model with a span classification head on top for extractive question-answering tasks like + SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and + `span end logits`). + + Args: + hidden_size (int): + Dimensionality of the embedding layer, encoder layer and pooler layer. Defaults to `768`. + seq_len (int): + See :class:`IpuBertConfig`. + """ + + def __init__(self, hidden_size, seq_len): + super(IpuBertForQuestionAnswering, self).__init__() + self.hidden_size = hidden_size + self.seq_len = seq_len + self.classifier = nn.Linear(hidden_size, 2) + + def forward(self, sequence_output): + r""" + The IpuBertForQuestionAnswering forward method, overrides the __call__() special method. + + Args: + sequence_output (Tensor): + See :class:`BertModel`. + + Returns: + tuple: Returns tuple (`start_logits`, `end_logits`). + + With the fields: + + - `start_logits` (Tensor): + A tensor of the input token classification logits, indicates the start position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + + - `end_logits` (Tensor): + A tensor of the input token classification logits, indicates the end position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + """ + logits = self.classifier(sequence_output) + + start_logits = paddle.slice(input=logits, axes=[1], starts=[0], ends=[1]) + end_logits = paddle.slice(input=logits, axes=[1], starts=[1], ends=[2]) + + start_logits = paddle.reshape(start_logits, [-1, self.seq_len]) + end_logits = paddle.reshape(end_logits, [-1, self.seq_len]) + return start_logits, end_logits + + +class IpuBertQAAccAndLoss(paddle.nn.Layer): + """ + Criterion for Question and Answering. + """ + + def __init__(self, custom_ops=None): + super(IpuBertQAAccAndLoss, self).__init__() + self.custom_ops = custom_ops + + def forward(self, start_logits, end_logits, start_labels, end_labels): + r""" + The IpuBertQAAccAndLoss forward method, overrides the __call__() special method. + + Args: + start_logits (Tensor): + See :class:`IpuBertForQuestionAnswering`. + end_logits (Tensor): + See :class:`IpuBertForQuestionAnswering`. + start_labels (Tensor): + Labels for start position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + end_labels (Tensor): + Labels for end position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + + """ + with paddle.static.name_scope("loss"): + start_loss = paddle.fluid.layers.softmax(start_logits) + start_loss = self.custom_ops.custom_nll_loss(start_loss, start_labels, 1, "None", False) + end_loss = paddle.fluid.layers.softmax(end_logits) + end_loss = self.custom_ops.custom_nll_loss(end_loss, end_labels, 1, "None", False) + loss = paddle.add(start_loss, end_loss) + + with paddle.static.name_scope("acc"): + start_logits = paddle.fluid.layers.argmax(start_logits, axis=1) + end_logits = paddle.fluid.layers.argmax(end_logits, axis=1) + start_equal = paddle.fluid.layers.equal(start_logits, start_labels) + end_equal = paddle.fluid.layers.equal(end_logits, end_labels) + start_equal = paddle.fluid.layers.cast(start_equal, "float32") + end_equal = paddle.fluid.layers.cast(end_equal, "float32") + start_acc = paddle.mean(start_equal) + end_acc = paddle.mean(end_equal) + + return start_acc, end_acc, loss + + +class IpuBertPretrainingMLMHeads(Layer): + """ + Perform language modeling task. + + Args: + hidden_size (int): + See :class:`IpuBertConfig`. + vocab_size (int): + See :class:`IpuBertConfig`. + max_position_embeddings (int): + See :class:`IpuBertConfig`. + max_predictions_per_seq (int): + See :class:`IpuBertConfig`. + seq_len (int): + See :class:`IpuBertConfig`. + """ + + def __init__(self, hidden_size, vocab_size, max_position_embeddings, max_predictions_per_seq, seq_len): + super(IpuBertPretrainingMLMHeads, self).__init__() + self.hidden_size = hidden_size + self.vocab_size = vocab_size + self.max_position_embeddings = max_position_embeddings + self.max_predictions_per_seq = max_predictions_per_seq + self.sequence_length = seq_len + self.transform = nn.Linear(hidden_size, hidden_size) + self.layer_norm = nn.LayerNorm(hidden_size, epsilon=0.001) + + def forward(self, encoders_output, word_embeddings_weights): + # cls + out = self.transform(encoders_output) + out = paddle.fluid.layers.gelu(out, approximate=True) + out = self.layer_norm(out) + + # mlm + out = paddle.reshape(out, [-1, self.sequence_length, self.hidden_size]) + out = paddle.slice(out, [1], [0], [self.max_predictions_per_seq]) + out = paddle.reshape(out, [-1, self.hidden_size]) + + # serialized matmul + out = paddle.matmul(out, word_embeddings_weights) + out.block.ops[-1]._set_attr("serialize_factor", 5) + mlm_out = paddle.reshape(out, [-1, self.max_predictions_per_seq, self.vocab_size]) + + return mlm_out + + +class IpuBertPretrainingNSPHeads(Layer): + """ + Perform next sequence classification task. + + Args: + hidden_size (int): + See :class:`IpuBertConfig`. + max_predictions_per_seq (int): + See :class:`IpuBertConfig`. + seq_len (int): + See :class:`IpuBertConfig`. + """ + + def __init__(self, hidden_size, max_predictions_per_seq, seq_len): + super(IpuBertPretrainingNSPHeads, self).__init__() + self.hidden_size = hidden_size + self.max_predictions_per_seq = max_predictions_per_seq + self.seq_len = seq_len + self.seq_relationship = nn.Linear(hidden_size, 2) + self.pooler = IpuBertPooler(hidden_size, self.seq_len, self.max_predictions_per_seq) + + def forward(self, encoders_output): + pooled_output = self.pooler(encoders_output) + nsp_out = self.seq_relationship(pooled_output) + return nsp_out + + +class IpuBertPooler(Layer): + """ + Pool the result of BertEncoder. + """ + + def __init__(self, hidden_size, sequence_length, max_predictions_per_seq, pool_act="tanh"): + super(IpuBertPooler, self).__init__() + self.dense = nn.Linear(hidden_size, hidden_size) + self.activation = nn.Tanh() + self.pool_act = pool_act + self.sequence_length = sequence_length + self.max_predictions_per_seq = max_predictions_per_seq + self.hidden_size = hidden_size + + def forward(self, hidden_states): + hidden_states = paddle.reshape(hidden_states, [-1, self.sequence_length, self.hidden_size]) + first_token_tensor = paddle.slice( + input=hidden_states, + axes=[1], + starts=[self.max_predictions_per_seq], + ends=[self.max_predictions_per_seq + 1], + ) + first_token_tensor = paddle.reshape(first_token_tensor, [-1, self.hidden_size]) + pooled_output = self.dense(first_token_tensor) + if self.pool_act == "tanh": + pooled_output = self.activation(pooled_output) + return pooled_output + + +class IpuBertPretrainingMLMAccAndLoss(Layer): + """ + Criterion for masked language modeling. + """ + + def __init__(self, micro_batch, ignore_index, custom_ops): + super(IpuBertPretrainingMLMAccAndLoss, self).__init__() + self.micro_batch = micro_batch + self.ignore_index = ignore_index + self.custom_ops = custom_ops + + def forward(self, mlm, masked_lm_ids): + mlm_pred = paddle.fluid.layers.argmax(mlm, axis=-1) + mlm_pred = paddle.cast(mlm_pred, "int32") + with paddle.static.name_scope("Accuracy"): + mlm_label = paddle.cast(masked_lm_ids, "int32") + mlm_correct = paddle.fluid.layers.equal(mlm_pred, mlm_label) + attrs = { + "name": "mlm_mask_val", + "shape": [1], + "dtype": "int32", + "value": self.ignore_index, + } + mlm_mask_val = paddle.fluid.layers.fill_constant(**attrs) + mlm_unmask = paddle.fluid.layers.equal(mlm_label, mlm_mask_val) + mlm_mask = paddle.logical_not(mlm_unmask) + mlm_mask = paddle.cast(mlm_mask, "float32") + mlm_correct = paddle.cast(mlm_correct, "float32") + masked_mlm_correct = paddle.fluid.layers.elementwise_mul(mlm_correct, mlm_mask) + total_correct_tokens = paddle.fluid.layers.reduce_sum(masked_mlm_correct) + total_tokens = paddle.fluid.layers.reduce_sum(mlm_mask) + total_correct_tokens = paddle.cast(total_correct_tokens, "float32") + total_tokens = paddle.cast(total_tokens, "float32") + mlm_acc = paddle.fluid.layers.elementwise_div(total_correct_tokens, total_tokens) + + masked_lm_softmax = paddle.fluid.layers.softmax(mlm) + mlm_loss = self.custom_ops.custom_nll_loss(masked_lm_softmax, masked_lm_ids, 1, str(self.ignore_index), False) + + return mlm_acc, mlm_loss + + +class IpuBertPretrainingNSPAccAndLoss(Layer): + """ + Criterion for next sequence classification. + """ + + def __init__(self, micro_batch, ignore_index, custom_ops): + super(IpuBertPretrainingNSPAccAndLoss, self).__init__() + self.micro_batch = micro_batch + self.ignore_index = ignore_index + self.custom_ops = custom_ops + + def forward(self, nsp, nsp_label): + nsp_pred = paddle.fluid.layers.argmax(nsp, axis=-1) + nsp_pred = paddle.cast(nsp_pred, "int32") + with paddle.static.name_scope("Accuracy"): + nsp_label = paddle.cast(nsp_label, "int32") + nsp_correct = paddle.fluid.layers.equal(nsp_pred, nsp_label) + nsp_correct = paddle.cast(nsp_correct, "int32") + nsp_correct = paddle.fluid.layers.reduce_sum(nsp_correct) + nsp_correct = paddle.cast(nsp_correct, "float32") + attrs = { + "name": "mlm_mask_val", + "shape": [1], + "dtype": "int32", + "value": self.micro_batch, + } + nsp_total = paddle.fluid.layers.fill_constant(**attrs) + nsp_total = paddle.cast(nsp_total, "float32") + nsp_acc = paddle.fluid.layers.elementwise_div(nsp_correct, nsp_total) + + next_sentence_softmax = paddle.fluid.layers.softmax(nsp) + nsp_loss = self.custom_ops.custom_nll_loss(next_sentence_softmax, nsp_label, 1, "None", False) + + return nsp_acc, nsp_loss diff --git a/model_zoo/bert/static_ipu/requirements.txt b/model_zoo/bert/static_ipu/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..43dee67df533055d39178a88c6c7978a9029b365 --- /dev/null +++ b/model_zoo/bert/static_ipu/requirements.txt @@ -0,0 +1,8 @@ +datasets +h5py +multiprocess +numpy +paddlenlp +scipy +wandb +tqdm diff --git a/model_zoo/bert/static_ipu/run_pretrain.py b/model_zoo/bert/static_ipu/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..8c45cc6cfb4392ea51c9e2c0ad9182567c14f240 --- /dev/null +++ b/model_zoo/bert/static_ipu/run_pretrain.py @@ -0,0 +1,393 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import os +import pickle +import random +import time + +import numpy as np +import paddle +import paddle.optimizer +import paddle.static +from dataset_ipu import PretrainingHDF5DataLoader +from modeling import ( + BertModel, + DeviceScope, + IpuBertConfig, + IpuBertPretrainingMLMAccAndLoss, + IpuBertPretrainingMLMHeads, + IpuBertPretrainingNSPAccAndLoss, + IpuBertPretrainingNSPHeads, +) +from scipy.stats import truncnorm +from utils import ProgressFunc, load_custom_ops, parse_args + +from paddlenlp.transformers import LinearDecayWithWarmup + + +def set_seed(seed): + """ + Use the same data seed(for data shuffle) for all procs to guarantee data + consistency after sharding. + """ + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def create_data_holder(args): + bs = args.micro_batch_size + indices = paddle.static.data(name="indices", shape=[bs * args.seq_len], dtype="int32") + segments = paddle.static.data(name="segments", shape=[bs * args.seq_len], dtype="int32") + positions = paddle.static.data(name="positions", shape=[bs * args.seq_len], dtype="int32") + mask_tokens_mask_idx = paddle.static.data(name="mask_tokens_mask_idx", shape=[bs, 1], dtype="int32") + sequence_mask_idx = paddle.static.data(name="sequence_mask_idx", shape=[bs, 1], dtype="int32") + masked_lm_ids = paddle.static.data(name="masked_lm_ids", shape=[bs, args.max_predictions_per_seq], dtype="int32") + next_sentence_labels = paddle.static.data(name="next_sentence_labels", shape=[bs], dtype="int32") + return [indices, segments, positions, mask_tokens_mask_idx, sequence_mask_idx, masked_lm_ids, next_sentence_labels] + + +def reset_program_state_dict(state_dict, mean=0, scale=0.02): + """ + Initialize the parameter from the bert config, and set the parameter by + reseting the state dict." + """ + new_state_dict = dict() + for n, p in state_dict.items(): + if ( + n.endswith("_moment1_0") + or n.endswith("_moment2_0") + or n.endswith("_beta2_pow_acc_0") + or n.endswith("_beta1_pow_acc_0") + ): + continue + if "learning_rate" in n: + continue + + dtype_str = "float32" + if p._dtype == paddle.float64: + dtype_str = "float64" + + if "layer_norm" in n and n.endswith(".w_0"): + new_state_dict[n] = np.ones(p.shape()).astype(dtype_str) + continue + + if n.endswith(".b_0"): + new_state_dict[n] = np.zeros(p.shape()).astype(dtype_str) + else: + new_state_dict[n] = truncnorm.rvs(-2, 2, loc=mean, scale=scale, size=p.shape()).astype(dtype_str) + return new_state_dict + + +def create_ipu_strategy(args): + ipu_strategy = paddle.static.IpuStrategy() + options = { + "is_training": args.is_training, + "enable_manual_shard": True, + "enable_pipelining": True, + "batches_per_step": args.batches_per_step, + "micro_batch_size": args.micro_batch_size, + "loss_scaling": args.scale_loss, + "enable_replicated_graphs": True, + "replicated_graph_count": args.num_replica, + "num_ipus": args.num_ipus * args.num_replica, + "enable_gradient_accumulation": args.enable_grad_acc, + "accumulation_factor": args.grad_acc_factor, + "auto_recomputation": 3, + "enable_half_partial": True, + "available_memory_proportion": args.available_mem_proportion, + "enable_stochastic_rounding": True, + "max_weight_norm": 65504.0, + "default_prefetch_buffering_depth": 3, + "rearrange_anchors_on_host": False, + "enable_fp16": args.ipu_enable_fp16, + "random_seed": args.seed, + "use_no_bias_optimizer": True, + "enable_prefetch_datastreams": True, + "enable_outlining": True, + "subgraph_copying_strategy": 1, # JustInTime + "outline_threshold": 10.0, + "disable_grad_accumulation_tensor_streams": True, + "schedule_non_weight_update_gradient_consumers_early": True, + "cache_path": "paddle_cache", + "enable_floating_point_checks": False, + "accl1_type": args.accl1_type, + "accl2_type": args.accl2_type, + "weight_decay_mode": args.weight_decay_mode, + } + + if not args.optimizer_state_offchip: + options["location_optimizer"] = { + "on_chip": 1, # popart::TensorStorage::OnChip + "use_replicated_tensor_sharding": 1, # popart::ReplicatedTensorSharding::On + } + + # use popart::AccumulateOuterFragmentSchedule::OverlapMemoryOptimized + # excludedVirtualGraphs = [0] + options["accumulate_outer_fragment"] = {3: [0]} + + options["convolution_options"] = {"partialsType": "half"} + options["engine_options"] = { + "opt.useAutoloader": "true", + "target.syncReplicasIndependently": "true", + "exchange.streamBufferOverlap": "hostRearrangeOnly", + } + + options["enable_engine_caching"] = args.enable_engine_caching + + options["compilation_progress_logger"] = ProgressFunc + + ipu_strategy.set_options(options) + + # enable custom patterns + ipu_strategy.enable_pattern("DisableAttnDropoutBwdPattern") + + return ipu_strategy + + +def main(args): + paddle.enable_static() + place = paddle.set_device("ipu") + set_seed(args.seed) + main_program = paddle.static.default_main_program() + startup_program = paddle.static.default_startup_program() + + # The sharding of encoder layers + if args.num_hidden_layers == 12: + attn_index = [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] + ff_index = [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] + else: + raise Exception("Only support num_hidden_layers = 12") + + bert_config = {k: getattr(args, k) for k in IpuBertConfig._fields if hasattr(args, k)} + bert_config["embeddings_scope"] = DeviceScope(0, 0, "Embedding") + bert_config["attn_scopes"] = [DeviceScope(attn_index[i], attn_index[i]) for i in range(args.num_hidden_layers)] + bert_config["ff_scopes"] = [DeviceScope(ff_index[i], ff_index[i]) for i in range(args.num_hidden_layers)] + bert_config["mlm_scope"] = DeviceScope(0, args.num_ipus, "MLM") + bert_config["nsp_scope"] = DeviceScope(0, args.num_ipus, "NSP") + bert_config["layers_per_ipu"] = [4, 4, 4] + + config = IpuBertConfig(**bert_config) + + # custom_ops + custom_ops = load_custom_ops() + + # Load the training dataset + logging.info("Loading dataset") + input_files = [ + os.path.join(args.input_files, f) + for f in os.listdir(args.input_files) + if os.path.isfile(os.path.join(args.input_files, f)) and "training" in f + ] + input_files.sort() + + dataset = PretrainingHDF5DataLoader( + input_files=input_files, + max_seq_length=args.seq_len, + max_mask_tokens=args.max_predictions_per_seq, + batch_size=args.batch_size, + shuffle=args.shuffle, + ) + logging.info(f"dataset length: {len(dataset)}") + total_samples = dataset.total_samples + logging.info( + "total samples: %d, total batch_size: %d, max steps: %d" % (total_samples, args.batch_size, args.max_steps) + ) + + logging.info("Building Model") + + [ + indices, + segments, + positions, + mask_tokens_mask_idx, + sequence_mask_idx, + masked_lm_ids, + next_sentence_labels, + ] = create_data_holder(args) + + # Encoder Layers + bert_model = BertModel(config, custom_ops) + encoders, word_embedding = bert_model(indices, segments, positions, [mask_tokens_mask_idx, sequence_mask_idx]) + + # PretrainingHeads + mlm_heads = IpuBertPretrainingMLMHeads( + args.hidden_size, args.vocab_size, args.max_position_embeddings, args.max_predictions_per_seq, args.seq_len + ) + nsp_heads = IpuBertPretrainingNSPHeads(args.hidden_size, args.max_predictions_per_seq, args.seq_len) + + # AccAndLoss + nsp_criterion = IpuBertPretrainingNSPAccAndLoss(args.micro_batch_size, args.ignore_index, custom_ops) + mlm_criterion = IpuBertPretrainingMLMAccAndLoss(args.micro_batch_size, args.ignore_index, custom_ops) + + with config.nsp_scope: + nsp_out = nsp_heads(encoders) + nsp_acc, nsp_loss = nsp_criterion(nsp_out, next_sentence_labels) + + with config.mlm_scope: + mlm_out = mlm_heads(encoders, word_embedding) + ( + mlm_acc, + mlm_loss, + ) = mlm_criterion(mlm_out, masked_lm_ids) + total_loss = mlm_loss + nsp_loss + + # lr_scheduler + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, args.max_steps, args.warmup_steps) + # optimizer + optimizer = paddle.optimizer.Lamb( + learning_rate=lr_scheduler, + lamb_weight_decay=args.weight_decay, + beta1=args.beta1, + beta2=args.beta2, + epsilon=args.adam_epsilon, + ) + optimizer.minimize(total_loss) + + # Static executor + exe = paddle.static.Executor(place) + exe.run(startup_program) + + # Set initial weights + state_dict = main_program.state_dict() + reset_state_dict = reset_program_state_dict(state_dict) + paddle.static.set_program_state(main_program, reset_state_dict) + + if args.enable_load_params: + logging.info(f"loading weights from: {args.load_params_path}") + if not args.load_params_path.endswith("pdparams"): + raise Exception("need pdparams file") + with open(args.load_params_path, "rb") as file: + params = pickle.load(file) + paddle.static.set_program_state(main_program, params) + + # Create ipu_strategy + ipu_strategy = create_ipu_strategy(args) + + feed_list = [ + "indices", + "segments", + "positions", + "mask_tokens_mask_idx", + "sequence_mask_idx", + "masked_lm_ids", + "next_sentence_labels", + ] + fetch_list = [mlm_acc.name, mlm_loss.name, nsp_acc.name, nsp_loss.name] + + # Compile program for IPU + ipu_compiler = paddle.static.IpuCompiledProgram(main_program, ipu_strategy=ipu_strategy) + logging.info("start compiling, please wait some minutes") + cur_time = time.time() + main_program = ipu_compiler.compile(feed_list, fetch_list) + time_cost = time.time() - cur_time + logging.info(f"finish compiling! time cost: {time_cost}") + + batch_start = time.time() + global_step = 0 + for batch in dataset: + global_step += 1 + epoch = global_step * args.batch_size // total_samples + read_cost = time.time() - batch_start + + feed = { + "indices": batch[0], + "segments": batch[1], + "positions": batch[2], + "mask_tokens_mask_idx": batch[3], + "sequence_mask_idx": batch[4], + "masked_lm_ids": batch[5], + "next_sentence_labels": batch[6], + } + lr_scheduler.step() + + train_start = time.time() + loss_return = exe.run(main_program, feed=feed, fetch_list=fetch_list, use_program_cache=True) + train_cost = time.time() - train_start + total_cost = time.time() - batch_start + tput = args.batch_size / total_cost + + if args.wandb: + wandb.log( + { + "epoch": epoch, + "global_step": global_step, + "loss/MLM": np.mean(loss_return[1]), + "loss/NSP": np.mean(loss_return[3]), + "accuracy/MLM": np.mean(loss_return[0]), + "accuracy/NSP": np.mean(loss_return[2]), + "latency/read": read_cost, + "latency/train": train_cost, + "latency/e2e": total_cost, + "throughput": tput, + "learning_rate": lr_scheduler(), + } + ) + + if global_step % args.logging_steps == 0: + logging.info( + { + "epoch": epoch, + "global_step": global_step, + "loss/MLM": np.mean(loss_return[1]), + "loss/NSP": np.mean(loss_return[3]), + "accuracy/MLM": np.mean(loss_return[0]), + "accuracy/NSP": np.mean(loss_return[2]), + "latency/read": read_cost, + "latency/train": train_cost, + "latency/e2e": total_cost, + "throughput": tput, + "learning_rate": lr_scheduler(), + } + ) + + if global_step % args.save_steps == 0: + ipu_compiler._backend.weights_to_host() + paddle.static.save(main_program.org_program, os.path.join(args.output_dir, "step_{}".format(global_step))) + + if global_step >= args.max_steps: + ipu_compiler._backend.weights_to_host() + paddle.static.save( + main_program.org_program, os.path.join(args.output_dir, "final_step_{}".format(global_step)) + ) + dataset.release() + del dataset + return + + batch_start = time.time() + + +if __name__ == "__main__": + args = parse_args() + + logging.basicConfig( + level=logging.INFO, format="%(asctime)s %(name)s %(levelname)s %(message)s", datefmt="%Y-%m-%d %H:%M:%S %a" + ) + + if not os.path.exists(args.output_dir): + os.makedirs(args.output_dir, exist_ok=True) + + if args.wandb: + import wandb + + wandb.init(project="paddle-base-bert", settings=wandb.Settings(console="off"), name="paddle-base-bert") + wandb_config = vars(args) + wandb_config["global_batch_size"] = args.batch_size + wandb.config.update(args) + + logging.info(args) + main(args) + logging.info("program finished") diff --git a/model_zoo/bert/static_ipu/run_squad.py b/model_zoo/bert/static_ipu/run_squad.py new file mode 100644 index 0000000000000000000000000000000000000000..e3c76ed612a8351a53a79409f64432f5b374239e --- /dev/null +++ b/model_zoo/bert/static_ipu/run_squad.py @@ -0,0 +1,482 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import logging +import os +import pickle +import time +from functools import partial + +import numpy as np +import paddle +import paddle.optimizer +import paddle.static +from datasets import load_dataset +from modeling import ( + BertModel, + DeviceScope, + IpuBertConfig, + IpuBertForQuestionAnswering, + IpuBertQAAccAndLoss, +) +from paddle.io import BatchSampler, DataLoader +from run_pretrain import create_ipu_strategy, reset_program_state_dict, set_seed +from utils import load_custom_ops, parse_args + +from paddlenlp.data import Dict, Stack +from paddlenlp.metrics.squad import compute_prediction, squad_evaluate +from paddlenlp.transformers import BertTokenizer, LinearDecayWithWarmup + + +def create_data_holder(args): + bs = args.micro_batch_size + indices = paddle.static.data(name="indices", shape=[bs * args.seq_len], dtype="int32") + segments = paddle.static.data(name="segments", shape=[bs * args.seq_len], dtype="int32") + positions = paddle.static.data(name="positions", shape=[bs * args.seq_len], dtype="int32") + input_mask = paddle.static.data(name="input_mask", shape=[bs, 1, 1, args.seq_len], dtype="float32") + if not args.is_training: + return [indices, segments, positions, input_mask] + else: + start_labels = paddle.static.data(name="start_labels", shape=[bs], dtype="int32") + end_labels = paddle.static.data(name="end_labels", shape=[bs], dtype="int32") + return [indices, segments, positions, input_mask, start_labels, end_labels] + + +def prepare_train_features(examples, tokenizer, args): + # Some of the questions have lots of whitespace on the left, which is not useful and will make the + # truncation of the context fail (the tokenized question will take a lots of space). So we remove that + # left whitespace + contexts = examples["context"] + questions = examples["question"] + + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + tokenized_examples = tokenizer( + questions, + contexts, + stride=128, + max_seq_len=args.seq_len, + pad_to_max_seq_len=True, + return_position_ids=True, + return_token_type_ids=True, + return_attention_mask=True, + return_length=True, + ) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + # The offset mappings will give us a map from token to character position in the original context. This will + # help us compute the start_positions and end_positions. + offset_mapping = tokenized_examples.pop("offset_mapping") + + # Let's label those examples! + tokenized_examples["start_positions"] = [] + tokenized_examples["end_positions"] = [] + tokenized_examples["input_mask"] = [] + + for i, offsets in enumerate(offset_mapping): + # We will label impossible answers with the index of the CLS token. + input_ids = tokenized_examples["input_ids"][i] + cls_index = input_ids.index(tokenizer.cls_token_id) + + sequence_ids = tokenized_examples["token_type_ids"][i] + + # attention_mask to input_mask + input_mask = (np.asarray(tokenized_examples["attention_mask"][i]) - 1) * 1e3 + input_mask = np.expand_dims(input_mask, axis=(0, 1)) + if args.ipu_enable_fp16: + input_mask = input_mask.astype(np.float16) + else: + input_mask = input_mask.astype(np.float32) + tokenized_examples["input_mask"].append(input_mask) + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + answers = examples["answers"][sample_index] + # If no answers are given, set the cls_index as answer. + if len(answers["answer_start"]) == 0: + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Start/end character index of the answer in the text. + start_char = answers["answer_start"][0] + end_char = start_char + len(answers["text"][0]) + + # Start token index of the current span in the text. + token_start_index = 0 + while sequence_ids[token_start_index] != 1: + token_start_index += 1 + + # End token index of the current span in the text. + token_end_index = len(input_ids) - 1 + while sequence_ids[token_end_index] != 1: + token_end_index -= 1 + token_end_index -= 1 + + # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index). + if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char): + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Otherwise move the token_start_index and token_end_index to the two ends of the answer. + # Note: we could go after the last offset if the answer is the last word (edge case). + while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char: + token_start_index += 1 + tokenized_examples["start_positions"].append(token_start_index - 1) + while offsets[token_end_index][1] >= end_char: + token_end_index -= 1 + tokenized_examples["end_positions"].append(token_end_index + 1) + + return tokenized_examples + + +def prepare_validation_features(examples, tokenizer, args): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = examples["context"] + questions = examples["question"] + tokenized_examples = tokenizer( + questions, + contexts, + stride=128, + max_seq_len=args.seq_len, + pad_to_max_seq_len=True, + return_position_ids=True, + return_token_type_ids=True, + return_attention_mask=True, + return_length=True, + ) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + + # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the + # corresponding example_id and we will store the offset mappings. + tokenized_examples["example_id"] = [] + tokenized_examples["input_mask"] = [] + + for i in range(len(tokenized_examples["input_ids"])): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + input_ids = tokenized_examples["input_ids"][i] + sequence_A_lengths = input_ids.index(tokenizer.sep_token_id) + 2 + sequence_B_lengths = len(input_ids) - sequence_A_lengths + sequence_ids = [0] * sequence_A_lengths + [1] * sequence_B_lengths + context_index = 1 + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + tokenized_examples["example_id"].append(examples["id"][sample_index]) + + # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token + # position is part of the context or not. + tokenized_examples["offset_mapping"][i] = [ + (o if sequence_ids[k] == context_index else None) + for k, o in enumerate(tokenized_examples["offset_mapping"][i]) + ] + + # attention_mask to input_mask + input_mask = (np.asarray(tokenized_examples["attention_mask"][i]) - 1) * 1e3 + input_mask = np.expand_dims(input_mask, axis=(0, 1)) + if args.ipu_enable_fp16: + input_mask = input_mask.astype(np.float16) + else: + input_mask = input_mask.astype(np.float32) + tokenized_examples["input_mask"].append(input_mask) + + return tokenized_examples + + +def load_squad_dataset(args): + tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") + features_fn = prepare_train_features if args.is_training else prepare_validation_features + if args.is_training: + raw_dataset = load_dataset("squad", split="train") + else: + raw_dataset = load_dataset("squad", split="validation") + column_names = raw_dataset.column_names + dataset = raw_dataset.map( + partial(features_fn, tokenizer=tokenizer, args=args), batched=True, remove_columns=column_names, num_proc=4 + ) + + bs = args.micro_batch_size * args.grad_acc_factor * args.batches_per_step * args.num_replica + args.batch_size = bs + if args.is_training: + train_batch_sampler = BatchSampler(dataset, batch_size=bs, shuffle=args.shuffle, drop_last=True) + else: + train_batch_sampler = BatchSampler(dataset, batch_size=bs, shuffle=args.shuffle, drop_last=False) + + if args.is_training: + collate_fn = lambda samples, fn=Dict( + { + "input_ids": Stack(), + "token_type_ids": Stack(), + "position_ids": Stack(), + "input_mask": Stack(), + "start_positions": Stack(), + "end_positions": Stack(), + } + ): fn(samples) + else: + collate_fn = lambda samples, fn=Dict( + {"input_ids": Stack(), "token_type_ids": Stack(), "position_ids": Stack(), "input_mask": Stack()} + ): fn(samples) + + data_loader = DataLoader( + dataset=dataset, batch_sampler=train_batch_sampler, collate_fn=collate_fn, return_list=True + ) + return raw_dataset, data_loader + + +def main(args): + paddle.enable_static() + place = paddle.set_device("ipu") + set_seed(args.seed) + main_program = paddle.static.default_main_program() + startup_program = paddle.static.default_startup_program() + + # The sharding of encoder layers + if args.num_hidden_layers == 12: + attn_ipu_index = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1] + ff_ipu_index = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1] + else: + raise Exception("Only support num_hidden_layers = 12") + + bert_config = {k: getattr(args, k) for k in IpuBertConfig._fields if hasattr(args, k)} + bert_config["embeddings_scope"] = DeviceScope(0, 0, "Embedding") + bert_config["attn_scopes"] = [ + DeviceScope(attn_ipu_index[i], attn_ipu_index[i]) for i in range(args.num_hidden_layers) + ] + bert_config["ff_scopes"] = [DeviceScope(ff_ipu_index[i], ff_ipu_index[i]) for i in range(args.num_hidden_layers)] + bert_config["layers_per_ipu"] = [6, 6] + + config = IpuBertConfig(**bert_config) + + # custom_ops + custom_ops = load_custom_ops() + + logging.info("building model") + + if args.is_training: + [indices, segments, positions, input_mask, start_labels, end_labels] = create_data_holder(args) + else: + [indices, segments, positions, input_mask] = create_data_holder(args) + + # Encoder Layers + bert_model = BertModel(config, custom_ops) + encoders, _ = bert_model(indices, segments, positions, input_mask) + + squad_scope = DeviceScope(args.num_ipus - 1, args.num_ipus - 1, "squad") + with squad_scope: + qa_cls = IpuBertForQuestionAnswering(args.hidden_size, args.seq_len) + start_logits, end_logits = qa_cls(encoders) + + if args.is_training: + acc_loss = IpuBertQAAccAndLoss(custom_ops) + acc0, acc1, loss = acc_loss(start_logits, end_logits, start_labels, end_labels) + + # load squad dataset + raw_dataset, data_loader = load_squad_dataset(args) + + total_samples = len(data_loader.dataset) + max_steps = total_samples // args.batch_size * args.epochs + logging.info( + "total samples: %d, total batch_size: %d, max steps: %d" % (total_samples, args.batch_size, max_steps) + ) + + if args.is_training: + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, max_steps, args.warmup_steps) + optimizer = paddle.optimizer.Adam( + learning_rate=lr_scheduler, + weight_decay=args.weight_decay, + beta1=args.beta1, + beta2=args.beta2, + epsilon=args.adam_epsilon, + ) + optimizer.minimize(loss) + + # Static executor + exe = paddle.static.Executor(place) + exe.run(startup_program) + + # Set initial weights + state_dict = main_program.state_dict() + reset_state_dict = reset_program_state_dict(state_dict) + paddle.static.set_program_state(main_program, reset_state_dict) + + if args.enable_load_params: + logging.info(f"loading weights from: {args.load_params_path}") + if not args.load_params_path.endswith("pdparams"): + raise Exception("need pdparams file") + with open(args.load_params_path, "rb") as file: + params = pickle.load(file) + # Delete mlm and nsp weights + if args.is_training and "linear_72.w_0" in params: + params.pop("linear_72.w_0") + params.pop("linear_72.b_0") + paddle.static.set_program_state(main_program, params) + + if args.tf_checkpoint: + from load_tf_ckpt import load_initializers_from_tf + + logging.info(f"loading weights from: {args.tf_checkpoint}") + initializers, _ = load_initializers_from_tf(args.tf_checkpoint, args) + paddle.static.set_program_state(main_program, initializers) + + # Create ipu_strategy + ipu_strategy = create_ipu_strategy(args) + + if args.is_training: + feed_list = ["indices", "segments", "positions", "input_mask", "start_labels", "end_labels"] + fetch_list = [loss.name, acc0.name, acc1.name] + else: + feed_list = ["indices", "segments", "positions", "input_mask"] + fetch_list = [start_logits.name, end_logits.name] + + ipu_compiler = paddle.static.IpuCompiledProgram(main_program, ipu_strategy=ipu_strategy) + logging.info("start compiling, please wait some minutes") + cur_time = time.time() + main_program = ipu_compiler.compile(feed_list, fetch_list) + time_cost = time.time() - cur_time + logging.info(f"finish compiling! time cost: {time_cost}") + + if args.is_training: + global_step = 0 + batch_start = time.time() + for epoch in range(args.epochs): + for batch in data_loader: + global_step += 1 + + feed = { + "indices": batch[0], + "segments": batch[1], + "positions": batch[2], + "input_mask": batch[3], + "start_labels": batch[4], + "end_labels": batch[5], + } + lr_scheduler.step() + + train_start = time.time() + outputs = exe.run(main_program, feed=feed, fetch_list=fetch_list, use_program_cache=True) + train_cost = time.time() - train_start + total_cost = time.time() - batch_start + + tput = args.batch_size / total_cost + if args.wandb: + wandb.log( + { + "epoch": epoch, + "global_step": global_step, + "loss": np.mean(outputs[0]), + "accuracy": np.mean(outputs[1:]), + "train_cost": train_cost, + "total_cost": total_cost, + "throughput": tput, + "learning_rate": lr_scheduler(), + } + ) + + if global_step % args.logging_steps == 0: + logging.info( + { + "epoch": epoch, + "global_step": global_step, + "loss": np.mean(outputs[0]), + "accuracy": np.mean(outputs[1:]), + "train_cost": train_cost, + "total_cost": total_cost, + "throughput": tput, + "learning_rate": lr_scheduler(), + } + ) + + batch_start = time.time() + + # save final state + ipu_compiler._backend.weights_to_host() + paddle.static.save(main_program.org_program, os.path.join(args.output_dir, "Final_model")) + + if not args.is_training: + all_start_logits = [] + all_end_logits = [] + for step, batch in enumerate(data_loader): + if step % args.logging_steps == 0: + logging.info(f"running step: {step}") + + real_len = np.array(batch[0]).shape[0] + # padding zeros if needed + if real_len < args.batch_size: + batch = [np.asarray(x) for x in batch] + pad0 = np.zeros([args.batch_size - real_len, args.seq_len]).astype(batch[0].dtype) + batch[0] = np.vstack((batch[0], pad0)) + batch[1] = np.vstack((batch[1], pad0)) + batch[2] = np.vstack((batch[2], pad0)) + pad1 = np.zeros([args.batch_size - real_len, 1, 1, args.seq_len]) - 1e3 + pad1 = pad1.astype(batch[3].dtype) + batch[3] = np.vstack((batch[3], pad1)) + + feed = { + "indices": batch[0], + "segments": batch[1], + "positions": batch[2], + "input_mask": batch[3], + } + start_logits, end_logits = exe.run(main_program, feed=feed, fetch_list=fetch_list) + + start_logits = start_logits.reshape([-1, args.seq_len]) + end_logits = end_logits.reshape([-1, args.seq_len]) + for idx in range(real_len): + all_start_logits.append(start_logits[idx]) + all_end_logits.append(end_logits[idx]) + + # evaluate results + all_predictions, all_nbest_json, scores_diff_json = compute_prediction( + raw_dataset, data_loader.dataset, (all_start_logits, all_end_logits) + ) + squad_evaluate( + examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, na_probs=scores_diff_json + ) + # write results to file + with open("squad_prediction.json", "w", encoding="utf-8") as writer: + writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n") + + +if __name__ == "__main__": + args = parse_args() + + logging.basicConfig( + level=logging.INFO, format="%(asctime)s %(name)s %(levelname)s %(message)s", datefmt="%Y-%m-%d %H:%M:%S %a" + ) + + if not os.path.exists(args.output_dir): + os.makedirs(args.output_dir, exist_ok=True) + + if args.wandb: + import wandb + + wandb.init(project="paddle-squad", settings=wandb.Settings(console="off"), name="paddle-squad") + wandb_config = vars(args) + wandb_config["global_batch_size"] = args.batch_size + wandb.config.update(args) + + logging.info(args) + main(args) + logging.info("program finished") diff --git a/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain.sh b/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain.sh new file mode 100644 index 0000000000000000000000000000000000000000..cd1c5bb00f40e1aad40c614e695c42b8104d97ea --- /dev/null +++ b/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain.sh @@ -0,0 +1,36 @@ +#!/usr/bin/env bash + +export RDMAV_FORK_SAFE=1 +python3 run_pretrain.py \ + --input_files "path_to_phase1_hdf5_dataset" \ + --output_dir pretrain_128_model \ + --seq_len 128 \ + --hidden_size 768 \ + --vocab_size 30400 \ + --max_predictions_per_seq 20 \ + --max_position_embeddings 512 \ + --learning_rate 0.006 \ + --weight_decay 1e-2 \ + --max_steps 7038 \ + --warmup_steps 2000 \ + --logging_steps 10 \ + --seed 1984 \ + --beta1 0.9 \ + --beta2 0.999 \ + --num_hidden_layers 12 \ + --micro_batch_size 32 \ + --ipu_enable_fp16 True \ + --scale_loss 512 \ + --batches_per_step 1 \ + --num_replica 4 \ + --enable_grad_acc True \ + --grad_acc_factor 512 \ + --batch_size 65536 \ + --available_mem_proportion 0.28 \ + --ignore_index 0 \ + --enable_load_params False \ + --hidden_dropout_prob 0.1 \ + --attention_probs_dropout_prob 0.1 \ + --shuffle True \ + --wandb False \ + --save_steps 1000 diff --git a/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain_phase2.sh b/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain_phase2.sh new file mode 100644 index 0000000000000000000000000000000000000000..8458ed48b6b2528fceaa25cd64cbc52bb5d06bbe --- /dev/null +++ b/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain_phase2.sh @@ -0,0 +1,38 @@ +#!/usr/bin/env bash + +export RDMAV_FORK_SAFE=1 +python3 run_pretrain.py \ + --input_files "path_to_phase2_hdf5_dataset" \ + --output_dir pretrain_384_model \ + --seq_len 384 \ + --hidden_size 768 \ + --vocab_size 30400 \ + --max_predictions_per_seq 56 \ + --max_position_embeddings 512 \ + --learning_rate 0.002828427125 \ + --weight_decay 1e-2 \ + --max_steps 2137 \ + --warmup_steps 274 \ + --logging_steps 10 \ + --seed 1984 \ + --beta1 0.9 \ + --beta2 0.999 \ + --num_hidden_layers 12 \ + --micro_batch_size 8 \ + --ipu_enable_fp16 True \ + --scale_loss 128 \ + --batches_per_step 1 \ + --num_replica 4 \ + --enable_grad_acc True \ + --grad_acc_factor 512 \ + --batch_size 16384 \ + --available_mem_proportion 0.28 \ + --ignore_index 0 \ + --enable_load_params True \ + --load_params_path "./pretrain_128_model/final_step_7038.pdparams" \ + --hidden_dropout_prob 0.1 \ + --attention_probs_dropout_prob 0.1 \ + --shuffle True \ + --wandb False \ + --enable_engine_caching False \ + --save_steps 500 diff --git a/model_zoo/bert/static_ipu/scripts/pod16/run_squad.sh b/model_zoo/bert/static_ipu/scripts/pod16/run_squad.sh new file mode 100644 index 0000000000000000000000000000000000000000..4c36ef69d6b1a8fdffa9255b9e3f7b68b5b9ba90 --- /dev/null +++ b/model_zoo/bert/static_ipu/scripts/pod16/run_squad.sh @@ -0,0 +1,41 @@ +#!/usr/bin/env bash + +python3 run_squad.py \ + --output_dir squad_model \ + --task "SQUAD" \ + --is_training True \ + --seq_len 384 \ + --hidden_size 768 \ + --vocab_size 30400 \ + --max_predictions_per_seq 56 \ + --max_position_embeddings 512 \ + --learning_rate 5.6e-05 \ + --weight_decay 0 \ + --epochs 4 \ + --warmup_steps 52 \ + --logging_steps 10 \ + --seed 42 \ + --beta1 0.9 \ + --beta2 0.999 \ + --num_hidden_layers 12 \ + --micro_batch_size 2 \ + --ipu_enable_fp16 True \ + --accl1_type "FLOAT" \ + --accl2_type "FLOAT" \ + --weight_decay_mode "decay" \ + --scale_loss 256 \ + --optimizer_state_offchip False \ + --batches_per_step 4 \ + --num_replica 4 \ + --num_ipus 2 \ + --enable_grad_acc True \ + --grad_acc_factor 16 \ + --available_mem_proportion 0.40 \ + --ignore_index 0 \ + --hidden_dropout_prob 0.1 \ + --attention_probs_dropout_prob 0.1 \ + --shuffle True \ + --wandb False \ + --enable_engine_caching False \ + --enable_load_params True \ + --load_params_path "pretrain_384_model/final_step_2137.pdparams" diff --git a/model_zoo/bert/static_ipu/scripts/pod16/run_squad_infer.sh b/model_zoo/bert/static_ipu/scripts/pod16/run_squad_infer.sh new file mode 100644 index 0000000000000000000000000000000000000000..28ffa72854436f2edaf16da927a4aa9ff6760d12 --- /dev/null +++ b/model_zoo/bert/static_ipu/scripts/pod16/run_squad_infer.sh @@ -0,0 +1,38 @@ +#!/usr/bin/env bash + +python3 run_squad.py \ + --output_dir squad_model \ + --task "SQUAD" \ + --is_training False \ + --seq_len 384 \ + --hidden_size 768 \ + --vocab_size 30400 \ + --max_predictions_per_seq 56 \ + --max_position_embeddings 512 \ + --learning_rate 5.6e-05 \ + --weight_decay 1e-2 \ + --epochs 4 \ + --warmup_steps 52 \ + --logging_steps 10 \ + --seed 1984 \ + --beta1 0.9 \ + --beta2 0.999 \ + --num_hidden_layers 12 \ + --micro_batch_size 2 \ + --ipu_enable_fp16 True \ + --scale_loss 256 \ + --optimizer_state_offchip False \ + --batches_per_step 4 \ + --num_replica 4 \ + --num_ipus 2 \ + --enable_grad_acc False \ + --grad_acc_factor 1 \ + --available_mem_proportion 0.40 \ + --ignore_index 0 \ + --hidden_dropout_prob 0.0 \ + --attention_probs_dropout_prob 0.0 \ + --shuffle False \ + --wandb False \ + --enable_engine_caching False \ + --enable_load_params True \ + --load_params_path "squad_model/Final_model.pdparams" diff --git a/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain.sh b/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain.sh new file mode 100644 index 0000000000000000000000000000000000000000..299e0dc259811dad21a63f7e1619508819d07d4b --- /dev/null +++ b/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain.sh @@ -0,0 +1,36 @@ +#!/usr/bin/env bash + +export RDMAV_FORK_SAFE=1 +python3 run_pretrain.py \ + --input_files "path_to_phase1_hdf5_dataset" \ + --output_dir pretrain_128_model \ + --seq_len 128 \ + --hidden_size 768 \ + --vocab_size 30400 \ + --max_predictions_per_seq 20 \ + --max_position_embeddings 512 \ + --learning_rate 0.006 \ + --weight_decay 1e-2 \ + --max_steps 7038 \ + --warmup_steps 2000 \ + --logging_steps 10 \ + --seed 1984 \ + --beta1 0.9 \ + --beta2 0.999 \ + --num_hidden_layers 12 \ + --micro_batch_size 32 \ + --ipu_enable_fp16 True \ + --scale_loss 512 \ + --batches_per_step 1 \ + --num_replica 1 \ + --enable_grad_acc True \ + --grad_acc_factor 2048 \ + --batch_size 65536 \ + --available_mem_proportion 0.28 \ + --ignore_index 0 \ + --enable_load_params False \ + --hidden_dropout_prob 0.1 \ + --attention_probs_dropout_prob 0.1 \ + --shuffle True \ + --wandb False \ + --save_steps 1000 diff --git a/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain_phase2.sh b/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain_phase2.sh new file mode 100644 index 0000000000000000000000000000000000000000..89ec3ec4bab920768a2c8f9f2f4d87b8a3da434c --- /dev/null +++ b/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain_phase2.sh @@ -0,0 +1,38 @@ +#!/usr/bin/env bash + +export RDMAV_FORK_SAFE=1 +python3 run_pretrain.py \ + --input_files "path_to_phase2_hdf5_dataset" \ + --output_dir pretrain_384_model \ + --seq_len 384 \ + --hidden_size 768 \ + --vocab_size 30400 \ + --max_predictions_per_seq 56 \ + --max_position_embeddings 512 \ + --learning_rate 0.002828427125 \ + --weight_decay 1e-2 \ + --max_steps 2137 \ + --warmup_steps 274 \ + --logging_steps 10 \ + --seed 1984 \ + --beta1 0.9 \ + --beta2 0.999 \ + --num_hidden_layers 12 \ + --micro_batch_size 8 \ + --ipu_enable_fp16 True \ + --scale_loss 128 \ + --batches_per_step 1 \ + --num_replica 1 \ + --enable_grad_acc True \ + --grad_acc_factor 2048 \ + --batch_size 16384 \ + --available_mem_proportion 0.28 \ + --ignore_index 0 \ + --enable_load_params True \ + --load_params_path "./pretrain_128_model/final_step_7038.pdparams" \ + --hidden_dropout_prob 0.1 \ + --attention_probs_dropout_prob 0.1 \ + --shuffle True \ + --wandb False \ + --enable_engine_caching False \ + --save_steps 500 diff --git a/model_zoo/bert/static_ipu/scripts/pod4/run_squad.sh b/model_zoo/bert/static_ipu/scripts/pod4/run_squad.sh new file mode 100644 index 0000000000000000000000000000000000000000..81302949c4a8f1f69c12f8dc137c26b9466dfa3e --- /dev/null +++ b/model_zoo/bert/static_ipu/scripts/pod4/run_squad.sh @@ -0,0 +1,41 @@ +#!/usr/bin/env bash + +python3 run_squad.py \ + --output_dir squad_model \ + --task "SQUAD" \ + --is_training True \ + --seq_len 384 \ + --hidden_size 768 \ + --vocab_size 30400 \ + --max_predictions_per_seq 56 \ + --max_position_embeddings 512 \ + --learning_rate 5.6e-05 \ + --weight_decay 1e-2 \ + --epochs 4 \ + --warmup_steps 30 \ + --logging_steps 10 \ + --seed 1984 \ + --beta1 0.9 \ + --beta2 0.999 \ + --num_hidden_layers 12 \ + --micro_batch_size 2 \ + --ipu_enable_fp16 True \ + --accl1_type "FLOAT" \ + --accl2_type "FLOAT" \ + --weight_decay_mode "decay" \ + --scale_loss 256 \ + --optimizer_state_offchip True \ + --batches_per_step 4 \ + --num_replica 2 \ + --num_ipus 2 \ + --enable_grad_acc True \ + --grad_acc_factor 64 \ + --available_mem_proportion 0.40 \ + --ignore_index 0 \ + --hidden_dropout_prob 0.1 \ + --attention_probs_dropout_prob 0.1 \ + --shuffle True \ + --wandb False \ + --enable_engine_caching False \ + --enable_load_params True \ + --load_params_path "pretrain_384_model/final_step_2137.pdparams" diff --git a/model_zoo/bert/static_ipu/scripts/pod4/run_squad_infer.sh b/model_zoo/bert/static_ipu/scripts/pod4/run_squad_infer.sh new file mode 100644 index 0000000000000000000000000000000000000000..ae400c59e52839827c28183c9f12d20c3b344690 --- /dev/null +++ b/model_zoo/bert/static_ipu/scripts/pod4/run_squad_infer.sh @@ -0,0 +1,38 @@ +#!/usr/bin/env bash + +python3 run_squad.py \ + --output_dir squad_model \ + --task "SQUAD" \ + --is_training False \ + --seq_len 384 \ + --hidden_size 768 \ + --vocab_size 30400 \ + --max_predictions_per_seq 56 \ + --max_position_embeddings 512 \ + --learning_rate 5.6e-05 \ + --weight_decay 1e-2 \ + --epochs 4 \ + --warmup_steps 52 \ + --logging_steps 10 \ + --seed 1984 \ + --beta1 0.9 \ + --beta2 0.999 \ + --num_hidden_layers 12 \ + --micro_batch_size 2 \ + --ipu_enable_fp16 True \ + --scale_loss 256 \ + --optimizer_state_offchip False \ + --batches_per_step 4 \ + --num_replica 2 \ + --num_ipus 2 \ + --enable_grad_acc False \ + --grad_acc_factor 1 \ + --available_mem_proportion 0.40 \ + --ignore_index 0 \ + --hidden_dropout_prob 0.0 \ + --attention_probs_dropout_prob 0.0 \ + --shuffle False \ + --wandb False \ + --enable_engine_caching False \ + --enable_load_params True \ + --load_params_path "squad_model/Final_model.pdparams" diff --git a/model_zoo/bert/static_ipu/utils.py b/model_zoo/bert/static_ipu/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..5280a3bbc014588c5d0eb7e8209df1293d707cc0 --- /dev/null +++ b/model_zoo/bert/static_ipu/utils.py @@ -0,0 +1,189 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import tqdm +from paddle.utils.cpp_extension import load + +from paddlenlp.trainer.argparser import strtobool + + +def load_custom_ops(): + cur_dir = os.path.dirname(os.path.realpath(__file__)) + custom_dir = cur_dir + "/custom_ops" + sources = [ + f"{custom_dir}/custom_shape_infer.cc", + f"{custom_dir}/custom_checkpointoutput.cc", + f"{custom_dir}/custom_detach.cc", + f"{custom_dir}/custom_identity.cc", + f"{custom_dir}/custom_nll_loss.cc", + f"{custom_dir}/tied_gather_pattern.cc", + f"{custom_dir}/tied_gather.cc", + f"{custom_dir}/disable_attn_dropout_bwd_pattern.cc", + f"{custom_dir}/workarounds/prevent_const_expr_folding_op.cc", + f"{custom_dir}/utils.cc", + ] + custom_ops = load( + name="custom_ops", + sources=sources, + extra_cxx_cflags=["-DONNX_NAMESPACE=onnx"], + build_directory=custom_dir, + ) + return custom_ops + + +class ProgressBar: + def __init__(self): + self._bar = None + self._last = 0 + + def __call__(self, progress: int, total: int): + if self._bar is None: + bar_format = "{l_bar}{bar}| {n_fmt}/{total_fmt} " + bar_format += "[{elapsed}<{remaining}]" + self._bar = tqdm.tqdm(desc="Graph compilation", total=total, bar_format=bar_format) + self._bar.update(progress - self._last) + self._last = progress + if progress == total: + self._bar.close() + self._bar = None + + +# need to set to 0 when start a new compilation +g_current_progress = 0 + + +def ProgressFunc(progress, total): + global g_current_progress + if progress != g_current_progress: + g_current_progress = progress + print(f"Graph compilation: {progress}/{total}") + + +def str_to_bool(val): + return bool(strtobool(val)) + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--task", + type=str, + default="PRETRAINING", + help="task", + ) + parser.add_argument( + "--input_files", + type=str, + default="", + help="Files to load data from. " + "For Pretraining: Path to tfrecord files" + "For SQuAD: Path to train-v1.1.json", + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=False, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument("--is_training", type=str_to_bool, default=True, help="training or inference") + # graph + parser.add_argument("--seq_len", default=128, type=int, help="The sequence length") + parser.add_argument("--vocab_size", default=30912, type=int, help="Set the size of the vocabulary") + parser.add_argument( + "--max_predictions_per_seq", default=20, type=int, help="The maximum total of masked tokens in input sequence" + ) + parser.add_argument("--max_position_embeddings", default=512, type=int, help="the length of the input mask") + parser.add_argument("--num_hidden_layers", type=int, default=None, help="Override config file if not None") + parser.add_argument( + "--hidden_size", default=768, type=int, help="Set the size of the hidden state of the transformer layers size" + ) + parser.add_argument("--ignore_index", type=int, default=-1, help="ignore mlm index") + parser.add_argument( + "--hidden_dropout_prob", + type=float, + default=0.1, + help="Set the layer dropout probability for fully connected layer in embedding and encoder", + ) + parser.add_argument( + "--attention_probs_dropout_prob", + type=float, + default=0.0, + help="Set the layer dropout probability for attention layer in encoder", + ) + # optimizer + parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--beta1", type=float, default=0.9, help="Set the Adam/Lamb beta1 value") + parser.add_argument("--beta2", type=float, default=0.999, help="Set the Adam/Lamb beta2 value") + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--warmup_steps", default=10, type=int, help="Linear warmup over warmup_steps.") + parser.add_argument("--scale_loss", type=float, default=1.0, help="The value of scale_loss for fp16.") + parser.add_argument("--accl1_type", type=str, default="FLOAT", help="FLOAT or FLOAT16") + parser.add_argument("--accl2_type", type=str, default="FLOAT", help="FLOAT or FLOAT16") + parser.add_argument("--weight_decay_mode", type=str, default="", help="decay or l2_regularization") + parser.add_argument( + "--optimizer_state_offchip", + type=str_to_bool, + default=True, + help="Set the store location of the optimizer tensors", + ) + parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + # ipu + parser.add_argument( + "--epochs", + type=int, + default=1, + help="the iteration of the whole dataset", + ) + parser.add_argument( + "--batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--micro_batch_size", type=int, default=1, help="micro batch size") + parser.add_argument("--batches_per_step", type=int, default=1, help="batches per step") + parser.add_argument("--seed", type=int, default=42, help="Random seed for initialization") + parser.add_argument("--num_ipus", type=int, default=4, help="Number of IPUs to use") + parser.add_argument("--ipu_enable_fp16", type=str_to_bool, default=False, help="ipu enable fp16 or not.") + parser.add_argument("--num_replica", type=int, default=1, help="number of replica") + parser.add_argument("--enable_grad_acc", type=str_to_bool, default=False, help="enable gradient accumulation") + parser.add_argument("--grad_acc_factor", type=int, default=1, help="factor of gradient accumulation") + parser.add_argument( + "--available_mem_proportion", + type=float, + default=0.0, + help="set the available memory proportion for matmul/conv", + ) + parser.add_argument("--shuffle", type=str_to_bool, nargs="?", const=True, default=False, help="Shuffle Dataset") + parser.add_argument( + "--wandb", type=str_to_bool, nargs="?", const=True, default=False, help="Enable logging to Weights and Biases." + ) + parser.add_argument("--enable_load_params", type=str_to_bool, default=False, help="load params or not") + parser.add_argument("--load_params_path", type=str, help="load params path") + parser.add_argument("--tf_checkpoint", type=str, help="Path to Tensorflow Checkpoint to initialise the model.") + parser.add_argument("--enable_engine_caching", type=str_to_bool, default=True, help="enable engine caching or not") + args = parser.parse_args() + return args diff --git a/model_zoo/electra/README.md b/model_zoo/electra/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b1c7d1e6dd24dba32486ea4252a795b525c0f7b6 --- /dev/null +++ b/model_zoo/electra/README.md @@ -0,0 +1,270 @@ +# ELECTRA with PaddleNLP + +[ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) 在[BERT](https://arxiv.org/abs/1810.04805)的基础上对其预训练过程进行了改进:预训练由两部分模型网络组成,称为Generator和Discriminator,各自包含1个BERT模型。Generator的预训练使用和BERT一样的Masked Language Model(MLM)任务,但Discriminator的预训练使用Replaced Token Detection(RTD)任务(主要改进点)。预训练完成后,使用Discriminator作为精调模型,后续的Fine-tuning不再使用Generator。 +![avatar](./electra_model_brief_introduce.JPG) + +图片来源:来自[electra论文](https://openreview.net/pdf?id=r1xMH1BtvB) + +根据论文中给出的实验结果,在和BERT具有相同的模型参数、预训练计算量一样的情况下,GLUE得分比BERT明显好,small模型为79.9:75.1,Base模型为85.1:82.2,Large模型为89.0:87.2。 + +本项目是 ELECTRA 在 Paddle 2.0上的开源实现。 + +## **环境依赖** + +- jieba, 安装方式:`pip install jieba` +- colorlog, 安装方式:`pip install colorlog` +- colorama, 安装方式:`pip install colorama` +- seqeval, 安装方式:`pip install seqeval` + +## **数据准备** +### 建议的预训练数据 +论文中提到预训练需要两部分数据:Book Corpus数据 和 Wikipedia Corpus数据,均为英文文本,utf-8编码。但是当前BookCorpus数据已不再开源,可以使用其它数据替代,只要是纯英文文本数据,utf-8编码即可(例如 Gutenberg Dataset)。 +。另外,Wikipedia Corpus数据建议从[官方获取](https://www.english-corpora.org/wiki/),下面例子假设这些数据都已获取并都放在./BookCorpus/train.data 文件中,每行一句英文文本 + +### 自定义预训练数据 +支持用户自定义数据进行训练,自定义数据为文本形式,每行一句英文文本,utf-8编码,下面例子假设数据在./BookCorpus/train.data + +### Fine-tuning数据 +Fine-tuning 使用GLUE数据,这部分Paddle已提供,在执行第4章 Fine-tuning 命令时会自动下载并加载 + +### 推理数据 +可以使用GLUE test数据集(Paddle已提供,在Fine-tuning时会自动下载),或者也可以自定义,格式要求和2.2 自定义预训练数据一样,每行一句英文文本,utf-8编码 + +## **模型预训练** + +**特别注意**:预训练模型如果想要达到较好的效果,需要训练几乎全量的Book Corpus数据 和 Wikipedia Corpus数据,原始文本接近20G,建议用GPU进行预训练,最好4片GPU以上。如果资源较少,Paddle提供已经预训练好的模型进行Fine-tuning,可以直接跳转到下面:运行Fine-tuning-使用Paddle提供的预训练模型运行 Fine-tuning + +### 单机单卡 +```shell +export CUDA_VISIBLE_DEVICES="0" +export DATA_DIR=./BookCorpus/ + +python -u ./run_pretrain.py \ + --model_type electra \ + --model_name_or_path electra-small \ + --input_dir $DATA_DIR \ + --output_dir ./pretrain_model/ \ + --train_batch_size 64 \ + --learning_rate 5e-4 \ + --max_seq_length 128 \ + --weight_decay 1e-2 \ + --adam_epsilon 1e-6 \ + --warmup_steps 10000 \ + --num_train_epochs 4 \ + --logging_steps 100 \ + --save_steps 10000 \ + --max_steps -1 \ + --device gpu +``` +其中参数释义如下: +- `model_type` 表示模型类型,默认为ELECTRA模型。 +- `model_name_or_path` 如果配置1个名字,则表示预训练模型的规模,当前支持的名字为:electra-small(约1400万参数)、electra-base(约1.1亿参数)、electra-large(约3.35亿参数)。如果配置1个路径,则表示按照路径中的模型规模进行训练,这时需配置 --init_from_ckpt 参数一起使用,一般用于断点恢复训练场景。 +- `input_dir` 表示输入数据的目录,该目录下需要有1个train.data纯英文文本文件,utf-8编码。 +- `output_dir` 表示将要保存预训练模型的目录。 +- `train_batch_size` 表示 每次迭代**每张卡**上的样本数目。此例子train_batch_size=64 运行时大致需要单卡12G显存,如果实际GPU显存小于12G或大大多于12G,可适当调小/调大此配置。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `weight_decay` 表示每次迭代中参数缩小的比例,该值乘以学习率为真正缩小的比例。 +- `adam_epsilon` 表示adam优化器中的epsilon值。 +- `warmup_steps` 表示学习率逐渐升高到基础学习率(即上面配置的learning_rate)所需要的迭代数,最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存间隔。 +- `max_steps` 如果配置且大于0,表示预训练最多执行的迭代数量;如果不配置或配置小于0,则根据输入数据量、train_batch_size和num_train_epochs来确定预训练迭代数量 +- `device` 表示使用的设备类型。默认为GPU,可以配置为CPU、GPU、XPU。若希望使用GPU训练,将其设置为GPU,同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。 + +另外还有一些额外参数不在如上命令中: +- `use_amp` 表示是否开启混合精度(float16)进行训练,默认不开启。如果在命令中加上了--use_amp,则会开启。 +- `init_from_ckpt` 表示是否从某个checkpoint继续训练(断点恢复训练),默认不开启。如果在命令中加上了--init_from_ckpt,且 --model_name_or_path 配置的是路径,则会开启从某个checkpoint继续训练。例如下面的命令从第40000步的checkpoint继续训练: +```shell +export CUDA_VISIBLE_DEVICES="0" +export DATA_DIR=./BookCorpus/ + +python -u ./run_pretrain.py \ + --model_type electra \ + --model_name_or_path ./pretrain_model/model_40000.pdparams/ \ + --input_dir $DATA_DIR \ + --output_dir ./pretrain_model/ \ + --train_batch_size 64 \ + --learning_rate 5e-4 \ + --max_seq_length 128 \ + --weight_decay 1e-2 \ + --adam_epsilon 1e-6 \ + --warmup_steps 10000 \ + --num_train_epochs 4 \ + --logging_steps 100 \ + --save_steps 10000 \ + --max_steps -1 \ + --device gpu \ + --init_from_ckpt +``` + +训练过程将按照 `logging_steps`的设置打印如下日志: + +``` +global step 100/322448, epoch: 0, loss: 46.2487393681735099, lr: 0.000100000000, speed: 0.6439 step/s +global step 200/322448, epoch: 0, loss: 45.2436411214760099, lr: 0.000200000000, speed: 0.6041 step/s +global step 300/322448, epoch: 0, loss: 43.2906827821215998, lr: 0.000300000000, speed: 0.5991 step/s +``` + +### 单机多卡 +```shell +export CUDA_VISIBLE_DEVICES="0,1,2,3" +export DATA_DIR=./BookCorpus/ + +python -u ./run_pretrain.py \ + --model_type electra \ + --model_name_or_path electra-small \ + --input_dir $DATA_DIR \ + --output_dir ./pretrain_model/ \ + --train_batch_size 64 \ + --learning_rate 5e-4 \ + --max_seq_length 128 \ + --weight_decay 1e-2 \ + --adam_epsilon 1e-6 \ + --warmup_steps 10000 \ + --num_train_epochs 4 \ + --logging_steps 100 \ + --save_steps 10000 \ + --max_steps -1 \ + --device gpu +``` +执行命令和单机单卡一样,只有环境变量CUDA_VISIBLE_DEVICES配置多个GPU-id,配置后预训练程序使用配置中的GPU-id,不会使用未配置的GPU-id + +## **Fine-tuning** +### 从预训练模型得到Fine-tuning所需模型 +由第一段简介得知,Electra Fine-tuning时只需要Discriminator部分,所以通过如下命令从预训练模型中提取出Discriminator,得到Fine-tuning所需模型 +```shell +python -u ./get_ft_model.py \ + --model_dir ./pretrain_model/model_40000.pdparams/ +``` +其中参数释义如下: +- `model_dir` 表示预训练模型所在目录,这里例子取预训练40000步的checkpoint来生成Fine-tuning所需模型,生成的用于Fine-tuning的模型也会在这个目录下。 + +此命令可多次执行,但只有第1次会生成Fine-tuning所需模型 + +**特别注意**:如果使用windows系统执行此命令,需使用**管理员**权限运行,否则会出错。Linux无此限制 + +### 运行Fine-tuning +使用./run_glue.py运行,有两种方式: +#### **使用Paddle提供的预训练模型运行 Fine-tuning** +此方式无需在本地进行预训练,即可以跳过上面 模型预训练 和 从预训练模型得到Fine-tuning所需模型 的介绍,直接运行Fine-tuning。 + +以 GLUE/SST-2 任务为例,启动 Fine-tuning 的方式如下: +```shell +export CUDA_VISIBLE_DEVICES=0,1 +export TASK_NAME=SST-2 + +python -u ./run_glue.py \ + --model_type electra \ + --model_name_or_path electra-small \ + --task_name $TASK_NAME \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 1e-4 \ + --num_train_epochs 3 \ + --logging_steps 100 \ + --save_steps 100 \ + --output_dir ./$TASK_NAME/ \ + --device gpu +``` +其中参数释义如下: +- `model_type` 指示了模型类型,当前支持BERT、ELECTRA、ERNIE模型。 +- `model_name_or_path` 如果配置模型名(electra模型当前支持electra-small、electra-base、electra-large几种规格)则为本节介绍的方式。如果配置本地目录(例如执行get_ft_model.py 命令得到Fine-tuning所需模型,配置其所在的目录 pretrain_model/model_40000.pdparams/)则为下一节中介绍的方式。 +- `task_name` 表示 Fine-tuning 的任务,当前支持CoLA、SST-2、MRPC、STS-B、QQP、MNLI、QNLI、RTE。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `output_dir` 表示模型保存路径。 +- `device` 表示使用的设备类型。默认为GPU,可以配置为CPU、GPU、XPU、NPU。若希望使用多GPU训练,将其设置为GPU,同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。 + +#### **使用本地预训练模型运行 Fine-tuning** +按照上面模型预训练的介绍,在本地运行 ELECTRA 模型的预训练后,执行get_ft_model.py命令得到Fine-tuning所需模型,然后运行 Fine-tuning。 + +以 GLUE/SST-2 任务为例,启动 Fine-tuning 的方式如下: +```shell +export CUDA_VISIBLE_DEVICES=0,1 +export TASK_NAME=SST-2 + +python -u ./run_glue.py \ + --model_type electra \ + --model_name_or_path ./pretrain_model/model_40000.pdparams/ \ + --task_name $TASK_NAME \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 1e-4 \ + --num_train_epochs 3 \ + --logging_steps 100 \ + --save_steps 100 \ + --output_dir ./$TASK_NAME/ \ + --device gpu +``` +其中绝大部分参数和上节中一样,只有参数model_name_or_path配置了本地预训练模型的路径 + +无论使用哪种方式进行 Fine-tuning,过程将按照 `logging_steps` 和 `save_steps` 的设置打印如下格式的日志: + +``` +global step 100/6315, epoch: 0, batch: 99, rank_id: 0, loss: 0.687738, lr: 0.0000158479, speed: 3.3566 step/s +eval loss: 0.693736, acc: 0.5137614678899083, eval done total : 2.0170159339904785 s +global step 200/6315, epoch: 0, batch: 199, rank_id: 0, loss: 0.342201, lr: 0.0000316957, speed: 3.1531 step/s +eval loss: 0.715023, acc: 0.8256880733944955, eval done total : 1.9682419300079346 s +global step 300/6315, epoch: 0, batch: 299, rank_id: 0, loss: 0.516034, lr: 0.0000475436, speed: 3.1663 step/s +eval loss: 0.653879, acc: 0.8658256880733946, eval done total : 1.9738705158233643 s +global step 400/6315, epoch: 0, batch: 399, rank_id: 0, loss: 0.228789, lr: 0.0000633914, speed: 3.1512 step/s +eval loss: 0.863306, acc: 0.8600917431192661, eval done total : 1.960683822631836 s +global step 500/6315, epoch: 0, batch: 499, rank_id: 0, loss: 0.320570, lr: 0.0000792393, speed: 3.1495 step/s +eval loss: 0.732358, acc: 0.8704128440366973, eval done total : 1.9749321937561035 s +``` + +使用electra-small预训练模型进行单卡 Fine-tuning ,在验证集上有如下结果(这里各类任务的结果是运行3次取最好得到): + +| Task | Metric | Result | +|-------|------------------------------|-------------| +| CoLA | Matthews corr | 58.22 | +| SST-2 | acc. | 91.85 | +| MRPC | acc./F1 | 88.24 | +| STS-B | Pearson/Spearman corr | 87.24 | +| QQP | acc./F1 | 88.83 | +| MNLI | matched acc./mismatched acc. | 82.45 | +| QNLI | acc. | 88.61 | +| RTE | acc. | 66.78 | + +注:acc.是Accuracy的简称,表中Metric字段名词取自[GLUE论文](https://openreview.net/pdf?id=rJ4km2R5t7) + +## **推理部署** +运行某个GLUE任务后(还是继续以GLUE/SST-2 情感分类任务为例),想要将Fine-tuning模型导出以加速类似场景更多数据的推理,可以按照如下步骤完成推理部署 + +### 导出推理模型 +```shell +python -u ./export_model.py \ + --input_model_dir ./SST-2/sst-2_ft_model_6000.pdparams/ \ + --output_model_dir ./ \ + --model_name electra-deploy +``` +其中参数释义如下: +- `input_model_dir` 表示输入的预训练模型所在目录,这里例子取SST-2 Fine-tuning 6000步的checkpoint来导出推理模型。 +- `output_model_dir` 表示将要保存推理模型的目录,这里例子取当前路径。 +- `model_name` 表示输出推理模型的名字前缀,任意字符串均可,默认为electra-deploy。 + +例如,执行如上命令后,可以看到在output_model_dir配置的目录下,导出的推理模型包括3个文件: +| 文件 | 说明 | +|-------------------------------|----------------------------------------| +| electra-deploy.pdiparams | 模型权重文件,供推理时加载使用 | +| electra-deploy.pdiparams.info | 模型权重信息文件 | +| electra-deploy.pdmodel | 模型结构文件,供推理时加载使用 | + +### **使用Paddle Inference API进行推理** +准备好如上推理模型后,可参考[Paddle Inference API推理步骤](./deploy/python/README.md)。 + +### **使用Paddle Serving API进行推理** +上面介绍的Paddle Inference为使用本地模型推理,Paddle Serving 可以实现在服务器端部署推理模型,客户端远程通过RPC/HTTP方式发送数据进行推理,实现模型推理的服务化。准备好如上推理模型后,可参考[Paddle Serving API推理步骤](./deploy/serving/README.md)。 + +### **使用Paddle Lite API进行推理** +上面介绍的Paddle Inference和Serving主要在服务器上进行推理,而在移动设备(手机、平板等)上需要使用Paddle Lite进行推理。准备好如上推理模型后,可参考[Paddle Lite API推理步骤](./deploy/lite/README.md)。 + + +## Reference +[Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR 2020](https://openreview.net/pdf?id=r1xMH1BtvB) diff --git a/model_zoo/electra/deploy/lite/Makefile b/model_zoo/electra/deploy/lite/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..ca4be9cfc7aead414bcd32e16e1288af46a8a8b1 --- /dev/null +++ b/model_zoo/electra/deploy/lite/Makefile @@ -0,0 +1,33 @@ +ARM_ABI = arm8 +export ARM_ABI + +include ../Makefile.def + +LITE_ROOT=../../../ + +CXX_INCLUDES = $(INCLUDES) ${OPENCV_INCLUDE} -I$(LITE_ROOT)/cxx/include + +CXX_LIBS = -L$(LITE_ROOT)/cxx/lib/ -lpaddle_light_api_shared $(SYSTEM_LIBS) + +############################################################### +# How to use one of static libaray: # +# `libpaddle_api_full_bundled.a` # +# `libpaddle_api_light_bundled.a` # +############################################################### +# Note: default use lite's shared library. # +############################################################### +# 1. Comment above line using `libpaddle_light_api_shared.so` +# 2. Undo comment below line using `libpaddle_api_light_bundled.a` + +#CXX_LIBS = $(LITE_ROOT)/cxx/lib/libpaddle_api_light_bundled.a $(SYSTEM_LIBS) + +electra_lite: electra_lite.o + $(CC) $(SYSROOT_LINK) $(CXXFLAGS_LINK) electra_lite.o -o electra_lite $(CXX_LIBS) $(LDFLAGS) + +electra_lite.o: sentiment_classfication.cpp + $(CC) $(SYSROOT_COMPLILE) $(CXX_DEFINES) $(CXX_INCLUDES) $(CXX_FLAGS) -o electra_lite.o -c sentiment_classfication.cpp + +.PHONY: clean +clean: + rm -f electra_lite.o + rm -f electra_lite diff --git a/model_zoo/electra/deploy/lite/README.md b/model_zoo/electra/deploy/lite/README.md new file mode 100644 index 0000000000000000000000000000000000000000..00230fafacf70947ee158f9033b843674227ee69 --- /dev/null +++ b/model_zoo/electra/deploy/lite/README.md @@ -0,0 +1,162 @@ +# **ELECTRA 使用Paddle Lite API进行推理** +在移动设备(手机、平板等)上需要使用Paddle Lite进行推理。[Paddle Lite](https://github.com/PaddlePaddle/Paddle-Lite)是飞桨轻量化推理引擎,为手机、IOT端提供高效推理能力,并广泛整合跨平台硬件,为端侧部署及应用落地问题提供轻量化的部署方案。下面以Android手机(Armv7或Armv8)为例,使用Paddle Lite进行ELECTRA模型的推理。 + +## 前提条件 +准备好Inference所需模型,需要2个文件: +| 文件 | 说明 | +|-------------------------------|----------------------------------------| +| electra-deploy.pdiparams | 模型权重文件,供推理时加载使用 | +| electra-deploy.pdmodel | 模型结构文件,供推理时加载使用 | + +如何获得Inference模型?[可参考文档“导出推理模型”一节](../../README.md),下面假设这2个文件已生成,并放在当前目录下 + +## 准备硬件和系统 +- 电脑。用于保存代码和数据;编译Paddle Lite(看需要) +- 手机。Android手机(Armv7或Armv8),手机要能直接连接电脑,或者手机直连某个设备,其能连接到电脑。 + +如果在其它特殊硬件上或想要自己编译Paddle Lite预测库和优化工具,则电脑上还需准备: +- 交叉编译环境。不同开发环境的编译流程请参考对应文档。 + - [Docker](https://paddle-lite.readthedocs.io/zh/latest/source_compile/compile_env.html#docker) + - [Linux](https://paddle-lite.readthedocs.io/zh/latest/source_compile/compile_env.html#linux) + - [MAC OS](https://paddle-lite.readthedocs.io/zh/latest/source_compile/compile_env.html#mac-os) + +## 准备Paddle Lite预测库 +有两种方法: +- 直接下载。[官方预测库下载地址](https://paddle-lite.readthedocs.io/zh/latest/quick_start/release_lib.html),注意选择和手机arm系统版本匹配的,并带with_extra=ON的下载链接。 +- 编译Paddle-Lite得到预测库。**需要先准备好交叉编译环境**,然后依次执行如下命令,例如编译在 armv8 硬件上运行的预测库并打开extra op: +```shell +git clone https://github.com/PaddlePaddle/Paddle-Lite.git +cd Paddle-Lite +git checkout develop +./lite/tools/build_android.sh --arch=armv8 --with_extra=ON +``` +直接下载预测库并解压后,可以得到`inference_lite_lib.android.armv8/`文件夹,通过编译Paddle-Lite得到的预测库位于`Paddle-Lite/build.lite.android.armv8.gcc/inference_lite_lib.android.armv8/`文件夹。无论使用哪种方法,得到的预测库的文件目录结构都如下,为了方便统一说明,下面都假设预测库位于${Paddle_Lite_root}/inference_lite_lib.android.armv8/目录中: +``` +${Paddle_Lite_root}/inference_lite_lib.android.armv8/ +|-- cxx C++ 预测库和头文件 +| |-- include C++ 头文件 +| | |-- paddle_api.h +| | |-- paddle_image_preprocess.h +| | |-- paddle_lite_factory_helper.h +| | |-- paddle_place.h +| | |-- paddle_use_kernels.h +| | |-- paddle_use_ops.h +| | `-- paddle_use_passes.h +| `-- lib C++预测库 +| |-- libpaddle_api_light_bundled.a C++静态库 +| `-- libpaddle_light_api_shared.so C++动态库 +|-- java Java预测库 +| |-- jar +| | `-- PaddlePredictor.jar +| |-- so +| | `-- libpaddle_lite_jni.so +| `-- src +|-- demo C++和Java示例代码 +| |-- cxx C++ 预测库demo +| `-- java Java 预测库demo +``` + +## 准备Paddle Lite模型优化工具 +因为移动设备上对模型的要求很严格,所以需要使用Paddle Lite模型优化工具将Inference模型优化后才能将模型部署到移动设备上进行推理,模型优化的方法包括量化、子图融合、混合调度、Kernel优选等等方法。准备Paddle Lite模型优化工具也有两种方法: +- 直接下载。 +```shell +pip install paddlelite。 +``` +- 编译Paddle-Lite得到模型优化工具。**需要先准备好交叉编译环境**,然后依次执行如下命令: +```shell +# 如果准备环境时已经clone了Paddle-Lite,则不用重新clone Paddle-Lite +git clone https://github.com/PaddlePaddle/Paddle-Lite.git +cd Paddle-Lite +git checkout develop +# 启动编译 +./lite/tools/build.sh build_optimize_tool +``` +如果是直接下载,工具可执行文件为`paddle_lite_opt`,并放在系统环境变量PATH中,所以无需进入到工具所在目录就可以直接执行;如果是编译得到,则工具可执行文件为`Paddle-Lite/build.opt/lite/api/opt`,为了后面统一说明,可将工具统一命名为`paddle_lite_opt`,并将其所处目录添加到系统环境变量PATH中,通过如下方式查看其运行选项和使用方式; +```shell +cd build.opt/lite/api/ && mv opt paddle_lite_opt +./paddle_lite_opt +``` + +## 使用Paddle Lite模型优化工具转换Inference模型 +以前提条件中准备好的Inference模型 electra-deploy.pdmodel/electra-deploy.pdiparams 为例,执行: +```shell +paddle_lite_opt \ + --model_file ./electra-deploy.pdmodel \ + --param_file ./electra-deploy.pdiparams \ + --optimize_out ./electra-deploy-lite \ + --optimize_out_type protobuf \ + --valid_targets arm \ + --record_tailoring_info false +``` +其中参数释义如下: +- `model_file` 表示推理需要加载的模型结构文件。例如前提中得到的electra-deploy.pdmodel。 +- `param_file` 表示推理需要加载的模型权重文件。例如前提中得到的electra-deploy.pdiparams。 +- `optimize_out` 表示输出的Lite模型**名字前缀**。例如配置./electra-deploy-lite,最终得到的Lite模型为./electra-deploy-lite.nb。 +- `optimize_out_type` 表示输出模型类型,目前支持两种类型:protobuf和naive_buffer,其中naive_buffer是一种更轻量级的序列化/反序列化实现。若您需要在mobile端执行模型预测,请将此选项设置为naive_buffer。默认为protobuf。 +- `valid_targets` 表示模型将要运行的硬件类型,默认为arm。目前可支持x86、arm、opencl、npu、xpu,可以同时指定多个backend(以空格分隔),Model Optimize Tool将会自动选择最佳方式。如果需要支持华为NPU(Kirin 810/990 Soc搭载的达芬奇架构NPU),应当设置为npu, arm。 +- `record_tailoring_info` 表示是否使用 根据模型裁剪库文件 功能,如使用则设置该选项为true,以记录优化后模型含有的kernel和OP信息,默认为false。 + +如上命令执行后,得到Lite模型为./electra-deploy-lite.nb + +## 预处理输入数据,并和Lite预测库、Lite模型、编译好的C++代码/配置 一起打包。 +```shell +# 假设${Paddle_Lite_root}已经配置了正确的Lite预测库路径 +python -u ./prepare.py \ + --lite_lib_path ${Paddle_Lite_root}/inference_lite_lib.android.armv8/ \ + --lite_model_file ./electra-deploy-lite.nb \ + --predict_file ./test.txt \ + --max_seq_length 128 \ + --model_name electra-small + +# 进入lite demo的工作目录 +cd ${Paddle_Lite_root}/inference_lite_lib.android.armv8/demo/cxx/electra/ +make -j && mv electra_lite debug +``` +其中prepare.py的参数释义如下: +- `lite_lib_path` 表示预测库所在目录。 +- `lite_model_file` 表示Lite模型路径。 +- `predict_file` 表示用于推理的文件数据,可以配置1个或多个文件,每个文件和预训练数据格式一样,为utf-8编码的文本数据,每行1句文本。 +- `max_seq_length` 表示输入的最大句子长度,超过该长度将被截断。 +- `model_name` 表示推理模型的类型,当前支持electra-small(约1400万参数)、electra-base(约1.1亿参数)、electra-large(约3.35亿参数)。 + +如上命令执行完后,${Paddle_Lite_root}/inference_lite_lib.android.armv8/demo/cxx/electra/文件夹下将有如下文件,只有其中的**debug目录**会传到手机: +```shell +demo/cxx/electra/ +|-- debug/ +| |--config.txt 推理配置和超参数配置 +| |--electra-deploy-lite.nb 优化后的Lite模型文件 +| |--electra_lite 编译好的在手机上执行推理的可执行文件 +| |--libpaddle_light_api_shared.so C++预测库文件 +| |--predict_input.bin 预处理好的输入数据(二进制) +| |--predict_input.txt 输入数据明文 +| |--sst2_label.txt 类别说明文件 +|-- config.txt 推理配置和超参数配置 +|-- Makefile 编译文件 +|-- sentiment_classfication.cpp 推理代码文件 +``` + +## 与目标设备连接执行推理 +如果电脑和Android手机直接连接,则在电脑上安装[ADB工具](https://developer.android.com/studio/command-line/adb),通过ADB工具来连接和操作Android设备: +```shell +# 检查是否连接上设备 +adb devices +# 将debug目录推送到设备的/data/local/tmp/electra/目录下,需事先在设备上创建 +adb push debug /data/local/tmp/electra/ +# 登录设备并打开设备上的shell +adb shell +# 准备相关环境。进入程序目录,配置好动态链接库的环境变量并给程序添加执行权限 +cd /data/local/tmp/electra/debug && export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/local/tmp/electra/debug/ && chmod +x electra_lite +# 输入数据,运行Lite推理 +./electra_lite ./config.txt +``` +如果电脑和Android手机没有直接连接,Android手机直连某个设备,则需将debug目录cp到那个设备上,并在那个设备上安装ADB工具以执行如上代码。 + +执行如上推理命令后得到如下结果,同样数据在Paddle Lite推理的结果应该和使用Inference/Serving的结果是一样的 +```shell +=== electra predict result: ./predict_input.txt=== +sentence: [CLS] uneasy mishmash of styles and genres . [SEP], class_id: 0(negative), logits: 2.22824 +sentence: [CLS] director rob marshall went out gunning to make a great one . [SEP], class_id: 1(positive), logits: 0.241332 +total time : 0.399562 s. +``` + +如果修改了代码,则需要先执行prepare.py,再重新编译并打包push到手机上;如果只修改输入数据,则只需要执行prepare.py并打包push到手机上,不用重新编译。 diff --git a/model_zoo/electra/deploy/lite/config.txt b/model_zoo/electra/deploy/lite/config.txt new file mode 100644 index 0000000000000000000000000000000000000000..df4c84b9775de1b3b916cc65a7dbe813644f5584 --- /dev/null +++ b/model_zoo/electra/deploy/lite/config.txt @@ -0,0 +1,6 @@ +lite_model_file ./electra-deploy-lite.nb # path of model relative to executable file +label_file ./sst2_label.txt # path of label description file +predict_file_bin ./predict_input.bin # path of data in binary +predict_file_txt ./predict_input.txt # path of data in text +predict_num 10 # number of data to predict, automatic generation and no need config +predict_length 39 # sequence length of each data, automatic generation and no need config diff --git a/model_zoo/electra/deploy/lite/prepare.py b/model_zoo/electra/deploy/lite/prepare.py new file mode 100644 index 0000000000000000000000000000000000000000..e1bc88d3086b2c82730db05cb6c20560701c917a --- /dev/null +++ b/model_zoo/electra/deploy/lite/prepare.py @@ -0,0 +1,191 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import fileinput +import io +import os +import shutil +import time + +import numpy as np + +from paddlenlp.transformers import ElectraTokenizer +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--lite_lib_path", type=str, required=True, default=None, help="directory of paddle lite api library" + ) + parser.add_argument("--lite_model_file", type=str, required=True, default=None, help="paddle lite model file(.nb)") + parser.add_argument("--predict_sentences", type=str, nargs="*", help="one or more sentence to predict") + parser.add_argument( + "--predict_file", type=str, nargs="*", help="one or more file which contain sentence to predict" + ) + parser.add_argument( + "--prepared_file_prefix", + type=str, + default="predict_input", + help="prepared file prefix after processing predict sentences", + ) + parser.add_argument("--batch_size", type=int, default=100000, help="batch size") + parser.add_argument("--max_seq_length", type=int, default=128, help="max length of each sequence") + parser.add_argument( + "--model_name", + type=str, + default="electra-small", + help="shortcut name selected in the list: " + + ", ".join(list(ElectraTokenizer.pretrained_init_configuration.keys())), + ) + return parser.parse_args() + + +def read_sentences(paths=[]): + sentences = [] + for sen_path in paths: + assert os.path.isfile(sen_path), "The {} isn't a valid file.".format(sen_path) + sen = read_file(sen_path) + if sen is None: + logger.info("error in loading file:{}".format(sen_path)) + continue + sentences.extend(sen) + return sentences + + +def read_file(path): + lines = [] + with open(path, encoding="utf-8") as f: + while True: + line = f.readline() + if line: + if len(line) > 0 and not line.isspace(): + lines.append(line.strip()) + else: + break + return lines + + +def get_predicted_input(predicted_data, tokenizer, max_seq_length, batch_size): + if predicted_data == [] or not isinstance(predicted_data, list): + raise TypeError("The predicted_data is inconsistent with expectations.") + + sen_ids_batch = [] + sen_words_batch = [] + sen_ids = [] + sen_words = [] + batch_num = 0 + pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token) + for sen in predicted_data: + sen_id = tokenizer(sen, max_seq_len=max_seq_length)["input_ids"] + sen_ids.append(sen_id) + sen_words.append(tokenizer.cls_token + " " + sen + " " + tokenizer.sep_token) + batch_num += 1 + if batch_num == batch_size: + tmp_list = [] + max_length = max([len(i) for i in sen_ids]) + for i in sen_ids: + if len(i) < max_length: + tmp_list.append(i + (max_length - len(i)) * [pad_token_id]) + else: + tmp_list.append(i) + sen_ids_batch.append(tmp_list) + sen_words_batch.append(sen_words) + sen_ids = [] + sen_words = [] + batch_num = 0 + + if len(sen_ids) > 0: + tmp_list = [] + max_length = max([len(i) for i in sen_ids]) + for i in sen_ids: + if len(i) < max_length: + tmp_list.append(i + (max_length - len(i)) * [pad_token_id]) + else: + tmp_list.append(i) + sen_ids_batch.append(tmp_list) + sen_words_batch.append(sen_words) + + return sen_ids_batch, sen_words_batch + + +def prepare_predict(args, sentences=[], paths=[]): + """ + Args: + sentences (list[str]): each string is a sentence. If sentences not paths + paths (list[str]): The paths of file which contain sentences. If paths not sentences + """ + + # initialize data + if sentences != [] and isinstance(sentences, list) and (paths == [] or paths is None): + predicted_data = sentences + elif (sentences == [] or sentences is None) and isinstance(paths, list) and paths != []: + predicted_data = read_sentences(paths) + else: + raise TypeError("The input data is inconsistent with expectations.") + + tokenizer = ElectraTokenizer.from_pretrained(args.model_name) + predicted_input, predicted_sens = get_predicted_input( + predicted_data, tokenizer, args.max_seq_length, args.batch_size + ) + + predicted_input_np = np.array(predicted_input) + predict_num = predicted_input_np.shape[1] + predict_length = predicted_input_np.shape[2] + predict_input_bin = args.prepared_file_prefix + ".bin" + predict_input_txt = args.prepared_file_prefix + ".txt" + predicted_input_np[0].astype(np.int64).tofile(predict_input_bin) + with io.open(predict_input_txt, "w", encoding="UTF-8") as f: + for sen_batch in predicted_sens: + for sen in sen_batch: + if len(sen.strip()) > 0: + f.write(sen.strip() + "\n") + + for line in fileinput.input("./deploy/lite/config.txt", inplace=True): + if "predict_num" in line: + newline = "predict_num " + str(predict_num) + print("%s" % newline) + elif "predict_length" in line: + newline = "predict_length " + str(predict_length) + print("%s" % newline) + else: + print("%s" % line.strip()) + + root_dir = args.lite_lib_path + "/demo/cxx/electra/" + debug_dir = args.lite_lib_path + "/demo/cxx/electra/debug/" + if not os.path.exists(debug_dir): + os.makedirs(debug_dir) + shutil.copy(args.lite_model_file, debug_dir) + shutil.copy("./deploy/lite/sst2_label.txt", debug_dir) + shutil.copy("./deploy/lite/config.txt", debug_dir) + shutil.copy(predict_input_bin, debug_dir) + shutil.copy(predict_input_txt, debug_dir) + libpaddle_light_api = os.path.join(args.lite_lib_path, "cxx/lib/libpaddle_light_api_shared.so") + shutil.copy(libpaddle_light_api, debug_dir) + + shutil.copy("./deploy/lite/config.txt", root_dir) + shutil.copy("./deploy/lite/sentiment_classfication.cpp", root_dir) + shutil.copy("./deploy/lite/Makefile", root_dir) + + +if __name__ == "__main__": + args = parse_args() + sentences = args.predict_sentences + paths = args.predict_file + start_time = time.time() + # sentences = ["The quick brown fox see over the lazy dog.", "The quick brown fox jump over tree lazy dog."] + # paths = ["../../debug/test.txt", "../../debug/test.txt.1"] + prepare_predict(args, sentences, paths) + print("prepare lite predict done, total time : %s s" % (time.time() - start_time)) diff --git a/model_zoo/electra/deploy/lite/sentiment_classfication.cpp b/model_zoo/electra/deploy/lite/sentiment_classfication.cpp new file mode 100644 index 0000000000000000000000000000000000000000..6305a5d2adc131ed46a1ceb113437e5dfe251ae2 --- /dev/null +++ b/model_zoo/electra/deploy/lite/sentiment_classfication.cpp @@ -0,0 +1,240 @@ +// Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include +#include +#include +#include +#include +#include "paddle_api.h" // NOLINT + +using namespace paddle::lite_api; // NOLINT +using namespace std; + +#undef stderr +FILE *stderr = &__sF[2]; + +struct RESULT { + std::string class_name; + int class_id; + float logits; +}; + +std::vector PostProcess(const float *output_data, + int predict_num, + int predict_class, + const std::vector &word_labels) { + int predict_result[predict_num] = {0}; + float predict_logits[predict_num] = {0}; + for (int i = 0; i < predict_num; i++) { + int index = -1; + float max_score = -100.0; + for (int j = 0; j < predict_class; j++) { + float score = output_data[i * predict_class + j]; + if (score > max_score) { + max_score = score; + index = j; + } + } + predict_result[i] = index; + predict_logits[i] = max_score; + } + + std::vector results(predict_num); + for (int i = 0; i < results.size(); i++) { + results[i].class_name = "Unknown"; + if (predict_result[i] >= 0 && predict_result[i] < word_labels.size()) { + results[i].class_name = word_labels[predict_result[i]]; + } + results[i].class_id = predict_result[i]; + results[i].logits = predict_logits[i]; + } + return results; +} + +std::shared_ptr LoadModel(std::string model_file) { + MobileConfig config; + config.set_model_from_file(model_file); + + std::shared_ptr predictor = + CreatePaddlePredictor(config); + return predictor; +} + +std::vector split(const std::string &str, + const std::string &delim) { + std::vector res; + if ("" == str) return res; + char *strs = new char[str.length() + 1]; + std::strcpy(strs, str.c_str()); + + char *d = new char[delim.length() + 1]; + std::strcpy(d, delim.c_str()); + + char *p = std::strtok(strs, d); + while (p) { + string s = p; + res.push_back(s); + p = std::strtok(NULL, d); + } + + return res; +} + +std::vector ReadDict(const std::string &path) { + std::ifstream in(path); + std::string filename; + std::string line; + std::vector m_vec; + if (in) { + while (getline(in, line)) { + m_vec.push_back(line); + } + } else { + std::cout << "no such file" << std::endl; + } + return m_vec; +} + +std::map LoadConfigTxt( + const std::string &config_path) { + auto config = ReadDict(config_path); + + std::map dict; + for (int i = 0; i < config.size(); i++) { + std::vector res = split(config[i], " "); + dict[res[0]] = res[1]; + } + return dict; +} + +void PrintConfig(const std::map &config) { + std::cout << "=======PaddleClas lite demo config======" << std::endl; + for (auto iter = config.begin(); iter != config.end(); iter++) { + std::cout << iter->first << " : " << iter->second << std::endl; + } + std::cout << "=======End of PaddleClas lite demo config======" << std::endl; +} + +std::vector LoadLabels(const std::string &path) { + std::ifstream file; + std::vector labels; + file.open(path); + while (file) { + std::string line; + std::getline(file, line); + std::string::size_type pos = line.find(" "); + if (pos != std::string::npos) { + line = line.substr(pos + 1); + } + labels.push_back(line); + } + file.clear(); + file.close(); + return labels; +} + +std::vector RunModel(std::shared_ptr predictor, + const std::map &config, + double &cost_time) { + // read config + std::string label_path = config.at("label_file"); + std::string predict_file_bin = config.at("predict_file_bin"); + int predict_num = stoi(config.at("predict_num")); + int predict_length = stoi(config.at("predict_length")); + + // Load Labels + std::vector word_labels = LoadLabels(label_path); + + // Read predict data + int64_t predict_data[predict_num][predict_length] = {0}; + ifstream in(predict_file_bin, ios::in | ios::binary); + in.read((char *)&predict_data, sizeof predict_data); + in.close(); + + // Fill input tensor + std::unique_ptr input_tensor(std::move(predictor->GetInput(0))); + input_tensor->Resize({predict_num, predict_length}); + auto *data = input_tensor->mutable_data(); + for (int i = 0; i < predict_num; i++) { + for (int j = 0; j < predict_length; j++) { + data[i * predict_length + j] = predict_data[i][j]; + } + } + + auto start = std::chrono::system_clock::now(); + // Run predictor + predictor->Run(); + + // Get output and post process + std::unique_ptr output_tensor( + std::move(predictor->GetOutput(0))); + auto *output_data = output_tensor->data(); + auto end = std::chrono::system_clock::now(); + auto duration = + std::chrono::duration_cast(end - start); + cost_time = double(duration.count()) * + std::chrono::microseconds::period::num / + std::chrono::microseconds::period::den; + + if (output_tensor->shape().size() != 2) { + std::cerr << "[ERROR] the size of output tensor shape must equal to 2\n"; + exit(1); + } + int predict_class = int(output_tensor->shape()[1]); + + auto results = + PostProcess(output_data, predict_num, predict_class, word_labels); + + return results; +} + +int main(int argc, char **argv) { + if (argc < 2) { + std::cerr << "[ERROR] usage: " << argv[0] << " config_path\n"; + exit(1); + } + + // load config + std::string config_path = argv[1]; + auto config = LoadConfigTxt(config_path); + PrintConfig(config); + + // init predictor + std::string lite_model_file = config.at("lite_model_file"); + auto electra_predictor = LoadModel(lite_model_file); + + double elapsed_time = 0.0; + double run_time = 0; + + // run lite inference + std::vector results = RunModel(electra_predictor, config, run_time); + + // print result + std::string predict_file_txt = config.at("predict_file_txt"); + auto sentences = ReadDict(predict_file_txt); + std::cout << "=== electra predict result: " << predict_file_txt + << "===" << std::endl; + for (int i = 0; i < results.size(); i++) { + std::cout << "sentence: " << sentences[i] + << ", class_id: " << results[i].class_id << "(" + << results[i].class_name << ")" + << ", logits: " << results[i].logits << std::endl; + } + std::cout << "total time : " << run_time << " s." << std::endl; + + return 0; +} diff --git a/model_zoo/electra/deploy/lite/sst2_label.txt b/model_zoo/electra/deploy/lite/sst2_label.txt new file mode 100644 index 0000000000000000000000000000000000000000..0dcf43b8731e78e4e4dfa69d20674dcd1b9b5bb3 --- /dev/null +++ b/model_zoo/electra/deploy/lite/sst2_label.txt @@ -0,0 +1,2 @@ +0 negative +1 positive diff --git a/model_zoo/electra/deploy/python/README.md b/model_zoo/electra/deploy/python/README.md new file mode 100644 index 0000000000000000000000000000000000000000..35628a747926dea80553480aabf6e5955176bfa4 --- /dev/null +++ b/model_zoo/electra/deploy/python/README.md @@ -0,0 +1,56 @@ +# **ELECTRA 使用Paddle Inference API进行推理** + +## 前提条件 +准备好Inference所需模型,需要2个文件: +| 文件 | 说明 | +|-------------------------------|----------------------------------------| +| electra-deploy.pdiparams | 模型权重文件,供推理时加载使用 | +| electra-deploy.pdmodel | 模型结构文件,供推理时加载使用 | + +如何获得Inference模型?[可参考文档“导出推理模型”一节](../../README.md),下面假设这2个文件已生成,并放在当前目录下,有两种方法进行推理 + +## 从命令行读取输入数据进行推理 +```shell +python -u ./predict.py \ + --model_file ./electra-deploy.pdmodel \ + --params_file ./electra-deploy.pdiparams \ + --predict_sentences "uneasy mishmash of styles and genres ." "director rob marshall went out gunning to make a great one ." \ + --batch_size 2 \ + --max_seq_length 128 \ + --model_name electra-small +``` +其中参数释义如下: +- `model_file` 表示推理需要加载的模型结构文件。例如前提中得到的electra-deploy.pdmodel。 +- `params_file` 表示推理需要加载的模型权重文件。例如前提中得到的electra-deploy.pdiparams。 +- `predict_sentences` 表示用于推理的(句子)数据,可以配置1条或多条。如果此项配置,则predict_file不用配置。 +- `batch_size` 表示每次推理的样本数目。 +- `max_seq_length` 表示输入的最大句子长度,超过该长度将被截断。 +- `model_name` 表示推理模型的类型,当前支持electra-small(约1400万参数)、electra-base(约1.1亿参数)、electra-large(约3.35亿参数)。 + +另外还有一些额外参数不在如上命令中: +- `use_gpu` 表示是否使用GPU进行推理,默认不开启。如果在命令中加上了--use_gpu,则使用GPU进行推理。 +- `use_trt` 表示是否使用TensorRT进行推理,默认不开启。如果在命令中加上了--use_trt,且配置了--use_gpu,则使用TensorRT进行推理。前提条件:1)需提前安装TensorRT或使用[Paddle提供的TensorRT docker镜像](https://github.com/PaddlePaddle/Serving/blob/v0.5.0/doc/DOCKER_IMAGES_CN.md)。2)需根据cuda、cudnn、tensorRT和python的版本,安装[匹配版本的Paddle包](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html) + +## 从文件读取输入数据进行推理 +```shell +python -u ./predict.py \ + --model_file ./electra-deploy.pdmodel \ + --params_file ./electra-deploy.pdiparams \ + --predict_file "./sst-2.test.tsv.1" "./sst-2.test.tsv.2" \ + --batch_size 2 \ + --max_seq_length 128 \ + --model_name electra-small +``` +其中绝大部分和从命令行读取输入数据一样,这里描述不一样的参数: +- `predict_file` 表示用于推理的文件数据,可以配置1个或多个文件,每个文件和预训练数据格式一样,为utf-8编码的文本数据,每行1句文本。如果此项配置,则predict_sentences不用配置。 + +模型对每1句话分别推理出1个结果,例如下面为使用第一种方法中的命令得到的SST-2情感分类推理结果,0表示句子是负向情感,1表示句子为正向情感。因为batch_size=2,所以只有1个batch: +```shell +===== batch 0 ===== +Input sentence is : [CLS] uneasy mishmash of styles and genres . [SEP] +Output data is : 0 +Input sentence is : [CLS] director rob marshall went out gunning to make a great one . [SEP] +Output data is : 1 +inference total 2 sentences done, total time : 0.0849156379699707 s +``` +此推理结果表示:第1句话是负向情感,第2句话是正向情感。 diff --git a/model_zoo/electra/deploy/python/predict.py b/model_zoo/electra/deploy/python/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..a4c6a60ff44559967a18bdef4927f1ee1c091fed --- /dev/null +++ b/model_zoo/electra/deploy/python/predict.py @@ -0,0 +1,194 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time + +import numpy as np +from paddle import inference + +from paddlenlp.transformers import ElectraTokenizer +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--model_file", type=str, required=True, help="model filename") + parser.add_argument("--params_file", type=str, required=True, help="parameter filename") + parser.add_argument("--predict_sentences", type=str, nargs="*", help="one or more sentence to predict") + parser.add_argument( + "--predict_file", type=str, nargs="*", help="one or more file which contain sentence to predict" + ) + parser.add_argument("--batch_size", type=int, default=1, help="batch size") + parser.add_argument("--use_gpu", action="store_true", help="whether to use gpu") + parser.add_argument("--use_trt", action="store_true", help="whether to use TensorRT") + parser.add_argument("--max_seq_length", type=int, default=128, help="max length of each sequence") + parser.add_argument( + "--model_name", + type=str, + default="electra-small", + help="shortcut name selected in the list: " + + ", ".join(list(ElectraTokenizer.pretrained_init_configuration.keys())), + ) + return parser.parse_args() + + +def read_sentences(paths=[]): + sentences = [] + for sen_path in paths: + assert os.path.isfile(sen_path), "The {} isn't a valid file.".format(sen_path) + sen = read_file(sen_path) + if sen is None: + logger.info("error in loading file:{}".format(sen_path)) + continue + sentences.extend(sen) + return sentences + + +def read_file(path): + lines = [] + with open(path, encoding="utf-8") as f: + while True: + line = f.readline() + if line: + if len(line) > 0 and not line.isspace(): + lines.append(line.strip()) + else: + break + return lines + + +def get_predicted_input(predicted_data, tokenizer, max_seq_length, batch_size): + if predicted_data == [] or not isinstance(predicted_data, list): + raise TypeError("The predicted_data is inconsistent with expectations.") + + sen_ids_batch = [] + sen_words_batch = [] + sen_ids = [] + sen_words = [] + batch_num = 0 + pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token) + for sen in predicted_data: + sen_id = tokenizer(sen, max_seq_len=max_seq_length)["input_ids"] + sen_ids.append(sen_id) + sen_words.append(tokenizer.cls_token + " " + sen + " " + tokenizer.sep_token) + batch_num += 1 + if batch_num == batch_size: + tmp_list = [] + max_length = max([len(i) for i in sen_ids]) + for i in sen_ids: + if len(i) < max_length: + tmp_list.append(i + (max_length - len(i)) * [pad_token_id]) + else: + tmp_list.append(i) + sen_ids_batch.append(tmp_list) + sen_words_batch.append(sen_words) + sen_ids = [] + sen_words = [] + batch_num = 0 + + if len(sen_ids) > 0: + tmp_list = [] + max_length = max([len(i) for i in sen_ids]) + for i in sen_ids: + if len(i) < max_length: + tmp_list.append(i + (max_length - len(i)) * [pad_token_id]) + else: + tmp_list.append(i) + sen_ids_batch.append(tmp_list) + sen_words_batch.append(sen_words) + + return sen_ids_batch, sen_words_batch + + +def predict(args, sentences=[], paths=[]): + """ + Args: + sentences (list[str]): each string is a sentence. If sentences not paths + paths (list[str]): The paths of file which contain sentences. If paths not sentences + Returns: + res (list(numpy.ndarray)): The result of sentence, indicate whether each word is replaced, same shape with sentences. + """ + + # initialize data + if sentences != [] and isinstance(sentences, list) and (paths == [] or paths is None): + predicted_data = sentences + elif (sentences == [] or sentences is None) and isinstance(paths, list) and paths != []: + predicted_data = read_sentences(paths) + else: + raise TypeError("The input data is inconsistent with expectations.") + + tokenizer = ElectraTokenizer.from_pretrained(args.model_name) + predicted_input, predicted_sens = get_predicted_input( + predicted_data, tokenizer, args.max_seq_length, args.batch_size + ) + + # config + config = inference.Config(args.model_file, args.params_file) + config.switch_use_feed_fetch_ops(False) + config.enable_memory_optim() + if args.use_gpu: + config.enable_use_gpu(1000, 0) + if args.use_trt: + config.enable_tensorrt_engine( + workspace_size=1 << 30, + max_batch_size=args.batch_size, + min_subgraph_size=5, + precision_mode=inference.PrecisionType.Float32, + use_static=False, + use_calib_mode=False, + ) + + # predictor + predictor = inference.create_predictor(config) + + start_time = time.time() + output_data = [] + count = 0 + for i, sen in enumerate(predicted_input): + sen = np.array(sen).astype("int64") + # get input name + input_names = predictor.get_input_names() + # get input pointer and copy data + input_tensor = predictor.get_input_handle(input_names[0]) + input_tensor.copy_from_cpu(sen) + + # run predictor + predictor.run() + + # get output name + output_names = predictor.get_output_names() + # get output pointer and copy data(nd.array) + output_tensor = predictor.get_output_handle(output_names[0]) + predict_data = output_tensor.copy_to_cpu() + output_res = np.argmax(predict_data, axis=1).tolist() + output_data.append(output_res) + + print("===== batch {} =====".format(i)) + for j in range(len(predicted_sens[i])): + print("Input sentence is : {}".format(predicted_sens[i][j])) + # print("Output logis is : {}".format(output_data[j])) + print("Output data is : {}".format(output_res[j])) + count += len(predicted_sens[i]) + print("inference total %s sentences done, total time : %s s" % (count, time.time() - start_time)) + + +if __name__ == "__main__": + args = parse_args() + sentences = args.predict_sentences + paths = args.predict_file + # sentences = ["The quick brown fox see over the lazy dog.", "The quick brown fox jump over tree lazy dog."] + # paths = ["../../debug/test.txt", "../../debug/test.txt.1"] + predict(args, sentences, paths) diff --git a/model_zoo/electra/deploy/serving/README.md b/model_zoo/electra/deploy/serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..1b64547e9c2463ff4207eba0a2430f63ac9394a3 --- /dev/null +++ b/model_zoo/electra/deploy/serving/README.md @@ -0,0 +1,107 @@ +# **ELECTRA 使用Paddle Serving API进行推理** +Paddle Serving 可以实现在服务器端部署推理模型,客户端远程通过RPC/HTTP方式发送数据进行推理,实现模型推理的服务化,下面以RPC方式为例进行说明。 + +## 前提条件 +准备好Inference所需模型,需要2个文件: +| 文件 | 说明 | +|-------------------------------|----------------------------------------| +| electra-deploy.pdiparams | 模型权重文件,供推理时加载使用 | +| electra-deploy.pdmodel | 模型结构文件,供推理时加载使用 | + +如何获得Inference模型?[可参考文档“导出推理模型”一节](../../README.md),下面假设这2个文件已生成,并放在当前目录下 + +## 在服务器端和客户端启动Serving的docker容器 +建议在docker容器中运行服务器端和客户端以避免一些系统依赖库问题,启动docker镜像的命令参考:[Serving readme](https://github.com/PaddlePaddle/Serving/tree/v0.5.0) + +## 在服务器端安装相关模块 +```shell +pip install paddle-serving-app paddle-serving-client paddle-serving-server paddlepaddle +``` +如果服务器端可以使用GPU进行推理,则安装server的gpu版本,安装时要注意参考服务器当前CUDA、TensorRT的版本来安装对应的版本:[Serving readme](https://github.com/PaddlePaddle/Serving/tree/v0.5.0) +```shell +pip install paddle-serving-app paddle-serving-client paddle-serving-server-gpu paddlepaddle-gpu +``` + +## 在客户端安装相关模块 +```shell +pip install paddle-serving-app paddle-serving-client +``` + +## 从Inference模型生成Serving的模型和配置 +以前提条件中准备好的Inference模型 electra-deploy.pdmodel/electra-deploy.pdiparams 为例: +```shell +python -u ./covert_inference_model_to_serving.py \ + --inference_model_dir ./ \ + --model_file ./electra-deploy.pdmodel \ + --params_file ./electra-deploy.pdiparams +``` +其中参数释义如下: +- `inference_model_dir` 表示Inference推理模型所在目录,这里假设为当前目录。 +- `model_file` 表示推理需要加载的模型结构文件。例如前提中得到的electra-deploy.pdmodel。 +- `params_file` 表示推理需要加载的模型权重文件。例如前提中得到的electra-deploy.pdiparams。 + +执行命令后,会在当前目录下生成2个目录:serving_server 和 serving_client。serving_server目录包含服务器端所需的模型和配置,需将其cp到服务器端容器中;serving_client目录包含客户端所需的配置,需将其cp到客户端容器中 + +## 启动server +在服务器端容器中,使用上一步得到的serving_server目录启动server +```shell +python -m paddle_serving_server_gpu.serve \ + --model ./serving_server \ + --port 8383 +``` +其中参数释义如下: +- `model` 表示server加载的模型和配置所在目录。 +- `port` 表示server开启的服务端口8383。 + +如果服务器端可以使用GPU进行推理计算,则启动服务器时可以配置server使用的GPU id +```shell +python -m paddle_serving_server_gpu.serve \ + --model ./serving_server \ + --port 8383 \ + --gpu_id 0 +``` +- `gpu_id` 表示server使用0号GPU。 + +## 启动client进行推理 +在客户端容器中,使用前面得到的serving_client目录启动client发起RPC推理请求。和使用Paddle Inference API进行推理一样,有如下两种方法: +### 从命令行读取输入数据发起推理请求 +```shell +python -u ./client.py \ + --client_config_file ./serving_client/serving_client_conf.prototxt \ + --server_ip_port 127.0.0.1:8383 \ + --predict_sentences "uneasy mishmash of styles and genres ." "director rob marshall went out gunning to make a great one ." \ + --batch_size 2 \ + --max_seq_length 128 \ + --model_name electra-small +``` +其中参数释义如下: +- `client_config_file` 表示客户端需要加载的配置文件。 +- `server_ip_port` 表示服务器端的ip和port。默认为127.0.0.1:8383。 +- `predict_sentences` 表示用于推理的(句子)数据,可以配置1条或多条。如果此项配置,则predict_file不用配置。 +- `batch_size` 表示每次推理的样本数目。 +- `max_seq_length` 表示输入的最大句子长度,超过该长度将被截断。 +- `model_name` 表示推理模型的类型,当前支持electra-small(约1400万参数)、electra-base(约1.1亿参数)、electra-large(约3.35亿参数)。 + +### 从文件读取输入数据发起推理请求 +```shell +python -u ./client.py \ + --client_config_file ./serving_client/serving_client_conf.prototxt \ + --server_ip_port 127.0.0.1:8383 \ + --predict_file "./sst-2.test.tsv.1" "./sst-2.test.tsv.2" \ + --batch_size 2 \ + --max_seq_length 128 \ + --model_name electra-small +``` +其中绝大部分和从命令行读取输入数据一样,这里描述不一样的参数: +- `predict_file` 表示用于推理的文件数据,可以配置1个或多个文件,每个文件和预训练数据格式一样,为utf-8编码的文本数据,每行1句文本。如果此项配置,则predict_sentences不用配置。 + +使用Paddle Serving API进行推理的结果和使用Inference API的结果是一样的: +```shell +===== batch 0 ===== +Input sentence is : [CLS] uneasy mishmash of styles and genres . [SEP] +Output data is : 0 +Input sentence is : [CLS] director rob marshall went out gunning to make a great one . [SEP] +Output data is : 1 +inference total 2 sentences done, total time : 4.729415416717529 s +``` +此推理结果表示:第1句话是负向情感,第2句话是正向情感。 diff --git a/model_zoo/electra/deploy/serving/client.py b/model_zoo/electra/deploy/serving/client.py new file mode 100644 index 0000000000000000000000000000000000000000..331748feee3eb29ad679e3d056042c508548da5c --- /dev/null +++ b/model_zoo/electra/deploy/serving/client.py @@ -0,0 +1,166 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time + +import numpy as np +from paddle_serving_client import Client + +from paddlenlp.transformers import ElectraTokenizer +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--client_config_file", type=str, required=True, help="client prototxt config file") + parser.add_argument("--server_ip_port", type=str, default="127.0.0.1:8383", help="server_ip:port") + parser.add_argument("--predict_sentences", type=str, nargs="*", help="one or more sentence to predict") + parser.add_argument( + "--predict_file", type=str, nargs="*", help="one or more file which contain sentence to predict" + ) + parser.add_argument("--batch_size", type=int, default=1, help="batch size") + parser.add_argument("--max_seq_length", type=int, default=128, help="max length of each sequence") + parser.add_argument( + "--model_name", + type=str, + default="electra-small", + help="shortcut name selected in the list: " + + ", ".join(list(ElectraTokenizer.pretrained_init_configuration.keys())), + ) + return parser.parse_args() + + +def read_sentences(paths=[]): + sentences = [] + for sen_path in paths: + assert os.path.isfile(sen_path), "The {} isn't a valid file.".format(sen_path) + sen = read_file(sen_path) + if sen is None: + logger.info("error in loading file:{}".format(sen_path)) + continue + sentences.extend(sen) + return sentences + + +def read_file(path): + lines = [] + with open(path, encoding="utf-8") as f: + while True: + line = f.readline() + if line: + if len(line) > 0 and not line.isspace(): + lines.append(line.strip()) + else: + break + return lines + + +def get_predicted_input(predicted_data, tokenizer, max_seq_length, batch_size): + if predicted_data == [] or not isinstance(predicted_data, list): + raise TypeError("The predicted_data is inconsistent with expectations.") + + sen_ids_batch = [] + sen_words_batch = [] + sen_ids = [] + sen_words = [] + batch_num = 0 + pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token) + for sen in predicted_data: + sen_id = tokenizer(sen, max_seq_len=max_seq_length)["input_ids"] + sen_ids.append(sen_id) + sen_words.append(tokenizer.cls_token + " " + sen + " " + tokenizer.sep_token) + batch_num += 1 + if batch_num == batch_size: + tmp_list = [] + max_length = max([len(i) for i in sen_ids]) + for i in sen_ids: + if len(i) < max_length: + tmp_list.append(i + (max_length - len(i)) * [pad_token_id]) + else: + tmp_list.append(i) + sen_ids_batch.append(tmp_list) + sen_words_batch.append(sen_words) + sen_ids = [] + sen_words = [] + batch_num = 0 + + if len(sen_ids) > 0: + tmp_list = [] + max_length = max([len(i) for i in sen_ids]) + for i in sen_ids: + if len(i) < max_length: + tmp_list.append(i + (max_length - len(i)) * [pad_token_id]) + else: + tmp_list.append(i) + sen_ids_batch.append(tmp_list) + sen_words_batch.append(sen_words) + + return sen_ids_batch, sen_words_batch + + +def predict(args, sentences=[], paths=[]): + """ + Args: + sentences (list[str]): each string is a sentence. If have sentences then no need paths + paths (list[str]): The paths of file which contain sentences. If have paths then no need sentences + Returns: + res (list(numpy.ndarray)): The result of sentence, indicate whether each word is replaced, same shape with sentences. + """ + + # initialize client + client = Client() + client.load_client_config(args.client_config_file) + # "serving_client/serving_client_conf.prototxt") + client.connect([args.server_ip_port]) + + # initialize data + if sentences != [] and isinstance(sentences, list) and (paths == [] or paths is None): + predicted_data = sentences + elif (sentences == [] or sentences is None) and isinstance(paths, list) and paths != []: + predicted_data = read_sentences(paths) + else: + raise TypeError("The input data is inconsistent with expectations.") + + tokenizer = ElectraTokenizer.from_pretrained(args.model_name) + predicted_input, predicted_sens = get_predicted_input( + predicted_data, tokenizer, args.max_seq_length, args.batch_size + ) + + start_time = time.time() + count = 0 + for i, sen in enumerate(predicted_input): + sen = np.array(sen).astype("int64") + + fetch_map = client.predict(feed={"input_ids": sen}, fetch=["save_infer_model/scale_0.tmp_0"], batch=True) + output_data = np.array(fetch_map["save_infer_model/scale_0.tmp_0"]) + output_res = np.argmax(output_data, axis=1) + + print("===== batch {} =====".format(i)) + for j in range(len(predicted_sens[i])): + print("Input sentence is : {}".format(predicted_sens[i][j])) + # print("Output logis is : {}".format(output_data[j])) + print("Output data is : {}".format(output_res[j])) + + count += len(predicted_sens[i]) + print("inference total %s sentences done, total time : %s s" % (count, time.time() - start_time)) + + +if __name__ == "__main__": + # paddle.enable_static() + args = parse_args() + sentences = args.predict_sentences + paths = args.predict_file + predict(args, sentences, paths) diff --git a/model_zoo/electra/deploy/serving/covert_inference_model_to_serving.py b/model_zoo/electra/deploy/serving/covert_inference_model_to_serving.py new file mode 100644 index 0000000000000000000000000000000000000000..795a5aeb2281af9910313ea2c6fb3a8efa643462 --- /dev/null +++ b/model_zoo/electra/deploy/serving/covert_inference_model_to_serving.py @@ -0,0 +1,39 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import paddle +import paddle_serving_client.io as serving_io + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--inference_model_dir", type=str, required=True, help="input inference model dir") + parser.add_argument("--model_file", type=str, required=True, help="input inference model file name") + parser.add_argument("--params_file", type=str, required=True, help="input inference parameters file name") + return parser.parse_args() + + +if __name__ == "__main__": + paddle.enable_static() + args = parse_args() + feed_names, fetch_names = serving_io.inference_model_to_serving( + dirname=args.inference_model_dir, + serving_server="serving_server", + serving_client="serving_client", + model_filename=args.model_file, + params_filename=args.params_file, + ) + print("model feed_names : %s" % feed_names) + print("model fetch_names : %s" % fetch_names) diff --git a/model_zoo/electra/electra_model_brief_introduce.JPG b/model_zoo/electra/electra_model_brief_introduce.JPG new file mode 100644 index 0000000000000000000000000000000000000000..226294ad6c52bb52112e3a4c8523611c600f85d1 Binary files /dev/null and b/model_zoo/electra/electra_model_brief_introduce.JPG differ diff --git a/model_zoo/electra/export_model.py b/model_zoo/electra/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..babb7741ac0d95e8ecf8e7681cd45a2ef9344e32 --- /dev/null +++ b/model_zoo/electra/export_model.py @@ -0,0 +1,79 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# from collections import namedtuple +from __future__ import absolute_import, division, print_function + +import argparse +import hashlib +import os + +import paddle +from paddle.static import InputSpec + +from paddlenlp.transformers import ElectraForSequenceClassification + + +def get_md5sum(file_path): + md5sum = None + if os.path.isfile(file_path): + with open(file_path, "rb") as f: + md5_obj = hashlib.md5() + md5_obj.update(f.read()) + hash_code = md5_obj.hexdigest() + md5sum = str(hash_code).lower() + return md5sum + + +def main(): + # check and load model + input_model_file = os.path.join(args.input_model_dir, "model_state.pdparams") + print("load model to get static model : %s \nmodel md5sum : %s" % (input_model_file, get_md5sum(input_model_file))) + model_state_dict = paddle.load(input_model_file) + + if all((s.startswith("generator") or s.startswith("discriminator")) for s in model_state_dict.keys()): + print("the model : %s is electra pretrain model, we need fine-tuning model to deploy" % input_model_file) + exit(1) + elif "discriminator_predictions.dense.weight" in model_state_dict: + print("the model : %s is electra discriminator model, we need fine-tuning model to deploy" % input_model_file) + exit(1) + elif "classifier.dense.weight" in model_state_dict: + print("we are load glue fine-tuning model") + model = ElectraForSequenceClassification.from_pretrained(args.input_model_dir) + print("total model layers : ", len(model_state_dict)) + else: + print("the model file : %s may not be fine-tuning model, please check" % input_model_file) + exit(1) + + # save static model to disk + paddle.jit.save( + layer=model, + path=os.path.join(args.output_model_dir, args.model_name), + input_spec=[InputSpec(shape=[None, None], dtype="int64")], + ) + print("save electra inference model success") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--input_model_dir", required=True, default=None, help="Directory for storing Electra pretraining model" + ) + parser.add_argument( + "--output_model_dir", required=True, default=None, help="Directory for output Electra inference model" + ) + parser.add_argument( + "--model_name", default="electra-deploy", type=str, help="prefix name of output model and parameters" + ) + args, unparsed = parser.parse_known_args() + main() diff --git a/model_zoo/electra/get_ft_model.py b/model_zoo/electra/get_ft_model.py new file mode 100644 index 0000000000000000000000000000000000000000..fde0b7cb75ec3178255b5ae0343ab7da6f3236f6 --- /dev/null +++ b/model_zoo/electra/get_ft_model.py @@ -0,0 +1,82 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# from collections import namedtuple +import argparse +import hashlib +import os + +import paddle + + +def get_md5sum(file_path): + md5sum = None + if os.path.isfile(file_path): + with open(file_path, "rb") as f: + md5_obj = hashlib.md5() + md5_obj.update(f.read()) + hash_code = md5_obj.hexdigest() + md5sum = str(hash_code).lower() + return md5sum + + +def main(args): + pretraining_model = os.path.join(args.model_dir, "model_state.pdparams") + if os.path.islink(pretraining_model): + print("%s already contain fine-tuning model, pleace check" % args.model_dir) + exit(0) + print( + "load Electra pretrain model to get generator/discriminator model : %s \nmodel md5sum : %s" + % (pretraining_model, get_md5sum(pretraining_model)) + ) + # depart total_pretraining_model to generator and discriminator state_dict + total_pretraining_model = paddle.load(pretraining_model) + generator_state_dict = {} + discriminator_state_dict = {} + num_keys = 0 + for key in total_pretraining_model.keys(): + new_key = None + if "generator." in key: + new_key = key.replace("generator.", "", 1) + generator_state_dict[new_key] = total_pretraining_model[key] + if "discriminator." in key: + new_key = key.replace("discriminator.", "", 1) + discriminator_state_dict[new_key] = total_pretraining_model[key] + num_keys += 1 + print("total electra keys : ", num_keys) + print("total generator keys : ", len(generator_state_dict)) + print("total discriminator keys : ", len(discriminator_state_dict)) + + # save generator and discriminator model to disk + paddle.save(generator_state_dict, os.path.join(args.model_dir, args.generator_output_file)) + paddle.save(discriminator_state_dict, os.path.join(args.model_dir, args.discriminator_output_file)) + print("save generator and discriminator model success") + os.rename(pretraining_model, os.path.join(args.model_dir, "pretrain_model_state.pdparams")) + os.symlink(args.discriminator_output_file, os.path.join(args.model_dir, "model_state.pdparams")) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_dir", required=True, default=None, help="Directory of storing ElectraForTotalPreTraining model" + ) + parser.add_argument( + "--generator_output_file", default="generator_for_ft.pdparams", help="Electra generator model for fine-tuning" + ) + parser.add_argument( + "--discriminator_output_file", + default="discriminator_for_ft.pdparams", + help="Electra discriminator model for fine-tuning", + ) + args, unparsed = parser.parse_known_args() + main(args) diff --git a/model_zoo/electra/run_glue.py b/model_zoo/electra/run_glue.py new file mode 100644 index 0000000000000000000000000000000000000000..c5a8051a4a026e4dfc30b9a43abb40f6f24e1664 --- /dev/null +++ b/model_zoo/electra/run_glue.py @@ -0,0 +1,369 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from paddle.io import DataLoader +from paddle.metric import Accuracy + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman +from paddlenlp.transformers import ( + BertForSequenceClassification, + BertTokenizer, + ElectraForSequenceClassification, + ElectraTokenizer, + ErnieForSequenceClassification, + ErnieTokenizer, + LinearDecayWithWarmup, +) + +FORMAT = "%(asctime)s-%(levelname)s: %(message)s" +logging.basicConfig(level=logging.INFO, format=FORMAT) +logger = logging.getLogger(__name__) + +METRIC_CLASSES = { + "cola": Mcc, + "sst-2": Accuracy, + "mrpc": AccuracyAndF1, + "sts-b": PearsonAndSpearman, + "qqp": AccuracyAndF1, + "mnli": Accuracy, + "qnli": Accuracy, + "rte": Accuracy, +} + +MODEL_CLASSES = { + "bert": (BertForSequenceClassification, BertTokenizer), + "electra": (ElectraForSequenceClassification, ElectraTokenizer), + "ernie": (ErnieForSequenceClassification, ErnieTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.") + parser.add_argument( + "--num_train_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--batch_size", + default=32, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument( + "--warmup_steps", + default=0, + type=int, + help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion", + ) + parser.add_argument( + "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps." + ) + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--seed", default=42, type=int, help="random seed for initialization") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu", "npu"], + help="The device to select to train the model, is must be cpu/gpu/npu.", + ) + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +@paddle.no_grad() +def evaluate(model, loss_fct, metric, data_loader): + model.eval() + metric.reset() + for batch in data_loader: + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + loss = loss_fct(logits, labels) + correct = metric.compute(logits, labels) + metric.update(correct) + res = metric.accumulate() + if isinstance(metric, AccuracyAndF1): + print( + "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, " + % ( + loss.numpy(), + res[0], + res[1], + res[2], + res[3], + res[4], + ), + end="", + ) + elif isinstance(metric, Mcc): + print("eval loss: %f, mcc: %s, " % (loss.numpy(), res[0]), end="") + elif isinstance(metric, PearsonAndSpearman): + print( + "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, " + % (loss.numpy(), res[0], res[1], res[2]), + end="", + ) + else: + print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="") + model.train() + + +def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + label = example["labels"] + label = np.array([label], dtype=label_dtype) + # Convert raw text to feature + if (int(is_test) + len(example)) == 2: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + else: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + + if not is_test: + return example["input_ids"], example["token_type_ids"], label + else: + return example["input_ids"], example["token_type_ids"] + + +def do_train(args): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + + args.task_name = args.task_name.lower() + metric_class = METRIC_CLASSES[args.task_name] + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + train_ds = load_dataset("glue", args.task_name, splits="train") + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + trans_func = partial( + convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length + ) + train_ds = train_ds.map(trans_func, lazy=True) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment + Stack(dtype="int64" if train_ds.label_list else "float32"), # label + ): fn(samples) + train_data_loader = DataLoader( + dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + if args.task_name == "mnli": + dev_ds_matched, dev_ds_mismatched = load_dataset( + "glue", args.task_name, splits=["dev_matched", "dev_mismatched"] + ) + + dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True) + dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True) + dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False) + dev_data_loader_matched = DataLoader( + dataset=dev_ds_matched, + batch_sampler=dev_batch_sampler_matched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + dev_batch_sampler_mismatched = paddle.io.BatchSampler( + dev_ds_mismatched, batch_size=args.batch_size, shuffle=False + ) + dev_data_loader_mismatched = DataLoader( + dataset=dev_ds_mismatched, + batch_sampler=dev_batch_sampler_mismatched, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + else: + dev_ds = load_dataset("glue", args.task_name, splits="dev") + dev_ds = dev_ds.map(trans_func, lazy=True) + dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) + dev_data_loader = DataLoader( + dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list) + model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_classes) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs) + warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=0.9, + beta2=0.999, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss() + + metric = metric_class() + + global_step = 0 + tic_train = time.time() + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + + input_ids, segment_ids, labels = batch + logits = model(input_ids, segment_ids) + loss = loss_fct(logits, labels) + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0 or global_step == num_training_steps: + print( + "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s" + % ( + global_step, + num_training_steps, + epoch, + step, + paddle.distributed.get_rank(), + loss, + optimizer.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_step % args.save_steps == 0 or global_step == num_training_steps: + tic_eval = time.time() + if args.task_name == "mnli": + evaluate(model, loss_fct, metric, dev_data_loader_matched) + evaluate(model, loss_fct, metric, dev_data_loader_mismatched) + print("eval done total : %s s" % (time.time() - tic_eval)) + else: + evaluate(model, loss_fct, metric, dev_data_loader) + print("eval done total : %s s" % (time.time() - tic_eval)) + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join( + args.output_dir, "%s_ft_model_%d.pdparams" % (args.task_name, global_step) + ) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + # Need better way to get inner model of DataParallel + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + return + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + n_gpu = len(os.getenv("CUDA_VISIBLE_DEVICES", "").split(",")) + if args.device in "gpu" and n_gpu > 1: + paddle.distributed.spawn(do_train, args=(args,), nprocs=n_gpu) + else: + do_train(args) diff --git a/model_zoo/electra/run_pretrain.py b/model_zoo/electra/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..02e12eaf88f9ac544cae47f7957f55009b128f86 --- /dev/null +++ b/model_zoo/electra/run_pretrain.py @@ -0,0 +1,556 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import copy +import io +import json +import logging +import os +import random +import time + +import numpy as np +import paddle + +from paddlenlp.transformers import ( + ElectraForTotalPretraining, + ElectraPretrainingCriterion, + ElectraTokenizer, + LinearDecayWithWarmup, +) + +FORMAT = "%(asctime)s-%(levelname)s: %(message)s" +logging.basicConfig(level=logging.INFO, format=FORMAT) +logger = logging.getLogger(__name__) + +MODEL_CLASSES = { + "electra": (ElectraForTotalPretraining, ElectraTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_type", + default="electra", + type=str, + help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", + default="electra-small", + type=str, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--input_dir", + default=None, + type=str, + required=True, + help="The input directory where the data will be read from.", + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument("--max_seq_length", default=128, type=int, help="max length of each sequence") + parser.add_argument("--mask_prob", default=0.15, type=float, help="the probability of one word to be mask") + parser.add_argument( + "--train_batch_size", + default=96, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument( + "--eval_batch_size", + default=96, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--learning_rate", default=5e-4, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--num_train_epochs", + default=4, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_train_epochs.", + ) + parser.add_argument("--warmup_steps", default=10000, type=int, help="Linear warmup over warmup_steps.") + + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--init_from_ckpt", + action="store_true", + help="Whether to load model checkpoint. if True, args.model_name_or_path must be dir store ckpt or will train from fresh start", + ) + parser.add_argument( + "--use_amp", action="store_true", help="Whether to use float16(Automatic Mixed Precision) to train." + ) + parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") + parser.add_argument("--eager_run", type=bool, default=True, help="Use dygraph mode.") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu"], + help="The device to select to train the model, is must be cpu/gpu.", + ) + args = parser.parse_args() + return args + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +class WorkerInitObj(object): + def __init__(self, seed): + self.seed = seed + + def __call__(self, id): + np.random.seed(seed=self.seed + id) + random.seed(self.seed + id) + + +class BookCorpus(paddle.io.Dataset): + """ + https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html + Args: + data_path (:obj:`str`) : The dataset file path, which contains train.tsv, dev.tsv and test.tsv. + tokenizer (:obj:`class PretrainedTokenizer`) : The tokenizer to split word and convert word to id. + max_seq_length (:obj:`int`) : max length for each sequence. + mode (:obj:`str`, `optional`, defaults to `train`): + It identifies the dataset mode (train, test or dev). + """ + + def __init__( + self, + data_path, + tokenizer, + max_seq_length, + mode="train", + ): + if mode == "train": + data_file = "train.data" + elif mode == "test": + data_file = "test.data" + else: + data_file = "dev.data" + + self.data_file = os.path.join(data_path, data_file) + self.tokenizer = tokenizer + self.max_seq_length = max_seq_length + self.raw_examples = self._read_file(self.data_file) + + def _read_file(self, input_file): + """ + Reads a text file. + + Args: + input_file (:obj:`str`) : The file to be read. + + Returns: + examples (:obj:`list`): All the input data. + """ + if not os.path.exists(input_file): + raise RuntimeError("The file {} is not found.".format(input_file)) + else: + with io.open(input_file, "r", encoding="UTF-8") as f: + examples = [] + while True: + line = f.readline() + if line: + if len(line) > 0 and not line.isspace(): + example = self.tokenizer(line, max_seq_len=self.max_seq_length)["input_ids"] + examples.append(example) + else: + break + return examples + + def truncation_ids(self, ids, max_seq_length): + if len(ids) <= (max_seq_length - 2): + return ids + else: + return ids[: (max_seq_length - 2)] + + def __getitem__(self, idx): + return self.raw_examples[idx] + + def __len__(self): + return len(self.raw_examples) + + +class DataCollatorForElectra(object): + """ + pads, gets batch of tensors and preprocesses batches for masked language modeling + when dataloader num_worker > 0, this collator may trigger some bugs, for safe, be sure dataloader num_worker=0 + """ + + def __init__(self, tokenizer, max_seq_length, mlm=True, mlm_probability=0.15): + self.tokenizer = tokenizer + self.max_seq_length = max_seq_length + self.mlm = True + self.mlm_probability = mlm_probability + + def __call__(self, examples): + if self.mlm: + inputs, raw_inputs, labels = self.mask_tokens(examples) + return inputs, raw_inputs, labels + else: + raw_inputs, _ = self.add_special_tokens_and_set_maskprob(examples, True, self.max_seq_length) + raw_inputs = self.tensorize_batch(raw_inputs, "int64") + inputs = raw_inputs.clone().detach() + labels = raw_inputs.clone().detach() + if self.tokenizer.pad_token is not None: + pad_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.pad_token) + labels[labels == pad_token_id] = -100 + return batch, raw_inputs, labels # noqa:821 + + def tensorize_batch(self, examples, dtype): + if isinstance(examples[0], (list, tuple)): + examples = [paddle.to_tensor(e, dtype=dtype) for e in examples] + length_of_first = examples[0].shape[0] + are_tensors_same_length = all(x.shape[0] == length_of_first for x in examples) + if are_tensors_same_length: + return paddle.stack(examples, axis=0) + else: + raise ValueError("the tensor in examples not have same shape, please check input examples") + + def add_special_tokens_and_set_maskprob(self, inputs, truncation, max_seq_length): + # sep_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.sep_token) + pad_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.pad_token) + # cls_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.cls_token) + full_inputs = [] + full_maskprob = [] + max_length = 0 + for ids in inputs: + if len(ids) > max_length: + max_length = len(ids) + max_length = min(max_length, max_seq_length) + + for ids in inputs: + if len(ids) <= max_length: + padding_num = max_length - len(ids) + full_inputs.append(ids + ([pad_token_id] * padding_num)) + full_maskprob.append([0] + ([self.mlm_probability] * (len(ids) - 2)) + [0] + ([0] * padding_num)) + else: + if truncation: + full_inputs.append(ids[:max_length]) + full_maskprob.append([0] + ([self.mlm_probability] * (max_length - 2)) + [0]) + else: + full_inputs.append(ids) + full_maskprob.append([0] + ([self.mlm_probability] * (len(ids) - 2)) + [0]) + return full_inputs, full_maskprob + + def mask_tokens(self, examples): + if self.tokenizer.mask_token is None: + raise ValueError("the tokenizer does not have mask_token, please check!") + mask_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token) + + raw_inputs, probability_matrix = self.add_special_tokens_and_set_maskprob(examples, True, self.max_seq_length) + raw_inputs = self.tensorize_batch(raw_inputs, "int64") + probability_matrix = self.tensorize_batch(probability_matrix, "float32") + inputs = raw_inputs.clone() + labels = raw_inputs.clone() + + total_indices = paddle.bernoulli(probability_matrix).astype("bool").numpy() + labels[~total_indices] = -100 + + # 80% MASK + indices_mask = paddle.bernoulli(paddle.full(labels.shape, 0.8)).astype("bool").numpy() & total_indices + inputs[indices_mask] = mask_token_id + + # 10% Random + indices_random = ( + paddle.bernoulli(paddle.full(labels.shape, 0.5)).astype("bool").numpy() & total_indices & ~indices_mask + ) + random_words = paddle.randint(low=0, high=self.tokenizer.vocab_size, shape=labels.shape, dtype="int64") + inputs = paddle.where(paddle.to_tensor(indices_random), random_words, inputs) + + # 10% Original + return inputs, raw_inputs, labels + + +def create_dataloader(dataset, mode="train", batch_size=1, use_gpu=True, data_collator=None): + """ + Creats dataloader. + + Args: + dataset(obj:`paddle.io.Dataset`): + Dataset instance. + mode(obj:`str`, optional, defaults to obj:`train`): + If mode is 'train', it will shuffle the dataset randomly. + batch_size(obj:`int`, optional, defaults to 1): + The sample number of a mini-batch. + use_gpu(obj:`bool`, optional, defaults to obj:`True`): + Whether to use gpu to run. + + Returns: + dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. + """ + + if mode == "train" and use_gpu: + sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True) + dataloader = paddle.io.DataLoader( + dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0 + ) + else: + shuffle = True if mode == "train" else False + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader( + dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0 + ) + + return dataloader + + +def do_train(args): + paddle.enable_static() if not args.eager_run else None + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args) + # worker_init = WorkerInitObj(args.seed + paddle.distributed.get_rank()) + + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + # Loads or initializes a model. + pretrained_models = list(tokenizer_class.pretrained_init_configuration.keys()) + config = model_class.config_class.from_pretrained(args.model_name_or_path) + + if args.model_name_or_path in pretrained_models: + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + model = model_class(config) + args.init_from_ckpt = False + else: + if os.path.isdir(args.model_name_or_path) and args.init_from_ckpt: + # Load checkpoint + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + with open(os.path.join(args.model_name_or_path, "run_states.json"), "r") as f: + config_dict = json.load(f) + model_name = config_dict["model_name"] + if model_name in pretrained_models: + model = model_class.from_pretrained(args.model_name_or_path) + model.set_state_dict(paddle.load(os.path.join(args.model_name_or_path, "model_state.pdparams"))) + else: + raise ValueError( + "initialize a model from ckpt need model_name " + "in model_config_file. The supported model_name " + "are as follows: {}".format(tokenizer_class.pretrained_init_configuration.keys()) + ) + else: + raise ValueError( + "initialize a model need identifier or the " + "directory of storing model. if use identifier, the supported model " + "identifiers are as follows: {}, if use directory, " + "make sure set init_from_ckpt as True".format(model_class.pretrained_init_configuration.keys()) + ) + + criterion = ElectraPretrainingCriterion(config) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + # Loads dataset. + tic_load_data = time.time() + print("start load data : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) + train_dataset = BookCorpus( + data_path=args.input_dir, tokenizer=tokenizer, max_seq_length=args.max_seq_length, mode="train" + ) + print("load data done, total : %s s" % (time.time() - tic_load_data)) + + # Reads data and generates mini-batches. + data_collator = DataCollatorForElectra( + tokenizer=tokenizer, max_seq_length=args.max_seq_length, mlm=True, mlm_probability=args.mask_prob + ) + + train_data_loader = create_dataloader( + train_dataset, + batch_size=args.train_batch_size, + mode="train", + use_gpu=True if args.device in "gpu" else False, + data_collator=data_collator, + ) + + num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps) + + clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0) + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + grad_clip=clip, + apply_decay_param_fun=lambda x: x in decay_params, + ) + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=1024) + + print("start train : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) + trained_global_step = global_step = 0 + t_loss = paddle.to_tensor([0.0]) + log_loss = paddle.to_tensor([0.0]) + loss_list = [] + log_list = [] + tic_train = time.time() + if os.path.isdir(args.model_name_or_path) and args.init_from_ckpt: + optimizer.set_state_dict(paddle.load(os.path.join(args.model_name_or_path, "model_state.pdopt"))) + trained_global_step = global_step = config_dict["global_step"] + if trained_global_step < num_training_steps: + print( + "[ start train from checkpoint ] we have already trained %s steps, seeking next step : %s" + % (trained_global_step, trained_global_step + 1) + ) + else: + print( + "[ start train from checkpoint ] we have already trained %s steps, but total training steps is %s, please check configuration !" + % (trained_global_step, num_training_steps) + ) + exit(0) + + for epoch in range(args.num_train_epochs): + for step, batch in enumerate(train_data_loader): + if trained_global_step > 0: + trained_global_step -= 1 + continue + global_step += 1 + input_ids, raw_input_ids, gen_labels = batch + if args.use_amp: + with paddle.amp.auto_cast(): + gen_logits, disc_logits, disc_labels, attention_mask = model( + input_ids=input_ids, raw_input_ids=raw_input_ids, generator_labels=gen_labels + ) + loss = criterion(gen_logits, disc_logits, gen_labels, disc_labels, attention_mask) + scaled = scaler.scale(loss) + scaled.backward() + t_loss += loss.detach() + scaler.minimize(optimizer, scaled) + else: + gen_logits, disc_logits, disc_labels, attention_mask = model( + input_ids=input_ids, raw_input_ids=raw_input_ids, generator_labels=gen_labels + ) + loss = criterion(gen_logits, disc_logits, gen_labels, disc_labels, attention_mask) + loss.backward() + t_loss += loss.detach() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + local_loss = (t_loss - log_loss) / args.logging_steps + if paddle.distributed.get_world_size() > 1: + paddle.distributed.all_gather(loss_list, local_loss) + if paddle.distributed.get_rank() == 0: + log_str = ( + "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, " + "avg_loss: {4:.15f}, lr: {5:.10f}, speed: {6:.2f} s/it" + ).format( + global_step, + num_training_steps, + epoch, + step, + float((paddle.stack(loss_list).sum() / len(loss_list)).numpy()), + optimizer.get_lr(), + (time.time() - tic_train) / args.logging_steps, + ) + print(log_str) + log_list.append(log_str) + loss_list = [] + else: + log_str = ( + "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, " + "loss: {4:.15f}, lr: {5:.10f}, speed: {6:.2f} s/it" + ).format( + global_step, + num_training_steps, + epoch, + step, + float(local_loss.numpy()), + optimizer.get_lr(), + (time.time() - tic_train) / args.logging_steps, + ) + print(log_str) + log_list.append(log_str) + log_loss = t_loss + tic_train = time.time() + if global_step % args.save_steps == 0: + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "model_%d.pdparams" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + config_to_save = copy.deepcopy(model_to_save.discriminator.electra.config) + config_to_save.to_json_file(os.path.join(output_dir, "model_config.json")) + run_states = { + "model_name": model_name if args.init_from_ckpt else args.model_name_or_path, + "global_step": global_step, + "epoch": epoch, + "step": step, + } + with open(os.path.join(output_dir, "run_states.json"), "w") as f: + json.dump(run_states, f) + paddle.save(model.state_dict(), os.path.join(output_dir, "model_state.pdparams")) + tokenizer.save_pretrained(output_dir) + paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt")) + if len(log_list) > 0: + with open(os.path.join(output_dir, "train.log"), "w") as f: + for log in log_list: + if len(log.strip()) > 0: + f.write(log.strip() + "\n") + if global_step >= num_training_steps: + return + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + n_gpu = len(os.getenv("CUDA_VISIBLE_DEVICES", "").split(",")) + if args.device in "gpu" and n_gpu > 1: + paddle.distributed.spawn(do_train, args=(args,), nprocs=n_gpu) + else: + do_train(args) diff --git a/model_zoo/ernie-1.0/README.md b/model_zoo/ernie-1.0/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b025325ccb7f9c3b4d39180097d974731d4d6440 --- /dev/null +++ b/model_zoo/ernie-1.0/README.md @@ -0,0 +1,671 @@ +# ERNIE: Enhanced Representation through kNowledge IntEgration + +**目录** +- [1. 模型简介](#模型简介) + - [1.1 目录结构](#目录结构) + - [1.1 环境依赖](#环境依赖) +- [2. 中文预训练](#中文预训练) + - [2.1 小规模语料预训练: 14GB - CLUECorpusSmall](#CLUECorpusSmall) + - [2.2 大规模语料预训练: 400GB - CLUE & WuDao](#ERNIE-CW) + - [2.3 预训练模型贡献](#预训练模型贡献) +- [3. 下游任务微调](#下游任务微调) + - [3.1 序列分类](#序列分类) + - [3.2 Token分类](#序列分类) + - [3.3 阅读理解](#阅读理解) +- [4. 预测部署](#预测部署) +- [5. 参考文献](#参考文献) + + + + + +## 1. 模型简介 + +ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架,它将大数据预训练与多源丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的知识,实现模型效果不断进化。 + +ERNIE在情感分析、文本匹配、自然语言推理、词法分析、阅读理解、智能问答等16个公开数据集上全面显著超越世界领先技术,在国际权威的通用语言理解评估基准GLUE上,得分首次突破90分,获得全球第一。 +相关创新成果也被国际顶级学术会议AAAI、IJCAI收录。 +同时,ERNIE在工业界得到了大规模应用,如搜索引擎、新闻推荐、广告系统、语音交互、智能客服等。 + +ERNIE 通过建模海量数据中的词、实体及实体关系,学习真实世界的语义知识。相较于 BERT 学习原始语言信号,ERNIE 直接对先验语义知识单元进行建模,增强了模型语义表示能力。 + +这里我们举个例子: +``` +Learnt by BERT :哈 [mask] 滨是 [mask] 龙江的省会,[mask] 际冰 [mask] 文化名城。 +Learnt by ERNIE:[mask] [mask] [mask] 是黑龙江的省会,国际 [mask] [mask] 文化名城。 +``` +在 BERT 模型中,我们通过『哈』与『滨』的局部共现,即可判断出『尔』字,模型没有学习与『哈尔滨』相关的任何知识。而 ERNIE 通过学习词与实体的表达,使模型能够建模出『哈尔滨』与『黑龙江』的关系,学到『哈尔滨』是 『黑龙江』的省会以及『哈尔滨』是个冰雪城市。 + + + +**项目特色** +- **中文预训练** + - 提供了完整中文预训练流程,从词表构造、数据处理、任务训练,到下游任务。 + - 提供中文Whole Word Mask,支持文本动态Mask。 +- **数据流程**, + - 数据预处理流程高效,40分钟即可完成14G ERNIE数据制作。 + - 数据稳定可复现,多数据集即插即用。 +- **分布式训练**, + - 支持多机多卡,支持混合精度、重计算、梯度累积等功能。 + + + +### 1.1 目录结构 + +整体的目录结构如下: + +```shell +./ +├── args.py 训练配置参数文件 +├── converter 静态图参数转换为动态图的脚本 +│   └── params_static_to_dygraph.py +├── finetune 下游任务finetune脚本 +│   ├── config.yml 训练参数配置文件 +│   ├── question_answering.py 阅读理解任务预处理代码 +│   ├── sequence_classification.py 序列分类任务预处理代码 +│   ├── token_classification.py TOKEN分类任务预处理代码 +│   ├── README.md 说明文档 +│   ├── run_ner.py 命名实体识别任务运行脚本 +│   ├── run_qa.py 阅读理解任务运行脚本 +│   ├── run_seq_cls.py 序列分类任务运行脚本 +│   └── utils.py +├── README.md 说明文档 +├── pretraining_introduction.md 中文预训练详细介绍文档 +├── preprocess +│   ├── docs 部分数据制作文档,包括CLUECorpusSmall,WuDaoCorpusBase +│   ├─ xxx.py 文件处理的python脚本 +│ └──README.md PaddleNLP 预训练数据流程 +├── vocab 全中文字符词表制作教程 +├── run_gb512_s1m.sh 训练启动shell脚本,batch size 512. max steps 100w +├── run_gb512_s1m_static.sh +├── run_gb512_s1m_trainer.sh +├── run_pretrain.py 训练启动python脚本 +├── run_pretrain_static.py +└── run_pretrain_trainer.py +``` + + + +### 1.2 环境依赖 + +- tool_helpers +- visualdl +- pybind11 + +安装命令 `pip install visualdl pybind11 tool_helpers` + + + +## 2. 中文预训练 + +ERNIE预训练采用的是MLM(Mask Language Model)的训练方式,采用WWM(Whole Word Mask)方式,对于完整语义单元的Token,会同时进行Mask。整体的训练损失loss是mlm_loss + sop_loss。 + +ERNIE 中文预训练更详细的介绍文档请可以参见[ERNIE 中文预训练介绍](./pretraining_introduction.md)。 + + +本样例为用户提供了高效的训练流程, +- **支持动态文本mask**: 用户可以根据自己的需求,灵活修改mask方式。具体可以参考修改`data_tools/dataset_utils.py`中`create_masked_lm_predictions`函数。 +- **支持自动断点训练重启恢复**。 用户可以设置`checkpoint_steps`,间隔`checkpoint_steps`数,即保留最新的checkpoint到`model_last`文件夹。重启训练时,程序默认从最新checkpoint重启训练,学习率、数据集都可以恢复到checkpoint时候的状态。 + + + + +### 2.1 小规模语料预训练: 14GB - CLUECorpusSmall +下面是使用CLUECorpusSmall 14G文本进行预训练的流程: + +
+CLUECorpusSmall 数据准备 + +#### 数据准备 +数据下载部分请参考[preprocess](./preprocess)目录,根据文档中`CLUECorpusSmall 数据集处理教程`,下载数据。下载好后: + +解压文件 +```shell +unzip comment2019zh_corpus.zip -d clue_corpus_small_14g/comment2019zh_corpus +unzip news2016zh_corpus.zip -d clue_corpus_small_14g/news2016zh_corpus +unzip webText2019zh_corpus.zip -d clue_corpus_small_14g/webText2019zh_corpus +unzip wiki2019zh_corpus.zip -d clue_corpus_small_14g/wiki2019zh_corpus +``` +将txt文件转换为jsonl格式 +``` +python preprocess/trans_to_json.py --input_path ./clue_corpus_small_14g --output_path clue_corpus_small_14g.jsonl +``` +现在我们得到了jsonl格式的数据集,下面是针对训练任务的数据集应用,此处以ernie为例。 +``` +python -u preprocess/create_pretraining_data.py \ + --model_name ernie-1.0-base-zh \ + --tokenizer_name ErnieTokenizer \ + --input_path clue_corpus_small_14g.jsonl \ + --split_sentences \ + --data_impl mmap \ + --chinese \ + --cn_whole_word_segment \ + --cn_seg_func jieba \ + --output_prefix clue_corpus_small_14g_20220104 \ + --workers 48 \ + --log_interval 10000 +``` +数据共有文档`15702702`条左右,由于分词比较耗时,大概一小时左右可以完成。在当前目录下产出训练所需数据。 +``` +clue_corpus_small_14g_20220104.bin +clue_corpus_small_14g_20220104.idx +``` + +
+ + +
+CLUECorpusSmall 开始训练 + + +#### 开始训练 + +将制作好的数据`clue_corpus_small_14g_20220104.bin,clue_corpus_small_14g_20220104.idx`移动到input_dir中,即可开始训练。 +这里以8卡GPU训练为例任务脚本为例: +``` +python -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "output/ernie-1.0-dp8-gb512/log" \ + run_pretrain.py \ + --model_type "ernie" \ + --model_name_or_path "ernie-1.0-base-zh" \ + --tokenizer_name_or_path "ernie-1.0-base-zh" \ + --input_dir "./data" \ + --data_impl "mmap" \ + --output_dir "output/ernie-1.0-dp8-gb512" \ + --split 949,50,1 \ + --max_seq_len 512 \ + --micro_batch_size 64 \ + --use_amp true \ + --fp16_opt_level O2 \ + --max_lr 0.0001 \ + --min_lr 0.00001 \ + --max_steps 1000000 \ + --save_steps 50000 \ + --checkpoint_steps 5000 \ + --decay_steps 990000 \ + --weight_decay 0.01 \ + --warmup_rate 0.01 \ + --grad_clip 1.0 \ + --logging_freq 20 \ + --num_workers 2 \ + --eval_freq 1000 \ + --device "gpu" \ + --share_folder false \ +``` + +使用8卡MLU训练示例: +``` +python -u -m paddle.distributed.launch \ + --mlus "0,1,2,3,4,5,6,7" \ + --log_dir "output/ernie-1.0-dp8-gb512/log" \ + run_pretrain.py \ + --model_type "ernie" \ + --model_name_or_path "ernie-1.0-base-zh" \ + --tokenizer_name_or_path "ernie-1.0-base-zh" \ + --input_dir "./data" \ + --data_impl "mmap" \ + --output_dir "output/ernie-1.0-dp8-gb512" \ + --split 949,50,1 \ + --max_seq_len 512 \ + --micro_batch_size 64 \ + --use_amp true \ + --fp16_opt_level O2 \ + --max_lr 0.0001 \ + --min_lr 0.00001 \ + --max_steps 1000000 \ + --save_steps 50000 \ + --checkpoint_steps 5000 \ + --decay_steps 990000 \ + --weight_decay 0.01 \ + --warmup_rate 0.01 \ + --grad_clip 1.0 \ + --logging_freq 20 \ + --num_workers 2 \ + --eval_freq 1000 \ + --device "mlu" \ + --share_folder false \ +``` + +其中参数释义如下: +- `model_name_or_path` 要训练的模型或者之前训练的checkpoint。 +- `tokenizer_name_or_path` 模型词表文件所在的文件夹,或者PaddleNLP内置tokenizer的名字。 +- `continue_training` 默认false,模型从随机初始化,开始训练。如果为True,从已有的预训练权重加载,开始训练。如果为True, 训练初始loss 为2.x 是正常loss,如果未False,随机初始化,初始loss一般为10+。 +- `input_dir` 指定输入文件,可以使用目录,指定目录时将包括目录中的所有文件。 +- `data_impl` 指定输入文件数据制作类型,默认为`mmap`,可指定`mmap`或`lazy`,`mmap`格式在读入数据时会建立内存映射,`lazy`格式在读入数据时直接从文件读取。 +- `output_dir` 指定输出文件。 +- `split` 划分数据集为train、valid、test的比例。整个数据集会按照这个比例划分数据。默认1/1000的数据为test,当样本数太少时,请修改此比例。 +- `max_seq_len` 输入文本序列的长度。 +- `micro_batch_size` 单卡batch size大小,比如此处单卡bs=64, 采用8卡训练`global_batch_size=64*8=512`。 +- `use_amp` 开启混合精度策略。 +- `fp16_opt_level` 混合精度策略,支持O1 自动混合精度,O2 pure fp16精度训练。 +- `max_lr` 训练学习率。 +- `min_lr` 学习率衰减的最小值。 +- `max_steps` 最大训练步数。 +- `save_steps` 保存模型间隔。默认保存地址格式为`output_dir/model_50000`(5w 步时的权重)。 +- `checkpoint_steps` 模型checkpoint间隔,用于模型断点重启训练。默认地址为`output_dir/model_last`. +- `weight_decay` 权重衰减参数。 +- `warmup_rate` 学习率warmup参数。 +- `grad_clip` 梯度裁剪范围。 +- `logging_freq` 日志输出间隔。 +- `num_workers` DataLoader采样进程,当数据输入为瓶颈时,可尝试提高采样进程数目。 +- `eval_freq` 模型评估间隔。 +- `device` 训练设备,默认为GPU。 +- `share_folder` 多机训练时,如果多机input_dir为挂载的同一个nfs网络位置,可以开启次选项,多机共享同一份数据。 + + +注: +- 训练支持断点重启,直接启动即可,程序会找到最新的checkpoint(`output_dir/model_last`),开始重启训练。请确保重启的训练配置与之前相同。 +- visualdl的日志在 `./output/ernie-1.0-dp8-gb512/train_log/xxx` 中。 +
+ + + +
+CLUECorpusSmall 数据集训练效果 + +#### CLUECorpusSmall 数据集训练效果 + +使用创建好的训练clue_corpus_small_14g数据集。使用本训练脚本, batch_size=512, max_steps=100w,[详细训练日志](https://www.paddlepaddle.org.cn/paddle/visualdl/service/app/index?id=3fddf650db14b9319f9dc3a91dfe4ac6) + +最终训练loss结果: + +image + +image + +|Loss | Train | Validation | +|-|-|-| +|loss |2.59 | 2.48 | +|lm_loss|2.48 | 2.38 | +|sop_loss|0.11 | 0.10 | + +训练集 lm_loss=2.48 左右, 验证集 lm_loss=2.38 左右。 + +使用训练好的模型参数,在下游任务重进行finetune。这里报告部分数据集上的finetune结果: + +CLUE评估结果: + +Model | Arch | CLUE AVG | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL +-- | -- | -- | -- | -- | -- | -- | -- | -- | -- +Metrics |   |   |   | Acc | Acc | Acc | Acc | Acc | Acc | Acc +ERNIE-1.0 Base | 12L768H | 73.78 | 74.95 | 58.73 | 61.37 | 81.77 | 75.46 | 81.25 | 82.93 +ERINE-1.0-cluecorpussmall | 12L768H | 73.24(-0.54) | 74.26 | 57.24 | 60.79 | 81.15 | 76.64 | 81.25 | 81.33 + +注: +- `ERNIE-1.0 Base`官方预训练参数,采用的训练配置是batch_size=1024、steps=100w, +- `ERINE-1.0-cluecorpussmall`复现版本,采用的是batch_size=512、steps=100w。 + +
+ + + +### 2.2 大规模语料预训练: 400GB - CLUE & WuDao + +PaddleNLP致力于预训练开源工作,使用开源中文语料CLUE、WuDao 总共400GB,提供大规模语料训练教程,让用户可以从零开始构建,基于大规模语料,训练预训练模型。 + +[ERNIE 中文预训练介绍](./pretraining_introduction.md),从数据下载,词表制作,数据转化,模型训练,所有流程,完全开源开放,可复现。 +并训练发布开源最优的模型参数。 + +#### 数据准备 + +数据下载,数据转化部分,请参见[数据预处理文档](./preprocess/README.md), +- [CLUECorpus2020数据处理](./preprocess/docs/CLUECorpus2020.md) +- [WuDaoCorpusBase数据处理](./preprocess/docs/WuDaoCorpusBase.md) + +如果需要定制化词表,词表制作部分请参考[词表制作](./vocab/README.md)。 + + +#### 训练脚本 + +训练脚本如下 + +**环境配置** + +- PYTHONPATH 设置为当前目录(适合paddlenlp develop运行) +- 设置了一些FLAGS,包括增强报错,动态图Flag,提高矩阵乘法精度。 +- 多机情况下,可以设置`NCCL_SOCKET_IFNAME`指明NCCL使用的通信网口。 + +
+环境配置脚本 + +```shell +set -x + +# cd PaddleNLP/model_zoo/ernie-1.0 +export PYTHONPATH=$PYTHONPATH:../../ + +export FLAGS_call_stack_level=2 +# export NCCL_SOCKET_IFNAME=xgbe0 +export FLAGS_gemm_use_half_precision_compute_type=False +export FLAGS_enable_eager_mode=1 +unset CUDA_VISIBLE_DEVICES +``` +
+ +**路径配置** + +- 主要配置输入输出目录 +- 这里的`vocab_dir`如果没有使用自定义词表的话,请设置为内置的tokenizer,如`ernie-1.0-base-zh,ernie-3.0-base-zh`等。 +- 这里的 `data_dir` 设置多份数据集,用户不使用多份数据集的话,直接`data_dir="./data"`即可。 + +
+路径配置 + +```shell +trainer_id=${PADDLE_TRAINER_ID:-"0"} +task_name="0809-ernie-1.0-base-cw-dp16-gb1024" + +base_nfs="/path/to/your/nfs/mount/point" +base_dir="${base_nfs}/ernie-cw/output/${task_name}" +data_dir="5.0 ${base_nfs}/clue_oscar/clue_corpus_oscar_0630 7.0 ${base_nfs}/clue_train/clue_corpus_train_0629 12.0 ${base_nfs}/wudao_200g/wudao_200g_0703" +vocab_dir="${base_nfs}/" +``` +
+ +**启动训练**: + +对于`ernie-3.0-base-zh`我们提供了悟道的一个小规模样本的数据: +``` +mkdir data && cd data +wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_200g_sample_ernie-3.0-base-zh_ids.npy +wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_200g_sample_ernie-3.0-base-zh_idx.npz +cd - +``` +同时我们也提供了 `ernie-1.0-base-zh` 的悟道一个小规模样本的数据: +``` +https://paddlenlp.bj.bcebos.com/models/transformers/data_tools/wudao_200g_sample_ernie-1.0-base-zh_ids.npy +https://paddlenlp.bj.bcebos.com/models/transformers/data_tools/wudao_200g_sample_ernie-1.0-base-zh_idx.npz +``` + +可以指定`tokenizer_name_or_path=ernie-3.0-bash-zh`,`input_dir=./data` 用下面的脚本训练。 + +这里启动的是单机8卡任务,整体全局的batch_size 512 (64*8)。如果指定ips参数,进行多机运行,如 `python3 -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" --ips 192.168.1.101,192.168.1.101 ` +```shell +python3 -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "${base_dir}/log_${trainer_id}" \ + run_pretrain.py \ + --model_type "ernie" \ + --model_name_or_path "ernie-3.0-base-zh" \ + --tokenizer_name_or_path "${vocab_dir}" \ + --input_dir "${data_dir}" \ + --output_dir "${base_dir}" \ + --split 949,50,1 \ + --max_seq_len 512 \ + --binary_head true \ + --micro_batch_size 64 \ + --use_amp true \ + --fp16_opt_level "O1" \ + --use_recompute false \ + --max_lr 0.0001 \ + --min_lr 0.00001 \ + --max_steps 4000000 \ + --save_steps 100000 \ + --checkpoint_steps 5000 \ + --decay_steps 3900000 \ + --weight_decay 0.01 \ + --warmup_rate 0.01 \ + --grad_clip 1.0 \ + --logging_freq 20 \ + --num_workers 3 \ + --eval_freq 1000 \ + --device "gpu"\ + --share_folder true \ + --hidden_dropout_prob 0.1 \ + --attention_probs_dropout_prob 0.1 \ + --seed 1234 \ +``` + + +其中参数释义如下: +- `model_name_or_path` 要训练的模型或者之前训练的checkpoint。 +- `tokenizer_name_or_path` 模型词表文件所在的文件夹(对于ernie,词表文件名一般命名为vocab.txt),或者PaddleNLP内置tokenizer的名字。 +- `continue_training` 默认false,模型从随机初始化,开始训练。如果为True,从已有的预训练权重加载,开始训练。如果为True, 训练初始loss 为2.x 是正常loss,如果未False,随机初始化,初始loss一般为10+。 +- `input_dir` 指定输入文件,可以使用目录,指定目录时将包括目录中的所有文件。 +- `output_dir` 指定输出文件。 +- `split` 划分数据集为train、valid、test的比例。整个数据集会按照这个比例划分数据。默认`split=949,50,1`, 使用1/1000的数据为test,当样本数太少时,增大测试的样本数目。 +- `max_seq_len` 输入文本序列的长度,默认值`512`。 +- `binary_head` 是否使用SOP(Sentences Order Predicet) loss,默认为 True,使用此loss。如果用户句子语料很短,无法组合成句子对,请设置此参数为`false`。 +- `micro_batch_size` 单卡batch size大小,比如此处单卡bs=64, 采用8卡训练`global_batch_size=64*8=512`。 +- `use_amp` 开启混合精度策略。 +- `fp16_opt_level` 混合精度策略,支持O1 自动混合精度,O2 pure fp16精度训练。 +- `max_lr` 训练学习率。 +- `min_lr` 学习率衰减到最小值后,学习率将一直保持为`min_lr`。 +- `max_steps` 最大训练步数。训练不支持通过`epoch`控制,第一次制造数据index时候,日志会显示数据会被计算的epoch数,请注意查看。 +- `save_steps` 保存模型间隔。默认保存地址格式为`output_dir/model_50000`(5w 步时的权重)。 +- `checkpoint_steps` 模型checkpoint间隔,用于模型断点重启训练。默认地址为`output_dir/model_last`. +- `weight_decay` 权重衰减参数。 +- `warmup_rate` 学习率warmup参数。 +- `grad_clip` 梯度裁剪范围。 +- `logging_freq` 日志输出间隔。 +- `num_workers` DataLoader采样进程,当数据输入为瓶颈时,可尝试提高采样进程数目。 +- `eval_freq` 模型评估间隔。 +- `device` 训练设备,默认为GPU。 +- `share_folder` 多机训练时,如果多机`input_dir`为挂载的同一个nfs网络位置,可以开启次选项,多机共享同一份数据。(每次运行,会制作训练的index数据,如果为挂载的统一nfs位置,则一台机器制作数据即可,否则每台机器都需要制作) + + +

+ +

+ +接下来我们主要介绍训练流程部分的特性的简单介绍:详细参数配置介绍请参见[ERNIE 中文预训练介绍](./pretraining_introduction.md)。 + +- **训练网络配置方面:** + + 本小节主要针对,任务的损失函数、MASK参数等配置进行了简单介绍。 + - SOP Loss + - SOP (Sentence Order Predict) 损失,是 模型训练的常用损失。将文本中的句子顺序分为两段打乱,最后判断文本是否被打乱。可以通过设置`binary_head`开启或者关闭。 + - MASK + - MLM (Mask Language Model) 是通过随机将文本中的部分token,随机替换为`[MASK]` token,最后预测出真实的token值。ERNIE默认采用了Whole Word MASK方式,选定一些词语进行MASK。 + - *使用方法*: 用户可以设置 `masked_lm_prob` 控制mask的token占文本总token长度的比例。默认`masked_lm_prob=0.15` 随机mask 15% 的token数目。 + - Ngram MASK + - 项目还支持了n-gram mask策略,如下图所示,在 WWM 进行词语级别MASK的基础上(如此处mask掉的`[模型]`词组),n-gram 可以MASK掉连续n个词组。下面例子中,连续mask了2个词组,`【[语言][模型]】`同时进行了mask。 +

+ +

+ + - *使用方法*: 用户通过`max_ngrams`设置最大的`ngram`长度。默认`max_ngrams=3`。 + + - Dropout + - Dropout 是常用的防止过拟合策略。对于大规模数据集训练,如`ernie-3.0`系列4T文本语料,可以设置 `dropout=0`,不考虑过拟合。实际`ernie-3.0-base-zh`训练中,没有开启Dropout。 + +详细参数配置介绍请参见[ERNIE 中文预训练介绍](./pretraining_introduction.md)。 + + +- **训练速度方面** + + 我们支持了如下策略,加速计算过程,减小显存占用,扩大batch_size: + + - **多卡多机训练**: + - 基于飞桨Fleet分布式API,用户可以十分方便的通过数据并行的方法,将训练扩展到多机多卡。 + - **混合精度训练**: + - 部分算子使用FP16计算kernel,加速计算过程。支持AMP混合精度O1,和Pure FP16全FP训练策略O2。 + - **梯度累积训练**: + - 用户可以指定梯度累积的步数,在梯度累积的step中,减少多卡之间梯度的通信,减少更新的次数,可以扩大训练的batch_size. + - **重计算训练**: + - 通过重新计算前向的方式,减少前向网络中间变量的存储,可以显著减少显存占用, + +详细参数配置介绍请参见[ERNIE 中文预训练介绍](./pretraining_introduction.md)。 + + +- **训练数据流方面** + + 我们针对训练数据流扩展、混合、重启等方面做了针对性优化提升 +

+ +

+ + - **多机扩展** + - 用户可以将数据放置到 NFS 服务器上,多机同时挂载数据即可。训练数据与计算资源分离。 + - **多数据混合** + - 训练数据集支持多个文件,即插即用,设置权重,传入参数即可`input_dir="1.0 dateset_a/prefix 2.0 dataset_b/prefix"` + - **稳定可复现** + - MLM任务具有一定随机性,需要随机mask数据。本数据流通过固定每一个step数据的随机种子,实验数据流稳定可复现。 + - **快加载** + - 数据文件使用mmap读取,加载数百GB文件几乎不耗时。 + - **断点重启** + - 用户可以单独设置,checkpoints steps 参数可设置较小,重启训练默认加载最新checkpoint。 + - 断点数据自动恢复,学习率等参数也自动恢复。 + +详细参数配置介绍请参见[ERNIE 中文预训练介绍](./pretraining_introduction.md)。 + +- **观察评估方面** + + - **可视化日志记录** + - 日志展示为全局loss,波动小。 + - 记录混合精度,loss_scaling等信息,方便用户debug。 + - 对模型结构,配置参数,paddle版本信息进行记录,方便复现环境 + - **下游任务评估**:CLUE Benchmark搜索评估参数效果 + - 使用[批量启动-grid-search](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/benchmark/clue#%E6%89%B9%E9%87%8F%E5%90%AF%E5%8A%A8-grid-search),可以进行批量搜索任务 + - 注意,这里使用的是训练中的checkpoint进行评估,可以直接试着 评估待评估的参数为,所在的路径地址,即如 `python grid_seach.py output/ernie-base-outdir/model_100000` 之类的checkpoint地址。 + +详细介绍请参见[ERNIE 中文预训练介绍](./pretraining_introduction.md)。 + + +- **训练效果方面** + + 我们release了base、large两个模型。均取得了较好的预训练效果。 + + - **ERNIE 1.0-Base-zh-cw** 模型: + - 使用CLUE,WuDao共计400GB的语料,batch_size 1024, 训练 400w step,即可训练得到`ernie-3.0-base-zh`类似的模型效果。相关模型参数,开源为`ernie-1.0-base-zh-cw`,用户加载即可使用。使用CLUE benchmark 对最优超参数进行GradSearch搜索: + +Model                                  | Arch | CLUE AVG | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUE WSC2020 | CSL | CMRC | CHID | C3 +-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | + Metrics |   |   | Acc | Acc | Acc | Acc | Acc | Acc | Acc | Exact/F1| Acc| Acc | Acc +ERNIE 1.0-Base-zh-cw | 12L768H | 76.47 | 76.07 | 57.86 | 59.91 | 83.41 | 79.91 | 89.91 | 83.42 | 72.88/90.78 | 84.68 | 76.98 | +ERNIE 2.0-Base-zh | 12L768H | 74.95 | 76.25 | 58.53 | 61.72 | 83.07 | 78.81 | 84.21 | 82.77 | 68.22/88.71 | 82.78 | 73.19 +ERNIE 1.0-Base-zh | 12L768H | 74.17 | 74.84 | 58.91 | 62.25 | 81.68 | 76.58 | 85.20 | 82.77 | 67.32/87.83 | 82.47 | 69.68 +- + - **ERNIE 1.0-Large-zh-cw** 模型: + + - 除了base模型外,我们还训练了放出了large模型。此模型参数采用的是词表与ernie-1.0相同,因此命名为`ernie-1.0-large-zh-cw`。使用开源语料,batch_size 512, 训练 400w step,训练去除SOP任务,只保留MLM损失: + +Model                                    | Arch | CLUE AVG | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUE WSC2020 | CSL | CMRC | CHID | C3 +-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | +Metrics |   |   | Acc | Acc | Acc | Acc | Acc | Acc | Acc | Exact/F1 | Acc| Acc +ERNIE 1.0-Large-zh-cw | 24L1024H | 79.03 | 75.97 | 59.65 | 62.91 | 85.09 | 81.73| 93.09 | 84.53 | 74.22/91.88 | 88.57 | 84.54 +ERNIE 3.0-Xbase-zh| 20L1024H | 78.71 | 76.85 | 59.89 | 62.41 | 84.76 | 82.51 | 89.80 | 84.47 | 75.49/92.67 | 86.36 | 84.59 +RoBERTa-wwm-ext-large | 24L1024H | 76.61 | 76.00 | 59.33 | 62.02 | 83.88 | 78.81 | 90.79 | 83.67 | 70.58/89.82 | 85.72 | 75.26 + + + + +### 预训练模型贡献 +PaddleNLP为开发者提供了[community](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/community/contribute_models/contribute_awesome_pretrained_models.rst)模块,用户可以上传自己训练的模型,开源给其他用户使用。 +使用本文档给出的参数配置,在CLUECorpusSmall数据集上训练,可以得到`zhui/ernie-1.0-cluecorpussmall`参数,可直接使用。 +```python +model = AutoModelForMaskedLM.from_pretrained('zhui/ernie-1.0-cluecorpussmall') +``` + +贡献预训练模型的方法,可以参考[贡献预训练模型权重](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/community/contribute_models/contribute_awesome_pretrained_models.rst)教程。 + + + +## 3. 下游任务微调 + +使用训练中产出的checkpoint,或者paddlenlp内置的模型权重,使用本脚本,用户可以快速对当前模型效果进行评估。 + +### 运行示例 +本文档适配了三大主流下游任务,用户可以根据自己的需求,评估自己所需的数据集。 + + + +1. 序列分类 +```shell +cd finetune +dataset="chnsenticorp_v2" +python run_seq_cls.py \ + --do_train \ + --do_eval \ + --do_predict \ + --model_name_or_path ernie-1.0-base-zh \ + --dataset $dataset \ + --output_dir ./tmp/$dataset +``` + + + +2. Token分类 +```shell +cd finetune +dataset="peoples_daily_ner" +python run_ner.py \ + --do_train \ + --do_eval \ + --do_predict \ + --model_name_or_path ernie-1.0-base-zh \ + --dataset $dataset \ + --output_dir ./tmp/$dataset +``` + + + +3. 阅读理解 +```shell +cd finetune +dataset="cmrc2018" +python run_qa.py \ + --do_train \ + --do_eval \ + --model_name_or_path ernie-1.0-base-zh \ + --dataset $dataset \ + --output_dir ./tmp/$dataset +``` + + + + +## 4. 预测部署 +以中文文本情感分类问题为例,介绍一下从模型finetune到部署的过程。 + +与之前的finetune参数配置稍有区别,此处加入了一些配置选项。 + +- do_export,开启模型导出功能 +- eval_steps/save_steps 评估和保存的step间隔 +- metric_for_best_model 模型效果的比较指标。(次选项生效,需要save_steps为eval_steps的倍数) +- save_total_limit 最多保存的ckpt个数。(超过限制数据时,效果更差,或更旧的ckpt将被删除) + +```shell +cd finetune +# 开始finetune训练并导出模型 +dataset="chnsenticorp_v2" +python run_seq_cls.py \ + --do_train \ + --do_eval \ + --do_predict \ + --do_export \ + --model_name_or_path ernie-1.0-base-zh \ + --dataset $dataset \ + --output_dir ./tmp/$dataset \ + --eval_steps 200 \ + --save_steps 200 \ + --metric_for_best_model "eval_accuracy" \ + --load_best_model_at_end \ + --save_total_limit 3 \ + +``` +训练完导出模型之后,可以用于部署,`deploy/seq_cls_infer.py`文件提供了python部署预测示例。可执行以下命令运行部署示例: + +```shell +python deploy/seq_cls_infer.py --model_dir tmp/chnsenticorp_v2/export/ --device cpu --backend paddle +``` + +运行后预测结果打印如下: +```text +[2023-03-01 08:25:31,352] [ INFO] - We are using to load '../tmp/chnsenticorp_v2/export/'. +WARNING: Logging before InitGoogleLogging() is written to STDERR +W0301 08:25:37.617117 58742 analysis_config.cc:958] It is detected that mkldnn and memory_optimize_pass are enabled at the same time, but they are not supported yet. Currently, memory_optimize_pass is explicitly disabled +[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend Runtime initialized with Backend::PDINFER in Device::CPU. +Batch id: 0, example id: 0, sentence: 这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般, label: negative, negative prob: 0.9999, positive prob: 0.0001. +Batch id: 1, example id: 0, sentence: 怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片!开始还怀疑是不是赠送的个别现象,可是后来发现每张DVD后面都有!真不知道生产商怎么想的,我想看的是猫和老鼠,不是米老鼠!如果厂家是想赠送的话,那就全套米老鼠和唐老鸭都赠送,只在每张DVD后面添加一集算什么??简直是画蛇添足!!, label: negative, negative prob: 0.9998, positive prob: 0.0002. +Batch id: 2, example id: 0, sentence: 还稍微重了点,可能是硬盘大的原故,还要再轻半斤就好了。其他要进一步验证。贴的几种膜气泡较多,用不了多久就要更换了,屏幕膜稍好点,但比没有要强多了。建议配赠几张膜让用用户自己贴。, label: negative, negative prob: 0.9999, positive prob: 0.0001. +...... +``` + +更多关于部署的情况可以参考[ERNIE 1.0 模型 Python 部署示例](finetune/deploy/README.md)。 + + + +## 5. 参考文献 +- [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/pdf/1904.09223.pdf) diff --git a/model_zoo/ernie-1.0/args.py b/model_zoo/ernie-1.0/args.py new file mode 100644 index 0000000000000000000000000000000000000000..9af1706f12a0f9cac8f235b0c0b35fe67b874b5f --- /dev/null +++ b/model_zoo/ernie-1.0/args.py @@ -0,0 +1,112 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle + +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.utils.log import logger + + +def parse_args(MODEL_CLASSES): + parser = argparse.ArgumentParser() + # yapf: disable + parser.add_argument("--model_type", default=None, type=str, required=True, help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), ) + parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])),) + parser.add_argument("--tokenizer_name_or_path", default=None, type=str, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])),) + parser.add_argument("--continue_training", default=False, type=bool, help="Pre-training from existing paddlenlp model weights. Default Fasle and model will train from scratch. If set True, the model_name_or_path argument must exist in the paddlenlp models.") + + # Train I/O config + parser.add_argument("--input_dir", default=None, type=str, required=True, help="The input directory where the data will be read from.", ) + parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the training logs and checkpoints will be written.") + parser.add_argument("--split", type=str, default='949,50,1', help="Train/valid/test data split.") + parser.add_argument("--data_impl", type=str, default='mmap', help="mmap/lazy format converted from preprocessed data.") + parser.add_argument("--binary_head", type=strtobool, default=True, help="True for NSP task.") + parser.add_argument("--max_seq_len", type=int, default=1024, help="Max sequence length.") + parser.add_argument("--micro_batch_size", default=8, type=int, help="Batch size per device for one step training.", ) + parser.add_argument("--global_batch_size", default=None, type=int, help="Global batch size for all training process. None for not check the size is valid. If we only use data parallelism, it should be device_num * micro_batch_size.") + + # Default training config + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--grad_clip", default=0.0, type=float, help="Grad clip for the parameter.") + parser.add_argument("--max_lr", default=1e-5, type=float, help="The initial max learning rate for Adam.") + parser.add_argument("--min_lr", default=5e-5, type=float, help="The initial min learning rate for Adam.") + parser.add_argument("--warmup_rate", default=0.01, type=float, help="Linear warmup over warmup_steps for learing rate.") + + # Adam optimizer config + parser.add_argument("--adam_beta1", default=0.9, type=float, help="The beta1 for Adam optimizer. The exponential decay rate for the 1st moment estimates.") + parser.add_argument("--adam_beta2", default=0.999, type=float, help="The bate2 for Adam optimizer. The exponential decay rate for the 2nd moment estimates.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + + # Training steps config + parser.add_argument("--num_train_epochs", default=1, type=int, help="Total number of training epochs to perform.", ) + parser.add_argument("--max_steps", default=500000, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.") + parser.add_argument("--checkpoint_steps", type=int, default=500, help="Save checkpoint every X updates steps to the model_last folder.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument("--decay_steps", default=360000, type=int, help="The steps use to control the learing rate. If the step > decay_steps, will use the min_lr.") + parser.add_argument("--logging_freq", type=int, default=1, help="Log every X updates steps.") + parser.add_argument("--eval_freq", type=int, default=500, help="Evaluate for every X updates steps.") + parser.add_argument("--eval_iters", type=int, default=10, help="Evaluate the model use X steps data.") + + # Config for 4D Parallelism + parser.add_argument("--use_sharding", type=strtobool, nargs='?', const=False, help="Use sharding Parallelism to training.") + parser.add_argument("--sharding_degree", type=int, default=1, help="Sharding degree. Share the parameters to many cards.") + parser.add_argument("--dp_degree", type=int, default=1, help="Data Parallelism degree.") + parser.add_argument("--mp_degree", type=int, default=1, help="Model Parallelism degree. Spliting the linear layers to many cards.") + parser.add_argument("--pp_degree", type=int, default=1, help="Pipeline Parallelism degree. Spliting the model layers to different parts.") + parser.add_argument("--use_recompute", type=strtobool, nargs='?', const=False, help="Using the recompute to save the memory.") + + # AMP config + parser.add_argument("--use_amp", type=strtobool, nargs='?', const=False, help="Enable mixed precision training.") + parser.add_argument("--fp16_opt_level", type=str, default="O2", help="Mixed precision training optimization level.") + parser.add_argument("--enable_addto", type=strtobool, nargs='?', const=True, default=True, help="Whether to enable the addto strategy for gradient accumulation or not. This is only used for AMP training.") + parser.add_argument("--scale_loss", type=float, default=32768, help="The value of scale_loss for fp16. This is only used for AMP training.") + parser.add_argument("--hidden_dropout_prob", type=float, default=0.1, help="The hidden dropout prob.") + parser.add_argument("--attention_probs_dropout_prob", type=float, default=0.1, help="The attention probs dropout prob.") + + # Other config + parser.add_argument("--seed", type=int, default=1234, help="Random seed for initialization.") + parser.add_argument("--num_workers", type=int, default=2, help="Num of workers for DataLoader.") + parser.add_argument("--check_accuracy", type=strtobool, nargs='?', const=False, help="Check accuracy for training process.") + parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu", "xpu", "mlu", "npu"], help="select cpu, gpu, xpu, npu devices.") + parser.add_argument("--lr_decay_style", type=str, default="cosine", choices=["cosine", "none"], help="Learning rate decay style.") + parser.add_argument("--share_folder", type=strtobool, nargs='?', const=False, help="Use share folder for data dir and output dir on multi machine.") + + # Argument for bert/ernie + parser.add_argument("--masked_lm_prob", type=float, default=0.15, help="Mask token prob.") + parser.add_argument("--short_seq_prob", type=float, default=0.1, help="Short sequence prob.") + parser.add_argument("--favor_longer_ngram", type=strtobool, default=False, help="Short sequence prob.") + parser.add_argument("--max_ngrams", type=int, default=3, help="Short sequence prob.") + + # yapf: enable + + args = parser.parse_args() + + if args.tokenizer_name_or_path is None: + args.tokenizer_name_or_path = args.model_name_or_path + args.test_iters = args.eval_iters * 10 + + if args.check_accuracy: + if args.hidden_dropout_prob != 0: + args.hidden_dropout_prob = 0.0 + logger.warning("The hidden_dropout_prob should set to 0 for accuracy checking.") + if args.attention_probs_dropout_prob != 0: + args.attention_probs_dropout_prob = 0.0 + logger.warning("The attention_probs_dropout_prob should set to 0 for accuracy checking.") + if args.dp_degree * args.mp_degree * args.pp_degree * args.sharding_degree == 1: + if paddle.distributed.get_world_size() > 1: + args.dp_degree = paddle.distributed.get_world_size() + + return args diff --git a/model_zoo/ernie-1.0/converter/params_static_to_dygraph.py b/model_zoo/ernie-1.0/converter/params_static_to_dygraph.py new file mode 100644 index 0000000000000000000000000000000000000000..c86cc1fea0180e5627322cf8309ed5a5d8514533 --- /dev/null +++ b/model_zoo/ernie-1.0/converter/params_static_to_dygraph.py @@ -0,0 +1,51 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import paddle +from paddlenlp.transformers import AutoModelForPretraining +from paddlenlp.utils.log import logger + +paddle.set_device("cpu") +parser = argparse.ArgumentParser() +parser.add_argument("--model", type=str, help="The name of pretrained weights in PaddleNLP.") +parser.add_argument("--path", type=str, help="The path of checkpoint to be loaded.") +parser.add_argument("--output_path", type=str, default=None, help="The path of checkpoint to be loaded.") +args = parser.parse_args() + + +def init_dygraph_with_static(model, static_params_path): + from paddlenlp.utils.tools import static_params_to_dygraph + + static_tensor_dict = paddle.static.load_program_state(static_params_path) + return static_params_to_dygraph(model, static_tensor_dict) + + +def main(args): + logger.info("Loading model: %s" % args.model) + model = AutoModelForPretraining.from_pretrained(args.model) + logger.info("Loading static params and trans paramters...") + model_dict = init_dygraph_with_static(model, args.path) + save_name = args.output_path + if save_name is None: + save_name = args.model + "_converted.pdparams" + if not save_name.endswith(".pdparams"): + save_name += ".pdparams" + logger.info("Saving converted params to %s" % save_name) + paddle.save(model_dict, save_name) + logger.info("New pdparams saved!") + + +if __name__ == "__main__": + main(args) diff --git a/model_zoo/ernie-1.0/data_tools/Makefile b/model_zoo/ernie-1.0/data_tools/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..e29b78584d96c405fa75f44d71567d3676e8700a --- /dev/null +++ b/model_zoo/ernie-1.0/data_tools/Makefile @@ -0,0 +1,10 @@ +CXXFLAGS += -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color +CPPFLAGS += $(shell python3 -m pybind11 --includes) +CPPFLAGS += $(shell python3-config --includes) +LIBNAME = helpers +LIBEXT = $(shell python3-config --extension-suffix) + +default: $(LIBNAME)$(LIBEXT) + +%$(LIBEXT): %.cpp + $(CXX) $(CXXFLAGS) $(CPPFLAGS) $< -o $@ diff --git a/model_zoo/ernie-1.0/data_tools/dataset_utils.py b/model_zoo/ernie-1.0/data_tools/dataset_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..0e3d66f0f8b984be70a72a9bd7216cfb1669dc5a --- /dev/null +++ b/model_zoo/ernie-1.0/data_tools/dataset_utils.py @@ -0,0 +1,840 @@ +# coding=utf-8 + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The Google AI Language Team Authors, and NVIDIA. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import math +import os +import re +import time + +import numpy as np +import paddle + +from paddlenlp.data.indexed_dataset import get_indexed_dataset_ + +# Most of the code here has been copied from: +# https://github.com/google-research/albert/blob/master/create_pretraining_data.py +# with some modifications. + + +def get_local_rank(): + return int(os.getenv("PADDLE_RANK_IN_NODE", 0)) + + +print_rank_0 = print +COMPILED = False +DSET_TYPE_BERT = "standard_bert" +DSET_TYPE_T5 = "t5" +DSET_TYPE_ERNIE = "ernie" +DSET_TYPES = [DSET_TYPE_BERT, DSET_TYPE_T5, DSET_TYPE_ERNIE] + + +def compile_helper(): + """Compile helper function ar runtime. Make sure this + is invoked on a single process.""" + import os + import subprocess + + path = os.path.abspath(os.path.dirname(__file__)) + ret = subprocess.run(["make", "-C", path]) + if ret.returncode != 0: + print("Making C++ dataset helpers module failed, exiting.") + import sys + + sys.exit(1) + + +class BlendableDataset(paddle.io.Dataset): + """ + The BlendableDataset is a wrapper which used to mix different dataset. + + The input is a list of dataset and corresponding weights for each dataset. + """ + + def __init__(self, datasets, weights): + + self.datasets = datasets + num_datasets = len(datasets) + assert num_datasets == len(weights) + + self.size = 0 + for dataset in self.datasets: + self.size += len(dataset) + + # Normalize weights. + weights = np.array(weights, dtype=np.float64) + sum_weights = np.sum(weights) + assert sum_weights > 0.0 + weights /= sum_weights + + # Build indecies. + start_time = time.time() + assert num_datasets < 255 + self.dataset_index = np.zeros(self.size, dtype=np.uint8) + self.dataset_sample_index = np.zeros(self.size, dtype=np.int64) + + # local_rank = 0 if fleet.local_rank() is None else int(fleet.local_rank( + # )) + local_rank = get_local_rank() + + while True: + try: + try: + from tool_helpers import helpers + except Exception: + print_rank_0(" > missing tool_helpers, pip install tool_helpers please, try to compile locally.") + import data_tools.helpers as helpers + break + except Exception: + if local_rank == 0: + compile_helper() + print_rank_0("> wait for hepers to be compiled!") + time.sleep(1) + + helpers.build_blending_indices( + self.dataset_index, self.dataset_sample_index, weights, num_datasets, self.size, local_rank == 0 + ) + print_rank_0( + "> elapsed time for building blendable dataset indices: " "{:.2f} (sec)".format(time.time() - start_time) + ) + + def __len__(self): + return self.size + + def __getitem__(self, idx): + dataset_idx = self.dataset_index[idx] + sample_idx = self.dataset_sample_index[idx] + return self.datasets[dataset_idx][sample_idx] + + +def get_datasets_weights_and_num_samples(data_prefix, train_valid_test_num_samples): + + # The data prefix should be in the format of: + # weight-1, data-prefix-1, weight-2, data-prefix-2, .. + assert len(data_prefix) % 2 == 0 + num_datasets = len(data_prefix) // 2 + weights = [0] * num_datasets + prefixes = [0] * num_datasets + for i in range(num_datasets): + weights[i] = float(data_prefix[2 * i]) + prefixes[i] = (data_prefix[2 * i + 1]).strip() + # Normalize weights + weight_sum = 0.0 + for weight in weights: + weight_sum += weight + assert weight_sum > 0.0 + weights = [weight / weight_sum for weight in weights] + + # Add 0.5% (the 1.005 factor) so in case the bleding dataset does + # not uniformly distribute the number of samples, we still have + # samples left to feed to the network. + datasets_train_valid_test_num_samples = [] + for weight in weights: + datasets_train_valid_test_num_samples.append( + [int(math.ceil(val * weight * 1.005)) for val in train_valid_test_num_samples] + ) + + return prefixes, weights, datasets_train_valid_test_num_samples + + +def get_a_and_b_segments(sample, np_rng): + """Divide sample into a and b segments.""" + + # Number of sentences in the sample. + n_sentences = len(sample) + # Make sure we always have two sentences. + assert n_sentences > 1, "make sure each sample has at least two sentences." + + # First part: + # `a_end` is how many sentences go into the `A`. + a_end = 1 + if n_sentences >= 3: + # Note that randin in numpy is exclusive. + a_end = np_rng.randint(1, n_sentences) + tokens_a = [] + for j in range(a_end): + tokens_a.extend(sample[j]) + + # Second part: + tokens_b = [] + for j in range(a_end, n_sentences): + tokens_b.extend(sample[j]) + + # Random next: + is_next_random = False + if np_rng.random() < 0.5: + is_next_random = True + tokens_a, tokens_b = tokens_b, tokens_a + + return tokens_a, tokens_b, is_next_random + + +def truncate_segments(tokens_a, tokens_b, len_a, len_b, max_num_tokens, np_rng): + """Truncates a pair of sequences to a maximum sequence length.""" + # print(len_a, len_b, max_num_tokens) + assert len_a > 0 + if len_a + len_b <= max_num_tokens: + return False + while len_a + len_b > max_num_tokens: + if len_a > len_b: + len_a -= 1 + tokens = tokens_a + else: + len_b -= 1 + tokens = tokens_b + if np_rng.random() < 0.5: + del tokens[0] + else: + tokens.pop() + return True + + +def create_tokens_and_tokentypes(tokens_a, tokens_b, cls_id, sep_id): + """Merge segments A and B, add [CLS] and [SEP] and build tokentypes.""" + + tokens = [] + tokentypes = [] + # [CLS]. + tokens.append(cls_id) + tokentypes.append(0) + # Segment A. + for token in tokens_a: + tokens.append(token) + tokentypes.append(0) + # [SEP]. + tokens.append(sep_id) + tokentypes.append(0) + # Segment B. + for token in tokens_b: + tokens.append(token) + tokentypes.append(1) + if tokens_b: + # [SEP]. + tokens.append(sep_id) + tokentypes.append(1) + + return tokens, tokentypes + + +MaskedLmInstance = collections.namedtuple("MaskedLmInstance", ["index", "label"]) + + +def is_start_piece(piece): + """Check if the current word piece is the starting piece (BERT).""" + # When a word has been split into + # WordPieces, the first token does not have any marker and any subsequence + # tokens are prefixed with ##. So whenever we see the ## token, we + # append it to the previous set of word indexes. + return not piece.startswith("##") + + +def create_masked_lm_predictions( + tokens, + vocab_id_list, + vocab_id_to_token_dict, + masked_lm_prob, + cls_id, + sep_id, + mask_id, + max_predictions_per_seq, + np_rng, + max_ngrams=3, + vocab_token_to_id_dict=None, + do_whole_word_mask=True, + favor_longer_ngram=False, + do_permutation=False, + geometric_dist=False, + to_chinese_char=False, + inplace_random_mask=False, + masking_style="bert", +): + """Creates the predictions for the masked LM objective. + Note: Tokens here are vocab ids and not text tokens.""" + + cand_indexes = [] + # Note(mingdachen): We create a list for recording if the piece is + # the starting piece of current token, where 1 means true, so that + # on-the-fly whole word masking is possible. + token_boundary = [0] * len(tokens) + + for (i, token) in enumerate(tokens): + if token == cls_id or token == sep_id: + token_boundary[i] = 1 + continue + # Whole Word Masking means that if we mask all of the wordpieces + # corresponding to an original word. + # + # Note that Whole Word Masking does *not* change the training code + # at all -- we still predict each WordPiece independently, softmaxed + # over the entire vocabulary. + vocab_id = vocab_id_to_token_dict[token] + if do_whole_word_mask and len(cand_indexes) >= 1 and not is_start_piece(vocab_id): + cand_indexes[-1].append(i) + else: + cand_indexes.append([i]) + if is_start_piece(vocab_id_to_token_dict[token]): + token_boundary[i] = 1 + + if to_chinese_char: + # set ## chinse char to original chinese char + char_tokens = [] + assert vocab_token_to_id_dict is not None + for i, b in enumerate(token_boundary): + if b == 0: + vocab_id = vocab_id_to_token_dict[tokens[i]] + new_vocab_id = vocab_id[2:] if len(re.findall("##[\u4E00-\u9FA5]", vocab_id)) > 0 else vocab_id + char_tokens.append( + vocab_token_to_id_dict[new_vocab_id] if new_vocab_id in vocab_token_to_id_dict else token + ) + else: + char_tokens.append(tokens[i]) + output_tokens = list(char_tokens) + else: + output_tokens = list(tokens) + + masked_lm_positions = [] + masked_lm_labels = [] + + if masked_lm_prob == 0: + return (output_tokens, masked_lm_positions, masked_lm_labels, token_boundary) + + num_to_predict = min(max_predictions_per_seq, max(1, int(round(len(tokens) * masked_lm_prob)))) + + ngrams = np.arange(1, max_ngrams + 1, dtype=np.int64) + if not geometric_dist: + # Note(mingdachen): + # By default, we set the probilities to favor shorter ngram sequences. + pvals = 1.0 / np.arange(1, max_ngrams + 1) + pvals /= pvals.sum(keepdims=True) + if favor_longer_ngram: + pvals = pvals[::-1] + + ngram_indexes = [] + for idx in range(len(cand_indexes)): + ngram_index = [] + for n in ngrams: + ngram_index.append(cand_indexes[idx : idx + n]) + ngram_indexes.append(ngram_index) + + np_rng.shuffle(ngram_indexes) + + (masked_lms, masked_spans) = ([], []) + covered_indexes = set() + backup_output_tokens = list(output_tokens) + for cand_index_set in ngram_indexes: + if len(masked_lms) >= num_to_predict: + break + if not cand_index_set: + continue + # Note(mingdachen): + # Skip current piece if they are covered in lm masking or previous ngrams. + for index_set in cand_index_set[0]: + for index in index_set: + if index in covered_indexes: + continue + + if not geometric_dist: + n = np_rng.choice( + ngrams[: len(cand_index_set)], + p=pvals[: len(cand_index_set)] / pvals[: len(cand_index_set)].sum(keepdims=True), + ) + else: + # Sampling "n" from the geometric distribution and clipping it to + # the max_ngrams. Using p=0.2 default from the SpanBERT paper + # https://arxiv.org/pdf/1907.10529.pdf (Sec 3.1) + n = min(np_rng.geometric(0.2), max_ngrams) + + index_set = sum(cand_index_set[n - 1], []) + n -= 1 + # Note(mingdachen): + # Repeatedly looking for a candidate that does not exceed the + # maximum number of predictions by trying shorter ngrams. + while len(masked_lms) + len(index_set) > num_to_predict: + if n == 0: + break + index_set = sum(cand_index_set[n - 1], []) + n -= 1 + # If adding a whole-word mask would exceed the maximum number of + # predictions, then just skip this candidate. + if len(masked_lms) + len(index_set) > num_to_predict: + continue + is_any_index_covered = False + for index in index_set: + if index in covered_indexes: + is_any_index_covered = True + break + if is_any_index_covered: + continue + for index in index_set: + covered_indexes.add(index) + masked_token = None + if masking_style == "bert": + # 80% of the time, replace with [MASK] + if np_rng.random() < 0.8: + masked_token = mask_id + else: + # 10% of the time, keep original + if np_rng.random() < 0.5: + masked_token = output_tokens[index] + # 10% of the time, replace with random word + else: + if inplace_random_mask: + masked_token = backup_output_tokens[np_rng.randint(0, len(output_tokens))] + else: + masked_token = vocab_id_list[np_rng.randint(0, len(vocab_id_list))] + elif masking_style == "t5": + masked_token = mask_id + else: + raise ValueError("invalid value of masking style") + + output_tokens[index] = masked_token + masked_lms.append(MaskedLmInstance(index=index, label=backup_output_tokens[index])) + + masked_spans.append( + MaskedLmInstance(index=index_set, label=[backup_output_tokens[index] for index in index_set]) + ) + + assert len(masked_lms) <= num_to_predict + np_rng.shuffle(ngram_indexes) + + select_indexes = set() + if do_permutation: + for cand_index_set in ngram_indexes: + if len(select_indexes) >= num_to_predict: + break + if not cand_index_set: + continue + # Note(mingdachen): + # Skip current piece if they are covered in lm masking or previous ngrams. + for index_set in cand_index_set[0]: + for index in index_set: + if index in covered_indexes or index in select_indexes: + continue + + n = np.random.choice( + ngrams[: len(cand_index_set)], + p=pvals[: len(cand_index_set)] / pvals[: len(cand_index_set)].sum(keepdims=True), + ) + index_set = sum(cand_index_set[n - 1], []) + n -= 1 + + while len(select_indexes) + len(index_set) > num_to_predict: + if n == 0: + break + index_set = sum(cand_index_set[n - 1], []) + n -= 1 + # If adding a whole-word mask would exceed the maximum number of + # predictions, then just skip this candidate. + if len(select_indexes) + len(index_set) > num_to_predict: + continue + is_any_index_covered = False + for index in index_set: + if index in covered_indexes or index in select_indexes: + is_any_index_covered = True + break + if is_any_index_covered: + continue + for index in index_set: + select_indexes.add(index) + assert len(select_indexes) <= num_to_predict + + select_indexes = sorted(select_indexes) + permute_indexes = list(select_indexes) + np_rng.shuffle(permute_indexes) + orig_token = list(output_tokens) + + for src_i, tgt_i in zip(select_indexes, permute_indexes): + output_tokens[src_i] = orig_token[tgt_i] + masked_lms.append(MaskedLmInstance(index=src_i, label=orig_token[src_i])) + + masked_lms = sorted(masked_lms, key=lambda x: x.index) + # Sort the spans by the index of the first span + masked_spans = sorted(masked_spans, key=lambda x: x.index[0]) + + for p in masked_lms: + masked_lm_positions.append(p.index) + masked_lm_labels.append(p.label) + return (output_tokens, masked_lm_positions, masked_lm_labels, token_boundary, masked_spans) + + +def pad_and_convert_to_numpy(tokens, tokentypes, masked_positions, masked_labels, pad_id, max_seq_length): + """Pad sequences and convert them to numpy.""" + + # Some checks. + num_tokens = len(tokens) + padding_length = max_seq_length - num_tokens + assert padding_length >= 0 + assert len(tokentypes) == num_tokens + assert len(masked_positions) == len(masked_labels) + + # Tokens and token types. + filler = [pad_id] * padding_length + tokens_np = np.array(tokens + filler, dtype=np.int64) + tokentypes_np = np.array(tokentypes + filler, dtype=np.int64) + + # Padding mask. + padding_mask_np = np.array([1] * num_tokens + [0] * padding_length, dtype=np.int64) + + # Lables and loss mask. + labels = [-1] * max_seq_length + loss_mask = [0] * max_seq_length + for i in range(len(masked_positions)): + assert masked_positions[i] < num_tokens + labels[masked_positions[i]] = masked_labels[i] + loss_mask[masked_positions[i]] = 1 + labels_np = np.array(labels, dtype=np.int64) + loss_mask_np = np.array(loss_mask, dtype=np.int64) + + return tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np + + +def build_train_valid_test_datasets( + data_prefix, + args, + tokenizer, + splits_string, + train_valid_test_num_samples, + max_seq_length, + masked_lm_prob, + short_seq_prob, + seed, + skip_warmup, + binary_head=False, + max_seq_length_dec=None, + dataset_type="standard_bert", +): + + if len(data_prefix) == 1: + return _build_train_valid_test_datasets( + data_prefix[0], + args, + tokenizer, + splits_string, + train_valid_test_num_samples, + max_seq_length, + masked_lm_prob, + short_seq_prob, + seed, + skip_warmup, + binary_head, + max_seq_length_dec, + dataset_type=dataset_type, + ) + + # Blending dataset. + # Parse the values. + output = get_datasets_weights_and_num_samples(data_prefix, train_valid_test_num_samples) + prefixes, weights, datasets_train_valid_test_num_samples = output + + # Build individual datasets. + train_datasets = [] + valid_datasets = [] + test_datasets = [] + for i in range(len(prefixes)): + train_ds, valid_ds, test_ds = _build_train_valid_test_datasets( + prefixes[i], + args, + tokenizer, + splits_string, + datasets_train_valid_test_num_samples[i], + max_seq_length, + masked_lm_prob, + short_seq_prob, + seed, + skip_warmup, + binary_head, + max_seq_length_dec, + dataset_type=dataset_type, + ) + if train_ds: + train_datasets.append(train_ds) + if valid_ds: + valid_datasets.append(valid_ds) + if test_ds: + test_datasets.append(test_ds) + + # Blend. + blending_train_dataset = None + if train_datasets: + blending_train_dataset = BlendableDataset(train_datasets, weights) + blending_valid_dataset = None + if valid_datasets: + blending_valid_dataset = BlendableDataset(valid_datasets, weights) + blending_test_dataset = None + if test_datasets: + blending_test_dataset = BlendableDataset(test_datasets, weights) + + return (blending_train_dataset, blending_valid_dataset, blending_test_dataset) + + +def _build_train_valid_test_datasets( + data_prefix, + args, + tokenizer, + splits_string, + train_valid_test_num_samples, + max_seq_length, + masked_lm_prob, + short_seq_prob, + seed, + skip_warmup, + binary_head, + max_seq_length_dec, + dataset_type="standard_bert", +): + + if dataset_type not in DSET_TYPES: + raise ValueError("Invalid dataset_type: ", dataset_type) + + # Indexed dataset. + indexed_dataset = get_indexed_dataset_(data_prefix, args.data_impl, skip_warmup) + + # Get start and end indices of train/valid/train into doc-idx + # Note that doc-idx is desinged to be num-docs + 1 so we can + # easily iterate over it. + total_num_of_documents = indexed_dataset.doc_idx.shape[0] - 1 + splits = get_train_valid_test_split_(splits_string, total_num_of_documents) + + # Print stats about the splits. + print_rank_0(" > dataset split:") + + def print_split_stats(name, index): + print_rank_0(" {}:".format(name)) + print_rank_0( + " document indices in [{}, {}) total of {} " + "documents".format(splits[index], splits[index + 1], splits[index + 1] - splits[index]) + ) + start_index = indexed_dataset.doc_idx[splits[index]] + end_index = indexed_dataset.doc_idx[splits[index + 1]] + print_rank_0( + " sentence indices in [{}, {}) total of {} " + "sentences".format(start_index, end_index, end_index - start_index) + ) + + print_split_stats("train", 0) + print_split_stats("validation", 1) + print_split_stats("test", 2) + + def build_dataset(index, name): + # from megatron.data.bert_dataset import BertDataset + # from megatron.data.t5_dataset import T5Dataset + # from .ernie_dataset import ErnieDataset + + dataset = None + if splits[index + 1] > splits[index]: + # Get the pointer to the original doc-idx so we can set it later. + doc_idx_ptr = indexed_dataset.get_doc_idx() + # Slice the doc-idx + start_index = splits[index] + # Add +1 so we can index into the dataset to get the upper bound. + end_index = splits[index + 1] + 1 + # New doc_idx view. + indexed_dataset.set_doc_idx(doc_idx_ptr[start_index:end_index]) + # Build the dataset accordingly. + kwargs = dict( + name=name, + data_prefix=data_prefix, + num_epochs=None, + max_num_samples=train_valid_test_num_samples[index], + max_seq_length=max_seq_length, + seed=seed, + share_folder=args.share_folder, + args=args, + ) + if dataset_type == DSET_TYPE_T5: + from t5_dataset import T5Dataset + + dataset = T5Dataset( + indexed_dataset=indexed_dataset, + tokenizer=tokenizer, + masked_lm_prob=masked_lm_prob, + max_seq_length_dec=max_seq_length_dec, + short_seq_prob=short_seq_prob, + **kwargs, + ) + # elif dataset_type == DSET_TYPE_BERT: + # dataset = BertDataset( + # indexed_dataset=indexed_dataset, + # tokenizer=tokenizer, + # masked_lm_prob=masked_lm_prob, + # short_seq_prob=short_seq_prob, + # binary_head=binary_head, + # **kwargs, + # ) + elif dataset_type == DSET_TYPE_ERNIE: + from .ernie_dataset import ErnieDataset + + dataset = ErnieDataset( + indexed_dataset=indexed_dataset, + tokenizer=tokenizer, + masked_lm_prob=masked_lm_prob, + short_seq_prob=short_seq_prob, + binary_head=binary_head, + **kwargs, + ) + else: + raise NotImplementedError("Dataset type not fully implemented.") + + # Set the original pointer so dataset remains the main dataset. + indexed_dataset.set_doc_idx(doc_idx_ptr) + # Checks. + assert indexed_dataset.doc_idx[0] == 0 + assert indexed_dataset.doc_idx.shape[0] == (total_num_of_documents + 1) + return dataset + + train_dataset = build_dataset(0, "train") + valid_dataset = build_dataset(1, "valid") + test_dataset = build_dataset(2, "test") + + return (train_dataset, valid_dataset, test_dataset) + + +def get_train_valid_test_split_(splits_string, size): + """Get dataset splits from comma or '/' separated string list.""" + + splits = [] + if splits_string.find(",") != -1: + splits = [float(s) for s in splits_string.split(",")] + elif splits_string.find("/") != -1: + splits = [float(s) for s in splits_string.split("/")] + else: + splits = [float(splits_string)] + while len(splits) < 3: + splits.append(0.0) + splits = splits[:3] + splits_sum = sum(splits) + assert splits_sum > 0.0 + splits = [split / splits_sum for split in splits] + splits_index = [0] + for index, split in enumerate(splits): + splits_index.append(splits_index[index] + int(round(split * float(size)))) + diff = splits_index[-1] - size + for index in range(1, len(splits_index)): + splits_index[index] -= diff + assert len(splits_index) == 4 + assert splits_index[-1] == size + return splits_index + + +def get_samples_mapping( + indexed_dataset, + data_prefix, + num_epochs, + max_num_samples, + max_seq_length, + short_seq_prob, + seed, + name, + binary_head, + share_folder, +): + """Get a list that maps a sample index to a starting sentence index, end sentence index, and length""" + + if not num_epochs: + if not max_num_samples: + raise ValueError("Need to specify either max_num_samples " "or num_epochs") + num_epochs = np.iinfo(np.int32).max - 1 + if not max_num_samples: + max_num_samples = np.iinfo(np.int64).max - 1 + + # Filename of the index mapping + indexmap_filename = data_prefix + indexmap_filename += "_{}_indexmap".format(name) + if num_epochs != (np.iinfo(np.int32).max - 1): + indexmap_filename += "_{}ep".format(num_epochs) + if max_num_samples != (np.iinfo(np.int64).max - 1): + indexmap_filename += "_{}mns".format(max_num_samples) + indexmap_filename += "_{}msl".format(max_seq_length) + indexmap_filename += "_{:0.2f}ssp".format(short_seq_prob) + indexmap_filename += "_{}s".format(seed) + indexmap_filename += ".npy" + + local_rank = get_local_rank() + if share_folder: + local_rank = paddle.distributed.get_rank() + # Build the indexed mapping if not exist. + + if local_rank == 0 and not os.path.isfile(indexmap_filename): + print( + " > WARNING: could not find index map file {}, building " + "the indices on rank 0 ...".format(indexmap_filename) + ) + + # Make sure the types match the helpers input types. + assert indexed_dataset.doc_idx.dtype == np.int64 + print(indexed_dataset.sizes.dtype) + assert indexed_dataset.sizes.dtype == np.int32 + + # Build samples mapping + verbose = local_rank == 0 + start_time = time.time() + print_rank_0(" > building sapmles index mapping for {} ...".format(name)) + # First compile and then import. + try: + from tool_helpers import helpers + except ModuleNotFoundError: + print_rank_0(" > missing tool_helpers, pip install tool_helpers please, try to compile locally.") + if local_rank == 0: + compile_helper() + import data_tools.helpers as helpers + + samples_mapping = helpers.build_mapping( + indexed_dataset.doc_idx, + indexed_dataset.sizes, + num_epochs, + max_num_samples, + max_seq_length, + short_seq_prob, + seed, + verbose, + 2 if binary_head else 1, + ) + print_rank_0(" > done building sapmles index maping") + np.save(indexmap_filename, samples_mapping, allow_pickle=True) + print_rank_0(" > saved the index mapping in {}".format(indexmap_filename)) + # Make sure all the ranks have built the mapping + print_rank_0( + " > elasped time to build and save samples mapping " "(seconds): {:4f}".format(time.time() - start_time) + ) + + else: + while True: + if not os.path.isfile(indexmap_filename): + time.sleep(3) + else: + try: + np.load(indexmap_filename, allow_pickle=True, mmap_mode="r") + break + except Exception: + print("%s file is still writing or damaged, please wait a moment." % indexmap_filename) + time.sleep(3) + + # This should be a barrier but nccl barrier assumes + # device_index=rank which is not the case for model + # parallel case + if paddle.distributed.get_world_size() > 1: + if paddle.in_dynamic_mode(): + paddle.distributed.barrier() + + # Load indexed dataset. + print_rank_0(" > loading indexed mapping from {}".format(indexmap_filename)) + start_time = time.time() + samples_mapping = np.load(indexmap_filename, allow_pickle=True, mmap_mode="r") + print_rank_0(" loaded indexed file in {:3.3f} seconds".format(time.time() - start_time)) + print_rank_0(" total number of samples: {}".format(samples_mapping.shape[0])) + + return samples_mapping diff --git a/model_zoo/ernie-1.0/data_tools/ernie_dataset.py b/model_zoo/ernie-1.0/data_tools/ernie_dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..935c98692a774997317431e8165dfbca9931ba14 --- /dev/null +++ b/model_zoo/ernie-1.0/data_tools/ernie_dataset.py @@ -0,0 +1,239 @@ +# Copyright (c) 2021, PadddlePaddle authors. All Rights Reserved. +# Copyright (c) 2020, NVIDIA CORPORATION. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""BERT Style dataset.""" + +import copy + +import numpy as np +import paddle + +from .dataset_utils import ( + create_masked_lm_predictions, + create_tokens_and_tokentypes, + get_a_and_b_segments, + get_samples_mapping, + truncate_segments, +) + + +class ErnieDataset(paddle.io.Dataset): + def __init__( + self, + name, + tokenizer, + indexed_dataset, + data_prefix, + num_epochs, + max_num_samples, + masked_lm_prob, + max_seq_length, + short_seq_prob, + seed, + binary_head, + share_folder=False, + args=None, + ): + + # Params to store. + self.name = name + self.seed = seed + self.masked_lm_prob = masked_lm_prob + self.max_seq_length = max_seq_length + self.binary_head = binary_head + self.share_folder = share_folder + self.args = args + + # Dataset. + self.indexed_dataset = indexed_dataset + + # Build the samples mapping. + self.samples_mapping = get_samples_mapping( + self.indexed_dataset, + data_prefix, + num_epochs, + max_num_samples, + self.max_seq_length - 3, # account for added tokens + short_seq_prob, + self.seed, + self.name, + self.binary_head, + self.share_folder, + ) + + # Vocab stuff. + # tokenizer = get_tokenizer() + # self.vocab_id_list = list(tokenizer.inv_vocab.keys()) + # self.vocab_id_to_token_dict = tokenizer.inv_vocab + self.vocab_id_list = list(tokenizer.vocab.idx_to_token.keys()) + self.vocab_id_to_token_dict = copy.deepcopy(tokenizer.vocab.idx_to_token) + self.vocab_token_to_id_dict = copy.deepcopy(tokenizer.vocab.token_to_idx) + + # ERNIE is chinse char level model, sometime is need + # add ## chinse char to encode and decode. + # Here we extend the vocab dict. + self.vocab_id_to_token_dict.update(tokenizer.added_tokens_decoder) + self.vocab_token_to_id_dict.update(tokenizer.added_tokens_encoder) + + self.cls_id = tokenizer.cls_token_id + self.sep_id = tokenizer.sep_token_id + self.mask_id = tokenizer.mask_token_id + self.pad_id = tokenizer.pad_token_id + + def __len__(self): + return self.samples_mapping.shape[0] + + def __getitem__(self, idx): + start_idx, end_idx, seq_length = self.samples_mapping[idx] + sample = [self.indexed_dataset[i] for i in range(start_idx, end_idx)] + + # Note that this rng state should be numpy and not python since + # python randint is inclusive whereas the numpy one is exclusive. + # We % 2**32 since numpy requres the seed to be between 0 and 2**32 - 1 + np_rng = np.random.RandomState(seed=((self.seed + idx) % 2**32)) + return build_training_sample( + sample, + seq_length, + self.max_seq_length, # needed for padding + self.vocab_id_list, + self.vocab_id_to_token_dict, + self.vocab_token_to_id_dict, + self.cls_id, + self.sep_id, + self.mask_id, + self.pad_id, + self.masked_lm_prob, + np_rng, + self.binary_head, + self.args, + ) + + +def build_training_sample( + sample, + target_seq_length, + max_seq_length, + vocab_id_list, + vocab_id_to_token_dict, + vocab_token_to_id_dict, + cls_id, + sep_id, + mask_id, + pad_id, + masked_lm_prob, + np_rng, + binary_head, + args=None, +): + """Biuld training sample. + + Arguments: + sample: A list of sentences in which each sentence is a list token ids. + target_seq_length: Desired sequence length. + max_seq_length: Maximum length of the sequence. All values are padded to + this length. + vocab_id_list: List of vocabulary ids. Used to pick a random id. + vocab_id_to_token_dict: A dictionary from vocab ids to text tokens. + vocab_token_to_id_dict: A dictionary from text tokens to vocab ids. + cls_id: Start of example id. + sep_id: Separator id. + mask_id: Mask token id. + pad_id: Padding token id. + masked_lm_prob: Probability to mask tokens. + np_rng: Random number genenrator. Note that this rng state should be + numpy and not python since python randint is inclusive for + the opper bound whereas the numpy one is exclusive. + """ + + if binary_head: + # We assume that we have at least two sentences in the sample + assert len(sample) > 1, "The sentence num should be large than 1." + assert target_seq_length <= max_seq_length + + # Divide sample into two segments (A and B). + if binary_head: + tokens_a, tokens_b, is_next_random = get_a_and_b_segments(sample, np_rng) + else: + tokens_a = [] + for j in range(len(sample)): + tokens_a.extend(sample[j]) + tokens_b = [] + is_next_random = False + + # Truncate to `target_sequence_length`. + max_num_tokens = target_seq_length + truncate_segments(tokens_a, tokens_b, len(tokens_a), len(tokens_b), max_num_tokens, np_rng) + + # Build tokens and toketypes. + tokens, tokentypes = create_tokens_and_tokentypes(tokens_a, tokens_b, cls_id, sep_id) + + # Masking. + max_predictions_per_seq = masked_lm_prob * max_num_tokens + (tokens, masked_positions, masked_labels, _, _) = create_masked_lm_predictions( + tokens, + vocab_id_list, + vocab_id_to_token_dict, + masked_lm_prob, + cls_id, + sep_id, + mask_id, + max_predictions_per_seq, + np_rng, + vocab_token_to_id_dict=vocab_token_to_id_dict, + to_chinese_char=True, + inplace_random_mask=False, + favor_longer_ngram=False if args is None else args.favor_longer_ngram, + max_ngrams=3 if args is None else args.max_ngrams, + ) + + # Padding. + tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np = pad_and_convert_to_numpy( + tokens, tokentypes, masked_positions, masked_labels, pad_id, max_seq_length + ) + + return tokens_np, tokentypes_np, padding_mask_np, masked_positions, masked_labels, int(is_next_random) + + +def pad_and_convert_to_numpy(tokens, tokentypes, masked_positions, masked_labels, pad_id, max_seq_length): + """Pad sequences and convert them to numpy.""" + + # Some checks. + num_tokens = len(tokens) + # print(num_tokens, max_seq_length) + padding_length = max_seq_length - num_tokens + assert padding_length >= 0 + assert len(tokentypes) == num_tokens + assert len(masked_positions) == len(masked_labels) + + # Tokens and token types. + filler = [pad_id] * padding_length + tokens_np = np.array(tokens + filler, dtype=np.int64) + tokentypes_np = np.array(tokentypes + filler, dtype=np.int64) + + # Padding mask. + padding_mask_np = np.array([1] * num_tokens + [0] * padding_length, dtype=np.float32) + padding_mask_np = (1 - padding_mask_np) * -1e4 + + padding_mask_np = padding_mask_np.reshape([1, 1, -1]) + # Lables and loss mask. + labels = [-1] * max_seq_length + loss_mask = [0] * max_seq_length + for i in range(len(masked_positions)): + assert masked_positions[i] < num_tokens + labels[masked_positions[i]] = masked_labels[i] + loss_mask[masked_positions[i]] = 1 + labels_np = np.array(labels, dtype=np.int64) + loss_mask_np = np.array(loss_mask, dtype=np.int64) + + return tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np diff --git a/model_zoo/ernie-1.0/data_tools/helpers.cpp b/model_zoo/ernie-1.0/data_tools/helpers.cpp new file mode 100644 index 0000000000000000000000000000000000000000..07ae7ccdf6dd6ec871b623519fd8162d8eac606e --- /dev/null +++ b/model_zoo/ernie-1.0/data_tools/helpers.cpp @@ -0,0 +1,741 @@ +/* + coding=utf-8 + Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + */ + + +/* Helper methods for fast index mapping builds */ + +#include +#include +#include +#include +#include +#include +#include +#include + +namespace py = pybind11; +using namespace std; + +const int32_t LONG_SENTENCE_LEN = 512; + +void build_blending_indices(py::array_t& dataset_index, + py::array_t& dataset_sample_index, + const py::array_t& weights, + const int32_t num_datasets, + const int64_t size, + const bool verbose) { + /* Given multiple datasets and a weighting array, build samples + such that it follows those wieghts.*/ + + if (verbose) { + std::cout << "> building indices for blendable datasets ..." << std::endl; + } + + // Get the pointer access without the checks. + auto dataset_index_ptr = dataset_index.mutable_unchecked<1>(); + auto dataset_sample_index_ptr = dataset_sample_index.mutable_unchecked<1>(); + auto weights_ptr = weights.unchecked<1>(); + + // Initialize buffer for number of samples used for each dataset. + int64_t current_samples[num_datasets]; + for (int64_t i = 0; i < num_datasets; ++i) { + current_samples[i] = 0; + } + + // For each sample: + for (int64_t sample_idx = 0; sample_idx < size; ++sample_idx) { + // Determine where the max error in sampling is happening. + auto sample_idx_double = std::max(static_cast(sample_idx), 1.0); + int64_t max_error_index = 0; + double max_error = weights_ptr[0] * sample_idx_double - + static_cast(current_samples[0]); + for (int64_t dataset_idx = 1; dataset_idx < num_datasets; ++dataset_idx) { + double error = weights_ptr[dataset_idx] * sample_idx_double - + static_cast(current_samples[dataset_idx]); + if (error > max_error) { + max_error = error; + max_error_index = dataset_idx; + } + } + + // Populate the indices. + dataset_index_ptr[sample_idx] = static_cast(max_error_index); + dataset_sample_index_ptr[sample_idx] = current_samples[max_error_index]; + + // Update the total samples. + current_samples[max_error_index] += 1; + } + + // print info + if (verbose) { + std::cout << " > sample ratios:" << std::endl; + for (int64_t dataset_idx = 0; dataset_idx < num_datasets; ++dataset_idx) { + auto ratio = static_cast(current_samples[dataset_idx]) / + static_cast(size); + std::cout << " dataset " << dataset_idx + << ", input: " << weights_ptr[dataset_idx] + << ", achieved: " << ratio << std::endl; + } + } +} + + +py::array build_sample_idx(const py::array_t& sizes_, + const py::array_t& doc_idx_, + const int32_t seq_length, + const int32_t num_epochs, + const int64_t tokens_per_epoch) { + /* Sample index (sample_idx) is used for gpt2 like dataset for which + the documents are flattened and the samples are built based on this + 1-D flatten array. It is a 2D array with sizes [number-of-samples + 1, 2] + where [..., 0] contains the index into `doc_idx` and [..., 1] is the + starting offset in that document.*/ + + // Consistency checks. + assert(seq_length > 1); + assert(num_epochs > 0); + assert(tokens_per_epoch > 1); + + // Remove bound checks. + auto sizes = sizes_.unchecked<1>(); + auto doc_idx = doc_idx_.unchecked<1>(); + + // Mapping and it's length (1D). + int64_t num_samples = (num_epochs * tokens_per_epoch - 1) / seq_length; + int32_t* sample_idx = new int32_t[2 * (num_samples + 1)]; + + cout << " using:" << endl << std::flush; + cout << " number of documents: " << doc_idx_.shape(0) / num_epochs + << endl + << std::flush; + cout << " number of epochs: " << num_epochs << endl + << std::flush; + cout << " sequence length: " << seq_length << endl + << std::flush; + cout << " total number of samples: " << num_samples << endl + << std::flush; + + // Index into sample_idx. + int64_t sample_index = 0; + // Index into doc_idx. + int64_t doc_idx_index = 0; + // Beginning offset for each document. + int32_t doc_offset = 0; + // Start with first document and no offset. + sample_idx[2 * sample_index] = doc_idx_index; + sample_idx[2 * sample_index + 1] = doc_offset; + ++sample_index; + + while (sample_index <= num_samples) { + // Start with a fresh sequence. + int32_t remaining_seq_length = seq_length + 1; + while (remaining_seq_length != 0) { + // Get the document length. + auto doc_id = doc_idx[doc_idx_index]; + auto doc_length = sizes[doc_id] - doc_offset; + // And add it to the current sequence. + remaining_seq_length -= doc_length; + // If we have more than a full sequence, adjust offset and set + // remaining length to zero so we return from the while loop. + // Note that -1 here is for the same reason we have -1 in + // `_num_epochs` calculations. + if (remaining_seq_length <= 0) { + doc_offset += (remaining_seq_length + doc_length - 1); + remaining_seq_length = 0; + } else { + // Otherwise, start from the beginning of the next document. + ++doc_idx_index; + doc_offset = 0; + } + } + // Record the sequence. + sample_idx[2 * sample_index] = doc_idx_index; + sample_idx[2 * sample_index + 1] = doc_offset; + ++sample_index; + } + + // Method to deallocate memory. + py::capsule free_when_done(sample_idx, [](void* mem_) { + int32_t* mem = reinterpret_cast(mem_); + delete[] mem; + }); + + // Return the numpy array. + const auto byte_size = sizeof(int32_t); + return py::array(std::vector{num_samples + 1, 2}, // shape + {2 * byte_size, byte_size}, // C-style contiguous strides + sample_idx, // the data pointer + free_when_done); // numpy array references +} + + +inline int32_t get_target_sample_len(const int32_t short_seq_ratio, + const int32_t max_length, + std::mt19937& rand32_gen) { + /* Training sample length. */ + if (short_seq_ratio == 0) { + return max_length; + } + const auto random_number = rand32_gen(); + if ((random_number % short_seq_ratio) == 0) { + return 2 + random_number % (max_length - 1); + } + return max_length; +} + + +template +py::array build_mapping_impl(const py::array_t& docs_, + const py::array_t& sizes_, + const int32_t num_epochs, + const uint64_t max_num_samples, + const int32_t max_seq_length, + const double short_seq_prob, + const int32_t seed, + const bool verbose, + const int32_t min_num_sent) { + /* Build a mapping of (start-index, end-index, sequence-length) where + start and end index are the indices of the sentences in the sample + and sequence-length is the target sequence length. + */ + + // Consistency checks. + assert(num_epochs > 0); + assert(max_seq_length > 1); + assert(short_seq_prob >= 0.0); + assert(short_seq_prob <= 1.0); + assert(seed > 0); + + // Remove bound checks. + auto docs = docs_.unchecked<1>(); + auto sizes = sizes_.unchecked<1>(); + + // For efficiency, convert probability to ratio. Note: rand() generates int. + int32_t short_seq_ratio = 0; + if (short_seq_prob > 0) { + short_seq_ratio = static_cast(round(1.0 / short_seq_prob)); + } + + if (verbose) { + const auto sent_start_index = docs[0]; + const auto sent_end_index = docs[docs_.shape(0) - 1]; + const auto num_sentences = sent_end_index - sent_start_index; + cout << " using:" << endl << std::flush; + cout << " number of documents: " << docs_.shape(0) - 1 + << endl + << std::flush; + cout << " sentences range: [" << sent_start_index << ", " + << sent_end_index << ")" << endl + << std::flush; + cout << " total number of sentences: " << num_sentences << endl + << std::flush; + cout << " number of epochs: " << num_epochs << endl + << std::flush; + cout << " maximum number of samples: " << max_num_samples << endl + << std::flush; + cout << " maximum sequence length: " << max_seq_length << endl + << std::flush; + cout << " minimum sentences num: " << min_num_sent << endl + << std::flush; + cout << " short sequence probability: " << short_seq_prob << endl + << std::flush; + cout << " short sequence ration (1/prob): " << short_seq_ratio << endl + << std::flush; + cout << " seed: " << seed << endl + << std::flush; + } + + // Mapping and it's length (1D). + int64_t num_samples = -1; + DocIdx* maps = NULL; + + // Perform two iterations, in the first iteration get the size + // and allocate memory and in the second iteration populate the map. + bool second = false; + for (int32_t iteration = 0; iteration < 2; ++iteration) { + // Set the seed so both iterations produce the same results. + std::mt19937 rand32_gen(seed); + + // Set the flag on second iteration. + second = (iteration == 1); + + // Counters: + uint64_t empty_docs = 0; + uint64_t one_sent_docs = 0; + uint64_t long_sent_docs = 0; + + // Current map index. + uint64_t map_index = 0; + + // For each epoch: + for (int32_t epoch = 0; epoch < num_epochs; ++epoch) { + if (map_index >= max_num_samples) { + if (verbose && (!second)) { + cout << " reached " << max_num_samples << " samples after " + << epoch << " epochs ..." << endl + << std::flush; + } + break; + } + if(epoch > 0 && map_index == 0){ + cout << endl << " No available documtment find this dataset." << endl << std::flush; + throw std::invalid_argument( + "Invalid dataset! the document should be with more than " + + std::to_string(min_num_sent) + " scentences."); + } + // For each document: + for (int32_t doc = 0; doc < (docs.shape(0) - 1); ++doc) { + // Document sentences are in [sent_index_first, sent_index_last) + const auto sent_index_first = docs[doc]; + const auto sent_index_last = docs[doc + 1]; + // At the beginning of the document previous index is the + // start index. + auto prev_start_index = sent_index_first; + + // Remaining documents. + auto num_remain_sent = sent_index_last - sent_index_first; + + // Some bookkeeping + if ((epoch == 0) && (!second)) { + if (num_remain_sent == 0) { + ++empty_docs; + } + if (num_remain_sent == 1) { + ++one_sent_docs; + } + } + + // Detect documents with long sentences. + bool contains_long_sentence = false; + if (num_remain_sent > 1) { + for (auto sent_index = sent_index_first; sent_index < sent_index_last; + ++sent_index) { + if (sizes[sent_index] > LONG_SENTENCE_LEN) { + if ((epoch == 0) && (!second)) { + ++long_sent_docs; + } + contains_long_sentence = true; + break; + } + } + } + // If we have more than two sentences. + if ((num_remain_sent >= min_num_sent) && (!contains_long_sentence)) { + // Set values. + auto seq_len = int32_t{0}; + auto num_sent = int32_t{0}; + auto target_seq_len = get_target_sample_len( + short_seq_ratio, max_seq_length, rand32_gen); + + // Loop through sentences. + for (auto sent_index = sent_index_first; sent_index < sent_index_last; + ++sent_index) { + // Add the size and number of sentences. + seq_len += sizes[sent_index]; + ++num_sent; + --num_remain_sent; + + // If we have reached the target length. + // and if not only one sentence is left in the document. + // and if we have at least two sentneces. + // and if we have reached end of the document. + if (((seq_len >= target_seq_len) && (num_remain_sent > 1) && + (num_sent >= min_num_sent)) || + (num_remain_sent == 0)) { + // Check for overflow. + if ((3 * map_index + 2) > std::numeric_limits::max()) { + cout << "number of samples exceeded maximum " + << "allowed by type int64: " + << std::numeric_limits::max() << endl; + throw std::overflow_error("Number of samples"); + } + + // Populate the map. + if (second) { + const auto map_index_0 = 3 * map_index; + maps[map_index_0] = static_cast(prev_start_index); + maps[map_index_0 + 1] = static_cast(sent_index + 1); + maps[map_index_0 + 2] = static_cast(target_seq_len); + } + + // Update indices / counters. + ++map_index; + prev_start_index = sent_index + 1; + target_seq_len = get_target_sample_len( + short_seq_ratio, max_seq_length, rand32_gen); + seq_len = 0; + num_sent = 0; + } + + } // for (auto sent_index=sent_index_first; ... + } // if (num_remain_sent > 1) { + } // for (int doc=0; doc < num_docs; ++doc) { + } // for (int epoch=0; epoch < num_epochs; ++epoch) { + + if (!second) { + if (verbose) { + cout << " number of empty documents: " << empty_docs << endl + << std::flush; + cout << " number of documents with one sentence: " << one_sent_docs + << endl + << std::flush; + cout << " number of documents with long sentences: " << long_sent_docs + << endl + << std::flush; + cout << " will create mapping for " << map_index << " samples" << endl + << std::flush; + } + assert(maps == NULL); + assert(num_samples < 0); + maps = new DocIdx[3 * map_index]; + num_samples = static_cast(map_index); + } + + } // for (int iteration=0; iteration < 2; ++iteration) { + + // Shuffle. + // We need a 64 bit random number generator as we might have more + // than 2 billion samples. + std::mt19937_64 rand64_gen(seed + 1); + for (auto i = (num_samples - 1); i > 0; --i) { + const auto j = static_cast(rand64_gen() % (i + 1)); + const auto i0 = 3 * i; + const auto j0 = 3 * j; + // Swap values. + swap(maps[i0], maps[j0]); + swap(maps[i0 + 1], maps[j0 + 1]); + swap(maps[i0 + 2], maps[j0 + 2]); + } + + // Method to deallocate memory. + py::capsule free_when_done(maps, [](void* mem_) { + DocIdx* mem = reinterpret_cast(mem_); + delete[] mem; + }); + + // Return the numpy array. + const auto byte_size = sizeof(DocIdx); + return py::array(std::vector{num_samples, 3}, // shape + {3 * byte_size, byte_size}, // C-style contiguous strides + maps, // the data pointer + free_when_done); // numpy array references +} + + +py::array build_mapping(const py::array_t& docs_, + const py::array_t& sizes_, + const int num_epochs, + const uint64_t max_num_samples, + const int max_seq_length, + const double short_seq_prob, + const int seed, + const bool verbose, + const int32_t min_num_sent) { + if (sizes_.size() > std::numeric_limits::max()) { + if (verbose) { + cout << " using uint64 for data mapping..." << endl << std::flush; + } + return build_mapping_impl(docs_, + sizes_, + num_epochs, + max_num_samples, + max_seq_length, + short_seq_prob, + seed, + verbose, + min_num_sent); + } else { + if (verbose) { + cout << " using uint32 for data mapping..." << endl << std::flush; + } + return build_mapping_impl(docs_, + sizes_, + num_epochs, + max_num_samples, + max_seq_length, + short_seq_prob, + seed, + verbose, + min_num_sent); + } +} + +template +py::array build_blocks_mapping_impl(const py::array_t& docs_, + const py::array_t& sizes_, + const py::array_t& titles_sizes_, + const int32_t num_epochs, + const uint64_t max_num_samples, + const int32_t max_seq_length, + const int32_t seed, + const bool verbose, + const bool use_one_sent_blocks) { + /* Build a mapping of (start-index, end-index, sequence-length) where + start and end index are the indices of the sentences in the sample + and sequence-length is the target sequence length. + */ + + // Consistency checks. + assert(num_epochs > 0); + assert(max_seq_length > 1); + assert(seed > 0); + + // Remove bound checks. + auto docs = docs_.unchecked<1>(); + auto sizes = sizes_.unchecked<1>(); + auto titles_sizes = titles_sizes_.unchecked<1>(); + + if (verbose) { + const auto sent_start_index = docs[0]; + const auto sent_end_index = docs[docs_.shape(0) - 1]; + const auto num_sentences = sent_end_index - sent_start_index; + cout << " using:" << endl << std::flush; + cout << " number of documents: " << docs_.shape(0) - 1 + << endl + << std::flush; + cout << " sentences range: [" << sent_start_index << ", " + << sent_end_index << ")" << endl + << std::flush; + cout << " total number of sentences: " << num_sentences << endl + << std::flush; + cout << " number of epochs: " << num_epochs << endl + << std::flush; + cout << " maximum number of samples: " << max_num_samples << endl + << std::flush; + cout << " maximum sequence length: " << max_seq_length << endl + << std::flush; + cout << " seed: " << seed << endl + << std::flush; + } + + // Mapping and its length (1D). + int64_t num_samples = -1; + DocIdx* maps = NULL; + + // Acceptable number of sentences per block. + int min_num_sent = 2; + if (use_one_sent_blocks) { + min_num_sent = 1; + } + + // Perform two iterations, in the first iteration get the size + // and allocate memory and in the second iteration populate the map. + bool second = false; + for (int32_t iteration = 0; iteration < 2; ++iteration) { + // Set the flag on second iteration. + second = (iteration == 1); + + // Current map index. + uint64_t map_index = 0; + + uint64_t empty_docs = 0; + uint64_t one_sent_docs = 0; + uint64_t long_sent_docs = 0; + // For each epoch: + for (int32_t epoch = 0; epoch < num_epochs; ++epoch) { + // assign every block a unique id + int32_t block_id = 0; + + if (map_index >= max_num_samples) { + if (verbose && (!second)) { + cout << " reached " << max_num_samples << " samples after " + << epoch << " epochs ..." << endl + << std::flush; + } + break; + } + // For each document: + for (int32_t doc = 0; doc < (docs.shape(0) - 1); ++doc) { + // Document sentences are in [sent_index_first, sent_index_last) + const auto sent_index_first = docs[doc]; + const auto sent_index_last = docs[doc + 1]; + const auto target_seq_len = max_seq_length - titles_sizes[doc]; + + // At the beginning of the document previous index is the + // start index. + auto prev_start_index = sent_index_first; + + // Remaining documents. + auto num_remain_sent = sent_index_last - sent_index_first; + + // Some bookkeeping + if ((epoch == 0) && (!second)) { + if (num_remain_sent == 0) { + ++empty_docs; + } + if (num_remain_sent == 1) { + ++one_sent_docs; + } + } + // Detect documents with long sentences. + bool contains_long_sentence = false; + if (num_remain_sent >= min_num_sent) { + for (auto sent_index = sent_index_first; sent_index < sent_index_last; + ++sent_index) { + if (sizes[sent_index] > LONG_SENTENCE_LEN) { + if ((epoch == 0) && (!second)) { + ++long_sent_docs; + } + contains_long_sentence = true; + break; + } + } + } + // If we have enough sentences and no long sentences. + if ((num_remain_sent >= min_num_sent) && (!contains_long_sentence)) { + // Set values. + auto seq_len = int32_t{0}; + auto num_sent = int32_t{0}; + + // Loop through sentences. + for (auto sent_index = sent_index_first; sent_index < sent_index_last; + ++sent_index) { + // Add the size and number of sentences. + seq_len += sizes[sent_index]; + ++num_sent; + --num_remain_sent; + + // If we have reached the target length. + // and there are an acceptable number of sentences left + // and if we have at least the minimum number of sentences. + // or if we have reached end of the document. + if (((seq_len >= target_seq_len) && + (num_remain_sent >= min_num_sent) && + (num_sent >= min_num_sent)) || + (num_remain_sent == 0)) { + // Populate the map. + if (second) { + const auto map_index_0 = 4 * map_index; + // Each sample has 4 items: the starting sentence index, ending + // sentence index, + // the index of the document from which the block comes (used + // for fetching titles) + // and the unique id of the block (used for creating block + // indexes) + + maps[map_index_0] = static_cast(prev_start_index); + maps[map_index_0 + 1] = static_cast(sent_index + 1); + maps[map_index_0 + 2] = static_cast(doc); + maps[map_index_0 + 3] = static_cast(block_id); + } + + // Update indices / counters. + ++map_index; + ++block_id; + prev_start_index = sent_index + 1; + seq_len = 0; + num_sent = 0; + } + } // for (auto sent_index=sent_index_first; ... + } // if (num_remain_sent > 1) { + } // for (int doc=0; doc < num_docs; ++doc) { + } // for (int epoch=0; epoch < num_epochs; ++epoch) { + + if (!second) { + if (verbose) { + cout << " number of empty documents: " << empty_docs << endl + << std::flush; + cout << " number of documents with one sentence: " << one_sent_docs + << endl + << std::flush; + cout << " number of documents with long sentences: " << long_sent_docs + << endl + << std::flush; + cout << " will create mapping for " << map_index << " samples" << endl + << std::flush; + } + assert(maps == NULL); + assert(num_samples < 0); + maps = new DocIdx[4 * map_index]; + num_samples = static_cast(map_index); + } + + } // for (int iteration=0; iteration < 2; ++iteration) { + + // Shuffle. + // We need a 64 bit random number generator as we might have more + // than 2 billion samples. + std::mt19937_64 rand64_gen(seed + 1); + for (auto i = (num_samples - 1); i > 0; --i) { + const auto j = static_cast(rand64_gen() % (i + 1)); + const auto i0 = 4 * i; + const auto j0 = 4 * j; + // Swap values. + swap(maps[i0], maps[j0]); + swap(maps[i0 + 1], maps[j0 + 1]); + swap(maps[i0 + 2], maps[j0 + 2]); + swap(maps[i0 + 3], maps[j0 + 3]); + } + + // Method to deallocate memory. + py::capsule free_when_done(maps, [](void* mem_) { + DocIdx* mem = reinterpret_cast(mem_); + delete[] mem; + }); + + // Return the numpy array. + const auto byte_size = sizeof(DocIdx); + return py::array(std::vector{num_samples, 4}, // shape + {4 * byte_size, byte_size}, // C-style contiguous strides + maps, // the data pointer + free_when_done); // numpy array references +} + +py::array build_blocks_mapping(const py::array_t& docs_, + const py::array_t& sizes_, + const py::array_t& titles_sizes_, + const int num_epochs, + const uint64_t max_num_samples, + const int max_seq_length, + const int seed, + const bool verbose, + const bool use_one_sent_blocks) { + if (sizes_.size() > std::numeric_limits::max()) { + if (verbose) { + cout << " using uint64 for data mapping..." << endl << std::flush; + } + return build_blocks_mapping_impl(docs_, + sizes_, + titles_sizes_, + num_epochs, + max_num_samples, + max_seq_length, + seed, + verbose, + use_one_sent_blocks); + } else { + if (verbose) { + cout << " using uint32 for data mapping..." << endl << std::flush; + } + return build_blocks_mapping_impl(docs_, + sizes_, + titles_sizes_, + num_epochs, + max_num_samples, + max_seq_length, + seed, + verbose, + use_one_sent_blocks); + } +} + +PYBIND11_MODULE(helpers, m) { + m.def("build_mapping", &build_mapping); + m.def("build_blocks_mapping", &build_blocks_mapping); + m.def("build_sample_idx", &build_sample_idx); + m.def("build_blending_indices", &build_blending_indices); +} diff --git a/model_zoo/ernie-1.0/finetune/README.md b/model_zoo/ernie-1.0/finetune/README.md new file mode 100644 index 0000000000000000000000000000000000000000..27603219d59b7f3a5b96590e3890ef1490af5a60 --- /dev/null +++ b/model_zoo/ernie-1.0/finetune/README.md @@ -0,0 +1,71 @@ +# 预训练下游任务 finetune 示例 +使用训练中产出的checkpoint,或者paddlenlp内置的模型权重,使用本脚本,用户可以快速对当前模型效果进行评估。 + +## 运行示例 +本文档适配了三大主流下游任务,用户可以根据自己的需求,评估自己所需的数据集。 + +运行脚本示例如下: + +1. 序列分类 +```shell +dataset="chnsenticorp_v2" +python run_seq_cls.py \ + --do_train \ + --do_eval \ + --do_predict \ + --model_name_or_path ernie-1.0 \ + --dataset $dataset \ + --output_dir ./tmp/$dataset +``` + +2. Token分类 +```shell +dataset="peoples_daily_ner" +python run_ner.py \ + --do_train \ + --do_eval \ + --do_predict \ + --model_name_or_path ernie-1.0 \ + --dataset $dataset \ + --output_dir ./tmp/$dataset +``` + +3. 阅读理解 +```shell +dataset="cmrc2018" +python run_qa.py \ + --do_train \ + --do_eval \ + --model_name_or_path ernie-1.0 \ + --dataset $dataset \ + --output_dir ./tmp/$dataset +``` + +## 参数说明 + +### 传入参数 +必须参数 +- do_train、do_eval、do_predict分别表示运行训练、评估、测试数据集合。 +- do_export 导出为inference预测模型 +- model_name_or_path 表示模型权重名称,或者训练中保存的checkpoint地址 +- dataset 表示数据集名称 +- output_dir 表示运行中,一些checkpoint等参数的输出目录 + +其他可配置参数: +- per_device_train_batch_size 训练时batch大小 +- per_device_eval_batch_size 评估时batch大小 +- num_train_epochs 训练epoch数目 +- learning_rate 学习率 +- max_seq_length 最大序列长度 +- weight_decay 训练时优化器对参数衰减系数 +- logging_steps 打印日志间隔步数 +- eval_steps 评估效果间隔步数 +- max_steps 最大训练步数(可覆盖num_train_epochs) + + +### yaml文件参数 +本示例也支持用户在yaml文件中配置参数。用户可以自行修改`config.yaml`文件。 + +注意: +- 这些参数会重写传入的默认参数,以yaml文件参数为准。 +- yaml文件中的batch_size同时等价于per_device_train_batch_size,per_device_eval_batch_size diff --git a/model_zoo/ernie-1.0/finetune/config.yml b/model_zoo/ernie-1.0/finetune/config.yml new file mode 100644 index 0000000000000000000000000000000000000000..d5d1e1082182abc6730827d8689e1fed11f96c1b --- /dev/null +++ b/model_zoo/ernie-1.0/finetune/config.yml @@ -0,0 +1,71 @@ +# Default Args for all dataset +# You can overwrite the configs in each dataset. +DefaultArgs: + learning_rate: 0.00005 + num_train_epochs: 3 + batch_size: 64 + max_seq_length: 128 + weight_decay: 0.01 + logging_steps: 10 + eval_steps: 200 + minimum_eval_times: -1 + max_steps: -1 + warmup_steps: 0 + warmup_ratio: 0.1 + metric: "Accuracy" + +# Datasets which used for sequence classfication +SequenceClassification: + clue afqmc: + num_train_epochs: 4 + clue tnews: + num_train_epochs: 4 + clue iflytek: + num_train_epochs: 8 + clue ocnli: + num_train_epochs: 8 + clue cmnli: + num_train_epochs: 3 + clue wsc: + num_train_epochs: 50 + clue csl: + num_train_epochs: 10 + max_seq_length: 256 + batch_size: 32 + xnli_cn: + learning_rate: 0.0001 + num_train_epochs: 3 + batch_size: 256 + chnsenticorp_v2: + learning_rate: 0.00005 + batch_size: 16 + num_train_epochs: 8 + +# Datasets which used for token classfication +TokenClassification: + peoples_daily_ner: + learning_rate: 0.00005 + num_train_epochs: 4 + batch_size: 128 + msra_ner: + num_train_epochs: 3 + +# Datasets which used for question answersing +QuestionAnswering: + cmrc2018: + learning_rate: 0.00005 + num_train_epochs: 1 + batch_size: 32 + max_seq_length: 384 + dureader_nlp: + num_train_epochs: 1 + batch_size: 12 + max_seq_length: 384 + dureader_robust: + num_train_epochs: 1 + batch_size: 12 + max_seq_length: 384 + dlbp: + num_train_epochs: 1 + batch_size: 12 + max_seq_length: 384 diff --git a/model_zoo/ernie-1.0/finetune/deploy/README.md b/model_zoo/ernie-1.0/finetune/deploy/README.md new file mode 100644 index 0000000000000000000000000000000000000000..d4e135253ebb5884a3fed88021a690b148d34d8b --- /dev/null +++ b/model_zoo/ernie-1.0/finetune/deploy/README.md @@ -0,0 +1,138 @@ +# FastDeploy ERNIE 1.0 模型 Python 部署示例 + +在部署前,参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy Python SDK。 + +本目录下提供 `seq_cls_infer.py` 快速完成在 CPU/GPU 的中文情感分类任务的 Python 部署示例。 + +## 依赖安装 + +直接执行以下命令安装部署示例的依赖。 + +```bash +# 安装 fast_tokenizer 以及 GPU 版本 fastdeploy +pip install fast-tokenizer-python fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html +``` + +## 快速开始 + +以下示例展示如何基于 FastDeploy 库完成 ERNIE 1.0 模型在 ChnSenticorp 数据集上进行文本分类任务的 Python 预测部署,可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端,并使用`--model_dir`参数指定运行的模型,具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [ERNIE 1.0 训练文档](../../README.md)导出得到的部署模型,其模型目录为`model_zoo/ernie-1.0/finetune/tmp/export`(用户可按实际情况设置)。 + + +```bash +# CPU 推理 +python seq_cls_infer.py --model_dir ../tmp/chnsenticorp_v2/export/ --device cpu --backend paddle +# GPU 推理 +python seq_cls_infer.py --model_dir ../tmp/chnsenticorp_v2/export/ --device gpu --backend paddle +``` + +运行完成后返回的结果如下: + +```bash +[2023-02-26 13:38:46,370] [ INFO] - We are using to load '../../finetune/tmp/chnsenticorp_v2/export/'. +[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend Runtime initialized with Backend::PDINFER in Device::GPU. +Batch id: 0, example id: 0, sentence: 这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般, label: negative, negative prob: 0.9999, positive prob: 0.0001. +Batch id: 1, example id: 0, sentence: 怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片!开始还怀疑是不是赠送的个别现象,可是后来发现每张DVD后面都有!真不知道生产商怎么想的,我想看的是猫和老鼠,不是米老鼠!如果厂家是想赠送的话,那就全套米老鼠和唐老鸭都赠送,只在每张DVD后面添加一集算什么??简直是画蛇添足!!, label: negative, negative prob: 0.9998, positive prob: 0.0002. +Batch id: 2, example id: 0, sentence: 还稍微重了点,可能是硬盘大的原故,还要再轻半斤就好了。其他要进一步验证。贴的几种膜气泡较多,用不了多久就要更换了,屏幕膜稍好点,但比没有要强多了。建议配赠几张膜让用用户自己贴。, label: negative, negative prob: 0.9999, positive prob: 0.0001. +...... +``` + +## 参数说明 + +| 参数 |参数说明 | +|----------|--------------| +|--model_dir | 指定部署模型的目录, | +|--batch_size |输入的batch size,默认为 1| +|--max_length |最大序列长度,默认为 128| +|--device | 运行的设备,可选范围: ['cpu', 'gpu'],默认为'cpu' | +|--device_id | 运行设备的id。默认为0。 | +|--cpu_threads | 当使用cpu推理时,指定推理的cpu线程数,默认为1。| +|--backend | 支持的推理后端,可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt'],默认为'paddle' | +|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启,默认为False | +|--use_fast| 是否使用FastTokenizer加速分词阶段。默认为True| + +## FastDeploy 高阶用法 + +FastDeploy 在 Python 端上,提供 `fastdeploy.RuntimeOption.use_xxx()` 以及 `fastdeploy.RuntimeOption.use_xxx_backend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 ERNIE 1.0 模型,需要选择硬件所支持的推理引擎进行部署,下表展示如何在不同的硬件上选择可用的推理引擎部署 ERNIE 1.0 模型。 + +符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持; + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
硬件 硬件对应的接口 可用的推理引擎 推理引擎对应的接口 是否支持 Paddle 新格式量化模型 是否支持 FP16 模式
CPU use_cpu() Paddle Inference use_paddle_infer_backend() N/A
ONNX Runtime use_ort_backend() N/A
OpenVINO use_openvino_backend() N/A
GPU use_gpu() Paddle Inference use_paddle_infer_backend() N/A
ONNX Runtime use_ort_backend()
Paddle TensorRT use_paddle_infer_backend() + paddle_infer_option.enable_trt = True
TensorRT use_trt_backend()
昆仑芯 XPU use_kunlunxin() Paddle Lite use_paddle_lite_backend() N/A
华为 昇腾 use_ascend() Paddle Lite use_paddle_lite_backend()
Graphcore IPU use_ipu() Paddle Inference use_paddle_infer_backend() N/A
diff --git a/model_zoo/ernie-1.0/finetune/deploy/seq_cls_infer.py b/model_zoo/ernie-1.0/finetune/deploy/seq_cls_infer.py new file mode 100644 index 0000000000000000000000000000000000000000..5232e09505b38f13fb5e062522daea1504a3c81e --- /dev/null +++ b/model_zoo/ernie-1.0/finetune/deploy/seq_cls_infer.py @@ -0,0 +1,155 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import distutils.util +import os + +import fastdeploy as fd +import numpy as np + +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import AutoTokenizer + + +def parse_arguments(): + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument("--model_dir", required=True, help="The directory of model.") + parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.") + parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.") + parser.add_argument( + "--device", + type=str, + default="cpu", + choices=["gpu", "cpu"], + help="Type of inference device, support 'cpu' or 'gpu'.", + ) + parser.add_argument( + "--backend", + type=str, + default="paddle", + choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"], + help="The inference runtime backend.", + ) + parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.") + parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.") + parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.") + parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode") + parser.add_argument("--cpu_threads", type=int, default=1, help="Number of threads to predict when using cpu.") + parser.add_argument("--device_id", type=int, default=0, help="Select which gpu device to train model.") + parser.add_argument( + "--use_fast", + type=distutils.util.strtobool, + default=True, + help="Whether to use fast_tokenizer to accelarate the tokenization.", + ) + return parser.parse_args() + + +def batchfy_text(texts, batch_size): + batch_texts = [] + batch_start = 0 + while batch_start < len(texts): + batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]] + batch_start += batch_size + return batch_texts + + +class Predictor(object): + def __init__(self, args): + self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir, use_fast=args.use_fast) + self.runtime = self.create_fd_runtime(args) + self.batch_size = args.batch_size + self.max_length = args.max_length + + def create_fd_runtime(self, args): + option = fd.RuntimeOption() + model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel") + params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams") + option.set_model_path(model_path, params_path) + if args.device == "cpu": + option.use_cpu() + option.set_cpu_thread_num(args.cpu_threads) + else: + option.use_gpu(args.device_id) + if args.backend == "paddle": + option.use_paddle_infer_backend() + elif args.backend == "onnx_runtime": + option.use_ort_backend() + elif args.backend == "openvino": + option.use_openvino_backend() + else: + option.use_trt_backend() + if args.backend == "paddle_tensorrt": + option.use_paddle_infer_backend() + option.paddle_infer_option.collect_trt_shape = True + option.paddle_infer_option.enable_trt = True + trt_file = os.path.join(args.model_dir, "model.trt") + option.trt_option.set_shape( + "input_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length] + ) + option.trt_option.set_shape( + "token_type_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length] + ) + if args.use_fp16: + option.trt_option.enable_fp16 = True + trt_file = trt_file + ".fp16" + option.trt_option.serialize_file = trt_file + return fd.Runtime(option) + + def preprocess(self, text): + data = self.tokenizer(text, max_length=self.max_length, padding=True, truncation=True) + input_ids_name = self.runtime.get_input_info(0).name + token_type_ids_name = self.runtime.get_input_info(1).name + input_map = { + input_ids_name: np.array(data["input_ids"], dtype="int64"), + token_type_ids_name: np.array(data["token_type_ids"], dtype="int64"), + } + return input_map + + def infer(self, input_map): + results = self.runtime.infer(input_map) + return results + + def postprocess(self, infer_data): + logits = np.array(infer_data[0]) + max_value = np.max(logits, axis=1, keepdims=True) + exp_data = np.exp(logits - max_value) + probs = exp_data / np.sum(exp_data, axis=1, keepdims=True) + out_dict = {"label": probs.argmax(axis=-1), "confidence": probs} + return out_dict + + def predict(self, texts): + input_map = self.preprocess(texts) + infer_result = self.infer(input_map) + output = self.postprocess(infer_result) + return output + + +if __name__ == "__main__": + args = parse_arguments() + predictor = Predictor(args) + test_ds = load_dataset("chnsenticorp", splits=["test"]) + texts_ds = [d["text"] for d in test_ds] + label_map = {0: "negative", 1: "positive"} + batch_texts = batchfy_text(texts_ds, args.batch_size) + for bs, texts in enumerate(batch_texts): + outputs = predictor.predict(texts) + for i, sentence1 in enumerate(texts): + print( + f"Batch id: {bs}, example id: {i}, sentence: {sentence1}, " + f"label: {label_map[outputs['label'][i]]}, negative prob: {outputs['confidence'][i][0]:.4f}, " + f"positive prob: {outputs['confidence'][i][1]:.4f}." + ) diff --git a/model_zoo/ernie-1.0/finetune/question_answering.py b/model_zoo/ernie-1.0/finetune/question_answering.py new file mode 100644 index 0000000000000000000000000000000000000000..010b07676b2e7cff2b0bc14752f79138c78743aa --- /dev/null +++ b/model_zoo/ernie-1.0/finetune/question_answering.py @@ -0,0 +1,219 @@ +# Copyright 2020-present the HuggingFace Inc. team. +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle + +from paddlenlp.trainer import PredictionOutput, Trainer + + +class QuestionAnsweringTrainer(Trainer): + def __init__(self, *args, eval_examples=None, post_process_function=None, **kwargs): + super().__init__(*args, **kwargs) + self.eval_examples = eval_examples + self.post_process_function = post_process_function + + def evaluate(self, eval_dataset=None, eval_examples=None, ignore_keys=None, metric_key_prefix: str = "eval"): + eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset + eval_dataloader = self.get_eval_dataloader(eval_dataset) + eval_examples = self.eval_examples if eval_examples is None else eval_examples + + # Temporarily disable metric computation, we will do it in the loop here. + compute_metrics = self.compute_metrics + self.compute_metrics = None + eval_loop = self.evaluation_loop + try: + output = eval_loop( + eval_dataloader, + description="Evaluation", + # No point gathering the predictions if there are no metrics, otherwise we defer to + # self.args.prediction_loss_only + prediction_loss_only=True if compute_metrics is None else None, + ignore_keys=ignore_keys, + ) + finally: + self.compute_metrics = compute_metrics + + if self.post_process_function is not None and self.compute_metrics is not None: + eval_preds = self.post_process_function(eval_examples, eval_dataset, output.predictions) + metrics = self.compute_metrics(eval_preds) + + # Prefix all keys with metric_key_prefix + '_' + for key in list(metrics.keys()): + if not key.startswith(f"{metric_key_prefix}_"): + metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key) + + self.log(metrics) + else: + metrics = {} + + self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, metrics) + return metrics + + def predict(self, predict_dataset, predict_examples, ignore_keys=None, metric_key_prefix: str = "test"): + predict_dataloader = self.get_test_dataloader(predict_dataset) + + # Temporarily disable metric computation, we will do it in the loop here. + compute_metrics = self.compute_metrics + self.compute_metrics = None + eval_loop = self.evaluation_loop + try: + output = eval_loop( + predict_dataloader, + description="Prediction", + # No point gathering the predictions if there are no metrics, otherwise we defer to + # self.args.prediction_loss_only + prediction_loss_only=True if compute_metrics is None else None, + ignore_keys=ignore_keys, + ) + finally: + self.compute_metrics = compute_metrics + + if self.post_process_function is None or self.compute_metrics is None: + return output + + predictions = self.post_process_function(predict_examples, predict_dataset, output.predictions, "predict") + metrics = self.compute_metrics(predictions) + + # Prefix all keys with metric_key_prefix + '_' + for key in list(metrics.keys()): + if not key.startswith(f"{metric_key_prefix}_"): + metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key) + + return PredictionOutput(predictions=predictions.predictions, label_ids=predictions.label_ids, metrics=metrics) + + +class CrossEntropyLossForSQuAD(paddle.nn.Layer): + def __init__(self): + super(CrossEntropyLossForSQuAD, self).__init__() + + def forward(self, y, label): + start_logits, end_logits = y + start_position, end_position = label + start_position = paddle.unsqueeze(start_position, axis=-1) + end_position = paddle.unsqueeze(end_position, axis=-1) + start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position) + end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position) + loss = (start_loss + end_loss) / 2 + return loss + + +def prepare_train_features(examples, tokenizer, args): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. + contexts = examples["context"] + questions = examples["question"] + + tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + # The offset mappings will give us a map from token to character position in the original context. This will + # help us compute the start_positions and end_positions. + offset_mapping = tokenized_examples.pop("offset_mapping") + + # Let's label those examples! + tokenized_examples["start_positions"] = [] + tokenized_examples["end_positions"] = [] + + for i, offsets in enumerate(offset_mapping): + # We will label impossible answers with the index of the CLS token. + input_ids = tokenized_examples["input_ids"][i] + cls_index = input_ids.index(tokenizer.cls_token_id) + + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_examples["token_type_ids"][i] + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + answers = examples["answers"][sample_index] + # If no answers are given, set the cls_index as answer. + if len(answers["answer_start"]) == 0: + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Start/end character index of the answer in the text. + start_char = answers["answer_start"][0] + end_char = start_char + len(answers["text"][0]) + + # Start token index of the current span in the text. + token_start_index = 0 + while sequence_ids[token_start_index] != 1: + token_start_index += 1 + + # End token index of the current span in the text. + token_end_index = len(input_ids) - 1 + while sequence_ids[token_end_index] != 1: + token_end_index -= 1 + token_end_index -= 1 + + # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index). + if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char): + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Otherwise move the token_start_index and token_end_index to the two ends of the answer. + # Note: we could go after the last offset if the answer is the last word (edge case). + while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char: + token_start_index += 1 + tokenized_examples["start_positions"].append(token_start_index - 1) + while offsets[token_end_index][1] >= end_char: + token_end_index -= 1 + tokenized_examples["end_positions"].append(token_end_index + 1) + + return tokenized_examples + + +def prepare_validation_features(examples, tokenizer, args): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HuggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = examples["context"] + questions = examples["question"] + + tokenized_examples = tokenizer( + questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length, return_attention_mask=True + ) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + + # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the + # corresponding example_id and we will store the offset mappings. + tokenized_examples["example_id"] = [] + + for i in range(len(tokenized_examples["input_ids"])): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_examples["token_type_ids"][i] + context_index = 1 + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + tokenized_examples["example_id"].append(examples["id"][sample_index]) + + # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token + # position is part of the context or not. + + tokenized_examples["offset_mapping"][i] = [ + (o if sequence_ids[k] == context_index and k != len(sequence_ids) - 1 else None) + for k, o in enumerate(tokenized_examples["offset_mapping"][i]) + ] + + return tokenized_examples diff --git a/model_zoo/ernie-1.0/finetune/run_ner.py b/model_zoo/ernie-1.0/finetune/run_ner.py new file mode 100644 index 0000000000000000000000000000000000000000..f754c83684cb84e506be5d6306431b22b717039a --- /dev/null +++ b/model_zoo/ernie-1.0/finetune/run_ner.py @@ -0,0 +1,208 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import sys +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from datasets import load_metric + +import paddlenlp +from paddlenlp.data import DataCollatorForTokenClassification +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, +) +from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer +from paddlenlp.utils.log import logger + +sys.path.insert(0, os.path.abspath(".")) +from token_classification import ner_trans_fn # noqa: E402 +from utils import ALL_DATASETS, DataArguments, ModelArguments # noqa: E402 + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + # set_seed(args) + data_args.dataset = data_args.dataset.strip() + if data_args.dataset not in ALL_DATASETS: + raise ValueError("Not found dataset {}".format(data_args.dataset)) + + if data_args.dataset in ALL_DATASETS: + # if you custom you hyper-parameters in yaml config, it will overwrite all args. + config = ALL_DATASETS[data_args.dataset] + for args in (model_args, data_args, training_args): + for arg in vars(args): + if arg in config.keys(): + setattr(args, arg, config[arg]) + + training_args.per_device_train_batch_size = config["batch_size"] + training_args.per_device_eval_batch_size = config["batch_size"] + + dataset_config = data_args.dataset.split(" ") + raw_datasets = load_dataset( + dataset_config[0], + None if len(dataset_config) <= 1 else dataset_config[1], + ) + + label_list = getattr(raw_datasets["train"], "label_list", None) + data_args.label_list = label_list + data_args.ignore_label = -100 + data_args.no_entity_id = len(data_args.label_list) - 1 + + num_classes = 1 if raw_datasets["train"].label_list is None else len(raw_datasets["train"].label_list) + + # Define tokenizer, model, loss function. + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + model = AutoModelForTokenClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes) + + class criterion(nn.Layer): + def __init__(self): + super(criterion, self).__init__() + self.loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=data_args.ignore_label) + + def forward(self, *args, **kwargs): + return paddle.mean(self.loss_fn(*args, **kwargs)) + + loss_fct = criterion() + + # Define dataset pre-process function + trans_fn = partial(ner_trans_fn, tokenizer=tokenizer, args=data_args) + # Define data collector + data_collator = DataCollatorForTokenClassification(tokenizer) + + # Dataset pre-process + if training_args.do_train: + train_dataset = raw_datasets["train"].map(trans_fn) + if training_args.do_eval: + eval_dataset = raw_datasets["dev"].map(trans_fn) + if training_args.do_predict: + test_dataset = raw_datasets["test"].map(trans_fn) + + # Define the metrics of tasks. + # Metrics + metric = load_metric("seqeval") + + def compute_metrics(p): + predictions, labels = p + predictions = np.argmax(predictions, axis=2) + + # Remove ignored index (special tokens) + true_predictions = [ + [label_list[p] for (p, l) in zip(prediction, label) if l != -100] + for prediction, label in zip(predictions, labels) + ] + true_labels = [ + [label_list[l] for (p, l) in zip(prediction, label) if l != -100] + for prediction, label in zip(predictions, labels) + ] + results = metric.compute(predictions=true_predictions, references=true_labels) + return { + "precision": results["overall_precision"], + "recall": results["overall_recall"], + "f1": results["overall_f1"], + "accuracy": results["overall_accuracy"], + } + + trainer = Trainer( + model=model, + criterion=loss_fct, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + if training_args.do_predict: + test_ret = trainer.predict(test_dataset) + trainer.log_metrics("test", test_ret.metrics) + if test_ret.label_ids is None: + paddle.save( + test_ret.predictions, + os.path.join(training_args.output_dir, "test_results.pdtensor"), + ) + + # export inference model + if training_args.do_export: + # You can also load from certain checkpoint + # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/") + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ] + if model_args.export_model_dir is None: + model_args.export_model_dir = os.path.join(training_args.output_dir, "export") + paddlenlp.transformers.export_model( + model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir + ) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-1.0/finetune/run_qa.py b/model_zoo/ernie-1.0/finetune/run_qa.py new file mode 100644 index 0000000000000000000000000000000000000000..d09436b737e3f5afe59b72c911a8277312a6ea49 --- /dev/null +++ b/model_zoo/ernie-1.0/finetune/run_qa.py @@ -0,0 +1,239 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import sys +from functools import partial + +import paddle +from datasets import load_dataset + +import paddlenlp +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.metrics.squad import compute_prediction, squad_evaluate +from paddlenlp.trainer import ( + EvalPrediction, + PdArgumentParser, + TrainingArguments, + get_last_checkpoint, +) +from paddlenlp.transformers import AutoModelForQuestionAnswering, AutoTokenizer +from paddlenlp.utils.log import logger + +sys.path.insert(0, os.path.abspath(".")) +from question_answering import ( # noqa: E402 + CrossEntropyLossForSQuAD, + QuestionAnsweringTrainer, + prepare_train_features, + prepare_validation_features, +) +from utils import ALL_DATASETS, DataArguments, ModelArguments # noqa: E402 + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + # set_seed(args) + data_args.dataset = data_args.dataset.strip() + + if data_args.dataset in ALL_DATASETS: + # if you custom you hyper-parameters in yaml config, it will overwrite all args. + config = ALL_DATASETS[data_args.dataset] + for args in (model_args, data_args, training_args): + for arg in vars(args): + if arg in config.keys(): + setattr(args, arg, config[arg]) + + training_args.per_device_train_batch_size = config["batch_size"] + training_args.per_device_eval_batch_size = config["batch_size"] + + dataset_config = data_args.dataset.split(" ") + raw_datasets = load_dataset( + dataset_config[0], None if len(dataset_config) <= 1 else dataset_config[1], cache_dir=model_args.cache_dir + ) + + label_list = getattr(raw_datasets["train"], "label_list", None) + data_args.label_list = label_list + + # Define tokenizer, model, loss function. + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + model = AutoModelForQuestionAnswering.from_pretrained(model_args.model_name_or_path) + + loss_fct = CrossEntropyLossForSQuAD() + + # Preprocessing the datasets. + # Preprocessing is slighlty different for training and evaluation. + if training_args.do_train: + column_names = raw_datasets["train"].column_names + elif training_args.do_eval: + column_names = raw_datasets["validation"].column_names + else: + column_names = raw_datasets["test"].column_names + + if training_args.do_train: + train_dataset = raw_datasets["train"] + # Create train feature from dataset + with training_args.main_process_first(desc="train dataset map pre-processing"): + # Dataset pre-process + train_dataset = train_dataset.map( + partial(prepare_train_features, tokenizer=tokenizer, args=data_args), + batched=True, + num_proc=4, + remove_columns=column_names, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on train dataset", + ) + + if training_args.do_eval: + eval_examples = raw_datasets["validation"] + with training_args.main_process_first(desc="evaluate dataset map pre-processing"): + eval_dataset = eval_examples.map( + partial(prepare_validation_features, tokenizer=tokenizer, args=data_args), + batched=True, + num_proc=4, + remove_columns=column_names, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on validation dataset", + ) + if training_args.do_predict: + predict_examples = raw_datasets["test"] + with training_args.main_process_first(desc="test dataset map pre-processing"): + predict_dataset = predict_examples.map( + partial(prepare_validation_features, tokenizer=tokenizer, args=data_args), + batched=True, + num_proc=4, + remove_columns=column_names, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on prediction dataset", + ) + + # Define data collector + data_collator = DataCollatorWithPadding(tokenizer) + + # Post-processing: + def post_processing_function(examples, features, predictions, stage="eval"): + # Post-processing: we match the start logits and end logits to answers in the original context. + predictions, all_nbest_json, scores_diff_json = compute_prediction( + examples=examples, + features=features, + predictions=predictions, + n_best_size=data_args.n_best_size, + max_answer_length=data_args.max_answer_length, + null_score_diff_threshold=data_args.null_score_diff_threshold, + ) + + # # Format the result to the format the metric expects. + # formatted_predictions = [{ + # "id": k, + # "prediction_text": v + # } for k, v in predictions.items()] + + references = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples] + return EvalPrediction(predictions=predictions, label_ids=references) + + def compute_metrics(p: EvalPrediction): + ret = squad_evaluate(examples=p.label_ids, preds=p.predictions, is_whitespace_splited=False) + return dict(ret) + # return metric.compute(predictions=p.predictions, references=p.label_ids) + + trainer = QuestionAnsweringTrainer( + model=model, + criterion=loss_fct, + args=training_args, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + eval_examples=eval_examples if training_args.do_eval else None, + data_collator=data_collator, + post_process_function=post_processing_function, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + if training_args.do_train: + # Training + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # model.set_state_dict(paddle.load("tmp/model_state.pdparams")) + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + if training_args.do_predict: + test_ret = trainer.predict(predict_dataset, predict_examples) + trainer.log_metrics("predict", test_ret.metrics) + + if test_ret.label_ids is None: + paddle.save( + test_ret.predictions, + os.path.join(training_args.output_dir, "test_results.pdtensor"), + ) + + # export inference model + if training_args.do_export: + # You can also load from certain checkpoint + # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/") + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ] + + if model_args.export_model_dir is None: + model_args.export_model_dir = os.path.join(training_args.output_dir, "export") + paddlenlp.transformers.export_model( + model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir + ) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-1.0/finetune/run_seq_cls.py b/model_zoo/ernie-1.0/finetune/run_seq_cls.py new file mode 100644 index 0000000000000000000000000000000000000000..2d85030c4ad6a14cb40bb7b90abd7429c7755c2a --- /dev/null +++ b/model_zoo/ernie-1.0/finetune/run_seq_cls.py @@ -0,0 +1,185 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from functools import partial + +import paddle +import paddle.nn as nn +from paddle.metric import Accuracy +from sequence_classification import clue_trans_fn, seq_trans_fn +from utils import ALL_DATASETS, DataArguments, ModelArguments + +import paddlenlp +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, +) +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + data_args.dataset = data_args.dataset.strip() + + if data_args.dataset in ALL_DATASETS: + # if you custom you hyper-parameters in yaml config, it will overwrite all args. + config = ALL_DATASETS[data_args.dataset] + logger.info("Over-writing training config by yaml config!") + for args in (model_args, data_args, training_args): + for arg in vars(args): + if arg in config.keys(): + setattr(args, arg, config[arg]) + + training_args.per_device_train_batch_size = config["batch_size"] + training_args.per_device_eval_batch_size = config["batch_size"] + + dataset_config = data_args.dataset.split(" ") + raw_datasets = load_dataset( + dataset_config[0], + None if len(dataset_config) <= 1 else dataset_config[1], + ) + + data_args.label_list = getattr(raw_datasets["train"], "label_list", None) + num_classes = 1 if raw_datasets["train"].label_list is None else len(raw_datasets["train"].label_list) + + # Define tokenizer, model, loss function. + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + model = AutoModelForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes) + criterion = nn.loss.CrossEntropyLoss() if data_args.label_list else nn.loss.MSELoss() + + # Define dataset pre-process function + if "clue" in data_args.dataset: + trans_fn = partial(clue_trans_fn, tokenizer=tokenizer, args=data_args) + else: + trans_fn = partial(seq_trans_fn, tokenizer=tokenizer, args=data_args) + + # Define data collector + data_collator = DataCollatorWithPadding(tokenizer) + + # Dataset pre-process + if training_args.do_train: + train_dataset = raw_datasets["train"].map(trans_fn) + if training_args.do_eval: + eval_dataset = raw_datasets["dev"].map(trans_fn) + if training_args.do_predict: + test_dataset = raw_datasets["test"].map(trans_fn) + + # Define the metrics of tasks. + def compute_metrics(p): + preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions + + preds = paddle.to_tensor(preds) + label = paddle.to_tensor(p.label_ids) + + metric = Accuracy() + metric.reset() + result = metric.compute(preds, label) + metric.update(result) + accu = metric.accumulate() + metric.reset() + return {"accuracy": accu} + + trainer = Trainer( + model=model, + criterion=criterion, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + if training_args.do_predict: + test_ret = trainer.predict(test_dataset) + trainer.log_metrics("test", test_ret.metrics) + if test_ret.label_ids is None: + paddle.save( + test_ret.predictions, + os.path.join(training_args.output_dir, "test_results.pdtensor"), + ) + + # export inference model + if training_args.do_export: + # You can also load from certain checkpoint + # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/") + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ] + if model_args.export_model_dir is None: + model_args.export_model_dir = os.path.join(training_args.output_dir, "export") + paddlenlp.transformers.export_model( + model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir + ) + trainer.tokenizer.save_pretrained(model_args.export_model_dir) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-1.0/finetune/sequence_classification.py b/model_zoo/ernie-1.0/finetune/sequence_classification.py new file mode 100644 index 0000000000000000000000000000000000000000..3952520c65a3fc86525d8a55d44a70f90903df13 --- /dev/null +++ b/model_zoo/ernie-1.0/finetune/sequence_classification.py @@ -0,0 +1,109 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +def convert_example(example, tokenizer, max_seq_length=512, is_test=False): + is_test = True + if "label" in example.keys(): + is_test = False + + if "text_b" in example.keys(): + text = example["text_a"] + text_pair = example["text_b"] + else: + text = example["text"] + text_pair = None + + encoded_inputs = tokenizer(text=text, text_pair=text_pair, max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if is_test: + return { + "input_ids": input_ids, + "token_type_ids": token_type_ids, + } + else: + # label = np.array([example["label"]], dtype="int64") + label = int(example["label"]) + return {"input_ids": input_ids, "token_type_ids": token_type_ids, "labels": label} + + +# Data pre-process function for clue benchmark datatset +def convert_clue(example, label_list, tokenizer=None, max_seq_length=512, **kwargs): + """convert a glue example into necessary features""" + is_test = False + if "label" not in example.keys(): + is_test = True + + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + example["label"] = int(example["label"]) if label_dtype != "float32" else float(example["label"]) + label = example["label"] + # Convert raw text to feature + if "keyword" in example: # CSL + sentence1 = " ".join(example["keyword"]) + example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]} + elif "target" in example: # wsc + text, query, pronoun, query_idx, pronoun_idx = ( + example["text"], + example["target"]["span1_text"], + example["target"]["span2_text"], + example["target"]["span1_index"], + example["target"]["span2_index"], + ) + text_list = list(text) + assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun) + assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query) + if pronoun_idx > query_idx: + text_list.insert(query_idx, "_") + text_list.insert(query_idx + len(query) + 1, "_") + text_list.insert(pronoun_idx + 2, "[") + text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]") + else: + text_list.insert(pronoun_idx, "[") + text_list.insert(pronoun_idx + len(pronoun) + 1, "]") + text_list.insert(query_idx + 2, "_") + text_list.insert(query_idx + len(query) + 2 + 1, "_") + text = "".join(text_list) + example["sentence"] = text + + if tokenizer is None: + return example + if "sentence" in example: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + elif "sentence1" in example: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + + if not is_test: + if "token_type_ids" in example: + return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"], "labels": label} + else: + return {"input_ids": example["input_ids"], "labels": label} + else: + return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"]} + + +def seq_trans_fn(example, tokenizer, args): + return convert_example( + example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + ) + + +def clue_trans_fn(example, tokenizer, args): + return convert_clue(example, tokenizer=tokenizer, label_list=args.label_list, max_seq_length=args.max_seq_length) diff --git a/model_zoo/ernie-1.0/finetune/token_classification.py b/model_zoo/ernie-1.0/finetune/token_classification.py new file mode 100644 index 0000000000000000000000000000000000000000..2d4bd0fad78af00905c4dafc85275d876333cf86 --- /dev/null +++ b/model_zoo/ernie-1.0/finetune/token_classification.py @@ -0,0 +1,57 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +def tokenize_and_align_labels(example, tokenizer, no_entity_id, max_seq_len=512): + if "labels" in example: + labels = example["labels"] + example = example["tokens"] + tokenized_input = tokenizer(example, is_split_into_words=True, max_seq_len=max_seq_len, return_length=False) + + # -2 for [CLS] and [SEP] + if len(tokenized_input["input_ids"]) - 2 < len(labels): + labels = labels[: len(tokenized_input["input_ids"]) - 2] + tokenized_input["labels"] = [no_entity_id] + labels + [no_entity_id] + tokenized_input["labels"] += [no_entity_id] * ( + len(tokenized_input["input_ids"]) - len(tokenized_input["labels"]) + ) + else: + if example["tokens"] == []: + tokenized_input = { + "labels": [], + "input_ids": [], + "token_type_ids": [], + } + return tokenized_input + tokenized_input = tokenizer( + example["tokens"], + max_seq_len=max_seq_len, + # We use this argument because the texts in our dataset are lists of words (with a label for each word). + is_split_into_words=True, + return_length=False, + ) + label_ids = example["ner_tags"] + if len(tokenized_input["input_ids"]) - 2 < len(label_ids): + label_ids = label_ids[: len(tokenized_input["input_ids"]) - 2] + label_ids = [no_entity_id] + label_ids + [no_entity_id] + + label_ids += [no_entity_id] * (len(tokenized_input["input_ids"]) - len(label_ids)) + tokenized_input["labels"] = label_ids + return tokenized_input + + +def ner_trans_fn(example, tokenizer, args): + return tokenize_and_align_labels( + example, tokenizer=tokenizer, no_entity_id=args.no_entity_id, max_seq_len=args.max_seq_length + ) diff --git a/model_zoo/ernie-1.0/finetune/utils.py b/model_zoo/ernie-1.0/finetune/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..398d144e17447d409a00c4e6deb6660db108ecff --- /dev/null +++ b/model_zoo/ernie-1.0/finetune/utils.py @@ -0,0 +1,164 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import copy +import os.path as osp +from dataclasses import dataclass, field +from typing import Optional + +import yaml + +TASKS = [ + "SequenceClassification", + "TokenClassification", + "QuestionAnswering", +] + +config = yaml.load(open(osp.join(osp.abspath("."), "./config.yml"), "r"), Loader=yaml.FullLoader) +default_args = config["DefaultArgs"] + +ALL_DATASETS = {} + +for task_type in TASKS: + task = config[task_type] + for data_name in task.keys(): + new_args = task[data_name] + new_args = {} if new_args is None else new_args + final_args = copy.deepcopy(default_args) + final_args.update(new_args) + final_args["model"] = "AutoModelFor{}".format(task_type) + ALL_DATASETS[data_name] = final_args + + +class Dict(object): + def __init__(self, fn): + assert isinstance(fn, (dict)), ( + "Input pattern not understood. The input of Dict must be a dict with key of input column name and value of collate_fn " + "Received fn=%s" % (str(fn)) + ) + + self._fn = fn + + for col_name, ele_fn in self._fn.items(): + assert callable(ele_fn), "Batchify functions must be callable! type(fn[%d]) = %s" % ( + col_name, + str(type(ele_fn)), + ) + + def __call__(self, data): + + ret = {} + if len(data) <= 0: + return ret + + for col_name, ele_fn in self._fn.items(): + # skip unused col_name, such as labels in test mode. + if col_name not in data[0].keys(): + continue + result = ele_fn([ele[col_name] for ele in data]) + ret[col_name] = result + + return ret + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + Using `PdArgumentParser` we can turn this class + into argparse arguments to be able to specify them on + the command line. + """ + + dataset: str = field(default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}) + + max_seq_length: int = field( + default=128, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + + # Additional configs for QA task. + doc_stride: int = field( + default=128, + metadata={"help": "When splitting up a long document into chunks, how much stride to take between chunks."}, + ) + + n_best_size: int = field( + default=20, + metadata={ + "help": "The total number of n-best predictions to generate in the nbest_predictions.json output file." + }, + ) + + max_query_length: int = field( + default=64, + metadata={"help": "Max query length."}, + ) + + max_answer_length: int = field( + default=30, + metadata={"help": "Max answer length."}, + ) + + do_lower_case: bool = field( + default=False, + metadata={ + "help": "Whether to lower case the input text. Should be True for uncased models and False for cased models." + }, + ) + overwrite_cache: bool = field( + default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} + ) + preprocessing_num_workers: Optional[int] = field( + default=None, + metadata={"help": "The number of processes to use for the preprocessing."}, + ) + null_score_diff_threshold: float = field( + default=0.0, + metadata={ + "help": "The threshold used to select the null answer: if the best answer has a score that is less than " + "the score of the null answer minus this threshold, the null answer is selected for this example. " + "Only useful when `version_2_with_negative=True`." + }, + ) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + + model_name_or_path: str = field( + metadata={ + "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html" + } + ) + config_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} + ) + tokenizer_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} + ) + cache_dir: Optional[str] = field( + default=None, + metadata={"help": "Path to directory to store the dataset cache."}, + ) + export_model_dir: Optional[str] = field( + default=None, + metadata={"help": "Path to directory to store the exported inference model."}, + ) diff --git a/model_zoo/ernie-1.0/preprocess/README.md b/model_zoo/ernie-1.0/preprocess/README.md new file mode 100644 index 0000000000000000000000000000000000000000..aab9319c731cb7c832f374176bb0ee5b32f4ee90 --- /dev/null +++ b/model_zoo/ernie-1.0/preprocess/README.md @@ -0,0 +1,259 @@ +# PaddleNLP 预训练数据流程 + +本示例致力于打造基于PaddleNLP预训练模型的最佳实践。 + + +我们将预训练数据过程划分为以下部分 + +- 原始数据转换,原始文本转换为jsonl的json字符串格式。 +- 数据ID化,断句、分词、tokenize转化为token id格式。 +- 训练index文件生成,生成train、valid、test的每个样本索引。 +- token动态mask(可选),python 层实时mask文本。 + +本目录下主要包含一下文件: +``` +├── create_pretraining_data.py +├── merge.py +├── trans_to_json.py +├── words_segmentation.py +└── README.md +``` + +### 环境依赖 + + - tqdm + - numpy + - pybind11 + - tool_helpers + - lac (可选) + - zstandard (可选) + +安装命令`pip install tqdm numpy pybind11 tool_helpers lac zstandard`。另,部分功能需要`g++>=4.8`编译支持 + + +## 训练全流程数据Pipeline + +飞桨是自主研发、功能完备、开源开放的产业级深度学习平台,集深度学习核心训练和推理框架、基础模型库、端到端开发套件和丰富的工具组件于一体 + +|步骤|阶段                     |数据格式| 样例| +|-|-|-|-| +| 0️⃣初始状态 | -|原始数据:
**每个doc之间用空行间隔开**
- 中文,默认每句换行符,作为句子结束。
- 英文,默认使用nltk判断句子结束 | ```飞桨是功能完备、开源开放的产业级深度学习平台。```
```飞桨拥有核心训练和推理框架、基础模型库。```

```PaddleNLP是自然语言处理领域的优秀工具。``` | +|1️⃣原始数据转换
`trans_to_json.py`|预处理
输入:0️⃣初始状态
输出:jsonl|jsonl格式:每个doc对应一行json字符串| ```{"text": "飞桨是功能完备、开源开放的产业级深度学习平台。飞桨拥有..."}```
```{"text": "PaddleNLP是自然语言..."}``` +|❇️(**可选**)数据中文分词
`words_segmentation.py`|语料分词:中文WWM
输入:jsonl
输出:0️⃣初始状态| 将jsonl格式的数据,恢复成分词后的原始格式数据
| ```飞桨 是 功能 完备、开源 开放的 产业级 深度学习 平台。```
```飞桨 拥有 核心 训练和推理 框架、基础 模型库。```

```PaddleNLP 是 自然语言处理领域 的 优秀工具。``` +|2️⃣数据ID化
`create_pretrain_data.py`|预处理| bin格式:数据id化后的token id
idx格式:数据句子、文章位置索引 | - +|3️⃣训练index文件生成|训练启动|npy格式:
根据训练步数max_steps生成
train、valid、test的每个样本索引文件| - +|4️⃣token动态mask(可选)| Dataset取数据 | 无 |- + + +注意: +- **❇️(**可选**)数据中文分词** 是中文预训练做 WWM 的可选步骤 + - 当你的数据比较少时,分词耗时较少,不需要分词步骤。直接在`create_pretrain_data.py`步骤中分词即可。 + - 目的是为了提前分词,加快后续数据ID转化步骤。 + - 如果这里输入的是 jsonl格式文件,最好为多文件,`trans_to_json.py` 时候开启`no-merge`选项。 + - 当你的数据集比较大,或者需要尝试多次转换数据的时候,提前分词可以避免`create_pretrain_data.py`时每次都运行一次分词程序。 +- 转换后,需要重新进行步骤 1️⃣`原始数据转换 trans_to_json.py`,最后2️⃣`数据ID化`步骤设置`--cn_splited=True`参数。 +- 2️⃣`数据ID化`也可以在转化ID的同时,一起实现分词。不需要❇️`数据中文分词`步骤。 + + +## 数据教程汇总 + +针对目前开源的数据集,PaddleNLP提供了详细的数据教程,点击对应数据集的链接,即可开始进行数据制作: + +| 名称 | 文本类型 | 纯文本大小 | 适配模型 +|-|-|-|-| +| [CLUECorpusSmall](./docs/CLUECorpusSmall.md)| 中文 | 14GB | Llama +| [OpenWebText2](./docs/OpenWebText2.md) | 英文 | 70GB | Llama +| [WuDaoCorpus2.0 Base](./docs/WuDaoCorpusBase.md)| 中文 | 200GB | Llama +| [CLUECorpus2020](./docs/CLUECorpus2020.md)| 中文 | 200GB | Llama + +## 预训练详细准备 + +下面以ziya-llama-13b-v1预训练为例,简要介绍一下预训练的全流程。 + +### 原始数据 +首先下载样例数据: +``` +mkdir data && cd data +wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/baike.txt +cd .. +``` + +### 原始数据转换 jsonl 格式 +使用`trans_to_json.py`转化为json串格式,下面是脚本的使用说明 +``` +optional arguments: + -h, --help show this help message and exit + --input_path INPUT_PATH + Path to you raw files. Folder or file path. + 必须设置,可以是文件夹或者单个文件。文件夹中的目录默认最多搜索两层子目录。 + --output_path OUTPUT_PATH + Path to save the output json files. + 必须设置,输出文件的名字。 + --json_key JSON_KEY The content key of json file. + 建议不修改,默认的key是text + --doc_spliter DOC_SPLITER + Spliter between documents. We will strip the line, if you use blank line to split doc, leave it blank. + 根据实际情况修改,默认空行作为文章换行符。 + --min_doc_length MIN_DOC_LENGTH + Minimal char of a documment. + 可选。过滤掉长度多短的文章,默认值10 + --workers WORKERS Number of worker processes to launch + 可选。多进程转化文件,适用于 input_path 中包含的文件数据较多的情况。每个文件,分配给不同worker处理 + --log_interval LOG_INTERVAL + Interval between progress updates. + 可选。此处的interval是值处理完文件个数的间隔。 + --no-merge Don't merge the file. + 可选。默认不开启这个选项,默认每个文件转换的jsonl文本,会拼接成到同一个文件。 + --no-shuffle Don't shuffle the file. + 可选。默认不开启这个选项,默认对处理完进行shuffle。 +``` +根据说明,我们使用下面简单命令,可以得到`baike_sample.jsonl`文件。此处,我们对文章所有doc进行了shuffle。 +```shell +python trans_to_json.py --input_path ./data --output_path baike_sample +``` + +```shell +#查看数据 +head -1 baike_sample.jsonl +{"text": "中国效仿西方发展工业的过程,于中华民国国民政府成立后至中日战争开战前夕已顺畅发展,尽管其间受到内外因素的多重干扰。尔后直至中日战争和国共战争的结束, +中国始有较为长期的和平发展时期。\n1980年代以来,邓小平政府宣布改革开放,开始实行社会主义市场经济并推行经济体制改革。中国大陆近年至2010年,GDP超过72000亿美元, +已经成为美国之后的世界第二经济大国,普遍认为中国是世界上发展速度最快的经济体,但是人均国民生产总值仍位于世界中等水平(第89位),并逐渐受到资源限制和贫富差距加 +大的制约。中华人民共和国省份中,广东为GDP最高的第一强省,浙江为人均收入最高的第一富省。中国大陆、香港、澳门、台湾之间的经济联系在全球化的过程中日益紧密。\n"} +``` + +### 数据ID化 +本部分,我们使用 `create_pretraining_data.py` 脚本将前面得到的 `baike_sample.jsonl` 进行tokenize id化处理。 +``` +optional arguments: + -h, --help show this help message and exit + --model_name MODEL_NAME + What model to use. + 必须设置,如:idea-ccnl/ziya-llama-13b-v1, 可以参考已有的模型名称 https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/llama/README.md + --tokenizer_name {LlamaTokenizer} + What type of tokenizer to use. + 模型对应的tokenizer, Llama模型需使用LlamaTokenizer +data input/output: + --input_path INPUT_PATH + Path to input JSON files. + 必须设置,输入文件jsonl的目录 + --output_prefix OUTPUT_PREFIX + Output prefix to store output file. + 必须设置,输出文件的名称。 + 假设名称为XXX,则会输出 XXX.bin, XXX.idx 两个文件。 + bin文件,数据id化后的token ids; idx文件,数据句子、文章位置索引。 + --data_format {JSON} Only support json format for now. One document per line. + 不需要设置。目前默认处理jsonl数据格式 + --json_key JSON_KEY For JSON format. Space separate listed of keys to extract from json + 文本串json的key值。同前面trans_to_json.py的json_key,默认text为key + --split_sentences Split documents into sentences. + 是否需要将文章划分成句子。一般而言,GPT不需要,BERT/ERNIE模型需要 + --data_impl {mmap,lazy} + Convert the json into mmap/lazy format. + 处理后的数据格式,可选“mmap”或“lazy”,其中“mmap”格式在读入数据时会建立内存映射,“lazy”格式在读入数据时直接从文件读取。 + +chinese words: + --chinese Is corpus need words segmentation step for chinese words. + 若设置了split_sentences,并处理中文则需要设置。 + --cn_whole_word_segment + Is corpus need words segmentation step for chinese words WWM. + 可选。是否需要WWM策略。一般而言,BERT/ERNIE模型需要,GPT不需要。 + --cn_seg_func {lac,seg,jieba} + Words segment function for chinese words. + 默认jieba,jieba速度较快,lac模型更准确,计算量高。 + --cn_splited Is chinese corpus is splited in to words. + 分词后的文本,可选。设置此选项则,cn_seg_func不起作用。 + 例如分词后文本串 "中国 效仿 西方 发展 工业 的过 程" + --cn_split_dimer CN_SPLIT_DIMER + Split dimer between chinese words. + 配合cn_splited使用,默认空格表示分词间隔。 + +common config: + --append_eos Append an token to the end of a document. + gpt类模型专用,gpt设置此选项,表示doc结束。针对tokenier中不包含eos_token情况,输出提示warning并且不添加。 + --log_interval LOG_INTERVAL + Interval between progress updates + 打印日志间隔,interval表示处理 文本行数/doc数的 间隔。 + --workers WORKERS Number of worker processes to launch + 处理文本id化的进程个数。 +``` +通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:`baike_sample.bin`, 文章索引信息`baike_sample.idx`. + +* 针对 llama 模型 +```shell +python -u create_pretraining_data.py \ + --model_name "idea-ccnl/ziya-llama-13b-v1" \ + --tokenizer_name "LlamaTokenizer" \ + --input_path "baike_sample.jsonl" \ + --output_prefix "baike_sample" \ + --data_format "JSON" \ + --json_key "text" \ + --data_impl "mmap" \ + --cn_seg_func "jieba" \ + --append_eos \ + --log_interval 5 \ + --workers 40 + +``` + +* 针对 ernie 模型 +```shell +python -u create_pretraining_data.py \ + --model_name "ernie-3.0-base-zh" \ + --tokenizer_name "ErnieTokenizer" \ + --input_path "baike_sample.jsonl" \ + --output_prefix "baike_sample" \ + --data_format "JSON" \ + --json_key "text" \ + --split_sentences \ + --data_impl "mmap" \ + --chinese \ + --cn_whole_word_segment \ + --cn_seg_func "jieba" \ + --log_interval 5 \ + --workers 40 +``` +1. 如果您使用已经分好词的语料,可以设置 --cn_splited 为 True,同时指定--cn_split_dimer如空格。 +2. 使用自定义词表的话,请指定model_name为词表所在的文件夹地址。 + +若需要预处理的文件过大,该脚本所耗费的时间可能会很长。此时可以考虑将jsonl文件拆分为多个小文件,并行使用create_pretraining_data.py进行处理,得到多个.bin & .idx文件。 +之后使用如下merge脚本合并多个小的.bin & .idx文件。 +``` +python merge.py \ + --input /root/data \ + --output-prefix /root/data/merged \ + --data_impl mmap +``` +使用说明: +``` +arguments: + --input INPUT_PATH + Path to the folder where the files to be merged. + 待合并的文件所在文件夹,文件夹内各个小文件需按merge的顺序排列,如1.bin / 1.idx,2.bin / 2.idx... + --output_prefix OUTPUT_PREFIX + Output prefix to store output file. + 合并后输出文件的名称,假设名称为XXX,则会输出 XXX.bin, XXX.idx 两个文件。 + --data_impl {mmap,lazy} + Convert the json into mmap/lazy format. + merge前后的数据格式,可选“mmap”或“lazy,各个待merge的文件需格式一致。”。 +``` + +### 预训练开始 +得到了处理好的训练数据,就可以开始模型的预训练了。简单将预处理好的数据,拷贝到data目录,即可开始预训练。 +```shell +mkdir data +mv ./preprocess/baike_sample* ./data +``` + +* llama预训练请参考[预训练](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/llama/README.md)。 +* ernie预训练请参考[预训练](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/pretraining_introduction.md)。 + + +代码说明: +- 动态mask相关代码实现在`./data_tools/dataset_utils.py` + 用户可以根据自己的需求,灵活修改mask方式。具体可以参考`dataset_utils.py`中`create_masked_lm_predictions`函数。 + 可以自定义的选项有do_whole_word_mask, favor_longer_ngram, do_permutation, geometric_dist等, + 可以参考[Megatron](https://github.com/NVIDIA/Megatron-LM)使用这些lm_mask策略。 + +## 参考内容 + +注: 大部分数据流程,参考自[Megatron](https://github.com/NVIDIA/Megatron-LM),特此表达感谢。 diff --git a/model_zoo/ernie-1.0/preprocess/create_pretraining_data.py b/model_zoo/ernie-1.0/preprocess/create_pretraining_data.py new file mode 100644 index 0000000000000000000000000000000000000000..ea63297936ed976313864cb6736ad1669e168628 --- /dev/null +++ b/model_zoo/ernie-1.0/preprocess/create_pretraining_data.py @@ -0,0 +1,380 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import io +import json +import multiprocessing +import os +import re +import sys +import time + +import numpy as np +from tqdm import tqdm + +import paddlenlp.transformers as tfs +from paddlenlp.data import indexed_dataset +from paddlenlp.utils.log import logger + +try: + import nltk + + nltk_available = True +except ImportError: + nltk_available = False + +from datetime import datetime + + +def print_datetime(string): + time_str = datetime.now().strftime("%Y-%m-%d %H:%M:%S") + print("[" + string + "] datetime: {} ".format(time_str)) + + +def get_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--model_name", type=str, required=True, help="What model to use.") + parser.add_argument( + "--tokenizer_name", + type=str, + required=True, + choices=[ + "ErnieTokenizer", + "BertTokenizer", + "GPTTokenizer", + "GPTChineseTokenizer", + "LlamaTokenizer", + "ElectraTokenizer", + "T5Tokenizer", + ], + help="What type of tokenizer to use.", + ) + group = parser.add_argument_group(title="data input/output") + group.add_argument("--input_path", type=str, required=True, help="Path to input JSON files.") + group.add_argument("--output_prefix", type=str, required=True, help="Output prefix to store output file.") + group.add_argument( + "--data_format", + type=str, + default="text", + choices=["JSON"], + help="Only support json format for now. One document per line.", + ) + group.add_argument( + "--json_key", + type=str, + default="text", + help="For JSON format. Space separate listed of keys to extract from json", + ) + group.add_argument("--split_sentences", action="store_true", help="Split documents into sentences.") + + group.add_argument("--data_impl", type=str, default="mmap", choices=["lazy", "mmap"]) + + group = parser.add_argument_group(title="chinese words") + group.add_argument( + "--chinese", action="store_true", help="Is corpus need words segmentation step for chinese words." + ) + group.add_argument( + "--cn_whole_word_segment", + action="store_true", + help="Is corpus need words segmentation step for chinese words WWM.", + ) + group.add_argument( + "--cn_seg_func", + type=str, + default="jieba", + choices=["lac", "seg", "jieba"], + help="Words segment function for chinese words.", + ) + group.add_argument("--cn_splited", action="store_true", help="Is chinese corpus is splited in to words.") + group.add_argument("--cn_split_dimer", type=str, default=" ", help="Split dimer between chinese words.") + + group = parser.add_argument_group(title="common config") + group.add_argument("--append_eos", action="store_true", help="Append an token to the end of a document.") + group.add_argument("--log_interval", type=int, default=100, help="Interval between progress updates") + group.add_argument("--workers", type=int, default=1, help="Number of worker processes to launch") + group.add_argument("--max_doc_num", type=int, default=sys.maxsize, help="Number of worker processes to launch") + + args = parser.parse_args() + return args + + +def lexical_analysis_fn(): + from LAC import LAC + + lac = LAC(mode="lac") + + def process(line): + words, _ = lac.run(line) + return words + + return process + + +def chinese_segmentation_fn(): + from LAC import LAC + + lac_cws = LAC(mode="seg") + + def process(line): + words = lac_cws.run(line) + return words + + return process + + +def jieba_segmentation_fn(): + import jieba + + def process(line): + words = jieba.cut(line) + return list(words) + + return process + + +def get_whole_word_mask_tokens(tokens, words, max_word_length=6): + """ + Do whole word mask on Chinese word. + First, we do Chinese word segmentation on the sequence of tokens, which are from the WordPiece tokenization. + Then, we add the '##' mark on chinese characters which are in the middle of Chinese words. + And if the tokens are not chinese characters, we just exploit the results of WordPiece tokenization as words. + Such as, + - text line : 通过利用mercer核,将样本从输入空间映射到高维特征空间,使原来没有显现的特征突现出来,取得了很好的图像分割效果。 + - the input tokens (after WordPiece): + ['通', '过', '利', '用', 'me', '##rc', '##er', '核', ',', '将', '样', '本', '从', '输', '入', '空', '间', '映', + '射', '到', '高', '维', '特', '征', '空', '间', ',', '使', '原', '来', '没', '有', '显', '现', '的', '特', '征', + '突', '现', '出', '来', ',', '取', '得', '了', '很', '好', '的', '图', '像', '分', '割', '效', '果', '。'] + - the Chinese words (after Chinese word segmentation like jieba) + ['通过', '利用', 'mercer', '核', ',', '将', '样本', '从', '输入', '空间', '映射', '到', '高维', '特征', + '空间', ',', '使', '原来', '没有', '显现', '的', '特征', '突现', '出来', ',', '取得', '了', '很', '好', + '的', '图像', '分割', '效果', '。'] + - the output whole word mask tokens: + ['通', '##过', '利', '##用', 'me', '##rc', '##er', '核', ',', '将', '样', '##本', '从', '输', '##入', + '空', '##间', '映', '##射', '到', '高', '##维', '特', '##征', '空', '##间', ',', '使', '原', '##来', + '没', '##有', '显', '##现', '的', '特', '##征', '突', '##现', '出', '##来', ',', '取', '##得', '了', + '很', '好', '的', '图', '##像', '分', '##割', '效', '##果', '。'] + + Args: + tokens(list(str)): The sequence of tokens, which are from the WordPiece tokenization. + words(list(str)): The sequence of Chinese words. + max_word_length(int, optional): + The maximum chinese character in Chinese words. It avoids too long Chinese word to be masked. + Defaults as 4. + + Returns: + new_tokens(list(str)): The new token will be done with whole word masking strategy. + + """ + + new_tokens = [] + # opt for long document + words_set = set(words) + i = 0 + while i < len(tokens): + # non-chinese character, then do word piece + if len(re.findall("[\u4E00-\u9FA5]", tokens[i])) == 0: + new_tokens.append(tokens[i]) + i += 1 + continue + + # add "##" mark on the middel tokens of Chinese words + # such as ["通过", "利用"] -> ["通", "##过", "利", "##用"] + has_add = False + for length in range(max_word_length, 0, -1): + if i + length > len(tokens): + continue + if "".join(tokens[i : i + length]) in words_set: + new_tokens.append(tokens[i]) + for l in range(1, length): + new_tokens.append("##" + tokens[i + l]) + i += length + has_add = True + break + + if not has_add: + new_tokens.append(tokens[i]) + i += 1 + return new_tokens + + +class IdentitySplitter(object): + def tokenize(self, *text): + return text + + +class NewlineSplitter: + def tokenize(self, text): + return text.split("\n") + + +class Converter(object): + def __init__(self, args): + self.args = args + + def initializer(self): + Converter.tokenizer = getattr(tfs, self.args.tokenizer_name).from_pretrained(self.args.model_name) + if self.args.cn_whole_word_segment: + # Extend chinese char vocab for ErnieTokinzer + Converter.tokenizer.extend_chinese_char() + + # Split document to sentence. + if self.args.split_sentences: + if self.args.chinese: + Converter.splitter = NewlineSplitter() + else: + if not nltk_available: + print("NLTK is not available to split sentences.") + exit() + splitter = nltk.load("tokenizers/punkt/english.pickle") + Converter.splitter = splitter + else: + Converter.splitter = IdentitySplitter() + + # Split sentence whole words mask for chinese + if self.args.cn_whole_word_segment: + if self.args.cn_splited: + Converter.segment_func = lambda text: text.split(self.args.cn_split_dimer) + else: + CHINESE_SEG_FUNC = { + "lac": lexical_analysis_fn(), + "seg": chinese_segmentation_fn(), + "jieba": jieba_segmentation_fn(), + } + Converter.segment_func = CHINESE_SEG_FUNC[self.args.cn_seg_func] + Converter.whole_word_mask = get_whole_word_mask_tokens + else: + Converter.segment_func = lambda x: x + Converter.whole_word_mask = lambda x, y: x + + def process(text): + words = Converter.segment_func(text) + # if there are two empty word, the should a split dimer in the pos + if self.args.cn_splited: + pre_dimer = False + for index, w in enumerate(words): + if pre_dimer and len(w) == 0: + words[index] = self.args.cn_split_dimer + pre_dimer = False + elif len(w) == 0: + pre_dimer = True + else: + pre_dimer = False + + tokens = Converter.tokenizer.tokenize("".join(words)) + tokens = Converter.whole_word_mask(tokens, words) + tokens = Converter.tokenizer.convert_tokens_to_ids(tokens) + return tokens + + Converter.process = process + + def encode(self, json_line): + text = json.loads(json_line)[self.args.json_key] + doc_ids = [] + for sentence in Converter.splitter.tokenize(text): + sentence_ids = Converter.process(sentence.strip()) + if len(sentence_ids) > 0: + doc_ids.append(sentence_ids) + + if len(doc_ids) > 0 and self.args.append_eos: + if Converter.tokenizer.eos_token_id is None: + logger.warning( + "{}: eos_token_id is not set, ".format(self.args.tokenizer_name) + + "please set other tokenizer " + + "or config eos_token_id or unset append_eos." + ) + else: + doc_ids[-1].append(Converter.tokenizer.eos_token_id) + + return doc_ids, len(text.encode("utf-8")) + + +def main(): + print_datetime("start") + args = get_args() + file_paths = [] + if os.path.isfile(args.input_path): + file_paths.append(args.input_path) + else: + for root, _, fs in os.walk(args.input_path): + for f in fs: + file_paths.append(os.path.join(root, f)) + + convert = Converter(args) + + # Try tokenizer is availiable + sample_tokenizer = getattr(tfs, args.tokenizer_name).from_pretrained(args.model_name) + if sample_tokenizer.vocab_size < 2**16 - 1: + save_dtype = np.uint16 + else: + save_dtype = np.int32 + + pool = multiprocessing.Pool(args.workers, initializer=convert.initializer) + + output_ids_files = args.output_prefix + ".bin" + output_idx_files = args.output_prefix + ".idx" + builder = indexed_dataset.make_builder(output_ids_files, args.data_impl, save_dtype) + + file_paths.sort() + + step = 0 + total_bytes_processed = 0 + startup_start = time.time() + for file_path in tqdm(file_paths): + if file_path.endswith(".zst"): + import zstandard + + cctx = zstandard.ZstdDecompressor() + fh = open(file_path, "rb") + text = io.BufferedReader(cctx.stream_reader(fh)) + elif file_path.endswith(".jsonl"): + text = open(file_path, "r", encoding="utf-8") + else: + print("Unexpected data format, skiped %s" % file_path) + continue + + encoded_docs = pool.imap(convert.encode, text, 256) + print("Processing %s" % file_path) + for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1): + step += 1 + total_bytes_processed += bytes_processed + if len(doc) == 0: + continue + + for sentence in doc: + sentence_len = len(sentence) + if sentence_len == 0: + continue + builder.add_item(sentence) + + builder.end_document() + + if step % args.log_interval == 0: + current = time.time() + elapsed = current - startup_start + mbs = total_bytes_processed / elapsed / 1024 / 1024 + print(f"Processed {step} documents", f"({step/elapsed:.2f} docs/s, {mbs:.4f} MB/s).", file=sys.stderr) + if step >= args.max_doc_num: + break + + if step >= args.max_doc_num: + break + + pool.close() + print("Saving tokens to files...") + builder.finalize(output_idx_files) + print_datetime("end") + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-1.0/preprocess/docs/CLUECorpus2020.md b/model_zoo/ernie-1.0/preprocess/docs/CLUECorpus2020.md new file mode 100644 index 0000000000000000000000000000000000000000..3c6727fab4c7d1003a8d22a0f964b579710dd989 --- /dev/null +++ b/model_zoo/ernie-1.0/preprocess/docs/CLUECorpus2020.md @@ -0,0 +1,12 @@ +## CLUECorpus2020 语料 + +| 名称 | 文本类型 | 纯文本大小 | +|-|-|-| +| CLUECorpus2020| 中文 | 200GB | + +CLUECorpus2020 过对Common Crawl的中文部分进行语料清洗得到。开源部分提供了约200G左右的语料文本,详细介绍见[官网](https://github.com/CLUEbenchmark/CLUECorpus2020#%E6%95%B0%E6%8D%AE%E4%B8%8B%E8%BD%BD),用户可以通过邮件申请下载,方式如下: + +> 数据下载 +> 申请方式: 将使用语料研究目的和用途,计划、研究机构和申请者介绍,发送到邮箱,并承诺不向第三方提供。 +> +> 邮箱: CLUEbenchmark@163.com,标题是:CLUECorpus2020 200G语料库 diff --git a/model_zoo/ernie-1.0/preprocess/docs/CLUECorpusSmall.md b/model_zoo/ernie-1.0/preprocess/docs/CLUECorpusSmall.md new file mode 100644 index 0000000000000000000000000000000000000000..6af9876968f033653cc310f5999d6c70dc3e5b9b --- /dev/null +++ b/model_zoo/ernie-1.0/preprocess/docs/CLUECorpusSmall.md @@ -0,0 +1,78 @@ +# CLUECorpusSmall + +| 名称 | 文本类型 | 纯文本大小 | +|-|-|-| +| CLUECorpusSmall| 中文 | 14GB | + +**数据集简介**:可用于语言建模、预训练或生成型任务等,数据量超过14G,近4000个定义良好的txt文件、50亿个字。主要部分来自于nlp_chinese_corpus项目 +包含如下子语料库(总共14G语料):新闻语料[news2016zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/6bac09db4e6d4857b6d680d34447457490cb2dbdd8b8462ea1780a407f38e12b?responseContentDisposition=attachment%3B%20filename%3Dnews2016zh_corpus.zip), 社区互动语料[webText2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/83da03f7b4974871a52348b41c16c7e3b34a26d5ca644f558df8435be4de51c3?responseContentDisposition=attachment%3B%20filename%3DwebText2019zh_corpus.zip),维基百科语料[wiki2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/d7a166408d8b4ffdaf4de9cfca09f6ee1e2340260f26440a92f78134d068b28f?responseContentDisposition=attachment%3B%20filename%3Dwiki2019zh_corpus.zip),评论数据语料[comment2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/b66ddd445735408383c42322850ac4bb82faf9cc611447c2affb925443de7a6d?responseContentDisposition=attachment%3B%20filename%3Dcomment2019zh_corpus.zip)。 + +## 数据获取 + +用户可以通过官方github网页下载,https://github.com/CLUEbenchmark/CLUECorpus2020 。同时,为方便用户,我们也提供了aistudio数据集下载地址。[part1](https://aistudio.baidu.com/aistudio/datasetdetail/60598),[part2](https://aistudio.baidu.com/aistudio/datasetdetail/124357)。使用aistudio版本的数据,下载好后,可以核对md5值: +```shell +> md5sum ./* + 8a8be341ebce39cfe9524fb0b46b08c5 ./comment2019zh_corpus.zip + 4bdc2c941a7adb4a061caf273fea42b8 ./news2016zh_corpus.zip + fc582409f078b10d717caf233cc58ddd ./webText2019zh_corpus.zip + 157dacde91dcbd2e52a60af49f710fa5 ./wiki2019zh_corpus.zip +``` +解压文件 +```shell +unzip comment2019zh_corpus.zip -d clue_corpus_small_14g/comment2019zh_corpus +unzip news2016zh_corpus.zip -d clue_corpus_small_14g/news2016zh_corpus +unzip webText2019zh_corpus.zip -d clue_corpus_small_14g/webText2019zh_corpus +unzip wiki2019zh_corpus.zip -d clue_corpus_small_14g/wiki2019zh_corpus +``` +将txt文件转换为jsonl格式 +``` +python trans_to_json.py --input_path ./clue_corpus_small_14g --output_path clue_corpus_small_14g.jsonl +``` +现在我们得到了jsonl格式的数据集。 + +## 中文预训练数据制作 + +下面是针对训练任务的数据集应用。 + +* llama为例 +```shell +python -u create_pretraining_data.py \ + --model_name "idea-ccnl/ziya-llama-13b-v1" \ + --tokenizer_name "LlamaTokenizer" \ + --input_path "clue_corpus_small_14g.jsonl" \ + --output_prefix "clue_corpus_small_14g" \ + --data_format "JSON" \ + --json_key "text" \ + --data_impl "mmap" \ + --append_eos \ + --log_interval 10000 \ + --workers 48 +``` + +* ernie为例 +```shell +python -u create_pretraining_data.py \ + --model_name "ernie-3.0-base-zh" \ + --tokenizer_name "ErnieTokenizer" \ + --input_path "clue_corpus_small_14g.jsonl" \ + --output_prefix "clue_corpus_small_14g" \ + --data_format "JSON" \ + --json_key "text" \ + --split_sentences \ + --data_impl "mmap" \ + --chinese \ + --cn_whole_word_segment \ + --cn_seg_func "lac" \ + --log_interval 10000 \ + --workers 48 +``` + +- model_name 可以更换为[其他模型](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/llama/README.md)。 +- workers 表示转化的线程数目 + +数据共有文档`15702702`条左右,由于分词比较耗时,大概一小时左右可以完成。在当前目录下产出训练所需数据。 +``` +clue_corpus_small_14g.bin +clue_corpus_small_14g.idx +``` +用户可以使用此数据进行预训练任务。 diff --git a/model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md b/model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md new file mode 100644 index 0000000000000000000000000000000000000000..8e849d829e2f6919ab82d50cc9fb4a594ecb43fe --- /dev/null +++ b/model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md @@ -0,0 +1,43 @@ +# OpenWebText2 + +| 名称 | 文本类型 | 纯文本大小 | +|-|-|-| +| OpenWebText2 | 英文 | 70GB | + +## 数据获取 + +[OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/)是一个开源的英文网页文本数据集,数据来源于Reddit,经过去重、清洗、提取,最终包含800多万个文档。 +本示例采用EleutherAI清洗好的[OpenWebText2数据](https://openwebtext2.readthedocs.io/en/latest/index.html#download-plug-and-play-version) + +下载以后通过以下命令解压: + +```shell +# wget https://mystic.the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar +wget https://paddlenlp.bj.bcebos.com/models/transformers/gpt/openwebtext2.jsonl.zst.tar +tar -xvf openwebtext2.json.zst.tar -C /path/to/openwebtext +``` + +## Llama训练数据制作 + +然后使用[proprecess](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-1.0/proprecess) 工具下的`create_pretraining_data.py`脚本进行数据集制作: +``` +python -u create_pretraining_data.py \ + --model_name meta-llama/Llama-2-7b \ + --tokenizer_name LlamaTokenizer \ + --data_format JSON \ + --input_path /path/to/openwebtext/ \ + --append_eos \ + --output_prefix llama_openwebtext \ + --workers 40 \ + --log_interval 10000 \ + --data_impl "mmap" +``` +处理时间约一个小时左右,就可以得到我们需要的`llama_openwebtext.bin`, `llama_openwebtext.idx`数据集文件。 + +将所有预处理得到的文件统一放入一个文件夹中,以备训练使用: + +``` +mkdir data +mv llama_openwebtext.bin ./data +mv llama_openwebtext.idx ./data +``` diff --git a/model_zoo/ernie-1.0/preprocess/docs/WuDaoCorpusBase.md b/model_zoo/ernie-1.0/preprocess/docs/WuDaoCorpusBase.md new file mode 100644 index 0000000000000000000000000000000000000000..580f12cd643bdc0c71ce9ed5aeb972e13c4205ab --- /dev/null +++ b/model_zoo/ernie-1.0/preprocess/docs/WuDaoCorpusBase.md @@ -0,0 +1,97 @@ +# WuDaoCorpus2.0 Base 语料 + + +| 名称 | 文本类型 | 纯文本大小 | +|-|-|-| +| WuDaoCorpus2.0 Base| 中文 | 200GB | + +WuDaoCorpora是悟道爬取的中文大规模语料。整体数量为3TB,目前开源的部分为WuDaoCorpus2.0 bases数据集,大小为200GB。 + +## 数据获取 + +**1. 下载解压** + +用户微信登录[官网](https://resource.wudaoai.cn/home),即可直接下载数据。下载好的压缩数据约 64GB。解压 +``` +unrar x WuDaoCorpus2.0_base_200G.rar +``` +**2. 语料分词** + +由于WuDao数据集比较大,分词比较耗时,这里先进行了语料分词: +```shell +python words_segmentation.py \ + --input_path ./WuDaoCorpus2.0_base_200G \ + --workers 40 \ + --data_format wudao \ + --cn_seg_func seg \ + --output_path ./wudao_lac_cut \ +``` + +注:预训练需要实现 SOP( Sentence Order Predict) 任务,在分词的同时,我们使用 简单规则 进行了文本断句。如果语料只有一句话,建议去除SOP loss,训练时设置 `binary_head=False`。 + +**3. 转换为jsonl格式** + +文本转化完成后。我们使用 `../data_tools/trans_to_json.py`重新转换为jsonl格式(分词完毕)。 +```shell +python ./trans_to_json.py \ + --input_path ./wudao_lac_cut \ + --output_path wudao_corpus_200g.jsonl \ + --workers 40 +``` +在当前目录下产出数据`wudao_corpus_200g.jsonl`。格式如下: +``` +{"text": "主持人 : 作为 一个 曲线救国 的 路线 我们 没 办法 。\n金鑫 : 考试 和 分数 只是 一个 阶段性 的 评价 手段 , 不是 目的 , 就 像 人 活着 的 目的 不是 为了 吃饭 , 吃饭 是 为了 让 我们 活下去 , 我们 学习 的 目的 不是 为了 考试 , 不是 为了 那个 分数 , 而是 我 掌握 了 知识 , 成为 我 内在 的 能力 , 将来 我 去 创作 创造 工作 , 我能 把 它 做 得 更好 。\n主持人 : 特别感谢 金总 今天 接受 我 的 访谈 , 也 让 我 从 别的 层面 看到 了 一对一 到底 存在 的 道理 是 什么 , 并且 能 发展 那么 好 的 原因 在 哪里 。\n在 节目 后 您 谈谈 您 对 一对一 未来 的 希望 , 包括 您 对 它 未来 的 设想 是 什么 ?\n金鑫 : 一对一 个性化 教育 现在 还是 在 初级阶段 , 如果 是 四个 阶段 的话 , 现在 还是 在 第一阶段 到 第二阶段 迈进 的 , 学大 在 这方面 我们 希望 能 做 得 更 快 更 远 一些 。\n将来 个性化 教育 一定 是 能够 帮助 学生 在 成绩 上 的 提升 , 能够 更好 的 成长 , 进而 成为 对 社会 对 国家 更 有用 的 人才 , 就是 我们 的 成绩 、 成长 、 成才 。\n学大 1 对 1 教育 的 教师 团队 由 各科 优秀教师 、 考试 指导 专家 、 心理 辅导 专家 及 学习 方法 指导 专家 组成 , 同时 配备 专职 班主任 及 学习 监管 师 , 全方位 辅导 顺利 而 有序 的 运作 。\n其中 部分 教师 担任 多年 毕业班 教学 工作 , 多次 参与 中 考试 命题 研究 及 阅卷 工作 , 深谙 中 考试 精髓 , 能够 在 短 的 时间 内 引领 学生 掌握 中 考试 知识 重点 , 快速 提分 。\n■ 对于 成绩 差 的 学生 : 注重 学生 基础知识 , 力求 让 学生 在 基础 中 找 自信 , 在 自信 中 提升 ;\n注重 主观题 的 解题 方法 及 思路 , 以此 来 加强 对 基础知识 的 运用 。\n■ 对于 成绩 需要 拔高 的 学生 : 找出 学生 弱点 , 加强 基础 , 重点 提高 弱势 项目 。\n"} +{"text": "武田信玄 是 天生 的 武将 , 一生 开拓 了 八十五万 石至 九十余万 石之多 的 领地 。\n武田信玄 他 21 岁 时 流放 自己 的 父亲 武田信虎 至骏河 , 避免 父亲 传位 给 弟弟 , 从而 登上 了 第 19 代家督 之位 。\n他 将 信 浓国 ( 现 长野县 ) 纳入 控制 范围 后 , 又 与 当时 的 豪强 今井氏 、 北条 氏 结成 三国 军事同盟 , 与 上 杉谦信 在 川 中岛 前后 展开 了 五次 大战 。\n武田信玄 勇于 进攻 。\n他 连续 攻打 邻国 , 扩大 自己 势力范围 , 可称 遇神 杀神 , 遇佛 杀佛 。\n他 不仅 流放 了 自己 的 父亲 , 连 自己 的 嫡子 武田义信 因 与 他 在 战略 方向 上 相左 , 也 被 他 幽禁 于 佛寺 , 随即 被迫 自杀 。\n武田信玄 虽然 是 战国 武将 中 的 最强者 , 但 他 的 弱点 是 年龄 。\n信玄比 织田信长 年长 13 岁 , 比上 杉谦信 年长 9 岁 。\n当信 玄年 届 五十 之 时 , 信长 和 谦信 犹 在 壮年 。\n上杉谦信 而且 , 武田信玄 虽 驰骋 天下 , 却 未率 军 进过 京都 , 而 织田信长 在 永禄 十一年 ( 1568 年 ) 就 以 拥立 第 15 代 将军 足利义 昭 为名 率兵 上洛 了 。\n所谓 \" 制 京都 者 得 天下 \" , 所以 , 想要 一统天下 , 武田信玄 的 时间 很 紧迫 。\n元龟 三年 ( 1572 年 ) , 武田信玄 与 室 町 幕府 第 15 代 将军 足利义 昭 、 本愿 寺 显如 , 以及 浅井 氏 、 朝仓氏 等 反 织田信长 实力 组成 联盟 , 编织 \" 反信长 包围圈 \" 。\n同年 10 月 3 日 , 武田信玄 率领 大军 , 开始 了 第一次 上洛之行 。\n是 年 , 信玄 52 岁 , 这 也许 是 他 统一天下 的 最后 一次 机会 。\n武田信玄 所 率领 的 是 当时 战国 最强 的 3 万甲州 精兵 。\n打着 \" 风林火山 \" 的 旗帜 , 武田军 第一站 就 到达 了 织田信长 的 同盟 德川家康 所在 的 三河 远江 。\n织田信长 德川家康 的 军队 在 甲州 精兵 之前 显得 不堪一击 , 到 了 10 月 13 日 , 只来 成 、 天 方城 、 一 宫城 、 饭田 城 、 各和城 、 向 笠 城 等 城池 纷纷 被 攻陷 。\n德川家康 见势不妙 , 决定 在 浜松 城中 闭门不出 。\n但是 武田信玄 毫不 松懈 , 又 将 家康 在 远江 地区 的 重要 据点 二俣城 攻破 。\n德川家康 集合 所有 军队 共 1 万 1 千人 , 出城 与 信玄 决一死战 , 但 大败 而 还 , 险些 失 了 性命 。\n这次 战争 被 称为 \" 三方 原战 \" , 德川家康 曾经 承认 这次 战争 是 他 生平 最大 的 失败 。\n"} +``` + +## 中文预训练数据制作 + +下面是针对训练任务的数据集应用。 + +* llama为例 +```shell +python -u create_pretraining_data.py \ + --model_name "idea-ccnl/ziya-llama-13b-v1" \ + --tokenizer_name "LlamaTokenizer" \ + --input_path "wudao_corpus_200g.jsonl" \ + --output_prefix "wudao_corpus_200g" \ + --data_format "JSON" \ + --json_key "text" \ + --data_impl "mmap" \ + --cn_seg_func "jieba" \ + --cn_splited \ + --append_eos \ + --log_interval 10000 \ + --workers 48 +``` + +* ernie为例 +```shell +python -u create_pretraining_data.py \ + --model_name "ernie-3.0-base-zh" \ + --tokenizer_name "ErnieTokenizer" \ + --input_path "wudao_corpus_200g.jsonl" \ + --output_prefix "wudao_corpus_200g" \ + --data_format "JSON" \ + --json_key "text" \ + --split_sentences \ + --data_impl "mmap" \ + --chinese \ + --cn_whole_word_segment \ + --cn_seg_func "jieba" \ + --cn_splited \ + --log_interval 10000 \ + --workers 48 +``` + + +- 我们提前进行了分词,所以加上了 `cn_splited`,否则不需要使用此选项。 +- model_name 可以更换为[其他模型](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm)。 +- workers 表示转化的线程数目 + +在当前目录下产出训练所需数据。 +``` +wudao_corpus_200g.bin +wudao_corpus_200g.idx +``` +用户可以使用此数据进行预训练任务。 diff --git a/model_zoo/ernie-1.0/preprocess/merge.py b/model_zoo/ernie-1.0/preprocess/merge.py new file mode 100644 index 0000000000000000000000000000000000000000..cad6034907068b1b0ae96cb8b0338b81ed001334 --- /dev/null +++ b/model_zoo/ernie-1.0/preprocess/merge.py @@ -0,0 +1,90 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from datetime import datetime + +from paddlenlp.data import indexed_dataset + + +def print_datetime(string): + time_str = datetime.now().strftime("%Y-%m-%d %H:%M:%S") + print("[" + string + "] datetime: {} ".format(time_str)) + + +def main(args): + + prefixes = set() + for basename in os.listdir(args.input): + prefix, ext = os.path.splitext(basename) + + if prefix in prefixes: + continue + + if not os.path.isfile(os.path.join(args.input, basename)): + continue + + ext_pair = ".bin" if ext == ".idx" else ".idx" + assert os.path.isfile( + os.path.join(args.input, prefix) + ext_pair + ), f"ERROR: {ext_pair} file not provided for {os.path.join(args.input, prefix)}" + + prefixes.add(prefix) + + builder = None + + for prefix in sorted(prefixes): + print_datetime(f"start processing file {prefix}") + if builder is None: + dataset = indexed_dataset.make_dataset(os.path.join(args.input, prefix), args.data_impl) + + if isinstance(dataset, indexed_dataset.MMapIndexedDataset): + builder = indexed_dataset.MMapIndexedDatasetBuilder( + args.output_prefix + ".bin", dtype=dataset._index.dtype + ) + else: + builder = indexed_dataset.IndexedDatasetBuilder(args.output_prefix + ".bin", dtype=dataset.dtype) + + del dataset + print_datetime(f"start merge file {prefix}") + builder.merge_file_(os.path.join(args.input, prefix)) + print_datetime(f"end merge file {prefix}") + + print_datetime("start finalize") + builder.finalize(args.output_prefix + ".idx") + print_datetime("end finalize") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + + group = parser.add_argument_group(title="input data") + group.add_argument( + "--input", type=str, required=True, help="Path to directory containing all document files to merge" + ) + group.add_argument("--data_impl", type=str, required=True, help="data_impl") + + group = parser.add_argument_group(title="output data") + group.add_argument("--output-prefix", type=str, required=True, help="Path to binary output file without suffix") + + args = parser.parse_args() + + assert os.path.isdir(args.input), f"ERROR: {args.input} is not a directory or does not exist" + + assert os.path.isdir( + os.path.dirname(args.output_prefix) + ), f"ERROR: {os.path.dirname(args.output_prefix)} is not a directory or does not exist" + + main(args) diff --git a/model_zoo/ernie-1.0/preprocess/trans_to_json.py b/model_zoo/ernie-1.0/preprocess/trans_to_json.py new file mode 100644 index 0000000000000000000000000000000000000000..0e8e71d77e54d565f7f8273676c09e62e7bdcf72 --- /dev/null +++ b/model_zoo/ernie-1.0/preprocess/trans_to_json.py @@ -0,0 +1,140 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import multiprocessing +import os +import shutil +import sys +import time +from functools import partial + + +def get_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--input_path", type=str, required=True, help="Path to you raw files. Folder or file path.") + parser.add_argument("--output_path", type=str, required=True, help="Path to save the output json files.") + parser.add_argument("--json_key", type=str, default="text", help="The content key of json file.") + parser.add_argument( + "--doc_spliter", + type=str, + default="", + help="Spliter between documents. We will strip the line, if you use blank line to split doc, leave it blank.", + ) + parser.add_argument("--min_doc_length", type=int, default=10, help="Minimal char of a documment.") + parser.add_argument("--workers", type=int, default=1, help="Number of worker processes to launch") + parser.add_argument("--log_interval", type=int, default=1, help="Interval between progress updates.") + parser.add_argument("--no-merge", action="store_true", help="Don't merge the file.") + parser.add_argument("--no-shuffle", action="store_true", help="Don't shuffle the file.") + args = parser.parse_args() + return args + + +def raw_text_to_json(path, doc_spliter="", json_key="text", min_doc_length=10): + path = os.path.abspath(path) + if not os.path.exists(path): + print("No found file %s" % path) + return 0, None + + out_filepath = path + ".jsonl" + fout = open(out_filepath, "w", encoding="utf-8") + len_files = 0 + with open(path, "r") as f: + doc = "" + line = f.readline() + while line: + len_files += len(line) + if line.strip() == doc_spliter: + if len(doc) > min_doc_length: + fout.write(json.dumps({json_key: doc}, ensure_ascii=False) + "\n") + doc = "" + else: + doc += line + line = f.readline() + + if len(doc) > min_doc_length: + fout.write(json.dumps({json_key: doc}, ensure_ascii=False) + "\n") + doc = "" + + return len_files, out_filepath + + +def merge_file(file_paths, output_path): + if not output_path.endswith(".jsonl"): + output_path = output_path + ".jsonl" + print("Merging files into %s" % output_path) + with open(output_path, "wb") as wfd: + for f in file_paths: + if f is not None and os.path.exists(f): + with open(f, "rb") as fd: + shutil.copyfileobj(fd, wfd) + os.remove(f) + print("File save in %s" % output_path) + return output_path + + +def shuffle_file(output_path): + print("Shuffling the jsonl file...") + if os.path.exists(output_path): + os.system("shuf %s -o %s" % (output_path, output_path)) + print("File shuffled!!!") + else: + raise ValueError("File not found: %s" % output_path) + + +def main(): + args = get_args() + startup_start = time.time() + + file_paths = [] + if os.path.isfile(args.input_path): + file_paths.append(args.input_path) + else: + for root, _, fs in os.walk(args.input_path): + for f in fs: + file_paths.append(os.path.join(root, f)) + + pool = multiprocessing.Pool(args.workers) + + startup_end = time.time() + proc_start = time.time() + total_bytes_processed = 0 + print("Time to startup:", startup_end - startup_start) + + trans_json = partial( + raw_text_to_json, doc_spliter=args.doc_spliter, json_key=args.json_key, min_doc_length=args.min_doc_length + ) + encoded_files = pool.imap(trans_json, file_paths, 1) + + out_paths = [] + for i, (bytes_processed, out_path) in enumerate(encoded_files, start=1): + total_bytes_processed += bytes_processed + out_paths.append(out_path) + + if i % args.log_interval == 0: + current = time.time() + elapsed = current - proc_start + mbs = total_bytes_processed / elapsed / 1024 / 1024 + print(f"Processed {i} files", f"({i/elapsed} files/s, {mbs} MB/s).", file=sys.stderr) + + if not args.no_merge: + output_path = merge_file(out_paths, args.output_path) + if not args.no_shuffle: + shuffle_file(output_path) + + +if __name__ == "__main__": + main() + # profile.run("main()", "testprof") diff --git a/model_zoo/ernie-1.0/preprocess/words_segmentation.py b/model_zoo/ernie-1.0/preprocess/words_segmentation.py new file mode 100644 index 0000000000000000000000000000000000000000..0aeb1907e0cddf33310ad163031136334d1e722e --- /dev/null +++ b/model_zoo/ernie-1.0/preprocess/words_segmentation.py @@ -0,0 +1,202 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import multiprocessing +import os +import re +import sys +import time +from functools import partial + + +def get_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--input_path", type=str, required=True, help="Path to you raw files. Folder or file path.") + parser.add_argument("--output_path", type=str, default="./tmp", help="Path to save the output json files.") + parser.add_argument( + "--data_format", + type=str, + default="jsonl", + choices=["jsonl", "wudao"], + help="Path to you raw files. Folder or file path.", + ) + parser.add_argument( + "--cn_seg_func", + type=str, + default="jieba", + choices=["lac", "seg", "jieba"], + help="Words segment function for chinese words.", + ) + parser.add_argument("--workers", type=int, default=1, help="Number of worker processes to launch") + parser.add_argument("--log_interval", type=int, default=1, help="Interval between progress updates.") + args = parser.parse_args() + return args + + +def lexical_analysis_fn(): + from LAC import LAC + + lac = LAC(mode="lac") + + def process(line): + words, _ = lac.run(line) + return words + + return process + + +def chinese_segmentation_fn(): + from LAC import LAC + + lac_cws = LAC(mode="seg") + + def process(line): + words = lac_cws.run(line) + return words + + return process + + +def jieba_segmentation_fn(): + import jieba + + def process(line): + words = jieba.cut(line) + return list(words) + + return process + + +CHINESE_SEG_FUNC = { + "lac": lexical_analysis_fn(), + "seg": chinese_segmentation_fn(), + "jieba": jieba_segmentation_fn(), +} + + +def read_wudao(path): + print("Loading %s" % path) + with open(path, "r") as f: + try: + contents = json.load(f) + except Exception: + print("Failed to load %s" % path) + raise StopIteration + for js in contents: + yield js["content"] + + +def read_jsonl(path): + print("Loading %s" % path) + with open(path, "r") as f: + line = f.readline() + while line: + contents = json.load(f) + yield contents["text"] + line = f.readline() + + +READFILE_FUNC = { + "jsonl": read_jsonl, + "wudao": read_wudao, +} + +special_chars = ["\n", "。", "?", "?", " ", ";", ";", "!", "!"] +split_chars = ["。", "?", "?", ";", ";", "!", "!"] + + +def text_to_text(path, output_path, read_func, seg_func): + out_name = os.path.join(output_path, path[-20:]) + + print("Write into %s" % out_name) + if os.path.exists(out_name): + print("File exists %s" % out_name) + return 0, None + + seg_func = CHINESE_SEG_FUNC[seg_func] + read_func = READFILE_FUNC[read_func] + + data_len = 0 + count = 0 + with open(out_name, "w") as f: + for text in read_func(path): + # for js in contents: + count += 1 + # text = js["content"] + data_len += len(text.encode("utf-8")) + # make special char only once, + # because of those token will be treat as sentence spliter. + # 此处为断句逻辑 + for char in special_chars: + text = re.sub("[" + char + "]+[ ]*", char, text) + for char in split_chars: + text = text.replace(char, char + "\n") + + # 此处为分词逻辑 + final = "" + for line in text.split("\n"): + if len(line) == 0: + continue + words = seg_func(line) + final += " ".join(words) + "\n" + f.write(final + "\n") + + return data_len, None + + +def main(): + args = get_args() + startup_start = time.time() + + file_paths = [] + if os.path.isfile(args.input_path): + file_paths.append(args.input_path) + else: + for root, _, fs in os.walk(args.input_path): + for f in fs: + file_paths.append(os.path.join(root, f)) + + pool = multiprocessing.Pool(args.workers) + + startup_end = time.time() + proc_start = time.time() + total_bytes_processed = 0 + print("Time to startup:", startup_end - startup_start) + + if not os.path.exists(args.output_path): + os.makedirs(args.output_path) + + trans_func = partial( + text_to_text, output_path=args.output_path, seg_func=args.cn_seg_func, read_func=args.data_format + ) + + encoded_files = pool.imap(trans_func, file_paths, 1) + + out_paths = [] + for i, (bytes_processed, out_path) in enumerate(encoded_files, start=1): + total_bytes_processed += bytes_processed + out_paths.append(out_path) + + if i % args.log_interval == 0: + current = time.time() + elapsed = current - proc_start + mbs = total_bytes_processed / elapsed / 1024 / 1024 + print(f"Processed {i} files", f"({i/elapsed} files/s, {mbs} MB/s).", file=sys.stderr) + pool.close() + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-1.0/pretraining_introduction.md b/model_zoo/ernie-1.0/pretraining_introduction.md new file mode 100644 index 0000000000000000000000000000000000000000..b7d77c3e405e1e9ee9b6875fb9025503de8d8659 --- /dev/null +++ b/model_zoo/ernie-1.0/pretraining_introduction.md @@ -0,0 +1,615 @@ +# ERNIE 中文预训练介绍 + +ERNIE是百度提出的大规模预训练模型,曾在中文场景下取得了SOTA效果。 +PaddleNLP致力于预训练开源工作,使用开源中文语料CLUE、WuDao 总共400GB,发布大规模开源语料预训练全流程。从零开始,轻松构建预训练模型。 + +本项目,从数据下载,词表制作,数据转化,模型训练,所有流程,完全开源开放,可复现。 +并训练发布开源最优的模型参数。 + +接下来将从下面几个方面,详细介绍整个数据制作全流程,从零开始,构建一个预训练模型。 + +* [1. 数据准备](数据准备) + * [1.1 大规模中文数据](#大规模中文数据) + * [1.2 高精准中文分词](#高精准中文分词) + * [1.3 快速Token ID 转化](#快速TokenID转化) +* [2. 全字符中文词表制作](#中文词表制作) + - [2.1 分析准备](#分析准备) + - [2.2 文本字符统计](#文本字符统计) + - [2.3 英文字符词表](#英文字符词表) + - [2.4 合并词表](#合并词表) +* [3. 开始训练](#开始训练) + - [3.1 训练脚本](#训练脚本) + - [3.2 训练网络配置](#networks) + - [3.3 训练速度配置](#speed) + - [3.4 训练数据流配置](#data_pipe) + - [3.5 观察评估](#观察评估) +- [4. 训练效果](#release_models) + - [4.1 ERNIE 1.0-Base-zh-cw 模型](#ernie-1.0-base-zh-cw) + - [4.2 ERNIE 1.0-Large-zh-cw 模型](#ernie-1.0-large-zh-cw) +* [5. 参考](#references) + +全部流程介绍图如下: + +

+ +

+ + +**环境依赖** + +- tool_helpers +- visualdl +- pybind11 +- lac (可选) + +安装命令 `pip install tool_helpers visualdl pybind11 lac` + + + +## 1. 数据准备 + +数据流是预训练的非常重要的,[预处理文档](./preprocess/README.md)提供了整体的数据变动的流程示意,用户可以查看数据制作的细节文档。 + + + + +### 1.1 大规模中文数据 + +模型的根本是数据,大数据才能有望获得更好的训练效果。我们希望语料有如下特点: +- **大规模**:目前像ERNIE-3.0,GPT-3,CPM等模型,动辄数T的文本语料。而目前开源的一些中文模型,确是基于15G左右的CLUECorpus语料训练,大大限制了模型的效果, +- **开源开放**:为了让用户也可以比较容易复现整体的数据流程,采用的数据希望是**开源**的,人人可以获取的。 + +综上,我们选用的预料为 CLUECorpus2020 语料 200G, WuDaoCorpus2.0 Base 语料 200G。 + +**CLUECorpus2020 语料** + +CLUECorpus2020 是通过Common Crawl中文部分语料清洗得到。开源部分提供了约200G左右的语料文本,详细介绍见[官网](https://github.com/CLUEbenchmark/CLUECorpus2020#%E6%95%B0%E6%8D%AE%E4%B8%8B%E8%BD%BD),用户可以通过邮件申请下载。 + +**WuDaoCorpus2.0 Base 语料** + +WuDaoCorpora是悟道爬取的中文大规模语料。整体数量为3TB,目前开源的部分为WuDaoCorpus2.0 bases数据集,大小为200GB。 +用户微信登录[官网](https://resource.wudaoai.cn/home),即可直接下载数据。下载好的压缩数据约 64GB。 + + +为了方便用户测试,我们提供了少量part的WuDao数据供大家使用,(如有侵权,请联系我们删除) +``` +wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/WuDaoCorpus2.0_base_200G_sample.tar.gz +tar -xvf WuDaoCorpus2.0_base_200G_sample.tar.gz +``` +用户可以用这份数据跑完后续全程。数据量约为2GB。 + + + + +### 1.2 高精准中文分词 + +ERNIE 使用知识嵌入的方式进行预训练。文本中的知识,比如 文本的中的人名、地名、成语、短语等都是知识。如何把这知识训练融合到模型中呢?ERNIE给出的方案是对这些知识短语一起MASK,然后预测,也就是Whole Words MASK。 + +在我们数据处理层面,如何尽可能精确的从原始文本中提取知识,直接关系预训练模型的效果。我们对目前PaddleNLP常用的分词方式的有`jieba`,`lac`,`seg`进行分析。`jieba`采用HMM隐马尔可模型,`lac`是LSTM模型。 + +效果、速度对比表格如下,假设CPU使用40线程,GPU使用16卡,处理200G文本: + +| 切词方式 | 效果 | 速度 | 预估耗时 +|-|-|-|-| +| jieba | 一般 | 607 KB/s | 2.5 h | +| lac | 好 | 106 KB/s | 13.9 h +| wordtag (弃用)| 最好 | 0.94 KB/s | 159 D (GPU)| + +综合考虑分词的效果与速度,我们选择百度的LAC(seg)作为我们的文本分词工具。 + + +本文档以WuDao数据为例,对数据进行分词: + + +```shell +python ./preprocess/words_segmentation.py \ + --input_path "./WuDaoCorpus2.0_base_200G" \ + --output_path "./wudao_lac_cut" \ + --data_format "wudao" \ + --cn_seg_func "seg" \ + --workers 48 +``` + +注:预训练需要实现 SOP( Sentence Order Predict) 任务,在分词的同时,我们使用 简单规则 进行了文本断句。如果语料只有一句话,建议去除SOP loss,训练时设置 `binary_head=False`。 + +文本转化完成后。我们使用 `./preprocess/trans_to_json.py`重新转换为jsonl格式(分词完毕)。 +```shell +python ./preprocess/trans_to_json.py \ + --input_path "./wudao_lac_cut" \ + --output_path "wudao_corpus_200g_sample.jsonl" \ + --workers 40 \ + --no-shuffle +``` +使用 WuDaoCorpus2.0_base_200G_sample.tar.gz 数据可以得到jsonl文本为: +``` +wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_corpus_200g_sample.jsonl +``` +用户可以下载处理好的数据,进行tokenizer转换。 + + + + +## 1.3 快速Token ID 转化 + +预料、词表准备妥当后,我们可以开始进行最后的数据ID转化。 + +- 高效的 Multiprocessing 多进程实现 +- 使用内存BytesIO存储ID数据 + +由于转换的逻辑复杂,需要定义`class Converter`对象来进行转化处理。如果每次处理新的文本,都实例化一次class对象,速度瓶颈会在处理函数的实例化。 +我们使用了提前multiprocessing.Pool的`initializer`,对处理函数进行提前实例化,提高处理效率。 + +处理后的token id数量巨大,可以达到数百Billion,如果使用普通的数据结构,如python的list保存,会出现存储瓶颈,不仅占用空间大,list对象还需要重新分配内存空间。这里我们采用了 BytesIO 的方式,类似写入内存文件的方式,速度快,可以非常方便转化为numpy文件保存。 + +使用 Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz CPU测试,40线程,处理速度 8+MB/s,约7个小时左右,即可完成 200GB 文本转化为ID. + +```shell +python -u ./preprocess/create_pretraining_data.py \ + --model_name "ernie-3.0-base-zh" \ + --tokenizer_name "ErnieTokenizer" \ + --input_path "wudao_corpus_200g.jsonl" \ + --output_prefix "wudao_corpus_200g" \ + --split_sentences\ + --data_impl "mmap" \ + --chinese \ + --cn_splited \ + --cn_whole_word_segment \ + --workers 48 \ + --log_interval 1000 +``` + +此处需要指定词表文件进行ID转化,用户可以使用paddlenlp内置的部分词表如`ernie-1.0-base-zh,ernie-3.0-base-zh`,设置`model_name`参数为对应参数名即可。 +也可以根据自己的需求,重新开始制作词表,然后`model_name`传入词表所在的文件夹目录即可。词表制作,请参考下一章节[全字符中文词表制作](#全字符中文词表制作)。 + +转化后的数据如下,使用这份数据,即可开始ERNIE预训练: +``` +wudao_corpus_200g.bin +wudao_corpus_200g.idx +``` +同样,对于 WuDaoCorpus2.0_base_200G_sample.tar.gz 数据,使用`ernie-3.0-bash-zh`的tokenizer,可以得到数据。 +``` +mkdir data && cd data +wget https://paddlenlp.bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_corpus_200g_sample_ernie-3.0-base-zh.bin +wget https://paddlenlp.bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_corpus_200g_sample_ernie-3.0-base-zh.idx +``` + + + +### 2. 全字符中文词表制作 + +之前的 数据 id 化中,使用了已有的词表进行转化,当没有词表时,需要从头开始进行词表制作。如果你没有制作新词表的需求,请跳过此部分,直接阅读 [第三节,开始训练](#开始训练)。 + +那制作ERNIE的词表有什么特点需要注意呢?常见的方法是使用 sentencepiece 切词,使用BPE去找通用的子词串。但是,ERNIE之类的中文模型,是属于字模型,不会出现连续汉字作为子词 如`##中国`。一般是通过 BasicTokenizer,给所有中文汉字之间,添加空格,然后再去切分 子词 subword,这样每个汉字就都是独立的。 +``` +china -> ch #ina +我爱china -> 我 爱 china -> 我 爱 ch #ina +``` + +这里提供了ERNIE模型词表制作的两种方案: + +- 第一种,词表组合方案 + 1. 统计字符 + 2. 制作英文词表 + 3. 合并词表 + +- 第二种,预处理后直接生成,方案 + 1. 文本预处理(中文加空格,文本normalize) + 2. 使用sentencepeice制作词表 + +第二种方案需要对文本先使用`BasicTokenizer`切分一遍语料。 +第一种方案,自定义程度高,但存在一些局限性。本项目采用了第一种方案,详细介绍如下: + +### 2.1 分析准备 +词表大小: 这里我们考虑的因素主要有两个 +- 已有模型对照: + - ERNIE 3.0系列模型的词表,词表大小为 40000 左右。 +- 预训练数据存储占用: + - 文本token id化后,希望使用uint16表示,此时表示的最大字符为65536。 + - 同时考虑到ERNIE虽然是字模型,我们的仍然需要 `##中` 之类的中文字符表示分词信息。假设使用中文全字符20902(0x4E00-0x9FA5)个字符,那么剩余 vocab 大小不能超过 44634。 + +综上,本项目决定采用 40000 左右的 vocab 容量。 +其中: +- 中文全字符 `20902` +- 英文字符 `17000` +- 其他字符约 `2000` 左右 + + +### 2.2 文本字符统计 +首先第一步是对文本字符进行统计。字符统计的目的主要是添加常用的中文字符、特殊字符。 + +由于语料文本过大,我们随机选取 10G 左右的原始文本进行了字符统计。 +``` +python ./vocab/gen_char.py path_to_corpus.txt +``` +可以在本地文件夹得到`char_dict.pickle`字符频率文件。同时我们也提供了自己统计的词频文件,方便用户复现: +``` +wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/char_dict.pickle +``` + +### 2.3 英文字符词表 +基于字符的词频统计,使得英文字符也切割为字母,为此我们需要添加英文词表。 +英文部分,我们使用了 [WikiText](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip) 数据集,来构造词表。 +下载解压数据,使用BPE切词 +``` +wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip +unzip wikitext-103-v1.zip +python ./vocab/gen_vocab.py ./wikitext-103-raw/wiki.train.raw +``` +即可产生英文部分的词表。这里我们也提供了处理好的 vocab 方便用户验证。 +``` +wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/eng.vocab +``` + + +### 2.4 合并词表 + +目前我们得到了字符统计表,和英文字符词表。下一步,我们将词表进行合并。 + +将`char_dict.pickle`,`eng.vocab`放置到当前目录,使用下面命令 +``` +python ./vocab/merge_vocab.py +``` +即可在 当前 目录生成 vocab.txt 得到最终词表。 + +此阶段需要注意的一些问题是: +1. 对于一些日文、谚文文字字符,需要进行 normalize +2. 添加special_tokens + +### 2.5 问题遗留 +本项目采用的第一种方式,即拼接产出的词表,对连续非中、英文字符文本,会出现UNK的情况。 +如issue: [#2927](https://github.com/PaddlePaddle/PaddleNLP/issues/2927)、 [#2585](https://github.com/PaddlePaddle/PaddleNLP/issues/2585)。本项目做了两点改进: + +1. 对 Symbol 字符默认添加空格,变成独立字符 +2. 对 日文、谚文 在合并词表阶段默认添加 ## 字符。 + +虽然有上述两点修复,任然无法避免 [#2927](https://github.com/PaddlePaddle/PaddleNLP/issues/2927) 现象。 +彻底解决的话,建议使用第二种方式制作vocab文件。 + +### 2.6 方案二:预处理后直接生成 +此方案没有被采用,这里也简单说明一下具体的方案: +1. 对语料使用 BasicTokenizer 转换 +```python +from paddlenlp.transformers import +tokenizer = BasicTokenizer() +basic_toknizer = lambda x: " ".join(tokenizer.tokenize(x)) +# 对语料使用 basic_toknizer 转换 +# 并存储为新的语料 afer_basic_toknizer_corpus.txt +``` +2. 处理转换后的语料 +```shell +python ./vocab/gen_vocab.py afer_basic_toknizer_corpus.txt +``` +对处理好的vocab文件手动替换一些` -> [PAD]`之类的special_tokens,即可产出词表。 + + +## 3. 开始训练 + +使用开源中文语料CLUE、WuDao 总共400GB,提供上面提供的大规模语料数据集制作教程。接下来,看是模型训练。 + +

+ +

+ +### 3.1 训练脚本 + +训练脚本如下。环境配置和路径配置,不是必要的,如果用户只想简单训练,可以直接跳到[继续训练](#继续训练)部分,直接训练。 + +环境配置 +- PYTHONPATH 设置为当前目录(适合paddlenlp develop运行) +- 设置了一些FLAGS,包括增强报错,动态图Flag,提高矩阵乘法精度。 +- 多机情况下,可以设置`NCCL_SOCKET_IFNAME`指明NCCL使用的通信网口。 + +
+环境配置脚本 + +```shell +set -x + +# cd PaddleNLP/model_zoo/ernie-1.0 +export PYTHONPATH=$PYTHONPATH:../../ + +export FLAGS_call_stack_level=2 +# export NCCL_SOCKET_IFNAME=xgbe0 +export FLAGS_gemm_use_half_precision_compute_type=False +export FLAGS_enable_eager_mode=1 +unset CUDA_VISIBLE_DEVICES +``` +
+ +路径配置 + +- 主要配置输入输出目录 +- 这里的`vocab_dir`如果没有使用自定义词表的话,请设置为内置的tokenizer,如`ernie-1.0-base-zh,ernie-3.0-base-zh`等。 +- 这里的 `data_dir` 设置多份数据集,用户不使用多份数据集的话,直接`data_dir="./data"`即可。 + +
+路径配置 + +```shell +trainer_id=${PADDLE_TRAINER_ID:-"0"} +task_name="0809-ernie-1.0-base-cw-dp16-gb1024" + +base_nfs="/path/to/your/nfs/mount/point" +base_dir="${base_nfs}/ernie-cw/output/${task_name}" +data_dir="5.0 ${base_nfs}/clue_oscar/clue_corpus_oscar_0630 7.0 ${base_nfs}/clue_train/clue_corpus_train_0629 12.0 ${base_nfs}/wudao_200g/wudao_200g_0703" +vocab_dir="${base_nfs}/" +``` +
+ +**启动训练**:这里启动的是单机8卡任务,整体全局的batch_size 512 (64*8)。如果指定ips参数,进行多机运行,如 `python3 -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" --ips 192.168.1.101,192.168.1.101 ` + +```shell +python3 -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "${base_dir}/log_${trainer_id}" \ + run_pretrain.py \ + --model_type "ernie" \ + --model_name_or_path "ernie-3.0-base-zh" \ + --tokenizer_name_or_path "${vocab_dir}" \ + --input_dir "${data_dir}" \ + --output_dir "${base_dir}" \ + --split 949,50,1 \ + --max_seq_len 512 \ + --binary_head true \ + --micro_batch_size 64 \ + --use_amp true \ + --fp16_opt_level "O1" \ + --use_recompute false \ + --max_lr 0.0001 \ + --min_lr 0.00001 \ + --max_steps 4000000 \ + --save_steps 100000 \ + --checkpoint_steps 5000 \ + --decay_steps 3900000 \ + --weight_decay 0.01 \ + --warmup_rate 0.01 \ + --grad_clip 1.0 \ + --logging_freq 20 \ + --num_workers 3 \ + --eval_freq 1000 \ + --device "gpu"\ + --share_folder true \ + --hidden_dropout_prob 0.1 \ + --attention_probs_dropout_prob 0.1 \ + --seed 1234 \ +``` + + +其中参数释义如下: +- `model_name_or_path` 要训练的模型或者之前训练的checkpoint。 +- `tokenizer_name_or_path` 模型词表文件所在的文件夹(对于ernie,词表文件名一般命名为vocab.txt),或者PaddleNLP内置tokenizer的名字。 +- `continue_training` 默认false,模型从随机初始化,开始训练。如果为True,从已有的预训练权重加载,开始训练。如果为True, 训练初始loss 为2.x 是正常loss,如果未False,随机初始化,初始loss一般为10+。 +- `input_dir` 指定输入文件,可以使用目录,指定目录时将包括目录中的所有文件。 +- `output_dir` 指定输出文件。 +- `split` 划分数据集为train、valid、test的比例。整个数据集会按照这个比例划分数据。默认`split=949,50,1`, 使用1/1000的数据为test,当样本数太少时,增大测试的样本数目。 +- `max_seq_len` 输入文本序列的长度,默认值`512`。 +- `binary_head` 是否使用SOP(Sentences Order Predicet) loss,默认为 True,使用此loss。如果用户句子语料很短,无法组合成句子对,请设置此参数为`false`。 +- `micro_batch_size` 单卡batch size大小,比如此处单卡bs=64, 采用8卡训练`global_batch_size=64*8=512`。 +- `use_amp` 开启混合精度策略。 +- `fp16_opt_level` 混合精度策略,支持O1 自动混合精度,O2 pure fp16精度训练。 +- `max_lr` 训练学习率。 +- `min_lr` 学习率衰减到最小值后,学习率将一直保持为`min_lr`。 +- `max_steps` 最大训练步数。训练不支持通过`epoch`控制,第一次制造数据index时候,日志会显示数据会被计算的epoch数,请注意查看。 +- `save_steps` 保存模型间隔。默认保存地址格式为`output_dir/model_50000`(5w 步时的权重)。 +- `checkpoint_steps` 模型checkpoint间隔,用于模型断点重启训练。默认地址为`output_dir/model_last`. +- `weight_decay` 权重衰减参数。 +- `warmup_rate` 学习率warmup参数。 +- `grad_clip` 梯度裁剪范围。 +- `logging_freq` 日志输出间隔。 +- `num_workers` DataLoader采样进程,当数据输入为瓶颈时,可尝试提高采样进程数目。 +- `eval_freq` 模型评估间隔。 +- `device` 训练设备,默认为GPU。 +- `share_folder` 多机训练时,如果多机`input_dir`为挂载的同一个nfs网络位置,可以开启次选项,多机共享同一份数据。(每次运行,会制作训练的index数据,如果为挂载的统一nfs位置,则一台机器制作数据即可,否则每台机器都需要制作) + +继续训练 + + +很多同学的需求,是从已有的预训练参数开始,继续训练过程,这里我们使用前面教程提供的`WuDaoCorpus2.0_base_200G_sample.tar.gz`样本数据,在`ernie-3.0-base-zh`权重上继续训练。脚本如下: + +
+展开脚本 + +``` +python3 -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "output/ernie_continue_training/logs" \ + run_pretrain.py \ + --model_type "ernie" \ + --model_name_or_path "ernie-3.0-base-zh" \ + --tokenizer_name_or_path "ernie-3.0-base-zh" \ + --continue_training true \ + --input_dir ./data \ + --output_dir output/ernie_continue_training/ \ + --split 949,50,1 \ + --max_seq_len 512 \ + --binary_head true \ + --micro_batch_size 64 \ + --use_amp true \ + --fp16_opt_level "O1" \ + --use_recompute false \ + --max_lr 0.0001 \ + --min_lr 0.00001 \ + --max_steps 500000 \ + --save_steps 100000 \ + --checkpoint_steps 5000 \ + --decay_steps 490000 \ + --weight_decay 0.01 \ + --warmup_rate 0.01 \ + --grad_clip 1.0 \ + --logging_freq 1 \ + --num_workers 3 \ + --eval_freq 1000 \ + --device "gpu"\ + --scale_loss 1024\ + --seed 1234 \ +``` +
+ + + + +### 3.2 训练网络配置 + +本小节 + +- SOP Loss + - SOP (Sentence Order Predict) 损失,是 模型训练的常用损失。将文本中的句子顺序分为两段打乱,最后判断文本是否被打乱。下图是数据组织形式的展示: +

+ +

+ + - *使用方法*: 此开关由 `binary_head` 选项开启,`binary_head=True`添加sop loss, `binary_head=False` 关闭 sop loss。 + - **注意:如果你使用的语料文本中,只有一句话,无法分为多个句子段落,请设置 `binary_head=False`。否则,不符合要求的数据默认被删去,导致可训练的数据过小。** +- MASK + - MLM (Mask Language Model) 是通过随机将文本中的部分token,随机替换为`[MASK]` token,最后预测出真实的token值。ERNIE默认采用了Whole Word MASK方式,选定一些词语进行MASK。 + - *使用方法*: 用户可以设置 `masked_lm_prob` 控制mask的token占文本总token长度的比例。默认`masked_lm_prob=0.15` 随机mask 15% 的token数目。 + - 设置`short_seq_prob`, 控制长度小于max_seq_length的样本比例,默认值`short_seq_prob=0.1`。制作数据时候,会有相应比例的数据 最大长度会设置为 一个小于 max_seq_length 的随机值。 +- Ngram MASK + - 项目还支持了n-gram mask策略,如下图所示,在 WWM 进行词语级别MASK的基础上(如此处mask掉的`[模型]`词组),n-gram 可以MASK掉连续n个词组。下面例子中,连续mask了2个词组,`【[语言][模型]】`同时进行了mask。 +

+ +

+ + - *使用方法*: 用户通过`max_ngrams`设置最大的`ngram`长度。默认`max_ngrams=3`。 + - 注: + - ernie预训练使用的 dataset 代码文件在 `./data_tools/ernie_dataset.py` + - 数据集index生成,动态mask相关代码实现在`./data_tools/dataset_utils.py` + + - 用户可以根据自己的需求,灵活修改mask方式。具体可以参考`dataset_utils.py`中`create_masked_lm_predictions`函数。可以自定义的选项有do_whole_word_mask, favor_longer_ngram, do_permutation, geometric_dist等,可以参考[Megatron](https://github.com/NVIDIA/Megatron-LM)使用这些lm_mask策略。 + +- Dropout + - Dropout 是常用的防止过拟合策略。对于大规模数据集训练,如`ernie-3.0`系列4T文本语料,可以设置 `dropout=0`,不考虑过拟合。实际`ernie-3.0-base-zh`训练中,没有开启Dropout。 + - *使用方法*: 用户可以设置 `hidden_dropout_prob`,`attention_probs_dropout_prob`。默认值为 `0.1`。 + + + + +### 3.3 训练速度配置 + +**训练速度方面**,我们支持了如下策略,加 +速计算过程,减小显存占用,扩大batch_size: + +- **多卡多机**训练: + - 基于飞桨Fleet分布式API,用户可以十分方便的通过数据并行的方法,将训练扩展到多机多卡。 + - *使用方法*: + - 单机八卡 + ```shell + python3 -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + run_pretrain.py + ``` + - 多机,假设机器ip为 `192.168.1.101,192.168.1.102` **注**:多台机器启动的ips参数需要顺序一致。 + ```shell + python3 -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --ips 192.168.1.101,192.168.1.102 \ + run_pretrain.py + ``` +- **混合精度**训练: + - 部分算子使用FP16计算kernel,加速计算过程。支持AMP混合精度O1,和Pure FP16全FP训练策略O2。 + - 如下图所示,使用AMP O1时,一些参数自动从fp32 cast为FP16类型计算。使用`O2` pure fp16时,模型参数为 fp16。 + - *使用方法*: 设置`use_amp=True`开启混合精度训练。设置`fp16_opt_level=O1`,切换pure_fp16请设置为`O2`。 +

+ +

+- **梯度累积**训练: + - 用户可以指定梯度累积的步数,在梯度累积的step中。 + - 减少多卡之间梯度的通信,减少更新的次数,扩大训练的batch_size. + - *使用方法*:用户设置 `gobal_batch_size`为 `micro_batch_size*卡数`的倍数,即可开启梯度累积。如:单卡bs=16,8卡,此时如果设置`gobal_batch_size=512`,则梯度累积次数为`gobal_batch_size/bs/card_num=512/16/8=4`。 +- **重计算**训练: + - 通过重新计算前向的方式,减少前向网络中间变量的存储,可以显著减少显存占用。理论上,该方式以时间换空间,但在batch size显著扩大的情况下,速度下降幅度较小。 + - 如图所示:原来训练过程中,中间变量需要常驻显存,等待反向计算。使用重计算之后,修改成了反向需要时,再重新计算一遍前向过程,生成中间变量。避免常驻显存,减小显存占用。 + - *使用方法*:用户设置`use_recompute=True`即可使用。注意使用时,可同时扩大`micro_batch_size`参数。 +

+ +

+ + + + +### 3.4 训练数据流配置 +**训练数据流方面**,我们针对训练数据流扩展、混合、重启等方面做了针对性优化提升 + +数据流 +- **多机扩展** + - 用户可以将数据放置到 NFS 服务器上,多机同时挂载数据即可。 + - 解析:当用户需要在多台机器之间,一起多机训练,或者切换到空闲的机器上训练时。由于数据集很大(数百GB),迁移不方便。训练数据与计算资源分离,是非常适合的策略。 + - *使用方法*:参考[NFS服务搭建教程](https://blog.csdn.net/eijiyey/article/details/123184529),用户将制作好的数据,放到NFS机器,然后挂载到有训练资源的其他机器训练即可。 +

+ +

+ +- **多数据混合** + - *简介*:训练数据集支持多个文件,即插即用,可设置不同数据集占比权重。上面的多机训练的架构,混合使用了四份数据集。 + - *使用方法*:传入参数即可`input_dir="1.0 dateset_a/prefix 2.0 dataset_b/prefix"` + - **注意**:如果文件夹中只有一份数据如`data/wudao_200g_0703_ids.npy data/wudao_200g_0703_idx.npz`,可以直接设置`input_dir=./data`为输入目录即可。如果需要设定多份数据集,必须写上数据集前缀,如`input_dir="1.0 data/wudao_200g_0703 1.0 data2/clue_corpus_train_0629"`。写前缀即可,不要加上后面类似`_ids.npy _idx.npz`的尾缀。 +- **稳定可复现** + - *简介*:MLM任务具有一定随机性,需要随机mask数据。本数据流通过固定每一个step数据的随机种子,实验数据流稳定可复现。 + - *使用方法*: 传入`seed`参数即可,修改参数后会重新生成 index 数据,打乱数据顺序。 +- **快加载** + - *简介*:数据文件使用mmap读取,避免直接将数据加载到内存,加载数百GB文件几乎不耗时。 +- **断点重启** + - *简介*:用户可以单独设置,`checkpoint_steps` 参数可设置较小,重启训练默认加载最新checkpoint。 + - 断点数据自动恢复,学习率等参数也自动恢复。 + - **注意:** 此`checkpoint_steps`参数仅保留最后一个`checkpoint`到`model_last`文件夹,默认每次覆盖。用户需要永久保存参数,请设置`save_steps`。建议可以设置`checkpoint_steps`为需要间隔训练半小时、一小时左右的时间,一旦环境故障,可以获取到最新的`checkpoint`。 + + +### 3.4 观察评估 + +- **训练过程观察**:VisualDL可视化日志记录 + - 日志展示为全局loss,波动小。 + - 记录混合精度,loss_scaling等信息,方便用户debug。 + - 对模型结构,配置参数,paddle版本信息进行记录,方便复现环境 + +

+ +

+ + +- **下游任务评估**:CLUE Benchmark搜索评估参数效果 + - 使用[批量启动-grid-search](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/benchmark/clue#%E6%89%B9%E9%87%8F%E5%90%AF%E5%8A%A8-grid-search),可以进行批量搜索任务 + - 注意,这里使用的是训练中的checkpoint进行评估,可以直接试着 评估待评估的参数为,所在的路径地址,即如 `python grid_seach.py output/ernie-base-outdir/model_100000` 之类的checkpoint地址。 + + + +## 4. 训练效果 + +**训练效果方面**,我们release了 base、large两个模型。均取得了较好的预训练效果。 + + + +### 4.1 ERNIE 1.0-Base-zh-cw 模型 + +使用CLUE,WuDao共计400GB的语料,batch_size 1024, 训练 400w step,即可训练得到`ernie-3.0-base-zh`类似的模型效果。相关模型参数,开源为`ernie-1.0-base-zh-cw`,用户加载即可使用。使用CLUE benchmark 对最优超参数进行GradSearch搜索: + +Model                                  | Arch | CLUE AVG | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUE WSC2020 | CSL | CMRC | CHID | C3 +-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | + Metrics |   |   | Acc | Acc | Acc | Acc | Acc | Acc | Acc | Exact/F1| Acc| Acc +ERNIE 1.0-Base-zh-cw | 12L768H | 76.47 | 76.04 | 57.86 | 59.91 | 83.41 | 79.58 | 89.91 | 83.42 | 72.88/90.78 | 84.68 | 76.98 | +ERNIE 2.0-Base-zh | 12L768H | 74.32 | 75.65 | 58.25 | 61.64 | 82.62 | 78.71 | 81.91 | 82.33 | 66.08/87.46 | 82.78 | 73.19 +ERNIE 1.0-Base-zh | 12L768H | 74.17 | 74.84 | 58.91 | 62.25 | 81.68 | 76.58 | 85.20 | 82.77 | 67.32/87.83 | 82.47 | 69.68 + + + + +### 4.2 ERNIE 1.0-Large-zh-cw 模型 + +除了base模型外,我们还训练了large模型。命名为`ernie-1.0-large-zh-cw`。使用开源语料,batch_size 512, 训练 400w step,训练去除SOP任务,只保留MLM损失,使用CLUE benchmark 对最优超参数进行GradSearch搜索: + +Model                                    | Arch | CLUE AVG | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUE WSC2020 | CSL | CMRC | CHID | C3 +-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | +Metrics |   |   | Acc | Acc | Acc | Acc | Acc | Acc | Acc | Exact/F1 | Acc| Acc +ERNIE 1.0-Large-zh-cw| 24L1024H | 79.03 | 75.97 | 59.65 | 62.91 | 85.09 | 81.73| 93.09 | 84.53 | 74.22/91.88 | 88.57 | 84.54 +ERNIE 3.0-Xbase-zh| 20L1024H | 78.39 | 76.16 | 59.55 | 61.87 | 84.40 | 81.73 | 88.82 | 83.60 | 75.99/93.00 | 86.78 | 84.98 +RoBERTa-wwm-ext-large | 24L1024H | 76.61 | 76.00 | 59.33 | 62.02 | 83.88 | 78.81 | 90.79 | 83.67 | 70.58/89.82 | 85.72 | 75.26 + + + +## 5. 参考文献 + +感谢CLUE,WuDao提供的开源文本语料,主要数据流部分参考自[Megatron](https://github.com/NVIDIA/Megatron-LM),参考资料: +- Xu, L., Zhang, X. and Dong, Q., 2020. CLUECorpus2020: A large-scale Chinese corpus for pre-training language model. arXiv preprint arXiv:2003.01355. +- Yuan, S., Zhao, H., Du, Z., Ding, M., Liu, X., Cen, Y., Zou, X., Yang, Z. and Tang, J., 2021. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2, pp.65-68. +- https://github.com/CLUEbenchmark/CLUECorpus2020 +- https://resource.wudaoai.cn +- https://github.com/NVIDIA/Megatron-LM diff --git a/model_zoo/ernie-1.0/run_gb512_s1m.sh b/model_zoo/ernie-1.0/run_gb512_s1m.sh new file mode 100644 index 0000000000000000000000000000000000000000..70719bc347e0e5972540d5e12712d089c4d23df4 --- /dev/null +++ b/model_zoo/ernie-1.0/run_gb512_s1m.sh @@ -0,0 +1,54 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -x +unset CUDA_VISIBLE_DEVICES + +rm -rf core.* + +# dp8 for 8 worker of data parallel +# gb512 for the global batch size is 512 = 64 * 8 +# s1m for max steps is 1 million +task_name="ernie-1.0-dp8-gb512" +rm -rf output/$task_name/log + +python -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "output/$task_name/log" \ + run_pretrain.py \ + --model_type "ernie" \ + --model_name_or_path "ernie-1.0-base-zh" \ + --tokenizer_name_or_path "ernie-1.0-base-zh" \ + --input_dir "./data" \ + --data_impl "mmap" \ + --output_dir "output/$task_name" \ + --split 949,50,1 \ + --max_seq_len 512 \ + --micro_batch_size 64 \ + --use_amp true \ + --use_recompute false \ + --max_lr 0.0001 \ + --min_lr 0.00001 \ + --max_steps 1000000 \ + --save_steps 50000 \ + --checkpoint_steps 5000 \ + --decay_steps 990000 \ + --weight_decay 0.01 \ + --warmup_rate 0.01 \ + --grad_clip 1.0 \ + --logging_freq 20\ + --num_workers 2 \ + --eval_freq 1000 \ + --device "gpu"\ + diff --git a/model_zoo/ernie-1.0/run_gb512_s1m_static.sh b/model_zoo/ernie-1.0/run_gb512_s1m_static.sh new file mode 100644 index 0000000000000000000000000000000000000000..23e885ca5fed215f26229a303c261958601116c2 --- /dev/null +++ b/model_zoo/ernie-1.0/run_gb512_s1m_static.sh @@ -0,0 +1,60 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -x +unset CUDA_VISIBLE_DEVICES + +rm -rf *.prototxt +rm -rf core.* +rm -rf start_sharding* +rm -rf main_sharding* + +# dp8 for 8 worker of data parallel +# gb512 for the global batch size is 512 = 64 * 8 +task_name="ernie-1.0-dp8-gb512" +rm -rf output/$task_name/log + +python -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "output/$task_name/log" \ + run_pretrain_static.py \ + --model_type "ernie" \ + --model_name_or_path "ernie-1.0-base-zh" \ + --tokenizer_name_or_path "ernie-1.0-base-zh" \ + --input_dir "./data" \ + --data_impl "mmap" \ + --output_dir "output/$task_name" \ + --split 949,50,1 \ + --max_seq_len 512 \ + --micro_batch_size 64 \ + --sharding_degree 1\ + --dp_degree 8 \ + --use_sharding false \ + --use_amp true \ + --use_recompute false \ + --max_lr 0.0001 \ + --min_lr 0.00001 \ + --max_steps 1000000 \ + --save_steps 50000 \ + --checkpoint_steps 5000 \ + --decay_steps 990000 \ + --weight_decay 0.01\ + --warmup_rate 0.01 \ + --grad_clip 1.0 \ + --num_workers 2 \ + --logging_freq 20\ + --eval_freq 1000 \ + --device "gpu" + +# NOTE: please set use_sharding=True for sharding_degree > 1 diff --git a/model_zoo/ernie-1.0/run_gb512_s1m_trainer.sh b/model_zoo/ernie-1.0/run_gb512_s1m_trainer.sh new file mode 100644 index 0000000000000000000000000000000000000000..04eddc01ad72da6477b03ff0913e4d7ab78a9952 --- /dev/null +++ b/model_zoo/ernie-1.0/run_gb512_s1m_trainer.sh @@ -0,0 +1,53 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -x +unset CUDA_VISIBLE_DEVICES + +# dp8 for 8 worker of data parallel +# gb512 for the global batch size is 512 = 64 * 8 +# s1m for max steps is 1 million +task_name="ernie-1.0-dp8-gb512" +rm -rf output/$task_name/ + +python -u -m paddle.distributed.launch \ + --gpus "0,1,2,3,4,5,6,7" \ + --log_dir "output/$task_name""_log" \ + run_pretrain_trainer.py \ + --model_type "ernie" \ + --model_name_or_path "ernie-1.0-base-zh" \ + --tokenizer_name_or_path "ernie-1.0-base-zh" \ + --input_dir "./data" \ + --data_impl "mmap" \ + --output_dir "output/$task_name" \ + --split 949,50,1 \ + --max_seq_length 512 \ + --per_device_train_batch_size 64 \ + --per_device_eval_batch_size 64 \ + --fp16 \ + --fp16_opt_level "O2" \ + --learning_rate 0.0001 \ + --min_learning_rate 0.00001 \ + --max_steps 1000000 \ + --save_steps 50000 \ + --weight_decay 0.01 \ + --warmup_ratio 0.01 \ + --max_grad_norm 1.0 \ + --logging_steps 20\ + --dataloader_num_workers 4 \ + --eval_steps 1000 \ + --report_to "visualdl" \ + --disable_tqdm true \ + --do_train \ + --device "gpu" diff --git a/model_zoo/ernie-1.0/run_npu_single_card.sh b/model_zoo/ernie-1.0/run_npu_single_card.sh new file mode 100644 index 0000000000000000000000000000000000000000..f8011169b62c44dda2298348657ab193154294d0 --- /dev/null +++ b/model_zoo/ernie-1.0/run_npu_single_card.sh @@ -0,0 +1,51 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -x + +export FLAGS_selected_npus=0 +export ASCEND_GLOBAL_LOG_LEVEL=3 +export FLAGS_allocator_strategy=naive_best_fit + +rm -rf core.* + +task_name="ernie-1.0-npu" +rm -rf output/$task_name/log + +python -u run_pretrain.py \ + --model_type "ernie" \ + --model_name_or_path "ernie-3.0-base-zh" \ + --tokenizer_name_or_path "ernie-3.0-base-zh" \ + --input_dir "./data" \ + --data_impl "mmap" \ + --output_dir "output/$task_name" \ + --split 949,50,1 \ + --max_seq_len 512 \ + --micro_batch_size 52 \ + --use_amp true \ + --fp16_opt_level "O1" \ + --use_recompute false \ + --max_lr 0.0001 \ + --min_lr 0.00001 \ + --max_steps 1000000 \ + --save_steps 50000 \ + --checkpoint_steps 5000 \ + --decay_steps 990000 \ + --weight_decay 0.01 \ + --warmup_rate 0.01 \ + --grad_clip 1.0 \ + --logging_freq 20\ + --num_workers 8 \ + --eval_freq 1000 \ + --device "npu"\ diff --git a/model_zoo/ernie-1.0/run_pretrain.py b/model_zoo/ernie-1.0/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..ae9a0ee6962cf17e793a72a6870dc7617c96242f --- /dev/null +++ b/model_zoo/ernie-1.0/run_pretrain.py @@ -0,0 +1,763 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +ERNIE-1.0 pretraining scripts. +""" +import contextlib +import json +import os +import random +import shutil +import sys +import time + +import numpy as np +import paddle +import paddle.distributed as dist +import paddle.distributed.fleet as fleet +import yaml +from args import parse_args +from data_tools.dataset_utils import build_train_valid_test_datasets +from paddle.distributed.fleet.utils.hybrid_parallel_util import ( + fused_allreduce_gradients, +) +from visualdl import LogWriter + +from paddlenlp.data import Stack +from paddlenlp.transformers import ( + ErnieConfig, + ErnieForMaskedLM, + ErnieForPretraining, + ErniePretrainingCriterion, + ErnieTokenizer, + LinearAnnealingWithWarmupDecay, +) +from paddlenlp.utils.batch_sampler import DistributedBatchSampler +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "ernie": (ErnieConfig, ErnieForPretraining, ErniePretrainingCriterion, ErnieTokenizer), +} + + +def create_pretrained_dataset( + args, + data_file, + tokenizer, + data_world_size, + data_world_rank, + max_seq_len, + places=None, + data_holders=None, + binary_head=True, + current_step=0, +): + + train_valid_test_num_samples = [ + args.global_batch_size * args.max_steps, + args.micro_batch_size * (args.max_steps // args.eval_freq + 1) * args.eval_iters * data_world_size, + args.micro_batch_size * args.test_iters * data_world_size, + ] + + train_ds, valid_ds, test_ds = build_train_valid_test_datasets( + data_prefix=data_file, + args=args, + tokenizer=tokenizer, + splits_string=args.split, + train_valid_test_num_samples=train_valid_test_num_samples, + max_seq_length=args.max_seq_len, + masked_lm_prob=args.masked_lm_prob, + short_seq_prob=args.short_seq_prob, + seed=args.seed, + skip_warmup=True, + binary_head=binary_head, + max_seq_length_dec=None, + dataset_type="ernie", + ) + + def print_dataset(data, mode="train"): + logger.info(f"Sample data for {mode} mode") + input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels = data + if tokenizer.pad_token_id in input_ids: + input_ids = input_ids[0 : list(input_ids).index(tokenizer.pad_token_id)] + logger.info(tokenizer._decode(input_ids)) + for pos, label in zip(masked_lm_positions, masked_lm_labels): + input_ids[pos] = label + logger.info(tokenizer._decode(input_ids)) + logger.info(tokenizer.convert_ids_to_tokens(masked_lm_labels)) + + print_dataset(train_ds[0], "train") + print_dataset(valid_ds[0], "valid") + print_dataset(test_ds[0], "test") + + def _collate_data(data, stack_fn=Stack()): + num_fields = len(data[0]) + out = [None] * num_fields + # 0. input_ids, + # 1. segment_ids, + # 2. input_mask, + # 3. masked_lm_positions, + # 4. masked_lm_labels, + # 5. next_sentence_labels + for i in (0, 1, 2, 5): + out[i] = stack_fn([x[i] for x in data]) + out[5] = out[5].reshape([-1, 1]) + _, seq_length = out[0].shape + size = sum(len(x[3]) for x in data) + # masked_lm_positions + # Organize as a 1D tensor for gather or use gather_nd + if args.device == "npu": + # For NPU device, fixed input sentence length, in + # order to reduce the number of op compile. + if size % 80 != 0: + size += 80 - (size % 80) + else: + if size % 8 != 0: + size += 8 - (size % 8) + out[3] = np.full(size, 0, dtype=np.int32) + # masked_lm_labels + out[4] = np.full([size, 1], -1, dtype=np.int64) + mask_token_num = 0 + for i, x in enumerate(data): + for j, pos in enumerate(x[3]): + out[3][mask_token_num] = i * seq_length + pos + out[4][mask_token_num] = x[4][j] + mask_token_num += 1 + + return out + + def loader(dataset, consumed_samples=0): + batch_sampler = DistributedBatchSampler( + dataset, + batch_size=args.micro_batch_size, + num_replicas=data_world_size, + rank=data_world_rank, + shuffle=False, + drop_last=True, + consumed_samples=consumed_samples, + ) + data_loader = paddle.io.DataLoader( + dataset=dataset, + batch_sampler=batch_sampler, + num_workers=args.num_workers, + worker_init_fn=None, + collate_fn=_collate_data, + return_list=False, + ) + return data_loader + + train_dl = loader(train_ds, args.global_batch_size * current_step) + valid_dl = loader( + valid_ds, args.micro_batch_size * ((current_step + 1) // args.eval_freq) * args.eval_iters * data_world_size + ) + test_dl = loader(test_ds, 0) + + return train_dl, valid_dl, test_dl + + +def get_train_data_file(args): + if len(args.input_dir.split()) > 1: + # weight-1 data-prefix-1 weight-2 data-prefix-2 ... + return args.input_dir.split() + else: + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f))) + ] + files = [x.replace("_idx.npz", "") for x in files] + files = [x.replace(".idx", "") for x in files] + + if len(files) > 1: + ret = [] + logger.info("You are using multi-dataset:") + for x in files: + ret.append(1.0) + ret.append(x) + logger.info(" > set weight of %s dataset to 1.0" % x) + return ret + + return files + + +def all_gather(v): + if dist.get_world_size() <= 1: + return v.item() + ret = [] + dist.all_gather(ret, v) + output_tensors = [t if len(t.shape) > 0 else t.reshape_([-1]) for t in ret] + concat = paddle.concat(output_tensors, axis=0) + return concat.mean().item() + + +@paddle.no_grad() +def run_evaluate(data_loader, model, criterion, iter_steps, log_writer, global_step, args, task_name="valid"): + model.eval() + + if args.binary_head: + loss_global = { + "loss": paddle.to_tensor(0.0), + "lm_loss": paddle.to_tensor(0.0), + "sop_loss": paddle.to_tensor(0.0), + } + else: + loss_global = { + "loss": paddle.to_tensor(0.0), + } + + local_time = time.time() + + for eval_step, batch in enumerate(data_loader): + input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels = batch + with paddle.amp.auto_cast( + args.use_amp, + custom_white_list=[ + "softmax", + "layer_norm", + "gelu", + ], + custom_black_list=[ + "c_softmax_with_cross_entropy", + ], + level=args.fp16_opt_level, + ): + + if args.binary_head: + prediction_scores, seq_relationship_score = model( + input_ids=input_ids, + token_type_ids=segment_ids, + position_ids=None, + attention_mask=input_mask, + masked_positions=masked_lm_positions, + ) + + lm_loss, sop_loss = criterion( + prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels + ) + loss = lm_loss + sop_loss + else: + prediction_scores = model( + input_ids=input_ids, + token_type_ids=segment_ids, + position_ids=None, + attention_mask=input_mask, + masked_positions=masked_lm_positions, + ) + + loss = criterion(prediction_scores, None, masked_lm_labels) + + loss_global["loss"] += loss.detach() + if args.binary_head: + loss_global["lm_loss"] += lm_loss.detach() + loss_global["sop_loss"] += sop_loss.detach() + + if eval_step >= iter_steps - 1: + log_info_dict = dict() + for k, v in loss_global.items(): + log_info_dict[k] = all_gather(v) / iter_steps + v.subtract_(v) + if dist.get_rank() == 0: + log_info_dict["samples_per_second"] = ( + iter_steps * args.micro_batch_size * dist.get_world_size() / (time.time() - local_time) + ) + loss_info = ", ".join( + ["{}: {:.6f}".format(k, log_info_dict[k]) for k in log_info_dict.keys() if k.endswith("loss")] + ) + + logger.info( + "%s step %d, batch: %d, %s, ips: %.0f seqs/s" + % (task_name, global_step, iter_steps, loss_info, log_info_dict["samples_per_second"]) + ) + + for k, v in log_info_dict.items(): + log_writer.add_scalar("%s/%s" % (task_name, k), v, global_step) + + break + + model.train() + + +def set_seed(args): + if args.device == "cpu": + idx = 0 + else: + idx = paddle.distributed.get_rank() + random.seed(args.seed + idx) + np.random.seed(args.seed + idx) + paddle.seed(args.seed + idx) + + +def args_post_process(args, worker_num): + default_global_batch_size = worker_num * args.micro_batch_size + if args.global_batch_size is None: + args.global_batch_size = default_global_batch_size + + bsz_per_dp = args.global_batch_size // worker_num + micro_batch_size = args.micro_batch_size + assert ( + args.global_batch_size % micro_batch_size == 0 + ), "cannot do gradient accumulate, global_batch_size: {} micro_batch_size: {}".format( + args.global_batch_size, micro_batch_size + ) + accumulate_steps = bsz_per_dp // micro_batch_size + assert ( + accumulate_steps >= 1 + ), f"Larger global_batch_size: {args.global_batch_size} is expect, micro_batch_size is {micro_batch_size}, but only {bsz_per_dp} on each card!" + + args.eval_iters *= accumulate_steps + args.test_iters *= accumulate_steps + + args.accumulate_steps = accumulate_steps + + +def default_logdir() -> str: + """ + Same default + """ + import socket + from datetime import datetime + + current_time = datetime.now().strftime("%b%d_%H-%M-%S") + return os.path.join("runs", current_time + "_" + socket.gethostname()) + + +def do_train(args): + paddle.set_device(args.device) + + worker_index = paddle.distributed.get_rank() + worker_num = paddle.distributed.get_world_size() + + if worker_num > 1: + paddle.distributed.init_parallel_env() + + if args.dp_degree * args.sharding_degree == 1: + args.dp_degree = worker_num + args.sharding_degree = 1 + + args_post_process(args, worker_num) + + logger.info("{:20}:{}".format("paddle commit id", paddle.version.commit)) + for arg in vars(args): + logger.info("{:20}:{}".format(arg, getattr(args, arg))) + + strategy = fleet.DistributedStrategy() + strategy.hybrid_configs = {"dp_degree": args.dp_degree, "mp_degree": 1, "pp_degree": 1, "sharding_degree": 1} + + fleet.init(is_collective=True, strategy=strategy) + + # Create the random seed for the worker + set_seed(args) + + assert ( + args.dp_degree * args.sharding_degree == worker_num + ), "The product of degree num should be equal to worker_num." + + # Create log write, + log_writer = None + if worker_index == 0: + log_writer = LogWriter(os.path.join(args.output_dir, default_logdir())) + + # Define the input data in the static mode + config_class, model_class, criterion_class, tokenizer_class = MODEL_CLASSES[args.model_type] + if args.binary_head is False: + model_class = ErnieForMaskedLM + + pretrained_models_list = list(model_class.pretrained_init_configuration.keys()) + + # load config in checkpoint + global_step = 0 + checkpoint_dir = os.path.join(args.output_dir, "model_last") + if os.path.exists(checkpoint_dir): + if os.path.isfile(os.path.join(checkpoint_dir, "./config.yml")): + with open(os.path.join(checkpoint_dir, "./config.yml"), "r") as f: + step_config = yaml.load(f, Loader=yaml.FullLoader) + assert ( + step_config["global_batch_size"] == args.global_batch_size + ), "Please ensure checkpoint global batch size is the same. Folder: {}".format(checkpoint_dir) + global_step = step_config["global_step"] + + if args.model_name_or_path in pretrained_models_list and not args.continue_training: + logger.warning(f"Your model {args.model_name_or_path} is training from scratch !!!") + model_config = model_class.pretrained_init_configuration[args.model_name_or_path] + model_config["hidden_dropout_prob"] = args.hidden_dropout_prob + model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob + model_config["enable_recompute"] = args.use_recompute + model = model_class(config_class(**model_config)) + else: + logger.warning(f"Your model is continue training from {args.model_name_or_path}") + model = model_class.from_pretrained( + args.model_name_or_path, + hidden_dropout_prob=args.hidden_dropout_prob, + attention_probs_dropout_prob=args.attention_probs_dropout_prob, + enable_recompute=args.use_recompute, + ) + + criterion = criterion_class(with_nsp_loss=args.binary_head) + + if worker_index == 0: + # log the model config and args + model_config_json = json.dumps(model.config.to_dict(), ensure_ascii=False, indent=2) + log_writer.add_text("model_config", model_config_json) + args_dict = {"paddle commit id": str(paddle.version.commit)} + for arg in vars(args): + args_dict[arg] = str(getattr(args, arg)) + log_writer.add_text("args", json.dumps(args_dict, indent=2)) + + # Create the learning_rate sheduler and optimizer + if args.decay_steps is None: + args.decay_steps = args.max_steps + assert args.warmup_rate <= 1.0 and args.warmup_rate >= 0.0, "warmup_rate should be in [0, 1]" + args.warmup_steps = args.warmup_rate * args.max_steps + + lr_scheduler = LinearAnnealingWithWarmupDecay( + args.max_lr, args.min_lr, warmup_step=args.warmup_steps, decay_step=args.decay_steps, last_epoch=global_step + ) + + clip = None + if args.grad_clip > 0: + clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.grad_clip) + + decay_param = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + logger.info("Using paddle.optimizer.AdamW.") + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler if lr_scheduler is not None else args.max_lr, + beta1=args.adam_beta1, + beta2=args.adam_beta2, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + grad_clip=clip, + apply_decay_param_fun=lambda x: x in decay_param, + multi_precision=args.use_amp, + ) + + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + scaler = fleet.distributed_scaler(scaler) + model = paddle.amp.decorate(models=model, level=args.fp16_opt_level) + else: + scaler = None + + if paddle.distributed.get_world_size() > 1: + model = fleet.distributed_model(model) + optimizer = fleet.distributed_optimizer(optimizer) + tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name_or_path) + # Must extend chinese char for ErnieTokenizer + tokenizer.extend_chinese_char() + + data_file = get_train_data_file(args) + train_data_loader, valid_data_loader, test_data_loader = create_pretrained_dataset( + args, + data_file, + tokenizer, + data_world_size=worker_num, + data_world_rank=worker_index, + max_seq_len=args.max_seq_len, + binary_head=args.binary_head, + current_step=global_step, + ) + + # load checkpoint vars + if os.path.exists(checkpoint_dir): + if os.path.isfile(os.path.join(checkpoint_dir, "./config.yml")): + logger.info("Try to load checkpoint from %s " % checkpoint_dir) + opt_path = os.path.join(checkpoint_dir, "model_state.pdopt") + params_path = os.path.join(checkpoint_dir, "model_state.pdparams") + + if os.path.exists(opt_path): + load_dict = paddle.load(params_path) + model_dict = model.state_dict() + if args.use_amp and args.fp16_opt_level == "O2": + for k, v in load_dict.items(): + if k not in model_dict: + logger.warning(f"Checkpoint have too much keys: {k}") + continue + if "layer_norm" not in model_dict[k].name: + load_dict[k] = v.astype("float16") + model.set_state_dict(load_dict) + opt_dict = paddle.load(opt_path) + optimizer.set_state_dict(opt_dict) + else: + logger.warning("No optimizer checkpoint file found in %s." % opt_path) + if scaler is not None and os.path.isfile(os.path.join(checkpoint_dir, "scaler.pdparams")): + scaler.load_state_dict(paddle.load(os.path.join(checkpoint_dir, "scaler.pdparams"), return_numpy=True)) + logger.info("Checkpoint loaded from global step: {}".format(global_step)) + + if args.binary_head: + loss_global = { + "loss": paddle.to_tensor(0.0), + "lm_loss": paddle.to_tensor(0.0), + "sop_loss": paddle.to_tensor(0.0), + } + else: + loss_global = { + "loss": paddle.to_tensor(0.0), + } + + tic_train = time.time() + while True: + # If not call valid_data_loader, the enumerate will call valid_data_loader + # many times. and start a new random dataloader. + valid_data_loader = valid_data_loader() + test_data_loader = test_data_loader() + + # time count + train_reader_cost = 0.0 + train_run_cost = 0.0 + tr_loss = paddle.to_tensor(0.0) + reader_start = time.time() + + for step, batch in enumerate(train_data_loader()): + train_reader_cost += time.time() - reader_start + train_start = time.time() + + # 0. input_ids, + # 1. segment_ids, + # 2. input_mask, + # 3. masked_lm_positions, + # 4. masked_lm_labels, + # 5. next_sentence_labels + + input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels = batch + + ctx_manager = contextlib.nullcontext() if sys.version_info >= (3, 7) else contextlib.suppress() + + if worker_num > 1 and (args.use_recompute or ((step + 1) % args.accumulate_steps != 0)): + # grad acc, no_sync when (step + 1) % args.accumulate_steps != 0: + # recompute, no_sync every where + # recompute + grad_acc, no_sync every where + ctx_manager = model.no_sync() + else: + ctx_manager = contextlib.nullcontext() if sys.version_info >= (3, 7) else contextlib.suppress() + + with ctx_manager: + # For NPU device, using fp16 data type to execute `dropout` NPU op + # can improve performance, which can change `Cast` CANN OP from + # AICPU operator to AICore operator. + with paddle.amp.auto_cast( + args.use_amp, + custom_white_list=[ + "softmax", + "layer_norm", + "gelu", + "dropout", + ], + custom_black_list=[ + "c_softmax_with_cross_entropy", + ], + level=args.fp16_opt_level, + ): + + # Create the model for the ernie pretrain + if args.binary_head: + prediction_scores, seq_relationship_score = model( + input_ids=input_ids, + token_type_ids=segment_ids, + position_ids=None, + attention_mask=input_mask, + masked_positions=masked_lm_positions, + ) + lm_loss, sop_loss = criterion( + prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels + ) + loss = lm_loss + sop_loss + else: + prediction_scores = model( + input_ids=input_ids, + token_type_ids=segment_ids, + position_ids=None, + attention_mask=input_mask, + masked_positions=masked_lm_positions, + ) + + loss = criterion(prediction_scores, None, masked_lm_labels) + + if args.accumulate_steps >= 1: + tr_loss_step = loss / args.accumulate_steps + else: + tr_loss_step = loss + + if args.use_amp: + scaler.scale(tr_loss_step).backward() + else: + tr_loss_step.backward() + + tr_loss += tr_loss_step.detach() + + loss_global["loss"] += loss.detach() + if args.binary_head: + loss_global["lm_loss"] += lm_loss.detach() + loss_global["sop_loss"] += sop_loss.detach() + + # Skip for accumulate_steps in global step + if (step + 1) % args.accumulate_steps != 0: + continue + + if worker_num > 1 and args.use_recompute: + fused_allreduce_gradients(list(model.parameters()), None) + + if args.use_amp: + scaler.minimize(optimizer, tr_loss) + else: + optimizer.step() + + if args.device == "npu": + # For NPU device, set set_to_zero to False can improve + # performance. + optimizer.clear_grad(set_to_zero=False) + else: + optimizer.clear_grad() + train_run_cost += time.time() - train_start + tr_loss.subtract_(tr_loss) + + global_step += 1 + + if global_step % args.logging_freq == 0: + log_info_dict = dict() + log_info_dict["global_step"] = global_step + for k, v in loss_global.items(): + log_info_dict[k] = all_gather(v) / args.logging_freq / args.accumulate_steps + v.subtract_(v) + if worker_index == 0: + speed = args.logging_freq / (time.time() - tic_train) + log_info_dict["learning_rate"] = lr_scheduler.get_lr() + log_info_dict["steps_per_second"] = speed + log_info_dict["samples_per_second"] = speed * args.global_batch_size + + for k, v in log_info_dict.items(): + log_writer.add_scalar("train/%s" % k, v, global_step) + + loss_info = ", ".join( + ["{}: {:.6f}".format(k, log_info_dict[k]) for k in log_info_dict.keys() if k.endswith("loss")] + ) + + common_loginfo = ( + "global step %d, %s, speed: %.2f steps/s, ips: %.2f seqs/s, learning rate: %.5e" + % ( + global_step, + loss_info, + speed, + log_info_dict["samples_per_second"], + log_info_dict["learning_rate"], + ) + ) + + addition_info = "" + if args.use_amp: + amp_info = { + "loss_scaling": scaler._scale.item(), + "incr_count": scaler._incr_count, + "decr_count": scaler._decr_count, + } + addition_info = ", ".join("%s: %.2f" % (k, v) for k, v in amp_info.items()) + for k, v in amp_info.items(): + log_writer.add_scalar("amp/%s" % k, v, global_step) + + logger.info(", ".join([common_loginfo, addition_info])) + + tic_train = time.time() + + if lr_scheduler is not None: + lr_scheduler.step() + + if global_step % args.eval_freq == 0: + # TODO, check the input data of validation + + run_evaluate( + valid_data_loader, + model, + criterion, + args.eval_iters, + log_writer, + global_step, + args, + task_name="valid", + ) + tic_train = time.time() + + def save_ckpt(output_dir, model, tokenizer, optimizer, scaler, args, global_step): + step_config = { + "model_name": args.model_name_or_path, + "global_step": global_step, + "global_batch_size": args.global_batch_size, + "consumed_samples": global_step * args.global_batch_size, + } + + logger.debug("saving models to {}".format(output_dir)) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + + tokenizer.save_pretrained(output_dir) + # added token is not need for downstream finetune tasks. + added_token_path = os.path.join(output_dir, "added_tokens.json") + if os.path.exists(added_token_path): + os.remove(added_token_path) + + model_to_save.save_model_config(output_dir) + model_dict = model_to_save.state_dict() + if scaler is not None: + paddle.save(scaler.state_dict(), os.path.join(output_dir, "scaler.pdparams")) + for k, v in model_dict.items(): + if v.dtype is paddle.float16: + model_dict[k] = v.astype("float32") + paddle.save(model_dict, os.path.join(output_dir, "model_state.pdparams")) + paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt")) + + with open(os.path.join(output_dir, "config.yml"), "w") as f: + yaml.dump(step_config, f, encoding="utf-8", allow_unicode=True) + + if global_step % args.save_steps == 0 or global_step >= args.max_steps: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if worker_index == 0: + save_ckpt(output_dir, model, tokenizer, optimizer, scaler, args, global_step) + + if worker_num > 1: + paddle.distributed.barrier() + tic_train = time.time() + + if global_step % args.checkpoint_steps == 0: + output_dir = os.path.join(args.output_dir, "model_last") + if worker_index == 0: + if not os.path.exists(output_dir): + os.mkdir(output_dir) + output_dir_bak = os.path.join(args.output_dir, "model_last_bak") + if os.path.exists(output_dir): + if os.path.exists(output_dir_bak): + shutil.rmtree(output_dir_bak) + shutil.move(output_dir, output_dir_bak) + os.mkdir(output_dir) + save_ckpt(output_dir, model, tokenizer, optimizer, scaler, args, global_step) + + if worker_num > 1: + paddle.distributed.barrier() + + if global_step >= args.max_steps: + run_evaluate( + test_data_loader, + model, + criterion, + args.test_iters, + log_writer, + global_step, + args, + task_name="test", + ) + del train_data_loader + del valid_data_loader + del test_data_loader + return + + +if __name__ == "__main__": + config = parse_args(MODEL_CLASSES) + do_train(config) diff --git a/model_zoo/ernie-1.0/run_pretrain_static.py b/model_zoo/ernie-1.0/run_pretrain_static.py new file mode 100644 index 0000000000000000000000000000000000000000..0986e01039b2b43fc0bec60b40479da135eb4c5b --- /dev/null +++ b/model_zoo/ernie-1.0/run_pretrain_static.py @@ -0,0 +1,744 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +ERNIE pretraining scripts for paddlepaddle static graph mode. +""" +import collections +import json +import os +import random +import shutil +import time + +import numpy as np +import paddle +import paddle.distributed as dist +import paddle.distributed.fleet as fleet +import yaml +from args import parse_args +from data_tools.dataset_utils import build_train_valid_test_datasets +from paddle.distributed.fleet.meta_optimizers.sharding.utils import save_persistables +from visualdl import LogWriter + +from paddlenlp.data import Stack +from paddlenlp.ops import Topology, get_rng_state_tracker +from paddlenlp.transformers import ( + ErnieConfig, + ErnieForPretraining, + ErniePretrainingCriterion, + ErnieTokenizer, + LinearAnnealingWithWarmupDecay, +) +from paddlenlp.utils.batch_sampler import DistributedBatchSampler +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "ernie": (ErnieConfig, ErnieForPretraining, ErniePretrainingCriterion, ErnieTokenizer), +} + + +def create_pretrained_dataset( + args, + data_file, + tokenizer, + data_world_size, + data_world_rank, + max_seq_len, + places, + data_holders, + binary_head=True, + current_step=0, +): + + train_valid_test_num_samples = [ + args.global_batch_size * args.max_steps, + args.micro_batch_size * (args.max_steps // args.eval_freq + 1) * args.eval_iters * data_world_size, + args.micro_batch_size * args.test_iters * data_world_size, + ] + train_ds, valid_ds, test_ds = build_train_valid_test_datasets( + data_prefix=data_file, + args=args, + tokenizer=tokenizer, + splits_string=args.split, + train_valid_test_num_samples=train_valid_test_num_samples, + max_seq_length=args.max_seq_len, + masked_lm_prob=args.masked_lm_prob, + short_seq_prob=args.short_seq_prob, + seed=args.seed, + skip_warmup=True, + binary_head=binary_head, + max_seq_length_dec=None, + dataset_type="ernie", + ) + + def print_dataset(data, mode="train"): + logger.info(f"Sample data for {mode} mode") + input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels = data + if tokenizer.pad_token_id in input_ids: + input_ids = input_ids[0 : list(input_ids).index(tokenizer.pad_token_id)] + logger.info(tokenizer._decode(input_ids)) + for pos, label in zip(masked_lm_positions, masked_lm_labels): + input_ids[pos] = label + logger.info(tokenizer._decode(input_ids)) + logger.info(tokenizer.convert_ids_to_tokens(masked_lm_labels)) + + print_dataset(train_ds[0], "train") + print_dataset(valid_ds[0], "valid") + print_dataset(test_ds[0], "test") + + def _collate_data(data, stack_fn=Stack()): + num_fields = len(data[0]) + out = [None] * num_fields + # 0. input_ids, + # 1. segment_ids, + # 2. input_mask, + # 3. masked_lm_positions, + # 4. masked_lm_labels, + # 5. next_sentence_labels + for i in (0, 1, 2, 5): + out[i] = stack_fn([x[i] for x in data]) + out[5] = out[5].reshape([-1, 1]) + _, seq_length = out[0].shape + size = sum(len(x[3]) for x in data) + # masked_lm_positions + # Organize as a 1D tensor for gather or use gather_nd + if size % 8 != 0: + size += 8 - (size % 8) + out[3] = np.full(size, 0, dtype=np.int32) + # masked_lm_labels + out[4] = np.full([size, 1], -1, dtype=np.int64) + mask_token_num = 0 + for i, x in enumerate(data): + for j, pos in enumerate(x[3]): + out[3][mask_token_num] = i * seq_length + pos + out[4][mask_token_num] = x[4][j] + mask_token_num += 1 + + return out + + def loader(dataset, consumed_samples=0): + batch_sampler = DistributedBatchSampler( + dataset, + batch_size=args.micro_batch_size, + num_replicas=data_world_size, + rank=data_world_rank, + shuffle=False, + drop_last=True, + consumed_samples=consumed_samples, + ) + data_loader = paddle.io.DataLoader( + dataset=dataset, + places=places, + feed_list=data_holders, + batch_sampler=batch_sampler, + num_workers=args.num_workers, + worker_init_fn=None, + collate_fn=_collate_data, + return_list=False, + ) + return data_loader + + train_dl = loader(train_ds, args.global_batch_size * current_step) + valid_dl = loader( + valid_ds, args.micro_batch_size * ((current_step + 1) // args.eval_freq) * args.eval_iters * data_world_size + ) + test_dl = loader(test_ds, 0) + + return train_dl, valid_dl, test_dl + + +def create_data_holder(args=None): + input_ids = paddle.static.data(name="input_ids", shape=[-1, -1], dtype="int64") + segment_ids = paddle.static.data(name="segment_ids", shape=[-1, -1], dtype="int64") + input_mask = paddle.static.data(name="input_mask", shape=[-1, 1, 1, -1], dtype="float32") + masked_lm_positions = paddle.static.data(name="masked_lm_positions", shape=[-1], dtype="int32") + masked_lm_labels = paddle.static.data(name="masked_lm_labels", shape=[-1, 1], dtype="int64") + + next_sentence_labels = paddle.static.data(name="next_sentence_labels", shape=[-1, 1], dtype="int64") + + return [input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels] + + +def dist_optimizer(args, topo): + default_global_batch_size = topo.data_info.size * args.micro_batch_size + if args.global_batch_size is None: + args.global_batch_size = default_global_batch_size + + bsz_per_dp = args.global_batch_size // topo.data_info.size + micro_batch_size = args.micro_batch_size + assert ( + args.global_batch_size % micro_batch_size == 0 + ), "cannot do gradient accumulate, global_batch_size: {} micro_batch_size: {}".format( + args.global_batch_size, micro_batch_size + ) + accumulate_steps = bsz_per_dp // micro_batch_size + + exec_strategy = paddle.static.ExecutionStrategy() + exec_strategy.num_threads = 1 + exec_strategy.num_iteration_per_drop_scope = 10000 + + build_strategy = paddle.static.BuildStrategy() + build_strategy.enable_sequential_execution = True # for profile + # build_strategy.reduce_strategy = paddle.static.BuildStrategy.ReduceStrategy._NoReduce + build_strategy.fuse_broadcast_ops = True + build_strategy.fix_op_run_order = True + build_strategy.enable_inplace = True + build_strategy.enable_addto = args.enable_addto + + dist_strategy = fleet.DistributedStrategy() + # dist_strategy.without_graph_optimization = True + dist_strategy.execution_strategy = exec_strategy + dist_strategy.build_strategy = build_strategy + dist_strategy.nccl_comm_num = 3 + dist_strategy.fuse_grad_size_in_MB = 16 + + dist_strategy.recompute = args.use_recompute + dist_strategy.pipeline = args.pp_degree > 1 + + if args.pp_degree <= 1 and args.sharding_degree <= 1 and accumulate_steps > 1: + dist_strategy.gradient_merge = True + dist_strategy.gradient_merge_configs = {"k_steps": accumulate_steps} + args.eval_iters *= accumulate_steps + args.test_iters *= accumulate_steps + + if args.use_amp: + dist_strategy.amp = True + dist_strategy.amp_configs = { + "custom_white_list": [ + "softmax", + "layer_norm", + "gelu", + ], + "custom_black_list": ["c_softmax_with_cross_entropy"], + "init_loss_scaling": 32768, + "use_dynamic_loss_scaling": True, + } + if args.use_sharding: + dist_strategy.sharding = True + dist_strategy.sharding_configs = { + "segment_broadcast_MB": 32, + "sharding_degree": args.sharding_degree, + "mp_degree": args.mp_degree, + "pp_degree": args.pp_degree, + "dp_degree": args.dp_degree, + "gradient_merge_acc_step": accumulate_steps if args.sharding_degree > 1 else 1, + "optimize_offload": False, + } + if args.pp_degree > 1: + dist_strategy.pipeline_configs = { + "schedule_mode": "1F1B", + "micro_micro_batch_size": micro_batch_size, + "accumulate_steps": accumulate_steps, + } + + args.accumulate_steps = accumulate_steps + return dist_strategy + + +def get_train_data_file(args): + if len(args.input_dir.split()) > 1: + # weight-1 data-prefix-1 weight-2 data-prefix-2 ... + return args.input_dir.split() + else: + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if (os.path.isfile(os.path.join(args.input_dir, f)) and "_idx.npz" in str(f)) + ] + files = [x.replace("_idx.npz", "") for x in files] + + if len(files) > 1: + ret = [] + logger.info("You are using multi-dataset:") + for x in files: + ret.append(1.0) + ret.append(x) + logger.info(" > set weight of %s dataset to 1.0" % x) + return ret + + return files + + +def run_evaluate( + data_loader, exe, program, iter_steps, log_writer, global_step, args, is_last, eval_fetch, task_name="valid" +): + all_ret = collections.defaultdict(list) + average_ret = collections.defaultdict(float) + + local_time = time.time() + worker_num = fleet.worker_num() + + for eval_step, batch in enumerate(data_loader): + ret = exe.run(program, feed=batch, fetch_list=list(eval_fetch.values())) + if is_last: + for k, v in zip(list(eval_fetch.keys()), ret): + all_ret[k].append(v.item()) + + if eval_step >= iter_steps - 1: + if not is_last or log_writer is None: + break + + for k in list(eval_fetch.keys()): + average_ret[k] = sum(all_ret[k]) / len(all_ret[k]) / worker_num + + speed = iter_steps / (time.time() - local_time) + speed_tokens = speed * args.micro_batch_size * args.max_seq_len * worker_num + ips = speed * args.micro_batch_size * worker_num + + loss_info = ", ".join(["{}: {:.6f}".format(k, average_ret[k]) for k in eval_fetch.keys()]) + + logger.info( + "%s step %d, batch: %d, %s, speed: %.0f tokens/s, ips: %.2f seqs/s" + % (task_name, global_step, eval_step + 1, loss_info, speed_tokens, ips) + ) + + for k in list(eval_fetch.keys()): + log_writer.add_scalar("%s/%s" % (task_name, k), average_ret[k], global_step) + + break + + +def all_reduce(v): + if fleet.worker_num() <= 1: + return v + v = v + 0 + dist.all_reduce(v) + return v + + +def default_logdir() -> str: + """ + Same default + """ + import socket + from datetime import datetime + + current_time = datetime.now().strftime("%b%d_%H-%M-%S") + return os.path.join("runs", current_time + "_" + socket.gethostname()) + + +def do_train(args): + # Initialize the paddle and paddle fleet execute environment + paddle.enable_static() + fleet.init(is_collective=True) + + # Create the random seed for the worker + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + get_rng_state_tracker().add("global_seed", args.seed) + get_rng_state_tracker().add("local_seed", args.seed + fleet.worker_index() + 2021) + + assert args.device in ["cpu", "gpu", "xpu", "mlu"], "Invalid device! Available device should be cpu, gpu, or xpu." + place = paddle.set_device(args.device) + + worker_num = fleet.worker_num() + worker_index = fleet.worker_index() + + assert ( + args.dp_degree * args.sharding_degree * args.mp_degree * args.pp_degree == worker_num + ), "The product of degree num should be equal to worker_num." + + topo = Topology( + device_rank=worker_index, + world_size=worker_num, + dp_degree=args.dp_degree, + pp_degree=args.pp_degree, + sharding_degree=args.sharding_degree, + mp_degree=args.mp_degree, + ) + + logger.info("The topo of hybrid parallelism:\n{}".format(topo)) + + dist_strategy = dist_optimizer(args, topo) + + # Create log write, train results show on last card of pipeline. + # Create log write, + log_writer = None + if worker_index == 0: + log_writer = LogWriter(os.path.join(args.output_dir, default_logdir())) + + # Define the input data in the static mode + config_class, model_class, criterion_class, tokenizer_class = MODEL_CLASSES[args.model_type] + pretrained_models_list = list(model_class.pretrained_init_configuration.keys()) + + # load config in checkpoint + global_step = 0 + checkpoint_dir = os.path.join(args.output_dir, "model_last") + if os.path.exists(checkpoint_dir): + if os.path.isfile(os.path.join(checkpoint_dir, "./config.yml")): + with open(os.path.join(checkpoint_dir, "./config.yml"), "r") as f: + step_config = yaml.load(f, Loader=yaml.FullLoader) + assert ( + step_config["global_batch_size"] == args.global_batch_size + ), "Please ensure checkpoint global batch size is the same. Folder: {}".format(checkpoint_dir) + global_step = step_config["global_step"] + + data_file = get_train_data_file(args) + main_program = paddle.static.default_main_program() + startup_program = paddle.static.default_startup_program() + with paddle.static.program_guard(main_program, startup_program): + data_holders = create_data_holder(args) + # 0. input_ids, + # 1. segment_ids, + # 2. input_mask, + # 3. masked_lm_positions, + # 4. masked_lm_labels, + # 5. next_sentence_labels + + [ + input_ids, + segment_ids, + input_mask, + masked_lm_positions, + masked_lm_labels, + next_sentence_labels, + ] = data_holders + + tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name_or_path) + tokenizer.extend_chinese_char() + + train_data_loader, valid_data_loader, test_data_loader = create_pretrained_dataset( + args, + data_file, + tokenizer, + data_world_size=topo.data_info.size, + data_world_rank=topo.data_info.rank, + max_seq_len=args.max_seq_len, + places=paddle.static.cuda_places(), + data_holders=data_holders, + binary_head=args.binary_head, + current_step=global_step, + ) + fleet.init(is_collective=True) + + if args.model_name_or_path in pretrained_models_list: + model_config = model_class.pretrained_init_configuration[args.model_name_or_path] + if model_config["vocab_size"] % 8 != 0: + model_config["vocab_size"] += 8 - (model_config["vocab_size"] % 8) + model_config["hidden_dropout_prob"] = args.hidden_dropout_prob + model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob + model = model_class(config_class(**model_config)) + else: + model, _ = model_class.from_pretrained( + args.model_name_or_path, + hidden_dropout_prob=args.hidden_dropout_prob, + attention_probs_dropout_prob=args.attention_probs_dropout_prob, + ) + + # Create the model for the gpt pretrain + prediction_scores, seq_relationship_score = model( + input_ids=input_ids, + token_type_ids=segment_ids, + position_ids=None, + attention_mask=input_mask, + masked_positions=masked_lm_positions, + ) + + criterion = criterion_class(with_nsp_loss=args.binary_head) + if args.binary_head: + lm_loss, sop_loss = criterion( + prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels + ) + loss = lm_loss + sop_loss + lm_loss_reduce = all_reduce(lm_loss) + sop_loss_reduce = all_reduce(sop_loss) + else: + loss = criterion(prediction_scores, seq_relationship_score, masked_lm_labels) + + loss_reduce = all_reduce(loss) + + # Create the learning_rate sheduler and optimizer + if args.decay_steps is None: + args.decay_steps = args.max_steps + + lr_scheduler = LinearAnnealingWithWarmupDecay( + args.max_lr, + args.min_lr, + warmup_step=args.warmup_rate * args.max_steps, + decay_step=args.decay_steps, + last_epoch=global_step, + ) + + clip = None + if args.grad_clip > 0: + clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.grad_clip) + + decay_param = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + logger.info("Using paddle.optimizer.AdamW.") + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=args.adam_beta1, + beta2=args.adam_beta2, + epsilon=args.adam_epsilon, + grad_clip=clip, + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_param, + ) + # alias + optimizer.apply_optimize = optimizer._apply_optimize + + # if args.use_recompute: + # dist_strategy.recompute = True + # dist_strategy.recompute_configs = { + # "checkpoints": model.ernie.checkpoints + # } + + # Use the fleet api to compile the distributed optimizer + optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy) + + optimizer.minimize(loss) + logger.info(f"final strategy: {fleet._final_strategy()}") + logger.info("The training meta optimizer is/are %s" % fleet._get_applied_meta_list()) + + program_desc_dir = os.path.join(args.output_dir, "program_desc") + if not os.path.isdir(program_desc_dir): + os.mkdir(program_desc_dir) + + with open(program_desc_dir + "/main_program.txt.%d" % worker_index, "w") as f: + f.write(str(main_program)) + + with open(program_desc_dir + "/startup_program.txt.%d" % worker_index, "w") as f: + f.write(str(startup_program)) + + if worker_index == 0: + # log the model config and args + model_config_json = json.dumps(model.config.to_dict(), ensure_ascii=False, indent=2) + log_writer.add_text("model_config", model_config_json) + args_dict = {"paddle commit id": str(paddle.version.commit)} + for arg in vars(args): + args_dict[arg] = str(getattr(args, arg)) + log_writer.add_text("args", json.dumps(args_dict, indent=2)) + + # Define the Executor for running the static model + exe = paddle.static.Executor(place) + exe.run(startup_program) + + test_program = main_program.clone(for_test=True) + + if args.model_name_or_path not in pretrained_models_list: + logger.info("Try to load checkpoint from %s " % args.model_name_or_path) + dygrah_path = os.path.join(args.model_name_or_path, "model_state.pdparams") + static_path = os.path.join(args.model_name_or_path, "static_vars") + + flag_loaded = False + if os.path.exists(static_path): + if args.mp_degree > 1: + logger.warning("MP should init with dygraph params") + else: + logger.info("Loading parameters from %s" % static_path) + paddle.static.load(main_program, static_path, exe) + flag_loaded = True + + if not flag_loaded and os.path.exists(dygrah_path): + if args.sharding_degree > 1: + logger.warning("Sharding should init with static vars") + else: + logger.info("Loading parameters from %s" % dygrah_path) + # init_static_with_params(model, paddle.load(dygrah_path, return_numpy=True), topo, main_program) + flag_loaded = True + + if not flag_loaded: + logger.error("No checkpoint load.") + + # load checkpoint vars + if os.path.exists(checkpoint_dir): + if os.path.isfile(os.path.join(checkpoint_dir, "./config.yml")): + paddle.static.load(main_program, os.path.join(checkpoint_dir, "static_vars"), exe) + + fetch_loss_vars = collections.OrderedDict() + fetch_other_vars = collections.OrderedDict() + fetch_loss_vars["loss"] = loss_reduce + if args.binary_head: + fetch_loss_vars["lm_loss"] = lm_loss_reduce + fetch_loss_vars["sop_loss"] = sop_loss_reduce + + fetch_other_vars["learning_rate"] = main_program.global_block().vars["learning_rate_0"] + + additional_vars = collections.OrderedDict() + if args.use_amp: + for key in ["loss_scaling", "num_good_steps", "num_bad_steps"]: + additional_vars[key] = main_program.global_block().vars[key + "_0"] + + tic_train = time.time() + while True: + fetchs = [] + fetchs_keys = [] + if topo.is_last: + fetchs = list(fetch_loss_vars.values()) + list(fetch_other_vars.values()) + list(additional_vars.values()) + fetchs_keys = list(fetch_loss_vars.keys()) + list(fetch_other_vars.keys()) + list(additional_vars.keys()) + + # Bug fix, if not call valid_data_loader, the enumerate will call valid_data_loader + # many times. and start a new random dataloader. + valid_data_loader = valid_data_loader() + test_data_loader = test_data_loader() + + loss_res = collections.defaultdict(list) + for step, batch in enumerate(train_data_loader()): + ret = exe.run(main_program, feed=batch, fetch_list=fetchs, use_program_cache=True) + # Skip for accumulate_steps in global step + + if log_writer is not None: + for k, v in zip(fetchs_keys, ret): + if k in fetch_loss_vars: + loss_res[k].append(v.item()) + + if (step + 1) % args.accumulate_steps != 0: + continue + global_step += 1 + # In the new 2.0 api, must call this function to change the learning_rate + lr_scheduler.step() + + if global_step % args.logging_freq == 0: + if topo.is_last and log_writer is not None: + res = collections.defaultdict(float) + for k, v in zip(fetchs_keys, ret): + if k in fetch_loss_vars: + res[k] = sum(loss_res[k]) / len(loss_res[k]) / worker_num + loss_res[k] = [] + else: + res[k] = v + + speed = args.logging_freq / (time.time() - tic_train) + res["global_step"] = global_step + res["steps_per_second"] = speed + res["samples_per_second"] = speed * args.global_batch_size + + loss_info = ", ".join(["{}: {:.6f}".format(k, res[k]) for k in fetch_loss_vars.keys()]) + + common_loginfo = ( + "global step %d, %s, speed: %.2f steps/s, ips: %.2f seqs/s, learning rate: %.5e" + % ( + global_step, + loss_info, + res["steps_per_second"], + res["samples_per_second"], + res["learning_rate"], + ) + ) + additional_loginfo = ", ".join(["{}: {}".format(k, res[k]) for k in additional_vars.keys()]) + if additional_loginfo: + common_loginfo += ", " + additional_loginfo + logger.info(common_loginfo) + + for k, v in res.items(): + if k in additional_vars: + log_writer.add_scalar("amp/" + k, v, global_step) + else: + log_writer.add_scalar("train/" + k, v, global_step) + + tic_train = time.time() + + # if args.check_accuracy: + # if global_step >= args.max_steps: + # return + # else: + # continue + + if global_step % args.eval_freq == 0: + # TODO, check the input data of validation + eval_fetch = collections.OrderedDict() + # if topo.is_last: + eval_fetch["loss"] = loss_reduce + if args.binary_head: + eval_fetch["lm_loss"] = lm_loss_reduce + eval_fetch["sop_loss"] = sop_loss_reduce + + run_evaluate( + valid_data_loader, + exe, + test_program, + args.eval_iters, + log_writer, + global_step, + args, + topo.is_last, + eval_fetch, + "valid", + ) + tic_train = time.time() + + if global_step % args.save_steps == 0 or global_step >= args.max_steps: + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + logger.debug("saving models to {}".format(output_dir)) + save_persistables(exe, os.path.join(output_dir, "static_vars"), main_program) + tokenizer.save_pretrained(output_dir) + tic_train = time.time() + + if global_step % args.checkpoint_steps == 0: + output_dir = os.path.join(args.output_dir, "model_last") + if worker_index == 0: + if not os.path.exists(output_dir): + os.mkdir(output_dir) + output_dir_bak = os.path.join(args.output_dir, "model_last_bak") + if os.path.exists(output_dir): + if os.path.exists(output_dir_bak): + shutil.rmtree(output_dir_bak) + shutil.move(output_dir, output_dir_bak) + os.mkdir(output_dir) + + step_config = { + "model_name": args.model_name_or_path, + "global_step": global_step, + "global_batch_size": args.global_batch_size, + "consumed_samples": global_step * args.global_batch_size, + } + + with open(os.path.join(output_dir, "config.yml"), "w") as f: + yaml.dump(step_config, f, encoding="utf-8", allow_unicode=True) + + fleet.barrier_worker() + + logger.debug("saving models to {}".format(output_dir)) + if args.sharding_degree <= 1: + # Save on the first worker by default. + if worker_index == 0: + paddle.static.save(main_program, os.path.join(output_dir, "static_vars")) + else: + # Use save_persistables in sharding, but more slower + save_persistables(exe, os.path.join(output_dir, "static_vars"), main_program) + + if global_step >= args.max_steps: + eval_fetch = collections.OrderedDict() + # if topo.is_last: + eval_fetch["loss"] = loss_reduce + if args.binary_head: + eval_fetch["lm_loss"] = lm_loss_reduce + eval_fetch["sop_loss"] = sop_loss_reduce + + run_evaluate( + test_data_loader, + exe, + test_program, + args.test_iters, + log_writer, + global_step, + args, + topo.is_last, + eval_fetch, + "test", + ) + del train_data_loader + del valid_data_loader + del test_data_loader + return + + +if __name__ == "__main__": + args = parse_args(MODEL_CLASSES) + logger.info("{:20}:{}".format("paddle commit id", paddle.version.commit)) + for arg in vars(args): + logger.info("{:20}:{}".format(arg, getattr(args, arg))) + + do_train(args) diff --git a/model_zoo/ernie-1.0/run_pretrain_trainer.py b/model_zoo/ernie-1.0/run_pretrain_trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..1d54eee4ffaf153b99e4fe445e8b976e2c002232 --- /dev/null +++ b/model_zoo/ernie-1.0/run_pretrain_trainer.py @@ -0,0 +1,486 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +ERNIE-1.0 pretraining scripts. +""" +import math +import os +import random +import time +from dataclasses import dataclass, field +from typing import Optional + +import numpy as np +import paddle +from data_tools.dataset_utils import build_train_valid_test_datasets + +from paddlenlp.data import Stack +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, + speed_metrics, +) +from paddlenlp.transformers import ( + ErnieConfig, + ErnieForMaskedLM, + ErnieForPretraining, + ErniePretrainingCriterion, + ErnieTokenizer, + LinearAnnealingWithWarmupDecay, +) +from paddlenlp.utils.batch_sampler import DistributedBatchSampler +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "ernie": ( + ErnieConfig, + ErnieForPretraining, + ErniePretrainingCriterion, + ErnieTokenizer, + ), +} + + +def add_start_docstrings(*docstr): + def docstring_decorator(fn): + fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "") + return fn + + return docstring_decorator + + +@dataclass +@add_start_docstrings(TrainingArguments.__doc__) +class PreTrainingArguments(TrainingArguments): + min_learning_rate: float = field( + default=1e-5, + metadata={"help": "Minimum learning rate deacyed to."}, + ) + decay_steps: float = field( + default=None, + metadata={ + "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate." + }, + ) + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and evaluating. + Using `PdArgumentParser` we can turn this class into argparse arguments to be able to + specify them on the command line. + """ + + input_dir: str = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."}) + + max_seq_length: int = field( + default=512, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + masked_lm_prob: float = field( + default=0.15, + metadata={"help": "Mask token prob."}, + ) + short_seq_prob: float = field( + default=0.1, + metadata={"help": "Short sequence prob."}, + ) + share_folder: bool = field( + default=False, + metadata={"help": "Use share folder for data dir and output dir on multi machine."}, + ) + favor_longer_ngram: bool = field( + default=False, + metadata={"help": "Whether to favor long ngrams"}, + ) + max_ngrams: int = field( + default=3, + metadata={"help": "Max N Grams"}, + ) + data_impl: str = field( + default="mmap", + metadata={"help": "mmap/lazy format converted from json"}, + ) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to pre-train from. + """ + + model_type: Optional[str] = field( + default="ernie", metadata={"help": "Only support for ernie pre-training for now."} + ) + model_name_or_path: str = field( + default="ernie-1.0", + metadata={ + "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html" + }, + ) + binary_head: Optional[bool] = field(default=True, metadata={"help": "True for NSP task."}) + hidden_dropout_prob: float = field(default=0.1, metadata={"help": "The hidden dropout prob."}) + attention_probs_dropout_prob: float = field(default=0.1, metadata={"help": "The attention probs dropout prob."}) + config_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} + ) + tokenizer_name_or_path: Optional[str] = field( + default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} + ) + + +def create_pretrained_dataset( + data_args, + training_args, + data_file, + tokenizer, + binary_head=True, +): + + train_valid_test_num_samples = [ + training_args.per_device_train_batch_size + * training_args.world_size + * training_args.max_steps + * training_args.gradient_accumulation_steps, + training_args.per_device_eval_batch_size + * training_args.world_size + * training_args.eval_iters + * (training_args.max_steps // training_args.eval_steps + 1), + training_args.per_device_eval_batch_size * training_args.world_size * training_args.test_iters, + ] + train_ds, valid_ds, test_ds = build_train_valid_test_datasets( + data_prefix=data_file, + args=data_args, + tokenizer=tokenizer, + splits_string=data_args.split, + train_valid_test_num_samples=train_valid_test_num_samples, + max_seq_length=data_args.max_seq_length, + masked_lm_prob=data_args.masked_lm_prob, + short_seq_prob=data_args.short_seq_prob, + seed=training_args.seed, + skip_warmup=True, + binary_head=binary_head, + max_seq_length_dec=None, + dataset_type="ernie", + ) + + def print_dataset(data, mode="train"): + logger.info(f"Sample data for {mode} mode") + input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels = data + if tokenizer.pad_token_id in input_ids: + input_ids = input_ids[0 : list(input_ids).index(tokenizer.pad_token_id)] + logger.info(tokenizer._decode(input_ids)) + for pos, label in zip(masked_lm_positions, masked_lm_labels): + input_ids[pos] = label + logger.info(tokenizer._decode(input_ids)) + logger.info(tokenizer.convert_ids_to_tokens(masked_lm_labels)) + + print_dataset(train_ds[0], "train") + print_dataset(valid_ds[0], "valid") + print_dataset(test_ds[0], "test") + + def _collate_data(data, stack_fn=Stack()): + num_fields = len(data[0]) + out = [None] * num_fields + # 0. input_ids, + # 1. segment_ids, + # 2. input_mask, + # 3. masked_lm_positions, + # 4. masked_lm_labels, + # 5. next_sentence_labels + for i in (0, 1, 2, 5): + out[i] = stack_fn([x[i] for x in data]) + out[5] = out[5].reshape([-1, 1]) + _, seq_length = out[0].shape + size = sum(len(x[3]) for x in data) + # masked_lm_positions + # Organize as a 1D tensor for gather or use gather_nd + if size % 8 != 0: + size += 8 - (size % 8) + out[3] = np.full(size, 0, dtype=np.int32) + # masked_lm_labels + out[4] = np.full([size, 1], -1, dtype=np.int64) + mask_token_num = 0 + for i, x in enumerate(data): + for j, pos in enumerate(x[3]): + out[3][mask_token_num] = i * seq_length + pos + out[4][mask_token_num] = x[4][j] + mask_token_num += 1 + + return { + "input_ids": out[0], + "token_type_ids": out[1], + "attention_mask": out[2], + "masked_positions": out[3], + "labels": (out[4], out[5]), + } + + return train_ds, valid_ds, test_ds, _collate_data + + +def get_train_data_file(args): + if len(args.input_dir.split()) > 1: + # weight-1 data-prefix-1 weight-2 data-prefix-2 ... + return args.input_dir.split() + else: + files = [ + os.path.join(args.input_dir, f) + for f in os.listdir(args.input_dir) + if (os.path.isfile(os.path.join(args.input_dir, f)) and "_idx.npz" in str(f)) + ] + files = [x.replace("_idx.npz", "") for x in files] + + if len(files) > 1: + ret = [] + logger.info("You are using multi-dataset:") + for x in files: + ret.append(1.0) + ret.append(x) + logger.info(" > set weight of %s dataset to 1.0" % x) + return ret + + return files + + +def set_seed(args): + if args.device == "cpu": + idx = 0 + else: + idx = paddle.distributed.get_rank() + random.seed(args.seed + idx) + np.random.seed(args.seed + idx) + paddle.seed(args.seed + idx) + + +class PretrainingTrainer(Trainer): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix: str = "eval"): + eval_dataloader = getattr(self, "eval_dataloader", None) + if eval_dataloader is None: + eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset + eval_dataloader = self.get_eval_dataloader(eval_dataset) + # must call data loader, otherwise, it will init many times, cause OOM error. + self.eval_dataloader = eval_dataloader() + + start_time = time.time() + # Temporarily disable metric computation, we will do it in the loop here. + compute_metrics = self.compute_metrics + eval_loop = self.evaluation_loop + + output = eval_loop( + eval_dataloader, + description="Evaluation", + # No point gathering the predictions if there are no metrics, otherwise we defer to + # self.args.prediction_loss_only + prediction_loss_only=True if compute_metrics is None else None, + ignore_keys=ignore_keys, + # Only evaluate max_eval_iters + max_eval_iters=self.args.eval_iters, + ) + + total_batch_size = self.args.eval_batch_size * self.args.world_size + output.metrics.update( + speed_metrics( + metric_key_prefix, + start_time, + num_samples=output.num_samples, + num_steps=math.ceil(output.num_samples / total_batch_size), + ) + ) + + self.log(output.metrics) + + self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics) + return output.metrics + + def _get_eval_sampler(self, eval_dataset) -> Optional[paddle.io.Sampler]: + return DistributedBatchSampler( + eval_dataset, + batch_size=self.args.per_device_eval_batch_size, + shuffle=False, + num_replicas=self.args.world_size, + rank=self.args.process_index, + drop_last=self.args.dataloader_drop_last, + ) + + def _get_train_sampler(self) -> Optional[paddle.io.Sampler]: + return DistributedBatchSampler( + self.train_dataset, + batch_size=self.args.per_device_train_batch_size, + shuffle=False, + num_replicas=self.args.world_size, + rank=self.args.process_index, + drop_last=self.args.dataloader_drop_last, + ) + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.tokenizer_name_or_path is None: + model_args.tokenizer_name_or_path = model_args.model_name_or_path + + set_seed(training_args) + paddle.set_device(training_args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + training_args.eval_iters = 10 + training_args.test_iters = training_args.eval_iters * 10 + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + # if last_checkpoint is None and len( + # os.listdir(training_args.output_dir)) > 1: + # raise ValueError( + # f"Output directory ({training_args.output_dir}) already exists and is not empty. " + # "Use --overwrite_output_dir to overcome.") + if last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + config_class, model_class, criterion_class, tokenizer_class = MODEL_CLASSES[model_args.model_type] + + if model_args.binary_head is False: + model_class = ErnieForMaskedLM + + pretrained_models_list = list(model_class.pretrained_init_configuration.keys()) + + if model_args.model_name_or_path in pretrained_models_list: + logger.warning(f"Your model {model_args.model_name_or_path} is training from scratch !!!") + model_config = model_class.pretrained_init_configuration[model_args.model_name_or_path] + model_config["hidden_dropout_prob"] = model_args.hidden_dropout_prob + model_config["attention_probs_dropout_prob"] = model_args.attention_probs_dropout_prob + model = model_class(config_class(**model_config)) + # model_config["enable_recompute"] = args.use_recompute + else: + logger.warning(f"Your model is continue training from {model_args.model_name_or_path}") + model = model_class.from_pretrained( + model_args.model_name_or_path, + hidden_dropout_prob=model_args.hidden_dropout_prob, + attention_probs_dropout_prob=model_args.attention_probs_dropout_prob, + ) + + class CriterionWrapper(paddle.nn.Layer): + """ """ + + def __init__(self): + """CriterionWrapper""" + super(CriterionWrapper, self).__init__() + self.criterion = criterion_class(with_nsp_loss=model_args.binary_head) + + def forward(self, output, labels): + """forward function + + Args: + output (tuple): prediction_scores, seq_relationship_score + labels (tuple): masked_lm_labels, next_sentence_labels + + Returns: + Tensor: final loss. + """ + masked_lm_labels, next_sentence_labels = labels + if model_args.binary_head: + prediction_scores, seq_relationship_score = output + + lm_loss, sop_loss = self.criterion( + prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels + ) + + loss = lm_loss + sop_loss + + else: + prediction_scores = output + loss = self.criterion(prediction_scores, None, masked_lm_labels) + + return loss + + # Create the learning_rate sheduler and optimizer + if training_args.decay_steps is None: + training_args.decay_steps = training_args.max_steps + warmup_steps = training_args.warmup_ratio * training_args.max_steps + + lr_scheduler = LinearAnnealingWithWarmupDecay( + training_args.learning_rate, + training_args.min_learning_rate, + warmup_step=warmup_steps, + decay_step=training_args.decay_steps, + ) + + data_file = get_train_data_file(data_args) + tokenizer = tokenizer_class.from_pretrained(model_args.tokenizer_name_or_path) + tokenizer.extend_chinese_char() + + train_dataset, eval_dataset, test_dataset, data_collator = create_pretrained_dataset( + data_args, training_args, data_file, tokenizer, model_args.binary_head + ) + + trainer = PretrainingTrainer( + model=model, + criterion=CriterionWrapper(), + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + optimizers=(None, lr_scheduler), + tokenizer=tokenizer, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + if training_args.do_predict: + test_ret = trainer.predict(test_dataset) + trainer.log_metrics("test", test_ret.metrics) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-1.0/vocab/README.md b/model_zoo/ernie-1.0/vocab/README.md new file mode 100644 index 0000000000000000000000000000000000000000..8179e8651a81af2867d3f2996c54cd9f9a0b03f6 --- /dev/null +++ b/model_zoo/ernie-1.0/vocab/README.md @@ -0,0 +1,203 @@ +# ERNIE 中文词表制作 + +ERNIE是百度提出的大规模预训练模型,曾在中文场景下取得了SOTA效果。 +PaddleNLP致力于预训练开源工作,本文档提供了ERNIE词表的制作方法。 + +预训练全部流程的整体详细介绍文档,请参考[ERNIE 中文预训练介绍](../pretraining_introduction.md)。 + +**目录** +* [1. 数据获取](#数据获取) +* [2. 全字符中文词表制作](#中文词表制作) + - [2.1 分析准备](#分析准备) + - [2.2 文本字符统计](#文本字符统计) + - [2.3 英文字符词表](#英文字符词表) + - [2.4 合并词表](#合并词表) +* [3. 词表使用](#vocab_usage) + - [3.1 转化为jsonl格式数据](#jsonl) + - [3.2 TokenID转化](#快速TokenID转化) +* [4. 参考](#ref) + + + + +## 1. 数据获取 + + +**WuDaoCorpus2.0 Base 语料** + +WuDaoCorpora是悟道爬取的中文大规模语料。整体数量为3TB,目前开源的部分为WuDaoCorpus2.0 bases数据集,大小为200GB。用户请参考[这里](../preprocess/docs/WuDaoCorpusBase.md)获取原始文本数据。 + + +**CLUECorpus2020 语料** + +CLUECorpus2020 过对Common Crawl的中文部分进行语料清洗得到。开源部分提供了约200G左右的语料文本,详细介绍见[官网](https://github.com/CLUEbenchmark/CLUECorpus2020#%E6%95%B0%E6%8D%AE%E4%B8%8B%E8%BD%BD),用户参考[这里](./preprocess/docs/CLUECorpus2020.md)获取原始文本数据。 + + + + + + +## 2. 全字符中文词表制作 + +词表的制作有两种方案: + +第一种,词表组合方案 +1. 统计字符 +2. 制作英文词表 +3. 合并词表 + +第二种,预处理后直接生成,方案 +1. 文本预处理(中文加空格,文本normalize) +2. 使用sentencepeice制作词表 + +第二种方案需要对文本先使用`BasicTokenizer`切分一遍语料。 +第一种方案,自定义程度高,但存在一些局限性。本项目采用了第一种方案,详细介绍如下: + +### 2.1 分析准备 +词表大小: 这里我们考虑的因素主要有两个 +- 已有模型对照: + - ERNIE 3.0系列模型的词表,词表大小为 40000 左右。 +- 预训练数据存储占用: + - 文本token id化后,希望使用uint16表示,此时表示的最大字符为65536。 + - 同时考虑到ERNIE虽然是字模型,我们的仍然需要 `##中` 之类的中文字符表示分词信息。假设使用中文全字符20902(0x4E00-0x9FA5)个字符,那么剩余 vocab 大小不能超过 44634。 + +综上,本项目决定采用 40000 左右的 vocab 容量。 +其中: +- 中文全字符 `20902` +- 英文字符 `17000` +- 其他字符约 `2000` 左右 + + +### 2.2 文本字符统计 +首先第一步是对文本字符进行统计。字符统计的目的主要是添加常用的中文字符、特殊字符。 + +由于语料文本过大,我们随机选取 10G 左右的原始文本进行了字符统计。 +``` +python gen_char.py path_to_corpus.txt +``` +可以在本地文件夹得到`char_dict.pickle`字符频率文件。同时我们也提供了自己统计的词频文件,方便用户复现: +``` +wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/char_dict.pickle +``` + +### 2.3 英文字符词表 +基于字符的词频统计,使得英文字符也切割为字母,为此我们需要添加英文词表。 +英文部分,我们使用了 [WikiText](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip) 数据集,来构造词表。 +下载解压数据,使用BPE切词 +``` +wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip +unzip wikitext-103-v1.zip +python gen_vocab.py ./wikitext-103-raw/wiki.train.raw +``` +即可产生英文部分的词表。这里我们也提供了处理好的 vocab 方便用户验证。 +``` +wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/eng.vocab +``` + + +### 2.4 合并词表 + +目前我们得到了字符统计表,和英文字符词表。下一步,我们将词表进行合并。 + +将`char_dict.pickle`,`eng.vocab`放置到当前目录,使用下面命令 +``` +python merge_vocab.py +``` +即可在 当前 目录生成 vocab.txt 得到最终词表。 + +此阶段需要注意的一些问题是: +1. 对于一些日文、谚文文字字符,需要进行 normalize +2. 添加special_tokens + +### 2.5 问题遗留 +本项目采用的第一种方式,即拼接产出的词表,对连续非中、英文字符文本,会出现UNK的情况。 +如issue: [#2927](https://github.com/PaddlePaddle/PaddleNLP/issues/2927)、 [#2585](https://github.com/PaddlePaddle/PaddleNLP/issues/2585)。本项目做了两点改进: + +1. 对 Symbol 字符默认添加空格,变成独立字符 +2. 对 日文、谚文 在合并词表阶段默认添加 ## 字符。 + +虽然有上述两点修复,任然无法避免 [#2927](https://github.com/PaddlePaddle/PaddleNLP/issues/2927) 现象。 +彻底解决的话,建议使用第二种方式制作vocab文件。 + +### 2.6 方案二:预处理后直接生成 +此方案没有被采用,这里也简单说明一下具体的方案: +1. 对语料使用 BasicTokenizer 转换 +```python +from paddlenlp.transformers import +tokenizer = BasicTokenizer() +basic_toknizer = lambda x: " ".join(tokenizer.tokenize(x)) +# 对语料使用 basic_toknizer 转换 +# 并存储为新的语料 afer_basic_toknizer_corpus.txt +``` +2. 处理转换后的语料 +```shell +python gen_vocab.py afer_basic_toknizer_corpus.txt +``` +对处理好的vocab文件手动替换一些` -> [PAD]`之类的special_tokens,即可产出词表。 + + + +## 3. 词表使用 + + + +## 3.1 转化为jsonl格式数据 + +本文档以WuDao数据为例,对数据进行分词: + +```shell +python ../preprocess/words_segmentation.py \ + --input_path ./WuDaoCorpus2.0_base_200G \ + --workers 40 \ + --data_format wudao \ + --cn_seg_func seg \ + --output_path ./wudao_lac_cut \ +``` + +文本转化完成后。我们使用 `../data_tools/trans_to_json.py`重新转换为jsonl格式(分词完毕)。 +```shell +python ../preprocess/trans_to_json.py \ + --input_path ./wudao_lac_cut \ + --output_path wudao_corpus_200g_0623.jsonl \ + --workers 40 \ +``` + + + +## 3.2 Token ID 转化 + +语料、新建的词表准备妥当后,我们可以开始进行最后的数据ID转化。 + +``` +python -u ../preprocess/create_pretraining_data.py \ + --model_name /path/to/your/vocab.txt \ + --tokenizer_name ErnieTokenizer \ + --input_path wudao_corpus_200g_0623.jsonl \ + --split_sentences \ + --chinese \ + --cn_whole_word_segment \ + --cn_seg_func jieba \ + --cn_splited \ + --output_prefix wudao_corpus_200g_0623 \ + --workers 48 \ + --log_interval 10000 +``` + +- 我们提前分词好了,所以加上了 `cn_splited`,否则不需要使用此选项。 +- model_name 指定为我们准备的词表路径。也可以更换为其他 ERNIE 系列模型,如: `ernie-3.0-base-zh` +- workers 表示转化的线程数目 + +转化后的数据如下,使用这份数据,即可开始ERNIE预训练 +``` +-rw-rw-r-- 1 500 501 129G Jul 4 03:39 wudao_200g_0703_ids.npy +-rw-rw-r-- 1 500 501 6.4G Jul 4 03:39 wudao_200g_0703_idx.npz +``` + + +## 4. 参考 + +感谢CLUE,WuDao提供的开源文本语料,参考资料: +- Xu, L., Zhang, X. and Dong, Q., 2020. CLUECorpus2020: A large-scale Chinese corpus for pre-training language model. arXiv preprint arXiv:2003.01355. +- Yuan, S., Zhao, H., Du, Z., Ding, M., Liu, X., Cen, Y., Zou, X., Yang, Z. and Tang, J., 2021. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2, pp.65-68. +- https://github.com/CLUEbenchmark/CLUECorpus2020 +- https://resource.wudaoai.cn diff --git a/model_zoo/ernie-1.0/vocab/gen_char.py b/model_zoo/ernie-1.0/vocab/gen_char.py new file mode 100644 index 0000000000000000000000000000000000000000..fbda9900f245ae884e64990d38597d4fdb579865 --- /dev/null +++ b/model_zoo/ernie-1.0/vocab/gen_char.py @@ -0,0 +1,64 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time +import sys +import pickle +from collections import defaultdict + +input_path = sys.argv[1] +print(input_path) + +char_dict = defaultdict(int) + +file_paths = [] +if os.path.isfile(input_path): + file_paths.append(input_path) +else: + for root, _, fs in os.walk(input_path): + for f in fs: + file_paths.append(os.path.join(root, f)) + +count = 0 +s = time.time() +data_len = 0 +for file_name in file_paths: + print(f" > reading file {file_name}") + with open(file_name, "r") as f: + line = f.readline() + while line: + count += 1 + data_len += len(line.encode("utf-8")) + for char in line: + char_dict[char] += 1 + line = f.readline() + if count % 10000 == 0: + print( + f"processed doc {count}, char size: {len(char_dict)}, speed: {data_len/1024/1024/(time.time() - s)} MB/s" + ) + with open("char_dict.txt", "w") as rf: + res = sorted(char_dict.items(), key=lambda x: -x[1]) + for x in res: + k, v = x + rf.write(f"{k} {v}\n") + +with open("char_dict.txt", "w") as f: + res = sorted(char_dict.items(), key=lambda x: -x[1]) + for x in res: + k, v = x + f.write(f"{k} {v}\n") + +with open("char_dict.pickle", "wb") as f: + pickle.dump(char_dict, f) diff --git a/model_zoo/ernie-1.0/vocab/gen_vocab.py b/model_zoo/ernie-1.0/vocab/gen_vocab.py new file mode 100644 index 0000000000000000000000000000000000000000..4df40721c411da0e07830ddf7b2bc9aa73b21b87 --- /dev/null +++ b/model_zoo/ernie-1.0/vocab/gen_vocab.py @@ -0,0 +1,26 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys +import sentencepiece as spm + +input_path = sys.argv[1] +print("Generate vocabulary file for corpus: ", input_path) + +spm.SentencePieceTrainer.train( + input=input_path, + model_prefix="eng", + vocab_size=17000, + model_type="BPE", +) diff --git a/model_zoo/ernie-1.0/vocab/merge_vocab.py b/model_zoo/ernie-1.0/vocab/merge_vocab.py new file mode 100644 index 0000000000000000000000000000000000000000..15f6f6191874a87c806b28deeab05bcc5a48d33a --- /dev/null +++ b/model_zoo/ernie-1.0/vocab/merge_vocab.py @@ -0,0 +1,135 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import pickle +import re + +from paddlenlp.transformers import BasicTokenizer +from paddlenlp.transformers.tokenizer_utils import ( + _is_control, + _is_punctuation, + _is_whitespace, + is_chinese_char, + tokenize_special_chars, +) + +re_eng = re.compile("[#a-zA-Z0-9]", re.U) +re_sep = re.compile("\[[A-Z]+\]", re.U) +re_sep_eng = re.compile("\<[\/a-z]+\>", re.U) + +bt = BasicTokenizer() +normalize_chars = lambda x: "".join(bt.tokenize(x)) + + +# 20902 个中文全字符 +def chinese_char(): + return set([chr(x) for x in range(0x4E00, 0x9FA5 + 1)]) + + +# 日文 或 谚文字母 +def jk_vocab(c): + c = ord(c) + return (c >= 0x3040 and c <= 0x33FF) or (c >= 0x1100 and c <= 0x11FF) # 谚文字母 + + +# 特殊 TOKEN +def add_special_token(): + return ["[PAD]", "[CLS]", "[SEP]", "[MASK]", "[UNK]"] + + +char_dict = pickle.load(open("char_dict.pickle", "rb")) +chinese_vocab = chinese_char() +final_vocab = set() +other_char = [] + + +def add_vocab(char, f): + if re_sep_eng.match(char): + # Delete tokens in eng.vocab + return + + # Add eng vocab and specical token + if re_eng.match(char) or re_sep.match(char): + if char not in final_vocab: + final_vocab.add(char) + f.write(f"{char}\n") + return + + # Add chinese char + if len(char) > 1 and char.startswith("##") and chinese_vocab(char[2]): + if char not in final_vocab: + final_vocab.add(char) + f.write(f"{char}\n") + return + + # Normalize char, 部分字符 nioe + char = normalize_chars(char) + for i, k in enumerate(char): + if _is_whitespace(k) or _is_control(k): + continue + if k not in final_vocab: + if not _is_punctuation(k) and not is_chinese_char(ord(k)) and k == tokenize_special_chars(k): + other_char.append(k) + final_vocab.add(k) + f.write(f"{k}\n") + if jk_vocab(k): + # add "##" for japanese and korean char + add_vocab("##" + k, f) + + +with open("vocab.txt", "w") as f: + for x in add_special_token(): + add_vocab(x, f) + + res = sorted(char_dict.items(), key=lambda x: -x[1]) + + # Add chinse char by freq + for x in res: + k, v = x + k = normalize_chars(k) + if k in chinese_vocab: + add_vocab(k, f) + chinese_vocab.remove(k) + + # If chinse char not in freq add it + chinese_vocab = sorted(chinese_vocab) + while len(chinese_vocab) > 0: + k = chinese_vocab.pop() + if k not in final_vocab: + f.write(f"{k}\n") + final_vocab.add(k) + + # And english vocab part + with open("eng.vocab") as ec: + line = ec.readline() + while line: + k, v = line.strip().split() + if "▁" in k: + # remove "▁" in eng vocab + k = k[1:] + elif re_sep_eng.match(k): + pass + else: + # add "##" for eng vocab + k = "##" + k + + add_vocab(k, f) + line = ec.readline() + + # Add additional tokens in corpus + # such as japanese and korean char and other symbols + for x in res: + k, v = x + if v >= 200: + add_vocab(k, f) diff --git a/model_zoo/ernie-3.0/README.md b/model_zoo/ernie-3.0/README.md new file mode 100644 index 0000000000000000000000000000000000000000..96a6452183dc0ea0ad63770070db5228f16138b5 --- /dev/null +++ b/model_zoo/ernie-3.0/README.md @@ -0,0 +1,1632 @@ +# ERNIE 3.0 轻量级模型 + + **目录** + * [模型介绍](#模型介绍) + * [在线蒸馏技术](#在线蒸馏技术) + * [模型效果](#模型效果) + * [开始运行](#开始运行) + * [环境要求](#环境要求) + * [数据准备](#数据准备) + * [模型训练](#模型训练) + * [模型预测](#模型预测) + * [模型压缩](#模型压缩) + * [环境依赖](#环境依赖) + * [模型压缩 API 使用](#模型压缩API使用) + * [压缩效果](#压缩效果) + * [精度测试](#精度测试) + * [性能测试](#性能测试) + * [CPU 性能](#CPU性能) + * [GPU 性能](#CPU性能) + * [使用 FastTokenizer 加速](#使用FastTokenizer加速) + * [部署](#部署) + * [FastDeploy 部署](#FastDeploy部署) + * [Python 部署](#Python部署) + * [C++ 部署](#C++部署) + * [服务化部署](#服务化部署) + * [Notebook教程](#Notebook教程) + * [参考文献](#参考文献) + + + +## 模型介绍 + +本次开源的模型是文心大模型 ERNIE 3.0, 文心大模型 ERNIE 3.0 作为百亿参数知识增强的大模型,除了从海量文本数据中学习词汇、结构、语义等知识外,还从大规模知识图谱中学习。 基础上通过**在线蒸馏技术**得到的轻量级模型,模型结构与 ERNIE 2.0 保持一致,相比 ERNIE 2.0 具有更强的中文效果。 + +相关技术详解可参考文章[《解析全球最大中文单体模型鹏城-百度·文心技术细节》](https://www.jiqizhixin.com/articles/2021-12-08-9) + +### 在线蒸馏技术 + +在线蒸馏技术在模型学习的过程中周期性地将知识信号传递给若干个学生模型同时训练,从而在蒸馏阶段一次性产出多种尺寸的学生模型。相对传统蒸馏技术,该技术极大节省了因大模型额外蒸馏计算以及多个学生的重复知识传递带来的算力消耗。 + +这种新颖的蒸馏方式利用了文心大模型的规模优势,在蒸馏完成后保证了学生模型的效果和尺寸丰富性,方便不同性能需求的应用场景使用。此外,由于文心大模型的模型尺寸与学生模型差距巨大,模型蒸馏难度极大甚至容易失效。为此,通过引入了助教模型进行蒸馏的技术,利用助教作为知识传递的桥梁以缩短学生模型和大模型表达空间相距过大的问题,从而促进蒸馏效率的提升。 + +更多技术细节可以参考论文: +- [ERNIE-Tiny: A Progressive Distillation Framework for Pretrained Transformer Compression](https://arxiv.org/abs/2106.02241) +- [ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/abs/2112.12731) + +

+ image +

+ + + + +### 模型效果 + +本项目开源 **ERNIE 3.0 _Base_** 、**ERNIE 3.0 _Medium_** 、 **ERNIE 3.0 _Mini_** 、 **ERNIE 3.0 _Micro_** 、 **ERNIE 3.0 _Nano_** 五个模型: + +- [**ERNIE 3.0-_Base_**](https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams) (_12-layer, 768-hidden, 12-heads_) +- [**ERNIE 3.0-_Medium_**](https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh.pdparams) (_6-layer, 768-hidden, 12-heads_) +- [**ERNIE 3.0-_Mini_**](https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh.pdparams) (_6-layer, 384-hidden, 12-heads_) +- [**ERNIE 3.0-_Micro_**](https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh.pdparams) (_4-layer, 384-hidden, 12-heads_) +- [**ERNIE 3.0-_Nano_**](https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh.pdparams) (_4-layer, 312-hidden, 12-heads_) + + +下面是 PaddleNLP 中轻量级中文模型的**效果-时延图**。横坐标表示在 IFLYTEK 数据集 (最大序列长度设置为 128) 上测试的延迟(latency,单位:ms),纵坐标是 CLUE 10 个任务上的平均精度(包含文本分类、文本匹配、自然语言推理、代词消歧、阅读理解等任务),其中 CMRC2018 阅读理解任务的评价指标是 Exact Match(EM),其他任务的评价指标均是 Accuracy。图中越靠**左上**的模型,精度和性能水平越高。 + +图中模型名下方标注了模型的参数量,测试环境见[性能测试](#性能测试)。 + +batch_size=32 时,CPU 下的效果-时延图(线程数 1 和 8): + + + + + + +
+ +batch_size=1 时,CPU 下的效果-时延图(线程数 1 和 8): + + + + + + +
+ +batch_size=32 和 1,预测精度为 FP16 时,GPU 下的效果-时延图: + + + + + + +
+ +从图上可看出,ERNIE 3.0 系列轻量级模型在精度和性能上的综合表现已全面领先于 UER-py、Huawei-Noah 以及 HFL 的中文模型。且当 batch_size=1、预测精度为 FP16 时,在 GPU 上宽且浅的模型的推理性能更有优势。 + +在 CLUE **验证集**上评测指标如下表所示: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Arch + + Model + + AVG + + AFQMC + + TNEWS + + IFLYTEK + + CMNLI + + OCNLI + + CLUEWSC2020 + + CSL + + CMRC2018 + + CHID + + C3 +
24L1024H + ERNIE 1.0-Large-cw + + 79.03 + + 75.97 + + 59.65 + + 62.91 + + 85.09 + + 81.73 + + 93.09 + + 84.53 + + 74.22/91.88 + + 88.57 + + 84.54 +
+ ERNIE 2.0-Large-zh + + 76.90 + + 76.23 + + 59.33 + + 61.91 + + 83.85 + + 79.93 + + 89.82 + + 83.23 + + 70.95/90.31 + + 86.78 + + 78.12 +
+ RoBERTa-wwm-ext-large + + 76.61 + + 76.00 + + 59.33 + + 62.02 + + 83.88 + + 78.81 + + 90.79 + + 83.67 + + 70.58/89.82 + + 85.72 + + 75.26 +
20L1024H + ERNIE 3.0-Xbase-zh + + 78.39 + + 76.16 + + 59.55 + + 61.87 + + 84.40 + + 81.73 + + 88.82 + + 83.60 + + 75.99/93.00 + + 86.78 + + 84.98 +
12L768H + + + ERNIE 3.0-Base-zh + + + + 76.05 + + 75.93 + + 58.26 + + 61.56 + + 83.02 + + 80.10 + + 86.18 + + 82.63 + + 70.71/90.41 + + 84.26 + + 77.88 +
+ ERNIE 1.0-Base-zh-cw + + 76.47 + + 76.07 + + 57.86 + + 59.91 + + 83.41 + + 79.58 + + 89.91 + + 83.42 + + 72.88/90.78 + + 84.68 + + 76.98 +
+ ERNIE-Gram-zh + + 75.72 + + 75.28 + + 57.88 + + 60.87 + + 82.90 + + 79.08 + + 88.82 + + 82.83 + + 71.82/90.38 + + 84.04 + + 73.69 +
+ Langboat/Mengzi-BERT-Base + + 74.69 + + 75.35 + + 57.76 + + 61.64 + + 82.41 + + 77.93 + + 88.16 + + 82.20 + + 67.04/88.35 + + 83.74 + + 70.70 +
+ ERNIE 2.0-Base-zh + + 74.32 + + 75.65 + + 58.25 + + 61.64 + + 82.62 + + 78.71 + + 81.91 + + 82.33 + + 66.08/87.46 + + 82.78 + + 73.19 +
+ ERNIE 1.0-Base-zh + + 74.17 + + 74.84 + + 58.91 + + 62.25 + + 81.68 + + 76.58 + + 85.20 + + 82.77 + + 67.32/87.83 + + 82.47 + + 69.68 +
+ RoBERTa-wwm-ext + + 74.11 + + 74.60 + + 58.08 + + 61.23 + + 81.11 + + 76.92 + + 88.49 + + 80.77 + + 68.39/88.50 + + 83.43 + + 68.03 +
+ BERT-Base-Chinese + + 72.57 + + 74.63 + + 57.13 + + 61.29 + + 80.97 + + 75.22 + + 81.91 + + 81.90 + + 65.30/86.53 + + 82.01 + + 65.38 +
+ UER/Chinese-RoBERTa-Base + + 71.78 + + 72.89 + + 57.62 + + 61.14 + + 80.01 + + 75.56 + + 81.58 + + 80.80 + + 63.87/84.95 + + 81.52 + + 62.76 +
8L512H + UER/Chinese-RoBERTa-Medium + + 67.06 + + 70.64 + + 56.10 + + 58.29 + + 77.35 + + 71.90 + + 68.09 + + 78.63 + + 57.63/78.91 + + 75.13 + + 56.84 +
6L768H + + + ERNIE 3.0-Medium-zh + + + + 72.49 + + 73.37 + + 57.00 + + 60.67 + + 80.64 + + 76.88 + + 79.28 + + 81.60 + + 65.83/87.30 + + 79.91 + + 69.73 +
+ HLF/RBT6, Chinese + + 70.06 + + 73.45 + + 56.82 + + 59.64 + + 79.36 + + 73.32 + + 76.64 + + 80.67 + + 62.72/84.77 + + 78.17 + + 59.85 +
+ TinyBERT6, Chinese + + 69.62 + + 72.22 + + 55.70 + + 54.48 + + 79.12 + + 74.07 + + 77.63 + + 80.17 + + 63.03/83.75 + + 77.64 + + 62.11 +
+ RoFormerV2 Small + + 68.52 + + 72.47 + + 56.53 + + 60.72 + + 76.37 + + 72.95 + + 75.00 + + 81.07 + + 62.97/83.64 + + 67.66 + + 59.41 +
+ UER/Chinese-RoBERTa-L6-H768 + + 67.09 + + 70.13 + + 56.54 + + 60.48 + + 77.49 + + 72.00 + + 72.04 + + 77.33 + + 53.74/75.52 + + 76.73 + + 54.40 +
6L384H + + + ERNIE 3.0-Mini-zh + + + + 66.90 + + 71.85 + + 55.24 + + 54.48 + + 77.19 + + 73.08 + + 71.05 + + 79.30 + + 58.53/81.97 + + 69.71 + + 58.60 +
4L768H + HFL/RBT4, Chinese + + 67.42 + + 72.41 + + 56.50 + + 58.95 + + 77.34 + + 70.78 + + 71.05 + + 78.23 + + 59.30/81.93 + + 73.18 + + 56.45 +
4L512H + UER/Chinese-RoBERTa-Small + + 63.25 + + 69.21 + + 55.41 + + 57.552 + + 73.64 + + 69.80 + + 66.78 + + 74.83 + + 46.75/69.69 + + 67.59 + + 50.92 +
4L384H + + + ERNIE 3.0-Micro-zh + + + + 64.21 + + 71.15 + + 55.05 + + 53.83 + + 74.81 + + 70.41 + + 69.08 + + 76.50 + + 53.77/77.82 + + 62.26 + + 55.53 +
4L312H + + + ERNIE 3.0-Nano-zh + + + + 62.97 + + 70.51 + + 54.57 + + 48.36 + + 74.97 + + 70.61 + + 68.75 + + 75.93 + + 52.00/76.35 + + 58.91 + + 55.11 +
+ TinyBERT4, Chinese + + 60.82 + + 69.07 + + 54.02 + + 39.71 + + 73.94 + + 69.59 + + 70.07 + + 75.07 + + 46.04/69.34 + + 58.53 + + 52.18 +
4L256H + UER/Chinese-RoBERTa-Mini + + 53.40 + + 69.32 + + 54.22 + + 41.63 + + 69.40 + + 67.36 + + 65.13 + + 70.07 + + 5.96/17.13 + + 51.19 + + 39.68 +
3L1024H + HFL/RBTL3, Chinese + + 66.63 + + 71.11 + + 56.14 + + 59.56 + + 76.41 + + 71.29 + + 69.74 + + 76.93 + + 58.50/80.90 + + 71.03 + + 55.56 +
3L768H + HFL/RBT3, Chinese + + 65.72 + + 70.95 + + 55.53 + + 59.18 + + 76.20 + + 70.71 + + 67.11 + + 76.63 + + 55.73/78.63 + + 70.26 + + 54.93 +
2L128H + UER/Chinese-RoBERTa-Tiny + + 44.45 + + 69.02 + + 51.47 + + 20.28 + + 59.95 + + 57.73 + + 63.82 + + 67.43 + + 3.08/14.33 + + 23.57 + + 28.12 +
+
+ + + + +## 代码结构 +以下是本项目代码结构 + +```text +. +├── run_seq_cls.py # 分类任务的微调脚本 +├── run_token_cls.py # 序列标注任务的微调脚本 +├── run_qa.py # 阅读理解任务的微调脚本 +├── compress_seq_cls.py # 分类任务的压缩脚本 +├── compress_token_cls.py # 序列标注任务的压缩脚本 +├── compress_qa.py # 阅读理解任务的压缩脚本 +├── utils.py # 训练工具脚本 +├── configs # 压缩配置文件夹 +│ └── default.yml # 默认配置文件 +├── deploy # 部署目录 +│ └── predictor # onnx离线部署 +│ └── infer_cpu.py +│ └── infer_gpu.py +│ └── README.md +│ └── requirements_cpu.txt +│ └── requirements_gpu.txt +│ └── simple_serving # 基于PaddleNLP SimpleServing 服务化部署 +│ └── client_qa.py +│ └── client_seq_cls.py +│ └── client_token_cls.py +│ └── README.md +│ └── server_qa.py +│ └── server_seq_cls.py +│ └── server_token_cls.py +│ └── triton_serving # 基于Triton Serving 服务化部署 +│ └── models +│ └── README.md +│ └── seq_cls_grpc_client.py +│ └── token_cls_grpc_client.py +└── README.md # 文档 + +``` + + + +## 开始运行 +下面提供以 CLUE 数据集进行模型微调相关训练、预测、部署的代码, CLUE 数据集是中文语言理解测评基准数据集,包括了文本分类、文本推理、实体抽取、问答等相关数据集。 + +### 环境要求 +- python >= 3.7 +- paddlepaddle >= 2.3 +- paddlenlp >= 2.4 +- paddleslim >= 2.4 + +### 数据准备 +此次微调数据主要是以 CLUE benchmark 数据集为主, CLUE benchmark 包括了文本分类、实体抽取、问答三大类数据集,而 CLUE benchmark 数据目前已经集成在 PaddleNLP 的 datasets 里面,可以通过下面的方式来使用数据集 + +```python +from paddlenlp.datasets import load_dataset + +# Load the clue Tnews dataset +train_ds, test_ds = load_dataset('clue', 'tnews', splits=('train', 'test')) + +``` + + +## 模型训练 + +使用 PaddleNLP 只需要一行代码可以拿到 ERNIE 3.0 系列模型,之后可以在自己的下游数据下进行微调,从而获得具体任务上效果更好的模型。 + +```python + +from paddlenlp.transformers import * + +tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") + +# 用于分类任务 +seq_cls_model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-medium-zh") + +# 用于序列标注任务 +token_cls_model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh") + +# 用于阅读理解任务 +qa_model = AutoModelForQuestionAnswering.from_pretrained("ernie-3.0-medium-zh") + +``` + +本项目提供了针对分类(包含文本分类、文本匹配、自然语言推理、代词消歧等任务)、序列标注、阅读理解三大场景下微调的示例脚本,可分别参考 `run_seq_cls.py` 、`run_token_cls.py`、`run_qa.py` 三个脚本,启动方式如下: + +```shell +# 分类任务 +# 该脚本共支持 CLUE 中 7 个分类任务,超参不全相同,因此分类任务中的超参配置利用 config.yml 配置 +# --device 选择训练模型的硬件,可选 cpu/gpu/xpu/npu,默认为 gpu。xpu 为昆仑芯片,npu 为昇腾芯片。 +python run_seq_cls.py --model_name_or_path ernie-3.0-medium-zh --dataset afqmc --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml + +# 序列标注任务 +python run_token_cls.py --model_name_or_path ernie-3.0-medium-zh --dataset msra_ner --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml + +# 阅读理解任务 +python run_qa.py --model_name_or_path ernie-3.0-medium-zh --dataset cmrc2018 --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml +``` + + +## 模型预测 + +```shell +# 分类任务 +# 该脚本共支持 CLUE 中 7 个分类任务,超参不全相同,因此分类任务中的超参配置利用 config.yml 配置 +# --device 选择训练模型的硬件,可选 cpu/gpu/xpu/npu,默认为 gpu。xpu 为昆仑芯片,npu 为昇腾芯片。 +python run_seq_cls.py --model_name_or_path best_models/afqmc/ --dataset afqmc --output_dir ./best_models --do_predict --config=configs/default.yml + +# 序列标注任务 +python run_token_cls.py --model_name_or_path best_models/msra_ner/ --dataset msra_ner --output_dir ./best_models --do_predict --config=configs/default.yml + +# 阅读理解任务 +python run_qa.py --model_name_or_path best_models/cmrc2018/ --dataset cmrc2018 --output_dir ./best_models --do_predict --config=configs/default.yml +``` + + + + +## 模型压缩 + +尽管 ERNIE 3.0 已提供了效果不错的 6 层、4 层轻量级模型可以微调后直接使用,但如果有模型部署上线的需求,则需要进一步压缩模型体积,可以使用这里提供的一套模型压缩方案及 API 对上一步微调后的模型进行压缩。 + + + +### 环境依赖 + +使用裁剪功能需要安装 paddleslim 包 + +```shell +pip install paddleslim +``` + + + +### 模型压缩 API 使用 + +本项目使用压缩 API 对任务数据上微调后的模型进行裁剪和量化。用户在传入模型,以及相关的压缩超参数(可选,提供默认选项)后,只需要调用一行 `compress()` 即可一键启动裁剪和量化,并自动保存压缩后的模型进行后续部署。 + +核心调用方法如下,如需跑通完整的例子可参考本目录下完整样例脚本: + +```python + +trainer = Trainer( + model=model, + args=compression_args, + data_collator=data_collator, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + criterion=criterion) + +trainer.compress() + +``` +压缩 API 可以传入的超参数可参考[文档](../../docs/compression.md)。 + +本项目提供了压缩 API 在分类(包含文本分类、文本匹配、自然语言推理、代词消歧等任务)、序列标注、阅读理解三大场景下的使用样例,可以分别参考 `compress_seq_cls.py` 、`compress_token_cls.py`、`compress_qa.py`,启动方式如下: + +```shell +# 分类任务 +# 该脚本共支持 CLUE 中 7 个分类任务,超参不全相同,因此分类任务中的超参配置利用 configs/defalut.yml 配置 +python compress_seq_cls.py --model_name_or_path best_models/afqmc/ --dataset afqmc --output_dir ./best_models/afqmc --config=configs/default.yml + +# 序列标注任务 +python compress_token_cls.py --model_name_or_path best_models/msra_ner/ --dataset msra_ner --output_dir ./best_models/msra_ner --config=configs/default.yml + +# 阅读理解任务 +python compress_qa.py --model_name_or_path best_models/cmrc2018/ --dataset cmrc2018 --output_dir ./best_models/cmrc2018 --config=configs/default.yml + +``` + + + + +### 压缩效果 + + + +#### 精度测试 + +本案例中我们对 ERNIE 3.0-Medium 模型在三类任务上微调后的模型使用压缩 API 进行压缩。压缩后精度如下: + +| Model | AVG | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL | CMRC2018 | MSRA_NER | +| ------------------------------- | ----- | ----- | ----- | ------- | ----- | ----- | ----------- | ----- | ----------- | ----------------- | +| ERNIE 3.0-Medium | 74.87 | 75.35 | 57.45 | 60.18 | 81.16 | 77.19 | 80.59 | 81.93 | 66.95/87.15 | 92.65/93.43/93.04 | +| ERNIE 3.0-Medium+FP16 | 74.87 | 75.32 | 57.45 | 60.22 | 81.16 | 77.22 | 80.59 | 81.90 | 66.95/87.16 | 92.65/93.45/93.05 | +| ERNIE 3.0-Medium+裁剪+FP32 | 74.70 | 75.14 | 57.31 | 60.29 | 81.25 | 77.46 | 79.93 | 81.70 | 65.92/86.43 | 93.10/93.43/93.27 | +| ERNIE 3.0-Medium+裁剪+FP16 | 74.71 | 75.21 | 57.27 | 60.29 | 81.24 | 77.56 | 79.93 | 81.73 | 65.89/86.44 | 93.10/93.43/93.27 | +| ERNIE 3.0-Medium+裁剪+量化+INT8 | 74.44 | 75.02 | 57.26 | 60.37 | 81.03 | 77.25 | 77.96 | 81.67 | 66.17/86.55 | 93.17/93.23/93.20 | +| ERNIE 3.0-Medium+量化+INT8 | 74.10 | 74.67 | 56.99 | 59.91 | 81.03 | 75.05 | 78.62 | 81.60 | 66.32/86.82 | 93.10/92.90/92.70 | + +**评价指标说明:** 其中 CLUE 分类任务(AFQMC 语义相似度、TNEWS 文本分类、IFLYTEK 长文本分类、CMNLI 自然语言推理、OCNLI 自然语言推理、CLUEWSC2020 代词消歧、CSL 论文关键词识别)的评价指标是 Accuracy,阅读理解任务 CLUE CMRC2018 的评价指标是 EM (Exact Match) / F1-Score,计算平均值时取 EM,序列标注任务 MSRA_NER 的评价指标是 Precision/Recall/F1-Score,计算平均值时取 F1-Score。 + +由表可知,`ERNIE 3.0-Medium` 模型经过裁剪和量化后,精度平均下降 0.46,其中裁剪后下降了 0.17,单独量化精度平均下降 0.77。 + + + +#### 性能测试 + +性能测试的配置如下: + +1. 数据集:TNEWS(文本分类)、MSRA_NER(序列标注)、CLUE CMRC2018(阅读理解) + +2. 计算卡:T4、CUDA11.2、CuDNN8.2 + +3. CPU 信息:Intel(R) Xeon(R) Gold 6271C CPU + +4. PaddlePaddle 版本:2.3 + +5. PaddleNLP 版本:2.3 + +6. 性能数据单位是 QPS。QPS 测试方法:固定 batch size 为 32,测试运行时间 total_time,计算 QPS = total_samples / total_time + +7. 精度数据单位:文本分类是 Accuracy,序列标注是 F1-Score,阅读理解是 EM (Exact Match) + + + +##### CPU 性能 + +测试环境及说明如上,测试 CPU 性能时,线程数设置为12。 + +| | TNEWS 性能 | TNEWS 精度 | MSRA_NER 性能 | MSRA_NER 精度 | CMRC2018 性能 | CMRC2018 精度 | +| -------------------------- | ------------ | ------------ | ------------- | ------------- | ------------- | ------------- | +| ERNIE 3.0-Medium+FP32 | 311.95(1.0X) | 57.45 | 90.91(1.0x) | 93.04 | 33.74(1.0x) | 66.95 | +| ERNIE 3.0-Medium+INT8 | 600.35(1.9x) | 56.57(-0.88) | 141.00(1.6x) | 92.64(-0.40) | 56.51(1.7x) | 66.23(-0.72) | +| ERNIE 3.0-Medium+裁剪+FP32 | 408.65(1.3x) | 57.31(-0.14) | 122.13(1.3x) | 93.27(+0.23) | 48.47(1.4x) | 65.55(-1.40) | +| ERNIE 3.0-Medium+裁剪+INT8 | 704.42(2.3x) | 56.69(-0.76) | 215.58(2.4x) | 92.39(-0.65) | 75.23(2.2x) | 63.47(-3.48) | + + +三类任务(分类、序列标注、阅读理解)经过相同压缩过程后,加速比达到 2.3 左右。 + + + + +##### GPU 性能 + +| | TNEWS 性能 | TNEWS 精度 | MSRA_NER 性能 | MSRA_NER 精度 | CMRC2018 性能 | CMRC2018 精度 | +| -------------------------- | ------------- | ------------ | ------------- | ------------- | ------------- | ------------- | +| ERNIE 3.0-Medium+FP32 | 1123.85(1.0x) | 57.45 | 366.75(1.0x) | 93.04 | 146.84(1.0x) | 66.95 | +| ERNIE 3.0-Medium+FP16 | 2672.41(2.4x) | 57.45(0.00) | 840.11(2.3x) | 93.05(0.01) | 303.43(2.1x) | 66.95(0.00) | +| ERNIE 3.0-Medium+INT8 | 3226.26(2.9x) | 56.99(-0.46) | 889.33(2.4x) | 92.70(-0.34) | 348.84(2.4x) | 66.32(-0.63 | +| ERNIE 3.0-Medium+裁剪+FP32 | 1424.01(1.3x) | 57.31(-0.14) | 454.27(1.2x) | 93.27(+0.23) | 183.77(1.3x) | 65.92(-1.03) | +| ERNIE 3.0-Medium+裁剪+FP16 | 3577.62(3.2x) | 57.27(-0.18) | 1138.77(3.1x) | 93.27(+0.23) | 445.71(3.0x) | 65.89(-1.06) | +| ERNIE 3.0-Medium+裁剪+INT8 | 3635.48(3.2x) | 57.26(-0.19) | 1105.26(3.0x) | 93.20(+0.16) | 444.27(3.0x) | 66.17(-0.78) | + + +三类任务(分类、序列标注、阅读理解)经过裁剪 + 量化后加速比均达到 3 倍左右,所有任务上平均精度损失可控制在 0.5 以内(0.46)。 + + + +### 使用 FastTokenizer 加速 + +FastTokenizer 是飞桨提供的速度领先的文本处理算子库,集成了 Google 于 2021 年底发布的 LinMaxMatch 算法,该算法引入 Aho-Corasick 将 WordPiece 的时间复杂度从 O(N2) 优化到 O(N),已在 Google 搜索业务中大规模上线。FastTokenizer 速度显著领先,且呈现 batch_size 越大,优势越突出。例如,设置 batch_size = 64 时,FastTokenizer 切词速度比 HuggingFace 快 28 倍。 + +在 ERNIE 3.0 轻量级模型裁剪、量化基础上,当设置切词线程数为 4 时,使用 FastTokenizer 在 NVIDIA Tesla T4 环境下在 IFLYTEK (长文本分类数据集,最大序列长度为 128)数据集上性能提升了 2.39 倍,相比 BERT-Base 性能提升了 7.09 倍,在 Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz、线程数为 8 的情况下性能提升了 1.27 倍,相比 BERT-Base 性能提升了 5.13 倍。加速效果如下图所示: + + + + + + +
+ +使用 FastTokenizer 的方式非常简单,在安装 fast_tokenizer 包之后,仅需在 tokenizer 实例化时直接传入 `use_fast=True` 即可。目前已在 Linux 系统下支持 BERT、ERNIE、TinyBERT 等模型。 + +安装 fast_tokenizer 包的命令: + +```shell +pip install fast-tokenizer-python +``` + +如需设置切词线程数,需要调用`fast_tokenizer.set_thread_num`接口进行设置: + +```python +# 设置切词线程数为 4 +import fast_tokenizer +fast_tokenizer.set_thread_num(4) +``` + +调用 `from_pretrained` 时只需轻松传入一个参数 `use_fast=True`: + +```python +from paddlenlp.transformers import AutoTokenizer +AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True) +``` + + + +## 部署 + +我们基于 FastDeploy 为 ERNIE 3.0 提供了多种部署方案,可以满足不同场景下的部署需求,请根据实际情况进行选择。 + + + +### FastDeploy 部署 + +⚡️[FastDeploy](https://github.com/PaddlePaddle/FastDeploy)是一款全场景、易用灵活、极致高效的AI推理部署工具,为开发者提供多硬件、多推理引擎后端的部署能力。开发者只需调用一行代码即可随意切换硬件、推理引擎后端。 + +
+ + + +
+ +目前 ERNIE 3.0 模型已提供基于 FastDeploy 的部署示例,支持在多款硬件(CPU、GPU、昆仑芯、华为昇腾以及 Graphcore IPU)以及推理引擎后端进行部署。具体的适配的硬件以及推理引擎请参考:[FastDeploy 部署指南](./deploy/README.md) + + + +#### Python 部署 + +Python 部署请参考:[Python 部署指南](./deploy/python/README.md) + + + +#### C++ 部署 + +C++ 部署请参考:[C++ 部署指南](./deploy/cpp/README.md) + + + +### 服务化部署 + +- [FastDeploy Serving 高性能服务化部署指南](./deploy/serving/README.md) +- [PaddleNLP SimpleServing 服务化部署指南](./deploy/simple_serving/README.md) + + + + +## Notebook教程 + +- [【快速上手ERNIE 3.0】中文情感分析实战](https://aistudio.baidu.com/aistudio/projectdetail/3955163) + +- [【快速上手ERNIE 3.0】法律文本多标签分类实战](https://aistudio.baidu.com/aistudio/projectdetail/3996601) + +- [【快速上手ERNIE 3.0】中文语义匹配实战](https://aistudio.baidu.com/aistudio/projectdetail/3986803) + +- [【快速上手ERNIE 3.0】MSRA序列标注实战](https://aistudio.baidu.com/aistudio/projectdetail/3989073) + +- [【快速上手ERNIE 3.0】机器阅读理解实战](https://aistudio.baidu.com/aistudio/projectdetail/2017189) + +- [【快速上手ERNIE 3.0】对话意图识别](https://aistudio.baidu.com/aistudio/projectdetail/2017202?contributionType=1) + + +## 参考文献 + +* Sun Y, Wang S, Feng S, et al. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation[J]. arXiv preprint arXiv:2107.02137, 2021. + +* Su W, Chen X, Feng S, et al. ERNIE-Tiny: A Progressive Distillation Framework for Pretrained Transformer Compression[J]. arXiv preprint arXiv:2106.02241, 2021. + +* Wang S, Sun Y, Xiang Y, et al. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation[J]. arXiv preprint arXiv:2112.12731, 2021. diff --git a/model_zoo/ernie-3.0/compress_qa.py b/model_zoo/ernie-3.0/compress_qa.py new file mode 100644 index 0000000000000000000000000000000000000000..a4cf1722aad1b7172a6b17c5dc88621a49f33805 --- /dev/null +++ b/model_zoo/ernie-3.0/compress_qa.py @@ -0,0 +1,136 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial + +import paddle +import paddle.nn.functional as F +from datasets import load_dataset +from utils import ( + DataArguments, + ModelArguments, + QuestionAnsweringTrainer, + load_config, + prepare_train_features, + prepare_validation_features, +) + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.metrics.squad import compute_prediction +from paddlenlp.trainer import CompressionArguments, EvalPrediction, PdArgumentParser +from paddlenlp.transformers import ErnieForQuestionAnswering, ErnieTokenizer + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments)) + model_args, data_args, compression_args = parser.parse_args_into_dataclasses() + + # Load model and data config + model_args, data_args, compression_args = load_config( + model_args.config, "QuestionAnswering", data_args.dataset, model_args, data_args, compression_args + ) + + paddle.set_device(compression_args.device) + data_args.dataset = data_args.dataset.strip() + + # Log model and data config + compression_args.print_config(model_args, "Model") + compression_args.print_config(data_args, "Data") + + raw_datasets = load_dataset("clue", data_args.dataset) + + label_list = getattr(raw_datasets["train"], "label_list", None) + data_args.label_list = label_list + + # Define tokenizer, model, loss function. + tokenizer = ErnieTokenizer.from_pretrained(model_args.model_name_or_path) + model = ErnieForQuestionAnswering.from_pretrained(model_args.model_name_or_path) + + # Preprocessing the datasets. + # Preprocessing is slighlty different for training and evaluation. + + raw_datasets = load_dataset("clue", data_args.dataset) + column_names = raw_datasets["train"].column_names + label_list = getattr(raw_datasets["train"], "label_list", None) + data_args.label_list = label_list + # Create train feature from dataset + with compression_args.main_process_first(desc="train dataset map pre-processing"): + # Dataset pre-process + train_dataset = raw_datasets["train"] + train_dataset = train_dataset.map( + partial(prepare_train_features, tokenizer=tokenizer, args=data_args), + batched=True, + num_proc=4, + remove_columns=column_names, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on train dataset", + ) + with compression_args.main_process_first(desc="evaluate dataset map pre-processing"): + eval_examples = raw_datasets["validation"] + eval_dataset = eval_examples.map( + partial(prepare_validation_features, tokenizer=tokenizer, args=data_args), + batched=True, + num_proc=4, + remove_columns=column_names, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on validation dataset", + ) + + # Define data collector + data_collator = DataCollatorWithPadding(tokenizer) + + # Post-processing: + def post_processing_function(examples, features, predictions, stage="eval"): + # Post-processing: we match the start logits and end logits to answers in the original context. + predictions, all_nbest_json, scores_diff_json = compute_prediction( + examples=examples, + features=features, + predictions=predictions, + n_best_size=data_args.n_best_size, + max_answer_length=data_args.max_answer_length, + null_score_diff_threshold=data_args.null_score_diff_threshold, + ) + + references = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples] + return EvalPrediction(predictions=predictions, label_ids=references) + + def criterion(outputs, label): + start_logits, end_logits = outputs + start_position, end_position = label + start_position = paddle.unsqueeze(start_position, axis=-1) + end_position = paddle.unsqueeze(end_position, axis=-1) + start_loss = F.cross_entropy(input=start_logits, label=start_position) + end_loss = F.cross_entropy(input=end_logits, label=end_position) + loss = (start_loss + end_loss) / 2 + return loss + + trainer = QuestionAnsweringTrainer( + model=model, + args=compression_args, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + eval_examples=eval_examples, + data_collator=data_collator, + post_process_function=post_processing_function, + tokenizer=tokenizer, + criterion=criterion, + ) + + compression_args.print_config() + + trainer.compress() + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-3.0/compress_seq_cls.py b/model_zoo/ernie-3.0/compress_seq_cls.py new file mode 100644 index 0000000000000000000000000000000000000000..75e0aa2f0201b0bef5d551aa7ddfcc4178930a05 --- /dev/null +++ b/model_zoo/ernie-3.0/compress_seq_cls.py @@ -0,0 +1,79 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial + +import paddle +from utils import DataArguments, ModelArguments, load_config, seq_convert_example + +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import CompressionArguments, PdArgumentParser, Trainer +from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments)) + model_args, data_args, compression_args = parser.parse_args_into_dataclasses() + + # Log model and data config + model_args, data_args, compression_args = load_config( + model_args.config, "SequenceClassification", data_args.dataset, model_args, data_args, compression_args + ) + + paddle.set_device(compression_args.device) + + data_args.dataset = data_args.dataset.strip() + + # Log model and data config + compression_args.print_config(model_args, "Model") + compression_args.print_config(data_args, "Data") + + raw_datasets = load_dataset("clue", data_args.dataset) + + data_args.label_list = getattr(raw_datasets["train"], "label_list", None) + num_classes = len(raw_datasets["train"].label_list) + + criterion = paddle.nn.CrossEntropyLoss() + # Defines tokenizer, model, loss function. + tokenizer = ErnieTokenizer.from_pretrained(model_args.model_name_or_path) + model = ErnieForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes) + + # Defines dataset pre-process function + trans_fn = partial( + seq_convert_example, tokenizer=tokenizer, label_list=data_args.label_list, max_seq_len=data_args.max_seq_length + ) + + # Defines data collector + data_collator = DataCollatorWithPadding(tokenizer) + + train_dataset = raw_datasets["train"].map(trans_fn) + eval_dataset = raw_datasets["dev"].map(trans_fn) + + trainer = Trainer( + model=model, + args=compression_args, + data_collator=data_collator, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + criterion=criterion, + ) # Strategy`dynabert` needs arguments `criterion` + + compression_args.print_config() + + trainer.compress() + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-3.0/compress_token_cls.py b/model_zoo/ernie-3.0/compress_token_cls.py new file mode 100644 index 0000000000000000000000000000000000000000..3271b3a41dac2f54271fb9a74f080c7d2fd5e312 --- /dev/null +++ b/model_zoo/ernie-3.0/compress_token_cls.py @@ -0,0 +1,98 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial + +import paddle +import paddle.nn as nn +from utils import DataArguments, ModelArguments, load_config, token_convert_example + +from paddlenlp.data import DataCollatorForTokenClassification +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import CompressionArguments, PdArgumentParser, Trainer +from paddlenlp.transformers import ErnieForTokenClassification, ErnieTokenizer + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments)) + model_args, data_args, compression_args = parser.parse_args_into_dataclasses() + + # Log model and data config + model_args, data_args, compression_args = load_config( + model_args.config, "TokenClassification", data_args.dataset, model_args, data_args, compression_args + ) + paddle.set_device(compression_args.device) + + data_args.dataset = data_args.dataset.strip() + + # Log model and data config + compression_args.print_config(model_args, "Model") + compression_args.print_config(data_args, "Data") + + raw_datasets = load_dataset(data_args.dataset) + label_list = raw_datasets["train"].label_list + data_args.label_list = label_list + data_args.ignore_label = -100 + + data_args.no_entity_id = 0 + num_classes = len(label_list) + + # Define tokenizer, model, loss function. + tokenizer = ErnieTokenizer.from_pretrained(model_args.model_name_or_path) + model = ErnieForTokenClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes) + + class criterion(nn.Layer): + def __init__(self): + super(criterion, self).__init__() + self.loss_fn = nn.CrossEntropyLoss(ignore_index=data_args.ignore_label) + + def forward(self, *args, **kwargs): + return paddle.mean(self.loss_fn(*args, **kwargs)) + + loss_fct = criterion() + + # Define dataset pre-process function + trans_fn = partial( + token_convert_example, + tokenizer=tokenizer, + no_entity_id=data_args.no_entity_id, + max_seq_length=data_args.max_seq_length, + return_length=True, + ) + + # Define data collector + data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=data_args.ignore_label) + + # Dataset pre-process + train_dataset = raw_datasets["train"].map(trans_fn) + eval_dataset = raw_datasets["test"].map(trans_fn) + train_dataset.label_list = label_list + train_dataset.ignore_label = data_args.ignore_label + trainer = Trainer( + model=model, + criterion=loss_fct, + args=compression_args, + data_collator=data_collator, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + tokenizer=tokenizer, + ) + + compression_args.print_config() + + trainer.compress() + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-3.0/configs/default.yml b/model_zoo/ernie-3.0/configs/default.yml new file mode 100644 index 0000000000000000000000000000000000000000..33c543f318355061f5dca6336b2105ec89380ea6 --- /dev/null +++ b/model_zoo/ernie-3.0/configs/default.yml @@ -0,0 +1,70 @@ +# Default Args for all dataset +# You can overwrite the configs in each dataset. +DefaultArgs: + learning_rate: 0.00003 + num_train_epochs: 3 + max_seq_length: 128 + weight_decay: 0.01 +# Datasets which used for sequence classfication +SequenceClassification: + afqmc: + num_train_epochs: 1 + learning_rate: 0.00003 + max_seq_length: 128 + per_device_train_batch_size: 16 + per_device_eval_batch_size: 32 + tnews: + num_train_epochs: 3 + learning_rate: 0.00005 + max_seq_length: 128 + per_device_train_batch_size: 32 + per_device_eval_batch_size: 32 + iflytek: + num_train_epochs: 3 + learning_rate: 0.00005 + max_seq_length: 128 + per_device_train_batch_size: 16 + per_device_eval_batch_size: 16 + ocnli: + num_train_epochs: 6 + learning_rate: 0.00005 + max_seq_length: 128 + per_device_train_batch_size: 64 + per_device_eval_batch_size: 64 + cmnli: + num_train_epochs: 4 + learning_rate: 0.00002 + max_seq_length: 128 + per_device_train_batch_size: 32 + per_device_eval_batch_size: 32 + cluewsc2020: + num_train_epochs: 50 + learning_rate: 0.00003 + max_seq_length: 128 + per_device_train_batch_size: 16 + per_device_eval_batch_size: 16 + csl: + num_train_epochs: 8 + learning_rate: 0.00005 + max_seq_length: 256 + per_device_train_batch_size: 64 + per_device_eval_batch_size: 64 + +# Datasets which used for token classfication +TokenClassification: + msra_ner: + learning_rate: 0.00005 + max_seq_length: 128 + num_train_epochs: 1 + per_device_train_batch_size: 8 + per_device_eval_batch_size: 16 + +# Datasets which used for question answersing +QuestionAnswering: + cmrc2018: + learning_rate: 0.00005 + max_seq_length: 512 + num_train_epochs: 1 + per_device_train_batch_size: 8 + per_device_eval_batch_size: 12 + diff --git a/model_zoo/ernie-3.0/deploy/README.md b/model_zoo/ernie-3.0/deploy/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c08521440c3e3dfc64c0cbe47ab478d918e476fd --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/README.md @@ -0,0 +1,119 @@ +# FastDeploy ERNIE 3.0 模型高性能部署 + +**⚡️FastDeploy** 是一款**全场景**、**易用灵活**、**极致高效**的 AI 推理部署工具,满足开发者**多硬件、多平台**的产业部署需求。开发者可以基于 FastDeploy 将训练好的预测模型在不同的硬件、不同的推理引擎后端上进行部署。目前 FastDeploy 提供多种编程语言的 SDK,包括 C++、Python 以及 Java SDK。 + +在部署 ERNIE 3.0 模型前,需要安装 FastDeploy SDK,可参考 [FastDeploy SDK安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)确认部署环境是否满足 FastDeploy 环境要求,并按照介绍安装相应的 SDK。 + +目前,ERNIE 3.0 模型支持在如下的硬件以及推理引擎进行部署。 + +符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持; + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
硬件 可用的推理引擎 是否支持 Paddle 新格式量化模型 是否支持 FP16 模式
CPU Paddle Inference N/A
ONNX Runtime N/A
OpenVINO N/A
GPU Paddle Inference N/A
ONNX Runtime
Paddle TensorRT
TensorRT
昆仑芯 XPU Paddle Lite N/A
华为 昇腾 Paddle Lite
Graphcore IPU Paddle Inference N/A
+ +## 支持的NLP任务列表 + +符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持; + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
任务 Task 部署方式 是否支持
文本分类 Python
C++
Serving
序列标注 Python
C++
Serving
+ +## 详细部署文档 + +ERNIE 3.0 模型支持 Python、C++ 部署以及 Serving 服务化部署。以下是详细文档。 + +- [Python 部署](python/README.md) +- [C++ 部署](cpp/README.md) +- [Serving 部署](serving/README.md) diff --git a/model_zoo/ernie-3.0/deploy/cpp/CMakeLists.txt b/model_zoo/ernie-3.0/deploy/cpp/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..db2b4305b58688e2a65394ddf3ffbc3d705fd09b --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/cpp/CMakeLists.txt @@ -0,0 +1,35 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +PROJECT(infer_demo C CXX) +CMAKE_MINIMUM_REQUIRED (VERSION 3.10) + +option(FASTDEPLOY_INSTALL_DIR "Path of downloaded fastdeploy sdk.") +include(${FASTDEPLOY_INSTALL_DIR}/FastDeploy.cmake) +include(${FASTDEPLOY_INSTALL_DIR}/utils/gflags.cmake) + +include_directories(${FASTDEPLOY_INCS}) + +add_executable(seq_cls_infer_demo ${PROJECT_SOURCE_DIR}/seq_cls_infer.cc) +add_executable(token_cls_infer_demo ${PROJECT_SOURCE_DIR}/token_cls_infer.cc) +add_dependencies(seq_cls_infer_demo gflags) +add_dependencies(token_cls_infer_demo gflags) + +if(UNIX AND (NOT APPLE) AND (NOT ANDROID)) + target_link_libraries(seq_cls_infer_demo ${FASTDEPLOY_LIBS} gflags pthread) + target_link_libraries(token_cls_infer_demo ${FASTDEPLOY_LIBS} gflags pthread) +else() + target_link_libraries(seq_cls_infer_demo ${FASTDEPLOY_LIBS} gflags) + target_link_libraries(token_cls_infer_demo ${FASTDEPLOY_LIBS} gflags) +endif() diff --git a/model_zoo/ernie-3.0/deploy/cpp/README.md b/model_zoo/ernie-3.0/deploy/cpp/README.md new file mode 100644 index 0000000000000000000000000000000000000000..39c098c95f36e8523cdcab1192cd79e42a71b5fb --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/cpp/README.md @@ -0,0 +1,272 @@ +# FastDeploy ERNIE 3.0 模型 C++ 部署示例 + +在部署前,参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy C++ SDK。 + +本目录下分别提供 `seq_cls_infer.cc` 以及 `token_cls_infer.cc` 快速完成在 CPU/GPU 的文本分类任务以及序列标注任务的 C++ 部署示例。 + + +## 文本分类任务 + +### 快速开始 + +以下示例展示如何基于 FastDeploy 库完成 ERNIE 3.0 Medium 模型在 CLUE Benchmark 的 [AFQMC 数据集](https://github.com/CLUEbenchmark/CLUE)上进行文本分类任务的 C++ 预测部署,可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端。示例中的模型是 ERNIE 3.0 在 `AFQMC 数据集`微调后导出得到的部署模型。 + +```bash +mkdir build +cd build +# 下载FastDeploy预编译库,用户可在上文提到的`FastDeploy SDK安装文档`中自行选择合适的版本使用 +wget https://bj.bcebos.com/fastdeploy/release/cpp/fastdeploy-linux-x64-x.x.x.tgz +tar xvf fastdeploy-linux-x64-x.x.x.tgz +cmake .. -DFASTDEPLOY_INSTALL_DIR=${PWD}/fastdeploy-linux-x64-x.x.x +make -j + +# CPU 推理 +./seq_cls_infer_demo --model_dir ../../../best_models/afqmc/export/ --device cpu --backend paddle + +# GPU 推理 +./seq_cls_infer_demo --model_dir ../../../best_models/afqmc/export/ --device gpu --backend paddle + +``` + +运行完成后返回的结果如下: +```bash +[INFO] /paddle/PaddleNLP/model_zoo/ernie-3.0/fastdeploy/cpp/seq_cls_infer.cc(103)::CreateRuntimeOption model_path = ../../../best_models/afqmc/export/model.pdmodel, param_path = ../../../best_models/afqmc/export/model.pdiparams +[INFO] fastdeploy/runtime.cc(500)::Init Runtime initialized with Backend::PDINFER in Device::CPU. +input data: 花呗收款额度限制, 收钱码,对花呗支付的金额有限制吗 +seq cls result: +label: Similar confidence: 0.509855 +----------------------------- +input data: 花呗支持高铁票支付吗, 为什么友付宝不支持花呗付款 +seq cls result: +label: Similar confidence: 0.986198 +----------------------------- +``` + +### 量化模型部署 + +该示例支持部署 Paddle INT8 新格式量化模型,仅需在`--model_dir`参数传入量化模型路径,并且在对应硬件上选择可用的推理引擎后端,即可完成量化模型部署。在 GPU 上部署量化模型时,可选后端为`paddle_tensorrt`、`tensorrt`;在CPU上部署量化模型时,可选后端为`paddle`、`onnx_runtime`。下面将展示如何使用该示例完成量化模型部署,示例中的模型是按照 [ERNIE 3.0 训练文档](../../README.md) 压缩量化后导出得到的量化模型。 + +```bash + +# 在 GPU 上使用 tensorrt 后端运行量化模型,模型目录可按照实际模型路径设置 +./seq_cls_infer_demo --model_dir ../../../best_models/afqmc/width_mult_0.75/mse16_1/ --device gpu --backend tensorrt --model_prefix int8 + +# 在 CPU 上使用paddle_inference后端,模型目录可按照实际模型路径设置 +./seq_cls_infer_demo --model_dir ../../../best_models/afqmc/width_mult_0.75/mse16_1/ --device cpu --backend paddle --model_prefix int8 + +``` + +运行完成后返回的结果如下: + +```bash +[INFO] /paddle/PaddleNLP/model_zoo/ernie-3.0/fastdeploy/cpp/seq_cls_infer.cc(67)::CreateRuntimeOption model_path = ../../../best_models/afqmc/width_mult_0.75/mse16_1/int8.pdmodel, param_path = ../../../best_models/afqmc/width_mult_0.75/mse16_1/int8.pdmodel +[INFO] fastdeploy/runtime.cc(596)::Init Runtime initialized with Backend::TRT in Device::GPU. +input data: 花呗收款额度限制, 收钱码,对花呗支付的金额有限制吗 +seq cls result: +label: Similar confidence: 0.5259 +----------------------------- +input data: 花呗支持高铁票支付吗, 为什么友付宝不支持花呗付款 +seq cls result: +label: Similar confidence: 0.985738 +----------------------------- +``` + +### 参数说明 + +`seq_cls_infer_demo` 除了以上示例的命令行参数,还支持更多命令行参数的设置。以下为各命令行参数的说明。 + +| 参数 |参数说明 | +|----------|--------------| +|--model_dir | 指定部署模型的目录 | +|--vocab_path| 指定的模型词表路径 | +|--batch_size |最大可测的 batch size,默认为 1| +|--max_length |最大序列长度,默认为 128| +|--device | 运行的设备,可选范围: ['cpu', 'kunlunxin', 'gpu'],默认为'cpu' | +|--backend | 支持的推理后端,可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt'],默认为'paddle' | +|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启,默认为False | +|--model_prefix| 模型文件前缀。前缀会分别与'.pdmodel'和'.pdiparams'拼接得到模型文件名和参数文件名。默认为 'model'| + +## 序列标注任务 + +### 快速开始 + +以下示例展示如何基于 FastDeploy 库完成 ERNIE 3.0 Medium 模型在 CLUE Benchmark 的 [MSRA_NER 数据集](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra)上进行序列标注任务的 C++ 预测部署,可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端。 + +```bash +mkdir build +cd build +# 下载FastDeploy预编译库,用户可在上文提到的`FastDeploy预编译库`中自行选择合适的版本使用 +wget https://bj.bcebos.com/fastdeploy/release/cpp/fastdeploy-linux-x64-x.x.x.tgz +tar xvf fastdeploy-linux-x64-x.x.x.tgz +cmake .. -DFASTDEPLOY_INSTALL_DIR=${PWD}/fastdeploy-linux-x64-x.x.x +make -j + +# CPU 推理 +./token_cls_infer_demo --model_dir ../../../best_models/msra_ner/export --device cpu --backend paddle + +# GPU 推理 +./token_cls_infer_demo --model_dir ../../../best_models/msra_ner/export --device gpu --backend paddle + +``` + +运行完成后返回的结果如下: + +```bash + +[INFO] /paddle/PaddleNLP/model_zoo/ernie-3.0/fastdeploy/cpp/token_cls_infer.cc(103)::CreateRuntimeOption model_path = ../../../best_models/msra_ner/export/model.pdmodel, param_path = ../../../best_models/msra_ner/export/model.pdiparams +[INFO] fastdeploy/runtime.cc(500)::Init Runtime initialized with Backend::PDINFER in Device::CPU. +input data: 北京的涮肉,重庆的火锅,成都的小吃都是极具特色的美食。 +The model detects all entities: +entity: 北京, label: LOC, pos: [0, 1] +entity: 重庆, label: LOC, pos: [6, 7] +entity: 成都, label: LOC, pos: [12, 13] +----------------------------- +input data: 乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。 +The model detects all entities: +entity: 乔丹, label: PER, pos: [0, 1] +entity: 科比, label: PER, pos: [3, 4] +entity: 詹姆斯, label: PER, pos: [6, 8] +entity: 姚明, label: PER, pos: [10, 11] +----------------------------- + +``` + +### 量化模型部署 + +该示例支持部署 Paddle INT8 新格式量化模型,仅需在`--model_dir`参数传入量化模型路径,并且在对应硬件上选择可用的推理引擎后端,即可完成量化模型部署。在 GPU 上部署量化模型时,可选后端为`paddle_tensorrt`、`tensorrt`;在CPU上部署量化模型时,可选后端为`paddle`、`onnx_runtime`。下面将展示如何使用该示例完成量化模型部署,示例中的模型是按照 [ERNIE 3.0 训练文档](../../README.md) 压缩量化后导出得到的量化模型。 + +```bash + +# 在 GPU 上使用 tensorrt 后端运行量化模型,模型目录可按照实际模型路径设置 +./token_cls_infer_demo --model_dir ../../../best_models/msra_ner/width_mult_0.75/mse16_1/ --device gpu --backend tensorrt --model_prefix int8 + +# 在 CPU 上使用paddle_inference后端,模型目录可按照实际模型路径设置 +./token_cls_infer_demo --model_dir ../../../best_models/msra_ner/width_mult_0.75/mse16_1/ --device cpu --backend paddle --model_prefix int8 + +``` + +运行完成后返回的结果如下: + +```bash +[INFO] /paddle/PaddleNLP/model_zoo/ernie-3.0/fastdeploy/cpp/token_cls_infer.cc(103)::CreateRuntimeOption model_path = ../../../best_models/msra_ner/export/model.pdmodel, param_path = ../../../best_models/msra_ner/export/model.pdiparams +[INFO] fastdeploy/runtime.cc(500)::Init Runtime initialized with Backend::PDINFER in Device::CPU. +input data: 北京的涮肉,重庆的火锅,成都的小吃都是极具特色的美食。 +The model detects all entities: +entity: 北京, label: LOC, pos: [0, 1] +entity: 重庆, label: LOC, pos: [6, 7] +entity: 成都, label: LOC, pos: [12, 13] +----------------------------- +input data: 乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。 +The model detects all entities: +entity: 乔丹, label: PER, pos: [0, 1] +entity: 科比, label: PER, pos: [3, 4] +entity: 詹姆斯, label: PER, pos: [6, 8] +entity: 姚明, label: PER, pos: [10, 11] +----------------------------- +``` + +### 参数说明 + +`token_cls_infer_demo` 除了以上示例的命令行参数,还支持更多命令行参数的设置。以下为各命令行参数的说明。 + +| 参数 |参数说明 | +|----------|--------------| +|--model_dir | 指定部署模型的目录, | +|--batch_size |最大可测的 batch size,默认为 1| +|--max_length |最大序列长度,默认为 128| +|--device | 运行的设备,可选范围: ['cpu', 'gpu'],默认为'cpu' | +|--backend | 支持的推理后端,可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt'],默认为'paddle' | +|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启,默认为False | + +## FastDeploy 高阶用法 + +FastDeploy 在 C++ 端上,提供 `fastdeploy::RuntimeOption::UseXXX()` 以及 `fastdeploy::RuntimeOption::UseXXXBackend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 ERNIE 3.0 模型,需要选择硬件所支持的推理引擎进行部署,下表展示如何在不同的硬件上选择可用的推理引擎部署 ERNIE 3.0 模型。 + +符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持; + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
硬件 硬件对应的接口 可用的推理引擎 推理引擎对应的接口 是否支持 Paddle 新格式量化模型 是否支持 FP16 模式
CPU UseCpu() Paddle Inference UsePaddleInferBackend() N/A
ONNX Runtime UseOrtBackend() N/A
OpenVINO UseOpenvinoBackend() N/A
GPU UseGpu() Paddle Inference UsePaddleInferBackend() N/A
ONNX Runtime UseOrtBackend()
Paddle TensorRT UseTrtBackend() + EnablePaddleToTrt()
TensorRT UseTrtBackend()
昆仑芯 XPU UseKunlunXin() Paddle Lite UsePaddleLiteBackend() N/A
华为 昇腾 UseAscend() Paddle Lite UsePaddleLiteBackend()
Graphcore IPU UseIpu() Paddle Inference UsePaddleInferBackend() N/A
+ +## 相关文档 + +[ERNIE 3.0模型详细介绍](../../README.md) + +[ERNIE 3.0模型Python部署方法](../python/README.md) diff --git a/model_zoo/ernie-3.0/deploy/cpp/seq_cls_infer.cc b/model_zoo/ernie-3.0/deploy/cpp/seq_cls_infer.cc new file mode 100644 index 0000000000000000000000000000000000000000..a33337d9694441d614260dd3834cbf04d0028ee5 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/cpp/seq_cls_infer.cc @@ -0,0 +1,293 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#include +#include +#include +#include + +#include "fast_tokenizer/tokenizers/ernie_fast_tokenizer.h" +#include "fastdeploy/function/reduce.h" +#include "fastdeploy/function/softmax.h" +#include "fastdeploy/runtime.h" +#include "fastdeploy/utils/path.h" +#include "gflags/gflags.h" + +using namespace paddlenlp; +using namespace fast_tokenizer::tokenizers_impl; +#ifdef WIN32 +const char sep = '\\'; +#else +const char sep = '/'; +#endif + +DEFINE_string(model_dir, "", "Directory of the inference model."); +DEFINE_string(vocab_path, "", "Path of the vocab file."); +DEFINE_string(model_prefix, "model", "The model and params file prefix."); +DEFINE_string(device, + "cpu", + "Type of inference device, support 'cpu', 'kunlunxin' or 'gpu'."); +DEFINE_string(backend, + "paddle", + "The inference runtime backend, support: ['onnx_runtime', " + "'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']"); +DEFINE_int32(batch_size, 1, "The batch size of data."); +DEFINE_int32(max_length, 128, "The batch size of data."); +DEFINE_bool(use_fp16, false, "Wheter to use FP16 mode."); + +void PrintUsage() { + fastdeploy::FDINFO + << "Usage: seq_cls_infer_demo --model_dir dir --device [cpu|gpu] " + "--backend " + "[onnx_runtime|paddle|openvino|tensorrt|paddle_tensorrt] " + "--batch_size size --max_length len --use_fp16 false" + << std::endl; + fastdeploy::FDINFO << "Default value of device: cpu" << std::endl; + fastdeploy::FDINFO << "Default value of backend: onnx_runtime" << std::endl; + fastdeploy::FDINFO << "Default value of batch_size: 1" << std::endl; + fastdeploy::FDINFO << "Default value of max_length: 128" << std::endl; + fastdeploy::FDINFO << "Default value of use_fp16: false" << std::endl; +} + +bool CreateRuntimeOption(fastdeploy::RuntimeOption* option) { + std::string model_path = + FLAGS_model_dir + sep + FLAGS_model_prefix + ".pdmodel"; + std::string param_path = + FLAGS_model_dir + sep + FLAGS_model_prefix + ".pdiparams"; + fastdeploy::FDINFO << "model_path = " << model_path + << ", param_path = " << param_path << std::endl; + option->SetModelPath(model_path, param_path); + if (FLAGS_device == "kunlunxin") { + option->UseKunlunXin(); + return true; + } else if (FLAGS_device == "gpu") { + option->UseGpu(); + } else if (FLAGS_device == "cpu") { + option->UseCpu(); + } else { + fastdeploy::FDERROR << "The avilable device should be one of the list " + "['cpu', 'gpu', 'kunlunxin']. But receive '" + << FLAGS_device << "'" << std::endl; + return false; + } + + if (FLAGS_backend == "onnx_runtime") { + option->UseOrtBackend(); + } else if (FLAGS_backend == "paddle") { + option->UsePaddleInferBackend(); + } else if (FLAGS_backend == "openvino") { + option->UseOpenVINOBackend(); + } else if (FLAGS_backend == "tensorrt" || + FLAGS_backend == "paddle_tensorrt") { + option->UseTrtBackend(); + if (FLAGS_backend == "paddle_tensorrt") { + option->EnablePaddleToTrt(); + option->EnablePaddleTrtCollectShape(); + } + std::string trt_file = FLAGS_model_dir + sep + "infer.trt"; + option->SetTrtInputShape("input_ids", + {1, 1}, + {FLAGS_batch_size, FLAGS_max_length}, + {FLAGS_batch_size, FLAGS_max_length}); + option->SetTrtInputShape("token_type_ids", + {1, 1}, + {FLAGS_batch_size, FLAGS_max_length}, + {FLAGS_batch_size, FLAGS_max_length}); + if (FLAGS_use_fp16) { + option->EnableTrtFP16(); + trt_file = trt_file + ".fp16"; + } + } else { + fastdeploy::FDERROR << "The avilable backend should be one of the list " + "['paddle', 'openvino', 'tensorrt', " + "'paddle_tensorrt']. But receive '" + << FLAGS_backend << "'" << std::endl; + return false; + } + return true; +} + +bool BatchFyTexts(const std::vector& texts, + int batch_size, + std::vector>* batch_texts) { + for (int idx = 0; idx < texts.size(); idx += batch_size) { + int rest = texts.size() - idx; + int curr_size = std::min(batch_size, rest); + std::vector batch_text(curr_size); + std::copy_n(texts.begin() + idx, curr_size, batch_text.begin()); + batch_texts->emplace_back(std::move(batch_text)); + } + return true; +} + +struct SeqClsResult { + int label; + float confidence; +}; + +struct ErnieForSequenceClassificationPredictor { + fastdeploy::Runtime runtime_; + ErnieFastTokenizer tokenizer_; + ErnieForSequenceClassificationPredictor( + const fastdeploy::RuntimeOption& option, + const ErnieFastTokenizer& tokenizer) + : tokenizer_(tokenizer) { + runtime_.Init(option); + } + + bool Preprocess(const std::vector& texts, + const std::vector& texts_pair, + std::vector* inputs) { + std::vector encodings; + std::vector text_pair_input; + // 1. Tokenize the text or (text, text_pair) + if (texts_pair.empty()) { + for (int i = 0; i < texts.size(); ++i) { + text_pair_input.emplace_back(texts[i]); + } + } else { + if (texts.size() != texts_pair.size()) { + return false; + } + for (int i = 0; i < texts.size(); ++i) { + text_pair_input.emplace_back( + std::pair(texts[i], texts_pair[i])); + } + } + tokenizer_.EncodeBatchStrings(text_pair_input, &encodings); + // 2. Construct the input vector tensor + // 2.1 Allocate input tensor + int64_t batch_size = texts.size(); + int64_t seq_len = 0; + if (batch_size > 0) { + seq_len = encodings[0].GetIds().size(); + } + inputs->resize(runtime_.NumInputs()); + for (int i = 0; i < runtime_.NumInputs(); ++i) { + (*inputs)[i].Allocate({batch_size, seq_len}, + fastdeploy::FDDataType::INT64, + runtime_.GetInputInfo(i).name); + } + // 2.2 Set the value of data + size_t start = 0; + int64_t* input_ids_ptr = + reinterpret_cast((*inputs)[0].MutableData()); + int64_t* type_ids_ptr = + reinterpret_cast((*inputs)[1].MutableData()); + for (int i = 0; i < encodings.size(); ++i) { + auto&& curr_input_ids = encodings[i].GetIds(); + auto&& curr_type_ids = encodings[i].GetTypeIds(); + std::copy( + curr_input_ids.begin(), curr_input_ids.end(), input_ids_ptr + start); + std::copy( + curr_type_ids.begin(), curr_type_ids.end(), type_ids_ptr + start); + start += seq_len; + } + return true; + } + + bool Postprocess(const std::vector& outputs, + std::vector* seq_cls_results) { + const auto& logits = outputs[0]; + fastdeploy::FDTensor probs; + fastdeploy::function::Softmax(logits, &probs); + + fastdeploy::FDTensor labels, confidences; + fastdeploy::function::Max(probs, &confidences, {-1}); + fastdeploy::function::ArgMax(probs, &labels, -1); + if (labels.Numel() != confidences.Numel()) { + return false; + } + + seq_cls_results->resize(labels.Numel()); + int64_t* label_ptr = reinterpret_cast(labels.Data()); + float* confidence_ptr = reinterpret_cast(confidences.Data()); + for (int i = 0; i < labels.Numel(); ++i) { + (*seq_cls_results)[i].label = label_ptr[i]; + (*seq_cls_results)[i].confidence = confidence_ptr[i]; + } + return true; + } + + bool Predict(const std::vector& texts, + const std::vector& texts_pair, + std::vector* seq_cls_results) { + std::vector inputs; + if (!Preprocess(texts, texts_pair, &inputs)) { + return false; + } + + std::vector outputs(runtime_.NumOutputs()); + runtime_.Infer(inputs, &outputs); + + if (!Postprocess(outputs, seq_cls_results)) { + return false; + } + return true; + } +}; + +void PrintResult(const std::vector& seq_cls_results, + const std::vector& data, + const std::vector& data_pair) { + static std::vector label_list{"Similar", "Not similar"}; + for (int i = 0; i < data.size(); ++i) { + std::cout << "input data: " << data[i] << ", " << data_pair[i] << std::endl; + std::cout << "seq cls result: " << std::endl; + std::cout << "label: " << label_list[seq_cls_results[i].label] + << " confidence: " << seq_cls_results[i].confidence << std::endl; + std::cout << "-----------------------------" << std::endl; + } +} + +int main(int argc, char* argv[]) { + google::ParseCommandLineFlags(&argc, &argv, true); + auto option = fastdeploy::RuntimeOption(); + if (!CreateRuntimeOption(&option)) { + PrintUsage(); + return -1; + } + + std::string vocab_path = FLAGS_vocab_path; + if (!fastdeploy::CheckFileExists(vocab_path)) { + vocab_path = fastdeploy::PathJoin(FLAGS_model_dir, "vocab.txt"); + if (!fastdeploy::CheckFileExists(vocab_path)) { + fastdeploy::FDERROR << "The path of vocab " << vocab_path + << " doesn't exist" << std::endl; + PrintUsage(); + return -1; + } + } + ErnieFastTokenizer tokenizer(vocab_path); + tokenizer.EnableTruncMethod( + FLAGS_max_length, + 0, + fast_tokenizer::core::Direction::RIGHT, + fast_tokenizer::core::TruncStrategy::LONGEST_FIRST); + + ErnieForSequenceClassificationPredictor predictor(option, tokenizer); + + std::vector seq_cls_results; + std::vector texts_ds = {"花呗收款额度限制", + "花呗支持高铁票支付吗"}; + std::vector texts_pair_ds = {"收钱码,对花呗支付的金额有限制吗", + "为什么友付宝不支持花呗付款"}; + std::vector> batch_texts, batch_texts_pair; + BatchFyTexts(texts_ds, FLAGS_batch_size, &batch_texts); + BatchFyTexts(texts_pair_ds, FLAGS_batch_size, &batch_texts_pair); + for (int bs = 0; bs < batch_texts.size(); ++bs) { + predictor.Predict(batch_texts[bs], batch_texts_pair[bs], &seq_cls_results); + PrintResult(seq_cls_results, batch_texts[bs], batch_texts_pair[bs]); + } + return 0; +} diff --git a/model_zoo/ernie-3.0/deploy/cpp/token_cls_infer.cc b/model_zoo/ernie-3.0/deploy/cpp/token_cls_infer.cc new file mode 100644 index 0000000000000000000000000000000000000000..5395be7e2f55196cc45afdec6b17bba572f22a2b --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/cpp/token_cls_infer.cc @@ -0,0 +1,320 @@ +// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#include +#include +#include +#include + +#include "fast_tokenizer/pretokenizers/pretokenizer.h" +#include "fast_tokenizer/tokenizers/ernie_fast_tokenizer.h" +#include "fastdeploy/function/functions.h" +#include "fastdeploy/runtime.h" +#include "fastdeploy/utils/path.h" +#include "gflags/gflags.h" + +using namespace paddlenlp; +using namespace fast_tokenizer::tokenizers_impl; +#ifdef WIN32 +const char sep = '\\'; +#else +const char sep = '/'; +#endif + +DEFINE_string(model_dir, "", "Directory of the inference model."); +DEFINE_string(vocab_path, "", "Path of the vocab file."); +DEFINE_string(model_prefix, "model", "The model and params file prefix."); +DEFINE_string(device, + "cpu", + "Type of inference device, support 'cpu' or 'gpu'."); +DEFINE_string(backend, + "paddle", + "The inference runtime backend, support: ['onnx_runtime', " + "'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']"); +DEFINE_int32(batch_size, 1, "The batch size of data."); +DEFINE_int32(max_length, 128, "The batch size of data."); +DEFINE_bool(use_fp16, false, "Wheter to use FP16 mode."); + +void PrintUsage() { + fastdeploy::FDINFO + << "Usage: seq_cls_infer_demo --model_dir dir --device [cpu|gpu] " + "--backend " + "[onnx_runtime|paddle|openvino|tensorrt|paddle_tensorrt] " + "--batch_size size --max_length len --use_fp16 false" + << std::endl; + fastdeploy::FDINFO << "Default value of device: cpu" << std::endl; + fastdeploy::FDINFO << "Default value of backend: onnx_runtime" << std::endl; + fastdeploy::FDINFO << "Default value of batch_size: 1" << std::endl; + fastdeploy::FDINFO << "Default value of max_length: 128" << std::endl; + fastdeploy::FDINFO << "Default value of use_fp16: false" << std::endl; +} + +bool CreateRuntimeOption(fastdeploy::RuntimeOption* option) { + std::string model_path = + FLAGS_model_dir + sep + FLAGS_model_prefix + ".pdmodel"; + std::string param_path = + FLAGS_model_dir + sep + FLAGS_model_prefix + ".pdiparams"; + fastdeploy::FDINFO << "model_path = " << model_path + << ", param_path = " << param_path << std::endl; + option->SetModelPath(model_path, param_path); + + if (FLAGS_device == "gpu") { + option->UseGpu(); + } else if (FLAGS_device == "cpu") { + option->UseCpu(); + } else if (FLAGS_device == "kunlunxin") { + option->UseKunlunXin(); + return true; + } else { + fastdeploy::FDERROR << "The avilable device should be one of the list " + "['cpu', 'gpu', 'kunlunxin']. But receive '" + << FLAGS_device << "'" << std::endl; + return false; + } + + if (FLAGS_backend == "onnx_runtime") { + option->UseOrtBackend(); + } else if (FLAGS_backend == "paddle") { + option->UsePaddleInferBackend(); + } else if (FLAGS_backend == "openvino") { + option->UseOpenVINOBackend(); + } else if (FLAGS_backend == "tensorrt" || + FLAGS_backend == "paddle_tensorrt") { + option->UseTrtBackend(); + if (FLAGS_backend == "paddle_tensorrt") { + option->EnablePaddleToTrt(); + option->EnablePaddleTrtCollectShape(); + } + std::string trt_file = FLAGS_model_dir + sep + "infer.trt"; + option->SetTrtInputShape("input_ids", + {1, 1}, + {FLAGS_batch_size, FLAGS_max_length}, + {FLAGS_batch_size, FLAGS_max_length}); + option->SetTrtInputShape("token_type_ids", + {1, 1}, + {FLAGS_batch_size, FLAGS_max_length}, + {FLAGS_batch_size, FLAGS_max_length}); + if (FLAGS_use_fp16) { + option->EnableTrtFP16(); + trt_file = trt_file + ".fp16"; + } + } else { + fastdeploy::FDERROR << "The avilable backend should be one of the list " + "['paddle', 'openvino', 'tensorrt', " + "'paddle_tensorrt']. But receive '" + << FLAGS_backend << "'" << std::endl; + return false; + } + + return true; +} + +bool BatchFyTexts(const std::vector& texts, + int batch_size, + std::vector>* batch_texts) { + for (int idx = 0; idx < texts.size(); idx += batch_size) { + int rest = texts.size() - idx; + int curr_size = std::min(batch_size, rest); + std::vector batch_text(curr_size); + std::copy_n(texts.begin() + idx, curr_size, batch_text.begin()); + batch_texts->emplace_back(std::move(batch_text)); + } + return true; +} + +struct TokenClsResult { + struct TokenResult { + std::string token_label; + std::string entity; + std::pair pos; + friend std::ostream& operator<<(std::ostream& os, + const TokenResult& result); + }; + std::vector token_results; +}; + +std::ostream& operator<<(std::ostream& os, + const typename TokenClsResult::TokenResult& result) { + os << "entity: " << result.entity << ", label: " << result.token_label + << ", pos: [" << result.pos.first << ", " << result.pos.second << "]"; + return os; +} + +void PrintResult(const std::vector& token_cls_results, + const std::vector& data) { + for (int i = 0; i < data.size(); ++i) { + std::cout << "input data: " << data[i] << std::endl; + std::cout << "The model detects all entities:" << std::endl; + auto& curr_results = token_cls_results[i]; + for (auto& token_result : curr_results.token_results) { + std::cout << token_result << std::endl; + } + std::cout << "-----------------------------" << std::endl; + } +} + +struct ErnieForTokenClassificationPredictor { + fastdeploy::Runtime runtime_; + ErnieFastTokenizer tokenizer_; + std::vector label_list_; + + ErnieForTokenClassificationPredictor( + const fastdeploy::RuntimeOption& option, + const ErnieFastTokenizer& tokenizer, + const std::vector& label_list) + : tokenizer_(tokenizer), label_list_(label_list) { + runtime_.Init(option); + } + + bool Preprocess(const std::vector& texts, + std::vector* inputs) { + std::vector encodings; + std::vector text_pair_input; + // 1. Tokenize the text or (text, text_pair) + for (int i = 0; i < texts.size(); ++i) { + text_pair_input.emplace_back(texts[i]); + } + tokenizer_.EncodeBatchStrings(text_pair_input, &encodings); + // 2. Construct the input vector tensor + // 2.1 Allocate input tensor + int64_t batch_size = texts.size(); + int64_t seq_len = 0; + if (batch_size > 0) { + seq_len = encodings[0].GetIds().size(); + } + inputs->resize(runtime_.NumInputs()); + for (int i = 0; i < runtime_.NumInputs(); ++i) { + (*inputs)[i].Allocate({batch_size, seq_len}, + fastdeploy::FDDataType::INT64, + runtime_.GetInputInfo(i).name); + } + // 2.2 Set the value of data + size_t start = 0; + int64_t* input_ids_ptr = + reinterpret_cast((*inputs)[0].MutableData()); + int64_t* type_ids_ptr = + reinterpret_cast((*inputs)[1].MutableData()); + for (int i = 0; i < encodings.size(); ++i) { + auto&& curr_input_ids = encodings[i].GetIds(); + auto&& curr_type_ids = encodings[i].GetTypeIds(); + std::copy( + curr_input_ids.begin(), curr_input_ids.end(), input_ids_ptr + start); + std::copy( + curr_type_ids.begin(), curr_type_ids.end(), type_ids_ptr + start); + start += seq_len; + } + return true; + } + + bool Postprocess(const std::vector& outputs, + const std::vector& texts, + std::vector* results) { + fastdeploy::FDTensor batch_preds; + auto& logits = outputs[0]; + fastdeploy::function::ArgMax(logits, &batch_preds, -1); + for (int i = 0; i < results->size(); ++i) { + fastdeploy::FDTensor preds; + fastdeploy::function::Slice(batch_preds, {0}, {i}, &preds); + int start = -1; + std::string label_name = ""; + std::vector items; + + int seq_len = preds.Shape()[0]; + + fast_tokenizer::pretokenizers::CharToBytesOffsetConverter convertor( + texts[i]); + fast_tokenizer::core::Offset curr_offset; + for (int j = 0; j < seq_len; ++j) { + int64_t label_id = (reinterpret_cast(preds.Data()))[j]; + const std::string& curr_label = label_list_[label_id]; + if ((curr_label == "O" || curr_label.find("B-") != std::string::npos) && + start >= 0) { + // Convert the unicode character offset to byte offset. + convertor.convert({start, j - 1}, &curr_offset); + if (curr_offset.first >= texts[i].length()) { + break; + } + items.emplace_back(typename TokenClsResult::TokenResult{ + label_name, + texts[i].substr(curr_offset.first, + curr_offset.second - curr_offset.first), + {start, j - 2}}); + start = -1; + } + if (curr_label.find("B-") != std::string::npos) { + start = j - 1; + label_name = curr_label.substr(2); + } + } + (*results)[i].token_results = std::move(items); + } + return true; + } + bool Predict(const std::vector& texts, + std::vector* results) { + std::vector inputs; + if (!Preprocess(texts, &inputs)) { + return false; + } + + std::vector outputs(runtime_.NumOutputs()); + runtime_.Infer(inputs, &outputs); + results->resize(texts.size()); + if (!Postprocess(outputs, texts, results)) { + return false; + } + return true; + } +}; + +int main(int argc, char* argv[]) { + google::ParseCommandLineFlags(&argc, &argv, true); + auto option = fastdeploy::RuntimeOption(); + if (!CreateRuntimeOption(&option)) { + PrintUsage(); + return -1; + } + + std::string vocab_path = FLAGS_vocab_path; + if (!fastdeploy::CheckFileExists(vocab_path)) { + vocab_path = fastdeploy::PathJoin(FLAGS_model_dir, "vocab.txt"); + if (!fastdeploy::CheckFileExists(vocab_path)) { + fastdeploy::FDERROR << "The path of vocab " << vocab_path + << " doesn't exist" << std::endl; + PrintUsage(); + return -1; + } + } + uint32_t max_length = FLAGS_max_length; + ErnieFastTokenizer tokenizer(vocab_path); + tokenizer.EnableTruncMethod( + max_length, + 0, + fast_tokenizer::core::Direction::RIGHT, + fast_tokenizer::core::TruncStrategy::LONGEST_FIRST); + + std::vector label_list = { + "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "O"}; + ErnieForTokenClassificationPredictor predictor(option, tokenizer, label_list); + std::vector token_cls_results; + std::vector texts_ds = { + "北京的涮肉,重庆的火锅,成都的小吃都是极具特色的美食。", + "乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。"}; + std::vector> batch_texts; + BatchFyTexts(texts_ds, FLAGS_batch_size, &batch_texts); + for (int bs = 0; bs < batch_texts.size(); ++bs) { + predictor.Predict(batch_texts[bs], &token_cls_results); + PrintResult(token_cls_results, batch_texts[bs]); + } + return 0; +} \ No newline at end of file diff --git a/model_zoo/ernie-3.0/deploy/python/README.md b/model_zoo/ernie-3.0/deploy/python/README.md new file mode 100644 index 0000000000000000000000000000000000000000..da1576a1fd65da0633fdaedc11b59708fabf99d5 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/python/README.md @@ -0,0 +1,260 @@ +# FastDeploy ERNIE 3.0 模型 Python 部署示例 + +在部署前,参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy Python SDK。 + +本目录下分别提供 `seq_cls_infer.py` 以及 `token_cls_infer.py` 快速完成在 CPU/GPU 的文本分类任务以及序列标注任务的 Python 部署示例。 + +## 依赖安装 + +直接执行以下命令安装部署示例的依赖。 + +```bash + +# 安装fast_tokenizer以及GPU版本fastdeploy +pip install fast-tokenizer-python fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html + +``` + +## 文本分类任务 + +### 快速开始 + +以下示例展示如何基于 FastDeploy 库完成 ERNIE 3.0 Medium 模型在 CLUE Benchmark 的 [AFQMC 数据集](https://github.com/CLUEbenchmark/CLUE)上进行文本分类任务的 Python 预测部署,可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端,并使用`--model_dir`参数指定运行的模型,具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [ERNIE 3.0 训练文档](../../README.md)导出得到的部署模型,其模型目录为`model_zoo/ernie-3.0/best_models/afqmc/export`(用户可按实际情况设置)。 + +```bash + +# CPU 推理 +python seq_cls_infer.py --model_dir ../../best_models/afqmc/export --device cpu --backend paddle + +# GPU 推理 +python seq_cls_infer.py --model_dir ../../best_models/afqmc/export --device gpu --backend paddle + +``` + +运行完成后返回的结果如下: + +```bash + +[INFO] fastdeploy/runtime.cc(596)::Init Runtime initialized with Backend::PDINFER in Device::CPU. +Batch id:0, example id:0, sentence1:花呗收款额度限制, sentence2:收钱码,对花呗支付的金额有限制吗, label:0, similarity:0.5099 +Batch id:1, example id:0, sentence1:花呗支持高铁票支付吗, sentence2:为什么友付宝不支持花呗付款, label:0, similarity:0.9862 + +``` + +### 量化模型部署 + +该示例支持部署 Paddle INT8 新格式量化模型,仅需在`--model_dir`参数传入量化模型路径,并且在对应硬件上选择可用的推理引擎后端,即可完成量化模型部署。在 GPU 上部署量化模型时,可选后端为`paddle_tensorrt`、`tensorrt`;在CPU上部署量化模型时,可选后端为`paddle`、`onnx_runtime`。下面将展示如何使用该示例完成量化模型部署,示例中的模型是按照 [ERNIE 3.0 训练文档](../../README.md) 压缩量化后导出得到的量化模型。 + +```bash + +# 在GPU上使用 tensorrt 后端,模型目录可按照实际模型路径设置 +python seq_cls_infer.py --model_dir ../../best_models/afqmc/width_mult_0.75/mse16_1/ --device gpu --backend tensorrt --model_prefix int8 + +# 在CPU上使用paddle_inference后端,模型目录可按照实际模型路径设置 +python seq_cls_infer.py --model_dir ../../best_models/afqmc/width_mult_0.75/mse16_1/ --device cpu --backend paddle --model_prefix int8 + +``` + +运行完成后返回的结果如下: + +```bash +[INFO] fastdeploy/runtime/runtime.cc(101)::Init Runtime initialized with Backend::PDINFER in Device::GPU. +Batch id:0, example id:0, sentence1:花呗收款额度限制, sentence2:收钱码,对花呗支付的金额有限制吗, label:0, similarity:0.5224 +Batch id:1, example id:0, sentence1:花呗支持高铁票支付吗, sentence2:为什么友付宝不支持花呗付款, label:0, similarity:0.9856 +``` + + +### 参数说明 + +`seq_cls_infer.py` 除了以上示例的命令行参数,还支持更多命令行参数的设置。以下为各命令行参数的说明。 + +| 参数 |参数说明 | +|----------|--------------| +|--model_dir | 指定部署模型的目录, | +|--batch_size |输入的batch size,默认为 1| +|--max_length |最大序列长度,默认为 128| +|--device | 运行的设备,可选范围: ['cpu', 'gpu'],默认为'cpu' | +|--backend | 支持的推理后端,可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt'],默认为'paddle' | +|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启,默认为False | +|--use_fast| 是否使用FastTokenizer加速分词阶段。默认为True| + +## 序列标注任务 + +### 快速开始 + +以下示例展示如何基于 FastDeploy 库完成 ERNIE 3.0 Medium 模型在 CLUE Benchmark 的[ MSRA_NER 数据集](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra)上进行序列标注任务的Python预测部署,可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端,并使用`--model_dir`参数指定运行的模型,具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [ERNIE 3.0 训练文档](../../README.md)导出得到的部署模型,其模型目录为`model_zoo/ernie-3.0/best_models/msra_ner/export`(用户可按实际情况设置)。 + + +```bash + +# CPU 推理 +python token_cls_infer.py --model_dir ../../best_models/msra_ner/export/ --device cpu --backend paddle + +# GPU 推理 +python token_cls_infer.py --model_dir ../../best_models/msra_ner/export/ --device gpu --backend paddle + +``` + +运行完成后返回的结果如下: + +```bash + +[INFO] fastdeploy/runtime.cc(500)::Init Runtime initialized with Backend::PDINFER in Device::CPU. +input data: 北京的涮肉,重庆的火锅,成都的小吃都是极具特色的美食。 +The model detects all entities: +entity: 北京 label: LOC pos: [0, 1] +entity: 重庆 label: LOC pos: [6, 7] +entity: 成都 label: LOC pos: [12, 13] +----------------------------- +input data: 乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。 +The model detects all entities: +entity: 乔丹 label: PER pos: [0, 1] +entity: 科比 label: PER pos: [3, 4] +entity: 詹姆斯 label: PER pos: [6, 8] +entity: 姚明 label: PER pos: [10, 11] +----------------------------- + +``` + +### 量化模型部署 + +该示例支持部署 Paddle INT8 新格式量化模型,仅需在`--model_dir`参数传入量化模型路径,并且在对应硬件上选择可用的推理引擎后端,即可完成量化模型部署。在 GPU 上部署量化模型时,可选后端为`paddle_tensorrt`、`tensorrt`;在CPU上部署量化模型时,可选后端为`paddle`、`onnx_runtime`。下面将展示如何使用该示例完成量化模型部署,示例中的模型是按照 [ERNIE 3.0 训练文档](../../README.md) 压缩量化后导出得到的量化模型。 + +```bash + +# 在GPU上使用 tensorrt 后端,模型目录可按照实际模型路径设置 +python token_cls_infer.py --model_dir ../../best_models/msra_ner/width_mult_0.75/mse16_1/ --device gpu --backend tensorrt --model_prefix int8 + +# 在CPU上使用paddle_inference后端,模型目录可按照实际模型路径设置 +python token_cls_infer.py --model_dir ../../best_models/msra_ner/width_mult_0.75/mse16_1/ --device cpu --backend paddle --model_prefix int8 + +``` + +运行完成后返回的结果如下: + +```bash + +[INFO] fastdeploy/runtime.cc(500)::Init Runtime initialized with Backend::PDINFER in Device::CPU. +input data: 北京的涮肉,重庆的火锅,成都的小吃都是极具特色的美食。 +The model detects all entities: +entity: 北京 label: LOC pos: [0, 1] +entity: 重庆 label: LOC pos: [6, 7] +entity: 成都 label: LOC pos: [12, 13] +----------------------------- +input data: 乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。 +The model detects all entities: +entity: 乔丹 label: PER pos: [0, 1] +entity: 科比 label: PER pos: [3, 4] +entity: 詹姆斯 label: PER pos: [6, 8] +entity: 姚明 label: PER pos: [10, 11] +----------------------------- +``` + +### 参数说明 + +`token_cls_infer.py` 除了以上示例的命令行参数,还支持更多命令行参数的设置。以下为各命令行参数的说明。 + +| 参数 |参数说明 | +|----------|--------------| +|--model_dir | 指定部署模型的目录, | +|--batch_size |输入的batch size,默认为 1| +|--max_length |最大序列长度,默认为 128| +|--device | 运行的设备,可选范围: ['cpu', 'gpu'],默认为'cpu' | +|--backend | 支持的推理后端,可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt'],默认为'paddle' | +|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启,默认为False | +|--use_fast| 是否使用FastTokenizer加速分词阶段。默认为True| +|--model_prefix| 模型文件前缀。前缀会分别与'.pdmodel'和'.pdiparams'拼接得到模型文件名和参数文件名。默认为 'model'| + + +## FastDeploy 高阶用法 + +FastDeploy 在 Python 端上,提供 `fastdeploy.RuntimeOption.use_xxx()` 以及 `fastdeploy.RuntimeOption.use_xxx_backend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 ERNIE 3.0 模型,需要选择硬件所支持的推理引擎进行部署,下表展示如何在不同的硬件上选择可用的推理引擎部署 ERNIE 3.0 模型。 + +符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持; + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
硬件 硬件对应的接口 可用的推理引擎 推理引擎对应的接口 是否支持 Paddle 新格式量化模型 是否支持 FP16 模式
CPU use_cpu() Paddle Inference use_paddle_infer_backend() N/A
ONNX Runtime use_ort_backend() N/A
OpenVINO use_openvino_backend() N/A
GPU use_gpu() Paddle Inference use_paddle_infer_backend() N/A
ONNX Runtime use_ort_backend()
Paddle TensorRT use_trt_backend() + enable_paddle_to_trt()
TensorRT use_trt_backend()
昆仑芯 XPU use_kunlunxin() Paddle Lite use_paddle_lite_backend() N/A
华为 昇腾 use_ascend() Paddle Lite use_paddle_lite_backend()
Graphcore IPU use_ipu() Paddle Inference use_paddle_infer_backend() N/A
+ +## 相关文档 + +[ERNIE 3.0模型详细介绍](../../README.md) + +[ERNIE 3.0模型C++部署方法](../cpp/README.md) diff --git a/model_zoo/ernie-3.0/deploy/python/requirements.txt b/model_zoo/ernie-3.0/deploy/python/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..08af4cf97eb1aabc6da8278fceb000b6adc04363 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/python/requirements.txt @@ -0,0 +1 @@ +fast-tokenizer-python \ No newline at end of file diff --git a/model_zoo/ernie-3.0/deploy/python/seq_cls_infer.py b/model_zoo/ernie-3.0/deploy/python/seq_cls_infer.py new file mode 100644 index 0000000000000000000000000000000000000000..a1699a3fb4762cbef58cd21fef552e3d325d900b --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/python/seq_cls_infer.py @@ -0,0 +1,154 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import distutils.util +import os + +import fastdeploy as fd +import numpy as np + +from paddlenlp.transformers import AutoTokenizer + + +def parse_arguments(): + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument("--model_dir", required=True, help="The directory of model.") + parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.") + parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.") + parser.add_argument( + "--device", + type=str, + default="cpu", + choices=["gpu", "cpu"], + help="Type of inference device, support 'cpu' or 'gpu'.", + ) + parser.add_argument( + "--backend", + type=str, + default="paddle", + choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"], + help="The inference runtime backend.", + ) + parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.") + parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.") + parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.") + parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode") + parser.add_argument( + "--use_fast", + type=distutils.util.strtobool, + default=True, + help="Whether to use fast_tokenizer to accelarate the tokenization.", + ) + return parser.parse_args() + + +def batchfy_text(texts, batch_size): + batch_texts = [] + batch_start = 0 + while batch_start < len(texts): + batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]] + batch_start += batch_size + return batch_texts + + +class Predictor(object): + def __init__(self, args): + self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir, use_fast=args.use_fast) + self.runtime = self.create_fd_runtime(args) + self.batch_size = args.batch_size + self.max_length = args.max_length + + def create_fd_runtime(self, args): + option = fd.RuntimeOption() + model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel") + params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams") + option.set_model_path(model_path, params_path) + if args.device == "cpu": + option.use_cpu() + else: + option.use_gpu() + if args.backend == "paddle": + option.use_paddle_infer_backend() + elif args.backend == "onnx_runtime": + option.use_ort_backend() + elif args.backend == "openvino": + option.use_openvino_backend() + else: + option.use_trt_backend() + if args.backend == "paddle_tensorrt": + option.enable_paddle_to_trt() + option.enable_paddle_trt_collect_shape() + trt_file = os.path.join(args.model_dir, "model.trt") + option.set_trt_input_shape( + "input_ids", + min_shape=[1, 1], + opt_shape=[args.batch_size, args.max_length], + max_shape=[args.batch_size, args.max_length], + ) + option.set_trt_input_shape( + "token_type_ids", + min_shape=[1, 1], + opt_shape=[args.batch_size, args.max_length], + max_shape=[args.batch_size, args.max_length], + ) + if args.use_fp16: + option.enable_trt_fp16() + trt_file = trt_file + ".fp16" + option.set_trt_cache_file(trt_file) + return fd.Runtime(option) + + def preprocess(self, text, text_pair): + data = self.tokenizer(text, text_pair, max_length=self.max_length, padding=True, truncation=True) + input_ids_name = self.runtime.get_input_info(0).name + token_type_ids_name = self.runtime.get_input_info(1).name + input_map = { + input_ids_name: np.array(data["input_ids"], dtype="int64"), + token_type_ids_name: np.array(data["token_type_ids"], dtype="int64"), + } + return input_map + + def infer(self, input_map): + results = self.runtime.infer(input_map) + return results + + def postprocess(self, infer_data): + logits = np.array(infer_data[0]) + max_value = np.max(logits, axis=1, keepdims=True) + exp_data = np.exp(logits - max_value) + probs = exp_data / np.sum(exp_data, axis=1, keepdims=True) + out_dict = {"label": probs.argmax(axis=-1), "confidence": probs.max(axis=-1)} + return out_dict + + def predict(self, texts, texts_pair=None): + input_map = self.preprocess(texts, texts_pair) + infer_result = self.infer(input_map) + output = self.postprocess(infer_result) + return output + + +if __name__ == "__main__": + args = parse_arguments() + predictor = Predictor(args) + texts_ds = ["花呗收款额度限制", "花呗支持高铁票支付吗"] + texts_pair_ds = ["收钱码,对花呗支付的金额有限制吗", "为什么友付宝不支持花呗付款"] + batch_texts = batchfy_text(texts_ds, args.batch_size) + batch_texts_pair = batchfy_text(texts_pair_ds, args.batch_size) + + for bs, (texts, texts_pair) in enumerate(zip(batch_texts, batch_texts_pair)): + outputs = predictor.predict(texts, texts_pair) + for i, (sentence1, sentence2) in enumerate(zip(texts, texts_pair)): + print( + f"Batch id:{bs}, example id:{i}, sentence1:{sentence1}, sentence2:{sentence2}, label:{outputs['label'][i]}, similarity:{outputs['confidence'][i]:.4f}" + ) diff --git a/model_zoo/ernie-3.0/deploy/python/token_cls_infer.py b/model_zoo/ernie-3.0/deploy/python/token_cls_infer.py new file mode 100644 index 0000000000000000000000000000000000000000..00a9c82333b9cc3cf1738bc56a4f63c183cc0539 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/python/token_cls_infer.py @@ -0,0 +1,187 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import distutils.util +import os + +import fastdeploy as fd +import numpy as np + +from paddlenlp.transformers import AutoTokenizer + + +def parse_arguments(): + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument("--model_dir", required=True, help="The directory of model.") + parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.") + parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.") + parser.add_argument( + "--device", + type=str, + default="cpu", + choices=["gpu", "cpu"], + help="Type of inference device, support 'cpu' or 'gpu'.", + ) + parser.add_argument( + "--backend", + type=str, + default="paddle", + choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"], + help="The inference runtime backend.", + ) + parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.") + parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.") + parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.") + parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode") + parser.add_argument( + "--use_fast", + type=distutils.util.strtobool, + default=True, + help="Whether to use fast_tokenizer to accelarate the tokenization.", + ) + return parser.parse_args() + + +def batchfy_text(texts, batch_size): + batch_texts = [] + batch_start = 0 + while batch_start < len(texts): + batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]] + batch_start += batch_size + return batch_texts + + +class ErnieForTokenClassificationPredictor(object): + def __init__(self, args): + self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir, use_fast=args.use_fast) + self.runtime = self.create_fd_runtime(args) + self.batch_size = args.batch_size + self.max_length = args.max_length + self.label_names = ["B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "O"] + + def create_fd_runtime(self, args): + option = fd.RuntimeOption() + model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel") + params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams") + option.set_model_path(model_path, params_path) + if args.device == "cpu": + option.use_cpu() + else: + option.use_gpu() + if args.backend == "paddle": + option.use_paddle_infer_backend() + elif args.backend == "onnx_runtime": + option.use_ort_backend() + elif args.backend == "openvino": + option.use_openvino_backend() + else: + option.use_trt_backend() + if args.backend == "paddle_tensorrt": + option.enable_paddle_to_trt() + option.enable_paddle_trt_collect_shape() + trt_file = os.path.join(args.model_dir, "infer.trt") + option.set_trt_input_shape( + "input_ids", + min_shape=[1, 1], + opt_shape=[args.batch_size, args.max_length], + max_shape=[args.batch_size, args.max_length], + ) + option.set_trt_input_shape( + "token_type_ids", + min_shape=[1, 1], + opt_shape=[args.batch_size, args.max_length], + max_shape=[args.batch_size, args.max_length], + ) + if args.use_fp16: + option.enable_trt_fp16() + trt_file = trt_file + ".fp16" + option.set_trt_cache_file(trt_file) + return fd.Runtime(option) + + def preprocess(self, texts): + is_split_into_words = False + if isinstance(texts[0], list): + is_split_into_words = True + data = self.tokenizer( + texts, max_length=self.max_length, padding=True, truncation=True, is_split_into_words=is_split_into_words + ) + input_ids_name = self.runtime.get_input_info(0).name + token_type_ids_name = self.runtime.get_input_info(1).name + input_map = { + input_ids_name: np.array(data["input_ids"], dtype="int64"), + token_type_ids_name: np.array(data["token_type_ids"], dtype="int64"), + } + return input_map + + def infer(self, input_map): + results = self.runtime.infer(input_map) + return results + + def postprocess(self, infer_data, input_data): + result = np.array(infer_data[0]) + tokens_label = result.argmax(axis=-1).tolist() + value = [] + for batch, token_label in enumerate(tokens_label): + start = -1 + label_name = "" + items = [] + for i, label in enumerate(token_label): + if (self.label_names[label] == "O" or "B-" in self.label_names[label]) and start >= 0: + entity = input_data[batch][start : i - 1] + if isinstance(entity, list): + entity = "".join(entity) + if len(entity) == 0: + break + items.append( + { + "pos": [start, i - 2], + "entity": entity, + "label": label_name, + } + ) + start = -1 + if "B-" in self.label_names[label]: + start = i - 1 + label_name = self.label_names[label][2:] + value.append(items) + + out_dict = {"value": value, "tokens_label": tokens_label} + return out_dict + + def predict(self, texts): + input_map = self.preprocess(texts) + infer_result = self.infer(input_map) + output = self.postprocess(infer_result, texts) + return output + + +def token_cls_print_ret(infer_result, input_data): + rets = infer_result["value"] + for i, ret in enumerate(rets): + print("input data:", input_data[i]) + print("The model detects all entities:") + for iterm in ret: + print("entity:", iterm["entity"], " label:", iterm["label"], " pos:", iterm["pos"]) + print("-----------------------------") + + +if __name__ == "__main__": + args = parse_arguments() + predictor = ErnieForTokenClassificationPredictor(args) + texts = ["北京的涮肉,重庆的火锅,成都的小吃都是极具特色的美食。", "乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。"] + batch_data = batchfy_text(texts, args.batch_size) + for data in batch_data: + outputs = predictor.predict(data) + token_cls_print_ret(outputs, data) diff --git a/model_zoo/ernie-3.0/deploy/serving/README.md b/model_zoo/ernie-3.0/deploy/serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a8a9de1ad94e51ad95a6ed0f18471d34e71fdd3b --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/README.md @@ -0,0 +1,225 @@ +# FastDeploy ERNIE 3.0 模型 Serving 部署示例 + + +在服务化部署前,需确认 + +- 1. 服务化镜像的软硬件环境要求和镜像拉取命令请参考 [FastDeploy 服务化部署](https://github.com/PaddlePaddle/FastDeploy/blob/develop/serving/README_CN.md) + +## 准备模型 + +以下示例展示如何基于 FastDeploy 库完成 ERNIE 3.0 模型在 CLUE Benchmark 的 [AFQMC 数据集](https://github.com/CLUEbenchmark/CLUE)上进行文本分类任务以及 [MSRA_NER 数据集](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra)上进行序列标注任务的**服务化部署**。按照[ERNIE 3.0 训练文档](../../README.md)分别训练并导出文本分类模型以及序列标注模型,并将导出的模型移动到 models 目录下相应位置。注意:模型与参数文件必须命名为 **model.pdmodel** 和 **model.pdiparams**。 + +模型移动好之后,文本分类任务的 models 目录结构如下: + +``` +models +├── ernie_seqcls # 分类任务的 pipeline +│   ├── 1 +│   └── config.pbtxt # 通过这个文件组合前后处理和模型推理 +├── ernie_seqcls_model # 分类任务的模型推理 +│   ├── 1 +│   │   ├── model.pdiparams +│   │   └── model.pdmodel +│   └── config.pbtxt +├── ernie_seqcls_postprocess # 分类任务后处理 +│   ├── 1 +│   │   └── model.py +│   └── config.pbtxt +└── ernie_tokenizer # 预处理分词 + ├── 1 + │   └── model.py + └── config.pbtxt +``` + +序列标注任务的 models 目录结构如下: + +``` +models +├── ernie_tokencls # 序列标注任务的 pipeline +│   ├── 1 +│   └── config.pbtxt # 通过这个文件组合前后处理和模型推理 +├── ernie_tokencls_model # 序列标注任务的模型推理 +│   ├── 1 +│   │   ├── model.pdiparams +│   │   └── model.pdmodel +│   └── config.pbtxt +├── ernie_tokencls_postprocess # 序列标注任务后处理 +│   ├── 1 +│   │   └── model.py +│   └── config.pbtxt +└── ernie_tokenizer # 预处理分词 + ├── 1 + │   └── model.py + └── config.pbtxt +``` + +## 拉取并运行镜像 + +``` +# x.y.z为镜像版本号,需参照 serving 文档替换为数字 +# GPU镜像 +docker pull registry.baidubce.com/paddlepaddle/fastdeploy:x.y.z-gpu-cuda11.4-trt8.4-21.10 +# CPU镜像 +docker pull registry.baidubce.com/paddlepaddle/fastdeploy:x.y.z-cpu-only-21.10 + +# GPU 运行 +nvidia-docker run -it --net=host --name fastdeploy_server --shm-size="1g" -v /path/serving/models:/models rregistry.baidubce.com/paddlepaddle/fastdeploy:x.y.z-gpu-cuda11.4-trt8.4-21.10 bash + +# CPU 运行 +docker run -it --net=host --name fastdeploy_server --shm-size="1g" -v /path/serving/models:/models registry.baidubce.com/paddlepaddle/fastdeploy:x.y.z-cpu-only-21.10 bash +``` + +## 部署模型 + +serving 目录包含启动 pipeline 服务的配置和发送预测请求的代码,包括: + +``` +models # 服务化启动需要的模型仓库,包含模型和服务配置文件 +seq_cls_rpc_client.py # AFQMC 分类任务发送 pipeline 预测请求的脚本 +token_cls_rpc_client.py # 序列标注任务发送 pipeline 预测请求的脚本 +``` + +注意:启动服务时,Server 的每个 python 后端进程默认申请 64M 内存,默认启动的 docker 无法启动多个 python 后端节点。有两个解决方案: + +1. 启动容器时设置 shm-size 参数, 比如: docker run -it --net=host --name fastdeploy_server --shm-size="1g" -v /path/serving/models:/models registry.baidubce.com/paddlepaddle/fastdeploy:x.y.z-gpu-cuda11.4-trt8.4-21.10 bash + +2. 启动服务时设置 python 后端的 shm-default-byte-size 参数, 设置 python 后端的默认内存为10M: fastdeployserver --model-repository=/models --backend-config=python,shm-default-byte-size=10485760 + +### 分类任务 + +在容器内执行下面命令启动服务: + +``` +# 默认启动 models 下所有模型 +fastdeployserver --model-repository=/models + +# 可通过参数只启动分类任务 +fastdeployserver --model-repository=/models --model-control-mode=explicit --load-model=ernie_seqcls +``` + +输出打印如下: + +```shell + +I0209 09:15:49.314029 708 model_repository_manager.cc:1183] successfully loaded 'ernie_seqcls_model' version 1 +I0209 09:15:49.314917 708 model_repository_manager.cc:1022] loading: ernie_seqcls:1 +I0209 09:15:49.417014 708 model_repository_manager.cc:1183] successfully loaded 'ernie_seqcls' version 1 +... +I0209 09:15:49.417394 708 server.cc:549] ++------------+---------------------------------------------------------------+--------+ +| Backend | Path | Config | ++------------+---------------------------------------------------------------+--------+ +| python | /opt/tritonserver/backends/python/libtriton_python.so | {} | +| fastdeploy | /opt/tritonserver/backends/fastdeploy/libtriton_fastdeploy.so | {} | ++------------+---------------------------------------------------------------+--------+ + +I0209 09:15:49.417552 708 server.cc:592] ++--------------------------+---------+--------+ +| Model | Version | Status | ++--------------------------+---------+--------+ +| ernie_seqcls | 1 | READY | +| ernie_seqcls_model | 1 | READY | +| ernie_seqcls_postprocess | 1 | READY | +| ernie_seqcls_tokenizer | 1 | READY | ++--------------------------+---------+--------+ + +``` + +### 序列标注任务 + +在容器内执行下面命令启动序列标注服务: + +```shell +fastdeployserver --model-repository=/models --model-control-mode=explicit --load-model=ernie_tokencls --backend-config=python,shm-default-byte-size=10485760 +``` + +输出打印如下: + +```shell + +I0209 09:15:49.314029 708 model_repository_manager.cc:1183] successfully loaded 'ernie_tokencls_model' version 1 +I0209 09:15:49.314917 708 model_repository_manager.cc:1022] loading: ernie_tokencls:1 +I0209 09:15:49.417014 708 model_repository_manager.cc:1183] successfully loaded 'ernie_tokencls' version 1 +... +I0209 09:15:49.417394 708 server.cc:549] ++------------+---------------------------------------------------------------+--------+ +| Backend | Path | Config | ++------------+---------------------------------------------------------------+--------+ +| python | /opt/tritonserver/backends/python/libtriton_python.so | {} | +| fastdeploy | /opt/tritonserver/backends/fastdeploy/libtriton_fastdeploy.so | {} | ++------------+---------------------------------------------------------------+--------+ + +I0209 09:15:49.417552 708 server.cc:592] ++----------------------------+---------+--------+ +| Model | Version | Status | ++----------------------------+---------+--------+ +| ernie_tokencls | 1 | READY | +| ernie_tokencls_model | 1 | READY | +| ernie_tokencls_postprocess | 1 | READY | +| ernie_tokencls_tokenizer | 1 | READY | ++----------------------------+---------+--------+ + +``` + +## 客户端请求 + +客户端请求可以在本地执行脚本请求;也可以在容器中执行。 + +本地执行脚本需要先安装依赖: + +```shell + +pip install grpcio +pip install tritonclient[all] + +# 如果bash无法识别括号,可以使用如下指令安装: +pip install tritonclient\[all\] + +``` + +### 分类任务 + +注意执行客户端请求时关闭代理,并根据实际情况修改main函数中的ip地址(启动服务所在的机器) + +```shell +python seq_cls_grpc_client.py +``` + +输出打印如下: + +```shell +{'label': array([0, 0]), 'confidence': array([0.54437345, 0.98503494], dtype=float32)} +acc: 0.7224281742354032 +``` + + +### 序列标注任务 + +注意执行客户端请求时关闭代理,并根据实际情况修改main函数中的ip地址(启动服务所在的机器) + + +```shell +python token_cls_grpc_client.py +``` + +输出打印如下: + +```shell + +input data: 北京的涮肉,重庆的火锅,成都的小吃都是极具特色的美食。 +The model detects all entities: +entity: 北京 label: LOC pos: [0, 1] +entity: 重庆 label: LOC pos: [6, 7] +entity: 成都 label: LOC pos: [12, 13] +input data: 乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。 +The model detects all entities: +entity: 乔丹 label: PER pos: [0, 1] +entity: 科比 label: PER pos: [3, 4] +entity: 詹姆斯 label: PER pos: [6, 8] +entity: 姚明 label: PER pos: [10, 11] + +``` + +## 配置修改 + +当前分类任务( ernie_seqcls_model/config.pbtxt )默认配置在 CPU上 运行 OpenVINO 引擎; 序列标注任务默认配置在 GPU 上运行 Paddle Inference 引擎。如果要在 CPU/GPU 或其他推理引擎上运行, 需要修改配置,详情请参考[配置文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/serving/docs/zh_CN/model_configuration.md)。 diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/1/README.md b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/1/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..05a9750b0b0a2885900f3fe0d2bd4a41b5ede4e8 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/config.pbtxt @@ -0,0 +1,85 @@ +name: "ernie_seqcls" +platform: "ensemble" +max_batch_size: 64 +input [ + { + name: "TEXT" + data_type: TYPE_STRING + dims: [ 1 ] + }, + { + name: "TEXT_PAIR" + data_type: TYPE_STRING + dims: [ 1 ] + } +] +output [ + { + name: "label" + data_type: TYPE_INT64 + dims: [ 1 ] + }, + { + name: "confidence" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] +ensemble_scheduling { + step [ + { + model_name: "ernie_seqcls_tokenizer" + model_version: 1 + input_map { + key: "INPUT_0" + value: "TEXT" + } + input_map { + key: "INPUT_1" + value: "TEXT_PAIR" + } + output_map { + key: "OUTPUT_0" + value: "tokenizer_input_ids" + } + output_map { + key: "OUTPUT_1" + value: "tokenizer_token_type_ids" + } + }, + { + model_name: "ernie_seqcls_model" + model_version: 1 + input_map { + key: "input_ids" + value: "tokenizer_input_ids" + } + input_map { + key: "token_type_ids" + value: "tokenizer_token_type_ids" + } + output_map { + # 需要按照实际模型输出进行配置。 + key: "linear_75.tmp_1" + value: "OUTPUT_2" + } + }, + { + model_name: "ernie_seqcls_postprocess" + model_version: 1 + input_map { + key: "POST_INPUT" + value: "OUTPUT_2" + } + output_map { + key: "POST_label" + value: "label" + } + output_map { + key: "POST_confidence" + value: "confidence" + } + } + ] +} + diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/1/README.md b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/1/README.md new file mode 100644 index 0000000000000000000000000000000000000000..5eb08196c5088eb7aa903e5e8896e466b22094d7 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/1/README.md @@ -0,0 +1 @@ +本目录存放 ERNIE 3.0 模型 diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..10ea68104f384fe0bd4446ce9b85e93f5cd7399c --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/config.pbtxt @@ -0,0 +1,43 @@ +backend: "fastdeploy" +max_batch_size: 64 +input [ + { + name: "input_ids" + data_type: TYPE_INT64 + dims: [ -1 ] + }, + { + name: "token_type_ids" + data_type: TYPE_INT64 + dims: [ -1 ] + } +] +output [ + { + name: "linear_75.tmp_1" + data_type: TYPE_FP32 + dims: [ 2 ] + } +] + +instance_group [ + { + # 创建1个实例 + count: 1 + # 使用CPU推理(KIND_CPU、KIND_GPU) + kind: KIND_GPU + } +] + +optimization { + execution_accelerators { + cpu_execution_accelerator : [ + { + # use openvino backend + name: "paddle" + parameters { key: "cpu_threads" value: "5" } + } + ] + } +} + diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/1/model.py b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/1/model.py new file mode 100644 index 0000000000000000000000000000000000000000..d4f72200eb24f3c62b7b9cb4c1f64d7e77151d4e --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/1/model.py @@ -0,0 +1,102 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import numpy as np + +# triton_python_backend_utils is available in every Triton Python model. You +# need to use this module to create inference requests and responses. It also +# contains some utility functions for extracting information from model_config +# and converting Triton input/output types to numpy types. +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + + def initialize(self, args): + """`initialize` is called only once when the model is being loaded. + Implementing `initialize` function is optional. This function allows + the model to initialize any state associated with this model. + Parameters + ---------- + args : dict + Both keys and values are strings. The dictionary keys and values are: + * model_config: A JSON string containing the model configuration + * model_instance_kind: A string containing model instance kind + * model_instance_device_id: A string containing model instance device ID + * model_repository: Model repository path + * model_version: Model version + * model_name: Model name + """ + self.model_config = json.loads(args["model_config"]) + print("model_config:", self.model_config) + + self.input_names = [] + for input_config in self.model_config["input"]: + self.input_names.append(input_config["name"]) + print("input:", self.input_names) + + self.output_names = [] + self.output_dtype = [] + for output_config in self.model_config["output"]: + self.output_names.append(output_config["name"]) + dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) + self.output_dtype.append(dtype) + print("output:", self.output_names) + + def execute(self, requests): + """`execute` must be implemented in every Python model. `execute` + function receives a list of pb_utils.InferenceRequest as the only + argument. This function is called when an inference is requested + for this model. Depending on the batching configuration (e.g. Dynamic + Batching) used, `requests` may contain multiple requests. Every + Python model, must create one pb_utils.InferenceResponse for every + pb_utils.InferenceRequest in `requests`. If there is an error, you can + set the error argument when creating a pb_utils.InferenceResponse. + Parameters + ---------- + requests : list + A list of pb_utils.InferenceRequest + Returns + ------- + list + A list of pb_utils.InferenceResponse. The length of this list must + be the same as `requests` + """ + responses = [] + for request in requests: + data = pb_utils.get_input_tensor_by_name(request, self.input_names[0]) + data = data.as_numpy() + + max_value = np.max(data, axis=1, keepdims=True) + exp_data = np.exp(data - max_value) + probs = exp_data / np.sum(exp_data, axis=1, keepdims=True) + probs = probs.max(axis=-1) + + out_tensor1 = pb_utils.Tensor(self.output_names[0], data.argmax(axis=-1)) + out_tensor2 = pb_utils.Tensor(self.output_names[1], probs) + inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2]) + responses.append(inference_response) + return responses + + def finalize(self): + """`finalize` is called only once when the model is being unloaded. + Implementing `finalize` function is optional. This function allows + the model to perform any necessary clean ups before exit. + """ + print("Cleaning up...") diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..e3a5c61ec77f336083e5198767e7b713b1fa9857 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/config.pbtxt @@ -0,0 +1,31 @@ +name: "ernie_seqcls_postprocess" +backend: "python" +max_batch_size: 64 + +input [ + { + name: "POST_INPUT" + data_type: TYPE_FP32 + dims: [ 2 ] + } +] + +output [ + { + name: "POST_label" + data_type: TYPE_INT64 + dims: [ 1 ] + }, + { + name: "POST_confidence" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_CPU + } +] diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/1/model.py b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/1/model.py new file mode 100644 index 0000000000000000000000000000000000000000..6e0ab63157653a04578179db4fcae9bd841c46d4 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/1/model.py @@ -0,0 +1,110 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import numpy as np + +# triton_python_backend_utils is available in every Triton Python model. You +# need to use this module to create inference requests and responses. It also +# contains some utility functions for extracting information from model_config +# and converting Triton input/output types to numpy types. +import triton_python_backend_utils as pb_utils + +from paddlenlp.transformers import AutoTokenizer + + +class TritonPythonModel: + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + + def initialize(self, args): + """`initialize` is called only once when the model is being loaded. + Implementing `initialize` function is optional. This function allows + the model to initialize any state associated with this model. + Parameters + ---------- + args : dict + Both keys and values are strings. The dictionary keys and values are: + * model_config: A JSON string containing the model configuration + * model_instance_kind: A string containing model instance kind + * model_instance_device_id: A string containing model instance device ID + * model_repository: Model repository path + * model_version: Model version + * model_name: Model name + """ + self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True) + # You must parse model_config. JSON string is not parsed here + self.model_config = json.loads(args["model_config"]) + print("model_config:", self.model_config) + + self.input_names = [] + for input_config in self.model_config["input"]: + self.input_names.append(input_config["name"]) + print("input:", self.input_names) + + self.output_names = [] + self.output_dtype = [] + for output_config in self.model_config["output"]: + self.output_names.append(output_config["name"]) + dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) + self.output_dtype.append(dtype) + print("output:", self.output_names) + + def execute(self, requests): + """`execute` must be implemented in every Python model. `execute` + function receives a list of pb_utils.InferenceRequest as the only + argument. This function is called when an inference is requested + for this model. Depending on the batching configuration (e.g. Dynamic + Batching) used, `requests` may contain multiple requests. Every + Python model, must create one pb_utils.InferenceResponse for every + pb_utils.InferenceRequest in `requests`. If there is an error, you can + set the error argument when creating a pb_utils.InferenceResponse. + Parameters + ---------- + requests : list + A list of pb_utils.InferenceRequest + Returns + ------- + list + A list of pb_utils.InferenceResponse. The length of this list must + be the same as `requests` + """ + responses = [] + for request in requests: + texts = pb_utils.get_input_tensor_by_name(request, self.input_names[0]) + texts = texts.as_numpy() + texts = [i[0].decode("utf-8") for i in texts] + + text_pairs = pb_utils.get_input_tensor_by_name(request, self.input_names[1]) + text_pairs = text_pairs.as_numpy() + text_pairs = [i[0].decode("utf-8") for i in text_pairs] + + data = self.tokenizer(texts, text_pair=text_pairs, max_length=128, padding=True, truncation=True) + input_ids = np.array(data["input_ids"], dtype=self.output_dtype[0]) + token_type_ids = np.array(data["token_type_ids"], dtype=self.output_dtype[1]) + + out_tensor1 = pb_utils.Tensor(self.output_names[0], input_ids) + out_tensor2 = pb_utils.Tensor(self.output_names[1], token_type_ids) + inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2]) + responses.append(inference_response) + return responses + + def finalize(self): + """`finalize` is called only once when the model is being unloaded. + Implementing `finalize` function is optional. This function allows + the model to perform any necessary clean ups before exit. + """ + print("Cleaning up...") diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..f72a61c879d976ac6eee55d312862e46d85b3d6a --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/config.pbtxt @@ -0,0 +1,36 @@ +name: "ernie_seqcls_tokenizer" +backend: "python" +max_batch_size: 64 + +input [ + { + name: "INPUT_0" + data_type: TYPE_STRING + dims: [ 1 ] + }, + { + name: "INPUT_1" + data_type: TYPE_STRING + dims: [ 1 ] + } +] + +output [ + { + name: "OUTPUT_0" + data_type: TYPE_INT64 + dims: [ -1 ] + }, + { + name: "OUTPUT_1" + data_type: TYPE_INT64 + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_CPU + } +] diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/1/README.md b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/1/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..0936b6ef5eb646aad5ea62ba5727084eab67b05d --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/config.pbtxt @@ -0,0 +1,66 @@ +name: "ernie_tokencls" +platform: "ensemble" +max_batch_size: 64 +input [ + { + name: "INPUT" + data_type: TYPE_STRING + dims: [ 1 ] + } +] +output [ + { + name: "OUTPUT" + data_type: TYPE_STRING + dims: [ 1 ] + } +] +ensemble_scheduling { + step [ + { + model_name: "ernie_tokencls_tokenizer" + model_version: 1 + input_map { + key: "INPUT_0" + value: "INPUT" + } + output_map { + key: "OUTPUT_0" + value: "tokenizer_input_ids" + } + output_map { + key: "OUTPUT_1" + value: "tokenizer_token_type_ids" + } + }, + { + model_name: "ernie_tokencls_model" + model_version: 1 + input_map { + key: "input_ids" + value: "tokenizer_input_ids" + } + input_map { + key: "token_type_ids" + value: "tokenizer_token_type_ids" + } + output_map { + # 需要按照实际模型输出进行配置。 + key: "linear_75.tmp_1" + value: "OUTPUT_2" + } + }, + { + model_name: "ernie_tokencls_postprocess" + model_version: 1 + input_map { + key: "POST_INPUT" + value: "OUTPUT_2" + } + output_map { + key: "POST_OUTPUT" + value: "OUTPUT" + } + } + ] +} diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/1/README.md b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/1/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b3ce2c1ae200527f83d571590f1c731d51644fdb --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/1/README.md @@ -0,0 +1 @@ +本目录存放ERNIE 3.0模型 diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..e87dc6d28728ae3bdd7417145f4dac0e85cc1826 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/config.pbtxt @@ -0,0 +1,41 @@ +backend: "fastdeploy" +max_batch_size: 64 +input [ + { + name: "input_ids" + data_type: TYPE_INT64 + dims: [ -1 ] + }, + { + name: "token_type_ids" + data_type: TYPE_INT64 + dims: [ -1 ] + } +] +output [ + { + # 需要按照实际模型输出进行配置。 + name: "linear_75.tmp_1" + data_type: TYPE_FP32 + dims: [ -1, 7 ] + } +] + +instance_group [ + { + # 创建1个实例 + count: 1 + # 使用GPU推理(KIND_CPU、KIND_GPU) + kind: KIND_GPU + } +] + +optimization { + execution_accelerators { + gpu_execution_accelerator : [ + { + name: "paddle" + } + ] + } +} diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/1/model.py b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/1/model.py new file mode 100644 index 0000000000000000000000000000000000000000..2e1e6720a17ce2573b35087f8e5ded5f493a87d2 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/1/model.py @@ -0,0 +1,121 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import numpy as np + +# triton_python_backend_utils is available in every Triton Python model. You +# need to use this module to create inference requests and responses. It also +# contains some utility functions for extracting information from model_config +# and converting Triton input/output types to numpy types. +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + + def initialize(self, args): + """`initialize` is called only once when the model is being loaded. + Implementing `initialize` function is optional. This function allows + the model to initialize any state associated with this model. + Parameters + ---------- + args : dict + Both keys and values are strings. The dictionary keys and values are: + * model_config: A JSON string containing the model configuration + * model_instance_kind: A string containing model instance kind + * model_instance_device_id: A string containing model instance device ID + * model_repository: Model repository path + * model_version: Model version + * model_name: Model name + """ + self.model_config = json.loads(args["model_config"]) + print("model_config:", self.model_config) + + self.input_names = [] + for input_config in self.model_config["input"]: + self.input_names.append(input_config["name"]) + print("input:", self.input_names) + + self.output_names = [] + self.output_dtype = [] + for output_config in self.model_config["output"]: + self.output_names.append(output_config["name"]) + dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) + self.output_dtype.append(dtype) + print("output:", self.output_names) + # The label names of NER models trained by different data sets may be different + self.label_names = ["B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "O"] + + def execute(self, requests): + """`execute` must be implemented in every Python model. `execute` + function receives a list of pb_utils.InferenceRequest as the only + argument. This function is called when an inference is requested + for this model. Depending on the batching configuration (e.g. Dynamic + Batching) used, `requests` may contain multiple requests. Every + Python model, must create one pb_utils.InferenceResponse for every + pb_utils.InferenceRequest in `requests`. If there is an error, you can + set the error argument when creating a pb_utils.InferenceResponse. + Parameters + ---------- + requests : list + A list of pb_utils.InferenceRequest + Returns + ------- + list + A list of pb_utils.InferenceResponse. The length of this list must + be the same as `requests` + """ + responses = [] + for request in requests: + data = pb_utils.get_input_tensor_by_name(request, self.input_names[0]) + data = data.as_numpy() + tokens_label = data.argmax(axis=-1).tolist() + value = [] + for _, token_label in enumerate(tokens_label): + start = -1 + label_name = "" + items = [] + for i, label in enumerate(token_label): + if (self.label_names[label] == "O" or "B-" in self.label_names[label]) and start >= 0: + items.append( + { + "pos": [start, i - 2], + "label": label_name, + } + ) + start = -1 + if "B-" in self.label_names[label]: + start = i - 1 + label_name = self.label_names[label][2:] + value.append(items) + out_result = np.array(value, dtype="object") + out_tensor = pb_utils.Tensor(self.output_names[0], out_result) + inference_response = pb_utils.InferenceResponse( + output_tensors=[ + out_tensor, + ] + ) + responses.append(inference_response) + return responses + + def finalize(self): + """`finalize` is called only once when the model is being unloaded. + Implementing `finalize` function is optional. This function allows + the model to perform any necessary clean ups before exit. + """ + print("Cleaning up...") diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..7760158277e04ff8769a1940af207c836a75c6f3 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/config.pbtxt @@ -0,0 +1,26 @@ +name: "ernie_tokencls_postprocess" +backend: "python" +max_batch_size: 64 + +input [ + { + name: "POST_INPUT" + data_type: TYPE_FP32 + dims: [ -1, 7 ] + } +] + +output [ + { + name: "POST_OUTPUT" + data_type: TYPE_STRING + dims: [ 1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_CPU + } +] diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/1/model.py b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/1/model.py new file mode 100644 index 0000000000000000000000000000000000000000..66c3a947371de3c156daf45449dca8bfeacc96b6 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/1/model.py @@ -0,0 +1,105 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import numpy as np + +# triton_python_backend_utils is available in every Triton Python model. You +# need to use this module to create inference requests and responses. It also +# contains some utility functions for extracting information from model_config +# and converting Triton input/output types to numpy types. +import triton_python_backend_utils as pb_utils + +from paddlenlp.transformers import AutoTokenizer + + +class TritonPythonModel: + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + + def initialize(self, args): + """`initialize` is called only once when the model is being loaded. + Implementing `initialize` function is optional. This function allows + the model to initialize any state associated with this model. + Parameters + ---------- + args : dict + Both keys and values are strings. The dictionary keys and values are: + * model_config: A JSON string containing the model configuration + * model_instance_kind: A string containing model instance kind + * model_instance_device_id: A string containing model instance device ID + * model_repository: Model repository path + * model_version: Model version + * model_name: Model name + """ + self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True) + # You must parse model_config. JSON string is not parsed here + self.model_config = json.loads(args["model_config"]) + print("model_config:", self.model_config) + + self.input_names = [] + for input_config in self.model_config["input"]: + self.input_names.append(input_config["name"]) + print("input:", self.input_names) + + self.output_names = [] + self.output_dtype = [] + for output_config in self.model_config["output"]: + self.output_names.append(output_config["name"]) + dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) + self.output_dtype.append(dtype) + print("output:", self.output_names) + + def execute(self, requests): + """`execute` must be implemented in every Python model. `execute` + function receives a list of pb_utils.InferenceRequest as the only + argument. This function is called when an inference is requested + for this model. Depending on the batching configuration (e.g. Dynamic + Batching) used, `requests` may contain multiple requests. Every + Python model, must create one pb_utils.InferenceResponse for every + pb_utils.InferenceRequest in `requests`. If there is an error, you can + set the error argument when creating a pb_utils.InferenceResponse. + Parameters + ---------- + requests : list + A list of pb_utils.InferenceRequest + Returns + ------- + list + A list of pb_utils.InferenceResponse. The length of this list must + be the same as `requests` + """ + responses = [] + for request in requests: + data = pb_utils.get_input_tensor_by_name(request, self.input_names[0]) + data = data.as_numpy() + data = [i[0].decode("utf-8") for i in data] + data = self.tokenizer(data, max_length=128, padding=True, truncation=True) + input_ids = np.array(data["input_ids"], dtype=self.output_dtype[0]) + token_type_ids = np.array(data["token_type_ids"], dtype=self.output_dtype[1]) + + out_tensor1 = pb_utils.Tensor(self.output_names[0], input_ids) + out_tensor2 = pb_utils.Tensor(self.output_names[1], token_type_ids) + inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2]) + responses.append(inference_response) + return responses + + def finalize(self): + """`finalize` is called only once when the model is being unloaded. + Implementing `finalize` function is optional. This function allows + the model to perform any necessary clean ups before exit. + """ + print("Cleaning up...") diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/config.pbtxt new file mode 100644 index 0000000000000000000000000000000000000000..08c7727bf89fc801bc9fb6fb61729f927b4590f9 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/config.pbtxt @@ -0,0 +1,31 @@ +name: "ernie_tokencls_tokenizer" +backend: "python" +max_batch_size: 64 + +input [ + { + name: "INPUT_0" + data_type: TYPE_STRING + dims: [ 1 ] + } +] + +output [ + { + name: "OUTPUT_0" + data_type: TYPE_INT64 + dims: [ -1 ] + }, + { + name: "OUTPUT_1" + data_type: TYPE_INT64 + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_CPU + } +] diff --git a/model_zoo/ernie-3.0/deploy/serving/seq_cls_grpc_client.py b/model_zoo/ernie-3.0/deploy/serving/seq_cls_grpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..deb45a7e45523f566d144e6f9fb616674a23554f --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/seq_cls_grpc_client.py @@ -0,0 +1,140 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +from typing import Optional + +import numpy as np +from tritonclient.grpc import InferenceServerClient, InferInput, InferRequestedOutput + +LOGGER = logging.getLogger("run_inference_on_triton") + + +class SyncGRPCTritonRunner: + DEFAULT_MAX_RESP_WAIT_S = 120 + + def __init__( + self, + server_url: str, + model_name: str, + model_version: str, + *, + verbose=False, + resp_wait_s: Optional[float] = None, + ): + self._server_url = server_url + self._model_name = model_name + self._model_version = model_version + self._verbose = verbose + self._response_wait_t = self.DEFAULT_MAX_RESP_WAIT_S if resp_wait_s is None else resp_wait_s + + self._client = InferenceServerClient(self._server_url, verbose=self._verbose) + error = self._verify_triton_state(self._client) + if error: + raise RuntimeError(f"Could not communicate to Triton Server: {error}") + + LOGGER.debug( + f"Triton server {self._server_url} and model {self._model_name}:{self._model_version} " + f"are up and ready!" + ) + + model_config = self._client.get_model_config(self._model_name, self._model_version) + model_metadata = self._client.get_model_metadata(self._model_name, self._model_version) + LOGGER.info(f"Model config {model_config}") + LOGGER.info(f"Model metadata {model_metadata}") + + self._inputs = {tm.name: tm for tm in model_metadata.inputs} + self._input_names = list(self._inputs) + self._outputs = {tm.name: tm for tm in model_metadata.outputs} + self._output_names = list(self._outputs) + self._outputs_req = [InferRequestedOutput(name) for name in self._outputs] + + def Run(self, inputs): + """ + Args: + inputs: list, Each value corresponds to an input name of self._input_names + Returns: + results: dict, {name : numpy.array} + """ + infer_inputs = [] + for idx, data in enumerate(inputs): + data = np.array([[x.encode("utf-8")] for x in data], dtype=np.object_) + infer_input = InferInput(self._input_names[idx], [len(data), 1], "BYTES") + infer_input.set_data_from_numpy(data) + infer_inputs.append(infer_input) + + results = self._client.infer( + model_name=self._model_name, + model_version=self._model_version, + inputs=infer_inputs, + outputs=self._outputs_req, + client_timeout=self._response_wait_t, + ) + results = {name: results.as_numpy(name) for name in self._output_names} + return results + + def _verify_triton_state(self, triton_client): + if not triton_client.is_server_live(): + return f"Triton server {self._server_url} is not live" + elif not triton_client.is_server_ready(): + return f"Triton server {self._server_url} is not ready" + elif not triton_client.is_model_ready(self._model_name, self._model_version): + return f"Model {self._model_name}:{self._model_version} is not ready" + return None + + +def test_afqmc_dataset(runner): + from paddlenlp.datasets import load_dataset + + dev_ds = load_dataset("clue", "afqmc", splits="dev") + + batches = [] + labels = [] + idx = 0 + batch_size = 32 + while idx < len(dev_ds): + texts = [] + text_pairs = [] + label = [] + for i in range(batch_size): + if idx + i >= len(dev_ds): + break + texts.append(dev_ds[idx + i]["sentence1"]) + text_pairs.append(dev_ds[idx + i]["sentence2"]) + label.append(dev_ds[idx + i]["label"]) + batches.append((texts, text_pairs)) + labels.append(np.array(label)) + idx += batch_size + + accuracy = 0 + for i, data in enumerate(batches): + ret = runner.Run(data) + accuracy += np.sum(labels[i] == ret["label"]) + print("acc:", 1.0 * accuracy / len(dev_ds)) + + +if __name__ == "__main__": + model_name = "ernie_seqcls" + model_version = "1" + url = "localhost:8001" + runner = SyncGRPCTritonRunner(url, model_name, model_version) + + # [([texts], [text_pairs])] + dataset = [(["花呗收款额度限制", "花呗支持高铁票支付吗"], ["收钱码,对花呗支付的金额有限制吗", "为什么友付宝不支持花呗付款"])] + + for batch_text_pair in dataset: + result = runner.Run(batch_text_pair) + print(result) + + test_afqmc_dataset(runner) diff --git a/model_zoo/ernie-3.0/deploy/serving/token_cls_grpc_client.py b/model_zoo/ernie-3.0/deploy/serving/token_cls_grpc_client.py new file mode 100644 index 0000000000000000000000000000000000000000..21b3b6de537f691891f4354f68d7aea259a94832 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/serving/token_cls_grpc_client.py @@ -0,0 +1,124 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import ast +import logging +from typing import Optional + +import numpy as np +from tritonclient.grpc import InferenceServerClient, InferInput, InferRequestedOutput + +LOGGER = logging.getLogger("run_inference_on_triton") + + +class SyncGRPCTritonRunner: + DEFAULT_MAX_RESP_WAIT_S = 120 + + def __init__( + self, + server_url: str, + model_name: str, + model_version: str, + *, + verbose=False, + resp_wait_s: Optional[float] = None, + ): + self._server_url = server_url + self._model_name = model_name + self._model_version = model_version + self._verbose = verbose + self._response_wait_t = self.DEFAULT_MAX_RESP_WAIT_S if resp_wait_s is None else resp_wait_s + + self._client = InferenceServerClient(self._server_url, verbose=self._verbose) + error = self._verify_triton_state(self._client) + if error: + raise RuntimeError(f"Could not communicate to Triton Server: {error}") + + LOGGER.debug( + f"Triton server {self._server_url} and model {self._model_name}:{self._model_version} " + f"are up and ready!" + ) + + model_config = self._client.get_model_config(self._model_name, self._model_version) + model_metadata = self._client.get_model_metadata(self._model_name, self._model_version) + LOGGER.info(f"Model config {model_config}") + LOGGER.info(f"Model metadata {model_metadata}") + + self._inputs = {tm.name: tm for tm in model_metadata.inputs} + self._input_names = list(self._inputs) + self._outputs = {tm.name: tm for tm in model_metadata.outputs} + self._output_names = list(self._outputs) + self._outputs_req = [InferRequestedOutput(name) for name in self._outputs] + + def Run(self, inputs): + """ + Args: + inputs: list, Each value corresponds to an input name of self._input_names + Returns: + results: dict, {name : numpy.array} + """ + infer_inputs = [] + for idx, data in enumerate(inputs): + data = np.array([[x.encode("utf-8")] for x in data], dtype=np.object_) + infer_input = InferInput(self._input_names[idx], [len(data), 1], "BYTES") + infer_input.set_data_from_numpy(data) + infer_inputs.append(infer_input) + + results = self._client.infer( + model_name=self._model_name, + model_version=self._model_version, + inputs=infer_inputs, + outputs=self._outputs_req, + client_timeout=self._response_wait_t, + ) + results = {name: results.as_numpy(name) for name in self._output_names} + return results + + def _verify_triton_state(self, triton_client): + if not triton_client.is_server_live(): + return f"Triton server {self._server_url} is not live" + elif not triton_client.is_server_ready(): + return f"Triton server {self._server_url} is not ready" + elif not triton_client.is_model_ready(self._model_name, self._model_version): + return f"Model {self._model_name}:{self._model_version} is not ready" + return None + + +if __name__ == "__main__": + model_name = "ernie_tokencls" + model_version = "1" + url = "localhost:8001" + runner = SyncGRPCTritonRunner(url, model_name, model_version) + dataset = [ + ["北京的涮肉,重庆的火锅,成都的小吃都是极具特色的美食。", "乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。"], + ] + + for batch_input in dataset: + # input format:[input1, input2 ... inputn], n = len(self._input_names) + result = runner.Run([batch_input]) + for i, ret in enumerate(result["OUTPUT"]): + ret = ast.literal_eval(ret.decode("utf-8")) + print("input data:", batch_input[i]) + print("The model detects all entities:") + for iterm in ret: + entity = batch_input[i][iterm["pos"][0] : iterm["pos"][1] + 1] + if len(entity) > 0: + print( + "entity:", + entity, + " label:", + iterm["label"], + " pos:", + iterm["pos"], + ) diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/README.md b/model_zoo/ernie-3.0/deploy/simple_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b411bc58b9eafba3b2a7be436be6e656f9986346 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/simple_serving/README.md @@ -0,0 +1,58 @@ +# 基于PaddleNLP SimpleServing 的服务化部署 + +## 目录 +- [环境准备](#环境准备) +- [Server启动服务](#Server服务启动) +- [其他参数设置](#其他参数设置) + +## 环境准备 +使用有SimpleServing功能的PaddleNLP版本 + +## Server服务启动 +### 文本分类任务启动 +#### 启动文本分类 Server 服务 +```bash +paddlenlp server server_seq_cls:app --host 0.0.0.0 --port 8189 +``` + +#### 分类任务发送服务 +```bash +python client_seq_cls.py --dataset afqmc +``` + +### 命名实体识别任务启动 +#### 启动命名实体识别 Server 服务 +```bash +paddlenlp server server_token_cls:app --host 0.0.0.0 --port 8189 +``` + +#### 命名实体识别 Client发送服务 +```bash +python client_token_cls.py +``` + +### 问答任务启动 +#### 启动问答 Server 服务 +```bash +paddlenlp server server_qa:app --host 0.0.0.0 --port 8189 +``` + +#### 问答 Client 发送服务 +```bash +python client_qa.py +``` + +## 其他参数设置 +可以在client端设置 `max_seq_len`, `batch_size` 参数 +```python + data = { + 'data': { + 'text': texts, + 'text_pair': text_pairs if len(text_pairs) > 0 else None + }, + 'parameters': { + 'max_seq_len': args.max_seq_len, + 'batch_size': args.batch_size + } + } +``` diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/client_qa.py b/model_zoo/ernie-3.0/deploy/simple_serving/client_qa.py new file mode 100644 index 0000000000000000000000000000000000000000..d4e72ba2db4bd338fc2c8e6e2ef43deea4e099c2 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/simple_serving/client_qa.py @@ -0,0 +1,44 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json + +import requests + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--max_seq_len", default=512, type=int, help="The maximum total input sequence length after tokenization.") +parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for predicting.") +parser.add_argument("--doc_stride", default=128, type=int, help="Batch size per GPU/CPU for predicting.") +args = parser.parse_args() +# yapf: disable +url = "http://0.0.0.0:8189/models/ernie_qa" +headers = {"Content-Type": "application/json"} + +if __name__ == "__main__": + texts = ["研究证实,细胞减少与肺内病变程度及肺内炎性病变吸收程度密切相关。", ] + data = { + 'data': { + 'context': ['《战国无双3》()是由光荣和ω-force开发的战国无双系列的正统第三续作。本作以三大故事为主轴,分别是以武田信玄等人为主的《关东三国志》,织田信长等人为主的《战国三杰》,石田三成等人为主的《关原的年轻武者》,丰富游戏内的剧情。此部份专门介绍角色,欲知武器情报、奥义字或擅长攻击类型等,请至战国无双系列1.由于乡里大辅先生因故去世,不得不寻找其他声优接手。从猛将传 and Z开始。2.战国无双 编年史的原创男女主角亦有专属声优。此模式是任天堂游戏谜之村雨城改编的新增模式。本作中共有20张战场地图(不含村雨城),后来发行的猛将传再新增3张战场地图。但游戏内战役数量繁多,部分地图会有兼用的状况,战役虚实则是以光荣发行的2本「战国无双3 人物真书」内容为主,以下是相关介绍。(注:前方加☆者为猛将传新增关卡及地图。)合并本篇和猛将传的内容,村雨城模式剔除,战国史模式可直接游玩。主打两大模式「战史演武」&「争霸演武」。系列作品外传作品', '《战国无双3》()是由光荣和ω-force开发的战国无双系列的正统第三续作。本作以三大故事为主轴,分别是以武田信玄等人为主的《关东三国志》,织田信长等人为主的《战国三杰》,石田三成等人为主的《关原的年轻武者》,丰富游戏内的剧情。此部份专门介绍角色,欲知武器情报、奥义字或擅长攻击类型等,请至战国无双系列1.由于乡里大辅先生因故去世,不得不寻找其他声优接手。从猛将传 and Z开始。2.战国无双 编年史的原创男女主角亦有专属声优。此模式是任天堂游戏谜之村雨城改编的新增模式。本作中共有20张战场地图(不含村雨城),后来发行的猛将传再新增3张战场地图。但游戏内战役数量繁多,部分地图会有兼用的状况,战役虚实则是以光荣发行的2本「战国无双3 人物真书」内容为主,以下是相关介绍。(注:前方加☆者为猛将传新增关卡及地图。)合并本篇和猛将传的内容,村雨城模式剔除,战国史模式可直接游玩。主打两大模式「战史演武」&「争霸演武」。系列作品外传作品'], # noqa: E126 + 'question': ['《战国无双3》是由哪两个公司合作开发的?', '男女主角亦有专属声优这一模式是由谁改编的?'] # noqa: E126 + }, + 'parameters': { + 'max_seq_len': args.max_seq_len, + 'batch_size': args.batch_size, + 'doc_stride': args.doc_stride, + } + } + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.text) diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/client_seq_cls.py b/model_zoo/ernie-3.0/deploy/simple_serving/client_seq_cls.py new file mode 100644 index 0000000000000000000000000000000000000000..e32c950498abd9226fbe7023c636285053ab8e65 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/simple_serving/client_seq_cls.py @@ -0,0 +1,83 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json + +import requests + +from paddlenlp.datasets import load_dataset + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--dataset", required=True, type=str, help="The dataset name for the simple seving") +parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization.") +parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.") +args = parser.parse_args() +# yapf: enable + +url = "http://0.0.0.0:8189/models/ernie_cls" +headers = {"Content-Type": "application/json"} + + +def seq_convert_example(example): + """convert a glue example into necessary features""" + # Convert raw text to feature + if "keyword" in example: # CSL + sentence1 = " ".join(example["keyword"]) + example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]} + elif "target" in example: # wsc + text, query, pronoun, query_idx, pronoun_idx = ( + example["text"], + example["target"]["span1_text"], + example["target"]["span2_text"], + example["target"]["span1_index"], + example["target"]["span2_index"], + ) + text_list = list(text) + assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun) + assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query) + if pronoun_idx > query_idx: + text_list.insert(query_idx, "_") + text_list.insert(query_idx + len(query) + 1, "_") + text_list.insert(pronoun_idx + 2, "[") + text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]") + else: + text_list.insert(pronoun_idx, "[") + text_list.insert(pronoun_idx + len(pronoun) + 1, "]") + text_list.insert(query_idx + 2, "_") + text_list.insert(query_idx + len(query) + 2 + 1, "_") + text = "".join(text_list) + example["sentence"] = text + return example + + +if __name__ == "__main__": + examples = load_dataset("clue", args.dataset)["dev"][:10] + texts = [] + text_pairs = [] + for example in examples: + example = seq_convert_example(example) + if "sentence" in example: + texts.append(example) + else: + texts.append(example["sentence1"]) + text_pairs.append(example["sentence2"]) + + data = { + "data": {"text": texts, "text_pair": text_pairs if len(text_pairs) > 0 else None}, + "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size}, + } + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.text) diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/client_token_cls.py b/model_zoo/ernie-3.0/deploy/simple_serving/client_token_cls.py new file mode 100644 index 0000000000000000000000000000000000000000..3a07509fd5e7f33dca5c9dca3fa214b338c5eabd --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/simple_serving/client_token_cls.py @@ -0,0 +1,41 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json + +import requests + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization.") +parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for predicting.") +args = parser.parse_args() +# yapf: disable +url = "http://0.0.0.0:8189/models/ernie_ner" +headers = {"Content-Type": "application/json"} + +if __name__ == "__main__": + texts = ["北京的涮肉,重庆的火锅,成都的小吃都是极具特色的美食。", "乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。"] + data = { + 'data': { + 'text': texts + }, + 'parameters': { + 'max_seq_len': args.max_seq_len, + 'batch_size': args.batch_size, + } + } + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.text) diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/server_qa.py b/model_zoo/ernie-3.0/deploy/simple_serving/server_qa.py new file mode 100644 index 0000000000000000000000000000000000000000..a02775acb6caa7222ba28a2ef0f055a8942cc208 --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/simple_serving/server_qa.py @@ -0,0 +1,54 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np + +from paddlenlp import SimpleServer +from paddlenlp.server import BasePostHandler, QAModelHandler + + +class QAPostHandler(BasePostHandler): + def __init__(self): + super().__init__() + + @classmethod + def process(cls, data, parameters): + start_logits = data["logits"] + end_logits = data["logits_1"] + contexts = data["data"]["context"] + questions = data["data"]["question"] + offset_mappings = data["data"]["offset_mapping"] + answers = [] + count = 0 + for start_logit, end_logit, offset_mapping in zip(start_logits, end_logits, offset_mappings): + start_position = np.argmax(np.array(start_logit)) + end_position = np.argmax(np.array(end_logit)) + start_id = offset_mapping[start_position][0] + end_id = offset_mapping[end_position][1] + answer = [] + if end_position > start_position: + answer = contexts[count][start_id:end_id] + answers.append(answer) + count += 1 + + return {"context": contexts, "question": questions, "answer": answers} + + +app = SimpleServer() +app.register( + "models/ernie_qa", + model_path="../../best_models/cmrc2018/export/", + tokenizer_name="ernie-3.0-medium-zh", + model_handler=QAModelHandler, + post_handler=QAPostHandler, +) diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/server_seq_cls.py b/model_zoo/ernie-3.0/deploy/simple_serving/server_seq_cls.py new file mode 100644 index 0000000000000000000000000000000000000000..837ad1793247416a3fb337742d4de84ad804cccf --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/simple_serving/server_seq_cls.py @@ -0,0 +1,25 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp import SimpleServer +from paddlenlp.server import CustomModelHandler, MultiClassificationPostHandler + +app = SimpleServer() +app.register( + "models/ernie_cls", + model_path="../../best_models/afqmc/export/", + tokenizer_name="ernie-3.0-medium-zh", + model_handler=CustomModelHandler, + post_handler=MultiClassificationPostHandler, +) diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/server_token_cls.py b/model_zoo/ernie-3.0/deploy/simple_serving/server_token_cls.py new file mode 100644 index 0000000000000000000000000000000000000000..57aff957a078f2410b0fdcd20e2acbc2a7c927be --- /dev/null +++ b/model_zoo/ernie-3.0/deploy/simple_serving/server_token_cls.py @@ -0,0 +1,73 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np + +from paddlenlp import SimpleServer +from paddlenlp.server import BasePostHandler, TokenClsModelHandler + + +class NERPostHandler(BasePostHandler): + def __init__(self): + super().__init__() + + @classmethod + def process(cls, data, parameters): + label_list = ["B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "O"] + input_datas = data["data"]["text"] + predictions = np.array(data["logits"]) + tokens_label = predictions.argmax(axis=-1) + tokens_label = tokens_label.tolist() + value = [] + for batch, token_label in enumerate(tokens_label): + start = -1 + label_name = "" + items = [] + input_data = input_datas[batch] + for i, label in enumerate(token_label): + if (label_list[label] == "O" or "B-" in label_list[label]) and start >= 0: + entity = input_data[start : i - 1] + if isinstance(entity, list): + entity = "".join(entity) + items.append( + { + "pos": [start, i - 2], + "entity": entity, + "label": label_name, + } + ) + start = -1 + if "B-" in label_list[label]: + start = i - 1 + label_name = label_list[label][2:] + if start >= 0: + items.append( + { + "pos": [start, len(token_label) - 1], + "entity": input_data[start : len(token_label) - 1], + "label": "", + } + ) + value.append(items) + out_dict = {"value": value, "tokens_label": tokens_label} + return out_dict + + +app = SimpleServer() +app.register( + "models/ernie_ner", + model_path="../../best_models/msra_ner/export/", + tokenizer_name="ernie-3.0-medium-zh", + model_handler=TokenClsModelHandler, + post_handler=NERPostHandler, +) diff --git a/model_zoo/ernie-3.0/infer.py b/model_zoo/ernie-3.0/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..0941a2db9b8a0d9204d63d25467b579b596e3b3c --- /dev/null +++ b/model_zoo/ernie-3.0/infer.py @@ -0,0 +1,521 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import time +from functools import partial +from multiprocessing import cpu_count + +import numpy as np +import onnxruntime as ort +import paddle +from datasets import load_dataset +from paddle import inference +from paddle.metric import Accuracy + +from paddlenlp.data import DataCollatorForTokenClassification, DataCollatorWithPadding +from paddlenlp.datasets import load_dataset as ppnlp_load_dataset +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.metrics.squad import compute_prediction, squad_evaluate +from paddlenlp.trainer.argparser import strtobool +from paddlenlp.transformers import AutoTokenizer + +METRIC_CLASSES = { + "afqmc": Accuracy, + "tnews": Accuracy, + "iflytek": Accuracy, + "ocnli": Accuracy, + "cmnli": Accuracy, + "cluewsc2020": Accuracy, + "csl": Accuracy, +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default="tnews", + type=str, + help="The name of the task to perform predict, selected in the list: " + ", ".join(METRIC_CLASSES.keys()), + ) + parser.add_argument( + "--model_name_or_path", default="ernie-3.0-medium-zh", type=str, help="The directory or name of model." + ) + parser.add_argument("--model_path", type=str, required=True, help="The path prefix of inference model to be used.") + parser.add_argument( + "--device", default="gpu", choices=["gpu", "cpu", "xpu", "npu"], help="Device selected for inference." + ) + parser.add_argument("--batch_size", default=32, type=int, help="Batch size for predict.") + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", + ) + parser.add_argument("--perf_warmup_steps", default=20, type=int, help="Warmup steps for performance test.") + parser.add_argument( + "--n_best_size", + default=20, + type=int, + help="The total number of n-best predictions to generate in the nbest_predictions.json output file.", + ) + parser.add_argument( + "--max_answer_length", default=50, type=int, help="Max answer length for question answering task." + ) + parser.add_argument("--shape_file", default="shape_info.txt", type=str, help="Shape info filename.") + parser.add_argument("--use_trt", action="store_true", help="Whether to use inference engin TensorRT.") + parser.add_argument("--use_lite", action="store_true", help="Whether to use inference engin PaddleLite.") + parser.add_argument("--perf", action="store_true", help="Whether to test performance.") + parser.add_argument("--collect_shape", action="store_true", help="Whether to collect shape info.") + + parser.add_argument( + "--precision", default="fp32", choices=["fp32", "fp16", "int8"], help="Precision for inference." + ) + parser.add_argument( + "--num_threads", + default=cpu_count(), + type=int, + help="num_threads for cpu.", + ) + parser.add_argument( + "--enable_quantize", + action="store_true", + help="Whether to enable quantization for acceleration. Valid for both onnx and dnnl", + ) + parser.add_argument( + "--enable_bf16", + action="store_true", + help="Whether to use the bfloat16 datatype", + ) + parser.add_argument("--use_onnxruntime", type=strtobool, default=False, help="Use onnxruntime to infer or not.") + parser.add_argument( + "--debug", action="store_true", help="With debug it will save graph and model after each pass." + ) + parser.add_argument( + "--provider", + default="CPUExecutionProvider", + choices=["CPUExecutionProvider", "DnnlExecutionProvider"], + type=str, + help="Onnx ExecutionProvider with DNNL or without DNNL", + ) + parser.add_argument( + "--lazy_data_processing", + default=True, + type=bool, + help="Whether use lazy data processing", + ) + + args = parser.parse_args() + return args + + +def convert_example(example, tokenizer, label_list, is_test=False, max_seq_length=512): + """convert a glue example into necessary features""" + if not is_test: + # `label_list == None` is for regression task + # Get the label + label = np.array(example["label"], dtype="int64") + # Convert raw text to feature + if "keyword" in example: # CSL + sentence1 = " ".join(example["keyword"]) + example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]} + elif "target" in example: # wsc + text, query, pronoun, query_idx, pronoun_idx = ( + example["text"], + example["target"]["span1_text"], + example["target"]["span2_text"], + example["target"]["span1_index"], + example["target"]["span2_index"], + ) + text_list = list(text) + assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun) + assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query) + if pronoun_idx > query_idx: + text_list.insert(query_idx, "_") + text_list.insert(query_idx + len(query) + 1, "_") + text_list.insert(pronoun_idx + 2, "[") + text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]") + else: + text_list.insert(pronoun_idx, "[") + text_list.insert(pronoun_idx + len(pronoun) + 1, "]") + text_list.insert(query_idx + 2, "_") + text_list.insert(query_idx + len(query) + 2 + 1, "_") + text = "".join(text_list) + example["sentence"] = text + if "sentence" in example: + example = tokenizer(example["sentence"], max_seq_len=max_seq_length) + elif "sentence1" in example: + example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length) + + if not is_test: + example["labels"] = label + return example + + +class Predictor(object): + def __init__(self, predictor, input_handles, output_handles): + self.predictor = predictor + self.input_handles = input_handles + self.output_handles = output_handles + + @classmethod + def create_predictor(cls, args): + if args.use_onnxruntime: + assert args.device != "xpu", "Running ONNXRuntime on XPU is temporarily not supported." + if args.model_path.count(".onnx"): + onnx_model = args.model_path + else: + import paddle2onnx + + onnx_model = paddle2onnx.command.c_paddle_to_onnx( + model_file=args.model_path + ".pdmodel", + params_file=args.model_path + ".pdiparams", + opset_version=13, + enable_onnx_checker=True, + ) + dynamic_quantize_model = onnx_model + if args.enable_quantize: + from onnxruntime.quantization import quantize_dynamic + + float_onnx_file = "model.onnx" + with open(float_onnx_file, "wb") as f: + f.write(onnx_model) + dynamic_quantize_model = "dynamic_quantize_model.onnx" + quantize_dynamic(float_onnx_file, dynamic_quantize_model) + sess_options = ort.SessionOptions() + sess_options.intra_op_num_threads = args.num_threads + sess_options.inter_op_num_threads = args.num_threads + executionprovider = args.provider + print("ExecutionProvider is: ", executionprovider) + predictor = ort.InferenceSession( + dynamic_quantize_model, sess_options=sess_options, providers=[executionprovider] + ) + input_name1 = predictor.get_inputs()[1].name + input_name2 = predictor.get_inputs()[0].name + input_handles = [input_name1, input_name2] + return cls(predictor, input_handles, []) + + config = paddle.inference.Config(args.model_path + ".pdmodel", args.model_path + ".pdiparams") + if args.device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + cls.device = paddle.set_device("gpu") + elif args.device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + config.switch_ir_optim(True) + config.enable_mkldnn() + if args.enable_bf16: + config.enable_mkldnn_bfloat16() + if args.enable_quantize: + config.enable_mkldnn_int8() + if args.debug: + config.switch_ir_debug(True) + config.set_cpu_math_library_num_threads(args.num_threads) + cls.device = paddle.set_device("cpu") + elif args.device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + elif args.device == "npu": + if args.use_lite: + config.enable_lite_engine(paddle.inference.PrecisionType(0), True) + config.nnadapter().enable().set_device_names(["huawei_ascend_npu"]) + else: + config.enable_custom_device("npu") + if args.use_trt: + precision_map = { + "int8": inference.PrecisionType.Int8, + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + } + config.enable_tensorrt_engine( + workspace_size=1 << 30, + precision_mode=precision_map[args.precision], + max_batch_size=args.batch_size, + min_subgraph_size=5, + use_static=False, + use_calib_mode=False, + ) + print("Enable TensorRT is: {}".format(config.tensorrt_engine_enabled())) + + if args.collect_shape: + config.collect_shape_range_info(args.task_name + args.shape_file) + else: + config.enable_tuned_tensorrt_dynamic_shape(args.task_name + args.shape_file, True) + + config.delete_pass("embedding_eltwise_layernorm_fuse_pass") + predictor = paddle.inference.create_predictor(config) + + input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()] + output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()] + + return cls(predictor, input_handles, output_handles) + + def set_dynamic_shape(self, max_seq_length, batch_size): + # The dynamic shape info required by TRT is automatically generated according to max_seq_length and batch_size and stored in shape_info.txt + min_batch_size, max_batch_size, opt_batch_size = 1, batch_size, batch_size + min_seq_len, max_seq_len, opt_seq_len = 2, max_seq_length, 32 + batches = [ + [ + np.zeros([min_batch_size, min_seq_len], dtype="int64"), + np.zeros([min_batch_size, min_seq_len], dtype="int64"), + ], + [ + np.zeros([max_batch_size, max_seq_len], dtype="int64"), + np.zeros([max_batch_size, max_seq_len], dtype="int64"), + ], + [ + np.zeros([opt_batch_size, opt_seq_len], dtype="int64"), + np.zeros([opt_batch_size, opt_seq_len], dtype="int64"), + ], + ] + for batch in batches: + self.predict_batch(batch) + print("Set dynamic shape finished, please close set_dynamic_shape and restart.") + exit(0) + + def predict_batch(self, data): + if len(self.output_handles) == 0: + input_dict = {} + for input_field, input_handle in zip(data, self.input_handles): + input_dict[input_handle] = input_field + result = self.predictor.run(None, input_dict) + return result + + for input_field, input_handle in zip(data, self.input_handles): + input_handle.copy_from_cpu(input_field) + self.predictor.run() + output = [output_handle.copy_to_cpu() for output_handle in self.output_handles] + return output + + def predict(self, dataset, tokenizer, batchify_fn, args, dev_example=None, dev_ds_ori=None): + if args.collect_shape: + self.set_dynamic_shape(args.max_seq_length, args.batch_size) + if args.task_name == "cmrc2018": + dataset_removed = dataset.remove_columns(["offset_mapping", "attention_mask", "example_id"]) + sample_num = len(dataset) + batches = [] + for i in range(0, sample_num, args.batch_size): + batch_size = min(args.batch_size, sample_num - i) + batch = [dataset_removed[i + j] for j in range(batch_size)] + batches.append(batch) + else: + sample_num = len(dataset) + batches = [] + for i in range(0, sample_num, args.batch_size): + batch_size = min(args.batch_size, sample_num - i) + batch = [dataset[i + j] for j in range(batch_size)] + batches.append(batch) + if args.perf: + for i, batch in enumerate(batches): + batch = batchify_fn(batch) + input_ids, segment_ids = batch["input_ids"].numpy(), batch["token_type_ids"].numpy() + output = self.predict_batch([input_ids, segment_ids]) + if i > args.perf_warmup_steps: + break + time1 = time.time() + nums = 0 + for batch in batches: + batch = batchify_fn(batch) + input_ids, segment_ids = batch["input_ids"].numpy(), batch["token_type_ids"].numpy() + nums = nums + input_ids.shape[0] + output = self.predict_batch([input_ids, segment_ids]) + total_time = time.time() - time1 + print( + "task name: %s, sample nums: %s, time: %s, QPS: %s " + % (args.task_name, nums, total_time, nums / total_time) + ) + + else: + if args.task_name == "msra_ner": + metric = ChunkEvaluator(label_list=args.label_list) + metric.reset() + all_predictions = [] + for batch in batches: + batch = batchify_fn(batch) + input_ids, segment_ids = batch["input_ids"].numpy(), batch["token_type_ids"].numpy() + output = self.predict_batch([input_ids, segment_ids])[0] + preds = np.argmax(output, axis=2) + all_predictions.append(preds.tolist()) + num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute( + batch["seq_len"], paddle.to_tensor(preds), batch["labels"] + ) + metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + res = metric.accumulate() + print("task name: %s, (precision, recall, f1): %s, " % (args.task_name, res)) + elif args.task_name == "cmrc2018": + all_start_logits = [] + all_end_logits = [] + for batch in batches: + batch = batchify_fn(batch) + input_ids, segment_ids = batch["input_ids"].numpy(), batch["token_type_ids"].numpy() + start_logits, end_logits = self.predict_batch([input_ids, segment_ids]) + for idx in range(start_logits.shape[0]): + if len(all_start_logits) % 1000 == 0 and len(all_start_logits): + print("Processing example: %d" % len(all_start_logits)) + all_start_logits.append(start_logits[idx]) + all_end_logits.append(end_logits[idx]) + all_predictions, _, _ = compute_prediction( + dev_example, + dataset, + (all_start_logits, all_end_logits), + False, + args.n_best_size, + args.max_answer_length, + ) + res = squad_evaluate( + examples=[raw_data for raw_data in dev_example], preds=all_predictions, is_whitespace_splited=False + ) + print("task name: %s, EM: %s, F1: %s" % (args.task_name, res["exact"], res["f1"])) + return all_predictions + else: + all_predictions = [] + metric = METRIC_CLASSES[args.task_name]() + metric.reset() + for i, batch in enumerate(batches): + batch = batchify_fn(batch) + output = self.predict_batch([batch["input_ids"].numpy(), batch["token_type_ids"].numpy()])[0] + preds = np.argmax(output, axis=1) + all_predictions.append(preds.tolist()) + correct = metric.compute(paddle.to_tensor(output), batch["labels"]) + metric.update(correct) + res = metric.accumulate() + + print("task name: %s, acc: %s, " % (args.task_name, res)) + return all_predictions + + +def tokenize_and_align_labels(example, tokenizer, no_entity_id, max_seq_len=512): + if example["tokens"] == []: + tokenized_input = { + "labels": [], + "input_ids": [], + "token_type_ids": [], + "seq_len": 0, + "length": 0, + } + return tokenized_input + tokenized_input = tokenizer( + example["tokens"], + max_seq_len=max_seq_len, + # We use this argument because the texts in our dataset are lists of words (with a label for each word). + is_split_into_words=True, + return_length=True, + ) + label_ids = example["ner_tags"] + if len(tokenized_input["input_ids"]) - 2 < len(label_ids): + label_ids = label_ids[: len(tokenized_input["input_ids"]) - 2] + label_ids = [no_entity_id] + label_ids + [no_entity_id] + + label_ids += [no_entity_id] * (len(tokenized_input["input_ids"]) - len(label_ids)) + tokenized_input["labels"] = label_ids + return tokenized_input + + +def prepare_validation_features(examples, tokenizer, doc_stride, max_seq_length): + contexts = examples["context"] + questions = examples["question"] + + tokenized_examples = tokenizer( + questions, contexts, stride=doc_stride, max_seq_len=max_seq_length, return_attention_mask=True + ) + + sample_mapping = tokenized_examples.pop("overflow_to_sample") + + tokenized_examples["example_id"] = [] + + for i in range(len(tokenized_examples["input_ids"])): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_examples["token_type_ids"][i] + context_index = 1 + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + tokenized_examples["example_id"].append(examples["id"][sample_index]) + tokenized_examples["offset_mapping"][i] = [ + (o if sequence_ids[k] == context_index and k != len(sequence_ids) - 1 else None) + for k, o in enumerate(tokenized_examples["offset_mapping"][i]) + ] + + return tokenized_examples + + +def main(): + paddle.seed(42) + args = parse_args() + + args.task_name = args.task_name.lower() + + predictor = Predictor.create_predictor(args) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + + if args.task_name == "msra_ner": + + def ner_trans_fn(example, tokenizer, max_seq_length=128, no_entity_id=0): + return tokenize_and_align_labels( + example, tokenizer=tokenizer, no_entity_id=no_entity_id, max_seq_len=max_seq_length + ) + + trans_fn = partial(ner_trans_fn, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + dev_ds = load_dataset("msra_ner", split="test") + label_list = dev_ds.features["ner_tags"].feature.names + args.label_list = label_list + + column_names = dev_ds.column_names + dev_ds = dev_ds.map(trans_fn, remove_columns=column_names) + batchify_fn = DataCollatorForTokenClassification(tokenizer) + predictor.predict(dev_ds, tokenizer, batchify_fn, args) + elif args.task_name == "cmrc2018": + dev_example = load_dataset("cmrc2018", split="validation") + column_names = dev_example.column_names + dev_ds = dev_example.map( + partial( + prepare_validation_features, tokenizer=tokenizer, doc_stride=128, max_seq_length=args.max_seq_length + ), + batched=True, + num_proc=4, + remove_columns=column_names, + load_from_cache_file=True, + desc="Running tokenizer on validation dataset", + ) + + batchify_fn = DataCollatorWithPadding(tokenizer) + predictor.predict(dev_ds, tokenizer, batchify_fn, args, dev_example) + else: + dev_ds = ppnlp_load_dataset("clue", args.task_name, splits="dev") + + trans_func = partial( + convert_example, + label_list=dev_ds.label_list, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + is_test=False, + ) + dev_ds = dev_ds.map(trans_func, lazy=args.lazy_data_processing) + if args.device == "npu": + # NOTE: Avoid CANN recompile operators for different shape inputs, which will result in very slow training. + batchify_fn = DataCollatorWithPadding(tokenizer, padding="max_length", max_length=args.max_seq_length) + else: + batchify_fn = DataCollatorWithPadding(tokenizer) + + predictor.predict(dev_ds, tokenizer, batchify_fn, args) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-3.0/run_qa.py b/model_zoo/ernie-3.0/run_qa.py new file mode 100644 index 0000000000000000000000000000000000000000..6c3b03d70c493a550bed99d081d986df0eef4dea --- /dev/null +++ b/model_zoo/ernie-3.0/run_qa.py @@ -0,0 +1,224 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +from functools import partial + +import paddle +from datasets import load_dataset +from utils import ( + CrossEntropyLossForSQuAD, + DataArguments, + ModelArguments, + QuestionAnsweringTrainer, + load_config, + prepare_train_features, + prepare_validation_features, +) + +import paddlenlp +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.metrics.squad import compute_prediction, squad_evaluate +from paddlenlp.trainer import ( + EvalPrediction, + PdArgumentParser, + TrainingArguments, + get_last_checkpoint, +) +from paddlenlp.transformers import ErnieForQuestionAnswering, ErnieTokenizer +from paddlenlp.utils.log import logger + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + # Load model and data config + model_args, data_args, training_args = load_config( + model_args.config, "QuestionAnswering", data_args.dataset, model_args, data_args, training_args + ) + # Print model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + + data_args.dataset = data_args.dataset.strip() + training_args.output_dir = os.path.join(training_args.output_dir, data_args.dataset) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + raw_datasets = load_dataset("clue", data_args.dataset) + label_list = getattr(raw_datasets["train"], "label_list", None) + data_args.label_list = label_list + + # Define tokenizer, model, loss function. + tokenizer = ErnieTokenizer.from_pretrained(model_args.model_name_or_path) + model = ErnieForQuestionAnswering.from_pretrained(model_args.model_name_or_path) + + loss_fct = CrossEntropyLossForSQuAD() + + # Preprocessing the datasets. + # Preprocessing is slighlty different for training and evaluation. + if training_args.do_train: + column_names = raw_datasets["train"].column_names + elif training_args.do_eval: + column_names = raw_datasets["validation"].column_names + else: + column_names = raw_datasets["validation"].column_names + + if training_args.do_train: + train_dataset = raw_datasets["train"] + # Create train feature from dataset + with training_args.main_process_first(desc="train dataset map pre-processing"): + # Dataset pre-process + train_dataset = train_dataset.map( + partial(prepare_train_features, tokenizer=tokenizer, args=data_args), + batched=True, + num_proc=4, + batch_size=4, + remove_columns=column_names, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on train dataset", + ) + + if training_args.do_eval: + eval_examples = raw_datasets["validation"] + with training_args.main_process_first(desc="evaluate dataset map pre-processing"): + eval_dataset = eval_examples.map( + partial(prepare_validation_features, tokenizer=tokenizer, args=data_args), + batched=True, + num_proc=4, + batch_size=4, + remove_columns=column_names, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on validation dataset", + ) + if training_args.do_predict: + predict_examples = raw_datasets["validation"] + contexts = predict_examples["context"] + questions = predict_examples["question"] + with training_args.main_process_first(desc="test dataset map pre-processing"): + predict_dataset = predict_examples.map( + partial(prepare_validation_features, tokenizer=tokenizer, args=data_args), + batched=True, + num_proc=4, + batch_size=4, + remove_columns=column_names, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on prediction dataset", + ) + + # Define data collector + data_collator = DataCollatorWithPadding(tokenizer) + + # Post-processing: + def post_processing_function(examples, features, predictions, stage="eval"): + # Post-processing: we match the start logits and end logits to answers in the original context. + predictions, all_nbest_json, scores_diff_json = compute_prediction( + examples=examples, + features=features, + predictions=predictions, + n_best_size=data_args.n_best_size, + max_answer_length=data_args.max_answer_length, + null_score_diff_threshold=data_args.null_score_diff_threshold, + ) + + references = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples] + return EvalPrediction(predictions=predictions, label_ids=references) + + def compute_metrics(p: EvalPrediction): + ret = squad_evaluate(examples=p.label_ids, preds=p.predictions, is_whitespace_splited=False) + return dict(ret) + # return metric.compute(predictions=p.predictions, references=p.label_ids) + + trainer = QuestionAnsweringTrainer( + model=model, + criterion=loss_fct, + args=training_args, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + eval_examples=eval_examples if training_args.do_eval else None, + data_collator=data_collator, + post_process_function=post_processing_function, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + if training_args.do_train: + # Training + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + if training_args.do_predict: + test_ret = trainer.predict(predict_dataset, predict_examples) + trainer.log_metrics("predict", test_ret.metrics) + + out_dict = {"answer": test_ret.predictions, "context": contexts, "question": questions} + out_file = open(os.path.join(training_args.output_dir, "test_results.json"), "w", encoding="utf8") + json.dump(out_dict, out_file, ensure_ascii=True) + + # Export inference model + if training_args.do_export: + # You can also load from certain checkpoint + # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/") + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ] + + model_args.export_model_dir = os.path.join(model_args.export_model_dir, data_args.dataset, "export") + + paddlenlp.transformers.export_model( + model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir + ) + trainer.tokenizer.save_pretrained(model_args.export_model_dir) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-3.0/run_seq_cls.py b/model_zoo/ernie-3.0/run_seq_cls.py new file mode 100644 index 0000000000000000000000000000000000000000..ada03238c9b457b317671127fb4754cfdd622b95 --- /dev/null +++ b/model_zoo/ernie-3.0/run_seq_cls.py @@ -0,0 +1,179 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from __future__ import annotations + +import json +import os +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from paddle.metric import Accuracy +from utils import DataArguments, ModelArguments, load_config, seq_convert_example + +import paddlenlp +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, +) +from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer +from paddlenlp.utils.log import logger + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + # Log model and data config + model_args, data_args, training_args = load_config( + model_args.config, "SequenceClassification", data_args.dataset, model_args, data_args, training_args + ) + # Print model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + data_args.dataset = data_args.dataset.strip() + training_args.output_dir = os.path.join(training_args.output_dir, data_args.dataset) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + raw_datasets = load_dataset("clue", data_args.dataset) + data_args.label_list = getattr(raw_datasets["train"], "label_list", None) + num_classes = len(raw_datasets["train"].label_list) + + # Define tokenizer, model, loss function. + model = ErnieForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes) + tokenizer = ErnieTokenizer.from_pretrained(model_args.model_name_or_path) + criterion = nn.loss.CrossEntropyLoss() if data_args.label_list else nn.loss.MSELoss() + + # Define dataset pre-process function + trans_fn = partial( + seq_convert_example, + tokenizer=tokenizer, + label_list=data_args.label_list, + max_seq_len=data_args.max_seq_length, + dynamic_max_length=data_args.dynamic_max_length, + ) + + # Define data collator + data_collator = DataCollatorWithPadding(tokenizer) + + # Dataset pre-process + logger.info("Data Preprocessing...") + if training_args.do_train: + train_dataset = raw_datasets["train"].map(trans_fn, lazy=training_args.lazy_data_processing) + if training_args.do_eval: + eval_dataset = raw_datasets["dev"].map(trans_fn, lazy=training_args.lazy_data_processing) + if training_args.do_predict: + test_dataset = raw_datasets["test"].map(trans_fn, lazy=training_args.lazy_data_processing) + + # Define the metrics of tasks. + def compute_metrics(p): + preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions + + preds = paddle.to_tensor(preds) + label = paddle.to_tensor(p.label_ids) + + metric = Accuracy() + metric.reset() + result = metric.compute(preds, label) + metric.update(result) + accu = metric.accumulate() + metric.reset() + return {"accuracy": accu} + + trainer = Trainer( + model=model, + criterion=criterion, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + if training_args.do_predict: + test_ret = trainer.predict(test_dataset) + trainer.log_metrics("test", test_ret.metrics) + logits = test_ret.predictions + max_value = np.max(logits, axis=1, keepdims=True) + exp_data = np.exp(logits - max_value) + probs = exp_data / np.sum(exp_data, axis=1, keepdims=True) + out_dict = {"label": probs.argmax(axis=-1).tolist(), "confidence": probs.max(axis=-1).tolist()} + out_file = open(os.path.join(training_args.output_dir, "test_results.json"), "w") + json.dump(out_dict, out_file) + + # Export inference model + if training_args.do_export: + # You can also load from certain checkpoint + # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/") + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ] + model_args.export_model_dir = os.path.join(model_args.export_model_dir, data_args.dataset, "export") + paddlenlp.transformers.export_model( + model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir + ) + trainer.tokenizer.save_pretrained(model_args.export_model_dir) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-3.0/run_token_cls.py b/model_zoo/ernie-3.0/run_token_cls.py new file mode 100644 index 0000000000000000000000000000000000000000..f805ad307c6f796dfd371e64c27fbeb5434fd128 --- /dev/null +++ b/model_zoo/ernie-3.0/run_token_cls.py @@ -0,0 +1,230 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from datasets import load_metric +from utils import DataArguments, ModelArguments, load_config, token_convert_example + +import paddlenlp +from paddlenlp.data import DataCollatorForTokenClassification +from paddlenlp.datasets import load_dataset +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, +) +from paddlenlp.transformers import ErnieForTokenClassification, ErnieTokenizer +from paddlenlp.utils.log import logger + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + # Log model and data config + model_args, data_args, training_args = load_config( + model_args.config, "TokenClassification", data_args.dataset, model_args, data_args, training_args + ) + + # Print model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + + data_args.dataset = data_args.dataset.strip() + training_args.output_dir = os.path.join(training_args.output_dir, data_args.dataset) + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + raw_datasets = load_dataset(data_args.dataset) + label_list = raw_datasets["train"].label_list + data_args.label_list = label_list + data_args.ignore_label = -100 + data_args.no_entity_id = 0 + + num_classes = len(label_list) + + # Define tokenizer, model, loss function. + tokenizer = ErnieTokenizer.from_pretrained(model_args.model_name_or_path) + model = ErnieForTokenClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes) + + class criterion(nn.Layer): + def __init__(self): + super(criterion, self).__init__() + self.loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=data_args.ignore_label) + + def forward(self, *args, **kwargs): + return paddle.mean(self.loss_fn(*args, **kwargs)) + + loss_fct = criterion() + + # Define dataset pre-process function + trans_fn = partial( + token_convert_example, + tokenizer=tokenizer, + no_entity_id=data_args.no_entity_id, + max_seq_length=data_args.max_seq_length, + dynamic_max_length=data_args.dynamic_max_length, + ) + # Define data collector + data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=data_args.ignore_label) + + # Dataset pre-process + logger.info("Data Preprocessing...") + if training_args.do_train: + train_dataset = raw_datasets["train"].map(trans_fn, lazy=training_args.lazy_data_processing) + if training_args.do_eval: + # The msra_ner dataset do not have the dev dataset, use the test dataset for the evaluation + eval_dataset = raw_datasets["test"].map(trans_fn, lazy=training_args.lazy_data_processing) + if training_args.do_predict: + test_dataset = raw_datasets["test"].map(trans_fn, lazy=training_args.lazy_data_processing) + + # Define the metrics of tasks. + # Metrics + metric = load_metric("seqeval") + + def compute_metrics(p): + predictions, labels = p + predictions = np.argmax(predictions, axis=2) + + # Remove ignored index (special tokens) + true_predictions = [ + [label_list[p] for (p, l) in zip(prediction, label) if l != -100] + for prediction, label in zip(predictions, labels) + ] + true_labels = [ + [label_list[l] for (p, l) in zip(prediction, label) if l != -100] + for prediction, label in zip(predictions, labels) + ] + results = metric.compute(predictions=true_predictions, references=true_labels) + return { + "precision": results["overall_precision"], + "recall": results["overall_recall"], + "f1": results["overall_f1"], + "accuracy": results["overall_accuracy"], + } + + trainer = Trainer( + model=model, + criterion=loss_fct, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + trainer.log_metrics("eval", eval_metrics) + + if training_args.do_predict: + test_ret = trainer.predict(test_dataset) + trainer.log_metrics("test", test_ret.metrics) + tokens_label = test_ret.predictions.argmax(axis=-1) + tokens_label = tokens_label.tolist() + value = [] + for batch, token_label in enumerate(tokens_label): + start = -1 + label_name = "" + items = [] + input_data = tokenizer.convert_ids_to_tokens(test_dataset[batch]["input_ids"])[1:-1] + for i, label in enumerate(token_label): + if (data_args.label_list[label] == "O" or "B-" in data_args.label_list[label]) and start >= 0: + entity = input_data[start : i - 1] + if isinstance(entity, list): + entity = "".join(entity) + items.append( + { + "pos": [start, i - 2], + "entity": entity, + "label": label_name, + } + ) + start = -1 + if "B-" in data_args.label_list[label]: + start = i - 1 + label_name = data_args.label_list[label][2:] + if start >= 0: + items.append( + { + "pos": [start, len(token_label) - 1], + "entity": input_data[start : len(token_label) - 1], + "label": "", + } + ) + value.append(items) + out_dict = {"value": value, "tokens_label": tokens_label} + out_file = open(os.path.join(training_args.output_dir, "test_results.json"), "w") + json.dump(out_dict, out_file, ensure_ascii=True) + + # Export inference model + if training_args.do_export: + # You can also load from certain checkpoint + # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/") + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids + ] + model_args.export_model_dir = os.path.join(model_args.export_model_dir, data_args.dataset, "export") + paddlenlp.transformers.export_model( + model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir + ) + trainer.tokenizer.save_pretrained(model_args.export_model_dir) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-3.0/utils.py b/model_zoo/ernie-3.0/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..697adfd22bcfadf178fe64c8e475ed0493949c0c --- /dev/null +++ b/model_zoo/ernie-3.0/utils.py @@ -0,0 +1,556 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from dataclasses import dataclass, field +from typing import List, Optional + +import paddle +import yaml + +from paddlenlp.trainer import PredictionOutput, Trainer + + +def load_config(config_file_path, task_name, dataset_name, model_args, data_args, training_args): + config = yaml.load(open(config_file_path, "r"), Loader=yaml.FullLoader) + # Set the batch size of trainer setting + + config = config[task_name][dataset_name] + for args in (model_args, data_args, training_args): + for arg in config.keys(): + if hasattr(args, arg): + setattr(args, arg, config[arg]) + return model_args, data_args, training_args + + +def get_dynamic_max_length(examples, default_max_length: int, dynamic_max_length: List[int]) -> int: + """get max_length by examples which you can change it by examples in batch""" + # if the input is a batch of examples + if isinstance(examples["input_ids"][0], list): + cur_length = max([len(i) for i in examples["input_ids"]]) + # if the input is a single example + else: + cur_length = len(examples["input_ids"]) + + max_length = default_max_length + for max_length_option in sorted(dynamic_max_length): + if cur_length <= max_length_option: + max_length = max_length_option + break + return max_length + + +def prepare_train_features(examples, tokenizer, args): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. + contexts = examples["context"] + questions = examples["question"] + + if args.dynamic_max_length is not None: + tokenized_examples = tokenizer( + questions, contexts, stride=args.doc_stride, max_length=args.max_seq_length, truncation=True + ) + max_length = get_dynamic_max_length( + examples=tokenized_examples, + default_max_length=args.max_seq_length, + dynamic_max_length=args.dynamic_max_length, + ) + # always pad to max_length + tokenized_examples = tokenizer( + questions, contexts, stride=args.doc_stride, max_length=max_length, padding="max_length", truncation=True + ) + else: + tokenized_examples = tokenizer( + questions, contexts, stride=args.doc_stride, max_length=args.max_seq_length, truncation=True + ) + + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + # The offset mappings will give us a map from token to character position in the original context. This will + # help us compute the start_positions and end_positions. + offset_mapping = tokenized_examples.pop("offset_mapping") + + # Let's label those examples! + tokenized_examples["start_positions"] = [] + tokenized_examples["end_positions"] = [] + + for i, offsets in enumerate(offset_mapping): + # We will label impossible answers with the index of the CLS token. + input_ids = tokenized_examples["input_ids"][i] + cls_index = input_ids.index(tokenizer.cls_token_id) + + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_examples["token_type_ids"][i] + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + answers = examples["answers"][sample_index] + # If no answers are given, set the cls_index as answer. + if len(answers["answer_start"]) == 0: + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Start/end character index of the answer in the text. + start_char = answers["answer_start"][0] + end_char = start_char + len(answers["text"][0]) + + # Start token index of the current span in the text. + token_start_index = 0 + while sequence_ids[token_start_index] != 1: + token_start_index += 1 + + # End token index of the current span in the text. + token_end_index = len(input_ids) - 1 + while sequence_ids[token_end_index] != 1: + token_end_index -= 1 + token_end_index -= 1 + + # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index). + if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char): + tokenized_examples["start_positions"].append(cls_index) + tokenized_examples["end_positions"].append(cls_index) + else: + # Otherwise move the token_start_index and token_end_index to the two ends of the answer. + # Note: we could go after the last offset if the answer is the last word (edge case). + while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char: + token_start_index += 1 + tokenized_examples["start_positions"].append(token_start_index - 1) + while offsets[token_end_index][1] >= end_char: + token_end_index -= 1 + tokenized_examples["end_positions"].append(token_end_index + 1) + + return tokenized_examples + + +def prepare_validation_features(examples, tokenizer, args): + # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results + # in one example possible giving several features when a context is long, each of those features having a + # context that overlaps a bit the context of the previous feature. + # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is + # that HuggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. + contexts = examples["context"] + questions = examples["question"] + + if args.dynamic_max_length is not None: + tokenized_examples = tokenizer( + questions, contexts, stride=args.doc_stride, max_length=args.max_seq_length, truncation=True + ) + max_length = get_dynamic_max_length( + examples=tokenized_examples, + default_max_length=args.max_seq_length, + dynamic_max_length=args.dynamic_max_length, + ) + # always pad to max_length + tokenized_examples = tokenizer( + questions, contexts, stride=args.doc_stride, max_length=max_length, padding="max_length", truncation=True + ) + else: + tokenized_examples = tokenizer( + questions, contexts, stride=args.doc_stride, max_length=args.max_seq_length, truncation=True + ) + # Since one example might give us several features if it has a long context, we need a map from a feature to + # its corresponding example. This key gives us just that. + sample_mapping = tokenized_examples.pop("overflow_to_sample") + + # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the + # corresponding example_id and we will store the offset mappings. + tokenized_examples["example_id"] = [] + + for i in range(len(tokenized_examples["input_ids"])): + # Grab the sequence corresponding to that example (to know what is the context and what is the question). + sequence_ids = tokenized_examples["token_type_ids"][i] + context_index = 1 + + # One example can give several spans, this is the index of the example containing this span of text. + sample_index = sample_mapping[i] + tokenized_examples["example_id"].append(examples["id"][sample_index]) + + # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token + # position is part of the context or not. + + tokenized_examples["offset_mapping"][i] = [ + (o if sequence_ids[k] == context_index and k != len(sequence_ids) - 1 else None) + for k, o in enumerate(tokenized_examples["offset_mapping"][i]) + ] + + return tokenized_examples + + +class CrossEntropyLossForSQuAD(paddle.nn.Layer): + def __init__(self): + super(CrossEntropyLossForSQuAD, self).__init__() + + def forward(self, y, label): + start_logits, end_logits = y + start_position, end_position = label + start_position = paddle.unsqueeze(start_position, axis=-1) + end_position = paddle.unsqueeze(end_position, axis=-1) + start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position) + end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position) + loss = (start_loss + end_loss) / 2 + return loss + + +class QuestionAnsweringTrainer(Trainer): + def __init__(self, *args, eval_examples=None, post_process_function=None, **kwargs): + super().__init__(*args, **kwargs) + self.eval_examples = eval_examples + self.post_process_function = post_process_function + + def evaluate(self, eval_dataset=None, eval_examples=None, ignore_keys=None, metric_key_prefix: str = "eval"): + eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset + eval_dataloader = self.get_eval_dataloader(eval_dataset) + eval_examples = self.eval_examples if eval_examples is None else eval_examples + + # Temporarily disable metric computation, we will do it in the loop here. + compute_metrics = self.compute_metrics + self.compute_metrics = None + eval_loop = self.evaluation_loop + try: + output = eval_loop( + eval_dataloader, + description="Evaluation", + # No point gathering the predictions if there are no metrics, otherwise we defer to + # self.args.prediction_loss_only + prediction_loss_only=True if compute_metrics is None else None, + ignore_keys=ignore_keys, + ) + finally: + self.compute_metrics = compute_metrics + + if self.post_process_function is not None and self.compute_metrics is not None: + eval_preds = self.post_process_function(eval_examples, eval_dataset, output.predictions) + metrics = self.compute_metrics(eval_preds) + + # Prefix all keys with metric_key_prefix + '_' + for key in list(metrics.keys()): + if not key.startswith(f"{metric_key_prefix}_"): + metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key) + + self.log(metrics) + else: + metrics = {} + + self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, metrics) + return metrics + + def predict(self, predict_dataset, predict_examples, ignore_keys=None, metric_key_prefix: str = "test"): + predict_dataloader = self.get_test_dataloader(predict_dataset) + + # Temporarily disable metric computation, we will do it in the loop here. + compute_metrics = self.compute_metrics + self.compute_metrics = None + eval_loop = self.evaluation_loop + try: + output = eval_loop( + predict_dataloader, + description="Prediction", + # No point gathering the predictions if there are no metrics, otherwise we defer to + # self.args.prediction_loss_only + prediction_loss_only=True if compute_metrics is None else None, + ignore_keys=ignore_keys, + ) + finally: + self.compute_metrics = compute_metrics + + if self.post_process_function is None or self.compute_metrics is None: + return output + + predictions = self.post_process_function(predict_examples, predict_dataset, output.predictions, "predict") + metrics = self.compute_metrics(predictions) + + # Prefix all keys with metric_key_prefix + '_' + for key in list(metrics.keys()): + if not key.startswith(f"{metric_key_prefix}_"): + metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key) + + return PredictionOutput(predictions=predictions.predictions, label_ids=predictions.label_ids, metrics=metrics) + + +# Data pre-process function for clue benchmark datatset +def seq_convert_example( + example, label_list, tokenizer=None, max_seq_length=512, dynamic_max_length: Optional[List[int]] = None, **kwargs +): + """convert a glue example into necessary features""" + is_test = False + if "label" not in example.keys(): + is_test = True + + if not is_test: + # `label_list == None` is for regression task + label_dtype = "int64" if label_list else "float32" + # Get the label + example["label"] = int(example["label"]) if label_dtype != "float32" else float(example["label"]) + label = example["label"] + # Convert raw text to feature + if "keyword" in example: # CSL + sentence1 = " ".join(example["keyword"]) + example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]} + elif "target" in example: # wsc + text, query, pronoun, query_idx, pronoun_idx = ( + example["text"], + example["target"]["span1_text"], + example["target"]["span2_text"], + example["target"]["span1_index"], + example["target"]["span2_index"], + ) + text_list = list(text) + assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun) + assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query) + if pronoun_idx > query_idx: + text_list.insert(query_idx, "_") + text_list.insert(query_idx + len(query) + 1, "_") + text_list.insert(pronoun_idx + 2, "[") + text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]") + else: + text_list.insert(pronoun_idx, "[") + text_list.insert(pronoun_idx + len(pronoun) + 1, "]") + text_list.insert(query_idx + 2, "_") + text_list.insert(query_idx + len(query) + 2 + 1, "_") + text = "".join(text_list) + example["sentence"] = text + + if tokenizer is None: + return example + if "sentence" in example: + if dynamic_max_length is not None: + temp_example = tokenizer(example["sentence"], max_length=max_seq_length, truncation=True) + max_length = get_dynamic_max_length( + examples=temp_example, default_max_length=max_seq_length, dynamic_max_length=dynamic_max_length + ) + # always pad to max_length + example = tokenizer(example["sentence"], max_length=max_length, padding="max_length", truncation=True) + else: + example = tokenizer(example["sentence"], max_length=max_seq_length, truncation=True) + elif "sentence1" in example: + if dynamic_max_length is not None: + temp_example = tokenizer( + example["sentence1"], + text_pair=example["sentence2"], + max_length=max_seq_length, + truncation=True, + ) + max_length = get_dynamic_max_length( + examples=temp_example, default_max_length=max_seq_length, dynamic_max_length=dynamic_max_length + ) + example = tokenizer( + example["sentence1"], + text_pair=example["sentence2"], + max_length=max_length, + padding="max_length", + truncation=True, + ) + else: + example = tokenizer( + example["sentence1"], + text_pair=example["sentence2"], + max_length=max_seq_length, + truncation=True, + ) + + if not is_test: + if "token_type_ids" in example: + return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"], "labels": label} + else: + return {"input_ids": example["input_ids"], "labels": label} + else: + return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"]} + + +def token_convert_example( + example, + tokenizer, + no_entity_id, + max_seq_length=512, + return_length=False, + dynamic_max_length: Optional[List[int]] = None, +): + if "labels" in example: + labels = example["labels"] + example = example["tokens"] + if dynamic_max_length is not None: + tokenized_input = tokenizer( + example, + is_split_into_words=True, + max_length=max_seq_length, + truncation=True, + return_length=return_length, + ) + max_length = get_dynamic_max_length( + examples=tokenized_input, default_max_length=max_seq_length, dynamic_max_length=dynamic_max_length + ) + # always pad to max_length + tokenized_input = tokenizer( + example, + is_split_into_words=True, + max_length=max_length, + padding="max_length", + truncation=True, + return_length=return_length, + ) + else: + tokenized_input = tokenizer( + example, + is_split_into_words=True, + max_length=max_seq_length, + truncation=True, + return_length=return_length, + ) + + # -2 for [CLS] and [SEP] + if len(tokenized_input["input_ids"]) - 2 < len(labels): + labels = labels[: len(tokenized_input["input_ids"]) - 2] + tokenized_input["labels"] = [no_entity_id] + labels + [no_entity_id] + tokenized_input["labels"] += [no_entity_id] * ( + len(tokenized_input["input_ids"]) - len(tokenized_input["labels"]) + ) + else: + if example["tokens"] == []: + if return_length: + tokenized_input = {"labels": [], "input_ids": [], "token_type_ids": [], "length": 0, "seq_len": 0} + else: + tokenized_input = {"labels": [], "input_ids": [], "token_type_ids": []} + + return tokenized_input + if dynamic_max_length is not None: + tokenized_input = tokenizer( + example["tokens"], + max_length=max_seq_length, + truncation=True, + is_split_into_words=True, + return_length=return_length, + ) + max_length = get_dynamic_max_length( + examples=tokenized_input, default_max_length=max_seq_length, dynamic_max_length=dynamic_max_length + ) + # always pad to max_length + tokenized_input = tokenizer( + example["tokens"], + max_length=max_length, + padding="max_length", + truncation=True, + is_split_into_words=True, + return_length=return_length, + ) + else: + tokenized_input = tokenizer( + example["tokens"], + max_length=max_seq_length, + truncation=True, + is_split_into_words=True, + return_length=return_length, + ) + + label_ids = example["ner_tags"] + if len(tokenized_input["input_ids"]) - 2 < len(label_ids): + label_ids = label_ids[: len(tokenized_input["input_ids"]) - 2] + label_ids = [no_entity_id] + label_ids + [no_entity_id] + + label_ids += [no_entity_id] * (len(tokenized_input["input_ids"]) - len(label_ids)) + tokenized_input["labels"] = label_ids + return tokenized_input + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + Using `PdArgumentParser` we can turn this class + into argparse arguments to be able to specify them on + the command line. + """ + + dataset: str = field(default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}) + + max_seq_length: int = field( + default=128, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + + # Additional configs for QA task. + doc_stride: int = field( + default=128, + metadata={"help": "When splitting up a long document into chunks, how much stride to take between chunks."}, + ) + + n_best_size: int = field( + default=20, + metadata={ + "help": "The total number of n-best predictions to generate in the nbest_predictions.json output file." + }, + ) + + max_query_length: int = field( + default=64, + metadata={"help": "Max query length."}, + ) + + max_answer_length: int = field( + default=30, + metadata={"help": "Max answer length."}, + ) + + dynamic_max_length: Optional[List[int]] = field( + default=None, + metadata={"help": "dynamic max length from batch, it can be array of length, eg: 16 32 64 128"}, + ) + + do_lower_case: bool = field( + default=False, + metadata={ + "help": "Whether to lower case the input text. Should be True for uncased models and False for cased models." + }, + ) + overwrite_cache: bool = field( + default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} + ) + preprocessing_num_workers: Optional[int] = field( + default=None, + metadata={"help": "The number of processes to use for the preprocessing."}, + ) + null_score_diff_threshold: float = field( + default=0.0, + metadata={ + "help": "The threshold used to select the null answer: if the best answer has a score that is less than " + "the score of the null answer minus this threshold, the null answer is selected for this example. " + "Only useful when `version_2_with_negative=True`." + }, + ) + + # TODO(wj-Mcat): support padding configuration: `max_length`, `longest_first` + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + + model_name_or_path: str = field( + metadata={ + "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html" + } + ) + config: Optional[str] = field( + default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} + ) + export_model_dir: Optional[str] = field( + default="./best_models", + metadata={"help": "Path to directory to store the exported inference model."}, + ) diff --git a/model_zoo/ernie-code/README.en.md b/model_zoo/ernie-code/README.en.md new file mode 100644 index 0000000000000000000000000000000000000000..77904c09ce4da8a937ce07af1e9381f5c76c89f7 --- /dev/null +++ b/model_zoo/ernie-code/README.en.md @@ -0,0 +1,76 @@ +# ERNIE-Code + +[ACL 2023 (Findings)](https://aclanthology.org/2023.findings-acl.676/) | [arXiv](https://arxiv.org/pdf/2212.06742) | [BibTex](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-code/README.md#bibtex) | [中文版](./README.md) + +![ernie-code-comp](https://github.com/KB-Ding/PaddleNLP/assets/13767887/2a550b46-a7d5-416d-b300-83cce7044be4) + +[ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages](https://aclanthology.org/2023.findings-acl.676.pdf) + + +ERNIE-Code is a unified large language model (LLM) that connects 116 natural languages with 6 programming languages. We employ two pre-training methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. + +## Quick Start + +This project is the PaddlePaddle implementation of the ERINE-Code, including model prediction and weight conversion. The brief directory structure and description of this example are as follows: + +```text +├── README.md # Documentation +├── predict.py # Forward prediction demo +├── converter.py # Weight conversion script +``` + +### Multilingual Text-to-Code / Code-to-Text + +This project provides a simple demo for multlingual code/text generation. The startup command is as follows: + +```shell +python predict.py \ + --input 'BadZipFileのAliasは、古い Python バージョンとの互換性のために。' \ + --target_lang 'code' \ + --source_prefix 'translate Japanese to Python: \n' \ + --max_length 1024 \ + --num_beams 3 \ + --device 'gpu' +``` + +Explanation of parameters in the configuration file: +- `input`:The input sequence. +- `target_lang`: The target language, which can be set to 'text' or 'code'. +- `source_prefix`: The prompt. +- `max_length`: The maximum length of input/output text. +- `num_beams`: The number of beams to keep at each decoding step (for beam search). +- `device`: The running device, which can be set to 'cpu' or 'gpu'. + + +### Zero-shot Examples +- Multilingual code-to-text generation (zero-shot) + +![code-to-text-examples](https://github.com/KB-Ding/PaddleNLP/assets/13767887/7dbf225e-e6be-401d-9f6c-f733e2f68f76) + +![zh_code-to-text_examples-1](https://github.com/KB-Ding/PaddleNLP/assets/13767887/2d1ba091-f43c-4f3e-95c6-0038ede9e63e) + +- Multilingual text-to-text translation (zero-shot) + +![zero-shot-mt-examples](https://github.com/KB-Ding/PaddleNLP/assets/13767887/8be1a977-fa21-4a46-86ba-136fa8276a1a) + + +## BibTeX +``` +@inproceedings{chai-etal-2023-ernie, + title = "{ERNIE}-Code: Beyond {E}nglish-Centric Cross-lingual Pretraining for Programming Languages", + author = "Chai, Yekun and + Wang, Shuohuan and + Pang, Chao and + Sun, Yu and + Tian, Hao and + Wu, Hua", + booktitle = "Findings of the Association for Computational Linguistics: ACL 2023", + month = jul, + year = "2023", + address = "Toronto, Canada", + publisher = "Association for Computational Linguistics", + url = "https://aclanthology.org/2023.findings-acl.676", + pages = "10628--10650", + abstract = "Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.", +} +``` diff --git a/model_zoo/ernie-code/README.md b/model_zoo/ernie-code/README.md new file mode 100644 index 0000000000000000000000000000000000000000..5562d043d24004fba13e873a7365f44004d1a828 --- /dev/null +++ b/model_zoo/ernie-code/README.md @@ -0,0 +1,81 @@ +# ERNIE-Code + +[ACL 2023 (Findings)](https://aclanthology.org/2023.findings-acl.676/) | [arXiv](https://arxiv.org/pdf/2212.06742) | [BibTex](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-code/README.md#bibtex) | [English version](./README.en.md) + +![ernie-code-comp](https://github.com/KB-Ding/PaddleNLP/assets/13767887/2a550b46-a7d5-416d-b300-83cce7044be4) + +[ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages](https://aclanthology.org/2023.findings-acl.676.pdf) + + +ERNIE-Code是一个多自然语言、多编程语言的统一代码语言模型(Code LLM),支持116种自然语言和6+种编程语言。采用了两种预训练方法来进行跨语言预训练: +- Span-Corruption Language Modeling (SCLM) 从单语言的自然语言或编程语言中进行掩码语言学习; +- Pivot-based Translation Language Modeling (PTLM),将多自然语言到多编程语言的映射 规约为,以英语为枢轴(pivot)的多自然语言到英语、和英语到多编程语言的联合学习。 + +ERNIE-Code在代码智能的各种下游任务中,包括代码到多自然语言、多自然语言到代码、代码到代码、多自然语言文档翻译等任务,优于以前的多语言代码和文本模型(例如mT5 和 CodeT5),同时在多自然语言的代码摘要和文档翻译等任务上具备较好的的zero-shot prompt能力。 + +## 快速开始 + +本项目是ERNIE-Code的PaddlePaddle实现,包括模型预测和权重转换。以下是该示例的简要目录结构和说明: + +```text +├── README.md # 文档 +├── predict.py # 前向预测示例 +├── converter.py # 权重转换脚本 +``` + +### 多语言文本到代码/代码到文本 + +本项目提供了一个简单的多语言代码/文本生成的演示。启动命令如下: + +```shell +python predict.py \ + --input 'BadZipFileのAliasは、古い Python バージョンとの互換性のために。' \ + --target_lang 'code' \ + --source_prefix 'translate Japanese to Python: \n' \ + --max_length 1024 \ + --num_beams 3 \ + --device 'gpu' +``` + +配置文件中参数的解释: +- `input`:输入的文本序列。 +- `target_lang`:目标语言,可设置为'text'或'code'。 +- `source_prefix`:提示词Prompt。 +- `max_length`:输入/输出文本的最大长度。 +- `num_beams`:解码时每个时间步保留的beam大小(用于束搜索)。 +- `device`:运行设备,可设置为'cpu'或'gpu'。 + + + +### Zero-shot示例 +- 多语言代码到文本生成(zero-shot) + +![code-to-text-examples](https://github.com/KB-Ding/PaddleNLP/assets/13767887/7dbf225e-e6be-401d-9f6c-f733e2f68f76) + +![zh_code-to-text_examples-1](https://github.com/KB-Ding/PaddleNLP/assets/13767887/2d1ba091-f43c-4f3e-95c6-0038ede9e63e) + +- 计算机术语翻译(zero-shot) + +![zero-shot-mt-examples](https://github.com/KB-Ding/PaddleNLP/assets/13767887/8be1a977-fa21-4a46-86ba-136fa8276a1a) + + +## BibTeX +``` +@inproceedings{chai-etal-2023-ernie, + title = "{ERNIE}-Code: Beyond {E}nglish-Centric Cross-lingual Pretraining for Programming Languages", + author = "Chai, Yekun and + Wang, Shuohuan and + Pang, Chao and + Sun, Yu and + Tian, Hao and + Wu, Hua", + booktitle = "Findings of the Association for Computational Linguistics: ACL 2023", + month = jul, + year = "2023", + address = "Toronto, Canada", + publisher = "Association for Computational Linguistics", + url = "https://aclanthology.org/2023.findings-acl.676", + pages = "10628--10650", + abstract = "Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.", +} +``` diff --git a/model_zoo/ernie-code/convert.py b/model_zoo/ernie-code/convert.py new file mode 100644 index 0000000000000000000000000000000000000000..900e8437bea71025d1f9e78184c178296609ec17 --- /dev/null +++ b/model_zoo/ernie-code/convert.py @@ -0,0 +1,68 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +from collections import OrderedDict + +dont_transpose = [ + "shared.weight", + "layer_norm.weight", + ".layer_norm.weight", + "relative_attention_bias.weight", + "embed_tokens.weight", +] + + +def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path): + import paddle + import torch + + pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu") + paddle_state_dict = OrderedDict() + for k, v in pytorch_state_dict.items(): + transpose = False + + if k[-7:] == ".weight": + if not any([w in k for w in dont_transpose]): + if v.ndim == 2: + v = v.transpose(0, 1) + transpose = True + + print(f"Converting: {k} | is_transpose {transpose}") + + if k != "lm_head.weight": + k = "ErnieCode." + k + # The bf16 data of torch cannot be directly converted to paddle + paddle_state_dict[k] = paddle.to_tensor(v.to(torch.float32).numpy()).cast(paddle.bfloat16).numpy() + + paddle.save(paddle_state_dict, paddle_dump_path) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--pytorch_checkpoint_path", + default="/home/models/pytorch_model.bin", + type=str, + required=False, + help="Path to the Pytorch checkpoint path.", + ) + parser.add_argument( + "--paddle_dump_path", + default="/home/models/model_state.pdparams", + type=str, + required=False, + help="Path to the output Paddle model.", + ) + args = parser.parse_args() + convert_pytorch_checkpoint_to_paddle(args.pytorch_checkpoint_path, args.paddle_dump_path) diff --git a/model_zoo/ernie-code/predict.py b/model_zoo/ernie-code/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..95b916bae7d232e252bfed86e47c30e68762d667 --- /dev/null +++ b/model_zoo/ernie-code/predict.py @@ -0,0 +1,76 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse + +import numpy as np +import paddle + +from paddlenlp.transformers import AutoModelForConditionalGeneration, AutoTokenizer + +parser = argparse.ArgumentParser("ERNIE-CODE") +parser.add_argument( + "--model_name_or_path", + default="ernie-code-base-L512", + type=str, +) +parser.add_argument("--input", default="BadZipFileのAliasは、古い Python バージョンとの互換性のために。", type=str) +parser.add_argument("--target_lang", default="code", type=str) +parser.add_argument("--source_prefix", default="translate Japanese to Python: \n", type=str) +parser.add_argument("--max_length", type=int, default=1024) +parser.add_argument("--num_beams", type=int, default=3) +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"]) + +args = parser.parse_args() + + +def predict(): + + paddle.set_device(args.device) + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) + model = AutoModelForConditionalGeneration.from_pretrained(args.model_name_or_path) + + prefix = args.source_prefix if args.source_prefix is not None else "" + + def preprocess_function(inputs, tokenizer): + inputs = [prefix + inp for inp in inputs] + + model_inputs = tokenizer(inputs, max_length=args.max_length) + return model_inputs + + dev_dataset = [args.input] + model_inputs = preprocess_function(dev_dataset, tokenizer) + model.eval() + gen_kwargs = { + "max_length": args.max_length, + "num_beams": args.num_beams, + "decode_strategy": "beam_search", + "length_penalty": 0, + "min_length": 0, + } + generated_tokens, _ = model.generate( + paddle.to_tensor(np.array(model_inputs["input_ids"]).reshape(1, -1).astype("int64")), + attention_mask=paddle.to_tensor(np.array(model_inputs["attention_mask"]).reshape(1, -1).astype("int64")), + **gen_kwargs, + ) + if args.target_lang == "text": + decoded_preds = tokenizer.batch_decode(generated_tokens.numpy(), skip_special_tokens=True) + elif args.target_lang == "code": + decoded_preds = tokenizer.batch_decode( + generated_tokens.numpy(), skip_special_tokens=False, clean_up_tokenization_spaces=False + ) + print(decoded_preds) + + +if __name__ == "__main__": + predict() diff --git a/model_zoo/ernie-doc/README.md b/model_zoo/ernie-doc/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ca3d5a064826f925a166f43d520d7bfea33aff78 --- /dev/null +++ b/model_zoo/ernie-doc/README.md @@ -0,0 +1,215 @@ +# ERNIE-Doc: A Retrospective Long-Document Modeling Transformer + +* [模型简介](#模型简介) +* [快速开始](#快速开始) + * [环境依赖](#环境依赖) + * [通用参数释义](#通用参数释义) + * [分类任务](#分类任务) + * [阅读理解任务](#阅读理解任务) + * [语义匹配任务](#语义匹配任务) + * [序列标注任务](#序列标注任务) +* [致谢](#致谢) +* [参考论文](#参考论文) + +## 模型简介 +[ERNIE-Doc](https://arxiv.org/abs/2012.15688)是百度NLP提出的针对长文本的预训练模型。在循环Transformer机制之上,创新性地提出两阶段重复学习以及增强的循环机制,以此提高模型感受野,加强模型对长文本的理解能力。 + +本项目是 ERNIE-Doc 的 PaddlePaddle 动态图实现, 包含模型训练,模型验证等内容。以下是本例的简要目录结构及说明: + +```text +. +├── README.md # 文档 +├── data.py # 数据处理 +├── metrics.py # ERNIE-Doc下游任务指标 +├── model.py # 下游任务模型实现 +├── optimization.py # 优化算法 +├── run_classifier.py # 分类任务 +├── run_mcq.py # 阅读理解任务,单项选择题 +├── run_mrc.py # 抽取式阅读理解任务 +├── run_semantic_matching.py # 语义匹配任务 +└── run_sequence_labeling.py # 序列标注任务 + +``` + +## 快速开始 + +### 环境依赖 + +- nltk +- beautifulsoup4 + +安装命令:`pip install nltk==3.5 beautifulsoup4` + +初次使用时,需要下载nltk的模型,可运行以下命令(下载模型可能比较慢,请耐心等待): + +``` +python -c "import nltk; nltk.download('punkt')" +``` + +### 通用参数释义 + +- `model_name_or_path` 指示了Fine-tuning使用的具体预训练模型以及预训练时使用的tokenizer,目前支持的预训练模型有:"ernie-doc-base-zh", "ernie-doc-base-en"。若模型相关内容保存在本地,这里也可以提供相应目录地址,例如:"./checkpoint/model_xx/"。 +- `dataset` 表示Fine-tuning需要加载的数据集。 +- `memory_length` 表示当前的句子被截取作为下一个样本的特征的长度。 +- `max_seq_length` 表示最大句子长度,超过该长度的部分将被切分成下一个样本。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔步数。 +- `save_steps` 表示模型保存及评估间隔步数。 +- `output_dir` 表示模型保存路径。 +- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 +- `seed` 表示随机数种子。 +- `weight_decay` 表示AdamW的权重衰减系数。 +- `warmup_proportion` 表示学习率warmup系数。 +- `layerwise_decay` 表示AdamW with Layerwise decay的逐层衰减系数。 + +由于不同任务、不同数据集所设的超参数差别较大,可查看[ERNIE-Doc](https://arxiv.org/abs/2012.15688)论文附录中具体超参设定,此处不一一列举。 + +### 分类任务 + +分类任务支持多种数据集的评测,目前支持`imdb`, `iflytek`, `thucnews`, `hyp`四个数据集(有关数据集的描述可查看[PaddleNLP文本分类数据集](../../docs/data_prepare/dataset_list.md))。可通过参数`dataset`指定具体的数据集,下面以`imdb`为例子运行分类任务。 + +#### 单卡训练 + +```shell +python run_classifier.py --batch_size 8 --model_name_or_path ernie-doc-base-en + +``` + +#### 多卡训练 + +```shell +python -m paddle.distributed.launch --gpus "0,1" --log_dir imdb run_classifier.py --batch_size 8 --model_name_or_path ernie-doc-base-en + +``` + +在`imdb`, `iflytek`, `thucnews`, `hyp`各数据集上Fine-tuning后,在验证集上有如下结果: + +| Dataset | Model | Dev ACC | +|:---------:|:-----------------:|:----------------:| +| IMDB | ernie-doc-base-en | 0.9506 | +| THUCNews | ernie-doc-base-zh | 0.9854 | +| HYP | ernie-doc-base-en | 0.7412 | +| IFLYTEK | ernie-doc-base-zh | 0.6179 | + + +### 阅读理解任务 + +阅读理解任务支持抽取式阅读理解与单项选择题任务。 + +- 抽取式阅读理解 + +目前抽取式阅读理解支持`duredear-robust`, `drcd`,`cmrc2018`数据集。可通过参数`dataset`指定具体的数据集,下面以`dureader_robust`为例子运行抽取式阅读理解任务。 + +#### 单卡训练 + +```shell +python run_mrc.py --dataset dureader_robust --batch_size 8 --learning_rate 2.75e-4 +``` + +#### 多卡训练 + +```shell +python -m paddle.distributed.launch --gpus "0,1" --log_dir dureader_robust run_mrc.py --dataset dureader_robust --batch_size 8 --learning_rate 2.75e-4 +``` + +在`duredear-robust`, `drcd`, `cmrc2018`各数据集上Fine-tuning后,在验证集上有如下结果: + +| Dataset | Model | Dev EM/F1 | +|:--------------:|:-----------------:|:----------------:| +| Dureader-robust| ernie-doc-base-zh | 0.7481/0.8637 | +| DRCD | ernie-doc-base-zh | 0.8879/0.9392 | +| CMRC2018 | ernie-doc-base-zh | 0.7061/0.9004 | + + +- 单项选择题 + +[C3](https://github.com/nlpdata/c3)是首个自由形式的多选项中文机器阅读理解数据集。该数据集每个样本提供一个上下文(文章或者对话)、问题以及至多四个答案选项,要求从答案选项中选择一个正确选项。 + +目前PaddleNLP提供`C3`阅读理解单项选择题数据集,可执行以下命令运行该任务。 + +#### 单卡训练 + +```shell +python run_mcq.py --batch_size 8 + +``` + +#### 多卡训练 + +```shell +python -m paddle.distributed.launch --gpus "0,1" --log_dir mcq run_mcq.py --batch_size 8 + +``` + +在`C3`数据集上Fine-tuning后,在验证集上有如下结果: +| Dataset | Model | Dev/Test Acc | +|:--------------:|:-----------------:|:----------------:| +| C3 | ernie-doc-base-zh | 0.7573/0.7583 | + + +### 语义匹配任务 + +[CAIL2019 SCM](https://github.com/china-ai-law-challenge/CAIL2019/tree/master/scm) 数据集是来自“中国裁判文书网”公开的法律文书,其中每份数据由三篇法律文书组成。对于每份数据,用`(A,B,C)`来代表该组数据,其中`(A,B,C)`均对应某一篇文书。该任务要求判别similarity(A, B)是否大于similarity(A, C)。 + +可执行以下命令运行该任务。 + +#### 单卡训练 + +```shell +python run_semantic_matching.py --batch_size 6 --learning_rate 2e-5 +``` + +#### 多卡训练 + +```shell +python -m paddle.distributed.launch --gpus "0,1" --log_dir cail run_semantic_matching.py --batch_size 6 --learning_rate 2e-5 +``` + +在`CAIL2019-SCM`数据集上Fine-tuning后,在验证集与测试集上有如下结果: + +| Dataset | Model | Dev/Test Acc | +|:--------------:|:-----------------:|:----------------:| +| CAIL2019-SCM | ernie-doc-base-zh | 0.6420/0.6484 | + + +### 序列标注任务 + + +MSRA-NER 数据集由微软亚研院发布,其目标是识别文本中具有特定意义的实体,主要包括人名、地名、机构名等。示例如下: + +``` +不\002久\002前\002,\002中\002国\002共\002产\002党\002召\002开\002了\002举\002世\002瞩\002目\002的\002第\002十\002五\002次\002全\002国\002代\002表\002大\002会\002。 O\002O\002O\002O\002B-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002O\002O\002O\002O\002O\002O\002O\002O\002B-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002O +这\002次\002代\002表\002大\002会\002是\002在\002中\002国\002改\002革\002开\002放\002和\002社\002会\002主\002义\002现\002代\002化\002建\002设\002发\002展\002的\002关\002键\002时\002刻\002召\002开\002的\002历\002史\002性\002会\002议\002。 O\002O\002O\002O\002O\002O\002O\002O\002B-LOC\002I-LOC\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O +``` + +PaddleNLP集成的数据集MSRA-NER数据集对文件格式做了调整:每一行文本、标签以特殊字符"\t"进行分隔,每个字之间以特殊字符"\002"分隔。 + +可执行以下命令运行序列标注任务。 + +#### 单卡训练 + +```shell +python run_sequence_labeling.py --batch_size 8 --learning_rate 3e-5 +``` + +#### 多卡训练 + +```shell +python -m paddle.distributed.launch --gpus "0,1" --log_dir msra_ner run_sequence_labeling.py --batch_size 8 --learning_rate 3e-5 +``` + +在`MSRA-NER`数据集上Fine-tuning后,在验证集与测试集上有如下最佳结果: + +| Dataset | Model | Precision/Recall/F1 | +|:--------------:|:-----------------:|:-----------------------:| +| MSRA-NER | ernie-doc-base-zh | 0.9288/0.9139/0.9213 | + + +## 致谢 +* 感谢[百度NLP](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-doc)提供ERNIE-Doc开源代码的实现以及预训练模型。 + +## 参考论文 + +* Siyu Ding, Junyuan Shang et al. "ERNIE-Doc: A Retrospective Long-Document Modeling Transformer" ACL, 2021 diff --git a/model_zoo/ernie-doc/data.py b/model_zoo/ernie-doc/data.py new file mode 100644 index 0000000000000000000000000000000000000000..93a36443458aed06b19475936ff9d275f87fa53b --- /dev/null +++ b/model_zoo/ernie-doc/data.py @@ -0,0 +1,1194 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import itertools +from collections import namedtuple + +import numpy as np +from paddle.utils import try_import + +from paddlenlp.transformers import tokenize_chinese_chars +from paddlenlp.utils.log import logger + +__all__ = ["ClassifierIterator", "MRCIterator", "MCQIterator"] + + +def get_related_pos(insts, seq_len, memory_len=128): + """generate relative postion ids""" + beg = seq_len + seq_len + memory_len + r_position = [list(range(beg - 1, seq_len - 1, -1)) + list(range(0, seq_len)) for i in range(len(insts))] + return np.array(r_position).astype("int64").reshape([len(insts), beg, 1]) + + +def pad_batch_data( + insts, + insts_data_type="int64", + pad_idx=0, + final_cls=False, + pad_max_len=None, + return_pos=False, + return_input_mask=False, + return_max_len=False, + return_num_token=False, + return_seq_lens=False, +): + """ + Pad the instances to the max sequence length in batch, and generate the + corresponding position data and attention bias. + """ + return_list = [] + if pad_max_len: + max_len = pad_max_len + else: + max_len = max(len(inst) for inst in insts) + # Any token included in dict can be used to pad, since the paddings' loss + # will be masked out by weights and make no effect on parameter gradients. + + # Input id + if final_cls: + inst_data = np.array([inst[:-1] + list([pad_idx] * (max_len - len(inst))) + [inst[-1]] for inst in insts]) + else: + inst_data = np.array([inst + list([pad_idx] * (max_len - len(inst))) for inst in insts]) + return_list += [inst_data.astype(insts_data_type).reshape([-1, max_len, 1])] + + # Position id + if return_pos: + inst_pos = np.array([list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) for inst in insts]) + + return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])] + + if return_input_mask: + # This is used to avoid attention on paddings. + if final_cls: + input_mask_data = np.array([[1] * len(inst[:-1]) + [0] * (max_len - len(inst)) + [1] for inst in insts]) + else: + input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts]) + input_mask_data = np.expand_dims(input_mask_data, axis=-1) + return_list += [input_mask_data.astype("float32")] + + if return_max_len: + return_list += [max_len] + + if return_num_token: + num_token = 0 + for inst in insts: + num_token += len(inst) + return_list += [num_token] + + if return_seq_lens: + seq_lens_type = [-1] + seq_lens = np.array([len(inst) for inst in insts]) + return_list += [seq_lens.astype("int64").reshape(seq_lens_type)] + + return return_list if len(return_list) > 1 else return_list[0] + + +class TextPreprocessor(object): + def __call__(self, text): + raise NotImplementedError("TextPreprocessor object can't be called") + + +class ImdbTextPreprocessor(TextPreprocessor): + def __call__(self, text): + text = text.strip().replace("

", " ") + text = text.replace("\t", "") + return text + + +class HYPTextPreprocessor(TextPreprocessor): + def __init__(self): + self.bs4 = try_import("bs4") + + def __call__(self, text): + text = self.bs4.BeautifulSoup(text, "html.parser").get_text() + text = text.strip().replace("\n", "").replace("\t", "") + return text + + +class ClassifierIterator(object): + def __init__( + self, + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length=512, + memory_len=128, + repeat_input=False, + in_tokens=False, + mode="train", + random_seed=None, + preprocess_text_fn=None, + ): + self.batch_size = batch_size + self.tokenizer = tokenizer + self.trainer_num = trainer_num + self.trainer_id = trainer_id + self.max_seq_length = max_seq_length + self.memory_len = memory_len + self.repeat_input = repeat_input + self.in_tokens = in_tokens + self.dataset = [data for data in dataset] + self.num_examples = None + self.mode = mode + self.shuffle = True if mode == "train" else False + if random_seed is None: + random_seed = 12345 + self.random_seed = random_seed + self.preprocess_text_fn = preprocess_text_fn + + def shuffle_sample(self): + if self.shuffle: + self.global_rng = np.random.RandomState(self.random_seed) + self.global_rng.shuffle(self.dataset) + + def _cnt_list(self, inp): + """Cnt_list""" + cnt = 0 + for lit in inp: + if lit: + cnt += 1 + return cnt + + def _convert_to_features(self, example, qid): + """ + Convert example to features fed into model + """ + if "text" in example: # imdb + text = example["text"] + elif "sentence" in example: # iflytek + text = example["sentence"] + + if self.preprocess_text_fn: + text = self.preprocess_text_fn(text) + label = example["label"] + doc_spans = [] + _DocSpan = namedtuple("DocSpan", ["start", "length"]) + start_offset = 0 + max_tokens_for_doc = self.max_seq_length - 2 + tokens_a = self.tokenizer.tokenize(text) + while start_offset < len(tokens_a): + length = len(tokens_a) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(tokens_a): + break + start_offset += min(length, self.memory_len) + + features = [] + Feature = namedtuple("Feature", ["src_ids", "label_id", "qid", "cal_loss"]) + for (doc_span_index, doc_span) in enumerate(doc_spans): + tokens = tokens_a[doc_span.start : doc_span.start + doc_span.length] + ["[SEP]"] + ["[CLS]"] + token_ids = self.tokenizer.convert_tokens_to_ids(tokens) + features.append(Feature(src_ids=token_ids, label_id=label, qid=qid, cal_loss=1)) + + if self.repeat_input: + features_repeat = features + features = list(map(lambda x: x._replace(cal_loss=0), features)) + features = features + features_repeat + return features + + def _get_samples(self, pre_batch_list, is_last=False): + if is_last: + # Pad batch + len_doc = [len(doc) for doc in pre_batch_list] + max_len_idx = len_doc.index(max(len_doc)) + dirty_sample = pre_batch_list[max_len_idx][-1]._replace(cal_loss=0) + for sample_list in pre_batch_list: + sample_list.extend([dirty_sample] * (max(len_doc) - len(sample_list))) + + samples = [] + min_len = min([len(doc) for doc in pre_batch_list]) + for cnt in range(min_len): + for batch_idx in range(self.batch_size * self.trainer_num): + sample = pre_batch_list[batch_idx][cnt] + samples.append(sample) + + for idx in range(len(pre_batch_list)): + pre_batch_list[idx] = pre_batch_list[idx][min_len:] + return samples + + def _pad_batch_records(self, batch_records, gather_idx=[]): + batch_token_ids = [record.src_ids for record in batch_records] + if batch_records[0].label_id is not None: + batch_labels = [record.label_id for record in batch_records] + batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1]) + else: + batch_labels = np.array([]).astype("int64").reshape([-1, 1]) + # Qid + if batch_records[-1].qid is not None: + batch_qids = [record.qid for record in batch_records] + batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1]) + else: + batch_qids = np.array([]).astype("int64").reshape([-1, 1]) + + if gather_idx: + batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([1]).astype("int64") + else: + batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([0]).astype("int64") + + # Padding + padded_token_ids, input_mask = pad_batch_data( + batch_token_ids, + pad_idx=self.tokenizer.pad_token_id, + pad_max_len=self.max_seq_length, + final_cls=True, + return_input_mask=True, + ) + padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64") + padded_position_ids = get_related_pos(padded_token_ids, self.max_seq_length, self.memory_len) + + return_list = [ + padded_token_ids, + padded_position_ids, + padded_task_ids, + input_mask, + batch_labels, + batch_qids, + batch_gather_idx, + need_cal_loss, + ] + return return_list + + def _prepare_batch_data(self, examples): + batch_records, max_len, gather_idx = [], 0, [] + for index, example in enumerate(examples): + max_len = max(max_len, len(example.src_ids)) + if self.in_tokens: + to_append = (len(batch_records) + 1) * max_len <= self.batch_size + else: + to_append = len(batch_records) < self.batch_size + if to_append: + batch_records.append(example) + if example.cal_loss == 1: + gather_idx.append(index % self.batch_size) + else: + yield self._pad_batch_records(batch_records, gather_idx) + batch_records, max_len = [example], len(example.src_ids) + gather_idx = [index % self.batch_size] if example.cal_loss == 1 else [] + yield self._pad_batch_records(batch_records, gather_idx) + + def _create_instances(self): + examples = self.dataset + pre_batch_list = [] + insert_idx = [] + for qid, example in enumerate(examples): + features = self._convert_to_features(example, qid) + if self._cnt_list(pre_batch_list) < self.batch_size * self.trainer_num: + if insert_idx: + pre_batch_list[insert_idx[0]] = features + insert_idx.pop(0) + else: + pre_batch_list.append(features) + if self._cnt_list(pre_batch_list) == self.batch_size * self.trainer_num: + assert self._cnt_list(pre_batch_list) == len(pre_batch_list), "the two value must be equal" + assert not insert_idx, "the insert_idx must be null" + sample_batch = self._get_samples(pre_batch_list) + + for idx, lit in enumerate(pre_batch_list): + if not lit: + insert_idx.append(idx) + for batch_records in self._prepare_batch_data(sample_batch): + yield batch_records + + if self.mode != "train": + if self._cnt_list(pre_batch_list): + pre_batch_list += [ + [] for _ in range(self.batch_size * self.trainer_num - self._cnt_list(pre_batch_list)) + ] + sample_batch = self._get_samples(pre_batch_list, is_last=True) + for batch_records in self._prepare_batch_data(sample_batch): + yield batch_records + + def __call__(self): + curr_id = 0 + for batch_records in self._create_instances(): + if curr_id == self.trainer_id or self.mode != "train": + yield batch_records + curr_id = (curr_id + 1) % self.trainer_num + + def get_num_examples(self): + if self.num_examples is None: + self.num_examples = 0 + for qid, example in enumerate(self.dataset): + self.num_examples += len(self._convert_to_features(example, qid)) + return self.num_examples + + +class MRCIterator(ClassifierIterator): + """ + Machine Reading Comprehension iterator. Only for answer extraction. + """ + + def __init__( + self, + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length=512, + memory_len=128, + repeat_input=False, + in_tokens=False, + mode="train", + random_seed=None, + doc_stride=128, + max_query_length=64, + ): + super(MRCIterator, self).__init__( + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length, + memory_len, + repeat_input, + in_tokens, + mode, + random_seed, + preprocess_text_fn=None, + ) + self.doc_stride = doc_stride + self.max_query_length = max_query_length + self.examples = [] + self.features = [] + self.features_all = [] + self._preprocess_data() + + def shuffle_sample(self): + if self.shuffle: + self.global_rng = np.random.RandomState(self.random_seed) + self.global_rng.shuffle(self.features_all) + + def _convert_qa_to_examples(self): + Example = namedtuple( + "Example", ["qas_id", "question_text", "doc_tokens", "orig_answer_text", "start_position", "end_position"] + ) + examples = [] + for qa in self.dataset: + qas_id = qa["id"] + question_text = qa["question"] + context = qa["context"] + start_pos = None + end_pos = None + orig_answer_text = None + if self.mode == "train": + if len(qa["answers"]) != 1: + raise ValueError("For training, each question should have exactly 1 answer.") + orig_answer_text = qa["answers"][0] + answer_offset = qa["answer_starts"][0] + answer_length = len(orig_answer_text) + doc_tokens = [ + context[:answer_offset], + context[answer_offset : answer_offset + answer_length], + context[answer_offset + answer_length :], + ] + + start_pos = 1 + end_pos = 1 + + actual_text = " ".join(doc_tokens[start_pos : (end_pos + 1)]) + if orig_answer_text.islower(): + actual_text = actual_text.lower() + if actual_text.find(orig_answer_text) == -1: + logger.info("Could not find answer: '%s' vs. '%s'" % (actual_text, orig_answer_text)) + continue + + else: + doc_tokens = tokenize_chinese_chars(context) + + example = Example( + qas_id=qas_id, + question_text=question_text, + doc_tokens=doc_tokens, + orig_answer_text=orig_answer_text, + start_position=start_pos, + end_position=end_pos, + ) + examples.append(example) + return examples + + def _convert_example_to_feature(self, examples): + Feature = namedtuple( + "Feature", + [ + "qid", + "example_index", + "doc_span_index", + "tokens", + "token_to_orig_map", + "token_is_max_context", + "src_ids", + "start_position", + "end_position", + "cal_loss", + ], + ) + features = [] + self.features_all = [] + unique_id = 1000 + is_training = self.mode == "train" + print("total {} examples".format(len(examples)), flush=True) + for (example_index, example) in enumerate(examples): + query_tokens = self.tokenizer.tokenize(example.question_text) + if len(query_tokens) > self.max_query_length: + query_tokens = query_tokens[0 : self.max_query_length] + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + for (i, token) in enumerate(example.doc_tokens): + orig_to_tok_index.append(len(all_doc_tokens)) + sub_tokens = self.tokenizer.tokenize(token) + for sub_token in sub_tokens: + tok_to_orig_index.append(i) + all_doc_tokens.append(sub_token) + + tok_start_position = None + tok_end_position = None + if is_training: + tok_start_position = orig_to_tok_index[example.start_position] + if example.end_position < len(example.doc_tokens) - 1: + tok_end_position = orig_to_tok_index[example.end_position + 1] - 1 + else: + tok_end_position = len(all_doc_tokens) - 1 + (tok_start_position, tok_end_position) = self._improve_answer_span( + all_doc_tokens, tok_start_position, tok_end_position, example.orig_answer_text + ) + + max_tokens_for_doc = self.max_seq_length - len(query_tokens) - 3 + _DocSpan = namedtuple("DocSpan", ["start", "length"]) + doc_spans = [] + start_offset = 0 + while start_offset < len(all_doc_tokens): + length = len(all_doc_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(all_doc_tokens): + break + start_offset += min(length, self.doc_stride) + + features_each = [] + for (doc_span_index, doc_span) in enumerate(doc_spans): + tokens = [] + token_to_orig_map = {} + token_is_max_context = {} + tokens.append("[CLS]") + for i in range(doc_span.length): + split_token_index = doc_span.start + i + token_to_orig_map[i + 1] = tok_to_orig_index[split_token_index] + is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index) + token_is_max_context[i + 1] = is_max_context + tokens += all_doc_tokens[doc_span.start : doc_span.start + doc_span.length] + tokens.append("[SEP]") + + for token in query_tokens: + tokens.append(token) + tokens.append("[SEP]") + + token_ids = self.tokenizer.convert_tokens_to_ids(tokens) + start_position = None + end_position = None + if is_training: + doc_start = doc_span.start + doc_end = doc_span.start + doc_span.length - 1 + out_of_span = False + if not (tok_start_position >= doc_start and tok_end_position <= doc_end): + out_of_span = True + if out_of_span: + start_position = 0 + end_position = 0 + else: + doc_offset = 1 # len(query_tokens) + 2 + start_position = tok_start_position - doc_start + doc_offset + end_position = tok_end_position - doc_start + doc_offset + + feature = Feature( + qid=unique_id, + example_index=example_index, + doc_span_index=doc_span_index, + tokens=tokens, + token_to_orig_map=token_to_orig_map, + token_is_max_context=token_is_max_context, + src_ids=token_ids, + start_position=start_position, + end_position=end_position, + cal_loss=1, + ) + features.append(feature) + features_each.append(feature) + if example_index % 1000 == 0: + print("processing {} examples".format(example_index), flush=True) + + unique_id += 1 + # Repeat + if self.repeat_input: + features_each_repeat = features_each + features_each = list(map(lambda x: x._replace(cla_loss=0), features_each)) + features_each += features_each_repeat + + self.features_all.append(features_each) + + return features + + def _preprocess_data(self): + # Construct examples + self.examples = self._convert_qa_to_examples() + # Construct features + self.features = self._convert_example_to_feature(self.examples) + + def get_num_examples(self): + if not self.features_all: + self._preprocess_data() + return len(sum(self.features_all, [])) + + def _improve_answer_span(self, doc_tokens, input_start, input_end, orig_answer_text): + """Improve answer span""" + tok_answer_text = " ".join(self.tokenizer.tokenize(orig_answer_text)) + + for new_start in range(input_start, input_end + 1): + for new_end in range(input_end, new_start - 1, -1): + text_span = " ".join(doc_tokens[new_start : (new_end + 1)]) + if text_span == tok_answer_text: + return (new_start, new_end) + + return (input_start, input_end) + + def _check_is_max_context(self, doc_spans, cur_span_index, position): + """Check is max context""" + best_score = None + best_span_index = None + for (span_index, doc_span) in enumerate(doc_spans): + end = doc_span.start + doc_span.length - 1 + if position < doc_span.start: + break + if position > end: + continue + num_left_context = position - doc_span.start + num_right_context = end - position + score = min(num_left_context, num_right_context) + 0.01 * doc_span.length + if best_score is None or score > best_score: + best_score = score + best_span_index = span_index + if best_span_index > cur_span_index: + return False + + return cur_span_index == best_span_index + + def _pad_batch_records(self, batch_records, gather_idx=[]): + """Pad batch data""" + batch_token_ids = [record.src_ids for record in batch_records] + + if self.mode == "train": + batch_start_position = [record.start_position for record in batch_records] + batch_end_position = [record.end_position for record in batch_records] + batch_start_position = np.array(batch_start_position).astype("int64").reshape([-1, 1]) + batch_end_position = np.array(batch_end_position).astype("int64").reshape([-1, 1]) + else: + batch_size = len(batch_token_ids) + batch_start_position = np.zeros(shape=[batch_size, 1], dtype="int64") + batch_end_position = np.zeros(shape=[batch_size, 1], dtype="int64") + + batch_qids = [record.qid for record in batch_records] + batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1]) + + if gather_idx: + batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([1]).astype("int64") + else: + batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([0]).astype("int64") + + # padding + padded_token_ids, input_mask = pad_batch_data( + batch_token_ids, + pad_idx=self.tokenizer.pad_token_id, + pad_max_len=self.max_seq_length, + return_input_mask=True, + ) + padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64") + padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len) + + return_list = [ + padded_token_ids, + padded_position_ids, + padded_task_ids, + input_mask, + batch_start_position, + batch_end_position, + batch_qids, + batch_gather_idx, + need_cal_loss, + ] + + return return_list + + def _create_instances(self): + """Generate batch records""" + pre_batch_list = [] + insert_idx = [] + for qid, features in enumerate(self.features_all): + if self._cnt_list(pre_batch_list) < self.batch_size * self.trainer_num: + if insert_idx: + pre_batch_list[insert_idx[0]] = features + insert_idx.pop(0) + else: + pre_batch_list.append(features) + if self._cnt_list(pre_batch_list) == self.batch_size * self.trainer_num: + assert self._cnt_list(pre_batch_list) == len(pre_batch_list), "the two value must be equal" + assert not insert_idx, "the insert_idx must be null" + sample_batch = self._get_samples(pre_batch_list) + + for idx, lit in enumerate(pre_batch_list): + if not lit: + insert_idx.append(idx) + for batch_records in self._prepare_batch_data(sample_batch): + yield batch_records + + if self.mode != "train": + if self._cnt_list(pre_batch_list): + pre_batch_list += [ + [] for _ in range(self.batch_size * self.trainer_num - self._cnt_list(pre_batch_list)) + ] + sample_batch = self._get_samples(pre_batch_list, is_last=True) + for batch_records in self._prepare_batch_data(sample_batch): + yield batch_records + + +class MCQIterator(MRCIterator): + """ + Multiple choice question iterator. + """ + + def __init__( + self, + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length=512, + memory_len=128, + repeat_input=False, + in_tokens=False, + mode="train", + random_seed=None, + doc_stride=128, + max_query_length=64, + choice_num=4, + ): + self.choice_num = choice_num + super(MCQIterator, self).__init__( + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length, + memory_len, + repeat_input, + in_tokens, + mode, + random_seed, + ) + + def _truncate_seq_pair(self, tokens_a, tokens_b, max_length): + """Truncates a sequence pair in place to the maximum length.""" + + # This is a simple heuristic which will always truncate the longer sequence + # one token at a time. This makes more sense than truncating an equal percent + # of tokens from each, since if one sequence is very short then each token + # that's truncated likely contains more information than a longer sequence. + tokens_a = list(tokens_a) + tokens_b = list(tokens_b) + while True: + total_length = len(tokens_a) + len(tokens_b) + if total_length <= max_length: + break + if len(tokens_a) > len(tokens_b): + tokens_a.pop() + else: + tokens_b.pop() + return tokens_a, tokens_b + + def _convert_qa_to_examples(self): + Example = namedtuple("Example", ["qas_id", "context", "question", "choice", "label"]) + examples = [] + for qas_id, qa in enumerate(self.dataset): + context = "\n".join(qa["context"]).lower() + question = qa["question"].lower() + choice = [c.lower() for c in qa["choice"]] + # pad empty choice + for k in range(len(choice), self.choice_num): + choice.append("") + label = qa["label"] + + example = Example(qas_id=qas_id, context=context, question=question, choice=choice, label=label) + examples.append(example) + return examples + + def _convert_example_to_feature(self, examples): + Feature = namedtuple("Feature", ["qid", "src_ids", "segment_ids", "label", "cal_loss"]) + features = [] + self.features_all = [] + for (ex_index, example) in enumerate(examples): + context_tokens = self.tokenizer.tokenize(example.context) + question_tokens = self.tokenizer.tokenize(example.question) + choice_tokens_lst = [self.tokenizer.tokenize(choice) for choice in example.choice] + # nums = 4 + question_choice_pairs = [ + self._truncate_seq_pair(question_tokens, choice_tokens, self.max_query_length - 2) + for choice_tokens in choice_tokens_lst + ] + total_qc_num = sum([(len(q) + len(c)) for q, c in question_choice_pairs]) + max_tokens_for_doc = self.max_seq_length - total_qc_num - 4 + _DocSpan = namedtuple("DocSpan", ["start", "length"]) + doc_spans = [] + start_offset = 0 + + while start_offset < len(context_tokens): + length = len(context_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(context_tokens): + break + start_offset += min(length, self.doc_stride) + + features_each = [] + for (doc_span_index, doc_span) in enumerate(doc_spans): + qa_features = [] + for q_tokens, c_tokens in question_choice_pairs: + segment_tokens = ["[CLS]"] + token_type_ids = [0] + + segment_tokens += context_tokens[doc_span.start : doc_span.start + doc_span.length] + token_type_ids += [0] * doc_span.length + + segment_tokens += ["[SEP]"] + token_type_ids += [0] + + segment_tokens += q_tokens + token_type_ids += [1] * len(q_tokens) + + segment_tokens += ["[SEP]"] + token_type_ids += [1] + + segment_tokens += c_tokens + token_type_ids += [1] * len(c_tokens) + + segment_tokens += ["[SEP]"] + token_type_ids += [1] + + input_ids = self.tokenizer.convert_tokens_to_ids(segment_tokens) + feature = Feature( + qid=example.qas_id, + label=example.label, + src_ids=input_ids, + segment_ids=token_type_ids, + cal_loss=1, + ) + qa_features.append(feature) + + features.append(qa_features) + features_each.append(qa_features) + + # Repeat + if self.repeat_input: + features_each_repeat = features_each + features_each = list(map(lambda x: x._replace(cla_loss=0), features_each)) + features_each += features_each_repeat + + self.features_all.append(features_each) + + return features + + def _pad_batch_records(self, batch_records, gather_idx=[]): + batch_token_ids = [[record.src_ids for record in records] for records in batch_records] + if batch_records[0][0].label is not None: + batch_labels = [[record.label for record in records] for records in batch_records] + batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1]) + else: + batch_labels = np.array([]).astype("int64").reshape([-1, 1]) + # Qid + batch_qids = [[record.qid for record in records] for records in batch_records] + batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1]) + + if gather_idx: + batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([1]).astype("int64") + else: + batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([0]).astype("int64") + + batch_task_ids = [[record.segment_ids for record in records] for records in batch_records] + + # Padding + batch_padded_token_ids = [] + batch_input_mask = [] + batch_padded_task_ids = [] + batch_padded_position_ids = [] + batch_size = len(batch_token_ids) + for i in range(batch_size): + padded_token_ids, input_mask = pad_batch_data( + batch_token_ids[i], + pad_idx=self.tokenizer.pad_token_id, + pad_max_len=self.max_seq_length, + return_input_mask=True, + ) + padded_task_ids = pad_batch_data( + batch_task_ids[i], pad_idx=self.tokenizer.pad_token_id, pad_max_len=self.max_seq_length + ) + + padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len) + + batch_padded_token_ids.append(padded_token_ids) + batch_input_mask.append(input_mask) + batch_padded_task_ids.append(padded_task_ids) + batch_padded_position_ids.append(padded_position_ids) + + batch_padded_token_ids = ( + np.array(batch_padded_token_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1]) + ) + batch_padded_position_ids = ( + np.array(batch_padded_position_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1]) + ) + batch_padded_task_ids = ( + np.array(batch_padded_task_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1]) + ) + batch_input_mask = np.array(batch_input_mask).astype("float32").reshape([batch_size * self.choice_num, -1, 1]) + + return_list = [ + batch_padded_token_ids, + batch_padded_position_ids, + batch_padded_task_ids, + batch_input_mask, + batch_labels, + batch_qids, + batch_gather_idx, + need_cal_loss, + ] + return return_list + + def _prepare_batch_data(self, examples_list): + batch_records, max_len, gather_idx = [], 0, [] + real_batch_size = self.batch_size * self.choice_num + index = 0 + for examples in examples_list: + records = [] + gather_idx_candidate = [] + for example in examples: + if example.cal_loss == 1: + gather_idx_candidate.append(index % real_batch_size) + max_len = max(max_len, len(example.src_ids)) + records.append(example) + index += 1 + + if self.in_tokens: + to_append = (len(batch_records) + 1) * self.choice_num * max_len <= self.batch_size + else: + to_append = len(batch_records) < self.batch_size + if to_append: + batch_records.append(records) + gather_idx += gather_idx_candidate + else: + yield self._pad_batch_records(batch_records, gather_idx) + batch_records, max_len = [records], max(len(record.src_ids) for record in records) + gather_idx = gather_idx_candidate + if len(batch_records) > 0: + yield self._pad_batch_records(batch_records, gather_idx) + + def _get_samples(self, pre_batch_list, is_last=False): + if is_last: + # Pad batch + len_doc = [[len(doc) for doc in doc_list] for doc_list in pre_batch_list] + len_doc = list(itertools.chain(*len_doc)) + max_len_idx = len_doc.index(max(len_doc)) + doc_idx = max_len_idx % self.choice_num + doc_list_idx = max_len_idx // self.choice_num + dirty_sample = pre_batch_list[doc_list_idx][doc_idx][-1]._replace(cal_loss=0) + for sample_list in pre_batch_list: + for samples in sample_list: + samples.extend([dirty_sample] * (max(len_doc) - len(samples))) + samples = [] + min_len = min([len(doc) for doc in pre_batch_list]) + for cnt in range(min_len): + for batch_idx in range(self.batch_size * self.trainer_num): + sample = pre_batch_list[batch_idx][cnt] + samples.append(sample) + + for idx in range(len(pre_batch_list)): + pre_batch_list[idx] = pre_batch_list[idx][min_len:] + return samples + + +class SemanticMatchingIterator(MRCIterator): + def _convert_qa_to_examples(self): + Example = namedtuple("Example", ["qid", "text_a", "text_b", "text_c", "label"]) + examples = [] + for qid, qa in enumerate(self.dataset): + text_a, text_b, text_c = list( + map(lambda x: x.replace("\n", "").strip(), [qa["text_a"], qa["text_b"], qa["text_c"]]) + ) + + example = Example(qid=qid, text_a=text_a, text_b=text_b, text_c=text_c, label=qa["label"]) + examples += [example] + return examples + + def _create_tokens_and_type_id(self, text_a_tokens, text_b_tokens, start, length): + tokens = ( + ["[CLS]"] + + text_a_tokens[start : start + length] + + ["[SEP]"] + + text_b_tokens[start : start + length] + + ["[SEP]"] + ) + token_type_ids = [0] + [0] * (length + 1) + [1] * (length + 1) + return tokens, token_type_ids + + def _convert_example_to_feature(self, examples): + Feature = namedtuple( + "Feature", ["qid", "src_ids", "segment_ids", "pair_src_ids", "pair_segment_ids", "label", "cal_loss"] + ) + features = [] + self.features_all = [] + for (ex_index, example) in enumerate(examples): + text_a_tokens = self.tokenizer.tokenize(example.text_a) + text_b_tokens = self.tokenizer.tokenize(example.text_b) + text_c_tokens = self.tokenizer.tokenize(example.text_c) + a_len, b_len, c_len = list(map(lambda x: len(x), [text_a_tokens, text_b_tokens, text_c_tokens])) + + # Align 3 text + min_text_len = min([a_len, b_len, c_len]) + text_a_tokens = text_a_tokens[:min_text_len] + text_b_tokens = text_b_tokens[:min_text_len] + text_c_tokens = text_c_tokens[:min_text_len] + + _DocSpan = namedtuple("DocSpan", ["start", "length"]) + doc_spans = [] + start_offset = 0 + + max_tokens_for_doc = (self.max_seq_length - 3) // 2 + + while start_offset < len(text_a_tokens): + length = len(text_a_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(text_a_tokens): + break + start_offset += min(length, self.doc_stride) + + features_each = [] + for (doc_span_index, doc_span) in enumerate(doc_spans): + tokens1, token_type_ids1 = self._create_tokens_and_type_id( + text_a_tokens, text_b_tokens, doc_span.start, doc_span.length + ) + tokens2, token_type_ids2 = self._create_tokens_and_type_id( + text_a_tokens, text_c_tokens, doc_span.start, doc_span.length + ) + + input_ids1 = self.tokenizer.convert_tokens_to_ids(tokens1) + input_ids2 = self.tokenizer.convert_tokens_to_ids(tokens2) + feature = Feature( + qid=example.qid, + label=example.label, + src_ids=input_ids1, + segment_ids=token_type_ids1, + pair_src_ids=input_ids2, + pair_segment_ids=token_type_ids2, + cal_loss=1, + ) + + features.append(feature) + features_each.append(feature) + + # Repeat + if self.repeat_input: + features_each_repeat = features_each + features_each = list(map(lambda x: x._replace(cla_loss=0), features_each)) + features_each += features_each_repeat + + self.features_all.append(features_each) + + return features + + def _create_pad_ids(self, batch_records, prefix=""): + src_ids = prefix + "src_ids" + segment_ids = prefix + "segment_ids" + batch_token_ids = [getattr(record, src_ids) for record in batch_records] + batch_task_ids = [getattr(record, segment_ids) for record in batch_records] + + # Padding + padded_token_ids, input_mask = pad_batch_data( + batch_token_ids, + pad_idx=self.tokenizer.pad_token_id, + pad_max_len=self.max_seq_length, + return_input_mask=True, + ) + padded_task_ids = pad_batch_data( + batch_task_ids, pad_idx=self.tokenizer.pad_token_id, pad_max_len=self.max_seq_length + ) + + padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len) + + return [padded_token_ids, padded_position_ids, padded_task_ids, input_mask] + + def _pad_batch_records(self, batch_records, gather_idx=[]): + if batch_records[0].label is not None: + batch_labels = [record.label for record in batch_records] + batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1]) + else: + batch_labels = np.array([]).astype("int64").reshape([-1, 1]) + # Qid + batch_qids = [record.qid for record in batch_records] + batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1]) + + if gather_idx: + batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([1]).astype("int64") + else: + batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([0]).astype("int64") + + return_list = ( + self._create_pad_ids(batch_records) + + self._create_pad_ids(batch_records, "pair_") + + [batch_labels, batch_qids, batch_gather_idx, need_cal_loss] + ) + return return_list + + +class SequenceLabelingIterator(ClassifierIterator): + def __init__( + self, + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length=512, + memory_len=128, + repeat_input=False, + in_tokens=False, + mode="train", + random_seed=None, + no_entity_id=-1, + ): + super(SequenceLabelingIterator, self).__init__( + dataset, + batch_size, + tokenizer, + trainer_num, + trainer_id, + max_seq_length, + memory_len, + repeat_input, + in_tokens, + mode, + random_seed, + preprocess_text_fn=None, + ) + self.no_entity_id = no_entity_id + + def _convert_to_features(self, example, qid): + """ + Convert example to features fed into model + """ + tokens = example["tokens"] + label = example["labels"] + doc_spans = [] + _DocSpan = namedtuple("DocSpan", ["start", "length"]) + start_offset = 0 + max_tokens_for_doc = self.max_seq_length - 2 + while start_offset < len(tokens): + length = len(tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(tokens): + break + start_offset += min(length, self.memory_len) + + features = [] + Feature = namedtuple("Feature", ["src_ids", "label_ids", "qid", "cal_loss"]) + for (doc_span_index, doc_span) in enumerate(doc_spans): + curr_tokens = ["[CLS]"] + tokens[doc_span.start : doc_span.start + doc_span.length] + ["[SEP]"] + token_ids = self.tokenizer.convert_tokens_to_ids(curr_tokens) + label = ( + [self.no_entity_id] + label[doc_span.start : doc_span.start + doc_span.length] + [self.no_entity_id] + ) + + features.append(Feature(src_ids=token_ids, label_ids=label, qid=qid, cal_loss=1)) + + if self.repeat_input: + features_repeat = features + features = list(map(lambda x: x._replace(cal_loss=0), features)) + features = features + features_repeat + return features + + def _pad_batch_records(self, batch_records, gather_idx=[]): + batch_token_ids = [record.src_ids for record in batch_records] + batch_length = [len(record.src_ids) for record in batch_records] + batch_length = np.array(batch_length).astype("int64").reshape([-1, 1]) + + if batch_records[0].label_ids is not None: + batch_labels = [record.label_ids for record in batch_records] + else: + batch_labels = np.array([]).astype("int64").reshape([-1, 1]) + # Qid + if batch_records[-1].qid is not None: + batch_qids = [record.qid for record in batch_records] + batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1]) + else: + batch_qids = np.array([]).astype("int64").reshape([-1, 1]) + + if gather_idx: + batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([1]).astype("int64") + else: + batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1]) + need_cal_loss = np.array([0]).astype("int64") + # Padding + padded_token_ids, input_mask = pad_batch_data( + batch_token_ids, + pad_idx=self.tokenizer.pad_token_id, + pad_max_len=self.max_seq_length, + return_input_mask=True, + ) + if batch_records[0].label_ids is not None: + padded_batch_labels = pad_batch_data( + batch_labels, pad_idx=self.no_entity_id, pad_max_len=self.max_seq_length + ) + padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64") + padded_position_ids = get_related_pos(padded_token_ids, self.max_seq_length, self.memory_len) + + return_list = [ + padded_token_ids, + padded_position_ids, + padded_task_ids, + input_mask, + padded_batch_labels, + batch_length, + batch_qids, + batch_gather_idx, + need_cal_loss, + ] + return return_list diff --git a/model_zoo/ernie-doc/metrics.py b/model_zoo/ernie-doc/metrics.py new file mode 100644 index 0000000000000000000000000000000000000000..41777f1d106cc1035b50297ca140a6b1a1b9ef92 --- /dev/null +++ b/model_zoo/ernie-doc/metrics.py @@ -0,0 +1,367 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import sys + +import numpy as np +import paddle +from paddle.utils import try_import + +from paddlenlp.metrics.dureader import ( + _compute_softmax, + _get_best_indexes, + get_final_text, +) + +# Metric for ERNIE-DOCs + + +class F1(object): + def __init__(self, positive_label=1): + self.positive_label = positive_label + self.reset() + + def compute(self, preds, labels): + if isinstance(preds, paddle.Tensor): + preds = preds.numpy() + elif isinstance(preds, list): + preds = np.array(preds, dtype="float32") + if isinstance(labels, list): + labels = np.array(labels, dtype="int64") + elif isinstance(labels, paddle.Tensor): + labels = labels.numpy() + preds = np.argmax(preds, axis=1) + tp = ((preds == labels) & (labels == self.positive_label)).sum() + fn = ((preds != labels) & (labels == self.positive_label)).sum() + fp = ((preds != labels) & (preds == self.positive_label)).sum() + return tp, fp, fn + + def update(self, statistic): + tp, fp, fn = statistic + self.tp += tp + self.fp += fp + self.fn += fn + + def accumulate(self): + recall = self.tp / (self.tp + self.fn) + precision = self.tp / (self.tp + self.fp) + f1 = 2 * recall * precision / (recall + precision) + return f1 + + def reset(self): + self.tp = 0 + self.fp = 0 + self.fn = 0 + + +class EM_AND_F1(object): + def __init__(self): + self.nltk = try_import("nltk") + self.re = try_import("re") + + def _mixed_segmentation(self, in_str, rm_punc=False): + """mixed_segmentation""" + in_str = in_str.lower().strip() + segs_out = [] + temp_str = "" + sp_char = [ + "-", + ":", + "_", + "*", + "^", + "/", + "\\", + "~", + "`", + "+", + "=", + ",", + "。", + ":", + "?", + "!", + "“", + "”", + ";", + "’", + "《", + "》", + "……", + "·", + "、", + "「", + "」", + "(", + ")", + "-", + "~", + "『", + "』", + ] + for char in in_str: + if rm_punc and char in sp_char: + continue + pattern = "[\\u4e00-\\u9fa5]" + if self.re.search(pattern, char) or char in sp_char: + if temp_str != "": + ss = self.nltk.word_tokenize(temp_str) + segs_out.extend(ss) + temp_str = "" + segs_out.append(char) + else: + temp_str += char + + # Handling last part + if temp_str != "": + ss = self.nltk.word_tokenize(temp_str) + segs_out.extend(ss) + + return segs_out + + # Remove punctuation + def _remove_punctuation(self, in_str): + """remove_punctuation""" + in_str = in_str.lower().strip() + sp_char = [ + "-", + ":", + "_", + "*", + "^", + "/", + "\\", + "~", + "`", + "+", + "=", + ",", + "。", + ":", + "?", + "!", + "“", + "”", + ";", + "’", + "《", + "》", + "……", + "·", + "、", + "「", + "」", + "(", + ")", + "-", + "~", + "『", + "』", + ] + out_segs = [] + for char in in_str: + if char in sp_char: + continue + else: + out_segs.append(char) + return "".join(out_segs) + + # Find longest common string + def _find_lcs(self, s1, s2): + m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)] + mmax = 0 + p = 0 + for i in range(len(s1)): + for j in range(len(s2)): + if s1[i] == s2[j]: + m[i + 1][j + 1] = m[i][j] + 1 + if m[i + 1][j + 1] > mmax: + mmax = m[i + 1][j + 1] + p = i + 1 + return s1[p - mmax : p], mmax + + def _calc_f1_score(self, answers, prediction): + f1_scores = [] + for ans in answers: + ans_segs = self._mixed_segmentation(ans, rm_punc=True) + prediction_segs = self._mixed_segmentation(prediction, rm_punc=True) + lcs, lcs_len = self._find_lcs(ans_segs, prediction_segs) + if lcs_len == 0: + f1_scores.append(0) + continue + precision = 1.0 * lcs_len / len(prediction_segs) + recall = 1.0 * lcs_len / len(ans_segs) + f1 = (2 * precision * recall) / (precision + recall) + f1_scores.append(f1) + return max(f1_scores) + + def _calc_em_score(self, answers, prediction): + em = 0 + for ans in answers: + ans_ = self._remove_punctuation(ans) + prediction_ = self._remove_punctuation(prediction) + if ans_ == prediction_: + em = 1 + break + return em + + def __call__(self, prediction, ground_truth): + f1 = 0 + em = 0 + total_count = 0 + skip_count = 0 + for instance in ground_truth: + total_count += 1 + query_id = instance["id"] + answers = instance["answers"] + if query_id not in prediction: + sys.stderr.write("Unanswered question: {}\n".format(query_id)) + skip_count += 1 + continue + preds = str(prediction[query_id]) + f1 += self._calc_f1_score(answers, preds) + em += self._calc_em_score(answers, preds) + + f1_score = 100.0 * f1 / total_count + em_score = 100.0 * em / total_count + + avg_score = (f1_score + em_score) * 0.5 + return em_score, f1_score, avg_score, total_count + + +def compute_qa_predictions( + all_examples, all_features, all_results, n_best_size, max_answer_length, do_lower_case, tokenizer, verbose +): + """Write final predictions to the json file and log-odds of null if needed.""" + + example_index_to_features = collections.defaultdict(list) + for feature in all_features: + example_index_to_features[feature.example_index].append(feature) + + unique_id_to_result = {} + for result in all_results: + unique_id_to_result[result.unique_id] = result + + _PrelimPrediction = collections.namedtuple( # pylint: disable=invalid-name + "PrelimPrediction", ["feature_index", "start_index", "end_index", "start_logit", "end_logit"] + ) + + all_predictions = collections.OrderedDict() + all_nbest_json = collections.OrderedDict() + + for (example_index, example) in enumerate(all_examples): + features = example_index_to_features[example_index] + + prelim_predictions = [] + # Keep track of the minimum score of null start+end of position 0 + for (feature_index, feature) in enumerate(features): + result = unique_id_to_result[feature.qid] + start_indexes = _get_best_indexes(result.start_logits, n_best_size) + end_indexes = _get_best_indexes(result.end_logits, n_best_size) + + for start_index in start_indexes: + for end_index in end_indexes: + # We could hypothetically create invalid predictions, e.g., predict + # that the start of the span is in the question. We throw out all + # invalid predictions. + if start_index >= len(feature.tokens): + continue + if end_index >= len(feature.tokens): + continue + if start_index not in feature.token_to_orig_map: + continue + if end_index not in feature.token_to_orig_map: + continue + if not feature.token_is_max_context.get(start_index, False): + continue + if end_index < start_index: + continue + length = end_index - start_index + 1 + if length > max_answer_length: + continue + prelim_predictions.append( + _PrelimPrediction( + feature_index=feature_index, + start_index=start_index, + end_index=end_index, + start_logit=result.start_logits[start_index], + end_logit=result.end_logits[end_index], + ) + ) + + prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True) + + _NbestPrediction = collections.namedtuple( # pylint: disable=invalid-name + "NbestPrediction", ["text", "start_logit", "end_logit"] + ) + + seen_predictions = {} + nbest = [] + for pred in prelim_predictions: + if len(nbest) >= n_best_size: + break + feature = features[pred.feature_index] + if pred.start_index > 0: # this is a non-null prediction + tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)] + orig_doc_start = feature.token_to_orig_map[pred.start_index] + orig_doc_end = feature.token_to_orig_map[pred.end_index] + orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)] + tok_text = " ".join(tok_tokens) + + # De-tokenize WordPieces that have been split off. + tok_text = tok_text.replace(" ##", "") + tok_text = tok_text.replace("##", "") + + # Clean whitespace + tok_text = tok_text.strip() + tok_text = " ".join(tok_text.split()) + orig_text = "".join(orig_tokens) + + final_text = get_final_text(tok_text, orig_text, tokenizer, verbose) + if final_text in seen_predictions: + continue + + seen_predictions[final_text] = True + else: + final_text = "" + seen_predictions[final_text] = True + + nbest.append(_NbestPrediction(text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit)) + + # In very rare edge cases we could have no valid predictions. So we + # just create a nonce prediction in this case to avoid failure. + if not nbest: + nbest.append(_NbestPrediction(text="empty", start_logit=0.0, end_logit=0.0)) + + total_scores = [] + for entry in nbest: + total_scores.append(entry.start_logit + entry.end_logit) + + probs = _compute_softmax(total_scores) + + nbest_json = [] + for (i, entry) in enumerate(nbest): + output = collections.OrderedDict() + output["text"] = entry.text + output["probability"] = probs[i] + output["start_logit"] = entry.start_logit + output["end_logit"] = entry.end_logit + nbest_json.append(output) + + assert len(nbest_json) >= 1 + + all_predictions[example.qas_id] = nbest_json[0]["text"] + all_nbest_json[example.qas_id] = nbest_json + return all_predictions, all_nbest_json diff --git a/model_zoo/ernie-doc/model.py b/model_zoo/ernie-doc/model.py new file mode 100644 index 0000000000000000000000000000000000000000..5fd3e622596a490fb3c1337abaed8c0189ba9390 --- /dev/null +++ b/model_zoo/ernie-doc/model.py @@ -0,0 +1,50 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle.nn as nn + + +class ErnieDocForTextMatching(nn.Layer): + def __init__(self, ernie_doc, num_classes=2, dropout=None): + super().__init__() + self.ernie_doc = ernie_doc + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + self.classifier = nn.Linear(ernie_doc.config["hidden_size"], num_classes) + + def forward( + self, + query_input_ids, + title_input_ids, + query_memories, + title_memories, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + ): + + _, query_pooled_output, query_mem, = self.ernie_doc( + query_input_ids, query_memories, query_token_type_ids, query_position_ids, query_attention_mask + ) + + _, title_pooled_output, title_mem = self.ernie_doc( + title_input_ids, title_memories, title_token_type_ids, title_position_ids, title_attention_mask + ) + + diff_pooled_output = query_pooled_output - title_pooled_output + diff_pooled_output = self.dropout(diff_pooled_output) + output = self.classifier(diff_pooled_output) + return output, query_mem, title_mem diff --git a/model_zoo/ernie-doc/run_classifier.py b/model_zoo/ernie-doc/run_classifier.py new file mode 100644 index 0000000000000000000000000000000000000000..a2985189f4162e54f9c04d3610cff2f4bb07084f --- /dev/null +++ b/model_zoo/ernie-doc/run_classifier.py @@ -0,0 +1,323 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from collections import defaultdict +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from data import ClassifierIterator, HYPTextPreprocessor, ImdbTextPreprocessor +from metrics import F1 +from paddle.metric import Accuracy +from paddle.optimizer import AdamW + +from paddlenlp.ops.optimizer import layerwise_lr_decay +from paddlenlp.transformers import ( + ErnieDocBPETokenizer, + ErnieDocForSequenceClassification, + ErnieDocTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-en", help="Pretraining model name or path") +parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.") +parser.add_argument("--learning_rate", type=float, default=7e-5, help="Learning rate used to train.") +parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.") +parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.") +parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint") +parser.add_argument("--epochs", type=int, default=3, help="Number of epoches for training.") +parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.") +parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.") +parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.") +parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") +parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--dataset", default="imdb", choices=["imdb", "iflytek", "thucnews", "hyp"], type=str, help="The training dataset") +parser.add_argument("--layerwise_decay", default=1.0, type=float, help="Layerwise decay ratio") +parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",) +args = parser.parse_args() +# fmt: on + +# tokenizer, eval_dataset, test_dataset, preprocess_text_fn, metric +# BPETokenizer for English Tasks +# ErnieDocTokenizer for Chinese Tasks + +DATASET_INFO = { + "imdb": (ErnieDocBPETokenizer, "test", "test", ImdbTextPreprocessor(), Accuracy()), + "hyp": (ErnieDocBPETokenizer, "dev", "test", HYPTextPreprocessor(), F1()), + "iflytek": (ErnieDocTokenizer, "validation", "test", None, Accuracy()), + "thucnews": (ErnieDocTokenizer, "dev", "test", None, Accuracy()), +} + + +def set_seed(args): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(args.seed) + np.random.seed(args.seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(args.seed) + + +def init_memory(batch_size, memory_length, d_model, n_layers): + return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)] + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, memories0): + model.eval() + losses = [] + # copy the memory + memories = list(memories0) + tic_train = time.time() + eval_logging_step = 500 + + probs_dict = defaultdict(list) + label_dict = dict() + for step, batch in enumerate(data_loader, start=1): + input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idxs, need_cal_loss = batch + logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask) + logits, labels, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, labels, qids])) + # Need to collect probs for each qid, so use softmax_with_cross_entropy + loss, probs = nn.functional.softmax_with_cross_entropy(logits, labels, return_softmax=True) + losses.append(loss.mean().numpy()) + # Shape: [B, NUM_LABELS] + np_probs = probs.numpy() + # Shape: [B, 1] + np_qids = qids.numpy() + np_labels = labels.numpy().flatten() + for i, qid in enumerate(np_qids.flatten()): + probs_dict[qid].append(np_probs[i]) + label_dict[qid] = np_labels[i] # Same qid share same label. + + if step % eval_logging_step == 0: + logger.info( + "Step %d: loss: %.5f, speed: %.5f steps/s" + % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train)) + ) + tic_train = time.time() + + # Collect predicted labels + preds = [] + labels = [] + for qid, probs in probs_dict.items(): + mean_prob = np.mean(np.array(probs), axis=0) + preds.append(mean_prob) + labels.append(label_dict[qid]) + + preds = paddle.to_tensor(np.array(preds, dtype="float32")) + labels = paddle.to_tensor(np.array(labels, dtype="int64")) + + metric.update(metric.compute(preds, labels)) + acc_or_f1 = metric.accumulate() + logger.info("Eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric.__class__.__name__, acc_or_f1)) + metric.reset() + model.train() + return acc_or_f1 + + +def do_train(args): + set_seed(args) + tokenizer_class, eval_name, test_name, preprocess_text_fn, eval_metric = DATASET_INFO[args.dataset] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + if args.dataset in ["hyp", "thucnews"]: + from paddlenlp.datasets import load_dataset + + train_ds, eval_ds, test_ds = load_dataset(args.dataset, splits=["train", eval_name, test_name]) + num_classes = len(train_ds.label_list) + + else: + from datasets import load_dataset + + # Get dataset + if args.dataset == "iflytek": + + train_ds, eval_ds, test_ds = load_dataset("clue", name=args.dataset, split=["train", eval_name, test_name]) + else: + train_ds, eval_ds = load_dataset(args.dataset, split=["train", eval_name]) + test_ds = eval_ds + num_classes = train_ds.features["label"].num_classes + + # Initialize model + paddle.set_device(args.device) + trainer_num = paddle.distributed.get_world_size() + if trainer_num > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + if rank == 0: + if os.path.exists(args.model_name_or_path): + logger.info("init checkpoint from %s" % args.model_name_or_path) + model = ErnieDocForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes) + model_config = model.ernie_doc.config + if trainer_num > 1: + model = paddle.DataParallel(model) + + train_ds_iter = ClassifierIterator( + train_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + random_seed=args.seed, + preprocess_text_fn=preprocess_text_fn, + ) + eval_ds_iter = ClassifierIterator( + eval_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + mode="eval", + preprocess_text_fn=preprocess_text_fn, + ) + test_ds_iter = ClassifierIterator( + test_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + mode="test", + preprocess_text_fn=preprocess_text_fn, + ) + + train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device()) + eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device()) + test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device()) + + num_training_examples = train_ds_iter.get_num_examples() + num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num + logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank)) + logger.info("Num train examples: %d" % num_training_examples) + logger.info("Max train steps: %d" % num_training_steps) + logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion)) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + # Construct dict + name_dict = dict() + for n, p in model.named_parameters(): + name_dict[p.name] = n + + simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"]) + + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + lr_ratio=simple_lr_setting, + ) + + criterion = paddle.nn.loss.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + global_steps = 0 + best_acc = -1 + create_memory = partial( + init_memory, + args.batch_size, + args.memory_length, + model_config["hidden_size"], + model_config["num_hidden_layers"], + ) + # Copy the memory + memories = create_memory() + tic_train = time.time() + stop_training = False + for epoch in range(args.epochs): + train_ds_iter.shuffle_sample() + train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device()) + for step, batch in enumerate(train_dataloader, start=1): + global_steps += 1 + input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idx, need_cal_loss = batch + logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask) + + logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels])) + loss = criterion(logits, labels) * need_cal_loss + mean_loss = loss.mean() + mean_loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + # Rough acc result, not a precise acc + acc = metric.compute(logits, labels) * need_cal_loss + metric.update(acc) + + if global_steps % args.logging_steps == 0: + logger.info( + "train: global step %d, epoch: %d, loss: %f, acc:%f, lr: %f, speed: %.2f step/s" + % ( + global_steps, + epoch, + mean_loss, + metric.accumulate(), + lr_scheduler.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + + if global_steps % args.save_steps == 0: + # Evaluate + logger.info("Eval:") + eval_acc = evaluate(model, eval_metric, eval_dataloader, create_memory()) + # Save + if rank == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if eval_acc > best_acc: + logger.info("Save best model......") + best_acc = eval_acc + best_model_dir = os.path.join(args.output_dir, "best_model") + if not os.path.exists(best_model_dir): + os.makedirs(best_model_dir) + model_to_save.save_pretrained(best_model_dir) + tokenizer.save_pretrained(best_model_dir) + + if args.max_steps > 0 and global_steps >= args.max_steps: + stop_training = True + break + if stop_training: + break + logger.info("Final test result:") + eval_acc = evaluate(model, eval_metric, test_dataloader, create_memory()) + + +if __name__ == "__main__": + do_train(args) diff --git a/model_zoo/ernie-doc/run_mcq.py b/model_zoo/ernie-doc/run_mcq.py new file mode 100644 index 0000000000000000000000000000000000000000..4050959fa8c756a114de22157a429210d87d834a --- /dev/null +++ b/model_zoo/ernie-doc/run_mcq.py @@ -0,0 +1,306 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from collections import defaultdict +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from data import MCQIterator +from paddle.metric import Accuracy +from paddle.optimizer import AdamW + +from paddlenlp.datasets import load_dataset +from paddlenlp.ops.optimizer import layerwise_lr_decay +from paddlenlp.transformers import ( + ErnieDocForSequenceClassification, + ErnieDocTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", help="Pretraining model name or path") +parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.") +parser.add_argument("--learning_rate", type=float, default=1e-4, help="Learning rate used to train.") +parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.") +parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.") +parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint") +parser.add_argument("--epochs", type=int, default=8, help="Number of epoches for training.") +parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.") +parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.") +parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.") +parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") +parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--dataset", default="c3", choices=["c3"], type=str, help="The training dataset") +parser.add_argument("--layerwise_decay", default=0.8, type=float, help="Layerwise decay ratio") +parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--gradient_accumulation_steps", default=4, type=int, help="Number of updates steps to accumualte before performing a backward/update pass.") +parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",) +args = parser.parse_args() +# fmt: on + + +DATASET_INFO = { + "c3": (ErnieDocTokenizer, "dev", "test", Accuracy()), +} + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +def init_memory(batch_size, memory_length, d_model, n_layers): + return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)] + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, memories0, choice_num): + model.eval() + losses = [] + # Copy the memory + memories = list(memories0) + tic_train = time.time() + eval_logging_step = 500 + + probs_dict = defaultdict(list) + label_dict = dict() + for step, batch in enumerate(data_loader, start=1): + input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idxs, need_cal_loss = batch + logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask) + logits, labels, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, labels, qids])) + logits = logits.reshape([-1, choice_num]) + labels = labels.reshape([-1, choice_num, 1])[:, 0] + qids = qids.reshape([-1, choice_num, 1])[:, 0] + loss, probs = nn.functional.softmax_with_cross_entropy(logits, labels, return_softmax=True) + losses.append(loss.mean().numpy()) + # Shape: [B, NUM_LABELS] + np_probs = probs.numpy() + # Shape: [B, 1] + np_qids = qids.numpy().flatten() + np_labels = labels.numpy().flatten() + for i, qid in enumerate(np_qids): + probs_dict[qid].append(np_probs[i]) + label_dict[qid] = np_labels[i] # Same qid share same label. + + if step % eval_logging_step == 0: + logger.info( + "Step %d: loss: %.5f, speed: %.5f steps/s" + % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train)) + ) + tic_train = time.time() + + # Collect predicted labels + preds = [] + labels = [] + logger.info("Total {} qustion".format(len(probs_dict))) + for qid, probs in probs_dict.items(): + mean_prob = np.mean(np.array(probs), axis=0) + preds.append(mean_prob) + labels.append(label_dict[qid]) + + preds = paddle.to_tensor(np.array(preds, dtype="float32")) + labels = paddle.to_tensor(np.array(labels, dtype="int64")) + + metric.update(metric.compute(preds, labels)) + acc_or_f1 = metric.accumulate() + metric.reset() + logger.info("Eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric.__class__.__name__, acc_or_f1)) + model.train() + return acc_or_f1 + + +def do_train(args): + set_seed(args) + tokenizer_class, eval_name, test_name, eval_metric = DATASET_INFO[args.dataset] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + # Get dataset + train_ds, eval_ds, test_ds = load_dataset(args.dataset, splits=["train", eval_name, test_name]) + + num_classes = len(train_ds.label_list) + + # Initialize model + paddle.set_device(args.device) + trainer_num = paddle.distributed.get_world_size() + if trainer_num > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + if rank == 0: + if os.path.exists(args.model_name_or_path): + logger.info("init checkpoint from %s" % args.model_name_or_path) + model = ErnieDocForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=1, cls_token_idx=0) + model_config = model.ernie_doc.config + if trainer_num > 1: + model = paddle.DataParallel(model) + batch_size = int(args.batch_size / args.gradient_accumulation_steps) + train_ds_iter = MCQIterator( + train_ds, + batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + random_seed=args.seed, + choice_num=num_classes, + ) + + eval_ds_iter = MCQIterator( + eval_ds, + batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + random_seed=args.seed, + mode="eval", + choice_num=num_classes, + ) + + test_ds_iter = MCQIterator( + test_ds, + batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + random_seed=args.seed, + mode="test", + choice_num=num_classes, + ) + + train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device()) + eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device()) + test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device()) + + num_training_examples = train_ds_iter.get_num_examples() + num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num + logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank)) + logger.info("Num train examples: %d" % num_training_examples) + logger.info("Max train steps: %d" % num_training_steps) + logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion)) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + # Construct dict + name_dict = dict() + for n, p in model.named_parameters(): + name_dict[p.name] = n + + simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"]) + + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + lr_ratio=simple_lr_setting, + ) + + criterion = paddle.nn.loss.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + global_steps = 1 + best_acc = -1 + create_memory = partial( + init_memory, + batch_size * num_classes, + args.memory_length, + model_config["hidden_size"], + model_config["num_hidden_layers"], + ) + # Copy the memory + memories = create_memory() + tic_train = time.time() + + for epoch in range(args.epochs): + train_ds_iter.shuffle_sample() + train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device()) + for step, batch in enumerate(train_dataloader, start=1): + input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idx, need_cal_loss = batch + logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask) + logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels])) + logits = logits.reshape([-1, num_classes]) + labels = labels.reshape([-1, num_classes, 1])[:, 0] + + loss = criterion(logits, labels) * need_cal_loss + loss.backward() + if step % args.gradient_accumulation_steps == 0: + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + global_steps += 1 + # Rough acc result, not a precise acc + acc = metric.compute(logits, labels) * need_cal_loss + metric.update(acc) + + if global_steps % args.logging_steps == 0 and step % args.gradient_accumulation_steps == 0: + logger.info( + "train: global step %d, epoch: %d, loss: %f, acc:%f, lr: %f, speed: %.2f step/s" + % ( + global_steps, + epoch, + loss, + metric.accumulate(), + lr_scheduler.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + + if global_steps % args.save_steps == 0 and step % args.gradient_accumulation_steps == 0: + logger.info("Eval, total {} qustion.".format(len(eval_ds))) + eval_acc = evaluate(model, eval_metric, eval_dataloader, create_memory(), num_classes) + # Save model + if rank == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if eval_acc > best_acc: + logger.info("Save best model......") + best_acc = eval_acc + best_model_dir = os.path.join(args.output_dir, "best_model") + if not os.path.exists(best_model_dir): + os.makedirs(best_model_dir) + model_to_save.save_pretrained(best_model_dir) + tokenizer.save_pretrained(best_model_dir) + + if args.max_steps > 0 and global_steps >= args.max_steps: + return + logger.info("Final test result:") + eval_acc = evaluate(model, eval_metric, test_dataloader, create_memory(), num_classes) + + +if __name__ == "__main__": + do_train(args) diff --git a/model_zoo/ernie-doc/run_mrc.py b/model_zoo/ernie-doc/run_mrc.py new file mode 100644 index 0000000000000000000000000000000000000000..51687dd04dbed59ab78da127731e7c62a10eadd7 --- /dev/null +++ b/model_zoo/ernie-doc/run_mrc.py @@ -0,0 +1,358 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from collections import namedtuple +from functools import partial + +import numpy as np +import paddle +from data import MRCIterator +from metrics import EM_AND_F1, compute_qa_predictions +from paddle.optimizer import AdamW + +from paddlenlp.datasets import load_dataset +from paddlenlp.ops.optimizer import layerwise_lr_decay +from paddlenlp.transformers import ( + ErnieDocForQuestionAnswering, + ErnieDocTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", help="Pretraining model name or path") +parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.") +parser.add_argument("--learning_rate", type=float, default=2.75e-4, help="Learning rate used to train.") +parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.") +parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.") +parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint") +parser.add_argument("--epochs", type=int, default=5, help="Number of epoches for training.") +parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.") +parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.") +parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.") +parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") +parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--layerwise_decay", default=0.8, type=float, help="Layerwise decay ratio") +parser.add_argument("--n_best_size", default=20, type=int, help="The total number of n-best predictions to generate in the nbest_predictions.json output file.") +parser.add_argument("--max_answer_length", default=100, type=int, help="Max answer length.") +parser.add_argument("--do_lower_case", action='store_false', help="Whether to lower case the input text. Should be True for uncased models and False for cased models.") +parser.add_argument("--verbose", action='store_true', help="Whether to output verbose log.") +parser.add_argument("--dropout", default=0.1, type=float, help="Dropout ratio of ernie_doc") +parser.add_argument("--dataset", default="dureader_robust", type=str, choices=["dureader_robust", "cmrc2018", "drcd"], help="The avaliable Q&A dataset") +parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",) +args = parser.parse_args() +# fmt: on + +# eval_dataset, test_dataset, +DATASET_INFO = { + "dureader_robust": ["dev", "dev", ErnieDocTokenizer], + "cmrc2018": ["dev", "dev", ErnieDocTokenizer], + "drcd": ["dev", "test", ErnieDocTokenizer], +} + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +def init_memory(batch_size, memory_length, d_model, n_layers): + return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)] + + +class CrossEntropyLossForQA(paddle.nn.Layer): + def __init__(self): + super(CrossEntropyLossForQA, self).__init__() + self.criterion = paddle.nn.CrossEntropyLoss() + + def forward(self, y, label): + start_logits, end_logits = y + start_position, end_position = label + + start_loss = self.criterion(start_logits, start_position) + end_loss = self.criterion(end_logits, end_position) + loss = (start_loss + end_loss) / 2 + return loss + + +@paddle.no_grad() +def evaluate(args, model, criterion, metric, data_loader, memories0, tokenizer): + RawResult = namedtuple("RawResult", ["unique_id", "start_logits", "end_logits"]) + model.eval() + all_results = [] + + tic_start = time.time() + tic_eval = time.time() + memories = list(memories0) + + # Collect result + logger.info("The example number of eval_dataloader: {}".format(len(data_loader._batch_reader.features))) + for step, batch in enumerate(data_loader, start=1): + ( + input_ids, + position_ids, + token_type_ids, + attn_mask, + start_position, + end_position, + qids, + gather_idx, + need_cal_loss, + ) = batch + + start_logits, end_logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask) + + start_logits, end_logits, qids = list( + map(lambda x: paddle.gather(x, gather_idx), [start_logits, end_logits, qids]) + ) + np_qids = qids.numpy() + np_start_logits = start_logits.numpy() + np_end_logits = end_logits.numpy() + + if int(need_cal_loss.numpy()) == 1: + for idx in range(qids.shape[0]): + if len(all_results) % 1000 == 0 and len(all_results): + logger.info("Processing example: %d" % len(all_results)) + logger.info("time per 1000: {} s".format(time.time() - tic_eval)) + tic_eval = time.time() + + qid_each = int(np_qids[idx]) + start_logits_each = [float(x) for x in np_start_logits[idx].flat] + end_logits_each = [float(x) for x in np_end_logits[idx].flat] + all_results.append( + RawResult(unique_id=qid_each, start_logits=start_logits_each, end_logits=end_logits_each) + ) + + # Compute_predictions + all_predictions_eval, all_nbest_eval = compute_qa_predictions( + data_loader._batch_reader.examples, + data_loader._batch_reader.features, + all_results, + args.n_best_size, + args.max_answer_length, + args.do_lower_case, + tokenizer, + args.verbose, + ) + + EM, F1, AVG, TOTAL = metric(all_predictions_eval, data_loader._batch_reader.dataset) + + logger.info("EM: {}, F1: {}, AVG: {}, TOTAL: {}, TIME: {}".format(EM, F1, AVG, TOTAL, time.time() - tic_start)) + model.train() + return EM, F1, AVG + + +def do_train(args): + set_seed(args) + + DEV, TEST, TOKENIZER_CLASS = DATASET_INFO[args.dataset] + tokenizer = TOKENIZER_CLASS.from_pretrained(args.model_name_or_path) + + train_ds, eval_ds = load_dataset(args.dataset, splits=["train", DEV]) + if DEV == TEST: + test_ds = eval_ds + else: + test_ds = load_dataset(args.dataset, splits=[TEST]) + + paddle.set_device(args.device) + trainer_num = paddle.distributed.get_world_size() + if trainer_num > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + if rank == 0: + if os.path.exists(args.model_name_or_path): + logger.info("init checkpoint from %s" % args.model_name_or_path) + + model = ErnieDocForQuestionAnswering.from_pretrained(args.model_name_or_path, dropout=args.dropout) + model_config = model.ernie_doc.config + if trainer_num > 1: + model = paddle.DataParallel(model) + + train_ds_iter = MRCIterator( + train_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + random_seed=args.seed, + ) + + eval_ds_iter = MRCIterator( + eval_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + mode="eval", + random_seed=args.seed, + ) + + test_ds_iter = MRCIterator( + test_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + mode="test", + random_seed=args.seed, + ) + + train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device()) + + eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device()) + + test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device()) + + num_training_examples = train_ds_iter.get_num_examples() + num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num + logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank)) + logger.info("Num train examples: %d" % num_training_examples) + logger.info("Max train steps: %d" % num_training_steps) + logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion)) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + # Construct dict + name_dict = dict() + for n, p in model.named_parameters(): + name_dict[p.name] = n + + simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"]) + + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + lr_ratio=simple_lr_setting, + ) + + global_steps = 0 + create_memory = partial( + init_memory, + args.batch_size, + args.memory_length, + model_config["hidden_size"], + model_config["num_hidden_layers"], + ) + + criterion = CrossEntropyLossForQA() + + memories = create_memory() + tic_train = time.time() + best_avg_metric = -1 + stop_training = False + for epoch in range(args.epochs): + train_ds_iter.shuffle_sample() + train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device()) + for step, batch in enumerate(train_dataloader, start=1): + global_steps += 1 + ( + input_ids, + position_ids, + token_type_ids, + attn_mask, + start_position, + end_position, + qids, + gather_idx, + need_cal_loss, + ) = batch + start_logits, end_logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask) + + start_logits, end_logits, qids, start_position, end_position = list( + map( + lambda x: paddle.gather(x, gather_idx), + [start_logits, end_logits, qids, start_position, end_position], + ) + ) + loss = criterion([start_logits, end_logits], [start_position, end_position]) * need_cal_loss + + mean_loss = loss.mean() + mean_loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_steps % args.logging_steps == 0: + logger.info( + "train: global step %d, epoch: %d, loss: %f, lr: %f, speed: %.2f step/s" + % ( + global_steps, + epoch, + mean_loss, + lr_scheduler.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + + if global_steps % args.save_steps == 0: + # Evaluate + logger.info("Eval:") + EM, F1, AVG = evaluate( + args, model, criterion, EM_AND_F1(), eval_dataloader, create_memory(), tokenizer + ) + if rank == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if best_avg_metric < AVG: + output_dir = os.path.join(args.output_dir, "best_model") + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + if args.max_steps > 0 and global_steps >= args.max_steps: + stop_training = True + break + if stop_training: + break + logger.info("Test:") + evaluate(args, model, criterion, EM_AND_F1(), test_dataloader, create_memory(), tokenizer) + if rank == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + + +if __name__ == "__main__": + do_train(args) diff --git a/model_zoo/ernie-doc/run_semantic_matching.py b/model_zoo/ernie-doc/run_semantic_matching.py new file mode 100644 index 0000000000000000000000000000000000000000..986154f82e15f22f5d2f63b0dc9ee49503712632 --- /dev/null +++ b/model_zoo/ernie-doc/run_semantic_matching.py @@ -0,0 +1,357 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from collections import defaultdict +from functools import partial + +import numpy as np +import paddle +import paddle.nn as nn +from data import SemanticMatchingIterator +from model import ErnieDocForTextMatching +from paddle.metric import Accuracy +from paddle.optimizer import AdamW + +from paddlenlp.datasets import load_dataset +from paddlenlp.ops.optimizer import layerwise_lr_decay +from paddlenlp.transformers import ( + ErnieDocModel, + ErnieDocTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--batch_size", default=6, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", help="Pretraining model name or path") +parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.") +parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train.") +parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.") +parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.") +parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint") +parser.add_argument("--epochs", type=int, default=15, help="Number of epoches for training.") +parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.") +parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.") +parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.") +parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") +parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--dataset", default="cail2019_scm", choices=["cail2019_scm"], type=str, help="The training dataset") +parser.add_argument("--dropout", default=0.1, type=float, help="Dropout ratio of ernie_doc") +parser.add_argument("--layerwise_decay", default=1.0, type=float, help="Layerwise decay ratio") +parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",) +args = parser.parse_args() +# fmt: on + +DATASET_INFO = { + "cail2019_scm": (ErnieDocTokenizer, "dev", "test", Accuracy()), +} + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +def init_memory(batch_size, memory_length, d_model, n_layers): + return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)] + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, memories0, pair_memories0): + model.eval() + losses = [] + # copy the memory + memories = list(memories0) + pair_memories = list(pair_memories0) + tic_train = time.time() + eval_logging_step = 500 + + probs_dict = defaultdict(list) + label_dict = dict() + for step, batch in enumerate(data_loader, start=1): + ( + input_ids, + position_ids, + token_type_ids, + attn_mask, + pair_input_ids, + pair_position_ids, + pair_token_type_ids, + pair_attn_mask, + labels, + qids, + gather_idx, + need_cal_loss, + ) = batch + + logits, memories, pair_memories = model( + input_ids, + pair_input_ids, + memories, + pair_memories, + token_type_ids, + position_ids, + attn_mask, + pair_token_type_ids, + pair_position_ids, + pair_attn_mask, + ) + logits, labels, qids = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels, qids])) + # Need to collect probs for each qid, so use softmax_with_cross_entropy + loss, probs = nn.functional.softmax_with_cross_entropy(logits, labels, return_softmax=True) + losses.append(loss.mean().numpy()) + # Shape: [B, NUM_LABELS] + np_probs = probs.numpy() + # Shape: [B, 1] + np_qids = qids.numpy() + np_labels = labels.numpy().flatten() + for i, qid in enumerate(np_qids.flatten()): + probs_dict[qid].append(np_probs[i]) + label_dict[qid] = np_labels[i] # Same qid share same label. + + if step % eval_logging_step == 0: + logger.info( + "Step %d: loss: %.5f, speed: %.5f steps/s" + % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train)) + ) + tic_train = time.time() + + # Collect predicted labels + preds = [] + labels = [] + for qid, probs in probs_dict.items(): + mean_prob = np.mean(np.array(probs), axis=0) + preds.append(mean_prob) + labels.append(label_dict[qid]) + + preds = paddle.to_tensor(np.array(preds, dtype="float32")) + labels = paddle.to_tensor(np.array(labels, dtype="int64")) + + metric.update(metric.compute(preds, labels)) + acc_or_f1 = metric.accumulate() + logger.info("Eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric.__class__.__name__, acc_or_f1)) + metric.reset() + model.train() + return acc_or_f1 + + +def do_train(args): + set_seed(args) + tokenizer_class, eval_name, test_name, eval_metric = DATASET_INFO[args.dataset] + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + # Get dataset + train_ds, eval_ds, test_ds = load_dataset(args.dataset, splits=["train", eval_name, test_name]) + + num_classes = len(train_ds.label_list) + + # Initialize model + paddle.set_device(args.device) + trainer_num = paddle.distributed.get_world_size() + if trainer_num > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + if rank == 0: + if os.path.exists(args.model_name_or_path): + logger.info("init checkpoint from %s" % args.model_name_or_path) + + ernie_doc = ErnieDocModel.from_pretrained(args.model_name_or_path, cls_token_idx=0) + model = ErnieDocForTextMatching(ernie_doc, num_classes, args.dropout) + + model_config = model.ernie_doc.config + if trainer_num > 1: + model = paddle.DataParallel(model) + + train_ds_iter = SemanticMatchingIterator( + train_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + random_seed=args.seed, + ) + + eval_ds_iter = SemanticMatchingIterator( + eval_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + random_seed=args.seed, + mode="eval", + ) + + test_ds_iter = SemanticMatchingIterator( + test_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + random_seed=args.seed, + mode="test", + ) + + train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device()) + eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device()) + test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device()) + + num_training_examples = train_ds_iter.get_num_examples() + num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num + logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank)) + logger.info("Num train examples: %d" % num_training_examples) + logger.info("Max train steps: %d" % num_training_steps) + logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion)) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + # Construct dict + name_dict = dict() + for n, p in model.named_parameters(): + name_dict[p.name] = n + + simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"]) + + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + lr_ratio=simple_lr_setting, + ) + + criterion = paddle.nn.loss.CrossEntropyLoss() + metric = paddle.metric.Accuracy() + + global_steps = 0 + best_acc = -1 + create_memory = partial( + init_memory, + args.batch_size, + args.memory_length, + model_config["hidden_size"], + model_config["num_hidden_layers"], + ) + # Copy the memory + memories = create_memory() + pair_memories = create_memory() + tic_train = time.time() + + for epoch in range(args.epochs): + train_ds_iter.shuffle_sample() + train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device()) + for step, batch in enumerate(train_dataloader, start=1): + global_steps += 1 + ( + input_ids, + position_ids, + token_type_ids, + attn_mask, + pair_input_ids, + pair_position_ids, + pair_token_type_ids, + pair_attn_mask, + labels, + qids, + gather_idx, + need_cal_loss, + ) = batch + + logits, memories, pair_memories = model( + input_ids, + pair_input_ids, + memories, + pair_memories, + token_type_ids, + position_ids, + attn_mask, + pair_token_type_ids, + pair_position_ids, + pair_attn_mask, + ) + + logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels])) + loss = criterion(logits, labels) * need_cal_loss + mean_loss = loss.mean() + mean_loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + # Rough acc result, not a precise acc + acc = metric.compute(logits, labels) * need_cal_loss + metric.update(acc) + + if global_steps % args.logging_steps == 0: + logger.info( + "train: global step %d, epoch: %d, loss: %f, acc:%f, lr: %f, speed: %.2f step/s" + % ( + global_steps, + epoch, + mean_loss, + metric.accumulate(), + lr_scheduler.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + + if global_steps % args.save_steps == 0: + # Evaluate + logger.info("Eval:") + eval_acc = evaluate(model, eval_metric, eval_dataloader, create_memory(), create_memory()) + # Save + if rank == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + save_param_path = os.path.join(output_dir, "model_state.pdparams") + paddle.save(model_to_save.state_dict(), save_param_path) + tokenizer.save_pretrained(output_dir) + if eval_acc > best_acc: + logger.info("Save best model......") + best_acc = eval_acc + best_model_dir = os.path.join(args.output_dir, "best_model") + if not os.path.exists(best_model_dir): + os.makedirs(best_model_dir) + + save_param_path = os.path.join(best_model_dir, "model_state.pdparams") + paddle.save(model_to_save.state_dict(), save_param_path) + tokenizer.save_pretrained(best_model_dir) + + if args.max_steps > 0 and global_steps >= args.max_steps: + return + logger.info("Final test result:") + eval_acc = evaluate(model, eval_metric, test_dataloader, create_memory(), create_memory()) + + +if __name__ == "__main__": + do_train(args) diff --git a/model_zoo/ernie-doc/run_sequence_labeling.py b/model_zoo/ernie-doc/run_sequence_labeling.py new file mode 100644 index 0000000000000000000000000000000000000000..0686f9626ac41df6e6e34e20dfdd684a58f06393 --- /dev/null +++ b/model_zoo/ernie-doc/run_sequence_labeling.py @@ -0,0 +1,300 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from collections import defaultdict +from functools import partial + +import numpy as np +import paddle +from data import SequenceLabelingIterator +from paddle.optimizer import AdamW + +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.ops.optimizer import layerwise_lr_decay +from paddlenlp.transformers import ( + ErnieDocForTokenClassification, + ErnieDocTokenizer, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser() +parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", help="Pretraining model name or path") +parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.") +parser.add_argument("--learning_rate", type=float, default=3e-5, help="Learning rate used to train.") +parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.") +parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.") +parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint") +parser.add_argument("--epochs", type=int, default=3, help="Number of epoches for training.") +parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.") +parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.") +parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.") +parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") +parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.") +parser.add_argument("--dataset", default="msra_ner", choices=["msra_ner"], type=str, help="The training dataset") +parser.add_argument("--layerwise_decay", default=1.0, type=float, help="Layerwise decay ratio") +parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",) +args = parser.parse_args() +# fmt: on + + +def set_seed(args): + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + + +def init_memory(batch_size, memory_length, d_model, n_layers): + return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)] + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, memories0): + model.eval() + metric.reset() + avg_loss, precision, recall, f1_score = 0, 0, 0, 0 + loss_fct = paddle.nn.loss.CrossEntropyLoss() + losses = [] + # Copy the memory + memories = list(memories0) + tic_train = time.time() + eval_logging_step = 500 + labels_dict = defaultdict(list) + preds_dict = defaultdict(list) + length_dict = defaultdict(list) + + for step, batch in enumerate(data_loader, start=1): + input_ids, position_ids, token_type_ids, attn_mask, labels, lengths, qids, gather_idxs, need_cal_loss = batch + logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask) + logits, labels, qids, lengths = list( + map(lambda x: paddle.gather(x, gather_idxs), [logits, labels, qids, lengths]) + ) + loss = loss_fct(logits, labels) + avg_loss = loss.mean() + losses.append(avg_loss) + preds = logits.argmax(axis=2) + + np_qids = qids.numpy().flatten() + for i, qid in enumerate(np_qids): + preds_dict[qid].append(preds[i]) + labels_dict[qid].append(labels[i]) + length_dict[qid].append(lengths[i]) + + if step % eval_logging_step == 0: + logger.info( + "Step %d: loss: %.5f, speed: %.5f steps/s" + % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train)) + ) + tic_train = time.time() + + qids = preds_dict.keys() + for qid in qids: + preds = paddle.concat(preds_dict[qid], axis=0).unsqueeze(0) + labels = paddle.concat(labels_dict[qid], axis=0).unsqueeze(0).squeeze(-1) + length = paddle.concat(length_dict[qid], axis=0) + length = length.sum(axis=0, keepdim=True) + num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(length, preds, labels) + metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + precision, recall, f1_score = metric.accumulate() + metric.reset() + logger.info("Total {} samples.".format(len(qids))) + logger.info("eval loss: %f, precision: %f, recall: %f, f1: %f" % (avg_loss, precision, recall, f1_score)) + model.train() + return precision, recall, f1_score + + +def do_train(args): + set_seed(args) + tokenizer = ErnieDocTokenizer.from_pretrained(args.model_name_or_path) + train_ds, eval_ds = load_dataset(args.dataset, splits=["train", "test"]) + test_ds = eval_ds + + num_classes = len(train_ds.label_list) + no_entity_id = num_classes - 1 + + paddle.set_device(args.device) + trainer_num = paddle.distributed.get_world_size() + if trainer_num > 1: + paddle.distributed.init_parallel_env() + rank = paddle.distributed.get_rank() + if rank == 0: + if os.path.exists(args.model_name_or_path): + logger.info("init checkpoint from %s" % args.model_name_or_path) + model = ErnieDocForTokenClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes) + model_config = model.ernie_doc.config + if trainer_num > 1: + model = paddle.DataParallel(model) + + train_ds_iter = SequenceLabelingIterator( + train_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + random_seed=args.seed, + no_entity_id=no_entity_id, + ) + eval_ds_iter = SequenceLabelingIterator( + eval_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + mode="eval", + no_entity_id=no_entity_id, + ) + test_ds_iter = SequenceLabelingIterator( + test_ds, + args.batch_size, + tokenizer, + trainer_num, + trainer_id=rank, + memory_len=model_config["memory_len"], + max_seq_length=args.max_seq_length, + mode="test", + no_entity_id=no_entity_id, + ) + + train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device()) + eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device()) + test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True) + test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device()) + + num_training_examples = train_ds_iter.get_num_examples() + num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num + logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank)) + logger.info("Num train examples: %d" % num_training_examples) + logger.info("Max train steps: %d" % num_training_steps) + logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion)) + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + # Construct dict + name_dict = dict() + for n, p in model.named_parameters(): + name_dict[p.name] = n + + simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"]) + + optimizer = AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + lr_ratio=simple_lr_setting, + ) + + criterion = paddle.nn.loss.CrossEntropyLoss() + metric = ChunkEvaluator(label_list=train_ds.label_list) + + global_steps = 0 + + create_memory = partial( + init_memory, + args.batch_size, + args.memory_length, + model_config["hidden_size"], + model_config["num_hidden_layers"], + ) + # Copy the memory + memories = create_memory() + tic_train = time.time() + best_f1 = 0 + stop_training = False + for epoch in range(args.epochs): + train_ds_iter.shuffle_sample() + train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device()) + for step, batch in enumerate(train_dataloader, start=1): + global_steps += 1 + ( + input_ids, + position_ids, + token_type_ids, + attn_mask, + labels, + lengths, + qids, + gather_idx, + need_cal_loss, + ) = batch + logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask) + logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels])) + + loss = criterion(logits, labels) * need_cal_loss + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_steps % args.logging_steps == 0: + logger.info( + "train: global step %d, epoch: %d, loss: %f, lr: %f, speed: %.2f step/s" + % ( + global_steps, + epoch, + loss, + lr_scheduler.get_lr(), + args.logging_steps / (time.time() - tic_train), + ) + ) + tic_train = time.time() + if global_steps % args.save_steps == 0: + # Evaluate + logger.info("Eval:") + precision, recall, f1_score = evaluate(model, metric, eval_dataloader, create_memory()) + # Save + if rank == 0: + output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps)) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if f1_score > best_f1: + logger.info("Save best model......") + best_f1 = f1_score + best_model_dir = os.path.join(args.output_dir, "best_model") + if not os.path.exists(best_model_dir): + os.makedirs(best_model_dir) + model_to_save.save_pretrained(best_model_dir) + tokenizer.save_pretrained(best_model_dir) + + if args.max_steps > 0 and global_steps >= args.max_steps: + stop_training = True + break + if stop_training: + break + + logger.info("Final test result:") + evaluate(model, metric, test_dataloader, create_memory()) + + +if __name__ == "__main__": + do_train(args) diff --git a/model_zoo/ernie-gen/README.md b/model_zoo/ernie-gen/README.md new file mode 100644 index 0000000000000000000000000000000000000000..4d236818fd22338ac00e30765cf7414919de628d --- /dev/null +++ b/model_zoo/ernie-gen/README.md @@ -0,0 +1,148 @@ +# ERNIE-Gen: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation + +## 1. 简介 + +ERNIE-GEN 是面向生成任务的预训练-微调框架,首次在预训练阶段加入**span-by-span 生成任务**,让模型每次能够生成一个语义完整的片段。在预训练和微调中通过**填充式生成机制**和**噪声感知机制**来缓解曝光偏差问题。此外, ERNIE-GEN 采样**多片段-多粒度目标文本采样策略**, 增强源文本和目标文本的关联性,加强了编码器和解码器的交互。 + +![multi-flow-attention](https://github.com/PaddlePaddle/ERNIE/raw/repro/ernie-gen/.meta/multi-flow-attention.png) + +## 快速开始 + +### 环境依赖 + +- tqdm + +安装方式:`pip install tqdm` + +### 数据准备 + +在本例中,我们提供了古诗词数据集,示例数据如下: + +```text +画\002精\002禅\002室\002冷\002,\002方\002暑\002久\002徘\002徊\002。 不\002尽\002林\002端\002雪\002,\002长\002青\002石\002上\002苔\002。\002心\002闲\002对\002岩\002岫\002,\002目\002浄\002失\002尘\002埃\002。\002坐\002久\002清\002风\002至\002,\002疑\002从\002翠\002涧\002来\002。 +``` + +每行数据都是由两列组成,以制表符分隔。第一列是输入的诗句前文,第二列是输出的诗句后文,所有文字都以 `\002` 分隔。 + +完整数据集可以通过以下命令下载并解压: + +```bash +wget --no-check-certificate https://bj.bcebos.com/paddlenlp/datasets/poetry.tar.gz +tar xvf poetry.tar.gz +``` + +### 模型微调 + +#### 单卡训练 + +训练启动方式如下: + +```bash +python -u ./train.py \ + --model_name_or_path ernie-1.0 \ + --max_encode_len 24 \ + --max_decode_len 72 \ + --batch_size 48 \ + --learning_rate 2e-5 \ + --num_epochs 12 \ + --logging_steps 1 \ + --save_steps 1000 \ + --output_dir ./tmp/ \ + --device gpu \ + # --init_checkpoint ./tmp/model_10000/model_state.pdparams +``` + +参数释义如下: +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer,支持[PaddleNLP Transformer类预训练模型](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 中的所有模型,但只有`ernie-gen-base-en, ernie-gen-large-en, ernie-gen-large-en-430g`三种模型会加载最后输出层的参数,其余模型只会加载transformer参数作热启动。若模型相关内容保存在本地,这里也可以提供相应目录地址。 +- `max_encode_len` 表示最大输入句子长度,超过该长度将被截断。 +- `max_decode_len` 表示最大输出句子长度,超过该长度将被截断。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- `num_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔。 +- `save_steps` 表示模型保存及评估间隔。 +- `output_dir` 表示模型保存路径。 +- `device`: 训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 +- `init_checkpoint` 表示模型加载路径,通过设置此参数可以开启增量训练。 + +训练会持续很长的时间,为此我们提供了[微调后的模型](https://bj.bcebos.com/paddlenlp/models/transformers/ernie_gen_finetuned/ernie_1.0_poetry.pdparams)。您可以下载该模型并通过`init_checkpoint`加载其参数进行增量训练、评估或预测。 + +#### 多卡训练 + +训练启动方式如下: + +```bash +python -m paddle.distributed.launch --gpus "0,1" ./train.py \ + --model_name_or_path ernie-1.0 \ + --max_encode_len 24 \ + --max_decode_len 72 \ + --batch_size 48 \ + --learning_rate 2e-5 \ + --num_epochs 12 \ + --logging_steps 1 \ + --save_steps 1000 \ + --output_dir ./tmp/ \ + --device gpu \ + # --init_checkpoint ./tmp/model_10000/model_state.pdparams +``` + +### 模型评估 + +通过加载训练保存的模型,可以对验证集数据进行验证,启动方式如下: + +```bash +python -u ./eval.py \ + --model_name_or_path ernie-1.0 \ + --max_encode_len 24 \ + --max_decode_len 72 \ + --batch_size 48 \ + --init_checkpoint ./tmp/model_10000/model_state.pdparams \ + --device gpu +``` + +参数释义如下: +- `model_name_or_path` 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer,支持[PaddleNLP Transformer类预训练模型](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 中的所有模型,但只有`ernie-gen-base-en, ernie-gen-large-en, ernie-gen-large-en-430g`三种模型会加载最后输出层的参数,其余模型只会加载transformer参数作热启动。若模型相关内容保存在本地,这里也可以提供相应目录地址。 +- `max_encode_len` 表示最大输入句子长度,超过该长度将被截断。 +- `max_decode_len` 表示最大输出句子长度,超过该长度将被截断。 +- `batch_size` 表示每次迭代**每张卡**上的样本数目。 +- `init_checkpoint` 表示模型加载路径。 +- `use_gpu` 表示使用GPU。 + +### 模型预测 + +对无标签数据可以启动模型预测: + +```bash +python -u ./predict.py \ + --model_name_or_path ernie-1.0 \ + --max_encode_len 24 \ + --max_decode_len 72 \ + --batch_size 48 \ + --init_checkpoint ./tmp/model_10000/model_state.pdparams \ + --device gpu +``` + + +## Citation + +您可以按下面的格式引用ERNIE-Gen论文: + +``` +@article{xiao2020ernie-gen, + title={ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation}, + author={Xiao, Dongling and Zhang, Han and Li, Yukun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng}, + journal={arXiv preprint arXiv:2001.11314}, + year={2020} +} +``` + +## 线上教程体验 + +我们为诗歌文本生成提供了线上教程,欢迎体验: + +* [使用PaddleNLP预训练模型ERNIE-GEN生成诗歌](https://aistudio.baidu.com/aistudio/projectdetail/1339888) + + +## Acknowledgement + +- 感谢 [chinese-poetry数据集](https://github.com/chinese-poetry/chinese-poetry) 开放的诗歌数据集 diff --git a/model_zoo/ernie-gen/decode.py b/model_zoo/ernie-gen/decode.py new file mode 100644 index 0000000000000000000000000000000000000000..0450dd46504f19e884cf2f3922f1821e268359d1 --- /dev/null +++ b/model_zoo/ernie-gen/decode.py @@ -0,0 +1,292 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import re +from collections import namedtuple + +import numpy as np +import paddle +import paddle.nn as nn + + +def gen_bias(encoder_inputs, decoder_inputs, step): + decoder_bsz, decoder_seqlen = decoder_inputs.shape[:2] + encoder_bsz, encoder_seqlen = encoder_inputs.shape[:2] + attn_bias = paddle.reshape(paddle.arange(0, decoder_seqlen, 1, dtype="float32") + 1, [1, -1, 1]) + decoder_bias = paddle.cast( + (paddle.matmul(attn_bias, 1.0 / attn_bias, transpose_y=True) >= 1.0), "float32" + ) # [1, decoderlen, decoderlen] + encoder_bias = paddle.unsqueeze( + paddle.cast(paddle.ones_like(encoder_inputs), "float32"), [1] + ) # [bsz, 1, encoderlen] + encoder_bias = paddle.expand( + encoder_bias, [encoder_bsz, decoder_seqlen, encoder_seqlen] + ) # [bsz,decoderlen, encoderlen] + decoder_bias = paddle.expand( + decoder_bias, [decoder_bsz, decoder_seqlen, decoder_seqlen] + ) # [bsz, decoderlen, decoderlen] + if step > 0: + bias = paddle.concat( + [encoder_bias, paddle.ones([decoder_bsz, decoder_seqlen, step], "float32"), decoder_bias], -1 + ) + else: + bias = paddle.concat([encoder_bias, decoder_bias], -1) + return bias + + +@paddle.no_grad() +def greedy_search_infilling( + model, + token_ids, + token_type_ids, + sos_id, + eos_id, + attn_id, + pad_id, + unk_id, + vocab_size, + max_encode_len=640, + max_decode_len=100, + tgt_type_id=3, +): + _, logits, info = model(token_ids, token_type_ids) + d_batch, d_seqlen = token_ids.shape + seqlen = paddle.sum(paddle.cast(token_ids != 0, "int64"), 1, keepdim=True) + has_stopped = np.zeros([d_batch], dtype=np.bool_) + gen_seq_len = np.zeros([d_batch], dtype=np.int64) + output_ids = [] + + past_cache = info["caches"] + + cls_ids = paddle.ones([d_batch], dtype="int64") * sos_id + attn_ids = paddle.ones([d_batch], dtype="int64") * attn_id + ids = paddle.stack([cls_ids, attn_ids], -1) + for step in range(max_decode_len): + bias = gen_bias(token_ids, ids, step) + pos_ids = paddle.to_tensor(np.tile(np.array([[step, step + 1]], dtype=np.int64), [d_batch, 1])) + pos_ids += seqlen + _, logits, info = model( + ids, paddle.ones_like(ids) * tgt_type_id, pos_ids=pos_ids, attn_bias=bias, past_cache=past_cache + ) + + if logits.shape[-1] > vocab_size: + logits[:, :, vocab_size:] = 0 + logits[:, :, pad_id] = 0 + logits[:, :, unk_id] = 0 + logits[:, :, attn_id] = 0 + + gen_ids = paddle.argmax(logits, -1) + + past_cached_k, past_cached_v = past_cache + cached_k, cached_v = info["caches"] + cached_k = [paddle.concat([pk, k[:, :1, :]], 1) for pk, k in zip(past_cached_k, cached_k)] # concat cached + cached_v = [paddle.concat([pv, v[:, :1, :]], 1) for pv, v in zip(past_cached_v, cached_v)] + past_cache = (cached_k, cached_v) + + gen_ids = gen_ids[:, 1] + ids = paddle.stack([gen_ids, attn_ids], 1) + + gen_ids = gen_ids.numpy() + has_stopped |= (gen_ids == eos_id).astype(np.bool_) + gen_seq_len += 1 - has_stopped.astype(np.int64) + output_ids.append(gen_ids.tolist()) + if has_stopped.all(): + break + output_ids = np.array(output_ids).transpose([1, 0]) + return output_ids + + +BeamSearchState = namedtuple("BeamSearchState", ["log_probs", "lengths", "finished"]) +BeamSearchOutput = namedtuple("BeamSearchOutput", ["scores", "predicted_ids", "beam_parent_ids"]) + + +def log_softmax(x): + e_x = np.exp(x - np.max(x)) + return np.log(e_x / e_x.sum()) + + +def mask_prob(p, onehot_eos, finished): + is_finished = paddle.cast(paddle.reshape(finished, [-1, 1]) != 0, "float32") + p = is_finished * (1.0 - paddle.cast(onehot_eos, "float32")) * -9999.0 + (1.0 - is_finished) * p + return p + + +def hyp_score(log_probs, length, length_penalty): + lp = paddle.pow((5.0 + paddle.cast(length, "float32")) / 6.0, length_penalty) + return log_probs / lp + + +def beam_search_step(state, logits, eos_id, beam_width, is_first_step, length_penalty): + """logits.shape == [B*W, V]""" + _, vocab_size = logits.shape + + bsz, beam_width = state.log_probs.shape + onehot_eos = paddle.cast(nn.functional.one_hot(paddle.ones([1], "int64") * eos_id, vocab_size), "int64") # [1, V] + + probs = paddle.log(nn.functional.softmax(logits)) # [B*W, V] + probs = mask_prob(probs, onehot_eos, state.finished) # [B*W, V] + allprobs = paddle.reshape(state.log_probs, [-1, 1]) + probs # [B*W, V] + + not_finished = 1 - paddle.reshape(state.finished, [-1, 1]) # [B*W,1] + not_eos = 1 - onehot_eos + length_to_add = not_finished * not_eos # [B*W,V] + alllen = paddle.reshape(state.lengths, [-1, 1]) + length_to_add + + allprobs = paddle.reshape(allprobs, [-1, beam_width * vocab_size]) + alllen = paddle.reshape(alllen, [-1, beam_width * vocab_size]) + allscore = hyp_score(allprobs, alllen, length_penalty) + if is_first_step: + allscore = paddle.reshape(allscore, [bsz, beam_width, -1])[:, 0, :] # first step only consiter beam 0 + scores, idx = paddle.topk(allscore, k=beam_width) # [B, W] + next_beam_id = idx // vocab_size # [B, W] + next_word_id = idx % vocab_size + + gather_idx = paddle.concat([paddle.nonzero(idx != -1)[:, :1], paddle.reshape(idx, [-1, 1])], 1) + next_probs = paddle.reshape(paddle.gather_nd(allprobs, gather_idx), idx.shape) + next_len = paddle.reshape(paddle.gather_nd(alllen, gather_idx), idx.shape) + + gather_idx = paddle.concat([paddle.nonzero(next_beam_id != -1)[:, :1], paddle.reshape(next_beam_id, [-1, 1])], 1) + next_finished = paddle.reshape( + paddle.gather_nd(state.finished, gather_idx), state.finished.shape + ) # [gather new beam state according to new beam id] + + next_finished += paddle.cast(next_word_id == eos_id, "int64") + next_finished = paddle.cast(next_finished > 0, "int64") + + next_state = BeamSearchState(log_probs=next_probs, lengths=next_len, finished=next_finished) + output = BeamSearchOutput(scores=scores, predicted_ids=next_word_id, beam_parent_ids=next_beam_id) + + return output, next_state + + +@paddle.no_grad() +def beam_search_infilling( + model, + token_ids, + token_type_ids, + sos_id, + eos_id, + attn_id, + pad_id, + unk_id, + vocab_size, + max_encode_len=640, + max_decode_len=100, + beam_width=5, + tgt_type_id=3, + length_penalty=1.0, +): + _, __, info = model(token_ids, token_type_ids) + d_batch, d_seqlen = token_ids.shape + + state = BeamSearchState( + log_probs=paddle.zeros([d_batch, beam_width], "float32"), + lengths=paddle.zeros([d_batch, beam_width], "int64"), + finished=paddle.zeros([d_batch, beam_width], "int64"), + ) + outputs = [] + + def reorder_(t, parent_id): + """reorder cache according to parent beam id""" + gather_idx = paddle.nonzero(parent_id != -1)[:, 0] * beam_width + paddle.reshape(parent_id, [-1]) + t = paddle.gather(t, gather_idx) + return t + + def tile_(t, times): + _shapes = list(t.shape[1:]) + new_shape = [t.shape[0], times] + list(t.shape[1:]) + ret = paddle.reshape( + paddle.expand(paddle.unsqueeze(t, [1]), new_shape), + [ + -1, + ] + + _shapes, + ) + return ret + + cached_k, cached_v = info["caches"] + cached_k = [tile_(k, beam_width) for k in cached_k] + cached_v = [tile_(v, beam_width) for v in cached_v] + past_cache = (cached_k, cached_v) + + token_ids = tile_(token_ids, beam_width) + seqlen = paddle.sum(paddle.cast(token_ids != 0, "int64"), 1, keepdim=True) + # log.debug(token_ids.shape) + + cls_ids = paddle.ones([d_batch * beam_width], dtype="int64") * sos_id + attn_ids = paddle.ones([d_batch * beam_width], dtype="int64") * attn_id # SOS + ids = paddle.stack([cls_ids, attn_ids], -1) + for step in range(max_decode_len): + # log.debug('decode step %d' % step) + bias = gen_bias(token_ids, ids, step) + pos_ids = paddle.to_tensor(np.tile(np.array([[step, step + 1]], dtype=np.int64), [d_batch * beam_width, 1])) + pos_ids += seqlen + _, logits, info = model( + ids, paddle.ones_like(ids) * tgt_type_id, pos_ids=pos_ids, attn_bias=bias, past_cache=past_cache + ) + if logits.shape[-1] > vocab_size: + logits[:, :, vocab_size:] = 0 + logits[:, :, pad_id] = 0 + logits[:, :, unk_id] = 0 + logits[:, :, attn_id] = 0 + + output, state = beam_search_step( + state, + logits[:, 1], + eos_id=eos_id, + beam_width=beam_width, + is_first_step=(step == 0), + length_penalty=length_penalty, + ) + outputs.append(output) + + past_cached_k, past_cached_v = past_cache + cached_k, cached_v = info["caches"] + cached_k = [ + reorder_(paddle.concat([pk, k[:, :1, :]], 1), output.beam_parent_ids) + for pk, k in zip(past_cached_k, cached_k) + ] # concat cached + cached_v = [ + reorder_(paddle.concat([pv, v[:, :1, :]], 1), output.beam_parent_ids) + for pv, v in zip(past_cached_v, cached_v) + ] + past_cache = (cached_k, cached_v) + + pred_ids_flatten = paddle.reshape(output.predicted_ids, [d_batch * beam_width]) + ids = paddle.stack([pred_ids_flatten, attn_ids], 1) + + if state.finished.numpy().all(): + break + + final_ids = paddle.stack([o.predicted_ids for o in outputs], 0) + final_parent_ids = paddle.stack([o.beam_parent_ids for o in outputs], 0) + final_ids = nn.functional.gather_tree(final_ids, final_parent_ids)[:, :, 0] # pick best beam + final_ids = paddle.transpose(paddle.reshape(final_ids, [-1, d_batch * 1]), [1, 0]) + + return final_ids.numpy() + + +en_patten = re.compile(r"^[a-zA-Z0-9]*$") + + +def post_process(token): + if token.startswith("##"): + ret = token[2:] + elif token in ["[CLS]", "[SEP]", "[PAD]"]: + ret = "" + else: + if en_patten.match(token): + ret = " " + token + else: + ret = token + return ret diff --git a/model_zoo/ernie-gen/encode.py b/model_zoo/ernie-gen/encode.py new file mode 100644 index 0000000000000000000000000000000000000000..a1f47e1f33102619f7810d5061e1af3161ade8ab --- /dev/null +++ b/model_zoo/ernie-gen/encode.py @@ -0,0 +1,145 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from copy import deepcopy + +import numpy as np + + +def convert_example( + tokenizer, + attn_id, + tgt_type_id=3, + max_encode_len=512, + max_decode_len=128, + is_test=False, + noise_prob=0.0, + use_random_noice=False, +): + def warpper(example): + """convert an example into necessary features""" + tokens = example["tokens"] + labels = example["labels"] + encoded_src = tokenizer(tokens, max_seq_len=max_encode_len, pad_to_max_seq_len=False) + src_ids, src_sids = encoded_src["input_ids"], encoded_src["token_type_ids"] + src_pids = np.arange(len(src_ids)) + + if not is_test: + encoded_tgt = tokenizer(labels, max_seq_len=max_decode_len, pad_to_max_seq_len=False) + tgt_ids, tgt_sids = encoded_tgt["input_ids"], encoded_tgt["token_type_ids"] + tgt_ids = np.array(tgt_ids).astype("int64") + tgt_sids = np.array(tgt_sids) + tgt_type_id + tgt_pids = np.arange(len(tgt_ids)) + len(src_ids) + + attn_ids = np.ones_like(tgt_ids) * attn_id + if noise_prob > 0.0: + tgt_labels = deepcopy(tgt_ids) + if use_random_noice: + noice_ids = np.random.randint(1, len(tokenizer.vocab), size=tgt_ids.shape) + else: + noice_ids = np.ones_like(tgt_ids) * tokenizer.vocab["[NOISE]"] + (pos,) = np.where(np.ones_like(tgt_ids)) + np.random.shuffle(pos) + pos = pos[: int(noise_prob * len(pos))] + tgt_ids[pos] = noice_ids[ + pos, + ] + else: + tgt_labels = tgt_ids + + return (src_ids, src_pids, src_sids, tgt_ids, tgt_pids, tgt_sids, attn_ids, tgt_labels) + + return warpper + + +def gen_mask(batch_ids, mask_type="bidi", query_len=None, pad_value=0): + if query_len is None: + query_len = batch_ids.shape[1] + if mask_type != "empty": + mask = (batch_ids != pad_value).astype(np.float32) + mask = np.tile(np.expand_dims(mask, 1), [1, query_len, 1]) + if mask_type == "causal": + assert query_len == batch_ids.shape[1] + mask = np.tril(mask) + elif mask_type == "causal_without_diag": + assert query_len == batch_ids.shape[1] + mask = np.tril(mask, -1) + elif mask_type == "diag": + assert query_len == batch_ids.shape[1] + mask = np.stack([np.diag(np.diag(m)) for m in mask], 0) + + else: + mask_type == "empty" + mask = np.zeros_like(batch_ids).astype(np.float32) + mask = np.tile(np.expand_dims(mask, 1), [1, query_len, 1]) + return mask + + +def after_padding(args): + """ + attention mask: + *** src, tgt, attn + src 00, 01, 11 + tgt 10, 11, 12 + attn 20, 21, 22 + + *** s1, s2 | t1 t2 t3| attn1 attn2 attn3 + s1 1, 1 | 0, 0, 0,| 0, 0, 0, + s2 1, 1 | 0, 0, 0,| 0, 0, 0, + - + t1 1, 1, | 1, 0, 0,| 0, 0, 0, + t2 1, 1, | 1, 1, 0,| 0, 0, 0, + t3 1, 1, | 1, 1, 1,| 0, 0, 0, + - + attn1 1, 1, | 0, 0, 0,| 1, 0, 0, + attn2 1, 1, | 1, 0, 0,| 0, 1, 0, + attn3 1, 1, | 1, 1, 0,| 0, 0, 1, + + for details, see Fig3. https://arxiv.org/abs/2001.11314 + """ + src_ids, src_pids, src_sids, tgt_ids, tgt_pids, tgt_sids, attn_ids, tgt_labels = args + src_len = src_ids.shape[1] + tgt_len = tgt_ids.shape[1] + mask_00 = gen_mask(src_ids, "bidi", query_len=src_len) + # mask_01 = gen_mask(tgt_ids, "empty", query_len=src_len) + # mask_02 = gen_mask(attn_ids, "empty", query_len=src_len) + + mask_10 = gen_mask(src_ids, "bidi", query_len=tgt_len) + mask_11 = gen_mask(tgt_ids, "causal", query_len=tgt_len) + # mask_12 = gen_mask(attn_ids, "empty", query_len=tgt_len) + + mask_20 = gen_mask(src_ids, "bidi", query_len=tgt_len) + mask_21 = gen_mask(tgt_ids, "causal_without_diag", query_len=tgt_len) + mask_22 = gen_mask(attn_ids, "diag", query_len=tgt_len) + + mask_src_2_src = mask_00 + mask_tgt_2_srctgt = np.concatenate([mask_10, mask_11], 2) + mask_attn_2_srctgtattn = np.concatenate([mask_20, mask_21, mask_22], 2) + + raw_tgt_labels = deepcopy(tgt_labels) + tgt_labels = tgt_labels[np.where(tgt_labels != 0)] + return ( + src_ids, + src_sids, + src_pids, + tgt_ids, + tgt_sids, + tgt_pids, + attn_ids, + mask_src_2_src, + mask_tgt_2_srctgt, + mask_attn_2_srctgtattn, + tgt_labels, + raw_tgt_labels, + ) diff --git a/model_zoo/ernie-gen/eval.py b/model_zoo/ernie-gen/eval.py new file mode 100644 index 0000000000000000000000000000000000000000..47dd2298ac7b6d8d2031f1bb08e23b8c5da608e9 --- /dev/null +++ b/model_zoo/ernie-gen/eval.py @@ -0,0 +1,147 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle +from decode import beam_search_infilling +from encode import after_padding, convert_example +from paddle.io import DataLoader +from tqdm import tqdm + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import Rouge1, Rouge2 +from paddlenlp.transformers import ( + BertTokenizer, + ElectraTokenizer, + ErnieForGeneration, + ErnieTinyTokenizer, + ErnieTokenizer, + RobertaTokenizer, +) +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser('seq2seq model with ERNIE-GEN') +parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(list(ErnieTokenizer.pretrained_init_configuration.keys()))) +parser.add_argument('--max_encode_len', type=int, default=24, help="The max encoding sentence length") +parser.add_argument('--max_decode_len', type=int, default=72, help="The max decoding sentence length") +parser.add_argument("--batch_size", default=50, type=int, help="Batch size per GPU/CPU for training.", ) +parser.add_argument('--beam_width', type=int, default=1, help="Beam search width") +parser.add_argument('--length_penalty', type=float, default=1.0, help="The length penalty during decoding") +parser.add_argument('--init_checkpoint', type=str, default=None, help='Checkpoint to warm start from') +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.") +args = parser.parse_args() +# fmt: on + + +def evaluate(): + paddle.set_device(args.device) + + model = ErnieForGeneration.from_pretrained(args.model_name_or_path) + if "ernie-tiny" in args.model_name_or_path: + tokenizer = ErnieTinyTokenizer.from_pretrained(args.model_name_or_path) + elif "ernie" in args.model_name_or_path: + tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path) + elif "roberta" in args.model_name_or_path or "rbt" in args.model_name_or_path: + tokenizer = RobertaTokenizer.from_pretrained(args.model_name_or_path) + elif "electra" in args.model_name_or_path: + tokenizer = ElectraTokenizer.from_pretrained(args.model_name_or_path) + else: + tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path) + + dev_dataset = load_dataset("poetry", splits=("dev"), lazy=False) + attn_id = tokenizer.vocab["[ATTN]"] if "[ATTN]" in tokenizer.vocab else tokenizer.vocab["[MASK]"] + tgt_type_id = model.sent_emb.weight.shape[0] - 1 + + trans_func = convert_example( + tokenizer=tokenizer, + attn_id=attn_id, + tgt_type_id=tgt_type_id, + max_encode_len=args.max_encode_len, + max_decode_len=args.max_decode_len, + ) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # src_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # src_pids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # src_sids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # tgt_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # tgt_pids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # tgt_sids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # attn_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # tgt_labels + ): after_padding(fn(samples)) + + dev_dataset = dev_dataset.map(trans_func) + dev_batch_sampler = paddle.io.BatchSampler(dev_dataset, batch_size=args.batch_size, shuffle=False) + data_loader = DataLoader( + dataset=dev_dataset, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + rouge1 = Rouge1() + rouge2 = Rouge2() + + if args.init_checkpoint: + model_state = paddle.load(args.init_checkpoint) + model.set_state_dict(model_state) + + model.eval() + vocab = tokenizer.vocab + eos_id = vocab[tokenizer.sep_token] + sos_id = vocab[tokenizer.cls_token] + pad_id = vocab[tokenizer.pad_token] + unk_id = vocab[tokenizer.unk_token] + vocab_size = len(vocab) + evaluated_sentences_ids = [] + reference_sentences_ids = [] + logger.info("Evaluating...") + for data in tqdm(data_loader): + (src_ids, src_sids, src_pids, _, _, _, _, _, _, _, _, raw_tgt_labels) = data # never use target when infer + # Use greedy_search_infilling or beam_search_infilling to get predictions + output_ids = beam_search_infilling( + model, + src_ids, + src_sids, + eos_id=eos_id, + sos_id=sos_id, + attn_id=attn_id, + pad_id=pad_id, + unk_id=unk_id, + vocab_size=vocab_size, + max_decode_len=args.max_decode_len, + max_encode_len=args.max_encode_len, + beam_width=args.beam_width, + length_penalty=args.length_penalty, + tgt_type_id=tgt_type_id, + ) + + for ids in output_ids.tolist(): + if eos_id in ids: + ids = ids[: ids.index(eos_id)] + evaluated_sentences_ids.append(ids) + + for ids in raw_tgt_labels.numpy().tolist(): + ids = ids[: ids.index(eos_id)] + reference_sentences_ids.append(ids) + + score1 = rouge1.score(evaluated_sentences_ids, reference_sentences_ids) + score2 = rouge2.score(evaluated_sentences_ids, reference_sentences_ids) + + logger.info("Rouge-1: %.5f ,Rouge-2: %.5f" % (score1 * 100, score2 * 100)) + + +if __name__ == "__main__": + evaluate() diff --git a/model_zoo/ernie-gen/model.py b/model_zoo/ernie-gen/model.py new file mode 100644 index 0000000000000000000000000000000000000000..31e30c6e9c333f38a1659fd5576a25311b8bc0cc --- /dev/null +++ b/model_zoo/ernie-gen/model.py @@ -0,0 +1,64 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn + + +class StackModel(nn.Layer): + def __init__(self, model): + super().__init__() + self.model = model + + def forward( + self, + src_ids, + src_sids, + src_pids, + tgt_ids, + tgt_sids, + tgt_pids, + attn_ids, + mask_src_2_src, + mask_tgt_2_srctgt, + mask_attn_2_srctgtattn, + tgt_labels, + tgt_pos, + ): + _, __, info = self.model( + src_ids, sent_ids=src_sids, pos_ids=src_pids, attn_bias=mask_src_2_src, encode_only=True + ) + cached_k, cached_v = info["caches"] + _, __, info = self.model( + tgt_ids, + sent_ids=tgt_sids, + pos_ids=tgt_pids, + attn_bias=mask_tgt_2_srctgt, + past_cache=(cached_k, cached_v), + encode_only=True, + ) + cached_k2, cached_v2 = info["caches"] + past_cache_k = [paddle.concat([k, k2], 1) for k, k2 in zip(cached_k, cached_k2)] + past_cache_v = [paddle.concat([v, v2], 1) for v, v2 in zip(cached_v, cached_v2)] + loss, _, __ = self.model( + attn_ids, + sent_ids=tgt_sids, + pos_ids=tgt_pids, + attn_bias=mask_attn_2_srctgtattn, + past_cache=(past_cache_k, past_cache_v), + tgt_labels=tgt_labels, + tgt_pos=tgt_pos, + ) + loss = loss.mean() + return loss diff --git a/model_zoo/ernie-gen/predict.py b/model_zoo/ernie-gen/predict.py new file mode 100644 index 0000000000000000000000000000000000000000..408ded91231af664a54acbde4a8f9e5784999093 --- /dev/null +++ b/model_zoo/ernie-gen/predict.py @@ -0,0 +1,137 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle +from decode import beam_search_infilling, post_process +from encode import after_padding, convert_example +from paddle.io import DataLoader + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import ( + BertTokenizer, + ElectraTokenizer, + ErnieForGeneration, + ErnieTinyTokenizer, + ErnieTokenizer, + RobertaTokenizer, +) +from paddlenlp.utils.log import logger + +# fmt: off +parser = argparse.ArgumentParser('seq2seq model with ERNIE-GEN') +parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(list(ErnieTokenizer.pretrained_init_configuration.keys()))) +parser.add_argument('--max_encode_len', type=int, default=24, help="The max encoding sentence length") +parser.add_argument('--max_decode_len', type=int, default=72, help="The max decoding sentence length") +parser.add_argument("--batch_size", default=50, type=int, help="Batch size per GPU/CPU for training.", ) +parser.add_argument('--beam_width', type=int, default=3, help="Beam search width") +parser.add_argument('--length_penalty', type=float, default=1.0, help="The length penalty during decoding") +parser.add_argument('--init_checkpoint', type=str, default=None, help='Checkpoint to warm start from') +parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.") +# fmt: on + +args = parser.parse_args() + + +def predict(): + paddle.set_device(args.device) + + model = ErnieForGeneration.from_pretrained(args.model_name_or_path) + if "ernie-tiny" in args.model_name_or_path: + tokenizer = ErnieTinyTokenizer.from_pretrained(args.model_name_or_path) + elif "ernie" in args.model_name_or_path: + tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path) + elif "roberta" in args.model_name_or_path or "rbt" in args.model_name_or_path: + tokenizer = RobertaTokenizer.from_pretrained(args.model_name_or_path) + elif "electra" in args.model_name_or_path: + tokenizer = ElectraTokenizer.from_pretrained(args.model_name_or_path) + else: + tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path) + + dev_dataset = load_dataset("poetry", splits=("dev"), lazy=False) + attn_id = tokenizer.vocab["[ATTN]"] if "[ATTN]" in tokenizer.vocab else tokenizer.vocab["[MASK]"] + tgt_type_id = model.sent_emb.weight.shape[0] - 1 + + trans_func = convert_example( + tokenizer=tokenizer, + attn_id=attn_id, + tgt_type_id=tgt_type_id, + max_encode_len=args.max_encode_len, + max_decode_len=args.max_decode_len, + ) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # src_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # src_pids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # src_sids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # tgt_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # tgt_pids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # tgt_sids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # attn_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # tgt_labels + ): after_padding(fn(samples)) + + dev_dataset = dev_dataset.map(trans_func) + test_batch_sampler = paddle.io.BatchSampler(dev_dataset, batch_size=args.batch_size, shuffle=False) + data_loader = DataLoader( + dataset=dev_dataset, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + if args.init_checkpoint: + model_state = paddle.load(args.init_checkpoint) + model.set_state_dict(model_state) + + model.eval() + vocab = tokenizer.vocab + eos_id = vocab[tokenizer.sep_token] + sos_id = vocab[tokenizer.cls_token] + pad_id = vocab[tokenizer.pad_token] + unk_id = vocab[tokenizer.unk_token] + vocab_size = len(vocab) + logger.info("Predicting...") + for data in data_loader: + (src_ids, src_sids, src_pids, _, _, _, _, _, _, _, _, raw_tgt_labels) = data # never use target when infer + # Use greedy_search_infilling or beam_search_infilling to get predictions + output_ids = beam_search_infilling( + model, + src_ids, + src_sids, + eos_id=eos_id, + sos_id=sos_id, + attn_id=attn_id, + pad_id=pad_id, + unk_id=unk_id, + vocab_size=vocab_size, + max_decode_len=args.max_decode_len, + max_encode_len=args.max_encode_len, + beam_width=args.beam_width, + length_penalty=args.length_penalty, + tgt_type_id=tgt_type_id, + ) + + for source_ids, target_ids, predict_ids in zip( + src_ids.numpy().tolist(), raw_tgt_labels.numpy().tolist(), output_ids.tolist() + ): + if eos_id in predict_ids: + predict_ids = predict_ids[: predict_ids.index(eos_id)] + source_sentence = "".join(map(post_process, vocab.to_tokens(source_ids[1 : source_ids.index(eos_id)]))) + tgt_sentence = "".join(map(post_process, vocab.to_tokens(target_ids[1 : target_ids.index(eos_id)]))) + predict_ids = "".join(map(post_process, vocab.to_tokens(predict_ids))) + print("source :%s\ntarget :%s\npredict:%s\n" % (source_sentence, tgt_sentence, predict_ids)) + + +if __name__ == "__main__": + predict() diff --git a/model_zoo/ernie-gen/train.py b/model_zoo/ernie-gen/train.py new file mode 100644 index 0000000000000000000000000000000000000000..feafba7ce36c04eaf9a03a8254f1877a5ce3686b --- /dev/null +++ b/model_zoo/ernie-gen/train.py @@ -0,0 +1,323 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import time + +import paddle +import paddle.nn as nn +from decode import beam_search_infilling, post_process +from encode import after_padding, convert_example +from model import StackModel +from paddle.io import DataLoader +from tqdm import tqdm + +from paddlenlp.data import Pad, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import Rouge1, Rouge2 +from paddlenlp.transformers import ( + BertTokenizer, + ElectraTokenizer, + ErnieForGeneration, + ErnieTinyTokenizer, + ErnieTokenizer, + LinearDecayWithWarmup, + RobertaTokenizer, +) +from paddlenlp.utils.log import logger + +parser = argparse.ArgumentParser("seq2seq model with ERNIE-GEN") +parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join(list(ErnieTokenizer.pretrained_init_configuration.keys())), +) +parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", +) +parser.add_argument("--max_encode_len", type=int, default=5, help="The max encoding sentence length") +parser.add_argument("--max_decode_len", type=int, default=5, help="The max decoding sentence length") +parser.add_argument( + "--batch_size", + default=8, + type=int, + help="Batch size per GPU/CPU for training.", +) +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.1, type=float, help="Weight decay if we apply some.") +parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") +parser.add_argument( + "--num_epochs", + default=3, + type=int, + help="Total number of training epochs to perform.", +) +parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion.") +parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.") +parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") +parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu", "xpu"], + help="The device to select to train the model, is must be cpu/gpu/xpu.", +) +parser.add_argument("--beam_width", type=int, default=1, help="Beam search width.") +parser.add_argument("--noise_prob", type=float, default=0.0, help="Probability of token be repalced.") +parser.add_argument( + "--use_random_noice", + action="store_true", + help="If set, replace target tokens with random token from vocabulary, else replace with `[NOISE]`.", +) +parser.add_argument("--label_smooth", type=float, default=0.0, help="The soft label smooth rate.") +parser.add_argument("--length_penalty", type=float, default=1.0, help="The length penalty during decoding.") +parser.add_argument("--init_checkpoint", type=str, default=None, help="Checkpoint to warm start from.") +parser.add_argument("--save_dir", type=str, default=None, help="Model output directory.") +parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_epochs.", +) + +args = parser.parse_args() + + +def evaluate(model, data_loader, tokenizer, rouge1, rouge2, attn_id, tgt_type_id, args): + model.eval() + + vocab = tokenizer.vocab + eos_id = vocab[tokenizer.sep_token] + sos_id = vocab[tokenizer.cls_token] + pad_id = vocab[tokenizer.pad_token] + unk_id = vocab[tokenizer.unk_token] + vocab_size = len(vocab) + evaluated_sentences_ids = [] + reference_sentences_ids = [] + logger.info("Evaluating...") + for data in tqdm(data_loader): + (src_ids, src_tids, src_pids, _, _, _, _, _, _, _, _, raw_tgt_labels) = data # never use target when infer + # Use greedy_search_infilling or beam_search_infilling to get predictions + output_ids = beam_search_infilling( + model, + src_ids, + src_tids, + eos_id=eos_id, + sos_id=sos_id, + attn_id=attn_id, + pad_id=pad_id, + unk_id=unk_id, + vocab_size=vocab_size, + max_decode_len=args.max_decode_len, + max_encode_len=args.max_encode_len, + beam_width=args.beam_width, + length_penalty=args.length_penalty, + tgt_type_id=tgt_type_id, + ) + + for ids in output_ids.tolist(): + if eos_id in ids: + ids = ids[: ids.index(eos_id)] + evaluated_sentences_ids.append(ids) + + for ids in raw_tgt_labels.numpy().tolist(): + ids = ids[: ids.index(eos_id)] + reference_sentences_ids.append(ids) + + score1 = rouge1.score(evaluated_sentences_ids, reference_sentences_ids) + score2 = rouge2.score(evaluated_sentences_ids, reference_sentences_ids) + + logger.info("Rouge-1: %.5f ,Rouge-2: %.5f" % (score1 * 100, score2 * 100)) + + evaluated_sentences = [] + reference_sentences = [] + for ids in reference_sentences_ids[:5]: + reference_sentences.append("".join(map(post_process, vocab.to_tokens(ids)))) + for ids in evaluated_sentences_ids[:5]: + evaluated_sentences.append("".join(map(post_process, vocab.to_tokens(ids)))) + logger.debug(reference_sentences) + logger.debug(evaluated_sentences) + + model.train() + + +def train(): + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + model = ErnieForGeneration.from_pretrained(args.model_name_or_path) + if "ernie-tiny" in args.model_name_or_path: + tokenizer = ErnieTinyTokenizer.from_pretrained(args.model_name_or_path) + elif "ernie" in args.model_name_or_path: + tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path) + elif "roberta" in args.model_name_or_path or "rbt" in args.model_name_or_path: + tokenizer = RobertaTokenizer.from_pretrained(args.model_name_or_path) + elif "electra" in args.model_name_or_path: + tokenizer = ElectraTokenizer.from_pretrained(args.model_name_or_path) + else: + tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path) + if args.init_checkpoint: + model_state = paddle.load(args.init_checkpoint) + model.set_state_dict(model_state) + + train_dataset, dev_dataset = load_dataset("poetry", splits=("train", "dev"), lazy=False) + attn_id = tokenizer.vocab["[ATTN]"] if "[ATTN]" in tokenizer.vocab else tokenizer.vocab["[MASK]"] + tgt_type_id = model.sent_emb.weight.shape[0] - 1 + + trans_func = convert_example( + tokenizer=tokenizer, + attn_id=attn_id, + tgt_type_id=tgt_type_id, + max_encode_len=args.max_encode_len, + max_decode_len=args.max_decode_len, + noise_prob=args.noise_prob, + use_random_noice=args.use_random_noice, + ) + + train_dataset = train_dataset.map(trans_func) + train_batch_sampler = paddle.io.DistributedBatchSampler(train_dataset, batch_size=args.batch_size, shuffle=True) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # src_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # src_pids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # src_tids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # tgt_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # tgt_pids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # tgt_tids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # attn_ids + Pad(axis=0, pad_val=tokenizer.pad_token_id), # tgt_labels + ): after_padding(fn(samples)) + train_data_loader = DataLoader( + dataset=train_dataset, + batch_sampler=train_batch_sampler, + collate_fn=batchify_fn, + num_workers=0, + return_list=True, + ) + + dev_dataset = dev_dataset.map(trans_func) + dev_data_loader = DataLoader( + dataset=dev_dataset, batch_size=args.batch_size, collate_fn=batchify_fn, num_workers=0, return_list=True + ) + + label_num = model.word_emb.weight.shape[0] + train_model = StackModel(model) + if paddle.distributed.get_world_size() > 1: + # All 'forward' outputs derived from the module parameters using in DataParallel + # must participate in the calculation of losses and subsequent gradient calculations. + # So we use StackModel here to make the model only output loss in its 'forward' function. + train_model = paddle.DataParallel(train_model) + + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + grad_clip=nn.ClipGradByGlobalNorm(1.0), + apply_decay_param_fun=lambda x: x in decay_params, + ) + + rouge1 = Rouge1() + rouge2 = Rouge2() + + global_step = 0 + tic_train = time.time() + for epoch in range(args.num_epochs): + for step, batch in enumerate(train_data_loader): + global_step += 1 + ( + src_ids, + src_tids, + src_pids, + tgt_ids, + tgt_tids, + tgt_pids, + attn_ids, + mask_src_2_src, + mask_tgt_2_srctgt, + mask_attn_2_srctgtattn, + tgt_labels, + _, + ) = batch + if args.label_smooth > 0.0: + tgt_labels = nn.functional.label_smooth( + nn.functional.one_hot(tgt_labels, label_num), epsilon=args.label_smooth + ) + tgt_pos = paddle.nonzero(attn_ids == attn_id) + loss = train_model( + src_ids, + src_tids, + src_pids, + tgt_ids, + tgt_tids, + tgt_pids, + attn_ids, + mask_src_2_src, + mask_tgt_2_srctgt, + mask_attn_2_srctgtattn, + tgt_labels, + tgt_pos, + ) + if global_step % args.logging_steps == 0: + if paddle.distributed.get_rank() == 0: + logger.info( + "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s, lr: %.3e" + % ( + global_step, + epoch, + step, + loss, + args.logging_steps / (time.time() - tic_train), + lr_scheduler.get_lr(), + ) + ) + tic_train = time.time() + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if ( + global_step % args.save_steps == 0 + or global_step == num_training_steps + and paddle.distributed.get_rank() == 0 + ): + evaluate(model, dev_data_loader, tokenizer, rouge1, rouge2, attn_id, tgt_type_id, args) + output_dir = os.path.join(args.output_dir, "model_%d" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + model_to_save.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + if global_step >= num_training_steps: + return + + +if __name__ == "__main__": + train() diff --git a/model_zoo/ernie-health/README.md b/model_zoo/ernie-health/README.md new file mode 100644 index 0000000000000000000000000000000000000000..af5924fffec0fb1b1d8015350976594c53a04491 --- /dev/null +++ b/model_zoo/ernie-health/README.md @@ -0,0 +1,189 @@ +# ERNIE-Health 中文医疗预训练模型 + +医疗领域存在大量的专业知识和医学术语,人类经过长时间的学习才能成为一名优秀的医生。那机器如何才能“读懂”医疗文献呢?尤其是面对电子病历、生物医疗文献中存在的大量非结构化、非标准化文本,计算机是无法直接使用、处理的。这就需要自然语言处理(Natural Language Processing,NLP)技术大展身手了。 + +## 模型介绍 + +本项目针对中文医疗语言理解任务,开源了中文医疗预训练模型 [ERNIE-Health](https://arxiv.org/pdf/2110.07244.pdf)(模型名称`ernie-health-chinese`)。 + +ERNIE-Health 依托百度文心 ERNIE 先进的知识增强预训练语言模型打造, 通过医疗知识增强技术进一步学习海量的医疗数据, 精准地掌握了专业的医学知识。ERNIE-Health 利用医疗实体掩码策略对专业术语等实体级知识学习, 学会了海量的医疗实体知识。同时,通过医疗问答匹配任务学习病患病状描述与医生专业治疗方案的对应关系,获得了医疗实体知识之间的内在联系。ERNIE-Health 共学习了 60 多万的医疗专业术语和 4000 多万的医疗专业问答数据,大幅提升了对医疗专业知识的理解和建模能力。此外,ERNIE-Health 还探索了多级语义判别预训练任务,提升了模型对医疗知识的学习效率。该模型的整体结构与 ELECTRA 相似,包括生成器和判别器两部分。 + +![Overview_of_EHealth](https://user-images.githubusercontent.com/25607475/163949632-8b34e23c-d0cd-49df-8d88-8549a253d221.png) + +更多技术细节可参考论文 +- [Building Chinese Biomedical Language Models via Multi-Level Text Discrimination](https://arxiv.org/pdf/2110.07244.pdf) + +## 模型效果 + +ERNIE-Health模型以超越人类医学专家水平的成绩登顶中文医疗信息处理权威榜单 [CBLUE](https://github.com/CBLUEbenchmark/CBLUE) 冠军, 验证了 ERNIE 在医疗行业应用的重要价值。 + +![CBLUERank](https://user-images.githubusercontent.com/25607475/160394225-04f75498-ce1a-4665-85f7-d495815eed51.png) + +相应的开源模型参数 ``ernie-health-chinese`` 在 CBLUE **验证集** 上的评测指标如下表所示: + +| Task | metric | results | results (fp16) | +| --------- | :------: | :-----: | :------------: | +| CHIP-STS | Macro-F1 | 0.88749 | 0.88555 | +| CHIP-CTC | Macro-F1 | 0.84136 | 0.83514 | +| CHIP-CDN | F1 | 0.76979 | 0.76489 | +| KUAKE-QQR | Accuracy | 0.83865 | 0.84053 | +| KUAKE-QTR | Accuracy | 0.69722 | 0.69722 | +| KUAKE-QIC | Accuracy | 0.81483 | 0.82046 | +| CMeEE | Micro-F1 | 0.66120 | 0.66026 | +| CMeIE | Micro-F1 | 0.61385 | 0.60076 | + +## 环境依赖 + +- paddlepaddle >= 2.2.0 +- paddlenlp >= 2.3.4 + +## 模型预训练 + +PaddleNLP中提供了ERNIE-Health训练好的模型参数。``ernie-health-chinese``版本为160G医疗文本数据上的训练结果,数据包括脱敏医患对话语料、医疗健康科普文章、脱敏医院电子医疗病例档案以及医学和临床病理学教材。本节提供了预训练的整体流程,可用于自定义数据的学习。 + +#### 注意: 预训练资源要求 + +- 推荐使用至少4张16G以上显存的GPU进行预训练。 +- 数据量应尽可能接近ERNIE-Health论文中训练数据的量级,以获得好的预训练模型效果。 +- 若资源有限,可以直接使用开源的ERNIE-Health模型进行微调,具体实现可参考 [CBLUE样例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-health/cblue)。 + +#### 数据准备 + +- 数据编码:UTF-8 +- 数据格式:预训练文本数据放在同个目录下,每个文件中每行一句中文文本。 + +- 数据预处理:首先对原始文本进行分词,分词结果中非首中文字符替换为``##``前缀的字符(例如,``医疗``处理后得到``[医, ##疗]``)。接着将token转换为对应的id。最后将目录下的全部数据合并存储,token ids拼接后存储至``.npy``文件,每条样本的长度存储在``.npz``文件。 + +```shell +python preprocess.py --input_path ./raw_data/ --output_file ./data/samples --tokenize_tool lac --num_worker 8 +``` +可配置参数包括 +- ``input_path`` 为原始文本数据所在目录,该目录下包含至少一个中文文本文件,UTF-8编码。 +- ``output_file`` 为预处理后数据的存储路径及文件名(不包含后缀)。 +- ``tokenize_tool``表示分词工具,包括``lac``、``seg``和``jieba``,默认为``lac``。 +- ``logging_steps`` 表示日志打印间隔,每处理``logging_steps``个句子打印一次日志。 +- ``num_worker`` 表示使用的进程数,增加进程数可加速预处理。 + + +#### 单机单卡 + +``` +CUDA_VISIBLE_DEVICES=0 python run_pretrain.py \ + --input_dir ./data \ + --output_dir ./output \ + --learning_rate 1e-7 \ + --batch_size 10 \ + --adam_epsilon 1e-8 \ + --weight_decay 1e-2 \ + --warmup_steps 10000 \ + --max_steps 1000000 \ + --save_steps 10000 \ + --logging_steps 1 \ + --seed 1000 \ + --use_amp +``` + +#### 单机多卡 + +``` +unset CUDA_VISIBLE_DEVICES +python -m paddle.distributed.launch --gpus "0,1,2,3" run_pretrain.py \ + --input_dir ./data \ + --output_dir ./output \ + --learning_rate 1e-7 \ + --batch_size 10 \ + --adam_epsilon 1e-8 \ + --weight_decay 1e-2 \ + --warmup_steps 10000 \ + --max_steps 1000000 \ + --save_steps 10000 \ + --logging_steps 1 \ + --seed 1000 \ + --use_amp +``` + +可配置参数包括 +- ``model_name_or_path``表示内置模型参数名(目前支持``ernie-health-chinese``),或者模型参数配置路径(这时需配置 --init_from_ckpt 参数一起使用,一般用于断点恢复训练场景。) +- ``input_dir``表示训练数据所在目录,该目录下要有``.npy``和``.npz``两个文件,格式与```preprocess.py``预处理结果相同。 +- ``output_dir``表示预训练模型参数和训练日志的保存目录。 +- ``batch_size``表示每次迭代每张卡上的样本数量。当batch_size=4时,运行时单卡约需要12G显存。如果实际GPU显存小于12G或大大多于12G,可适当调小/调大此配置。 +- ``learning_rate`` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 +- ``max_seq_length`` 表示最大句子长度,超过该长度将被截断。 +- ``weight_decay`` 表示每次迭代中参数缩小的比例,该值乘以学习率为真正缩小的比例。 +- ``adam_epsilon`` 表示adam优化器中的epsilon值。 +- ``warmup_steps`` 表示学习率逐渐升高到基础学习率(即上面配置的learning_rate)所需要的迭代数,最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。 +- ``num_epochs`` 表示训练轮数。 +- ``logging_steps`` 表示日志打印间隔。 +- ``save_steps`` 表示模型保存间隔。 +- ``max_steps`` 如果配置且大于0,表示预训练最多执行的迭代数量;如果不配置或配置小于0,则根据输入数据量、``batch_size``和``num_epochs``来确定预训练迭代数量。 +- ``device`` 表示使用的设备类型。默认为GPU,可以配置为CPU、GPU、XPU。若希望使用GPU训练,将其设置为GPU,同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。 +- ``use_amp`` 表示是否开启混合精度(float16)进行训练,默认不开启。如果在命令中加上了--use_amp,则会开启。 +- ``init_from_ckpt`` 表示是否从某个checkpoint继续训练(断点恢复训练),默认不开启。如果在命令中加上了--init_from_ckpt,且 --model_name_or_path 配置的是路径,则会开启从某个checkpoint继续训练。 + +#### Trainer 训练版本 +本样例同时提供了Trainer版本的预训练流程,预训练重启、可视化等流程较为完备。需要从源码安装paddlenlp使用。 + +``` +unset CUDA_VISIBLE_DEVICES +task_name="eheath-pretraining" + +python -u -m paddle.distributed.launch \ + --gpus 0,1,2,3,4,5,6,7 \ + run_pretrain_trainer.py \ + --input_dir "./data" \ + --output_dir "output/$task_name" \ + --max_seq_length 512 \ + --gradient_accumulation_steps 1\ + --per_device_train_batch_size 8 \ + --learning_rate 0.001 \ + --max_steps 1000000 \ + --save_steps 50000 \ + --weight_decay 0.01 \ + --warmup_ratio 0.01 \ + --max_grad_norm 1.0 \ + --logging_steps 20 \ + --dataloader_num_workers 2 \ + --device "gpu"\ + --fp16 \ + --fp16_opt_level "O1" \ + --do_train \ + --disable_tqdm True\ + --save_total_limit 10 +``` +大部分参数含义如上文所述,这里简要介绍一些新参数: + +- ``per_device_train_batch_size`` 同上文batch_size。训练时,每次迭代每张卡上的样本数目。 +- ``warmup_ratio`` 与warmup_steps类似,warmup步数占总步数的比例。 +- ``fp16`` 与`use_amp`相同,表示使用混合精度 +- ``fp16_opt_level`` 混合精度的策略。注:O2训练eHealth存在部分问题,暂时请勿使用。 +- ``save_total_limit`` 保存的ckpt数量的最大限制 + +## 微调 + +模型预训练结束后,可以对判别器进行微调以完成下游医疗任务。不同任务的模型加载方式如下: + +``` +from paddlenlp.transformers import * + +tokenizer = AutoTokenizer.from_pretrained('ernie-health-chinese') + +# 分类任务 +model = AutoModelForSequenceClassification.from_pretrained('ernie-health-chinese') +# 序列标注任务 +model = AutoModelForTokenClassification.from_pretrained('ernie-health-chinese') +# 阅读理解任务 +model = AutoModelForQuestionAnswering.from_pretrained('ernie-health-chinese') +``` + +本项目提供了在 CBLUE 数据集上的微调脚本,包括分类、实体识别和关系抽取三类任务,详细信息可参考 ``cblue``[目录](./cblue)。 + +## 部署 + +我们为ERNIE-Health微调后的模型提供了Python端部署方案,请根据实际情况进行实现。 + +详细部署流程请参考:[基于ONNXRuntime推理部署指南](./cblue/deploy/predictor/) + + +## Reference + +Wang, Quan, et al. “Building Chinese Biomedical Language Models via Multi-Level Text Discrimination.” arXiv preprint arXiv:2110.07244 (2021). [pdf](https://arxiv.org/abs/2110.07244) diff --git a/model_zoo/ernie-health/cblue/README.md b/model_zoo/ernie-health/cblue/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ce37227d052ce5ddc86bcb89a54178bbeb4b1b65 --- /dev/null +++ b/model_zoo/ernie-health/cblue/README.md @@ -0,0 +1,130 @@ +# 使用医疗领域预训练模型Fine-tune完成中文医疗语言理解任务 + +本示例展示了中文医疗预训练模型 ERNIE-Health([Building Chinese Biomedical Language Models via Multi-Level Text Discrimination](https://arxiv.org/abs/2110.07244))如何 Fine-tune 完成中文医疗语言理解任务。 + +## 数据集介绍 + +本项目使用了中文医学语言理解测评([Chinese Biomedical Language Understanding Evaluation,CBLUE](https://github.com/CBLUEbenchmark/CBLUE))1.0 版本数据集,这是国内首个面向中文医疗文本处理的多任务榜单,涵盖了医学文本信息抽取(实体识别、关系抽取)、医学术语归一化、医学文本分类、医学句子关系判定和医学问答共5大类任务8个子任务。其数据来源分布广泛,包括医学教材、电子病历、临床试验公示以及互联网用户真实查询等。该榜单一经推出便受到了学界和业界的广泛关注,已逐渐发展成为检验AI系统中文医疗信息处理能力的“金标准”。 + +* CMeEE:中文医学命名实体识别 +* CMeIE:中文医学文本实体关系抽取 +* CHIP-CDN:临床术语标准化任务 +* CHIP-CTC:临床试验筛选标准短文本分类 +* CHIP-STS:平安医疗科技疾病问答迁移学习 +* KUAKE-QIC:医疗搜索检索词意图分类 +* KUAKE-QTR:医疗搜索查询词-页面标题相关性 +* KUAKE-QQR:医疗搜索查询词-查询词相关性 + +更多信息可参考CBLUE的[github](https://github.com/CBLUEbenchmark/CBLUE/blob/main/README_ZH.md)。其中对于临床术语标准化任务(CHIP-CDN),我们按照 ERNIE-Health 中的方法通过检索将原多分类任务转换为了二分类任务,即给定一诊断原词和一诊断标准词,要求判定后者是否是前者对应的诊断标准词。本项目提供了检索处理后的 CHIP-CDN 数据集(简写`CHIP-CDN-2C`),且构建了基于该数据集的example代码。 + +## 模型介绍 + +ERNIE-Health 模型的整体结构与 ELECTRA 相似,包括生成器和判别器两部分。 而 Fine-tune 过程只用到了判别器模块,由 12 层 Transformer 网络组成。 + +## 快速开始 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +cblue/ +├── train_classification.py # 文本分类任务训练评估脚本 +├── train_ner.py # 实体识别任务训练评估脚本 +├── train_spo.py # 关系抽取任务训练评估脚本 +├── export_model.py # 动态图导出静态图参数脚本 +└── README.md +``` + +### 依赖安装 + +```shell +pip install xlrd==1.2.0 +``` + +### 模型训练 + +我们按照任务类别划分,同时提供了8个任务的样例代码。可以运行下边的命令,在训练集上进行训练,并在**验证集**上进行验证。 + +**训练参数设置(Training setup)及结果** + +| Task | epochs | batch_size | learning_rate | max_seq_length | metric | results | results (fp16) | +| --------- | :----: | :--------: | :-----------: | :------------: | :------: | :-----: | :------------: | +| CHIP-STS | 4 | 16 | 3e-5 | 96 | Macro-F1 | 0.88749 | 0.88555 | +| CHIP-CTC | 4 | 32 | 6e-5 | 160 | Macro-F1 | 0.84136 | 0.83514 | +| CHIP-CDN | 16 | 256 | 3e-5 | 32 | F1 | 0.76979 | 0.76489 | +| KUAKE-QQR | 2 | 32 | 6e-5 | 64 | Accuracy | 0.83865 | 0.84053 | +| KUAKE-QTR | 4 | 32 | 6e-5 | 64 | Accuracy | 0.69722 | 0.69722 | +| KUAKE-QIC | 4 | 32 | 6e-5 | 128 | Accuracy | 0.81483 | 0.82046 | +| CMeEE | 2 | 32 | 6e-5 | 128 | Micro-F1 | 0.66120 | 0.66026 | +| CMeIE | 100 | 12 | 6e-5 | 300 | Micro-F1 | 0.61385 | 0.60076 | + +可支持配置的参数: + +* `save_dir`:可选,保存训练模型的目录;默认保存在当前目录checkpoints文件夹下。 +* `max_seq_length`:可选,ELECTRA模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `learning_rate`:可选,Fine-tune的最大学习率;默认为6e-5。 +* `weight_decay`:可选,控制正则项力度的参数,用于防止过拟合,默认为0.01。 +* `epochs`: 训练轮次,默认为3。 +* `max_steps`: 最大训练步数。若训练`epochs`轮包含的训练步数大于该值,则达到`max_steps`后就提前结束。 +* `valid_steps`: evaluate的间隔steps数,默认100。 +* `save_steps`: 保存checkpoints的间隔steps数,默认100。 +* `logging_steps`: 日志打印的间隔steps数,默认10。 +* `warmup_proption`:可选,学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.1。 +* `init_from_ckpt`:可选,模型参数路径,恢复模型训练;默认为None。 +* `seed`:可选,随机种子,默认为1000. +* `device`: 选用什么设备进行训练,可选cpu、gpu或npu。如使用gpu训练则参数gpus指定GPU卡号。 +* `use_amp`: 是否使用混合精度训练,默认为False。 + + +#### 医疗文本分类任务 + +```shell +$ unset CUDA_VISIBLE_DEVICES +$ python -m paddle.distributed.launch --gpus '0,1,2,3' train_classification.py --dataset CHIP-CDN-2C --batch_size 256 --max_seq_length 32 --learning_rate 3e-5 --epochs 16 +``` + +其他可支持配置的参数: + +* `dataset`:可选,CHIP-CDN-2C CHIP-CTC CHIP-STS KUAKE-QIC KUAKE-QTR KUAKE-QQR,默认为KUAKE-QIC数据集。 + +#### 医疗命名实体识别任务(CMeEE) + +```shell +$ export CUDA_VISIBLE_DEVICES=0 +$ python train_ner.py --batch_size 32 --max_seq_length 128 --learning_rate 6e-5 --epochs 12 +``` + +#### 医疗关系抽取任务(CMeIE) + +```shell +$ export CUDA_VISIBLE_DEVICES=0 +$ python train_spo.py --batch_size 12 --max_seq_length 300 --learning_rate 6e-5 --epochs 100 +``` + +### 静态图模型导出 + +使用动态图训练结束之后,还可以将动态图参数导出成静态图参数,用于部署推理等,具体代码见export_model.py。静态图参数保存在`output_path`指定路径中。 + +运行方式: +1. 分类任务静态图模型导出 +```shell +python export_model.py --train_dataset CHIP-CDN-2C --params_path=./checkpoint/model_900/ --output_path=./export +``` + +2. SPO任务静态图模型导出 +```shell +python export_model.py --train_dataset CMeIE --params_path=./checkpoint/model_900/ --output_path=./export +``` + +3. NER任务静态图模型导出 +```shell +python export_model.py --train_dataset CMeEE --params_path=./checkpoint/model_1500/ --output_path=./export +``` + +**NOTICE**: train_dataset分类任务选择填上训练数据集名称,params_path选择最好参数的模型的路径。 + +[1] CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [pdf](https://arxiv.org/abs/2106.08087) [git](https://github.com/CBLUEbenchmark/CBLUE) [web](https://tianchi.aliyun.com/specials/promotion/2021chinesemedicalnlpleaderboardchallenge) + +[2] Wang, Quan, et al. “Building Chinese Biomedical Language Models via Multi-Level Text Discrimination.” arXiv preprint arXiv:2110.07244 (2021). [pdf](https://arxiv.org/abs/2110.07244) diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/README.md b/model_zoo/ernie-health/cblue/deploy/predictor/README.md new file mode 100644 index 0000000000000000000000000000000000000000..4741a43385ffec8c3869c10453e6e7abe018ad28 --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/predictor/README.md @@ -0,0 +1,191 @@ +# 基于ONNXRuntime推理部署指南 + +本示例以[CBLUE数据集微调](../../README.md)得到的ERNIE-Health模型为例,分别提供了文本分类任务、实体识别任务和关系抽取任务的部署代码,自定义数据集可参考实现。 +在推理部署前需将微调后的动态图模型转换导出为静态图,详细步骤见[静态图模型导出](../../README.md)。 + +**目录** + * [环境安装](#环境安装) + * [GPU部署推理样例](#gpu部署推理样例) + * [CPU部署推理样例](#cpu部署推理样例) + * [性能与精度测试](#性能与精度测试) + * [GPU精度与性能](#gpu精度与性能) + * [CPU精度与性能](#cpu精度与性能) + +## 环境安装 + +ONNX模型转换和推理部署依赖于Paddle2ONNX和ONNXRuntime。其中Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式,算子目前稳定支持导出ONNX Opset 7~15,更多细节可参考:[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)。 + +#### GPU端 +请先确保机器已正确安装NVIDIA相关驱动和基础软件,确保CUDA >= 11.2,CuDNN >= 8.2,并使用以下命令安装所需依赖: +``` +python -m pip install -r requirements_gpu.tx +``` +\* 如需使用半精度(FP16)部署,请确保GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0,典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。 更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档:[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix) +#### CPU端 +请使用如下命令安装所需依赖: +``` +python -m pip install -r requirements_cpu.txt +``` +## GPU部署推理样例 + +请使用如下命令进行GPU上的部署,可用`use_fp16`开启**半精度部署推理加速**,可用`device_id`**指定GPU卡号**。 + +- 文本分类任务 + +``` +python infer_classification.py --device gpu --device_id 0 --dataset KUAKE-QIC --model_path_prefix ../../export/inference +``` + +- 实体识别任务 + +``` +python infer_ner.py --device gpu --device_id 0 --dataset CMeEE --model_path_prefix ../../export/inference +``` + +- 关系抽取任务 + +``` +python infer_spo.py --device gpu --device_id 0 --dataset CMeIE --model_path_prefix ../../export/inference +``` + +可支持配置的参数: + +* `model_path_prefix`:必须,待推理模型路径前缀。 +* `model_name_or_path`:选择预训练模型;默认为"ernie-health-chinese"。 +* `dataset`:CBLUE中的训练数据集。 + * `文本分类任务`:包括KUAKE-QIC, KUAKE-QQR, KUAKE-QTR, CHIP-CTC, CHIP-STS, CHIP-CDN-2C;默认为KUAKE-QIC。 + * `实体抽取任务`:默认为CMeEE。 + * `关系抽取任务`:默认为CMeIE。 +* `max_seq_length`:模型使用的最大序列长度,最大不能超过512;`关系抽取任务`默认为300,其余默认为128。 +* `use_fp16`:选择是否开启FP16进行加速,仅在`devive=gpu`时生效;默认关闭。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为200。 +* `device`: 选用什么设备进行训练,可选cpu、gpu;默认为gpu。 +* `device_id`: 选择GPU卡号;默认为0。 +* `data_file`:本地待预测数据文件;默认为None。 + +#### 本地数据集加载 +如需使用本地数据集,请指定本地待预测数据文件 `data_file`,每行一条样例,单文本输入每句一行,双文本输入以`\t`分隔符隔开。例如 + +**ctc-data.txt** +``` +在过去的6个月曾服用偏头痛预防性药物或长期服用镇痛药物者,以及有酒精依赖或药物滥用习惯者; +患有严重的冠心病、脑卒中,以及传染性疾病、精神疾病者; +活动性乙肝(包括大三阳或小三阳)或血清学指标(HBsAg或/和HBeAg或/和HBcAb)阳性者,丙肝、肺结核、巨细胞病毒、严重真菌感染或HIV感染; +... +``` + +**sts-data.txt** +``` +糖尿病能吃减肥药吗?能治愈吗?\t糖尿病为什么不能吃减肥药? +为什么慢性乙肝会急性发作\t引起隐匿性慢性乙肝的原因是什么 +标准血压是多少高血压指?低血压又指?\t半月前检查血压100/130,正常吗? +... +``` + +## CPU部署推理样例 + +请使用如下命令进行CPU上的部署,可用`num_threads`**调整预测线程数量**。 + +- 文本分类任务 + +``` +python infer_classification.py --device cpu --dataset KUAKE-QIC --model_path_prefix ../../export/inference +``` + +- 实体识别任务 + +``` +python infer_ner.py --device cpu --dataset CMeEE --model_path_prefix ../../export/inference +``` + +- 关系抽取任务 + +``` +python infer_spo.py --device cpu --dataset CMeIE --model_path_prefix ../../export/inference +``` + +可支持配置的参数: + +* `model_path_prefix`:必须,待推理模型路径前缀。 +* `model_name_or_path`:选择预训练模型;默认为"ernie-health-chinese"。 +* `dataset`:CBLUE中的训练数据集。 + * `文本分类任务`:包括KUAKE-QIC, KUAKE-QQR, KUAKE-QTR, CHIP-CTC, CHIP-STS, CHIP-CDN-2C;默认为KUAKE-QIC。 + * `实体抽取任务`:默认为CMeEE。 + * `关系抽取任务`:默认为CMeIE。 +* `max_seq_length`:模型使用的最大序列长度,最大不能超过512;`关系抽取任务`默认为300,其余默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为200。 +* `device`: 选用什么设备进行训练,可选cpu、gpu;默认为gpu。 +* `num_threads`:cpu线程数,在`device=gpu`时影响较小;默认为cpu的物理核心数量。 +* `data_file`:本地待预测数据文件,格式见[GPU部署推理样例](#本地数据集加载)中的介绍;默认为None。 + +## 性能与精度测试 + +本节提供了在CBLUE数据集上预测的性能和精度数据,以供参考。 + +测试配置如下: + +1. 数据集 + + 使用CBLUE数据集中的开发集用于ERNIE-Health微调模型部署推理的性能与精度测试,包括 + + - 医疗搜索检索词意图分类(KUAKE-QIC)任务 + - 医疗搜索查询词-页面标题相关性(KUAKE-QTR)任务 + - 医疗搜索查询词-查询词相关性(KUAKE-QQR)任务 + - 临床试验筛选标准短文本分类(CHIP-CTC)任务 + - 平安医疗科技疾病问答迁移学习(CHIP-STS)任务 + - 临床术语标准化匹配(CHIP-CDN-2C)任务 + - 中文医学命名实体识别(CMeEE)任务 + - 中文医学文本实体关系抽取(CMeIE)任务 + +2. 物理机环境 + + 系统: CentOS Linux release 7.7.1908 (Core) + + GPU: Tesla V100-SXM2-32GB + + CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz + + CUDA: 11.2 + + cuDNN: 8.1.0 + + Driver Version: 460.27.04 + + 内存: 630 GB + +3. PaddlePaddle 版本:2.3.0 + +4. PaddleNLP 版本:2.3.4 + +5. 性能数据指标:latency。latency 测试方法:固定 batch size 为 200(CHIP-CDN-2C 和 CMeIE 数据集为 20),部署运行时间 total_time,计算 latency = total_time / total_samples + + +### GPU精度与性能 + +| 数据集 | 最大文本长度 | 精度评估指标 | FP32 指标值 | FP16 指标值 | FP32 latency(ms) | FP16 latency(ms) | +| ---------- | ---------- | ---------- | ---------- | ---------- | ---------------- | ---------------- | +| KUAKE-QIC | 128 | Accuracy | 0.8046 | 0.8046 | 1.92 | 0.46 | +| KUAKE-QTR | 64 | Accuracy | 0.6886 | 0.6876 (-) | 0.92 | 0.23 | +| KUAKE-QQR | 64 | Accuracy | 0.7755 | 0.7755 | 0.61 | 0.16 | +| CHIP-CTC | 160 | Macro F1 | 0.8445 | 0.8446 (+) | 2.34 | 0.60 | +| CHIP-STS | 96 | Macro F1 | 0.8892 | 0.8892 | 1.39 | 0.35 | +| CHIP-CDN-2C | 256 | Macro F1 | 0.8921 | 0.8920 (-) | 1.58 | 0.48 | +| CMeEE | 128 | Micro F1 | 0.6469 | 0.6468 (-) | 1.90 | 0.48 | +| CMeIE | 300 | Micro F1 | 0.5903 | 0.5902 (-) | 50.32 | 41.50 | + +经过FP16转化加速比达到 1.2 ~ 4 倍左右,精度变化在 1e-4 ~ 1e-3 内。 + +### CPU精度与性能 + +测试环境及说明如上,测试 CPU 性能时,线程数设置为40。 + +| 数据集 | 最大文本长度 | 精度评估指标 | FP32 指标值 | FP32 latency(ms) | +| ---------- | ------------ | ------------ | ---------- | ---------------- | +| KUAKE-QIC | 128 | Accuracy | 0.8046 | 37.72 | +| KUAKE-QTR | 64 | Accuracy | 0.6886 | 18.40 | +| KUAKE-QQR | 64 | Accuracy | 0.7755 | 10.34 | +| CHIP-CTC | 160 | Macro F1 | 0.8445 | 47.43 | +| CHIP-STS | 96 | Macro F1 | 0.8892 | 27.67 | +| CHIP-CDN-2C | 256 | Micro F1 | 0.8921 | 26.86 | +| CMeEE | 128 | Micro F1 | 0.6469 | 37.59 | +| CMeIE | 300 | Micro F1 | 0.5902 | 213.04 | diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/infer_classification.py b/model_zoo/ernie-health/cblue/deploy/predictor/infer_classification.py new file mode 100644 index 0000000000000000000000000000000000000000..2c4586fc9bc2946d892f731a537be286248c3b97 --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/predictor/infer_classification.py @@ -0,0 +1,146 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import psutil +from predictor import CLSPredictor + +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used." + ) + parser.add_argument( + "--model_name_or_path", default="ernie-health-chinese", type=str, help="The directory or name of model." + ) + parser.add_argument("--dataset", default="KUAKE-QIC", type=str, help="Dataset for text classfication.") + parser.add_argument("--data_file", default=None, type=str, help="The data to predict with one sample per line.") + parser.add_argument( + "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization." + ) + parser.add_argument( + "--use_fp16", + action="store_true", + help="Whether to use fp16 inference, only takes effect when deploying on gpu.", + ) + parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for predicting.") + parser.add_argument( + "--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu." + ) + parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." + ) + parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.") + args = parser.parse_args() + return args + + +LABEL_LIST = { + "kuake-qic": ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"], + "kuake-qtr": ["完全不匹配", "很少匹配,有一些参考价值", "部分匹配", "完全匹配"], + "kuake-qqr": ["B为A的语义父集,B指代范围大于A; 或者A与B语义毫无关联。", "B为A的语义子集,B指代范围小于A。", "表示A与B等价,表述完全一致。"], + "chip-ctc": [ + "成瘾行为", + "居住情况", + "年龄", + "酒精使用", + "过敏耐受", + "睡眠", + "献血", + "能力", + "依存性", + "知情同意", + "数据可及性", + "设备", + "诊断", + "饮食", + "残疾群体", + "疾病", + "教育情况", + "病例来源", + "参与其它试验", + "伦理审查", + "种族", + "锻炼", + "性别", + "健康群体", + "实验室检查", + "预期寿命", + "读写能力", + "含有多类别的语句", + "肿瘤进展", + "疾病分期", + "护理", + "口腔相关", + "器官组织状态", + "药物", + "怀孕相关", + "受体状态", + "研究者决定", + "风险评估", + "性取向", + "体征(医生检测)", + " 吸烟状况", + "特殊病人特征", + "症状(患者感受)", + "治疗或手术", + ], + "chip-sts": ["语义不同", "语义相同"], + "chip-cdn-2c": ["否", "是"], +} + +TEXT = { + "kuake-qic": ["心肌缺血如何治疗与调养呢?", "什么叫痔核脱出?什么叫外痔?"], + "kuake-qtr": [["儿童远视眼怎么恢复视力", "远视眼该如何保养才能恢复一些视力"], ["抗生素的药有哪些", "抗生素类的药物都有哪些?"]], + "kuake-qqr": [["茴香是发物吗", "茴香怎么吃?"], ["气的胃疼是怎么回事", "气到胃痛是什么原因"]], + "chip-ctc": ["(1)前牙结构发育不良:釉质发育不全、氟斑牙、四环素牙等;", "怀疑或确有酒精或药物滥用史;"], + "chip-sts": [["糖尿病能吃减肥药吗?能治愈吗?", "糖尿病为什么不能吃减肥药"], ["H型高血压的定义", "WHO对高血压的最新分类定义标准数值"]], + "chip-cdn-2c": [["1型糖尿病性植物神经病变", " 1型糖尿病肾病IV期"], ["髂腰肌囊性占位", "髂肌囊肿"]], +} + +METRIC = { + "kuake-qic": "acc", + "kuake-qtr": "acc", + "kuake-qqr": "acc", + "chip-ctc": "macro", + "chip-sts": "macro", + "chip-cdn-2c": "macro", +} + + +def main(): + args = parse_args() + + for arg_name, arg_value in vars(args).items(): + logger.info("{:20}: {}".format(arg_name, arg_value)) + + args.dataset = args.dataset.lower() + label_list = LABEL_LIST[args.dataset] + if args.data_file is not None: + with open(args.data_file, "r") as fp: + input_data = [x.strip().split("\t") for x in fp.readlines()] + input_data = [x[0] if len(x) == 1 else x for x in input_data] + else: + input_data = TEXT[args.dataset] + + predictor = CLSPredictor(args, label_list) + predictor.predict(input_data) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/infer_ner.py b/model_zoo/ernie-health/cblue/deploy/predictor/infer_ner.py new file mode 100644 index 0000000000000000000000000000000000000000..afc2d2ba99fc73678d2bab67eb27beff4811238d --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/predictor/infer_ner.py @@ -0,0 +1,116 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import psutil +from predictor import NERPredictor + +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used." + ) + parser.add_argument( + "--model_name_or_path", default="ernie-health-chinese", type=str, help="The directory or name of model." + ) + parser.add_argument("--dataset", default="CMeEE", type=str, help="Dataset for named entity recognition.") + parser.add_argument("--data_file", default=None, type=str, help="The data to predict with one sample per line.") + parser.add_argument( + "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization" + ) + parser.add_argument( + "--use_fp16", + action="store_true", + help="Whether to use fp16 inference, only takes effect when deploying on gpu.", + ) + parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for predicting.") + parser.add_argument( + "--num_threads", default=psutil.cpu_count(logical=False), type=int, help="Number of threads for cpu." + ) + parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." + ) + parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.") + args = parser.parse_args() + return args + + +LABEL_LIST = { + "cmeee": [ + [ + "B-bod", + "I-bod", + "E-bod", + "S-bod", + "B-dis", + "I-dis", + "E-dis", + "S-dis", + "B-pro", + "I-pro", + "E-pro", + "S-pro", + "B-dru", + "I-dru", + "E-dru", + "S-dru", + "B-ite", + "I-ite", + "E-ite", + "S-ite", + "B-mic", + "I-mic", + "E-mic", + "S-mic", + "B-equ", + "I-equ", + "E-equ", + "S-equ", + "B-dep", + "I-dep", + "E-dep", + "S-dep", + "O", + ], + ["B-sym", "I-sym", "E-sym", "S-sym", "O"], + ] +} + +TEXT = {"cmeee": ["研究证实,细胞减少与肺内病变程度及肺内炎性病变吸收程度密切相关。", "可为不规则发热、稽留热或弛张热,但以不规则发热为多,可能与患儿应用退热药物导致热型不规律有关。"]} + + +def main(): + args = parse_args() + + for arg_name, arg_value in vars(args).items(): + logger.info("{:20}: {}".format(arg_name, arg_value)) + + dataset = args.dataset.lower() + label_list = LABEL_LIST[dataset] + if args.data_file is not None: + with open(args.data_file, "r") as fp: + input_data = [x.strip() for x in fp.readlines()] + else: + input_data = TEXT[dataset] + + predictor = NERPredictor(args, label_list) + predictor.predict(input_data) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/infer_spo.py b/model_zoo/ernie-health/cblue/deploy/predictor/infer_spo.py new file mode 100644 index 0000000000000000000000000000000000000000..972eade14d753f2343488c2b541d636cca0e8814 --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/predictor/infer_spo.py @@ -0,0 +1,124 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import psutil +from predictor import SPOPredictor + +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used." + ) + parser.add_argument( + "--model_name_or_path", default="ernie-health-chinese", type=str, help="The directory or name of model." + ) + parser.add_argument("--dataset", default="CMeIE", type=str, help="Dataset for named entity recognition.") + parser.add_argument("--data_file", default=None, type=str, help="The data to predict with one sample per line.") + parser.add_argument( + "--max_seq_length", default=300, type=int, help="The maximum total input sequence length after tokenization." + ) + parser.add_argument( + "--use_fp16", + action="store_true", + help="Whether to use fp16 inference, only takes effect when deploying on gpu.", + ) + parser.add_argument( + "--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu." + ) + parser.add_argument("--batch_size", default=20, type=int, help="Batch size per GPU/CPU for predicting.") + parser.add_argument( + "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." + ) + parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.") + args = parser.parse_args() + return args + + +LABEL_LIST = { + "cmeie": [ + "预防", + "阶段", + "就诊科室", + "辅助治疗", + "化疗", + "放射治疗", + "手术治疗", + "实验室检查", + "影像学检查", + "辅助检查", + "组织学检查", + "内窥镜检查", + "筛查", + "多发群体", + "发病率", + "发病年龄", + "多发地区", + "发病性别倾向", + "死亡率", + "多发季节", + "传播途径", + "并发症", + "病理分型", + "相关(导致)", + "鉴别诊断", + "相关(转化)", + "相关(症状)", + "临床表现", + "治疗后症状", + "侵及周围组织转移的症状", + "病因", + "高危因素", + "风险评估因素", + "病史", + "遗传因素", + "发病机制", + "病理生理", + "药物治疗", + "发病部位", + "转移部位", + "外侵部位", + "预后状况", + "预后生存率", + "同义词", + ] +} + +TEXT = {"cmeie": ["骶髂关节炎是明确诊断JAS的关键条件。若有肋椎关节病变会使胸部扩张度减小。", "稳定型缺血性心脏疾病@肥胖与缺乏活动也导致高血压增多。"]} + + +def main(): + args = parse_args() + + for arg_name, arg_value in vars(args).items(): + logger.info("{:20}: {}".format(arg_name, arg_value)) + + dataset = args.dataset.lower() + label_list = LABEL_LIST[dataset] + if args.data_file is not None: + with open(args.data_file, "r") as fp: + input_data = [x.strip() for x in fp.readlines()] + else: + input_data = TEXT[dataset] + + predictor = SPOPredictor(args, label_list) + predictor.predict(input_data) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/predictor.py b/model_zoo/ernie-health/cblue/deploy/predictor/predictor.py new file mode 100644 index 0000000000000000000000000000000000000000..6e3137301cfda85e50b2150f47a6b3574b81b0f1 --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/predictor/predictor.py @@ -0,0 +1,361 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time + +import numpy as np +import onnxruntime as ort +import paddle2onnx +import six + +from paddlenlp.transformers import ( + AutoTokenizer, + normalize_chars, + tokenize_special_chars, +) +from paddlenlp.utils.log import logger + + +class InferBackend(object): + def __init__(self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, num_threads=10): + + if not isinstance(device, six.string_types): + logger.error( + ">>> [InferBackend] The type of device must be string, but the type you set is: ", type(device) + ) + exit(0) + if device not in ["cpu", "gpu"]: + logger.error(">>> [InferBackend] The device must be cpu or gpu, but your device is set to:", type(device)) + exit(0) + + logger.info(">>> [InferBackend] Creating Engine ...") + + onnx_model = paddle2onnx.command.c_paddle_to_onnx( + model_file=model_path_prefix + ".pdmodel", + params_file=model_path_prefix + ".pdiparams", + opset_version=13, + enable_onnx_checker=True, + ) + infer_model_dir = model_path_prefix.rsplit("/", 1)[0] + float_onnx_file = os.path.join(infer_model_dir, "model.onnx") + with open(float_onnx_file, "wb") as f: + f.write(onnx_model) + + if device == "gpu": + logger.info(">>> [InferBackend] Use GPU to inference ...") + providers = ["CUDAExecutionProvider"] + if use_fp16: + logger.info(">>> [InferBackend] Use FP16 to inference ...") + import onnx + from onnxconverter_common import float16 + + fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx") + onnx_model = onnx.load_model(float_onnx_file) + trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True) + onnx.save_model(trans_model, fp16_model_file) + onnx_model = fp16_model_file + else: + logger.info(">>> [InferBackend] Use CPU to inference ...") + providers = ["CPUExecutionProvider"] + if use_fp16: + logger.warning( + ">>> [InferBackend] Ignore use_fp16 as it only " + "takes effect when deploying on gpu..." + ) + + sess_options = ort.SessionOptions() + sess_options.intra_op_num_threads = num_threads + self.predictor = ort.InferenceSession( + onnx_model, sess_options=sess_options, providers=providers, provider_options=[{"device_id": device_id}] + ) + + self.input_handles = [ + self.predictor.get_inputs()[0].name, + self.predictor.get_inputs()[1].name, + ] + + if device == "gpu": + try: + assert "CUDAExecutionProvider" in self.predictor.get_providers() + except AssertionError: + raise AssertionError( + """The environment for GPU inference is not set properly. \nA possible cause is that you had installed both onnxruntime and onnxruntime-gpu. \nPlease run the following commands to reinstall: \n1) pip uninstall -y onnxruntime onnxruntime-gpu \n2) pip install onnxruntime-gpu""" + ) + logger.info(">>> [InferBackend] Engine Created ...") + + def infer(self, input_dict: dict): + input_dict = {k: v for k, v in input_dict.items() if k in self.input_handles} + result = self.predictor.run(None, input_dict) + return result + + +class EHealthPredictor(object): + def __init__(self, args, label_list): + self.label_list = label_list + self._tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=True) + self._max_seq_length = args.max_seq_length + self._batch_size = args.batch_size + self.inference_backend = InferBackend( + args.model_path_prefix, args.device, args.device_id, args.use_fp16, args.num_threads + ) + + def predict(self, input_data: list): + encoded_inputs = self.preprocess(input_data) + infer_result = self.infer_batch(encoded_inputs) + result = self.postprocess(infer_result) + self.printer(result, input_data) + return result + + def _infer(self, input_dict): + infer_data = self.inference_backend.infer(input_dict) + return infer_data + + def infer_batch(self, encoded_inputs): + num_sample = len(encoded_inputs["input_ids"]) + infer_data = None + num_infer_data = None + for idx in range(0, num_sample, self._batch_size): + l, r = idx, idx + self._batch_size + keys = encoded_inputs.keys() + input_dict = {k: encoded_inputs[k][l:r] for k in keys} + results = self._infer(input_dict) + if infer_data is None: + infer_data = [[x] for x in results] + num_infer_data = len(results) + else: + for i in range(num_infer_data): + infer_data[i].append(results[i]) + for i in range(num_infer_data): + infer_data[i] = np.concatenate(infer_data[i], axis=0) + return infer_data + + def performance(self, encoded_inputs): + nums = len(encoded_inputs["input_ids"]) + start_time = time.time() + infer_result = self.infer_batch(preprocess_result) # noqa + total_time = time.time() - start_time + logger.info("sample nums: %d, time: %.2f, latency: %.2f ms" % (nums, total_time, 1000 * total_time / nums)) + + def get_text_and_label(self, dataset): + raise NotImplementedError + + def preprocess(self, input_data: list): + raise NotImplementedError + + def postprocess(self, infer_data): + raise NotImplementedError + + def printer(self, result, input_data): + raise NotImplementedError + + +class CLSPredictor(EHealthPredictor): + def preprocess(self, input_data: list): + norm_text = lambda x: tokenize_special_chars(normalize_chars(x)) + # To deal with a pair of input text. + if isinstance(input_data[0], list): + text = [norm_text(sample[0]) for sample in input_data] + text_pair = [norm_text(sample[1]) for sample in input_data] + else: + text = [norm_text(x) for x in input_data] + text_pair = None + + data = self._tokenizer( + text=text, text_pair=text_pair, max_length=self._max_seq_length, padding=True, truncation=True + ) + + encoded_inputs = { + "input_ids": np.array(data["input_ids"], dtype="int64"), + "token_type_ids": np.array(data["token_type_ids"], dtype="int64"), + } + return encoded_inputs + + def postprocess(self, infer_data): + infer_data = infer_data[0] + max_value = np.max(infer_data, axis=1, keepdims=True) + exp_data = np.exp(infer_data - max_value) + probs = exp_data / np.sum(exp_data, axis=1, keepdims=True) + label = probs.argmax(axis=-1) + confidence = probs.max(axis=-1) + return {"label": label, "confidence": confidence} + + def printer(self, result, input_data): + label, confidence = result["label"], result["confidence"] + for i in range(len(label)): + logger.info("input data: {}".format(input_data[i])) + logger.info("labels: {}, confidence: {}".format(self.label_list[label[i]], confidence[i])) + logger.info("-----------------------------") + + +class NERPredictor(EHealthPredictor): + """The predictor for CMeEE dataset.""" + + en_to_cn = { + "bod": "身体", + "mic": "微生物类", + "dis": "疾病", + "sym": "临床表现", + "pro": "医疗程序", + "equ": "医疗设备", + "dru": "药物", + "dep": "科室", + "ite": "医学检验项目", + } + + def _extract_chunk(self, tokens): + chunks = set() + start_idx, cur_idx = 0, 0 + while cur_idx < len(tokens): + if tokens[cur_idx][0] == "B": + start_idx = cur_idx + cur_idx += 1 + while cur_idx < len(tokens) and tokens[cur_idx][0] == "I": + if tokens[cur_idx][2:] == tokens[start_idx][2:]: + cur_idx += 1 + else: + break + if cur_idx < len(tokens) and tokens[cur_idx][0] == "E": + if tokens[cur_idx][2:] == tokens[start_idx][2:]: + chunks.add((tokens[cur_idx][2:], start_idx - 1, cur_idx)) + cur_idx += 1 + elif tokens[cur_idx][0] == "S": + chunks.add((tokens[cur_idx][2:], cur_idx - 1, cur_idx)) + cur_idx += 1 + else: + cur_idx += 1 + return list(chunks) + + def preprocess(self, infer_data): + infer_data = [[x.lower() for x in text] for text in infer_data] + data = self._tokenizer( + infer_data, max_length=self._max_seq_length, padding=True, is_split_into_words=True, truncation=True + ) + + encoded_inputs = { + "input_ids": np.array(data["input_ids"], dtype="int64"), + "token_type_ids": np.array(data["token_type_ids"], dtype="int64"), + } + return encoded_inputs + + def postprocess(self, infer_data): + tokens_oth = np.argmax(infer_data[0], axis=-1) + tokens_sym = np.argmax(infer_data[1], axis=-1) + entity = [] + for oth_ids, sym_ids in zip(tokens_oth, tokens_sym): + token_oth = [self.label_list[0][x] for x in oth_ids] + token_sym = [self.label_list[1][x] for x in sym_ids] + chunks = self._extract_chunk(token_oth) + self._extract_chunk(token_sym) + sub_entity = [] + for etype, sid, eid in chunks: + sub_entity.append({"type": self.en_to_cn[etype], "start_id": sid, "end_id": eid}) + entity.append(sub_entity) + return {"entity": entity} + + def printer(self, result, input_data): + result = result["entity"] + for i, preds in enumerate(result): + logger.info("input data: {}".format(input_data[i])) + logger.info("detected entities:") + for item in preds: + logger.info( + "* entity: {}, type: {}, position: ({}, {})".format( + input_data[i][item["start_id"] : item["end_id"]], + item["type"], + item["start_id"], + item["end_id"], + ) + ) + logger.info("-----------------------------") + + +class SPOPredictor(EHealthPredictor): + """The predictor for the CMeIE dataset.""" + + def predict(self, input_data: list): + encoded_inputs = self.preprocess(input_data) + lengths = encoded_inputs["attention_mask"].sum(axis=-1) + infer_result = self.infer_batch(encoded_inputs) + result = self.postprocess(infer_result, lengths) + self.printer(result, input_data) + return result + + def preprocess(self, infer_data): + infer_data = [[x.lower() for x in text] for text in infer_data] + data = self._tokenizer( + infer_data, + max_length=self._max_seq_length, + padding=True, + is_split_into_words=True, + truncation=True, + return_attention_mask=True, + ) + encoded_inputs = { + "input_ids": np.array(data["input_ids"], dtype="int64"), + "token_type_ids": np.array(data["token_type_ids"], dtype="int64"), + "attention_mask": np.array(data["attention_mask"], dtype="float32"), + } + return encoded_inputs + + def postprocess(self, infer_data, lengths): + ent_logits = np.array(infer_data[0]) + spo_logits = np.array(infer_data[1]) + ent_pred_list = [] + ent_idxs_list = [] + for idx, ent_pred in enumerate(ent_logits): + seq_len = lengths[idx] - 2 + start = np.where(ent_pred[:, 0] > 0.5)[0] + end = np.where(ent_pred[:, 1] > 0.5)[0] + ent_pred = [] + ent_idxs = {} + for x in start: + y = end[end >= x] + if (x == 0) or (x > seq_len): + continue + if len(y) > 0: + y = y[0] + if y > seq_len: + continue + ent_idxs[x] = (x - 1, y - 1) + ent_pred.append((x - 1, y - 1)) + ent_pred_list.append(ent_pred) + ent_idxs_list.append(ent_idxs) + + spo_preds = spo_logits > 0 + spo_pred_list = [[] for _ in range(len(spo_preds))] + idxs, preds, subs, objs = np.nonzero(spo_preds) + for idx, p_id, s_id, o_id in zip(idxs, preds, subs, objs): + obj = ent_idxs_list[idx].get(o_id, None) + if obj is None: + continue + sub = ent_idxs_list[idx].get(s_id, None) + if sub is None: + continue + spo_pred_list[idx].append((tuple(sub), p_id, tuple(obj))) + + return {"entity": ent_pred_list, "spo": spo_pred_list} + + def printer(self, result, input_data): + ent_pred_list, spo_pred_list = result["entity"], result["spo"] + for i, (ent, rel) in enumerate(zip(ent_pred_list, spo_pred_list)): + logger.info("input data: {}".format(input_data[i])) + logger.info("detected entities and relations:") + for sid, eid in ent: + logger.info("* entity: {}, position: ({}, {})".format(input_data[i][sid : eid + 1], sid, eid)) + for s, p, o in rel: + logger.info( + "+ spo: ({}, {}, {})".format( + input_data[i][s[0] : s[1] + 1], self.label_list[p], input_data[i][o[0] : o[1] + 1] + ) + ) + logger.info("-----------------------------") diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/requirements_cpu.txt b/model_zoo/ernie-health/cblue/deploy/predictor/requirements_cpu.txt new file mode 100644 index 0000000000000000000000000000000000000000..645682ec79c6c8694ee9ea288af3dc3c416a4dfb --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/predictor/requirements_cpu.txt @@ -0,0 +1,2 @@ +onnxruntime==1.10.0 +psutil diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/requirements_gpu.txt b/model_zoo/ernie-health/cblue/deploy/predictor/requirements_gpu.txt new file mode 100644 index 0000000000000000000000000000000000000000..2ca8b172eb7993140d6f5e2c3692a200195dd1ee --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/predictor/requirements_gpu.txt @@ -0,0 +1,4 @@ +onnxruntime-gpu==1.11.1 +onnx==1.12.0 +onnxconverter-common==1.9.0 +psutil diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/README.md b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..50166fe400b50fc6f3ef6e21b889ed704273ae12 --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/README.md @@ -0,0 +1,60 @@ +# 基于PaddleNLP SimpleServing 的服务化部署 + +## 目录 +- [环境准备](#环境准备) +- [Server启动服务](#Server服务启动) +- [其他参数设置](#其他参数设置) + +## 环境准备 +使用有SimpleServing功能的PaddleNLP版本 +```shell +pip install paddlenlp >= 2.3.6 +``` +## Server服务启动 +### 分类任务启动 +#### 启动 分类 Server 服务 +```bash +paddlenlp server server_classification:app --host 0.0.0.0 --port 8189 +``` + +#### 分类任务发送服务 +```bash +python client_classification.py --dataset chip-cdn-2c +``` + +### NER 任务启动 +#### 启动 NER Server 服务 +```bash +paddlenlp server server_ner:app --host 0.0.0.0 --port 8189 +``` + +#### NER Client发送服务 +```bash +python client_ner.py +``` + +### SPO 任务启动 +#### 启动 SPO Server 服务 +```bash +paddlenlp server server_spo:app --host 0.0.0.0 --port 8189 +``` + +#### SPO Client 发送服务 +```bash +python client_spo.py +``` + +## 其他参数设置 +可以在client端设置 `max_seq_len`, `batch_size` 参数 +```python + data = { + 'data': { + 'text': texts, + 'text_pair': text_pairs if len(text_pairs) > 0 else None + }, + 'parameters': { + 'max_seq_len': args.max_seq_len, + 'batch_size': args.batch_size + } + } +``` diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_classification.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_classification.py new file mode 100644 index 0000000000000000000000000000000000000000..1993acb4b0f0456a5af6a11572baca1521dc9372 --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_classification.py @@ -0,0 +1,54 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json + +import requests + +parser = argparse.ArgumentParser() +parser.add_argument("--dataset", required=True, type=str, help="The dataset name for the simple seving") +parser.add_argument( + "--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization." +) +parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.") +args = parser.parse_args() + +url = "http://0.0.0.0:8189/models/cblue_cls" +headers = {"Content-Type": "application/json"} + +TEXT = { + "kuake-qic": ["心肌缺血如何治疗与调养呢?", "什么叫痔核脱出?什么叫外痔?"], + "kuake-qtr": [["儿童远视眼怎么恢复视力", "远视眼该如何保养才能恢复一些视力"], ["抗生素的药有哪些", "抗生素类的药物都有哪些?"]], + "kuake-qqr": [["茴香是发物吗", "茴香怎么吃?"], ["气的胃疼是怎么回事", "气到胃痛是什么原因"]], + "chip-ctc": ["(1)前牙结构发育不良:釉质发育不全、氟斑牙、四环素牙等;", "怀疑或确有酒精或药物滥用史;"], + "chip-sts": [["糖尿病能吃减肥药吗?能治愈吗?", "糖尿病为什么不能吃减肥药"], ["H型高血压的定义", "WHO对高血压的最新分类定义标准数值"]], + "chip-cdn-2c": [["1型糖尿病性植物神经病变", " 1型糖尿病肾病IV期"], ["髂腰肌囊性占位", "髂肌囊肿"]], +} + +if __name__ == "__main__": + args.dataset = args.dataset.lower() + input_data = TEXT[args.dataset] + texts = [] + text_pairs = [] + for data in input_data: + if len(data) == 2: + text_pairs.append(data[1]) + texts.append(data[0]) + data = { + "data": {"text": texts, "text_pair": text_pairs if len(text_pairs) > 0 else None}, + "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size}, + } + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.text) diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_ner.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_ner.py new file mode 100644 index 0000000000000000000000000000000000000000..d3c64479ec20c2925bf64232d262ca879feb9298 --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_ner.py @@ -0,0 +1,40 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json + +import requests + +parser = argparse.ArgumentParser() +parser.add_argument( + "--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization." +) +parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for predicting.") +args = parser.parse_args() + +url = "http://0.0.0.0:8189/models/cblue_ner" +headers = {"Content-Type": "application/json"} + +if __name__ == "__main__": + texts = ["研究证实,细胞减少与肺内病变程度及肺内炎性病变吸收程度密切相关。", "可为不规则发热、稽留热或弛张热,但以不规则发热为多,可能与患儿应用退热药物导致热型不规律有关。"] + texts = [[x.lower() for x in text] for text in texts] + data = { + "data": { + "text": texts, + }, + "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size, "is_split_into_words": True}, + } + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.text) diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_spo.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_spo.py new file mode 100644 index 0000000000000000000000000000000000000000..38c34459d054ab57fac31ab0ef5b70c03a45ce07 --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_spo.py @@ -0,0 +1,45 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json + +import requests + +parser = argparse.ArgumentParser() +parser.add_argument( + "--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization." +) +parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for predicting.") +args = parser.parse_args() + +url = "http://0.0.0.0:8189/models/cblue_spo" +headers = {"Content-Type": "application/json"} + +if __name__ == "__main__": + texts = ["骶髂关节炎是明确诊断JAS的关键条件。若有肋椎关节病变会使胸部扩张度减小。", "稳定型缺血性心脏疾病@肥胖与缺乏活动也导致高血压增多。"] + texts = [[x.lower() for x in text] for text in texts] + data = { + "data": { + "text": texts, + }, + "parameters": { + "max_seq_len": args.max_seq_len, + "batch_size": args.batch_size, + "return_attention_mask": True, + "is_split_into_words": True, + }, + } + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.text) diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_classification.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_classification.py new file mode 100644 index 0000000000000000000000000000000000000000..1c024501dd15c512b637930d924978e251451771 --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_classification.py @@ -0,0 +1,25 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp import SimpleServer +from paddlenlp.server import CustomModelHandler, MultiClassificationPostHandler + +app = SimpleServer() +app.register( + "models/cblue_cls", + model_path="../../../export", + tokenizer_name="ernie-health-chinese", + model_handler=CustomModelHandler, + post_handler=MultiClassificationPostHandler, +) diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_ner.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_ner.py new file mode 100644 index 0000000000000000000000000000000000000000..2b20efd1df819e2d62567f327ef80a91ff3d79cc --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_ner.py @@ -0,0 +1,129 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np + +from paddlenlp import SimpleServer +from paddlenlp.server import BasePostHandler, TokenClsModelHandler + +en_to_cn = { + "bod": "身体", + "mic": "微生物类", + "dis": "疾病", + "sym": "临床表现", + "pro": "医疗程序", + "equ": "医疗设备", + "dru": "药物", + "dep": "科室", + "ite": "医学检验项目", +} + +label_list = [ + [ + "B-bod", + "I-bod", + "E-bod", + "S-bod", + "B-dis", + "I-dis", + "E-dis", + "S-dis", + "B-pro", + "I-pro", + "E-pro", + "S-pro", + "B-dru", + "I-dru", + "E-dru", + "S-dru", + "B-ite", + "I-ite", + "E-ite", + "S-ite", + "B-mic", + "I-mic", + "E-mic", + "S-mic", + "B-equ", + "I-equ", + "E-equ", + "S-equ", + "B-dep", + "I-dep", + "E-dep", + "S-dep", + "O", + ], + ["B-sym", "I-sym", "E-sym", "S-sym", "O"], +] + + +def _extract_chunk(tokens): + chunks = set() + start_idx, cur_idx = 0, 0 + while cur_idx < len(tokens): + if tokens[cur_idx][0] == "B": + start_idx = cur_idx + cur_idx += 1 + while cur_idx < len(tokens) and tokens[cur_idx][0] == "I": + if tokens[cur_idx][2:] == tokens[start_idx][2:]: + cur_idx += 1 + else: + break + if cur_idx < len(tokens) and tokens[cur_idx][0] == "E": + if tokens[cur_idx][2:] == tokens[start_idx][2:]: + chunks.add((tokens[cur_idx][2:], start_idx - 1, cur_idx)) + cur_idx += 1 + elif tokens[cur_idx][0] == "S": + chunks.add((tokens[cur_idx][2:], cur_idx - 1, cur_idx)) + cur_idx += 1 + else: + cur_idx += 1 + return list(chunks) + + +class NERPostHandler(BasePostHandler): + def __init__(self): + super().__init__() + + @classmethod + def process(cls, data, parameters): + if "logits" not in data or "logits_1" not in data: + raise ValueError( + "The output of model handler do not include the 'logits', " + " please check the model handler output. The model handler output:\n{}".format(data) + ) + tokens_oth = np.array(data["logits"]) + tokens_sym = np.array(data["logits_1"]) + tokens_oth = np.argmax(tokens_oth, axis=-1) + tokens_sym = np.argmax(tokens_sym, axis=-1) + entity = [] + for oth_ids, sym_ids in zip(tokens_oth, tokens_sym): + token_oth = [label_list[0][x] for x in oth_ids] + token_sym = [label_list[1][x] for x in sym_ids] + chunks = _extract_chunk(token_oth) + _extract_chunk(token_sym) + sub_entity = [] + for etype, sid, eid in chunks: + sub_entity.append({"type": en_to_cn[etype], "start_id": sid, "end_id": eid}) + entity.append(sub_entity) + return {"entity": entity} + + +app = SimpleServer() +app.register( + "models/cblue_ner", + model_path="../../../export_ner", + tokenizer_name="ernie-health-chinese", + model_handler=TokenClsModelHandler, + post_handler=NERPostHandler, +) diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_spo.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_spo.py new file mode 100644 index 0000000000000000000000000000000000000000..1a64cdbe66aa897e2cdcdce977227ebd6440e95e --- /dev/null +++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_spo.py @@ -0,0 +1,142 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np + +from paddlenlp import SimpleServer +from paddlenlp.server import BasePostHandler, TokenClsModelHandler + +label_list = [ + "预防", + "阶段", + "就诊科室", + "辅助治疗", + "化疗", + "放射治疗", + "手术治疗", + "实验室检查", + "影像学检查", + "辅助检查", + "组织学检查", + "内窥镜检查", + "筛查", + "多发群体", + "发病率", + "发病年龄", + "多发地区", + "发病性别倾向", + "死亡率", + "多发季节", + "传播途径", + "并发症", + "病理分型", + "相关(导致)", + "鉴别诊断", + "相关(转化)", + "相关(症状)", + "临床表现", + "治疗后症状", + "侵及周围组织转移的症状", + "病因", + "高危因素", + "风险评估因素", + "病史", + "遗传因素", + "发病机制", + "病理生理", + "药物治疗", + "发病部位", + "转移部位", + "外侵部位", + "预后状况", + "预后生存率", + "同义词", +] + + +class SPOPostHandler(BasePostHandler): + def __init__(self): + super().__init__() + + @classmethod + def process(cls, data, parameters): + if "logits" not in data or "logits_1" not in data: + raise ValueError( + "The output of model handler do not include the 'logits', " + " please check the model handler output. The model handler output:\n{}".format(data) + ) + lengths = np.array(data["attention_mask"], dtype="float32").sum(axis=-1) + ent_logits = np.array(data["logits"]) + spo_logits = np.array(data["logits_1"]) + ent_pred_list = [] + ent_idxs_list = [] + for idx, ent_pred in enumerate(ent_logits): + seq_len = lengths[idx] - 2 + start = np.where(ent_pred[:, 0] > 0.5)[0] + end = np.where(ent_pred[:, 1] > 0.5)[0] + ent_pred = [] + ent_idxs = {} + for x in start: + y = end[end >= x] + if (x == 0) or (x > seq_len): + continue + if len(y) > 0: + y = y[0] + if y > seq_len: + continue + ent_idxs[x] = (x - 1, y - 1) + ent_pred.append((x - 1, y - 1)) + ent_pred_list.append(ent_pred) + ent_idxs_list.append(ent_idxs) + + spo_preds = spo_logits > 0 + spo_pred_list = [[] for _ in range(len(spo_preds))] + idxs, preds, subs, objs = np.nonzero(spo_preds) + for idx, p_id, s_id, o_id in zip(idxs, preds, subs, objs): + obj = ent_idxs_list[idx].get(o_id, None) + if obj is None: + continue + sub = ent_idxs_list[idx].get(s_id, None) + if sub is None: + continue + spo_pred_list[idx].append((tuple(sub), p_id, tuple(obj))) + input_data = data["data"]["text"] + ent_list = [] + spo_list = [] + for i, (ent, rel) in enumerate(zip(ent_pred_list, spo_pred_list)): + cur_ent_list = [] + cur_spo_list = [] + for sid, eid in ent: + cur_ent_list.append("".join([str(d) for d in input_data[i][sid : eid + 1]])) + for s, p, o in rel: + cur_spo_list.append( + ( + "".join([str(d) for d in input_data[i][s[0] : s[1] + 1]]), + label_list[p], + "".join([str(d) for d in input_data[i][o[0] : o[1] + 1]]), + ) + ) + ent_list.append(cur_ent_list) + spo_list.append(cur_spo_list) + + return {"entity": ent_list, "spo": spo_list} + + +app = SimpleServer() +app.register( + "models/cblue_spo", + model_path="../../../export", + tokenizer_name="ernie-health-chinese", + model_handler=TokenClsModelHandler, + post_handler=SPOPostHandler, +) diff --git a/model_zoo/ernie-health/cblue/export_model.py b/model_zoo/ernie-health/cblue/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..ebc71e376aa47c75444adaca8589c87c142f0fd1 --- /dev/null +++ b/model_zoo/ernie-health/cblue/export_model.py @@ -0,0 +1,88 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from model import ElectraForBinaryTokenClassification, ElectraForSPO + +from paddlenlp.transformers import ElectraForSequenceClassification + +NUM_CLASSES = { + "CHIP-CDN-2C": 2, + "CHIP-STS": 2, + "CHIP-CTC": 44, + "KUAKE-QQR": 3, + "KUAKE-QTR": 4, + "KUAKE-QIC": 11, + "CMeEE": [33, 5], + "CMeIE": 44, +} + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--train_dataset", required=True, type=str, help="The name of dataset used for training.") + parser.add_argument( + "--params_path", + type=str, + required=True, + default="./checkpoint/", + help="The path to model parameters to be loaded.", + ) + parser.add_argument( + "--output_path", type=str, default="./export", help="The path of model parameter in static graph to be saved." + ) + args = parser.parse_args() + return args + + +def main(): + args = parse_args() + + # Load the model parameters. + if args.train_dataset not in NUM_CLASSES: + raise ValueError(f"Please modify the code to fit {args.dataset}") + + if args.train_dataset == "CMeEE": + model = ElectraForBinaryTokenClassification.from_pretrained( + args.params_path, + num_classes_oth=NUM_CLASSES[args.train_dataset][0], + num_classes_sym=NUM_CLASSES[args.train_dataset][1], + ) + elif args.train_dataset == "CMeIE": + model = ElectraForSPO.from_pretrained(args.params_path, num_labels=NUM_CLASSES[args.train_dataset]) + else: + model = ElectraForSequenceClassification.from_pretrained( + args.params_path, num_labels=NUM_CLASSES[args.train_dataset] + ) + + model.eval() + + # Convert to static graph with specific input description: + # input_ids, token_type_ids + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ] + model = paddle.jit.to_static(model, input_spec=input_spec) + + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-health/cblue/model.py b/model_zoo/ernie-health/cblue/model.py new file mode 100644 index 0000000000000000000000000000000000000000..71c9d62e71ffeb724beb9d034ebd55d19ee74ab0 --- /dev/null +++ b/model_zoo/ernie-health/cblue/model.py @@ -0,0 +1,122 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn + +from paddlenlp.transformers import ElectraConfig, ElectraModel, ElectraPretrainedModel + + +class ElectraForBinaryTokenClassification(ElectraPretrainedModel): + """ + Electra Model with two linear layers on top of the hidden-states output layers, + designed for token classification tasks with nesting. + + Args: + electra (:class:`ElectraModel`): + An instance of ElectraModel. + num_classes (list): + The number of classes. + dropout (float, optionl): + The dropout probability for output of Electra. + If None, use the same value as `hidden_dropout_prob' of 'ElectraModel` + instance `electra`. Defaults to None. + """ + + def __init__(self, config: ElectraConfig, num_classes_oth, num_classes_sym): + super(ElectraForBinaryTokenClassification, self).__init__(config) + self.num_classes_oth = num_classes_oth + self.num_classes_sym = num_classes_sym + self.electra = ElectraModel(config) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + self.classifier_oth = nn.Linear(config.hidden_size, self.num_classes_oth) + self.classifier_sym = nn.Linear(config.hidden_size, self.num_classes_sym) + + def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None): + sequence_output = self.electra(input_ids, token_type_ids, position_ids, attention_mask) + sequence_output = self.dropout(sequence_output) + + logits_sym = self.classifier_sym(sequence_output) + logits_oth = self.classifier_oth(sequence_output) + + return logits_oth, logits_sym + + +class MultiHeadAttentionForSPO(nn.Layer): + """ + Multi-head attention layer for SPO task. + """ + + def __init__(self, embed_dim, num_heads, scale_value=768): + super(MultiHeadAttentionForSPO, self).__init__() + self.embed_dim = embed_dim + self.num_heads = num_heads + self.scale_value = scale_value**-0.5 + self.q_proj = nn.Linear(embed_dim, embed_dim * num_heads) + self.k_proj = nn.Linear(embed_dim, embed_dim * num_heads) + + def forward(self, query, key): + q = self.q_proj(query) + k = self.k_proj(key) + q = paddle.reshape(q, shape=[0, 0, self.num_heads, self.embed_dim]) + k = paddle.reshape(k, shape=[0, 0, self.num_heads, self.embed_dim]) + q = paddle.transpose(q, perm=[0, 2, 1, 3]) + k = paddle.transpose(k, perm=[0, 2, 1, 3]) + scores = paddle.matmul(q, k, transpose_y=True) + scores = paddle.scale(scores, scale=self.scale_value) + return scores + + +class ElectraForSPO(ElectraPretrainedModel): + """ + Electra Model with a linear layer on top of the hidden-states output + layers for entity recognition, and a multi-head attention layer for + relation classification. + + Args: + electra (:class:`ElectraModel`): + An instance of ElectraModel. + num_classes (int): + The number of classes. + dropout (float, optionl): + The dropout probability for output of Electra. + If None, use the same value as `hidden_dropout_prob' of 'ElectraModel` + instance `electra`. Defaults to None. + """ + + def __init__(self, config: ElectraConfig): + super(ElectraForSPO, self).__init__(config) + self.num_classes = config.num_labels + self.electra = ElectraModel(config) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + self.classifier = nn.Linear(config.hidden_size, 2) + self.span_attention = MultiHeadAttentionForSPO(config.hidden_size, config.num_labels) + + def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None): + outputs = self.electra( + input_ids, token_type_ids, position_ids, attention_mask, output_hidden_states=True, return_dict=True + ) + sequence_outputs = outputs.last_hidden_state + all_hidden_states = outputs.hidden_states + sequence_outputs = self.dropout(sequence_outputs) + ent_logits = self.classifier(sequence_outputs) + + subject_output = all_hidden_states[-2] + cls_output = paddle.unsqueeze(sequence_outputs[:, 0, :], axis=1) + subject_output = subject_output + cls_output + + output_size = self.num_classes + self.electra.config["hidden_size"] # noqa:F841 + rel_logits = self.span_attention(sequence_outputs, subject_output) + + return ent_logits, rel_logits diff --git a/model_zoo/ernie-health/cblue/train_classification.py b/model_zoo/ernie-health/cblue/train_classification.py new file mode 100644 index 0000000000000000000000000000000000000000..b7e59b2f80f0acfff2f7b717db26345966ad7374 --- /dev/null +++ b/model_zoo/ernie-health/cblue/train_classification.py @@ -0,0 +1,263 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import distutils.util +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +from paddle.metric import Accuracy +from utils import LinearDecayWithWarmup, convert_example, create_dataloader + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import AccuracyAndF1, MultiLabelsMetric +from paddlenlp.transformers import ElectraForSequenceClassification, ElectraTokenizer + +METRIC_CLASSES = { + "KUAKE-QIC": Accuracy, + "KUAKE-QQR": Accuracy, + "KUAKE-QTR": Accuracy, + "CHIP-CTC": MultiLabelsMetric, + "CHIP-STS": MultiLabelsMetric, + "CHIP-CDN-2C": AccuracyAndF1, +} + +parser = argparse.ArgumentParser() +parser.add_argument( + "--dataset", + choices=["KUAKE-QIC", "KUAKE-QQR", "KUAKE-QTR", "CHIP-STS", "CHIP-CTC", "CHIP-CDN-2C"], + default="KUAKE-QIC", + type=str, + help="Dataset for sequence classfication tasks.", +) +parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization.") +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu", "npu"], + default="gpu", + help="Select which device to train model, default to gpu.", +) +parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs.") +parser.add_argument( + "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override epochs." +) +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument( + "--learning_rate", default=6e-5, type=float, help="Learning rate for fine-tuning sequence classification task." +) +parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay of optimizer if we apply some.") +parser.add_argument( + "--warmup_proportion", + default=0.1, + type=float, + help="Linear warmup proportion of learning rate over the training process.", +) +parser.add_argument( + "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization." +) +parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.") +parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.") +parser.add_argument( + "--save_dir", + default="./checkpoint", + type=str, + help="The output directory where the model checkpoints will be written.", +) +parser.add_argument("--save_steps", default=100, type=int, help="The interval steps to save checkpoints.") +parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.") +parser.add_argument("--use_amp", default=False, type=distutils.util.strtobool, help="Enable mixed precision training.") +parser.add_argument("--scale_loss", default=128, type=float, help="The value of scale_loss for fp16.") + +args = parser.parse_args() + + +def set_seed(seed): + """set random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader): + """ + Given a dataset, it evals model and compute the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + dataloader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + criterion(obj:`paddle.nn.Layer`): It can compute the loss. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + losses = [] + for batch in data_loader: + input_ids, token_type_ids, position_ids, labels = batch + logits = model(input_ids, token_type_ids, position_ids) + loss = criterion(logits, labels) + losses.append(loss.numpy()) + correct = metric.compute(logits, labels) + metric.update(correct) + if isinstance(metric, Accuracy): + metric_name = "accuracy" + result = metric.accumulate() + elif isinstance(metric, MultiLabelsMetric): + metric_name = "macro f1" + _, _, result = metric.accumulate("macro") + else: + metric_name = "micro f1" + _, _, _, result, _ = metric.accumulate() + + print("eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric_name, result)) + model.train() + metric.reset() + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds, dev_ds = load_dataset("cblue", args.dataset, splits=["train", "dev"]) + + model = ElectraForSequenceClassification.from_pretrained( + "ernie-health-chinese", num_labels=len(train_ds.label_list) + ) + tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + batchify_fn = lambda samples, fn=Tuple( # noqa: E731 + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + Pad(axis=0, pad_val=args.max_seq_length - 1, dtype="int64"), # position + Stack(dtype="int64"), + ): [data for data in fn(samples)] + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + state_keys = {x: x.replace("discriminator.", "") for x in state_dict.keys() if "discriminator." in x} + if len(state_keys) > 0: + state_dict = {state_keys[k]: state_dict[k] for k in state_keys.keys()} + model.set_dict(state_dict) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs + args.epochs = (num_training_steps - 1) // len(train_data_loader) + 1 + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + criterion = paddle.nn.loss.CrossEntropyLoss() + if METRIC_CLASSES[args.dataset] is Accuracy: + metric = METRIC_CLASSES[args.dataset]() + metric_name = "accuracy" + elif METRIC_CLASSES[args.dataset] is MultiLabelsMetric: + metric = METRIC_CLASSES[args.dataset](num_labels=len(train_ds.label_list)) + metric_name = "macro f1" + else: + metric = METRIC_CLASSES[args.dataset]() + metric_name = "micro f1" + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + global_step = 0 + tic_train = time.time() + total_train_time = 0 + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + input_ids, token_type_ids, position_ids, labels = batch + with paddle.amp.auto_cast( + args.use_amp, + custom_white_list=["layer_norm", "softmax", "gelu", "tanh"], + ): + logits = model(input_ids, token_type_ids, position_ids) + loss = criterion(logits, labels) + probs = F.softmax(logits, axis=1) + correct = metric.compute(probs, labels) + metric.update(correct) + + if isinstance(metric, Accuracy): + result = metric.accumulate() + elif isinstance(metric, MultiLabelsMetric): + _, _, result = metric.accumulate("macro") + else: + _, _, _, result, _ = metric.accumulate() + + if args.use_amp: + scaler.scale(loss).backward() + scaler.minimize(optimizer, loss) + else: + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + global_step += 1 + if global_step % args.logging_steps == 0 and rank == 0: + time_diff = time.time() - tic_train + total_train_time += time_diff + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, %s: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, metric_name, result, args.logging_steps / time_diff) + ) + + if global_step % args.valid_steps == 0 and rank == 0: + evaluate(model, criterion, metric, dev_data_loader) + + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + if paddle.distributed.get_world_size() > 1: + model._layers.save_pretrained(save_dir) + else: + model.save_pretrained(save_dir) + tokenizer.save_pretrained(save_dir) + + if global_step >= num_training_steps: + return + tic_train = time.time() + + if rank == 0 and total_train_time > 0: + print("Speed: %.2f steps/s" % (global_step / total_train_time)) + + +if __name__ == "__main__": + do_train() diff --git a/model_zoo/ernie-health/cblue/train_ner.py b/model_zoo/ernie-health/cblue/train_ner.py new file mode 100644 index 0000000000000000000000000000000000000000..de7e50b1f9dde76ba1b106092e7d86ad085cca81 --- /dev/null +++ b/model_zoo/ernie-health/cblue/train_ner.py @@ -0,0 +1,254 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +from model import ElectraForBinaryTokenClassification +from utils import ( + LinearDecayWithWarmup, + NERChunkEvaluator, + convert_example_ner, + create_dataloader, +) + +from paddlenlp.data import Dict, Pad +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import ElectraTokenizer + +parser = argparse.ArgumentParser() +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu", "npu"], + default="gpu", + help="Select which device to train model, default to gpu.", +) +parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.") +parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument( + "--learning_rate", default=6e-5, type=float, help="Learning rate for fine-tuning token classification task." +) +parser.add_argument( + "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization." +) +parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.") +parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.") +parser.add_argument("--save_steps", default=100, type=int, help="The interval steps to save checkpoints.") +parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") +parser.add_argument( + "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process." +) +parser.add_argument("--use_amp", default=False, type=bool, help="Enable mixed precision training.") +parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs.") +parser.add_argument( + "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override epochs." +) +parser.add_argument("--seed", default=1000, type=int, help="Random seed.") +parser.add_argument( + "--save_dir", + default="./checkpoint", + type=str, + help="The output directory where the model checkpoints will be written.", +) +parser.add_argument("--scale_loss", default=128, type=float, help="The value of scale_loss for fp16.") + +args = parser.parse_args() + + +def set_seed(seed): + """set random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader): + model.eval() + metric.reset() + losses = [] + for batch in data_loader: + input_ids, token_type_ids, position_ids, masks, label_oth, label_sym = batch + logits = model(input_ids, token_type_ids, position_ids) + + loss_mask = masks.unsqueeze(2) + loss = [(criterion(x, y.unsqueeze(2)) * loss_mask).mean() for x, y in zip(logits, [label_oth, label_sym])] + losses.append([x.numpy() for x in loss]) + + lengths = paddle.sum(masks, axis=1) + preds = [paddle.argmax(x, axis=2) for x in logits] + correct = metric.compute(lengths, preds, [label_oth, label_sym]) + metric.update(correct) + _, _, result = metric.accumulate() + loss = np.mean(losses, axis=0) + print("eval loss symptom: %.5f, loss others: %.5f, loss: %.5f, f1: %.5f" % (loss[1], loss[0], loss.sum(), result)) + model.train() + metric.reset() + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds, dev_ds = load_dataset("cblue", "CMeEE", splits=["train", "dev"]) + + model = ElectraForBinaryTokenClassification.from_pretrained( + "ernie-health-chinese", + num_classes_oth=len(train_ds.label_list[0]), + num_classes_sym=len(train_ds.label_list[1]), + ) + tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese") + + label_list = train_ds.label_list + pad_label_id = [len(label_list[0]) - 1, len(label_list[1]) - 1] + + trans_func = partial( + convert_example_ner, tokenizer=tokenizer, max_seq_length=args.max_seq_length, pad_label_id=pad_label_id + ) + + batchify_fn = lambda samples, fn=Dict( # noqa: E731 + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), + "position_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), + "attention_mask": Pad(axis=0, pad_val=0, dtype="float32"), + "label_oth": Pad(axis=0, pad_val=pad_label_id[0], dtype="int64"), + "label_sym": Pad(axis=0, pad_val=pad_label_id[1], dtype="int64"), + } + ): fn(samples) + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + if args.init_from_ckpt: + if not os.path.isfile(args.init_from_ckpt): + raise ValueError("init_from_ckpt is not a valid model filename.") + state_dict = paddle.load(args.init_from_ckpt) + state_keys = {x: x.replace("discriminator.", "") for x in state_dict.keys() if "discriminator." in x} + if len(state_keys) > 0: + state_dict = {state_keys[k]: state_dict[k] for k in state_keys.keys()} + model.set_dict(state_dict) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs + args.epochs = (num_training_steps - 1) // len(train_data_loader) + 1 + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + criterion = paddle.nn.functional.softmax_with_cross_entropy + + metric = NERChunkEvaluator(label_list) + + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + + global_step = 0 + tic_train = time.time() + total_train_time = 0 + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + input_ids, token_type_ids, position_ids, masks, label_oth, label_sym = batch + with paddle.amp.auto_cast( + args.use_amp, + custom_white_list=["layer_norm", "softmax", "gelu"], + ): + logits = model(input_ids, token_type_ids, position_ids) + + loss_mask = paddle.unsqueeze(masks, 2) + losses = [ + (criterion(x, y.unsqueeze(2)) * loss_mask).mean() for x, y in zip(logits, [label_oth, label_sym]) + ] + loss = losses[0] + losses[1] + + lengths = paddle.sum(masks, axis=1) + preds = [paddle.argmax(x, axis=-1) for x in logits] + correct = metric.compute(lengths, preds, [label_oth, label_sym]) + metric.update(correct) + _, _, f1 = metric.accumulate() + + if args.use_amp: + scaler.scale(loss).backward() + scaler.minimize(optimizer, loss) + else: + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + global_step += 1 + if global_step % args.logging_steps == 0 and rank == 0: + time_diff = time.time() - tic_train + total_train_time += time_diff + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, loss symptom: %.5f, loss others: %.5f, f1: %.5f, speed: %.2f step/s, learning_rate: %f" + % ( + global_step, + epoch, + step, + loss, + losses[1], + losses[0], + f1, + args.logging_steps / time_diff, + lr_scheduler.get_lr(), + ) + ) + + if global_step % args.valid_steps == 0 and rank == 0: + evaluate(model, criterion, metric, dev_data_loader) + + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + if paddle.distributed.get_world_size() > 1: + model._layers.save_pretrained(save_dir) + else: + model.save_pretrained(save_dir) + tokenizer.save_pretrained(save_dir) + + if global_step >= num_training_steps: + return + tic_train = time.time() + + if rank == 0 and total_train_time > 0: + print("Speed: %.2f steps/s" % (global_step / total_train_time)) + + +if __name__ == "__main__": + do_train() diff --git a/model_zoo/ernie-health/cblue/train_spo.py b/model_zoo/ernie-health/cblue/train_spo.py new file mode 100644 index 0000000000000000000000000000000000000000..e5d8d5eb6128c0503deeaeb01d6d253c58b66e8f --- /dev/null +++ b/model_zoo/ernie-health/cblue/train_spo.py @@ -0,0 +1,300 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import distutils.util +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +from model import ElectraForSPO +from tqdm import tqdm +from utils import ( + LinearDecayWithWarmup, + SPOChunkEvaluator, + convert_example_spo, + create_dataloader, +) + +from paddlenlp.data import Dict, Pad +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import ElectraTokenizer + +parser = argparse.ArgumentParser() +parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization.") +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu", "npu"], + default="gpu", + help="Select which device to train model, default to gpu.", +) +parser.add_argument("--epochs", default=100, type=int, help="Total number of training epochs.") +parser.add_argument( + "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override epochs." +) +parser.add_argument("--batch_size", default=12, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument( + "--learning_rate", default=6e-5, type=float, help="Learning rate for fine-tuning sequence classification task." +) +parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay of optimizer if we apply some.") +parser.add_argument( + "--warmup_proportion", + default=0.1, + type=float, + help="Linear warmup proportion of learning rate over the training process.", +) +parser.add_argument( + "--max_seq_length", default=300, type=int, help="The maximum total input sequence length after tokenization." +) +parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.") +parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.") +parser.add_argument( + "--save_dir", + default="./checkpoint", + type=str, + help="The output directory where the model checkpoints will be written.", +) +parser.add_argument("--save_steps", default=100, type=int, help="The interval steps to save checkpoints.") +parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.") +parser.add_argument("--use_amp", default=False, type=distutils.util.strtobool, help="Enable mixed precision training.") +parser.add_argument("--scale_loss", default=128, type=float, help="The value of scale_loss for fp16.") + +args = parser.parse_args() + + +def set_seed(seed): + """set random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader): + """ + Given a dataset, it evals model and compute the metric. + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + dataloader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + criterion(`paddle.nn.functional`): It can compute the loss. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + losses = [] + for batch in tqdm(data_loader): + input_ids, token_type_ids, position_ids, masks, ent_label, spo_label = batch + ent_mask = paddle.unsqueeze(masks, axis=2) + spo_mask = paddle.matmul(ent_mask, ent_mask, transpose_y=True) + spo_mask = paddle.unsqueeze(spo_mask, axis=1) + + logits = model(input_ids, token_type_ids, position_ids) + + ent_loss = criterion(logits[0], ent_label[0], weight=ent_mask, reduction="sum") + spo_loss = criterion(logits[1], spo_label[0], weight=spo_mask, reduction="sum") + loss = ent_loss + spo_loss + losses.append(loss.numpy()) + lengths = paddle.sum(masks, axis=-1) + correct = metric.compute(lengths, logits[0], logits[1], ent_label[1], spo_label[1]) + metric.update(correct) + results = metric.accumulate() + print( + "eval loss: %.5f, entity f1: %.5f, spo f1: %.5f" % (np.mean(losses), results["entity"][2], results["spo"][2]) + ) + model.train() + metric.reset() + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds, dev_ds = load_dataset("cblue", "CMeIE", splits=["train", "dev"]) + + model = ElectraForSPO.from_pretrained("ernie-health-chinese", num_classes=len(train_ds.label_list)) + tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese") + + trans_func = partial( + convert_example_spo, + tokenizer=tokenizer, + num_classes=len(train_ds.label_list), + max_seq_length=args.max_seq_length, + ) + + def batchify_fn(data): + _batchify_fn = lambda samples, fn=Dict( # noqa: E731 + { + "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), + "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), + "position_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), + "attention_mask": Pad(axis=0, pad_val=0, dtype="float32"), + } + ): fn(samples) + ent_label = [x["ent_label"] for x in data] + spo_label = [x["spo_label"] for x in data] + input_ids, token_type_ids, position_ids, masks = _batchify_fn(data) + batch_size, batch_len = input_ids.shape + num_classes = len(train_ds.label_list) + # Create one-hot labels. + # + # For example, + # - text: + # [CLS], 局, 部, 皮, 肤, 感, 染, 引, 起, 的, 皮, 疹, 等, [SEP] + # + # - ent_label (obj: `list`): + # [(0, 5), (9, 10)] # ['局部皮肤感染', '皮疹'] + # + # - one_hot_ent_label: # shape (sequence_length, 2) + # [[ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], # start index + # [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]] # end index + # + # - spo_label (obj: `list`): + # [(0, 23, 9)] # [('局部皮肤感染', '相关(导致)', '皮疹')], where entities + # are encoded by their start indexes. + # + # - one_hot_spo_label: # shape (num_predicate, sequence_length, sequence_length) + # [..., + # [..., [0, ..., 1, ..., 0], ...], # for predicate '相关(导致)' + # ...] # the value at [23, 1, 10] is set as 1 + # + one_hot_ent_label = np.zeros([batch_size, batch_len, 2], dtype=np.float32) + one_hot_spo_label = np.zeros([batch_size, num_classes, batch_len, batch_len], dtype=np.float32) + for idx, ent_idxs in enumerate(ent_label): + # Shift index by 1 because input_ids start with [CLS] here. + for x, y in ent_idxs: + x = x + 1 + y = y + 1 + if x > 0 and x < batch_len and y < batch_len: + one_hot_ent_label[idx, x, 0] = 1 + one_hot_ent_label[idx, y, 1] = 1 + for idx, spo_idxs in enumerate(spo_label): + for s, p, o in spo_idxs: + s_id = s[0] + 1 + o_id = o[0] + 1 + if s_id > 0 and s_id < batch_len and o_id < batch_len: + one_hot_spo_label[idx, p, s_id, o_id] = 1 + # one_hot_xxx_label are used for loss computation. + # xxx_label are used for metric computation. + ent_label = [one_hot_ent_label, ent_label] + spo_label = [one_hot_spo_label, spo_label] + return input_ids, token_type_ids, position_ids, masks, ent_label, spo_label + + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + if args.init_from_ckpt: + if not os.path.isfile(args.init_from_ckpt): + raise ValueError("init_from_ckpt is not a valid model filename.") + state_dict = paddle.load(args.init_from_ckpt) + state_keys = {x: x.replace("discriminator.", "") for x in state_dict.keys() if "discriminator." in x} + if len(state_keys) > 0: + state_dict = {state_keys[k]: state_dict[k] for k in state_keys.keys()} + model.set_dict(state_dict) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs + args.epochs = (num_training_steps - 1) // len(train_data_loader) + 1 + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + criterion = F.binary_cross_entropy_with_logits + + metric = SPOChunkEvaluator(num_classes=len(train_ds.label_list)) + + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + global_step = 0 + tic_train = time.time() + total_train_time = 0 + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + input_ids, token_type_ids, position_ids, masks, ent_label, spo_label = batch + ent_mask = paddle.unsqueeze(masks, axis=2) + spo_mask = paddle.matmul(ent_mask, ent_mask, transpose_y=True) + spo_mask = paddle.unsqueeze(spo_mask, axis=1) + + with paddle.amp.auto_cast( + args.use_amp, + custom_white_list=["layer_norm", "softmax", "gelu"], + ): + logits = model(input_ids, token_type_ids, position_ids) + ent_loss = criterion(logits[0], ent_label[0], weight=ent_mask, reduction="sum") + spo_loss = criterion(logits[1], spo_label[0], weight=spo_mask, reduction="sum") + + loss = ent_loss + spo_loss + + if args.use_amp: + scaler.scale(loss).backward() + scaler.minimize(optimizer, loss) + else: + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + global_step += 1 + if global_step % args.logging_steps == 0 and rank == 0: + time_diff = time.time() - tic_train + total_train_time += time_diff + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, " + "ent_loss: %.5f, spo_loss: %.5f, speed: %.2f steps/s" + % (global_step, epoch, step, loss, ent_loss, spo_loss, args.logging_steps / time_diff) + ) + + if global_step % args.valid_steps == 0 and rank == 0: + evaluate(model, criterion, metric, dev_data_loader) + + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + if paddle.distributed.get_world_size() > 1: + model._layers.save_pretrained(save_dir) + else: + model.save_pretrained(save_dir) + tokenizer.save_pretrained(save_dir) + + if global_step >= num_training_steps: + return + tic_train = time.time() + + if rank == 0 and total_train_time > 0: + print("Speed: %.2f steps/s" % (global_step / total_train_time)) + + +if __name__ == "__main__": + do_train() diff --git a/model_zoo/ernie-health/cblue/utils.py b/model_zoo/ernie-health/cblue/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..4c0bda0c63ebbd5eb07a27ef1b1aef45a055671a --- /dev/null +++ b/model_zoo/ernie-health/cblue/utils.py @@ -0,0 +1,503 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math + +import numpy as np +import paddle +from paddle.optimizer.lr import LambdaDecay + +from paddlenlp.transformers import normalize_chars, tokenize_special_chars + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +class LinearDecayWithWarmup(LambdaDecay): + def __init__(self, learning_rate, total_steps, warmup, last_epoch=-1, verbose=False): + """ + Creates a learning rate scheduler, which increases learning rate linearly + from 0 to given `learning_rate`, after this warmup period learning rate + would be decreased linearly from the base learning rate to 0. + + Args: + learning_rate (float): + The base learning rate. It is a python float number. + total_steps (int): + The number of training steps. + warmup (int or float): + If int, it means the number of steps for warmup. If float, it means + the proportion of warmup in total training steps. + last_epoch (int, optional): + The index of last epoch. It can be set to restart training. If + None, it means initial learning rate. + Defaults to -1. + verbose (bool, optional): + If True, prints a message to stdout for each update. + Defaults to False. + """ + + warmup_steps = warmup if isinstance(warmup, int) else int(math.floor(warmup * total_steps)) + + def lr_lambda(current_step): + if current_step < warmup_steps: + return float(current_step) / float(max(1, warmup_steps)) + return max(0.0, 1.0 - current_step / total_steps) + + super(LinearDecayWithWarmup, self).__init__(learning_rate, lr_lambda, last_epoch, verbose) + + +def convert_example(example, tokenizer, max_seq_length=512, is_test=False): + """ + Builds model inputs from a sequence or a pair of sequences for sequence + classification tasks by concatenating and adding special tokens. And + creates a mask from the two sequences for sequence-pair classification + tasks. + + The convention in Electra/EHealth is: + + - single sequence: + input_ids: ``[CLS] X [SEP]`` + token_type_ids: `` 0 0 0`` + position_ids: `` 0 1 2`` + + - a senquence pair: + input_ids: ``[CLS] X [SEP] Y [SEP]`` + token_type_ids: `` 0 0 0 1 1`` + position_ids: `` 0 1 2 3 4`` + + Args: + example (obj:`dict`): + A dictionary of input data, containing text and label if it has. + tokenizer (obj:`PretrainedTokenizer`): + A tokenizer inherits from :class:`paddlenlp.transformers.PretrainedTokenizer`. + Users can refer to the superclass for more information. + max_seq_length (obj:`int`): + The maximum total input sequence length after tokenization. + Sequences longer will be truncated, and the shorter will be padded. + is_test (obj:`bool`, default to `False`): + Whether the example contains label or not. + + Returns: + input_ids (obj:`list[int]`): + The list of token ids. + token_type_ids (obj:`list[int]`): + List of sequence pair mask. + position_ids (obj:`list[int]`): + List of position ids. + label(obj:`numpy.array`, data type of int64, optional): + The input label if not is_test. + """ + text_a = example["text_a"] + text_b = example.get("text_b", None) + + text_a = tokenize_special_chars(normalize_chars(text_a)) + if text_b is not None: + text_b = tokenize_special_chars(normalize_chars(text_b)) + + encoded_inputs = tokenizer(text=text_a, text_pair=text_b, max_seq_len=max_seq_length, return_position_ids=True) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + position_ids = encoded_inputs["position_ids"] + + if is_test: + return input_ids, token_type_ids, position_ids + label = np.array([example["label"]], dtype="int64") + return input_ids, token_type_ids, position_ids, label + + +def convert_example_ner(example, tokenizer, max_seq_length=512, pad_label_id=-100, is_test=False): + """ + Builds model inputs from a sequence and creates labels for named- + entity recognition task CMeEE. + + For example, a sample should be: + + - input_ids: ``[CLS] x1 x2 [SEP] [PAD]`` + - token_type_ids: `` 0 0 0 0 0`` + - position_ids: `` 0 1 2 3 0`` + - attention_mask: `` 1 1 1 1 0`` + - label_oth: `` 32 3 32 32 32`` (optional, label ids of others) + - label_sym: `` 4 4 4 4 4`` (optional, label ids of symptom) + + Args: + example (obj:`dict`): + A dictionary of input data, containing text and label if it has. + tokenizer (obj:`PretrainedTokenizer`): + A tokenizer inherits from :class:`paddlenlp.transformers.PretrainedTokenizer`. + Users can refer to the superclass for more information. + max_seq_length (obj:`int`): + The maximum total input sequence length after tokenization. + Sequences longer will be truncated, and the shorter will be padded. + is_test (obj:`bool`, default to `False`): + Whether the example contains label or not. + + Returns: + encoded_output (obj: `dict[str, list|np.array]`): + The sample dictionary including `input_ids`, `token_type_ids`, + `position_ids`, `attention_mask`, `label_oth` (optional), + `label_sym` (optional) + """ + + encoded_inputs = {} + text = example["text"] + if len(text) > max_seq_length - 2: + text = text[: max_seq_length - 2] + text = ["[CLS]"] + [x.lower() for x in text] + ["[SEP]"] + input_len = len(text) + encoded_inputs["input_ids"] = tokenizer.convert_tokens_to_ids(text) + encoded_inputs["token_type_ids"] = np.zeros(input_len) + encoded_inputs["position_ids"] = list(range(input_len)) + encoded_inputs["attention_mask"] = np.ones(input_len) + + if not is_test: + labels = example["labels"] + if input_len - 2 < len(labels[0]): + labels[0] = labels[0][: input_len - 2] + if input_len - 2 < len(labels[1]): + labels[1] = labels[1][: input_len - 2] + encoded_inputs["label_oth"] = [pad_label_id[0]] + labels[0] + [pad_label_id[0]] + encoded_inputs["label_sym"] = [pad_label_id[1]] + labels[1] + [pad_label_id[1]] + + return encoded_inputs + + +def convert_example_spo(example, tokenizer, num_classes, max_seq_length=512, is_test=False): + """ + Builds model inputs from a sequence and creates labels for SPO prediction + task CMeIE. + + For example, a sample should be: + + - input_ids: ``[CLS] x1 x2 [SEP] [PAD]`` + - token_type_ids: `` 0 0 0 0 0`` + - position_ids: `` 0 1 2 3 0`` + - attention_mask: `` 1 1 1 1 0`` + - ent_label: ``[[0 1 0 0 0], # start ids are set as 1 + [0 0 1 0 0]] # end ids are set as 1 + - spo_label: a tensor of shape [num_classes, max_batch_len, max_batch_len]. + Set [predicate_id, subject_start_id, object_start_id] as 1 + when (subject, predicate, object) exists. + + Args: + example (obj:`dict`): + A dictionary of input data, containing text and label if it has. + tokenizer (obj:`PretrainedTokenizer`): + A tokenizer inherits from :class:`paddlenlp.transformers.PretrainedTokenizer`. + Users can refer to the superclass for more information. + num_classes (obj:`int`): + The number of predicates. + max_seq_length (obj:`int`): + The maximum total input sequence length after tokenization. + Sequences longer will be truncated, and the shorter will be padded. + is_test (obj:`bool`, default to `False`): + Whether the example contains label or not. + + Returns: + encoded_output (obj: `dict[str, list|np.array]`): + The sample dictionary including `input_ids`, `token_type_ids`, + `position_ids`, `attention_mask`, `ent_label` (optional), + `spo_label` (optional) + """ + encoded_inputs = {} + text = example["text"] + if len(text) > max_seq_length - 2: + text = text[: max_seq_length - 2] + text = ["[CLS]"] + [x.lower() for x in text] + ["[SEP]"] + input_len = len(text) + encoded_inputs["input_ids"] = tokenizer.convert_tokens_to_ids(text) + encoded_inputs["token_type_ids"] = np.zeros(input_len) + encoded_inputs["position_ids"] = list(range(input_len)) + encoded_inputs["attention_mask"] = np.ones(input_len) + if not is_test: + encoded_inputs["ent_label"] = example["ent_label"] + encoded_inputs["spo_label"] = example["spo_label"] + return encoded_inputs + + +class NERChunkEvaluator(paddle.metric.Metric): + """ + NERChunkEvaluator computes the precision, recall and F1-score for chunk detection. + It is often used in sequence tagging tasks, such as Named Entity Recognition (NER). + + Args: + label_list (list): + The label list. + + Note: + Difference from `paddlenlp.metric.ChunkEvaluator`: + + - `paddlenlp.metric.ChunkEvaluator` + All sequences with non-'O' labels are taken as chunks when computing num_infer. + - `NERChunkEvaluator` + Only complete sequences are taken as chunks, namely `B- I- E-` or `S-`. + """ + + def __init__(self, label_list): + super(NERChunkEvaluator, self).__init__() + self.id2label = [dict(enumerate(x)) for x in label_list] + self.num_classes = [len(x) for x in label_list] + self.num_infer = 0 + self.num_label = 0 + self.num_correct = 0 + + def compute(self, lengths, predictions, labels): + """ + Computes the prediction, recall and F1-score for chunk detection. + + Args: + lengths (Tensor): + The valid length of every sequence, a tensor with shape `[batch_size]`. + predictions (Tensor): + The predictions index, a tensor with shape `[batch_size, sequence_length]`. + labels (Tensor): + The labels index, a tensor with shape `[batch_size, sequence_length]`. + + Returns: + tuple: Returns tuple (`num_infer_chunks, num_label_chunks, num_correct_chunks`). + + With the fields: + + - `num_infer_chunks` (Tensor): The number of the inference chunks. + - `num_label_chunks` (Tensor): The number of the label chunks. + - `num_correct_chunks` (Tensor): The number of the correct chunks. + """ + assert len(predictions) == len(labels) + assert len(predictions) == len(self.id2label) + preds = [x.numpy() for x in predictions] + labels = [x.numpy() for x in labels] + + preds_chunk = set() + label_chunk = set() + for idx, (pred, label) in enumerate(zip(preds, labels)): + for i, case in enumerate(pred): + case = [self.id2label[idx][x] for x in case[: lengths[i]]] + preds_chunk |= self.extract_chunk(case, i) + for i, case in enumerate(label): + case = [self.id2label[idx][x] for x in case[: lengths[i]]] + label_chunk |= self.extract_chunk(case, i) + + num_infer = len(preds_chunk) + num_label = len(label_chunk) + num_correct = len(preds_chunk & label_chunk) + return num_infer, num_label, num_correct + + def update(self, correct): + num_infer, num_label, num_correct = correct + self.num_infer += num_infer + self.num_label += num_label + self.num_correct += num_correct + + def accumulate(self): + precision = self.num_correct / (self.num_infer + 1e-6) + recall = self.num_correct / (self.num_label + 1e-6) + f1 = 2 * precision * recall / (precision + recall + 1e-6) + return precision, recall, f1 + + def reset(self): + self.num_infer = 0 + self.num_label = 0 + self.num_correct = 0 + + def name(self): + return "precision", "recall", "f1" + + def extract_chunk(self, sequence, cid=0): + chunks = set() + + start_idx, cur_idx = 0, 0 + while cur_idx < len(sequence): + if sequence[cur_idx][0] == "B": + start_idx = cur_idx + cur_idx += 1 + while cur_idx < len(sequence) and sequence[cur_idx][0] == "I": + if sequence[cur_idx][2:] == sequence[start_idx][2:]: + cur_idx += 1 + else: + break + if cur_idx < len(sequence) and sequence[cur_idx][0] == "E": + if sequence[cur_idx][2:] == sequence[start_idx][2:]: + chunks.add((cid, sequence[cur_idx][2:], start_idx, cur_idx)) + cur_idx += 1 + elif sequence[cur_idx][0] == "S": + chunks.add((cid, sequence[cur_idx][2:], cur_idx, cur_idx)) + cur_idx += 1 + else: + cur_idx += 1 + + return chunks + + +class SPOChunkEvaluator(paddle.metric.Metric): + """ + SPOChunkEvaluator computes the precision, recall and F1-score for multiple + chunk detections, including Named Entity Recognition (NER) and SPO Prediction. + + Args: + num_classes (int): + The number of predicates. + """ + + def __init__(self, num_classes=None): + super(SPOChunkEvaluator, self).__init__() + self.num_classes = num_classes + self.num_infer_ent = 0 + self.num_infer_spo = 1e-10 + self.num_label_ent = 0 + self.num_label_spo = 1e-10 + self.num_correct_ent = 0 + self.num_correct_spo = 0 + + def compute(self, lengths, ent_preds, spo_preds, ent_labels, spo_labels): + """ + Computes the prediction, recall and F1-score for NER and SPO prediction. + + Args: + lengths (Tensor): + The valid length of every sequence, a tensor with shape `[batch_size]`. + ent_preds (Tensor): + The predictions of entities. + A tensor with shape `[batch_size, sequence_length, 2]`. + `ent_preds[:, :, 0]` denotes the start indexes of entities. + `ent_preds[:, :, 1]` denotes the end indexes of entities. + spo_preds (Tensor): + The predictions of predicates between all possible entities. + A tensor with shape `[batch_size, num_classes, sequence_length, sequence_length]`. + ent_labels (list[list|tuple]): + The entity labels' indexes. A list of pair `[start_index, end_index]`. + spo_labels (list[list|tuple]): + The SPO labels' indexes. A list of triple `[[subject_start_index, subject_end_index], + predicate_id, [object_start_index, object_end_index]]`. + + Returns: + tuple: + Returns tuple (`num_infer_chunks, num_label_chunks, num_correct_chunks`). + The `ent` denotes results of NER and the `spo` denotes results of SPO prediction. + + With the fields: + + - `num_infer_chunks` (dict): The number of the inference chunks. + - `num_label_chunks` (dict): The number of the label chunks. + - `num_correct_chunks` (dict): The number of the correct chunks. + """ + ent_preds = ent_preds.numpy() + spo_preds = spo_preds.numpy() + + ent_pred_list = [] + ent_idxs_list = [] + for idx, ent_pred in enumerate(ent_preds): + seq_len = lengths[idx] - 2 + start = np.where(ent_pred[:, 0] > 0.5)[0] + end = np.where(ent_pred[:, 1] > 0.5)[0] + ent_pred = [] + ent_idxs = {} + for x in start: + y = end[end >= x] + if (x == 0) or (x > seq_len): + continue + if len(y) > 0: + y = y[0] + if y > seq_len: + continue + ent_idxs[x] = (x - 1, y - 1) + ent_pred.append((x - 1, y - 1)) + ent_pred_list.append(ent_pred) + ent_idxs_list.append(ent_idxs) + + spo_preds = spo_preds > 0 + spo_pred_list = [[] for _ in range(len(spo_preds))] + idxs, preds, subs, objs = np.nonzero(spo_preds) + for idx, p_id, s_id, o_id in zip(idxs, preds, subs, objs): + obj = ent_idxs_list[idx].get(o_id, None) + if obj is None: + continue + sub = ent_idxs_list[idx].get(s_id, None) + if sub is None: + continue + spo_pred_list[idx].append((sub, p_id, obj)) + + correct = {"ent": 0, "spo": 0} + infer = {"ent": 0, "spo": 0} + label = {"ent": 0, "spo": 0} + for ent_pred, ent_true in zip(ent_pred_list, ent_labels): + ent_true = [tuple(x) for x in ent_true] + infer["ent"] += len(set(ent_pred)) + label["ent"] += len(set(ent_true)) + correct["ent"] += len(set(ent_pred) & set(ent_true)) + + for spo_pred, spo_true in zip(spo_pred_list, spo_labels): + spo_true = [(tuple(s), p, tuple(o)) for s, p, o in spo_true] + infer["spo"] += len(set(spo_pred)) + label["spo"] += len(set(spo_true)) + correct["spo"] += len(set(spo_pred) & set(spo_true)) + + return infer, label, correct + + def update(self, corrects): + assert len(corrects) == 3 + for item in corrects: + assert isinstance(item, dict) + for value in item.values(): + if not self._is_number_or_matrix(value): + raise ValueError("The numbers must be a number(int) or a numpy ndarray.") + num_infer, num_label, num_correct = corrects + self.num_infer_ent += num_infer["ent"] + self.num_infer_spo += num_infer["spo"] + self.num_label_ent += num_label["ent"] + self.num_label_spo += num_label["spo"] + self.num_correct_ent += num_correct["ent"] + self.num_correct_spo += num_correct["spo"] + + def accumulate(self): + spo_precision = self.num_correct_spo / self.num_infer_spo + spo_recall = self.num_correct_spo / self.num_label_spo + spo_f1 = 2 * self.num_correct_spo / (self.num_infer_spo + self.num_label_spo) + ent_precision = self.num_correct_ent / self.num_infer_ent if self.num_infer_ent > 0 else 0.0 + ent_recall = self.num_correct_ent / self.num_label_ent if self.num_label_ent > 0 else 0.0 + ent_f1 = ( + 2 * ent_precision * ent_recall / (ent_precision + ent_recall) if (ent_precision + ent_recall) != 0 else 0.0 + ) + return {"entity": (ent_precision, ent_recall, ent_f1), "spo": (spo_precision, spo_recall, spo_f1)} + + def _is_number_or_matrix(self, var): + def _is_number_(var): + return ( + isinstance(var, int) + or isinstance(var, np.int64) + or isinstance(var, float) + or (isinstance(var, np.ndarray) and var.shape == (1,)) + ) + + return _is_number_(var) or isinstance(var, np.ndarray) + + def reset(self): + self.num_infer_ent = 0 + self.num_infer_spo = 1e-10 + self.num_label_ent = 0 + self.num_label_spo = 1e-10 + self.num_correct_ent = 0 + self.num_correct_spo = 0 + + def name(self): + return {"entity": ("precision", "recall", "f1"), "spo": ("precision", "recall", "f1")} diff --git a/model_zoo/ernie-health/dataset.py b/model_zoo/ernie-health/dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..252b2e952aee484db6c6a365ee36f7e93d16ca63 --- /dev/null +++ b/model_zoo/ernie-health/dataset.py @@ -0,0 +1,204 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import numpy as np +import paddle + + +class MedicalCorpus(paddle.io.Dataset): + def __init__(self, data_path, tokenizer): + self.data_path = data_path + self.tokenizer = tokenizer + # Add ids for suffixal chinese tokens in tokenized text, e.g. '##度' in '百度'. + # It should coincide with the vocab dictionary in preprocess.py. + orig_len = len(self.tokenizer) # noqa:F841 + suffix_vocab = {} + for idx, token in enumerate(range(0x4E00, 0x9FA6)): + suffix_vocab[len(self.tokenizer) + idx] = "##" + chr(token) + self.tokenizer.added_tokens_decoder.update(suffix_vocab) + self._samples, self._global_index = self._read_data_files(data_path) + + def _get_data_files(self, data_path): + # Get all prefix of .npy/.npz files in the current and next-level directories. + files = [ + os.path.join(data_path, f) + for f in os.listdir(data_path) + if (os.path.isfile(os.path.join(data_path, f)) and "_idx.npz" in str(f)) + ] + files = [x.replace("_idx.npz", "") for x in files] + return files + + def _read_data_files(self, data_path): + data_files = self._get_data_files(data_path) + samples = [] + indexes = [] + for file_id, file_name in enumerate(data_files): + + for suffix in ["_ids.npy", "_idx.npz"]: + if not os.path.isfile(file_name + suffix): + raise ValueError("File Not found, %s" % (file_name + suffix)) + + token_ids = np.load(file_name + "_ids.npy", mmap_mode="r", allow_pickle=True) + samples.append(token_ids) + + split_ids = np.load(file_name + "_idx.npz") + end_ids = np.cumsum(split_ids["lens"], dtype=np.int64) + file_ids = np.full(end_ids.shape, file_id) + split_ids = np.stack([file_ids, end_ids], axis=-1) + indexes.extend(split_ids) + indexes = np.stack(indexes, axis=0) + return samples, indexes + + def __len__(self): + return len(self._global_index) + + def __getitem__(self, index): + file_id, end_id = self._global_index[index] + start_id = 0 + if index > 0: + pre_file_id, pre_end_id = self._global_index[index - 1] + if pre_file_id == file_id: + start_id = pre_end_id + word_token_ids = self._samples[file_id][start_id:end_id] + token_ids = [] + is_suffix = np.zeros(word_token_ids.shape) + for idx, token_id in enumerate(word_token_ids): + token = self.tokenizer.convert_ids_to_tokens(int(token_id)) + if "##" in token: + token_id = self.tokenizer.convert_tokens_to_ids(token[-1]) + is_suffix[idx] = 1 + token_ids.append(token_id) + + return token_ids, is_suffix.astype(np.int64) + + +class DataCollatorForErnieHealth(object): + def __init__(self, tokenizer, mlm_prob, max_seq_length, return_dict=False): + self.tokenizer = tokenizer + self.mlm_prob = mlm_prob + self.max_seq_len = max_seq_length + self.return_dict = return_dict + self._ids = { + "cls": self.tokenizer.convert_tokens_to_ids(self.tokenizer.cls_token), + "sep": self.tokenizer.convert_tokens_to_ids(self.tokenizer.sep_token), + "pad": self.tokenizer.convert_tokens_to_ids(self.tokenizer.pad_token), + "mask": self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token), + } + + def __call__(self, data): + masked_input_ids_a, input_ids_a, labels_a = self.mask_tokens(data) + masked_input_ids_b, input_ids_b, labels_b = self.mask_tokens(data) + masked_input_ids = paddle.concat([masked_input_ids_a, masked_input_ids_b], axis=0).astype("int64") + input_ids = paddle.concat([input_ids_a, input_ids_b], axis=0) + labels = paddle.concat([labels_a, labels_b], axis=0) + if self.return_dict: + return {"input_ids": masked_input_ids, "raw_input_ids": input_ids, "generator_labels": labels} + + else: + return masked_input_ids, input_ids, labels + + def mask_tokens(self, batch_data): + + token_ids = [x[0] for x in batch_data] + is_suffix = [x[1] for x in batch_data] + + # Create probability matrix where the probability of real tokens is + # self.mlm_prob, while that of others is zero. + data = self.add_special_tokens_and_set_maskprob(token_ids, is_suffix) + token_ids, is_suffix, prob_matrix = data + token_ids = paddle.to_tensor(token_ids, dtype="int64", stop_gradient=True) + masked_token_ids = token_ids.clone() + labels = token_ids.clone() + + # Create masks for words, where '百' must be masked if '度' is masked + # for the word '百度'. + prob_matrix = prob_matrix * (1 - is_suffix) + word_mask_index = np.random.binomial(1, prob_matrix).astype("float") + is_suffix_mask = is_suffix == 1 + word_mask_index_tmp = word_mask_index + while word_mask_index_tmp.sum() > 0: + word_mask_index_tmp = np.concatenate( + [np.zeros((word_mask_index.shape[0], 1)), word_mask_index_tmp[:, :-1]], axis=1 + ) + word_mask_index_tmp = word_mask_index_tmp * is_suffix_mask + word_mask_index += word_mask_index_tmp + word_mask_index = word_mask_index.astype("bool") + labels[~word_mask_index] = -100 + + # 80% replaced with [MASK]. + token_mask_index = paddle.bernoulli(paddle.full(labels.shape, 0.8)).astype("bool").numpy() & word_mask_index + masked_token_ids[token_mask_index] = self._ids["mask"] + + # 10% replaced with random token ids. + token_random_index = paddle.to_tensor( + paddle.bernoulli(paddle.full(labels.shape, 0.5)).astype("bool").numpy() + & word_mask_index + & ~token_mask_index + ) + random_tokens = paddle.randint(low=0, high=self.tokenizer.vocab_size, shape=labels.shape, dtype="int64") + masked_token_ids = paddle.where(token_random_index, random_tokens, masked_token_ids) + + return masked_token_ids, token_ids, labels + + def add_special_tokens_and_set_maskprob(self, token_ids, is_suffix): + batch_size = len(token_ids) + batch_token_ids = np.full((batch_size, self.max_seq_len), self._ids["pad"]) + batch_token_ids[:, 0] = self._ids["cls"] + batch_is_suffix = np.full_like(batch_token_ids, -1) + prob_matrix = np.zeros_like(batch_token_ids, dtype="float32") + + for idx in range(batch_size): + if len(token_ids[idx]) > self.max_seq_len - 2: + token_ids[idx] = token_ids[idx][: self.max_seq_len - 2] + is_suffix[idx] = is_suffix[idx][: self.max_seq_len - 2] + seq_len = len(token_ids[idx]) + batch_token_ids[idx, seq_len + 1] = self._ids["sep"] + batch_token_ids[idx, 1 : seq_len + 1] = token_ids[idx] + batch_is_suffix[idx, 1 : seq_len + 1] = is_suffix[idx] + prob_matrix[idx, 1 : seq_len + 1] = self.mlm_prob + + return batch_token_ids, batch_is_suffix, prob_matrix + + +def create_dataloader(dataset, mode="train", batch_size=1, use_gpu=True, data_collator=None): + """ + Creats dataloader. + Args: + dataset(obj:`paddle.io.Dataset`): + Dataset instance. + mode(obj:`str`, optional, defaults to obj:`train`): + If mode is 'train', it will shuffle the dataset randomly. + batch_size(obj:`int`, optional, defaults to 1): + The sample number of a mini-batch. + use_gpu(obj:`bool`, optional, defaults to obj:`True`): + Whether to use gpu to run. + Returns: + dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. + """ + + if mode == "train" and use_gpu: + sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True) + dataloader = paddle.io.DataLoader( + dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0 + ) + else: + shuffle = True if mode == "train" else False + sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) + dataloader = paddle.io.DataLoader( + dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0 + ) + + return dataloader diff --git a/model_zoo/ernie-health/preprocess.py b/model_zoo/ernie-health/preprocess.py new file mode 100644 index 0000000000000000000000000000000000000000..c8c6ec2c5629dfa76c69e46bf4b4b6185ece5b63 --- /dev/null +++ b/model_zoo/ernie-health/preprocess.py @@ -0,0 +1,201 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import io +import multiprocessing +import os +import re +import sys +import time + +import numpy as np +from tqdm import tqdm + +from paddlenlp.transformers import ElectraTokenizer + + +def parse_args(): + parser = argparse.ArgumentParser("Preprocessor for ERNIE-Health") + parser.add_argument( + "--input_path", type=str, required=True, help="The path to input text files where a sentence per line." + ) + parser.add_argument("--output_file", type=str, required=True, help="The output file path of preprocessed ids.") + parser.add_argument( + "--tokenize_tool", + type=str, + default="lac", + choices=["lac", "seg", "jieba"], + help="The tokenization tool for chinese words.", + ) + parser.add_argument("--logging_steps", type=int, default=100, help="The interval between progress updates.") + parser.add_argument("--num_worker", type=int, default=1, help="Number of worker processes to launch.") + + args = parser.parse_args() + return args + + +def lac_segmentation(): + from LAC import LAC + + tool = LAC(mode="lac") + + def process(text): + words, _ = tool.run(text) + return words + + return process + + +def seg_segmentation(): + from LAC import LAC + + tool = LAC(mode="seg") + + def process(text): + words = tool.run(text) + return words + + return process + + +def jieba_segmentation(): + import jieba + + def process(text): + words = jieba.cut(text) + return list(words) + + return process + + +SEGMENTATION_FN = {"lac": lac_segmentation(), "seg": seg_segmentation(), "jieba": jieba_segmentation()} + + +class ProcessFn(object): + def __init__(self, args): + self.args = args + + def initializer(self): + ProcessFn.tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese") + ProcessFn.segmenter = SEGMENTATION_FN[self.args.tokenize_tool] + # Update vocabulary with '##'-prefixed chinese characters. + # The ids should coincide with those in run_pretrain.py. + suffix_vocab = {} + for idx, token in enumerate(range(0x4E00, 0x9FA6)): + suffix_vocab["##" + chr(token)] = len(ProcessFn.tokenizer) + idx + ProcessFn.tokenizer.added_tokens_encoder.update(suffix_vocab) + + def mark_word_in_tokens(tokens, words, max_word_length=4): + word_set = set(words) + index = 0 + while index < len(tokens): + # Skip non-chinese characters. + if len(re.findall("[\u4E00-\u9FA5]", tokens[index])) == 0: + index += 1 + continue + # Find the word with maximum length and mark it. + find_word = False + for length in range(max_word_length, 0, -1): + if index + length > len(tokens): + continue + if "".join(tokens[index : index + length]) in word_set: + for i in range(1, length): + tokens[index + i] = "##" + tokens[index + i] + index += length + find_word = True + break + + if not find_word: + index += 1 + return tokens + + def process(text): + words = ProcessFn.segmenter(text.strip()) + tokens = ProcessFn.tokenizer.tokenize("".join(words)) + tokens = mark_word_in_tokens(tokens, words) + tokens = ProcessFn.tokenizer.convert_tokens_to_ids(tokens) + return tokens + + ProcessFn.process = process + + def encode(self, text): + token_ids = ProcessFn.process(text) + return token_ids, len(text.encode("utf-8")) + + +def main(): + args = parse_args() + + file_paths = [] + if os.path.isfile(args.input_path): + file_paths.append(args.input_path) + else: + for root, dirs, files in os.walk(args.input_path): + for file_name in files: + file_paths.append(os.path.join(root, file_name)) + file_paths.sort() + + tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese") + save_dtype = np.uint16 if tokenizer.vocab_size < 2**16 - 1 else np.int32 + processer = ProcessFn(args) + + pool = multiprocessing.Pool(args.num_worker, initializer=processer.initializer) + + token_id_stream = io.BytesIO() + sent_len_stream = io.BytesIO() + + step = 0 + sent_count = 0 + total_bytes_processed = 0 + start_tic = time.time() + + for path in tqdm(file_paths): + text_fp = open(path, "r") + processed_text = pool.imap(processer.encode, text_fp, 256) + print("Processing %s" % path) + for i, (tokens, bytes_processed) in enumerate(processed_text, start=1): + step += 1 + total_bytes_processed += bytes_processed + + sentence_len = len(tokens) + if sentence_len == 0: + continue + sent_len_stream.write(sentence_len.to_bytes(4, byteorder="little", signed=True)) + sent_count += 1 + token_id_stream.write(np.array(tokens, dtype=save_dtype).tobytes(order="C")) + + if step % args.logging_steps == 0: + time_cost = time.time() - start_tic + mbs = total_bytes_processed / time_cost / 1024 / 1024 + print( + f"Processed {step} sentences", + f"({step/time_cost:.2f} sentences/s, {mbs:.4f} MB/s).", + file=sys.stderr, + ) + + pool.close() + print("Saving tokens to files...") + all_token_ids = np.frombuffer(token_id_stream.getbuffer(), dtype=save_dtype) + all_sent_lens = np.frombuffer(sent_len_stream.getbuffer(), dtype=np.int32) + np.save(args.output_file + "_ids.npy", all_token_ids) + np.savez(args.output_file + "_idx.npz", lens=all_sent_lens) + + print("Total sentences num: %d" % len(all_sent_lens)) + print("Total tokens num: %d" % len(all_token_ids)) + print("Average tokens per sentence: %.2f" % (len(all_token_ids) / len(all_sent_lens))) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-health/run_pretrain.py b/model_zoo/ernie-health/run_pretrain.py new file mode 100644 index 0000000000000000000000000000000000000000..2774a865ed9cc62f681a76a31dd6b91c5c6ff290 --- /dev/null +++ b/model_zoo/ernie-health/run_pretrain.py @@ -0,0 +1,396 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import random +import time +from collections import defaultdict + +import numpy as np +import paddle +from dataset import DataCollatorForErnieHealth, MedicalCorpus, create_dataloader +from visualdl import LogWriter + +from paddlenlp.transformers import ( + ElectraConfig, + ElectraTokenizer, + ErnieHealthForTotalPretraining, + LinearDecayWithWarmup, +) +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "ernie-health": (ElectraConfig, ErnieHealthForTotalPretraining, ElectraTokenizer), +} + + +def parse_args(): + parser = argparse.ArgumentParser() + + parser.add_argument( + "--model_name_or_path", + default="ernie-health-chinese", + type=str, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], []) + ), + ) + parser.add_argument( + "--input_dir", + default=None, + type=str, + required=True, + help="The input directory where the data will be read from.", + ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the model predictions and checkpoints will be written.", + ) + parser.add_argument("--max_seq_length", default=512, type=int, help="The max length of each sequence") + parser.add_argument( + "--mlm_prob", default=0.15, type=float, help="The probability of tokens to be sampled as masks." + ) + parser.add_argument( + "--batch_size", + default=256, + type=int, + help="Batch size per GPU/CPU for training.", + ) + parser.add_argument("--learning_rate", default=2e-4, type=float, help="The initial learning rate for Adam.") + parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + parser.add_argument( + "--num_epochs", + default=100, + type=int, + help="Total number of training epochs to perform.", + ) + parser.add_argument( + "--max_steps", + default=-1, + type=int, + help="If > 0: set total number of training steps to perform. Override num_epochs.", + ) + parser.add_argument("--warmup_steps", default=10000, type=int, help="Linear warmup over warmup_steps.") + parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.") + parser.add_argument("--save_steps", type=int, default=10000, help="Save checkpoint every X updates steps.") + parser.add_argument( + "--init_from_ckpt", + action="store_true", + help="Whether to load model checkpoint. if True, args.model_name_or_path must be dir store ckpt or will train from fresh start", + ) + parser.add_argument( + "--use_amp", action="store_true", help="Whether to use float16(Automatic Mixed Precision) to train." + ) + parser.add_argument("--eager_run", type=bool, default=True, help="Use dygraph mode.") + parser.add_argument( + "--device", + default="gpu", + type=str, + choices=["cpu", "gpu"], + help="The device to select to train the model, is must be cpu/gpu.", + ) + parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") + args = parser.parse_args() + return args + + +def set_seed(seed): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(seed) + np.random.seed(seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(args.seed + paddle.distributed.get_rank())` + paddle.seed(seed) + + +class WorkerInitObj(object): + def __init__(self, seed): + self.seed = seed + + def __call__(self, id): + np.random.seed(seed=self.seed + id) + random.seed(self.seed + id) + + +def do_train(args): + paddle.enable_static() if not args.eager_run else None + paddle.set_device(args.device) + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + config_class, model_class, tokenizer_class = MODEL_CLASSES["ernie-health"] + + # Loads or initialize a model. + pretrained_models = list(tokenizer_class.pretrained_init_configuration.keys()) + + if os.path.isdir(args.model_name_or_path) and args.init_from_ckpt: + # Load checkpoint + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + with open(os.path.join(args.model_name_or_path, "run_states.json"), "r") as f: + config_dict = json.load(f) + model_name = config_dict["model_name"] + if model_name in pretrained_models: + model = model_class.from_pretrained(args.model_name_or_path) + model.set_state_dict(paddle.load(os.path.join(args.model_name_or_path, "model_state.pdparams"))) + else: + raise ValueError( + "initialize a model from ckpt need model_name " + "in model_config_file. The supported model_name " + "are as follows: {}".format(tokenizer_class.pretrained_init_configuration.keys()) + ) + else: + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + model_config = config_class() + model = model_class(model_config) + args.init_from_ckpt = False + + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + # Loads dataset. + tic_load_data = time.time() + logger.info("start load data : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) + + train_dataset = MedicalCorpus(data_path=args.input_dir, tokenizer=tokenizer) + logger.info("load data done, total : %s s" % (time.time() - tic_load_data)) + + # Reads data and generates mini-batches. + data_collator = DataCollatorForErnieHealth( + tokenizer=tokenizer, max_seq_length=args.max_seq_length, mlm_prob=args.mlm_prob + ) + + train_data_loader = create_dataloader( + train_dataset, + batch_size=args.batch_size, + mode="train", + use_gpu=True if args.device in "gpu" else False, + data_collator=data_collator, + ) + + num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_epochs) + args.num_epochs = (num_training_steps - 1) // len(train_data_loader) + 1 + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps) + + clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + epsilon=args.adam_epsilon, + parameters=model.parameters(), + weight_decay=args.weight_decay, + grad_clip=clip, + apply_decay_param_fun=lambda x: x in decay_params, + ) + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=1024) + + logger.info("start train : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) + trained_global_step = global_step = 0 + t_loss = defaultdict(lambda: paddle.to_tensor([0.0])) + log_loss = defaultdict(lambda: paddle.to_tensor([0.0])) + loss_list = defaultdict(list) + log_list = [] + tic_train = time.time() + + if os.path.isdir(args.model_name_or_path) and args.init_from_ckpt: + optimizer.set_state_dict(paddle.load(os.path.join(args.model_name_or_path, "model_state.pdopt"))) + trained_global_step = global_step = config_dict["global_step"] + if trained_global_step < num_training_steps: + logger.info( + "[ start train from checkpoint ] we have already trained %s steps, seeking next step : %s" + % (trained_global_step, trained_global_step + 1) + ) + else: + logger.info( + "[ start train from checkpoint ] we have already trained %s steps, but total training steps is %s, please check configuration !" + % (trained_global_step, num_training_steps) + ) + exit(0) + + if paddle.distributed.get_rank() == 0: + writer = LogWriter(os.path.join(args.output_dir, "loss_log")) + + for epoch in range(args.num_epochs): + for step, batch in enumerate(train_data_loader): + if trained_global_step > 0: + trained_global_step -= 1 + continue + global_step += 1 + masked_input_ids, input_ids, gen_labels = batch + + if args.use_amp: + with paddle.amp.auto_cast(): + loss, gen_loss, rtd_loss, mts_loss, csp_loss = model( + input_ids=masked_input_ids, + raw_input_ids=input_ids, + generator_labels=gen_labels, + ) + + scaled = scaler.scale(loss) + scaled.backward() + t_loss["loss"] += loss.detach() + t_loss["gen"] += gen_loss.detach() + t_loss["rtd"] += rtd_loss.detach() + t_loss["mts"] += mts_loss.detach() + t_loss["csp"] += csp_loss.detach() + scaler.minimize(optimizer, scaled) + else: + loss, gen_loss, rtd_loss, mts_loss, csp_loss = model( + input_ids=masked_input_ids, + raw_input_ids=input_ids, + generator_labels=gen_labels, + ) + loss.backward() + t_loss["loss"] += loss.detach() + t_loss["gen"] += gen_loss.detach() + t_loss["rtd"] += rtd_loss.detach() + t_loss["mts"] += mts_loss.detach() + t_loss["csp"] += csp_loss.detach() + optimizer.step() + + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.logging_steps == 0: + local_loss = dict( + [(k, (t_loss[k] - log_loss[k]) / args.logging_steps) for k in ["loss", "gen", "rtd", "mts", "csp"]] + ) + if paddle.distributed.get_world_size() > 1: + for k in ["loss", "gen", "rtd", "mts", "csp"]: + paddle.distributed.all_gather(loss_list[k], local_loss[k]) + if paddle.distributed.get_rank() == 0: + tmp_loss = dict( + [ + (k, float((paddle.stack(loss_list[k]).sum() / len(loss_list[k])).numpy())) + for k in ["loss", "gen", "rtd", "mts", "csp"] + ] + ) + log_str = ( + "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, " + "avg_loss: {4:.15f}, generator: {5:.15f}, rtd: {6:.15f}, multi_choice: {7:.15f}, " + "seq_contrastive: {8:.15f}, lr: {9:.10f}, speed: {10:.2f} s/it" + ).format( + global_step, + num_training_steps, + epoch, + step, + tmp_loss["loss"], + tmp_loss["gen"], + tmp_loss["rtd"], + tmp_loss["mts"], + tmp_loss["csp"], + optimizer.get_lr(), + (time.time() - tic_train) / args.logging_steps, + ) + logger.info(log_str) + log_list.append(log_str) + writer.add_scalar("generator_loss", tmp_loss["gen"], global_step) + writer.add_scalar("rtd_loss", tmp_loss["rtd"] * 50, global_step) + writer.add_scalar("mts_loss", tmp_loss["mts"] * 20, global_step) + writer.add_scalar("csp_loss", tmp_loss["csp"], global_step) + writer.add_scalar("total_loss", tmp_loss["loss"], global_step) + writer.add_scalar("lr", optimizer.get_lr(), global_step) + loss_list = defaultdict(list) + else: + local_loss = dict([(k, float(v)) for k, v in local_loss.items()]) + log_str = ( + "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, " + "avg_loss: {4:.15f}, generator: {5:.15f}, rtd: {6:.15f}, multi_choice: {7:.15f}, " + "seq_contrastive_loss: {8:.15f}, lr: {9:.10f}, speed: {10:.2f} s/it" + ).format( + global_step, + num_training_steps, + epoch, + step, + local_loss["loss"], + local_loss["gen"], + local_loss["rtd"], + local_loss["mts"], + local_loss["csp"], + optimizer.get_lr(), + (time.time() - tic_train) / args.logging_steps, + ) + logger.info(log_str) + log_list.append(log_str) + loss_dict = { + "generator_loss": local_loss["gen"], + "rtd_loss": local_loss["rtd"] * 50, + "mts_loss": local_loss["mts"] * 20, + "csp_loss": local_loss["csp"], + } + for k, v in loss_dict.items(): + writer.add_scalar("loss/%s" % k, v, global_step) + writer.add_scalar("total_loss", local_loss["loss"], global_step) + writer.add_scalar("lr", optimizer.get_lr(), global_step) + log_loss = dict(t_loss) + tic_train = time.time() + + if global_step % args.save_steps == 0: + if paddle.distributed.get_rank() == 0: + output_dir = os.path.join(args.output_dir, "model_%d.pdparams" % global_step) + if not os.path.exists(output_dir): + os.makedirs(output_dir) + model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model + config_to_save = model_to_save.discriminator.electra.config.to_dict() + if "self" in config_to_save: + del config_to_save["self"] + run_states = { + "model_name": model_name if args.init_from_ckpt else args.model_name_or_path, + "global_step": global_step, + "epoch": epoch, + "step": step, + } + with open(os.path.join(output_dir, "model_config.json"), "w") as f: + json.dump(config_to_save, f) + with open(os.path.join(output_dir, "run_states.json"), "w") as f: + json.dump(run_states, f) + paddle.save(model.state_dict(), os.path.join(output_dir, "model_state.pdparams")) + tokenizer.save_pretrained(output_dir) + paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt")) + if len(log_list) > 0: + with open(os.path.join(output_dir, "train.log"), "w") as f: + for log in log_list: + if len(log.strip()) > 0: + f.write(log.strip() + "\n") + if global_step >= num_training_steps: + if paddle.distributed.get_rank() == 0: + writer.close() + return + + +def print_arguments(args): + """print arguments""" + print("----------- Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("------------------------------------------------") + + +if __name__ == "__main__": + args = parse_args() + print_arguments(args) + do_train(args) diff --git a/model_zoo/ernie-health/run_pretrain_trainer.py b/model_zoo/ernie-health/run_pretrain_trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..98505d68958f5ba9ff9b4f55bc76bb12d6228aa1 --- /dev/null +++ b/model_zoo/ernie-health/run_pretrain_trainer.py @@ -0,0 +1,166 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time +from dataclasses import dataclass, field +from typing import Optional + +import paddle +from dataset import DataCollatorForErnieHealth, MedicalCorpus + +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, +) +from paddlenlp.transformers import ( + ElectraConfig, + ElectraTokenizer, + ErnieHealthForTotalPretraining, +) +from paddlenlp.utils.log import logger + +MODEL_CLASSES = { + "ernie-health": (ElectraConfig, ErnieHealthForTotalPretraining, ElectraTokenizer), +} + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and evaluating. + Using `PdArgumentParser` we can turn this class into argparse arguments to be able to + specify them on the command line. + """ + + input_dir: str = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + max_seq_length: int = field( + default=512, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + masked_lm_prob: float = field( + default=0.15, + metadata={"help": "Mask token prob."}, + ) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to pre-train from. + """ + + model_type: Optional[str] = field( + default="ernie-health", metadata={"help": "Only support for ernie-health pre-training for now."} + ) + model_name_or_path: str = field( + default="ernie-health-chinese", + metadata={ + "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html" + }, + ) + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + training_args.eval_iters = 10 + training_args.test_iters = training_args.eval_iters * 10 + # training_args.recompute = True + + # Log model and data config + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, " + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + config_class, model_class, tokenizer_class = MODEL_CLASSES["ernie-health"] + + # Loads or initialize a model. + tokenizer = tokenizer_class.from_pretrained(model_args.model_name_or_path) + + model_config = config_class() + model = model_class(model_config) + + # Loads dataset. + tic_load_data = time.time() + logger.info("start load data : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))) + + train_dataset = MedicalCorpus(data_path=data_args.input_dir, tokenizer=tokenizer) + logger.info("load data done, total : %s s" % (time.time() - tic_load_data)) + + # Reads data and generates mini-batches. + data_collator = DataCollatorForErnieHealth( + tokenizer=tokenizer, + max_seq_length=data_args.max_seq_length, + mlm_prob=data_args.masked_lm_prob, + return_dict=True, + ) + + trainer = Trainer( + model=model, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=None, + tokenizer=tokenizer, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-health/run_trainer.sh b/model_zoo/ernie-health/run_trainer.sh new file mode 100644 index 0000000000000000000000000000000000000000..94b99e83471303621de611f6320024b0410e35b0 --- /dev/null +++ b/model_zoo/ernie-health/run_trainer.sh @@ -0,0 +1,45 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#set -x +unset CUDA_VISIBLE_DEVICES + +task_name="eheath-pretraining" +rm -rf output/$task_name/log + +python -u -m paddle.distributed.launch \ + --gpus 0,1,2,3,4,5,6,7 \ + run_pretrain_trainer.py \ + --input_dir "./data" \ + --output_dir "output/$task_name" \ + --max_seq_length 512 \ + --gradient_accumulation_steps 1\ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 8 \ + --learning_rate 0.001 \ + --max_steps 1000000 \ + --save_steps 50000 \ + --weight_decay 0.01 \ + --warmup_ratio 0.01 \ + --max_grad_norm 1.0 \ + --logging_steps 20 \ + --dataloader_num_workers 4 \ + --device "gpu"\ + --fp16 \ + --fp16_opt_level "O1" \ + --do_train \ + --disable_tqdm True\ + --save_total_limit 10 + +# WARNING: fp16_opt_level O2 may cause ehealth pretraining fail ! \ No newline at end of file diff --git a/model_zoo/ernie-layout/README.md b/model_zoo/ernie-layout/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a092f0a5c048115340bd4bc71762d46e2267d9f5 --- /dev/null +++ b/model_zoo/ernie-layout/README.md @@ -0,0 +1,435 @@ +English | [简体中文](README_ch.md) + +# ERNIE-Layout + + **content** + +- [ERNIE-Layout](#ERNIE-Layout) + - [1. Model Instruction](#1) + - [2. Out-of-Box](#2) + - [HuggingFace web demo](#21) + - [Demo show](#22) + - [Taskflow](#23) + - [3. Model Performance](#3) + - [4. Fine-tuning Examples](#4) + - [4.1 Key Information Extraction](#41) + - [4.2 Document Question Answering](#42) + - [4.3 Document Image Classification](#43) + - [5. Deploy](#5) + - [5.1 Inference Model Export](#51) + - [5.2 Python Deploy](#52) + + + +## 1. Model Instruction +Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. + +[The work](http://arxiv.org/abs/2210.06155) is accepted by EMNLP 2022 (Findings). To expand the scope of commercial applications for document intelligence, we release the multilingual model of ERNIE-Layout through PaddleNLP. + +
+ +
+ + + +## 2. Out-of-Box + + + +#### HuggingFace web demo + +🧾 HuggingFace web demo is available [here](https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout) + +
+ +
+ + + +#### Demo show + +- Invoice VQA + +
+ +
+ +- Poster VQA + +
+ +
+ +- WebPage VQA + +
+ +
+ + +- Table VQA + +
+ +
+ + +- Exam Paper VQA + +
+ +
+ + +- English invoice VQA by multilingual(CH, EN, JP, Th, ES, RUS) prompt + +
+ +
+ +- Chinese invoice VQA by multilingual(CHS, CHT, EN, JP, DE) prompt + +
+ +
+ +- Demo images are available [here](https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/demo.zip) + + + +#### Taskflow + +- Input Format + +``` +[ + {"doc": "./book.png", "prompt": ["What is the name of the author of 'The Adventure Zone: The Crystal Kingdom’?", "What type of book cover does The Adventure Zone: The Crystal Kingdom have?", "For Rage, who is the author listed as?"]}, + {"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]} +] +``` + +Default to use PaddleOCR, you can also use your own OCR result via ``word_boxes``, the data format is ``List[str, List[float, float, float, float]]``。 + +``` +[ + {"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes} +] +``` + +- Support single and batch input + + - Image from http link + +
+ +
+ + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + + >>> docprompt = Taskflow("document_intelligence", lang="en") + >>> docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/book.png", "prompt": ["What is the name of the author of 'The Adventure Zone: The Crystal Kingdom’?", "What type of book cover does The Adventure Zone: The Crystal Kingdom have?", "For Rage, who is the author listed as?"]}]) + [{'prompt': "What is the name of the author of 'The Adventure Zone: The " + 'Crystal Kingdom’?', + 'result': [{'end': 39, + 'prob': 0.99, + 'start': 22, + 'value': 'Clint McElroy. Carey Pietsch, Griffn McElroy, Travis ' + 'McElroy'}]}, + {'prompt': 'What type of book cover does The Adventure Zone: The Crystal ' + 'Kingdom have?', + 'result': [{'end': 51, 'prob': 1.0, 'start': 51, 'value': 'Paperback'}]}, + {'prompt': 'For Rage, who is the author listed as?', + 'result': [{'end': 93, 'prob': 1.0, 'start': 91, 'value': 'Bob Woodward'}]}] + ``` + + - Image from local path + +
+ +
+ + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + + >>> docprompt = Taskflow("document_intelligence") + >>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}])) + [{'prompt': '五百丁本次想要担任的是什么职位?', + 'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]}, + {'prompt': '五百丁是在哪里上的大学?', + 'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]}, + {'prompt': '大学学的是什么专业?', + 'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科)'}]}] + ``` + +- Parameter Description + * `batch_size`: number of input of each batch, default to 1. + * `lang`: PaddleOCR language, `en` is better to English images, default to `ch`. + * `topn`: return the top n results with highest probability, default to 1. + + + + +## 3. Model Performance + +- Dataset + + | Dataset | Task | Language | Note | + | --------- | ---------- | --- | ---- | + | FUNSD | Key Information Extraction | English | - | + | XFUND-ZH | Key Information Extraction | Chinese | - | + | DocVQA-ZH | Document Question Answering | Chinese | The submission of the competition of [DocVQA-ZH](http://ailab.aiwin.org.cn/competitions/49) is now closed so we split original dataset into three parts for model evluation. There are 4,187 training images, 500 validation images, and 500 test images.| + | RVL-CDIP (sampled) | Document Image Classification | English | The RVL-CDIP dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. Because of the original dataset is large and slow for training, so we downsampling from it. The sampled dataset consist of 6,400 training images, 800 validation images, and 800 test images. | + +- Results + + | Model | FUNSD | RVL-CDIP (sampled) | XFUND-ZH | DocVQA-ZH | + | ------------------ | --------- | --------- | --------- | --------- | + | LayoutXLM-Base | 86.72 | **90.88** | 86.24 | 66.01 | + | ERNIE-LayoutX-Base | **89.31** | 90.29 | **88.58** | **69.57** | + +- Evaluation Methods + + - All the above tasks do the Hyper Parameter searching based on Grid Search method. The evaluation step interval of FUNSD and XFUND-ZH are both 100, metric is F1-Score. The evaluation step interval of RVL-CDIP is 2000, metric is Accuracy. The evaluation step interval of DocVQA-ZH is 10000, metric is [ANLS](https://arxiv.org/pdf/1907.00490.pdf), + + - Hyper Parameters search ranges + + | Hyper Parameters | FUNSD | RVL-CDIP (sampled) | XFUND-ZH | DocVQA-ZH | + | ----------------- | ------- | -------- | -------- | --------- | + | learning_rate | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 | + | batch_size | 1, 2, 4 | 8, 16, 24 | 1, 2, 4 | 8, 16, 24 | + | warmup_ratio | - | 0, 0.05, 0.1 | - | 0, 0.05, 0.1 | + + The strategy of ``lr_scheduler_type`` for FUNSD and XFUND is constant, so warmup_ratio is excluded. + + - ``max_steps`` is applied for the fine-tuning on both FUNSD and XFUND-ZH, 10000 steps and 20000 steps respectively; ``num_train_epochs`` is set to 6 and 20 for DocVQA-ZH and RVL-CDIP respectively. + +- Best Hyper Parameter + + | Model | FUNSD | RVL-CDIP (sampled) | XFUND-ZH | DocVQA-ZH | + | ------------------ | ------------ | ------------ | ------------ | ----------- | + | LayoutXLM-Base | 1e-5, 2, _ | 1e-5, 8, 0.1 | 1e-5, 2, _ | 2e-5. 8, 0.1 | + | ERNIE-LayoutX-Base | 2e-5, 4, _ | 1e-5, 8, 0. | 1e-5, 4, _ | 2e-5. 8, 0.05 | + + + + +## 4. Fine-tuning Examples + +- Installation + +``` +pip install -r requirements.txt +``` + + + +#### 4.1 Key Information Extraction + +- FUNSD Train + +```shell +python -u run_ner.py \ + --model_name_or_path ernie-layoutx-base-uncased \ + --output_dir ./ernie-layoutx-base-uncased/models/funsd/ \ + --dataset_name funsd \ + --do_train \ + --do_eval \ + --max_steps 10000 \ + --eval_steps 100 \ + --save_steps 100 \ + --save_total_limit 1 \ + --load_best_model_at_end \ + --pattern ner-bio \ + --preprocessing_num_workers 4 \ + --overwrite_cache false \ + --use_segment_box \ + --doc_stride 128 \ + --target_size 1000 \ + --per_device_train_batch_size 4 \ + --per_device_eval_batch_size 4 \ + --learning_rate 2e-5 \ + --lr_scheduler_type constant \ + --gradient_accumulation_steps 1 \ + --seed 1000 \ + --metric_for_best_model eval_f1 \ + --greater_is_better true \ + --overwrite_output_dir +``` + +- XFUND-ZH Train + +```shell +python -u run_ner.py \ + --model_name_or_path ernie-layoutx-base-uncased \ + --output_dir ./ernie-layoutx-base-uncased/models/xfund_zh/ \ + --dataset_name xfund_zh \ + --do_train \ + --do_eval \ + --lang "ch" \ + --max_steps 20000 \ + --eval_steps 100 \ + --save_steps 100 \ + --save_total_limit 1 \ + --load_best_model_at_end \ + --pattern ner-bio \ + --preprocessing_num_workers 4 \ + --overwrite_cache false \ + --use_segment_box \ + --doc_stride 128 \ + --target_size 1000 \ + --per_device_train_batch_size 4 \ + --per_device_eval_batch_size 4 \ + --learning_rate 1e-5 \ + --lr_scheduler_type constant \ + --gradient_accumulation_steps 1 \ + --seed 1000 \ + --metric_for_best_model eval_f1 \ + --greater_is_better true \ + --overwrite_output_dir +``` + + + +#### 4.2 Document Question Answering + +- DocVQA-ZH Train + +```shell +python3 -u run_mrc.py \ + --model_name_or_path ernie-layoutx-base-uncased \ + --output_dir ./ernie-layoutx-base-uncased/models/docvqa_zh/ \ + --dataset_name docvqa_zh \ + --do_train \ + --do_eval \ + --lang "ch" \ + --num_train_epochs 6 \ + --lr_scheduler_type linear \ + --warmup_ratio 0.05 \ + --weight_decay 0 \ + --eval_steps 10000 \ + --save_steps 10000 \ + --save_total_limit 1 \ + --load_best_model_at_end \ + --pattern "mrc" \ + --use_segment_box false \ + --return_entity_level_metrics false \ + --overwrite_cache false \ + --doc_stride 128 \ + --target_size 1000 \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 8 \ + --learning_rate 2e-5 \ + --preprocessing_num_workers 32 \ + --save_total_limit 1 \ + --train_nshard 16 \ + --seed 1000 \ + --metric_for_best_model anls \ + --greater_is_better true \ + --overwrite_output_dir +``` + + + +#### 4.3 Document Image Classification + +- RVL-CDIP Train + +```shell +python3 -u run_cls.py \ + --model_name_or_path ernie-layoutx-base-uncased \ + --output_dir ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ \ + --dataset_name rvl_cdip_sampled \ + --do_train \ + --do_eval \ + --num_train_epochs 20 \ + --lr_scheduler_type linear \ + --max_seq_length 512 \ + --warmup_ratio 0.05 \ + --weight_decay 0 \ + --eval_steps 2000 \ + --save_steps 2000 \ + --save_total_limit 1 \ + --load_best_model_at_end \ + --pattern "cls" \ + --use_segment_box \ + --return_entity_level_metrics false \ + --overwrite_cache false \ + --doc_stride 128 \ + --target_size 1000 \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 8 \ + --learning_rate 1e-5 \ + --preprocessing_num_workers 32 \ + --train_nshard 16 \ + --seed 1000 \ + --metric_for_best_model acc \ + --greater_is_better true \ + --overwrite_output_dir +``` + + + +## 5. Deploy + + + +#### 5.1 Inference Model Export + +After fine-tuning, you can also export the inference model via [Model Export Script](export_model.py), the inference model will be saved in the `output_path` you specified. + +- Export the model fine-tuned on FUNSD + +```shell +python export_model.py --task_type ner --model_path ./ernie-layoutx-base-uncased/models/funsd/ --output_path ./ner_export +``` + +- Export the model fine-tuned on DocVQA-ZH + +```shell +python export_model.py --task_type mrc --model_path ./ernie-layoutx-base-uncased/models/docvqa_zh/ --output_path ./mrc_export +``` + +- Export the model fine-tuned on RVL-CDIP(sampled) + +```shell +python export_model.py --task_type cls --model_path ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ --output_path ./cls_export +``` + +- Parameter Description + * `model_path`:the save directory of dygraph model parameters, default to "./checkpoint/"。 + * `output_path`:the save directory of static graph model parameters, default to "./export"。 + +- Directory + + ```text + export/ + ├── inference.pdiparams + ├── inference.pdiparams.info + └── inference.pdmodel + ``` + + + +#### 5.2 Python Deploy + +We provide the deploy example on Key Information Extraction, Document Question Answering and Document Image Classification, please follow the [ERNIE-Layout Python Deploy Guide](./deploy/python/README.md) + + + + +## References + +- [ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](http://arxiv.org/abs/2210.06155) + +- [ICDAR 2019 Competition on Scene Text Visual Question Answering](https://arxiv.org/pdf/1907.00490.pdf) + +- [XFUND dataset](https://github.com/doc-analysis/XFUND) + +- [FUNSD dataset](https://guillaumejaume.github.io/FUNSD/) + +- [RVL-CDIP dataset](https://adamharley.com/rvl-cdip/) + +- [Competition of Insurance Document Visual Cognition Question Answering](http://ailab.aiwin.org.cn/competitions/49) diff --git a/model_zoo/ernie-layout/README_ch.md b/model_zoo/ernie-layout/README_ch.md new file mode 100644 index 0000000000000000000000000000000000000000..d755952b20cf354038403da6612b9b8898a87993 --- /dev/null +++ b/model_zoo/ernie-layout/README_ch.md @@ -0,0 +1,437 @@ +[English](README.md) | 简体中文 + +# ERNIE-Layout + + **目录** + +- [1. 模型介绍](#1) +- [2. 开箱即用](#2) + - [HuggingFace web demo](#21) + - [应用场景展示](#22) + - [Taskflow](#23) +- [3. Benchmark模型效果](#3) +- [4. 模型微调](#4) + - [4.1 文档信息抽取任务](#41) + - [4.2 文档视觉问答任务](#42) + - [4.3 文档图像分类任务](#43) +- [5. 部署](#5) + - [5.1 静态图导出](#51) + - [5.2 Python部署](#52) + + + +## 1. 模型介绍 + + +ERNIE-Layout以文心文本大模型ERNIE为底座,融合文本、图像、布局等信息进行跨模态联合建模,创新性引入布局知识增强,提出阅读顺序预测、细粒度图文匹配等自监督预训练任务,升级空间解偶注意力机制,在各数据集上效果取得大幅度提升,相关工作[ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](http://arxiv.org/abs/2210.06155)已被EMNLP 2022 Findings会议收录[1]。考虑到文档智能在多语种上商用广泛,依托PaddleNLP对外开源业界最强的多语言跨模态文档预训练模型ERNIE-Layout。 + +
+ +
+ + + +## 2. 开箱即用 + + + +#### HuggingFace web demo + +🧾 通过[Huggingface网页](https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout)体验DocPrompt功能: + +
+ +
+ + + +#### 应用场景展示 + +- 发票抽取问答 + +
+ +
+ +- 海报抽取问答 + +
+ +
+ +- 网页抽取问答 + +
+ +
+ + +- 表格抽取问答 + +
+ +
+ + +- 试卷抽取问答 + +
+ +
+ + +- 英文票据多语种(中、英、日、泰、西班牙、俄语)抽取问答 + +
+ +
+ +- 中文票据多语种(中简、中繁、英、日、德语)抽取问答 + +
+ +
+ +- Demo图片可在此[下载](https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/demo.zip) + + + +#### Taskflow + +通过``paddlenlp.Taskflow``三行代码调用DocPrompt功能,具备多语言文档抽取问答能力,部分应用场景展示如下: + +- 输入格式 + +``` +[ + {"doc": "./invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}, + {"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]} +] +``` + +默认使用PaddleOCR进行OCR识别,同时支持用户通过``word_boxes``传入自己的OCR结果,格式为``List[str, List[float, float, float, float]]``。 + +``` +[ + {"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes} +] +``` + +- 支持单条、批量预测 + + - 支持本地图片路径输入 + +
+ +
+ + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + + >>> docprompt = Taskflow("document_intelligence") + >>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}])) + [{'prompt': '五百丁本次想要担任的是什么职位?', + 'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]}, + {'prompt': '五百丁是在哪里上的大学?', + 'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]}, + {'prompt': '大学学的是什么专业?', + 'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科)'}]}] + ``` + + - http图片链接输入 + +
+ +
+ + ```python + >>> from pprint import pprint + >>> from paddlenlp import Taskflow + + >>> docprompt = Taskflow("document_intelligence") + >>> pprint(docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}])) + [{'prompt': '发票号码是多少?', + 'result': [{'end': 2, 'prob': 0.74, 'start': 2, 'value': 'No44527206'}]}, + {'prompt': '校验码是多少?', + 'result': [{'end': 233, + 'prob': 1.0, + 'start': 231, + 'value': '01107 555427109891646'}]}] + ``` + +- 可配置参数说明 + * `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 + * `lang`:选择PaddleOCR的语言,`ch`可在中英混合的图片中使用,`en`在英文图片上的效果更好,默认为`ch`。 + * `topn`: 如果模型识别出多个结果,将返回前n个概率值最高的结果,默认为1。 + + + + +## 3. Benchmark模型效果 + +- 开源数据集介绍 + + | 数据集 | 任务类型 | 语言 | 说明 | + | --------- | ---------- | --- | ---- | + | FUNSD | 文档信息抽取 | 英文 | - | + | XFUND-ZH | 文档信息抽取 | 中文 | - | + | DocVQA-ZH | 文档视觉问答 | 中文 | [DocVQA-ZH](http://ailab.aiwin.org.cn/competitions/49)已停止榜单提交,因此我们将原始训练集进行重新划分以评估模型效果,划分后训练集包含4,187张图片,验证集包含500张图片,测试集包含500张图片。 | + | RVL-CDIP (sampled) | 文档图像分类 | 英文 | RVL-CDIP原始数据集共包含400,000张图片,由于数据集较大训练较慢,为验证文档图像分类的模型效果故进行降采样,采样后的训练集包含6,400张图片,验证集包含800张图片,测试集包含800张图片。 | + +- 评测结果 + + 在文档智能领域主流开源数据集的**验证集**上评测指标如下表所示: + + | Model | FUNSD | RVL-CDIP (sampled) | XFUND-ZH | DocVQA-ZH | + | ------------------ | --------- | --------- | --------- | --------- | + | LayoutXLM-Base | 86.72 | **90.88** | 86.24 | 66.01 | + | ERNIE-LayoutX-Base | **89.31** | 90.29 | **88.58** | **69.57** | + +- 具体评测方式 + + - 以上所有任务均基于Grid Search方式进行超参寻优。FUNSD和XFUND-ZH每间隔 100 steps 评估验证集效果,评价指标为F1-Score。 + RVL-CDIP每间隔2000 steps评估验证集效果,评价指标为Accuracy。DocVQA-ZH每间隔10000 steps评估验证集效果,取验证集最优效果作为表格中的汇报指标,评价指标为ANLS(计算方法参考https://arxiv.org/pdf/1907.00490.pdf)。 + + - 以上每个下游任务的超参范围如下表所示: + + | Hyper Parameters | FUNSD | RVL-CDIP (sampled) | XFUND-ZH | DocVQA-ZH | + | ----------------- | ------- | -------- | -------- | --------- | + | learning_rate | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 | + | batch_size | 1, 2, 4 | 8, 16, 24 | 1, 2, 4 | 8, 16, 24 | + | warmup_ratio | - | 0, 0.05, 0.1 | - | 0, 0.05, 0.1 | + + FUNSD和XFUND-ZH使用的lr_scheduler_type策略是constant,因此不对warmup_ratio进行搜索。 + + - 文档信息抽取任务FUNSD和XFUND-ZH采用最大步数(max_steps)的微调方式,分别为10000 steps和20000 steps;文档视觉问答DocVQA-ZH的num_train_epochs为6;文档图像分类RVL-CDIP的num_train_epochs为20。 + +- 最优超参 + + 不同预训练模型在下游任务上做Grid Search之后的最优超参(learning_rate、batch_size、warmup_ratio)如下: + + | Model | FUNSD | RVL-CDIP (sampled) | XFUND-ZH | DocVQA-ZH | + | ------------------ | ------------ | ------------ | ------------ | ----------- | + | LayoutXLM-Base | 1e-5, 2, _ | 1e-5, 8, 0.1 | 1e-5, 2, _ | 2e-5. 8, 0.1 | + | ERNIE-LayoutX-Base | 2e-5, 4, _ | 1e-5, 8, 0. | 1e-5, 4, _ | 2e-5. 8, 0.05 | + + + + +## 4. 模型微调 + +- 请执行以下命令进行安装项目依赖 + +``` +pip install -r requirements.txt +``` + + + +#### 4.1 文档信息抽取任务 + +- FUNSD训练 + +```shell +python -u run_ner.py \ + --model_name_or_path ernie-layoutx-base-uncased \ + --output_dir ./ernie-layoutx-base-uncased/models/funsd/ \ + --dataset_name funsd \ + --do_train \ + --do_eval \ + --max_steps 10000 \ + --eval_steps 100 \ + --save_steps 100 \ + --save_total_limit 1 \ + --load_best_model_at_end \ + --pattern ner-bio \ + --preprocessing_num_workers 4 \ + --overwrite_cache false \ + --use_segment_box \ + --doc_stride 128 \ + --target_size 1000 \ + --per_device_train_batch_size 4 \ + --per_device_eval_batch_size 4 \ + --learning_rate 2e-5 \ + --lr_scheduler_type constant \ + --gradient_accumulation_steps 1 \ + --seed 1000 \ + --metric_for_best_model eval_f1 \ + --greater_is_better true \ + --overwrite_output_dir +``` + +- XFUND-ZH训练 + +```shell +python -u run_ner.py \ + --model_name_or_path ernie-layoutx-base-uncased \ + --output_dir ./ernie-layoutx-base-uncased/models/xfund_zh/ \ + --dataset_name xfund_zh \ + --do_train \ + --do_eval \ + --lang "ch" \ + --max_steps 20000 \ + --eval_steps 100 \ + --save_steps 100 \ + --save_total_limit 1 \ + --load_best_model_at_end \ + --pattern ner-bio \ + --preprocessing_num_workers 4 \ + --overwrite_cache false \ + --use_segment_box \ + --doc_stride 128 \ + --target_size 1000 \ + --per_device_train_batch_size 4 \ + --per_device_eval_batch_size 4 \ + --learning_rate 1e-5 \ + --lr_scheduler_type constant \ + --gradient_accumulation_steps 1 \ + --seed 1000 \ + --metric_for_best_model eval_f1 \ + --greater_is_better true \ + --overwrite_output_dir +``` + + + +#### 4.2 文档视觉问答任务 + +- DocVQA-ZH训练 + +```shell +python3 -u run_mrc.py \ + --model_name_or_path ernie-layoutx-base-uncased \ + --output_dir ./ernie-layoutx-base-uncased/models/docvqa_zh/ \ + --dataset_name docvqa_zh \ + --do_train \ + --do_eval \ + --lang "ch" \ + --num_train_epochs 6 \ + --lr_scheduler_type linear \ + --warmup_ratio 0.05 \ + --weight_decay 0 \ + --eval_steps 10000 \ + --save_steps 10000 \ + --save_total_limit 1 \ + --load_best_model_at_end \ + --pattern "mrc" \ + --use_segment_box false \ + --return_entity_level_metrics false \ + --overwrite_cache false \ + --doc_stride 128 \ + --target_size 1000 \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 8 \ + --learning_rate 2e-5 \ + --preprocessing_num_workers 32 \ + --save_total_limit 1 \ + --train_nshard 16 \ + --seed 1000 \ + --metric_for_best_model anls \ + --greater_is_better true \ + --overwrite_output_dir +``` + + + +#### 4.3 文档图像分类任务 + +- RVL-CDIP训练 + +```shell +python3 -u run_cls.py \ + --model_name_or_path ernie-layoutx-base-uncased \ + --output_dir ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ \ + --dataset_name rvl_cdip_sampled \ + --do_train \ + --do_eval \ + --num_train_epochs 20 \ + --lr_scheduler_type linear \ + --max_seq_length 512 \ + --warmup_ratio 0.05 \ + --weight_decay 0 \ + --eval_steps 2000 \ + --save_steps 2000 \ + --save_total_limit 1 \ + --load_best_model_at_end \ + --pattern "cls" \ + --use_segment_box \ + --return_entity_level_metrics false \ + --overwrite_cache false \ + --doc_stride 128 \ + --target_size 1000 \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 8 \ + --learning_rate 1e-5 \ + --preprocessing_num_workers 32 \ + --train_nshard 16 \ + --seed 1000 \ + --metric_for_best_model acc \ + --greater_is_better true \ + --overwrite_output_dir +``` + + + +## 5. 部署 + + + +#### 5.1 静态图导出 + +使用动态图训练结束之后,还可以将动态图参数导出为静态图参数,静态图模型将用于**后续的推理部署工作**。具体代码见[静态图导出脚本](export_model.py),静态图参数保存在`output_path`指定路径中。运行方式: + + +- 导出在FUNSD上微调后的模型: + +```shell +python export_model.py --task_type ner --model_path ./ernie-layoutx-base-uncased/models/funsd/ --output_path ./ner_export +``` + +- 导出在DocVQA-ZH上微调后的模型: + +```shell +python export_model.py --task_type mrc --model_path ./ernie-layoutx-base-uncased/models/docvqa_zh/ --output_path ./mrc_export +``` + +- 导出在RVL-CDIP(sampled)上微调后的模型: + +```shell +python export_model.py --task_type cls --model_path ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ --output_path ./cls_export +``` + +- 可支持配置的参数: +* `model_path`:动态图训练保存的参数路径;默认为"./checkpoint/"。 +* `output_path`:静态图图保存的参数路径;默认为"./export"。 + +- 程序运行时将会自动导出模型到指定的 `output_path` 中,保存模型文件结构如下所示: + +```text +export/ +├── inference.pdiparams +├── inference.pdiparams.info +└── inference.pdmodel +``` + + + +#### 5.2 Python部署 + +导出静态图模型之后可用于部署,项目提供了文档信息抽取、文档视觉问答和文档图像分类三大场景下的使用示例,详见[ERNIE-Layout Python部署指南](./deploy/python/README_ch.md)。 + + + + +## References + +- [ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](http://arxiv.org/abs/2210.06155) + +- [ICDAR 2019 Competition on Scene Text Visual Question Answering](https://arxiv.org/pdf/1907.00490.pdf) + +- [XFUND dataset](https://github.com/doc-analysis/XFUND) + +- [FUNSD dataset](https://guillaumejaume.github.io/FUNSD/) + +- [RVL-CDIP dataset](https://adamharley.com/rvl-cdip/) + +- [保险文本视觉认知问答竞赛](http://ailab.aiwin.org.cn/competitions/49) diff --git a/model_zoo/ernie-layout/data_collator.py b/model_zoo/ernie-layout/data_collator.py new file mode 100644 index 0000000000000000000000000000000000000000..bee1a06cf8162e07607c92b7a5a6635fdb914791 --- /dev/null +++ b/model_zoo/ernie-layout/data_collator.py @@ -0,0 +1,78 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Optional, Union +from dataclasses import dataclass + +from paddlenlp.transformers.tokenizer_utils_base import PretrainedTokenizerBase, PaddingStrategy + + +@dataclass +class DataCollator: + """ + Data collator that will dynamically pad the inputs received, as well as the labels. + Args: + tokenizer (:class:`~transformers.PreTrainedTokenizer` or :class:`~transformers.PreTrainedTokenizerFast`): + The tokenizer used for encoding the data. + padding (:obj:`bool`, :obj:`str` or :class:`~transformers.file_utils.PaddingStrategy`, `optional`, defaults to :obj:`True`): + Select a strategy to pad the returned sequences (according to the model's padding side and padding index) + among: + * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single + sequence if provided). + * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the + maximum acceptable input length for the model if that argument is not provided. + * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of + different lengths). + max_length (:obj:`int`, `optional`): + Maximum length of the returned list and optionally padding length (see above). + pad_to_multiple_of (:obj:`int`, `optional`): + If set will pad the sequence to a multiple of the provided value. + This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= + 7.5 (Volta). + label_pad_token_id (:obj:`int`, `optional`, defaults to -100): + The id to use when padding the labels (-100 will be automatically ignore by PyTorch loss functions). + """ + + tokenizer: PretrainedTokenizerBase + padding: Union[bool, str, PaddingStrategy] = True + max_length: Optional[int] = None + label_pad_token_id: int = -100 + pad_to_multiple_of: Optional[int] = None + return_tensors: str = "np" + + def __call__(self, features): + has_labels = "labels" in features[0] + for feat in features: + feat["input_ids"] = feat["input_ids"] + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * ( + self.max_length - len(feat["input_ids"]) + ) + feat["attention_mask"] = feat["attention_mask"] + [ + 1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token] + ] * (self.max_length - len(feat["attention_mask"])) + feat["bbox"] = feat["bbox"] + [[0, 0, 0, 0] for _ in range(self.max_length - len(feat["bbox"]))] + if has_labels and not isinstance(feat["labels"], int): + feat["labels"] = feat["labels"] + [1 * self.label_pad_token_id] * ( + self.max_length - len(feat["labels"]) + ) + + batch = self.tokenizer.pad( + features, + padding=self.padding, + max_length=self.max_length, + pad_to_multiple_of=self.pad_to_multiple_of, + # Conversion to tensors will fail if we have labels as they are not of the same length yet. + return_tensors=self.return_tensors, + ) + return batch diff --git a/model_zoo/ernie-layout/deploy/python/README.md b/model_zoo/ernie-layout/deploy/python/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b4d2a52589fc849a15c63648dea60d02fdaeae78 --- /dev/null +++ b/model_zoo/ernie-layout/deploy/python/README.md @@ -0,0 +1,137 @@ +English | [简体中文](README_ch.md) + +# ERNIE-Layout Python Deploy Guide + +- [1. Quick Start](#1) +- [2. Key Information Extraction Deploy](#2) +- [3. Document Question Answering Deploy](#3) +- [4. Document Image Classification Deploy](#4) +- [5. Parameter Description](#5) + + + +## 1. Quick Start + +#### Environment + +- Dependency Installation + +``` +pip install -r requirements.txt +``` + +#### Data Preparation + +- Dowload the sample images and put in ``./images`` + +```shell +wget https://bj.bcebos.com/paddlenlp/datasets/document_intelligence/images.zip && unzip images.zip +``` + + + +## 2. Key Information Extraction Deploy + +- Run + +```shell +python infer.py \ + --model_path_prefix ../../ner_export/inference \ + --task_type ner \ + --lang "en" \ + --batch_size 8 +``` + +- Output sample + +``` +[{'doc': './images/ner_sample.jpg', + 'result': [{'text': 'ATT . GEN . ADMIN . OFFICE', + 'label': 'QUESTION', + 'start': 0, + 'end': 12, + 'probability': 0.8961102192651806}, + {'text': 'Fax :', + 'label': 'QUESTION', + 'start': 13, + 'end': 14, + 'probability': 0.8005126895801068}, + {'text': '614', + 'label': 'ANSWER', + 'start': 15, + 'end': 16, + 'probability': 0.5063673730110718}, + {'text': 'Dec 10', + 'label': 'ANSWER', + 'start': 23, + 'end': 24, + 'probability': 0.6265156606943465}, + + ...... + + {'text': 'NOTE', + 'label': 'QUESTION', + 'start': 179, + 'end': 179, + 'probability': 0.9810855421041412}]}] +``` + + + +## 3. Document Question Answering Deploy + +- Run + +```shell +python infer.py \ + --model_path_prefix ../../mrc_export/inference \ + --task_type mrc \ + --lang "ch" \ + --batch_size 8 +``` + +- Output sample + +``` +[{'doc': './images/mrc_sample.jpg', + 'result': [{'question': '杨小峰是什么身份?', 'answer': ['法定代表人']}, + {'question': '花了多少钱进行注册的这个公司?', 'answer': ['壹仟壹佰万元']}, + {'question': '公司的类型属于什么?', 'answer': ['有限责任公司']}, + {'question': '杨小峰的住所是在哪里?', + 'answer': ['成都市武侯区佳灵路20号九峰国际1栋16楼62号']}, + {'question': '这个公司的法定代表人叫什么?', 'answer': ['杨小峰']}, + {'question': '91510107749745776R代表的是什么?', 'answer': ['统一社会信用代码']}, + {'question': '公司在什么时候成立的?', + 'answer': ['2003年7月22日营业期限2003年7月22日']}]}] +``` + + + +## 4. Document Image Classification Deploy + +- Run + +```shell +python infer.py \ + --model_path_prefix ../../cls_export/inference \ + --lang "en" \ + --task_type cls \ + --batch_size 8 +``` + +- Output sample + +``` +[{'doc': './images/cls_sample.jpg', 'result': 'email'}] +``` + + + +## 5. Parameter Description + +- `model_path_prefix`: The file path of the Paddle model for inference, with the file prefix name。For example, the inference model file path is `./export/inference.pdiparams`, then pass `./export/inference`。 +- `batch_size`: number of input of each batch, default to 1. +- `max_seq_length`: If the OCR result exceeds the set maximum length, the OCR result will be sliced. The default is 512. +- `task_type`: choose the task type,the options are `ner`, `cls` and `mrc`。 +- `lang`: select the task language,the options are `en` and `ch`。 +- `device`: choose the device,the options are `cpu` and `gpu`。 diff --git a/model_zoo/ernie-layout/deploy/python/README_ch.md b/model_zoo/ernie-layout/deploy/python/README_ch.md new file mode 100644 index 0000000000000000000000000000000000000000..1eeb0debc1a60f810164eaf81997e3a6a9d7e8ed --- /dev/null +++ b/model_zoo/ernie-layout/deploy/python/README_ch.md @@ -0,0 +1,129 @@ +[English](README.md) | 简体中文 + +# ERNIE-Layout Python部署指南 + +本文介绍ERNIE-Layout Python部署指南,包括部署环境的准备,文档信息抽取、文档视觉问答和文档图像分类三大场景下的使用示例。 + +- [1. 开始运行](#1-开始运行) +- [2. 文档信息抽取模型推理](#2-文档信息抽取模型推理) +- [3. 文档视觉问答模型推理](#3-文档视觉问答模型推理) +- [4. 文档图像分类模型推理](#4-文档图像分类模型推理) +- [5. 更多配置](#5-更多配置) + +## 1. 开始运行 + +#### 环境要求 + +- 请执行以下命令进行安装项目依赖 + +``` +pip install -r requirements.txt +``` + +#### 数据准备 + +- 提供了少量图片数据,可用于后续章节的部署测试,下载后放在``./images``目录。 + +```shell +wget https://bj.bcebos.com/paddlenlp/datasets/document_intelligence/images.zip && unzip images.zip +``` + +## 2. 文档信息抽取模型推理 + +- 使用如下命令进行英文文档信息抽取部署 + +```shell +python infer.py \ + --model_path_prefix ../../ner_export/inference \ + --task_type ner \ + --lang "en" \ + --batch_size 8 +``` + +- 输出样例 + +``` +[{'doc': './images/ner_sample.jpg', + 'result': [{'text': 'ATT . GEN . ADMIN . OFFICE', + 'label': 'QUESTION', + 'start': 0, + 'end': 12, + 'probability': 0.8961102192651806}, + {'text': 'Fax :', + 'label': 'QUESTION', + 'start': 13, + 'end': 14, + 'probability': 0.8005126895801068}, + {'text': '614', + 'label': 'ANSWER', + 'start': 15, + 'end': 16, + 'probability': 0.5063673730110718}, + {'text': 'Dec 10', + 'label': 'ANSWER', + 'start': 23, + 'end': 24, + 'probability': 0.6265156606943465}, + + ...... + + {'text': 'NOTE', + 'label': 'QUESTION', + 'start': 179, + 'end': 179, + 'probability': 0.9810855421041412}]}] +``` + +## 3. 文档视觉问答模型推理 + +- 使用如下命令进行中文文档视觉问答部署 + +```shell +python infer.py \ + --model_path_prefix ../../mrc_export/inference \ + --task_type mrc \ + --lang "ch" \ + --batch_size 8 +``` + +- 输出样例 + +``` +[{'doc': './images/mrc_sample.jpg', + 'result': [{'question': '杨小峰是什么身份?', 'answer': ['法定代表人']}, + {'question': '花了多少钱进行注册的这个公司?', 'answer': ['壹仟壹佰万元']}, + {'question': '公司的类型属于什么?', 'answer': ['有限责任公司']}, + {'question': '杨小峰的住所是在哪里?', + 'answer': ['成都市武侯区佳灵路20号九峰国际1栋16楼62号']}, + {'question': '这个公司的法定代表人叫什么?', 'answer': ['杨小峰']}, + {'question': '91510107749745776R代表的是什么?', 'answer': ['统一社会信用代码']}, + {'question': '公司在什么时候成立的?', + 'answer': ['2003年7月22日营业期限2003年7月22日']}]}] +``` + +## 4. 文档图像分类模型推理 + +- 使用如下命令进行英文文档图像分类部署 + +```shell +python infer.py \ + --model_path_prefix ../../cls_export/inference \ + --lang "en" \ + --task_type cls \ + --batch_size 8 +``` + +- 输出样例 + +``` +[{'doc': './images/cls_sample.jpg', 'result': 'email'}] +``` + +## 5. 更多配置 + +- `model_path_prefix`: 用于推理的Paddle模型文件路径,需加上文件前缀名称。例如模型文件路径为`./export/inference.pdiparams`,则传入`./export/inference`。 +- `batch_size`: 批处理大小,请结合机器情况进行调整,默认为16。 +- `max_seq_length`: 如果OCR的结果超过设定的最大长度则对OCR结果进行自动切分,默认为512。 +- `task_type`: 选择任务类型,可选有`ner`, `cls`和`mrc`。 +- `lang`: 选择任务的语言类型,可选有`en`, `ch`。 +- `device`: 选用什么设备进行训练,可选`cpu`或`gpu`。 diff --git a/model_zoo/ernie-layout/deploy/python/infer.py b/model_zoo/ernie-layout/deploy/python/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..686a6c50a79443ec2c7ad57db9d22b18acb437ea --- /dev/null +++ b/model_zoo/ernie-layout/deploy/python/infer.py @@ -0,0 +1,67 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +from predictor import Predictor + + +def parse_args(): + # yapf: disable + parser = argparse.ArgumentParser() + # Required parameters + parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.") + parser.add_argument("--batch_size", default=4, type=int, help="Batch size per GPU for inference.") + parser.add_argument("--max_seq_length", default=512, type=int, help="The maximum input sequence length. Sequences longer than this will be split automatically.") + parser.add_argument("--task_type", default="ner", type=str, choices=["ner", "cls", "mrc"], help="Specify the task type.") + parser.add_argument("--lang", default="en", type=str, choices=["ch", "en"], help="Specify the task type.") + parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") + args = parser.parse_args() + # yapf: enable + return args + + +def main(): + args = parse_args() + if args.task_type == "mrc": + args.questions = [ + [ + "公司的类型属于什么?", + "杨小峰的住所是在哪里?", + "这个公司的法定代表人叫什么?", + "花了多少钱进行注册的这个公司?", + "公司在什么时候成立的?", + "杨小峰是什么身份?", + "91510107749745776R代表的是什么?", + ], + ] + docs = ["./images/mrc_sample.jpg"] + elif args.task_type == "cls": + docs = ["./images/cls_sample.jpg"] + elif args.task_type == "ner": + docs = ["./images/ner_sample.jpg"] + else: + raise ValueError("Unspport task type: {}".format(args.task_type)) + + predictor = Predictor(args) + + outputs = predictor.predict(docs) + import pprint + + pprint.sorted = lambda x, key=None: x + pprint.pprint(outputs) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-layout/deploy/python/predictor.py b/model_zoo/ernie-layout/deploy/python/predictor.py new file mode 100644 index 0000000000000000000000000000000000000000..5b11a4d901710d196105fb84b6f4e3b20c3553bd --- /dev/null +++ b/model_zoo/ernie-layout/deploy/python/predictor.py @@ -0,0 +1,867 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import base64 +import collections + +import cv2 +import numpy as np +import paddle +import scipy +import six +from paddleocr import PaddleOCR +from PIL import Image +from seqeval.metrics.sequence_labeling import get_entities + +from paddlenlp.transformers import AutoTokenizer +from paddlenlp.utils.image_utils import ppocr2example +from paddlenlp.utils.log import logger + + +class InferBackend(object): + def __init__(self, model_path_prefix, device="cpu"): + logger.info(">>> [InferBackend] Creating Engine ...") + config = paddle.inference.Config( + model_path_prefix + ".pdmodel", + model_path_prefix + ".pdiparams", + ) + if device == "gpu": + config.enable_use_gpu(100, 0) + config.switch_ir_optim(False) + else: + config.disable_gpu() + config.enable_mkldnn() + config.switch_use_feed_fetch_ops(False) + config.disable_glog_info() + config.enable_memory_optim() + self.predictor = paddle.inference.create_predictor(config) + self.input_names = [name for name in self.predictor.get_input_names()] + self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] + self.output_handles = [self.predictor.get_output_handle(name) for name in self.predictor.get_output_names()] + logger.info(">>> [InferBackend] Engine Created ...") + + def infer(self, input_dict: dict): + for idx, input_name in enumerate(self.input_names): + self.input_handles[idx].copy_from_cpu(input_dict[input_name]) + self.predictor.run() + outputs = [output_handle.copy_to_cpu() for output_handle in self.output_handles] + return outputs + + +class Predictor(object): + def __init__(self, args): + use_gpu = True if args.device == "gpu" else False + self.tokenizer = AutoTokenizer.from_pretrained("ernie-layoutx-base-uncased") + self.batch_size = args.batch_size + self.max_seq_length = args.max_seq_length + self.task_type = args.task_type + self.lang = args.lang + self.ocr = PaddleOCR(use_angle_cls=True, lang=self.lang, show_log=False, use_gpu=use_gpu) + + self.examples_cache = collections.defaultdict(list) + self.features_cache = collections.defaultdict(list) + self._PrelimPrediction = collections.namedtuple( + "PrelimPrediction", ["feature_index", "start_index", "end_index", "start_logit", "end_logit"] + ) + self.inference_backend = InferBackend(args.model_path_prefix, device=args.device) + if self.task_type == "ner": + self.label_dict = { + "O": 0, + "B-ANSWER": 1, + "I-ANSWER": 2, + "B-HEADER": 3, + "I-HEADER": 4, + "B-QUESTION": 5, + "I-QUESTION": 6, + } + self.preprocess = self.preprocess_ner + self.postprocess = self.postprocess_ner + elif self.task_type == "cls": + self.label_dict = { + "advertisement": 0, + "budget": 1, + "email": 2, + "file folder": 3, + "form": 4, + "handwritten": 5, + "invoice": 6, + "letter": 7, + "memo": 8, + "news article": 9, + "presentation": 10, + "questionnaire": 11, + "resume": 12, + "scientific publication": 13, + "scientific report": 14, + "specification": 15, + } + self.preprocess = self.preprocess_cls + self.postprocess = self.postprocess_cls + elif self.task_type == "mrc": + self.questions = args.questions + self.preprocess = self.preprocess_mrc + self.postprocess = self.postprocess_mrc + else: + raise ValueError("Unspport task type: {}".format(args.task_type)) + + def _check_is_max_context(self, doc_spans, cur_span_index, position): + """Check if this is the 'max context' doc span for the token.""" + + best_score = None + best_span_index = None + for (span_index, doc_span) in enumerate(doc_spans): + end = doc_span["start"] + doc_span["length"] - 1 + if position < doc_span["start"]: + continue + if position > end: + continue + num_left_context = position - doc_span["start"] + num_right_context = end - position + score = min(num_left_context, num_right_context) + 0.01 * doc_span["length"] + if best_score is None or score > best_score: + best_score = score + best_span_index = span_index + return cur_span_index == best_span_index + + def _get_best_indexes(self, logits, n_best_size): + """Get the n-best logits from a list.""" + index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True) + + best_indexes = [] + for i in range(len(index_and_score)): + if i >= n_best_size: + break + best_indexes.append(index_and_score[i][0]) + return best_indexes + + def get_predictions(self, pred, label_list): + pred = scipy.special.softmax(pred, axis=-1) + pred_ids = np.argmax(pred, axis=1) + prediction_score = [pred[idx][i] for idx, i in enumerate(pred_ids)] + predictions = [label_list[i] for i in pred_ids] + return predictions, prediction_score + + def get_final_text(self, pred_text, orig_text, do_lower_case, tokenizer): + """Project the tokenized prediction back to the original text.""" + + def _strip_spaces(text): + ns_chars = [] + ns_to_s_map = collections.OrderedDict() + for (i, c) in enumerate(text): + if c == " ": + continue + ns_to_s_map[len(ns_chars)] = i + ns_chars.append(c) + ns_text = "".join(ns_chars) + return (ns_text, ns_to_s_map) + + tok_text = tokenizer.convert_tokens_to_string(tokenizer.tokenize(orig_text)) + + start_position = tok_text.find(pred_text) + if start_position == -1: + return orig_text + end_position = start_position + len(pred_text) - 1 + + (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text) + (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text) + + if len(orig_ns_text) != len(tok_ns_text): + return orig_text + + # We then project the characters in `pred_text` back to `orig_text` using + # the character-to-character alignment. + tok_s_to_ns_map = {} + for (i, tok_index) in six.iteritems(tok_ns_to_s_map): + tok_s_to_ns_map[tok_index] = i + + orig_start_position = None + if start_position in tok_s_to_ns_map: + ns_start_position = tok_s_to_ns_map[start_position] + if ns_start_position in orig_ns_to_s_map: + orig_start_position = orig_ns_to_s_map[ns_start_position] + + if orig_start_position is None: + return orig_text + + orig_end_position = None + if end_position in tok_s_to_ns_map: + ns_end_position = tok_s_to_ns_map[end_position] + if ns_end_position in orig_ns_to_s_map: + orig_end_position = orig_ns_to_s_map[ns_end_position] + + if orig_end_position is None: + return orig_text + + output_text = orig_text[orig_start_position : (orig_end_position + 1)] + return output_text + + def preprocess_ner(self, examples, doc_stride=128, target_size=1000, max_size=1000): + ignore_label_id = -100 + tokenized_examples = collections.defaultdict(list) + for example_idx, example_text in enumerate(examples["text"]): + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + all_doc_token_boxes = [] + all_doc_token_labels = [] + cls_token_box = [0, 0, 0, 0] + sep_token_box = [0, 0, 0, 0] + pad_token_box = [0, 0, 0, 0] + + im_base64 = examples["image"][example_idx] + image, _ = _str2im(im_base64) + image = _permute(image, to_bgr=False) + + bboxes = examples["bbox"][example_idx] + bboxes, _s = _scale_same_as_image( + bboxes, + examples["width"][example_idx], + examples["height"][example_idx], + target_size, + ) + + orig_labels = ["O"] * len(example_text) + + for (i, token) in enumerate(example_text): + orig_to_tok_index.append(len(all_doc_tokens)) + if self.lang == "ch": + sub_tokens = self.tokenizer.tokenize("&" + token)[1:] + else: + sub_tokens = self.tokenizer.tokenize(token) + label = orig_labels[i] + box = bboxes[i] + for sub_token in sub_tokens: + tok_to_orig_index.append(i) + all_doc_tokens.append(sub_token) + all_doc_token_boxes.append(box) + all_doc_token_labels.append(label) + + max_tokens_for_doc = self.max_seq_length - 2 + doc_spans = [] + start_offset = 0 + while start_offset < len(all_doc_tokens): + length = len(all_doc_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append({"start": start_offset, "length": length}) + if start_offset + length == len(all_doc_tokens): + break + start_offset += min(length, doc_stride, max_tokens_for_doc) + + for (doc_span_index, doc_span) in enumerate(doc_spans): + + tokens = [] + token_boxes = [] + token_label_ids = [] + token_to_orig_map = {} + token_is_max_context = {} + sentence_ids = [] + tokens.append(self.tokenizer.cls_token) + token_boxes.append(cls_token_box) + token_label_ids.append(ignore_label_id) + sentence_ids.append(0) + + for i in range(doc_span["length"]): + split_token_index = doc_span["start"] + i + token_to_orig_map[str(len(tokens))] = tok_to_orig_index[split_token_index] + + is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index) + token_is_max_context[str(len(tokens))] = is_max_context + tokens.append(all_doc_tokens[split_token_index]) + token_boxes.append(all_doc_token_boxes[split_token_index]) + token_label_ids.append(self.label_dict[all_doc_token_labels[split_token_index]]) + sentence_ids.append(0) + + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(self.tokenizer.sep_token) + token_boxes.append(sep_token_box) + token_label_ids.append(ignore_label_id) + sentence_ids.append(0) + input_mask = [1] * len(tokens) + + while len(tokens) < self.max_seq_length: + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(self.tokenizer.pad_token) + input_mask.append(0) + sentence_ids.append(0) + token_boxes.append(pad_token_box) + token_label_ids.append(ignore_label_id) + + input_ids = self.tokenizer.convert_tokens_to_ids(tokens) + position_ids = list(range(len(input_ids))) + + tokenized_examples["id"].append(example_idx) + tokenized_examples["tokens"].append(tokens) + tokenized_examples["input_ids"].append(input_ids) + tokenized_examples["attention_mask"].append(input_mask) + tokenized_examples["token_type_ids"].append(sentence_ids) + tokenized_examples["bbox"].append(token_boxes) + tokenized_examples["position_ids"].append(position_ids) + tokenized_examples["image"].append(image) + tokenized_examples["labels"].append(token_label_ids) + tokenized_examples["token_is_max_context"].append(token_is_max_context) + tokenized_examples["token_to_orig_map"].append(token_to_orig_map) + for input_id in tokenized_examples["input_ids"]: + input_id = input_id + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * ( + self.max_seq_length - len(input_id) + ) + + for att_mask in tokenized_examples["attention_mask"]: + att_mask = att_mask + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * ( + self.max_seq_length - len(att_mask) + ) + + for bbox in tokenized_examples["bbox"]: + bbox = bbox + [[0, 0, 0, 0] for _ in range(self.max_seq_length - len(bbox))] + + for label in tokenized_examples["labels"]: + label = label + [1 * ignore_label_id] * (self.max_seq_length - len(label)) + + self.examples_cache["name"] = list(range(len(examples["text"]))) + self.examples_cache["text"] = [item for item in examples["text"]] + self.features_cache["id"] = [item for item in tokenized_examples["id"]] + self.features_cache["labels"] = [item for item in tokenized_examples["labels"]] + self.features_cache["tokens"] = [item for item in tokenized_examples["tokens"]] + self.features_cache["token_is_max_context"] = [item for item in tokenized_examples["token_is_max_context"]] + self.features_cache["token_to_orig_map"] = [item for item in tokenized_examples["token_to_orig_map"]] + return tokenized_examples + + def postprocess_ner(self, preds): + separator = "" if self.lang == "ch" else " " + feature_id_to_features = collections.defaultdict(list) + for idx, feature_id in enumerate(self.features_cache["id"]): + feature_id_to_features[feature_id].append(idx) + + predictions = [] + recover_preds = [] + + for eid, example_id in enumerate(self.examples_cache["name"]): + prediction_tags = [] + feature_map = example_id + features_ids = feature_id_to_features[feature_map] + gather_pred = [] + gather_label = [] + gather_tokens = [] + gather_score = [] + gather_map = [] + for idx in features_ids: + pred, label = preds[idx], self.features_cache["labels"][idx] + prediction, prediction_score = self.get_predictions(pred, list(self.label_dict.keys())) + + token_is_max_context = self.features_cache["token_is_max_context"][idx] + token_to_orig_map = self.features_cache["token_to_orig_map"][idx] + for token_idx in range(len(token_is_max_context)): + token_idx += 1 + if token_is_max_context[str(token_idx)]: + gather_tokens.append(self.features_cache["tokens"][idx][token_idx]) + gather_pred.append(prediction[token_idx]) + gather_score.append(prediction_score[token_idx]) + gather_label.append(label[token_idx]) + gather_map.append(token_to_orig_map[str(token_idx)]) + + recover_pred = [p for (p, l) in zip(gather_pred, gather_label) if l != -100] + + pred_entities = get_entities(recover_pred) + recover_preds.append(recover_pred) + + for item in pred_entities: + entity = self.tokenizer.convert_tokens_to_string(gather_tokens[item[1] : (item[2] + 1)]).strip() + orig_doc_start = gather_map[item[1]] + orig_doc_end = gather_map[item[2]] + orig_tokens = self.examples_cache["text"][eid][orig_doc_start : (orig_doc_end + 1)] + orig_text = separator.join(orig_tokens) + final_text = self.get_final_text(entity, orig_text, False, self.tokenizer) + final_text = final_text.replace(" ", " ") + + res = { + "text": final_text, + "label": item[0], + "start": item[1], + "end": item[2], + "probability": sum(gather_score[item[1] : item[2] + 1]) / (item[2] - item[1] + 1), + } + prediction_tags.append(res) + + predictions.append(prediction_tags) + return predictions + + def preprocess_cls(self, examples, doc_stride=128, target_size=1000, max_size=1000): + tokenized_examples = collections.defaultdict(list) + for example_idx, example_text in enumerate(examples["text"]): + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + all_doc_token_boxes = [] + cls_token_box = [0, 0, 0, 0] + sep_token_box = [0, 0, 0, 0] + pad_token_box = [0, 0, 0, 0] + + im_base64 = examples["image"][example_idx] + image, _ = _str2im(im_base64) + image = _permute(image, to_bgr=False) + + bboxes = examples["bbox"][example_idx] + bboxes, _s = _scale_same_as_image( + bboxes, + examples["width"][example_idx], + examples["height"][example_idx], + target_size, + ) + + for (i, token) in enumerate(example_text): + orig_to_tok_index.append(len(all_doc_tokens)) + if self.lang == "ch": + sub_tokens = self.tokenizer.tokenize("&" + token)[1:] + else: + sub_tokens = self.tokenizer.tokenize(token) + box = bboxes[i] + for sub_token in sub_tokens: + tok_to_orig_index.append(i) + all_doc_tokens.append(sub_token) + all_doc_token_boxes.append(box) + + max_tokens_for_doc = self.max_seq_length - 2 + doc_spans = [] + start_offset = 0 + while start_offset < len(all_doc_tokens): + length = len(all_doc_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append({"start": start_offset, "length": length}) + if start_offset + length == len(all_doc_tokens): + break + start_offset += min(length, doc_stride, max_tokens_for_doc) + + for doc_span in doc_spans: + + tokens = [] + token_boxes = [] + sentence_ids = [] + tokens.append(self.tokenizer.cls_token) + token_boxes.append(cls_token_box) + sentence_ids.append(0) + + for i in range(doc_span["length"]): + split_token_index = doc_span["start"] + i + tokens.append(all_doc_tokens[split_token_index]) + token_boxes.append(all_doc_token_boxes[split_token_index]) + sentence_ids.append(0) + + tokens.append(self.tokenizer.sep_token) + token_boxes.append(sep_token_box) + sentence_ids.append(0) + input_mask = [1] * len(tokens) + + while len(tokens) < self.max_seq_length: + tokens.append(self.tokenizer.pad_token) + input_mask.append(0) + sentence_ids.append(0) + token_boxes.append(pad_token_box) + + input_ids = self.tokenizer.convert_tokens_to_ids(tokens) + position_ids = list(range(len(input_ids))) + + tokenized_examples["id"].append(example_idx) + tokenized_examples["tokens"].append(tokens) + tokenized_examples["input_ids"].append(input_ids) + tokenized_examples["attention_mask"].append(input_mask) + tokenized_examples["token_type_ids"].append(sentence_ids) + tokenized_examples["bbox"].append(token_boxes) + tokenized_examples["position_ids"].append(position_ids) + tokenized_examples["image"].append(image) + for input_id in tokenized_examples["input_ids"]: + input_id = input_id + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * ( + self.max_seq_length - len(input_id) + ) + + for att_mask in tokenized_examples["attention_mask"]: + att_mask = att_mask + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * ( + self.max_seq_length - len(att_mask) + ) + + for bbox in tokenized_examples["bbox"]: + bbox = bbox + [[0, 0, 0, 0] for _ in range(self.max_seq_length - len(bbox))] + + self.examples_cache["name"] = list(range(len(examples["text"]))) + self.features_cache["id"] = [item for item in tokenized_examples["id"]] + return tokenized_examples + + def postprocess_cls(self, preds): + feature_id_to_features = collections.defaultdict(list) + for idx, feature_id in enumerate(self.features_cache["id"]): + feature_id_to_features[feature_id].append(idx) + + predictions = [] + + for example_id in self.examples_cache["name"]: + features_ids = feature_id_to_features[example_id] + + max_rcd = [0, -1] + for idx in features_ids: + pred = preds[idx] + pred = scipy.special.softmax(pred, axis=-1) + pred_id = int(np.argmax(pred, axis=-1)) + if pred[pred_id] > max_rcd[0]: + max_rcd = [pred[pred_id], pred_id] + + predictions.append(list(self.label_dict.keys())[max_rcd[1]]) + return predictions + + def preprocess_mrc(self, examples, doc_stride=128, max_query_length=64, target_size=1000, max_size=1000): + qid = -1 + tokenized_examples = collections.defaultdict(list) + for example_idx, example_text in enumerate(examples["text"]): + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + all_doc_token_boxes = [] + cls_token_box = [0, 0, 0, 0] + sep_token_box = [0, 0, 0, 0] + pad_token_box = [0, 0, 0, 0] + query_token_box = [0, 0, 0, 0] + + im_base64 = examples["image"][example_idx] + image, _ = _str2im(im_base64) + image = _permute(image, to_bgr=False) + + bboxes = examples["bbox"][example_idx] + bboxes, _s = _scale_same_as_image( + bboxes, + examples["width"][example_idx], + examples["height"][example_idx], + target_size, + ) + + for (i, token) in enumerate(example_text): + orig_to_tok_index.append(len(all_doc_tokens)) + if self.lang == "ch": + sub_tokens = self.tokenizer.tokenize("&" + token)[1:] + else: + sub_tokens = self.tokenizer.tokenize(token) + box = bboxes[i] + for sub_token in sub_tokens: + tok_to_orig_index.append(i) + all_doc_tokens.append(sub_token) + all_doc_token_boxes.append(box) + + for question in self.questions[example_idx]: + qid += 1 + query_tokens = self.tokenizer.tokenize( + question, add_special_tokens=False, truncation=False, max_length=max_query_length + ) + + start_offset = 0 + doc_spans = [] + max_tokens_for_doc = self.max_seq_length - len(query_tokens) - 3 + while start_offset < len(all_doc_tokens): + length = len(all_doc_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append({"start": start_offset, "length": length}) + if start_offset + length == len(all_doc_tokens): + break + start_offset += min(length, doc_stride, max_tokens_for_doc) + + for (doc_span_index, doc_span) in enumerate(doc_spans): + + tokens = [] + token_boxes = [] + token_to_orig_map = {} + token_is_max_context = {} + sentence_ids = [] + seg_a = 0 + seg_b = 1 + + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(self.tokenizer.cls_token) + token_boxes.append(cls_token_box) + sentence_ids.append(seg_a) + + for i in range(doc_span["length"]): + split_token_index = doc_span["start"] + i + token_to_orig_map[str(len(tokens))] = tok_to_orig_index[split_token_index] + + is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index) + token_is_max_context[str(len(tokens))] = is_max_context + tokens.append(all_doc_tokens[split_token_index]) + token_boxes.append(all_doc_token_boxes[split_token_index]) + sentence_ids.append(seg_a) + + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(self.tokenizer.sep_token) + token_boxes.append(sep_token_box) + sentence_ids.append(seg_a) + input_mask = [1] * len(tokens) + + while len(tokens) < self.max_seq_length - len(query_tokens) - 1: + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(self.tokenizer.pad_token) + input_mask.append(0) + sentence_ids.append(seg_b) + token_boxes.append(pad_token_box) + + for token in query_tokens: + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(token) + input_mask.append(1) + sentence_ids.append(seg_b) + token_boxes.append(query_token_box) + + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(self.tokenizer.sep_token) + input_mask.append(1) + token_boxes.append(sep_token_box) + sentence_ids.append(seg_b) + + input_ids = self.tokenizer.convert_tokens_to_ids(tokens) + position_ids = list(range(len(tokens) - len(query_tokens) - 1)) + list( + range(len(query_tokens) + 1) + ) + + answer_rcd = [] + start_position = -1 + end_position = -1 + + start_labels = [0] * len(input_ids) + end_labels = [0] * len(input_ids) + start_labels[start_position] = 1 + end_labels[end_position] = 1 + answer_rcd.append([start_position, end_position]) + + tokenized_examples["id"].append(example_idx) + tokenized_examples["question_id"].append(qid) + tokenized_examples["questions"].append(question) + tokenized_examples["tokens"].append(tokens) + tokenized_examples["input_ids"].append(input_ids) + tokenized_examples["attention_mask"].append(input_mask) + tokenized_examples["token_type_ids"].append(sentence_ids) + tokenized_examples["bbox"].append(token_boxes) + tokenized_examples["position_ids"].append(position_ids) + tokenized_examples["image"].append(image) + tokenized_examples["token_is_max_context"].append(token_is_max_context) + tokenized_examples["token_to_orig_map"].append(token_to_orig_map) + for input_id in tokenized_examples["input_ids"]: + input_id = input_id + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * ( + self.max_seq_length - len(input_id) + ) + + for att_mask in tokenized_examples["attention_mask"]: + att_mask = att_mask + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * ( + self.max_seq_length - len(att_mask) + ) + + for bbox in tokenized_examples["bbox"]: + bbox = bbox + [[0, 0, 0, 0] for _ in range(self.max_seq_length - len(bbox))] + self.examples_cache["name"] = list(range(len(examples["text"]))) + self.examples_cache["text"] = [item for item in examples["text"]] + self.features_cache["id"] = [item for item in tokenized_examples["id"]] + self.features_cache["question_id"] = [item for item in tokenized_examples["question_id"]] + self.features_cache["tokens"] = [item for item in tokenized_examples["tokens"]] + self.features_cache["questions"] = [item for item in tokenized_examples["questions"]] + self.features_cache["token_is_max_context"] = [item for item in tokenized_examples["token_is_max_context"]] + self.features_cache["token_to_orig_map"] = [item for item in tokenized_examples["token_to_orig_map"]] + return tokenized_examples + + def postprocess_mrc(self, preds, max_answer_length=64, n_best_size=5): + separator = "" if self.lang == "ch" else " " + feature_id_to_features = collections.defaultdict(list) + for idx, feature_id in enumerate(self.features_cache["id"]): + feature_id_to_features[feature_id].append(idx) + + predictions = collections.defaultdict(lambda: collections.defaultdict(list)) + for ei, example_id in enumerate(self.examples_cache["name"]): + feature_map = example_id + features_ids = feature_id_to_features[feature_map] + prelim_predictions = [] + for idx in features_ids: + start_logits = preds[0][idx] + end_logits = preds[1][idx] + + start_indexes = self._get_best_indexes(start_logits, n_best_size) + end_indexes = self._get_best_indexes(end_logits, n_best_size) + token_is_max_context = self.features_cache["token_is_max_context"][idx] + + for start_index in start_indexes: + for end_index in end_indexes: + if not token_is_max_context.get(str(start_index), False): + continue + if end_index < start_index: + continue + length = end_index - start_index + 1 + if length > max_answer_length: + continue + prelim_predictions.append( + self._PrelimPrediction( + feature_index=idx, + start_index=start_index, + end_index=end_index, + start_logit=start_logits[start_index], + end_logit=end_logits[end_index], + ) + ) + + prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True) + + for rcd in prelim_predictions: + + question_id = self.features_cache["question_id"][rcd.feature_index] + question = self.features_cache["questions"][rcd.feature_index] + if question_id in predictions[example_id]: + continue + + if rcd.start_index > 0: + tok_tokens = self.features_cache["tokens"][rcd.feature_index][ + rcd.start_index : (rcd.end_index + 1) + ] + orig_doc_start = self.features_cache["token_to_orig_map"][rcd.feature_index][str(rcd.start_index)] + orig_doc_end = self.features_cache["token_to_orig_map"][rcd.feature_index][str(rcd.end_index)] + orig_tokens = self.examples_cache["text"][ei][orig_doc_start : (orig_doc_end + 1)] + orig_text = separator.join(orig_tokens) + + tok_text = self.tokenizer.convert_tokens_to_string(tok_tokens).strip() + final_text = self.get_final_text(tok_text, orig_text, False, self.tokenizer) + else: + continue + if question_id in predictions[example_id]: + predictions[example_id][question_id]["answer"].append(final_text) + else: + predictions[example_id][question_id] = {"question": question, "answer": [final_text]} + formatted_predictions = [] + for v in predictions.values(): + formatted_predictions.append([{"question": qa["question"], "answer": qa["answer"]} for qa in v.values()]) + return formatted_predictions + + def infer(self, data): + return self.inference_backend.infer(data) + + def predict(self, docs): + input_data = [] + for doc in docs: + ocr_result = self.ocr.ocr(doc, cls=True) + # Compatible with paddleocr>=2.6.0.2 + ocr_result = ocr_result[0] if len(ocr_result) == 1 else ocr_result + example = ppocr2example(ocr_result, doc) + input_data.append(example) + + inputs = collections.defaultdict(list) + for data in input_data: + for k in data.keys(): + inputs[k].append(data[k]) + + preprocess_result = self.preprocess(inputs) + preds = [[], []] if self.task_type == "mrc" else [] + for idx in range(0, len(preprocess_result["id"]), self.batch_size): + l, r = idx, idx + self.batch_size + input_dict = {} + for input_name in self.inference_backend.input_names: + input_dict[input_name] = np.array(preprocess_result[input_name][l:r], dtype="int64") + output = self.infer(input_dict) + if self.task_type != "mrc": + preds.extend(output[0].tolist()) + else: + preds[0].extend(output[0].tolist()) + preds[1].extend(output[1].tolist()) + results = self.postprocess(preds) + formatted_results = [] + for doc, res in zip(docs, results): + formatted_result = {"doc": doc, "result": res} + formatted_results.append(formatted_result) + return formatted_results + + +def _decode_image(im_base64): + """Decode image""" + if im_base64 is not None: + image = base64.b64decode(im_base64.encode("utf-8")) + im = np.frombuffer(image, dtype="uint8") + im = cv2.imdecode(im, 1) + im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB) + return im + else: + return np.zeros([224, 224, 3], dtype=np.uint8) + + +def _resize_image( + im, + target_size=0, + interp=cv2.INTER_LINEAR, + resize_box=False, +): + """Resize the image numpy.""" + if not isinstance(im, np.ndarray): + raise TypeError("image type is not numpy.") + if len(im.shape) != 3: + raise ValueError("image is not 3-dimensional.") + im_shape = im.shape + im_size_min = np.min(im_shape[0:2]) + selected_size = target_size + if float(im_size_min) == 0: + raise ZeroDivisionError("min size of image is 0") + resize_w = selected_size + resize_h = selected_size + + im = im.astype("uint8") + im = Image.fromarray(im) + im = im.resize((int(resize_w), int(resize_h)), interp) + im = np.array(im) + return im + + +def _scale_same_as_image(boxes, width, height, target_size): + """ + Scale the bounding box of each character within maximum boundary. + """ + scale_x = target_size / width + scale_y = target_size / height + + new_boxes = [ + [ + int(max(0, min(box[0] * scale_x, target_size - 1))), + int(max(0, min(box[1] * scale_y, target_size - 1))), + int(max(0, min(box[2] * scale_x, target_size - 1))), + int(max(0, min(box[3] * scale_y, target_size - 1))), + ] + for box in boxes + ] + return new_boxes, (scale_x, scale_y) + + +def _permute(im, channel_first=True, to_bgr=False): + """Permute""" + if channel_first: + im = np.swapaxes(im, 1, 2) + im = np.swapaxes(im, 1, 0) + if to_bgr: + im = im[[2, 1, 0], :, :] + return im + + +def _str2im( + im_base64, + target_size=224, + mean=[103.530, 116.280, 123.675], + std=[57.375, 57.120, 58.395], +): + # step1: decode image + origin_im = _decode_image(im_base64) + # step2: resize image + im = _resize_image(origin_im, target_size=target_size, interp=1, resize_box=False) + return im, origin_im diff --git a/model_zoo/ernie-layout/deploy/python/requirements.txt b/model_zoo/ernie-layout/deploy/python/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..0adfb83a41a922a8329a3d598d5873081cf639da --- /dev/null +++ b/model_zoo/ernie-layout/deploy/python/requirements.txt @@ -0,0 +1 @@ +paddleocr diff --git a/model_zoo/ernie-layout/export_model.py b/model_zoo/ernie-layout/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..ea6e6e2a2cbd39fb5e80ab47ea0821365f5e95b3 --- /dev/null +++ b/model_zoo/ernie-layout/export_model.py @@ -0,0 +1,58 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from paddlenlp.transformers import ( + AutoModelForSequenceClassification, + AutoModelForQuestionAnswering, + AutoModelForTokenClassification, +) + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_path", type=str, required=True, default='./ernie-layoutx-base-uncased/models/funsd/1e-5_2/', help="The path to model parameters to be loaded.") +parser.add_argument("--task_type", type=str, required=True, default="ner", choices=["ner", "cls", "mrc"], help="Select the task type.") +parser.add_argument("--output_path", type=str, default='./export', help="The path of model parameter in static graph to be saved.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + if args.task_type == "ner": + model = AutoModelForTokenClassification.from_pretrained(args.model_path) + elif args.task_type == "mrc": + model = AutoModelForQuestionAnswering.from_pretrained(args.model_path) + elif args.task_type == "cls": + model = AutoModelForSequenceClassification.from_pretrained(args.model_path) + else: + raise ValueError("Unsppoorted task type!") + model.eval() + + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"), + paddle.static.InputSpec(shape=[None, None, None], dtype="int64", name="bbox"), + paddle.static.InputSpec(shape=[None, None, None, None], dtype="int64", name="image"), + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="attention_mask"), + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"), + paddle.static.InputSpec(shape=[None, None], dtype="int64", name="position_ids"), + ], + ) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/model_zoo/ernie-layout/finetune_args.py b/model_zoo/ernie-layout/finetune_args.py new file mode 100644 index 0000000000000000000000000000000000000000..11d0e8fa940f86f895718cdb0dead4d4066a508e --- /dev/null +++ b/model_zoo/ernie-layout/finetune_args.py @@ -0,0 +1,142 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Optional +from dataclasses import dataclass, field + + +@dataclass +class DataArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + """ + + task_name: Optional[str] = field(default="ner", metadata={"help": "The name of the task (ner, pos...)."}) + dataset_name: Optional[str] = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + dataset_config_name: Optional[str] = field( + default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."} + ) + overwrite_cache: bool = field( + default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} + ) + preprocessing_num_workers: Optional[int] = field( + default=None, + metadata={"help": "The number of processes to use for the preprocessing."}, + ) + max_seq_length: int = field( + default=512, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + doc_stride: int = field( + default=128, + metadata={"help": "When splitting up a long document into chunks, how much stride to take between chunks."}, + ) + target_size: int = field( + default=1024, + metadata={"help": "The maximum 2d pos size"}, + ) + pad_to_max_length: bool = field( + default=True, + metadata={ + "help": "Whether to pad all samples to model maximum sentence length. " + "If False, will pad the samples dynamically when batching to the maximum length in the batch. More " + "efficient on GPU but very bad for TPU." + }, + ) + max_train_samples: Optional[int] = field( + default=None, + metadata={ + "help": "For debugging purposes or quicker training, truncate the number of training examples to this " + "value if set." + }, + ) + max_val_samples: Optional[int] = field( + default=None, + metadata={ + "help": "For debugging purposes or quicker training, truncate the number of validation examples to this " + "value if set." + }, + ) + max_test_samples: Optional[int] = field( + default=None, + metadata={ + "help": "For debugging purposes or quicker training, truncate the number of test examples to this " + "value if set." + }, + ) + label_all_tokens: bool = field( + default=False, + metadata={ + "help": "Whether to put the label for one word on all tokens of generated by that word or just on the " + "one (in which case the other tokens will have a padding index)." + }, + ) + return_entity_level_metrics: bool = field( + default=False, + metadata={"help": "Whether to return all the entity levels during evaluation or just the overall ones."}, + ) + train_log_file: Optional[str] = field( + default=None, + metadata={"help": "train log file"}, + ) + train_nshard: Optional[int] = field( + default=1, + metadata={"help": "For big dataset, DocVQA/CORD when using ner3 pattern"}, + ) + use_segment_box: bool = field( + default=False, + metadata={"help": "Whether use segment box"}, + ) + task_type: str = field( + default="ner", + metadata={"help": "The task type"}, + ) + pattern: Optional[str] = field( + default="ner1", + metadata={"help": "The way to process input, choose from ner1, ner2, ner3"}, + ) + rst_converter: Optional[str] = field( + default=None, + metadata={"help": "The way to convert the predict result"}, + ) + lang: Optional[str] = field( + default="en", + metadata={"help": "Languge type of the dataset"}, + ) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + + model_name_or_path: str = field( + metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} + ) + config_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} + ) + tokenizer_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} + ) + cache_dir: Optional[str] = field( + default=None, + metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, + ) diff --git a/model_zoo/ernie-layout/layout_trainer.py b/model_zoo/ernie-layout/layout_trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..ed2e8afc6b3dd0335e4fc83d64abed4847c89376 --- /dev/null +++ b/model_zoo/ernie-layout/layout_trainer.py @@ -0,0 +1,132 @@ +# encoding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +from typing import Dict + +from paddlenlp.trainer import Trainer + + +class LayoutTrainer(Trainer): + def __init__(self, *args, eval_examples=None, post_process_function=None, convert_fn=None, **kwargs): + super().__init__(*args, **kwargs) + self.eval_examples = eval_examples + self.post_process_function = post_process_function + self.convert_fn = convert_fn + + def save_predictions(self, split, preds, labels): + """ + Save metrics into a json file for that split, e.g. `train_results.json`. + Under distributed environment this is done only for a process with rank 0. + Args: + split (`str`): + Mode/split name: one of `train`, `eval`, `test`, `all` + To understand the metrics please read the docstring of [`~Trainer.log_metrics`]. The only difference is that raw + unformatted numbers are saved in the current method. + """ + + path = os.path.join(self.args.output_dir, f"{split}_predictions.json") + with open(path, "w") as f: + json.dump(preds, f, ensure_ascii=False, indent=4, sort_keys=True) + + path = os.path.join(self.args.output_dir, f"{split}_golden_labels.json") + with open(path, "w") as f: + json.dump(labels, f, ensure_ascii=False, indent=4, sort_keys=True) + + def evaluate( + self, + eval_dataset=None, + eval_examples=None, + ignore_keys=None, + metric_key_prefix="eval", + ) -> Dict[str, float]: + + eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset + eval_examples = self.eval_examples if eval_examples is None else eval_examples + eval_dataloader = self.get_eval_dataloader(eval_dataset) + compute_metrics = self.compute_metrics + self.compute_metrics = None + eval_loop = self.evaluation_loop + try: + output = eval_loop( + eval_dataloader, + description="Evaluation", + prediction_loss_only=True if compute_metrics is None else None, + ignore_keys=ignore_keys, + metric_key_prefix=metric_key_prefix, + ) + finally: + self.compute_metrics = compute_metrics + + if self.post_process_function is not None and self.compute_metrics is not None: + pred_rst, gt_rst, eval_preds = self.post_process_function( + eval_examples, eval_dataset, output.predictions, output.label_ids + ) + self.save_predictions("eval", pred_rst, gt_rst) + metrics = self.compute_metrics(eval_preds) + if self.convert_fn is not None: + processed_metrics = self.convert_fn(pred_rst, self.args.output_dir) + if processed_metrics is not None: + metrics.update(processed_metrics) + + # Prefix all keys with metric_key_prefix + '_' + for key in list(metrics.keys()): + if not key.startswith(f"{metric_key_prefix}_"): + metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key) + + self.log(metrics) + else: + metrics = {} + + self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, metrics) + return metrics + + def predict(self, predict_dataset, predict_examples, ignore_keys=None, metric_key_prefix: str = "test"): + + predict_dataloader = self.get_test_dataloader(predict_dataset) + + compute_metrics = self.compute_metrics + self.compute_metrics = None + eval_loop = self.evaluation_loop + try: + output = eval_loop( + predict_dataloader, + description="Prediction", + prediction_loss_only=True if compute_metrics is None else None, + ignore_keys=ignore_keys, + ) + finally: + self.compute_metrics = compute_metrics + + if self.post_process_function is not None and self.compute_metrics is not None: + pred_rst, gt_rst, eval_preds = self.post_process_function( + predict_examples, predict_dataset, output.predictions, output.label_ids + ) + self.save_predictions("test", pred_rst, gt_rst) + metrics = self.compute_metrics(eval_preds) + + if self.convert_fn is not None: + processed_metrics = self.convert_fn(pred_rst, self.args.output_dir) + if processed_metrics is not None: + metrics.update(processed_metrics) + + # Prefix all keys with metric_key_prefix + '_' + for key in list(metrics.keys()): + if not key.startswith(f"{metric_key_prefix}_"): + metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key) + else: + metrics = {} + return metrics diff --git a/model_zoo/ernie-layout/requirements.txt b/model_zoo/ernie-layout/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..f3e7debe27f8d7705700b9fafc5f99ad23ade721 --- /dev/null +++ b/model_zoo/ernie-layout/requirements.txt @@ -0,0 +1,2 @@ +editdistance>=0.6.0 +opencv-python>=4.6.0.66 diff --git a/model_zoo/ernie-layout/run_cls.py b/model_zoo/ernie-layout/run_cls.py new file mode 100644 index 0000000000000000000000000000000000000000..9784cf05f6cf1698941626fd8709e28ab5a8a4bc --- /dev/null +++ b/model_zoo/ernie-layout/run_cls.py @@ -0,0 +1,209 @@ +# encoding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import os +from functools import partial + +import datasets +import paddle +from data_collator import DataCollator +from datasets import load_dataset +from finetune_args import DataArguments, ModelArguments +from layout_trainer import LayoutTrainer +from paddle.metric import Accuracy +from utils import PostProcessor, PreProcessor, get_label_ld + +from paddlenlp.trainer import PdArgumentParser, TrainingArguments, get_last_checkpoint +from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer +from paddlenlp.utils.log import logger + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + train_ds, dev_ds, test_ds = load_dataset(data_args.dataset_name, split=["train", "validation", "test"]) + + if training_args.do_train: + column_names = train_ds.column_names + elif training_args.do_eval: + column_names = dev_ds.column_names + elif training_args.do_predict: + column_names = test_ds.column_names + else: + logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.") + raise NotImplementedError + + label_list, label_to_id = get_label_ld(train_ds["qas"], scheme="cls") + num_labels = len(label_list) + + # Load Model and Tokenizer + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + model = AutoModelForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_labels) + model.config["has_visual_segment_embedding"] = False + + preprocessor = PreProcessor() + postprocessor = PostProcessor() + training_args.label_names = ["labels"] + max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) + + preprocess_func = partial( + preprocessor.preprocess_cls, + tokenizer=tokenizer, + max_seq_length=max_seq_length, + doc_stride=data_args.doc_stride, + label_dict=label_to_id, + max_size=data_args.target_size, + target_size=data_args.target_size, + use_segment_box=data_args.use_segment_box, + preprocessing_num_workers=data_args.preprocessing_num_workers, + ) + preprocess_func_for_valid = preprocess_func + + postprocess_func = partial(postprocessor.postprocess_cls, label_list=label_list, tokenizer=tokenizer) + + # Dataset pre-process + if training_args.do_train: + if data_args.train_nshard > 1: + logger.info(f"spliting train dataset into {data_args.train_nshard} shard") + train_shards = [] + for idx in range(data_args.train_nshard): + train_shards.append( + train_ds.shard(num_shards=data_args.train_nshard, index=idx).map( + preprocess_func, + batched=True, + remove_columns=column_names, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + ) + ) + train_dataset = datasets.concatenate_datasets(train_shards) + else: + train_dataset = train_ds.map( + preprocess_func, + batched=True, + remove_columns=column_names, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + ) + if training_args.do_eval: + eval_dataset = dev_ds.map( + preprocess_func_for_valid, + batched=True, + remove_columns=column_names, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + ) + if training_args.do_predict: + test_dataset = test_ds.map( + preprocess_func_for_valid, + batched=True, + remove_columns=column_names, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + ) + + # Data collator + data_collator = DataCollator( + tokenizer, padding="max_length", label_pad_token_id=-100, max_length=max_seq_length, return_tensors="pd" + ) + + def compute_metrics(eval_preds): + preds = paddle.to_tensor(eval_preds.predictions) + labels = paddle.to_tensor(eval_preds.label_ids) + + metric = Accuracy() + metric.reset() + correct = preds == labels + correct = paddle.cast(paddle.unsqueeze(correct, axis=-1), dtype="float32") + + metric.update(correct) + accu = metric.accumulate() + metric.reset() + return {"acc": accu} + + trainer = LayoutTrainer( + model=model, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + eval_examples=dev_ds, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + post_process_function=postprocess_func, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + + max_train_samples = ( + data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset) + ) + metrics["train_samples"] = min(max_train_samples, len(train_dataset)) + + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + + max_val_samples = data_args.max_val_samples if data_args.max_val_samples is not None else len(eval_dataset) + metrics["eval_samples"] = min(max_val_samples, len(eval_dataset)) + + trainer.log_metrics("eval", eval_metrics) + trainer.save_metrics("eval", metrics) + + if training_args.do_predict: + postprocessor.examples_cache = collections.defaultdict(list) + postprocessor.features_cache = collections.defaultdict(list) + metrics = trainer.predict(test_dataset, test_ds) + trainer.log_metrics("test", metrics) + trainer.save_metrics("test", metrics) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-layout/run_mrc.py b/model_zoo/ernie-layout/run_mrc.py new file mode 100644 index 0000000000000000000000000000000000000000..d1f4f918c8a01fc83a77c82a492fa82fbdfcccf2 --- /dev/null +++ b/model_zoo/ernie-layout/run_mrc.py @@ -0,0 +1,242 @@ +# encoding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import os +from functools import partial + +import datasets +import paddle +from data_collator import DataCollator +from datasets import load_dataset +from finetune_args import DataArguments, ModelArguments +from layout_trainer import LayoutTrainer +from utils import PostProcessor, PreProcessor, anls_score + +from paddlenlp.trainer import PdArgumentParser, TrainingArguments, get_last_checkpoint +from paddlenlp.transformers import AutoModelForQuestionAnswering, AutoTokenizer +from paddlenlp.utils.log import logger + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + train_ds, dev_ds, test_ds = load_dataset(data_args.dataset_name, split=["train", "validation", "test"]) + + if training_args.do_train: + column_names = train_ds.column_names + elif training_args.do_eval: + column_names = dev_ds.column_names + elif training_args.do_predict: + column_names = test_ds.column_names + else: + logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.") + raise NotImplementedError + + num_labels = 2 + + # Load Model and Tokenizer + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + model = AutoModelForQuestionAnswering.from_pretrained(model_args.model_name_or_path, num_classes=num_labels) + model.config["has_visual_segment_embedding"] = False + + preprocessor = PreProcessor() + postprocessor = PostProcessor() + training_args.label_names = ["start_positions", "end_positions"] + + max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) + + preprocess_func = partial( + preprocessor.preprocess_mrc, + tokenizer=tokenizer, + max_seq_length=max_seq_length, + doc_stride=data_args.doc_stride, + max_size=data_args.target_size, + target_size=data_args.target_size, + use_segment_box=data_args.use_segment_box, + preprocessing_num_workers=data_args.preprocessing_num_workers, + is_training=True, + lang=data_args.lang, + ) + + preprocess_func_for_valid = partial( + preprocessor.preprocess_mrc, + tokenizer=tokenizer, + max_seq_length=max_seq_length, + doc_stride=data_args.doc_stride, + max_size=data_args.target_size, + target_size=data_args.target_size, + use_segment_box=data_args.use_segment_box, + preprocessing_num_workers=data_args.preprocessing_num_workers, + is_training=False, + lang=data_args.lang, + ) + + postprocess_func = partial(postprocessor.postprocess_mrc, tokenizer=tokenizer, lang=data_args.lang) + + # Dataset pre-process + if training_args.do_train: + if data_args.train_nshard > 1: + logger.info(f"spliting train dataset into {data_args.train_nshard} shard") + train_shards = [] + for idx in range(data_args.train_nshard): + train_shards.append( + train_ds.shard(num_shards=data_args.train_nshard, index=idx).map( + preprocess_func, + batched=True, + remove_columns=column_names, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + ) + ) + train_dataset = datasets.concatenate_datasets(train_shards) + else: + train_dataset = train_ds.map( + preprocess_func, + batched=True, + remove_columns=column_names, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + ) + + if training_args.do_eval: + eval_dataset = dev_ds.map( + preprocess_func_for_valid, + batched=True, + remove_columns=column_names, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + ) + if training_args.do_predict: + test_dataset = test_ds.map( + preprocess_func_for_valid, + batched=True, + remove_columns=column_names, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + ) + + # Data collator + data_collator = DataCollator( + tokenizer, padding="max_length", label_pad_token_id=-100, max_length=max_seq_length, return_tensors="pd" + ) + + def compute_metrics(eval_preds): + def _convert(examples): + """Convert to evaluation data format""" + formatted_examples = [] + for example in examples: + formatted_example = {} + formatted_example["id"] = example["id"] + formatted_example["annotations"] = { + "qid": [], + "question": [], + "value": [], + } + for i in range(len(example["annotations"])): + formatted_example["annotations"]["qid"].append(example["annotations"][i]["qid"]) + formatted_example["annotations"]["question"].append(example["annotations"][i]["question"]) + formatted_example["annotations"]["value"].append(example["annotations"][i]["value"]) + formatted_examples.append(formatted_example) + return formatted_examples + + pred_dict = collections.defaultdict(lambda: collections.defaultdict(list)) + ref_dict = collections.defaultdict(lambda: collections.defaultdict(list)) + + preds = _convert(eval_preds.predictions) + labels = _convert(eval_preds.label_ids) + + for pred in preds: + for key, values in zip(pred["annotations"]["qid"], pred["annotations"]["value"]): + pred_dict[pred["id"]][key].extend(values) + for label in labels: + for key, values in zip(label["annotations"]["qid"], label["annotations"]["value"]): + ref_dict[label["id"]][key].extend(values) + score = anls_score(ref_dict, pred_dict) + return score + + trainer = LayoutTrainer( + model=model, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + eval_examples=dev_ds, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + post_process_function=postprocess_func, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + + max_train_samples = ( + data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset) + ) + metrics["train_samples"] = min(max_train_samples, len(train_dataset)) + + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + + max_val_samples = data_args.max_val_samples if data_args.max_val_samples is not None else len(eval_dataset) + metrics["eval_samples"] = min(max_val_samples, len(eval_dataset)) + + trainer.log_metrics("eval", eval_metrics) + trainer.save_metrics("eval", metrics) + + if training_args.do_predict: + postprocessor.examples_cache = collections.defaultdict(list) + postprocessor.features_cache = collections.defaultdict(list) + metrics = trainer.predict(test_dataset, test_ds) + trainer.log_metrics("test", metrics) + trainer.save_metrics("test", metrics) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-layout/run_ner.py b/model_zoo/ernie-layout/run_ner.py new file mode 100644 index 0000000000000000000000000000000000000000..fab6f73fd6e882917272b93977333a588cd3a149 --- /dev/null +++ b/model_zoo/ernie-layout/run_ner.py @@ -0,0 +1,214 @@ +# encoding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import collections +import os +from functools import partial + +import paddle +from data_collator import DataCollator +from datasets import load_dataset +from finetune_args import DataArguments, ModelArguments +from layout_trainer import LayoutTrainer +from seqeval.metrics import classification_report +from utils import PostProcessor, PreProcessor, get_label_ld + +from paddlenlp.trainer import PdArgumentParser, TrainingArguments, get_last_checkpoint +from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer +from paddlenlp.utils.log import logger + + +def main(): + parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + training_args.print_config(model_args, "Model") + training_args.print_config(data_args, "Data") + + paddle.set_device(training_args.device) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + train_ds, dev_ds, test_ds = load_dataset(data_args.dataset_name, split=["train", "validation", "test"]) + + if training_args.do_train: + column_names = train_ds.column_names + elif training_args.do_eval: + column_names = dev_ds.column_names + elif training_args.do_predict: + column_names = test_ds.column_names + else: + logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.") + raise NotImplementedError + + label_list, label_to_id = get_label_ld(train_ds["qas"], scheme=data_args.pattern.split("-")[1]) + num_labels = len(label_list) + + # Load Model and Tokenizer + if model_args.model_name_or_path == "vi-layoutxlm-base-uncased": + tokenizer = AutoTokenizer.from_pretrained("layoutxlm-base-uncased") + else: + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + model = AutoModelForTokenClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_labels) + model.config["has_visual_segment_embedding"] = False + + preprocessor = PreProcessor() + postprocessor = PostProcessor() + training_args.label_names = ["labels"] + max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) + + preprocess_func = partial( + preprocessor.preprocess_ner, + tokenizer=tokenizer, + max_seq_length=max_seq_length, + doc_stride=data_args.doc_stride, + label_dict=label_to_id, + max_size=data_args.target_size, + target_size=data_args.target_size, + use_segment_box=data_args.use_segment_box, + preprocessing_num_workers=data_args.preprocessing_num_workers, + scheme=data_args.pattern.split("-")[1], + lang=data_args.lang, + ) + preprocess_func_for_valid = preprocess_func + + postprocess_func = partial( + postprocessor.postprocess_ner, label_list=label_list, tokenizer=tokenizer, lang=data_args.lang + ) + + # Dataset pre-process + if training_args.do_train: + train_dataset = train_ds.map( + preprocess_func, + batched=True, + remove_columns=column_names, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + ) + if training_args.do_eval: + eval_dataset = dev_ds.map( + preprocess_func_for_valid, + batched=True, + remove_columns=column_names, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + ) + if training_args.do_predict: + test_dataset = test_ds.map( + preprocess_func_for_valid, + batched=True, + remove_columns=column_names, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + ) + + # Data collator + data_collator = DataCollator( + tokenizer, padding="max_length", label_pad_token_id=-100, max_length=max_seq_length, return_tensors="pd" + ) + + def compute_metrics(eval_preds): + preds = eval_preds.predictions + labels = eval_preds.label_ids + + report = classification_report(y_true=labels, y_pred=preds, output_dict=True) + + report.pop("macro avg") + report.pop("weighted avg") + overall_score = report.pop("micro avg") + scores = { + type_name: { + "precision": score["precision"], + "recall": score["recall"], + "f1": score["f1-score"], + "number": score["support"], + } + for type_name, score in report.items() + } + scores["overall_precision"] = overall_score["precision"] + scores["overall_recall"] = overall_score["recall"] + scores["overall_f1"] = overall_score["f1-score"] + results = { + "precision": scores["overall_precision"], + "recall": scores["overall_recall"], + "f1": scores["overall_f1"], + } + return results + + trainer = LayoutTrainer( + model=model, + args=training_args, + data_collator=data_collator, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + eval_examples=dev_ds, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + post_process_function=postprocess_func, + ) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # Training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + + max_train_samples = ( + data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset) + ) + metrics["train_samples"] = min(max_train_samples, len(train_dataset)) + + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluate and tests model + if training_args.do_eval: + eval_metrics = trainer.evaluate() + + max_val_samples = data_args.max_val_samples if data_args.max_val_samples is not None else len(eval_dataset) + metrics["eval_samples"] = min(max_val_samples, len(eval_dataset)) + + trainer.log_metrics("eval", eval_metrics) + trainer.save_metrics("eval", metrics) + + if training_args.do_predict: + postprocessor.examples_cache = collections.defaultdict(list) + postprocessor.features_cache = collections.defaultdict(list) + metrics = trainer.predict(test_dataset, test_ds) + trainer.log_metrics("test", metrics) + trainer.save_metrics("test", metrics) + + +if __name__ == "__main__": + main() diff --git a/model_zoo/ernie-layout/utils.py b/model_zoo/ernie-layout/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..2a7b32bf4941c3d45ce0a5e1532b687a4156d5e4 --- /dev/null +++ b/model_zoo/ernie-layout/utils.py @@ -0,0 +1,1091 @@ +# encoding=utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2018 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import base64 +import collections +import hashlib +import random + +import cv2 +import datasets +import editdistance +import numpy as np +import scipy +import six +from PIL import Image +from seqeval.metrics.sequence_labeling import get_entities + +from paddlenlp.trainer import EvalPrediction + + +def _get_md5(string): + """Get md5 value for string""" + hl = hashlib.md5() + hl.update(string.encode(encoding="utf-8")) + return hl.hexdigest() + + +def _decode_image(im_base64): + """Decode image""" + if im_base64 is not None: + image = base64.b64decode(im_base64.encode("utf-8")) + im = np.frombuffer(image, dtype="uint8") + im = cv2.imdecode(im, 1) + im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB) + return im + else: + return np.zeros([224, 224, 3], dtype=np.uint8) + + +def _resize_image( + im, + target_size=0, + interp=cv2.INTER_LINEAR, + resize_box=False, +): + """Resize the image numpy.""" + if not isinstance(im, np.ndarray): + raise TypeError("image type is not numpy.") + if len(im.shape) != 3: + raise ValueError("image is not 3-dimensional.") + im_shape = im.shape + im_size_min = np.min(im_shape[0:2]) + if isinstance(target_size, list): + # Case for multi-scale training + selected_size = random.choice(target_size) + else: + selected_size = target_size + if float(im_size_min) == 0: + raise ZeroDivisionError("min size of image is 0") + resize_w = selected_size + resize_h = selected_size + + im = im.astype("uint8") + im = Image.fromarray(im) + im = im.resize((int(resize_w), int(resize_h)), interp) + im = np.array(im) + return im + + +def _scale_same_as_image(boxes, width, height, target_size): + """ + Scale the bounding box of each character within maximum boundary. + """ + scale_x = target_size / width + scale_y = target_size / height + + new_boxes = [ + [ + int(max(0, min(box[0] * scale_x, target_size - 1))), + int(max(0, min(box[1] * scale_y, target_size - 1))), + int(max(0, min(box[2] * scale_x, target_size - 1))), + int(max(0, min(box[3] * scale_y, target_size - 1))), + ] + for box in boxes + ] + return new_boxes, (scale_x, scale_y) + + +def _permute(im, channel_first=True, to_bgr=False): + """Permute""" + if channel_first: + im = np.swapaxes(im, 1, 2) + im = np.swapaxes(im, 1, 0) + if to_bgr: + im = im[[2, 1, 0], :, :] + return im + + +def _str2im( + im_base64, + target_size=224, +): + # Step1: decode image + origin_im = _decode_image(im_base64) + # Step2: resize image + im = _resize_image(origin_im, target_size=target_size, interp=1, resize_box=False) + return im, origin_im + + +def get_label_ld(qas, scheme="bio"): + if scheme == "cls": + unique_labels = set() + for qa in qas: + label_text = qa["answers"][0]["text"][0] + unique_labels.add(label_text) + + label_list = list(unique_labels) + label_list.sort() + else: + unique_keys = set() + for qa in qas: + for key in qa["question"]: + unique_keys.add(key) + key_list = list(unique_keys) + key_list.sort() + + label_list = ["O"] + for key in key_list: + if scheme == "bio": + label_list.append("B-" + key) + label_list.append("I-" + key) + elif scheme == "bioes": + label_list.append("B-" + key) + label_list.append("I-" + key) + label_list.append("E-" + key) + label_list.append("S-" + key) + else: + raise NotImplementedError + + label_dict = {l: i for i, l in enumerate(label_list)} + return label_list, label_dict + + +def anls_score(labels, predictions): + def get_anls(prediction, ground_truth): + prediction = prediction.strip().lower() + ground_truth = ground_truth.strip().lower() + iou = 1 - editdistance.eval(prediction, ground_truth) / max(len(prediction), len(ground_truth), 1e-5) + anls = iou if iou >= 0.5 else 0.0 + return anls + + def metric_max_over_ground_truths(metric_fn, prediction, ground_truths): + if len(ground_truths) == 0: + return 0 + scores_for_ground_truths = [] + for ground_truth in ground_truths: + score = metric_fn(prediction, ground_truth) + scores_for_ground_truths.append(score) + return max(scores_for_ground_truths) + + anls, total = 0, 0 + assert labels.keys() == predictions.keys() + for _id in labels.keys(): + assert labels[_id].keys() == predictions[_id].keys() + for question in labels[_id]: + if len(predictions[_id][question]) > 0: + prediction_text = predictions[_id][question][0] + else: + prediction_text = "" + ground_truths = labels[_id][question] + total += 1 + anls += metric_max_over_ground_truths(get_anls, prediction_text, ground_truths) + + anls = 100.0 * anls / total + return {"anls": anls} + + +class PreProcessor: + def __init__(self): + pass + + def _check_is_max_context(self, doc_spans, cur_span_index, position): + """Check if this is the 'max context' doc span for the token.""" + + best_score = None + best_span_index = None + for (span_index, doc_span) in enumerate(doc_spans): + end = doc_span["start"] + doc_span["length"] - 1 + if position < doc_span["start"]: + continue + if position > end: + continue + num_left_context = position - doc_span["start"] + num_right_context = end - position + score = min(num_left_context, num_right_context) + 0.01 * doc_span["length"] + if best_score is None or score > best_score: + best_score = score + best_span_index = span_index + return cur_span_index == best_span_index + + def preprocess_ner( + self, + examples, + tokenizer=None, + label_dict=None, + max_seq_length=512, + doc_stride=128, + target_size=1000, + max_size=1000, + other_label="O", + ignore_label_id=-100, + use_segment_box=False, + preprocessing_num_workers=1, + scheme="bio", + lang="en", + ): + """ + Adapt to NER task. + """ + tokenized_examples = collections.defaultdict(list) + for example_idx, example_text in enumerate(examples["text"]): + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + all_doc_token_boxes = [] + all_doc_token_labels = [] + cls_token_box = [0, 0, 0, 0] + sep_token_box = [0, 0, 0, 0] + pad_token_box = [0, 0, 0, 0] + + im_base64 = examples["image"][example_idx] + image, _ = _str2im(im_base64) + image = _permute(image, to_bgr=False) + + if use_segment_box: + bboxes = examples["segment_bbox"][example_idx] + else: + bboxes = examples["bbox"][example_idx] + bboxes, _s = _scale_same_as_image( + bboxes, + examples["width"][example_idx], + examples["height"][example_idx], + target_size, + ) + + qas = examples["qas"][example_idx] + orig_labels = [other_label] * len(example_text) + for question, answers in zip(qas["question"], qas["answers"]): + for answer_start, answer_end in zip( + answers["answer_start"], + answers["answer_end"], + ): + if scheme == "bio": + orig_labels[answer_start] = "B-" + question + orig_labels[answer_start + 1 : answer_end] = ["I-" + question] * ( + answer_end - answer_start - 1 + ) + elif scheme == "bioes": + orig_labels[answer_start] = "B-" + question + if answer_end - answer_start - 1 > 1: + orig_labels[answer_end - 1] = "E-" + question + orig_labels[answer_start + 1 : answer_end - 1] = ["I-" + question] * ( + answer_end - answer_start - 2 + ) + else: + orig_labels[answer_start] = "S-" + question + + for (i, token) in enumerate(example_text): + orig_to_tok_index.append(len(all_doc_tokens)) + if lang == "ch": + sub_tokens = tokenizer.tokenize("&" + token)[1:] + else: + sub_tokens = tokenizer.tokenize(token) + label = orig_labels[i] + box = bboxes[i] + for j, sub_token in enumerate(sub_tokens): + tok_to_orig_index.append(i) + all_doc_tokens.append(sub_token) + all_doc_token_boxes.append(box) + if "B-" in label[:2]: + if j == 0: + all_doc_token_labels.append(label) + else: + all_doc_token_labels.append("I-" + label[2:]) + elif "E-" in label[:2]: + if len(sub_tokens) - 1 == j: + all_doc_token_labels.append("E-" + label[2:]) + else: + all_doc_token_labels.append("I-" + label[2:]) + elif "S-" in label[:2]: + if len(sub_tokens) == 1: + all_doc_token_labels.append(label) + else: + if j == 0: + all_doc_token_labels.append("B-" + label[2:]) + elif len(sub_tokens) - 1 == j: + all_doc_token_labels.append("E-" + label[2:]) + else: + all_doc_token_labels.append("I-" + label[2:]) + else: + all_doc_token_labels.append(label) + + max_tokens_for_doc = max_seq_length - 2 + doc_spans = [] + start_offset = 0 + while start_offset < len(all_doc_tokens): + length = len(all_doc_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append({"start": start_offset, "length": length}) + if start_offset + length == len(all_doc_tokens): + break + start_offset += min(length, doc_stride, max_tokens_for_doc) + + for (doc_span_index, doc_span) in enumerate(doc_spans): + + tokens = [] + token_boxes = [] + token_label_ids = [] + token_to_orig_map = {} + token_is_max_context = {} + sentence_ids = [] + tokens.append(tokenizer.cls_token) + token_boxes.append(cls_token_box) + token_label_ids.append(ignore_label_id) + sentence_ids.append(0) + + for i in range(doc_span["length"]): + split_token_index = doc_span["start"] + i + token_to_orig_map[str(len(tokens))] = tok_to_orig_index[split_token_index] + + is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index) + token_is_max_context[str(len(tokens))] = is_max_context + tokens.append(all_doc_tokens[split_token_index]) + token_boxes.append(all_doc_token_boxes[split_token_index]) + token_label_ids.append(label_dict[all_doc_token_labels[split_token_index]]) + sentence_ids.append(0) + + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(tokenizer.sep_token) + token_boxes.append(sep_token_box) + token_label_ids.append(ignore_label_id) + sentence_ids.append(0) + input_mask = [1] * len(tokens) + + while len(tokens) < max_seq_length: + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(tokenizer.pad_token) + input_mask.append(0) + sentence_ids.append(0) + token_boxes.append(pad_token_box) + token_label_ids.append(ignore_label_id) + + input_ids = tokenizer.convert_tokens_to_ids(tokens) + position_ids = list(range(len(input_ids))) + + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + assert len(token_boxes) == max_seq_length + assert len(sentence_ids) == max_seq_length + assert len(token_label_ids) == max_seq_length + + feature_id = examples["name"][example_idx] + "__" + str(examples["page_no"][example_idx]) + tokenized_examples["id"].append(feature_id) + tokenized_examples["tokens"].append(tokens) + tokenized_examples["input_ids"].append(input_ids) + tokenized_examples["attention_mask"].append(input_mask) + tokenized_examples["token_type_ids"].append(sentence_ids) + tokenized_examples["bbox"].append(token_boxes) + tokenized_examples["position_ids"].append(position_ids) + tokenized_examples["image"].append(image) + # tokenized_examples["orig_image"].append(origin_image) + tokenized_examples["labels"].append(token_label_ids) + tokenized_examples["token_is_max_context"].append(token_is_max_context) + tokenized_examples["token_to_orig_map"].append(token_to_orig_map) + return tokenized_examples + + def _improve_answer_span(self, doc_tokens, input_start, input_end, tokenizer, orig_answer_text): + """Returns tokenized answer spans that better match the annotated answer.""" + + tok_answer_text = tokenizer.convert_tokens_to_string(tokenizer.tokenize(orig_answer_text)) + for new_start in range(input_start, input_end + 1): + for new_end in range(input_end, new_start - 1, -1): + text_span = tokenizer.convert_tokens_to_string(doc_tokens[new_start : (new_end + 1)]) + if text_span == tok_answer_text: + return (new_start, new_end) + + return (input_start, input_end) + + def preprocess_mrc( + self, + examples, + tokenizer=None, + max_seq_length=512, + doc_stride=128, + max_query_length=64, + target_size=1000, + max_size=1000, + use_segment_box=False, + preprocessing_num_workers=1, + is_training=False, + lang="en", + ): + """ + Adapt to MRC task. + """ + + tokenized_examples = collections.defaultdict(list) + for example_idx, example_text in enumerate(examples["text"]): + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + all_doc_token_boxes = [] + cls_token_box = [0, 0, 0, 0] + sep_token_box = [0, 0, 0, 0] + pad_token_box = [0, 0, 0, 0] + query_token_box = [0, 0, 0, 0] + + im_base64 = examples["image"][example_idx] + image, _ = _str2im(im_base64) + image = _permute(image, to_bgr=False) + + if use_segment_box: + bboxes = examples["segment_bbox"][example_idx] + else: + bboxes = examples["bbox"][example_idx] + bboxes, _s = _scale_same_as_image( + bboxes, + examples["width"][example_idx], + examples["height"][example_idx], + target_size, + ) + + for (i, token) in enumerate(example_text): + orig_to_tok_index.append(len(all_doc_tokens)) + if lang == "ch": + sub_tokens = tokenizer.tokenize("&" + token)[1:] + else: + sub_tokens = tokenizer.tokenize(token) + box = bboxes[i] + for j, sub_token in enumerate(sub_tokens): + tok_to_orig_index.append(i) + all_doc_tokens.append(sub_token) + all_doc_token_boxes.append(box) + + qas = examples["qas"][example_idx] + for qid, question, answers in zip(qas["question_id"], qas["question"], qas["answers"]): + + query_tokens = tokenizer.tokenize( + question, add_special_tokens=False, truncation=False, max_length=max_query_length + ) + + start_offset = 0 + doc_spans = [] + max_tokens_for_doc = max_seq_length - len(query_tokens) - 3 + while start_offset < len(all_doc_tokens): + length = len(all_doc_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append({"start": start_offset, "length": length}) + if start_offset + length == len(all_doc_tokens): + break + start_offset += min(length, doc_stride, max_tokens_for_doc) + + for (doc_span_index, doc_span) in enumerate(doc_spans): + + tokens = [] + token_boxes = [] + token_to_orig_map = {} + token_is_max_context = {} + sentence_ids = [] + seg_a = 0 + seg_b = 1 + + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(tokenizer.cls_token) + token_boxes.append(cls_token_box) + sentence_ids.append(seg_a) + + for i in range(doc_span["length"]): + split_token_index = doc_span["start"] + i + token_to_orig_map[str(len(tokens))] = tok_to_orig_index[split_token_index] + + is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index) + token_is_max_context[str(len(tokens))] = is_max_context + tokens.append(all_doc_tokens[split_token_index]) + token_boxes.append(all_doc_token_boxes[split_token_index]) + sentence_ids.append(seg_a) + + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(tokenizer.sep_token) + token_boxes.append(sep_token_box) + sentence_ids.append(seg_a) + input_mask = [1] * len(tokens) + + while len(tokens) < max_seq_length - len(query_tokens) - 1: + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(tokenizer.pad_token) + input_mask.append(0) + sentence_ids.append(seg_b) + token_boxes.append(pad_token_box) + + for idx, token in enumerate(query_tokens): + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(token) + input_mask.append(1) + sentence_ids.append(seg_b) + token_boxes.append(query_token_box) + + token_is_max_context[str(len(tokens))] = False + token_to_orig_map[str(len(tokens))] = -1 + tokens.append(tokenizer.sep_token) + input_mask.append(1) + token_boxes.append(sep_token_box) + sentence_ids.append(seg_b) + + input_ids = tokenizer.convert_tokens_to_ids(tokens) + position_ids = list(range(len(tokens) - len(query_tokens) - 1)) + list( + range(len(query_tokens) + 1) + ) + + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + assert len(token_boxes) == max_seq_length + assert len(sentence_ids) == max_seq_length + + answer_rcd = [] + for answer_text, answer_start, answer_end in zip( + answers["text"], + answers["answer_start"], + answers["answer_end"], + ): + + if is_training and answer_start == -1 and answer_end == -1: + continue + + start_position = -1 + end_position = -1 + + if is_training: + + if [answer_start, answer_end] in answer_rcd: + continue + answer_rcd.append([answer_start, answer_end]) + + tok_start_position = orig_to_tok_index[answer_start] + if answer_end < len(example_text) - 1: + tok_end_position = orig_to_tok_index[answer_end] - 1 + else: + tok_end_position = len(all_doc_tokens) - 1 + (tok_start_position, tok_end_position) = self._improve_answer_span( + all_doc_tokens, tok_start_position, tok_end_position, tokenizer, answer_text + ) + # If the answer is outside the span, set start_position == end_position == 0 + + # For training, if our document chunk does not contain an annotation + # we throw it out, since there is nothing to predict. + doc_start = doc_span["start"] + doc_end = doc_span["start"] + doc_span["length"] - 1 + if not (tok_start_position >= doc_start and tok_end_position <= doc_end): + start_position = 0 + end_position = 0 + else: + doc_offset = 1 + start_position = tok_start_position - doc_start + doc_offset + end_position = tok_end_position - doc_start + doc_offset + + start_labels = [0] * len(input_ids) + end_labels = [0] * len(input_ids) + start_labels[start_position] = 1 + end_labels[end_position] = 1 + answer_rcd.append([start_position, end_position]) + + feature_id = examples["name"][example_idx] + "__" + str(examples["page_no"][example_idx]) + tokenized_examples["id"].append(feature_id) + tokenized_examples["question_id"].append(qid) + tokenized_examples["questions"].append(question) + tokenized_examples["tokens"].append(tokens) + tokenized_examples["input_ids"].append(input_ids) + tokenized_examples["attention_mask"].append(input_mask) + tokenized_examples["token_type_ids"].append(sentence_ids) + tokenized_examples["bbox"].append(token_boxes) + tokenized_examples["position_ids"].append(position_ids) + tokenized_examples["image"].append(image) + tokenized_examples["start_positions"].append(start_position) + tokenized_examples["end_positions"].append(end_position) + tokenized_examples["start_labels"].append(start_labels) + tokenized_examples["end_labels"].append(end_labels) + tokenized_examples["token_is_max_context"].append(token_is_max_context) + tokenized_examples["token_to_orig_map"].append(token_to_orig_map) + + if not is_training: + break + return tokenized_examples + + def preprocess_cls( + self, + examples, + tokenizer=None, + label_dict=None, + max_seq_length=512, + doc_stride=128, + target_size=1000, + max_size=1000, + other_label="O", + ignore_label_id=-100, + use_segment_box=False, + preprocessing_num_workers=1, + ): + """ + Adapt to CLS task. + """ + + tokenized_examples = collections.defaultdict(list) + for example_idx, example_text in enumerate(examples["text"]): + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + all_doc_token_boxes = [] + cls_token_box = [0, 0, 0, 0] + sep_token_box = [0, 0, 0, 0] + pad_token_box = [0, 0, 0, 0] + + im_base64 = examples["image"][example_idx] + image, _ = _str2im(im_base64) + image = _permute(image, to_bgr=False) + + if use_segment_box: + bboxes = examples["segment_bbox"][example_idx] + else: + bboxes = examples["bbox"][example_idx] + bboxes, _s = _scale_same_as_image( + bboxes, + examples["width"][example_idx], + examples["height"][example_idx], + target_size, + ) + + qas = examples["qas"][example_idx] + label = label_dict[qas["answers"][0]["text"][0]] + + for (i, token) in enumerate(example_text): + orig_to_tok_index.append(len(all_doc_tokens)) + sub_tokens = tokenizer.tokenize(token) + box = bboxes[i] + for j, sub_token in enumerate(sub_tokens): + tok_to_orig_index.append(i) + all_doc_tokens.append(sub_token) + all_doc_token_boxes.append(box) + + max_tokens_for_doc = max_seq_length - 2 + doc_spans = [] + start_offset = 0 + while start_offset < len(all_doc_tokens): + length = len(all_doc_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append({"start": start_offset, "length": length}) + if start_offset + length == len(all_doc_tokens): + break + start_offset += min(length, doc_stride, max_tokens_for_doc) + + for doc_span in doc_spans: + + tokens = [] + token_boxes = [] + sentence_ids = [] + tokens.append(tokenizer.cls_token) + token_boxes.append(cls_token_box) + sentence_ids.append(0) + + for i in range(doc_span["length"]): + split_token_index = doc_span["start"] + i + tokens.append(all_doc_tokens[split_token_index]) + token_boxes.append(all_doc_token_boxes[split_token_index]) + sentence_ids.append(0) + + tokens.append(tokenizer.sep_token) + token_boxes.append(sep_token_box) + sentence_ids.append(0) + input_mask = [1] * len(tokens) + + while len(tokens) < max_seq_length: + tokens.append(tokenizer.pad_token) + input_mask.append(0) + sentence_ids.append(0) + token_boxes.append(pad_token_box) + + input_ids = tokenizer.convert_tokens_to_ids(tokens) + position_ids = list(range(len(input_ids))) + + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + assert len(token_boxes) == max_seq_length + assert len(sentence_ids) == max_seq_length + + feature_id = examples["name"][example_idx] + "__" + str(examples["page_no"][example_idx]) + tokenized_examples["id"].append(feature_id) + tokenized_examples["tokens"].append(tokens) + tokenized_examples["input_ids"].append(input_ids) + tokenized_examples["attention_mask"].append(input_mask) + tokenized_examples["token_type_ids"].append(sentence_ids) + tokenized_examples["bbox"].append(token_boxes) + tokenized_examples["position_ids"].append(position_ids) + tokenized_examples["image"].append(image) + # tokenized_examples["orig_image"].append(origin_image) + tokenized_examples["labels"].append(label) + return tokenized_examples + + +class PostProcessor: + def __init__(self): + """init post processor""" + + self.examples_cache = collections.defaultdict(list) + self.features_cache = collections.defaultdict(list) + self._PrelimPrediction = collections.namedtuple( # pylint: disable=invalid-name + "PrelimPrediction", ["feature_index", "start_index", "end_index", "start_logit", "end_logit"] + ) + + def get_predictions(self, pred, label_list, with_crf=False): + if not with_crf: + pred = scipy.special.softmax(pred, axis=-1) + pred_ids = np.argmax(pred, axis=1) + else: + pred_ids = pred + prediction_score = [pred[idx][i] for idx, i in enumerate(pred_ids)] + predictions = [label_list[i] for i in pred_ids] + return predictions, prediction_score + + def postprocess_ner( + self, + examples: datasets.Dataset, + features: datasets.Dataset, + preds, + labels, + label_list, + tokenizer=None, + with_crf=False, + lang="en", + ): + if "name" not in self.examples_cache: + self.examples_cache["name"] = [item for item in examples["name"]] + if "page_no" not in self.examples_cache: + self.examples_cache["page_no"] = [item for item in examples["page_no"]] + if "text" not in self.examples_cache: + self.examples_cache["text"] = [item for item in examples["text"]] + if "id" not in self.features_cache: + self.features_cache["id"] = [item for item in features["id"]] + if "tokens" not in self.features_cache: + self.features_cache["tokens"] = [item for item in features["tokens"]] + if "token_is_max_context" not in self.features_cache: + self.features_cache["token_is_max_context"] = [item for item in features["token_is_max_context"]] + if "token_to_orig_map" not in self.features_cache: + self.features_cache["token_to_orig_map"] = [item for item in features["token_to_orig_map"]] + separator = "" if lang == "ch" else " " + + feature_id_to_features = collections.defaultdict(list) + for idx, feature_id in enumerate(self.features_cache["id"]): + feature_id_to_features[feature_id].append(idx) + + references = collections.defaultdict(list) + predictions = collections.defaultdict(list) + recover_preds = [] + recover_labels = [] + + for eid, example_id in enumerate(self.examples_cache["name"]): + feature_map = example_id + "__" + str(self.examples_cache["page_no"][eid]) + features_ids = feature_id_to_features[feature_map] + gather_pred = [] + gather_label = [] + gather_tokens = [] + gather_score = [] + gather_map = [] + for idx in features_ids: + pred, label = preds[idx], labels[idx] + prediction, prediction_score = self.get_predictions(pred, label_list, with_crf=with_crf) + + token_is_max_context = self.features_cache["token_is_max_context"][idx] + token_to_orig_map = self.features_cache["token_to_orig_map"][idx] + for token_idx in range(len(token_is_max_context)): + token_idx += 1 + if token_is_max_context[str(token_idx)]: + gather_tokens.append(self.features_cache["tokens"][idx][token_idx]) + gather_pred.append(prediction[token_idx]) + gather_score.append(prediction_score[token_idx]) + gather_label.append(label[token_idx]) + gather_map.append(token_to_orig_map[str(token_idx)]) + + recover_pred = [p for (p, l) in zip(gather_pred, gather_label) if l != -100] + recover_label = [label_list[l] for l in gather_label if l != -100] + + pred_entities = get_entities(recover_pred) + gt_entities = get_entities(recover_label) + recover_preds.append(recover_pred) + recover_labels.append(recover_label) + + for item in pred_entities: + entity = tokenizer.convert_tokens_to_string(gather_tokens[item[1] : (item[2] + 1)]).strip() + orig_doc_start = gather_map[item[1]] + orig_doc_end = gather_map[item[2]] + orig_tokens = self.examples_cache["text"][eid][orig_doc_start : (orig_doc_end + 1)] + orig_text = separator.join(orig_tokens) + final_text = self.get_final_text(entity, orig_text, False, tokenizer) + predictions[example_id].append( + [ + item[0], + final_text, + sum(gather_score[item[1] : item[2] + 1]) / (item[2] - item[1] + 1), + [item[1], item[2]], + ", ".join(recover_pred[item[1] : item[2] + 1]), + ] + ) + + for item in gt_entities: + entity = tokenizer.convert_tokens_to_string(gather_tokens[item[1] : (item[2] + 1)]).strip() + orig_doc_start = gather_map[item[1]] + orig_doc_end = gather_map[item[2]] + orig_tokens = self.examples_cache["text"][eid][orig_doc_start : (orig_doc_end + 1)] + orig_text = separator.join(orig_tokens) + final_text = self.get_final_text(entity, orig_text, False, tokenizer) + references[example_id].append( + [item[0], final_text, 1, [item[1], item[2]], ", ".join(recover_label[item[1] : item[2] + 1])] + ) + if example_id not in predictions: + predictions[example_id].append(["", "", -1, [], ""]) + + return predictions, references, EvalPrediction(predictions=recover_preds, label_ids=recover_labels) + + def _get_best_indexes(self, logits, n_best_size): + """Get the n-best logits from a list.""" + index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True) + + best_indexes = [] + for i in range(len(index_and_score)): + if i >= n_best_size: + break + best_indexes.append(index_and_score[i][0]) + return best_indexes + + def get_final_text(self, pred_text, orig_text, do_lower_case, tokenizer): + """Project the tokenized prediction back to the original text.""" + + def _strip_spaces(text): + ns_chars = [] + ns_to_s_map = collections.OrderedDict() + for (i, c) in enumerate(text): + if c == " ": + continue + ns_to_s_map[len(ns_chars)] = i + ns_chars.append(c) + ns_text = "".join(ns_chars) + return (ns_text, ns_to_s_map) + + tok_text = tokenizer.convert_tokens_to_string(tokenizer.tokenize(orig_text)) + + start_position = tok_text.find(pred_text) + if start_position == -1: + return orig_text + end_position = start_position + len(pred_text) - 1 + + (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text) + (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text) + + if len(orig_ns_text) != len(tok_ns_text): + return orig_text + + # We then project the characters in `pred_text` back to `orig_text` using + # the character-to-character alignment. + tok_s_to_ns_map = {} + for (i, tok_index) in six.iteritems(tok_ns_to_s_map): + tok_s_to_ns_map[tok_index] = i + + orig_start_position = None + if start_position in tok_s_to_ns_map: + ns_start_position = tok_s_to_ns_map[start_position] + if ns_start_position in orig_ns_to_s_map: + orig_start_position = orig_ns_to_s_map[ns_start_position] + + if orig_start_position is None: + return orig_text + + orig_end_position = None + if end_position in tok_s_to_ns_map: + ns_end_position = tok_s_to_ns_map[end_position] + if ns_end_position in orig_ns_to_s_map: + orig_end_position = orig_ns_to_s_map[ns_end_position] + + if orig_end_position is None: + return orig_text + + output_text = orig_text[orig_start_position : (orig_end_position + 1)] + return output_text + + def postprocess_mrc( + self, + examples: datasets.Dataset, + features: datasets.Dataset, + preds, + labels, + tokenizer, + max_answer_length=64, + n_best_size=5, + lang="en", + ): + if "name" not in self.examples_cache: + self.examples_cache["name"] = [item for item in examples["name"]] + if "page_no" not in self.examples_cache: + self.examples_cache["page_no"] = [item for item in examples["page_no"]] + if "text" not in self.examples_cache: + self.examples_cache["text"] = [item for item in examples["text"]] + if "qas" not in self.examples_cache: + self.examples_cache["qas"] = [item for item in examples["qas"]] + + if "id" not in self.features_cache: + self.features_cache["id"] = [item for item in features["id"]] + if "tokens" not in self.features_cache: + self.features_cache["tokens"] = [item for item in features["tokens"]] + if "question_id" not in self.features_cache: + self.features_cache["question_id"] = [item for item in features["question_id"]] + if "questions" not in self.features_cache: + self.features_cache["questions"] = [item for item in features["questions"]] + if "token_is_max_context" not in self.features_cache: + self.features_cache["token_is_max_context"] = [item for item in features["token_is_max_context"]] + if "token_to_orig_map" not in self.features_cache: + self.features_cache["token_to_orig_map"] = [item for item in features["token_to_orig_map"]] + + separator = "" if lang == "ch" else " " + + feature_id_to_features = collections.defaultdict(list) + for idx, feature_id in enumerate(self.features_cache["id"]): + feature_id_to_features[feature_id].append(idx) + + predictions, references = collections.defaultdict( + lambda: collections.defaultdict(list) + ), collections.defaultdict(lambda: collections.defaultdict(list)) + for ei, example_id in enumerate(self.examples_cache["name"]): + feature_map = example_id + "__" + str(self.examples_cache["page_no"][ei]) + features_ids = feature_id_to_features[feature_map] + prelim_predictions = [] + for i, idx in enumerate(features_ids): + + start_logits = preds[0][idx] + end_logits = preds[1][idx] + + start_indexes = self._get_best_indexes(start_logits, n_best_size) + end_indexes = self._get_best_indexes(end_logits, n_best_size) + token_is_max_context = self.features_cache["token_is_max_context"][idx] + + for start_index in start_indexes: + for end_index in end_indexes: + if not token_is_max_context.get(str(start_index), False): + continue + if end_index < start_index: + continue + length = end_index - start_index + 1 + if length > max_answer_length: + continue + prelim_predictions.append( + self._PrelimPrediction( + feature_index=idx, + start_index=start_index, + end_index=end_index, + start_logit=start_logits[start_index], + end_logit=end_logits[end_index], + ) + ) + + prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True) + + for rcd in prelim_predictions: + + question_id = self.features_cache["question_id"][rcd.feature_index] + question = self.features_cache["questions"][rcd.feature_index] + if question_id in predictions[example_id]: + continue + + if rcd.start_index > 0: + tok_tokens = self.features_cache["tokens"][rcd.feature_index][ + rcd.start_index : (rcd.end_index + 1) + ] + orig_doc_start = self.features_cache["token_to_orig_map"][rcd.feature_index][str(rcd.start_index)] + orig_doc_end = self.features_cache["token_to_orig_map"][rcd.feature_index][str(rcd.end_index)] + orig_tokens = self.examples_cache["text"][ei][orig_doc_start : (orig_doc_end + 1)] + orig_text = separator.join(orig_tokens) + + tok_text = tokenizer.convert_tokens_to_string(tok_tokens).strip() + final_text = self.get_final_text(tok_text, orig_text, False, tokenizer) + else: + continue + if question_id in predictions[example_id]: + predictions[example_id][question_id]["answers"].append(final_text) + else: + predictions[example_id][question_id] = {"question": question, "answers": [final_text]} + + for example_index, example in enumerate(examples): + eid = self.examples_cache["name"][example_index] + qas = self.examples_cache["qas"][example_index] + for question_id, question, answers in zip(qas["question_id"], qas["question"], qas["answers"]): + references[eid][question_id] = { + "question": question, + "answers": [answer_text for answer_text in answers["text"]], + } + if eid not in predictions or question_id not in predictions[eid]: + predictions[eid][question_id] = {"question": question, "answers": [""]} + + formatted_predictions = [ + { + "id": k, + "annotations": [ + {"qid": str(qid), "question": qa["question"], "value": qa["answers"]} for qid, qa in v.items() + ], + } + for k, v in predictions.items() + ] + formated_references = [ + { + "id": k, + "annotations": [ + {"qid": str(qid), "question": qa["question"], "value": qa["answers"]} for qid, qa in v.items() + ], + } + for k, v in references.items() + ] + return ( + predictions, + references, + EvalPrediction(predictions=formatted_predictions, label_ids=formated_references), + ) + + def postprocess_cls( + self, + examples: datasets.Dataset, + features: datasets.Dataset, + preds, + labels, + label_list, + tokenizer=None, + ): + if "name" not in self.examples_cache: + self.examples_cache["name"] = [item for item in examples["name"]] + if "page_no" not in self.examples_cache: + self.examples_cache["page_no"] = [item for item in examples["page_no"]] + if "id" not in self.features_cache: + self.features_cache["id"] = [item for item in features["id"]] + + feature_id_to_features = collections.defaultdict(list) + for idx, feature_id in enumerate(self.features_cache["id"]): + feature_id_to_features[feature_id].append(idx) + + references = {} + predictions = {} + recover_preds = [] + recover_labels = [] + + for eid, example_id in enumerate(self.examples_cache["name"]): + feature_map = example_id + "__" + str(self.examples_cache["page_no"][eid]) + features_ids = feature_id_to_features[feature_map] + + max_rcd = [0, -1] + for i, idx in enumerate(features_ids): + pred, label = preds[idx], labels[idx] + pred = scipy.special.softmax(pred, axis=-1) + pred_id = int(np.argmax(pred, axis=-1)) + if pred[pred_id] > max_rcd[0]: + max_rcd = [pred[pred_id], pred_id] + + recover_preds.append(max_rcd[1]) + recover_labels.append(label) + predictions[example_id] = label_list[max_rcd[1]] + references[example_id] = label_list[label] + return predictions, references, EvalPrediction(predictions=recover_preds, label_ids=recover_labels) diff --git a/model_zoo/ernie-m/README.md b/model_zoo/ernie-m/README.md new file mode 100644 index 0000000000000000000000000000000000000000..634f88c21deb6703bae1be6b4e3eb3f0750f2572 --- /dev/null +++ b/model_zoo/ernie-m/README.md @@ -0,0 +1,237 @@ +# ERNIE-M + +* [模型介绍](#模型介绍) +* [开始运行](#开始运行) + * [环境要求](#环境要求) + * [数据准备](#数据准备) + * [模型训练](#模型训练) + * [参数释义](#参数释义) + * [单卡训练](#单卡训练) + * [单机多卡](#单机多卡) + * [预测评估](#预测评估) + * [部署](#部署) + * [FastDeploy 部署](#FastDeploy部署) + * [Python 部署](#Python部署) + * [服务化部署](#服务化部署) +* [参考论文](#参考论文) + +## 模型介绍 + +[ERNIE-M](https://arxiv.org/abs/2012.15674) 是百度提出的一种多语言语言模型。原文提出了一种新的训练方法,让模型能够将多种语言的表示与单语语料库对齐,以克服平行语料库大小对模型性能的限制。原文的主要想法是将回译机制整合到预训练的流程中,在单语语料库上生成伪平行句对,以便学习不同语言之间的语义对齐,从而增强跨语言模型的语义建模。实验结果表明,ERNIE-M 优于现有的跨语言模型,并在各种跨语言下游任务中提供了最新的 SOTA 结果。 +原文提出两种方法建模各种语言间的对齐关系: + +- **Cross-Attention Masked Language Modeling(CAMLM)**: 该算法在少量双语语料上捕捉语言间的对齐信息。其需要在不利用源句子上下文的情况下,通过目标句子还原被掩盖的词语,使模型初步建模了语言间的对齐关系。 +- **Back-Translation masked language modeling(BTMLM)**: 该方法基于回译机制从单语语料中学习语言间的对齐关系。通过CAMLM 生成伪平行语料,然后让模型学习生成的伪平行句子,使模型可以利用单语语料更好地建模语义对齐关系。 + + +![framework](https://user-images.githubusercontent.com/40912707/201308423-bf4f0100-3ada-4bae-89d5-b07ffec1e2c0.png) + +本项目是 ERNIE-M 的 PaddlePaddle 动态图实现,包含模型训练,模型验证等内容。以下是本例的简要目录结构及说明: + +```text +. +|-- README.md # 文档 +|-- deploy # 部署目录 +| |-- predictor # onnx离线部署 +| | |-- README.md +| | |-- ernie_m_predictor.py +| | |-- inference.py +| | |-- requirements_cpu.txt +| | `-- requirements_gpu.txt +| `-- simple_serving # 基于PaddleNLP SimpleServing 服务化部署 +| |-- README.md +| |-- client_seq_cls.py +| `-- server_seq_cls.py +`-- run_classifier.py # 分类任务微调脚本 +``` + +## 开始运行 + +下面提供以XNLI数据集进行模型微调相关训练、预测、部署的代码,XNLI数据集是MNLI的子集,并且已被翻译成14种不同的语言(包含一些较低资源语言)。与MNLI一样,目标是预测文本蕴含(句子 A 是否暗示/矛盾/都不是句子 B )。 + +### 环境要求 + +python >= 3.7 +paddlepaddle >= 2.3 +paddlenlp >= 2.4.9 +paddle2onnx >= 1.0.5 + +### 数据准备 + +此次微调数据使用XNLI数据集, 可以通过下面的方式来使用数据集 + +```python +from datasets import load_dataset + +# all_languages = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"] +# load xnli dataset of english +train_ds, eval_ds, test_ds = load_dataset("xnli", "en", split=["train_ds", "validation", "test"]) +``` + +### 模型训练 + +#### 参数释义 + +- `task_type` 表示了自然语言推断任务的类型,目前支持的类型为:"cross-lingual-transfer", "translate-train-all" + ,分别表示在英文数据集上训练并在所有15种语言数据集上测试、在所有15种语言数据集上训练和测试。 +- `model_name_or_path` 指示了 Fine-tuning 使用的具体预训练模型以及预训练时使用的tokenizer,目前支持的预训练模型有:"ernie-m-base", "ernie-m-large" + 。若模型相关内容保存在本地,这里也可以提供相应目录地址,例如:"./finetuned_models"。 +- `do_train` 是否进行训练任务。 +- `do_eval` 是否进行评估任务。 +- `do_predict` 是否进行评测任务。 +- `do_export` 是否导出模型。 +- `output_dir` 表示模型保存路径。 +- `export_model_dir` 模型的导出路径。 +- `per_device_train_batch_size` 表示训练时每次迭代**每张**卡上的样本数目。 +- `per_device_eval_batch_size` 表示验证时每次迭代**每张**卡上的样本数目。 +- `max_seq_length` 表示最大句子长度,超过该长度将被截断,不足该长度的将会进行 padding。 +- `learning_rate` 表示基础学习率大小,将于 learning rate scheduler 产生的值相乘作为当前学习率。 +- `classifier_dropout` 表示模型用于分类的 dropout rate ,默认是0.1。 +- `num_train_epochs` 表示训练轮数。 +- `logging_steps` 表示日志打印间隔步数。 +- `save_steps` 表示模型保存及评估间隔步数。 +- `layerwise_decay` 表示 AdamW with Layerwise decay 的逐层衰减系数。 +- `warmup_rate` 表示学习率warmup系数。 +- `max_steps` 表示最大训练步数。若训练`num_train_epochs`轮包含的训练步数大于该值,则达到`max_steps`后就提前结束。 +- `seed` 表示随机数种子。 +- `device` 表示训练使用的设备, 'gpu'表示使用 GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用 CPU。 +- `fp16` 表示是否启用自动混合精度训练。 +- `scale_loss` 表示自动混合精度训练的参数。 +- `load_best_model_at_end` 训练结束后是否加载最优模型,通常与`metric_for_best_model`配合使用。 +- `metric_for_best_model` 最优模型指标,如`eval_accuarcy`等,用于比较模型好坏。 + +#### 单卡训练 + +`run_classifier.py`是模型微调脚本,可以使用如下命令对预训练模型进行微调训练。 + +```shell +python run_classifier.py \ + --do_train \ + --do_eval \ + --do_export \ + --task_type cross-lingual-transfer \ + --model_name_or_path ernie-m-base \ + --output_dir ./finetuned_models/ \ + --export_model_dir ./finetuned_models/ \ + --per_device_train_batch_size 16 \ + --per_device_eval_batch_size 16 \ + --max_seq_length 256 \ + --learning_rate 5e-5 \ + --classifier_dropout 0.1 \ + --weight_decay 0.0 \ + --layerwise_decay 0.8 \ + --save_steps 12272 \ + --eval_steps 767 \ + --num_train_epochs 5 \ + --warmup_ratio 0.1 \ + --load_best_model_at_end True \ + --metric_for_best_model eval_accuracy \ + --overwrite_output_dir +``` + +#### 单机多卡 + +同样,可以执行如下命令实现多卡训练 + +```shell +python -m paddle.distributed.launch --gpus 0,1 run_classifier.py \ + --do_train \ + --do_eval \ + --do_export \ + --task_type cross-lingual-transfer \ + --model_name_or_path ernie-m-base \ + --output_dir ./finetuned_models/ \ + --export_model_dir ./finetuned_models/ \ + --per_device_train_batch_size 16 \ + --per_device_eval_batch_size 16 \ + --max_seq_length 256 \ + --learning_rate 5e-5 \ + --classifier_dropout 0.1 \ + --weight_decay 0.0 \ + --layerwise_decay 0.8 \ + --save_steps 12272 \ + --eval_steps 767 \ + --num_train_epochs 5 \ + --warmup_ratio 0.1 \ + --load_best_model_at_end True \ + --metric_for_best_model eval_accuracy \ + --overwrite_output_dir \ + --remove_unused_columns False +``` + +这里设置额外的参数`--remove_unused_columns`为`False`是因为数据集中不需要的字段已经被手动去除了。 + +#### 预测评估 + +当训练完成后,可以直接加载训练保存的模型进行评估,此时`--model_name_or_path`传入训练时的`output_dir`即`./finetuned_models`。 + +```shell +python run_classifier.py \ + --do_predict \ + --task_type cross-lingual-transfer \ + --model_name_or_path ./finetuned_models \ + --output_dir ./finetuned_models +``` + +预测结果(label)和预测的置信度(confidence)将写入`./finetuned_models/test_results.json`文件。 + + +在XNLI数据集上微调 cross-lingual-transfer 类型的自然语言推断任务后,在测试集上有如下结果 +| Model | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur | Avg | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | +| Cross-lingual Transfer | | | | | | | | | | | | | | | | | +| XLM | 85.0 | 78.7 | 78.9 | 77.8 | 76.6 | 77.4 | 75.3 | 72.5 | 73.1 | 76.1 | 73.2 | 76.5 | 69.6 | 68.4 | 67.3 | 75.1 | +| Unicoder | 85.1 | 79.0 | 79.4 | 77.8 | 77.2 | 77.2 | 76.3 | 72.8 | 73.5 | 76.4 | 73.6 | 76.2 | 69.4 | 69.7 | 66.7 | 75.4 | +| XLM-R | 85.8 | 79.7 | 80.7 | 78.7 | 77.5 | 79.6 | 78.1 | 74.2 | 73.8 | 76.5 | 74.6 | 76.7 | 72.4 | 66.5 | 68.3 | 76.2 | +| INFOXLM | **86.4** | **80.6** | 80.8 | 78.9 | 77.8 | 78.9 | 77.6 | 75.6 | 74.0 | 77.0 | 73.7 | 76.7 | 72.0 | 66.4 | 67.1 | 76.2 | +| **ERNIE-M** | 85.5 | 80.1 | **81.2** | **79.2** | **79.1** | **80.4** | **78.1** | **76.8** | **76.3** | **78.3** | **75.8** | **77.4** | **72.9** | **69.5** | **68.8** | **77.3** | +| XLM-R Large | 89.1 | 84.1 | 85.1 | 83.9 | 82.9 | 84.0 | 81.2 | 79.6 | 79.8 | 80.8 | 78.1 | 80.2 | 76.9 | 73.9 | 73.8 | 80.9 | +| INFOXLM Large | **89.7** | 84.5 | 85.5 | 84.1 | 83.4 | 84.2 | 81.3 | 80.9 | 80.4 | 80.8 | 78.9 | 80.9 | 77.9 | 74.8 | 73.7 | 81.4 | +| VECO Large | 88.2 | 79.2 | 83.1 | 82.9 | 81.2 | 84.2 | 82.8 | 76.2 | 80.3 | 74.3 | 77.0 | 78.4 | 71.3 | **80.4** | **79.1** | 79.9 | +| **ERNIR-M Large** | 89.3 | **85.1** | **85.7** | **84.4** | **83.7** | **84.5** | 82.0 | **81.2** | **81.2** | **81.9** | **79.2** | **81.0** | **78.6** | 76.2 | 75.4 | **82.0** | +| Translate-Train-All | | | | | | | | | | | | | | | | | +| XLM | 85.0 | 80.8 | 81.3 | 80.3 | 79.1 | 80.9 | 78.3 | 75.6 | 77.6 | 78.5 | 76.0 | 79.5 | 72.9 | 72.8 | 68.5 | 77.8 | +| Unicoder | 85.6 | 81.1 | 82.3 | 80.9 | 79.5 | 81.4 | 79.7 | 76.8 | 78.2 | 77.9 | 77.1 | 80.5 | 73.4 | 73.8 | 69.6 | 78.5 | +| XLM-R | 85.4 | 81.4 | 82.2 | 80.3 | 80.4 | 81.3 | 79.7 | 78.6 | 77.3 | 79.7 | 77.9 | 80.2 | 76.1 | 73.1 | 73.0 | 79.1 | +| INFOXLM | 86.1 | 82.0 | 82.8 | 81.8 | 80.9 | 82.0 | 80.2 | 79.0 | 78.8 | 80.5 | 78.3 | 80.5 | 77.4 | 73.0 | 71.6 | 79.7 | +| **ERNIE-M** | **86.2** | **82.5** | **83.8** | **82.6** | **82.4** | **83.4** | **80.2** | **80.6** | **80.5** | **81.1** | **79.2** | **80.5** | **77.7** | **75.0** | **73.3** | **80.6** | +| XLM-R Large | 89.1 | 85.1 | 86.6 | 85.7 | 85.3 | 85.9 | 83.5 | 83.2 | 83.1 | 83.7 | 81.5 | **83.7** | **81.6** | 78.0 | 78.1 | 83.6 | +| VECO Large | 88.9 | 82.4 | 86.0 | 84.7 | 85.3 | 86.2 | **85.8** | 80.1 | 83.0 | 77.2 | 80.9 | 82.8 | 75.3 | **83.1** | **83.0** | 83.0 | +| **ERNIE-M Large** | **89.5** | **86.5** | **86.9** | **86.1** | **86.0** | **86.8** | 84.1 | **83.8** | **84.1** | **84.5** | **82.1** | 83.5 | 81.1 | 79.4 | 77.9 | **84.2** | + + + +## 部署 + +我们基于 FastDeploy 为 ERNIE-M 提供了多种部署方案,可以满足不同场景下的部署需求,请根据实际情况进行选择。 + + + +### FastDeploy 部署 + +⚡️[FastDeploy](https://github.com/PaddlePaddle/FastDeploy)是一款全场景、易用灵活、极致高效的AI推理部署工具,为开发者提供多硬件、多推理引擎后端的部署能力。开发者只需调用一行代码即可随意切换硬件、推理引擎后端。 + +
+ + + +
+ +目前 ERNIE-M 模型已提供基于 FastDeploy 的部署示例,支持在多款硬件(CPU、GPU、昆仑芯、华为昇腾以及 Graphcore IPU)以及推理引擎后端进行部署。 + + + +#### Python 部署 + +Python 部署请参考:[Python 部署指南](./deploy/python/README.md) + + + +### 服务化部署 + +* [PaddleNLP SimpleServing 服务化部署指南](./deploy/simple_serving/README.md) + + +## 参考论文 + + [Ouyang X , Wang S , Pang C , et al. ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora[J]. 2020.](https://arxiv.org/abs/2012.15674) diff --git a/model_zoo/ernie-m/deploy/python/README.md b/model_zoo/ernie-m/deploy/python/README.md new file mode 100644 index 0000000000000000000000000000000000000000..dfc55dcdcd39d081c62113b496468e3db306ba3a --- /dev/null +++ b/model_zoo/ernie-m/deploy/python/README.md @@ -0,0 +1,144 @@ +# FastDeploy ERNIE-M 模型 Python 部署示例 + +在部署前,参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy Python SDK。 + +本目录下分别提供 `seq_cls_infer.py` 快速完成在 CPU/GPU 的文本分类任务的 Python 部署示例。 + +## 依赖安装 + +直接执行以下命令安装部署示例的依赖。 + +```bash + +# 安装 fast_tokenizer 以及 GPU 版本 fastdeploy +pip install fast-tokenizer-python fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html + +``` + +## 快速开始 + +以下示例展示如何基于 FastDeploy 库完成 ERNIE-M 模型在 XNLI 数据集上进行自然语言推断任务的 Python 预测部署,可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端,并使用`--model_dir`参数指定运行的模型,具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [ERNIE-M 训练文档](../../README.md)导出得到的部署模型,其模型目录为`model_zoo/ernie-m/finetuned_models/export`(用户可按实际情况设置)。 + + +```bash + +# CPU 推理 +python seq_cls_infer.py --model_dir ../../finetuned_models/export/ --device cpu --backend paddle + +# GPU 推理 +python seq_cls_infer.py --model_dir ../../finetuned_models/export/model --device gpu --backend paddle + +``` + +运行完成后返回的结果如下: + +```bash + +[2023-02-24 08:54:42,574] [ INFO] - We are using to load 'export/'. +[INFO] fastdeploy/runtime/runtime.cc(309)::CreatePaddleBackend Runtime initialized with Backend::PDINFER in Device::GPU. +Batch id:0, example id:0, sentence1:"他们告诉我,呃,我最后会被叫到一个人那里去见面。", sentence2:"我从来没有被告知任何与任何人会面。", label:contradiction, similarity:0.9975 +Batch id:1, example id:0, sentence1:"他们告诉我,呃,我最后会被叫到一个人那里去见面。", sentence2:"我被告知将有一个人被叫进来与我见面。", label:entailment, similarity:0.9866 +Batch id:2, example id:0, sentence1:"他们告诉我,呃,我最后会被叫到一个人那里去见面。", sentence2:"那个人来得有点晚。", label:neutral, similarity:0.9921 + +``` + +## 参数说明 + +| 参数 |参数说明 | +|----------|--------------| +|--model_dir | 指定部署模型的目录, | +|--batch_size |输入的batch size,默认为 1| +|--max_length |最大序列长度,默认为 128| +|--device | 运行的设备,可选范围: ['cpu', 'gpu'],默认为'cpu' | +|--device_id | 运行设备的id。默认为0。 | +|--cpu_threads | 当使用cpu推理时,指定推理的cpu线程数,默认为1。| +|--backend | 支持的推理后端,可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt'],默认为'paddle' | +|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启,默认为False | +|--use_fast| 是否使用FastTokenizer加速分词阶段。默认为True| + +## FastDeploy 高阶用法 + +FastDeploy 在 Python 端上,提供 `fastdeploy.RuntimeOption.use_xxx()` 以及 `fastdeploy.RuntimeOption.use_xxx_backend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 ERNIE-M 模型,需要选择硬件所支持的推理引擎进行部署,下表展示如何在不同的硬件上选择可用的推理引擎部署 ERNIE-M 模型。 + +符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持; + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
硬件 硬件对应的接口 可用的推理引擎 推理引擎对应的接口 是否支持 Paddle 新格式量化模型 是否支持 FP16 模式
CPU use_cpu() Paddle Inference use_paddle_infer_backend() N/A
ONNX Runtime use_ort_backend() N/A
OpenVINO use_openvino_backend() N/A
GPU use_gpu() Paddle Inference use_paddle_infer_backend() N/A
ONNX Runtime use_ort_backend()
Paddle TensorRT use_paddle_infer_backend() + paddle_infer_option.enable_trt = True
TensorRT use_trt_backend()
昆仑芯 XPU use_kunlunxin() Paddle Lite use_paddle_lite_backend() N/A
华为 昇腾 use_ascend() Paddle Lite use_paddle_lite_backend()
Graphcore IPU use_ipu() Paddle Inference use_paddle_infer_backend() N/A
diff --git a/model_zoo/ernie-m/deploy/python/seq_cls_infer.py b/model_zoo/ernie-m/deploy/python/seq_cls_infer.py new file mode 100644 index 0000000000000000000000000000000000000000..cffbbc253b48ed9298d5110e89eb7fbbaf785054 --- /dev/null +++ b/model_zoo/ernie-m/deploy/python/seq_cls_infer.py @@ -0,0 +1,150 @@ +# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import distutils.util +import os + +import fastdeploy as fd +import numpy as np + +from paddlenlp.transformers import AutoTokenizer + + +def parse_arguments(): + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument("--model_dir", required=True, help="The directory of model.") + parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.") + parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.") + parser.add_argument( + "--device", + type=str, + default="cpu", + choices=["gpu", "cpu"], + help="Type of inference device, support 'cpu' or 'gpu'.", + ) + parser.add_argument( + "--backend", + type=str, + default="paddle", + choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"], + help="The inference runtime backend.", + ) + parser.add_argument("--cpu_threads", type=int, default=1, help="Number of threads to predict when using cpu.") + parser.add_argument("--device_id", type=int, default=0, help="Select which gpu device to train model.") + parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.") + parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.") + parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.") + parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode") + parser.add_argument( + "--use_fast", + type=distutils.util.strtobool, + default=True, + help="Whether to use fast_tokenizer to accelarate the tokenization.", + ) + return parser.parse_args() + + +def batchfy_text(texts, batch_size): + batch_texts = [] + batch_start = 0 + while batch_start < len(texts): + batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]] + batch_start += batch_size + return batch_texts + + +class Predictor(object): + def __init__(self, args): + self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir, use_fast=args.use_fast) + self.runtime = self.create_fd_runtime(args) + self.batch_size = args.batch_size + self.max_length = args.max_length + + def create_fd_runtime(self, args): + option = fd.RuntimeOption() + model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel") + params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams") + option.set_model_path(model_path, params_path) + if args.device == "cpu": + option.use_cpu() + option.set_cpu_thread_num(args.cpu_threads) + else: + option.use_gpu(args.device_id) + if args.backend == "paddle": + option.use_paddle_infer_backend() + elif args.backend == "onnx_runtime": + option.use_ort_backend() + elif args.backend == "openvino": + option.use_openvino_backend() + else: + option.use_trt_backend() + if args.backend == "paddle_tensorrt": + option.use_paddle_infer_backend() + option.paddle_infer_option.collect_trt_shape = True + option.paddle_infer_option.enable_trt = True + trt_file = os.path.join(args.model_dir, "model.trt") + option.trt_option.set_shape( + "input_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length] + ) + if args.use_fp16: + option.trt_option.enable_fp16 = True + trt_file = trt_file + ".fp16" + option.trt_option.serialize_file = trt_file + return fd.Runtime(option) + + def preprocess(self, text, text_pair): + data = self.tokenizer(text, text_pair, max_length=self.max_length, padding=True, truncation=True) + input_ids_name = self.runtime.get_input_info(0).name + input_map = { + input_ids_name: np.array(data["input_ids"], dtype="int64"), + } + return input_map + + def infer(self, input_map): + results = self.runtime.infer(input_map) + return results + + def postprocess(self, infer_data): + logits = np.array(infer_data[0]) + max_value = np.max(logits, axis=1, keepdims=True) + exp_data = np.exp(logits - max_value) + probs = exp_data / np.sum(exp_data, axis=1, keepdims=True) + out_dict = {"label": probs.argmax(axis=-1), "confidence": probs.max(axis=-1)} + return out_dict + + def predict(self, texts, texts_pair=None): + input_map = self.preprocess(texts, texts_pair) + infer_result = self.infer(input_map) + output = self.postprocess(infer_result) + return output + + +if __name__ == "__main__": + args = parse_arguments() + predictor = Predictor(args) + text = ["他们告诉我,呃,我最后会被叫到一个人那里去见面。"] * 3 + text_pair = ["我从来没有被告知任何与任何人会面。", "我被告知将有一个人被叫进来与我见面。", "那个人来得有点晚。"] + batch_texts = batchfy_text(text, args.batch_size) + batch_texts_pair = batchfy_text(text_pair, args.batch_size) + label_list = ["entailment", "neutral", "contradiction"] + + for bs, (texts, texts_pair) in enumerate(zip(batch_texts, batch_texts_pair)): + outputs = predictor.predict(texts, texts_pair) + for i, (sentence1, sentence2) in enumerate(zip(texts, texts_pair)): + print( + f'Batch id:{bs}, example id:{i}, sentence1:"{sentence1}", sentence2:"{sentence2}", ' + f"label:{label_list[outputs['label'][i]]}, confidence:{outputs['confidence'][i]:.4f}" + ) diff --git a/model_zoo/ernie-m/deploy/simple_serving/README.md b/model_zoo/ernie-m/deploy/simple_serving/README.md new file mode 100644 index 0000000000000000000000000000000000000000..30da3c4f796ad41af7ffe2b536aa397598ff280f --- /dev/null +++ b/model_zoo/ernie-m/deploy/simple_serving/README.md @@ -0,0 +1,37 @@ +# 基于PaddleNLP SimpleServing 的服务化部署 + +## 目录 +- [环境准备](#环境准备) +- [Server启动服务](#Server服务启动) +- [其他参数设置](#其他参数设置) + +## 环境准备 + +paddlenlp >= 2.5.0 + +## Server服务启动 +### 文本分类任务启动 +#### 启动文本分类 Server 服务 +```bash +paddlenlp server server_seq_cls:app --host 0.0.0.0 --port 8189 +``` + +#### 分类任务发送服务 +```bash +python client_seq_cls.py --language zh +``` + +## 其他参数设置 +可以在client端设置 `max_seq_len`, `batch_size` 参数 +```python + data = { + 'data': { + 'text': texts, + 'text_pair': text_pairs + }, + 'parameters': { + 'max_seq_len': args.max_seq_len, + 'batch_size': args.batch_size + } + } +``` diff --git a/model_zoo/ernie-m/deploy/simple_serving/client_seq_cls.py b/model_zoo/ernie-m/deploy/simple_serving/client_seq_cls.py new file mode 100644 index 0000000000000000000000000000000000000000..5fc1de30fa040839444ea9d8bcfbb2cee696b1bd --- /dev/null +++ b/model_zoo/ernie-m/deploy/simple_serving/client_seq_cls.py @@ -0,0 +1,43 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json + +import requests +from datasets import load_dataset + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--language", required=True, type=str, help="The language for the simple seving") +parser.add_argument("--max_seq_len", default=256, type=int, help="The maximum total input sequence length after tokenization.") +parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.") +args = parser.parse_args() +# yapf: enable + +url = "http://0.0.0.0:8189/models/ernie_m_cls" +headers = {"Content-Type": "application/json"} + + +if __name__ == "__main__": + examples = load_dataset("xnli", args.language, split="validation")[:10] + texts = [text for text in examples["premise"]] + text_pairs = [text for text in examples["hypothesis"]] + + data = { + "data": {"text": texts, "text_pair": text_pairs}, + "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size}, + } + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.text) diff --git a/model_zoo/ernie-m/deploy/simple_serving/server_seq_cls.py b/model_zoo/ernie-m/deploy/simple_serving/server_seq_cls.py new file mode 100644 index 0000000000000000000000000000000000000000..b38202ece3d1550f0956a8623fa42cc6dc730023 --- /dev/null +++ b/model_zoo/ernie-m/deploy/simple_serving/server_seq_cls.py @@ -0,0 +1,25 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp import SimpleServer +from paddlenlp.server import ERNIEMHandler, MultiClassificationPostHandler + +app = SimpleServer() +app.register( + "models/ernie_m_cls", + model_path="../../finetuned_models/export", + tokenizer_name="ernie-m-base", + model_handler=ERNIEMHandler, + post_handler=MultiClassificationPostHandler, +) diff --git a/model_zoo/ernie-m/run_classifier.py b/model_zoo/ernie-m/run_classifier.py new file mode 100644 index 0000000000000000000000000000000000000000..0d1886c6dd6f6db588ae027bad7852f28410d834 --- /dev/null +++ b/model_zoo/ernie-m/run_classifier.py @@ -0,0 +1,322 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +import random +from dataclasses import dataclass, field +from functools import partial +from typing import Optional + +import numpy as np +import paddle +from datasets import load_dataset +from paddle.io import Dataset +from paddle.metric import Accuracy + +import paddlenlp +from paddlenlp.data import DataCollatorWithPadding +from paddlenlp.trainer import ( + PdArgumentParser, + Trainer, + TrainingArguments, + get_last_checkpoint, +) +from paddlenlp.transformers import ( + AutoModelForSequenceClassification, + AutoTokenizer, + ErnieMForSequenceClassification, +) +from paddlenlp.utils.log import logger + +all_languages = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"] +task_type_list = ["cross-lingual-transfer", "translate-train-all"] + + +@dataclass +class ModelArguments: + task_type: str = field( + default=None, + metadata={"help": "The type of the task to finetune selected in the list: " + ", ".join(task_type_list)}, + ) + model_name_or_path: str = field( + default=None, + metadata={ + "help": "Path to pre-trained model or shortcut name selected in the list: " + + ", ".join(list(ErnieMForSequenceClassification.pretrained_init_configuration.keys())) + }, + ) + max_seq_length: Optional[int] = field( + default=256, + metadata={ + "help": "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + }, + ) + classifier_dropout: Optional[float] = field(default=0.1, metadata={"help": "Dropout rate."}) + layerwise_decay: Optional[float] = field(default=0.8, metadata={"help": "Layerwise decay ratio."}) + export_model_dir: Optional[str] = field( + default="./best_models", + metadata={"help": "Path to directory to store the exported inference model."}, + ) + use_test_data: Optional[bool] = field( + default=False, metadata={"help": "Whether to use a tiny dataset for CI test."} + ) + test_data_path: Optional[str] = field(default=None, metadata={"help": "Path to tiny dataset."}) + + +def set_seed(seed): + # Use the same data seed(for data shuffle) for all procs to guarantee data + # consistency after sharding. + random.seed(seed) + np.random.seed(seed) + # Maybe different op seeds(for dropout) for different procs is better. By: + # `paddle.seed(seed + paddle.distributed.get_rank())` + paddle.seed(seed) + + +def convert_example(example, tokenizer, max_seq_length=256): + """convert a example into necessary features""" + # Convert raw text to feature + tokenized_example = tokenizer( + example["premise"], + text_pair=example["hypothesis"], + max_length=max_seq_length, + padding=False, + truncation=True, + return_position_ids=True, + return_attention_mask=True, + return_token_type_ids=False, + ) + return tokenized_example + + +def load_xnli_dataset(args, path, lang, split=None): + """load dataset for specificed language""" + if args.use_test_data: + if args.test_data_path is None: + raise ValueError("Should specified `test_data_path` for test datasets when `use_test_data` is True.") + data_files = { + "train": args.test_data_path, + "validation": args.test_data_path, + "test": args.test_data_path, + } + return load_dataset("json", data_files=data_files, split=split) + else: + return load_dataset(path, lang, split=split) + + +class XnliDataset(Dataset): + """ + Make all languages datasets be loaded in lazy mode. + """ + + def __init__(self, datasets): + self.datasets = datasets + # Ar language has 2000 empty data. + self.num_samples = [len(i) for i in datasets] + self.cumsum_len = np.cumsum(self.num_samples) + + def __getitem__(self, idx): + language_idx = np.argmax(self.cumsum_len > idx) + last = language_idx - 1 if language_idx > 0 else language_idx + sample_idx = idx - self.cumsum_len[last] if idx >= self.cumsum_len[last] else idx + return self.datasets[int(language_idx)][int(sample_idx)] + + def __len__(self): + return self.cumsum_len[-1] + + +def do_train(): + training_args, model_args = PdArgumentParser([TrainingArguments, ModelArguments]).parse_args_into_dataclasses() + training_args: TrainingArguments = training_args + model_args: ModelArguments = model_args + + training_args.print_config(model_args, "Model") + + paddle.set_device(training_args.device) + + set_seed(training_args.seed) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + # Dataset pre-process + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=model_args.max_seq_length) + remove_columns = ["premise", "hypothesis"] + + def collect_all_languages_dataset(split): + all_ds = [] + for language in all_languages: + ds = load_xnli_dataset(model_args, "xnli", language, split=split) + all_ds.append(ds.map(trans_func, batched=True, remove_columns=remove_columns)) + return XnliDataset(all_ds) + + if model_args.task_type == "cross-lingual-transfer": + raw_datasets = load_xnli_dataset(model_args, "xnli", "en") + if training_args.do_train: + train_ds = raw_datasets["train"].map(trans_func, batched=True, remove_columns=remove_columns) + if training_args.do_eval: + eval_ds = raw_datasets["validation"].map(trans_func, batched=True, remove_columns=remove_columns) + if training_args.do_predict: + test_ds = raw_datasets["test"].map(trans_func, batched=True, remove_columns=remove_columns) + elif model_args.task_type == "translate-train-all": + if training_args.do_train: + train_ds = collect_all_languages_dataset("train") + if training_args.do_eval: + eval_ds = collect_all_languages_dataset("validation") + if training_args.do_predict: + test_ds = collect_all_languages_dataset("test") + else: + raise ValueError( + f"task_type should be 'cross-lingual-transfer' or 'translate-train-all' but '{model_args.task_type}' is specificed." + ) + + data_collator = DataCollatorWithPadding(tokenizer) + + num_labels = 3 + model = AutoModelForSequenceClassification.from_pretrained( + model_args.model_name_or_path, num_labels=num_labels, classifier_dropout=model_args.classifier_dropout + ) + + # Define the metrics of tasks. + def compute_metrics(p): + # Define the metrics of tasks. + preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions + + preds = paddle.to_tensor(preds) + label = paddle.to_tensor(p.label_ids) + + metric = Accuracy() + result = metric.compute(preds, label) + metric.update(result) + accu = metric.accumulate() + return {"accuracy": accu} + + trainer = Trainer( + model=model, + args=training_args, + data_collator=data_collator, + train_dataset=train_ds if training_args.do_train else None, + eval_dataset=eval_ds if training_args.do_eval else None, + tokenizer=tokenizer, + compute_metrics=compute_metrics, + # optimizers=[optimizer, lr_scheduler], + ) + + def using_layerwise_lr_decay(layerwise_decay, model, training_args): + """ + Generate parameter names needed to perform weight decay. + All bias and LayerNorm parameters are excluded. + """ + # params_list = [{"params": param, "learning_rate": lr * decay_ratio}, ... ] + params_list = [] + n_layers = model.config.num_hidden_layers + for name, param in model.named_parameters(): + ratio = 1.0 + param_to_train = {"params": param, "dygraph_key_name": name} + if any(nd in name for nd in ["bias", "norm"]): + param_to_train["weight_decay"] = 0.0 + else: + param_to_train["weight_decay"] = training_args.weight_decay + + if "encoder.layers" in name: + idx = name.find("encoder.layers.") + layer = int(name[idx:].split(".")[2]) + ratio = layerwise_decay ** (n_layers - layer) + elif "embedding" in name: + ratio = layerwise_decay ** (n_layers + 1) + + param_to_train["learning_rate"] = ratio + + params_list.append(param_to_train) + return params_list + + params_to_train = using_layerwise_lr_decay(model_args.layerwise_decay, model, training_args) + + trainer.set_optimizer_grouped_parameters(params_to_train) + + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + + # training + if training_args.do_train: + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + trainer.save_model() + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluating + if training_args.do_eval: + combined = {} + for language in all_languages: + eval_ds = load_xnli_dataset(model_args, "xnli", language, split="validation") + eval_ds = eval_ds.map(trans_func, batched=True, remove_columns=remove_columns, load_from_cache_file=True) + metrics = trainer.evaluate(eval_dataset=eval_ds) + metrics = {k + f"_{language}": v for k, v in metrics.items()} + combined.update({f"eval_accuracy_{language}": metrics.get(f"eval_accuracy_{language}", 0.0)}) + trainer.log_metrics("eval", metrics) + + combined.update({"eval_accuracy_average": np.mean(list(combined.values()))}) + trainer.log_metrics("eval", combined) + trainer.save_metrics("eval", combined) + + # Predicting + if training_args.do_predict: + test_ret = trainer.predict(test_ds) + trainer.log_metrics("test", test_ret.metrics) + logits = test_ret.predictions + max_value = np.max(logits, axis=1, keepdims=True) + exp_data = np.exp(logits - max_value) + probs = exp_data / np.sum(exp_data, axis=1, keepdims=True) + out_dict = {"label": probs.argmax(axis=-1).tolist(), "confidence": probs.max(axis=-1).tolist()} + out_file = open(os.path.join(training_args.output_dir, "test_results.json"), "w") + json.dump(out_dict, out_file) + + # Export inference model + if training_args.do_export and paddle.distributed.get_rank() == 0: + # You can also load from certain checkpoint + # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/") + model_to_save = trainer.model + model_to_save = model_to_save._layers if isinstance(model_to_save, paddle.DataParallel) else model_to_save + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ] + model_args.export_model_dir = os.path.join(model_args.export_model_dir, "export") + paddlenlp.transformers.export_model( + model=model_to_save, input_spec=input_spec, path=model_args.export_model_dir + ) + trainer.tokenizer.save_pretrained(model_args.export_model_dir) + + +if __name__ == "__main__": + do_train() diff --git a/model_zoo/ernie-tiny/README.md b/model_zoo/ernie-tiny/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c0107eea0ca8950822d142b3743f4b526a46d9c0 --- /dev/null +++ b/model_zoo/ernie-tiny/README.md @@ -0,0 +1,654 @@ +# ERNIE 3.0 Tiny: Frustratingly Simple Method to Improve Task-Agnostic Distillation Generalization + + **目录** + * [ERNIE 3.0 Tiny 介绍](#模型介绍) + * [预训练模型效果](#模型效果) + * [代码结构](#代码结构) + * [开始运行](#开始运行) + * [任务介绍](#任务介绍) + * [环境要求](#环境要求) + * [数据准备](#数据准备) + * [模型训练](#模型训练) + * [模型评估](#模型评估) + * [端上模型压缩方案🔥](#模型压缩) + * [压缩效果](#压缩效果) + * [⚡️ FastDeploy 部署](#FastDeploy部署) + * [性能结论](#压缩结论) + * [参考文献](#参考文献) + +本项目开源了 **ERNIE 3.0 Tiny** 预训练模型及 **端上语义理解压缩方案**。 + +- **ERNIE 3.0 Tiny** 百度 ERNIE 使用 ERNIE-Tiny 系列的知识蒸馏技术,将 ERNIE 3.0 Titan 大模型的能力传递给小模型,产出并开源了易于部署的 ERNIE 3.0 Tiny 系列预训练模型,刷新了中文小模型的 SOTA 成绩。在这些较少参数量的 ERNIE 3.0 Tiny 系列模型中,有一部分可以直接部署在 CPU 上。 + +- **端上语义理解压缩方案** 在语义理解任务中使用 ERNIE 3.0 Tiny 微调的基础上,我们建议进一步使用包含模型裁剪、量化训练、Embedding 量化等策略的压缩方案,在保持模型精度不降的情况下,可将模型体积减小为原来的 7.8%,达到 5.4 MB,内存占用也随之大幅减小。再经过 [⚡️FastDeploy](https://github.com/PaddlePaddle/FastDeploy) 部署工具和 [FastTokenizer](../../fast_tokenizer/README.md) 对分词阶段的加速,**端到端推理性能**也有显著提升,从而将 ERNIE 3.0 Tiny 模型成功部署至 **📱端侧**。由于端侧部署对内存占用的要求比服务端更高,因此该方案也同样适用于 🖥服务端部署。 + + + +## ERNIE 3.0 Tiny 介绍 + +百度 ERNIE 团队在 2021 年底发布了百亿级别大模型 ERNIE 3.0 和千亿级别的大模型 ERNIE 3.0 Titan。为了让大模型的能力能够真正在一线业务发挥威力,ERNIE 团队推出了 ERNIE-Tiny 系列的知识蒸馏技术,通过任务无关蒸馏的方法,产出了多个轻量级模型 ERNIE 3.0 Tiny,刷新了中文小模型的成绩,并使这些模型能够直接在 CPU 上进行预测,大大拓展了 ERNIE 模型的使用场景。 + +2023 年初,ERNIE 团队进一步开源了 ERNIE 3.0 Tiny 模型的 v2 版本,使教师模型预先**注入下游知识**并参与 **多任务训练**,大大提高了小模型在下游任务上的效果。ERNIE 3.0 Tiny v2 模型在 in-domain、out-domain、low-resource 的下游任务上比 v1 有了进一步的提升,并且 v2 还开源了 3L128H 结构的模型。 + +### 在线蒸馏技术 + +在线蒸馏技术在模型学习的过程中周期性地将知识信号传递给若干个学生模型同时训练,从而在蒸馏阶段一次性产出多种尺寸的学生模型。相对传统蒸馏技术,该技术极大节省了因大模型额外蒸馏计算以及多个学生的重复知识传递带来的算力消耗。 + +这种新颖的蒸馏方式利用了文心大模型的规模优势,在蒸馏完成后保证了学生模型的效果和尺寸丰富性,方便不同性能需求的应用场景使用。此外,由于文心大模型的模型尺寸与学生模型差距巨大,模型蒸馏难度极大甚至容易失效。为此,通过引入了助教模型进行蒸馏的技术,利用助教作为知识传递的桥梁以缩短学生模型和大模型表达空间相距过大的问题,从而促进蒸馏效率的提升。 + +

+ image +

+ +
+ +### 注入下游知识 +ERNIE 3.0 Tiny v1 通过在线蒸馏技术将预训练大模型压缩成预训练小模型,然而由于小模型在微调之前没有接触到下游任务的相关知识,导致效果和大模型仍然存在差距。因此 ERNIE 团队进一步提出 **ERNIE 3.0 Tiny v2**,通过微调教师模型,让教师模型学习到下游任务的相关知识,进而能够在蒸馏的过程中传导给学生模型。尽管学生模型完全没有见过下游数据,通过预先注入下游知识到教师模型,蒸馏得到的学生模型也能够获取到下游任务的相关知识,进而使下游任务上的效果得到提升。 + +### 多任务学习提升泛化性 +多任务学习已经被证明对增强模型泛化性有显著的效果,例如 MT-DNN、MUPPET、FLAN 等。通过对教师模型加入多下游任务微调,不但能够对教师模型注入下游知识、提高教师模型的泛化性,并且能够通过蒸馏传给学生模型,大幅度提升小模型的泛化性。具体地,我们对教师模型进行了 28 个任务的多任务微调。 + +

+ image +

+
+ +因此,ERNIE 3.0 Tiny v2 相比 ERNIE 3.0 Tiny v1 在 in-domain、out-domain、low-resource 数据上都能获得显著的提升。 + + + +## 预训练模型效果 + +本项目开源 **ERNIE 3.0 Tiny _Base_** 、**ERNIE 3.0 Tiny _Medium_** 、 **ERNIE 3.0 Tiny _Mini_** 、 **ERNIE 3.0 Tiny _Micro_** 、 **ERNIE 3.0 Tiny _Nano_**、**ERNIE 3.0 Tiny _Pico_** 六种结构的中文模型: + +- **ERNIE 3.0-Tiny-_Base_**-zh (_12-layer, 768-hidden, 12-heads_) +- **ERNIE 3.0-Tiny-_Medium_**-zh(_6-layer, 768-hidden, 12-heads_) +- **ERNIE 3.0-Tiny-_Mini_**-zh (_6-layer, 384-hidden, 12-heads_) +- **ERNIE 3.0-Tiny-_Micro_**-zh (_4-layer, 384-hidden, 12-heads_) +- **ERNIE 3.0-Tiny-_Nano_**-zh (_4-layer, 312-hidden, 12-heads_) +- **ERNIE 3.0-Tiny-_Pico_**-zh (_3-layer, 128-hidden, 2-heads_) + +其中,v2 版本开源了 6 种结构的模型,v1 版本开源了前 5 种结构的模型。 + +ERNIE 3.0 Tiny 模型可以用于文本分类、文本推理、实体抽取、问答等各种 NLU 任务中。下表是 ERNIE 3.0 Tiny 模型在 in-domain、out-domain 和 low-resource 三类数据集上的效果。其中 CLUE 指标可以通过 [PaddleNLP CLUE Benchmark](../../examples/benchmark/clue) 复现。 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArchModel In-domain Out-domain Low-resource
--avg.afqmctnewsiflytekcmnliocnlicluewsc2020cslavg.CANLIshopping_10avg.bustm_feweprtmt_fewcsldcp_few
12L768HERNIE 3.0 Tiny-Base-v1-zh75.3875.9358.2661.5683.0280.1086.1882.6397.2999.3195.2675.8176.0989.0662.29
ERNIE 3.0 Tiny-Base-v2-zh75.9377.4359.1161.4984.5681.8684.5482.5097.3099.2295.3879.0082.5089.8464.65
6L768HERNIE 3.0 Tiny-Medium-v1-zh72.7873.3757.0060.6780.6476.8879.2881.6096.9999.1694.8272.1669.0685.9461.48
ERNIE 3.0 Tiny-Medium-v2-zh74.2575.8857.8661.6482.8980.2779.9381.2797.2299.1995.2478.6481.4190.9463.58
6L384HERNIE 3.0 Tiny-Mini-v1-zh68.8871.8555.2454.4877.1973.0871.0579.3096.2798.4494.1066.7967.3482.9750.07
ERNIE 3.0 Tiny-Mini-v2-zh70.4974.4056.2055.7980.1776.7572.3777.7796.6998.6994.6872.4673.7588.1255.50
4L384HERNIE 3.0 Tiny-Micro-v1-zh67.2671.1555.0553.8374.8170.4169.0876.5095.7697.6993.8365.7166.2583.7547.12
ERNIE 3.0 Tiny-Micro-v2-zh67.9872.5255.4554.3377.8174.8566.4574.4396.4798.4194.5269.6572.5084.5351.93
4L312HERNIE 3.0 Tiny-Nano-v1-zh66.2470.5154.5748.3674.9770.6168.7575.9371.1651.8791.3553.8058.5981.4121.40
ERNIE 3.0 Tiny-Nano-v2-zh67.7772.7555.3848.9078.0174.5468.4276.3796.3498.1994.4868.1672.3487.0345.10
3L128H2AERNIE 3.0 Tiny-Pico-v2-zh57.8169.3552.5021.0565.6564.0363.4968.6074.1354.9793.2951.2562.3479.8411.58
+ +ERNIE 3.0 Tiny v2 多任务学习、在线蒸馏方案效果显著,刷新了中文小模型的 SOTA 成绩。具体对比数据见如下模型 **精度-时延** 图,横坐标表示在 Arm CPU(高通 865 芯片)上,基于 Arm v8 arch 测试(batch_size=1, seq_len=32)的推理时延(Latency,单位毫秒),纵坐标是 CLUE 10 个任务上的平均精度(包含文本分类、文本匹配、自然语言推理、代词消歧、阅读理解等任务),其中 CMRC2018 阅读理解任务的评价指标是 Exact Match(EM),其它任务的评价指标均是 Accuracy。模型名下方标注了模型的参数量。 + +

+ image +

+ + +图中越靠左上方的模型,精度和性能水平越高。可以看到 ERNIE 3.0 Tiny v2 在同等规模的开源模型中,综合实力领先其他同类型轻量级模型。与 UER/RoBERTa-Base 相比,12L768H 的 ERNIE 3.0-Base 平均精度提升了 4.5 个点,比同等规模的BERT-Base-Chinese 提升 3.7 个点;6L768H 的 ERNIE 3.0-Medium 相比 12L768H 的 UER/Chinese-RoBERTa 高 2.4,比 BERT-Base-Chinese 高 1.7,并且节省一倍运算时间;另外值得一提的是,这些小模型能够直接部署在 CPU 上。 + +使用 PaddleNLP 只需要一行代码就可以下载并获取 ERNIE 3.0 Tiny 预训练模型,之后可以用自己的下游数据下进行微调。 + +```python + +from paddlenlp.transformers import * + +tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-tiny-medium-v2-zh") + +# 用于分类任务(本项目中的意图识别任务) +seq_cls_model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-tiny-medium-v2-zh") + +# 用于序列标注任务(本项目中的槽位填充任务) +token_cls_model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-tiny-medium-v2-zh") + +# 用于阅读理解任务 +qa_model = AutoModelForQuestionAnswering.from_pretrained("ernie-3.0-tiny-medium-v2-zh") + +``` + +如果使用 v1 版本模型,只需要把 v2 替换成 v1 即可。 + + + +## 代码结构 + +以下是本项目代码结构 + +```text +. +├── run_train.py # 微调和压缩脚本 +├── run_eval.py # 评估脚本 +├── utils.py # 训练工具脚本 +├── model.py # 模型结构脚本 +├── data # 数据目录(自定义数据) +│ └── train.txt # 训练集(待用户新增) +│ └── dev.txt # 验证集(待用户新增) +│ └── intent_label.txt # 意图标签文件 +│ └── slot_label.txt # 槽位标签文件 +├── deploy # 部署目录 +│ └── README.md # Fastdeploy 部署文档 +│ └── android # 端侧部署目录 +│ └── cpp # 服务端部署目录(C++) +│ └── python # 服务端部署目录(Python) +└── README.md # 文档 +``` + + + +## 开始运行 + + + +### 任务介绍 + +本项目是使用 ERNIE 3.0 Tiny 预训练模型端侧部署方案,任务背景是车载语音场景下的口语理解(Spoken Language Understanding,SLU)。本项目包括微调、压缩和部署的全流程。 + +SLU 任务主要将用户的自然语言表达解析为结构化信息。结构化信息的解析主要包括意图识别和槽位填充两个步骤。 + +- 数据样例: + +```text +- 输入:来一首周华健的花心 +- 输出 + - 意图识别任务:music.play + - 槽位填充任务:来一首周华健花心 +``` + +在本项目中,意图识别和槽位填充任务分别被建模为文本分类和序列标注任务,二者共用一个 ERNIE 3.0 Tiny 模型,只有最后的任务层是独立的。 + +- 评价方法:单句意图和槽位被完全正确分类的准确率(Accuracy)。 + +### 环境要求 +- python >= 3.7 +- paddlepaddle >= 2.4.1 +- paddlenlp >= 2.5 +- paddleslim >= 2.4 + +### 数据准备 + +本项目使用了 [NLPCC2018 Shared Task 4](http://tcci.ccf.org.cn/conference/2018/taskdata.php) 的数据集,该数据集来源于中文真实商用车载语音任务型对话系统的对话日志。需要说明的一点是,本项目为了使压缩样例更简洁,只考虑了原任务中的意图识别和槽位填充任务,纠错数据被忽略,并且只考虑单句任务。由于公开的测试集没有标签,因此只使用了训练集,并自行分割出训练集和验证集。 + +训练集的下载地址为[链接](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata04.zip)。下载、解压后得到 `corpus.train.txt` 文件,将它移动至本项目中的 `data` 目录,再经过下面的代码按照 4:1 的比例分割出训练集和验证集,得到 `data/train.txt` 和 `data/dev.txt` 两个文件: + +```shell +cd data + +shuf corpus.train.txt > corpus.train.txt.shuf +num_lines=$(wc -l corpus.train.txt|awk '{print $1}') +head -n $[num_lines/5] corpus.train.txt.shuf > dev.txt +tail -n $[num_lines-num_lines/5] corpus.train.txt.shuf > train.txt + +``` +执行完后,data 目录应是如下结构: + +```text +├── data # 数据目录(自定义数据) +│ └── train.txt # 训练集 +│ └── dev.txt # 验证集 +│ └── intent_label.txt # 意图标签文件 +│ └── slot_label.txt # 槽位标签文件 +``` + +由于文件较小,`intent_label.txt` 和 `slot_label.txt` 文件是从 `corpus.train.txt` 文件中提取并上传 git 的,提前写入这两个文件是为了读取数据逻辑更便捷,也便于预测时后处理使用。 + + + +## 模型训练 + +本项目自定义了继承自 `ErniePretrainedModel` 的模型 `JointErnie`,使意图识别和槽位填充两个任务可以共用一个预训练模型 `ernie-3.0-tiny-nano-v2-zh`,但是各自也分别拥有最后一层独立的全连接层。模型的定义依然可以使用 `from_pretrained` API 传入使用的预训练模型和相关参数。这里也可以按照需求使用 ERNIE 3.0 Tiny 其他大小的模型,如果不知道如何选择,可以对多个大小的模型都进行训练和压缩,最后根据在硬件上的精度、时延、内存占用等指标来选择模型。 + +```python +from model import JointErnie + +model = JointErnie.from_pretrained( + pretrained_model_name_or_path="ernie-3.0-tiny-nano-v2-zh", + intent_dim=11, + slot_dim=32, +) +``` + +运行下面的脚本,使用 Trainer API 启动训练: + +```shell +mkdir output/BS64_LR5e-5_EPOCHS30 + +python run_train.py \ + --device gpu \ + --logging_steps 100 \ + --save_steps 100 \ + --eval_steps 100 \ + --model_name_or_path ernie-3.0-tiny-nano-v2-zh \ + --num_train_epochs 30 \ + --per_device_eval_batch_size 64 \ + --per_device_train_batch_size 64 \ + --learning_rate 5e-5 \ + --prune_embeddings \ + --max_vocab_size 6000 \ + --max_seq_length 16 \ + --output_dir output/BS64_LR5e-5_EPOCHS30 \ + --train_path data/train.txt \ + --dev_path data/dev.txt \ + --intent_label_path data/intent_label.txt \ + --slot_label_path data/slot_label.txt \ + --label_names 'intent_label' 'slot_label' \ + --weight_decay 0.01 \ + --warmup_ratio 0.1 \ + --do_train \ + --do_eval \ + --do_export \ + --input_dtype "int32" \ + --disable_tqdm True \ + --overwrite_output_dir \ + --load_best_model_at_end True \ + --save_total_limit 1 \ + --metric_for_best_model eval_accuracy \ +``` + +可配置参数说明: + +* `model_name_or_path`:必须,进行微调使用的预训练模型。可选择的有 "ernie-3.0-tiny-base-v2-zh"、"ernie-3.0-tiny-medium-v2-zh"、"ernie-3.0-tiny-mini-v2-zh"、"ernie-3.0-tiny-micro-v2-zh"、"ernie-3.0-tiny-nano-v2-zh"、"ernie-3.0-tiny-pico-v2-zh"。 +* `output_dir`:必须,模型训练后保存的模型目录。 +* `prune_embeddings`:可选,模型的 embeddings 是否需要裁剪。如果设置,会按照 `max_seq_length` 以及 `max_vocab_size` 对预训练模型的 `position embeddings` 和 `word_embeddings` 参数进行裁剪,并将新的 model 和 tokenizer 保存至 `${output_dir}/pretrained_model` 下。后续的模型微调会基于 embeddings 裁剪后的模型开始。该策略主要是为了减少部署时模型的内存占用。如果对模型的内存占用要求不高,也可以不设置。 +* `max_seq_length`:最大序列长度,是指分词后样本的最大token数,本项目中是 16。如果设置了 `prune_embeddings`,那么会对模型的 `position embeddings` 根据 `max_seq_length` 的值进行裁剪。 +* `max_vocab_size`:词表裁剪后的大小。当设置 `prune_embeddings` 时,会根据词频对预训练模型的词表进行排序,并根据 `max_vocab_size` 大小进行裁剪。 +* `train_path`:必须,训练集路径 +* `dev_path`:必须,验证集路径 +* `intent_label_path`:必须,意图标签文件路径。 +* `slot_label_path`:必须,槽位标签文件路径。 +* `label_names`:训练集中标签对应的 key 名称。如果不传入,在训练时 Trainer 可能由于无法区分输入数据和标签造成错误。 +* `do_train`:是否进行微调训练,设置该参数表示进行微调训练。 +* `do_eval`:是否进行评估,设置该参数表示进行评估。 +* `do_export`:是否导出模型,设置该参数表示训练完成后导出预测模型。 +* `load_best_model_at_end`:是否在训练结尾导入最好的模型。 +* `metric_for_best_model`:选择最好模型的 metric 名称。 +* `per_device_train_batch_size`:训练集训练过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 32。 +* `per_device_eval_batch_size`:开发集评测过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 32。 +* `learning_rate`:训练最大学习率。 +* `num_train_epochs`: 训练轮次,使用早停法时可以选择 100;默认为10。 +* `logging_steps`: 训练过程中日志打印的间隔 steps 数,默认100。 +* `save_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。 +* `weight_decay`:除了所有 bias 和 LayerNorm 权重之外,应用于所有层的权重衰减数值。可选;默认为 0.0; +* `input_dtype`:模型输入张量的数据类型。默认是 `int64`。 +* `device`: 训练设备,可选择 'cpu'、'gpu' 其中的一种;默认为 'gpu'。 + + + + +## 模型评估 +- 动态图 + +使用动态图进行评估,可以直接使用 [模型训练](#模型训练) 中的评估脚本,取消设置 `--do_train` 和 `--do_export` 并保留设置 `--do_eval`,并将 `--model_name_or_path` 设置成微调后的模型路径即可。 + +- 静态图 + +如果使用静态图进行评估或者预测,可以参考脚本 `run_eval.py`,参考下面的命令启动评估: + +```shell +python run_eval.py \ + --device gpu \ + --model_name_or_path output/BS64_LR5e-5_EPOCHS30/checkpoint-7700/ \ + --infer_prefix output/BS64_LR5e-5_EPOCHS30/infer_model \ + --output_dir ./ \ + --test_path data/dev.txt \ + --intent_label_path data/intent_label.txt \ + --slot_label_path data/slot_label.txt \ + --max_seq_length 16 \ + --per_device_eval_batch_size 512 \ + --do_eval +``` + +* `model_name_or_path`:动态图模型的目录,主要用于加载 tokenizer。 +* `infer_prefix`:预测模型的路径(目录+前缀)。例如当 `infer_prefix` 为 `output/infer_model` 时,代表预测模型和参数文件分别为 `output/infer_model.pdmodel` 和 `output/infer_model.pdiparams`。 +* `test_path` :评估所用文件路径名; +* `do_eval`,是否输出评价指标的结果。如果设置,脚本会开启评估模式,最终会输出精度评价指标的值。如果不设置,则会输出模型后处理后的结果。例如: + +```text +- 输入:放一首刘德华的音乐 +- 输出: + {'intent': 'music.play', 'confidence': array([0.9984201], dtype=float32)} + {'value': [[{'slot': 'singer', 'entity': '刘德华', 'pos': [3, 5]}]]} +``` + + + +## 🔥端上模型压缩方案 + +尽管 ERNIE 3.0 Tiny 已提供了效果不错的轻量级模型可以微调后直接使用,但在本项目中,微调后的模型体积是 69.0 MB,内存占用达到 115.72MB,部署至端侧还是存在一定困难。因此当模型有部署上线的需求,想要进一步压缩模型体积,降低推理时延,可使用本项目的 **端上语义理解压缩方案** 对上一步微调后的模型进行压缩。 + +为了方便实现,[PaddleNLP 模型压缩 API](../../docs/compression.md) 已提供了以下压缩功能,模型压缩 API 主要是基于 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) 模型压缩能力,PaddleSlim 是一个专注于深度学习模型压缩的工具库,提供低比特量化、知识蒸馏、稀疏化和模型结构搜索等模型压缩策略,帮助开发者快速实现模型的小型化,欢迎大家使用。 + +端上模型压缩流程如下图所示: + +

+ image +

+
+ +在本项目中,模型压缩和模型训练共用了脚本 `run_train.py`,压缩时需设置 `--do_compress` 开启模型压缩,并取消设置 `--do_train` 关闭普通训练。模型压缩还需要设置 `--strategy` 参数,本项目中选择 `'dynabert+qat+embeddings'` 组合策略。 + +运行下面的脚本,可对上面微调后的模型进行压缩: + +```shell +python run_train.py \ + --do_compress \ + --strategy 'dynabert+qat+embeddings' \ + --num_train_epochs 10 \ + --model_name_or_path output/BS64_LR5e-5_EPOCHS30/checkpoint-6700 \ + --output_dir output/BS64_LR5e-5_EPOCHS30/ \ + --max_seq_length 16 \ + --per_device_eval_batch_size 64 \ + --per_device_train_batch_size 64 \ + --learning_rate 5e-5 \ + --train_path data/train.txt \ + --dev_path data/dev.txt \ + --intent_label_path data/intent_label.txt \ + --slot_label_path data/slot_label.txt \ + --label_names 'intent_label' 'slot_label' \ + --weight_decay 0.01 \ + --warmup_ratio 0.1 \ + --input_dtype "int32" \ + --device gpu \ + --logging_steps 100 \ + --save_steps 100 \ + --eval_steps 100 \ + --disable_tqdm True \ + --save_total_limit 1 \ + --metric_for_best_model eval_accuracy \ +``` + +可配置参数说明: + +* `strategy`:压缩策略,本案例中推荐使用`"dynabert+qat+embeddings"`,这是一个策略组合,由 `"dynabert"`、`"qat"`、`"embeddings"` 组成。其中`"dynabert"` 是一种裁剪策略,能直接对模型宽度进行裁剪,从而直接减少参数量,需要训练;`"qat"` 是一种量化方法,用于将模型中矩阵乘(底层是 matmul_v2 算子)的权重及激活值的数据类型由 FP32 转成 INT8,并使模型精度尽量保持无损,需要训练;`"embeddings"` 则代表 Embedding 量化策略,它将 Embedding API(底层是 lookup_table_v2 算子)的权重由 FP32 转成 INT8 存储,而不需要训练。由于词表参数量占比非常大,Embedding 量化能够大幅度减少模型的内存占用,但不会对时延产生正向作用。 +* `model_name_or_path`:必须,进行压缩所使用的微调模型。 +* `output_dir`:必须,模型训练或者压缩后保存的模型目录;默认为 `None` 。 +* `do_compress`:必须。压缩需要通过这个开关来打开。其他的开关`do_train` 、`do_eval`和`do_export` 在此步则不能设置。 +* `input_dtype`:模型输入张量的数据类型。默认是 `int64`。 + +其他参数同训练参数,如`learning_rate`、`num_train_epochs`、`per_device_train_batch_size` 等,是指压缩过程中的训练(`"dynabert"` 裁剪 以及 `"qat"` 量化)时所使用的参数,一般可以和微调时保持一致即可,其中 `num_train_epochs` 可比微调时略小。 + + + +### 压缩效果 + +| 模型 | 模型精度(acc.) | 模型体积(MB) | +|-----------------------------------|--------------|--------------| +| 原模型 | 82.34 | 69.0 | +| 原模型+裁剪(词表+模型宽度) | 82.11(-0.23) | 64.0(-7.2%) | +| 原模型+裁剪(词表+模型宽度)+量化(矩阵乘) | 82.21(-0.13) | 11.0(-84.1%) | +| 原模型+裁剪(词表+模型宽度)+量化(矩阵乘+Embedding) | 82.21(-0.13) | 5.4(-92.2%) | + +模型经过压缩后,精度基本无损,体积减小了 92.2%,仅有 5.4 MB。到此,算法侧的工作基本完成。 + + + +## ⚡️FastDeplopy 部署 +能够将深度学习模型部署到性能较低的端侧本身是比较困难的工作,因此在前面我们对小模型做了大量的优化,在精度不降的情况下将 69 MB 的模型压缩至 5.4 MB,但是如果想更好地满足业务上线要求,还需要有部署工具对性能有更多优化。在这里,PaddlePadde 提供了易用高效的云边端推理部署工具 ⚡️FastDeploy,它的 [Paddle Lite](https://github.com/PaddlePaddle/Paddle-Lite) 后端基于算子融合和常量折叠进行了深度模型优化,使得模型推理速度可有大幅度提升;它所集成的 FastTokenizer 库能够对分词阶段进行加速,在麒麟 985 芯片上单条文本的分词的推理时延低于 0.1 毫秒; + +因此,本项目基于 FastDeploy 部署工具,完成了 ERNIE 3.0 Tiny 端侧和服务端的高效部署,请参考 [ERNIE 3.0 Tiny 部署文档](deploy/README.md)。以下动图是 ERNIE 3.0 Tiny 意图识别、槽位填充联合模型使用 FastDeploy 部署在 Android App 上推理的效果展示: + +

+ image +

+ +想要更多了解 FastDeploy 可参考 [FastDeploy 仓库](https://github.com/PaddlePaddle/FastDeploy)。FastDeploy 是一款全场景、易用灵活、极致高效的 AI 推理部署工具,提供开箱即用的部署体验。它为 NLP 任务提供了一整套完整的部署 Pipeline,提供 ERNIE 3.0 Tiny 模型从文本预处理、推理引擎 Runtime 以及后处理三个阶段所需要的接口模块,开发者可以基于这些接口模块在云、边、端上部署各类常见的 NLP 任务,如文本分类、序列标注、信息抽取等: +- 在文本预处理阶段,FastDeploy 使用 PaddleNLP 提供的简单易用的高效分词工具 [FastTokenizer](../../fast_tokenizer/README.md) 完成文本预处理,开发者只需调用几行代码就能完成分词阶段开发。在麒麟 985 芯片上单条文本的分词延时低于 0.1 毫秒,将本项目模型部署在 GPU 上时,使用 FastTokenizer 工具可使端到端性能提升 **3.56倍**; +- 在 Runtime 阶段,FastDeploy 集成多款硬件以及推理引擎后端,开发者可以设置 `fastdeploy::RuntimeOption` 以完成在不同硬件以及使用不同的推理引擎进行部署。目前,FastDeploy 支持的后端引擎有: + - 端侧: `Paddle Lite`; + - 服务端 GPU: `Paddle Inference`、`ONNX Runtime`、`Paddle TensorRT` 以及 `TensorRT`; + - 服务端 CPU:`Paddle Inference`、`ONNX Runtime` 以及 `OpenVINO`。 +- 在后处理阶段,FastDeploy 提供了张量级别的 [数值运算模块](https://baidu-paddle.github.io/fastdeploy-api/cpp/html/namespacefastdeploy_1_1function.html), 基于该模块可以快速完成各类任务的后处理计算,如文本分类任务的 Softmax 等数值计算。 + + + +### 性能结论 + +使用 FastDeploy 将压缩后的模型部署在华为 nova 7 Pro (麒麟 985 芯片)上,选用 Paddle Lite 作为后端进行测试,得到不同推理精度下的模型效果、端到端时延(包括前后处理)、内存占用(包括加载 FastTokenizer 库)的数据如下: + +| 模型 | 模型精度(acc.) | 推理精度 | 端到端时延(ms) | 内存占用 Pss (MB) | 模型体积(MB) | +|-----------------------------------|--------------|-----------|-------------|----------------|--------------| +| 原模型 | 82.34 | FP32 | 9.90 | 115.72 | 69.0 | +| 原模型 | 82.34(-0.00) | FP16 | 6.03(1.64x) | 106.24(-8.2%) | 69.0(-0.0%) | +| 原模型+裁剪(词表+模型宽度) | 82.11(-0.23) | FP32 | 7.55(1.31x) | 59.49(-48.59%) | 64.0(-7.2%) | +| 原模型+裁剪(词表+模型宽度) | 82.11(-0.23) | FP16 | 4.68(2.12x) | 52.23(-54.87%) | 64.0(-7.2%) | +| 原模型+裁剪(词表+模型宽度)+量化(矩阵乘) | 82.21(-0.13) | FP32+INT8 | 4.57(2.17x) | 49.17(-57.51%) | 11.0(-84.1%) | +| **原模型+裁剪(词表+模型宽度)+量化(矩阵乘+Embedding)** | 82.21(-0.13) | FP32+INT8 | **4.64(2.13x)** | **43.77(-62.18%)** | **5.4(-92.2%)** | + + +**测试条件**:max_seq_length=16,batch_size=1,thread_num=1 + +模型经过压缩后,精度基本无损,体积减小了 92.2%。在以上测试条件下,端到端推理速度达到原来的 2.13 倍,内存占用减小了 62.18%。 + + + +## 参考文献 +* Liu W, Chen X, Liu J, et al. ERNIE 3.0 Tiny: Frustratingly Simple Method to Improve Task-Agnostic Distillation Generalization[J]. arXiv preprint arXiv:2301.03416, 2023. +* Su W, Chen X, Feng S, et al. ERNIE-Tiny: A Progressive Distillation Framework for Pretrained Transformer Compression[J]. arXiv preprint arXiv:2106.02241, 2021. +* Wang S, Sun Y, Xiang Y, et al. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation[J]. arXiv preprint arXiv:2112.12731, 2021. +* Sun Y, Wang S, Feng S, et al. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation[J]. arXiv preprint arXiv:2107.02137, 2021. diff --git a/model_zoo/ernie-tiny/data/intent_label.txt b/model_zoo/ernie-tiny/data/intent_label.txt new file mode 100644 index 0000000000000000000000000000000000000000..ebaff04778ed8d27626dab99c55d9b3c678f8984 --- /dev/null +++ b/model_zoo/ernie-tiny/data/intent_label.txt @@ -0,0 +1,11 @@ +OTHERS +music.play +music.pause +music.prev +music.next +navigation.navigation +navigation.open +navigation.start_navigation +navigation.cancel_navigation +phone_call.make_a_phone_call +phone_call.cancel diff --git a/model_zoo/ernie-tiny/data/slot_label.txt b/model_zoo/ernie-tiny/data/slot_label.txt new file mode 100644 index 0000000000000000000000000000000000000000..269b7a3912813f7cb6130c86a29225c490680560 --- /dev/null +++ b/model_zoo/ernie-tiny/data/slot_label.txt @@ -0,0 +1,32 @@ +PAD +O +B-song +I-song +B-singer +I-singer +B-theme +I-theme +B-style +I-style +B-age +I-age +B-toplist +I-toplist +B-emotion +I-emotion +B-language +I-language +B-instrument +I-instrument +B-scene +I-scene +B-destination +I-destination +B-custom_destination +I-custom_destination +B-origin +I-origin +B-phone_num +I-phone_num +B-contact_name +I-contact_name diff --git a/model_zoo/ernie-tiny/deploy/README.md b/model_zoo/ernie-tiny/deploy/README.md new file mode 100644 index 0000000000000000000000000000000000000000..0f83e2cbd791e2884ca34fa2af82105f56373ad2 --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/README.md @@ -0,0 +1,59 @@ +# FastDeploy ERNIE 3.0 Tiny 模型高性能部署 + +**目录** + * [FastDeploy部署介绍](#FastDeploy部署介绍) + * [代码结构](#代码结构) + * [环境要求](#环境要求) + * [详细部署文档](#详细部署文档) + + + +## FastDeploy部署介绍 + +**⚡️FastDeploy**是一款**全场景**、**易用灵活**、**极致高效**的AI推理部署工具,满足开发者**多硬件、多平台**的产业部署需求。开发者可以基于FastDeploy将训练好的预测模型在不同的硬件、不同的操作系统以及不同的推理引擎后端上进行部署。目前FastDeploy提供多种编程语言的 SDK,包括 C++、Python 以及 Java SDK。 + +目前 ERNIE 3.0 Tiny 模型已提供基于 FastDeploy 的云边端的部署示例,在服务端上的 GPU 硬件上,支持`Paddle Inference`、`ONNX Runtime`、`Paddle TensorRT`以及`TensorRT`后端,在CPU上支持`Paddle Inference`、`ONNX Runtime`以及`OpenVINO`后端;在移动端上支持`Paddle Lite`后端。多硬件、多推理引擎后端的支持可以满足开发者不同的部署需求。 + +为了提供 ERNIE 3.0 Tiny 高性能端到端部署能力,我们使用 [FastTokenizer](../../../fast_tokenizer/README.md) 工具完成高效分词,大大提升端到端预测性能。针对 ERNIE 3.0 Tiny 模型,FastTokenizer 集成了Google提出的 [Fast WordPiece Tokenization](https://arxiv.org/pdf/2012.15524.pdf) 快速分词算法,可大幅提升分词效率。在 ERNIE 3.0 Tiny 性能测试中,使用 FastTokenizer 可提升端到端性能 **3.56倍** 。我们在 [Python部署](python/README.md) 文档更全面地展示使用 FastTokenizer 的加速能力。 + +本部署示例是车载语音场景下的口语理解(Spoken Language Understanding,SLU)任务,详细可看[ERNIE 3.0 Tiny介绍](../README.md)。 + + + + +## 代码结构 + +```text + +├── cpp +│   ├── CMakeLists.txt # CMake编译脚本 +│   ├── infer_demo.cc # C++ 部署示例代码 +│   └── README.md # C++ 部署示例文档 +├── python +│   ├── infer_demo.py # Python 部署示例代码 +│   └── README.md # Python 部署示例文档 +├── android +│   ├── README.md # Android部署文档 +│   ├── app # App示例代码 +│   ├── build.gradle +│   ├── ernie_tiny # ERNIE 3.0 Tiny JNI & Java封装代码 +│   ├── ...... # Android相关的工程文件及目录 +│   ├── local.properties +│   └── ui # 一些辅助用的UI代码 +└── README.md # 文档 + +``` + + + +## 环境要求 + +在部署ERNIE 3.0 Tiny模型前,需要安装FastDeploy SDK,可参考[FastDeploy SDK安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)确认部署环境是否满足FastDeploy环境要求,并按照介绍安装相应的SDK。 + + + +## 详细部署文档 + +- [Python部署](python/README.md) +- [C++部署](cpp/README.md) +- [Android部署](android/README.md) diff --git a/model_zoo/ernie-tiny/deploy/android/.gitignore b/model_zoo/ernie-tiny/deploy/android/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..a5bc2e92563cf7f0e331638a7a7c78923bd13dfd --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/.gitignore @@ -0,0 +1,23 @@ +.DS_Store +.idea +.gradle +.cxx +cache +build +app/cache +app/libs/* +app/.cxx +app/build +app/src/main/assets/models/* +app/.gradle +app/.idea +ernie_tiny/cache +ernie_tiny/libs/fastdeploy* +ernie_tiny/.cxx +ernie_tiny/build +ernie_tiny/src/main/assets/models/* +ernie_tiny/.gradle +ernie_tiny/.idea +ui/build +ui/.gradle +ui/cache \ No newline at end of file diff --git a/model_zoo/ernie-tiny/deploy/android/README.md b/model_zoo/ernie-tiny/deploy/android/README.md new file mode 100644 index 0000000000000000000000000000000000000000..357464fa7739a4664a2d92f466fcd4cb19b47e9d --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/README.md @@ -0,0 +1,260 @@ +# FastDeploy ERNIE 3.0 Tiny 模型Android部署示例 + +本目录提供了快速完成在 Android 的车载语音场景下的口语理解(Spoken Language Understanding,SLU)任务的部署示例。 + +## 环境准备 + +1. 在本地环境安装好 Android Studio 工具,详细安装方法请见[Android Stuido 官网](https://developer.android.com/studio)。 +2. 准备一部 Android 手机,并开启 USB 调试模式。开启方法: `手机设置 -> 查找开发者选项 -> 打开开发者选项和 USB 调试模式` + +## App示例编译和使用步骤 + +1. ERNIE 3.0 Tiny Android 部署示例位于`PaddleNLP/model_zoo/ernie-tiny/deploy/android`目录 +2. 用 Android Studio 打开 ernie-tiny/deploy/android 工程 +3. 手机连接电脑,打开 USB 调试和文件传输模式,并在 Android Studio 上连接自己的手机设备(手机需要开启允许从 USB 安装软件权限) + +image + + +工程内容说明: +```bash +. +├── README.md +├── app # App示例代码 +├── build.gradle +├── ernie_tiny # ERNIE Tiny JNI & Java封装代码 +# ... # 一些和gradle相关的工程配置文件 +├── local.properties +└── ui # 一些辅助用的UI代码 +``` + +> **注意:** +>> 如果您在导入项目、编译或者运行过程中遇到 NDK 配置错误的提示,请打开 ` File > Project Structure > SDK Location`,修改 `Andriod NDK location` 为您本机配置的 NDK 所在路径。本工程默认使用的NDK版本为20. +>> 如果您是通过 Android Studio 的 SDK Tools 下载的 NDK (见本章节"环境准备"),可以直接点击下拉框选择默认路径。 +>> 还有一种 NDK 配置方法,你可以在 `local.properties` 文件中手动完成 NDK 路径配置,如下图所示 +>> 如果以上步骤仍旧无法解决 NDK 配置错误,请尝试根据 Android Studio 官方文档中的[更新 Android Gradle 插件](https://developer.android.com/studio/releases/gradle-plugin?hl=zh-cn#updating-plugin)章节,尝试更新Android Gradle plugin版本。 + + +4. 点击 Run 按钮,自动编译 APP 并安装到手机。(该过程会自动下载预编译的 FastDeploy Android 库 以及 模型文件,需要联网) + 成功后效果如下,图一:APP 安装到手机;图二: APP 打开后的效果,输入文本后点击"开始分析意图"后即会自动进行意图识别和槽位分析;图三:APP 设置选项,点击右上角的设置图片,可以设置不同选项进行体验。 + +| APP 效果 | APP 演示 | APP设置项 | + | --- | --- | --- | +| image | image | image | + +如果您使用的是较旧版本的Android Studio时遇到gradle版本兼容的问题,可以修改[build.gradle](./build.gradle) 中的gradle版本号为您当前使用的版本。 +```gradle +buildscript { + // ... + dependencies { + classpath 'com.android.tools.build:gradle:7.2.2' // 修改为您当前使用的版本 + } +} +``` + +5. 点击 APP 右上角的设置选项,可以跳转到设置页。在设置页,您可以选择不同的模型和不同的推理精度,即是否开启 FP16 和 Int8 推理,两种推理精度只能二选一。所有模型均支持FP32推理,当设置项中的 FP16 和 Int8 选项都为 false 或 不设置 时,即使用了FP32进行推理。各模型FP16和Int8推理的支持情况为: + +|模型选项|模型名称|FP16|Int8| +|---|---|---|---| +|models/ernie-tiny|原模型|✅|-| +|models/ernie-tiny-clip|原模型+裁剪(词表+模型宽度)|✅|-| +|models/ernie-tiny-clip-qat|原模型+裁剪(词表+模型宽度)+量化(矩阵乘)|-|✅| +|models/ernie-tiny-clip-qat-embedding-int8|原模型+裁剪(词表+模型宽度)+量化(矩阵乘+Embedding)|-|✅| + + +## ERNIE Tiny Java SDK 说明和使用 + +本工程除了可以直接编译 App 体验之外,还可以编译 ERNIE 3.0 Tiny 的`Java SDK`,方便用户开箱即用。 如下图所示,编译 Java SDK 的步骤为: + - 先在 Android Studio 中打开`ernie_tiny/build.gradle`工程文件; + - 选择 Build->Make Module 'android:ernietiny'; + - 从`ernie_tiny/build/outputs/aar`目录中获取编译后得到的 SDK,即`ernie_tiny-debug.aar`. + +image + +在获取到`ernie_tiny-debug.aar`后,可将其拷贝到您自己的工程中进行使用。在 Android Studio 中的配置步骤如下: +(1)首先,将`ernie_tiny-debug.aar`拷贝到您 Android 工程的libs目录下。 +```bash +├── build.gradle +├── libs +│   └── ernie_tiny-debug.aar +├── proguard-rules.pro +└── src +``` +(2)在您的 Android 工程中的 build.gradle 引入 ERNIE 3.0 Tiny SDK,如下 +```groovy +dependencies { + implementation fileTree(include: ['ernie_tiny-debug.aar'], dir: 'libs') + implementation 'com.android.support:appcompat-v7:28.0.0' + // ... +} +``` + +### ERNIE 3.0 Tiny Java API说明如下 + +- ERNIE 3.0 Tiny `Predictor` 初始化 API: 模型初始化 API 包含两种方式,方式一是通过构造函数直接初始化;方式二是,通过调用 init 函数,在合适的程序节点进行初始化。ERNIE 3.0 Tiny Predictor 初始化参数说明如下: + - modelFile: String, paddle 格式的模型文件路径,如 infer_model.pdmodel + - paramFile: String, paddle 格式的参数文件路径,如 infer_model.pdiparams + - vocabFile: String, 词表文件,如 vocab.txt 每一行包含一个词 + - slotLabelsFile: String, 槽位标签文件,如 slots_label.txt + - intentLabelsFile: String, 意图标签文件,如 intent_label.txt 每一行包含一个标签 + - addedTokensFile: String, 额外词表文件,如 added_tokens.json,json文件 + - runtimeOption: RuntimeOption,可选参数,模型初始化 option。如果不传入该参数则会使用默认的运行时选项。 + - maxLength: 最大序列长度,默认为16 + +```java +public Predictor(); // 空构造函数,之后可以调用init初始化 +public Predictor(String modelFile, String paramsFile, String vocabFile, String slotLabelsFile, String intentLabelsFile, String addedTokensFile); +public Predictor(String modelFile, String paramsFile, String vocabFile, String slotLabelsFile, String intentLabelsFile, String addedTokensFile, RuntimeOption runtimeOption, int maxLength); +public boolean init(String modelFile, String paramsFile, String vocabFile, String slotLabelsFile, String intentLabelsFile, String addedTokensFile, RuntimeOption runtimeOption, int maxLength); +``` + +- ERNIE 3.0 Tiny `Predictor` 预测 API:Predictor 提供 predict 接口对输出的文本进行意图识别。 +```java +public IntentDetAndSlotFillResult[] predict(String[] texts); +``` + +- ERNIE 3.0 Tiny Predictor 资源释放 API: 调用 release() API 可以释放模型资源,返回 true 表示释放成功,false 表示失败;调用 initialized() 可以判断 Predictor 是否初始化成功,true 表示初始化成功,false 表示失败。 +```java +public boolean release(); // 释放native资源 +public boolean initialized(); // 检查Predictor是否初始化成功 +``` + +- RuntimeOption设置说明 + +```java +public class RuntimeOption { + public void enableLiteFp16(); // 开启fp16精度推理 + public void disableLiteFP16(); // 关闭fp16精度推理(默认关闭) + public void enableLiteInt8(); // 开启int8精度推理(需要先准备好量化模型) + public void disableLiteInt8(); // 关闭int8精度推理(默认关闭) + public void setCpuThreadNum(int threadNum); // 设置线程数 + public void setLitePowerMode(LitePowerMode mode); // 设置能耗模式 + public void setLitePowerMode(String modeStr); // 通过字符串形式设置能耗模式 +} +``` + +- 意图和槽位识别结果`IntentDetAndSlotFillResult`说明 + +```java +public class IntentDetAndSlotFillResult { + public String mStr; // 可用于debug的字符串 拼接了意图识别和槽位识别的结果 + public boolean mInitialized = false; + public IntentDetResult mIntentResult; // 意图识别结果 + public SlotFillResult[] mSlotResult; // 槽位识别结果 + + static class IntentDetResult { + public String mIntentLabel; // 意图识别结果文本标签 + public float mIntentConfidence; // 意图识别结果置信度 + } + static class SlotFillResult { + public String mSlotLabel; // 槽位识别结果文本标签 + public String mEntity; // 槽位识别的实体 + public int[] mPos; // [2] // 在原始文本对应的位置 [start,end] + } +} +``` + +- ERNIE 3.0 Tiny `Predictor` FP32/FP16 推理示例 + +```java +import com.baidu.paddle.paddlenlp.ernie_tiny.RuntimeOption; +import com.baidu.paddle.paddlenlp.ernie_tiny.Predictor; +import com.baidu.paddle.paddlenlp.ernie_tiny.IntentDetAndSlotFillResult; +import android.app.Activity; + +// 以下为伪代码 +class TestERNIETiny extends Activity { + @Override + protected void onCreate(@Nullable Bundle savedInstanceState) { + super.onCreate(savedInstanceState); + Predictor predictor = new Predictor(); + + // 设置模型文件和标签文件 + String modelFile = "ernie-tiny/infer_model.pdmodel"; + String paramsFile = "ernie-tiny/infer_model.pdiparams"; + String vocabFile = "ernie-tiny/vocab.txt"; + String slotLabelsFile = "ernie-tiny/slots_label.txt"; + String intentLabelsFile = "ernie-tiny/intent_label.txt"; + String addedTokensFile = "ernie-tiny/added_tokens.json"; + + // RuntimeOption 设置 + RuntimeOption option = new RuntimeOption(); + option.setCpuThreadNum(2); // 设置线程数 + option.enableLiteFp16(); // 是否开启FP16精度推理 + + // Predictor初始化 + predictor.init(modelFile, paramsFile, vocabFile, slotLabelsFile, intentLabelsFile, addedTokensFile, option, 16); + + // 进行意图识别和槽位分析 + String[] inputTexts = new String[]{"来一首周华健的花心", "播放我们都一样", "到信阳市汽车配件城"}; + + IntentDetAndSlotFillResult[] results = predictor.predict(inputTexts); + } +} +``` + +- ERNIE 3.0 Tiny `Predictor` Int8 量化模型推理示例 + +```java +import com.baidu.paddle.paddlenlp.ernie_tiny.RuntimeOption; +import com.baidu.paddle.paddlenlp.ernie_tiny.Predictor; +import com.baidu.paddle.paddlenlp.ernie_tiny.IntentDetAndSlotFillResult; +import android.app.Activity; + +// 以下为伪代码 +class TestERNIETiny extends Activity { + @Override + protected void onCreate(@Nullable Bundle savedInstanceState) { + super.onCreate(savedInstanceState); + Predictor predictor = new Predictor(); + + // 设置模型文件和标签文件 + String modelFile = "ernie-tiny-clip-qat-embedding-int8/infer_model.pdmodel"; + String paramsFile = "ernie-tiny-clip-qat-embedding-int8/infer_model.pdiparams"; + String vocabFile = "ernie-tiny-clip-qat-embedding-int8/vocab.txt"; + String slotLabelsFile = "ernie-tiny-clip-qat-embedding-int8/slots_label.txt"; + String intentLabelsFile = "ernie-tiny-clip-qat-embedding-int8/intent_label.txt"; + String addedTokensFile = "ernie-tiny-clip-qat-embedding-int8/added_tokens.json"; + + // RuntimeOption 设置 + RuntimeOption option = new RuntimeOption(); + option.setCpuThreadNum(2); // 设置线程数 + option.enableLiteInt8(); // 开启int8精度推理(需要先准备好量化模型) + + // Predictor初始化 + predictor.init(modelFile, paramsFile, vocabFile, slotLabelsFile, intentLabelsFile, addedTokensFile, option, 16); + + // 进行意图识别和槽位分析 + String[] inputTexts = new String[]{"来一首周华健的花心", "播放我们都一样", "到信阳市汽车配件城"}; + + IntentDetAndSlotFillResult[] results = predictor.predict(inputTexts); + } +} +``` + +更详细的用法请参考 [ERNIETinyMainActivity](./app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyMainActivity.java) 中的用法 + +## 替换 App 示例中的 ERNIE 3.0 Tiny 模型 + +替换 App 示例中的模型的步骤非常简单,模型所在的位置为 `app/src/main/assets/models`。替换模型之前请确保您的模型目录中包含 vocab.txt、slots_label.txt、intent_label.txt 以及 added_token.json 等意图识别和槽位分析所需要的词表和标签文件。替换模型的步骤为: + - 将您的 ERNIE 3.0 Tiny 模型放在 `app/src/main/assets/models` 目录下; + - 修改 `app/src/main/res/values/strings.xml` 中模型路径的默认值,如: + +```xml + +models/ernie-tiny-clip-qat-embedding-int8 +``` + +## 关于 ERNIE 3.0 Tiny JNI 的封装 + +如果您对 ERNIE 3.0 Tiny JNI 封装的实现感兴趣,可以参考 [ernie_tiny_jni/predictor_jni.cc](./ernie_tiny/src/main/cpp/ernie_tiny_jni/predictor_jni.cc), 关于如何使用 JNI 进行模型封装并和 Java 通信,可以参考 [FastDeploy/use_cpp_sdk_on_android.md](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/faq/use_cpp_sdk_on_android.md) 文档中的说明。 + +## 相关文档 + +[ERNIE 3.0 Tiny 模型详细介绍](../../README.md) + +[ERNIE 3.0 Tiny 模型C++部署方法](../cpp/README.md) + +[ERNIE 3.0 Tiny 模型Python部署方法](../python/README.md) + +[FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md) \ No newline at end of file diff --git a/model_zoo/ernie-tiny/deploy/android/app/build.gradle b/model_zoo/ernie-tiny/deploy/android/app/build.gradle new file mode 100644 index 0000000000000000000000000000000000000000..2037c756095921513c58e3c31c307b35ded93494 --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/app/build.gradle @@ -0,0 +1,91 @@ +apply plugin: 'com.android.application' + +android { + compileSdk 28 + + defaultConfig { + applicationId 'com.baidu.paddle.paddlenlp.app' + minSdkVersion 15 + //noinspection ExpiredTargetSdkVersion + targetSdkVersion 28 + versionCode 1 + versionName "1.0" + testInstrumentationRunner "android.support.test.runner.AndroidJUnitRunner" + } + + buildTypes { + release { + minifyEnabled false + proguardFiles getDefaultProguardFile('proguard-android-optimize.txt'), 'proguard-rules.pro' + } + } +} + +dependencies { + implementation project(path: ':ernie_tiny') + implementation project(path: ':ui') + //noinspection GradleCompatible + implementation 'com.android.support:appcompat-v7:28.0.0' + //noinspection GradleDependency + implementation 'com.android.support.constraint:constraint-layout:1.1.3' + implementation 'com.android.support:design:28.0.0' + implementation 'org.jetbrains:annotations:15.0' + //noinspection GradleDependency + testImplementation 'junit:junit:4.12' + androidTestImplementation 'com.android.support.test:runner:1.0.2' + androidTestImplementation 'com.android.support.test.espresso:espresso-core:3.0.2' +} + +def FD_MODEL = [ + [ + 'src' : 'https://bj.bcebos.com/paddlehub/fastdeploy/ernie-tiny.tgz', + 'dest': 'src/main/assets/models' + ], + [ + 'src' : 'https://bj.bcebos.com/paddlehub/fastdeploy/ernie-tiny-clip.tgz', + 'dest': 'src/main/assets/models' + ], + [ + 'src' : 'https://bj.bcebos.com/paddlehub/fastdeploy/ernie-tiny-clip-qat.tgz', + 'dest': 'src/main/assets/models' + ], + [ + 'src' : 'https://bj.bcebos.com/paddlehub/fastdeploy/ernie-tiny-clip-qat-embedding-int8.tgz', + 'dest': 'src/main/assets/models' + ] + +] + +task downloadAndExtractModels(type: DefaultTask) { + doFirst { + println "[INFO] Downloading and extracting models ..." + } + doLast { + String cachePath = "cache" + if (!file("${cachePath}").exists()) { + mkdir "${cachePath}" + } + FD_MODEL.eachWithIndex { model, index -> + String[] modelPaths = model.src.split("/") + String modelName = modelPaths[modelPaths.length - 1] + String modelPrefix = modelName.substring(0, modelName.length() - 4) + boolean copyFiles = false + if (!file("${model.dest}/${modelPrefix}").exists()) { + // Download the target model if not exists + if (!file("${cachePath}/${modelName}").exists()) { + println "[INFO] Downloading ${model.src} -> ${cachePath}/${modelName}" + ant.get(src: model.src, dest: file("${cachePath}/${modelName}")) + } + copyFiles = true + } + if (copyFiles) { + println "[INFO] Taring ${cachePath}/${modelName} -> ${model.dest}/${modelPrefix}" + copy { from(tarTree("${cachePath}/${modelName}")) into("${model.dest}") } + } else { + println "[INFO] ${model.dest}/${modelPrefix} already exists!" + } + } + } +} + +preBuild.dependsOn downloadAndExtractModels diff --git a/model_zoo/ernie-tiny/deploy/android/app/proguard-rules.pro b/model_zoo/ernie-tiny/deploy/android/app/proguard-rules.pro new file mode 100644 index 0000000000000000000000000000000000000000..481bb434814107eb79d7a30b676d344b0df2f8ce --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/app/proguard-rules.pro @@ -0,0 +1,21 @@ +# Add project specific ProGuard rules here. +# You can control the set of applied configuration files using the +# proguardFiles setting in build.gradle. +# +# For more details, see +# http://developer.android.com/guide/developing/tools/proguard.html + +# If your project uses WebView with JS, uncomment the following +# and specify the fully qualified class name to the JavaScript interface +# class: +#-keepclassmembers class fqcn.of.javascript.interface.for.webview { +# public *; +#} + +# Uncomment this to preserve the line number information for +# debugging stack traces. +#-keepattributes SourceFile,LineNumberTable + +# If you keep the line number information, uncomment this to +# hide the original source file name. +#-renamesourcefileattribute SourceFile \ No newline at end of file diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/AndroidManifest.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/AndroidManifest.xml new file mode 100644 index 0000000000000000000000000000000000000000..08851e680b01916de42e4e2af9ca6708fba14c99 --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/AndroidManifest.xml @@ -0,0 +1,30 @@ + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyMainActivity.java b/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyMainActivity.java new file mode 100644 index 0000000000000000000000000000000000000000..3fb229bec15f5c8790eac7f7518be42e6f606a22 --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyMainActivity.java @@ -0,0 +1,241 @@ +package com.baidu.paddle.paddlenlp.app.ernie_tiny; + +import android.Manifest; +import android.annotation.SuppressLint; +import android.app.Activity; +import android.app.AlertDialog; +import android.content.DialogInterface; +import android.content.Intent; +import android.content.SharedPreferences; +import android.content.pm.PackageManager; +import android.os.Bundle; +import android.preference.PreferenceManager; +import android.support.annotation.NonNull; +import android.support.annotation.Nullable; +import android.support.v4.app.ActivityCompat; +import android.support.v4.content.ContextCompat; +import android.util.Log; +import android.view.View; +import android.widget.Button; +import android.widget.EditText; +import android.widget.ImageButton; +import android.widget.ImageView; + +import com.baidu.paddle.paddlenlp.app.R; +import com.baidu.paddle.paddlenlp.ernie_tiny.RuntimeOption; +import com.baidu.paddle.paddlenlp.ernie_tiny.Predictor; +import com.baidu.paddle.paddlenlp.ernie_tiny.IntentDetAndSlotFillResult; +import com.baidu.paddle.paddlenlp.ui.Utils; + + +public class ERNIETinyMainActivity extends Activity implements View.OnClickListener { + private static final String TAG = ERNIETinyMainActivity.class.getSimpleName(); + private ImageView back; + private ImageButton btnSettings; + private EditText etERNIETinyInput; + private EditText etERNIETinyOutput; + private Button btnERNIETinyAnalysis; + private String[] inputTexts; + + // Call 'init' and 'release' manually later + Predictor predictor = new Predictor(); + + @Override + protected void onCreate(@Nullable Bundle savedInstanceState) { + super.onCreate(savedInstanceState); + + setContentView(R.layout.ernie_tiny_activity_main); + + // Clear all setting items to avoid app crashing due to the incorrect settings + initSettings(); + + // Check and request WRITE_EXTERNAL_STORAGE permissions + if (!checkAllPermissions()) { + requestAllPermissions(); + } + + // Init the camera preview and UI components + initView(); + } + + @Override + protected void onResume() { + super.onResume(); + // Reload settings and re-initialize the predictor + checkAndUpdateSettings(); + } + + @Override + protected void onPause() { + super.onPause(); + } + + @Override + protected void onDestroy() { + if (predictor != null) { + predictor.release(); + } + super.onDestroy(); + } + + private void initView() { + // Back from setting page to main page + back = findViewById(R.id.iv_back); + back.setOnClickListener(this); + // Apply ERNIE Tiny predict + btnERNIETinyAnalysis = findViewById(R.id.btn_ernie_tiny_analysis); + btnERNIETinyAnalysis.setOnClickListener(this); + // ERNIE Tiny input and output texts + etERNIETinyInput = findViewById(R.id.et_ernie_tiny_input); + etERNIETinyOutput = findViewById(R.id.et_ernie_tiny_output); + // Setting page + btnSettings = findViewById(R.id.btn_settings); + btnSettings.setOnClickListener(this); + } + + @SuppressLint("NonConstantResourceId") + @Override + public void onClick(View view) { + switch (view.getId()) { + case R.id.btn_settings: + startActivity(new Intent(ERNIETinyMainActivity.this, ERNIETinySettingsActivity.class)); + break; + case R.id.iv_back: + finish(); + break; + case R.id.btn_ernie_tiny_analysis: + extractTextsIntentAndSlot(); + break; + default: + break; + } + + } + + public void extractTextsIntentAndSlot() { + if (updateInputTexts()) { + IntentDetAndSlotFillResult[] results = predictor.predict(inputTexts); + updateOutputTexts(results); + } + } + + public void updateOutputTexts(IntentDetAndSlotFillResult[] results) { + if (results == null) { + etERNIETinyOutput.setText("分析结果为空"); + return; + } + if (inputTexts == null) { + etERNIETinyOutput.setText("输入文本为空"); + return; + } + if (inputTexts.length != results.length) { + String info = "输入文本数量与分析结果数量不一致!" + + inputTexts.length + "!=" + results.length; + etERNIETinyOutput.setText(info); + return; + } + // Merge Result Str + StringBuilder resultStrBuffer = new StringBuilder(); + for (int i = 0; i < results.length; ++i) { + resultStrBuffer + .append("NO.") + .append(i) + .append(" text = ") + .append(inputTexts[i]) + .append("\n") + .append(results[i].mStr) + .append("\n"); + } + // Update output text view (EditText) + etERNIETinyOutput.setText(resultStrBuffer.toString()); + } + + public boolean updateInputTexts() { + String combinedInputText = etERNIETinyInput.getText().toString(); + if (combinedInputText == null || combinedInputText.length() == 0) { + // Use default text if no custom text + combinedInputText = getString(R.string.ERNIE_TINY_INPUT_TEXTS_DEFAULT); + } + String[] texts = combinedInputText.split("[。!!:;:;]"); + if (texts.length <= 0) { + return false; + } + for (int i = 0; i < texts.length; ++i) { + texts[i] = texts[i].trim(); + } + // Update input texts + inputTexts = texts; + return true; + } + + @SuppressLint("ApplySharedPref") + public void initSettings() { + SharedPreferences sharedPreferences = PreferenceManager.getDefaultSharedPreferences(this); + SharedPreferences.Editor editor = sharedPreferences.edit(); + editor.clear(); + editor.commit(); + ERNIETinySettingsActivity.resetSettings(); + } + + public void checkAndUpdateSettings() { + if (ERNIETinySettingsActivity.checkAndUpdateSettings(this)) { + // Clear output text first + etERNIETinyOutput.setText(""); + + // Update predictor + String realModelDir = getCacheDir() + "/" + ERNIETinySettingsActivity.modelDir; + Utils.copyDirectoryFromAssets(this, ERNIETinySettingsActivity.modelDir, realModelDir); + + String modelFile = realModelDir + "/" + "infer_model.pdmodel"; + String paramsFile = realModelDir + "/" + "infer_model.pdiparams"; + String vocabFile = realModelDir + "/" + "vocab.txt"; + String slotLabelsFile = realModelDir + "/" + "slots_label.txt"; + String intentLabelsFile = realModelDir + "/" + "intent_label.txt"; + String addedTokensFile = realModelDir + "/" + "added_tokens.json"; + RuntimeOption option = new RuntimeOption(); + option.setCpuThreadNum(ERNIETinySettingsActivity.cpuThreadNum); + option.setLitePowerMode(ERNIETinySettingsActivity.cpuPowerMode); + if (Boolean.parseBoolean(ERNIETinySettingsActivity.enableLiteInt8)) { + option.enableLiteInt8(); // For quantized models + } else { + // Enable FP16 if Int8 option is not ON. + if (Boolean.parseBoolean(ERNIETinySettingsActivity.enableLiteFp16)) { + option.enableLiteFp16(); + } + } + predictor.init(modelFile, paramsFile, vocabFile, slotLabelsFile, + intentLabelsFile, addedTokensFile, option, 16); + } + } + + @Override + public void onRequestPermissionsResult(int requestCode, @NonNull String[] permissions, + @NonNull int[] grantResults) { + super.onRequestPermissionsResult(requestCode, permissions, grantResults); + if (grantResults[0] != PackageManager.PERMISSION_GRANTED || grantResults[1] != PackageManager.PERMISSION_GRANTED) { + new AlertDialog.Builder(ERNIETinyMainActivity.this) + .setTitle("Permission denied") + .setMessage("Click to force quit the app, then open Settings->Apps & notifications->Target " + + "App->Permissions to grant all of the permissions.") + .setCancelable(false) + .setPositiveButton("Exit", new DialogInterface.OnClickListener() { + @Override + public void onClick(DialogInterface dialog, int which) { + ERNIETinyMainActivity.this.finish(); + } + }).show(); + } + } + + private void requestAllPermissions() { + ActivityCompat.requestPermissions( + this, new String[]{Manifest.permission.WRITE_EXTERNAL_STORAGE}, + 0); + } + + private boolean checkAllPermissions() { + return ContextCompat.checkSelfPermission(this, Manifest.permission.WRITE_EXTERNAL_STORAGE) + == PackageManager.PERMISSION_GRANTED; + } + +} diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinySettingsActivity.java b/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinySettingsActivity.java new file mode 100644 index 0000000000000000000000000000000000000000..3e4c7cc394fde9db31eb83ede303b754ad775083 --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinySettingsActivity.java @@ -0,0 +1,133 @@ +package com.baidu.paddle.paddlenlp.app.ernie_tiny; + +import android.annotation.SuppressLint; +import android.content.Context; +import android.content.SharedPreferences; +import android.os.Bundle; +import android.preference.ListPreference; +import android.preference.PreferenceManager; +import android.support.v7.app.ActionBar; + +import com.baidu.paddle.paddlenlp.app.R; +import com.baidu.paddle.paddlenlp.ui.view.AppCompatPreferenceActivity; + + +public class ERNIETinySettingsActivity extends AppCompatPreferenceActivity implements + SharedPreferences.OnSharedPreferenceChangeListener { + private static final String TAG = ERNIETinySettingsActivity.class.getSimpleName(); + static public String modelDir = ""; + static public int cpuThreadNum = 2; + static public String cpuPowerMode = ""; + static public String enableLiteFp16 = "false"; + static public String enableLiteInt8 = "true"; + + ListPreference lpChoosePreInstalledModel = null; + ListPreference lpCPUThreadNum = null; + ListPreference lpCPUPowerMode = null; + ListPreference lpEnableLiteFp16 = null; + ListPreference lpEnableLiteInt8 = null; + + @Override + public void onCreate(Bundle savedInstanceState) { + super.onCreate(savedInstanceState); + addPreferencesFromResource(R.xml.ernie_tiny_settings); + ActionBar supportActionBar = getSupportActionBar(); + if (supportActionBar != null) { + supportActionBar.setDisplayHomeAsUpEnabled(true); + } + + // Setup UI components + lpChoosePreInstalledModel = + (ListPreference) findPreference(getString(R.string.CHOOSE_PRE_INSTALLED_MODEL_KEY)); + lpCPUThreadNum = (ListPreference) findPreference(getString(R.string.CPU_THREAD_NUM_KEY)); + lpCPUPowerMode = (ListPreference) findPreference(getString(R.string.CPU_POWER_MODE_KEY)); + lpEnableLiteFp16 = (ListPreference) findPreference(getString(R.string.ENABLE_LITE_FP16_MODE_KEY)); + lpEnableLiteInt8 = (ListPreference) findPreference(getString(R.string.ENABLE_LITE_INT8_MODE_KEY)); + } + + @SuppressLint("ApplySharedPref") + private void reloadSettingsAndUpdateUI() { + SharedPreferences sharedPreferences = getPreferenceScreen().getSharedPreferences(); + + String model_dir = sharedPreferences.getString(getString(R.string.CHOOSE_PRE_INSTALLED_MODEL_KEY), + getString(R.string.CHOOSE_PRE_INSTALLED_MODEL_DEFAULT)); + + String cpu_thread_num = sharedPreferences.getString(getString(R.string.CPU_THREAD_NUM_KEY), + getString(R.string.CPU_THREAD_NUM_DEFAULT)); + String cpu_power_mode = sharedPreferences.getString(getString(R.string.CPU_POWER_MODE_KEY), + getString(R.string.CPU_POWER_MODE_DEFAULT)); + String enable_lite_fp16 = sharedPreferences.getString(getString(R.string.ENABLE_LITE_FP16_MODE_KEY), + getString(R.string.ENABLE_LITE_FP16_MODE_DEFAULT)); + String enable_lite_int8 = sharedPreferences.getString(getString(R.string.ENABLE_LITE_INT8_MODE_KEY), + getString(R.string.ENABLE_LITE_FP16_MODE_DEFAULT)); + + lpChoosePreInstalledModel.setSummary(model_dir); + lpChoosePreInstalledModel.setValue(model_dir); + lpCPUThreadNum.setValue(cpu_thread_num); + lpCPUThreadNum.setSummary(cpu_thread_num); + lpCPUPowerMode.setValue(cpu_power_mode); + lpCPUPowerMode.setSummary(cpu_power_mode); + lpEnableLiteFp16.setValue(enable_lite_fp16); + lpEnableLiteFp16.setSummary(enable_lite_fp16); + lpEnableLiteInt8.setValue(enable_lite_int8); + lpEnableLiteInt8.setSummary(enable_lite_int8); + } + + static boolean checkAndUpdateSettings(Context ctx) { + boolean settingsChanged = false; + SharedPreferences sharedPreferences = PreferenceManager.getDefaultSharedPreferences(ctx); + + String model_dir = sharedPreferences.getString(ctx.getString(R.string.CHOOSE_PRE_INSTALLED_MODEL_KEY), + ctx.getString(R.string.CHOOSE_PRE_INSTALLED_MODEL_DEFAULT)); + settingsChanged |= !modelDir.equalsIgnoreCase(model_dir); + modelDir = model_dir; + + String cpu_thread_num = sharedPreferences.getString(ctx.getString(R.string.CPU_THREAD_NUM_KEY), + ctx.getString(R.string.CPU_THREAD_NUM_DEFAULT)); + settingsChanged |= cpuThreadNum != Integer.parseInt(cpu_thread_num); + cpuThreadNum = Integer.parseInt(cpu_thread_num); + + String cpu_power_mode = sharedPreferences.getString(ctx.getString(R.string.CPU_POWER_MODE_KEY), + ctx.getString(R.string.CPU_POWER_MODE_DEFAULT)); + settingsChanged |= !cpuPowerMode.equalsIgnoreCase(cpu_power_mode); + cpuPowerMode = cpu_power_mode; + + String enable_lite_fp16 = sharedPreferences.getString(ctx.getString(R.string.ENABLE_LITE_FP16_MODE_KEY), + ctx.getString(R.string.ENABLE_LITE_FP16_MODE_DEFAULT)); + settingsChanged |= !enableLiteFp16.equalsIgnoreCase(enable_lite_fp16); + enableLiteFp16 = enable_lite_fp16; + + String enable_lite_int8 = sharedPreferences.getString(ctx.getString(R.string.ENABLE_LITE_INT8_MODE_KEY), + ctx.getString(R.string.ENABLE_LITE_INT8_MODE_DEFAULT)); + settingsChanged |= !enableLiteInt8.equalsIgnoreCase(enable_lite_int8); + enableLiteInt8 = enable_lite_int8; + + return settingsChanged; + } + + static void resetSettings() { + modelDir = ""; + cpuThreadNum = 2; + cpuPowerMode = ""; + enableLiteFp16 = "false"; + enableLiteInt8 = "true"; + } + + @Override + protected void onResume() { + super.onResume(); + getPreferenceScreen().getSharedPreferences().registerOnSharedPreferenceChangeListener(this); + reloadSettingsAndUpdateUI(); + } + + @Override + protected void onPause() { + super.onPause(); + getPreferenceScreen().getSharedPreferences().unregisterOnSharedPreferenceChangeListener(this); + } + + @Override + public void onSharedPreferenceChanged(SharedPreferences sharedPreferences, String key) { + reloadSettingsAndUpdateUI(); + } +} diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyWelcomeActivity.java b/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyWelcomeActivity.java new file mode 100644 index 0000000000000000000000000000000000000000..9d9aace3594caaf8bb5628185c31b9a813a29399 --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyWelcomeActivity.java @@ -0,0 +1,30 @@ +package com.baidu.paddle.paddlenlp.app.ernie_tiny; + +import android.app.Activity; +import android.content.Intent; +import android.graphics.Color; +import android.os.Build; +import android.os.Bundle; +import android.support.annotation.Nullable; +import android.view.View; + +import com.baidu.paddle.paddlenlp.app.R; + +public class ERNIETinyWelcomeActivity extends Activity { + @Override + protected void onCreate(@Nullable Bundle savedInstanceState) { + super.onCreate(savedInstanceState); + if (Build.VERSION.SDK_INT > Build.VERSION_CODES.LOLLIPOP) { + getWindow().getDecorView().setSystemUiVisibility(View.SYSTEM_UI_FLAG_LAYOUT_FULLSCREEN + | View.SYSTEM_UI_FLAG_LAYOUT_STABLE + ); + getWindow().setStatusBarColor(Color.TRANSPARENT); + } + setContentView(R.layout.ernie_tiny_welcome); + } + + public void startActivity(View view) { + Intent intent = new Intent(ERNIETinyWelcomeActivity.this, ERNIETinyMainActivity.class); + startActivity(intent); + } +} diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-v24/ic_launcher_foreground.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-v24/ic_launcher_foreground.xml new file mode 100644 index 0000000000000000000000000000000000000000..1f6bb290603d7caa16c5fb6f61bbfdc750622f5c --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-v24/ic_launcher_foreground.xml @@ -0,0 +1,34 @@ + + + + + + + + + + + diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-v24/round_corner_btn.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-v24/round_corner_btn.xml new file mode 100644 index 0000000000000000000000000000000000000000..c5dcc45d56375ae8bfad057aea837a1d34c6aac2 --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-v24/round_corner_btn.xml @@ -0,0 +1,10 @@ + + + + + \ No newline at end of file diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-xhdpi/back_btn.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-xhdpi/back_btn.png new file mode 100644 index 0000000000000000000000000000000000000000..ff121e85f5614dfd022f39627028af825a46d683 Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-xhdpi/back_btn.png differ diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-xhdpi/more_menu.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-xhdpi/more_menu.png new file mode 100644 index 0000000000000000000000000000000000000000..edf9f3ccced5afeb71d9516d93ea19f26c7d9984 Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-xhdpi/more_menu.png differ diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings.xml new file mode 100644 index 0000000000000000000000000000000000000000..917897b99981d18082d18a87a4ad5176ad8e8f8d --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings.xml @@ -0,0 +1,6 @@ + + + + + + diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings_default.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings_default.xml new file mode 100644 index 0000000000000000000000000000000000000000..e19589a97e419249eaacd05f3d75deeeada3e128 --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings_default.xml @@ -0,0 +1,13 @@ + + + + diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings_pressed.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings_pressed.xml new file mode 100644 index 0000000000000000000000000000000000000000..c4af2a042de3a8ae00ab253f889a20dedffa4874 --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings_pressed.xml @@ -0,0 +1,13 @@ + + + + diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/ic_launcher_background.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/ic_launcher_background.xml new file mode 100644 index 0000000000000000000000000000000000000000..0d025f9bf6b67c63044a36a9ff44fbc69e5c5822 --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/ic_launcher_background.xml @@ -0,0 +1,170 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/main_bk.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/main_bk.png new file mode 100644 index 0000000000000000000000000000000000000000..1ff9457d4d74c9b486c6c5094d3396aaea417d1c Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/main_bk.png differ diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/paddle_logo.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/paddle_logo.png new file mode 100644 index 0000000000000000000000000000000000000000..bc1135abfab7aa48f29392da4bca614f688314af Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/paddle_logo.png differ diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/layout/ernie_tiny_activity_main.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/layout/ernie_tiny_activity_main.xml new file mode 100644 index 0000000000000000000000000000000000000000..27e469bb1563af7bf6817de8ba71cfb131384124 --- /dev/null +++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/layout/ernie_tiny_activity_main.xml @@ -0,0 +1,84 @@ + + + + + + + + + + + + + + + + +